Barabba
Posts: 47
Joined: Wed Aug 03, 2016 3:49 pm

When SD get frozen, experts needed

Fri May 03, 2019 6:58 pm

Hi mates, thank you for reading this and suggesting solutions.
I've several questions about it, I'll try to make shorter as possible.

I've a RB3, it was running only Node Red with very low CPU consumption 2-5%, with 58-59° CPU temperature and SD Samsung 16GB. Few RAM usage, and log2RAM to limit the SD writes, Raspbian Node.js and NR recently updated.
Every about week I was suffering a system "hang", a partial one, kernel was working, I could ping the RB IP, I can see NR keep working and manage the I2C bus (LCD), but I can't more open SSH sessions, Telnet, anything that looks "TCP related". As I don't have monitor/mouse I couldn't debug it, and as I use log2RAM I couldn't read the latest logs after power off and on it.
But yesterday it happened a miracle, the SSH session opened before on my laptop kept working while partial "hang" was going on , so I could debug something, and I noticed that any command was more accessible, for example:
pi@raspberrypi:~ $ ls -l
-bash: ls: command not found

I cannot browse the SD conent, only the ram drive with logs, and without usr/bin folder all the unix commands were out of usage, only echo (as I know) was working, I tried to google around what to do in this situations but very few results.

Question 1) which other commands are available? Can be real that Unix doesn't have any command for debug available in the RAM? everything is HDD-related? No HDD - No commands-bye-bye? Wow.. at this point it would be lovely to reserve a small amount of ram as a drive for some bash storage, like su, sudo, mnt, chmod.
2) Is it possible that without /usr/bin any new SSH, or Telenet session, or HTTP (I've a small server) can be opened? Should I believe that all these problems were related only because the hang of SD card? I haven't sniffed the traffic, to check a TCP service responding at SYN and ACK, but it may be believable that this services (such SSH) pretend to load something from SD before opening a new session even if they are already running and active in ram, strange anyway this dependent of SD, I had am SSH session active but I couldn't start a new one!

I wanted to kick the hardware watchdog to produce a reboot by writing and echo . > /dev/watchdog, but I couldn't do it wouthout root, and I get the root (no su command available):
3) Any way to obtain root in this situation? (without SD/HDD)
4) Some days before I set with chmod 666 on /dev/watchdog, so I could kick the watchdog from users too, but after a reboot this setting was gone and I haven't adviced it. Should I consider not permanent the change of permissions into a device? Should I script that change on every boot?
5) I've tried also The magic way (echo 1 > /proc/sys/kernel/sysrq, echo b > /proc/sysrq-trigger) , but root is needed too. Any other way to reboot? Any way to setup the machine next time to reboot by the magic way also under user?

So the only solution I had, as always, was to power off and on again the RB, it worked again without problems, and without errors on logs. It seems the SD got frozen inside maybe to avoid unsafe written inside/data lost, maybe it's controller inside noticed a high level of temperature and it blocked it from read and write.
Here are some logs I was able to dump from the log2ram device in ram, before power off/on:
May 1 17:36:48 raspberrypi kernel: [100821.820646] mmc0: timeout waiting for hardware interrupt.
May 1 17:36:48 raspberrypi kernel: [100821.820662] [77b2f15f] FCM> a6bd9dd4 0
May 1 17:36:48 raspberrypi kernel: [100821.820669] [77b2f160] CMD d 0
May 1 17:36:48 raspberrypi kernel: [100821.820677] [77b2f160] REQ> a6bd9dd4 0
May 1 17:36:48 raspberrypi kernel: [100821.820685] [77b2f173] TSK< a6bd9dd4 0

.... many other lines like above

May 1 17:36:48 raspberrypi kernel: [100821.822556] mmc0: cmd op 25 arg 0x467640 flags 0xb5 - resp 00000900 00000000 00000000 00000000, err 0
May 1 17:36:48 raspberrypi kernel: [100821.822564] mmc0: data blocks 8 blksz 200 - err 0
May 1 17:36:48 raspberrypi kernel: [100821.822574] mmc0:>stop op 12 arg 0x0 flags 0x49d - resp 00000000 00000000 00000000 00000000, err 0
May 1 17:36:48 raspberrypi kernel: [100821.822579] mmc0: =========== REGISTER DUMP ===========
May 1 17:36:48 raspberrypi kernel: [100821.822584] mmc0: SDCMD 0x0000880c
May 1 17:36:48 raspberrypi kernel: [100821.822589] mmc0: SDARG 0x00000000
May 1 17:36:48 raspberrypi kernel: [100821.822595] mmc0: SDTOUT 0x00bebc20
May 1 17:36:48 raspberrypi kernel: [100821.822600] mmc0: SDCDIV 0x00000008
May 1 17:36:48 raspberrypi kernel: [100821.822605] mmc0: SDRSP0 0x00000c00
May 1 17:36:48 raspberrypi kernel: [100821.822610] mmc0: SDRSP1 0x00000c0e
May 1 17:36:48 raspberrypi kernel: [100821.822616] mmc0: SDRSP2 0x7fffffff
May 1 17:36:48 raspberrypi kernel: [100821.822621] mmc0: SDRSP3 0x0002400c
May 1 17:36:48 raspberrypi kernel: [100821.822626] mmc0: SDHSTS 0x00000000
May 1 17:36:48 raspberrypi kernel: [100821.822631] mmc0: SDVDD 0x00000001
May 1 17:36:48 raspberrypi kernel: [100821.822636] mmc0: SDEDM 0x00010801
May 1 17:36:48 raspberrypi kernel: [100821.822641] mmc0: SDHCFG 0x0000040e
May 1 17:36:48 raspberrypi kernel: [100821.822646] mmc0: SDHBCT 0x00000006
May 1 17:36:48 raspberrypi kernel: [100821.822651] mmc0: SDHBLC 0x00000001
May 1 17:36:48 raspberrypi kernel: [100821.822656] mmc0: ===========================================
May 1 17:36:48 raspberrypi kernel: [100821.924884] mmc0: previous command never completed.
May 1 17:36:48 raspberrypi kernel: [100821.925197] mmcblk0: error -84 sending status command, retrying
May 1 17:36:48 raspberrypi kernel: [100822.026250] mmc0: previous command never completed.
May 1 17:36:48 raspberrypi kernel: [100822.026320] mmcblk0: error -84 sending status command, retrying
May 1 17:36:48 raspberrypi kernel: [100822.127388] mmc0: previous command never completed.
May 1 17:36:48 raspberrypi kernel: [100822.128357] mmcblk0: error -84 sending status command, aborting
May 1 17:36:48 raspberrypi kernel: [100822.229535] mmc0: previous command never completed.
May 1 17:36:48 raspberrypi kernel: [100822.331426] mmc0: previous command never completed.
May 1 17:36:48 raspberrypi kernel: [100822.433022] mmc0: previous command never completed.
May 1 17:36:48 raspberrypi kernel: [100822.534736] mmc0: previous command never completed.
May 1 17:36:48 raspberrypi kernel: [100822.535239] Aborting journal on device mmcblk0p2-8.
May 1 17:36:48 raspberrypi kernel: [100822.535380] mmc0: card 0001 removed
May 1 17:36:48 raspberrypi kernel: [100822.535479] JBD2: Error -5 detected when updating journal superblock for mmcblk0p2-8.
May 1 17:36:49 raspberrypi kernel: [100822.700158] blk_partition_remap: fail for partition 2
May 1 17:36:49 raspberrypi kernel: [100822.700179] EXT4-fs warning (device mmcblk0p2): dx_probe:745: inode #12: lblock 0: comm systemd-udevd: error -5 reading directory block
May 1 17:36:49 raspberrypi kernel: [100822.700210] blk_partition_remap: fail for partition 2
May 1 17:36:49 raspberrypi kernel: [100822.700219] EXT4-fs error (device mmcblk0p2): ext4_find_entry:1438: inode #128514: comm systemd-udevd: reading directory lblock 0
May 1 17:36:49 raspberrypi kernel: [100822.700254] blk_partition_remap: fail for partition 2
May 1 17:36:49 raspberrypi kernel: [100822.700260] Buffer I/O error on dev mmcblk0p2, logical block 0, lost sync page write
May 1 17:36:49 raspberrypi kernel: [100822.700278] blk_partition_remap: fail for partition 2
May 1 17:36:49 raspberrypi kernel: [100822.700284] EXT4-fs warning (device mmcblk0p2): dx_probe:745: inode #12: lblock 0: comm systemd-udevd: error -5 reading directory block

On another log file:
May 1 17:36:49 raspberrypi systemd[1]: Unmounting /boot...
May 1 17:36:49 raspberrypi systemd[5080]: boot.mount: Failed at step EXEC spawning /bin/umount: Input/output error
May 1 17:36:49 raspberrypi systemd[1]: boot.mount: Mount process exited, code=exited status=203
May 1 17:36:49 raspberrypi systemd[1]: Failed unmounting /boot.
May 1 17:36:49 raspberrypi systemd[1]: boot.mount: Unit is bound to inactive unit dev-disk-by\x2dpartuuid-9bddb831\x2d01.device. Stopping, too.

I suppose this is the same situation that happened in past as I wrote, every 5 to 7 days, same sympthoms also this time, now just for the fisrt time I could debug and see the logs.
To prevent this "unknow" hang and extend the days of working, in the last days I tried reboot regulary the RB by sudo reboot, but didn't help, I think with sudo reboot the SD keep working like before, not advicing a RB reset, maybe accumulating the factors that will drive it to freeze after that time.
Chances are, only a power cycle off and on may turn the SD available again, for this reasons chances are that even a Watchdog reset could not create the circumstances to "revive" the SD, of course if it depends about SD! And not from some other HW device in the RB or process that ay start after those 5-7 days that hang the SD. Unfortunately I couldn't test the effect of a watchdog reset it because of lack of root, and this information anyway is crucial for me, now I changed SD (Sandisk ultra 32GB) and cloned it from the previous one, and I opened more holes to let air cool the CPU (now I have 52°), but in future I can suffer again, so I would ask you:

6) Is a watchdog reset produce a kind of reset on SD too? Maybe the power on SD slots interrupts for a while? Maybe there is provided a combinations of signals on the SD pins to produce an internal reset?
If not, I guess I can only realize an external hardware watchdog with Arduino, that will cut the power for a while if the RP will not kick it after a minute, for example.

I can try to cool more the RB somehow, or buy a high temperature SD, I've found 16GB Transcend TS16GUSDHC10V High Endurance, can you recomend it? Maybe better to use an external USB dongle?

Thank you a lot for your precious answers!

epoch1970
Posts: 5131
Joined: Thu May 05, 2016 9:33 am
Location: Paris, France

Re: When SD get frozen, experts needed

Fri May 03, 2019 7:36 pm

Spectacular.

The watchdog resets the whole platform, as you should have read from responses to your prior questions on that subject.

The SD card does not care for the CPU temperature, rather for ambient temperature. I would say ~50°C CPU with light load correlates to 30-40°C in the case, I think any SD will still be fine around 60°C.

I suggest you break the problem in smaller chunks and proceed step by step.
Start from a clean and standard system, install only a few packages, do not activate watchdog resets, do not log to RAM.
If the system crashes, read the log files to find the issue and fix it.
Run again, if all looks fine add one more feature and test again.

Once every feature is loaded and the system runs perfectly for a long time,
consider if you really need watchdog and "log2ram" or whatever.

A system rigged like that can no longer be improved, because there isn't any trace left to say what really happened. So it has to be perfect first.
"S'il n'y a pas de solution, c'est qu'il n'y a pas de problème." Les Shadoks, J. Rouxel

Barabba
Posts: 47
Joined: Wed Aug 03, 2016 3:49 pm

Re: When SD get frozen, experts needed

Mon May 06, 2019 1:12 pm

Thank you for your kind answer!

Well, I'm not sure that Watchdog will reset the SD too, can you confirm that? Is a watchdog hard reset take off power to SD? Or is the system send a reset command to SD? I'm pretty sure that with sudo reboot the SD isn't initialized, or at least will not "reset" it enough to assure more days of stability.

Yes SD is not connected with CPU temperature, but having a lastic case, and leaving the SD in the top-up position where the air inside is hotter (I can't change the position due the GPIO flat cable), it will heat it too.. so is somehow related to it. Now I opened some air holes and CPU is lower to 52, maybe better also for the *new* Sandisk ultra SD. 4 days alsmost apssed and the system is stable.

Unfortunately I can't do step by step check, especially because this checks takes at least 7 days, maybe more. The system is up and it's needed to stay up and working fully, it's not my home and I can't ask the client to "wait more 50 days because I've to do step by step", unfortunately. Here it's needed to guess basing on kind reports of users who had experiences of frozen and hangs, thank you a lot for all that will like to contribute here.
In my opinion the filesystem is ok, I?ve no log warnings, and as I wrote I was able to miracle catch the SD lock, and I think it explains everything, the questions I would love someone expert can answer me are:

1) when main filesystem is unavailable which comand in RAM I can execute? Only echo?
2) when main filesystem is unavailable is it expectable that new SSH sessions (or any other TCP session) can be upened, even if there is an active SSH already? (so I suppose the service is workng in ram, without need of HDD).
3) Any way to obtain root in this situation? (without SD/HDD)
4) writing a chmod to a device, will the permission be lost on next reboot?
5) Any other way to reboot from user mode? (by now I know magic way and kick the /dev/watchdog)
6) does watchdog hard reset, resets/reinitialize also the SD from a lock state?

In the first post I explained all better,
thank you a lot for kind answers!!

Return to “Troubleshooting”