When SD get frozen, experts needed
Posted: Fri May 03, 2019 6:58 pm
Hi mates, thank you for reading this and suggesting solutions.
I've several questions about it, I'll try to make shorter as possible.
I've a RB3, it was running only Node Red with very low CPU consumption 2-5%, with 58-59° CPU temperature and SD Samsung 16GB. Few RAM usage, and log2RAM to limit the SD writes, Raspbian Node.js and NR recently updated.
Every about week I was suffering a system "hang", a partial one, kernel was working, I could ping the RB IP, I can see NR keep working and manage the I2C bus (LCD), but I can't more open SSH sessions, Telnet, anything that looks "TCP related". As I don't have monitor/mouse I couldn't debug it, and as I use log2RAM I couldn't read the latest logs after power off and on it.
But yesterday it happened a miracle, the SSH session opened before on my laptop kept working while partial "hang" was going on , so I could debug something, and I noticed that any command was more accessible, for example:
pi@raspberrypi:~ $ ls -l
-bash: ls: command not found
I cannot browse the SD conent, only the ram drive with logs, and without usr/bin folder all the unix commands were out of usage, only echo (as I know) was working, I tried to google around what to do in this situations but very few results.
Question 1) which other commands are available? Can be real that Unix doesn't have any command for debug available in the RAM? everything is HDD-related? No HDD - No commands-bye-bye? Wow.. at this point it would be lovely to reserve a small amount of ram as a drive for some bash storage, like su, sudo, mnt, chmod.
2) Is it possible that without /usr/bin any new SSH, or Telenet session, or HTTP (I've a small server) can be opened? Should I believe that all these problems were related only because the hang of SD card? I haven't sniffed the traffic, to check a TCP service responding at SYN and ACK, but it may be believable that this services (such SSH) pretend to load something from SD before opening a new session even if they are already running and active in ram, strange anyway this dependent of SD, I had am SSH session active but I couldn't start a new one!
I wanted to kick the hardware watchdog to produce a reboot by writing and echo . > /dev/watchdog, but I couldn't do it wouthout root, and I get the root (no su command available):
3) Any way to obtain root in this situation? (without SD/HDD)
4) Some days before I set with chmod 666 on /dev/watchdog, so I could kick the watchdog from users too, but after a reboot this setting was gone and I haven't adviced it. Should I consider not permanent the change of permissions into a device? Should I script that change on every boot?
5) I've tried also The magic way (echo 1 > /proc/sys/kernel/sysrq, echo b > /proc/sysrq-trigger) , but root is needed too. Any other way to reboot? Any way to setup the machine next time to reboot by the magic way also under user?
So the only solution I had, as always, was to power off and on again the RB, it worked again without problems, and without errors on logs. It seems the SD got frozen inside maybe to avoid unsafe written inside/data lost, maybe it's controller inside noticed a high level of temperature and it blocked it from read and write.
Here are some logs I was able to dump from the log2ram device in ram, before power off/on:
May 1 17:36:48 raspberrypi kernel: [100821.820646] mmc0: timeout waiting for hardware interrupt.
May 1 17:36:48 raspberrypi kernel: [100821.820662] [77b2f15f] FCM> a6bd9dd4 0
May 1 17:36:48 raspberrypi kernel: [100821.820669] [77b2f160] CMD d 0
May 1 17:36:48 raspberrypi kernel: [100821.820677] [77b2f160] REQ> a6bd9dd4 0
May 1 17:36:48 raspberrypi kernel: [100821.820685] [77b2f173] TSK< a6bd9dd4 0
.... many other lines like above
May 1 17:36:48 raspberrypi kernel: [100821.822556] mmc0: cmd op 25 arg 0x467640 flags 0xb5 - resp 00000900 00000000 00000000 00000000, err 0
May 1 17:36:48 raspberrypi kernel: [100821.822564] mmc0: data blocks 8 blksz 200 - err 0
May 1 17:36:48 raspberrypi kernel: [100821.822574] mmc0:>stop op 12 arg 0x0 flags 0x49d - resp 00000000 00000000 00000000 00000000, err 0
May 1 17:36:48 raspberrypi kernel: [100821.822579] mmc0: =========== REGISTER DUMP ===========
May 1 17:36:48 raspberrypi kernel: [100821.822584] mmc0: SDCMD 0x0000880c
May 1 17:36:48 raspberrypi kernel: [100821.822589] mmc0: SDARG 0x00000000
May 1 17:36:48 raspberrypi kernel: [100821.822595] mmc0: SDTOUT 0x00bebc20
May 1 17:36:48 raspberrypi kernel: [100821.822600] mmc0: SDCDIV 0x00000008
May 1 17:36:48 raspberrypi kernel: [100821.822605] mmc0: SDRSP0 0x00000c00
May 1 17:36:48 raspberrypi kernel: [100821.822610] mmc0: SDRSP1 0x00000c0e
May 1 17:36:48 raspberrypi kernel: [100821.822616] mmc0: SDRSP2 0x7fffffff
May 1 17:36:48 raspberrypi kernel: [100821.822621] mmc0: SDRSP3 0x0002400c
May 1 17:36:48 raspberrypi kernel: [100821.822626] mmc0: SDHSTS 0x00000000
May 1 17:36:48 raspberrypi kernel: [100821.822631] mmc0: SDVDD 0x00000001
May 1 17:36:48 raspberrypi kernel: [100821.822636] mmc0: SDEDM 0x00010801
May 1 17:36:48 raspberrypi kernel: [100821.822641] mmc0: SDHCFG 0x0000040e
May 1 17:36:48 raspberrypi kernel: [100821.822646] mmc0: SDHBCT 0x00000006
May 1 17:36:48 raspberrypi kernel: [100821.822651] mmc0: SDHBLC 0x00000001
May 1 17:36:48 raspberrypi kernel: [100821.822656] mmc0: ===========================================
May 1 17:36:48 raspberrypi kernel: [100821.924884] mmc0: previous command never completed.
May 1 17:36:48 raspberrypi kernel: [100821.925197] mmcblk0: error -84 sending status command, retrying
May 1 17:36:48 raspberrypi kernel: [100822.026250] mmc0: previous command never completed.
May 1 17:36:48 raspberrypi kernel: [100822.026320] mmcblk0: error -84 sending status command, retrying
May 1 17:36:48 raspberrypi kernel: [100822.127388] mmc0: previous command never completed.
May 1 17:36:48 raspberrypi kernel: [100822.128357] mmcblk0: error -84 sending status command, aborting
May 1 17:36:48 raspberrypi kernel: [100822.229535] mmc0: previous command never completed.
May 1 17:36:48 raspberrypi kernel: [100822.331426] mmc0: previous command never completed.
May 1 17:36:48 raspberrypi kernel: [100822.433022] mmc0: previous command never completed.
May 1 17:36:48 raspberrypi kernel: [100822.534736] mmc0: previous command never completed.
May 1 17:36:48 raspberrypi kernel: [100822.535239] Aborting journal on device mmcblk0p2-8.
May 1 17:36:48 raspberrypi kernel: [100822.535380] mmc0: card 0001 removed
May 1 17:36:48 raspberrypi kernel: [100822.535479] JBD2: Error -5 detected when updating journal superblock for mmcblk0p2-8.
May 1 17:36:49 raspberrypi kernel: [100822.700158] blk_partition_remap: fail for partition 2
May 1 17:36:49 raspberrypi kernel: [100822.700179] EXT4-fs warning (device mmcblk0p2): dx_probe:745: inode #12: lblock 0: comm systemd-udevd: error -5 reading directory block
May 1 17:36:49 raspberrypi kernel: [100822.700210] blk_partition_remap: fail for partition 2
May 1 17:36:49 raspberrypi kernel: [100822.700219] EXT4-fs error (device mmcblk0p2): ext4_find_entry:1438: inode #128514: comm systemd-udevd: reading directory lblock 0
May 1 17:36:49 raspberrypi kernel: [100822.700254] blk_partition_remap: fail for partition 2
May 1 17:36:49 raspberrypi kernel: [100822.700260] Buffer I/O error on dev mmcblk0p2, logical block 0, lost sync page write
May 1 17:36:49 raspberrypi kernel: [100822.700278] blk_partition_remap: fail for partition 2
May 1 17:36:49 raspberrypi kernel: [100822.700284] EXT4-fs warning (device mmcblk0p2): dx_probe:745: inode #12: lblock 0: comm systemd-udevd: error -5 reading directory block
On another log file:
May 1 17:36:49 raspberrypi systemd[1]: Unmounting /boot...
May 1 17:36:49 raspberrypi systemd[5080]: boot.mount: Failed at step EXEC spawning /bin/umount: Input/output error
May 1 17:36:49 raspberrypi systemd[1]: boot.mount: Mount process exited, code=exited status=203
May 1 17:36:49 raspberrypi systemd[1]: Failed unmounting /boot.
May 1 17:36:49 raspberrypi systemd[1]: boot.mount: Unit is bound to inactive unit dev-disk-by\x2dpartuuid-9bddb831\x2d01.device. Stopping, too.
I suppose this is the same situation that happened in past as I wrote, every 5 to 7 days, same sympthoms also this time, now just for the fisrt time I could debug and see the logs.
To prevent this "unknow" hang and extend the days of working, in the last days I tried reboot regulary the RB by sudo reboot, but didn't help, I think with sudo reboot the SD keep working like before, not advicing a RB reset, maybe accumulating the factors that will drive it to freeze after that time.
Chances are, only a power cycle off and on may turn the SD available again, for this reasons chances are that even a Watchdog reset could not create the circumstances to "revive" the SD, of course if it depends about SD! And not from some other HW device in the RB or process that ay start after those 5-7 days that hang the SD. Unfortunately I couldn't test the effect of a watchdog reset it because of lack of root, and this information anyway is crucial for me, now I changed SD (Sandisk ultra 32GB) and cloned it from the previous one, and I opened more holes to let air cool the CPU (now I have 52°), but in future I can suffer again, so I would ask you:
6) Is a watchdog reset produce a kind of reset on SD too? Maybe the power on SD slots interrupts for a while? Maybe there is provided a combinations of signals on the SD pins to produce an internal reset?
If not, I guess I can only realize an external hardware watchdog with Arduino, that will cut the power for a while if the RP will not kick it after a minute, for example.
I can try to cool more the RB somehow, or buy a high temperature SD, I've found 16GB Transcend TS16GUSDHC10V High Endurance, can you recomend it? Maybe better to use an external USB dongle?
Thank you a lot for your precious answers!
I've several questions about it, I'll try to make shorter as possible.
I've a RB3, it was running only Node Red with very low CPU consumption 2-5%, with 58-59° CPU temperature and SD Samsung 16GB. Few RAM usage, and log2RAM to limit the SD writes, Raspbian Node.js and NR recently updated.
Every about week I was suffering a system "hang", a partial one, kernel was working, I could ping the RB IP, I can see NR keep working and manage the I2C bus (LCD), but I can't more open SSH sessions, Telnet, anything that looks "TCP related". As I don't have monitor/mouse I couldn't debug it, and as I use log2RAM I couldn't read the latest logs after power off and on it.
But yesterday it happened a miracle, the SSH session opened before on my laptop kept working while partial "hang" was going on , so I could debug something, and I noticed that any command was more accessible, for example:
pi@raspberrypi:~ $ ls -l
-bash: ls: command not found
I cannot browse the SD conent, only the ram drive with logs, and without usr/bin folder all the unix commands were out of usage, only echo (as I know) was working, I tried to google around what to do in this situations but very few results.
Question 1) which other commands are available? Can be real that Unix doesn't have any command for debug available in the RAM? everything is HDD-related? No HDD - No commands-bye-bye? Wow.. at this point it would be lovely to reserve a small amount of ram as a drive for some bash storage, like su, sudo, mnt, chmod.
2) Is it possible that without /usr/bin any new SSH, or Telenet session, or HTTP (I've a small server) can be opened? Should I believe that all these problems were related only because the hang of SD card? I haven't sniffed the traffic, to check a TCP service responding at SYN and ACK, but it may be believable that this services (such SSH) pretend to load something from SD before opening a new session even if they are already running and active in ram, strange anyway this dependent of SD, I had am SSH session active but I couldn't start a new one!
I wanted to kick the hardware watchdog to produce a reboot by writing and echo . > /dev/watchdog, but I couldn't do it wouthout root, and I get the root (no su command available):
3) Any way to obtain root in this situation? (without SD/HDD)
4) Some days before I set with chmod 666 on /dev/watchdog, so I could kick the watchdog from users too, but after a reboot this setting was gone and I haven't adviced it. Should I consider not permanent the change of permissions into a device? Should I script that change on every boot?
5) I've tried also The magic way (echo 1 > /proc/sys/kernel/sysrq, echo b > /proc/sysrq-trigger) , but root is needed too. Any other way to reboot? Any way to setup the machine next time to reboot by the magic way also under user?
So the only solution I had, as always, was to power off and on again the RB, it worked again without problems, and without errors on logs. It seems the SD got frozen inside maybe to avoid unsafe written inside/data lost, maybe it's controller inside noticed a high level of temperature and it blocked it from read and write.
Here are some logs I was able to dump from the log2ram device in ram, before power off/on:
May 1 17:36:48 raspberrypi kernel: [100821.820646] mmc0: timeout waiting for hardware interrupt.
May 1 17:36:48 raspberrypi kernel: [100821.820662] [77b2f15f] FCM> a6bd9dd4 0
May 1 17:36:48 raspberrypi kernel: [100821.820669] [77b2f160] CMD d 0
May 1 17:36:48 raspberrypi kernel: [100821.820677] [77b2f160] REQ> a6bd9dd4 0
May 1 17:36:48 raspberrypi kernel: [100821.820685] [77b2f173] TSK< a6bd9dd4 0
.... many other lines like above
May 1 17:36:48 raspberrypi kernel: [100821.822556] mmc0: cmd op 25 arg 0x467640 flags 0xb5 - resp 00000900 00000000 00000000 00000000, err 0
May 1 17:36:48 raspberrypi kernel: [100821.822564] mmc0: data blocks 8 blksz 200 - err 0
May 1 17:36:48 raspberrypi kernel: [100821.822574] mmc0:>stop op 12 arg 0x0 flags 0x49d - resp 00000000 00000000 00000000 00000000, err 0
May 1 17:36:48 raspberrypi kernel: [100821.822579] mmc0: =========== REGISTER DUMP ===========
May 1 17:36:48 raspberrypi kernel: [100821.822584] mmc0: SDCMD 0x0000880c
May 1 17:36:48 raspberrypi kernel: [100821.822589] mmc0: SDARG 0x00000000
May 1 17:36:48 raspberrypi kernel: [100821.822595] mmc0: SDTOUT 0x00bebc20
May 1 17:36:48 raspberrypi kernel: [100821.822600] mmc0: SDCDIV 0x00000008
May 1 17:36:48 raspberrypi kernel: [100821.822605] mmc0: SDRSP0 0x00000c00
May 1 17:36:48 raspberrypi kernel: [100821.822610] mmc0: SDRSP1 0x00000c0e
May 1 17:36:48 raspberrypi kernel: [100821.822616] mmc0: SDRSP2 0x7fffffff
May 1 17:36:48 raspberrypi kernel: [100821.822621] mmc0: SDRSP3 0x0002400c
May 1 17:36:48 raspberrypi kernel: [100821.822626] mmc0: SDHSTS 0x00000000
May 1 17:36:48 raspberrypi kernel: [100821.822631] mmc0: SDVDD 0x00000001
May 1 17:36:48 raspberrypi kernel: [100821.822636] mmc0: SDEDM 0x00010801
May 1 17:36:48 raspberrypi kernel: [100821.822641] mmc0: SDHCFG 0x0000040e
May 1 17:36:48 raspberrypi kernel: [100821.822646] mmc0: SDHBCT 0x00000006
May 1 17:36:48 raspberrypi kernel: [100821.822651] mmc0: SDHBLC 0x00000001
May 1 17:36:48 raspberrypi kernel: [100821.822656] mmc0: ===========================================
May 1 17:36:48 raspberrypi kernel: [100821.924884] mmc0: previous command never completed.
May 1 17:36:48 raspberrypi kernel: [100821.925197] mmcblk0: error -84 sending status command, retrying
May 1 17:36:48 raspberrypi kernel: [100822.026250] mmc0: previous command never completed.
May 1 17:36:48 raspberrypi kernel: [100822.026320] mmcblk0: error -84 sending status command, retrying
May 1 17:36:48 raspberrypi kernel: [100822.127388] mmc0: previous command never completed.
May 1 17:36:48 raspberrypi kernel: [100822.128357] mmcblk0: error -84 sending status command, aborting
May 1 17:36:48 raspberrypi kernel: [100822.229535] mmc0: previous command never completed.
May 1 17:36:48 raspberrypi kernel: [100822.331426] mmc0: previous command never completed.
May 1 17:36:48 raspberrypi kernel: [100822.433022] mmc0: previous command never completed.
May 1 17:36:48 raspberrypi kernel: [100822.534736] mmc0: previous command never completed.
May 1 17:36:48 raspberrypi kernel: [100822.535239] Aborting journal on device mmcblk0p2-8.
May 1 17:36:48 raspberrypi kernel: [100822.535380] mmc0: card 0001 removed
May 1 17:36:48 raspberrypi kernel: [100822.535479] JBD2: Error -5 detected when updating journal superblock for mmcblk0p2-8.
May 1 17:36:49 raspberrypi kernel: [100822.700158] blk_partition_remap: fail for partition 2
May 1 17:36:49 raspberrypi kernel: [100822.700179] EXT4-fs warning (device mmcblk0p2): dx_probe:745: inode #12: lblock 0: comm systemd-udevd: error -5 reading directory block
May 1 17:36:49 raspberrypi kernel: [100822.700210] blk_partition_remap: fail for partition 2
May 1 17:36:49 raspberrypi kernel: [100822.700219] EXT4-fs error (device mmcblk0p2): ext4_find_entry:1438: inode #128514: comm systemd-udevd: reading directory lblock 0
May 1 17:36:49 raspberrypi kernel: [100822.700254] blk_partition_remap: fail for partition 2
May 1 17:36:49 raspberrypi kernel: [100822.700260] Buffer I/O error on dev mmcblk0p2, logical block 0, lost sync page write
May 1 17:36:49 raspberrypi kernel: [100822.700278] blk_partition_remap: fail for partition 2
May 1 17:36:49 raspberrypi kernel: [100822.700284] EXT4-fs warning (device mmcblk0p2): dx_probe:745: inode #12: lblock 0: comm systemd-udevd: error -5 reading directory block
On another log file:
May 1 17:36:49 raspberrypi systemd[1]: Unmounting /boot...
May 1 17:36:49 raspberrypi systemd[5080]: boot.mount: Failed at step EXEC spawning /bin/umount: Input/output error
May 1 17:36:49 raspberrypi systemd[1]: boot.mount: Mount process exited, code=exited status=203
May 1 17:36:49 raspberrypi systemd[1]: Failed unmounting /boot.
May 1 17:36:49 raspberrypi systemd[1]: boot.mount: Unit is bound to inactive unit dev-disk-by\x2dpartuuid-9bddb831\x2d01.device. Stopping, too.
I suppose this is the same situation that happened in past as I wrote, every 5 to 7 days, same sympthoms also this time, now just for the fisrt time I could debug and see the logs.
To prevent this "unknow" hang and extend the days of working, in the last days I tried reboot regulary the RB by sudo reboot, but didn't help, I think with sudo reboot the SD keep working like before, not advicing a RB reset, maybe accumulating the factors that will drive it to freeze after that time.
Chances are, only a power cycle off and on may turn the SD available again, for this reasons chances are that even a Watchdog reset could not create the circumstances to "revive" the SD, of course if it depends about SD! And not from some other HW device in the RB or process that ay start after those 5-7 days that hang the SD. Unfortunately I couldn't test the effect of a watchdog reset it because of lack of root, and this information anyway is crucial for me, now I changed SD (Sandisk ultra 32GB) and cloned it from the previous one, and I opened more holes to let air cool the CPU (now I have 52°), but in future I can suffer again, so I would ask you:
6) Is a watchdog reset produce a kind of reset on SD too? Maybe the power on SD slots interrupts for a while? Maybe there is provided a combinations of signals on the SD pins to produce an internal reset?
If not, I guess I can only realize an external hardware watchdog with Arduino, that will cut the power for a while if the RP will not kick it after a minute, for example.
I can try to cool more the RB somehow, or buy a high temperature SD, I've found 16GB Transcend TS16GUSDHC10V High Endurance, can you recomend it? Maybe better to use an external USB dongle?
Thank you a lot for your precious answers!