sethbollinger
Posts: 11
Joined: Sun Jan 12, 2014 9:23 pm

SD Card Corruption

Sun Jan 12, 2014 9:37 pm

Hello All,

I have a 20 device testbed where I'm seeing ~25% fail rate in SD cards in 4 hours of testing. The devices seem to be failing at a very low level as they're sometimes not even mountable in a linux workstation (seems like the card is crashing causing a cyclical USB reset using my card reader). Sometimes the only recovery option is to write data to the entire card using a different OS (osx in this case).

Is there someone working this problem that would benefit from any data I could gather? I'm willing to expend some time and energy gathering test data if it would help. This is pretty frustrating for us...

I'm seeing this problem using the latest firmware running on latest rasbian, also a custom yocto build (current firmware, 3.12 kernel).

Thanks,

Seth

andrum99
Posts: 707
Joined: Fri Jul 20, 2012 2:41 pm

Re: SD Card Corruption

Mon Jan 13, 2014 12:02 am

The possible causes of this problem are:

duff power supply
duff power supply
duff power supply
dodgy SD card off of eBay
somebody messing about with it
old buggy firmware on the Pi

in that order. Try one of these:

https://www.modmypi.com/raspberry-pi-ac ... wer-supply

Can you also describe your set up? There may be something you have overlooked that is causing the corruption.

sethbollinger
Posts: 11
Joined: Sun Jan 12, 2014 9:23 pm

Re: SD Card Corruption

Mon Jan 13, 2014 2:01 am

I appreciate your reply. Are you one of the engineers working on the problem? There was a timing problem with the clocks that was fixed in the firmware a while back. We're having what seems like the _exact_ same problem, so it would be nice to get some input from someone who worked that problem.

1. We've tried 5 different power supplies.

2. We've tried 3 different types of SD cards.

3. Nobody is "messing" with these devices, they're running on an isolated test bed.

4. The firmware is current, as stated in the post...

We have 20 devices on a power controller that we're power cycling via script about every minute. As stated in the previous post, they will fail in what appears to be a low level way (SD card FTL) after ~4 hours. I've seen a bunch of nulls appended to files in /etc, which I assume happens before the low level failure. I'm not writing to those files _ever_, so I expect them to be coherent.

Here are the errors we're seeing:
[  673.585070] mmcblk0: error -110 transferring data, sector 7585447, nr 1, cmd response 0x900, card status 0x200b00
[  673.595431] end_request: I/O error, dev mmcblk0, sector 7585447
[  677.899355] mmcblk0: error -110 transferring data, sector 7585440, nr 8, cmd response 0x900, card status 0x200b00
[  677.909715] mmcblk0: retrying using single block read
[  682.212153] mmcblk0: error -110 transferring data, sector 7585440, nr 8, cmd response 0x900, card status 0x200b00
[  682.222539] end_request: I/O error, dev mmcblk0, sector 7585440
[  686.525864] mmcblk0: error -110 transferring data, sector 7585441, nr 7, cmd response 0x900, card status 0x200b00
[  686.536225] end_request: I/O error, dev mmcblk0, sector 7585441

I've run this test on non-dpi devices running jffs2 on raw flash and have _never_ seen this type of corruption. It seems like a hardware/firmware problem to me.

klricks
Posts: 6499
Joined: Sat Jan 12, 2013 3:01 am
Location: Grants Pass, OR, USA
Contact: Website

Re: SD Card Corruption

Mon Jan 13, 2014 3:38 am

Are you saying that you set up 20 RPi and are abruptly shutting power off and on once per minute?
Are the RPi's idle at command prompt or booted to GUI?
Unless specified otherwise my response is based on the latest and fully updated Raspbian Buster w/ Desktop OS.

jdb
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 2030
Joined: Thu Jul 11, 2013 2:37 pm

Re: SD Card Corruption

Mon Jan 13, 2014 9:28 am

sethbollinger wrote:
We have 20 devices on a power controller that we're power cycling via script about every minute. As stated in the previous post, they will fail in what appears to be a low level way (SD card FTL) after ~4 hours. I've seen a bunch of nulls appended to files in /etc, which I assume happens before the low level failure. I'm not writing to those files _ever_, so I expect them to be coherent.

I've run this test on non-dpi devices running jffs2 on raw flash and have _never_ seen this type of corruption. It seems like a hardware/firmware problem to me.
The hardware problem is that you're switching the power to the pi every minute.

Raw flash is a substantially different animal. JFFS2 is also substantially different as it's log-structured and aware of flash block sizes. If you power cycle during a jffs2 write, then the write is corrupt - but because of the log structure when the filesystem is next mounted the bad block is simply discarded and the filesystem restored to an earlier state. Metadata is also cached in RAM so it's very unlikely that the filesystem metadata will be corrupt as that too is written in log chunks.

SD cards implement a flash translation layer. We use ext4 as the root filesystem for what is essentially a desktop operating system.

At any time (especially just 60 seconds after boot) there could be reads or writes going on as a result of userspace activity. During these commands, and for a short while after as determined by the FTL's internal state machines, multiple flash blocks will be open for read/write. If power is switched off in the middle of access, then arbitrary data will be corrupted.
Rockets are loud.
https://astro-pi.org

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 23071
Joined: Sat Jul 30, 2011 7:41 pm

Re: SD Card Corruption

Mon Jan 13, 2014 9:32 am

Interesting question : If you did the same with a PC would the HD corrupt in a similar timescale?

Anyway, I'll flag this thread to the guy who worked on the original SD corruption issue.

EDIT: I think he already knows judging from previous message.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed. Here's an example...
"My grief counseller just died, luckily, he was so good, I didn't care."

sethbollinger
Posts: 11
Joined: Sun Jan 12, 2014 9:23 pm

Re: SD Card Corruption

Mon Jan 13, 2014 11:43 am

We see this problem using f2fs as well. This is a log based journaling FS (as you probably know).

This makes me pretty sad as there's no way that we can make certain our customers don't power cycle the device. So you're saying that pretty much every one of the millions of devices out there will become corrupt if you power cycle them at the wrong time? Sounds like a good reason to not use SD cards as the main storage...

I will set up a test using another piece of hardware running from an SD card.

Thanks!

User avatar
RaTTuS
Posts: 10377
Joined: Tue Nov 29, 2011 11:12 am
Location: North West UK

Re: SD Card Corruption

Mon Jan 13, 2014 11:56 am

sethbollinger wrote:We see this problem using f2fs as well. This is a log based journaling FS (as you probably know).

This makes me pretty sad as there's no way that we can make certain our customers don't power cycle the device. So you're saying that pretty much every one of the millions of devices out there will become corrupt if you power cycle them at the wrong time? Sounds like a good reason to not use SD cards as the main storage...

I will set up a test using another piece of hardware running from an SD card.

Thanks!
yes there is - use a small lipo battery that will sense when you get a power fail and gracefully shutdown . if you are power cycling every 60 secs it is going to kill something
How To ask Questions :- http://www.catb.org/esr/faqs/smart-questions.html
WARNING - some parts of this post may be erroneous YMMV

1QC43qbL5FySu2Pi51vGqKqxy3UiJgukSX
Covfefe

wimble
Posts: 34
Joined: Tue Feb 05, 2013 9:52 am

Re: SD Card Corruption

Mon Jan 13, 2014 12:02 pm

I've had messages similar to those that sethbollinger reports.

Code: Select all

Dec 17 10:52:25 raspberrypi kernel: [480802.908015] mmcblk0: error -110 transferring data, sector 3211435, nr 5, cmd response 0x900, card status 0x200b00
Dec 17 10:52:25 raspberrypi kernel: [480802.908049] end_request: I/O error, dev mmcblk0, sector 3211435
I seem to have resolved the issue, at least in my case: I was powering the Pi from a USB port on my NAS, and had a USB keyboard plugged in (nothing fancy, just a cheap 7 quid keyboard). Just re-imaging the SD card didn't help: the problem just came back. Switching to a RS electronics mains adapter and unplugging the keyboard *and* re-imaging the card hasn't shown any re-occurrence in the last week.

Whilst I'm willing to believe the duff power supply that andrum99 mentions may have have caused the original corruption of my card, the things he lists there aren't complete: at one point in attempting to solve the problem, I reimaged the card (the same one that had been reporting problems, and the same one that is now working), and the Pi simply wouldn't boot at all. I got the 4 flashes from the LED, so it couldn't find loader.bin. Clearly that wasn't the result of the power supply or a dodgy card, or somebody messing about. I reimaged *again*, and it worked fine. So I'm happy :)

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 23071
Joined: Sat Jul 30, 2011 7:41 pm

Re: SD Card Corruption

Mon Jan 13, 2014 12:04 pm

sethbollinger wrote:This makes me pretty sad as there's no way that we can make certain our customers don't power cycle the device. So you're saying that pretty much every one of the millions of devices out there will become corrupt if you power cycle them at the wrong time? Sounds like a good reason to not use SD cards as the main storage...
In certain applications, SD card would be inappropriate simply because of the way it works. But for the main use case for the Raspi, it works fine. It's cheap, easily changed from one card to another, and for most people, SD card corruption will never happen, simply because they are not turning their devices on and off all the time. It's a percentage likelihood thing. Most people simply do not turn their devices on/off frequently enough to ever break them.

One way round the problem would be to use different storage attached to the USB, and make the SD card read only. You could try tests with a USB memory stick taking the brunt of the read/write load.

I'm trying to think of an alternative to SD cards for the Raspi use case - not got anything yet!
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed. Here's an example...
"My grief counseller just died, luckily, he was so good, I didn't care."

gsh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 1421
Joined: Sat Sep 10, 2011 11:43 am

Re: SD Card Corruption

Mon Jan 13, 2014 1:00 pm

I'm not sure that this is actually a power problem with the Pi... I think it's the SD card that is being trashed.

If you power off when it's not writing to the card then you'll be OK... So just doing 'sync' then power off is usually good enough (unless you've got background processes writing to the card)

Otherwise that's the deal! If you want to stop this from happening, what you should be able to do is to create a separate filesystem partition for the data you are writing then it's possible it'll trash that partition but the root partition should survive (you'll have to detect the trashed card and recover it by writing to it again)

Gordon
--
Gordon Hollingworth PhD
Raspberry Pi - Director of Software Engineering

User avatar
g7ruh
Posts: 67
Joined: Mon Apr 23, 2012 9:49 am
Location: Blackfield UK

Re: SD Card Corruption

Mon Jan 13, 2014 1:38 pm

I got a few new Samsung cards recently 8GB and 16GB class 10 (even though I ordered C4 cards <sigh>). I decided to go with the C10 cards. I had a lot of these sorts of errors for a while, which got me thinking.... I am now loading up latest release of Rasbian and updating the firmware to current version then writing ZEROS to the whole card.

Code: Select all

#!/bin/bash
#
# write zeros to all free sectors then remove the files.
#
# clean the apt cache first
#
# run from the disk you want to zeroise
#
sudo apt-get clean
dd if=/dev/zero of=zero.small.file bs=1024 count=102400
dd if=/dev/zero of=zero.file bs=1024
sync ; sleep 60 ; sync
rm zero.small.file
rm zero.file

Since I have been doing this I have not had any problem with "corruption" or bad writes. I monitor the syslog for "Ext4 error" to help detect issues during testing.

I have a few new cards on order and I will repeat the recipe above with them.

I am beginning to wonder if the problems are caused by areas of the card never having been written before, suddenly making the O/S 'glitch'. In the good old days of huge (low capacity) disk drives, in order to prepare a new disk for use, every sector had to be written (formatted), this also checked the sector in case it was bad.

I make the above observation in the light of a lot of "corruptions" or bad writes in other cards too, so I did a more controlled test as above. PSU is not the issue (checked that long time ago) Voltage is not an issue, likewise checked. Voltage drop not an issue across the thermal fuse (removed and replaced with a glass fuse holder and fuse).

There is one other issue which could be a cause: the SD card holder itself and poor or degrading contacts. Any form of signal loss during a write could act just like sudden power loss. I flexed one Pi in its case (to adjust position of the camera) and I got similar errors and a corrupted card.

I have some 35 years of experience fixing disk drive technology of various vintages, and lateral thinking this to the (non-rotating) SD card.

Just a thought in case it triggers other thoughts

Roger

Heater
Posts: 12964
Joined: Tue Jul 17, 2012 3:02 pm

Re: SD Card Corruption

Mon Jan 13, 2014 2:36 pm

g7ruh

Writing all zeros to an SD card does not do anything beneficial for you. In fact it may be detrimental.

Basically by doing that you are telling the card that all sectors (blocks?) are now full of data. Which happens to be all zeros. The SD cards controller now cannot use those blocks to perform any wear levelling it may want to do. The SD has no way to know that those sectors full of zeros are of no importance to you after you have put a file system image on there. It cannot tell the difference between a files system and any other random data you mat write to it's blocks.

That is unless you use TRIM, via the files system or manually (fstrim command) which would then tell the SD which blocks you now conceder "free".

As far as I know TRIM does not work on the SD/Raspian combination. Or at least the fstrim command told me the operation was not supported.

gsh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 1421
Joined: Sat Sep 10, 2011 11:43 am

Re: SD Card Corruption

Mon Jan 13, 2014 2:54 pm

There is no point doing FSTRIM on an SD card the only difference it makes is to make writing take a little less power since it doesn't need to erase as it is writing!

SD cards purposely hold back 10% of the disk to do wear levelling (dependant upon individual manufacturers etc) but in general it will actually take one of the erased blocks and write to it and then circulate unwritten blocks with low wear (to make sure levelling occurs across all blocks even those that are never written to!)

Gordon
--
Gordon Hollingworth PhD
Raspberry Pi - Director of Software Engineering

User avatar
g7ruh
Posts: 67
Joined: Mon Apr 23, 2012 9:49 am
Location: Blackfield UK

Re: SD Card Corruption

Mon Jan 13, 2014 3:03 pm

Heater wrote: Basically by doing that you are telling the card that all sectors (blocks?) are now full of data. Which happens to be all zeros. The SD cards controller now cannot use those blocks to perform any wear levelling it may want to do.
Heater,
thanks, maybe that is what happens, the writing everywhere slows down the SD card operations and if there were issues before, now there is just a little bit longer to do what it has to, without generating errors.
Roger

Heater
Posts: 12964
Joined: Tue Jul 17, 2012 3:02 pm

Re: SD Card Corruption

Mon Jan 13, 2014 3:24 pm

Gordon,

Perhaps you are the person I need to spell this out for once and for all.

Let's ignore that 10% held back for wear levelling for a moment.

a) Is it so that an SD card will perform wear levelling by using unused space on your SD card?

If so that means that for an card with a lot of free space the controller has a lot of space to perform wear levelling in and the cards lifetime is therefore longer. But if your card is full there is no space to perform wear levelling in and we can expect the cards lifetime to be much shorter.

That means that starting off by writing all zeros to a card basically tells it that all blocks are in use. Wear levelling opportunities are removed and card lifetime can be expected to be shorter.

That 10% gives a bit more wiggle room but not much.

b) Given that you have filled a card up with all zeros, thus removing most wear levelling space, I would have thought TRIM was exactly what you needed to get back to a clean, all blocks free, state. Are you saying this is wrong?.

That's kind of academic as TRIM is not supported on any of the cards I have anyway.

gsh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 1421
Joined: Sat Sep 10, 2011 11:43 am

Re: SD Card Corruption

Mon Jan 13, 2014 4:41 pm

TRIM is not supported because it has no meaning for SD card...

The way it has been explained to me (by actual engineers at Samsung and Kingston) the SD card FTL doesn't tell you about 10% of the card that is always erased and waiting to be written to. When you start writing to the card it takes one of the currently erased blocks and starts writing whereever you asked it to, when you finish writing, it then closes that block by copying data across from the current block and then changing the mapping so the new block is used instead.

After some time the wear levelling identifies blocks that are particularly unused and swaps those into the 10% by copying their contents and swapping over. Basically that's how it works... It doesn't need to know about empty blocks (although would in theory reduce overall wear if it did)

Finally in the end this is all SD card dependant and a function of the FTL (which they like to keep up their sleeves!)

For SSD this is all different because they don't have a magic 10% hidden away so it is important to identify the blocks that are known to be empty and of course it makes sense because the OS knows which blocks are used and which are not!

Gordon
--
Gordon Hollingworth PhD
Raspberry Pi - Director of Software Engineering

Heater
Posts: 12964
Joined: Tue Jul 17, 2012 3:02 pm

Re: SD Card Corruption

Mon Jan 13, 2014 4:48 pm

Gordon.

Thank you.

Seems the missing detail is that wear levelling can work on an SD card by swapping blocks around even if they are used. Even if the card is pretty much full.

The good news is that after a shaky start where I ended up with a pile of dud cards things have been very stable recently. No idea if that's due to my new Transcend Class 10 9GB cards, or firmware/OS changes or just dumb luck.

When I say "dud" some of those cards were having blocks become write protected. At least according to whatever formatting tool I had. Certainly after writing all zeros to them with dd I could still find my old data in some of the blocks. I would expect random power downs to trash the files system but this was a surprise to me.

sethbollinger
Posts: 11
Joined: Sun Jan 12, 2014 9:23 pm

Re: SD Card Corruption

Tue Jan 14, 2014 12:13 am

James, Gordon,

Thanks for replying. I appreciate you taking the time to answer!

Thanks,

Seth

User avatar
jojopi
Posts: 3078
Joined: Tue Oct 11, 2011 8:38 pm

Re: SD Card Corruption

Tue Jan 14, 2014 12:45 am

gsh wrote:For SSD this is all different because they don't have a magic 10% hidden away so it is important to identify the blocks that are known to be empty and of course it makes sense because the OS knows which blocks are used and which are not!
I think SSDs are similar. If they have 256GiB of NAND, they present a user-visible size of say 240GB decimal, which is a difference of 12.7%.

sethbollinger
Posts: 11
Joined: Sun Jan 12, 2014 9:23 pm

Re: SD Card Corruption

Fri Jan 17, 2014 2:15 am

FYI, we ran a simple test on 20 devices with shutdown -r in a startup crontab. This results in 15 devices failing out of 20 in less than a week of testing. This test was run using the latest rasbian image.

Seth

gsh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 1421
Joined: Sat Sep 10, 2011 11:43 am

Re: SD Card Corruption

Fri Jan 17, 2014 6:22 am

How did they fail? What errors did you see?

Gordon
--
Gordon Hollingworth PhD
Raspberry Pi - Director of Software Engineering

gsh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 1421
Joined: Sat Sep 10, 2011 11:43 am

Re: SD Card Corruption

Fri Jan 17, 2014 6:48 am

So this time you weren't powering off the devices at all?

Gordon
--
Gordon Hollingworth PhD
Raspberry Pi - Director of Software Engineering

sethbollinger
Posts: 11
Joined: Sun Jan 12, 2014 9:23 pm

Re: SD Card Corruption

Fri Jan 17, 2014 11:34 am

I'm sorry, I don't have the errors at this time. I'll have my associate dig them up and I'll try to get them to you later this morning, hopefully around 9 central time.

Correct, we're not power cycling at all. We wanted to verify no corruption in "normal" runtime.

Seth

User avatar
Richard-TX
Posts: 1549
Joined: Tue May 28, 2013 3:24 pm
Location: North Texas

Re: SD Card Corruption

Fri Jan 17, 2014 3:19 pm

I have a few questions regarding those failures.

What board revision number?
cat /proc/cpuinfo
http://elinux.org/RPi_HardwareHistory

What relearse/version of the OS?
cat /proc/version or image name.
Richard
Doing Unix since 1985.
The 9-25-2013 image of Wheezy can be found at:
http://downloads.raspberrypi.org/raspbian/images/raspbian-2013-09-27/2013-09-25-wheezy-raspbian.zip

Return to “General discussion”