mr_peta
Posts: 4
Joined: Sun Jul 07, 2019 1:06 pm

RPi4/Buster data corruption on external SSD

Sun Jul 07, 2019 2:31 pm

Hello,

I'm facing a strange problem on RPi4: A partition I create on my external SSD, gets corrupter after reboot on RPi4.

Update:
It is the SSD that I was using with RPi2 previously. Attached are steps that reproduce the problem on RPi4. One partition is enough to reproduce the problem. Following the same steps on RPi2 doesn't show any problems.

Steps to reproduce (short version):
  1. Delete any existing partitions
  2. Create one primary partition 4GB long
  3. Format the partition with ext4
  4. Change the label to "data"
  5. Verify that the label is set and the ext4 filesystem is consistent
  6. Reboot, and after:
    • Label is missing
    • fsck finds the partition heavily corrupted
Does somebody encountered a similar problem or have any idea?

Detailed description and device information follows.


Steps to reproduce (long version):

1-2. Partition table after creating a single partition

Code: Select all

[email protected]:~ $ sudo fdisk -l /dev/sda
Disk /dev/sda: 111.8 GiB, 120034123776 bytes, 234441648 sectors
Disk model: SNA-DC/U        
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x479a57dd

Device     Boot Start     End Sectors Size Id Type
/dev/sda1        2048 8390655 8388608   4G 83 Linux

3. ext4 format

Code: Select all

[email protected]:~ $ sudo mkfs.ext4 /dev/sda1
mke2fs 1.44.5 (15-Dec-2018)
Creating filesystem with 1048576 4k blocks and 262144 inodes
Filesystem UUID: 14689001-5d62-45df-a7c5-36e4c9a7d465
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done 

4. Changing that label to "data"

Code: Select all

[email protected]:~ $ sudo e2label /dev/sda1 data

5. Verify that the label is set and the ext4 filesystem is consistent

Code: Select all

[email protected]:~ $ sudo blkid /dev/sda1
/dev/sda1: LABEL="data" UUID="14689001-5d62-45df-a7c5-36e4c9a7d465" TYPE="ext4" PARTUUID="479a57dd-01"

[email protected]:~ $ sudo fsck -n /dev/sda1
fsck from util-linux 2.33.1
e2fsck 1.44.5 (15-Dec-2018)
data: clean, 11/262144 files, 36942/1048576 blocks

6. Reboot

After reboot

The label is missing:

Code: Select all

[email protected]:~ $ sudo blkid /dev/sda1
/dev/sda1: PARTUUID="479a57dd-01"

And the filesystem is corrupted:

Code: Select all

[email protected]:~ $ sudo fsck -n /dev/sda1
fsck from util-linux 2.33.1
e2fsck 1.44.5 (15-Dec-2018)
ext2fs_open2: Bad magic number in super-block
fsck.ext2: Superblock invalid, trying backup blocks...
/dev/sda1 was not cleanly unmounted, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  +(32768--33280) +(98304--98816) +(163840--164352) +(229376--229888) +(294912--295424) -(557056--557060) -557062 -557068
...
removing lot of lines
...
Fix? no

Free inodes count wrong for group #16 (8192, counted=4532).
Fix? no

Free inodes count wrong for group #17 (8192, counted=4347).
Fix? no
...
removing lot of lines
...
Inode bitmap differences: Group 1 inode bitmap does not match checksum.
IGNORED.
Group 2 inode bitmap does not match checksum.
IGNORED.
Group 3 inode bitmap does not match checksum.
IGNORED.
...
removing lot of lines
...
/dev/sda1: ********** WARNING: Filesystem still has errors **********

/dev/sda1: 11/262144 files (0.0% non-contiguous), 36942/1048576 blocks

A few notes:
  • I'm using RPi4 with 4GB RAM.
  • I'm using the official USB-C power supply.
  • I'm using the Raspbian buster that came with the SD card, without any modifications.
  • There is no difference between having the SSD enclosure connected to USB 2 or USB 3 port.
  • As I wrote earlier, repeating the same on RPi2 works without any problems.
SSD Info:
  • Enclosure: USB2 Kingston SNA-DC/U
  • Disk: OCZ-TRION100
RPi4 Info:
  • Raspbian 10 (buster)
  • kernel 4.19.46-v7l+
RPi2 Info:
  • Raspbian 9 (stretch)
  • kernel 4.14.70-v7+

I would be grateful for any insights...

Many thanks,
Petr
Last edited by mr_peta on Tue Jul 09, 2019 11:01 am, edited 1 time in total.

mr_peta
Posts: 4
Joined: Sun Jul 07, 2019 1:06 pm

Re: RPi4/Buster data corruption on external SSD

Mon Jul 08, 2019 5:53 pm

Just managed to repeat the SSD test with RPi 3 (it is not the B+) and the SSD works perfectly. I used the same SD card/Rasbian buster I used with 4B previously.

Therefore, the problem really seems to be specific to RPi4.

ShiftPlusOne
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 6043
Joined: Fri Jul 29, 2011 5:36 pm
Location: The unfashionable end of the western spiral arm of the Galaxy

Re: RPi4/Buster data corruption on external SSD

Mon Jul 08, 2019 5:59 pm

Just a +1 that I've noticed this as well, but didn't check whether it was an issue with the drive or the pi. I haven't seen it since switching to btrfs.

mr_peta
Posts: 4
Joined: Sun Jul 07, 2019 1:06 pm

Re: RPi4/Buster data corruption on external SSD

Tue Jul 09, 2019 10:53 am

I made a progress on the problem:

First, the problem can be reproduced in an easier way:
  • Just format the ext4 partition
  • Unplug the USB2 enclosure
  • Plug the USB2 enclosure
  • Run fsck
This will break the filesystem on a different Linux host (not RPi) as well. (For the enclosure/SSD I have).

However, if drive write cache is flushed just before unplugging the USB2 enclosure, the filesystem is not corrupted anymore!

Command used to flush the drive write cache:

Code: Select all

hdparm -F /dev/sda
Using "sync" command is not enough.

Flushing just before reboot, solves the original problem with partition corruption as well.

Therefore, the reported problem doesn't look like as a problem of the RPi4 itself. Just a bit puzzling is, why the reboot problem is visible only on RPi4. Is it too fast? :)

Another question is whether kernel/module should implicitly take care about flushing drive write caches... Or is something somewhere broken?

jdb
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 2123
Joined: Thu Jul 11, 2013 2:37 pm

Re: RPi4/Buster data corruption on external SSD

Tue Jul 09, 2019 11:46 am

Oof, that's nasty. What does hdparm -W report? Do you get the same corruption if you disable the drive write cache with hdparm -W0?
Rockets are loud.
https://astro-pi.org

mr_peta
Posts: 4
Joined: Sun Jul 07, 2019 1:06 pm

Re: RPi4/Buster data corruption on external SSD

Tue Jul 09, 2019 3:52 pm

Indeed, the write cache is enabled:

Code: Select all

$ sudo hdparm -W /dev/sda

/dev/sda:
 write-caching = 1 (on)
And turning it off with hdparm -W0, prevents the corruption from happening when the enclosure is unplugged/plugged or the OS is rebooted.

Something else, I noticed in syslog after plugging the enclosure:

Code: Select all

Jul  9 12:05:39 raspberrypi kernel: [ 2228.392413] scsi 0:0:0:0: Direct-Access     Kingston SNA-DC/U         1.08 PQ: 0 ANSI: 4
Jul  9 12:05:39 raspberrypi kernel: [ 2228.393273] sd 0:0:0:0: Attached scsi generic sg0 type 0
Jul  9 12:05:39 raspberrypi kernel: [ 2228.393942] sd 0:0:0:0: [sda] 234441648 512-byte logical blocks: (120 GB/112 GiB)
Jul  9 12:05:39 raspberrypi kernel: [ 2228.394495] sd 0:0:0:0: [sda] Write Protect is off
Jul  9 12:05:39 raspberrypi kernel: [ 2228.394510] sd 0:0:0:0: [sda] Mode Sense: 23 00 00 00
Jul  9 12:05:39 raspberrypi kernel: [ 2228.394944] sd 0:0:0:0: [sda] No Caching mode page found            <-- HERE
Jul  9 12:05:39 raspberrypi kernel: [ 2228.394958] sd 0:0:0:0: [sda] Assuming drive cache: write through   <-- HERE
Jul  9 12:05:39 raspberrypi kernel: [ 2228.397114]  sda: sda1
Jul  9 12:05:39 raspberrypi kernel: [ 2228.399804] sd 0:0:0:0: [sda] Attached SCSI disk
Kernel didn't detect the caching mode (the drive didn't advertise properly or something else) and kernel is assuming write-through. However, if my understanding is correct, what the drive is using, in the reality, is write-back.

I wonder, whether it would be better if kernel assumed "write-back" as default in such cases... Probably a question for kernel developers?

Return to “Troubleshooting”