pjft
Posts: 8
Joined: Fri Jan 20, 2017 11:02 pm

mount.exfat out of memory during rsync

Thu Apr 23, 2020 2:19 pm

Hi all,

Apologies for the topic. I've searched a fair bit by now and I can't find much so I thought I'd ask here.

I am trying to run rsync between 2 local 4TB USB drives on a RPi 3B+ - an EXT4 one and a EXFAT one.

The rsync process had been running for a few days and at some point it failed with the following messages:

Code: Select all

rsync: write failed on "<destination path>": Transport endpoint is not connected (107)
rsync: stat "<destination path temp file>" failed: Transport endpoint is not connected (107)
rsync: rename "<destination path temp file>" -> "<destination path>": Transport endpoint is not connected (107)
rsync error: error in file IO (code 11) at receiver.c(374) [receiver=3.1.3]

After running fsck, rebooting and remounting, the rsync process failed again immediately at that same file (which was the first to re-start the sync process on, meaning that the previous ones went well).

After digging through logs and whatnot I ended up checking dmesg and this was there (messages from unmounting to mounting again, to then failing):

Code: Select all

[ 2967.675957] usb 1-1.1.2.4: USB disconnect, device number 8
[ 2984.351687] usb 1-1.1.2.4: new high-speed USB device number 9 using dwc_otg
[ 2984.484027] usb 1-1.1.2.4: New USB device found, idVendor=1058, idProduct=25a1, bcdDevice=10.14
[ 2984.484044] usb 1-1.1.2.4: New USB device strings: Mfr=2, Product=3, SerialNumber=1
[ 2984.484054] usb 1-1.1.2.4: Product: Elements 25A1
[ 2984.484063] usb 1-1.1.2.4: Manufacturer: Western Digital
[ 2984.484073] usb 1-1.1.2.4: SerialNumber: 575836314435385041355A50
[ 2984.484948] usb-storage 1-1.1.2.4:1.0: USB Mass Storage device detected
[ 2984.485527] scsi host2: usb-storage 1-1.1.2.4:1.0
[ 2985.532857] scsi 2:0:0:0: Direct-Access     WD       Elements 25A1    1014 PQ: 0 ANSI: 6
[ 2985.534172] sd 2:0:0:0: Attached scsi generic sg2 type 0
[ 2985.535745] sd 2:0:0:0: [sdc] Spinning up disk...
[ 2986.571702] .......ready
[ 2992.813156] sd 2:0:0:0: [sdc] Very big device. Trying to use READ CAPACITY(16).
[ 2992.813551] sd 2:0:0:0: [sdc] 7813969920 512-byte logical blocks: (4.00 TB/3.64 TiB)
[ 2992.813563] sd 2:0:0:0: [sdc] 4096-byte physical blocks
[ 2992.814100] sd 2:0:0:0: [sdc] Write Protect is off
[ 2992.814112] sd 2:0:0:0: [sdc] Mode Sense: 47 00 10 08
[ 2992.814621] sd 2:0:0:0: [sdc] No Caching mode page found
[ 2992.814637] sd 2:0:0:0: [sdc] Assuming drive cache: write through
[ 2992.924994]  sdc: sdc1
[ 2992.928509] sd 2:0:0:0: [sdc] Attached SCSI disk
[ 5248.659439] rsync invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
[ 5248.659447] rsync cpuset=/ mems_allowed=0
[ 5248.659464] CPU: 2 PID: 4266 Comm: rsync Tainted: G         C        4.19.75-v7+ #1270
[ 5248.659467] Hardware name: BCM2835
[ 5248.659489] [<80111fcc>] (unwind_backtrace) from [<8010d544>] (show_stack+0x20/0x24)
[ 5248.659499] [<8010d544>] (show_stack) from [<80819bc0>] (dump_stack+0xd4/0x118)
[ 5248.659511] [<80819bc0>] (dump_stack) from [<8023cfe0>] (dump_header+0x80/0x250)
[ 5248.659522] [<8023cfe0>] (dump_header) from [<8023c358>] (oom_kill_process+0x358/0x3a8)
[ 5248.659532] [<8023c358>] (oom_kill_process) from [<8023cc88>] (out_of_memory+0x134/0x36c)
[ 5248.659541] [<8023cc88>] (out_of_memory) from [<802430e0>] (__alloc_pages_nodemask+0x1024/0x1178)
[ 5248.659549] [<802430e0>] (__alloc_pages_nodemask) from [<802384dc>] (filemap_fault+0x5d0/0x704)
[ 5248.659559] [<802384dc>] (filemap_fault) from [<8036fd0c>] (ext4_filemap_fault+0x38/0x4c)
[ 5248.659570] [<8036fd0c>] (ext4_filemap_fault) from [<802710e8>] (__do_fault+0x5c/0x194)
[ 5248.659581] [<802710e8>] (__do_fault) from [<802759a4>] (handle_mm_fault+0x568/0xd68)
[ 5248.659590] [<802759a4>] (handle_mm_fault) from [<80837810>] (do_page_fault+0x330/0x3a4)
[ 5248.659600] [<80837810>] (do_page_fault) from [<801167ec>] (do_PrefetchAbort+0x54/0xd8)
[ 5248.659609] [<801167ec>] (do_PrefetchAbort) from [<80101f24>] (ret_from_exception+0x0/0x1c)
[ 5248.659613] Exception stack(0x81781fb0 to 0x81781ff8)
[ 5248.659618] 1fa0:                                     0000004b 7edc9094 00000ff8 00000000
[ 5248.659625] 1fc0: 00000000 01d73ae0 7edc9094 000927c0 00000000 00000002 00082bd0 00000001
[ 5248.659631] 1fe0: 000000ff 7edc8f48 00000000 00029960 a0000010 ffffffff
[ 5248.659634] Mem-Info:
[ 5248.659647] active_anon:108555 inactive_anon:109407 isolated_anon:0
                active_file:549 inactive_file:598 isolated_file:14
                unevictable:454 dirty:0 writeback:0 unstable:0
                slab_reclaimable:7651 slab_unreclaimable:12616
                mapped:3281 shmem:6313 pagetables:2011 bounce:0
                free:4139 free_pcp:111 free_cma:0
[ 5248.659656] Node 0 active_anon:434220kB inactive_anon:437628kB active_file:2196kB inactive_file:2392kB unevictable:1816kB isolated(anon):0kB isolated(file):56kB mapped:13124kB dirty:0kB writeback:0kB shmem:25252kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[ 5248.659668] Normal free:16556kB min:16384kB low:20480kB high:24576kB active_anon:434220kB inactive_anon:437628kB active_file:2400kB inactive_file:2440kB unevictable:1816kB writepending:0kB present:1021952kB managed:999036kB mlocked:1816kB kernel_stack:1792kB pagetables:8044kB bounce:0kB free_pcp:444kB local_pcp:444kB free_cma:0kB
[ 5248.659670] lowmem_reserve[]: 0 0
[ 5248.659679] Normal: 151*4kB (UMH) 43*8kB (UMEH) 885*16kB (UMH) 49*32kB (UEH) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16676kB
[ 5248.659724] 9036 total pagecache pages
[ 5248.659728] 1171 pages in swap cache
[ 5248.659732] Swap cache stats: add 34034, delete 32863, find 372967/377461
[ 5248.659734] Free swap  = 0kB
[ 5248.659737] Total swap = 102396kB
[ 5248.659740] 255488 pages RAM
[ 5248.659743] 0 pages HighMem/MovableOnly
[ 5248.659745] 5729 pages reserved
[ 5248.659748] 2048 pages cma reserved
[ 5248.659751] Tasks state (memory values in pages):
[ 5248.659754] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[ 5248.659777] [    136]     0   136     4868      927   108544       19             0 systemd-journal
[ 5248.659784] [    152]     0   152     1139       44    12288        0             0 blkmapd
[ 5248.659791] [    166]     0   166     4552      428    20480       44         -1000 systemd-udevd
[ 5248.659797] [    172]   101   172     3608      158    20480       11             0 systemd-network
[ 5248.659804] [    297]   102   297     4781      516    28672       34             0 systemd-resolve
[ 5248.659811] [    299]     0   299      699      297    10240        1             0 rpc.idmapd
[ 5248.659818] [    632]     0   632     3252      269    20480       31             0 systemd-logind
[ 5248.659824] [    635]     0   635      923      254    14336        0             0 alsactl
[ 5248.659831] [    647]   108   647     1474      351    16384        3             0 avahi-daemon
[ 5248.659838] [    657]   104   657     1672      431    16384        2          -900 dbus-daemon
[ 5248.659844] [    663]   108   663     1443       63    12288        1             0 avahi-daemon
[ 5248.659851] [    666] 65534   666     1080      350    10240        0             0 thd
[ 5248.659857] [    667]     0   667     6450      373    20480       39             0 rsyslogd
[ 5248.659864] [    673]     0   673     2677      419    20480        0             0 wpa_supplicant
[ 5248.659870] [    785]     0   785      708      335    12288        8             0 dhcpcd
[ 5248.659877] [    788]     0   788    55082      448    75776     1010             0 php-fpm7.3
[ 5248.659884] [    795]     0   795     8900      346    45056     2627             0 salt-minion
[ 5248.659890] [    798] 65534   798     2105      166    14336       24             0 openvpn
[ 5248.659897] [    804]   110   804     2901      351    10240        5             0 chronyd
[ 5248.659904] [    805] 65534   805     5355     1108    28672      807             0 python3
[ 5248.659910] [    807]     0   807     7799      541    40960       69             0 nmbd
[ 5248.659917] [    810]   110   810      928      321    12288        1             0 chronyd
[ 5248.659923] [    813]     0   813    17132     1886    43008       82             0 fail2ban-server
[ 5248.659930] [    823]     0   823     2673      275    18432       97         -1000 sshd
[ 5248.659937] [    830]   105   830     1644      415    16384        8             0 rpcbind
[ 5248.659943] [    859]     0   859     7908     1506    43008     4976             0 rpc.mountd
[ 5248.659950] [    864]   115   864    11730      415    32768      174             0 redis-server
[ 5248.659959] [    905]   114   905   183212    45744   311296     3621             0 mysqld
[ 5248.659965] [    912]     0   912    12548       33    32768      256             0 nginx
[ 5248.659972] [    913]    33   913    12588      340    32768      225             0 nginx
[ 5248.659978] [    914]    33   914    12614      334    32768      224             0 nginx
[ 5248.659985] [    915]    33   915    12683      378    32768      219             0 nginx
[ 5248.659991] [    916]    33   916    12588      276    32768      230             0 nginx
[ 5248.659998] [    931]     0   931    11889      822    53248       15             0 smbd
[ 5248.660004] [    937]     0   937    22573      950    65536     1053             0 omv-engined
[ 5248.660011] [    942]    33   942    55075      259    67584     1031             0 php-fpm7.3
[ 5248.660018] [    943]    33   943    55075      259    67584     1031             0 php-fpm7.3
[ 5248.660024] [    973]     0   973    11160      686    51200       16             0 smbd-notifyd
[ 5248.660031] [    974]     0   974    11158      680    49152       16             0 cleanupd
[ 5248.660038] [    993]     0   993      535       32     8192        0             0 hciattach
[ 5248.660044] [    998]     0   998     2384       87    16384        9             0 bluetoothd
[ 5248.660051] [   1101]     0  1101    16414     6225    65536      954             0 salt-minion
[ 5248.660058] [   1107]     0  1107    10005      263    24576      105             0 master
[ 5248.660065] [   1108]   109  1108    10025      413    24576        0             0 pickup
[ 5248.660071] [   1109]   109  1109    10066      350    24576        1             0 qmgr
[ 5248.660078] [   1119]   109  1119    10092      336    24576        5             0 tlsmgr
[ 5248.660084] [   1133]     0  1133    12158     1964    47104     1857             0 salt-minion
[ 5248.660092] [   1154]  1000  1154    11074      407    32768        2             0 transmission-da
[ 5248.660098] [   1171]     0  1171     1791      371    14336        9             0 cron
[ 5248.660104] [   1190]     0  1190     6914      266    18432        1             0 rngd
[ 5248.660110] [   1197]     0  1197     1077      257    12288        1             0 agetty
[ 5248.660117] [   1212]     0  1212     3855      480    16384        2             0 monit
[ 5248.660123] [   1239]     0  1239    55403      699    79872      887             0 apache2
[ 5248.660130] [   1240]    33  1240    61520     2432   118784      775             0 apache2
[ 5248.660137] [   1242]    33  1242    73947    15295   169984      777             0 apache2
[ 5248.660143] [   1243]    33  1243    73453    14797   169984      802             0 apache2
[ 5248.660150] [   1490]     0  1490      484      349    12288        0         -1000 watchdog
[ 5248.660156] [   1549]    33  1549    77924    18617   178176      772             0 apache2
[ 5248.660163] [   1710]     0  1710     3051      513    20480       35             0 sshd
[ 5248.660169] [   1716]  1000  1716     3676      437    20480       28             0 systemd
[ 5248.660175] [   1719]  1000  1719     8666      296    28672      413             0 (sd-pam)
[ 5248.660182] [   1734]  1000  1734     3051      433    20480       22             0 sshd
[ 5248.660188] [   1737]  1000  1737     1967      620    18432        0             0 bash
[ 5248.660195] [   2143]    33  2143    73385    14674   165888      787             0 apache2
[ 5248.660202] [   2511]    33  2511    61536     2540   118784      772             0 apache2
[ 5248.660208] [   2512]    33  2512    62356     3600   122880      771             0 apache2
[ 5248.660215] [   2636]    33  2636    61508     2766   118784      772             0 apache2
[ 5248.660221] [   2781]    33  2781    60968     2493   116736      773             0 apache2
[ 5248.660228] [   2902]    33  2902    57050     1990    98304      796             0 apache2
[ 5248.660237] [   4189]     0  4189    68937    68370   286720        0             0 mount.exfat
[ 5248.660244] [   4261]     0  4261     2291      416    18432        0             0 sudo
[ 5248.660250] [   4266]     0  4266    13376     3071    63488        0             0 rsync
[ 5248.660256] [   4267]     0  4267     2861     1659    20480        0             0 rsync
[ 5248.660263] [   4268]     0  4268    13868     2792    63488        0             0 rsync
[ 5248.660270] [   4948]     0  4948     1518      552    12288        0             0 smartd
[ 5248.660280] [   5721]   999  5721    55163     1256    79872      733             0 php-fpm7.3
[ 5248.660287] [   6293]   999  6293    55163     1279    79872      729             0 php-fpm7.3
[ 5248.660294] Out of memory: Kill process 4189 (mount.exfat) score 248 or sacrifice child
[ 5248.660309] Killed process 4189 (mount.exfat) total-vm:275748kB, anon-rss:273480kB, file-rss:0kB, shmem-rss:0kB
[ 5248.750080] oom_reaper: reaped process 4189 (mount.exfat), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

After logging in to a separate SSH session, and running htop, I do see the process /sbin/mount.exfat grow in memory usage from a few kb to almost 300MB and then being killed.

Searching about rsync and memory issues, I found somewhat conflicting and unreliable evidence towards using the option --no-inc-recursive as an option to improve memory performance. While it wasn't clear that that would help, it did help make it move past the previously repeatable crash and progress for several more hours. Alas, after 16h it crashed again, and I'm kind of at a loss.

There are recommendations around splitting the rsync in smaller folders, but other than workarounds I'd really try to look into what could help address the underlying issue rather than the symptoms - otherwise I fear that in a few months we'll be back at this.

Some more info:

OS version:
Linux 4.19.75-v7+ #1270 SMP Tue Sep 24 18:45:11 BST 2019 armv7l GNU/Linux

Original Rsync command:

Code: Select all

sudo rsync --delete --partial --ignore-errors --no-perms --no-owner --no-group -rvzhtiP --exclude nextcloud/ /srv/dev-disk-by-label-HDDCOPY/  /mnt/backup-exfat/

Current Rsync command (added modify-window as well because FAT and daylight savings were marking everything as needed to be synced):

Code: Select all

sudo rsync --delete --partial --ignore-errors --no-perms --no-owner --no-group -rvzhtiP --exclude nextcloud/ --no-inc-recursive --modify-window=3601 /srv/dev-disk-by-label-HDDCOPY/  /mnt/backup-exfat/
Number of files being synced: around 520k

Mount runtime options (from /proc/mounts):

Code: Select all

/dev/sdc1 /mnt/backup-exfat fuseblk rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other,blksize=4096 0 0

I read about a exfat-nofuse project that seems to potentially offer a non-FUSE driver for EXFAT, but at least until further evidence I'd rather stick to better supported alternatives.

I ran smartctl on the drive, and both smartctl tests and fsck suggest no errors (the drive is only a few months old, and not often used). The previous backup I had performed had been in a RPi 4 with 4GB of RAM, which might have made the problem not surface, but now I'm using an older RPi 3B+ to that effect.

I have yet again unmounted it, mounted it and started the sync process again - some more files seem to be transferred - but the mount.exfat process is taking up 294MB of memory right now, and the free stats are:

Code: Select all

              total        used        free      shared  buff/cache   available
Mem:         999036      812976       26588       48132      159472       72912
Swap:        102396      102396           0

so I expect that sooner or later it'll die on me again.

I'm also somewhat mindful of the possibility of these repeated crashes damaging or corrupting the actual drive.

Small note (and don't read this the wrong way, but just to try to focus the thread): there are reasons for the EXFAT one to be EXFAT and not another filesystem, mostly around sharing that drive with several other OSes at home, so unless it is deemed critical I'd rather stray from the common suggestion of "why not format it in EXTx".

Also, same goes for using the RPi 3B+ for this, but happy to discuss further. I suppose I could connect the drive on my laptop and run the rsync using the local network, but... that just feels... sub-optimal and inefficient, not to mention that it would take even longer as we're now capping the transfer speed to the Pi's Ethernet transfer rate.

Any pointers, requests for more detailed information or suggestions are more than welcome.

Thanks in advance for your time, and stay healthy and safe!

EDIT: added OS version.

Return to “Troubleshooting”