Apologies for the topic. I've searched a fair bit by now and I can't find much so I thought I'd ask here.
I am trying to run rsync between 2 local 4TB USB drives on a RPi 3B+ - an EXT4 one and a EXFAT one.
The rsync process had been running for a few days and at some point it failed with the following messages:
Code: Select all
rsync: write failed on "<destination path>": Transport endpoint is not connected (107)
rsync: stat "<destination path temp file>" failed: Transport endpoint is not connected (107)
rsync: rename "<destination path temp file>" -> "<destination path>": Transport endpoint is not connected (107)
rsync error: error in file IO (code 11) at receiver.c(374) [receiver=3.1.3]After running fsck, rebooting and remounting, the rsync process failed again immediately at that same file (which was the first to re-start the sync process on, meaning that the previous ones went well).
After digging through logs and whatnot I ended up checking dmesg and this was there (messages from unmounting to mounting again, to then failing):
Code: Select all
[ 2967.675957] usb 1-1.1.2.4: USB disconnect, device number 8
[ 2984.351687] usb 1-1.1.2.4: new high-speed USB device number 9 using dwc_otg
[ 2984.484027] usb 1-1.1.2.4: New USB device found, idVendor=1058, idProduct=25a1, bcdDevice=10.14
[ 2984.484044] usb 1-1.1.2.4: New USB device strings: Mfr=2, Product=3, SerialNumber=1
[ 2984.484054] usb 1-1.1.2.4: Product: Elements 25A1
[ 2984.484063] usb 1-1.1.2.4: Manufacturer: Western Digital
[ 2984.484073] usb 1-1.1.2.4: SerialNumber: 575836314435385041355A50
[ 2984.484948] usb-storage 1-1.1.2.4:1.0: USB Mass Storage device detected
[ 2984.485527] scsi host2: usb-storage 1-1.1.2.4:1.0
[ 2985.532857] scsi 2:0:0:0: Direct-Access WD Elements 25A1 1014 PQ: 0 ANSI: 6
[ 2985.534172] sd 2:0:0:0: Attached scsi generic sg2 type 0
[ 2985.535745] sd 2:0:0:0: [sdc] Spinning up disk...
[ 2986.571702] .......ready
[ 2992.813156] sd 2:0:0:0: [sdc] Very big device. Trying to use READ CAPACITY(16).
[ 2992.813551] sd 2:0:0:0: [sdc] 7813969920 512-byte logical blocks: (4.00 TB/3.64 TiB)
[ 2992.813563] sd 2:0:0:0: [sdc] 4096-byte physical blocks
[ 2992.814100] sd 2:0:0:0: [sdc] Write Protect is off
[ 2992.814112] sd 2:0:0:0: [sdc] Mode Sense: 47 00 10 08
[ 2992.814621] sd 2:0:0:0: [sdc] No Caching mode page found
[ 2992.814637] sd 2:0:0:0: [sdc] Assuming drive cache: write through
[ 2992.924994] sdc: sdc1
[ 2992.928509] sd 2:0:0:0: [sdc] Attached SCSI disk
[ 5248.659439] rsync invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
[ 5248.659447] rsync cpuset=/ mems_allowed=0
[ 5248.659464] CPU: 2 PID: 4266 Comm: rsync Tainted: G C 4.19.75-v7+ #1270
[ 5248.659467] Hardware name: BCM2835
[ 5248.659489] [<80111fcc>] (unwind_backtrace) from [<8010d544>] (show_stack+0x20/0x24)
[ 5248.659499] [<8010d544>] (show_stack) from [<80819bc0>] (dump_stack+0xd4/0x118)
[ 5248.659511] [<80819bc0>] (dump_stack) from [<8023cfe0>] (dump_header+0x80/0x250)
[ 5248.659522] [<8023cfe0>] (dump_header) from [<8023c358>] (oom_kill_process+0x358/0x3a8)
[ 5248.659532] [<8023c358>] (oom_kill_process) from [<8023cc88>] (out_of_memory+0x134/0x36c)
[ 5248.659541] [<8023cc88>] (out_of_memory) from [<802430e0>] (__alloc_pages_nodemask+0x1024/0x1178)
[ 5248.659549] [<802430e0>] (__alloc_pages_nodemask) from [<802384dc>] (filemap_fault+0x5d0/0x704)
[ 5248.659559] [<802384dc>] (filemap_fault) from [<8036fd0c>] (ext4_filemap_fault+0x38/0x4c)
[ 5248.659570] [<8036fd0c>] (ext4_filemap_fault) from [<802710e8>] (__do_fault+0x5c/0x194)
[ 5248.659581] [<802710e8>] (__do_fault) from [<802759a4>] (handle_mm_fault+0x568/0xd68)
[ 5248.659590] [<802759a4>] (handle_mm_fault) from [<80837810>] (do_page_fault+0x330/0x3a4)
[ 5248.659600] [<80837810>] (do_page_fault) from [<801167ec>] (do_PrefetchAbort+0x54/0xd8)
[ 5248.659609] [<801167ec>] (do_PrefetchAbort) from [<80101f24>] (ret_from_exception+0x0/0x1c)
[ 5248.659613] Exception stack(0x81781fb0 to 0x81781ff8)
[ 5248.659618] 1fa0: 0000004b 7edc9094 00000ff8 00000000
[ 5248.659625] 1fc0: 00000000 01d73ae0 7edc9094 000927c0 00000000 00000002 00082bd0 00000001
[ 5248.659631] 1fe0: 000000ff 7edc8f48 00000000 00029960 a0000010 ffffffff
[ 5248.659634] Mem-Info:
[ 5248.659647] active_anon:108555 inactive_anon:109407 isolated_anon:0
active_file:549 inactive_file:598 isolated_file:14
unevictable:454 dirty:0 writeback:0 unstable:0
slab_reclaimable:7651 slab_unreclaimable:12616
mapped:3281 shmem:6313 pagetables:2011 bounce:0
free:4139 free_pcp:111 free_cma:0
[ 5248.659656] Node 0 active_anon:434220kB inactive_anon:437628kB active_file:2196kB inactive_file:2392kB unevictable:1816kB isolated(anon):0kB isolated(file):56kB mapped:13124kB dirty:0kB writeback:0kB shmem:25252kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[ 5248.659668] Normal free:16556kB min:16384kB low:20480kB high:24576kB active_anon:434220kB inactive_anon:437628kB active_file:2400kB inactive_file:2440kB unevictable:1816kB writepending:0kB present:1021952kB managed:999036kB mlocked:1816kB kernel_stack:1792kB pagetables:8044kB bounce:0kB free_pcp:444kB local_pcp:444kB free_cma:0kB
[ 5248.659670] lowmem_reserve[]: 0 0
[ 5248.659679] Normal: 151*4kB (UMH) 43*8kB (UMEH) 885*16kB (UMH) 49*32kB (UEH) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16676kB
[ 5248.659724] 9036 total pagecache pages
[ 5248.659728] 1171 pages in swap cache
[ 5248.659732] Swap cache stats: add 34034, delete 32863, find 372967/377461
[ 5248.659734] Free swap = 0kB
[ 5248.659737] Total swap = 102396kB
[ 5248.659740] 255488 pages RAM
[ 5248.659743] 0 pages HighMem/MovableOnly
[ 5248.659745] 5729 pages reserved
[ 5248.659748] 2048 pages cma reserved
[ 5248.659751] Tasks state (memory values in pages):
[ 5248.659754] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[ 5248.659777] [ 136] 0 136 4868 927 108544 19 0 systemd-journal
[ 5248.659784] [ 152] 0 152 1139 44 12288 0 0 blkmapd
[ 5248.659791] [ 166] 0 166 4552 428 20480 44 -1000 systemd-udevd
[ 5248.659797] [ 172] 101 172 3608 158 20480 11 0 systemd-network
[ 5248.659804] [ 297] 102 297 4781 516 28672 34 0 systemd-resolve
[ 5248.659811] [ 299] 0 299 699 297 10240 1 0 rpc.idmapd
[ 5248.659818] [ 632] 0 632 3252 269 20480 31 0 systemd-logind
[ 5248.659824] [ 635] 0 635 923 254 14336 0 0 alsactl
[ 5248.659831] [ 647] 108 647 1474 351 16384 3 0 avahi-daemon
[ 5248.659838] [ 657] 104 657 1672 431 16384 2 -900 dbus-daemon
[ 5248.659844] [ 663] 108 663 1443 63 12288 1 0 avahi-daemon
[ 5248.659851] [ 666] 65534 666 1080 350 10240 0 0 thd
[ 5248.659857] [ 667] 0 667 6450 373 20480 39 0 rsyslogd
[ 5248.659864] [ 673] 0 673 2677 419 20480 0 0 wpa_supplicant
[ 5248.659870] [ 785] 0 785 708 335 12288 8 0 dhcpcd
[ 5248.659877] [ 788] 0 788 55082 448 75776 1010 0 php-fpm7.3
[ 5248.659884] [ 795] 0 795 8900 346 45056 2627 0 salt-minion
[ 5248.659890] [ 798] 65534 798 2105 166 14336 24 0 openvpn
[ 5248.659897] [ 804] 110 804 2901 351 10240 5 0 chronyd
[ 5248.659904] [ 805] 65534 805 5355 1108 28672 807 0 python3
[ 5248.659910] [ 807] 0 807 7799 541 40960 69 0 nmbd
[ 5248.659917] [ 810] 110 810 928 321 12288 1 0 chronyd
[ 5248.659923] [ 813] 0 813 17132 1886 43008 82 0 fail2ban-server
[ 5248.659930] [ 823] 0 823 2673 275 18432 97 -1000 sshd
[ 5248.659937] [ 830] 105 830 1644 415 16384 8 0 rpcbind
[ 5248.659943] [ 859] 0 859 7908 1506 43008 4976 0 rpc.mountd
[ 5248.659950] [ 864] 115 864 11730 415 32768 174 0 redis-server
[ 5248.659959] [ 905] 114 905 183212 45744 311296 3621 0 mysqld
[ 5248.659965] [ 912] 0 912 12548 33 32768 256 0 nginx
[ 5248.659972] [ 913] 33 913 12588 340 32768 225 0 nginx
[ 5248.659978] [ 914] 33 914 12614 334 32768 224 0 nginx
[ 5248.659985] [ 915] 33 915 12683 378 32768 219 0 nginx
[ 5248.659991] [ 916] 33 916 12588 276 32768 230 0 nginx
[ 5248.659998] [ 931] 0 931 11889 822 53248 15 0 smbd
[ 5248.660004] [ 937] 0 937 22573 950 65536 1053 0 omv-engined
[ 5248.660011] [ 942] 33 942 55075 259 67584 1031 0 php-fpm7.3
[ 5248.660018] [ 943] 33 943 55075 259 67584 1031 0 php-fpm7.3
[ 5248.660024] [ 973] 0 973 11160 686 51200 16 0 smbd-notifyd
[ 5248.660031] [ 974] 0 974 11158 680 49152 16 0 cleanupd
[ 5248.660038] [ 993] 0 993 535 32 8192 0 0 hciattach
[ 5248.660044] [ 998] 0 998 2384 87 16384 9 0 bluetoothd
[ 5248.660051] [ 1101] 0 1101 16414 6225 65536 954 0 salt-minion
[ 5248.660058] [ 1107] 0 1107 10005 263 24576 105 0 master
[ 5248.660065] [ 1108] 109 1108 10025 413 24576 0 0 pickup
[ 5248.660071] [ 1109] 109 1109 10066 350 24576 1 0 qmgr
[ 5248.660078] [ 1119] 109 1119 10092 336 24576 5 0 tlsmgr
[ 5248.660084] [ 1133] 0 1133 12158 1964 47104 1857 0 salt-minion
[ 5248.660092] [ 1154] 1000 1154 11074 407 32768 2 0 transmission-da
[ 5248.660098] [ 1171] 0 1171 1791 371 14336 9 0 cron
[ 5248.660104] [ 1190] 0 1190 6914 266 18432 1 0 rngd
[ 5248.660110] [ 1197] 0 1197 1077 257 12288 1 0 agetty
[ 5248.660117] [ 1212] 0 1212 3855 480 16384 2 0 monit
[ 5248.660123] [ 1239] 0 1239 55403 699 79872 887 0 apache2
[ 5248.660130] [ 1240] 33 1240 61520 2432 118784 775 0 apache2
[ 5248.660137] [ 1242] 33 1242 73947 15295 169984 777 0 apache2
[ 5248.660143] [ 1243] 33 1243 73453 14797 169984 802 0 apache2
[ 5248.660150] [ 1490] 0 1490 484 349 12288 0 -1000 watchdog
[ 5248.660156] [ 1549] 33 1549 77924 18617 178176 772 0 apache2
[ 5248.660163] [ 1710] 0 1710 3051 513 20480 35 0 sshd
[ 5248.660169] [ 1716] 1000 1716 3676 437 20480 28 0 systemd
[ 5248.660175] [ 1719] 1000 1719 8666 296 28672 413 0 (sd-pam)
[ 5248.660182] [ 1734] 1000 1734 3051 433 20480 22 0 sshd
[ 5248.660188] [ 1737] 1000 1737 1967 620 18432 0 0 bash
[ 5248.660195] [ 2143] 33 2143 73385 14674 165888 787 0 apache2
[ 5248.660202] [ 2511] 33 2511 61536 2540 118784 772 0 apache2
[ 5248.660208] [ 2512] 33 2512 62356 3600 122880 771 0 apache2
[ 5248.660215] [ 2636] 33 2636 61508 2766 118784 772 0 apache2
[ 5248.660221] [ 2781] 33 2781 60968 2493 116736 773 0 apache2
[ 5248.660228] [ 2902] 33 2902 57050 1990 98304 796 0 apache2
[ 5248.660237] [ 4189] 0 4189 68937 68370 286720 0 0 mount.exfat
[ 5248.660244] [ 4261] 0 4261 2291 416 18432 0 0 sudo
[ 5248.660250] [ 4266] 0 4266 13376 3071 63488 0 0 rsync
[ 5248.660256] [ 4267] 0 4267 2861 1659 20480 0 0 rsync
[ 5248.660263] [ 4268] 0 4268 13868 2792 63488 0 0 rsync
[ 5248.660270] [ 4948] 0 4948 1518 552 12288 0 0 smartd
[ 5248.660280] [ 5721] 999 5721 55163 1256 79872 733 0 php-fpm7.3
[ 5248.660287] [ 6293] 999 6293 55163 1279 79872 729 0 php-fpm7.3
[ 5248.660294] Out of memory: Kill process 4189 (mount.exfat) score 248 or sacrifice child
[ 5248.660309] Killed process 4189 (mount.exfat) total-vm:275748kB, anon-rss:273480kB, file-rss:0kB, shmem-rss:0kB
[ 5248.750080] oom_reaper: reaped process 4189 (mount.exfat), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
After logging in to a separate SSH session, and running htop, I do see the process /sbin/mount.exfat grow in memory usage from a few kb to almost 300MB and then being killed.
Searching about rsync and memory issues, I found somewhat conflicting and unreliable evidence towards using the option --no-inc-recursive as an option to improve memory performance. While it wasn't clear that that would help, it did help make it move past the previously repeatable crash and progress for several more hours. Alas, after 16h it crashed again, and I'm kind of at a loss.
There are recommendations around splitting the rsync in smaller folders, but other than workarounds I'd really try to look into what could help address the underlying issue rather than the symptoms - otherwise I fear that in a few months we'll be back at this.
Some more info:
OS version:
Linux 4.19.75-v7+ #1270 SMP Tue Sep 24 18:45:11 BST 2019 armv7l GNU/Linux
Original Rsync command:
Code: Select all
sudo rsync --delete --partial --ignore-errors --no-perms --no-owner --no-group -rvzhtiP --exclude nextcloud/ /srv/dev-disk-by-label-HDDCOPY/ /mnt/backup-exfat/Current Rsync command (added modify-window as well because FAT and daylight savings were marking everything as needed to be synced):
Code: Select all
sudo rsync --delete --partial --ignore-errors --no-perms --no-owner --no-group -rvzhtiP --exclude nextcloud/ --no-inc-recursive --modify-window=3601 /srv/dev-disk-by-label-HDDCOPY/ /mnt/backup-exfat/Mount runtime options (from /proc/mounts):
Code: Select all
/dev/sdc1 /mnt/backup-exfat fuseblk rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other,blksize=4096 0 0I read about a exfat-nofuse project that seems to potentially offer a non-FUSE driver for EXFAT, but at least until further evidence I'd rather stick to better supported alternatives.
I ran smartctl on the drive, and both smartctl tests and fsck suggest no errors (the drive is only a few months old, and not often used). The previous backup I had performed had been in a RPi 4 with 4GB of RAM, which might have made the problem not surface, but now I'm using an older RPi 3B+ to that effect.
I have yet again unmounted it, mounted it and started the sync process again - some more files seem to be transferred - but the mount.exfat process is taking up 294MB of memory right now, and the free stats are:
Code: Select all
total used free shared buff/cache available
Mem: 999036 812976 26588 48132 159472 72912
Swap: 102396 102396 0
so I expect that sooner or later it'll die on me again.
I'm also somewhat mindful of the possibility of these repeated crashes damaging or corrupting the actual drive.
Small note (and don't read this the wrong way, but just to try to focus the thread): there are reasons for the EXFAT one to be EXFAT and not another filesystem, mostly around sharing that drive with several other OSes at home, so unless it is deemed critical I'd rather stray from the common suggestion of "why not format it in EXTx".
Also, same goes for using the RPi 3B+ for this, but happy to discuss further. I suppose I could connect the drive on my laptop and run the rsync using the local network, but... that just feels... sub-optimal and inefficient, not to mention that it would take even longer as we're now capping the transfer speed to the Pi's Ethernet transfer rate.
Any pointers, requests for more detailed information or suggestions are more than welcome.
Thanks in advance for your time, and stay healthy and safe!
EDIT: added OS version.