* btrfs with huge numbers of hardlinks is extremely slow @ 2021-11-25 21:56 Andrey Melnikov 2021-11-26 5:15 ` Zygo Blaxell 0 siblings, 1 reply; 3+ messages in thread From: Andrey Melnikov @ 2021-11-25 21:56 UTC (permalink / raw) To: linux-btrfs, Andrey Melnikov Every night a new backup is stored on this fs with 'rsync --link-dest=$yestoday/ $today/' - 1402700 hardlinks and 23000 directories are created, 50-100 normal files transferred. Now, FS contains 351 copies of backup data with 486086495 hardlinks and ANY operations on this FS take significant time. For example - simple count hardlinks with "time find . -type f -links +1 | wc -l" take: real 28567m33.611s user 31m33.395s sys 506m28.576s 19 days 20 hours 10 mins with constant reads from storage 2-4Mb/s. - BTRFS not suitable for this workload? - using reflinks helps speedup FS operations? - readed metadata not cached at all? What BTRFS read 19 days from disks??? Hardware: dell r300 with 2 WD Purple 1Tb disk on LSISAS1068E RAID 1 (without cache). Linux nlxc 5.14-51412-generic #0~lch11 SMP Wed Oct 13 15:57:07 UTC 2021 x86_64 GNU/Linux btrfs-progs v5.14.1 # btrfs fi show Label: none uuid: a840a2ca-bf05-4074-8895-60d993cb5bdd Total devices 1 FS bytes used 474.26GiB devid 1 size 931.00GiB used 502.23GiB path /dev/sdb1 # btrfs fi df /srv Data, single: total=367.19GiB, used=343.92GiB System, single: total=32.00MiB, used=128.00KiB Metadata, single: total=135.00GiB, used=130.34GiB GlobalReserve, single: total=512.00MiB, used=0.00B # btrfs fi us /srv Overall: Device size: 931.00GiB Device allocated: 502.23GiB Device unallocated: 428.77GiB Device missing: 0.00B Used: 474.26GiB Free (estimated): 452.04GiB (min: 452.04GiB) Free (statfs, df): 452.04GiB Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 512.00MiB (used: 0.00B) Multiple profiles: no Data,single: Size:367.19GiB, Used:343.92GiB (93.66%) /dev/sdb1 367.19GiB Metadata,single: Size:135.00GiB, Used:130.34GiB (96.55%) /dev/sdb1 135.00GiB System,single: Size:32.00MiB, Used:128.00KiB (0.39%) /dev/sdb1 32.00MiB Unallocated: /dev/sdb1 428.77GiB ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: btrfs with huge numbers of hardlinks is extremely slow 2021-11-25 21:56 btrfs with huge numbers of hardlinks is extremely slow Andrey Melnikov @ 2021-11-26 5:15 ` Zygo Blaxell 2021-11-26 8:23 ` Forza 0 siblings, 1 reply; 3+ messages in thread From: Zygo Blaxell @ 2021-11-26 5:15 UTC (permalink / raw) To: Andrey Melnikov; +Cc: linux-btrfs On Fri, Nov 26, 2021 at 12:56:25AM +0300, Andrey Melnikov wrote: > Every night a new backup is stored on this fs with 'rsync > --link-dest=$yestoday/ $today/' - 1402700 hardlinks and 23000 > directories are created, 50-100 normal files transferred. > Now, FS contains 351 copies of backup data with 486086495 hardlinks > and ANY operations on this FS take significant time. For example - > simple count hardlinks with > "time find . -type f -links +1 | wc -l" take: > real 28567m33.611s > user 31m33.395s > sys 506m28.576s > > 19 days 20 hours 10 mins with constant reads from storage 2-4Mb/s. That works out to reading the entire drive 4x in 20 days, or all of the metadata 30x. Certainly hardlinks will not result in optimal object placement, and you probably don't have enough RAM to cache the entire metadata tree, and you're using WD Purple drives in a fileserver for some reason, so those numbers seem plausible. > - BTRFS not suitable for this workload? There are definitely better ways to do this on btrfs, e.g. btrfs sub snap $yesterday $today rsync ... (no link-dest) ... $today This will avoid duplicating the entire file tree every time. It will also store historical file attributes correctly, which --link-dest sometimes does not. You might also consider doing it differently: rsync ... (no link-dest) ... working-dir/. && btrfs sub snap -r working-dir $today so that your $today directory doesn't exist until it is complete with no rsync errors. 'working-dir' will have to be a subvol, but you only have to create it once and you can keep reusing it afterwards. > - using reflinks helps speedup FS operations? Snapshots are lazy reflink copies, so they'll do a little better than reflinks. You'll only modify the metadata for the 50-100 files that you transfer each day, instead of completely rewriting all of the metadata in the filesystem every day with hardlinks. Hardlinks put the inodes further and further away from their directory nodes each day, and add some extra searching overhead within directories as well. You'll need more and more RAM to cache the same amount of each filesystem tree, because they're all in the same metadata tree. With snapshots they'll end up in separate metadata trees. > - readed metadata not cached at all? If you have less than about 640GB of RAM (4x the size of your metadata) then you're going to be rereading metadata pages at some point. Because you're using hardlinks, the metadata pages from different days are all mashed together, and 'find' will flood the cache chasing references to them. Other recommendations: - Use the right drive model for your workload. WD Purple drives are for continuous video streaming, they are not for seeky millions-of-tiny-files rsync workloads. Almost any other model will outperform them, and better drives for this workload (e.g. CMR WD Red models) are cheaper. Your WD Purple drives are getting 283 links/s. Compare that with some other drive models: 1665 links/s: WD Green (2x1TB + 1x2TB btrfs raid1) 6850 links/s: Sandisk Extreme MicroSD (1x256GB btrfs single/dup) 12511 links/s: WD Red (2x1TB btrfs raid1) 13371 links/s: WD Red SSD + Seagate Ironwolf SSD (6x1TB btrfs raid1) 14872 links/s: WD Black (1x1TB btrfs single/dup, 8 years old) 25498 links/s: WD Gold + Seagate Exos (3x16TB btrfs raid1) 27341 links/s: Toshiba NVME (1x2TB btrfs single/dup) 311284 links/s: Sabrent Rocket 4 NVME (2x1TB btrfs raid1) (1344748222 links, 111 snapshots) Some of these numbers are lower than they should be, because I ran 'find' commands on some machines that were busy doing other work. The point is that even if some of these numbers are too low, all of these numbers are higher what we can expect from a WD Purple. - Use btrfs raid1 instead of hardware RAID1, i.e. expose each disk separately through the RAID interface to btrfs. This will enable btrfs to correct errors and isolate faults if one of your drives goes bad. You can also use iostat to see if one of the drives is running much slower than the other, which might be an early indication of failure (and it might be the only indication of failure you get, if your drive's firmware doesn't support SCTERC and hides failures). > What BTRFS read 19 days from disks??? > > Hardware: dell r300 with 2 WD Purple 1Tb disk on LSISAS1068E RAID 1 > (without cache). > Linux nlxc 5.14-51412-generic #0~lch11 SMP Wed Oct 13 15:57:07 UTC > 2021 x86_64 GNU/Linux > btrfs-progs v5.14.1 > > # btrfs fi show > Label: none uuid: a840a2ca-bf05-4074-8895-60d993cb5bdd > Total devices 1 FS bytes used 474.26GiB > devid 1 size 931.00GiB used 502.23GiB path /dev/sdb1 > > # btrfs fi df /srv > Data, single: total=367.19GiB, used=343.92GiB > System, single: total=32.00MiB, used=128.00KiB > Metadata, single: total=135.00GiB, used=130.34GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > > # btrfs fi us /srv > Overall: > Device size: 931.00GiB > Device allocated: 502.23GiB > Device unallocated: 428.77GiB > Device missing: 0.00B > Used: 474.26GiB > Free (estimated): 452.04GiB (min: 452.04GiB) > Free (statfs, df): 452.04GiB > Data ratio: 1.00 > Metadata ratio: 1.00 > Global reserve: 512.00MiB (used: 0.00B) > Multiple profiles: no > > Data,single: Size:367.19GiB, Used:343.92GiB (93.66%) > /dev/sdb1 367.19GiB > > Metadata,single: Size:135.00GiB, Used:130.34GiB (96.55%) > /dev/sdb1 135.00GiB > > System,single: Size:32.00MiB, Used:128.00KiB (0.39%) > /dev/sdb1 32.00MiB > > Unallocated: > /dev/sdb1 428.77GiB ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: btrfs with huge numbers of hardlinks is extremely slow 2021-11-26 5:15 ` Zygo Blaxell @ 2021-11-26 8:23 ` Forza 0 siblings, 0 replies; 3+ messages in thread From: Forza @ 2021-11-26 8:23 UTC (permalink / raw) To: Zygo Blaxell, Andrey Melnikov; +Cc: linux-btrfs ---- From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> -- Sent: 2021-11-26 - 06:15 ---- > On Fri, Nov 26, 2021 at 12:56:25AM +0300, Andrey Melnikov wrote: >> Every night a new backup is stored on this fs with 'rsync >> --link-dest=$yestoday/ $today/' - 1402700 hardlinks and 23000 >> directories are created, 50-100 normal files transferred. >> Now, FS contains 351 copies of backup data with 486086495 hardlinks >> and ANY operations on this FS take significant time. For example - >> simple count hardlinks with >> "time find . -type f -links +1 | wc -l" take: >> real 28567m33.611s >> user 31m33.395s >> sys 506m28.576s >> >> 19 days 20 hours 10 mins with constant reads from storage 2-4Mb/s. > > That works out to reading the entire drive 4x in 20 days, or all of the > metadata 30x. Certainly hardlinks will not result in optimal object > placement, and you probably don't have enough RAM to cache the entire > metadata tree, and you're using WD Purple drives in a fileserver for > some reason, so those numbers seem plausible. Defragmenting the subvolume and extent tree could help reducing the amount and distance of seeks to metadata which should improve performance of find. # btrfs fi defrag /path/to/subvol If you cannot change drive model, breaking up your HW raid and create a btrfs raid1 should also improve performance, as well as fault tolerance. Apart from this, I second Zygo's suggestions below. > >> - BTRFS not suitable for this workload? > > There are definitely better ways to do this on btrfs, e.g. > > btrfs sub snap $yesterday $today > rsync ... (no link-dest) ... $today > > This will avoid duplicating the entire file tree every time. It will also > store historical file attributes correctly, which --link-dest sometimes > does not. > > You might also consider doing it differently: > > rsync ... (no link-dest) ... working-dir/. && > btrfs sub snap -r working-dir $today > > so that your $today directory doesn't exist until it is complete with > no rsync errors. 'working-dir' will have to be a subvol, but you only > have to create it once and you can keep reusing it afterwards. > >> - using reflinks helps speedup FS operations? > > Snapshots are lazy reflink copies, so they'll do a little better than > reflinks. You'll only modify the metadata for the 50-100 files that you > transfer each day, instead of completely rewriting all of the metadata > in the filesystem every day with hardlinks. > > Hardlinks put the inodes further and further away from their directory > nodes each day, and add some extra searching overhead within directories > as well. You'll need more and more RAM to cache the same amount of > each filesystem tree, because they're all in the same metadata tree. > With snapshots they'll end up in separate metadata trees. > >> - readed metadata not cached at all? > > If you have less than about 640GB of RAM (4x the size of your metadata) > then you're going to be rereading metadata pages at some point. Because > you're using hardlinks, the metadata pages from different days are all > mashed together, and 'find' will flood the cache chasing references to > them. > > Other recommendations: > > - Use the right drive model for your workload. WD Purple drives are for > continuous video streaming, they are not for seeky millions-of-tiny-files > rsync workloads. Almost any other model will outperform them, and > better drives for this workload (e.g. CMR WD Red models) are cheaper. > Your WD Purple drives are getting 283 links/s. Compare that with some > other drive models: > > 1665 links/s: WD Green (2x1TB + 1x2TB btrfs raid1) > > 6850 links/s: Sandisk Extreme MicroSD (1x256GB btrfs single/dup) > > 12511 links/s: WD Red (2x1TB btrfs raid1) > > 13371 links/s: WD Red SSD + Seagate Ironwolf SSD (6x1TB btrfs raid1) > > 14872 links/s: WD Black (1x1TB btrfs single/dup, 8 years old) > > 25498 links/s: WD Gold + Seagate Exos (3x16TB btrfs raid1) > > 27341 links/s: Toshiba NVME (1x2TB btrfs single/dup) > > 311284 links/s: Sabrent Rocket 4 NVME (2x1TB btrfs raid1) > (1344748222 links, 111 snapshots) > > Some of these numbers are lower than they should be, because I ran > 'find' commands on some machines that were busy doing other work. > The point is that even if some of these numbers are too low, all of > these numbers are higher what we can expect from a WD Purple. > > - Use btrfs raid1 instead of hardware RAID1, i.e. expose each disk > separately through the RAID interface to btrfs. This will enable btrfs > to correct errors and isolate faults if one of your drives goes bad. > You can also use iostat to see if one of the drives is running much > slower than the other, which might be an early indication of failure > (and it might be the only indication of failure you get, if your drive's > firmware doesn't support SCTERC and hides failures). > >> What BTRFS read 19 days from disks??? >> >> Hardware: dell r300 with 2 WD Purple 1Tb disk on LSISAS1068E RAID 1 >> (without cache). >> Linux nlxc 5.14-51412-generic #0~lch11 SMP Wed Oct 13 15:57:07 UTC >> 2021 x86_64 GNU/Linux >> btrfs-progs v5.14.1 >> >> # btrfs fi show >> Label: none uuid: a840a2ca-bf05-4074-8895-60d993cb5bdd >> Total devices 1 FS bytes used 474.26GiB >> devid 1 size 931.00GiB used 502.23GiB path /dev/sdb1 >> >> # btrfs fi df /srv >> Data, single: total=367.19GiB, used=343.92GiB >> System, single: total=32.00MiB, used=128.00KiB >> Metadata, single: total=135.00GiB, used=130.34GiB >> GlobalReserve, single: total=512.00MiB, used=0.00B >> >> # btrfs fi us /srv >> Overall: >> Device size: 931.00GiB >> Device allocated: 502.23GiB >> Device unallocated: 428.77GiB >> Device missing: 0.00B >> Used: 474.26GiB >> Free (estimated): 452.04GiB (min: 452.04GiB) >> Free (statfs, df): 452.04GiB >> Data ratio: 1.00 >> Metadata ratio: 1.00 >> Global reserve: 512.00MiB (used: 0.00B) >> Multiple profiles: no >> >> Data,single: Size:367.19GiB, Used:343.92GiB (93.66%) >> /dev/sdb1 367.19GiB >> >> Metadata,single: Size:135.00GiB, Used:130.34GiB (96.55%) >> /dev/sdb1 135.00GiB >> >> System,single: Size:32.00MiB, Used:128.00KiB (0.39%) >> /dev/sdb1 32.00MiB >> >> Unallocated: >> /dev/sdb1 428.77GiB ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2021-11-26 8:25 UTC | newest] Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-11-25 21:56 btrfs with huge numbers of hardlinks is extremely slow Andrey Melnikov 2021-11-26 5:15 ` Zygo Blaxell 2021-11-26 8:23 ` Forza
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).