btrfs with huge numbers of hardlinks is extremely slow

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* btrfs with huge numbers of hardlinks is extremely slow
@ 2021-11-25 21:56 Andrey Melnikov
  2021-11-26  5:15 ` Zygo Blaxell
  0 siblings, 1 reply; 3+ messages in thread
From: Andrey Melnikov @ 2021-11-25 21:56 UTC (permalink / raw)
  To: linux-btrfs, Andrey Melnikov

Every night a new backup is stored on this fs with 'rsync
--link-dest=$yestoday/ $today/' - 1402700 hardlinks and 23000
directories are created, 50-100 normal files transferred.
Now, FS contains 351 copies of backup data with 486086495 hardlinks
and ANY operations on this FS take significant time. For example -
simple count hardlinks with
"time find . -type f -links +1 | wc -l" take:
real    28567m33.611s
user    31m33.395s
sys     506m28.576s

19 days 20 hours 10 mins with constant reads from storage 2-4Mb/s.

- BTRFS not suitable for this workload?
- using reflinks helps speedup FS operations?
- readed metadata not cached at all? What BTRFS read 19 days from disks???

Hardware: dell r300 with 2 WD Purple 1Tb disk on LSISAS1068E RAID 1
(without cache).
Linux nlxc 5.14-51412-generic #0~lch11 SMP Wed Oct 13 15:57:07 UTC
2021 x86_64 GNU/Linux
btrfs-progs v5.14.1

# btrfs fi show
Label: none  uuid: a840a2ca-bf05-4074-8895-60d993cb5bdd
        Total devices 1 FS bytes used 474.26GiB
        devid    1 size 931.00GiB used 502.23GiB path /dev/sdb1

# btrfs fi df /srv
Data, single: total=367.19GiB, used=343.92GiB
System, single: total=32.00MiB, used=128.00KiB
Metadata, single: total=135.00GiB, used=130.34GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

# btrfs fi us /srv
Overall:
    Device size:                 931.00GiB
    Device allocated:            502.23GiB
    Device unallocated:          428.77GiB
    Device missing:                  0.00B
    Used:                        474.26GiB
    Free (estimated):            452.04GiB      (min: 452.04GiB)
    Free (statfs, df):           452.04GiB
    Data ratio:                       1.00
    Metadata ratio:                   1.00
    Global reserve:              512.00MiB      (used: 0.00B)
    Multiple profiles:                  no

Data,single: Size:367.19GiB, Used:343.92GiB (93.66%)
   /dev/sdb1     367.19GiB

Metadata,single: Size:135.00GiB, Used:130.34GiB (96.55%)
   /dev/sdb1     135.00GiB

System,single: Size:32.00MiB, Used:128.00KiB (0.39%)
   /dev/sdb1      32.00MiB

Unallocated:
   /dev/sdb1     428.77GiB

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: btrfs with huge numbers of hardlinks is extremely slow
  2021-11-25 21:56 btrfs with huge numbers of hardlinks is extremely slow Andrey Melnikov
@ 2021-11-26  5:15 ` Zygo Blaxell
  2021-11-26  8:23   ` Forza
  0 siblings, 1 reply; 3+ messages in thread
From: Zygo Blaxell @ 2021-11-26  5:15 UTC (permalink / raw)
  To: Andrey Melnikov; +Cc: linux-btrfs

On Fri, Nov 26, 2021 at 12:56:25AM +0300, Andrey Melnikov wrote:
> Every night a new backup is stored on this fs with 'rsync
> --link-dest=$yestoday/ $today/' - 1402700 hardlinks and 23000
> directories are created, 50-100 normal files transferred.
> Now, FS contains 351 copies of backup data with 486086495 hardlinks
> and ANY operations on this FS take significant time. For example -
> simple count hardlinks with
> "time find . -type f -links +1 | wc -l" take:
> real    28567m33.611s
> user    31m33.395s
> sys     506m28.576s
> 
> 19 days 20 hours 10 mins with constant reads from storage 2-4Mb/s.

That works out to reading the entire drive 4x in 20 days, or all of the
metadata 30x.  Certainly hardlinks will not result in optimal object
placement, and you probably don't have enough RAM to cache the entire
metadata tree, and you're using WD Purple drives in a fileserver for
some reason, so those numbers seem plausible.

> - BTRFS not suitable for this workload?

There are definitely better ways to do this on btrfs, e.g.

	btrfs sub snap $yesterday $today
	rsync ... (no link-dest) ... $today

This will avoid duplicating the entire file tree every time.  It will also
store historical file attributes correctly, which --link-dest sometimes
does not.

You might also consider doing it differently:

	rsync ... (no link-dest) ... working-dir/. &&
	btrfs sub snap -r working-dir $today

so that your $today directory doesn't exist until it is complete with
no rsync errors.  'working-dir' will have to be a subvol, but you only
have to create it once and you can keep reusing it afterwards.

> - using reflinks helps speedup FS operations?

Snapshots are lazy reflink copies, so they'll do a little better than
reflinks.  You'll only modify the metadata for the 50-100 files that you
transfer each day, instead of completely rewriting all of the metadata
in the filesystem every day with hardlinks.

Hardlinks put the inodes further and further away from their directory
nodes each day, and add some extra searching overhead within directories
as well.  You'll need more and more RAM to cache the same amount of
each filesystem tree, because they're all in the same metadata tree.
With snapshots they'll end up in separate metadata trees.

> - readed metadata not cached at all? 

If you have less than about 640GB of RAM (4x the size of your metadata)
then you're going to be rereading metadata pages at some point.  Because
you're using hardlinks, the metadata pages from different days are all
mashed together, and 'find' will flood the cache chasing references to
them.

Other recommendations:

- Use the right drive model for your workload.  WD Purple drives are for
continuous video streaming, they are not for seeky millions-of-tiny-files
rsync workloads.  Almost any other model will outperform them, and
better drives for this workload (e.g. CMR WD Red models) are cheaper.
Your WD Purple drives are getting 283 links/s.  Compare that with some
other drive models:

	1665 links/s:  WD Green (2x1TB + 1x2TB btrfs raid1)

	6850 links/s:  Sandisk Extreme MicroSD (1x256GB btrfs single/dup)

	12511 links/s:  WD Red (2x1TB btrfs raid1)

	13371 links/s:  WD Red SSD + Seagate Ironwolf SSD (6x1TB btrfs raid1)

	14872 links/s:  WD Black (1x1TB btrfs single/dup, 8 years old)

	25498 links/s:  WD Gold + Seagate Exos (3x16TB btrfs raid1)

	27341 links/s:  Toshiba NVME (1x2TB btrfs single/dup)

	311284 links/s:  Sabrent Rocket 4 NVME (2x1TB btrfs raid1)
	(1344748222 links, 111 snapshots)

Some of these numbers are lower than they should be, because I ran
'find' commands on some machines that were busy doing other work.
The point is that even if some of these numbers are too low, all of
these numbers are higher what we can expect from a WD Purple.

- Use btrfs raid1 instead of hardware RAID1, i.e. expose each disk
separately through the RAID interface to btrfs.  This will enable btrfs
to correct errors and isolate faults if one of your drives goes bad.
You can also use iostat to see if one of the drives is running much
slower than the other, which might be an early indication of failure
(and it might be the only indication of failure you get, if your drive's
firmware doesn't support SCTERC and hides failures).

> What BTRFS read 19 days from disks???
> 
> Hardware: dell r300 with 2 WD Purple 1Tb disk on LSISAS1068E RAID 1
> (without cache).
> Linux nlxc 5.14-51412-generic #0~lch11 SMP Wed Oct 13 15:57:07 UTC
> 2021 x86_64 GNU/Linux
> btrfs-progs v5.14.1
> 
> # btrfs fi show
> Label: none  uuid: a840a2ca-bf05-4074-8895-60d993cb5bdd
>         Total devices 1 FS bytes used 474.26GiB
>         devid    1 size 931.00GiB used 502.23GiB path /dev/sdb1
> 
> # btrfs fi df /srv
> Data, single: total=367.19GiB, used=343.92GiB
> System, single: total=32.00MiB, used=128.00KiB
> Metadata, single: total=135.00GiB, used=130.34GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> # btrfs fi us /srv
> Overall:
>     Device size:                 931.00GiB
>     Device allocated:            502.23GiB
>     Device unallocated:          428.77GiB
>     Device missing:                  0.00B
>     Used:                        474.26GiB
>     Free (estimated):            452.04GiB      (min: 452.04GiB)
>     Free (statfs, df):           452.04GiB
>     Data ratio:                       1.00
>     Metadata ratio:                   1.00
>     Global reserve:              512.00MiB      (used: 0.00B)
>     Multiple profiles:                  no
> 
> Data,single: Size:367.19GiB, Used:343.92GiB (93.66%)
>    /dev/sdb1     367.19GiB
> 
> Metadata,single: Size:135.00GiB, Used:130.34GiB (96.55%)
>    /dev/sdb1     135.00GiB
> 
> System,single: Size:32.00MiB, Used:128.00KiB (0.39%)
>    /dev/sdb1      32.00MiB
> 
> Unallocated:
>    /dev/sdb1     428.77GiB

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: btrfs with huge numbers of hardlinks is extremely slow
  2021-11-26  5:15 ` Zygo Blaxell
@ 2021-11-26  8:23   ` Forza
  0 siblings, 0 replies; 3+ messages in thread
From: Forza @ 2021-11-26  8:23 UTC (permalink / raw)
  To: Zygo Blaxell, Andrey Melnikov; +Cc: linux-btrfs



---- From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> -- Sent: 2021-11-26 - 06:15 ----

> On Fri, Nov 26, 2021 at 12:56:25AM +0300, Andrey Melnikov wrote:
>> Every night a new backup is stored on this fs with 'rsync
>> --link-dest=$yestoday/ $today/' - 1402700 hardlinks and 23000
>> directories are created, 50-100 normal files transferred.
>> Now, FS contains 351 copies of backup data with 486086495 hardlinks
>> and ANY operations on this FS take significant time. For example -
>> simple count hardlinks with
>> "time find . -type f -links +1 | wc -l" take:
>> real    28567m33.611s
>> user    31m33.395s
>> sys     506m28.576s
>> 
>> 19 days 20 hours 10 mins with constant reads from storage 2-4Mb/s.
> 
> That works out to reading the entire drive 4x in 20 days, or all of the
> metadata 30x.  Certainly hardlinks will not result in optimal object
> placement, and you probably don't have enough RAM to cache the entire
> metadata tree, and you're using WD Purple drives in a fileserver for
> some reason, so those numbers seem plausible.

Defragmenting the subvolume and extent tree could help reducing the amount and distance of seeks to metadata which should improve performance of find. 

# btrfs fi defrag /path/to/subvol 


If you cannot change drive model, breaking up your HW raid and create a btrfs raid1 should also improve performance, as well as fault tolerance. 

Apart from this, I second Zygo's suggestions below. 




> 
>> - BTRFS not suitable for this workload?
> 
> There are definitely better ways to do this on btrfs, e.g.
> 
> 	btrfs sub snap $yesterday $today
> 	rsync ... (no link-dest) ... $today
> 
> This will avoid duplicating the entire file tree every time.  It will also
> store historical file attributes correctly, which --link-dest sometimes
> does not.
> 
> You might also consider doing it differently:
> 
> 	rsync ... (no link-dest) ... working-dir/. &&
> 	btrfs sub snap -r working-dir $today
> 
> so that your $today directory doesn't exist until it is complete with
> no rsync errors.  'working-dir' will have to be a subvol, but you only
> have to create it once and you can keep reusing it afterwards.
> 
>> - using reflinks helps speedup FS operations?
> 
> Snapshots are lazy reflink copies, so they'll do a little better than
> reflinks.  You'll only modify the metadata for the 50-100 files that you
> transfer each day, instead of completely rewriting all of the metadata
> in the filesystem every day with hardlinks.
> 
> Hardlinks put the inodes further and further away from their directory
> nodes each day, and add some extra searching overhead within directories
> as well.  You'll need more and more RAM to cache the same amount of
> each filesystem tree, because they're all in the same metadata tree.
> With snapshots they'll end up in separate metadata trees.
> 
>> - readed metadata not cached at all? 
> 
> If you have less than about 640GB of RAM (4x the size of your metadata)
> then you're going to be rereading metadata pages at some point.  Because
> you're using hardlinks, the metadata pages from different days are all
> mashed together, and 'find' will flood the cache chasing references to
> them.
> 
> Other recommendations:
> 
> - Use the right drive model for your workload.  WD Purple drives are for
> continuous video streaming, they are not for seeky millions-of-tiny-files
> rsync workloads.  Almost any other model will outperform them, and
> better drives for this workload (e.g. CMR WD Red models) are cheaper.
> Your WD Purple drives are getting 283 links/s.  Compare that with some
> other drive models:
> 
> 	1665 links/s:  WD Green (2x1TB + 1x2TB btrfs raid1)
> 
> 	6850 links/s:  Sandisk Extreme MicroSD (1x256GB btrfs single/dup)
> 
> 	12511 links/s:  WD Red (2x1TB btrfs raid1)
> 
> 	13371 links/s:  WD Red SSD + Seagate Ironwolf SSD (6x1TB btrfs raid1)
> 
> 	14872 links/s:  WD Black (1x1TB btrfs single/dup, 8 years old)
> 
> 	25498 links/s:  WD Gold + Seagate Exos (3x16TB btrfs raid1)
> 
> 	27341 links/s:  Toshiba NVME (1x2TB btrfs single/dup)
> 
> 	311284 links/s:  Sabrent Rocket 4 NVME (2x1TB btrfs raid1)
> 	(1344748222 links, 111 snapshots)
> 
> Some of these numbers are lower than they should be, because I ran
> 'find' commands on some machines that were busy doing other work.
> The point is that even if some of these numbers are too low, all of
> these numbers are higher what we can expect from a WD Purple.
> 
> - Use btrfs raid1 instead of hardware RAID1, i.e. expose each disk
> separately through the RAID interface to btrfs.  This will enable btrfs
> to correct errors and isolate faults if one of your drives goes bad.
> You can also use iostat to see if one of the drives is running much
> slower than the other, which might be an early indication of failure
> (and it might be the only indication of failure you get, if your drive's
> firmware doesn't support SCTERC and hides failures).
> 
>> What BTRFS read 19 days from disks???
>> 
>> Hardware: dell r300 with 2 WD Purple 1Tb disk on LSISAS1068E RAID 1
>> (without cache).
>> Linux nlxc 5.14-51412-generic #0~lch11 SMP Wed Oct 13 15:57:07 UTC
>> 2021 x86_64 GNU/Linux
>> btrfs-progs v5.14.1
>> 
>> # btrfs fi show
>> Label: none  uuid: a840a2ca-bf05-4074-8895-60d993cb5bdd
>>         Total devices 1 FS bytes used 474.26GiB
>>         devid    1 size 931.00GiB used 502.23GiB path /dev/sdb1
>> 
>> # btrfs fi df /srv
>> Data, single: total=367.19GiB, used=343.92GiB
>> System, single: total=32.00MiB, used=128.00KiB
>> Metadata, single: total=135.00GiB, used=130.34GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>> 
>> # btrfs fi us /srv
>> Overall:
>>     Device size:                 931.00GiB
>>     Device allocated:            502.23GiB
>>     Device unallocated:          428.77GiB
>>     Device missing:                  0.00B
>>     Used:                        474.26GiB
>>     Free (estimated):            452.04GiB      (min: 452.04GiB)
>>     Free (statfs, df):           452.04GiB
>>     Data ratio:                       1.00
>>     Metadata ratio:                   1.00
>>     Global reserve:              512.00MiB      (used: 0.00B)
>>     Multiple profiles:                  no
>> 
>> Data,single: Size:367.19GiB, Used:343.92GiB (93.66%)
>>    /dev/sdb1     367.19GiB
>> 
>> Metadata,single: Size:135.00GiB, Used:130.34GiB (96.55%)
>>    /dev/sdb1     135.00GiB
>> 
>> System,single: Size:32.00MiB, Used:128.00KiB (0.39%)
>>    /dev/sdb1      32.00MiB
>> 
>> Unallocated:
>>    /dev/sdb1     428.77GiB



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-11-26  8:25 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-25 21:56 btrfs with huge numbers of hardlinks is extremely slow Andrey Melnikov
2021-11-26  5:15 ` Zygo Blaxell
2021-11-26  8:23   ` Forza

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).