All of lore.kernel.org
 help / color / mirror / Atom feed
* BTRFS File Delete Speed Scales With File Size?
@ 2020-06-09 15:31 Ellis H. Wilson III
  2020-06-09 17:53 ` Adam Borowski
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Ellis H. Wilson III @ 2020-06-09 15:31 UTC (permalink / raw)
  To: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 1878 bytes --]

Hi folks,

We have a few engineers looking through BTRFS code presently for answers 
to this, but I was interested to get input from the experts in parallel 
to hopefully understand this issue quickly.

We find that removes of large amounts of data can take a significant 
amount of time in BTRFS on HDDs -- in fact it appears to scale linearly 
with the size of the file.  I'd like to better understand the mechanics 
underpinning that behavior.

See the attached graph for a quick experiment that demonstrates this 
behavior.  In this experiment I use 40 threads to perform deletions of 
previous written data in parallel.  10,000 files in every case and I 
scale files by powers of two from 16MB to 16GB.  Thus, the raw amount of 
data deleted also expands by 2x every step.  Frankly I expected deletion 
of a file to be predominantly a metadata operation and not scale with 
the size of the file, but perhaps I'm misunderstanding that.  While the 
overall speed of deletion is relatively fast (hovering between 30GB/s 
and 50GB/s) compared with raw ingest of data to the disk array we're 
using (in our case ~1.5GB/s) it can still take a very long time to 
delete data from the drives and removes hang completely until that data 
is deleted, unlike in some other filesystems.  They also compete 
aggressively with foreground I/O due to the intense seeking on the HDDs.

This is with an older version of BTRFS (4.12.14-lp150.12.73-default) so 
if algorithms have changed substantially such that deletion rate is no 
longer tied to file size in newer versions please just say so and I'll 
be glad to take a look at that change and try with a newer version.

FWIW, we are using the v2 free space cache.  If any other information is 
relevant to this just let me know and I'll be glad to share.

Thank you for any time people can spare to help us understand this better!

ellis

[-- Attachment #2: parallel_delete_speed.png --]
[-- Type: image/png, Size: 25660 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BTRFS File Delete Speed Scales With File Size?
  2020-06-09 15:31 BTRFS File Delete Speed Scales With File Size? Ellis H. Wilson III
@ 2020-06-09 17:53 ` Adam Borowski
  2020-06-09 19:23   ` Martin Raiber
  2020-06-09 23:09 ` Hans van Kranenburg
  2020-06-10  2:45 ` Zygo Blaxell
  2 siblings, 1 reply; 13+ messages in thread
From: Adam Borowski @ 2020-06-09 17:53 UTC (permalink / raw)
  To: Ellis H. Wilson III; +Cc: Btrfs BTRFS

On Tue, Jun 09, 2020 at 11:31:41AM -0400, Ellis H. Wilson III wrote:
> We have a few engineers looking through BTRFS code presently for answers to
> this, but I was interested to get input from the experts in parallel to
> hopefully understand this issue quickly.
> 
> We find that removes of large amounts of data can take a significant amount
> of time in BTRFS on HDDs -- in fact it appears to scale linearly with the
> size of the file.  I'd like to better understand the mechanics underpinning
> that behavior.
> 
> See the attached graph for a quick experiment that demonstrates this
> behavior.  In this experiment I use 40 threads to perform deletions of
> previous written data in parallel.  10,000 files in every case and I scale
> files by powers of two from 16MB to 16GB.  Thus, the raw amount of data
> deleted also expands by 2x every step.  Frankly I expected deletion of a
> file to be predominantly a metadata operation and not scale with the size of
> the file, but perhaps I'm misunderstanding that.

The size of metadata is, after a small constant bit, proportional to the
number of extents.  Which in turn depends on file size.  With compression
off, extents may be as big as 1GB (which would make their number
negligible), but that's clearly not happening in your case.

There are tools which can show extent layout. I'd recommend python3-btrfs,
which includes /usr/share/doc/python3-btrfs/examples/show_file.py that
prints everything available about the list of extents.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ in the beginning was the boot and root floppies and they were good.
⢿⡄⠘⠷⠚⠋⠀                                       -- <willmore> on #linux-sunxi
⠈⠳⣄⠀⠀⠀⠀

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BTRFS File Delete Speed Scales With File Size?
  2020-06-09 17:53 ` Adam Borowski
@ 2020-06-09 19:23   ` Martin Raiber
  0 siblings, 0 replies; 13+ messages in thread
From: Martin Raiber @ 2020-06-09 19:23 UTC (permalink / raw)
  To: Adam Borowski, Ellis H. Wilson III; +Cc: Btrfs BTRFS

On 09.06.2020 19:53 Adam Borowski wrote:
> On Tue, Jun 09, 2020 at 11:31:41AM -0400, Ellis H. Wilson III wrote:
>> We have a few engineers looking through BTRFS code presently for answers to
>> this, but I was interested to get input from the experts in parallel to
>> hopefully understand this issue quickly.
>>
>> We find that removes of large amounts of data can take a significant amount
>> of time in BTRFS on HDDs -- in fact it appears to scale linearly with the
>> size of the file.  I'd like to better understand the mechanics underpinning
>> that behavior.
>>
>> See the attached graph for a quick experiment that demonstrates this
>> behavior.  In this experiment I use 40 threads to perform deletions of
>> previous written data in parallel.  10,000 files in every case and I scale
>> files by powers of two from 16MB to 16GB.  Thus, the raw amount of data
>> deleted also expands by 2x every step.  Frankly I expected deletion of a
>> file to be predominantly a metadata operation and not scale with the size of
>> the file, but perhaps I'm misunderstanding that.
> The size of metadata is, after a small constant bit, proportional to the
> number of extents.  Which in turn depends on file size.  With compression
> off, extents may be as big as 1GB (which would make their number
> negligible), but that's clearly not happening in your case.
>
> There are tools which can show extent layout. I'd recommend python3-btrfs,
> which includes /usr/share/doc/python3-btrfs/examples/show_file.py that
> prints everything available about the list of extents.

Also there is a 4 byte CRC checksum per 4K block that needs to be 
removed. Mount with "nodatasum" and run your tests to confirm this is 
the cause.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BTRFS File Delete Speed Scales With File Size?
  2020-06-09 15:31 BTRFS File Delete Speed Scales With File Size? Ellis H. Wilson III
  2020-06-09 17:53 ` Adam Borowski
@ 2020-06-09 23:09 ` Hans van Kranenburg
  2020-06-11  1:56   ` Ellis H. Wilson III
       [not found]   ` <d5379505-7dd1-d5bc-59e7-207aaa82acf6@panasas.com>
  2020-06-10  2:45 ` Zygo Blaxell
  2 siblings, 2 replies; 13+ messages in thread
From: Hans van Kranenburg @ 2020-06-09 23:09 UTC (permalink / raw)
  To: Ellis H. Wilson III, Btrfs BTRFS

Hi!

On 6/9/20 5:31 PM, Ellis H. Wilson III wrote:
> Hi folks,
> 
> We have a few engineers looking through BTRFS code presently for answers 
> to this, but I was interested to get input from the experts in parallel 
> to hopefully understand this issue quickly.
> 
> We find that removes of large amounts of data can take a significant 
> amount of time in BTRFS on HDDs -- in fact it appears to scale linearly 
> with the size of the file.  I'd like to better understand the mechanics 
> underpinning that behavior.

Like Adam already mentioned, the amount and size of the individual
extent metadata items that need to be removed is one variable in the big
equation.

The code in show_file.py example in python-btrfs is a bit outdated, and
it prints info about all extents that are referenced and a bit more.

Alternatively, we can only look at the file extent items and calculate
the amount and average and total size (on disk):

---- >8 ----
#!/usr/bin/python3

import btrfs
import os
import sys

with btrfs.FileSystem(sys.argv[1]) as fs:
    inum = os.fstat(fs.fd).st_ino
    min_key = btrfs.ctree.Key(inum, btrfs.ctree.EXTENT_DATA_KEY, 0)
    max_key = btrfs.ctree.Key(inum, btrfs.ctree.EXTENT_DATA_KEY, -1)
    amount = 0
    total_size = 0
    for header, data in btrfs.ioctl.search_v2(fs.fd, 0, min_key, max_key):
        item = btrfs.ctree.FileExtentItem(header, data)
        if item.type != btrfs.ctree.FILE_EXTENT_REG:
            continue
        amount += 1
        total_size += item.disk_num_bytes
    print("Referencing data from {} regular extents with total size {} "
          "average size {}".format(
              amount, btrfs.utils.pretty_size(total_size),
              btrfs.utils.pretty_size(int(total_size/amount))))
---- >8 ----

If I e.g. point this at /bin/bash (compressed), I get:

Referencing data from 9 regular extents with total size 600.00KiB
average size 66.67KiB

The above code assumes that the real data extents are only referenced
once (no dedupe within the same file etc), otherwise you'll have to
filter them (and keep more stuff in memory). And, it ignores inlined
small extents for simplicity. Anyway, you can hack away on it to e.g.
make it look up more interesting things.

https://python-btrfs.readthedocs.io/en/latest/btrfs.html#btrfs.ctree.FileExtentItem

> See the attached graph for a quick experiment that demonstrates this 
> behavior.  In this experiment I use 40 threads to perform deletions of 
> previous written data in parallel.  10,000 files in every case and I 
> scale files by powers of two from 16MB to 16GB.  Thus, the raw amount of 
> data deleted also expands by 2x every step.  Frankly I expected deletion 
> of a file to be predominantly a metadata operation and not scale with 
> the size of the file, but perhaps I'm misunderstanding that.  While the 
> overall speed of deletion is relatively fast (hovering between 30GB/s 
> and 50GB/s) compared with raw ingest of data to the disk array we're 
> using (in our case ~1.5GB/s) it can still take a very long time to 
> delete data from the drives and removes hang completely until that data 
> is deleted, unlike in some other filesystems.  They also compete 
> aggressively with foreground I/O due to the intense seeking on the HDDs.

What is interesting in this case is to see what kind of IO pattern the
deletes are causing (so, obviously, test without adding data). Like, how
much throughput for read and write, and how many iops read and write per
second (so that we can easily calculate average iop size).

If most time is spent doing random read IO, then hope for a bright
future in which you can store btrfs metadata on solid state and the data
itself on cheaper spinning rust.

If most time is spent doing writes, then what you're seeing is probably
what I call rumination, which is caused by making changes in metadata,
which leads to writing modified parts of metadata in free space again.

The extent tree (which has info about the actual data extents on disk
that the file_extent_item ones seen above point to) is once of those,
and there's only one of that kind, which is even tracking its own space
allocation, which can lead to the effect of a cat or dog chasing it's
own tail. There have definitely been performance improvements in that
area, I don't know exactly where, but when I moved from 4.9 to 4.19
kernel, things improved a bit.

There is a very long story at
https://www.spinics.net/lists/linux-btrfs/msg70624.html about these
kinds of things.

There are unfortunately no easy accessible tools present yet that can
give live insight in what exactly is happening when it's writing
metadata like crazy.

> This is with an older version of BTRFS (4.12.14-lp150.12.73-default) so 
> if algorithms have changed substantially such that deletion rate is no 
> longer tied to file size in newer versions please just say so and I'll 
> be glad to take a look at that change and try with a newer version.

That seems to be a suse kernel. I have no idea what kind of btrfs is in
there.

> FWIW, we are using the v2 free space cache.

Just to be sure, there are edge cases in which the filesystem says it's
using space cache v2, but actually isn't.

Does /sys/fs/btrfs/<UUID>/features/free_space_tree exist?

Or, of course a fun little program to just do a bogus search in it,
which explodes if it's not there:

---- >8 ----
#!/usr/bin/python3

import btrfs

with btrfs.FileSystem('/') as fs:
    try:
        list(fs.free_space_extents(0, 0))
        print("Yay!")
    except FileNotFoundError:
        print("No FST :(")
---- >8 ----

Space cache v1 will cause filesystem stalls combined with a burst of
writes on every 'transaction commit', and space cache v2 will cause more
random reads and writes, but they don't block the whole thing.

> If any other information is 
> relevant to this just let me know and I'll be glad to share.

Suggestions for things to try, and see what difference it makes:

* Organize incoming data in btrfs subvolumes in a way that enables you
to remove the subvol instead of rm'ing the files. If there are a lot of
files and stuff, this can be more efficient, since btrfs knows about
what parts to 'cut away' with a chainsaw when removing, instead of
telling it to do everything file by file in small steps. It also keeps
the size of the individual subvolume metadata trees down, reducing
rumination done by the cow. If you don't have them, your default fs tree
with all file info is relatively seen as large as the extent tree.

* Remove stuff in the same order as it got added. Remember, rm * removes
files in alphabetical order, not in the order in which metadata was
added to disk (the inode number sequence). It might cause less jumping
around in metadata. Using subvolumes instead is still better, because in
that case the whole issue does not exist.

* (not recommended) If your mount options do not show 'ssd' in them and
your kernel does not have this patch:
https://www.spinics.net/lists/linux-btrfs/msg64203.html then you can try
the mount -o ssd_spread,nossd (read the story in the link above).
Anyway, you're better off moving to something recent instead of trying
all these obsolete workarounds...

If you have a kernel >= 4.14 then you're better off mounting with -o ssd
for rotational disks because the metadata writes are put more together
in bigger parts of free space, leading to less rumination. There is
still a lot of research and improvements to be done in this area (again,
see long post referenced above).

If you have kernel < 4.14, then definitely do not mount with -o ssd, but
even mount ssds with -o nossd explicitly. (yes, that's also a long story)...

Fun! :) /o\

> Thank you for any time people can spare to help us understand this better!

Don't hesitate to ask more questions,
Have fun,
Hans

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BTRFS File Delete Speed Scales With File Size?
  2020-06-09 15:31 BTRFS File Delete Speed Scales With File Size? Ellis H. Wilson III
  2020-06-09 17:53 ` Adam Borowski
  2020-06-09 23:09 ` Hans van Kranenburg
@ 2020-06-10  2:45 ` Zygo Blaxell
  2 siblings, 0 replies; 13+ messages in thread
From: Zygo Blaxell @ 2020-06-10  2:45 UTC (permalink / raw)
  To: Ellis H. Wilson III; +Cc: Btrfs BTRFS

On Tue, Jun 09, 2020 at 11:31:41AM -0400, Ellis H. Wilson III wrote:
> Hi folks,
> 
> We have a few engineers looking through BTRFS code presently for answers to
> this, but I was interested to get input from the experts in parallel to
> hopefully understand this issue quickly.
> 
> We find that removes of large amounts of data can take a significant amount
> of time in BTRFS on HDDs -- in fact it appears to scale linearly with the
> size of the file.  I'd like to better understand the mechanics underpinning
> that behavior.
> 
> See the attached graph for a quick experiment that demonstrates this
> behavior.  In this experiment I use 40 threads to perform deletions of
> previous written data in parallel.  

40 threads doing delete is probably overkill--at the end of the day it all
goes through the same disk(s), and only one or two deleting threads per
disk will be enough to saturate the IO channel (or at least hit btrfs's
current concurrency limits).

> 10,000 files in every case and I scale
> files by powers of two from 16MB to 16GB.  Thus, the raw amount of data
> deleted also expands by 2x every step.  Frankly I expected deletion of a
> file to be predominantly a metadata operation and not scale with the size of
> the file, but perhaps I'm misunderstanding that.  

Metadata size is directly proportional to file size.  An extent takes
up about 100 bytes of metadata and can represent 4K to 128MB of data
(4K to 128KB for compressed data).  That's comparable to most modern
filesystems.

crc32c csums add 0.1% of the file data size in metadata (4 byte csum for
each 4K data block).  All the other csum algos are larger than crc32c and
use proportionally more metadata for each data block.  That's probably
where the majority of the linear scaling effect comes from--the rest is
buried in statistical noise.

You can remove the csum overhead by disabling datasums; however, you lose
self-healing and corruption detection if you do that.

There are also overheads related to sharing of extents.  Whenever a
shared extent is modified (e.g. a file is deleted while a snapshot
containing the file exists) an update to the backrefs is required.
Other filesystems don't have backrefs at all (so they don't support
online shrinking resize), or they only use backrefs when extents are
shared (e.g. XFS rmaps).

> While the overall speed of
> deletion is relatively fast (hovering between 30GB/s and 50GB/s) compared
> with raw ingest of data to the disk array we're using (in our case ~1.5GB/s)
> it can still take a very long time to delete data from the drives and
> removes hang completely until that data is deleted, unlike in some other
> filesystems.  They also compete aggressively with foreground I/O due to the
> intense seeking on the HDDs.

You may also be hitting delayed extent ref throttling.  btrfs doesn't
add or remove references to extents right away--it queues them up in
memory so that it doesn't touch the same metadata page multiple times
in a transaction.  When the batch reaches a certain size (100 to 10000
extents, depending on disk performance), writes are blocked to allow for
the disks to catch up.

Without this throttling, transactions can keep growing indefinitely.  In
the worst case, a crash can roll back the filesystem to a much earlier
state--hours or days earlier, if you have sufficient disk space to keep
appending to the ongoing transaction, and the right mix of aggressive
writer processes to keep the IO channels saturated.  This regressed in
5.0 and hasn't really been fixed properly yet.

> This is with an older version of BTRFS (4.12.14-lp150.12.73-default) so if
> algorithms have changed substantially such that deletion rate is no longer
> tied to file size in newer versions please just say so and I'll be glad to
> take a look at that change and try with a newer version.

That would require significant on-disk format changes.

It is not possible to quickly (i.e. in O(1) or even O(log(n)) time) drop
all the csums or extents that belong to a file in btrfs.  Extents and
csums are owned by the extent tree, files just contain references to them.
When a file is deleted, each individual metadata item contained in the
file has to be removed from its page, which means reading and possibly
also writing the entire page at some point during the transaction.

This isn't different from other filesystems: most modern filesystems,
including ext4, xfs, and btrfs, are O(n*log(n)) in data size for deletes.
The constant term is much larger in btrfs due to the csums.

Ideally, deletes should proceed in the same order the data was written.
This reads metadata pages sequentially from disk and provides the best
benefits of RAM caching.  This is another reason to limit the number of
deleting threads.

> FWIW, we are using the v2 free space cache.  If any other information is
> relevant to this just let me know and I'll be glad to share.
> 
> Thank you for any time people can spare to help us understand this better!
> 
> ellis



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BTRFS File Delete Speed Scales With File Size?
  2020-06-09 23:09 ` Hans van Kranenburg
@ 2020-06-11  1:56   ` Ellis H. Wilson III
  2020-06-11  3:14     ` Zygo Blaxell
       [not found]   ` <d5379505-7dd1-d5bc-59e7-207aaa82acf6@panasas.com>
  1 sibling, 1 reply; 13+ messages in thread
From: Ellis H. Wilson III @ 2020-06-11  1:56 UTC (permalink / raw)
  To: Hans van Kranenburg, Btrfs BTRFS

On 6/9/20 7:09 PM, Hans van Kranenburg wrote:
> * (not recommended) If your mount options do not show 'ssd' in them and
> your kernel does not have this patch:
> https://www.spinics.net/lists/linux-btrfs/msg64203.html  then you can try
> the mount -o ssd_spread,nossd (read the story in the link above).
> Anyway, you're better off moving to something recent instead of trying
> all these obsolete workarounds...

I forgot to respond to this in my last email.  The version of BTRFS 
running in my openSuSE 15.0 kernel does indeed have that patch.

Nevertheless, I tried with just ssd_spread for kicks, and it showed no 
major improvement, and had the same growth pathology as past runs had:

0f70583cd12cac:/pandata/0 # time (for i in {1..10}; do time (rm test$i; 
sync); done)
real    0m0.636s
real    0m0.649s
real    0m0.417s
real    0m0.562s
real    0m0.381s
real    0m0.949s
real    0m2.014s
real    0m2.129s
real    0m2.074s
real    0m5.114s

Total:
real    0m14.939s

Even more curiously, when I repeat this experiment using ext4 (lazy init 
was disabled) on the exact same disks, I see a nearly identical slowdown 
pathology:

real    0m0.122s
real    0m0.048s
real    0m0.075s
real    0m0.076s
real    0m0.100s
real    0m0.499s
real    0m1.658s
real    0m1.709s
real    0m1.716s
real    0m6.599s

Total:
real    0m12.614s

Very wonky.  Maybe this has something to do with the mdraid we use 
underneath both, or maybe it's something architectural I'm not 
immediately grasping that impacts all extent-based filesystems.  Will 
report back when I have blktraces.

Best,

ellis

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BTRFS File Delete Speed Scales With File Size?
  2020-06-11  1:56   ` Ellis H. Wilson III
@ 2020-06-11  3:14     ` Zygo Blaxell
  0 siblings, 0 replies; 13+ messages in thread
From: Zygo Blaxell @ 2020-06-11  3:14 UTC (permalink / raw)
  To: Ellis H. Wilson III; +Cc: Hans van Kranenburg, Btrfs BTRFS

On Wed, Jun 10, 2020 at 09:56:11PM -0400, Ellis H. Wilson III wrote:
> On 6/9/20 7:09 PM, Hans van Kranenburg wrote:
> > * (not recommended) If your mount options do not show 'ssd' in them and
> > your kernel does not have this patch:
> > https://www.spinics.net/lists/linux-btrfs/msg64203.html  then you can try
> > the mount -o ssd_spread,nossd (read the story in the link above).
> > Anyway, you're better off moving to something recent instead of trying
> > all these obsolete workarounds...
> 
> I forgot to respond to this in my last email.  The version of BTRFS running
> in my openSuSE 15.0 kernel does indeed have that patch.
> 
> Nevertheless, I tried with just ssd_spread for kicks, and it showed no major
> improvement, and had the same growth pathology as past runs had:
> 
> 0f70583cd12cac:/pandata/0 # time (for i in {1..10}; do time (rm test$i;
> sync); done)
> real    0m0.636s
> real    0m0.649s
> real    0m0.417s
> real    0m0.562s
> real    0m0.381s
> real    0m0.949s
> real    0m2.014s
> real    0m2.129s
> real    0m2.074s
> real    0m5.114s
> 
> Total:
> real    0m14.939s
> 
> Even more curiously, when I repeat this experiment using ext4 (lazy init was
> disabled) on the exact same disks, I see a nearly identical slowdown
> pathology:
> 
> real    0m0.122s
> real    0m0.048s
> real    0m0.075s
> real    0m0.076s
> real    0m0.100s
> real    0m0.499s
> real    0m1.658s
> real    0m1.709s
> real    0m1.716s
> real    0m6.599s
> 
> Total:
> real    0m12.614s
> 
> Very wonky.  Maybe this has something to do with the mdraid we use
> underneath both, or maybe it's something architectural I'm not immediately
> grasping that impacts all extent-based filesystems.  Will report back when I
> have blktraces.

Not just extent-based.  ext2 and ext3 had O(n) deletes too, and they
used block lists.

> Best,
> 
> ellis

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BTRFS File Delete Speed Scales With File Size?
       [not found]   ` <d5379505-7dd1-d5bc-59e7-207aaa82acf6@panasas.com>
@ 2020-06-11 13:54     ` Ellis H. Wilson III
  2020-06-15 22:00       ` Ellis H. Wilson III
  0 siblings, 1 reply; 13+ messages in thread
From: Ellis H. Wilson III @ 2020-06-11 13:54 UTC (permalink / raw)
  To: Btrfs BTRFS

It seems my first email from last night didn't go through.  Resending, 
but without the .txt attachment (all of it is described below in enough 
detail that I don't think it's needed).

On 6/10/20 9:23 PM, Ellis H. Wilson III wrote:
> Responses to each person below (sorry I thought it useful to collapse a 
> few emails into a single response due to the similar nature of the 
> comments):
> 
> 
> Adam:
> 
> Request for show_file.py output (extent_map.txt) for a basic 64GiB file 
> is attached.  It appears the segments are roughly 128MB in size, so 
> there are (and it reports) 512 of them, and they are all roughly 
> sequential on disk.  Obviously if I delete things in parallel (as our 
> end-users are extremely liable to do, us being a parallel file system 
> and all) one might expect seek time to dominate and some 10ms per 
> segment (~5s) to occur if you have to hit every segment before you 
> return to the caller.  Serial deletion testing shows an average time of 
> 1.66s across 10 data points, with decent variability, but the overall 
> average is at the upper-end of what we saw for parallel deletes 
> (30-40GB/s).
> 
> Is extent size adjustable (I couldn't find this trivially via search)? 
> Our drives are raided (raid0 via mdadm) together into ~72TB or (soon) 
> ~96TB BTRFS filesystems, and we manage small files on separate SSD 
> pools, so optimizing for very large data is highly viable if the options 
> exist to do so and we expect the seek from extent to extent is the 
> dominating time for deletion.  Side-effect information is welcome, or 
> you can point me to go read a specific architectural document and I'll 
> happily oblige.
> 
> 
> Martin:
> 
> Remounted with nodatasum:
> [478141.959269] BTRFS info (device md0): setting nodatasum
> 
> Created another 10x64GB files.  Average deletion speed for serial, 
> in-order deletion is 1.12s per file. This is faster than before, but 
> still slower than expected.  Oddly, I notice that for the first 5 
> deletions, the speed is /much/ faster: between 0.04s and 0.07s, which is 
> in the far more reasonable ~TB deleted per second range.  However, the 
> next 4 are around 1.7s per delete, and the 10th one takes 5s for some 
> reason.  Again, these were deleted in the order they were created and 
> one at a time rather than in parallel like my previous experiment.
> 
> 
> Hans:
> 
> Your script reports what I'd gathered from Adams suggestion to use your 
> other script:
> 
> Referencing data from 512 regular extents with total size 64.00GiB 
> average size 128.00MiB
> 
> No dedupe or compression enabled on our BTRFS filesystems at present.
> 
> Regarding the I/O pattern, I'd not taken block traces, but we notice 
> lots of low-bandwidth but high-activity reading (suggesting high 
> seeking) followed by shorter phases of the same kind of writes.
> 
> Regarding BTRFS metadata on SSDs -- I've mentioned it before on this 
> list but that would be an incredible feature add for us, and for more 
> than just deletion speed.  Do you know if that is being developed actively?
> 
> I did the same test (10x64GiB files, created in-order/serially, deleted 
> in-order/serially) on our SATA SSD that manages small files for us.  The 
> result was surprising:
> 
> time (for i in {1..10}; do time (rm test$i; sync); done)
> #snipped just the reals out:
> real    0m0.309s
> real    0m0.286s
> real    0m0.272s
> real    0m0.298s
> real    0m0.286s
> real    0m0.870s
> real    0m1.861s
> real    0m1.877s
> real    0m1.905s
> real    0m5.002s
> 
> Total:
> real    0m12.978s
> 
> So this shouldn't be a seek-dominated issue (though SATA this is an 
> enterprise-grade SSD), and I see the same pathology I saw when I tried 
> the nodatacow option for Martin.  Fast deletions up-front (though slower 
> than on the HDDs, which is weird), then a series of medium-slow 
> deletions, and then one long and slow deletion at the end.  I definitely 
> need to do a blocktrace on this.
> 
> /sys/fs/btrfs/<UUID>/features/free_space_tree *does* exist for the UUID 
> of the md0 HDD array and the single SATA SSD referred to above.
> 
> "* Organize incoming data in btrfs subvolumes in a way that enables you
> to remove the subvol instead of rm'ing the files."
> 
> A similar approach was explored already.  We find that if you remove the 
> subvolume, BTRFS does a better job of throttling itself in the kernel 
> while doing deletions and the disks are generally not as busy, but 
> subsequent reflinks across volumes stall horribly.  Frequently in the 
> tens of seconds range.  So our idea of "create a series of recycle bins 
> and delete them when they reach some threshold" didn't quite work 
> because moving new stuff being deleted in the foreground to one of the 
> other recycle bins would come to a screeching halt when another recycle 
> bin was being deleted.
> 
> "* Remove stuff in the same order as it got added."
> 
> We use temporal ordering of data into buckets for our parallel file 
> system (PFS) components to make things easier on the dirents, 
> nevertheless, we're at the mercy of the user's workload ultimately. 
> Nobody is removing anything like rm * as our PFS components are randomly 
> assigned IDs.
> 
> We realize the need to move to something newer, however, we've noticed 
> serious creation-rate degradation in recent revisions (that's for a 
> different thread) and don't presently have all of the upgrade hooks 
> in-place to do full OS upgrade yet anyhow.
> 
> Thanks again for everyone's comments, and if you have any additional 
> suggestions I'm all ears.
> 
> Best,  I respond when I get block graphs back.
> 
> ellis
> 
> 
> 
> On 6/9/20 7:09 PM, Hans van Kranenburg wrote:
>> Hi!
>>
>> On 6/9/20 5:31 PM, Ellis H. Wilson III wrote:
>>> Hi folks,
>>>
>>> We have a few engineers looking through BTRFS code presently for answers
>>> to this, but I was interested to get input from the experts in parallel
>>> to hopefully understand this issue quickly.
>>>
>>> We find that removes of large amounts of data can take a significant
>>> amount of time in BTRFS on HDDs -- in fact it appears to scale linearly
>>> with the size of the file.  I'd like to better understand the mechanics
>>> underpinning that behavior.
>>
>> Like Adam already mentioned, the amount and size of the individual
>> extent metadata items that need to be removed is one variable in the big
>> equation.
>>
>> The code in show_file.py example in python-btrfs is a bit outdated, and
>> it prints info about all extents that are referenced and a bit more.
>>
>> Alternatively, we can only look at the file extent items and calculate
>> the amount and average and total size (on disk):
>>
>> ---- >8 ----
>> #!/usr/bin/python3
>>
>> import btrfs
>> import os
>> import sys
>>
>> with btrfs.FileSystem(sys.argv[1]) as fs:
>>      inum = os.fstat(fs.fd).st_ino
>>      min_key = btrfs.ctree.Key(inum, btrfs.ctree.EXTENT_DATA_KEY, 0)
>>      max_key = btrfs.ctree.Key(inum, btrfs.ctree.EXTENT_DATA_KEY, -1)
>>      amount = 0
>>      total_size = 0
>>      for header, data in btrfs.ioctl.search_v2(fs.fd, 0, min_key, 
>> max_key):
>>          item = btrfs.ctree.FileExtentItem(header, data)
>>          if item.type != btrfs.ctree.FILE_EXTENT_REG:
>>              continue
>>          amount += 1
>>          total_size += item.disk_num_bytes
>>      print("Referencing data from {} regular extents with total size {} "
>>            "average size {}".format(
>>                amount, btrfs.utils.pretty_size(total_size),
>>                btrfs.utils.pretty_size(int(total_size/amount))))
>> ---- >8 ----
>>
>> If I e.g. point this at /bin/bash (compressed), I get:
>>
>> Referencing data from 9 regular extents with total size 600.00KiB
>> average size 66.67KiB
>>
>> The above code assumes that the real data extents are only referenced
>> once (no dedupe within the same file etc), otherwise you'll have to
>> filter them (and keep more stuff in memory). And, it ignores inlined
>> small extents for simplicity. Anyway, you can hack away on it to e.g.
>> make it look up more interesting things.
>>
>> https://python-btrfs.readthedocs.io/en/latest/btrfs.html#btrfs.ctree.FileExtentItem 
>>
>>
>>> See the attached graph for a quick experiment that demonstrates this
>>> behavior.  In this experiment I use 40 threads to perform deletions of
>>> previous written data in parallel.  10,000 files in every case and I
>>> scale files by powers of two from 16MB to 16GB.  Thus, the raw amount of
>>> data deleted also expands by 2x every step.  Frankly I expected deletion
>>> of a file to be predominantly a metadata operation and not scale with
>>> the size of the file, but perhaps I'm misunderstanding that.  While the
>>> overall speed of deletion is relatively fast (hovering between 30GB/s
>>> and 50GB/s) compared with raw ingest of data to the disk array we're
>>> using (in our case ~1.5GB/s) it can still take a very long time to
>>> delete data from the drives and removes hang completely until that data
>>> is deleted, unlike in some other filesystems.  They also compete
>>> aggressively with foreground I/O due to the intense seeking on the HDDs.
>>
>> What is interesting in this case is to see what kind of IO pattern the
>> deletes are causing (so, obviously, test without adding data). Like, how
>> much throughput for read and write, and how many iops read and write per
>> second (so that we can easily calculate average iop size).
>>
>> If most time is spent doing random read IO, then hope for a bright
>> future in which you can store btrfs metadata on solid state and the data
>> itself on cheaper spinning rust.
>>
>> If most time is spent doing writes, then what you're seeing is probably
>> what I call rumination, which is caused by making changes in metadata,
>> which leads to writing modified parts of metadata in free space again.
>>
>> The extent tree (which has info about the actual data extents on disk
>> that the file_extent_item ones seen above point to) is once of those,
>> and there's only one of that kind, which is even tracking its own space
>> allocation, which can lead to the effect of a cat or dog chasing it's
>> own tail. There have definitely been performance improvements in that
>> area, I don't know exactly where, but when I moved from 4.9 to 4.19
>> kernel, things improved a bit.
>>
>> There is a very long story at
>> https://www.spinics.net/lists/linux-btrfs/msg70624.html about these
>> kinds of things.
>>
>> There are unfortunately no easy accessible tools present yet that can
>> give live insight in what exactly is happening when it's writing
>> metadata like crazy.
>>
>>> This is with an older version of BTRFS (4.12.14-lp150.12.73-default) so
>>> if algorithms have changed substantially such that deletion rate is no
>>> longer tied to file size in newer versions please just say so and I'll
>>> be glad to take a look at that change and try with a newer version.
>>
>> That seems to be a suse kernel. I have no idea what kind of btrfs is in
>> there.
>>
>>> FWIW, we are using the v2 free space cache.
>>
>> Just to be sure, there are edge cases in which the filesystem says it's
>> using space cache v2, but actually isn't.
>>
>> Does /sys/fs/btrfs/<UUID>/features/free_space_tree exist?
>>
>> Or, of course a fun little program to just do a bogus search in it,
>> which explodes if it's not there:
>>
>> ---- >8 ----
>> #!/usr/bin/python3
>>
>> import btrfs
>>
>> with btrfs.FileSystem('/') as fs:
>>      try:
>>          list(fs.free_space_extents(0, 0))
>>          print("Yay!")
>>      except FileNotFoundError:
>>          print("No FST :(")
>> ---- >8 ----
>>
>> Space cache v1 will cause filesystem stalls combined with a burst of
>> writes on every 'transaction commit', and space cache v2 will cause more
>> random reads and writes, but they don't block the whole thing.
>>
>>> If any other information is
>>> relevant to this just let me know and I'll be glad to share.
>>
>> Suggestions for things to try, and see what difference it makes:
>>
>> * Organize incoming data in btrfs subvolumes in a way that enables you
>> to remove the subvol instead of rm'ing the files. If there are a lot of
>> files and stuff, this can be more efficient, since btrfs knows about
>> what parts to 'cut away' with a chainsaw when removing, instead of
>> telling it to do everything file by file in small steps. It also keeps
>> the size of the individual subvolume metadata trees down, reducing
>> rumination done by the cow. If you don't have them, your default fs tree
>> with all file info is relatively seen as large as the extent tree.
>>
>> * Remove stuff in the same order as it got added. Remember, rm * removes
>> files in alphabetical order, not in the order in which metadata was
>> added to disk (the inode number sequence). It might cause less jumping
>> around in metadata. Using subvolumes instead is still better, because in
>> that case the whole issue does not exist.
>>
>> * (not recommended) If your mount options do not show 'ssd' in them and
>> your kernel does not have this patch:
>> https://www.spinics.net/lists/linux-btrfs/msg64203.html then you can try
>> the mount -o ssd_spread,nossd (read the story in the link above).
>> Anyway, you're better off moving to something recent instead of trying
>> all these obsolete workarounds...
>>
>> If you have a kernel >= 4.14 then you're better off mounting with -o ssd
>> for rotational disks because the metadata writes are put more together
>> in bigger parts of free space, leading to less rumination. There is
>> still a lot of research and improvements to be done in this area (again,
>> see long post referenced above).
>>
>> If you have kernel < 4.14, then definitely do not mount with -o ssd, but
>> even mount ssds with -o nossd explicitly. (yes, that's also a long 
>> story)...
>>
>> Fun! :) /o\
>>
>>> Thank you for any time people can spare to help us understand this 
>>> better!
>>
>> Don't hesitate to ask more questions,
>> Have fun,
>> Hans
>>
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BTRFS File Delete Speed Scales With File Size?
  2020-06-11 13:54     ` Ellis H. Wilson III
@ 2020-06-15 22:00       ` Ellis H. Wilson III
  2020-06-16  3:56         ` Zygo Blaxell
  0 siblings, 1 reply; 13+ messages in thread
From: Ellis H. Wilson III @ 2020-06-15 22:00 UTC (permalink / raw)
  To: Btrfs BTRFS

Some follow-ups:

1. We've discovered that background dirty sync will cause the deletion 
slow-downs I witnessed before.  If we sync and drop all vm caches before 
starting, the deletion speed for a 64GiB file averages around 2.5s.

2. We find that speed to delete and fsync the directory on 4.12 is 
equivalent to delete and then sync the entire filesystem.

3. We tested on 5.7.1-1-default, the vanilla release available in 
tumbleweed openSuSE.  We find that removes are greatly improved 
(20x64GiB files removed, serially, in the order they were created, an 
fsync of the directory holding them done after each remove, and all 
caches dropped before beginning):

  real    0m0.112s
  real    0m0.024s
  real    0m0.038s
  real    0m0.039s
  real    0m0.017s
  real    0m0.032s
  real    0m0.073s
  real    0m0.019s
  real    0m0.041s
  real    0m0.029s
  real    0m0.034s
  real    0m0.030s
  real    0m0.023s
  real    0m0.012s
  real    0m0.020s
  real    0m0.035s
  real    0m0.013s
  real    0m0.066s
  real    0m0.014s
  real    0m0.007s

We further note background busy-ness by btrfs transaction many seconds 
later, suggestion a new approach to unlink better mirrors how BTRFS 
manages subvolume deletion.

4. We also note that deletion speed goes back to the slow speeds 
reported before if you add a full sync after every remove rather than a 
targeted fsync.  This is relatively unsurprising given the different 
expectations, though an interesting finding given that merely fsync'ing 
didn't help in the case of 4.12.

Any comments on when major change(s) came in that would impact these 
behaviors would be greatly appreciated.

Thanks,

ellis


On 6/11/20 9:54 AM, Ellis H. Wilson III wrote:
> It seems my first email from last night didn't go through.  Resending, 
> but without the .txt attachment (all of it is described below in enough 
> detail that I don't think it's needed).
> 
> On 6/10/20 9:23 PM, Ellis H. Wilson III wrote:
>> Responses to each person below (sorry I thought it useful to collapse 
>> a few emails into a single response due to the similar nature of the 
>> comments):
>>
>>
>> Adam:
>>
>> Request for show_file.py output (extent_map.txt) for a basic 64GiB 
>> file is attached.  It appears the segments are roughly 128MB in size, 
>> so there are (and it reports) 512 of them, and they are all roughly 
>> sequential on disk.  Obviously if I delete things in parallel (as our 
>> end-users are extremely liable to do, us being a parallel file system 
>> and all) one might expect seek time to dominate and some 10ms per 
>> segment (~5s) to occur if you have to hit every segment before you 
>> return to the caller.  Serial deletion testing shows an average time 
>> of 1.66s across 10 data points, with decent variability, but the 
>> overall average is at the upper-end of what we saw for parallel 
>> deletes (30-40GB/s).
>>
>> Is extent size adjustable (I couldn't find this trivially via search)? 
>> Our drives are raided (raid0 via mdadm) together into ~72TB or (soon) 
>> ~96TB BTRFS filesystems, and we manage small files on separate SSD 
>> pools, so optimizing for very large data is highly viable if the 
>> options exist to do so and we expect the seek from extent to extent is 
>> the dominating time for deletion.  Side-effect information is welcome, 
>> or you can point me to go read a specific architectural document and 
>> I'll happily oblige.
>>
>>
>> Martin:
>>
>> Remounted with nodatasum:
>> [478141.959269] BTRFS info (device md0): setting nodatasum
>>
>> Created another 10x64GB files.  Average deletion speed for serial, 
>> in-order deletion is 1.12s per file. This is faster than before, but 
>> still slower than expected.  Oddly, I notice that for the first 5 
>> deletions, the speed is /much/ faster: between 0.04s and 0.07s, which 
>> is in the far more reasonable ~TB deleted per second range.  However, 
>> the next 4 are around 1.7s per delete, and the 10th one takes 5s for 
>> some reason.  Again, these were deleted in the order they were created 
>> and one at a time rather than in parallel like my previous experiment.
>>
>>
>> Hans:
>>
>> Your script reports what I'd gathered from Adams suggestion to use 
>> your other script:
>>
>> Referencing data from 512 regular extents with total size 64.00GiB 
>> average size 128.00MiB
>>
>> No dedupe or compression enabled on our BTRFS filesystems at present.
>>
>> Regarding the I/O pattern, I'd not taken block traces, but we notice 
>> lots of low-bandwidth but high-activity reading (suggesting high 
>> seeking) followed by shorter phases of the same kind of writes.
>>
>> Regarding BTRFS metadata on SSDs -- I've mentioned it before on this 
>> list but that would be an incredible feature add for us, and for more 
>> than just deletion speed.  Do you know if that is being developed 
>> actively?
>>
>> I did the same test (10x64GiB files, created in-order/serially, 
>> deleted in-order/serially) on our SATA SSD that manages small files 
>> for us.  The result was surprising:
>>
>> time (for i in {1..10}; do time (rm test$i; sync); done)
>> #snipped just the reals out:
>> real    0m0.309s
>> real    0m0.286s
>> real    0m0.272s
>> real    0m0.298s
>> real    0m0.286s
>> real    0m0.870s
>> real    0m1.861s
>> real    0m1.877s
>> real    0m1.905s
>> real    0m5.002s
>>
>> Total:
>> real    0m12.978s
>>
>> So this shouldn't be a seek-dominated issue (though SATA this is an 
>> enterprise-grade SSD), and I see the same pathology I saw when I tried 
>> the nodatacow option for Martin.  Fast deletions up-front (though 
>> slower than on the HDDs, which is weird), then a series of medium-slow 
>> deletions, and then one long and slow deletion at the end.  I 
>> definitely need to do a blocktrace on this.
>>
>> /sys/fs/btrfs/<UUID>/features/free_space_tree *does* exist for the 
>> UUID of the md0 HDD array and the single SATA SSD referred to above.
>>
>> "* Organize incoming data in btrfs subvolumes in a way that enables you
>> to remove the subvol instead of rm'ing the files."
>>
>> A similar approach was explored already.  We find that if you remove 
>> the subvolume, BTRFS does a better job of throttling itself in the 
>> kernel while doing deletions and the disks are generally not as busy, 
>> but subsequent reflinks across volumes stall horribly.  Frequently in 
>> the tens of seconds range.  So our idea of "create a series of recycle 
>> bins and delete them when they reach some threshold" didn't quite work 
>> because moving new stuff being deleted in the foreground to one of the 
>> other recycle bins would come to a screeching halt when another 
>> recycle bin was being deleted.
>>
>> "* Remove stuff in the same order as it got added."
>>
>> We use temporal ordering of data into buckets for our parallel file 
>> system (PFS) components to make things easier on the dirents, 
>> nevertheless, we're at the mercy of the user's workload ultimately. 
>> Nobody is removing anything like rm * as our PFS components are 
>> randomly assigned IDs.
>>
>> We realize the need to move to something newer, however, we've noticed 
>> serious creation-rate degradation in recent revisions (that's for a 
>> different thread) and don't presently have all of the upgrade hooks 
>> in-place to do full OS upgrade yet anyhow.
>>
>> Thanks again for everyone's comments, and if you have any additional 
>> suggestions I'm all ears.
>>
>> Best,  I respond when I get block graphs back.
>>
>> ellis
>>
>>
>>
>> On 6/9/20 7:09 PM, Hans van Kranenburg wrote:
>>> Hi!
>>>
>>> On 6/9/20 5:31 PM, Ellis H. Wilson III wrote:
>>>> Hi folks,
>>>>
>>>> We have a few engineers looking through BTRFS code presently for 
>>>> answers
>>>> to this, but I was interested to get input from the experts in parallel
>>>> to hopefully understand this issue quickly.
>>>>
>>>> We find that removes of large amounts of data can take a significant
>>>> amount of time in BTRFS on HDDs -- in fact it appears to scale linearly
>>>> with the size of the file.  I'd like to better understand the mechanics
>>>> underpinning that behavior.
>>>
>>> Like Adam already mentioned, the amount and size of the individual
>>> extent metadata items that need to be removed is one variable in the big
>>> equation.
>>>
>>> The code in show_file.py example in python-btrfs is a bit outdated, and
>>> it prints info about all extents that are referenced and a bit more.
>>>
>>> Alternatively, we can only look at the file extent items and calculate
>>> the amount and average and total size (on disk):
>>>
>>> ---- >8 ----
>>> #!/usr/bin/python3
>>>
>>> import btrfs
>>> import os
>>> import sys
>>>
>>> with btrfs.FileSystem(sys.argv[1]) as fs:
>>>      inum = os.fstat(fs.fd).st_ino
>>>      min_key = btrfs.ctree.Key(inum, btrfs.ctree.EXTENT_DATA_KEY, 0)
>>>      max_key = btrfs.ctree.Key(inum, btrfs.ctree.EXTENT_DATA_KEY, -1)
>>>      amount = 0
>>>      total_size = 0
>>>      for header, data in btrfs.ioctl.search_v2(fs.fd, 0, min_key, 
>>> max_key):
>>>          item = btrfs.ctree.FileExtentItem(header, data)
>>>          if item.type != btrfs.ctree.FILE_EXTENT_REG:
>>>              continue
>>>          amount += 1
>>>          total_size += item.disk_num_bytes
>>>      print("Referencing data from {} regular extents with total size 
>>> {} "
>>>            "average size {}".format(
>>>                amount, btrfs.utils.pretty_size(total_size),
>>>                btrfs.utils.pretty_size(int(total_size/amount))))
>>> ---- >8 ----
>>>
>>> If I e.g. point this at /bin/bash (compressed), I get:
>>>
>>> Referencing data from 9 regular extents with total size 600.00KiB
>>> average size 66.67KiB
>>>
>>> The above code assumes that the real data extents are only referenced
>>> once (no dedupe within the same file etc), otherwise you'll have to
>>> filter them (and keep more stuff in memory). And, it ignores inlined
>>> small extents for simplicity. Anyway, you can hack away on it to e.g.
>>> make it look up more interesting things.
>>>
>>> https://python-btrfs.readthedocs.io/en/latest/btrfs.html#btrfs.ctree.FileExtentItem 
>>>
>>>
>>>> See the attached graph for a quick experiment that demonstrates this
>>>> behavior.  In this experiment I use 40 threads to perform deletions of
>>>> previous written data in parallel.  10,000 files in every case and I
>>>> scale files by powers of two from 16MB to 16GB.  Thus, the raw 
>>>> amount of
>>>> data deleted also expands by 2x every step.  Frankly I expected 
>>>> deletion
>>>> of a file to be predominantly a metadata operation and not scale with
>>>> the size of the file, but perhaps I'm misunderstanding that.  While the
>>>> overall speed of deletion is relatively fast (hovering between 30GB/s
>>>> and 50GB/s) compared with raw ingest of data to the disk array we're
>>>> using (in our case ~1.5GB/s) it can still take a very long time to
>>>> delete data from the drives and removes hang completely until that data
>>>> is deleted, unlike in some other filesystems.  They also compete
>>>> aggressively with foreground I/O due to the intense seeking on the 
>>>> HDDs.
>>>
>>> What is interesting in this case is to see what kind of IO pattern the
>>> deletes are causing (so, obviously, test without adding data). Like, how
>>> much throughput for read and write, and how many iops read and write per
>>> second (so that we can easily calculate average iop size).
>>>
>>> If most time is spent doing random read IO, then hope for a bright
>>> future in which you can store btrfs metadata on solid state and the data
>>> itself on cheaper spinning rust.
>>>
>>> If most time is spent doing writes, then what you're seeing is probably
>>> what I call rumination, which is caused by making changes in metadata,
>>> which leads to writing modified parts of metadata in free space again.
>>>
>>> The extent tree (which has info about the actual data extents on disk
>>> that the file_extent_item ones seen above point to) is once of those,
>>> and there's only one of that kind, which is even tracking its own space
>>> allocation, which can lead to the effect of a cat or dog chasing it's
>>> own tail. There have definitely been performance improvements in that
>>> area, I don't know exactly where, but when I moved from 4.9 to 4.19
>>> kernel, things improved a bit.
>>>
>>> There is a very long story at
>>> https://www.spinics.net/lists/linux-btrfs/msg70624.html about these
>>> kinds of things.
>>>
>>> There are unfortunately no easy accessible tools present yet that can
>>> give live insight in what exactly is happening when it's writing
>>> metadata like crazy.
>>>
>>>> This is with an older version of BTRFS (4.12.14-lp150.12.73-default) so
>>>> if algorithms have changed substantially such that deletion rate is no
>>>> longer tied to file size in newer versions please just say so and I'll
>>>> be glad to take a look at that change and try with a newer version.
>>>
>>> That seems to be a suse kernel. I have no idea what kind of btrfs is in
>>> there.
>>>
>>>> FWIW, we are using the v2 free space cache.
>>>
>>> Just to be sure, there are edge cases in which the filesystem says it's
>>> using space cache v2, but actually isn't.
>>>
>>> Does /sys/fs/btrfs/<UUID>/features/free_space_tree exist?
>>>
>>> Or, of course a fun little program to just do a bogus search in it,
>>> which explodes if it's not there:
>>>
>>> ---- >8 ----
>>> #!/usr/bin/python3
>>>
>>> import btrfs
>>>
>>> with btrfs.FileSystem('/') as fs:
>>>      try:
>>>          list(fs.free_space_extents(0, 0))
>>>          print("Yay!")
>>>      except FileNotFoundError:
>>>          print("No FST :(")
>>> ---- >8 ----
>>>
>>> Space cache v1 will cause filesystem stalls combined with a burst of
>>> writes on every 'transaction commit', and space cache v2 will cause more
>>> random reads and writes, but they don't block the whole thing.
>>>
>>>> If any other information is
>>>> relevant to this just let me know and I'll be glad to share.
>>>
>>> Suggestions for things to try, and see what difference it makes:
>>>
>>> * Organize incoming data in btrfs subvolumes in a way that enables you
>>> to remove the subvol instead of rm'ing the files. If there are a lot of
>>> files and stuff, this can be more efficient, since btrfs knows about
>>> what parts to 'cut away' with a chainsaw when removing, instead of
>>> telling it to do everything file by file in small steps. It also keeps
>>> the size of the individual subvolume metadata trees down, reducing
>>> rumination done by the cow. If you don't have them, your default fs tree
>>> with all file info is relatively seen as large as the extent tree.
>>>
>>> * Remove stuff in the same order as it got added. Remember, rm * removes
>>> files in alphabetical order, not in the order in which metadata was
>>> added to disk (the inode number sequence). It might cause less jumping
>>> around in metadata. Using subvolumes instead is still better, because in
>>> that case the whole issue does not exist.
>>>
>>> * (not recommended) If your mount options do not show 'ssd' in them and
>>> your kernel does not have this patch:
>>> https://www.spinics.net/lists/linux-btrfs/msg64203.html then you can try
>>> the mount -o ssd_spread,nossd (read the story in the link above).
>>> Anyway, you're better off moving to something recent instead of trying
>>> all these obsolete workarounds...
>>>
>>> If you have a kernel >= 4.14 then you're better off mounting with -o ssd
>>> for rotational disks because the metadata writes are put more together
>>> in bigger parts of free space, leading to less rumination. There is
>>> still a lot of research and improvements to be done in this area (again,
>>> see long post referenced above).
>>>
>>> If you have kernel < 4.14, then definitely do not mount with -o ssd, but
>>> even mount ssds with -o nossd explicitly. (yes, that's also a long 
>>> story)...
>>>
>>> Fun! :) /o\
>>>
>>>> Thank you for any time people can spare to help us understand this 
>>>> better!
>>>
>>> Don't hesitate to ask more questions,
>>> Have fun,
>>> Hans
>>>
>>
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BTRFS File Delete Speed Scales With File Size?
  2020-06-15 22:00       ` Ellis H. Wilson III
@ 2020-06-16  3:56         ` Zygo Blaxell
  2020-06-16 13:25           ` Ellis H. Wilson III
  0 siblings, 1 reply; 13+ messages in thread
From: Zygo Blaxell @ 2020-06-16  3:56 UTC (permalink / raw)
  To: Ellis H. Wilson III; +Cc: Btrfs BTRFS

On Mon, Jun 15, 2020 at 06:00:47PM -0400, Ellis H. Wilson III wrote:
> Some follow-ups:
> 
> 1. We've discovered that background dirty sync will cause the deletion
> slow-downs I witnessed before.  If we sync and drop all vm caches before
> starting, the deletion speed for a 64GiB file averages around 2.5s.
> 
> 2. We find that speed to delete and fsync the directory on 4.12 is
> equivalent to delete and then sync the entire filesystem.

I'm not quite sure how fsync() is relevant here--fsync() doesn't do
anything about delayed refs, unless fsync() accidentally triggers a full
sync (which I wouldn't expect to happen when fsyncing a simple unlink).

Are you measuring fsync() and unlink() separately, or the total time
running both?  I would expect the unlink() to be slow on kernels up to
4.20, and fsync() to be almost a no-op on kernels after 4.0.

> 3. We tested on 5.7.1-1-default, the vanilla release available in tumbleweed
> openSuSE.  We find that removes are greatly improved (20x64GiB files
> removed, serially, in the order they were created, an fsync of the directory
> holding them done after each remove, and all caches dropped before
> beginning):
> 
>  real    0m0.112s
>  real    0m0.024s
>  real    0m0.038s
>  real    0m0.039s
>  real    0m0.017s
>  real    0m0.032s
>  real    0m0.073s
>  real    0m0.019s
>  real    0m0.041s
>  real    0m0.029s
>  real    0m0.034s
>  real    0m0.030s
>  real    0m0.023s
>  real    0m0.012s
>  real    0m0.020s
>  real    0m0.035s
>  real    0m0.013s
>  real    0m0.066s
>  real    0m0.014s
>  real    0m0.007s
> 
> We further note background busy-ness by btrfs transaction many seconds
> later, suggestion a new approach to unlink better mirrors how BTRFS manages
> subvolume deletion.

Without delayed ref throttling, ordinary unlinks get millions of their
reference updates pushed into the background, where they do indeed behave
more like subvol deletes (which are in the background for their entire
lives, and consist entirely of reference updates).

Removing the throttling trades kernel memory for speed--at least, as long
as the kernel can keep allocating more memory.  When memory runs out,
or the filesystem is synced or umounted (actions which force a commit),
all pending ref updates get flushed synchronously, and delays can be
significant.

Fast SSDs might be able to process 10k delayed refs per second, while
a few GB of host RAM can hold millions of delayed refs.  The math
multiplies up to hours of IO time pretty quickly on hosts with more RAM
and slower spinning disks.

> 4. We also note that deletion speed goes back to the slow speeds reported
> before if you add a full sync after every remove rather than a targeted
> fsync.  This is relatively unsurprising given the different expectations,
> though an interesting finding given that merely fsync'ing didn't help in the
> case of 4.12.

unlink() reads the extent references contained in a file and creates
delayed refs (an in-memory log of reference adds and deletes) to
remove the extent references in a kernel thread (sometimes a different,
background thread, though neither is guaranteed).

4.12 throttles unlink() to avoid creating huge backlogs of ref updates
in kernel memory.  During the throttling, previously delayed refs
(including ref updates caused by other threads) are processed by the
kernel thread servicing the unlink() system call.

5.7 doesn't throttle unlink(), so the unlink() system call simply queues
delayed ref updates as quickly as the kernel can allocate memory.
Once kernel memory runs out, unlink() will start processing delayed
refs during the unlink() system call, blocking the caller of unlink().
If memory doesn't run out, then the processing happens during transaction
commit in some other thread (which may be btrfs-transaction, or some
other unlucky user thread writing to the filesystem that triggers delayed
ref processing).

In both kernels there will be bursts of fast processing as unlink()
borrows memory, with occasional long delays while unlink() (or some other
random system call) pays off memory debt.  4.12 limited this borrowing
to thousands of refs and most of the payment to the unlink() caller;
in 5.7, there are no limits, and the debt to be paid by a random user
thread can easily be millions of refs, each of which may require a page
of IO to complete.

The total IO cost from the start of the first unlink() call to the end
of the last block freed and committed on disk should be about the same
on 4.12 and 5.7, though there are many other small improvements between
4.12 and 5.7 that might make 5.7 slightly faster than 4.12.  Any other
measurement methodology will not observe some part of the total IO cost.

Nothing about the fundamental work of unlink() has changed--the filesystem
must read and write all the same metadata blocks in more or less the same
order to delete a file.  The difference is which kernel thread does the
work, and how much kernel memory is occupied by the filesystem while
that work is done.

> Any comments on when major change(s) came in that would impact these
> behaviors would be greatly appreciated.

Between 4.20 and 5.0 the delayed ref throttling went away, which might
account for a sudden shift of latency to a different part of the kernel.

The impact is that there is now a larger window for the filesystem to
roll back to earlier versions of the data after a crash.  Without the
throttling, unlink can just keep appending more and more ref deletes to
the current commit faster than the disks can push out updated metadata.
As a result, there is a bad case where the filesystem spends all of
its IO time trying to catch up to metadata updates, but never successfully
completes a commit on disk.

A crash rolls back to the previous completed commit and replays the
fsync log.  A crash on a 5.0 to 5.7 kernel potentially erases hours of
work, or forces the filesystem to replay hours of fsync() log when the
filesystem is mounted again.  This requires specific conditions, but it
sounds like your use case might create those conditions.

Some details:

	https://lore.kernel.org/linux-btrfs/20200209004307.GG13306@hungrycats.org/

5.7 has some recent delayed ref processing improvements, but AFAIK
the real fixes (which include putting the throttling back) are still
in development.

> Thanks,
> 
> ellis
> 
> 
> On 6/11/20 9:54 AM, Ellis H. Wilson III wrote:
> > It seems my first email from last night didn't go through.  Resending,
> > but without the .txt attachment (all of it is described below in enough
> > detail that I don't think it's needed).
> > 
> > On 6/10/20 9:23 PM, Ellis H. Wilson III wrote:
> > > Responses to each person below (sorry I thought it useful to
> > > collapse a few emails into a single response due to the similar
> > > nature of the comments):
> > > 
> > > 
> > > Adam:
> > > 
> > > Request for show_file.py output (extent_map.txt) for a basic 64GiB
> > > file is attached.  It appears the segments are roughly 128MB in
> > > size, so there are (and it reports) 512 of them, and they are all
> > > roughly sequential on disk.  Obviously if I delete things in
> > > parallel (as our end-users are extremely liable to do, us being a
> > > parallel file system and all) one might expect seek time to dominate
> > > and some 10ms per segment (~5s) to occur if you have to hit every
> > > segment before you return to the caller.  Serial deletion testing
> > > shows an average time of 1.66s across 10 data points, with decent
> > > variability, but the overall average is at the upper-end of what we
> > > saw for parallel deletes (30-40GB/s).
> > > 
> > > Is extent size adjustable (I couldn't find this trivially via
> > > search)? Our drives are raided (raid0 via mdadm) together into ~72TB
> > > or (soon) ~96TB BTRFS filesystems, and we manage small files on
> > > separate SSD pools, so optimizing for very large data is highly
> > > viable if the options exist to do so and we expect the seek from
> > > extent to extent is the dominating time for deletion.  Side-effect
> > > information is welcome, or you can point me to go read a specific
> > > architectural document and I'll happily oblige.
> > > 
> > > 
> > > Martin:
> > > 
> > > Remounted with nodatasum:
> > > [478141.959269] BTRFS info (device md0): setting nodatasum
> > > 
> > > Created another 10x64GB files.  Average deletion speed for serial,
> > > in-order deletion is 1.12s per file. This is faster than before, but
> > > still slower than expected.  Oddly, I notice that for the first 5
> > > deletions, the speed is /much/ faster: between 0.04s and 0.07s,
> > > which is in the far more reasonable ~TB deleted per second range. 
> > > However, the next 4 are around 1.7s per delete, and the 10th one
> > > takes 5s for some reason.  Again, these were deleted in the order
> > > they were created and one at a time rather than in parallel like my
> > > previous experiment.
> > > 
> > > 
> > > Hans:
> > > 
> > > Your script reports what I'd gathered from Adams suggestion to use
> > > your other script:
> > > 
> > > Referencing data from 512 regular extents with total size 64.00GiB
> > > average size 128.00MiB
> > > 
> > > No dedupe or compression enabled on our BTRFS filesystems at present.
> > > 
> > > Regarding the I/O pattern, I'd not taken block traces, but we notice
> > > lots of low-bandwidth but high-activity reading (suggesting high
> > > seeking) followed by shorter phases of the same kind of writes.
> > > 
> > > Regarding BTRFS metadata on SSDs -- I've mentioned it before on this
> > > list but that would be an incredible feature add for us, and for
> > > more than just deletion speed.  Do you know if that is being
> > > developed actively?
> > > 
> > > I did the same test (10x64GiB files, created in-order/serially,
> > > deleted in-order/serially) on our SATA SSD that manages small files
> > > for us.  The result was surprising:
> > > 
> > > time (for i in {1..10}; do time (rm test$i; sync); done)
> > > #snipped just the reals out:
> > > real    0m0.309s
> > > real    0m0.286s
> > > real    0m0.272s
> > > real    0m0.298s
> > > real    0m0.286s
> > > real    0m0.870s
> > > real    0m1.861s
> > > real    0m1.877s
> > > real    0m1.905s
> > > real    0m5.002s
> > > 
> > > Total:
> > > real    0m12.978s
> > > 
> > > So this shouldn't be a seek-dominated issue (though SATA this is an
> > > enterprise-grade SSD), and I see the same pathology I saw when I
> > > tried the nodatacow option for Martin.  Fast deletions up-front
> > > (though slower than on the HDDs, which is weird), then a series of
> > > medium-slow deletions, and then one long and slow deletion at the
> > > end.  I definitely need to do a blocktrace on this.
> > > 
> > > /sys/fs/btrfs/<UUID>/features/free_space_tree *does* exist for the
> > > UUID of the md0 HDD array and the single SATA SSD referred to above.
> > > 
> > > "* Organize incoming data in btrfs subvolumes in a way that enables you
> > > to remove the subvol instead of rm'ing the files."
> > > 
> > > A similar approach was explored already.  We find that if you remove
> > > the subvolume, BTRFS does a better job of throttling itself in the
> > > kernel while doing deletions and the disks are generally not as
> > > busy, but subsequent reflinks across volumes stall horribly. 
> > > Frequently in the tens of seconds range.  So our idea of "create a
> > > series of recycle bins and delete them when they reach some
> > > threshold" didn't quite work because moving new stuff being deleted
> > > in the foreground to one of the other recycle bins would come to a
> > > screeching halt when another recycle bin was being deleted.
> > > 
> > > "* Remove stuff in the same order as it got added."
> > > 
> > > We use temporal ordering of data into buckets for our parallel file
> > > system (PFS) components to make things easier on the dirents,
> > > nevertheless, we're at the mercy of the user's workload ultimately.
> > > Nobody is removing anything like rm * as our PFS components are
> > > randomly assigned IDs.
> > > 
> > > We realize the need to move to something newer, however, we've
> > > noticed serious creation-rate degradation in recent revisions
> > > (that's for a different thread) and don't presently have all of the
> > > upgrade hooks in-place to do full OS upgrade yet anyhow.
> > > 
> > > Thanks again for everyone's comments, and if you have any additional
> > > suggestions I'm all ears.
> > > 
> > > Best,  I respond when I get block graphs back.
> > > 
> > > ellis
> > > 
> > > 
> > > 
> > > On 6/9/20 7:09 PM, Hans van Kranenburg wrote:
> > > > Hi!
> > > > 
> > > > On 6/9/20 5:31 PM, Ellis H. Wilson III wrote:
> > > > > Hi folks,
> > > > > 
> > > > > We have a few engineers looking through BTRFS code presently
> > > > > for answers
> > > > > to this, but I was interested to get input from the experts in parallel
> > > > > to hopefully understand this issue quickly.
> > > > > 
> > > > > We find that removes of large amounts of data can take a significant
> > > > > amount of time in BTRFS on HDDs -- in fact it appears to scale linearly
> > > > > with the size of the file.  I'd like to better understand the mechanics
> > > > > underpinning that behavior.
> > > > 
> > > > Like Adam already mentioned, the amount and size of the individual
> > > > extent metadata items that need to be removed is one variable in the big
> > > > equation.
> > > > 
> > > > The code in show_file.py example in python-btrfs is a bit outdated, and
> > > > it prints info about all extents that are referenced and a bit more.
> > > > 
> > > > Alternatively, we can only look at the file extent items and calculate
> > > > the amount and average and total size (on disk):
> > > > 
> > > > ---- >8 ----
> > > > #!/usr/bin/python3
> > > > 
> > > > import btrfs
> > > > import os
> > > > import sys
> > > > 
> > > > with btrfs.FileSystem(sys.argv[1]) as fs:
> > > >      inum = os.fstat(fs.fd).st_ino
> > > >      min_key = btrfs.ctree.Key(inum, btrfs.ctree.EXTENT_DATA_KEY, 0)
> > > >      max_key = btrfs.ctree.Key(inum, btrfs.ctree.EXTENT_DATA_KEY, -1)
> > > >      amount = 0
> > > >      total_size = 0
> > > >      for header, data in btrfs.ioctl.search_v2(fs.fd, 0,
> > > > min_key, max_key):
> > > >          item = btrfs.ctree.FileExtentItem(header, data)
> > > >          if item.type != btrfs.ctree.FILE_EXTENT_REG:
> > > >              continue
> > > >          amount += 1
> > > >          total_size += item.disk_num_bytes
> > > >      print("Referencing data from {} regular extents with total
> > > > size {} "
> > > >            "average size {}".format(
> > > >                amount, btrfs.utils.pretty_size(total_size),
> > > >                btrfs.utils.pretty_size(int(total_size/amount))))
> > > > ---- >8 ----
> > > > 
> > > > If I e.g. point this at /bin/bash (compressed), I get:
> > > > 
> > > > Referencing data from 9 regular extents with total size 600.00KiB
> > > > average size 66.67KiB
> > > > 
> > > > The above code assumes that the real data extents are only referenced
> > > > once (no dedupe within the same file etc), otherwise you'll have to
> > > > filter them (and keep more stuff in memory). And, it ignores inlined
> > > > small extents for simplicity. Anyway, you can hack away on it to e.g.
> > > > make it look up more interesting things.
> > > > 
> > > > https://python-btrfs.readthedocs.io/en/latest/btrfs.html#btrfs.ctree.FileExtentItem
> > > > 
> > > > 
> > > > > See the attached graph for a quick experiment that demonstrates this
> > > > > behavior.  In this experiment I use 40 threads to perform deletions of
> > > > > previous written data in parallel.  10,000 files in every case and I
> > > > > scale files by powers of two from 16MB to 16GB.  Thus, the
> > > > > raw amount of
> > > > > data deleted also expands by 2x every step.  Frankly I
> > > > > expected deletion
> > > > > of a file to be predominantly a metadata operation and not scale with
> > > > > the size of the file, but perhaps I'm misunderstanding that.  While the
> > > > > overall speed of deletion is relatively fast (hovering between 30GB/s
> > > > > and 50GB/s) compared with raw ingest of data to the disk array we're
> > > > > using (in our case ~1.5GB/s) it can still take a very long time to
> > > > > delete data from the drives and removes hang completely until that data
> > > > > is deleted, unlike in some other filesystems.  They also compete
> > > > > aggressively with foreground I/O due to the intense seeking
> > > > > on the HDDs.
> > > > 
> > > > What is interesting in this case is to see what kind of IO pattern the
> > > > deletes are causing (so, obviously, test without adding data). Like, how
> > > > much throughput for read and write, and how many iops read and write per
> > > > second (so that we can easily calculate average iop size).
> > > > 
> > > > If most time is spent doing random read IO, then hope for a bright
> > > > future in which you can store btrfs metadata on solid state and the data
> > > > itself on cheaper spinning rust.
> > > > 
> > > > If most time is spent doing writes, then what you're seeing is probably
> > > > what I call rumination, which is caused by making changes in metadata,
> > > > which leads to writing modified parts of metadata in free space again.
> > > > 
> > > > The extent tree (which has info about the actual data extents on disk
> > > > that the file_extent_item ones seen above point to) is once of those,
> > > > and there's only one of that kind, which is even tracking its own space
> > > > allocation, which can lead to the effect of a cat or dog chasing it's
> > > > own tail. There have definitely been performance improvements in that
> > > > area, I don't know exactly where, but when I moved from 4.9 to 4.19
> > > > kernel, things improved a bit.
> > > > 
> > > > There is a very long story at
> > > > https://www.spinics.net/lists/linux-btrfs/msg70624.html about these
> > > > kinds of things.
> > > > 
> > > > There are unfortunately no easy accessible tools present yet that can
> > > > give live insight in what exactly is happening when it's writing
> > > > metadata like crazy.
> > > > 
> > > > > This is with an older version of BTRFS (4.12.14-lp150.12.73-default) so
> > > > > if algorithms have changed substantially such that deletion rate is no
> > > > > longer tied to file size in newer versions please just say so and I'll
> > > > > be glad to take a look at that change and try with a newer version.
> > > > 
> > > > That seems to be a suse kernel. I have no idea what kind of btrfs is in
> > > > there.
> > > > 
> > > > > FWIW, we are using the v2 free space cache.
> > > > 
> > > > Just to be sure, there are edge cases in which the filesystem says it's
> > > > using space cache v2, but actually isn't.
> > > > 
> > > > Does /sys/fs/btrfs/<UUID>/features/free_space_tree exist?
> > > > 
> > > > Or, of course a fun little program to just do a bogus search in it,
> > > > which explodes if it's not there:
> > > > 
> > > > ---- >8 ----
> > > > #!/usr/bin/python3
> > > > 
> > > > import btrfs
> > > > 
> > > > with btrfs.FileSystem('/') as fs:
> > > >      try:
> > > >          list(fs.free_space_extents(0, 0))
> > > >          print("Yay!")
> > > >      except FileNotFoundError:
> > > >          print("No FST :(")
> > > > ---- >8 ----
> > > > 
> > > > Space cache v1 will cause filesystem stalls combined with a burst of
> > > > writes on every 'transaction commit', and space cache v2 will cause more
> > > > random reads and writes, but they don't block the whole thing.
> > > > 
> > > > > If any other information is
> > > > > relevant to this just let me know and I'll be glad to share.
> > > > 
> > > > Suggestions for things to try, and see what difference it makes:
> > > > 
> > > > * Organize incoming data in btrfs subvolumes in a way that enables you
> > > > to remove the subvol instead of rm'ing the files. If there are a lot of
> > > > files and stuff, this can be more efficient, since btrfs knows about
> > > > what parts to 'cut away' with a chainsaw when removing, instead of
> > > > telling it to do everything file by file in small steps. It also keeps
> > > > the size of the individual subvolume metadata trees down, reducing
> > > > rumination done by the cow. If you don't have them, your default fs tree
> > > > with all file info is relatively seen as large as the extent tree.
> > > > 
> > > > * Remove stuff in the same order as it got added. Remember, rm * removes
> > > > files in alphabetical order, not in the order in which metadata was
> > > > added to disk (the inode number sequence). It might cause less jumping
> > > > around in metadata. Using subvolumes instead is still better, because in
> > > > that case the whole issue does not exist.
> > > > 
> > > > * (not recommended) If your mount options do not show 'ssd' in them and
> > > > your kernel does not have this patch:
> > > > https://www.spinics.net/lists/linux-btrfs/msg64203.html then you can try
> > > > the mount -o ssd_spread,nossd (read the story in the link above).
> > > > Anyway, you're better off moving to something recent instead of trying
> > > > all these obsolete workarounds...
> > > > 
> > > > If you have a kernel >= 4.14 then you're better off mounting with -o ssd
> > > > for rotational disks because the metadata writes are put more together
> > > > in bigger parts of free space, leading to less rumination. There is
> > > > still a lot of research and improvements to be done in this area (again,
> > > > see long post referenced above).
> > > > 
> > > > If you have kernel < 4.14, then definitely do not mount with -o ssd, but
> > > > even mount ssds with -o nossd explicitly. (yes, that's also a
> > > > long story)...
> > > > 
> > > > Fun! :) /o\
> > > > 
> > > > > Thank you for any time people can spare to help us
> > > > > understand this better!
> > > > 
> > > > Don't hesitate to ask more questions,
> > > > Have fun,
> > > > Hans
> > > > 
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BTRFS File Delete Speed Scales With File Size?
  2020-06-16  3:56         ` Zygo Blaxell
@ 2020-06-16 13:25           ` Ellis H. Wilson III
  2020-06-16 13:55             ` Ellis H. Wilson III
  2020-06-18  3:15             ` Zygo Blaxell
  0 siblings, 2 replies; 13+ messages in thread
From: Ellis H. Wilson III @ 2020-06-16 13:25 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Btrfs BTRFS

First and foremost Zygo, apologies for not having responded to your 
prior two emails.  I was alerted by a colleague that we'd gotten a 
response, and found that very unfortunately our MS filter had marked 
your emails as spam.  I've unmarked them so hopefully this doesn't 
happen in the future.

A few comments on prior emails, and then responses in-line:

40 threads is simply the max due to the nature of our OSD architecture. 
We can make good use of them in the case of heavy PFS metadata updates 
(which do not go to BTRFS), but yes, they can overwhelm spinning rust 
under very heavy deletion workloads.  We're looking into task queuing to 
throttle this.

We are less concerned about the ability to process deletes in O(1) or 
O(log(n)) time, and more concerned about not holding up foreground I/O 
while a number of them are processed (ideally in the background).


On 6/15/20 11:56 PM, Zygo Blaxell wrote:
>> 2. We find that speed to delete and fsync the directory on 4.12 is
>> equivalent to delete and then sync the entire filesystem.
> 
> I'm not quite sure how fsync() is relevant here--fsync() doesn't do
> anything about delayed refs, unless fsync() accidentally triggers a full
> sync (which I wouldn't expect to happen when fsyncing a simple unlink).

fsync is how we are guaranteeing that either a write or create has 
occurred to a new file, or a delete has been acknowledged by the 
underlying FS and will be rolled forward on power-fail.  We recognize 
there may be non-trivial work involved in rolling that forward on 
reboot, especially on spinners.  It was of interest because previously I 
was (wrongly, at least relative to our use of BTRFS in our PFS) timing 
both the remove and the full 'sync' together, rather than just the 
remove and the fsync of the housing directory.  I agree in 4.12 the 
dominant time is the remove, so switching didn't matter, but it does 
matter in 5.7, for reasons you elucidate elsewhere.

> 4.12 throttles unlink() to avoid creating huge backlogs of ref updates
> in kernel memory.  During the throttling, previously delayed refs
> (including ref updates caused by other threads) are processed by the
> kernel thread servicing the unlink() system call.
> 
> 5.7 doesn't throttle unlink(), so the unlink() system call simply queues
> delayed ref updates as quickly as the kernel can allocate memory.
> Once kernel memory runs out, unlink() will start processing delayed
> refs during the unlink() system call, blocking the caller of unlink().
> If memory doesn't run out, then the processing happens during transaction
> commit in some other thread (which may be btrfs-transaction, or some
> other unlucky user thread writing to the filesystem that triggers delayed
> ref processing).

The above is exactly the explanation I've been looking for.  Thank you!

> In both kernels there will be bursts of fast processing as unlink()
> borrows memory, with occasional long delays while unlink() (or some other
> random system call) pays off memory debt.  4.12 limited this borrowing
> to thousands of refs and most of the payment to the unlink() caller;
> in 5.7, there are no limits, and the debt to be paid by a random user
> thread can easily be millions of refs, each of which may require a page
> of IO to complete.

Are there any user-tunable settings for this in 4.12?  We would be 
extremely interested in bumping the outstanding refs in that version if 
doing so was as simple as a sysctl, hidden mount option, or something 
similar.

>> Any comments on when major change(s) came in that would impact these
>> behaviors would be greatly appreciated.
> 
> Between 4.20 and 5.0 the delayed ref throttling went away, which might
> account for a sudden shift of latency to a different part of the kernel.
> 
> The impact is that there is now a larger window for the filesystem to
> roll back to earlier versions of the data after a crash.  Without the
> throttling, unlink can just keep appending more and more ref deletes to
> the current commit faster than the disks can push out updated metadata.
> As a result, there is a bad case where the filesystem spends all of
> its IO time trying to catch up to metadata updates, but never successfully
> completes a commit on disk.
> 
> A crash rolls back to the previous completed commit and replays the
> fsync log.  A crash on a 5.0 to 5.7 kernel potentially erases hours of
> work, or forces the filesystem to replay hours of fsync() log when the
> filesystem is mounted again.  This requires specific conditions, but it
> sounds like your use case might create those conditions.

Yes, we might under the right end-user scenarios and just the right 
timed power-fail, but we have appropriate states for the PFS to cope 
with this that can include a very lengthy bring-up period, of which 
mounting BTRFS is already included.  Erasing acknowledged fsync'd work 
would be a breach of POSIX and concerning, but taking time on mount to 
replay things like this is both expected and something we do at a higher 
level in our PFS so we get it.

> Some details:
> 
> 	https://lore.kernel.org/linux-btrfs/20200209004307.GG13306@hungrycats.org/
> 
> 5.7 has some recent delayed ref processing improvements, but AFAIK
> the real fixes (which include putting the throttling back) are still
> in development.

All extremely helpful info Zygo.  Thank you again, and do let me know if 
there are any tunables we can play with in 4.12 to better mimic the 
behavior we're seeing in 5.7.  We'd be indebted to find out there were.

Best,

ellis

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BTRFS File Delete Speed Scales With File Size?
  2020-06-16 13:25           ` Ellis H. Wilson III
@ 2020-06-16 13:55             ` Ellis H. Wilson III
  2020-06-18  3:15             ` Zygo Blaxell
  1 sibling, 0 replies; 13+ messages in thread
From: Ellis H. Wilson III @ 2020-06-16 13:55 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Btrfs BTRFS

On 6/16/20 9:25 AM, Ellis H. Wilson III wrote:
>> In both kernels there will be bursts of fast processing as unlink()
>> borrows memory, with occasional long delays while unlink() (or some other
>> random system call) pays off memory debt.  4.12 limited this borrowing
>> to thousands of refs and most of the payment to the unlink() caller;
>> in 5.7, there are no limits, and the debt to be paid by a random user
>> thread can easily be millions of refs, each of which may require a page
>> of IO to complete.
> 
> Are there any user-tunable settings for this in 4.12?  We would be 
> extremely interested in bumping the outstanding refs in that version if 
> doing so was as simple as a sysctl, hidden mount option, or something 
> similar.

It appears in btrfs_should_throttle_delayed_refs in 4.12 the limit is 
hard-coded to delayed refs that can be completed in 1 or 2 seconds:

  2868 int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle 
*trans,
  2869                                        struct btrfs_fs_info *fs_info)
  2870 {
  2871         u64 num_entries =
  2872 
atomic_read(&trans->transaction->delayed_refs.num_entries);
  2873         u64 avg_runtime;
  2874         u64 val;
  2875
  2876         smp_mb();
  2877         avg_runtime = fs_info->avg_delayed_ref_runtime;
  2878         val = num_entries * avg_runtime;
  2879         if (val >= NSEC_PER_SEC)
  2880                 return 1;
  2881         if (val >= NSEC_PER_SEC / 2)
  2882                 return 2;

Please let me know if I'm interpreting this wrongly and there is some 
sysctl/mount/fs tunable I can play with.

Thanks,

ellis

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BTRFS File Delete Speed Scales With File Size?
  2020-06-16 13:25           ` Ellis H. Wilson III
  2020-06-16 13:55             ` Ellis H. Wilson III
@ 2020-06-18  3:15             ` Zygo Blaxell
  1 sibling, 0 replies; 13+ messages in thread
From: Zygo Blaxell @ 2020-06-18  3:15 UTC (permalink / raw)
  To: Ellis H. Wilson III; +Cc: Btrfs BTRFS

On Tue, Jun 16, 2020 at 09:25:16AM -0400, Ellis H. Wilson III wrote:
> First and foremost Zygo, apologies for not having responded to your prior
> two emails.  I was alerted by a colleague that we'd gotten a response, and
> found that very unfortunately our MS filter had marked your emails as spam.
> I've unmarked them so hopefully this doesn't happen in the future.
> 
> A few comments on prior emails, and then responses in-line:
> 
> 40 threads is simply the max due to the nature of our OSD architecture. We
> can make good use of them in the case of heavy PFS metadata updates (which
> do not go to BTRFS), but yes, they can overwhelm spinning rust under very
> heavy deletion workloads.  We're looking into task queuing to throttle this.
> 
> We are less concerned about the ability to process deletes in O(1) or
> O(log(n)) time, and more concerned about not holding up foreground I/O while
> a number of them are processed (ideally in the background).

So to clarify:  you want O(1) in the foreground thread (the one that calls
unlink()) and O(log(n)) in the background thread.  Also, you want to be
able to designate some threads as foreground and others as background.



> On 6/15/20 11:56 PM, Zygo Blaxell wrote:
> > > 2. We find that speed to delete and fsync the directory on 4.12 is
> > > equivalent to delete and then sync the entire filesystem.
> > 
> > I'm not quite sure how fsync() is relevant here--fsync() doesn't do
> > anything about delayed refs, unless fsync() accidentally triggers a full
> > sync (which I wouldn't expect to happen when fsyncing a simple unlink).
> 
> fsync is how we are guaranteeing that either a write or create has occurred
> to a new file, or a delete has been acknowledged by the underlying FS and
> will be rolled forward on power-fail.  We recognize there may be non-trivial
> work involved in rolling that forward on reboot, especially on spinners.  It
> was of interest because previously I was (wrongly, at least relative to our
> use of BTRFS in our PFS) timing both the remove and the full 'sync'
> together, rather than just the remove and the fsync of the housing
> directory.  I agree in 4.12 the dominant time is the remove, so switching
> didn't matter, but it does matter in 5.7, for reasons you elucidate
> elsewhere.
> 
> > 4.12 throttles unlink() to avoid creating huge backlogs of ref updates
> > in kernel memory.  During the throttling, previously delayed refs
> > (including ref updates caused by other threads) are processed by the
> > kernel thread servicing the unlink() system call.
> > 
> > 5.7 doesn't throttle unlink(), so the unlink() system call simply queues
> > delayed ref updates as quickly as the kernel can allocate memory.
> > Once kernel memory runs out, unlink() will start processing delayed
> > refs during the unlink() system call, blocking the caller of unlink().
> > If memory doesn't run out, then the processing happens during transaction
> > commit in some other thread (which may be btrfs-transaction, or some
> > other unlucky user thread writing to the filesystem that triggers delayed
> > ref processing).
> 
> The above is exactly the explanation I've been looking for.  Thank you!
> 
> > In both kernels there will be bursts of fast processing as unlink()
> > borrows memory, with occasional long delays while unlink() (or some other
> > random system call) pays off memory debt.  4.12 limited this borrowing
> > to thousands of refs and most of the payment to the unlink() caller;
> > in 5.7, there are no limits, and the debt to be paid by a random user
> > thread can easily be millions of refs, each of which may require a page
> > of IO to complete.
> 
> Are there any user-tunable settings for this in 4.12?  We would be extremely
> interested in bumping the outstanding refs in that version if doing so was
> as simple as a sysctl, hidden mount option, or something similar.

You found the code already.  The specific delayed ref limit is computed
with hardcoded parameters in the kernel, but trivial to change there.

On the other hand, changing it doesn't have very much effect:  you
still don't get to pick which threads are foreground or background.
If the limit is low (as in 4.12) then usually the thread that ends
up stuck with the IO cost is the one that's incurring the IO cost
(i.e. it's the thread doing unlink(), which can be a background worker
thread in your application).  If the limit is high (as in 5.7) then the
thread that gets stuck with the IO cost is much more random (it will be
whichever thread is running when one of several trigger conditions occurs)
and you also get the other bad side-effects of having long update queues.

This is where a lot of write latency in btrfs comes from, and why it's
so hard to avoid priority inversions on btrfs.  Whenever you modify the
filesystem, delayed ref updates go into a single queue where they are
merged with other updates already queued.  There's no priority management
there (AIUI the delayed ref update IO bypasses normal ionice and blkio
cgroup priority settings and doesn't implement any priority algorithm
of its own), so no notion of a "foreground" or "background" thread is
supported--all threads are equal.  What happens instead is that any of
the threads which are trying to perform a write operation on btrfs may
bear the full cost of completing delayed ref updates previously queued
by all threads.

If you have a very predictable bursty workload (e.g. you can quantify
exactly how many extent references will be changed per minute across the
filesystem) then you can increase the hardcoded throttle parameters so
that delayed refs are not throttled below that number, and keep enough
free RAM in the system so the kernel does not run out.  As long as the
limit is not exceeded, delayed ref flushing won't occur until scheduled
transaction commits, and the updates will truly occur in the background
(i.e. only in the btrfs-transaction thread).  As soon as the limit is
exceeded, some thread (you don't get to choose which one) will block,
and block longer if the limit is higher.

If you don't have a predictable workload, then it's better to do the
unlinks in a low-priority (possibly even self-throttling) thread with
4.12-style throttling in the unlink() syscall.  For things like build
servers that have to wipe out large working trees, I move the root of
the tree aside to a deletion queue directory, and a background worker
thread does 'rm -fr' on whatever appears in that directory--slowly.
Other threads still get blocked at random, but only a few seconds at a
time.

If we are considering btrfs modifications, an on-disk delayed ref log
might help, at least for the more expensive ref deletes (the ones that
must fetch and modify pages in extent and csum trees).  This would
be processed in the background the same way snapshot deletes are.
Putting the queued ref deletes on disk removes their CoW updates from the
commit critical path, so that any thread that gets stuck with delayed
ref flushing only has to do the cheaper add-ref operations, instead
of the full delayed ref processing.  This idea might not be sane, I'm
thinking aloud here.  Also requires an on-disk format change, so it's
minimum a year away from becoming reality just for that.

A simpler modification is to make unlink() convert the inode into an
orphan inode, then the unlink() syscall returns, and btrfs-cleaner removes
the extents and orphan inode later (with throttling).  That might annoy
users with small filesystems who do big 'rm -rf' followed by big tarball
extracts--a sync would be needed in between in order to avoid running out
of disk space.  That used to be a chronic problem in older versions of
btrfs, and AIUI throttling unlink() calls at the source was the solution.

It might also be useful to implement some kind of priority queue in the
delayed refs code; however, we want transactions to commit at some point,
and the current design requires all the delayed refs to be done before
the commit can happen.  There would need to be some kind of transaction
pipelining to allow high-priority threads to get ahead of low-priority
ones, and even then, all threads have equal priority when there is no
more memory available to defer low-priority work.

> > > Any comments on when major change(s) came in that would impact these
> > > behaviors would be greatly appreciated.
> > 
> > Between 4.20 and 5.0 the delayed ref throttling went away, which might
> > account for a sudden shift of latency to a different part of the kernel.
> > 
> > The impact is that there is now a larger window for the filesystem to
> > roll back to earlier versions of the data after a crash.  Without the
> > throttling, unlink can just keep appending more and more ref deletes to
> > the current commit faster than the disks can push out updated metadata.
> > As a result, there is a bad case where the filesystem spends all of
> > its IO time trying to catch up to metadata updates, but never successfully
> > completes a commit on disk.
> > 
> > A crash rolls back to the previous completed commit and replays the
> > fsync log.  A crash on a 5.0 to 5.7 kernel potentially erases hours of
> > work, or forces the filesystem to replay hours of fsync() log when the
> > filesystem is mounted again.  This requires specific conditions, but it
> > sounds like your use case might create those conditions.
> 
> Yes, we might under the right end-user scenarios and just the right timed
> power-fail, but we have appropriate states for the PFS to cope with this
> that can include a very lengthy bring-up period, of which mounting BTRFS is
> already included.  Erasing acknowledged fsync'd work would be a breach of
> POSIX and concerning, but taking time on mount to replay things like this is
> both expected and something we do at a higher level in our PFS so we get it.

Power-fail timing isn't particularly hard.  If data is coming
in continuously faster than the disks can write it, and commits are
taking an average of 3 hours each, you have only 8 commits per day,
so a randomly timed power failure wipes out an average of 90 minutes of
non-fsynced filesystem changes.

The behavior is an extreme interpretation of POSIX (and Linux conventions
around fsync):  everything that is flushed with fsync() stays flushed,
everything that is not flushed with fsync() reverts to its state at the
previous commit.  It's even useful to force this behavior when you want
to test programs to verify they're calling fsync() everywhere they should.

> > Some details:
> > 
> > 	https://lore.kernel.org/linux-btrfs/20200209004307.GG13306@hungrycats.org/
> > 
> > 5.7 has some recent delayed ref processing improvements, but AFAIK
> > the real fixes (which include putting the throttling back) are still
> > in development.
> 
> All extremely helpful info Zygo.  Thank you again, and do let me know if
> there are any tunables we can play with in 4.12 to better mimic the behavior
> we're seeing in 5.7.  We'd be indebted to find out there were.
> 
> Best,
> 
> ellis

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-06-18  3:15 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-09 15:31 BTRFS File Delete Speed Scales With File Size? Ellis H. Wilson III
2020-06-09 17:53 ` Adam Borowski
2020-06-09 19:23   ` Martin Raiber
2020-06-09 23:09 ` Hans van Kranenburg
2020-06-11  1:56   ` Ellis H. Wilson III
2020-06-11  3:14     ` Zygo Blaxell
     [not found]   ` <d5379505-7dd1-d5bc-59e7-207aaa82acf6@panasas.com>
2020-06-11 13:54     ` Ellis H. Wilson III
2020-06-15 22:00       ` Ellis H. Wilson III
2020-06-16  3:56         ` Zygo Blaxell
2020-06-16 13:25           ` Ellis H. Wilson III
2020-06-16 13:55             ` Ellis H. Wilson III
2020-06-18  3:15             ` Zygo Blaxell
2020-06-10  2:45 ` Zygo Blaxell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.