linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* O(n^2) deletion performance
@ 2018-01-02  0:21 Niklas Hambüchen
  2018-01-02  1:20 ` Niklas Hambüchen
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Niklas Hambüchen @ 2018-01-02  0:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Paul Eggert, Jim Meyering, Pádraig Brady

Hello filesystem hackers,

Over in coreutils issue https://bugs.gnu.org/29921 we have found that
unlink()ing `n` many files seems to take in general O(n²) wall time.

So far me and others (in CC) have benchmarked `rm -r` and other methods
of unlink()ing multiple files and we've found approximately quadratic
performance across:

* ext4, XFS and zfs-on-linux on spinning disks
* ext4 on SSDs
* the other combinations were not tested yet

Please find the exact numbers https://bugs.gnu.org/29921.

So far we've observed apparently linear deletion time on:

* XFS on NVMe

It would be great to hear any kind of insights you may have regarding
this issue, implementation details that might be relevant here, or even
some quick benchmarks on your systems to confirm or refute our findings.

Unless instructed otherwise, I would also go ahead and file bugs for the
respective filesystem types, as quadratic deletion time makes for a bad
user experience and lots of sysadmin tears.

Happy new year,
Niklas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: O(n^2) deletion performance
  2018-01-02  0:21 O(n^2) deletion performance Niklas Hambüchen
@ 2018-01-02  1:20 ` Niklas Hambüchen
  2018-01-02  1:59 ` Theodore Ts'o
  2018-01-02  4:54 ` Dave Chinner
  2 siblings, 0 replies; 11+ messages in thread
From: Niklas Hambüchen @ 2018-01-02  1:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Paul Eggert, Jim Meyering, Pádraig Brady

Fixing an omission:

I meant "unlink()ing `n` many files [in the same directory]".

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: O(n^2) deletion performance
  2018-01-02  0:21 O(n^2) deletion performance Niklas Hambüchen
  2018-01-02  1:20 ` Niklas Hambüchen
@ 2018-01-02  1:59 ` Theodore Ts'o
  2018-01-02  2:49   ` Andreas Dilger
  2018-01-02  4:54 ` Dave Chinner
  2 siblings, 1 reply; 11+ messages in thread
From: Theodore Ts'o @ 2018-01-02  1:59 UTC (permalink / raw)
  To: Niklas Hambüchen
  Cc: linux-fsdevel, Paul Eggert, Jim Meyering, Pádraig Brady

On Tue, Jan 02, 2018 at 01:21:32AM +0100, Niklas Hamb�chen wrote:
> Hello filesystem hackers,
> 
> Over in coreutils issue https://bugs.gnu.org/29921 we have found that
> unlink()ing `n` many files seems to take in general O(n�) wall time.
> 
> So far me and others (in CC) have benchmarked `rm -r` and other methods
> of unlink()ing multiple files and we've found approximately quadratic
> performance across...

So a couple of observations.  You're benchmarking the creation and
deletion of files that don't have any blocks associated with them, so
you are measuring the directory and inode deallocation operations.
There are no block (de-)allocation operations going on.

Secondly, you aren't dropping the cache between between the creation
phase and the timed deletion phase of the test.  This means that the
results are heavily going to be impacted by the amount of memory
available on the system.  This doesn't matter all that much since you
aren't actually writing any data blocks, and so you're only measuring
metadata operations which are going to be journalled.

Sorting by inode number helps in the cold cache case, especially if
you have data blocks that you need to delete.  I also suspect that the
original tests of the sort-the-inodes commit were with substantially
smaller numbers of files, especially these were non-zero-sized files
being deleted.

Sorting by inode number helps optimze deleting files in creation order
for ext3/4 and that helps optimize updates to the inode table and to
the allocation bitmaps.  However, the number I/O's and hence the
elapsed time is also going to be a function of the directory block
updates.

In general you're not going to be able delete files in optimal
directory tree order *and* simulaneously in optimal inode table order.
So fundamentally there's not a whole lot that can be done here as
directories get super-sized.

For ext4, you can improve things somewhat by using a larger journal
size, if the workload is extremely metadata-heavy (which it is in this
very artificial case of deleting a directory hierarchy with 16 milion+
zero-length inodes).  We now use a large default journal size with
newer versions of e2fsprogs, but if your file system was created a
while ago, it probably is maxed out with a 128 meg journal.  Using a
one or two gig journal (e.g., "mke2fs -t ext4 -J size=2048 /dev/sdXX)
will help in this case.  It's unlikely to help that much for more
realistic workloads, though.

> Unless instructed otherwise, I would also go ahead and file bugs for the
> respective filesystem types, as quadratic deletion time makes for a bad
> user experience and lots of sysadmin tears.

Unfortunately there's not a whole lot I can do from the ext4 end.  And
it's not clear to me that optimizing for this extremely artificial
workload makes a lot of sense even if we could do something.  (I
support we could create an rm -rf system call which would work so long
as you had sufficient amounts of non-swappable kernel memory available
--- but in the end it's going to be impossible to make something which
is purely linear for arbitrarily large directories.)

							- Ted

P.S.  I suspect that if you had sufficiently large amounts of
super-expensive NVMe storage, you'd see super-linear times even for
XFS on NVMe....  The fast storage simply puts off the inevitable.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: O(n^2) deletion performance
  2018-01-02  1:59 ` Theodore Ts'o
@ 2018-01-02  2:49   ` Andreas Dilger
  2018-01-02  4:27     ` Jim Meyering
  2018-01-02  4:33     ` Theodore Ts'o
  0 siblings, 2 replies; 11+ messages in thread
From: Andreas Dilger @ 2018-01-02  2:49 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Niklas Hambüchen, Linux FS-devel Mailing List, Paul Eggert,
	Jim Meyering, Pádraig Brady

On Jan 1, 2018, at 6:59 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> 
> On Tue, Jan 02, 2018 at 01:21:32AM +0100, Niklas Hambüchen wrote:
>> Hello filesystem hackers,
>> 
>> Over in coreutils issue https://bugs.gnu.org/29921 we have found that
>> unlink()ing `n` many files seems to take in general O(n²) wall time.

Looking at the performance results in these tests (in particular the
first results that showed system time separate from user time), the
system time appears to be roughly linear with the number of files that
are deleted.  It is the user time that is increasing quadratically, so
that isn't the fault of the filesystem/kernel but rather "rm" itself
as far as I can see.

>> So far me and others (in CC) have benchmarked `rm -r` and other methods
>> of unlink()ing multiple files and we've found approximately quadratic
>> performance across...
> 
> So a couple of observations.  You're benchmarking the creation and
> deletion of files that don't have any blocks associated with them, so
> you are measuring the directory and inode deallocation operations.
> There are no block (de-)allocation operations going on.
> 
> Secondly, you aren't dropping the cache between between the creation
> phase and the timed deletion phase of the test.  This means that the
> results are heavily going to be impacted by the amount of memory
> available on the system.  This doesn't matter all that much since you
> aren't actually writing any data blocks, and so you're only measuring
> metadata operations which are going to be journalled.
> 
> Sorting by inode number helps in the cold cache case, especially if
> you have data blocks that you need to delete.  I also suspect that the
> original tests of the sort-the-inodes commit were with substantially
> smaller numbers of files, especially these were non-zero-sized files
> being deleted.
> 
> Sorting by inode number helps optimze deleting files in creation order
> for ext3/4 and that helps optimize updates to the inode table and to
> the allocation bitmaps.  However, the number I/O's and hence the
> elapsed time is also going to be a function of the directory block
> updates.
> 

> In general you're not going to be able delete files in optimal
> directory tree order *and* simulaneously in optimal inode table order.
> So fundamentally there's not a whole lot that can be done here as
> directories get super-sized.

At one time we discussed to change inode number allocation to be
piecewise linear for inodes within the same directory.  As a directory
grows larger, it could grab a chunk of free inodes in the inode table,
then map the new filename hash to the range of free inodes, and use
that when selecting the new inode number.  As that chunk is filled up,
a new, larger chunk of free inodes would be selected and new filenames
would be mapped into the new chunk.

If this was done, then the hash-order directory traversal for deletion
or stat would have approximately O(log(n)) inode table blocks to load
for a given hash range rather than having to get each inode from a
random inode table block.

> For ext4, you can improve things somewhat by using a larger journal
> size, if the workload is extremely metadata-heavy (which it is in this
> very artificial case of deleting a directory hierarchy with 16 milion+
> zero-length inodes).  We now use a large default journal size with
> newer versions of e2fsprogs, but if your file system was created a
> while ago, it probably is maxed out with a 128 meg journal.  Using a
> one or two gig journal (e.g., "mke2fs -t ext4 -J size=2048 /dev/sdXX)
> will help in this case.  It's unlikely to help that much for more
> realistic workloads, though.
> 
>> Unless instructed otherwise, I would also go ahead and file bugs for the
>> respective filesystem types, as quadratic deletion time makes for a bad
>> user experience and lots of sysadmin tears.
> 
> Unfortunately there's not a whole lot I can do from the ext4 end.  And
> it's not clear to me that optimizing for this extremely artificial
> workload makes a lot of sense even if we could do something.  (I
> support we could create an rm -rf system call which would work so long
> as you had sufficient amounts of non-swappable kernel memory available
> --- but in the end it's going to be impossible to make something which
> is purely linear for arbitrarily large directories.)

Cheers, Andreas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: O(n^2) deletion performance
  2018-01-02  2:49   ` Andreas Dilger
@ 2018-01-02  4:27     ` Jim Meyering
  2018-01-02  6:22       ` Theodore Ts'o
  2018-01-02  4:33     ` Theodore Ts'o
  1 sibling, 1 reply; 11+ messages in thread
From: Jim Meyering @ 2018-01-02  4:27 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Theodore Ts'o, Niklas Hambüchen,
	Linux FS-devel Mailing List, Paul Eggert, Pádraig Brady

On Mon, Jan 1, 2018 at 6:49 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On Jan 1, 2018, at 6:59 PM, Theodore Ts'o <tytso@mit.edu> wrote:
>>
>> On Tue, Jan 02, 2018 at 01:21:32AM +0100, Niklas Hambüchen wrote:
>>> Hello filesystem hackers,
>>>
>>> Over in coreutils issue https://bugs.gnu.org/29921 we have found that
>>> unlink()ing `n` many files seems to take in general O(n²) wall time.
>
> Looking at the performance results in these tests (in particular the
> first results that showed system time separate from user time), the
> system time appears to be roughly linear with the number of files that
> are deleted.  It is the user time that is increasing quadratically, so
> that isn't the fault of the filesystem/kernel but rather "rm" itself
> as far as I can see.

Thank you both for the quick replies.

Here are those numbers again:

 nfiles     real   user     sys

 100000    0.51s  0.07s   0.43s
 200000    2.46s  0.15s   0.89s
 400000   10.78s  0.26s   2.21s
 800000   44.72s  0.58s   6.03s
1600000  180.37s  1.06s  10.70s

Since the user and sys durations for the 1.6M case are so low (nowhere
near 2-3x those of the 800K case), I wonder if that test hit the inode
limit for that file system. Niklas, what does df -i report for the
file system on which you ran that test?

Here are some numbers from a system with 32BG of RAM, an ext4 SSD and
starting with 8.8M free inodes. They show user and system time very
close to linear, with a little uptick on the last iteration. "real"
seems to be growing too fast, even here.

$ i=100000; while :; do mkdir x; (cd x && seq $i|xargs touch); env
time -f "%e %U %S $i" env rm -rf x; case $i in 32*) break;; esac;
i=$[i*2]; done
real  user sys   N
0.48 0.05 0.41 100000
1.28 0.10 0.86 200000
3.79 0.21 1.83 400000
10.51 0.43 4.04 800000
28.00 0.88 8.97 1600000
62.54 1.86 19.30 3200000

...
> At one time we discussed to change inode number allocation to be
> piecewise linear for inodes within the same directory.  As a directory
> grows larger, it could grab a chunk of free inodes in the inode table,
> then map the new filename hash to the range of free inodes, and use
> that when selecting the new inode number.  As that chunk is filled up,
> a new, larger chunk of free inodes would be selected and new filenames
> would be mapped into the new chunk.
>
> If this was done, then the hash-order directory traversal for deletion
> or stat would have approximately O(log(n)) inode table blocks to load
> for a given hash range rather than having to get each inode from a
> random inode table block.

Ouch. It sounds like this is a known-O(n) process and the above would
have reduced it to O(log(n)). Can you provide a reference? I am
curious to know how it played out.

Our goal (with fts and coreutils) has been to make it harder for an
accident or maliciousness (with a few million entries in a directory)
to hinder file system traversals. Of course, it's not just rm: any
FS-traversal tool is affected: cp, chmod, chgrp, du, find, tar, etc.
Sure, quotas can help, but even self-inflicted accidents happen on
single-user systems with no quotas.

Idly wondered if the default inode limits could save ext4 users? Perhaps not.
In this 850GB file system, I see it has 48M inodes (caveat, I may have
changed the default when I created it -- don't recall):

  Filesystem     Type Inodes IUsed IFree IUse% Mounted on
  /dev/sda3      ext4    54M  6.0M   48M   12% /x

Putting even half of those all in one directory is quick and easy, yet
would would cause disproportionate pain. And that was a small
partition.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: O(n^2) deletion performance
  2018-01-02  2:49   ` Andreas Dilger
  2018-01-02  4:27     ` Jim Meyering
@ 2018-01-02  4:33     ` Theodore Ts'o
  1 sibling, 0 replies; 11+ messages in thread
From: Theodore Ts'o @ 2018-01-02  4:33 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Niklas Hambüchen, Linux FS-devel Mailing List, Paul Eggert,
	Jim Meyering, Pádraig Brady

On Mon, Jan 01, 2018 at 07:49:55PM -0700, Andreas Dilger wrote:
> 
> At one time we discussed to change inode number allocation to be
> piecewise linear for inodes within the same directory.  As a directory
> grows larger, it could grab a chunk of free inodes in the inode table,
> then map the new filename hash to the range of free inodes, and use
> that when selecting the new inode number.  As that chunk is filled up,
> a new, larger chunk of free inodes would be selected and new filenames
> would be mapped into the new chunk.

Well, it's not so simple.  Remember that there are only 16 inodes per
4k inode table block.  And only 32,768 inodes per block group.  In the
workloads discussed in the coreutils bug, there are 1 million to 32
million files being created in a single directory.  At that point,
unless we start doing truly ridiculous readaheads in the inode table,
the disk is going to be randomly seeking no matter what you do.  Also,
if we try to stay strictly within the inodes in the block group,
assuming that the average file size is larger than 4k --- generally a
safe bet --- this will tend to separate the data blocks from the
inodes, which will increase latency when reading files.

And most of the time, optimizing for reading files makes sense (e.g.,
think /usr/include), because that tends to happen more often than rm
-rf workloads.

The bottom line is this gets tricky, especially if the directory is
dynamically growing and shrinking.  Now you might start deleting from
the first chunk, and start allocating from the second chunk, or maybe
the third chunk.  The bottom line is that it's relatively easy to
optimize for specific workloads --- but even if this is a good idea
--- "rm -rf" of zero-length files is not the first workload I would be
hyper-optimizing for.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: O(n^2) deletion performance
  2018-01-02  0:21 O(n^2) deletion performance Niklas Hambüchen
  2018-01-02  1:20 ` Niklas Hambüchen
  2018-01-02  1:59 ` Theodore Ts'o
@ 2018-01-02  4:54 ` Dave Chinner
  2 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2018-01-02  4:54 UTC (permalink / raw)
  To: Niklas Hambüchen
  Cc: linux-fsdevel, Paul Eggert, Jim Meyering, Pádraig Brady

On Tue, Jan 02, 2018 at 01:21:32AM +0100, Niklas Hamb�chen wrote:
> Hello filesystem hackers,
> 
> Over in coreutils issue https://bugs.gnu.org/29921 we have found that
> unlink()ing `n` many files seems to take in general O(n�) wall time.
> 
> So far me and others (in CC) have benchmarked `rm -r` and other methods
> of unlink()ing multiple files and we've found approximately quadratic
> performance across:
> 
> * ext4, XFS and zfs-on-linux on spinning disks
> * ext4 on SSDs
> * the other combinations were not tested yet
> 
> Please find the exact numbers https://bugs.gnu.org/29921.

Run iowatcher on the tests and have a look at the IO patterns before
and after the modifications to rm. Work out where the seek delays
are actually common from and find the common characteristics that
all the reports have.

> It would be great to hear any kind of insights you may have regarding
> this issue, implementation details that might be relevant here, or even
> some quick benchmarks on your systems to confirm or refute our findings.

So I just ran a quick test here on XFS. Creat files with fsmark,
unmount, mount, remove them. compare wall times.

Nfiles		create time		remove time
  10k		0m0.518s		0m0.658s
  20k		0m1.039s		0m1.342s
 100k		0m4.540s		0m4.646s
 200k		0m8.567s		0m10.180s
   1M		1m23.059s		0m51.582s

Looks pretty linear in terms of rm time. This is with rm v8.26,
and the rm command is "rm -rf /mnt/scratch/0" which contains a
single directory with all the files in it.

The typical IO pattern during rm is getdents @ 40calls/s and 20,000
inode lookup calls/s. i.e. inode lookups outnumber directory reads
by 500:1. XFS does inode reads in clusters, so that 20,000 inodes/s
after a sequential create is being serviced by about 650 read IO/s.
not very much, really.

This is run on XFS on raid0 of 2 NVMe drives, sparse 500TB file
passed to a VM as virtio,cache=none,aio=native drive formatted with
XFS. (i.e. XFS on XFS).

So, run it on spinning drives. Same VM, just a SATA XFS image file
on a sata drive on the host used instead:

Nfiles		create time		remove time
  10k		0m0.520s		0m0.643s
  20k		0m1.198s		0m1.351s
 100k		0m5.306s		0m6.768s
 200k		0m11.083s		0m13.763s
   1M		1m39.876s		1m14.765s

So, pretty much identical, still linear. The difference in
perofrmance between the two was the journal size - the smaller
filesystem wrote at 50MB/s continually to the journal, the large
NVMe filesystem hardly touched it. Remade small filesystem with
larger log, IO patterns and performance numbers are identical to the
large filesystem.

So I locked out 15.5GB of ram, so the test runs short of memory and
causes writeback to occur during creations.  Barely changed the
performance on either spinning or NVMe drives.

Just upgraded to coreutils 8.28. performance is basically identical
to 8.26.

Oh, I just remembered that I've also got some old samsung 840 EVO
SSDs attached to that VM! I ran the test again, got the same
results. And, while I'm at it, I ran it on the emulated pmem I have
configured, and it got the same results.

IOWs, I've got XFS on 4.14 producing near identical results for
spinning disks, old SATA SSDs, brand new NVMe SSDs and pmem.  There
simple isn't an O(n^2) problem in the userspace or XFS code before
or after the changes pointed out the bug linked above.

Keep in mind that running a random script like this on a random
filesystem in a random initial state is about the most-sub-optimal
environment you can make for a regression test.  It won't give
reliable results and you can't even compare results from the same
machine and filesystem because initial conditions are always
different from run to run.

> Unless instructed otherwise, I would also go ahead and file bugs for the
> respective filesystem types, as quadratic deletion time makes for a bad
> user experience and lots of sysadmin tears.

I think you're being overly melodramatic. There isn't an obvious
problem, just a lot of handwaving. Do some IO analysis that show
where the delays are coming from, then we'll have something to work
from....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: O(n^2) deletion performance
  2018-01-02  4:27     ` Jim Meyering
@ 2018-01-02  6:22       ` Theodore Ts'o
  2018-01-04  4:16         ` Jim Meyering
  0 siblings, 1 reply; 11+ messages in thread
From: Theodore Ts'o @ 2018-01-02  6:22 UTC (permalink / raw)
  To: Jim Meyering
  Cc: Andreas Dilger, Niklas Hambüchen,
	Linux FS-devel Mailing List, Paul Eggert, Pádraig Brady

On Mon, Jan 01, 2018 at 08:27:48PM -0800, Jim Meyering wrote:
> Our goal (with fts and coreutils) has been to make it harder for an
> accident or maliciousness (with a few million entries in a directory)
> to hinder file system traversals. Of course, it's not just rm: any
> FS-traversal tool is affected: cp, chmod, chgrp, du, find, tar, etc.
> Sure, quotas can help, but even self-inflicted accidents happen on
> single-user systems with no quotas.
> 
> Idly wondered if the default inode limits could save ext4 users? Perhaps not.
> In this 850GB file system, I see it has 48M inodes (caveat, I may have
> changed the default when I created it -- don't recall):

Well, it's a bit of a blunt hammer, but you *can* set a mount option
"mount -t ext4 -o max_dir_size_kb=512" which will not allow the
directory to grow larger than 512k (or pick your favorite limit).

To me that's like putting a speed limit governer on a car because you
don't trust the driver not to exceed some arbitrary (sometimes opposed
by politicians who think they know better) limit.  Personally, I would
be super annoyed if my car was outfitted with something which didn't
let me go faster than 90 km/h or 56 mph.  But, apparently many
commercial vehciles, particular in the UK, do have such things.

But if you don't trust your users, the feature is there for a
sysadmin/BOFH to impose on said users; but personally, it's relatively
rare thing, and when trading off between the pain caused by hitting an
artificially imposed limit, versus the patience needed to recover from
accidentally dumping 16 million files into a directory --- I prefer
the latter.  I can wait a few minutes....

						- Ted

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: O(n^2) deletion performance
  2018-01-02  6:22       ` Theodore Ts'o
@ 2018-01-04  4:16         ` Jim Meyering
  2018-01-04  7:16           ` Theodore Ts'o
  2018-01-04 11:42           ` Dave Chinner
  0 siblings, 2 replies; 11+ messages in thread
From: Jim Meyering @ 2018-01-04  4:16 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Andreas Dilger, Niklas Hambüchen,
	Linux FS-devel Mailing List, Paul Eggert, Pádraig Brady

On Mon, Jan 1, 2018 at 10:22 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Mon, Jan 01, 2018 at 08:27:48PM -0800, Jim Meyering wrote:
>> Our goal (with fts and coreutils) has been to make it harder for an
>> accident or maliciousness (with a few million entries in a directory)
>> to hinder file system traversals. Of course, it's not just rm: any
>> FS-traversal tool is affected: cp, chmod, chgrp, du, find, tar, etc.
>> Sure, quotas can help, but even self-inflicted accidents happen on
>> single-user systems with no quotas.
>>
>> Idly wondered if the default inode limits could save ext4 users? Perhaps not.
>> In this 850GB file system, I see it has 48M inodes (caveat, I may have
>> changed the default when I created it -- don't recall):
>
> Well, it's a bit of a blunt hammer, but you *can* set a mount option
> "mount -t ext4 -o max_dir_size_kb=512" which will not allow the
> directory to grow larger than 512k (or pick your favorite limit).

Thanks, but no thanks :-)

Still wondering how this happened... deliberate optimization for
something else, probably.
And wishing I'd written a relative (not absolute) test for it in 2008,
so I would have noticed sooner.
In 2008 when I wrote this coreutils extN performance test:

  https://git.savannah.gnu.org/cgit/coreutils.git/tree/tests/rm/ext3-perf.sh

there was no O(N^2) or even "just" O(N^1.5) component when using the
then-just-improved rm. Many of us plotted the curves.

Any idea when ext4's unlink became more expensive?

> versus the patience needed to recover from
> accidentally dumping 16 million files into a directory --- I prefer
> the latter.  I can wait a few minutes....

I've just run a test on the spinning-disk file system mentioned above,
and it took 75 minutes to delete 12.8M entries. That's rather nasty.

On the bright side, Kevin Vigor was kind enough to run tests showing
that on some large, fast NVMe devices, everything looks linear:
https://docs.google.com/spreadsheets/d/1bPi8MTvSP4xzzuARPOd5fxFujhBoU2Dxandr-Vh1T9c/edit#gid=0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: O(n^2) deletion performance
  2018-01-04  4:16         ` Jim Meyering
@ 2018-01-04  7:16           ` Theodore Ts'o
  2018-01-04 11:42           ` Dave Chinner
  1 sibling, 0 replies; 11+ messages in thread
From: Theodore Ts'o @ 2018-01-04  7:16 UTC (permalink / raw)
  To: Jim Meyering
  Cc: Andreas Dilger, Niklas Hambüchen,
	Linux FS-devel Mailing List, Paul Eggert, Pádraig Brady

On Wed, Jan 03, 2018 at 08:16:58PM -0800, Jim Meyering wrote:
> 
> Still wondering how this happened... deliberate optimization for
> something else, probably.
> And wishing I'd written a relative (not absolute) test for it in 2008,
> so I would have noticed sooner.
> In 2008 when I wrote this coreutils extN performance test:
> 
>   https://git.savannah.gnu.org/cgit/coreutils.git/tree/tests/rm/ext3-perf.sh
> 
> there was no O(N^2) or even "just" O(N^1.5) component when using the
> then-just-improved rm. Many of us plotted the curves.

How high (how many files) did you plot the curves back in 2008?  Do
you remember?

> Any idea when ext4's unlink became more expensive?

Ext4's unlink hasn't really changed.  What *has* changed over time is
the block and inode allocation algorithms.  2008 is before we
optimized allocation algorithms to read/write operations to large
files more efficient, and to try to avoid free space fragmention as
the file system aged.

Given that on large / fast NVMe device everything looks linear, it's
pretty clear that what you're seeing is essentially seek time on
spinning rust, and probably SSD GC overhead on flash devices.

> I've just run a test on the spinning-disk file system mentioned above,
> and it took 75 minutes to delete 12.8M entries. That's rather nasty.

Yes, but how long did it take to *create* the 12.8M entries?  My guess
is that it was roughly comparable?

							- Ted

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: O(n^2) deletion performance
  2018-01-04  4:16         ` Jim Meyering
  2018-01-04  7:16           ` Theodore Ts'o
@ 2018-01-04 11:42           ` Dave Chinner
  1 sibling, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2018-01-04 11:42 UTC (permalink / raw)
  To: Jim Meyering
  Cc: Theodore Ts'o, Andreas Dilger, Niklas Hambüchen,
	Linux FS-devel Mailing List, Paul Eggert, Pádraig Brady

On Wed, Jan 03, 2018 at 08:16:58PM -0800, Jim Meyering wrote:
> On Mon, Jan 1, 2018 at 10:22 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> > versus the patience needed to recover from
> > accidentally dumping 16 million files into a directory --- I prefer
> > the latter.  I can wait a few minutes....
> 
> I've just run a test on the spinning-disk file system mentioned above,
> and it took 75 minutes to delete 12.8M entries. That's rather nasty.

An unrealistic performance expectation is not a code bug. It's a
knowledge problem and a learning opportunity. :)

So let's put some numbers to this operation. If this is a cache-cold rm, then the
number of read IOs required on a perfectly laid out /XFS/ filesystem
would be roughly;
	read IOs = num getdents IO + num inode lookup IO
		 = 12.8M / 32 + 12.8M / 500
		 = 400,000 + 25,600
		 = 425,600.

Now, lets assume an low seek time for each IO on a sata drive - about
3ms per IO. That means the read IO delay is:

	read IO delay = 425,600 * 0.003s
		      = 1277s
		      = 21m17s

Hence with a perfect layout we've got at least 21m of blocking IO
time alone in this workload.  If the directory and inodes are
scattered, and we get an average seek time (7ms), then we're looking
at 50 minutes of blocking read IO time in the workload.

Then we've got to add the time for each operation to be executed on
the CPU. If I take the number from a 10M file unlink I did yesterday
in a VM with device image files hosted on XFS on NVMe drives:

File count: 10000000
Create: 918.46 47.14 866.24
Unlink: 1545.99 12.14 1410.97

There's 1410s of system CPU time to unlink 10 million inodes. Add another
25% for the larger inode count in your test, and we've got almost 30
minutes of CPU time in this workload. If we add that to our 50
minutes of IO blocking time, we've got ~80 minutes total runtime.

IOWs, 75 minutes on a spinning sata drive to delete 12.8M directory
entries is entirely reasonable for any filesystem because the
workload is single threaded, blocks on each read IO, requires
hundreds of thousands of read IOs to complete and the IO time is
largely dependent on the physical layout of the directory and inode
structure.

That's the reason I asked for iowatcher to be run to analysis these
bug reports - it will tell us /exactly/ how much time is being spent
waiting for IO, what the seek patterns are, and that will tell us
how badly we all suck at laying out large directories.

> On the bright side, Kevin Vigor was kind enough to run tests showing
> that on some large, fast NVMe devices, everything looks linear:
> https://docs.google.com/spreadsheets/d/1bPi8MTvSP4xzzuARPOd5fxFujhBoU2Dxandr-Vh1T9c/edit#gid=0

Of course. That's because the read Io latency time is deterministic,
not dependent on the physical layout (i.e. constant "seek" time) and
it's much, much faster than a spinning disk - probably around 100
microseconds. In that case, the IO wait time drops to:

	read IO delay = 425,600 * 0.0001s
		      = 42.6s

That's not unreasonable given the above numbers from my test VM
showing roughly 2 minutes of wait time in 25 minutes for 10million
inodes being removed.

IOWs, the performance on fast SSDs and NVMe drives is determined
almost entirely by software algorithms and the per-operation CPU
cost rather than the physical layout of the structure on disk and
the time needed to retreive it into memory.  That's why you see
variable and slow performance on slow spinning drives, but it gets
more and more linear as the devices get faster and less sensitive to
physical layout of the filesystem metadata....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-01-04 11:42 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-02  0:21 O(n^2) deletion performance Niklas Hambüchen
2018-01-02  1:20 ` Niklas Hambüchen
2018-01-02  1:59 ` Theodore Ts'o
2018-01-02  2:49   ` Andreas Dilger
2018-01-02  4:27     ` Jim Meyering
2018-01-02  6:22       ` Theodore Ts'o
2018-01-04  4:16         ` Jim Meyering
2018-01-04  7:16           ` Theodore Ts'o
2018-01-04 11:42           ` Dave Chinner
2018-01-02  4:33     ` Theodore Ts'o
2018-01-02  4:54 ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).