From: Dave Chinner <david@fromorbit.com>
To: "Niklas Hambüchen" <niklas@nh2.me>
Cc: linux-fsdevel@vger.kernel.org, "Paul Eggert" <eggert@cs.ucla.edu>,
"Jim Meyering" <jim@meyering.net>,
"Pádraig Brady" <P@draigbrady.com>
Subject: Re: O(n^2) deletion performance
Date: Tue, 2 Jan 2018 15:54:33 +1100 [thread overview]
Message-ID: <20180102045433.GA30682@dastard> (raw)
In-Reply-To: <5ca3808d-4eea-afec-75a6-2cc41f44b868@nh2.me>
On Tue, Jan 02, 2018 at 01:21:32AM +0100, Niklas Hamb�chen wrote:
> Hello filesystem hackers,
>
> Over in coreutils issue https://bugs.gnu.org/29921 we have found that
> unlink()ing `n` many files seems to take in general O(n�) wall time.
>
> So far me and others (in CC) have benchmarked `rm -r` and other methods
> of unlink()ing multiple files and we've found approximately quadratic
> performance across:
>
> * ext4, XFS and zfs-on-linux on spinning disks
> * ext4 on SSDs
> * the other combinations were not tested yet
>
> Please find the exact numbers https://bugs.gnu.org/29921.
Run iowatcher on the tests and have a look at the IO patterns before
and after the modifications to rm. Work out where the seek delays
are actually common from and find the common characteristics that
all the reports have.
> It would be great to hear any kind of insights you may have regarding
> this issue, implementation details that might be relevant here, or even
> some quick benchmarks on your systems to confirm or refute our findings.
So I just ran a quick test here on XFS. Creat files with fsmark,
unmount, mount, remove them. compare wall times.
Nfiles create time remove time
10k 0m0.518s 0m0.658s
20k 0m1.039s 0m1.342s
100k 0m4.540s 0m4.646s
200k 0m8.567s 0m10.180s
1M 1m23.059s 0m51.582s
Looks pretty linear in terms of rm time. This is with rm v8.26,
and the rm command is "rm -rf /mnt/scratch/0" which contains a
single directory with all the files in it.
The typical IO pattern during rm is getdents @ 40calls/s and 20,000
inode lookup calls/s. i.e. inode lookups outnumber directory reads
by 500:1. XFS does inode reads in clusters, so that 20,000 inodes/s
after a sequential create is being serviced by about 650 read IO/s.
not very much, really.
This is run on XFS on raid0 of 2 NVMe drives, sparse 500TB file
passed to a VM as virtio,cache=none,aio=native drive formatted with
XFS. (i.e. XFS on XFS).
So, run it on spinning drives. Same VM, just a SATA XFS image file
on a sata drive on the host used instead:
Nfiles create time remove time
10k 0m0.520s 0m0.643s
20k 0m1.198s 0m1.351s
100k 0m5.306s 0m6.768s
200k 0m11.083s 0m13.763s
1M 1m39.876s 1m14.765s
So, pretty much identical, still linear. The difference in
perofrmance between the two was the journal size - the smaller
filesystem wrote at 50MB/s continually to the journal, the large
NVMe filesystem hardly touched it. Remade small filesystem with
larger log, IO patterns and performance numbers are identical to the
large filesystem.
So I locked out 15.5GB of ram, so the test runs short of memory and
causes writeback to occur during creations. Barely changed the
performance on either spinning or NVMe drives.
Just upgraded to coreutils 8.28. performance is basically identical
to 8.26.
Oh, I just remembered that I've also got some old samsung 840 EVO
SSDs attached to that VM! I ran the test again, got the same
results. And, while I'm at it, I ran it on the emulated pmem I have
configured, and it got the same results.
IOWs, I've got XFS on 4.14 producing near identical results for
spinning disks, old SATA SSDs, brand new NVMe SSDs and pmem. There
simple isn't an O(n^2) problem in the userspace or XFS code before
or after the changes pointed out the bug linked above.
Keep in mind that running a random script like this on a random
filesystem in a random initial state is about the most-sub-optimal
environment you can make for a regression test. It won't give
reliable results and you can't even compare results from the same
machine and filesystem because initial conditions are always
different from run to run.
> Unless instructed otherwise, I would also go ahead and file bugs for the
> respective filesystem types, as quadratic deletion time makes for a bad
> user experience and lots of sysadmin tears.
I think you're being overly melodramatic. There isn't an obvious
problem, just a lot of handwaving. Do some IO analysis that show
where the delays are coming from, then we'll have something to work
from....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
prev parent reply other threads:[~2018-01-02 4:54 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-01-02 0:21 O(n^2) deletion performance Niklas Hambüchen
2018-01-02 1:20 ` Niklas Hambüchen
2018-01-02 1:59 ` Theodore Ts'o
2018-01-02 2:49 ` Andreas Dilger
2018-01-02 4:27 ` Jim Meyering
2018-01-02 6:22 ` Theodore Ts'o
2018-01-04 4:16 ` Jim Meyering
2018-01-04 7:16 ` Theodore Ts'o
2018-01-04 11:42 ` Dave Chinner
2018-01-02 4:33 ` Theodore Ts'o
2018-01-02 4:54 ` Dave Chinner [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180102045433.GA30682@dastard \
--to=david@fromorbit.com \
--cc=P@draigbrady.com \
--cc=eggert@cs.ucla.edu \
--cc=jim@meyering.net \
--cc=linux-fsdevel@vger.kernel.org \
--cc=niklas@nh2.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).