Re: O(n^2) deletion performance

From: Dave Chinner <david@fromorbit.com>
To: "Niklas Hambüchen" <niklas@nh2.me>
Cc: linux-fsdevel@vger.kernel.org, "Paul Eggert" <eggert@cs.ucla.edu>,
	"Jim Meyering" <jim@meyering.net>,
	"Pádraig Brady" <P@draigbrady.com>
Subject: Re: O(n^2) deletion performance
Date: Tue, 2 Jan 2018 15:54:33 +1100	[thread overview]
Message-ID: <20180102045433.GA30682@dastard> (raw)
In-Reply-To: <5ca3808d-4eea-afec-75a6-2cc41f44b868@nh2.me>

On Tue, Jan 02, 2018 at 01:21:32AM +0100, Niklas Hambï¿½chen wrote:
> Hello filesystem hackers,
> 
> Over in coreutils issue https://bugs.gnu.org/29921 we have found that
> unlink()ing `n` many files seems to take in general O(nï¿½) wall time.
> 
> So far me and others (in CC) have benchmarked `rm -r` and other methods
> of unlink()ing multiple files and we've found approximately quadratic
> performance across:
> 
> * ext4, XFS and zfs-on-linux on spinning disks
> * ext4 on SSDs
> * the other combinations were not tested yet
> 
> Please find the exact numbers https://bugs.gnu.org/29921.

Run iowatcher on the tests and have a look at the IO patterns before
and after the modifications to rm. Work out where the seek delays
are actually common from and find the common characteristics that
all the reports have.

> It would be great to hear any kind of insights you may have regarding
> this issue, implementation details that might be relevant here, or even
> some quick benchmarks on your systems to confirm or refute our findings.

So I just ran a quick test here on XFS. Creat files with fsmark,
unmount, mount, remove them. compare wall times.

Nfiles		create time		remove time
  10k		0m0.518s		0m0.658s
  20k		0m1.039s		0m1.342s
 100k		0m4.540s		0m4.646s
 200k		0m8.567s		0m10.180s
   1M		1m23.059s		0m51.582s

Looks pretty linear in terms of rm time. This is with rm v8.26,
and the rm command is "rm -rf /mnt/scratch/0" which contains a
single directory with all the files in it.

The typical IO pattern during rm is getdents @ 40calls/s and 20,000
inode lookup calls/s. i.e. inode lookups outnumber directory reads
by 500:1. XFS does inode reads in clusters, so that 20,000 inodes/s
after a sequential create is being serviced by about 650 read IO/s.
not very much, really.

This is run on XFS on raid0 of 2 NVMe drives, sparse 500TB file
passed to a VM as virtio,cache=none,aio=native drive formatted with
XFS. (i.e. XFS on XFS).

So, run it on spinning drives. Same VM, just a SATA XFS image file
on a sata drive on the host used instead:

Nfiles		create time		remove time
  10k		0m0.520s		0m0.643s
  20k		0m1.198s		0m1.351s
 100k		0m5.306s		0m6.768s
 200k		0m11.083s		0m13.763s
   1M		1m39.876s		1m14.765s

So, pretty much identical, still linear. The difference in
perofrmance between the two was the journal size - the smaller
filesystem wrote at 50MB/s continually to the journal, the large
NVMe filesystem hardly touched it. Remade small filesystem with
larger log, IO patterns and performance numbers are identical to the
large filesystem.

So I locked out 15.5GB of ram, so the test runs short of memory and
causes writeback to occur during creations.  Barely changed the
performance on either spinning or NVMe drives.

Just upgraded to coreutils 8.28. performance is basically identical
to 8.26.

Oh, I just remembered that I've also got some old samsung 840 EVO
SSDs attached to that VM! I ran the test again, got the same
results. And, while I'm at it, I ran it on the emulated pmem I have
configured, and it got the same results.

IOWs, I've got XFS on 4.14 producing near identical results for
spinning disks, old SATA SSDs, brand new NVMe SSDs and pmem.  There
simple isn't an O(n^2) problem in the userspace or XFS code before
or after the changes pointed out the bug linked above.

Keep in mind that running a random script like this on a random
filesystem in a random initial state is about the most-sub-optimal
environment you can make for a regression test.  It won't give
reliable results and you can't even compare results from the same
machine and filesystem because initial conditions are always
different from run to run.

> Unless instructed otherwise, I would also go ahead and file bugs for the
> respective filesystem types, as quadratic deletion time makes for a bad
> user experience and lots of sysadmin tears.

I think you're being overly melodramatic. There isn't an obvious
problem, just a lot of handwaving. Do some IO analysis that show
where the delays are coming from, then we'll have something to work
from....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com