Re: O(n^2) deletion performance

From: Theodore Ts'o <tytso@mit.edu>
To: "Niklas Hambüchen" <niklas@nh2.me>
Cc: linux-fsdevel@vger.kernel.org, "Paul Eggert" <eggert@cs.ucla.edu>,
	"Jim Meyering" <jim@meyering.net>,
	"Pádraig Brady" <P@draigbrady.com>
Subject: Re: O(n^2) deletion performance
Date: Mon, 1 Jan 2018 20:59:32 -0500	[thread overview]
Message-ID: <20180102015932.GF2532@thunk.org> (raw)
In-Reply-To: <5ca3808d-4eea-afec-75a6-2cc41f44b868@nh2.me>

On Tue, Jan 02, 2018 at 01:21:32AM +0100, Niklas Hambï¿½chen wrote:
> Hello filesystem hackers,
> 
> Over in coreutils issue https://bugs.gnu.org/29921 we have found that
> unlink()ing `n` many files seems to take in general O(nï¿½) wall time.
> 
> So far me and others (in CC) have benchmarked `rm -r` and other methods
> of unlink()ing multiple files and we've found approximately quadratic
> performance across...

So a couple of observations.  You're benchmarking the creation and
deletion of files that don't have any blocks associated with them, so
you are measuring the directory and inode deallocation operations.
There are no block (de-)allocation operations going on.

Secondly, you aren't dropping the cache between between the creation
phase and the timed deletion phase of the test.  This means that the
results are heavily going to be impacted by the amount of memory
available on the system.  This doesn't matter all that much since you
aren't actually writing any data blocks, and so you're only measuring
metadata operations which are going to be journalled.

Sorting by inode number helps in the cold cache case, especially if
you have data blocks that you need to delete.  I also suspect that the
original tests of the sort-the-inodes commit were with substantially
smaller numbers of files, especially these were non-zero-sized files
being deleted.

Sorting by inode number helps optimze deleting files in creation order
for ext3/4 and that helps optimize updates to the inode table and to
the allocation bitmaps.  However, the number I/O's and hence the
elapsed time is also going to be a function of the directory block
updates.

In general you're not going to be able delete files in optimal
directory tree order *and* simulaneously in optimal inode table order.
So fundamentally there's not a whole lot that can be done here as
directories get super-sized.

For ext4, you can improve things somewhat by using a larger journal
size, if the workload is extremely metadata-heavy (which it is in this
very artificial case of deleting a directory hierarchy with 16 milion+
zero-length inodes).  We now use a large default journal size with
newer versions of e2fsprogs, but if your file system was created a
while ago, it probably is maxed out with a 128 meg journal.  Using a
one or two gig journal (e.g., "mke2fs -t ext4 -J size=2048 /dev/sdXX)
will help in this case.  It's unlikely to help that much for more
realistic workloads, though.

> Unless instructed otherwise, I would also go ahead and file bugs for the
> respective filesystem types, as quadratic deletion time makes for a bad
> user experience and lots of sysadmin tears.

Unfortunately there's not a whole lot I can do from the ext4 end.  And
it's not clear to me that optimizing for this extremely artificial
workload makes a lot of sense even if we could do something.  (I
support we could create an rm -rf system call which would work so long
as you had sufficient amounts of non-swappable kernel memory available
--- but in the end it's going to be impossible to make something which
is purely linear for arbitrarily large directories.)

							- Ted

P.S.  I suspect that if you had sufficiently large amounts of
super-expensive NVMe storage, you'd see super-linear times even for
XFS on NVMe....  The fast storage simply puts off the inevitable.