Re: How capacious and well-indexed are ext4, xfs and btrfs directories?

From: Chris Mason <clm@fb.com>
To: Andreas Dilger <adilger@dilger.ca>
Cc: Josh Triplett <josh@joshtriplett.org>,
	David Howells <dhowells@redhat.com>,
	Theodore Ts'o <tytso@mit.edu>,
	"Darrick J. Wong" <djwong@kernel.org>,
	Ext4 Developers List <linux-ext4@vger.kernel.org>,
	xfs <linux-xfs@vger.kernel.org>,
	linux-btrfs <linux-btrfs@vger.kernel.org>,
	"linux-cachefs@redhat.com" <linux-cachefs@redhat.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	NeilBrown <neilb@suse.com>
Subject: Re: How capacious and well-indexed are ext4, xfs and btrfs directories?
Date: Wed, 26 May 2021 00:24:56 +0000	[thread overview]
Message-ID: <5D04989A-E253-47B5-B50A-E96419F0E151@fb.com> (raw)
In-Reply-To: <B70B57ED-6F11-45CC-B99F-86BBDE36ACA4@dilger.ca>

> On May 25, 2021, at 5:13 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> 
> On May 22, 2021, at 11:51 PM, Josh Triplett <josh@joshtriplett.org> wrote:
>> 
>> On Thu, May 20, 2021 at 11:13:28PM -0600, Andreas Dilger wrote:
>>> On May 17, 2021, at 9:06 AM, David Howells <dhowells@redhat.com> wrote:
>>>> With filesystems like ext4, xfs and btrfs, what are the limits on directory
>>>> capacity, and how well are they indexed?
>>>> 
>>>> The reason I ask is that inside of cachefiles, I insert fanout directories
>>>> inside index directories to divide up the space for ext2 to cope with the
>>>> limits on directory sizes and that it did linear searches (IIRC).
>>>> 
>>>> For some applications, I need to be able to cache over 1M entries (render
>>>> farm) and even a kernel tree has over 100k.
>>>> 
>>>> What I'd like to do is remove the fanout directories, so that for each logical
>>>> "volume"[*] I have a single directory with all the files in it.  But that
>>>> means sticking massive amounts of entries into a single directory and hoping
>>>> it (a) isn't too slow and (b) doesn't hit the capacity limit.
>>> 
>>> Ext4 can comfortably handle ~12M entries in a single directory, if the
>>> filenames are not too long (e.g. 32 bytes or so).  With the "large_dir"
>>> feature (since 4.13, but not enabled by default) a single directory can
>>> hold around 4B entries, basically all the inodes of a filesystem.
>> 
>> ext4 definitely seems to be able to handle it. I've seen bottlenecks in
>> other parts of the storage stack, though.
>> 
>> With a normal NVMe drive, a dm-crypt volume containing ext4, and discard
>> enabled (on both ext4 and dm-crypt), I've seen rm -r of a directory with
>> a few million entries (each pointing to a ~4-8k file) take the better
>> part of an hour, almost all of it system time in iowait. Also makes any
>> other concurrent disk writes hang, even a simple "touch x". Turning off
>> discard speeds it up by several orders of magnitude.
>> 
>> (I don't know if this is a known issue or not, so here are the details
>> just in case it isn't. Also, if this is already fixed in a newer kernel,
>> my apologies for the outdated report.)
> 
> Definitely "-o discard" is known to have a measurable performance impact,
> simply because it ends up sending a lot more requests to the block device,
> and those requests can be slow/block the queue, depending on underlying
> storage behavior.
> 
> There was a patch pushed recently that targets "-o discard" performance:
> https://patchwork.ozlabs.org/project/linux-ext4/list/?series=244091
> that needs a bit more work, but may be worthwhile to test if it improves
> your workload, and help put some weight behind landing it?
> 

This is pretty far off topic from the original message, but we’ve had a long list of discard problems in production:

* Synchronous discards stall under heavy delete loads, especially on lower end drives.  Even drives that service the discards entirely in ram on the host (fusion-io’s best feature imho) had trouble.  I’m sure some really high end flash is really high end, but it hasn’t been a driving criteria for us in the fleet.

* XFS async discards decouple the commit latency from the discard latency, which is great.  But the backlog of discards wasn’t really limited, so mass deletion events ended up generating stalls for reads and writes that were competing with the discards.  We last benchmarked this with v5.2, so it might be different now, but unfortunately it wasn’t usable for us.

* fstrim-from-cron limits the stalls to 2am, which is peak somewhere in the world, so it isn't ideal.  On some drives its fine, on others it’s a 10 minute lunch break.

For XFS in latency sensitive workloads, we’ve settled on synchronous discards and applications using iterating truncate calls that nibble the ends off of a file bit by bit while calling fsync in reasonable intervals.  It hurts to say out loud but is also wonderfully predictable.

We generally use btrfs on low end root drives, where discards are a much bigger problem.  The btrfs async discard implementation considers re-allocating the block the same as discarding it, so we avoid some discards just by reusing blocks.  It sorts pending discards to prefer larger IOs, and dribbles them out slowly to avoid saturating the drive.  It’s a giant bag of compromises but avoids latencies and maintains the write amplification targets.  We do use it on a few data intensive workloads with higher end flash, but we crank up the iops targets for the discards there.

-chris