linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: David Howells <dhowells@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>,
	Andreas Dilger <adilger.kernel@dilger.ca>,
	"Darrick J. Wong" <djwong@kernel.org>, Chris Mason <clm@fb.com>,
	linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org,
	linux-btrfs@vger.kernel.org, linux-cachefs@redhat.com,
	linux-fsdevel@vger.kernel.org
Subject: Re: How capacious and well-indexed are ext4, xfs and btrfs directories?
Date: Tue, 18 May 2021 09:22:37 +1000	[thread overview]
Message-ID: <20210517232237.GE2893@dread.disaster.area> (raw)
In-Reply-To: <206078.1621264018@warthog.procyon.org.uk>

On Mon, May 17, 2021 at 04:06:58PM +0100, David Howells wrote:
> Hi,
> 
> With filesystems like ext4, xfs and btrfs, what are the limits on directory
> capacity, and how well are they indexed?
> 
> The reason I ask is that inside of cachefiles, I insert fanout directories
> inside index directories to divide up the space for ext2 to cope with the
> limits on directory sizes and that it did linear searches (IIRC).

Don't do that for XFS. XFS directories have internal hashed btree
indexes that are far more space efficient than using fanout in
userspace. i.e. The XFS hash index uses 8 bytes per dirent, and so
in a 4kB directory block size structure can index about 500 entries
per block. And being O(log N) for lookup, insert and remove, the
fan-out within the directory hash per IO operation is an aorder of
magnitude higher than using directories in userspace....

The capacity limit for XFS is 32GB of dirent data, which generally
equates to somewhere around 300-500 million dirents depending on
filename size. The hash index is separate from this limit (has it's
own 32GB address segment, as does the internal freespace map for the
directory....

The other directory design characterisitic of XFs directories is
that readdir is always a sequential read through the dirent data
with built in readahead. It does not need to look up the hash index
to determine where to read the next dirents from - that's a straight
"file offset to physical location" lookup in the extent btree, which
is always cached in memory. So that's generally not a limiting
factor, either.

> For some applications, I need to be able to cache over 1M entries (render
> farm) and even a kernel tree has over 100k.

Not a problem for XFS with a single directory, but could definitely
be a problem for others especially as the directory grows and
shrinks. Last I measured, ext4 directory perf drops off at about
80-90k entries using 40 byte file names, but you can get an idea of
XFS directory scalability with large entry counts in commit
756c6f0f7efe ("xfs: reverse search directory freespace indexes").
I'll reproduce the table using a 4kB directory block size here:

     File count	      create time(sec) / rate (files/s)
      10k			  0.41 / 24.3k
      20k			  0.75 / 26.7k
     100k			  3.27 / 30.6k
     200k			  6.71 / 29.8k
       1M			 37.67 / 26.5k
       2M			 79.55 / 25.2k
      10M			552.89 / 18.1k

So that's single threaded file create, which shows the rough limits
of insert into the large directory. There really isn't a major
drop-off in performance until there are several million entries in
the directory. Remove is roughly the same speed for the same dirent
count.

> What I'd like to do is remove the fanout directories, so that for each logical
> "volume"[*] I have a single directory with all the files in it.  But that
> means sticking massive amounts of entries into a single directory and hoping
> it (a) isn't too slow and (b) doesn't hit the capacity limit.

Note that if you use a single directory, you are effectively single
threading modifications to your file index. You still need to use
fanout directories if you want concurrency during modification for
the cachefiles index, but that's a different design criteria
compared to directory capacity and modification/lookup scalability.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2021-05-17 23:22 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-17 15:06 How capacious and well-indexed are ext4, xfs and btrfs directories? David Howells
2021-05-17 23:22 ` Dave Chinner [this message]
2021-05-17 23:40   ` Chris Mason
2021-05-19  8:00   ` Avi Kivity
2021-05-19 12:57     ` Dave Chinner
2021-05-19 14:13       ` Avi Kivity
2021-05-18  7:24 ` David Howells
2021-05-21  5:13 ` Andreas Dilger
2021-05-23  5:51   ` Josh Triplett
2021-05-25  4:21     ` Darrick J. Wong
2021-05-25  5:00       ` Christoph Hellwig
2021-05-25 21:13     ` Andreas Dilger
2021-05-25 21:26       ` Matthew Wilcox
2021-05-25 22:13         ` Darrick J. Wong
2021-05-25 22:48         ` Andreas Dilger
2021-05-26  0:24       ` Chris Mason
2021-06-22  0:50       ` Josh Triplett
2021-05-25 22:31 ` David Howells
2021-05-25 22:58   ` Andreas Dilger
2021-05-26  0:00   ` David Howells

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210517232237.GE2893@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=clm@fb.com \
    --cc=dhowells@redhat.com \
    --cc=djwong@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-cachefs@redhat.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).