Linux-XFS Archive on
 help / color / Atom feed
From: Dave Chinner <>
To: "Darrick J. Wong" <>
Subject: Re: [PATCH 00/27] [RFC, WIP] xfsprogs: xfs_buf unification and AIO
Date: Fri, 16 Oct 2020 09:35:48 +1100
Message-ID: <20201015223548.GL7391@dread.disaster.area> (raw)
In-Reply-To: <20201015183756.GE9837@magnolia>

On Thu, Oct 15, 2020 at 11:37:56AM -0700, Darrick J. Wong wrote:
> On Thu, Oct 15, 2020 at 06:21:28PM +1100, Dave Chinner wrote:
> > - AIO engine does not support discontiguous buffers.
> > 
> > - work out best way to handle IOCBs for AIO - is embedding them in
> >   the xfs_buf the only way to do this efficiently?
> The only other way I can think of is to embed MAX_AIO_EVENTS iocbs in
> the buftarg and (I guess) track which ones are free with a bitmap.

I originally had an array embedded in the buftarg, and when I
realised that I had to rtack exactly which ones were still in use I
trashed it and moved the iocb to the struct xfs_buf so no tracking
is necessary. All I need to do is change the buffer allocation code
to allocate an iocb array when it allocates the map array so we have
a direct b_maps -> b_iocbs relationship at all times.

> That
> would cut down the iocb memory overhead to the IOs that we're actually
> running at a cost of more bookkeepping and potential starvation issues
> for the yuuuge buffer that takes a long time to collect all NR iocbs.

We've already got a huge amount of per-buffer state, so adding iocbs
to that isn't a huge deal...

> > - rationalise the xfs_buf/xfs_buftarg split. Work out where to define
> >   the stuff that is different between kernel and userspace (e.g. the
> >   struct xfs_buf) vs the stuff that remains the same (e.g. the XBF
> >   buffer flags)
> Ow my brain.

Yeah. :(

> > - lots of code cleanup
> > 
> > - xfs_repair does not cache between phases right now - it
> >   unconditionally purges the AG cache at the end of AG processing.
> >   This keeps memory usage way down, and for filesystems that have
> >   too much metadata to cache entirely in RAM, it's *much* faster
> Why is it faster to blow out the cache?  Is it the internal overhead of
> digging through a huge buftarg hash table to find the buffer that the
> caller wants and/or whacking enough of the other stale buffers to free
> up memory to avoid the buffer cache limits?

Lots of reasons. The size of the hash table/chain length is largely
irrelevant. The biggest issue seems to be memory allocation for the
buffers - when we allocate the buffer, malloc does a mprotect() call
on the new heap space for some reason and that takes the mmap_sem in
write mode. Which serialises all the page faults being done while
reading those buffers as they take the mmap_sem shared. Hence while
we are allocating new heap space, there's massive numbers of context
switches on the mmap_sem contention. COntext switch profile while
running at ~150,000 context switches a second:

   64.45%    64.45%  xfs_repair       [kernel.kallsyms]         [k] schedule
            |          |          
            |           --40.95%--asm_exc_page_fault
            |                     exc_page_fault
            |                     |          
            |                      --40.95%--down_read
            |                                rwsem_down_read_slowpath
            |                                schedule
            |                                schedule
            |          |          
            |           --7.87%--asm_exc_page_fault
            |                     exc_page_fault
            |                     |          
            |                      --7.86%--down_read
            |                                rwsem_down_read_slowpath
            |                                schedule
            |                                schedule

The __memmove_sse2_unaligned_erms() function is from the memcpy()
loop in repair/prefetch.c where it is copying metadata from the
large IO buffer into the individual xfs_bufs (i.e. read TLB faults)
and AFAICT the mprotect() syscalls are coming from malloc heap
expansion. About 7% of the context switches are from pthread_mutex
locks, and only 2.5% of the context switches are from the pread() IO
that is being done.

IOWs, while memory footprint is growing/changing, repair performance
is largely limited by mmap_sem concurrency issues.

So my reading of this is that by bounding the memory footprint of
repair by freeing the buffers back to the heap regularly, we largely
stay out of the heap grow mprotect path and avoid this mmap_sem
contention. That allows the CPU intensive parts of prefetch and
metadata verification to run more efficiently....

> >   than xfs_repair v5.6.0. On smaller filesystems, however, hitting
> >   RAM caches is much more desirable. Hence there is work to be done
> >   here determining how to manage the caches effectively.
> > 
> > - async buffer readahead works in userspace now, but unless you have
> >   a very high IOP device (think in multiples of 100kiops) the
> >   xfs_repair prefetch mechanism that trades off bandwidth for iops is
> >   still faster. More investigative work is needed here.
> (No idea what this means.)

The SSD I've been testing on peaks at about 70,000kiops for reads.
If I drive prefetch by xfs_buf_readahead() only (got patches that do
that) then, compared to 5.6.0, prefetch bandwidth drops to ~400MB/s
but the device is iops bound and so prefetch is slower than when
5.6.0 uses 2MB IOs, does ~5-10kiops and consumes 900MB/s of bandwidth.

This patchset runs the existing prefetch code at 15-20kiops and
1.8-2.1GB/s using 2MB IOs - we still get better throughput on these
SSDs by trading off iops for bandwidth.

The original patchset I wrote (but never published) had similar
performance on the above hardware, but I also ran it on faster SSDs.
On those, repair could push them to around 200-250kiops with
prefetch by xfs_buf_readahead() and that was much faster than the
2-2.5GB/s that the existing prefetch could get before being limited
by the pcie 3.0 4x bus the SSD is on. I'll have some numbers from
this patchset on my faster hardware next week, but I don't expect
them to be much different...

So, essentially, if you've got more small read IOPS capacity than
you have "sparse metadata population optimised" large IO bandwidth
(or metadata too sparse to trigger the large IO optimisations) then
using xfs_buf_readahead() appears to more efficient than using the
existing prefetch code.

That said, there's a lot of testing, observation and analysis needed
to determine what will be the best general approach here. Signs
are pointing towards "existing prefetch for rotational storage and
low iops ssds" and miminal caching and on-demand prefetch for high
end SSDs, but we'll see how that stands up...

> > - xfs_db could likely use readahead for btree traversals and other
> >   things.
> > 
> > - More memory pressure trigger testing needs to be done to determine
> >   if the trigger settings are sufficient to prevent OOM kills on
> >   different hardware and filesystem configurations.
> > 
> > - Replace AIO engine with io_uring engine?
> Hm, is Al actually going to review io_uring as he's been threatening to
> do?  I'd hold off until that happens, or one of us goes and tests it in
> anger to watch all the smoke leak out, etc...

Yeah, I'm in no hurry to do this. Just making sure people understand
that it's a potential future direction.

> > - start working on splitting the kernel xfs_buf.[ch] code the same
> >   way as the userspace code and moving xfs_buf.[ch] to fs/xfs/libxfs
> >   so that they can be updated in sync.
> Aha, so yes that answers my question in ... whichever patch that was
> somewhere around #17.

Right. I had to start somewhere, and given that userspace
requirements largely define how the split needs to occur, I decided
to start by making userspace work and then, once that is done,
change the kernel to match the same structure that userspace


Dave Chinner

      reply index

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-15  7:21 Dave Chinner
2020-10-15  7:21 ` [PATCH 01/27] xfsprogs: remove unused buffer tracing code Dave Chinner
2020-10-15  7:21 ` [PATCH 02/27] xfsprogs: remove unused IO_DEBUG functionality Dave Chinner
2020-11-16  2:31   ` Eric Sandeen
2020-10-15  7:21 ` [PATCH 03/27] libxfs: get rid of b_bcount from xfs_buf Dave Chinner
2020-11-23 19:53   ` Eric Sandeen
2020-10-15  7:21 ` [PATCH 04/27] libxfs: rename buftarg->dev to btdev Dave Chinner
2020-11-16  2:33   ` Eric Sandeen
2020-10-15  7:21 ` [PATCH 05/27] xfsprogs: get rid of ancient btree tracing fragments Dave Chinner
2020-11-16  2:35   ` Eric Sandeen
2020-10-15  7:21 ` [PATCH 06/27] xfsprogs: remove xfs_buf_t typedef Dave Chinner
2020-10-15 15:22   ` Darrick J. Wong
2020-10-15 20:54     ` Dave Chinner
2020-10-15  7:21 ` [PATCH 07/27] xfsprogs: introduce liburcu support Dave Chinner
2020-10-15  7:21 ` [PATCH 08/27] libxfs: add spinlock_t wrapper Dave Chinner
2020-10-15  7:21 ` [PATCH 09/27] atomic: convert to uatomic Dave Chinner
2020-10-15  7:21 ` [PATCH 10/27] libxfs: add kernel-compatible completion API Dave Chinner
2020-10-15 17:09   ` Darrick J. Wong
2020-10-19 22:21     ` Dave Chinner
2020-10-15  7:21 ` [PATCH 11/27] libxfs: add wrappers for kernel semaphores Dave Chinner
2020-10-15  7:21 ` [PATCH 12/27] xfsprogs: convert use-once buffer reads to uncached IO Dave Chinner
2020-10-15 17:12   ` Darrick J. Wong
2020-10-19 22:36     ` Dave Chinner
2020-10-15  7:21 ` [PATCH 13/27] libxfs: introduce userspace buftarg infrastructure Dave Chinner
2020-10-15  7:21 ` [PATCH 14/27] xfs: rename libxfs_buftarg_init to libxfs_open_devices() Dave Chinner
2020-10-15  7:21 ` [PATCH 15/27] libxfs: introduce userspace buftarg infrastructure Dave Chinner
2020-10-15 17:16   ` Darrick J. Wong
2020-10-15  7:21 ` [PATCH 16/27] libxfs: add a synchronous IO engine to the buftarg Dave Chinner
2020-10-15  7:21 ` [PATCH 17/27] xfsprogs: convert libxfs_readbufr to libxfs_buf_read_uncached Dave Chinner
2020-10-15  7:21 ` [PATCH 18/27] libxfs: convert libxfs_bwrite to buftarg IO Dave Chinner
2020-10-15  7:21 ` [PATCH 19/27] libxfs: add cache infrastructure to buftarg Dave Chinner
2020-10-15  7:21 ` [PATCH 20/27] libxfs: add internal lru to btcache Dave Chinner
2020-10-15  7:21 ` [PATCH 21/27] libxfs: Add kernel list_lru wrapper Dave Chinner
2020-10-15  7:21 ` [PATCH 22/27] libxfs: introduce new buffer cache infrastructure Dave Chinner
2020-10-15 17:46   ` Darrick J. Wong
2020-10-15  7:21 ` [PATCH 23/27] libxfs: use PSI information to detect memory pressure Dave Chinner
2020-10-15 17:56   ` Darrick J. Wong
2020-10-15 21:20     ` Dave Chinner
2020-10-15  7:21 ` [PATCH 24/27] libxfs: add a buftarg cache shrinker implementation Dave Chinner
2020-10-15 18:01   ` Darrick J. Wong
2020-10-15 21:33     ` Dave Chinner
2020-10-15  7:21 ` [PATCH 25/27] libxfs: switch buffer cache implementations Dave Chinner
2020-10-15  7:21 ` [PATCH 26/27] build: set dependency correctly Dave Chinner
2020-10-15  7:21 ` [PATCH 27/27] libxfs: convert sync IO buftarg engine to AIO Dave Chinner
2020-10-15 18:26   ` Darrick J. Wong
2020-10-15 21:42     ` Dave Chinner
2020-10-15  7:29 ` [PATCH 00/27] [RFC, WIP] xfsprogs: xfs_buf unification and AIO Dave Chinner
2020-10-15 18:37 ` Darrick J. Wong
2020-10-15 22:35   ` Dave Chinner [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201015223548.GL7391@dread.disaster.area \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-XFS Archive on

Archives are clonable:
	git clone --mirror linux-xfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-xfs linux-xfs/ \
	public-inbox-index linux-xfs

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone