archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <>
To: "Darrick J. Wong" <>
Subject: Re: [PATCH 03/20] xfs: defer inode inactivation to a workqueue
Date: Mon, 2 Aug 2021 09:47:09 +1000	[thread overview]
Message-ID: <20210801234709.GD2757197@dread.disaster.area> (raw)
In-Reply-To: <20210801214910.GC2757197@dread.disaster.area>

On Mon, Aug 02, 2021 at 07:49:10AM +1000, Dave Chinner wrote:
> On Fri, Jul 30, 2021 at 09:21:12PM -0700, Darrick J. Wong wrote:
> > On Fri, Jul 30, 2021 at 02:24:00PM +1000, Dave Chinner wrote:
> > > On Thu, Jul 29, 2021 at 11:44:10AM -0700, Darrick J. Wong wrote:
> > And since the shrinkers are always a source of amusement, what /is/ up
> > with it?  I don't really like having to feed it magic numbers just to
> > get it to do what I want, which is ... let it free some memory in the
> > first round, then we'll kick the background workers when the priority
> > bumps (er, decreases), and hope that's enough not to OOM the box.
> Well, the shrinkers are intended as a one-shot memory pressure
> notification as you are trying to use them. They are intended to be
> told the amount of work that needs to be done to free memory, and
> they they calculate how much of that work should be done based on
> it's idea of the current level of memory pressure.
> One shot shrinker triggers never tend to work well because they
> treat all memory pressure the same - very light memory pressure is
> dead with by the same big hammer than deals with OOM levels of
> memory pressure.
> As it is, I'm more concerned right now with finding out why there's
> such large performance regressions in highly concurrent recursive
> chmod/unlink workloads. I spend most of friday looking at it trying
> to work out what behaviour was causing the regression, but I haven't
> isolated it yet.

So I pulled all of my patchsets out and I'm just looking at the
deferred inactivation changes now. rm -rf triggers a profile like:

-   68.94%     3.24%  [kernel]            [k] xlog_cil_commit
   - 65.70% xlog_cil_commit
      - 55.01% _raw_spin_lock
         - do_raw_spin_lock
              54.14% __pv_queued_spin_lock_slowpath
        2.26% memcpy_erms
      - 1.60% xfs_buf_item_committing
         - 1.57% xfs_buf_item_release
            - 0.70% xfs_buf_unlock
                 0.67% up
              0.57% xfs_buf_rele
        1.09% xfs_buf_item_format
      - 0.90% _raw_spin_unlock
         - 0.80% do_raw_spin_unlock
              0.61% __raw_callee_save___pv_queued_spin_unlock
      - 0.81% xfs_buf_item_size
           0.57% xfs_buf_item_size_segment.isra.0
        0.67% xfs_inode_item_format

And the path into here is split almost exactly 50/50 between
xfs_remove() (process context) and xfs_inactive (deferred context).

+   40.85%     0.02%  [kernel]            [k] xfs_remove
+   40.61%     0.00%  [kernel]            [k] xfs_inodegc_worker

rm -rf runtime without the patchset is 2m30s, but 3m41s with it.

So, essentially, the background inactivation increases the
concurrency through the transaction commit path and causes a massive
increase in CIL push lock contention (i.e. catastrophic lock
breakdown) which makes performance go backwards.

Nothing can be done about this except, well, merge the CIL
scalability patches in my patch stack that address this exact lock
contention problem.

> I suspect that it is lockstepping between user
> processes and background inactivation for the unlink - I'm seeing
> the backlink rhashtable show up in the profiles which indicates the
> unlinked list lengths are an issue and we're lockstepping the AGI.
> It also may simply be that there is too much parallelism hammering
> the transaction subsystem now....

The reason I've been seeing different symptoms is that my CIL
scalability patchset entirely eliminates this spinlock contention
and something else less obvious becomes the performance limiting
factor. i.e. the CPU profile looks like this instead:

+   33.33%     0.00%  [kernel]            [k] xfs_inodegc_worker
+   26.45%     5.49%  [kernel]            [k] xlog_cil_commit
+   23.13%     0.04%  [kernel]            [k] xfs_remove

And runtime is a little lower (3m20s) but still not "fast". The fact
that CPU time has gone down so much results in idle time  and
indicates we are now contending on a sleeping lock as the context
swtich profile indicates:

+   36.23%     0.00%  [kernel]            [k] xfs_buf_read_map
+   22.99%     0.00%  [kernel]            [k] down
+   22.99%     0.00%  [kernel]            [k] __down
+   22.99%     0.00%  [kernel]            [k] xfs_read_agi
+   11.83%     0.00%  [kernel]            [k] xfs_imap_to_bp

Over a third of the context switches are on locked buffers, 2/3s of
tehm on AGI buffers, and the rest are mostly on inode inode cluster
buffers likely doing unlinked list modifications.

IOWs, we traded CIL push lock contention for AGI and inode cluster
buffer lock stepping between the unlink() syscall context and the
background inactivation processes.

This isn't a new problem - I've known about it for years, and I have
been working towards solving it (e.g. the single unlinked list per
AGI patches).  In effect, the problem is that we always add newly
unlinked to the head - the AGI end - of the unlinked lists, so we
must always take the AGI lock in unlink() context just to update the
unlinked list. On the inactivation side, we always have to take the
AGI because we are freeing inodes.

With deferred inactivation, the unlinked lists grow large, so we
could avoid needing to modify the AGI by adding inodes to the tail
of the unlinked list instead of the head. Unfortunately, to do this
we currently need to store 64 agino tail pointers per AGI in memory.
I might try modifying my patches to do this so I can re-order the
unlinked list without needing to change on disk format. It's not
very data cache or memory efficient, but it likely will avoid most
of the AGI contention on these workloads.

However, none of this is ready for prime time, so the next thing
I'll look at is why both foreground and background are running at
the same time instead of batching effectively. i.e. unlink context
runs a bunch of unlinks adding inodes to the unlinked lists, then
background runs doing the deferred work while foreground stays out
of the way (e.g. is throttled). This will involve looking at traces,
so I suspect it's going to take me a day or two to extract repeating
patterns and understand them well enough to determine what to
do/look at next.


Dave Chinner

  reply	other threads:[~2021-08-01 23:47 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
2021-07-29 18:43 ` [PATCH 01/20] xfs: move xfs_inactive call to xfs_inode_mark_reclaimable Darrick J. Wong
2021-07-29 18:44 ` [PATCH 02/20] xfs: detach dquots from inode if we don't need to inactivate it Darrick J. Wong
2021-07-29 18:44 ` [PATCH 03/20] xfs: defer inode inactivation to a workqueue Darrick J. Wong
2021-07-30  4:24   ` Dave Chinner
2021-07-31  4:21     ` Darrick J. Wong
2021-08-01 21:49       ` Dave Chinner
2021-08-01 23:47         ` Dave Chinner [this message]
2021-08-03  8:34   ` [PATCH, alternative] xfs: per-cpu deferred inode inactivation queues Dave Chinner
2021-08-03 20:20     ` Darrick J. Wong
2021-08-04  3:20     ` [PATCH, alternative v2] " Darrick J. Wong
2021-08-04 10:03       ` [PATCH] xfs: inodegc needs to stop before freeze Dave Chinner
2021-08-04 12:37         ` Dave Chinner
2021-08-04 10:46       ` [PATCH] xfs: don't run inodegc flushes when inodegc is not active Dave Chinner
2021-08-04 16:20         ` Darrick J. Wong
2021-08-04 11:09       ` [PATCH, alternative v2] xfs: per-cpu deferred inode inactivation queues Dave Chinner
2021-08-04 15:59         ` Darrick J. Wong
2021-08-04 21:35           ` Dave Chinner
2021-08-04 11:49       ` [PATCH, pre-03/20 #1] xfs: introduce CPU hotplug infrastructure Dave Chinner
2021-08-04 11:50       ` [PATCH, pre-03/20 #2] xfs: introduce all-mounts list for cpu hotplug notifications Dave Chinner
2021-08-04 16:06         ` Darrick J. Wong
2021-08-04 21:17           ` Dave Chinner
2021-08-04 11:52       ` [PATCH, post-03/20 1/1] xfs: hook up inodegc to CPU dead notification Dave Chinner
2021-08-04 16:19         ` Darrick J. Wong
2021-08-04 21:48           ` Dave Chinner
2021-07-29 18:44 ` [PATCH 04/20] xfs: throttle inode inactivation queuing on memory reclaim Darrick J. Wong
2021-07-29 18:44 ` [PATCH 05/20] xfs: don't throttle memory reclaim trying to queue inactive inodes Darrick J. Wong
2021-07-29 18:44 ` [PATCH 06/20] xfs: throttle inodegc queuing on backlog Darrick J. Wong
2021-08-02  0:45   ` Dave Chinner
2021-08-02  1:30     ` Dave Chinner
2021-07-29 18:44 ` [PATCH 07/20] xfs: queue inodegc worker immediately when memory is tight Darrick J. Wong
2021-07-29 18:44 ` [PATCH 08/20] xfs: expose sysfs knob to control inode inactivation delay Darrick J. Wong
2021-07-29 18:44 ` [PATCH 09/20] xfs: reduce inactivation delay when free space is tight Darrick J. Wong
2021-07-29 18:44 ` [PATCH 10/20] xfs: reduce inactivation delay when quota are tight Darrick J. Wong
2021-07-29 18:44 ` [PATCH 11/20] xfs: reduce inactivation delay when realtime extents " Darrick J. Wong
2021-07-29 18:44 ` [PATCH 12/20] xfs: inactivate inodes any time we try to free speculative preallocations Darrick J. Wong
2021-07-29 18:45 ` [PATCH 13/20] xfs: flush inode inactivation work when compiling usage statistics Darrick J. Wong
2021-07-29 18:45 ` [PATCH 14/20] xfs: parallelize inode inactivation Darrick J. Wong
2021-08-02  0:55   ` Dave Chinner
2021-08-02 21:33     ` Darrick J. Wong
2021-07-29 18:45 ` [PATCH 15/20] xfs: reduce inactivation delay when AG free space are tight Darrick J. Wong
2021-07-29 18:45 ` [PATCH 16/20] xfs: queue inodegc worker immediately on backlog Darrick J. Wong
2021-07-29 18:45 ` [PATCH 17/20] xfs: don't run speculative preallocation gc when fs is frozen Darrick J. Wong
2021-07-29 18:45 ` [PATCH 18/20] xfs: scale speculative preallocation gc delay based on free space Darrick J. Wong
2021-07-29 18:45 ` [PATCH 19/20] xfs: use background worker pool when transactions can't get " Darrick J. Wong
2021-07-29 18:45 ` [PATCH 20/20] xfs: avoid buffer deadlocks when walking fs inodes Darrick J. Wong
2021-08-02 10:35 ` [PATCHSET v8 00/20] xfs: deferred inode inactivation Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210801234709.GD2757197@dread.disaster.area \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).