linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 13/24] xfs: make inode reclaim almost non-blocking
Date: Sun, 24 May 2020 08:29:35 +1000	[thread overview]
Message-ID: <20200523222935.GH2040@dread.disaster.area> (raw)
In-Reply-To: <20200522224806.GQ8230@magnolia>

On Fri, May 22, 2020 at 03:48:06PM -0700, Darrick J. Wong wrote:
> On Fri, May 22, 2020 at 01:50:18PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now that dirty inode writeback doesn't cause read-modify-write
> > cycles on the inode cluster buffer under memory pressure, the need
> > to throttle memory reclaim to the rate at which we can clean dirty
> > inodes goes away. That is due to the fact that we no longer thrash
> > inode cluster buffers under memory pressure to clean dirty inodes.
> > 
> > This means inode writeback no longer stalls on memory allocation
> > or read IO, and hence can be done asynchrnously without generating
> 
> "...asynchronously..."
> 
> > memory pressure. As a result, blocking inode writeback in reclaim is
> > no longer necessary to prevent reclaim priority windup as cleaning
> > dirty inodes is no longer dependent on having memory reserves
> > available for the filesystem to make progress reclaiming inodes.
> > 
> > Hence we can convert inode reclaim to be non-blocking for shrinker
> > callouts, both for direct reclaim and kswapd.
> > 
> > On a vanilla kernel, running a 16-way fsmark create workload on a
> > 4 node/16p/16GB RAM machine, I can reliably pin 14.75GB of RAM via
> > userspace mlock(). The OOM killer gets invoked at 15GB of
> > pinned RAM.
> > 
> > With this patch alone, pinning memory triggers premature OOM
> > killer invocation, sometimes with as much as 45% of RAM being free.
> > It's trivially easy to trigger the OOM killer when reclaim does not
> > block.
> > 
> > With pinning inode clusters in RAM adn then adding this patch, I can
> > reliably pin 14.5GB of RAM and still have the fsmark workload run to
> > completion. The OOM killer gets invoked 14.75GB of pinned RAM, which
> > is only a small amount of memory less than the vanilla kernel. It is
> > much more reliable than just with async reclaim alone.
> 
> So the lack of OOM kills is the result of not having to do RMW and
> ratcheting up the reclaim priority, right?

Effectively. The ratcheting up the reclaim priority without
writeback is a secondary effect of RMW in inode writeback.

That is, the AIL blocks on memory reclaim doing dirty inode
writeback because it has unbound demand (async flushing). Hence it
exhausts memory reserves if there are lots of dirty inodes. It's
also PF_MEMALLOC so, like kswapd, it can dip into certain reserves
that normal allocation can't.

The synchronous write behaviour of reclaim, however, bounds memory
demand at (N * ag count * pages per inode cluster), and hence it is
much more likely to make forwards progress, albeit slowly. The
synchronous write also has the effect of throttling the rate at
which reclaim cycles, hence slowly down the speed at which it ramps
up the reclaim priority rate. IOWs, we get both forwards progress
and lower reclaim priority because we block reclaim like this.

IOWs, removing the synchronous writeback from reclaim does two
things. The first is that it removes the ability to make forwards
progress reclaiming inodes from XFS when there is very low free
memory. This is bad for obvious reasons.

The second is that it allows reclaim to think it can't free
inode memory quickly and that's what causes the increase in reclaim
priority. i.e. it needs more scan loops to free inodes because
writeback of dirty inodes is slow and not making progress. This is
also bad, because we can make progress, just not as fast as memory
reclaim is capable of backing off from.

The sync writeback of inode clusters from reclaim mitigated both of
these issues when they occurred at the cost of increased allocation
latency at extreme OOM conditions...

This is why, despite everyone with OOM latency problems claiming "it
works for them so you should just merge it", just skipping inode
writeback in the shrinker has not been a solution to the problem -
it didn't solve the underlying "reclaim of dirty inodes can create
unbound memory demand" problem that the sync inode writeback
controlled.

Previous attempts to solve this problem had been focussed on
replacing the throttling the shrinker did with backoffs in the core
reclaim algorithms, but that's made no progress on the mm/ side of
things. Hence this patchset - trying to tackle the problem from a
different direction so we are no longer reliant on changing core OS
infrastructure to solve problems XFS users are having.

> And, {con|per}versely, can I run fstests with 400MB of RAM now? :D

If it is bound on sync inode writeback from memory reclaim, then it
will help, otherwise it may make things worse because the trade off
we are making here is that dirty inodes can pin substantially more
memory in cache while they queue to be written back.

Yup, that's the ugly downside of this approach. Rather than have the
core memory reclaim throttle and wait according to what we need it
to do, we simply make the XFS cache footprint larger every time we
dirty an inode. It also costs us 1-2% extra CPU per transaction, so
this change certainly isn't free. IMO, it's most definitely not the
most efficient, performant or desirable solution to the problem, but
it's one that works and is wholly contained within XFS.

> > simoops shows that allocation stalls go away when async reclaim is
> > used. Vanilla kernel:
> > 
> > Run time: 1924 seconds
> > Read latency (p50: 3,305,472) (p95: 3,723,264) (p99: 4,001,792)
> > Write latency (p50: 184,064) (p95: 553,984) (p99: 807,936)
> > Allocation latency (p50: 2,641,920) (p95: 3,911,680) (p99: 4,464,640)
> > work rate = 13.45/sec (avg 13.44/sec) (p50: 13.46) (p95: 13.58) (p99: 13.70)
> > alloc stall rate = 3.80/sec (avg: 2.59) (p50: 2.54) (p95: 2.96) (p99: 3.02)
> > 
> > With inode cluster pinning and async reclaim:
> > 
> > Run time: 1924 seconds
> > Read latency (p50: 3,305,472) (p95: 3,715,072) (p99: 3,977,216)
> > Write latency (p50: 187,648) (p95: 553,984) (p99: 789,504)
> > Allocation latency (p50: 2,748,416) (p95: 3,919,872) (p99: 4,448,256)
> 
> I'm not familiar with simoops, and ElGoog is not helpful.  What are the
> units here?

Microseconds, IIRC.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2020-05-23 22:29 UTC|newest]

Thread overview: 91+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
2020-05-22  3:50 ` [PATCH 01/24] xfs: remove logged flag from inode log item Dave Chinner
2020-05-22  7:25   ` Christoph Hellwig
2020-05-22 21:13   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 02/24] xfs: add an inode item lock Dave Chinner
2020-05-22  6:45   ` Amir Goldstein
2020-05-22 21:24   ` Darrick J. Wong
2020-05-23  8:45   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 03/24] xfs: mark inode buffers in cache Dave Chinner
2020-05-22  7:45   ` Amir Goldstein
2020-05-22 21:35   ` Darrick J. Wong
2020-05-24 23:41     ` Dave Chinner
2020-05-23  8:48   ` Christoph Hellwig
2020-05-25  0:06     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 04/24] xfs: mark dquot " Dave Chinner
2020-05-22  7:46   ` Amir Goldstein
2020-05-22 21:38   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 05/24] xfs: mark log recovery buffers for completion Dave Chinner
2020-05-22  7:41   ` Amir Goldstein
2020-05-24 23:54     ` Dave Chinner
2020-05-22 21:41   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 06/24] xfs: call xfs_buf_iodone directly Dave Chinner
2020-05-22  7:56   ` Amir Goldstein
2020-05-22 21:53   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 07/24] xfs: clean up whacky buffer log item list reinit Dave Chinner
2020-05-22 22:01   ` Darrick J. Wong
2020-05-23  8:50   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 08/24] xfs: fold xfs_istale_done into xfs_iflush_done Dave Chinner
2020-05-22 22:10   ` Darrick J. Wong
2020-05-23  9:12   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 09/24] xfs: use direct calls for dquot IO completion Dave Chinner
2020-05-22 22:13   ` Darrick J. Wong
2020-05-23  9:16   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 10/24] xfs: clean up the buffer iodone callback functions Dave Chinner
2020-05-22 22:26   ` Darrick J. Wong
2020-05-25  0:37     ` Dave Chinner
2020-05-23  9:19   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 11/24] xfs: get rid of log item callbacks Dave Chinner
2020-05-22 22:27   ` Darrick J. Wong
2020-05-23  9:19   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 12/24] xfs: pin inode backing buffer to the inode log item Dave Chinner
2020-05-22 22:39   ` Darrick J. Wong
2020-05-23  9:34   ` Christoph Hellwig
2020-05-23 21:43     ` Dave Chinner
2020-05-24  5:31       ` Christoph Hellwig
2020-05-24 23:13         ` Dave Chinner
2020-05-22  3:50 ` [PATCH 13/24] xfs: make inode reclaim almost non-blocking Dave Chinner
2020-05-22 12:19   ` Amir Goldstein
2020-05-22 22:48   ` Darrick J. Wong
2020-05-23 22:29     ` Dave Chinner [this message]
2020-05-22  3:50 ` [PATCH 14/24] xfs: remove IO submission from xfs_reclaim_inode() Dave Chinner
2020-05-22 23:06   ` Darrick J. Wong
2020-05-25  3:49     ` Dave Chinner
2020-05-23  9:40   ` Christoph Hellwig
2020-05-23 22:35     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 15/24] xfs: allow multiple reclaimers per AG Dave Chinner
2020-05-22 23:10   ` Darrick J. Wong
2020-05-23 22:35     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 16/24] xfs: don't block inode reclaim on the ILOCK Dave Chinner
2020-05-22 23:11   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 17/24] xfs: remove SYNC_TRYLOCK from inode reclaim Dave Chinner
2020-05-22 23:14   ` Darrick J. Wong
2020-05-23 22:42     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 18/24] xfs: clean up inode reclaim comments Dave Chinner
2020-05-22 23:17   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 19/24] xfs: attach inodes to the cluster buffer when dirtied Dave Chinner
2020-05-22 23:48   ` Darrick J. Wong
2020-05-23 22:59     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 20/24] xfs: xfs_iflush() is no longer necessary Dave Chinner
2020-05-22 23:54   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 21/24] xfs: rename xfs_iflush_int() Dave Chinner
2020-05-22 12:33   ` Amir Goldstein
2020-05-22 23:57   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 22/24] xfs: rework xfs_iflush_cluster() dirty inode iteration Dave Chinner
2020-05-23  0:13   ` Darrick J. Wong
2020-05-23 23:14     ` Dave Chinner
2020-05-23 11:31   ` Christoph Hellwig
2020-05-23 23:23     ` Dave Chinner
2020-05-24  5:32       ` Christoph Hellwig
2020-05-23 11:39   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 23/24] xfs: factor xfs_iflush_done Dave Chinner
2020-05-23  0:20   ` Darrick J. Wong
2020-05-23 11:35   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 24/24] xfs: remove xfs_inobp_check() Dave Chinner
2020-05-23  0:16   ` Darrick J. Wong
2020-05-23 11:36   ` Christoph Hellwig
2020-05-22  4:04 ` [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
2020-05-23 16:18   ` Darrick J. Wong
2020-05-23 21:22     ` Dave Chinner
2020-05-22  6:18 ` Amir Goldstein
2020-05-22 12:01   ` Amir Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200523222935.GH2040@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=darrick.wong@oracle.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).