Re: [PATCH] xfs: require an rcu grace period before inode recycle

From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	linux-xfs@vger.kernel.org, Ian Kent <raven@themaw.net>,
	rcu@vger.kernel.org
Subject: Re: [PATCH] xfs: require an rcu grace period before inode recycle
Date: Thu, 27 Jan 2022 14:01:25 -0500	[thread overview]
Message-ID: <YfLsBdPBSsyPFgHJ@bfoster> (raw)
In-Reply-To: <20220127052609.GR59729@dread.disaster.area>

On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote:
> On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote:
> > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote:
> > 
> > > Right, background inactivation does not improve performance - it's
> > > necessary to get the transactions out of the evict() path. All we
> > > wanted was to ensure that there were no performance degradations as
> > > a result of background inactivation, not that it was faster.
> > > 
> > > If you want to confirm that there is an increase in cold cache
> > > access when the batch size is increased, cpu profiles with 'perf
> > > top'/'perf record/report' and CPU cache performance metric reporting
> > > via 'perf stat -dddd' are your friend. See elsewhere in the thread
> > > where I mention those things to Paul.
> > 
> > Dave, do you see a plausible way to eventually drop Ian's bandaid?
> > I'm not asking for that to happen this cycle and for backports Ian's
> > patch is obviously fine.
> 
> Yes, but not in the near term.
> 
> > What I really want to avoid is the situation when we are stuck with
> > keeping that bandaid in fs/namei.c, since all ways to avoid seeing
> > reused inodes would hurt XFS too badly.  And the benchmarks in this
> > thread do look like that.
> 
> The simplest way I think is to have the XFS inode allocation track
> "busy inodes" in the same way we track "busy extents". A busy extent
> is an extent that has been freed by the user, but is not yet marked
> free in the journal/on disk. If we try to reallocate that busy
> extent, we either select a different free extent to allocate, or if
> we can't find any we force the journal to disk, wait for it to
> complete (hence unbusying the extents) and retry the allocation
> again.
> 
> We can do something similar for inode allocation - it's actually a
> lockless tag lookup on the radix tree entry for the candidate inode
> number. If we find the reclaimable radix tree tag set, the we select
> a different inode. If we can't allocate a new inode, then we kick
> synchronize_rcu() and retry the allocation, allowing inodes to be
> recycled this time.
> 

I'm starting to poke around this area since it's become clear that the
currently proposed scheme just involves too much latency (unless Paul
chimes in with his expedited grace period variant, at which point I will
revisit) in the fast allocation/recycle path. ISTM so far that a simple
"skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will
have pretty much the same pattern of behavior as this patch: one
synchronize_rcu() per batch.

IOW, background reclaim only kicks in after 30s by default, so the pool
of free inodes pretty much always consists of 100% reclaimable inodes.
On top of that, at smaller batch sizes, the pool tends to have a uniform
(!elapsed) grace period cookie, so a stall is required to be able to
allocate any of them. As the batch size increases, I do see the
population of free inodes start to contain a mix of expired and
non-expired grace period cookies. It's fairly easy to hack up an
internal icwalk scan to locate already expired inodes, but the problem
is that the recycle rate is so much faster than the grace period latency
that it doesn't really matter. We'll still have to stall by the time we
get to the non-expired inodes, and so we're back to one stall per batch
and the same general performance characteristic of this patch.

So given all of this, I'm wondering about something like the following
high level inode allocation algorithm:

1. If the AG has any reclaimable inodes, scan for one with an expired
grace period. If found, target that inode for physical allocation.

2. If the AG free inode count == the AG reclaimable count and we know
all reclaimable inodes are most likely pending a grace period (because
the previous step failed), allocate a new inode chunk (and target it in
this allocation).

3. If the AG free inode count > the reclaimable count, scan the finobt
for an inode that is not present in the radix tree (i.e. Dave's logic
above).

Each of those steps could involve some heuristics to maintain
predictable behavior and avoid large scans and such, but the general
idea is that the repeated alloc/free inode workload naturally populates
the AG with enough physical inodes to always be able to satisfy an
allocation without waiting on a grace period. IOW, this is effectively
similar behavior to if physical inode freeing was delayed to an rcu
callback, with the tradeoff of complicating the allocation path rather
than stalling in the inactivation pipeline. Thoughts?

This of course is more involved than this patch (or similarly simple
variants of RCU delaying preexisting bits of code) and requires some
more investigation, but certainly shouldn't be a multi-year thing. The
question is probably more of whether it's enough complexity to justify
in the meantime...

> > Are there any realistic prospects of having xfs_iget() deal with
> > reuse case by allocating new in-core inode and flipping whatever
> > references you've got in XFS journalling data structures to the
> > new copy?  If I understood what you said on IRC correctly, that is...
> 
> That's ... much harder.
> 
> One of the problems is that once an inode has a log item attached to
> it, it assumes that it can be accessed without specific locking,
> etc. see xfs_inode_clean(), for example. So there's some life-cycle
> stuff that needs to be taken care of in XFS first, and the inode <->
> log item relationship is tangled.
> 
> I've been working towards removing that tangle - but taht stuff is
> quite a distance down my logging rework patch queue. THat queue has
> been stuck now for a year trying to get the first handful of rework
> and scalability modifications reviewed and merged, so I'm not
> holding my breathe as to how long a more substantial rework of
> internal logging code will take to review and merge.
> 
> Really, though, we need the inactivation stuff to be done as part of
> the VFS inode lifecycle. I have some ideas on what to do here, but I
> suspect we'll need some changes to iput_final()/evict() to allow us
> to process final unlinks in the bakground and then call evict()
> ourselves when the unlink completes. That way ->destroy_inode() can
> just call xfs_reclaim_inode() to free it directly, which also helps
> us get rid of background inode freeing and hence inode recycling
> from XFS altogether. I think we _might_ be able to do this without
> needing to change any of the logging code in XFS, but I haven't
> looked any further than this into it as yet.
> 

... of whatever this ends up looking like.

Can you elaborate on what you mean by processing unlinks in the
background? I can see the value of being able to eliminate the recycle
code in XFS, but wouldn't we still have to limit and throttle against
background work to maintain sustained removal performance? IOW, what's
the general teardown behavior you're getting at here, aside from what
parts push into the vfs or not?

Brian

> > Again, I'm not asking if it can be done this cycle; having a
> > realistic path to doing that eventually would be fine by me.
> 
> We're talking a year at least, probably two, before we get there...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>