Re: [PATCH] xfs: require an rcu grace period before inode recycle

From: Dave Chinner <david@fromorbit.com>
To: Al Viro <viro@zeniv.linux.org.uk>
Cc: Brian Foster <bfoster@redhat.com>,
	linux-xfs@vger.kernel.org, Ian Kent <raven@themaw.net>,
	rcu@vger.kernel.org
Subject: Re: [PATCH] xfs: require an rcu grace period before inode recycle
Date: Thu, 27 Jan 2022 16:26:09 +1100	[thread overview]
Message-ID: <20220127052609.GR59729@dread.disaster.area> (raw)
In-Reply-To: <YfIdVq6R6xEWxy0K@zeniv-ca.linux.org.uk>

On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote:
> On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote:
> 
> > Right, background inactivation does not improve performance - it's
> > necessary to get the transactions out of the evict() path. All we
> > wanted was to ensure that there were no performance degradations as
> > a result of background inactivation, not that it was faster.
> > 
> > If you want to confirm that there is an increase in cold cache
> > access when the batch size is increased, cpu profiles with 'perf
> > top'/'perf record/report' and CPU cache performance metric reporting
> > via 'perf stat -dddd' are your friend. See elsewhere in the thread
> > where I mention those things to Paul.
> 
> Dave, do you see a plausible way to eventually drop Ian's bandaid?
> I'm not asking for that to happen this cycle and for backports Ian's
> patch is obviously fine.

Yes, but not in the near term.

> What I really want to avoid is the situation when we are stuck with
> keeping that bandaid in fs/namei.c, since all ways to avoid seeing
> reused inodes would hurt XFS too badly.  And the benchmarks in this
> thread do look like that.

The simplest way I think is to have the XFS inode allocation track
"busy inodes" in the same way we track "busy extents". A busy extent
is an extent that has been freed by the user, but is not yet marked
free in the journal/on disk. If we try to reallocate that busy
extent, we either select a different free extent to allocate, or if
we can't find any we force the journal to disk, wait for it to
complete (hence unbusying the extents) and retry the allocation
again.

We can do something similar for inode allocation - it's actually a
lockless tag lookup on the radix tree entry for the candidate inode
number. If we find the reclaimable radix tree tag set, the we select
a different inode. If we can't allocate a new inode, then we kick
synchronize_rcu() and retry the allocation, allowing inodes to be
recycled this time.

> Are there any realistic prospects of having xfs_iget() deal with
> reuse case by allocating new in-core inode and flipping whatever
> references you've got in XFS journalling data structures to the
> new copy?  If I understood what you said on IRC correctly, that is...

That's ... much harder.

One of the problems is that once an inode has a log item attached to
it, it assumes that it can be accessed without specific locking,
etc. see xfs_inode_clean(), for example. So there's some life-cycle
stuff that needs to be taken care of in XFS first, and the inode <->
log item relationship is tangled.

I've been working towards removing that tangle - but taht stuff is
quite a distance down my logging rework patch queue. THat queue has
been stuck now for a year trying to get the first handful of rework
and scalability modifications reviewed and merged, so I'm not
holding my breathe as to how long a more substantial rework of
internal logging code will take to review and merge.

Really, though, we need the inactivation stuff to be done as part of
the VFS inode lifecycle. I have some ideas on what to do here, but I
suspect we'll need some changes to iput_final()/evict() to allow us
to process final unlinks in the bakground and then call evict()
ourselves when the unlink completes. That way ->destroy_inode() can
just call xfs_reclaim_inode() to free it directly, which also helps
us get rid of background inode freeing and hence inode recycling
from XFS altogether. I think we _might_ be able to do this without
needing to change any of the logging code in XFS, but I haven't
looked any further than this into it as yet.

> Again, I'm not asking if it can be done this cycle; having a
> realistic path to doing that eventually would be fine by me.

We're talking a year at least, probably two, before we get there...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com