[PATCH] xfs: require an rcu grace period before inode recycle

From: Brian Foster <bfoster@redhat.com>
To: linux-xfs@vger.kernel.org
Cc: Dave Chinner <david@fromorbit.com>,
	Al Viro <viro@zeniv.linux.org.uk>, Ian Kent <raven@themaw.net>,
	rcu@vger.kernel.org
Subject: [PATCH] xfs: require an rcu grace period before inode recycle
Date: Fri, 21 Jan 2022 09:24:54 -0500	[thread overview]
Message-ID: <20220121142454.1994916-1-bfoster@redhat.com> (raw)

The XFS inode allocation algorithm aggressively reuses recently
freed inodes. This is historical behavior that has been in place for
quite some time, since XFS was imported to mainline Linux. Once the
VFS adopted RCUwalk path lookups (also some time ago), this behavior
became slightly incompatible because the inode recycle path doesn't
isolate concurrent access to the inode from the VFS.

This has recently manifested as problems in the VFS when XFS happens
to change the type or properties of a recently unlinked inode while
still involved in an RCU lookup. For example, if the VFS refers to a
previous incarnation of a symlink inode, obtains the ->get_link()
callback from inode_operations, and the latter happens to change to
a non-symlink type via a recycle event, the ->get_link() callback
pointer is reset to NULL and the lookup results in a crash.

To avoid this class of problem, isolate in-core inodes for recycling
with an RCU grace period. This is the same level of protection the
VFS expects for inactivated inodes that are never reused, and so
guarantees no further concurrent access before the type or
properties of the inode change. We don't want an unconditional
synchronize_rcu() event here because that would result in a
significant performance impact to mixed inode allocation workloads.

Fortunately, we can take advantage of the recently added deferred
inactivation mechanism to mitigate the need for an RCU wait in most
cases. Deferred inactivation queues and batches the on-disk freeing
of recently destroyed inodes, and so significantly increases the
likelihood that a grace period has elapsed by the time an inode is
freed and observable by the allocation code as a reuse candidate.
Capture the current RCU grace period cookie at inode destroy time
and refer to it at allocation time to conditionally wait for an RCU
grace period if one hadn't expired in the meantime.  Since only
unlinked inodes are recycle candidates and unlinked inodes always
require inactivation, we only need to poll and assign RCU state in
the inactivation codepath. Slightly adjust struct xfs_inode to fit
the new field into padding holes that conveniently preexist in the
same cacheline as the deferred inactivation list.

Finally, note that the ideal long term solution here is to
rearchitect bits of XFS' internal inode lifecycle management such
that this additional stall point is not required, but this requires
more thought, time and work to address. This approach restores
functional correctness in the meantime.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

Hi all,

Here's the RCU fixup patch for inode reuse that I've been playing with,
re: the vfs patch discussion [1]. I've put it in pretty much the most
basic form, but I think there are a couple aspects worth thinking about:

1. Use and frequency of start_poll_synchronize_rcu() (vs.
get_state_synchronize_rcu()). The former is a bit more active than the
latter in that it triggers the start of a grace period, when necessary.
This currently invokes per inode, which is the ideal frequency in
theory, but could be reduced, associated with the xfs_inogegc thresholds
in some manner, etc., if there is good reason to do that.

2. The rcu cookie lifecycle. This variant updates it on inactivation
queue and nowhere else because the RCU docs imply that counter rollover
is not a significant problem. In practice, I think this means that if an
inode is stamped at least once, and the counter rolls over, future
(non-inactivation, non-unlinked) eviction -> repopulation cycles could
trigger rcu syncs. I think this would require repeated
eviction/reinstantiation cycles within a small window to be noticeable,
so I'm not sure how likely this is to occur. We could be more defensive
by resetting or refreshing the cookie. E.g., refresh (or reset to zero)
at recycle time, unconditionally refresh at destroy time (using
get_state_synchronize_rcu() for non-inactivation), etc.

Otherwise testing is ongoing, but this version at least survives an
fstests regression run.

Brian

[1] https://lore.kernel.org/linux-fsdevel/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/

 fs/xfs/xfs_icache.c | 11 +++++++++++
 fs/xfs/xfs_inode.h  |  3 ++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index d019c98eb839..4931daa45ca4 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -349,6 +349,16 @@ xfs_iget_recycle(
 	spin_unlock(&ip->i_flags_lock);
 	rcu_read_unlock();
 
+	/*
+	 * VFS RCU pathwalk lookups dictate the same lifecycle rules for an
+	 * inode recycle as for freeing an inode. I.e., we cannot repurpose the
+	 * inode until a grace period has elapsed from the time the previous
+	 * version of the inode was destroyed. In most cases a grace period has
+	 * already elapsed if the inode was (deferred) inactivated, but
+	 * synchronize here as a last resort to guarantee correctness.
+	 */
+	cond_synchronize_rcu(ip->i_destroy_gp);
+
 	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
 	error = xfs_reinit_inode(mp, inode);
 	if (error) {
@@ -2019,6 +2029,7 @@ xfs_inodegc_queue(
 	trace_xfs_inode_set_need_inactive(ip);
 	spin_lock(&ip->i_flags_lock);
 	ip->i_flags |= XFS_NEED_INACTIVE;
+	ip->i_destroy_gp = start_poll_synchronize_rcu();
 	spin_unlock(&ip->i_flags_lock);
 
 	gc = get_cpu_ptr(mp->m_inodegc);
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index c447bf04205a..2153e3edbb86 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -40,8 +40,9 @@ typedef struct xfs_inode {
 	/* Transaction and locking information. */
 	struct xfs_inode_log_item *i_itemp;	/* logging information */
 	mrlock_t		i_lock;		/* inode lock */
-	atomic_t		i_pincount;	/* inode pin count */
 	struct llist_node	i_gclist;	/* deferred inactivation list */
+	unsigned long		i_destroy_gp;	/* destroy rcugp cookie */
+	atomic_t		i_pincount;	/* inode pin count */
 
 	/*
 	 * Bitsets of inode metadata that have been checked and/or are sick.
-- 
2.31.1