linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 28/28] xfs: rework unreferenced inode lookups
Date: Fri, 15 Nov 2019 12:26:00 -0500	[thread overview]
Message-ID: <20191115172600.GC55854@bfoster> (raw)
In-Reply-To: <20191114221602.GJ4614@dread.disaster.area>

On Fri, Nov 15, 2019 at 09:16:02AM +1100, Dave Chinner wrote:
> On Wed, Nov 06, 2019 at 05:18:46PM -0500, Brian Foster wrote:
> > On Fri, Nov 01, 2019 at 10:46:18AM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Looking up an unreferenced inode in the inode cache is a bit hairy.
> > > We do this for inode invalidation and writeback clustering purposes,
> > > which is all invisible to the VFS. Hence we can't take reference
> > > counts to the inode and so must be very careful how we do it.
> > > 
> > > There are several different places that all do the lookups and
> > > checks slightly differently. Fundamentally, though, they are all
> > > racy and inode reclaim has to block waiting for the inode lock if it
> > > loses the race. This is not very optimal given all the work we;ve
> > > already done to make reclaim non-blocking.
> > > 
> > > We can make the reclaim process nonblocking with a couple of simple
> > > changes. If we define the unreferenced lookup process in a way that
> > > will either always grab an inode in a way that reclaim will notice
> > > and skip, or will notice a reclaim has grabbed the inode so it can
> > > skip the inode, then there is no need for reclaim to need to cycle
> > > the inode ILOCK at all.
> > > 
> > > Selecting an inode for reclaim is already non-blocking, so if the
> > > ILOCK is held the inode will be skipped. If we ensure that reclaim
> > > holds the ILOCK until the inode is freed, then we can do the same
> > > thing in the unreferenced lookup to avoid inodes in reclaim. We can
> > > do this simply by holding the ILOCK until the RCU grace period
> > > expires and the inode freeing callback is run. As all unreferenced
> > > lookups have to hold the rcu_read_lock(), we are guaranteed that
> > > a reclaimed inode will be noticed as the trylock will fail.
> > > 
> > ...
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/mrlock.h     |  27 +++++++++
> > >  fs/xfs/xfs_icache.c |  88 +++++++++++++++++++++--------
> > >  fs/xfs/xfs_inode.c  | 131 +++++++++++++++++++++-----------------------
> > >  3 files changed, 153 insertions(+), 93 deletions(-)
> > > 
> > > diff --git a/fs/xfs/mrlock.h b/fs/xfs/mrlock.h
> > > index 79155eec341b..1752a2592bcc 100644
> > > --- a/fs/xfs/mrlock.h
> > > +++ b/fs/xfs/mrlock.h
> > ...
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index 11bf4768d491..45ee3b5cd873 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -106,6 +106,7 @@ xfs_inode_free_callback(
> > >  		ip->i_itemp = NULL;
> > >  	}
> > >  
> > > +	mrunlock_excl_non_owner(&ip->i_lock);
> > >  	kmem_zone_free(xfs_inode_zone, ip);
> > >  }
> > >  
> > > @@ -132,6 +133,7 @@ xfs_inode_free(
> > >  	 * free state. The ip->i_flags_lock provides the barrier against lookup
> > >  	 * races.
> > >  	 */
> > > +	mrupdate_non_owner(&ip->i_lock);
> > 
> > Can we tie these into the proper locking interface using flags? For
> > example, something like xfs_ilock(ip, XFS_ILOCK_EXCL|XFS_ILOCK_NONOWNER)
> > or xfs_ilock(ip, XFS_ILOCK_EXCL_NONOWNER) perhaps?
> 
> I'd prefer not to make this part of the common locking interface -
> it's a one off special use case, not something we want to progate
> elsewhere into the code.
> 

What urks me about this is that it obfuscates rather than highlights
that fact because I have no idea what mrtryupdate_non_owner() actually
does without looking it up. We could easily name a flag
XFS_ILOCK_PENDING_RECLAIM or something similarly ridiculous to make it
blindingly obvious it should only be used in a special context.

> Now that I think over it, I probably should have tagged this with
> patch with [RFC]. I think we should just get rid of the mrlock
> wrappers rather than add more, and that would simplify this a lot.
> 

Yeah, FWIW I've been reviewing this patch as a WIP on top of all of the
nonblocking bits as opposed to being some fundamental part of that work.
That aside, I also agree that cleaning up these wrappers might address
that concern because something like:

	/* open code ilock because ... */
	down_write_trylock_non_owner(&ip->i_lock);

... is at least readable code.

> 
> > >  	spin_lock(&ip->i_flags_lock);
> > >  	ip->i_flags = XFS_IRECLAIM;
> > >  	ip->i_ino = 0;
> > > @@ -295,11 +297,24 @@ xfs_iget_cache_hit(
> > >  		}
> > >  
> > >  		/*
> > > -		 * We need to set XFS_IRECLAIM to prevent xfs_reclaim_inode
> > > -		 * from stomping over us while we recycle the inode. Remove it
> > > -		 * from the LRU straight away so we can re-init the VFS inode.
> > > +		 * Before we reinitialise the inode, we need to make sure
> > > +		 * reclaim does not pull it out from underneath us. We already
> > > +		 * hold the i_flags_lock, and because the XFS_IRECLAIM is not
> > > +		 * set we know the inode is still on the LRU. However, the LRU
> > > +		 * code may have just selected this inode to reclaim, so we need
> > > +		 * to ensure we hold the i_flags_lock long enough for the
> > > +		 * trylock in xfs_inode_reclaim_isolate() to fail. We do this by
> > > +		 * removing the inode from the LRU, which will spin on the LRU
> > > +		 * list locks until reclaim stops walking, at which point we
> > > +		 * know there is no possible race between reclaim isolation and
> > > +		 * this lookup.
> > > +		 *
> > 
> > Somewhat related to my question about the lru_lock on the earlier patch.
> 
> *nod*
> 
> The caveat here is that this is the slow path so spinning for a
> while doesn't really matter.
> 
> > > @@ -1022,19 +1076,7 @@ xfs_dispose_inode(
> > >  	spin_unlock(&pag->pag_ici_lock);
> > >  	xfs_perag_put(pag);
> > >  
> > > -	/*
> > > -	 * Here we do an (almost) spurious inode lock in order to coordinate
> > > -	 * with inode cache radix tree lookups.  This is because the lookup
> > > -	 * can reference the inodes in the cache without taking references.
> > > -	 *
> > > -	 * We make that OK here by ensuring that we wait until the inode is
> > > -	 * unlocked after the lookup before we go ahead and free it.
> > > -	 *
> > > -	 * XXX: need to check this is still true. Not sure it is.
> > > -	 */
> > > -	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > >  	xfs_qm_dqdetach(ip);
> > > -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > 
> > Ok, so I'm staring at this a bit more and think I'm missing something.
> > If we put aside the change to hold ilock until the inode is freed, we
> > basically have the following (simplified) flow as the inode goes from
> > isolation to disposal:
> > 
> > 	ilock	(isolate)
> > 	iflock
> > 	set XFS_IRECLAIM
> > 	ifunlock (disposal)
> > 	iunlock
> > 	radix delete
> > 	ilock cycle (drain)
> > 	rcu free
> > 
> > What we're trying to eliminate is the ilock cycle to drain any
> > concurrent unreferenced lookups from accessing the inode once it is
> > freed. The free itself is still RCU protected.
> > 
> > Looking over at the ifree path, we now have something like this:
> > 
> > 	rcu_read_lock()
> > 	radix lookup
> > 	check XFS_IRECLAIM
> > 	ilock
> > 	if XFS_ISTALE, skip
> > 	set XFS_ISTALE
> > 	rcu_read_unlock()
> > 	iflock
> > 	/* return locked down inode */
> 
> You missed a lock.
> 
> 	rcu_read_lock()
> 	radix lookup
> >>>	i_flags_lock
> 	check XFS_IRECLAIM
> 	ilock
> 	if XFS_ISTALE, skip
> 	set XFS_ISTALE
> >>>	i_flags_unlock
> 	rcu_read_unlock()
> 	iflock
> 
> > Given that we set XFS_IRECLAIM under ilock, would we still need either
> > the ilock cycle or to hold ilock through the RCU free if the ifree side
> > (re)checked XFS_IRECLAIM after it has the ilock (but before it drops the
> > rcu read lock)?
> 
> We set XFS_IRECLAIM under the i_flags_lock.
> 
> It is the combination of rcu_read_lock() and i_flags_lock() that
> provides the RCU lookup state barriers - the ILOCK is not part of
> that at all.
> 
> The key point here is that once we've validated the inode we found
> in the radix tree under the i_flags_lock, we then take the ILOCK,
> thereby serialising the taking of the ILOCK here with the taking of
> the ILOCK in the reclaim isolation code.
> 
> i.e. all the reclaim state serialisation is actually based around
> holding the i_flags_lock, not the ILOCK. 
> 
> Once we have grabbed the ILOCK under the i_flags_lock, we can
> drop the i_flags_lock knowing that reclaim will not be able isolate
> this inode and set XFS_IRECLAIM.
> 

Hmm, Ok. I knew i_flags_lock was in there when I wrote this up. I
intentionally left it out as a simplification because it wasn't clear to
me that it was a critical part of the lookup. I'll keep this in mind the
next time I walk through this code.

> > ISTM we should either have a non-reclaim inode with
> > ilock protection or a reclaim inode with RCU protection (so we can skip
> > it before it frees), but I could easily be missing something here..
> 
> Heh. Yeah, it's a complex dance, and it's all based around how
> RCU lookups and the i_flags_lock interact to provide coherent
> detection of freed inodes.
> 
> I have a nagging feeling that this whole ILOCK-held-to-rcu-free game
> can be avoided. I need to walk myself through the lookup state
> machine again and determine if ordering the XFS_IRECLAIM flag check
> after greabbing the ILOCK is sufficient to prevent ifree/iflush
> lookups from accessing the inode outside the rcu_read_lock()
> context.
> 

That's pretty much what I was wondering...

> If so, most of this patch will go away....
> 
> > > +	 * attached to the buffer so we don't need to do anything more here.
> > >  	 */
> > > -	if (ip != free_ip) {
> > > -		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
> > > -			rcu_read_unlock();
> > > -			delay(1);
> > > -			goto retry;
> > > -		}
> > > -
> > > -		/*
> > > -		 * Check the inode number again in case we're racing with
> > > -		 * freeing in xfs_reclaim_inode().  See the comments in that
> > > -		 * function for more information as to why the initial check is
> > > -		 * not sufficient.
> > > -		 */
> > > -		if (ip->i_ino != inum) {
> > > +	if (__xfs_iflags_test(ip, XFS_ISTALE)) {
> > 
> > Is there a correctness reason for why we move the stale check to under
> > ilock (in both iflush/ifree)?
> 
> It's under the i_flags_lock, and so I moved it up under the lookup
> hold of the i_flags_lock so we don't need to cycle it again.
> 

Yeah, but in both cases it looks like it moved to under the ilock as
well, which comes after i_flags_lock. IOW, why grab ilock for stale
inodes when we're just going to skip them?

Brian

> > >  	/*
> > > -	 * We don't need to attach clean inodes or those only with unlogged
> > > -	 * changes (which we throw away, anyway).
> > > +	 * We don't need to attach clean inodes to the buffer - they are marked
> > > +	 * stale in memory now and will need to be re-initialised by inode
> > > +	 * allocation before they can be reused.
> > >  	 */
> > >  	if (!ip->i_itemp || xfs_inode_clean(ip)) {
> > >  		ASSERT(ip != free_ip);
> > >  		xfs_ifunlock(ip);
> > > -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > +		if (ip != free_ip)
> > > +			xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > 
> > There's an assert against this case just above, though I suppose there's
> > nothing wrong with just keeping it and making the functional code more
> > cautious.
> 
> *nod*
> 
> It follows Darrick's lead of making sure that production kernels
> don't do something stupid because of some whacky corruption we
> didn't expect to ever see.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


  parent reply	other threads:[~2019-11-15 17:26 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
2019-10-31 23:45 ` [PATCH 01/28] xfs: Lower CIL flush limit for large logs Dave Chinner
2019-10-31 23:45 ` [PATCH 02/28] xfs: Throttle commits on delayed background CIL push Dave Chinner
2019-11-01 12:04   ` Brian Foster
2019-11-01 21:40     ` Dave Chinner
2019-11-04 22:48       ` Darrick J. Wong
2019-10-31 23:45 ` [PATCH 03/28] xfs: don't allow log IO to be throttled Dave Chinner
2019-10-31 23:45 ` [PATCH 04/28] xfs: Improve metadata buffer reclaim accountability Dave Chinner
2019-11-01 12:05   ` Brian Foster
2019-11-04 23:21   ` Darrick J. Wong
2019-10-31 23:45 ` [PATCH 05/28] xfs: correctly acount for reclaimable slabs Dave Chinner
2019-10-31 23:45 ` [PATCH 06/28] xfs: factor common AIL item deletion code Dave Chinner
2019-11-04 23:16   ` Darrick J. Wong
2019-10-31 23:45 ` [PATCH 07/28] xfs: tail updates only need to occur when LSN changes Dave Chinner
2019-11-04 23:18   ` Darrick J. Wong
2019-10-31 23:45 ` [PATCH 08/28] xfs: factor inode lookup from xfs_ifree_cluster Dave Chinner
2019-11-01 12:05   ` Brian Foster
2019-11-04 23:20   ` Darrick J. Wong
2019-10-31 23:45 ` [PATCH 09/28] mm: directed shrinker work deferral Dave Chinner
2019-11-04 15:25   ` Brian Foster
2019-11-14 20:49     ` Dave Chinner
2019-11-15 17:21       ` Brian Foster
2019-11-18  0:49         ` Dave Chinner
2019-11-19 15:12           ` Brian Foster
2019-10-31 23:46 ` [PATCH 10/28] shrinkers: use defer_work for GFP_NOFS sensitive shrinkers Dave Chinner
2019-10-31 23:46 ` [PATCH 11/28] mm: factor shrinker work calculations Dave Chinner
2019-11-02 10:55   ` kbuild test robot
2019-11-04 15:29   ` Brian Foster
2019-11-14 20:59     ` Dave Chinner
2019-10-31 23:46 ` [PATCH 12/28] shrinker: defer work only to kswapd Dave Chinner
2019-11-04 15:29   ` Brian Foster
2019-11-14 21:11     ` Dave Chinner
2019-11-15 17:23       ` Brian Foster
2019-10-31 23:46 ` [PATCH 13/28] shrinker: clean up variable types and tracepoints Dave Chinner
2019-11-04 15:30   ` Brian Foster
2019-10-31 23:46 ` [PATCH 14/28] mm: reclaim_state records pages reclaimed, not slabs Dave Chinner
2019-11-04 19:58   ` Brian Foster
2019-10-31 23:46 ` [PATCH 15/28] mm: back off direct reclaim on excessive shrinker deferral Dave Chinner
2019-11-04 19:58   ` Brian Foster
2019-11-14 21:28     ` Dave Chinner
2019-10-31 23:46 ` [PATCH 16/28] mm: kswapd backoff for shrinkers Dave Chinner
2019-11-04 19:58   ` Brian Foster
2019-11-14 21:41     ` Dave Chinner
2019-10-31 23:46 ` [PATCH 17/28] xfs: synchronous AIL pushing Dave Chinner
2019-11-05 17:05   ` Brian Foster
2019-10-31 23:46 ` [PATCH 18/28] xfs: don't block kswapd in inode reclaim Dave Chinner
2019-10-31 23:46 ` [PATCH 19/28] xfs: reduce kswapd blocking on inode locking Dave Chinner
2019-11-05 17:05   ` Brian Foster
2019-10-31 23:46 ` [PATCH 20/28] xfs: kill background reclaim work Dave Chinner
2019-11-05 17:05   ` Brian Foster
2019-10-31 23:46 ` [PATCH 21/28] xfs: use AIL pushing for inode reclaim IO Dave Chinner
2019-11-05 17:06   ` Brian Foster
2019-10-31 23:46 ` [PATCH 22/28] xfs: remove mode from xfs_reclaim_inodes() Dave Chinner
2019-10-31 23:46 ` [PATCH 23/28] xfs: track reclaimable inodes using a LRU list Dave Chinner
2019-10-31 23:46 ` [PATCH 24/28] xfs: reclaim inodes from the LRU Dave Chinner
2019-11-06 17:21   ` Brian Foster
2019-11-14 21:51     ` Dave Chinner
2019-10-31 23:46 ` [PATCH 25/28] xfs: remove unusued old inode reclaim code Dave Chinner
2019-11-06 17:21   ` Brian Foster
2019-10-31 23:46 ` [PATCH 26/28] xfs: use xfs_ail_push_all in xfs_reclaim_inodes Dave Chinner
2019-11-06 17:22   ` Brian Foster
2019-11-14 21:53     ` Dave Chinner
2019-10-31 23:46 ` [PATCH 27/28] rwsem: introduce down/up_write_non_owner Dave Chinner
2019-10-31 23:46 ` [PATCH 28/28] xfs: rework unreferenced inode lookups Dave Chinner
2019-11-06 22:18   ` Brian Foster
2019-11-14 22:16     ` Dave Chinner
2019-11-15 13:13       ` Christoph Hellwig
2019-11-15 17:26       ` Brian Foster [this message]
2019-11-18  1:00         ` Dave Chinner
2019-11-19 15:13           ` Brian Foster
2019-11-19 21:18             ` Dave Chinner
2019-11-20 12:42               ` Brian Foster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191115172600.GC55854@bfoster \
    --to=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).