Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 25/26] xfs: rework unreferenced inode lookups
Date: Thu, 17 Oct 2019 12:24:38 +1100
Message-ID: <20191017012438.GK16973@dread.disaster.area> (raw)
In-Reply-To: <20191014130719.GE12380@bfoster>

On Mon, Oct 14, 2019 at 09:07:19AM -0400, Brian Foster wrote:
> On Wed, Oct 09, 2019 at 02:21:23PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Looking up an unreferenced inode in the inode cache is a bit hairy.
> > We do this for inode invalidation and writeback clustering purposes,
> > which is all invisible to the VFS. Hence we can't take reference
> > counts to the inode and so must be very careful how we do it.
> > 
> > There are several different places that all do the lookups and
> > checks slightly differently. Fundamentally, though, they are all
> > racy and inode reclaim has to block waiting for the inode lock if it
> > loses the race. This is not very optimal given all the work we;ve
> > already done to make reclaim non-blocking.
> > 
> > We can make the reclaim process nonblocking with a couple of simple
> > changes. If we define the unreferenced lookup process in a way that
> > will either always grab an inode in a way that reclaim will notice
> > and skip, or will notice a reclaim has grabbed the inode so it can
> > skip the inode, then there is no need for reclaim to need to cycle
> > the inode ILOCK at all.
> > 
> > Selecting an inode for reclaim is already non-blocking, so if the
> > ILOCK is held the inode will be skipped. If we ensure that reclaim
> > holds the ILOCK until the inode is freed, then we can do the same
> > thing in the unreferenced lookup to avoid inodes in reclaim. We can
> > do this simply by holding the ILOCK until the RCU grace period
> > expires and the inode freeing callback is run. As all unreferenced
> > lookups have to hold the rcu_read_lock(), we are guaranteed that
> > a reclaimed inode will be noticed as the trylock will fail.
> > 
> > 
> > Additional research notes on final reclaim locking before free
> > --------------------------------------------------------------
> > 
> > 2016: 1f2dcfe89eda ("xfs: xfs_inode_free() isn't RCU safe")
> > 
> > Fixes situation where the inode is found during RCU lookup within
> > the freeing grace period, but critical structures have already been
> > freed. lookup code that has this problem is stuff like
> > xfs_iflush_cluster.
> > 
> > 
> > 2008: 455486b9ccdd ("[XFS] avoid all reclaimable inodes in xfs_sync_inodes_ag")
> > 
> > Prior to this commit, the flushing of inodes required serialisation
> > with xfs_ireclaim(), which did this lock/unlock thingy to ensure
> > that it waited for flushing in xfs_sync_inodes_ag() to complete
> > before freeing the inode:
> > 
> >                 /*
> > -                * If we can't get a reference on the VFS_I, the inode must be
> > -                * in reclaim. If we can get the inode lock without blocking,
> > -                * it is safe to flush the inode because we hold the tree lock
> > -                * and xfs_iextract will block right now. Hence if we lock the
> > -                * inode while holding the tree lock, xfs_ireclaim() is
> > -                * guaranteed to block on the inode lock we now hold and hence
> > -                * it is safe to reference the inode until we drop the inode
> > -                * locks completely.
> > +                * If we can't get a reference on the inode, it must be
> > +                * in reclaim. Leave it for the reclaim code to flush.
> >                  */
> > 
> > This case is completely gone from the modern code.
> > 
> > lock/unlock exists at start of git era. Switching to archive tree.
> > 
> > This xfs_sync() functionality goes back to 1994 when inode
> > writeback was first introduced by:
> > 
> > 47ac6d60 ("Add support to xfs_ireclaim() needed for xfs_sync().")
> > 
> > So it has been there forever -  lets see if we can get rid of it.
> > State of existing codeL
> > 
> > - xfs_iflush_cluster() does not check for XFS_IRECLAIM inode flag
> >   while holding rcu_read_lock()/i_flags_lock, so doesn't avoid
> >   reclaimable or inodes that are in the process of being reclaimed.
> >   Inodes at this point of reclaim are clean, so if xfs_iflush_cluster
> >   wins the race to the ILOCK, then inode reclaim has to wait
> >   for the lock to be dropped by xfs_iflush_cluster() once it detects
> >   the inode is clean.
> > 
> 
> Ok, so the iflush/ifree clustering functionality doesn't account for
> inodes under reclaim, thus has the potential to contend with reclaim in
> progress via ilock. The new isolate function trylocks the ilock and
> iflock to check dirty state and whatnot before it sets XFS_IRECLAIM and
> continues scanning, so we aren't blocking through that code. Both of
> those locks are held until the dispose, where ->i_ino is zeroed and
> ilock released.

Not quite. The XFS_IRECLAIM flag indicates the inode has been
isolated but may not yet have been disposed. There can be a
substantial delay between isolation and disposal, and the ip->i_ino
is not cleared until disposal is run. IOWs, it handles this case:

reclaim:				iflush/ifree

isolate
  spin_trylock(ip->i_flags_lock)
  xfs_ilock_nowait(ip, ILOCK_EXCL)
  xfs_iflock(ip)
  ip->i_flags |= XFS_IRECLAIM
  spin_unlock(ip->i_flags_lock);
<loops isolating more inodes>
					rcu_read_lock()
					ip = radix_tree_lookup()
					spin_lock(ip->i_flags_lock)
					ip->i_ino still set
					if XFS_IRECLAIM set
					  <skip inode>

So when the inode has been isolated, we see the XFS_IRECLAIM just
fine because of the i_flags_lock.

The reason why the ILOCK is taken under the i_flags_lock in
iflush/ifree is that we can have this happen if we drop the spin
lock first:

					ip = radix_tree_lookup()
					spin_lock(ip->i_flags_lock)
					ip->i_ino still set
					if XFS_IRECLAIM set
					  skip inode
					spin_unlock(ip->i_flags_lock)
					rcu_read_unlock()
					<preempted>
isolate
  spin_trylock(ip->i_flags_lock)
  xfs_ilock_nowait(ip, ILOCK_EXCL)
  xfs_iflock(ip)
  ip->i_flags |= XFS_IRECLAIM
  spin_unlock(ip->i_flags_lock);
dispose inode
  rcu_free
<...>
rcu_callbacks
  xfs_iunlock(ip, ILOCK_EXCL)
  kmem_cache_free(ip)
					<scheduled again>
					xfs_ilock_nowait(ip, ILOCK_EXCL)
					accesses freed inode

IOWs, it's the combination of using the same locking heirarchy in
the isolation routine and the iflush/ifree that provide the
serialisation. We have to serialise the taking of the ILOCK under
the i_flags_lock, because it's the i_flags_lock that actually
provides the RCU lookup object validity serialisation. Hence we have
to ensure that the inode cannot be reclaimed via RCU callbacks while
under the rcu_read_lock context. That means we have to:

a) hold off RCU freeing of inodes (rcu_read_lock)
b) hold the object spinlock to ensure the object is not yet 
queued for RCU freeing (!ip->i_ino checks)
c) Hold the object spin lock to ensure the object has not been
locked for reclaim and is about to be disposed (XFS_IRECLAIM checks)
d) Hold the object spinlock while we grab the lock(s) that will hold
off reclaim once we drop the object spin lock until we are finished
with the object (ILOCK -> iflock)

So XFS_IRECLAIM plays a part in this dance, but it's only one step
in the process...

> I'd think at this point a racing iflush/ifree would see the ->i_ino
> update. If I'm following this correctly, blocking in reclaim would
> require a race where iflush gets ->i_flags_lock and sees a valid
> ->i_ino, a reclaim in progress is waiting on ->i_flags_lock to reset
> ->i_ino, iflush releases ->i_flags_lock in time for reclaim to acquire
> it, reset ->i_ino and then release ilock before the iflush ilock_nowait
> fails (since reclaim still has it) or reclaim itself reacquires it. At
> that point, reclaim blocks on ilock and ends up waiting on iflush to
> identify that ->i_ino is zero and drop the lock. Am I following that
> correctly?
> 
> If so, then to avoid that race condition (this sounds more like a lock
> contention inefficiency than a blocking problem),

It's not a contention issue - there's real bugs if we don't order
the locking correctly here.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com


  reply index

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-09  3:20 [PATCH V2 00/26] mm, xfs: non-blocking inode reclaim Dave Chinner
2019-10-09  3:20 ` [PATCH 01/26] xfs: Lower CIL flush limit for large logs Dave Chinner
2019-10-11 12:39   ` Brian Foster
2019-10-30 17:08   ` Darrick J. Wong
2019-10-09  3:21 ` [PATCH 02/26] xfs: Throttle commits on delayed background CIL push Dave Chinner
2019-10-11 12:38   ` Brian Foster
2019-10-09  3:21 ` [PATCH 03/26] xfs: don't allow log IO to be throttled Dave Chinner
2019-10-11  9:35   ` Christoph Hellwig
2019-10-11 12:39   ` Brian Foster
2019-10-30 17:14   ` Darrick J. Wong
2019-10-09  3:21 ` [PATCH 04/26] xfs: Improve metadata buffer reclaim accountability Dave Chinner
2019-10-11 12:39   ` Brian Foster
2019-10-11 12:57     ` Christoph Hellwig
2019-10-11 23:14       ` Dave Chinner
2019-10-11 23:13     ` Dave Chinner
2019-10-12 12:05       ` Brian Foster
2019-10-13  3:14         ` Dave Chinner
2019-10-14 13:05           ` Brian Foster
2019-10-30 17:25   ` Darrick J. Wong
2019-10-30 21:43     ` Dave Chinner
2019-10-31  3:06       ` Darrick J. Wong
2019-10-31 20:50         ` Dave Chinner
2019-10-31 21:05           ` Darrick J. Wong
2019-10-31 21:22             ` Christoph Hellwig
2019-11-03 21:26             ` Dave Chinner
2019-11-04 23:08               ` Darrick J. Wong
2019-10-09  3:21 ` [PATCH 05/26] xfs: correctly acount for reclaimable slabs Dave Chinner
2019-10-11 12:39   ` Brian Foster
2019-10-30 17:16   ` Darrick J. Wong
2019-10-09  3:21 ` [PATCH 06/26] xfs: synchronous AIL pushing Dave Chinner
2019-10-11  9:42   ` Christoph Hellwig
2019-10-11 12:40   ` Brian Foster
2019-10-11 23:15     ` Dave Chinner
2019-10-09  3:21 ` [PATCH 07/26] xfs: tail updates only need to occur when LSN changes Dave Chinner
2019-10-11  9:50   ` Christoph Hellwig
2019-10-11 12:40   ` Brian Foster
2019-10-09  3:21 ` [PATCH 08/26] mm: directed shrinker work deferral Dave Chinner
2019-10-14  8:46   ` Christoph Hellwig
2019-10-14 13:06     ` Brian Foster
2019-10-18  7:59     ` Dave Chinner
2019-10-09  3:21 ` [PATCH 09/26] shrinkers: use defer_work for GFP_NOFS sensitive shrinkers Dave Chinner
2019-10-09  3:21 ` [PATCH 10/26] mm: factor shrinker work calculations Dave Chinner
2019-10-09  3:21 ` [PATCH 11/26] shrinker: defer work only to kswapd Dave Chinner
2019-10-09  3:21 ` [PATCH 12/26] shrinker: clean up variable types and tracepoints Dave Chinner
2019-10-09  3:21 ` [PATCH 13/26] mm: reclaim_state records pages reclaimed, not slabs Dave Chinner
2019-10-09  3:21 ` [PATCH 14/26] mm: back off direct reclaim on excessive shrinker deferral Dave Chinner
2019-10-11 16:21   ` Matthew Wilcox
2019-10-11 23:20     ` Dave Chinner
2019-10-09  3:21 ` [PATCH 15/26] mm: kswapd backoff for shrinkers Dave Chinner
2019-10-09  3:21 ` [PATCH 16/26] xfs: synchronous AIL pushing Dave Chinner
2019-10-11 10:18   ` Christoph Hellwig
2019-10-11 15:29     ` Brian Foster
2019-10-11 23:27       ` Dave Chinner
2019-10-12 12:08         ` Brian Foster
2019-10-09  3:21 ` [PATCH 17/26] xfs: don't block kswapd in inode reclaim Dave Chinner
2019-10-11 15:29   ` Brian Foster
2019-10-09  3:21 ` [PATCH 18/26] xfs: reduce kswapd blocking on inode locking Dave Chinner
2019-10-11 10:29   ` Christoph Hellwig
2019-10-09  3:21 ` [PATCH 19/26] xfs: kill background reclaim work Dave Chinner
2019-10-11 10:31   ` Christoph Hellwig
2019-10-09  3:21 ` [PATCH 20/26] xfs: use AIL pushing for inode reclaim IO Dave Chinner
2019-10-11 17:38   ` Brian Foster
2019-10-09  3:21 ` [PATCH 21/26] xfs: remove mode from xfs_reclaim_inodes() Dave Chinner
2019-10-11 10:39   ` Christoph Hellwig
2019-10-14 13:07   ` Brian Foster
2019-10-09  3:21 ` [PATCH 22/26] xfs: track reclaimable inodes using a LRU list Dave Chinner
2019-10-11 10:42   ` Christoph Hellwig
2019-10-14 13:07   ` Brian Foster
2019-10-09  3:21 ` [PATCH 23/26] xfs: reclaim inodes from the LRU Dave Chinner
2019-10-11 10:56   ` Christoph Hellwig
2019-10-30 23:25     ` Dave Chinner
2019-10-09  3:21 ` [PATCH 24/26] xfs: remove unusued old inode reclaim code Dave Chinner
2019-10-09  3:21 ` [PATCH 25/26] xfs: rework unreferenced inode lookups Dave Chinner
2019-10-11 12:55   ` Christoph Hellwig
2019-10-11 13:39     ` Peter Zijlstra
2019-10-11 23:38     ` Dave Chinner
2019-10-14 13:07   ` Brian Foster
2019-10-17  1:24     ` Dave Chinner [this message]
2019-10-17  7:57       ` Brian Foster
2019-10-18 20:29         ` Dave Chinner
2019-10-09  3:21 ` [PATCH 26/26] xfs: use xfs_ail_push_all_sync in xfs_reclaim_inodes Dave Chinner
2019-10-11  9:55   ` Christoph Hellwig
2019-10-09  7:06 ` [PATCH V2 00/26] mm, xfs: non-blocking inode reclaim Christoph Hellwig
2019-10-11 19:03 ` Josef Bacik
2019-10-11 23:48   ` Dave Chinner
2019-10-12  0:19     ` Josef Bacik
2019-10-12  0:48       ` Dave Chinner

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191017012438.GK16973@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=bfoster@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git