linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 23/24] xfs: reclaim inodes from the LRU
Date: Fri, 9 Aug 2019 11:20:22 +1000	[thread overview]
Message-ID: <20190809012022.GX7777@dread.disaster.area> (raw)
In-Reply-To: <20190808163905.GC24551@bfoster>

On Thu, Aug 08, 2019 at 12:39:05PM -0400, Brian Foster wrote:
> On Thu, Aug 01, 2019 at 12:17:51PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Replace the AG radix tree walking reclaim code with a list_lru
> > walker, giving us both node-aware and memcg-aware inode reclaim
> > at the XFS level. This requires adding an inode isolation function to
> > determine if the inode can be reclaim, and a list walker to
> > dispose of the inodes that were isolated.
> > 
> > We want the isolation function to be non-blocking. If we can't
> > grab an inode then we either skip it or rotate it. If it's clean
> > then we skip it, if it's dirty then we rotate to give it time to be
> 
> Do you mean we remove it if it's clean?

No, I mean if we can't grab it and it's clean, then we just skip it,
leaving it at the head of the LRU for the next scanner to
immediately try to reclaim it. If it's dirty, we rotate it so that
time passes before we try to reclaim it again in the hope that it is
already clean by the time we've scanned through the entire LRU...

> > +++ b/fs/xfs/xfs_super.c
> ...
> > @@ -1810,23 +1811,58 @@ xfs_fs_mount(
> >  }
> >  
> >  static long
> > -xfs_fs_nr_cached_objects(
> > +xfs_fs_free_cached_objects(
> >  	struct super_block	*sb,
> >  	struct shrink_control	*sc)
> >  {
> > -	/* Paranoia: catch incorrect calls during mount setup or teardown */
> > -	if (WARN_ON_ONCE(!sb->s_fs_info))
> > -		return 0;
> > +	struct xfs_mount	*mp = XFS_M(sb);
> > +        struct xfs_ireclaim_args ra;
> 
> ^ whitespace damage

Already fixed.

> > +	long freed;
> >  
> > -	return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
> > +	INIT_LIST_HEAD(&ra.freeable);
> > +	ra.lowest_lsn = NULLCOMMITLSN;
> > +	ra.dirty_skipped = 0;
> > +
> > +	freed = list_lru_shrink_walk(&mp->m_inode_lru, sc,
> > +					xfs_inode_reclaim_isolate, &ra);
> 
> This is more related to the locking discussion on the earlier patch, but
> this looks like it has more similar serialization to the example patch I
> posted than the one without locking at all. IIUC, this walk has an
> internal lock per node lru that is held across the walk and passed into
> the callback. We never cycle it, so for any given node we only allow one
> reclaimer through here at a time.

That's not a guarantee that list_lru gives us. It could drop it's
internal lock at any time during that walk and we would be
blissfully unaware that it has done this. And at that point, the
reclaim context is completely unaware that other reclaim contexts
may be scanning the same LRU at the same time and are interleaving
with it.

And, really, that does not matter one iota. If multiple scanners are
interleaving, the reclaim traversal order and the decisions made are
no different from what a single reclaimer does.  i.e. we just don't
have to care if reclaim contexts interleave or not, because they
will not repeat work that has already been done unnecessarily.
That's one of the reasons for moving to IO-less LRU ordered reclaim
- it removes all the gross hacks we've had to implement to guarantee
reclaim scanning progress in one nice neat package of generic
infrastructure.

> That seems to be Ok given we don't do much in the isolation handler, the
> lock isn't held across the dispose sequence and we're still batching in
> the shrinker core on top of that. We're still serialized over the lru
> fixups such that concurrent reclaimers aren't processing the same
> inodes, however.

The only thing that we may need here is need_resched() checks if it
turns out that holding a lock for 1024 items to be scanned proved to
be too long to hold on to a single CPU. If we do that we'd cycle the
LRU lock and return RETRY or RETRY_REMOVE, hence enabling reclaimers
more finer-grained interleaving....

> BTW I got a lockdep splat[1] for some reason on a straight mount/unmount
> cycle with this patch.
....
> [   39.030519]  lock_acquire+0x90/0x170
> [   39.031170]  ? xfs_ilock+0xd2/0x280 [xfs]
> [   39.031603]  down_write_nested+0x4f/0xb0
> [   39.032064]  ? xfs_ilock+0xd2/0x280 [xfs]
> [   39.032684]  ? xfs_dispose_inodes+0x124/0x320 [xfs]
> [   39.033575]  xfs_ilock+0xd2/0x280 [xfs]
> [   39.034058]  xfs_dispose_inodes+0x124/0x320 [xfs]

False positive, AFAICT. It's complaining about the final xfs_ilock()
call we do before freeing the inode because we have other inodes
locked. I don't think this can deadlock because the inodes under
reclaim should not be usable by anyone else at this point because
they have the I_RECLAIM flag set.

I did notice this - I added a XXX comment I added to the case being
complained about to note I needed to resolve this locking issue.

+        * Here we do an (almost) spurious inode lock in order to coordinate
+        * with inode cache radix tree lookups.  This is because the lookup
+        * can reference the inodes in the cache without taking references.
+        *
+        * We make that OK here by ensuring that we wait until the inode is
+        * unlocked after the lookup before we go ahead and free it. 
+        * unlocked after the lookup before we go ahead and free it. 
+        *
+        * XXX: need to check this is still true. Not sure it is.
         */

I added that last line in this patch. In more detail....

The comment is suggesting that we need to take the ILOCK to
co-ordinate with RCU protected lookups in progress before we RCU
free the inode. That's waht RCU is supposed to do, so I'm not at all
sure what this is actually serialising against any more.

i.e. any racing radix tree lookup from this point in time is going
to see the XFS_IRECLAIM flag and ip->i_ino == 0 while under the
rcu_read_lock, and they will go try again after dropping all lock
context and waiting for a bit. The inode may remain visibile until
the next rcu grace period expires, but all lookups will abort long
before the get anywhere near the ILOCK. And once the RCU grace
period expires, lookups will be locked out by the rcu_read_lock(),
the raidx tree moves to a state where the removal of the inode is
guaranteed visibile to all CPUs, and then the object is freed.

So the ILOCK should have no part in lookup serialisation, and I need
to go look at the history of the code to determine where and why
this was added, and whether the condition it protects against is
still a valid concern or not....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2019-08-09  1:21 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
2019-08-01  2:17 ` [PATCH 01/24] mm: directed shrinker work deferral Dave Chinner
2019-08-02 15:27   ` Brian Foster
2019-08-04  1:49     ` Dave Chinner
2019-08-05 17:42       ` Brian Foster
2019-08-05 23:43         ` Dave Chinner
2019-08-06 12:27           ` Brian Foster
2019-08-06 22:22             ` Dave Chinner
2019-08-07 11:13               ` Brian Foster
2019-08-01  2:17 ` [PATCH 02/24] shrinkers: use will_defer for GFP_NOFS sensitive shrinkers Dave Chinner
2019-08-02 15:27   ` Brian Foster
2019-08-04  1:50     ` Dave Chinner
2019-08-01  2:17 ` [PATCH 03/24] mm: factor shrinker work calculations Dave Chinner
2019-08-02 15:08   ` Nikolay Borisov
2019-08-04  2:05     ` Dave Chinner
2019-08-02 15:31   ` Brian Foster
2019-08-01  2:17 ` [PATCH 04/24] shrinker: defer work only to kswapd Dave Chinner
2019-08-02 15:34   ` Brian Foster
2019-08-04 16:48   ` Nikolay Borisov
2019-08-04 21:37     ` Dave Chinner
2019-08-07 16:12   ` kbuild test robot
2019-08-07 18:00   ` kbuild test robot
2019-08-01  2:17 ` [PATCH 05/24] shrinker: clean up variable types and tracepoints Dave Chinner
2019-08-01  2:17 ` [PATCH 06/24] mm: reclaim_state records pages reclaimed, not slabs Dave Chinner
2019-08-01  2:17 ` [PATCH 07/24] mm: back off direct reclaim on excessive shrinker deferral Dave Chinner
2019-08-01  2:17 ` [PATCH 08/24] mm: kswapd backoff for shrinkers Dave Chinner
2019-08-01  2:17 ` [PATCH 09/24] xfs: don't allow log IO to be throttled Dave Chinner
2019-08-01 13:39   ` Chris Mason
2019-08-01 23:58     ` Dave Chinner
2019-08-02  8:12       ` Christoph Hellwig
2019-08-02 14:11       ` Chris Mason
2019-08-02 18:34         ` Matthew Wilcox
2019-08-02 23:28         ` Dave Chinner
2019-08-05 18:32           ` Chris Mason
2019-08-05 23:09             ` Dave Chinner
2019-08-01  2:17 ` [PATCH 10/24] xfs: fix missed wakeup on l_flush_wait Dave Chinner
2019-08-01  2:17 ` [PATCH 11/24] xfs:: account for memory freed from metadata buffers Dave Chinner
2019-08-01  8:16   ` Christoph Hellwig
2019-08-01  9:21     ` Dave Chinner
2019-08-06  5:51       ` Christoph Hellwig
2019-08-01  2:17 ` [PATCH 12/24] xfs: correctly acount for reclaimable slabs Dave Chinner
2019-08-06  5:52   ` Christoph Hellwig
2019-08-06 21:05     ` Dave Chinner
2019-08-01  2:17 ` [PATCH 13/24] xfs: synchronous AIL pushing Dave Chinner
2019-08-05 17:51   ` Brian Foster
2019-08-05 23:21     ` Dave Chinner
2019-08-06 12:29       ` Brian Foster
2019-08-01  2:17 ` [PATCH 14/24] xfs: tail updates only need to occur when LSN changes Dave Chinner
2019-08-05 17:53   ` Brian Foster
2019-08-05 23:28     ` Dave Chinner
2019-08-06  5:33       ` Dave Chinner
2019-08-06 12:53         ` Brian Foster
2019-08-06 21:11           ` Dave Chinner
2019-08-01  2:17 ` [PATCH 15/24] xfs: eagerly free shadow buffers to reduce CIL footprint Dave Chinner
2019-08-05 18:03   ` Brian Foster
2019-08-05 23:33     ` Dave Chinner
2019-08-06 12:57       ` Brian Foster
2019-08-06 21:21         ` Dave Chinner
2019-08-01  2:17 ` [PATCH 16/24] xfs: Lower CIL flush limit for large logs Dave Chinner
2019-08-04 17:12   ` Nikolay Borisov
2019-08-01  2:17 ` [PATCH 17/24] xfs: don't block kswapd in inode reclaim Dave Chinner
2019-08-06 18:21   ` Brian Foster
2019-08-06 21:27     ` Dave Chinner
2019-08-07 11:14       ` Brian Foster
2019-08-01  2:17 ` [PATCH 18/24] xfs: reduce kswapd blocking on inode locking Dave Chinner
2019-08-06 18:22   ` Brian Foster
2019-08-06 21:33     ` Dave Chinner
2019-08-07 11:30       ` Brian Foster
2019-08-07 23:16         ` Dave Chinner
2019-08-01  2:17 ` [PATCH 19/24] xfs: kill background reclaim work Dave Chinner
2019-08-01  2:17 ` [PATCH 20/24] xfs: use AIL pushing for inode reclaim IO Dave Chinner
2019-08-07 18:09   ` Brian Foster
2019-08-07 23:10     ` Dave Chinner
2019-08-08 16:20       ` Brian Foster
2019-08-01  2:17 ` [PATCH 21/24] xfs: remove mode from xfs_reclaim_inodes() Dave Chinner
2019-08-01  2:17 ` [PATCH 22/24] xfs: track reclaimable inodes using a LRU list Dave Chinner
2019-08-08 16:36   ` Brian Foster
2019-08-09  0:10     ` Dave Chinner
2019-08-01  2:17 ` [PATCH 23/24] xfs: reclaim inodes from the LRU Dave Chinner
2019-08-08 16:39   ` Brian Foster
2019-08-09  1:20     ` Dave Chinner [this message]
2019-08-09 12:36       ` Brian Foster
2019-08-11  2:17         ` Dave Chinner
2019-08-11 12:46           ` Brian Foster
2019-08-01  2:17 ` [PATCH 24/24] xfs: remove unusued old inode reclaim code Dave Chinner
2019-08-06  5:57 ` [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Christoph Hellwig
2019-08-06 21:37   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190809012022.GX7777@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=bfoster@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).