linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 23/24] xfs: reclaim inodes from the LRU
Date: Sun, 11 Aug 2019 08:46:04 -0400	[thread overview]
Message-ID: <20190811124604.GA27188@bfoster> (raw)
In-Reply-To: <20190811021747.GE7777@dread.disaster.area>

On Sun, Aug 11, 2019 at 12:17:47PM +1000, Dave Chinner wrote:
> On Fri, Aug 09, 2019 at 08:36:32AM -0400, Brian Foster wrote:
> > On Fri, Aug 09, 2019 at 11:20:22AM +1000, Dave Chinner wrote:
> > > On Thu, Aug 08, 2019 at 12:39:05PM -0400, Brian Foster wrote:
> > > > On Thu, Aug 01, 2019 at 12:17:51PM +1000, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
...
> > > > > +	long freed;
> > > > >  
> > > > > -	return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
> > > > > +	INIT_LIST_HEAD(&ra.freeable);
> > > > > +	ra.lowest_lsn = NULLCOMMITLSN;
> > > > > +	ra.dirty_skipped = 0;
> > > > > +
> > > > > +	freed = list_lru_shrink_walk(&mp->m_inode_lru, sc,
> > > > > +					xfs_inode_reclaim_isolate, &ra);
> > > > 
> > > > This is more related to the locking discussion on the earlier patch, but
> > > > this looks like it has more similar serialization to the example patch I
> > > > posted than the one without locking at all. IIUC, this walk has an
> > > > internal lock per node lru that is held across the walk and passed into
> > > > the callback. We never cycle it, so for any given node we only allow one
> > > > reclaimer through here at a time.
> > > 
> > > That's not a guarantee that list_lru gives us. It could drop it's
> > > internal lock at any time during that walk and we would be
> > > blissfully unaware that it has done this. And at that point, the
> > > reclaim context is completely unaware that other reclaim contexts
> > > may be scanning the same LRU at the same time and are interleaving
> > > with it.
> > > 
> > 
> > What is not a guarantee? I'm not following your point here. I suppose it
> > technically could drop the lock, but then it would have to restart the
> > iteration and wouldn't exactly provide predictable batching capability
> > to users.
> 
> There is no guarantee that the list_lru_shrink_walk() provides a
> single list walker at a time or that it provides predictable
> batching capability to users.
> 

Hm, Ok. I suppose the code could change if that's your point. But how is
that relevant to how XFS reclaim behavior as of this patch compares to
the intermediate patch 20? Note again that this the only reason I bring
it up in this patch (which seems mostly sane to me). Perhaps I should
have just replied again to patch 20 to avoid confusion. Apologies for
that, but that ship has sailed...

To reiterate, I suggested not dropping the lock in patch 20 for
$reasons. You replied generally that doing so might cause more problems
than it solves. I replied with a compromise patch that leaves the lock,
but only acquires it across inode grabbing instead of the entire reclaim
operation. This essentially implements "serialized isolation batching"
as we've termed it here (assuming the patch is correct).

I eventually made it to this patch and observe that you end up with an
implementation that does serialized isolation batching. Of course the
locking and implementation is quite different with this being a whole
new mechanism, so performance and things could be quite different too.
But my point is that if serialized isolation batching is acceptable
behavior for XFS inode reclaim right now, then it seems reasonable to me
that it be acceptable in patch 20 for what is ultimately just an
intermediate state.

I appreciate all of the discussion and background information that
follows, but it gets way far off from the original feedback. I've still
seen no direct response to the thoughts above either here or in patch
20. Are you planning to incorporate that example patch or something
similar? If so, then I think we're on the same page.

> > This internal lock protects the integrity of the list from external
> > adds/removes, etc., but it's also passed into the callback so of course
> > it can be cycled at any point. The callback just has to notify the
> > caller to restart the walk. E.g., from __list_lru_walk_one():
> > 
> >         /*
> >          * The lru lock has been dropped, our list traversal is
> >          * now invalid and so we have to restart from scratch.
> >          */
> 
> As the designer and author of the list_lru code, I do know how it
> works. I also know exactly what this problem this behaviour was
> intended to solve, because I had to solve it to meet the
> requirements I had for the infrastructure.
> 
> The isolation walk lock batching currently done is an optimisation
> to minimise lru lock contention - it amortise the cost of getting
> the lock over a substantial batch of work. If we drop the lock on
> every item we try to isolate - my initial implementations did this -
> then the lru lock thrashes badly against concurrent inserts and
> deletes and scalability is not much better than the global lock it
> was replacing.
> 

Ok.

> IOWs, the behaviour we have now is a result of lock contention
> optimisation to meet scalability requirements, not because of some
> "predictable batching" requirement. If we were to rework the
> traversal mechanism such that the lru lock was not necessary to
> protect the state of the LRU list across the batch of isolate
> callbacks, then we'd get the scalability we need but we'd completely
> change the concurrency behaviour. The list would still do LRU
> reclaim, and the isolate functions still work exactly as tehy
> currently do (i.e. they work on just the item passed to them) but
> we'd have concurrent reclaim contexts isolating items on the same
> LRU concurrently rather than being serialised. And that's perfectly
> fine, because the isolate/dispose architecture just doesn't care
> how the items on the LRU are isolated for disposal.....
> 

Sure, that mostly makes sense. We're free to similarly rework the old
mechanism to not require locking just as well. My point is that the
patch to move I/O submission to the AIL doesn't do that sufficiently,
despite that being the initial purpose of the lock.

BTW, the commit[1] that introduces the pag reclaim lock back in 2010
introduces the cursor at the same time and actually does mention some of
the things I'm pointing out as potential problems with patch 20. It
refers to potential for "massive contention" and prevention of
reclaimers "continually scanning the same inodes in each AG." Again, I'm
not worried about that for this series because we ultimately rework the
shrinker past those issues, but it appears that it wasn't purely a
matter of I/O submission ordering that motivated the lock (though I'm
sure that was part of it).

Given that, I think another reasonable option for patch 20 is to remove
the cursor and locking together. That at least matches some form of
historical reclaim behavior in XFS. If we took that approach, it might
make sense to split patch 20 into one patch that shifts I/O to xfsaild
(new behavior) and a subsequent that undoes [1].

[1] 69b491c214d7 ("xfs: serialise inode reclaim within an AG")

> What I'm trying to say is that the "isolation batching" we have is
> not desirable but it is necessary, and we because that's internal to
> the list_lru implementation, we can change that behaviour however
> we want and it won't affect the subsystems that own the objects
> being reclaimed. They still just get handed a list of items to
> dispose, and they all come from the reclaim end of the LRU list...
> 
> Indeed, the new XFS inode shrinker is not dependent on any specific
> batching order, it's not dependent on isolation being serialised,
> and it's not dependent on the lru_lock being held across the
> isolation function. IOWs, it's set up just right to take advantage
> of any increases in isolation concurrency that the list_lru
> infrastructure could provide...
>

Yep, it's a nice abstraction. This has no bearing on the old
implementation which does have XFS specific locking, however.
 
> > > > That seems to be Ok given we don't do much in the isolation handler, the
> > > > lock isn't held across the dispose sequence and we're still batching in
> > > > the shrinker core on top of that. We're still serialized over the lru
> > > > fixups such that concurrent reclaimers aren't processing the same
> > > > inodes, however.
> > > 
> > > The only thing that we may need here is need_resched() checks if it
> > > turns out that holding a lock for 1024 items to be scanned proved to
> > > be too long to hold on to a single CPU. If we do that we'd cycle the
> > > LRU lock and return RETRY or RETRY_REMOVE, hence enabling reclaimers
> > > more finer-grained interleaving....
> > > 
> > 
> > Sure, with the caveat that we restart the traversal..
> 
> Which only re-traverses the inodes we skipped because they were
> locked at the time. IOWs, Skipping inodes is rare because if it is
> in reclaim then the only things that can be contending is a radix
> tree lookup in progress or an inode clustering operation
> (write/free) in progress. Either way, they will be relatively rare
> and very short term lock holds, so if we have to restart the scan
> after dropping the lru lock then it's likely we'll restart at next
> inode in line for reclaim, anyway....
> 
> Hence I don't think having to restart a traversal would really
> matter all that much....
> 

Perhaps, that seems fairly reasonable given that we rotate dirty inodes.
I'm not totally convinced we might not thrash on an LRU with a high
population of dirty inodes or that a restart is even the simplest
approach to dealing with large batch sizes, but I'd reserve judgement on
that until there's code to review.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

  reply	other threads:[~2019-08-11 12:46 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
2019-08-01  2:17 ` [PATCH 01/24] mm: directed shrinker work deferral Dave Chinner
2019-08-02 15:27   ` Brian Foster
2019-08-04  1:49     ` Dave Chinner
2019-08-05 17:42       ` Brian Foster
2019-08-05 23:43         ` Dave Chinner
2019-08-06 12:27           ` Brian Foster
2019-08-06 22:22             ` Dave Chinner
2019-08-07 11:13               ` Brian Foster
2019-08-01  2:17 ` [PATCH 02/24] shrinkers: use will_defer for GFP_NOFS sensitive shrinkers Dave Chinner
2019-08-02 15:27   ` Brian Foster
2019-08-04  1:50     ` Dave Chinner
2019-08-01  2:17 ` [PATCH 03/24] mm: factor shrinker work calculations Dave Chinner
2019-08-02 15:08   ` Nikolay Borisov
2019-08-04  2:05     ` Dave Chinner
2019-08-02 15:31   ` Brian Foster
2019-08-01  2:17 ` [PATCH 04/24] shrinker: defer work only to kswapd Dave Chinner
2019-08-02 15:34   ` Brian Foster
2019-08-04 16:48   ` Nikolay Borisov
2019-08-04 21:37     ` Dave Chinner
2019-08-07 16:12   ` kbuild test robot
2019-08-07 18:00   ` kbuild test robot
2019-08-01  2:17 ` [PATCH 05/24] shrinker: clean up variable types and tracepoints Dave Chinner
2019-08-01  2:17 ` [PATCH 06/24] mm: reclaim_state records pages reclaimed, not slabs Dave Chinner
2019-08-01  2:17 ` [PATCH 07/24] mm: back off direct reclaim on excessive shrinker deferral Dave Chinner
2019-08-01  2:17 ` [PATCH 08/24] mm: kswapd backoff for shrinkers Dave Chinner
2019-08-01  2:17 ` [PATCH 09/24] xfs: don't allow log IO to be throttled Dave Chinner
2019-08-01 13:39   ` Chris Mason
2019-08-01 23:58     ` Dave Chinner
2019-08-02  8:12       ` Christoph Hellwig
2019-08-02 14:11       ` Chris Mason
2019-08-02 18:34         ` Matthew Wilcox
2019-08-02 23:28         ` Dave Chinner
2019-08-05 18:32           ` Chris Mason
2019-08-05 23:09             ` Dave Chinner
2019-08-01  2:17 ` [PATCH 10/24] xfs: fix missed wakeup on l_flush_wait Dave Chinner
2019-08-01  2:17 ` [PATCH 11/24] xfs:: account for memory freed from metadata buffers Dave Chinner
2019-08-01  8:16   ` Christoph Hellwig
2019-08-01  9:21     ` Dave Chinner
2019-08-06  5:51       ` Christoph Hellwig
2019-08-01  2:17 ` [PATCH 12/24] xfs: correctly acount for reclaimable slabs Dave Chinner
2019-08-06  5:52   ` Christoph Hellwig
2019-08-06 21:05     ` Dave Chinner
2019-08-01  2:17 ` [PATCH 13/24] xfs: synchronous AIL pushing Dave Chinner
2019-08-05 17:51   ` Brian Foster
2019-08-05 23:21     ` Dave Chinner
2019-08-06 12:29       ` Brian Foster
2019-08-01  2:17 ` [PATCH 14/24] xfs: tail updates only need to occur when LSN changes Dave Chinner
2019-08-05 17:53   ` Brian Foster
2019-08-05 23:28     ` Dave Chinner
2019-08-06  5:33       ` Dave Chinner
2019-08-06 12:53         ` Brian Foster
2019-08-06 21:11           ` Dave Chinner
2019-08-01  2:17 ` [PATCH 15/24] xfs: eagerly free shadow buffers to reduce CIL footprint Dave Chinner
2019-08-05 18:03   ` Brian Foster
2019-08-05 23:33     ` Dave Chinner
2019-08-06 12:57       ` Brian Foster
2019-08-06 21:21         ` Dave Chinner
2019-08-01  2:17 ` [PATCH 16/24] xfs: Lower CIL flush limit for large logs Dave Chinner
2019-08-04 17:12   ` Nikolay Borisov
2019-08-01  2:17 ` [PATCH 17/24] xfs: don't block kswapd in inode reclaim Dave Chinner
2019-08-06 18:21   ` Brian Foster
2019-08-06 21:27     ` Dave Chinner
2019-08-07 11:14       ` Brian Foster
2019-08-01  2:17 ` [PATCH 18/24] xfs: reduce kswapd blocking on inode locking Dave Chinner
2019-08-06 18:22   ` Brian Foster
2019-08-06 21:33     ` Dave Chinner
2019-08-07 11:30       ` Brian Foster
2019-08-07 23:16         ` Dave Chinner
2019-08-01  2:17 ` [PATCH 19/24] xfs: kill background reclaim work Dave Chinner
2019-08-01  2:17 ` [PATCH 20/24] xfs: use AIL pushing for inode reclaim IO Dave Chinner
2019-08-07 18:09   ` Brian Foster
2019-08-07 23:10     ` Dave Chinner
2019-08-08 16:20       ` Brian Foster
2019-08-01  2:17 ` [PATCH 21/24] xfs: remove mode from xfs_reclaim_inodes() Dave Chinner
2019-08-01  2:17 ` [PATCH 22/24] xfs: track reclaimable inodes using a LRU list Dave Chinner
2019-08-08 16:36   ` Brian Foster
2019-08-09  0:10     ` Dave Chinner
2019-08-01  2:17 ` [PATCH 23/24] xfs: reclaim inodes from the LRU Dave Chinner
2019-08-08 16:39   ` Brian Foster
2019-08-09  1:20     ` Dave Chinner
2019-08-09 12:36       ` Brian Foster
2019-08-11  2:17         ` Dave Chinner
2019-08-11 12:46           ` Brian Foster [this message]
2019-08-01  2:17 ` [PATCH 24/24] xfs: remove unusued old inode reclaim code Dave Chinner
2019-08-06  5:57 ` [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Christoph Hellwig
2019-08-06 21:37   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190811124604.GA27188@bfoster \
    --to=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).