Re: [PATCH] xfs: require an rcu grace period before inode recycle

From: Brian Foster <bfoster@redhat.com>
To: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>,
	linux-xfs@vger.kernel.org, Al Viro <viro@zeniv.linux.org.uk>,
	Ian Kent <raven@themaw.net>,
	rcu@vger.kernel.org
Subject: Re: [PATCH] xfs: require an rcu grace period before inode recycle
Date: Wed, 26 Jan 2022 08:21:49 -0500	[thread overview]
Message-ID: <YfFK7dYHKCEPCRoB@bfoster> (raw)
In-Reply-To: <20220126052910.GX4285@paulmck-ThinkPad-P17-Gen-1>

On Tue, Jan 25, 2022 at 09:29:10PM -0800, Paul E. McKenney wrote:
> On Wed, Jan 26, 2022 at 09:36:07AM +1100, Dave Chinner wrote:
> > On Tue, Jan 25, 2022 at 06:40:44AM -0800, Paul E. McKenney wrote:
> > > On Tue, Jan 25, 2022 at 11:31:20AM +1100, Dave Chinner wrote:
> > > > > Ok, I think we're talking about slightly different things. What I mean
> > > > > above is that if a task removes a file and goes off doing unrelated
> > > > > $work, that inode will just sit on the percpu queue indefinitely. That's
> > > > > fine, as there's no functional need for us to process it immediately
> > > > > unless we're around -ENOSPC thresholds or some such that demand reclaim
> > > > > of the inode.
> > > > 
> > > > Yup, an occasional unlink sitting around for a while on an unlinked
> > > > list isn't going to cause a performance problem.  Indeed, such
> > > > workloads are more likely to benefit from the reduced unlink()
> > > > syscall overhead and won't even notice the increase in background
> > > > CPU overhead for inactivation of those occasional inodes.
> > > > 
> > > > > It sounds like what you're talking about is specifically
> > > > > the behavior/performance of sustained file removal (which is important
> > > > > obviously), where apparently there is a notable degradation if the
> > > > > queues become deep enough to push the inode batches out of CPU cache. So
> > > > > that makes sense...
> > > > 
> > > > Yup, sustained bulk throughput is where cache residency really
> > > > matters. And for unlink, sustained unlink workloads are quite
> > > > common; they often are something people wait for on the command line
> > > > or make up a performance critical component of a highly concurrent
> > > > workload so it's pretty important to get this part right.
> > > > 
> > > > > > Darrick made the original assumption that we could delay
> > > > > > inactivation indefinitely and so he allowed really deep queues of up
> > > > > > to 64k deferred inactivations. But with queues this deep, we could
> > > > > > never get that background inactivation code to perform anywhere near
> > > > > > the original synchronous background inactivation code. e.g. I
> > > > > > measured 60-70% performance degradataions on my scalability tests,
> > > > > > and nothing stood out in the profiles until I started looking at
> > > > > > CPU data cache misses.
> > > > > > 
> > > > > 
> > > > > ... but could you elaborate on the scalability tests involved here so I
> > > > > can get a better sense of it in practice and perhaps observe the impact
> > > > > of changes in this path?
> > > > 
> > > > The same conconrrent fsmark create/traverse/unlink workloads I've
> > > > been running for the past decade+ demonstrates it pretty simply. I
> > > > also saw regressions with dbench (both op latency and throughput) as
> > > > the clinet count (concurrency) increased, and with compilebench.  I
> > > > didn't look much further because all the common benchmarks I ran
> > > > showed perf degradations with arbitrary delays that went away with
> > > > the current code we have.  ISTR that parts of aim7/reaim scalability
> > > > workloads that the intel zero-day infrastructure runs are quite
> > > > sensitive to background inactivation delays as well because that's a
> > > > CPU bound workload and hence any reduction in cache residency
> > > > results in a reduction of the number of concurrent jobs that can be
> > > > run.
> > > 
> > > Curiosity and all that, but has this work produced any intuition on
> > > the sensitivity of the performance/scalability to the delays?  As in
> > > the effect of microseconds vs. tens of microsecond vs. hundreds of
> > > microseconds?
> > 
> > Some, yes.
> > 
> > The upper delay threshold where performance is measurably
> > impacted is in the order of single digit milliseconds, not
> > microseconds.
> > 
> > What I saw was that as the batch processing delay goes beyond ~5ms,
> > IPC starts to fall. The CPU usage profile does not change shape, nor
> > does the proportions of where CPU time is spent change. All I see if
> > data cache misses go up substantially and IPC drop substantially. If
> > I read my notes correctly, typical change from "fast" to "slow" in
> > IPC was 0.82 to 0.39 and LLC-load-misses from 3% to 12%. The IPC
> > degradation was all done by the time the background batch processing
> > times were longer than a typical scheduler tick (10ms).
> > 
> > Now, I've been testing on Xeon CPUs with 36-76MB of l2-l3 caches, so
> > there's a fair amount of data that these can hold. I expect that
> > with smaller caches, the inflection point will be at smaller batch
> > sizes rather than more. Hence while I could have used larger batches
> > for background processing (e.g. 64-128 inodes rather than 32), I
> > chose smaller batch sizes by default so that CPUs with smaller
> > caches are less likely to be adversely affected by the batch size
> > being too large. OTOH, I started to measure noticable degradation by
> > batch sizes of 256 inodes on my machines, which is why the hard
> > queue limit got set to 256 inodes.
> > 
> > Scaling the delay/batch size down towards single inode queuing also
> > resulted in perf degradation. This was largely because of all the
> > extra scheduling overhead that trying to switching between user task
> > and kernel worker task for every inode entailed. Context switch rate
> > went from a couple of thousand/sec to over 100,000/s for single
> > inode batches, and performance went backwards in proportion with the
> > amount of CPU then spent on context switches. It also lead to
> > increases in buffer lock contention (hence context switches) as both
> > user task and kworker try to access the same buffers...
> 
> Makes sense.  Never a guarantee of easy answers.  ;-)
> 
> If it would help, I could create expedited-grace-period counterparts
> of get_state_synchronize_rcu(), start_poll_synchronize_rcu(),
> poll_state_synchronize_rcu(), and cond_synchronize_rcu().  These would
> provide sub-millisecond grace periods, in fact, sub-100-microsecond
> grace periods on smaller systems.
> 

If you have something with enough basic functionality, I'd be interested
in converting this patch over to an expedited variant to run some
tests/experiments. As it is, it seems the current approach is kind of
playing wack-a-mole between disrupting allocation performance by
populating the free inode pool with too many free but "pending rcu grace
period" inodes and sustained remove performance by pushing the internal
inactivation queues too deep and thus losing CPU cache, as Dave
describes above. So if an expedited grace period is possible that fits
within the time window on paper, it certainly seems worthwhile to test.

Otherwise the only thing that comes to mind right now is to start
playing around with the physical inode allocation algorithm to avoid
such pending inodes. I think a scanning approach may ultimately run into
the same problems with the right workload (i.e. such that all free
inodes are pending), so I suspect what this really means is either
figuring a nice enough way to efficiently locate expired inodes (maybe
via our own internal rcu callback to explicitly tag now expired inodes
as good allocation candidates), or to determine when to proceed with
inode chunk allocations when scanning is unlikely to succeed, or
something similar along those general lines..

> Of course, nothing comes for free.  Although expedited grace periods
> are way way cheaper than they used to be, they still IPI non-idle
> non-nohz_full-userspace CPUs, which translates to roughly the CPU overhead
> of a wakeup on each IPIed CPU.  And of course disruption to aggressive
> non-nohz_full real-time applications.  Shorter latencies also translates
> to fewer updates over which to amortize grace-period overhead.
> 
> But it should get well under your single-digit milliseconds of delay.
> 

If the expedited variant were sufficient for the fast path case, I
suppose it might be interesting to see if we could throttle down to
non-expedited variants either based on heuristic or feedback from
allocation side stalls.

Brian

> 							Thanx, Paul
>