Re: dyntick-idle CPU and node's qsmask

From: "Paul E. McKenney" <paulmck@linux.ibm.com>
To: Joel Fernandes <joel@joelfernandes.org>
Cc: linux-kernel@vger.kernel.org, josh@joshtriplett.org,
	rostedt@goodmis.org, mathieu.desnoyers@efficios.com,
	jiangshanlai@gmail.com
Subject: Re: dyntick-idle CPU and node's qsmask
Date: Sat, 10 Nov 2018 20:22:10 -0800	[thread overview]
Message-ID: <20181111042210.GN4170@linux.ibm.com> (raw)
In-Reply-To: <20181111030925.GA182908@google.com>

On Sat, Nov 10, 2018 at 07:09:25PM -0800, Joel Fernandes wrote:
> On Sat, Nov 10, 2018 at 03:04:36PM -0800, Paul E. McKenney wrote:
> > On Sat, Nov 10, 2018 at 01:46:59PM -0800, Joel Fernandes wrote:
> > > Hi Paul and everyone,
> > > 
> > > I was tracing/studying the RCU code today in paul/dev branch and noticed that
> > > for dyntick-idle CPUs, the RCU GP thread is clearing the rnp->qsmask
> > > corresponding to the leaf node for the idle CPU, and reporting a QS on their
> > > behalf.
> > > 
> > > rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 792 0 dti
> > > rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 801 2 dti
> > > rcu_sched-10    [003]    40.008041: rcu_quiescent_state_report: rcu_sched 805 5>0 0 0 3 0
> > > 
> > > That's all good but I was wondering if we can do better for the idle CPUs if
> > > we can some how not set the qsmask of the node in the first place. Then no
> > > reporting would be needed of quiescent state is needed for idle CPUs right?
> > > And we would also not need to acquire the rnp lock I think.
> > > 
> > > At least for a single node tree RCU system, it seems that would avoid needing
> > > to acquire the lock without complications. Anyway let me know your thoughts
> > > and happy to discuss this at the hallways of the LPC as well for folks
> > > attending :)
> > 
> > We could, but that would require consulting the rcu_data structure for
> > each CPU while initializing the grace period, thus increasing the number
> > of cache misses during grace-period initialization and also shortly after
> > for any non-idle CPUs.  This seems backwards on busy systems where each
> 
> When I traced, it appears to me that rcu_data structure of a remote CPU was
> being consulted anyway by the rcu_sched thread. So it seems like such cache
> miss would happen anyway whether it is during grace-period initialization or
> during the fqs stage? I guess I'm trying to say, the consultation of remote
> CPU's rcu_data happens anyway.

Hmmm...

The rcu_gp_init() function does access an rcu_data structure, but it is
that of the current CPU, so shouldn't involve a communications cache miss,
at least not in the common case.

Or are you seeing these cross-CPU rcu_data accesses in rcu_gp_fqs() or
functions that it calls?  In that case, please see below.

> > CPU will with high probability report its own quiescent state before three
> > jiffies pass, in which case the cache misses on the rcu_data structures
> > would be wasted motion.
> 
> If all the CPUs are busy and reporting their QS themselves, then I think the
> qsmask is likely 0 so then rcu_implicit_dynticks_qs (called from
> force_qs_rnp) wouldn't be called and so there would no cache misses on
> rcu_data right?

Yes, but assuming that all CPUs report their quiescent states before
the first call to rcu_gp_fqs().  One exception is when some CPU is
looping in the kernel for many milliseconds without passing through a
quiescent state.  This is because for recent kernels, cond_resched()
is not a quiescent state until the grace period is something like 100
milliseconds old.  (For older kernels, cond_resched() was never an RCU
quiescent state unless it actually scheduled.)

Why wait 100 milliseconds?  Because otherwise the increase in
cond_resched() overhead shows up all too well, causing 0day test robot
to complain bitterly.  Besides, I would expect that in the common case,
CPUs would be executing usermode code.

Ah, did you build with NO_HZ_FULL, boot with nohz_full CPUs, and then run
CPU-bound usermode workloads on those CPUs?  Such CPUs would appear to
be idle from an RCU perspective.  But these CPUs would never touch their
rcu_data structures, so they would likely remain in the RCU grace-period
kthread's cache.  So this should work well also.  Give or take that other
work would likely eject them from the cache, but in that case they would
be capacity cache misses rather than the aforementioned communications
cache misses.  Not that this distinction matters to whoever is measuring
performance.  ;-)

> > Now, this does increase overhead on mostly idle systems, but the theory
> > is that mostly idle systems are most able to absorb this extra overhead.
> 
> Yes. Could we use rcuperf to check the impact of such change?

I would be very surprised if the overhead was large enough for rcuperf
to be able to see it.

> Anyway it was just an idea that popped up when I was going through traces :)
> Thanks for the discussion and happy to discuss further or try out anything.

Either way, I do appreciate your going through this.  People have found
RCU bugs this way, one of which involved RCU uselessly calling a particular
function twice in quick succession.  ;-)

							Thanx, Paul