Re: [RFC][PATCH 12/13] stop_machine: Remove lglock

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>,
	tj@kernel.org, mingo@redhat.com, linux-kernel@vger.kernel.org,
	der.herr@hofr.at, dave@stgolabs.net, riel@redhat.com,
	viro@ZenIV.linux.org.uk, torvalds@linux-foundation.org
Subject: Re: [RFC][PATCH 12/13] stop_machine: Remove lglock
Date: Wed, 1 Jul 2015 08:56:55 -0700	[thread overview]
Message-ID: <20150701155655.GG3717@linux.vnet.ibm.com> (raw)
In-Reply-To: <20150701115642.GU19282@twins.programming.kicks-ass.net>

On Wed, Jul 01, 2015 at 01:56:42PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 30, 2015 at 02:32:58PM -0700, Paul E. McKenney wrote:
> 
> > > I had indeed forgotten that got farmed out to the kthread; on which, my
> > > poor desktop seems to have spend ~140 minutes of its (most recent)
> > > existence poking RCU things.
> > > 
> > >     7 root      20   0       0      0      0 S   0.0  0.0  56:34.66 rcu_sched
> > >     8 root      20   0       0      0      0 S   0.0  0.0  20:58.19 rcuos/0
> > >     9 root      20   0       0      0      0 S   0.0  0.0  18:50.75 rcuos/1
> > >    10 root      20   0       0      0      0 S   0.0  0.0  18:30.62 rcuos/2
> > >    11 root      20   0       0      0      0 S   0.0  0.0  17:33.24 rcuos/3
> > >    12 root      20   0       0      0      0 S   0.0  0.0   2:43.54 rcuos/4
> > >    13 root      20   0       0      0      0 S   0.0  0.0   3:00.31 rcuos/5
> > >    14 root      20   0       0      0      0 S   0.0  0.0   3:09.27 rcuos/6
> > >    15 root      20   0       0      0      0 S   0.0  0.0   2:52.98 rcuos/7
> > > 
> > > Which is almost as much time as my konsole:
> > > 
> > >  2853 peterz    20   0  586240 103664  41848 S   1.0  0.3 147:39.50 konsole
> > > 
> > > Which seems somewhat excessive. But who knows.
> > 
> > No idea.  How long has that system been up?  What has it been doing?
> 
> Some 40 odd days it seems. Its my desktop, I read email (in mutt in
> Konsole), I type patches (in vim in Konsole), I compile kernels (in
> Konsole) etc..
> 
> Now konsole is threaded and each new window/tab is just another thread
> in the same process so runtime should accumulate. However I just found
> that for some obscure reason there's two konsole processes around, and
> the other is the one that I'm using most, it also has significantly more
> runtime.
> 
>  3264 ?        Sl   452:43          \_ /usr/bin/konsole
> 
> Must be some of that brain damaged desktop shite that confused things --
> I see the one is stared with some -session argument. Some day I'll
> discover how to destroy all that nonsense and make things behave as they
> should.

Well, you appear to be using about 6% of a CPU, or 0.7% of the entire
8-CPU system for the RCU GP kthread.  That is more than I would like to
see consumed.

Odd that you have four of eight of the rcuos CPUs with higher consumption
than the others.  I would expect three of eight.  Are you by chance running
an eight-core system with hyperthreading disabled in hardware, via boot
parameter, or via explicit offline?  The real question I have is "is
nr_cpu_ids equal to 16 rather than to 8?"

A significant fraction of rcu_sched's CPU overhead is likely due to that
extra wakeup for the fourth leader rcuos kthread.

Also, do you have nohz_full set?  Just wondering why callback offloading
is enabled.  (If you want it enabled, fine, but from what I can see your
workload isn't being helped by it and it does have higher overhead.)

Even if you don't want offloading and do disable it, it would be good to
reduce the penalty.  Is there something I can do to reduce the overhead
of waking several kthreads?  Right now, I just do a series of wake_up()
calls, one for each leader rcuos kthread.

Oh, are you running v3.10 or some such?  If so, there are some more
recent RCU changes that can help with this.  They are called out here:

http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf

> > The rcu_sched overhead is expected behavior if the system has run between
> > ten and one hundred million grace periods, give or take an order of
> > magnitude depending on the number of idle CPUs and so on.
> > 
> > The overhead for the RCU offload kthreads is what it is.  A kfree() takes
> > as much time as a kfree does, and they are all nicely counted up for you.
> 
> Yah, if only we could account it back to whomever caused it :/

It could be done, but would require increasing the size of rcu_head.
And would require costly fine-grained timing of callback execution.
Not something for production systems, I would guess.

> > > Although here I'll once again go ahead and say something ignorant; how
> > > come that's a problem? Surely if we know the kthread thing has finished
> > > starting a GP, any one CPU issuing a full memory barrier (as would be
> > > implied by switching to the stop worker) must then indeed observe that
> > > global state? due to that transitivity thing.
> > > 
> > > That is, I'm having a wee bit of bother for seeing how you'd need
> > > manipulation of global variables as you elude to below.
> > 
> > Well, I thought that you wanted to leverage the combining tree to
> > determine when the grace period had completed.  If a given CPU isn't
> > pushing its quiescent states up the combining tree, then the combining
> > tree can't do much for you.
> 
> Right that is what I wanted, and sure the combining thing needs to
> happen with atomics, but that's not new, it already does that.
> 
> What I was talking about was the interaction between the force
> quiescence state and the poking detectoring that a QS had indeed be
> started.

It gets worse.

Suppose that a grace period is already in progess.  You cannot leverage
its use of the combining tree because some of the CPUs might have already
indicated a quiescent state, which means that the current grace period
won't necessarily wait for all of the CPUs that the concurrent expedited
grace period needs to wait on.  So you need to kick the current grace
period, wait for it to complete, wait for the next one to start (with
all the fun and exciting issues called out earlier), do the expedited
grace period, then wait for completion.

> > Well, I do have something that seems reasonably straightforward.  Sending
> > the patches along separately.  Not sure that it is worth its weight.
> > 
> > The idea is that we keep the expedited grace periods working as they do
> > now, independently of the normal grace period.  The normal grace period
> > takes a sequence number just after initialization, and checks to see
> > if an expedited grace period happened in the meantime at the beginning
> > of each quiescent-state forcing episode.  This saves the last one or
> > two quiescent-state forcing scans if the case where an expedited grace
> > period really did happen.
> > 
> > It is possible for the expedited grace period to help things along by
> > waking up the grace-period kthread, but of course doing this too much
> > further increases the time consumed by your rcu_sched kthread. 
> 
> Ah so that is the purpose of that patch. Still, I'm having trouble
> seeing how you can do this too much, you would only be waking it if
> there was a GP pending completion, right? At which point waking it is
> the right thing.
> 
> If you wake it unconditionally, even if there's nothing to do, then yes
> that'd be a waste of cycles.

Heh!  You are already complaining about rcu_sched consuming 0.7%
of your system, and rightfully so.  Increasing this overhead still
further therefore cannot be considered a good thing unless there is some
overwhelming benefit.  And I am not seeing that benefit.  Perhaps due
to a failure of imagination, but until someone enlightens me, I have to
throttle the wakeups -- or, perhaps better, omit the wakeups entirely.

Actually, I am not convinced that I should push any of the patches that
leverage expedited grace periods to help out normal grace periods.

							Thanx, Paul