From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752582AbbF3VdO (ORCPT <rfc822;w@1wt.eu>);
	Tue, 30 Jun 2015 17:33:14 -0400
Received: from e38.co.us.ibm.com ([32.97.110.159]:57303 "EHLO
	e38.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752014AbbF3VdD (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Tue, 30 Jun 2015 17:33:03 -0400
X-Helo: d03dlp02.boulder.ibm.com
X-MailFrom: paulmck@linux.vnet.ibm.com
X-RcptTo: linux-kernel@vger.kernel.org
Date: Tue, 30 Jun 2015 14:32:58 -0700
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>, tj@kernel.org, mingo@redhat.com,
        linux-kernel@vger.kernel.org, der.herr@hofr.at, dave@stgolabs.net,
        riel@redhat.com, viro@ZenIV.linux.org.uk,
        torvalds@linux-foundation.org
Subject: Re: [RFC][PATCH 12/13] stop_machine: Remove lglock
Message-ID: <20150630213258.GO3717@linux.vnet.ibm.com>
Reply-To: paulmck@linux.vnet.ibm.com
References: <20150624171004.GG3717@linux.vnet.ibm.com>
 <20150624175830.GS3644@twins.programming.kicks-ass.net>
 <20150625032303.GO3717@linux.vnet.ibm.com>
 <20150625110734.GX3644@twins.programming.kicks-ass.net>
 <20150625134726.GR3717@linux.vnet.ibm.com>
 <20150625142011.GU19282@twins.programming.kicks-ass.net>
 <20150625145133.GT3717@linux.vnet.ibm.com>
 <20150626123207.GZ19282@twins.programming.kicks-ass.net>
 <20150626161415.GY3717@linux.vnet.ibm.com>
 <20150629075645.GD19282@twins.programming.kicks-ass.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150629075645.GD19282@twins.programming.kicks-ass.net>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-TM-AS-MML: disable
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 15063021-0029-0000-0000-00000AEDF00B
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Mon, Jun 29, 2015 at 09:56:46AM +0200, Peter Zijlstra wrote:
> On Fri, Jun 26, 2015 at 09:14:28AM -0700, Paul E. McKenney wrote:
> > > To me it just makes more sense to have a single RCU state machine. With
> > > expedited we'll push it as fast as we can, but no faster.
> > 
> > Suppose that someone invokes synchronize_sched_expedited(), but there
> > is no normal grace period in flight.  Then each CPU will note its own
> > quiescent state, but when it later might have tried to push it up the
> > tree, it will see that there is no grace period in effect, and will
> > therefore not bother.
> 
> Right, I did mention the force grace period machinery to make sure we
> start one before poking :-)

Fair enough...

> > OK, we could have synchronize_sched_expedited() tell the grace-period
> > kthread to start a grace period if one was not already in progress.
> 
> I had indeed forgotten that got farmed out to the kthread; on which, my
> poor desktop seems to have spend ~140 minutes of its (most recent)
> existence poking RCU things.
> 
>     7 root      20   0       0      0      0 S   0.0  0.0  56:34.66 rcu_sched
>     8 root      20   0       0      0      0 S   0.0  0.0  20:58.19 rcuos/0
>     9 root      20   0       0      0      0 S   0.0  0.0  18:50.75 rcuos/1
>    10 root      20   0       0      0      0 S   0.0  0.0  18:30.62 rcuos/2
>    11 root      20   0       0      0      0 S   0.0  0.0  17:33.24 rcuos/3
>    12 root      20   0       0      0      0 S   0.0  0.0   2:43.54 rcuos/4
>    13 root      20   0       0      0      0 S   0.0  0.0   3:00.31 rcuos/5
>    14 root      20   0       0      0      0 S   0.0  0.0   3:09.27 rcuos/6
>    15 root      20   0       0      0      0 S   0.0  0.0   2:52.98 rcuos/7
> 
> Which is almost as much time as my konsole:
> 
>  2853 peterz    20   0  586240 103664  41848 S   1.0  0.3 147:39.50 konsole
> 
> Which seems somewhat excessive. But who knows.

No idea.  How long has that system been up?  What has it been doing?

The rcu_sched overhead is expected behavior if the system has run between
ten and one hundred million grace periods, give or take an order of
magnitude depending on the number of idle CPUs and so on.

The overhead for the RCU offload kthreads is what it is.  A kfree() takes
as much time as a kfree does, and they are all nicely counted up for you.

> > OK, the grace-period kthread could tell synchronize_sched_expedited()
> > when it has finished initializing the grace period, though this is
> > starting to get a bit on the Rube Goldberg side.  But this -still- is
> > not good enough, because even though the grace-period kthread has fully
> > initialized the new grace period, the individual CPUs are unaware of it.
> 
> Right, so over the weekend -- I had postponed reading this rather long
> email for I was knackered -- I had figured that because we trickle the
> GP completion up, you probably equally trickle the GP start down of
> sorts and there might be 'interesting' things there.

The GP completion trickles both up and down, though the down part shouldn't
matter in this case.

> > And they will therefore continue to ignore any quiescent state that they
> > encounter, because they cannot prove that it actually happened after
> > the start of the current grace period.
> 
> Right, badness :-)
> 
> Although here I'll once again go ahead and say something ignorant; how
> come that's a problem? Surely if we know the kthread thing has finished
> starting a GP, any one CPU issuing a full memory barrier (as would be
> implied by switching to the stop worker) must then indeed observe that
> global state? due to that transitivity thing.
> 
> That is, I'm having a wee bit of bother for seeing how you'd need
> manipulation of global variables as you elude to below.

Well, I thought that you wanted to leverage the combining tree to
determine when the grace period had completed.  If a given CPU isn't
pushing its quiescent states up the combining tree, then the combining
tree can't do much for you.

> > But this -still- isn't good enough, because
> > idle CPUs never will become aware of the new grace period -- by design,
> > as they are supposed to be able to sleep through an arbitrary number of
> > grace periods.
> 
> Yes, I'm sure. Waking up seems like a serializing experience though; but
> I suppose that's not good enough if we wake up right before we force
> start the GP.

That would indeed be one of the problems that could occur.  ;-)

> > I feel like there is a much easier way, but cannot yet articulate it.
> > I came across a couple of complications and a blind alley with it thus
> > far, but it still looks promising.  I expect to be able to generate
> > actual code for it within a few days, but right now it is just weird
> > abstract shapes in my head.  (Sorry, if I knew how to describe them,
> > I could just write the code!  When I do write the code, it will probably
> > seem obvious and trivial, that being the usual outcome...)
> 
> Hehe, glad to have been of help :-)

Well, I do have something that seems reasonably straightforward.  Sending
the patches along separately.  Not sure that it is worth its weight.

The idea is that we keep the expedited grace periods working as they do
now, independently of the normal grace period.  The normal grace period
takes a sequence number just after initialization, and checks to see
if an expedited grace period happened in the meantime at the beginning
of each quiescent-state forcing episode.  This saves the last one or
two quiescent-state forcing scans if the case where an expedited grace
period really did happen.

It is possible for the expedited grace period to help things along by
waking up the grace-period kthread, but of course doing this too much
further increases the time consumed by your rcu_sched kthread.  It is
possible to compromise by only doing the wakeup every so many grace
periods or only once per a given period of time, which is the approach
the last patch in the series takes.

I will be sending the series shortly, followed by a series for the
other portions of the expedited grace-period upgrade.

							Thanx, Paul