From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752428AbbFZQOl (ORCPT <rfc822;w@1wt.eu>);
	Fri, 26 Jun 2015 12:14:41 -0400
Received: from e36.co.us.ibm.com ([32.97.110.154]:56589 "EHLO
	e36.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751962AbbFZQOd (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Fri, 26 Jun 2015 12:14:33 -0400
X-Helo: d03dlp03.boulder.ibm.com
X-MailFrom: paulmck@linux.vnet.ibm.com
X-RcptTo: linux-kernel@vger.kernel.org
Date: Fri, 26 Jun 2015 09:14:28 -0700
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>, tj@kernel.org, mingo@redhat.com,
        linux-kernel@vger.kernel.org, der.herr@hofr.at, dave@stgolabs.net,
        riel@redhat.com, viro@ZenIV.linux.org.uk,
        torvalds@linux-foundation.org
Subject: Re: [RFC][PATCH 12/13] stop_machine: Remove lglock
Message-ID: <20150626161415.GY3717@linux.vnet.ibm.com>
Reply-To: paulmck@linux.vnet.ibm.com
References: <20150624160851.GF3717@linux.vnet.ibm.com>
 <20150624164200.GP3644@twins.programming.kicks-ass.net>
 <20150624171004.GG3717@linux.vnet.ibm.com>
 <20150624175830.GS3644@twins.programming.kicks-ass.net>
 <20150625032303.GO3717@linux.vnet.ibm.com>
 <20150625110734.GX3644@twins.programming.kicks-ass.net>
 <20150625134726.GR3717@linux.vnet.ibm.com>
 <20150625142011.GU19282@twins.programming.kicks-ass.net>
 <20150625145133.GT3717@linux.vnet.ibm.com>
 <20150626123207.GZ19282@twins.programming.kicks-ass.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150626123207.GZ19282@twins.programming.kicks-ass.net>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-TM-AS-MML: disable
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 15062616-0021-0000-0000-00000C06FD93
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Fri, Jun 26, 2015 at 02:32:07PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 25, 2015 at 07:51:46AM -0700, Paul E. McKenney wrote:
> > > So please humour me and explain how all this is far more complicated ;-)
> > 
> > Yeah, I do need to get RCU design/implementation documentation put together.
> > 
> > In the meantime, RCU's normal grace-period machinery is designed to be
> > quite loosely coupled.  The idea is that almost all actions occur locally,
> > reducing contention and cache thrashing.  But an expedited grace period
> > needs tight coupling in order to be able to complete quickly.  Making
> > something that switches between loose and tight coupling in short order
> > is not at all simple.
> 
> But expedited just means faster, we never promised that
> sync_rcu_expedited is the absolute fastest primitive ever.

Which is good, because given that it is doing something to each and
every CPU, it most assuredly won't in any way resemble the absolute
fastest primitive ever.  ;-)

> So I really should go read the RCU code I suppose, but I don't get
> what's wrong with starting a forced quiescent state, then doing the
> stop_work spray, where each work will run the regular RCU tick thing to
> push it forwards.
> 
> >From my feeble memories, what I remember is that the last cpu to
> complete a GP on a leaf node will push the completion up to the next
> level, until at last we've reached the root of your tree and we can
> complete the GP globally.

That is true, the task that notices the last required quiescent state
will push up the tree and notice that the grace period has ended.
If that task is not the grace-period kthread, it will then awaken
the grace-period kthread.

> To me it just makes more sense to have a single RCU state machine. With
> expedited we'll push it as fast as we can, but no faster.

Suppose that someone invokes synchronize_sched_expedited(), but there
is no normal grace period in flight.  Then each CPU will note its own
quiescent state, but when it later might have tried to push it up the
tree, it will see that there is no grace period in effect, and will
therefore not bother.

OK, we could have synchronize_sched_expedited() tell the grace-period
kthread to start a grace period if one was not already in progress.
But that still isn't good enough, because the grace-period kthread will
take some time to initialize the new grace period, and if we hammer all
the CPUs before the initialization is complete, the resulting quiescent
states cannot be counted against the new grace period.  (The reason for
this is that there is some delay between the actual quiescent state
and the time that it is reported, so we have to be very careful not
to incorrectly report a quiescent state from an earlier grace period
against the current grace period.)

OK, the grace-period kthread could tell synchronize_sched_expedited()
when it has finished initializing the grace period, though this is
starting to get a bit on the Rube Goldberg side.  But this -still- is
not good enough, because even though the grace-period kthread has fully
initialized the new grace period, the individual CPUs are unaware of it.
And they will therefore continue to ignore any quiescent state that they
encounter, because they cannot prove that it actually happened after
the start of the current grace period.

OK, we could have some sort of indication when all CPUs become aware
of the new grace period by having them atomically manipulate a global
counter.  Presumably we have some flag indicating when this is and is
not needed so that we avoid the killer memory contention in the common
case where it is not needed.  But this -still- isn't good enough, because
idle CPUs never will become aware of the new grace period -- by design,
as they are supposed to be able to sleep through an arbitrary number of
grace periods.

OK, so we could have some sort of indication when all non-idle CPUs
become aware of the new grace period.  But there could be races where
an idle CPU suddenly becomes non-idle just after it was reported that
the all non-idle CPUs were aware of the grace period.  This would result
in a hang, because this this newly non-idle CPU might not have noticed
the new grace period at the time that synchronize_sched_expedited()
hammers it, which would mean that this newly non-idle CPU would refuse
to report the resulting quiescent state.

OK, so the grace-period kthread could track and report the set of CPUs
that had ever been idle since synchronize_sched_expedited() contacted it.
But holy overhead Batman!!!

And that is just one of the possible interactions with the grace-period
kthread.  It might be in the middle of setting up a new grace period.
It might be in the middle of cleaning up after the last grace period.
It might be waiting for a grace period to complete, and the last quiescent
state was just reported, but hasn't propagated all the way up yet.  All
of these would need to be handled correctly, and a number of them would
be as messy as the above scenario.  Some might be even more messy.

I feel like there is a much easier way, but cannot yet articulate it.
I came across a couple of complications and a blind alley with it thus
far, but it still looks promising.  I expect to be able to generate
actual code for it within a few days, but right now it is just weird
abstract shapes in my head.  (Sorry, if I knew how to describe them,
I could just write the code!  When I do write the code, it will probably
seem obvious and trivial, that being the usual outcome...)

							Thanx, Paul