From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752428AbbFZQOl (ORCPT ); Fri, 26 Jun 2015 12:14:41 -0400 Received: from e36.co.us.ibm.com ([32.97.110.154]:56589 "EHLO e36.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751962AbbFZQOd (ORCPT ); Fri, 26 Jun 2015 12:14:33 -0400 X-Helo: d03dlp03.boulder.ibm.com X-MailFrom: paulmck@linux.vnet.ibm.com X-RcptTo: linux-kernel@vger.kernel.org Date: Fri, 26 Jun 2015 09:14:28 -0700 From: "Paul E. McKenney" To: Peter Zijlstra Cc: Oleg Nesterov , tj@kernel.org, mingo@redhat.com, linux-kernel@vger.kernel.org, der.herr@hofr.at, dave@stgolabs.net, riel@redhat.com, viro@ZenIV.linux.org.uk, torvalds@linux-foundation.org Subject: Re: [RFC][PATCH 12/13] stop_machine: Remove lglock Message-ID: <20150626161415.GY3717@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20150624160851.GF3717@linux.vnet.ibm.com> <20150624164200.GP3644@twins.programming.kicks-ass.net> <20150624171004.GG3717@linux.vnet.ibm.com> <20150624175830.GS3644@twins.programming.kicks-ass.net> <20150625032303.GO3717@linux.vnet.ibm.com> <20150625110734.GX3644@twins.programming.kicks-ass.net> <20150625134726.GR3717@linux.vnet.ibm.com> <20150625142011.GU19282@twins.programming.kicks-ass.net> <20150625145133.GT3717@linux.vnet.ibm.com> <20150626123207.GZ19282@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150626123207.GZ19282@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 15062616-0021-0000-0000-00000C06FD93 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jun 26, 2015 at 02:32:07PM +0200, Peter Zijlstra wrote: > On Thu, Jun 25, 2015 at 07:51:46AM -0700, Paul E. McKenney wrote: > > > So please humour me and explain how all this is far more complicated ;-) > > > > Yeah, I do need to get RCU design/implementation documentation put together. > > > > In the meantime, RCU's normal grace-period machinery is designed to be > > quite loosely coupled. The idea is that almost all actions occur locally, > > reducing contention and cache thrashing. But an expedited grace period > > needs tight coupling in order to be able to complete quickly. Making > > something that switches between loose and tight coupling in short order > > is not at all simple. > > But expedited just means faster, we never promised that > sync_rcu_expedited is the absolute fastest primitive ever. Which is good, because given that it is doing something to each and every CPU, it most assuredly won't in any way resemble the absolute fastest primitive ever. ;-) > So I really should go read the RCU code I suppose, but I don't get > what's wrong with starting a forced quiescent state, then doing the > stop_work spray, where each work will run the regular RCU tick thing to > push it forwards. > > >From my feeble memories, what I remember is that the last cpu to > complete a GP on a leaf node will push the completion up to the next > level, until at last we've reached the root of your tree and we can > complete the GP globally. That is true, the task that notices the last required quiescent state will push up the tree and notice that the grace period has ended. If that task is not the grace-period kthread, it will then awaken the grace-period kthread. > To me it just makes more sense to have a single RCU state machine. With > expedited we'll push it as fast as we can, but no faster. Suppose that someone invokes synchronize_sched_expedited(), but there is no normal grace period in flight. Then each CPU will note its own quiescent state, but when it later might have tried to push it up the tree, it will see that there is no grace period in effect, and will therefore not bother. OK, we could have synchronize_sched_expedited() tell the grace-period kthread to start a grace period if one was not already in progress. But that still isn't good enough, because the grace-period kthread will take some time to initialize the new grace period, and if we hammer all the CPUs before the initialization is complete, the resulting quiescent states cannot be counted against the new grace period. (The reason for this is that there is some delay between the actual quiescent state and the time that it is reported, so we have to be very careful not to incorrectly report a quiescent state from an earlier grace period against the current grace period.) OK, the grace-period kthread could tell synchronize_sched_expedited() when it has finished initializing the grace period, though this is starting to get a bit on the Rube Goldberg side. But this -still- is not good enough, because even though the grace-period kthread has fully initialized the new grace period, the individual CPUs are unaware of it. And they will therefore continue to ignore any quiescent state that they encounter, because they cannot prove that it actually happened after the start of the current grace period. OK, we could have some sort of indication when all CPUs become aware of the new grace period by having them atomically manipulate a global counter. Presumably we have some flag indicating when this is and is not needed so that we avoid the killer memory contention in the common case where it is not needed. But this -still- isn't good enough, because idle CPUs never will become aware of the new grace period -- by design, as they are supposed to be able to sleep through an arbitrary number of grace periods. OK, so we could have some sort of indication when all non-idle CPUs become aware of the new grace period. But there could be races where an idle CPU suddenly becomes non-idle just after it was reported that the all non-idle CPUs were aware of the grace period. This would result in a hang, because this this newly non-idle CPU might not have noticed the new grace period at the time that synchronize_sched_expedited() hammers it, which would mean that this newly non-idle CPU would refuse to report the resulting quiescent state. OK, so the grace-period kthread could track and report the set of CPUs that had ever been idle since synchronize_sched_expedited() contacted it. But holy overhead Batman!!! And that is just one of the possible interactions with the grace-period kthread. It might be in the middle of setting up a new grace period. It might be in the middle of cleaning up after the last grace period. It might be waiting for a grace period to complete, and the last quiescent state was just reported, but hasn't propagated all the way up yet. All of these would need to be handled correctly, and a number of them would be as messy as the above scenario. Some might be even more messy. I feel like there is a much easier way, but cannot yet articulate it. I came across a couple of complications and a blind alley with it thus far, but it still looks promising. I expect to be able to generate actual code for it within a few days, but right now it is just weird abstract shapes in my head. (Sorry, if I knew how to describe them, I could just write the code! When I do write the code, it will probably seem obvious and trivial, that being the usual outcome...) Thanx, Paul