From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752582AbbF3VdO (ORCPT ); Tue, 30 Jun 2015 17:33:14 -0400 Received: from e38.co.us.ibm.com ([32.97.110.159]:57303 "EHLO e38.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752014AbbF3VdD (ORCPT ); Tue, 30 Jun 2015 17:33:03 -0400 X-Helo: d03dlp02.boulder.ibm.com X-MailFrom: paulmck@linux.vnet.ibm.com X-RcptTo: linux-kernel@vger.kernel.org Date: Tue, 30 Jun 2015 14:32:58 -0700 From: "Paul E. McKenney" To: Peter Zijlstra Cc: Oleg Nesterov , tj@kernel.org, mingo@redhat.com, linux-kernel@vger.kernel.org, der.herr@hofr.at, dave@stgolabs.net, riel@redhat.com, viro@ZenIV.linux.org.uk, torvalds@linux-foundation.org Subject: Re: [RFC][PATCH 12/13] stop_machine: Remove lglock Message-ID: <20150630213258.GO3717@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20150624171004.GG3717@linux.vnet.ibm.com> <20150624175830.GS3644@twins.programming.kicks-ass.net> <20150625032303.GO3717@linux.vnet.ibm.com> <20150625110734.GX3644@twins.programming.kicks-ass.net> <20150625134726.GR3717@linux.vnet.ibm.com> <20150625142011.GU19282@twins.programming.kicks-ass.net> <20150625145133.GT3717@linux.vnet.ibm.com> <20150626123207.GZ19282@twins.programming.kicks-ass.net> <20150626161415.GY3717@linux.vnet.ibm.com> <20150629075645.GD19282@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150629075645.GD19282@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 15063021-0029-0000-0000-00000AEDF00B Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jun 29, 2015 at 09:56:46AM +0200, Peter Zijlstra wrote: > On Fri, Jun 26, 2015 at 09:14:28AM -0700, Paul E. McKenney wrote: > > > To me it just makes more sense to have a single RCU state machine. With > > > expedited we'll push it as fast as we can, but no faster. > > > > Suppose that someone invokes synchronize_sched_expedited(), but there > > is no normal grace period in flight. Then each CPU will note its own > > quiescent state, but when it later might have tried to push it up the > > tree, it will see that there is no grace period in effect, and will > > therefore not bother. > > Right, I did mention the force grace period machinery to make sure we > start one before poking :-) Fair enough... > > OK, we could have synchronize_sched_expedited() tell the grace-period > > kthread to start a grace period if one was not already in progress. > > I had indeed forgotten that got farmed out to the kthread; on which, my > poor desktop seems to have spend ~140 minutes of its (most recent) > existence poking RCU things. > > 7 root 20 0 0 0 0 S 0.0 0.0 56:34.66 rcu_sched > 8 root 20 0 0 0 0 S 0.0 0.0 20:58.19 rcuos/0 > 9 root 20 0 0 0 0 S 0.0 0.0 18:50.75 rcuos/1 > 10 root 20 0 0 0 0 S 0.0 0.0 18:30.62 rcuos/2 > 11 root 20 0 0 0 0 S 0.0 0.0 17:33.24 rcuos/3 > 12 root 20 0 0 0 0 S 0.0 0.0 2:43.54 rcuos/4 > 13 root 20 0 0 0 0 S 0.0 0.0 3:00.31 rcuos/5 > 14 root 20 0 0 0 0 S 0.0 0.0 3:09.27 rcuos/6 > 15 root 20 0 0 0 0 S 0.0 0.0 2:52.98 rcuos/7 > > Which is almost as much time as my konsole: > > 2853 peterz 20 0 586240 103664 41848 S 1.0 0.3 147:39.50 konsole > > Which seems somewhat excessive. But who knows. No idea. How long has that system been up? What has it been doing? The rcu_sched overhead is expected behavior if the system has run between ten and one hundred million grace periods, give or take an order of magnitude depending on the number of idle CPUs and so on. The overhead for the RCU offload kthreads is what it is. A kfree() takes as much time as a kfree does, and they are all nicely counted up for you. > > OK, the grace-period kthread could tell synchronize_sched_expedited() > > when it has finished initializing the grace period, though this is > > starting to get a bit on the Rube Goldberg side. But this -still- is > > not good enough, because even though the grace-period kthread has fully > > initialized the new grace period, the individual CPUs are unaware of it. > > Right, so over the weekend -- I had postponed reading this rather long > email for I was knackered -- I had figured that because we trickle the > GP completion up, you probably equally trickle the GP start down of > sorts and there might be 'interesting' things there. The GP completion trickles both up and down, though the down part shouldn't matter in this case. > > And they will therefore continue to ignore any quiescent state that they > > encounter, because they cannot prove that it actually happened after > > the start of the current grace period. > > Right, badness :-) > > Although here I'll once again go ahead and say something ignorant; how > come that's a problem? Surely if we know the kthread thing has finished > starting a GP, any one CPU issuing a full memory barrier (as would be > implied by switching to the stop worker) must then indeed observe that > global state? due to that transitivity thing. > > That is, I'm having a wee bit of bother for seeing how you'd need > manipulation of global variables as you elude to below. Well, I thought that you wanted to leverage the combining tree to determine when the grace period had completed. If a given CPU isn't pushing its quiescent states up the combining tree, then the combining tree can't do much for you. > > But this -still- isn't good enough, because > > idle CPUs never will become aware of the new grace period -- by design, > > as they are supposed to be able to sleep through an arbitrary number of > > grace periods. > > Yes, I'm sure. Waking up seems like a serializing experience though; but > I suppose that's not good enough if we wake up right before we force > start the GP. That would indeed be one of the problems that could occur. ;-) > > I feel like there is a much easier way, but cannot yet articulate it. > > I came across a couple of complications and a blind alley with it thus > > far, but it still looks promising. I expect to be able to generate > > actual code for it within a few days, but right now it is just weird > > abstract shapes in my head. (Sorry, if I knew how to describe them, > > I could just write the code! When I do write the code, it will probably > > seem obvious and trivial, that being the usual outcome...) > > Hehe, glad to have been of help :-) Well, I do have something that seems reasonably straightforward. Sending the patches along separately. Not sure that it is worth its weight. The idea is that we keep the expedited grace periods working as they do now, independently of the normal grace period. The normal grace period takes a sequence number just after initialization, and checks to see if an expedited grace period happened in the meantime at the beginning of each quiescent-state forcing episode. This saves the last one or two quiescent-state forcing scans if the case where an expedited grace period really did happen. It is possible for the expedited grace period to help things along by waking up the grace-period kthread, but of course doing this too much further increases the time consumed by your rcu_sched kthread. It is possible to compromise by only doing the wakeup every so many grace periods or only once per a given period of time, which is the approach the last patch in the series takes. I will be sending the series shortly, followed by a series for the other portions of the expedited grace-period upgrade. Thanx, Paul