From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755452AbcGTVb2 (ORCPT ); Wed, 20 Jul 2016 17:31:28 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:58484 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751641AbcGTVb0 (ORCPT ); Wed, 20 Jul 2016 17:31:26 -0400 X-IBM-Helo: d03dlp02.boulder.ibm.com X-IBM-MailFrom: paulmck@linux.vnet.ibm.com Date: Wed, 20 Jul 2016 14:31:44 -0700 From: "Paul E. McKenney" To: Oleg Nesterov Cc: Peter Zijlstra , mingo@kernel.org, linux-kernel@vger.kernel.org, tj@kernel.org, john.stultz@linaro.org, dimitrysh@google.com, romlem@google.com, ccross@google.com, tkjos@google.com Subject: Re: [PATCH] rcu_sync: simplify the state machine, introduce __rcu_sync_enter() Reply-To: paulmck@linux.vnet.ibm.com References: <20160714192018.GM30154@twins.programming.kicks-ass.net> <20160715132709.GA27644@redhat.com> <20160715133928.GH7094@linux.vnet.ibm.com> <20160715134523.GB28589@redhat.com> <20160715153840.GI7094@linux.vnet.ibm.com> <20160715164938.GA4294@redhat.com> <20160715180140.GM7094@linux.vnet.ibm.com> <20160716171007.GA30000@redhat.com> <20160719205018.GB7094@linux.vnet.ibm.com> <20160720171602.GA5059@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160720171602.GA5059@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16072021-0016-0000-0000-0000043A4A65 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 16072021-0017-0000-0000-0000314C27AD Message-Id: <20160720213144.GM7094@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-07-20_12:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1604210000 definitions=main-1607200238 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jul 20, 2016 at 07:16:03PM +0200, Oleg Nesterov wrote: > Paul, I had to switch to internal bugzillas after the first email, and now > I feel I can't read... I'll try to answer one question right now, tomorrow > I'll reread your email, probably I need to answer something else... I know that feeling of distraction! > On 07/19, Paul E. McKenney wrote: > > > > On Sat, Jul 16, 2016 at 07:10:07PM +0200, Oleg Nesterov wrote: > > > > > And, there is another possible transition, GP_ENTER -> GP_IDLE, because > > > not it is possible to call __rcu_sync_enter() and rcu_sync_exit() in any > > > state (except obviously they should be balanced), and they do not block. > ... > > If you feel strongly about allowing rcu_sync_exit() in GP_ENTER state, > > could you please tell me your use case? Or am I confused? > > See below, > > > And I think __rcu_sync_enter() can have more users. Let's look at > > freeze_super(). It calls percpu_down_write() 3 times, and waits for 3 GP's > > sequentally. > > > > > Now we can add 3 __rcu_sync_enter's at the start and 3 rcu_sync_exit's at > > > the end (actually we can do better, just to simplify). And again, note > > > that rcu_sync_exit() will work correctly even if we (say) return -EBUSY, > > > so rcu_sync_wait and/or percpu_down_write() was not called in between, > > > and in this case we won't block waiting for GP. > > > > > > > I am not going to claim to understand freeze_super(), but it does seem > > to have a fair amount of waiting. > > > > But yes, you could put rcu_sync_enter() > ^^^^^^^^^^^^^^ > __rcu_sync_enter, see below, > > > and rcu_sync_exit() before and > > after a series of write-side enter/exit pairs in order to force things > > to stay in writer mode, if that is what you are suggesting. > > No, no, this is not what I am trying to suggest. > > The problem is that freeze_super() takes 3 semaphores for writing in row, > this means that it needs to wait for 3 GP's sequentally, and it does this > with sb->s_umount held. This is just ugly. Slow, too. ;-) > OK, lets suppose it simply does > > freeze_super(sb) > { > down_write(&sb->s_umount); > if (NEED_TO_FREEZE) { > percpu_down_write(SEM1); > percpu_down_write(SEM2); > percpu_down_write(SEM3); > } > up_write(&sb->s_umount); > } > > and every percpu_down_write() waits for GP. > > Now, suppose we add the additional enter/exit's: > > freeze_super(sb) > { > // this doesn't block > __rcu_sync_enter(SEM3); > __rcu_sync_enter(SEM2); > __rcu_sync_enter(SEM1); > > down_write(&sb->s_umount); > if (NEED_TO_FREEZE) { > percpu_down_write(SEM1); The above waits for the grace period initiated by __rcu_sync_enter(), correct? Presumably "yes", because it will invoke rcu_sync_enter(), which will see the state as GP_ENTER, and will thus wait. Or am I confused? > percpu_down_write(SEM2); > percpu_down_write(SEM3); > } > up_write(&sb->s_umount); > > rcu_sync_exit(SEM1); But your point is that if !NEED_TO_FREEZE, we will get here without waiting for a grace period. But why aren't the __rcu_sync_enter() and rcu_sync_exit() calls inside the "if" statement? > rcu_sync_exit(SEM2); > rcu_sync_exit(SEM3); And the reason it is OK to invoke rcu_sync_exit() before doing the percpu_up_write() calls is that those calls will do their own rcu_sync_exit() calls, and the above calls really don't do anything other than decrement the count. > > } > > Again, actually we can do better, just to simplify. > > Now. the fisrt percpu_down_write(SEM1) can block waiting for GP or not, > this depends on how many time it spends in down_write(). > > But the 2nd and the 3rd percpu_down_write() most likely won't block, so > in the likely case freeze_super() will need a single GP pass. > > And note that NEED_TO_FREEZE can be false, in this case rcu_sync_exit() > will be called in GP_ENTER state. Understood, the above approach allows the three grace periods to elapse concurrently, at least assuming RCU is busy at the time. (If RCU is not busy, the first one gets its own grace period, and the other two share the next grace period.) But again why aren't the __rcu_sync_enter() and rcu_sync_exit() calls inside that "if" statement? That aside, would it make sense to name __rcu_sync_enter() something like rcu_sync_begin_to_enter(), rcu_sync_pre_enter() or some such? Something to make it clear that it just starts the job and that something else is needed to finish it. > To some degree, this is like get_state_synchronize_rcu/cond_synchronize_rcu. > But obviously percpu_down_write() can not use these helpers, and in this > particular case __rcu_sync_enter() is better because it forces the start > of GP pass. OK, I now understand the motivation, thank you for your patience! > ----------------------------------------------------------------------- > As for cgroups, we want to switch cgroup_threadgroup_rwsem into the > slow mode, at least for now. > > We could add the additional hooks/hacks into rcu/sync.c but why? We can > do this without any changes outside of cgroup.c right now, just add > rcu_sync_enter() into cgroup_init(). > > But we do not want to add a pointless synchronize_sched() at boot time, > __rcu_sync_enter() looks much better. I do like that approach much better than an rcu_sync_sabotage(). I still have concern with the __rcu_sync_enter() name, and would prefer something that clearly indicates that it starts the rcu_sync() entry job, but does not synchronously complete it. > ------------------------------------------------------------------------ > And even __cgroup_procs_write() could use __rcu_sync_enter(). But lets > ignore this for now. And here is an updated state table. I do not yet separately call out __rcu_sync_enter(), though without it the rcu_sync_exit() transition out of state B cannot happen. Thanx, Paul ------------------------------------------------------------------------ o "count" is the value of the rcu_sync structure's ->gp_count field. o "state" is the value of the rcu_sync structure's ->gp_state field. o "CB?" is "yes" if there is an RCU callback pending and "no" otherwise. | count | state | CB? | next state ---+-------+-----------+-----+------------------------------------ | | | | A | 0 | GP_IDLE | no | rcu_sync_enter() -> B (wait) | | | | rcu_sync_exit() -> illegal | | | | rcu_sync_dtor() -> H | | | | callback -> cannot happen | | | | ---+-------+-----------+-----+------------------------------------ | | | | B | 1+ | GP_ENTER | yes | rcu_sync_enter() -> B (wait) | | | | last rcu_sync_exit() -> J | | | | non-last rcu_sync_exit() -> B | | | | rcu_sync_dtor() -> illegal | | | | callback -> C (ends _enter() wait) | | | | ---+-------+-----------+-----+------------------------------------ | | | | C | 1+ | GP_PASSED | no | rcu_sync_enter() -> C | | | | last rcu_sync_exit() -> E | | | | non-last rcu_sync_exit() -> C | | | | rcu_sync_dtor() -> illegal | | | | callback -> cannot happen | | | | ---+-------+-----------+-----+------------------------------------ | | | | D | 0 | GP_REPLAY | yes | rcu_sync_enter() -> F | | | | rcu_sync_exit() -> illegal | | | | rcu_sync_dtor() -> I (wait ???) | | | | callback -> E | | | | ---+-------+-----------+-----+------------------------------------ | | | | E | 0 | GP_EXIT | yes | rcu_sync_enter() -> G | | | | rcu_sync_exit() -> illegal | | | | rcu_sync_dtor() -> I (wait) | | | | callback -> A | | | | ---+-------+-----------+-----+------------------------------------ | | | | F | 1+ | GP_REPLAY | yes | rcu_sync_enter() -> F | | | | last rcu_sync_exit() -> D | | | | non-last rcu_sync_exit() -> F | | | | rcu_sync_dtor() -> illegal | | | | callback -> C | | | | ---+-------+-----------+-----+------------------------------------ | | | | G | 1+ | GP_EXIT | yes | rcu_sync_enter() -> G | | | | last rcu_sync_exit() -> D | | | | non-last rcu_sync_exit() -> F | | | | rcu_sync_dtor() -> illegal | | | | callback -> C | | | | ---+-------+-----------+-----+------------------------------------ | | | | H | 0 | GP_IDLE | no | rcu_sync_enter() -> illegal | | | | rcu_sync_exit() -> illegal | | | | rcu_sync_dtor() -> illegal | | | | callback -> cannot happen | | | | ---+-------+-----------+-----+------------------------------------ | | | | I | 0 | GP_EXIT | yes | rcu_sync_enter() -> illegal | | | | rcu_sync_exit() -> illegal | | | | rcu_sync_dtor() -> illegal | | | | callback -> H (ends dtor() wait) ---+-------+-----------+-----+------------------------------------ | | | | J | 0 | GP_ENTER | yes | rcu_sync_enter() -> B (wait) | | | | rcu_sync_exit() -> illegal | | | | rcu_sync_dtor() -> illegal | | | | callback -> A | | | | Assumptions: o Initial state is A. o Final state is H. o There will never be enough unpaired rcu_sync_enter() calls to overflow ->gp_count. o All calls to rcu_sync_exit() must pair with a preceding call to rcu_sync_enter() by that same thread. o It is illegal to invoke rcu_sync_dtor() until after the caller has ensured that there will be no future calls to either rcu_sync_enter() or rcu_sync_exit(). o It is illegal to invoke rcu_sync_dtor() while there are any unpaired calls to rcu_sync_enter().