From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755292AbbJGQtG (ORCPT ); Wed, 7 Oct 2015 12:49:06 -0400 Received: from e33.co.us.ibm.com ([32.97.110.151]:54388 "EHLO e33.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753980AbbJGQtD (ORCPT ); Wed, 7 Oct 2015 12:49:03 -0400 X-IBM-Helo: d03dlp01.boulder.ibm.com X-IBM-MailFrom: paulmck@linux.vnet.ibm.com X-IBM-RcptTo: linux-kernel@vger.kernel.org Date: Wed, 7 Oct 2015 09:48:58 -0700 From: "Paul E. McKenney" To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, mingo@kernel.org, jiangshanlai@gmail.com, dipankar@in.ibm.com, akpm@linux-foundation.org, mathieu.desnoyers@efficios.com, josh@joshtriplett.org, tglx@linutronix.de, rostedt@goodmis.org, dhowells@redhat.com, edumazet@google.com, dvhart@linux.intel.com, fweisbec@gmail.com, oleg@redhat.com, bobby.prani@gmail.com Subject: Re: [PATCH tip/core/rcu 02/18] rcu: Move rcu_report_exp_rnp() to allow consolidation Message-ID: <20151007164858.GQ3910@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20151006162907.GA12020@linux.vnet.ibm.com> <1444148977-14108-1-git-send-email-paulmck@linux.vnet.ibm.com> <1444148977-14108-2-git-send-email-paulmck@linux.vnet.ibm.com> <20151006202937.GX3604@twins.programming.kicks-ass.net> <20151006205850.GW3910@linux.vnet.ibm.com> <20151007075114.GW2881@worktop.programming.kicks-ass.net> <20151007143325.GF3910@linux.vnet.ibm.com> <20151007144024.GI17308@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151007144024.GI17308@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 15100716-0009-0000-0000-00000EA9DF2E Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 07, 2015 at 04:40:24PM +0200, Peter Zijlstra wrote: > On Wed, Oct 07, 2015 at 07:33:25AM -0700, Paul E. McKenney wrote: > > > I'm sure you know what that means, but I've no clue ;-) That is, I > > > wouldn't know where to start looking in the RCU implementation to verify > > > the barrier is either needed or sufficient. Unless you mean _everywhere_ > > > :-) > > > > Pretty much everywhere. > > > > Let's take the usual RCU removal pattern as an example: > > > > void f1(struct foo *p) > > { > > list_del_rcu(p); > > synchronize_rcu_expedited(); > > kfree(p); > > } > > > > void f2(void) > > { > > struct foo *p; > > > > list_for_each_entry_rcu(p, &my_head, next) > > do_something_with(p); > > } > > > > So the synchronize_rcu_expedited() acts as an extremely heavyweight > > memory barrier that pairs with the rcu_dereference() inside of > > list_for_each_entry_rcu(). Easy enough, right? > > > > But what exactly within synchronize_rcu_expedited() provides the > > ordering? The answer is a web of lock-based critical sections and > > explicit memory barriers, with the one you called out as needing > > a comment being one of them. > > Right, but seeing there's possible implementations of sync_rcu(_exp)*() > that do not have the whole rcu_node tree like thing, there's more to > this particular barrier than the semantics of sync_rcu(). > > Some implementation choice requires this barrier upgrade -- and in > another email I suggest its the whole tree thing, we need to firmly > establish the state of one level before propagating the state up etc. > > Now I'm not entirely sure this is fully correct, but its the best I > could come up. It is pretty close. Ignoring dyntick idle for the moment, things go (very) roughly like this: o The RCU grace-period kthread notices that a new grace period is needed. It initializes the tree, which includes acquiring every rcu_node structure's ->lock. o CPU A notices that there is a new grace period. It acquires the ->lock of its leaf rcu_node structure, which forces full ordering against the grace-period kthread. o Some time later, that CPU A realizes that it has passed through a quiescent state, and again acquires its leaf rcu_node structure's ->lock, again enforcing full ordering, but this time against all CPUs corresponding to this same leaf rcu_node structure that previously noticed quiescent states for this same grace period. Also against all prior readers on this same CPU. o Some time later, CPU B (corresponding to that same leaf rcu_node structure) is the last of that leaf's group of CPUs to notice a quiescent state. It has also acquired that leaf's ->lock, again forcing ordering against its prior RCU read-side critical sections, but also against all the prior RCU read-side critical sections of all other CPUs corresponding to this same leaf. o CPU B therefore moves up the tree, acquiring the parent rcu_node structures' ->lock. In so doing, it forces full ordering against all prior RCU read-side critical sections of all CPUs corresponding to all leaf rcu_node structures subordinate to the current (non-leaf) rcu_node structure. o And so on, up the tree. o When CPU C reaches the root of the tree, and realizes that it is the last CPU to report a quiescent state for the current grace period, its acquisition of the root rcu_node structure's ->lock has forced full ordering against all RCU read-side critical sections that started before this grace period -- on all CPUs. CPU C therefore awakens the grace-period kthread. o When the grace-period kthread wakes up, it does cleanup, which (you guessed it!) requires acquiring the ->lock of each rcu_node structure. This not only forces full ordering against each pre-existing RCU read-side critical section, it also sets up things so that... o When CPU D notices that the grace period ended, it does so while holding its leaf rcu_node structure's ->lock. This forces full ordering against all relevant RCU read-side critical sections. This ordering prevails when CPU D later starts invoking RCU callbacks. o Just for fun, suppose that one of those callbacks does an "smp_store_release(&leak_gp, 1)". Suppose further that some CPU E that is not yet aware that the grace period is finished does an "r1 = smp_load_acquire(&lead_gp)" and gets 1. Even if CPU E was the very first CPU to report a quiescent state for the grace period, and even if CPU E has not executed any sort of ordering operations since, CPU E's subsequent code is -still- guaranteed to be fully ordered after each and every RCU read-side critical section that started before the grace period. Hey, you asked!!! ;-) Again, this is a cartoon-like view of the ordering that leaves out a lot of details, but it should get across the gist of the ordering. Thanx, Paul