From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1765468AbdDSPIX (ORCPT ); Wed, 19 Apr 2017 11:08:23 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:37128 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1765452AbdDSPIR (ORCPT ); Wed, 19 Apr 2017 11:08:17 -0400 Date: Wed, 19 Apr 2017 08:08:09 -0700 From: "Paul E. McKenney" To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, mingo@kernel.org, jiangshanlai@gmail.com, dipankar@in.ibm.com, akpm@linux-foundation.org, mathieu.desnoyers@efficios.com, josh@joshtriplett.org, tglx@linutronix.de, rostedt@goodmis.org, dhowells@redhat.com, edumazet@google.com, fweisbec@gmail.com, oleg@redhat.com, bobby.prani@gmail.com Subject: Re: [PATCH tip/core/rcu 04/13] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick Reply-To: paulmck@linux.vnet.ibm.com References: <20170413161948.ymvzlzhporgmldvn@hirez.programming.kicks-ass.net> <20170413165516.GI3956@linux.vnet.ibm.com> <20170413170434.xk4zq3p75pu3ubxw@hirez.programming.kicks-ass.net> <20170413173100.GL3956@linux.vnet.ibm.com> <20170413174631.56ycg545gwbsb4q2@hirez.programming.kicks-ass.net> <20170413181926.GP3956@linux.vnet.ibm.com> <20170413182309.vmyivo3oqrtfhhxt@hirez.programming.kicks-ass.net> <20170413184232.GQ3956@linux.vnet.ibm.com> <20170419132226.yvo3jyweb3d2a632@hirez.programming.kicks-ass.net> <20170419134835.bpuhurle2jjr66hm@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170419134835.bpuhurle2jjr66hm@hirez.programming.kicks-ass.net> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17041915-0044-0000-0000-0000030C49FF X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00006939; HX=3.00000240; KW=3.00000007; PH=3.00000004; SC=3.00000208; SDB=6.00849646; UDB=6.00419561; IPR=6.00628279; BA=6.00005304; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00015095; XFM=3.00000013; UTC=2017-04-19 15:08:12 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17041915-0045-0000-0000-0000073A500D Message-Id: <20170419150809.GL3956@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-04-19_12:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1703280000 definitions=main-1704190127 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 19, 2017 at 03:48:35PM +0200, Peter Zijlstra wrote: > On Wed, Apr 19, 2017 at 03:22:26PM +0200, Peter Zijlstra wrote: > > On Thu, Apr 13, 2017 at 11:42:32AM -0700, Paul E. McKenney wrote: > > > > > I believe that you are missing the fact that RCU grace-period > > > initialization and cleanup walks through the rcu_node tree breadth > > > first, using rcu_for_each_node_breadth_first(). > > > > Indeed. That is the part I completely missed. > > > > > This macro (shown below) > > > implements this breadth-first walk using a simple sequential traversal of > > > the ->node[] array that provides the structures making up the rcu_node > > > tree. As you can see, this scan is completely independent of how CPU > > > numbers might be mapped to rcu_data slots in the leaf rcu_node structures. > > > > So this code is clearly not a hotpath, but still its performance > > matters? > > > > Seems like you cannot win here :/ > > So I sort of see what that code does, but I cannot quite grasp from the > comments near there _why_ it is doing this. > > My thinking is that normal (active CPUs) will update their state at tick > time through the tree, and once the state reaches the root node, IOW all > CPUs agree they've observed that particular state, we advance the global > state, rinse repeat. That's how tree-rcu works. > > NOHZ-idle stuff would be excluded entirely; that is, if we're allowed to > go idle we're up-to-date, and completely drop out of the state tracking. > When we become active again, we can simply sync the CPU's state to the > active state and go from there -- ignoring whatever happened in the > mean-time. > > So why do we have to do machine wide updates? How can we get at the end > up a grace period without all CPUs already agreeing that its complete? > > /me puzzled. This a decent overall summary of how RCU grace periods work, but there are quite a few corner cases that complicate things. In this email, I will focus on just one of them, starting with CPUs returning from NOHZ-idle state. In theory, you are correct when you say that we could have CPUs sync up with current RCU state immediately upon return from idle. In practice, people are already screaming at me about the single CPU-local atomic operation and memory barriers, so adding code on the idle-exit fastpath to acquire the leaf rcu_node structure's lock and grab the current state would do nothing but cause Marc Zyngier and many others to report performance bugs to me. And even that would not be completely sufficient. After all, the state in the leaf rcu_node structure will be out of date during grace-period initialization and cleanup. So to -completely- synchronize state for the incoming CPU, I would have to acquire the root rcu_node structure's lock and look at the live state. Needless to say, the performance and scalability implications of acquiring a global lock on each and every idle exit event is not going to be at all pretty. This means that even non-idle CPUs must necessarily be allowed to have different about which grace period is currently in effect. We simply cannot have total agreement on when a given grace period starts or ends, because such agreement is just too expensive. Therefore, when a grace period begins, the grace-period kthread scans the rcu_node tree propagating this transition through the rcu_node tree. And similarly when a grace period ends. Because the rcu_node tree is mapped into a dense array, and because the scan proceeds in index order, the scan operation is pretty much best-case for the cache hardware. But on large machines with large cache-miss latencies, it can still inflict a bit of pain -- almost all of which has been addressed by the switch to grace-period kthreads. Hey, you asked!!! ;-) Thanx, Paul