From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1765468AbdDSPIX (ORCPT <rfc822;w@1wt.eu>);
        Wed, 19 Apr 2017 11:08:23 -0400
Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:37128 "EHLO
        mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1765452AbdDSPIR (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 19 Apr 2017 11:08:17 -0400
Date: Wed, 19 Apr 2017 08:08:09 -0700
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org, mingo@kernel.org, jiangshanlai@gmail.com,
        dipankar@in.ibm.com, akpm@linux-foundation.org,
        mathieu.desnoyers@efficios.com, josh@joshtriplett.org,
        tglx@linutronix.de, rostedt@goodmis.org, dhowells@redhat.com,
        edumazet@google.com, fweisbec@gmail.com, oleg@redhat.com,
        bobby.prani@gmail.com
Subject: Re: [PATCH tip/core/rcu 04/13] rcu: Make RCU_FANOUT_LEAF help text
 more explicit about skew_tick
Reply-To: paulmck@linux.vnet.ibm.com
References: <20170413161948.ymvzlzhporgmldvn@hirez.programming.kicks-ass.net>
 <20170413165516.GI3956@linux.vnet.ibm.com>
 <20170413170434.xk4zq3p75pu3ubxw@hirez.programming.kicks-ass.net>
 <20170413173100.GL3956@linux.vnet.ibm.com>
 <20170413174631.56ycg545gwbsb4q2@hirez.programming.kicks-ass.net>
 <20170413181926.GP3956@linux.vnet.ibm.com>
 <20170413182309.vmyivo3oqrtfhhxt@hirez.programming.kicks-ass.net>
 <20170413184232.GQ3956@linux.vnet.ibm.com>
 <20170419132226.yvo3jyweb3d2a632@hirez.programming.kicks-ass.net>
 <20170419134835.bpuhurle2jjr66hm@hirez.programming.kicks-ass.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170419134835.bpuhurle2jjr66hm@hirez.programming.kicks-ass.net>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-TM-AS-GCONF: 00
x-cbid: 17041915-0044-0000-0000-0000030C49FF
X-IBM-SpamModules-Scores: 
X-IBM-SpamModules-Versions: BY=3.00006939; HX=3.00000240; KW=3.00000007;
 PH=3.00000004; SC=3.00000208; SDB=6.00849646; UDB=6.00419561; IPR=6.00628279;
 BA=6.00005304; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000;
 ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00015095; XFM=3.00000013;
 UTC=2017-04-19 15:08:12
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 17041915-0045-0000-0000-0000073A500D
Message-Id: <20170419150809.GL3956@linux.vnet.ibm.com>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-04-19_12:,,
 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0
 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam
 adjust=0 reason=mlx scancount=1 engine=8.0.1-1703280000
 definitions=main-1704190127
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Apr 19, 2017 at 03:48:35PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 19, 2017 at 03:22:26PM +0200, Peter Zijlstra wrote:
> > On Thu, Apr 13, 2017 at 11:42:32AM -0700, Paul E. McKenney wrote:
> > 
> > > I believe that you are missing the fact that RCU grace-period
> > > initialization and cleanup walks through the rcu_node tree breadth
> > > first, using rcu_for_each_node_breadth_first().
> > 
> > Indeed. That is the part I completely missed.
> > 
> > >                                                 This macro (shown below)
> > > implements this breadth-first walk using a simple sequential traversal of
> > > the ->node[] array that provides the structures making up the rcu_node
> > > tree.  As you can see, this scan is completely independent of how CPU
> > > numbers might be mapped to rcu_data slots in the leaf rcu_node structures.
> > 
> > So this code is clearly not a hotpath, but still its performance
> > matters?
> > 
> > Seems like you cannot win here :/
> 
> So I sort of see what that code does, but I cannot quite grasp from the
> comments near there _why_ it is doing this.
> 
> My thinking is that normal (active CPUs) will update their state at tick
> time through the tree, and once the state reaches the root node, IOW all
> CPUs agree they've observed that particular state, we advance the global
> state, rinse repeat. That's how tree-rcu works.
> 
> NOHZ-idle stuff would be excluded entirely; that is, if we're allowed to
> go idle we're up-to-date, and completely drop out of the state tracking.
> When we become active again, we can simply sync the CPU's state to the
> active state and go from there -- ignoring whatever happened in the
> mean-time.
> 
> So why do we have to do machine wide updates? How can we get at the end
> up a grace period without all CPUs already agreeing that its complete?
> 
> /me puzzled.

This a decent overall summary of how RCU grace periods work, but there
are quite a few corner cases that complicate things.  In this email,
I will focus on just one of them, starting with CPUs returning from
NOHZ-idle state.

In theory, you are correct when you say that we could have CPUs sync up
with current RCU state immediately upon return from idle.  In practice,
people are already screaming at me about the single CPU-local atomic
operation and memory barriers, so adding code on the idle-exit fastpath
to acquire the leaf rcu_node structure's lock and grab the current
state would do nothing but cause Marc Zyngier and many others to report
performance bugs to me.

And even that would not be completely sufficient.  After all, the state
in the leaf rcu_node structure will be out of date during grace-period
initialization and cleanup.  So to -completely- synchronize state for
the incoming CPU, I would have to acquire the root rcu_node structure's
lock and look at the live state.  Needless to say, the performance and
scalability implications of acquiring a global lock on each and every
idle exit event is not going to be at all pretty.

This means that even non-idle CPUs must necessarily be allowed to have
different about which grace period is currently in effect.  We simply
cannot have total agreement on when a given grace period starts or
ends, because such agreement is just too expensive.  Therefore, when a
grace period begins, the grace-period kthread scans the rcu_node tree
propagating this transition through the rcu_node tree.  And similarly
when a grace period ends.

Because the rcu_node tree is mapped into a dense array, and because
the scan proceeds in index order, the scan operation is pretty much
best-case for the cache hardware.  But on large machines with large
cache-miss latencies, it can still inflict a bit of pain -- almost all
of which has been addressed by the switch to grace-period kthreads.

Hey, you asked!!!  ;-)

							Thanx, Paul