From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752152AbbFYLHt (ORCPT <rfc822;w@1wt.eu>);
	Thu, 25 Jun 2015 07:07:49 -0400
Received: from casper.infradead.org ([85.118.1.10]:43926 "EHLO
	casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751451AbbFYLHn (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Thu, 25 Jun 2015 07:07:43 -0400
Date: Thu, 25 Jun 2015 13:07:34 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>, tj@kernel.org, mingo@redhat.com,
        linux-kernel@vger.kernel.org, der.herr@hofr.at, dave@stgolabs.net,
        riel@redhat.com, viro@ZenIV.linux.org.uk,
        torvalds@linux-foundation.org
Subject: Re: [RFC][PATCH 12/13] stop_machine: Remove lglock
Message-ID: <20150625110734.GX3644@twins.programming.kicks-ass.net>
References: <20150624073503.GH3644@twins.programming.kicks-ass.net>
 <20150624145030.GB3717@linux.vnet.ibm.com>
 <20150624150151.GN3644@twins.programming.kicks-ass.net>
 <20150624152705.GE3717@linux.vnet.ibm.com>
 <20150624154010.GS19282@twins.programming.kicks-ass.net>
 <20150624160851.GF3717@linux.vnet.ibm.com>
 <20150624164200.GP3644@twins.programming.kicks-ass.net>
 <20150624171004.GG3717@linux.vnet.ibm.com>
 <20150624175830.GS3644@twins.programming.kicks-ass.net>
 <20150625032303.GO3717@linux.vnet.ibm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150625032303.GO3717@linux.vnet.ibm.com>
User-Agent: Mutt/1.5.21 (2012-12-30)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Jun 24, 2015 at 08:23:17PM -0700, Paul E. McKenney wrote:
> Here is what I had in mind, where you don't have any global trashing
> except when the ->expedited_sequence gets updated.  Passes mild rcutorture
> testing.

>  	/*
> +	 * Each pass through the following loop works its way
> +	 * up the rcu_node tree, returning if others have done the
> +	 * work or otherwise falls through holding the root rnp's
> +	 * ->exp_funnel_mutex.  The mapping from CPU to rcu_node structure
> +	 * can be inexact, as it is just promoting locality and is not
> +	 * strictly needed for correctness.
>  	 */
> +	rnp0 = per_cpu_ptr(rsp->rda, raw_smp_processor_id())->mynode;
> +	for (; rnp0 != NULL; rnp0 = rnp0->parent) {
> +		if (sync_sched_exp_wd(rsp, rnp1, &rsp->expedited_workdone1, s))
>  			return;
> +		mutex_lock(&rnp0->exp_funnel_mutex);
> +		if (rnp1)
> +			mutex_unlock(&rnp1->exp_funnel_mutex);
> +		rnp1 = rnp0;
> +	}
> +	rnp0 = rnp1;  /* rcu_get_root(rsp), AKA root rcu_node structure. */
> +	if (sync_sched_exp_wd(rsp, rnp0, &rsp->expedited_workdone2, s))
> +		return;

I'm still somewhat confused by the whole strict order sequence vs this
non ordered 'polling' of global state.

This funnel thing basically waits random times depending on the
contention of these mutexes and tries again. Ultimately serializing on
the root funnel thing.

So on the one hand you have to strictly order these expedited caller,
but then you don't want to actually process them in order. If 'by magic'
you manage to process the 3rd in queue, you can drop the 2nd because it
will have waited long enough. OTOH the 2nd will have waited too long.

You also do not take the actual RCU state machine into account -- this
is a parallel state.

Can't we integrate the force quiescent state machinery with the
expedited machinery -- that is instead of building a parallel state, use
the expedited thing to push the regular machine forward?

We can use the stop_machine calls to force the local RCU state forward,
after all, we _know_ we just made a context switch into the stopper
thread. All we need to do is disable interrupts to hold off the tick
(which normally drives the state machine) and just unconditionally
advance our state.

If we use the regular GP machinery, you also don't have to strongly
order the callers, just stick them on whatever GP was active when they
came in and let them roll, this allows much better (and more natural)
concurrent processing.