From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1751478AbdANIJy (ORCPT <rfc822;w@1wt.eu>);
        Sat, 14 Jan 2017 03:09:54 -0500
Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:47999 "EHLO
        mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL)
        by vger.kernel.org with ESMTP id S1751339AbdANIJx (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Sat, 14 Jan 2017 03:09:53 -0500
Date: Sat, 14 Jan 2017 00:00:22 -0800
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Borislav Petkov <bp@alien8.de>
Cc: linux-kernel@vger.kernel.org, lv.zheng@intel.com, stan.kain@gmail.com,
        waffolz@hotmail.com, josh@joshtriplett.org, rostedt@goodmis.org,
        mathieu.desnoyers@efficios.com, jiangshanlai@gmail.com,
        mingo@kernel.org, torvalds@linux-foundation.org, rafael@kernel.org
Subject: Re: [PATCH] rcu: Narrow early boot window of illegal synchronous
 grace periods
Reply-To: paulmck@linux.vnet.ibm.com
References: <20170113023807.GA27120@linux.vnet.ibm.com>
 <20170113112519.q6tmariemhhk5e56@pd.tnic>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170113112519.q6tmariemhhk5e56@pd.tnic>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-TM-AS-GCONF: 00
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 17011408-0004-0000-0000-0000114A236F
X-IBM-SpamModules-Scores: 
X-IBM-SpamModules-Versions: BY=3.00006430; HX=3.00000240; KW=3.00000007;
 PH=3.00000004; SC=3.00000199; SDB=6.00807452; UDB=6.00393065; IPR=6.00584790;
 BA=6.00005055; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000;
 ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00013919; XFM=3.00000011;
 UTC=2017-01-14 08:09:51
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 17011408-0005-0000-0000-00007C2A2EA0
Message-Id: <20170114080022.GS5238@linux.vnet.ibm.com>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-01-14_01:,,
 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0
 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam
 adjust=0 reason=mlx scancount=1 engine=8.0.1-1612050000
 definitions=main-1701140117
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Fri, Jan 13, 2017 at 12:25:19PM +0100, Borislav Petkov wrote:
> On Thu, Jan 12, 2017 at 06:38:07PM -0800, Paul E. McKenney wrote:
> > The current preemptible RCU implementation goes through three phases
> > during bootup.  In the first phase, there is only one CPU that is running
> > with preemption disabled, so that a no-op is a synchronous grace period.
> > In the second mid-boot phase, the scheduler is running, but RCU has
> > not yet gotten its kthreads spawned (and, for expedited grace periods,
> > workqueues are not yet running.  During this time, any attempt to do
> > a synchronous grace period will hang the system (or complain bitterly,
> > depending).  In the third and final phase, RCU is fully operational and
> > everything works normally.
> > 
> > This has been OK for some time, but there has recently been some
> > synchronous grace periods showing up during the second mid-boot phase.
> 
> You probably should add the callchain from the thread as an example and
> for future reference:
> 
> early_amd_iommu_init()
> |-> acpi_put_table(ivrs_base);
> |-> acpi_tb_put_table(table_desc);
> |-> acpi_tb_invalidate_table(table_desc);
> |-> acpi_tb_release_table(...)
> |-> acpi_os_unmap_memory
> |-> acpi_os_unmap_iomem
> |-> acpi_os_map_cleanup
> |-> synchronize_rcu_expedited   <-- the kernel/rcu/tree_exp.h version with CONFIG_PREEMPT_RCU=y
> 
> > This commit therefore reworks RCU to permit synchronous grace periods

It now looks like this:

------------------------------------------------------------------------

Note that the code was buggy even before this commit, as it was subject
to failure on real-time systems that forced all expedited grace periods
to run as normal grace periods (for example, using the rcu_normal ksysfs
parameter).  The callchain from the failure case is as follows:

early_amd_iommu_init()
|-> acpi_put_table(ivrs_base);
|-> acpi_tb_put_table(table_desc);
|-> acpi_tb_invalidate_table(table_desc);
|-> acpi_tb_release_table(...)
|-> acpi_os_unmap_memory
|-> acpi_os_unmap_iomem
|-> acpi_os_map_cleanup
|-> synchronize_rcu_expedited

The kernel showing this callchain was built with CONFIG_PREEMPT_RCU=y,
which caused the code to try using workqueues before they were
initialized, which did not go well.

------------------------------------------------------------------------

Does that work?

> Just say "Rework RCU to ..."
> 
> "This commit" and "This patch" and the such are not needed in commit
> messages.

Fair point, but this wording appears in almost all of my patches.  ;-)

My rationale is that it provides a clear transition from describing the
problem to introducing the solution.

> > to proceed during this mid-boot phase.
> > 
> > This commit accomplishes this by setting a flag from the existing
> > rcu_scheduler_starting() function which causes all synchronous grace
> > periods to take the expedited path.  The expedited path now checks this
> > flag, using the requesting task to drive the expedited grace period
> > forward during the mid-boot phase.  Finally, this flag is updated by a
> > core_initcall() function named rcu_exp_runtime_mode(), which causes the
> > runtime codepaths to be used.
> > 
> > Note that this arrangement assumes that tasks are not sent POSIX signals
> > (or anything similar) from the time that the first task is spawned
> > through core_initcall() time.
> > 
> > Reported-by: "Zheng, Lv" <lv.zheng@intel.com>
> > Reported-by: Borislav Petkov <bp@alien8.de>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > 
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index 321f9ed552a9..01f71e1d2e94 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -444,6 +444,10 @@ bool __rcu_is_watching(void);
> >  #error "Unknown RCU implementation specified to kernel configuration"
> >  #endif
> >  
> > +#define RCU_SCHEDULER_INACTIVE	0
> > +#define RCU_SCHEDULER_INIT	1
> > +#define RCU_SCHEDULER_RUNNING	2
> > +
> >  /*
> >   * init_rcu_head_on_stack()/destroy_rcu_head_on_stack() are needed for dynamic
> >   * initialization and destruction of rcu_head on the stack. rcu_head structures
> > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > index 80adef7d4c3d..0d6ff3e471be 100644
> > --- a/kernel/rcu/rcu.h
> > +++ b/kernel/rcu/rcu.h
> > @@ -136,6 +136,7 @@ int rcu_jiffies_till_stall_check(void);
> >  #define TPS(x)  tracepoint_string(x)
> >  
> >  void rcu_early_boot_tests(void);
> > +void rcu_test_sync_prims(void);
> >  
> >  /*
> >   * This function really isn't for public consumption, but RCU is special in
> > diff --git a/kernel/rcu/tiny_plugin.h b/kernel/rcu/tiny_plugin.h
> > index 196f0302e2f4..e3953bdee383 100644
> > --- a/kernel/rcu/tiny_plugin.h
> > +++ b/kernel/rcu/tiny_plugin.h
> > @@ -65,7 +65,7 @@ EXPORT_SYMBOL_GPL(rcu_scheduler_active);
> >  void __init rcu_scheduler_starting(void)
> >  {
> >  	WARN_ON(nr_context_switches() > 0);
> > -	rcu_scheduler_active = 1;
> > +	rcu_scheduler_active = RCU_SCHEDULER_RUNNING;
> 
> This tiny RCU version is setting _RUNNING while the
> kernel/rcu/tree.c-one is setting it to _INIT. The tiny bypasses the
> _INIT step now?
> 
> I'm guessing because you've added a third state - the _RUNNING and
> tiny doesn't need the intermediary _INIT, it is being set straight to
> _RUNNING...

Exactly, but yes, worth a comment.

The header comment for rcu_scheduler_starting() is now as follows:

/*
 * During boot, we forgive RCU lockdep issues.  After this function is
 * invoked, we start taking RCU lockdep issues seriously.  Note that unlike
 * Tree RCU, Tiny RCU transitions directly from RCU_SCHEDULER_INACTIVE
 * to RCU_SCHEDULER_RUNNING, skipping the RCU_SCHEDULER_INIT stage.
 * The reason for this is that Tiny RCU does not need kthreads, so does
 * not have to care about the fact that the scheduler is half-initialized
 * at a certain phase of the boot process.
 */

> >  #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 96c52e43f7ca..7bcce4607863 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -127,13 +127,16 @@ int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
> >  int sysctl_panic_on_rcu_stall __read_mostly;
> >  
> >  /*
> > - * The rcu_scheduler_active variable transitions from zero to one just
> > - * before the first task is spawned.  So when this variable is zero, RCU
> > + * The rcu_scheduler_active variable transitions from
> > + * RCU_SCHEDULER_INACTIVE to RCU_SCHEDULER_INIT just before the first
> > + * task is spawned.  So when this variable is RCU_SCHEDULER_INACTIVE, RCU
> >   * can assume that there is but one task, allowing RCU to (for example)
> >   * optimize synchronize_rcu() to a simple barrier().  When this variable
> >   * is one, RCU must actually do all the hard work required to detect real
> 
> ... is RCU_SCHEDULER_INIT, RCU must ...
> 
> >   * grace periods.  This variable is also used to suppress boot-time false
> > - * positives from lockdep-RCU error checking.
> > + * positives from lockdep-RCU error checking.  Finally, this variable
> 
> 					       .  Finally, it...
> 
> By now we know it is this variable :-)
> 
> > + * transitions from RCU_SCHEDULER_INIT to RCU_SCHEDULER_RUNNING after RCU
> > + * is fully initialized, including all of its tasks having been spawned.
> 
> s/tasks/kthreads/ ?
> 
> Should make it clearer...

Good catches, fixed!

> ...
> 
> > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > index d3053e99fdb6..e59e1849b89a 100644
> > --- a/kernel/rcu/tree_exp.h
> > +++ b/kernel/rcu/tree_exp.h
> > @@ -532,18 +532,28 @@ struct rcu_exp_work {
> >  };
> >  
> >  /*
> > + * Common code to drive an expedited grace period forward, used by
> > + * workqueues and mid-boot-time tasks.
> > + */
> > +static void rcu_exp_sel_wait_wake(struct rcu_state *rsp,
> > +				  smp_call_func_t func, unsigned long s)
> > +{
> > +	/* Initialize the rcu_node tree in preparation for the wait. */
> > +	sync_rcu_exp_select_cpus(rsp, func);
> > +
> > +	/* Wait and clean up, including waking everyone. */
> > +	rcu_exp_wait_wake(rsp, s);
> > +}
> > +
> > +/*
> >   * Work-queue handler to drive an expedited grace period forward.
> >   */
> >  static void wait_rcu_exp_gp(struct work_struct *wp)
> >  {
> >  	struct rcu_exp_work *rewp;
> >  
> > -	/* Initialize the rcu_node tree in preparation for the wait. */
> >  	rewp = container_of(wp, struct rcu_exp_work, rew_work);
> > -	sync_rcu_exp_select_cpus(rewp->rew_rsp, rewp->rew_func);
> > -
> > -	/* Wait and clean up, including waking everyone. */
> > -	rcu_exp_wait_wake(rewp->rew_rsp, rewp->rew_s);
> > +	rcu_exp_sel_wait_wake(rewp->rew_rsp, rewp->rew_func, rewp->rew_s);
> >  }
> >  
> >  /*
> > @@ -569,12 +579,18 @@ static void _synchronize_rcu_expedited(struct rcu_state *rsp,
> >  	if (exp_funnel_lock(rsp, s))
> >  		return;  /* Someone else did our work for us. */
> >  
> > -	/* Marshall arguments and schedule the expedited grace period. */
> > -	rew.rew_func = func;
> > -	rew.rew_rsp = rsp;
> > -	rew.rew_s = s;
> > -	INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp);
> > -	schedule_work(&rew.rew_work);
> > +	/* Ensure that load happens before action based on it. */
> > +	if (unlikely(rcu_scheduler_active == RCU_SCHEDULER_INIT)) {
> 
> Ok, so this variable is, AFAICT, on some hot paths. And we will query
> it each time we synchronize_sched() when we decide to do the expedited
> grace periods. There's that rcu_gp_is_expedited() which decides but I
> don't have an idea on which paths that happens...

It is marked __read_mostly for this reason, but I have always assumed that
the synchronous grace-period primitives were off the hotpath.

> In any case, should we make this var a jump label or so which gets
> patched properly or are the expedited paths comparatively seldom?

I believe that this would not buy very much, but if this variable starts
showing up on profiles, then perhaps a jump label would be appropriate.
As a separate patch, though!

> > +		/* Direct call during scheduler init and early_initcalls(). */
> > +		rcu_exp_sel_wait_wake(rsp, func, s);
> > +	} else {
> > +		/* Marshall arguments & schedule the expedited grace period. */
> > +		rew.rew_func = func;
> > +		rew.rew_rsp = rsp;
> > +		rew.rew_s = s;
> > +		INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp);
> > +		schedule_work(&rew.rew_work);
> > +	}
> >  
> >  	/* Wait for expedited grace period to complete. */
> >  	rdp = per_cpu_ptr(rsp->rda, raw_smp_processor_id());
> 
> Rest looks ok to me but WTH do I know about RCU internals...

Thank you for your review and comments!

							Thanx, Paul