From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751478AbdANIJy (ORCPT ); Sat, 14 Jan 2017 03:09:54 -0500 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:47999 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751339AbdANIJx (ORCPT ); Sat, 14 Jan 2017 03:09:53 -0500 Date: Sat, 14 Jan 2017 00:00:22 -0800 From: "Paul E. McKenney" To: Borislav Petkov Cc: linux-kernel@vger.kernel.org, lv.zheng@intel.com, stan.kain@gmail.com, waffolz@hotmail.com, josh@joshtriplett.org, rostedt@goodmis.org, mathieu.desnoyers@efficios.com, jiangshanlai@gmail.com, mingo@kernel.org, torvalds@linux-foundation.org, rafael@kernel.org Subject: Re: [PATCH] rcu: Narrow early boot window of illegal synchronous grace periods Reply-To: paulmck@linux.vnet.ibm.com References: <20170113023807.GA27120@linux.vnet.ibm.com> <20170113112519.q6tmariemhhk5e56@pd.tnic> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170113112519.q6tmariemhhk5e56@pd.tnic> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 X-Content-Scanned: Fidelis XPS MAILER x-cbid: 17011408-0004-0000-0000-0000114A236F X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00006430; HX=3.00000240; KW=3.00000007; PH=3.00000004; SC=3.00000199; SDB=6.00807452; UDB=6.00393065; IPR=6.00584790; BA=6.00005055; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00013919; XFM=3.00000011; UTC=2017-01-14 08:09:51 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17011408-0005-0000-0000-00007C2A2EA0 Message-Id: <20170114080022.GS5238@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-01-14_01:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1612050000 definitions=main-1701140117 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 13, 2017 at 12:25:19PM +0100, Borislav Petkov wrote: > On Thu, Jan 12, 2017 at 06:38:07PM -0800, Paul E. McKenney wrote: > > The current preemptible RCU implementation goes through three phases > > during bootup. In the first phase, there is only one CPU that is running > > with preemption disabled, so that a no-op is a synchronous grace period. > > In the second mid-boot phase, the scheduler is running, but RCU has > > not yet gotten its kthreads spawned (and, for expedited grace periods, > > workqueues are not yet running. During this time, any attempt to do > > a synchronous grace period will hang the system (or complain bitterly, > > depending). In the third and final phase, RCU is fully operational and > > everything works normally. > > > > This has been OK for some time, but there has recently been some > > synchronous grace periods showing up during the second mid-boot phase. > > You probably should add the callchain from the thread as an example and > for future reference: > > early_amd_iommu_init() > |-> acpi_put_table(ivrs_base); > |-> acpi_tb_put_table(table_desc); > |-> acpi_tb_invalidate_table(table_desc); > |-> acpi_tb_release_table(...) > |-> acpi_os_unmap_memory > |-> acpi_os_unmap_iomem > |-> acpi_os_map_cleanup > |-> synchronize_rcu_expedited <-- the kernel/rcu/tree_exp.h version with CONFIG_PREEMPT_RCU=y > > > This commit therefore reworks RCU to permit synchronous grace periods It now looks like this: ------------------------------------------------------------------------ Note that the code was buggy even before this commit, as it was subject to failure on real-time systems that forced all expedited grace periods to run as normal grace periods (for example, using the rcu_normal ksysfs parameter). The callchain from the failure case is as follows: early_amd_iommu_init() |-> acpi_put_table(ivrs_base); |-> acpi_tb_put_table(table_desc); |-> acpi_tb_invalidate_table(table_desc); |-> acpi_tb_release_table(...) |-> acpi_os_unmap_memory |-> acpi_os_unmap_iomem |-> acpi_os_map_cleanup |-> synchronize_rcu_expedited The kernel showing this callchain was built with CONFIG_PREEMPT_RCU=y, which caused the code to try using workqueues before they were initialized, which did not go well. ------------------------------------------------------------------------ Does that work? > Just say "Rework RCU to ..." > > "This commit" and "This patch" and the such are not needed in commit > messages. Fair point, but this wording appears in almost all of my patches. ;-) My rationale is that it provides a clear transition from describing the problem to introducing the solution. > > to proceed during this mid-boot phase. > > > > This commit accomplishes this by setting a flag from the existing > > rcu_scheduler_starting() function which causes all synchronous grace > > periods to take the expedited path. The expedited path now checks this > > flag, using the requesting task to drive the expedited grace period > > forward during the mid-boot phase. Finally, this flag is updated by a > > core_initcall() function named rcu_exp_runtime_mode(), which causes the > > runtime codepaths to be used. > > > > Note that this arrangement assumes that tasks are not sent POSIX signals > > (or anything similar) from the time that the first task is spawned > > through core_initcall() time. > > > > Reported-by: "Zheng, Lv" > > Reported-by: Borislav Petkov > > Signed-off-by: Paul E. McKenney > > > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > > index 321f9ed552a9..01f71e1d2e94 100644 > > --- a/include/linux/rcupdate.h > > +++ b/include/linux/rcupdate.h > > @@ -444,6 +444,10 @@ bool __rcu_is_watching(void); > > #error "Unknown RCU implementation specified to kernel configuration" > > #endif > > > > +#define RCU_SCHEDULER_INACTIVE 0 > > +#define RCU_SCHEDULER_INIT 1 > > +#define RCU_SCHEDULER_RUNNING 2 > > + > > /* > > * init_rcu_head_on_stack()/destroy_rcu_head_on_stack() are needed for dynamic > > * initialization and destruction of rcu_head on the stack. rcu_head structures > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > > index 80adef7d4c3d..0d6ff3e471be 100644 > > --- a/kernel/rcu/rcu.h > > +++ b/kernel/rcu/rcu.h > > @@ -136,6 +136,7 @@ int rcu_jiffies_till_stall_check(void); > > #define TPS(x) tracepoint_string(x) > > > > void rcu_early_boot_tests(void); > > +void rcu_test_sync_prims(void); > > > > /* > > * This function really isn't for public consumption, but RCU is special in > > diff --git a/kernel/rcu/tiny_plugin.h b/kernel/rcu/tiny_plugin.h > > index 196f0302e2f4..e3953bdee383 100644 > > --- a/kernel/rcu/tiny_plugin.h > > +++ b/kernel/rcu/tiny_plugin.h > > @@ -65,7 +65,7 @@ EXPORT_SYMBOL_GPL(rcu_scheduler_active); > > void __init rcu_scheduler_starting(void) > > { > > WARN_ON(nr_context_switches() > 0); > > - rcu_scheduler_active = 1; > > + rcu_scheduler_active = RCU_SCHEDULER_RUNNING; > > This tiny RCU version is setting _RUNNING while the > kernel/rcu/tree.c-one is setting it to _INIT. The tiny bypasses the > _INIT step now? > > I'm guessing because you've added a third state - the _RUNNING and > tiny doesn't need the intermediary _INIT, it is being set straight to > _RUNNING... Exactly, but yes, worth a comment. The header comment for rcu_scheduler_starting() is now as follows: /* * During boot, we forgive RCU lockdep issues. After this function is * invoked, we start taking RCU lockdep issues seriously. Note that unlike * Tree RCU, Tiny RCU transitions directly from RCU_SCHEDULER_INACTIVE * to RCU_SCHEDULER_RUNNING, skipping the RCU_SCHEDULER_INIT stage. * The reason for this is that Tiny RCU does not need kthreads, so does * not have to care about the fact that the scheduler is half-initialized * at a certain phase of the boot process. */ > > #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > index 96c52e43f7ca..7bcce4607863 100644 > > --- a/kernel/rcu/tree.c > > +++ b/kernel/rcu/tree.c > > @@ -127,13 +127,16 @@ int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */ > > int sysctl_panic_on_rcu_stall __read_mostly; > > > > /* > > - * The rcu_scheduler_active variable transitions from zero to one just > > - * before the first task is spawned. So when this variable is zero, RCU > > + * The rcu_scheduler_active variable transitions from > > + * RCU_SCHEDULER_INACTIVE to RCU_SCHEDULER_INIT just before the first > > + * task is spawned. So when this variable is RCU_SCHEDULER_INACTIVE, RCU > > * can assume that there is but one task, allowing RCU to (for example) > > * optimize synchronize_rcu() to a simple barrier(). When this variable > > * is one, RCU must actually do all the hard work required to detect real > > ... is RCU_SCHEDULER_INIT, RCU must ... > > > * grace periods. This variable is also used to suppress boot-time false > > - * positives from lockdep-RCU error checking. > > + * positives from lockdep-RCU error checking. Finally, this variable > > . Finally, it... > > By now we know it is this variable :-) > > > + * transitions from RCU_SCHEDULER_INIT to RCU_SCHEDULER_RUNNING after RCU > > + * is fully initialized, including all of its tasks having been spawned. > > s/tasks/kthreads/ ? > > Should make it clearer... Good catches, fixed! > ... > > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h > > index d3053e99fdb6..e59e1849b89a 100644 > > --- a/kernel/rcu/tree_exp.h > > +++ b/kernel/rcu/tree_exp.h > > @@ -532,18 +532,28 @@ struct rcu_exp_work { > > }; > > > > /* > > + * Common code to drive an expedited grace period forward, used by > > + * workqueues and mid-boot-time tasks. > > + */ > > +static void rcu_exp_sel_wait_wake(struct rcu_state *rsp, > > + smp_call_func_t func, unsigned long s) > > +{ > > + /* Initialize the rcu_node tree in preparation for the wait. */ > > + sync_rcu_exp_select_cpus(rsp, func); > > + > > + /* Wait and clean up, including waking everyone. */ > > + rcu_exp_wait_wake(rsp, s); > > +} > > + > > +/* > > * Work-queue handler to drive an expedited grace period forward. > > */ > > static void wait_rcu_exp_gp(struct work_struct *wp) > > { > > struct rcu_exp_work *rewp; > > > > - /* Initialize the rcu_node tree in preparation for the wait. */ > > rewp = container_of(wp, struct rcu_exp_work, rew_work); > > - sync_rcu_exp_select_cpus(rewp->rew_rsp, rewp->rew_func); > > - > > - /* Wait and clean up, including waking everyone. */ > > - rcu_exp_wait_wake(rewp->rew_rsp, rewp->rew_s); > > + rcu_exp_sel_wait_wake(rewp->rew_rsp, rewp->rew_func, rewp->rew_s); > > } > > > > /* > > @@ -569,12 +579,18 @@ static void _synchronize_rcu_expedited(struct rcu_state *rsp, > > if (exp_funnel_lock(rsp, s)) > > return; /* Someone else did our work for us. */ > > > > - /* Marshall arguments and schedule the expedited grace period. */ > > - rew.rew_func = func; > > - rew.rew_rsp = rsp; > > - rew.rew_s = s; > > - INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp); > > - schedule_work(&rew.rew_work); > > + /* Ensure that load happens before action based on it. */ > > + if (unlikely(rcu_scheduler_active == RCU_SCHEDULER_INIT)) { > > Ok, so this variable is, AFAICT, on some hot paths. And we will query > it each time we synchronize_sched() when we decide to do the expedited > grace periods. There's that rcu_gp_is_expedited() which decides but I > don't have an idea on which paths that happens... It is marked __read_mostly for this reason, but I have always assumed that the synchronous grace-period primitives were off the hotpath. > In any case, should we make this var a jump label or so which gets > patched properly or are the expedited paths comparatively seldom? I believe that this would not buy very much, but if this variable starts showing up on profiles, then perhaps a jump label would be appropriate. As a separate patch, though! > > + /* Direct call during scheduler init and early_initcalls(). */ > > + rcu_exp_sel_wait_wake(rsp, func, s); > > + } else { > > + /* Marshall arguments & schedule the expedited grace period. */ > > + rew.rew_func = func; > > + rew.rew_rsp = rsp; > > + rew.rew_s = s; > > + INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp); > > + schedule_work(&rew.rew_work); > > + } > > > > /* Wait for expedited grace period to complete. */ > > rdp = per_cpu_ptr(rsp->rda, raw_smp_processor_id()); > > Rest looks ok to me but WTH do I know about RCU internals... Thank you for your review and comments! Thanx, Paul