linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC] x86/entry: Ask RCU if it needs rcu_irq_{enter,exit}()
@ 2020-06-11 23:53 Paul E. McKenney
  2020-06-11 23:54 ` Paul E. McKenney
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Paul E. McKenney @ 2020-06-11 23:53 UTC (permalink / raw)
  To: linux-kernel, rcu
  Cc: tglx, luto, x86, frederic, rostedt, joel, mathieu.desnoyers,
	will, peterz

RCU needs to detect when one if its interrupt handlers interrupted an idle
state, where an idle state is either the idle loop itself or nohz_full
userspace execution.  When a CPU has been interrupted from one of these
idle states, RCU can report a quiescent state, helping the current grace
period to come to an end.

However, there are optimized kernel-entry paths where handlers that
would normally be executed in interrupt context are instead executed
directly from the base process context, but with interrupts disabled.
When such a directly-executed handler is followed by softirq processing
(which re-enables interrupts), it is possible that one of RCU's interrupt
handlers will interrupt this softirq processing.  This situation can cause
RCU to incorrectly assume that the CPU has passed through a quiescent
state, when in fact the CPU is instead in the midst of softirq processing,
and might well be within an RCU read-side critical section.  In that case,
reporting a quiescent state can result in too-short RCU grace periods,
leading to arbitrary memory corruption and a sharp degradation in the
actuarial statistics of your kernel.

The fix is for the optimized kernel-entry paths to replace the current
call to __rcu_is_watching() with a call to a new rcu_needs_irq_enter()
function, which returns true iff RCU needs explicit calls to
rcu_irq_enter() and rcu_irq_exit() surrounding the optimized invocation
of the handler.  These explicit calls are never required in Tiny RCU,
and in Tree RCU are required only if the CPU might have interrupted
nohz_full userspace execution or the idle loop.  There is the usual
precision/overhead tradeoff, with the current implementation majoring
in low common-case overhead.

While in the area, the commit also returns rcu_is_cpu_rrupt_from_idle()
to its original semantics.

This has been subjected only to very light rcutorture testing, so use
appropriate caution.  The placement of the new rcu_needs_irq_enter()
is not ideal, but the more natural include/linux/hardirq.h location has
#include issues.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index f4d5778..50d002a 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -517,6 +517,41 @@ SYSCALL_DEFINE0(ni_syscall)
 	return -ENOSYS;
 }
 
+/*
+ * RCU needs to detect when one if its interrupt handlers interrupted
+ * an idle state, where an idle state is either the idle loop itself or
+ * nohz_full userspace execution.  When a CPU has been interrupted from
+ * one of these idle states, RCU can report a quiescent state, helping
+ * the current grace period to come to an end.
+ *
+ * However, there are optimized kernel-entry paths where handlers that
+ * would normally be executed in interrupt context are instead executed
+ * directly from the base process context, but with interrupts disabled.
+ * When such a directly-executed handler is followed by softirq processing
+ * (which re-enables interrupts), it is possible that one of RCU's interrupt
+ * handlers will interrupt this softirq processing.  This situation
+ * can cause RCU to incorrectly assume that the CPU has passed through a
+ * quiescent state, when in fact the CPU is instead in the midst of softirq
+ * processing, and might well be within an RCU read-side critical section.
+ * In that case, reporting a quiescent state can result in too-short
+ * RCU grace periods, leading to arbitrary memory corruption and a sharp
+ * degradation in the actuarial statistics of your kernel.
+ *
+ * The fix is for the optimized kernel-entry paths to make use of
+ * this function, which returns true iff RCU needs explicit calls to
+ * rcu_irq_enter() and rcu_irq_exit() surrounding the optimized invocation
+ * of the handler.  These explicit calls are never required in Tiny RCU,
+ * and in Tree RCU are required only if the CPU might have interrupted
+ * nohz_full userspace execution or the idle loop.  There is the usual
+ * precision/overhead tradeoff, with the current implementation majoring
+ * in low common-case overhead.
+ */
+static __always_inline bool rcu_needs_irq_enter(void)
+{
+	return !IS_ENABLED(CONFIG_TINY_RCU) &&
+               (context_tracking_enabled_cpu(smp_processor_id()) || is_idle_task(current));
+}
+
 /**
  * idtentry_enter_cond_rcu - Handle state tracking on idtentry with conditional
  *			     RCU handling
@@ -557,7 +592,7 @@ bool noinstr idtentry_enter_cond_rcu(struct pt_regs *regs)
 		return false;
 	}
 
-	if (!__rcu_is_watching()) {
+	if (rcu_needs_irq_enter()) {
 		/*
 		 * If RCU is not watching then the same careful
 		 * sequence vs. lockdep and tracing is required
diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 8512cae..957ed29 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -87,7 +87,6 @@ static inline void rcu_scheduler_starting(void) { }
 static inline void rcu_end_inkernel_boot(void) { }
 static inline bool rcu_inkernel_boot_has_ended(void) { return true; }
 static inline bool rcu_is_watching(void) { return true; }
-static inline bool __rcu_is_watching(void) { return true; }
 static inline void rcu_momentary_dyntick_idle(void) { }
 static inline void kfree_rcu_scheduler_running(void) { }
 static inline bool rcu_gp_might_be_stalled(void) { return false; }
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index d5cc9d67..e7072a0 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -64,7 +64,6 @@ extern int rcu_scheduler_active __read_mostly;
 void rcu_end_inkernel_boot(void);
 bool rcu_inkernel_boot_has_ended(void);
 bool rcu_is_watching(void);
-bool __rcu_is_watching(void);
 #ifndef CONFIG_PREEMPTION
 void rcu_all_qs(void);
 #endif
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c716ead..d833e10 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -429,30 +429,25 @@ static int rcu_is_cpu_rrupt_from_idle(void)
 {
 	long nesting;
 
-	/*
-	 * Usually called from the tick; but also used from smp_function_call()
-	 * for expedited grace periods. This latter can result in running from
-	 * the idle task, instead of an actual IPI.
-	 */
+	// Usually called from the tick; but also used from smp_function_call()
+	// for expedited grace periods. This latter can result in running from
+	// the idle task, instead of an actual IPI.
 	lockdep_assert_irqs_disabled();
 
-	/* Check for counter underflows */
+	// Check for counter underflows
 	RCU_LOCKDEP_WARN(__this_cpu_read(rcu_data.dynticks_nesting) < 0,
 			 "RCU dynticks_nesting counter underflow!");
 	RCU_LOCKDEP_WARN(__this_cpu_read(rcu_data.dynticks_nmi_nesting) <= 0,
 			 "RCU dynticks_nmi_nesting counter underflow/zero!");
 
-	/* Are we at first interrupt nesting level? */
+	// Are we at first interrupt nesting level?
 	nesting = __this_cpu_read(rcu_data.dynticks_nmi_nesting);
-	if (nesting > 1)
-		return false;
-
-	/*
-	 * If we're not in an interrupt, we must be in the idle task!
-	 */
+	// If we're not in an interrupt, we must be in the idle task!
 	WARN_ON_ONCE(!nesting && !is_idle_task(current));
+	if (nesting != 1)
+		return false; // Not in irq or in nested irq.
 
-	/* Does CPU appear to be idle from an RCU standpoint? */
+	// Does CPU appear to be idle from an RCU standpoint?
 	return __this_cpu_read(rcu_data.dynticks_nesting) == 0;
 }
 
@@ -1058,11 +1053,6 @@ static void rcu_disable_urgency_upon_qs(struct rcu_data *rdp)
 	}
 }
 
-noinstr bool __rcu_is_watching(void)
-{
-	return !rcu_dynticks_curr_cpu_in_eqs();
-}
-
 /**
  * rcu_is_watching - see if RCU thinks that the current CPU is not idle
  *

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] x86/entry: Ask RCU if it needs rcu_irq_{enter,exit}()
  2020-06-11 23:53 [PATCH RFC] x86/entry: Ask RCU if it needs rcu_irq_{enter,exit}() Paul E. McKenney
@ 2020-06-11 23:54 ` Paul E. McKenney
  2020-06-12  5:30 ` Andy Lutomirski
  2020-06-12  9:27 ` [PATCH RFC] x86/entry: Ask RCU if it needs rcu_irq_{enter,exit}() Thomas Gleixner
  2 siblings, 0 replies; 21+ messages in thread
From: Paul E. McKenney @ 2020-06-11 23:54 UTC (permalink / raw)
  To: linux-kernel, rcu
  Cc: tglx, luto, x86, frederic, rostedt, joel, mathieu.desnoyers,
	will, peterz

On Thu, Jun 11, 2020 at 04:53:05PM -0700, Paul E. McKenney wrote:
> RCU needs to detect when one if its interrupt handlers interrupted an idle
> state, where an idle state is either the idle loop itself or nohz_full
> userspace execution.  When a CPU has been interrupted from one of these
> idle states, RCU can report a quiescent state, helping the current grace
> period to come to an end.
> 
> However, there are optimized kernel-entry paths where handlers that
> would normally be executed in interrupt context are instead executed
> directly from the base process context, but with interrupts disabled.
> When such a directly-executed handler is followed by softirq processing
> (which re-enables interrupts), it is possible that one of RCU's interrupt
> handlers will interrupt this softirq processing.  This situation can cause
> RCU to incorrectly assume that the CPU has passed through a quiescent
> state, when in fact the CPU is instead in the midst of softirq processing,
> and might well be within an RCU read-side critical section.  In that case,
> reporting a quiescent state can result in too-short RCU grace periods,
> leading to arbitrary memory corruption and a sharp degradation in the
> actuarial statistics of your kernel.
> 
> The fix is for the optimized kernel-entry paths to replace the current
> call to __rcu_is_watching() with a call to a new rcu_needs_irq_enter()
> function, which returns true iff RCU needs explicit calls to
> rcu_irq_enter() and rcu_irq_exit() surrounding the optimized invocation
> of the handler.  These explicit calls are never required in Tiny RCU,
> and in Tree RCU are required only if the CPU might have interrupted
> nohz_full userspace execution or the idle loop.  There is the usual
> precision/overhead tradeoff, with the current implementation majoring
> in low common-case overhead.
> 
> While in the area, the commit also returns rcu_is_cpu_rrupt_from_idle()
> to its original semantics.
> 
> This has been subjected only to very light rcutorture testing, so use
> appropriate caution.  The placement of the new rcu_needs_irq_enter()
> is not ideal, but the more natural include/linux/hardirq.h location has
> #include issues.

And if you want more details on how I got to this patch, please see below.

							Thanx, Paul

------------------------------------------------------------------------

Thomas supplied a patch and suggested that there be an RCU-supplied
rcu_needs_irq_enter() function that says whether the full
rcu_irq_enter()/rcu_irq_exit() dance is required.  The function needing
the dance is rcu_is_cpu_rrupt_from_idle().

Assumptions:

o	It would be simpler for idtentry_enter_cond_rcu() to check a
	new rcu_needs_irq_enter() function than to do a fragile check
	of "!__rcu_is_watching() || is_idle_task(current)".  Note also
	that this check does not account for expedited grace periods
	interacting with softirq handlers having interrupted nohz_full
	userspace execution.

	This assumption seems likely to hold.

o	If CONFIG_CONTEXT_TRACKING=y, assume that rcu_user_enter()
	and rcu_user_exit() might be invoked on any CPU.  Query in to
	Frederic on whether this can be more selective.

Functions of interest:

o	rcu_is_cpu_rrupt_from_idle().  See below.

o	__rcu_is_watching().  The only call to this will likely
	be eliminated.  If so, it can be removed.

o	idtentry_enter_cond_rcu().  Replace __rcu_is_watching()
	check with a check of rcu_needs_irq_enter().

o	idtentry_exit_cond_rcu().  No change.

o	rcu_irq_enter_check_tick().  Turn on tick for nohz_full
	CPUs when required.

o	rcu_irq_exit_check_preempt().  Straight lockdep validation.

Use cases for rcu_is_cpu_rrupt_from_idle():

o	rcu_sched_clock_irq(): If not interrupted from idle, need to
	ask the scheduler for help for ->urgent_qs request.

o	rcu_pending(): If a nohz_full CPU is interrupted from idle,
	don't raise_softirq() it.  Instead, let the grace-period
	kthread detect the quiescent state.

o	rcu_exp_handler() for PREEMPT_NONE kernels:  Directly report
	a quiescent state if interrupted from idle.

o	rcu_flavor_sched_clock_irq for PREEMPT kernels:  Report a
	voluntary context switch if interrupted from idle.  Here "idle"
	includes still in the kernel but on the way to/from nohz_full
	userspace execution.

o	rcu_flavor_sched_clock_irq for PREEMPT_NONE kernels:  Report a
	quiescent state if interrupted from idle.
	
In all of the above cases for NO_HZ_FULL kernels, "idle" includes still
in the kernel but on the way to/from nohz_full userspace execution.

Information available to rcu_needs_irq_enter():

o	IS_ENABLED(CONFIG_CONTEXT_TRACKING), which indicates that
	nohz_full userspace execution is possible, and that some
	CPUs might be invoking rcu_idle_enter() and rcu_idle_exit().

	There is also tick_nohz_full_cpu(), but it is not clear that if
	this returns false that the corresponding CPU is guaranteed not
	to be invoking rcu_idle_enter() and rcu_idle_exit().

o	is_idle_task(current), which returns true if the current task
	is an idle task.  Such tasks always need to execute
	rcu_irq_enter() and rcu_irq_exit().  Or, if instrumentation
	is prohibited, rcu_nmi_enter() and rcu_nmi_exit().

o	rdp->dynticks_nesting:	If non-zero, we are in process context,
	and don't need rcu_irq_enter() or rcu_irq_exit() regardless
	of other state.  But this requires that rcu_needs_irq_enter()
	be defined within Tree RCU, so it is not necessarily a win.

->	The simple approach is for rcu_needs_irq_enter() to return:

	!IS_ENABLED(CONFIG_TINY_RCU) &&
	(IS_ENABLED(CONFIG_CONTEXT_TRACKING) || is_idle_task(current))

	Except that Frederic points out context_tracking_enabled_cpu():

	!IS_ENABLED(CONFIG_TINY_RCU) &&
	(context_tracking_enabled_cpu(smp_processor_id()) || is_idle_task(current))

o	As a result, rcu_needs_irq_enter() might need to be defined
	outside of RCU to allow inlining and to avoid #include hell.
	One candidate location is include/linux/hardirq.h, the same
	place that rcu_irq_enter_check_tick() is defined.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] x86/entry: Ask RCU if it needs rcu_irq_{enter,exit}()
  2020-06-11 23:53 [PATCH RFC] x86/entry: Ask RCU if it needs rcu_irq_{enter,exit}() Paul E. McKenney
  2020-06-11 23:54 ` Paul E. McKenney
@ 2020-06-12  5:30 ` Andy Lutomirski
  2020-06-12 12:40   ` Thomas Gleixner
  2020-06-12  9:27 ` [PATCH RFC] x86/entry: Ask RCU if it needs rcu_irq_{enter,exit}() Thomas Gleixner
  2 siblings, 1 reply; 21+ messages in thread
From: Andy Lutomirski @ 2020-06-12  5:30 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, rcu, Thomas Gleixner, Andrew Lutomirski, X86 ML,
	Frederic Weisbecker, Steven Rostedt, Joel Fernandes,
	Mathieu Desnoyers, Will Deacon, Peter Zijlstra

On Thu, Jun 11, 2020 at 4:53 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> RCU needs to detect when one if its interrupt handlers interrupted an idle
> state, where an idle state is either the idle loop itself or nohz_full
> userspace execution.  When a CPU has been interrupted from one of these
> idle states, RCU can report a quiescent state, helping the current grace
> period to come to an end.
>
> However, there are optimized kernel-entry paths where handlers that
> would normally be executed in interrupt context are instead executed
> directly from the base process context, but with interrupts disabled.
> When such a directly-executed handler is followed by softirq processing
> (which re-enables interrupts), it is possible that one of RCU's interrupt
> handlers will interrupt this softirq processing.

Why is the softirq processing special?  ISTM the basic sequence of events is:

idle, RCU not watching, IRQs on because we're using a lousy idle
mechanism that requires IRQs on.

IRQ hits.  The ones you're describing are the
DEFINE_IDTENTRY_SYSVEC_SIMPLE ones, but I'm not sure why it would
matter.  IRQ does idtentry_enter_cond_rcu().

idtentry_entry_cond_rcu() sees __rcu_is_watching() return true and
does not do rcu_irq_enter().

softirq processing turns on IRQs, a new IRQ hits, and kaboom.

But this is a bit of a strange explanation.  The SYSVEC_SIMPLE paths
don't seem to do irq_exit_rcu(), so there's no softirq processing
unless I missed something.  But more importantly, what are we doing
making it through idtentry_enter_cond_rcu() with rcu not watching at
the end?  If we hit the idle loop, surely __rcu_is_watching() should
have returned false, right?

So I added a little warning to detect the case where your patch
actually changes something (I don't believe for a second that this
warning should be committed -- it's just to get the stack trace):

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index f0b657097a2a..77330b96d8e5 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -592,6 +592,8 @@ bool noinstr idtentry_enter_cond_rcu(struct pt_regs *regs)
                return true;
        }

+       WARN_ON_ONCE(system_state == SYSTEM_RUNNING && is_idle_task(current));
+
        /*
         * If RCU is watching then RCU only wants to check
         * whether it needs to restart the tick in NOHZ

And I get this:

[    1.054927] ------------[ cut here ]------------
[    1.054928] WARNING: CPU: 1 PID: 0 at arch/x86/entry/common.c:595
idtentry_enter_cond_rcu+0x86/0xa0
[    1.054929] Modules linked in:
[    1.054930] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.7.0+ #145
[    1.054931] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31
04/01/2014
[    1.054931] RIP: 0010:idtentry_enter_cond_rcu+0x86/0xa0
[    1.054933] Code: 7c 24 08 e8 3c 3e 00 00 e8 57 a5 6e ff e8 d2 49
00 00 84 c0 74 19 90 31 c0 5b c3 65 48 8b 04 25 c0 7e 01 00 f6 40 24
02 74 99 <0f> 0b 90 eb 94 0f 0b 90 eb e2 0f 0b 90 eb 9d 66 66 2e 0f 1f
84 00
[    1.054933] RSP: 0000:ffffc9000006bc10 EFLAGS: 00010002
[    1.054934] RAX: ffff88807d529800 RBX: ffffc9000006bc38 RCX: ffffffff81c01327
[    1.054935] RDX: 0000000000000000 RSI: ffffffff81c00f49 RDI: ffffc9000006bc38
[    1.054935] RBP: ffffc9000006bc38 R08: 0000000000000000 R09: 0000000000000000
[    1.054936] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[    1.054936] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[    1.054937] FS:  0000000000000000(0000) GS:ffff88807dd00000(0000)
knlGS:0000000000000000
[    1.054937] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.054938] CR2: 00000000ffffffff CR3: 0000000002620001 CR4: 0000000000360ee0
[    1.054938] Call Trace:
[    1.054939]  sysvec_call_function_single+0xa/0xd0
[    1.054939]  asm_sysvec_call_function_single+0x31/0x40
[    1.054940] RIP: 0010:__do_softirq+0xaa/0x437
[    1.054941] Code: 24 2c 0a 00 00 00 48 89 44 24 10 48 89 44 24 20
44 89 34 24 65 66 c7 05 22 b3 22 7e 00 00 e8 ad dc 40 ff fb 66 0f 1f
44 00 00 <b8> ff ff ff ff 48 c7 c6 00 51 60 82 0f bc 04 24 83 c0 01 49
89 f7
[    1.054941] RSP: 0000:ffffc9000006bd18 EFLAGS: 00000202
[    1.054942] RAX: 0000000000001a6e RBX: ffff88807d529800 RCX: 0000000000000000
[    1.054943] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81e000a3
[    1.054944] RBP: ffffc9000006bdc8 R08: 0000000000000000 R09: 0000000000000000
[    1.054944] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[    1.054945] R13: 0000000000000000 R14: 0000000000000082 R15: 0000000000000000
[    1.054945]  ? __do_softirq+0xa3/0x437
[    1.054945]  ? __do_softirq+0xaa/0x437
[    1.054946]  ? __do_softirq+0xa3/0x437
[    1.054946]  ? update_ts_time_stats+0x4c/0x70
[    1.054947]  irq_exit_rcu+0xaf/0xc0
[    1.054947]  sysvec_apic_timer_interrupt+0x51/0xd0
[    1.054948]  asm_sysvec_apic_timer_interrupt+0x31/0x40
[    1.054948] RIP: 0010:native_safe_halt+0xe/0x10
[    1.054949] Code: 8b 00 a8 08 74 80 eb c2 cc cc cc cc e9 07 00 00
00 0f 00 2d 46 bf 4e 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d 36 bf 4e
00 fb f4 <c3> cc 41 54 55 53 48 83 ec 10 e8 43 b1 66 ff 65 8b 2d bc 07
4e 7e
[    1.054950] RSP: 0000:ffffc9000006bea0 EFLAGS: 00000206
[    1.054951] RAX: 0000000000001a69 RBX: 0000000000000001 RCX: 0000000000000000
[    1.054951] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81b30789
[    1.054952] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[    1.054952] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88807d529800
[    1.054953] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88807d529800
[    1.054953]  ? default_idle+0x19/0x160
[    1.054954]  ? native_safe_halt+0xe/0x10
[    1.054954]  default_idle+0x1e/0x160
[    1.054955]  do_idle+0x1f5/0x270
[    1.054955]  ? _raw_spin_unlock_irqrestore+0x3a/0x50
[    1.054956]  ? trace_hardirqs_on+0x20/0xf0
[    1.054956]  cpu_startup_entry+0x14/0x20
[    1.054957]  start_secondary+0x147/0x170
[    1.054957]  secondary_startup_64+0xa4/0xb0

This is saying we were idle and __rcu_is_watching() correctly returned
false.  We got sysvec_apic_timer_interrupt(), which did
rcu_irq_enter() and then turned on IRQs for softirq processing.  Then
we got sysvec_call_function_single() inside that, and
sysvec_call_function_single() noticed that RCU was already watching
and did *not* call rcu_irq_enter().  And, if
sysvec_call_function_single() does rcu_is_cpu_rrupt_from_idle(), it
will return true.  So the issue is that RCU thinks that, since it


> +static __always_inline bool rcu_needs_irq_enter(void)
> +{
> +       return !IS_ENABLED(CONFIG_TINY_RCU) &&
> +               (context_tracking_enabled_cpu(smp_processor_id()) || is_idle_task(current));
> +}

x86 shouldn't need this context tracking check -- we won't even call
this if we came from user mode, and we make sure we never run with
IRQs on when we're in CONTEXT_USER.

I think my preference would be to fix this with something more like
tglx's patch but with an explanation:

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index f0b657097a2a..93fd9d6fe033 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -571,7 +571,7 @@ bool noinstr idtentry_enter_cond_rcu(struct pt_regs *regs)
                return false;
        }

-       if (!__rcu_is_watching()) {
+       if (!__rcu_is_watching() || is_idle_task(current)) {
                /*
                 * If RCU is not watching then the same careful
                 * sequence vs. lockdep and tracing is required
@@ -579,6 +579,18 @@ bool noinstr idtentry_enter_cond_rcu(struct pt_regs *regs)
                 *
                 * This only happens for IRQs that hit the idle
                 * loop, i.e. if idle is not using MWAIT.
+                *
+                * We also do this in nested IRQs in the idie task
+                * (e.g. we get an IRQ, we run softirqs with IRQs on after
+                * processing the IRQ, and we get another IRQ inside the
+                * softirq.)  This is because RCU wants to be able to tell
+                * whether it's running in an IRQ context directly inside
+                * idle (and hence interrupted a quiescent state) or
+                * interrupted kernel code that was coincidentally in the
+                * idle task (and hence the kernel was not quiescent).  RCU
+                * does this by counting rcu_irq_enter() levels.  This won't
+                * prevent scheduling in an otherwise schedulable context
+                * because the idle task can never schedule.
                 */
                WARN_ON_ONCE(!(regs->flags & X86_EFLAGS_IF));
                lockdep_hardirqs_off(CALLER_ADDR0);

Alternatively, perhaps rcu_is_cpu_rrupt_from_idle(void) could do something like:

if (softirq_count())
  return false;

if we believe that the only way this could happen was due to a softirq.

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] x86/entry: Ask RCU if it needs rcu_irq_{enter,exit}()
  2020-06-11 23:53 [PATCH RFC] x86/entry: Ask RCU if it needs rcu_irq_{enter,exit}() Paul E. McKenney
  2020-06-11 23:54 ` Paul E. McKenney
  2020-06-12  5:30 ` Andy Lutomirski
@ 2020-06-12  9:27 ` Thomas Gleixner
  2020-06-12 13:57   ` Paul E. McKenney
  2 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2020-06-12  9:27 UTC (permalink / raw)
  To: paulmck, linux-kernel, rcu
  Cc: luto, x86, frederic, rostedt, joel, mathieu.desnoyers, will, peterz

"Paul E. McKenney" <paulmck@kernel.org> writes:
> +static __always_inline bool rcu_needs_irq_enter(void)
> +{
> +	return !IS_ENABLED(CONFIG_TINY_RCU) &&
> +               (context_tracking_enabled_cpu(smp_processor_id()) || is_idle_task(current));

This reintroduces the #PF problem which started the whole conditional
RCU entry discussion:

   https://lore.kernel.org/lkml/20200515235125.628629605@linutronix.de/

and which made us all come to the conclusion that we can do it always
conditional. No biscuit for you. :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] x86/entry: Ask RCU if it needs rcu_irq_{enter,exit}()
  2020-06-12  5:30 ` Andy Lutomirski
@ 2020-06-12 12:40   ` Thomas Gleixner
  2020-06-12 13:55     ` [PATCH x86/entry: Force rcu_irq_enter() when in idle task Thomas Gleixner
  0 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2020-06-12 12:40 UTC (permalink / raw)
  To: Andy Lutomirski, Paul E. McKenney
  Cc: LKML, rcu, Andrew Lutomirski, X86 ML, Frederic Weisbecker,
	Steven Rostedt, Joel Fernandes, Mathieu Desnoyers, Will Deacon,
	Peter Zijlstra

Andy Lutomirski <luto@kernel.org> writes:
>
> This is saying we were idle and __rcu_is_watching() correctly returned
> false.  We got sysvec_apic_timer_interrupt(), which did
> rcu_irq_enter() and then turned on IRQs for softirq processing.  Then
> we got sysvec_call_function_single() inside that, and
> sysvec_call_function_single() noticed that RCU was already watching
> and did *not* call rcu_irq_enter().  And, if
> sysvec_call_function_single() does rcu_is_cpu_rrupt_from_idle(), it
> will return true.  So the issue is that RCU thinks that, since it
>
>
>> +static __always_inline bool rcu_needs_irq_enter(void)
>> +{
>> +       return !IS_ENABLED(CONFIG_TINY_RCU) &&
>> +               (context_tracking_enabled_cpu(smp_processor_id()) || is_idle_task(current));
>> +}
>
> x86 shouldn't need this context tracking check -- we won't even call
> this if we came from user mode, and we make sure we never run with
> IRQs on when we're in CONTEXT_USER.

As I told Paul already, it's broken.

> I think my preference would be to fix this with something more like
> tglx's patch but with an explanation:

Yes, explanation is definitely required :)

> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index f0b657097a2a..93fd9d6fe033 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -571,7 +571,7 @@ bool noinstr idtentry_enter_cond_rcu(struct pt_regs *regs)
>                 return false;
>         }
>
> -       if (!__rcu_is_watching()) {
> +       if (!__rcu_is_watching() || is_idle_task(current)) {

Actually that __rcu_is_watching() check is pointless because this can
only return false when current is the idle task. If the entry came from
user mode then this path is not taken. Entry from user mode with NOHZ
full does:

  enter_from_user_mode()
    user_exit_irqoff()
      __context_tracking_exit(CONTEXT_USER)
         rcu_user_exit()
           rcu_eqs_exit(1)
       	      WRITE_ONCE(rdp->dynticks_nmi_nesting, DYNTICK_IRQ_NONIDLE);

So if an interrupt hits the kernel after enabling interrupts then:

  1) RCU is watching and out of EQS
  2) The dynticks_nmi_nesting counter is not longer relevant

That remains that way until returning to user or scheduling out to idle
which means:

     if (is_idle_task(current))

is completely sufficient. And we don't care about unconditional
rcu_irq_enter() in this case. If idle triggers a #PF which wants to
sleep then the RCU state is the least of our worries.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH x86/entry: Force rcu_irq_enter() when in idle task
  2020-06-12 12:40   ` Thomas Gleixner
@ 2020-06-12 13:55     ` Thomas Gleixner
  2020-06-12 14:26       ` Frederic Weisbecker
                         ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Thomas Gleixner @ 2020-06-12 13:55 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, rcu, Andrew Lutomirski, X86 ML, Frederic Weisbecker,
	Steven Rostedt, Joel Fernandes, Mathieu Desnoyers, Will Deacon,
	Peter Zijlstra

The idea of conditionally calling into rcu_irq_enter() only when RCU is
not watching turned out to be not completely thought through.

Paul noticed occasional premature end of grace periods in RCU torture
testing. Bisection led to the commit which made the invocation of
rcu_irq_enter() conditional on !rcu_is_watching().

It turned out that this conditional breaks RCU assumptions about the idle
task when the scheduler tick happens to be a nested interrupt. Nested
interrupts can happen when the first interrupt invokes softirq processing
on return which enables interrupts. If that nested tick interrupt does not
invoke rcu_irq_enter() then the nest accounting in RCU claims that this is
the first interrupt which might mark a quiescient state and end grace
periods prematurely.

Change the condition from !rcu_is_watching() to is_idle_task(current) which
enforces that interrupts in the idle task unconditionally invoke
rcu_irq_enter() independent of the RCU state.

This is also correct vs. user mode entries in NOHZ full scenarios because
user mode entries bring RCU out of EQS and force the RCU irq nesting state
accounting to nested. As only the first interrupt can enter from user mode
a nested tick interrupt will enter from kernel mode and as the nesting
state accounting is forced to nesting it will not do anything stupid even
if rcu_irq_enter() has not been invoked.

Fixes: 3eeec3858488 ("x86/entry: Provide idtentry_entry/exit_cond_rcu()")
Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/common.c |   35 ++++++++++++++++++++++++++++-------
 1 file changed, 28 insertions(+), 7 deletions(-)
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -557,14 +557,34 @@ bool noinstr idtentry_enter_cond_rcu(str
 		return false;
 	}
 
-	if (!__rcu_is_watching()) {
+	/*
+	 * If this entry hit the idle task invoke rcu_irq_enter() whether
+	 * RCU is watching or not.
+	 *
+	 * Interupts can nest when the first interrupt invokes softirq
+	 * processing on return which enables interrupts.
+	 *
+	 * Scheduler ticks in the idle task can mark quiescent state and
+	 * terminate a grace period, if and only if the timer interrupt is
+	 * not nested into another interrupt.
+	 *
+	 * Checking for __rcu_is_watching() here would prevent the nesting
+	 * interrupt to invoke rcu_irq_enter(). If that nested interrupt is
+	 * the tick then rcu_flavor_sched_clock_irq() would wrongfully
+	 * assume that it is the first interupt and eventually claim
+	 * quiescient state and end grace periods prematurely.
+	 *
+	 * Unconditionally invoke rcu_irq_enter() so RCU state stays
+	 * consistent.
+	 *
+	 * TINY_RCU does not support EQS, so let the compiler eliminate
+	 * this part when enabled.
+	 */
+	if (!IS_ENABLED(CONFIG_TINY_RCU) && is_idle_task(current)) {
 		/*
 		 * If RCU is not watching then the same careful
 		 * sequence vs. lockdep and tracing is required
 		 * as in enter_from_user_mode().
-		 *
-		 * This only happens for IRQs that hit the idle
-		 * loop, i.e. if idle is not using MWAIT.
 		 */
 		lockdep_hardirqs_off(CALLER_ADDR0);
 		rcu_irq_enter();
@@ -576,9 +596,10 @@ bool noinstr idtentry_enter_cond_rcu(str
 	}
 
 	/*
-	 * If RCU is watching then RCU only wants to check
-	 * whether it needs to restart the tick in NOHZ
-	 * mode.
+	 * If RCU is watching then RCU only wants to check whether it needs
+	 * to restart the tick in NOHZ mode. rcu_irq_enter_check_tick()
+	 * already contains a warning when RCU is not watching, so no point
+	 * in having another one here.
 	 */
 	instrumentation_begin();
 	rcu_irq_enter_check_tick();

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] x86/entry: Ask RCU if it needs rcu_irq_{enter,exit}()
  2020-06-12  9:27 ` [PATCH RFC] x86/entry: Ask RCU if it needs rcu_irq_{enter,exit}() Thomas Gleixner
@ 2020-06-12 13:57   ` Paul E. McKenney
  0 siblings, 0 replies; 21+ messages in thread
From: Paul E. McKenney @ 2020-06-12 13:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, rcu, luto, x86, frederic, rostedt, joel,
	mathieu.desnoyers, will, peterz

On Fri, Jun 12, 2020 at 11:27:21AM +0200, Thomas Gleixner wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> > +static __always_inline bool rcu_needs_irq_enter(void)
> > +{
> > +	return !IS_ENABLED(CONFIG_TINY_RCU) &&
> > +               (context_tracking_enabled_cpu(smp_processor_id()) || is_idle_task(current));
> 
> This reintroduces the #PF problem which started the whole conditional
> RCU entry discussion:
> 
>    https://lore.kernel.org/lkml/20200515235125.628629605@linutronix.de/
> 
> and which made us all come to the conclusion that we can do it always
> conditional. No biscuit for you. :)

We can only be thankful that source-code control systems mean that my
coding session yesterday afternoon will have no permanent effect.

Let this be a lesson to all of you:  Hacking RCU while sleep-deprived
is a really bad idea.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH x86/entry: Force rcu_irq_enter() when in idle task
  2020-06-12 13:55     ` [PATCH x86/entry: Force rcu_irq_enter() when in idle task Thomas Gleixner
@ 2020-06-12 14:26       ` Frederic Weisbecker
  2020-06-12 14:47         ` Thomas Gleixner
  2020-06-12 15:32       ` Andy Lutomirski
                         ` (3 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Frederic Weisbecker @ 2020-06-12 14:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paul E. McKenney, LKML, rcu, Andrew Lutomirski, X86 ML,
	Steven Rostedt, Joel Fernandes, Mathieu Desnoyers, Will Deacon,
	Peter Zijlstra

On Fri, Jun 12, 2020 at 03:55:00PM +0200, Thomas Gleixner wrote:
> The idea of conditionally calling into rcu_irq_enter() only when RCU is
> not watching turned out to be not completely thought through.
> 
> Paul noticed occasional premature end of grace periods in RCU torture
> testing. Bisection led to the commit which made the invocation of
> rcu_irq_enter() conditional on !rcu_is_watching().
> 
> It turned out that this conditional breaks RCU assumptions about the idle
> task when the scheduler tick happens to be a nested interrupt. Nested
> interrupts can happen when the first interrupt invokes softirq processing
> on return which enables interrupts. If that nested tick interrupt does not
> invoke rcu_irq_enter() then the nest accounting in RCU claims that this is
> the first interrupt which might mark a quiescient state and end grace
> periods prematurely.
> 
> Change the condition from !rcu_is_watching() to is_idle_task(current) which
> enforces that interrupts in the idle task unconditionally invoke
> rcu_irq_enter() independent of the RCU state.
> 
> This is also correct vs. user mode entries in NOHZ full scenarios because
> user mode entries bring RCU out of EQS and force the RCU irq nesting state
> accounting to nested. As only the first interrupt can enter from user mode
> a nested tick interrupt will enter from kernel mode and as the nesting
> state accounting is forced to nesting it will not do anything stupid even
> if rcu_irq_enter() has not been invoked.
> 
> Fixes: 3eeec3858488 ("x86/entry: Provide idtentry_entry/exit_cond_rcu()")
> Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Acked-by: Frederic Weisbecker <frederic@kernel.org>

So, in the end the call to rcu_irq_enter() in irq_enter() is going to
be useless in x86, right?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH x86/entry: Force rcu_irq_enter() when in idle task
  2020-06-12 14:26       ` Frederic Weisbecker
@ 2020-06-12 14:47         ` Thomas Gleixner
  0 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2020-06-12 14:47 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, LKML, rcu, Andrew Lutomirski, X86 ML,
	Steven Rostedt, Joel Fernandes, Mathieu Desnoyers, Will Deacon,
	Peter Zijlstra

Frederic Weisbecker <frederic@kernel.org> writes:

> On Fri, Jun 12, 2020 at 03:55:00PM +0200, Thomas Gleixner wrote:
>> The idea of conditionally calling into rcu_irq_enter() only when RCU is
>> not watching turned out to be not completely thought through.
>> 
>> Paul noticed occasional premature end of grace periods in RCU torture
>> testing. Bisection led to the commit which made the invocation of
>> rcu_irq_enter() conditional on !rcu_is_watching().
>> 
>> It turned out that this conditional breaks RCU assumptions about the idle
>> task when the scheduler tick happens to be a nested interrupt. Nested
>> interrupts can happen when the first interrupt invokes softirq processing
>> on return which enables interrupts. If that nested tick interrupt does not
>> invoke rcu_irq_enter() then the nest accounting in RCU claims that this is
>> the first interrupt which might mark a quiescient state and end grace
>> periods prematurely.
>> 
>> Change the condition from !rcu_is_watching() to is_idle_task(current) which
>> enforces that interrupts in the idle task unconditionally invoke
>> rcu_irq_enter() independent of the RCU state.
>> 
>> This is also correct vs. user mode entries in NOHZ full scenarios because
>> user mode entries bring RCU out of EQS and force the RCU irq nesting state
>> accounting to nested. As only the first interrupt can enter from user mode
>> a nested tick interrupt will enter from kernel mode and as the nesting
>> state accounting is forced to nesting it will not do anything stupid even
>> if rcu_irq_enter() has not been invoked.
>> 
>> Fixes: 3eeec3858488 ("x86/entry: Provide idtentry_entry/exit_cond_rcu()")
>> Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>
> Acked-by: Frederic Weisbecker <frederic@kernel.org>
>
> So, in the end the call to rcu_irq_enter() in irq_enter() is going to
> be useless in x86, right?

x86 is not calling irq_enter() anymore. It's using irq_enter_rcu().

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH x86/entry: Force rcu_irq_enter() when in idle task
  2020-06-12 13:55     ` [PATCH x86/entry: Force rcu_irq_enter() when in idle task Thomas Gleixner
  2020-06-12 14:26       ` Frederic Weisbecker
@ 2020-06-12 15:32       ` Andy Lutomirski
  2020-06-12 17:49       ` Paul E. McKenney
                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Andy Lutomirski @ 2020-06-12 15:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paul E. McKenney, LKML, rcu, Andrew Lutomirski, X86 ML,
	Frederic Weisbecker, Steven Rostedt, Joel Fernandes,
	Mathieu Desnoyers, Will Deacon, Peter Zijlstra

On Fri, Jun 12, 2020 at 6:55 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> The idea of conditionally calling into rcu_irq_enter() only when RCU is
> not watching turned out to be not completely thought through.
>
> Paul noticed occasional premature end of grace periods in RCU torture
> testing. Bisection led to the commit which made the invocation of
> rcu_irq_enter() conditional on !rcu_is_watching().
>
> It turned out that this conditional breaks RCU assumptions about the idle
> task when the scheduler tick happens to be a nested interrupt. Nested
> interrupts can happen when the first interrupt invokes softirq processing
> on return which enables interrupts. If that nested tick interrupt does not
> invoke rcu_irq_enter() then the nest accounting in RCU claims that this is
> the first interrupt which might mark a quiescient state and end grace
> periods prematurely.
>
> Change the condition from !rcu_is_watching() to is_idle_task(current) which
> enforces that interrupts in the idle task unconditionally invoke
> rcu_irq_enter() independent of the RCU state.
>
> This is also correct vs. user mode entries in NOHZ full scenarios because
> user mode entries bring RCU out of EQS and force the RCU irq nesting state
> accounting to nested. As only the first interrupt can enter from user mode
> a nested tick interrupt will enter from kernel mode and as the nesting
> state accounting is forced to nesting it will not do anything stupid even
> if rcu_irq_enter() has not been invoked.

Acked-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH x86/entry: Force rcu_irq_enter() when in idle task
  2020-06-12 13:55     ` [PATCH x86/entry: Force rcu_irq_enter() when in idle task Thomas Gleixner
  2020-06-12 14:26       ` Frederic Weisbecker
  2020-06-12 15:32       ` Andy Lutomirski
@ 2020-06-12 17:49       ` Paul E. McKenney
  2020-06-12 19:19         ` Paul E. McKenney
  2020-06-12 19:50       ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2020-06-15 20:16       ` [PATCH " Joel Fernandes
  4 siblings, 1 reply; 21+ messages in thread
From: Paul E. McKenney @ 2020-06-12 17:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, rcu, Andrew Lutomirski, X86 ML, Frederic Weisbecker,
	Steven Rostedt, Joel Fernandes, Mathieu Desnoyers, Will Deacon,
	Peter Zijlstra

On Fri, Jun 12, 2020 at 03:55:00PM +0200, Thomas Gleixner wrote:
> The idea of conditionally calling into rcu_irq_enter() only when RCU is
> not watching turned out to be not completely thought through.
> 
> Paul noticed occasional premature end of grace periods in RCU torture
> testing. Bisection led to the commit which made the invocation of
> rcu_irq_enter() conditional on !rcu_is_watching().
> 
> It turned out that this conditional breaks RCU assumptions about the idle
> task when the scheduler tick happens to be a nested interrupt. Nested
> interrupts can happen when the first interrupt invokes softirq processing
> on return which enables interrupts. If that nested tick interrupt does not
> invoke rcu_irq_enter() then the nest accounting in RCU claims that this is
> the first interrupt which might mark a quiescient state and end grace
> periods prematurely.

For this last sentence, how about the following?

If that nested tick interrupt does not invoke rcu_irq_enter() then the
RCU's irq-nesting checks will believe that this interrupt came directly
from idle, which will cause RCU to report a quiescent state.  Because
this interrupt instead came from a softirq handler which might have
been executing an RCU read-side critical section, this can cause the
grace period to end prematurely.

> Change the condition from !rcu_is_watching() to is_idle_task(current) which
> enforces that interrupts in the idle task unconditionally invoke
> rcu_irq_enter() independent of the RCU state.
> 
> This is also correct vs. user mode entries in NOHZ full scenarios because
> user mode entries bring RCU out of EQS and force the RCU irq nesting state
> accounting to nested. As only the first interrupt can enter from user mode
> a nested tick interrupt will enter from kernel mode and as the nesting
> state accounting is forced to nesting it will not do anything stupid even
> if rcu_irq_enter() has not been invoked.

On the testing front, just like with my busted patch yesterday, this
patch breaks the TASKS03 rcutorture scenario by preventing the Tasks
RCU grace periods from ever completing.  However, this is an unusual
configuration with NO_HZ_FULL and one CPU actually being nohz_full.
The more conventional TASKS01 and TASKS02 scenarios do just fine.

I will therefore address this issue in a follow-on patch.

> Fixes: 3eeec3858488 ("x86/entry: Provide idtentry_entry/exit_cond_rcu()")
> Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org>
Tested-by: "Paul E. McKenney" <paulmck@kernel.org>

> ---
>  arch/x86/entry/common.c |   35 ++++++++++++++++++++++++++++-------
>  1 file changed, 28 insertions(+), 7 deletions(-)
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -557,14 +557,34 @@ bool noinstr idtentry_enter_cond_rcu(str
>  		return false;
>  	}
>  
> -	if (!__rcu_is_watching()) {
> +	/*
> +	 * If this entry hit the idle task invoke rcu_irq_enter() whether
> +	 * RCU is watching or not.
> +	 *
> +	 * Interupts can nest when the first interrupt invokes softirq
> +	 * processing on return which enables interrupts.
> +	 *
> +	 * Scheduler ticks in the idle task can mark quiescent state and
> +	 * terminate a grace period, if and only if the timer interrupt is
> +	 * not nested into another interrupt.
> +	 *
> +	 * Checking for __rcu_is_watching() here would prevent the nesting
> +	 * interrupt to invoke rcu_irq_enter(). If that nested interrupt is
> +	 * the tick then rcu_flavor_sched_clock_irq() would wrongfully
> +	 * assume that it is the first interupt and eventually claim
> +	 * quiescient state and end grace periods prematurely.
> +	 *
> +	 * Unconditionally invoke rcu_irq_enter() so RCU state stays
> +	 * consistent.
> +	 *
> +	 * TINY_RCU does not support EQS, so let the compiler eliminate
> +	 * this part when enabled.
> +	 */
> +	if (!IS_ENABLED(CONFIG_TINY_RCU) && is_idle_task(current)) {
>  		/*
>  		 * If RCU is not watching then the same careful
>  		 * sequence vs. lockdep and tracing is required
>  		 * as in enter_from_user_mode().
> -		 *
> -		 * This only happens for IRQs that hit the idle
> -		 * loop, i.e. if idle is not using MWAIT.
>  		 */
>  		lockdep_hardirqs_off(CALLER_ADDR0);
>  		rcu_irq_enter();
> @@ -576,9 +596,10 @@ bool noinstr idtentry_enter_cond_rcu(str
>  	}
>  
>  	/*
> -	 * If RCU is watching then RCU only wants to check
> -	 * whether it needs to restart the tick in NOHZ
> -	 * mode.
> +	 * If RCU is watching then RCU only wants to check whether it needs
> +	 * to restart the tick in NOHZ mode. rcu_irq_enter_check_tick()
> +	 * already contains a warning when RCU is not watching, so no point
> +	 * in having another one here.
>  	 */
>  	instrumentation_begin();
>  	rcu_irq_enter_check_tick();

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH x86/entry: Force rcu_irq_enter() when in idle task
  2020-06-12 17:49       ` Paul E. McKenney
@ 2020-06-12 19:19         ` Paul E. McKenney
  2020-06-12 19:25           ` Thomas Gleixner
  0 siblings, 1 reply; 21+ messages in thread
From: Paul E. McKenney @ 2020-06-12 19:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, rcu, Andrew Lutomirski, X86 ML, Frederic Weisbecker,
	Steven Rostedt, Joel Fernandes, Mathieu Desnoyers, Will Deacon,
	Peter Zijlstra

On Fri, Jun 12, 2020 at 10:49:53AM -0700, Paul E. McKenney wrote:
> On Fri, Jun 12, 2020 at 03:55:00PM +0200, Thomas Gleixner wrote:
> > The idea of conditionally calling into rcu_irq_enter() only when RCU is
> > not watching turned out to be not completely thought through.
> > 
> > Paul noticed occasional premature end of grace periods in RCU torture
> > testing. Bisection led to the commit which made the invocation of
> > rcu_irq_enter() conditional on !rcu_is_watching().
> > 
> > It turned out that this conditional breaks RCU assumptions about the idle
> > task when the scheduler tick happens to be a nested interrupt. Nested
> > interrupts can happen when the first interrupt invokes softirq processing
> > on return which enables interrupts. If that nested tick interrupt does not
> > invoke rcu_irq_enter() then the nest accounting in RCU claims that this is
> > the first interrupt which might mark a quiescient state and end grace
> > periods prematurely.
> 
> For this last sentence, how about the following?
> 
> If that nested tick interrupt does not invoke rcu_irq_enter() then the
> RCU's irq-nesting checks will believe that this interrupt came directly
> from idle, which will cause RCU to report a quiescent state.  Because
> this interrupt instead came from a softirq handler which might have
> been executing an RCU read-side critical section, this can cause the
> grace period to end prematurely.
> 
> > Change the condition from !rcu_is_watching() to is_idle_task(current) which
> > enforces that interrupts in the idle task unconditionally invoke
> > rcu_irq_enter() independent of the RCU state.
> > 
> > This is also correct vs. user mode entries in NOHZ full scenarios because
> > user mode entries bring RCU out of EQS and force the RCU irq nesting state
> > accounting to nested. As only the first interrupt can enter from user mode
> > a nested tick interrupt will enter from kernel mode and as the nesting
> > state accounting is forced to nesting it will not do anything stupid even
> > if rcu_irq_enter() has not been invoked.
> 
> On the testing front, just like with my busted patch yesterday, this
> patch breaks the TASKS03 rcutorture scenario by preventing the Tasks
> RCU grace periods from ever completing.  However, this is an unusual
> configuration with NO_HZ_FULL and one CPU actually being nohz_full.
> The more conventional TASKS01 and TASKS02 scenarios do just fine.
> 
> I will therefore address this issue in a follow-on patch.

I should add that -your- patch from yesterday did -not- cause this
problem, in case that is of interest.

							Thanx, Paul

> > Fixes: 3eeec3858488 ("x86/entry: Provide idtentry_entry/exit_cond_rcu()")
> > Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 
> Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org>
> Tested-by: "Paul E. McKenney" <paulmck@kernel.org>
> 
> > ---
> >  arch/x86/entry/common.c |   35 ++++++++++++++++++++++++++++-------
> >  1 file changed, 28 insertions(+), 7 deletions(-)
> > --- a/arch/x86/entry/common.c
> > +++ b/arch/x86/entry/common.c
> > @@ -557,14 +557,34 @@ bool noinstr idtentry_enter_cond_rcu(str
> >  		return false;
> >  	}
> >  
> > -	if (!__rcu_is_watching()) {
> > +	/*
> > +	 * If this entry hit the idle task invoke rcu_irq_enter() whether
> > +	 * RCU is watching or not.
> > +	 *
> > +	 * Interupts can nest when the first interrupt invokes softirq
> > +	 * processing on return which enables interrupts.
> > +	 *
> > +	 * Scheduler ticks in the idle task can mark quiescent state and
> > +	 * terminate a grace period, if and only if the timer interrupt is
> > +	 * not nested into another interrupt.
> > +	 *
> > +	 * Checking for __rcu_is_watching() here would prevent the nesting
> > +	 * interrupt to invoke rcu_irq_enter(). If that nested interrupt is
> > +	 * the tick then rcu_flavor_sched_clock_irq() would wrongfully
> > +	 * assume that it is the first interupt and eventually claim
> > +	 * quiescient state and end grace periods prematurely.
> > +	 *
> > +	 * Unconditionally invoke rcu_irq_enter() so RCU state stays
> > +	 * consistent.
> > +	 *
> > +	 * TINY_RCU does not support EQS, so let the compiler eliminate
> > +	 * this part when enabled.
> > +	 */
> > +	if (!IS_ENABLED(CONFIG_TINY_RCU) && is_idle_task(current)) {
> >  		/*
> >  		 * If RCU is not watching then the same careful
> >  		 * sequence vs. lockdep and tracing is required
> >  		 * as in enter_from_user_mode().
> > -		 *
> > -		 * This only happens for IRQs that hit the idle
> > -		 * loop, i.e. if idle is not using MWAIT.
> >  		 */
> >  		lockdep_hardirqs_off(CALLER_ADDR0);
> >  		rcu_irq_enter();
> > @@ -576,9 +596,10 @@ bool noinstr idtentry_enter_cond_rcu(str
> >  	}
> >  
> >  	/*
> > -	 * If RCU is watching then RCU only wants to check
> > -	 * whether it needs to restart the tick in NOHZ
> > -	 * mode.
> > +	 * If RCU is watching then RCU only wants to check whether it needs
> > +	 * to restart the tick in NOHZ mode. rcu_irq_enter_check_tick()
> > +	 * already contains a warning when RCU is not watching, so no point
> > +	 * in having another one here.
> >  	 */
> >  	instrumentation_begin();
> >  	rcu_irq_enter_check_tick();

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH x86/entry: Force rcu_irq_enter() when in idle task
  2020-06-12 19:19         ` Paul E. McKenney
@ 2020-06-12 19:25           ` Thomas Gleixner
  2020-06-12 19:28             ` Andy Lutomirski
  2020-06-12 19:34             ` Thomas Gleixner
  0 siblings, 2 replies; 21+ messages in thread
From: Thomas Gleixner @ 2020-06-12 19:25 UTC (permalink / raw)
  To: paulmck
  Cc: LKML, rcu, Andrew Lutomirski, X86 ML, Frederic Weisbecker,
	Steven Rostedt, Joel Fernandes, Mathieu Desnoyers, Will Deacon,
	Peter Zijlstra

"Paul E. McKenney" <paulmck@kernel.org> writes:
> On Fri, Jun 12, 2020 at 10:49:53AM -0700, Paul E. McKenney wrote:
>> I will therefore address this issue in a follow-on patch.
>
> I should add that -your- patch from yesterday did -not- cause this
> problem, in case that is of interest.

So I still can add it back and amend the changelog and comment:

Change the condition from !rcu_is_watching() to is_idle_task(current) ||
rcu_is_watching() which enforces that interrupts in the idle task
unconditionally invoke rcu_irq_enter() independent of the RCU state. For
most scenarios is_idle_task() would be sufficient but Task RCU needs it
according to Paul.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH x86/entry: Force rcu_irq_enter() when in idle task
  2020-06-12 19:25           ` Thomas Gleixner
@ 2020-06-12 19:28             ` Andy Lutomirski
  2020-06-12 19:34             ` Thomas Gleixner
  1 sibling, 0 replies; 21+ messages in thread
From: Andy Lutomirski @ 2020-06-12 19:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: paulmck, LKML, rcu, Andrew Lutomirski, X86 ML,
	Frederic Weisbecker, Steven Rostedt, Joel Fernandes,
	Mathieu Desnoyers, Will Deacon, Peter Zijlstra


> On Jun 12, 2020, at 12:25 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> "Paul E. McKenney" <paulmck@kernel.org> writes:
>>> On Fri, Jun 12, 2020 at 10:49:53AM -0700, Paul E. McKenney wrote:
>>> I will therefore address this issue in a follow-on patch.
>> 
>> I should add that -your- patch from yesterday did -not- cause this
>> problem, in case that is of interest.
> 
> So I still can add it back and amend the changelog and comment:
> 
> Change the condition from !rcu_is_watching() to is_idle_task(current) ||
> rcu_is_watching() which enforces that interrupts in the idle task
> unconditionally invoke rcu_irq_enter() independent of the RCU state. For
> most scenarios is_idle_task() would be sufficient but Task RCU needs it
> according to Paul.
> 
> 

Can you easily get a call trace for a case where the system is fully booted, is_idle_task returns false, and rcu_is_watching() also returns false?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH x86/entry: Force rcu_irq_enter() when in idle task
  2020-06-12 19:25           ` Thomas Gleixner
  2020-06-12 19:28             ` Andy Lutomirski
@ 2020-06-12 19:34             ` Thomas Gleixner
  2020-06-12 21:56               ` Paul E. McKenney
  1 sibling, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2020-06-12 19:34 UTC (permalink / raw)
  To: paulmck
  Cc: LKML, rcu, Andrew Lutomirski, X86 ML, Frederic Weisbecker,
	Steven Rostedt, Joel Fernandes, Mathieu Desnoyers, Will Deacon,
	Peter Zijlstra

Thomas Gleixner <tglx@linutronix.de> writes:

> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> On Fri, Jun 12, 2020 at 10:49:53AM -0700, Paul E. McKenney wrote:
>>> I will therefore address this issue in a follow-on patch.
>>
>> I should add that -your- patch from yesterday did -not- cause this
>> problem, in case that is of interest.
>
> So I still can add it back and amend the changelog and comment:
>
> Change the condition from !rcu_is_watching() to is_idle_task(current) ||
> rcu_is_watching() which enforces that interrupts in the idle task
> unconditionally invoke rcu_irq_enter() independent of the RCU state. For
> most scenarios is_idle_task() would be sufficient but Task RCU needs it
> according to Paul.

After talking to Paul some more we came to the conclusion that the
failure scenario of task rcu is not completely clear and the trigger
scenario is obscure enough. This needs more investigation and the
important part which we were chasing is fixed and agreed on. So I go
with the simple version now and Paul will follow up once it can be
properly explained.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [tip: x86/entry] x86/entry: Force rcu_irq_enter() when in idle task
  2020-06-12 13:55     ` [PATCH x86/entry: Force rcu_irq_enter() when in idle task Thomas Gleixner
                         ` (2 preceding siblings ...)
  2020-06-12 17:49       ` Paul E. McKenney
@ 2020-06-12 19:50       ` tip-bot2 for Thomas Gleixner
  2020-06-15 20:16       ` [PATCH " Joel Fernandes
  4 siblings, 0 replies; 21+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-06-12 19:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Paul E. McKenney, Thomas Gleixner, Andy Lutomirski,
	Frederic Weisbecker, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     0bf3924bfabd13ba21aa702344fc00b3b3263e5a
Gitweb:        https://git.kernel.org/tip/0bf3924bfabd13ba21aa702344fc00b3b3263e5a
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Fri, 12 Jun 2020 15:55:00 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 12 Jun 2020 21:36:33 +02:00

x86/entry: Force rcu_irq_enter() when in idle task

The idea of conditionally calling into rcu_irq_enter() only when RCU is
not watching turned out to be not completely thought through.

Paul noticed occasional premature end of grace periods in RCU torture
testing. Bisection led to the commit which made the invocation of
rcu_irq_enter() conditional on !rcu_is_watching().

It turned out that this conditional breaks RCU assumptions about the idle
task when the scheduler tick happens to be a nested interrupt. Nested
interrupts can happen when the first interrupt invokes softirq processing
on return which enables interrupts.

If that nested tick interrupt does not invoke rcu_irq_enter() then the
RCU's irq-nesting checks will believe that this interrupt came directly
from idle, which will cause RCU to report a quiescent state.  Because this
interrupt instead came from a softirq handler which might have been
executing an RCU read-side critical section, this can cause the grace
period to end prematurely.

Change the condition from !rcu_is_watching() to is_idle_task(current) which
enforces that interrupts in the idle task unconditionally invoke
rcu_irq_enter() independent of the RCU state.

This is also correct vs. user mode entries in NOHZ full scenarios because
user mode entries bring RCU out of EQS and force the RCU irq nesting state
accounting to nested. As only the first interrupt can enter from user mode
a nested tick interrupt will enter from kernel mode and as the nesting
state accounting is forced to nesting it will not do anything stupid even
if rcu_irq_enter() has not been invoked.

Fixes: 3eeec3858488 ("x86/entry: Provide idtentry_entry/exit_cond_rcu()")
Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: "Paul E. McKenney" <paulmck@kernel.org>
Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/87wo4cxubv.fsf@nanos.tec.linutronix.de

---
 arch/x86/entry/common.c | 35 ++++++++++++++++++++++++++++-------
 1 file changed, 28 insertions(+), 7 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index f4d5778..bd3f141 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -557,14 +557,34 @@ bool noinstr idtentry_enter_cond_rcu(struct pt_regs *regs)
 		return false;
 	}
 
-	if (!__rcu_is_watching()) {
+	/*
+	 * If this entry hit the idle task invoke rcu_irq_enter() whether
+	 * RCU is watching or not.
+	 *
+	 * Interupts can nest when the first interrupt invokes softirq
+	 * processing on return which enables interrupts.
+	 *
+	 * Scheduler ticks in the idle task can mark quiescent state and
+	 * terminate a grace period, if and only if the timer interrupt is
+	 * not nested into another interrupt.
+	 *
+	 * Checking for __rcu_is_watching() here would prevent the nesting
+	 * interrupt to invoke rcu_irq_enter(). If that nested interrupt is
+	 * the tick then rcu_flavor_sched_clock_irq() would wrongfully
+	 * assume that it is the first interupt and eventually claim
+	 * quiescient state and end grace periods prematurely.
+	 *
+	 * Unconditionally invoke rcu_irq_enter() so RCU state stays
+	 * consistent.
+	 *
+	 * TINY_RCU does not support EQS, so let the compiler eliminate
+	 * this part when enabled.
+	 */
+	if (!IS_ENABLED(CONFIG_TINY_RCU) && is_idle_task(current)) {
 		/*
 		 * If RCU is not watching then the same careful
 		 * sequence vs. lockdep and tracing is required
 		 * as in enter_from_user_mode().
-		 *
-		 * This only happens for IRQs that hit the idle
-		 * loop, i.e. if idle is not using MWAIT.
 		 */
 		lockdep_hardirqs_off(CALLER_ADDR0);
 		rcu_irq_enter();
@@ -576,9 +596,10 @@ bool noinstr idtentry_enter_cond_rcu(struct pt_regs *regs)
 	}
 
 	/*
-	 * If RCU is watching then RCU only wants to check
-	 * whether it needs to restart the tick in NOHZ
-	 * mode.
+	 * If RCU is watching then RCU only wants to check whether it needs
+	 * to restart the tick in NOHZ mode. rcu_irq_enter_check_tick()
+	 * already contains a warning when RCU is not watching, so no point
+	 * in having another one here.
 	 */
 	instrumentation_begin();
 	rcu_irq_enter_check_tick();

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH x86/entry: Force rcu_irq_enter() when in idle task
  2020-06-12 19:34             ` Thomas Gleixner
@ 2020-06-12 21:56               ` Paul E. McKenney
  0 siblings, 0 replies; 21+ messages in thread
From: Paul E. McKenney @ 2020-06-12 21:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, rcu, Andrew Lutomirski, X86 ML, Frederic Weisbecker,
	Steven Rostedt, Joel Fernandes, Mathieu Desnoyers, Will Deacon,
	Peter Zijlstra

On Fri, Jun 12, 2020 at 09:34:56PM +0200, Thomas Gleixner wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
> 
> > "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> On Fri, Jun 12, 2020 at 10:49:53AM -0700, Paul E. McKenney wrote:
> >>> I will therefore address this issue in a follow-on patch.
> >>
> >> I should add that -your- patch from yesterday did -not- cause this
> >> problem, in case that is of interest.
> >
> > So I still can add it back and amend the changelog and comment:
> >
> > Change the condition from !rcu_is_watching() to is_idle_task(current) ||
> > rcu_is_watching() which enforces that interrupts in the idle task
> > unconditionally invoke rcu_irq_enter() independent of the RCU state. For
> > most scenarios is_idle_task() would be sufficient but Task RCU needs it
> > according to Paul.
> 
> After talking to Paul some more we came to the conclusion that the
> failure scenario of task rcu is not completely clear and the trigger
> scenario is obscure enough. This needs more investigation and the
> important part which we were chasing is fixed and agreed on. So I go
> with the simple version now and Paul will follow up once it can be
> properly explained.

So the whole TASKS03 failure issue turned out to be me applying the
wrong patches onto the wrong commits.  Retesting with -tip x86/entry
passes TASKS03, as in more than 100 instances of it.

I am rerunning the whole stack, but I don't see the need to wait for
that.  (I will be running increasingly long tests over Friday night,
Pacific Time.)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH x86/entry: Force rcu_irq_enter() when in idle task
  2020-06-12 13:55     ` [PATCH x86/entry: Force rcu_irq_enter() when in idle task Thomas Gleixner
                         ` (3 preceding siblings ...)
  2020-06-12 19:50       ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
@ 2020-06-15 20:16       ` Joel Fernandes
  2020-06-16  8:40         ` Thomas Gleixner
  4 siblings, 1 reply; 21+ messages in thread
From: Joel Fernandes @ 2020-06-15 20:16 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paul E. McKenney, LKML, rcu, Andrew Lutomirski, X86 ML,
	Frederic Weisbecker, Steven Rostedt, Mathieu Desnoyers,
	Will Deacon, Peter Zijlstra

Hi Thomas, Paul,

On Fri, Jun 12, 2020 at 03:55:00PM +0200, Thomas Gleixner wrote:
> The idea of conditionally calling into rcu_irq_enter() only when RCU is
> not watching turned out to be not completely thought through.
> 
> Paul noticed occasional premature end of grace periods in RCU torture
> testing. Bisection led to the commit which made the invocation of
> rcu_irq_enter() conditional on !rcu_is_watching().
> 
> It turned out that this conditional breaks RCU assumptions about the idle
> task when the scheduler tick happens to be a nested interrupt. Nested
> interrupts can happen when the first interrupt invokes softirq processing
> on return which enables interrupts. If that nested tick interrupt does not
> invoke rcu_irq_enter() then the nest accounting in RCU claims that this is
> the first interrupt which might mark a quiescient state and end grace
> periods prematurely.
> 
> Change the condition from !rcu_is_watching() to is_idle_task(current) which
> enforces that interrupts in the idle task unconditionally invoke
> rcu_irq_enter() independent of the RCU state.
> 
> This is also correct vs. user mode entries in NOHZ full scenarios because
> user mode entries bring RCU out of EQS and force the RCU irq nesting state

I had to re-read this sentence a couple of times. The 'user mode entries'
sounds like 'entry into user mode'. It would be good to reword it to 'IRQ
entry in user mode'.

My knowledge predates the rcu-watching reworks so apologies on the below
question but I still didn't fully follow why when the idle task behaves
differently from being in user mode. Even with user mode we should have:

<user mode>  (in EQS)
  <irq entry>   <---------- exits EQS so now rcu is watching
     <softirq entry in the exit path>
        <timer tick irq entry>  <-- the buggy !watching logic prevents rcu_irq_enter
	    -> report QS since tick thinks it is first level.

Is there a subtlety here I'm missing? I checked the RCU code and I did not
see anywhere that rcu_user_enter() makes it behave differently. Both
rcu_user_enter() and rcu_idle_enter() call rcu_eqs_enter().

> accounting to nested. As only the first interrupt can enter from user mode
> a nested tick interrupt will enter from kernel mode and as the nesting
> state accounting is forced to nesting it will not do anything stupid even
> if rcu_irq_enter() has not been invoked.
> 
> Fixes: 3eeec3858488 ("x86/entry: Provide idtentry_entry/exit_cond_rcu()")
> Reported-by: "Paul E. McKenney" <paulmck@kernel.org>

> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/entry/common.c |   35 ++++++++++++++++++++++++++++-------
>  1 file changed, 28 insertions(+), 7 deletions(-)
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -557,14 +557,34 @@ bool noinstr idtentry_enter_cond_rcu(str
>  		return false;
>  	}
>  
> -	if (!__rcu_is_watching()) {
> +	/*
> +	 * If this entry hit the idle task invoke rcu_irq_enter() whether
> +	 * RCU is watching or not.
> +	 *
> +	 * Interupts can nest when the first interrupt invokes softirq
> +	 * processing on return which enables interrupts.
> +	 *
> +	 * Scheduler ticks in the idle task can mark quiescent state and
> +	 * terminate a grace period, if and only if the timer interrupt is
> +	 * not nested into another interrupt.
> +	 *
> +	 * Checking for __rcu_is_watching() here would prevent the nesting
> +	 * interrupt to invoke rcu_irq_enter(). If that nested interrupt is
> +	 * the tick then rcu_flavor_sched_clock_irq() would wrongfully
> +	 * assume that it is the first interupt and eventually claim
> +	 * quiescient state and end grace periods prematurely.
> +	 *
> +	 * Unconditionally invoke rcu_irq_enter() so RCU state stays
> +	 * consistent.

It is not really unconditional though, we are checking for the idle task
below.  I suggest a slight reword of the comments accordingly.

thanks,

 - Joel


> +	 *
> +	 * TINY_RCU does not support EQS, so let the compiler eliminate
> +	 * this part when enabled.
> +	 */
> +	if (!IS_ENABLED(CONFIG_TINY_RCU) && is_idle_task(current)) {
>  		/*
>  		 * If RCU is not watching then the same careful
>  		 * sequence vs. lockdep and tracing is required
>  		 * as in enter_from_user_mode().
> -		 *
> -		 * This only happens for IRQs that hit the idle
> -		 * loop, i.e. if idle is not using MWAIT.
>  		 */
>  		lockdep_hardirqs_off(CALLER_ADDR0);
>  		rcu_irq_enter();
> @@ -576,9 +596,10 @@ bool noinstr idtentry_enter_cond_rcu(str
>  	}
>  
>  	/*
> -	 * If RCU is watching then RCU only wants to check
> -	 * whether it needs to restart the tick in NOHZ
> -	 * mode.
> +	 * If RCU is watching then RCU only wants to check whether it needs
> +	 * to restart the tick in NOHZ mode. rcu_irq_enter_check_tick()
> +	 * already contains a warning when RCU is not watching, so no point
> +	 * in having another one here.
>  	 */
>  	instrumentation_begin();
>  	rcu_irq_enter_check_tick();

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH x86/entry: Force rcu_irq_enter() when in idle task
  2020-06-15 20:16       ` [PATCH " Joel Fernandes
@ 2020-06-16  8:40         ` Thomas Gleixner
  2020-06-16 14:30           ` Joel Fernandes
  0 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2020-06-16  8:40 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Paul E. McKenney, LKML, rcu, Andrew Lutomirski, X86 ML,
	Frederic Weisbecker, Steven Rostedt, Mathieu Desnoyers,
	Will Deacon, Peter Zijlstra

Joel,

Joel Fernandes <joel@joelfernandes.org> writes:
> On Fri, Jun 12, 2020 at 03:55:00PM +0200, Thomas Gleixner wrote:
>> This is also correct vs. user mode entries in NOHZ full scenarios because
>> user mode entries bring RCU out of EQS and force the RCU irq nesting state
>
> I had to re-read this sentence a couple of times. The 'user mode entries'
> sounds like 'entry into user mode'. It would be good to reword it to 'IRQ
> entry in user mode'.

:)

> My knowledge predates the rcu-watching reworks so apologies on the below
> question but I still didn't fully follow why when the idle task behaves
> differently from being in user mode. Even with user mode we should have:
>
> <user mode>  (in EQS)
>   <irq entry>   <---------- exits EQS so now rcu is watching
>      <softirq entry in the exit path>
>         <timer tick irq entry>  <-- the buggy !watching logic prevents rcu_irq_enter
> 	    -> report QS since tick thinks it is first level.
>
> Is there a subtlety here I'm missing? I checked the RCU code and I did not
> see anywhere that rcu_user_enter() makes it behave differently. Both
> rcu_user_enter() and rcu_idle_enter() call rcu_eqs_enter().

The interrupt hit user mode entry does:

    idtentry_enter_cond_rcu()
      if (user_mode(regs))
         enter_from_user_mode()
           user_exit_irqoff()
             __context_tracking_exit(CONTEXT_USER)
               rcu_user_exit()
                 rcu_eqs_exit(1)
                   ...
                   WRITE_ONCE(rdp->dynticks_nmi_nesting,
                              DYNTICK_IRQ_NONIDLE);

i.e. for interrupts which enter from user mode we are not invoking
rcu_irq_enter() at all.

The return from interrupt path has nothing to do with that because
that's an entry in kernel mode.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH x86/entry: Force rcu_irq_enter() when in idle task
  2020-06-16  8:40         ` Thomas Gleixner
@ 2020-06-16 14:30           ` Joel Fernandes
  2020-06-16 16:52             ` Andy Lutomirski
  0 siblings, 1 reply; 21+ messages in thread
From: Joel Fernandes @ 2020-06-16 14:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paul E. McKenney, LKML, rcu, Andrew Lutomirski, X86 ML,
	Frederic Weisbecker, Steven Rostedt, Mathieu Desnoyers,
	Will Deacon, Peter Zijlstra

On Tue, Jun 16, 2020 at 10:40:04AM +0200, Thomas Gleixner wrote:
> Joel,
> 
> Joel Fernandes <joel@joelfernandes.org> writes:
> > On Fri, Jun 12, 2020 at 03:55:00PM +0200, Thomas Gleixner wrote:
> >> This is also correct vs. user mode entries in NOHZ full scenarios because
> >> user mode entries bring RCU out of EQS and force the RCU irq nesting state
> >
> > I had to re-read this sentence a couple of times. The 'user mode entries'
> > sounds like 'entry into user mode'. It would be good to reword it to 'IRQ
> > entry in user mode'.
> 
> :)
> 
> > My knowledge predates the rcu-watching reworks so apologies on the below
> > question but I still didn't fully follow why when the idle task behaves
> > differently from being in user mode. Even with user mode we should have:
> >
> > <user mode>  (in EQS)
> >   <irq entry>   <---------- exits EQS so now rcu is watching
> >      <softirq entry in the exit path>
> >         <timer tick irq entry>  <-- the buggy !watching logic prevents rcu_irq_enter
> > 	    -> report QS since tick thinks it is first level.
> >
> > Is there a subtlety here I'm missing? I checked the RCU code and I did not
> > see anywhere that rcu_user_enter() makes it behave differently. Both
> > rcu_user_enter() and rcu_idle_enter() call rcu_eqs_enter().
> 
> The interrupt hit user mode entry does:
> 
>     idtentry_enter_cond_rcu()
>       if (user_mode(regs))
>          enter_from_user_mode()
>            user_exit_irqoff()
>              __context_tracking_exit(CONTEXT_USER)
>                rcu_user_exit()
>                  rcu_eqs_exit(1)
>                    ...
>                    WRITE_ONCE(rdp->dynticks_nmi_nesting,
>                               DYNTICK_IRQ_NONIDLE);
> 
> i.e. for interrupts which enter from user mode we are not invoking
> rcu_irq_enter() at all.
> 
> The return from interrupt path has nothing to do with that because
> that's an entry in kernel mode.

Hi Thomas,
Ah, IRQ entry in user mode triggers the context-tracking path. Makes sense now, thanks.

This will help me when I have to propose to get rid of dynticks_nmi_nesting again :)

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH x86/entry: Force rcu_irq_enter() when in idle task
  2020-06-16 14:30           ` Joel Fernandes
@ 2020-06-16 16:52             ` Andy Lutomirski
  0 siblings, 0 replies; 21+ messages in thread
From: Andy Lutomirski @ 2020-06-16 16:52 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Thomas Gleixner, Paul E. McKenney, LKML, rcu, Andrew Lutomirski,
	X86 ML, Frederic Weisbecker, Steven Rostedt, Mathieu Desnoyers,
	Will Deacon, Peter Zijlstra

On Tue, Jun 16, 2020 at 7:30 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Tue, Jun 16, 2020 at 10:40:04AM +0200, Thomas Gleixner wrote:
> > Joel,
> >
> > Joel Fernandes <joel@joelfernandes.org> writes:
> > > On Fri, Jun 12, 2020 at 03:55:00PM +0200, Thomas Gleixner wrote:
> > >> This is also correct vs. user mode entries in NOHZ full scenarios because
> > >> user mode entries bring RCU out of EQS and force the RCU irq nesting state
> > >
> > > I had to re-read this sentence a couple of times. The 'user mode entries'
> > > sounds like 'entry into user mode'. It would be good to reword it to 'IRQ
> > > entry in user mode'.
> >
> > :)
> >
> > > My knowledge predates the rcu-watching reworks so apologies on the below
> > > question but I still didn't fully follow why when the idle task behaves
> > > differently from being in user mode. Even with user mode we should have:
> > >
> > > <user mode>  (in EQS)
> > >   <irq entry>   <---------- exits EQS so now rcu is watching
> > >      <softirq entry in the exit path>
> > >         <timer tick irq entry>  <-- the buggy !watching logic prevents rcu_irq_enter
> > >         -> report QS since tick thinks it is first level.
> > >
> > > Is there a subtlety here I'm missing? I checked the RCU code and I did not
> > > see anywhere that rcu_user_enter() makes it behave differently. Both
> > > rcu_user_enter() and rcu_idle_enter() call rcu_eqs_enter().
> >
> > The interrupt hit user mode entry does:
> >
> >     idtentry_enter_cond_rcu()
> >       if (user_mode(regs))
> >          enter_from_user_mode()
> >            user_exit_irqoff()
> >              __context_tracking_exit(CONTEXT_USER)
> >                rcu_user_exit()
> >                  rcu_eqs_exit(1)
> >                    ...
> >                    WRITE_ONCE(rdp->dynticks_nmi_nesting,
> >                               DYNTICK_IRQ_NONIDLE);
> >
> > i.e. for interrupts which enter from user mode we are not invoking
> > rcu_irq_enter() at all.
> >
> > The return from interrupt path has nothing to do with that because
> > that's an entry in kernel mode.
>
> Hi Thomas,
> Ah, IRQ entry in user mode triggers the context-tracking path. Makes sense now, thanks.
>
> This will help me when I have to propose to get rid of dynticks_nmi_nesting again :)
>

Propose away, but just keep in mind that horrible architectures like
x86 really can nest non-maskable interrupts that hit kernel code more
than one deep.

--Andy

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2020-06-16 16:52 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-11 23:53 [PATCH RFC] x86/entry: Ask RCU if it needs rcu_irq_{enter,exit}() Paul E. McKenney
2020-06-11 23:54 ` Paul E. McKenney
2020-06-12  5:30 ` Andy Lutomirski
2020-06-12 12:40   ` Thomas Gleixner
2020-06-12 13:55     ` [PATCH x86/entry: Force rcu_irq_enter() when in idle task Thomas Gleixner
2020-06-12 14:26       ` Frederic Weisbecker
2020-06-12 14:47         ` Thomas Gleixner
2020-06-12 15:32       ` Andy Lutomirski
2020-06-12 17:49       ` Paul E. McKenney
2020-06-12 19:19         ` Paul E. McKenney
2020-06-12 19:25           ` Thomas Gleixner
2020-06-12 19:28             ` Andy Lutomirski
2020-06-12 19:34             ` Thomas Gleixner
2020-06-12 21:56               ` Paul E. McKenney
2020-06-12 19:50       ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
2020-06-15 20:16       ` [PATCH " Joel Fernandes
2020-06-16  8:40         ` Thomas Gleixner
2020-06-16 14:30           ` Joel Fernandes
2020-06-16 16:52             ` Andy Lutomirski
2020-06-12  9:27 ` [PATCH RFC] x86/entry: Ask RCU if it needs rcu_irq_{enter,exit}() Thomas Gleixner
2020-06-12 13:57   ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).