[PATCH 0/7 v4] rcu: Fix some rcu uses in extended quiescent state

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/7 v4] rcu: Fix some rcu uses in extended quiescent state
@ 2011-09-26 10:19 Frederic Weisbecker
  2011-09-26 10:19 ` [PATCH 1/7] rcu: Fix preempt-unsafe debug check of rcu " Frederic Weisbecker
                   ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Frederic Weisbecker @ 2011-09-26 10:19 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, H. Peter Anvin, Andy Henroid, Mike Frysinger,
	Guan Xuetao, David Miller, Chris Metcalf, Hans-Christian Egtvedt,
	Ralf Baechle, Russell King, Paul Mackerras, Heiko Carstens,
	Paul Mundt, Lai Jiangshan

Hi Paul,

Two fixes added in that set, the first two patches:

- The check on uses under extended quiescent states was buggy
- One more illegal use of RCU fixed, from inside tick_nohz_stop_sched_tick()

It has survived one day and one night of rcutorture in x86-64
with periodic cpu hotplug onlining/offlining...

No test in powerpc yet though... So I don't know yet why
you got an rcutorture failure.

Frederic Weisbecker (7):
  rcu: Fix preempt-unsafe debug check of rcu extended quiescent state
  rcu: Fix early call to rcu_enter_nohz() on tick stopping
  nohz: Separate out irq exit and idle loop dyntick logic
  nohz: Allow rcu extended quiescent state handling seperately from
    tick stop
  x86: Enter rcu extended qs after idle notifier call
  x86: Call idle notifier after irq_enter()
  rcu: Fix early call to rcu_irq_exit()

 arch/arm/kernel/process.c                |    4 +-
 arch/avr32/kernel/process.c              |    4 +-
 arch/blackfin/kernel/process.c           |    4 +-
 arch/microblaze/kernel/process.c         |    4 +-
 arch/mips/kernel/process.c               |    4 +-
 arch/powerpc/kernel/idle.c               |    4 +-
 arch/powerpc/platforms/iseries/setup.c   |    8 +-
 arch/s390/kernel/process.c               |    4 +-
 arch/sh/kernel/idle.c                    |    4 +-
 arch/sparc/kernel/process_64.c           |    4 +-
 arch/tile/kernel/process.c               |    4 +-
 arch/um/kernel/process.c                 |    4 +-
 arch/unicore32/kernel/process.c          |    4 +-
 arch/x86/kernel/apic/apic.c              |    6 +-
 arch/x86/kernel/apic/io_apic.c           |    2 +-
 arch/x86/kernel/cpu/mcheck/mce.c         |    2 +-
 arch/x86/kernel/cpu/mcheck/therm_throt.c |    2 +-
 arch/x86/kernel/cpu/mcheck/threshold.c   |    2 +-
 arch/x86/kernel/irq.c                    |    6 +-
 arch/x86/kernel/process_32.c             |    4 +-
 arch/x86/kernel/process_64.c             |    9 ++-
 include/linux/tick.h                     |   12 ++-
 kernel/rcutree.c                         |   10 ++-
 kernel/softirq.c                         |    4 +-
 kernel/time/tick-sched.c                 |  120 +++++++++++++++++++++---------
 25 files changed, 147 insertions(+), 88 deletions(-)

-- 
1.7.5.4


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/7] rcu: Fix preempt-unsafe debug check of rcu extended quiescent state
  2011-09-26 10:19 [PATCH 0/7 v4] rcu: Fix some rcu uses in extended quiescent state Frederic Weisbecker
@ 2011-09-26 10:19 ` Frederic Weisbecker
  2011-09-26 22:04   ` Pavel Ivanov
  2011-09-26 10:19 ` [PATCH 2/7] rcu: Fix early call to rcu_enter_nohz() on tick stopping Frederic Weisbecker
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Frederic Weisbecker @ 2011-09-26 10:19 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Lai Jiangshan, Ingo Molnar

In the rcu_check_extended_qs() function that is used to check
illegal uses of RCU under extended quiescent states, we look
at the local value of dynticks that is even if an extended
quiescent state or odd otherwise.

We are looking at it without disabling the preemption though
and this opens a race window where we may read the state of
a remote CPU instead, like in the following scenario:

CPU 1                                                        CPU 2

bool rcu_check_extended_qs(void)
{
	struct rcu_dynticks *rdtp;

	rdtp = &per_cpu(rcu_dynticks,
			raw_smp_processor_id());

	< ---- Task is migrated here ---- >

	// CPU 1 goes idle and increase rdtp->dynticks
							/* Here we are reading the value for
							the remote CPU 1 instead of the local CPU */
							if (atomic_read(&rdtp->dynticks) & 0x1)
								return false;

							return true;
							}

The possible result of this is false positive return value of that
function, suggesting we are in an extended quiescent state in random
places.

Fix this by disabling preemption while reading that value.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@redhat.com>
---
 kernel/rcutree.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index c9b4adf..234dca3 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -469,13 +469,15 @@ void rcu_irq_exit(void)
 
 bool rcu_check_extended_qs(void)
 {
-	struct rcu_dynticks *rdtp;
+	struct rcu_dynticks *rdtp = &get_cpu_var(rcu_dynticks);
+	bool ext_qs = true;
 
-	rdtp = &per_cpu(rcu_dynticks, raw_smp_processor_id());
 	if (atomic_read(&rdtp->dynticks) & 0x1)
-		return false;
+		ext_qs = false;
+
+	put_cpu_var(rcu_dynticks);
 
-	return true;
+	return ext_qs;
 }
 EXPORT_SYMBOL_GPL(rcu_check_extended_qs);
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/7] rcu: Fix early call to rcu_enter_nohz() on tick stopping
  2011-09-26 10:19 [PATCH 0/7 v4] rcu: Fix some rcu uses in extended quiescent state Frederic Weisbecker
  2011-09-26 10:19 ` [PATCH 1/7] rcu: Fix preempt-unsafe debug check of rcu " Frederic Weisbecker
@ 2011-09-26 10:19 ` Frederic Weisbecker
  2011-09-26 10:19 ` [PATCH 3/7] nohz: Separate out irq exit and idle loop dyntick logic Frederic Weisbecker
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Frederic Weisbecker @ 2011-09-26 10:19 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner, Ingo Molnar

In tick_nohz_stop_sched_tick(), we enter RCU extended quiescent
state right before reprogramming the next timer.

However when we reprogram it, we may raise the hrtimer softirq,
thus entering the scheduler to wake up the softirq thread and
iterate over the scheduler domains under RCU when we search the dest
CPU for the task.

We need to be outside an RCU extended quiescent state to achieve
this otherwise it's an illegal use of RCU:

	WARNING: at include/linux/rcupdate.h:248 select_task_rq_fair+0xc9b/0xcd0()
	Hardware name: AMD690VM-FMH
	Modules linked in:
	Pid: 0, comm: swapper Tainted: G        W   3.0.0+ #56
	Call Trace:
	[<ffffffff8105cc3f>] warn_slowpath_common+0x7f/0xc0
	[<ffffffff8105cc9a>] warn_slowpath_null+0x1a/0x20
	[<ffffffff8105239b>] select_task_rq_fair+0xc9b/0xcd0
	[<ffffffff812f2b64>] ? do_raw_spin_lock+0x54/0x160
	[<ffffffff81058ee3>] try_to_wake_up+0xd3/0x300
	[<ffffffff81090758>] ? ktime_get+0x68/0xf0
	[<ffffffff81059165>] wake_up_process+0x15/0x20
	[<ffffffff81065135>] raise_softirq_irqoff+0x65/0x110
	[<ffffffff8108a2d5>] __hrtimer_start_range_ns+0x415/0x5a0
	[<ffffffff8108a478>] hrtimer_start+0x18/0x20
	[<ffffffff810980e0>] tick_nohz_stop_sched_tick+0x2b0/0x3c0
	[<ffffffff8100a9c1>] cpu_idle+0x81/0x120
	[<ffffffff817e720f>] rest_init+0xef/0x170
	[<ffffffff817e7172>] ? rest_init+0x52/0x170
	[<ffffffff81ed6cb7>] start_kernel+0x3cb/0x3d6
	[<ffffffff81ed6346>] x86_64_start_reservations+0x131/0x135
	[<ffffffff81ed644d>] x86_64_start_kernel+0x103/0x112

Fix this by calling rcu_enter_nohz() only once everything is done
to stop the tick.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
---
 kernel/time/tick-sched.c |   27 ++++++++++++++++++---------
 1 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index eb98e55..9416700 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -246,19 +246,13 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
 
-/**
- * tick_nohz_stop_sched_tick - stop the idle tick from the idle task
- *
- * When the next event is more than a tick into the future, stop the idle tick
- * Called either from the idle loop or from irq_exit() when an idle period was
- * just interrupted by an interrupt which did not cause a reschedule.
- */
-void tick_nohz_stop_sched_tick(int inidle)
+static bool __tick_nohz_stop_sched_tick(int inidle)
 {
 	unsigned long seq, last_jiffies, next_jiffies, delta_jiffies, flags;
 	struct tick_sched *ts;
 	ktime_t last_update, expires, now;
 	struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev;
+	bool stopped = false;
 	u64 time_delta;
 	int cpu;
 
@@ -405,7 +399,7 @@ void tick_nohz_stop_sched_tick(int inidle)
 			ts->idle_tick = hrtimer_get_expires(&ts->sched_timer);
 			ts->tick_stopped = 1;
 			ts->idle_jiffies = last_jiffies;
-			rcu_enter_nohz();
+			stopped = true;
 		}
 
 		ts->idle_sleeps++;
@@ -445,6 +439,21 @@ out:
 	ts->sleep_length = ktime_sub(dev->next_event, now);
 end:
 	local_irq_restore(flags);
+
+	return stopped;
+}
+
+/**
+ * tick_nohz_stop_sched_tick - stop the idle tick from the idle task
+ *
+ * When the next event is more than a tick into the future, stop the idle tick
+ * Called either from the idle loop or from irq_exit() when an idle period was
+ * just interrupted by an interrupt which did not cause a reschedule.
+ */
+void tick_nohz_stop_sched_tick(int inidle)
+{
+	if (__tick_nohz_stop_sched_tick(inidle))
+		rcu_enter_nohz();
 }
 
 /**
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/7] nohz: Separate out irq exit and idle loop dyntick logic
  2011-09-26 10:19 [PATCH 0/7 v4] rcu: Fix some rcu uses in extended quiescent state Frederic Weisbecker
  2011-09-26 10:19 ` [PATCH 1/7] rcu: Fix preempt-unsafe debug check of rcu " Frederic Weisbecker
  2011-09-26 10:19 ` [PATCH 2/7] rcu: Fix early call to rcu_enter_nohz() on tick stopping Frederic Weisbecker
@ 2011-09-26 10:19 ` Frederic Weisbecker
  2011-09-26 10:19 ` [PATCH 4/7] nohz: Allow rcu extended quiescent state handling seperately from tick stop Frederic Weisbecker
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Frederic Weisbecker @ 2011-09-26 10:19 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, Frederic Weisbecker, Mike Frysinger, Guan Xuetao,
	David Miller, Chris Metcalf, Hans-Christian Egtvedt,
	Ralf Baechle, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	H. Peter Anvin, Russell King, Paul Mackerras, Heiko Carstens,
	Paul Mundt

The tick_nohz_stop_sched_tick() function, which tries to delay
the next timer tick as long as possible, can be called from two
places:

- From the idle loop to start the dytick idle mode
- From interrupt exit if we have interrupted the dyntick
idle mode, so that we reprogram the next tick event in
case the irq changed some internal state that requires this
action.

There are only few minor differences between both that
are handled by that function, driven by the ts->inidle
cpu variable and the inidle parameter. The whole guarantees
that we only update the dyntick mode on irq exit if we actually
interrupted the dyntick idle mode.

Split this function into:

- tick_nohz_idle_enter(), which sets ts->inidle to 1 and enters
dynticks idle mode unconditionally if it can.

- tick_nohz_irq_exit() which only updates the dynticks idle mode
when ts->inidle is set (ie: if tick_nohz_idle_enter() has been called).

To maintain symmetry, tick_nohz_restart_sched_tick() has been renamed
into tick_nohz_idle_exit().

The point of all this is to micro-optimize the irq exit path (no need
for local_irq_save there) and to prepare for the split between dynticks
and rcu extended quiescent state logics. We'll need this split to
further fix illegal uses of RCU in extended quiescent states.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: David Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
---
 arch/arm/kernel/process.c              |    4 +-
 arch/avr32/kernel/process.c            |    4 +-
 arch/blackfin/kernel/process.c         |    4 +-
 arch/microblaze/kernel/process.c       |    4 +-
 arch/mips/kernel/process.c             |    4 +-
 arch/powerpc/kernel/idle.c             |    4 +-
 arch/powerpc/platforms/iseries/setup.c |    8 ++--
 arch/s390/kernel/process.c             |    4 +-
 arch/sh/kernel/idle.c                  |    4 +-
 arch/sparc/kernel/process_64.c         |    4 +-
 arch/tile/kernel/process.c             |    4 +-
 arch/um/kernel/process.c               |    4 +-
 arch/unicore32/kernel/process.c        |    4 +-
 arch/x86/kernel/process_32.c           |    4 +-
 arch/x86/kernel/process_64.c           |    4 +-
 include/linux/tick.h                   |    9 ++--
 kernel/softirq.c                       |    2 +-
 kernel/time/tick-sched.c               |   87 +++++++++++++++++++------------
 18 files changed, 91 insertions(+), 71 deletions(-)

diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 5e1e541..f570e8f 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -182,7 +182,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_idle_enter();
 		leds_event(led_idle_start);
 		while (!need_resched()) {
 #ifdef CONFIG_HOTPLUG_CPU
@@ -208,7 +208,7 @@ void cpu_idle(void)
 			}
 		}
 		leds_event(led_idle_end);
-		tick_nohz_restart_sched_tick();
+		tick_nohz_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/avr32/kernel/process.c b/arch/avr32/kernel/process.c
index ef5a2a0..6ee7952 100644
--- a/arch/avr32/kernel/process.c
+++ b/arch/avr32/kernel/process.c
@@ -34,10 +34,10 @@ void cpu_idle(void)
 {
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_idle_enter();
 		while (!need_resched())
 			cpu_idle_sleep();
-		tick_nohz_restart_sched_tick();
+		tick_nohz_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/blackfin/kernel/process.c b/arch/blackfin/kernel/process.c
index 6a660fa..790b12e 100644
--- a/arch/blackfin/kernel/process.c
+++ b/arch/blackfin/kernel/process.c
@@ -88,10 +88,10 @@ void cpu_idle(void)
 #endif
 		if (!idle)
 			idle = default_idle;
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_idle_enter();
 		while (!need_resched())
 			idle();
-		tick_nohz_restart_sched_tick();
+		tick_nohz_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/microblaze/kernel/process.c b/arch/microblaze/kernel/process.c
index 968648a..80c749b 100644
--- a/arch/microblaze/kernel/process.c
+++ b/arch/microblaze/kernel/process.c
@@ -103,10 +103,10 @@ void cpu_idle(void)
 		if (!idle)
 			idle = default_idle;
 
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_idle_enter();
 		while (!need_resched())
 			idle();
-		tick_nohz_restart_sched_tick();
+		tick_nohz_idle_exit();
 
 		preempt_enable_no_resched();
 		schedule();
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index d2112d3..d72a0e9 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -56,7 +56,7 @@ void __noreturn cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_idle_enter();
 		while (!need_resched() && cpu_online(cpu)) {
 #ifdef CONFIG_MIPS_MT_SMTC
 			extern void smtc_idle_loop_hook(void);
@@ -77,7 +77,7 @@ void __noreturn cpu_idle(void)
 		     system_state == SYSTEM_BOOTING))
 			play_dead();
 #endif
-		tick_nohz_restart_sched_tick();
+		tick_nohz_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/powerpc/kernel/idle.c b/arch/powerpc/kernel/idle.c
index 39a2baa..878572f 100644
--- a/arch/powerpc/kernel/idle.c
+++ b/arch/powerpc/kernel/idle.c
@@ -56,7 +56,7 @@ void cpu_idle(void)
 
 	set_thread_flag(TIF_POLLING_NRFLAG);
 	while (1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_idle_enter();
 		while (!need_resched() && !cpu_should_die()) {
 			ppc64_runlatch_off();
 
@@ -93,7 +93,7 @@ void cpu_idle(void)
 
 		HMT_medium();
 		ppc64_runlatch_on();
-		tick_nohz_restart_sched_tick();
+		tick_nohz_idle_exit();
 		preempt_enable_no_resched();
 		if (cpu_should_die())
 			cpu_die();
diff --git a/arch/powerpc/platforms/iseries/setup.c b/arch/powerpc/platforms/iseries/setup.c
index c25a081..e2f5fad 100644
--- a/arch/powerpc/platforms/iseries/setup.c
+++ b/arch/powerpc/platforms/iseries/setup.c
@@ -562,7 +562,7 @@ static void yield_shared_processor(void)
 static void iseries_shared_idle(void)
 {
 	while (1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_idle_enter();
 		while (!need_resched() && !hvlpevent_is_pending()) {
 			local_irq_disable();
 			ppc64_runlatch_off();
@@ -576,7 +576,7 @@ static void iseries_shared_idle(void)
 		}
 
 		ppc64_runlatch_on();
-		tick_nohz_restart_sched_tick();
+		tick_nohz_idle_exit();
 
 		if (hvlpevent_is_pending())
 			process_iSeries_events();
@@ -592,7 +592,7 @@ static void iseries_dedicated_idle(void)
 	set_thread_flag(TIF_POLLING_NRFLAG);
 
 	while (1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_idle_enter();
 		if (!need_resched()) {
 			while (!need_resched()) {
 				ppc64_runlatch_off();
@@ -609,7 +609,7 @@ static void iseries_dedicated_idle(void)
 		}
 
 		ppc64_runlatch_on();
-		tick_nohz_restart_sched_tick();
+		tick_nohz_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 541a750..db3e930 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -90,10 +90,10 @@ static void default_idle(void)
 void cpu_idle(void)
 {
 	for (;;) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_idle_enter();
 		while (!need_resched())
 			default_idle();
-		tick_nohz_restart_sched_tick();
+		tick_nohz_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/sh/kernel/idle.c b/arch/sh/kernel/idle.c
index 425d604..a6b9a96 100644
--- a/arch/sh/kernel/idle.c
+++ b/arch/sh/kernel/idle.c
@@ -88,7 +88,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_idle_enter();
 
 		while (!need_resched()) {
 			check_pgt_cache();
@@ -109,7 +109,7 @@ void cpu_idle(void)
 			start_critical_timings();
 		}
 
-		tick_nohz_restart_sched_tick();
+		tick_nohz_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c
index c158a95..1235f63 100644
--- a/arch/sparc/kernel/process_64.c
+++ b/arch/sparc/kernel/process_64.c
@@ -95,12 +95,12 @@ void cpu_idle(void)
 	set_thread_flag(TIF_POLLING_NRFLAG);
 
 	while(1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_idle_enter();
 
 		while (!need_resched() && !cpu_is_offline(cpu))
 			sparc64_yield(cpu);
 
-		tick_nohz_restart_sched_tick();
+		tick_nohz_idle_exit();
 
 		preempt_enable_no_resched();
 
diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index 9c45d8b..920e674 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -85,7 +85,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_idle_enter();
 		while (!need_resched()) {
 			if (cpu_is_offline(cpu))
 				BUG();  /* no HOTPLUG_CPU */
@@ -105,7 +105,7 @@ void cpu_idle(void)
 				local_irq_enable();
 			current_thread_info()->status |= TS_POLLING;
 		}
-		tick_nohz_restart_sched_tick();
+		tick_nohz_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index fab4371..046abea 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -245,10 +245,10 @@ void default_idle(void)
 		if (need_resched())
 			schedule();
 
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_idle_enter();
 		nsecs = disable_timer();
 		idle_sleep(nsecs);
-		tick_nohz_restart_sched_tick();
+		tick_nohz_idle_exit();
 	}
 }
 
diff --git a/arch/unicore32/kernel/process.c b/arch/unicore32/kernel/process.c
index ba401df..9999b9a 100644
--- a/arch/unicore32/kernel/process.c
+++ b/arch/unicore32/kernel/process.c
@@ -55,7 +55,7 @@ void cpu_idle(void)
 {
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_idle_enter();
 		while (!need_resched()) {
 			local_irq_disable();
 			stop_critical_timings();
@@ -63,7 +63,7 @@ void cpu_idle(void)
 			local_irq_enable();
 			start_critical_timings();
 		}
-		tick_nohz_restart_sched_tick();
+		tick_nohz_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index a3d0dc5..1ab4c58 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -97,7 +97,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_idle_enter();
 		while (!need_resched()) {
 
 			check_pgt_cache();
@@ -112,7 +112,7 @@ void cpu_idle(void)
 			pm_idle();
 			start_critical_timings();
 		}
-		tick_nohz_restart_sched_tick();
+		tick_nohz_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index ca6f7ab..d7a6418 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -120,7 +120,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_idle_enter();
 		while (!need_resched()) {
 
 			rmb();
@@ -145,7 +145,7 @@ void cpu_idle(void)
 			__exit_idle();
 		}
 
-		tick_nohz_restart_sched_tick();
+		tick_nohz_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/include/linux/tick.h b/include/linux/tick.h
index b232ccc..3f094d4 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -121,14 +121,15 @@ static inline int tick_oneshot_mode_active(void) { return 0; }
 #endif /* !CONFIG_GENERIC_CLOCKEVENTS */
 
 # ifdef CONFIG_NO_HZ
-extern void tick_nohz_stop_sched_tick(int inidle);
-extern void tick_nohz_restart_sched_tick(void);
+extern void tick_nohz_idle_enter(void);
+extern void tick_nohz_idle_exit(void);
+extern void tick_nohz_irq_exit(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
 # else
-static inline void tick_nohz_stop_sched_tick(int inidle) { }
-static inline void tick_nohz_restart_sched_tick(void) { }
+static inline void tick_nohz_idle_enter(void) { }
+static inline void tick_nohz_idle_exit(void) { }
 static inline ktime_t tick_nohz_get_sleep_length(void)
 {
 	ktime_t len = { .tv64 = NSEC_PER_SEC/HZ };
diff --git a/kernel/softirq.c b/kernel/softirq.c
index fca82c3..d2be0e0 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -351,7 +351,7 @@ void irq_exit(void)
 #ifdef CONFIG_NO_HZ
 	/* Make sure that timer wheel updates are propagated */
 	if (idle_cpu(smp_processor_id()) && !in_interrupt() && !need_resched())
-		tick_nohz_stop_sched_tick(0);
+		tick_nohz_irq_exit();
 #endif
 	preempt_enable_no_resched();
 }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9416700..b0ce515 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -246,36 +246,18 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
 
-static bool __tick_nohz_stop_sched_tick(int inidle)
+static bool __tick_nohz_stop_sched_tick(struct tick_sched *ts)
 {
-	unsigned long seq, last_jiffies, next_jiffies, delta_jiffies, flags;
-	struct tick_sched *ts;
+	unsigned long seq, last_jiffies, next_jiffies, delta_jiffies;
 	ktime_t last_update, expires, now;
 	struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev;
-	bool stopped = false;
+	bool stopped;
 	u64 time_delta;
 	int cpu;
 
-	local_irq_save(flags);
-
 	cpu = smp_processor_id();
 	ts = &per_cpu(tick_cpu_sched, cpu);
 
-	/*
-	 * Call to tick_nohz_start_idle stops the last_update_time from being
-	 * updated. Thus, it must not be called in the event we are called from
-	 * irq_exit() with the prior state different than idle.
-	 */
-	if (!inidle && !ts->inidle)
-		goto end;
-
-	/*
-	 * Set ts->inidle unconditionally. Even if the system did not
-	 * switch to NOHZ mode the cpu frequency governers rely on the
-	 * update of the idle time accounting in tick_nohz_start_idle().
-	 */
-	ts->inidle = 1;
-
 	now = tick_nohz_start_idle(cpu, ts);
 
 	/*
@@ -291,10 +273,10 @@ static bool __tick_nohz_stop_sched_tick(int inidle)
 	}
 
 	if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
-		goto end;
+		return false;
 
 	if (need_resched())
-		goto end;
+		return false;
 
 	if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
 		static int ratelimit;
@@ -304,9 +286,11 @@ static bool __tick_nohz_stop_sched_tick(int inidle)
 			       (unsigned int) local_softirq_pending());
 			ratelimit++;
 		}
-		goto end;
+		return false;
 	}
 
+	stopped = false;
+
 	ts->idle_calls++;
 	/* Read jiffies and the time when jiffies were updated last */
 	do {
@@ -437,23 +421,58 @@ out:
 	ts->next_jiffies = next_jiffies;
 	ts->last_jiffies = last_jiffies;
 	ts->sleep_length = ktime_sub(dev->next_event, now);
-end:
-	local_irq_restore(flags);
 
 	return stopped;
 }
 
+static void tick_nohz_stop_sched_tick(struct tick_sched *ts)
+{
+	if (__tick_nohz_stop_sched_tick(ts))
+		rcu_enter_nohz();
+}
+
 /**
- * tick_nohz_stop_sched_tick - stop the idle tick from the idle task
+ * tick_nohz_idle_enter - stop the idle tick from the idle task
  *
  * When the next event is more than a tick into the future, stop the idle tick
- * Called either from the idle loop or from irq_exit() when an idle period was
- * just interrupted by an interrupt which did not cause a reschedule.
+ * Called when we start the idle loop.
  */
-void tick_nohz_stop_sched_tick(int inidle)
+void tick_nohz_idle_enter(void)
 {
-	if (__tick_nohz_stop_sched_tick(inidle))
-		rcu_enter_nohz();
+	struct tick_sched *ts;
+
+	WARN_ON_ONCE(irqs_disabled());
+
+	local_irq_disable();
+
+	ts = &__get_cpu_var(tick_cpu_sched);
+	/*
+	 * set ts->inidle unconditionally. even if the system did not
+	 * switch to nohz mode the cpu frequency governers rely on the
+	 * update of the idle time accounting in tick_nohz_start_idle().
+	 */
+	ts->inidle = 1;
+	tick_nohz_stop_sched_tick(ts);
+
+	local_irq_enable();
+}
+
+/**
+ * tick_nohz_irq_exit - update next tick event from interrupt exit
+ *
+ * When an interrupt fires while we are idle and it doesn't cause
+ * a reschedule, it may still add, modify or delete a timer, enqueue
+ * an RCU callback, etc...
+ * So we need to re-calculate and reprogram the next tick event.
+ */
+void tick_nohz_irq_exit(void)
+{
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+
+	if (!ts->inidle)
+		return;
+
+	tick_nohz_stop_sched_tick(ts);
 }
 
 /**
@@ -495,11 +514,11 @@ static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
 }
 
 /**
- * tick_nohz_restart_sched_tick - restart the idle tick from the idle task
+ * tick_nohz_idle_exit - restart the idle tick from the idle task
  *
  * Restart the idle tick when the CPU is woken up from idle
  */
-void tick_nohz_restart_sched_tick(void)
+void tick_nohz_idle_exit(void)
 {
 	int cpu = smp_processor_id();
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/7] nohz: Allow rcu extended quiescent state handling seperately from tick stop
  2011-09-26 10:19 [PATCH 0/7 v4] rcu: Fix some rcu uses in extended quiescent state Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2011-09-26 10:19 ` [PATCH 3/7] nohz: Separate out irq exit and idle loop dyntick logic Frederic Weisbecker
@ 2011-09-26 10:19 ` Frederic Weisbecker
  2011-09-26 10:44   ` Peter Zijlstra
  2011-09-26 10:19 ` [PATCH 5/7] x86: Enter rcu extended qs after idle notifier call Frederic Weisbecker
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Frederic Weisbecker @ 2011-09-26 10:19 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, Frederic Weisbecker, Mike Frysinger, Guan Xuetao,
	David Miller, Chris Metcalf, Hans-Christian Egtvedt,
	Ralf Baechle, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	H. Peter Anvin, Russell King, Paul Mackerras, Heiko Carstens,
	Paul Mundt

It is assumed that rcu won't be used once we switch to tickless
mode and until we restart the tick. However this is not always
true, as in x86-64 where we dereference the idle notifiers after
the tick is stopped.

To prepare for fixing this, add a parameter to tick_nohz_enter_idle()
named "rcu_ext_qs" that tells whether we want to enter RCU extended
quiescent state at the same time we stop the tick.

If no use of RCU is made in the idle loop between
tick_nohz_enter_idle() and tick_nohz_exit_idle() calls, the parameter
must be set to true and the arch doesn't need to call rcu_enter_nohz()
and rcu_exit_nohz() explicitly.

Otherwise the parameter must be set to false and the arch is
responsible of calling:

- rcu_enter_nohz() after its last use of RCU before the CPU is put
to sleep.
- rcu_exit_nohz() before the first use of RCU after the CPU is woken
up.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: David Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
---
 arch/arm/kernel/process.c              |    2 +-
 arch/avr32/kernel/process.c            |    2 +-
 arch/blackfin/kernel/process.c         |    2 +-
 arch/microblaze/kernel/process.c       |    2 +-
 arch/mips/kernel/process.c             |    2 +-
 arch/powerpc/kernel/idle.c             |    2 +-
 arch/powerpc/platforms/iseries/setup.c |    4 ++--
 arch/s390/kernel/process.c             |    2 +-
 arch/sh/kernel/idle.c                  |    2 +-
 arch/sparc/kernel/process_64.c         |    2 +-
 arch/tile/kernel/process.c             |    2 +-
 arch/um/kernel/process.c               |    2 +-
 arch/unicore32/kernel/process.c        |    2 +-
 arch/x86/kernel/process_32.c           |    2 +-
 arch/x86/kernel/process_64.c           |    2 +-
 include/linux/tick.h                   |    7 +++++--
 kernel/time/tick-sched.c               |   28 ++++++++++++++++++++++++----
 17 files changed, 45 insertions(+), 22 deletions(-)

diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index f570e8f..51b0e39 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -182,7 +182,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_idle_enter();
+		tick_nohz_idle_enter(true);
 		leds_event(led_idle_start);
 		while (!need_resched()) {
 #ifdef CONFIG_HOTPLUG_CPU
diff --git a/arch/avr32/kernel/process.c b/arch/avr32/kernel/process.c
index 6ee7952..5041c84 100644
--- a/arch/avr32/kernel/process.c
+++ b/arch/avr32/kernel/process.c
@@ -34,7 +34,7 @@ void cpu_idle(void)
 {
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_idle_enter();
+		tick_nohz_idle_enter(true);
 		while (!need_resched())
 			cpu_idle_sleep();
 		tick_nohz_idle_exit();
diff --git a/arch/blackfin/kernel/process.c b/arch/blackfin/kernel/process.c
index 790b12e..f22a0da 100644
--- a/arch/blackfin/kernel/process.c
+++ b/arch/blackfin/kernel/process.c
@@ -88,7 +88,7 @@ void cpu_idle(void)
 #endif
 		if (!idle)
 			idle = default_idle;
-		tick_nohz_idle_enter();
+		tick_nohz_idle_enter(true);
 		while (!need_resched())
 			idle();
 		tick_nohz_idle_exit();
diff --git a/arch/microblaze/kernel/process.c b/arch/microblaze/kernel/process.c
index 80c749b..0f5290f 100644
--- a/arch/microblaze/kernel/process.c
+++ b/arch/microblaze/kernel/process.c
@@ -103,7 +103,7 @@ void cpu_idle(void)
 		if (!idle)
 			idle = default_idle;
 
-		tick_nohz_idle_enter();
+		tick_nohz_idle_enter(true);
 		while (!need_resched())
 			idle();
 		tick_nohz_idle_exit();
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index d72a0e9..20be814 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -56,7 +56,7 @@ void __noreturn cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_idle_enter();
+		tick_nohz_idle_enter(true);
 		while (!need_resched() && cpu_online(cpu)) {
 #ifdef CONFIG_MIPS_MT_SMTC
 			extern void smtc_idle_loop_hook(void);
diff --git a/arch/powerpc/kernel/idle.c b/arch/powerpc/kernel/idle.c
index 878572f..a0e31a7 100644
--- a/arch/powerpc/kernel/idle.c
+++ b/arch/powerpc/kernel/idle.c
@@ -56,7 +56,7 @@ void cpu_idle(void)
 
 	set_thread_flag(TIF_POLLING_NRFLAG);
 	while (1) {
-		tick_nohz_idle_enter();
+		tick_nohz_idle_enter(true);
 		while (!need_resched() && !cpu_should_die()) {
 			ppc64_runlatch_off();
 
diff --git a/arch/powerpc/platforms/iseries/setup.c b/arch/powerpc/platforms/iseries/setup.c
index e2f5fad..f239427 100644
--- a/arch/powerpc/platforms/iseries/setup.c
+++ b/arch/powerpc/platforms/iseries/setup.c
@@ -562,7 +562,7 @@ static void yield_shared_processor(void)
 static void iseries_shared_idle(void)
 {
 	while (1) {
-		tick_nohz_idle_enter();
+		tick_nohz_idle_enter(true);
 		while (!need_resched() && !hvlpevent_is_pending()) {
 			local_irq_disable();
 			ppc64_runlatch_off();
@@ -592,7 +592,7 @@ static void iseries_dedicated_idle(void)
 	set_thread_flag(TIF_POLLING_NRFLAG);
 
 	while (1) {
-		tick_nohz_idle_enter();
+		tick_nohz_idle_enter(true);
 		if (!need_resched()) {
 			while (!need_resched()) {
 				ppc64_runlatch_off();
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index db3e930..3dbaf59 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -90,7 +90,7 @@ static void default_idle(void)
 void cpu_idle(void)
 {
 	for (;;) {
-		tick_nohz_idle_enter();
+		tick_nohz_idle_enter(true);
 		while (!need_resched())
 			default_idle();
 		tick_nohz_idle_exit();
diff --git a/arch/sh/kernel/idle.c b/arch/sh/kernel/idle.c
index a6b9a96..bb0a627 100644
--- a/arch/sh/kernel/idle.c
+++ b/arch/sh/kernel/idle.c
@@ -88,7 +88,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_idle_enter();
+		tick_nohz_idle_enter(true);
 
 		while (!need_resched()) {
 			check_pgt_cache();
diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c
index 1235f63..3c5d363 100644
--- a/arch/sparc/kernel/process_64.c
+++ b/arch/sparc/kernel/process_64.c
@@ -95,7 +95,7 @@ void cpu_idle(void)
 	set_thread_flag(TIF_POLLING_NRFLAG);
 
 	while(1) {
-		tick_nohz_idle_enter();
+		tick_nohz_idle_enter(true);
 
 		while (!need_resched() && !cpu_is_offline(cpu))
 			sparc64_yield(cpu);
diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index 920e674..727dc85 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -85,7 +85,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_idle_enter();
+		tick_nohz_idle_enter(true);
 		while (!need_resched()) {
 			if (cpu_is_offline(cpu))
 				BUG();  /* no HOTPLUG_CPU */
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index 046abea..5693d6d 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -245,7 +245,7 @@ void default_idle(void)
 		if (need_resched())
 			schedule();
 
-		tick_nohz_idle_enter();
+		tick_nohz_idle_enter(true);
 		nsecs = disable_timer();
 		idle_sleep(nsecs);
 		tick_nohz_idle_exit();
diff --git a/arch/unicore32/kernel/process.c b/arch/unicore32/kernel/process.c
index 9999b9a..afa50d9 100644
--- a/arch/unicore32/kernel/process.c
+++ b/arch/unicore32/kernel/process.c
@@ -55,7 +55,7 @@ void cpu_idle(void)
 {
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_idle_enter();
+		tick_nohz_idle_enter(true);
 		while (!need_resched()) {
 			local_irq_disable();
 			stop_critical_timings();
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 1ab4c58..8c2faa9 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -97,7 +97,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_idle_enter();
+		tick_nohz_idle_enter(true);
 		while (!need_resched()) {
 
 			check_pgt_cache();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index d7a6418..19ca231 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -120,7 +120,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_idle_enter();
+		tick_nohz_idle_enter(true);
 		while (!need_resched()) {
 
 			rmb();
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 3f094d4..375e7d8 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -45,6 +45,8 @@ enum tick_nohz_mode {
  * @iowait_sleeptime:	Sum of the time slept in idle with sched tick stopped, with IO outstanding
  * @sleep_length:	Duration of the current idle sleep
  * @do_timer_lst:	CPU was the last one doing do_timer before going idle
+ * @rcu_ext_qs:		Set if we want to enter RCU extended quiescent state
+ *			when the tick gets stopped.
  */
 struct tick_sched {
 	struct hrtimer			sched_timer;
@@ -67,6 +69,7 @@ struct tick_sched {
 	unsigned long			next_jiffies;
 	ktime_t				idle_expires;
 	int				do_timer_last;
+	int				rcu_ext_qs;
 };
 
 extern void __init tick_init(void);
@@ -121,14 +124,14 @@ static inline int tick_oneshot_mode_active(void) { return 0; }
 #endif /* !CONFIG_GENERIC_CLOCKEVENTS */
 
 # ifdef CONFIG_NO_HZ
-extern void tick_nohz_idle_enter(void);
+extern void tick_nohz_idle_enter(bool rcu_ext_qs);
 extern void tick_nohz_idle_exit(void);
 extern void tick_nohz_irq_exit(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
 # else
-static inline void tick_nohz_idle_enter(void) { }
+static inline void tick_nohz_idle_enter(bool rcu_ext_qs) { }
 static inline void tick_nohz_idle_exit(void) { }
 static inline ktime_t tick_nohz_get_sleep_length(void)
 {
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index b0ce515..cd1a54e 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -427,17 +427,30 @@ out:
 
 static void tick_nohz_stop_sched_tick(struct tick_sched *ts)
 {
-	if (__tick_nohz_stop_sched_tick(ts))
+	if (__tick_nohz_stop_sched_tick(ts) && ts->rcu_ext_qs)
 		rcu_enter_nohz();
 }
 
 /**
  * tick_nohz_idle_enter - stop the idle tick from the idle task
+ * @rcu_ext_qs: enter into rcu extended quiescent state
  *
- * When the next event is more than a tick into the future, stop the idle tick
+ * When the next event is more than a tick into the future, stop the idle tick.
  * Called when we start the idle loop.
+ *
+ * If no use of RCU is made in the idle loop between
+ * tick_nohz_enter_idle() and tick_nohz_exit_idle() calls, rcu_ext_qs
+ * must be set to true and the arch doesn't need to call rcu_enter_nohz()
+ * and rcu_exit_nohz() explicitly.
+ *
+ * Otherwise the parameter must be set to false and the arch is
+ * responsible of calling:
+ *
+ * - rcu_enter_nohz() after its last use of RCU before the CPU is put
+ *  to sleep.
+ * - rcu_exit_nohz() before the first use of RCU after the CPU is woken up.
  */
-void tick_nohz_idle_enter(void)
+void tick_nohz_idle_enter(bool rcu_ext_qs)
 {
 	struct tick_sched *ts;
 
@@ -452,6 +465,10 @@ void tick_nohz_idle_enter(void)
 	 * update of the idle time accounting in tick_nohz_start_idle().
 	 */
 	ts->inidle = 1;
+
+	if (rcu_ext_qs)
+		ts->rcu_ext_qs = 1;
+
 	tick_nohz_stop_sched_tick(ts);
 
 	local_irq_enable();
@@ -542,7 +559,10 @@ void tick_nohz_idle_exit(void)
 
 	ts->inidle = 0;
 
-	rcu_exit_nohz();
+	if (ts->rcu_ext_qs) {
+		rcu_exit_nohz();
+		ts->rcu_ext_qs = 0;
+	}
 
 	/* Update jiffies first */
 	select_nohz_load_balancer(0);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 5/7] x86: Enter rcu extended qs after idle notifier call
  2011-09-26 10:19 [PATCH 0/7 v4] rcu: Fix some rcu uses in extended quiescent state Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2011-09-26 10:19 ` [PATCH 4/7] nohz: Allow rcu extended quiescent state handling seperately from tick stop Frederic Weisbecker
@ 2011-09-26 10:19 ` Frederic Weisbecker
  2011-09-26 10:19 ` [PATCH 6/7] x86: Call idle notifier after irq_enter() Frederic Weisbecker
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Frederic Weisbecker @ 2011-09-26 10:19 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, H. Peter Anvin

The idle notifier, called by enter_idle(), enters into rcu read
side critical section but at that time we already switched into
rcu dynticks idle mode. And it's illegal to use rcu_read_lock()
in that state.

This results in rcu reporting its bad mood:

[    1.275635] WARNING: at include/linux/rcupdate.h:194 __atomic_notifier_call_chain+0xd2/0x110()
[    1.275635] Hardware name: AMD690VM-FMH
[    1.275635] Modules linked in:
[    1.275635] Pid: 0, comm: swapper Not tainted 3.0.0-rc6+ #252
[    1.275635] Call Trace:
[    1.275635]  [<ffffffff81051c8a>] warn_slowpath_common+0x7a/0xb0
[    1.275635]  [<ffffffff81051cd5>] warn_slowpath_null+0x15/0x20
[    1.275635]  [<ffffffff817d6f22>] __atomic_notifier_call_chain+0xd2/0x110
[    1.275635]  [<ffffffff817d6f71>] atomic_notifier_call_chain+0x11/0x20
[    1.275635]  [<ffffffff810018a0>] enter_idle+0x20/0x30
[    1.275635]  [<ffffffff81001995>] cpu_idle+0xa5/0x110
[    1.275635]  [<ffffffff817a7465>] rest_init+0xe5/0x140
[    1.275635]  [<ffffffff817a73c8>] ? rest_init+0x48/0x140
[    1.275635]  [<ffffffff81cc5ca3>] start_kernel+0x3d1/0x3dc
[    1.275635]  [<ffffffff81cc5321>] x86_64_start_reservations+0x131/0x135
[    1.275635]  [<ffffffff81cc5412>] x86_64_start_kernel+0xed/0xf4
[    1.275635] ---[ end trace a22d306b065d4a66 ]---

Fix this by entering rcu extended quiescent state later, just before
the CPU goes to sleep.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
---
 arch/x86/kernel/process_64.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 19ca231..dee2e6c 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -120,7 +120,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_idle_enter(true);
+		tick_nohz_idle_enter(false);
 		while (!need_resched()) {
 
 			rmb();
@@ -136,7 +136,12 @@ void cpu_idle(void)
 			enter_idle();
 			/* Don't trace irqs off for idle */
 			stop_critical_timings();
+
+			/* enter_idle() needs rcu for notifiers */
+			rcu_enter_nohz();
 			pm_idle();
+			rcu_exit_nohz();
+
 			start_critical_timings();
 
 			/* In many cases the interrupt that ended idle
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 6/7] x86: Call idle notifier after irq_enter()
  2011-09-26 10:19 [PATCH 0/7 v4] rcu: Fix some rcu uses in extended quiescent state Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2011-09-26 10:19 ` [PATCH 5/7] x86: Enter rcu extended qs after idle notifier call Frederic Weisbecker
@ 2011-09-26 10:19 ` Frederic Weisbecker
  2011-09-26 10:19 ` [PATCH 7/7] rcu: Fix early call to rcu_irq_exit() Frederic Weisbecker
  2011-09-26 18:26 ` [PATCH 0/7 v4] rcu: Fix some rcu uses in extended quiescent state Paul E. McKenney
  7 siblings, 0 replies; 23+ messages in thread
From: Frederic Weisbecker @ 2011-09-26 10:19 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Thomas Gleixner,
	H. Peter Anvin, Andy Henroid

Interrupts notify the idle exit state before calling irq_enter().
But the notifier code calls rcu_read_lock() and this is not
allowed while rcu is in an extended quiescent state. We need
to wait for rcu_irq_enter() to be called before doing so
otherwise this results in a grumpy RCU:

[    0.099991] WARNING: at include/linux/rcupdate.h:194 __atomic_notifier_call_chain+0xd2/0x110()
[    0.099991] Hardware name: AMD690VM-FMH
[    0.099991] Modules linked in:
[    0.099991] Pid: 0, comm: swapper Not tainted 3.0.0-rc6+ #255
[    0.099991] Call Trace:
[    0.099991]  <IRQ>  [<ffffffff81051c8a>] warn_slowpath_common+0x7a/0xb0
[    0.099991]  [<ffffffff81051cd5>] warn_slowpath_null+0x15/0x20
[    0.099991]  [<ffffffff817d6fa2>] __atomic_notifier_call_chain+0xd2/0x110
[    0.099991]  [<ffffffff817d6ff1>] atomic_notifier_call_chain+0x11/0x20
[    0.099991]  [<ffffffff81001873>] exit_idle+0x43/0x50
[    0.099991]  [<ffffffff81020439>] smp_apic_timer_interrupt+0x39/0xa0
[    0.099991]  [<ffffffff817da253>] apic_timer_interrupt+0x13/0x20
[    0.099991]  <EOI>  [<ffffffff8100ae67>] ? default_idle+0xa7/0x350
[    0.099991]  [<ffffffff8100ae65>] ? default_idle+0xa5/0x350
[    0.099991]  [<ffffffff8100b19b>] amd_e400_idle+0x8b/0x110
[    0.099991]  [<ffffffff810cb01f>] ? rcu_enter_nohz+0x8f/0x160
[    0.099991]  [<ffffffff810019a0>] cpu_idle+0xb0/0x110
[    0.099991]  [<ffffffff817a7505>] rest_init+0xe5/0x140
[    0.099991]  [<ffffffff817a7468>] ? rest_init+0x48/0x140
[    0.099991]  [<ffffffff81cc5ca3>] start_kernel+0x3d1/0x3dc
[    0.099991]  [<ffffffff81cc5321>] x86_64_start_reservations+0x131/0x135
[    0.099991]  [<ffffffff81cc5412>] x86_64_start_kernel+0xed/0xf4

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Henroid <andrew.d.henroid@intel.com>
---
 arch/x86/kernel/apic/apic.c              |    6 +++---
 arch/x86/kernel/apic/io_apic.c           |    2 +-
 arch/x86/kernel/cpu/mcheck/mce.c         |    2 +-
 arch/x86/kernel/cpu/mcheck/therm_throt.c |    2 +-
 arch/x86/kernel/cpu/mcheck/threshold.c   |    2 +-
 arch/x86/kernel/irq.c                    |    6 +++---
 6 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index b9338b8..82af921 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -856,8 +856,8 @@ void __irq_entry smp_apic_timer_interrupt(struct pt_regs *regs)
 	 * Besides, if we don't timer interrupts ignore the global
 	 * interrupt lock, which is the WrongThing (tm) to do.
 	 */
-	exit_idle();
 	irq_enter();
+	exit_idle();
 	local_apic_timer_interrupt();
 	irq_exit();
 
@@ -1790,8 +1790,8 @@ void smp_spurious_interrupt(struct pt_regs *regs)
 {
 	u32 v;
 
-	exit_idle();
 	irq_enter();
+	exit_idle();
 	/*
 	 * Check if this really is a spurious interrupt and ACK it
 	 * if it is a vectored one.  Just in case...
@@ -1827,8 +1827,8 @@ void smp_error_interrupt(struct pt_regs *regs)
 		"Illegal register address",	/* APIC Error Bit 7 */
 	};
 
-	exit_idle();
 	irq_enter();
+	exit_idle();
 	/* First tickle the hardware, only then report what went on. -- REW */
 	v0 = apic_read(APIC_ESR);
 	apic_write(APIC_ESR, 0);
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index e529339..e1b5eec 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2275,8 +2275,8 @@ asmlinkage void smp_irq_move_cleanup_interrupt(void)
 	unsigned vector, me;
 
 	ack_APIC_irq();
-	exit_idle();
 	irq_enter();
+	exit_idle();
 
 	me = smp_processor_id();
 	for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++) {
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index ff1ae9b..7ba9757 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -470,8 +470,8 @@ static inline void mce_get_rip(struct mce *m, struct pt_regs *regs)
 asmlinkage void smp_mce_self_interrupt(struct pt_regs *regs)
 {
 	ack_APIC_irq();
-	exit_idle();
 	irq_enter();
+	exit_idle();
 	mce_notify_irq();
 	mce_schedule_work();
 	irq_exit();
diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
index 27c6251..f6bbc64 100644
--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -396,8 +396,8 @@ static void (*smp_thermal_vector)(void) = unexpected_thermal_interrupt;
 
 asmlinkage void smp_thermal_interrupt(struct pt_regs *regs)
 {
-	exit_idle();
 	irq_enter();
+	exit_idle();
 	inc_irq_stat(irq_thermal_count);
 	smp_thermal_vector();
 	irq_exit();
diff --git a/arch/x86/kernel/cpu/mcheck/threshold.c b/arch/x86/kernel/cpu/mcheck/threshold.c
index d746df2..aa578ca 100644
--- a/arch/x86/kernel/cpu/mcheck/threshold.c
+++ b/arch/x86/kernel/cpu/mcheck/threshold.c
@@ -19,8 +19,8 @@ void (*mce_threshold_vector)(void) = default_threshold_interrupt;
 
 asmlinkage void smp_threshold_interrupt(void)
 {
-	exit_idle();
 	irq_enter();
+	exit_idle();
 	inc_irq_stat(irq_threshold_count);
 	mce_threshold_vector();
 	irq_exit();
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 6c0802e..73cf928 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -180,8 +180,8 @@ unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
 	unsigned vector = ~regs->orig_ax;
 	unsigned irq;
 
-	exit_idle();
 	irq_enter();
+	exit_idle();
 
 	irq = __this_cpu_read(vector_irq[vector]);
 
@@ -208,10 +208,10 @@ void smp_x86_platform_ipi(struct pt_regs *regs)
 
 	ack_APIC_irq();
 
-	exit_idle();
-
 	irq_enter();
 
+	exit_idle();
+
 	inc_irq_stat(x86_platform_ipis);
 
 	if (x86_platform_ipi_callback)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 7/7] rcu: Fix early call to rcu_irq_exit()
  2011-09-26 10:19 [PATCH 0/7 v4] rcu: Fix some rcu uses in extended quiescent state Frederic Weisbecker
                   ` (5 preceding siblings ...)
  2011-09-26 10:19 ` [PATCH 6/7] x86: Call idle notifier after irq_enter() Frederic Weisbecker
@ 2011-09-26 10:19 ` Frederic Weisbecker
  2011-09-26 18:26 ` [PATCH 0/7 v4] rcu: Fix some rcu uses in extended quiescent state Paul E. McKenney
  7 siblings, 0 replies; 23+ messages in thread
From: Frederic Weisbecker @ 2011-09-26 10:19 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, Peter Zijlstra

On the irq exit path, tick_nohz_irq_exit()
may raise a softirq, which action leads to the wake up
path and select_task_rq_fair() that makes use of rcu
to iterate the domains.

This is an illegal use of RCU because we may be in dyntick
idle mode:

[  132.978883] ===============================
[  132.978883] [ INFO: suspicious RCU usage. ]
[  132.978883] -------------------------------
[  132.978883] kernel/sched_fair.c:1707 suspicious rcu_dereference_check() usage!
[  132.978883]
[  132.978883] other info that might help us debug this:
[  132.978883]
[  132.978883]
[  132.978883] rcu_scheduler_active = 1, debug_locks = 0
[  132.978883] RCU used illegally from extended quiescent state!
[  132.978883] 2 locks held by swapper/0:
[  132.978883]  #0:  (&p->pi_lock){-.-.-.}, at: [<ffffffff8105a729>] try_to_wake_up+0x39/0x2f0
[  132.978883]  #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff8105556a>] select_task_rq_fair+0x6a/0xec0
[  132.978883]
[  132.978883] stack backtrace:
[  132.978883] Pid: 0, comm: swapper Tainted: G        W   3.0.0+ #178
[  132.978883] Call Trace:
[  132.978883]  <IRQ>  [<ffffffff810a01f6>] lockdep_rcu_suspicious+0xe6/0x100
[  132.978883]  [<ffffffff81055c49>] select_task_rq_fair+0x749/0xec0
[  132.978883]  [<ffffffff8105556a>] ? select_task_rq_fair+0x6a/0xec0
[  132.978883]  [<ffffffff812fe494>] ? do_raw_spin_lock+0x54/0x150
[  132.978883]  [<ffffffff810a1f2d>] ? trace_hardirqs_on+0xd/0x10
[  132.978883]  [<ffffffff8105a7c3>] try_to_wake_up+0xd3/0x2f0
[  132.978883]  [<ffffffff81094f98>] ? ktime_get+0x68/0xf0
[  132.978883]  [<ffffffff8105aa35>] wake_up_process+0x15/0x20
[  132.978883]  [<ffffffff81069dd5>] raise_softirq_irqoff+0x65/0x110
[  132.978883]  [<ffffffff8108eb65>] __hrtimer_start_range_ns+0x415/0x5a0
[  132.978883]  [<ffffffff812fe3ee>] ? do_raw_spin_unlock+0x5e/0xb0
[  132.978883]  [<ffffffff8108ed08>] hrtimer_start+0x18/0x20
[  132.978883]  [<ffffffff8109c9c3>] tick_nohz_stop_sched_tick+0x393/0x450
[  132.978883]  [<ffffffff810694f2>] irq_exit+0xd2/0x100
[  132.978883]  [<ffffffff81829e96>] do_IRQ+0x66/0xe0
[  132.978883]  [<ffffffff81820d53>] common_interrupt+0x13/0x13
[  132.978883]  <EOI>  [<ffffffff8103434b>] ? native_safe_halt+0xb/0x10
[  132.978883]  [<ffffffff810a1f2d>] ? trace_hardirqs_on+0xd/0x10
[  132.978883]  [<ffffffff810144ea>] default_idle+0xba/0x370
[  132.978883]  [<ffffffff810147fe>] amd_e400_idle+0x5e/0x130
[  132.978883]  [<ffffffff8100a9f6>] cpu_idle+0xb6/0x120
[  132.978883]  [<ffffffff817f217f>] rest_init+0xef/0x150
[  132.978883]  [<ffffffff817f20e2>] ? rest_init+0x52/0x150
[  132.978883]  [<ffffffff81ed9cf3>] start_kernel+0x3da/0x3e5
[  132.978883]  [<ffffffff81ed9346>] x86_64_start_reservations+0x131/0x135
[  132.978883]  [<ffffffff81ed944d>] x86_64_start_kernel+0x103/0x112

Fix this by calling rcu_irq_exit() after tick_nohz_stop_sched_tick().

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/softirq.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index d2be0e0..328aabb 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -347,12 +347,12 @@ void irq_exit(void)
 	if (!in_interrupt() && local_softirq_pending())
 		invoke_softirq();
 
-	rcu_irq_exit();
 #ifdef CONFIG_NO_HZ
 	/* Make sure that timer wheel updates are propagated */
 	if (idle_cpu(smp_processor_id()) && !in_interrupt() && !need_resched())
 		tick_nohz_irq_exit();
 #endif
+	rcu_irq_exit();
 	preempt_enable_no_resched();
 }
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/7] nohz: Allow rcu extended quiescent state handling seperately from tick stop
  2011-09-26 10:19 ` [PATCH 4/7] nohz: Allow rcu extended quiescent state handling seperately from tick stop Frederic Weisbecker
@ 2011-09-26 10:44   ` Peter Zijlstra
  2011-09-26 16:02     ` Paul E. McKenney
  2011-09-26 17:06     ` Frederic Weisbecker
  0 siblings, 2 replies; 23+ messages in thread
From: Peter Zijlstra @ 2011-09-26 10:44 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, LKML, Mike Frysinger, Guan Xuetao,
	David Miller, Chris Metcalf, Hans-Christian Egtvedt,
	Ralf Baechle, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
	Russell King, Paul Mackerras, Heiko Carstens, Paul Mundt

On Mon, 2011-09-26 at 12:19 +0200, Frederic Weisbecker wrote:
> It is assumed that rcu won't be used once we switch to tickless
> mode and until we restart the tick. However this is not always
> true, as in x86-64 where we dereference the idle notifiers after
> the tick is stopped.
> 
> To prepare for fixing this, add a parameter to tick_nohz_enter_idle()
> named "rcu_ext_qs" that tells whether we want to enter RCU extended
> quiescent state at the same time we stop the tick.
> 
> If no use of RCU is made in the idle loop between
> tick_nohz_enter_idle() and tick_nohz_exit_idle() calls, the parameter
> must be set to true and the arch doesn't need to call rcu_enter_nohz()
> and rcu_exit_nohz() explicitly.
> 
> Otherwise the parameter must be set to false and the arch is
> responsible of calling:
> 
> - rcu_enter_nohz() after its last use of RCU before the CPU is put
> to sleep.
> - rcu_exit_nohz() before the first use of RCU after the CPU is woken
> up. 

I can't say this really makes sense:

  tick_nohz_idle_enter(false);

reads like, don't enter nohz state, not: enter nohz state but don't
enter rcu-nohz state.

I realize you want to keep the per-arch frobbing low, but since you're
already touching all of them, I think its more important to keep the
functions readable.

Why not simply fully split nohz and rcu and modify all idle routines
with both calls?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/7] nohz: Allow rcu extended quiescent state handling seperately from tick stop
  2011-09-26 10:44   ` Peter Zijlstra
@ 2011-09-26 16:02     ` Paul E. McKenney
  2011-09-26 16:06       ` Peter Zijlstra
  2011-09-26 17:06     ` Frederic Weisbecker
  1 sibling, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2011-09-26 16:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, Mike Frysinger, Guan Xuetao,
	David Miller, Chris Metcalf, Hans-Christian Egtvedt,
	Ralf Baechle, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
	Russell King, Paul Mackerras, Heiko Carstens, Paul Mundt

On Mon, Sep 26, 2011 at 12:44:44PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-09-26 at 12:19 +0200, Frederic Weisbecker wrote:
> > It is assumed that rcu won't be used once we switch to tickless
> > mode and until we restart the tick. However this is not always
> > true, as in x86-64 where we dereference the idle notifiers after
> > the tick is stopped.
> > 
> > To prepare for fixing this, add a parameter to tick_nohz_enter_idle()
> > named "rcu_ext_qs" that tells whether we want to enter RCU extended
> > quiescent state at the same time we stop the tick.
> > 
> > If no use of RCU is made in the idle loop between
> > tick_nohz_enter_idle() and tick_nohz_exit_idle() calls, the parameter
> > must be set to true and the arch doesn't need to call rcu_enter_nohz()
> > and rcu_exit_nohz() explicitly.
> > 
> > Otherwise the parameter must be set to false and the arch is
> > responsible of calling:
> > 
> > - rcu_enter_nohz() after its last use of RCU before the CPU is put
> > to sleep.
> > - rcu_exit_nohz() before the first use of RCU after the CPU is woken
> > up. 
> 
> I can't say this really makes sense:
> 
>   tick_nohz_idle_enter(false);
> 
> reads like, don't enter nohz state, not: enter nohz state but don't
> enter rcu-nohz state.
> 
> I realize you want to keep the per-arch frobbing low, but since you're
> already touching all of them, I think its more important to keep the
> functions readable.
> 
> Why not simply fully split nohz and rcu and modify all idle routines
> with both calls?

This might well be the correct thing to do, but one thing that gives
me pause is that some architectures have a large number of idle routines.
If such an architecture can use tick_nohz_idle_enter(true), then that
architecture needs only one change rather than one change to each of
potentially many idle loops.

Would your readability concerns be addressed by something like the
following?

#define RCU_NO_HZ_LATER 0
#define RCU_NO_HZ_NOW   1

Then we would have one of the following:

	tick_nohz_idle_enter(RCU_NO_HZ_LATER);
	tick_nohz_idle_enter(RCU_NO_HZ_NOW);

							Thanx, Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/7] nohz: Allow rcu extended quiescent state handling seperately from tick stop
  2011-09-26 16:02     ` Paul E. McKenney
@ 2011-09-26 16:06       ` Peter Zijlstra
  2011-09-26 16:32         ` Paul E. McKenney
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2011-09-26 16:06 UTC (permalink / raw)
  To: paulmck
  Cc: Frederic Weisbecker, LKML, Mike Frysinger, Guan Xuetao,
	David Miller, Chris Metcalf, Hans-Christian Egtvedt,
	Ralf Baechle, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
	Russell King, Paul Mackerras, Heiko Carstens, Paul Mundt

On Mon, 2011-09-26 at 09:02 -0700, Paul E. McKenney wrote:
> Would your readability concerns be addressed by something like the
> following?
> 
> #define RCU_NO_HZ_LATER 0
> #define RCU_NO_HZ_NOW   1
> 
> Then we would have one of the following:
> 
>         tick_nohz_idle_enter(RCU_NO_HZ_LATER);
>         tick_nohz_idle_enter(RCU_NO_HZ_NOW); 

That certainly is a lot better, except for the two different ways of
collating NO HZ in that one line.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/7] nohz: Allow rcu extended quiescent state handling seperately from tick stop
  2011-09-26 16:06       ` Peter Zijlstra
@ 2011-09-26 16:32         ` Paul E. McKenney
  0 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2011-09-26 16:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, Mike Frysinger, Guan Xuetao,
	David Miller, Chris Metcalf, Hans-Christian Egtvedt,
	Ralf Baechle, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
	Russell King, Paul Mackerras, Heiko Carstens, Paul Mundt

On Mon, Sep 26, 2011 at 06:06:47PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-09-26 at 09:02 -0700, Paul E. McKenney wrote:
> > Would your readability concerns be addressed by something like the
> > following?
> > 
> > #define RCU_NO_HZ_LATER 0
> > #define RCU_NO_HZ_NOW   1
> > 
> > Then we would have one of the following:
> > 
> >         tick_nohz_idle_enter(RCU_NO_HZ_LATER);
> >         tick_nohz_idle_enter(RCU_NO_HZ_NOW); 
> 
> That certainly is a lot better, except for the two different ways of
> collating NO HZ in that one line.

So perhaps RCU_NOHZ_LATER and RCU_NOHZ_NOW?

Though currently the Linux kernel drops the underscore for lower-case
"nohz" and keeps it for upper-case "NO_HZ".  Hey, it was that way when
I started messing with it many years ago!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/7] nohz: Allow rcu extended quiescent state handling seperately from tick stop
  2011-09-26 10:44   ` Peter Zijlstra
  2011-09-26 16:02     ` Paul E. McKenney
@ 2011-09-26 17:06     ` Frederic Weisbecker
  1 sibling, 0 replies; 23+ messages in thread
From: Frederic Weisbecker @ 2011-09-26 17:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, LKML, Mike Frysinger, Guan Xuetao,
	David Miller, Chris Metcalf, Hans-Christian Egtvedt,
	Ralf Baechle, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
	Russell King, Paul Mackerras, Heiko Carstens, Paul Mundt

On Mon, Sep 26, 2011 at 12:44:44PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-09-26 at 12:19 +0200, Frederic Weisbecker wrote:
> > It is assumed that rcu won't be used once we switch to tickless
> > mode and until we restart the tick. However this is not always
> > true, as in x86-64 where we dereference the idle notifiers after
> > the tick is stopped.
> > 
> > To prepare for fixing this, add a parameter to tick_nohz_enter_idle()
> > named "rcu_ext_qs" that tells whether we want to enter RCU extended
> > quiescent state at the same time we stop the tick.
> > 
> > If no use of RCU is made in the idle loop between
> > tick_nohz_enter_idle() and tick_nohz_exit_idle() calls, the parameter
> > must be set to true and the arch doesn't need to call rcu_enter_nohz()
> > and rcu_exit_nohz() explicitly.
> > 
> > Otherwise the parameter must be set to false and the arch is
> > responsible of calling:
> > 
> > - rcu_enter_nohz() after its last use of RCU before the CPU is put
> > to sleep.
> > - rcu_exit_nohz() before the first use of RCU after the CPU is woken
> > up. 
> 
> I can't say this really makes sense:
> 
>   tick_nohz_idle_enter(false);
> 
> reads like, don't enter nohz state, not: enter nohz state but don't
> enter rcu-nohz state.
> 
> I realize you want to keep the per-arch frobbing low, but since you're
> already touching all of them, I think its more important to keep the
> functions readable.
> 
> Why not simply fully split nohz and rcu and modify all idle routines
> with both calls?

Right, that parameter can be a bit confusing.
I could do the split and call directly rcu_enter_nohz() from the arch
but that's going to be a duplication of code everywhere. For now only
x86 seem to need to call it separately. We have only tested x86 and
powerpc yet though...

The other reason is that it avoids to call rcu_enter_nohz()
unconditionally. If the tick doesn't stop we don't need to call it. May
be that can matter to make the CPU go faster to sleep.

We could have tick_nohz_idle_enter() for cases where we need to call
rcu_enter_nohz() explicitly from the arch and tick_nohz_idle_enter_norcu()
otherwise.

Or an enum as a parameter.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/7 v4] rcu: Fix some rcu uses in extended quiescent state
  2011-09-26 10:19 [PATCH 0/7 v4] rcu: Fix some rcu uses in extended quiescent state Frederic Weisbecker
                   ` (6 preceding siblings ...)
  2011-09-26 10:19 ` [PATCH 7/7] rcu: Fix early call to rcu_irq_exit() Frederic Weisbecker
@ 2011-09-26 18:26 ` Paul E. McKenney
  7 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2011-09-26 18:26 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	H. Peter Anvin, Andy Henroid, Mike Frysinger, Guan Xuetao,
	David Miller, Chris Metcalf, Hans-Christian Egtvedt,
	Ralf Baechle, Russell King, Paul Mackerras, Heiko Carstens,
	Paul Mundt, Lai Jiangshan

On Mon, Sep 26, 2011 at 12:19:05PM +0200, Frederic Weisbecker wrote:
> Hi Paul,
> 
> Two fixes added in that set, the first two patches:
> 
> - The check on uses under extended quiescent states was buggy
> - One more illegal use of RCU fixed, from inside tick_nohz_stop_sched_tick()
> 
> It has survived one day and one night of rcutorture in x86-64
> with periodic cpu hotplug onlining/offlining...

Very good!  I have queued these and pushed them to -rcu on github
(https://github.com/paulmckrcu/linux) branch rcu/dynticks.

Some of the Power boxes are now back in the land of the living, and
will hopefully work their way through their testing backlogs some time
soon.  ;-)  At which point, I will restart PowerPC testing.

							Thanx, Paul

> No test in powerpc yet though... So I don't know yet why
> you got an rcutorture failure.
> 
> Frederic Weisbecker (7):
>   rcu: Fix preempt-unsafe debug check of rcu extended quiescent state
>   rcu: Fix early call to rcu_enter_nohz() on tick stopping
>   nohz: Separate out irq exit and idle loop dyntick logic
>   nohz: Allow rcu extended quiescent state handling seperately from
>     tick stop
>   x86: Enter rcu extended qs after idle notifier call
>   x86: Call idle notifier after irq_enter()
>   rcu: Fix early call to rcu_irq_exit()
> 
>  arch/arm/kernel/process.c                |    4 +-
>  arch/avr32/kernel/process.c              |    4 +-
>  arch/blackfin/kernel/process.c           |    4 +-
>  arch/microblaze/kernel/process.c         |    4 +-
>  arch/mips/kernel/process.c               |    4 +-
>  arch/powerpc/kernel/idle.c               |    4 +-
>  arch/powerpc/platforms/iseries/setup.c   |    8 +-
>  arch/s390/kernel/process.c               |    4 +-
>  arch/sh/kernel/idle.c                    |    4 +-
>  arch/sparc/kernel/process_64.c           |    4 +-
>  arch/tile/kernel/process.c               |    4 +-
>  arch/um/kernel/process.c                 |    4 +-
>  arch/unicore32/kernel/process.c          |    4 +-
>  arch/x86/kernel/apic/apic.c              |    6 +-
>  arch/x86/kernel/apic/io_apic.c           |    2 +-
>  arch/x86/kernel/cpu/mcheck/mce.c         |    2 +-
>  arch/x86/kernel/cpu/mcheck/therm_throt.c |    2 +-
>  arch/x86/kernel/cpu/mcheck/threshold.c   |    2 +-
>  arch/x86/kernel/irq.c                    |    6 +-
>  arch/x86/kernel/process_32.c             |    4 +-
>  arch/x86/kernel/process_64.c             |    9 ++-
>  include/linux/tick.h                     |   12 ++-
>  kernel/rcutree.c                         |   10 ++-
>  kernel/softirq.c                         |    4 +-
>  kernel/time/tick-sched.c                 |  120 +++++++++++++++++++++---------
>  25 files changed, 147 insertions(+), 88 deletions(-)
> 
> -- 
> 1.7.5.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] rcu: Fix preempt-unsafe debug check of rcu extended quiescent state
  2011-09-26 10:19 ` [PATCH 1/7] rcu: Fix preempt-unsafe debug check of rcu " Frederic Weisbecker
@ 2011-09-26 22:04   ` Pavel Ivanov
  2011-09-27 11:50     ` Frederic Weisbecker
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Ivanov @ 2011-09-26 22:04 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, LKML, Peter Zijlstra, Thomas Gleixner,
	Lai Jiangshan, Ingo Molnar

On Mon, Sep 26, 2011 at 6:19 AM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> In the rcu_check_extended_qs() function that is used to check
> illegal uses of RCU under extended quiescent states, we look
> at the local value of dynticks that is even if an extended
> quiescent state or odd otherwise.
>
> We are looking at it without disabling the preemption though
> and this opens a race window where we may read the state of
> a remote CPU instead, like in the following scenario:
>
> CPU 1                                                        CPU 2
>
> bool rcu_check_extended_qs(void)
> {
>        struct rcu_dynticks *rdtp;
>
>        rdtp = &per_cpu(rcu_dynticks,
>                        raw_smp_processor_id());
>
>        < ---- Task is migrated here ---- >
>
>        // CPU 1 goes idle and increase rdtp->dynticks
>                                                        /* Here we are reading the value for
>                                                        the remote CPU 1 instead of the local CPU */
>                                                        if (atomic_read(&rdtp->dynticks) & 0x1)
>                                                                return false;
>
>                                                        return true;
>                                                        }
>
> The possible result of this is false positive return value of that
> function, suggesting we are in an extended quiescent state in random
> places.

How is this different from what your patch allows?

CPU 1                                               CPU 2

bool rcu_check_extended_qs(void)
{
       struct rcu_dynticks *rdtp =
               &get_cpu_var(rcu_dynticks);
       bool ext_qs = true;

       if (atomic_read(&rdtp->dynticks) & 0x1)
               ext_qs = false;

       put_cpu_var(rcu_dynticks);

       < ---- Task is migrated here ---- >

                                                  /* Here we return true/false
                                                     based on the value for the
                                                     remote CPU 1 instead of the
                                                     local CPU */

                                                  return ext_qs;
                                                  }


Looks like it can result in the same false positive result.

Pavel


> Fix this by disabling preemption while reading that value.
>
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> ---
>  kernel/rcutree.c |   10 ++++++----
>  1 files changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index c9b4adf..234dca3 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -469,13 +469,15 @@ void rcu_irq_exit(void)
>
>  bool rcu_check_extended_qs(void)
>  {
> -       struct rcu_dynticks *rdtp;
> +       struct rcu_dynticks *rdtp = &get_cpu_var(rcu_dynticks);
> +       bool ext_qs = true;
>
> -       rdtp = &per_cpu(rcu_dynticks, raw_smp_processor_id());
>        if (atomic_read(&rdtp->dynticks) & 0x1)
> -               return false;
> +               ext_qs = false;
> +
> +       put_cpu_var(rcu_dynticks);
>
> -       return true;
> +       return ext_qs;
>  }
>  EXPORT_SYMBOL_GPL(rcu_check_extended_qs);
>
> --
> 1.7.5.4

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] rcu: Fix preempt-unsafe debug check of rcu extended quiescent state
  2011-09-26 22:04   ` Pavel Ivanov
@ 2011-09-27 11:50     ` Frederic Weisbecker
  2011-09-27 15:16       ` Pavel Ivanov
  0 siblings, 1 reply; 23+ messages in thread
From: Frederic Weisbecker @ 2011-09-27 11:50 UTC (permalink / raw)
  To: Pavel Ivanov
  Cc: Paul E. McKenney, LKML, Peter Zijlstra, Thomas Gleixner,
	Lai Jiangshan, Ingo Molnar

On Mon, Sep 26, 2011 at 06:04:06PM -0400, Pavel Ivanov wrote:
> On Mon, Sep 26, 2011 at 6:19 AM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > In the rcu_check_extended_qs() function that is used to check
> > illegal uses of RCU under extended quiescent states, we look
> > at the local value of dynticks that is even if an extended
> > quiescent state or odd otherwise.
> >
> > We are looking at it without disabling the preemption though
> > and this opens a race window where we may read the state of
> > a remote CPU instead, like in the following scenario:
> >
> > CPU 1                                                        CPU 2
> >
> > bool rcu_check_extended_qs(void)
> > {
> >        struct rcu_dynticks *rdtp;
> >
> >        rdtp = &per_cpu(rcu_dynticks,
> >                        raw_smp_processor_id());
> >
> >        < ---- Task is migrated here ---- >
> >
> >        // CPU 1 goes idle and increase rdtp->dynticks
> >                                                        /* Here we are reading the value for
> >                                                        the remote CPU 1 instead of the local CPU */
> >                                                        if (atomic_read(&rdtp->dynticks) & 0x1)
> >                                                                return false;
> >
> >                                                        return true;
> >                                                        }
> >
> > The possible result of this is false positive return value of that
> > function, suggesting we are in an extended quiescent state in random
> > places.
> 
> How is this different from what your patch allows?
> 
> CPU 1                                               CPU 2
> 
> bool rcu_check_extended_qs(void)
> {
>        struct rcu_dynticks *rdtp =
>                &get_cpu_var(rcu_dynticks);
>        bool ext_qs = true;
> 
>        if (atomic_read(&rdtp->dynticks) & 0x1)
>                ext_qs = false;
> 
>        put_cpu_var(rcu_dynticks);
> 
>        < ---- Task is migrated here ---- >
> 
>                                                   /* Here we return true/false
>                                                      based on the value for the
>                                                      remote CPU 1 instead of the
>                                                      local CPU */
> 
>                                                   return ext_qs;
>                                                   }
> 
> 
> Looks like it can result in the same false positive result.

Why?

While calling rcu_read_lock(), the task can still be migrated
for example from CPU 0 to CPU 1, until we do our check
in rcu_check_extended_qs() with preemption disabled. But that's
not a problem, we are going to do our check either in CPU 0 or
CPU 1, that's doesn't matter much.

What matters is that we do that check by ensuring we are really
checking the value of the cpu var in the CPU we are currently
running and not some other random one that can change its dynticks
value at the same time.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] rcu: Fix preempt-unsafe debug check of rcu extended quiescent state
  2011-09-27 11:50     ` Frederic Weisbecker
@ 2011-09-27 15:16       ` Pavel Ivanov
  2011-09-27 16:01         ` Paul E. McKenney
  2011-09-27 21:44         ` Frederic Weisbecker
  0 siblings, 2 replies; 23+ messages in thread
From: Pavel Ivanov @ 2011-09-27 15:16 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, LKML, Peter Zijlstra, Thomas Gleixner,
	Lai Jiangshan, Ingo Molnar

On Tue, Sep 27, 2011 at 7:50 AM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> On Mon, Sep 26, 2011 at 06:04:06PM -0400, Pavel Ivanov wrote:
>> On Mon, Sep 26, 2011 at 6:19 AM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
>> > In the rcu_check_extended_qs() function that is used to check
>> > illegal uses of RCU under extended quiescent states, we look
>> > at the local value of dynticks that is even if an extended
>> > quiescent state or odd otherwise.
>> >
>> > We are looking at it without disabling the preemption though
>> > and this opens a race window where we may read the state of
>> > a remote CPU instead, like in the following scenario:
>> >
>> > CPU 1                                                        CPU 2
>> >
>> > bool rcu_check_extended_qs(void)
>> > {
>> >        struct rcu_dynticks *rdtp;
>> >
>> >        rdtp = &per_cpu(rcu_dynticks,
>> >                        raw_smp_processor_id());
>> >
>> >        < ---- Task is migrated here ---- >
>> >
>> >        // CPU 1 goes idle and increase rdtp->dynticks
>> >                                                        /* Here we are reading the value for
>> >                                                        the remote CPU 1 instead of the local CPU */
>> >                                                        if (atomic_read(&rdtp->dynticks) & 0x1)
>> >                                                                return false;
>> >
>> >                                                        return true;
>> >                                                        }
>> >
>> > The possible result of this is false positive return value of that
>> > function, suggesting we are in an extended quiescent state in random
>> > places.
>>
>> How is this different from what your patch allows?
>>
>> CPU 1                                               CPU 2
>>
>> bool rcu_check_extended_qs(void)
>> {
>>        struct rcu_dynticks *rdtp =
>>                &get_cpu_var(rcu_dynticks);
>>        bool ext_qs = true;
>>
>>        if (atomic_read(&rdtp->dynticks) & 0x1)
>>                ext_qs = false;
>>
>>        put_cpu_var(rcu_dynticks);
>>
>>        < ---- Task is migrated here ---- >
>>
>>                                                   /* Here we return true/false
>>                                                      based on the value for the
>>                                                      remote CPU 1 instead of the
>>                                                      local CPU */
>>
>>                                                   return ext_qs;
>>                                                   }
>>
>>
>> Looks like it can result in the same false positive result.
>
> Why?
>
> While calling rcu_read_lock(), the task can still be migrated
> for example from CPU 0 to CPU 1, until we do our check
> in rcu_check_extended_qs() with preemption disabled. But that's
> not a problem, we are going to do our check either in CPU 0 or
> CPU 1, that's doesn't matter much.

I don't have sources at hand now to check all call sites of
rcu_check_extended_qs(). But are you saying that
rcu_check_extended_qs() is always called after rcu_read_lock() ? If
that's the case then preemption is already disabled, right? Otherwise
I don't understand your reasoning here.

> What matters is that we do that check by ensuring we are really
> checking the value of the cpu var in the CPU we are currently
> running and not some other random one that can change its dynticks
> value at the same time.

Define the "CPU we are currently running on" in this context. Is it
CPU executing call to rcu_check_extended_qs() or is it CPU executing
return from rcu_check_extended_qs() ? These CPUs can be different both
before your patch and after that. And function can return extended_qs
state from either of these CPUs again both before and after the patch.
If the calling code wants these CPUs to be the same it has to disable
preemption before making the call. And if it does so then additional
preemption disabling inside the function is pointless.

All your patch does is changing possible scenarios of preemption...
Wait a minute... Is that the whole point of the patch? If CPU where
rcu_check_extended_qs() was called is in an extended qs then function
cannot be preempted and so it will correctly return true from current
CPU. If CPU where rcu_check_extended_qs() was called is not in
extended qs then function can be preempted and migrated. But CPU where
it migrates to won't be in extended qs either and thus function will
correctly return false no matter which per_cpu variable is read. Is my
understanding correct now? If so it's better to be explained in the
commit log.

BTW, isn't it possible for CPU to wake up from extended qs after some
interrupt coming in the middle of the function?


Pavel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] rcu: Fix preempt-unsafe debug check of rcu extended quiescent state
  2011-09-27 15:16       ` Pavel Ivanov
@ 2011-09-27 16:01         ` Paul E. McKenney
  2011-09-27 21:44         ` Frederic Weisbecker
  1 sibling, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2011-09-27 16:01 UTC (permalink / raw)
  To: Pavel Ivanov
  Cc: Frederic Weisbecker, LKML, Peter Zijlstra, Thomas Gleixner,
	Lai Jiangshan, Ingo Molnar

On Tue, Sep 27, 2011 at 11:16:02AM -0400, Pavel Ivanov wrote:
> On Tue, Sep 27, 2011 at 7:50 AM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > On Mon, Sep 26, 2011 at 06:04:06PM -0400, Pavel Ivanov wrote:
> >> On Mon, Sep 26, 2011 at 6:19 AM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> >> > In the rcu_check_extended_qs() function that is used to check
> >> > illegal uses of RCU under extended quiescent states, we look
> >> > at the local value of dynticks that is even if an extended
> >> > quiescent state or odd otherwise.
> >> >
> >> > We are looking at it without disabling the preemption though
> >> > and this opens a race window where we may read the state of
> >> > a remote CPU instead, like in the following scenario:
> >> >
> >> > CPU 1                                                        CPU 2
> >> >
> >> > bool rcu_check_extended_qs(void)
> >> > {
> >> >        struct rcu_dynticks *rdtp;
> >> >
> >> >        rdtp = &per_cpu(rcu_dynticks,
> >> >                        raw_smp_processor_id());
> >> >
> >> >        < ---- Task is migrated here ---- >
> >> >
> >> >        // CPU 1 goes idle and increase rdtp->dynticks
> >> >                                                        /* Here we are reading the value for
> >> >                                                        the remote CPU 1 instead of the local CPU */
> >> >                                                        if (atomic_read(&rdtp->dynticks) & 0x1)
> >> >                                                                return false;
> >> >
> >> >                                                        return true;
> >> >                                                        }
> >> >
> >> > The possible result of this is false positive return value of that
> >> > function, suggesting we are in an extended quiescent state in random
> >> > places.
> >>
> >> How is this different from what your patch allows?
> >>
> >> CPU 1                                               CPU 2
> >>
> >> bool rcu_check_extended_qs(void)
> >> {
> >>        struct rcu_dynticks *rdtp =
> >>                &get_cpu_var(rcu_dynticks);
> >>        bool ext_qs = true;
> >>
> >>        if (atomic_read(&rdtp->dynticks) & 0x1)
> >>                ext_qs = false;
> >>
> >>        put_cpu_var(rcu_dynticks);
> >>
> >>        < ---- Task is migrated here ---- >
> >>
> >>                                                   /* Here we return true/false
> >>                                                      based on the value for the
> >>                                                      remote CPU 1 instead of the
> >>                                                      local CPU */
> >>
> >>                                                   return ext_qs;
> >>                                                   }
> >>
> >>
> >> Looks like it can result in the same false positive result.
> >
> > Why?
> >
> > While calling rcu_read_lock(), the task can still be migrated
> > for example from CPU 0 to CPU 1, until we do our check
> > in rcu_check_extended_qs() with preemption disabled. But that's
> > not a problem, we are going to do our check either in CPU 0 or
> > CPU 1, that's doesn't matter much.
> 
> I don't have sources at hand now to check all call sites of
> rcu_check_extended_qs(). But are you saying that
> rcu_check_extended_qs() is always called after rcu_read_lock() ? If
> that's the case then preemption is already disabled, right? Otherwise
> I don't understand your reasoning here.

For kernels built with CONFIG_TREE_PREEMPT_RCU or CONFIG_TINY_PREEMPT_RCU,
rcu_read_lock() does not imply preempt_disable().  So you really can be
in an RCU read-side critical section without having preemption disabled.

> > What matters is that we do that check by ensuring we are really
> > checking the value of the cpu var in the CPU we are currently
> > running and not some other random one that can change its dynticks
> > value at the same time.
> 
> Define the "CPU we are currently running on" in this context. Is it
> CPU executing call to rcu_check_extended_qs() or is it CPU executing
> return from rcu_check_extended_qs() ? These CPUs can be different both
> before your patch and after that. And function can return extended_qs
> state from either of these CPUs again both before and after the patch.
> If the calling code wants these CPUs to be the same it has to disable
> preemption before making the call. And if it does so then additional
> preemption disabling inside the function is pointless.
> 
> All your patch does is changing possible scenarios of preemption...
> Wait a minute... Is that the whole point of the patch? If CPU where
> rcu_check_extended_qs() was called is in an extended qs then function
> cannot be preempted and so it will correctly return true from current
> CPU. If CPU where rcu_check_extended_qs() was called is not in
> extended qs then function can be preempted and migrated. But CPU where
> it migrates to won't be in extended qs either and thus function will
> correctly return false no matter which per_cpu variable is read. Is my
> understanding correct now? If so it's better to be explained in the
> commit log.
> 
> BTW, isn't it possible for CPU to wake up from extended qs after some
> interrupt coming in the middle of the function?

Indeed this can happen.  The calls to rcu_irq_enter() and
rcu_irq_exit() handle this transition from dyntick-idle extended
quiescent state to RCU being usable in the interrrupt handler.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] rcu: Fix preempt-unsafe debug check of rcu extended quiescent state
  2011-09-27 15:16       ` Pavel Ivanov
  2011-09-27 16:01         ` Paul E. McKenney
@ 2011-09-27 21:44         ` Frederic Weisbecker
  2011-09-28  3:17           ` Yong Zhang
  2011-09-28  3:52           ` Pavel Ivanov
  1 sibling, 2 replies; 23+ messages in thread
From: Frederic Weisbecker @ 2011-09-27 21:44 UTC (permalink / raw)
  To: Pavel Ivanov
  Cc: Paul E. McKenney, LKML, Peter Zijlstra, Thomas Gleixner,
	Lai Jiangshan, Ingo Molnar

On Tue, Sep 27, 2011 at 11:16:02AM -0400, Pavel Ivanov wrote:
> > What matters is that we do that check by ensuring we are really
> > checking the value of the cpu var in the CPU we are currently
> > running and not some other random one that can change its dynticks
> > value at the same time.
> 
> Define the "CPU we are currently running on" in this context. Is it
> CPU executing call to rcu_check_extended_qs() or is it CPU executing
> return from rcu_check_extended_qs() ? These CPUs can be different both
> before your patch and after that. And function can return extended_qs
> state from either of these CPUs again both before and after the patch.
> If the calling code wants these CPUs to be the same it has to disable
> preemption before making the call. And if it does so then additional
> preemption disabling inside the function is pointless.

So, like Paul said rcu_read_lock() doesn't necessary imply to disable
preemption.

Hence by the time we call rcu_check_extended_qs() the current task
can be migrated anytime before, while in the function (except
a little part) or after.

The CPU I was referring to when I talked about "CPU we are currently
running on" is the CPU we are running between the call to get_cpu_var()
and put_cpu_var(). This one can not be changed because get_cpu_var()
disables preemption.

So consider this piece of code:

        struct rcu_dynticks *rdtp =
               &get_cpu_var(rcu_dynticks);
        bool ext_qs = true;

        if (atomic_read(&rdtp->dynticks) & 0x1)
                ext_qs = false;

        put_cpu_var(rcu_dynticks);	

What I expect from preemption disabled is that when I read the local
CPU variable rdtp->dyntick, I'm sure this is the CPU var of the local
CPU and the rdtp->dyntick from another CPU.

If I don't disable preemption, like it was without my patch:

0	struct rcu_dynticks *rdtp =
1               &__raw_get_cpu_var(rcu_dynticks);
2
3        if (atomic_read(&rdtp->dynticks) & 0x1)
4                ext_qs = false;
5
6        put_cpu_var(rcu_dynticks);	

I can fetch rdtp of CPU 1 in line 0. Then the task migrates on CPU 2.
So on line 3 I'm reading rdtp->dynticks of CPU 1 from CPU 2 and this is
racy because CPU 1 can change the value of rdtp->dynticks concurrently.

Now indeed it's weird because we can migrate anytime outside that preempt
disabled section.

So let's explore the two cases where this function can be called:

- From the idle task. For now this is the only place where we can
run sections of code in RCU extended quiescent state. If any use
of RCU is made on such section, it will hit our check.
Here there is no head-scratching about the role of disabling preemption
because the idle tasks can't be migrated. There is one per cpu so
the rcu_dynticks variable we look at is always the same inside a
given idle task.

- From a normal task. We assume it can be migrated anytime. But
normal tasks aren't supposed in RCU extended quiescent state.
Still the check can be useful there and spot for example cases where
we exit the idle task without calling rcu_exit_nohz().

Now from a normal task, when we call rcu_read_lock(), we assume
we can read the value dynticks from any CPU, wherever we migrate
to. So for example if we are running idle in CPU 1, then we exit
idle without calling rcu_exit_nohz(), the next task running on this
CPU is about to call rcu_read_lock(), but of on the last time before
we do our check it migrates to CPU 2. It won't detect the issue in CPU 1
then. But it doesn't matter much, soon or later there are fair
chances there will be a call to rcu_read_lock() on CPU 1 that
will report the issue.

That's also an anticipation for future development where we may
call rcu_enter_nohz() in more place than just idle. Like in
the Nohz cpusets for example.

Right?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] rcu: Fix preempt-unsafe debug check of rcu extended quiescent state
  2011-09-27 21:44         ` Frederic Weisbecker
@ 2011-09-28  3:17           ` Yong Zhang
  2011-09-28 12:44             ` Frederic Weisbecker
  2011-09-28  3:52           ` Pavel Ivanov
  1 sibling, 1 reply; 23+ messages in thread
From: Yong Zhang @ 2011-09-28  3:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Pavel Ivanov, Paul E. McKenney, LKML, Peter Zijlstra,
	Thomas Gleixner, Lai Jiangshan, Ingo Molnar

On Tue, Sep 27, 2011 at 11:44:56PM +0200, Frederic Weisbecker wrote:
> On Tue, Sep 27, 2011 at 11:16:02AM -0400, Pavel Ivanov wrote:
> > > What matters is that we do that check by ensuring we are really
> > > checking the value of the cpu var in the CPU we are currently
> > > running and not some other random one that can change its dynticks
> > > value at the same time.
> > 
> > Define the "CPU we are currently running on" in this context. Is it
> > CPU executing call to rcu_check_extended_qs() or is it CPU executing
> > return from rcu_check_extended_qs() ? These CPUs can be different both
> > before your patch and after that. And function can return extended_qs
> > state from either of these CPUs again both before and after the patch.
> > If the calling code wants these CPUs to be the same it has to disable
> > preemption before making the call. And if it does so then additional
> > preemption disabling inside the function is pointless.
> 
> So, like Paul said rcu_read_lock() doesn't necessary imply to disable
> preemption.
> 
> Hence by the time we call rcu_check_extended_qs() the current task
> can be migrated anytime before, while in the function (except
> a little part) or after.
> 
> The CPU I was referring to when I talked about "CPU we are currently
> running on" is the CPU we are running between the call to get_cpu_var()
> and put_cpu_var(). This one can not be changed because get_cpu_var()
> disables preemption.
> 
> So consider this piece of code:
> 
>         struct rcu_dynticks *rdtp =
>                &get_cpu_var(rcu_dynticks);
>         bool ext_qs = true;
>  
>         if (atomic_read(&rdtp->dynticks) & 0x1)
>                 ext_qs = false;
>  
>         put_cpu_var(rcu_dynticks);	
> 
> What I expect from preemption disabled is that when I read the local
> CPU variable rdtp->dyntick, I'm sure this is the CPU var of the local
> CPU and the rdtp->dyntick from another CPU.
> 
> If I don't disable preemption, like it was without my patch:
> 
> 0	struct rcu_dynticks *rdtp =
> 1               &__raw_get_cpu_var(rcu_dynticks);
> 2
> 3        if (atomic_read(&rdtp->dynticks) & 0x1)
> 4                ext_qs = false;
> 5
> 6        put_cpu_var(rcu_dynticks);	
> 
> I can fetch rdtp of CPU 1 in line 0. Then the task migrates on CPU 2.
> So on line 3 I'm reading rdtp->dynticks of CPU 1 from CPU 2 and this is
> racy because CPU 1 can change the value of rdtp->dynticks concurrently.
> 
> Now indeed it's weird because we can migrate anytime outside that preempt
> disabled section.
> 
> So let's explore the two cases where this function can be called:
> 
> - From the idle task. For now this is the only place where we can
> run sections of code in RCU extended quiescent state. If any use
> of RCU is made on such section, it will hit our check.
> Here there is no head-scratching about the role of disabling preemption
> because the idle tasks can't be migrated. There is one per cpu so
> the rcu_dynticks variable we look at is always the same inside a
> given idle task.

Yeah.

> 
> - From a normal task. We assume it can be migrated anytime. But
> normal tasks aren't supposed in RCU extended quiescent state.
> Still the check can be useful there and spot for example cases where
> we exit the idle task without calling rcu_exit_nohz().
> 
> Now from a normal task, when we call rcu_read_lock(), we assume
> we can read the value dynticks from any CPU, wherever we migrate
> to. So for example if we are running idle in CPU 1, then we exit
> idle without calling rcu_exit_nohz(), the next task running on this
> CPU is about to call rcu_read_lock(), but of on the last time before
> we do our check it migrates to CPU 2. It won't detect the issue in CPU 1
> then. But it doesn't matter much, soon or later there are fair
> chances there will be a call to rcu_read_lock() on CPU 1 that
> will report the issue.

So the main usage is to detect unbalanced rcu_enter_nohz()/rcu_exit_nohz(),
right?

If so, I suggest this should be commented somewhere, like the commit log;
because I was focusing on the idle task before then think it's harmless
with/without this patch :)

Thanks,
Yong

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] rcu: Fix preempt-unsafe debug check of rcu extended quiescent state
  2011-09-27 21:44         ` Frederic Weisbecker
  2011-09-28  3:17           ` Yong Zhang
@ 2011-09-28  3:52           ` Pavel Ivanov
  2011-09-28 12:46             ` Frederic Weisbecker
  1 sibling, 1 reply; 23+ messages in thread
From: Pavel Ivanov @ 2011-09-28  3:52 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, LKML, Peter Zijlstra, Thomas Gleixner,
	Lai Jiangshan, Ingo Molnar

On Tue, Sep 27, 2011 at 5:44 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> On Tue, Sep 27, 2011 at 11:16:02AM -0400, Pavel Ivanov wrote:
>> > What matters is that we do that check by ensuring we are really
>> > checking the value of the cpu var in the CPU we are currently
>> > running and not some other random one that can change its dynticks
>> > value at the same time.
>>
>> Define the "CPU we are currently running on" in this context. Is it
>> CPU executing call to rcu_check_extended_qs() or is it CPU executing
>> return from rcu_check_extended_qs() ? These CPUs can be different both
>> before your patch and after that. And function can return extended_qs
>> state from either of these CPUs again both before and after the patch.
>> If the calling code wants these CPUs to be the same it has to disable
>> preemption before making the call. And if it does so then additional
>> preemption disabling inside the function is pointless.
>
> So, like Paul said rcu_read_lock() doesn't necessary imply to disable
> preemption.
>
> Hence by the time we call rcu_check_extended_qs() the current task
> can be migrated anytime before, while in the function (except
> a little part) or after.
>
> The CPU I was referring to when I talked about "CPU we are currently
> running on" is the CPU we are running between the call to get_cpu_var()
> and put_cpu_var(). This one can not be changed because get_cpu_var()
> disables preemption.
>
> So consider this piece of code:
>
>        struct rcu_dynticks *rdtp =
>               &get_cpu_var(rcu_dynticks);
>        bool ext_qs = true;
>
>        if (atomic_read(&rdtp->dynticks) & 0x1)
>                ext_qs = false;
>
>        put_cpu_var(rcu_dynticks);
>
> What I expect from preemption disabled is that when I read the local
> CPU variable rdtp->dyntick, I'm sure this is the CPU var of the local
> CPU and the rdtp->dyntick from another CPU.
>
> If I don't disable preemption, like it was without my patch:
>
> 0       struct rcu_dynticks *rdtp =
> 1               &__raw_get_cpu_var(rcu_dynticks);
> 2
> 3        if (atomic_read(&rdtp->dynticks) & 0x1)
> 4                ext_qs = false;
> 5
> 6        put_cpu_var(rcu_dynticks);
>
> I can fetch rdtp of CPU 1 in line 0. Then the task migrates on CPU 2.
> So on line 3 I'm reading rdtp->dynticks of CPU 1 from CPU 2 and this is
> racy because CPU 1 can change the value of rdtp->dynticks concurrently.
>
> Now indeed it's weird because we can migrate anytime outside that preempt
> disabled section.
>
> So let's explore the two cases where this function can be called:
>
> - From the idle task. For now this is the only place where we can
> run sections of code in RCU extended quiescent state. If any use
> of RCU is made on such section, it will hit our check.
> Here there is no head-scratching about the role of disabling preemption
> because the idle tasks can't be migrated. There is one per cpu so
> the rcu_dynticks variable we look at is always the same inside a
> given idle task.
>
> - From a normal task. We assume it can be migrated anytime. But
> normal tasks aren't supposed in RCU extended quiescent state.
> Still the check can be useful there and spot for example cases where
> we exit the idle task without calling rcu_exit_nohz().
>
> Now from a normal task, when we call rcu_read_lock(), we assume
> we can read the value dynticks from any CPU, wherever we migrate
> to. So for example if we are running idle in CPU 1, then we exit
> idle without calling rcu_exit_nohz(), the next task running on this
> CPU is about to call rcu_read_lock(), but of on the last time before
> we do our check it migrates to CPU 2. It won't detect the issue in CPU 1
> then. But it doesn't matter much, soon or later there are fair
> chances there will be a call to rcu_read_lock() on CPU 1 that
> will report the issue.
>
> That's also an anticipation for future development where we may
> call rcu_enter_nohz() in more place than just idle. Like in
> the Nohz cpusets for example.
>
> Right?

Right. Thank you, now I understand better what this function and this
patch are about. And I suggest to add that explanation to the log or a
comment before the function. Adding explanation to commit log would
make sense because it explains how behavior of the function is
different before and after the patch and why the patch matters.


Pavel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] rcu: Fix preempt-unsafe debug check of rcu extended quiescent state
  2011-09-28  3:17           ` Yong Zhang
@ 2011-09-28 12:44             ` Frederic Weisbecker
  0 siblings, 0 replies; 23+ messages in thread
From: Frederic Weisbecker @ 2011-09-28 12:44 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Pavel Ivanov, Paul E. McKenney, LKML, Peter Zijlstra,
	Thomas Gleixner, Lai Jiangshan, Ingo Molnar

On Wed, Sep 28, 2011 at 11:17:40AM +0800, Yong Zhang wrote:
> On Tue, Sep 27, 2011 at 11:44:56PM +0200, Frederic Weisbecker wrote:
> > On Tue, Sep 27, 2011 at 11:16:02AM -0400, Pavel Ivanov wrote:
> > > > What matters is that we do that check by ensuring we are really
> > > > checking the value of the cpu var in the CPU we are currently
> > > > running and not some other random one that can change its dynticks
> > > > value at the same time.
> > > 
> > > Define the "CPU we are currently running on" in this context. Is it
> > > CPU executing call to rcu_check_extended_qs() or is it CPU executing
> > > return from rcu_check_extended_qs() ? These CPUs can be different both
> > > before your patch and after that. And function can return extended_qs
> > > state from either of these CPUs again both before and after the patch.
> > > If the calling code wants these CPUs to be the same it has to disable
> > > preemption before making the call. And if it does so then additional
> > > preemption disabling inside the function is pointless.
> > 
> > So, like Paul said rcu_read_lock() doesn't necessary imply to disable
> > preemption.
> > 
> > Hence by the time we call rcu_check_extended_qs() the current task
> > can be migrated anytime before, while in the function (except
> > a little part) or after.
> > 
> > The CPU I was referring to when I talked about "CPU we are currently
> > running on" is the CPU we are running between the call to get_cpu_var()
> > and put_cpu_var(). This one can not be changed because get_cpu_var()
> > disables preemption.
> > 
> > So consider this piece of code:
> > 
> >         struct rcu_dynticks *rdtp =
> >                &get_cpu_var(rcu_dynticks);
> >         bool ext_qs = true;
> >  
> >         if (atomic_read(&rdtp->dynticks) & 0x1)
> >                 ext_qs = false;
> >  
> >         put_cpu_var(rcu_dynticks);	
> > 
> > What I expect from preemption disabled is that when I read the local
> > CPU variable rdtp->dyntick, I'm sure this is the CPU var of the local
> > CPU and the rdtp->dyntick from another CPU.
> > 
> > If I don't disable preemption, like it was without my patch:
> > 
> > 0	struct rcu_dynticks *rdtp =
> > 1               &__raw_get_cpu_var(rcu_dynticks);
> > 2
> > 3        if (atomic_read(&rdtp->dynticks) & 0x1)
> > 4                ext_qs = false;
> > 5
> > 6        put_cpu_var(rcu_dynticks);	
> > 
> > I can fetch rdtp of CPU 1 in line 0. Then the task migrates on CPU 2.
> > So on line 3 I'm reading rdtp->dynticks of CPU 1 from CPU 2 and this is
> > racy because CPU 1 can change the value of rdtp->dynticks concurrently.
> > 
> > Now indeed it's weird because we can migrate anytime outside that preempt
> > disabled section.
> > 
> > So let's explore the two cases where this function can be called:
> > 
> > - From the idle task. For now this is the only place where we can
> > run sections of code in RCU extended quiescent state. If any use
> > of RCU is made on such section, it will hit our check.
> > Here there is no head-scratching about the role of disabling preemption
> > because the idle tasks can't be migrated. There is one per cpu so
> > the rcu_dynticks variable we look at is always the same inside a
> > given idle task.
> 
> Yeah.
> 
> > 
> > - From a normal task. We assume it can be migrated anytime. But
> > normal tasks aren't supposed in RCU extended quiescent state.
> > Still the check can be useful there and spot for example cases where
> > we exit the idle task without calling rcu_exit_nohz().
> > 
> > Now from a normal task, when we call rcu_read_lock(), we assume
> > we can read the value dynticks from any CPU, wherever we migrate
> > to. So for example if we are running idle in CPU 1, then we exit
> > idle without calling rcu_exit_nohz(), the next task running on this
> > CPU is about to call rcu_read_lock(), but of on the last time before
> > we do our check it migrates to CPU 2. It won't detect the issue in CPU 1
> > then. But it doesn't matter much, soon or later there are fair
> > chances there will be a call to rcu_read_lock() on CPU 1 that
> > will report the issue.
> 
> So the main usage is to detect unbalanced rcu_enter_nohz()/rcu_exit_nohz(),
> right?

In fact the primary and main goal is to detect rcu uses between the calls
to rcu_enter_nohz() and rcu_exit_nohz(). We found several of these bugs in idle
code.

But it might find other things like unbalanced rcu_enter_nohz()/rcu_exit_nohz().

Also rcu_enter_nohz() is used only in idle for now but that may find
broader uses in the future. So we want this check everywhere.

> If so, I suggest this should be commented somewhere, like the commit log;
> because I was focusing on the idle task before then think it's harmless
> with/without this patch :)

You're right, I'll add a comment to explain that.

Thanks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] rcu: Fix preempt-unsafe debug check of rcu extended quiescent state
  2011-09-28  3:52           ` Pavel Ivanov
@ 2011-09-28 12:46             ` Frederic Weisbecker
  0 siblings, 0 replies; 23+ messages in thread
From: Frederic Weisbecker @ 2011-09-28 12:46 UTC (permalink / raw)
  To: Pavel Ivanov
  Cc: Paul E. McKenney, LKML, Peter Zijlstra, Thomas Gleixner,
	Lai Jiangshan, Ingo Molnar

On Tue, Sep 27, 2011 at 11:52:52PM -0400, Pavel Ivanov wrote:
> On Tue, Sep 27, 2011 at 5:44 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > On Tue, Sep 27, 2011 at 11:16:02AM -0400, Pavel Ivanov wrote:
> >> > What matters is that we do that check by ensuring we are really
> >> > checking the value of the cpu var in the CPU we are currently
> >> > running and not some other random one that can change its dynticks
> >> > value at the same time.
> >>
> >> Define the "CPU we are currently running on" in this context. Is it
> >> CPU executing call to rcu_check_extended_qs() or is it CPU executing
> >> return from rcu_check_extended_qs() ? These CPUs can be different both
> >> before your patch and after that. And function can return extended_qs
> >> state from either of these CPUs again both before and after the patch.
> >> If the calling code wants these CPUs to be the same it has to disable
> >> preemption before making the call. And if it does so then additional
> >> preemption disabling inside the function is pointless.
> >
> > So, like Paul said rcu_read_lock() doesn't necessary imply to disable
> > preemption.
> >
> > Hence by the time we call rcu_check_extended_qs() the current task
> > can be migrated anytime before, while in the function (except
> > a little part) or after.
> >
> > The CPU I was referring to when I talked about "CPU we are currently
> > running on" is the CPU we are running between the call to get_cpu_var()
> > and put_cpu_var(). This one can not be changed because get_cpu_var()
> > disables preemption.
> >
> > So consider this piece of code:
> >
> >        struct rcu_dynticks *rdtp =
> >               &get_cpu_var(rcu_dynticks);
> >        bool ext_qs = true;
> >
> >        if (atomic_read(&rdtp->dynticks) & 0x1)
> >                ext_qs = false;
> >
> >        put_cpu_var(rcu_dynticks);
> >
> > What I expect from preemption disabled is that when I read the local
> > CPU variable rdtp->dyntick, I'm sure this is the CPU var of the local
> > CPU and the rdtp->dyntick from another CPU.
> >
> > If I don't disable preemption, like it was without my patch:
> >
> > 0       struct rcu_dynticks *rdtp =
> > 1               &__raw_get_cpu_var(rcu_dynticks);
> > 2
> > 3        if (atomic_read(&rdtp->dynticks) & 0x1)
> > 4                ext_qs = false;
> > 5
> > 6        put_cpu_var(rcu_dynticks);
> >
> > I can fetch rdtp of CPU 1 in line 0. Then the task migrates on CPU 2.
> > So on line 3 I'm reading rdtp->dynticks of CPU 1 from CPU 2 and this is
> > racy because CPU 1 can change the value of rdtp->dynticks concurrently.
> >
> > Now indeed it's weird because we can migrate anytime outside that preempt
> > disabled section.
> >
> > So let's explore the two cases where this function can be called:
> >
> > - From the idle task. For now this is the only place where we can
> > run sections of code in RCU extended quiescent state. If any use
> > of RCU is made on such section, it will hit our check.
> > Here there is no head-scratching about the role of disabling preemption
> > because the idle tasks can't be migrated. There is one per cpu so
> > the rcu_dynticks variable we look at is always the same inside a
> > given idle task.
> >
> > - From a normal task. We assume it can be migrated anytime. But
> > normal tasks aren't supposed in RCU extended quiescent state.
> > Still the check can be useful there and spot for example cases where
> > we exit the idle task without calling rcu_exit_nohz().
> >
> > Now from a normal task, when we call rcu_read_lock(), we assume
> > we can read the value dynticks from any CPU, wherever we migrate
> > to. So for example if we are running idle in CPU 1, then we exit
> > idle without calling rcu_exit_nohz(), the next task running on this
> > CPU is about to call rcu_read_lock(), but of on the last time before
> > we do our check it migrates to CPU 2. It won't detect the issue in CPU 1
> > then. But it doesn't matter much, soon or later there are fair
> > chances there will be a call to rcu_read_lock() on CPU 1 that
> > will report the issue.
> >
> > That's also an anticipation for future development where we may
> > call rcu_enter_nohz() in more place than just idle. Like in
> > the Nohz cpusets for example.
> >
> > Right?
> 
> Right. Thank you, now I understand better what this function and this
> patch are about. And I suggest to add that explanation to the log or a
> comment before the function. Adding explanation to commit log would
> make sense because it explains how behavior of the function is
> different before and after the patch and why the patch matters.

Yeah indeed, I'll add a comment to explain this.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2011-09-28 12:46 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-26 10:19 [PATCH 0/7 v4] rcu: Fix some rcu uses in extended quiescent state Frederic Weisbecker
2011-09-26 10:19 ` [PATCH 1/7] rcu: Fix preempt-unsafe debug check of rcu " Frederic Weisbecker
2011-09-26 22:04   ` Pavel Ivanov
2011-09-27 11:50     ` Frederic Weisbecker
2011-09-27 15:16       ` Pavel Ivanov
2011-09-27 16:01         ` Paul E. McKenney
2011-09-27 21:44         ` Frederic Weisbecker
2011-09-28  3:17           ` Yong Zhang
2011-09-28 12:44             ` Frederic Weisbecker
2011-09-28  3:52           ` Pavel Ivanov
2011-09-28 12:46             ` Frederic Weisbecker
2011-09-26 10:19 ` [PATCH 2/7] rcu: Fix early call to rcu_enter_nohz() on tick stopping Frederic Weisbecker
2011-09-26 10:19 ` [PATCH 3/7] nohz: Separate out irq exit and idle loop dyntick logic Frederic Weisbecker
2011-09-26 10:19 ` [PATCH 4/7] nohz: Allow rcu extended quiescent state handling seperately from tick stop Frederic Weisbecker
2011-09-26 10:44   ` Peter Zijlstra
2011-09-26 16:02     ` Paul E. McKenney
2011-09-26 16:06       ` Peter Zijlstra
2011-09-26 16:32         ` Paul E. McKenney
2011-09-26 17:06     ` Frederic Weisbecker
2011-09-26 10:19 ` [PATCH 5/7] x86: Enter rcu extended qs after idle notifier call Frederic Weisbecker
2011-09-26 10:19 ` [PATCH 6/7] x86: Call idle notifier after irq_enter() Frederic Weisbecker
2011-09-26 10:19 ` [PATCH 7/7] rcu: Fix early call to rcu_irq_exit() Frederic Weisbecker
2011-09-26 18:26 ` [PATCH 0/7 v4] rcu: Fix some rcu uses in extended quiescent state Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.