[PATCH RFC nohz_full 0/7] v3 Provide infrastructure for full-system idle

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC nohz_full 0/7] v3 Provide infrastructure for full-system idle
@ 2013-07-09  1:29 Paul E. McKenney
  2013-07-09  1:30 ` [PATCH RFC nohz_full 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2013-07-09  1:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw

Whenever there is at least one non-idle CPU, it is necessary to
periodically update timekeeping information.  Before NO_HZ_FULL, this
updating was carried out by the scheduling-clock tick, which ran on
every non-idle CPU.  With the advent of NO_HZ_FULL, it is possible
to have non-idle CPUs that are not receiving scheduling-clock ticks.
This possibility is handled by assigning a timekeeping CPU that continues
taking scheduling-clock ticks.

Unfortunately, timekeeping CPU continues taking scheduling-clock
interrupts even when all other CPUs are completely idle, which is
not so good for energy efficiency and battery lifetime.  Clearly, it
would be good to turn off the timekeeping CPU's scheduling-clock tick
when all CPUs are completely idle.  This is conceptually simple, but
we also need good performance and scalability on large systems, which
rules out implementations based on frequently updated global counts of
non-idle CPUs as well as implementations that frequently scan all CPUs.
Nevertheless, we need a single global indicator in order to keep the
overhead of checking acceptably low.

The chosen approach is to enforce hysteresis on the non-idle to
full-system-idle transition, with the amount of hysteresis increasing
linearly with the number of CPUs, thus keeping contention acceptably low.
This approach piggybacks on RCU's existing force-quiescent-state scanning
of idle CPUs, which has the advantage of avoiding the scan entirely on
busy systems that have high levels of multiprogramming.  This scan
takes per-CPU idleness information and feeds it into a state machine
that applies the level of hysteresis required to arrive at a single
full-system-idle indicator.

The individual patches are as follows:

1.	Add a CONFIG_NO_HZ_FULL_SYSIDLE Kconfig parameter to enable
	this feature.  Kernels built with CONFIG_NO_HZ_FULL_SYSIDLE=n
	act exactly as they do today.

2.	Add new fields to the rcu_dynticks structure that track CPU-idle
	information.  These fields consider CPUs running usermode to be
	non-idle, in contrast with the existing fields in that structure.

3.	Track per-CPU idle states.

4.	Add full-system idle states and state variables.

5.	Expand force_qs_rnp(), dyntick_save_progress_counter(), and
	rcu_implicit_dynticks_qs() APIs to enable passing full-system
	idle state information.

6.	Add full-system-idle state machine.

7.	Force RCU's grace-period kthreads onto the timekeeping CPU.

Changes since v2:

o	Completed removing NMI support (thanks to Frederic for spotting
	the remaining cruft).

o	Fix a state-machine bug, again spotted by Frederic.  See
	http://lists-archives.com/linux-kernel/27865835-nohz_full-add-full-system-idle-state-machine.html
	for the full details of the bug.

o	Updated commit log and comment as suggested by Josh Triplett.

Changes since v1:

o	Removed NMI support because NMI handlers cannot safely read
	the time anyway (thanks to Thomas Gleixner and Peter Zijlstra).

						Thanx, Paul

------------------------------------------------------------------------

 b/include/linux/rcupdate.h |   18 +
 b/kernel/rcutree.c         |   49 ++++-
 b/kernel/rcutree.h         |   17 +
 b/kernel/rcutree_plugin.h  |  421 ++++++++++++++++++++++++++++++++++++++++++++-
 b/kernel/time/Kconfig      |   23 ++
 5 files changed, 513 insertions(+), 15 deletions(-)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH RFC nohz_full 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state
  2013-07-09  1:29 [PATCH RFC nohz_full 0/7] v3 Provide infrastructure for full-system idle Paul E. McKenney
@ 2013-07-09  1:30 ` Paul E. McKenney
  2013-07-09  1:30   ` [PATCH RFC nohz_full 2/7] nohz_full: Add rcu_dyntick data " Paul E. McKenney
                     ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Paul E. McKenney @ 2013-07-09  1:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw,
	Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

At least one CPU must keep the scheduling-clock tick running for
timekeeping purposes whenever there is a non-idle CPU.  However, with
the new nohz_full adaptive-idle machinery, it is difficult to distinguish
between all CPUs really being idle as opposed to all non-idle CPUs being
in adaptive-ticks mode.  This commit therefore adds a Kconfig parameter
as a first step towards enabling a scalable detection of full-system
idle state.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/time/Kconfig | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 70f27e8..a613c2a 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -134,6 +134,29 @@ config NO_HZ_FULL_ALL
 	 Note the boot CPU will still be kept outside the range to
 	 handle the timekeeping duty.
 
+config NO_HZ_FULL_SYSIDLE
+	bool "Detect full-system idle state for full dynticks system"
+	depends on NO_HZ_FULL
+	default n
+	help
+	 At least one CPU must keep the scheduling-clock tick running
+	 for timekeeping purposes whenever there is a non-idle CPU,
+	 where "non-idle" includes CPUs with a single runnable task
+	 in adaptive-idle mode.  Because the underlying adaptive-tick
+	 support cannot distinguish between all CPUs being idle and
+	 all CPUs each running a single task in adaptive-idle mode,
+	 the underlying support simply ensures that there is always
+	 a CPU handling the scheduling-clock tick, whether or not all
+	 CPUs are idle.  This Kconfig option enables scalable detection
+	 of the all-CPUs-idle state, thus allowing the scheduling-clock
+	 tick to be disabled when all CPUs are idle.  Note that scalable
+	 detection of the all-CPUs-idle state means that larger systems
+	 will be slower to declare the all-CPUs-idle state.
+
+	 Say Y if you would like to help debug all-CPUs-idle detection.
+
+	 Say N if you are unsure.
+
 config NO_HZ
 	bool "Old Idle dynticks config"
 	depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH RFC nohz_full 2/7] nohz_full: Add rcu_dyntick data for scalable detection of all-idle state
  2013-07-09  1:30 ` [PATCH RFC nohz_full 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state Paul E. McKenney
@ 2013-07-09  1:30   ` Paul E. McKenney
  2013-07-09  9:37     ` Peter Zijlstra
  2013-07-09  1:30   ` [PATCH RFC nohz_full 3/7] nohz_full: Add per-CPU idle-state tracking Paul E. McKenney
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2013-07-09  1:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw,
	Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit adds fields to the rcu_dyntick structure that are used to
detect idle CPUs.  These new fields differ from the existing ones in
that the existing ones consider a CPU executing in user mode to be idle,
where the new ones consider CPUs executing in user mode to be busy.
The handling of these new fields is otherwise quite similar to that for
the exiting fields.  This commit also adds the initialization required
for these fields.

So, why is usermode execution treated differently, with RCU considering
it a quiescent state equivalent to idle, while in contrast the new
full-system idle state detection considers usermode execution to be
non-idle?

It turns out that although one of RCU's quiescent states is usermode
execution, it is not a full-system idle state.  This is because the
purpose of the full-system idle state is not RCU, but rather determining
when accurate timekeeping can safely be disabled.  Whenever accurate
timekeeping is required in a CONFIG_NO_HZ_FULL kernel, at least one
CPU must keep the scheduling-clock tick going.  If even one CPU is
executing in user mode, accurate timekeeping is requires, particularly for
architectures where gettimeofday() and friends do not enter the kernel.
Only when all CPUs are really and truly idle can accurate timekeeping be
disabled, allowing all CPUs to turn off the scheduling clock interrupt,
thus greatly improving energy efficiency.

This naturally raises the question "Why is this code in RCU rather than in
timekeeping?", and the answer is that RCU has the data and infrastructure
to efficiently make this determination.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/rcutree.c        |  5 +++++
 kernel/rcutree.h        |  9 +++++++++
 kernel/rcutree_plugin.h | 19 +++++++++++++++++++
 3 files changed, 33 insertions(+)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 78745ae..259e300 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -209,6 +209,10 @@ EXPORT_SYMBOL_GPL(rcu_note_context_switch);
 DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
 	.dynticks_nesting = DYNTICK_TASK_EXIT_IDLE,
 	.dynticks = ATOMIC_INIT(1),
+#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
+	.dynticks_idle_nesting = DYNTICK_TASK_NEST_VALUE,
+	.dynticks_idle = ATOMIC_INIT(1),
+#endif /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
 };
 
 static long blimit = 10;	/* Maximum callbacks per rcu_do_batch. */
@@ -2902,6 +2906,7 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp, int preemptible)
 	rdp->blimit = blimit;
 	init_callback_list(rdp);  /* Re-enable callbacks on this CPU. */
 	rdp->dynticks->dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
+	rcu_sysidle_init_percpu_data(rdp->dynticks);
 	atomic_set(&rdp->dynticks->dynticks,
 		   (atomic_read(&rdp->dynticks->dynticks) & ~0x1) + 1);
 	raw_spin_unlock(&rnp->lock);		/* irqs remain disabled. */
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index b383258..bd99d59 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -88,6 +88,14 @@ struct rcu_dynticks {
 				    /* Process level is worth LLONG_MAX/2. */
 	int dynticks_nmi_nesting;   /* Track NMI nesting level. */
 	atomic_t dynticks;	    /* Even value for idle, else odd. */
+#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
+	long long dynticks_idle_nesting;
+				    /* irq/process nesting level from idle. */
+	atomic_t dynticks_idle;	    /* Even value for idle, else odd. */
+				    /*  "Idle" excludes userspace execution. */
+	unsigned long dynticks_idle_jiffies;
+				    /* End of last non-NMI non-idle period. */
+#endif /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
 #ifdef CONFIG_RCU_FAST_NO_HZ
 	bool all_lazy;		    /* Are all CPU's CBs lazy? */
 	unsigned long nonlazy_posted;
@@ -545,6 +553,7 @@ static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
 static void rcu_spawn_nocb_kthreads(struct rcu_state *rsp);
 static void rcu_kick_nohz_cpu(int cpu);
 static bool init_nocb_callback_list(struct rcu_data *rdp);
+static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
 
 #endif /* #ifndef RCU_TREE_NONCORE */
 
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 769e12e..6937eb6 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2375,3 +2375,22 @@ static void rcu_kick_nohz_cpu(int cpu)
 		smp_send_reschedule(cpu);
 #endif /* #ifdef CONFIG_NO_HZ_FULL */
 }
+
+
+#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
+
+/*
+ * Initialize dynticks sysidle state for CPUs coming online.
+ */
+static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
+{
+	rdtp->dynticks_idle_nesting = DYNTICK_TASK_NEST_VALUE;
+}
+
+#else /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
+
+static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
+{
+}
+
+#endif /* #else #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH RFC nohz_full 3/7] nohz_full: Add per-CPU idle-state tracking
  2013-07-09  1:30 ` [PATCH RFC nohz_full 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state Paul E. McKenney
  2013-07-09  1:30   ` [PATCH RFC nohz_full 2/7] nohz_full: Add rcu_dyntick data " Paul E. McKenney
@ 2013-07-09  1:30   ` Paul E. McKenney
  2013-07-09  1:30   ` [PATCH RFC nohz_full 4/7] nohz_full: Add full-system idle states and variables Paul E. McKenney
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2013-07-09  1:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw,
	Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit adds the code that updates the rcu_dyntick structure's
new fields to track the per-CPU idle state based on interrupts and
transitions into and out of the idle loop (NMIs are ignored because NMI
handlers cannot cleanly read out the time anyway).  This code is similar
to the code that maintains RCU's idea of per-CPU idleness, but differs
in that RCU treats CPUs running in user mode as idle, where this new
code does not.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/rcutree.c        |  4 +++
 kernel/rcutree.h        |  2 ++
 kernel/rcutree_plugin.h | 79 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 85 insertions(+)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 259e300..ed9a36a 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -416,6 +416,7 @@ void rcu_idle_enter(void)
 
 	local_irq_save(flags);
 	rcu_eqs_enter(false);
+	rcu_sysidle_enter(&__get_cpu_var(rcu_dynticks), 0);
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(rcu_idle_enter);
@@ -466,6 +467,7 @@ void rcu_irq_exit(void)
 		trace_rcu_dyntick("--=", oldval, rdtp->dynticks_nesting);
 	else
 		rcu_eqs_enter_common(rdtp, oldval, true);
+	rcu_sysidle_enter(rdtp, 1);
 	local_irq_restore(flags);
 }
 
@@ -534,6 +536,7 @@ void rcu_idle_exit(void)
 
 	local_irq_save(flags);
 	rcu_eqs_exit(false);
+	rcu_sysidle_exit(&__get_cpu_var(rcu_dynticks), 0);
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(rcu_idle_exit);
@@ -585,6 +588,7 @@ void rcu_irq_enter(void)
 		trace_rcu_dyntick("++=", oldval, rdtp->dynticks_nesting);
 	else
 		rcu_eqs_exit_common(rdtp, oldval, true);
+	rcu_sysidle_exit(rdtp, 1);
 	local_irq_restore(flags);
 }
 
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index bd99d59..1895043 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -553,6 +553,8 @@ static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
 static void rcu_spawn_nocb_kthreads(struct rcu_state *rsp);
 static void rcu_kick_nohz_cpu(int cpu);
 static bool init_nocb_callback_list(struct rcu_data *rdp);
+static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq);
+static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq);
 static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
 
 #endif /* #ifndef RCU_TREE_NONCORE */
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 6937eb6..814ff47 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2380,6 +2380,77 @@ static void rcu_kick_nohz_cpu(int cpu)
 #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
 
 /*
+ * Invoked to note exit from irq or task transition to idle.  Note that
+ * usermode execution does -not- count as idle here!  After all, we want
+ * to detect full-system idle states, not RCU quiescent states and grace
+ * periods.  The caller must have disabled interrupts.
+ */
+static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq)
+{
+	unsigned long j;
+
+	/* Adjust nesting, check for fully idle. */
+	if (irq) {
+		rdtp->dynticks_idle_nesting--;
+		WARN_ON_ONCE(rdtp->dynticks_idle_nesting < 0);
+		if (rdtp->dynticks_idle_nesting != 0)
+			return;  /* Still not fully idle. */
+	} else {
+		if ((rdtp->dynticks_idle_nesting & DYNTICK_TASK_NEST_MASK) ==
+		    DYNTICK_TASK_NEST_VALUE) {
+			rdtp->dynticks_idle_nesting = 0;
+		} else {
+			rdtp->dynticks_idle_nesting -= DYNTICK_TASK_NEST_VALUE;
+			WARN_ON_ONCE(rdtp->dynticks_idle_nesting < 0);
+			return;  /* Still not fully idle. */
+		}
+	}
+
+	/* Record start of fully idle period. */
+	j = jiffies;
+	ACCESS_ONCE(rdtp->dynticks_idle_jiffies) = j;
+	smp_mb__before_atomic_inc();
+	atomic_inc(&rdtp->dynticks_idle);
+	smp_mb__after_atomic_inc();
+	WARN_ON_ONCE(atomic_read(&rdtp->dynticks_idle) & 0x1);
+}
+
+/*
+ * Invoked to note entry to irq or task transition from idle.  Note that
+ * usermode execution does -not- count as idle here!  The caller must
+ * have disabled interrupts.
+ */
+static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
+{
+	/* Adjust nesting, check for already non-idle. */
+	if (irq) {
+		rdtp->dynticks_idle_nesting++;
+		WARN_ON_ONCE(rdtp->dynticks_idle_nesting <= 0);
+		if (rdtp->dynticks_idle_nesting != 1)
+			return; /* Already non-idle. */
+	} else {
+		/*
+		 * Allow for irq misnesting.  Yes, it really is possible
+		 * to enter an irq handler then never leave it, and maybe
+		 * also vice versa.  Handle both possibilities.
+		 */
+		if (rdtp->dynticks_idle_nesting & DYNTICK_TASK_NEST_MASK) {
+			rdtp->dynticks_idle_nesting += DYNTICK_TASK_NEST_VALUE;
+			WARN_ON_ONCE(rdtp->dynticks_idle_nesting <= 0);
+			return; /* Already non-idle. */
+		} else {
+			rdtp->dynticks_idle_nesting = DYNTICK_TASK_EXIT_IDLE;
+		}
+	}
+
+	/* Record end of idle period. */
+	smp_mb__before_atomic_inc();
+	atomic_inc(&rdtp->dynticks_idle);
+	smp_mb__after_atomic_inc();
+	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks_idle) & 0x1));
+}
+
+/*
  * Initialize dynticks sysidle state for CPUs coming online.
  */
 static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
@@ -2389,6 +2460,14 @@ static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
 
 #else /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
 
+static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq)
+{
+}
+
+static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
+{
+}
+
 static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
 {
 }
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH RFC nohz_full 4/7] nohz_full: Add full-system idle states and variables
  2013-07-09  1:30 ` [PATCH RFC nohz_full 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state Paul E. McKenney
  2013-07-09  1:30   ` [PATCH RFC nohz_full 2/7] nohz_full: Add rcu_dyntick data " Paul E. McKenney
  2013-07-09  1:30   ` [PATCH RFC nohz_full 3/7] nohz_full: Add per-CPU idle-state tracking Paul E. McKenney
@ 2013-07-09  1:30   ` Paul E. McKenney
  2013-07-09  1:30   ` [PATCH RFC nohz_full 5/7] nohz_full: Add full-system-idle arguments to API Paul E. McKenney
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2013-07-09  1:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw,
	Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit adds control variables and states for full-system idle.
The system will progress through the states in numerical order when
the system is fully idle (other than the timekeeping CPU), and reset
down to the initial state if any non-timekeeping CPU goes non-idle.
The current state is kept in full_sysidle_state.

A RCU_SYSIDLE_SMALL macro is defined, and systems with this number
of CPUs or fewer move through the states more aggressively.  The idea
is that the resulting memory contention is less of a problem on small
systems.  Architectures can adjust this value (which defaults to 8)
using CONFIG_ARCH_RCU_SYSIDLE_SMALL.

One flavor of RCU will be in charge of driving the state machine,
defined by rcu_sysidle_state.  This should be the busiest flavor of RCU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/rcutree_plugin.h | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 814ff47..3edae39 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2380,6 +2380,34 @@ static void rcu_kick_nohz_cpu(int cpu)
 #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
 
 /*
+ * Handle small systems specially, accelerating their transition into
+ * full idle state.  Allow arches to override this code's idea of
+ * what constitutes a "small" system.
+ */
+#ifdef CONFIG_ARCH_RCU_SYSIDLE_SMALL
+#define RCU_SYSIDLE_SMALL CONFIG_ARCH_RCU_SYSIDLE_SMALL
+#else /* #ifdef CONFIG_ARCH_RCU_SYSIDLE_SMALL */
+#define RCU_SYSIDLE_SMALL 8
+#endif
+
+/*
+ * Define RCU flavor that holds sysidle state.  This needs to be the
+ * most active flavor of RCU.
+ */
+#ifdef CONFIG_PREEMPT_RCU
+static struct rcu_state __maybe_unused *rcu_sysidle_state = &rcu_preempt_state;
+#else /* #ifdef CONFIG_PREEMPT_RCU */
+static struct rcu_state __maybe_unused *rcu_sysidle_state = &rcu_sched_state;
+#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
+
+static int __maybe_unused full_sysidle_state; /* Current system-idle state. */
+#define RCU_SYSIDLE_NOT		0	/* Some CPU is not idle. */
+#define RCU_SYSIDLE_SHORT	1	/* All CPUs idle for brief period. */
+#define RCU_SYSIDLE_LONG	2	/* All CPUs idle for long enough. */
+#define RCU_SYSIDLE_FULL	3	/* All CPUs idle, ready for sysidle. */
+#define RCU_SYSIDLE_FULL_NOTED	4	/* Actually entered sysidle state. */
+
+/*
  * Invoked to note exit from irq or task transition to idle.  Note that
  * usermode execution does -not- count as idle here!  After all, we want
  * to detect full-system idle states, not RCU quiescent states and grace
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH RFC nohz_full 5/7] nohz_full: Add full-system-idle arguments to API
  2013-07-09  1:30 ` [PATCH RFC nohz_full 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state Paul E. McKenney
                     ` (2 preceding siblings ...)
  2013-07-09  1:30   ` [PATCH RFC nohz_full 4/7] nohz_full: Add full-system idle states and variables Paul E. McKenney
@ 2013-07-09  1:30   ` Paul E. McKenney
  2013-07-09  1:30   ` [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle state machine Paul E. McKenney
  2013-07-09  1:30   ` [PATCH RFC nohz_full 7/7] nohz_full: Force RCU's grace-period kthreads onto timekeeping CPU Paul E. McKenney
  5 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2013-07-09  1:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw,
	Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit adds an isidle and jiffies argument to force_qs_rnp(),
dyntick_save_progress_counter(), and rcu_implicit_dynticks_qs() to enable
RCU's force-quiescent-state process to check for full-system idle.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/rcutree.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index ed9a36a..9971f86 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -231,7 +231,9 @@ module_param(jiffies_till_next_fqs, ulong, 0644);
 
 static void rcu_start_gp_advanced(struct rcu_state *rsp, struct rcu_node *rnp,
 				  struct rcu_data *rdp);
-static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *));
+static void force_qs_rnp(struct rcu_state *rsp,
+			 int (*f)(struct rcu_data *, bool *, unsigned long *),
+			 bool *isidle, unsigned long *maxj);
 static void force_quiescent_state(struct rcu_state *rsp);
 static int rcu_pending(int cpu);
 
@@ -712,7 +714,8 @@ static int rcu_is_cpu_rrupt_from_idle(void)
  * credit them with an implicit quiescent state.  Return 1 if this CPU
  * is in dynticks idle mode, which is an extended quiescent state.
  */
-static int dyntick_save_progress_counter(struct rcu_data *rdp)
+static int dyntick_save_progress_counter(struct rcu_data *rdp,
+					 bool *isidle, unsigned long *maxj)
 {
 	rdp->dynticks_snap = atomic_add_return(0, &rdp->dynticks->dynticks);
 	return (rdp->dynticks_snap & 0x1) == 0;
@@ -724,7 +727,8 @@ static int dyntick_save_progress_counter(struct rcu_data *rdp)
  * idle state since the last call to dyntick_save_progress_counter()
  * for this same CPU, or by virtue of having been offline.
  */
-static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
+static int rcu_implicit_dynticks_qs(struct rcu_data *rdp,
+				    bool *isidle, unsigned long *maxj)
 {
 	unsigned int curr;
 	unsigned int snap;
@@ -1345,16 +1349,19 @@ static int rcu_gp_init(struct rcu_state *rsp)
 int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
 {
 	int fqs_state = fqs_state_in;
+	bool isidle = 0;
+	unsigned long maxj;
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
 	rsp->n_force_qs++;
 	if (fqs_state == RCU_SAVE_DYNTICK) {
 		/* Collect dyntick-idle snapshots. */
-		force_qs_rnp(rsp, dyntick_save_progress_counter);
+		force_qs_rnp(rsp, dyntick_save_progress_counter,
+			     &isidle, &maxj);
 		fqs_state = RCU_FORCE_QS;
 	} else {
 		/* Handle dyntick-idle and offline CPUs. */
-		force_qs_rnp(rsp, rcu_implicit_dynticks_qs);
+		force_qs_rnp(rsp, rcu_implicit_dynticks_qs, &isidle, &maxj);
 	}
 	/* Clear flag to prevent immediate re-entry. */
 	if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
@@ -2055,7 +2062,9 @@ void rcu_check_callbacks(int cpu, int user)
  *
  * The caller must have suppressed start of new grace periods.
  */
-static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
+static void force_qs_rnp(struct rcu_state *rsp,
+			 int (*f)(struct rcu_data *, bool *, unsigned long *),
+			 bool *isidle, unsigned long *maxj)
 {
 	unsigned long bit;
 	int cpu;
@@ -2079,7 +2088,7 @@ static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
 		bit = 1;
 		for (; cpu <= rnp->grphi; cpu++, bit <<= 1) {
 			if ((rnp->qsmask & bit) != 0 &&
-			    f(per_cpu_ptr(rsp->rda, cpu)))
+			    f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj))
 				mask |= bit;
 		}
 		if (mask != 0) {
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle state machine
  2013-07-09  1:30 ` [PATCH RFC nohz_full 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state Paul E. McKenney
                     ` (3 preceding siblings ...)
  2013-07-09  1:30   ` [PATCH RFC nohz_full 5/7] nohz_full: Add full-system-idle arguments to API Paul E. McKenney
@ 2013-07-09  1:30   ` Paul E. McKenney
  2013-07-17 23:31     ` Frederic Weisbecker
  2013-07-09  1:30   ` [PATCH RFC nohz_full 7/7] nohz_full: Force RCU's grace-period kthreads onto timekeeping CPU Paul E. McKenney
  5 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2013-07-09  1:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw,
	Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit adds the state machine that takes the per-CPU idle data
as input and produces a full-system-idle indication as output.  This
state machine is driven out of RCU's quiescent-state-forcing
mechanism, which invokes rcu_sysidle_check_cpu() to collect per-CPU
idle state and then rcu_sysidle_report() to drive the state machine.

The full-system-idle state is sampled using rcu_sys_is_idle(), which
also drives the state machine if RCU is idle (and does so by forcing
RCU to become non-idle).  This function returns true if all but the
timekeeping CPU (tick_do_timer_cpu) are idle and have been idle long
enough to avoid memory contention on the full_sysidle_state state
variable.  The rcu_sysidle_force_exit() may be called externally
to reset the state machine back into non-idle state.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>

Conflicts:

	kernel/rcutree.h
	kernel/rcutree_plugin.h
---
 include/linux/rcupdate.h |  18 ++++
 kernel/rcutree.c         |  16 ++-
 kernel/rcutree.h         |   5 +
 kernel/rcutree_plugin.h  | 275 ++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 307 insertions(+), 7 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 48f1ef9..1aa8d8c 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -1011,4 +1011,22 @@ static inline bool rcu_is_nocb_cpu(int cpu) { return false; }
 #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
 
 
+/* Only for use by adaptive-ticks code. */
+#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
+extern bool rcu_sys_is_idle(void);
+extern void rcu_sysidle_force_exit(void);
+#else /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
+
+static inline bool rcu_sys_is_idle(void)
+{
+	return false;
+}
+
+static inline void rcu_sysidle_force_exit(void)
+{
+}
+
+#endif /* #else #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
+
+
 #endif /* __LINUX_RCUPDATE_H */
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 9971f86..06cfd75 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -718,6 +718,7 @@ static int dyntick_save_progress_counter(struct rcu_data *rdp,
 					 bool *isidle, unsigned long *maxj)
 {
 	rdp->dynticks_snap = atomic_add_return(0, &rdp->dynticks->dynticks);
+	rcu_sysidle_check_cpu(rdp, isidle, maxj);
 	return (rdp->dynticks_snap & 0x1) == 0;
 }
 
@@ -1356,11 +1357,17 @@ int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
 	rsp->n_force_qs++;
 	if (fqs_state == RCU_SAVE_DYNTICK) {
 		/* Collect dyntick-idle snapshots. */
+		if (is_sysidle_rcu_state(rsp)) {
+			isidle = 1;
+			maxj = jiffies - ULONG_MAX / 4;
+		}
 		force_qs_rnp(rsp, dyntick_save_progress_counter,
 			     &isidle, &maxj);
+		rcu_sysidle_report(rsp, isidle, maxj);
 		fqs_state = RCU_FORCE_QS;
 	} else {
 		/* Handle dyntick-idle and offline CPUs. */
+		isidle = 0;
 		force_qs_rnp(rsp, rcu_implicit_dynticks_qs, &isidle, &maxj);
 	}
 	/* Clear flag to prevent immediate re-entry. */
@@ -2087,9 +2094,12 @@ static void force_qs_rnp(struct rcu_state *rsp,
 		cpu = rnp->grplo;
 		bit = 1;
 		for (; cpu <= rnp->grphi; cpu++, bit <<= 1) {
-			if ((rnp->qsmask & bit) != 0 &&
-			    f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj))
-				mask |= bit;
+			if ((rnp->qsmask & bit) != 0) {
+				if ((rnp->qsmaskinit & bit) != 0)
+					*isidle = 0;
+				if (f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj))
+					mask |= bit;
+			}
 		}
 		if (mask != 0) {
 
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 1895043..7326a3c 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -555,6 +555,11 @@ static void rcu_kick_nohz_cpu(int cpu);
 static bool init_nocb_callback_list(struct rcu_data *rdp);
 static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq);
 static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq);
+static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
+				  unsigned long *maxj);
+static bool is_sysidle_rcu_state(struct rcu_state *rsp);
+static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
+			       unsigned long maxj);
 static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
 
 #endif /* #ifndef RCU_TREE_NONCORE */
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 3edae39..b47ffb0 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -28,7 +28,7 @@
 #include <linux/gfp.h>
 #include <linux/oom.h>
 #include <linux/smpboot.h>
-#include <linux/tick.h>
+#include "time/tick-internal.h"
 
 #define RCU_KTHREAD_PRIO 1
 
@@ -2395,12 +2395,12 @@ static void rcu_kick_nohz_cpu(int cpu)
  * most active flavor of RCU.
  */
 #ifdef CONFIG_PREEMPT_RCU
-static struct rcu_state __maybe_unused *rcu_sysidle_state = &rcu_preempt_state;
+static struct rcu_state *rcu_sysidle_state = &rcu_preempt_state;
 #else /* #ifdef CONFIG_PREEMPT_RCU */
-static struct rcu_state __maybe_unused *rcu_sysidle_state = &rcu_sched_state;
+static struct rcu_state *rcu_sysidle_state = &rcu_sched_state;
 #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
 
-static int __maybe_unused full_sysidle_state; /* Current system-idle state. */
+static int full_sysidle_state;		/* Current system-idle state. */
 #define RCU_SYSIDLE_NOT		0	/* Some CPU is not idle. */
 #define RCU_SYSIDLE_SHORT	1	/* All CPUs idle for brief period. */
 #define RCU_SYSIDLE_LONG	2	/* All CPUs idle for long enough. */
@@ -2444,6 +2444,38 @@ static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq)
 }
 
 /*
+ * Unconditionally force exit from full system-idle state.  This is
+ * invoked when a normal CPU exits idle, but must be called separately
+ * for the timekeeping CPU (tick_do_timer_cpu).  The reason for this
+ * is that the timekeeping CPU is permitted to take scheduling-clock
+ * interrupts while the system is in system-idle state, and of course
+ * rcu_sysidle_exit() has no way of distinguishing a scheduling-clock
+ * interrupt from any other type of interrupt.
+ */
+void rcu_sysidle_force_exit(void)
+{
+	int oldstate = ACCESS_ONCE(full_sysidle_state);
+	int newoldstate;
+
+	/*
+	 * Each pass through the following loop attempts to exit full
+	 * system-idle state.  If contention proves to be a problem,
+	 * a trylock-based contention tree could be used here.
+	 */
+	while (oldstate > RCU_SYSIDLE_SHORT) {
+		newoldstate = cmpxchg(&full_sysidle_state,
+				      oldstate, RCU_SYSIDLE_NOT);
+		if (oldstate == newoldstate &&
+		    oldstate == RCU_SYSIDLE_FULL_NOTED) {
+			rcu_kick_nohz_cpu(tick_do_timer_cpu);
+			return; /* We cleared it, done! */
+		}
+		oldstate = newoldstate;
+	}
+	smp_mb(); /* Order initial oldstate fetch vs. later non-idle work. */
+}
+
+/*
  * Invoked to note entry to irq or task transition from idle.  Note that
  * usermode execution does -not- count as idle here!  The caller must
  * have disabled interrupts.
@@ -2476,6 +2508,226 @@ static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
 	atomic_inc(&rdtp->dynticks_idle);
 	smp_mb__after_atomic_inc();
 	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks_idle) & 0x1));
+
+	/*
+	 * If we are the timekeeping CPU, we are permitted to be non-idle
+	 * during a system-idle state.  This must be the case, because
+	 * the timekeeping CPU has to take scheduling-clock interrupts
+	 * during the time that the system is transitioning to full
+	 * system-idle state.  This means that the timekeeping CPU must
+	 * invoke rcu_sysidle_force_exit() directly if it does anything
+	 * more than take a scheduling-clock interrupt.
+	 */
+	if (smp_processor_id() == tick_do_timer_cpu)
+		return;
+
+	/* Update system-idle state: We are clearly no longer fully idle! */
+	rcu_sysidle_force_exit();
+}
+
+/*
+ * Check to see if the current CPU is idle.  Note that usermode execution
+ * does not count as idle.  The caller must have disabled interrupts.
+ */
+static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
+				  unsigned long *maxj)
+{
+	int cur;
+	unsigned long j;
+	struct rcu_dynticks *rdtp = rdp->dynticks;
+
+	/*
+	 * If some other CPU has already reported non-idle, if this is
+	 * not the flavor of RCU that tracks sysidle state, or if this
+	 * is an offline or the timekeeping CPU, nothing to do.
+	 */
+	if (!*isidle || rdp->rsp != rcu_sysidle_state ||
+	    cpu_is_offline(rdp->cpu) || rdp->cpu == tick_do_timer_cpu)
+		return;
+	/* WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu); */
+
+	/* Pick up current idle and NMI-nesting counter and check. */
+	cur = atomic_read(&rdtp->dynticks_idle);
+	if (cur & 0x1) {
+		*isidle = 0; /* We are not idle! */
+		return;
+	}
+	smp_mb(); /* Read counters before timestamps. */
+
+	/* Pick up timestamps. */
+	j = ACCESS_ONCE(rdtp->dynticks_idle_jiffies);
+	/* If this CPU entered idle more recently, update maxj timestamp. */
+	if (ULONG_CMP_LT(*maxj, j))
+		*maxj = j;
+}
+
+/*
+ * Is this the flavor of RCU that is handling full-system idle?
+ */
+static bool is_sysidle_rcu_state(struct rcu_state *rsp)
+{
+	return rsp == rcu_sysidle_state;
+}
+
+/*
+ * Return a delay in jiffies based on the number of CPUs, rcu_node
+ * leaf fanout, and jiffies tick rate.  The idea is to allow larger
+ * systems more time to transition to full-idle state in order to
+ * avoid the cache thrashing that otherwise occur on the state variable.
+ * Really small systems (less than a couple of tens of CPUs) should
+ * instead use a single global atomically incremented counter, and later
+ * versions of this will automatically reconfigure themselves accordingly.
+ */
+static unsigned long rcu_sysidle_delay(void)
+{
+	if (nr_cpu_ids <= RCU_SYSIDLE_SMALL)
+		return 0;
+	return DIV_ROUND_UP(nr_cpu_ids * HZ, rcu_fanout_leaf * 1000);
+}
+
+/*
+ * Advance the full-system-idle state.  This is invoked when all of
+ * the non-timekeeping CPUs are idle.
+ */
+static void rcu_sysidle(unsigned long j)
+{
+	/* Check the current state. */
+	switch (ACCESS_ONCE(full_sysidle_state)) {
+	case RCU_SYSIDLE_NOT:
+
+		/* First time all are idle, so note a short idle period. */
+		ACCESS_ONCE(full_sysidle_state) = RCU_SYSIDLE_SHORT;
+		break;
+
+	case RCU_SYSIDLE_SHORT:
+
+		/*
+		 * Idle for a bit, time to advance to next state?
+		 * cmpxchg failure means race with non-idle, let them win.
+		 */
+		if (ULONG_CMP_GE(jiffies, j + rcu_sysidle_delay()))
+			(void)cmpxchg(&full_sysidle_state,
+				      RCU_SYSIDLE_SHORT, RCU_SYSIDLE_LONG);
+		break;
+
+	case RCU_SYSIDLE_LONG:
+
+		/*
+		 * Do an additional check pass before advancing to full.
+		 * cmpxchg failure means race with non-idle, let them win.
+		 */
+		if (ULONG_CMP_GE(jiffies, j + rcu_sysidle_delay()))
+			(void)cmpxchg(&full_sysidle_state,
+				      RCU_SYSIDLE_LONG, RCU_SYSIDLE_FULL);
+		break;
+
+	default:
+		break;
+	}
+}
+
+/*
+ * Found a non-idle non-timekeeping CPU, so kick the system-idle state
+ * back to the beginning.
+ */
+static void rcu_sysidle_cancel(void)
+{
+	smp_mb();
+	ACCESS_ONCE(full_sysidle_state) = RCU_SYSIDLE_NOT;
+}
+
+/*
+ * Update the sysidle state based on the results of a force-quiescent-state
+ * scan of the CPUs' dyntick-idle state.
+ */
+static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
+			       unsigned long maxj)
+{
+	if (rsp != rcu_sysidle_state)
+		return;  /* Wrong flavor, ignore. */
+	if (isidle)
+		rcu_sysidle(maxj);    /* More idle! */
+	else
+		rcu_sysidle_cancel(); /* Idle is over. */
+}
+
+/* Callback and function for forcing an RCU grace period. */
+struct rcu_sysidle_head {
+	struct rcu_head rh;
+	int inuse;
+};
+
+static void rcu_sysidle_cb(struct rcu_head *rhp)
+{
+	struct rcu_sysidle_head *rshp;
+
+	smp_mb();  /* grace period precedes setting inuse. */
+	rshp = container_of(rhp, struct rcu_sysidle_head, rh);
+	ACCESS_ONCE(rshp->inuse) = 0;
+}
+
+/*
+ * Check to see if the system is fully idle, other than the timekeeping CPU.
+ * The caller must have disabled interrupts.
+ */
+bool rcu_sys_is_idle(void)
+{
+	static struct rcu_sysidle_head rsh;
+	int rss = ACCESS_ONCE(full_sysidle_state);
+
+	if (WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu))
+		return false;
+
+	/* Handle small-system case by doing a full scan of CPUs. */
+	if (nr_cpu_ids <= RCU_SYSIDLE_SMALL) {
+		int oldrss = rss - 1;
+
+		/*
+		 * One pass to advance to each state up to _FULL.
+		 * Give up if any pass fails to advance the state.
+		 */
+		while (rss < RCU_SYSIDLE_FULL && oldrss < rss) {
+			int cpu;
+			bool isidle = true;
+			unsigned long maxj = jiffies - ULONG_MAX / 4;
+			struct rcu_data *rdp;
+
+			/* Scan all the CPUs looking for nonidle CPUs. */
+			for_each_possible_cpu(cpu) {
+				rdp = per_cpu_ptr(rcu_sysidle_state->rda, cpu);
+				rcu_sysidle_check_cpu(rdp, &isidle, &maxj);
+				if (!isidle)
+					break;
+			}
+			rcu_sysidle_report(rcu_sysidle_state, isidle, maxj);
+			oldrss = rss;
+			rss = ACCESS_ONCE(full_sysidle_state);
+		}
+	}
+
+	/* If this is the first observation of an idle period, record it. */
+	if (rss == RCU_SYSIDLE_FULL) {
+		rss = cmpxchg(&full_sysidle_state,
+			      RCU_SYSIDLE_FULL, RCU_SYSIDLE_FULL_NOTED);
+		return rss == RCU_SYSIDLE_FULL;
+	}
+
+	smp_mb(); /* ensure rss load happens before later caller actions. */
+
+	/* If already fully idle, tell the caller (in case of races). */
+	if (rss == RCU_SYSIDLE_FULL_NOTED)
+		return true;
+
+	/*
+	 * If we aren't there yet, and a grace period is not in flight,
+	 * initiate a grace period.  Either way, tell the caller that
+	 * we are not there yet.
+	 */
+	if (nr_cpu_ids > RCU_SYSIDLE_SMALL &&
+	    !rcu_gp_in_progress(rcu_sysidle_state) &&
+	    !rsh.inuse && xchg(&rsh.inuse, 1) == 0)
+		call_rcu(&rsh.rh, rcu_sysidle_cb);
+	return false;
 }
 
 /*
@@ -2496,6 +2748,21 @@ static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
 {
 }
 
+static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
+				  unsigned long *maxj)
+{
+}
+
+static bool is_sysidle_rcu_state(struct rcu_state *rsp)
+{
+	return false;
+}
+
+static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
+			       unsigned long maxj)
+{
+}
+
 static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
 {
 }
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH RFC nohz_full 7/7] nohz_full: Force RCU's grace-period kthreads onto timekeeping CPU
  2013-07-09  1:30 ` [PATCH RFC nohz_full 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state Paul E. McKenney
                     ` (4 preceding siblings ...)
  2013-07-09  1:30   ` [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle state machine Paul E. McKenney
@ 2013-07-09  1:30   ` Paul E. McKenney
  5 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2013-07-09  1:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw,
	Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Because RCU's quiescent-state-forcing mechanism is used to drive the
full-system-idle state machine, and because this mechanism is executed
by RCU's grace-period kthreads, this commit forces these kthreads to
run on the timekeeping CPU (tick_do_timer_cpu).  To do otherwise would
mean that the RCU grace-period kthreads would force the system into
non-idle state every time they drove the state machine, which would
be just a bit on the futile side.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/rcutree.c        |  1 +
 kernel/rcutree.h        |  1 +
 kernel/rcutree_plugin.h | 20 +++++++++++++++++++-
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 06cfd75..ad9a5ec 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1286,6 +1286,7 @@ static int rcu_gp_init(struct rcu_state *rsp)
 	struct rcu_data *rdp;
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
+	rcu_bind_gp_kthread();
 	raw_spin_lock_irq(&rnp->lock);
 	rsp->gp_flags = 0; /* Clear all flags: New grace period. */
 
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 7326a3c..1602c21 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -558,6 +558,7 @@ static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq);
 static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
 				  unsigned long *maxj);
 static bool is_sysidle_rcu_state(struct rcu_state *rsp);
+static void rcu_bind_gp_kthread(void);
 static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
 			       unsigned long maxj);
 static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index b47ffb0..a4d44c3 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2544,7 +2544,7 @@ static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
 	if (!*isidle || rdp->rsp != rcu_sysidle_state ||
 	    cpu_is_offline(rdp->cpu) || rdp->cpu == tick_do_timer_cpu)
 		return;
-	/* WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu); */
+	WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu);
 
 	/* Pick up current idle and NMI-nesting counter and check. */
 	cur = atomic_read(&rdtp->dynticks_idle);
@@ -2570,6 +2570,20 @@ static bool is_sysidle_rcu_state(struct rcu_state *rsp)
 }
 
 /*
+ * Bind the grace-period kthread for the sysidle flavor of RCU to the
+ * timekeeping CPU.
+ */
+static void rcu_bind_gp_kthread(void)
+{
+	int cpu = ACCESS_ONCE(tick_do_timer_cpu);
+
+	if (cpu < 0 || cpu >= nr_cpu_ids)
+		return;
+	if (raw_smp_processor_id() != cpu)
+		set_cpus_allowed_ptr(current, cpumask_of(cpu));
+}
+
+/*
  * Return a delay in jiffies based on the number of CPUs, rcu_node
  * leaf fanout, and jiffies tick rate.  The idea is to allow larger
  * systems more time to transition to full-idle state in order to
@@ -2758,6 +2772,10 @@ static bool is_sysidle_rcu_state(struct rcu_state *rsp)
 	return false;
 }
 
+static void rcu_bind_gp_kthread(void)
+{
+}
+
 static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
 			       unsigned long maxj)
 {
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC nohz_full 2/7] nohz_full: Add rcu_dyntick data for scalable detection of all-idle state
  2013-07-09  1:30   ` [PATCH RFC nohz_full 2/7] nohz_full: Add rcu_dyntick data " Paul E. McKenney
@ 2013-07-09  9:37     ` Peter Zijlstra
  2013-07-09 13:23       ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2013-07-09  9:37 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw

On Mon, Jul 08, 2013 at 06:30:01PM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> This commit adds fields to the rcu_dyntick structure that are used to
> detect idle CPUs.  These new fields differ from the existing ones in
> that the existing ones consider a CPU executing in user mode to be idle,
> where the new ones consider CPUs executing in user mode to be busy.
> The handling of these new fields is otherwise quite similar to that for
> the exiting fields.  This commit also adds the initialization required
> for these fields.
> 
> So, why is usermode execution treated differently, with RCU considering
> it a quiescent state equivalent to idle, while in contrast the new
> full-system idle state detection considers usermode execution to be
> non-idle?
> 
> It turns out that although one of RCU's quiescent states is usermode
> execution, it is not a full-system idle state.  This is because the
> purpose of the full-system idle state is not RCU, but rather determining
> when accurate timekeeping can safely be disabled.  Whenever accurate
> timekeeping is required in a CONFIG_NO_HZ_FULL kernel, at least one
> CPU must keep the scheduling-clock tick going.  If even one CPU is
> executing in user mode, accurate timekeeping is requires, particularly for
> architectures where gettimeofday() and friends do not enter the kernel.
> Only when all CPUs are really and truly idle can accurate timekeeping be
> disabled, allowing all CPUs to turn off the scheduling clock interrupt,
> thus greatly improving energy efficiency.
> 
> This naturally raises the question "Why is this code in RCU rather than in
> timekeeping?", and the answer is that RCU has the data and infrastructure
> to efficiently make this determination.

but but but but... why doesn't the regular nohz code qualify? I'd think
that too would be tracking pretty much the same things, no?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC nohz_full 2/7] nohz_full: Add rcu_dyntick data for scalable detection of all-idle state
  2013-07-09  9:37     ` Peter Zijlstra
@ 2013-07-09 13:23       ` Paul E. McKenney
  0 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2013-07-09 13:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw

On Tue, Jul 09, 2013 at 11:37:28AM +0200, Peter Zijlstra wrote:
> On Mon, Jul 08, 2013 at 06:30:01PM -0700, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > 
> > This commit adds fields to the rcu_dyntick structure that are used to
> > detect idle CPUs.  These new fields differ from the existing ones in
> > that the existing ones consider a CPU executing in user mode to be idle,
> > where the new ones consider CPUs executing in user mode to be busy.
> > The handling of these new fields is otherwise quite similar to that for
> > the exiting fields.  This commit also adds the initialization required
> > for these fields.
> > 
> > So, why is usermode execution treated differently, with RCU considering
> > it a quiescent state equivalent to idle, while in contrast the new
> > full-system idle state detection considers usermode execution to be
> > non-idle?
> > 
> > It turns out that although one of RCU's quiescent states is usermode
> > execution, it is not a full-system idle state.  This is because the
> > purpose of the full-system idle state is not RCU, but rather determining
> > when accurate timekeeping can safely be disabled.  Whenever accurate
> > timekeeping is required in a CONFIG_NO_HZ_FULL kernel, at least one
> > CPU must keep the scheduling-clock tick going.  If even one CPU is
> > executing in user mode, accurate timekeeping is requires, particularly for
> > architectures where gettimeofday() and friends do not enter the kernel.
> > Only when all CPUs are really and truly idle can accurate timekeeping be
> > disabled, allowing all CPUs to turn off the scheduling clock interrupt,
> > thus greatly improving energy efficiency.
> > 
> > This naturally raises the question "Why is this code in RCU rather than in
> > timekeeping?", and the answer is that RCU has the data and infrastructure
> > to efficiently make this determination.
> 
> but but but but... why doesn't the regular nohz code qualify? I'd think
> that too would be tracking pretty much the same things, no?

The regular nohz code is identifying which CPUs are idle, but is doing
so on a CPU-by-CPU basis.  Before turning off system-wide timekeeping,
we need to know that -all- of the CPUs are idle.  The regular nohz code
does not do this.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle state machine
  2013-07-09  1:30   ` [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle state machine Paul E. McKenney
@ 2013-07-17 23:31     ` Frederic Weisbecker
  2013-07-18  0:41       ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Frederic Weisbecker @ 2013-07-17 23:31 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Mon, Jul 08, 2013 at 06:30:05PM -0700, Paul E. McKenney wrote:
>  }
>  
>  /*
> + * Unconditionally force exit from full system-idle state.  This is
> + * invoked when a normal CPU exits idle, but must be called separately
> + * for the timekeeping CPU (tick_do_timer_cpu).  The reason for this
> + * is that the timekeeping CPU is permitted to take scheduling-clock
> + * interrupts while the system is in system-idle state, and of course
> + * rcu_sysidle_exit() has no way of distinguishing a scheduling-clock
> + * interrupt from any other type of interrupt.
> + */
> +void rcu_sysidle_force_exit(void)
> +{
> +	int oldstate = ACCESS_ONCE(full_sysidle_state);
> +	int newoldstate;
> +
> +	/*
> +	 * Each pass through the following loop attempts to exit full
> +	 * system-idle state.  If contention proves to be a problem,
> +	 * a trylock-based contention tree could be used here.
> +	 */
> +	while (oldstate > RCU_SYSIDLE_SHORT) {

I'm missing a key here.

Let's imagine that the timekeeper has finally set full_sysidle_state = RCU_SYSIDLE_FULL_NOTED
with cmpxchg, what guarantees that this CPU is not seeing a stale RCU_SYSIDLE_SHORT value
for example?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle state machine
  2013-07-17 23:31     ` Frederic Weisbecker
@ 2013-07-18  0:41       ` Paul E. McKenney
  2013-07-18  1:33         ` Frederic Weisbecker
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2013-07-18  0:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Thu, Jul 18, 2013 at 01:31:21AM +0200, Frederic Weisbecker wrote:
> On Mon, Jul 08, 2013 at 06:30:05PM -0700, Paul E. McKenney wrote:
> >  }
> >  
> >  /*
> > + * Unconditionally force exit from full system-idle state.  This is
> > + * invoked when a normal CPU exits idle, but must be called separately
> > + * for the timekeeping CPU (tick_do_timer_cpu).  The reason for this
> > + * is that the timekeeping CPU is permitted to take scheduling-clock
> > + * interrupts while the system is in system-idle state, and of course
> > + * rcu_sysidle_exit() has no way of distinguishing a scheduling-clock
> > + * interrupt from any other type of interrupt.
> > + */
> > +void rcu_sysidle_force_exit(void)
> > +{
> > +	int oldstate = ACCESS_ONCE(full_sysidle_state);
> > +	int newoldstate;
> > +
> > +	/*
> > +	 * Each pass through the following loop attempts to exit full
> > +	 * system-idle state.  If contention proves to be a problem,
> > +	 * a trylock-based contention tree could be used here.
> > +	 */
> > +	while (oldstate > RCU_SYSIDLE_SHORT) {
> 
> I'm missing a key here.
> 
> Let's imagine that the timekeeper has finally set full_sysidle_state = RCU_SYSIDLE_FULL_NOTED
> with cmpxchg, what guarantees that this CPU is not seeing a stale RCU_SYSIDLE_SHORT value
> for example?

Good question!  Let's see if I have a reasonable answer.  ;-)

I am going to start with the large-CPU case, so that the state is advanced
only by the grace-period kthread.

1.	Case 1: the timekeeper CPU invoked rcu_sysidle_force_exit().
	In this case, this is the same CPU that set full_sysidle_state
	to RCU_SYSIDLE_FULL_NOTED, so it is guaranteed not to see a
	stale value.

2.	Case 2: Some CPU came out of idle, and invoked rcu_sysidle_exit().
	In this case, if this CPU reads a RCU_SYSIDLE_SHORT from
	full_sysidle_state, this read must have come before the
	cmpxchg() (on some other CPU) that set it to RCU_SYSIDLE_LONG.
	Because this CPU's read from full_sysidle_state was preceded by
	an atomic_inc() that updated this CPU's ->dynticks_idle, that
	update must also precede the cmpxchg() that set full_sysidle_state
	to RCU_SYSIDLE_LONG.  Because the state advancing is done from
	within a single thread, the subsequent scan is guaranteed to see
	the first CPU's update of ->dynticks_idle, and will therefore
	refrain from advancing full_sysidle_state to RCU_SYSIDLE_FULL.

	This will in turn prevent the timekeeping thread from advancing
	the state to RCU_SYSIDLE_FULL_NOTED, so this scenario cannot
	happen.

Unfortunately, the reasoning in #2 above does not hold in the small-CPU
case because there is the possibility of both the timekeeping CPU and
the RCU grace-period kthread concurrently advancing the state machine.
This would be bad, good catch!!!

The patch below (untested) is an attempt to fix this.  If it actually
works, I will merge it in with 6/7.

Anything else I missed?  ;-)

							Thanx, Paul

------------------------------------------------------------------------

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index aa3f525..fe83085 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1364,7 +1364,7 @@ int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
 		}
 		force_qs_rnp(rsp, dyntick_save_progress_counter,
 			     &isidle, &maxj);
-		rcu_sysidle_report(rsp, isidle, maxj);
+		rcu_sysidle_report_gp(rsp, isidle, maxj);
 		fqs_state = RCU_FORCE_QS;
 	} else {
 		/* Handle dyntick-idle and offline CPUs. */
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 1602c21..657b415 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -559,8 +559,8 @@ static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
 				  unsigned long *maxj);
 static bool is_sysidle_rcu_state(struct rcu_state *rsp);
 static void rcu_bind_gp_kthread(void);
-static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
-			       unsigned long maxj);
+static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
+				  unsigned long maxj);
 static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
 
 #endif /* #ifndef RCU_TREE_NONCORE */
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index a4d44c3..f65d9c2 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2655,14 +2655,22 @@ static void rcu_sysidle_cancel(void)
  * scan of the CPUs' dyntick-idle state.
  */
 static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
-			       unsigned long maxj)
+			       unsigned long maxj, bool gpkt)
 {
 	if (rsp != rcu_sysidle_state)
 		return;  /* Wrong flavor, ignore. */
-	if (isidle)
-		rcu_sysidle(maxj);    /* More idle! */
-	else
+	if (isidle) {
+		if (gpkt && nr_cpu_ids > RCU_SYSIDLE_SMALL)
+			rcu_sysidle(maxj);    /* More idle! */
+	} else {
 		rcu_sysidle_cancel(); /* Idle is over. */
+	}
+}
+
+static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
+				  unsigned long maxj)
+{
+	rcu_sysidle_report(rsp, isidle, maxj, true);
 }
 
 /* Callback and function for forcing an RCU grace period. */
@@ -2713,7 +2721,8 @@ bool rcu_sys_is_idle(void)
 				if (!isidle)
 					break;
 			}
-			rcu_sysidle_report(rcu_sysidle_state, isidle, maxj);
+			rcu_sysidle_report(rcu_sysidle_state,
+					   isidle, maxj, false);
 			oldrss = rss;
 			rss = ACCESS_ONCE(full_sysidle_state);
 		}
@@ -2776,8 +2785,8 @@ static void rcu_bind_gp_kthread(void)
 {
 }
 
-static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
-			       unsigned long maxj)
+static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
+				  unsigned long maxj)
 {
 }
 


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle state machine
  2013-07-18  0:41       ` Paul E. McKenney
@ 2013-07-18  1:33         ` Frederic Weisbecker
  2013-07-18  3:39           ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Frederic Weisbecker @ 2013-07-18  1:33 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Wed, Jul 17, 2013 at 05:41:41PM -0700, Paul E. McKenney wrote:
> On Thu, Jul 18, 2013 at 01:31:21AM +0200, Frederic Weisbecker wrote:
> > I'm missing a key here.
> > 
> > Let's imagine that the timekeeper has finally set full_sysidle_state = RCU_SYSIDLE_FULL_NOTED
> > with cmpxchg, what guarantees that this CPU is not seeing a stale RCU_SYSIDLE_SHORT value
> > for example?
> 
> Good question!  Let's see if I have a reasonable answer.  ;-)
> 
> I am going to start with the large-CPU case, so that the state is advanced
> only by the grace-period kthread.
> 
> 1.	Case 1: the timekeeper CPU invoked rcu_sysidle_force_exit().
> 	In this case, this is the same CPU that set full_sysidle_state
> 	to RCU_SYSIDLE_FULL_NOTED, so it is guaranteed not to see a
> 	stale value.
> 
> 2.	Case 2: Some CPU came out of idle, and invoked rcu_sysidle_exit().
> 	In this case, if this CPU reads a RCU_SYSIDLE_SHORT from
> 	full_sysidle_state, this read must have come before the
> 	cmpxchg() (on some other CPU) that set it to RCU_SYSIDLE_LONG.
> 	Because this CPU's read from full_sysidle_state was preceded by
> 	an atomic_inc() that updated this CPU's ->dynticks_idle, that
> 	update must also precede the cmpxchg() that set full_sysidle_state
> 	to RCU_SYSIDLE_LONG.  Because the state advancing is done from
> 	within a single thread, the subsequent scan is guaranteed to see
> 	the first CPU's update of ->dynticks_idle, and will therefore
> 	refrain from advancing full_sysidle_state to RCU_SYSIDLE_FULL.
> 
> 	This will in turn prevent the timekeeping thread from advancing
> 	the state to RCU_SYSIDLE_FULL_NOTED, so this scenario cannot
> 	happen.

Ok, so IIUC the safety is guaranteed in the following ordering:

    CPU 0                                               CPU 1

    idle = 1                                            smp_mb()
    //for each cpu
    if (atomic_read(rdp(1)->dyntick_idle) & 1)          atomic_inc(rdp->dyntick_idle)
          idle = 0                                      smp_mb()

    if (idle)
         cmpxchg(full_sysidle_state, SHORT, LONG)       while (full_sysidle_state > SHORT)
                                                            //reset with cmpxchg

So it's like:

    CPU 0                                              CPU 1

    read I                                             write I
    smp_mb()                                           smp_mb()
    cmpxchg S                                          read S

I still can't find what guarantees we don't read a value in CPU 1 that is way below
what we want.

> 
> Unfortunately, the reasoning in #2 above does not hold in the small-CPU
> case because there is the possibility of both the timekeeping CPU and
> the RCU grace-period kthread concurrently advancing the state machine.
> This would be bad, good catch!!!

It's not like I spotted anything myself but you're welcome :)

> 
> The patch below (untested) is an attempt to fix this.  If it actually
> works, I will merge it in with 6/7.
> 
> Anything else I missed?  ;-)

Well I guess I'll wait one more night before trying to understand
the below ;)

    Thanks!

> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index aa3f525..fe83085 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1364,7 +1364,7 @@ int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
>  		}
>  		force_qs_rnp(rsp, dyntick_save_progress_counter,
>  			     &isidle, &maxj);
> -		rcu_sysidle_report(rsp, isidle, maxj);
> +		rcu_sysidle_report_gp(rsp, isidle, maxj);
>  		fqs_state = RCU_FORCE_QS;
>  	} else {
>  		/* Handle dyntick-idle and offline CPUs. */
> diff --git a/kernel/rcutree.h b/kernel/rcutree.h
> index 1602c21..657b415 100644
> --- a/kernel/rcutree.h
> +++ b/kernel/rcutree.h
> @@ -559,8 +559,8 @@ static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
>  				  unsigned long *maxj);
>  static bool is_sysidle_rcu_state(struct rcu_state *rsp);
>  static void rcu_bind_gp_kthread(void);
> -static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
> -			       unsigned long maxj);
> +static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> +				  unsigned long maxj);
>  static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
>  
>  #endif /* #ifndef RCU_TREE_NONCORE */
> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> index a4d44c3..f65d9c2 100644
> --- a/kernel/rcutree_plugin.h
> +++ b/kernel/rcutree_plugin.h
> @@ -2655,14 +2655,22 @@ static void rcu_sysidle_cancel(void)
>   * scan of the CPUs' dyntick-idle state.
>   */
>  static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
> -			       unsigned long maxj)
> +			       unsigned long maxj, bool gpkt)
>  {
>  	if (rsp != rcu_sysidle_state)
>  		return;  /* Wrong flavor, ignore. */
> -	if (isidle)
> -		rcu_sysidle(maxj);    /* More idle! */
> -	else
> +	if (isidle) {
> +		if (gpkt && nr_cpu_ids > RCU_SYSIDLE_SMALL)
> +			rcu_sysidle(maxj);    /* More idle! */
> +	} else {
>  		rcu_sysidle_cancel(); /* Idle is over. */
> +	}
> +}
> +
> +static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> +				  unsigned long maxj)
> +{
> +	rcu_sysidle_report(rsp, isidle, maxj, true);
>  }
>  
>  /* Callback and function for forcing an RCU grace period. */
> @@ -2713,7 +2721,8 @@ bool rcu_sys_is_idle(void)
>  				if (!isidle)
>  					break;
>  			}
> -			rcu_sysidle_report(rcu_sysidle_state, isidle, maxj);
> +			rcu_sysidle_report(rcu_sysidle_state,
> +					   isidle, maxj, false);
>  			oldrss = rss;
>  			rss = ACCESS_ONCE(full_sysidle_state);
>  		}
> @@ -2776,8 +2785,8 @@ static void rcu_bind_gp_kthread(void)
>  {
>  }
>  
> -static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
> -			       unsigned long maxj)
> +static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> +				  unsigned long maxj)
>  {
>  }
>  
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle state machine
  2013-07-18  1:33         ` Frederic Weisbecker
@ 2013-07-18  3:39           ` Paul E. McKenney
  2013-07-18 14:24             ` Frederic Weisbecker
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2013-07-18  3:39 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Thu, Jul 18, 2013 at 03:33:01AM +0200, Frederic Weisbecker wrote:
> On Wed, Jul 17, 2013 at 05:41:41PM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 18, 2013 at 01:31:21AM +0200, Frederic Weisbecker wrote:
> > > I'm missing a key here.
> > > 
> > > Let's imagine that the timekeeper has finally set full_sysidle_state = RCU_SYSIDLE_FULL_NOTED
> > > with cmpxchg, what guarantees that this CPU is not seeing a stale RCU_SYSIDLE_SHORT value
> > > for example?
> > 
> > Good question!  Let's see if I have a reasonable answer.  ;-)
> > 
> > I am going to start with the large-CPU case, so that the state is advanced
> > only by the grace-period kthread.
> > 
> > 1.	Case 1: the timekeeper CPU invoked rcu_sysidle_force_exit().
> > 	In this case, this is the same CPU that set full_sysidle_state
> > 	to RCU_SYSIDLE_FULL_NOTED, so it is guaranteed not to see a
> > 	stale value.
> > 
> > 2.	Case 2: Some CPU came out of idle, and invoked rcu_sysidle_exit().
> > 	In this case, if this CPU reads a RCU_SYSIDLE_SHORT from
> > 	full_sysidle_state, this read must have come before the
> > 	cmpxchg() (on some other CPU) that set it to RCU_SYSIDLE_LONG.
> > 	Because this CPU's read from full_sysidle_state was preceded by
> > 	an atomic_inc() that updated this CPU's ->dynticks_idle, that
> > 	update must also precede the cmpxchg() that set full_sysidle_state
> > 	to RCU_SYSIDLE_LONG.  Because the state advancing is done from
> > 	within a single thread, the subsequent scan is guaranteed to see
> > 	the first CPU's update of ->dynticks_idle, and will therefore
> > 	refrain from advancing full_sysidle_state to RCU_SYSIDLE_FULL.
> > 
> > 	This will in turn prevent the timekeeping thread from advancing
> > 	the state to RCU_SYSIDLE_FULL_NOTED, so this scenario cannot
> > 	happen.
> 
> Ok, so IIUC the safety is guaranteed in the following ordering:
> 
>     CPU 0                                               CPU 1
> 
>     idle = 1                                            smp_mb()
>     //for each cpu
>     if (atomic_read(rdp(1)->dyntick_idle) & 1)          atomic_inc(rdp->dyntick_idle)
>           idle = 0                                      smp_mb()
> 
>     if (idle)
>          cmpxchg(full_sysidle_state, SHORT, LONG)       while (full_sysidle_state > SHORT)
>                                                             //reset with cmpxchg
> 
> So it's like:
> 
>     CPU 0                                              CPU 1
> 
>     read I                                             write I
>     smp_mb()                                           smp_mb()
>     cmpxchg S                                          read S
> 
> I still can't find what guarantees we don't read a value in CPU 1 that is way below
> what we want.

One key point is that there is a second cycle from LONG to FULL.

(Not saying that there is not a bug -- there might well be.  In fact,
I am starting to think that I need to do another Promela model...)

> > Unfortunately, the reasoning in #2 above does not hold in the small-CPU
> > case because there is the possibility of both the timekeeping CPU and
> > the RCU grace-period kthread concurrently advancing the state machine.
> > This would be bad, good catch!!!
> 
> It's not like I spotted anything myself but you're welcome :)

I will take them any way I can get them.  ;-)

> > The patch below (untested) is an attempt to fix this.  If it actually
> > works, I will merge it in with 6/7.
> > 
> > Anything else I missed?  ;-)
> 
> Well I guess I'll wait one more night before trying to understand
> the below ;)

The key point is that the added check means that either the timekeeping
CPU is advancing the state machine (if there are few CPUs) or the
RCU grace-period kthread is (if there are many CPUs), but never both.
Or that is the intent, anyway!

							Thanx, Paul

>     Thanks!
> 
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > 
> > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > index aa3f525..fe83085 100644
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> > @@ -1364,7 +1364,7 @@ int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
> >  		}
> >  		force_qs_rnp(rsp, dyntick_save_progress_counter,
> >  			     &isidle, &maxj);
> > -		rcu_sysidle_report(rsp, isidle, maxj);
> > +		rcu_sysidle_report_gp(rsp, isidle, maxj);
> >  		fqs_state = RCU_FORCE_QS;
> >  	} else {
> >  		/* Handle dyntick-idle and offline CPUs. */
> > diff --git a/kernel/rcutree.h b/kernel/rcutree.h
> > index 1602c21..657b415 100644
> > --- a/kernel/rcutree.h
> > +++ b/kernel/rcutree.h
> > @@ -559,8 +559,8 @@ static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
> >  				  unsigned long *maxj);
> >  static bool is_sysidle_rcu_state(struct rcu_state *rsp);
> >  static void rcu_bind_gp_kthread(void);
> > -static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
> > -			       unsigned long maxj);
> > +static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> > +				  unsigned long maxj);
> >  static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
> >  
> >  #endif /* #ifndef RCU_TREE_NONCORE */
> > diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> > index a4d44c3..f65d9c2 100644
> > --- a/kernel/rcutree_plugin.h
> > +++ b/kernel/rcutree_plugin.h
> > @@ -2655,14 +2655,22 @@ static void rcu_sysidle_cancel(void)
> >   * scan of the CPUs' dyntick-idle state.
> >   */
> >  static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
> > -			       unsigned long maxj)
> > +			       unsigned long maxj, bool gpkt)
> >  {
> >  	if (rsp != rcu_sysidle_state)
> >  		return;  /* Wrong flavor, ignore. */
> > -	if (isidle)
> > -		rcu_sysidle(maxj);    /* More idle! */
> > -	else
> > +	if (isidle) {
> > +		if (gpkt && nr_cpu_ids > RCU_SYSIDLE_SMALL)
> > +			rcu_sysidle(maxj);    /* More idle! */
> > +	} else {
> >  		rcu_sysidle_cancel(); /* Idle is over. */
> > +	}
> > +}
> > +
> > +static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> > +				  unsigned long maxj)
> > +{
> > +	rcu_sysidle_report(rsp, isidle, maxj, true);
> >  }
> >  
> >  /* Callback and function for forcing an RCU grace period. */
> > @@ -2713,7 +2721,8 @@ bool rcu_sys_is_idle(void)
> >  				if (!isidle)
> >  					break;
> >  			}
> > -			rcu_sysidle_report(rcu_sysidle_state, isidle, maxj);
> > +			rcu_sysidle_report(rcu_sysidle_state,
> > +					   isidle, maxj, false);
> >  			oldrss = rss;
> >  			rss = ACCESS_ONCE(full_sysidle_state);
> >  		}
> > @@ -2776,8 +2785,8 @@ static void rcu_bind_gp_kthread(void)
> >  {
> >  }
> >  
> > -static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
> > -			       unsigned long maxj)
> > +static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> > +				  unsigned long maxj)
> >  {
> >  }
> >  
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle state machine
  2013-07-18  3:39           ` Paul E. McKenney
@ 2013-07-18 14:24             ` Frederic Weisbecker
  2013-07-18 16:47               ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Frederic Weisbecker @ 2013-07-18 14:24 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Wed, Jul 17, 2013 at 08:39:21PM -0700, Paul E. McKenney wrote:
> On Thu, Jul 18, 2013 at 03:33:01AM +0200, Frederic Weisbecker wrote:
> > So it's like:
> > 
> >     CPU 0                                              CPU 1
> > 
> >     read I                                             write I
> >     smp_mb()                                           smp_mb()
> >     cmpxchg S                                          read S
> > 
> > I still can't find what guarantees we don't read a value in CPU 1 that is way below
> > what we want.
> 
> One key point is that there is a second cycle from LONG to FULL.
> 
> (Not saying that there is not a bug -- there might well be.  In fact,
> I am starting to think that I need to do another Promela model...

Now I'm very confused :)

I'm far from being a specialist on these matters but I would really love to
understand this patchset. Is there any documentation somewhere I can read
that could help, something about cycles of committed memory or something?


> 
> > > Unfortunately, the reasoning in #2 above does not hold in the small-CPU
> > > case because there is the possibility of both the timekeeping CPU and
> > > the RCU grace-period kthread concurrently advancing the state machine.
> > > This would be bad, good catch!!!
> > 
> > It's not like I spotted anything myself but you're welcome :)
> 
> I will take them any way I can get them.  ;-)
> 
> > > The patch below (untested) is an attempt to fix this.  If it actually
> > > works, I will merge it in with 6/7.
> > > 
> > > Anything else I missed?  ;-)
> > 
> > Well I guess I'll wait one more night before trying to understand
> > the below ;)
> 
> The key point is that the added check means that either the timekeeping
> CPU is advancing the state machine (if there are few CPUs) or the
> RCU grace-period kthread is (if there are many CPUs), but never both.
> Or that is the intent, anyway!

Yeah got that.

Thanks!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle state machine
  2013-07-18 14:24             ` Frederic Weisbecker
@ 2013-07-18 16:47               ` Paul E. McKenney
  2013-07-18 22:46                 ` Frederic Weisbecker
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2013-07-18 16:47 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Thu, Jul 18, 2013 at 04:24:51PM +0200, Frederic Weisbecker wrote:
> On Wed, Jul 17, 2013 at 08:39:21PM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 18, 2013 at 03:33:01AM +0200, Frederic Weisbecker wrote:
> > > So it's like:
> > > 
> > >     CPU 0                                              CPU 1
> > > 
> > >     read I                                             write I
> > >     smp_mb()                                           smp_mb()
> > >     cmpxchg S                                          read S
> > > 
> > > I still can't find what guarantees we don't read a value in CPU 1 that is way below
> > > what we want.
> > 
> > One key point is that there is a second cycle from LONG to FULL.
> > 
> > (Not saying that there is not a bug -- there might well be.  In fact,
> > I am starting to think that I need to do another Promela model...
> 
> Now I'm very confused :)

To quote a Nobel Laureate who presented at an ISEF here in Portland some
years back, "Confusion is the most productive state of mind."  ;-)

> I'm far from being a specialist on these matters but I would really love to
> understand this patchset. Is there any documentation somewhere I can read
> that could help, something about cycles of committed memory or something?

Documentation/memory-barriers.txt should suffice for this.  If you want
more rigor, http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf

But memory-barrier pairing suffices here.  Here is case 2 from my
earlier email in more detail.  The comments with capital letters
mark important memory barriers, some of which are buried in atomic
operations.

1. Some CPU coming out of idle:

o	rcu_sysidle_exit():

	smp_mb__before_atomic_inc();
	atomic_inc(&rdtp->dynticks_idle);
	smp_mb__after_atomic_inc(); /* A */

o	rcu_sysidle_force_exit():

	oldstate = ACCESS_ONCE(full_sysidle_state);

2. RCU GP kthread:

o	rcu_sysidle():

	cmpxchg(&full_sysidle_state, RCU_SYSIDLE_SHORT, RCU_SYSIDLE_LONG);
		/* B */

o	rcu_sysidle_check_cpu():

	cur = atomic_read(&rdtp->dynticks_idle);

Memory barrier A pairs with memory barrier B, so that if #1's load
from full_sysidle_state sees RCU_SYSIDLE_SHORT, we know that #1's
atomic_inc() must be visible to #2's atomic_read().  This will cause #2
to recognize that the CPU came out of idle, which will in turn cause it
to invoke rcu_sysidle_cancel() instead of rcu_sysidle(), resulting in
full_sysidle_state being set to RCU_SYSIDLE_NOT.

							Thanx, Paul

> > > > Unfortunately, the reasoning in #2 above does not hold in the small-CPU
> > > > case because there is the possibility of both the timekeeping CPU and
> > > > the RCU grace-period kthread concurrently advancing the state machine.
> > > > This would be bad, good catch!!!
> > > 
> > > It's not like I spotted anything myself but you're welcome :)
> > 
> > I will take them any way I can get them.  ;-)
> > 
> > > > The patch below (untested) is an attempt to fix this.  If it actually
> > > > works, I will merge it in with 6/7.
> > > > 
> > > > Anything else I missed?  ;-)
> > > 
> > > Well I guess I'll wait one more night before trying to understand
> > > the below ;)
> > 
> > The key point is that the added check means that either the timekeeping
> > CPU is advancing the state machine (if there are few CPUs) or the
> > RCU grace-period kthread is (if there are many CPUs), but never both.
> > Or that is the intent, anyway!
> 
> Yeah got that.
> 
> Thanks!
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle state machine
  2013-07-18 16:47               ` Paul E. McKenney
@ 2013-07-18 22:46                 ` Frederic Weisbecker
  2013-07-19  0:24                   ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Frederic Weisbecker @ 2013-07-18 22:46 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Thu, Jul 18, 2013 at 09:47:49AM -0700, Paul E. McKenney wrote:
> On Thu, Jul 18, 2013 at 04:24:51PM +0200, Frederic Weisbecker wrote:
> > On Wed, Jul 17, 2013 at 08:39:21PM -0700, Paul E. McKenney wrote:
> > > On Thu, Jul 18, 2013 at 03:33:01AM +0200, Frederic Weisbecker wrote:
> > > > So it's like:
> > > > 
> > > >     CPU 0                                              CPU 1
> > > > 
> > > >     read I                                             write I
> > > >     smp_mb()                                           smp_mb()
> > > >     cmpxchg S                                          read S
> > > > 
> > > > I still can't find what guarantees we don't read a value in CPU 1 that is way below
> > > > what we want.
> > > 
> > > One key point is that there is a second cycle from LONG to FULL.
> > > 
> > > (Not saying that there is not a bug -- there might well be.  In fact,
> > > I am starting to think that I need to do another Promela model...
> > 
> > Now I'm very confused :)
> 
> To quote a Nobel Laureate who presented at an ISEF here in Portland some
> years back, "Confusion is the most productive state of mind."  ;-)

Then I must be a very productive guy!

> 
> > I'm far from being a specialist on these matters but I would really love to
> > understand this patchset. Is there any documentation somewhere I can read
> > that could help, something about cycles of committed memory or something?
> 
> Documentation/memory-barriers.txt should suffice for this.  If you want
> more rigor, http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
> 
> But memory-barrier pairing suffices here.  Here is case 2 from my
> earlier email in more detail.  The comments with capital letters
> mark important memory barriers, some of which are buried in atomic
> operations.
> 
> 1. Some CPU coming out of idle:
> 
> o	rcu_sysidle_exit():
> 
> 	smp_mb__before_atomic_inc();
> 	atomic_inc(&rdtp->dynticks_idle);
> 	smp_mb__after_atomic_inc(); /* A */
> 
> o	rcu_sysidle_force_exit():
> 
> 	oldstate = ACCESS_ONCE(full_sysidle_state);
> 
> 2. RCU GP kthread:
> 
> o	rcu_sysidle():
> 
> 	cmpxchg(&full_sysidle_state, RCU_SYSIDLE_SHORT, RCU_SYSIDLE_LONG);
> 		/* B */
> 
> o	rcu_sysidle_check_cpu():
> 
> 	cur = atomic_read(&rdtp->dynticks_idle);
> 
> Memory barrier A pairs with memory barrier B, so that if #1's load
> from full_sysidle_state sees RCU_SYSIDLE_SHORT, we know that #1's
> atomic_inc() must be visible to #2's atomic_read().  This will cause #2
> to recognize that the CPU came out of idle, which will in turn cause it
> to invoke rcu_sysidle_cancel() instead of rcu_sysidle(), resulting in
> full_sysidle_state being set to RCU_SYSIDLE_NOT.

Ok I get it for that direction.
Now imagine CPU 0 is the RCU GP kthread (#2) and CPU 1 is idle and stays
so.

CPU 0 then rounds and see that all CPUs are idle, until it finally sets
up RCU_SYSIDLE_SHORT_FULL and finally goes to sleep.

Then CPU 1 wakes up. It really has to see a value above RCU_SYSIDLE_SHORT
otherwise it won't do the cmpxchg and see the FULL_NOTED that makes it send
the IPI.

What provides the guarantee that CPU 1 sees a value above RCU_SYSIDLE_SHORT?
Not on the cmpxchg but when it first dereference with ACCESS_ONCE.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle state machine
  2013-07-18 22:46                 ` Frederic Weisbecker
@ 2013-07-19  0:24                   ` Paul E. McKenney
  2013-07-19  2:12                     ` Frederic Weisbecker
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2013-07-19  0:24 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Fri, Jul 19, 2013 at 12:46:21AM +0200, Frederic Weisbecker wrote:
> On Thu, Jul 18, 2013 at 09:47:49AM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 18, 2013 at 04:24:51PM +0200, Frederic Weisbecker wrote:
> > > On Wed, Jul 17, 2013 at 08:39:21PM -0700, Paul E. McKenney wrote:
> > > > On Thu, Jul 18, 2013 at 03:33:01AM +0200, Frederic Weisbecker wrote:
> > > > > So it's like:
> > > > > 
> > > > >     CPU 0                                              CPU 1
> > > > > 
> > > > >     read I                                             write I
> > > > >     smp_mb()                                           smp_mb()
> > > > >     cmpxchg S                                          read S
> > > > > 
> > > > > I still can't find what guarantees we don't read a value in CPU 1 that is way below
> > > > > what we want.
> > > > 
> > > > One key point is that there is a second cycle from LONG to FULL.
> > > > 
> > > > (Not saying that there is not a bug -- there might well be.  In fact,
> > > > I am starting to think that I need to do another Promela model...
> > > 
> > > Now I'm very confused :)
> > 
> > To quote a Nobel Laureate who presented at an ISEF here in Portland some
> > years back, "Confusion is the most productive state of mind."  ;-)
> 
> Then I must be a very productive guy!

So that is your secret!  ;-)

> > > I'm far from being a specialist on these matters but I would really love to
> > > understand this patchset. Is there any documentation somewhere I can read
> > > that could help, something about cycles of committed memory or something?
> > 
> > Documentation/memory-barriers.txt should suffice for this.  If you want
> > more rigor, http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
> > 
> > But memory-barrier pairing suffices here.  Here is case 2 from my
> > earlier email in more detail.  The comments with capital letters
> > mark important memory barriers, some of which are buried in atomic
> > operations.
> > 
> > 1. Some CPU coming out of idle:
> > 
> > o	rcu_sysidle_exit():
> > 
> > 	smp_mb__before_atomic_inc();
> > 	atomic_inc(&rdtp->dynticks_idle);
> > 	smp_mb__after_atomic_inc(); /* A */
> > 
> > o	rcu_sysidle_force_exit():
> > 
> > 	oldstate = ACCESS_ONCE(full_sysidle_state);
> > 
> > 2. RCU GP kthread:
> > 
> > o	rcu_sysidle():
> > 
> > 	cmpxchg(&full_sysidle_state, RCU_SYSIDLE_SHORT, RCU_SYSIDLE_LONG);
> > 		/* B */
> > 
> > o	rcu_sysidle_check_cpu():
> > 
> > 	cur = atomic_read(&rdtp->dynticks_idle);
> > 
> > Memory barrier A pairs with memory barrier B, so that if #1's load
> > from full_sysidle_state sees RCU_SYSIDLE_SHORT, we know that #1's
> > atomic_inc() must be visible to #2's atomic_read().  This will cause #2
> > to recognize that the CPU came out of idle, which will in turn cause it
> > to invoke rcu_sysidle_cancel() instead of rcu_sysidle(), resulting in
> > full_sysidle_state being set to RCU_SYSIDLE_NOT.
> 
> Ok I get it for that direction.
> Now imagine CPU 0 is the RCU GP kthread (#2) and CPU 1 is idle and stays
> so.
> 
> CPU 0 then rounds and see that all CPUs are idle, until it finally sets
> up RCU_SYSIDLE_SHORT_FULL and finally goes to sleep.
> 
> Then CPU 1 wakes up. It really has to see a value above RCU_SYSIDLE_SHORT
> otherwise it won't do the cmpxchg and see the FULL_NOTED that makes it send
> the IPI.
> 
> What provides the guarantee that CPU 1 sees a value above RCU_SYSIDLE_SHORT?
> Not on the cmpxchg but when it first dereference with ACCESS_ONCE.

The trick is that CPU 0 will have scanned, moved to RCU_SYSIDLE_SHORT,
scanned, moved to RCU_SYSIDLE_LONG, then scanned again before moving
to RCU_SYSIDLE_FULL.  Given CPU 1 has been idle all this time, CPU 0
will have read its ->dynticks_idle counter on each scan and seen it
having an even value.  When CPU 1 comes out of idle, it will atomically
increment its ->dyntick_idle(), which will happen after CPU 0's read of
->dyntick_idle() during its last scan.

Because of the memory-barrier pairing above, this means that CPU
1's read from full_sysidle_state must follow the cmpxchg() that
set full_sysidle_state to RCU_SYSIDLE_LONG (though not necessarily
the two later cmpxchg()s that set it to RCU_SYSIDLE_FULL and
RCU_SYSIDLE_FULL_NOTED).  But because RCU_SYSIDLE_LONG is greater than
RCU_SYSIDLE_SHORT, CPU 1 will take action to end the idle period.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle state machine
  2013-07-19  0:24                   ` Paul E. McKenney
@ 2013-07-19  2:12                     ` Frederic Weisbecker
  2013-07-19  5:06                       ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Frederic Weisbecker @ 2013-07-19  2:12 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Thu, Jul 18, 2013 at 05:24:08PM -0700, Paul E. McKenney wrote:
> On Fri, Jul 19, 2013 at 12:46:21AM +0200, Frederic Weisbecker wrote:
> > On Thu, Jul 18, 2013 at 09:47:49AM -0700, Paul E. McKenney wrote:
> > > 1. Some CPU coming out of idle:
> > > 
> > > o	rcu_sysidle_exit():
> > > 
> > > 	smp_mb__before_atomic_inc();
> > > 	atomic_inc(&rdtp->dynticks_idle);
> > > 	smp_mb__after_atomic_inc(); /* A */
> > > 
> > > o	rcu_sysidle_force_exit():
> > > 
> > > 	oldstate = ACCESS_ONCE(full_sysidle_state);
> > > 
> > > 2. RCU GP kthread:
> > > 
> > > o	rcu_sysidle():
> > > 
> > > 	cmpxchg(&full_sysidle_state, RCU_SYSIDLE_SHORT, RCU_SYSIDLE_LONG);
> > > 		/* B */
> > > 
> > > o	rcu_sysidle_check_cpu():
> > > 
> > > 	cur = atomic_read(&rdtp->dynticks_idle);
> > > 
> > > Memory barrier A pairs with memory barrier B, so that if #1's load
> > > from full_sysidle_state sees RCU_SYSIDLE_SHORT, we know that #1's
> > > atomic_inc() must be visible to #2's atomic_read().  This will cause #2
> > > to recognize that the CPU came out of idle, which will in turn cause it
> > > to invoke rcu_sysidle_cancel() instead of rcu_sysidle(), resulting in
> > > full_sysidle_state being set to RCU_SYSIDLE_NOT.
> > 
> > Ok I get it for that direction.
> > Now imagine CPU 0 is the RCU GP kthread (#2) and CPU 1 is idle and stays
> > so.
> > 
> > CPU 0 then rounds and see that all CPUs are idle, until it finally sets
> > up RCU_SYSIDLE_SHORT_FULL and finally goes to sleep.
> > 
> > Then CPU 1 wakes up. It really has to see a value above RCU_SYSIDLE_SHORT
> > otherwise it won't do the cmpxchg and see the FULL_NOTED that makes it send
> > the IPI.
> > 
> > What provides the guarantee that CPU 1 sees a value above RCU_SYSIDLE_SHORT?
> > Not on the cmpxchg but when it first dereference with ACCESS_ONCE.
> 
> The trick is that CPU 0 will have scanned, moved to RCU_SYSIDLE_SHORT,
> scanned, moved to RCU_SYSIDLE_LONG, then scanned again before moving
> to RCU_SYSIDLE_FULL.  Given CPU 1 has been idle all this time, CPU 0
> will have read its ->dynticks_idle counter on each scan and seen it
> having an even value.  When CPU 1 comes out of idle, it will atomically
> increment its ->dyntick_idle(), which will happen after CPU 0's read of
> ->dyntick_idle() during its last scan.
> 
> Because of the memory-barrier pairing above, this means that CPU
> 1's read from full_sysidle_state must follow the cmpxchg() that
> set full_sysidle_state to RCU_SYSIDLE_LONG (though not necessarily
> the two later cmpxchg()s that set it to RCU_SYSIDLE_FULL and
> RCU_SYSIDLE_FULL_NOTED).  But because RCU_SYSIDLE_LONG is greater than
> RCU_SYSIDLE_SHORT, CPU 1 will take action to end the idle period.

Lets summarize the last sequence, the following happens ordered by time:

        CPU 0                          CPU 1

     cmpxchg(&full_sysidle_state,
             RCU_SYSIDLE_SHORT,
             RCU_SYSIDLE_LONG);

     smp_mb() //cmpxchg

     atomic_read(rdtp(1)->dynticks_idle)

     //CPU 0 goes to sleep
                                       //CPU 1 wakes up
                                       atomic_inc(rdtp(1)->dynticks_idle)

                                       smp_mb()

                                       ACCESS_ONCE(full_sysidle_state)


Are you suggesting that because the CPU 1 executes its atomic_inc() _after_ (in terms
of absolute time) the atomic_read of CPU 0, the ordering settled in both sides guarantees
that the value read from CPU 1 is the one from the cmpxchg that precedes the atomic_read,
or FULL or FULL_NOTED that happen later.

If so that's a big lesson for me.                                     

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle state machine
  2013-07-19  2:12                     ` Frederic Weisbecker
@ 2013-07-19  5:06                       ` Paul E. McKenney
  2013-07-24 18:09                         ` Frederic Weisbecker
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2013-07-19  5:06 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Fri, Jul 19, 2013 at 04:12:08AM +0200, Frederic Weisbecker wrote:
> On Thu, Jul 18, 2013 at 05:24:08PM -0700, Paul E. McKenney wrote:
> > On Fri, Jul 19, 2013 at 12:46:21AM +0200, Frederic Weisbecker wrote:
> > > On Thu, Jul 18, 2013 at 09:47:49AM -0700, Paul E. McKenney wrote:
> > > > 1. Some CPU coming out of idle:
> > > > 
> > > > o	rcu_sysidle_exit():
> > > > 
> > > > 	smp_mb__before_atomic_inc();
> > > > 	atomic_inc(&rdtp->dynticks_idle);
> > > > 	smp_mb__after_atomic_inc(); /* A */
> > > > 
> > > > o	rcu_sysidle_force_exit():
> > > > 
> > > > 	oldstate = ACCESS_ONCE(full_sysidle_state);
> > > > 
> > > > 2. RCU GP kthread:
> > > > 
> > > > o	rcu_sysidle():
> > > > 
> > > > 	cmpxchg(&full_sysidle_state, RCU_SYSIDLE_SHORT, RCU_SYSIDLE_LONG);
> > > > 		/* B */
> > > > 
> > > > o	rcu_sysidle_check_cpu():
> > > > 
> > > > 	cur = atomic_read(&rdtp->dynticks_idle);
> > > > 
> > > > Memory barrier A pairs with memory barrier B, so that if #1's load
> > > > from full_sysidle_state sees RCU_SYSIDLE_SHORT, we know that #1's
> > > > atomic_inc() must be visible to #2's atomic_read().  This will cause #2
> > > > to recognize that the CPU came out of idle, which will in turn cause it
> > > > to invoke rcu_sysidle_cancel() instead of rcu_sysidle(), resulting in
> > > > full_sysidle_state being set to RCU_SYSIDLE_NOT.
> > > 
> > > Ok I get it for that direction.
> > > Now imagine CPU 0 is the RCU GP kthread (#2) and CPU 1 is idle and stays
> > > so.
> > > 
> > > CPU 0 then rounds and see that all CPUs are idle, until it finally sets
> > > up RCU_SYSIDLE_SHORT_FULL and finally goes to sleep.
> > > 
> > > Then CPU 1 wakes up. It really has to see a value above RCU_SYSIDLE_SHORT
> > > otherwise it won't do the cmpxchg and see the FULL_NOTED that makes it send
> > > the IPI.
> > > 
> > > What provides the guarantee that CPU 1 sees a value above RCU_SYSIDLE_SHORT?
> > > Not on the cmpxchg but when it first dereference with ACCESS_ONCE.
> > 
> > The trick is that CPU 0 will have scanned, moved to RCU_SYSIDLE_SHORT,
> > scanned, moved to RCU_SYSIDLE_LONG, then scanned again before moving
> > to RCU_SYSIDLE_FULL.  Given CPU 1 has been idle all this time, CPU 0
> > will have read its ->dynticks_idle counter on each scan and seen it
> > having an even value.  When CPU 1 comes out of idle, it will atomically
> > increment its ->dyntick_idle(), which will happen after CPU 0's read of
> > ->dyntick_idle() during its last scan.
> > 
> > Because of the memory-barrier pairing above, this means that CPU
> > 1's read from full_sysidle_state must follow the cmpxchg() that
> > set full_sysidle_state to RCU_SYSIDLE_LONG (though not necessarily
> > the two later cmpxchg()s that set it to RCU_SYSIDLE_FULL and
> > RCU_SYSIDLE_FULL_NOTED).  But because RCU_SYSIDLE_LONG is greater than
> > RCU_SYSIDLE_SHORT, CPU 1 will take action to end the idle period.
> 
> Lets summarize the last sequence, the following happens ordered by time:
> 
>         CPU 0                          CPU 1
> 
>      cmpxchg(&full_sysidle_state,
>              RCU_SYSIDLE_SHORT,
>              RCU_SYSIDLE_LONG);
> 
>      smp_mb() //cmpxchg
> 
>      atomic_read(rdtp(1)->dynticks_idle)
> 
>      //CPU 0 goes to sleep
>                                        //CPU 1 wakes up
>                                        atomic_inc(rdtp(1)->dynticks_idle)
> 
>                                        smp_mb()
> 
>                                        ACCESS_ONCE(full_sysidle_state)
> 
> 
> Are you suggesting that because the CPU 1 executes its atomic_inc() _after_ (in terms
> of absolute time) the atomic_read of CPU 0, the ordering settled in both sides guarantees
> that the value read from CPU 1 is the one from the cmpxchg that precedes the atomic_read,
> or FULL or FULL_NOTED that happen later.
> 
> If so that's a big lesson for me.                                     

It is not absolute time that matters.  Instead, it is the fact that
CPU 0, when reading from ->dynticks_idle, read the old value before the
atomic_inc().  Therefore, anything CPU 0 did before that memory barrier
preceding CPU 0's read must come before anything CPU 1 did after that
memory barrier following the atomic_inc().  For this to work, there
must be some access to the same variable on each CPU.

Or, if you must think in terms of time, you need a separate independent
timeline for each variable, with no direct mapping from one timeline to
another, except resulting from memory-barrier interactions.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle state machine
  2013-07-19  5:06                       ` Paul E. McKenney
@ 2013-07-24 18:09                         ` Frederic Weisbecker
  2013-07-24 22:09                           ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Frederic Weisbecker @ 2013-07-24 18:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Thu, Jul 18, 2013 at 10:06:25PM -0700, Paul E. McKenney wrote:
> > Lets summarize the last sequence, the following happens ordered by time:
> > 
> >         CPU 0                          CPU 1
> > 
> >      cmpxchg(&full_sysidle_state,
> >              RCU_SYSIDLE_SHORT,
> >              RCU_SYSIDLE_LONG);
> > 
> >      smp_mb() //cmpxchg
> > 
> >      atomic_read(rdtp(1)->dynticks_idle)
> > 
> >      //CPU 0 goes to sleep
> >                                        //CPU 1 wakes up
> >                                        atomic_inc(rdtp(1)->dynticks_idle)
> > 
> >                                        smp_mb()
> > 
> >                                        ACCESS_ONCE(full_sysidle_state)
> > 
> > 
> > Are you suggesting that because the CPU 1 executes its atomic_inc() _after_ (in terms
> > of absolute time) the atomic_read of CPU 0, the ordering settled in both sides guarantees
> > that the value read from CPU 1 is the one from the cmpxchg that precedes the atomic_read,
> > or FULL or FULL_NOTED that happen later.
> > 
> > If so that's a big lesson for me.                                     
> 
> It is not absolute time that matters.  Instead, it is the fact that
> CPU 0, when reading from ->dynticks_idle, read the old value before the
> atomic_inc().  Therefore, anything CPU 0 did before that memory barrier
> preceding CPU 0's read must come before anything CPU 1 did after that
> memory barrier following the atomic_inc().  For this to work, there
> must be some access to the same variable on each CPU.

Aren't we in the following situation?

    CPU 0                          CPU 1

    STORE A                        STORE B
    LOAD B                         LOAD A


If so and referring to your perfbook, this is an "ears to mouth" situation.
And it seems to describe there is no strong guarantee in that situation.

> 
> Or, if you must think in terms of time, you need a separate independent
> timeline for each variable, with no direct mapping from one timeline to
> another, except resulting from memory-barrier interactions.
> 
> 							Thanx, Paul
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle state machine
  2013-07-24 18:09                         ` Frederic Weisbecker
@ 2013-07-24 22:09                           ` Paul E. McKenney
  2013-07-24 23:26                             ` Frederic Weisbecker
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2013-07-24 22:09 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Wed, Jul 24, 2013 at 08:09:04PM +0200, Frederic Weisbecker wrote:
> On Thu, Jul 18, 2013 at 10:06:25PM -0700, Paul E. McKenney wrote:
> > > Lets summarize the last sequence, the following happens ordered by time:
> > > 
> > >         CPU 0                          CPU 1
> > > 
> > >      cmpxchg(&full_sysidle_state,
> > >              RCU_SYSIDLE_SHORT,
> > >              RCU_SYSIDLE_LONG);
> > > 
> > >      smp_mb() //cmpxchg
> > > 
> > >      atomic_read(rdtp(1)->dynticks_idle)
> > > 
> > >      //CPU 0 goes to sleep
> > >                                        //CPU 1 wakes up
> > >                                        atomic_inc(rdtp(1)->dynticks_idle)
> > > 
> > >                                        smp_mb()
> > > 
> > >                                        ACCESS_ONCE(full_sysidle_state)
> > > 
> > > 
> > > Are you suggesting that because the CPU 1 executes its atomic_inc() _after_ (in terms
> > > of absolute time) the atomic_read of CPU 0, the ordering settled in both sides guarantees
> > > that the value read from CPU 1 is the one from the cmpxchg that precedes the atomic_read,
> > > or FULL or FULL_NOTED that happen later.
> > > 
> > > If so that's a big lesson for me.                                     
> > 
> > It is not absolute time that matters.  Instead, it is the fact that
> > CPU 0, when reading from ->dynticks_idle, read the old value before the
> > atomic_inc().  Therefore, anything CPU 0 did before that memory barrier
> > preceding CPU 0's read must come before anything CPU 1 did after that
> > memory barrier following the atomic_inc().  For this to work, there
> > must be some access to the same variable on each CPU.
> 
> Aren't we in the following situation?
> 
>     CPU 0                          CPU 1
> 
>     STORE A                        STORE B
>     LOAD B                         LOAD A
> 
> 
> If so and referring to your perfbook, this is an "ears to mouth" situation.
> And it seems to describe there is no strong guarantee in that situation.

"Yes" to the first, but on modern hardware, "no" to the second.  The key
paragraph is Section 12.2.4.5:

	The following pairings from Table 12.1 can be used on modern
	hardware, but might fail on some systems that were produced in
	the 1990s. However, these can safely be used on all mainstream
	hardware introduced since the year 2000.

That said, you are not the first to be confused by this, so I might need
to rework this section to make it clear that each can in fact be used on
modern hardware.

If you happen to have an old Sequent NUMA-Q or Symmetry box lying around,
things are a bit different.  On the other hand, I don't believe that any
of these old boxes are still running Linux.  (Hey, I am as sentimental as
the next guy, but there are limits!)

I updated this section and pushed it, please let me know if this helps!

							Thanx, Paul

> > Or, if you must think in terms of time, you need a separate independent
> > timeline for each variable, with no direct mapping from one timeline to
> > another, except resulting from memory-barrier interactions.
> > 
> > 							Thanx, Paul
> > 
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle state machine
  2013-07-24 22:09                           ` Paul E. McKenney
@ 2013-07-24 23:26                             ` Frederic Weisbecker
  2013-07-26 22:52                               ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Frederic Weisbecker @ 2013-07-24 23:26 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Wed, Jul 24, 2013 at 03:09:02PM -0700, Paul E. McKenney wrote:
> On Wed, Jul 24, 2013 at 08:09:04PM +0200, Frederic Weisbecker wrote:
> > On Thu, Jul 18, 2013 at 10:06:25PM -0700, Paul E. McKenney wrote:
> > > > Lets summarize the last sequence, the following happens ordered by time:
> > > > 
> > > >         CPU 0                          CPU 1
> > > > 
> > > >      cmpxchg(&full_sysidle_state,
> > > >              RCU_SYSIDLE_SHORT,
> > > >              RCU_SYSIDLE_LONG);
> > > > 
> > > >      smp_mb() //cmpxchg
> > > > 
> > > >      atomic_read(rdtp(1)->dynticks_idle)
> > > > 
> > > >      //CPU 0 goes to sleep
> > > >                                        //CPU 1 wakes up
> > > >                                        atomic_inc(rdtp(1)->dynticks_idle)
> > > > 
> > > >                                        smp_mb()
> > > > 
> > > >                                        ACCESS_ONCE(full_sysidle_state)
> > > > 
> > > > 
> > > > Are you suggesting that because the CPU 1 executes its atomic_inc() _after_ (in terms
> > > > of absolute time) the atomic_read of CPU 0, the ordering settled in both sides guarantees
> > > > that the value read from CPU 1 is the one from the cmpxchg that precedes the atomic_read,
> > > > or FULL or FULL_NOTED that happen later.
> > > > 
> > > > If so that's a big lesson for me.                                     
> > > 
> > > It is not absolute time that matters.  Instead, it is the fact that
> > > CPU 0, when reading from ->dynticks_idle, read the old value before the
> > > atomic_inc().  Therefore, anything CPU 0 did before that memory barrier
> > > preceding CPU 0's read must come before anything CPU 1 did after that
> > > memory barrier following the atomic_inc().  For this to work, there
> > > must be some access to the same variable on each CPU.
> > 
> > Aren't we in the following situation?
> > 
> >     CPU 0                          CPU 1
> > 
> >     STORE A                        STORE B
> >     LOAD B                         LOAD A
> > 
> > 
> > If so and referring to your perfbook, this is an "ears to mouth" situation.
> > And it seems to describe there is no strong guarantee in that situation.
> 
> "Yes" to the first, but on modern hardware, "no" to the second.  The key
> paragraph is Section 12.2.4.5:
> 
> 	The following pairings from Table 12.1 can be used on modern
> 	hardware, but might fail on some systems that were produced in
> 	the 1990s. However, these can safely be used on all mainstream
> 	hardware introduced since the year 2000.

Right I missed that!

> 
> That said, you are not the first to be confused by this, so I might need
> to rework this section to make it clear that each can in fact be used on
> modern hardware.
> 
> If you happen to have an old Sequent NUMA-Q or Symmetry box lying around,
> things are a bit different.  On the other hand, I don't believe that any
> of these old boxes are still running Linux.  (Hey, I am as sentimental as
> the next guy, but there are limits!)
> 
> I updated this section and pushed it, please let me know if this helps!

I don't know because I encountered some troubles to build it, I'm seeing thousand
lines like this:

Name "main::opt_help" used only once: possible typo at /usr/bin/a2ping line 534.
/usr/bin/a2ping: not a GS output from gs -dSAFER
./cartoons/whippersnapper300.eps -> ./cartoons/whippersnapper300.pdf
Name "main::opt_extra" used only once: possible typo at /usr/bin/a2ping line 546.
Name "main::opt_help" used only once: possible typo at /usr/bin/a2ping line 534.
/usr/bin/a2ping: not a GS output from gs -dSAFER
make: *** [embedfonts] Error 1

Anyway I looked at the diff and it looks indeed clearer, thanks!

So back to the issue, I think we made nice progresses with my rusty brain ;-)
But just to be clear, I'm pasting that again for just a few precisions:

        CPU 0                                CPU 1

       cmpxchg(&full_sysidle_state,          //CPU 1 wakes up
                RCU_SYSIDLE_SHORT,           atomic_inc(rdtp(1)->dynticks_idle)
                RCU_SYSIDLE_LONG);

       smp_mb() //cmpxchg                    smp_mb()
       atomic_read(rdtp(1)->dynticks_idle)   ACCESS_ONCE(full_sysidle_state
      //CPU 0 goes to sleep

1) If CPU 0 sets RCU_SYSIDLE_LONG and sees dynticks_idle as even, do we have the _guarantee_
that later CPU 1 sees full_sysidle_state == RCU_SYSIDLE_LONG (or any later full_sysidle_state value)
due to the connection between atomic_read / atomic_inc and the barriers that come along?

2) You taught me once that barrier != memory committed, and it has been one of the hardest trauma
in my life. How can we be sure that CPU 1 sees memory as committed from CPU 0? The only fact that
we read an even value from CPU 0 is enough for the connection between the atomic_read() and atomic_inc()
and all the barriers that come along?

3) In your book it says: "recent hardware would guarantee that at least one of the loads saw the value
stored by the corresponding store".

At least one? So in our example, CPU 0 could see dynticks_idle as even (success to see some prior store done in
CPU 1) but following the above statement reasoning, CPU 1 might not see the corresponding store and see, for example
RCU_SYSIDLE_SHORT?

I'm really sorry to bother you with that... :-(

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle state machine
  2013-07-24 23:26                             ` Frederic Weisbecker
@ 2013-07-26 22:52                               ` Paul E. McKenney
  2013-07-27 18:13                                 ` Frederic Weisbecker
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2013-07-26 22:52 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Thu, Jul 25, 2013 at 01:26:44AM +0200, Frederic Weisbecker wrote:
> On Wed, Jul 24, 2013 at 03:09:02PM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 24, 2013 at 08:09:04PM +0200, Frederic Weisbecker wrote:
> > > On Thu, Jul 18, 2013 at 10:06:25PM -0700, Paul E. McKenney wrote:
> > > > > Lets summarize the last sequence, the following happens ordered by time:
> > > > > 
> > > > >         CPU 0                          CPU 1
> > > > > 
> > > > >      cmpxchg(&full_sysidle_state,
> > > > >              RCU_SYSIDLE_SHORT,
> > > > >              RCU_SYSIDLE_LONG);
> > > > > 
> > > > >      smp_mb() //cmpxchg
> > > > > 
> > > > >      atomic_read(rdtp(1)->dynticks_idle)
> > > > > 
> > > > >      //CPU 0 goes to sleep
> > > > >                                        //CPU 1 wakes up
> > > > >                                        atomic_inc(rdtp(1)->dynticks_idle)
> > > > > 
> > > > >                                        smp_mb()
> > > > > 
> > > > >                                        ACCESS_ONCE(full_sysidle_state)
> > > > > 
> > > > > 
> > > > > Are you suggesting that because the CPU 1 executes its atomic_inc() _after_ (in terms
> > > > > of absolute time) the atomic_read of CPU 0, the ordering settled in both sides guarantees
> > > > > that the value read from CPU 1 is the one from the cmpxchg that precedes the atomic_read,
> > > > > or FULL or FULL_NOTED that happen later.
> > > > > 
> > > > > If so that's a big lesson for me.                                     
> > > > 
> > > > It is not absolute time that matters.  Instead, it is the fact that
> > > > CPU 0, when reading from ->dynticks_idle, read the old value before the
> > > > atomic_inc().  Therefore, anything CPU 0 did before that memory barrier
> > > > preceding CPU 0's read must come before anything CPU 1 did after that
> > > > memory barrier following the atomic_inc().  For this to work, there
> > > > must be some access to the same variable on each CPU.
> > > 
> > > Aren't we in the following situation?
> > > 
> > >     CPU 0                          CPU 1
> > > 
> > >     STORE A                        STORE B
> > >     LOAD B                         LOAD A
> > > 
> > > 
> > > If so and referring to your perfbook, this is an "ears to mouth" situation.
> > > And it seems to describe there is no strong guarantee in that situation.
> > 
> > "Yes" to the first, but on modern hardware, "no" to the second.  The key
> > paragraph is Section 12.2.4.5:
> > 
> > 	The following pairings from Table 12.1 can be used on modern
> > 	hardware, but might fail on some systems that were produced in
> > 	the 1990s. However, these can safely be used on all mainstream
> > 	hardware introduced since the year 2000.
> 
> Right I missed that!

Nor are you alone.  ;-)

> > That said, you are not the first to be confused by this, so I might need
> > to rework this section to make it clear that each can in fact be used on
> > modern hardware.
> > 
> > If you happen to have an old Sequent NUMA-Q or Symmetry box lying around,
> > things are a bit different.  On the other hand, I don't believe that any
> > of these old boxes are still running Linux.  (Hey, I am as sentimental as
> > the next guy, but there are limits!)
> > 
> > I updated this section and pushed it, please let me know if this helps!
> 
> I don't know because I encountered some troubles to build it, I'm seeing thousand
> lines like this:
> 
> Name "main::opt_help" used only once: possible typo at /usr/bin/a2ping line 534.
> /usr/bin/a2ping: not a GS output from gs -dSAFER
> ./cartoons/whippersnapper300.eps -> ./cartoons/whippersnapper300.pdf
> Name "main::opt_extra" used only once: possible typo at /usr/bin/a2ping line 546.
> Name "main::opt_help" used only once: possible typo at /usr/bin/a2ping line 534.
> /usr/bin/a2ping: not a GS output from gs -dSAFER
> make: *** [embedfonts] Error 1

Strange.  My version of a2ping is Ubuntu 12.04's:

a2ping.pl 2.77p, 2006-11-15 -- Written by <pts@fazekas.hu> from April 2003.

You have something more recent?

> Anyway I looked at the diff and it looks indeed clearer, thanks!

Glad it helped!

> So back to the issue, I think we made nice progresses with my rusty brain ;-)
> But just to be clear, I'm pasting that again for just a few precisions:
> 
>         CPU 0                                CPU 1
> 
>        cmpxchg(&full_sysidle_state,          //CPU 1 wakes up
>                 RCU_SYSIDLE_SHORT,           atomic_inc(rdtp(1)->dynticks_idle)
>                 RCU_SYSIDLE_LONG);
> 
>        smp_mb() //cmpxchg                    smp_mb()
>        atomic_read(rdtp(1)->dynticks_idle)   ACCESS_ONCE(full_sysidle_state
>       //CPU 0 goes to sleep
> 
> 
> 
> 1) If CPU 0 sets RCU_SYSIDLE_LONG and sees dynticks_idle as even, do we have the _guarantee_
> that later CPU 1 sees full_sysidle_state == RCU_SYSIDLE_LONG (or any later full_sysidle_state value)
> due to the connection between atomic_read / atomic_inc and the barriers that come along?

No, we do not have this guarantee.

What happens instead is that CPU 0 later sets RCU_SYSIDLE_FULL after
having again seen CPU 1's ->dynticks_idle having an even (idle) value.
If CPU 1 later increments its ->dynticks_idle to an odd (non-idle) value,
then does a memory barrier, and then reads full_sysidle_state, then CPU
1 is guaranteed to see RCU_SYSIDLE_LONG.  Please note that CPU 1 is -not-
obligated to see RCU_SYSIDLE_FULL, much less RCU_SYSIDLE_NOTED.

However, when CPU 1 does an atomic operation on full_sysidle_state that
returns the old value, CPU 1 is guaranteed to get the most recent state.
(Without this guarantee, we could have a cactus state of values for
full_sysidle_state, which might make alternate-universe fans happy,
but which would be really hard to program.)

This is why we need an RCU_SYSIDLE_LONG state in addition to
RCU_SYSIDLE_SHORT and RCU_SYSIDLE_FULL.

> 2) You taught me once that barrier != memory committed, and it has been one of the hardest trauma
> in my life. How can we be sure that CPU 1 sees memory as committed from CPU 0? The only fact that
> we read an even value from CPU 0 is enough for the connection between the atomic_read() and atomic_inc()
> and all the barriers that come along?

My condolences on the trauma!  Now, if you will look directly into the
light...  ;-)

We cannot be sure.  All we can be sure of is the memory-barrier
relationships and coherence order of each individual memory location.

The basic memory-barrier relationships are of the following form:

	CPU 0		CPU 1
	ref A		ref C
	smp_mb()	smp_mb()
	ref B		ref D

If the results of these references show that ref C saw the results of
ref B, then you know that ref D is guaranteed to see the results of
ref A.  There are no guarantees on time delay.  At least there are no
guarantees on time delay that the various hardware vendors are willing
to tell us about.

For the coherence order of each individual memory location, let's focus
on full_sysidle_state.  Suppose that CPU 0 uses cmpxchg() to advance
it from RCU_SYSIDLE_NOT to RCU_SYSIDLE_SHORT to RCU_SYSIDLE_LONG to
RCU_SYSIDLE_FULL and finally to RCU_SYSIDLE_NOTED.  When CPU 1 uses
a normal load to access full_sysidle_state, it is only guaranteed to
see RCU_SYSIDLE_LONG, because RCU_SYSIDLE_LONG is the value that
CPU 0 last wrote prior to an access of CPU 1's ->dynticks_idle counter.

However, CPU 1 then uses a cmpxchg() loop to change the value of
full_sysidle_state back to RCU_SYSIDLE_NOT in rcu_sysidle_force_exit().
At this point, CPU 1 cannot be allowed to see an old value of
full_sysidle_state, because this would contradict the cmpxchg() return
values that CPU 0 saw in rcu_sysidle().  Therefore, the cmpxchg()
in rcu_sysidle_force_exit() is obligated to return the most recent
value.

How does the hardware do this?  Well, for loads, the hardware can look
just in its own cache, which might well contain an outdated value.
In contrast, for cmpxchg(), the CPU must invalidate full_sysidle_state
from all other caches, which ensures that the most recent value becomes
available.

So one way to guarantee most-recent results is to always
use atomic instructions to access the variables, for example,
atomic_add(&full_sysidle_state, 0) instead of a simple load instruction.
Might be some performance degradation, but you know what they way about
free lunches.  ;-)

> 3) In your book it says: "recent hardware would guarantee that at least one of the loads saw the value
> stored by the corresponding store".
> 
> At least one? So in our example, CPU 0 could see dynticks_idle as even (success to see some prior store done in
> CPU 1) but following the above statement reasoning, CPU 1 might not see the corresponding store and see, for example
> RCU_SYSIDLE_SHORT?

Ah, yes, the kernel of Dekker's algorithm.  ;-)

	CPU 0		CPU 1

	A=1		B=1
	smp_mb()	smp_mb()
	r1=B		r2=A

The initial values of both A and B are zero.

And yes, this does have the same form as the manipulations of
full_sysidle_state and ->dynticks_idle.

Let's suppose that neither CPU saw the results of either CPU's store.
The after both CPUs finished, we would have r1==0&&r2==0.  Let's start
with the r1==0.

The r1==0 means that CPU 0 loaded from B before CPU 1 stored to B:
After all, otherwise, we would have r1==1.  But because of memory-barrier
pairing, this means that CPU 1's load from A must see the results of
CPU 0's store to A, and therefore r2==1.

The procedure starting with r2==0 follows the same line of reasoning.
Therefore, r1==0&&r2==0 cannot happen, which implies that at least
one of the CPU's loads must see the other CPU's store.

> I'm really sorry to bother you with that... :-(

Absolutely not a problem!  In fact, the above discussion is now yet
another infamous Quick Quiz in perfbook.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle state machine
  2013-07-26 22:52                               ` Paul E. McKenney
@ 2013-07-27 18:13                                 ` Frederic Weisbecker
  0 siblings, 0 replies; 30+ messages in thread
From: Frederic Weisbecker @ 2013-07-27 18:13 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Fri, Jul 26, 2013 at 03:52:12PM -0700, Paul E. McKenney wrote:
> On Thu, Jul 25, 2013 at 01:26:44AM +0200, Frederic Weisbecker wrote:
> > I don't know because I encountered some troubles to build it, I'm seeing thousand
> > lines like this:
> > 
> > Name "main::opt_help" used only once: possible typo at /usr/bin/a2ping line 534.
> > /usr/bin/a2ping: not a GS output from gs -dSAFER
> > ./cartoons/whippersnapper300.eps -> ./cartoons/whippersnapper300.pdf
> > Name "main::opt_extra" used only once: possible typo at /usr/bin/a2ping line 546.
> > Name "main::opt_help" used only once: possible typo at /usr/bin/a2ping line 534.
> > /usr/bin/a2ping: not a GS output from gs -dSAFER
> > make: *** [embedfonts] Error 1
> 
> Strange.  My version of a2ping is Ubuntu 12.04's:
> 
> a2ping.pl 2.77p, 2006-11-15 -- Written by <pts@fazekas.hu> from April 2003.
> 
> You have something more recent?

I run Fedora 15 (yeah, too lazy to upgrade), and:

a2ping.pl 2.77p, 2004-04-28 -- Written by <pts@fazekas.hu> from April 2003

> > So back to the issue, I think we made nice progresses with my rusty brain ;-)
> > But just to be clear, I'm pasting that again for just a few precisions:
> > 
> >         CPU 0                                CPU 1
> > 
> >        cmpxchg(&full_sysidle_state,          //CPU 1 wakes up
> >                 RCU_SYSIDLE_SHORT,           atomic_inc(rdtp(1)->dynticks_idle)
> >                 RCU_SYSIDLE_LONG);
> > 
> >        smp_mb() //cmpxchg                    smp_mb()
> >        atomic_read(rdtp(1)->dynticks_idle)   ACCESS_ONCE(full_sysidle_state
> >       //CPU 0 goes to sleep
> > 
> > 
> > 
> > 1) If CPU 0 sets RCU_SYSIDLE_LONG and sees dynticks_idle as even, do we have the _guarantee_
> > that later CPU 1 sees full_sysidle_state == RCU_SYSIDLE_LONG (or any later full_sysidle_state value)
> > due to the connection between atomic_read / atomic_inc and the barriers that come along?
> 
> No, we do not have this guarantee.
> 
> What happens instead is that CPU 0 later sets RCU_SYSIDLE_FULL after
> having again seen CPU 1's ->dynticks_idle having an even (idle) value.
> If CPU 1 later increments its ->dynticks_idle to an odd (non-idle) value,
> then does a memory barrier, and then reads full_sysidle_state, then CPU
> 1 is guaranteed to see RCU_SYSIDLE_LONG.  Please note that CPU 1 is -not-
> obligated to see RCU_SYSIDLE_FULL, much less RCU_SYSIDLE_NOTED.

Aah. I think I got it now. So lay this out like what follows:

  void cpu_0(int old, int new)
  {
       cmpxchg(&full_sysidle_state, old, new);
       smp_mb()
       atomic_read(rdtp(1)->dynticks_idle)
  }

  void cpu_1(void)
  {
       atomic_inc(rdtp(1)->dynticks_idle)
       smp_mb()
       ACCESS_ONCE(full_sysidle_stat)
  }


And this scenario:

  CPU 0                                              CPU 1

  cpu_0(RCU_SYSIDLE_NOT, RCU_SYSIDLE_SHORT)
  cpu_0(RCU_SYSIDLE_SHORT, RCU_SYSIDLE_LONG)
                                                     cpu_1() // guaranteed to see at least RCU_SYSIDLE_SHORT
  cpu_0(RCU_SYSIDLE_LONG, RCU_SYSIDLE_FULL)
                                                     cpu_1() // guaranteed to see at least RCU_SYSIDLE_LONG

IIUC, the double slah comments should be true.
If so I can observe that even if we don't have memory commit guarantees, there seem to be some
forward progress guarantee against the update and read sequences.

What usually guarantees such forward progress on the sequences is when we have
some locking or atomic ops updates that also return a value (inc_return, cmpxchg, ...)
mirroring on both sides.

I can't find anything in atomic_ops.txt or in your book (ok sorry I only read a few pages
on the barriers chapter :o) that describes such forward progress guarantee.

Would I retrieve such a guarantee on a generic sequence like this?

//A = B = 0

 CPU 0                CPU 1

 store A = 1          store B = 1
 mb()                 mb()
 read B               read A
 ----------------------------
 store A = 2          store B = 2
 mb()                 mb()
 read B               read A

On the second sequence can I deduce that A and B as respectively read by CPU 0 and CPU 1
are at least 1 and might be 2?

> 
> However, when CPU 1 does an atomic operation on full_sysidle_state that
> returns the old value, CPU 1 is guaranteed to get the most recent state.
> (Without this guarantee, we could have a cactus state of values for
> full_sysidle_state, which might make alternate-universe fans happy,
> but which would be really hard to program.)

Haha! Yeah for the cmpxchg I'm ok.

> 
> This is why we need an RCU_SYSIDLE_LONG state in addition to
> RCU_SYSIDLE_SHORT and RCU_SYSIDLE_FULL.
> 
> > 2) You taught me once that barrier != memory committed, and it has been one of the hardest trauma
> > in my life. How can we be sure that CPU 1 sees memory as committed from CPU 0? The only fact that
> > we read an even value from CPU 0 is enough for the connection between the atomic_read() and atomic_inc()
> > and all the barriers that come along?
> 
> My condolences on the trauma!  Now, if you will look directly into the
> light...  ;-)
> 
> We cannot be sure.  All we can be sure of is the memory-barrier
> relationships and coherence order of each individual memory location.
> 
> The basic memory-barrier relationships are of the following form:
> 
> 	CPU 0		CPU 1
> 	ref A		ref C
> 	smp_mb()	smp_mb()
> 	ref B		ref D
> 
> If the results of these references show that ref C saw the results of
> ref B, then you know that ref D is guaranteed to see the results of
> ref A.  There are no guarantees on time delay.  At least there are no
> guarantees on time delay that the various hardware vendors are willing
> to tell us about.
> 
> For the coherence order of each individual memory location, let's focus
> on full_sysidle_state.  Suppose that CPU 0 uses cmpxchg() to advance
> it from RCU_SYSIDLE_NOT to RCU_SYSIDLE_SHORT to RCU_SYSIDLE_LONG to
> RCU_SYSIDLE_FULL and finally to RCU_SYSIDLE_NOTED.  When CPU 1 uses
> a normal load to access full_sysidle_state, it is only guaranteed to
> see RCU_SYSIDLE_LONG, because RCU_SYSIDLE_LONG is the value that
> CPU 0 last wrote prior to an access of CPU 1's ->dynticks_idle counter.

Yeah I see.

> 
> However, CPU 1 then uses a cmpxchg() loop to change the value of
> full_sysidle_state back to RCU_SYSIDLE_NOT in rcu_sysidle_force_exit().
> At this point, CPU 1 cannot be allowed to see an old value of
> full_sysidle_state, because this would contradict the cmpxchg() return
> values that CPU 0 saw in rcu_sysidle().  Therefore, the cmpxchg()
> in rcu_sysidle_force_exit() is obligated to return the most recent
> value.

Right.

> How does the hardware do this?  Well, for loads, the hardware can look
> just in its own cache, which might well contain an outdated value.
> In contrast, for cmpxchg(), the CPU must invalidate full_sysidle_state
> from all other caches, which ensures that the most recent value becomes
> available.

Doing a cmpxchg invalidates the copy in the cacheline of all other
CPUs?

> 
> So one way to guarantee most-recent results is to always
> use atomic instructions to access the variables, for example,
> atomic_add(&full_sysidle_state, 0) instead of a simple load instruction.
> Might be some performance degradation, but you know what they way about
> free lunches.  ;-)

Right :)

> 
> > 3) In your book it says: "recent hardware would guarantee that at least one of the loads saw the value
> > stored by the corresponding store".
> > 
> > At least one? So in our example, CPU 0 could see dynticks_idle as even (success to see some prior store done in
> > CPU 1) but following the above statement reasoning, CPU 1 might not see the corresponding store and see, for example
> > RCU_SYSIDLE_SHORT?
> 
> Ah, yes, the kernel of Dekker's algorithm.  ;-)
> 
> 	CPU 0		CPU 1
> 
> 	A=1		B=1
> 	smp_mb()	smp_mb()
> 	r1=B		r2=A
> 
> The initial values of both A and B are zero.
> 
> And yes, this does have the same form as the manipulations of
> full_sysidle_state and ->dynticks_idle.
> 
> Let's suppose that neither CPU saw the results of either CPU's store.
> The after both CPUs finished, we would have r1==0&&r2==0.  Let's start
> with the r1==0.
> 
> The r1==0 means that CPU 0 loaded from B before CPU 1 stored to B:
> After all, otherwise, we would have r1==1.  But because of memory-barrier
> pairing, this means that CPU 1's load from A must see the results of
> CPU 0's store to A, and therefore r2==1.
> 
> The procedure starting with r2==0 follows the same line of reasoning.
> Therefore, r1==0&&r2==0 cannot happen, which implies that at least
> one of the CPU's loads must see the other CPU's store.

Hmm, ok I'll read that again tomorrow with a fresher brain :)

> 
> > I'm really sorry to bother you with that... :-(
> 
> Absolutely not a problem!  In fact, the above discussion is now yet
> another infamous Quick Quiz in perfbook.  ;-)
> 
> 							Thanx, Paul
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC nohz_full 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state
  2013-08-05  1:04   ` Frederic Weisbecker
@ 2013-08-17 23:38     ` Paul E. McKenney
  0 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2013-08-17 23:38 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Mon, Aug 05, 2013 at 03:04:55AM +0200, Frederic Weisbecker wrote:
> On Fri, Jul 26, 2013 at 04:19:18PM -0700, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > 
> > At least one CPU must keep the scheduling-clock tick running for
> > timekeeping purposes whenever there is a non-idle CPU.  However, with
> > the new nohz_full adaptive-idle machinery, it is difficult to distinguish
> > between all CPUs really being idle as opposed to all non-idle CPUs being
> > in adaptive-ticks mode.  This commit therefore adds a Kconfig parameter
> > as a first step towards enabling a scalable detection of full-system
> > idle state.
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Cc: Frederic Weisbecker <fweisbec@gmail.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > ---
> >  kernel/time/Kconfig | 23 +++++++++++++++++++++++
> >  1 file changed, 23 insertions(+)
> > 
> > diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
> > index 70f27e8..a613c2a 100644
> > --- a/kernel/time/Kconfig
> > +++ b/kernel/time/Kconfig
> > @@ -134,6 +134,29 @@ config NO_HZ_FULL_ALL
> >  	 Note the boot CPU will still be kept outside the range to
> >  	 handle the timekeeping duty.
> >  
> > +config NO_HZ_FULL_SYSIDLE
> > +	bool "Detect full-system idle state for full dynticks system"
> > +	depends on NO_HZ_FULL
> > +	default n
> > +	help
> > +	 At least one CPU must keep the scheduling-clock tick running
> > +	 for timekeeping purposes whenever there is a non-idle CPU,
> > +	 where "non-idle" includes CPUs with a single runnable task
> > +	 in adaptive-idle mode.
> 
> "adaptive-idle" is particularly confusing here. How about this:
> 
>     'where "non-idle" also includes dynticks CPUs as long they are
>     running non-idle tasks.' 
> 
>           Because the underlying adaptive-tick
> > +	 support cannot distinguish between all CPUs being idle and
> > +	 all CPUs each running a single task in adaptive-idle mode,
> 
> s/adaptive-idle/dynticks
> 
> Thanks.

Good point, fixed.

							Thanx, Paul

> > +	 the underlying support simply ensures that there is always
> > +	 a CPU handling the scheduling-clock tick, whether or not all
> > +	 CPUs are idle.  This Kconfig option enables scalable detection
> > +	 of the all-CPUs-idle state, thus allowing the scheduling-clock
> > +	 tick to be disabled when all CPUs are idle.  Note that scalable
> > +	 detection of the all-CPUs-idle state means that larger systems
> > +	 will be slower to declare the all-CPUs-idle state.
> > +
> > +	 Say Y if you would like to help debug all-CPUs-idle detection.
> > +
> > +	 Say N if you are unsure.
> > +
> >  config NO_HZ
> >  	bool "Old Idle dynticks config"
> >  	depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS
> > -- 
> > 1.8.1.5
> > 
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC nohz_full 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state
  2013-07-26 23:19 ` [PATCH RFC nohz_full 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state Paul E. McKenney
  2013-07-29  3:35   ` Lai Jiangshan
@ 2013-08-05  1:04   ` Frederic Weisbecker
  2013-08-17 23:38     ` Paul E. McKenney
  1 sibling, 1 reply; 30+ messages in thread
From: Frederic Weisbecker @ 2013-08-05  1:04 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Fri, Jul 26, 2013 at 04:19:18PM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> At least one CPU must keep the scheduling-clock tick running for
> timekeeping purposes whenever there is a non-idle CPU.  However, with
> the new nohz_full adaptive-idle machinery, it is difficult to distinguish
> between all CPUs really being idle as opposed to all non-idle CPUs being
> in adaptive-ticks mode.  This commit therefore adds a Kconfig parameter
> as a first step towards enabling a scalable detection of full-system
> idle state.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> ---
>  kernel/time/Kconfig | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
> index 70f27e8..a613c2a 100644
> --- a/kernel/time/Kconfig
> +++ b/kernel/time/Kconfig
> @@ -134,6 +134,29 @@ config NO_HZ_FULL_ALL
>  	 Note the boot CPU will still be kept outside the range to
>  	 handle the timekeeping duty.
>  
> +config NO_HZ_FULL_SYSIDLE
> +	bool "Detect full-system idle state for full dynticks system"
> +	depends on NO_HZ_FULL
> +	default n
> +	help
> +	 At least one CPU must keep the scheduling-clock tick running
> +	 for timekeeping purposes whenever there is a non-idle CPU,
> +	 where "non-idle" includes CPUs with a single runnable task
> +	 in adaptive-idle mode.

"adaptive-idle" is particularly confusing here. How about this:

    'where "non-idle" also includes dynticks CPUs as long they are
    running non-idle tasks.' 

          Because the underlying adaptive-tick
> +	 support cannot distinguish between all CPUs being idle and
> +	 all CPUs each running a single task in adaptive-idle mode,

s/adaptive-idle/dynticks

Thanks.

> +	 the underlying support simply ensures that there is always
> +	 a CPU handling the scheduling-clock tick, whether or not all
> +	 CPUs are idle.  This Kconfig option enables scalable detection
> +	 of the all-CPUs-idle state, thus allowing the scheduling-clock
> +	 tick to be disabled when all CPUs are idle.  Note that scalable
> +	 detection of the all-CPUs-idle state means that larger systems
> +	 will be slower to declare the all-CPUs-idle state.
> +
> +	 Say Y if you would like to help debug all-CPUs-idle detection.
> +
> +	 Say N if you are unsure.
> +
>  config NO_HZ
>  	bool "Old Idle dynticks config"
>  	depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS
> -- 
> 1.8.1.5
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC nohz_full 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state
  2013-07-29  3:35   ` Lai Jiangshan
@ 2013-07-29 15:28     ` Paul E. McKenney
  0 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2013-07-29 15:28 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw

On Mon, Jul 29, 2013 at 11:35:56AM +0800, Lai Jiangshan wrote:
> On 07/27/2013 07:19 AM, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > 
> > At least one CPU must keep the scheduling-clock tick running for
> > timekeeping purposes whenever there is a non-idle CPU.  However, with
> > the new nohz_full adaptive-idle machinery, it is difficult to distinguish
> > between all CPUs really being idle as opposed to all non-idle CPUs being
> > in adaptive-ticks mode.  This commit therefore adds a Kconfig parameter
> > as a first step towards enabling a scalable detection of full-system
> > idle state.
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Cc: Frederic Weisbecker <fweisbec@gmail.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > ---
> >  kernel/time/Kconfig | 23 +++++++++++++++++++++++
> >  1 file changed, 23 insertions(+)
> > 
> > diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
> > index 70f27e8..a613c2a 100644
> > --- a/kernel/time/Kconfig
> > +++ b/kernel/time/Kconfig
> > @@ -134,6 +134,29 @@ config NO_HZ_FULL_ALL
> >  	 Note the boot CPU will still be kept outside the range to
> >  	 handle the timekeeping duty.
> >  
> > +config NO_HZ_FULL_SYSIDLE
> > +	bool "Detect full-system idle state for full dynticks system"
> > +	depends on NO_HZ_FULL
> > +	default n
> > +	help
> > +	 At least one CPU must keep the scheduling-clock tick running
> > +	 for timekeeping purposes whenever there is a non-idle CPU,
> > +	 where "non-idle" includes CPUs with a single runnable task
> > +	 in adaptive-idle mode.  Because the underlying adaptive-tick
> > +	 support cannot distinguish between all CPUs being idle and
> > +	 all CPUs each running a single task in adaptive-idle mode,
> > +	 the underlying support simply ensures that there is always
> > +	 a CPU handling the scheduling-clock tick, whether or not all
> > +	 CPUs are idle.  This Kconfig option enables scalable detection
> > +	 of the all-CPUs-idle state, thus allowing the scheduling-clock
> > +	 tick to be disabled when all CPUs are idle.  Note that scalable
> > +	 detection of the all-CPUs-idle state means that larger systems
> > +	 will be slower to declare the all-CPUs-idle state.
> > +
> > +	 Say Y if you would like to help debug all-CPUs-idle detection.
> 
> The code is needed only for debug?
> I guess not.

The code is not used only for debug, but if you enable it now, you will
likely be helping to debug it.  ;-)

							Thanx, Paul

> > +
> > +	 Say N if you are unsure.
> > +
> >  config NO_HZ
> >  	bool "Old Idle dynticks config"
> >  	depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC nohz_full 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state
  2013-07-26 23:19 ` [PATCH RFC nohz_full 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state Paul E. McKenney
@ 2013-07-29  3:35   ` Lai Jiangshan
  2013-07-29 15:28     ` Paul E. McKenney
  2013-08-05  1:04   ` Frederic Weisbecker
  1 sibling, 1 reply; 30+ messages in thread
From: Lai Jiangshan @ 2013-07-29  3:35 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw

On 07/27/2013 07:19 AM, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> At least one CPU must keep the scheduling-clock tick running for
> timekeeping purposes whenever there is a non-idle CPU.  However, with
> the new nohz_full adaptive-idle machinery, it is difficult to distinguish
> between all CPUs really being idle as opposed to all non-idle CPUs being
> in adaptive-ticks mode.  This commit therefore adds a Kconfig parameter
> as a first step towards enabling a scalable detection of full-system
> idle state.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> ---
>  kernel/time/Kconfig | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
> index 70f27e8..a613c2a 100644
> --- a/kernel/time/Kconfig
> +++ b/kernel/time/Kconfig
> @@ -134,6 +134,29 @@ config NO_HZ_FULL_ALL
>  	 Note the boot CPU will still be kept outside the range to
>  	 handle the timekeeping duty.
>  
> +config NO_HZ_FULL_SYSIDLE
> +	bool "Detect full-system idle state for full dynticks system"
> +	depends on NO_HZ_FULL
> +	default n
> +	help
> +	 At least one CPU must keep the scheduling-clock tick running
> +	 for timekeeping purposes whenever there is a non-idle CPU,
> +	 where "non-idle" includes CPUs with a single runnable task
> +	 in adaptive-idle mode.  Because the underlying adaptive-tick
> +	 support cannot distinguish between all CPUs being idle and
> +	 all CPUs each running a single task in adaptive-idle mode,
> +	 the underlying support simply ensures that there is always
> +	 a CPU handling the scheduling-clock tick, whether or not all
> +	 CPUs are idle.  This Kconfig option enables scalable detection
> +	 of the all-CPUs-idle state, thus allowing the scheduling-clock
> +	 tick to be disabled when all CPUs are idle.  Note that scalable
> +	 detection of the all-CPUs-idle state means that larger systems
> +	 will be slower to declare the all-CPUs-idle state.
> +
> +	 Say Y if you would like to help debug all-CPUs-idle detection.

The code is needed only for debug?
I guess not.

> +
> +	 Say N if you are unsure.
> +
>  config NO_HZ
>  	bool "Old Idle dynticks config"
>  	depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH RFC nohz_full 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state
  2013-07-26 23:18 [PATCH RFC nohz_full 0/7] v4 Provide infrastructure for full-system idle Paul E. McKenney
@ 2013-07-26 23:19 ` Paul E. McKenney
  2013-07-29  3:35   ` Lai Jiangshan
  2013-08-05  1:04   ` Frederic Weisbecker
  0 siblings, 2 replies; 30+ messages in thread
From: Paul E. McKenney @ 2013-07-26 23:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw,
	Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

At least one CPU must keep the scheduling-clock tick running for
timekeeping purposes whenever there is a non-idle CPU.  However, with
the new nohz_full adaptive-idle machinery, it is difficult to distinguish
between all CPUs really being idle as opposed to all non-idle CPUs being
in adaptive-ticks mode.  This commit therefore adds a Kconfig parameter
as a first step towards enabling a scalable detection of full-system
idle state.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/time/Kconfig | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 70f27e8..a613c2a 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -134,6 +134,29 @@ config NO_HZ_FULL_ALL
 	 Note the boot CPU will still be kept outside the range to
 	 handle the timekeeping duty.
 
+config NO_HZ_FULL_SYSIDLE
+	bool "Detect full-system idle state for full dynticks system"
+	depends on NO_HZ_FULL
+	default n
+	help
+	 At least one CPU must keep the scheduling-clock tick running
+	 for timekeeping purposes whenever there is a non-idle CPU,
+	 where "non-idle" includes CPUs with a single runnable task
+	 in adaptive-idle mode.  Because the underlying adaptive-tick
+	 support cannot distinguish between all CPUs being idle and
+	 all CPUs each running a single task in adaptive-idle mode,
+	 the underlying support simply ensures that there is always
+	 a CPU handling the scheduling-clock tick, whether or not all
+	 CPUs are idle.  This Kconfig option enables scalable detection
+	 of the all-CPUs-idle state, thus allowing the scheduling-clock
+	 tick to be disabled when all CPUs are idle.  Note that scalable
+	 detection of the all-CPUs-idle state means that larger systems
+	 will be slower to declare the all-CPUs-idle state.
+
+	 Say Y if you would like to help debug all-CPUs-idle detection.
+
+	 Say N if you are unsure.
+
 config NO_HZ
 	bool "Old Idle dynticks config"
 	depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2013-08-17 23:38 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-09  1:29 [PATCH RFC nohz_full 0/7] v3 Provide infrastructure for full-system idle Paul E. McKenney
2013-07-09  1:30 ` [PATCH RFC nohz_full 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state Paul E. McKenney
2013-07-09  1:30   ` [PATCH RFC nohz_full 2/7] nohz_full: Add rcu_dyntick data " Paul E. McKenney
2013-07-09  9:37     ` Peter Zijlstra
2013-07-09 13:23       ` Paul E. McKenney
2013-07-09  1:30   ` [PATCH RFC nohz_full 3/7] nohz_full: Add per-CPU idle-state tracking Paul E. McKenney
2013-07-09  1:30   ` [PATCH RFC nohz_full 4/7] nohz_full: Add full-system idle states and variables Paul E. McKenney
2013-07-09  1:30   ` [PATCH RFC nohz_full 5/7] nohz_full: Add full-system-idle arguments to API Paul E. McKenney
2013-07-09  1:30   ` [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle state machine Paul E. McKenney
2013-07-17 23:31     ` Frederic Weisbecker
2013-07-18  0:41       ` Paul E. McKenney
2013-07-18  1:33         ` Frederic Weisbecker
2013-07-18  3:39           ` Paul E. McKenney
2013-07-18 14:24             ` Frederic Weisbecker
2013-07-18 16:47               ` Paul E. McKenney
2013-07-18 22:46                 ` Frederic Weisbecker
2013-07-19  0:24                   ` Paul E. McKenney
2013-07-19  2:12                     ` Frederic Weisbecker
2013-07-19  5:06                       ` Paul E. McKenney
2013-07-24 18:09                         ` Frederic Weisbecker
2013-07-24 22:09                           ` Paul E. McKenney
2013-07-24 23:26                             ` Frederic Weisbecker
2013-07-26 22:52                               ` Paul E. McKenney
2013-07-27 18:13                                 ` Frederic Weisbecker
2013-07-09  1:30   ` [PATCH RFC nohz_full 7/7] nohz_full: Force RCU's grace-period kthreads onto timekeeping CPU Paul E. McKenney
2013-07-26 23:18 [PATCH RFC nohz_full 0/7] v4 Provide infrastructure for full-system idle Paul E. McKenney
2013-07-26 23:19 ` [PATCH RFC nohz_full 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state Paul E. McKenney
2013-07-29  3:35   ` Lai Jiangshan
2013-07-29 15:28     ` Paul E. McKenney
2013-08-05  1:04   ` Frederic Weisbecker
2013-08-17 23:38     ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.