[PATCH RFC nohz_full 0/7] v2 Provide infrastructure for full-system idle

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC nohz_full 0/7] v2 Provide infrastructure for full-system idle
@ 2013-06-28 20:09 Paul E. McKenney
  2013-06-28 20:10 ` [PATCH RFC nohz_full v2 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state Paul E. McKenney
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Paul E. McKenney @ 2013-06-28 20:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw

Whenever there is at least one non-idle CPU, it is necessary to
periodically update timekeeping information.  Before NO_HZ_FULL, this
updating was carried out by the scheduling-clock tick, which ran on
every non-idle CPU.  With the advent of NO_HZ_FULL, it is possible
to have non-idle CPUs that are not receiving scheduling-clock ticks.
This possibility is handled by assigning a timekeeping CPU that continues
taking scheduling-clock ticks.

Unfortunately, timekeeping CPU continues taking scheduling-clock
interrupts even when all other CPUs are completely idle, which is
not so good for energy efficiency and battery lifetime.  Clearly, it
would be good to turn off the timekeeping CPU's scheduling-clock tick
when all CPUs are completely idle.  This is conceptually simple, but
we also need good performance and scalability on large systems, which
rules out implementations based on frequently updated global counts of
non-idle CPUs as well as implementations that frequently scan all CPUs.
Nevertheless, we need a single global indicator in order to keep the
overhead of checking acceptably low.

The chosen approach is to enforce hysteresis on the non-idle to
full-system-idle transition, with the amount of hysteresis increasing
linearly with the number of CPUs, thus keeping contention acceptably low.
This approach piggybacks on RCU's existing force-quiescent-state scanning
of idle CPUs, which has the advantage of avoiding the scan entirely on
busy systems that have high levels of multiprogramming.  This scan
takes per-CPU idleness information and feeds it into a state machine
that applies the level of hysteresis required to arrive at a single
full-system-idle indicator.

The individual patches are as follows:

1.	Add a CONFIG_NO_HZ_FULL_SYSIDLE Kconfig parameter to enable
	this feature.  Kernels built with CONFIG_NO_HZ_FULL_SYSIDLE=n
	act exactly as they do today.

2.	Add new fields to the rcu_dynticks structure that track CPU-idle
	information.  These fields consider CPUs running usermode to be
	non-idle, in contrast with the existing fields in that structure.

3.	Track per-CPU idle states.

4.	Add full-system idle states and state variables.

5.	Expand force_qs_rnp(), dyntick_save_progress_counter(), and
	rcu_implicit_dynticks_qs() APIs to enable passing full-system
	idle state information.

6.	Add full-system-idle state machine.

7.	Force RCU's grace-period kthreads onto the timekeeping CPU.

Changes since v1:

o	Removed NMI support because NMI handlers cannot safely read
	the time anyway (thanks to Thomas Gleixner and Peter Zijlstra).

						Thanx, Paul

------------------------------------------------------------------------

 b/include/linux/rcupdate.h |   18 +
 b/kernel/rcutree.c         |   49 ++++-
 b/kernel/rcutree.h         |   17 +
 b/kernel/rcutree_plugin.h  |  407 ++++++++++++++++++++++++++++++++++++++++++++-
 b/kernel/time/Kconfig      |   23 ++
 5 files changed, 499 insertions(+), 15 deletions(-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH RFC nohz_full v2 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state
  2013-06-28 20:09 [PATCH RFC nohz_full 0/7] v2 Provide infrastructure for full-system idle Paul E. McKenney
@ 2013-06-28 20:10 ` Paul E. McKenney
  2013-06-28 20:10   ` [PATCH RFC nohz_full v2 2/7] nohz_full: Add rcu_dyntick data " Paul E. McKenney
                     ` (5 more replies)
  2013-07-01 15:19 ` [PATCH RFC nohz_full 0/7] v2 Provide infrastructure for full-system idle Andi Kleen
  2013-07-01 19:43 ` Christoph Lameter
  2 siblings, 6 replies; 32+ messages in thread
From: Paul E. McKenney @ 2013-06-28 20:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw,
	Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

At least one CPU must keep the scheduling-clock tick running for
timekeeping purposes whenever there is a non-idle CPU.  However, with
the new nohz_full adaptive-idle machinery, it is difficult to distinguish
between all CPUs really being idle as opposed to all non-idle CPUs being
in adaptive-ticks mode.  This commit therefore adds a Kconfig parameter
as a first step towards enabling a scalable detection of full-system
idle state.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/time/Kconfig | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 70f27e8..a613c2a 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -134,6 +134,29 @@ config NO_HZ_FULL_ALL
 	 Note the boot CPU will still be kept outside the range to
 	 handle the timekeeping duty.
 
+config NO_HZ_FULL_SYSIDLE
+	bool "Detect full-system idle state for full dynticks system"
+	depends on NO_HZ_FULL
+	default n
+	help
+	 At least one CPU must keep the scheduling-clock tick running
+	 for timekeeping purposes whenever there is a non-idle CPU,
+	 where "non-idle" includes CPUs with a single runnable task
+	 in adaptive-idle mode.  Because the underlying adaptive-tick
+	 support cannot distinguish between all CPUs being idle and
+	 all CPUs each running a single task in adaptive-idle mode,
+	 the underlying support simply ensures that there is always
+	 a CPU handling the scheduling-clock tick, whether or not all
+	 CPUs are idle.  This Kconfig option enables scalable detection
+	 of the all-CPUs-idle state, thus allowing the scheduling-clock
+	 tick to be disabled when all CPUs are idle.  Note that scalable
+	 detection of the all-CPUs-idle state means that larger systems
+	 will be slower to declare the all-CPUs-idle state.
+
+	 Say Y if you would like to help debug all-CPUs-idle detection.
+
+	 Say N if you are unsure.
+
 config NO_HZ
 	bool "Old Idle dynticks config"
 	depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH RFC nohz_full v2 2/7] nohz_full: Add rcu_dyntick data for scalable detection of all-idle state
  2013-06-28 20:10 ` [PATCH RFC nohz_full v2 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state Paul E. McKenney
@ 2013-06-28 20:10   ` Paul E. McKenney
  2013-07-01 15:31     ` Josh Triplett
  2013-06-28 20:10   ` [PATCH RFC nohz_full v2 3/7] nohz_full: Add per-CPU idle-state tracking Paul E. McKenney
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2013-06-28 20:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw,
	Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit adds fields to the rcu_dyntick structure that are used to
detect idle CPUs.  These new fields differ from the existing ones in
that the existing ones consider a CPU executing in user mode to be idle,
where the new ones consider CPUs executing in user mode to be busy.
The handling of these new fields is otherwise quite similar to that for
the exiting fields.  This commit also adds the initialization required
for these fields.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/rcutree.c        |  5 +++++
 kernel/rcutree.h        |  9 +++++++++
 kernel/rcutree_plugin.h | 19 +++++++++++++++++++
 3 files changed, 33 insertions(+)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 78745ae..259e300 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -209,6 +209,10 @@ EXPORT_SYMBOL_GPL(rcu_note_context_switch);
 DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
 	.dynticks_nesting = DYNTICK_TASK_EXIT_IDLE,
 	.dynticks = ATOMIC_INIT(1),
+#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
+	.dynticks_idle_nesting = DYNTICK_TASK_NEST_VALUE,
+	.dynticks_idle = ATOMIC_INIT(1),
+#endif /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
 };
 
 static long blimit = 10;	/* Maximum callbacks per rcu_do_batch. */
@@ -2902,6 +2906,7 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp, int preemptible)
 	rdp->blimit = blimit;
 	init_callback_list(rdp);  /* Re-enable callbacks on this CPU. */
 	rdp->dynticks->dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
+	rcu_sysidle_init_percpu_data(rdp->dynticks);
 	atomic_set(&rdp->dynticks->dynticks,
 		   (atomic_read(&rdp->dynticks->dynticks) & ~0x1) + 1);
 	raw_spin_unlock(&rnp->lock);		/* irqs remain disabled. */
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index b383258..bd99d59 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -88,6 +88,14 @@ struct rcu_dynticks {
 				    /* Process level is worth LLONG_MAX/2. */
 	int dynticks_nmi_nesting;   /* Track NMI nesting level. */
 	atomic_t dynticks;	    /* Even value for idle, else odd. */
+#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
+	long long dynticks_idle_nesting;
+				    /* irq/process nesting level from idle. */
+	atomic_t dynticks_idle;	    /* Even value for idle, else odd. */
+				    /*  "Idle" excludes userspace execution. */
+	unsigned long dynticks_idle_jiffies;
+				    /* End of last non-NMI non-idle period. */
+#endif /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
 #ifdef CONFIG_RCU_FAST_NO_HZ
 	bool all_lazy;		    /* Are all CPU's CBs lazy? */
 	unsigned long nonlazy_posted;
@@ -545,6 +553,7 @@ static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
 static void rcu_spawn_nocb_kthreads(struct rcu_state *rsp);
 static void rcu_kick_nohz_cpu(int cpu);
 static bool init_nocb_callback_list(struct rcu_data *rdp);
+static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
 
 #endif /* #ifndef RCU_TREE_NONCORE */
 
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 769e12e..6937eb6 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2375,3 +2375,22 @@ static void rcu_kick_nohz_cpu(int cpu)
 		smp_send_reschedule(cpu);
 #endif /* #ifdef CONFIG_NO_HZ_FULL */
 }
+
+
+#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
+
+/*
+ * Initialize dynticks sysidle state for CPUs coming online.
+ */
+static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
+{
+	rdtp->dynticks_idle_nesting = DYNTICK_TASK_NEST_VALUE;
+}
+
+#else /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
+
+static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
+{
+}
+
+#endif /* #else #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH RFC nohz_full v2 3/7] nohz_full: Add per-CPU idle-state tracking
  2013-06-28 20:10 ` [PATCH RFC nohz_full v2 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state Paul E. McKenney
  2013-06-28 20:10   ` [PATCH RFC nohz_full v2 2/7] nohz_full: Add rcu_dyntick data " Paul E. McKenney
@ 2013-06-28 20:10   ` Paul E. McKenney
  2013-07-01 15:33     ` Josh Triplett
  2013-06-28 20:10   ` [PATCH RFC nohz_full v2 4/7] nohz_full: Add full-system idle states and variables Paul E. McKenney
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2013-06-28 20:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw,
	Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit adds the code that updates the rcu_dyntick structure's
new fields to track the per-CPU idle state based on interrupts and
transitions into and out of the idle loop (NMIs are ignored because NMI
handlers cannot cleanly read out the time anyway).  This code is similar
to the code that maintains RCU's idea of per-CPU idleness, but differs
in that RCU treats CPUs running in user mode as idle, where this new
code does not.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/rcutree.c        |  4 +++
 kernel/rcutree.h        |  2 ++
 kernel/rcutree_plugin.h | 78 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 84 insertions(+)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 259e300..ed9a36a 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -416,6 +416,7 @@ void rcu_idle_enter(void)
 
 	local_irq_save(flags);
 	rcu_eqs_enter(false);
+	rcu_sysidle_enter(&__get_cpu_var(rcu_dynticks), 0);
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(rcu_idle_enter);
@@ -466,6 +467,7 @@ void rcu_irq_exit(void)
 		trace_rcu_dyntick("--=", oldval, rdtp->dynticks_nesting);
 	else
 		rcu_eqs_enter_common(rdtp, oldval, true);
+	rcu_sysidle_enter(rdtp, 1);
 	local_irq_restore(flags);
 }
 
@@ -534,6 +536,7 @@ void rcu_idle_exit(void)
 
 	local_irq_save(flags);
 	rcu_eqs_exit(false);
+	rcu_sysidle_exit(&__get_cpu_var(rcu_dynticks), 0);
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(rcu_idle_exit);
@@ -585,6 +588,7 @@ void rcu_irq_enter(void)
 		trace_rcu_dyntick("++=", oldval, rdtp->dynticks_nesting);
 	else
 		rcu_eqs_exit_common(rdtp, oldval, true);
+	rcu_sysidle_exit(rdtp, 1);
 	local_irq_restore(flags);
 }
 
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index bd99d59..1895043 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -553,6 +553,8 @@ static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
 static void rcu_spawn_nocb_kthreads(struct rcu_state *rsp);
 static void rcu_kick_nohz_cpu(int cpu);
 static bool init_nocb_callback_list(struct rcu_data *rdp);
+static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq);
+static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq);
 static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
 
 #endif /* #ifndef RCU_TREE_NONCORE */
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 6937eb6..6f0bce4 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2380,6 +2380,76 @@ static void rcu_kick_nohz_cpu(int cpu)
 #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
 
 /*
+ * Invoked to note exit from irq or task transition to idle.  Note that
+ * usermode execution does -not- count as idle here!  The caller must
+ * have disabled interrupts.
+ */
+static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq)
+{
+	unsigned long j;
+
+	/* Adjust nesting, check for fully idle. */
+	if (irq) {
+		rdtp->dynticks_idle_nesting--;
+		WARN_ON_ONCE(rdtp->dynticks_idle_nesting < 0);
+		if (rdtp->dynticks_idle_nesting != 0)
+			return;  /* Still not fully idle. */
+	} else {
+		if ((rdtp->dynticks_idle_nesting & DYNTICK_TASK_NEST_MASK) ==
+		    DYNTICK_TASK_NEST_VALUE) {
+			rdtp->dynticks_idle_nesting = 0;
+		} else {
+			rdtp->dynticks_idle_nesting -= DYNTICK_TASK_NEST_VALUE;
+			WARN_ON_ONCE(rdtp->dynticks_idle_nesting < 0);
+			return;  /* Still not fully idle. */
+		}
+	}
+
+	/* Record start of fully idle period. */
+	j = jiffies;
+	ACCESS_ONCE(rdtp->dynticks_idle_jiffies) = j;
+	smp_mb__before_atomic_inc();
+	atomic_inc(&rdtp->dynticks_idle);
+	smp_mb__after_atomic_inc();
+	WARN_ON_ONCE(atomic_read(&rdtp->dynticks_idle) & 0x1);
+}
+
+/*
+ * Invoked to note entry to irq or task transition from idle.  Note that
+ * usermode execution does -not- count as idle here!  The caller must
+ * have disabled interrupts.
+ */
+static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
+{
+	/* Adjust nesting, check for already non-idle. */
+	if (irq) {
+		rdtp->dynticks_idle_nesting++;
+		WARN_ON_ONCE(rdtp->dynticks_idle_nesting <= 0);
+		if (rdtp->dynticks_idle_nesting != 1)
+			return; /* Already non-idle. */
+	} else {
+		/*
+		 * Allow for irq misnesting.  Yes, it really is possible
+		 * to enter an irq handler then never leave it, and maybe
+		 * also vice versa.  Handle both possibilities.
+		 */
+		if (rdtp->dynticks_idle_nesting & DYNTICK_TASK_NEST_MASK) {
+			rdtp->dynticks_idle_nesting += DYNTICK_TASK_NEST_VALUE;
+			WARN_ON_ONCE(rdtp->dynticks_idle_nesting <= 0);
+			return; /* Already non-idle. */
+		} else {
+			rdtp->dynticks_idle_nesting = DYNTICK_TASK_EXIT_IDLE;
+		}
+	}
+
+	/* Record end of idle period. */
+	smp_mb__before_atomic_inc();
+	atomic_inc(&rdtp->dynticks_idle);
+	smp_mb__after_atomic_inc();
+	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks_idle) & 0x1));
+}
+
+/*
  * Initialize dynticks sysidle state for CPUs coming online.
  */
 static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
@@ -2389,6 +2459,14 @@ static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
 
 #else /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
 
+static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq)
+{
+}
+
+static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
+{
+}
+
 static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
 {
 }
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH RFC nohz_full v2 4/7] nohz_full: Add full-system idle states and variables
  2013-06-28 20:10 ` [PATCH RFC nohz_full v2 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state Paul E. McKenney
  2013-06-28 20:10   ` [PATCH RFC nohz_full v2 2/7] nohz_full: Add rcu_dyntick data " Paul E. McKenney
  2013-06-28 20:10   ` [PATCH RFC nohz_full v2 3/7] nohz_full: Add per-CPU idle-state tracking Paul E. McKenney
@ 2013-06-28 20:10   ` Paul E. McKenney
  2013-06-28 20:10   ` [PATCH RFC nohz_full v2 5/7] nohz_full: Add full-system-idle arguments to API Paul E. McKenney
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 32+ messages in thread
From: Paul E. McKenney @ 2013-06-28 20:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw,
	Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit adds control variables and states for full-system idle.
The system will progress through the states in numerical order when
the system is fully idle (other than the timekeeping CPU), and reset
down to the initial state if any non-timekeeping CPU goes non-idle.
The current state is kept in full_sysidle_state.

A RCU_SYSIDLE_SMALL macro is defined, and systems with this number
of CPUs or fewer move through the states more aggressively.  The idea
is that the resulting memory contention is less of a problem on small
systems.  Architectures can adjust this value (which defaults to 8)
using CONFIG_ARCH_RCU_SYSIDLE_SMALL.

One flavor of RCU will be in charge of driving the state machine,
defined by rcu_sysidle_state.  This should be the busiest flavor of RCU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/rcutree_plugin.h | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 6f0bce4..a4b2c09 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2380,6 +2380,33 @@ static void rcu_kick_nohz_cpu(int cpu)
 #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
 
 /*
+ * Handle small systems specially, accelerating their transition into
+ * full idle state.  Allow arches to override this code's idea of
+ * what constitutes a "small" system.
+ */
+#ifdef CONFIG_ARCH_RCU_SYSIDLE_SMALL
+#define RCU_SYSIDLE_SMALL CONFIG_ARCH_RCU_SYSIDLE_SMALL
+#else /* #ifdef CONFIG_ARCH_RCU_SYSIDLE_SMALL */
+#define RCU_SYSIDLE_SMALL 8
+#endif
+
+/*
+ * Define RCU flavor that holds sysidle state.  This needs to be the
+ * most active flavor of RCU.
+ */
+#ifdef CONFIG_PREEMPT_RCU
+static struct rcu_state __maybe_unused *rcu_sysidle_state = &rcu_preempt_state;
+#else /* #ifdef CONFIG_PREEMPT_RCU */
+static struct rcu_state __maybe_unused *rcu_sysidle_state = &rcu_sched_state;
+#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
+
+static int __maybe_unused full_sysidle_state; /* Current system-idle state. */
+#define RCU_SYSIDLE_NOT		0	/* Some CPU is not idle. */
+#define RCU_SYSIDLE_SHORT	1	/* All CPUs idle for brief period. */
+#define RCU_SYSIDLE_FULL	2	/* All CPUs idle, ready for sysidle. */
+#define RCU_SYSIDLE_FULL_NOTED	3	/* Actually entered sysidle state. */
+
+/*
  * Invoked to note exit from irq or task transition to idle.  Note that
  * usermode execution does -not- count as idle here!  The caller must
  * have disabled interrupts.
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH RFC nohz_full v2 5/7] nohz_full: Add full-system-idle arguments to API
  2013-06-28 20:10 ` [PATCH RFC nohz_full v2 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state Paul E. McKenney
                     ` (2 preceding siblings ...)
  2013-06-28 20:10   ` [PATCH RFC nohz_full v2 4/7] nohz_full: Add full-system idle states and variables Paul E. McKenney
@ 2013-06-28 20:10   ` Paul E. McKenney
  2013-06-28 20:10   ` [PATCH RFC nohz_full v2 6/7] nohz_full: Add full-system-idle state machine Paul E. McKenney
  2013-06-28 20:10   ` [PATCH RFC nohz_full v2 7/7] nohz_full: Force RCU's grace-period kthreads onto timekeeping CPU Paul E. McKenney
  5 siblings, 0 replies; 32+ messages in thread
From: Paul E. McKenney @ 2013-06-28 20:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw,
	Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit adds an isidle and jiffies argument to force_qs_rnp(),
dyntick_save_progress_counter(), and rcu_implicit_dynticks_qs() to enable
RCU's force-quiescent-state process to check for full-system idle.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/rcutree.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index ed9a36a..9971f86 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -231,7 +231,9 @@ module_param(jiffies_till_next_fqs, ulong, 0644);
 
 static void rcu_start_gp_advanced(struct rcu_state *rsp, struct rcu_node *rnp,
 				  struct rcu_data *rdp);
-static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *));
+static void force_qs_rnp(struct rcu_state *rsp,
+			 int (*f)(struct rcu_data *, bool *, unsigned long *),
+			 bool *isidle, unsigned long *maxj);
 static void force_quiescent_state(struct rcu_state *rsp);
 static int rcu_pending(int cpu);
 
@@ -712,7 +714,8 @@ static int rcu_is_cpu_rrupt_from_idle(void)
  * credit them with an implicit quiescent state.  Return 1 if this CPU
  * is in dynticks idle mode, which is an extended quiescent state.
  */
-static int dyntick_save_progress_counter(struct rcu_data *rdp)
+static int dyntick_save_progress_counter(struct rcu_data *rdp,
+					 bool *isidle, unsigned long *maxj)
 {
 	rdp->dynticks_snap = atomic_add_return(0, &rdp->dynticks->dynticks);
 	return (rdp->dynticks_snap & 0x1) == 0;
@@ -724,7 +727,8 @@ static int dyntick_save_progress_counter(struct rcu_data *rdp)
  * idle state since the last call to dyntick_save_progress_counter()
  * for this same CPU, or by virtue of having been offline.
  */
-static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
+static int rcu_implicit_dynticks_qs(struct rcu_data *rdp,
+				    bool *isidle, unsigned long *maxj)
 {
 	unsigned int curr;
 	unsigned int snap;
@@ -1345,16 +1349,19 @@ static int rcu_gp_init(struct rcu_state *rsp)
 int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
 {
 	int fqs_state = fqs_state_in;
+	bool isidle = 0;
+	unsigned long maxj;
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
 	rsp->n_force_qs++;
 	if (fqs_state == RCU_SAVE_DYNTICK) {
 		/* Collect dyntick-idle snapshots. */
-		force_qs_rnp(rsp, dyntick_save_progress_counter);
+		force_qs_rnp(rsp, dyntick_save_progress_counter,
+			     &isidle, &maxj);
 		fqs_state = RCU_FORCE_QS;
 	} else {
 		/* Handle dyntick-idle and offline CPUs. */
-		force_qs_rnp(rsp, rcu_implicit_dynticks_qs);
+		force_qs_rnp(rsp, rcu_implicit_dynticks_qs, &isidle, &maxj);
 	}
 	/* Clear flag to prevent immediate re-entry. */
 	if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
@@ -2055,7 +2062,9 @@ void rcu_check_callbacks(int cpu, int user)
  *
  * The caller must have suppressed start of new grace periods.
  */
-static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
+static void force_qs_rnp(struct rcu_state *rsp,
+			 int (*f)(struct rcu_data *, bool *, unsigned long *),
+			 bool *isidle, unsigned long *maxj)
 {
 	unsigned long bit;
 	int cpu;
@@ -2079,7 +2088,7 @@ static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
 		bit = 1;
 		for (; cpu <= rnp->grphi; cpu++, bit <<= 1) {
 			if ((rnp->qsmask & bit) != 0 &&
-			    f(per_cpu_ptr(rsp->rda, cpu)))
+			    f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj))
 				mask |= bit;
 		}
 		if (mask != 0) {
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH RFC nohz_full v2 6/7] nohz_full: Add full-system-idle state machine
  2013-06-28 20:10 ` [PATCH RFC nohz_full v2 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state Paul E. McKenney
                     ` (3 preceding siblings ...)
  2013-06-28 20:10   ` [PATCH RFC nohz_full v2 5/7] nohz_full: Add full-system-idle arguments to API Paul E. McKenney
@ 2013-06-28 20:10   ` Paul E. McKenney
  2013-07-01 16:35     ` Frederic Weisbecker
                       ` (2 more replies)
  2013-06-28 20:10   ` [PATCH RFC nohz_full v2 7/7] nohz_full: Force RCU's grace-period kthreads onto timekeeping CPU Paul E. McKenney
  5 siblings, 3 replies; 32+ messages in thread
From: Paul E. McKenney @ 2013-06-28 20:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw,
	Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit adds the state machine that takes the per-CPU idle data
as input and produces a full-system-idle indication as output.  This
state machine is driven out of RCU's quiescent-state-forcing
mechanism, which invokes rcu_sysidle_check_cpu() to collect per-CPU
idle state and then rcu_sysidle_report() to drive the state machine.

The full-system-idle state is sampled using rcu_sys_is_idle(), which
also drives the state machine if RCU is idle (and does so by forcing
RCU to become non-idle).  This function returns true if all but the
timekeeping CPU (tick_do_timer_cpu) are idle and have been idle long
enough to avoid memory contention on the full_sysidle_state state
variable.  The rcu_sysidle_force_exit() may be called externally
to reset the state machine back into non-idle state.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>

Conflicts:

	kernel/rcutree.h
	kernel/rcutree_plugin.h
---
 include/linux/rcupdate.h |  18 ++++
 kernel/rcutree.c         |  16 ++-
 kernel/rcutree.h         |   5 +
 kernel/rcutree_plugin.h  | 263 ++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 295 insertions(+), 7 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 48f1ef9..1aa8d8c 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -1011,4 +1011,22 @@ static inline bool rcu_is_nocb_cpu(int cpu) { return false; }
 #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
 
 
+/* Only for use by adaptive-ticks code. */
+#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
+extern bool rcu_sys_is_idle(void);
+extern void rcu_sysidle_force_exit(void);
+#else /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
+
+static inline bool rcu_sys_is_idle(void)
+{
+	return false;
+}
+
+static inline void rcu_sysidle_force_exit(void)
+{
+}
+
+#endif /* #else #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
+
+
 #endif /* __LINUX_RCUPDATE_H */
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 9971f86..06cfd75 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -718,6 +718,7 @@ static int dyntick_save_progress_counter(struct rcu_data *rdp,
 					 bool *isidle, unsigned long *maxj)
 {
 	rdp->dynticks_snap = atomic_add_return(0, &rdp->dynticks->dynticks);
+	rcu_sysidle_check_cpu(rdp, isidle, maxj);
 	return (rdp->dynticks_snap & 0x1) == 0;
 }
 
@@ -1356,11 +1357,17 @@ int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
 	rsp->n_force_qs++;
 	if (fqs_state == RCU_SAVE_DYNTICK) {
 		/* Collect dyntick-idle snapshots. */
+		if (is_sysidle_rcu_state(rsp)) {
+			isidle = 1;
+			maxj = jiffies - ULONG_MAX / 4;
+		}
 		force_qs_rnp(rsp, dyntick_save_progress_counter,
 			     &isidle, &maxj);
+		rcu_sysidle_report(rsp, isidle, maxj);
 		fqs_state = RCU_FORCE_QS;
 	} else {
 		/* Handle dyntick-idle and offline CPUs. */
+		isidle = 0;
 		force_qs_rnp(rsp, rcu_implicit_dynticks_qs, &isidle, &maxj);
 	}
 	/* Clear flag to prevent immediate re-entry. */
@@ -2087,9 +2094,12 @@ static void force_qs_rnp(struct rcu_state *rsp,
 		cpu = rnp->grplo;
 		bit = 1;
 		for (; cpu <= rnp->grphi; cpu++, bit <<= 1) {
-			if ((rnp->qsmask & bit) != 0 &&
-			    f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj))
-				mask |= bit;
+			if ((rnp->qsmask & bit) != 0) {
+				if ((rnp->qsmaskinit & bit) != 0)
+					*isidle = 0;
+				if (f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj))
+					mask |= bit;
+			}
 		}
 		if (mask != 0) {
 
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 1895043..7326a3c 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -555,6 +555,11 @@ static void rcu_kick_nohz_cpu(int cpu);
 static bool init_nocb_callback_list(struct rcu_data *rdp);
 static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq);
 static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq);
+static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
+				  unsigned long *maxj);
+static bool is_sysidle_rcu_state(struct rcu_state *rsp);
+static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
+			       unsigned long maxj);
 static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
 
 #endif /* #ifndef RCU_TREE_NONCORE */
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index a4b2c09..9b60df5 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -28,7 +28,7 @@
 #include <linux/gfp.h>
 #include <linux/oom.h>
 #include <linux/smpboot.h>
-#include <linux/tick.h>
+#include "time/tick-internal.h"
 
 #define RCU_KTHREAD_PRIO 1
 
@@ -2395,12 +2395,12 @@ static void rcu_kick_nohz_cpu(int cpu)
  * most active flavor of RCU.
  */
 #ifdef CONFIG_PREEMPT_RCU
-static struct rcu_state __maybe_unused *rcu_sysidle_state = &rcu_preempt_state;
+static struct rcu_state *rcu_sysidle_state = &rcu_preempt_state;
 #else /* #ifdef CONFIG_PREEMPT_RCU */
-static struct rcu_state __maybe_unused *rcu_sysidle_state = &rcu_sched_state;
+static struct rcu_state *rcu_sysidle_state = &rcu_sched_state;
 #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
 
-static int __maybe_unused full_sysidle_state; /* Current system-idle state. */
+static int full_sysidle_state;		/* Current system-idle state. */
 #define RCU_SYSIDLE_NOT		0	/* Some CPU is not idle. */
 #define RCU_SYSIDLE_SHORT	1	/* All CPUs idle for brief period. */
 #define RCU_SYSIDLE_FULL	2	/* All CPUs idle, ready for sysidle. */
@@ -2442,6 +2442,38 @@ static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq)
 }
 
 /*
+ * Unconditionally force exit from full system-idle state.  This is
+ * invoked when a normal CPU exits idle, but must be called separately
+ * for the timekeeping CPU (tick_do_timer_cpu).  The reason for this
+ * is that the timekeeping CPU is permitted to take scheduling-clock
+ * interrupts while the system is in system-idle state, and of course
+ * rcu_sysidle_exit() has no way of distinguishing a scheduling-clock
+ * interrupt from any other type of interrupt.
+ */
+void rcu_sysidle_force_exit(void)
+{
+	int oldstate = ACCESS_ONCE(full_sysidle_state);
+	int newoldstate;
+
+	/*
+	 * Each pass through the following loop attempts to exit full
+	 * system-idle state.  If contention proves to be a problem,
+	 * a trylock-based contention tree could be used here.
+	 */
+	while (oldstate > RCU_SYSIDLE_SHORT) {
+		newoldstate = cmpxchg(&full_sysidle_state,
+				      oldstate, RCU_SYSIDLE_NOT);
+		if (oldstate == newoldstate &&
+		    oldstate == RCU_SYSIDLE_FULL_NOTED) {
+			rcu_kick_nohz_cpu(tick_do_timer_cpu);
+			return; /* We cleared it, done! */
+		}
+		oldstate = newoldstate;
+	}
+	smp_mb(); /* Order initial oldstate fetch vs. later non-idle work. */
+}
+
+/*
  * Invoked to note entry to irq or task transition from idle.  Note that
  * usermode execution does -not- count as idle here!  The caller must
  * have disabled interrupts.
@@ -2474,6 +2506,214 @@ static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
 	atomic_inc(&rdtp->dynticks_idle);
 	smp_mb__after_atomic_inc();
 	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks_idle) & 0x1));
+
+	/*
+	 * If we are the timekeeping CPU, we are permitted to be non-idle
+	 * during a system-idle state.  This must be the case, because
+	 * the timekeeping CPU has to take scheduling-clock interrupts
+	 * during the time that the system is transitioning to full
+	 * system-idle state.  This means that the timekeeping CPU must
+	 * invoke rcu_sysidle_force_exit() directly if it does anything
+	 * more than take a scheduling-clock interrupt.
+	 */
+	if (smp_processor_id() == tick_do_timer_cpu)
+		return;
+
+	/* Update system-idle state: We are clearly no longer fully idle! */
+	rcu_sysidle_force_exit();
+}
+
+/*
+ * Check to see if the current CPU is idle.  Note that usermode execution
+ * does not count as idle.  The caller must have disabled interrupts.
+ */
+static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
+				  unsigned long *maxj)
+{
+	int cur;
+	int curnmi;
+	unsigned long j;
+	struct rcu_dynticks *rdtp = rdp->dynticks;
+
+	/*
+	 * If some other CPU has already reported non-idle, if this is
+	 * not the flavor of RCU that tracks sysidle state, or if this
+	 * is an offline or the timekeeping CPU, nothing to do.
+	 */
+	if (!*isidle || rdp->rsp != rcu_sysidle_state ||
+	    cpu_is_offline(rdp->cpu) || rdp->cpu == tick_do_timer_cpu)
+		return;
+	/* WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu); */
+
+	/*
+	 * Pick up current idle and NMI-nesting counters, check.  We check
+	 * for NMIs using RCU's main ->dynticks counter.  This works because
+	 * any time ->dynticks has its low bit set, ->dynticks_idle will
+	 * too -- unless the only reason that ->dynticks's low bit is set
+	 * is due to an NMI from idle.  Which is exactly the case we need
+	 * to account for.
+	 */
+	cur = atomic_read(&rdtp->dynticks_idle);
+	curnmi = atomic_read(&rdtp->dynticks);
+	if ((cur & 0x1) || (curnmi & 0x1)) {
+		*isidle = 0; /* We are not idle! */
+		return;
+	}
+	smp_mb(); /* Read counters before timestamps. */
+
+	/* Pick up timestamps. */
+	j = ACCESS_ONCE(rdtp->dynticks_idle_jiffies);
+	/* If this CPU entered idle more recently, update maxj timestamp. */
+	if (ULONG_CMP_LT(*maxj, j))
+		*maxj = j;
+}
+
+/*
+ * Is this the flavor of RCU that is handling full-system idle?
+ */
+static bool is_sysidle_rcu_state(struct rcu_state *rsp)
+{
+	return rsp == rcu_sysidle_state;
+}
+
+/*
+ * Return a delay in jiffies based on the number of CPUs, rcu_node
+ * leaf fanout, and jiffies tick rate.  The idea is to allow larger
+ * systems more time to transition to full-idle state in order to
+ * avoid the cache thrashing that otherwise occur on the state variable.
+ * Really small systems (less than a couple of tens of CPUs) should
+ * instead use a single global atomically incremented counter, and later
+ * versions of this will automatically reconfigure themselves accordingly.
+ */
+static unsigned long rcu_sysidle_delay(void)
+{
+	if (nr_cpu_ids <= RCU_SYSIDLE_SMALL)
+		return 0;
+	return DIV_ROUND_UP(nr_cpu_ids * HZ, rcu_fanout_leaf * 1000);
+}
+
+/*
+ * Advance the full-system-idle state.  This is invoked when all of
+ * the non-timekeeping CPUs are idle.
+ */
+static void rcu_sysidle(unsigned long j)
+{
+	/* Check the current state. */
+	switch (ACCESS_ONCE(full_sysidle_state)) {
+	case RCU_SYSIDLE_NOT:
+
+		/* First time all are idle, so note a short idle period. */
+		ACCESS_ONCE(full_sysidle_state) = RCU_SYSIDLE_SHORT;
+		break;
+
+	case RCU_SYSIDLE_SHORT:
+
+		/*
+		 * Idle for a bit, time to advance to next state?
+		 * cmpxchg failure means race with non-idle, let them win.
+		 */
+		if (ULONG_CMP_GE(jiffies, j + rcu_sysidle_delay()))
+			(void)cmpxchg(&full_sysidle_state,
+				      RCU_SYSIDLE_SHORT, RCU_SYSIDLE_FULL);
+		break;
+
+	default:
+		break;
+	}
+}
+
+/*
+ * Found a non-idle non-timekeeping CPU, so kick the system-idle state
+ * back to the beginning.
+ */
+static void rcu_sysidle_cancel(void)
+{
+	smp_mb();
+	ACCESS_ONCE(full_sysidle_state) = RCU_SYSIDLE_NOT;
+}
+
+/*
+ * Update the sysidle state based on the results of a force-quiescent-state
+ * scan of the CPUs' dyntick-idle state.
+ */
+static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
+			       unsigned long maxj)
+{
+	if (rsp != rcu_sysidle_state)
+		return;  /* Wrong flavor, ignore. */
+	if (isidle)
+		rcu_sysidle(maxj);    /* More idle! */
+	else
+		rcu_sysidle_cancel(); /* Idle is over. */
+}
+
+/* Callback and function for forcing an RCU grace period. */
+struct rcu_sysidle_head {
+	struct rcu_head rh;
+	int inuse;
+};
+
+static void rcu_sysidle_cb(struct rcu_head *rhp)
+{
+	struct rcu_sysidle_head *rshp;
+
+	smp_mb();  /* grace period precedes setting inuse. */
+	rshp = container_of(rhp, struct rcu_sysidle_head, rh);
+	ACCESS_ONCE(rshp->inuse) = 0;
+}
+
+/*
+ * Check to see if the system is fully idle, other than the timekeeping CPU.
+ * The caller must have disabled interrupts.
+ */
+bool rcu_sys_is_idle(void)
+{
+	static struct rcu_sysidle_head rsh;
+	int rss = ACCESS_ONCE(full_sysidle_state);
+
+	WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu);
+
+	/* Handle small-system case by doing a full scan of CPUs. */
+	if (nr_cpu_ids <= RCU_SYSIDLE_SMALL && rss < RCU_SYSIDLE_FULL) {
+		int cpu;
+		bool isidle = true;
+		unsigned long maxj = jiffies - ULONG_MAX / 4;
+		struct rcu_data *rdp;
+
+		/* Scan all the CPUs looking for nonidle CPUs. */
+		for_each_possible_cpu(cpu) {
+			rdp = per_cpu_ptr(rcu_sysidle_state->rda, cpu);
+			rcu_sysidle_check_cpu(rdp, &isidle, &maxj);
+			if (!isidle)
+				break;
+		}
+		rcu_sysidle_report(rcu_sysidle_state, isidle, maxj);
+		rss = ACCESS_ONCE(full_sysidle_state);
+	}
+
+	/* If this is the first observation of an idle period, record it. */
+	if (rss == RCU_SYSIDLE_FULL) {
+		rss = cmpxchg(&full_sysidle_state,
+			      RCU_SYSIDLE_FULL, RCU_SYSIDLE_FULL_NOTED);
+		return rss == RCU_SYSIDLE_FULL;
+	}
+
+	smp_mb(); /* ensure rss load happens before later caller actions. */
+
+	/* If already fully idle, tell the caller (in case of races). */
+	if (rss == RCU_SYSIDLE_FULL_NOTED)
+		return true;
+
+	/*
+	 * If we aren't there yet, and a grace period is not in flight,
+	 * initiate a grace period.  Either way, tell the caller that
+	 * we are not there yet.
+	 */
+	if (nr_cpu_ids > RCU_SYSIDLE_SMALL &&
+	    !rcu_gp_in_progress(rcu_sysidle_state) &&
+	    !rsh.inuse && xchg(&rsh.inuse, 1) == 0)
+		call_rcu(&rsh.rh, rcu_sysidle_cb);
+	return false;
 }
 
 /*
@@ -2494,6 +2734,21 @@ static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
 {
 }
 
+static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
+				  unsigned long *maxj)
+{
+}
+
+static bool is_sysidle_rcu_state(struct rcu_state *rsp)
+{
+	return false;
+}
+
+static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
+			       unsigned long maxj)
+{
+}
+
 static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
 {
 }
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH RFC nohz_full v2 7/7] nohz_full: Force RCU's grace-period kthreads onto timekeeping CPU
  2013-06-28 20:10 ` [PATCH RFC nohz_full v2 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state Paul E. McKenney
                     ` (4 preceding siblings ...)
  2013-06-28 20:10   ` [PATCH RFC nohz_full v2 6/7] nohz_full: Add full-system-idle state machine Paul E. McKenney
@ 2013-06-28 20:10   ` Paul E. McKenney
  5 siblings, 0 replies; 32+ messages in thread
From: Paul E. McKenney @ 2013-06-28 20:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, darren, fweisbec, sbw,
	Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Because RCU's quiescent-state-forcing mechanism is used to drive the
full-system-idle state machine, and because this mechanism is executed
by RCU's grace-period kthreads, this commit forces these kthreads to
run on the timekeeping CPU (tick_do_timer_cpu).  To do otherwise would
mean that the RCU grace-period kthreads would force the system into
non-idle state every time they drove the state machine, which would
be just a bit on the futile side.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/rcutree.c        |  1 +
 kernel/rcutree.h        |  1 +
 kernel/rcutree_plugin.h | 20 +++++++++++++++++++-
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 06cfd75..ad9a5ec 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1286,6 +1286,7 @@ static int rcu_gp_init(struct rcu_state *rsp)
 	struct rcu_data *rdp;
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
+	rcu_bind_gp_kthread();
 	raw_spin_lock_irq(&rnp->lock);
 	rsp->gp_flags = 0; /* Clear all flags: New grace period. */
 
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 7326a3c..1602c21 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -558,6 +558,7 @@ static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq);
 static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
 				  unsigned long *maxj);
 static bool is_sysidle_rcu_state(struct rcu_state *rsp);
+static void rcu_bind_gp_kthread(void);
 static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
 			       unsigned long maxj);
 static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 9b60df5..4478cfd 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2543,7 +2543,7 @@ static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
 	if (!*isidle || rdp->rsp != rcu_sysidle_state ||
 	    cpu_is_offline(rdp->cpu) || rdp->cpu == tick_do_timer_cpu)
 		return;
-	/* WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu); */
+	WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu);
 
 	/*
 	 * Pick up current idle and NMI-nesting counters, check.  We check
@@ -2577,6 +2577,20 @@ static bool is_sysidle_rcu_state(struct rcu_state *rsp)
 }
 
 /*
+ * Bind the grace-period kthread for the sysidle flavor of RCU to the
+ * timekeeping CPU.
+ */
+static void rcu_bind_gp_kthread(void)
+{
+	int cpu = ACCESS_ONCE(tick_do_timer_cpu);
+
+	if (cpu < 0 || cpu >= nr_cpu_ids)
+		return;
+	if (raw_smp_processor_id() != cpu)
+		set_cpus_allowed_ptr(current, cpumask_of(cpu));
+}
+
+/*
  * Return a delay in jiffies based on the number of CPUs, rcu_node
  * leaf fanout, and jiffies tick rate.  The idea is to allow larger
  * systems more time to transition to full-idle state in order to
@@ -2744,6 +2758,10 @@ static bool is_sysidle_rcu_state(struct rcu_state *rsp)
 	return false;
 }
 
+static void rcu_bind_gp_kthread(void)
+{
+}
+
 static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
 			       unsigned long maxj)
 {
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full 0/7] v2 Provide infrastructure for full-system idle
  2013-06-28 20:09 [PATCH RFC nohz_full 0/7] v2 Provide infrastructure for full-system idle Paul E. McKenney
  2013-06-28 20:10 ` [PATCH RFC nohz_full v2 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state Paul E. McKenney
@ 2013-07-01 15:19 ` Andi Kleen
  2013-07-01 16:03   ` Paul E. McKenney
  2013-07-01 19:43 ` Christoph Lameter
  2 siblings, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2013-07-01 15:19 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	fweisbec, sbw

"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>
> The individual patches are as follows:
>
> 1.	Add a CONFIG_NO_HZ_FULL_SYSIDLE Kconfig parameter to enable
> 	this feature.  Kernels built with CONFIG_NO_HZ_FULL_SYSIDLE=n
> 	act exactly as they do today.

Is this extra CONFIG option really needed? RCU already has a bewildering
variety of CONFIG options, and no idle CONFIG is also pretty complicated. 
At some point noone will know how to configure kernels anymore if 
these non trivial, complicated trade off CONFIGs keep spreading.

The facility sounds like a good thing in general. Just enable
it implicitely with NO_HZ_SYSIDLE ? 

If you want a switch for testing I would advise a sysctl or sysfs knob

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full v2 2/7] nohz_full: Add rcu_dyntick data for scalable detection of all-idle state
  2013-06-28 20:10   ` [PATCH RFC nohz_full v2 2/7] nohz_full: Add rcu_dyntick data " Paul E. McKenney
@ 2013-07-01 15:31     ` Josh Triplett
  2013-07-01 15:52       ` Paul E. McKenney
  0 siblings, 1 reply; 32+ messages in thread
From: Josh Triplett @ 2013-07-01 15:31 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw

On Fri, Jun 28, 2013 at 01:10:17PM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> This commit adds fields to the rcu_dyntick structure that are used to
> detect idle CPUs.  These new fields differ from the existing ones in
> that the existing ones consider a CPU executing in user mode to be idle,
> where the new ones consider CPUs executing in user mode to be busy.

Can you explain, both in the commit messages and in the comments added
by the next commit, *why* this code doesn't consider userspace a
quiescent state?

- Josh Triplett

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full v2 3/7] nohz_full: Add per-CPU idle-state tracking
  2013-06-28 20:10   ` [PATCH RFC nohz_full v2 3/7] nohz_full: Add per-CPU idle-state tracking Paul E. McKenney
@ 2013-07-01 15:33     ` Josh Triplett
  0 siblings, 0 replies; 32+ messages in thread
From: Josh Triplett @ 2013-07-01 15:33 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw

On Fri, Jun 28, 2013 at 01:10:18PM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> This commit adds the code that updates the rcu_dyntick structure's
> new fields to track the per-CPU idle state based on interrupts and
> transitions into and out of the idle loop (NMIs are ignored because NMI
> handlers cannot cleanly read out the time anyway).  This code is similar
> to the code that maintains RCU's idea of per-CPU idleness, but differs
> in that RCU treats CPUs running in user mode as idle, where this new
> code does not.
[...]
> --- a/kernel/rcutree_plugin.h
> +++ b/kernel/rcutree_plugin.h
> @@ -2380,6 +2380,76 @@ static void rcu_kick_nohz_cpu(int cpu)
>  #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
>  
>  /*
> + * Invoked to note exit from irq or task transition to idle.  Note that
> + * usermode execution does -not- count as idle here!  The caller must
> + * have disabled interrupts.

Can you explain in the comments why this code doesn't treat userspace as
idle/quiesced?

- Josh Triplett

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full v2 2/7] nohz_full: Add rcu_dyntick data for scalable detection of all-idle state
  2013-07-01 15:31     ` Josh Triplett
@ 2013-07-01 15:52       ` Paul E. McKenney
  2013-07-01 18:16         ` Josh Triplett
  0 siblings, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2013-07-01 15:52 UTC (permalink / raw)
  To: Josh Triplett
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw

On Mon, Jul 01, 2013 at 08:31:50AM -0700, Josh Triplett wrote:
> On Fri, Jun 28, 2013 at 01:10:17PM -0700, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > 
> > This commit adds fields to the rcu_dyntick structure that are used to
> > detect idle CPUs.  These new fields differ from the existing ones in
> > that the existing ones consider a CPU executing in user mode to be idle,
> > where the new ones consider CPUs executing in user mode to be busy.
> 
> Can you explain, both in the commit messages and in the comments added
> by the next commit, *why* this code doesn't consider userspace a
> quiescent state?

Good point!  Does the following explain it?

	Although one of RCU's quiescent states is usermode execution,
	it is not a full-system idle state.  This is because the purpose
	of the full-system idle state is not RCU, but rather determining
	when accurate timekeeping can safely be disabled.  Whenever
	accurate timekeeping is required in a CONFIG_NO_HZ_FULL kernel,
	at least one CPU must keep the scheduling-clock tick going.
	If even one CPU is executing in user mode, accurate timekeeping
	is requires, particularly for architectures where gettimeofday()
	and friends do not enter the kernel.  Only when all CPUs are
	really and truly idle can accurate timekeeping be disabled,
	allowing all CPUs to turn off the scheduling clock interrupt,
	thus greatly improving energy efficiency.

	This naturally raises the question "Why is this code in RCU rather
	than in timekeeping?", and the answer is that RCU has the data
	and infrastructure to efficiently make this determination.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full 0/7] v2 Provide infrastructure for full-system idle
  2013-07-01 15:19 ` [PATCH RFC nohz_full 0/7] v2 Provide infrastructure for full-system idle Andi Kleen
@ 2013-07-01 16:03   ` Paul E. McKenney
  2013-07-01 16:19     ` Andi Kleen
  0 siblings, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2013-07-01 16:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	fweisbec, sbw

On Mon, Jul 01, 2013 at 08:19:25AM -0700, Andi Kleen wrote:
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> >
> > The individual patches are as follows:
> >
> > 1.	Add a CONFIG_NO_HZ_FULL_SYSIDLE Kconfig parameter to enable
> > 	this feature.  Kernels built with CONFIG_NO_HZ_FULL_SYSIDLE=n
> > 	act exactly as they do today.
> 
> Is this extra CONFIG option really needed? RCU already has a bewildering
> variety of CONFIG options, and no idle CONFIG is also pretty complicated. 
> At some point noone will know how to configure kernels anymore if 
> these non trivial, complicated trade off CONFIGs keep spreading.
> 
> The facility sounds like a good thing in general. Just enable
> it implicitely with NO_HZ_SYSIDLE ? 

I am guessing that you want CONFIG_NO_HZ_FULL to implicitly enable
the sysidle code so that CONFIG_NO_HZ_FULL_SYSIDLE can be eliminated.
I will be happy to take that step, but only after I gain full confidence
in the correctness and performance of the sysidle code.

> If you want a switch for testing I would advise a sysctl or sysfs knob

This would work well for the correctness part, but not for the performance
part.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full 0/7] v2 Provide infrastructure for full-system idle
  2013-07-01 16:03   ` Paul E. McKenney
@ 2013-07-01 16:19     ` Andi Kleen
  2013-07-01 19:19       ` Paul E. McKenney
  0 siblings, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2013-07-01 16:19 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andi Kleen, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, peterz, rostedt, dhowells,
	edumazet, darren, fweisbec, sbw

> I am guessing that you want CONFIG_NO_HZ_FULL to implicitly enable
> the sysidle code so that CONFIG_NO_HZ_FULL_SYSIDLE can be eliminated.
> I will be happy to take that step, but only after I gain full confidence
> in the correctness and performance of the sysidle code.

FWIW if you want useful testing you need to enable it by default
(as part of NO_IDLE_HZ) anyways. Users will most likely pick
whatever is "default" in Kconfig.

> > If you want a switch for testing I would advise a sysctl or sysfs knob
> 
> This would work well for the correctness part, but not for the performance
> part.

What performance part? 

Are you saying this adds so many checks to hot paths that normal runtime
if() with a flag is too expensive?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full v2 6/7] nohz_full: Add full-system-idle state machine
  2013-06-28 20:10   ` [PATCH RFC nohz_full v2 6/7] nohz_full: Add full-system-idle state machine Paul E. McKenney
@ 2013-07-01 16:35     ` Frederic Weisbecker
  2013-07-01 18:10       ` Paul E. McKenney
  2013-07-01 16:53     ` Frederic Weisbecker
  2013-07-01 21:38     ` Frederic Weisbecker
  2 siblings, 1 reply; 32+ messages in thread
From: Frederic Weisbecker @ 2013-07-01 16:35 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Fri, Jun 28, 2013 at 01:10:21PM -0700, Paul E. McKenney wrote:
>  /*
> + * Unconditionally force exit from full system-idle state.  This is
> + * invoked when a normal CPU exits idle, but must be called separately
> + * for the timekeeping CPU (tick_do_timer_cpu).  The reason for this
> + * is that the timekeeping CPU is permitted to take scheduling-clock
> + * interrupts while the system is in system-idle state, and of course
> + * rcu_sysidle_exit() has no way of distinguishing a scheduling-clock
> + * interrupt from any other type of interrupt.
> + */
> +void rcu_sysidle_force_exit(void)
> +{
> +	int oldstate = ACCESS_ONCE(full_sysidle_state);
> +	int newoldstate;
> +
> +	/*
> +	 * Each pass through the following loop attempts to exit full
> +	 * system-idle state.  If contention proves to be a problem,
> +	 * a trylock-based contention tree could be used here.
> +	 */
> +	while (oldstate > RCU_SYSIDLE_SHORT) {
> +		newoldstate = cmpxchg(&full_sysidle_state,
> +				      oldstate, RCU_SYSIDLE_NOT);
> +		if (oldstate == newoldstate &&
> +		    oldstate == RCU_SYSIDLE_FULL_NOTED) {
> +			rcu_kick_nohz_cpu(tick_do_timer_cpu);
> +			return; /* We cleared it, done! */
> +		}
> +		oldstate = newoldstate;
> +	}
> +	smp_mb(); /* Order initial oldstate fetch vs. later non-idle work. */
> +}
> +
> +/*
>   * Invoked to note entry to irq or task transition from idle.  Note that
>   * usermode execution does -not- count as idle here!  The caller must
>   * have disabled interrupts.
> @@ -2474,6 +2506,214 @@ static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
>  	atomic_inc(&rdtp->dynticks_idle);
>  	smp_mb__after_atomic_inc();
>  	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks_idle) & 0x1));
> +
> +	/*
> +	 * If we are the timekeeping CPU, we are permitted to be non-idle
> +	 * during a system-idle state.  This must be the case, because
> +	 * the timekeeping CPU has to take scheduling-clock interrupts
> +	 * during the time that the system is transitioning to full
> +	 * system-idle state.  This means that the timekeeping CPU must
> +	 * invoke rcu_sysidle_force_exit() directly if it does anything
> +	 * more than take a scheduling-clock interrupt.
> +	 */
> +	if (smp_processor_id() == tick_do_timer_cpu)
> +		return;
> +
> +	/* Update system-idle state: We are clearly no longer fully idle! */
> +	rcu_sysidle_force_exit();
> +}
> +
> +/*
> + * Check to see if the current CPU is idle.  Note that usermode execution
> + * does not count as idle.  The caller must have disabled interrupts.
> + */
> +static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
> +				  unsigned long *maxj)
> +{
> +	int cur;
> +	int curnmi;
> +	unsigned long j;
> +	struct rcu_dynticks *rdtp = rdp->dynticks;
> +
> +	/*
> +	 * If some other CPU has already reported non-idle, if this is
> +	 * not the flavor of RCU that tracks sysidle state, or if this
> +	 * is an offline or the timekeeping CPU, nothing to do.
> +	 */
> +	if (!*isidle || rdp->rsp != rcu_sysidle_state ||
> +	    cpu_is_offline(rdp->cpu) || rdp->cpu == tick_do_timer_cpu)
> +		return;
> +	/* WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu); */
> +
> +	/*
> +	 * Pick up current idle and NMI-nesting counters, check.  We check
> +	 * for NMIs using RCU's main ->dynticks counter.  This works because
> +	 * any time ->dynticks has its low bit set, ->dynticks_idle will
> +	 * too -- unless the only reason that ->dynticks's low bit is set
> +	 * is due to an NMI from idle.  Which is exactly the case we need
> +	 * to account for.
> +	 */
> +	cur = atomic_read(&rdtp->dynticks_idle);
> +	curnmi = atomic_read(&rdtp->dynticks);
> +	if ((cur & 0x1) || (curnmi & 0x1)) {

I think you wanted to ignore NMIs this time because they don't read walltime?

By the way they can still read jiffies, but unlike irq_enter(), nmi_enter()
don't catch up with missing jiffies update. So the behaviour doesn't change
compared to !NO_HZ_FULL.

> +		*isidle = 0; /* We are not idle! */
> +		return;
> +	}
> +	smp_mb(); /* Read counters before timestamps. */
> +
> +	/* Pick up timestamps. */
> +	j = ACCESS_ONCE(rdtp->dynticks_idle_jiffies);
> +	/* If this CPU entered idle more recently, update maxj timestamp. */
> +	if (ULONG_CMP_LT(*maxj, j))
> +		*maxj = j;

So I'm a bit confused with the ordering so I'm probably going to ask a silly question.

What makes sure that we are not reading a stale value of rdtp->dynticks_idle
in the following scenario:

    CPU 0                          CPU 1
    
                                   //CPU 1 idle
                                   //rdtp(1)->dynticks_idle == 0

sysidle_check_cpu(CPU 1) {
    rdtp(1)->dynticks_idle == 0
}
cmpxchg(full_sysidle_state, 
        ...RCU_SYSIDLE_SHORT)
                                   rcu_irq_exit() {
                                         rdtp(1)->dynticks_idle = 1
                                         smp_mb()
                                         rcu_sysidle_force_exit() {
                                            full_sysidle_state == RCU_SYSIDLE_SHORT
                                            // no cmpxchg
                                            smp_mb()
                                   ...

[1]
sysidle_check_cpu(CPU 1) {
    rdtp(1)->dynticks_idle == 0
}

cmpxchg(RCU_SYSIDLE_FULL, ...)

[2]
sysidle_check_cpu(CPU 1) {
    rdtp(1)->dynticks_idle == 0
}

cmpxchg(RCU_SYSIDLE_FULL_NOTED, ...)


I mean in [1] and [2] I can't see something in the ordering that guarantees that we see
the new value rdtp(1)->dynticks_idle == 1.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full v2 6/7] nohz_full: Add full-system-idle state machine
  2013-06-28 20:10   ` [PATCH RFC nohz_full v2 6/7] nohz_full: Add full-system-idle state machine Paul E. McKenney
  2013-07-01 16:35     ` Frederic Weisbecker
@ 2013-07-01 16:53     ` Frederic Weisbecker
  2013-07-01 18:17       ` Paul E. McKenney
  2013-07-01 21:38     ` Frederic Weisbecker
  2 siblings, 1 reply; 32+ messages in thread
From: Frederic Weisbecker @ 2013-07-01 16:53 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Fri, Jun 28, 2013 at 01:10:21PM -0700, Paul E. McKenney wrote:
> +
> +/*
> + * Check to see if the system is fully idle, other than the timekeeping CPU.
> + * The caller must have disabled interrupts.
> + */
> +bool rcu_sys_is_idle(void)
> +{
> +	static struct rcu_sysidle_head rsh;
> +	int rss = ACCESS_ONCE(full_sysidle_state);
> +
> +	WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu);
> +
> +	/* Handle small-system case by doing a full scan of CPUs. */
> +	if (nr_cpu_ids <= RCU_SYSIDLE_SMALL && rss < RCU_SYSIDLE_FULL) {
> +		int cpu;
> +		bool isidle = true;
> +		unsigned long maxj = jiffies - ULONG_MAX / 4;
> +		struct rcu_data *rdp;
> +
> +		/* Scan all the CPUs looking for nonidle CPUs. */
> +		for_each_possible_cpu(cpu) {
> +			rdp = per_cpu_ptr(rcu_sysidle_state->rda, cpu);
> +			rcu_sysidle_check_cpu(rdp, &isidle, &maxj);
> +			if (!isidle)
> +				break;
> +		}
> +		rcu_sysidle_report(rcu_sysidle_state, isidle, maxj);

To clarify my worries, I would expect the following ordering:

    CPU 0                                           CPU 1

    cmpxchg(global_state, RCU_SYSIDLE_..., ...)     write rdtp(1)->idle_dynticks
    smp_mb() // implied by cmpxchg                  smp_mb()
    read rdtp(1)-idle_dynticks                      cmpxchg(global_state, RCU_SYSIDLE_NONE)

This example doesn't really make sense because CPU 0 only wants to change global state after
checking rdtp(1)->idle_dynticks. So we obviously want the other way around as you did. But then
I can't find an ordering that makes sure that only either happens:

* CPU 0 sees CPU 1 update when it wakes up and so we reset to RCU_SYSIDLE_NONE from CPU 0
* CPU 1 wakes up and reset to RCU_SYSIDLE_NONE and sends an IPI to CPU 0

I mean that's what the code does, but I don't get the ordering that guarantees we can't
fall into some intermediate state when CPU 0 sees CPU 1 idle whereas it already woke up without
sending the IPI.

I'm sure you'll point me to my mistaken review :)


> +		rss = ACCESS_ONCE(full_sysidle_state);
> +	}
> +
> +	/* If this is the first observation of an idle period, record it. */
> +	if (rss == RCU_SYSIDLE_FULL) {
> +		rss = cmpxchg(&full_sysidle_state,
> +			      RCU_SYSIDLE_FULL, RCU_SYSIDLE_FULL_NOTED);
> +		return rss == RCU_SYSIDLE_FULL;
> +	}
> +
> +	smp_mb(); /* ensure rss load happens before later caller actions. */
> +
> +	/* If already fully idle, tell the caller (in case of races). */
> +	if (rss == RCU_SYSIDLE_FULL_NOTED)
> +		return true;
> +
> +	/*
> +	 * If we aren't there yet, and a grace period is not in flight,
> +	 * initiate a grace period.  Either way, tell the caller that
> +	 * we are not there yet.
> +	 */
> +	if (nr_cpu_ids > RCU_SYSIDLE_SMALL &&
> +	    !rcu_gp_in_progress(rcu_sysidle_state) &&
> +	    !rsh.inuse && xchg(&rsh.inuse, 1) == 0)
> +		call_rcu(&rsh.rh, rcu_sysidle_cb);
> +	return false;
>  }
>  
>  /*
> @@ -2494,6 +2734,21 @@ static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
>  {
>  }
>  
> +static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
> +				  unsigned long *maxj)
> +{
> +}
> +
> +static bool is_sysidle_rcu_state(struct rcu_state *rsp)
> +{
> +	return false;
> +}
> +
> +static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
> +			       unsigned long maxj)
> +{
> +}
> +
>  static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
>  {
>  }
> -- 
> 1.8.1.5
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full v2 6/7] nohz_full: Add full-system-idle state machine
  2013-07-01 16:35     ` Frederic Weisbecker
@ 2013-07-01 18:10       ` Paul E. McKenney
  2013-07-01 20:55         ` Frederic Weisbecker
  0 siblings, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2013-07-01 18:10 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Mon, Jul 01, 2013 at 06:35:31PM +0200, Frederic Weisbecker wrote:
> On Fri, Jun 28, 2013 at 01:10:21PM -0700, Paul E. McKenney wrote:
> >  /*
> > + * Unconditionally force exit from full system-idle state.  This is
> > + * invoked when a normal CPU exits idle, but must be called separately
> > + * for the timekeeping CPU (tick_do_timer_cpu).  The reason for this
> > + * is that the timekeeping CPU is permitted to take scheduling-clock
> > + * interrupts while the system is in system-idle state, and of course
> > + * rcu_sysidle_exit() has no way of distinguishing a scheduling-clock
> > + * interrupt from any other type of interrupt.
> > + */
> > +void rcu_sysidle_force_exit(void)
> > +{
> > +	int oldstate = ACCESS_ONCE(full_sysidle_state);
> > +	int newoldstate;
> > +
> > +	/*
> > +	 * Each pass through the following loop attempts to exit full
> > +	 * system-idle state.  If contention proves to be a problem,
> > +	 * a trylock-based contention tree could be used here.
> > +	 */
> > +	while (oldstate > RCU_SYSIDLE_SHORT) {
> > +		newoldstate = cmpxchg(&full_sysidle_state,
> > +				      oldstate, RCU_SYSIDLE_NOT);
> > +		if (oldstate == newoldstate &&
> > +		    oldstate == RCU_SYSIDLE_FULL_NOTED) {
> > +			rcu_kick_nohz_cpu(tick_do_timer_cpu);
> > +			return; /* We cleared it, done! */
> > +		}
> > +		oldstate = newoldstate;
> > +	}
> > +	smp_mb(); /* Order initial oldstate fetch vs. later non-idle work. */
> > +}
> > +
> > +/*
> >   * Invoked to note entry to irq or task transition from idle.  Note that
> >   * usermode execution does -not- count as idle here!  The caller must
> >   * have disabled interrupts.
> > @@ -2474,6 +2506,214 @@ static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
> >  	atomic_inc(&rdtp->dynticks_idle);
> >  	smp_mb__after_atomic_inc();
> >  	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks_idle) & 0x1));
> > +
> > +	/*
> > +	 * If we are the timekeeping CPU, we are permitted to be non-idle
> > +	 * during a system-idle state.  This must be the case, because
> > +	 * the timekeeping CPU has to take scheduling-clock interrupts
> > +	 * during the time that the system is transitioning to full
> > +	 * system-idle state.  This means that the timekeeping CPU must
> > +	 * invoke rcu_sysidle_force_exit() directly if it does anything
> > +	 * more than take a scheduling-clock interrupt.
> > +	 */
> > +	if (smp_processor_id() == tick_do_timer_cpu)
> > +		return;
> > +
> > +	/* Update system-idle state: We are clearly no longer fully idle! */
> > +	rcu_sysidle_force_exit();
> > +}
> > +
> > +/*
> > + * Check to see if the current CPU is idle.  Note that usermode execution
> > + * does not count as idle.  The caller must have disabled interrupts.
> > + */
> > +static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
> > +				  unsigned long *maxj)
> > +{
> > +	int cur;
> > +	int curnmi;
> > +	unsigned long j;
> > +	struct rcu_dynticks *rdtp = rdp->dynticks;
> > +
> > +	/*
> > +	 * If some other CPU has already reported non-idle, if this is
> > +	 * not the flavor of RCU that tracks sysidle state, or if this
> > +	 * is an offline or the timekeeping CPU, nothing to do.
> > +	 */
> > +	if (!*isidle || rdp->rsp != rcu_sysidle_state ||
> > +	    cpu_is_offline(rdp->cpu) || rdp->cpu == tick_do_timer_cpu)
> > +		return;
> > +	/* WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu); */
> > +
> > +	/*
> > +	 * Pick up current idle and NMI-nesting counters, check.  We check
> > +	 * for NMIs using RCU's main ->dynticks counter.  This works because
> > +	 * any time ->dynticks has its low bit set, ->dynticks_idle will
> > +	 * too -- unless the only reason that ->dynticks's low bit is set
> > +	 * is due to an NMI from idle.  Which is exactly the case we need
> > +	 * to account for.
> > +	 */
> > +	cur = atomic_read(&rdtp->dynticks_idle);
> > +	curnmi = atomic_read(&rdtp->dynticks);
> > +	if ((cur & 0x1) || (curnmi & 0x1)) {
> 
> I think you wanted to ignore NMIs this time because they don't read walltime?
> 
> By the way they can still read jiffies, but unlike irq_enter(), nmi_enter()
> don't catch up with missing jiffies update. So the behaviour doesn't change
> compared to !NO_HZ_FULL.

You are right, I missed this when ripping out NMI handling.  Will fix!

> > +		*isidle = 0; /* We are not idle! */
> > +		return;
> > +	}
> > +	smp_mb(); /* Read counters before timestamps. */
> > +
> > +	/* Pick up timestamps. */
> > +	j = ACCESS_ONCE(rdtp->dynticks_idle_jiffies);
> > +	/* If this CPU entered idle more recently, update maxj timestamp. */
> > +	if (ULONG_CMP_LT(*maxj, j))
> > +		*maxj = j;
> 
> So I'm a bit confused with the ordering so I'm probably going to ask a silly question.
> 
> What makes sure that we are not reading a stale value of rdtp->dynticks_idle
> in the following scenario:
> 
>     CPU 0                          CPU 1
>     
>                                    //CPU 1 idle
>                                    //rdtp(1)->dynticks_idle == 0
> 
> sysidle_check_cpu(CPU 1) {
>     rdtp(1)->dynticks_idle == 0
> }
> cmpxchg(full_sysidle_state, 
>         ...RCU_SYSIDLE_SHORT)
>                                    rcu_irq_exit() {

rcu_irq_enter(), right?

>                                          rdtp(1)->dynticks_idle = 1
>                                          smp_mb()
>                                          rcu_sysidle_force_exit() {
>                                             full_sysidle_state == RCU_SYSIDLE_SHORT
>                                             // no cmpxchg
>                                             smp_mb()
>                                    ...
> 
> [1]
> sysidle_check_cpu(CPU 1) {
>     rdtp(1)->dynticks_idle == 0
> }
> 
> cmpxchg(RCU_SYSIDLE_FULL, ...)

You know, I had an RCU_SYSIDLE_LONG state for this purpose, but later
convinced myself that I didn't need it.  :-/

Time to go put it back in, and thank you for your careful review!

							Thanx, Paul

> [2]
> sysidle_check_cpu(CPU 1) {
>     rdtp(1)->dynticks_idle == 0
> }
> 
> cmpxchg(RCU_SYSIDLE_FULL_NOTED, ...)
> 
> 
> I mean in [1] and [2] I can't see something in the ordering that guarantees that we see
> the new value rdtp(1)->dynticks_idle == 1.
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full v2 2/7] nohz_full: Add rcu_dyntick data for scalable detection of all-idle state
  2013-07-01 15:52       ` Paul E. McKenney
@ 2013-07-01 18:16         ` Josh Triplett
  2013-07-01 18:23           ` Paul E. McKenney
  0 siblings, 1 reply; 32+ messages in thread
From: Josh Triplett @ 2013-07-01 18:16 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw

On Mon, Jul 01, 2013 at 08:52:20AM -0700, Paul E. McKenney wrote:
> On Mon, Jul 01, 2013 at 08:31:50AM -0700, Josh Triplett wrote:
> > On Fri, Jun 28, 2013 at 01:10:17PM -0700, Paul E. McKenney wrote:
> > > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > > 
> > > This commit adds fields to the rcu_dyntick structure that are used to
> > > detect idle CPUs.  These new fields differ from the existing ones in
> > > that the existing ones consider a CPU executing in user mode to be idle,
> > > where the new ones consider CPUs executing in user mode to be busy.
> > 
> > Can you explain, both in the commit messages and in the comments added
> > by the next commit, *why* this code doesn't consider userspace a
> > quiescent state?
> 
> Good point!  Does the following explain it?
> 
> 	Although one of RCU's quiescent states is usermode execution,
> 	it is not a full-system idle state.  This is because the purpose
> 	of the full-system idle state is not RCU, but rather determining
> 	when accurate timekeeping can safely be disabled.  Whenever
> 	accurate timekeeping is required in a CONFIG_NO_HZ_FULL kernel,
> 	at least one CPU must keep the scheduling-clock tick going.
> 	If even one CPU is executing in user mode, accurate timekeeping
> 	is requires, particularly for architectures where gettimeofday()
> 	and friends do not enter the kernel.  Only when all CPUs are
> 	really and truly idle can accurate timekeeping be disabled,
> 	allowing all CPUs to turn off the scheduling clock interrupt,
> 	thus greatly improving energy efficiency.
> 
> 	This naturally raises the question "Why is this code in RCU rather
> 	than in timekeeping?", and the answer is that RCU has the data
> 	and infrastructure to efficiently make this determination.

Good explanation, thanks.

This also naturally raises the question "How can we let userspace get
accurate time without forcing a timer tick?".

- Josh Triplett

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full v2 6/7] nohz_full: Add full-system-idle state machine
  2013-07-01 16:53     ` Frederic Weisbecker
@ 2013-07-01 18:17       ` Paul E. McKenney
  0 siblings, 0 replies; 32+ messages in thread
From: Paul E. McKenney @ 2013-07-01 18:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Mon, Jul 01, 2013 at 06:53:57PM +0200, Frederic Weisbecker wrote:
> On Fri, Jun 28, 2013 at 01:10:21PM -0700, Paul E. McKenney wrote:
> > +
> > +/*
> > + * Check to see if the system is fully idle, other than the timekeeping CPU.
> > + * The caller must have disabled interrupts.
> > + */
> > +bool rcu_sys_is_idle(void)
> > +{
> > +	static struct rcu_sysidle_head rsh;
> > +	int rss = ACCESS_ONCE(full_sysidle_state);
> > +
> > +	WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu);
> > +
> > +	/* Handle small-system case by doing a full scan of CPUs. */
> > +	if (nr_cpu_ids <= RCU_SYSIDLE_SMALL && rss < RCU_SYSIDLE_FULL) {
> > +		int cpu;
> > +		bool isidle = true;
> > +		unsigned long maxj = jiffies - ULONG_MAX / 4;
> > +		struct rcu_data *rdp;
> > +
> > +		/* Scan all the CPUs looking for nonidle CPUs. */
> > +		for_each_possible_cpu(cpu) {
> > +			rdp = per_cpu_ptr(rcu_sysidle_state->rda, cpu);
> > +			rcu_sysidle_check_cpu(rdp, &isidle, &maxj);
> > +			if (!isidle)
> > +				break;
> > +		}
> > +		rcu_sysidle_report(rcu_sysidle_state, isidle, maxj);
> 
> To clarify my worries, I would expect the following ordering:
> 
>     CPU 0                                           CPU 1
> 
>     cmpxchg(global_state, RCU_SYSIDLE_..., ...)     write rdtp(1)->idle_dynticks
>     smp_mb() // implied by cmpxchg                  smp_mb()
>     read rdtp(1)-idle_dynticks                      cmpxchg(global_state, RCU_SYSIDLE_NONE)
> 
> This example doesn't really make sense because CPU 0 only wants to change global state after
> checking rdtp(1)->idle_dynticks. So we obviously want the other way around as you did. But then
> I can't find an ordering that makes sure that only either happens:
> 
> * CPU 0 sees CPU 1 update when it wakes up and so we reset to RCU_SYSIDLE_NONE from CPU 0
> * CPU 1 wakes up and reset to RCU_SYSIDLE_NONE and sends an IPI to CPU 0
> 
> I mean that's what the code does, but I don't get the ordering that guarantees we can't
> fall into some intermediate state when CPU 0 sees CPU 1 idle whereas it already woke up without
> sending the IPI.
> 
> I'm sure you'll point me to my mistaken review :)

Not at all -- as noted in previous email, you have helped me understand
why my original design included an RCU_SYSIDLE_LONG state.

								Thanx, Paul

> > +		rss = ACCESS_ONCE(full_sysidle_state);
> > +	}
> > +
> > +	/* If this is the first observation of an idle period, record it. */
> > +	if (rss == RCU_SYSIDLE_FULL) {
> > +		rss = cmpxchg(&full_sysidle_state,
> > +			      RCU_SYSIDLE_FULL, RCU_SYSIDLE_FULL_NOTED);
> > +		return rss == RCU_SYSIDLE_FULL;
> > +	}
> > +
> > +	smp_mb(); /* ensure rss load happens before later caller actions. */
> > +
> > +	/* If already fully idle, tell the caller (in case of races). */
> > +	if (rss == RCU_SYSIDLE_FULL_NOTED)
> > +		return true;
> > +
> > +	/*
> > +	 * If we aren't there yet, and a grace period is not in flight,
> > +	 * initiate a grace period.  Either way, tell the caller that
> > +	 * we are not there yet.
> > +	 */
> > +	if (nr_cpu_ids > RCU_SYSIDLE_SMALL &&
> > +	    !rcu_gp_in_progress(rcu_sysidle_state) &&
> > +	    !rsh.inuse && xchg(&rsh.inuse, 1) == 0)
> > +		call_rcu(&rsh.rh, rcu_sysidle_cb);
> > +	return false;
> >  }
> >  
> >  /*
> > @@ -2494,6 +2734,21 @@ static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
> >  {
> >  }
> >  
> > +static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
> > +				  unsigned long *maxj)
> > +{
> > +}
> > +
> > +static bool is_sysidle_rcu_state(struct rcu_state *rsp)
> > +{
> > +	return false;
> > +}
> > +
> > +static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
> > +			       unsigned long maxj)
> > +{
> > +}
> > +
> >  static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
> >  {
> >  }
> > -- 
> > 1.8.1.5
> > 
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full v2 2/7] nohz_full: Add rcu_dyntick data for scalable detection of all-idle state
  2013-07-01 18:16         ` Josh Triplett
@ 2013-07-01 18:23           ` Paul E. McKenney
  2013-07-01 18:34             ` Josh Triplett
  0 siblings, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2013-07-01 18:23 UTC (permalink / raw)
  To: Josh Triplett
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw

On Mon, Jul 01, 2013 at 11:16:01AM -0700, Josh Triplett wrote:
> On Mon, Jul 01, 2013 at 08:52:20AM -0700, Paul E. McKenney wrote:
> > On Mon, Jul 01, 2013 at 08:31:50AM -0700, Josh Triplett wrote:
> > > On Fri, Jun 28, 2013 at 01:10:17PM -0700, Paul E. McKenney wrote:
> > > > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > > > 
> > > > This commit adds fields to the rcu_dyntick structure that are used to
> > > > detect idle CPUs.  These new fields differ from the existing ones in
> > > > that the existing ones consider a CPU executing in user mode to be idle,
> > > > where the new ones consider CPUs executing in user mode to be busy.
> > > 
> > > Can you explain, both in the commit messages and in the comments added
> > > by the next commit, *why* this code doesn't consider userspace a
> > > quiescent state?
> > 
> > Good point!  Does the following explain it?
> > 
> > 	Although one of RCU's quiescent states is usermode execution,
> > 	it is not a full-system idle state.  This is because the purpose
> > 	of the full-system idle state is not RCU, but rather determining
> > 	when accurate timekeeping can safely be disabled.  Whenever
> > 	accurate timekeeping is required in a CONFIG_NO_HZ_FULL kernel,
> > 	at least one CPU must keep the scheduling-clock tick going.
> > 	If even one CPU is executing in user mode, accurate timekeeping
> > 	is requires, particularly for architectures where gettimeofday()
> > 	and friends do not enter the kernel.  Only when all CPUs are
> > 	really and truly idle can accurate timekeeping be disabled,
> > 	allowing all CPUs to turn off the scheduling clock interrupt,
> > 	thus greatly improving energy efficiency.
> > 
> > 	This naturally raises the question "Why is this code in RCU rather
> > 	than in timekeeping?", and the answer is that RCU has the data
> > 	and infrastructure to efficiently make this determination.
> 
> Good explanation, thanks.
> 
> This also naturally raises the question "How can we let userspace get
> accurate time without forcing a timer tick?".

We don't.  ;-)

Without CONFIG_NO_HZ_FULL, if a CPU is running in user mode, that CPU
takes scheduling-clock interrupts.  User-mode code will therefore always
see accurate time.  For some definition of "accurate", anyway.

With CONFIG_NO_HZ_FULL and without CONFIG_NO_HZ_FULL_SYSIDLE, a single
designated CPU will always be taking scheduling-clock interrupts, which
again ensures that user-mode code will always see accurate time.

With both CONFIG_NO_HZ_FULL and CONFIG_NO_HZ_FULL_SYSIDLE, if
any CPU other than the timekeeping CPU is nonidle (where "nonidle"
includes usermode execution), then the timekeeping CPU will be taking
scheduling-clock interrupts, yet again ensuring that user-mode code will
always see accurate time.  If all CPUs are idle (in other words, we are
in RCU_SYSIDLE_FULL_NOTED state and the timekeeping CPU is also idle),
scheduling-clock interrupts will be globally disabled.  Or will be,
once I fix the bug noted by Frederic.

I am guessing that you would like this added to the explanation?  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full v2 2/7] nohz_full: Add rcu_dyntick data for scalable detection of all-idle state
  2013-07-01 18:23           ` Paul E. McKenney
@ 2013-07-01 18:34             ` Josh Triplett
  2013-07-01 19:16               ` Paul E. McKenney
  0 siblings, 1 reply; 32+ messages in thread
From: Josh Triplett @ 2013-07-01 18:34 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw

On Mon, Jul 01, 2013 at 11:23:26AM -0700, Paul E. McKenney wrote:
> On Mon, Jul 01, 2013 at 11:16:01AM -0700, Josh Triplett wrote:
> > On Mon, Jul 01, 2013 at 08:52:20AM -0700, Paul E. McKenney wrote:
> > > On Mon, Jul 01, 2013 at 08:31:50AM -0700, Josh Triplett wrote:
> > > > On Fri, Jun 28, 2013 at 01:10:17PM -0700, Paul E. McKenney wrote:
> > > > > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > > > > 
> > > > > This commit adds fields to the rcu_dyntick structure that are used to
> > > > > detect idle CPUs.  These new fields differ from the existing ones in
> > > > > that the existing ones consider a CPU executing in user mode to be idle,
> > > > > where the new ones consider CPUs executing in user mode to be busy.
> > > > 
> > > > Can you explain, both in the commit messages and in the comments added
> > > > by the next commit, *why* this code doesn't consider userspace a
> > > > quiescent state?
> > > 
> > > Good point!  Does the following explain it?
> > > 
> > > 	Although one of RCU's quiescent states is usermode execution,
> > > 	it is not a full-system idle state.  This is because the purpose
> > > 	of the full-system idle state is not RCU, but rather determining
> > > 	when accurate timekeeping can safely be disabled.  Whenever
> > > 	accurate timekeeping is required in a CONFIG_NO_HZ_FULL kernel,
> > > 	at least one CPU must keep the scheduling-clock tick going.
> > > 	If even one CPU is executing in user mode, accurate timekeeping
> > > 	is requires, particularly for architectures where gettimeofday()
> > > 	and friends do not enter the kernel.  Only when all CPUs are
> > > 	really and truly idle can accurate timekeeping be disabled,
> > > 	allowing all CPUs to turn off the scheduling clock interrupt,
> > > 	thus greatly improving energy efficiency.
> > > 
> > > 	This naturally raises the question "Why is this code in RCU rather
> > > 	than in timekeeping?", and the answer is that RCU has the data
> > > 	and infrastructure to efficiently make this determination.
> > 
> > Good explanation, thanks.
> > 
> > This also naturally raises the question "How can we let userspace get
> > accurate time without forcing a timer tick?".
> 
> We don't.  ;-)

We don't currently, hence my question about how we can. :)

> Without CONFIG_NO_HZ_FULL, if a CPU is running in user mode, that CPU
> takes scheduling-clock interrupts.  User-mode code will therefore always
> see accurate time.  For some definition of "accurate", anyway.
> 
> With CONFIG_NO_HZ_FULL and without CONFIG_NO_HZ_FULL_SYSIDLE, a single
> designated CPU will always be taking scheduling-clock interrupts, which
> again ensures that user-mode code will always see accurate time.
> 
> With both CONFIG_NO_HZ_FULL and CONFIG_NO_HZ_FULL_SYSIDLE, if
> any CPU other than the timekeeping CPU is nonidle (where "nonidle"
> includes usermode execution), then the timekeeping CPU will be taking
> scheduling-clock interrupts, yet again ensuring that user-mode code will
> always see accurate time.  If all CPUs are idle (in other words, we are
> in RCU_SYSIDLE_FULL_NOTED state and the timekeeping CPU is also idle),
> scheduling-clock interrupts will be globally disabled.  Or will be,
> once I fix the bug noted by Frederic.
> 
> I am guessing that you would like this added to the explanation?  ;-)

Seemed pretty clear already from your previous explanation above, but
since you've taken the time to write it... :)

- Josh Triplett

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full v2 2/7] nohz_full: Add rcu_dyntick data for scalable detection of all-idle state
  2013-07-01 18:34             ` Josh Triplett
@ 2013-07-01 19:16               ` Paul E. McKenney
  2013-07-02  5:10                 ` Mike Galbraith
  0 siblings, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2013-07-01 19:16 UTC (permalink / raw)
  To: Josh Triplett
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, dhowells, edumazet, darren, fweisbec,
	sbw

On Mon, Jul 01, 2013 at 11:34:13AM -0700, Josh Triplett wrote:
> On Mon, Jul 01, 2013 at 11:23:26AM -0700, Paul E. McKenney wrote:
> > On Mon, Jul 01, 2013 at 11:16:01AM -0700, Josh Triplett wrote:
> > > On Mon, Jul 01, 2013 at 08:52:20AM -0700, Paul E. McKenney wrote:
> > > > On Mon, Jul 01, 2013 at 08:31:50AM -0700, Josh Triplett wrote:
> > > > > On Fri, Jun 28, 2013 at 01:10:17PM -0700, Paul E. McKenney wrote:
> > > > > > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > > > > > 
> > > > > > This commit adds fields to the rcu_dyntick structure that are used to
> > > > > > detect idle CPUs.  These new fields differ from the existing ones in
> > > > > > that the existing ones consider a CPU executing in user mode to be idle,
> > > > > > where the new ones consider CPUs executing in user mode to be busy.
> > > > > 
> > > > > Can you explain, both in the commit messages and in the comments added
> > > > > by the next commit, *why* this code doesn't consider userspace a
> > > > > quiescent state?
> > > > 
> > > > Good point!  Does the following explain it?
> > > > 
> > > > 	Although one of RCU's quiescent states is usermode execution,
> > > > 	it is not a full-system idle state.  This is because the purpose
> > > > 	of the full-system idle state is not RCU, but rather determining
> > > > 	when accurate timekeeping can safely be disabled.  Whenever
> > > > 	accurate timekeeping is required in a CONFIG_NO_HZ_FULL kernel,
> > > > 	at least one CPU must keep the scheduling-clock tick going.
> > > > 	If even one CPU is executing in user mode, accurate timekeeping
> > > > 	is requires, particularly for architectures where gettimeofday()
> > > > 	and friends do not enter the kernel.  Only when all CPUs are
> > > > 	really and truly idle can accurate timekeeping be disabled,
> > > > 	allowing all CPUs to turn off the scheduling clock interrupt,
> > > > 	thus greatly improving energy efficiency.
> > > > 
> > > > 	This naturally raises the question "Why is this code in RCU rather
> > > > 	than in timekeeping?", and the answer is that RCU has the data
> > > > 	and infrastructure to efficiently make this determination.
> > > 
> > > Good explanation, thanks.
> > > 
> > > This also naturally raises the question "How can we let userspace get
> > > accurate time without forcing a timer tick?".
> > 
> > We don't.  ;-)
> 
> We don't currently, hence my question about how we can. :)

Per-CPU atomic clocks?  Hardware-synchronized time across all CPUs?
Hardware detection of the full-system idle state, allowing the hardware
synchronization to be shut down in that case?  (But of course started with
full synchronization whenever something went non-idle!)  Use a periodic
hrtimer instead of the scheduling-clock tick?  (Aside from the fact that
the scheduling-clock tick is already an hrtimer in some configurations...)

The last might not be as silly as it sounds.  I believe that timekeeping
can tolerate an interrupt rate much slower than HZ, so if the timekeeping
CPU figured out that the only reason for the scheduling-clock tick
was timekeeping, it could run the tick much more slowly.  That said,
I wouldn't blame Frederic for deferring that particular increment of
complexity for a bit.  ;-)

> > Without CONFIG_NO_HZ_FULL, if a CPU is running in user mode, that CPU
> > takes scheduling-clock interrupts.  User-mode code will therefore always
> > see accurate time.  For some definition of "accurate", anyway.
> > 
> > With CONFIG_NO_HZ_FULL and without CONFIG_NO_HZ_FULL_SYSIDLE, a single
> > designated CPU will always be taking scheduling-clock interrupts, which
> > again ensures that user-mode code will always see accurate time.
> > 
> > With both CONFIG_NO_HZ_FULL and CONFIG_NO_HZ_FULL_SYSIDLE, if
> > any CPU other than the timekeeping CPU is nonidle (where "nonidle"
> > includes usermode execution), then the timekeeping CPU will be taking
> > scheduling-clock interrupts, yet again ensuring that user-mode code will
> > always see accurate time.  If all CPUs are idle (in other words, we are
> > in RCU_SYSIDLE_FULL_NOTED state and the timekeeping CPU is also idle),
> > scheduling-clock interrupts will be globally disabled.  Or will be,
> > once I fix the bug noted by Frederic.
> > 
> > I am guessing that you would like this added to the explanation?  ;-)
> 
> Seemed pretty clear already from your previous explanation above, but
> since you've taken the time to write it... :)

If the above sufficed, the additional verbiage might add more confusion
than understanding.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full 0/7] v2 Provide infrastructure for full-system idle
  2013-07-01 16:19     ` Andi Kleen
@ 2013-07-01 19:19       ` Paul E. McKenney
  0 siblings, 0 replies; 32+ messages in thread
From: Paul E. McKenney @ 2013-07-01 19:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	fweisbec, sbw

On Mon, Jul 01, 2013 at 06:19:10PM +0200, Andi Kleen wrote:
> > I am guessing that you want CONFIG_NO_HZ_FULL to implicitly enable
> > the sysidle code so that CONFIG_NO_HZ_FULL_SYSIDLE can be eliminated.
> > I will be happy to take that step, but only after I gain full confidence
> > in the correctness and performance of the sysidle code.
> 
> FWIW if you want useful testing you need to enable it by default
> (as part of NO_IDLE_HZ) anyways. Users will most likely pick
> whatever is "default" in Kconfig.

At this point in the process, I want testers who choose to test.  Hapless
victim testers come later.  Well, other than randconfig testers, but I
consider them to be voluntary hapless victims.  ;-)

> > > If you want a switch for testing I would advise a sysctl or sysfs knob
> > 
> > This would work well for the correctness part, but not for the performance
> > part.
> 
> What performance part? 
> 
> Are you saying this adds so many checks to hot paths that normal runtime
> if() with a flag is too expensive?

I am saying that I don't know, and that I want to make it easy for people
to find out by comparing to the base configuration -- and for me to be
able to detect this from their .config file.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full 0/7] v2 Provide infrastructure for full-system idle
  2013-06-28 20:09 [PATCH RFC nohz_full 0/7] v2 Provide infrastructure for full-system idle Paul E. McKenney
  2013-06-28 20:10 ` [PATCH RFC nohz_full v2 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state Paul E. McKenney
  2013-07-01 15:19 ` [PATCH RFC nohz_full 0/7] v2 Provide infrastructure for full-system idle Andi Kleen
@ 2013-07-01 19:43 ` Christoph Lameter
  2013-07-01 19:56   ` Paul E. McKenney
  2 siblings, 1 reply; 32+ messages in thread
From: Christoph Lameter @ 2013-07-01 19:43 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	fweisbec, sbw

On Fri, 28 Jun 2013, Paul E. McKenney wrote:

> Unfortunately, timekeeping CPU continues taking scheduling-clock
> interrupts even when all other CPUs are completely idle, which is
> not so good for energy efficiency and battery lifetime.  Clearly, it
> would be good to turn off the timekeeping CPU's scheduling-clock tick
> when all CPUs are completely idle.  This is conceptually simple, but
> we also need good performance and scalability on large systems, which
> rules out implementations based on frequently updated global counts of
> non-idle CPUs as well as implementations that frequently scan all CPUs.
> Nevertheless, we need a single global indicator in order to keep the
> overhead of checking acceptably low.

Can we turn off timekeeping when no cpu needs time in adaptive mode?
Setting breakpoints in the VDSO could force timekeeping on again whenever
something needs time. Would this not be simpler?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full 0/7] v2 Provide infrastructure for full-system idle
  2013-07-01 19:43 ` Christoph Lameter
@ 2013-07-01 19:56   ` Paul E. McKenney
  2013-07-01 20:24     ` Christoph Lameter
  0 siblings, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2013-07-01 19:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	fweisbec, sbw

On Mon, Jul 01, 2013 at 07:43:47PM +0000, Christoph Lameter wrote:
> On Fri, 28 Jun 2013, Paul E. McKenney wrote:
> 
> > Unfortunately, timekeeping CPU continues taking scheduling-clock
> > interrupts even when all other CPUs are completely idle, which is
> > not so good for energy efficiency and battery lifetime.  Clearly, it
> > would be good to turn off the timekeeping CPU's scheduling-clock tick
> > when all CPUs are completely idle.  This is conceptually simple, but
> > we also need good performance and scalability on large systems, which
> > rules out implementations based on frequently updated global counts of
> > non-idle CPUs as well as implementations that frequently scan all CPUs.
> > Nevertheless, we need a single global indicator in order to keep the
> > overhead of checking acceptably low.
> 
> Can we turn off timekeeping when no cpu needs time in adaptive mode?
> Setting breakpoints in the VDSO could force timekeeping on again whenever
> something needs time. Would this not be simpler?

Might be.  But what causes the breakpoints to be set on a system where
there is one CPU-bound nohz_full user-mode task with all other CPUs idle?
Or are you suggesting taking a breakpoint trap on each timekeeping access
to VDSO?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full 0/7] v2 Provide infrastructure for full-system idle
  2013-07-01 19:56   ` Paul E. McKenney
@ 2013-07-01 20:24     ` Christoph Lameter
  2013-07-01 20:43       ` Thomas Gleixner
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Lameter @ 2013-07-01 20:24 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	fweisbec, sbw

On Mon, 1 Jul 2013, Paul E. McKenney wrote:

> On Mon, Jul 01, 2013 at 07:43:47PM +0000, Christoph Lameter wrote:
> > On Fri, 28 Jun 2013, Paul E. McKenney wrote:
> >
> > > Unfortunately, timekeeping CPU continues taking scheduling-clock
> > > interrupts even when all other CPUs are completely idle, which is
> > > not so good for energy efficiency and battery lifetime.  Clearly, it
> > > would be good to turn off the timekeeping CPU's scheduling-clock tick
> > > when all CPUs are completely idle.  This is conceptually simple, but
> > > we also need good performance and scalability on large systems, which
> > > rules out implementations based on frequently updated global counts of
> > > non-idle CPUs as well as implementations that frequently scan all CPUs.
> > > Nevertheless, we need a single global indicator in order to keep the
> > > overhead of checking acceptably low.
> >
> > Can we turn off timekeeping when no cpu needs time in adaptive mode?
> > Setting breakpoints in the VDSO could force timekeeping on again whenever
> > something needs time. Would this not be simpler?
>
> Might be.  But what causes the breakpoints to be set on a system where
> there is one CPU-bound nohz_full user-mode task with all other CPUs idle?
> Or are you suggesting taking a breakpoint trap on each timekeeping access
> to VDSO?

Well when a tick notices that it is the last one still enabled on the
system and it could disable itself then it would set the breakpoint and
then turn the tick on the last processor off. The code invoked by the
breakpoint would reenable tick processing, update time and then use that
new info to return the correct time.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full 0/7] v2 Provide infrastructure for full-system idle
  2013-07-01 20:24     ` Christoph Lameter
@ 2013-07-01 20:43       ` Thomas Gleixner
  0 siblings, 0 replies; 32+ messages in thread
From: Thomas Gleixner @ 2013-07-01 20:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, peterz, rostedt, dhowells,
	edumazet, darren, fweisbec, sbw

On Mon, 1 Jul 2013, Christoph Lameter wrote:

> On Mon, 1 Jul 2013, Paul E. McKenney wrote:
> 
> > On Mon, Jul 01, 2013 at 07:43:47PM +0000, Christoph Lameter wrote:
> > > On Fri, 28 Jun 2013, Paul E. McKenney wrote:
> > >
> > > > Unfortunately, timekeeping CPU continues taking scheduling-clock
> > > > interrupts even when all other CPUs are completely idle, which is
> > > > not so good for energy efficiency and battery lifetime.  Clearly, it
> > > > would be good to turn off the timekeeping CPU's scheduling-clock tick
> > > > when all CPUs are completely idle.  This is conceptually simple, but
> > > > we also need good performance and scalability on large systems, which
> > > > rules out implementations based on frequently updated global counts of
> > > > non-idle CPUs as well as implementations that frequently scan all CPUs.
> > > > Nevertheless, we need a single global indicator in order to keep the
> > > > overhead of checking acceptably low.
> > >
> > > Can we turn off timekeeping when no cpu needs time in adaptive mode?
> > > Setting breakpoints in the VDSO could force timekeeping on again whenever
> > > something needs time. Would this not be simpler?
> >
> > Might be.  But what causes the breakpoints to be set on a system where
> > there is one CPU-bound nohz_full user-mode task with all other CPUs idle?
> > Or are you suggesting taking a breakpoint trap on each timekeeping access
> > to VDSO?
> 
> Well when a tick notices that it is the last one still enabled on the
> system and it could disable itself then it would set the breakpoint and
> then turn the tick on the last processor off. The code invoked by the
> breakpoint would reenable tick processing, update time and then use that
> new info to return the correct time.

Do we really care about that corner case? Not really. 

Also if there is one CPU which runs that tight loop in userspace and
timekeeping gets shut down after X seconds wants to have a fast and
bound look at the current time every X*N seconds. It's going to take a
trap and spend unbound time in the kernel. That's quite contrary to
the goal we wanted to achieve in the first place.

Though if ALL cpus are idle, i.e. none of the tickless cpus is running
any workload there is a point in switching off the time(house)keeping
CPU as well in order to save power. But that's not a problem we solve
with a breakpoint in the VDSO.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full v2 6/7] nohz_full: Add full-system-idle state machine
  2013-07-01 18:10       ` Paul E. McKenney
@ 2013-07-01 20:55         ` Frederic Weisbecker
  0 siblings, 0 replies; 32+ messages in thread
From: Frederic Weisbecker @ 2013-07-01 20:55 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Mon, Jul 01, 2013 at 11:10:40AM -0700, Paul E. McKenney wrote:
> On Mon, Jul 01, 2013 at 06:35:31PM +0200, Frederic Weisbecker wrote:
> > What makes sure that we are not reading a stale value of rdtp->dynticks_idle
> > in the following scenario:
> > 
> >     CPU 0                          CPU 1
> >     
> >                                    //CPU 1 idle
> >                                    //rdtp(1)->dynticks_idle == 0
> > 
> > sysidle_check_cpu(CPU 1) {
> >     rdtp(1)->dynticks_idle == 0
> > }
> > cmpxchg(full_sysidle_state, 
> >         ...RCU_SYSIDLE_SHORT)
> >                                    rcu_irq_exit() {
> 
> rcu_irq_enter(), right?
>

Woops, I meant rcu_idle_exit(). But yeah rcu_irq_enter() as well.

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full v2 6/7] nohz_full: Add full-system-idle state machine
  2013-06-28 20:10   ` [PATCH RFC nohz_full v2 6/7] nohz_full: Add full-system-idle state machine Paul E. McKenney
  2013-07-01 16:35     ` Frederic Weisbecker
  2013-07-01 16:53     ` Frederic Weisbecker
@ 2013-07-01 21:38     ` Frederic Weisbecker
  2013-07-01 22:51       ` Paul E. McKenney
  2 siblings, 1 reply; 32+ messages in thread
From: Frederic Weisbecker @ 2013-07-01 21:38 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Fri, Jun 28, 2013 at 01:10:21PM -0700, Paul E. McKenney wrote:
> +/*
> + * Check to see if the system is fully idle, other than the timekeeping CPU.
> + * The caller must have disabled interrupts.
> + */
> +bool rcu_sys_is_idle(void)

Where is this function called? I can't find any caller in the patchset.

> +{
> +	static struct rcu_sysidle_head rsh;
> +	int rss = ACCESS_ONCE(full_sysidle_state);
> +
> +	WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu);
> +
> +	/* Handle small-system case by doing a full scan of CPUs. */
> +	if (nr_cpu_ids <= RCU_SYSIDLE_SMALL && rss < RCU_SYSIDLE_FULL) {
> +		int cpu;
> +		bool isidle = true;
> +		unsigned long maxj = jiffies - ULONG_MAX / 4;
> +		struct rcu_data *rdp;
> +
> +		/* Scan all the CPUs looking for nonidle CPUs. */
> +		for_each_possible_cpu(cpu) {
> +			rdp = per_cpu_ptr(rcu_sysidle_state->rda, cpu);
> +			rcu_sysidle_check_cpu(rdp, &isidle, &maxj);
> +			if (!isidle)
> +				break;
> +		}
> +		rcu_sysidle_report(rcu_sysidle_state, isidle, maxj);
> +		rss = ACCESS_ONCE(full_sysidle_state);
> +	}
> +
> +	/* If this is the first observation of an idle period, record it. */
> +	if (rss == RCU_SYSIDLE_FULL) {
> +		rss = cmpxchg(&full_sysidle_state,
> +			      RCU_SYSIDLE_FULL, RCU_SYSIDLE_FULL_NOTED);
> +		return rss == RCU_SYSIDLE_FULL;
> +	}
> +
> +	smp_mb(); /* ensure rss load happens before later caller actions. */
> +
> +	/* If already fully idle, tell the caller (in case of races). */
> +	if (rss == RCU_SYSIDLE_FULL_NOTED)
> +		return true;
> +
> +	/*
> +	 * If we aren't there yet, and a grace period is not in flight,
> +	 * initiate a grace period.  Either way, tell the caller that
> +	 * we are not there yet.
> +	 */
> +	if (nr_cpu_ids > RCU_SYSIDLE_SMALL &&
> +	    !rcu_gp_in_progress(rcu_sysidle_state) &&
> +	    !rsh.inuse && xchg(&rsh.inuse, 1) == 0)
> +		call_rcu(&rsh.rh, rcu_sysidle_cb);

So this starts an RCU/RCU_preempt grace period to force the global idle
detection.

Would it make sense to create a new RCU flavour instead for this purpose?
Its only per CPU quiescent state would be when the timekeeping CPU ticks
(from rcu_check_callbacks()). The other CPUs would only complete their
QS request through extended quiescent states, ie: only the timekeeping
CPU is burdened.

This way you can enqueue a callback that is executed in the end of the
grace period for that flavour and that callback can help driving the
state machine somehow.

Now may be that's not a good idea because this adds some overhead to
any code that uses for_each_rcu_flavour().


> +	return false;
>  }
>  
>  /*
> @@ -2494,6 +2734,21 @@ static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
>  {
>  }
>  
> +static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
> +				  unsigned long *maxj)
> +{
> +}
> +
> +static bool is_sysidle_rcu_state(struct rcu_state *rsp)
> +{
> +	return false;
> +}
> +
> +static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
> +			       unsigned long maxj)
> +{
> +}
> +
>  static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
>  {
>  }
> -- 
> 1.8.1.5
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full v2 6/7] nohz_full: Add full-system-idle state machine
  2013-07-01 21:38     ` Frederic Weisbecker
@ 2013-07-01 22:51       ` Paul E. McKenney
  0 siblings, 0 replies; 32+ messages in thread
From: Paul E. McKenney @ 2013-07-01 22:51 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, dhowells, edumazet, darren,
	sbw

On Mon, Jul 01, 2013 at 11:38:34PM +0200, Frederic Weisbecker wrote:
> On Fri, Jun 28, 2013 at 01:10:21PM -0700, Paul E. McKenney wrote:
> > +/*
> > + * Check to see if the system is fully idle, other than the timekeeping CPU.
> > + * The caller must have disabled interrupts.
> > + */
> > +bool rcu_sys_is_idle(void)
> 
> Where is this function called? I can't find any caller in the patchset.

It should be called at the point where the timekeeping CPU is going
idle.  If it returns true, then the timekeeping CPU can shut off the
scheduling-clock interrupt.

> > +{
> > +	static struct rcu_sysidle_head rsh;
> > +	int rss = ACCESS_ONCE(full_sysidle_state);
> > +
> > +	WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu);
> > +
> > +	/* Handle small-system case by doing a full scan of CPUs. */
> > +	if (nr_cpu_ids <= RCU_SYSIDLE_SMALL && rss < RCU_SYSIDLE_FULL) {
> > +		int cpu;
> > +		bool isidle = true;
> > +		unsigned long maxj = jiffies - ULONG_MAX / 4;
> > +		struct rcu_data *rdp;
> > +
> > +		/* Scan all the CPUs looking for nonidle CPUs. */
> > +		for_each_possible_cpu(cpu) {
> > +			rdp = per_cpu_ptr(rcu_sysidle_state->rda, cpu);
> > +			rcu_sysidle_check_cpu(rdp, &isidle, &maxj);
> > +			if (!isidle)
> > +				break;
> > +		}
> > +		rcu_sysidle_report(rcu_sysidle_state, isidle, maxj);
> > +		rss = ACCESS_ONCE(full_sysidle_state);
> > +	}
> > +
> > +	/* If this is the first observation of an idle period, record it. */
> > +	if (rss == RCU_SYSIDLE_FULL) {
> > +		rss = cmpxchg(&full_sysidle_state,
> > +			      RCU_SYSIDLE_FULL, RCU_SYSIDLE_FULL_NOTED);
> > +		return rss == RCU_SYSIDLE_FULL;
> > +	}
> > +
> > +	smp_mb(); /* ensure rss load happens before later caller actions. */
> > +
> > +	/* If already fully idle, tell the caller (in case of races). */
> > +	if (rss == RCU_SYSIDLE_FULL_NOTED)
> > +		return true;
> > +
> > +	/*
> > +	 * If we aren't there yet, and a grace period is not in flight,
> > +	 * initiate a grace period.  Either way, tell the caller that
> > +	 * we are not there yet.
> > +	 */
> > +	if (nr_cpu_ids > RCU_SYSIDLE_SMALL &&
> > +	    !rcu_gp_in_progress(rcu_sysidle_state) &&
> > +	    !rsh.inuse && xchg(&rsh.inuse, 1) == 0)
> > +		call_rcu(&rsh.rh, rcu_sysidle_cb);
> 
> So this starts an RCU/RCU_preempt grace period to force the global idle
> detection.
> 
> Would it make sense to create a new RCU flavour instead for this purpose?
> Its only per CPU quiescent state would be when the timekeeping CPU ticks
> (from rcu_check_callbacks()). The other CPUs would only complete their
> QS request through extended quiescent states, ie: only the timekeeping
> CPU is burdened.
> 
> This way you can enqueue a callback that is executed in the end of the
> grace period for that flavour and that callback can help driving the
> state machine somehow.
> 
> Now may be that's not a good idea because this adds some overhead to
> any code that uses for_each_rcu_flavour().

Also it adds overhead.  The most active RCU flavor will almost always
have grace periods in flight, so the above call_rcu() should be invoked
rarely on most systems.

							Thanx, Paul

> > +	return false;
> >  }
> >  
> >  /*
> > @@ -2494,6 +2734,21 @@ static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
> >  {
> >  }
> >  
> > +static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
> > +				  unsigned long *maxj)
> > +{
> > +}
> > +
> > +static bool is_sysidle_rcu_state(struct rcu_state *rsp)
> > +{
> > +	return false;
> > +}
> > +
> > +static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
> > +			       unsigned long maxj)
> > +{
> > +}
> > +
> >  static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
> >  {
> >  }
> > -- 
> > 1.8.1.5
> > 
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full v2 2/7] nohz_full: Add rcu_dyntick data for scalable detection of all-idle state
  2013-07-01 19:16               ` Paul E. McKenney
@ 2013-07-02  5:10                 ` Mike Galbraith
  2013-07-02  5:58                   ` Paul E. McKenney
  0 siblings, 1 reply; 32+ messages in thread
From: Mike Galbraith @ 2013-07-02  5:10 UTC (permalink / raw)
  To: paulmck
  Cc: Josh Triplett, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, niv, tglx, peterz, rostedt, dhowells,
	edumazet, darren, fweisbec, sbw

On Mon, 2013-07-01 at 12:16 -0700, Paul E. McKenney wrote: 
> On Mon, Jul 01, 2013 at 11:34:13AM -0700, Josh Triplett wrote:

> > > > This also naturally raises the question "How can we let userspace get
> > > > accurate time without forcing a timer tick?".
> > > 
> > > We don't.  ;-)
> > 
> > We don't currently, hence my question about how we can. :)
> 
> Per-CPU atomic clocks?

Great idea, who needs timekeeping code. 

http://www.euronews.com/2013/04/02/swiss-sets-sights-on-miniscule-atomic-clock/

-Mike


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC nohz_full v2 2/7] nohz_full: Add rcu_dyntick data for scalable detection of all-idle state
  2013-07-02  5:10                 ` Mike Galbraith
@ 2013-07-02  5:58                   ` Paul E. McKenney
  0 siblings, 0 replies; 32+ messages in thread
From: Paul E. McKenney @ 2013-07-02  5:58 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Josh Triplett, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, niv, tglx, peterz, rostedt, dhowells,
	edumazet, darren, fweisbec, sbw

On Tue, Jul 02, 2013 at 07:10:52AM +0200, Mike Galbraith wrote:
> On Mon, 2013-07-01 at 12:16 -0700, Paul E. McKenney wrote: 
> > On Mon, Jul 01, 2013 at 11:34:13AM -0700, Josh Triplett wrote:
> 
> > > > > This also naturally raises the question "How can we let userspace get
> > > > > accurate time without forcing a timer tick?".
> > > > 
> > > > We don't.  ;-)
> > > 
> > > We don't currently, hence my question about how we can. :)
> > 
> > Per-CPU atomic clocks?
> 
> Great idea, who needs timekeeping code. 
> 
> http://www.euronews.com/2013/04/02/swiss-sets-sights-on-miniscule-atomic-clock/

"in theory you’ll only need to set it once every 3,000 years, providing
of course your battery lasts that long" ;-) ;-) ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2013-07-02  5:58 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-28 20:09 [PATCH RFC nohz_full 0/7] v2 Provide infrastructure for full-system idle Paul E. McKenney
2013-06-28 20:10 ` [PATCH RFC nohz_full v2 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state Paul E. McKenney
2013-06-28 20:10   ` [PATCH RFC nohz_full v2 2/7] nohz_full: Add rcu_dyntick data " Paul E. McKenney
2013-07-01 15:31     ` Josh Triplett
2013-07-01 15:52       ` Paul E. McKenney
2013-07-01 18:16         ` Josh Triplett
2013-07-01 18:23           ` Paul E. McKenney
2013-07-01 18:34             ` Josh Triplett
2013-07-01 19:16               ` Paul E. McKenney
2013-07-02  5:10                 ` Mike Galbraith
2013-07-02  5:58                   ` Paul E. McKenney
2013-06-28 20:10   ` [PATCH RFC nohz_full v2 3/7] nohz_full: Add per-CPU idle-state tracking Paul E. McKenney
2013-07-01 15:33     ` Josh Triplett
2013-06-28 20:10   ` [PATCH RFC nohz_full v2 4/7] nohz_full: Add full-system idle states and variables Paul E. McKenney
2013-06-28 20:10   ` [PATCH RFC nohz_full v2 5/7] nohz_full: Add full-system-idle arguments to API Paul E. McKenney
2013-06-28 20:10   ` [PATCH RFC nohz_full v2 6/7] nohz_full: Add full-system-idle state machine Paul E. McKenney
2013-07-01 16:35     ` Frederic Weisbecker
2013-07-01 18:10       ` Paul E. McKenney
2013-07-01 20:55         ` Frederic Weisbecker
2013-07-01 16:53     ` Frederic Weisbecker
2013-07-01 18:17       ` Paul E. McKenney
2013-07-01 21:38     ` Frederic Weisbecker
2013-07-01 22:51       ` Paul E. McKenney
2013-06-28 20:10   ` [PATCH RFC nohz_full v2 7/7] nohz_full: Force RCU's grace-period kthreads onto timekeeping CPU Paul E. McKenney
2013-07-01 15:19 ` [PATCH RFC nohz_full 0/7] v2 Provide infrastructure for full-system idle Andi Kleen
2013-07-01 16:03   ` Paul E. McKenney
2013-07-01 16:19     ` Andi Kleen
2013-07-01 19:19       ` Paul E. McKenney
2013-07-01 19:43 ` Christoph Lameter
2013-07-01 19:56   ` Paul E. McKenney
2013-07-01 20:24     ` Christoph Lameter
2013-07-01 20:43       ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.