linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel)
@ 2012-04-30 23:54 Frederic Weisbecker
  2012-04-30 23:54 ` [PATCH 01/41] nohz: Separate idle sleeping time accounting from nohz logic Frederic Weisbecker
                   ` (41 more replies)
  0 siblings, 42 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

Hi,

A summary of what this is about can be found here:
 https://lkml.org/lkml/2011/8/15/245

Changes since v2:

	* Correctly handle update of the cpuset mask when the nohz
	flag is set (courtesy of Hakan Akkan)

	* Handle rq clock. This introduces a new update_nohz_rq_clock()
	helper that sites which make use of rq->clock can call if they want to
	ensure the rq clock doesn't have a stale value due to the targeted CPU
	beeing tickless. If it's tickless, it's not maintaining the rq clock
	by calling scheduler_tick()->update_rq_clock() periodically.
	I think I've added this manual call to every callsites that needed it.
	I may have missed some though, or we may forget to handle tickless
	CPUs in future code. So I think we need to add some automated debug checks
	to catch that.

	* Fix a warning reported by Gilad Ben Yossef: we flush the time on
	pre-schedule and then the tick is restarted from an IPI before we do
	it manually on post-schedule. From there we try to flush the time
	again but ts->jiffies_saved_whence is set to SAVED_NONE because
	we already flushed the time. This was triggering a spurious warning.

Still a lot to do. I'm now maintaining the TODO list there:
 https://github.com/fweisbec/linux-dynticks/wiki/TODO

The git branch can be fetched from:
                                                                                                                                                                                                                   
 git://github.com/fweisbec/linux-dynticks.git
	nohz/cpuset-v3

  
Frederic Weisbecker (40):
  nohz: Separate idle sleeping time accounting from nohz logic
  nohz: Make nohz API agnostic against idle ticks cputime accounting
  nohz: Rename ts->idle_tick to ts->last_tick
  nohz: Move nohz load balancer selection into idle logic
  nohz: Move ts->idle_calls incrementation into strict idle logic
  nohz: Move next idle expiry time record into idle logic area
  cpuset: Set up interface for nohz flag
  nohz: Try not to give the timekeeping duty to an adaptive tickless
    cpu
  x86: New cpuset nohz irq vector
  nohz: Adaptive tick stop and restart on nohz cpuset
  nohz/cpuset: Don't turn off the tick if rcu needs it
  nohz/cpuset: Wake up adaptive nohz CPU when a timer gets enqueued
  nohz/cpuset: Don't stop the tick if posix cpu timers are running
  nohz/cpuset: Restart tick when nohz flag is cleared on cpuset
  nohz/cpuset: Restart the tick if printk needs it
  rcu: Restart the tick on non-responding adaptive nohz CPUs
  rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU
  nohz: Generalize tickless cpu time accounting
  nohz/cpuset: Account user and system times in adaptive nohz mode
  nohz/cpuset: New API to flush cputimes on nohz cpusets
  nohz/cpuset: Flush cputime on threads in nohz cpusets when waiting
    leader
  nohz/cpuset: Flush cputimes on procfs stat file read
  nohz/cpuset: Flush cputimes for getrusage() and times() syscalls
  x86: Syscall hooks for nohz cpusets
  x86: Exception hooks for nohz cpusets
  x86: Add adaptive tickless hooks on do_notify_resume()
  nohz: Don't restart the tick before scheduling to idle
  sched: Comment on rq->clock correctness in ttwu_do_wakeup() in nohz
  sched: Update rq clock on nohz CPU before migrating tasks
  sched: Update rq clock on nohz CPU before setting fair group shares
  sched: Update rq clock on tickless CPUs before calling
    check_preempt_curr()
  sched: Update rq clock earlier in unthrottle_cfs_rq
  sched: Update clock of nohz busiest rq before balancing
  sched: Update rq clock before idle balancing
  sched: Update nohz rq clock before searching busiest group on load
    balancing
  rcu: New rcu_user_enter() and rcu_user_exit() APIs
  rcu: New rcu_user_enter_irq() and rcu_user_exit_irq() APIs
  rcu: Switch to extended quiescent state in userspace from nohz cpuset
  nohz: Exit RCU idle mode when we schedule before resuming userspace
  nohz/cpuset: Disable under some configs

Hakan Akkan (1):
  nohz/cpuset: enable addition&removal of cpus while in adaptive nohz
    mode

 arch/Kconfig                       |    3 +
 arch/x86/Kconfig                   |    1 +
 arch/x86/include/asm/entry_arch.h  |    3 +
 arch/x86/include/asm/hw_irq.h      |    7 +
 arch/x86/include/asm/irq_vectors.h |    2 +
 arch/x86/include/asm/smp.h         |   11 +
 arch/x86/include/asm/thread_info.h |   10 +-
 arch/x86/kernel/entry_64.S         |   12 +-
 arch/x86/kernel/irqinit.c          |    4 +
 arch/x86/kernel/ptrace.c           |   10 +
 arch/x86/kernel/signal.c           |    3 +
 arch/x86/kernel/smp.c              |   26 ++
 arch/x86/kernel/traps.c            |   20 +-
 arch/x86/mm/fault.c                |   13 +-
 fs/proc/array.c                    |    2 +
 include/linux/cpuset.h             |   29 ++
 include/linux/kernel_stat.h        |    2 +
 include/linux/posix-timers.h       |    1 +
 include/linux/rcupdate.h           |    8 +
 include/linux/sched.h              |   10 +-
 include/linux/tick.h               |   75 ++++--
 init/Kconfig                       |    8 +
 kernel/cpuset.c                    |  141 +++++++++-
 kernel/exit.c                      |    8 +
 kernel/posix-cpu-timers.c          |   12 +
 kernel/printk.c                    |   15 +-
 kernel/rcutree.c                   |  150 ++++++++--
 kernel/sched/core.c                |  112 ++++++++-
 kernel/sched/fair.c                |   39 +++-
 kernel/sched/sched.h               |   29 ++
 kernel/softirq.c                   |    6 +-
 kernel/sys.c                       |    6 +
 kernel/time/tick-sched.c           |  542 +++++++++++++++++++++++++++++-------
 kernel/time/timer_list.c           |    7 +-
 kernel/timer.c                     |    2 +-
 35 files changed, 1148 insertions(+), 181 deletions(-)

-- 
1.7.5.4


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 01/41] nohz: Separate idle sleeping time accounting from nohz logic
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-04-30 23:54 ` [PATCH 02/41] nohz: Make nohz API agnostic against idle ticks cputime accounting Frederic Weisbecker
                   ` (40 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

As we plan to be able to stop the tick outside the idle task, we
need to prepare for separating nohz logic from idle. As a start,
this pulls the idle sleeping time accounting out of the tick
stop/restart API to the callers on idle entry/exit.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/tick-sched.c |   78 +++++++++++++++++++++++++--------------------
 1 files changed, 43 insertions(+), 35 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 3526038..a1ca479 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -271,10 +271,10 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
 
-static void tick_nohz_stop_sched_tick(struct tick_sched *ts)
+static void tick_nohz_stop_sched_tick(struct tick_sched *ts, ktime_t now)
 {
 	unsigned long seq, last_jiffies, next_jiffies, delta_jiffies;
-	ktime_t last_update, expires, now;
+	ktime_t last_update, expires;
 	struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev;
 	u64 time_delta;
 	int cpu;
@@ -282,8 +282,6 @@ static void tick_nohz_stop_sched_tick(struct tick_sched *ts)
 	cpu = smp_processor_id();
 	ts = &per_cpu(tick_cpu_sched, cpu);
 
-	now = tick_nohz_start_idle(cpu, ts);
-
 	/*
 	 * If this cpu is offline and it is the one which updates
 	 * jiffies, then give up the assignment and let it be taken by
@@ -444,6 +442,14 @@ out:
 	ts->sleep_length = ktime_sub(dev->next_event, now);
 }
 
+static void __tick_nohz_idle_enter(struct tick_sched *ts)
+{
+	ktime_t now;
+
+	now = tick_nohz_start_idle(smp_processor_id(), ts);
+	tick_nohz_stop_sched_tick(ts, now);
+}
+
 /**
  * tick_nohz_idle_enter - stop the idle tick from the idle task
  *
@@ -479,7 +485,7 @@ void tick_nohz_idle_enter(void)
 	 * update of the idle time accounting in tick_nohz_start_idle().
 	 */
 	ts->inidle = 1;
-	tick_nohz_stop_sched_tick(ts);
+	__tick_nohz_idle_enter(ts);
 
 	local_irq_enable();
 }
@@ -499,7 +505,7 @@ void tick_nohz_irq_exit(void)
 	if (!ts->inidle)
 		return;
 
-	tick_nohz_stop_sched_tick(ts);
+	__tick_nohz_idle_enter(ts);
 }
 
 /**
@@ -540,39 +546,11 @@ static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
 	}
 }
 
-/**
- * tick_nohz_idle_exit - restart the idle tick from the idle task
- *
- * Restart the idle tick when the CPU is woken up from idle
- * This also exit the RCU extended quiescent state. The CPU
- * can use RCU again after this function is called.
- */
-void tick_nohz_idle_exit(void)
+static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
 {
-	int cpu = smp_processor_id();
-	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
 	unsigned long ticks;
 #endif
-	ktime_t now;
-
-	local_irq_disable();
-
-	WARN_ON_ONCE(!ts->inidle);
-
-	ts->inidle = 0;
-
-	if (ts->idle_active || ts->tick_stopped)
-		now = ktime_get();
-
-	if (ts->idle_active)
-		tick_nohz_stop_idle(cpu, now);
-
-	if (!ts->tick_stopped) {
-		local_irq_enable();
-		return;
-	}
-
 	/* Update jiffies first */
 	select_nohz_load_balancer(0);
 	tick_do_update_jiffies64(now);
@@ -599,6 +577,36 @@ void tick_nohz_idle_exit(void)
 	ts->idle_exittime = now;
 
 	tick_nohz_restart(ts, now);
+}
+
+/**
+ * tick_nohz_idle_exit - restart the idle tick from the idle task
+ *
+ * Restart the idle tick when the CPU is woken up from idle
+ * This also exit the RCU extended quiescent state. The CPU
+ * can use RCU again after this function is called.
+ */
+void tick_nohz_idle_exit(void)
+
+{
+	int cpu = smp_processor_id();
+	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+	ktime_t now;
+
+	local_irq_disable();
+
+	WARN_ON_ONCE(!ts->inidle);
+
+	ts->inidle = 0;
+
+	if (ts->idle_active || ts->tick_stopped)
+		now = ktime_get();
+
+	if (ts->idle_active)
+		tick_nohz_stop_idle(cpu, now);
+
+	if (ts->tick_stopped)
+		tick_nohz_restart_sched_tick(ts, now);
 
 	local_irq_enable();
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 02/41] nohz: Make nohz API agnostic against idle ticks cputime accounting
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
  2012-04-30 23:54 ` [PATCH 01/41] nohz: Separate idle sleeping time accounting from nohz logic Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-04-30 23:54 ` [PATCH 03/41] nohz: Rename ts->idle_tick to ts->last_tick Frederic Weisbecker
                   ` (39 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

When the timer tick fires, it accounts the new jiffy as either part
of system, user or idle time. This is how we record the cputime
statistics.

But when the tick is stopped from the idle task, we still need
to record the number of jiffies spent tickless until we restart
the tick and fall back to traditional tick-based cputime accounting.

To do this, we take a snapshot of jiffies when the tick is stopped
and compute the difference against the new value of jiffies when
the tick is restarted. Then we account this whole difference to
the idle cputime.

However we are preparing to be able to stop the tick from other places
than idle. So this idle time accounting needs to be performed from
the callers of nohz APIs, not from the nohz APIs because we now want
them to be agnostic against where we stop/restart tick.

Therefore, we pull the tickless idle time accounting out of generic
nohz helpers up to idle entry/exit callers.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/tick-sched.c |   37 ++++++++++++++++++++++---------------
 1 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a1ca479..9373f61 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -402,7 +402,6 @@ static void tick_nohz_stop_sched_tick(struct tick_sched *ts, ktime_t now)
 
 			ts->idle_tick = hrtimer_get_expires(&ts->sched_timer);
 			ts->tick_stopped = 1;
-			ts->idle_jiffies = last_jiffies;
 		}
 
 		ts->idle_sleeps++;
@@ -445,9 +444,13 @@ out:
 static void __tick_nohz_idle_enter(struct tick_sched *ts)
 {
 	ktime_t now;
+	int was_stopped = ts->tick_stopped;
 
 	now = tick_nohz_start_idle(smp_processor_id(), ts);
 	tick_nohz_stop_sched_tick(ts, now);
+
+	if (!was_stopped && ts->tick_stopped)
+		ts->idle_jiffies = ts->last_jiffies;
 }
 
 /**
@@ -548,14 +551,24 @@ static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
 
 static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
 {
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING
-	unsigned long ticks;
-#endif
 	/* Update jiffies first */
 	select_nohz_load_balancer(0);
 	tick_do_update_jiffies64(now);
 
+	touch_softlockup_watchdog();
+	/*
+	 * Cancel the scheduled timer and restore the tick
+	 */
+	ts->tick_stopped  = 0;
+	ts->idle_exittime = now;
+
+	tick_nohz_restart(ts, now);
+}
+
+static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
+{
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
+	unsigned long ticks;
 	/*
 	 * We stopped the tick in idle. Update process times would miss the
 	 * time we slept as update_process_times does only a 1 tick
@@ -568,15 +581,6 @@ static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
 	if (ticks && ticks < LONG_MAX)
 		account_idle_ticks(ticks);
 #endif
-
-	touch_softlockup_watchdog();
-	/*
-	 * Cancel the scheduled timer and restore the tick
-	 */
-	ts->tick_stopped  = 0;
-	ts->idle_exittime = now;
-
-	tick_nohz_restart(ts, now);
 }
 
 /**
@@ -605,8 +609,10 @@ void tick_nohz_idle_exit(void)
 	if (ts->idle_active)
 		tick_nohz_stop_idle(cpu, now);
 
-	if (ts->tick_stopped)
+	if (ts->tick_stopped) {
 		tick_nohz_restart_sched_tick(ts, now);
+		tick_nohz_account_idle_ticks(ts);
+	}
 
 	local_irq_enable();
 }
@@ -811,7 +817,8 @@ static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
 		 */
 		if (ts->tick_stopped) {
 			touch_softlockup_watchdog();
-			ts->idle_jiffies++;
+			if (idle_cpu(cpu))
+				ts->idle_jiffies++;
 		}
 		update_process_times(user_mode(regs));
 		profile_tick(CPU_PROFILING);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 03/41] nohz: Rename ts->idle_tick to ts->last_tick
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
  2012-04-30 23:54 ` [PATCH 01/41] nohz: Separate idle sleeping time accounting from nohz logic Frederic Weisbecker
  2012-04-30 23:54 ` [PATCH 02/41] nohz: Make nohz API agnostic against idle ticks cputime accounting Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-04-30 23:54 ` [PATCH 04/41] nohz: Move nohz load balancer selection into idle logic Frederic Weisbecker
                   ` (38 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

Now that idle and nohz logic are going to be split, ts->idle_tick becomes
a misnomer when it takes a field name to save the last tick before
switching to nohz mode, because now we want to be able to switch to nohz mode
further the idle context.

Call it last_tick instead. This changes a bit the timer list
stat export so we need to increase its version.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/tick.h     |    8 ++++----
 kernel/time/tick-sched.c |    4 ++--
 kernel/time/timer_list.c |    4 ++--
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index ab8be90..f37fceb 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -31,10 +31,10 @@ enum tick_nohz_mode {
  * struct tick_sched - sched tick emulation and no idle tick control/stats
  * @sched_timer:	hrtimer to schedule the periodic tick in high
  *			resolution mode
- * @idle_tick:		Store the last idle tick expiry time when the tick
- *			timer is modified for idle sleeps. This is necessary
+ * @last_tick:		Store the last tick expiry time when the tick
+ *			timer is modified for nohz sleeps. This is necessary
  *			to resume the tick timer operation in the timeline
- *			when the CPU returns from idle
+ *			when the CPU returns from nohz sleep.
  * @tick_stopped:	Indicator that the idle tick has been stopped
  * @idle_jiffies:	jiffies at the entry to idle for idle time accounting
  * @idle_calls:		Total number of idle calls
@@ -51,7 +51,7 @@ struct tick_sched {
 	struct hrtimer			sched_timer;
 	unsigned long			check_clocks;
 	enum tick_nohz_mode		nohz_mode;
-	ktime_t				idle_tick;
+	ktime_t				last_tick;
 	int				inidle;
 	int				tick_stopped;
 	unsigned long			idle_jiffies;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9373f61..fc9f687 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -400,7 +400,7 @@ static void tick_nohz_stop_sched_tick(struct tick_sched *ts, ktime_t now)
 		if (!ts->tick_stopped) {
 			select_nohz_load_balancer(1);
 
-			ts->idle_tick = hrtimer_get_expires(&ts->sched_timer);
+			ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
 			ts->tick_stopped = 1;
 		}
 
@@ -526,7 +526,7 @@ ktime_t tick_nohz_get_sleep_length(void)
 static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
 {
 	hrtimer_cancel(&ts->sched_timer);
-	hrtimer_set_expires(&ts->sched_timer, ts->idle_tick);
+	hrtimer_set_expires(&ts->sched_timer, ts->last_tick);
 
 	while (1) {
 		/* Forward the time to expire in the future */
diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
index 3258455..af5a7e9 100644
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -167,7 +167,7 @@ static void print_cpu(struct seq_file *m, int cpu, u64 now)
 	{
 		struct tick_sched *ts = tick_get_tick_sched(cpu);
 		P(nohz_mode);
-		P_ns(idle_tick);
+		P_ns(last_tick);
 		P(tick_stopped);
 		P(idle_jiffies);
 		P(idle_calls);
@@ -259,7 +259,7 @@ static int timer_list_show(struct seq_file *m, void *v)
 	u64 now = ktime_to_ns(ktime_get());
 	int cpu;
 
-	SEQ_printf(m, "Timer List Version: v0.6\n");
+	SEQ_printf(m, "Timer List Version: v0.7\n");
 	SEQ_printf(m, "HRTIMER_MAX_CLOCK_BASES: %d\n", HRTIMER_MAX_CLOCK_BASES);
 	SEQ_printf(m, "now at %Ld nsecs\n", (unsigned long long)now);
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 04/41] nohz: Move nohz load balancer selection into idle logic
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2012-04-30 23:54 ` [PATCH 03/41] nohz: Rename ts->idle_tick to ts->last_tick Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-05-07 15:51   ` Christoph Lameter
  2012-04-30 23:54 ` [PATCH 05/41] nohz: Move ts->idle_calls incrementation into strict " Frederic Weisbecker
                   ` (37 subsequent siblings)
  41 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

[ ** BUGGY PATCH: I need to put more thinking into this ** ]

We want the nohz load balancer to be an idle CPU, thus
move that selection to strict dyntick idle logic.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/tick-sched.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index fc9f687..b79dea2 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -398,8 +398,6 @@ static void tick_nohz_stop_sched_tick(struct tick_sched *ts, ktime_t now)
 		 * the scheduler tick in nohz_restart_sched_tick.
 		 */
 		if (!ts->tick_stopped) {
-			select_nohz_load_balancer(1);
-
 			ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
 			ts->tick_stopped = 1;
 		}
@@ -449,8 +447,10 @@ static void __tick_nohz_idle_enter(struct tick_sched *ts)
 	now = tick_nohz_start_idle(smp_processor_id(), ts);
 	tick_nohz_stop_sched_tick(ts, now);
 
-	if (!was_stopped && ts->tick_stopped)
+	if (!was_stopped && ts->tick_stopped) {
 		ts->idle_jiffies = ts->last_jiffies;
+		select_nohz_load_balancer(1);
+	}
 }
 
 /**
@@ -552,7 +552,6 @@ static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
 static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
 {
 	/* Update jiffies first */
-	select_nohz_load_balancer(0);
 	tick_do_update_jiffies64(now);
 
 	touch_softlockup_watchdog();
@@ -610,6 +609,7 @@ void tick_nohz_idle_exit(void)
 		tick_nohz_stop_idle(cpu, now);
 
 	if (ts->tick_stopped) {
+		select_nohz_load_balancer(0);
 		tick_nohz_restart_sched_tick(ts, now);
 		tick_nohz_account_idle_ticks(ts);
 	}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 05/41] nohz: Move ts->idle_calls incrementation into strict idle logic
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2012-04-30 23:54 ` [PATCH 04/41] nohz: Move nohz load balancer selection into idle logic Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-04-30 23:54 ` [PATCH 06/41] nohz: Move next idle expiry time record into idle logic area Frederic Weisbecker
                   ` (36 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

Since we want to prepare for making the nohz API to work further
the idle case, we need to pull ts->idle_calls incrementation up to
the callers in idle.

To perform this, we split tick_nohz_stop_sched_tick() switch in two
parts, a first that checks if we can really stop the tick for idle,
and another that actually stops it. Then from the callers in idle,
we check if we can stop the tick and only then we increment idle_calls
and finally relay to the nohz API that doesn't care about these details
anymore.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/tick-sched.c |   88 +++++++++++++++++++++++++---------------------
 1 files changed, 48 insertions(+), 40 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index b79dea2..12ba932 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -271,47 +271,15 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
 
-static void tick_nohz_stop_sched_tick(struct tick_sched *ts, ktime_t now)
+static void tick_nohz_stop_sched_tick(struct tick_sched *ts,
+				      ktime_t now, int cpu)
 {
 	unsigned long seq, last_jiffies, next_jiffies, delta_jiffies;
 	ktime_t last_update, expires;
 	struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev;
 	u64 time_delta;
-	int cpu;
-
-	cpu = smp_processor_id();
-	ts = &per_cpu(tick_cpu_sched, cpu);
-
-	/*
-	 * If this cpu is offline and it is the one which updates
-	 * jiffies, then give up the assignment and let it be taken by
-	 * the cpu which runs the tick timer next. If we don't drop
-	 * this here the jiffies might be stale and do_timer() never
-	 * invoked.
-	 */
-	if (unlikely(!cpu_online(cpu))) {
-		if (cpu == tick_do_timer_cpu)
-			tick_do_timer_cpu = TICK_DO_TIMER_NONE;
-	}
-
-	if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
-		return;
-
-	if (need_resched())
-		return;
 
-	if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
-		static int ratelimit;
-
-		if (ratelimit < 10) {
-			printk(KERN_ERR "NOHZ: local_softirq_pending %02x\n",
-			       (unsigned int) local_softirq_pending());
-			ratelimit++;
-		}
-		return;
-	}
 
-	ts->idle_calls++;
 	/* Read jiffies and the time when jiffies were updated last */
 	do {
 		seq = read_seqbegin(&xtime_lock);
@@ -439,17 +407,57 @@ out:
 	ts->sleep_length = ktime_sub(dev->next_event, now);
 }
 
+static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
+{
+	/*
+	 * If this cpu is offline and it is the one which updates
+	 * jiffies, then give up the assignment and let it be taken by
+	 * the cpu which runs the tick timer next. If we don't drop
+	 * this here the jiffies might be stale and do_timer() never
+	 * invoked.
+	 */
+	if (unlikely(!cpu_online(cpu))) {
+		if (cpu == tick_do_timer_cpu)
+			tick_do_timer_cpu = TICK_DO_TIMER_NONE;
+	}
+
+	if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
+		return false;
+
+	if (need_resched())
+		return false;
+
+	if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
+		static int ratelimit;
+
+		if (ratelimit < 10) {
+			printk(KERN_ERR "NOHZ: local_softirq_pending %02x\n",
+			       (unsigned int) local_softirq_pending());
+			ratelimit++;
+		}
+		return false;
+	}
+
+	return true;
+}
+
 static void __tick_nohz_idle_enter(struct tick_sched *ts)
 {
 	ktime_t now;
-	int was_stopped = ts->tick_stopped;
+	int cpu = smp_processor_id();
 
-	now = tick_nohz_start_idle(smp_processor_id(), ts);
-	tick_nohz_stop_sched_tick(ts, now);
+	now = tick_nohz_start_idle(cpu, ts);
 
-	if (!was_stopped && ts->tick_stopped) {
-		ts->idle_jiffies = ts->last_jiffies;
-		select_nohz_load_balancer(1);
+	if (can_stop_idle_tick(cpu, ts)) {
+		int was_stopped = ts->tick_stopped;
+
+		ts->idle_calls++;
+		tick_nohz_stop_sched_tick(ts, now, cpu);
+
+		if (!was_stopped && ts->tick_stopped) {
+			ts->idle_jiffies = ts->last_jiffies;
+			select_nohz_load_balancer(1);
+		}
 	}
 }
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 06/41] nohz: Move next idle expiry time record into idle logic area
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2012-04-30 23:54 ` [PATCH 05/41] nohz: Move ts->idle_calls incrementation into strict " Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-04-30 23:54 ` [PATCH 07/41] cpuset: Set up interface for nohz flag Frederic Weisbecker
                   ` (35 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

The next idle expiry time record and idle sleeps tracking are
statistics that only concern idle.

Since we want the nohz APIs to become usable further idle
context, let's pull up the handling of these statistics to the
callers in idle.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/tick-sched.c |   24 ++++++++++++++----------
 1 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 12ba932..0695e9d 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -271,11 +271,11 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
 
-static void tick_nohz_stop_sched_tick(struct tick_sched *ts,
-				      ktime_t now, int cpu)
+static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
+					 ktime_t now, int cpu)
 {
 	unsigned long seq, last_jiffies, next_jiffies, delta_jiffies;
-	ktime_t last_update, expires;
+	ktime_t last_update, expires, ret = { .tv64 = 0 };
 	struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev;
 	u64 time_delta;
 
@@ -358,6 +358,8 @@ static void tick_nohz_stop_sched_tick(struct tick_sched *ts,
 		if (ts->tick_stopped && ktime_equal(expires, dev->next_event))
 			goto out;
 
+		ret = expires;
+
 		/*
 		 * nohz_stop_sched_tick can be called several times before
 		 * the nohz_restart_sched_tick is called. This happens when
@@ -370,11 +372,6 @@ static void tick_nohz_stop_sched_tick(struct tick_sched *ts,
 			ts->tick_stopped = 1;
 		}
 
-		ts->idle_sleeps++;
-
-		/* Mark expires */
-		ts->idle_expires = expires;
-
 		/*
 		 * If the expiration time == KTIME_MAX, then
 		 * in this case we simply stop the tick timer.
@@ -405,6 +402,8 @@ out:
 	ts->next_jiffies = next_jiffies;
 	ts->last_jiffies = last_jiffies;
 	ts->sleep_length = ktime_sub(dev->next_event, now);
+
+	return ret;
 }
 
 static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
@@ -443,7 +442,7 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
 
 static void __tick_nohz_idle_enter(struct tick_sched *ts)
 {
-	ktime_t now;
+	ktime_t now, expires;
 	int cpu = smp_processor_id();
 
 	now = tick_nohz_start_idle(cpu, ts);
@@ -452,7 +451,12 @@ static void __tick_nohz_idle_enter(struct tick_sched *ts)
 		int was_stopped = ts->tick_stopped;
 
 		ts->idle_calls++;
-		tick_nohz_stop_sched_tick(ts, now, cpu);
+
+		expires = tick_nohz_stop_sched_tick(ts, now, cpu);
+		if (expires.tv64 > 0LL) {
+			ts->idle_sleeps++;
+			ts->idle_expires = expires;
+		}
 
 		if (!was_stopped && ts->tick_stopped) {
 			ts->idle_jiffies = ts->last_jiffies;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 07/41] cpuset: Set up interface for nohz flag
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (5 preceding siblings ...)
  2012-04-30 23:54 ` [PATCH 06/41] nohz: Move next idle expiry time record into idle logic area Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-05-07 15:55   ` Christoph Lameter
  2012-04-30 23:54 ` [PATCH 08/41] nohz: Try not to give the timekeeping duty to an adaptive tickless cpu Frederic Weisbecker
                   ` (34 subsequent siblings)
  41 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

Prepare the interface to implement the nohz cpuset flag.
This flag, once set, will tell the system to try to
shutdown the periodic timer tick when possible.

We use here a per cpu refcounter. As long as a CPU
is contained into at least one cpuset that has the
nohz flag set, it is part of the set of CPUs that
run into adaptive nohz mode.

[ include build fix from Zen Lin ]

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 arch/Kconfig           |    3 ++
 include/linux/cpuset.h |   25 +++++++++++++++++++++++
 init/Kconfig           |    8 +++++++
 kernel/cpuset.c        |   52 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 88 insertions(+), 0 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 4f55c73..a0710f6 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -177,6 +177,9 @@ config HAVE_ARCH_JUMP_LABEL
 	bool
 
 config HAVE_ARCH_MUTEX_CPU_RELAX
+       bool
+
+config HAVE_CPUSETS_NO_HZ
 	bool
 
 config HAVE_RCU_TABLE_FREE
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index e9eaec5..5510708 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -244,4 +244,29 @@ static inline void put_mems_allowed(void)
 
 #endif /* !CONFIG_CPUSETS */
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+
+DECLARE_PER_CPU(int, cpu_adaptive_nohz_ref);
+
+static inline bool cpuset_cpu_adaptive_nohz(int cpu)
+{
+	if (per_cpu(cpu_adaptive_nohz_ref, cpu) > 0)
+		return true;
+
+	return false;
+}
+
+static inline bool cpuset_adaptive_nohz(void)
+{
+	if (__get_cpu_var(cpu_adaptive_nohz_ref) > 0)
+		return true;
+
+	return false;
+}
+#else
+static inline bool cpuset_cpu_adaptive_nohz(int cpu) { return false; }
+static inline bool cpuset_adaptive_nohz(void) { return false; }
+
+#endif /* CONFIG_CPUSETS_NO_HZ */
+
 #endif /* _LINUX_CPUSET_H */
diff --git a/init/Kconfig b/init/Kconfig
index 3f42cd6..43f7687 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -638,6 +638,14 @@ config PROC_PID_CPUSET
 	depends on CPUSETS
 	default y
 
+config CPUSETS_NO_HZ
+       bool "Tickless cpusets"
+       depends on CPUSETS && HAVE_CPUSETS_NO_HZ
+       help
+         This options let you apply a nohz property to a cpuset such
+	 that the periodic timer tick tries to be avoided when possible on
+	 the concerned CPUs.
+
 config CGROUP_CPUACCT
 	bool "Simple CPU accounting cgroup subsystem"
 	help
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index a09ac2b..5a28cf8 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -145,6 +145,7 @@ typedef enum {
 	CS_SCHED_LOAD_BALANCE,
 	CS_SPREAD_PAGE,
 	CS_SPREAD_SLAB,
+	CS_ADAPTIVE_NOHZ,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -183,6 +184,11 @@ static inline int is_spread_slab(const struct cpuset *cs)
 	return test_bit(CS_SPREAD_SLAB, &cs->flags);
 }
 
+static inline int is_adaptive_nohz(const struct cpuset *cs)
+{
+	return test_bit(CS_ADAPTIVE_NOHZ, &cs->flags);
+}
+
 static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
 };
@@ -1211,6 +1217,31 @@ static void cpuset_change_flag(struct task_struct *tsk,
 	cpuset_update_task_spread_flag(cgroup_cs(scan->cg), tsk);
 }
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+
+DEFINE_PER_CPU(int, cpu_adaptive_nohz_ref);
+
+static void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
+{
+	int cpu;
+	int val;
+
+	if (is_adaptive_nohz(old_cs) == is_adaptive_nohz(cs))
+		return;
+
+	for_each_cpu(cpu, cs->cpus_allowed) {
+		if (is_adaptive_nohz(cs))
+			per_cpu(cpu_adaptive_nohz_ref, cpu) += 1;
+		else
+			per_cpu(cpu_adaptive_nohz_ref, cpu) -= 1;
+	}
+}
+#else
+static inline void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
+{
+}
+#endif
+
 /*
  * update_tasks_flags - update the spread flags of tasks in the cpuset.
  * @cs: the cpuset in which each task's spread flags needs to be changed
@@ -1276,6 +1307,8 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
 			|| (is_spread_page(cs) != is_spread_page(trialcs)));
 
+	update_nohz_cpus(cs, trialcs);
+
 	mutex_lock(&callback_mutex);
 	cs->flags = trialcs->flags;
 	mutex_unlock(&callback_mutex);
@@ -1488,6 +1521,7 @@ typedef enum {
 	FILE_MEMORY_PRESSURE,
 	FILE_SPREAD_PAGE,
 	FILE_SPREAD_SLAB,
+	FILE_ADAPTIVE_NOHZ,
 } cpuset_filetype_t;
 
 static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
@@ -1527,6 +1561,11 @@ static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
 	case FILE_SPREAD_SLAB:
 		retval = update_flag(CS_SPREAD_SLAB, cs, val);
 		break;
+#ifdef CONFIG_CPUSETS_NO_HZ
+	case FILE_ADAPTIVE_NOHZ:
+		retval = update_flag(CS_ADAPTIVE_NOHZ, cs, val);
+		break;
+#endif
 	default:
 		retval = -EINVAL;
 		break;
@@ -1686,6 +1725,10 @@ static u64 cpuset_read_u64(struct cgroup *cont, struct cftype *cft)
 		return is_spread_page(cs);
 	case FILE_SPREAD_SLAB:
 		return is_spread_slab(cs);
+#ifdef CONFIG_CPUSETS_NO_HZ
+	case FILE_ADAPTIVE_NOHZ:
+		return is_adaptive_nohz(cs);
+#endif
 	default:
 		BUG();
 	}
@@ -1794,6 +1837,15 @@ static struct cftype files[] = {
 		.write_u64 = cpuset_write_u64,
 		.private = FILE_SPREAD_SLAB,
 	},
+
+#ifdef CONFIG_CPUSETS_NO_HZ
+	{
+		.name = "adaptive_nohz",
+		.read_u64 = cpuset_read_u64,
+		.write_u64 = cpuset_write_u64,
+		.private = FILE_ADAPTIVE_NOHZ,
+	},
+#endif
 };
 
 static struct cftype cft_memory_pressure_enabled = {
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 08/41] nohz: Try not to give the timekeeping duty to an adaptive tickless cpu
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (6 preceding siblings ...)
  2012-04-30 23:54 ` [PATCH 07/41] cpuset: Set up interface for nohz flag Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-05-07 16:02   ` Christoph Lameter
  2012-04-30 23:54 ` [PATCH 09/41] x86: New cpuset nohz irq vector Frederic Weisbecker
                   ` (33 subsequent siblings)
  41 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

Try to give the timekeeing duty to a CPU that doesn't belong
to any nohz cpuset when possible, so that we increase the chance
for these nohz cpusets to run their CPUs out of periodic tick
mode.

[TODO: We need to find a way to ensure there is always one non-nohz
running CPU maintaining the timekeeping duty if every non-idle CPUs are
adaptive tickless]

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/tick-sched.c |   52 ++++++++++++++++++++++++++++++++++++---------
 1 files changed, 41 insertions(+), 11 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 0695e9d..f1142d5 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -20,6 +20,7 @@
 #include <linux/profile.h>
 #include <linux/sched.h>
 #include <linux/module.h>
+#include <linux/cpuset.h>
 
 #include <asm/irq_regs.h>
 
@@ -782,6 +783,45 @@ void tick_check_idle(int cpu)
 	tick_check_nohz(cpu);
 }
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+
+/*
+ * Take the timer duty if nobody is taking care of it.
+ * If a CPU already does and and it's in a nohz cpuset,
+ * then take the charge so that it can switch to nohz mode.
+ */
+static void tick_do_timer_check_handler(int cpu)
+{
+	int handler = tick_do_timer_cpu;
+
+	if (unlikely(handler == TICK_DO_TIMER_NONE)) {
+		tick_do_timer_cpu = cpu;
+	} else {
+		if (!cpuset_adaptive_nohz() &&
+		    cpuset_cpu_adaptive_nohz(handler))
+			tick_do_timer_cpu = cpu;
+	}
+}
+
+#else
+
+static void tick_do_timer_check_handler(int cpu)
+{
+#ifdef CONFIG_NO_HZ
+	/*
+	 * Check if the do_timer duty was dropped. We don't care about
+	 * concurrency: This happens only when the cpu in charge went
+	 * into a long sleep. If two cpus happen to assign themself to
+	 * this duty, then the jiffies update is still serialized by
+	 * xtime_lock.
+	 */
+	if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE))
+		tick_do_timer_cpu = cpu;
+#endif
+}
+
+#endif /* CONFIG_CPUSETS_NO_HZ */
+
 /*
  * High resolution timer specific code
  */
@@ -798,17 +838,7 @@ static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
 	ktime_t now = ktime_get();
 	int cpu = smp_processor_id();
 
-#ifdef CONFIG_NO_HZ
-	/*
-	 * Check if the do_timer duty was dropped. We don't care about
-	 * concurrency: This happens only when the cpu in charge went
-	 * into a long sleep. If two cpus happen to assign themself to
-	 * this duty, then the jiffies update is still serialized by
-	 * xtime_lock.
-	 */
-	if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE))
-		tick_do_timer_cpu = cpu;
-#endif
+	tick_do_timer_check_handler(cpu);
 
 	/* Check, if the jiffies need an update */
 	if (tick_do_timer_cpu == cpu)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 09/41] x86: New cpuset nohz irq vector
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (7 preceding siblings ...)
  2012-04-30 23:54 ` [PATCH 08/41] nohz: Try not to give the timekeeping duty to an adaptive tickless cpu Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-04-30 23:54 ` [PATCH 10/41] nohz: Adaptive tick stop and restart on nohz cpuset Frederic Weisbecker
                   ` (32 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

We need a way to send an IPI (remote or local) in order to
asynchronously restart the tick for CPUs in nohz adaptive mode.

This must be asynchronous such that we can trigger it with irqs
disabled. This must be usable as a self-IPI as well for example
in cases where we want to avoid random dealock scenario while
restarting the tick inline otherwise.

This only settles the x86 backend. The core tick restart function
will be defined in a later patch.

[CHECKME: Perhaps we instead need to use irq work for self IPIs.
But we also need a way to send async remote IPIs.]

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/entry_arch.h  |    3 +++
 arch/x86/include/asm/hw_irq.h      |    7 +++++++
 arch/x86/include/asm/irq_vectors.h |    2 ++
 arch/x86/include/asm/smp.h         |   11 +++++++++++
 arch/x86/kernel/entry_64.S         |    4 ++++
 arch/x86/kernel/irqinit.c          |    4 ++++
 arch/x86/kernel/smp.c              |   24 ++++++++++++++++++++++++
 7 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h
index 0baa628..f71872d 100644
--- a/arch/x86/include/asm/entry_arch.h
+++ b/arch/x86/include/asm/entry_arch.h
@@ -10,6 +10,9 @@
  * through the ICC by us (IPIs)
  */
 #ifdef CONFIG_SMP
+#ifdef CONFIG_CPUSETS_NO_HZ
+BUILD_INTERRUPT(cpuset_update_nohz_interrupt,CPUSET_UPDATE_NOHZ_VECTOR)
+#endif
 BUILD_INTERRUPT(reschedule_interrupt,RESCHEDULE_VECTOR)
 BUILD_INTERRUPT(call_function_interrupt,CALL_FUNCTION_VECTOR)
 BUILD_INTERRUPT(call_function_single_interrupt,CALL_FUNCTION_SINGLE_VECTOR)
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index eb92a6e..0d26ed7 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -35,6 +35,10 @@ extern void spurious_interrupt(void);
 extern void thermal_interrupt(void);
 extern void reschedule_interrupt(void);
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+extern void cpuset_update_nohz_interrupt(void);
+#endif
+
 extern void invalidate_interrupt(void);
 extern void invalidate_interrupt0(void);
 extern void invalidate_interrupt1(void);
@@ -152,6 +156,9 @@ extern asmlinkage void smp_irq_move_cleanup_interrupt(void);
 #endif
 #ifdef CONFIG_SMP
 extern void smp_reschedule_interrupt(struct pt_regs *);
+#ifdef CONFIG_CPUSETS_NO_HZ
+extern void smp_cpuset_update_nohz_interrupt(struct pt_regs *);
+#endif
 extern void smp_call_function_interrupt(struct pt_regs *);
 extern void smp_call_function_single_interrupt(struct pt_regs *);
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 4b44487..11bc691 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -112,6 +112,8 @@
 /* Xen vector callback to receive events in a HVM domain */
 #define XEN_HVM_EVTCHN_CALLBACK		0xf3
 
+#define CPUSET_UPDATE_NOHZ_VECTOR	0xf2
+
 /*
  * Local APIC timer IRQ vector is on a different priority level,
  * to work around the 'lost local interrupt if more than 2 IRQ
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index 0434c40..475c26b 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -70,6 +70,10 @@ struct smp_ops {
 	void (*stop_other_cpus)(int wait);
 	void (*smp_send_reschedule)(int cpu);
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+	void (*smp_cpuset_update_nohz)(int cpu);
+#endif
+
 	int (*cpu_up)(unsigned cpu);
 	int (*cpu_disable)(void);
 	void (*cpu_die)(unsigned int cpu);
@@ -138,6 +142,13 @@ static inline void smp_send_reschedule(int cpu)
 	smp_ops.smp_send_reschedule(cpu);
 }
 
+static inline void smp_cpuset_update_nohz(int cpu)
+{
+#ifdef CONFIG_CPUSETS_NO_HZ
+	smp_ops.smp_cpuset_update_nohz(cpu);
+#endif
+}
+
 static inline void arch_send_call_function_single_ipi(int cpu)
 {
 	smp_ops.send_call_func_single_ipi(cpu);
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1333d98..54f269c 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1002,6 +1002,10 @@ apicinterrupt CALL_FUNCTION_VECTOR \
 	call_function_interrupt smp_call_function_interrupt
 apicinterrupt RESCHEDULE_VECTOR \
 	reschedule_interrupt smp_reschedule_interrupt
+#ifdef CONFIG_CPUSETS_NO_HZ
+apicinterrupt CPUSET_UPDATE_NOHZ_VECTOR \
+	cpuset_update_nohz_interrupt smp_cpuset_update_nohz_interrupt
+#endif
 #endif
 
 apicinterrupt ERROR_APIC_VECTOR \
diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c
index 313fb5c..2220f3c 100644
--- a/arch/x86/kernel/irqinit.c
+++ b/arch/x86/kernel/irqinit.c
@@ -172,6 +172,10 @@ static void __init smp_intr_init(void)
 	 */
 	alloc_intr_gate(RESCHEDULE_VECTOR, reschedule_interrupt);
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+	alloc_intr_gate(CPUSET_UPDATE_NOHZ_VECTOR, cpuset_update_nohz_interrupt);
+#endif
+
 	/* IPIs for invalidation */
 #define ALLOC_INVTLB_VEC(NR) \
 	alloc_intr_gate(INVALIDATE_TLB_VECTOR_START+NR, \
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 66c74f4..94615a3 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -123,6 +123,17 @@ static void native_smp_send_reschedule(int cpu)
 	apic->send_IPI_mask(cpumask_of(cpu), RESCHEDULE_VECTOR);
 }
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+static void native_smp_cpuset_update_nohz(int cpu)
+{
+	if (unlikely(cpu_is_offline(cpu))) {
+		WARN_ON(1);
+		return;
+	}
+	apic->send_IPI_mask(cpumask_of(cpu), CPUSET_UPDATE_NOHZ_VECTOR);
+}
+#endif
+
 void native_send_call_func_single_ipi(int cpu)
 {
 	apic->send_IPI_mask(cpumask_of(cpu), CALL_FUNCTION_SINGLE_VECTOR);
@@ -267,6 +278,16 @@ void smp_reschedule_interrupt(struct pt_regs *regs)
 	 */
 }
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+void smp_cpuset_update_nohz_interrupt(struct pt_regs *regs)
+{
+	ack_APIC_irq();
+	irq_enter();
+	inc_irq_stat(irq_call_count);
+	irq_exit();
+}
+#endif
+
 void smp_call_function_interrupt(struct pt_regs *regs)
 {
 	ack_APIC_irq();
@@ -300,6 +321,9 @@ struct smp_ops smp_ops = {
 
 	.stop_other_cpus	= native_nmi_stop_other_cpus,
 	.smp_send_reschedule	= native_smp_send_reschedule,
+#ifdef CONFIG_CPUSETS_NO_HZ
+	.smp_cpuset_update_nohz = native_smp_cpuset_update_nohz,
+#endif
 
 	.cpu_up			= native_cpu_up,
 	.cpu_die		= native_cpu_die,
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 10/41] nohz: Adaptive tick stop and restart on nohz cpuset
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (8 preceding siblings ...)
  2012-04-30 23:54 ` [PATCH 09/41] x86: New cpuset nohz irq vector Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-04-30 23:54 ` [PATCH 11/41] nohz/cpuset: Don't turn off the tick if rcu needs it Frederic Weisbecker
                   ` (31 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

When a CPU is included in a nohz cpuset, try to switch
it to nohz mode from the interrupt exit path if it is running
a single non-idle task.

Then restart the tick if necessary if we are enqueuing a
second task while the timer is stopped, so that the scheduler
tick is rearmed.

[TODO: Handle the many things done from scheduler_tick()]

[ Included build fix from Geoff Levand ]

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/smp.c    |    2 +
 include/linux/sched.h    |    6 +++
 include/linux/tick.h     |   11 +++++-
 init/Kconfig             |    2 +-
 kernel/sched/core.c      |   22 ++++++++++++
 kernel/sched/sched.h     |   23 ++++++++++++
 kernel/softirq.c         |    6 ++-
 kernel/time/tick-sched.c |   84 +++++++++++++++++++++++++++++++++++++++++----
 8 files changed, 144 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 94615a3..df83671 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,7 @@
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <linux/tick.h>
 
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
@@ -283,6 +284,7 @@ void smp_cpuset_update_nohz_interrupt(struct pt_regs *regs)
 {
 	ack_APIC_irq();
 	irq_enter();
+	tick_nohz_check_adaptive();
 	inc_irq_stat(irq_call_count);
 	irq_exit();
 }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0657368..dd5df2a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2746,6 +2746,12 @@ static inline void inc_syscw(struct task_struct *tsk)
 #define TASK_SIZE_OF(tsk)	TASK_SIZE
 #endif
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+extern bool sched_can_stop_tick(void);
+#else
+static inline bool sched_can_stop_tick(void) { return false; }
+#endif
+
 #ifdef CONFIG_MM_OWNER
 extern void mm_update_next_owner(struct mm_struct *mm);
 extern void mm_init_owner(struct mm_struct *mm, struct task_struct *p);
diff --git a/include/linux/tick.h b/include/linux/tick.h
index f37fceb..9b66fd3 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -124,11 +124,12 @@ static inline int tick_oneshot_mode_active(void) { return 0; }
 # ifdef CONFIG_NO_HZ
 extern void tick_nohz_idle_enter(void);
 extern void tick_nohz_idle_exit(void);
+extern void tick_nohz_restart_sched_tick(void);
 extern void tick_nohz_irq_exit(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
-# else
+# else /* !NO_HZ */
 static inline void tick_nohz_idle_enter(void) { }
 static inline void tick_nohz_idle_exit(void) { }
 
@@ -142,4 +143,12 @@ static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
 static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
 # endif /* !NO_HZ */
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+extern void tick_nohz_check_adaptive(void);
+extern void tick_nohz_post_schedule(void);
+#else /* !CPUSETS_NO_HZ */
+static inline void tick_nohz_check_adaptive(void) { }
+static inline void tick_nohz_post_schedule(void) { }
+#endif /* CPUSETS_NO_HZ */
+
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index 43f7687..7cdb8be 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -640,7 +640,7 @@ config PROC_PID_CPUSET
 
 config CPUSETS_NO_HZ
        bool "Tickless cpusets"
-       depends on CPUSETS && HAVE_CPUSETS_NO_HZ
+       depends on CPUSETS && HAVE_CPUSETS_NO_HZ && NO_HZ && HIGH_RES_TIMERS
        help
          This options let you apply a nohz property to a cpuset such
 	 that the periodic timer tick tries to be avoided when possible on
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b342f57..4f80a81 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1323,6 +1323,27 @@ static void update_avg(u64 *avg, u64 sample)
 }
 #endif
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+bool sched_can_stop_tick(void)
+{
+	struct rq *rq;
+
+	rq = this_rq();
+
+	/*
+	 * Ensure nr_running updates are visible
+	 * FIXME: the barrier is probably not enough to ensure
+	 * the updates are visible right away.
+	 */
+	smp_rmb();
+	/* More than one running task need preemption */
+	if (rq->nr_running > 1)
+		return false;
+
+	return true;
+}
+#endif
+
 static void
 ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
 {
@@ -2059,6 +2080,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	 * frame will be invalid.
 	 */
 	finish_task_switch(this_rq(), prev);
+	tick_nohz_post_schedule();
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 98c0c26..b89f254 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1,6 +1,7 @@
 
 #include <linux/sched.h>
 #include <linux/mutex.h>
+#include <linux/cpuset.h>
 #include <linux/spinlock.h>
 #include <linux/stop_machine.h>
 
@@ -925,6 +926,28 @@ static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
 static inline void inc_nr_running(struct rq *rq)
 {
 	rq->nr_running++;
+
+	if (rq->nr_running == 2) {
+		/*
+		 * Make rq->nr_running update visible right away so that
+		 * remote CPU knows that it must restart the tick.
+		 * FIXME: This is probably not enough to ensure the update is visible
+		 */
+		smp_wmb();
+		/*
+		 * Make updates to cpu_adaptive_nohz_ref visible right now.
+		 * If the CPU is not yet in a nohz cpuset then it will see
+		 * the value on rq->nr_running later on the first time it
+		 * tries to shutdown the tick. Otherwise we must send it
+		 * it an IPI. But the ordering must be strict to ensure
+		 * the first case.
+		 * FIXME: That too is probably not enough to ensure the
+		 * update is visible.
+		 */
+		smp_rmb();
+		if (cpuset_cpu_adaptive_nohz(rq->cpu))
+			smp_cpuset_update_nohz(rq->cpu);
+	}
 }
 
 static inline void dec_nr_running(struct rq *rq)
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 5ace266..1bacb20 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -24,6 +24,7 @@
 #include <linux/ftrace.h>
 #include <linux/smp.h>
 #include <linux/tick.h>
+#include <linux/cpuset.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/irq.h>
@@ -297,7 +298,8 @@ void irq_enter(void)
 	int cpu = smp_processor_id();
 
 	rcu_irq_enter();
-	if (is_idle_task(current) && !in_interrupt()) {
+
+	if ((is_idle_task(current) || cpuset_adaptive_nohz()) && !in_interrupt()) {
 		/*
 		 * Prevent raise_softirq from needlessly waking up ksoftirqd
 		 * here, as softirq will be serviced on return from interrupt.
@@ -349,7 +351,7 @@ void irq_exit(void)
 
 #ifdef CONFIG_NO_HZ
 	/* Make sure that timer wheel updates are propagated */
-	if (idle_cpu(smp_processor_id()) && !in_interrupt() && !need_resched())
+	if (!in_interrupt())
 		tick_nohz_irq_exit();
 #endif
 	rcu_irq_exit();
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f1142d5..43fa7ac 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -506,6 +506,24 @@ void tick_nohz_idle_enter(void)
 	local_irq_enable();
 }
 
+static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts)
+{
+#ifdef CONFIG_CPUSETS_NO_HZ
+	int cpu = smp_processor_id();
+
+	if (!cpuset_adaptive_nohz() || is_idle_task(current))
+		return;
+
+	if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
+		return;
+
+	if (!sched_can_stop_tick())
+		return;
+
+	tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
+#endif
+}
+
 /**
  * tick_nohz_irq_exit - update next tick event from interrupt exit
  *
@@ -518,10 +536,12 @@ void tick_nohz_irq_exit(void)
 {
 	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
 
-	if (!ts->inidle)
-		return;
-
-	__tick_nohz_idle_enter(ts);
+	if (ts->inidle) {
+		if (!need_resched())
+			__tick_nohz_idle_enter(ts);
+	} else {
+		tick_nohz_cpuset_stop_tick(ts);
+	}
 }
 
 /**
@@ -562,7 +582,7 @@ static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
 	}
 }
 
-static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
+static void __tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
 {
 	/* Update jiffies first */
 	tick_do_update_jiffies64(now);
@@ -577,6 +597,31 @@ static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
 	tick_nohz_restart(ts, now);
 }
 
+/**
+ * tick_nohz_restart_sched_tick - restart the tick for a tickless CPU
+ *
+ * Restart the tick when the CPU is in adaptive tickless mode.
+ */
+void tick_nohz_restart_sched_tick(void)
+{
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+	unsigned long flags;
+	ktime_t now;
+
+	local_irq_save(flags);
+
+	if (!ts->tick_stopped) {
+		local_irq_restore(flags);
+		return;
+	}
+
+	now = ktime_get();
+	__tick_nohz_restart_sched_tick(ts, now);
+
+	local_irq_restore(flags);
+}
+
+
 static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
 {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
@@ -623,7 +668,7 @@ void tick_nohz_idle_exit(void)
 
 	if (ts->tick_stopped) {
 		select_nohz_load_balancer(0);
-		tick_nohz_restart_sched_tick(ts, now);
+		__tick_nohz_restart_sched_tick(ts, now);
 		tick_nohz_account_idle_ticks(ts);
 	}
 
@@ -784,7 +829,6 @@ void tick_check_idle(int cpu)
 }
 
 #ifdef CONFIG_CPUSETS_NO_HZ
-
 /*
  * Take the timer duty if nobody is taking care of it.
  * If a CPU already does and and it's in a nohz cpuset,
@@ -803,6 +847,29 @@ static void tick_do_timer_check_handler(int cpu)
 	}
 }
 
+void tick_nohz_check_adaptive(void)
+{
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+
+	if (ts->tick_stopped && !is_idle_task(current)) {
+		if (!sched_can_stop_tick())
+			tick_nohz_restart_sched_tick();
+	}
+}
+
+void tick_nohz_post_schedule(void)
+{
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+
+	/*
+	 * No need to disable irqs here. The worst that can happen
+	 * is an irq that comes and restart the tick before us.
+	 * tick_nohz_restart_sched_tick() is irq safe.
+	 */
+	if (ts->tick_stopped)
+		tick_nohz_restart_sched_tick();
+}
+
 #else
 
 static void tick_do_timer_check_handler(int cpu)
@@ -849,6 +916,7 @@ static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
 	 * no valid regs pointer
 	 */
 	if (regs) {
+		int user = user_mode(regs);
 		/*
 		 * When we are idle and the tick is stopped, we have to touch
 		 * the watchdog as we might not schedule for a really long
@@ -862,7 +930,7 @@ static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
 			if (idle_cpu(cpu))
 				ts->idle_jiffies++;
 		}
-		update_process_times(user_mode(regs));
+		update_process_times(user);
 		profile_tick(CPU_PROFILING);
 	}
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 11/41] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (9 preceding siblings ...)
  2012-04-30 23:54 ` [PATCH 10/41] nohz: Adaptive tick stop and restart on nohz cpuset Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-05-22 17:16   ` Paul E. McKenney
  2012-04-30 23:54 ` [PATCH 12/41] nohz/cpuset: Wake up adaptive nohz CPU when a timer gets enqueued Frederic Weisbecker
                   ` (30 subsequent siblings)
  41 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

If RCU is waiting for the current CPU to complete a grace
period, don't turn off the tick. Unlike dynctik-idle, we
are not necessarily going to enter into rcu extended quiescent
state, so we may need to keep the tick to note current CPU's
quiescent states.

[added build fix from Zen Lin]

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/rcupdate.h |    1 +
 kernel/rcutree.c         |    3 +--
 kernel/time/tick-sched.c |   22 ++++++++++++++++++----
 3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 81c04f4..e06639e 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -184,6 +184,7 @@ static inline int rcu_preempt_depth(void)
 extern void rcu_sched_qs(int cpu);
 extern void rcu_bh_qs(int cpu);
 extern void rcu_check_callbacks(int cpu, int user);
+extern int rcu_pending(int cpu);
 struct notifier_block;
 extern void rcu_idle_enter(void);
 extern void rcu_idle_exit(void);
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 6c4a672..e141c7e 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -212,7 +212,6 @@ int rcu_cpu_stall_suppress __read_mostly;
 module_param(rcu_cpu_stall_suppress, int, 0644);
 
 static void force_quiescent_state(struct rcu_state *rsp, int relaxed);
-static int rcu_pending(int cpu);
 
 /*
  * Return the number of RCU-sched batches processed thus far for debug & stats.
@@ -1915,7 +1914,7 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
  * by the current CPU, returning 1 if so.  This function is part of the
  * RCU implementation; it is -not- an exported member of the RCU API.
  */
-static int rcu_pending(int cpu)
+int rcu_pending(int cpu)
 {
 	return __rcu_pending(&rcu_sched_state, &per_cpu(rcu_sched_data, cpu)) ||
 	       __rcu_pending(&rcu_bh_state, &per_cpu(rcu_bh_data, cpu)) ||
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 43fa7ac..4f99766 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -506,9 +506,21 @@ void tick_nohz_idle_enter(void)
 	local_irq_enable();
 }
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+static bool can_stop_adaptive_tick(void)
+{
+	if (!sched_can_stop_tick())
+		return false;
+
+	/* Is there a grace period to complete ? */
+	if (rcu_pending(smp_processor_id()))
+		return false;
+
+	return true;
+}
+
 static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts)
 {
-#ifdef CONFIG_CPUSETS_NO_HZ
 	int cpu = smp_processor_id();
 
 	if (!cpuset_adaptive_nohz() || is_idle_task(current))
@@ -517,12 +529,14 @@ static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts)
 	if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
 		return;
 
-	if (!sched_can_stop_tick())
+	if (!can_stop_adaptive_tick())
 		return;
 
 	tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
-#endif
 }
+#else
+static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts) { }
+#endif
 
 /**
  * tick_nohz_irq_exit - update next tick event from interrupt exit
@@ -852,7 +866,7 @@ void tick_nohz_check_adaptive(void)
 	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
 
 	if (ts->tick_stopped && !is_idle_task(current)) {
-		if (!sched_can_stop_tick())
+		if (!can_stop_adaptive_tick())
 			tick_nohz_restart_sched_tick();
 	}
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 12/41] nohz/cpuset: Wake up adaptive nohz CPU when a timer gets enqueued
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (10 preceding siblings ...)
  2012-04-30 23:54 ` [PATCH 11/41] nohz/cpuset: Don't turn off the tick if rcu needs it Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-04-30 23:54 ` [PATCH 13/41] nohz/cpuset: Don't stop the tick if posix cpu timers are running Frederic Weisbecker
                   ` (29 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

Wake up a CPU when a timer list timer is enqueued there and
the CPU is in adaptive nohz mode. Sending an IPI to it makes
it reconsidering the next timer to program on top of recent
updates.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/sched.h |    4 ++--
 kernel/sched/core.c   |   24 +++++++++++++++++++++++-
 kernel/timer.c        |    2 +-
 3 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dd5df2a..2cf5d9b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1992,9 +1992,9 @@ static inline void idle_task_exit(void) {}
 #endif
 
 #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
-extern void wake_up_idle_cpu(int cpu);
+extern void wake_up_nohz_cpu(int cpu);
 #else
-static inline void wake_up_idle_cpu(int cpu) { }
+static inline void wake_up_nohz_cpu(int cpu) { }
 #endif
 
 extern unsigned int sysctl_sched_latency;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4f80a81..ba9e4d4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -576,7 +576,7 @@ unlock:
  * account when the CPU goes back to idle and evaluates the timer
  * wheel for the next timer event.
  */
-void wake_up_idle_cpu(int cpu)
+static void wake_up_idle_cpu(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 
@@ -606,6 +606,28 @@ void wake_up_idle_cpu(int cpu)
 		smp_send_reschedule(cpu);
 }
 
+static bool wake_up_cpuset_nohz_cpu(int cpu)
+{
+#ifdef CONFIG_CPUSETS_NO_HZ
+	/*
+	 * FIXME: We need to ensure that updates
+	 * on cpu_adaptive_nohz_ref are visible right
+	 * away.
+	 */
+	if (cpuset_cpu_adaptive_nohz(cpu)) {
+		smp_cpuset_update_nohz(cpu);
+		return true;
+	}
+#endif
+	return false;
+}
+
+void wake_up_nohz_cpu(int cpu)
+{
+	if (!wake_up_cpuset_nohz_cpu(cpu))
+		wake_up_idle_cpu(cpu);
+}
+
 static inline bool got_nohz_idle_kick(void)
 {
 	int cpu = smp_processor_id();
diff --git a/kernel/timer.c b/kernel/timer.c
index a297ffc..c203297 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -926,7 +926,7 @@ void add_timer_on(struct timer_list *timer, int cpu)
 	 * makes sure that a CPU on the way to idle can not evaluate
 	 * the timer wheel.
 	 */
-	wake_up_idle_cpu(cpu);
+	wake_up_nohz_cpu(cpu);
 	spin_unlock_irqrestore(&base->lock, flags);
 }
 EXPORT_SYMBOL_GPL(add_timer_on);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 13/41] nohz/cpuset: Don't stop the tick if posix cpu timers are running
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (11 preceding siblings ...)
  2012-04-30 23:54 ` [PATCH 12/41] nohz/cpuset: Wake up adaptive nohz CPU when a timer gets enqueued Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-04-30 23:54 ` [PATCH 14/41] nohz/cpuset: Restart tick when nohz flag is cleared on cpuset Frederic Weisbecker
                   ` (28 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

If either a per thread or a per process posix cpu timer is running,
don't stop the tick.

TODO: restart the tick if it is stopped and a posix cpu timer is
enqueued. Check we probably need a memory barrier for the per
process posix timer that can be enqueued from another task
of the group.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/posix-timers.h |    1 +
 kernel/posix-cpu-timers.c    |   12 ++++++++++++
 kernel/time/tick-sched.c     |    4 ++++
 3 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index 042058f..97480c2 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -119,6 +119,7 @@ int posix_timer_event(struct k_itimer *timr, int si_private);
 void posix_cpu_timer_schedule(struct k_itimer *timer);
 
 void run_posix_cpu_timers(struct task_struct *task);
+bool posix_cpu_timers_running(struct task_struct *tsk);
 void posix_cpu_timers_exit(struct task_struct *task);
 void posix_cpu_timers_exit_group(struct task_struct *task);
 
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 125cb67..79d4c24 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -6,6 +6,7 @@
 #include <linux/posix-timers.h>
 #include <linux/errno.h>
 #include <linux/math64.h>
+#include <linux/cpuset.h>
 #include <asm/uaccess.h>
 #include <linux/kernel_stat.h>
 #include <trace/events/timer.h>
@@ -1274,6 +1275,17 @@ static inline int fastpath_timer_check(struct task_struct *tsk)
 	return 0;
 }
 
+bool posix_cpu_timers_running(struct task_struct *tsk)
+{
+	if (!task_cputime_zero(&tsk->cputime_expires))
+		return true;
+
+	if (tsk->signal->cputimer.running)
+		return true;
+
+	return false;
+}
+
 /*
  * This is called from the timer interrupt handler.  The irq handler has
  * already updated our counts.  We need to check if any timers fire now.
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 4f99766..fc35d41 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -21,6 +21,7 @@
 #include <linux/sched.h>
 #include <linux/module.h>
 #include <linux/cpuset.h>
+#include <linux/posix-timers.h>
 
 #include <asm/irq_regs.h>
 
@@ -512,6 +513,9 @@ static bool can_stop_adaptive_tick(void)
 	if (!sched_can_stop_tick())
 		return false;
 
+	if (posix_cpu_timers_running(current))
+		return false;
+
 	/* Is there a grace period to complete ? */
 	if (rcu_pending(smp_processor_id()))
 		return false;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 14/41] nohz/cpuset: Restart tick when nohz flag is cleared on cpuset
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (12 preceding siblings ...)
  2012-04-30 23:54 ` [PATCH 13/41] nohz/cpuset: Don't stop the tick if posix cpu timers are running Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-04-30 23:54 ` [PATCH 15/41] nohz/cpuset: Restart the tick if printk needs it Frederic Weisbecker
                   ` (27 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

Issue an IPI to restart the tick on a CPU that belongs
to a cpuset when its nohz flag gets cleared.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/cpuset.h   |    2 ++
 kernel/cpuset.c          |   23 +++++++++++++++++++++++
 kernel/time/tick-sched.c |    8 ++++++++
 3 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 5510708..89ef5f3 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -263,6 +263,8 @@ static inline bool cpuset_adaptive_nohz(void)
 
 	return false;
 }
+
+extern void cpuset_exit_nohz_interrupt(void *unused);
 #else
 static inline bool cpuset_cpu_adaptive_nohz(int cpu) { return false; }
 static inline bool cpuset_adaptive_nohz(void) { return false; }
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 5a28cf8..00864a0 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1221,6 +1221,14 @@ static void cpuset_change_flag(struct task_struct *tsk,
 
 DEFINE_PER_CPU(int, cpu_adaptive_nohz_ref);
 
+static void cpu_exit_nohz(int cpu)
+{
+	preempt_disable();
+	smp_call_function_single(cpu, cpuset_exit_nohz_interrupt,
+				 NULL, true);
+	preempt_enable();
+}
+
 static void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
 {
 	int cpu;
@@ -1234,6 +1242,21 @@ static void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
 			per_cpu(cpu_adaptive_nohz_ref, cpu) += 1;
 		else
 			per_cpu(cpu_adaptive_nohz_ref, cpu) -= 1;
+
+		val = per_cpu(cpu_adaptive_nohz_ref, cpu);
+
+		if (!val) {
+			/*
+			 * The update to cpu_adaptive_nohz_ref must be
+			 * visible right away. So that once we restart the tick
+			 * from the IPI, it won't be stopped again due to cache
+			 * update lag.
+			 * FIXME: We probably need more to ensure this value is really
+			 * visible right away.
+			 */
+			smp_mb();
+			cpu_exit_nohz(cpu);
+		}
 	}
 }
 #else
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index fc35d41..fe31add 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -875,6 +875,14 @@ void tick_nohz_check_adaptive(void)
 	}
 }
 
+void cpuset_exit_nohz_interrupt(void *unused)
+{
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+
+	if (ts->tick_stopped && !is_idle_task(current))
+		tick_nohz_restart_adaptive();
+}
+
 void tick_nohz_post_schedule(void)
 {
 	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 15/41] nohz/cpuset: Restart the tick if printk needs it
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (13 preceding siblings ...)
  2012-04-30 23:54 ` [PATCH 14/41] nohz/cpuset: Restart tick when nohz flag is cleared on cpuset Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-04-30 23:54 ` [PATCH 16/41] rcu: Restart the tick on non-responding adaptive nohz CPUs Frederic Weisbecker
                   ` (26 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

If we are in nohz adaptive mode and printk is called, the tick is
missing to wake up the logger. We need to restart the tick when that
happens. Do this asynchronously by issuing a tick restart self IPI
to avoid deadlocking with the current random locking chain.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/printk.c |   15 ++++++++++++++-
 1 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/kernel/printk.c b/kernel/printk.c
index 32690a0..a32f291 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -41,6 +41,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/rculist.h>
+#include <linux/cpuset.h>
 
 #include <asm/uaccess.h>
 
@@ -1230,8 +1231,20 @@ int printk_needs_cpu(int cpu)
 
 void wake_up_klogd(void)
 {
-	if (waitqueue_active(&log_wait))
+	unsigned long flags;
+
+	if (waitqueue_active(&log_wait)) {
 		this_cpu_write(printk_pending, 1);
+		/* Make it visible from any interrupt from now */
+		barrier();
+		/*
+		 * It's safe to check that even if interrupts are not disabled.
+		 * If we enable nohz adaptive mode concurrently, we'll see the
+		 * printk_pending value and thus keep a periodic tick behaviour.
+		 */
+		if (cpuset_adaptive_nohz())
+			smp_cpuset_update_nohz(smp_processor_id());
+	}
 }
 
 /**
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 16/41] rcu: Restart the tick on non-responding adaptive nohz CPUs
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (14 preceding siblings ...)
  2012-04-30 23:54 ` [PATCH 15/41] nohz/cpuset: Restart the tick if printk needs it Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-05-22 17:20   ` Paul E. McKenney
  2012-04-30 23:54 ` [PATCH 17/41] rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU Frederic Weisbecker
                   ` (25 subsequent siblings)
  41 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

When a CPU in adaptive nohz mode doesn't respond to complete
a grace period, issue it a specific IPI so that it restarts
the tick and chases a quiescent state.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/rcutree.c |   17 +++++++++++++++++
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index e141c7e..3fffc26 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -50,6 +50,7 @@
 #include <linux/wait.h>
 #include <linux/kthread.h>
 #include <linux/prefetch.h>
+#include <linux/cpuset.h>
 
 #include "rcutree.h"
 #include <trace/events/rcu.h>
@@ -302,6 +303,20 @@ static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
 
 #ifdef CONFIG_SMP
 
+static void cpuset_update_rcu_cpu(int cpu)
+{
+#ifdef CONFIG_CPUSETS_NO_HZ
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	if (cpuset_cpu_adaptive_nohz(cpu))
+		smp_cpuset_update_nohz(cpu);
+
+	local_irq_restore(flags);
+#endif
+}
+
 /*
  * If the specified CPU is offline, tell the caller that it is in
  * a quiescent state.  Otherwise, whack it with a reschedule IPI.
@@ -325,6 +340,8 @@ static int rcu_implicit_offline_qs(struct rcu_data *rdp)
 		return 1;
 	}
 
+	cpuset_update_rcu_cpu(rdp->cpu);
+
 	/*
 	 * The CPU is online, so send it a reschedule IPI.  This forces
 	 * it through the scheduler, and (inefficiently) also handles cases
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 17/41] rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (15 preceding siblings ...)
  2012-04-30 23:54 ` [PATCH 16/41] rcu: Restart the tick on non-responding adaptive nohz CPUs Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-05-22 17:27   ` Paul E. McKenney
  2012-04-30 23:54 ` [PATCH 18/41] nohz: Generalize tickless cpu time accounting Frederic Weisbecker
                   ` (24 subsequent siblings)
  41 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

If we enqueue an rcu callback, we need the CPU tick to stay
alive until we take care of those by completing the appropriate
grace period.

Thus, when we call_rcu(), send a self IPI that checks rcu_needs_cpu()
so that we restore a periodic tick behaviour that can take care of
everything.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/rcutree.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 3fffc26..b8d300c 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1749,6 +1749,13 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
 	else
 		trace_rcu_callback(rsp->name, head, rdp->qlen);
 
+	/* Restart the timer if needed to handle the callbacks */
+	if (cpuset_adaptive_nohz()) {
+		/* Make updates on nxtlist visible to self IPI */
+		barrier();
+		smp_cpuset_update_nohz(smp_processor_id());
+	}
+
 	/* If interrupts were disabled, don't dive into RCU core. */
 	if (irqs_disabled_flags(flags)) {
 		local_irq_restore(flags);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 18/41] nohz: Generalize tickless cpu time accounting
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (16 preceding siblings ...)
  2012-04-30 23:54 ` [PATCH 17/41] rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-04-30 23:54 ` [PATCH 19/41] nohz/cpuset: Account user and system times in adaptive nohz mode Frederic Weisbecker
                   ` (23 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

When the CPU enters idle, it saves the jiffies stamp into
ts->idle_jiffies, increment this value by one every time
there is a timer interrupt and accounts "jiffies - ts->idle_jiffies"
idle ticks when we exit idle. This way we still account the
idle CPU time even if the tick is stopped.

This patch settles the ground to generalize this for user
and system accounting. ts->idle_jiffies becomes ts->saved_jiffies and
a new member ts->saved_jiffies_whence indicates from which domain
we saved the jiffies: user, system or idle.

This is one more step toward making the tickless infrastructure usable
further idle contexts.

For now this is only used by idle but further patches make use of
it for user and system.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/kernel_stat.h |    2 +
 include/linux/tick.h        |   45 ++++++++++++++++++++-------------
 kernel/sched/core.c         |   22 ++++++++++++++++
 kernel/time/tick-sched.c    |   57 ++++++++++++++++++++++++++++---------------
 kernel/time/timer_list.c    |    3 +-
 5 files changed, 90 insertions(+), 39 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 2fbd905..be90056 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -122,7 +122,9 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu)
 extern unsigned long long task_delta_exec(struct task_struct *);
 
 extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
+extern void account_user_ticks(struct task_struct *, unsigned long);
 extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t);
+extern void account_system_ticks(struct task_struct *, unsigned long);
 extern void account_steal_time(cputime_t);
 extern void account_idle_time(cputime_t);
 
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 9b66fd3..03b6edd 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -27,25 +27,33 @@ enum tick_nohz_mode {
 	NOHZ_MODE_HIGHRES,
 };
 
+enum tick_saved_jiffies {
+	JIFFIES_SAVED_NONE,
+	JIFFIES_SAVED_IDLE,
+	JIFFIES_SAVED_USER,
+	JIFFIES_SAVED_SYS,
+};
+
 /**
  * struct tick_sched - sched tick emulation and no idle tick control/stats
- * @sched_timer:	hrtimer to schedule the periodic tick in high
- *			resolution mode
- * @last_tick:		Store the last tick expiry time when the tick
- *			timer is modified for nohz sleeps. This is necessary
- *			to resume the tick timer operation in the timeline
- *			when the CPU returns from nohz sleep.
- * @tick_stopped:	Indicator that the idle tick has been stopped
- * @idle_jiffies:	jiffies at the entry to idle for idle time accounting
- * @idle_calls:		Total number of idle calls
- * @idle_sleeps:	Number of idle calls, where the sched tick was stopped
- * @idle_entrytime:	Time when the idle call was entered
- * @idle_waketime:	Time when the idle was interrupted
- * @idle_exittime:	Time when the idle state was left
- * @idle_sleeptime:	Sum of the time slept in idle with sched tick stopped
- * @iowait_sleeptime:	Sum of the time slept in idle with sched tick stopped, with IO outstanding
- * @sleep_length:	Duration of the current idle sleep
- * @do_timer_lst:	CPU was the last one doing do_timer before going idle
+ * @sched_timer:		hrtimer to schedule the periodic tick in high
+ *				resolution mode
+ * @last_tick:			Store the last tick expiry time when the tick
+ *				timer is modified for nohz sleeps. This is necessary
+ *				to resume the tick timer operation in the timeline
+ *				when the CPU returns from nohz sleep.
+ * @tick_stopped:		Indicator that the idle tick has been stopped
+ * @idle_calls:			Total number of idle calls
+ * @idle_sleeps:		Number of idle calls, where the sched tick was stopped
+ * @idle_entrytime:		Time when the idle call was entered
+ * @idle_waketime:		Time when the idle was interrupted
+ * @idle_exittime:		Time when the idle state was left
+ * @idle_sleeptime:		Sum of the time slept in idle with sched tick stopped
+ * @saved_jiffies:		Jiffies snapshot on tick stop for cpu time accounting
+ * @saved_jiffies_whence:	Area where we saved @saved_jiffies
+ * @iowait_sleeptime:		Sum of the time slept in idle with sched tick stopped, with IO outstanding
+ * @sleep_length:		Duration of the current idle sleep
+ * @do_timer_lst:		CPU was the last one doing do_timer before going idle
  */
 struct tick_sched {
 	struct hrtimer			sched_timer;
@@ -54,7 +62,6 @@ struct tick_sched {
 	ktime_t				last_tick;
 	int				inidle;
 	int				tick_stopped;
-	unsigned long			idle_jiffies;
 	unsigned long			idle_calls;
 	unsigned long			idle_sleeps;
 	int				idle_active;
@@ -62,6 +69,8 @@ struct tick_sched {
 	ktime_t				idle_waketime;
 	ktime_t				idle_exittime;
 	ktime_t				idle_sleeptime;
+	enum tick_saved_jiffies		saved_jiffies_whence;
+	unsigned long			saved_jiffies;
 	ktime_t				iowait_sleeptime;
 	ktime_t				sleep_length;
 	unsigned long			last_jiffies;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ba9e4d4..eca842e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2693,6 +2693,17 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
 	acct_update_integrals(p);
 }
 
+void account_user_ticks(struct task_struct *p, unsigned long ticks)
+{
+	cputime_t delta_cputime, delta_scaled;
+
+	if (ticks) {
+		delta_cputime = jiffies_to_cputime(ticks);
+		delta_scaled = cputime_to_scaled(ticks);
+		account_user_time(p, delta_cputime, delta_scaled);
+	}
+}
+
 /*
  * Account guest cpu time to a process.
  * @p: the process that the cpu time gets accounted to
@@ -2770,6 +2781,17 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 	__account_system_time(p, cputime, cputime_scaled, index);
 }
 
+void account_system_ticks(struct task_struct *p, unsigned long ticks)
+{
+	cputime_t delta_cputime, delta_scaled;
+
+	if (ticks) {
+		delta_cputime = jiffies_to_cputime(ticks);
+		delta_scaled = cputime_to_scaled(ticks);
+		account_system_time(p, 0, delta_cputime, delta_scaled);
+	}
+}
+
 /*
  * Account for involuntary wait time.
  * @cputime: the cpu time spent in involuntary wait
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index fe31add..b5ad06d 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -461,7 +461,8 @@ static void __tick_nohz_idle_enter(struct tick_sched *ts)
 		}
 
 		if (!was_stopped && ts->tick_stopped) {
-			ts->idle_jiffies = ts->last_jiffies;
+			ts->saved_jiffies = ts->last_jiffies;
+			ts->saved_jiffies_whence = JIFFIES_SAVED_IDLE;
 			select_nohz_load_balancer(1);
 		}
 	}
@@ -640,22 +641,36 @@ void tick_nohz_restart_sched_tick(void)
 }
 
 
-static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
+static void tick_nohz_account_ticks(struct tick_sched *ts)
 {
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING
 	unsigned long ticks;
 	/*
-	 * We stopped the tick in idle. Update process times would miss the
-	 * time we slept as update_process_times does only a 1 tick
-	 * accounting. Enforce that this is accounted to idle !
+	 * We stopped the tick. Update process times would miss the
+	 * time we ran tickless as update_process_times does only a 1 tick
+	 * accounting. Enforce that this is accounted to nohz timeslices.
 	 */
-	ticks = jiffies - ts->idle_jiffies;
+	ticks = jiffies - ts->saved_jiffies;
 	/*
 	 * We might be one off. Do not randomly account a huge number of ticks!
 	 */
-	if (ticks && ticks < LONG_MAX)
-		account_idle_ticks(ticks);
-#endif
+	if (ticks && ticks < LONG_MAX) {
+		switch (ts->saved_jiffies_whence) {
+		case JIFFIES_SAVED_IDLE:
+			account_idle_ticks(ticks);
+			break;
+		case JIFFIES_SAVED_USER:
+			account_user_ticks(current, ticks);
+			break;
+		case JIFFIES_SAVED_SYS:
+			account_system_ticks(current, ticks);
+			break;
+		case JIFFIES_SAVED_NONE:
+			break;
+		default:
+			WARN_ON_ONCE(1);
+		}
+	}
+	ts->saved_jiffies_whence = JIFFIES_SAVED_NONE;
 }
 
 /**
@@ -687,7 +702,9 @@ void tick_nohz_idle_exit(void)
 	if (ts->tick_stopped) {
 		select_nohz_load_balancer(0);
 		__tick_nohz_restart_sched_tick(ts, now);
-		tick_nohz_account_idle_ticks(ts);
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+		tick_nohz_account_ticks(ts);
+#endif
 	}
 
 	local_irq_enable();
@@ -735,7 +752,7 @@ static void tick_nohz_handler(struct clock_event_device *dev)
 	 */
 	if (ts->tick_stopped) {
 		touch_softlockup_watchdog();
-		ts->idle_jiffies++;
+		ts->saved_jiffies++;
 	}
 
 	update_process_times(user_mode(regs));
@@ -944,17 +961,17 @@ static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
 	if (regs) {
 		int user = user_mode(regs);
 		/*
-		 * When we are idle and the tick is stopped, we have to touch
-		 * the watchdog as we might not schedule for a really long
-		 * time. This happens on complete idle SMP systems while
-		 * waiting on the login prompt. We also increment the "start of
-		 * idle" jiffy stamp so the idle accounting adjustment we do
-		 * when we go busy again does not account too much ticks.
+		 * When the tick is stopped, we have to touch the watchdog
+		 * as we might not schedule for a really long time. This
+		 * happens on complete idle SMP systems while waiting on
+		 * the login prompt. We also increment the last jiffy stamp
+		 * recorded when we stopped the tick so the cpu time accounting
+		 * adjustment does not account too much ticks when we flush them.
 		 */
 		if (ts->tick_stopped) {
+			/* CHECKME: may be this is only needed in idle */
 			touch_softlockup_watchdog();
-			if (idle_cpu(cpu))
-				ts->idle_jiffies++;
+			ts->saved_jiffies++;
 		}
 		update_process_times(user);
 		profile_tick(CPU_PROFILING);
diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
index af5a7e9..54705e3 100644
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -169,7 +169,8 @@ static void print_cpu(struct seq_file *m, int cpu, u64 now)
 		P(nohz_mode);
 		P_ns(last_tick);
 		P(tick_stopped);
-		P(idle_jiffies);
+		/* CHECKME: Do we want saved_jiffies_whence as well? */
+		P(saved_jiffies);
 		P(idle_calls);
 		P(idle_sleeps);
 		P_ns(idle_entrytime);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 19/41] nohz/cpuset: Account user and system times in adaptive nohz mode
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (17 preceding siblings ...)
  2012-04-30 23:54 ` [PATCH 18/41] nohz: Generalize tickless cpu time accounting Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-04-30 23:54 ` [PATCH 20/41] nohz/cpuset: New API to flush cputimes on nohz cpusets Frederic Weisbecker
                   ` (22 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

If we are not running the tick, we are not anymore regularly counting
the user/system cputime at every jiffies.

To solve this, save a snapshot of the jiffies when we stop the tick
and keep track of where we saved it: user or system. On top of this,
we account the cputime elapsed when we cross the kernel entry/exit
boundaries and when we restart the tick.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/tick.h     |   12 ++++
 kernel/sched/core.c      |    1 +
 kernel/time/tick-sched.c |  131 +++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 142 insertions(+), 2 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 03b6edd..598b492 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -153,11 +153,23 @@ static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
 # endif /* !NO_HZ */
 
 #ifdef CONFIG_CPUSETS_NO_HZ
+extern void tick_nohz_enter_kernel(void);
+extern void tick_nohz_exit_kernel(void);
+extern void tick_nohz_enter_exception(struct pt_regs *regs);
+extern void tick_nohz_exit_exception(struct pt_regs *regs);
 extern void tick_nohz_check_adaptive(void);
+extern void tick_nohz_pre_schedule(void);
 extern void tick_nohz_post_schedule(void);
+extern bool tick_nohz_account_tick(void);
 #else /* !CPUSETS_NO_HZ */
+static inline void tick_nohz_enter_kernel(void) { }
+static inline void tick_nohz_exit_kernel(void) { }
+static inline void tick_nohz_enter_exception(struct pt_regs *regs) { }
+static inline void tick_nohz_exit_exception(struct pt_regs *regs) { }
 static inline void tick_nohz_check_adaptive(void) { }
+static inline void tick_nohz_pre_schedule(void) { }
 static inline void tick_nohz_post_schedule(void) { }
+static inline bool tick_nohz_account_tick(void) { return false; }
 #endif /* CPUSETS_NO_HZ */
 
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index eca842e..5debfd7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1923,6 +1923,7 @@ static inline void
 prepare_task_switch(struct rq *rq, struct task_struct *prev,
 		    struct task_struct *next)
 {
+	tick_nohz_pre_schedule();
 	sched_info_switch(prev, next);
 	perf_event_task_sched_out(prev, next);
 	fire_sched_out_preempt_notifiers(prev, next);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index b5ad06d..a68909a 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -526,7 +526,13 @@ static bool can_stop_adaptive_tick(void)
 
 static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts)
 {
+	struct pt_regs *regs = get_irq_regs();
 	int cpu = smp_processor_id();
+	int was_stopped;
+	int user = 0;
+
+	if (regs)
+		user = user_mode(regs);
 
 	if (!cpuset_adaptive_nohz() || is_idle_task(current))
 		return;
@@ -537,7 +543,36 @@ static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts)
 	if (!can_stop_adaptive_tick())
 		return;
 
+	/*
+	 * If we stop the tick between the syscall exit hook and the actual
+	 * return to userspace, we'll think we are in system space (due to
+	 * user_mode() thinking so). And since we passed the syscall exit hook
+	 * already we won't realize we are in userspace. So the time spent
+	 * tickless would be spuriously accounted as belonging to system.
+	 *
+	 * To avoid this kind of problem, we only stop the tick from userspace
+	 * (until we find a better solution).
+	 * We can later enter the kernel and keep the tick stopped. But the place
+	 * where we stop the tick must be userspace.
+	 * We make an exception for kernel threads since they always execute in
+	 * kernel space.
+	 */
+	if (!user && current->mm)
+		return;
+
+	was_stopped = ts->tick_stopped;
 	tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
+
+	if (!was_stopped && ts->tick_stopped) {
+		WARN_ON_ONCE(ts->saved_jiffies_whence != JIFFIES_SAVED_NONE);
+		if (user)
+			ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
+		else if (!current->mm)
+			ts->saved_jiffies_whence = JIFFIES_SAVED_SYS;
+
+		ts->saved_jiffies = jiffies;
+		set_thread_flag(TIF_NOHZ);
+	}
 }
 #else
 static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts) { }
@@ -864,6 +899,70 @@ void tick_check_idle(int cpu)
 }
 
 #ifdef CONFIG_CPUSETS_NO_HZ
+void tick_nohz_exit_kernel(void)
+{
+	unsigned long flags;
+	struct tick_sched *ts;
+	unsigned long delta_jiffies;
+
+	local_irq_save(flags);
+
+	ts = &__get_cpu_var(tick_cpu_sched);
+
+	if (!ts->tick_stopped) {
+		local_irq_restore(flags);
+		return;
+	}
+
+	WARN_ON_ONCE(ts->saved_jiffies_whence != JIFFIES_SAVED_SYS);
+
+	delta_jiffies = jiffies - ts->saved_jiffies;
+	account_system_ticks(current, delta_jiffies);
+
+	ts->saved_jiffies = jiffies;
+	ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
+
+	local_irq_restore(flags);
+}
+
+void tick_nohz_enter_kernel(void)
+{
+	unsigned long flags;
+	struct tick_sched *ts;
+	unsigned long delta_jiffies;
+
+	local_irq_save(flags);
+
+	ts = &__get_cpu_var(tick_cpu_sched);
+
+	if (!ts->tick_stopped) {
+		local_irq_restore(flags);
+		return;
+	}
+
+	WARN_ON_ONCE(ts->saved_jiffies_whence != JIFFIES_SAVED_USER);
+
+	delta_jiffies = jiffies - ts->saved_jiffies;
+	account_user_ticks(current, delta_jiffies);
+
+	ts->saved_jiffies = jiffies;
+	ts->saved_jiffies_whence = JIFFIES_SAVED_SYS;
+
+	local_irq_restore(flags);
+}
+
+void tick_nohz_enter_exception(struct pt_regs *regs)
+{
+	if (user_mode(regs))
+		tick_nohz_enter_kernel();
+}
+
+void tick_nohz_exit_exception(struct pt_regs *regs)
+{
+	if (user_mode(regs))
+		tick_nohz_exit_kernel();
+}
+
 /*
  * Take the timer duty if nobody is taking care of it.
  * If a CPU already does and and it's in a nohz cpuset,
@@ -882,13 +981,22 @@ static void tick_do_timer_check_handler(int cpu)
 	}
 }
 
+static void tick_nohz_restart_adaptive(void)
+{
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+
+	tick_nohz_account_ticks(ts);
+	tick_nohz_restart_sched_tick();
+	clear_thread_flag(TIF_NOHZ);
+}
+
 void tick_nohz_check_adaptive(void)
 {
 	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
 
 	if (ts->tick_stopped && !is_idle_task(current)) {
 		if (!can_stop_adaptive_tick())
-			tick_nohz_restart_sched_tick();
+			tick_nohz_restart_adaptive();
 	}
 }
 
@@ -900,6 +1008,26 @@ void cpuset_exit_nohz_interrupt(void *unused)
 		tick_nohz_restart_adaptive();
 }
 
+/*
+ * Flush cputime and clear hooks before context switch in case we
+ * haven't yet received the IPI that should take care of that.
+ */
+void tick_nohz_pre_schedule(void)
+{
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+
+	/*
+	 * We are holding the rq lock and if we restart the tick now
+	 * we could deadlock by acquiring the lock twice. Instead
+	 * we do that on post schedule time. For now do the cleanups
+	 * on the prev task.
+	 */
+	if (ts->tick_stopped) {
+		tick_nohz_account_ticks(ts);
+		clear_thread_flag(TIF_NOHZ);
+	}
+}
+
 void tick_nohz_post_schedule(void)
 {
 	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
@@ -912,7 +1040,6 @@ void tick_nohz_post_schedule(void)
 	if (ts->tick_stopped)
 		tick_nohz_restart_sched_tick();
 }
-
 #else
 
 static void tick_do_timer_check_handler(int cpu)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 20/41] nohz/cpuset: New API to flush cputimes on nohz cpusets
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (18 preceding siblings ...)
  2012-04-30 23:54 ` [PATCH 19/41] nohz/cpuset: Account user and system times in adaptive nohz mode Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-04-30 23:54 ` [PATCH 21/41] nohz/cpuset: Flush cputime on threads in nohz cpusets when waiting leader Frederic Weisbecker
                   ` (21 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

Provide a new API that sends an IPI to every CPUs included
in nohz cpusets in order to flush their cputimes. It's going
to be useful for those that want to see accurate cputimes
on a nohz cpuset.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/cpuset.h   |    2 ++
 include/linux/tick.h     |    1 +
 kernel/cpuset.c          |   34 +++++++++++++++++++++++++++++++++-
 kernel/time/tick-sched.c |   21 ++++++++++++++++-----
 4 files changed, 52 insertions(+), 6 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 89ef5f3..ccbc2fd 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -265,9 +265,11 @@ static inline bool cpuset_adaptive_nohz(void)
 }
 
 extern void cpuset_exit_nohz_interrupt(void *unused);
+extern void cpuset_nohz_flush_cputimes(void);
 #else
 static inline bool cpuset_cpu_adaptive_nohz(int cpu) { return false; }
 static inline bool cpuset_adaptive_nohz(void) { return false; }
+static inline void cpuset_nohz_flush_cputimes(void) { }
 
 #endif /* CONFIG_CPUSETS_NO_HZ */
 
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 598b492..3c31d6e 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -161,6 +161,7 @@ extern void tick_nohz_check_adaptive(void);
 extern void tick_nohz_pre_schedule(void);
 extern void tick_nohz_post_schedule(void);
 extern bool tick_nohz_account_tick(void);
+extern void tick_nohz_flush_current_times(bool restart_tick);
 #else /* !CPUSETS_NO_HZ */
 static inline void tick_nohz_enter_kernel(void) { }
 static inline void tick_nohz_exit_kernel(void) { }
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 00864a0..aa8304d 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -59,6 +59,7 @@
 #include <linux/mutex.h>
 #include <linux/workqueue.h>
 #include <linux/cgroup.h>
+#include <linux/tick.h>
 
 /*
  * Workqueue for cpuset related tasks.
@@ -1221,6 +1222,23 @@ static void cpuset_change_flag(struct task_struct *tsk,
 
 DEFINE_PER_CPU(int, cpu_adaptive_nohz_ref);
 
+static cpumask_t nohz_cpuset_mask;
+
+static void flush_cputime_interrupt(void *unused)
+{
+	tick_nohz_flush_current_times(false);
+}
+
+void cpuset_nohz_flush_cputimes(void)
+{
+	preempt_disable();
+	smp_call_function_many(&nohz_cpuset_mask, flush_cputime_interrupt,
+			       NULL, true);
+	preempt_enable();
+	/* Make the utime/stime updates visible */
+	smp_mb();
+}
+
 static void cpu_exit_nohz(int cpu)
 {
 	preempt_disable();
@@ -1245,7 +1263,15 @@ static void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
 
 		val = per_cpu(cpu_adaptive_nohz_ref, cpu);
 
-		if (!val) {
+		if (val == 1) {
+			cpumask_set_cpu(cpu, &nohz_cpuset_mask);
+			/*
+			 * The mask update needs to be visible right away
+			 * so that this CPU is part of the cputime IPI
+			 * update right now.
+			 */
+			 smp_mb();
+		} else if (!val) {
 			/*
 			 * The update to cpu_adaptive_nohz_ref must be
 			 * visible right away. So that once we restart the tick
@@ -1256,6 +1282,12 @@ static void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
 			 */
 			smp_mb();
 			cpu_exit_nohz(cpu);
+			/*
+			 * Now that the tick has been restarted and cputimes
+			 * flushed, we don't need anymore to be part of the
+			 * cputime flush IPI.
+			 */
+			cpumask_clear_cpu(cpu, &nohz_cpuset_mask);
 		}
 	}
 }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a68909a..5933506 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -705,7 +705,6 @@ static void tick_nohz_account_ticks(struct tick_sched *ts)
 			WARN_ON_ONCE(1);
 		}
 	}
-	ts->saved_jiffies_whence = JIFFIES_SAVED_NONE;
 }
 
 /**
@@ -739,6 +738,7 @@ void tick_nohz_idle_exit(void)
 		__tick_nohz_restart_sched_tick(ts, now);
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
 		tick_nohz_account_ticks(ts);
+		ts->saved_jiffies_whence = JIFFIES_SAVED_NONE;
 #endif
 	}
 
@@ -983,9 +983,7 @@ static void tick_do_timer_check_handler(int cpu)
 
 static void tick_nohz_restart_adaptive(void)
 {
-	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
-
-	tick_nohz_account_ticks(ts);
+	tick_nohz_flush_current_times(true);
 	tick_nohz_restart_sched_tick();
 	clear_thread_flag(TIF_NOHZ);
 }
@@ -1023,7 +1021,7 @@ void tick_nohz_pre_schedule(void)
 	 * on the prev task.
 	 */
 	if (ts->tick_stopped) {
-		tick_nohz_account_ticks(ts);
+		tick_nohz_flush_current_times(true);
 		clear_thread_flag(TIF_NOHZ);
 	}
 }
@@ -1040,6 +1038,19 @@ void tick_nohz_post_schedule(void)
 	if (ts->tick_stopped)
 		tick_nohz_restart_sched_tick();
 }
+
+void tick_nohz_flush_current_times(bool restart_tick)
+{
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+
+	if (ts->tick_stopped) {
+		tick_nohz_account_ticks(ts);
+		if (restart_tick)
+			ts->saved_jiffies_whence = JIFFIES_SAVED_NONE;
+		else
+			ts->saved_jiffies = jiffies;
+	}
+}
 #else
 
 static void tick_do_timer_check_handler(int cpu)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 21/41] nohz/cpuset: Flush cputime on threads in nohz cpusets when waiting leader
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (19 preceding siblings ...)
  2012-04-30 23:54 ` [PATCH 20/41] nohz/cpuset: New API to flush cputimes on nohz cpusets Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-04-30 23:54 ` [PATCH 22/41] nohz/cpuset: Flush cputimes on procfs stat file read Frederic Weisbecker
                   ` (20 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

When we wait for a zombie task, flush the cputimes on nohz cpusets
in case we are waiting for a group leader that has threads running
in nohz CPUs. This way thread_group_times() doesn't report stale
values.

<doubts>
If I understood well the code, by the time we call that thread_group_times(),
we may have childs that are still running, so this is necessary.
But I need to check deeper.
</doubts>

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/exit.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 4b4042f..c194662 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -52,6 +52,7 @@
 #include <linux/hw_breakpoint.h>
 #include <linux/oom.h>
 #include <linux/writeback.h>
+#include <linux/cpuset.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -1712,6 +1713,13 @@ repeat:
 	   (!wo->wo_pid || hlist_empty(&wo->wo_pid->tasks[wo->wo_type])))
 		goto notask;
 
+	/*
+	 * For cputime in sub-threads before adding them.
+	 * Must be called outside tasklist_lock lock because write lock
+	 * can be acquired under irqs disabled.
+	 */
+	cpuset_nohz_flush_cputimes();
+
 	set_current_state(TASK_INTERRUPTIBLE);
 	read_lock(&tasklist_lock);
 	tsk = current;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 22/41] nohz/cpuset: Flush cputimes on procfs stat file read
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (20 preceding siblings ...)
  2012-04-30 23:54 ` [PATCH 21/41] nohz/cpuset: Flush cputime on threads in nohz cpusets when waiting leader Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-04-30 23:54 ` [PATCH 23/41] nohz/cpuset: Flush cputimes for getrusage() and times() syscalls Frederic Weisbecker
                   ` (19 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

When we read a process's procfs stat file, we need
to flush the cputimes of the tasks running in nohz
cpusets in case some childs in the thread group are
running there.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 fs/proc/array.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index c602b8d..0dc88ad 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -397,6 +397,8 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	cutime = cstime = utime = stime = 0;
 	cgtime = gtime = 0;
 
+	/* For thread group times */
+	cpuset_nohz_flush_cputimes();
 	if (lock_task_sighand(task, &flags)) {
 		struct signal_struct *sig = task->signal;
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 23/41] nohz/cpuset: Flush cputimes for getrusage() and times() syscalls
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (21 preceding siblings ...)
  2012-04-30 23:54 ` [PATCH 22/41] nohz/cpuset: Flush cputimes on procfs stat file read Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-04-30 23:54 ` [PATCH 24/41] x86: Syscall hooks for nohz cpusets Frederic Weisbecker
                   ` (18 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

Both syscalls need to iterate through the thread group to get
the cputimes. As some threads of the group may be running on
nohz cpuset, we need to flush the cputimes there.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sys.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 4070153..5b3e880 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -45,6 +45,7 @@
 #include <linux/syscalls.h>
 #include <linux/kprobes.h>
 #include <linux/user_namespace.h>
+#include <linux/cpuset.h>
 
 #include <linux/kmsg_dump.h>
 /* Move somewhere else to avoid recompiling? */
@@ -950,6 +951,8 @@ void do_sys_times(struct tms *tms)
 {
 	cputime_t tgutime, tgstime, cutime, cstime;
 
+	cpuset_nohz_flush_cputimes();
+
 	spin_lock_irq(&current->sighand->siglock);
 	thread_group_times(current, &tgutime, &tgstime);
 	cutime = current->signal->cutime;
@@ -1614,6 +1617,9 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
 		goto out;
 	}
 
+	/* For thread_group_times */
+	cpuset_nohz_flush_cputimes();
+
 	if (!lock_task_sighand(p, &flags))
 		return;
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 24/41] x86: Syscall hooks for nohz cpusets
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (22 preceding siblings ...)
  2012-04-30 23:54 ` [PATCH 23/41] nohz/cpuset: Flush cputimes for getrusage() and times() syscalls Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-04-30 23:54 ` [PATCH 25/41] x86: Exception " Frederic Weisbecker
                   ` (17 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

Add syscall hooks to notify syscall entry and exit on
CPUs running in adative nohz mode.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/thread_info.h |   10 +++++++---
 arch/x86/kernel/ptrace.c           |   10 ++++++++++
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index cfd8144..0c1724e 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -88,6 +88,7 @@ struct thread_info {
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* 32bit process */
 #define TIF_FORK		18	/* ret_from_fork */
+#define TIF_NOHZ		19	/* in nohz userspace mode */
 #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_DEBUG		21	/* uses debug registers */
 #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
@@ -110,6 +111,7 @@ struct thread_info {
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_FORK		(1 << TIF_FORK)
+#define _TIF_NOHZ		(1 << TIF_NOHZ)
 #define _TIF_DEBUG		(1 << TIF_DEBUG)
 #define _TIF_IO_BITMAP		(1 << TIF_IO_BITMAP)
 #define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
@@ -120,12 +122,13 @@ struct thread_info {
 /* work to do in syscall_trace_enter() */
 #define _TIF_WORK_SYSCALL_ENTRY	\
 	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT |	\
-	 _TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT)
+	 _TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT |	\
+	 _TIF_NOHZ)
 
 /* work to do in syscall_trace_leave() */
 #define _TIF_WORK_SYSCALL_EXIT	\
 	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SINGLESTEP |	\
-	 _TIF_SYSCALL_TRACEPOINT)
+	 _TIF_SYSCALL_TRACEPOINT | _TIF_NOHZ)
 
 /* work to do on interrupt/exception return */
 #define _TIF_WORK_MASK							\
@@ -135,7 +138,8 @@ struct thread_info {
 
 /* work to do on any return to user space */
 #define _TIF_ALLWORK_MASK						\
-	((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_TRACEPOINT)
+	((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_TRACEPOINT |	\
+	_TIF_NOHZ)
 
 /* Only used for 64 bit */
 #define _TIF_DO_NOTIFY_MASK						\
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 5026738..2966791 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -21,6 +21,7 @@
 #include <linux/signal.h>
 #include <linux/perf_event.h>
 #include <linux/hw_breakpoint.h>
+#include <linux/tick.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -1369,6 +1370,9 @@ long syscall_trace_enter(struct pt_regs *regs)
 {
 	long ret = 0;
 
+	/* Notify nohz task syscall early so the rest can use rcu */
+	tick_nohz_enter_kernel();
+
 	/*
 	 * If we stepped into a sysenter/syscall insn, it trapped in
 	 * kernel mode; do_debug() cleared TF and set TIF_SINGLESTEP.
@@ -1427,4 +1431,10 @@ void syscall_trace_leave(struct pt_regs *regs)
 			!test_thread_flag(TIF_SYSCALL_EMU);
 	if (step || test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall_exit(regs, step);
+
+	/*
+	 * Notify nohz task exit syscall at last so the rest can
+	 * use rcu.
+	 */
+	tick_nohz_exit_kernel();
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 25/41] x86: Exception hooks for nohz cpusets
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (23 preceding siblings ...)
  2012-04-30 23:54 ` [PATCH 24/41] x86: Syscall hooks for nohz cpusets Frederic Weisbecker
@ 2012-04-30 23:54 ` Frederic Weisbecker
  2012-04-30 23:55 ` [PATCH 26/41] x86: Add adaptive tickless hooks on do_notify_resume() Frederic Weisbecker
                   ` (16 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:54 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

Add necessary hooks to x86 exception for nohz cpusets
support. It includes traps, page fault, debug exceptions,
etc...

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/Kconfig        |    1 +
 arch/x86/kernel/traps.c |   20 ++++++++++++++------
 arch/x86/mm/fault.c     |   13 +++++++++++--
 3 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5bed94e..0d3116c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -82,6 +82,7 @@ config X86
 	select CLKEVT_I8253
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select GENERIC_IOMAP
+	select HAVE_CPUSETS_NO_HZ
 
 config INSTRUCTION_DECODER
 	def_bool (KPROBES || PERF_EVENTS)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 4bbe04d..977d0b9 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -26,6 +26,7 @@
 #include <linux/sched.h>
 #include <linux/timer.h>
 #include <linux/init.h>
+#include <linux/tick.h>
 #include <linux/bug.h>
 #include <linux/nmi.h>
 #include <linux/mm.h>
@@ -301,15 +302,17 @@ gp_in_kernel:
 /* May run on IST stack. */
 dotraplinkage void __kprobes do_int3(struct pt_regs *regs, long error_code)
 {
+	tick_nohz_enter_exception(regs);
+
 #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
 	if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, 3, SIGTRAP)
 			== NOTIFY_STOP)
-		return;
+		goto exit;
 #endif /* CONFIG_KGDB_LOW_LEVEL_TRAP */
 
 	if (notify_die(DIE_INT3, "int3", regs, error_code, 3, SIGTRAP)
 			== NOTIFY_STOP)
-		return;
+		goto exit;
 
 	/*
 	 * Let others (NMI) know that the debug stack is in use
@@ -320,6 +323,8 @@ dotraplinkage void __kprobes do_int3(struct pt_regs *regs, long error_code)
 	do_trap(3, SIGTRAP, "int3", regs, error_code, NULL);
 	preempt_conditional_cli(regs);
 	debug_stack_usage_dec();
+exit:
+	tick_nohz_exit_exception(regs);
 }
 
 #ifdef CONFIG_X86_64
@@ -380,6 +385,8 @@ dotraplinkage void __kprobes do_debug(struct pt_regs *regs, long error_code)
 	unsigned long dr6;
 	int si_code;
 
+	tick_nohz_enter_exception(regs);
+
 	get_debugreg(dr6, 6);
 
 	/* Filter out all the reserved bits which are preset to 1 */
@@ -395,7 +402,7 @@ dotraplinkage void __kprobes do_debug(struct pt_regs *regs, long error_code)
 
 	/* Catch kmemcheck conditions first of all! */
 	if ((dr6 & DR_STEP) && kmemcheck_trap(regs))
-		return;
+		goto exit;
 
 	/* DR6 may or may not be cleared by the CPU */
 	set_debugreg(0, 6);
@@ -410,7 +417,7 @@ dotraplinkage void __kprobes do_debug(struct pt_regs *regs, long error_code)
 
 	if (notify_die(DIE_DEBUG, "debug", regs, PTR_ERR(&dr6), error_code,
 							SIGTRAP) == NOTIFY_STOP)
-		return;
+		goto exit;
 
 	/*
 	 * Let others (NMI) know that the debug stack is in use
@@ -426,7 +433,7 @@ dotraplinkage void __kprobes do_debug(struct pt_regs *regs, long error_code)
 				error_code, 1);
 		preempt_conditional_cli(regs);
 		debug_stack_usage_dec();
-		return;
+		goto exit;
 	}
 
 	/*
@@ -447,7 +454,8 @@ dotraplinkage void __kprobes do_debug(struct pt_regs *regs, long error_code)
 	preempt_conditional_cli(regs);
 	debug_stack_usage_dec();
 
-	return;
+exit:
+	tick_nohz_exit_exception(regs);
 }
 
 /*
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f0b4caf..6c4c983 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -13,6 +13,7 @@
 #include <linux/perf_event.h>		/* perf_sw_event		*/
 #include <linux/hugetlb.h>		/* hstate_index_to_shift	*/
 #include <linux/prefetch.h>		/* prefetchw			*/
+#include <linux/tick.h>
 
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
 #include <asm/pgalloc.h>		/* pgd_*(), ...			*/
@@ -1000,8 +1001,8 @@ static int fault_in_kernel_space(unsigned long address)
  * and the problem, and then passes it off to one of the appropriate
  * routines.
  */
-dotraplinkage void __kprobes
-do_page_fault(struct pt_regs *regs, unsigned long error_code)
+static void __kprobes
+__do_page_fault(struct pt_regs *regs, unsigned long error_code)
 {
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
@@ -1209,3 +1210,11 @@ good_area:
 
 	up_read(&mm->mmap_sem);
 }
+
+dotraplinkage void __kprobes
+do_page_fault(struct pt_regs *regs, unsigned long error_code)
+{
+	tick_nohz_enter_exception(regs);
+	__do_page_fault(regs, error_code);
+	tick_nohz_exit_exception(regs);
+}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 26/41] x86: Add adaptive tickless hooks on do_notify_resume()
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (24 preceding siblings ...)
  2012-04-30 23:54 ` [PATCH 25/41] x86: Exception " Frederic Weisbecker
@ 2012-04-30 23:55 ` Frederic Weisbecker
  2012-04-30 23:55 ` [PATCH 27/41] nohz/cpuset: enable addition&removal of cpus while in adaptive nohz mode Frederic Weisbecker
                   ` (15 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:55 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

Before resuming to userspace, we may fall into do_notify_resume()
to handle signals or other things. And because we may be coming
from syscall/exception or interrupt exit, we may be running into
RCU idle mode as we resume tickless to userspace.

However do_notify_resume() may make use of RCU read side critical
sections so we need to exit RCU idle mode before doing anything in
that path.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/signal.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 46a01bd..577fd93 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -20,6 +20,7 @@
 #include <linux/personality.h>
 #include <linux/uaccess.h>
 #include <linux/user-return-notifier.h>
+#include <linux/tick.h>
 
 #include <asm/processor.h>
 #include <asm/ucontext.h>
@@ -810,6 +811,7 @@ static void do_signal(struct pt_regs *regs)
 void
 do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 {
+	tick_nohz_enter_kernel();
 #ifdef CONFIG_X86_MCE
 	/* notify userspace of pending MCEs */
 	if (thread_info_flags & _TIF_MCE_NOTIFY)
@@ -832,6 +834,7 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 #ifdef CONFIG_X86_32
 	clear_thread_flag(TIF_IRET);
 #endif /* CONFIG_X86_32 */
+	tick_nohz_exit_kernel();
 }
 
 void signal_fault(struct pt_regs *regs, void __user *frame, char *where)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 27/41] nohz/cpuset: enable addition&removal of cpus while in adaptive nohz mode
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (25 preceding siblings ...)
  2012-04-30 23:55 ` [PATCH 26/41] x86: Add adaptive tickless hooks on do_notify_resume() Frederic Weisbecker
@ 2012-04-30 23:55 ` Frederic Weisbecker
  2012-04-30 23:55 ` [PATCH 28/41] nohz: Don't restart the tick before scheduling to idle Frederic Weisbecker
                   ` (14 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:55 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Hakan Akkan, Frederic Weisbecker, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Christoph Lameter,
	Daniel Lezcano, Geoff Levand, Gilad Ben Yossef, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

From: Hakan Akkan <hakanakkan@gmail.com>

Currently modifying cpuset.cpus mask of a cgroup does not
update the reference counters for adaptive nohz mode if the
cpuset already had cpuset.adaptive_nohz == 1. Fix it so that
cpus can be added or removed from a adaptive_nohz cpuset.

Signed-off-by: Hakan Akkan <hakanakkan@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/cpuset.c |  106 +++++++++++++++++++++++++++++++++++-------------------
 1 files changed, 69 insertions(+), 37 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index aa8304d..148d138 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -862,6 +862,8 @@ static void update_tasks_cpumask(struct cpuset *cs, struct ptr_heap *heap)
 	cgroup_scan_tasks(&scan);
 }
 
+static void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs);
+
 /**
  * update_cpumask - update the cpus_allowed mask of a cpuset and all tasks in it
  * @cs: the cpuset to consider
@@ -902,6 +904,11 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	if (cpumask_equal(cs->cpus_allowed, trialcs->cpus_allowed))
 		return 0;
 
+	/*
+	 * Update adaptive nohz bits.
+	 */
+	update_nohz_cpus(cs, trialcs);
+
 	retval = heap_init(&heap, PAGE_SIZE, GFP_KERNEL, NULL);
 	if (retval)
 		return retval;
@@ -1247,51 +1254,73 @@ static void cpu_exit_nohz(int cpu)
 	preempt_enable();
 }
 
-static void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
+static void update_cpu_nohz_flag(int cpu, int adjust)
+{
+	int ref = (per_cpu(cpu_adaptive_nohz_ref, cpu) += adjust);
+
+	if (ref == 1 && adjust > 0) {
+		cpumask_set_cpu(cpu, &nohz_cpuset_mask);
+		/*
+		 * The mask update needs to be visible right away
+		 * so that this CPU is part of the cputime IPI
+		 * update right now.
+		 */
+		 smp_mb();
+	} else if (!ref) {
+		/*
+		 * The update to cpu_adaptive_nohz_ref must be
+		 * visible right away. So that once we restart the tick
+		 * from the IPI, it won't be stopped again due to cache
+		 * update lag.
+		 * FIXME: We probably need more to ensure this value is really
+		 * visible right away.
+		 */
+		smp_mb();
+		cpu_exit_nohz(cpu);
+		/*
+		 * Now that the tick has been restarted and cputimes
+		 * flushed, we don't need anymore to be part of the
+		 * cputime flush IPI.
+		 */
+		cpumask_clear_cpu(cpu, &nohz_cpuset_mask);
+	}
+}
+
+static void update_nohz_flag(struct cpuset *old_cs, struct cpuset *cs)
 {
 	int cpu;
-	int val;
+	int adjust;
 
 	if (is_adaptive_nohz(old_cs) == is_adaptive_nohz(cs))
 		return;
 
+	adjust = is_adaptive_nohz(cs) ? 1 : -1;
 	for_each_cpu(cpu, cs->cpus_allowed) {
-		if (is_adaptive_nohz(cs))
-			per_cpu(cpu_adaptive_nohz_ref, cpu) += 1;
-		else
-			per_cpu(cpu_adaptive_nohz_ref, cpu) -= 1;
-
-		val = per_cpu(cpu_adaptive_nohz_ref, cpu);
-
-		if (val == 1) {
-			cpumask_set_cpu(cpu, &nohz_cpuset_mask);
-			/*
-			 * The mask update needs to be visible right away
-			 * so that this CPU is part of the cputime IPI
-			 * update right now.
-			 */
-			 smp_mb();
-		} else if (!val) {
-			/*
-			 * The update to cpu_adaptive_nohz_ref must be
-			 * visible right away. So that once we restart the tick
-			 * from the IPI, it won't be stopped again due to cache
-			 * update lag.
-			 * FIXME: We probably need more to ensure this value is really
-			 * visible right away.
-			 */
-			smp_mb();
-			cpu_exit_nohz(cpu);
-			/*
-			 * Now that the tick has been restarted and cputimes
-			 * flushed, we don't need anymore to be part of the
-			 * cputime flush IPI.
-			 */
-			cpumask_clear_cpu(cpu, &nohz_cpuset_mask);
-		}
+		update_cpu_nohz_flag(cpu, adjust);
 	}
 }
+
+static void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
+{
+	int cpu;
+	cpumask_t cpus;
+
+	/*
+	 * Only bother if the cpuset has adaptive nohz
+	 */
+	if (!is_adaptive_nohz(cs))
+		return;
+
+	cpumask_xor(&cpus, old_cs->cpus_allowed, cs->cpus_allowed);
+
+	for_each_cpu(cpu, &cpus)
+		update_cpu_nohz_flag(cpu,
+			cpumask_test_cpu(cpu, cs->cpus_allowed) ? 1 : -1);
+}
 #else
+static inline void update_nohz_flag(struct cpuset *old_cs, struct cpuset *cs)
+{
+}
 static inline void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
 {
 }
@@ -1362,7 +1391,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
 			|| (is_spread_page(cs) != is_spread_page(trialcs)));
 
-	update_nohz_cpus(cs, trialcs);
+	update_nohz_flag(cs, trialcs);
 
 	mutex_lock(&callback_mutex);
 	cs->flags = trialcs->flags;
@@ -2006,7 +2035,8 @@ static struct cgroup_subsys_state *cpuset_create(
 /*
  * If the cpuset being removed has its flag 'sched_load_balance'
  * enabled, then simulate turning sched_load_balance off, which
- * will call async_rebuild_sched_domains().
+ * will call async_rebuild_sched_domains(). Also update adaptive
+ * nohz flag.
  */
 
 static void cpuset_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
@@ -2016,6 +2046,8 @@ static void cpuset_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
 	if (is_sched_load_balance(cs))
 		update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
 
+	update_flag(CS_ADAPTIVE_NOHZ, cs, 0);
+
 	number_of_cpusets--;
 	free_cpumask_var(cs->cpus_allowed);
 	kfree(cs);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 28/41] nohz: Don't restart the tick before scheduling to idle
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (26 preceding siblings ...)
  2012-04-30 23:55 ` [PATCH 27/41] nohz/cpuset: enable addition&removal of cpus while in adaptive nohz mode Frederic Weisbecker
@ 2012-04-30 23:55 ` Frederic Weisbecker
  2012-04-30 23:55 ` [PATCH 29/41] sched: Comment on rq->clock correctness in ttwu_do_wakeup() in nohz Frederic Weisbecker
                   ` (13 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:55 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

If we were running adaptive tickless but then we schedule out and
enter the idle task, we don't need to restart the tick because
tick_nohz_idle_enter() is going to be called right away.

The only thing we need to do is to save the jiffies such that
when we later restart the tick we can account the CPU time spent
while idle was tickless.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/tick-sched.c |   18 +++++++++++-------
 1 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 5933506..8217409 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1029,14 +1029,18 @@ void tick_nohz_pre_schedule(void)
 void tick_nohz_post_schedule(void)
 {
 	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+	unsigned long flags;
 
-	/*
-	 * No need to disable irqs here. The worst that can happen
-	 * is an irq that comes and restart the tick before us.
-	 * tick_nohz_restart_sched_tick() is irq safe.
-	 */
-	if (ts->tick_stopped)
-		tick_nohz_restart_sched_tick();
+	local_irq_save(flags);
+	if (ts->tick_stopped) {
+		if (is_idle_task(current)) {
+			ts->saved_jiffies = jiffies;
+			ts->saved_jiffies_whence = JIFFIES_SAVED_IDLE;
+		} else {
+			tick_nohz_restart_sched_tick();
+		}
+	}
+	local_irq_restore(flags);
 }
 
 void tick_nohz_flush_current_times(bool restart_tick)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 29/41] sched: Comment on rq->clock correctness in ttwu_do_wakeup() in nohz
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (27 preceding siblings ...)
  2012-04-30 23:55 ` [PATCH 28/41] nohz: Don't restart the tick before scheduling to idle Frederic Weisbecker
@ 2012-04-30 23:55 ` Frederic Weisbecker
  2012-04-30 23:55 ` [PATCH 30/41] sched: Update rq clock on nohz CPU before migrating tasks Frederic Weisbecker
                   ` (12 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:55 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/core.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5debfd7..d24da6b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1430,6 +1430,12 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 	if (p->sched_class->task_woken)
 		p->sched_class->task_woken(rq, p);
 
+	/*
+	 * For adaptive nohz case: We called ttwu_activate()
+	 * which just updated the rq clock. There is an
+	 * exception with p->on_rq != 0 but in this case
+	 * we are not idle and rq->idle_stamp == 0
+	 */
 	if (rq->idle_stamp) {
 		u64 delta = rq->clock - rq->idle_stamp;
 		u64 max = 2*sysctl_sched_migration_cost;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 30/41] sched: Update rq clock on nohz CPU before migrating tasks
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (28 preceding siblings ...)
  2012-04-30 23:55 ` [PATCH 29/41] sched: Comment on rq->clock correctness in ttwu_do_wakeup() in nohz Frederic Weisbecker
@ 2012-04-30 23:55 ` Frederic Weisbecker
  2012-04-30 23:55 ` [PATCH 31/41] sched: Update rq clock on nohz CPU before setting fair group shares Frederic Weisbecker
                   ` (11 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:55 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

Because the sched_class::put_prev_task() callback of rt and fair
classes are referring to the rq clock to update their runtime
statistics. A CPU running in tickless mode may carry a stale value.
We need to update it there.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/core.c  |    6 ++++++
 kernel/sched/sched.h |    6 ++++++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d24da6b..a7e611a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5166,6 +5166,12 @@ static void migrate_tasks(unsigned int dead_cpu)
 	/* Ensure any throttled groups are reachable by pick_next_task */
 	unthrottle_offline_cfs_rqs(rq);
 
+	/*
+	 * ->put_prev_task() need to have an up-to-date value
+	 * of rq->clock[_task]
+	 */
+	update_nohz_rq_clock(rq);
+
 	for ( ; ; ) {
 		/*
 		 * There's this thread running, bail when that's the only
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b89f254..b463e82 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -957,6 +957,12 @@ static inline void dec_nr_running(struct rq *rq)
 
 extern void update_rq_clock(struct rq *rq);
 
+static inline void update_nohz_rq_clock(struct rq *rq)
+{
+	if (cpuset_cpu_adaptive_nohz(cpu_of(rq)))
+		update_rq_clock(rq);
+}
+
 extern void activate_task(struct rq *rq, struct task_struct *p, int flags);
 extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 31/41] sched: Update rq clock on nohz CPU before setting fair group shares
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (29 preceding siblings ...)
  2012-04-30 23:55 ` [PATCH 30/41] sched: Update rq clock on nohz CPU before migrating tasks Frederic Weisbecker
@ 2012-04-30 23:55 ` Frederic Weisbecker
  2012-04-30 23:55 ` [PATCH 32/41] sched: Update rq clock on tickless CPUs before calling check_preempt_curr() Frederic Weisbecker
                   ` (10 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:55 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

Because we may update the execution time (sched_group_set_shares()->
	update_cfs_shares()->reweight_entity()->update_curr()) before
reweighting the entity after updating the group shares and this requires
an uptodate version of the runqueue clock. Let's update it on the target
CPU if it runs tickless because scheduler_tick() is not there to maintain
it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/fair.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aca16b8..3312abe 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5519,6 +5519,11 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 		se = tg->se[i];
 		/* Propagate contribution to hierarchy */
 		raw_spin_lock_irqsave(&rq->lock, flags);
+		/*
+		 * We may call update_curr() which needs an up-to-date
+		 * version of rq clock if the CPU runs tickless.
+		 */
+		update_nohz_rq_clock(rq);
 		for_each_sched_entity(se)
 			update_cfs_shares(group_cfs_rq(se));
 		raw_spin_unlock_irqrestore(&rq->lock, flags);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 32/41] sched: Update rq clock on tickless CPUs before calling check_preempt_curr()
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (30 preceding siblings ...)
  2012-04-30 23:55 ` [PATCH 31/41] sched: Update rq clock on nohz CPU before setting fair group shares Frederic Weisbecker
@ 2012-04-30 23:55 ` Frederic Weisbecker
  2012-04-30 23:55 ` [PATCH 33/41] sched: Update rq clock earlier in unthrottle_cfs_rq Frederic Weisbecker
                   ` (9 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:55 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

check_preempt_wakeup() of fair class needs an uptodate sched clock
value to update runtime stats of the current task.

When a task is woken up, activate_task() is usually called right before
ttwu_do_wakeup() unless the task is already in the runqueue. In this
case we need to update the rq clock manually in case the CPU runs
tickless because ttwu_do_wakeup() calls check_preempt_wakeup().

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/core.c |   17 ++++++++++++++++-
 1 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a7e611a..949158a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1474,6 +1474,12 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
 
 	rq = __task_rq_lock(p);
 	if (p->on_rq) {
+		/*
+		 * Ensure check_preempt_curr() won't deal with a stale value
+		 * of rq clock if the CPU is tickless. BTW do we actually need
+		 * check_preempt_curr() to be called here?
+		 */
+		update_nohz_rq_clock(rq);
 		ttwu_do_wakeup(rq, p, wake_flags);
 		ret = 1;
 	}
@@ -1683,8 +1689,17 @@ static void try_to_wake_up_local(struct task_struct *p)
 	if (!(p->state & TASK_NORMAL))
 		goto out;
 
-	if (!p->on_rq)
+	if (!p->on_rq) {
 		ttwu_activate(rq, p, ENQUEUE_WAKEUP);
+	} else {
+		/*
+		 * Even if the task is on the runqueue we still
+		 * need to ensure check_preempt_curr() won't
+		 * deal with a stale rq clock value on a tickless
+		 * CPU
+		 */
+		update_nohz_rq_clock(rq);
+	}
 
 	ttwu_do_wakeup(rq, p, 0);
 	ttwu_stat(p, smp_processor_id(), 0);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 33/41] sched: Update rq clock earlier in unthrottle_cfs_rq
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (31 preceding siblings ...)
  2012-04-30 23:55 ` [PATCH 32/41] sched: Update rq clock on tickless CPUs before calling check_preempt_curr() Frederic Weisbecker
@ 2012-04-30 23:55 ` Frederic Weisbecker
  2012-04-30 23:55 ` [PATCH 34/41] sched: Update clock of nohz busiest rq before balancing Frederic Weisbecker
                   ` (8 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:55 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

In this function we are making use of rq->clock right before the
update of the rq clock, let's just call update_rq_clock() just
before that to avoid using a stale rq clock value.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/fair.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3312abe..42a87d7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1682,15 +1682,16 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	long task_delta;
 
 	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
-
 	cfs_rq->throttled = 0;
+
+	update_rq_clock(rq);
+
 	raw_spin_lock(&cfs_b->lock);
 	cfs_b->throttled_time += rq->clock - cfs_rq->throttled_timestamp;
 	list_del_rcu(&cfs_rq->throttled_list);
 	raw_spin_unlock(&cfs_b->lock);
 	cfs_rq->throttled_timestamp = 0;
 
-	update_rq_clock(rq);
 	/* update hierarchical throttle state */
 	walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 34/41] sched: Update clock of nohz busiest rq before balancing
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (32 preceding siblings ...)
  2012-04-30 23:55 ` [PATCH 33/41] sched: Update rq clock earlier in unthrottle_cfs_rq Frederic Weisbecker
@ 2012-04-30 23:55 ` Frederic Weisbecker
  2012-04-30 23:55 ` [PATCH 35/41] sched: Update rq clock before idle balancing Frederic Weisbecker
                   ` (7 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:55 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

move_tasks() and active_load_balance_cpu_stop() both need
to have the busiest rq clock uptodate because they may end
up calling can_migrate_task() that uses rq->clock_task
to determine if the task running in the busiest runqueue
is cache hot.

Hence if the busiest runqueue is tickless, update its clock
before reading it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/fair.c |   15 +++++++++++++++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 42a87d7..eff80e0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4455,6 +4455,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 			int *balance)
 {
 	int ld_moved, lb_flags = 0, active_balance = 0;
+	int clock_updated;
 	struct sched_group *group;
 	unsigned long imbalance;
 	struct rq *busiest;
@@ -4488,6 +4489,7 @@ redo:
 	schedstat_add(sd, lb_imbalance[idle], imbalance);
 
 	ld_moved = 0;
+	clock_updated = 0;
 	if (busiest->nr_running > 1) {
 		/*
 		 * Attempt to move tasks. If find_busiest_group has found
@@ -4498,6 +4500,12 @@ redo:
 		lb_flags |= LBF_ALL_PINNED;
 		local_irq_save(flags);
 		double_rq_lock(this_rq, busiest);
+		/*
+		 * Move tasks may end up calling can_migrate_task() which
+		 * requires an uptodate value of the rq clock.
+		 */
+		update_nohz_rq_clock(busiest);
+		clock_updated = 1;
 		ld_moved = move_tasks(this_rq, this_cpu, busiest,
 				      imbalance, sd, idle, &lb_flags);
 		double_rq_unlock(this_rq, busiest);
@@ -4563,6 +4571,13 @@ redo:
 				busiest->active_balance = 1;
 				busiest->push_cpu = this_cpu;
 				active_balance = 1;
+				/*
+				 * active_load_balance_cpu_stop may end up calling
+				 * can_migrate_task() which requires an uptodate
+				 * value of the rq clock.
+				 */
+				if (!clock_updated)
+					update_nohz_rq_clock(busiest);
 			}
 			raw_spin_unlock_irqrestore(&busiest->lock, flags);
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 35/41] sched: Update rq clock before idle balancing
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (33 preceding siblings ...)
  2012-04-30 23:55 ` [PATCH 34/41] sched: Update clock of nohz busiest rq before balancing Frederic Weisbecker
@ 2012-04-30 23:55 ` Frederic Weisbecker
  2012-05-02  3:36   ` Michael Wang
  2012-04-30 23:55 ` [PATCH 36/41] sched: Update nohz rq clock before searching busiest group on load balancing Frederic Weisbecker
                   ` (6 subsequent siblings)
  41 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:55 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

idle_balance() is called from schedule() right before we schedule the
idle task. It needs to record the idle timestamp at that time and for
this the rq clock must be accurate. If the CPU is running tickless
we need to update the rq clock manually.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/fair.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eff80e0..cd871e7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4638,6 +4638,7 @@ void idle_balance(int this_cpu, struct rq *this_rq)
 	int pulled_task = 0;
 	unsigned long next_balance = jiffies + HZ;
 
+	update_nohz_rq_clock(this_rq);
 	this_rq->idle_stamp = this_rq->clock;
 
 	if (this_rq->avg_idle < sysctl_sched_migration_cost)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 36/41] sched: Update nohz rq clock before searching busiest group on load balancing
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (34 preceding siblings ...)
  2012-04-30 23:55 ` [PATCH 35/41] sched: Update rq clock before idle balancing Frederic Weisbecker
@ 2012-04-30 23:55 ` Frederic Weisbecker
  2012-04-30 23:55 ` [PATCH 37/41] rcu: New rcu_user_enter() and rcu_user_exit() APIs Frederic Weisbecker
                   ` (5 subsequent siblings)
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:55 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

While load balancing an rq target, we look for the busiest group.
This operation may require an uptodate rq clock if we end up calling
scale_rt_power(). To this end, update it manually if the target is
running tickless.

DOUBT: don't we actually also need this in vanilla kernel, in case
this_cpu is in dyntick-idle mode?

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/fair.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cd871e7..af8377f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4466,6 +4466,19 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 
 	schedstat_inc(sd, lb_count[idle]);
 
+	/*
+	 * find_busiest_group() may need an uptodate cpu clock
+	 * for find_busiest_group() (see scale_rt_power()). If
+	 * the CPU is nohz, it's clock may be stale.
+	 */
+	if (cpuset_cpu_adaptive_nohz(this_cpu)) {
+		local_irq_save(flags);
+		raw_spin_lock(&this_rq->lock);
+		update_rq_clock(this_rq);
+		raw_spin_unlock(&this_rq->lock);
+		local_irq_restore(flags);
+	}
+
 redo:
 	group = find_busiest_group(sd, this_cpu, &imbalance, idle,
 				   cpus, balance);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 37/41] rcu: New rcu_user_enter() and rcu_user_exit() APIs
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (35 preceding siblings ...)
  2012-04-30 23:55 ` [PATCH 36/41] sched: Update nohz rq clock before searching busiest group on load balancing Frederic Weisbecker
@ 2012-04-30 23:55 ` Frederic Weisbecker
  2012-05-22 18:23   ` Paul E. McKenney
  2012-04-30 23:55 ` [PATCH 38/41] rcu: New rcu_user_enter_irq() and rcu_user_exit_irq() APIs Frederic Weisbecker
                   ` (4 subsequent siblings)
  41 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:55 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

These two APIs are provided to help the implementation
of an adaptive tickless kernel (cf: nohz cpusets). We need
to run into RCU extended quiescent state when we are in
userland so that a tickless CPU is not involved in the
global RCU state machine and can shutdown its tick safely.

These APIs are called from syscall and exception entry/exit
points and can't be called from interrupt.

They are essentially the same than rcu_idle_enter() and
rcu_idle_exit() minus the checks that ensure the CPU is
running the idle task.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/rcupdate.h |    5 ++
 kernel/rcutree.c         |  107 ++++++++++++++++++++++++++++++++-------------
 2 files changed, 81 insertions(+), 31 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index e06639e..6539290 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -191,6 +191,11 @@ extern void rcu_idle_exit(void);
 extern void rcu_irq_enter(void);
 extern void rcu_irq_exit(void);
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+void rcu_user_enter(void);
+void rcu_user_exit(void);
+#endif
+
 /*
  * Infrastructure to implement the synchronize_() primitives in
  * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index b8d300c..cba1332 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -357,16 +357,8 @@ static int rcu_implicit_offline_qs(struct rcu_data *rdp)
 
 #endif /* #ifdef CONFIG_SMP */
 
-/*
- * rcu_idle_enter_common - inform RCU that current CPU is moving towards idle
- *
- * If the new value of the ->dynticks_nesting counter now is zero,
- * we really have entered idle, and must do the appropriate accounting.
- * The caller must have disabled interrupts.
- */
-static void rcu_idle_enter_common(struct rcu_dynticks *rdtp, long long oldval)
+static void rcu_check_idle_enter(long long oldval)
 {
-	trace_rcu_dyntick("Start", oldval, 0);
 	if (!is_idle_task(current)) {
 		struct task_struct *idle = idle_task(smp_processor_id());
 
@@ -376,6 +368,18 @@ static void rcu_idle_enter_common(struct rcu_dynticks *rdtp, long long oldval)
 			  current->pid, current->comm,
 			  idle->pid, idle->comm); /* must be idle task! */
 	}
+}
+
+/*
+ * rcu_idle_enter_common - inform RCU that current CPU is moving towards idle
+ *
+ * If the new value of the ->dynticks_nesting counter now is zero,
+ * we really have entered idle, and must do the appropriate accounting.
+ * The caller must have disabled interrupts.
+ */
+static void rcu_idle_enter_common(struct rcu_dynticks *rdtp, long long oldval)
+{
+	trace_rcu_dyntick("Start", oldval, 0);
 	rcu_prepare_for_idle(smp_processor_id());
 	/* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
 	smp_mb__before_atomic_inc();  /* See above. */
@@ -384,6 +388,22 @@ static void rcu_idle_enter_common(struct rcu_dynticks *rdtp, long long oldval)
 	WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
 }
 
+static long long __rcu_idle_enter(void)
+{
+	unsigned long flags;
+	long long oldval;
+	struct rcu_dynticks *rdtp;
+
+	local_irq_save(flags);
+	rdtp = &__get_cpu_var(rcu_dynticks);
+	oldval = rdtp->dynticks_nesting;
+	rdtp->dynticks_nesting = 0;
+	rcu_idle_enter_common(rdtp, oldval);
+	local_irq_restore(flags);
+
+	return oldval;
+}
+
 /**
  * rcu_idle_enter - inform RCU that current CPU is entering idle
  *
@@ -398,16 +418,15 @@ static void rcu_idle_enter_common(struct rcu_dynticks *rdtp, long long oldval)
  */
 void rcu_idle_enter(void)
 {
-	unsigned long flags;
 	long long oldval;
-	struct rcu_dynticks *rdtp;
 
-	local_irq_save(flags);
-	rdtp = &__get_cpu_var(rcu_dynticks);
-	oldval = rdtp->dynticks_nesting;
-	rdtp->dynticks_nesting = 0;
-	rcu_idle_enter_common(rdtp, oldval);
-	local_irq_restore(flags);
+	oldval = __rcu_idle_enter();
+	rcu_check_idle_enter(oldval);
+}
+
+void rcu_user_enter(void)
+{
+	__rcu_idle_enter();
 }
 
 /**
@@ -437,6 +456,7 @@ void rcu_irq_exit(void)
 	oldval = rdtp->dynticks_nesting;
 	rdtp->dynticks_nesting--;
 	WARN_ON_ONCE(rdtp->dynticks_nesting < 0);
+
 	if (rdtp->dynticks_nesting)
 		trace_rcu_dyntick("--=", oldval, rdtp->dynticks_nesting);
 	else
@@ -444,6 +464,20 @@ void rcu_irq_exit(void)
 	local_irq_restore(flags);
 }
 
+static void rcu_check_idle_exit(struct rcu_dynticks *rdtp, long long oldval)
+{
+	if (!is_idle_task(current)) {
+		struct task_struct *idle = idle_task(smp_processor_id());
+
+		trace_rcu_dyntick("Error on exit: not idle task",
+				  oldval, rdtp->dynticks_nesting);
+		ftrace_dump(DUMP_ALL);
+		WARN_ONCE(1, "Current pid: %d comm: %s / Idle pid: %d comm: %s",
+			  current->pid, current->comm,
+			  idle->pid, idle->comm); /* must be idle task! */
+	}
+}
+
 /*
  * rcu_idle_exit_common - inform RCU that current CPU is moving away from idle
  *
@@ -460,16 +494,18 @@ static void rcu_idle_exit_common(struct rcu_dynticks *rdtp, long long oldval)
 	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
 	rcu_cleanup_after_idle(smp_processor_id());
 	trace_rcu_dyntick("End", oldval, rdtp->dynticks_nesting);
-	if (!is_idle_task(current)) {
-		struct task_struct *idle = idle_task(smp_processor_id());
+}
 
-		trace_rcu_dyntick("Error on exit: not idle task",
-				  oldval, rdtp->dynticks_nesting);
-		ftrace_dump(DUMP_ALL);
-		WARN_ONCE(1, "Current pid: %d comm: %s / Idle pid: %d comm: %s",
-			  current->pid, current->comm,
-			  idle->pid, idle->comm); /* must be idle task! */
-	}
+static long long __rcu_idle_exit(struct rcu_dynticks *rdtp)
+{
+	long long oldval;
+
+	oldval = rdtp->dynticks_nesting;
+	WARN_ON_ONCE(oldval != 0);
+	rdtp->dynticks_nesting = LLONG_MAX / 2;
+	rcu_idle_exit_common(rdtp, oldval);
+
+	return oldval;
 }
 
 /**
@@ -485,16 +521,25 @@ static void rcu_idle_exit_common(struct rcu_dynticks *rdtp, long long oldval)
  */
 void rcu_idle_exit(void)
 {
+	long long oldval;
+	struct rcu_dynticks *rdtp;
 	unsigned long flags;
+
+	local_irq_save(flags);
+	rdtp = &__get_cpu_var(rcu_dynticks);
+	oldval = __rcu_idle_exit(rdtp);
+	rcu_check_idle_exit(rdtp, oldval);
+	local_irq_restore(flags);
+}
+
+void rcu_user_exit(void)
+{
 	struct rcu_dynticks *rdtp;
-	long long oldval;
+	unsigned long flags;
 
 	local_irq_save(flags);
 	rdtp = &__get_cpu_var(rcu_dynticks);
-	oldval = rdtp->dynticks_nesting;
-	WARN_ON_ONCE(oldval != 0);
-	rdtp->dynticks_nesting = DYNTICK_TASK_NESTING;
-	rcu_idle_exit_common(rdtp, oldval);
+	 __rcu_idle_exit(rdtp);
 	local_irq_restore(flags);
 }
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 38/41] rcu: New rcu_user_enter_irq() and rcu_user_exit_irq() APIs
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (36 preceding siblings ...)
  2012-04-30 23:55 ` [PATCH 37/41] rcu: New rcu_user_enter() and rcu_user_exit() APIs Frederic Weisbecker
@ 2012-04-30 23:55 ` Frederic Weisbecker
  2012-05-22 18:33   ` Paul E. McKenney
  2012-04-30 23:55 ` [PATCH 39/41] rcu: Switch to extended quiescent state in userspace from nohz cpuset Frederic Weisbecker
                   ` (3 subsequent siblings)
  41 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:55 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

A CPU running in adaptive tickless mode wants to enter into
RCU extended quiescent state while running in userspace. This
way we can shut down the tick that is usually needed on each
CPU for the needs of RCU.

Typically, RCU enters the extended quiescent state when we resume
to userspace through a syscall or exception exit, this is done
using rcu_user_enter(). Then RCU exit this state by calling
rcu_user_exit() from syscall or exception entry.

However there are two other points where we may want to enter
or exit this state. Some remote CPU may require a tickless CPU
to restart its tick for any reason and send it an IPI for
this purpose. As we restart the tick, we don't want to resume
from the IPI in RCU extended quiescent state anymore.
Similarly we may stop the tick from an interrupt in userspace and
we need to be able to enter RCU extended quiescent state when we
resume from this interrupt to userspace.

To these ends, we provide two new APIs:

- rcu_user_enter_irq(). This must be called from a non-nesting
interrupt betwenn rcu_irq_enter() and rcu_irq_exit().
After the irq calls rcu_irq_exit(), we'll run into RCU extended
quiescent state.

- rcu_user_exit_irq(). This must be called from a non-nesting
interrupt, interrupting an RCU extended quiescent state, and
between rcu_irq_enter() and rcu_irq_exit(). After the irq calls
rcu_irq_exit(), we'll prevent from resuming the RCU extended
quiescent.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/rcupdate.h |    2 ++
 kernel/rcutree.c         |   24 ++++++++++++++++++++++++
 2 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 6539290..3cf1d51 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -194,6 +194,8 @@ extern void rcu_irq_exit(void);
 #ifdef CONFIG_CPUSETS_NO_HZ
 void rcu_user_enter(void);
 void rcu_user_exit(void);
+void rcu_user_enter_irq(void);
+void rcu_user_exit_irq(void);
 #endif
 
 /*
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index cba1332..2adc5a0 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -429,6 +429,18 @@ void rcu_user_enter(void)
 	__rcu_idle_enter();
 }
 
+void rcu_user_enter_irq(void)
+{
+	unsigned long flags;
+	struct rcu_dynticks *rdtp;
+
+	local_irq_save(flags);
+	rdtp = &__get_cpu_var(rcu_dynticks);
+	WARN_ON_ONCE(rdtp->dynticks_nesting == 1);
+	rdtp->dynticks_nesting = 1;
+	local_irq_restore(flags);
+}
+
 /**
  * rcu_irq_exit - inform RCU that current CPU is exiting irq towards idle
  *
@@ -543,6 +555,18 @@ void rcu_user_exit(void)
 	local_irq_restore(flags);
 }
 
+void rcu_user_exit_irq(void)
+{
+	unsigned long flags;
+	struct rcu_dynticks *rdtp;
+
+	local_irq_save(flags);
+	rdtp = &__get_cpu_var(rcu_dynticks);
+	WARN_ON_ONCE(rdtp->dynticks_nesting == 0);
+	rdtp->dynticks_nesting = (LLONG_MAX / 2) + 1;
+	local_irq_restore(flags);
+}
+
 /**
  * rcu_irq_enter - inform RCU that current CPU is entering irq away from idle
  *
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 39/41] rcu: Switch to extended quiescent state in userspace from nohz cpuset
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (37 preceding siblings ...)
  2012-04-30 23:55 ` [PATCH 38/41] rcu: New rcu_user_enter_irq() and rcu_user_exit_irq() APIs Frederic Weisbecker
@ 2012-04-30 23:55 ` Frederic Weisbecker
  2012-05-22 18:36   ` Paul E. McKenney
  2012-04-30 23:55 ` [PATCH 40/41] nohz: Exit RCU idle mode when we schedule before resuming userspace Frederic Weisbecker
                   ` (2 subsequent siblings)
  41 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:55 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

When we switch to adaptive nohz mode and we run in userspace,
we can still receive IPIs from the RCU core if a grace period
has been started by another CPU because we need to take part
of its completion.

However running in userspace is similar to that of running in
idle because we don't make use of RCU there, thus we can be
considered as running in RCU extended quiescent state. The
benefit when running into that mode is that we are not
anymore disturbed by needless IPIs coming from the RCU core.

To perform this, we just to use the RCU extended quiescent state
APIs on the following points:

- kernel exit or tick stop in userspace: here we switch to extended
quiescent state because we run in userspace without the tick.

- kernel entry or tick restart: here we exit the extended quiescent
state because either we enter the kernel and we may make use of RCU
read side critical section anytime, or we need the timer tick for some
reason and that takes care of RCU grace period in a traditional way.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/tick.h     |    3 +++
 kernel/time/tick-sched.c |   27 +++++++++++++++++++++++++--
 2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 3c31d6e..e2a49ad 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -153,6 +153,8 @@ static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
 # endif /* !NO_HZ */
 
 #ifdef CONFIG_CPUSETS_NO_HZ
+DECLARE_PER_CPU(int, nohz_task_ext_qs);
+
 extern void tick_nohz_enter_kernel(void);
 extern void tick_nohz_exit_kernel(void);
 extern void tick_nohz_enter_exception(struct pt_regs *regs);
@@ -160,6 +162,7 @@ extern void tick_nohz_exit_exception(struct pt_regs *regs);
 extern void tick_nohz_check_adaptive(void);
 extern void tick_nohz_pre_schedule(void);
 extern void tick_nohz_post_schedule(void);
+extern void tick_nohz_cpu_exit_qs(void);
 extern bool tick_nohz_account_tick(void);
 extern void tick_nohz_flush_current_times(bool restart_tick);
 #else /* !CPUSETS_NO_HZ */
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 8217409..b15ab5e 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -565,10 +565,13 @@ static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts)
 
 	if (!was_stopped && ts->tick_stopped) {
 		WARN_ON_ONCE(ts->saved_jiffies_whence != JIFFIES_SAVED_NONE);
-		if (user)
+		if (user) {
 			ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
-		else if (!current->mm)
+			__get_cpu_var(nohz_task_ext_qs) = 1;
+			rcu_user_enter_irq();
+		} else if (!current->mm) {
 			ts->saved_jiffies_whence = JIFFIES_SAVED_SYS;
+		}
 
 		ts->saved_jiffies = jiffies;
 		set_thread_flag(TIF_NOHZ);
@@ -899,6 +902,8 @@ void tick_check_idle(int cpu)
 }
 
 #ifdef CONFIG_CPUSETS_NO_HZ
+DEFINE_PER_CPU(int, nohz_task_ext_qs);
+
 void tick_nohz_exit_kernel(void)
 {
 	unsigned long flags;
@@ -922,6 +927,9 @@ void tick_nohz_exit_kernel(void)
 	ts->saved_jiffies = jiffies;
 	ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
 
+	__get_cpu_var(nohz_task_ext_qs) = 1;
+	rcu_user_enter();
+
 	local_irq_restore(flags);
 }
 
@@ -940,6 +948,11 @@ void tick_nohz_enter_kernel(void)
 		return;
 	}
 
+	if (__get_cpu_var(nohz_task_ext_qs) == 1) {
+		__get_cpu_var(nohz_task_ext_qs) = 0;
+		rcu_user_exit();
+	}
+
 	WARN_ON_ONCE(ts->saved_jiffies_whence != JIFFIES_SAVED_USER);
 
 	delta_jiffies = jiffies - ts->saved_jiffies;
@@ -951,6 +964,14 @@ void tick_nohz_enter_kernel(void)
 	local_irq_restore(flags);
 }
 
+void tick_nohz_cpu_exit_qs(void)
+{
+	if (__get_cpu_var(nohz_task_ext_qs)) {
+		rcu_user_exit_irq();
+		__get_cpu_var(nohz_task_ext_qs) = 0;
+	}
+}
+
 void tick_nohz_enter_exception(struct pt_regs *regs)
 {
 	if (user_mode(regs))
@@ -986,6 +1007,7 @@ static void tick_nohz_restart_adaptive(void)
 	tick_nohz_flush_current_times(true);
 	tick_nohz_restart_sched_tick();
 	clear_thread_flag(TIF_NOHZ);
+	tick_nohz_cpu_exit_qs();
 }
 
 void tick_nohz_check_adaptive(void)
@@ -1023,6 +1045,7 @@ void tick_nohz_pre_schedule(void)
 	if (ts->tick_stopped) {
 		tick_nohz_flush_current_times(true);
 		clear_thread_flag(TIF_NOHZ);
+		/* FIXME: warn if we are in RCU idle mode */
 	}
 }
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 40/41] nohz: Exit RCU idle mode when we schedule before resuming userspace
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (38 preceding siblings ...)
  2012-04-30 23:55 ` [PATCH 39/41] rcu: Switch to extended quiescent state in userspace from nohz cpuset Frederic Weisbecker
@ 2012-04-30 23:55 ` Frederic Weisbecker
  2012-04-30 23:55 ` [PATCH 41/41] nohz/cpuset: Disable under some configs Frederic Weisbecker
  2012-05-07 22:10 ` [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Geoff Levand
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:55 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

When a CPU running tickless resumes userspace, it enters into
RCU idle mode. But if we are preempted on kernel exit, after we
entered RCU idle mode but before we actually resumed userspace,
through an explicit call to schedule, we need to re-enable RCU in
case this function makes use of RCU read side critical section
and also for the next task to be scheduled.

NOTE: If we are preempted while running adaptive tickless, it means
we will receive an IPI that will escape the RCU idle mode for us. So
this patch is useful only when such IPI arrives too late.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/entry_64.S |    8 ++++----
 include/linux/tick.h       |    3 ++-
 kernel/sched/core.c        |   14 ++++++++++++++
 kernel/time/tick-sched.c   |    9 ++++++---
 4 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 54f269c..c86d963 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -522,7 +522,7 @@ sysret_careful:
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
 	pushq_cfi %rdi
-	call schedule
+	call schedule_user
 	popq_cfi %rdi
 	jmp sysret_check
 
@@ -630,7 +630,7 @@ int_careful:
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
 	pushq_cfi %rdi
-	call schedule
+	call schedule_user
 	popq_cfi %rdi
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
@@ -898,7 +898,7 @@ retint_careful:
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
 	pushq_cfi %rdi
-	call  schedule
+	call  schedule_user
 	popq_cfi %rdi
 	GET_THREAD_INFO(%rcx)
 	DISABLE_INTERRUPTS(CLBR_NONE)
@@ -1398,7 +1398,7 @@ paranoid_userspace:
 paranoid_schedule:
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_ANY)
-	call schedule
+	call schedule_user
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_OFF
 	jmp paranoid_userspace
diff --git a/include/linux/tick.h b/include/linux/tick.h
index e2a49ad..93add37 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -162,7 +162,7 @@ extern void tick_nohz_exit_exception(struct pt_regs *regs);
 extern void tick_nohz_check_adaptive(void);
 extern void tick_nohz_pre_schedule(void);
 extern void tick_nohz_post_schedule(void);
-extern void tick_nohz_cpu_exit_qs(void);
+extern void tick_nohz_cpu_exit_qs(bool irq);
 extern bool tick_nohz_account_tick(void);
 extern void tick_nohz_flush_current_times(bool restart_tick);
 #else /* !CPUSETS_NO_HZ */
@@ -173,6 +173,7 @@ static inline void tick_nohz_exit_exception(struct pt_regs *regs) { }
 static inline void tick_nohz_check_adaptive(void) { }
 static inline void tick_nohz_pre_schedule(void) { }
 static inline void tick_nohz_post_schedule(void) { }
+static inline void tick_nohz_cpu_exit_qs(bool irq) { }
 static inline bool tick_nohz_account_tick(void) { return false; }
 #endif /* CPUSETS_NO_HZ */
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 949158a..c8d3793 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3379,6 +3379,20 @@ int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
 }
 #endif
 
+asmlinkage void __sched schedule_user(void)
+{
+	/*
+	 * We may arrive here before resuming userspace.
+	 * If we are running tickless, RCU may be in idle
+	 * mode. We need to reenable RCU for the next task
+	 * and also in case schedule() make use of RCU itself.
+	 */
+	preempt_disable();
+	tick_nohz_cpu_exit_qs(false);
+	preempt_enable_no_resched();
+	schedule();
+}
+
 #ifdef CONFIG_PREEMPT
 /*
  * this is the entry point to schedule() from in-kernel preemption
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index b15ab5e..586f970 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -964,10 +964,13 @@ void tick_nohz_enter_kernel(void)
 	local_irq_restore(flags);
 }
 
-void tick_nohz_cpu_exit_qs(void)
+void tick_nohz_cpu_exit_qs(bool irq)
 {
 	if (__get_cpu_var(nohz_task_ext_qs)) {
-		rcu_user_exit_irq();
+		if (irq)
+			rcu_user_exit_irq();
+		else
+			rcu_user_exit();
 		__get_cpu_var(nohz_task_ext_qs) = 0;
 	}
 }
@@ -1007,7 +1010,7 @@ static void tick_nohz_restart_adaptive(void)
 	tick_nohz_flush_current_times(true);
 	tick_nohz_restart_sched_tick();
 	clear_thread_flag(TIF_NOHZ);
-	tick_nohz_cpu_exit_qs();
+	tick_nohz_cpu_exit_qs(true);
 }
 
 void tick_nohz_check_adaptive(void)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 41/41] nohz/cpuset: Disable under some configs
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (39 preceding siblings ...)
  2012-04-30 23:55 ` [PATCH 40/41] nohz: Exit RCU idle mode when we schedule before resuming userspace Frederic Weisbecker
@ 2012-04-30 23:55 ` Frederic Weisbecker
  2012-05-07 22:10 ` [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Geoff Levand
  41 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-30 23:55 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

This shows the various things that are not yet handled by
the nohz cpusets: perf events, irq work, irq time accounting.

But there are further things that have yet to be handled:
sched clock tick, runqueue clock, sched_class::task_tick(),
rq clock, cpu load, complete handling of cputimes, ...

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 init/Kconfig |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 7cdb8be..3080b16 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -640,7 +640,7 @@ config PROC_PID_CPUSET
 
 config CPUSETS_NO_HZ
        bool "Tickless cpusets"
-       depends on CPUSETS && HAVE_CPUSETS_NO_HZ && NO_HZ && HIGH_RES_TIMERS
+       depends on CPUSETS && HAVE_CPUSETS_NO_HZ && NO_HZ && HIGH_RES_TIMERS && !IRQ_TIME_ACCOUNTING
        help
          This options let you apply a nohz property to a cpuset such
 	 that the periodic timer tick tries to be avoided when possible on
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH 35/41] sched: Update rq clock before idle balancing
  2012-04-30 23:55 ` [PATCH 35/41] sched: Update rq clock before idle balancing Frederic Weisbecker
@ 2012-05-02  3:36   ` Michael Wang
  2012-05-02 10:55     ` Frederic Weisbecker
  0 siblings, 1 reply; 96+ messages in thread
From: Michael Wang @ 2012-05-02  3:36 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

On 05/01/2012 07:55 AM, Frederic Weisbecker wrote:

> idle_balance() is called from schedule() right before we schedule the
> idle task. It needs to record the idle timestamp at that time and for
> this the rq clock must be accurate. If the CPU is running tickless
> we need to update the rq clock manually.
> 
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Alessio Igor Bogani <abogani@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Avi Kivity <avi@redhat.com>
> Cc: Chris Metcalf <cmetcalf@tilera.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Geoff Levand <geoff@infradead.org>
> Cc: Gilad Ben Yossef <gilad@benyossef.com>
> Cc: Hakan Akkan <hakanakkan@gmail.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Kevin Hilman <khilman@ti.com>
> Cc: Max Krasnyansky <maxk@qualcomm.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Stephen Hemminger <shemminger@vyatta.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> ---
>  kernel/sched/fair.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eff80e0..cd871e7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4638,6 +4638,7 @@ void idle_balance(int this_cpu, struct rq *this_rq)
>  	int pulled_task = 0;
>  	unsigned long next_balance = jiffies + HZ;
> 
> +	update_nohz_rq_clock(this_rq);


I'm not sure but why we have to care nohz? if we really need an accurate
clock, we should do the update anyway, don't we?

Some thing also confused me is the description:
"If the CPU is running tickless we need to update the rq clock manually."

I think the cpu will enter tickless mode only when the idle thread
already switched in, then invoke
tick_nohz_idle_enter->tick_nohz_stop_sched_tick, isn't it?

And if we invoke idle_balance for a cpu, that means it hasn't enter
idle(current task is not idle task), so how can such a cpu in tickless mode?

Regards,
Michael Wang

>  	this_rq->idle_stamp = this_rq->clock;
> 
>  	if (this_rq->avg_idle < sysctl_sched_migration_cost)


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 35/41] sched: Update rq clock before idle balancing
  2012-05-02  3:36   ` Michael Wang
@ 2012-05-02 10:55     ` Frederic Weisbecker
  0 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-05-02 10:55 UTC (permalink / raw)
  To: Michael Wang
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

On Wed, May 02, 2012 at 11:36:07AM +0800, Michael Wang wrote:
> On 05/01/2012 07:55 AM, Frederic Weisbecker wrote:
> 
> > idle_balance() is called from schedule() right before we schedule the
> > idle task. It needs to record the idle timestamp at that time and for
> > this the rq clock must be accurate. If the CPU is running tickless
> > we need to update the rq clock manually.
> > 
> > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > Cc: Alessio Igor Bogani <abogani@kernel.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Avi Kivity <avi@redhat.com>
> > Cc: Chris Metcalf <cmetcalf@tilera.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> > Cc: Geoff Levand <geoff@infradead.org>
> > Cc: Gilad Ben Yossef <gilad@benyossef.com>
> > Cc: Hakan Akkan <hakanakkan@gmail.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Kevin Hilman <khilman@ti.com>
> > Cc: Max Krasnyansky <maxk@qualcomm.com>
> > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Stephen Hemminger <shemminger@vyatta.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > ---
> >  kernel/sched/fair.c |    1 +
> >  1 files changed, 1 insertions(+), 0 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index eff80e0..cd871e7 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4638,6 +4638,7 @@ void idle_balance(int this_cpu, struct rq *this_rq)
> >  	int pulled_task = 0;
> >  	unsigned long next_balance = jiffies + HZ;
> > 
> > +	update_nohz_rq_clock(this_rq);
> 
> 
> I'm not sure but why we have to care nohz? if we really need an accurate
> clock, we should do the update anyway, don't we?

This concerns adaptive tickless CPUs only. So I wanted to keep the overhead
low for CPUs that are not in adaptive tickless mode. update_nohz_rq_clock()
takes care of that. It only updates the rq clock if the CPU is adaptive tickless.

> 
> Some thing also confused me is the description:
> "If the CPU is running tickless we need to update the rq clock manually."
> 
> I think the cpu will enter tickless mode only when the idle thread
> already switched in, then invoke
> tick_nohz_idle_enter->tick_nohz_stop_sched_tick, isn't it?

An adaptive tickless CPU tries to shutdown the tick even when the CPU
is not idle. By the time we are about to sleep and schedule the idle
task, we may be already tickless for a while.

> 
> And if we invoke idle_balance for a cpu, that means it hasn't enter
> idle(current task is not idle task), so how can such a cpu in tickless mode?
> 
> Regards,
> Michael Wang
> 
> >  	this_rq->idle_stamp = this_rq->clock;
> > 
> >  	if (this_rq->avg_idle < sysctl_sched_migration_cost)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 04/41] nohz: Move nohz load balancer selection into idle logic
  2012-04-30 23:54 ` [PATCH 04/41] nohz: Move nohz load balancer selection into idle logic Frederic Weisbecker
@ 2012-05-07 15:51   ` Christoph Lameter
  0 siblings, 0 replies; 96+ messages in thread
From: Christoph Lameter @ 2012-05-07 15:51 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Daniel Lezcano, Geoff Levand,
	Gilad Ben Yossef, Hakan Akkan, Ingo Molnar, Kevin Hilman,
	Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

On Tue, 1 May 2012, Frederic Weisbecker wrote:

> [ ** BUGGY PATCH: I need to put more thinking into this ** ]
>
> We want the nohz load balancer to be an idle CPU, thus
> move that selection to strict dyntick idle logic.

An idle cpu? We may want to put it on a busy cpu that is running high
latency OS tasks. Would it be possible to have an option where we can pin
the load balancer to a specific cpu?

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/41] cpuset: Set up interface for nohz flag
  2012-04-30 23:54 ` [PATCH 07/41] cpuset: Set up interface for nohz flag Frederic Weisbecker
@ 2012-05-07 15:55   ` Christoph Lameter
  2012-05-08 14:20     ` Frederic Weisbecker
  0 siblings, 1 reply; 96+ messages in thread
From: Christoph Lameter @ 2012-05-07 15:55 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Daniel Lezcano, Geoff Levand,
	Gilad Ben Yossef, Hakan Akkan, Ingo Molnar, Kevin Hilman,
	Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

On Tue, 1 May 2012, Frederic Weisbecker wrote:

> Prepare the interface to implement the nohz cpuset flag.
> This flag, once set, will tell the system to try to
> shutdown the periodic timer tick when possible.
>
> We use here a per cpu refcounter. As long as a CPU
> is contained into at least one cpuset that has the
> nohz flag set, it is part of the set of CPUs that
> run into adaptive nohz mode.

As I have said before: It would be much simpler if one could specify the
set of nohz cpus independently of cpusets. Having a flag f.e. as a file in

	/sys/devices/system/cpu/cpuX/nohz

?


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/41] nohz: Try not to give the timekeeping duty to an adaptive tickless cpu
  2012-04-30 23:54 ` [PATCH 08/41] nohz: Try not to give the timekeeping duty to an adaptive tickless cpu Frederic Weisbecker
@ 2012-05-07 16:02   ` Christoph Lameter
  2012-05-08 17:35     ` Frederic Weisbecker
  0 siblings, 1 reply; 96+ messages in thread
From: Christoph Lameter @ 2012-05-07 16:02 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Daniel Lezcano, Geoff Levand,
	Gilad Ben Yossef, Hakan Akkan, Ingo Molnar, Kevin Hilman,
	Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

On Tue, 1 May 2012, Frederic Weisbecker wrote:

> Try to give the timekeeing duty to a CPU that doesn't belong
> to any nohz cpuset when possible, so that we increase the chance
> for these nohz cpusets to run their CPUs out of periodic tick
> mode.
>
> [TODO: We need to find a way to ensure there is always one non-nohz
> running CPU maintaining the timekeeping duty if every non-idle CPUs are
> adaptive tickless]

I sure wish this would also be pinnable to a specific cpu.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel)
  2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
                   ` (40 preceding siblings ...)
  2012-04-30 23:55 ` [PATCH 41/41] nohz/cpuset: Disable under some configs Frederic Weisbecker
@ 2012-05-07 22:10 ` Geoff Levand
  41 siblings, 0 replies; 96+ messages in thread
From: Geoff Levand @ 2012-05-07 22:10 UTC (permalink / raw)
  To: Frederic Weisbecker, Kevin Hilman
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Gilad Ben Yossef, Hakan Akkan, Ingo Molnar, Kevin Hilman,
	Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

Hi All,

On Tue, 2012-05-01 at 01:54 +0200, Frederic Weisbecker wrote:
> A summary of what this is about can be found here:
>  https://lkml.org/lkml/2011/8/15/245
> 
> Changes since v2:
> ...

I rebased my ARM patches to Frederick's v3:

  http://git.kernel.org/?p=linux/kernel/git/geoff/nohz.git

-Geoff


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/41] cpuset: Set up interface for nohz flag
  2012-05-07 15:55   ` Christoph Lameter
@ 2012-05-08 14:20     ` Frederic Weisbecker
  2012-05-08 14:50       ` Peter Zijlstra
  2012-05-08 15:16       ` Christoph Lameter
  0 siblings, 2 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-05-08 14:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Daniel Lezcano, Geoff Levand,
	Gilad Ben Yossef, Hakan Akkan, Ingo Molnar, Kevin Hilman,
	Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

2012/5/7 Christoph Lameter <cl@linux.com>:
> On Tue, 1 May 2012, Frederic Weisbecker wrote:
>
>> Prepare the interface to implement the nohz cpuset flag.
>> This flag, once set, will tell the system to try to
>> shutdown the periodic timer tick when possible.
>>
>> We use here a per cpu refcounter. As long as a CPU
>> is contained into at least one cpuset that has the
>> nohz flag set, it is part of the set of CPUs that
>> run into adaptive nohz mode.
>
> As I have said before: It would be much simpler if one could specify the
> set of nohz cpus independently of cpusets. Having a flag f.e. as a file in
>
>        /sys/devices/system/cpu/cpuX/nohz

I don't know if it would be simpler. It's just a different interface
to set a per-CPU property.
Cpusets of sysfs, I don't mind either way.

What is the usual policy on where to put which kind of CPU property?

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/41] cpuset: Set up interface for nohz flag
  2012-05-08 14:20     ` Frederic Weisbecker
@ 2012-05-08 14:50       ` Peter Zijlstra
  2012-05-08 15:18         ` Christoph Lameter
  2012-05-08 15:16       ` Christoph Lameter
  1 sibling, 1 reply; 96+ messages in thread
From: Peter Zijlstra @ 2012-05-08 14:50 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Christoph Lameter, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

On Tue, 2012-05-08 at 16:20 +0200, Frederic Weisbecker wrote:
> 2012/5/7 Christoph Lameter <cl@linux.com>:
> > On Tue, 1 May 2012, Frederic Weisbecker wrote:
> >
> >> Prepare the interface to implement the nohz cpuset flag.
> >> This flag, once set, will tell the system to try to
> >> shutdown the periodic timer tick when possible.
> >>
> >> We use here a per cpu refcounter. As long as a CPU
> >> is contained into at least one cpuset that has the
> >> nohz flag set, it is part of the set of CPUs that
> >> run into adaptive nohz mode.
> >
> > As I have said before: It would be much simpler if one could specify the
> > set of nohz cpus independently of cpusets. Having a flag f.e. as a file in
> >
> >        /sys/devices/system/cpu/cpuX/nohz
> 
> I don't know if it would be simpler. It's just a different interface
> to set a per-CPU property.
> Cpusets of sysfs, I don't mind either way.
> 
> What is the usual policy on where to put which kind of CPU property?

There's no such policy, but I don't get why Christoph objects to
cpusets, its the option I would prefer. You're going to use cpusets
anyway to partition your system, might as well also use it to mark a
whole partition/set as nohz.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/41] cpuset: Set up interface for nohz flag
  2012-05-08 14:20     ` Frederic Weisbecker
  2012-05-08 14:50       ` Peter Zijlstra
@ 2012-05-08 15:16       ` Christoph Lameter
  1 sibling, 0 replies; 96+ messages in thread
From: Christoph Lameter @ 2012-05-08 15:16 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Daniel Lezcano, Geoff Levand,
	Gilad Ben Yossef, Hakan Akkan, Ingo Molnar, Kevin Hilman,
	Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

[-- Attachment #1: Type: TEXT/PLAIN, Size: 666 bytes --]

On Tue, 8 May 2012, Frederic Weisbecker wrote:

> > As I have said before: It would be much simpler if one could specify the
> > set of nohz cpus independently of cpusets. Having a flag f.e. as a file in
> >
> >        /sys/devices/system/cpu/cpuX/nohz
>
> I don't know if it would be simpler. It's just a different interface
> to set a per-CPU property.
> Cpusets of sysfs, I don't mind either way.
>
> What is the usual policy on where to put which kind of CPU property?

/sys/devices/system/cpu contains the state information for each processor.

There are already cpumasks in there that can be used to monitor and modify
per processor behavior.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/41] cpuset: Set up interface for nohz flag
  2012-05-08 14:50       ` Peter Zijlstra
@ 2012-05-08 15:18         ` Christoph Lameter
  2012-05-08 15:27           ` Peter Zijlstra
  0 siblings, 1 reply; 96+ messages in thread
From: Christoph Lameter @ 2012-05-08 15:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

On Tue, 8 May 2012, Peter Zijlstra wrote:

> There's no such policy, but I don't get why Christoph objects to
> cpusets, its the option I would prefer. You're going to use cpusets
> anyway to partition your system, might as well also use it to mark a
> whole partition/set as nohz.

We are currently not using cpusets but are simply isolating processors as
needed. Not sure that I want the overhead (administratively as well as in
kernel) to deal with this.

Someone may be using a different partitioning technique (like cgroups) etc
and then wont be able to use nohz. Having it not depend on cpusets makes
it more universal.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/41] cpuset: Set up interface for nohz flag
  2012-05-08 15:18         ` Christoph Lameter
@ 2012-05-08 15:27           ` Peter Zijlstra
  2012-05-08 15:38             ` Christoph Lameter
  0 siblings, 1 reply; 96+ messages in thread
From: Peter Zijlstra @ 2012-05-08 15:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

On Tue, 2012-05-08 at 10:18 -0500, Christoph Lameter wrote:
> On Tue, 8 May 2012, Peter Zijlstra wrote:
> 
> > There's no such policy, but I don't get why Christoph objects to
> > cpusets, its the option I would prefer. You're going to use cpusets
> > anyway to partition your system, might as well also use it to mark a
> > whole partition/set as nohz.
> 
> We are currently not using cpusets but are simply isolating processors as
> needed. Not sure that I want the overhead (administratively as well as in
> kernel) to deal with this.

isolating how? The only way to do that is with the (broken) isolcpus
crap and cpusets. There is no other way.

> Someone may be using a different partitioning technique (like cgroups) etc
> and then wont be able to use nohz. Having it not depend on cpusets makes
> it more universal.

You seem terminally confused on cpuset vs cgroups thing. One more time:
cpusets is a cgroup controller. Without cgroup support there is no
cpusets.

Furthermore there is no other partitioning scheme, cpusets is it.



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/41] cpuset: Set up interface for nohz flag
  2012-05-08 15:27           ` Peter Zijlstra
@ 2012-05-08 15:38             ` Christoph Lameter
  2012-05-08 15:48               ` Peter Zijlstra
  0 siblings, 1 reply; 96+ messages in thread
From: Christoph Lameter @ 2012-05-08 15:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

On Tue, 8 May 2012, Peter Zijlstra wrote:

> isolating how? The only way to do that is with the (broken) isolcpus
> crap and cpusets. There is no other way.

For some reason this seems to work here. What is broken with isolcpus?

> Furthermore there is no other partitioning scheme, cpusets is it.

One can partition the system anyway one wants by setting cpu affinities
and memory policies etc. No need for cpusets/cgroups.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/41] cpuset: Set up interface for nohz flag
  2012-05-08 15:38             ` Christoph Lameter
@ 2012-05-08 15:48               ` Peter Zijlstra
  2012-05-08 15:57                 ` Christoph Lameter
  0 siblings, 1 reply; 96+ messages in thread
From: Peter Zijlstra @ 2012-05-08 15:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

On Tue, 2012-05-08 at 10:38 -0500, Christoph Lameter wrote:
> On Tue, 8 May 2012, Peter Zijlstra wrote:
> 
> > isolating how? The only way to do that is with the (broken) isolcpus
> > crap and cpusets. There is no other way.
> 
> For some reason this seems to work here. What is broken with isolcpus?

It mostly still works I think, but iirc there were a few places that
ignored the cpuisol mask.

But really the moment we get proper means of flushing cpu state
(currently achievable by unplug-replug) isolcpu gets depricated and
eventually removed.

cpusets can do what isolcpu can and more (provided this flush thing).

> > Furthermore there is no other partitioning scheme, cpusets is it.
> 
> One can partition the system anyway one wants by setting cpu affinities
> and memory policies etc. No need for cpusets/cgroups.

Not so, the load-balancer will still try to move the tasks and
subsequently fail. Partitioning means it won't even try to move tasks
across the partition boundary.

By proper partitioning you can split load balance domains (or completely
disable the load-balancer by giving it a single cpu domain).



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/41] cpuset: Set up interface for nohz flag
  2012-05-08 15:48               ` Peter Zijlstra
@ 2012-05-08 15:57                 ` Christoph Lameter
  2012-05-08 16:16                   ` Peter Zijlstra
  0 siblings, 1 reply; 96+ messages in thread
From: Christoph Lameter @ 2012-05-08 15:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

On Tue, 8 May 2012, Peter Zijlstra wrote:

> > For some reason this seems to work here. What is broken with isolcpus?
>
> It mostly still works I think, but iirc there were a few places that
> ignored the cpuisol mask.

Yes there is still superfluous stuff going on on isolated processors.

> But really the moment we get proper means of flushing cpu state
> (currently achievable by unplug-replug) isolcpu gets depricated and
> eventually removed.

Not sure what that means and how that is relevant. Scheduler?

> cpusets can do what isolcpu can and more (provided this flush thing).

cpusets is a pretty heavy handed thing and causes inefficiencies in the
allocators if compiled into the kernel because checks will have to be done
in hot allocation paths.

> > > Furthermore there is no other partitioning scheme, cpusets is it.
> >
> > One can partition the system anyway one wants by setting cpu affinities
> > and memory policies etc. No need for cpusets/cgroups.
>
> Not so, the load-balancer will still try to move the tasks and
> subsequently fail. Partitioning means it won't even try to move tasks
> across the partition boundary.

Ok so the scheduler is inefficient on this. Maybe that can be improved?

Setting affinities should not cause overhead in the scheduler.

> By proper partitioning you can split load balance domains (or completely
> disable the load-balancer by giving it a single cpu domain).

I thought that was the point of isolcpus?


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/41] cpuset: Set up interface for nohz flag
  2012-05-08 15:57                 ` Christoph Lameter
@ 2012-05-08 16:16                   ` Peter Zijlstra
  2012-05-08 16:25                     ` Peter Zijlstra
  2012-05-08 19:50                     ` Mike Galbraith
  0 siblings, 2 replies; 96+ messages in thread
From: Peter Zijlstra @ 2012-05-08 16:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Paul E. McKenney,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

On Tue, 2012-05-08 at 10:57 -0500, Christoph Lameter wrote:
> On Tue, 8 May 2012, Peter Zijlstra wrote:
> 
> > > For some reason this seems to work here. What is broken with isolcpus?
> >
> > It mostly still works I think, but iirc there were a few places that
> > ignored the cpuisol mask.
> 
> Yes there is still superfluous stuff going on on isolated processors.

Aside from that..

> > But really the moment we get proper means of flushing cpu state
> > (currently achievable by unplug-replug) isolcpu gets depricated and
> > eventually removed.
> 
> Not sure what that means and how that is relevant. Scheduler?

Things like stray timers, an unplug-replug cycle will push all timers
away. So if you create a partition with cpus that have ran other tasks
but in the future will be dedicated to this 'special' task, you need to
flush all these things.

This is currently only possible through the unplug-replug hack.

For isolcpus this usually isn't a problem since the cpus will be idle
until you start something on them. But if you were to change workloads
you could run into this.

> > cpusets can do what isolcpu can and more (provided this flush thing).
> 
> cpusets is a pretty heavy handed thing and causes inefficiencies in the
> allocators if compiled into the kernel because checks will have to be done
> in hot allocation paths.

Should we then re-implement those bits using mpols? Thereby avoiding
duplicate mask operations?

> > > > Furthermore there is no other partitioning scheme, cpusets is it.
> > >
> > > One can partition the system anyway one wants by setting cpu affinities
> > > and memory policies etc. No need for cpusets/cgroups.
> >
> > Not so, the load-balancer will still try to move the tasks and
> > subsequently fail. Partitioning means it won't even try to move tasks
> > across the partition boundary.
> 
> Ok so the scheduler is inefficient on this. Maybe that can be improved?

No, it simply doesn't (and cannot) know this.. well it could but I think
its an NP-hard problem. The way its been solved is by means of explicit
configuration using cpusets.

> Setting affinities should not cause overhead in the scheduler.

To the contrary, it must. It makes the placement problem harder. It adds
constraints to an otherwise uniform problem.

> > By proper partitioning you can split load balance domains (or completely
> > disable the load-balancer by giving it a single cpu domain).
> 
> I thought that was the point of isolcpus?

I have the same problem with isolcpus that you seem to have with the
cpuset stuff on the allocator paths.

isolcpus is a very limited hack that adds more pain that its worth. Its
yet another mask to check and its functionality is completely available
through cpusets.

You cannot create multi-cpu partitions using isolcpus, you cannot
dynamically reconfigure it.

And on the scheduler side cpusets doesn't add runtime overhead to normal
things, only sched_setaffinity() and a few other rare operations get
slightly more expensive. And it allows to reduce runtime overhead by
making the load-balancer domains smaller.

All wins in my book.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/41] cpuset: Set up interface for nohz flag
  2012-05-08 16:16                   ` Peter Zijlstra
@ 2012-05-08 16:25                     ` Peter Zijlstra
  2012-05-08 19:50                     ` Mike Galbraith
  1 sibling, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2012-05-08 16:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Thomas Gleixner, Geoff Levand, linaro-sched-sig, Daniel Lezcano,
	Stephen Hemminger, LKML, Chris Metcalf, Gilad Ben Yossef,
	Hakan Akkan, Alessio Igor Bogani, Avi Kivity, Max Krasnyansky,
	Steven Rostedt, Andrew Morton, Ingo Molnar

On Tue, 2012-05-08 at 18:16 +0200, Peter Zijlstra wrote:
> No, it simply doesn't (and cannot) know this.. well it could but I think
> its an NP-hard problem. The way its been solved is by means of explicit
> configuration using cpusets. 

Yeah, it looks like a combinatorics problem. Anyway, even if we could
solve that problem we'd still end up with the cpuset infrastructure.
Only auto-magically configured instead of manually.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/41] nohz: Try not to give the timekeeping duty to an adaptive tickless cpu
  2012-05-07 16:02   ` Christoph Lameter
@ 2012-05-08 17:35     ` Frederic Weisbecker
  0 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-05-08 17:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Daniel Lezcano, Geoff Levand,
	Gilad Ben Yossef, Hakan Akkan, Ingo Molnar, Kevin Hilman,
	Max Krasnyansky, Paul E. McKenney, Peter Zijlstra,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

2012/5/7 Christoph Lameter <cl@linux.com>:
> On Tue, 1 May 2012, Frederic Weisbecker wrote:
>
>> Try to give the timekeeing duty to a CPU that doesn't belong
>> to any nohz cpuset when possible, so that we increase the chance
>> for these nohz cpusets to run their CPUs out of periodic tick
>> mode.
>>
>> [TODO: We need to find a way to ensure there is always one non-nohz
>> running CPU maintaining the timekeeping duty if every non-idle CPUs are
>> adaptive tickless]
>
> I sure wish this would also be pinnable to a specific cpu.

Yeah, well we need to be more flexible and allow for finegrained sets of CPUs,
I quoted some reasons in one of our previous discussions:
https://lkml.org/lkml/2012/3/29/559

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/41] cpuset: Set up interface for nohz flag
  2012-05-08 16:16                   ` Peter Zijlstra
  2012-05-08 16:25                     ` Peter Zijlstra
@ 2012-05-08 19:50                     ` Mike Galbraith
  2012-05-08 20:45                       ` Christoph Lameter
  1 sibling, 1 reply; 96+ messages in thread
From: Mike Galbraith @ 2012-05-08 19:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Frederic Weisbecker, LKML, linaro-sched-sig,
	Alessio Igor Bogani, Andrew Morton, Avi Kivity, Chris Metcalf,
	Daniel Lezcano, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Kevin Hilman, Max Krasnyansky, Paul E. McKenney,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

On Tue, 2012-05-08 at 18:16 +0200, Peter Zijlstra wrote: 
> On Tue, 2012-05-08 at 10:57 -0500, Christoph Lameter wrote:

> isolcpus is a very limited hack that adds more pain that its worth. Its
> yet another mask to check and its functionality is completely available
> through cpusets.

Agreed.

> You cannot create multi-cpu partitions using isolcpus, you cannot
> dynamically reconfigure it.

Big plus for cpusets.

> And on the scheduler side cpusets doesn't add runtime overhead to normal
> things, only sched_setaffinity() and a few other rare operations get
> slightly more expensive. And it allows to reduce runtime overhead by
> making the load-balancer domains smaller.

Very big deal if you have a load that doesn't do all the performance 'i'
dotting and 't' crossing it maybe could have, but ends up on a big box.

-Mike


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/41] cpuset: Set up interface for nohz flag
  2012-05-08 19:50                     ` Mike Galbraith
@ 2012-05-08 20:45                       ` Christoph Lameter
  2012-05-09  4:21                         ` Mike Galbraith
  0 siblings, 1 reply; 96+ messages in thread
From: Christoph Lameter @ 2012-05-08 20:45 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Frederic Weisbecker, LKML, linaro-sched-sig,
	Alessio Igor Bogani, Andrew Morton, Avi Kivity, Chris Metcalf,
	Daniel Lezcano, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Kevin Hilman, Max Krasnyansky, Paul E. McKenney,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

On Tue, 8 May 2012, Mike Galbraith wrote:

> On Tue, 2012-05-08 at 18:16 +0200, Peter Zijlstra wrote:
> > On Tue, 2012-05-08 at 10:57 -0500, Christoph Lameter wrote:
>
> > isolcpus is a very limited hack that adds more pain that its worth. Its
> > yet another mask to check and its functionality is completely available
> > through cpusets.
>
> Agreed.

How would that work? By creating cpusets that only have a single cpu in
them?

> > You cannot cree multi-cpu partitions using isolcpus, you cannot
> > dynamically reconfigure it.
>
> Big plus for cpusets.

Why would you want to do anything like it? cpusets are confusing. You can
have a cpu be part of multiple cpusets. Which nohz setting applies for a
particular cpu then? If any of the cpusets have nohz set then it applies
to the cpu? And thus someone in a cpuset that does not has nohz set will
find that a cpu will have nohz functionality?

Its not a good match for this. You would want a per cpu attribute for
nohz.

> > And on the scheduler side cpusets doesn't add runtime overhead to normal
> > things, only sched_setaffinity() and a few other rare operations get
> > slightly more expensive. And it allows to reduce runtime overhead by
> > making the load-balancer domains smaller.
>
> Very big deal if you have a load that doesn't do all the performance 'i'
> dotting and 't' crossing it maybe could have, but ends up on a big box.

isolcpus are not part of load balancer domains.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/41] cpuset: Set up interface for nohz flag
  2012-05-08 20:45                       ` Christoph Lameter
@ 2012-05-09  4:21                         ` Mike Galbraith
  2012-05-09 11:02                           ` Frederic Weisbecker
                                             ` (2 more replies)
  0 siblings, 3 replies; 96+ messages in thread
From: Mike Galbraith @ 2012-05-09  4:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Frederic Weisbecker, LKML, linaro-sched-sig,
	Alessio Igor Bogani, Andrew Morton, Avi Kivity, Chris Metcalf,
	Daniel Lezcano, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Kevin Hilman, Max Krasnyansky, Paul E. McKenney,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

On Tue, 2012-05-08 at 15:45 -0500, Christoph Lameter wrote: 
> On Tue, 8 May 2012, Mike Galbraith wrote:
> 
> > On Tue, 2012-05-08 at 18:16 +0200, Peter Zijlstra wrote:
> > > On Tue, 2012-05-08 at 10:57 -0500, Christoph Lameter wrote:
> >
> > > isolcpus is a very limited hack that adds more pain that its worth. Its
> > > yet another mask to check and its functionality is completely available
> > > through cpusets.
> >
> > Agreed.
> 
> How would that work? By creating cpusets that only have a single cpu in
> them?

No, just turn load balancing off for exclusive set, domains go poof.

> > > You cannot cree multi-cpu partitions using isolcpus, you cannot
> > > dynamically reconfigure it.
> >
> > Big plus for cpusets.
> 
> Why would you want to do anything like it? cpusets are confusing. You can
> have a cpu be part of multiple cpusets. Which nohz setting applies for a
> particular cpu then? If any of the cpusets have nohz set then it applies
> to the cpu? And thus someone in a cpuset that does not has nohz set will
> find that a cpu will have nohz functionality?

nohz has to be at least an exclusive set property.

> Its not a good match for this. You would want a per cpu attribute for
> nohz.

Or per cpuset, which can be the same thing as per cpu if you want.

> > > And on the scheduler side cpusets doesn't add runtime overhead to normal
> > > things, only sched_setaffinity() and a few other rare operations get
> > > slightly more expensive. And it allows to reduce runtime overhead by
> > > making the load-balancer domains smaller.
> >
> > Very big deal if you have a load that doesn't do all the performance 'i'
> > dotting and 't' crossing it maybe could have, but ends up on a big box.
> 
> isolcpus are not part of load balancer domains.

Yup, so if you have an application with an RT component, somewhat
sensitive, needs isolation from rest of a big box, but app also has
SCHED_OTHER components.  isolcpus is a pain, everything has to be static
and nailed to the floor.  Load just works when plugged into a cpuset.

-Mike


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/41] cpuset: Set up interface for nohz flag
  2012-05-09  4:21                         ` Mike Galbraith
@ 2012-05-09 11:02                           ` Frederic Weisbecker
  2012-05-09 11:07                           ` Frederic Weisbecker
  2012-05-09 14:22                           ` Christoph Lameter
  2 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-05-09 11:02 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Christoph Lameter, Peter Zijlstra, LKML, linaro-sched-sig,
	Alessio Igor Bogani, Andrew Morton, Avi Kivity, Chris Metcalf,
	Daniel Lezcano, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Kevin Hilman, Max Krasnyansky, Paul E. McKenney,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

2012/5/9 Mike Galbraith <efault@gmx.de>:
> On Tue, 2012-05-08 at 15:45 -0500, Christoph Lameter wrote:
>> On Tue, 8 May 2012, Mike Galbraith wrote:
>>
>> > On Tue, 2012-05-08 at 18:16 +0200, Peter Zijlstra wrote:
>> > > On Tue, 2012-05-08 at 10:57 -0500, Christoph Lameter wrote:
>> >
>> > > isolcpus is a very limited hack that adds more pain that its worth. Its
>> > > yet another mask to check and its functionality is completely available
>> > > through cpusets.
>> >
>> > Agreed.
>>
>> How would that work? By creating cpusets that only have a single cpu in
>> them?
>
> No, just turn load balancing off for exclusive set, domains go poof.

I don't think it's

>> > > You cannot cree multi-cpu partitions using isolcpus, you cannot
>> > > dynamically reconfigure it.
>> >
>> > Big plus for cpusets.
>>
>> Why would you want to do anything like it? cpusets are confusing. You can
>> have a cpu be part of multiple cpusets. Which nohz setting applies for a
>> particular cpu then? If any of the cpusets have nohz set then it applies
>> to the cpu? And thus someone in a cpuset that does not has nohz set will
>> find that a cpu will have nohz functionality?
>
> nohz has to be at least an exclusive set property.
>
>> Its not a good match for this. You would want a per cpu attribute for
>> nohz.
>
> Or per cpuset, which can be the same thing as per cpu if you want.
>
>> > > And on the scheduler side cpusets doesn't add runtime overhead to normal
>> > > things, only sched_setaffinity() and a few other rare operations get
>> > > slightly more expensive. And it allows to reduce runtime overhead by
>> > > making the load-balancer domains smaller.
>> >
>> > Very big deal if you have a load that doesn't do all the performance 'i'
>> > dotting and 't' crossing it maybe could have, but ends up on a big box.
>>
>> isolcpus are not part of load balancer domains.
>
> Yup, so if you have an application with an RT component, somewhat
> sensitive, needs isolation from rest of a big box, but app also has
> SCHED_OTHER components.  isolcpus is a pain, everything has to be static
> and nailed to the floor.  Load just works when plugged into a cpuset.
>
> -Mike
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/41] cpuset: Set up interface for nohz flag
  2012-05-09  4:21                         ` Mike Galbraith
  2012-05-09 11:02                           ` Frederic Weisbecker
@ 2012-05-09 11:07                           ` Frederic Weisbecker
  2012-05-09 14:23                             ` Christoph Lameter
  2012-05-09 14:22                           ` Christoph Lameter
  2 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-05-09 11:07 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Christoph Lameter, Peter Zijlstra, LKML, linaro-sched-sig,
	Alessio Igor Bogani, Andrew Morton, Avi Kivity, Chris Metcalf,
	Daniel Lezcano, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Kevin Hilman, Max Krasnyansky, Paul E. McKenney,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

(Sorry, pressed sent too quickly)

2012/5/9 Mike Galbraith <efault@gmx.de>:
> On Tue, 2012-05-08 at 15:45 -0500, Christoph Lameter wrote:
>> On Tue, 8 May 2012, Mike Galbraith wrote:
>>
>> > On Tue, 2012-05-08 at 18:16 +0200, Peter Zijlstra wrote:
>> > > On Tue, 2012-05-08 at 10:57 -0500, Christoph Lameter wrote:
>> >
>> > > isolcpus is a very limited hack that adds more pain that its worth. Its
>> > > yet another mask to check and its functionality is completely available
>> > > through cpusets.
>> >
>> > Agreed.
>>
>> How would that work? By creating cpusets that only have a single cpu in
>> them?
>
> No, just turn load balancing off for exclusive set, domains go poof.
>
>> > > You cannot cree multi-cpu partitions using isolcpus, you cannot
>> > > dynamically reconfigure it.
>> >
>> > Big plus for cpusets.
>>
>> Why would you want to do anything like it? cpusets are confusing. You can
>> have a cpu be part of multiple cpusets. Which nohz setting applies for a
>> particular cpu then? If any of the cpusets have nohz set then it applies
>> to the cpu? And thus someone in a cpuset that does not has nohz set will
>> find that a cpu will have nohz functionality?
>
> nohz has to be at least an exclusive set property.

I don't think it's a good idea. It will prevent a set of nohz CPUs to
be used for any
other kind of partition. There is no good reason for that.

Also that doesn't really solve the issue. The root cpuset will still
have the nohz
flag turned off.

May be I should indeed rather use sysfs, it's actually true that
cpusets is confusing
for this kind of thing.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/41] cpuset: Set up interface for nohz flag
  2012-05-09  4:21                         ` Mike Galbraith
  2012-05-09 11:02                           ` Frederic Weisbecker
  2012-05-09 11:07                           ` Frederic Weisbecker
@ 2012-05-09 14:22                           ` Christoph Lameter
  2012-05-09 14:47                             ` Mike Galbraith
  2 siblings, 1 reply; 96+ messages in thread
From: Christoph Lameter @ 2012-05-09 14:22 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Frederic Weisbecker, LKML, linaro-sched-sig,
	Alessio Igor Bogani, Andrew Morton, Avi Kivity, Chris Metcalf,
	Daniel Lezcano, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Kevin Hilman, Max Krasnyansky, Paul E. McKenney,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

On Wed, 9 May 2012, Mike Galbraith wrote:

> > Its not a good match for this. You would want a per cpu attribute for
> > nohz.
>
> Or per cpuset, which can be the same thing as per cpu if you want.

But now we start to manage per cpu characteristics with cpusets whereas
before cpusets was used mainly to manage groups of processors assigned to
applications! This means there will be a requirement to use cpusets in
environments that have not used them before.

> > isolcpus are not part of load balancer domains.
>
> Yup, so if you have an application with an RT component, somewhat
> sensitive, needs isolation from rest of a big box, but app also has
> SCHED_OTHER components.  isolcpus is a pain, everything has to be static
> and nailed to the floor.  Load just works when plugged into a cpuset.

Well you have low latency requirements. If you code for lowest latency
then you have to consider cache sizes, cache sharing etc etc. This means
you will have to nail down everything anyways. Cpusets would just be
another thing that one has to worry about.

The loads definitely wont work right if just "plugged into a cpuset".

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/41] cpuset: Set up interface for nohz flag
  2012-05-09 11:07                           ` Frederic Weisbecker
@ 2012-05-09 14:23                             ` Christoph Lameter
  0 siblings, 0 replies; 96+ messages in thread
From: Christoph Lameter @ 2012-05-09 14:23 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Mike Galbraith, Peter Zijlstra, LKML, linaro-sched-sig,
	Alessio Igor Bogani, Andrew Morton, Avi Kivity, Chris Metcalf,
	Daniel Lezcano, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Kevin Hilman, Max Krasnyansky, Paul E. McKenney,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

On Wed, 9 May 2012, Frederic Weisbecker wrote:

> May be I should indeed rather use sysfs, it's actually true that cpusets
> is confusing for this kind of thing.

Yes lets manage the cpu characteristics from the directory with the cpu
properties.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/41] cpuset: Set up interface for nohz flag
  2012-05-09 14:22                           ` Christoph Lameter
@ 2012-05-09 14:47                             ` Mike Galbraith
  2012-05-09 15:05                               ` Christoph Lameter
  0 siblings, 1 reply; 96+ messages in thread
From: Mike Galbraith @ 2012-05-09 14:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Frederic Weisbecker, LKML, linaro-sched-sig,
	Alessio Igor Bogani, Andrew Morton, Avi Kivity, Chris Metcalf,
	Daniel Lezcano, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Kevin Hilman, Max Krasnyansky, Paul E. McKenney,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

On Wed, 2012-05-09 at 09:22 -0500, Christoph Lameter wrote: 
> On Wed, 9 May 2012, Mike Galbraith wrote:

> > > isolcpus are not part of load balancer domains.
> >
> > Yup, so if you have an application with an RT component, somewhat
> > sensitive, needs isolation from rest of a big box, but app also has
> > SCHED_OTHER components.  isolcpus is a pain, everything has to be static
> > and nailed to the floor.  Load just works when plugged into a cpuset.
> 
> Well you have low latency requirements. If you code for lowest latency
> then you have to consider cache sizes, cache sharing etc etc. This means
> you will have to nail down everything anyways. Cpusets would just be
> another thing that one has to worry about.
> 
> The loads definitely wont work right if just "plugged into a cpuset".

You're talking about serious RT/HPC.  I'm talking about apps/loads with
modest requirements, like "Please keep that evil nVidia (this that the
other) thing the _hell_ away from me, I cannot deal with it's futzing
around in the kernel for a _full second_ at a time".

-Mike


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/41] cpuset: Set up interface for nohz flag
  2012-05-09 14:47                             ` Mike Galbraith
@ 2012-05-09 15:05                               ` Christoph Lameter
  2012-05-09 15:33                                 ` Mike Galbraith
  0 siblings, 1 reply; 96+ messages in thread
From: Christoph Lameter @ 2012-05-09 15:05 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Frederic Weisbecker, LKML, linaro-sched-sig,
	Alessio Igor Bogani, Andrew Morton, Avi Kivity, Chris Metcalf,
	Daniel Lezcano, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Kevin Hilman, Max Krasnyansky, Paul E. McKenney,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

On Wed, 9 May 2012, Mike Galbraith wrote:

> On Wed, 2012-05-09 at 09:22 -0500, Christoph Lameter wrote:
> > On Wed, 9 May 2012, Mike Galbraith wrote:
>
> > > > isolcpus are not part of load balancer domains.
> > >
> > > Yup, so if you have an application with an RT component, somewhat
> > > sensitive, needs isolation from rest of a big box, but app also has
> > > SCHED_OTHER components.  isolcpus is a pain, everything has to be static
> > > and nailed to the floor.  Load just works when plugged into a cpuset.
> >
> > Well you have low latency requirements. If you code for lowest latency
> > then you have to consider cache sizes, cache sharing etc etc. This means
> > you will have to nail down everything anyways. Cpusets would just be
> > another thing that one has to worry about.
> >
> > The loads definitely wont work right if just "plugged into a cpuset".
>
> You're talking about serious RT/HPC.  I'm talking about apps/loads with
> modest requirements, like "Please keep that evil nVidia (this that the
> other) thing the _hell_ away from me, I cannot deal with it's futzing
> around in the kernel for a _full second_ at a time".

Well I hope you understand that I do not want to get another complexity
thrown in by having to deal with cpusets too in addition to the pinning,
caches, etc etc.

I do not get how a cpuset could be used by an application with load
balancing disabled. Seems to defeat the purpose of the cpuset (which IMHO
is to generate a custom load balancing domain after all). You would
have to manually pin the processes to processors of the cpuset anyways.

If you already have to pin then why would you want a cpuset on top of
that?



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/41] cpuset: Set up interface for nohz flag
  2012-05-09 15:05                               ` Christoph Lameter
@ 2012-05-09 15:33                                 ` Mike Galbraith
  2012-05-09 15:40                                   ` Christoph Lameter
  0 siblings, 1 reply; 96+ messages in thread
From: Mike Galbraith @ 2012-05-09 15:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Frederic Weisbecker, LKML, linaro-sched-sig,
	Alessio Igor Bogani, Andrew Morton, Avi Kivity, Chris Metcalf,
	Daniel Lezcano, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Kevin Hilman, Max Krasnyansky, Paul E. McKenney,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

On Wed, 2012-05-09 at 10:05 -0500, Christoph Lameter wrote:

> I do not get how a cpuset could be used by an application with load
> balancing disabled. Seems to defeat the purpose of the cpuset (which IMHO
> is to generate a custom load balancing domain after all). You would
> have to manually pin the processes to processors of the cpuset anyways.
> 
> If you already have to pin then why would you want a cpuset on top of
> that?

You don't have to turn load balancing completely off.  I have at least
one customer with modest isolation requirements, an exclusive set with
load balancing enabled is perfect for them.  OTOH, for tight constraint
RT, I have no choice but to turn everything I can get my hands on off.
Any such app manages itself or is busted crud, so all is well, cpusets
work nicely.

I don't like cgroups much, but I've become rather fond of cpusets.

-Mike



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/41] cpuset: Set up interface for nohz flag
  2012-05-09 15:33                                 ` Mike Galbraith
@ 2012-05-09 15:40                                   ` Christoph Lameter
  0 siblings, 0 replies; 96+ messages in thread
From: Christoph Lameter @ 2012-05-09 15:40 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Frederic Weisbecker, LKML, linaro-sched-sig,
	Alessio Igor Bogani, Andrew Morton, Avi Kivity, Chris Metcalf,
	Daniel Lezcano, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Kevin Hilman, Max Krasnyansky, Paul E. McKenney,
	Stephen Hemminger, Steven Rostedt, Sven-Thorsten Dietrich,
	Thomas Gleixner

On Wed, 9 May 2012, Mike Galbraith wrote:

> I don't like cgroups much, but I've become rather fond of cpusets.

Well that we agree on. And given the future of many more cores it makes
just sense to segment the system on processor and node boundaries instead
of adding overhead for management of slices of that.



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/41] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-04-30 23:54 ` [PATCH 11/41] nohz/cpuset: Don't turn off the tick if rcu needs it Frederic Weisbecker
@ 2012-05-22 17:16   ` Paul E. McKenney
  2012-05-23 13:52     ` Frederic Weisbecker
  0 siblings, 1 reply; 96+ messages in thread
From: Paul E. McKenney @ 2012-05-22 17:16 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Tue, May 01, 2012 at 01:54:45AM +0200, Frederic Weisbecker wrote:
> If RCU is waiting for the current CPU to complete a grace
> period, don't turn off the tick. Unlike dynctik-idle, we
> are not necessarily going to enter into rcu extended quiescent
> state, so we may need to keep the tick to note current CPU's
> quiescent states.
> 
> [added build fix from Zen Lin]

Hello, Frederic,

One question below -- why not rcu_needs_cpu() instead of rcu_pending()?

							Thanx, Paul

> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Alessio Igor Bogani <abogani@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Avi Kivity <avi@redhat.com>
> Cc: Chris Metcalf <cmetcalf@tilera.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Geoff Levand <geoff@infradead.org>
> Cc: Gilad Ben Yossef <gilad@benyossef.com>
> Cc: Hakan Akkan <hakanakkan@gmail.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Kevin Hilman <khilman@ti.com>
> Cc: Max Krasnyansky <maxk@qualcomm.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Stephen Hemminger <shemminger@vyatta.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> ---
>  include/linux/rcupdate.h |    1 +
>  kernel/rcutree.c         |    3 +--
>  kernel/time/tick-sched.c |   22 ++++++++++++++++++----
>  3 files changed, 20 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 81c04f4..e06639e 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -184,6 +184,7 @@ static inline int rcu_preempt_depth(void)
>  extern void rcu_sched_qs(int cpu);
>  extern void rcu_bh_qs(int cpu);
>  extern void rcu_check_callbacks(int cpu, int user);
> +extern int rcu_pending(int cpu);
>  struct notifier_block;
>  extern void rcu_idle_enter(void);
>  extern void rcu_idle_exit(void);
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 6c4a672..e141c7e 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -212,7 +212,6 @@ int rcu_cpu_stall_suppress __read_mostly;
>  module_param(rcu_cpu_stall_suppress, int, 0644);
> 
>  static void force_quiescent_state(struct rcu_state *rsp, int relaxed);
> -static int rcu_pending(int cpu);
> 
>  /*
>   * Return the number of RCU-sched batches processed thus far for debug & stats.
> @@ -1915,7 +1914,7 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
>   * by the current CPU, returning 1 if so.  This function is part of the
>   * RCU implementation; it is -not- an exported member of the RCU API.
>   */
> -static int rcu_pending(int cpu)
> +int rcu_pending(int cpu)
>  {
>  	return __rcu_pending(&rcu_sched_state, &per_cpu(rcu_sched_data, cpu)) ||
>  	       __rcu_pending(&rcu_bh_state, &per_cpu(rcu_bh_data, cpu)) ||
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 43fa7ac..4f99766 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -506,9 +506,21 @@ void tick_nohz_idle_enter(void)
>  	local_irq_enable();
>  }
> 
> +#ifdef CONFIG_CPUSETS_NO_HZ
> +static bool can_stop_adaptive_tick(void)
> +{
> +	if (!sched_can_stop_tick())
> +		return false;
> +
> +	/* Is there a grace period to complete ? */
> +	if (rcu_pending(smp_processor_id()))

You lost me on this one.  Why can't this be rcu_needs_cpu()?

> +		return false;
> +
> +	return true;
> +}
> +
>  static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts)
>  {
> -#ifdef CONFIG_CPUSETS_NO_HZ
>  	int cpu = smp_processor_id();
> 
>  	if (!cpuset_adaptive_nohz() || is_idle_task(current))
> @@ -517,12 +529,14 @@ static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts)
>  	if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
>  		return;
> 
> -	if (!sched_can_stop_tick())
> +	if (!can_stop_adaptive_tick())
>  		return;
> 
>  	tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
> -#endif
>  }
> +#else
> +static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts) { }
> +#endif
> 
>  /**
>   * tick_nohz_irq_exit - update next tick event from interrupt exit
> @@ -852,7 +866,7 @@ void tick_nohz_check_adaptive(void)
>  	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
> 
>  	if (ts->tick_stopped && !is_idle_task(current)) {
> -		if (!sched_can_stop_tick())
> +		if (!can_stop_adaptive_tick())
>  			tick_nohz_restart_sched_tick();
>  	}
>  }
> -- 
> 1.7.5.4
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 16/41] rcu: Restart the tick on non-responding adaptive nohz CPUs
  2012-04-30 23:54 ` [PATCH 16/41] rcu: Restart the tick on non-responding adaptive nohz CPUs Frederic Weisbecker
@ 2012-05-22 17:20   ` Paul E. McKenney
  2012-05-23 13:57     ` Frederic Weisbecker
  0 siblings, 1 reply; 96+ messages in thread
From: Paul E. McKenney @ 2012-05-22 17:20 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Tue, May 01, 2012 at 01:54:50AM +0200, Frederic Weisbecker wrote:
> When a CPU in adaptive nohz mode doesn't respond to complete
> a grace period, issue it a specific IPI so that it restarts
> the tick and chases a quiescent state.

Hello, Frederic,

I don't understand the need for this patch.  If the CPU is in
adaptive-tick mode, RCU should see it as being in dyntick-idle mode,
right?  If so, shouldn't RCU have already recognized the CPU as being
in an extended quiescent state?

Or is this a belt-and-suspenders situation?

							Thanx, Paul

> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Alessio Igor Bogani <abogani@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Avi Kivity <avi@redhat.com>
> Cc: Chris Metcalf <cmetcalf@tilera.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Geoff Levand <geoff@infradead.org>
> Cc: Gilad Ben Yossef <gilad@benyossef.com>
> Cc: Hakan Akkan <hakanakkan@gmail.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Kevin Hilman <khilman@ti.com>
> Cc: Max Krasnyansky <maxk@qualcomm.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Stephen Hemminger <shemminger@vyatta.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> ---
>  kernel/rcutree.c |   17 +++++++++++++++++
>  1 files changed, 17 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index e141c7e..3fffc26 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -50,6 +50,7 @@
>  #include <linux/wait.h>
>  #include <linux/kthread.h>
>  #include <linux/prefetch.h>
> +#include <linux/cpuset.h>
> 
>  #include "rcutree.h"
>  #include <trace/events/rcu.h>
> @@ -302,6 +303,20 @@ static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
> 
>  #ifdef CONFIG_SMP
> 
> +static void cpuset_update_rcu_cpu(int cpu)
> +{
> +#ifdef CONFIG_CPUSETS_NO_HZ
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +
> +	if (cpuset_cpu_adaptive_nohz(cpu))
> +		smp_cpuset_update_nohz(cpu);
> +
> +	local_irq_restore(flags);
> +#endif
> +}
> +
>  /*
>   * If the specified CPU is offline, tell the caller that it is in
>   * a quiescent state.  Otherwise, whack it with a reschedule IPI.
> @@ -325,6 +340,8 @@ static int rcu_implicit_offline_qs(struct rcu_data *rdp)
>  		return 1;
>  	}
> 
> +	cpuset_update_rcu_cpu(rdp->cpu);
> +
>  	/*
>  	 * The CPU is online, so send it a reschedule IPI.  This forces
>  	 * it through the scheduler, and (inefficiently) also handles cases
> -- 
> 1.7.5.4
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 17/41] rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU
  2012-04-30 23:54 ` [PATCH 17/41] rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU Frederic Weisbecker
@ 2012-05-22 17:27   ` Paul E. McKenney
  2012-05-22 17:30     ` Paul E. McKenney
  2012-05-23 14:00     ` Frederic Weisbecker
  0 siblings, 2 replies; 96+ messages in thread
From: Paul E. McKenney @ 2012-05-22 17:27 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Tue, May 01, 2012 at 01:54:51AM +0200, Frederic Weisbecker wrote:
> If we enqueue an rcu callback, we need the CPU tick to stay
> alive until we take care of those by completing the appropriate
> grace period.
> 
> Thus, when we call_rcu(), send a self IPI that checks rcu_needs_cpu()
> so that we restore a periodic tick behaviour that can take care of
> everything.

Ouch, I hadn't considered RCU callbacks being posted from within an
extended quiescent state.  I guess I need to make __call_rcu() either
complain about this or handle it correctly...  It would -usually- be
harmless, but there is getting to be quite a bit of active machinery
in the various idle loops, so just assuming that it cannot happen is
probably getting to be an obsolete assumption.

							Thanx, Paul

> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Alessio Igor Bogani <abogani@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Avi Kivity <avi@redhat.com>
> Cc: Chris Metcalf <cmetcalf@tilera.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Geoff Levand <geoff@infradead.org>
> Cc: Gilad Ben Yossef <gilad@benyossef.com>
> Cc: Hakan Akkan <hakanakkan@gmail.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Kevin Hilman <khilman@ti.com>
> Cc: Max Krasnyansky <maxk@qualcomm.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Stephen Hemminger <shemminger@vyatta.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> ---
>  kernel/rcutree.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 3fffc26..b8d300c 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1749,6 +1749,13 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
>  	else
>  		trace_rcu_callback(rsp->name, head, rdp->qlen);
> 
> +	/* Restart the timer if needed to handle the callbacks */
> +	if (cpuset_adaptive_nohz()) {
> +		/* Make updates on nxtlist visible to self IPI */
> +		barrier();
> +		smp_cpuset_update_nohz(smp_processor_id());
> +	}
> +
>  	/* If interrupts were disabled, don't dive into RCU core. */
>  	if (irqs_disabled_flags(flags)) {
>  		local_irq_restore(flags);
> -- 
> 1.7.5.4
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 17/41] rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU
  2012-05-22 17:27   ` Paul E. McKenney
@ 2012-05-22 17:30     ` Paul E. McKenney
  2012-05-23 14:03       ` Frederic Weisbecker
  2012-05-23 14:00     ` Frederic Weisbecker
  1 sibling, 1 reply; 96+ messages in thread
From: Paul E. McKenney @ 2012-05-22 17:30 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Tue, May 22, 2012 at 10:27:14AM -0700, Paul E. McKenney wrote:
> On Tue, May 01, 2012 at 01:54:51AM +0200, Frederic Weisbecker wrote:
> > If we enqueue an rcu callback, we need the CPU tick to stay
> > alive until we take care of those by completing the appropriate
> > grace period.
> > 
> > Thus, when we call_rcu(), send a self IPI that checks rcu_needs_cpu()
> > so that we restore a periodic tick behaviour that can take care of
> > everything.
> 
> Ouch, I hadn't considered RCU callbacks being posted from within an
> extended quiescent state.  I guess I need to make __call_rcu() either
> complain about this or handle it correctly...  It would -usually- be
> harmless, but there is getting to be quite a bit of active machinery
> in the various idle loops, so just assuming that it cannot happen is
> probably getting to be an obsolete assumption.

Adaptive ticks does restart the tick upon entering the kernel, correct?
If so, wouldn't the return to userspace cause adaptive tick to automatically
handle a callback posted from within the kernel?

(And yes, I still need to handle the possibility of callbacks being posted
from the idle loop, but that is a different extended quiescent state.)

							Thanx, Paul

> > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > Cc: Alessio Igor Bogani <abogani@kernel.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Avi Kivity <avi@redhat.com>
> > Cc: Chris Metcalf <cmetcalf@tilera.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> > Cc: Geoff Levand <geoff@infradead.org>
> > Cc: Gilad Ben Yossef <gilad@benyossef.com>
> > Cc: Hakan Akkan <hakanakkan@gmail.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Kevin Hilman <khilman@ti.com>
> > Cc: Max Krasnyansky <maxk@qualcomm.com>
> > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Stephen Hemminger <shemminger@vyatta.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > ---
> >  kernel/rcutree.c |    7 +++++++
> >  1 files changed, 7 insertions(+), 0 deletions(-)
> > 
> > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > index 3fffc26..b8d300c 100644
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> > @@ -1749,6 +1749,13 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
> >  	else
> >  		trace_rcu_callback(rsp->name, head, rdp->qlen);
> > 
> > +	/* Restart the timer if needed to handle the callbacks */
> > +	if (cpuset_adaptive_nohz()) {
> > +		/* Make updates on nxtlist visible to self IPI */
> > +		barrier();
> > +		smp_cpuset_update_nohz(smp_processor_id());
> > +	}
> > +
> >  	/* If interrupts were disabled, don't dive into RCU core. */
> >  	if (irqs_disabled_flags(flags)) {
> >  		local_irq_restore(flags);
> > -- 
> > 1.7.5.4
> > 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 37/41] rcu: New rcu_user_enter() and rcu_user_exit() APIs
  2012-04-30 23:55 ` [PATCH 37/41] rcu: New rcu_user_enter() and rcu_user_exit() APIs Frederic Weisbecker
@ 2012-05-22 18:23   ` Paul E. McKenney
  2012-05-23 14:22     ` Frederic Weisbecker
  0 siblings, 1 reply; 96+ messages in thread
From: Paul E. McKenney @ 2012-05-22 18:23 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Tue, May 01, 2012 at 01:55:11AM +0200, Frederic Weisbecker wrote:
> These two APIs are provided to help the implementation
> of an adaptive tickless kernel (cf: nohz cpusets). We need
> to run into RCU extended quiescent state when we are in
> userland so that a tickless CPU is not involved in the
> global RCU state machine and can shutdown its tick safely.
> 
> These APIs are called from syscall and exception entry/exit
> points and can't be called from interrupt.
> 
> They are essentially the same than rcu_idle_enter() and
> rcu_idle_exit() minus the checks that ensure the CPU is
> running the idle task.

This looks reasonably sane.  There are a few nits like missing comment
headers for functions and the need for tracing, but I can handle that
when I pull it in.  I am happy to do that pretty much any time, but not
before the API stabilizes.  ;-)

So let me know when it is ready for -rcu.

							Thanx, Paul

> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Alessio Igor Bogani <abogani@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Avi Kivity <avi@redhat.com>
> Cc: Chris Metcalf <cmetcalf@tilera.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Geoff Levand <geoff@infradead.org>
> Cc: Gilad Ben Yossef <gilad@benyossef.com>
> Cc: Hakan Akkan <hakanakkan@gmail.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Kevin Hilman <khilman@ti.com>
> Cc: Max Krasnyansky <maxk@qualcomm.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Stephen Hemminger <shemminger@vyatta.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> ---
>  include/linux/rcupdate.h |    5 ++
>  kernel/rcutree.c         |  107 ++++++++++++++++++++++++++++++++-------------
>  2 files changed, 81 insertions(+), 31 deletions(-)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index e06639e..6539290 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -191,6 +191,11 @@ extern void rcu_idle_exit(void);
>  extern void rcu_irq_enter(void);
>  extern void rcu_irq_exit(void);
> 
> +#ifdef CONFIG_CPUSETS_NO_HZ
> +void rcu_user_enter(void);
> +void rcu_user_exit(void);
> +#endif
> +
>  /*
>   * Infrastructure to implement the synchronize_() primitives in
>   * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index b8d300c..cba1332 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -357,16 +357,8 @@ static int rcu_implicit_offline_qs(struct rcu_data *rdp)
> 
>  #endif /* #ifdef CONFIG_SMP */
> 
> -/*
> - * rcu_idle_enter_common - inform RCU that current CPU is moving towards idle
> - *
> - * If the new value of the ->dynticks_nesting counter now is zero,
> - * we really have entered idle, and must do the appropriate accounting.
> - * The caller must have disabled interrupts.
> - */
> -static void rcu_idle_enter_common(struct rcu_dynticks *rdtp, long long oldval)
> +static void rcu_check_idle_enter(long long oldval)
>  {
> -	trace_rcu_dyntick("Start", oldval, 0);
>  	if (!is_idle_task(current)) {
>  		struct task_struct *idle = idle_task(smp_processor_id());
> 
> @@ -376,6 +368,18 @@ static void rcu_idle_enter_common(struct rcu_dynticks *rdtp, long long oldval)
>  			  current->pid, current->comm,
>  			  idle->pid, idle->comm); /* must be idle task! */
>  	}
> +}
> +
> +/*
> + * rcu_idle_enter_common - inform RCU that current CPU is moving towards idle
> + *
> + * If the new value of the ->dynticks_nesting counter now is zero,
> + * we really have entered idle, and must do the appropriate accounting.
> + * The caller must have disabled interrupts.
> + */
> +static void rcu_idle_enter_common(struct rcu_dynticks *rdtp, long long oldval)
> +{
> +	trace_rcu_dyntick("Start", oldval, 0);
>  	rcu_prepare_for_idle(smp_processor_id());
>  	/* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
>  	smp_mb__before_atomic_inc();  /* See above. */
> @@ -384,6 +388,22 @@ static void rcu_idle_enter_common(struct rcu_dynticks *rdtp, long long oldval)
>  	WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
>  }
> 
> +static long long __rcu_idle_enter(void)
> +{
> +	unsigned long flags;
> +	long long oldval;
> +	struct rcu_dynticks *rdtp;
> +
> +	local_irq_save(flags);
> +	rdtp = &__get_cpu_var(rcu_dynticks);
> +	oldval = rdtp->dynticks_nesting;
> +	rdtp->dynticks_nesting = 0;
> +	rcu_idle_enter_common(rdtp, oldval);
> +	local_irq_restore(flags);
> +
> +	return oldval;
> +}
> +
>  /**
>   * rcu_idle_enter - inform RCU that current CPU is entering idle
>   *
> @@ -398,16 +418,15 @@ static void rcu_idle_enter_common(struct rcu_dynticks *rdtp, long long oldval)
>   */
>  void rcu_idle_enter(void)
>  {
> -	unsigned long flags;
>  	long long oldval;
> -	struct rcu_dynticks *rdtp;
> 
> -	local_irq_save(flags);
> -	rdtp = &__get_cpu_var(rcu_dynticks);
> -	oldval = rdtp->dynticks_nesting;
> -	rdtp->dynticks_nesting = 0;
> -	rcu_idle_enter_common(rdtp, oldval);
> -	local_irq_restore(flags);
> +	oldval = __rcu_idle_enter();
> +	rcu_check_idle_enter(oldval);
> +}
> +
> +void rcu_user_enter(void)
> +{
> +	__rcu_idle_enter();
>  }
> 
>  /**
> @@ -437,6 +456,7 @@ void rcu_irq_exit(void)
>  	oldval = rdtp->dynticks_nesting;
>  	rdtp->dynticks_nesting--;
>  	WARN_ON_ONCE(rdtp->dynticks_nesting < 0);
> +
>  	if (rdtp->dynticks_nesting)
>  		trace_rcu_dyntick("--=", oldval, rdtp->dynticks_nesting);
>  	else
> @@ -444,6 +464,20 @@ void rcu_irq_exit(void)
>  	local_irq_restore(flags);
>  }
> 
> +static void rcu_check_idle_exit(struct rcu_dynticks *rdtp, long long oldval)
> +{
> +	if (!is_idle_task(current)) {
> +		struct task_struct *idle = idle_task(smp_processor_id());
> +
> +		trace_rcu_dyntick("Error on exit: not idle task",
> +				  oldval, rdtp->dynticks_nesting);
> +		ftrace_dump(DUMP_ALL);
> +		WARN_ONCE(1, "Current pid: %d comm: %s / Idle pid: %d comm: %s",
> +			  current->pid, current->comm,
> +			  idle->pid, idle->comm); /* must be idle task! */
> +	}
> +}
> +
>  /*
>   * rcu_idle_exit_common - inform RCU that current CPU is moving away from idle
>   *
> @@ -460,16 +494,18 @@ static void rcu_idle_exit_common(struct rcu_dynticks *rdtp, long long oldval)
>  	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
>  	rcu_cleanup_after_idle(smp_processor_id());
>  	trace_rcu_dyntick("End", oldval, rdtp->dynticks_nesting);
> -	if (!is_idle_task(current)) {
> -		struct task_struct *idle = idle_task(smp_processor_id());
> +}
> 
> -		trace_rcu_dyntick("Error on exit: not idle task",
> -				  oldval, rdtp->dynticks_nesting);
> -		ftrace_dump(DUMP_ALL);
> -		WARN_ONCE(1, "Current pid: %d comm: %s / Idle pid: %d comm: %s",
> -			  current->pid, current->comm,
> -			  idle->pid, idle->comm); /* must be idle task! */
> -	}
> +static long long __rcu_idle_exit(struct rcu_dynticks *rdtp)
> +{
> +	long long oldval;
> +
> +	oldval = rdtp->dynticks_nesting;
> +	WARN_ON_ONCE(oldval != 0);
> +	rdtp->dynticks_nesting = LLONG_MAX / 2;
> +	rcu_idle_exit_common(rdtp, oldval);
> +
> +	return oldval;
>  }
> 
>  /**
> @@ -485,16 +521,25 @@ static void rcu_idle_exit_common(struct rcu_dynticks *rdtp, long long oldval)
>   */
>  void rcu_idle_exit(void)
>  {
> +	long long oldval;
> +	struct rcu_dynticks *rdtp;
>  	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	rdtp = &__get_cpu_var(rcu_dynticks);
> +	oldval = __rcu_idle_exit(rdtp);
> +	rcu_check_idle_exit(rdtp, oldval);
> +	local_irq_restore(flags);
> +}
> +
> +void rcu_user_exit(void)
> +{
>  	struct rcu_dynticks *rdtp;
> -	long long oldval;
> +	unsigned long flags;
> 
>  	local_irq_save(flags);
>  	rdtp = &__get_cpu_var(rcu_dynticks);
> -	oldval = rdtp->dynticks_nesting;
> -	WARN_ON_ONCE(oldval != 0);
> -	rdtp->dynticks_nesting = DYNTICK_TASK_NESTING;
> -	rcu_idle_exit_common(rdtp, oldval);
> +	 __rcu_idle_exit(rdtp);
>  	local_irq_restore(flags);
>  }
> 
> -- 
> 1.7.5.4
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 38/41] rcu: New rcu_user_enter_irq() and rcu_user_exit_irq() APIs
  2012-04-30 23:55 ` [PATCH 38/41] rcu: New rcu_user_enter_irq() and rcu_user_exit_irq() APIs Frederic Weisbecker
@ 2012-05-22 18:33   ` Paul E. McKenney
  2012-05-23 14:31     ` Frederic Weisbecker
  0 siblings, 1 reply; 96+ messages in thread
From: Paul E. McKenney @ 2012-05-22 18:33 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Tue, May 01, 2012 at 01:55:12AM +0200, Frederic Weisbecker wrote:
> A CPU running in adaptive tickless mode wants to enter into
> RCU extended quiescent state while running in userspace. This
> way we can shut down the tick that is usually needed on each
> CPU for the needs of RCU.
> 
> Typically, RCU enters the extended quiescent state when we resume
> to userspace through a syscall or exception exit, this is done
> using rcu_user_enter(). Then RCU exit this state by calling
> rcu_user_exit() from syscall or exception entry.
> 
> However there are two other points where we may want to enter
> or exit this state. Some remote CPU may require a tickless CPU
> to restart its tick for any reason and send it an IPI for
> this purpose. As we restart the tick, we don't want to resume
> from the IPI in RCU extended quiescent state anymore.
> Similarly we may stop the tick from an interrupt in userspace and
> we need to be able to enter RCU extended quiescent state when we
> resume from this interrupt to userspace.
> 
> To these ends, we provide two new APIs:
> 
> - rcu_user_enter_irq(). This must be called from a non-nesting
> interrupt betwenn rcu_irq_enter() and rcu_irq_exit().
> After the irq calls rcu_irq_exit(), we'll run into RCU extended
> quiescent state.
> 
> - rcu_user_exit_irq(). This must be called from a non-nesting
> interrupt, interrupting an RCU extended quiescent state, and
> between rcu_irq_enter() and rcu_irq_exit(). After the irq calls
> rcu_irq_exit(), we'll prevent from resuming the RCU extended
> quiescent.

In both cases, the IRQ handler must correspond to an interrupt from
task/thread/process/whatever level, so that it is illegal to call
these from an interrupt handler that was invoked from within another
interrupt.  Right?

A couple more questions and comments below.

							Thanx, Paul

> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Alessio Igor Bogani <abogani@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Avi Kivity <avi@redhat.com>
> Cc: Chris Metcalf <cmetcalf@tilera.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Geoff Levand <geoff@infradead.org>
> Cc: Gilad Ben Yossef <gilad@benyossef.com>
> Cc: Hakan Akkan <hakanakkan@gmail.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Kevin Hilman <khilman@ti.com>
> Cc: Max Krasnyansky <maxk@qualcomm.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Stephen Hemminger <shemminger@vyatta.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> ---
>  include/linux/rcupdate.h |    2 ++
>  kernel/rcutree.c         |   24 ++++++++++++++++++++++++
>  2 files changed, 26 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 6539290..3cf1d51 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -194,6 +194,8 @@ extern void rcu_irq_exit(void);
>  #ifdef CONFIG_CPUSETS_NO_HZ
>  void rcu_user_enter(void);
>  void rcu_user_exit(void);
> +void rcu_user_enter_irq(void);
> +void rcu_user_exit_irq(void);
>  #endif
> 
>  /*
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index cba1332..2adc5a0 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -429,6 +429,18 @@ void rcu_user_enter(void)
>  	__rcu_idle_enter();
>  }
> 
> +void rcu_user_enter_irq(void)

It took me a bit to correctly parse the name, which goes something
like RCU adaptive-tick user enter while in an IRQ handler.  A header
comment would help.  (I can supply one when it is time for this to
go into -rcu.)

> +{
> +	unsigned long flags;
> +	struct rcu_dynticks *rdtp;
> +
> +	local_irq_save(flags);
> +	rdtp = &__get_cpu_var(rcu_dynticks);
> +	WARN_ON_ONCE(rdtp->dynticks_nesting == 1);
> +	rdtp->dynticks_nesting = 1;
> +	local_irq_restore(flags);
> +}
> +
>  /**
>   * rcu_irq_exit - inform RCU that current CPU is exiting irq towards idle
>   *
> @@ -543,6 +555,18 @@ void rcu_user_exit(void)
>  	local_irq_restore(flags);
>  }
> 
> +void rcu_user_exit_irq(void)
> +{
> +	unsigned long flags;
> +	struct rcu_dynticks *rdtp;
> +
> +	local_irq_save(flags);
> +	rdtp = &__get_cpu_var(rcu_dynticks);
> +	WARN_ON_ONCE(rdtp->dynticks_nesting == 0);

For symmetry, wouldn't this be as follows?

	WARN_ON_ONCE(rdtp->dynticks_nesting >= LLONG_MAX / 4);

In other words, complain if the task is trying to exit RCU-idle state when
it has already exited from RCU-idle state?

Of course, it had better not be zero as well.  Or negative, for that
matter.

> +	rdtp->dynticks_nesting = (LLONG_MAX / 2) + 1;
> +	local_irq_restore(flags);
> +}
> +
>  /**
>   * rcu_irq_enter - inform RCU that current CPU is entering irq away from idle
>   *
> -- 
> 1.7.5.4
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 39/41] rcu: Switch to extended quiescent state in userspace from nohz cpuset
  2012-04-30 23:55 ` [PATCH 39/41] rcu: Switch to extended quiescent state in userspace from nohz cpuset Frederic Weisbecker
@ 2012-05-22 18:36   ` Paul E. McKenney
  2012-05-22 23:04     ` Paul E. McKenney
  2012-05-23 14:33     ` Frederic Weisbecker
  0 siblings, 2 replies; 96+ messages in thread
From: Paul E. McKenney @ 2012-05-22 18:36 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Tue, May 01, 2012 at 01:55:13AM +0200, Frederic Weisbecker wrote:
> When we switch to adaptive nohz mode and we run in userspace,
> we can still receive IPIs from the RCU core if a grace period
> has been started by another CPU because we need to take part
> of its completion.
> 
> However running in userspace is similar to that of running in
> idle because we don't make use of RCU there, thus we can be
> considered as running in RCU extended quiescent state. The
> benefit when running into that mode is that we are not
> anymore disturbed by needless IPIs coming from the RCU core.
> 
> To perform this, we just to use the RCU extended quiescent state
> APIs on the following points:
> 
> - kernel exit or tick stop in userspace: here we switch to extended
> quiescent state because we run in userspace without the tick.
> 
> - kernel entry or tick restart: here we exit the extended quiescent
> state because either we enter the kernel and we may make use of RCU
> read side critical section anytime, or we need the timer tick for some
> reason and that takes care of RCU grace period in a traditional way.

One FIXME question below.

							Thanx, Paul

> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Alessio Igor Bogani <abogani@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Avi Kivity <avi@redhat.com>
> Cc: Chris Metcalf <cmetcalf@tilera.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Geoff Levand <geoff@infradead.org>
> Cc: Gilad Ben Yossef <gilad@benyossef.com>
> Cc: Hakan Akkan <hakanakkan@gmail.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Kevin Hilman <khilman@ti.com>
> Cc: Max Krasnyansky <maxk@qualcomm.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Stephen Hemminger <shemminger@vyatta.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> ---
>  include/linux/tick.h     |    3 +++
>  kernel/time/tick-sched.c |   27 +++++++++++++++++++++++++--
>  2 files changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index 3c31d6e..e2a49ad 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -153,6 +153,8 @@ static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
>  # endif /* !NO_HZ */
> 
>  #ifdef CONFIG_CPUSETS_NO_HZ
> +DECLARE_PER_CPU(int, nohz_task_ext_qs);
> +
>  extern void tick_nohz_enter_kernel(void);
>  extern void tick_nohz_exit_kernel(void);
>  extern void tick_nohz_enter_exception(struct pt_regs *regs);
> @@ -160,6 +162,7 @@ extern void tick_nohz_exit_exception(struct pt_regs *regs);
>  extern void tick_nohz_check_adaptive(void);
>  extern void tick_nohz_pre_schedule(void);
>  extern void tick_nohz_post_schedule(void);
> +extern void tick_nohz_cpu_exit_qs(void);
>  extern bool tick_nohz_account_tick(void);
>  extern void tick_nohz_flush_current_times(bool restart_tick);
>  #else /* !CPUSETS_NO_HZ */
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 8217409..b15ab5e 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -565,10 +565,13 @@ static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts)
> 
>  	if (!was_stopped && ts->tick_stopped) {
>  		WARN_ON_ONCE(ts->saved_jiffies_whence != JIFFIES_SAVED_NONE);
> -		if (user)
> +		if (user) {
>  			ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
> -		else if (!current->mm)
> +			__get_cpu_var(nohz_task_ext_qs) = 1;
> +			rcu_user_enter_irq();
> +		} else if (!current->mm) {
>  			ts->saved_jiffies_whence = JIFFIES_SAVED_SYS;
> +		}
> 
>  		ts->saved_jiffies = jiffies;
>  		set_thread_flag(TIF_NOHZ);
> @@ -899,6 +902,8 @@ void tick_check_idle(int cpu)
>  }
> 
>  #ifdef CONFIG_CPUSETS_NO_HZ
> +DEFINE_PER_CPU(int, nohz_task_ext_qs);
> +
>  void tick_nohz_exit_kernel(void)
>  {
>  	unsigned long flags;
> @@ -922,6 +927,9 @@ void tick_nohz_exit_kernel(void)
>  	ts->saved_jiffies = jiffies;
>  	ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
> 
> +	__get_cpu_var(nohz_task_ext_qs) = 1;
> +	rcu_user_enter();
> +
>  	local_irq_restore(flags);
>  }
> 
> @@ -940,6 +948,11 @@ void tick_nohz_enter_kernel(void)
>  		return;
>  	}
> 
> +	if (__get_cpu_var(nohz_task_ext_qs) == 1) {
> +		__get_cpu_var(nohz_task_ext_qs) = 0;
> +		rcu_user_exit();
> +	}
> +
>  	WARN_ON_ONCE(ts->saved_jiffies_whence != JIFFIES_SAVED_USER);
> 
>  	delta_jiffies = jiffies - ts->saved_jiffies;
> @@ -951,6 +964,14 @@ void tick_nohz_enter_kernel(void)
>  	local_irq_restore(flags);
>  }
> 
> +void tick_nohz_cpu_exit_qs(void)
> +{
> +	if (__get_cpu_var(nohz_task_ext_qs)) {
> +		rcu_user_exit_irq();
> +		__get_cpu_var(nohz_task_ext_qs) = 0;
> +	}
> +}
> +
>  void tick_nohz_enter_exception(struct pt_regs *regs)
>  {
>  	if (user_mode(regs))
> @@ -986,6 +1007,7 @@ static void tick_nohz_restart_adaptive(void)
>  	tick_nohz_flush_current_times(true);
>  	tick_nohz_restart_sched_tick();
>  	clear_thread_flag(TIF_NOHZ);
> +	tick_nohz_cpu_exit_qs();
>  }
> 
>  void tick_nohz_check_adaptive(void)
> @@ -1023,6 +1045,7 @@ void tick_nohz_pre_schedule(void)
>  	if (ts->tick_stopped) {
>  		tick_nohz_flush_current_times(true);
>  		clear_thread_flag(TIF_NOHZ);
> +		/* FIXME: warn if we are in RCU idle mode */

This would be WARN_ON_ONCE(rcu_is_cpu_idle()) or some such, correct?

>  	}
>  }
> 
> -- 
> 1.7.5.4
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 39/41] rcu: Switch to extended quiescent state in userspace from nohz cpuset
  2012-05-22 18:36   ` Paul E. McKenney
@ 2012-05-22 23:04     ` Paul E. McKenney
  2012-05-23 14:33     ` Frederic Weisbecker
  1 sibling, 0 replies; 96+ messages in thread
From: Paul E. McKenney @ 2012-05-22 23:04 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Tue, May 22, 2012 at 11:36:30AM -0700, Paul E. McKenney wrote:
> On Tue, May 01, 2012 at 01:55:13AM +0200, Frederic Weisbecker wrote:
> > When we switch to adaptive nohz mode and we run in userspace,
> > we can still receive IPIs from the RCU core if a grace period
> > has been started by another CPU because we need to take part
> > of its completion.
> > 
> > However running in userspace is similar to that of running in
> > idle because we don't make use of RCU there, thus we can be
> > considered as running in RCU extended quiescent state. The
> > benefit when running into that mode is that we are not
> > anymore disturbed by needless IPIs coming from the RCU core.
> > 
> > To perform this, we just to use the RCU extended quiescent state
> > APIs on the following points:
> > 
> > - kernel exit or tick stop in userspace: here we switch to extended
> > quiescent state because we run in userspace without the tick.
> > 
> > - kernel entry or tick restart: here we exit the extended quiescent
> > state because either we enter the kernel and we may make use of RCU
> > read side critical section anytime, or we need the timer tick for some
> > reason and that takes care of RCU grace period in a traditional way.
> 
> One FIXME question below.

And I found out one reason: WARN_ON_ONCE(rcu_is_cpu_idle()) works only
if CONFIG_PROVE_RCU.  I am fixing this.

							Thanx, Paul

> > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > Cc: Alessio Igor Bogani <abogani@kernel.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Avi Kivity <avi@redhat.com>
> > Cc: Chris Metcalf <cmetcalf@tilera.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> > Cc: Geoff Levand <geoff@infradead.org>
> > Cc: Gilad Ben Yossef <gilad@benyossef.com>
> > Cc: Hakan Akkan <hakanakkan@gmail.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Kevin Hilman <khilman@ti.com>
> > Cc: Max Krasnyansky <maxk@qualcomm.com>
> > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Stephen Hemminger <shemminger@vyatta.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > ---
> >  include/linux/tick.h     |    3 +++
> >  kernel/time/tick-sched.c |   27 +++++++++++++++++++++++++--
> >  2 files changed, 28 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/tick.h b/include/linux/tick.h
> > index 3c31d6e..e2a49ad 100644
> > --- a/include/linux/tick.h
> > +++ b/include/linux/tick.h
> > @@ -153,6 +153,8 @@ static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
> >  # endif /* !NO_HZ */
> > 
> >  #ifdef CONFIG_CPUSETS_NO_HZ
> > +DECLARE_PER_CPU(int, nohz_task_ext_qs);
> > +
> >  extern void tick_nohz_enter_kernel(void);
> >  extern void tick_nohz_exit_kernel(void);
> >  extern void tick_nohz_enter_exception(struct pt_regs *regs);
> > @@ -160,6 +162,7 @@ extern void tick_nohz_exit_exception(struct pt_regs *regs);
> >  extern void tick_nohz_check_adaptive(void);
> >  extern void tick_nohz_pre_schedule(void);
> >  extern void tick_nohz_post_schedule(void);
> > +extern void tick_nohz_cpu_exit_qs(void);
> >  extern bool tick_nohz_account_tick(void);
> >  extern void tick_nohz_flush_current_times(bool restart_tick);
> >  #else /* !CPUSETS_NO_HZ */
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 8217409..b15ab5e 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -565,10 +565,13 @@ static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts)
> > 
> >  	if (!was_stopped && ts->tick_stopped) {
> >  		WARN_ON_ONCE(ts->saved_jiffies_whence != JIFFIES_SAVED_NONE);
> > -		if (user)
> > +		if (user) {
> >  			ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
> > -		else if (!current->mm)
> > +			__get_cpu_var(nohz_task_ext_qs) = 1;
> > +			rcu_user_enter_irq();
> > +		} else if (!current->mm) {
> >  			ts->saved_jiffies_whence = JIFFIES_SAVED_SYS;
> > +		}
> > 
> >  		ts->saved_jiffies = jiffies;
> >  		set_thread_flag(TIF_NOHZ);
> > @@ -899,6 +902,8 @@ void tick_check_idle(int cpu)
> >  }
> > 
> >  #ifdef CONFIG_CPUSETS_NO_HZ
> > +DEFINE_PER_CPU(int, nohz_task_ext_qs);
> > +
> >  void tick_nohz_exit_kernel(void)
> >  {
> >  	unsigned long flags;
> > @@ -922,6 +927,9 @@ void tick_nohz_exit_kernel(void)
> >  	ts->saved_jiffies = jiffies;
> >  	ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
> > 
> > +	__get_cpu_var(nohz_task_ext_qs) = 1;
> > +	rcu_user_enter();
> > +
> >  	local_irq_restore(flags);
> >  }
> > 
> > @@ -940,6 +948,11 @@ void tick_nohz_enter_kernel(void)
> >  		return;
> >  	}
> > 
> > +	if (__get_cpu_var(nohz_task_ext_qs) == 1) {
> > +		__get_cpu_var(nohz_task_ext_qs) = 0;
> > +		rcu_user_exit();
> > +	}
> > +
> >  	WARN_ON_ONCE(ts->saved_jiffies_whence != JIFFIES_SAVED_USER);
> > 
> >  	delta_jiffies = jiffies - ts->saved_jiffies;
> > @@ -951,6 +964,14 @@ void tick_nohz_enter_kernel(void)
> >  	local_irq_restore(flags);
> >  }
> > 
> > +void tick_nohz_cpu_exit_qs(void)
> > +{
> > +	if (__get_cpu_var(nohz_task_ext_qs)) {
> > +		rcu_user_exit_irq();
> > +		__get_cpu_var(nohz_task_ext_qs) = 0;
> > +	}
> > +}
> > +
> >  void tick_nohz_enter_exception(struct pt_regs *regs)
> >  {
> >  	if (user_mode(regs))
> > @@ -986,6 +1007,7 @@ static void tick_nohz_restart_adaptive(void)
> >  	tick_nohz_flush_current_times(true);
> >  	tick_nohz_restart_sched_tick();
> >  	clear_thread_flag(TIF_NOHZ);
> > +	tick_nohz_cpu_exit_qs();
> >  }
> > 
> >  void tick_nohz_check_adaptive(void)
> > @@ -1023,6 +1045,7 @@ void tick_nohz_pre_schedule(void)
> >  	if (ts->tick_stopped) {
> >  		tick_nohz_flush_current_times(true);
> >  		clear_thread_flag(TIF_NOHZ);
> > +		/* FIXME: warn if we are in RCU idle mode */
> 
> This would be WARN_ON_ONCE(rcu_is_cpu_idle()) or some such, correct?
> 
> >  	}
> >  }
> > 
> > -- 
> > 1.7.5.4
> > 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/41] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-05-22 17:16   ` Paul E. McKenney
@ 2012-05-23 13:52     ` Frederic Weisbecker
  2012-05-23 15:15       ` Paul E. McKenney
  0 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-05-23 13:52 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Tue, May 22, 2012 at 10:16:58AM -0700, Paul E. McKenney wrote:
> On Tue, May 01, 2012 at 01:54:45AM +0200, Frederic Weisbecker wrote:
> > If RCU is waiting for the current CPU to complete a grace
> > period, don't turn off the tick. Unlike dynctik-idle, we
> > are not necessarily going to enter into rcu extended quiescent
> > state, so we may need to keep the tick to note current CPU's
> > quiescent states.
> > 
> > [added build fix from Zen Lin]
> 
> Hello, Frederic,
> 
> One question below -- why not rcu_needs_cpu() instead of rcu_pending()?
> 
> 							Thanx, Paul
> 
> > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > Cc: Alessio Igor Bogani <abogani@kernel.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Avi Kivity <avi@redhat.com>
> > Cc: Chris Metcalf <cmetcalf@tilera.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> > Cc: Geoff Levand <geoff@infradead.org>
> > Cc: Gilad Ben Yossef <gilad@benyossef.com>
> > Cc: Hakan Akkan <hakanakkan@gmail.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Kevin Hilman <khilman@ti.com>
> > Cc: Max Krasnyansky <maxk@qualcomm.com>
> > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Stephen Hemminger <shemminger@vyatta.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > ---
> >  include/linux/rcupdate.h |    1 +
> >  kernel/rcutree.c         |    3 +--
> >  kernel/time/tick-sched.c |   22 ++++++++++++++++++----
> >  3 files changed, 20 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index 81c04f4..e06639e 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -184,6 +184,7 @@ static inline int rcu_preempt_depth(void)
> >  extern void rcu_sched_qs(int cpu);
> >  extern void rcu_bh_qs(int cpu);
> >  extern void rcu_check_callbacks(int cpu, int user);
> > +extern int rcu_pending(int cpu);
> >  struct notifier_block;
> >  extern void rcu_idle_enter(void);
> >  extern void rcu_idle_exit(void);
> > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > index 6c4a672..e141c7e 100644
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> > @@ -212,7 +212,6 @@ int rcu_cpu_stall_suppress __read_mostly;
> >  module_param(rcu_cpu_stall_suppress, int, 0644);
> > 
> >  static void force_quiescent_state(struct rcu_state *rsp, int relaxed);
> > -static int rcu_pending(int cpu);
> > 
> >  /*
> >   * Return the number of RCU-sched batches processed thus far for debug & stats.
> > @@ -1915,7 +1914,7 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
> >   * by the current CPU, returning 1 if so.  This function is part of the
> >   * RCU implementation; it is -not- an exported member of the RCU API.
> >   */
> > -static int rcu_pending(int cpu)
> > +int rcu_pending(int cpu)
> >  {
> >  	return __rcu_pending(&rcu_sched_state, &per_cpu(rcu_sched_data, cpu)) ||
> >  	       __rcu_pending(&rcu_bh_state, &per_cpu(rcu_bh_data, cpu)) ||
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 43fa7ac..4f99766 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -506,9 +506,21 @@ void tick_nohz_idle_enter(void)
> >  	local_irq_enable();
> >  }
> > 
> > +#ifdef CONFIG_CPUSETS_NO_HZ
> > +static bool can_stop_adaptive_tick(void)
> > +{
> > +	if (!sched_can_stop_tick())
> > +		return false;
> > +
> > +	/* Is there a grace period to complete ? */
> > +	if (rcu_pending(smp_processor_id()))
> 
> You lost me on this one.  Why can't this be rcu_needs_cpu()?

We already have an rcu_needs_cpu() check in tick_nohz_stop_sched_tick()
that prevents the tick to shut down if the CPU has local callbacks to handle.

The rcu_pending() check is there in case some other CPU is waiting for the
current one to help completing a grace period, by reporting a quiescent state
for example. This happens because we may stop the tick in the kernel, not only
userspace. And if we are in the kernel, we still need to be part of the global
state machine.

> 
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> >  static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts)
> >  {
> > -#ifdef CONFIG_CPUSETS_NO_HZ
> >  	int cpu = smp_processor_id();
> > 
> >  	if (!cpuset_adaptive_nohz() || is_idle_task(current))
> > @@ -517,12 +529,14 @@ static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts)
> >  	if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
> >  		return;
> > 
> > -	if (!sched_can_stop_tick())
> > +	if (!can_stop_adaptive_tick())
> >  		return;
> > 
> >  	tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
> > -#endif
> >  }
> > +#else
> > +static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts) { }
> > +#endif
> > 
> >  /**
> >   * tick_nohz_irq_exit - update next tick event from interrupt exit
> > @@ -852,7 +866,7 @@ void tick_nohz_check_adaptive(void)
> >  	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
> > 
> >  	if (ts->tick_stopped && !is_idle_task(current)) {
> > -		if (!sched_can_stop_tick())
> > +		if (!can_stop_adaptive_tick())
> >  			tick_nohz_restart_sched_tick();
> >  	}
> >  }
> > -- 
> > 1.7.5.4
> > 
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 16/41] rcu: Restart the tick on non-responding adaptive nohz CPUs
  2012-05-22 17:20   ` Paul E. McKenney
@ 2012-05-23 13:57     ` Frederic Weisbecker
  2012-05-23 15:20       ` Paul E. McKenney
  0 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-05-23 13:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Tue, May 22, 2012 at 10:20:50AM -0700, Paul E. McKenney wrote:
> On Tue, May 01, 2012 at 01:54:50AM +0200, Frederic Weisbecker wrote:
> > When a CPU in adaptive nohz mode doesn't respond to complete
> > a grace period, issue it a specific IPI so that it restarts
> > the tick and chases a quiescent state.
> 
> Hello, Frederic,
> 
> I don't understand the need for this patch.  If the CPU is in
> adaptive-tick mode, RCU should see it as being in dyntick-idle mode,
> right?  If so, shouldn't RCU have already recognized the CPU as being
> in an extended quiescent state?
> 
> Or is this a belt-and-suspenders situation?
> 
> 							Thanx, Paul

If the tickless CPU is in userspace, it is in extended quiescent state. But
not if it runs tickless in the kernel. In this case we need to send it an IPI
so that it restarts the tick after checking rcu_pending().

> 
> > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > Cc: Alessio Igor Bogani <abogani@kernel.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Avi Kivity <avi@redhat.com>
> > Cc: Chris Metcalf <cmetcalf@tilera.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> > Cc: Geoff Levand <geoff@infradead.org>
> > Cc: Gilad Ben Yossef <gilad@benyossef.com>
> > Cc: Hakan Akkan <hakanakkan@gmail.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Kevin Hilman <khilman@ti.com>
> > Cc: Max Krasnyansky <maxk@qualcomm.com>
> > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Stephen Hemminger <shemminger@vyatta.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > ---
> >  kernel/rcutree.c |   17 +++++++++++++++++
> >  1 files changed, 17 insertions(+), 0 deletions(-)
> > 
> > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > index e141c7e..3fffc26 100644
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> > @@ -50,6 +50,7 @@
> >  #include <linux/wait.h>
> >  #include <linux/kthread.h>
> >  #include <linux/prefetch.h>
> > +#include <linux/cpuset.h>
> > 
> >  #include "rcutree.h"
> >  #include <trace/events/rcu.h>
> > @@ -302,6 +303,20 @@ static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
> > 
> >  #ifdef CONFIG_SMP
> > 
> > +static void cpuset_update_rcu_cpu(int cpu)
> > +{
> > +#ifdef CONFIG_CPUSETS_NO_HZ
> > +	unsigned long flags;
> > +
> > +	local_irq_save(flags);
> > +
> > +	if (cpuset_cpu_adaptive_nohz(cpu))
> > +		smp_cpuset_update_nohz(cpu);
> > +
> > +	local_irq_restore(flags);
> > +#endif
> > +}
> > +
> >  /*
> >   * If the specified CPU is offline, tell the caller that it is in
> >   * a quiescent state.  Otherwise, whack it with a reschedule IPI.
> > @@ -325,6 +340,8 @@ static int rcu_implicit_offline_qs(struct rcu_data *rdp)
> >  		return 1;
> >  	}
> > 
> > +	cpuset_update_rcu_cpu(rdp->cpu);
> > +
> >  	/*
> >  	 * The CPU is online, so send it a reschedule IPI.  This forces
> >  	 * it through the scheduler, and (inefficiently) also handles cases
> > -- 
> > 1.7.5.4
> > 
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 17/41] rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU
  2012-05-22 17:27   ` Paul E. McKenney
  2012-05-22 17:30     ` Paul E. McKenney
@ 2012-05-23 14:00     ` Frederic Weisbecker
  2012-05-23 16:01       ` Paul E. McKenney
  1 sibling, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-05-23 14:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Tue, May 22, 2012 at 10:27:14AM -0700, Paul E. McKenney wrote:
> On Tue, May 01, 2012 at 01:54:51AM +0200, Frederic Weisbecker wrote:
> > If we enqueue an rcu callback, we need the CPU tick to stay
> > alive until we take care of those by completing the appropriate
> > grace period.
> > 
> > Thus, when we call_rcu(), send a self IPI that checks rcu_needs_cpu()
> > so that we restore a periodic tick behaviour that can take care of
> > everything.
> 
> Ouch, I hadn't considered RCU callbacks being posted from within an
> extended quiescent state.  I guess I need to make __call_rcu() either
> complain about this or handle it correctly...  It would -usually- be
> harmless, but there is getting to be quite a bit of active machinery
> in the various idle loops, so just assuming that it cannot happen is
> probably getting to be an obsolete assumption.

May be first provide some detection to warn in such case. And if it happens
to warn too much perhaps you can allow that?

> 
> 							Thanx, Paul
> 
> > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > Cc: Alessio Igor Bogani <abogani@kernel.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Avi Kivity <avi@redhat.com>
> > Cc: Chris Metcalf <cmetcalf@tilera.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> > Cc: Geoff Levand <geoff@infradead.org>
> > Cc: Gilad Ben Yossef <gilad@benyossef.com>
> > Cc: Hakan Akkan <hakanakkan@gmail.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Kevin Hilman <khilman@ti.com>
> > Cc: Max Krasnyansky <maxk@qualcomm.com>
> > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Stephen Hemminger <shemminger@vyatta.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > ---
> >  kernel/rcutree.c |    7 +++++++
> >  1 files changed, 7 insertions(+), 0 deletions(-)
> > 
> > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > index 3fffc26..b8d300c 100644
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> > @@ -1749,6 +1749,13 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
> >  	else
> >  		trace_rcu_callback(rsp->name, head, rdp->qlen);
> > 
> > +	/* Restart the timer if needed to handle the callbacks */
> > +	if (cpuset_adaptive_nohz()) {
> > +		/* Make updates on nxtlist visible to self IPI */
> > +		barrier();
> > +		smp_cpuset_update_nohz(smp_processor_id());
> > +	}
> > +
> >  	/* If interrupts were disabled, don't dive into RCU core. */
> >  	if (irqs_disabled_flags(flags)) {
> >  		local_irq_restore(flags);
> > -- 
> > 1.7.5.4
> > 
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 17/41] rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU
  2012-05-22 17:30     ` Paul E. McKenney
@ 2012-05-23 14:03       ` Frederic Weisbecker
  2012-05-23 16:15         ` Paul E. McKenney
  0 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-05-23 14:03 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Tue, May 22, 2012 at 10:30:47AM -0700, Paul E. McKenney wrote:
> On Tue, May 22, 2012 at 10:27:14AM -0700, Paul E. McKenney wrote:
> > On Tue, May 01, 2012 at 01:54:51AM +0200, Frederic Weisbecker wrote:
> > > If we enqueue an rcu callback, we need the CPU tick to stay
> > > alive until we take care of those by completing the appropriate
> > > grace period.
> > > 
> > > Thus, when we call_rcu(), send a self IPI that checks rcu_needs_cpu()
> > > so that we restore a periodic tick behaviour that can take care of
> > > everything.
> > 
> > Ouch, I hadn't considered RCU callbacks being posted from within an
> > extended quiescent state.  I guess I need to make __call_rcu() either
> > complain about this or handle it correctly...  It would -usually- be
> > harmless, but there is getting to be quite a bit of active machinery
> > in the various idle loops, so just assuming that it cannot happen is
> > probably getting to be an obsolete assumption.
> 
> Adaptive ticks does restart the tick upon entering the kernel, correct?

No, it keeps the tick down. The tick is restarted only if it's needed:
when more than one task are on the runqueue, a posix cpu timer is running,
a CPU needs the current one to report a quiescent state, etc...

> If so, wouldn't the return to userspace cause adaptive tick to automatically
> handle a callback posted from within the kernel?
> 
> (And yes, I still need to handle the possibility of callbacks being posted
> from the idle loop, but that is a different extended quiescent state.)
> 
> 							Thanx, Paul
> 
> > > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > > Cc: Alessio Igor Bogani <abogani@kernel.org>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Avi Kivity <avi@redhat.com>
> > > Cc: Chris Metcalf <cmetcalf@tilera.com>
> > > Cc: Christoph Lameter <cl@linux.com>
> > > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> > > Cc: Geoff Levand <geoff@infradead.org>
> > > Cc: Gilad Ben Yossef <gilad@benyossef.com>
> > > Cc: Hakan Akkan <hakanakkan@gmail.com>
> > > Cc: Ingo Molnar <mingo@kernel.org>
> > > Cc: Kevin Hilman <khilman@ti.com>
> > > Cc: Max Krasnyansky <maxk@qualcomm.com>
> > > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Cc: Stephen Hemminger <shemminger@vyatta.com>
> > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
> > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > ---
> > >  kernel/rcutree.c |    7 +++++++
> > >  1 files changed, 7 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > > index 3fffc26..b8d300c 100644
> > > --- a/kernel/rcutree.c
> > > +++ b/kernel/rcutree.c
> > > @@ -1749,6 +1749,13 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
> > >  	else
> > >  		trace_rcu_callback(rsp->name, head, rdp->qlen);
> > > 
> > > +	/* Restart the timer if needed to handle the callbacks */
> > > +	if (cpuset_adaptive_nohz()) {
> > > +		/* Make updates on nxtlist visible to self IPI */
> > > +		barrier();
> > > +		smp_cpuset_update_nohz(smp_processor_id());
> > > +	}
> > > +
> > >  	/* If interrupts were disabled, don't dive into RCU core. */
> > >  	if (irqs_disabled_flags(flags)) {
> > >  		local_irq_restore(flags);
> > > -- 
> > > 1.7.5.4
> > > 
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 37/41] rcu: New rcu_user_enter() and rcu_user_exit() APIs
  2012-05-22 18:23   ` Paul E. McKenney
@ 2012-05-23 14:22     ` Frederic Weisbecker
  2012-05-23 16:28       ` Paul E. McKenney
  0 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-05-23 14:22 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Tue, May 22, 2012 at 11:23:06AM -0700, Paul E. McKenney wrote:
> On Tue, May 01, 2012 at 01:55:11AM +0200, Frederic Weisbecker wrote:
> > These two APIs are provided to help the implementation
> > of an adaptive tickless kernel (cf: nohz cpusets). We need
> > to run into RCU extended quiescent state when we are in
> > userland so that a tickless CPU is not involved in the
> > global RCU state machine and can shutdown its tick safely.
> > 
> > These APIs are called from syscall and exception entry/exit
> > points and can't be called from interrupt.
> > 
> > They are essentially the same than rcu_idle_enter() and
> > rcu_idle_exit() minus the checks that ensure the CPU is
> > running the idle task.
> 
> This looks reasonably sane.  There are a few nits like missing comment
> headers for functions and the need for tracing, but I can handle that
> when I pull it in.  I am happy to do that pretty much any time, but not
> before the API stabilizes.  ;-)
> 
> So let me know when it is ready for -rcu.

Ok. So would you be willing to host this specific part in -rcu? I don't
know if these APIs are welcome upstream if they have no upstream users
yet. OTOH it would be easier for me if I don't need to include these patches
in my endless rebases.

Another solution is to host that in some seperate tree. In yours or in -tip.
Ingo seemed to be willing to host this patchset.

What do you think?

I believe I need to rebase against your latest changes though.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 38/41] rcu: New rcu_user_enter_irq() and rcu_user_exit_irq() APIs
  2012-05-22 18:33   ` Paul E. McKenney
@ 2012-05-23 14:31     ` Frederic Weisbecker
  0 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-05-23 14:31 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Tue, May 22, 2012 at 11:33:52AM -0700, Paul E. McKenney wrote:
> On Tue, May 01, 2012 at 01:55:12AM +0200, Frederic Weisbecker wrote:
> > A CPU running in adaptive tickless mode wants to enter into
> > RCU extended quiescent state while running in userspace. This
> > way we can shut down the tick that is usually needed on each
> > CPU for the needs of RCU.
> > 
> > Typically, RCU enters the extended quiescent state when we resume
> > to userspace through a syscall or exception exit, this is done
> > using rcu_user_enter(). Then RCU exit this state by calling
> > rcu_user_exit() from syscall or exception entry.
> > 
> > However there are two other points where we may want to enter
> > or exit this state. Some remote CPU may require a tickless CPU
> > to restart its tick for any reason and send it an IPI for
> > this purpose. As we restart the tick, we don't want to resume
> > from the IPI in RCU extended quiescent state anymore.
> > Similarly we may stop the tick from an interrupt in userspace and
> > we need to be able to enter RCU extended quiescent state when we
> > resume from this interrupt to userspace.
> > 
> > To these ends, we provide two new APIs:
> > 
> > - rcu_user_enter_irq(). This must be called from a non-nesting
> > interrupt betwenn rcu_irq_enter() and rcu_irq_exit().
> > After the irq calls rcu_irq_exit(), we'll run into RCU extended
> > quiescent state.
> > 
> > - rcu_user_exit_irq(). This must be called from a non-nesting
> > interrupt, interrupting an RCU extended quiescent state, and
> > between rcu_irq_enter() and rcu_irq_exit(). After the irq calls
> > rcu_irq_exit(), we'll prevent from resuming the RCU extended
> > quiescent.
> 
> In both cases, the IRQ handler must correspond to an interrupt from
> task/thread/process/whatever level, so that it is illegal to call
> these from an interrupt handler that was invoked from within another
> interrupt.  Right?

Indeed.

> 
> A couple more questions and comments below.
> 
> 							Thanx, Paul
> 
> > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > Cc: Alessio Igor Bogani <abogani@kernel.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Avi Kivity <avi@redhat.com>
> > Cc: Chris Metcalf <cmetcalf@tilera.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> > Cc: Geoff Levand <geoff@infradead.org>
> > Cc: Gilad Ben Yossef <gilad@benyossef.com>
> > Cc: Hakan Akkan <hakanakkan@gmail.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Kevin Hilman <khilman@ti.com>
> > Cc: Max Krasnyansky <maxk@qualcomm.com>
> > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Stephen Hemminger <shemminger@vyatta.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > ---
> >  include/linux/rcupdate.h |    2 ++
> >  kernel/rcutree.c         |   24 ++++++++++++++++++++++++
> >  2 files changed, 26 insertions(+), 0 deletions(-)
> > 
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index 6539290..3cf1d51 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -194,6 +194,8 @@ extern void rcu_irq_exit(void);
> >  #ifdef CONFIG_CPUSETS_NO_HZ
> >  void rcu_user_enter(void);
> >  void rcu_user_exit(void);
> > +void rcu_user_enter_irq(void);
> > +void rcu_user_exit_irq(void);
> >  #endif
> > 
> >  /*
> > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > index cba1332..2adc5a0 100644
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> > @@ -429,6 +429,18 @@ void rcu_user_enter(void)
> >  	__rcu_idle_enter();
> >  }
> > 
> > +void rcu_user_enter_irq(void)
> 
> It took me a bit to correctly parse the name, which goes something
> like RCU adaptive-tick user enter while in an IRQ handler.  A header
> comment would help.  (I can supply one when it is time for this to
> go into -rcu.)

Sure. I must confess I haven't focused on comments for now but this
will need some before getting merged anywhere.


> 
> > +{
> > +	unsigned long flags;
> > +	struct rcu_dynticks *rdtp;
> > +
> > +	local_irq_save(flags);
> > +	rdtp = &__get_cpu_var(rcu_dynticks);
> > +	WARN_ON_ONCE(rdtp->dynticks_nesting == 1);
> > +	rdtp->dynticks_nesting = 1;
> > +	local_irq_restore(flags);
> > +}
> > +
> >  /**
> >   * rcu_irq_exit - inform RCU that current CPU is exiting irq towards idle
> >   *
> > @@ -543,6 +555,18 @@ void rcu_user_exit(void)
> >  	local_irq_restore(flags);
> >  }
> > 
> > +void rcu_user_exit_irq(void)
> > +{
> > +	unsigned long flags;
> > +	struct rcu_dynticks *rdtp;
> > +
> > +	local_irq_save(flags);
> > +	rdtp = &__get_cpu_var(rcu_dynticks);
> > +	WARN_ON_ONCE(rdtp->dynticks_nesting == 0);
> 
> For symmetry, wouldn't this be as follows?
> 
> 	WARN_ON_ONCE(rdtp->dynticks_nesting >= LLONG_MAX / 4);
> 
> In other words, complain if the task is trying to exit RCU-idle state when
> it has already exited from RCU-idle state?

May be yeah. Note this was done before your patch
"rcu: Allow nesting of rcu_idle_enter() and rcu_idle_exit()" so I may need
to rebase and check my patch is still correct on top of yours.

> 
> Of course, it had better not be zero as well.  Or negative, for that
> matter.
> 
> > +	rdtp->dynticks_nesting = (LLONG_MAX / 2) + 1;
> > +	local_irq_restore(flags);
> > +}
> > +
> >  /**
> >   * rcu_irq_enter - inform RCU that current CPU is entering irq away from idle
> >   *
> > -- 
> > 1.7.5.4
> > 
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 39/41] rcu: Switch to extended quiescent state in userspace from nohz cpuset
  2012-05-22 18:36   ` Paul E. McKenney
  2012-05-22 23:04     ` Paul E. McKenney
@ 2012-05-23 14:33     ` Frederic Weisbecker
  1 sibling, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-05-23 14:33 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Tue, May 22, 2012 at 11:36:30AM -0700, Paul E. McKenney wrote:
> On Tue, May 01, 2012 at 01:55:13AM +0200, Frederic Weisbecker wrote:
> > When we switch to adaptive nohz mode and we run in userspace,
> > we can still receive IPIs from the RCU core if a grace period
> > has been started by another CPU because we need to take part
> > of its completion.
> > 
> > However running in userspace is similar to that of running in
> > idle because we don't make use of RCU there, thus we can be
> > considered as running in RCU extended quiescent state. The
> > benefit when running into that mode is that we are not
> > anymore disturbed by needless IPIs coming from the RCU core.
> > 
> > To perform this, we just to use the RCU extended quiescent state
> > APIs on the following points:
> > 
> > - kernel exit or tick stop in userspace: here we switch to extended
> > quiescent state because we run in userspace without the tick.
> > 
> > - kernel entry or tick restart: here we exit the extended quiescent
> > state because either we enter the kernel and we may make use of RCU
> > read side critical section anytime, or we need the timer tick for some
> > reason and that takes care of RCU grace period in a traditional way.
> 
> One FIXME question below.
> 
> 							Thanx, Paul
> 
> > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > Cc: Alessio Igor Bogani <abogani@kernel.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Avi Kivity <avi@redhat.com>
> > Cc: Chris Metcalf <cmetcalf@tilera.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> > Cc: Geoff Levand <geoff@infradead.org>
> > Cc: Gilad Ben Yossef <gilad@benyossef.com>
> > Cc: Hakan Akkan <hakanakkan@gmail.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Kevin Hilman <khilman@ti.com>
> > Cc: Max Krasnyansky <maxk@qualcomm.com>
> > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Stephen Hemminger <shemminger@vyatta.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > ---
> >  include/linux/tick.h     |    3 +++
> >  kernel/time/tick-sched.c |   27 +++++++++++++++++++++++++--
> >  2 files changed, 28 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/tick.h b/include/linux/tick.h
> > index 3c31d6e..e2a49ad 100644
> > --- a/include/linux/tick.h
> > +++ b/include/linux/tick.h
> > @@ -153,6 +153,8 @@ static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
> >  # endif /* !NO_HZ */
> > 
> >  #ifdef CONFIG_CPUSETS_NO_HZ
> > +DECLARE_PER_CPU(int, nohz_task_ext_qs);
> > +
> >  extern void tick_nohz_enter_kernel(void);
> >  extern void tick_nohz_exit_kernel(void);
> >  extern void tick_nohz_enter_exception(struct pt_regs *regs);
> > @@ -160,6 +162,7 @@ extern void tick_nohz_exit_exception(struct pt_regs *regs);
> >  extern void tick_nohz_check_adaptive(void);
> >  extern void tick_nohz_pre_schedule(void);
> >  extern void tick_nohz_post_schedule(void);
> > +extern void tick_nohz_cpu_exit_qs(void);
> >  extern bool tick_nohz_account_tick(void);
> >  extern void tick_nohz_flush_current_times(bool restart_tick);
> >  #else /* !CPUSETS_NO_HZ */
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 8217409..b15ab5e 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -565,10 +565,13 @@ static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts)
> > 
> >  	if (!was_stopped && ts->tick_stopped) {
> >  		WARN_ON_ONCE(ts->saved_jiffies_whence != JIFFIES_SAVED_NONE);
> > -		if (user)
> > +		if (user) {
> >  			ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
> > -		else if (!current->mm)
> > +			__get_cpu_var(nohz_task_ext_qs) = 1;
> > +			rcu_user_enter_irq();
> > +		} else if (!current->mm) {
> >  			ts->saved_jiffies_whence = JIFFIES_SAVED_SYS;
> > +		}
> > 
> >  		ts->saved_jiffies = jiffies;
> >  		set_thread_flag(TIF_NOHZ);
> > @@ -899,6 +902,8 @@ void tick_check_idle(int cpu)
> >  }
> > 
> >  #ifdef CONFIG_CPUSETS_NO_HZ
> > +DEFINE_PER_CPU(int, nohz_task_ext_qs);
> > +
> >  void tick_nohz_exit_kernel(void)
> >  {
> >  	unsigned long flags;
> > @@ -922,6 +927,9 @@ void tick_nohz_exit_kernel(void)
> >  	ts->saved_jiffies = jiffies;
> >  	ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
> > 
> > +	__get_cpu_var(nohz_task_ext_qs) = 1;
> > +	rcu_user_enter();
> > +
> >  	local_irq_restore(flags);
> >  }
> > 
> > @@ -940,6 +948,11 @@ void tick_nohz_enter_kernel(void)
> >  		return;
> >  	}
> > 
> > +	if (__get_cpu_var(nohz_task_ext_qs) == 1) {
> > +		__get_cpu_var(nohz_task_ext_qs) = 0;
> > +		rcu_user_exit();
> > +	}
> > +
> >  	WARN_ON_ONCE(ts->saved_jiffies_whence != JIFFIES_SAVED_USER);
> > 
> >  	delta_jiffies = jiffies - ts->saved_jiffies;
> > @@ -951,6 +964,14 @@ void tick_nohz_enter_kernel(void)
> >  	local_irq_restore(flags);
> >  }
> > 
> > +void tick_nohz_cpu_exit_qs(void)
> > +{
> > +	if (__get_cpu_var(nohz_task_ext_qs)) {
> > +		rcu_user_exit_irq();
> > +		__get_cpu_var(nohz_task_ext_qs) = 0;
> > +	}
> > +}
> > +
> >  void tick_nohz_enter_exception(struct pt_regs *regs)
> >  {
> >  	if (user_mode(regs))
> > @@ -986,6 +1007,7 @@ static void tick_nohz_restart_adaptive(void)
> >  	tick_nohz_flush_current_times(true);
> >  	tick_nohz_restart_sched_tick();
> >  	clear_thread_flag(TIF_NOHZ);
> > +	tick_nohz_cpu_exit_qs();
> >  }
> > 
> >  void tick_nohz_check_adaptive(void)
> > @@ -1023,6 +1045,7 @@ void tick_nohz_pre_schedule(void)
> >  	if (ts->tick_stopped) {
> >  		tick_nohz_flush_current_times(true);
> >  		clear_thread_flag(TIF_NOHZ);
> > +		/* FIXME: warn if we are in RCU idle mode */
> 
> This would be WARN_ON_ONCE(rcu_is_cpu_idle()) or some such, correct?

Yeah indeed. I'll add that.

> 
> >  	}
> >  }
> > 
> > -- 
> > 1.7.5.4
> > 
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/41] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-05-23 13:52     ` Frederic Weisbecker
@ 2012-05-23 15:15       ` Paul E. McKenney
  2012-05-23 16:06         ` Frederic Weisbecker
  0 siblings, 1 reply; 96+ messages in thread
From: Paul E. McKenney @ 2012-05-23 15:15 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Wed, May 23, 2012 at 03:52:09PM +0200, Frederic Weisbecker wrote:
> On Tue, May 22, 2012 at 10:16:58AM -0700, Paul E. McKenney wrote:
> > On Tue, May 01, 2012 at 01:54:45AM +0200, Frederic Weisbecker wrote:
> > > If RCU is waiting for the current CPU to complete a grace
> > > period, don't turn off the tick. Unlike dynctik-idle, we
> > > are not necessarily going to enter into rcu extended quiescent
> > > state, so we may need to keep the tick to note current CPU's
> > > quiescent states.
> > > 
> > > [added build fix from Zen Lin]
> > 
> > Hello, Frederic,
> > 
> > One question below -- why not rcu_needs_cpu() instead of rcu_pending()?
> > 
> > 							Thanx, Paul
> > 
> > > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > > Cc: Alessio Igor Bogani <abogani@kernel.org>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Avi Kivity <avi@redhat.com>
> > > Cc: Chris Metcalf <cmetcalf@tilera.com>
> > > Cc: Christoph Lameter <cl@linux.com>
> > > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> > > Cc: Geoff Levand <geoff@infradead.org>
> > > Cc: Gilad Ben Yossef <gilad@benyossef.com>
> > > Cc: Hakan Akkan <hakanakkan@gmail.com>
> > > Cc: Ingo Molnar <mingo@kernel.org>
> > > Cc: Kevin Hilman <khilman@ti.com>
> > > Cc: Max Krasnyansky <maxk@qualcomm.com>
> > > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Cc: Stephen Hemminger <shemminger@vyatta.com>
> > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
> > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > ---
> > >  include/linux/rcupdate.h |    1 +
> > >  kernel/rcutree.c         |    3 +--
> > >  kernel/time/tick-sched.c |   22 ++++++++++++++++++----
> > >  3 files changed, 20 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > index 81c04f4..e06639e 100644
> > > --- a/include/linux/rcupdate.h
> > > +++ b/include/linux/rcupdate.h
> > > @@ -184,6 +184,7 @@ static inline int rcu_preempt_depth(void)
> > >  extern void rcu_sched_qs(int cpu);
> > >  extern void rcu_bh_qs(int cpu);
> > >  extern void rcu_check_callbacks(int cpu, int user);
> > > +extern int rcu_pending(int cpu);
> > >  struct notifier_block;
> > >  extern void rcu_idle_enter(void);
> > >  extern void rcu_idle_exit(void);
> > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > > index 6c4a672..e141c7e 100644
> > > --- a/kernel/rcutree.c
> > > +++ b/kernel/rcutree.c
> > > @@ -212,7 +212,6 @@ int rcu_cpu_stall_suppress __read_mostly;
> > >  module_param(rcu_cpu_stall_suppress, int, 0644);
> > > 
> > >  static void force_quiescent_state(struct rcu_state *rsp, int relaxed);
> > > -static int rcu_pending(int cpu);
> > > 
> > >  /*
> > >   * Return the number of RCU-sched batches processed thus far for debug & stats.
> > > @@ -1915,7 +1914,7 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
> > >   * by the current CPU, returning 1 if so.  This function is part of the
> > >   * RCU implementation; it is -not- an exported member of the RCU API.
> > >   */
> > > -static int rcu_pending(int cpu)
> > > +int rcu_pending(int cpu)
> > >  {
> > >  	return __rcu_pending(&rcu_sched_state, &per_cpu(rcu_sched_data, cpu)) ||
> > >  	       __rcu_pending(&rcu_bh_state, &per_cpu(rcu_bh_data, cpu)) ||
> > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > > index 43fa7ac..4f99766 100644
> > > --- a/kernel/time/tick-sched.c
> > > +++ b/kernel/time/tick-sched.c
> > > @@ -506,9 +506,21 @@ void tick_nohz_idle_enter(void)
> > >  	local_irq_enable();
> > >  }
> > > 
> > > +#ifdef CONFIG_CPUSETS_NO_HZ
> > > +static bool can_stop_adaptive_tick(void)
> > > +{
> > > +	if (!sched_can_stop_tick())
> > > +		return false;
> > > +
> > > +	/* Is there a grace period to complete ? */
> > > +	if (rcu_pending(smp_processor_id()))
> > 
> > You lost me on this one.  Why can't this be rcu_needs_cpu()?
> 
> We already have an rcu_needs_cpu() check in tick_nohz_stop_sched_tick()
> that prevents the tick to shut down if the CPU has local callbacks to handle.
> 
> The rcu_pending() check is there in case some other CPU is waiting for the
> current one to help completing a grace period, by reporting a quiescent state
> for example. This happens because we may stop the tick in the kernel, not only
> userspace. And if we are in the kernel, we still need to be part of the global
> state machine.

Ah!  But RCU will notice that the CPU is in dyntick-idle mode, and will
therefore take any needed quiescent-state action on that CPU's behalf.
So there should be no need to call rcu_pending() anywhere outside of the
RCU core code.

							Thanx, Paul

> > > +		return false;
> > > +
> > > +	return true;
> > > +}
> > > +
> > >  static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts)
> > >  {
> > > -#ifdef CONFIG_CPUSETS_NO_HZ
> > >  	int cpu = smp_processor_id();
> > > 
> > >  	if (!cpuset_adaptive_nohz() || is_idle_task(current))
> > > @@ -517,12 +529,14 @@ static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts)
> > >  	if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
> > >  		return;
> > > 
> > > -	if (!sched_can_stop_tick())
> > > +	if (!can_stop_adaptive_tick())
> > >  		return;
> > > 
> > >  	tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
> > > -#endif
> > >  }
> > > +#else
> > > +static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts) { }
> > > +#endif
> > > 
> > >  /**
> > >   * tick_nohz_irq_exit - update next tick event from interrupt exit
> > > @@ -852,7 +866,7 @@ void tick_nohz_check_adaptive(void)
> > >  	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
> > > 
> > >  	if (ts->tick_stopped && !is_idle_task(current)) {
> > > -		if (!sched_can_stop_tick())
> > > +		if (!can_stop_adaptive_tick())
> > >  			tick_nohz_restart_sched_tick();
> > >  	}
> > >  }
> > > -- 
> > > 1.7.5.4
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 16/41] rcu: Restart the tick on non-responding adaptive nohz CPUs
  2012-05-23 13:57     ` Frederic Weisbecker
@ 2012-05-23 15:20       ` Paul E. McKenney
  2012-05-23 15:57         ` Frederic Weisbecker
  0 siblings, 1 reply; 96+ messages in thread
From: Paul E. McKenney @ 2012-05-23 15:20 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Wed, May 23, 2012 at 03:57:24PM +0200, Frederic Weisbecker wrote:
> On Tue, May 22, 2012 at 10:20:50AM -0700, Paul E. McKenney wrote:
> > On Tue, May 01, 2012 at 01:54:50AM +0200, Frederic Weisbecker wrote:
> > > When a CPU in adaptive nohz mode doesn't respond to complete
> > > a grace period, issue it a specific IPI so that it restarts
> > > the tick and chases a quiescent state.
> > 
> > Hello, Frederic,
> > 
> > I don't understand the need for this patch.  If the CPU is in
> > adaptive-tick mode, RCU should see it as being in dyntick-idle mode,
> > right?  If so, shouldn't RCU have already recognized the CPU as being
> > in an extended quiescent state?
> > 
> > Or is this a belt-and-suspenders situation?
> > 
> > 							Thanx, Paul
> 
> If the tickless CPU is in userspace, it is in extended quiescent state. But
> not if it runs tickless in the kernel. In this case we need to send it an IPI
> so that it restarts the tick after checking rcu_pending().

But if it has registered itself with RCU as idle, for example, by calling
rcu_user_enter(), then RCU will be ignoring that CPU, posting quiescent
states as needed on its behalf.  So I still don't understand the need
for this patch.

							Thanx, Paul

> > > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > > Cc: Alessio Igor Bogani <abogani@kernel.org>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Avi Kivity <avi@redhat.com>
> > > Cc: Chris Metcalf <cmetcalf@tilera.com>
> > > Cc: Christoph Lameter <cl@linux.com>
> > > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> > > Cc: Geoff Levand <geoff@infradead.org>
> > > Cc: Gilad Ben Yossef <gilad@benyossef.com>
> > > Cc: Hakan Akkan <hakanakkan@gmail.com>
> > > Cc: Ingo Molnar <mingo@kernel.org>
> > > Cc: Kevin Hilman <khilman@ti.com>
> > > Cc: Max Krasnyansky <maxk@qualcomm.com>
> > > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Cc: Stephen Hemminger <shemminger@vyatta.com>
> > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
> > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > ---
> > >  kernel/rcutree.c |   17 +++++++++++++++++
> > >  1 files changed, 17 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > > index e141c7e..3fffc26 100644
> > > --- a/kernel/rcutree.c
> > > +++ b/kernel/rcutree.c
> > > @@ -50,6 +50,7 @@
> > >  #include <linux/wait.h>
> > >  #include <linux/kthread.h>
> > >  #include <linux/prefetch.h>
> > > +#include <linux/cpuset.h>
> > > 
> > >  #include "rcutree.h"
> > >  #include <trace/events/rcu.h>
> > > @@ -302,6 +303,20 @@ static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
> > > 
> > >  #ifdef CONFIG_SMP
> > > 
> > > +static void cpuset_update_rcu_cpu(int cpu)
> > > +{
> > > +#ifdef CONFIG_CPUSETS_NO_HZ
> > > +	unsigned long flags;
> > > +
> > > +	local_irq_save(flags);
> > > +
> > > +	if (cpuset_cpu_adaptive_nohz(cpu))
> > > +		smp_cpuset_update_nohz(cpu);
> > > +
> > > +	local_irq_restore(flags);
> > > +#endif
> > > +}
> > > +
> > >  /*
> > >   * If the specified CPU is offline, tell the caller that it is in
> > >   * a quiescent state.  Otherwise, whack it with a reschedule IPI.
> > > @@ -325,6 +340,8 @@ static int rcu_implicit_offline_qs(struct rcu_data *rdp)
> > >  		return 1;
> > >  	}
> > > 
> > > +	cpuset_update_rcu_cpu(rdp->cpu);
> > > +
> > >  	/*
> > >  	 * The CPU is online, so send it a reschedule IPI.  This forces
> > >  	 * it through the scheduler, and (inefficiently) also handles cases
> > > -- 
> > > 1.7.5.4
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 16/41] rcu: Restart the tick on non-responding adaptive nohz CPUs
  2012-05-23 15:20       ` Paul E. McKenney
@ 2012-05-23 15:57         ` Frederic Weisbecker
  0 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-05-23 15:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Wed, May 23, 2012 at 08:20:09AM -0700, Paul E. McKenney wrote:
> On Wed, May 23, 2012 at 03:57:24PM +0200, Frederic Weisbecker wrote:
> > On Tue, May 22, 2012 at 10:20:50AM -0700, Paul E. McKenney wrote:
> > > On Tue, May 01, 2012 at 01:54:50AM +0200, Frederic Weisbecker wrote:
> > > > When a CPU in adaptive nohz mode doesn't respond to complete
> > > > a grace period, issue it a specific IPI so that it restarts
> > > > the tick and chases a quiescent state.
> > > 
> > > Hello, Frederic,
> > > 
> > > I don't understand the need for this patch.  If the CPU is in
> > > adaptive-tick mode, RCU should see it as being in dyntick-idle mode,
> > > right?  If so, shouldn't RCU have already recognized the CPU as being
> > > in an extended quiescent state?
> > > 
> > > Or is this a belt-and-suspenders situation?
> > > 
> > > 							Thanx, Paul
> > 
> > If the tickless CPU is in userspace, it is in extended quiescent state. But
> > not if it runs tickless in the kernel. In this case we need to send it an IPI
> > so that it restarts the tick after checking rcu_pending().
> 
> But if it has registered itself with RCU as idle, for example, by calling
> rcu_user_enter(), then RCU will be ignoring that CPU, posting quiescent
> states as needed on its behalf.  So I still don't understand the need
> for this patch.

Indeed if we are going to stop the tick and enter extended quiescent state,
we can ignore this check. I can optimize that.

But if we try to stop the tick in the kernel, which means we can still meet
RCU read side critical sections, we need to prevent from stopping the tick
if there is a global grace period to complete.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 17/41] rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU
  2012-05-23 14:00     ` Frederic Weisbecker
@ 2012-05-23 16:01       ` Paul E. McKenney
  0 siblings, 0 replies; 96+ messages in thread
From: Paul E. McKenney @ 2012-05-23 16:01 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Wed, May 23, 2012 at 04:00:15PM +0200, Frederic Weisbecker wrote:
> On Tue, May 22, 2012 at 10:27:14AM -0700, Paul E. McKenney wrote:
> > On Tue, May 01, 2012 at 01:54:51AM +0200, Frederic Weisbecker wrote:
> > > If we enqueue an rcu callback, we need the CPU tick to stay
> > > alive until we take care of those by completing the appropriate
> > > grace period.
> > > 
> > > Thus, when we call_rcu(), send a self IPI that checks rcu_needs_cpu()
> > > so that we restore a periodic tick behaviour that can take care of
> > > everything.
> > 
> > Ouch, I hadn't considered RCU callbacks being posted from within an
> > extended quiescent state.  I guess I need to make __call_rcu() either
> > complain about this or handle it correctly...  It would -usually- be
> > harmless, but there is getting to be quite a bit of active machinery
> > in the various idle loops, so just assuming that it cannot happen is
> > probably getting to be an obsolete assumption.
> 
> May be first provide some detection to warn in such case. And if it happens
> to warn too much perhaps you can allow that?

Heh.  It is just as simple to allow it as it is to warn about it, so
I am just transitioning immediately to allowing it.

							Thanx, Paul

> > > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > > Cc: Alessio Igor Bogani <abogani@kernel.org>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Avi Kivity <avi@redhat.com>
> > > Cc: Chris Metcalf <cmetcalf@tilera.com>
> > > Cc: Christoph Lameter <cl@linux.com>
> > > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> > > Cc: Geoff Levand <geoff@infradead.org>
> > > Cc: Gilad Ben Yossef <gilad@benyossef.com>
> > > Cc: Hakan Akkan <hakanakkan@gmail.com>
> > > Cc: Ingo Molnar <mingo@kernel.org>
> > > Cc: Kevin Hilman <khilman@ti.com>
> > > Cc: Max Krasnyansky <maxk@qualcomm.com>
> > > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Cc: Stephen Hemminger <shemminger@vyatta.com>
> > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
> > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > ---
> > >  kernel/rcutree.c |    7 +++++++
> > >  1 files changed, 7 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > > index 3fffc26..b8d300c 100644
> > > --- a/kernel/rcutree.c
> > > +++ b/kernel/rcutree.c
> > > @@ -1749,6 +1749,13 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
> > >  	else
> > >  		trace_rcu_callback(rsp->name, head, rdp->qlen);
> > > 
> > > +	/* Restart the timer if needed to handle the callbacks */
> > > +	if (cpuset_adaptive_nohz()) {
> > > +		/* Make updates on nxtlist visible to self IPI */
> > > +		barrier();
> > > +		smp_cpuset_update_nohz(smp_processor_id());
> > > +	}
> > > +
> > >  	/* If interrupts were disabled, don't dive into RCU core. */
> > >  	if (irqs_disabled_flags(flags)) {
> > >  		local_irq_restore(flags);
> > > -- 
> > > 1.7.5.4
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/41] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-05-23 15:15       ` Paul E. McKenney
@ 2012-05-23 16:06         ` Frederic Weisbecker
  2012-05-23 16:27           ` Paul E. McKenney
  0 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-05-23 16:06 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Wed, May 23, 2012 at 08:15:42AM -0700, Paul E. McKenney wrote:
> On Wed, May 23, 2012 at 03:52:09PM +0200, Frederic Weisbecker wrote:
> > > > +#ifdef CONFIG_CPUSETS_NO_HZ
> > > > +static bool can_stop_adaptive_tick(void)
> > > > +{
> > > > +	if (!sched_can_stop_tick())
> > > > +		return false;
> > > > +
> > > > +	/* Is there a grace period to complete ? */
> > > > +	if (rcu_pending(smp_processor_id()))
> > > 
> > > You lost me on this one.  Why can't this be rcu_needs_cpu()?
> > 
> > We already have an rcu_needs_cpu() check in tick_nohz_stop_sched_tick()
> > that prevents the tick to shut down if the CPU has local callbacks to handle.
> > 
> > The rcu_pending() check is there in case some other CPU is waiting for the
> > current one to help completing a grace period, by reporting a quiescent state
> > for example. This happens because we may stop the tick in the kernel, not only
> > userspace. And if we are in the kernel, we still need to be part of the global
> > state machine.
> 
> Ah!  But RCU will notice that the CPU is in dyntick-idle mode, and will
> therefore take any needed quiescent-state action on that CPU's behalf.
> So there should be no need to call rcu_pending() anywhere outside of the
> RCU core code.


No. If the tick is stopped and we are in the kernel, we may be using RCU
anytime, so we need to be part of the RCU core.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 17/41] rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU
  2012-05-23 14:03       ` Frederic Weisbecker
@ 2012-05-23 16:15         ` Paul E. McKenney
  2012-05-31 15:56           ` Frederic Weisbecker
  0 siblings, 1 reply; 96+ messages in thread
From: Paul E. McKenney @ 2012-05-23 16:15 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Wed, May 23, 2012 at 04:03:36PM +0200, Frederic Weisbecker wrote:
> On Tue, May 22, 2012 at 10:30:47AM -0700, Paul E. McKenney wrote:
> > On Tue, May 22, 2012 at 10:27:14AM -0700, Paul E. McKenney wrote:
> > > On Tue, May 01, 2012 at 01:54:51AM +0200, Frederic Weisbecker wrote:
> > > > If we enqueue an rcu callback, we need the CPU tick to stay
> > > > alive until we take care of those by completing the appropriate
> > > > grace period.
> > > > 
> > > > Thus, when we call_rcu(), send a self IPI that checks rcu_needs_cpu()
> > > > so that we restore a periodic tick behaviour that can take care of
> > > > everything.
> > > 
> > > Ouch, I hadn't considered RCU callbacks being posted from within an
> > > extended quiescent state.  I guess I need to make __call_rcu() either
> > > complain about this or handle it correctly...  It would -usually- be
> > > harmless, but there is getting to be quite a bit of active machinery
> > > in the various idle loops, so just assuming that it cannot happen is
> > > probably getting to be an obsolete assumption.
> > 
> > Adaptive ticks does restart the tick upon entering the kernel, correct?
> 
> No, it keeps the tick down. The tick is restarted only if it's needed:
> when more than one task are on the runqueue, a posix cpu timer is running,
> a CPU needs the current one to report a quiescent state, etc...

Ah, I didn't realize that you didn't restart the tick upon entry to the
kernel.  So this is why you need the IPI -- because there is no tick, if
the system call runs for a long time, RCU is not guaranteed to make any
progress on that CPU.

In the common case, this will not be a problem because system calls
normally spend a short amount of time in the kernel, so normally RCU's
dyntick-idle detection will handle this case.  The exception to this
rule is when there is a long CPU-bound code path in the kernel, where
"long" means many milliseconds.  In this exception case, this CPU needs
to be interrupted or whatever is needed to force the CPU to progress
through RCU.

							Thanx, Paul

> > If so, wouldn't the return to userspace cause adaptive tick to automatically
> > handle a callback posted from within the kernel?
> > 
> > (And yes, I still need to handle the possibility of callbacks being posted
> > from the idle loop, but that is a different extended quiescent state.)
> > 
> > 							Thanx, Paul
> > 
> > > > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > > > Cc: Alessio Igor Bogani <abogani@kernel.org>
> > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > Cc: Avi Kivity <avi@redhat.com>
> > > > Cc: Chris Metcalf <cmetcalf@tilera.com>
> > > > Cc: Christoph Lameter <cl@linux.com>
> > > > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> > > > Cc: Geoff Levand <geoff@infradead.org>
> > > > Cc: Gilad Ben Yossef <gilad@benyossef.com>
> > > > Cc: Hakan Akkan <hakanakkan@gmail.com>
> > > > Cc: Ingo Molnar <mingo@kernel.org>
> > > > Cc: Kevin Hilman <khilman@ti.com>
> > > > Cc: Max Krasnyansky <maxk@qualcomm.com>
> > > > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > Cc: Stephen Hemminger <shemminger@vyatta.com>
> > > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > > Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
> > > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > > ---
> > > >  kernel/rcutree.c |    7 +++++++
> > > >  1 files changed, 7 insertions(+), 0 deletions(-)
> > > > 
> > > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > > > index 3fffc26..b8d300c 100644
> > > > --- a/kernel/rcutree.c
> > > > +++ b/kernel/rcutree.c
> > > > @@ -1749,6 +1749,13 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
> > > >  	else
> > > >  		trace_rcu_callback(rsp->name, head, rdp->qlen);
> > > > 
> > > > +	/* Restart the timer if needed to handle the callbacks */
> > > > +	if (cpuset_adaptive_nohz()) {
> > > > +		/* Make updates on nxtlist visible to self IPI */
> > > > +		barrier();
> > > > +		smp_cpuset_update_nohz(smp_processor_id());
> > > > +	}
> > > > +
> > > >  	/* If interrupts were disabled, don't dive into RCU core. */
> > > >  	if (irqs_disabled_flags(flags)) {
> > > >  		local_irq_restore(flags);
> > > > -- 
> > > > 1.7.5.4
> > > > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/41] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-05-23 16:06         ` Frederic Weisbecker
@ 2012-05-23 16:27           ` Paul E. McKenney
  2012-05-31 16:01             ` Frederic Weisbecker
  0 siblings, 1 reply; 96+ messages in thread
From: Paul E. McKenney @ 2012-05-23 16:27 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Wed, May 23, 2012 at 06:06:33PM +0200, Frederic Weisbecker wrote:
> On Wed, May 23, 2012 at 08:15:42AM -0700, Paul E. McKenney wrote:
> > On Wed, May 23, 2012 at 03:52:09PM +0200, Frederic Weisbecker wrote:
> > > > > +#ifdef CONFIG_CPUSETS_NO_HZ
> > > > > +static bool can_stop_adaptive_tick(void)
> > > > > +{
> > > > > +	if (!sched_can_stop_tick())
> > > > > +		return false;
> > > > > +
> > > > > +	/* Is there a grace period to complete ? */
> > > > > +	if (rcu_pending(smp_processor_id()))
> > > > 
> > > > You lost me on this one.  Why can't this be rcu_needs_cpu()?
> > > 
> > > We already have an rcu_needs_cpu() check in tick_nohz_stop_sched_tick()
> > > that prevents the tick to shut down if the CPU has local callbacks to handle.
> > > 
> > > The rcu_pending() check is there in case some other CPU is waiting for the
> > > current one to help completing a grace period, by reporting a quiescent state
> > > for example. This happens because we may stop the tick in the kernel, not only
> > > userspace. And if we are in the kernel, we still need to be part of the global
> > > state machine.
> > 
> > Ah!  But RCU will notice that the CPU is in dyntick-idle mode, and will
> > therefore take any needed quiescent-state action on that CPU's behalf.
> > So there should be no need to call rcu_pending() anywhere outside of the
> > RCU core code.
> 
> No. If the tick is stopped and we are in the kernel, we may be using RCU
> anytime, so we need to be part of the RCU core.

OK, so the only problem is if we spend a long time CPU-bound in the kernel,
where "long" is milliseconds or tens of milliseconds.  In that case, the
RCU core will notice that the CPU has not responded but is not idle, for
example, in rcu_implicit_dynticks_qs().  It can take action at this point
to get the offending CPU to pay attention to RCU.

Does this make sense, or am I still missing something?


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 37/41] rcu: New rcu_user_enter() and rcu_user_exit() APIs
  2012-05-23 14:22     ` Frederic Weisbecker
@ 2012-05-23 16:28       ` Paul E. McKenney
  0 siblings, 0 replies; 96+ messages in thread
From: Paul E. McKenney @ 2012-05-23 16:28 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Wed, May 23, 2012 at 04:22:17PM +0200, Frederic Weisbecker wrote:
> On Tue, May 22, 2012 at 11:23:06AM -0700, Paul E. McKenney wrote:
> > On Tue, May 01, 2012 at 01:55:11AM +0200, Frederic Weisbecker wrote:
> > > These two APIs are provided to help the implementation
> > > of an adaptive tickless kernel (cf: nohz cpusets). We need
> > > to run into RCU extended quiescent state when we are in
> > > userland so that a tickless CPU is not involved in the
> > > global RCU state machine and can shutdown its tick safely.
> > > 
> > > These APIs are called from syscall and exception entry/exit
> > > points and can't be called from interrupt.
> > > 
> > > They are essentially the same than rcu_idle_enter() and
> > > rcu_idle_exit() minus the checks that ensure the CPU is
> > > running the idle task.
> > 
> > This looks reasonably sane.  There are a few nits like missing comment
> > headers for functions and the need for tracing, but I can handle that
> > when I pull it in.  I am happy to do that pretty much any time, but not
> > before the API stabilizes.  ;-)
> > 
> > So let me know when it is ready for -rcu.
> 
> Ok. So would you be willing to host this specific part in -rcu? I don't
> know if these APIs are welcome upstream if they have no upstream users
> yet. OTOH it would be easier for me if I don't need to include these patches
> in my endless rebases.
> 
> Another solution is to host that in some seperate tree. In yours or in -tip.
> Ingo seemed to be willing to host this patchset.
> 
> What do you think?
> 
> I believe I need to rebase against your latest changes though.

Indeed, there has been a bit of churn in this area from RCU_FAST_NO_HZ.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 17/41] rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU
  2012-05-23 16:15         ` Paul E. McKenney
@ 2012-05-31 15:56           ` Frederic Weisbecker
  0 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-05-31 15:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Wed, May 23, 2012 at 09:15:14AM -0700, Paul E. McKenney wrote:
> On Wed, May 23, 2012 at 04:03:36PM +0200, Frederic Weisbecker wrote:
> > On Tue, May 22, 2012 at 10:30:47AM -0700, Paul E. McKenney wrote:
> > > On Tue, May 22, 2012 at 10:27:14AM -0700, Paul E. McKenney wrote:
> > > > On Tue, May 01, 2012 at 01:54:51AM +0200, Frederic Weisbecker wrote:
> > > > > If we enqueue an rcu callback, we need the CPU tick to stay
> > > > > alive until we take care of those by completing the appropriate
> > > > > grace period.
> > > > > 
> > > > > Thus, when we call_rcu(), send a self IPI that checks rcu_needs_cpu()
> > > > > so that we restore a periodic tick behaviour that can take care of
> > > > > everything.
> > > > 
> > > > Ouch, I hadn't considered RCU callbacks being posted from within an
> > > > extended quiescent state.  I guess I need to make __call_rcu() either
> > > > complain about this or handle it correctly...  It would -usually- be
> > > > harmless, but there is getting to be quite a bit of active machinery
> > > > in the various idle loops, so just assuming that it cannot happen is
> > > > probably getting to be an obsolete assumption.
> > > 
> > > Adaptive ticks does restart the tick upon entering the kernel, correct?
> > 
> > No, it keeps the tick down. The tick is restarted only if it's needed:
> > when more than one task are on the runqueue, a posix cpu timer is running,
> > a CPU needs the current one to report a quiescent state, etc...
> 
> Ah, I didn't realize that you didn't restart the tick upon entry to the
> kernel.  So this is why you need the IPI -- because there is no tick, if
> the system call runs for a long time, RCU is not guaranteed to make any
> progress on that CPU.
> 
> In the common case, this will not be a problem because system calls
> normally spend a short amount of time in the kernel, so normally RCU's
> dyntick-idle detection will handle this case.  The exception to this
> rule is when there is a long CPU-bound code path in the kernel, where
> "long" means many milliseconds.  In this exception case, this CPU needs
> to be interrupted or whatever is needed to force the CPU to progress
> through RCU.

Exactly!

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/41] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-05-23 16:27           ` Paul E. McKenney
@ 2012-05-31 16:01             ` Frederic Weisbecker
  2012-05-31 22:02               ` Paul E. McKenney
  0 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-05-31 16:01 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Wed, May 23, 2012 at 09:27:39AM -0700, Paul E. McKenney wrote:
> On Wed, May 23, 2012 at 06:06:33PM +0200, Frederic Weisbecker wrote:
> > On Wed, May 23, 2012 at 08:15:42AM -0700, Paul E. McKenney wrote:
> > > On Wed, May 23, 2012 at 03:52:09PM +0200, Frederic Weisbecker wrote:
> > > > > > +#ifdef CONFIG_CPUSETS_NO_HZ
> > > > > > +static bool can_stop_adaptive_tick(void)
> > > > > > +{
> > > > > > +	if (!sched_can_stop_tick())
> > > > > > +		return false;
> > > > > > +
> > > > > > +	/* Is there a grace period to complete ? */
> > > > > > +	if (rcu_pending(smp_processor_id()))
> > > > > 
> > > > > You lost me on this one.  Why can't this be rcu_needs_cpu()?
> > > > 
> > > > We already have an rcu_needs_cpu() check in tick_nohz_stop_sched_tick()
> > > > that prevents the tick to shut down if the CPU has local callbacks to handle.
> > > > 
> > > > The rcu_pending() check is there in case some other CPU is waiting for the
> > > > current one to help completing a grace period, by reporting a quiescent state
> > > > for example. This happens because we may stop the tick in the kernel, not only
> > > > userspace. And if we are in the kernel, we still need to be part of the global
> > > > state machine.
> > > 
> > > Ah!  But RCU will notice that the CPU is in dyntick-idle mode, and will
> > > therefore take any needed quiescent-state action on that CPU's behalf.
> > > So there should be no need to call rcu_pending() anywhere outside of the
> > > RCU core code.
> > 
> > No. If the tick is stopped and we are in the kernel, we may be using RCU
> > anytime, so we need to be part of the RCU core.
> 
> OK, so the only problem is if we spend a long time CPU-bound in the kernel,
> where "long" is milliseconds or tens of milliseconds.  In that case, the
> RCU core will notice that the CPU has not responded but is not idle, for
> example, in rcu_implicit_dynticks_qs().  It can take action at this point
> to get the offending CPU to pay attention to RCU.
> 
> Does this make sense, or am I still missing something?

Yeah that's exactly the purpose of the rcu_pending() check before shutting down
the tick and the IPI to wake it up.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/41] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-05-31 16:01             ` Frederic Weisbecker
@ 2012-05-31 22:02               ` Paul E. McKenney
  0 siblings, 0 replies; 96+ messages in thread
From: Paul E. McKenney @ 2012-05-31 22:02 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Hakan Akkan, Ingo Molnar,
	Kevin Hilman, Max Krasnyansky, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner

On Thu, May 31, 2012 at 06:01:21PM +0200, Frederic Weisbecker wrote:
> On Wed, May 23, 2012 at 09:27:39AM -0700, Paul E. McKenney wrote:
> > On Wed, May 23, 2012 at 06:06:33PM +0200, Frederic Weisbecker wrote:
> > > On Wed, May 23, 2012 at 08:15:42AM -0700, Paul E. McKenney wrote:
> > > > On Wed, May 23, 2012 at 03:52:09PM +0200, Frederic Weisbecker wrote:
> > > > > > > +#ifdef CONFIG_CPUSETS_NO_HZ
> > > > > > > +static bool can_stop_adaptive_tick(void)
> > > > > > > +{
> > > > > > > +	if (!sched_can_stop_tick())
> > > > > > > +		return false;
> > > > > > > +
> > > > > > > +	/* Is there a grace period to complete ? */
> > > > > > > +	if (rcu_pending(smp_processor_id()))
> > > > > > 
> > > > > > You lost me on this one.  Why can't this be rcu_needs_cpu()?
> > > > > 
> > > > > We already have an rcu_needs_cpu() check in tick_nohz_stop_sched_tick()
> > > > > that prevents the tick to shut down if the CPU has local callbacks to handle.
> > > > > 
> > > > > The rcu_pending() check is there in case some other CPU is waiting for the
> > > > > current one to help completing a grace period, by reporting a quiescent state
> > > > > for example. This happens because we may stop the tick in the kernel, not only
> > > > > userspace. And if we are in the kernel, we still need to be part of the global
> > > > > state machine.
> > > > 
> > > > Ah!  But RCU will notice that the CPU is in dyntick-idle mode, and will
> > > > therefore take any needed quiescent-state action on that CPU's behalf.
> > > > So there should be no need to call rcu_pending() anywhere outside of the
> > > > RCU core code.
> > > 
> > > No. If the tick is stopped and we are in the kernel, we may be using RCU
> > > anytime, so we need to be part of the RCU core.
> > 
> > OK, so the only problem is if we spend a long time CPU-bound in the kernel,
> > where "long" is milliseconds or tens of milliseconds.  In that case, the
> > RCU core will notice that the CPU has not responded but is not idle, for
> > example, in rcu_implicit_dynticks_qs().  It can take action at this point
> > to get the offending CPU to pay attention to RCU.
> > 
> > Does this make sense, or am I still missing something?
> 
> Yeah that's exactly the purpose of the rcu_pending() check before shutting down
> the tick and the IPI to wake it up.

Hmmm...  We appear to be talking past each other.

If you use rcu_pending(), you defeat CONFIG_RCU_FAST_NO_HZ and thus fail
to shut of the tick in situations where the application does a system
call involving an RCU update every few tens of milliseconds.  This is not
good.

What we should do instead is to call rcu_needs_cpu() instead of rcu_pending().
In the common case of short system calls, this will allow the tick to be
turned off a higher fraction of the time with no penalty.  In the very
unusual case where a system call runs CPU-bound for tens of milliseconds,
RCU's existing force_quiescent_state() machinery can easily be used to
force the CPU to pay attention to RCU.

Make sense, or am I missing something?

(And yes, the CONFIG_RCU_FAST_NO_HZ heuristics likely need to be adjusted
to better support adaptive ticks -- try less hard to retire callbacks,
for example.)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 96+ messages in thread

end of thread, other threads:[~2012-05-31 22:04 UTC | newest]

Thread overview: 96+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-30 23:54 [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Frederic Weisbecker
2012-04-30 23:54 ` [PATCH 01/41] nohz: Separate idle sleeping time accounting from nohz logic Frederic Weisbecker
2012-04-30 23:54 ` [PATCH 02/41] nohz: Make nohz API agnostic against idle ticks cputime accounting Frederic Weisbecker
2012-04-30 23:54 ` [PATCH 03/41] nohz: Rename ts->idle_tick to ts->last_tick Frederic Weisbecker
2012-04-30 23:54 ` [PATCH 04/41] nohz: Move nohz load balancer selection into idle logic Frederic Weisbecker
2012-05-07 15:51   ` Christoph Lameter
2012-04-30 23:54 ` [PATCH 05/41] nohz: Move ts->idle_calls incrementation into strict " Frederic Weisbecker
2012-04-30 23:54 ` [PATCH 06/41] nohz: Move next idle expiry time record into idle logic area Frederic Weisbecker
2012-04-30 23:54 ` [PATCH 07/41] cpuset: Set up interface for nohz flag Frederic Weisbecker
2012-05-07 15:55   ` Christoph Lameter
2012-05-08 14:20     ` Frederic Weisbecker
2012-05-08 14:50       ` Peter Zijlstra
2012-05-08 15:18         ` Christoph Lameter
2012-05-08 15:27           ` Peter Zijlstra
2012-05-08 15:38             ` Christoph Lameter
2012-05-08 15:48               ` Peter Zijlstra
2012-05-08 15:57                 ` Christoph Lameter
2012-05-08 16:16                   ` Peter Zijlstra
2012-05-08 16:25                     ` Peter Zijlstra
2012-05-08 19:50                     ` Mike Galbraith
2012-05-08 20:45                       ` Christoph Lameter
2012-05-09  4:21                         ` Mike Galbraith
2012-05-09 11:02                           ` Frederic Weisbecker
2012-05-09 11:07                           ` Frederic Weisbecker
2012-05-09 14:23                             ` Christoph Lameter
2012-05-09 14:22                           ` Christoph Lameter
2012-05-09 14:47                             ` Mike Galbraith
2012-05-09 15:05                               ` Christoph Lameter
2012-05-09 15:33                                 ` Mike Galbraith
2012-05-09 15:40                                   ` Christoph Lameter
2012-05-08 15:16       ` Christoph Lameter
2012-04-30 23:54 ` [PATCH 08/41] nohz: Try not to give the timekeeping duty to an adaptive tickless cpu Frederic Weisbecker
2012-05-07 16:02   ` Christoph Lameter
2012-05-08 17:35     ` Frederic Weisbecker
2012-04-30 23:54 ` [PATCH 09/41] x86: New cpuset nohz irq vector Frederic Weisbecker
2012-04-30 23:54 ` [PATCH 10/41] nohz: Adaptive tick stop and restart on nohz cpuset Frederic Weisbecker
2012-04-30 23:54 ` [PATCH 11/41] nohz/cpuset: Don't turn off the tick if rcu needs it Frederic Weisbecker
2012-05-22 17:16   ` Paul E. McKenney
2012-05-23 13:52     ` Frederic Weisbecker
2012-05-23 15:15       ` Paul E. McKenney
2012-05-23 16:06         ` Frederic Weisbecker
2012-05-23 16:27           ` Paul E. McKenney
2012-05-31 16:01             ` Frederic Weisbecker
2012-05-31 22:02               ` Paul E. McKenney
2012-04-30 23:54 ` [PATCH 12/41] nohz/cpuset: Wake up adaptive nohz CPU when a timer gets enqueued Frederic Weisbecker
2012-04-30 23:54 ` [PATCH 13/41] nohz/cpuset: Don't stop the tick if posix cpu timers are running Frederic Weisbecker
2012-04-30 23:54 ` [PATCH 14/41] nohz/cpuset: Restart tick when nohz flag is cleared on cpuset Frederic Weisbecker
2012-04-30 23:54 ` [PATCH 15/41] nohz/cpuset: Restart the tick if printk needs it Frederic Weisbecker
2012-04-30 23:54 ` [PATCH 16/41] rcu: Restart the tick on non-responding adaptive nohz CPUs Frederic Weisbecker
2012-05-22 17:20   ` Paul E. McKenney
2012-05-23 13:57     ` Frederic Weisbecker
2012-05-23 15:20       ` Paul E. McKenney
2012-05-23 15:57         ` Frederic Weisbecker
2012-04-30 23:54 ` [PATCH 17/41] rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU Frederic Weisbecker
2012-05-22 17:27   ` Paul E. McKenney
2012-05-22 17:30     ` Paul E. McKenney
2012-05-23 14:03       ` Frederic Weisbecker
2012-05-23 16:15         ` Paul E. McKenney
2012-05-31 15:56           ` Frederic Weisbecker
2012-05-23 14:00     ` Frederic Weisbecker
2012-05-23 16:01       ` Paul E. McKenney
2012-04-30 23:54 ` [PATCH 18/41] nohz: Generalize tickless cpu time accounting Frederic Weisbecker
2012-04-30 23:54 ` [PATCH 19/41] nohz/cpuset: Account user and system times in adaptive nohz mode Frederic Weisbecker
2012-04-30 23:54 ` [PATCH 20/41] nohz/cpuset: New API to flush cputimes on nohz cpusets Frederic Weisbecker
2012-04-30 23:54 ` [PATCH 21/41] nohz/cpuset: Flush cputime on threads in nohz cpusets when waiting leader Frederic Weisbecker
2012-04-30 23:54 ` [PATCH 22/41] nohz/cpuset: Flush cputimes on procfs stat file read Frederic Weisbecker
2012-04-30 23:54 ` [PATCH 23/41] nohz/cpuset: Flush cputimes for getrusage() and times() syscalls Frederic Weisbecker
2012-04-30 23:54 ` [PATCH 24/41] x86: Syscall hooks for nohz cpusets Frederic Weisbecker
2012-04-30 23:54 ` [PATCH 25/41] x86: Exception " Frederic Weisbecker
2012-04-30 23:55 ` [PATCH 26/41] x86: Add adaptive tickless hooks on do_notify_resume() Frederic Weisbecker
2012-04-30 23:55 ` [PATCH 27/41] nohz/cpuset: enable addition&removal of cpus while in adaptive nohz mode Frederic Weisbecker
2012-04-30 23:55 ` [PATCH 28/41] nohz: Don't restart the tick before scheduling to idle Frederic Weisbecker
2012-04-30 23:55 ` [PATCH 29/41] sched: Comment on rq->clock correctness in ttwu_do_wakeup() in nohz Frederic Weisbecker
2012-04-30 23:55 ` [PATCH 30/41] sched: Update rq clock on nohz CPU before migrating tasks Frederic Weisbecker
2012-04-30 23:55 ` [PATCH 31/41] sched: Update rq clock on nohz CPU before setting fair group shares Frederic Weisbecker
2012-04-30 23:55 ` [PATCH 32/41] sched: Update rq clock on tickless CPUs before calling check_preempt_curr() Frederic Weisbecker
2012-04-30 23:55 ` [PATCH 33/41] sched: Update rq clock earlier in unthrottle_cfs_rq Frederic Weisbecker
2012-04-30 23:55 ` [PATCH 34/41] sched: Update clock of nohz busiest rq before balancing Frederic Weisbecker
2012-04-30 23:55 ` [PATCH 35/41] sched: Update rq clock before idle balancing Frederic Weisbecker
2012-05-02  3:36   ` Michael Wang
2012-05-02 10:55     ` Frederic Weisbecker
2012-04-30 23:55 ` [PATCH 36/41] sched: Update nohz rq clock before searching busiest group on load balancing Frederic Weisbecker
2012-04-30 23:55 ` [PATCH 37/41] rcu: New rcu_user_enter() and rcu_user_exit() APIs Frederic Weisbecker
2012-05-22 18:23   ` Paul E. McKenney
2012-05-23 14:22     ` Frederic Weisbecker
2012-05-23 16:28       ` Paul E. McKenney
2012-04-30 23:55 ` [PATCH 38/41] rcu: New rcu_user_enter_irq() and rcu_user_exit_irq() APIs Frederic Weisbecker
2012-05-22 18:33   ` Paul E. McKenney
2012-05-23 14:31     ` Frederic Weisbecker
2012-04-30 23:55 ` [PATCH 39/41] rcu: Switch to extended quiescent state in userspace from nohz cpuset Frederic Weisbecker
2012-05-22 18:36   ` Paul E. McKenney
2012-05-22 23:04     ` Paul E. McKenney
2012-05-23 14:33     ` Frederic Weisbecker
2012-04-30 23:55 ` [PATCH 40/41] nohz: Exit RCU idle mode when we schedule before resuming userspace Frederic Weisbecker
2012-04-30 23:55 ` [PATCH 41/41] nohz/cpuset: Disable under some configs Frederic Weisbecker
2012-05-07 22:10 ` [RFC][PATCH 00/41] Nohz cpusets v3 (adaptive tickless kernel) Geoff Levand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).