All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel)
@ 2012-03-21 13:58 Frederic Weisbecker
  2012-03-21 13:58 ` Frederic Weisbecker
                   ` (34 more replies)
  0 siblings, 35 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

Hi all,

A summary of what this is about can be found here:
  https://lkml.org/lkml/2011/8/15/245

There are still a lot of things to handle. Especially about
what is done by scheduler_tick() but we also need to:

- completely handle cputime accounting (need to find every "reader"
of cputime and flush cputimes for all of them).
-handle  perf
- handle irqtime finegrained accounting
- handle ilb load balancing
- etc...

Nonetheless this is time to post a new iteration of the patchset
because the design has changed a bit, some bugs have been fixed,
more simplification, more unification with dynticks-idle code,
namespace fixes, various improvements here and there...

The git branch can be fetched from:

git://github.com/fweisbec/linux-dynticks.git
	nohz/cpuset-v2

Changelog since v1:

- Rebase against 3.3-rc7 + tip:timers/core branch targeted
for 3.4-rc1

- Refine some changelogs

- Adapt against latest rcu changes: introduce new APIs
  rcu_user_enter(), rcu_user_exit(), rcu_user_enter_irq()
  and rcu_user_exit_irq()

- Handle RCU idle mode with do_notify_resume() path

- Fix deadlock after double rq lock on schedule:
  schedule() -> rq_lock -> next is idle task ->
  tick_nohz_restart_sched_tick() -> wake up softirq ->
  rq lock

- Fix lockup while issuing flush times IPI on exit path:

  CPU 0	     	   	   	 CPU 1

  read_lock(tasklist_lock)
				write_lock_irq(tasklist_lock)
				smp_call_function(CPU 1)
				* deadlock *

- Many namespace renames (cpuset_* to tick_nohz_*) and code migration
from sched.c to tick-sched.c

- Seperate code that determine if we can stop the idle tick and don't
use it for adaptive tickless mode.

- Fix adaptive tickless mode set on idle incidentally. TIF_NOHZ was
then missing on the following task that ran tickless, issuing some
illegal uses of RCU

- Restart the tick anytime more than one task is on the runqueue. We were previously
only covering wake ups, now we also handle migration and any other source of task enqueuing

- Handle use of RCU in schedule() when called right before resuming userspace
(new schedule_user() API)

- Take the decision to stop the tick from irq exit instead of the middle of the timer
interrupt. This gives more opportunity to stop it and is one step more to unify idle
and adaptive tickless.

- Unify tickless idle and tickless user/system CPU time accounting infrastructures.

- If the tick is stopped adaptively and we are going to schedule the idle
task, don't restart the tick.

- Remove task_nohz_mode per cpu var and use ts->tick_stopped instead. This
leads to more unification between idle tickless and adaptive tickless.



Frederic Weisbecker (32):
  nohz: Separate idle sleeping time accounting from nohz logic
  nohz: Make nohz API agnostic against idle ticks cputime accounting
  nohz: Rename ts->idle_tick to ts->last_tick
  nohz: Move nohz load balancer selection into idle logic
  nohz: Move ts->idle_calls incrementation into strict idle logic
  nohz: Move next idle expiry time record into idle logic area
  cpuset: Set up interface for nohz flag
  nohz: Try not to give the timekeeping duty to an adaptive tickless
    cpu
  x86: New cpuset nohz irq vector
  nohz: Adaptive tick stop and restart on nohz cpuset
  nohz/cpuset: Don't turn off the tick if rcu needs it
  nohz/cpuset: Wake up adaptive nohz CPU when a timer gets enqueued
  nohz/cpuset: Don't stop the tick if posix cpu timers are running
  nohz/cpuset: Restart tick when nohz flag is cleared on cpuset
  nohz/cpuset: Restart the tick if printk needs it
  rcu: Restart the tick on non-responding adaptive nohz CPUs
  rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU
  nohz: Generalize tickless cpu time accounting
  nohz/cpuset: Account user and system times in adaptive nohz mode
  nohz/cpuset: New API to flush cputimes on nohz cpusets
  nohz/cpuset: Flush cputime on threads in nohz cpusets when waiting
    leader
  nohz/cpuset: Flush cputimes on procfs stat file read
  nohz/cpuset: Flush cputimes for getrusage() and times() syscalls
  x86: Syscall hooks for nohz cpusets
  x86: Exception hooks for nohz cpusets
  x86: Add adaptive tickless hooks on do_notify_resume()
  nohz: Don't restart the tick before scheduling to idle
  rcu: New rcu_user_enter() and rcu_user_exit() APIs
  rcu: New rcu_user_enter_irq() and rcu_user_exit_irq() APIs
  rcu: Switch to extended quiescent state in userspace from nohz cpuset
  nohz: Exit RCU idle mode when we schedule before resuming userspace
  nohz/cpuset: Disable under some configs

 arch/Kconfig                       |    3 +
 arch/x86/Kconfig                   |    1 +
 arch/x86/include/asm/entry_arch.h  |    3 +
 arch/x86/include/asm/hw_irq.h      |    7 +
 arch/x86/include/asm/irq_vectors.h |    2 +
 arch/x86/include/asm/smp.h         |   11 +
 arch/x86/include/asm/thread_info.h |   10 +-
 arch/x86/kernel/entry_64.S         |   12 +-
 arch/x86/kernel/irqinit.c          |    4 +
 arch/x86/kernel/ptrace.c           |   10 +
 arch/x86/kernel/signal.c           |    3 +
 arch/x86/kernel/smp.c              |   26 ++
 arch/x86/kernel/traps.c            |   20 +-
 arch/x86/mm/fault.c                |   13 +-
 fs/proc/array.c                    |    2 +
 include/linux/cpuset.h             |   29 ++
 include/linux/kernel_stat.h        |    2 +
 include/linux/posix-timers.h       |    1 +
 include/linux/rcupdate.h           |    8 +
 include/linux/sched.h              |   10 +-
 include/linux/tick.h               |   75 ++++--
 init/Kconfig                       |    8 +
 kernel/cpuset.c                    |  107 +++++++
 kernel/exit.c                      |    8 +
 kernel/posix-cpu-timers.c          |   12 +
 kernel/printk.c                    |   15 +-
 kernel/rcutree.c                   |  150 ++++++++--
 kernel/sched/core.c                |   83 ++++++-
 kernel/sched/sched.h               |   23 ++
 kernel/softirq.c                   |    6 +-
 kernel/sys.c                       |    6 +
 kernel/time/tick-sched.c           |  540 +++++++++++++++++++++++++++++-------
 kernel/time/timer_list.c           |    7 +-
 kernel/timer.c                     |    2 +-
 34 files changed, 1042 insertions(+), 177 deletions(-)

-- 
1.7.5.4


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel)
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-04-04 15:33   ` warning in tick_nohz_irq_exit Stephen Hemminger
  2012-03-21 13:58 ` [PATCH 01/32] nohz: Separate idle sleeping time accounting from nohz logic Frederic Weisbecker
                   ` (33 subsequent siblings)
  34 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

*** BLURB HERE ***

Frederic Weisbecker (32):
  nohz: Separate idle sleeping time accounting from nohz logic
  nohz: Make nohz API agnostic against idle ticks cputime accounting
  nohz: Rename ts->idle_tick to ts->last_tick
  nohz: Move nohz load balancer selection into idle logic
  nohz: Move ts->idle_calls incrementation into strict idle logic
  nohz: Move next idle expiry time record into idle logic area
  cpuset: Set up interface for nohz flag
  nohz: Try not to give the timekeeping duty to an adaptive tickless
    cpu
  x86: New cpuset nohz irq vector
  nohz: Adaptive tick stop and restart on nohz cpuset
  nohz/cpuset: Don't turn off the tick if rcu needs it
  nohz/cpuset: Wake up adaptive nohz CPU when a timer gets enqueued
  nohz/cpuset: Don't stop the tick if posix cpu timers are running
  nohz/cpuset: Restart tick when nohz flag is cleared on cpuset
  nohz/cpuset: Restart the tick if printk needs it
  rcu: Restart the tick on non-responding adaptive nohz CPUs
  rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU
  nohz: Generalize tickless cpu time accounting
  nohz/cpuset: Account user and system times in adaptive nohz mode
  nohz/cpuset: New API to flush cputimes on nohz cpusets
  nohz/cpuset: Flush cputime on threads in nohz cpusets when waiting
    leader
  nohz/cpuset: Flush cputimes on procfs stat file read
  nohz/cpuset: Flush cputimes for getrusage() and times() syscalls
  x86: Syscall hooks for nohz cpusets
  x86: Exception hooks for nohz cpusets
  x86: Add adaptive tickless hooks on do_notify_resume()
  nohz: Don't restart the tick before scheduling to idle
  rcu: New rcu_user_enter() and rcu_user_exit() APIs
  rcu: New rcu_user_enter_irq() and rcu_user_exit_irq() APIs
  rcu: Switch to extended quiescent state in userspace from nohz cpuset
  nohz: Exit RCU idle mode when we schedule before resuming userspace
  nohz/cpuset: Disable under some configs

 arch/Kconfig                       |    3 +
 arch/x86/Kconfig                   |    1 +
 arch/x86/include/asm/entry_arch.h  |    3 +
 arch/x86/include/asm/hw_irq.h      |    7 +
 arch/x86/include/asm/irq_vectors.h |    2 +
 arch/x86/include/asm/smp.h         |   11 +
 arch/x86/include/asm/thread_info.h |   10 +-
 arch/x86/kernel/entry_64.S         |   12 +-
 arch/x86/kernel/irqinit.c          |    4 +
 arch/x86/kernel/ptrace.c           |   10 +
 arch/x86/kernel/signal.c           |    3 +
 arch/x86/kernel/smp.c              |   26 ++
 arch/x86/kernel/traps.c            |   20 +-
 arch/x86/mm/fault.c                |   13 +-
 fs/proc/array.c                    |    2 +
 include/linux/cpuset.h             |   29 ++
 include/linux/kernel_stat.h        |    2 +
 include/linux/posix-timers.h       |    1 +
 include/linux/rcupdate.h           |    8 +
 include/linux/sched.h              |   10 +-
 include/linux/tick.h               |   75 ++++--
 init/Kconfig                       |    8 +
 kernel/cpuset.c                    |  107 +++++++
 kernel/exit.c                      |    8 +
 kernel/posix-cpu-timers.c          |   12 +
 kernel/printk.c                    |   15 +-
 kernel/rcutree.c                   |  150 ++++++++--
 kernel/sched/core.c                |   83 ++++++-
 kernel/sched/sched.h               |   23 ++
 kernel/softirq.c                   |    6 +-
 kernel/sys.c                       |    6 +
 kernel/time/tick-sched.c           |  540 +++++++++++++++++++++++++++++-------
 kernel/time/timer_list.c           |    7 +-
 kernel/timer.c                     |    2 +-
 34 files changed, 1042 insertions(+), 177 deletions(-)

-- 
1.7.5.4


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 01/32] nohz: Separate idle sleeping time accounting from nohz logic
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
  2012-03-21 13:58 ` Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 02/32] nohz: Make nohz API agnostic against idle ticks cputime accounting Frederic Weisbecker
                   ` (32 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

As we plan to be able to stop the tick outside the idle task, we
need to prepare for separating nohz logic from idle. As a start,
this pulls the idle sleeping time accounting out of the tick
stop/restart API to the callers on idle entry/exit.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 kernel/time/tick-sched.c |   78 +++++++++++++++++++++++++--------------------
 1 files changed, 43 insertions(+), 35 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 3526038..a1ca479 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -271,10 +271,10 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
 
-static void tick_nohz_stop_sched_tick(struct tick_sched *ts)
+static void tick_nohz_stop_sched_tick(struct tick_sched *ts, ktime_t now)
 {
 	unsigned long seq, last_jiffies, next_jiffies, delta_jiffies;
-	ktime_t last_update, expires, now;
+	ktime_t last_update, expires;
 	struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev;
 	u64 time_delta;
 	int cpu;
@@ -282,8 +282,6 @@ static void tick_nohz_stop_sched_tick(struct tick_sched *ts)
 	cpu = smp_processor_id();
 	ts = &per_cpu(tick_cpu_sched, cpu);
 
-	now = tick_nohz_start_idle(cpu, ts);
-
 	/*
 	 * If this cpu is offline and it is the one which updates
 	 * jiffies, then give up the assignment and let it be taken by
@@ -444,6 +442,14 @@ out:
 	ts->sleep_length = ktime_sub(dev->next_event, now);
 }
 
+static void __tick_nohz_idle_enter(struct tick_sched *ts)
+{
+	ktime_t now;
+
+	now = tick_nohz_start_idle(smp_processor_id(), ts);
+	tick_nohz_stop_sched_tick(ts, now);
+}
+
 /**
  * tick_nohz_idle_enter - stop the idle tick from the idle task
  *
@@ -479,7 +485,7 @@ void tick_nohz_idle_enter(void)
 	 * update of the idle time accounting in tick_nohz_start_idle().
 	 */
 	ts->inidle = 1;
-	tick_nohz_stop_sched_tick(ts);
+	__tick_nohz_idle_enter(ts);
 
 	local_irq_enable();
 }
@@ -499,7 +505,7 @@ void tick_nohz_irq_exit(void)
 	if (!ts->inidle)
 		return;
 
-	tick_nohz_stop_sched_tick(ts);
+	__tick_nohz_idle_enter(ts);
 }
 
 /**
@@ -540,39 +546,11 @@ static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
 	}
 }
 
-/**
- * tick_nohz_idle_exit - restart the idle tick from the idle task
- *
- * Restart the idle tick when the CPU is woken up from idle
- * This also exit the RCU extended quiescent state. The CPU
- * can use RCU again after this function is called.
- */
-void tick_nohz_idle_exit(void)
+static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
 {
-	int cpu = smp_processor_id();
-	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
 	unsigned long ticks;
 #endif
-	ktime_t now;
-
-	local_irq_disable();
-
-	WARN_ON_ONCE(!ts->inidle);
-
-	ts->inidle = 0;
-
-	if (ts->idle_active || ts->tick_stopped)
-		now = ktime_get();
-
-	if (ts->idle_active)
-		tick_nohz_stop_idle(cpu, now);
-
-	if (!ts->tick_stopped) {
-		local_irq_enable();
-		return;
-	}
-
 	/* Update jiffies first */
 	select_nohz_load_balancer(0);
 	tick_do_update_jiffies64(now);
@@ -599,6 +577,36 @@ void tick_nohz_idle_exit(void)
 	ts->idle_exittime = now;
 
 	tick_nohz_restart(ts, now);
+}
+
+/**
+ * tick_nohz_idle_exit - restart the idle tick from the idle task
+ *
+ * Restart the idle tick when the CPU is woken up from idle
+ * This also exit the RCU extended quiescent state. The CPU
+ * can use RCU again after this function is called.
+ */
+void tick_nohz_idle_exit(void)
+
+{
+	int cpu = smp_processor_id();
+	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+	ktime_t now;
+
+	local_irq_disable();
+
+	WARN_ON_ONCE(!ts->inidle);
+
+	ts->inidle = 0;
+
+	if (ts->idle_active || ts->tick_stopped)
+		now = ktime_get();
+
+	if (ts->idle_active)
+		tick_nohz_stop_idle(cpu, now);
+
+	if (ts->tick_stopped)
+		tick_nohz_restart_sched_tick(ts, now);
 
 	local_irq_enable();
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 02/32] nohz: Make nohz API agnostic against idle ticks cputime accounting
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
  2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 01/32] nohz: Separate idle sleeping time accounting from nohz logic Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 03/32] nohz: Rename ts->idle_tick to ts->last_tick Frederic Weisbecker
                   ` (31 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

When the timer tick fires, it accounts the new jiffy as either part
of system, user or idle time. This is how we record the cputime
statistics.

But when the tick is stopped from the idle task, we still need
to record the number of jiffies spent tickless until we restart
the tick and fall back to traditional tick-based cputime accounting.

To do this, we take a snapshot of jiffies when the tick is stopped
and compute the difference against the new value of jiffies when
the tick is restarted. Then we account this whole difference to
the idle cputime.

However we are preparing to be able to stop the tick from other places
than idle. So this idle time accounting needs to be performed from
the callers of nohz APIs, not from the nohz APIs because we now want
them to be agnostic against where we stop/restart tick.

Therefore, we pull the tickless idle time accounting out of generic
nohz helpers up to idle entry/exit callers.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 kernel/time/tick-sched.c |   37 ++++++++++++++++++++++---------------
 1 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a1ca479..9373f61 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -402,7 +402,6 @@ static void tick_nohz_stop_sched_tick(struct tick_sched *ts, ktime_t now)
 
 			ts->idle_tick = hrtimer_get_expires(&ts->sched_timer);
 			ts->tick_stopped = 1;
-			ts->idle_jiffies = last_jiffies;
 		}
 
 		ts->idle_sleeps++;
@@ -445,9 +444,13 @@ out:
 static void __tick_nohz_idle_enter(struct tick_sched *ts)
 {
 	ktime_t now;
+	int was_stopped = ts->tick_stopped;
 
 	now = tick_nohz_start_idle(smp_processor_id(), ts);
 	tick_nohz_stop_sched_tick(ts, now);
+
+	if (!was_stopped && ts->tick_stopped)
+		ts->idle_jiffies = ts->last_jiffies;
 }
 
 /**
@@ -548,14 +551,24 @@ static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
 
 static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
 {
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING
-	unsigned long ticks;
-#endif
 	/* Update jiffies first */
 	select_nohz_load_balancer(0);
 	tick_do_update_jiffies64(now);
 
+	touch_softlockup_watchdog();
+	/*
+	 * Cancel the scheduled timer and restore the tick
+	 */
+	ts->tick_stopped  = 0;
+	ts->idle_exittime = now;
+
+	tick_nohz_restart(ts, now);
+}
+
+static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
+{
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
+	unsigned long ticks;
 	/*
 	 * We stopped the tick in idle. Update process times would miss the
 	 * time we slept as update_process_times does only a 1 tick
@@ -568,15 +581,6 @@ static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
 	if (ticks && ticks < LONG_MAX)
 		account_idle_ticks(ticks);
 #endif
-
-	touch_softlockup_watchdog();
-	/*
-	 * Cancel the scheduled timer and restore the tick
-	 */
-	ts->tick_stopped  = 0;
-	ts->idle_exittime = now;
-
-	tick_nohz_restart(ts, now);
 }
 
 /**
@@ -605,8 +609,10 @@ void tick_nohz_idle_exit(void)
 	if (ts->idle_active)
 		tick_nohz_stop_idle(cpu, now);
 
-	if (ts->tick_stopped)
+	if (ts->tick_stopped) {
 		tick_nohz_restart_sched_tick(ts, now);
+		tick_nohz_account_idle_ticks(ts);
+	}
 
 	local_irq_enable();
 }
@@ -811,7 +817,8 @@ static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
 		 */
 		if (ts->tick_stopped) {
 			touch_softlockup_watchdog();
-			ts->idle_jiffies++;
+			if (idle_cpu(cpu))
+				ts->idle_jiffies++;
 		}
 		update_process_times(user_mode(regs));
 		profile_tick(CPU_PROFILING);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 03/32] nohz: Rename ts->idle_tick to ts->last_tick
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 02/32] nohz: Make nohz API agnostic against idle ticks cputime accounting Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 04/32] nohz: Move nohz load balancer selection into idle logic Frederic Weisbecker
                   ` (30 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

Now that idle and nohz logic are going to be split, ts->idle_tick becomes
a misnomer when it takes a field name to save the last tick before
switching to nohz mode, because now we want to be able to switch to nohz mode
further the idle context.

Call it last_tick instead. This changes a bit the timer list
stat export so we need to increase its version.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 include/linux/tick.h     |    8 ++++----
 kernel/time/tick-sched.c |    4 ++--
 kernel/time/timer_list.c |    4 ++--
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index ab8be90..f37fceb 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -31,10 +31,10 @@ enum tick_nohz_mode {
  * struct tick_sched - sched tick emulation and no idle tick control/stats
  * @sched_timer:	hrtimer to schedule the periodic tick in high
  *			resolution mode
- * @idle_tick:		Store the last idle tick expiry time when the tick
- *			timer is modified for idle sleeps. This is necessary
+ * @last_tick:		Store the last tick expiry time when the tick
+ *			timer is modified for nohz sleeps. This is necessary
  *			to resume the tick timer operation in the timeline
- *			when the CPU returns from idle
+ *			when the CPU returns from nohz sleep.
  * @tick_stopped:	Indicator that the idle tick has been stopped
  * @idle_jiffies:	jiffies at the entry to idle for idle time accounting
  * @idle_calls:		Total number of idle calls
@@ -51,7 +51,7 @@ struct tick_sched {
 	struct hrtimer			sched_timer;
 	unsigned long			check_clocks;
 	enum tick_nohz_mode		nohz_mode;
-	ktime_t				idle_tick;
+	ktime_t				last_tick;
 	int				inidle;
 	int				tick_stopped;
 	unsigned long			idle_jiffies;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9373f61..fc9f687 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -400,7 +400,7 @@ static void tick_nohz_stop_sched_tick(struct tick_sched *ts, ktime_t now)
 		if (!ts->tick_stopped) {
 			select_nohz_load_balancer(1);
 
-			ts->idle_tick = hrtimer_get_expires(&ts->sched_timer);
+			ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
 			ts->tick_stopped = 1;
 		}
 
@@ -526,7 +526,7 @@ ktime_t tick_nohz_get_sleep_length(void)
 static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
 {
 	hrtimer_cancel(&ts->sched_timer);
-	hrtimer_set_expires(&ts->sched_timer, ts->idle_tick);
+	hrtimer_set_expires(&ts->sched_timer, ts->last_tick);
 
 	while (1) {
 		/* Forward the time to expire in the future */
diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
index 3258455..af5a7e9 100644
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -167,7 +167,7 @@ static void print_cpu(struct seq_file *m, int cpu, u64 now)
 	{
 		struct tick_sched *ts = tick_get_tick_sched(cpu);
 		P(nohz_mode);
-		P_ns(idle_tick);
+		P_ns(last_tick);
 		P(tick_stopped);
 		P(idle_jiffies);
 		P(idle_calls);
@@ -259,7 +259,7 @@ static int timer_list_show(struct seq_file *m, void *v)
 	u64 now = ktime_to_ns(ktime_get());
 	int cpu;
 
-	SEQ_printf(m, "Timer List Version: v0.6\n");
+	SEQ_printf(m, "Timer List Version: v0.7\n");
 	SEQ_printf(m, "HRTIMER_MAX_CLOCK_BASES: %d\n", HRTIMER_MAX_CLOCK_BASES);
 	SEQ_printf(m, "now at %Ld nsecs\n", (unsigned long long)now);
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 04/32] nohz: Move nohz load balancer selection into idle logic
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 03/32] nohz: Rename ts->idle_tick to ts->last_tick Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 05/32] nohz: Move ts->idle_calls incrementation into strict " Frederic Weisbecker
                   ` (29 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

[ ** BUGGY PATCH: I need to put more thinking into this ** ]

We want the nohz load balancer to be an idle CPU, thus
move that selection to strict dyntick idle logic.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 kernel/time/tick-sched.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index fc9f687..b79dea2 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -398,8 +398,6 @@ static void tick_nohz_stop_sched_tick(struct tick_sched *ts, ktime_t now)
 		 * the scheduler tick in nohz_restart_sched_tick.
 		 */
 		if (!ts->tick_stopped) {
-			select_nohz_load_balancer(1);
-
 			ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
 			ts->tick_stopped = 1;
 		}
@@ -449,8 +447,10 @@ static void __tick_nohz_idle_enter(struct tick_sched *ts)
 	now = tick_nohz_start_idle(smp_processor_id(), ts);
 	tick_nohz_stop_sched_tick(ts, now);
 
-	if (!was_stopped && ts->tick_stopped)
+	if (!was_stopped && ts->tick_stopped) {
 		ts->idle_jiffies = ts->last_jiffies;
+		select_nohz_load_balancer(1);
+	}
 }
 
 /**
@@ -552,7 +552,6 @@ static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
 static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
 {
 	/* Update jiffies first */
-	select_nohz_load_balancer(0);
 	tick_do_update_jiffies64(now);
 
 	touch_softlockup_watchdog();
@@ -610,6 +609,7 @@ void tick_nohz_idle_exit(void)
 		tick_nohz_stop_idle(cpu, now);
 
 	if (ts->tick_stopped) {
+		select_nohz_load_balancer(0);
 		tick_nohz_restart_sched_tick(ts, now);
 		tick_nohz_account_idle_ticks(ts);
 	}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 05/32] nohz: Move ts->idle_calls incrementation into strict idle logic
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 04/32] nohz: Move nohz load balancer selection into idle logic Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 06/32] nohz: Move next idle expiry time record into idle logic area Frederic Weisbecker
                   ` (28 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

Since we want to prepare for making the nohz API to work further
the idle case, we need to pull ts->idle_calls incrementation up to
the callers in idle.

To perform this, we split tick_nohz_stop_sched_tick() switch in two
parts, a first that checks if we can really stop the tick for idle,
and another that actually stops it. Then from the callers in idle,
we check if we can stop the tick and only then we increment idle_calls
and finally relay to the nohz API that doesn't care about these details
anymore.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 kernel/time/tick-sched.c |   88 +++++++++++++++++++++++++---------------------
 1 files changed, 48 insertions(+), 40 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index b79dea2..12ba932 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -271,47 +271,15 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
 
-static void tick_nohz_stop_sched_tick(struct tick_sched *ts, ktime_t now)
+static void tick_nohz_stop_sched_tick(struct tick_sched *ts,
+				      ktime_t now, int cpu)
 {
 	unsigned long seq, last_jiffies, next_jiffies, delta_jiffies;
 	ktime_t last_update, expires;
 	struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev;
 	u64 time_delta;
-	int cpu;
-
-	cpu = smp_processor_id();
-	ts = &per_cpu(tick_cpu_sched, cpu);
-
-	/*
-	 * If this cpu is offline and it is the one which updates
-	 * jiffies, then give up the assignment and let it be taken by
-	 * the cpu which runs the tick timer next. If we don't drop
-	 * this here the jiffies might be stale and do_timer() never
-	 * invoked.
-	 */
-	if (unlikely(!cpu_online(cpu))) {
-		if (cpu == tick_do_timer_cpu)
-			tick_do_timer_cpu = TICK_DO_TIMER_NONE;
-	}
-
-	if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
-		return;
-
-	if (need_resched())
-		return;
 
-	if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
-		static int ratelimit;
-
-		if (ratelimit < 10) {
-			printk(KERN_ERR "NOHZ: local_softirq_pending %02x\n",
-			       (unsigned int) local_softirq_pending());
-			ratelimit++;
-		}
-		return;
-	}
 
-	ts->idle_calls++;
 	/* Read jiffies and the time when jiffies were updated last */
 	do {
 		seq = read_seqbegin(&xtime_lock);
@@ -439,17 +407,57 @@ out:
 	ts->sleep_length = ktime_sub(dev->next_event, now);
 }
 
+static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
+{
+	/*
+	 * If this cpu is offline and it is the one which updates
+	 * jiffies, then give up the assignment and let it be taken by
+	 * the cpu which runs the tick timer next. If we don't drop
+	 * this here the jiffies might be stale and do_timer() never
+	 * invoked.
+	 */
+	if (unlikely(!cpu_online(cpu))) {
+		if (cpu == tick_do_timer_cpu)
+			tick_do_timer_cpu = TICK_DO_TIMER_NONE;
+	}
+
+	if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
+		return false;
+
+	if (need_resched())
+		return false;
+
+	if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
+		static int ratelimit;
+
+		if (ratelimit < 10) {
+			printk(KERN_ERR "NOHZ: local_softirq_pending %02x\n",
+			       (unsigned int) local_softirq_pending());
+			ratelimit++;
+		}
+		return false;
+	}
+
+	return true;
+}
+
 static void __tick_nohz_idle_enter(struct tick_sched *ts)
 {
 	ktime_t now;
-	int was_stopped = ts->tick_stopped;
+	int cpu = smp_processor_id();
 
-	now = tick_nohz_start_idle(smp_processor_id(), ts);
-	tick_nohz_stop_sched_tick(ts, now);
+	now = tick_nohz_start_idle(cpu, ts);
 
-	if (!was_stopped && ts->tick_stopped) {
-		ts->idle_jiffies = ts->last_jiffies;
-		select_nohz_load_balancer(1);
+	if (can_stop_idle_tick(cpu, ts)) {
+		int was_stopped = ts->tick_stopped;
+
+		ts->idle_calls++;
+		tick_nohz_stop_sched_tick(ts, now, cpu);
+
+		if (!was_stopped && ts->tick_stopped) {
+			ts->idle_jiffies = ts->last_jiffies;
+			select_nohz_load_balancer(1);
+		}
 	}
 }
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 06/32] nohz: Move next idle expiry time record into idle logic area
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (5 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 05/32] nohz: Move ts->idle_calls incrementation into strict " Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 07/32] cpuset: Set up interface for nohz flag Frederic Weisbecker
                   ` (27 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

The next idle expiry time record and idle sleeps tracking are
statistics that only concern idle.

Since we want the nohz APIs to become usable further idle
context, let's pull up the handling of these statistics to the
callers in idle.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 kernel/time/tick-sched.c |   24 ++++++++++++++----------
 1 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 12ba932..0695e9d 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -271,11 +271,11 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
 
-static void tick_nohz_stop_sched_tick(struct tick_sched *ts,
-				      ktime_t now, int cpu)
+static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
+					 ktime_t now, int cpu)
 {
 	unsigned long seq, last_jiffies, next_jiffies, delta_jiffies;
-	ktime_t last_update, expires;
+	ktime_t last_update, expires, ret = { .tv64 = 0 };
 	struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev;
 	u64 time_delta;
 
@@ -358,6 +358,8 @@ static void tick_nohz_stop_sched_tick(struct tick_sched *ts,
 		if (ts->tick_stopped && ktime_equal(expires, dev->next_event))
 			goto out;
 
+		ret = expires;
+
 		/*
 		 * nohz_stop_sched_tick can be called several times before
 		 * the nohz_restart_sched_tick is called. This happens when
@@ -370,11 +372,6 @@ static void tick_nohz_stop_sched_tick(struct tick_sched *ts,
 			ts->tick_stopped = 1;
 		}
 
-		ts->idle_sleeps++;
-
-		/* Mark expires */
-		ts->idle_expires = expires;
-
 		/*
 		 * If the expiration time == KTIME_MAX, then
 		 * in this case we simply stop the tick timer.
@@ -405,6 +402,8 @@ out:
 	ts->next_jiffies = next_jiffies;
 	ts->last_jiffies = last_jiffies;
 	ts->sleep_length = ktime_sub(dev->next_event, now);
+
+	return ret;
 }
 
 static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
@@ -443,7 +442,7 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
 
 static void __tick_nohz_idle_enter(struct tick_sched *ts)
 {
-	ktime_t now;
+	ktime_t now, expires;
 	int cpu = smp_processor_id();
 
 	now = tick_nohz_start_idle(cpu, ts);
@@ -452,7 +451,12 @@ static void __tick_nohz_idle_enter(struct tick_sched *ts)
 		int was_stopped = ts->tick_stopped;
 
 		ts->idle_calls++;
-		tick_nohz_stop_sched_tick(ts, now, cpu);
+
+		expires = tick_nohz_stop_sched_tick(ts, now, cpu);
+		if (expires.tv64 > 0LL) {
+			ts->idle_sleeps++;
+			ts->idle_expires = expires;
+		}
 
 		if (!was_stopped && ts->tick_stopped) {
 			ts->idle_jiffies = ts->last_jiffies;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 07/32] cpuset: Set up interface for nohz flag
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (6 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 06/32] nohz: Move next idle expiry time record into idle logic area Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 14:50   ` Christoph Lameter
  2012-03-21 13:58 ` [PATCH 08/32] nohz: Try not to give the timekeeping duty to an adaptive tickless cpu Frederic Weisbecker
                   ` (26 subsequent siblings)
  34 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

Prepare the interface to implement the nohz cpuset flag.
This flag, once set, will tell the system to try to
shutdown the periodic timer tick when possible.

We use here a per cpu refcounter. As long as a CPU
is contained into at least one cpuset that has the
nohz flag set, it is part of the set of CPUs that
run into adaptive nohz mode.

[ include build fix from Zen Lin ]

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 arch/Kconfig           |    3 ++
 include/linux/cpuset.h |   25 +++++++++++++++++++++++
 init/Kconfig           |    8 +++++++
 kernel/cpuset.c        |   52 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 88 insertions(+), 0 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 4f55c73..a0710f6 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -177,6 +177,9 @@ config HAVE_ARCH_JUMP_LABEL
 	bool
 
 config HAVE_ARCH_MUTEX_CPU_RELAX
+       bool
+
+config HAVE_CPUSETS_NO_HZ
 	bool
 
 config HAVE_RCU_TABLE_FREE
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index e9eaec5..5510708 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -244,4 +244,29 @@ static inline void put_mems_allowed(void)
 
 #endif /* !CONFIG_CPUSETS */
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+
+DECLARE_PER_CPU(int, cpu_adaptive_nohz_ref);
+
+static inline bool cpuset_cpu_adaptive_nohz(int cpu)
+{
+	if (per_cpu(cpu_adaptive_nohz_ref, cpu) > 0)
+		return true;
+
+	return false;
+}
+
+static inline bool cpuset_adaptive_nohz(void)
+{
+	if (__get_cpu_var(cpu_adaptive_nohz_ref) > 0)
+		return true;
+
+	return false;
+}
+#else
+static inline bool cpuset_cpu_adaptive_nohz(int cpu) { return false; }
+static inline bool cpuset_adaptive_nohz(void) { return false; }
+
+#endif /* CONFIG_CPUSETS_NO_HZ */
+
 #endif /* _LINUX_CPUSET_H */
diff --git a/init/Kconfig b/init/Kconfig
index 3f42cd6..43f7687 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -638,6 +638,14 @@ config PROC_PID_CPUSET
 	depends on CPUSETS
 	default y
 
+config CPUSETS_NO_HZ
+       bool "Tickless cpusets"
+       depends on CPUSETS && HAVE_CPUSETS_NO_HZ
+       help
+         This options let you apply a nohz property to a cpuset such
+	 that the periodic timer tick tries to be avoided when possible on
+	 the concerned CPUs.
+
 config CGROUP_CPUACCT
 	bool "Simple CPU accounting cgroup subsystem"
 	help
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index a09ac2b..5a28cf8 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -145,6 +145,7 @@ typedef enum {
 	CS_SCHED_LOAD_BALANCE,
 	CS_SPREAD_PAGE,
 	CS_SPREAD_SLAB,
+	CS_ADAPTIVE_NOHZ,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -183,6 +184,11 @@ static inline int is_spread_slab(const struct cpuset *cs)
 	return test_bit(CS_SPREAD_SLAB, &cs->flags);
 }
 
+static inline int is_adaptive_nohz(const struct cpuset *cs)
+{
+	return test_bit(CS_ADAPTIVE_NOHZ, &cs->flags);
+}
+
 static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
 };
@@ -1211,6 +1217,31 @@ static void cpuset_change_flag(struct task_struct *tsk,
 	cpuset_update_task_spread_flag(cgroup_cs(scan->cg), tsk);
 }
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+
+DEFINE_PER_CPU(int, cpu_adaptive_nohz_ref);
+
+static void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
+{
+	int cpu;
+	int val;
+
+	if (is_adaptive_nohz(old_cs) == is_adaptive_nohz(cs))
+		return;
+
+	for_each_cpu(cpu, cs->cpus_allowed) {
+		if (is_adaptive_nohz(cs))
+			per_cpu(cpu_adaptive_nohz_ref, cpu) += 1;
+		else
+			per_cpu(cpu_adaptive_nohz_ref, cpu) -= 1;
+	}
+}
+#else
+static inline void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
+{
+}
+#endif
+
 /*
  * update_tasks_flags - update the spread flags of tasks in the cpuset.
  * @cs: the cpuset in which each task's spread flags needs to be changed
@@ -1276,6 +1307,8 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
 			|| (is_spread_page(cs) != is_spread_page(trialcs)));
 
+	update_nohz_cpus(cs, trialcs);
+
 	mutex_lock(&callback_mutex);
 	cs->flags = trialcs->flags;
 	mutex_unlock(&callback_mutex);
@@ -1488,6 +1521,7 @@ typedef enum {
 	FILE_MEMORY_PRESSURE,
 	FILE_SPREAD_PAGE,
 	FILE_SPREAD_SLAB,
+	FILE_ADAPTIVE_NOHZ,
 } cpuset_filetype_t;
 
 static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
@@ -1527,6 +1561,11 @@ static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
 	case FILE_SPREAD_SLAB:
 		retval = update_flag(CS_SPREAD_SLAB, cs, val);
 		break;
+#ifdef CONFIG_CPUSETS_NO_HZ
+	case FILE_ADAPTIVE_NOHZ:
+		retval = update_flag(CS_ADAPTIVE_NOHZ, cs, val);
+		break;
+#endif
 	default:
 		retval = -EINVAL;
 		break;
@@ -1686,6 +1725,10 @@ static u64 cpuset_read_u64(struct cgroup *cont, struct cftype *cft)
 		return is_spread_page(cs);
 	case FILE_SPREAD_SLAB:
 		return is_spread_slab(cs);
+#ifdef CONFIG_CPUSETS_NO_HZ
+	case FILE_ADAPTIVE_NOHZ:
+		return is_adaptive_nohz(cs);
+#endif
 	default:
 		BUG();
 	}
@@ -1794,6 +1837,15 @@ static struct cftype files[] = {
 		.write_u64 = cpuset_write_u64,
 		.private = FILE_SPREAD_SLAB,
 	},
+
+#ifdef CONFIG_CPUSETS_NO_HZ
+	{
+		.name = "adaptive_nohz",
+		.read_u64 = cpuset_read_u64,
+		.write_u64 = cpuset_write_u64,
+		.private = FILE_ADAPTIVE_NOHZ,
+	},
+#endif
 };
 
 static struct cftype cft_memory_pressure_enabled = {
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 08/32] nohz: Try not to give the timekeeping duty to an adaptive tickless cpu
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (7 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 07/32] cpuset: Set up interface for nohz flag Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 14:52   ` Christoph Lameter
  2012-03-21 13:58 ` [PATCH 09/32] x86: New cpuset nohz irq vector Frederic Weisbecker
                   ` (25 subsequent siblings)
  34 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

Try to give the timekeeing duty to a CPU that doesn't belong
to any nohz cpuset when possible, so that we increase the chance
for these nohz cpusets to run their CPUs out of periodic tick
mode.

[TODO: We need to find a way to ensure there is always one non-nohz
running CPU maintaining the timekeeping duty if every non-idle CPUs are
adaptive tickless]

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 kernel/time/tick-sched.c |   52 ++++++++++++++++++++++++++++++++++++---------
 1 files changed, 41 insertions(+), 11 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 0695e9d..f1142d5 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -20,6 +20,7 @@
 #include <linux/profile.h>
 #include <linux/sched.h>
 #include <linux/module.h>
+#include <linux/cpuset.h>
 
 #include <asm/irq_regs.h>
 
@@ -782,6 +783,45 @@ void tick_check_idle(int cpu)
 	tick_check_nohz(cpu);
 }
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+
+/*
+ * Take the timer duty if nobody is taking care of it.
+ * If a CPU already does and and it's in a nohz cpuset,
+ * then take the charge so that it can switch to nohz mode.
+ */
+static void tick_do_timer_check_handler(int cpu)
+{
+	int handler = tick_do_timer_cpu;
+
+	if (unlikely(handler == TICK_DO_TIMER_NONE)) {
+		tick_do_timer_cpu = cpu;
+	} else {
+		if (!cpuset_adaptive_nohz() &&
+		    cpuset_cpu_adaptive_nohz(handler))
+			tick_do_timer_cpu = cpu;
+	}
+}
+
+#else
+
+static void tick_do_timer_check_handler(int cpu)
+{
+#ifdef CONFIG_NO_HZ
+	/*
+	 * Check if the do_timer duty was dropped. We don't care about
+	 * concurrency: This happens only when the cpu in charge went
+	 * into a long sleep. If two cpus happen to assign themself to
+	 * this duty, then the jiffies update is still serialized by
+	 * xtime_lock.
+	 */
+	if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE))
+		tick_do_timer_cpu = cpu;
+#endif
+}
+
+#endif /* CONFIG_CPUSETS_NO_HZ */
+
 /*
  * High resolution timer specific code
  */
@@ -798,17 +838,7 @@ static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
 	ktime_t now = ktime_get();
 	int cpu = smp_processor_id();
 
-#ifdef CONFIG_NO_HZ
-	/*
-	 * Check if the do_timer duty was dropped. We don't care about
-	 * concurrency: This happens only when the cpu in charge went
-	 * into a long sleep. If two cpus happen to assign themself to
-	 * this duty, then the jiffies update is still serialized by
-	 * xtime_lock.
-	 */
-	if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE))
-		tick_do_timer_cpu = cpu;
-#endif
+	tick_do_timer_check_handler(cpu);
 
 	/* Check, if the jiffies need an update */
 	if (tick_do_timer_cpu == cpu)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 09/32] x86: New cpuset nohz irq vector
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (8 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 08/32] nohz: Try not to give the timekeeping duty to an adaptive tickless cpu Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 10/32] nohz: Adaptive tick stop and restart on nohz cpuset Frederic Weisbecker
                   ` (24 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

We need a way to send an IPI (remote or local) in order to
asynchronously restart the tick for CPUs in nohz adaptive mode.

This must be asynchronous such that we can trigger it with irqs
disabled. This must be usable as a self-IPI as well for example
in cases where we want to avoid random dealock scenario while
restarting the tick inline otherwise.

This only settles the x86 backend. The core tick restart function
will be defined in a later patch.

[CHECKME: Perhaps we instead need to use irq work for self IPIs.
But we also need a way to send async remote IPIs.]

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 arch/x86/include/asm/entry_arch.h  |    3 +++
 arch/x86/include/asm/hw_irq.h      |    7 +++++++
 arch/x86/include/asm/irq_vectors.h |    2 ++
 arch/x86/include/asm/smp.h         |   11 +++++++++++
 arch/x86/kernel/entry_64.S         |    4 ++++
 arch/x86/kernel/irqinit.c          |    4 ++++
 arch/x86/kernel/smp.c              |   24 ++++++++++++++++++++++++
 7 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h
index 0baa628..f71872d 100644
--- a/arch/x86/include/asm/entry_arch.h
+++ b/arch/x86/include/asm/entry_arch.h
@@ -10,6 +10,9 @@
  * through the ICC by us (IPIs)
  */
 #ifdef CONFIG_SMP
+#ifdef CONFIG_CPUSETS_NO_HZ
+BUILD_INTERRUPT(cpuset_update_nohz_interrupt,CPUSET_UPDATE_NOHZ_VECTOR)
+#endif
 BUILD_INTERRUPT(reschedule_interrupt,RESCHEDULE_VECTOR)
 BUILD_INTERRUPT(call_function_interrupt,CALL_FUNCTION_VECTOR)
 BUILD_INTERRUPT(call_function_single_interrupt,CALL_FUNCTION_SINGLE_VECTOR)
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index eb92a6e..0d26ed7 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -35,6 +35,10 @@ extern void spurious_interrupt(void);
 extern void thermal_interrupt(void);
 extern void reschedule_interrupt(void);
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+extern void cpuset_update_nohz_interrupt(void);
+#endif
+
 extern void invalidate_interrupt(void);
 extern void invalidate_interrupt0(void);
 extern void invalidate_interrupt1(void);
@@ -152,6 +156,9 @@ extern asmlinkage void smp_irq_move_cleanup_interrupt(void);
 #endif
 #ifdef CONFIG_SMP
 extern void smp_reschedule_interrupt(struct pt_regs *);
+#ifdef CONFIG_CPUSETS_NO_HZ
+extern void smp_cpuset_update_nohz_interrupt(struct pt_regs *);
+#endif
 extern void smp_call_function_interrupt(struct pt_regs *);
 extern void smp_call_function_single_interrupt(struct pt_regs *);
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 4b44487..11bc691 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -112,6 +112,8 @@
 /* Xen vector callback to receive events in a HVM domain */
 #define XEN_HVM_EVTCHN_CALLBACK		0xf3
 
+#define CPUSET_UPDATE_NOHZ_VECTOR	0xf2
+
 /*
  * Local APIC timer IRQ vector is on a different priority level,
  * to work around the 'lost local interrupt if more than 2 IRQ
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index 0434c40..475c26b 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -70,6 +70,10 @@ struct smp_ops {
 	void (*stop_other_cpus)(int wait);
 	void (*smp_send_reschedule)(int cpu);
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+	void (*smp_cpuset_update_nohz)(int cpu);
+#endif
+
 	int (*cpu_up)(unsigned cpu);
 	int (*cpu_disable)(void);
 	void (*cpu_die)(unsigned int cpu);
@@ -138,6 +142,13 @@ static inline void smp_send_reschedule(int cpu)
 	smp_ops.smp_send_reschedule(cpu);
 }
 
+static inline void smp_cpuset_update_nohz(int cpu)
+{
+#ifdef CONFIG_CPUSETS_NO_HZ
+	smp_ops.smp_cpuset_update_nohz(cpu);
+#endif
+}
+
 static inline void arch_send_call_function_single_ipi(int cpu)
 {
 	smp_ops.send_call_func_single_ipi(cpu);
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1333d98..54f269c 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1002,6 +1002,10 @@ apicinterrupt CALL_FUNCTION_VECTOR \
 	call_function_interrupt smp_call_function_interrupt
 apicinterrupt RESCHEDULE_VECTOR \
 	reschedule_interrupt smp_reschedule_interrupt
+#ifdef CONFIG_CPUSETS_NO_HZ
+apicinterrupt CPUSET_UPDATE_NOHZ_VECTOR \
+	cpuset_update_nohz_interrupt smp_cpuset_update_nohz_interrupt
+#endif
 #endif
 
 apicinterrupt ERROR_APIC_VECTOR \
diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c
index 313fb5c..2220f3c 100644
--- a/arch/x86/kernel/irqinit.c
+++ b/arch/x86/kernel/irqinit.c
@@ -172,6 +172,10 @@ static void __init smp_intr_init(void)
 	 */
 	alloc_intr_gate(RESCHEDULE_VECTOR, reschedule_interrupt);
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+	alloc_intr_gate(CPUSET_UPDATE_NOHZ_VECTOR, cpuset_update_nohz_interrupt);
+#endif
+
 	/* IPIs for invalidation */
 #define ALLOC_INVTLB_VEC(NR) \
 	alloc_intr_gate(INVALIDATE_TLB_VECTOR_START+NR, \
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 66c74f4..94615a3 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -123,6 +123,17 @@ static void native_smp_send_reschedule(int cpu)
 	apic->send_IPI_mask(cpumask_of(cpu), RESCHEDULE_VECTOR);
 }
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+static void native_smp_cpuset_update_nohz(int cpu)
+{
+	if (unlikely(cpu_is_offline(cpu))) {
+		WARN_ON(1);
+		return;
+	}
+	apic->send_IPI_mask(cpumask_of(cpu), CPUSET_UPDATE_NOHZ_VECTOR);
+}
+#endif
+
 void native_send_call_func_single_ipi(int cpu)
 {
 	apic->send_IPI_mask(cpumask_of(cpu), CALL_FUNCTION_SINGLE_VECTOR);
@@ -267,6 +278,16 @@ void smp_reschedule_interrupt(struct pt_regs *regs)
 	 */
 }
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+void smp_cpuset_update_nohz_interrupt(struct pt_regs *regs)
+{
+	ack_APIC_irq();
+	irq_enter();
+	inc_irq_stat(irq_call_count);
+	irq_exit();
+}
+#endif
+
 void smp_call_function_interrupt(struct pt_regs *regs)
 {
 	ack_APIC_irq();
@@ -300,6 +321,9 @@ struct smp_ops smp_ops = {
 
 	.stop_other_cpus	= native_nmi_stop_other_cpus,
 	.smp_send_reschedule	= native_smp_send_reschedule,
+#ifdef CONFIG_CPUSETS_NO_HZ
+	.smp_cpuset_update_nohz = native_smp_cpuset_update_nohz,
+#endif
 
 	.cpu_up			= native_cpu_up,
 	.cpu_die		= native_cpu_die,
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 10/32] nohz: Adaptive tick stop and restart on nohz cpuset
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (9 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 09/32] x86: New cpuset nohz irq vector Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it Frederic Weisbecker
                   ` (23 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

When a CPU is included in a nohz cpuset, try to switch
it to nohz mode from the interrupt exit path if it is running
a single non-idle task.

Then restart the tick if necessary if we are enqueuing a
second task while the timer is stopped, so that the scheduler
tick is rearmed.

[TODO: Handle the many things done from scheduler_tick()]

[ Included build fix from Geoff Levand ]

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 arch/x86/kernel/smp.c    |    2 +
 include/linux/sched.h    |    6 +++
 include/linux/tick.h     |   11 +++++-
 init/Kconfig             |    2 +-
 kernel/sched/core.c      |   22 ++++++++++++
 kernel/sched/sched.h     |   23 ++++++++++++
 kernel/softirq.c         |    6 ++-
 kernel/time/tick-sched.c |   84 +++++++++++++++++++++++++++++++++++++++++----
 8 files changed, 144 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 94615a3..df83671 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,7 @@
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <linux/tick.h>
 
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
@@ -283,6 +284,7 @@ void smp_cpuset_update_nohz_interrupt(struct pt_regs *regs)
 {
 	ack_APIC_irq();
 	irq_enter();
+	tick_nohz_check_adaptive();
 	inc_irq_stat(irq_call_count);
 	irq_exit();
 }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0657368..dd5df2a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2746,6 +2746,12 @@ static inline void inc_syscw(struct task_struct *tsk)
 #define TASK_SIZE_OF(tsk)	TASK_SIZE
 #endif
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+extern bool sched_can_stop_tick(void);
+#else
+static inline bool sched_can_stop_tick(void) { return false; }
+#endif
+
 #ifdef CONFIG_MM_OWNER
 extern void mm_update_next_owner(struct mm_struct *mm);
 extern void mm_init_owner(struct mm_struct *mm, struct task_struct *p);
diff --git a/include/linux/tick.h b/include/linux/tick.h
index f37fceb..9b66fd3 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -124,11 +124,12 @@ static inline int tick_oneshot_mode_active(void) { return 0; }
 # ifdef CONFIG_NO_HZ
 extern void tick_nohz_idle_enter(void);
 extern void tick_nohz_idle_exit(void);
+extern void tick_nohz_restart_sched_tick(void);
 extern void tick_nohz_irq_exit(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
-# else
+# else /* !NO_HZ */
 static inline void tick_nohz_idle_enter(void) { }
 static inline void tick_nohz_idle_exit(void) { }
 
@@ -142,4 +143,12 @@ static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
 static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
 # endif /* !NO_HZ */
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+extern void tick_nohz_check_adaptive(void);
+extern void tick_nohz_post_schedule(void);
+#else /* !CPUSETS_NO_HZ */
+static inline void tick_nohz_check_adaptive(void) { }
+static inline void tick_nohz_post_schedule(void) { }
+#endif /* CPUSETS_NO_HZ */
+
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index 43f7687..7cdb8be 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -640,7 +640,7 @@ config PROC_PID_CPUSET
 
 config CPUSETS_NO_HZ
        bool "Tickless cpusets"
-       depends on CPUSETS && HAVE_CPUSETS_NO_HZ
+       depends on CPUSETS && HAVE_CPUSETS_NO_HZ && NO_HZ && HIGH_RES_TIMERS
        help
          This options let you apply a nohz property to a cpuset such
 	 that the periodic timer tick tries to be avoided when possible on
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b342f57..4f80a81 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1323,6 +1323,27 @@ static void update_avg(u64 *avg, u64 sample)
 }
 #endif
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+bool sched_can_stop_tick(void)
+{
+	struct rq *rq;
+
+	rq = this_rq();
+
+	/*
+	 * Ensure nr_running updates are visible
+	 * FIXME: the barrier is probably not enough to ensure
+	 * the updates are visible right away.
+	 */
+	smp_rmb();
+	/* More than one running task need preemption */
+	if (rq->nr_running > 1)
+		return false;
+
+	return true;
+}
+#endif
+
 static void
 ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
 {
@@ -2059,6 +2080,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	 * frame will be invalid.
 	 */
 	finish_task_switch(this_rq(), prev);
+	tick_nohz_post_schedule();
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 98c0c26..b89f254 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1,6 +1,7 @@
 
 #include <linux/sched.h>
 #include <linux/mutex.h>
+#include <linux/cpuset.h>
 #include <linux/spinlock.h>
 #include <linux/stop_machine.h>
 
@@ -925,6 +926,28 @@ static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
 static inline void inc_nr_running(struct rq *rq)
 {
 	rq->nr_running++;
+
+	if (rq->nr_running == 2) {
+		/*
+		 * Make rq->nr_running update visible right away so that
+		 * remote CPU knows that it must restart the tick.
+		 * FIXME: This is probably not enough to ensure the update is visible
+		 */
+		smp_wmb();
+		/*
+		 * Make updates to cpu_adaptive_nohz_ref visible right now.
+		 * If the CPU is not yet in a nohz cpuset then it will see
+		 * the value on rq->nr_running later on the first time it
+		 * tries to shutdown the tick. Otherwise we must send it
+		 * it an IPI. But the ordering must be strict to ensure
+		 * the first case.
+		 * FIXME: That too is probably not enough to ensure the
+		 * update is visible.
+		 */
+		smp_rmb();
+		if (cpuset_cpu_adaptive_nohz(rq->cpu))
+			smp_cpuset_update_nohz(rq->cpu);
+	}
 }
 
 static inline void dec_nr_running(struct rq *rq)
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 5ace266..1bacb20 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -24,6 +24,7 @@
 #include <linux/ftrace.h>
 #include <linux/smp.h>
 #include <linux/tick.h>
+#include <linux/cpuset.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/irq.h>
@@ -297,7 +298,8 @@ void irq_enter(void)
 	int cpu = smp_processor_id();
 
 	rcu_irq_enter();
-	if (is_idle_task(current) && !in_interrupt()) {
+
+	if ((is_idle_task(current) || cpuset_adaptive_nohz()) && !in_interrupt()) {
 		/*
 		 * Prevent raise_softirq from needlessly waking up ksoftirqd
 		 * here, as softirq will be serviced on return from interrupt.
@@ -349,7 +351,7 @@ void irq_exit(void)
 
 #ifdef CONFIG_NO_HZ
 	/* Make sure that timer wheel updates are propagated */
-	if (idle_cpu(smp_processor_id()) && !in_interrupt() && !need_resched())
+	if (!in_interrupt())
 		tick_nohz_irq_exit();
 #endif
 	rcu_irq_exit();
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f1142d5..43fa7ac 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -506,6 +506,24 @@ void tick_nohz_idle_enter(void)
 	local_irq_enable();
 }
 
+static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts)
+{
+#ifdef CONFIG_CPUSETS_NO_HZ
+	int cpu = smp_processor_id();
+
+	if (!cpuset_adaptive_nohz() || is_idle_task(current))
+		return;
+
+	if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
+		return;
+
+	if (!sched_can_stop_tick())
+		return;
+
+	tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
+#endif
+}
+
 /**
  * tick_nohz_irq_exit - update next tick event from interrupt exit
  *
@@ -518,10 +536,12 @@ void tick_nohz_irq_exit(void)
 {
 	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
 
-	if (!ts->inidle)
-		return;
-
-	__tick_nohz_idle_enter(ts);
+	if (ts->inidle) {
+		if (!need_resched())
+			__tick_nohz_idle_enter(ts);
+	} else {
+		tick_nohz_cpuset_stop_tick(ts);
+	}
 }
 
 /**
@@ -562,7 +582,7 @@ static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
 	}
 }
 
-static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
+static void __tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
 {
 	/* Update jiffies first */
 	tick_do_update_jiffies64(now);
@@ -577,6 +597,31 @@ static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
 	tick_nohz_restart(ts, now);
 }
 
+/**
+ * tick_nohz_restart_sched_tick - restart the tick for a tickless CPU
+ *
+ * Restart the tick when the CPU is in adaptive tickless mode.
+ */
+void tick_nohz_restart_sched_tick(void)
+{
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+	unsigned long flags;
+	ktime_t now;
+
+	local_irq_save(flags);
+
+	if (!ts->tick_stopped) {
+		local_irq_restore(flags);
+		return;
+	}
+
+	now = ktime_get();
+	__tick_nohz_restart_sched_tick(ts, now);
+
+	local_irq_restore(flags);
+}
+
+
 static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
 {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
@@ -623,7 +668,7 @@ void tick_nohz_idle_exit(void)
 
 	if (ts->tick_stopped) {
 		select_nohz_load_balancer(0);
-		tick_nohz_restart_sched_tick(ts, now);
+		__tick_nohz_restart_sched_tick(ts, now);
 		tick_nohz_account_idle_ticks(ts);
 	}
 
@@ -784,7 +829,6 @@ void tick_check_idle(int cpu)
 }
 
 #ifdef CONFIG_CPUSETS_NO_HZ
-
 /*
  * Take the timer duty if nobody is taking care of it.
  * If a CPU already does and and it's in a nohz cpuset,
@@ -803,6 +847,29 @@ static void tick_do_timer_check_handler(int cpu)
 	}
 }
 
+void tick_nohz_check_adaptive(void)
+{
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+
+	if (ts->tick_stopped && !is_idle_task(current)) {
+		if (!sched_can_stop_tick())
+			tick_nohz_restart_sched_tick();
+	}
+}
+
+void tick_nohz_post_schedule(void)
+{
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+
+	/*
+	 * No need to disable irqs here. The worst that can happen
+	 * is an irq that comes and restart the tick before us.
+	 * tick_nohz_restart_sched_tick() is irq safe.
+	 */
+	if (ts->tick_stopped)
+		tick_nohz_restart_sched_tick();
+}
+
 #else
 
 static void tick_do_timer_check_handler(int cpu)
@@ -849,6 +916,7 @@ static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
 	 * no valid regs pointer
 	 */
 	if (regs) {
+		int user = user_mode(regs);
 		/*
 		 * When we are idle and the tick is stopped, we have to touch
 		 * the watchdog as we might not schedule for a really long
@@ -862,7 +930,7 @@ static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
 			if (idle_cpu(cpu))
 				ts->idle_jiffies++;
 		}
-		update_process_times(user_mode(regs));
+		update_process_times(user);
 		profile_tick(CPU_PROFILING);
 	}
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (10 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 10/32] nohz: Adaptive tick stop and restart on nohz cpuset Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 14:54   ` Christoph Lameter
  2012-03-21 13:58 ` [PATCH 12/32] nohz/cpuset: Wake up adaptive nohz CPU when a timer gets enqueued Frederic Weisbecker
                   ` (22 subsequent siblings)
  34 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

If RCU is waiting for the current CPU to complete a grace
period, don't turn off the tick. Unlike dynctik-idle, we
are not necessarily going to enter into rcu extended quiescent
state, so we may need to keep the tick to note current CPU's
quiescent states.

[added build fix from Zen Lin]

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 include/linux/rcupdate.h |    1 +
 kernel/rcutree.c         |    3 +--
 kernel/time/tick-sched.c |   22 ++++++++++++++++++----
 3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 81c04f4..e06639e 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -184,6 +184,7 @@ static inline int rcu_preempt_depth(void)
 extern void rcu_sched_qs(int cpu);
 extern void rcu_bh_qs(int cpu);
 extern void rcu_check_callbacks(int cpu, int user);
+extern int rcu_pending(int cpu);
 struct notifier_block;
 extern void rcu_idle_enter(void);
 extern void rcu_idle_exit(void);
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 6c4a672..e141c7e 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -212,7 +212,6 @@ int rcu_cpu_stall_suppress __read_mostly;
 module_param(rcu_cpu_stall_suppress, int, 0644);
 
 static void force_quiescent_state(struct rcu_state *rsp, int relaxed);
-static int rcu_pending(int cpu);
 
 /*
  * Return the number of RCU-sched batches processed thus far for debug & stats.
@@ -1915,7 +1914,7 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
  * by the current CPU, returning 1 if so.  This function is part of the
  * RCU implementation; it is -not- an exported member of the RCU API.
  */
-static int rcu_pending(int cpu)
+int rcu_pending(int cpu)
 {
 	return __rcu_pending(&rcu_sched_state, &per_cpu(rcu_sched_data, cpu)) ||
 	       __rcu_pending(&rcu_bh_state, &per_cpu(rcu_bh_data, cpu)) ||
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 43fa7ac..4f99766 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -506,9 +506,21 @@ void tick_nohz_idle_enter(void)
 	local_irq_enable();
 }
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+static bool can_stop_adaptive_tick(void)
+{
+	if (!sched_can_stop_tick())
+		return false;
+
+	/* Is there a grace period to complete ? */
+	if (rcu_pending(smp_processor_id()))
+		return false;
+
+	return true;
+}
+
 static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts)
 {
-#ifdef CONFIG_CPUSETS_NO_HZ
 	int cpu = smp_processor_id();
 
 	if (!cpuset_adaptive_nohz() || is_idle_task(current))
@@ -517,12 +529,14 @@ static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts)
 	if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
 		return;
 
-	if (!sched_can_stop_tick())
+	if (!can_stop_adaptive_tick())
 		return;
 
 	tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
-#endif
 }
+#else
+static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts) { }
+#endif
 
 /**
  * tick_nohz_irq_exit - update next tick event from interrupt exit
@@ -852,7 +866,7 @@ void tick_nohz_check_adaptive(void)
 	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
 
 	if (ts->tick_stopped && !is_idle_task(current)) {
-		if (!sched_can_stop_tick())
+		if (!can_stop_adaptive_tick())
 			tick_nohz_restart_sched_tick();
 	}
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 12/32] nohz/cpuset: Wake up adaptive nohz CPU when a timer gets enqueued
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (11 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 13/32] nohz/cpuset: Don't stop the tick if posix cpu timers are running Frederic Weisbecker
                   ` (21 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

Wake up a CPU when a timer list timer is enqueued there and
the CPU is in adaptive nohz mode. Sending an IPI to it makes
it reconsidering the next timer to program on top of recent
updates.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 include/linux/sched.h |    4 ++--
 kernel/sched/core.c   |   24 +++++++++++++++++++++++-
 kernel/timer.c        |    2 +-
 3 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dd5df2a..2cf5d9b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1992,9 +1992,9 @@ static inline void idle_task_exit(void) {}
 #endif
 
 #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
-extern void wake_up_idle_cpu(int cpu);
+extern void wake_up_nohz_cpu(int cpu);
 #else
-static inline void wake_up_idle_cpu(int cpu) { }
+static inline void wake_up_nohz_cpu(int cpu) { }
 #endif
 
 extern unsigned int sysctl_sched_latency;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4f80a81..ba9e4d4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -576,7 +576,7 @@ unlock:
  * account when the CPU goes back to idle and evaluates the timer
  * wheel for the next timer event.
  */
-void wake_up_idle_cpu(int cpu)
+static void wake_up_idle_cpu(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 
@@ -606,6 +606,28 @@ void wake_up_idle_cpu(int cpu)
 		smp_send_reschedule(cpu);
 }
 
+static bool wake_up_cpuset_nohz_cpu(int cpu)
+{
+#ifdef CONFIG_CPUSETS_NO_HZ
+	/*
+	 * FIXME: We need to ensure that updates
+	 * on cpu_adaptive_nohz_ref are visible right
+	 * away.
+	 */
+	if (cpuset_cpu_adaptive_nohz(cpu)) {
+		smp_cpuset_update_nohz(cpu);
+		return true;
+	}
+#endif
+	return false;
+}
+
+void wake_up_nohz_cpu(int cpu)
+{
+	if (!wake_up_cpuset_nohz_cpu(cpu))
+		wake_up_idle_cpu(cpu);
+}
+
 static inline bool got_nohz_idle_kick(void)
 {
 	int cpu = smp_processor_id();
diff --git a/kernel/timer.c b/kernel/timer.c
index a297ffc..c203297 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -926,7 +926,7 @@ void add_timer_on(struct timer_list *timer, int cpu)
 	 * makes sure that a CPU on the way to idle can not evaluate
 	 * the timer wheel.
 	 */
-	wake_up_idle_cpu(cpu);
+	wake_up_nohz_cpu(cpu);
 	spin_unlock_irqrestore(&base->lock, flags);
 }
 EXPORT_SYMBOL_GPL(add_timer_on);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 13/32] nohz/cpuset: Don't stop the tick if posix cpu timers are running
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (12 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 12/32] nohz/cpuset: Wake up adaptive nohz CPU when a timer gets enqueued Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 14/32] nohz/cpuset: Restart tick when nohz flag is cleared on cpuset Frederic Weisbecker
                   ` (20 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

If either a per thread or a per process posix cpu timer is running,
don't stop the tick.

TODO: restart the tick if it is stopped and a posix cpu timer is
enqueued. Check we probably need a memory barrier for the per
process posix timer that can be enqueued from another task
of the group.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 include/linux/posix-timers.h |    1 +
 kernel/posix-cpu-timers.c    |   12 ++++++++++++
 kernel/time/tick-sched.c     |    4 ++++
 3 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index 042058f..97480c2 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -119,6 +119,7 @@ int posix_timer_event(struct k_itimer *timr, int si_private);
 void posix_cpu_timer_schedule(struct k_itimer *timer);
 
 void run_posix_cpu_timers(struct task_struct *task);
+bool posix_cpu_timers_running(struct task_struct *tsk);
 void posix_cpu_timers_exit(struct task_struct *task);
 void posix_cpu_timers_exit_group(struct task_struct *task);
 
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 125cb67..79d4c24 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -6,6 +6,7 @@
 #include <linux/posix-timers.h>
 #include <linux/errno.h>
 #include <linux/math64.h>
+#include <linux/cpuset.h>
 #include <asm/uaccess.h>
 #include <linux/kernel_stat.h>
 #include <trace/events/timer.h>
@@ -1274,6 +1275,17 @@ static inline int fastpath_timer_check(struct task_struct *tsk)
 	return 0;
 }
 
+bool posix_cpu_timers_running(struct task_struct *tsk)
+{
+	if (!task_cputime_zero(&tsk->cputime_expires))
+		return true;
+
+	if (tsk->signal->cputimer.running)
+		return true;
+
+	return false;
+}
+
 /*
  * This is called from the timer interrupt handler.  The irq handler has
  * already updated our counts.  We need to check if any timers fire now.
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 4f99766..fc35d41 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -21,6 +21,7 @@
 #include <linux/sched.h>
 #include <linux/module.h>
 #include <linux/cpuset.h>
+#include <linux/posix-timers.h>
 
 #include <asm/irq_regs.h>
 
@@ -512,6 +513,9 @@ static bool can_stop_adaptive_tick(void)
 	if (!sched_can_stop_tick())
 		return false;
 
+	if (posix_cpu_timers_running(current))
+		return false;
+
 	/* Is there a grace period to complete ? */
 	if (rcu_pending(smp_processor_id()))
 		return false;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 14/32] nohz/cpuset: Restart tick when nohz flag is cleared on cpuset
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (13 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 13/32] nohz/cpuset: Don't stop the tick if posix cpu timers are running Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 15/32] nohz/cpuset: Restart the tick if printk needs it Frederic Weisbecker
                   ` (19 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

Issue an IPI to restart the tick on a CPU that belongs
to a cpuset when its nohz flag gets cleared.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 include/linux/cpuset.h   |    2 ++
 kernel/cpuset.c          |   23 +++++++++++++++++++++++
 kernel/time/tick-sched.c |    8 ++++++++
 3 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 5510708..89ef5f3 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -263,6 +263,8 @@ static inline bool cpuset_adaptive_nohz(void)
 
 	return false;
 }
+
+extern void cpuset_exit_nohz_interrupt(void *unused);
 #else
 static inline bool cpuset_cpu_adaptive_nohz(int cpu) { return false; }
 static inline bool cpuset_adaptive_nohz(void) { return false; }
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 5a28cf8..00864a0 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1221,6 +1221,14 @@ static void cpuset_change_flag(struct task_struct *tsk,
 
 DEFINE_PER_CPU(int, cpu_adaptive_nohz_ref);
 
+static void cpu_exit_nohz(int cpu)
+{
+	preempt_disable();
+	smp_call_function_single(cpu, cpuset_exit_nohz_interrupt,
+				 NULL, true);
+	preempt_enable();
+}
+
 static void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
 {
 	int cpu;
@@ -1234,6 +1242,21 @@ static void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
 			per_cpu(cpu_adaptive_nohz_ref, cpu) += 1;
 		else
 			per_cpu(cpu_adaptive_nohz_ref, cpu) -= 1;
+
+		val = per_cpu(cpu_adaptive_nohz_ref, cpu);
+
+		if (!val) {
+			/*
+			 * The update to cpu_adaptive_nohz_ref must be
+			 * visible right away. So that once we restart the tick
+			 * from the IPI, it won't be stopped again due to cache
+			 * update lag.
+			 * FIXME: We probably need more to ensure this value is really
+			 * visible right away.
+			 */
+			smp_mb();
+			cpu_exit_nohz(cpu);
+		}
 	}
 }
 #else
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index fc35d41..fe31add 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -875,6 +875,14 @@ void tick_nohz_check_adaptive(void)
 	}
 }
 
+void cpuset_exit_nohz_interrupt(void *unused)
+{
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+
+	if (ts->tick_stopped && !is_idle_task(current))
+		tick_nohz_restart_adaptive();
+}
+
 void tick_nohz_post_schedule(void)
 {
 	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 15/32] nohz/cpuset: Restart the tick if printk needs it
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (14 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 14/32] nohz/cpuset: Restart tick when nohz flag is cleared on cpuset Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 16/32] rcu: Restart the tick on non-responding adaptive nohz CPUs Frederic Weisbecker
                   ` (18 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

If we are in nohz adaptive mode and printk is called, the tick is
missing to wake up the logger. We need to restart the tick when that
happens. Do this asynchronously by issuing a tick restart self IPI
to avoid deadlocking with the current random locking chain.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 kernel/printk.c |   15 ++++++++++++++-
 1 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/kernel/printk.c b/kernel/printk.c
index 32690a0..a32f291 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -41,6 +41,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/rculist.h>
+#include <linux/cpuset.h>
 
 #include <asm/uaccess.h>
 
@@ -1230,8 +1231,20 @@ int printk_needs_cpu(int cpu)
 
 void wake_up_klogd(void)
 {
-	if (waitqueue_active(&log_wait))
+	unsigned long flags;
+
+	if (waitqueue_active(&log_wait)) {
 		this_cpu_write(printk_pending, 1);
+		/* Make it visible from any interrupt from now */
+		barrier();
+		/*
+		 * It's safe to check that even if interrupts are not disabled.
+		 * If we enable nohz adaptive mode concurrently, we'll see the
+		 * printk_pending value and thus keep a periodic tick behaviour.
+		 */
+		if (cpuset_adaptive_nohz())
+			smp_cpuset_update_nohz(smp_processor_id());
+	}
 }
 
 /**
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 16/32] rcu: Restart the tick on non-responding adaptive nohz CPUs
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (15 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 15/32] nohz/cpuset: Restart the tick if printk needs it Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 17/32] rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU Frederic Weisbecker
                   ` (17 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

When a CPU in adaptive nohz mode doesn't respond to complete
a grace period, issue it a specific IPI so that it restarts
the tick and chases a quiescent state.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 kernel/rcutree.c |   17 +++++++++++++++++
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index e141c7e..3fffc26 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -50,6 +50,7 @@
 #include <linux/wait.h>
 #include <linux/kthread.h>
 #include <linux/prefetch.h>
+#include <linux/cpuset.h>
 
 #include "rcutree.h"
 #include <trace/events/rcu.h>
@@ -302,6 +303,20 @@ static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
 
 #ifdef CONFIG_SMP
 
+static void cpuset_update_rcu_cpu(int cpu)
+{
+#ifdef CONFIG_CPUSETS_NO_HZ
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	if (cpuset_cpu_adaptive_nohz(cpu))
+		smp_cpuset_update_nohz(cpu);
+
+	local_irq_restore(flags);
+#endif
+}
+
 /*
  * If the specified CPU is offline, tell the caller that it is in
  * a quiescent state.  Otherwise, whack it with a reschedule IPI.
@@ -325,6 +340,8 @@ static int rcu_implicit_offline_qs(struct rcu_data *rdp)
 		return 1;
 	}
 
+	cpuset_update_rcu_cpu(rdp->cpu);
+
 	/*
 	 * The CPU is online, so send it a reschedule IPI.  This forces
 	 * it through the scheduler, and (inefficiently) also handles cases
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 17/32] rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (16 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 16/32] rcu: Restart the tick on non-responding adaptive nohz CPUs Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 18/32] nohz: Generalize tickless cpu time accounting Frederic Weisbecker
                   ` (16 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

If we enqueue an rcu callback, we need the CPU tick to stay
alive until we take care of those by completing the appropriate
grace period.

Thus, when we call_rcu(), send a self IPI that checks rcu_needs_cpu()
so that we restore a periodic tick behaviour that can take care of
everything.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 kernel/rcutree.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 3fffc26..b8d300c 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1749,6 +1749,13 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
 	else
 		trace_rcu_callback(rsp->name, head, rdp->qlen);
 
+	/* Restart the timer if needed to handle the callbacks */
+	if (cpuset_adaptive_nohz()) {
+		/* Make updates on nxtlist visible to self IPI */
+		barrier();
+		smp_cpuset_update_nohz(smp_processor_id());
+	}
+
 	/* If interrupts were disabled, don't dive into RCU core. */
 	if (irqs_disabled_flags(flags)) {
 		local_irq_restore(flags);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 18/32] nohz: Generalize tickless cpu time accounting
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (17 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 17/32] rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 19/32] nohz/cpuset: Account user and system times in adaptive nohz mode Frederic Weisbecker
                   ` (15 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

When the CPU enters idle, it saves the jiffies stamp into
ts->idle_jiffies, increment this value by one every time
there is a timer interrupt and accounts "jiffies - ts->idle_jiffies"
idle ticks when we exit idle. This way we still account the
idle CPU time even if the tick is stopped.

This patch settles the ground to generalize this for user
and system accounting. ts->idle_jiffies becomes ts->saved_jiffies and
a new member ts->saved_jiffies_whence indicates from which domain
we saved the jiffies: user, system or idle.

This is one more step toward making the tickless infrastructure usable
further idle contexts.

For now this is only used by idle but further patches make use of
it for user and system.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 include/linux/kernel_stat.h |    2 +
 include/linux/tick.h        |   45 +++++++++++++++++++++--------------
 kernel/sched/core.c         |   22 +++++++++++++++++
 kernel/time/tick-sched.c    |   55 +++++++++++++++++++++++++++---------------
 kernel/time/timer_list.c    |    3 +-
 5 files changed, 88 insertions(+), 39 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 2fbd905..be90056 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -122,7 +122,9 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu)
 extern unsigned long long task_delta_exec(struct task_struct *);
 
 extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
+extern void account_user_ticks(struct task_struct *, unsigned long);
 extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t);
+extern void account_system_ticks(struct task_struct *, unsigned long);
 extern void account_steal_time(cputime_t);
 extern void account_idle_time(cputime_t);
 
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 9b66fd3..03b6edd 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -27,25 +27,33 @@ enum tick_nohz_mode {
 	NOHZ_MODE_HIGHRES,
 };
 
+enum tick_saved_jiffies {
+	JIFFIES_SAVED_NONE,
+	JIFFIES_SAVED_IDLE,
+	JIFFIES_SAVED_USER,
+	JIFFIES_SAVED_SYS,
+};
+
 /**
  * struct tick_sched - sched tick emulation and no idle tick control/stats
- * @sched_timer:	hrtimer to schedule the periodic tick in high
- *			resolution mode
- * @last_tick:		Store the last tick expiry time when the tick
- *			timer is modified for nohz sleeps. This is necessary
- *			to resume the tick timer operation in the timeline
- *			when the CPU returns from nohz sleep.
- * @tick_stopped:	Indicator that the idle tick has been stopped
- * @idle_jiffies:	jiffies at the entry to idle for idle time accounting
- * @idle_calls:		Total number of idle calls
- * @idle_sleeps:	Number of idle calls, where the sched tick was stopped
- * @idle_entrytime:	Time when the idle call was entered
- * @idle_waketime:	Time when the idle was interrupted
- * @idle_exittime:	Time when the idle state was left
- * @idle_sleeptime:	Sum of the time slept in idle with sched tick stopped
- * @iowait_sleeptime:	Sum of the time slept in idle with sched tick stopped, with IO outstanding
- * @sleep_length:	Duration of the current idle sleep
- * @do_timer_lst:	CPU was the last one doing do_timer before going idle
+ * @sched_timer:		hrtimer to schedule the periodic tick in high
+ *				resolution mode
+ * @last_tick:			Store the last tick expiry time when the tick
+ *				timer is modified for nohz sleeps. This is necessary
+ *				to resume the tick timer operation in the timeline
+ *				when the CPU returns from nohz sleep.
+ * @tick_stopped:		Indicator that the idle tick has been stopped
+ * @idle_calls:			Total number of idle calls
+ * @idle_sleeps:		Number of idle calls, where the sched tick was stopped
+ * @idle_entrytime:		Time when the idle call was entered
+ * @idle_waketime:		Time when the idle was interrupted
+ * @idle_exittime:		Time when the idle state was left
+ * @idle_sleeptime:		Sum of the time slept in idle with sched tick stopped
+ * @saved_jiffies:		Jiffies snapshot on tick stop for cpu time accounting
+ * @saved_jiffies_whence:	Area where we saved @saved_jiffies
+ * @iowait_sleeptime:		Sum of the time slept in idle with sched tick stopped, with IO outstanding
+ * @sleep_length:		Duration of the current idle sleep
+ * @do_timer_lst:		CPU was the last one doing do_timer before going idle
  */
 struct tick_sched {
 	struct hrtimer			sched_timer;
@@ -54,7 +62,6 @@ struct tick_sched {
 	ktime_t				last_tick;
 	int				inidle;
 	int				tick_stopped;
-	unsigned long			idle_jiffies;
 	unsigned long			idle_calls;
 	unsigned long			idle_sleeps;
 	int				idle_active;
@@ -62,6 +69,8 @@ struct tick_sched {
 	ktime_t				idle_waketime;
 	ktime_t				idle_exittime;
 	ktime_t				idle_sleeptime;
+	enum tick_saved_jiffies		saved_jiffies_whence;
+	unsigned long			saved_jiffies;
 	ktime_t				iowait_sleeptime;
 	ktime_t				sleep_length;
 	unsigned long			last_jiffies;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ba9e4d4..eca842e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2693,6 +2693,17 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
 	acct_update_integrals(p);
 }
 
+void account_user_ticks(struct task_struct *p, unsigned long ticks)
+{
+	cputime_t delta_cputime, delta_scaled;
+
+	if (ticks) {
+		delta_cputime = jiffies_to_cputime(ticks);
+		delta_scaled = cputime_to_scaled(ticks);
+		account_user_time(p, delta_cputime, delta_scaled);
+	}
+}
+
 /*
  * Account guest cpu time to a process.
  * @p: the process that the cpu time gets accounted to
@@ -2770,6 +2781,17 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 	__account_system_time(p, cputime, cputime_scaled, index);
 }
 
+void account_system_ticks(struct task_struct *p, unsigned long ticks)
+{
+	cputime_t delta_cputime, delta_scaled;
+
+	if (ticks) {
+		delta_cputime = jiffies_to_cputime(ticks);
+		delta_scaled = cputime_to_scaled(ticks);
+		account_system_time(p, 0, delta_cputime, delta_scaled);
+	}
+}
+
 /*
  * Account for involuntary wait time.
  * @cputime: the cpu time spent in involuntary wait
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index fe31add..9359e6c 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -461,7 +461,8 @@ static void __tick_nohz_idle_enter(struct tick_sched *ts)
 		}
 
 		if (!was_stopped && ts->tick_stopped) {
-			ts->idle_jiffies = ts->last_jiffies;
+			ts->saved_jiffies = ts->last_jiffies;
+			ts->saved_jiffies_whence = JIFFIES_SAVED_IDLE;
 			select_nohz_load_balancer(1);
 		}
 	}
@@ -640,22 +641,34 @@ void tick_nohz_restart_sched_tick(void)
 }
 
 
-static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
+static void tick_nohz_account_ticks(struct tick_sched *ts)
 {
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING
 	unsigned long ticks;
 	/*
-	 * We stopped the tick in idle. Update process times would miss the
-	 * time we slept as update_process_times does only a 1 tick
-	 * accounting. Enforce that this is accounted to idle !
+	 * We stopped the tick. Update process times would miss the
+	 * time we ran tickless as update_process_times does only a 1 tick
+	 * accounting. Enforce that this is accounted to nohz timeslices.
 	 */
-	ticks = jiffies - ts->idle_jiffies;
+	ticks = jiffies - ts->saved_jiffies;
 	/*
 	 * We might be one off. Do not randomly account a huge number of ticks!
 	 */
-	if (ticks && ticks < LONG_MAX)
-		account_idle_ticks(ticks);
-#endif
+	if (ticks && ticks < LONG_MAX) {
+		switch (ts->saved_jiffies_whence) {
+		case JIFFIES_SAVED_IDLE:
+			account_idle_ticks(ticks);
+			break;
+		case JIFFIES_SAVED_USER:
+			account_user_ticks(current, ticks);
+			break;
+		case JIFFIES_SAVED_SYS:
+			account_system_ticks(current, ticks);
+			break;
+		default:
+			WARN_ON_ONCE(1);
+		}
+	}
+	ts->saved_jiffies_whence = JIFFIES_SAVED_NONE;
 }
 
 /**
@@ -687,7 +700,9 @@ void tick_nohz_idle_exit(void)
 	if (ts->tick_stopped) {
 		select_nohz_load_balancer(0);
 		__tick_nohz_restart_sched_tick(ts, now);
-		tick_nohz_account_idle_ticks(ts);
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+		tick_nohz_account_ticks(ts);
+#endif
 	}
 
 	local_irq_enable();
@@ -735,7 +750,7 @@ static void tick_nohz_handler(struct clock_event_device *dev)
 	 */
 	if (ts->tick_stopped) {
 		touch_softlockup_watchdog();
-		ts->idle_jiffies++;
+		ts->saved_jiffies++;
 	}
 
 	update_process_times(user_mode(regs));
@@ -944,17 +959,17 @@ static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
 	if (regs) {
 		int user = user_mode(regs);
 		/*
-		 * When we are idle and the tick is stopped, we have to touch
-		 * the watchdog as we might not schedule for a really long
-		 * time. This happens on complete idle SMP systems while
-		 * waiting on the login prompt. We also increment the "start of
-		 * idle" jiffy stamp so the idle accounting adjustment we do
-		 * when we go busy again does not account too much ticks.
+		 * When the tick is stopped, we have to touch the watchdog
+		 * as we might not schedule for a really long time. This
+		 * happens on complete idle SMP systems while waiting on
+		 * the login prompt. We also increment the last jiffy stamp
+		 * recorded when we stopped the tick so the cpu time accounting
+		 * adjustment does not account too much ticks when we flush them.
 		 */
 		if (ts->tick_stopped) {
+			/* CHECKME: may be this is only needed in idle */
 			touch_softlockup_watchdog();
-			if (idle_cpu(cpu))
-				ts->idle_jiffies++;
+			ts->saved_jiffies++;
 		}
 		update_process_times(user);
 		profile_tick(CPU_PROFILING);
diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
index af5a7e9..54705e3 100644
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -169,7 +169,8 @@ static void print_cpu(struct seq_file *m, int cpu, u64 now)
 		P(nohz_mode);
 		P_ns(last_tick);
 		P(tick_stopped);
-		P(idle_jiffies);
+		/* CHECKME: Do we want saved_jiffies_whence as well? */
+		P(saved_jiffies);
 		P(idle_calls);
 		P(idle_sleeps);
 		P_ns(idle_entrytime);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 19/32] nohz/cpuset: Account user and system times in adaptive nohz mode
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (18 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 18/32] nohz: Generalize tickless cpu time accounting Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 20/32] nohz/cpuset: New API to flush cputimes on nohz cpusets Frederic Weisbecker
                   ` (14 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

If we are not running the tick, we are not anymore regularly counting
the user/system cputime at every jiffies.

To solve this, save a snapshot of the jiffies when we stop the tick
and keep track of where we saved it: user or system. On top of this,
we account the cputime elapsed when we cross the kernel entry/exit
boundaries and when we restart the tick.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 include/linux/tick.h     |   12 ++++
 kernel/sched/core.c      |    1 +
 kernel/time/tick-sched.c |  131 +++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 142 insertions(+), 2 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 03b6edd..598b492 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -153,11 +153,23 @@ static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
 # endif /* !NO_HZ */
 
 #ifdef CONFIG_CPUSETS_NO_HZ
+extern void tick_nohz_enter_kernel(void);
+extern void tick_nohz_exit_kernel(void);
+extern void tick_nohz_enter_exception(struct pt_regs *regs);
+extern void tick_nohz_exit_exception(struct pt_regs *regs);
 extern void tick_nohz_check_adaptive(void);
+extern void tick_nohz_pre_schedule(void);
 extern void tick_nohz_post_schedule(void);
+extern bool tick_nohz_account_tick(void);
 #else /* !CPUSETS_NO_HZ */
+static inline void tick_nohz_enter_kernel(void) { }
+static inline void tick_nohz_exit_kernel(void) { }
+static inline void tick_nohz_enter_exception(struct pt_regs *regs) { }
+static inline void tick_nohz_exit_exception(struct pt_regs *regs) { }
 static inline void tick_nohz_check_adaptive(void) { }
+static inline void tick_nohz_pre_schedule(void) { }
 static inline void tick_nohz_post_schedule(void) { }
+static inline bool tick_nohz_account_tick(void) { return false; }
 #endif /* CPUSETS_NO_HZ */
 
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index eca842e..5debfd7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1923,6 +1923,7 @@ static inline void
 prepare_task_switch(struct rq *rq, struct task_struct *prev,
 		    struct task_struct *next)
 {
+	tick_nohz_pre_schedule();
 	sched_info_switch(prev, next);
 	perf_event_task_sched_out(prev, next);
 	fire_sched_out_preempt_notifiers(prev, next);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9359e6c..ff78126 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -526,7 +526,13 @@ static bool can_stop_adaptive_tick(void)
 
 static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts)
 {
+	struct pt_regs *regs = get_irq_regs();
 	int cpu = smp_processor_id();
+	int was_stopped;
+	int user = 0;
+
+	if (regs)
+		user = user_mode(regs);
 
 	if (!cpuset_adaptive_nohz() || is_idle_task(current))
 		return;
@@ -537,7 +543,36 @@ static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts)
 	if (!can_stop_adaptive_tick())
 		return;
 
+	/*
+	 * If we stop the tick between the syscall exit hook and the actual
+	 * return to userspace, we'll think we are in system space (due to
+	 * user_mode() thinking so). And since we passed the syscall exit hook
+	 * already we won't realize we are in userspace. So the time spent
+	 * tickless would be spuriously accounted as belonging to system.
+	 *
+	 * To avoid this kind of problem, we only stop the tick from userspace
+	 * (until we find a better solution).
+	 * We can later enter the kernel and keep the tick stopped. But the place
+	 * where we stop the tick must be userspace.
+	 * We make an exception for kernel threads since they always execute in
+	 * kernel space.
+	 */
+	if (!user && current->mm)
+		return;
+
+	was_stopped = ts->tick_stopped;
 	tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
+
+	if (!was_stopped && ts->tick_stopped) {
+		WARN_ON_ONCE(ts->saved_jiffies_whence != JIFFIES_SAVED_NONE);
+		if (user)
+			ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
+		else if (!current->mm)
+			ts->saved_jiffies_whence = JIFFIES_SAVED_SYS;
+
+		ts->saved_jiffies = jiffies;
+		set_thread_flag(TIF_NOHZ);
+	}
 }
 #else
 static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts) { }
@@ -862,6 +897,70 @@ void tick_check_idle(int cpu)
 }
 
 #ifdef CONFIG_CPUSETS_NO_HZ
+void tick_nohz_exit_kernel(void)
+{
+	unsigned long flags;
+	struct tick_sched *ts;
+	unsigned long delta_jiffies;
+
+	local_irq_save(flags);
+
+	ts = &__get_cpu_var(tick_cpu_sched);
+
+	if (!ts->tick_stopped) {
+		local_irq_restore(flags);
+		return;
+	}
+
+	WARN_ON_ONCE(ts->saved_jiffies_whence != JIFFIES_SAVED_SYS);
+
+	delta_jiffies = jiffies - ts->saved_jiffies;
+	account_system_ticks(current, delta_jiffies);
+
+	ts->saved_jiffies = jiffies;
+	ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
+
+	local_irq_restore(flags);
+}
+
+void tick_nohz_enter_kernel(void)
+{
+	unsigned long flags;
+	struct tick_sched *ts;
+	unsigned long delta_jiffies;
+
+	local_irq_save(flags);
+
+	ts = &__get_cpu_var(tick_cpu_sched);
+
+	if (!ts->tick_stopped) {
+		local_irq_restore(flags);
+		return;
+	}
+
+	WARN_ON_ONCE(ts->saved_jiffies_whence != JIFFIES_SAVED_USER);
+
+	delta_jiffies = jiffies - ts->saved_jiffies;
+	account_user_ticks(current, delta_jiffies);
+
+	ts->saved_jiffies = jiffies;
+	ts->saved_jiffies_whence = JIFFIES_SAVED_SYS;
+
+	local_irq_restore(flags);
+}
+
+void tick_nohz_enter_exception(struct pt_regs *regs)
+{
+	if (user_mode(regs))
+		tick_nohz_enter_kernel();
+}
+
+void tick_nohz_exit_exception(struct pt_regs *regs)
+{
+	if (user_mode(regs))
+		tick_nohz_exit_kernel();
+}
+
 /*
  * Take the timer duty if nobody is taking care of it.
  * If a CPU already does and and it's in a nohz cpuset,
@@ -880,13 +979,22 @@ static void tick_do_timer_check_handler(int cpu)
 	}
 }
 
+static void tick_nohz_restart_adaptive(void)
+{
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+
+	tick_nohz_account_ticks(ts);
+	tick_nohz_restart_sched_tick();
+	clear_thread_flag(TIF_NOHZ);
+}
+
 void tick_nohz_check_adaptive(void)
 {
 	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
 
 	if (ts->tick_stopped && !is_idle_task(current)) {
 		if (!can_stop_adaptive_tick())
-			tick_nohz_restart_sched_tick();
+			tick_nohz_restart_adaptive();
 	}
 }
 
@@ -898,6 +1006,26 @@ void cpuset_exit_nohz_interrupt(void *unused)
 		tick_nohz_restart_adaptive();
 }
 
+/*
+ * Flush cputime and clear hooks before context switch in case we
+ * haven't yet received the IPI that should take care of that.
+ */
+void tick_nohz_pre_schedule(void)
+{
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+
+	/*
+	 * We are holding the rq lock and if we restart the tick now
+	 * we could deadlock by acquiring the lock twice. Instead
+	 * we do that on post schedule time. For now do the cleanups
+	 * on the prev task.
+	 */
+	if (ts->tick_stopped) {
+		tick_nohz_account_ticks(ts);
+		clear_thread_flag(TIF_NOHZ);
+	}
+}
+
 void tick_nohz_post_schedule(void)
 {
 	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
@@ -910,7 +1038,6 @@ void tick_nohz_post_schedule(void)
 	if (ts->tick_stopped)
 		tick_nohz_restart_sched_tick();
 }
-
 #else
 
 static void tick_do_timer_check_handler(int cpu)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 20/32] nohz/cpuset: New API to flush cputimes on nohz cpusets
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (19 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 19/32] nohz/cpuset: Account user and system times in adaptive nohz mode Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 21/32] nohz/cpuset: Flush cputime on threads in nohz cpusets when waiting leader Frederic Weisbecker
                   ` (13 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

Provide a new API that sends an IPI to every CPUs included
in nohz cpusets in order to flush their cputimes. It's going
to be useful for those that want to see accurate cputimes
on a nohz cpuset.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 include/linux/cpuset.h   |    2 ++
 include/linux/tick.h     |    1 +
 kernel/cpuset.c          |   34 +++++++++++++++++++++++++++++++++-
 kernel/time/tick-sched.c |   21 ++++++++++++++++-----
 4 files changed, 52 insertions(+), 6 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 89ef5f3..ccbc2fd 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -265,9 +265,11 @@ static inline bool cpuset_adaptive_nohz(void)
 }
 
 extern void cpuset_exit_nohz_interrupt(void *unused);
+extern void cpuset_nohz_flush_cputimes(void);
 #else
 static inline bool cpuset_cpu_adaptive_nohz(int cpu) { return false; }
 static inline bool cpuset_adaptive_nohz(void) { return false; }
+static inline void cpuset_nohz_flush_cputimes(void) { }
 
 #endif /* CONFIG_CPUSETS_NO_HZ */
 
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 598b492..3c31d6e 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -161,6 +161,7 @@ extern void tick_nohz_check_adaptive(void);
 extern void tick_nohz_pre_schedule(void);
 extern void tick_nohz_post_schedule(void);
 extern bool tick_nohz_account_tick(void);
+extern void tick_nohz_flush_current_times(bool restart_tick);
 #else /* !CPUSETS_NO_HZ */
 static inline void tick_nohz_enter_kernel(void) { }
 static inline void tick_nohz_exit_kernel(void) { }
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 00864a0..aa8304d 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -59,6 +59,7 @@
 #include <linux/mutex.h>
 #include <linux/workqueue.h>
 #include <linux/cgroup.h>
+#include <linux/tick.h>
 
 /*
  * Workqueue for cpuset related tasks.
@@ -1221,6 +1222,23 @@ static void cpuset_change_flag(struct task_struct *tsk,
 
 DEFINE_PER_CPU(int, cpu_adaptive_nohz_ref);
 
+static cpumask_t nohz_cpuset_mask;
+
+static void flush_cputime_interrupt(void *unused)
+{
+	tick_nohz_flush_current_times(false);
+}
+
+void cpuset_nohz_flush_cputimes(void)
+{
+	preempt_disable();
+	smp_call_function_many(&nohz_cpuset_mask, flush_cputime_interrupt,
+			       NULL, true);
+	preempt_enable();
+	/* Make the utime/stime updates visible */
+	smp_mb();
+}
+
 static void cpu_exit_nohz(int cpu)
 {
 	preempt_disable();
@@ -1245,7 +1263,15 @@ static void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
 
 		val = per_cpu(cpu_adaptive_nohz_ref, cpu);
 
-		if (!val) {
+		if (val == 1) {
+			cpumask_set_cpu(cpu, &nohz_cpuset_mask);
+			/*
+			 * The mask update needs to be visible right away
+			 * so that this CPU is part of the cputime IPI
+			 * update right now.
+			 */
+			 smp_mb();
+		} else if (!val) {
 			/*
 			 * The update to cpu_adaptive_nohz_ref must be
 			 * visible right away. So that once we restart the tick
@@ -1256,6 +1282,12 @@ static void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
 			 */
 			smp_mb();
 			cpu_exit_nohz(cpu);
+			/*
+			 * Now that the tick has been restarted and cputimes
+			 * flushed, we don't need anymore to be part of the
+			 * cputime flush IPI.
+			 */
+			cpumask_clear_cpu(cpu, &nohz_cpuset_mask);
 		}
 	}
 }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index ff78126..6706a7d 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -703,7 +703,6 @@ static void tick_nohz_account_ticks(struct tick_sched *ts)
 			WARN_ON_ONCE(1);
 		}
 	}
-	ts->saved_jiffies_whence = JIFFIES_SAVED_NONE;
 }
 
 /**
@@ -737,6 +736,7 @@ void tick_nohz_idle_exit(void)
 		__tick_nohz_restart_sched_tick(ts, now);
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
 		tick_nohz_account_ticks(ts);
+		ts->saved_jiffies_whence = JIFFIES_SAVED_NONE;
 #endif
 	}
 
@@ -981,9 +981,7 @@ static void tick_do_timer_check_handler(int cpu)
 
 static void tick_nohz_restart_adaptive(void)
 {
-	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
-
-	tick_nohz_account_ticks(ts);
+	tick_nohz_flush_current_times(true);
 	tick_nohz_restart_sched_tick();
 	clear_thread_flag(TIF_NOHZ);
 }
@@ -1021,7 +1019,7 @@ void tick_nohz_pre_schedule(void)
 	 * on the prev task.
 	 */
 	if (ts->tick_stopped) {
-		tick_nohz_account_ticks(ts);
+		tick_nohz_flush_current_times(true);
 		clear_thread_flag(TIF_NOHZ);
 	}
 }
@@ -1038,6 +1036,19 @@ void tick_nohz_post_schedule(void)
 	if (ts->tick_stopped)
 		tick_nohz_restart_sched_tick();
 }
+
+void tick_nohz_flush_current_times(bool restart_tick)
+{
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+
+	if (ts->tick_stopped) {
+		tick_nohz_account_ticks(ts);
+		if (restart_tick)
+			ts->saved_jiffies_whence = JIFFIES_SAVED_NONE;
+		else
+			ts->saved_jiffies = jiffies;
+	}
+}
 #else
 
 static void tick_do_timer_check_handler(int cpu)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 21/32] nohz/cpuset: Flush cputime on threads in nohz cpusets when waiting leader
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (20 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 20/32] nohz/cpuset: New API to flush cputimes on nohz cpusets Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-27 14:10   ` Gilad Ben-Yossef
  2012-03-21 13:58 ` [PATCH 22/32] nohz/cpuset: Flush cputimes on procfs stat file read Frederic Weisbecker
                   ` (12 subsequent siblings)
  34 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

When we wait for a zombie task, flush the cputimes on nohz cpusets
in case we are waiting for a group leader that has threads running
in nohz CPUs. This way thread_group_times() doesn't report stale
values.

<doubts>
If I understood well the code, by the time we call that thread_group_times(),
we may have childs that are still running, so this is necessary.
But I need to check deeper.
</doubts>

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 kernel/exit.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 4b4042f..c194662 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -52,6 +52,7 @@
 #include <linux/hw_breakpoint.h>
 #include <linux/oom.h>
 #include <linux/writeback.h>
+#include <linux/cpuset.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -1712,6 +1713,13 @@ repeat:
 	   (!wo->wo_pid || hlist_empty(&wo->wo_pid->tasks[wo->wo_type])))
 		goto notask;
 
+	/*
+	 * For cputime in sub-threads before adding them.
+	 * Must be called outside tasklist_lock lock because write lock
+	 * can be acquired under irqs disabled.
+	 */
+	cpuset_nohz_flush_cputimes();
+
 	set_current_state(TASK_INTERRUPTIBLE);
 	read_lock(&tasklist_lock);
 	tsk = current;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 22/32] nohz/cpuset: Flush cputimes on procfs stat file read
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (21 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 21/32] nohz/cpuset: Flush cputime on threads in nohz cpusets when waiting leader Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 23/32] nohz/cpuset: Flush cputimes for getrusage() and times() syscalls Frederic Weisbecker
                   ` (11 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

When we read a process's procfs stat file, we need
to flush the cputimes of the tasks running in nohz
cpusets in case some childs in the thread group are
running there.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 fs/proc/array.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index c602b8d..0dc88ad 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -397,6 +397,8 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	cutime = cstime = utime = stime = 0;
 	cgtime = gtime = 0;
 
+	/* For thread group times */
+	cpuset_nohz_flush_cputimes();
 	if (lock_task_sighand(task, &flags)) {
 		struct signal_struct *sig = task->signal;
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 23/32] nohz/cpuset: Flush cputimes for getrusage() and times() syscalls
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (22 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 22/32] nohz/cpuset: Flush cputimes on procfs stat file read Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 24/32] x86: Syscall hooks for nohz cpusets Frederic Weisbecker
                   ` (10 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

Both syscalls need to iterate through the thread group to get
the cputimes. As some threads of the group may be running on
nohz cpuset, we need to flush the cputimes there.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 kernel/sys.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 4070153..5b3e880 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -45,6 +45,7 @@
 #include <linux/syscalls.h>
 #include <linux/kprobes.h>
 #include <linux/user_namespace.h>
+#include <linux/cpuset.h>
 
 #include <linux/kmsg_dump.h>
 /* Move somewhere else to avoid recompiling? */
@@ -950,6 +951,8 @@ void do_sys_times(struct tms *tms)
 {
 	cputime_t tgutime, tgstime, cutime, cstime;
 
+	cpuset_nohz_flush_cputimes();
+
 	spin_lock_irq(&current->sighand->siglock);
 	thread_group_times(current, &tgutime, &tgstime);
 	cutime = current->signal->cutime;
@@ -1614,6 +1617,9 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
 		goto out;
 	}
 
+	/* For thread_group_times */
+	cpuset_nohz_flush_cputimes();
+
 	if (!lock_task_sighand(p, &flags))
 		return;
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 24/32] x86: Syscall hooks for nohz cpusets
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (23 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 23/32] nohz/cpuset: Flush cputimes for getrusage() and times() syscalls Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 25/32] x86: Exception " Frederic Weisbecker
                   ` (9 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

Add syscall hooks to notify syscall entry and exit on
CPUs running in adative nohz mode.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 arch/x86/include/asm/thread_info.h |   10 +++++++---
 arch/x86/kernel/ptrace.c           |   10 ++++++++++
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index cfd8144..0c1724e 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -88,6 +88,7 @@ struct thread_info {
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* 32bit process */
 #define TIF_FORK		18	/* ret_from_fork */
+#define TIF_NOHZ		19	/* in nohz userspace mode */
 #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_DEBUG		21	/* uses debug registers */
 #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
@@ -110,6 +111,7 @@ struct thread_info {
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_FORK		(1 << TIF_FORK)
+#define _TIF_NOHZ		(1 << TIF_NOHZ)
 #define _TIF_DEBUG		(1 << TIF_DEBUG)
 #define _TIF_IO_BITMAP		(1 << TIF_IO_BITMAP)
 #define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
@@ -120,12 +122,13 @@ struct thread_info {
 /* work to do in syscall_trace_enter() */
 #define _TIF_WORK_SYSCALL_ENTRY	\
 	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT |	\
-	 _TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT)
+	 _TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT |	\
+	 _TIF_NOHZ)
 
 /* work to do in syscall_trace_leave() */
 #define _TIF_WORK_SYSCALL_EXIT	\
 	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SINGLESTEP |	\
-	 _TIF_SYSCALL_TRACEPOINT)
+	 _TIF_SYSCALL_TRACEPOINT | _TIF_NOHZ)
 
 /* work to do on interrupt/exception return */
 #define _TIF_WORK_MASK							\
@@ -135,7 +138,8 @@ struct thread_info {
 
 /* work to do on any return to user space */
 #define _TIF_ALLWORK_MASK						\
-	((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_TRACEPOINT)
+	((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_TRACEPOINT |	\
+	_TIF_NOHZ)
 
 /* Only used for 64 bit */
 #define _TIF_DO_NOTIFY_MASK						\
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 5026738..2966791 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -21,6 +21,7 @@
 #include <linux/signal.h>
 #include <linux/perf_event.h>
 #include <linux/hw_breakpoint.h>
+#include <linux/tick.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -1369,6 +1370,9 @@ long syscall_trace_enter(struct pt_regs *regs)
 {
 	long ret = 0;
 
+	/* Notify nohz task syscall early so the rest can use rcu */
+	tick_nohz_enter_kernel();
+
 	/*
 	 * If we stepped into a sysenter/syscall insn, it trapped in
 	 * kernel mode; do_debug() cleared TF and set TIF_SINGLESTEP.
@@ -1427,4 +1431,10 @@ void syscall_trace_leave(struct pt_regs *regs)
 			!test_thread_flag(TIF_SYSCALL_EMU);
 	if (step || test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall_exit(regs, step);
+
+	/*
+	 * Notify nohz task exit syscall at last so the rest can
+	 * use rcu.
+	 */
+	tick_nohz_exit_kernel();
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 25/32] x86: Exception hooks for nohz cpusets
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (24 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 24/32] x86: Syscall hooks for nohz cpusets Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 26/32] x86: Add adaptive tickless hooks on do_notify_resume() Frederic Weisbecker
                   ` (8 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

Add necessary hooks to x86 exception for nohz cpusets
support. It includes traps, page fault, debug exceptions,
etc...

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 arch/x86/Kconfig        |    1 +
 arch/x86/kernel/traps.c |   20 ++++++++++++++------
 arch/x86/mm/fault.c     |   13 +++++++++++--
 3 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5bed94e..0d3116c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -82,6 +82,7 @@ config X86
 	select CLKEVT_I8253
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select GENERIC_IOMAP
+	select HAVE_CPUSETS_NO_HZ
 
 config INSTRUCTION_DECODER
 	def_bool (KPROBES || PERF_EVENTS)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 4bbe04d..977d0b9 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -26,6 +26,7 @@
 #include <linux/sched.h>
 #include <linux/timer.h>
 #include <linux/init.h>
+#include <linux/tick.h>
 #include <linux/bug.h>
 #include <linux/nmi.h>
 #include <linux/mm.h>
@@ -301,15 +302,17 @@ gp_in_kernel:
 /* May run on IST stack. */
 dotraplinkage void __kprobes do_int3(struct pt_regs *regs, long error_code)
 {
+	tick_nohz_enter_exception(regs);
+
 #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
 	if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, 3, SIGTRAP)
 			== NOTIFY_STOP)
-		return;
+		goto exit;
 #endif /* CONFIG_KGDB_LOW_LEVEL_TRAP */
 
 	if (notify_die(DIE_INT3, "int3", regs, error_code, 3, SIGTRAP)
 			== NOTIFY_STOP)
-		return;
+		goto exit;
 
 	/*
 	 * Let others (NMI) know that the debug stack is in use
@@ -320,6 +323,8 @@ dotraplinkage void __kprobes do_int3(struct pt_regs *regs, long error_code)
 	do_trap(3, SIGTRAP, "int3", regs, error_code, NULL);
 	preempt_conditional_cli(regs);
 	debug_stack_usage_dec();
+exit:
+	tick_nohz_exit_exception(regs);
 }
 
 #ifdef CONFIG_X86_64
@@ -380,6 +385,8 @@ dotraplinkage void __kprobes do_debug(struct pt_regs *regs, long error_code)
 	unsigned long dr6;
 	int si_code;
 
+	tick_nohz_enter_exception(regs);
+
 	get_debugreg(dr6, 6);
 
 	/* Filter out all the reserved bits which are preset to 1 */
@@ -395,7 +402,7 @@ dotraplinkage void __kprobes do_debug(struct pt_regs *regs, long error_code)
 
 	/* Catch kmemcheck conditions first of all! */
 	if ((dr6 & DR_STEP) && kmemcheck_trap(regs))
-		return;
+		goto exit;
 
 	/* DR6 may or may not be cleared by the CPU */
 	set_debugreg(0, 6);
@@ -410,7 +417,7 @@ dotraplinkage void __kprobes do_debug(struct pt_regs *regs, long error_code)
 
 	if (notify_die(DIE_DEBUG, "debug", regs, PTR_ERR(&dr6), error_code,
 							SIGTRAP) == NOTIFY_STOP)
-		return;
+		goto exit;
 
 	/*
 	 * Let others (NMI) know that the debug stack is in use
@@ -426,7 +433,7 @@ dotraplinkage void __kprobes do_debug(struct pt_regs *regs, long error_code)
 				error_code, 1);
 		preempt_conditional_cli(regs);
 		debug_stack_usage_dec();
-		return;
+		goto exit;
 	}
 
 	/*
@@ -447,7 +454,8 @@ dotraplinkage void __kprobes do_debug(struct pt_regs *regs, long error_code)
 	preempt_conditional_cli(regs);
 	debug_stack_usage_dec();
 
-	return;
+exit:
+	tick_nohz_exit_exception(regs);
 }
 
 /*
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f0b4caf..6c4c983 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -13,6 +13,7 @@
 #include <linux/perf_event.h>		/* perf_sw_event		*/
 #include <linux/hugetlb.h>		/* hstate_index_to_shift	*/
 #include <linux/prefetch.h>		/* prefetchw			*/
+#include <linux/tick.h>
 
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
 #include <asm/pgalloc.h>		/* pgd_*(), ...			*/
@@ -1000,8 +1001,8 @@ static int fault_in_kernel_space(unsigned long address)
  * and the problem, and then passes it off to one of the appropriate
  * routines.
  */
-dotraplinkage void __kprobes
-do_page_fault(struct pt_regs *regs, unsigned long error_code)
+static void __kprobes
+__do_page_fault(struct pt_regs *regs, unsigned long error_code)
 {
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
@@ -1209,3 +1210,11 @@ good_area:
 
 	up_read(&mm->mmap_sem);
 }
+
+dotraplinkage void __kprobes
+do_page_fault(struct pt_regs *regs, unsigned long error_code)
+{
+	tick_nohz_enter_exception(regs);
+	__do_page_fault(regs, error_code);
+	tick_nohz_exit_exception(regs);
+}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 26/32] x86: Add adaptive tickless hooks on do_notify_resume()
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (25 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 25/32] x86: Exception " Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 27/32] nohz: Don't restart the tick before scheduling to idle Frederic Weisbecker
                   ` (7 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

Before resuming to userspace, we may fall into do_notify_resume()
to handle signals or other things. And because we may be coming
from syscall/exception or interrupt exit, we may be running into
RCU idle mode as we resume tickless to userspace.

However do_notify_resume() may make use of RCU read side critical
sections so we need to exit RCU idle mode before doing anything in
that path.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 arch/x86/kernel/signal.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 46a01bd..577fd93 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -20,6 +20,7 @@
 #include <linux/personality.h>
 #include <linux/uaccess.h>
 #include <linux/user-return-notifier.h>
+#include <linux/tick.h>
 
 #include <asm/processor.h>
 #include <asm/ucontext.h>
@@ -810,6 +811,7 @@ static void do_signal(struct pt_regs *regs)
 void
 do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 {
+	tick_nohz_enter_kernel();
 #ifdef CONFIG_X86_MCE
 	/* notify userspace of pending MCEs */
 	if (thread_info_flags & _TIF_MCE_NOTIFY)
@@ -832,6 +834,7 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 #ifdef CONFIG_X86_32
 	clear_thread_flag(TIF_IRET);
 #endif /* CONFIG_X86_32 */
+	tick_nohz_exit_kernel();
 }
 
 void signal_fault(struct pt_regs *regs, void __user *frame, char *where)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 27/32] nohz: Don't restart the tick before scheduling to idle
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (26 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 26/32] x86: Add adaptive tickless hooks on do_notify_resume() Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 28/32] rcu: New rcu_user_enter() and rcu_user_exit() APIs Frederic Weisbecker
                   ` (6 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

If we were running adaptive tickless but then we schedule out and
enter the idle task, we don't need to restart the tick because
tick_nohz_idle_enter() is going to be called right away.

The only thing we need to do is to save the jiffies such that
when we later restart the tick we can account the CPU time spent
while idle was tickless.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 kernel/time/tick-sched.c |   18 +++++++++++-------
 1 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 6706a7d..4b9bdfb 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1027,14 +1027,18 @@ void tick_nohz_pre_schedule(void)
 void tick_nohz_post_schedule(void)
 {
 	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+	unsigned long flags;
 
-	/*
-	 * No need to disable irqs here. The worst that can happen
-	 * is an irq that comes and restart the tick before us.
-	 * tick_nohz_restart_sched_tick() is irq safe.
-	 */
-	if (ts->tick_stopped)
-		tick_nohz_restart_sched_tick();
+	local_irq_save(flags);
+	if (ts->tick_stopped) {
+		if (is_idle_task(current)) {
+			ts->saved_jiffies = jiffies;
+			ts->saved_jiffies_whence = JIFFIES_SAVED_IDLE;
+		} else {
+			tick_nohz_restart_sched_tick();
+		}
+	}
+	local_irq_restore(flags);
 }
 
 void tick_nohz_flush_current_times(bool restart_tick)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 28/32] rcu: New rcu_user_enter() and rcu_user_exit() APIs
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (27 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 27/32] nohz: Don't restart the tick before scheduling to idle Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 29/32] rcu: New rcu_user_enter_irq() and rcu_user_exit_irq() APIs Frederic Weisbecker
                   ` (5 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

These two APIs are provided to help the implementation
of an adaptive tickless kernel (cf: nohz cpusets). We need
to run into RCU extended quiescent state when we are in
userland so that a tickless CPU is not involved in the
global RCU state machine and can shutdown its tick safely.

These APIs are called from syscall and exception entry/exit
points and can't be called from interrupt.

They are essentially the same than rcu_idle_enter() and
rcu_idle_exit() minus the checks that ensure the CPU is
running the idle task.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 include/linux/rcupdate.h |    5 ++
 kernel/rcutree.c         |  107 ++++++++++++++++++++++++++++++++-------------
 2 files changed, 81 insertions(+), 31 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index e06639e..6539290 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -191,6 +191,11 @@ extern void rcu_idle_exit(void);
 extern void rcu_irq_enter(void);
 extern void rcu_irq_exit(void);
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+void rcu_user_enter(void);
+void rcu_user_exit(void);
+#endif
+
 /*
  * Infrastructure to implement the synchronize_() primitives in
  * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index b8d300c..cba1332 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -357,16 +357,8 @@ static int rcu_implicit_offline_qs(struct rcu_data *rdp)
 
 #endif /* #ifdef CONFIG_SMP */
 
-/*
- * rcu_idle_enter_common - inform RCU that current CPU is moving towards idle
- *
- * If the new value of the ->dynticks_nesting counter now is zero,
- * we really have entered idle, and must do the appropriate accounting.
- * The caller must have disabled interrupts.
- */
-static void rcu_idle_enter_common(struct rcu_dynticks *rdtp, long long oldval)
+static void rcu_check_idle_enter(long long oldval)
 {
-	trace_rcu_dyntick("Start", oldval, 0);
 	if (!is_idle_task(current)) {
 		struct task_struct *idle = idle_task(smp_processor_id());
 
@@ -376,6 +368,18 @@ static void rcu_idle_enter_common(struct rcu_dynticks *rdtp, long long oldval)
 			  current->pid, current->comm,
 			  idle->pid, idle->comm); /* must be idle task! */
 	}
+}
+
+/*
+ * rcu_idle_enter_common - inform RCU that current CPU is moving towards idle
+ *
+ * If the new value of the ->dynticks_nesting counter now is zero,
+ * we really have entered idle, and must do the appropriate accounting.
+ * The caller must have disabled interrupts.
+ */
+static void rcu_idle_enter_common(struct rcu_dynticks *rdtp, long long oldval)
+{
+	trace_rcu_dyntick("Start", oldval, 0);
 	rcu_prepare_for_idle(smp_processor_id());
 	/* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
 	smp_mb__before_atomic_inc();  /* See above. */
@@ -384,6 +388,22 @@ static void rcu_idle_enter_common(struct rcu_dynticks *rdtp, long long oldval)
 	WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
 }
 
+static long long __rcu_idle_enter(void)
+{
+	unsigned long flags;
+	long long oldval;
+	struct rcu_dynticks *rdtp;
+
+	local_irq_save(flags);
+	rdtp = &__get_cpu_var(rcu_dynticks);
+	oldval = rdtp->dynticks_nesting;
+	rdtp->dynticks_nesting = 0;
+	rcu_idle_enter_common(rdtp, oldval);
+	local_irq_restore(flags);
+
+	return oldval;
+}
+
 /**
  * rcu_idle_enter - inform RCU that current CPU is entering idle
  *
@@ -398,16 +418,15 @@ static void rcu_idle_enter_common(struct rcu_dynticks *rdtp, long long oldval)
  */
 void rcu_idle_enter(void)
 {
-	unsigned long flags;
 	long long oldval;
-	struct rcu_dynticks *rdtp;
 
-	local_irq_save(flags);
-	rdtp = &__get_cpu_var(rcu_dynticks);
-	oldval = rdtp->dynticks_nesting;
-	rdtp->dynticks_nesting = 0;
-	rcu_idle_enter_common(rdtp, oldval);
-	local_irq_restore(flags);
+	oldval = __rcu_idle_enter();
+	rcu_check_idle_enter(oldval);
+}
+
+void rcu_user_enter(void)
+{
+	__rcu_idle_enter();
 }
 
 /**
@@ -437,6 +456,7 @@ void rcu_irq_exit(void)
 	oldval = rdtp->dynticks_nesting;
 	rdtp->dynticks_nesting--;
 	WARN_ON_ONCE(rdtp->dynticks_nesting < 0);
+
 	if (rdtp->dynticks_nesting)
 		trace_rcu_dyntick("--=", oldval, rdtp->dynticks_nesting);
 	else
@@ -444,6 +464,20 @@ void rcu_irq_exit(void)
 	local_irq_restore(flags);
 }
 
+static void rcu_check_idle_exit(struct rcu_dynticks *rdtp, long long oldval)
+{
+	if (!is_idle_task(current)) {
+		struct task_struct *idle = idle_task(smp_processor_id());
+
+		trace_rcu_dyntick("Error on exit: not idle task",
+				  oldval, rdtp->dynticks_nesting);
+		ftrace_dump(DUMP_ALL);
+		WARN_ONCE(1, "Current pid: %d comm: %s / Idle pid: %d comm: %s",
+			  current->pid, current->comm,
+			  idle->pid, idle->comm); /* must be idle task! */
+	}
+}
+
 /*
  * rcu_idle_exit_common - inform RCU that current CPU is moving away from idle
  *
@@ -460,16 +494,18 @@ static void rcu_idle_exit_common(struct rcu_dynticks *rdtp, long long oldval)
 	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
 	rcu_cleanup_after_idle(smp_processor_id());
 	trace_rcu_dyntick("End", oldval, rdtp->dynticks_nesting);
-	if (!is_idle_task(current)) {
-		struct task_struct *idle = idle_task(smp_processor_id());
+}
 
-		trace_rcu_dyntick("Error on exit: not idle task",
-				  oldval, rdtp->dynticks_nesting);
-		ftrace_dump(DUMP_ALL);
-		WARN_ONCE(1, "Current pid: %d comm: %s / Idle pid: %d comm: %s",
-			  current->pid, current->comm,
-			  idle->pid, idle->comm); /* must be idle task! */
-	}
+static long long __rcu_idle_exit(struct rcu_dynticks *rdtp)
+{
+	long long oldval;
+
+	oldval = rdtp->dynticks_nesting;
+	WARN_ON_ONCE(oldval != 0);
+	rdtp->dynticks_nesting = LLONG_MAX / 2;
+	rcu_idle_exit_common(rdtp, oldval);
+
+	return oldval;
 }
 
 /**
@@ -485,16 +521,25 @@ static void rcu_idle_exit_common(struct rcu_dynticks *rdtp, long long oldval)
  */
 void rcu_idle_exit(void)
 {
+	long long oldval;
+	struct rcu_dynticks *rdtp;
 	unsigned long flags;
+
+	local_irq_save(flags);
+	rdtp = &__get_cpu_var(rcu_dynticks);
+	oldval = __rcu_idle_exit(rdtp);
+	rcu_check_idle_exit(rdtp, oldval);
+	local_irq_restore(flags);
+}
+
+void rcu_user_exit(void)
+{
 	struct rcu_dynticks *rdtp;
-	long long oldval;
+	unsigned long flags;
 
 	local_irq_save(flags);
 	rdtp = &__get_cpu_var(rcu_dynticks);
-	oldval = rdtp->dynticks_nesting;
-	WARN_ON_ONCE(oldval != 0);
-	rdtp->dynticks_nesting = DYNTICK_TASK_NESTING;
-	rcu_idle_exit_common(rdtp, oldval);
+	 __rcu_idle_exit(rdtp);
 	local_irq_restore(flags);
 }
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 29/32] rcu: New rcu_user_enter_irq() and rcu_user_exit_irq() APIs
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (28 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 28/32] rcu: New rcu_user_enter() and rcu_user_exit() APIs Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 30/32] rcu: Switch to extended quiescent state in userspace from nohz cpuset Frederic Weisbecker
                   ` (4 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

A CPU running in adaptive tickless mode wants to enter into
RCU extended quiescent state while running in userspace. This
way we can shut down the tick that is usually needed on each
CPU for the needs of RCU.

Typically, RCU enters the extended quiescent state when we resume
to userspace through a syscall or exception exit, this is done
using rcu_user_enter(). Then RCU exit this state by calling
rcu_user_exit() from syscall or exception entry.

However there are two other points where we may want to enter
or exit this state. Some remote CPU may require a tickless CPU
to restart its tick for any reason and send it an IPI for
this purpose. As we restart the tick, we don't want to resume
from the IPI in RCU extended quiescent state anymore.
Similarly we may stop the tick from an interrupt in userspace and
we need to be able to enter RCU extended quiescent state when we
resume from this interrupt to userspace.

To these ends, we provide two new APIs:

- rcu_user_enter_irq(). This must be called from a non-nesting
interrupt betwenn rcu_irq_enter() and rcu_irq_exit().
After the irq calls rcu_irq_exit(), we'll run into RCU extended
quiescent state.

- rcu_user_exit_irq(). This must be called from a non-nesting
interrupt, interrupting an RCU extended quiescent state, and
between rcu_irq_enter() and rcu_irq_exit(). After the irq calls
rcu_irq_exit(), we'll prevent from resuming the RCU extended
quiescent.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 include/linux/rcupdate.h |    2 ++
 kernel/rcutree.c         |   24 ++++++++++++++++++++++++
 2 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 6539290..3cf1d51 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -194,6 +194,8 @@ extern void rcu_irq_exit(void);
 #ifdef CONFIG_CPUSETS_NO_HZ
 void rcu_user_enter(void);
 void rcu_user_exit(void);
+void rcu_user_enter_irq(void);
+void rcu_user_exit_irq(void);
 #endif
 
 /*
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index cba1332..2adc5a0 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -429,6 +429,18 @@ void rcu_user_enter(void)
 	__rcu_idle_enter();
 }
 
+void rcu_user_enter_irq(void)
+{
+	unsigned long flags;
+	struct rcu_dynticks *rdtp;
+
+	local_irq_save(flags);
+	rdtp = &__get_cpu_var(rcu_dynticks);
+	WARN_ON_ONCE(rdtp->dynticks_nesting == 1);
+	rdtp->dynticks_nesting = 1;
+	local_irq_restore(flags);
+}
+
 /**
  * rcu_irq_exit - inform RCU that current CPU is exiting irq towards idle
  *
@@ -543,6 +555,18 @@ void rcu_user_exit(void)
 	local_irq_restore(flags);
 }
 
+void rcu_user_exit_irq(void)
+{
+	unsigned long flags;
+	struct rcu_dynticks *rdtp;
+
+	local_irq_save(flags);
+	rdtp = &__get_cpu_var(rcu_dynticks);
+	WARN_ON_ONCE(rdtp->dynticks_nesting == 0);
+	rdtp->dynticks_nesting = (LLONG_MAX / 2) + 1;
+	local_irq_restore(flags);
+}
+
 /**
  * rcu_irq_enter - inform RCU that current CPU is entering irq away from idle
  *
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 30/32] rcu: Switch to extended quiescent state in userspace from nohz cpuset
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (29 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 29/32] rcu: New rcu_user_enter_irq() and rcu_user_exit_irq() APIs Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 31/32] nohz: Exit RCU idle mode when we schedule before resuming userspace Frederic Weisbecker
                   ` (3 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

When we switch to adaptive nohz mode and we run in userspace,
we can still receive IPIs from the RCU core if a grace period
has been started by another CPU because we need to take part
of its completion.

However running in userspace is similar to that of running in
idle because we don't make use of RCU there, thus we can be
considered as running in RCU extended quiescent state. The
benefit when running into that mode is that we are not
anymore disturbed by needless IPIs coming from the RCU core.

To perform this, we just to use the RCU extended quiescent state
APIs on the following points:

- kernel exit or tick stop in userspace: here we switch to extended
quiescent state because we run in userspace without the tick.

- kernel entry or tick restart: here we exit the extended quiescent
state because either we enter the kernel and we may make use of RCU
read side critical section anytime, or we need the timer tick for some
reason and that takes care of RCU grace period in a traditional way.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 include/linux/tick.h     |    3 +++
 kernel/time/tick-sched.c |   27 +++++++++++++++++++++++++--
 2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 3c31d6e..e2a49ad 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -153,6 +153,8 @@ static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
 # endif /* !NO_HZ */
 
 #ifdef CONFIG_CPUSETS_NO_HZ
+DECLARE_PER_CPU(int, nohz_task_ext_qs);
+
 extern void tick_nohz_enter_kernel(void);
 extern void tick_nohz_exit_kernel(void);
 extern void tick_nohz_enter_exception(struct pt_regs *regs);
@@ -160,6 +162,7 @@ extern void tick_nohz_exit_exception(struct pt_regs *regs);
 extern void tick_nohz_check_adaptive(void);
 extern void tick_nohz_pre_schedule(void);
 extern void tick_nohz_post_schedule(void);
+extern void tick_nohz_cpu_exit_qs(void);
 extern bool tick_nohz_account_tick(void);
 extern void tick_nohz_flush_current_times(bool restart_tick);
 #else /* !CPUSETS_NO_HZ */
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 4b9bdfb..6c66977 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -565,10 +565,13 @@ static void tick_nohz_cpuset_stop_tick(struct tick_sched *ts)
 
 	if (!was_stopped && ts->tick_stopped) {
 		WARN_ON_ONCE(ts->saved_jiffies_whence != JIFFIES_SAVED_NONE);
-		if (user)
+		if (user) {
 			ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
-		else if (!current->mm)
+			__get_cpu_var(nohz_task_ext_qs) = 1;
+			rcu_user_enter_irq();
+		} else if (!current->mm) {
 			ts->saved_jiffies_whence = JIFFIES_SAVED_SYS;
+		}
 
 		ts->saved_jiffies = jiffies;
 		set_thread_flag(TIF_NOHZ);
@@ -897,6 +900,8 @@ void tick_check_idle(int cpu)
 }
 
 #ifdef CONFIG_CPUSETS_NO_HZ
+DEFINE_PER_CPU(int, nohz_task_ext_qs);
+
 void tick_nohz_exit_kernel(void)
 {
 	unsigned long flags;
@@ -920,6 +925,9 @@ void tick_nohz_exit_kernel(void)
 	ts->saved_jiffies = jiffies;
 	ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
 
+	__get_cpu_var(nohz_task_ext_qs) = 1;
+	rcu_user_enter();
+
 	local_irq_restore(flags);
 }
 
@@ -938,6 +946,11 @@ void tick_nohz_enter_kernel(void)
 		return;
 	}
 
+	if (__get_cpu_var(nohz_task_ext_qs) == 1) {
+		__get_cpu_var(nohz_task_ext_qs) = 0;
+		rcu_user_exit();
+	}
+
 	WARN_ON_ONCE(ts->saved_jiffies_whence != JIFFIES_SAVED_USER);
 
 	delta_jiffies = jiffies - ts->saved_jiffies;
@@ -949,6 +962,14 @@ void tick_nohz_enter_kernel(void)
 	local_irq_restore(flags);
 }
 
+void tick_nohz_cpu_exit_qs(void)
+{
+	if (__get_cpu_var(nohz_task_ext_qs)) {
+		rcu_user_exit_irq();
+		__get_cpu_var(nohz_task_ext_qs) = 0;
+	}
+}
+
 void tick_nohz_enter_exception(struct pt_regs *regs)
 {
 	if (user_mode(regs))
@@ -984,6 +1005,7 @@ static void tick_nohz_restart_adaptive(void)
 	tick_nohz_flush_current_times(true);
 	tick_nohz_restart_sched_tick();
 	clear_thread_flag(TIF_NOHZ);
+	tick_nohz_cpu_exit_qs();
 }
 
 void tick_nohz_check_adaptive(void)
@@ -1021,6 +1043,7 @@ void tick_nohz_pre_schedule(void)
 	if (ts->tick_stopped) {
 		tick_nohz_flush_current_times(true);
 		clear_thread_flag(TIF_NOHZ);
+		/* FIXME: warn if we are in RCU idle mode */
 	}
 }
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 31/32] nohz: Exit RCU idle mode when we schedule before resuming userspace
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (30 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 30/32] rcu: Switch to extended quiescent state in userspace from nohz cpuset Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-21 13:58 ` [PATCH 32/32] nohz/cpuset: Disable under some configs Frederic Weisbecker
                   ` (2 subsequent siblings)
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

When a CPU running tickless resumes userspace, it enters into
RCU idle mode. But if we are preempted on kernel exit, after we
entered RCU idle mode but before we actually resumed userspace,
through an explicit call to schedule, we need to re-enable RCU in
case this function makes use of RCU read side critical section
and also for the next task to be scheduled.

NOTE: If we are preempted while running adaptive tickless, it means
we will receive an IPI that will escape the RCU idle mode for us. So
this patch is useful only when such IPI arrives too late.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 arch/x86/kernel/entry_64.S |    8 ++++----
 include/linux/tick.h       |    3 ++-
 kernel/sched/core.c        |   14 ++++++++++++++
 kernel/time/tick-sched.c   |    9 ++++++---
 4 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 54f269c..c86d963 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -522,7 +522,7 @@ sysret_careful:
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
 	pushq_cfi %rdi
-	call schedule
+	call schedule_user
 	popq_cfi %rdi
 	jmp sysret_check
 
@@ -630,7 +630,7 @@ int_careful:
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
 	pushq_cfi %rdi
-	call schedule
+	call schedule_user
 	popq_cfi %rdi
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
@@ -898,7 +898,7 @@ retint_careful:
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
 	pushq_cfi %rdi
-	call  schedule
+	call  schedule_user
 	popq_cfi %rdi
 	GET_THREAD_INFO(%rcx)
 	DISABLE_INTERRUPTS(CLBR_NONE)
@@ -1398,7 +1398,7 @@ paranoid_userspace:
 paranoid_schedule:
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_ANY)
-	call schedule
+	call schedule_user
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_OFF
 	jmp paranoid_userspace
diff --git a/include/linux/tick.h b/include/linux/tick.h
index e2a49ad..93add37 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -162,7 +162,7 @@ extern void tick_nohz_exit_exception(struct pt_regs *regs);
 extern void tick_nohz_check_adaptive(void);
 extern void tick_nohz_pre_schedule(void);
 extern void tick_nohz_post_schedule(void);
-extern void tick_nohz_cpu_exit_qs(void);
+extern void tick_nohz_cpu_exit_qs(bool irq);
 extern bool tick_nohz_account_tick(void);
 extern void tick_nohz_flush_current_times(bool restart_tick);
 #else /* !CPUSETS_NO_HZ */
@@ -173,6 +173,7 @@ static inline void tick_nohz_exit_exception(struct pt_regs *regs) { }
 static inline void tick_nohz_check_adaptive(void) { }
 static inline void tick_nohz_pre_schedule(void) { }
 static inline void tick_nohz_post_schedule(void) { }
+static inline void tick_nohz_cpu_exit_qs(bool irq) { }
 static inline bool tick_nohz_account_tick(void) { return false; }
 #endif /* CPUSETS_NO_HZ */
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5debfd7..cd4cb58 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3358,6 +3358,20 @@ int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
 }
 #endif
 
+asmlinkage void __sched schedule_user(void)
+{
+	/*
+	 * We may arrive here before resuming userspace.
+	 * If we are running tickless, RCU may be in idle
+	 * mode. We need to reenable RCU for the next task
+	 * and also in case schedule() make use of RCU itself.
+	 */
+	preempt_disable();
+	tick_nohz_cpu_exit_qs(false);
+	preempt_enable_no_resched();
+	schedule();
+}
+
 #ifdef CONFIG_PREEMPT
 /*
  * this is the entry point to schedule() from in-kernel preemption
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 6c66977..8b6a21b 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -962,10 +962,13 @@ void tick_nohz_enter_kernel(void)
 	local_irq_restore(flags);
 }
 
-void tick_nohz_cpu_exit_qs(void)
+void tick_nohz_cpu_exit_qs(bool irq)
 {
 	if (__get_cpu_var(nohz_task_ext_qs)) {
-		rcu_user_exit_irq();
+		if (irq)
+			rcu_user_exit_irq();
+		else
+			rcu_user_exit();
 		__get_cpu_var(nohz_task_ext_qs) = 0;
 	}
 }
@@ -1005,7 +1008,7 @@ static void tick_nohz_restart_adaptive(void)
 	tick_nohz_flush_current_times(true);
 	tick_nohz_restart_sched_tick();
 	clear_thread_flag(TIF_NOHZ);
-	tick_nohz_cpu_exit_qs();
+	tick_nohz_cpu_exit_qs(true);
 }
 
 void tick_nohz_check_adaptive(void)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 32/32] nohz/cpuset: Disable under some configs
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (31 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 31/32] nohz: Exit RCU idle mode when we schedule before resuming userspace Frederic Weisbecker
@ 2012-03-21 13:58 ` Frederic Weisbecker
  2012-03-27 15:02 ` [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Gilad Ben-Yossef
  2012-03-30  0:33 ` Kevin Hilman
  34 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-21 13:58 UTC (permalink / raw)
  To: LKML, linaro-sched-sig
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

This shows the various things that are not yet handled by
the nohz cpusets: perf events, irq work, irq time accounting.

But there are further things that have yet to be handled:
sched clock tick, runqueue clock, sched_class::task_tick(),
rq clock, cpu load, complete handling of cputimes, ...

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zen Lin <zen@openhuawei.org>
---
 init/Kconfig |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 7cdb8be..3080b16 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -640,7 +640,7 @@ config PROC_PID_CPUSET
 
 config CPUSETS_NO_HZ
        bool "Tickless cpusets"
-       depends on CPUSETS && HAVE_CPUSETS_NO_HZ && NO_HZ && HIGH_RES_TIMERS
+       depends on CPUSETS && HAVE_CPUSETS_NO_HZ && NO_HZ && HIGH_RES_TIMERS && !IRQ_TIME_ACCOUNTING
        help
          This options let you apply a nohz property to a cpuset such
 	 that the periodic timer tick tries to be avoided when possible on
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/32] cpuset: Set up interface for nohz flag
  2012-03-21 13:58 ` [PATCH 07/32] cpuset: Set up interface for nohz flag Frederic Weisbecker
@ 2012-03-21 14:50   ` Christoph Lameter
  2012-03-22  4:03     ` Mike Galbraith
  2012-03-27 11:19     ` Frederic Weisbecker
  0 siblings, 2 replies; 96+ messages in thread
From: Christoph Lameter @ 2012-03-21 14:50 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Daniel Lezcano, Geoff Levand,
	Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Wed, 21 Mar 2012, Frederic Weisbecker wrote:

> Prepare the interface to implement the nohz cpuset flag.
> This flag, once set, will tell the system to try to
> shutdown the periodic timer tick when possible.
>
> We use here a per cpu refcounter. As long as a CPU
> is contained into at least one cpuset that has the
> nohz flag set, it is part of the set of CPUs that
> run into adaptive nohz mode.

What are the drawbacks for nohz?

If there are none: Can we make nohz default behavior without relying on
cpusets?


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/32] nohz: Try not to give the timekeeping duty to an adaptive tickless cpu
  2012-03-21 13:58 ` [PATCH 08/32] nohz: Try not to give the timekeeping duty to an adaptive tickless cpu Frederic Weisbecker
@ 2012-03-21 14:52   ` Christoph Lameter
  2012-03-27 10:50     ` Frederic Weisbecker
  0 siblings, 1 reply; 96+ messages in thread
From: Christoph Lameter @ 2012-03-21 14:52 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Daniel Lezcano, Geoff Levand,
	Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Wed, 21 Mar 2012, Frederic Weisbecker wrote:

> Try to give the timekeeing duty to a CPU that doesn't belong
> to any nohz cpuset when possible, so that we increase the chance
> for these nohz cpusets to run their CPUs out of periodic tick
> mode.

Any way to manually specify which cpu? We f.e. always "sacrifice" cpu 0
for OS activities. We would like to have all Os processing things
restricted to cpu 0 so that the rest of the processors do not experience
the OS noise.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-21 13:58 ` [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it Frederic Weisbecker
@ 2012-03-21 14:54   ` Christoph Lameter
  2012-03-22  7:38     ` Gilad Ben-Yossef
  2012-03-27 12:13     ` Frederic Weisbecker
  0 siblings, 2 replies; 96+ messages in thread
From: Christoph Lameter @ 2012-03-21 14:54 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Daniel Lezcano, Geoff Levand,
	Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Wed, 21 Mar 2012, Frederic Weisbecker wrote:

> If RCU is waiting for the current CPU to complete a grace
> period, don't turn off the tick. Unlike dynctik-idle, we
> are not necessarily going to enter into rcu extended quiescent
> state, so we may need to keep the tick to note current CPU's
> quiescent states.

Is there any way for userspace to know that the tick is not off yet due to
this? It would make sense for us to have busy loop in user space that
waits until the OS has completed all processing if that avoids future
latencies for the application.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/32] cpuset: Set up interface for nohz flag
  2012-03-21 14:50   ` Christoph Lameter
@ 2012-03-22  4:03     ` Mike Galbraith
  2012-03-22 16:26       ` Christoph Lameter
  2012-03-27 11:22       ` Frederic Weisbecker
  2012-03-27 11:19     ` Frederic Weisbecker
  1 sibling, 2 replies; 96+ messages in thread
From: Mike Galbraith @ 2012-03-22  4:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Wed, 2012-03-21 at 09:50 -0500, Christoph Lameter wrote: 
> On Wed, 21 Mar 2012, Frederic Weisbecker wrote:
> 
> > Prepare the interface to implement the nohz cpuset flag.
> > This flag, once set, will tell the system to try to
> > shutdown the periodic timer tick when possible.
> >
> > We use here a per cpu refcounter. As long as a CPU
> > is contained into at least one cpuset that has the
> > nohz flag set, it is part of the set of CPUs that
> > run into adaptive nohz mode.
> 
> What are the drawbacks for nohz?

For nohz in general, latency.  To make it at all usable for rt loads, I
had to make isolated cores immune from playing load balancer.  Even so,
to achieve target latency, I had to hack up cpusets to let the user
dynamically switch nohz off for specified sets (and the tick has to be
skewed in both cases or you can just forget it).  With nohz, I can't
quite achieve 30us jitter target, turn it off, I get single digit.  Out
of the current box, triple digit for simple synchronized frame timers +
compute worker-bees load on 64 cores.  Patch 4 probably helps that, but
don't _think_ it'll fix it.  If you (currently) ever become balancer,
you're latency target is smoking wreckage.

-Mike


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-21 14:54   ` Christoph Lameter
@ 2012-03-22  7:38     ` Gilad Ben-Yossef
  2012-03-22 16:18       ` Christoph Lameter
  2012-03-22 17:18       ` Chris Metcalf
  2012-03-27 12:13     ` Frederic Weisbecker
  1 sibling, 2 replies; 96+ messages in thread
From: Gilad Ben-Yossef @ 2012-03-22  7:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Wed, Mar 21, 2012 at 4:54 PM, Christoph Lameter <cl@linux.com> wrote:
> On Wed, 21 Mar 2012, Frederic Weisbecker wrote:
>
>> If RCU is waiting for the current CPU to complete a grace
>> period, don't turn off the tick. Unlike dynctik-idle, we
>> are not necessarily going to enter into rcu extended quiescent
>> state, so we may need to keep the tick to note current CPU's
>> quiescent states.
>
> Is there any way for userspace to know that the tick is not off yet due to
> this? It would make sense for us to have busy loop in user space that
> waits until the OS has completed all processing if that avoids future
> latencies for the application.
>

I previously suggested having the user register to receive a signal
when the tick
is turned off. Since the tick is always turned off the user task is
the current task
by design, *I think* you can simply mark the signal pending when you
turn the tick off.

The user would register a signal handler to set a flag when it is
called and then busy
loop waiting for a flag to clear.

Checking this approach is on my todo list, but alas I had no progress
with it yet :-)

Gilad



-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-22  7:38     ` Gilad Ben-Yossef
@ 2012-03-22 16:18       ` Christoph Lameter
  2012-03-27 15:21         ` Gilad Ben-Yossef
  2012-03-22 17:18       ` Chris Metcalf
  1 sibling, 1 reply; 96+ messages in thread
From: Christoph Lameter @ 2012-03-22 16:18 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Thu, 22 Mar 2012, Gilad Ben-Yossef wrote:

> > Is there any way for userspace to know that the tick is not off yet due to
> > this? It would make sense for us to have busy loop in user space that
> > waits until the OS has completed all processing if that avoids future
> > latencies for the application.
> >
>
> I previously suggested having the user register to receive a signal
> when the tick
> is turned off. Since the tick is always turned off the user task is
> the current task
> by design, *I think* you can simply mark the signal pending when you
> turn the tick off.

Ok that sounds good. You would define a new signal for this?

So we would startup the application. App will do all prep work (memory
allocation, device setup etc etc) and then wait for the signal to be
received. After that it would enter the low latency processing phase.

Could we also get a signal if something disrupts the peace and switches
the timer interrupt on again?


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/32] cpuset: Set up interface for nohz flag
  2012-03-22  4:03     ` Mike Galbraith
@ 2012-03-22 16:26       ` Christoph Lameter
  2012-03-22 19:20         ` Mike Galbraith
  2012-03-27 11:22       ` Frederic Weisbecker
  1 sibling, 1 reply; 96+ messages in thread
From: Christoph Lameter @ 2012-03-22 16:26 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Thu, 22 Mar 2012, Mike Galbraith wrote:

> > > We use here a per cpu refcounter. As long as a CPU
> > > is contained into at least one cpuset that has the
> > > nohz flag set, it is part of the set of CPUs that
> > > run into adaptive nohz mode.
> >
> > What are the drawbacks for nohz?
>
> For nohz in general, latency.  To make it at all usable for rt loads, I

Well nohz while a process is running on a dedicated cpu means the cpu is
running full power and no disruptions occur. This is a tremendous benefit.

> had to make isolated cores immune from playing load balancer.  Even so,
> to achieve target latency, I had to hack up cpusets to let the user
> dynamically switch nohz off for specified sets (and the tick has to be
> skewed in both cases or you can just forget it).  With nohz, I can't
> quite achieve 30us jitter target, turn it off, I get single digit.  Out

Less than 10us jitter can alrady be accomplished by building a kernel with
certain options off (like for example preemption...) and ensuring that
stuff stays off certain processors. Lets not confuse realtime with low
latency. Real time in the sense of deterministic execution is bad for
latency because overhead is added to ensure the determinism which
increases latency.

I have a patch here that adds a system call to simply switch off the timer
interrupt cold turkey and with that I can get down to 1-2 usecs jitter.

> of the current box, triple digit for simple synchronized frame timers +
> compute worker-bees load on 64 cores.  Patch 4 probably helps that, but
> don't _think_ it'll fix it.  If you (currently) ever become balancer,
> you're latency target is smoking wreckage.

Yes so we need something to tell the system which cpu is the sacrificial
lamb that will not run low latency applications.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-22  7:38     ` Gilad Ben-Yossef
  2012-03-22 16:18       ` Christoph Lameter
@ 2012-03-22 17:18       ` Chris Metcalf
  2012-03-27 15:31         ` Gilad Ben-Yossef
  1 sibling, 1 reply; 96+ messages in thread
From: Chris Metcalf @ 2012-03-22 17:18 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Christoph Lameter, Frederic Weisbecker, LKML, linaro-sched-sig,
	Alessio Igor Bogani, Andrew Morton, Avi Kivity, Daniel Lezcano,
	Geoff Levand, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On 3/22/2012 3:38 AM, Gilad Ben-Yossef wrote:
> On Wed, Mar 21, 2012 at 4:54 PM, Christoph Lameter <cl@linux.com> wrote:
>> On Wed, 21 Mar 2012, Frederic Weisbecker wrote:
>>
>>> If RCU is waiting for the current CPU to complete a grace
>>> period, don't turn off the tick. Unlike dynctik-idle, we
>>> are not necessarily going to enter into rcu extended quiescent
>>> state, so we may need to keep the tick to note current CPU's
>>> quiescent states.
>> Is there any way for userspace to know that the tick is not off yet due to
>> this? It would make sense for us to have busy loop in user space that
>> waits until the OS has completed all processing if that avoids future
>> latencies for the application.
>>
> I previously suggested having the user register to receive a signal
> when the tick
> is turned off. Since the tick is always turned off the user task is
> the current task
> by design, *I think* you can simply mark the signal pending when you
> turn the tick off.
>
> The user would register a signal handler to set a flag when it is
> called and then busy
> loop waiting for a flag to clear.

This sounds plausible, but the kernel would have to know that the tick not
only was stopped currently, but also would still be stopped when the signal
handler's sigreturn syscall was performed.  The problem we've seen is that
it's sometimes somewhat nondeterministic when the kernel might decide it
needed some more ticking, once you let kernel code start to run.  For
example, for RCU ops the kernel can choose to ignore the nohz cpuset cores
when they're running userspace code only, but as soon as they get back into
the kernel for any reason, you may need to schedule a grace period, and so
just returning from the "you have no more ticks!" signal handler ends up
causing ticks to be scheduled.

The approach we took for the Tilera dataplane mode was to have a syscall
that would hold the task in the kernel until any ticks were done, and only
then return to userspace.  (This is the same set_dataplane() syscall that
also offers some flags to control and debug the dataplane stuff in general;
in fact the "hold in kernel" support is a mode we set for all syscalls, to
keep things deterministic.)  This way the "busy loop" is done in the
kernel, but in fact we explicitly go into idle until the next tick, so it's
lower-power.

An alternative approach, not so good for power but at least avoiding the
"use the kernel to avoid the kernel" aspect of signals, would be to
register a location in userspace that the kernel would write to when it
disabled the tick, and userspace could then just spin reading memory.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/32] cpuset: Set up interface for nohz flag
  2012-03-22 16:26       ` Christoph Lameter
@ 2012-03-22 19:20         ` Mike Galbraith
  0 siblings, 0 replies; 96+ messages in thread
From: Mike Galbraith @ 2012-03-22 19:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Thu, 2012-03-22 at 11:26 -0500, Christoph Lameter wrote: 
> On Thu, 22 Mar 2012, Mike Galbraith wrote:
> 
> > > > We use here a per cpu refcounter. As long as a CPU
> > > > is contained into at least one cpuset that has the
> > > > nohz flag set, it is part of the set of CPUs that
> > > > run into adaptive nohz mode.
> > >
> > > What are the drawbacks for nohz?
> >
> > For nohz in general, latency.  To make it at all usable for rt loads, I
> 
> Well nohz while a process is running on a dedicated cpu means the cpu is
> running full power and no disruptions occur. This is a tremendous benefit.

In the context of single task burning in userspace, you bet.

> Less than 10us jitter can alrady be accomplished by building a kernel with
> certain options off (like for example preemption...) and ensuring that
> stuff stays off certain processors. Lets not confuse realtime with low
> latency. Real time in the sense of deterministic execution is bad for
> latency because overhead is added to ensure the determinism which
> increases latency.

Yeah, I know RT pays heavily for determinism.  It loses on best case.

> > of the current box, triple digit for simple synchronized frame timers +
> > compute worker-bees load on 64 cores.  Patch 4 probably helps that, but
> > don't _think_ it'll fix it.  If you (currently) ever become balancer,
> > you're latency target is smoking wreckage.
> 
> Yes so we need something to tell the system which cpu is the sacrificial
> lamb that will not run low latency applications.

Definitely a lamb is required.

(This set is targeted at HPC, so I'll shut up now.. but RT is HPC too)

-Mike


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/32] nohz: Try not to give the timekeeping duty to an adaptive tickless cpu
  2012-03-21 14:52   ` Christoph Lameter
@ 2012-03-27 10:50     ` Frederic Weisbecker
  2012-03-27 16:08       ` Christoph Lameter
  0 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-27 10:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Daniel Lezcano, Geoff Levand,
	Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin,
	Dimitri Sivanich

On Wed, Mar 21, 2012 at 09:52:24AM -0500, Christoph Lameter wrote:
> On Wed, 21 Mar 2012, Frederic Weisbecker wrote:
> 
> > Try to give the timekeeing duty to a CPU that doesn't belong
> > to any nohz cpuset when possible, so that we increase the chance
> > for these nohz cpusets to run their CPUs out of periodic tick
> > mode.
> 
> Any way to manually specify which cpu? We f.e. always "sacrifice" cpu 0
> for OS activities. We would like to have all Os processing things
> restricted to cpu 0 so that the rest of the processors do not experience
> the OS noise.

Somebody tries to do this: https://lkml.org/lkml/2011/11/8/346

But in the case of nohz cpusets there is a problem to solve:

What if every CPUs are tickless (idle or busy), who must take
the timekeeping duty? Should we pick one of the busy CPUs? Or
keep one CPU with the tick even if it's idle? How do we choose
this CPU?

May be we need to define another flag on cpusets to assign the
timekeeping duty to any CPU on a flagged set. This way we can
force that duty to the CPU(s) we want.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/32] cpuset: Set up interface for nohz flag
  2012-03-21 14:50   ` Christoph Lameter
  2012-03-22  4:03     ` Mike Galbraith
@ 2012-03-27 11:19     ` Frederic Weisbecker
  1 sibling, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-27 11:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Daniel Lezcano, Geoff Levand,
	Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Wed, Mar 21, 2012 at 09:50:27AM -0500, Christoph Lameter wrote:
> On Wed, 21 Mar 2012, Frederic Weisbecker wrote:
> 
> > Prepare the interface to implement the nohz cpuset flag.
> > This flag, once set, will tell the system to try to
> > shutdown the periodic timer tick when possible.
> >
> > We use here a per cpu refcounter. As long as a CPU
> > is contained into at least one cpuset that has the
> > nohz flag set, it is part of the set of CPUs that
> > run into adaptive nohz mode.
> 
> What are the drawbacks for nohz?
> 
> If there are none: Can we make nohz default behavior without relying on
> cpusets?

I can't tell for now. I haven't yet covered everything the timer is
handling. Until that happens I can't do measurements.

In theory this sounds like a win in every case. I just would like
to test that in practice. This sets up hooks in kernel entry/exit.
More IPIs here and there. May be this adds overhead on workloads
involving a lot of syscalls or exceptions. I don't know.

Given this is not yet entirely clear, I think it may be better to keep
this interface around until the patchset reaches a version that becomes
mergeable. Then at this point we can get serious testing coverage to take
the decision to drop the interface and make it unconditional on
CONFIG_ADAPTIVE_NO_HZ.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/32] cpuset: Set up interface for nohz flag
  2012-03-22  4:03     ` Mike Galbraith
  2012-03-22 16:26       ` Christoph Lameter
@ 2012-03-27 11:22       ` Frederic Weisbecker
  2012-03-27 11:53         ` Mike Galbraith
  1 sibling, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-27 11:22 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Christoph Lameter, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Thu, Mar 22, 2012 at 05:03:53AM +0100, Mike Galbraith wrote:
> On Wed, 2012-03-21 at 09:50 -0500, Christoph Lameter wrote: 
> > On Wed, 21 Mar 2012, Frederic Weisbecker wrote:
> > 
> > > Prepare the interface to implement the nohz cpuset flag.
> > > This flag, once set, will tell the system to try to
> > > shutdown the periodic timer tick when possible.
> > >
> > > We use here a per cpu refcounter. As long as a CPU
> > > is contained into at least one cpuset that has the
> > > nohz flag set, it is part of the set of CPUs that
> > > run into adaptive nohz mode.
> > 
> > What are the drawbacks for nohz?
> 
> For nohz in general, latency.  To make it at all usable for rt loads, I
> had to make isolated cores immune from playing load balancer.  Even so,
> to achieve target latency, I had to hack up cpusets to let the user
> dynamically switch nohz off for specified sets (and the tick has to be
> skewed in both cases or you can just forget it).  With nohz, I can't
> quite achieve 30us jitter target, turn it off, I get single digit.  Out
> of the current box, triple digit for simple synchronized frame timers +
> compute worker-bees load on 64 cores.  Patch 4 probably helps that, but
> don't _think_ it'll fix it.  If you (currently) ever become balancer,
> you're latency target is smoking wreckage.

But this is because of waking up from CPU low power mode, right? If so
then busy tickless shouldn't be concerned. We can certainly have
configurations where the tick is not stopped in idle but can be elsewhere.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/32] cpuset: Set up interface for nohz flag
  2012-03-27 11:22       ` Frederic Weisbecker
@ 2012-03-27 11:53         ` Mike Galbraith
  2012-03-27 11:56           ` Frederic Weisbecker
  0 siblings, 1 reply; 96+ messages in thread
From: Mike Galbraith @ 2012-03-27 11:53 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Christoph Lameter, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Tue, 2012-03-27 at 13:22 +0200, Frederic Weisbecker wrote: 
> On Thu, Mar 22, 2012 at 05:03:53AM +0100, Mike Galbraith wrote:
> > On Wed, 2012-03-21 at 09:50 -0500, Christoph Lameter wrote: 
> > > On Wed, 21 Mar 2012, Frederic Weisbecker wrote:
> > > 
> > > > Prepare the interface to implement the nohz cpuset flag.
> > > > This flag, once set, will tell the system to try to
> > > > shutdown the periodic timer tick when possible.
> > > >
> > > > We use here a per cpu refcounter. As long as a CPU
> > > > is contained into at least one cpuset that has the
> > > > nohz flag set, it is part of the set of CPUs that
> > > > run into adaptive nohz mode.
> > > 
> > > What are the drawbacks for nohz?
> > 
> > For nohz in general, latency.  To make it at all usable for rt loads, I
> > had to make isolated cores immune from playing load balancer.  Even so,
> > to achieve target latency, I had to hack up cpusets to let the user
> > dynamically switch nohz off for specified sets (and the tick has to be
> > skewed in both cases or you can just forget it).  With nohz, I can't
> > quite achieve 30us jitter target, turn it off, I get single digit.  Out
> > of the current box, triple digit for simple synchronized frame timers +
> > compute worker-bees load on 64 cores.  Patch 4 probably helps that, but
> > don't _think_ it'll fix it.  If you (currently) ever become balancer,
> > you're latency target is smoking wreckage.
> 
> But this is because of waking up from CPU low power mode, right? If so
> then busy tickless shouldn't be concerned. We can certainly have
> configurations where the tick is not stopped in idle but can be elsewhere.

Boxen are restricted to C1 (even at that Q6600 _sucks rocks_, but more
modern CPUs don't).  ATM, ticked is cheaper, I can't get there from here
with nohz.

-Mike 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/32] cpuset: Set up interface for nohz flag
  2012-03-27 11:53         ` Mike Galbraith
@ 2012-03-27 11:56           ` Frederic Weisbecker
  2012-03-27 12:31             ` Mike Galbraith
  0 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-27 11:56 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Christoph Lameter, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Tue, Mar 27, 2012 at 01:53:03PM +0200, Mike Galbraith wrote:
> On Tue, 2012-03-27 at 13:22 +0200, Frederic Weisbecker wrote: 
> > On Thu, Mar 22, 2012 at 05:03:53AM +0100, Mike Galbraith wrote:
> > > On Wed, 2012-03-21 at 09:50 -0500, Christoph Lameter wrote: 
> > > > On Wed, 21 Mar 2012, Frederic Weisbecker wrote:
> > > > 
> > > > > Prepare the interface to implement the nohz cpuset flag.
> > > > > This flag, once set, will tell the system to try to
> > > > > shutdown the periodic timer tick when possible.
> > > > >
> > > > > We use here a per cpu refcounter. As long as a CPU
> > > > > is contained into at least one cpuset that has the
> > > > > nohz flag set, it is part of the set of CPUs that
> > > > > run into adaptive nohz mode.
> > > > 
> > > > What are the drawbacks for nohz?
> > > 
> > > For nohz in general, latency.  To make it at all usable for rt loads, I
> > > had to make isolated cores immune from playing load balancer.  Even so,
> > > to achieve target latency, I had to hack up cpusets to let the user
> > > dynamically switch nohz off for specified sets (and the tick has to be
> > > skewed in both cases or you can just forget it).  With nohz, I can't
> > > quite achieve 30us jitter target, turn it off, I get single digit.  Out
> > > of the current box, triple digit for simple synchronized frame timers +
> > > compute worker-bees load on 64 cores.  Patch 4 probably helps that, but
> > > don't _think_ it'll fix it.  If you (currently) ever become balancer,
> > > you're latency target is smoking wreckage.
> > 
> > But this is because of waking up from CPU low power mode, right? If so
> > then busy tickless shouldn't be concerned. We can certainly have
> > configurations where the tick is not stopped in idle but can be elsewhere.
> 
> Boxen are restricted to C1 (even at that Q6600 _sucks rocks_, but more
> modern CPUs don't).  ATM, ticked is cheaper, I can't get there from here
> with nohz.

Ok but there is a difference between idle nohz and busy nohz.
Idle nohz may let the CPU enter into low power mode. busy nohz
(implemented by this patchset) doesn't because it stops the tick
when the CPU runs.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-21 14:54   ` Christoph Lameter
  2012-03-22  7:38     ` Gilad Ben-Yossef
@ 2012-03-27 12:13     ` Frederic Weisbecker
  2012-03-27 16:13       ` Christoph Lameter
  1 sibling, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-27 12:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Daniel Lezcano, Geoff Levand,
	Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Wed, Mar 21, 2012 at 09:54:39AM -0500, Christoph Lameter wrote:
> On Wed, 21 Mar 2012, Frederic Weisbecker wrote:
> 
> > If RCU is waiting for the current CPU to complete a grace
> > period, don't turn off the tick. Unlike dynctik-idle, we
> > are not necessarily going to enter into rcu extended quiescent
> > state, so we may need to keep the tick to note current CPU's
> > quiescent states.
> 
> Is there any way for userspace to know that the tick is not off yet due to
> this? It would make sense for us to have busy loop in user space that
> waits until the OS has completed all processing if that avoids future
> latencies for the application.

What is the usecase you have in mind? Is it for realtime purpose?
The "tick stopped" is a volatile and relative state.

Relative because if a timer list is enqueued to fire 1 second later,
the tick will be stopped until that happens. How do we consider this (common)
case?

Also as Chris noted it is volatile because the tick can be restarted anytime
for random reasons: the CPU receives an IPI which makes it restart the
periodic tick.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/32] cpuset: Set up interface for nohz flag
  2012-03-27 11:56           ` Frederic Weisbecker
@ 2012-03-27 12:31             ` Mike Galbraith
  0 siblings, 0 replies; 96+ messages in thread
From: Mike Galbraith @ 2012-03-27 12:31 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Christoph Lameter, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Tue, 2012-03-27 at 13:56 +0200, Frederic Weisbecker wrote: 
> Ok but there is a difference between idle nohz and busy nohz.
> Idle nohz may let the CPU enter into low power mode. busy nohz
> (implemented by this patchset) doesn't because it stops the tick
> when the CPU runs.

Yeah.  Your set is tagged important here (stare at, and play with;).

-Mike


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 21/32] nohz/cpuset: Flush cputime on threads in nohz cpusets when waiting leader
  2012-03-21 13:58 ` [PATCH 21/32] nohz/cpuset: Flush cputime on threads in nohz cpusets when waiting leader Frederic Weisbecker
@ 2012-03-27 14:10   ` Gilad Ben-Yossef
  2012-03-27 14:23     ` Gilad Ben-Yossef
  0 siblings, 1 reply; 96+ messages in thread
From: Gilad Ben-Yossef @ 2012-03-27 14:10 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Wed, Mar 21, 2012 at 3:58 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> When we wait for a zombie task, flush the cputimes on nohz cpusets
> in case we are waiting for a group leader that has threads running
> in nohz CPUs. This way thread_group_times() doesn't report stale
> values.
>
> <doubts>
> If I understood well the code, by the time we call that thread_group_times(),
> we may have childs that are still running, so this is necessary.
> But I need to check deeper.
> </doubts>
>
...
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 4b4042f..c194662 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -52,6 +52,7 @@
>  #include <linux/hw_breakpoint.h>
>  #include <linux/oom.h>
>  #include <linux/writeback.h>
> +#include <linux/cpuset.h>
>
>  #include <asm/uaccess.h>
>  #include <asm/unistd.h>
> @@ -1712,6 +1713,13 @@ repeat:
>           (!wo->wo_pid || hlist_empty(&wo->wo_pid->tasks[wo->wo_type])))
>                goto notask;
>
> +       /*
> +        * For cputime in sub-threads before adding them.
> +        * Must be called outside tasklist_lock lock because write lock
> +        * can be acquired under irqs disabled.
> +        */
> +       cpuset_nohz_flush_cputimes();
> +
>        set_current_state(TASK_INTERRUPTIBLE);
>        read_lock(&tasklist_lock);
>        tsk = current;
> --
> 1.7.5.4
>

I believe this patch is not needed because after this point we call
do_wait_thread /ptrace_do_wait, which both call wait_consider_task,
which calls wait_task_stopped/zombie/continued, which all eventually
calls getrusage, which calls k_getrusage where you added a call to
cpuset_noz_flush_cputimes() in another patch :-)

Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 21/32] nohz/cpuset: Flush cputime on threads in nohz cpusets when waiting leader
  2012-03-27 14:10   ` Gilad Ben-Yossef
@ 2012-03-27 14:23     ` Gilad Ben-Yossef
  2012-03-28 11:20       ` Frederic Weisbecker
  0 siblings, 1 reply; 96+ messages in thread
From: Gilad Ben-Yossef @ 2012-03-27 14:23 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Tue, Mar 27, 2012 at 4:10 PM, Gilad Ben-Yossef <gilad@benyossef.com> wrote:
> On Wed, Mar 21, 2012 at 3:58 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
>> When we wait for a zombie task, flush the cputimes on nohz cpusets
>> in case we are waiting for a group leader that has threads running
>> in nohz CPUs. This way thread_group_times() doesn't report stale
>> values.
>>
>> <doubts>
>> If I understood well the code, by the time we call that thread_group_times(),
>> we may have childs that are still running, so this is necessary.
>> But I need to check deeper.
>> </doubts>
>>
> ...
>>
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index 4b4042f..c194662 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -52,6 +52,7 @@
>>  #include <linux/hw_breakpoint.h>
>>  #include <linux/oom.h>
>>  #include <linux/writeback.h>
>> +#include <linux/cpuset.h>
>>
>>  #include <asm/uaccess.h>
>>  #include <asm/unistd.h>
>> @@ -1712,6 +1713,13 @@ repeat:
>>           (!wo->wo_pid || hlist_empty(&wo->wo_pid->tasks[wo->wo_type])))
>>                goto notask;
>>
>> +       /*
>> +        * For cputime in sub-threads before adding them.
>> +        * Must be called outside tasklist_lock lock because write lock
>> +        * can be acquired under irqs disabled.
>> +        */
>> +       cpuset_nohz_flush_cputimes();
>> +
>>        set_current_state(TASK_INTERRUPTIBLE);
>>        read_lock(&tasklist_lock);
>>        tsk = current;
>> --
>> 1.7.5.4
>>
>
> I believe this patch is not needed because after this point we call
> do_wait_thread /ptrace_do_wait, which both call wait_consider_task,
> which calls wait_task_stopped/zombie/continued, which all eventually
> calls getrusage, which calls k_getrusage where you added a call to
> cpuset_noz_flush_cputimes() in another patch :-)
>

OK, I now see that wait_task_zombie actually calls
thread_group_times() directly, unlike other wait_task_*
what I wrote above is not needed.

It does result in more then one IPI for each isolated core (something
like 3 really) for the other cases though:
one from this patch and the rest from the one in k_getrusage calls.

I wonder what would be a better way to do it. In theory we can send
the IPI only to nohz cpuset cores that actually
run tasks form the thread group. Finding which is not trivial though...

Gilad

> Gilad
>
> --
> Gilad Ben-Yossef
> Chief Coffee Drinker
> gilad@benyossef.com
> Israel Cell: +972-52-8260388
> US Cell: +1-973-8260388
> http://benyossef.com
>
> "If you take a class in large-scale robotics, can you end up in a
> situation where the homework eats your dog?"
>  -- Jean-Baptiste Queru



-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel)
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (32 preceding siblings ...)
  2012-03-21 13:58 ` [PATCH 32/32] nohz/cpuset: Disable under some configs Frederic Weisbecker
@ 2012-03-27 15:02 ` Gilad Ben-Yossef
  2012-03-27 15:04   ` Gilad Ben-Yossef
                     ` (2 more replies)
  2012-03-30  0:33 ` Kevin Hilman
  34 siblings, 3 replies; 96+ messages in thread
From: Gilad Ben-Yossef @ 2012-03-27 15:02 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Wed, Mar 21, 2012 at 3:58 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> Hi all,
>
> A summary of what this is about can be found here:
>  https://lkml.org/lkml/2011/8/15/245
>
> There are still a lot of things to handle. Especially about
> what is done by scheduler_tick() but we also need to:
>
> - completely handle cputime accounting (need to find every "reader"
> of cputime and flush cputimes for all of them).
> -handle  perf
> - handle irqtime finegrained accounting
> - handle ilb load balancing
> - etc...
>

I gave the new version a spin (x86 8 way VM) and it looks cool.

I did get the following warning once, but couldn't recreate it:

[   31.812741] ------------[ cut here ]------------
[   31.812741] WARNING: at
/home/giladb/Workspace/linux/kernel/time/tick-sched.c:706
tick_nohz_account_ticks+0x7c/0x90()
[   31.812741] Hardware name: Bochs
[   31.812741] Modules linked in:
[   31.812741] Pid: 1006, comm: sh Not tainted 3.3.0-rc7+ #167
[   31.812741] Call Trace:
[   31.812741]  [<c102a3ad>] warn_slowpath_common+0x6d/0xa0
[   31.812741]  [<c106be0c>] ? tick_nohz_account_ticks+0x7c/0x90
[   31.812741]  [<c106be0c>] ? tick_nohz_account_ticks+0x7c/0x90
[   31.812741]  [<c102a3fd>] warn_slowpath_null+0x1d/0x20
[   31.812741]  [<c106be0c>] tick_nohz_account_ticks+0x7c/0x90
[   31.812741]  [<c106be5f>] tick_nohz_flush_current_times+0x3f/0x80
[   31.812741]  [<c106bf8d>] tick_nohz_restart_adaptive+0xd/0x30
[   31.812741]  [<c106c02e>] tick_nohz_check_adaptive+0x3e/0x50
[   31.812741]  [<c1018180>] smp_cpuset_update_nohz_interrupt+0x20/0x30
[   31.812741]  [<c1639c6a>] cpuset_update_nohz_interrupt+0x2a/0x30
[   31.812741]  [<c16395fd>] ? _raw_spin_unlock_irq+0xd/0x30
[   31.812741]  [<c10575c6>] finish_task_switch+0x46/0xa0
[   31.812741]  [<c1638558>] __schedule+0x398/0x910
[   31.812741]  [<c10ef2f1>] ? deactivate_slab+0x611/0x730
[   31.812741]  [<c1120777>] ? __find_get_block+0x97/0x1a0
[   31.812741]  [<c1221214>] ? cpumask_next_and+0x24/0xa0
[   31.812741]  [<c10558cb>] ? get_parent_ip+0xb/0x40
[   31.812741]  [<c1638b50>] schedule+0x30/0x50
[   31.812741]  [<c16379b5>] schedule_hrtimeout_range_clock+0xf5/0x110
[   31.812741]  [<c10558cb>] ? get_parent_ip+0xb/0x40
[   31.812741]  [<c10586db>] ? sub_preempt_count+0x7b/0xb0
[   31.812741]  [<c1639633>] ? _raw_spin_unlock_irqrestore+0x13/0x40
[   31.812741]  [<c1054140>] ? __wake_up+0x40/0x50
[   31.812741]  [<c1294d1f>] ? put_ldisc+0x3f/0xa0
[   31.812741]  [<c16379e2>] schedule_hrtimeout_range+0x12/0x20
[   31.812741]  [<c1107969>] poll_schedule_timeout+0x39/0x60
[   31.812741]  [<c1108020>] do_sys_poll+0x400/0x490
[   31.812741]  [<c1054d15>] ? cpuacct_charge+0x65/0x70
[   31.812741]  [<c1107a20>] ? poll_freewait+0x70/0x70
[   31.812741]  [<c1107af0>] ? __pollwait+0xd0/0xd0
[   31.812741]  [<c1107af0>] ? __pollwait+0xd0/0xd0
[   31.812741]  [<c10094a3>] ? native_sched_clock+0x33/0xe0
[   31.812741]  [<c105a0e2>] ? sched_clock_local+0xb2/0x190
[   31.812741]  [<c1054d15>] ? cpuacct_charge+0x65/0x70
[   31.812741]  [<c105b376>] ? update_curr+0x1a6/0x2a0
[   31.812741]  [<c105a2f9>] ? sched_clock_cpu+0x139/0x190
[   31.812741]  [<c105a0e2>] ? sched_clock_local+0xb2/0x190
[   31.812741]  [<c104dd43>] ? hrtimer_forward+0x163/0x1b0
[   31.812741]  [<c10644e2>] ? ktime_get+0x62/0x100
[   31.812741]  [<c1018b56>] ? lapic_next_event+0x16/0x20
[   31.812741]  [<c1069df2>] ? clockevents_program_event+0xc2/0x170
[   31.812741]  [<c106b514>] ? tick_program_event+0x24/0x30
[   31.812741]  [<c104cd1d>] ? hrtimer_interrupt+0x1ad/0x2e0
[   31.812741]  [<c1095128>] ? rcu_pending+0x58/0x70
[   31.812741]  [<c1030a3d>] ? irq_exit+0x6d/0x80
[   31.812741]  [<c1019363>] ? smp_apic_timer_interrupt+0x53/0x90
[   31.812741]  [<c11e0128>] ? avc_has_perm_noaudit+0xc8/0x360
[   31.812741]  [<c163a3b6>] ? apic_timer_interrupt+0x2a/0x30
[   31.812741]  [<c128f31e>] ? tty_ioctl+0x47e/0xa30
[   31.812741]  [<c11e0d66>] ? inode_has_perm+0x36/0x50
[   31.812741]  [<c11e13e8>] ? file_has_perm+0xa8/0xb0
[   31.812741]  [<c128eea0>] ? tty_check_change+0xe0/0xe0
[   31.812741]  [<c1106763>] ? do_vfs_ioctl+0x83/0x570
[   31.812741]  [<c11e4e46>] ? selinux_file_ioctl+0x56/0x110
[   31.812741]  [<c1108224>] sys_poll+0x54/0xb0
[   31.812741]  [<c1639b29>] syscall_call+0x7/0xb
[   31.812741] ---[ end trace 1d7d659b4aead681 ]---

With the two patches I'll attach to the next replies to this message,
I've been able to get a task running
on an isolated CPU with 0 timer interrupts.

In my case, I also had to disable the clocksource watchdog, but only
because TSC is not stable on my VM.
This is really not a nohz/cpuset problem.

There is one source of interference to cpu isolation this causes,
which is the cputime flush IPI. Every time you
run a command in the shell you get 3 - 4 IPIs sent to the nohz cpuset
to flush the cputimes so that thread group
times get computed correctly. That's not very nice :-)

I've tried disabling the IPI send, just to see how it goes and as far
as I've been able to tell you get bare metal like
environment for a 100% cpu bound code with no interrupts. Of course.
ps/top then show 0% cpu utilization for
that task since without the IPI the times it spends on the CPU is not
registered... that is a small price to pay
in my eyes for bare metal performance on Linux, but what do I know? :-)

Overall, way cool. Please keep it up !

GIlad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel)
  2012-03-27 15:02 ` [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Gilad Ben-Yossef
@ 2012-03-27 15:04   ` Gilad Ben-Yossef
  2012-03-27 15:05     ` Gilad Ben-Yossef
  2012-03-27 15:10   ` Peter Zijlstra
  2012-03-28 11:43   ` Frederic Weisbecker
  2 siblings, 1 reply; 96+ messages in thread
From: Gilad Ben-Yossef @ 2012-03-27 15:04 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

commit c28c1ce3b410db9c59cc78c403bcbc0f076e25fe
Author: Gilad Ben-Yossef <gilad@benyossef.com>
Date:   Mon Feb 13 15:47:24 2012 +0200

    timer: make __next_timer_interrupt explicit about no future event

    While playing with Frederic's adaptive tick patch set' I've noticed
    that no matter what I do, even though I can get the scheduler tick
    to turn itself off, I still got an interrupt from the timer once
    every few seconds, although it did not run the scheduler tick code.

    After poking at it for some time I believe what is happening is as
    follows:

    1. tick_nohz_stop_sched_tick()  [kernel/time/tick-sched.c]  calls
       get_next_timer_interrupt() [kernel/timer.c] with last_jiffies as
       parameter to get the next timer event, which in turn calls
       __next_timer_interrupt()

    2. next_timer_interrupt() starts with a default expiry time of
       (base->timer_jiffies + NEXT_TIMER_MAX_DELTA) and searches for
       the next timer event.

    3. Having failed to find any, __next_timer_interrupt() returns the
       default expiry time, which is returned by get_next_timer_interrupt()
       to tick_nohz_stop_sched_tick()

    4. tick_nohz_stop_sched_tick() now subtracts the value of last_jiffies
       from the return value and checks if the delta is smaller then
       NEXT_TIMER_MAX_DELTA, if it does (and multiple other things are
       aligned just right...) it cancels the timer interrupt.

    5. Alas, base->timer_jiffies and last_jiffies are (usually) not equal,
       with the result being that tick_nohz_stop_sched_tick() thinks there
       is a timer event pending sometime in the distant future, aprox.
       12 days from now(!). It therefore must keep the underlying timer
       firing every KTIME_MAX nsecs so that the clocksource will not wrap
       around.

    The end result is that we get a timer interrupt firing every KTIME_MAX
    nsecs even there is no future timer event at all.

    The attached patch tries to fix the above, by adding an explicit
    boolean return value to __next_timer_interrupt() to indicate whether
    or not it found a future timer event and having
    get_next_timer_interrupt() return (last_jiffies + NEXT_TIMER_MAX_DELTA)
    to indicate there is no timer event, in the same way it does for offline
    CPUs.

    Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
    CC: Thomas Gleixner <tglx@linutronix.de>
    Cc: Alessio Igor Bogani <abogani@kernel.org>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Avi Kivity <avi@redhat.com>
    Cc: Chris Metcalf <cmetcalf@tilera.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
    Cc: Geoff Levand <geoff@infradead.org>
    Cc: Gilad Ben Yossef <gilad@benyossef.com>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Max Krasnyansky <maxk@qualcomm.com>
    Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Stephen Hemminger <shemminger@vyatta.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Zen Lin <zen@openhuawei.org>

diff --git a/kernel/timer.c b/kernel/timer.c
index c203297..500b484 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1187,11 +1187,13 @@ static inline void __run_timers(struct tvec_base *base)
  * is used on S/390 to stop all activity when a CPU is idle.
  * This function needs to be called with interrupts disabled.
  */
-static unsigned long __next_timer_interrupt(struct tvec_base *base)
+static bool __next_timer_interrupt(struct tvec_base *base,
+					unsigned long *next_timer)
 {
 	unsigned long timer_jiffies = base->timer_jiffies;
 	unsigned long expires = timer_jiffies + NEXT_TIMER_MAX_DELTA;
-	int index, slot, array, found = 0;
+	int index, slot, array;
+	bool found = false;
 	struct timer_list *nte;
 	struct tvec *varray[4];

@@ -1202,12 +1204,12 @@ static unsigned long
__next_timer_interrupt(struct tvec_base *base)
 			if (tbase_get_deferrable(nte->base))
 				continue;

-			found = 1;
+			found = true;
 			expires = nte->expires;
 			/* Look at the cascade bucket(s)? */
 			if (!index || slot < index)
 				goto cascade;
-			return expires;
+			goto out;
 		}
 		slot = (slot + 1) & TVR_MASK;
 	} while (slot != index);
@@ -1233,7 +1235,7 @@ cascade:
 				if (tbase_get_deferrable(nte->base))
 					continue;

-				found = 1;
+				found = true;
 				if (time_before(nte->expires, expires))
 					expires = nte->expires;
 			}
@@ -1245,7 +1247,7 @@ cascade:
 				/* Look at the cascade bucket(s)? */
 				if (!index || slot < index)
 					break;
-				return expires;
+				goto out;
 			}
 			slot = (slot + 1) & TVN_MASK;
 		} while (slot != index);
@@ -1254,7 +1256,10 @@ cascade:
 			timer_jiffies += TVN_SIZE - index;
 		timer_jiffies >>= TVN_BITS;
 	}
-	return expires;
+out:
+	if(found)
+		*next_timer = expires;
+	return found;
 }

 /*
@@ -1317,9 +1322,15 @@ unsigned long get_next_timer_interrupt(unsigned long now)
 	if (cpu_is_offline(smp_processor_id()))
 		return now + NEXT_TIMER_MAX_DELTA;
 	spin_lock(&base->lock);
-	if (time_before_eq(base->next_timer, base->timer_jiffies))
-		base->next_timer = __next_timer_interrupt(base);
-	expires = base->next_timer;
+	if (time_before_eq(base->next_timer, base->timer_jiffies)) {
+
+		if(__next_timer_interrupt(base, &expires))
+			base->next_timer = expires;
+		else
+			expires = now + NEXT_TIMER_MAX_DELTA;
+	} else
+		expires = base->next_timer;
+
 	spin_unlock(&base->lock);

 	if (time_before_eq(expires, now))


-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel)
  2012-03-27 15:04   ` Gilad Ben-Yossef
@ 2012-03-27 15:05     ` Gilad Ben-Yossef
  2012-03-27 16:22       ` Christoph Lameter
  0 siblings, 1 reply; 96+ messages in thread
From: Gilad Ben-Yossef @ 2012-03-27 15:05 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

commit 013aed27b52122bda38ec9719263c0d09e8acf30
Author: Gilad Ben-Yossef <gilad@benyossef.com>
Date:   Sun Feb 26 15:38:06 2012 +0200

    mm: make vmstat_update periodic run conditional

    vmstat_update runs every second from the work queue to update statistics
    and drain per cpu pages back into the global page allocator.

    This is useful in most circumstances but is wasteful if the CPU doesn't
    actually make any VM activity. This can happen in the situtation that
    the CPU is idle or running a CPU bound long term task (e.g. CPU
    isolation), in which case the periodic vmstate_update timer needlessly
    interrupts the CPU.

    This patch tries to make vmstat_update schedule itself for the next
    round only if there was any work for it to do in the previous run.
    The assumption is that if for a whole second we didn't see any VM
    activity it is reasnoable to assume that the CPU is not using the
    VM because it is idle or runs a long term single CPU bound task.

    CPUs that do keep the vmstat_update periodic work scheduled are
    used to monitor the CPUs that have turned vmstat_update off for
    signs of VM activity and re-schedule the periodic work on them.

    Care is taken to make sure at least one CPU stays with the
    vmstat_update periodic work on always, including in the case
    where the last standing vmstat_update runner is being taken
    offline.

    Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
    Cc: Alessio Igor Bogani <abogani@kernel.org>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Avi Kivity <avi@redhat.com>
    Cc: Chris Metcalf <cmetcalf@tilera.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
    Cc: Geoff Levand <geoff@infradead.org>
    Cc: Gilad Ben Yossef <gilad@benyossef.com>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Max Krasnyansky <maxk@qualcomm.com>
    Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Stephen Hemminger <shemminger@vyatta.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Zen Lin <zen@openhuawei.org>

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 65efb92..67bf202 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -200,7 +200,7 @@ extern void __inc_zone_state(struct zone *, enum
zone_stat_item);
 extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);

-void refresh_cpu_vm_stats(int);
+bool refresh_cpu_vm_stats(int);
 void refresh_zone_stat_thresholds(void);

 int calculate_pressure_threshold(struct zone *zone);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f600557..a835dc3 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -14,6 +14,7 @@
 #include <linux/module.h>
 #include <linux/slab.h>
 #include <linux/cpu.h>
+#include <linux/cpumask.h>
 #include <linux/vmstat.h>
 #include <linux/sched.h>
 #include <linux/math64.h>
@@ -434,11 +435,12 @@ EXPORT_SYMBOL(dec_zone_page_state);
  * with the global counters. These could cause remote node cache line
  * bouncing and will have to be only done when necessary.
  */
-void refresh_cpu_vm_stats(int cpu)
+bool refresh_cpu_vm_stats(int cpu)
 {
 	struct zone *zone;
 	int i;
 	int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+	bool vm_activity = false;

 	for_each_populated_zone(zone) {
 		struct per_cpu_pageset *p;
@@ -485,14 +487,21 @@ void refresh_cpu_vm_stats(int cpu)
 		if (p->expire)
 			continue;

-		if (p->pcp.count)
+		if (p->pcp.count) {
+			vm_activity = true;
 			drain_zone_pages(zone, &p->pcp);
+		}
 #endif
 	}

 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
-		if (global_diff[i])
+		if (global_diff[i]) {
 			atomic_long_add(global_diff[i], &vm_stat[i]);
+			vm_activity = true;
+		}
+
+	return vm_activity;
+
 }

 #endif
@@ -1141,22 +1150,73 @@ static const struct file_operations
proc_vmstat_file_operations = {
 #ifdef CONFIG_SMP
 static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
 int sysctl_stat_interval __read_mostly = HZ;
+static struct cpumask vmstat_off_cpus;
+static DEFINE_MUTEX(vmstat_off_lock);

-static void vmstat_update(struct work_struct *w)
+static inline bool need_vmstat(int cpu)
 {
-	refresh_cpu_vm_stats(smp_processor_id());
-	schedule_delayed_work(&__get_cpu_var(vmstat_work),
-		round_jiffies_relative(sysctl_stat_interval));
+	struct zone *zone;
+	int i;
+
+	for_each_populated_zone(zone) {
+		struct per_cpu_pageset *p;
+
+		p = per_cpu_ptr(zone->pageset, cpu);
+
+		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
+			if (p->vm_stat_diff[i])
+				return true;
+
+		if (zone_to_nid(zone) != numa_node_id() && p->pcp.count)
+			return true;
+	}
+
+	return false;
 }

-static void __cpuinit start_cpu_timer(int cpu)
+static void vmstat_update(struct work_struct *w);
+
+static void start_cpu_timer(int cpu)
 {
 	struct delayed_work *work = &per_cpu(vmstat_work, cpu);

-	INIT_DELAYED_WORK_DEFERRABLE(work, vmstat_update);
+	cpumask_clear_cpu(cpu, &vmstat_off_cpus);
 	schedule_delayed_work_on(cpu, work, __round_jiffies_relative(HZ, cpu));
 }

+static void __cpuinit setup_cpu_timer(int cpu)
+{
+	struct delayed_work *work = &per_cpu(vmstat_work, cpu);
+
+	INIT_DELAYED_WORK_DEFERRABLE(work, vmstat_update);
+	start_cpu_timer(cpu);
+}
+
+static void vmstat_update(struct work_struct *w)
+{
+	int cpu, this_cpu = smp_processor_id();
+	int sleepy_cpu_counter = 0;
+	static spinlock_t lock;
+
+	if(spin_trylock(&lock)) {
+
+		for_each_cpu_and(cpu, &vmstat_off_cpus, cpu_online_mask)
+			if (need_vmstat(cpu))
+				start_cpu_timer(cpu);
+			else
+				sleepy_cpu_counter++;
+
+		spin_unlock(&lock);
+	}
+
+	if (likely(refresh_cpu_vm_stats(this_cpu) ||
+		(sleepy_cpu_counter >= num_online_cpus())))
+			schedule_delayed_work(&__get_cpu_var(vmstat_work),
+				round_jiffies_relative(sysctl_stat_interval));
+	else
+		cpumask_set_cpu(this_cpu, &vmstat_off_cpus);
+}
+
 /*
  * Use the cpu notifier to insure that the thresholds are recalculated
  * when necessary.
@@ -1165,23 +1225,27 @@ static int __cpuinit
vmstat_cpuup_callback(struct notifier_block *nfb,
 		unsigned long action,
 		void *hcpu)
 {
+	long this_cpu = smp_processor_id();
 	long cpu = (long)hcpu;

 	switch (action) {
 	case CPU_ONLINE:
 	case CPU_ONLINE_FROZEN:
 		refresh_zone_stat_thresholds();
-		start_cpu_timer(cpu);
+		setup_cpu_timer(cpu);
 		node_set_state(cpu_to_node(cpu), N_CPU);
 		break;
 	case CPU_DOWN_PREPARE:
 	case CPU_DOWN_PREPARE_FROZEN:
-		cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
-		per_cpu(vmstat_work, cpu).work.func = NULL;
+		if (!cpumask_test_cpu(cpu, &vmstat_off_cpus)) {
+			cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
+			per_cpu(vmstat_work, cpu).work.func = NULL;
+		} else if (cpumask_test_cpu(this_cpu, &vmstat_off_cpus))
+			start_cpu_timer(this_cpu);
 		break;
 	case CPU_DOWN_FAILED:
 	case CPU_DOWN_FAILED_FROZEN:
-		start_cpu_timer(cpu);
+		setup_cpu_timer(cpu);
 		break;
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
@@ -1205,7 +1269,7 @@ static int __init setup_vmstat(void)
 	register_cpu_notifier(&vmstat_notifier);

 	for_each_online_cpu(cpu)
-		start_cpu_timer(cpu);
+		setup_cpu_timer(cpu);
 #endif
 #ifdef CONFIG_PROC_FS
 	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);



-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel)
  2012-03-27 15:02 ` [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Gilad Ben-Yossef
  2012-03-27 15:04   ` Gilad Ben-Yossef
@ 2012-03-27 15:10   ` Peter Zijlstra
  2012-03-27 15:18     ` Gilad Ben-Yossef
  2012-05-22 21:31     ` Thomas Gleixner
  2012-03-28 11:43   ` Frederic Weisbecker
  2 siblings, 2 replies; 96+ messages in thread
From: Peter Zijlstra @ 2012-03-27 15:10 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Christoph Lameter,
	Daniel Lezcano, Geoff Levand, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Tue, 2012-03-27 at 17:02 +0200, Gilad Ben-Yossef wrote:
> 
> In my case, I also had to disable the clocksource watchdog, but only
> because TSC is not stable on my VM.
> This is really not a nohz/cpuset problem. 

No but that thing is annoying, I ran afoul of it too the other day.

Thomas, would you object to a means of turning that thing off? And if
not, do you have a preference as to what particular means
(sysctl/sysfs/etc..) ?

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel)
  2012-03-27 15:10   ` Peter Zijlstra
@ 2012-03-27 15:18     ` Gilad Ben-Yossef
  2012-05-22 21:31     ` Thomas Gleixner
  1 sibling, 0 replies; 96+ messages in thread
From: Gilad Ben-Yossef @ 2012-03-27 15:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Christoph Lameter,
	Daniel Lezcano, Geoff Levand, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Tue, Mar 27, 2012 at 5:10 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2012-03-27 at 17:02 +0200, Gilad Ben-Yossef wrote:
>>
>> In my case, I also had to disable the clocksource watchdog, but only
>> because TSC is not stable on my VM.
>> This is really not a nohz/cpuset problem.
>
> No but that thing is annoying, I ran afoul of it too the other day.
>
> Thomas, would you object to a means of turning that thing off? And if
> not, do you have a preference as to what particular means
> (sysctl/sysfs/etc..) ?

For what it's worth, there's already an CONFIG_CLOCKSOURCE_WATCHDOG
option, which is hard coded true right now.
Making it select-able was the path of least resistance for me:

commit 7a7af328cc0c8f8b837f8b23b4099e5bfd4c5462
Author: Gilad Ben-Yossef <gilad@benyossef.com>
Date:   Mon Mar 12 17:12:41 2012 +0200

    x86: make clocksource watchdog configurable (not for mainline)

    The clock source watchdog will wake up idle cores.

    Since I'm using KVM to test, where the TSC is always marked
    unstable, I've added this option to allow to disable it to
    assist testing.

    This is not intended for mainlining, just a reference for
    how I tested the patch set.

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0d3116c..21ffc76 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -100,9 +100,6 @@ config ARCH_DEFCONFIG
 config GENERIC_CMOS_UPDATE
 	def_bool y

-config CLOCKSOURCE_WATCHDOG
-	def_bool y
-
 config GENERIC_CLOCKEVENTS
 	def_bool y

@@ -1690,6 +1687,12 @@ config HOTPLUG_CPU
 	    automatically on SMP systems. )
 	  Say N if you want to disable CPU hotplug.

+config CLOCKSOURCE_WATCHDOG
+	bool "Clocksource watchdog"
+	default y
+	help
+	  Enable clock source watchdog.
+
 config COMPAT_VDSO
 	def_bool y
 	prompt "Compat VDSO support"
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index a45ca16..30223da 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -450,6 +450,8 @@ static void clocksource_enqueue_watchdog(struct
clocksource *cs)
 static inline void clocksource_dequeue_watchdog(struct clocksource *cs) { }
 static inline void clocksource_resume_watchdog(void) { }
 static inline int clocksource_watchdog_kthread(void *data) { return 0; }
+void clocksource_mark_unstable(struct clocksource *cs) { }
+

 #endif /* CONFIG_CLOCKSOURCE_WATCHDOG */



-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-22 16:18       ` Christoph Lameter
@ 2012-03-27 15:21         ` Gilad Ben-Yossef
  2012-03-28 12:39           ` Frederic Weisbecker
  0 siblings, 1 reply; 96+ messages in thread
From: Gilad Ben-Yossef @ 2012-03-27 15:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Thu, Mar 22, 2012 at 6:18 PM, Christoph Lameter <cl@linux.com> wrote:
> On Thu, 22 Mar 2012, Gilad Ben-Yossef wrote:
>
>> > Is there any way for userspace to know that the tick is not off yet due to
>> > this? It would make sense for us to have busy loop in user space that
>> > waits until the OS has completed all processing if that avoids future
>> > latencies for the application.
>> >
>>
>> I previously suggested having the user register to receive a signal
>> when the tick
>> is turned off. Since the tick is always turned off the user task is
>> the current task
>> by design, *I think* you can simply mark the signal pending when you
>> turn the tick off.
>
> Ok that sounds good. You would define a new signal for this?
>

My gut instinct is to let the process register with a specific signal
(properly the RT range)
it wants to receive when the tick goes off and/or on.

> So we would startup the application. App will do all prep work (memory
> allocation, device setup etc etc) and then wait for the signal to be
> received. After that it would enter the low latency processing phase.
>
> Could we also get a signal if something disrupts the peace and switches
> the timer interrupt on again?
>

I think you'll have to since once you have the tick turned off there
is no guarantee that
it wont get turned on by a timer scheduling an task or an IPI.


-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-22 17:18       ` Chris Metcalf
@ 2012-03-27 15:31         ` Gilad Ben-Yossef
  2012-03-27 15:43           ` Chris Metcalf
  0 siblings, 1 reply; 96+ messages in thread
From: Gilad Ben-Yossef @ 2012-03-27 15:31 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Christoph Lameter, Frederic Weisbecker, LKML, linaro-sched-sig,
	Alessio Igor Bogani, Andrew Morton, Avi Kivity, Daniel Lezcano,
	Geoff Levand, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Thu, Mar 22, 2012 at 7:18 PM, Chris Metcalf <cmetcalf@tilera.com> wrote:
> On 3/22/2012 3:38 AM, Gilad Ben-Yossef wrote:
>> On Wed, Mar 21, 2012 at 4:54 PM, Christoph Lameter <cl@linux.com> wrote:
>>> On Wed, 21 Mar 2012, Frederic Weisbecker wrote:
>>>
>>>> If RCU is waiting for the current CPU to complete a grace
>>>> period, don't turn off the tick. Unlike dynctik-idle, we
>>>> are not necessarily going to enter into rcu extended quiescent
>>>> state, so we may need to keep the tick to note current CPU's
>>>> quiescent states.
>>> Is there any way for userspace to know that the tick is not off yet due to
>>> this? It would make sense for us to have busy loop in user space that
>>> waits until the OS has completed all processing if that avoids future
>>> latencies for the application.
>>>
>> I previously suggested having the user register to receive a signal
>> when the tick
>> is turned off. Since the tick is always turned off the user task is
>> the current task
>> by design, *I think* you can simply mark the signal pending when you
>> turn the tick off.
>>
>> The user would register a signal handler to set a flag when it is
>> called and then busy
>> loop waiting for a flag to clear.
>
> This sounds plausible, but the kernel would have to know that the tick not
> only was stopped currently, but also would still be stopped when the signal
> handler's sigreturn syscall was performed.

Well, I'd say send a signal when the tick is turned off and another
signal when it's
turned on again.


> The problem we've seen is that
> it's sometimes somewhat nondeterministic when the kernel might decide it
> needed some more ticking, once you let kernel code start to run.  For
> example, for RCU ops the kernel can choose to ignore the nohz cpuset cores
> when they're running userspace code only, but as soon as they get back into
> the kernel for any reason, you may need to schedule a grace period, and so
> just returning from the "you have no more ticks!" signal handler ends up
> causing ticks to be scheduled.

There is no real difference from the user stand point between the
return signal sys call
doing something that causes the tick to be turned on and an IPI or
timer that turns on
the tick a nano second after the signal return system call returned.

The return signal syscall setting the tick on is just a private,
though annoying, case of the
tick getting turned on by something.

> The approach we took for the Tilera dataplane mode was to have a syscall
> that would hold the task in the kernel until any ticks were done, and only
> then return to userspace.  (This is the same set_dataplane() syscall that
> also offers some flags to control and debug the dataplane stuff in general;
> in fact the "hold in kernel" support is a mode we set for all syscalls, to
> keep things deterministic.)  This way the "busy loop" is done in the
> kernel, but in fact we explicitly go into idle until the next tick, so it's
> lower-power.
>

Yes, I saw that. My gripe with it is that puts the policy of what to do
while we wait for the tick to go away in the kernel. I usually hate the
kernel to take decisions on what to do. I want it to give mechanisms
and let the programmer set the policy.- e.g. have a led blink while
you're waiting for the
and the tick to go away so that the poor end user will know we are
still waiting for
the starts to align just right...

I'm not sure that is so big a deal, but that is why I thought of a
signal handler.

> An alternative approach, not so good for power but at least avoiding the
> "use the kernel to avoid the kernel" aspect of signals, would be to
> register a location in userspace that the kernel would write to when it
> disabled the tick, and userspace could then just spin reading memory.
>

That's cool for letting you know when the tick goes away but not for alarming
you when it suddenly came back... :-)

Gilad

> --
> Chris Metcalf, Tilera Corp.
> http://www.tilera.com
>



-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-27 15:31         ` Gilad Ben-Yossef
@ 2012-03-27 15:43           ` Chris Metcalf
  2012-03-28  8:36             ` Gilad Ben-Yossef
  0 siblings, 1 reply; 96+ messages in thread
From: Chris Metcalf @ 2012-03-27 15:43 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Christoph Lameter, Frederic Weisbecker, LKML, linaro-sched-sig,
	Alessio Igor Bogani, Andrew Morton, Avi Kivity, Daniel Lezcano,
	Geoff Levand, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On 3/27/2012 11:31 AM, Gilad Ben-Yossef wrote:
> On Thu, Mar 22, 2012 at 7:18 PM, Chris Metcalf <cmetcalf@tilera.com> wrote:
>> On 3/22/2012 3:38 AM, Gilad Ben-Yossef wrote:
>>> On Wed, Mar 21, 2012 at 4:54 PM, Christoph Lameter <cl@linux.com> wrote:
>>>> On Wed, 21 Mar 2012, Frederic Weisbecker wrote:
>>>>
>>>>> If RCU is waiting for the current CPU to complete a grace
>>>>> period, don't turn off the tick. Unlike dynctik-idle, we
>>>>> are not necessarily going to enter into rcu extended quiescent
>>>>> state, so we may need to keep the tick to note current CPU's
>>>>> quiescent states.
>>>> Is there any way for userspace to know that the tick is not off yet due to
>>>> this? It would make sense for us to have busy loop in user space that
>>>> waits until the OS has completed all processing if that avoids future
>>>> latencies for the application.
>>>>
>>> I previously suggested having the user register to receive a signal
>>> when the tick
>>> is turned off. Since the tick is always turned off the user task is
>>> the current task
>>> by design, *I think* you can simply mark the signal pending when you
>>> turn the tick off.
>>>
>>> The user would register a signal handler to set a flag when it is
>>> called and then busy
>>> loop waiting for a flag to clear.
>> This sounds plausible, but the kernel would have to know that the tick not
>> only was stopped currently, but also would still be stopped when the signal
>> handler's sigreturn syscall was performed.
> Well, I'd say send a signal when the tick is turned off and another
> signal when it's
> turned on again.

The thing is, what our customers seem to want is to be able to tell the
kernel to go away and not bother them again, ever, as long as their
application is running correctly.  Obviously if it crashes, or if some
intervention is required, or whatever, they want the kernel to step in, but
otherwise the proposed signal mechanisms don't seem to help the case that
they're interested in.  I don't think we've seen a customer application
where the signal mechanism would be helpful (unfortunately, since it does
seem like a cool idea).

Basically if the kernel interrupts a nohz application core, that's a fail. 
It's interesting to know that such a fail has happened, but sending a
signal just makes it an even worse fail: more overhead.  One thing I could
imagine that might be useful would be to register a region of user memory
that the kernel could put statistics of some kind into, obviously the
"bool" flag that says whether you're running tickless, but also things like
a count of the number of interrupts (e.g. ticks, but really anything) the
kernel had to deliver, the time of the last interrupt that was delivered,
maybe some breakdown by type of interrupt, etc.  Then if the application
detects an interruption, or perhaps just periodically, it can inspect that
state area and report on any bad developments: and these would be basically
kernel bugs from failing to protect the nohz core the way it had asked, or
else application bugs from accidentally requesting a kernel service
unintentionally.

>> The problem we've seen is that
>> it's sometimes somewhat nondeterministic when the kernel might decide it
>> needed some more ticking, once you let kernel code start to run.  For
>> example, for RCU ops the kernel can choose to ignore the nohz cpuset cores
>> when they're running userspace code only, but as soon as they get back into
>> the kernel for any reason, you may need to schedule a grace period, and so
>> just returning from the "you have no more ticks!" signal handler ends up
>> causing ticks to be scheduled.
> There is no real difference from the user stand point between the
> return signal sys call
> doing something that causes the tick to be turned on and an IPI or
> timer that turns on
> the tick a nano second after the signal return system call returned.
>
> The return signal syscall setting the tick on is just a private,
> though annoying, case of the
> tick getting turned on by something.

Yes, but see above: the claim I'm making is that we can arrange for a
well-behaved application to *expect* not to get kernel interrupts, so if
they happen, something has gone wrong.

>> The approach we took for the Tilera dataplane mode was to have a syscall
>> that would hold the task in the kernel until any ticks were done, and only
>> then return to userspace.  (This is the same set_dataplane() syscall that
>> also offers some flags to control and debug the dataplane stuff in general;
>> in fact the "hold in kernel" support is a mode we set for all syscalls, to
>> keep things deterministic.)  This way the "busy loop" is done in the
>> kernel, but in fact we explicitly go into idle until the next tick, so it's
>> lower-power.
>>
> Yes, I saw that. My gripe with it is that puts the policy of what to do
> while we wait for the tick to go away in the kernel. I usually hate the
> kernel to take decisions on what to do. I want it to give mechanisms
> and let the programmer set the policy.- e.g. have a led blink while
> you're waiting for the
> and the tick to go away so that the poor end user will know we are
> still waiting for
> the starts to align just right...

This is a fair point.  On the other hand, the way we implemented it is
basically just a mode flag that is checked on all returns from the kernel,
that allow userspace to invoke kernel functions "synchronously", but
slowly, and not get hammered later by unexpected interrupts.  So from that
point of view, we don't expect userspace to have anything useful to do on
return from syscalls or page faults other than wait in the kernel anyway. 
But if the application did want to do something fancy for those few
hundredths of a second while the ticks settle, you could imagine not using
this "wait in kernel" mode, and instead spinning on the proposed data
structure described above.

> I'm not sure that is so big a deal, but that is why I thought of a
> signal handler.
>
>> An alternative approach, not so good for power but at least avoiding the
>> "use the kernel to avoid the kernel" aspect of signals, would be to
>> register a location in userspace that the kernel would write to when it
>> disabled the tick, and userspace could then just spin reading memory.
>>
> That's cool for letting you know when the tick goes away but not for alarming
> you when it suddenly came back... :-)

Yes, and in fact delivering a signal is not a bad way to let the
application know that either it, or the kernel, just screwed up.  Currently
our dataplane code just handles this case with console backtraces (for the
"debug" mode) or by shooting down the application with SIGKILL (in "strict"
mode when it's said it wasn't going to use the kernel any more).

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/32] nohz: Try not to give the timekeeping duty to an adaptive tickless cpu
  2012-03-27 10:50     ` Frederic Weisbecker
@ 2012-03-27 16:08       ` Christoph Lameter
  2012-03-27 16:47         ` Peter Zijlstra
  2012-03-30  1:34         ` Frederic Weisbecker
  0 siblings, 2 replies; 96+ messages in thread
From: Christoph Lameter @ 2012-03-27 16:08 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Daniel Lezcano, Geoff Levand,
	Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin,
	Dimitri Sivanich

On Tue, 27 Mar 2012, Frederic Weisbecker wrote:

> > Any way to manually specify which cpu? We f.e. always "sacrifice" cpu 0
> > for OS activities. We would like to have all Os processing things
> > restricted to cpu 0 so that the rest of the processors do not experience
> > the OS noise.
>
> Somebody tries to do this: https://lkml.org/lkml/2011/11/8/346
>
> But in the case of nohz cpusets there is a problem to solve:
>
> What if every CPUs are tickless (idle or busy), who must take
> the timekeeping duty? Should we pick one of the busy CPUs? Or
> keep one CPU with the tick even if it's idle? How do we choose
> this CPU?

Then its the users fault because he specified the processor to use. There
is no picking if its manually assigned.

> May be we need to define another flag on cpusets to assign the
> timekeeping duty to any CPU on a flagged set. This way we can
> force that duty to the CPU(s) we want.

I wish you would disentangle the nohz work from the cpusets. Cpusets is
aged and being replaced by cgroups. And the cgroup work is something that
is not suitable for many loads given the VM overhead added.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-27 12:13     ` Frederic Weisbecker
@ 2012-03-27 16:13       ` Christoph Lameter
  2012-03-27 16:24         ` Steven Rostedt
  2012-03-28 11:53         ` Frederic Weisbecker
  0 siblings, 2 replies; 96+ messages in thread
From: Christoph Lameter @ 2012-03-27 16:13 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Daniel Lezcano, Geoff Levand,
	Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Tue, 27 Mar 2012, Frederic Weisbecker wrote:

> > Is there any way for userspace to know that the tick is not off yet due to
> > this? It would make sense for us to have busy loop in user space that
> > waits until the OS has completed all processing if that avoids future
> > latencies for the application.
>
> What is the usecase you have in mind? Is it for realtime purpose?

Please do not use "realtime" since I am not sure what you mean by that.
Its for a low latency applications that cannot use "realtime" because that
implies high latencies.

> The "tick stopped" is a volatile and relative state.

The use case is an application that cannot tolerate the latencies
introduced by timer tick processing. It will only start running when the
system is in a sufficiently quiet state.

> Relative because if a timer list is enqueued to fire 1 second later,
> the tick will be stopped until that happens. How do we consider this (common)
> case?
>
> Also as Chris noted it is volatile because the tick can be restarted anytime
> for random reasons: the CPU receives an IPI which makes it restart the
> periodic tick.

Ok some sort of notification would be good for that case. If a timer tick
happens and that was unavoidable then it would be good to log the reason
why this occured so that the system can be configured in such a way that
these interruptions are minimized.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel)
  2012-03-27 15:05     ` Gilad Ben-Yossef
@ 2012-03-27 16:22       ` Christoph Lameter
  2012-03-28  6:47         ` Gilad Ben-Yossef
  0 siblings, 1 reply; 96+ messages in thread
From: Christoph Lameter @ 2012-03-27 16:22 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Tue, 27 Mar 2012, Gilad Ben-Yossef wrote:

> +static void vmstat_update(struct work_struct *w)
> +{
> +	int cpu, this_cpu = smp_processor_id();
> +	int sleepy_cpu_counter = 0;
> +	static spinlock_t lock;
> +
> +	if(spin_trylock(&lock)) {

Trylock would cause cache bouncing between vmstat runs on various
processors. The reason that vmstat_update exists is to avoid these cache
bounces. Please no exclusive cacheline acquisiton by all cpus by default.

The best method would be to assign a sacrifical lamb cpu and check if we
are running on that cpu. That way taking a lock can be avoided.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-27 16:13       ` Christoph Lameter
@ 2012-03-27 16:24         ` Steven Rostedt
  2012-03-28  0:42           ` Christoph Lameter
  2012-03-28 11:53         ` Frederic Weisbecker
  1 sibling, 1 reply; 96+ messages in thread
From: Steven Rostedt @ 2012-03-27 16:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Tue, 2012-03-27 at 11:13 -0500, Christoph Lameter wrote:

> Please do not use "realtime" since I am not sure what you mean by that.
> Its for a low latency applications that cannot use "realtime" because that
> implies high latencies.

This statement totally confuses me, as the whole point of the -rt
(realtime) patch, is for lower latencies. Where do you get "realtime"
implies high latencies from?

-- Steve



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/32] nohz: Try not to give the timekeeping duty to an adaptive tickless cpu
  2012-03-27 16:08       ` Christoph Lameter
@ 2012-03-27 16:47         ` Peter Zijlstra
  2012-03-28  1:12           ` Christoph Lameter
  2012-03-30  1:34         ` Frederic Weisbecker
  1 sibling, 1 reply; 96+ messages in thread
From: Peter Zijlstra @ 2012-03-27 16:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin,
	Dimitri Sivanich

On Tue, 2012-03-27 at 11:08 -0500, Christoph Lameter wrote:
> 
> I wish you would disentangle the nohz work from the cpusets. Cpusets is
> aged and being replaced by cgroups. And the cgroup work is something that
> is not suitable for many loads given the VM overhead added. 

What VM overhead? Are you talking about the memcg nonsense? That's
entirely optional, you don't need to either build that or enable it.

And if we ever get rid of that multiple hierarchy nonsense I don't see a
reason to get rid of cpuset at all. The only reason to want to replace
it is to avoid the dis-joint-ness it has with the cpu controller (and
possible the memcg one).



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-27 16:24         ` Steven Rostedt
@ 2012-03-28  0:42           ` Christoph Lameter
  2012-03-28  1:06             ` Steven Rostedt
  0 siblings, 1 reply; 96+ messages in thread
From: Christoph Lameter @ 2012-03-28  0:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Tue, 27 Mar 2012, Steven Rostedt wrote:

> On Tue, 2012-03-27 at 11:13 -0500, Christoph Lameter wrote:
>
> > Please do not use "realtime" since I am not sure what you mean by that.
> > Its for a low latency applications that cannot use "realtime" because that
> > implies high latencies.
>
> This statement totally confuses me, as the whole point of the -rt
> (realtime) patch, is for lower latencies. Where do you get "realtime"
> implies high latencies from?

Obviously compiling a kernel with preemptiong introduces additional
overhead to guarantee more deterministic behavior. Additional overhead
increases latencies generated by the OS in general. Compile a kernel
without preemption and it will run faster and thus have lower latencies.

Realtime avoids high latency spikes but in general increases average OS
latencies.




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-28  0:42           ` Christoph Lameter
@ 2012-03-28  1:06             ` Steven Rostedt
  2012-03-28  1:19               ` Christoph Lameter
  0 siblings, 1 reply; 96+ messages in thread
From: Steven Rostedt @ 2012-03-28  1:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Tue, 2012-03-27 at 19:42 -0500, Christoph Lameter wrote:
> On Tue, 27 Mar 2012, Steven Rostedt wrote:
> 
> > On Tue, 2012-03-27 at 11:13 -0500, Christoph Lameter wrote:
> >
> > > Please do not use "realtime" since I am not sure what you mean by that.
> > > Its for a low latency applications that cannot use "realtime" because that
> > > implies high latencies.
> >
> > This statement totally confuses me, as the whole point of the -rt
> > (realtime) patch, is for lower latencies. Where do you get "realtime"
> > implies high latencies from?
> 
> Obviously compiling a kernel with preemptiong introduces additional
> overhead to guarantee more deterministic behavior. Additional overhead
> increases latencies generated by the OS in general. Compile a kernel
> without preemption and it will run faster and thus have lower latencies.

I call that "lower overhead".

> 
> Realtime avoids high latency spikes but in general increases average OS
> latencies.

Now to me the definition of a latency is the difference in time
something was suppose to happen and the time it actually does. A
reaction time.

I see you are calling the time spent in the kernel a latency. The time
added to complete a task is called "overhead". Yes, realtime adds
overhead to keep reaction time latency to a minimum.

According to wikipedia, the first thing it says about latency is:

"Latency is a measure of time delay experienced in a system, the precise
definition of which depends on the system and the time being measured.
Latencies may have different meaning in different contexts."

That last sentence is key. So lets avoid the term "latency" as it
obviously has a different meaning to the both of us.

Instead, lets use "determinism" (what we call latency in the realtime
world) and "overhead" (what you seem to see as latency caused by the
kernel).

-- Steve



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/32] nohz: Try not to give the timekeeping duty to an adaptive tickless cpu
  2012-03-27 16:47         ` Peter Zijlstra
@ 2012-03-28  1:12           ` Christoph Lameter
  2012-03-28  8:39             ` Peter Zijlstra
  0 siblings, 1 reply; 96+ messages in thread
From: Christoph Lameter @ 2012-03-28  1:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin,
	Dimitri Sivanich

On Tue, 27 Mar 2012, Peter Zijlstra wrote:

> On Tue, 2012-03-27 at 11:08 -0500, Christoph Lameter wrote:
> >
> > I wish you would disentangle the nohz work from the cpusets. Cpusets is
> > aged and being replaced by cgroups. And the cgroup work is something that
> > is not suitable for many loads given the VM overhead added.
>
> What VM overhead? Are you talking about the memcg nonsense? That's
> entirely optional, you don't need to either build that or enable it.

cgroups in general cause a much more complex VM processing with multiple
LRUs and additional checks in various places.

Even just adding cpusets enables the group scheduler functionality f.e.
which creates significantly larger scheduling latencies. Also complicates
key allocation VM paths etc etc.

> And if we ever get rid of that multiple hierarchy nonsense I don't see a
> reason to get rid of cpuset at all. The only reason to want to replace
> it is to avoid the dis-joint-ness it has with the cpu controller (and
> possible the memcg one).

I like cpusets much more than cgroups. I agree with you.

But I am not sure that cpusets are needed for nohz. We already have an
isolcpu set and it sounds to me that nohz is generally useful.

It would seem that the nohz patches would be much simpler if it would not
require cpusets to administer. The only thing that would be needed is to
have one cpu that is not subject to nohz. The logical choice is a
timekeeper cpu (which is usually cpu 0). Having that configurable would be
an extra bonus.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-28  1:06             ` Steven Rostedt
@ 2012-03-28  1:19               ` Christoph Lameter
  2012-03-28  1:35                 ` Steven Rostedt
  0 siblings, 1 reply; 96+ messages in thread
From: Christoph Lameter @ 2012-03-28  1:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Tue, 27 Mar 2012, Steven Rostedt wrote:

> > Obviously compiling a kernel with preemptiong introduces additional
> > overhead to guarantee more deterministic behavior. Additional overhead
> > increases latencies generated by the OS in general. Compile a kernel
> > without preemption and it will run faster and thus have lower latencies.
>
> I call that "lower overhead".

Good marketing but it does not change the facts.

> "Latency is a measure of time delay experienced in a system, the precise
> definition of which depends on the system and the time being measured.
> Latencies may have different meaning in different contexts."
>
> That last sentence is key. So lets avoid the term "latency" as it
> obviously has a different meaning to the both of us.
>
> Instead, lets use "determinism" (what we call latency in the realtime
> world) and "overhead" (what you seem to see as latency caused by the
> kernel).

I sure wish you would be using the term determinism instead of "latency".

Overhead causes latency and the definition that you quoted is what I am
talking about. Latencies are the delays in processing experienced by the
application through the speed of system calls and by interruptions  of
a user space process by the kernel for various reasons.






^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-28  1:19               ` Christoph Lameter
@ 2012-03-28  1:35                 ` Steven Rostedt
  2012-03-28  3:17                   ` Steven Rostedt
  0 siblings, 1 reply; 96+ messages in thread
From: Steven Rostedt @ 2012-03-28  1:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Tue, 2012-03-27 at 20:19 -0500, Christoph Lameter wrote:
> On Tue, 27 Mar 2012, Steven Rostedt wrote:
> 
> > > Obviously compiling a kernel with preemptiong introduces additional
> > > overhead to guarantee more deterministic behavior. Additional overhead
> > > increases latencies generated by the OS in general. Compile a kernel
> > > without preemption and it will run faster and thus have lower latencies.
> >
> > I call that "lower overhead".
> 
> Good marketing but it does not change the facts.

I see we are mixing the paint for the bike shed.

> 
> > "Latency is a measure of time delay experienced in a system, the precise
> > definition of which depends on the system and the time being measured.
> > Latencies may have different meaning in different contexts."
> >
> > That last sentence is key. So lets avoid the term "latency" as it
> > obviously has a different meaning to the both of us.
> >
> > Instead, lets use "determinism" (what we call latency in the realtime
> > world) and "overhead" (what you seem to see as latency caused by the
> > kernel).
> 
> I sure wish you would be using the term determinism instead of "latency".
> 
> Overhead causes latency and the definition that you quoted is what I am
> talking about. Latencies are the delays in processing experienced by the
> application through the speed of system calls and by interruptions  of
> a user space process by the kernel for various reasons.

I could also argue that a non-preempt kernel has a large latency as
well. Although it may have good through put for one task, another task
may suffer from a large latency waiting for a lower priority task to get
out of a system call.

You say tomAYto I say tomAHto.

Read the article: http://en.wikipedia.org/wiki/Latency_%28engineering%29

Especially the section about: Computer hardware and operating system latency

You'll see that it describes latency much closer to my definition than
yours.

Heck, google "operating system latency" and you'll see a lot of talk
about reaction times and how fast the hardware can do its job. I don't
see anything about the time a system call takes.

-- Steve




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-28  1:35                 ` Steven Rostedt
@ 2012-03-28  3:17                   ` Steven Rostedt
  2012-03-28  7:55                     ` Gilad Ben-Yossef
  0 siblings, 1 reply; 96+ messages in thread
From: Steven Rostedt @ 2012-03-28  3:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Tue, 2012-03-27 at 21:35 -0400, Steven Rostedt wrote:

> > > I call that "lower overhead".
> > 
> > Good marketing but it does not change the facts.

I'm replying again because this comment just pisses me off.

I'm the only male in my household, living with a wife, two teenage
daughters and two bitches (I own two female dogs). This is not the time
of month to be arguing with me!

The fact is, you live in your own little world. You see things from your
own little perspective. You can define the time a system call takes as a
latency, but that is just one very small aspect of latencies. There's
lots of other kinds of latencies and if you did the search I told you
to, you would see that. In fact, the latency caused by system calls is
such a small niche of the types of latencies there are. I'm not counting
the time a system call waits for a device. Although a preempt kernel
would be faster for such a case.

Having zero preemption in the kernel makes the system calls the fastest.
And to a single thread running by itself on a CPU, this would be the
latency it is most interested in. But a system as a whole, this can
actually be the cause of much larger latencies.

Compile your kernel without preemption (Sever). Use it for a while as a
desktop. Then compile it with CONFIG_PREEMPT, the Preemptible Kernel
option (which in the menu is also called: Low-Latency Desktop!)

Run that for a bit. You'll notice a smoother feeling with the
CONFIG_PREEMPT kernel.

You're talking about overhead not latency. Sure, overhead can cause
latencies, and latencies can cause overhead. What we call latency has
*nothing* to do with marketing. We happen to handle the other 98% of
latencies in the system.

-- Steve




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel)
  2012-03-27 16:22       ` Christoph Lameter
@ 2012-03-28  6:47         ` Gilad Ben-Yossef
  0 siblings, 0 replies; 96+ messages in thread
From: Gilad Ben-Yossef @ 2012-03-28  6:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Tue, Mar 27, 2012 at 6:22 PM, Christoph Lameter <cl@linux.com> wrote:
> On Tue, 27 Mar 2012, Gilad Ben-Yossef wrote:
>
>> +static void vmstat_update(struct work_struct *w)
>> +{
>> +     int cpu, this_cpu = smp_processor_id();
>> +     int sleepy_cpu_counter = 0;
>> +     static spinlock_t lock;
>> +
>> +     if(spin_trylock(&lock)) {
>
> Trylock would cause cache bouncing between vmstat runs on various
> processors. The reason that vmstat_update exists is to avoid these cache
> bounces. Please no exclusive cacheline acquisiton by all cpus by default.
>

Right, I didn't think of that.

> The best method would be to assign a sacrifical lamb cpu and check if we
> are running on that cpu. That way taking a lock can be avoided.

Cool, I'll do it.

Thanks

Gilad


> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-28  3:17                   ` Steven Rostedt
@ 2012-03-28  7:55                     ` Gilad Ben-Yossef
  2012-03-28 12:21                       ` Frederic Weisbecker
  2012-03-28 14:02                       ` Steven Rostedt
  0 siblings, 2 replies; 96+ messages in thread
From: Gilad Ben-Yossef @ 2012-03-28  7:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Christoph Lameter, Frederic Weisbecker, LKML, linaro-sched-sig,
	Alessio Igor Bogani, Andrew Morton, Avi Kivity, Chris Metcalf,
	Daniel Lezcano, Geoff Levand, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Wed, Mar 28, 2012 at 5:17 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Tue, 2012-03-27 at 21:35 -0400, Steven Rostedt wrote:
>
>> > > I call that "lower overhead".
>> >
>> > Good marketing but it does not change the facts.
>
> I'm replying again because this comment just pisses me off.
>
> I'm the only male in my household, living with a wife, two teenage
> daughters and two bitches (I own two female dogs). This is not the time
> of month to be arguing with me!

LOL. I'm married with two daughters. I feel your pain  :-)

>
> The fact is, you live in your own little world. You see things from your
> own little perspective. You can define the time a system call takes as a
> latency, but that is just one very small aspect of latencies. There's
> lots of other kinds of latencies and if you did the search I told you
> to, you would see that. In fact, the latency caused by system calls is
> such a small niche of the types of latencies there are. I'm not counting
> the time a system call waits for a device. Although a preempt kernel
> would be faster for such a case.
>

At the risk of butting in on this little flame war, I think it is
worth mentioning
that this discussion arouse in the context of of a feature
(cpuset/nohz) that deals
with a single task running alone on a CPU and making zero use of
kernel services,
from scheduling, through interrupts, to system calls. It's just a pure
100% cpu  bound task.

For the work loads this is intended for, the time it takes to respond
to an interrupt
or context switch to the kernel and back for a system call, is too
high. It doesn't matter
how predictable that time is - if the time to do a system call in the
best possible case
is too high for you, having that time predictable only means you are
predictably late :-)

This is not an observation about Linux, or preempt-rt, or any OS for
that matter. It's just
a statement of fact. There is nothing you can do to make the OS better
to get the time
lower. It's just a process that doesn't want OS involvement at all
from the point it is started
until it's done, except in exception cases. The entire OS is overhead.

The traditional way to deal with these beasts is to run it on bare
metal with no OS.
What we're discussing is a way to get Linux to give a task bare metal
like performance.
That is useful because you can debug and manage the tasks using OS services, but
still get bare metal like performance. In the context of this
discussion, *any* kernel
activity on that CPU during its run time is overhead, really.

Preempt-rt is good. I know because I was involved in implementing it
in a real time
system that have saved dozen of lives already. When that system misses
a dead line,
that is exactly what you get - a line of dead people. Seriously. But
it works. It just doesn't
happen to apply here.

I don't know if this is what Christoph meant to say, but can we please
try to get along now? :-)

Thanks,
Gilad


-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-27 15:43           ` Chris Metcalf
@ 2012-03-28  8:36             ` Gilad Ben-Yossef
  0 siblings, 0 replies; 96+ messages in thread
From: Gilad Ben-Yossef @ 2012-03-28  8:36 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Christoph Lameter, Frederic Weisbecker, LKML, linaro-sched-sig,
	Alessio Igor Bogani, Andrew Morton, Avi Kivity, Daniel Lezcano,
	Geoff Levand, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Tue, Mar 27, 2012 at 5:43 PM, Chris Metcalf <cmetcalf@tilera.com> wrote:

> The thing is, what our customers seem to want is to be able to tell the
> kernel to go away and not bother them again, ever, as long as their
> application is running correctly.  Obviously if it crashes, or if some
> intervention is required, or whatever, they want the kernel to step in, but
> otherwise the proposed signal mechanisms don't seem to help the case that
> they're interested in.  I don't think we've seen a customer application
> where the signal mechanism would be helpful (unfortunately, since it does
> seem like a cool idea).

I understand. I think the key phrase here is "our customers". Which is fine -
we're all doing this to scratch a personal (or corporate...) itch. But
the question
is: are there other possible users? can we build a mechanism that serves both
"our customers" and the other guys? I think we can.

A case in point, consider high performance computing people. They take their
4096 way SGI machine, carve off a few "system" CPUs and run a dedicated
process of each of the remaining cores doing Fourier transforms, or whatever
it is HPC people do, with the result spilling in to shared memory.

It's a 100% cpu bound single task pinned to a single core. The
scheduler tick and all
other kernel activity is a nuisance to them as it is for your customers.  But if
the kernel happens to start the tick for 10 seconds during their 37
hours long run
they certainly don't want to have that process killed!  Logging the
incident can
be useful for later analysis, though.

This is why I believe the signal mechanism is useful - your customers can have
code like this (add memory barriers as needed, of course):

tick goes away signal handler:

nohz = 1;

tick comes back signal handler:

if (!app_started)
   nohz = 0;
else
   abort();

The main function will have something like this:

while(!nohz);
app_started = 1;
...


The HPC people on the other hand can put code in the signal handler to just
record time stamp in a log into shared memory

Same mechanism, two use cases.

>
> Basically if the kernel interrupts a nohz application core, that's a fail.
> It's interesting to know that such a fail has happened, but sending a
> signal just makes it an even worse fail: more overhead.

So in the lab register a handler to abort() the app to debug it.
In production install a SIG_IGN signal handler and hope for the best :-)

> One thing I could
> imagine that might be useful would be to register a region of user memory
> that the kernel could put statistics of some kind into, obviously the
> "bool" flag that says whether you're running tickless, but also things like
> a count of the number of interrupts (e.g. ticks, but really anything) the
> kernel had to deliver, the time of the last interrupt that was delivered,
> maybe some breakdown by type of interrupt, etc.  Then if the application
> detects an interruption, or perhaps just periodically, it can inspect that
> state area and report on any bad developments: and these would be basically
> kernel bugs from failing to protect the nohz core the way it had asked, or
> else application bugs from accidentally requesting a kernel service
> unintentionally.

I think you've re-invented  /proc/interruptsand and maybe a couple of
debugfs/tracing
entries :-)
>
>>> The problem we've seen is that
>>> it's sometimes somewhat nondeterministic when the kernel might decide it
>>> needed some more ticking, once you let kernel code start to run.  For
>>> example, for RCU ops the kernel can choose to ignore the nohz cpuset cores
>>> when they're running userspace code only, but as soon as they get back into
>>> the kernel for any reason, you may need to schedule a grace period, and so
>>> just returning from the "you have no more ticks!" signal handler ends up
>>> causing ticks to be scheduled.
>> There is no real difference from the user stand point between the
>> return signal sys call
>> doing something that causes the tick to be turned on and an IPI or
>> timer that turns on
>> the tick a nano second after the signal return system call returned.
>>
>> The return signal syscall setting the tick on is just a private,
>> though annoying, case of the
>> tick getting turned on by something.
>
> Yes, but see above: the claim I'm making is that we can arrange for a
> well-behaved application to *expect* not to get kernel interrupts, so if
> they happen, something has gone wrong.
>

If that is your usage scenario, arrange things to never get an interrupt and
install a signal handler that aborts the app when the first
signal arrives after the app has started, at least in the lab.

Personally, I would probably use exactly this in the lab but put a SIG_IGN
in production. If the kernel delivers a single tick once every 398 days, and I
didn't manage to catch it in the lab, I probably would not want it to abort in
the field, but that's just me

For example, if I understood the code you posted correctly, if I run an app
on a non isolated core of Tilera ZOL that allocates slightly too much
memory, the page allocator will IPI all cores, including the isolated ones
to get them to spill their per-cpu pages back tot he page allocator.

Do you want to abort the app when that happens in production? some people
will say yes, some people will say no - I just want to log that. I can certainly
see the value in both points of view.

So - let's provide a mechanism to let these two guys get what they want.

>>> The approach we took for the Tilera dataplane mode was to have a syscall
>>> that would hold the task in the kernel until any ticks were done, and only
>>> then return to userspace.  (This is the same set_dataplane() syscall that
>>> also offers some flags to control and debug the dataplane stuff in general;
>>> in fact the "hold in kernel" support is a mode we set for all syscalls, to
>>> keep things deterministic.)  This way the "busy loop" is done in the
>>> kernel, but in fact we explicitly go into idle until the next tick, so it's
>>> lower-power.
>>>
>> Yes, I saw that. My gripe with it is that puts the policy of what to do
>> while we wait for the tick to go away in the kernel. I usually hate the
>> kernel to take decisions on what to do. I want it to give mechanisms
>> and let the programmer set the policy.- e.g. have a led blink while
>> you're waiting for the
>> and the tick to go away so that the poor end user will know we are
>> still waiting for
>> the starts to align just right...
>
> This is a fair point.  On the other hand, the way we implemented it is
> basically just a mode flag that is checked on all returns from the kernel,
> that allow userspace to invoke kernel functions "synchronously", but
> slowly, and not get hammered later by unexpected interrupts.  So from that
> point of view, we don't expect userspace to have anything useful to do on
> return from syscalls or page faults other than wait in the kernel anyway.
> But if the application did want to do something fancy for those few
> hundredths of a second while the ticks settle, you could imagine not using
> this "wait in kernel" mode, and instead spinning on the proposed data
> structure described above.
>
>> I'm not sure that is so big a deal, but that is why I thought of a
>> signal handler.
>>
>>> An alternative approach, not so good for power but at least avoiding the
>>> "use the kernel to avoid the kernel" aspect of signals, would be to
>>> register a location in userspace that the kernel would write to when it
>>> disabled the tick, and userspace could then just spin reading memory.
>>>
>> That's cool for letting you know when the tick goes away but not for alarming
>> you when it suddenly came back... :-)
>
> Yes, and in fact delivering a signal is not a bad way to let the
> application know that either it, or the kernel, just screwed up.  Currently
> our dataplane code just handles this case with console backtraces (for the
> "debug" mode) or by shooting down the application with SIGKILL (in "strict"
> mode when it's said it wasn't going to use the kernel any more).
>

I didn't think about doing system calls later after starting. You can
certainly re-use
the signal handler approach there (app_started =0, do the  syscall and
then wait again)
but I admit that this is more involved then just issuing the system
call and letting the kernel
sort itself.

I guess we can always support a callback function to let you run code
when a nohz tasks
returns from kernel to user space - then you can do whatever you want...

Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/32] nohz: Try not to give the timekeeping duty to an adaptive tickless cpu
  2012-03-28  1:12           ` Christoph Lameter
@ 2012-03-28  8:39             ` Peter Zijlstra
  2012-03-28 13:11               ` Dimitri Sivanich
  2012-03-28 15:51               ` Chris Metcalf
  0 siblings, 2 replies; 96+ messages in thread
From: Peter Zijlstra @ 2012-03-28  8:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin,
	Dimitri Sivanich

On Tue, 2012-03-27 at 20:12 -0500, Christoph Lameter wrote:
> On Tue, 27 Mar 2012, Peter Zijlstra wrote:
> 
> > On Tue, 2012-03-27 at 11:08 -0500, Christoph Lameter wrote:
> > >
> > > I wish you would disentangle the nohz work from the cpusets. Cpusets is
> > > aged and being replaced by cgroups. And the cgroup work is something that
> > > is not suitable for many loads given the VM overhead added.
> >
> > What VM overhead? Are you talking about the memcg nonsense? That's
> > entirely optional, you don't need to either build that or enable it.
> 
> cgroups in general cause a much more complex VM processing with multiple
> LRUs and additional checks in various places.

Uhm, not if you don't have the memcg thing enabled, the controllers are
separate.

> Even just adding cpusets enables the group scheduler functionality f.e.
> which creates significantly larger scheduling latencies. Also complicates
> key allocation VM paths etc etc.

No, you're mistaken.

Its perfectly possible to compile a kernel with

CONFIG_CGROUPS=y
CONFIG_CPUSETS=y
CONFIG_CGROUP_MEM_RES_CTLR=n
CONFIG_CGROUP_SCHED=n

That will give you cpusets, but not the cpu (sched) controller crap and
not the memcg (vm) controller muck.

> > And if we ever get rid of that multiple hierarchy nonsense I don't see a
> > reason to get rid of cpuset at all. The only reason to want to replace
> > it is to avoid the dis-joint-ness it has with the cpu controller (and
> > possible the memcg one).
> 
> I like cpusets much more than cgroups. I agree with you.

cpusets is a cgroup controller..

> But I am not sure that cpusets are needed for nohz. We already have an
> isolcpu set and it sounds to me that nohz is generally useful.

I really really want to kill isolcpu in favour of cpusets, the amount of
disparity and overlap in features is driving me insane.

isolcpu will only create separate cpus, you can do the same with cpusets
by creating 1 cpu sets and disabling load_balance on the root set.

The only difference is that isolcpu will never have had a task running
on the cpu and hence its timer lists etc will be guaranteed empty. So
once we add an interface to push away and/or wait for a cpu to quiesce
we should end up with the same state.

At that point I'll rip isolcpu out. 

> It would seem that the nohz patches would be much simpler if it would not
> require cpusets to administer. The only thing that would be needed is to
> have one cpu that is not subject to nohz. The logical choice is a
> timekeeper cpu (which is usually cpu 0). Having that configurable would be
> an extra bonus.

Like Frederic has been telling, the nohz stuff adds syscall overhead, it
needs to timestamp on kernel entry/exit etc.. Making it unconditional
will add this overhead to everybody and this might not be acceptable.




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 21/32] nohz/cpuset: Flush cputime on threads in nohz cpusets when waiting leader
  2012-03-27 14:23     ` Gilad Ben-Yossef
@ 2012-03-28 11:20       ` Frederic Weisbecker
  0 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-28 11:20 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Tue, Mar 27, 2012 at 04:23:14PM +0200, Gilad Ben-Yossef wrote:
> On Tue, Mar 27, 2012 at 4:10 PM, Gilad Ben-Yossef <gilad@benyossef.com> wrote:
> > On Wed, Mar 21, 2012 at 3:58 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> >> When we wait for a zombie task, flush the cputimes on nohz cpusets
> >> in case we are waiting for a group leader that has threads running
> >> in nohz CPUs. This way thread_group_times() doesn't report stale
> >> values.
> >>
> >> <doubts>
> >> If I understood well the code, by the time we call that thread_group_times(),
> >> we may have childs that are still running, so this is necessary.
> >> But I need to check deeper.
> >> </doubts>
> >>
> > ...
> >>
> >> diff --git a/kernel/exit.c b/kernel/exit.c
> >> index 4b4042f..c194662 100644
> >> --- a/kernel/exit.c
> >> +++ b/kernel/exit.c
> >> @@ -52,6 +52,7 @@
> >>  #include <linux/hw_breakpoint.h>
> >>  #include <linux/oom.h>
> >>  #include <linux/writeback.h>
> >> +#include <linux/cpuset.h>
> >>
> >>  #include <asm/uaccess.h>
> >>  #include <asm/unistd.h>
> >> @@ -1712,6 +1713,13 @@ repeat:
> >>           (!wo->wo_pid || hlist_empty(&wo->wo_pid->tasks[wo->wo_type])))
> >>                goto notask;
> >>
> >> +       /*
> >> +        * For cputime in sub-threads before adding them.
> >> +        * Must be called outside tasklist_lock lock because write lock
> >> +        * can be acquired under irqs disabled.
> >> +        */
> >> +       cpuset_nohz_flush_cputimes();
> >> +
> >>        set_current_state(TASK_INTERRUPTIBLE);
> >>        read_lock(&tasklist_lock);
> >>        tsk = current;
> >> --
> >> 1.7.5.4
> >>
> >
> > I believe this patch is not needed because after this point we call
> > do_wait_thread /ptrace_do_wait, which both call wait_consider_task,
> > which calls wait_task_stopped/zombie/continued, which all eventually
> > calls getrusage, which calls k_getrusage where you added a call to
> > cpuset_noz_flush_cputimes() in another patch :-)
> >
> 
> OK, I now see that wait_task_zombie actually calls
> thread_group_times() directly, unlike other wait_task_*
> what I wrote above is not needed.
> 
> It does result in more then one IPI for each isolated core (something
> like 3 really) for the other cases though:
> one from this patch and the rest from the one in k_getrusage calls.

Yeah I realize we may be calling getrusage() from each of the wait_*()
things if the user request the rusage. That plus the IPI done in this
patch this is too much.

> 
> I wonder what would be a better way to do it. In theory we can send
> the IPI only to nohz cpuset cores that actually
> run tasks form the thread group. Finding which is not trivial though...

I also realize that we only call wait_task_zombie() on group leaders
if they don't have any subthread left (see delay_group_leader() test).
But then we call thread_group_times() to get the time of all threads
in the group from wait_task_zombie().

Now I'm confused.

> 
> Gilad
> 
> > Gilad
> >
> > --
> > Gilad Ben-Yossef
> > Chief Coffee Drinker
> > gilad@benyossef.com
> > Israel Cell: +972-52-8260388
> > US Cell: +1-973-8260388
> > http://benyossef.com
> >
> > "If you take a class in large-scale robotics, can you end up in a
> > situation where the homework eats your dog?"
> >  -- Jean-Baptiste Queru
> 
> 
> 
> -- 
> Gilad Ben-Yossef
> Chief Coffee Drinker
> gilad@benyossef.com
> Israel Cell: +972-52-8260388
> US Cell: +1-973-8260388
> http://benyossef.com
> 
> "If you take a class in large-scale robotics, can you end up in a
> situation where the homework eats your dog?"
>  -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel)
  2012-03-27 15:02 ` [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Gilad Ben-Yossef
  2012-03-27 15:04   ` Gilad Ben-Yossef
  2012-03-27 15:10   ` Peter Zijlstra
@ 2012-03-28 11:43   ` Frederic Weisbecker
  2 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-28 11:43 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Tue, Mar 27, 2012 at 05:02:34PM +0200, Gilad Ben-Yossef wrote:
> On Wed, Mar 21, 2012 at 3:58 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > Hi all,
> >
> > A summary of what this is about can be found here:
> >  https://lkml.org/lkml/2011/8/15/245
> >
> > There are still a lot of things to handle. Especially about
> > what is done by scheduler_tick() but we also need to:
> >
> > - completely handle cputime accounting (need to find every "reader"
> > of cputime and flush cputimes for all of them).
> > -handle  perf
> > - handle irqtime finegrained accounting
> > - handle ilb load balancing
> > - etc...
> >
> 
> I gave the new version a spin (x86 8 way VM) and it looks cool.
> 
> I did get the following warning once, but couldn't recreate it:
> 
> [   31.812741] ------------[ cut here ]------------
> [   31.812741] WARNING: at
> /home/giladb/Workspace/linux/kernel/time/tick-sched.c:706
> tick_nohz_account_ticks+0x7c/0x90()
> [   31.812741] Hardware name: Bochs
> [   31.812741] Modules linked in:
> [   31.812741] Pid: 1006, comm: sh Not tainted 3.3.0-rc7+ #167
> [   31.812741] Call Trace:
> [   31.812741]  [<c102a3ad>] warn_slowpath_common+0x6d/0xa0
> [   31.812741]  [<c106be0c>] ? tick_nohz_account_ticks+0x7c/0x90
> [   31.812741]  [<c106be0c>] ? tick_nohz_account_ticks+0x7c/0x90
> [   31.812741]  [<c102a3fd>] warn_slowpath_null+0x1d/0x20
> [   31.812741]  [<c106be0c>] tick_nohz_account_ticks+0x7c/0x90
> [   31.812741]  [<c106be5f>] tick_nohz_flush_current_times+0x3f/0x80
> [   31.812741]  [<c106bf8d>] tick_nohz_restart_adaptive+0xd/0x30
> [   31.812741]  [<c106c02e>] tick_nohz_check_adaptive+0x3e/0x50
> [   31.812741]  [<c1018180>] smp_cpuset_update_nohz_interrupt+0x20/0x30
> [   31.812741]  [<c1639c6a>] cpuset_update_nohz_interrupt+0x2a/0x30
> [   31.812741]  [<c16395fd>] ? _raw_spin_unlock_irq+0xd/0x30
> [   31.812741]  [<c10575c6>] finish_task_switch+0x46/0xa0
> [   31.812741]  [<c1638558>] __schedule+0x398/0x910
> [   31.812741]  [<c10ef2f1>] ? deactivate_slab+0x611/0x730
> [   31.812741]  [<c1120777>] ? __find_get_block+0x97/0x1a0
> [   31.812741]  [<c1221214>] ? cpumask_next_and+0x24/0xa0
> [   31.812741]  [<c10558cb>] ? get_parent_ip+0xb/0x40
> [   31.812741]  [<c1638b50>] schedule+0x30/0x50
> [   31.812741]  [<c16379b5>] schedule_hrtimeout_range_clock+0xf5/0x110
> [   31.812741]  [<c10558cb>] ? get_parent_ip+0xb/0x40
> [   31.812741]  [<c10586db>] ? sub_preempt_count+0x7b/0xb0
> [   31.812741]  [<c1639633>] ? _raw_spin_unlock_irqrestore+0x13/0x40
> [   31.812741]  [<c1054140>] ? __wake_up+0x40/0x50
> [   31.812741]  [<c1294d1f>] ? put_ldisc+0x3f/0xa0
> [   31.812741]  [<c16379e2>] schedule_hrtimeout_range+0x12/0x20
> [   31.812741]  [<c1107969>] poll_schedule_timeout+0x39/0x60
> [   31.812741]  [<c1108020>] do_sys_poll+0x400/0x490
> [   31.812741]  [<c1054d15>] ? cpuacct_charge+0x65/0x70
> [   31.812741]  [<c1107a20>] ? poll_freewait+0x70/0x70
> [   31.812741]  [<c1107af0>] ? __pollwait+0xd0/0xd0
> [   31.812741]  [<c1107af0>] ? __pollwait+0xd0/0xd0
> [   31.812741]  [<c10094a3>] ? native_sched_clock+0x33/0xe0
> [   31.812741]  [<c105a0e2>] ? sched_clock_local+0xb2/0x190
> [   31.812741]  [<c1054d15>] ? cpuacct_charge+0x65/0x70
> [   31.812741]  [<c105b376>] ? update_curr+0x1a6/0x2a0
> [   31.812741]  [<c105a2f9>] ? sched_clock_cpu+0x139/0x190
> [   31.812741]  [<c105a0e2>] ? sched_clock_local+0xb2/0x190
> [   31.812741]  [<c104dd43>] ? hrtimer_forward+0x163/0x1b0
> [   31.812741]  [<c10644e2>] ? ktime_get+0x62/0x100
> [   31.812741]  [<c1018b56>] ? lapic_next_event+0x16/0x20
> [   31.812741]  [<c1069df2>] ? clockevents_program_event+0xc2/0x170
> [   31.812741]  [<c106b514>] ? tick_program_event+0x24/0x30
> [   31.812741]  [<c104cd1d>] ? hrtimer_interrupt+0x1ad/0x2e0
> [   31.812741]  [<c1095128>] ? rcu_pending+0x58/0x70
> [   31.812741]  [<c1030a3d>] ? irq_exit+0x6d/0x80
> [   31.812741]  [<c1019363>] ? smp_apic_timer_interrupt+0x53/0x90
> [   31.812741]  [<c11e0128>] ? avc_has_perm_noaudit+0xc8/0x360
> [   31.812741]  [<c163a3b6>] ? apic_timer_interrupt+0x2a/0x30
> [   31.812741]  [<c128f31e>] ? tty_ioctl+0x47e/0xa30
> [   31.812741]  [<c11e0d66>] ? inode_has_perm+0x36/0x50
> [   31.812741]  [<c11e13e8>] ? file_has_perm+0xa8/0xb0
> [   31.812741]  [<c128eea0>] ? tty_check_change+0xe0/0xe0
> [   31.812741]  [<c1106763>] ? do_vfs_ioctl+0x83/0x570
> [   31.812741]  [<c11e4e46>] ? selinux_file_ioctl+0x56/0x110
> [   31.812741]  [<c1108224>] sys_poll+0x54/0xb0
> [   31.812741]  [<c1639b29>] syscall_call+0x7/0xb
> [   31.812741] ---[ end trace 1d7d659b4aead681 ]---

Ah interesting. I think I see how that happened: we flushed the
time on tick_nohz_pre_schedule() and set SAVED_JIFFIES_NONE.
Then we received a nohz IPI before we could restart the tick
from tick_nohz_post_schedule(). With ts->tick_stopped we except
that ts->saved_jiffies_whence != SAVED_JIFFIES_NONE but that's
wrong.

I'll fix that.

> 
> With the two patches I'll attach to the next replies to this message,
> I've been able to get a task running
> on an isolated CPU with 0 timer interrupts.
> 
> In my case, I also had to disable the clocksource watchdog, but only
> because TSC is not stable on my VM.
> This is really not a nohz/cpuset problem.

Yeah that's a particular issue on its own. I luckily don't have it
on my main testbox.

> There is one source of interference to cpu isolation this causes,
> which is the cputime flush IPI. Every time you
> run a command in the shell you get 3 - 4 IPIs sent to the nohz cpuset
> to flush the cputimes so that thread group
> times get computed correctly. That's not very nice :-)
> 
> I've tried disabling the IPI send, just to see how it goes and as far
> as I've been able to tell you get bare metal like
> environment for a 100% cpu bound code with no interrupts. Of course.
> ps/top then show 0% cpu utilization for
> that task since without the IPI the times it spends on the CPU is not
> registered... that is a small price to pay
> in my eyes for bare metal performance on Linux, but what do I know? :-)

Yeah I'm sure we can reduce the amount of IPIs for the nohz thing. I've just
set a big one IPI executing on every tickless CPU for cases like cputime.
And may be too much IPIs sent for the scheduler and RCU. We can certainly
optimize everything. I'm not yet on the optimization stage but rather still
in the correctness one unfortunately :)

Thanks!

> 
> Overall, way cool. Please keep it up !
> 
> GIlad
> 
> -- 
> Gilad Ben-Yossef
> Chief Coffee Drinker
> gilad@benyossef.com
> Israel Cell: +972-52-8260388
> US Cell: +1-973-8260388
> http://benyossef.com
> 
> "If you take a class in large-scale robotics, can you end up in a
> situation where the homework eats your dog?"
>  -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-27 16:13       ` Christoph Lameter
  2012-03-27 16:24         ` Steven Rostedt
@ 2012-03-28 11:53         ` Frederic Weisbecker
  1 sibling, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-28 11:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Daniel Lezcano, Geoff Levand,
	Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Tue, Mar 27, 2012 at 11:13:39AM -0500, Christoph Lameter wrote:
> On Tue, 27 Mar 2012, Frederic Weisbecker wrote:
> 
> > > Is there any way for userspace to know that the tick is not off yet due to
> > > this? It would make sense for us to have busy loop in user space that
> > > waits until the OS has completed all processing if that avoids future
> > > latencies for the application.
> >
> > What is the usecase you have in mind? Is it for realtime purpose?
> 
> Please do not use "realtime" since I am not sure what you mean by that.
> Its for a low latency applications that cannot use "realtime" because that
> implies high latencies.
> 
> > The "tick stopped" is a volatile and relative state.
> 
> The use case is an application that cannot tolerate the latencies
> introduced by timer tick processing. It will only start running when the
> system is in a sufficiently quiet state.
> 
> > Relative because if a timer list is enqueued to fire 1 second later,
> > the tick will be stopped until that happens. How do we consider this (common)
> > case?
> >
> > Also as Chris noted it is volatile because the tick can be restarted anytime
> > for random reasons: the CPU receives an IPI which makes it restart the
> > periodic tick.
> 
> Ok some sort of notification would be good for that case. If a timer tick
> happens and that was unavoidable then it would be good to log the reason
> why this occured so that the system can be configured in such a way that
> these interruptions are minimized.

tracing is probably the right place to log these things. But that's
about debugging. This won't be a notification on top of which your app
could recover.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-28  7:55                     ` Gilad Ben-Yossef
@ 2012-03-28 12:21                       ` Frederic Weisbecker
  2012-03-28 12:41                         ` Gilad Ben-Yossef
  2012-03-28 14:02                       ` Steven Rostedt
  1 sibling, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-28 12:21 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Steven Rostedt, Christoph Lameter, LKML, linaro-sched-sig,
	Alessio Igor Bogani, Andrew Morton, Avi Kivity, Chris Metcalf,
	Daniel Lezcano, Geoff Levand, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Wed, Mar 28, 2012 at 09:55:37AM +0200, Gilad Ben-Yossef wrote:
> On Wed, Mar 28, 2012 at 5:17 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> > On Tue, 2012-03-27 at 21:35 -0400, Steven Rostedt wrote:
> >
> >> > > I call that "lower overhead".
> >> >
> >> > Good marketing but it does not change the facts.
> >
> > I'm replying again because this comment just pisses me off.
> >
> > I'm the only male in my household, living with a wife, two teenage
> > daughters and two bitches (I own two female dogs). This is not the time
> > of month to be arguing with me!
> 
> LOL. I'm married with two daughters. I feel your pain  :-)
> 
> >
> > The fact is, you live in your own little world. You see things from your
> > own little perspective. You can define the time a system call takes as a
> > latency, but that is just one very small aspect of latencies. There's
> > lots of other kinds of latencies and if you did the search I told you
> > to, you would see that. In fact, the latency caused by system calls is
> > such a small niche of the types of latencies there are. I'm not counting
> > the time a system call waits for a device. Although a preempt kernel
> > would be faster for such a case.
> >
> 
> At the risk of butting in on this little flame war, I think it is
> worth mentioning
> that this discussion arouse in the context of of a feature
> (cpuset/nohz) that deals
> with a single task running alone on a CPU and making zero use of
> kernel services,
> from scheduling, through interrupts, to system calls. It's just a pure
> 100% cpu  bound task.

Note that cpu isolation is a specific usecase of nohz cpusets. But it's
intended to be more generally useful (probably for most workloads). That
means we really want to support syscalls, interrupts and everything. That's
why it is called adaptive tickless and not just userspace isolated tickless.

Not that what I'm saying is making the debate on latency moving forward,
I just wanted to ensure there is no misunderstanding of this patchset :)

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-27 15:21         ` Gilad Ben-Yossef
@ 2012-03-28 12:39           ` Frederic Weisbecker
  2012-03-28 12:57             ` Gilad Ben-Yossef
  0 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-28 12:39 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Christoph Lameter, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Tue, Mar 27, 2012 at 05:21:34PM +0200, Gilad Ben-Yossef wrote:
> On Thu, Mar 22, 2012 at 6:18 PM, Christoph Lameter <cl@linux.com> wrote:
> > On Thu, 22 Mar 2012, Gilad Ben-Yossef wrote:
> >
> >> > Is there any way for userspace to know that the tick is not off yet due to
> >> > this? It would make sense for us to have busy loop in user space that
> >> > waits until the OS has completed all processing if that avoids future
> >> > latencies for the application.
> >> >
> >>
> >> I previously suggested having the user register to receive a signal
> >> when the tick
> >> is turned off. Since the tick is always turned off the user task is
> >> the current task
> >> by design, *I think* you can simply mark the signal pending when you
> >> turn the tick off.
> >
> > Ok that sounds good. You would define a new signal for this?
> >
> 
> My gut instinct is to let the process register with a specific signal
> (properly the RT range)
> it wants to receive when the tick goes off and/or on.

Note the signal itself could trigger an event that could restart the tick.
Calling call_rcu() is sufficient for that. We can probably optimize that
one day by assigning another CPU to handle the callbacks of a tickless
CPU but for now...

> 
> > So we would startup the application. App will do all prep work (memory
> > allocation, device setup etc etc) and then wait for the signal to be
> > received. After that it would enter the low latency processing phase.
> >
> > Could we also get a signal if something disrupts the peace and switches
> > the timer interrupt on again?
> >
> 
> I think you'll have to since once you have the tick turned off there
> is no guarantee that
> it wont get turned on by a timer scheduling an task or an IPI.

The problem with this scheme is that if the task is running with the
guarantee that nothing is going to disturb it (it assumes so when it
is notified that the timer is stopped), can it seriously recover from
the fact the timer has been restarted once it gets notified about it?

I have a hard time to imagine that. It's like an RT task running a
critical part that suddenly receives a notification from the kernel that
says "what's up dude? hey by the way you're not real time anymore" :) 
How are we recovering from that?

May be instead of focusing on these notifications, we should try hard to
shut down the tick before we reach userspace: delegate RCU work
to another CPU, avoid needless IPIs, avoid needless timer list timers, etc...
Fix those things one by one such that we can configure things to the point we
get closer to a guarantee of CPU isolation.

Does that sound reasonable?


> 
> 
> -- 
> Gilad Ben-Yossef
> Chief Coffee Drinker
> gilad@benyossef.com
> Israel Cell: +972-52-8260388
> US Cell: +1-973-8260388
> http://benyossef.com
> 
> "If you take a class in large-scale robotics, can you end up in a
> situation where the homework eats your dog?"
>  -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-28 12:21                       ` Frederic Weisbecker
@ 2012-03-28 12:41                         ` Gilad Ben-Yossef
  0 siblings, 0 replies; 96+ messages in thread
From: Gilad Ben-Yossef @ 2012-03-28 12:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Steven Rostedt, Christoph Lameter, LKML, linaro-sched-sig,
	Alessio Igor Bogani, Andrew Morton, Avi Kivity, Chris Metcalf,
	Daniel Lezcano, Geoff Levand, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Wed, Mar 28, 2012 at 2:21 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> On Wed, Mar 28, 2012 at 09:55:37AM +0200, Gilad Ben-Yossef wrote:
|
>> At the risk of butting in on this little flame war, I think it is
>> worth mentioning
>> that this discussion arouse in the context of of a feature
>> (cpuset/nohz) that deals
>> with a single task running alone on a CPU and making zero use of
>> kernel services,
>> from scheduling, through interrupts, to system calls. It's just a pure
>> 100% cpu  bound task.
>
> Note that cpu isolation is a specific usecase of nohz cpusets. But it's
> intended to be more generally useful (probably for most workloads). That
> means we really want to support syscalls, interrupts and everything. That's
> why it is called adaptive tickless and not just userspace isolated tickless.

Point taken.  I guess I suffer from myopic vision just like the next guy  :-)

Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-28 12:39           ` Frederic Weisbecker
@ 2012-03-28 12:57             ` Gilad Ben-Yossef
  2012-03-28 13:38               ` Frederic Weisbecker
  0 siblings, 1 reply; 96+ messages in thread
From: Gilad Ben-Yossef @ 2012-03-28 12:57 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Christoph Lameter, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Wed, Mar 28, 2012 at 2:39 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> On Tue, Mar 27, 2012 at 05:21:34PM +0200, Gilad Ben-Yossef wrote:
>> On Thu, Mar 22, 2012 at 6:18 PM, Christoph Lameter <cl@linux.com> wrote:
>> > On Thu, 22 Mar 2012, Gilad Ben-Yossef wrote:
>> >
>> >> > Is there any way for userspace to know that the tick is not off yet due to
>> >> > this? It would make sense for us to have busy loop in user space that
>> >> > waits until the OS has completed all processing if that avoids future
>> >> > latencies for the application.
>> >> >
>> >>
>> >> I previously suggested having the user register to receive a signal
>> >> when the tick
>> >> is turned off. Since the tick is always turned off the user task is
>> >> the current task
>> >> by design, *I think* you can simply mark the signal pending when you
>> >> turn the tick off.
>> >
>> > Ok that sounds good. You would define a new signal for this?
>> >
>>
>> My gut instinct is to let the process register with a specific signal
>> (properly the RT range)
>> it wants to receive when the tick goes off and/or on.
>
> Note the signal itself could trigger an event that could restart the tick.
> Calling call_rcu() is sufficient for that. We can probably optimize that
> one day by assigning another CPU to handle the callbacks of a tickless
> CPU but for now...
>



>>
>> > So we would startup the application. App will do all prep work (memory
>> > allocation, device setup etc etc) and then wait for the signal to be
>> > received. After that it would enter the low latency processing phase.
>> >
>> > Could we also get a signal if something disrupts the peace and switches
>> > the timer interrupt on again?
>> >
>>
>> I think you'll have to since once you have the tick turned off there
>> is no guarantee that
>> it wont get turned on by a timer scheduling an task or an IPI.
>
> The problem with this scheme is that if the task is running with the
> guarantee that nothing is going to disturb it (it assumes so when it
> is notified that the timer is stopped), can it seriously recover from
> the fact the timer has been restarted once it gets notified about it?

Recovery in this context involves a programmer/system architect looking
into what made the tick start and making sure that wont happen the next
time around.

I know it's not quite what you had in mind, but it works :-)

>
> I have a hard time to imagine that. It's like an RT task running a
> critical part that suddenly receives a notification from the kernel that
> says "what's up dude? hey by the way you're not real time anymore" :)
> How are we recovering from that?

The point is that it is the difference between a QA report that says:

"Performance dropped below acceptable level for 10 ms some when
during the test run"

and

"We got an indication that the kernel resumed the tick on us, so the test
was stopped and here is the stack trace for all the tasks running,
plus the logs".


> May be instead of focusing on these notifications, we should try hard to
> shut down the tick before we reach userspace: delegate RCU work
> to another CPU, avoid needless IPIs, avoid needless timer list timers, etc...
> Fix those things one by one such that we can configure things to the point we
> get closer to a guarantee of CPU isolation.
>
> Does that sound reasonable?

It does to me :-)

Gilad


-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/32] nohz: Try not to give the timekeeping duty to an adaptive tickless cpu
  2012-03-28  8:39             ` Peter Zijlstra
@ 2012-03-28 13:11               ` Dimitri Sivanich
  2012-03-28 15:51               ` Chris Metcalf
  1 sibling, 0 replies; 96+ messages in thread
From: Dimitri Sivanich @ 2012-03-28 13:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Frederic Weisbecker, LKML, linaro-sched-sig,
	Alessio Igor Bogani, Andrew Morton, Avi Kivity, Chris Metcalf,
	Daniel Lezcano, Geoff Levand, Gilad Ben Yossef, Ingo Molnar,
	Max Krasnyansky, Paul E. McKenney, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Wed, Mar 28, 2012 at 10:39:43AM +0200, Peter Zijlstra wrote:
> On Tue, 2012-03-27 at 20:12 -0500, Christoph Lameter wrote:
> > On Tue, 27 Mar 2012, Peter Zijlstra wrote:
> > 
> > > On Tue, 2012-03-27 at 11:08 -0500, Christoph Lameter wrote:
> > > >
> > > > I wish you would disentangle the nohz work from the cpusets. Cpusets is
> > > > aged and being replaced by cgroups. And the cgroup work is something that
> > > > is not suitable for many loads given the VM overhead added.
> > >
> > > What VM overhead? Are you talking about the memcg nonsense? That's
> > > entirely optional, you don't need to either build that or enable it.
> > 
> > cgroups in general cause a much more complex VM processing with multiple
> > LRUs and additional checks in various places.
> 
> Uhm, not if you don't have the memcg thing enabled, the controllers are
> separate.
> 
> > Even just adding cpusets enables the group scheduler functionality f.e.
> > which creates significantly larger scheduling latencies. Also complicates
> > key allocation VM paths etc etc.
> 
> No, you're mistaken.
> 
> Its perfectly possible to compile a kernel with
> 
> CONFIG_CGROUPS=y
> CONFIG_CPUSETS=y
> CONFIG_CGROUP_MEM_RES_CTLR=n
> CONFIG_CGROUP_SCHED=n
> 
> That will give you cpusets, but not the cpu (sched) controller crap and
> not the memcg (vm) controller muck.
> 
> > > And if we ever get rid of that multiple hierarchy nonsense I don't see a
> > > reason to get rid of cpuset at all. The only reason to want to replace
> > > it is to avoid the dis-joint-ness it has with the cpu controller (and
> > > possible the memcg one).
> > 
> > I like cpusets much more than cgroups. I agree with you.
> 
> cpusets is a cgroup controller..
> 
> > But I am not sure that cpusets are needed for nohz. We already have an
> > isolcpu set and it sounds to me that nohz is generally useful.
> 
> I really really want to kill isolcpu in favour of cpusets, the amount of
> disparity and overlap in features is driving me insane.
> 
> isolcpu will only create separate cpus, you can do the same with cpusets
> by creating 1 cpu sets and disabling load_balance on the root set.
> 
> The only difference is that isolcpu will never have had a task running
> on the cpu and hence its timer lists etc will be guaranteed empty. So
> once we add an interface to push away and/or wait for a cpu to quiesce
> we should end up with the same state.

That is the main reason why I've been in favor of keeping isolcpus.  If there
was some way to cleanup the timer lists, etc.. in a cpuset, then I would use
that in place of isolcpus.

However, I suggested something to move timers quite a while back and it
met with resistance (due to having to ensure that every timer was not somehow
constrained to a single cpu).

> 
> At that point I'll rip isolcpu out. 
> 
> > It would seem that the nohz patches would be much simpler if it would not
> > require cpusets to administer. The only thing that would be needed is to
> > have one cpu that is not subject to nohz. The logical choice is a
> > timekeeper cpu (which is usually cpu 0). Having that configurable would be
> > an extra bonus.
> 
> Like Frederic has been telling, the nohz stuff adds syscall overhead, it
> needs to timestamp on kernel entry/exit etc.. Making it unconditional
> will add this overhead to everybody and this might not be acceptable.
> 
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-28 12:57             ` Gilad Ben-Yossef
@ 2012-03-28 13:38               ` Frederic Weisbecker
  0 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-28 13:38 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Christoph Lameter, LKML, linaro-sched-sig, Alessio Igor Bogani,
	Andrew Morton, Avi Kivity, Chris Metcalf, Daniel Lezcano,
	Geoff Levand, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Wed, Mar 28, 2012 at 02:57:44PM +0200, Gilad Ben-Yossef wrote:
> On Wed, Mar 28, 2012 at 2:39 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > On Tue, Mar 27, 2012 at 05:21:34PM +0200, Gilad Ben-Yossef wrote:
> >> On Thu, Mar 22, 2012 at 6:18 PM, Christoph Lameter <cl@linux.com> wrote:
> >> > On Thu, 22 Mar 2012, Gilad Ben-Yossef wrote:
> >> >
> >> >> > Is there any way for userspace to know that the tick is not off yet due to
> >> >> > this? It would make sense for us to have busy loop in user space that
> >> >> > waits until the OS has completed all processing if that avoids future
> >> >> > latencies for the application.
> >> >> >
> >> >>
> >> >> I previously suggested having the user register to receive a signal
> >> >> when the tick
> >> >> is turned off. Since the tick is always turned off the user task is
> >> >> the current task
> >> >> by design, *I think* you can simply mark the signal pending when you
> >> >> turn the tick off.
> >> >
> >> > Ok that sounds good. You would define a new signal for this?
> >> >
> >>
> >> My gut instinct is to let the process register with a specific signal
> >> (properly the RT range)
> >> it wants to receive when the tick goes off and/or on.
> >
> > Note the signal itself could trigger an event that could restart the tick.
> > Calling call_rcu() is sufficient for that. We can probably optimize that
> > one day by assigning another CPU to handle the callbacks of a tickless
> > CPU but for now...
> >
> 
> 
> 
> >>
> >> > So we would startup the application. App will do all prep work (memory
> >> > allocation, device setup etc etc) and then wait for the signal to be
> >> > received. After that it would enter the low latency processing phase.
> >> >
> >> > Could we also get a signal if something disrupts the peace and switches
> >> > the timer interrupt on again?
> >> >
> >>
> >> I think you'll have to since once you have the tick turned off there
> >> is no guarantee that
> >> it wont get turned on by a timer scheduling an task or an IPI.
> >
> > The problem with this scheme is that if the task is running with the
> > guarantee that nothing is going to disturb it (it assumes so when it
> > is notified that the timer is stopped), can it seriously recover from
> > the fact the timer has been restarted once it gets notified about it?
> 
> Recovery in this context involves a programmer/system architect looking
> into what made the tick start and making sure that wont happen the next
> time around.
> 
> I know it's not quite what you had in mind, but it works :-)

So this is about fixing bugs. Tracing may fit better for that.

> 
> >
> > I have a hard time to imagine that. It's like an RT task running a
> > critical part that suddenly receives a notification from the kernel that
> > says "what's up dude? hey by the way you're not real time anymore" :)
> > How are we recovering from that?
> 
> The point is that it is the difference between a QA report that says:
> 
> "Performance dropped below acceptable level for 10 ms some when
> during the test run"
> 
> and
> 
> "We got an indication that the kernel resumed the tick on us, so the test
> was stopped and here is the stack trace for all the tasks running,
> plus the logs".

That's about post run analysis, that's sounds to be a job for tracing.

> 
> 
> > May be instead of focusing on these notifications, we should try hard to
> > shut down the tick before we reach userspace: delegate RCU work
> > to another CPU, avoid needless IPIs, avoid needless timer list timers, etc...
> > Fix those things one by one such that we can configure things to the point we
> > get closer to a guarantee of CPU isolation.
> >
> > Does that sound reasonable?
> 
> It does to me :-)
> 
> Gilad
> 
> 
> -- 
> Gilad Ben-Yossef
> Chief Coffee Drinker
> gilad@benyossef.com
> Israel Cell: +972-52-8260388
> US Cell: +1-973-8260388
> http://benyossef.com
> 
> "If you take a class in large-scale robotics, can you end up in a
> situation where the homework eats your dog?"
>  -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2012-03-28  7:55                     ` Gilad Ben-Yossef
  2012-03-28 12:21                       ` Frederic Weisbecker
@ 2012-03-28 14:02                       ` Steven Rostedt
  1 sibling, 0 replies; 96+ messages in thread
From: Steven Rostedt @ 2012-03-28 14:02 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Christoph Lameter, Frederic Weisbecker, LKML, linaro-sched-sig,
	Alessio Igor Bogani, Andrew Morton, Avi Kivity, Chris Metcalf,
	Daniel Lezcano, Geoff Levand, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Stephen Hemminger,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

On Wed, 2012-03-28 at 09:55 +0200, Gilad Ben-Yossef wrote:
> On Wed, Mar 28, 2012 at 5:17 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> > On Tue, 2012-03-27 at 21:35 -0400, Steven Rostedt wrote:
> >
> >> > > I call that "lower overhead".
> >> >
> >> > Good marketing but it does not change the facts.
> >
> > I'm replying again because this comment just pisses me off.
> >
> > I'm the only male in my household, living with a wife, two teenage
> > daughters and two bitches (I own two female dogs). This is not the time
> > of month to be arguing with me!
> 
> LOL. I'm married with two daughters. I feel your pain  :-)

Are they teenagers yet? When they become that, the male of the family
becomes the point of frustration relief for them :-p  I've been their
target for the last week. ;-)

> At the risk of butting in on this little flame war, I think it is
> worth mentioning
> that this discussion arouse in the context of of a feature
> (cpuset/nohz) that deals
> with a single task running alone on a CPU and making zero use of
> kernel services,
> from scheduling, through interrupts, to system calls. It's just a pure
> 100% cpu  bound task.

I perfectly understand, and I said several times that latency has
different meanings for different people. It's as abused as the term
realtime is. I just took offense that he claimed I was using the term as
a marketing ploy.

> 
> For the work loads this is intended for, the time it takes to respond
> to an interrupt
> or context switch to the kernel and back for a system call, is too
> high. It doesn't matter
> how predictable that time is - if the time to do a system call in the
> best possible case
> is too high for you, having that time predictable only means you are
> predictably late :-)

And I agree with this too.

> 
> This is not an observation about Linux, or preempt-rt, or any OS for
> that matter. It's just
> a statement of fact. There is nothing you can do to make the OS better
> to get the time
> lower. It's just a process that doesn't want OS involvement at all
> from the point it is started
> until it's done, except in exception cases. The entire OS is overhead.

Agreed as well.

> 
> The traditional way to deal with these beasts is to run it on bare
> metal with no OS.

I've always said the best realtime OS was DOS ;-)

> What we're discussing is a way to get Linux to give a task bare metal
> like performance.
> That is useful because you can debug and manage the tasks using OS services, but
> still get bare metal like performance. In the context of this
> discussion, *any* kernel
> activity on that CPU during its run time is overhead, really.

Agreed too.

> 
> Preempt-rt is good. I know because I was involved in implementing it
> in a real time
> system that have saved dozen of lives already. When that system misses
> a dead line,
> that is exactly what you get - a line of dead people. Seriously. But
> it works. It just doesn't
> happen to apply here.

And this is why I asked to use the term overhead and not latency. As
latency has different meanings in different contexts, overhead and
determinism are more concrete. Lets stick with those terms so we don't
need to pick the color of the bike shed.

> 
> I don't know if this is what Christoph meant to say, but can we please
> try to get along now? :-)

Oh, I love Christoph. He knows I do. I was just releasing my built up
frustration on him ;-)

-- Steve



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/32] nohz: Try not to give the timekeeping duty to an adaptive tickless cpu
  2012-03-28  8:39             ` Peter Zijlstra
  2012-03-28 13:11               ` Dimitri Sivanich
@ 2012-03-28 15:51               ` Chris Metcalf
  1 sibling, 0 replies; 96+ messages in thread
From: Chris Metcalf @ 2012-03-28 15:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Frederic Weisbecker, LKML, linaro-sched-sig,
	Alessio Igor Bogani, Andrew Morton, Avi Kivity, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin,
	Dimitri Sivanich

On 3/28/2012 4:39 AM, Peter Zijlstra wrote:
>> It would seem that the nohz patches would be much simpler if it would not
>> > require cpusets to administer. The only thing that would be needed is to
>> > have one cpu that is not subject to nohz. The logical choice is a
>> > timekeeper cpu (which is usually cpu 0). Having that configurable would be
>> > an extra bonus.
> Like Frederic has been telling, the nohz stuff adds syscall overhead, it
> needs to timestamp on kernel entry/exit etc.. Making it unconditional
> will add this overhead to everybody and this might not be acceptable.

And more generally, some kind of configuration step is important so the
kernel knows what userspace wants to have happen there.  On a nohz cpu the
kernel will try to optimize for running a single task with no OS overhead. 
On a default cpu the kernel will be trying to use the cpu as part of a pool
for optimizing the overall system experience, including work stealing,
background threads, etc.  This could be a per-process thing, or it could be
a per-core thing, but either way, the kernel can't really guess effectively.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel)
  2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
                   ` (33 preceding siblings ...)
  2012-03-27 15:02 ` [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Gilad Ben-Yossef
@ 2012-03-30  0:33 ` Kevin Hilman
  2012-03-30  0:45   ` Frederic Weisbecker
  34 siblings, 1 reply; 96+ messages in thread
From: Kevin Hilman @ 2012-03-30  0:33 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Thomas Gleixner, Geoff Levand,
	Stephen Hemminger, Daniel Lezcano, Chris Metcalf, Ingo Molnar,
	Gilad Ben Yossef, Alessio Igor Bogani, Avi Kivity,
	Max Krasnyansky, Zen Lin, Steven Rostedt, Christoph Lameter,
	Andrew Morton

Frederic Weisbecker <fweisbec@gmail.com> writes:

> There are still a lot of things to handle. Especially about
> what is done by scheduler_tick() but we also need to:
>
> - completely handle cputime accounting (need to find every "reader"
> of cputime and flush cputimes for all of them).
> -handle  perf
> - handle irqtime finegrained accounting
> - handle ilb load balancing
> - etc...

  - add more arch support :)

For ARM, I've started hacking things into place and have some
preliminary patches to add ARM support for the new IPI, syscall support,
exception hooks, etc.[1]

This is enough to get some basic things running and test the nohz
cpusets functionality on ARM, and for others to start trying it out.

For anyone interested in collaborating on ARM support, my
work-in-progress branch is below[1] based on Frederic's cpuset-v2
branch.  Patches and testing welcome.

Kevin

[1] git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux.git wip/arm-nohz-cpusets

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel)
  2012-03-30  0:33 ` Kevin Hilman
@ 2012-03-30  0:45   ` Frederic Weisbecker
  2012-03-30  2:07     ` Geoff Levand
  0 siblings, 1 reply; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-30  0:45 UTC (permalink / raw)
  To: Kevin Hilman, Geoff Levand
  Cc: LKML, linaro-sched-sig, Thomas Gleixner, Stephen Hemminger,
	Daniel Lezcano, Chris Metcalf, Ingo Molnar, Gilad Ben Yossef,
	Alessio Igor Bogani, Avi Kivity, Max Krasnyansky, Zen Lin,
	Steven Rostedt, Christoph Lameter, Andrew Morton

On Thu, Mar 29, 2012 at 05:33:46PM -0700, Kevin Hilman wrote:
> Frederic Weisbecker <fweisbec@gmail.com> writes:
> 
> > There are still a lot of things to handle. Especially about
> > what is done by scheduler_tick() but we also need to:
> >
> > - completely handle cputime accounting (need to find every "reader"
> > of cputime and flush cputimes for all of them).
> > -handle  perf
> > - handle irqtime finegrained accounting
> > - handle ilb load balancing
> > - etc...
> 
>   - add more arch support :)
> 
> For ARM, I've started hacking things into place and have some
> preliminary patches to add ARM support for the new IPI, syscall support,
> exception hooks, etc.[1]
> 
> This is enough to get some basic things running and test the nohz
> cpusets functionality on ARM, and for others to start trying it out.
> 
> For anyone interested in collaborating on ARM support, my
> work-in-progress branch is below[1] based on Frederic's cpuset-v2
> branch.  Patches and testing welcome.
> 
> Kevin
> 
> [1] git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux.git wip/arm-nohz-cpusets

Nice! Thanks for working on this. You may want to work with Geoff Levand who's working
on the ARM port too.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/32] nohz: Try not to give the timekeeping duty to an adaptive tickless cpu
  2012-03-27 16:08       ` Christoph Lameter
  2012-03-27 16:47         ` Peter Zijlstra
@ 2012-03-30  1:34         ` Frederic Weisbecker
  1 sibling, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-03-30  1:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Daniel Lezcano, Geoff Levand,
	Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Peter Zijlstra, Stephen Hemminger, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin,
	Dimitri Sivanich

On Tue, Mar 27, 2012 at 11:08:05AM -0500, Christoph Lameter wrote:
> On Tue, 27 Mar 2012, Frederic Weisbecker wrote:
> 
> > > Any way to manually specify which cpu? We f.e. always "sacrifice" cpu 0
> > > for OS activities. We would like to have all Os processing things
> > > restricted to cpu 0 so that the rest of the processors do not experience
> > > the OS noise.
> >
> > Somebody tries to do this: https://lkml.org/lkml/2011/11/8/346
> >
> > But in the case of nohz cpusets there is a problem to solve:
> >
> > What if every CPUs are tickless (idle or busy), who must take
> > the timekeeping duty? Should we pick one of the busy CPUs? Or
> > keep one CPU with the tick even if it's idle? How do we choose
> > this CPU?
> 
> Then its the users fault because he specified the processor to use. There
> is no picking if its manually assigned.

If you can only tune the timekeeping binding to a single CPU, what happens when
that CPU goes idle and it's the only CPU that is not in a nohz cpuset? If there
are other CPUs that are busy but tickless, they need the timekeeping to be
maintained. This means we need to keep the periodic tick on the CPU where we
binded the timekeeping, even if that CPU is idle. That's not good for powersaving.

Let's put the example above with a practical example. We have 4 CPUs.
CPU 0, 2 and 3 are in a nohz cpuset. CPU 1 is "normal". As long as CPU 1
is busy we are fine and we can give it the timekeeping duty. Now it runs
idle and the other CPUs (0, 2, 3) are busy and they run without the tick.

We have the choice between:

1) Keep the periodic tick on CPU 1, even if it's idle, and let the timekeeping
duty be handled there.

2) Pick either CPU 0, 2 or 3 and restart the tick in one of them to handle
the timekeeping duty.

I think both propositions are relevant, it just depends on what the user
wants. 1) will consume more power because the periodic tick prevents CPU 1
from entering low power mode. But this is good if you want to ensure isolation
on CPU 0, 2 and 3.
2) is good if you want powersaving and you don't care about isolation: you just
want to run tickless when possible but no big deal if we can't.

If you can tune the timekeeping binding to only one CPU you can't make
such finegrained decision. You rather need this timekeeping duty to be
defined on a set of CPUs. And cpuset is of course a very good candidate
to define properties on a set of CPUs ;)

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel)
  2012-03-30  0:45   ` Frederic Weisbecker
@ 2012-03-30  2:07     ` Geoff Levand
  2012-03-30 14:10       ` Kevin Hilman
  0 siblings, 1 reply; 96+ messages in thread
From: Geoff Levand @ 2012-03-30  2:07 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Kevin Hilman, LKML, linaro-sched-sig, Thomas Gleixner,
	Stephen Hemminger, Daniel Lezcano, Chris Metcalf, Ingo Molnar,
	Gilad Ben Yossef, Alessio Igor Bogani, Avi Kivity,
	Max Krasnyansky, Zen Lin, Steven Rostedt, Christoph Lameter,
	Andrew Morton

Hi All,

On Fri, 2012-03-30 at 02:45 +0200, Frederic Weisbecker wrote:
> On Thu, Mar 29, 2012 at 05:33:46PM -0700, Kevin Hilman wrote:
> > For anyone interested in collaborating on ARM support, my
> > work-in-progress branch is below[1] based on Frederic's cpuset-v2
> > branch.  Patches and testing welcome.
> > 
> > [1] git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux.git wip/arm-nohz-cpusets
> 
> Nice! Thanks for working on this. You may want to work with Geoff Levand who's working
> on the ARM port too.

You can see my work here:

  http://git.kernel.org/?p=linux/kernel/git/geoff/nohz.git

I also put a simple test I got from Frederic here:

  http://git.kernel.org/?p=linux/kernel/git/geoff/nohz-tests.git

Feel free to expand on either!

Kevin, I'll take a look at your work, I guess it has the same things as
mine.  Will you be at the Collaboration Summit next week?

-Geoff






^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel)
  2012-03-30  2:07     ` Geoff Levand
@ 2012-03-30 14:10       ` Kevin Hilman
  0 siblings, 0 replies; 96+ messages in thread
From: Kevin Hilman @ 2012-03-30 14:10 UTC (permalink / raw)
  To: Geoff Levand
  Cc: Frederic Weisbecker, LKML, linaro-sched-sig, Thomas Gleixner,
	Stephen Hemminger, Daniel Lezcano, Chris Metcalf, Ingo Molnar,
	Gilad Ben Yossef, Alessio Igor Bogani, Avi Kivity,
	Max Krasnyansky, Zen Lin, Steven Rostedt, Christoph Lameter,
	Andrew Morton

Geoff Levand <geoff@infradead.org> writes:

> Hi All,
>
> On Fri, 2012-03-30 at 02:45 +0200, Frederic Weisbecker wrote:
>> On Thu, Mar 29, 2012 at 05:33:46PM -0700, Kevin Hilman wrote:
>> > For anyone interested in collaborating on ARM support, my
>> > work-in-progress branch is below[1] based on Frederic's cpuset-v2
>> > branch.  Patches and testing welcome.
>> > 
>> > [1] git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux.git wip/arm-nohz-cpusets
>> 
>> Nice! Thanks for working on this. You may want to work with Geoff Levand who's working
>> on the ARM port too.
>
> You can see my work here:
>
>   http://git.kernel.org/?p=linux/kernel/git/geoff/nohz.git

Great! Will have a closer look.

A quick look suggests we've done almost exactly the same thing.  Guess
that's a good sign. :) 

> I also put a simple test I got from Frederic here:
>
>   http://git.kernel.org/?p=linux/kernel/git/geoff/nohz-tests.git

Great, having a common test to compare results was going to be my next
question.  I'll start playing with these next week.   Thanks.

> Feel free to expand on either!

> Kevin, I'll take a look at your work, I guess it has the same things as
> mine.  

Yeah, we've done basically the same thing.  I see though that you've
also already handled some RCU idle cases that I hadn't.

Since you're a little further than me, I'll switch you yours and 
start using that as the base

> Will you be at the Collaboration Summit next week?

Unfortunately, no.

Kevin

^ permalink raw reply	[flat|nested] 96+ messages in thread

* warning in tick_nohz_irq_exit
  2012-03-21 13:58 ` Frederic Weisbecker
@ 2012-04-04 15:33   ` Stephen Hemminger
  2012-04-04 20:45     ` Frederic Weisbecker
  0 siblings, 1 reply; 96+ messages in thread
From: Stephen Hemminger @ 2012-04-04 15:33 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

Using test kernel merged no-hz from your github repo with
current upstream.

Tried running this on laptop and seeing warning splats.
May or may not be related to write to cpuset/NAME/tasks
returning ENOSPC.

[  476.389422] ------------[ cut here ]------------
[  476.389440] WARNING: at kernel/time/tick-sched.c:567 tick_nohz_irq_exit+0x11e/0x194()


Full dmesg:

[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Linux version 3.3.1-1-amd64-vyatta (shemminger@s6510) (gcc version 4.6.3 (Debian 4.6.3-1) ) #1 SMP Tue Apr 3 15:52:51 PDT 2012
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.3.1-1-amd64-vyatta root=UUID=b7241ed2-14f8-4d17-b9c0-87c5a6876d4c ro quiet
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  BIOS-e820: 0000000000000000 - 000000000009e000 (usable)
[    0.000000]  BIOS-e820: 000000000009e000 - 00000000000a0000 (reserved)
[    0.000000]  BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
[    0.000000]  BIOS-e820: 0000000000100000 - 00000000cf6b0000 (usable)
[    0.000000]  BIOS-e820: 00000000cf6b0000 - 00000000cf6ca000 (ACPI data)
[    0.000000]  BIOS-e820: 00000000cf6ca000 - 00000000cf6ce000 (ACPI NVS)
[    0.000000]  BIOS-e820: 00000000cf6ce000 - 00000000d0000000 (reserved)
[    0.000000]  BIOS-e820: 00000000f8000000 - 00000000fc000000 (reserved)
[    0.000000]  BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
[    0.000000]  BIOS-e820: 00000000fed00000 - 00000000fed00400 (reserved)
[    0.000000]  BIOS-e820: 00000000fed14000 - 00000000fed1a000 (reserved)
[    0.000000]  BIOS-e820: 00000000fed1c000 - 00000000fed90000 (reserved)
[    0.000000]  BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
[    0.000000]  BIOS-e820: 00000000ff000000 - 0000000100000000 (reserved)
[    0.000000]  BIOS-e820: 0000000100000000 - 0000000130000000 (usable)
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] DMI 2.4 present.
[    0.000000] DMI: FUJITSU LifeBook S6510/FJNB1D3, BIOS Version 1.31  02/05/2009
[    0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
[    0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable)
[    0.000000] No AGP bridge found
[    0.000000] last_pfn = 0x130000 max_arch_pfn = 0x400000000
[    0.000000] MTRR default type: uncachable
[    0.000000] MTRR fixed ranges enabled:
[    0.000000]   00000-9FFFF write-back
[    0.000000]   A0000-BFFFF uncachable
[    0.000000]   C0000-CFFFF write-protect
[    0.000000]   D0000-DFFFF uncachable
[    0.000000]   E0000-FFFFF write-protect
[    0.000000] MTRR variable ranges enabled:
[    0.000000]   0 base 0D0000000 mask FF0000000 uncachable
[    0.000000]   1 base 0E0000000 mask FE0000000 uncachable
[    0.000000]   2 base 000000000 mask F00000000 write-back
[    0.000000]   3 base 100000000 mask FE0000000 write-back
[    0.000000]   4 base 120000000 mask FF0000000 write-back
[    0.000000]   5 base 0CF700000 mask FFFF00000 uncachable
[    0.000000]   6 base 0CF800000 mask FFF800000 uncachable
[    0.000000]   7 disabled
[    0.000000] x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
[    0.000000] e820 update range: 00000000cf700000 - 0000000100000000 (usable) ==> (reserved)
[    0.000000] last_pfn = 0xcf6b0 max_arch_pfn = 0x400000000
[    0.000000] initial memory mapped : 0 - 20000000
[    0.000000] Base memory trampoline at [ffff880000099000] 99000 size 20480
[    0.000000] init_memory_mapping: 0000000000000000-00000000cf6b0000
[    0.000000]  0000000000 - 00cf600000 page 2M
[    0.000000]  00cf600000 - 00cf6b0000 page 4k
[    0.000000] kernel direct mapping tables up to cf6b0000 @ 1fffa000-20000000
[    0.000000] init_memory_mapping: 0000000100000000-0000000130000000
[    0.000000]  0100000000 - 0130000000 page 2M
[    0.000000] kernel direct mapping tables up to 130000000 @ cf6aa000-cf6b0000
[    0.000000] RAMDISK: 36a60000 - 37528000
[    0.000000] ACPI: RSDP 00000000000f6150 00024 (v02 FUJ   )
[    0.000000] ACPI: XSDT 00000000cf6bfa35 00074 (v01 FUJ    PC       01310000 FUJ  00000100)
[    0.000000] ACPI: FACP 00000000cf6c80b1 000F4 (v03 FUJ    PC       01310000 FUJ  00000100)
[    0.000000] ACPI: DSDT 00000000cf6bfaa9 08608 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
[    0.000000] ACPI: FACS 00000000cf6cdfc0 00040
[    0.000000] ACPI: HPET 00000000cf6c81a5 00038 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
[    0.000000] ACPI: MCFG 00000000cf6c81dd 0003C (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
[    0.000000] ACPI: SSDT 00000000cf6c8219 004EF (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
[    0.000000] ACPI: SSDT 00000000cf6c8708 001CA (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
[    0.000000] ACPI: SSDT 00000000cf6c88d2 0106D (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
[    0.000000] ACPI: SSDT 00000000cf6c993f 00447 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
[    0.000000] ACPI: APIC 00000000cf6c9d86 00068 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
[    0.000000] ACPI: BOOT 00000000cf6c9dee 00028 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
[    0.000000] ACPI: SLIC 00000000cf6c9e16 00176 (v01 FUJ    PC       01310000 FUJ  00000100)
[    0.000000] ACPI: Local APIC address 0xfee00000
[    0.000000] No NUMA configuration found
[    0.000000] Faking a node at 0000000000000000-0000000130000000
[    0.000000] Initmem setup node 0 0000000000000000-0000000130000000
[    0.000000]   NODE_DATA [000000012fffb000 - 000000012fffffff]
[    0.000000]  [ffffea0000000000-ffffea0004bfffff] PMD -> [ffff88012b600000-ffff88012f5fffff] on node 0
[    0.000000] Zone PFN ranges:
[    0.000000]   DMA      0x00000010 -> 0x00001000
[    0.000000]   DMA32    0x00001000 -> 0x00100000
[    0.000000]   Normal   0x00100000 -> 0x00130000
[    0.000000] Movable zone start PFN for each node
[    0.000000] Early memory PFN ranges
[    0.000000]     0: 0x00000010 -> 0x0000009e
[    0.000000]     0: 0x00000100 -> 0x000cf6b0
[    0.000000]     0: 0x00100000 -> 0x00130000
[    0.000000] On node 0 totalpages: 1046078
[    0.000000]   DMA zone: 64 pages used for memmap
[    0.000000]   DMA zone: 5 pages reserved
[    0.000000]   DMA zone: 3913 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 16320 pages used for memmap
[    0.000000]   DMA32 zone: 829168 pages, LIFO batch:31
[    0.000000]   Normal zone: 3072 pages used for memmap
[    0.000000]   Normal zone: 193536 pages, LIFO batch:31
[    0.000000] ACPI: PM-Timer IO Port: 0x1008
[    0.000000] ACPI: Local APIC address 0xfee00000
[    0.000000] ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
[    0.000000] ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
[    0.000000] IOAPIC[0]: apic_id 2, version 32, address 0xfec00000, GSI 0-23
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[    0.000000] ACPI: IRQ0 used by override.
[    0.000000] ACPI: IRQ2 used by override.
[    0.000000] ACPI: IRQ9 used by override.
[    0.000000] Using ACPI (MADT) for SMP configuration information
[    0.000000] ACPI: HPET id: 0x8086a201 base: 0xfed00000
[    0.000000] SMP: Allowing 2 CPUs, 0 hotplug CPUs
[    0.000000] nr_irqs_gsi: 40
[    0.000000] PM: Registered nosave memory: 000000000009e000 - 00000000000a0000
[    0.000000] PM: Registered nosave memory: 00000000000a0000 - 00000000000e0000
[    0.000000] PM: Registered nosave memory: 00000000000e0000 - 0000000000100000
[    0.000000] PM: Registered nosave memory: 00000000cf6b0000 - 00000000cf6ca000
[    0.000000] PM: Registered nosave memory: 00000000cf6ca000 - 00000000cf6ce000
[    0.000000] PM: Registered nosave memory: 00000000cf6ce000 - 00000000d0000000
[    0.000000] PM: Registered nosave memory: 00000000d0000000 - 00000000f8000000
[    0.000000] PM: Registered nosave memory: 00000000f8000000 - 00000000fc000000
[    0.000000] PM: Registered nosave memory: 00000000fc000000 - 00000000fec00000
[    0.000000] PM: Registered nosave memory: 00000000fec00000 - 00000000fec10000
[    0.000000] PM: Registered nosave memory: 00000000fec10000 - 00000000fed00000
[    0.000000] PM: Registered nosave memory: 00000000fed00000 - 00000000fed14000
[    0.000000] PM: Registered nosave memory: 00000000fed14000 - 00000000fed1a000
[    0.000000] PM: Registered nosave memory: 00000000fed1a000 - 00000000fed1c000
[    0.000000] PM: Registered nosave memory: 00000000fed1c000 - 00000000fed90000
[    0.000000] PM: Registered nosave memory: 00000000fed90000 - 00000000fee00000
[    0.000000] PM: Registered nosave memory: 00000000fee00000 - 00000000fee01000
[    0.000000] PM: Registered nosave memory: 00000000fee01000 - 00000000ff000000
[    0.000000] PM: Registered nosave memory: 00000000ff000000 - 0000000100000000
[    0.000000] Allocating PCI resources starting at d0000000 (gap: d0000000:28000000)
[    0.000000] Booting paravirtualized kernel on bare hardware
[    0.000000] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:2 nr_node_ids:1
[    0.000000] PERCPU: Embedded 27 pages/cpu @ffff88012fc00000 s80192 r8192 d22208 u1048576
[    0.000000] pcpu-alloc: s80192 r8192 d22208 u1048576 alloc=1*2097152
[    0.000000] pcpu-alloc: [0] 0 1 
[    0.000000] Built 1 zonelists in Node order, mobility grouping on.  Total pages: 1026617
[    0.000000] Policy zone: Normal
[    0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-3.3.1-1-amd64-vyatta root=UUID=b7241ed2-14f8-4d17-b9c0-87c5a6876d4c ro quiet
[    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
[    0.000000] Checking aperture...
[    0.000000] No AGP bridge found
[    0.000000] Memory: 4033240k/4980736k available (3452k kernel code, 796424k absent, 151072k reserved, 3171k data, 560k init)
[    0.000000] SLUB: Genslabs=15, HWalign=64, Order=0-3, MinObjects=0, CPUs=2, Nodes=1
[    0.000000] Hierarchical RCU implementation.
[    0.000000] 	CONFIG_RCU_FANOUT set to non-default value of 32
[    0.000000] 	RCU dyntick-idle grace-period acceleration is enabled.
[    0.000000] NR_IRQS:4352 nr_irqs:512 16
[    0.000000] Extended CMOS year: 2000
[    0.000000] Console: colour VGA+ 80x25
[    0.000000] console [tty0] enabled
[    0.000000] hpet clockevent registered
[    0.000000] Fast TSC calibration using PIT
[    0.000000] Detected 2394.105 MHz processor.
[    0.010005] Calibrating delay loop (skipped), value calculated using timer frequency.. 4788.21 BogoMIPS (lpj=23941050)
[    0.010014] pid_max: default: 32768 minimum: 301
[    0.010052] Security Framework initialized
[    0.010745] Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
[    0.022235] Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
[    0.023585] Mount-cache hash table entries: 256
[    0.023888] CPU: Physical Processor ID: 0
[    0.023893] CPU: Processor Core ID: 0
[    0.023897] mce: CPU supports 6 MCE banks
[    0.023911] CPU0: Thermal monitoring enabled (TM2)
[    0.023918] using mwait in idle threads.
[    0.024936] ACPI: Core revision 20120111
[    0.036120] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.136162] CPU0: Intel(R) Core(TM)2 Duo CPU     T7700  @ 2.40GHz stepping 0b
[    0.140000] Performance Events: PEBS fmt0+, Core2 events, Intel PMU driver.
[    0.140000] PEBS disabled due to CPU errata.
[    0.140000] ... version:                2
[    0.140000] ... bit width:              40
[    0.140000] ... generic registers:      2
[    0.140000] ... value mask:             000000ffffffffff
[    0.140000] ... max period:             000000007fffffff
[    0.140000] ... fixed-purpose events:   3
[    0.140000] ... event mask:             0000000700000003
[    0.140000] NMI watchdog enabled, takes one hw-pmu counter.
[    0.140000] Booting Node   0, Processors  #1 Ok.
[    0.140000] smpboot cpu 1: start_ip = 99000
[    0.143775] NMI watchdog enabled, takes one hw-pmu counter.
[    0.143826] Brought up 2 CPUs
[    0.143831] Total of 2 processors activated (9576.42 BogoMIPS).
[    0.147081] devtmpfs: initialized
[    0.153207] PM: Registering ACPI NVS region at cf6ca000 (16384 bytes)
[    0.153207] print_constraints: dummy: 
[    0.153207] NET: Registered protocol family 16
[    0.153207] ACPI: bus type pci registered
[    0.153207] PCI: MMCONFIG for domain 0000 [bus 00-3f] at [mem 0xf8000000-0xfbffffff] (base 0xf8000000)
[    0.153207] PCI: MMCONFIG at [mem 0xf8000000-0xfbffffff] reserved in E820
[    0.159024] PCI: Using configuration type 1 for base access
[    0.160062] bio: create slab <bio-0> at 0
[    0.160069] ACPI: Added _OSI(Module Device)
[    0.160069] ACPI: Added _OSI(Processor Device)
[    0.160069] ACPI: Added _OSI(3.0 _SCP Extensions)
[    0.160069] ACPI: Added _OSI(Processor Aggregator Device)
[    0.162586] ACPI: EC: Look up EC in DSDT
[    0.172631] ACPI:      00000000cf6cda9f 003B7 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
[    0.173514] ACPI: Dynamic OEM Table Load:
[    0.173520] ACPI:                (null) 003B7 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
[    0.175848] ACPI: SSDT 00000000cf6cac19 002BC (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
[    0.176501] ACPI: Dynamic OEM Table Load:
[    0.176507] ACPI: SSDT           (null) 002BC (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
[    0.176755] ACPI: SSDT 00000000cf6cb119 00627 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
[    0.177375] ACPI: Dynamic OEM Table Load:
[    0.177380] ACPI: SSDT           (null) 00627 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
[    0.180423] ACPI: SSDT 00000000cf6cb061 000B8 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
[    0.181067] ACPI: Dynamic OEM Table Load:
[    0.181073] ACPI: SSDT           (null) 000B8 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
[    0.181251] ACPI: SSDT 00000000cf6cb740 00047 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
[    0.181873] ACPI: Dynamic OEM Table Load:
[    0.181879] ACPI: SSDT           (null) 00047 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
[    0.182301] ACPI: Interpreter enabled
[    0.182308] ACPI: (supports S0 S3 S4 S5)
[    0.182347] ACPI: Using IOAPIC for interrupt routing
[    0.190533] ACPI: EC: GPE = 0x17, I/O: command/status = 0x66, data = 0x62
[    0.190533] ACPI: ACPI Dock Station Driver: 1 docks/bays found
[    0.190533] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
[    0.190924] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
[    0.191834] pci_root PNP0A08:00: host bridge window [io  0x0000-0x0cf7]
[    0.191839] pci_root PNP0A08:00: host bridge window [io  0x0d00-0xffff]
[    0.191845] pci_root PNP0A08:00: host bridge window [mem 0x000a0000-0x000bffff]
[    0.191850] pci_root PNP0A08:00: host bridge window [mem 0x000d0000-0x000d3fff]
[    0.191855] pci_root PNP0A08:00: host bridge window [mem 0x000d4000-0x000d7fff]
[    0.191860] pci_root PNP0A08:00: host bridge window [mem 0x000d8000-0x000dbfff]
[    0.191865] pci_root PNP0A08:00: host bridge window [mem 0x000dc000-0x000dffff]
[    0.191871] pci_root PNP0A08:00: host bridge window [mem 0xd0000000-0xfebfffff]
[    0.191876] pci_root PNP0A08:00: host bridge window [mem 0xfed40000-0xfed44fff]
[    0.191938] PCI host bridge to bus 0000:00
[    0.191938] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
[    0.191938] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
[    0.191938] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
[    0.191938] pci_bus 0000:00: root bus resource [mem 0x000d0000-0x000d3fff]
[    0.191938] pci_bus 0000:00: root bus resource [mem 0x000d4000-0x000d7fff]
[    0.191938] pci_bus 0000:00: root bus resource [mem 0x000d8000-0x000dbfff]
[    0.191938] pci_bus 0000:00: root bus resource [mem 0x000dc000-0x000dffff]
[    0.191938] pci_bus 0000:00: root bus resource [mem 0xd0000000-0xfebfffff]
[    0.191938] pci_bus 0000:00: root bus resource [mem 0xfed40000-0xfed44fff]
[    0.191938] pci 0000:00:00.0: [8086:2a00] type 0 class 0x000600
[    0.191938] pci 0000:00:02.0: [8086:2a02] type 0 class 0x000300
[    0.191938] pci 0000:00:02.0: reg 10: [mem 0xfc000000-0xfc0fffff 64bit]
[    0.191938] pci 0000:00:02.0: reg 18: [mem 0xe0000000-0xefffffff 64bit pref]
[    0.191938] pci 0000:00:02.0: reg 20: [io  0x1800-0x1807]
[    0.191938] pci 0000:00:02.1: [8086:2a03] type 0 class 0x000380
[    0.191938] pci 0000:00:02.1: reg 10: [mem 0xfc100000-0xfc1fffff 64bit]
[    0.191938] pci 0000:00:1a.0: [8086:2834] type 0 class 0x000c03
[    0.191938] pci 0000:00:1a.0: reg 20: [io  0x1820-0x183f]
[    0.191938] pci 0000:00:1a.1: [8086:2835] type 0 class 0x000c03
[    0.191938] pci 0000:00:1a.1: reg 20: [io  0x1840-0x185f]
[    0.191938] pci 0000:00:1a.7: [8086:283a] type 0 class 0x000c03
[    0.191938] pci 0000:00:1a.7: reg 10: [mem 0xfc704800-0xfc704bff]
[    0.191938] pci 0000:00:1a.7: PME# supported from D0 D3hot D3cold
[    0.191938] pci 0000:00:1b.0: [8086:284b] type 0 class 0x000403
[    0.191938] pci 0000:00:1b.0: reg 10: [mem 0xfc700000-0xfc703fff 64bit]
[    0.191938] pci 0000:00:1b.0: PME# supported from D0 D3hot D3cold
[    0.191938] pci 0000:00:1c.0: [8086:283f] type 1 class 0x000604
[    0.191938] pci 0000:00:1c.0: PME# supported from D0 D3hot D3cold
[    0.191938] pci 0000:00:1c.2: [8086:2843] type 1 class 0x000604
[    0.191938] pci 0000:00:1c.2: PME# supported from D0 D3hot D3cold
[    0.191938] pci 0000:00:1d.0: [8086:2830] type 0 class 0x000c03
[    0.191938] pci 0000:00:1d.0: reg 20: [io  0x1860-0x187f]
[    0.191938] pci 0000:00:1d.1: [8086:2831] type 0 class 0x000c03
[    0.191938] pci 0000:00:1d.1: reg 20: [io  0x1880-0x189f]
[    0.191938] pci 0000:00:1d.2: [8086:2832] type 0 class 0x000c03
[    0.191938] pci 0000:00:1d.2: reg 20: [io  0x18a0-0x18bf]
[    0.191938] pci 0000:00:1d.7: [8086:2836] type 0 class 0x000c03
[    0.191938] pci 0000:00:1d.7: reg 10: [mem 0xfc704c00-0xfc704fff]
[    0.191952] pci 0000:00:1d.7: PME# supported from D0 D3hot D3cold
[    0.191986] pci 0000:00:1e.0: [8086:2448] type 1 class 0x000604
[    0.192099] pci 0000:00:1f.0: [8086:2815] type 0 class 0x000601
[    0.192215] pci 0000:00:1f.0: quirk: [io  0x1000-0x107f] claimed by ICH6 ACPI/GPIO/TCO
[    0.192225] pci 0000:00:1f.0: quirk: [io  0x1180-0x11bf] claimed by ICH6 GPIO
[    0.192232] pci 0000:00:1f.0: ICH7 LPC Generic IO decode 1 PIO at fd00 (mask 007f)
[    0.192301] pci 0000:00:1f.1: [8086:2850] type 0 class 0x000101
[    0.192323] pci 0000:00:1f.1: reg 10: [io  0x0000-0x0007]
[    0.192338] pci 0000:00:1f.1: reg 14: [io  0x0000-0x0003]
[    0.192354] pci 0000:00:1f.1: reg 18: [io  0x0000-0x0007]
[    0.192369] pci 0000:00:1f.1: reg 1c: [io  0x0000-0x0003]
[    0.192384] pci 0000:00:1f.1: reg 20: [io  0x1810-0x181f]
[    0.192450] pci 0000:00:1f.2: [8086:2829] type 0 class 0x000106
[    0.192483] pci 0000:00:1f.2: reg 10: [io  0x1c00-0x1c07]
[    0.192498] pci 0000:00:1f.2: reg 14: [io  0x18d4-0x18d7]
[    0.192514] pci 0000:00:1f.2: reg 18: [io  0x18d8-0x18df]
[    0.192530] pci 0000:00:1f.2: reg 1c: [io  0x18d0-0x18d3]
[    0.192545] pci 0000:00:1f.2: reg 20: [io  0x18e0-0x18ff]
[    0.192563] pci 0000:00:1f.2: reg 24: [mem 0xfc704000-0xfc7047ff]
[    0.192640] pci 0000:00:1f.2: PME# supported from D3hot
[    0.192672] pci 0000:00:1f.3: [8086:283e] type 0 class 0x000c05
[    0.192694] pci 0000:00:1f.3: reg 10: [mem 0x00000000-0x000000ff]
[    0.192742] pci 0000:00:1f.3: reg 20: [io  0x1c20-0x1c3f]
[    0.192890] pci 0000:04:00.0: [11ab:4363] type 0 class 0x000200
[    0.192927] pci 0000:04:00.0: reg 10: [mem 0xfc200000-0xfc203fff 64bit]
[    0.192949] pci 0000:04:00.0: reg 18: [io  0x2000-0x20ff]
[    0.193021] pci 0000:04:00.0: reg 30: [mem 0x00000000-0x0001ffff pref]
[    0.193136] pci 0000:04:00.0: supports D1 D2
[    0.193140] pci 0000:04:00.0: PME# supported from D0 D1 D2 D3hot D3cold
[    0.220028] pci 0000:00:1c.0: PCI bridge to [bus 04-07]
[    0.220036] pci 0000:00:1c.0:   bridge window [io  0x2000-0x2fff]
[    0.220044] pci 0000:00:1c.0:   bridge window [mem 0xfc200000-0xfc2fffff]
[    0.220150] pci 0000:0c:00.0: [8086:4229] type 0 class 0x000280
[    0.220197] pci 0000:0c:00.0: reg 10: [mem 0xfc300000-0xfc301fff 64bit]
[    0.220413] pci 0000:0c:00.0: PME# supported from D0 D3hot D3cold
[    0.240028] pci 0000:00:1c.2: PCI bridge to [bus 0c-0f]
[    0.240039] pci 0000:00:1c.2:   bridge window [mem 0xfc300000-0xfc3fffff]
[    0.240107] pci 0000:1c:03.0: [1217:7136] type 2 class 0x000607
[    0.240137] pci 0000:1c:03.0: reg 10: [mem 0x00000000-0x00000fff]
[    0.240185] pci 0000:1c:03.0: supports D1 D2
[    0.240189] pci 0000:1c:03.0: PME# supported from D0 D1 D2 D3hot D3cold
[    0.240227] pci 0000:1c:03.2: [1217:7120] type 0 class 0x000805
[    0.240257] pci 0000:1c:03.2: reg 10: [mem 0xfc402800-0xfc4028ff]
[    0.240380] pci 0000:1c:03.2: supports D1 D2
[    0.240384] pci 0000:1c:03.2: PME# supported from D0 D1 D2 D3hot D3cold
[    0.240416] pci 0000:1c:03.3: [1217:7130] type 0 class 0x000180
[    0.240446] pci 0000:1c:03.3: reg 10: [mem 0xfc400000-0xfc400fff]
[    0.240569] pci 0000:1c:03.3: supports D1 D2
[    0.240573] pci 0000:1c:03.3: PME# supported from D0 D1 D2 D3hot D3cold
[    0.240607] pci 0000:1c:03.4: [1217:00f7] type 0 class 0x000c00
[    0.240633] pci 0000:1c:03.4: reg 10: [mem 0xfc401000-0xfc401fff]
[    0.240650] pci 0000:1c:03.4: reg 14: [mem 0xfc402000-0xfc4027ff]
[    0.240754] pci 0000:1c:03.4: supports D1 D2
[    0.240758] pci 0000:1c:03.4: PME# supported from D0 D1 D2 D3hot
[    0.240831] pci 0000:00:1e.0: PCI bridge to [bus 1c-1d] (subtractive decode)
[    0.240841] pci 0000:00:1e.0:   bridge window [mem 0xfc400000-0xfc4fffff]
[    0.240852] pci 0000:00:1e.0:   bridge window [io  0x0000-0x0cf7] (subtractive decode)
[    0.240857] pci 0000:00:1e.0:   bridge window [io  0x0d00-0xffff] (subtractive decode)
[    0.240862] pci 0000:00:1e.0:   bridge window [mem 0x000a0000-0x000bffff] (subtractive decode)
[    0.240867] pci 0000:00:1e.0:   bridge window [mem 0x000d0000-0x000d3fff] (subtractive decode)
[    0.240873] pci 0000:00:1e.0:   bridge window [mem 0x000d4000-0x000d7fff] (subtractive decode)
[    0.240878] pci 0000:00:1e.0:   bridge window [mem 0x000d8000-0x000dbfff] (subtractive decode)
[    0.240883] pci 0000:00:1e.0:   bridge window [mem 0x000dc000-0x000dffff] (subtractive decode)
[    0.240889] pci 0000:00:1e.0:   bridge window [mem 0xd0000000-0xfebfffff] (subtractive decode)
[    0.240894] pci 0000:00:1e.0:   bridge window [mem 0xfed40000-0xfed44fff] (subtractive decode)
[    0.240955] pci_bus 0000:1d: [bus 1d-20] partially hidden behind transparent bridge 0000:1c [bus 1c-1d]
[    0.240986] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
[    0.241271] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.RP01._PRT]
[    0.241362] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.RP03._PRT]
[    0.241497] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCIB._PRT]
[    0.241562]  pci0000:00: Requesting ACPI _OSC control (0x1d)
[    0.241568]  pci0000:00: ACPI _OSC request failed (AE_NOT_FOUND), returned control mask: 0x1d
[    0.241572] ACPI _OSC control for PCIe not granted, disabling ASPM
[    0.250489] ACPI: PCI Interrupt Link [LNKA] (IRQs 1 3 4 5 6 7 10 12 14 15) *11
[    0.250489] ACPI: PCI Interrupt Link [LNKB] (IRQs 1 3 4 5 6 7 11 12 14 15) *0, disabled.
[    0.250489] ACPI: PCI Interrupt Link [LNKC] (IRQs 1 3 4 5 6 7 10 12 14 15) *11
[    0.250489] ACPI: PCI Interrupt Link [LNKD] (IRQs 1 3 4 5 6 7 11 12 14 15) *0, disabled.
[    0.250489] ACPI: PCI Interrupt Link [LNKE] (IRQs 1 3 4 5 6 7 10 12 14 15) *11
[    0.250536] ACPI: PCI Interrupt Link [LNKF] (IRQs 1 3 4 5 6 7 *11 12 14 15)
[    0.250618] ACPI: PCI Interrupt Link [LNKG] (IRQs 1 3 4 5 6 7 10 12 14 15) *11
[    0.250701] ACPI: PCI Interrupt Link [LNKH] (IRQs 1 3 4 5 6 7 *11 12 14 15)
[    0.250750] vgaarb: device added: PCI:0000:00:02.0,decodes=io+mem,owns=io+mem,locks=none
[    0.250750] vgaarb: loaded
[    0.250750] vgaarb: bridge control possible 0000:00:02.0
[    0.250750] SCSI subsystem initialized
[    0.250750] libata version 3.00 loaded.
[    0.250750] usbcore: registered new interface driver usbfs
[    0.250750] usbcore: registered new interface driver hub
[    0.250750] usbcore: registered new device driver usb
[    0.250750] PCI: Using ACPI for IRQ routing
[    0.260440] PCI: pci_cache_line_size set to 64 bytes
[    0.260592] reserve RAM buffer: 000000000009e000 - 000000000009ffff 
[    0.260597] reserve RAM buffer: 00000000cf6b0000 - 00000000cfffffff 
[    0.260630] HPET: 3 timers in total, 0 timers will be used for per-cpu timer
[    0.260630] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
[    0.260630] hpet0: 3 comparators, 64-bit 14.318180 MHz counter
[    0.270038] Switching to clocksource hpet
[    0.273784] pnp: PnP ACPI init
[    0.273805] ACPI: bus type pnp registered
[    0.274317] pnp 00:00: [bus 00-ff]
[    0.274323] pnp 00:00: [io  0x0000-0x0cf7 window]
[    0.274327] pnp 00:00: [io  0x0cf8-0x0cff]
[    0.274331] pnp 00:00: [io  0x0d00-0xffff window]
[    0.274336] pnp 00:00: [mem 0x000a0000-0x000bffff window]
[    0.274340] pnp 00:00: [mem 0x000c0000-0x000c3fff window]
[    0.274345] pnp 00:00: [mem 0x000c4000-0x000c7fff window]
[    0.274349] pnp 00:00: [mem 0x000c8000-0x000cbfff window]
[    0.274353] pnp 00:00: [mem 0x000cc000-0x000cffff window]
[    0.274358] pnp 00:00: [mem 0x000d0000-0x000d3fff window]
[    0.274362] pnp 00:00: [mem 0x000d4000-0x000d7fff window]
[    0.274366] pnp 00:00: [mem 0x000d8000-0x000dbfff window]
[    0.274371] pnp 00:00: [mem 0x000dc000-0x000dffff window]
[    0.274375] pnp 00:00: [mem 0x000e0000-0x000e3fff window]
[    0.274379] pnp 00:00: [mem 0x000e4000-0x000e7fff window]
[    0.274384] pnp 00:00: [mem 0x000e8000-0x000ebfff window]
[    0.274388] pnp 00:00: [mem 0x000ec000-0x000effff window]
[    0.274393] pnp 00:00: [mem 0x000f0000-0x000fffff window]
[    0.274397] pnp 00:00: [mem 0xd0000000-0xfebfffff window]
[    0.274402] pnp 00:00: [mem 0xfed40000-0xfed44fff window]
[    0.274542] pnp 00:00: Plug and Play ACPI device, IDs PNP0a08 PNP0a03 (active)
[    0.274657] pnp 00:01: [io  0x0010-0x001f]
[    0.274662] pnp 00:01: [io  0x0024-0x0025]
[    0.274666] pnp 00:01: [io  0x0028-0x0029]
[    0.274670] pnp 00:01: [io  0x002c-0x002d]
[    0.274673] pnp 00:01: [io  0x002e-0x002f]
[    0.274677] pnp 00:01: [io  0x0030-0x0031]
[    0.274681] pnp 00:01: [io  0x0034-0x0035]
[    0.274684] pnp 00:01: [io  0x0038-0x0039]
[    0.274688] pnp 00:01: [io  0x003c-0x003d]
[    0.274692] pnp 00:01: [io  0x0000-0xffffffffffffffff disabled]
[    0.274697] pnp 00:01: [io  0x0050-0x0053]
[    0.274700] pnp 00:01: [io  0x0061]
[    0.274704] pnp 00:01: [io  0x0063]
[    0.274707] pnp 00:01: [io  0x0065]
[    0.274715] pnp 00:01: [io  0x0067]
[    0.274719] pnp 00:01: [io  0x0072-0x0077]
[    0.274722] pnp 00:01: [io  0x0080]
[    0.274726] pnp 00:01: [io  0x0090-0x009f]
[    0.274729] pnp 00:01: [io  0x0092]
[    0.274733] pnp 00:01: [io  0x00a4-0x00a5]
[    0.274737] pnp 00:01: [io  0x00a8-0x00a9]
[    0.274740] pnp 00:01: [io  0x00ac-0x00ad]
[    0.274744] pnp 00:01: [io  0x00b0-0x00b1]
[    0.274748] pnp 00:01: [io  0x00b2-0x00b3]
[    0.274751] pnp 00:01: [io  0x00b4-0x00b5]
[    0.274755] pnp 00:01: [io  0x00b8-0x00b9]
[    0.274759] pnp 00:01: [io  0x00bc-0x00bd]
[    0.274762] pnp 00:01: [io  0x04d0-0x04d1]
[    0.274766] pnp 00:01: [io  0x0680-0x069f]
[    0.274770] pnp 00:01: [io  0x0800-0x080f]
[    0.274773] pnp 00:01: [io  0x1000-0x107f]
[    0.274777] pnp 00:01: [io  0x1080-0x10ff]
[    0.274781] pnp 00:01: [io  0x1100-0x111f]
[    0.274784] pnp 00:01: [io  0x1180-0x11bf]
[    0.274788] pnp 00:01: [io  0x1640-0x164f]
[    0.274792] pnp 00:01: [io  0xf800-0xf87f]
[    0.274796] pnp 00:01: [io  0xf880-0xf8ff]
[    0.274800] pnp 00:01: [io  0xfc00-0xfc7f]
[    0.274803] pnp 00:01: [io  0xfc80-0xfcff]
[    0.274807] pnp 00:01: [io  0xfd0c-0xfd7f]
[    0.274811] pnp 00:01: [io  0xfe00-0xfe03]
[    0.275050] system 00:01: [io  0x04d0-0x04d1] has been reserved
[    0.275056] system 00:01: [io  0x0680-0x069f] has been reserved
[    0.275061] system 00:01: [io  0x0800-0x080f] has been reserved
[    0.275067] system 00:01: [io  0x1000-0x107f] has been reserved
[    0.275072] system 00:01: [io  0x1080-0x10ff] has been reserved
[    0.275077] system 00:01: [io  0x1100-0x111f] has been reserved
[    0.275082] system 00:01: [io  0x1180-0x11bf] has been reserved
[    0.275088] system 00:01: [io  0x1640-0x164f] has been reserved
[    0.275093] system 00:01: [io  0xf800-0xf87f] has been reserved
[    0.275099] system 00:01: [io  0xf880-0xf8ff] has been reserved
[    0.275104] system 00:01: [io  0xfc00-0xfc7f] has been reserved
[    0.275109] system 00:01: [io  0xfc80-0xfcff] has been reserved
[    0.275115] system 00:01: [io  0xfd0c-0xfd7f] has been reserved
[    0.275120] system 00:01: [io  0xfe00-0xfe03] has been reserved
[    0.275127] system 00:01: Plug and Play ACPI device, IDs PNP0c02 (active)
[    0.275285] pnp 00:02: [mem 0xfed1c000-0xfed1ffff]
[    0.275290] pnp 00:02: [mem 0xfed14000-0xfed17fff]
[    0.275294] pnp 00:02: [mem 0xfed18000-0xfed18fff]
[    0.275298] pnp 00:02: [mem 0xfed19000-0xfed19fff]
[    0.275302] pnp 00:02: [mem 0xf8000000-0xfbffffff]
[    0.275306] pnp 00:02: [mem 0xfed20000-0xfed3ffff]
[    0.275311] pnp 00:02: [mem 0xfed40000-0xfed3ffff disabled]
[    0.275315] pnp 00:02: [mem 0xfed45000-0xfed8ffff]
[    0.275319] pnp 00:02: [mem 0xfef00000-0xfeffffff]
[    0.275449] system 00:02: [mem 0xfed1c000-0xfed1ffff] has been reserved
[    0.275456] system 00:02: [mem 0xfed14000-0xfed17fff] has been reserved
[    0.275462] system 00:02: [mem 0xfed18000-0xfed18fff] has been reserved
[    0.275467] system 00:02: [mem 0xfed19000-0xfed19fff] has been reserved
[    0.275473] system 00:02: [mem 0xf8000000-0xfbffffff] has been reserved
[    0.275478] system 00:02: [mem 0xfed20000-0xfed3ffff] has been reserved
[    0.275484] system 00:02: [mem 0xfed45000-0xfed8ffff] has been reserved
[    0.275489] system 00:02: [mem 0xfef00000-0xfeffffff] has been reserved
[    0.275496] system 00:02: Plug and Play ACPI device, IDs PNP0c02 (active)
[    0.275813] pnp 00:03: [io  0x004e-0x004f]
[    0.275818] pnp 00:03: [io  0xfd00-0xfd0b]
[    0.275822] pnp 00:03: [mem 0xfed40000-0xfed44fff]
[    0.275934] pnp 00:03: Plug and Play ACPI device, IDs IFX0102 PNP0c31 (active)
[    0.275990] pnp 00:04: [io  0x0000-0x000f]
[    0.275995] pnp 00:04: [io  0x0081-0x008f]
[    0.276003] pnp 00:04: [io  0x00c0-0x00df]
[    0.276007] pnp 00:04: [dma 4]
[    0.276114] pnp 00:04: Plug and Play ACPI device, IDs PNP0200 (active)
[    0.276145] pnp 00:05: [io  0x0060]
[    0.276149] pnp 00:05: [io  0x0064]
[    0.276162] pnp 00:05: [irq 1]
[    0.276273] pnp 00:05: Plug and Play ACPI device, IDs PNP0303 (active)
[    0.276303] pnp 00:06: [io  0x00f0-0x00fe]
[    0.276313] pnp 00:06: [irq 13]
[    0.276420] pnp 00:06: Plug and Play ACPI device, IDs PNP0c04 (active)
[    0.276466] pnp 00:07: [irq 12]
[    0.276578] pnp 00:07: Plug and Play ACPI device, IDs PNP0f13 (active)
[    0.276610] pnp 00:08: [io  0x0070-0x0071]
[    0.276619] pnp 00:08: [irq 8]
[    0.276731] pnp 00:08: Plug and Play ACPI device, IDs PNP0b00 (active)
[    0.276761] pnp 00:09: [io  0x0061]
[    0.276872] pnp 00:09: Plug and Play ACPI device, IDs PNP0800 (active)
[    0.277523] pnp: PnP ACPI: found 10 devices
[    0.277527] ACPI: ACPI bus type pnp unregistered
[    0.284690] PCI: max bus depth: 2 pci_try_num: 3
[    0.284752] pci 0000:00:1c.0: BAR 15: assigned [mem 0xd0000000-0xd00fffff pref]
[    0.284761] pci 0000:00:1f.3: BAR 0: assigned [mem 0xd0100000-0xd01000ff]
[    0.284774] pci 0000:00:1e.0: BAR 15: assigned [mem 0xd4000000-0xd7ffffff pref]
[    0.284782] pci 0000:00:1e.0: BAR 13: assigned [io  0x3000-0x3fff]
[    0.284790] pci 0000:00:1c.2: BAR 15: assigned [mem 0xd0200000-0xd03fffff 64bit pref]
[    0.284798] pci 0000:00:1c.2: BAR 13: assigned [io  0x4000-0x4fff]
[    0.284807] pci 0000:00:1c.0: BAR 15: assigned [mem 0xd0400000-0xd06fffff pref]
[    0.284814] pci 0000:04:00.0: BAR 6: assigned [mem 0xd0400000-0xd041ffff pref]
[    0.284819] pci 0000:00:1c.0: PCI bridge to [bus 04-07]
[    0.284825] pci 0000:00:1c.0:   bridge window [io  0x2000-0x2fff]
[    0.284835] pci 0000:00:1c.0:   bridge window [mem 0xfc200000-0xfc2fffff]
[    0.284842] pci 0000:00:1c.0:   bridge window [mem 0xd0400000-0xd06fffff pref]
[    0.284853] pci 0000:00:1c.2: PCI bridge to [bus 0c-0f]
[    0.284859] pci 0000:00:1c.2:   bridge window [io  0x4000-0x4fff]
[    0.284869] pci 0000:00:1c.2:   bridge window [mem 0xfc300000-0xfc3fffff]
[    0.284877] pci 0000:00:1c.2:   bridge window [mem 0xd0200000-0xd03fffff 64bit pref]
[    0.284897] pci 0000:1c:03.0: BAR 0: assigned [mem 0xd0000000-0xd0000fff]
[    0.284909] pci 0000:1c:03.0: BAR 16: assigned [mem 0xd8000000-0xdbffffff]
[    0.284915] pci 0000:1c:03.0: BAR 15: assigned [mem 0xd4000000-0xd7ffffff pref]
[    0.284920] pci 0000:1c:03.0: BAR 14: assigned [io  0x3000-0x30ff]
[    0.284925] pci 0000:1c:03.0: BAR 13: assigned [io  0x3400-0x34ff]
[    0.284930] pci 0000:1c:03.0: CardBus bridge to [bus 1d-20]
[    0.284934] pci 0000:1c:03.0:   bridge window [io  0x3400-0x34ff]
[    0.284942] pci 0000:1c:03.0:   bridge window [io  0x3000-0x30ff]
[    0.284950] pci 0000:1c:03.0:   bridge window [mem 0xd4000000-0xd7ffffff pref]
[    0.284959] pci 0000:1c:03.0:   bridge window [mem 0xd8000000-0xdbffffff]
[    0.284967] pci 0000:00:1e.0: PCI bridge to [bus 1c-1d]
[    0.284972] pci 0000:00:1e.0:   bridge window [io  0x3000-0x3fff]
[    0.284981] pci 0000:00:1e.0:   bridge window [mem 0xfc400000-0xfc4fffff]
[    0.284989] pci 0000:00:1e.0:   bridge window [mem 0xd4000000-0xd7ffffff pref]
[    0.285039] pci 0000:00:1e.0: setting latency timer to 64
[    0.285057] pci_bus 0000:00: resource 4 [io  0x0000-0x0cf7]
[    0.285062] pci_bus 0000:00: resource 5 [io  0x0d00-0xffff]
[    0.285066] pci_bus 0000:00: resource 6 [mem 0x000a0000-0x000bffff]
[    0.285071] pci_bus 0000:00: resource 7 [mem 0x000d0000-0x000d3fff]
[    0.285076] pci_bus 0000:00: resource 8 [mem 0x000d4000-0x000d7fff]
[    0.285080] pci_bus 0000:00: resource 9 [mem 0x000d8000-0x000dbfff]
[    0.285085] pci_bus 0000:00: resource 10 [mem 0x000dc000-0x000dffff]
[    0.285090] pci_bus 0000:00: resource 11 [mem 0xd0000000-0xfebfffff]
[    0.285094] pci_bus 0000:00: resource 12 [mem 0xfed40000-0xfed44fff]
[    0.285099] pci_bus 0000:04: resource 0 [io  0x2000-0x2fff]
[    0.285104] pci_bus 0000:04: resource 1 [mem 0xfc200000-0xfc2fffff]
[    0.285109] pci_bus 0000:04: resource 2 [mem 0xd0400000-0xd06fffff pref]
[    0.285114] pci_bus 0000:0c: resource 0 [io  0x4000-0x4fff]
[    0.285118] pci_bus 0000:0c: resource 1 [mem 0xfc300000-0xfc3fffff]
[    0.285123] pci_bus 0000:0c: resource 2 [mem 0xd0200000-0xd03fffff 64bit pref]
[    0.285128] pci_bus 0000:1c: resource 0 [io  0x3000-0x3fff]
[    0.285132] pci_bus 0000:1c: resource 1 [mem 0xfc400000-0xfc4fffff]
[    0.285137] pci_bus 0000:1c: resource 2 [mem 0xd4000000-0xd7ffffff pref]
[    0.285142] pci_bus 0000:1c: resource 4 [io  0x0000-0x0cf7]
[    0.285146] pci_bus 0000:1c: resource 5 [io  0x0d00-0xffff]
[    0.285151] pci_bus 0000:1c: resource 6 [mem 0x000a0000-0x000bffff]
[    0.285155] pci_bus 0000:1c: resource 7 [mem 0x000d0000-0x000d3fff]
[    0.285160] pci_bus 0000:1c: resource 8 [mem 0x000d4000-0x000d7fff]
[    0.285164] pci_bus 0000:1c: resource 9 [mem 0x000d8000-0x000dbfff]
[    0.285169] pci_bus 0000:1c: resource 10 [mem 0x000dc000-0x000dffff]
[    0.285174] pci_bus 0000:1c: resource 11 [mem 0xd0000000-0xfebfffff]
[    0.285179] pci_bus 0000:1c: resource 12 [mem 0xfed40000-0xfed44fff]
[    0.285183] pci_bus 0000:1d: resource 0 [io  0x3400-0x34ff]
[    0.285188] pci_bus 0000:1d: resource 1 [io  0x3000-0x30ff]
[    0.285192] pci_bus 0000:1d: resource 2 [mem 0xd4000000-0xd7ffffff pref]
[    0.285197] pci_bus 0000:1d: resource 3 [mem 0xd8000000-0xdbffffff]
[    0.285241] NET: Registered protocol family 2
[    0.285467] IP route cache hash table entries: 131072 (order: 8, 1048576 bytes)
[    0.287260] TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
[    0.292371] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
[    0.293032] TCP: Hash tables configured (established 524288 bind 65536)
[    0.293036] TCP reno registered
[    0.293054] UDP hash table entries: 2048 (order: 4, 65536 bytes)
[    0.293117] UDP-Lite hash table entries: 2048 (order: 4, 65536 bytes)
[    0.293289] NET: Registered protocol family 1
[    0.293317] pci 0000:00:02.0: Boot video device
[    0.293693] PCI: CLS 64 bytes, default 64
[    0.293766] Trying to unpack rootfs image as initramfs...
[    0.748181] Freeing initrd memory: 11040k freed
[    0.754749] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[    0.754758] Placing 64MB software IO TLB between ffff8800cb6aa000 - ffff8800cf6aa000
[    0.754763] software IO TLB at phys 0xcb6aa000 - 0xcf6aa000
[    0.754786] Simple Boot Flag at 0x7b set to 0x1
[    0.756515] audit: initializing netlink socket (disabled)
[    0.756535] type=2000 audit(1333552189.750:1): initialized
[    0.757428] HugeTLB registered 2 MB page size, pre-allocated 0 pages
[    0.765853] Registering unionfs 2.5.11 (for 3.3.0-rc3)
[    0.766222] msgmni has been set to 7898
[    0.766724] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 253)
[    0.766730] io scheduler noop registered
[    0.766733] io scheduler deadline registered
[    0.766949] io scheduler cfq registered (default)
[    0.767215] pcieport 0000:00:1c.0: irq 40 for MSI/MSI-X
[    0.767442] pcieport 0000:00:1c.2: irq 41 for MSI/MSI-X
[    0.768293] intel_idle: MWAIT substates: 0x22220
[    0.768297] intel_idle: does not run on family 6 model 15
[    0.768443] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[    0.769879] tpm_tis 00:03: 1.2 TPM (device-id 0xB, rev-id 16)
[    1.470021] tpm_tis 00:03: Operation Timed out
[    1.470076] tpm_tis 00:03: TPM self test failed
[    1.474343] brd: module loaded
[    1.474781] Fixed MDIO Bus: probed
[    1.474790] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[    1.474849] ehci_hcd 0000:00:1a.7: setting latency timer to 64
[    1.474855] ehci_hcd 0000:00:1a.7: EHCI Host Controller
[    1.474867] ehci_hcd 0000:00:1a.7: new USB bus registered, assigned bus number 1
[    1.474919] ehci_hcd 0000:00:1a.7: debug port 1
[    1.478799] ehci_hcd 0000:00:1a.7: cache line size of 64 is not supported
[    1.478827] ehci_hcd 0000:00:1a.7: irq 23, io mem 0xfc704800
[    1.490055] ehci_hcd 0000:00:1a.7: USB 2.0 started, EHCI 1.00
[    1.490107] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002
[    1.490112] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    1.490117] usb usb1: Product: EHCI Host Controller
[    1.490121] usb usb1: Manufacturer: Linux 3.3.1-1-amd64-vyatta ehci_hcd
[    1.490125] usb usb1: SerialNumber: 0000:00:1a.7
[    1.490409] hub 1-0:1.0: USB hub found
[    1.490419] hub 1-0:1.0: 4 ports detected
[    1.490555] ehci_hcd 0000:00:1d.7: setting latency timer to 64
[    1.490561] ehci_hcd 0000:00:1d.7: EHCI Host Controller
[    1.490571] ehci_hcd 0000:00:1d.7: new USB bus registered, assigned bus number 2
[    1.490616] ehci_hcd 0000:00:1d.7: debug port 1
[    1.494486] ehci_hcd 0000:00:1d.7: cache line size of 64 is not supported
[    1.494496] ehci_hcd 0000:00:1d.7: irq 23, io mem 0xfc704c00
[    1.510035] ehci_hcd 0000:00:1d.7: USB 2.0 started, EHCI 1.00
[    1.510083] usb usb2: New USB device found, idVendor=1d6b, idProduct=0002
[    1.510088] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    1.510092] usb usb2: Product: EHCI Host Controller
[    1.510096] usb usb2: Manufacturer: Linux 3.3.1-1-amd64-vyatta ehci_hcd
[    1.510100] usb usb2: SerialNumber: 0000:00:1d.7
[    1.510387] hub 2-0:1.0: USB hub found
[    1.510396] hub 2-0:1.0: 6 ports detected
[    1.510568] uhci_hcd: USB Universal Host Controller Interface driver
[    1.510603] uhci_hcd 0000:00:1a.0: setting latency timer to 64
[    1.510610] uhci_hcd 0000:00:1a.0: UHCI Host Controller
[    1.510620] uhci_hcd 0000:00:1a.0: new USB bus registered, assigned bus number 3
[    1.510671] uhci_hcd 0000:00:1a.0: irq 22, io base 0x00001820
[    1.510726] usb usb3: New USB device found, idVendor=1d6b, idProduct=0001
[    1.510731] usb usb3: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    1.510735] usb usb3: Product: UHCI Host Controller
[    1.510739] usb usb3: Manufacturer: Linux 3.3.1-1-amd64-vyatta uhci_hcd
[    1.510743] usb usb3: SerialNumber: 0000:00:1a.0
[    1.511006] hub 3-0:1.0: USB hub found
[    1.511015] hub 3-0:1.0: 2 ports detected
[    1.511128] uhci_hcd 0000:00:1a.1: setting latency timer to 64
[    1.511134] uhci_hcd 0000:00:1a.1: UHCI Host Controller
[    1.511143] uhci_hcd 0000:00:1a.1: new USB bus registered, assigned bus number 4
[    1.511179] uhci_hcd 0000:00:1a.1: irq 22, io base 0x00001840
[    1.511233] usb usb4: New USB device found, idVendor=1d6b, idProduct=0001
[    1.511238] usb usb4: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    1.511242] usb usb4: Product: UHCI Host Controller
[    1.511246] usb usb4: Manufacturer: Linux 3.3.1-1-amd64-vyatta uhci_hcd
[    1.511250] usb usb4: SerialNumber: 0000:00:1a.1
[    1.511510] hub 4-0:1.0: USB hub found
[    1.511519] hub 4-0:1.0: 2 ports detected
[    1.511630] uhci_hcd 0000:00:1d.0: setting latency timer to 64
[    1.511636] uhci_hcd 0000:00:1d.0: UHCI Host Controller
[    1.511646] uhci_hcd 0000:00:1d.0: new USB bus registered, assigned bus number 5
[    1.511682] uhci_hcd 0000:00:1d.0: irq 22, io base 0x00001860
[    1.511735] usb usb5: New USB device found, idVendor=1d6b, idProduct=0001
[    1.511741] usb usb5: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    1.511745] usb usb5: Product: UHCI Host Controller
[    1.511749] usb usb5: Manufacturer: Linux 3.3.1-1-amd64-vyatta uhci_hcd
[    1.511753] usb usb5: SerialNumber: 0000:00:1d.0
[    1.512008] hub 5-0:1.0: USB hub found
[    1.512016] hub 5-0:1.0: 2 ports detected
[    1.512128] uhci_hcd 0000:00:1d.1: setting latency timer to 64
[    1.512134] uhci_hcd 0000:00:1d.1: UHCI Host Controller
[    1.512143] uhci_hcd 0000:00:1d.1: new USB bus registered, assigned bus number 6
[    1.512180] uhci_hcd 0000:00:1d.1: irq 22, io base 0x00001880
[    1.512235] usb usb6: New USB device found, idVendor=1d6b, idProduct=0001
[    1.512241] usb usb6: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    1.512245] usb usb6: Product: UHCI Host Controller
[    1.512249] usb usb6: Manufacturer: Linux 3.3.1-1-amd64-vyatta uhci_hcd
[    1.512253] usb usb6: SerialNumber: 0000:00:1d.1
[    1.512508] hub 6-0:1.0: USB hub found
[    1.512516] hub 6-0:1.0: 2 ports detected
[    1.512628] uhci_hcd 0000:00:1d.2: setting latency timer to 64
[    1.512634] uhci_hcd 0000:00:1d.2: UHCI Host Controller
[    1.512643] uhci_hcd 0000:00:1d.2: new USB bus registered, assigned bus number 7
[    1.512680] uhci_hcd 0000:00:1d.2: irq 22, io base 0x000018a0
[    1.512733] usb usb7: New USB device found, idVendor=1d6b, idProduct=0001
[    1.512739] usb usb7: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    1.512743] usb usb7: Product: UHCI Host Controller
[    1.512747] usb usb7: Manufacturer: Linux 3.3.1-1-amd64-vyatta uhci_hcd
[    1.512751] usb usb7: SerialNumber: 0000:00:1d.2
[    1.513013] hub 7-0:1.0: USB hub found
[    1.513022] hub 7-0:1.0: 2 ports detected
[    1.513377] i8042: PNP: PS/2 Controller [PNP0303:KBC,PNP0f13:PS2M] at 0x60,0x64 irq 1,12
[    1.515536] i8042: Detected active multiplexing controller, rev 1.1
[    1.517407] serio: i8042 KBD port at 0x60,0x64 irq 1
[    1.517417] serio: i8042 AUX0 port at 0x60,0x64 irq 12
[    1.517422] serio: i8042 AUX1 port at 0x60,0x64 irq 12
[    1.517427] serio: i8042 AUX2 port at 0x60,0x64 irq 12
[    1.517432] serio: i8042 AUX3 port at 0x60,0x64 irq 12
[    1.517874] mousedev: PS/2 mouse device common for all mice
[    1.517988] rtc_cmos 00:08: RTC can wake from S4
[    1.518239] rtc_cmos 00:08: rtc core: registered rtc_cmos as rtc0
[    1.518282] rtc0: alarms up to one month, y3k, 114 bytes nvram, hpet irqs
[    1.518344] cpuidle: using governor ladder
[    1.518347] cpuidle: using governor menu
[    1.518697] No iBFT detected.
[    1.518728] Netfilter messages via NETLINK v0.30.
[    1.518806] ip_tables: (C) 2000-2006 Netfilter Core Team
[    1.518811] TCP cubic registered
[    1.518816] NET: Registered protocol family 17
[    1.518889] Registering the dns_resolver key type
[    1.519225] registered taskstats version 1
[    1.520227] rtc_cmos 00:08: setting system clock to 2012-04-04 15:09:51 UTC (1333552191)
[    1.522640] Freeing unused kernel memory: 560k freed
[    1.522909] Write protecting the kernel read-only data: 6144k
[    1.527588] Freeing unused kernel memory: 624k freed
[    1.532348] Freeing unused kernel memory: 608k freed
[    1.547150] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input0
[    1.657281] udevd[612]: starting version 175
[    1.701917] sky2: driver version 1.30
[    1.702034] sky2 0000:04:00.0: Yukon-2 EC Ultra chip revision 3
[    1.702182] sky2 0000:04:00.0: irq 42 for MSI/MSI-X
[    1.702515] sky2 0000:04:00.0: eth0: addr 00:17:42:8a:b4:05
[    1.706543] ata_piix 0000:00:1f.1: version 2.13
[    1.706630] ata_piix 0000:00:1f.1: setting latency timer to 64
[    1.714519] scsi0 : ata_piix
[    1.717709] scsi1 : ata_piix
[    1.718371] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0x1810 irq 14
[    1.718377] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0x1818 irq 15
[    1.742743] thermal LNXTHERM:00: registered as thermal_zone0
[    1.742749] ACPI: Thermal Zone [TZ00] (27 C)
[    1.742977] thermal LNXTHERM:01: registered as thermal_zone1
[    1.742982] ACPI: Thermal Zone [TZ01] (27 C)
[    1.747831] ahci 0000:00:1f.2: version 3.0
[    1.747943] ahci 0000:00:1f.2: irq 43 for MSI/MSI-X
[    1.748050] ahci 0000:00:1f.2: AHCI 0001.0100 32 slots 3 ports 3 Gbps 0x7 impl SATA mode
[    1.748058] ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio slum part ccc 
[    1.748067] ahci 0000:00:1f.2: setting latency timer to 64
[    1.750080] Refined TSC clocksource calibration: 2393.999 MHz.
[    1.750088] Switching to clocksource tsc
[    1.752771] scsi2 : ahci
[    1.756589] scsi3 : ahci
[    1.756750] scsi4 : ahci
[    1.756953] ata3: SATA max UDMA/133 abar m2048@0xfc704000 port 0xfc704100 irq 43
[    1.756960] ata4: SATA max UDMA/133 abar m2048@0xfc704000 port 0xfc704180 irq 43
[    1.756966] ata5: SATA max UDMA/133 abar m2048@0xfc704000 port 0xfc704200 irq 43
[    1.930028] usb 1-3: new high-speed USB device number 3 using ehci_hcd
[    2.100033] ata4: SATA link down (SStatus 0 SControl 300)
[    2.100068] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    2.100097] ata5: SATA link down (SStatus 0 SControl 300)
[    2.100475] ata3.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered out
[    2.100482] ata3.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
[    2.100570] ata3.00: ATA-9: M4-CT128M4SSD2, 0309, max UDMA/100
[    2.100576] ata3.00: 250069680 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
[    2.100978] ata3.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered out
[    2.100985] ata3.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
[    2.101073] ata3.00: configured for UDMA/100
[    2.101255] scsi 2:0:0:0: Direct-Access     ATA      M4-CT128M4SSD2   0309 PQ: 0 ANSI: 5
[    2.101501] sd 2:0:0:0: [sda] 250069680 512-byte logical blocks: (128 GB/119 GiB)
[    2.101527] sd 2:0:0:0: Attached scsi generic sg0 type 0
[    2.101622] sd 2:0:0:0: [sda] Write Protect is off
[    2.101628] sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00
[    2.101677] sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    2.102522]  sda: sda1 sda2 < sda5 sda6 >
[    2.103156] sd 2:0:0:0: [sda] Attached SCSI disk
[    2.106992] sdhci: Secure Digital Host Controller Interface driver
[    2.106997] sdhci: Copyright(c) Pierre Ossman
[    2.107472] sdhci-pci 0000:1c:03.2: SDHCI controller found [1217:7120] (rev 2)
[    2.107563] mmc0: no vmmc regulator found
[    2.107612] Registered led device: mmc0::
[    2.108674] mmc0: SDHCI controller on PCI [0000:1c:03.2] using PIO
[    2.109899] usb 1-3: New USB device found, idVendor=046d, idProduct=09b2
[    2.109905] usb 1-3: New USB device strings: Mfr=0, Product=2, SerialNumber=0
[    2.109910] usb 1-3: Product: OEM Camera
[    2.170066] firewire_ohci: Added fw-ohci device 0000:1c:03.4, OHCI v1.10, 8 IR + 8 IT contexts, quirks 0x10
[    2.225306] device-mapper: uevent: version 1.0.3
[    2.225727] device-mapper: ioctl: 4.22.0-ioctl (2011-10-19) initialised: dm-devel@redhat.com
[    2.259177] Btrfs loaded
[    2.266136] PM: Starting manual resume from disk
[    2.283545] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
[    2.450029] usb 3-2: new full-speed USB device number 2 using uhci_hcd
[    2.629744] usb 3-2: New USB device found, idVendor=0c24, idProduct=000f
[    2.629750] usb 3-2: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[    2.637278] udevd[854]: starting version 175
[    2.670194] firewire_core: created device fw0: GUID 00000e1003f448d1, S400
[    2.784513] Linux agpgart interface v0.103
[    2.787669] input: Lid Switch as /devices/LNXSYSTM:00/device:00/PNP0C0D:00/input/input1
[    2.787713] ACPI: Lid Switch [LID]
[    2.787804] input: Power Button as /devices/LNXSYSTM:00/device:00/PNP0C0C:00/input/input2
[    2.787812] ACPI: Power Button [PWRB]
[    2.799854] agpgart-intel 0000:00:00.0: Intel 965GM Chipset
[    2.800060] agpgart-intel 0000:00:00.0: detected gtt size: 524288K total, 262144K mappable
[    2.800066] Monitor-Mwait will be used to enter C-1 state
[    2.801730] agpgart-intel 0000:00:00.0: detected 8192K stolen memory
[    2.802036] agpgart-intel 0000:00:00.0: AGP aperture is 256M @ 0xe0000000
[    2.808253] ACPI: Deprecated procfs I/F for battery is loaded, please retry with CONFIG_ACPI_PROCFS_POWER cleared
[    2.808269] ACPI: Battery Slot [CMB1] (battery present)
[    2.815046] ACPI: Deprecated procfs I/F for battery is loaded, please retry with CONFIG_ACPI_PROCFS_POWER cleared
[    2.815062] ACPI: Battery Slot [CMB2] (battery present)
[    2.816520] Monitor-Mwait will be used to enter C-2 state
[    2.824425] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.07
[    2.824582] iTCO_wdt: Found a ICH8M TCO device (Version=2, TCOBASE=0x1060)
[    2.826922] iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
[    2.827356] Monitor-Mwait will be used to enter C-3 state
[    2.827381] Marking TSC unstable due to TSC halts in idle
[    2.827417] ACPI: acpi_idle registered with cpuidle
[    2.827566] ACPI: Deprecated procfs I/F for AC is loaded, please retry with CONFIG_ACPI_PROCFS_POWER cleared
[    2.829919] ACPI: AC Adapter [AC] (off-line)
[    2.839175] Switching to clocksource hpet
[    2.852557] cfg80211: Calling CRDA to update world regulatory domain
[    2.880008] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input3
[    2.880008] ACPI: Power Button [PWRF]
[    2.887729] iwl4965: Intel(R) Wireless WiFi 4965 driver for Linux, in-tree:
[    2.887734] iwl4965: Copyright(c) 2003-2011 Intel Corporation
[    2.887864] iwl4965 0000:0c:00.0: Detected Intel(R) Wireless WiFi Link 4965AGN, REV=0x4
[    2.927386] iwl4965 0000:0c:00.0: device EEPROM VER=0x36, CALIB=0x5
[    2.927444] iwl4965 0000:0c:00.0: Tunable channels: 11 802.11bg, 13 802.11a channels
[    2.927677] iwl4965 0000:0c:00.0: irq 44 for MSI/MSI-X
[    2.927848] usb 5-1: new low-speed USB device number 2 using uhci_hcd
[    2.939578] tpm_inf_pnp 00:03: Found TPM with ID IFX0102
[    2.939650] tpm_inf_pnp 00:03: TPM found: config base 0x4e, data base 0xfd00, chip version 0x000b, vendor id 0x15d1 (Infineon), product id 0x000b (SLB 9635 TT 1.2)
[    2.956878] input: PC Speaker as /devices/platform/pcspkr/input/input4
[    3.028008] input: Fujitsu Application Panel buttons as /devices/pci0000:00/0000:00:1f.3/i2c-0/0-0019/input/input5
[    3.079666] iwl4965 0000:0c:00.0: loaded firmware version 228.61.2.24
[    3.080152] Registered led device: phy0-led
[    3.097917] cfg80211: World regulatory domain updated:
[    3.097923] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
[    3.097929] cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[    3.097934] cfg80211:   (2457000 KHz - 2482000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
[    3.097939] cfg80211:   (2474000 KHz - 2494000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
[    3.097944] cfg80211:   (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[    3.097949] cfg80211:   (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[    3.099297] ieee80211 phy0: Selected rate control algorithm 'iwl-4965-rs'
[    3.118778] usb 5-1: New USB device found, idVendor=1050, idProduct=0010
[    3.118784] usb 5-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[    3.118790] usb 5-1: Product: Yubico Yubikey II
[    3.118793] usb 5-1: Manufacturer: Yubico
[    3.118796] usb 5-1: SerialNumber: 0000367025
[    3.161328] input: Yubico Yubico Yubikey II as /devices/pci0000:00/0000:00:1d.0/usb5/5-1/5-1:1.0/input/input6
[    3.161490] generic-usb 0003:1050:0010.0001: input,hidraw0: USB HID v1.11 Keyboard [Yubico Yubico Yubikey II] on usb-0000:00:1d.0-1/input0
[    3.161526] usbcore: registered new interface driver usbhid
[    3.161530] usbhid: USB HID core driver
[    3.454412] Adding 1951740k swap on /dev/sda5.  Priority:-1 extents:1 across:1951740k SS
[    3.460632] EXT4-fs (sda1): re-mounted. Opts: (null)
[    3.647225] psmouse serio4: synaptics: Touchpad model: 1, fw: 6.2, id: 0x1a0b1, caps: 0xa04713/0x202000/0x0
[    3.685537] input: SynPS/2 Synaptics TouchPad as /devices/platform/i8042/serio4/input/input7
[    3.745928] EXT4-fs (sda1): re-mounted. Opts: discard,errors=remount-ro
[    4.027211] input: Yubico Yubico Yubikey II as /devices/pci0000:00/0000:00:1d.0/usb5/5-1/5-1:1.0/input/input8
[    4.027371] generic-usb 0003:1050:0010.0002: input,hidraw0: USB HID v1.11 Keyboard [Yubico Yubico Yubikey II] on usb-0000:00:1d.0-1/input0
[    6.167412] Intel AES-NI instructions are not detected.
[    6.181055] padlock_sha: VIA PadLock Hash Engine not detected.
[   11.493873] loop: module loaded
[   11.970959] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)
[   12.164577] NET: Registered protocol family 10
[   12.221536] RPC: Registered named UNIX socket transport module.
[   12.221541] RPC: Registered udp transport module.
[   12.221544] RPC: Registered tcp transport module.
[   12.221548] RPC: Registered tcp NFSv4.1 backchannel transport module.
[   12.321045] fuse init (API version 7.18)
[   12.609444] NET: Registered protocol family 15
[   13.088426] sky2 0000:04:00.0: eth0: enabling interface
[   13.089637] ADDRCONF(NETDEV_UP): eth0: link is not ready
[   13.261673] [drm] Initialized drm 1.1.0 20060810
[   13.271149] i915 0000:00:02.0: setting latency timer to 64
[   13.342156] mtrr: type mismatch for e0000000,10000000 old: write-back new: write-combining
[   13.342161] [drm] MTRR allocation failed.  Graphics performance may suffer.
[   13.343984] i915 0000:00:02.0: irq 45 for MSI/MSI-X
[   13.344001] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[   13.344005] [drm] Driver supports precise vblank timestamp query.
[   13.344071] vgaarb: device changed decodes: PCI:0000:00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem
[   13.349856] ADDRCONF(NETDEV_UP): wlan0: link is not ready
[   14.078303] [drm] initialized overlay support
[   14.161389] input: ACPI Virtual Keyboard Device as /devices/virtual/input/input9
[   14.289222] fbcon: inteldrmfb (fb0) is primary device
[   14.289439] Console: switching to colour frame buffer device 160x50
[   14.289450] fb0: inteldrmfb frame buffer device
[   14.289453] drm: registered panic notifier
[   14.308314] acpi device:04: registered as cooling_device2
[   14.308479] input: Video Bus as /devices/LNXSYSTM:00/device:00/PNP0A08:00/LNXVIDEO:00/input/input10
[   14.310326] ACPI: Video Device [GFX0] (multi-head: yes  rom: no  post: no)
[   14.310466] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0
[   14.603377] lp: driver loaded but no devices found
[   14.610928] ppdev: user-space parallel port driver
[   17.794824] Ebtables v2.0 registered
[   17.809543] ip6_tables: (C) 2000-2006 Netfilter Core Team
[   18.204652] wlan0: authenticate with 00:22:90:93:49:d0 (try 1)
[   18.213234] wlan0: authenticated
[   18.268777] wlan0: associate with 00:22:90:93:49:d0 (try 1)
[   18.462396] wlan0: associate with 00:22:90:93:49:d0 (try 2)
[   18.480746] wlan0: RX AssocResp from 00:22:90:93:49:d0 (capab=0x431 status=0 aid=6)
[   18.480750] wlan0: associated
[   18.480753] wlan0: moving STA 00:22:90:93:49:d0 to state 1
[   18.480755] wlan0: moving STA 00:22:90:93:49:d0 to state 2
[   18.542165] ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
[   18.542234] cfg80211: Calling CRDA for country: US
[   18.546200] cfg80211: Regulatory domain changed to country: US
[   18.546204] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
[   18.546207] cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2700 mBm)
[   18.546210] cfg80211:   (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 1700 mBm)
[   18.546212] cfg80211:   (5250000 KHz - 5330000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[   18.546215] cfg80211:   (5490000 KHz - 5600000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[   18.546217] cfg80211:   (5650000 KHz - 5710000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[   18.546220] cfg80211:   (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 3000 mBm)
[   19.633036] wlan0: moving STA 00:22:90:93:49:d0 to state 3
[   30.191204] wlan0: no IPv6 routers present
[  476.389422] ------------[ cut here ]------------
[  476.389440] WARNING: at kernel/time/tick-sched.c:567 tick_nohz_irq_exit+0x11e/0x194()
[  476.389447] Hardware name: LifeBook S6510
[  476.389452] Modules linked in: kvm_intel kvm ip6table_filter ip6_tables iptable_filter ebtable_nat ebtables acpi_cpufreq mperf cpufreq_ondemand cpufreq_conservative cpufreq_userspace cpufreq_stats freq_table cpufreq_powersave parport_pc ppdev lp parport binfmt_misc i915 drm_kms_helper drm i2c_algo_bit uinput deflate ctr twofish_generic twofish_x86_64_3way twofish_x86_64 twofish_common camellia serpent_sse2_x86_64 xts lrw gf128mul serpent_generic blowfish_generic blowfish_x86_64 blowfish_common cast5 des_generic xcbc rmd160 sha512_generic crypto_null af_key fuse nfs nfs_acl lockd auth_rpcgss sunrpc ipv6 loop sha256_generic aes_x86_64 cryptd aes_generic cbc dm_crypt usbhid hid arc4 apanel input_polldev i2c_i801 i2c_core serio_raw pcspkr psmouse evdev tpm_infineon iwl4965 iwlegacy mac80211 cfg80211 rfkill video ac iTCO_wdt intel_agp battery button processor intel_gtt agpgart ext4 crc16 jbd2 btrfs crc32c libcrc32c zlib_deflate dm_mod firewire_ohci sdhci_pci sdhci firewire_core
  pata_acpi ata_generic thermal thermal_sys mmc_core crc_itu_t ahci libahci ata_piix sky2 [last unloaded: scsi_wait_scan]
[  476.389700] Pid: 9, comm: kworker/1:0 Not tainted 3.3.1-1-amd64-vyatta #1
[  476.389707] Call Trace:
[  476.389711]  <IRQ>  [<ffffffff81037e38>] ? warn_slowpath_common+0x78/0x8c
[  476.389732]  [<ffffffff8106f6c2>] ? tick_nohz_irq_exit+0x11e/0x194
[  476.389743]  [<ffffffff8103d40b>] ? irq_exit+0x73/0x79
[  476.389753]  [<ffffffff8100fbd7>] ? do_IRQ+0x82/0x98
[  476.389766]  [<ffffffff8135af6e>] ? common_interrupt+0x6e/0x6e
[  476.389771]  <EOI>  [<ffffffff81014f09>] ? native_read_tsc+0x2/0xf
[  476.389794]  [<ffffffff811a4515>] ? paravirt_read_tsc+0x5/0x8
[  476.389804]  [<ffffffff811a45b0>] ? delay_tsc+0x29/0x5e
[  476.389816]  [<ffffffffa0486063>] ? sclhi+0x5d/0x63 [i2c_algo_bit]
[  476.389827]  [<ffffffffa04861d7>] ? i2c_outb.isra.4+0x3c/0x8e [i2c_algo_bit]
[  476.389838]  [<ffffffffa04865e4>] ? bit_xfer+0x34b/0x3fc [i2c_algo_bit]
[  476.389849]  [<ffffffff810532d4>] ? __hrtimer_start_range_ns+0x297/0x2b8
[  476.389883]  [<ffffffffa0509e58>] ? intel_i2c_quirk_xfer+0x71/0xb9 [i915]
[  476.389901]  [<ffffffffa029596b>] ? i2c_transfer+0x90/0xf3 [i2c_core]
[  476.389929]  [<ffffffffa05063c5>] ? intel_sdvo_write_cmd+0x25f/0x2e5 [i915]
[  476.389940]  [<ffffffff811a383e>] ? vsnprintf+0x3ee/0x427
[  476.389969]  [<ffffffffa0508a5c>] ? intel_sdvo_detect+0x2d/0x1e0 [i915]
[  476.389980]  [<ffffffff8104c015>] ? queue_delayed_work_on+0xb0/0xc8
[  476.389996]  [<ffffffffa04d121c>] ? output_poll_execute+0x97/0x16c [drm_kms_helper]
[  476.390008]  [<ffffffffa04d1185>] ? drm_format_num_planes+0x8a/0x8a [drm_kms_helper]
[  476.390018]  [<ffffffff8104c20c>] ? process_one_work+0x157/0x296
[  476.390028]  [<ffffffff8104cc77>] ? worker_thread+0xc2/0x145
[  476.390037]  [<ffffffff8104cbb5>] ? manage_workers.isra.24+0x15b/0x15b
[  476.390046]  [<ffffffff8104ff44>] ? kthread+0x7d/0x85
[  476.390056]  [<ffffffff8135cbe4>] ? kernel_thread_helper+0x4/0x10
[  476.390066]  [<ffffffff8104fec7>] ? kthread_freezable_should_stop+0x37/0x37
[  476.390075]  [<ffffffff8135cbe0>] ? gs_change+0x13/0x13
[  476.390081] ---[ end trace f87639a4a779b971 ]---
[  769.514514] ------------[ cut here ]------------
[  769.514523] WARNING: at kernel/time/tick-sched.c:706 tick_nohz_account_ticks+0x77/0x80()
[  769.514526] Hardware name: LifeBook S6510
[  769.514527] Modules linked in: kvm_intel kvm ip6table_filter ip6_tables iptable_filter ebtable_nat ebtables acpi_cpufreq mperf cpufreq_ondemand cpufreq_conservative cpufreq_userspace cpufreq_stats freq_table cpufreq_powersave parport_pc ppdev lp parport binfmt_misc i915 drm_kms_helper drm i2c_algo_bit uinput deflate ctr twofish_generic twofish_x86_64_3way twofish_x86_64 twofish_common camellia serpent_sse2_x86_64 xts lrw gf128mul serpent_generic blowfish_generic blowfish_x86_64 blowfish_common cast5 des_generic xcbc rmd160 sha512_generic crypto_null af_key fuse nfs nfs_acl lockd auth_rpcgss sunrpc ipv6 loop sha256_generic aes_x86_64 cryptd aes_generic cbc dm_crypt usbhid hid arc4 apanel input_polldev i2c_i801 i2c_core serio_raw pcspkr psmouse evdev tpm_infineon iwl4965 iwlegacy mac80211 cfg80211 rfkill video ac iTCO_wdt intel_agp battery button processor intel_gtt agpgart ext4 crc16 jbd2 btrfs crc32c libcrc32c zlib_deflate dm_mod firewire_ohci sdhci_pci sdhci firewire_core
  pata_acpi ata_generic thermal thermal_sys mmc_core crc_itu_t ahci libahci ata_piix sky2 [last unloaded: scsi_wait_scan]
[  769.514619] Pid: 0, comm: swapper/0 Tainted: G        W    3.3.1-1-amd64-vyatta #1
[  769.514621] Call Trace:
[  769.514623]  <IRQ>  [<ffffffff81037e38>] ? warn_slowpath_common+0x78/0x8c
[  769.514632]  [<ffffffff8106eee9>] ? tick_nohz_account_ticks+0x77/0x80
[  769.514635]  [<ffffffff8106fb59>] ? tick_nohz_flush_current_times+0x24/0x48
[  769.514639]  [<ffffffff81073a0b>] ? generic_smp_call_function_single_interrupt+0xca/0xeb
[  769.514644]  [<ffffffff81024800>] ? smp_call_function_single_interrupt+0x10/0x20
[  769.514649]  [<ffffffff8135c6ce>] ? call_function_single_interrupt+0x6e/0x80
[  769.514650]  <EOI>  [<ffffffff8106e1bc>] ? tick_notify+0x1fc/0x354
[  769.514657]  [<ffffffff81055884>] ? arch_local_irq_enable+0x4/0x8
[  769.514660]  [<ffffffff81057ec6>] ? finish_task_switch+0x44/0xc2
[  769.514665]  [<ffffffff8135a40e>] ? __schedule+0x444/0x4ae
[  769.514669]  [<ffffffff8101452d>] ? paravirt_read_tsc+0x5/0x8
[  769.514673]  [<ffffffff8100d245>] ? cpu_idle+0xa7/0xac
[  769.514676]  [<ffffffff8168eabc>] ? start_kernel+0x342/0x34d
[  769.514680]  [<ffffffff8168e140>] ? early_idt_handlers+0x140/0x140
[  769.514683]  [<ffffffff8168e3c3>] ? x86_64_start_kernel+0x104/0x111
[  769.514685] ---[ end trace f87639a4a779b972 ]---

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: warning in tick_nohz_irq_exit
  2012-04-04 15:33   ` warning in tick_nohz_irq_exit Stephen Hemminger
@ 2012-04-04 20:45     ` Frederic Weisbecker
  0 siblings, 0 replies; 96+ messages in thread
From: Frederic Weisbecker @ 2012-04-04 20:45 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: LKML, linaro-sched-sig, Alessio Igor Bogani, Andrew Morton,
	Avi Kivity, Chris Metcalf, Christoph Lameter, Daniel Lezcano,
	Geoff Levand, Gilad Ben Yossef, Ingo Molnar, Max Krasnyansky,
	Paul E. McKenney, Peter Zijlstra, Steven Rostedt,
	Sven-Thorsten Dietrich, Thomas Gleixner, Zen Lin

2012/4/4 Stephen Hemminger <shemminger@vyatta.com>:
> Using test kernel merged no-hz from your github repo with
> current upstream.
>
> Tried running this on laptop and seeing warning splats.
> May or may not be related to write to cpuset/NAME/tasks
> returning ENOSPC.
>
> [  476.389422] ------------[ cut here ]------------
> [  476.389440] WARNING: at kernel/time/tick-sched.c:567 tick_nohz_irq_exit+0x11e/0x194()
>
>
> Full dmesg:
>
> [    0.000000] Initializing cgroup subsys cpuset
> [    0.000000] Initializing cgroup subsys cpu
> [    0.000000] Linux version 3.3.1-1-amd64-vyatta (shemminger@s6510) (gcc version 4.6.3 (Debian 4.6.3-1) ) #1 SMP Tue Apr 3 15:52:51 PDT 2012
> [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.3.1-1-amd64-vyatta root=UUID=b7241ed2-14f8-4d17-b9c0-87c5a6876d4c ro quiet
> [    0.000000] BIOS-provided physical RAM map:
> [    0.000000]  BIOS-e820: 0000000000000000 - 000000000009e000 (usable)
> [    0.000000]  BIOS-e820: 000000000009e000 - 00000000000a0000 (reserved)
> [    0.000000]  BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
> [    0.000000]  BIOS-e820: 0000000000100000 - 00000000cf6b0000 (usable)
> [    0.000000]  BIOS-e820: 00000000cf6b0000 - 00000000cf6ca000 (ACPI data)
> [    0.000000]  BIOS-e820: 00000000cf6ca000 - 00000000cf6ce000 (ACPI NVS)
> [    0.000000]  BIOS-e820: 00000000cf6ce000 - 00000000d0000000 (reserved)
> [    0.000000]  BIOS-e820: 00000000f8000000 - 00000000fc000000 (reserved)
> [    0.000000]  BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
> [    0.000000]  BIOS-e820: 00000000fed00000 - 00000000fed00400 (reserved)
> [    0.000000]  BIOS-e820: 00000000fed14000 - 00000000fed1a000 (reserved)
> [    0.000000]  BIOS-e820: 00000000fed1c000 - 00000000fed90000 (reserved)
> [    0.000000]  BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
> [    0.000000]  BIOS-e820: 00000000ff000000 - 0000000100000000 (reserved)
> [    0.000000]  BIOS-e820: 0000000100000000 - 0000000130000000 (usable)
> [    0.000000] NX (Execute Disable) protection: active
> [    0.000000] DMI 2.4 present.
> [    0.000000] DMI: FUJITSU LifeBook S6510/FJNB1D3, BIOS Version 1.31  02/05/2009
> [    0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
> [    0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable)
> [    0.000000] No AGP bridge found
> [    0.000000] last_pfn = 0x130000 max_arch_pfn = 0x400000000
> [    0.000000] MTRR default type: uncachable
> [    0.000000] MTRR fixed ranges enabled:
> [    0.000000]   00000-9FFFF write-back
> [    0.000000]   A0000-BFFFF uncachable
> [    0.000000]   C0000-CFFFF write-protect
> [    0.000000]   D0000-DFFFF uncachable
> [    0.000000]   E0000-FFFFF write-protect
> [    0.000000] MTRR variable ranges enabled:
> [    0.000000]   0 base 0D0000000 mask FF0000000 uncachable
> [    0.000000]   1 base 0E0000000 mask FE0000000 uncachable
> [    0.000000]   2 base 000000000 mask F00000000 write-back
> [    0.000000]   3 base 100000000 mask FE0000000 write-back
> [    0.000000]   4 base 120000000 mask FF0000000 write-back
> [    0.000000]   5 base 0CF700000 mask FFFF00000 uncachable
> [    0.000000]   6 base 0CF800000 mask FFF800000 uncachable
> [    0.000000]   7 disabled
> [    0.000000] x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
> [    0.000000] e820 update range: 00000000cf700000 - 0000000100000000 (usable) ==> (reserved)
> [    0.000000] last_pfn = 0xcf6b0 max_arch_pfn = 0x400000000
> [    0.000000] initial memory mapped : 0 - 20000000
> [    0.000000] Base memory trampoline at [ffff880000099000] 99000 size 20480
> [    0.000000] init_memory_mapping: 0000000000000000-00000000cf6b0000
> [    0.000000]  0000000000 - 00cf600000 page 2M
> [    0.000000]  00cf600000 - 00cf6b0000 page 4k
> [    0.000000] kernel direct mapping tables up to cf6b0000 @ 1fffa000-20000000
> [    0.000000] init_memory_mapping: 0000000100000000-0000000130000000
> [    0.000000]  0100000000 - 0130000000 page 2M
> [    0.000000] kernel direct mapping tables up to 130000000 @ cf6aa000-cf6b0000
> [    0.000000] RAMDISK: 36a60000 - 37528000
> [    0.000000] ACPI: RSDP 00000000000f6150 00024 (v02 FUJ   )
> [    0.000000] ACPI: XSDT 00000000cf6bfa35 00074 (v01 FUJ    PC       01310000 FUJ  00000100)
> [    0.000000] ACPI: FACP 00000000cf6c80b1 000F4 (v03 FUJ    PC       01310000 FUJ  00000100)
> [    0.000000] ACPI: DSDT 00000000cf6bfaa9 08608 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
> [    0.000000] ACPI: FACS 00000000cf6cdfc0 00040
> [    0.000000] ACPI: HPET 00000000cf6c81a5 00038 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
> [    0.000000] ACPI: MCFG 00000000cf6c81dd 0003C (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
> [    0.000000] ACPI: SSDT 00000000cf6c8219 004EF (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
> [    0.000000] ACPI: SSDT 00000000cf6c8708 001CA (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
> [    0.000000] ACPI: SSDT 00000000cf6c88d2 0106D (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
> [    0.000000] ACPI: SSDT 00000000cf6c993f 00447 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
> [    0.000000] ACPI: APIC 00000000cf6c9d86 00068 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
> [    0.000000] ACPI: BOOT 00000000cf6c9dee 00028 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
> [    0.000000] ACPI: SLIC 00000000cf6c9e16 00176 (v01 FUJ    PC       01310000 FUJ  00000100)
> [    0.000000] ACPI: Local APIC address 0xfee00000
> [    0.000000] No NUMA configuration found
> [    0.000000] Faking a node at 0000000000000000-0000000130000000
> [    0.000000] Initmem setup node 0 0000000000000000-0000000130000000
> [    0.000000]   NODE_DATA [000000012fffb000 - 000000012fffffff]
> [    0.000000]  [ffffea0000000000-ffffea0004bfffff] PMD -> [ffff88012b600000-ffff88012f5fffff] on node 0
> [    0.000000] Zone PFN ranges:
> [    0.000000]   DMA      0x00000010 -> 0x00001000
> [    0.000000]   DMA32    0x00001000 -> 0x00100000
> [    0.000000]   Normal   0x00100000 -> 0x00130000
> [    0.000000] Movable zone start PFN for each node
> [    0.000000] Early memory PFN ranges
> [    0.000000]     0: 0x00000010 -> 0x0000009e
> [    0.000000]     0: 0x00000100 -> 0x000cf6b0
> [    0.000000]     0: 0x00100000 -> 0x00130000
> [    0.000000] On node 0 totalpages: 1046078
> [    0.000000]   DMA zone: 64 pages used for memmap
> [    0.000000]   DMA zone: 5 pages reserved
> [    0.000000]   DMA zone: 3913 pages, LIFO batch:0
> [    0.000000]   DMA32 zone: 16320 pages used for memmap
> [    0.000000]   DMA32 zone: 829168 pages, LIFO batch:31
> [    0.000000]   Normal zone: 3072 pages used for memmap
> [    0.000000]   Normal zone: 193536 pages, LIFO batch:31
> [    0.000000] ACPI: PM-Timer IO Port: 0x1008
> [    0.000000] ACPI: Local APIC address 0xfee00000
> [    0.000000] ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
> [    0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
> [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
> [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
> [    0.000000] ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
> [    0.000000] IOAPIC[0]: apic_id 2, version 32, address 0xfec00000, GSI 0-23
> [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge)
> [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
> [    0.000000] ACPI: IRQ0 used by override.
> [    0.000000] ACPI: IRQ2 used by override.
> [    0.000000] ACPI: IRQ9 used by override.
> [    0.000000] Using ACPI (MADT) for SMP configuration information
> [    0.000000] ACPI: HPET id: 0x8086a201 base: 0xfed00000
> [    0.000000] SMP: Allowing 2 CPUs, 0 hotplug CPUs
> [    0.000000] nr_irqs_gsi: 40
> [    0.000000] PM: Registered nosave memory: 000000000009e000 - 00000000000a0000
> [    0.000000] PM: Registered nosave memory: 00000000000a0000 - 00000000000e0000
> [    0.000000] PM: Registered nosave memory: 00000000000e0000 - 0000000000100000
> [    0.000000] PM: Registered nosave memory: 00000000cf6b0000 - 00000000cf6ca000
> [    0.000000] PM: Registered nosave memory: 00000000cf6ca000 - 00000000cf6ce000
> [    0.000000] PM: Registered nosave memory: 00000000cf6ce000 - 00000000d0000000
> [    0.000000] PM: Registered nosave memory: 00000000d0000000 - 00000000f8000000
> [    0.000000] PM: Registered nosave memory: 00000000f8000000 - 00000000fc000000
> [    0.000000] PM: Registered nosave memory: 00000000fc000000 - 00000000fec00000
> [    0.000000] PM: Registered nosave memory: 00000000fec00000 - 00000000fec10000
> [    0.000000] PM: Registered nosave memory: 00000000fec10000 - 00000000fed00000
> [    0.000000] PM: Registered nosave memory: 00000000fed00000 - 00000000fed14000
> [    0.000000] PM: Registered nosave memory: 00000000fed14000 - 00000000fed1a000
> [    0.000000] PM: Registered nosave memory: 00000000fed1a000 - 00000000fed1c000
> [    0.000000] PM: Registered nosave memory: 00000000fed1c000 - 00000000fed90000
> [    0.000000] PM: Registered nosave memory: 00000000fed90000 - 00000000fee00000
> [    0.000000] PM: Registered nosave memory: 00000000fee00000 - 00000000fee01000
> [    0.000000] PM: Registered nosave memory: 00000000fee01000 - 00000000ff000000
> [    0.000000] PM: Registered nosave memory: 00000000ff000000 - 0000000100000000
> [    0.000000] Allocating PCI resources starting at d0000000 (gap: d0000000:28000000)
> [    0.000000] Booting paravirtualized kernel on bare hardware
> [    0.000000] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:2 nr_node_ids:1
> [    0.000000] PERCPU: Embedded 27 pages/cpu @ffff88012fc00000 s80192 r8192 d22208 u1048576
> [    0.000000] pcpu-alloc: s80192 r8192 d22208 u1048576 alloc=1*2097152
> [    0.000000] pcpu-alloc: [0] 0 1
> [    0.000000] Built 1 zonelists in Node order, mobility grouping on.  Total pages: 1026617
> [    0.000000] Policy zone: Normal
> [    0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-3.3.1-1-amd64-vyatta root=UUID=b7241ed2-14f8-4d17-b9c0-87c5a6876d4c ro quiet
> [    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
> [    0.000000] Checking aperture...
> [    0.000000] No AGP bridge found
> [    0.000000] Memory: 4033240k/4980736k available (3452k kernel code, 796424k absent, 151072k reserved, 3171k data, 560k init)
> [    0.000000] SLUB: Genslabs=15, HWalign=64, Order=0-3, MinObjects=0, CPUs=2, Nodes=1
> [    0.000000] Hierarchical RCU implementation.
> [    0.000000]  CONFIG_RCU_FANOUT set to non-default value of 32
> [    0.000000]  RCU dyntick-idle grace-period acceleration is enabled.
> [    0.000000] NR_IRQS:4352 nr_irqs:512 16
> [    0.000000] Extended CMOS year: 2000
> [    0.000000] Console: colour VGA+ 80x25
> [    0.000000] console [tty0] enabled
> [    0.000000] hpet clockevent registered
> [    0.000000] Fast TSC calibration using PIT
> [    0.000000] Detected 2394.105 MHz processor.
> [    0.010005] Calibrating delay loop (skipped), value calculated using timer frequency.. 4788.21 BogoMIPS (lpj=23941050)
> [    0.010014] pid_max: default: 32768 minimum: 301
> [    0.010052] Security Framework initialized
> [    0.010745] Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
> [    0.022235] Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
> [    0.023585] Mount-cache hash table entries: 256
> [    0.023888] CPU: Physical Processor ID: 0
> [    0.023893] CPU: Processor Core ID: 0
> [    0.023897] mce: CPU supports 6 MCE banks
> [    0.023911] CPU0: Thermal monitoring enabled (TM2)
> [    0.023918] using mwait in idle threads.
> [    0.024936] ACPI: Core revision 20120111
> [    0.036120] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> [    0.136162] CPU0: Intel(R) Core(TM)2 Duo CPU     T7700  @ 2.40GHz stepping 0b
> [    0.140000] Performance Events: PEBS fmt0+, Core2 events, Intel PMU driver.
> [    0.140000] PEBS disabled due to CPU errata.
> [    0.140000] ... version:                2
> [    0.140000] ... bit width:              40
> [    0.140000] ... generic registers:      2
> [    0.140000] ... value mask:             000000ffffffffff
> [    0.140000] ... max period:             000000007fffffff
> [    0.140000] ... fixed-purpose events:   3
> [    0.140000] ... event mask:             0000000700000003
> [    0.140000] NMI watchdog enabled, takes one hw-pmu counter.
> [    0.140000] Booting Node   0, Processors  #1 Ok.
> [    0.140000] smpboot cpu 1: start_ip = 99000
> [    0.143775] NMI watchdog enabled, takes one hw-pmu counter.
> [    0.143826] Brought up 2 CPUs
> [    0.143831] Total of 2 processors activated (9576.42 BogoMIPS).
> [    0.147081] devtmpfs: initialized
> [    0.153207] PM: Registering ACPI NVS region at cf6ca000 (16384 bytes)
> [    0.153207] print_constraints: dummy:
> [    0.153207] NET: Registered protocol family 16
> [    0.153207] ACPI: bus type pci registered
> [    0.153207] PCI: MMCONFIG for domain 0000 [bus 00-3f] at [mem 0xf8000000-0xfbffffff] (base 0xf8000000)
> [    0.153207] PCI: MMCONFIG at [mem 0xf8000000-0xfbffffff] reserved in E820
> [    0.159024] PCI: Using configuration type 1 for base access
> [    0.160062] bio: create slab <bio-0> at 0
> [    0.160069] ACPI: Added _OSI(Module Device)
> [    0.160069] ACPI: Added _OSI(Processor Device)
> [    0.160069] ACPI: Added _OSI(3.0 _SCP Extensions)
> [    0.160069] ACPI: Added _OSI(Processor Aggregator Device)
> [    0.162586] ACPI: EC: Look up EC in DSDT
> [    0.172631] ACPI:      00000000cf6cda9f 003B7 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
> [    0.173514] ACPI: Dynamic OEM Table Load:
> [    0.173520] ACPI:                (null) 003B7 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
> [    0.175848] ACPI: SSDT 00000000cf6cac19 002BC (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
> [    0.176501] ACPI: Dynamic OEM Table Load:
> [    0.176507] ACPI: SSDT           (null) 002BC (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
> [    0.176755] ACPI: SSDT 00000000cf6cb119 00627 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
> [    0.177375] ACPI: Dynamic OEM Table Load:
> [    0.177380] ACPI: SSDT           (null) 00627 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
> [    0.180423] ACPI: SSDT 00000000cf6cb061 000B8 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
> [    0.181067] ACPI: Dynamic OEM Table Load:
> [    0.181073] ACPI: SSDT           (null) 000B8 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
> [    0.181251] ACPI: SSDT 00000000cf6cb740 00047 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
> [    0.181873] ACPI: Dynamic OEM Table Load:
> [    0.181879] ACPI: SSDT           (null) 00047 (v01 FUJ    FJNB1D3  01310000 FUJ  00000100)
> [    0.182301] ACPI: Interpreter enabled
> [    0.182308] ACPI: (supports S0 S3 S4 S5)
> [    0.182347] ACPI: Using IOAPIC for interrupt routing
> [    0.190533] ACPI: EC: GPE = 0x17, I/O: command/status = 0x66, data = 0x62
> [    0.190533] ACPI: ACPI Dock Station Driver: 1 docks/bays found
> [    0.190533] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
> [    0.190924] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
> [    0.191834] pci_root PNP0A08:00: host bridge window [io  0x0000-0x0cf7]
> [    0.191839] pci_root PNP0A08:00: host bridge window [io  0x0d00-0xffff]
> [    0.191845] pci_root PNP0A08:00: host bridge window [mem 0x000a0000-0x000bffff]
> [    0.191850] pci_root PNP0A08:00: host bridge window [mem 0x000d0000-0x000d3fff]
> [    0.191855] pci_root PNP0A08:00: host bridge window [mem 0x000d4000-0x000d7fff]
> [    0.191860] pci_root PNP0A08:00: host bridge window [mem 0x000d8000-0x000dbfff]
> [    0.191865] pci_root PNP0A08:00: host bridge window [mem 0x000dc000-0x000dffff]
> [    0.191871] pci_root PNP0A08:00: host bridge window [mem 0xd0000000-0xfebfffff]
> [    0.191876] pci_root PNP0A08:00: host bridge window [mem 0xfed40000-0xfed44fff]
> [    0.191938] PCI host bridge to bus 0000:00
> [    0.191938] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
> [    0.191938] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
> [    0.191938] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
> [    0.191938] pci_bus 0000:00: root bus resource [mem 0x000d0000-0x000d3fff]
> [    0.191938] pci_bus 0000:00: root bus resource [mem 0x000d4000-0x000d7fff]
> [    0.191938] pci_bus 0000:00: root bus resource [mem 0x000d8000-0x000dbfff]
> [    0.191938] pci_bus 0000:00: root bus resource [mem 0x000dc000-0x000dffff]
> [    0.191938] pci_bus 0000:00: root bus resource [mem 0xd0000000-0xfebfffff]
> [    0.191938] pci_bus 0000:00: root bus resource [mem 0xfed40000-0xfed44fff]
> [    0.191938] pci 0000:00:00.0: [8086:2a00] type 0 class 0x000600
> [    0.191938] pci 0000:00:02.0: [8086:2a02] type 0 class 0x000300
> [    0.191938] pci 0000:00:02.0: reg 10: [mem 0xfc000000-0xfc0fffff 64bit]
> [    0.191938] pci 0000:00:02.0: reg 18: [mem 0xe0000000-0xefffffff 64bit pref]
> [    0.191938] pci 0000:00:02.0: reg 20: [io  0x1800-0x1807]
> [    0.191938] pci 0000:00:02.1: [8086:2a03] type 0 class 0x000380
> [    0.191938] pci 0000:00:02.1: reg 10: [mem 0xfc100000-0xfc1fffff 64bit]
> [    0.191938] pci 0000:00:1a.0: [8086:2834] type 0 class 0x000c03
> [    0.191938] pci 0000:00:1a.0: reg 20: [io  0x1820-0x183f]
> [    0.191938] pci 0000:00:1a.1: [8086:2835] type 0 class 0x000c03
> [    0.191938] pci 0000:00:1a.1: reg 20: [io  0x1840-0x185f]
> [    0.191938] pci 0000:00:1a.7: [8086:283a] type 0 class 0x000c03
> [    0.191938] pci 0000:00:1a.7: reg 10: [mem 0xfc704800-0xfc704bff]
> [    0.191938] pci 0000:00:1a.7: PME# supported from D0 D3hot D3cold
> [    0.191938] pci 0000:00:1b.0: [8086:284b] type 0 class 0x000403
> [    0.191938] pci 0000:00:1b.0: reg 10: [mem 0xfc700000-0xfc703fff 64bit]
> [    0.191938] pci 0000:00:1b.0: PME# supported from D0 D3hot D3cold
> [    0.191938] pci 0000:00:1c.0: [8086:283f] type 1 class 0x000604
> [    0.191938] pci 0000:00:1c.0: PME# supported from D0 D3hot D3cold
> [    0.191938] pci 0000:00:1c.2: [8086:2843] type 1 class 0x000604
> [    0.191938] pci 0000:00:1c.2: PME# supported from D0 D3hot D3cold
> [    0.191938] pci 0000:00:1d.0: [8086:2830] type 0 class 0x000c03
> [    0.191938] pci 0000:00:1d.0: reg 20: [io  0x1860-0x187f]
> [    0.191938] pci 0000:00:1d.1: [8086:2831] type 0 class 0x000c03
> [    0.191938] pci 0000:00:1d.1: reg 20: [io  0x1880-0x189f]
> [    0.191938] pci 0000:00:1d.2: [8086:2832] type 0 class 0x000c03
> [    0.191938] pci 0000:00:1d.2: reg 20: [io  0x18a0-0x18bf]
> [    0.191938] pci 0000:00:1d.7: [8086:2836] type 0 class 0x000c03
> [    0.191938] pci 0000:00:1d.7: reg 10: [mem 0xfc704c00-0xfc704fff]
> [    0.191952] pci 0000:00:1d.7: PME# supported from D0 D3hot D3cold
> [    0.191986] pci 0000:00:1e.0: [8086:2448] type 1 class 0x000604
> [    0.192099] pci 0000:00:1f.0: [8086:2815] type 0 class 0x000601
> [    0.192215] pci 0000:00:1f.0: quirk: [io  0x1000-0x107f] claimed by ICH6 ACPI/GPIO/TCO
> [    0.192225] pci 0000:00:1f.0: quirk: [io  0x1180-0x11bf] claimed by ICH6 GPIO
> [    0.192232] pci 0000:00:1f.0: ICH7 LPC Generic IO decode 1 PIO at fd00 (mask 007f)
> [    0.192301] pci 0000:00:1f.1: [8086:2850] type 0 class 0x000101
> [    0.192323] pci 0000:00:1f.1: reg 10: [io  0x0000-0x0007]
> [    0.192338] pci 0000:00:1f.1: reg 14: [io  0x0000-0x0003]
> [    0.192354] pci 0000:00:1f.1: reg 18: [io  0x0000-0x0007]
> [    0.192369] pci 0000:00:1f.1: reg 1c: [io  0x0000-0x0003]
> [    0.192384] pci 0000:00:1f.1: reg 20: [io  0x1810-0x181f]
> [    0.192450] pci 0000:00:1f.2: [8086:2829] type 0 class 0x000106
> [    0.192483] pci 0000:00:1f.2: reg 10: [io  0x1c00-0x1c07]
> [    0.192498] pci 0000:00:1f.2: reg 14: [io  0x18d4-0x18d7]
> [    0.192514] pci 0000:00:1f.2: reg 18: [io  0x18d8-0x18df]
> [    0.192530] pci 0000:00:1f.2: reg 1c: [io  0x18d0-0x18d3]
> [    0.192545] pci 0000:00:1f.2: reg 20: [io  0x18e0-0x18ff]
> [    0.192563] pci 0000:00:1f.2: reg 24: [mem 0xfc704000-0xfc7047ff]
> [    0.192640] pci 0000:00:1f.2: PME# supported from D3hot
> [    0.192672] pci 0000:00:1f.3: [8086:283e] type 0 class 0x000c05
> [    0.192694] pci 0000:00:1f.3: reg 10: [mem 0x00000000-0x000000ff]
> [    0.192742] pci 0000:00:1f.3: reg 20: [io  0x1c20-0x1c3f]
> [    0.192890] pci 0000:04:00.0: [11ab:4363] type 0 class 0x000200
> [    0.192927] pci 0000:04:00.0: reg 10: [mem 0xfc200000-0xfc203fff 64bit]
> [    0.192949] pci 0000:04:00.0: reg 18: [io  0x2000-0x20ff]
> [    0.193021] pci 0000:04:00.0: reg 30: [mem 0x00000000-0x0001ffff pref]
> [    0.193136] pci 0000:04:00.0: supports D1 D2
> [    0.193140] pci 0000:04:00.0: PME# supported from D0 D1 D2 D3hot D3cold
> [    0.220028] pci 0000:00:1c.0: PCI bridge to [bus 04-07]
> [    0.220036] pci 0000:00:1c.0:   bridge window [io  0x2000-0x2fff]
> [    0.220044] pci 0000:00:1c.0:   bridge window [mem 0xfc200000-0xfc2fffff]
> [    0.220150] pci 0000:0c:00.0: [8086:4229] type 0 class 0x000280
> [    0.220197] pci 0000:0c:00.0: reg 10: [mem 0xfc300000-0xfc301fff 64bit]
> [    0.220413] pci 0000:0c:00.0: PME# supported from D0 D3hot D3cold
> [    0.240028] pci 0000:00:1c.2: PCI bridge to [bus 0c-0f]
> [    0.240039] pci 0000:00:1c.2:   bridge window [mem 0xfc300000-0xfc3fffff]
> [    0.240107] pci 0000:1c:03.0: [1217:7136] type 2 class 0x000607
> [    0.240137] pci 0000:1c:03.0: reg 10: [mem 0x00000000-0x00000fff]
> [    0.240185] pci 0000:1c:03.0: supports D1 D2
> [    0.240189] pci 0000:1c:03.0: PME# supported from D0 D1 D2 D3hot D3cold
> [    0.240227] pci 0000:1c:03.2: [1217:7120] type 0 class 0x000805
> [    0.240257] pci 0000:1c:03.2: reg 10: [mem 0xfc402800-0xfc4028ff]
> [    0.240380] pci 0000:1c:03.2: supports D1 D2
> [    0.240384] pci 0000:1c:03.2: PME# supported from D0 D1 D2 D3hot D3cold
> [    0.240416] pci 0000:1c:03.3: [1217:7130] type 0 class 0x000180
> [    0.240446] pci 0000:1c:03.3: reg 10: [mem 0xfc400000-0xfc400fff]
> [    0.240569] pci 0000:1c:03.3: supports D1 D2
> [    0.240573] pci 0000:1c:03.3: PME# supported from D0 D1 D2 D3hot D3cold
> [    0.240607] pci 0000:1c:03.4: [1217:00f7] type 0 class 0x000c00
> [    0.240633] pci 0000:1c:03.4: reg 10: [mem 0xfc401000-0xfc401fff]
> [    0.240650] pci 0000:1c:03.4: reg 14: [mem 0xfc402000-0xfc4027ff]
> [    0.240754] pci 0000:1c:03.4: supports D1 D2
> [    0.240758] pci 0000:1c:03.4: PME# supported from D0 D1 D2 D3hot
> [    0.240831] pci 0000:00:1e.0: PCI bridge to [bus 1c-1d] (subtractive decode)
> [    0.240841] pci 0000:00:1e.0:   bridge window [mem 0xfc400000-0xfc4fffff]
> [    0.240852] pci 0000:00:1e.0:   bridge window [io  0x0000-0x0cf7] (subtractive decode)
> [    0.240857] pci 0000:00:1e.0:   bridge window [io  0x0d00-0xffff] (subtractive decode)
> [    0.240862] pci 0000:00:1e.0:   bridge window [mem 0x000a0000-0x000bffff] (subtractive decode)
> [    0.240867] pci 0000:00:1e.0:   bridge window [mem 0x000d0000-0x000d3fff] (subtractive decode)
> [    0.240873] pci 0000:00:1e.0:   bridge window [mem 0x000d4000-0x000d7fff] (subtractive decode)
> [    0.240878] pci 0000:00:1e.0:   bridge window [mem 0x000d8000-0x000dbfff] (subtractive decode)
> [    0.240883] pci 0000:00:1e.0:   bridge window [mem 0x000dc000-0x000dffff] (subtractive decode)
> [    0.240889] pci 0000:00:1e.0:   bridge window [mem 0xd0000000-0xfebfffff] (subtractive decode)
> [    0.240894] pci 0000:00:1e.0:   bridge window [mem 0xfed40000-0xfed44fff] (subtractive decode)
> [    0.240955] pci_bus 0000:1d: [bus 1d-20] partially hidden behind transparent bridge 0000:1c [bus 1c-1d]
> [    0.240986] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
> [    0.241271] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.RP01._PRT]
> [    0.241362] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.RP03._PRT]
> [    0.241497] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCIB._PRT]
> [    0.241562]  pci0000:00: Requesting ACPI _OSC control (0x1d)
> [    0.241568]  pci0000:00: ACPI _OSC request failed (AE_NOT_FOUND), returned control mask: 0x1d
> [    0.241572] ACPI _OSC control for PCIe not granted, disabling ASPM
> [    0.250489] ACPI: PCI Interrupt Link [LNKA] (IRQs 1 3 4 5 6 7 10 12 14 15) *11
> [    0.250489] ACPI: PCI Interrupt Link [LNKB] (IRQs 1 3 4 5 6 7 11 12 14 15) *0, disabled.
> [    0.250489] ACPI: PCI Interrupt Link [LNKC] (IRQs 1 3 4 5 6 7 10 12 14 15) *11
> [    0.250489] ACPI: PCI Interrupt Link [LNKD] (IRQs 1 3 4 5 6 7 11 12 14 15) *0, disabled.
> [    0.250489] ACPI: PCI Interrupt Link [LNKE] (IRQs 1 3 4 5 6 7 10 12 14 15) *11
> [    0.250536] ACPI: PCI Interrupt Link [LNKF] (IRQs 1 3 4 5 6 7 *11 12 14 15)
> [    0.250618] ACPI: PCI Interrupt Link [LNKG] (IRQs 1 3 4 5 6 7 10 12 14 15) *11
> [    0.250701] ACPI: PCI Interrupt Link [LNKH] (IRQs 1 3 4 5 6 7 *11 12 14 15)
> [    0.250750] vgaarb: device added: PCI:0000:00:02.0,decodes=io+mem,owns=io+mem,locks=none
> [    0.250750] vgaarb: loaded
> [    0.250750] vgaarb: bridge control possible 0000:00:02.0
> [    0.250750] SCSI subsystem initialized
> [    0.250750] libata version 3.00 loaded.
> [    0.250750] usbcore: registered new interface driver usbfs
> [    0.250750] usbcore: registered new interface driver hub
> [    0.250750] usbcore: registered new device driver usb
> [    0.250750] PCI: Using ACPI for IRQ routing
> [    0.260440] PCI: pci_cache_line_size set to 64 bytes
> [    0.260592] reserve RAM buffer: 000000000009e000 - 000000000009ffff
> [    0.260597] reserve RAM buffer: 00000000cf6b0000 - 00000000cfffffff
> [    0.260630] HPET: 3 timers in total, 0 timers will be used for per-cpu timer
> [    0.260630] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
> [    0.260630] hpet0: 3 comparators, 64-bit 14.318180 MHz counter
> [    0.270038] Switching to clocksource hpet
> [    0.273784] pnp: PnP ACPI init
> [    0.273805] ACPI: bus type pnp registered
> [    0.274317] pnp 00:00: [bus 00-ff]
> [    0.274323] pnp 00:00: [io  0x0000-0x0cf7 window]
> [    0.274327] pnp 00:00: [io  0x0cf8-0x0cff]
> [    0.274331] pnp 00:00: [io  0x0d00-0xffff window]
> [    0.274336] pnp 00:00: [mem 0x000a0000-0x000bffff window]
> [    0.274340] pnp 00:00: [mem 0x000c0000-0x000c3fff window]
> [    0.274345] pnp 00:00: [mem 0x000c4000-0x000c7fff window]
> [    0.274349] pnp 00:00: [mem 0x000c8000-0x000cbfff window]
> [    0.274353] pnp 00:00: [mem 0x000cc000-0x000cffff window]
> [    0.274358] pnp 00:00: [mem 0x000d0000-0x000d3fff window]
> [    0.274362] pnp 00:00: [mem 0x000d4000-0x000d7fff window]
> [    0.274366] pnp 00:00: [mem 0x000d8000-0x000dbfff window]
> [    0.274371] pnp 00:00: [mem 0x000dc000-0x000dffff window]
> [    0.274375] pnp 00:00: [mem 0x000e0000-0x000e3fff window]
> [    0.274379] pnp 00:00: [mem 0x000e4000-0x000e7fff window]
> [    0.274384] pnp 00:00: [mem 0x000e8000-0x000ebfff window]
> [    0.274388] pnp 00:00: [mem 0x000ec000-0x000effff window]
> [    0.274393] pnp 00:00: [mem 0x000f0000-0x000fffff window]
> [    0.274397] pnp 00:00: [mem 0xd0000000-0xfebfffff window]
> [    0.274402] pnp 00:00: [mem 0xfed40000-0xfed44fff window]
> [    0.274542] pnp 00:00: Plug and Play ACPI device, IDs PNP0a08 PNP0a03 (active)
> [    0.274657] pnp 00:01: [io  0x0010-0x001f]
> [    0.274662] pnp 00:01: [io  0x0024-0x0025]
> [    0.274666] pnp 00:01: [io  0x0028-0x0029]
> [    0.274670] pnp 00:01: [io  0x002c-0x002d]
> [    0.274673] pnp 00:01: [io  0x002e-0x002f]
> [    0.274677] pnp 00:01: [io  0x0030-0x0031]
> [    0.274681] pnp 00:01: [io  0x0034-0x0035]
> [    0.274684] pnp 00:01: [io  0x0038-0x0039]
> [    0.274688] pnp 00:01: [io  0x003c-0x003d]
> [    0.274692] pnp 00:01: [io  0x0000-0xffffffffffffffff disabled]
> [    0.274697] pnp 00:01: [io  0x0050-0x0053]
> [    0.274700] pnp 00:01: [io  0x0061]
> [    0.274704] pnp 00:01: [io  0x0063]
> [    0.274707] pnp 00:01: [io  0x0065]
> [    0.274715] pnp 00:01: [io  0x0067]
> [    0.274719] pnp 00:01: [io  0x0072-0x0077]
> [    0.274722] pnp 00:01: [io  0x0080]
> [    0.274726] pnp 00:01: [io  0x0090-0x009f]
> [    0.274729] pnp 00:01: [io  0x0092]
> [    0.274733] pnp 00:01: [io  0x00a4-0x00a5]
> [    0.274737] pnp 00:01: [io  0x00a8-0x00a9]
> [    0.274740] pnp 00:01: [io  0x00ac-0x00ad]
> [    0.274744] pnp 00:01: [io  0x00b0-0x00b1]
> [    0.274748] pnp 00:01: [io  0x00b2-0x00b3]
> [    0.274751] pnp 00:01: [io  0x00b4-0x00b5]
> [    0.274755] pnp 00:01: [io  0x00b8-0x00b9]
> [    0.274759] pnp 00:01: [io  0x00bc-0x00bd]
> [    0.274762] pnp 00:01: [io  0x04d0-0x04d1]
> [    0.274766] pnp 00:01: [io  0x0680-0x069f]
> [    0.274770] pnp 00:01: [io  0x0800-0x080f]
> [    0.274773] pnp 00:01: [io  0x1000-0x107f]
> [    0.274777] pnp 00:01: [io  0x1080-0x10ff]
> [    0.274781] pnp 00:01: [io  0x1100-0x111f]
> [    0.274784] pnp 00:01: [io  0x1180-0x11bf]
> [    0.274788] pnp 00:01: [io  0x1640-0x164f]
> [    0.274792] pnp 00:01: [io  0xf800-0xf87f]
> [    0.274796] pnp 00:01: [io  0xf880-0xf8ff]
> [    0.274800] pnp 00:01: [io  0xfc00-0xfc7f]
> [    0.274803] pnp 00:01: [io  0xfc80-0xfcff]
> [    0.274807] pnp 00:01: [io  0xfd0c-0xfd7f]
> [    0.274811] pnp 00:01: [io  0xfe00-0xfe03]
> [    0.275050] system 00:01: [io  0x04d0-0x04d1] has been reserved
> [    0.275056] system 00:01: [io  0x0680-0x069f] has been reserved
> [    0.275061] system 00:01: [io  0x0800-0x080f] has been reserved
> [    0.275067] system 00:01: [io  0x1000-0x107f] has been reserved
> [    0.275072] system 00:01: [io  0x1080-0x10ff] has been reserved
> [    0.275077] system 00:01: [io  0x1100-0x111f] has been reserved
> [    0.275082] system 00:01: [io  0x1180-0x11bf] has been reserved
> [    0.275088] system 00:01: [io  0x1640-0x164f] has been reserved
> [    0.275093] system 00:01: [io  0xf800-0xf87f] has been reserved
> [    0.275099] system 00:01: [io  0xf880-0xf8ff] has been reserved
> [    0.275104] system 00:01: [io  0xfc00-0xfc7f] has been reserved
> [    0.275109] system 00:01: [io  0xfc80-0xfcff] has been reserved
> [    0.275115] system 00:01: [io  0xfd0c-0xfd7f] has been reserved
> [    0.275120] system 00:01: [io  0xfe00-0xfe03] has been reserved
> [    0.275127] system 00:01: Plug and Play ACPI device, IDs PNP0c02 (active)
> [    0.275285] pnp 00:02: [mem 0xfed1c000-0xfed1ffff]
> [    0.275290] pnp 00:02: [mem 0xfed14000-0xfed17fff]
> [    0.275294] pnp 00:02: [mem 0xfed18000-0xfed18fff]
> [    0.275298] pnp 00:02: [mem 0xfed19000-0xfed19fff]
> [    0.275302] pnp 00:02: [mem 0xf8000000-0xfbffffff]
> [    0.275306] pnp 00:02: [mem 0xfed20000-0xfed3ffff]
> [    0.275311] pnp 00:02: [mem 0xfed40000-0xfed3ffff disabled]
> [    0.275315] pnp 00:02: [mem 0xfed45000-0xfed8ffff]
> [    0.275319] pnp 00:02: [mem 0xfef00000-0xfeffffff]
> [    0.275449] system 00:02: [mem 0xfed1c000-0xfed1ffff] has been reserved
> [    0.275456] system 00:02: [mem 0xfed14000-0xfed17fff] has been reserved
> [    0.275462] system 00:02: [mem 0xfed18000-0xfed18fff] has been reserved
> [    0.275467] system 00:02: [mem 0xfed19000-0xfed19fff] has been reserved
> [    0.275473] system 00:02: [mem 0xf8000000-0xfbffffff] has been reserved
> [    0.275478] system 00:02: [mem 0xfed20000-0xfed3ffff] has been reserved
> [    0.275484] system 00:02: [mem 0xfed45000-0xfed8ffff] has been reserved
> [    0.275489] system 00:02: [mem 0xfef00000-0xfeffffff] has been reserved
> [    0.275496] system 00:02: Plug and Play ACPI device, IDs PNP0c02 (active)
> [    0.275813] pnp 00:03: [io  0x004e-0x004f]
> [    0.275818] pnp 00:03: [io  0xfd00-0xfd0b]
> [    0.275822] pnp 00:03: [mem 0xfed40000-0xfed44fff]
> [    0.275934] pnp 00:03: Plug and Play ACPI device, IDs IFX0102 PNP0c31 (active)
> [    0.275990] pnp 00:04: [io  0x0000-0x000f]
> [    0.275995] pnp 00:04: [io  0x0081-0x008f]
> [    0.276003] pnp 00:04: [io  0x00c0-0x00df]
> [    0.276007] pnp 00:04: [dma 4]
> [    0.276114] pnp 00:04: Plug and Play ACPI device, IDs PNP0200 (active)
> [    0.276145] pnp 00:05: [io  0x0060]
> [    0.276149] pnp 00:05: [io  0x0064]
> [    0.276162] pnp 00:05: [irq 1]
> [    0.276273] pnp 00:05: Plug and Play ACPI device, IDs PNP0303 (active)
> [    0.276303] pnp 00:06: [io  0x00f0-0x00fe]
> [    0.276313] pnp 00:06: [irq 13]
> [    0.276420] pnp 00:06: Plug and Play ACPI device, IDs PNP0c04 (active)
> [    0.276466] pnp 00:07: [irq 12]
> [    0.276578] pnp 00:07: Plug and Play ACPI device, IDs PNP0f13 (active)
> [    0.276610] pnp 00:08: [io  0x0070-0x0071]
> [    0.276619] pnp 00:08: [irq 8]
> [    0.276731] pnp 00:08: Plug and Play ACPI device, IDs PNP0b00 (active)
> [    0.276761] pnp 00:09: [io  0x0061]
> [    0.276872] pnp 00:09: Plug and Play ACPI device, IDs PNP0800 (active)
> [    0.277523] pnp: PnP ACPI: found 10 devices
> [    0.277527] ACPI: ACPI bus type pnp unregistered
> [    0.284690] PCI: max bus depth: 2 pci_try_num: 3
> [    0.284752] pci 0000:00:1c.0: BAR 15: assigned [mem 0xd0000000-0xd00fffff pref]
> [    0.284761] pci 0000:00:1f.3: BAR 0: assigned [mem 0xd0100000-0xd01000ff]
> [    0.284774] pci 0000:00:1e.0: BAR 15: assigned [mem 0xd4000000-0xd7ffffff pref]
> [    0.284782] pci 0000:00:1e.0: BAR 13: assigned [io  0x3000-0x3fff]
> [    0.284790] pci 0000:00:1c.2: BAR 15: assigned [mem 0xd0200000-0xd03fffff 64bit pref]
> [    0.284798] pci 0000:00:1c.2: BAR 13: assigned [io  0x4000-0x4fff]
> [    0.284807] pci 0000:00:1c.0: BAR 15: assigned [mem 0xd0400000-0xd06fffff pref]
> [    0.284814] pci 0000:04:00.0: BAR 6: assigned [mem 0xd0400000-0xd041ffff pref]
> [    0.284819] pci 0000:00:1c.0: PCI bridge to [bus 04-07]
> [    0.284825] pci 0000:00:1c.0:   bridge window [io  0x2000-0x2fff]
> [    0.284835] pci 0000:00:1c.0:   bridge window [mem 0xfc200000-0xfc2fffff]
> [    0.284842] pci 0000:00:1c.0:   bridge window [mem 0xd0400000-0xd06fffff pref]
> [    0.284853] pci 0000:00:1c.2: PCI bridge to [bus 0c-0f]
> [    0.284859] pci 0000:00:1c.2:   bridge window [io  0x4000-0x4fff]
> [    0.284869] pci 0000:00:1c.2:   bridge window [mem 0xfc300000-0xfc3fffff]
> [    0.284877] pci 0000:00:1c.2:   bridge window [mem 0xd0200000-0xd03fffff 64bit pref]
> [    0.284897] pci 0000:1c:03.0: BAR 0: assigned [mem 0xd0000000-0xd0000fff]
> [    0.284909] pci 0000:1c:03.0: BAR 16: assigned [mem 0xd8000000-0xdbffffff]
> [    0.284915] pci 0000:1c:03.0: BAR 15: assigned [mem 0xd4000000-0xd7ffffff pref]
> [    0.284920] pci 0000:1c:03.0: BAR 14: assigned [io  0x3000-0x30ff]
> [    0.284925] pci 0000:1c:03.0: BAR 13: assigned [io  0x3400-0x34ff]
> [    0.284930] pci 0000:1c:03.0: CardBus bridge to [bus 1d-20]
> [    0.284934] pci 0000:1c:03.0:   bridge window [io  0x3400-0x34ff]
> [    0.284942] pci 0000:1c:03.0:   bridge window [io  0x3000-0x30ff]
> [    0.284950] pci 0000:1c:03.0:   bridge window [mem 0xd4000000-0xd7ffffff pref]
> [    0.284959] pci 0000:1c:03.0:   bridge window [mem 0xd8000000-0xdbffffff]
> [    0.284967] pci 0000:00:1e.0: PCI bridge to [bus 1c-1d]
> [    0.284972] pci 0000:00:1e.0:   bridge window [io  0x3000-0x3fff]
> [    0.284981] pci 0000:00:1e.0:   bridge window [mem 0xfc400000-0xfc4fffff]
> [    0.284989] pci 0000:00:1e.0:   bridge window [mem 0xd4000000-0xd7ffffff pref]
> [    0.285039] pci 0000:00:1e.0: setting latency timer to 64
> [    0.285057] pci_bus 0000:00: resource 4 [io  0x0000-0x0cf7]
> [    0.285062] pci_bus 0000:00: resource 5 [io  0x0d00-0xffff]
> [    0.285066] pci_bus 0000:00: resource 6 [mem 0x000a0000-0x000bffff]
> [    0.285071] pci_bus 0000:00: resource 7 [mem 0x000d0000-0x000d3fff]
> [    0.285076] pci_bus 0000:00: resource 8 [mem 0x000d4000-0x000d7fff]
> [    0.285080] pci_bus 0000:00: resource 9 [mem 0x000d8000-0x000dbfff]
> [    0.285085] pci_bus 0000:00: resource 10 [mem 0x000dc000-0x000dffff]
> [    0.285090] pci_bus 0000:00: resource 11 [mem 0xd0000000-0xfebfffff]
> [    0.285094] pci_bus 0000:00: resource 12 [mem 0xfed40000-0xfed44fff]
> [    0.285099] pci_bus 0000:04: resource 0 [io  0x2000-0x2fff]
> [    0.285104] pci_bus 0000:04: resource 1 [mem 0xfc200000-0xfc2fffff]
> [    0.285109] pci_bus 0000:04: resource 2 [mem 0xd0400000-0xd06fffff pref]
> [    0.285114] pci_bus 0000:0c: resource 0 [io  0x4000-0x4fff]
> [    0.285118] pci_bus 0000:0c: resource 1 [mem 0xfc300000-0xfc3fffff]
> [    0.285123] pci_bus 0000:0c: resource 2 [mem 0xd0200000-0xd03fffff 64bit pref]
> [    0.285128] pci_bus 0000:1c: resource 0 [io  0x3000-0x3fff]
> [    0.285132] pci_bus 0000:1c: resource 1 [mem 0xfc400000-0xfc4fffff]
> [    0.285137] pci_bus 0000:1c: resource 2 [mem 0xd4000000-0xd7ffffff pref]
> [    0.285142] pci_bus 0000:1c: resource 4 [io  0x0000-0x0cf7]
> [    0.285146] pci_bus 0000:1c: resource 5 [io  0x0d00-0xffff]
> [    0.285151] pci_bus 0000:1c: resource 6 [mem 0x000a0000-0x000bffff]
> [    0.285155] pci_bus 0000:1c: resource 7 [mem 0x000d0000-0x000d3fff]
> [    0.285160] pci_bus 0000:1c: resource 8 [mem 0x000d4000-0x000d7fff]
> [    0.285164] pci_bus 0000:1c: resource 9 [mem 0x000d8000-0x000dbfff]
> [    0.285169] pci_bus 0000:1c: resource 10 [mem 0x000dc000-0x000dffff]
> [    0.285174] pci_bus 0000:1c: resource 11 [mem 0xd0000000-0xfebfffff]
> [    0.285179] pci_bus 0000:1c: resource 12 [mem 0xfed40000-0xfed44fff]
> [    0.285183] pci_bus 0000:1d: resource 0 [io  0x3400-0x34ff]
> [    0.285188] pci_bus 0000:1d: resource 1 [io  0x3000-0x30ff]
> [    0.285192] pci_bus 0000:1d: resource 2 [mem 0xd4000000-0xd7ffffff pref]
> [    0.285197] pci_bus 0000:1d: resource 3 [mem 0xd8000000-0xdbffffff]
> [    0.285241] NET: Registered protocol family 2
> [    0.285467] IP route cache hash table entries: 131072 (order: 8, 1048576 bytes)
> [    0.287260] TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
> [    0.292371] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
> [    0.293032] TCP: Hash tables configured (established 524288 bind 65536)
> [    0.293036] TCP reno registered
> [    0.293054] UDP hash table entries: 2048 (order: 4, 65536 bytes)
> [    0.293117] UDP-Lite hash table entries: 2048 (order: 4, 65536 bytes)
> [    0.293289] NET: Registered protocol family 1
> [    0.293317] pci 0000:00:02.0: Boot video device
> [    0.293693] PCI: CLS 64 bytes, default 64
> [    0.293766] Trying to unpack rootfs image as initramfs...
> [    0.748181] Freeing initrd memory: 11040k freed
> [    0.754749] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
> [    0.754758] Placing 64MB software IO TLB between ffff8800cb6aa000 - ffff8800cf6aa000
> [    0.754763] software IO TLB at phys 0xcb6aa000 - 0xcf6aa000
> [    0.754786] Simple Boot Flag at 0x7b set to 0x1
> [    0.756515] audit: initializing netlink socket (disabled)
> [    0.756535] type=2000 audit(1333552189.750:1): initialized
> [    0.757428] HugeTLB registered 2 MB page size, pre-allocated 0 pages
> [    0.765853] Registering unionfs 2.5.11 (for 3.3.0-rc3)
> [    0.766222] msgmni has been set to 7898
> [    0.766724] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 253)
> [    0.766730] io scheduler noop registered
> [    0.766733] io scheduler deadline registered
> [    0.766949] io scheduler cfq registered (default)
> [    0.767215] pcieport 0000:00:1c.0: irq 40 for MSI/MSI-X
> [    0.767442] pcieport 0000:00:1c.2: irq 41 for MSI/MSI-X
> [    0.768293] intel_idle: MWAIT substates: 0x22220
> [    0.768297] intel_idle: does not run on family 6 model 15
> [    0.768443] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> [    0.769879] tpm_tis 00:03: 1.2 TPM (device-id 0xB, rev-id 16)
> [    1.470021] tpm_tis 00:03: Operation Timed out
> [    1.470076] tpm_tis 00:03: TPM self test failed
> [    1.474343] brd: module loaded
> [    1.474781] Fixed MDIO Bus: probed
> [    1.474790] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
> [    1.474849] ehci_hcd 0000:00:1a.7: setting latency timer to 64
> [    1.474855] ehci_hcd 0000:00:1a.7: EHCI Host Controller
> [    1.474867] ehci_hcd 0000:00:1a.7: new USB bus registered, assigned bus number 1
> [    1.474919] ehci_hcd 0000:00:1a.7: debug port 1
> [    1.478799] ehci_hcd 0000:00:1a.7: cache line size of 64 is not supported
> [    1.478827] ehci_hcd 0000:00:1a.7: irq 23, io mem 0xfc704800
> [    1.490055] ehci_hcd 0000:00:1a.7: USB 2.0 started, EHCI 1.00
> [    1.490107] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002
> [    1.490112] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
> [    1.490117] usb usb1: Product: EHCI Host Controller
> [    1.490121] usb usb1: Manufacturer: Linux 3.3.1-1-amd64-vyatta ehci_hcd
> [    1.490125] usb usb1: SerialNumber: 0000:00:1a.7
> [    1.490409] hub 1-0:1.0: USB hub found
> [    1.490419] hub 1-0:1.0: 4 ports detected
> [    1.490555] ehci_hcd 0000:00:1d.7: setting latency timer to 64
> [    1.490561] ehci_hcd 0000:00:1d.7: EHCI Host Controller
> [    1.490571] ehci_hcd 0000:00:1d.7: new USB bus registered, assigned bus number 2
> [    1.490616] ehci_hcd 0000:00:1d.7: debug port 1
> [    1.494486] ehci_hcd 0000:00:1d.7: cache line size of 64 is not supported
> [    1.494496] ehci_hcd 0000:00:1d.7: irq 23, io mem 0xfc704c00
> [    1.510035] ehci_hcd 0000:00:1d.7: USB 2.0 started, EHCI 1.00
> [    1.510083] usb usb2: New USB device found, idVendor=1d6b, idProduct=0002
> [    1.510088] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
> [    1.510092] usb usb2: Product: EHCI Host Controller
> [    1.510096] usb usb2: Manufacturer: Linux 3.3.1-1-amd64-vyatta ehci_hcd
> [    1.510100] usb usb2: SerialNumber: 0000:00:1d.7
> [    1.510387] hub 2-0:1.0: USB hub found
> [    1.510396] hub 2-0:1.0: 6 ports detected
> [    1.510568] uhci_hcd: USB Universal Host Controller Interface driver
> [    1.510603] uhci_hcd 0000:00:1a.0: setting latency timer to 64
> [    1.510610] uhci_hcd 0000:00:1a.0: UHCI Host Controller
> [    1.510620] uhci_hcd 0000:00:1a.0: new USB bus registered, assigned bus number 3
> [    1.510671] uhci_hcd 0000:00:1a.0: irq 22, io base 0x00001820
> [    1.510726] usb usb3: New USB device found, idVendor=1d6b, idProduct=0001
> [    1.510731] usb usb3: New USB device strings: Mfr=3, Product=2, SerialNumber=1
> [    1.510735] usb usb3: Product: UHCI Host Controller
> [    1.510739] usb usb3: Manufacturer: Linux 3.3.1-1-amd64-vyatta uhci_hcd
> [    1.510743] usb usb3: SerialNumber: 0000:00:1a.0
> [    1.511006] hub 3-0:1.0: USB hub found
> [    1.511015] hub 3-0:1.0: 2 ports detected
> [    1.511128] uhci_hcd 0000:00:1a.1: setting latency timer to 64
> [    1.511134] uhci_hcd 0000:00:1a.1: UHCI Host Controller
> [    1.511143] uhci_hcd 0000:00:1a.1: new USB bus registered, assigned bus number 4
> [    1.511179] uhci_hcd 0000:00:1a.1: irq 22, io base 0x00001840
> [    1.511233] usb usb4: New USB device found, idVendor=1d6b, idProduct=0001
> [    1.511238] usb usb4: New USB device strings: Mfr=3, Product=2, SerialNumber=1
> [    1.511242] usb usb4: Product: UHCI Host Controller
> [    1.511246] usb usb4: Manufacturer: Linux 3.3.1-1-amd64-vyatta uhci_hcd
> [    1.511250] usb usb4: SerialNumber: 0000:00:1a.1
> [    1.511510] hub 4-0:1.0: USB hub found
> [    1.511519] hub 4-0:1.0: 2 ports detected
> [    1.511630] uhci_hcd 0000:00:1d.0: setting latency timer to 64
> [    1.511636] uhci_hcd 0000:00:1d.0: UHCI Host Controller
> [    1.511646] uhci_hcd 0000:00:1d.0: new USB bus registered, assigned bus number 5
> [    1.511682] uhci_hcd 0000:00:1d.0: irq 22, io base 0x00001860
> [    1.511735] usb usb5: New USB device found, idVendor=1d6b, idProduct=0001
> [    1.511741] usb usb5: New USB device strings: Mfr=3, Product=2, SerialNumber=1
> [    1.511745] usb usb5: Product: UHCI Host Controller
> [    1.511749] usb usb5: Manufacturer: Linux 3.3.1-1-amd64-vyatta uhci_hcd
> [    1.511753] usb usb5: SerialNumber: 0000:00:1d.0
> [    1.512008] hub 5-0:1.0: USB hub found
> [    1.512016] hub 5-0:1.0: 2 ports detected
> [    1.512128] uhci_hcd 0000:00:1d.1: setting latency timer to 64
> [    1.512134] uhci_hcd 0000:00:1d.1: UHCI Host Controller
> [    1.512143] uhci_hcd 0000:00:1d.1: new USB bus registered, assigned bus number 6
> [    1.512180] uhci_hcd 0000:00:1d.1: irq 22, io base 0x00001880
> [    1.512235] usb usb6: New USB device found, idVendor=1d6b, idProduct=0001
> [    1.512241] usb usb6: New USB device strings: Mfr=3, Product=2, SerialNumber=1
> [    1.512245] usb usb6: Product: UHCI Host Controller
> [    1.512249] usb usb6: Manufacturer: Linux 3.3.1-1-amd64-vyatta uhci_hcd
> [    1.512253] usb usb6: SerialNumber: 0000:00:1d.1
> [    1.512508] hub 6-0:1.0: USB hub found
> [    1.512516] hub 6-0:1.0: 2 ports detected
> [    1.512628] uhci_hcd 0000:00:1d.2: setting latency timer to 64
> [    1.512634] uhci_hcd 0000:00:1d.2: UHCI Host Controller
> [    1.512643] uhci_hcd 0000:00:1d.2: new USB bus registered, assigned bus number 7
> [    1.512680] uhci_hcd 0000:00:1d.2: irq 22, io base 0x000018a0
> [    1.512733] usb usb7: New USB device found, idVendor=1d6b, idProduct=0001
> [    1.512739] usb usb7: New USB device strings: Mfr=3, Product=2, SerialNumber=1
> [    1.512743] usb usb7: Product: UHCI Host Controller
> [    1.512747] usb usb7: Manufacturer: Linux 3.3.1-1-amd64-vyatta uhci_hcd
> [    1.512751] usb usb7: SerialNumber: 0000:00:1d.2
> [    1.513013] hub 7-0:1.0: USB hub found
> [    1.513022] hub 7-0:1.0: 2 ports detected
> [    1.513377] i8042: PNP: PS/2 Controller [PNP0303:KBC,PNP0f13:PS2M] at 0x60,0x64 irq 1,12
> [    1.515536] i8042: Detected active multiplexing controller, rev 1.1
> [    1.517407] serio: i8042 KBD port at 0x60,0x64 irq 1
> [    1.517417] serio: i8042 AUX0 port at 0x60,0x64 irq 12
> [    1.517422] serio: i8042 AUX1 port at 0x60,0x64 irq 12
> [    1.517427] serio: i8042 AUX2 port at 0x60,0x64 irq 12
> [    1.517432] serio: i8042 AUX3 port at 0x60,0x64 irq 12
> [    1.517874] mousedev: PS/2 mouse device common for all mice
> [    1.517988] rtc_cmos 00:08: RTC can wake from S4
> [    1.518239] rtc_cmos 00:08: rtc core: registered rtc_cmos as rtc0
> [    1.518282] rtc0: alarms up to one month, y3k, 114 bytes nvram, hpet irqs
> [    1.518344] cpuidle: using governor ladder
> [    1.518347] cpuidle: using governor menu
> [    1.518697] No iBFT detected.
> [    1.518728] Netfilter messages via NETLINK v0.30.
> [    1.518806] ip_tables: (C) 2000-2006 Netfilter Core Team
> [    1.518811] TCP cubic registered
> [    1.518816] NET: Registered protocol family 17
> [    1.518889] Registering the dns_resolver key type
> [    1.519225] registered taskstats version 1
> [    1.520227] rtc_cmos 00:08: setting system clock to 2012-04-04 15:09:51 UTC (1333552191)
> [    1.522640] Freeing unused kernel memory: 560k freed
> [    1.522909] Write protecting the kernel read-only data: 6144k
> [    1.527588] Freeing unused kernel memory: 624k freed
> [    1.532348] Freeing unused kernel memory: 608k freed
> [    1.547150] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input0
> [    1.657281] udevd[612]: starting version 175
> [    1.701917] sky2: driver version 1.30
> [    1.702034] sky2 0000:04:00.0: Yukon-2 EC Ultra chip revision 3
> [    1.702182] sky2 0000:04:00.0: irq 42 for MSI/MSI-X
> [    1.702515] sky2 0000:04:00.0: eth0: addr 00:17:42:8a:b4:05
> [    1.706543] ata_piix 0000:00:1f.1: version 2.13
> [    1.706630] ata_piix 0000:00:1f.1: setting latency timer to 64
> [    1.714519] scsi0 : ata_piix
> [    1.717709] scsi1 : ata_piix
> [    1.718371] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0x1810 irq 14
> [    1.718377] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0x1818 irq 15
> [    1.742743] thermal LNXTHERM:00: registered as thermal_zone0
> [    1.742749] ACPI: Thermal Zone [TZ00] (27 C)
> [    1.742977] thermal LNXTHERM:01: registered as thermal_zone1
> [    1.742982] ACPI: Thermal Zone [TZ01] (27 C)
> [    1.747831] ahci 0000:00:1f.2: version 3.0
> [    1.747943] ahci 0000:00:1f.2: irq 43 for MSI/MSI-X
> [    1.748050] ahci 0000:00:1f.2: AHCI 0001.0100 32 slots 3 ports 3 Gbps 0x7 impl SATA mode
> [    1.748058] ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio slum part ccc
> [    1.748067] ahci 0000:00:1f.2: setting latency timer to 64
> [    1.750080] Refined TSC clocksource calibration: 2393.999 MHz.
> [    1.750088] Switching to clocksource tsc
> [    1.752771] scsi2 : ahci
> [    1.756589] scsi3 : ahci
> [    1.756750] scsi4 : ahci
> [    1.756953] ata3: SATA max UDMA/133 abar m2048@0xfc704000 port 0xfc704100 irq 43
> [    1.756960] ata4: SATA max UDMA/133 abar m2048@0xfc704000 port 0xfc704180 irq 43
> [    1.756966] ata5: SATA max UDMA/133 abar m2048@0xfc704000 port 0xfc704200 irq 43
> [    1.930028] usb 1-3: new high-speed USB device number 3 using ehci_hcd
> [    2.100033] ata4: SATA link down (SStatus 0 SControl 300)
> [    2.100068] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [    2.100097] ata5: SATA link down (SStatus 0 SControl 300)
> [    2.100475] ata3.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered out
> [    2.100482] ata3.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
> [    2.100570] ata3.00: ATA-9: M4-CT128M4SSD2, 0309, max UDMA/100
> [    2.100576] ata3.00: 250069680 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
> [    2.100978] ata3.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered out
> [    2.100985] ata3.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
> [    2.101073] ata3.00: configured for UDMA/100
> [    2.101255] scsi 2:0:0:0: Direct-Access     ATA      M4-CT128M4SSD2   0309 PQ: 0 ANSI: 5
> [    2.101501] sd 2:0:0:0: [sda] 250069680 512-byte logical blocks: (128 GB/119 GiB)
> [    2.101527] sd 2:0:0:0: Attached scsi generic sg0 type 0
> [    2.101622] sd 2:0:0:0: [sda] Write Protect is off
> [    2.101628] sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00
> [    2.101677] sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> [    2.102522]  sda: sda1 sda2 < sda5 sda6 >
> [    2.103156] sd 2:0:0:0: [sda] Attached SCSI disk
> [    2.106992] sdhci: Secure Digital Host Controller Interface driver
> [    2.106997] sdhci: Copyright(c) Pierre Ossman
> [    2.107472] sdhci-pci 0000:1c:03.2: SDHCI controller found [1217:7120] (rev 2)
> [    2.107563] mmc0: no vmmc regulator found
> [    2.107612] Registered led device: mmc0::
> [    2.108674] mmc0: SDHCI controller on PCI [0000:1c:03.2] using PIO
> [    2.109899] usb 1-3: New USB device found, idVendor=046d, idProduct=09b2
> [    2.109905] usb 1-3: New USB device strings: Mfr=0, Product=2, SerialNumber=0
> [    2.109910] usb 1-3: Product: OEM Camera
> [    2.170066] firewire_ohci: Added fw-ohci device 0000:1c:03.4, OHCI v1.10, 8 IR + 8 IT contexts, quirks 0x10
> [    2.225306] device-mapper: uevent: version 1.0.3
> [    2.225727] device-mapper: ioctl: 4.22.0-ioctl (2011-10-19) initialised: dm-devel@redhat.com
> [    2.259177] Btrfs loaded
> [    2.266136] PM: Starting manual resume from disk
> [    2.283545] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
> [    2.450029] usb 3-2: new full-speed USB device number 2 using uhci_hcd
> [    2.629744] usb 3-2: New USB device found, idVendor=0c24, idProduct=000f
> [    2.629750] usb 3-2: New USB device strings: Mfr=0, Product=0, SerialNumber=0
> [    2.637278] udevd[854]: starting version 175
> [    2.670194] firewire_core: created device fw0: GUID 00000e1003f448d1, S400
> [    2.784513] Linux agpgart interface v0.103
> [    2.787669] input: Lid Switch as /devices/LNXSYSTM:00/device:00/PNP0C0D:00/input/input1
> [    2.787713] ACPI: Lid Switch [LID]
> [    2.787804] input: Power Button as /devices/LNXSYSTM:00/device:00/PNP0C0C:00/input/input2
> [    2.787812] ACPI: Power Button [PWRB]
> [    2.799854] agpgart-intel 0000:00:00.0: Intel 965GM Chipset
> [    2.800060] agpgart-intel 0000:00:00.0: detected gtt size: 524288K total, 262144K mappable
> [    2.800066] Monitor-Mwait will be used to enter C-1 state
> [    2.801730] agpgart-intel 0000:00:00.0: detected 8192K stolen memory
> [    2.802036] agpgart-intel 0000:00:00.0: AGP aperture is 256M @ 0xe0000000
> [    2.808253] ACPI: Deprecated procfs I/F for battery is loaded, please retry with CONFIG_ACPI_PROCFS_POWER cleared
> [    2.808269] ACPI: Battery Slot [CMB1] (battery present)
> [    2.815046] ACPI: Deprecated procfs I/F for battery is loaded, please retry with CONFIG_ACPI_PROCFS_POWER cleared
> [    2.815062] ACPI: Battery Slot [CMB2] (battery present)
> [    2.816520] Monitor-Mwait will be used to enter C-2 state
> [    2.824425] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.07
> [    2.824582] iTCO_wdt: Found a ICH8M TCO device (Version=2, TCOBASE=0x1060)
> [    2.826922] iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
> [    2.827356] Monitor-Mwait will be used to enter C-3 state
> [    2.827381] Marking TSC unstable due to TSC halts in idle
> [    2.827417] ACPI: acpi_idle registered with cpuidle
> [    2.827566] ACPI: Deprecated procfs I/F for AC is loaded, please retry with CONFIG_ACPI_PROCFS_POWER cleared
> [    2.829919] ACPI: AC Adapter [AC] (off-line)
> [    2.839175] Switching to clocksource hpet
> [    2.852557] cfg80211: Calling CRDA to update world regulatory domain
> [    2.880008] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input3
> [    2.880008] ACPI: Power Button [PWRF]
> [    2.887729] iwl4965: Intel(R) Wireless WiFi 4965 driver for Linux, in-tree:
> [    2.887734] iwl4965: Copyright(c) 2003-2011 Intel Corporation
> [    2.887864] iwl4965 0000:0c:00.0: Detected Intel(R) Wireless WiFi Link 4965AGN, REV=0x4
> [    2.927386] iwl4965 0000:0c:00.0: device EEPROM VER=0x36, CALIB=0x5
> [    2.927444] iwl4965 0000:0c:00.0: Tunable channels: 11 802.11bg, 13 802.11a channels
> [    2.927677] iwl4965 0000:0c:00.0: irq 44 for MSI/MSI-X
> [    2.927848] usb 5-1: new low-speed USB device number 2 using uhci_hcd
> [    2.939578] tpm_inf_pnp 00:03: Found TPM with ID IFX0102
> [    2.939650] tpm_inf_pnp 00:03: TPM found: config base 0x4e, data base 0xfd00, chip version 0x000b, vendor id 0x15d1 (Infineon), product id 0x000b (SLB 9635 TT 1.2)
> [    2.956878] input: PC Speaker as /devices/platform/pcspkr/input/input4
> [    3.028008] input: Fujitsu Application Panel buttons as /devices/pci0000:00/0000:00:1f.3/i2c-0/0-0019/input/input5
> [    3.079666] iwl4965 0000:0c:00.0: loaded firmware version 228.61.2.24
> [    3.080152] Registered led device: phy0-led
> [    3.097917] cfg80211: World regulatory domain updated:
> [    3.097923] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
> [    3.097929] cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
> [    3.097934] cfg80211:   (2457000 KHz - 2482000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
> [    3.097939] cfg80211:   (2474000 KHz - 2494000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
> [    3.097944] cfg80211:   (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
> [    3.097949] cfg80211:   (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
> [    3.099297] ieee80211 phy0: Selected rate control algorithm 'iwl-4965-rs'
> [    3.118778] usb 5-1: New USB device found, idVendor=1050, idProduct=0010
> [    3.118784] usb 5-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
> [    3.118790] usb 5-1: Product: Yubico Yubikey II
> [    3.118793] usb 5-1: Manufacturer: Yubico
> [    3.118796] usb 5-1: SerialNumber: 0000367025
> [    3.161328] input: Yubico Yubico Yubikey II as /devices/pci0000:00/0000:00:1d.0/usb5/5-1/5-1:1.0/input/input6
> [    3.161490] generic-usb 0003:1050:0010.0001: input,hidraw0: USB HID v1.11 Keyboard [Yubico Yubico Yubikey II] on usb-0000:00:1d.0-1/input0
> [    3.161526] usbcore: registered new interface driver usbhid
> [    3.161530] usbhid: USB HID core driver
> [    3.454412] Adding 1951740k swap on /dev/sda5.  Priority:-1 extents:1 across:1951740k SS
> [    3.460632] EXT4-fs (sda1): re-mounted. Opts: (null)
> [    3.647225] psmouse serio4: synaptics: Touchpad model: 1, fw: 6.2, id: 0x1a0b1, caps: 0xa04713/0x202000/0x0
> [    3.685537] input: SynPS/2 Synaptics TouchPad as /devices/platform/i8042/serio4/input/input7
> [    3.745928] EXT4-fs (sda1): re-mounted. Opts: discard,errors=remount-ro
> [    4.027211] input: Yubico Yubico Yubikey II as /devices/pci0000:00/0000:00:1d.0/usb5/5-1/5-1:1.0/input/input8
> [    4.027371] generic-usb 0003:1050:0010.0002: input,hidraw0: USB HID v1.11 Keyboard [Yubico Yubico Yubikey II] on usb-0000:00:1d.0-1/input0
> [    6.167412] Intel AES-NI instructions are not detected.
> [    6.181055] padlock_sha: VIA PadLock Hash Engine not detected.
> [   11.493873] loop: module loaded
> [   11.970959] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)
> [   12.164577] NET: Registered protocol family 10
> [   12.221536] RPC: Registered named UNIX socket transport module.
> [   12.221541] RPC: Registered udp transport module.
> [   12.221544] RPC: Registered tcp transport module.
> [   12.221548] RPC: Registered tcp NFSv4.1 backchannel transport module.
> [   12.321045] fuse init (API version 7.18)
> [   12.609444] NET: Registered protocol family 15
> [   13.088426] sky2 0000:04:00.0: eth0: enabling interface
> [   13.089637] ADDRCONF(NETDEV_UP): eth0: link is not ready
> [   13.261673] [drm] Initialized drm 1.1.0 20060810
> [   13.271149] i915 0000:00:02.0: setting latency timer to 64
> [   13.342156] mtrr: type mismatch for e0000000,10000000 old: write-back new: write-combining
> [   13.342161] [drm] MTRR allocation failed.  Graphics performance may suffer.
> [   13.343984] i915 0000:00:02.0: irq 45 for MSI/MSI-X
> [   13.344001] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
> [   13.344005] [drm] Driver supports precise vblank timestamp query.
> [   13.344071] vgaarb: device changed decodes: PCI:0000:00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem
> [   13.349856] ADDRCONF(NETDEV_UP): wlan0: link is not ready
> [   14.078303] [drm] initialized overlay support
> [   14.161389] input: ACPI Virtual Keyboard Device as /devices/virtual/input/input9
> [   14.289222] fbcon: inteldrmfb (fb0) is primary device
> [   14.289439] Console: switching to colour frame buffer device 160x50
> [   14.289450] fb0: inteldrmfb frame buffer device
> [   14.289453] drm: registered panic notifier
> [   14.308314] acpi device:04: registered as cooling_device2
> [   14.308479] input: Video Bus as /devices/LNXSYSTM:00/device:00/PNP0A08:00/LNXVIDEO:00/input/input10
> [   14.310326] ACPI: Video Device [GFX0] (multi-head: yes  rom: no  post: no)
> [   14.310466] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0
> [   14.603377] lp: driver loaded but no devices found
> [   14.610928] ppdev: user-space parallel port driver
> [   17.794824] Ebtables v2.0 registered
> [   17.809543] ip6_tables: (C) 2000-2006 Netfilter Core Team
> [   18.204652] wlan0: authenticate with 00:22:90:93:49:d0 (try 1)
> [   18.213234] wlan0: authenticated
> [   18.268777] wlan0: associate with 00:22:90:93:49:d0 (try 1)
> [   18.462396] wlan0: associate with 00:22:90:93:49:d0 (try 2)
> [   18.480746] wlan0: RX AssocResp from 00:22:90:93:49:d0 (capab=0x431 status=0 aid=6)
> [   18.480750] wlan0: associated
> [   18.480753] wlan0: moving STA 00:22:90:93:49:d0 to state 1
> [   18.480755] wlan0: moving STA 00:22:90:93:49:d0 to state 2
> [   18.542165] ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
> [   18.542234] cfg80211: Calling CRDA for country: US
> [   18.546200] cfg80211: Regulatory domain changed to country: US
> [   18.546204] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
> [   18.546207] cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2700 mBm)
> [   18.546210] cfg80211:   (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 1700 mBm)
> [   18.546212] cfg80211:   (5250000 KHz - 5330000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
> [   18.546215] cfg80211:   (5490000 KHz - 5600000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
> [   18.546217] cfg80211:   (5650000 KHz - 5710000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
> [   18.546220] cfg80211:   (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 3000 mBm)
> [   19.633036] wlan0: moving STA 00:22:90:93:49:d0 to state 3
> [   30.191204] wlan0: no IPv6 routers present
> [  476.389422] ------------[ cut here ]------------
> [  476.389440] WARNING: at kernel/time/tick-sched.c:567 tick_nohz_irq_exit+0x11e/0x194()

Ah nice one. Were you running a specific testsuite?
I'll try to figure out what happened once I have access to a testbox.

Thanks.

> [  476.389447] Hardware name: LifeBook S6510
> [  476.389452] Modules linked in: kvm_intel kvm ip6table_filter ip6_tables iptable_filter ebtable_nat ebtables acpi_cpufreq mperf cpufreq_ondemand cpufreq_conservative cpufreq_userspace cpufreq_stats freq_table cpufreq_powersave parport_pc ppdev lp parport binfmt_misc i915 drm_kms_helper drm i2c_algo_bit uinput deflate ctr twofish_generic twofish_x86_64_3way twofish_x86_64 twofish_common camellia serpent_sse2_x86_64 xts lrw gf128mul serpent_generic blowfish_generic blowfish_x86_64 blowfish_common cast5 des_generic xcbc rmd160 sha512_generic crypto_null af_key fuse nfs nfs_acl lockd auth_rpcgss sunrpc ipv6 loop sha256_generic aes_x86_64 cryptd aes_generic cbc dm_crypt usbhid hid arc4 apanel input_polldev i2c_i801 i2c_core serio_raw pcspkr psmouse evdev tpm_infineon iwl4965 iwlegacy mac80211 cfg80211 rfkill video ac iTCO_wdt intel_agp battery button processor intel_gtt agpgart ext4 crc16 jbd2 btrfs crc32c libcrc32c zlib_deflate dm_mod firewire_ohci sdhci_pci sdhci firewire_core
>  pata_acpi ata_generic thermal thermal_sys mmc_core crc_itu_t ahci libahci ata_piix sky2 [last unloaded: scsi_wait_scan]
> [  476.389700] Pid: 9, comm: kworker/1:0 Not tainted 3.3.1-1-amd64-vyatta #1
> [  476.389707] Call Trace:
> [  476.389711]  <IRQ>  [<ffffffff81037e38>] ? warn_slowpath_common+0x78/0x8c
> [  476.389732]  [<ffffffff8106f6c2>] ? tick_nohz_irq_exit+0x11e/0x194
> [  476.389743]  [<ffffffff8103d40b>] ? irq_exit+0x73/0x79
> [  476.389753]  [<ffffffff8100fbd7>] ? do_IRQ+0x82/0x98
> [  476.389766]  [<ffffffff8135af6e>] ? common_interrupt+0x6e/0x6e
> [  476.389771]  <EOI>  [<ffffffff81014f09>] ? native_read_tsc+0x2/0xf
> [  476.389794]  [<ffffffff811a4515>] ? paravirt_read_tsc+0x5/0x8
> [  476.389804]  [<ffffffff811a45b0>] ? delay_tsc+0x29/0x5e
> [  476.389816]  [<ffffffffa0486063>] ? sclhi+0x5d/0x63 [i2c_algo_bit]
> [  476.389827]  [<ffffffffa04861d7>] ? i2c_outb.isra.4+0x3c/0x8e [i2c_algo_bit]
> [  476.389838]  [<ffffffffa04865e4>] ? bit_xfer+0x34b/0x3fc [i2c_algo_bit]
> [  476.389849]  [<ffffffff810532d4>] ? __hrtimer_start_range_ns+0x297/0x2b8
> [  476.389883]  [<ffffffffa0509e58>] ? intel_i2c_quirk_xfer+0x71/0xb9 [i915]
> [  476.389901]  [<ffffffffa029596b>] ? i2c_transfer+0x90/0xf3 [i2c_core]
> [  476.389929]  [<ffffffffa05063c5>] ? intel_sdvo_write_cmd+0x25f/0x2e5 [i915]
> [  476.389940]  [<ffffffff811a383e>] ? vsnprintf+0x3ee/0x427
> [  476.389969]  [<ffffffffa0508a5c>] ? intel_sdvo_detect+0x2d/0x1e0 [i915]
> [  476.389980]  [<ffffffff8104c015>] ? queue_delayed_work_on+0xb0/0xc8
> [  476.389996]  [<ffffffffa04d121c>] ? output_poll_execute+0x97/0x16c [drm_kms_helper]
> [  476.390008]  [<ffffffffa04d1185>] ? drm_format_num_planes+0x8a/0x8a [drm_kms_helper]
> [  476.390018]  [<ffffffff8104c20c>] ? process_one_work+0x157/0x296
> [  476.390028]  [<ffffffff8104cc77>] ? worker_thread+0xc2/0x145
> [  476.390037]  [<ffffffff8104cbb5>] ? manage_workers.isra.24+0x15b/0x15b
> [  476.390046]  [<ffffffff8104ff44>] ? kthread+0x7d/0x85
> [  476.390056]  [<ffffffff8135cbe4>] ? kernel_thread_helper+0x4/0x10
> [  476.390066]  [<ffffffff8104fec7>] ? kthread_freezable_should_stop+0x37/0x37
> [  476.390075]  [<ffffffff8135cbe0>] ? gs_change+0x13/0x13
> [  476.390081] ---[ end trace f87639a4a779b971 ]---
> [  769.514514] ------------[ cut here ]------------
> [  769.514523] WARNING: at kernel/time/tick-sched.c:706 tick_nohz_account_ticks+0x77/0x80()
> [  769.514526] Hardware name: LifeBook S6510
> [  769.514527] Modules linked in: kvm_intel kvm ip6table_filter ip6_tables iptable_filter ebtable_nat ebtables acpi_cpufreq mperf cpufreq_ondemand cpufreq_conservative cpufreq_userspace cpufreq_stats freq_table cpufreq_powersave parport_pc ppdev lp parport binfmt_misc i915 drm_kms_helper drm i2c_algo_bit uinput deflate ctr twofish_generic twofish_x86_64_3way twofish_x86_64 twofish_common camellia serpent_sse2_x86_64 xts lrw gf128mul serpent_generic blowfish_generic blowfish_x86_64 blowfish_common cast5 des_generic xcbc rmd160 sha512_generic crypto_null af_key fuse nfs nfs_acl lockd auth_rpcgss sunrpc ipv6 loop sha256_generic aes_x86_64 cryptd aes_generic cbc dm_crypt usbhid hid arc4 apanel input_polldev i2c_i801 i2c_core serio_raw pcspkr psmouse evdev tpm_infineon iwl4965 iwlegacy mac80211 cfg80211 rfkill video ac iTCO_wdt intel_agp battery button processor intel_gtt agpgart ext4 crc16 jbd2 btrfs crc32c libcrc32c zlib_deflate dm_mod firewire_ohci sdhci_pci sdhci firewire_core
>  pata_acpi ata_generic thermal thermal_sys mmc_core crc_itu_t ahci libahci ata_piix sky2 [last unloaded: scsi_wait_scan]
> [  769.514619] Pid: 0, comm: swapper/0 Tainted: G        W    3.3.1-1-amd64-vyatta #1
> [  769.514621] Call Trace:
> [  769.514623]  <IRQ>  [<ffffffff81037e38>] ? warn_slowpath_common+0x78/0x8c
> [  769.514632]  [<ffffffff8106eee9>] ? tick_nohz_account_ticks+0x77/0x80
> [  769.514635]  [<ffffffff8106fb59>] ? tick_nohz_flush_current_times+0x24/0x48
> [  769.514639]  [<ffffffff81073a0b>] ? generic_smp_call_function_single_interrupt+0xca/0xeb
> [  769.514644]  [<ffffffff81024800>] ? smp_call_function_single_interrupt+0x10/0x20
> [  769.514649]  [<ffffffff8135c6ce>] ? call_function_single_interrupt+0x6e/0x80
> [  769.514650]  <EOI>  [<ffffffff8106e1bc>] ? tick_notify+0x1fc/0x354
> [  769.514657]  [<ffffffff81055884>] ? arch_local_irq_enable+0x4/0x8
> [  769.514660]  [<ffffffff81057ec6>] ? finish_task_switch+0x44/0xc2
> [  769.514665]  [<ffffffff8135a40e>] ? __schedule+0x444/0x4ae
> [  769.514669]  [<ffffffff8101452d>] ? paravirt_read_tsc+0x5/0x8
> [  769.514673]  [<ffffffff8100d245>] ? cpu_idle+0xa7/0xac
> [  769.514676]  [<ffffffff8168eabc>] ? start_kernel+0x342/0x34d
> [  769.514680]  [<ffffffff8168e140>] ? early_idt_handlers+0x140/0x140
> [  769.514683]  [<ffffffff8168e3c3>] ? x86_64_start_kernel+0x104/0x111
> [  769.514685] ---[ end trace f87639a4a779b972 ]---

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel)
  2012-03-27 15:10   ` Peter Zijlstra
  2012-03-27 15:18     ` Gilad Ben-Yossef
@ 2012-05-22 21:31     ` Thomas Gleixner
  2012-05-22 21:50       ` Steven Rostedt
  1 sibling, 1 reply; 96+ messages in thread
From: Thomas Gleixner @ 2012-05-22 21:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gilad Ben-Yossef, Frederic Weisbecker, LKML, linaro-sched-sig,
	Alessio Igor Bogani, Andrew Morton, Avi Kivity, Chris Metcalf,
	Christoph Lameter, Daniel Lezcano, Geoff Levand, Ingo Molnar,
	Max Krasnyansky, Paul E. McKenney, Stephen Hemminger,
	Steven Rostedt, Sven-Thorsten Dietrich, Zen Lin

On Tue, 27 Mar 2012, Peter Zijlstra wrote:

> On Tue, 2012-03-27 at 17:02 +0200, Gilad Ben-Yossef wrote:
> > 
> > In my case, I also had to disable the clocksource watchdog, but only
> > because TSC is not stable on my VM.
> > This is really not a nohz/cpuset problem. 
> 
> No but that thing is annoying, I ran afoul of it too the other day.
> 
> Thomas, would you object to a means of turning that thing off? And if
> not, do you have a preference as to what particular means
> (sysctl/sysfs/etc..) ?

We have the commandline option "tsc=reliable" already. That disables
the stupid watchdog. Handle with care.

To take it a level further, use the patch below :)

Thanks,

	tglx

------------>
x86: Add tsc=perfect option which enforces sched_clock_stable
    
The sched_clock dance with updating local clocks can be avoided even
if the CPU does not have all the magic features.
    
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 590900c..dc8ecc3 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -109,6 +109,10 @@ static int __init tsc_setup(char *str)
 {
 	if (!strcmp(str, "reliable"))
 		tsc_clocksource_reliable = 1;
+	if (!strcmp(str, "perfect")) {
+		tsc_clocksource_reliable = 1;
+		sched_clock_stable = 1;
+	}
 	if (!strncmp(str, "noirqtime", 9))
 		no_sched_irq_time = 1;
 	return 1;
 

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel)
  2012-05-22 21:31     ` Thomas Gleixner
@ 2012-05-22 21:50       ` Steven Rostedt
  2012-05-22 22:22         ` Thomas Gleixner
  0 siblings, 1 reply; 96+ messages in thread
From: Steven Rostedt @ 2012-05-22 21:50 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Gilad Ben-Yossef, Frederic Weisbecker, LKML,
	linaro-sched-sig, Alessio Igor Bogani, Andrew Morton, Avi Kivity,
	Chris Metcalf, Christoph Lameter, Daniel Lezcano, Geoff Levand,
	Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Stephen Hemminger, Sven-Thorsten Dietrich, Zen Lin

On Tue, 2012-05-22 at 23:31 +0200, Thomas Gleixner wrote:

> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 590900c..dc8ecc3 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -109,6 +109,10 @@ static int __init tsc_setup(char *str)
>  {
>  	if (!strcmp(str, "reliable"))
>  		tsc_clocksource_reliable = 1;
> +	if (!strcmp(str, "perfect")) {
> +		tsc_clocksource_reliable = 1;
> +		sched_clock_stable = 1;
> +	}

	else if(!strcmp(str, "pony")) {
		tsc_clocksource_reliable = 1;
		sched_clock_stable = 1;
		tsc_perfect_smp_synchronization = 1;
	}

-- Steve

>  	if (!strncmp(str, "noirqtime", 9))
>  		no_sched_irq_time = 1;
>  	return 1;
>  



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel)
  2012-05-22 21:50       ` Steven Rostedt
@ 2012-05-22 22:22         ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2012-05-22 22:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Gilad Ben-Yossef, Frederic Weisbecker, LKML,
	linaro-sched-sig, Alessio Igor Bogani, Andrew Morton, Avi Kivity,
	Chris Metcalf, Christoph Lameter, Daniel Lezcano, Geoff Levand,
	Ingo Molnar, Max Krasnyansky, Paul E. McKenney,
	Stephen Hemminger, Sven-Thorsten Dietrich, Zen Lin

On Tue, 22 May 2012, Steven Rostedt wrote:

> On Tue, 2012-05-22 at 23:31 +0200, Thomas Gleixner wrote:
> 
> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > index 590900c..dc8ecc3 100644
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> > @@ -109,6 +109,10 @@ static int __init tsc_setup(char *str)
> >  {
> >  	if (!strcmp(str, "reliable"))
> >  		tsc_clocksource_reliable = 1;
> > +	if (!strcmp(str, "perfect")) {
> > +		tsc_clocksource_reliable = 1;
> > +		sched_clock_stable = 1;
> > +	}
> 
> 	else if(!strcmp(str, "pony")) {
> 		tsc_clocksource_reliable = 1;
> 		sched_clock_stable = 1;
> 		tsc_perfect_smp_synchronization = 1;

	else if (!strcmp(str, "real")
	     panic("Can't handle real TSCs!\n");

^ permalink raw reply	[flat|nested] 96+ messages in thread

end of thread, other threads:[~2012-05-22 22:23 UTC | newest]

Thread overview: 96+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-21 13:58 [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Frederic Weisbecker
2012-03-21 13:58 ` Frederic Weisbecker
2012-04-04 15:33   ` warning in tick_nohz_irq_exit Stephen Hemminger
2012-04-04 20:45     ` Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 01/32] nohz: Separate idle sleeping time accounting from nohz logic Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 02/32] nohz: Make nohz API agnostic against idle ticks cputime accounting Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 03/32] nohz: Rename ts->idle_tick to ts->last_tick Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 04/32] nohz: Move nohz load balancer selection into idle logic Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 05/32] nohz: Move ts->idle_calls incrementation into strict " Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 06/32] nohz: Move next idle expiry time record into idle logic area Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 07/32] cpuset: Set up interface for nohz flag Frederic Weisbecker
2012-03-21 14:50   ` Christoph Lameter
2012-03-22  4:03     ` Mike Galbraith
2012-03-22 16:26       ` Christoph Lameter
2012-03-22 19:20         ` Mike Galbraith
2012-03-27 11:22       ` Frederic Weisbecker
2012-03-27 11:53         ` Mike Galbraith
2012-03-27 11:56           ` Frederic Weisbecker
2012-03-27 12:31             ` Mike Galbraith
2012-03-27 11:19     ` Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 08/32] nohz: Try not to give the timekeeping duty to an adaptive tickless cpu Frederic Weisbecker
2012-03-21 14:52   ` Christoph Lameter
2012-03-27 10:50     ` Frederic Weisbecker
2012-03-27 16:08       ` Christoph Lameter
2012-03-27 16:47         ` Peter Zijlstra
2012-03-28  1:12           ` Christoph Lameter
2012-03-28  8:39             ` Peter Zijlstra
2012-03-28 13:11               ` Dimitri Sivanich
2012-03-28 15:51               ` Chris Metcalf
2012-03-30  1:34         ` Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 09/32] x86: New cpuset nohz irq vector Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 10/32] nohz: Adaptive tick stop and restart on nohz cpuset Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 11/32] nohz/cpuset: Don't turn off the tick if rcu needs it Frederic Weisbecker
2012-03-21 14:54   ` Christoph Lameter
2012-03-22  7:38     ` Gilad Ben-Yossef
2012-03-22 16:18       ` Christoph Lameter
2012-03-27 15:21         ` Gilad Ben-Yossef
2012-03-28 12:39           ` Frederic Weisbecker
2012-03-28 12:57             ` Gilad Ben-Yossef
2012-03-28 13:38               ` Frederic Weisbecker
2012-03-22 17:18       ` Chris Metcalf
2012-03-27 15:31         ` Gilad Ben-Yossef
2012-03-27 15:43           ` Chris Metcalf
2012-03-28  8:36             ` Gilad Ben-Yossef
2012-03-27 12:13     ` Frederic Weisbecker
2012-03-27 16:13       ` Christoph Lameter
2012-03-27 16:24         ` Steven Rostedt
2012-03-28  0:42           ` Christoph Lameter
2012-03-28  1:06             ` Steven Rostedt
2012-03-28  1:19               ` Christoph Lameter
2012-03-28  1:35                 ` Steven Rostedt
2012-03-28  3:17                   ` Steven Rostedt
2012-03-28  7:55                     ` Gilad Ben-Yossef
2012-03-28 12:21                       ` Frederic Weisbecker
2012-03-28 12:41                         ` Gilad Ben-Yossef
2012-03-28 14:02                       ` Steven Rostedt
2012-03-28 11:53         ` Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 12/32] nohz/cpuset: Wake up adaptive nohz CPU when a timer gets enqueued Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 13/32] nohz/cpuset: Don't stop the tick if posix cpu timers are running Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 14/32] nohz/cpuset: Restart tick when nohz flag is cleared on cpuset Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 15/32] nohz/cpuset: Restart the tick if printk needs it Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 16/32] rcu: Restart the tick on non-responding adaptive nohz CPUs Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 17/32] rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 18/32] nohz: Generalize tickless cpu time accounting Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 19/32] nohz/cpuset: Account user and system times in adaptive nohz mode Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 20/32] nohz/cpuset: New API to flush cputimes on nohz cpusets Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 21/32] nohz/cpuset: Flush cputime on threads in nohz cpusets when waiting leader Frederic Weisbecker
2012-03-27 14:10   ` Gilad Ben-Yossef
2012-03-27 14:23     ` Gilad Ben-Yossef
2012-03-28 11:20       ` Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 22/32] nohz/cpuset: Flush cputimes on procfs stat file read Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 23/32] nohz/cpuset: Flush cputimes for getrusage() and times() syscalls Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 24/32] x86: Syscall hooks for nohz cpusets Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 25/32] x86: Exception " Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 26/32] x86: Add adaptive tickless hooks on do_notify_resume() Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 27/32] nohz: Don't restart the tick before scheduling to idle Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 28/32] rcu: New rcu_user_enter() and rcu_user_exit() APIs Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 29/32] rcu: New rcu_user_enter_irq() and rcu_user_exit_irq() APIs Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 30/32] rcu: Switch to extended quiescent state in userspace from nohz cpuset Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 31/32] nohz: Exit RCU idle mode when we schedule before resuming userspace Frederic Weisbecker
2012-03-21 13:58 ` [PATCH 32/32] nohz/cpuset: Disable under some configs Frederic Weisbecker
2012-03-27 15:02 ` [RFC][PATCH 00/32] Nohz cpusets v2 (adaptive tickless kernel) Gilad Ben-Yossef
2012-03-27 15:04   ` Gilad Ben-Yossef
2012-03-27 15:05     ` Gilad Ben-Yossef
2012-03-27 16:22       ` Christoph Lameter
2012-03-28  6:47         ` Gilad Ben-Yossef
2012-03-27 15:10   ` Peter Zijlstra
2012-03-27 15:18     ` Gilad Ben-Yossef
2012-05-22 21:31     ` Thomas Gleixner
2012-05-22 21:50       ` Steven Rostedt
2012-05-22 22:22         ` Thomas Gleixner
2012-03-28 11:43   ` Frederic Weisbecker
2012-03-30  0:33 ` Kevin Hilman
2012-03-30  0:45   ` Frederic Weisbecker
2012-03-30  2:07     ` Geoff Levand
2012-03-30 14:10       ` Kevin Hilman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.