linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [ANNOUNCE] 3.8-rc2-nohz2
@ 2013-01-08  2:08 Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 01/33] context_tracking: Add comments on interface and internals Frederic Weisbecker
                   ` (32 more replies)
  0 siblings, 33 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner


Hi,

Here is a new version of the full dynticks patchset based on 3.8-rc2.
It addresses most feedbacks I got on the previous release (see the list of changes
below).

Thanks you for your reviews, they are really useful!

This version is pullable at:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	3.8-rc2-nohz2

For the details on how to use it, check this link:
https://lwn.net/Articles/530345/ on the section "How to use".

Changes since 3.8-rc1-nohz1:

* Let the user choose between CONFIG_VIRT_CPU_ACCOUNTING_NATIVE and
CONFIG_VIRT_CPU_ACCOUNTING_GEN if both are available (thanks Li Zhong).
[patch 03/33]

* Move code that export context tracking state to its own commit to
make the review easier (thanks Paul Gortmaker) [patch 02/33]

* Rename vtime_accounting() to vtime_accounting_enabled() (thanks
Paul Gortmaker) [patch 04/33]

* Fix vtime_enter_user / vtime_user_enter confusion. (thanks Li Zhong)
[patch 03/33]

* Fix grammar, spelling and foggy explanations. (thanks Paul Gortmaker)
[patch 04/33]

* Fix "hook" based naming (thanks Ingo Molnar) [patch 01/33]

* Fix is_nocb_cpu() orphan declaration (thanks Namhyung Kim) [patch 22/33]

* Add full dynticks runqueue clock debugging [patch 29-30/33]

* Fix missing rq clock update in update_cpu_load_nohz(), thanks to the
debugging code on the previous patch. [patch 32/33] That's not yet a full
solution for the nohz rt power scale though.

* Partly handle update_cpu_load_active() [patch 33/33] (we still have to handle
calc_load_account_active)


TODO list has slightly reduced and also slightly grown :)

- Handle calc_load_account_active().

- Handle sched_class->task_tick()

- Handle rt power scaling

- Make sure rcu_nocbs mask matches full_nohz's.

- Get the nohz printk patchset merged.

- Posix cpu timers enqueued while tick is off. Probably no big deal but
I need to look into that.

- Several trivial stuffs: perf_event_task_tick(), profile_tick(),
sched_clock_tick(), etc...

Enjoy!

---
Frederic Weisbecker (41):
      irq_work: Fix racy IRQ_WORK_BUSY flag setting
      irq_work: Fix racy check on work pending flag
      irq_work: Remove CONFIG_HAVE_IRQ_WORK
      nohz: Add API to check tick state
      irq_work: Don't stop the tick with pending works
      irq_work: Make self-IPIs optable
      printk: Wake up klogd using irq_work
      Merge branch 'nohz/printk-v8' into 3.8-rc2-nohz2-base
      context_tracking: Add comments on interface and internals
      context_tracking: Export context state for generic vtime
      cputime: Generic on-demand virtual cputime accounting
      cputime: Allow dynamic switch between tick/virtual based cputime accounting
      cputime: Use accessors to read task cputime stats
      cputime: Safely read cputime of full dynticks CPUs
      nohz: Basic full dynticks interface
      nohz: Assign timekeeping duty to a non-full-nohz CPU
      nohz: Trace timekeeping update
      nohz: Wake up full dynticks CPUs when a timer gets enqueued
      rcu: Restart the tick on non-responding full dynticks CPUs
      sched: Comment on rq->clock correctness in ttwu_do_wakeup() in nohz
      sched: Update rq clock on nohz CPU before migrating tasks
      sched: Update rq clock on nohz CPU before setting fair group shares
      sched: Update rq clock on tickless CPUs before calling check_preempt_curr()
      sched: Update rq clock earlier in unthrottle_cfs_rq
      sched: Update clock of nohz busiest rq before balancing
      sched: Update rq clock before idle balancing
      sched: Update nohz rq clock before searching busiest group on load balancing
      nohz: Move nohz load balancer selection into idle logic
      nohz: Full dynticks mode
      nohz: Only stop the tick on RCU nocb CPUs
      nohz: Don't turn off the tick if rcu needs it
      nohz: Don't stop the tick if posix cpu timers are running
      nohz: Add some tracing
      rcu: Don't keep the tick for RCU while in userspace
      profiling: Remove unused timer hook
      timer: Don't run non-pinned timer to full dynticks CPUs
      sched: Use an accessor to read rq clock
      sched: Debug nohz rq clock
      sched: Remove broken check for skip clock update
      sched: Update rq clock before rt sched average scale
      sched: Disable lb_bias feature for full dynticks

Steven Rostedt (2):
      irq_work: Flush work on CPU_DYING
      irq_work: Warn if there's still work on cpu_down

 arch/alpha/Kconfig                     |    1 -
 arch/alpha/kernel/osf_sys.c            |    6 +-
 arch/arm/Kconfig                       |    1 -
 arch/arm64/Kconfig                     |    1 -
 arch/blackfin/Kconfig                  |    1 -
 arch/frv/Kconfig                       |    1 -
 arch/hexagon/Kconfig                   |    1 -
 arch/ia64/include/asm/cputime.h        |    6 +-
 arch/ia64/include/asm/thread_info.h    |    4 +-
 arch/ia64/include/asm/xen/minstate.h   |    2 +-
 arch/ia64/kernel/asm-offsets.c         |    2 +-
 arch/ia64/kernel/entry.S               |   16 +-
 arch/ia64/kernel/fsys.S                |    4 +-
 arch/ia64/kernel/head.S                |    4 +-
 arch/ia64/kernel/ivt.S                 |    8 +-
 arch/ia64/kernel/minstate.h            |    2 +-
 arch/ia64/kernel/time.c                |    4 +-
 arch/mips/Kconfig                      |    1 -
 arch/parisc/Kconfig                    |    1 -
 arch/powerpc/Kconfig                   |    1 -
 arch/powerpc/include/asm/cputime.h     |    6 +-
 arch/powerpc/include/asm/lppaca.h      |    2 +-
 arch/powerpc/include/asm/ppc_asm.h     |    4 +-
 arch/powerpc/kernel/entry_64.S         |    4 +-
 arch/powerpc/kernel/time.c             |    4 +-
 arch/powerpc/platforms/pseries/dtl.c   |    6 +-
 arch/powerpc/platforms/pseries/setup.c |    6 +-
 arch/s390/Kconfig                      |    1 -
 arch/s390/kernel/vtime.c               |    6 +-
 arch/sh/Kconfig                        |    1 -
 arch/sparc/Kconfig                     |    1 -
 arch/x86/Kconfig                       |    1 -
 arch/x86/kernel/apm_32.c               |   11 +-
 drivers/isdn/mISDN/stack.c             |    7 +-
 drivers/staging/iio/trigger/Kconfig    |    1 -
 fs/binfmt_elf.c                        |    8 +-
 fs/binfmt_elf_fdpic.c                  |    7 +-
 include/asm-generic/cputime.h          |    1 +
 include/linux/context_tracking.h       |   28 ++++
 include/linux/hardirq.h                |    4 +-
 include/linux/init_task.h              |   11 ++
 include/linux/irq_work.h               |   20 +++
 include/linux/kernel_stat.h            |    2 +-
 include/linux/posix-timers.h           |    1 +
 include/linux/printk.h                 |    3 -
 include/linux/profile.h                |   13 --
 include/linux/rcupdate.h               |    8 +
 include/linux/sched.h                  |   48 +++++++-
 include/linux/tick.h                   |   26 ++++-
 include/linux/vtime.h                  |   51 +++++---
 init/Kconfig                           |   20 ++-
 kernel/acct.c                          |    6 +-
 kernel/context_tracking.c              |   91 ++++++++++----
 kernel/cpu.c                           |    4 +-
 kernel/delayacct.c                     |    7 +-
 kernel/exit.c                          |    6 +-
 kernel/fork.c                          |    8 +-
 kernel/hrtimer.c                       |    3 +-
 kernel/irq_work.c                      |  131 ++++++++++++++-----
 kernel/posix-cpu-timers.c              |   39 +++++-
 kernel/printk.c                        |   36 +++---
 kernel/profile.c                       |   24 ----
 kernel/rcutree.c                       |   19 ++-
 kernel/rcutree.h                       |    1 -
 kernel/rcutree_plugin.h                |   13 +--
 kernel/sched/core.c                    |  104 ++++++++++++++--
 kernel/sched/cputime.c                 |  222 +++++++++++++++++++++++++++-----
 kernel/sched/fair.c                    |   96 ++++++++++----
 kernel/sched/features.h                |    3 +
 kernel/sched/rt.c                      |    8 +-
 kernel/sched/sched.h                   |   50 +++++++
 kernel/sched/stats.h                   |    8 +-
 kernel/sched/stop_task.c               |    8 +-
 kernel/signal.c                        |   12 +-
 kernel/softirq.c                       |   11 +-
 kernel/time/Kconfig                    |    9 ++
 kernel/time/tick-broadcast.c           |    3 +-
 kernel/time/tick-common.c              |    5 +-
 kernel/time/tick-sched.c               |  144 ++++++++++++++++++---
 kernel/timer.c                         |    6 +-
 kernel/tsacct.c                        |   19 ++-
 81 files changed, 1117 insertions(+), 358 deletions(-)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 01/33] context_tracking: Add comments on interface and internals
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 02/33] context_tracking: Export context state for generic vtime Frederic Weisbecker
                   ` (31 subsequent siblings)
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

This subsystem lacks many explanations on its purpose and
design. Add these missing comments.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/context_tracking.c |   73 ++++++++++++++++++++++++++++++++++++++------
 1 files changed, 63 insertions(+), 10 deletions(-)

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index e0e07fd..4b360b4 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -1,3 +1,19 @@
+/*
+ * Context tracking: Probe on high level context boundaries such as kernel
+ * and userspace. This includes syscalls and exceptions entry/exit.
+ *
+ * This is used by RCU to remove its dependency on the timer tick while a CPU
+ * runs in userspace.
+ *
+ *  Started by Frederic Weisbecker:
+ *
+ * Copyright (C) 2012 Red Hat, Inc., Frederic Weisbecker <fweisbec@redhat.com>
+ *
+ * Many thanks to Gilad Ben-Yossef, Paul McKenney, Ingo Molnar, Andrew Morton,
+ * Steven Rostedt, Peter Zijlstra for suggestions and improvements.
+ *
+ */
+
 #include <linux/context_tracking.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
@@ -6,8 +22,8 @@
 
 struct context_tracking {
 	/*
-	 * When active is false, hooks are not set to
-	 * minimize overhead: TIF flags are cleared
+	 * When active is false, probes are unset in order
+	 * to minimize overhead: TIF flags are cleared
 	 * and calls to user_enter/exit are ignored. This
 	 * may be further optimized using static keys.
 	 */
@@ -24,6 +40,15 @@ static DEFINE_PER_CPU(struct context_tracking, context_tracking) = {
 #endif
 };
 
+/**
+ * user_enter - Inform the context tracking that the CPU is going to
+ *              enter userspace mode.
+ *
+ * This function must be called right before we switch from the kernel
+ * to userspace, when it's guaranteed the remaining kernel instructions
+ * to execute won't use any RCU read side critical section because this
+ * function sets RCU in extended quiescent state.
+ */
 void user_enter(void)
 {
 	unsigned long flags;
@@ -39,40 +64,68 @@ void user_enter(void)
 	if (in_interrupt())
 		return;
 
+	/* Kernel threads aren't supposed to go to userspace */
 	WARN_ON_ONCE(!current->mm);
 
 	local_irq_save(flags);
 	if (__this_cpu_read(context_tracking.active) &&
 	    __this_cpu_read(context_tracking.state) != IN_USER) {
 		__this_cpu_write(context_tracking.state, IN_USER);
+		/*
+		 * At this stage, only low level arch entry code remains and
+		 * then we'll run in userspace. We can assume there won't be
+		 * any RCU read-side critical section until the next call to
+		 * user_exit() or rcu_irq_enter(). Let's remove RCU's dependency
+		 * on the tick.
+		 */
 		rcu_user_enter();
 	}
 	local_irq_restore(flags);
 }
 
+
+/**
+ * user_exit - Inform the context tracking that the CPU is
+ *             exiting userspace mode and entering the kernel.
+ *
+ * This function must be called after we entered the kernel from userspace
+ * before any use of RCU read side critical section. This potentially include
+ * any high level kernel code like syscalls, exceptions, signal handling, etc...
+ *
+ * This call supports re-entrancy. This way it can be called from any exception
+ * handler without needing to know if we came from userspace or not.
+ */
 void user_exit(void)
 {
 	unsigned long flags;
 
-	/*
-	 * Some contexts may involve an exception occuring in an irq,
-	 * leading to that nesting:
-	 * rcu_irq_enter() rcu_user_exit() rcu_user_exit() rcu_irq_exit()
-	 * This would mess up the dyntick_nesting count though. And rcu_irq_*()
-	 * helpers are enough to protect RCU uses inside the exception. So
-	 * just return immediately if we detect we are in an IRQ.
-	 */
 	if (in_interrupt())
 		return;
 
 	local_irq_save(flags);
 	if (__this_cpu_read(context_tracking.state) == IN_USER) {
 		__this_cpu_write(context_tracking.state, IN_KERNEL);
+		/*
+		 * We are going to run code that may use RCU. Inform
+		 * RCU core about that (ie: we may need the tick again).
+		 */
 		rcu_user_exit();
 	}
 	local_irq_restore(flags);
 }
 
+
+/**
+ * context_tracking_task_switch - context switch the syscall callbacks
+ *
+ * The context tracking uses the syscall slow path to implement its user-kernel
+ * boundaries probes on syscalls. This way it doesn't impact the syscall fast
+ * path on CPUs that don't do context tracking.
+ *
+ * But we need to clear the flag on the previous task because it may later
+ * migrate to some CPU that doesn't do the context tracking. As such the TIF
+ * flag may not be desired there.
+ */
 void context_tracking_task_switch(struct task_struct *prev,
 			     struct task_struct *next)
 {
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 02/33] context_tracking: Export context state for generic vtime
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 01/33] context_tracking: Add comments on interface and internals Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 03/33] cputime: Generic on-demand virtual cputime accounting Frederic Weisbecker
                   ` (30 subsequent siblings)
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

Export the context state: whether we run in user / kernel
from the context tracking subsystem point of view.

This is going to be used by the generic virtual cputime
accounting subsystem that is needed to implement the full
dynticks.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/context_tracking.h |   28 ++++++++++++++++++++++++++++
 kernel/context_tracking.c        |   16 +---------------
 2 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index e24339c..b28d161 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,12 +3,40 @@
 
 #ifdef CONFIG_CONTEXT_TRACKING
 #include <linux/sched.h>
+#include <linux/percpu.h>
+
+struct context_tracking {
+	/*
+	 * When active is false, probes are unset in order
+	 * to minimize overhead: TIF flags are cleared
+	 * and calls to user_enter/exit are ignored. This
+	 * may be further optimized using static keys.
+	 */
+	bool active;
+	enum {
+		IN_KERNEL = 0,
+		IN_USER,
+	} state;
+};
+
+DECLARE_PER_CPU(struct context_tracking, context_tracking);
+
+static inline bool context_tracking_in_user(void)
+{
+	return __this_cpu_read(context_tracking.state) == IN_USER;
+}
+
+static inline bool context_tracking_active(void)
+{
+	return __this_cpu_read(context_tracking.active);
+}
 
 extern void user_enter(void);
 extern void user_exit(void);
 extern void context_tracking_task_switch(struct task_struct *prev,
 					 struct task_struct *next);
 #else
+static inline bool context_tracking_in_user(void) { return false; }
 static inline void user_enter(void) { }
 static inline void user_exit(void) { }
 static inline void context_tracking_task_switch(struct task_struct *prev,
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 4b360b4..c952770 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -17,24 +17,10 @@
 #include <linux/context_tracking.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
-#include <linux/percpu.h>
 #include <linux/hardirq.h>
 
-struct context_tracking {
-	/*
-	 * When active is false, probes are unset in order
-	 * to minimize overhead: TIF flags are cleared
-	 * and calls to user_enter/exit are ignored. This
-	 * may be further optimized using static keys.
-	 */
-	bool active;
-	enum {
-		IN_KERNEL = 0,
-		IN_USER,
-	} state;
-};
 
-static DEFINE_PER_CPU(struct context_tracking, context_tracking) = {
+DEFINE_PER_CPU(struct context_tracking, context_tracking) = {
 #ifdef CONFIG_CONTEXT_TRACKING_FORCE
 	.active = true,
 #endif
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 03/33] cputime: Generic on-demand virtual cputime accounting
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 01/33] context_tracking: Add comments on interface and internals Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 02/33] context_tracking: Export context state for generic vtime Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08 20:23   ` Steven Rostedt
                     ` (3 more replies)
  2013-01-08  2:08 ` [PATCH 04/33] cputime: Allow dynamic switch between tick/virtual based " Frederic Weisbecker
                   ` (29 subsequent siblings)
  32 siblings, 4 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

If we want to stop the tick further idle, we need to be
able to account the cputime without using the tick.

Virtual based cputime accounting solves that problem by
hooking into kernel/user boundaries.

However implementing CONFIG_VIRT_CPU_ACCOUNTING require
to set low level hooks and involves more overhead. But
we already have a generic context tracking subsystem
that is required for RCU needs by archs which will want to
shut down the tick outside idle.

This patch implements a generic virtual based cputime
accounting that relies on these generic kernel/user hooks.

There are some upsides of doing this:

- This requires no arch code to implement CONFIG_VIRT_CPU_ACCOUNTING
if context tracking is already built (already necessary for RCU in full
tickless mode).

- We can rely on the generic context tracking subsystem to dynamically
(de)activate the hooks, so that we can switch anytime between virtual
and tick based accounting. This way we don't have the overhead
of the virtual accounting when the tick is running periodically.

And a few downsides:

- It relies on jiffies and the hooks are set in high level code. This
results in less precise cputime accounting than with a true native
virtual based cputime accounting which hooks on low level code and use
a cpu hardware clock. Precision is not the goal of this though.

- There is probably more overhead than a native virtual based cputime
accounting. But this relies on hooks that are already set anyway.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 arch/ia64/include/asm/cputime.h        |    6 +-
 arch/ia64/include/asm/thread_info.h    |    4 +-
 arch/ia64/include/asm/xen/minstate.h   |    2 +-
 arch/ia64/kernel/asm-offsets.c         |    2 +-
 arch/ia64/kernel/entry.S               |   16 +++---
 arch/ia64/kernel/fsys.S                |    4 +-
 arch/ia64/kernel/head.S                |    4 +-
 arch/ia64/kernel/ivt.S                 |    8 ++--
 arch/ia64/kernel/minstate.h            |    2 +-
 arch/ia64/kernel/time.c                |    4 +-
 arch/powerpc/include/asm/cputime.h     |    6 +-
 arch/powerpc/include/asm/lppaca.h      |    2 +-
 arch/powerpc/include/asm/ppc_asm.h     |    4 +-
 arch/powerpc/kernel/entry_64.S         |    4 +-
 arch/powerpc/kernel/time.c             |    4 +-
 arch/powerpc/platforms/pseries/dtl.c   |    6 +-
 arch/powerpc/platforms/pseries/setup.c |    6 +-
 include/linux/vtime.h                  |   16 ++++++
 init/Kconfig                           |   15 +++++-
 kernel/context_tracking.c              |    6 ++-
 kernel/sched/cputime.c                 |   92 ++++++++++++++++++++++++++++++--
 21 files changed, 164 insertions(+), 49 deletions(-)

diff --git a/arch/ia64/include/asm/cputime.h b/arch/ia64/include/asm/cputime.h
index 7fcf7f0..af15d84 100644
--- a/arch/ia64/include/asm/cputime.h
+++ b/arch/ia64/include/asm/cputime.h
@@ -11,14 +11,14 @@
  * as published by the Free Software Foundation; either version
  * 2 of the License, or (at your option) any later version.
  *
- * If we have CONFIG_VIRT_CPU_ACCOUNTING, we measure cpu time in nsec.
+ * If we have CONFIG_VIRT_CPU_ACCOUNTING_NATIVE, we measure cpu time in nsec.
  * Otherwise we measure cpu time in jiffies using the generic definitions.
  */
 
 #ifndef __IA64_CPUTIME_H
 #define __IA64_CPUTIME_H
 
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 #include <asm-generic/cputime.h>
 #else
 
@@ -105,5 +105,5 @@ static inline void cputime_to_timeval(const cputime_t ct, struct timeval *val)
 
 extern void arch_vtime_task_switch(struct task_struct *tsk);
 
-#endif /* CONFIG_VIRT_CPU_ACCOUNTING */
+#endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 #endif /* __IA64_CPUTIME_H */
diff --git a/arch/ia64/include/asm/thread_info.h b/arch/ia64/include/asm/thread_info.h
index ff2ae41..020d655 100644
--- a/arch/ia64/include/asm/thread_info.h
+++ b/arch/ia64/include/asm/thread_info.h
@@ -31,7 +31,7 @@ struct thread_info {
 	mm_segment_t addr_limit;	/* user-level address space limit */
 	int preempt_count;		/* 0=premptable, <0=BUG; will also serve as bh-counter */
 	struct restart_block restart_block;
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 	__u64 ac_stamp;
 	__u64 ac_leave;
 	__u64 ac_stime;
@@ -69,7 +69,7 @@ struct thread_info {
 #define task_stack_page(tsk)	((void *)(tsk))
 
 #define __HAVE_THREAD_FUNCTIONS
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 #define setup_thread_stack(p, org)			\
 	*task_thread_info(p) = *task_thread_info(org);	\
 	task_thread_info(p)->ac_stime = 0;		\
diff --git a/arch/ia64/include/asm/xen/minstate.h b/arch/ia64/include/asm/xen/minstate.h
index c57fa91..00cf03e 100644
--- a/arch/ia64/include/asm/xen/minstate.h
+++ b/arch/ia64/include/asm/xen/minstate.h
@@ -1,5 +1,5 @@
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 /* read ar.itc in advance, and use it before leaving bank 0 */
 #define XEN_ACCOUNT_GET_STAMP		\
 	MOV_FROM_ITC(pUStk, p6, r20, r2);
diff --git a/arch/ia64/kernel/asm-offsets.c b/arch/ia64/kernel/asm-offsets.c
index a48bd9a..46c9e30 100644
--- a/arch/ia64/kernel/asm-offsets.c
+++ b/arch/ia64/kernel/asm-offsets.c
@@ -41,7 +41,7 @@ void foo(void)
 	DEFINE(TI_FLAGS, offsetof(struct thread_info, flags));
 	DEFINE(TI_CPU, offsetof(struct thread_info, cpu));
 	DEFINE(TI_PRE_COUNT, offsetof(struct thread_info, preempt_count));
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 	DEFINE(TI_AC_STAMP, offsetof(struct thread_info, ac_stamp));
 	DEFINE(TI_AC_LEAVE, offsetof(struct thread_info, ac_leave));
 	DEFINE(TI_AC_STIME, offsetof(struct thread_info, ac_stime));
diff --git a/arch/ia64/kernel/entry.S b/arch/ia64/kernel/entry.S
index e25b784..7a89f0b 100644
--- a/arch/ia64/kernel/entry.S
+++ b/arch/ia64/kernel/entry.S
@@ -724,7 +724,7 @@ GLOBAL_ENTRY(__paravirt_leave_syscall)
 #endif
 .global __paravirt_work_processed_syscall;
 __paravirt_work_processed_syscall:
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 	adds r2=PT(LOADRS)+16,r12
 	MOV_FROM_ITC(pUStk, p9, r22, r19)	// fetch time at leave
 	adds r18=TI_FLAGS+IA64_TASK_SIZE,r13
@@ -762,7 +762,7 @@ __paravirt_work_processed_syscall:
 
 	ld8 r29=[r2],16		// M0|1 load cr.ipsr
 	ld8 r28=[r3],16		// M0|1 load cr.iip
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 (pUStk) add r14=TI_AC_LEAVE+IA64_TASK_SIZE,r13
 	;;
 	ld8 r30=[r2],16		// M0|1 load cr.ifs
@@ -793,7 +793,7 @@ __paravirt_work_processed_syscall:
 	ld8.fill r1=[r3],16			// M0|1 load r1
 (pUStk) mov r17=1				// A
 	;;
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 (pUStk) st1 [r15]=r17				// M2|3
 #else
 (pUStk) st1 [r14]=r17				// M2|3
@@ -813,7 +813,7 @@ __paravirt_work_processed_syscall:
 	shr.u r18=r19,16		// I0|1 get byte size of existing "dirty" partition
 	COVER				// B    add current frame into dirty partition & set cr.ifs
 	;;
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 	mov r19=ar.bsp			// M2   get new backing store pointer
 	st8 [r14]=r22			// M	save time at leave
 	mov f10=f0			// F    clear f10
@@ -948,7 +948,7 @@ GLOBAL_ENTRY(__paravirt_leave_kernel)
 	adds r16=PT(CR_IPSR)+16,r12
 	adds r17=PT(CR_IIP)+16,r12
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 	.pred.rel.mutex pUStk,pKStk
 	MOV_FROM_PSR(pKStk, r22, r29)	// M2 read PSR now that interrupts are disabled
 	MOV_FROM_ITC(pUStk, p9, r22, r29)	// M  fetch time at leave
@@ -981,7 +981,7 @@ GLOBAL_ENTRY(__paravirt_leave_kernel)
 	;;
 	ld8.fill r12=[r16],16
 	ld8.fill r13=[r17],16
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 (pUStk)	adds r3=TI_AC_LEAVE+IA64_TASK_SIZE,r18
 #else
 (pUStk)	adds r18=IA64_TASK_THREAD_ON_USTACK_OFFSET,r18
@@ -989,7 +989,7 @@ GLOBAL_ENTRY(__paravirt_leave_kernel)
 	;;
 	ld8 r20=[r16],16	// ar.fpsr
 	ld8.fill r15=[r17],16
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 (pUStk)	adds r18=IA64_TASK_THREAD_ON_USTACK_OFFSET,r18	// deferred
 #endif
 	;;
@@ -997,7 +997,7 @@ GLOBAL_ENTRY(__paravirt_leave_kernel)
 	ld8.fill r2=[r17]
 (pUStk)	mov r17=1
 	;;
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 	//  mmi_ :  ld8 st1 shr;;         mmi_ : st8 st1 shr;;
 	//  mib  :  mov add br        ->  mib  : ld8 add br
 	//  bbb_ :  br  nop cover;;       mbb_ : mov br  cover;;
diff --git a/arch/ia64/kernel/fsys.S b/arch/ia64/kernel/fsys.S
index e662f17..c4cd45d 100644
--- a/arch/ia64/kernel/fsys.S
+++ b/arch/ia64/kernel/fsys.S
@@ -529,7 +529,7 @@ GLOBAL_ENTRY(paravirt_fsys_bubble_down)
 	nop.i 0
 	;;
 	mov ar.rsc=0				// M2   set enforced lazy mode, pl 0, LE, loadrs=0
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 	MOV_FROM_ITC(p0, p6, r30, r23)		// M    get cycle for accounting
 #else
 	nop.m 0
@@ -555,7 +555,7 @@ GLOBAL_ENTRY(paravirt_fsys_bubble_down)
 	cmp.ne pKStk,pUStk=r0,r0		// A    set pKStk <- 0, pUStk <- 1
 	br.call.sptk.many b7=ia64_syscall_setup	// B
 	;;
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 	// mov.m r30=ar.itc is called in advance
 	add r16=TI_AC_STAMP+IA64_TASK_SIZE,r2
 	add r17=TI_AC_LEAVE+IA64_TASK_SIZE,r2
diff --git a/arch/ia64/kernel/head.S b/arch/ia64/kernel/head.S
index 4738ff7..9be4e49 100644
--- a/arch/ia64/kernel/head.S
+++ b/arch/ia64/kernel/head.S
@@ -1073,7 +1073,7 @@ END(ia64_native_sched_clock)
 sched_clock = ia64_native_sched_clock
 #endif
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 GLOBAL_ENTRY(cycle_to_cputime)
 	alloc r16=ar.pfs,1,0,0,0
 	addl r8=THIS_CPU(ia64_cpu_info) + IA64_CPUINFO_NSEC_PER_CYC_OFFSET,r0
@@ -1091,7 +1091,7 @@ GLOBAL_ENTRY(cycle_to_cputime)
 	shrp r8=r9,r8,IA64_NSEC_PER_CYC_SHIFT
 	br.ret.sptk.many rp
 END(cycle_to_cputime)
-#endif /* CONFIG_VIRT_CPU_ACCOUNTING */
+#endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_IA64_BRL_EMU
 
diff --git a/arch/ia64/kernel/ivt.S b/arch/ia64/kernel/ivt.S
index fa25689..689ffca 100644
--- a/arch/ia64/kernel/ivt.S
+++ b/arch/ia64/kernel/ivt.S
@@ -784,7 +784,7 @@ ENTRY(break_fault)
 
 (p8)	adds r28=16,r28				// A    switch cr.iip to next bundle
 (p9)	adds r8=1,r8				// A    increment ei to next slot
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 	;;
 	mov b6=r30				// I0   setup syscall handler branch reg early
 #else
@@ -801,7 +801,7 @@ ENTRY(break_fault)
 	//
 ///////////////////////////////////////////////////////////////////////
 	st1 [r16]=r0				// M2|3 clear current->thread.on_ustack flag
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 	MOV_FROM_ITC(p0, p14, r30, r18)		// M    get cycle for accounting
 #else
 	mov b6=r30				// I0   setup syscall handler branch reg early
@@ -817,7 +817,7 @@ ENTRY(break_fault)
 	cmp.eq p14,p0=r9,r0			// A    are syscalls being traced/audited?
 	br.call.sptk.many b7=ia64_syscall_setup	// B
 1:
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 	// mov.m r30=ar.itc is called in advance, and r13 is current
 	add r16=TI_AC_STAMP+IA64_TASK_SIZE,r13	// A
 	add r17=TI_AC_LEAVE+IA64_TASK_SIZE,r13	// A
@@ -1043,7 +1043,7 @@ END(ia64_syscall_setup)
 	DBG_FAULT(16)
 	FAULT(16)
 
-#if defined(CONFIG_VIRT_CPU_ACCOUNTING) && defined(__IA64_ASM_PARAVIRTUALIZED_NATIVE)
+#if defined(CONFIG_VIRT_CPU_ACCOUNTING_NATIVE) && defined(__IA64_ASM_PARAVIRTUALIZED_NATIVE)
 	/*
 	 * There is no particular reason for this code to be here, other than
 	 * that there happens to be space here that would go unused otherwise.
diff --git a/arch/ia64/kernel/minstate.h b/arch/ia64/kernel/minstate.h
index d56753a..cc82a7d 100644
--- a/arch/ia64/kernel/minstate.h
+++ b/arch/ia64/kernel/minstate.h
@@ -4,7 +4,7 @@
 #include "entry.h"
 #include "paravirt_inst.h"
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 /* read ar.itc in advance, and use it before leaving bank 0 */
 #define ACCOUNT_GET_STAMP				\
 (pUStk) mov.m r20=ar.itc;
diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index b1995ef..94a474d 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -77,7 +77,7 @@ static struct clocksource clocksource_itc = {
 };
 static struct clocksource *itc_clocksource;
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 
 #include <linux/kernel_stat.h>
 
@@ -142,7 +142,7 @@ void vtime_account_idle(struct task_struct *tsk)
 	account_idle_time(vtime_delta(tsk));
 }
 
-#endif /* CONFIG_VIRT_CPU_ACCOUNTING */
+#endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 static irqreturn_t
 timer_interrupt (int irq, void *dev_id)
diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/include/asm/cputime.h
index 483733b..607559a 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -8,7 +8,7 @@
  * as published by the Free Software Foundation; either version
  * 2 of the License, or (at your option) any later version.
  *
- * If we have CONFIG_VIRT_CPU_ACCOUNTING, we measure cpu time in
+ * If we have CONFIG_VIRT_CPU_ACCOUNTING_NATIVE, we measure cpu time in
  * the same units as the timebase.  Otherwise we measure cpu time
  * in jiffies using the generic definitions.
  */
@@ -16,7 +16,7 @@
 #ifndef __POWERPC_CPUTIME_H
 #define __POWERPC_CPUTIME_H
 
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 #include <asm-generic/cputime.h>
 #ifdef __KERNEL__
 static inline void setup_cputime_one_jiffy(void) { }
@@ -231,5 +231,5 @@ static inline cputime_t clock_t_to_cputime(const unsigned long clk)
 static inline void arch_vtime_task_switch(struct task_struct *tsk) { }
 
 #endif /* __KERNEL__ */
-#endif /* CONFIG_VIRT_CPU_ACCOUNTING */
+#endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 #endif /* __POWERPC_CPUTIME_H */
diff --git a/arch/powerpc/include/asm/lppaca.h b/arch/powerpc/include/asm/lppaca.h
index 531fe0c3..b1e7f2a 100644
--- a/arch/powerpc/include/asm/lppaca.h
+++ b/arch/powerpc/include/asm/lppaca.h
@@ -145,7 +145,7 @@ struct dtl_entry {
 extern struct kmem_cache *dtl_cache;
 
 /*
- * When CONFIG_VIRT_CPU_ACCOUNTING = y, the cpu accounting code controls
+ * When CONFIG_VIRT_CPU_ACCOUNTING_NATIVE = y, the cpu accounting code controls
  * reading from the dispatch trace log.  If other code wants to consume
  * DTL entries, it can set this pointer to a function that will get
  * called once for each DTL entry that gets processed.
diff --git a/arch/powerpc/include/asm/ppc_asm.h b/arch/powerpc/include/asm/ppc_asm.h
index ea2a86e..2d0e1f5 100644
--- a/arch/powerpc/include/asm/ppc_asm.h
+++ b/arch/powerpc/include/asm/ppc_asm.h
@@ -24,7 +24,7 @@
  * user_time and system_time fields in the paca.
  */
 
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 #define ACCOUNT_CPU_USER_ENTRY(ra, rb)
 #define ACCOUNT_CPU_USER_EXIT(ra, rb)
 #define ACCOUNT_STOLEN_TIME
@@ -70,7 +70,7 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
 
 #endif /* CONFIG_PPC_SPLPAR */
 
-#endif /* CONFIG_VIRT_CPU_ACCOUNTING */
+#endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 /*
  * Macros for storing registers into and loading registers from
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index b310a05..a0ca42f 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -94,7 +94,7 @@ system_call_common:
 	addi	r9,r1,STACK_FRAME_OVERHEAD
 	ld	r11,exception_marker@toc(r2)
 	std	r11,-16(r9)		/* "regshere" marker */
-#if defined(CONFIG_VIRT_CPU_ACCOUNTING) && defined(CONFIG_PPC_SPLPAR)
+#if defined(CONFIG_VIRT_CPU_ACCOUNTING_NATIVE) && defined(CONFIG_PPC_SPLPAR)
 BEGIN_FW_FTR_SECTION
 	beq	33f
 	/* if from user, see if there are any DTL entries to process */
@@ -110,7 +110,7 @@ BEGIN_FW_FTR_SECTION
 	addi	r9,r1,STACK_FRAME_OVERHEAD
 33:
 END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
-#endif /* CONFIG_VIRT_CPU_ACCOUNTING && CONFIG_PPC_SPLPAR */
+#endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE && CONFIG_PPC_SPLPAR */
 
 	/*
 	 * A syscall should always be called with interrupts enabled
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index b3b1435..1ffe109 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -143,7 +143,7 @@ EXPORT_SYMBOL_GPL(ppc_proc_freq);
 unsigned long ppc_tb_freq;
 EXPORT_SYMBOL_GPL(ppc_tb_freq);
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 /*
  * Factors for converting from cputime_t (timebase ticks) to
  * jiffies, microseconds, seconds, and clock_t (1/USER_HZ seconds).
@@ -377,7 +377,7 @@ void vtime_account_user(struct task_struct *tsk)
 	account_user_time(tsk, utime, utimescaled);
 }
 
-#else /* ! CONFIG_VIRT_CPU_ACCOUNTING */
+#else /* ! CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 #define calc_cputime_factors()
 #endif
 
diff --git a/arch/powerpc/platforms/pseries/dtl.c b/arch/powerpc/platforms/pseries/dtl.c
index a764854..0cc0ac0 100644
--- a/arch/powerpc/platforms/pseries/dtl.c
+++ b/arch/powerpc/platforms/pseries/dtl.c
@@ -57,7 +57,7 @@ static u8 dtl_event_mask = 0x7;
  */
 static int dtl_buf_entries = N_DISPATCH_LOG;
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 struct dtl_ring {
 	u64	write_index;
 	struct dtl_entry *write_ptr;
@@ -142,7 +142,7 @@ static u64 dtl_current_index(struct dtl *dtl)
 	return per_cpu(dtl_rings, dtl->cpu).write_index;
 }
 
-#else /* CONFIG_VIRT_CPU_ACCOUNTING */
+#else /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 static int dtl_start(struct dtl *dtl)
 {
@@ -188,7 +188,7 @@ static u64 dtl_current_index(struct dtl *dtl)
 {
 	return lppaca_of(dtl->cpu).dtl_idx;
 }
-#endif /* CONFIG_VIRT_CPU_ACCOUNTING */
+#endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 static int dtl_enable(struct dtl *dtl)
 {
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index ca55882..527e12c 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -281,7 +281,7 @@ static struct notifier_block pci_dn_reconfig_nb = {
 
 struct kmem_cache *dtl_cache;
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 /*
  * Allocate space for the dispatch trace log for all possible cpus
  * and register the buffers with the hypervisor.  This is used for
@@ -332,12 +332,12 @@ static int alloc_dispatch_logs(void)
 
 	return 0;
 }
-#else /* !CONFIG_VIRT_CPU_ACCOUNTING */
+#else /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 static inline int alloc_dispatch_logs(void)
 {
 	return 0;
 }
-#endif /* CONFIG_VIRT_CPU_ACCOUNTING */
+#endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 static int alloc_dispatch_log_kmem_cache(void)
 {
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index ae30ab5..21ef703 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -14,9 +14,25 @@ extern void vtime_account(struct task_struct *tsk);
 static inline void vtime_task_switch(struct task_struct *prev) { }
 static inline void vtime_account_system(struct task_struct *tsk) { }
 static inline void vtime_account_system_irqsafe(struct task_struct *tsk) { }
+static inline void vtime_account_user(struct task_struct *tsk) { }
 static inline void vtime_account(struct task_struct *tsk) { }
 #endif
 
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+static inline void arch_vtime_task_switch(struct task_struct *tsk) { }
+static inline void vtime_user_enter(struct task_struct *tsk)
+{
+	vtime_account_system(tsk);
+}
+static inline void vtime_user_exit(struct task_struct *tsk)
+{
+	vtime_account_user(tsk);
+}
+#else
+static inline void vtime_user_enter(struct task_struct *tsk) { }
+static inline void vtime_user_exit(struct task_struct *tsk) { }
+#endif
+
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 extern void irqtime_account_irq(struct task_struct *tsk);
 #else
diff --git a/init/Kconfig b/init/Kconfig
index 5cc8713..51b5c33 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -322,6 +322,9 @@ source "kernel/time/Kconfig"
 
 menu "CPU/Task time and stats accounting"
 
+config VIRT_CPU_ACCOUNTING
+	bool
+
 choice
 	prompt "Cputime accounting"
 	default TICK_CPU_ACCOUNTING if !PPC64
@@ -338,9 +341,10 @@ config TICK_CPU_ACCOUNTING
 
 	  If unsure, say Y.
 
-config VIRT_CPU_ACCOUNTING
+config VIRT_CPU_ACCOUNTING_NATIVE
 	bool "Deterministic task and CPU time accounting"
 	depends on HAVE_VIRT_CPU_ACCOUNTING
+	select VIRT_CPU_ACCOUNTING
 	help
 	  Select this option to enable more accurate task and CPU time
 	  accounting.  This is done by reading a CPU counter on each
@@ -350,6 +354,15 @@ config VIRT_CPU_ACCOUNTING
 	  this also enables accounting of stolen time on logically-partitioned
 	  systems.
 
+config VIRT_CPU_ACCOUNTING_GEN
+	bool "Full dynticks CPU time accounting"
+	depends on HAVE_CONTEXT_TRACKING
+	select VIRT_CPU_ACCOUNTING
+	select CONTEXT_TRACKING
+	help
+	  Implement a generic virtual based cputime accounting by using
+	  the context tracking subsystem.
+
 config IRQ_TIME_ACCOUNTING
 	bool "Fine granularity task level IRQ time accounting"
 	depends on HAVE_IRQ_TIME_ACCOUNTING
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index c952770..bd461ad 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -56,7 +56,7 @@ void user_enter(void)
 	local_irq_save(flags);
 	if (__this_cpu_read(context_tracking.active) &&
 	    __this_cpu_read(context_tracking.state) != IN_USER) {
-		__this_cpu_write(context_tracking.state, IN_USER);
+		vtime_user_enter(current);
 		/*
 		 * At this stage, only low level arch entry code remains and
 		 * then we'll run in userspace. We can assume there won't be
@@ -65,6 +65,7 @@ void user_enter(void)
 		 * on the tick.
 		 */
 		rcu_user_enter();
+		__this_cpu_write(context_tracking.state, IN_USER);
 	}
 	local_irq_restore(flags);
 }
@@ -90,12 +91,13 @@ void user_exit(void)
 
 	local_irq_save(flags);
 	if (__this_cpu_read(context_tracking.state) == IN_USER) {
-		__this_cpu_write(context_tracking.state, IN_KERNEL);
 		/*
 		 * We are going to run code that may use RCU. Inform
 		 * RCU core about that (ie: we may need the tick again).
 		 */
 		rcu_user_exit();
+		vtime_user_exit(current);
+		__this_cpu_write(context_tracking.state, IN_KERNEL);
 	}
 	local_irq_restore(flags);
 }
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 293b202..3749a0e 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -3,6 +3,7 @@
 #include <linux/tsacct_kern.h>
 #include <linux/kernel_stat.h>
 #include <linux/static_key.h>
+#include <linux/context_tracking.h>
 #include "sched.h"
 
 
@@ -495,10 +496,24 @@ void vtime_task_switch(struct task_struct *prev)
 #ifndef __ARCH_HAS_VTIME_ACCOUNT
 void vtime_account(struct task_struct *tsk)
 {
-	if (in_interrupt() || !is_idle_task(tsk))
-		vtime_account_system(tsk);
-	else
-		vtime_account_idle(tsk);
+	if (!in_interrupt()) {
+		/*
+		 * If we interrupted user, context_tracking_in_user()
+		 * is 1 because the context tracking don't hook
+		 * on irq entry/exit. This way we know if
+		 * we need to flush user time on kernel entry.
+		 */
+		if (context_tracking_in_user()) {
+			vtime_account_user(tsk);
+			return;
+		}
+
+		if (is_idle_task(tsk)) {
+			vtime_account_idle(tsk);
+			return;
+		}
+	}
+	vtime_account_system(tsk);
 }
 EXPORT_SYMBOL_GPL(vtime_account);
 #endif /* __ARCH_HAS_VTIME_ACCOUNT */
@@ -587,3 +602,72 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
 	cputime_adjust(&cputime, &p->signal->prev_cputime, ut, st);
 }
 #endif
+
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+static DEFINE_PER_CPU(long, last_jiffies) = INITIAL_JIFFIES;
+
+static cputime_t get_vtime_delta(void)
+{
+	long delta;
+
+	delta = jiffies - __this_cpu_read(last_jiffies);
+	__this_cpu_add(last_jiffies, delta);
+
+	return jiffies_to_cputime(delta);
+}
+
+void vtime_account_system(struct task_struct *tsk)
+{
+	cputime_t delta_cpu = get_vtime_delta();
+
+	account_system_time(tsk, irq_count(), delta_cpu, cputime_to_scaled(delta_cpu));
+}
+
+void vtime_account_user(struct task_struct *tsk)
+{
+	cputime_t delta_cpu = get_vtime_delta();
+
+	/*
+	 * This is an unfortunate hack: if we flush user time only on
+	 * irq entry, we miss the jiffies update and the time is spuriously
+	 * accounted to system time.
+	 */
+	if (context_tracking_in_user())
+		account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu));
+}
+
+void vtime_account_idle(struct task_struct *tsk)
+{
+	cputime_t delta_cpu = get_vtime_delta();
+
+	account_idle_time(delta_cpu);
+}
+
+static int __cpuinit vtime_cpu_notify(struct notifier_block *self,
+				      unsigned long action, void *hcpu)
+{
+	long cpu = (long)hcpu;
+	long *last_jiffies_cpu = per_cpu_ptr(&last_jiffies, cpu);
+
+	switch (action) {
+	case CPU_UP_PREPARE:
+	case CPU_UP_PREPARE_FROZEN:
+		/*
+		 * CHECKME: ensure that's visible by the CPU
+		 * once it wakes up
+		 */
+		*last_jiffies_cpu = jiffies;
+	default:
+		break;
+	}
+
+	return NOTIFY_OK;
+}
+
+static int __init init_vtime(void)
+{
+	cpu_notifier(vtime_cpu_notify, 0);
+	return 0;
+}
+early_initcall(init_vtime);
+#endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 04/33] cputime: Allow dynamic switch between tick/virtual based cputime accounting
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 03/33] cputime: Generic on-demand virtual cputime accounting Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08 21:20   ` Steven Rostedt
  2013-01-08  2:08 ` [PATCH 05/33] cputime: Use accessors to read task cputime stats Frederic Weisbecker
                   ` (28 subsequent siblings)
  32 siblings, 1 reply; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

Allow to dynamically switch between tick and virtual based cputime accounting.
This way we can provide a kind of "on-demand" virtual based cputime
accounting. In this mode, the kernel will rely on the user hooks
subsystem to dynamically hook on kernel boundaries.

This is in preparation for being able to stop the timer tick in
more places than just the idle state. Doing so will depend on
CONFIG_VIRT_CPU_ACCOUNTING which makes it possible to account the
cputime without the tick by hooking on kernel/user boundaries.

Depending whether the tick is stopped or not, we can switch between
tick and vtime based accounting anytime in order to minimize the
overhead associated to user hooks.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/kernel_stat.h |    2 +-
 include/linux/sched.h       |    4 +-
 include/linux/vtime.h       |    8 ++++++
 kernel/fork.c               |    2 +-
 kernel/sched/cputime.c      |   58 +++++++++++++++++++++++++++---------------
 kernel/time/tick-sched.c    |    5 +++-
 6 files changed, 53 insertions(+), 26 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 66b7078..ed5f6ed 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -127,7 +127,7 @@ extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t)
 extern void account_steal_time(cputime_t);
 extern void account_idle_time(cputime_t);
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 static inline void account_process_tick(struct task_struct *tsk, int user)
 {
 	vtime_account_user(tsk);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 206bb08..66b2344 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -605,7 +605,7 @@ struct signal_struct {
 	cputime_t utime, stime, cutime, cstime;
 	cputime_t gtime;
 	cputime_t cgtime;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 	struct cputime prev_cputime;
 #endif
 	unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
@@ -1365,7 +1365,7 @@ struct task_struct {
 
 	cputime_t utime, stime, utimescaled, stimescaled;
 	cputime_t gtime;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 	struct cputime prev_cputime;
 #endif
 	unsigned long nvcsw, nivcsw; /* context switch counts */
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 21ef703..5368af9 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -10,12 +10,20 @@ extern void vtime_account_system_irqsafe(struct task_struct *tsk);
 extern void vtime_account_idle(struct task_struct *tsk);
 extern void vtime_account_user(struct task_struct *tsk);
 extern void vtime_account(struct task_struct *tsk);
+
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+extern bool vtime_accounting_enabled(void);
 #else
+static inline bool vtime_accounting_enabled(void) { return true; }
+#endif
+
+#else /* !CONFIG_VIRT_CPU_ACCOUNTING */
 static inline void vtime_task_switch(struct task_struct *prev) { }
 static inline void vtime_account_system(struct task_struct *tsk) { }
 static inline void vtime_account_system_irqsafe(struct task_struct *tsk) { }
 static inline void vtime_account_user(struct task_struct *tsk) { }
 static inline void vtime_account(struct task_struct *tsk) { }
+static inline bool vtime_accounting_enabled(void) { return false; }
 #endif
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
diff --git a/kernel/fork.c b/kernel/fork.c
index 65ca6d2..81b5209 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1230,7 +1230,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 
 	p->utime = p->stime = p->gtime = 0;
 	p->utimescaled = p->stimescaled = 0;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 	p->prev_cputime.utime = p->prev_cputime.stime = 0;
 #endif
 #if defined(SPLIT_RSS_COUNTING)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3749a0e..3ea4233 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -317,8 +317,6 @@ out:
 	rcu_read_unlock();
 }
 
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING
-
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 /*
  * Account a tick to a process and cpustat
@@ -388,6 +386,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 						struct rq *rq) {}
 #endif /* CONFIG_IRQ_TIME_ACCOUNTING */
 
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 /*
  * Account a single tick of cpu time.
  * @p: the process that the cpu time gets accounted to
@@ -398,6 +397,11 @@ void account_process_tick(struct task_struct *p, int user_tick)
 	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
 	struct rq *rq = this_rq();
 
+	if (vtime_accounting_enabled()) {
+		vtime_account_user(p);
+		return;
+	}
+
 	if (sched_clock_irqtime) {
 		irqtime_account_process_tick(p, user_tick, rq);
 		return;
@@ -439,29 +443,13 @@ void account_idle_ticks(unsigned long ticks)
 
 	account_idle_time(jiffies_to_cputime(ticks));
 }
-
 #endif
 
+
 /*
  * Use precise platform statistics if available:
  */
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING
-void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
-{
-	*ut = p->utime;
-	*st = p->stime;
-}
-
-void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
-{
-	struct task_cputime cputime;
-
-	thread_group_cputime(p, &cputime);
-
-	*ut = cputime.utime;
-	*st = cputime.stime;
-}
-
 void vtime_account_system_irqsafe(struct task_struct *tsk)
 {
 	unsigned long flags;
@@ -517,8 +505,25 @@ void vtime_account(struct task_struct *tsk)
 }
 EXPORT_SYMBOL_GPL(vtime_account);
 #endif /* __ARCH_HAS_VTIME_ACCOUNT */
+#endif /* CONFIG_VIRT_CPU_ACCOUNTING */
 
-#else
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
+{
+	*ut = p->utime;
+	*st = p->stime;
+}
+
+void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
+{
+	struct task_cputime cputime;
+
+	thread_group_cputime(p, &cputime);
+
+	*ut = cputime.utime;
+	*st = cputime.stime;
+}
+#else /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifndef nsecs_to_cputime
 # define nsecs_to_cputime(__nsecs)	nsecs_to_jiffies(__nsecs)
@@ -548,6 +553,12 @@ static void cputime_adjust(struct task_cputime *curr,
 {
 	cputime_t rtime, utime, total;
 
+	if (vtime_accounting_enabled()) {
+		*ut = curr->utime;
+		*st = curr->stime;
+		return;
+	}
+
 	utime = curr->utime;
 	total = utime + curr->stime;
 
@@ -601,7 +612,7 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
 	thread_group_cputime(p, &cputime);
 	cputime_adjust(&cputime, &p->signal->prev_cputime, ut, st);
 }
-#endif
+#endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 static DEFINE_PER_CPU(long, last_jiffies) = INITIAL_JIFFIES;
@@ -643,6 +654,11 @@ void vtime_account_idle(struct task_struct *tsk)
 	account_idle_time(delta_cpu);
 }
 
+bool vtime_accounting_enabled(void)
+{
+	return context_tracking_active();
+}
+
 static int __cpuinit vtime_cpu_notify(struct notifier_block *self,
 				      unsigned long action, void *hcpu)
 {
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index fb8e5e4..314b9ee 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -632,8 +632,11 @@ static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
 
 static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
 {
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 	unsigned long ticks;
+
+	if (vtime_accounting_enabled())
+		return;
 	/*
 	 * We stopped the tick in idle. Update process times would miss the
 	 * time we slept as update_process_times does only a 1 tick
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 05/33] cputime: Use accessors to read task cputime stats
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 04/33] cputime: Allow dynamic switch between tick/virtual based " Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 06/33] cputime: Safely read cputime of full dynticks CPUs Frederic Weisbecker
                   ` (27 subsequent siblings)
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

This is in preparation for the full dynticks feature. While
remotely reading the cputime of a task running in a full
dynticks CPU, we'll need to do some extra-computation. This
way we can account the time it spent tickless in userspace
since its last cputime snapshot.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 arch/alpha/kernel/osf_sys.c |    6 ++++--
 arch/x86/kernel/apm_32.c    |   11 ++++++-----
 drivers/isdn/mISDN/stack.c  |    7 ++++++-
 fs/binfmt_elf.c             |    8 ++++++--
 fs/binfmt_elf_fdpic.c       |    7 +++++--
 include/linux/sched.h       |   18 ++++++++++++++++++
 kernel/acct.c               |    6 ++++--
 kernel/cpu.c                |    4 +++-
 kernel/delayacct.c          |    7 +++++--
 kernel/exit.c               |    6 ++++--
 kernel/posix-cpu-timers.c   |   28 ++++++++++++++++++++++------
 kernel/sched/cputime.c      |    9 +++++----
 kernel/signal.c             |   12 ++++++++----
 kernel/tsacct.c             |   19 +++++++++++++------
 14 files changed, 109 insertions(+), 39 deletions(-)

diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c
index 14db93e..dbc1760 100644
--- a/arch/alpha/kernel/osf_sys.c
+++ b/arch/alpha/kernel/osf_sys.c
@@ -1139,6 +1139,7 @@ struct rusage32 {
 SYSCALL_DEFINE2(osf_getrusage, int, who, struct rusage32 __user *, ru)
 {
 	struct rusage32 r;
+	cputime_t utime, stime;
 
 	if (who != RUSAGE_SELF && who != RUSAGE_CHILDREN)
 		return -EINVAL;
@@ -1146,8 +1147,9 @@ SYSCALL_DEFINE2(osf_getrusage, int, who, struct rusage32 __user *, ru)
 	memset(&r, 0, sizeof(r));
 	switch (who) {
 	case RUSAGE_SELF:
-		jiffies_to_timeval32(current->utime, &r.ru_utime);
-		jiffies_to_timeval32(current->stime, &r.ru_stime);
+		task_cputime(current, &utime, &stime);
+		jiffies_to_timeval32(utime, &r.ru_utime);
+		jiffies_to_timeval32(stime, &r.ru_stime);
 		r.ru_minflt = current->min_flt;
 		r.ru_majflt = current->maj_flt;
 		break;
diff --git a/arch/x86/kernel/apm_32.c b/arch/x86/kernel/apm_32.c
index d65464e..8d7012b 100644
--- a/arch/x86/kernel/apm_32.c
+++ b/arch/x86/kernel/apm_32.c
@@ -899,6 +899,7 @@ static void apm_cpu_idle(void)
 	static int use_apm_idle; /* = 0 */
 	static unsigned int last_jiffies; /* = 0 */
 	static unsigned int last_stime; /* = 0 */
+	cputime_t stime;
 
 	int apm_idle_done = 0;
 	unsigned int jiffies_since_last_check = jiffies - last_jiffies;
@@ -906,23 +907,23 @@ static void apm_cpu_idle(void)
 
 	WARN_ONCE(1, "deprecated apm_cpu_idle will be deleted in 2012");
 recalc:
+	task_cputime(current, NULL, &stime);
 	if (jiffies_since_last_check > IDLE_CALC_LIMIT) {
 		use_apm_idle = 0;
-		last_jiffies = jiffies;
-		last_stime = current->stime;
 	} else if (jiffies_since_last_check > idle_period) {
 		unsigned int idle_percentage;
 
-		idle_percentage = current->stime - last_stime;
+		idle_percentage = stime - last_stime;
 		idle_percentage *= 100;
 		idle_percentage /= jiffies_since_last_check;
 		use_apm_idle = (idle_percentage > idle_threshold);
 		if (apm_info.forbid_idle)
 			use_apm_idle = 0;
-		last_jiffies = jiffies;
-		last_stime = current->stime;
 	}
 
+	last_jiffies = jiffies;
+	last_stime = stime;
+
 	bucket = IDLE_LEAKY_MAX;
 
 	while (!need_resched()) {
diff --git a/drivers/isdn/mISDN/stack.c b/drivers/isdn/mISDN/stack.c
index 5f21f62..deda591 100644
--- a/drivers/isdn/mISDN/stack.c
+++ b/drivers/isdn/mISDN/stack.c
@@ -18,6 +18,7 @@
 #include <linux/slab.h>
 #include <linux/mISDNif.h>
 #include <linux/kthread.h>
+#include <linux/sched.h>
 #include "core.h"
 
 static u_int	*debug;
@@ -202,6 +203,9 @@ static int
 mISDNStackd(void *data)
 {
 	struct mISDNstack *st = data;
+#ifdef MISDN_MSG_STATS
+	cputime_t utime, stime;
+#endif
 	int err = 0;
 
 	sigfillset(&current->blocked);
@@ -303,9 +307,10 @@ mISDNStackd(void *data)
 	       "msg %d sleep %d stopped\n",
 	       dev_name(&st->dev->dev), st->msg_cnt, st->sleep_cnt,
 	       st->stopped_cnt);
+	task_cputime(st->thread, &utime, &stime);
 	printk(KERN_DEBUG
 	       "mISDNStackd daemon for %s utime(%ld) stime(%ld)\n",
-	       dev_name(&st->dev->dev), st->thread->utime, st->thread->stime);
+	       dev_name(&st->dev->dev), utime, stime);
 	printk(KERN_DEBUG
 	       "mISDNStackd daemon for %s nvcsw(%ld) nivcsw(%ld)\n",
 	       dev_name(&st->dev->dev), st->thread->nvcsw, st->thread->nivcsw);
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 0c42cdb..49d0b43 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -33,6 +33,7 @@
 #include <linux/elf.h>
 #include <linux/utsname.h>
 #include <linux/coredump.h>
+#include <linux/sched.h>
 #include <asm/uaccess.h>
 #include <asm/param.h>
 #include <asm/page.h>
@@ -1320,8 +1321,11 @@ static void fill_prstatus(struct elf_prstatus *prstatus,
 		cputime_to_timeval(cputime.utime, &prstatus->pr_utime);
 		cputime_to_timeval(cputime.stime, &prstatus->pr_stime);
 	} else {
-		cputime_to_timeval(p->utime, &prstatus->pr_utime);
-		cputime_to_timeval(p->stime, &prstatus->pr_stime);
+		cputime_t utime, stime;
+
+		task_cputime(p, &utime, &stime);
+		cputime_to_timeval(utime, &prstatus->pr_utime);
+		cputime_to_timeval(stime, &prstatus->pr_stime);
 	}
 	cputime_to_timeval(p->signal->cutime, &prstatus->pr_cutime);
 	cputime_to_timeval(p->signal->cstime, &prstatus->pr_cstime);
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index dc84732..cb240dd 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -1375,8 +1375,11 @@ static void fill_prstatus(struct elf_prstatus *prstatus,
 		cputime_to_timeval(cputime.utime, &prstatus->pr_utime);
 		cputime_to_timeval(cputime.stime, &prstatus->pr_stime);
 	} else {
-		cputime_to_timeval(p->utime, &prstatus->pr_utime);
-		cputime_to_timeval(p->stime, &prstatus->pr_stime);
+		cputime_t utime, stime;
+
+		task_cputime(p, &utime, &stime);
+		cputime_to_timeval(utime, &prstatus->pr_utime);
+		cputime_to_timeval(stime, &prstatus->pr_stime);
 	}
 	cputime_to_timeval(p->signal->cutime, &prstatus->pr_cutime);
 	cputime_to_timeval(p->signal->cstime, &prstatus->pr_cstime);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 66b2344..d57e20f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1792,6 +1792,24 @@ static inline void put_task_struct(struct task_struct *t)
 		__put_task_struct(t);
 }
 
+static inline void task_cputime(struct task_struct *t,
+				cputime_t *utime, cputime_t *stime)
+{
+	if (utime)
+		*utime = t->utime;
+	if (stime)
+		*stime = t->stime;
+}
+
+static inline void task_cputime_scaled(struct task_struct *t,
+				       cputime_t *utimescaled,
+				       cputime_t *stimescaled)
+{
+	if (utimescaled)
+		*utimescaled = t->utimescaled;
+	if (stimescaled)
+		*stimescaled = t->stimescaled;
+}
 extern void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st);
 extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st);
 
diff --git a/kernel/acct.c b/kernel/acct.c
index 051e071..e8b1627 100644
--- a/kernel/acct.c
+++ b/kernel/acct.c
@@ -566,6 +566,7 @@ out:
 void acct_collect(long exitcode, int group_dead)
 {
 	struct pacct_struct *pacct = &current->signal->pacct;
+	cputime_t utime, stime;
 	unsigned long vsize = 0;
 
 	if (group_dead && current->mm) {
@@ -593,8 +594,9 @@ void acct_collect(long exitcode, int group_dead)
 		pacct->ac_flag |= ACORE;
 	if (current->flags & PF_SIGNALED)
 		pacct->ac_flag |= AXSIG;
-	pacct->ac_utime += current->utime;
-	pacct->ac_stime += current->stime;
+	task_cputime(current, &utime, &stime);
+	pacct->ac_utime += utime;
+	pacct->ac_stime += stime;
 	pacct->ac_minflt += current->min_flt;
 	pacct->ac_majflt += current->maj_flt;
 	spin_unlock_irq(&current->sighand->siglock);
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 3046a50..e5d5e8e 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -224,11 +224,13 @@ void clear_tasks_mm_cpumask(int cpu)
 static inline void check_for_tasks(int cpu)
 {
 	struct task_struct *p;
+	cputime_t utime, stime;
 
 	write_lock_irq(&tasklist_lock);
 	for_each_process(p) {
+		task_cputime(p, &utime, &stime);
 		if (task_cpu(p) == cpu && p->state == TASK_RUNNING &&
-		    (p->utime || p->stime))
+		    (utime || stime))
 			printk(KERN_WARNING "Task %s (pid = %d) is on cpu %d "
 				"(state = %ld, flags = %x)\n",
 				p->comm, task_pid_nr(p), cpu,
diff --git a/kernel/delayacct.c b/kernel/delayacct.c
index 418b3f7..d473988 100644
--- a/kernel/delayacct.c
+++ b/kernel/delayacct.c
@@ -106,6 +106,7 @@ int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
 	unsigned long long t2, t3;
 	unsigned long flags;
 	struct timespec ts;
+	cputime_t utime, stime, stimescaled, utimescaled;
 
 	/* Though tsk->delays accessed later, early exit avoids
 	 * unnecessary returning of other data
@@ -114,12 +115,14 @@ int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
 		goto done;
 
 	tmp = (s64)d->cpu_run_real_total;
-	cputime_to_timespec(tsk->utime + tsk->stime, &ts);
+	task_cputime(tsk, &utime, &stime);
+	cputime_to_timespec(utime + stime, &ts);
 	tmp += timespec_to_ns(&ts);
 	d->cpu_run_real_total = (tmp < (s64)d->cpu_run_real_total) ? 0 : tmp;
 
 	tmp = (s64)d->cpu_scaled_run_real_total;
-	cputime_to_timespec(tsk->utimescaled + tsk->stimescaled, &ts);
+	task_cputime_scaled(tsk, &utimescaled, &stimescaled);
+	cputime_to_timespec(utimescaled + stimescaled, &ts);
 	tmp += timespec_to_ns(&ts);
 	d->cpu_scaled_run_real_total =
 		(tmp < (s64)d->cpu_scaled_run_real_total) ? 0 : tmp;
diff --git a/kernel/exit.c b/kernel/exit.c
index b4df219..5d1b0ff 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -85,6 +85,7 @@ static void __exit_signal(struct task_struct *tsk)
 	bool group_dead = thread_group_leader(tsk);
 	struct sighand_struct *sighand;
 	struct tty_struct *uninitialized_var(tty);
+	cputime_t utime, stime;
 
 	sighand = rcu_dereference_check(tsk->sighand,
 					lockdep_tasklist_lock_is_held());
@@ -123,8 +124,9 @@ static void __exit_signal(struct task_struct *tsk)
 		 * We won't ever get here for the group leader, since it
 		 * will have been the last reference on the signal_struct.
 		 */
-		sig->utime += tsk->utime;
-		sig->stime += tsk->stime;
+		task_cputime(tsk, &utime, &stime);
+		sig->utime += utime;
+		sig->stime += stime;
 		sig->gtime += tsk->gtime;
 		sig->min_flt += tsk->min_flt;
 		sig->maj_flt += tsk->maj_flt;
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index a278cad..165d476 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -155,11 +155,19 @@ static void bump_cpu_timer(struct k_itimer *timer,
 
 static inline cputime_t prof_ticks(struct task_struct *p)
 {
-	return p->utime + p->stime;
+	cputime_t utime, stime;
+
+	task_cputime(p, &utime, &stime);
+
+	return utime + stime;
 }
 static inline cputime_t virt_ticks(struct task_struct *p)
 {
-	return p->utime;
+	cputime_t utime;
+
+	task_cputime(p, &utime, NULL);
+
+	return utime;
 }
 
 static int
@@ -471,18 +479,23 @@ static void cleanup_timers(struct list_head *head,
  */
 void posix_cpu_timers_exit(struct task_struct *tsk)
 {
+	cputime_t utime, stime;
+
 	add_device_randomness((const void*) &tsk->se.sum_exec_runtime,
 						sizeof(unsigned long long));
+	task_cputime(tsk, &utime, &stime);
 	cleanup_timers(tsk->cpu_timers,
-		       tsk->utime, tsk->stime, tsk->se.sum_exec_runtime);
+		       utime, stime, tsk->se.sum_exec_runtime);
 
 }
 void posix_cpu_timers_exit_group(struct task_struct *tsk)
 {
 	struct signal_struct *const sig = tsk->signal;
+	cputime_t utime, stime;
 
+	task_cputime(tsk, &utime, &stime);
 	cleanup_timers(tsk->signal->cpu_timers,
-		       tsk->utime + sig->utime, tsk->stime + sig->stime,
+		       utime + sig->utime, stime + sig->stime,
 		       tsk->se.sum_exec_runtime + sig->sum_sched_runtime);
 }
 
@@ -1226,11 +1239,14 @@ static inline int task_cputime_expired(const struct task_cputime *sample,
 static inline int fastpath_timer_check(struct task_struct *tsk)
 {
 	struct signal_struct *sig;
+	cputime_t utime, stime;
+
+	task_cputime(tsk, &utime, &stime);
 
 	if (!task_cputime_zero(&tsk->cputime_expires)) {
 		struct task_cputime task_sample = {
-			.utime = tsk->utime,
-			.stime = tsk->stime,
+			.utime = utime,
+			.stime = stime,
 			.sum_exec_runtime = tsk->se.sum_exec_runtime
 		};
 
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3ea4233..07912dd 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -296,6 +296,7 @@ static __always_inline bool steal_account_process_tick(void)
 void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 {
 	struct signal_struct *sig = tsk->signal;
+	cputime_t utime, stime;
 	struct task_struct *t;
 
 	times->utime = sig->utime;
@@ -309,8 +310,9 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 
 	t = tsk;
 	do {
-		times->utime += t->utime;
-		times->stime += t->stime;
+		task_cputime(tsk, &utime, &stime);
+		times->utime += utime;
+		times->stime += stime;
 		times->sum_exec_runtime += task_sched_runtime(t);
 	} while_each_thread(tsk, t);
 out:
@@ -594,11 +596,10 @@ static void cputime_adjust(struct task_cputime *curr,
 void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
 {
 	struct task_cputime cputime = {
-		.utime = p->utime,
-		.stime = p->stime,
 		.sum_exec_runtime = p->se.sum_exec_runtime,
 	};
 
+	task_cputime(p, &cputime.utime, &cputime.stime);
 	cputime_adjust(&cputime, &p->prev_cputime, ut, st);
 }
 
diff --git a/kernel/signal.c b/kernel/signal.c
index 7aaa51d..9b3e319 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1638,6 +1638,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
 	unsigned long flags;
 	struct sighand_struct *psig;
 	bool autoreap = false;
+	cputime_t utime, stime;
 
 	BUG_ON(sig == -1);
 
@@ -1675,8 +1676,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
 				       task_uid(tsk));
 	rcu_read_unlock();
 
-	info.si_utime = cputime_to_clock_t(tsk->utime + tsk->signal->utime);
-	info.si_stime = cputime_to_clock_t(tsk->stime + tsk->signal->stime);
+	task_cputime(tsk, &utime, &stime);
+	info.si_utime = cputime_to_clock_t(utime + tsk->signal->utime);
+	info.si_stime = cputime_to_clock_t(stime + tsk->signal->stime);
 
 	info.si_status = tsk->exit_code & 0x7f;
 	if (tsk->exit_code & 0x80)
@@ -1740,6 +1742,7 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
 	unsigned long flags;
 	struct task_struct *parent;
 	struct sighand_struct *sighand;
+	cputime_t utime, stime;
 
 	if (for_ptracer) {
 		parent = tsk->parent;
@@ -1758,8 +1761,9 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
 	info.si_uid = from_kuid_munged(task_cred_xxx(parent, user_ns), task_uid(tsk));
 	rcu_read_unlock();
 
-	info.si_utime = cputime_to_clock_t(tsk->utime);
-	info.si_stime = cputime_to_clock_t(tsk->stime);
+	task_cputime(tsk, &utime, &stime);
+	info.si_utime = cputime_to_clock_t(utime);
+	info.si_stime = cputime_to_clock_t(stime);
 
  	info.si_code = why;
  	switch (why) {
diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index 625df0b..017181f 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -32,6 +32,7 @@ void bacct_add_tsk(struct user_namespace *user_ns,
 {
 	const struct cred *tcred;
 	struct timespec uptime, ts;
+	cputime_t utime, stime, utimescaled, stimescaled;
 	u64 ac_etime;
 
 	BUILD_BUG_ON(TS_COMM_LEN < TASK_COMM_LEN);
@@ -65,10 +66,15 @@ void bacct_add_tsk(struct user_namespace *user_ns,
 	stats->ac_ppid	 = pid_alive(tsk) ?
 		task_tgid_nr_ns(rcu_dereference(tsk->real_parent), pid_ns) : 0;
 	rcu_read_unlock();
-	stats->ac_utime = cputime_to_usecs(tsk->utime);
-	stats->ac_stime = cputime_to_usecs(tsk->stime);
-	stats->ac_utimescaled = cputime_to_usecs(tsk->utimescaled);
-	stats->ac_stimescaled = cputime_to_usecs(tsk->stimescaled);
+
+	task_cputime(tsk, &utime, &stime);
+	stats->ac_utime = cputime_to_usecs(utime);
+	stats->ac_stime = cputime_to_usecs(stime);
+
+	task_cputime_scaled(tsk, &utimescaled, &stimescaled);
+	stats->ac_utimescaled = cputime_to_usecs(utimescaled);
+	stats->ac_stimescaled = cputime_to_usecs(stimescaled);
+
 	stats->ac_minflt = tsk->min_flt;
 	stats->ac_majflt = tsk->maj_flt;
 
@@ -122,13 +128,14 @@ void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
 void acct_update_integrals(struct task_struct *tsk)
 {
 	if (likely(tsk->mm)) {
-		cputime_t time, dtime;
+		cputime_t time, dtime, stime, utime;
 		struct timeval value;
 		unsigned long flags;
 		u64 delta;
 
 		local_irq_save(flags);
-		time = tsk->stime + tsk->utime;
+		task_cputime(tsk, &utime, &stime);
+		time = stime + utime;
 		dtime = time - tsk->acct_timexpd;
 		jiffies_to_timeval(cputime_to_jiffies(dtime), &value);
 		delta = value.tv_sec;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 06/33] cputime: Safely read cputime of full dynticks CPUs
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 05/33] cputime: Use accessors to read task cputime stats Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-09 14:54   ` Steven Rostedt
  2013-01-08  2:08 ` [PATCH 07/33] nohz: Basic full dynticks interface Frederic Weisbecker
                   ` (26 subsequent siblings)
  32 siblings, 1 reply; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

While remotely reading the cputime of a task running in a
full dynticks CPU, the values stored in utime/stime fields
of struct task_struct may be stale. Its values may be those
of the last kernel <-> user transition time snapshot and
we need to add the tickless time spent since this snapshot.

To fix this, flush the cputime of the dynticks CPUs on
kernel <-> user transition and record the time / context
where we did this. Then on top of this snapshot and the current
time, perform the fixup on the reader side from task_times()
accessors.

FIXME: do the same for idle and guest time.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 arch/s390/kernel/vtime.c      |    6 +-
 include/asm-generic/cputime.h |    1 +
 include/linux/hardirq.h       |    4 +-
 include/linux/init_task.h     |   11 ++++
 include/linux/sched.h         |   16 +++++
 include/linux/vtime.h         |   41 ++++++--------
 kernel/fork.c                 |    6 ++
 kernel/sched/cputime.c        |  123 ++++++++++++++++++++++++++++++-----------
 kernel/softirq.c              |    6 +-
 9 files changed, 150 insertions(+), 64 deletions(-)

diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c
index e84b8b6..ce9cc5a 100644
--- a/arch/s390/kernel/vtime.c
+++ b/arch/s390/kernel/vtime.c
@@ -127,7 +127,7 @@ void vtime_account_user(struct task_struct *tsk)
  * Update process times based on virtual cpu times stored by entry.S
  * to the lowcore fields user_timer, system_timer & steal_clock.
  */
-void vtime_account(struct task_struct *tsk)
+void vtime_account_irq_enter(struct task_struct *tsk)
 {
 	struct thread_info *ti = task_thread_info(tsk);
 	u64 timer, system;
@@ -145,10 +145,10 @@ void vtime_account(struct task_struct *tsk)
 
 	virt_timer_forward(system);
 }
-EXPORT_SYMBOL_GPL(vtime_account);
+EXPORT_SYMBOL_GPL(vtime_account_irq_enter);
 
 void vtime_account_system(struct task_struct *tsk)
-__attribute__((alias("vtime_account")));
+__attribute__((alias("vtime_account_irq_enter")));
 EXPORT_SYMBOL_GPL(vtime_account_system);
 
 void __kprobes vtime_stop_cpu(void)
diff --git a/include/asm-generic/cputime.h b/include/asm-generic/cputime.h
index 9a62937..3e704d5 100644
--- a/include/asm-generic/cputime.h
+++ b/include/asm-generic/cputime.h
@@ -10,6 +10,7 @@ typedef unsigned long __nocast cputime_t;
 #define cputime_to_jiffies(__ct)	(__force unsigned long)(__ct)
 #define cputime_to_scaled(__ct)		(__ct)
 #define jiffies_to_cputime(__hz)	(__force cputime_t)(__hz)
+#define jiffies_to_scaled(__hz)		(__force cputime_t)(__hz)
 
 typedef u64 __nocast cputime64_t;
 
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 624ef3f..7105d5c 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -153,7 +153,7 @@ extern void rcu_nmi_exit(void);
  */
 #define __irq_enter()					\
 	do {						\
-		vtime_account_irq_enter(current);	\
+		account_irq_enter_time(current);	\
 		add_preempt_count(HARDIRQ_OFFSET);	\
 		trace_hardirq_enter();			\
 	} while (0)
@@ -169,7 +169,7 @@ extern void irq_enter(void);
 #define __irq_exit()					\
 	do {						\
 		trace_hardirq_exit();			\
-		vtime_account_irq_exit(current);	\
+		account_irq_exit_time(current);		\
 		sub_preempt_count(HARDIRQ_OFFSET);	\
 	} while (0)
 
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6d087c5..a6ef59f 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -10,6 +10,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
 #include <linux/securebits.h>
+#include <linux/seqlock.h>
 #include <net/net_namespace.h>
 
 #ifdef CONFIG_SMP
@@ -141,6 +142,15 @@ extern struct task_group root_task_group;
 # define INIT_PERF_EVENTS(tsk)
 #endif
 
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+# define INIT_VTIME(tsk)						\
+	.vtime_seqlock = __SEQLOCK_UNLOCKED(tsk.vtime_seqlock),	\
+	.prev_jiffies = INITIAL_JIFFIES, /* CHECKME */		\
+	.prev_jiffies_whence = JIFFIES_SYS,
+#else
+# define INIT_VTIME(tsk)
+#endif
+
 #define INIT_TASK_COMM "swapper"
 
 /*
@@ -210,6 +220,7 @@ extern struct task_group root_task_group;
 	INIT_TRACE_RECURSION						\
 	INIT_TASK_RCU_PREEMPT(tsk)					\
 	INIT_CPUSET_SEQ							\
+	INIT_VTIME(tsk)							\
 }
 
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d57e20f..3bca36e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1368,6 +1368,15 @@ struct task_struct {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 	struct cputime prev_cputime;
 #endif
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+	seqlock_t vtime_seqlock;
+	long prev_jiffies;
+	enum {
+		JIFFIES_SLEEPING = 0,
+		JIFFIES_USER,
+		JIFFIES_SYS,
+	} prev_jiffies_whence;
+#endif
 	unsigned long nvcsw, nivcsw; /* context switch counts */
 	struct timespec start_time; 		/* monotonic time */
 	struct timespec real_start_time;	/* boot based time */
@@ -1792,6 +1801,12 @@ static inline void put_task_struct(struct task_struct *t)
 		__put_task_struct(t);
 }
 
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+extern void task_cputime(struct task_struct *t,
+			 cputime_t *utime, cputime_t *stime);
+extern void task_cputime_scaled(struct task_struct *t,
+				cputime_t *utimescaled, cputime_t *stimescaled);
+#else
 static inline void task_cputime(struct task_struct *t,
 				cputime_t *utime, cputime_t *stime)
 {
@@ -1810,6 +1825,7 @@ static inline void task_cputime_scaled(struct task_struct *t,
 	if (stimescaled)
 		*stimescaled = t->stimescaled;
 }
+#endif
 extern void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st);
 extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st);
 
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 5368af9..4a60dbd 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -9,34 +9,37 @@ extern void vtime_account_system(struct task_struct *tsk);
 extern void vtime_account_system_irqsafe(struct task_struct *tsk);
 extern void vtime_account_idle(struct task_struct *tsk);
 extern void vtime_account_user(struct task_struct *tsk);
-extern void vtime_account(struct task_struct *tsk);
+extern void vtime_account_irq_enter(struct task_struct *tsk);
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-extern bool vtime_accounting_enabled(void);
-#else
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 static inline bool vtime_accounting_enabled(void) { return true; }
 #endif
 
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING */
+
 static inline void vtime_task_switch(struct task_struct *prev) { }
 static inline void vtime_account_system(struct task_struct *tsk) { }
 static inline void vtime_account_system_irqsafe(struct task_struct *tsk) { }
 static inline void vtime_account_user(struct task_struct *tsk) { }
-static inline void vtime_account(struct task_struct *tsk) { }
+static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
 static inline bool vtime_accounting_enabled(void) { return false; }
 #endif
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-static inline void arch_vtime_task_switch(struct task_struct *tsk) { }
-static inline void vtime_user_enter(struct task_struct *tsk)
-{
-	vtime_account_system(tsk);
-}
+extern void arch_vtime_task_switch(struct task_struct *tsk);
+extern void vtime_account_irq_exit(struct task_struct *tsk);
+extern bool vtime_accounting_enabled(void);
+extern void vtime_user_enter(struct task_struct *tsk);
 static inline void vtime_user_exit(struct task_struct *tsk)
 {
 	vtime_account_user(tsk);
 }
 #else
+static inline void vtime_account_irq_exit(struct task_struct *tsk)
+{
+	/* On hard|softirq exit we always account to hard|softirq cputime */
+	vtime_account_system(tsk);
+}
 static inline void vtime_user_enter(struct task_struct *tsk) { }
 static inline void vtime_user_exit(struct task_struct *tsk) { }
 #endif
@@ -47,25 +50,15 @@ extern void irqtime_account_irq(struct task_struct *tsk);
 static inline void irqtime_account_irq(struct task_struct *tsk) { }
 #endif
 
-static inline void vtime_account_irq_enter(struct task_struct *tsk)
+static inline void account_irq_enter_time(struct task_struct *tsk)
 {
-	/*
-	 * Hardirq can interrupt idle task anytime. So we need vtime_account()
-	 * that performs the idle check in CONFIG_VIRT_CPU_ACCOUNTING.
-	 * Softirq can also interrupt idle task directly if it calls
-	 * local_bh_enable(). Such case probably don't exist but we never know.
-	 * Ksoftirqd is not concerned because idle time is flushed on context
-	 * switch. Softirqs in the end of hardirqs are also not a problem because
-	 * the idle time is flushed on hardirq time already.
-	 */
-	vtime_account(tsk);
+	vtime_account_irq_enter(tsk);
 	irqtime_account_irq(tsk);
 }
 
-static inline void vtime_account_irq_exit(struct task_struct *tsk)
+static inline void account_irq_exit_time(struct task_struct *tsk)
 {
-	/* On hard|softirq exit we always account to hard|softirq cputime */
-	vtime_account_system(tsk);
+	vtime_account_irq_exit(tsk);
 	irqtime_account_irq(tsk);
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 81b5209..75fd270 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1233,6 +1233,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 	p->prev_cputime.utime = p->prev_cputime.stime = 0;
 #endif
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+	seqlock_init(&p->vtime_seqlock);
+	p->prev_jiffies_whence = JIFFIES_SLEEPING; /*CHECKME: idle tasks? */
+	p->prev_jiffies = jiffies;
+#endif
+
 #if defined(SPLIT_RSS_COUNTING)
 	memset(&p->rss_stat, 0, sizeof(p->rss_stat));
 #endif
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 07912dd..bf4f72d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -484,7 +484,7 @@ void vtime_task_switch(struct task_struct *prev)
  * vtime_account().
  */
 #ifndef __ARCH_HAS_VTIME_ACCOUNT
-void vtime_account(struct task_struct *tsk)
+void vtime_account_irq_enter(struct task_struct *tsk)
 {
 	if (!in_interrupt()) {
 		/*
@@ -505,7 +505,7 @@ void vtime_account(struct task_struct *tsk)
 	}
 	vtime_account_system(tsk);
 }
-EXPORT_SYMBOL_GPL(vtime_account);
+EXPORT_SYMBOL_GPL(vtime_account_irq_enter);
 #endif /* __ARCH_HAS_VTIME_ACCOUNT */
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING */
 
@@ -616,41 +616,67 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-static DEFINE_PER_CPU(long, last_jiffies) = INITIAL_JIFFIES;
-
-static cputime_t get_vtime_delta(void)
+static cputime_t get_vtime_delta(struct task_struct *tsk)
 {
 	long delta;
 
-	delta = jiffies - __this_cpu_read(last_jiffies);
-	__this_cpu_add(last_jiffies, delta);
+	delta = jiffies - tsk->prev_jiffies;
+	tsk->prev_jiffies += delta;
 
 	return jiffies_to_cputime(delta);
 }
 
-void vtime_account_system(struct task_struct *tsk)
+static void __vtime_account_system(struct task_struct *tsk)
 {
-	cputime_t delta_cpu = get_vtime_delta();
+	cputime_t delta_cpu = get_vtime_delta(tsk);
 
 	account_system_time(tsk, irq_count(), delta_cpu, cputime_to_scaled(delta_cpu));
 }
 
+void vtime_account_system(struct task_struct *tsk)
+{
+	write_seqlock(&tsk->vtime_seqlock);
+	__vtime_account_system(tsk);
+	write_sequnlock(&tsk->vtime_seqlock);
+}
+
+void vtime_account_irq_exit(struct task_struct *tsk)
+{
+	write_seqlock(&tsk->vtime_seqlock);
+	if (context_tracking_in_user())
+		tsk->prev_jiffies_whence = JIFFIES_USER;
+	__vtime_account_system(tsk);
+	write_sequnlock(&tsk->vtime_seqlock);
+}
+
 void vtime_account_user(struct task_struct *tsk)
 {
-	cputime_t delta_cpu = get_vtime_delta();
+	cputime_t delta_cpu = get_vtime_delta(tsk);
 
 	/*
 	 * This is an unfortunate hack: if we flush user time only on
 	 * irq entry, we miss the jiffies update and the time is spuriously
 	 * accounted to system time.
 	 */
-	if (context_tracking_in_user())
+	if (context_tracking_in_user()) {
+		write_seqlock(&tsk->vtime_seqlock);
+		tsk->prev_jiffies_whence = JIFFIES_SYS;
 		account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu));
+		write_sequnlock(&tsk->vtime_seqlock);
+	}
+}
+
+void vtime_user_enter(struct task_struct *tsk)
+{
+	write_seqlock(&tsk->vtime_seqlock);
+	tsk->prev_jiffies_whence = JIFFIES_USER;
+	__vtime_account_system(tsk);
+	write_sequnlock(&tsk->vtime_seqlock);
 }
 
 void vtime_account_idle(struct task_struct *tsk)
 {
-	cputime_t delta_cpu = get_vtime_delta();
+	cputime_t delta_cpu = get_vtime_delta(tsk);
 
 	account_idle_time(delta_cpu);
 }
@@ -660,31 +686,64 @@ bool vtime_accounting_enabled(void)
 	return context_tracking_active();
 }
 
-static int __cpuinit vtime_cpu_notify(struct notifier_block *self,
-				      unsigned long action, void *hcpu)
+void arch_vtime_task_switch(struct task_struct *prev)
 {
-	long cpu = (long)hcpu;
-	long *last_jiffies_cpu = per_cpu_ptr(&last_jiffies, cpu);
+	write_seqlock(&prev->vtime_seqlock);
+	prev->prev_jiffies_whence = JIFFIES_SLEEPING;
+	write_sequnlock(&prev->vtime_seqlock);
 
-	switch (action) {
-	case CPU_UP_PREPARE:
-	case CPU_UP_PREPARE_FROZEN:
-		/*
-		 * CHECKME: ensure that's visible by the CPU
-		 * once it wakes up
-		 */
-		*last_jiffies_cpu = jiffies;
-	default:
-		break;
-	}
+	write_seqlock(&current->vtime_seqlock);
+	current->prev_jiffies_whence = JIFFIES_SYS;
+	current->prev_jiffies = jiffies;
+	write_sequnlock(&current->vtime_seqlock);
+}
+
+void task_cputime(struct task_struct *t, cputime_t *utime, cputime_t *stime)
+{
+	unsigned int seq;
+	long delta;
+
+	do {
+		seq = read_seqbegin(&t->vtime_seqlock);
+
+		*utime = t->utime;
+		*stime = t->stime;
+
+		if (t->prev_jiffies_whence == JIFFIES_SLEEPING ||
+		    is_idle_task(t))
+			continue;
 
-	return NOTIFY_OK;
+		delta = jiffies - t->prev_jiffies;
+
+		if (t->prev_jiffies_whence == JIFFIES_USER)
+			*utime += delta;
+		else if (t->prev_jiffies_whence == JIFFIES_SYS)
+			*stime += delta;
+	} while (read_seqretry(&t->vtime_seqlock, seq));
 }
 
-static int __init init_vtime(void)
+void task_cputime_scaled(struct task_struct *t,
+			 cputime_t *utimescaled, cputime_t *stimescaled)
 {
-	cpu_notifier(vtime_cpu_notify, 0);
-	return 0;
+	unsigned int seq;
+	long delta;
+
+	do {
+		seq = read_seqbegin(&t->vtime_seqlock);
+
+		*utimescaled = t->utimescaled;
+		*stimescaled = t->stimescaled;
+
+		if (t->prev_jiffies_whence == JIFFIES_SLEEPING ||
+		    is_idle_task(t))
+			continue;
+
+		delta = jiffies - t->prev_jiffies;
+
+		if (t->prev_jiffies_whence == JIFFIES_USER)
+			*utimescaled += jiffies_to_scaled(delta);
+		else if (t->prev_jiffies_whence == JIFFIES_SYS)
+			*stimescaled += jiffies_to_scaled(delta);
+	} while (read_seqretry(&t->vtime_seqlock, seq));
 }
-early_initcall(init_vtime);
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
diff --git a/kernel/softirq.c b/kernel/softirq.c
index ed567ba..f5cc25f 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -221,7 +221,7 @@ asmlinkage void __do_softirq(void)
 	current->flags &= ~PF_MEMALLOC;
 
 	pending = local_softirq_pending();
-	vtime_account_irq_enter(current);
+	account_irq_enter_time(current);
 
 	__local_bh_disable((unsigned long)__builtin_return_address(0),
 				SOFTIRQ_OFFSET);
@@ -272,7 +272,7 @@ restart:
 
 	lockdep_softirq_exit();
 
-	vtime_account_irq_exit(current);
+	account_irq_exit_time(current);
 	__local_bh_enable(SOFTIRQ_OFFSET);
 	tsk_restore_flags(current, old_flags, PF_MEMALLOC);
 }
@@ -341,7 +341,7 @@ static inline void invoke_softirq(void)
  */
 void irq_exit(void)
 {
-	vtime_account_irq_exit(current);
+	account_irq_exit_time(current);
 	trace_hardirq_exit();
 	sub_preempt_count(IRQ_EXIT_OFFSET);
 	if (!in_interrupt() && local_softirq_pending())
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 07/33] nohz: Basic full dynticks interface
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (5 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 06/33] cputime: Safely read cputime of full dynticks CPUs Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-02-11 14:35   ` Borislav Petkov
  2013-01-08  2:08 ` [PATCH 08/33] nohz: Assign timekeeping duty to a non-full-nohz CPU Frederic Weisbecker
                   ` (25 subsequent siblings)
  32 siblings, 1 reply; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

Start with a very simple interface to define full dynticks CPU:
use a boot time option defined cpumask through the "full_nohz="
kernel parameter.

Make sure you keep at least one CPU outside this range to handle
the timekeeping.

Also full_nohz= must match rcu_nocb= value.

Suggested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/tick.h     |    7 +++++++
 kernel/time/Kconfig      |    9 +++++++++
 kernel/time/tick-sched.c |   23 +++++++++++++++++++++++
 3 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 553272e..2d4f6f0 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -157,6 +157,13 @@ static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
 static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
 # endif /* !NO_HZ */
 
+#ifdef CONFIG_NO_HZ_FULL
+int tick_nohz_full_cpu(int cpu);
+#else
+static inline int tick_nohz_full_cpu(int cpu) { return 0; }
+#endif
+
+
 # ifdef CONFIG_CPU_IDLE_GOV_MENU
 extern void menu_hrtimer_cancel(void);
 # else
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 8601f0d..dc6381d 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -70,6 +70,15 @@ config NO_HZ
 	  only trigger on an as-needed basis both when the system is
 	  busy and when the system is idle.
 
+config NO_HZ_FULL
+       bool "Full tickless system"
+       depends on NO_HZ && RCU_USER_QS && VIRT_CPU_ACCOUNTING_GEN && RCU_NOCB_CPU && SMP
+       select CONTEXT_TRACKING_FORCE
+       help
+         Try to be tickless everywhere, not just in idle. (You need
+	 to fill up the full_nohz_mask boot parameter).
+
+
 config HIGH_RES_TIMERS
 	bool "High Resolution Timer Support"
 	depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 314b9ee..494a2aa 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -142,6 +142,29 @@ static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
 	profile_tick(CPU_PROFILING);
 }
 
+#ifdef CONFIG_NO_HZ_FULL
+static cpumask_var_t full_nohz_mask;
+bool have_full_nohz_mask;
+
+int tick_nohz_full_cpu(int cpu)
+{
+	if (!have_full_nohz_mask)
+		return 0;
+
+	return cpumask_test_cpu(cpu, full_nohz_mask);
+}
+
+/* Parse the boot-time nohz CPU list from the kernel parameters. */
+static int __init tick_nohz_full_setup(char *str)
+{
+	alloc_bootmem_cpumask_var(&full_nohz_mask);
+	have_full_nohz_mask = true;
+	cpulist_parse(str, full_nohz_mask);
+	return 1;
+}
+__setup("full_nohz=", tick_nohz_full_setup);
+#endif
+
 /*
  * NOHZ - aka dynamic tick functionality
  */
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 08/33] nohz: Assign timekeeping duty to a non-full-nohz CPU
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (6 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 07/33] nohz: Basic full dynticks interface Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-02-15 11:57   ` Borislav Petkov
  2013-01-08  2:08 ` [PATCH 09/33] nohz: Trace timekeeping update Frederic Weisbecker
                   ` (24 subsequent siblings)
  32 siblings, 1 reply; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

This way the full nohz CPUs can safely run with the tick
stopped with a guarantee that somebody else is taking
care of the jiffies and gtod progression.

NOTE: this doesn't handle CPU hotplug. Also we could use something
more elaborated wrt. powersaving if we have more than one non full-nohz
CPU running. But let's use this KISS solution for now.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
[fix have_nohz_full_mask offcase]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/time/tick-broadcast.c |    3 ++-
 kernel/time/tick-common.c    |    5 ++++-
 kernel/time/tick-sched.c     |    9 ++++++++-
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index f113755..596c547 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -537,7 +537,8 @@ void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
 		bc->event_handler = tick_handle_oneshot_broadcast;
 
 		/* Take the do_timer update */
-		tick_do_timer_cpu = cpu;
+		if (!tick_nohz_full_cpu(cpu))
+			tick_do_timer_cpu = cpu;
 
 		/*
 		 * We must be careful here. There might be other CPUs
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index b1600a6..83f2bd9 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -163,7 +163,10 @@ static void tick_setup_device(struct tick_device *td,
 		 * this cpu:
 		 */
 		if (tick_do_timer_cpu == TICK_DO_TIMER_BOOT) {
-			tick_do_timer_cpu = cpu;
+			if (!tick_nohz_full_cpu(cpu))
+				tick_do_timer_cpu = cpu;
+			else
+				tick_do_timer_cpu = TICK_DO_TIMER_NONE;
 			tick_next_period = ktime_get();
 			tick_period = ktime_set(0, NSEC_PER_SEC / HZ);
 		}
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 494a2aa..b75e302 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -112,7 +112,8 @@ static void tick_sched_do_timer(ktime_t now)
 	 * this duty, then the jiffies update is still serialized by
 	 * jiffies_lock.
 	 */
-	if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE))
+	if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE)
+	    && !tick_nohz_full_cpu(cpu))
 		tick_do_timer_cpu = cpu;
 #endif
 
@@ -163,6 +164,8 @@ static int __init tick_nohz_full_setup(char *str)
 	return 1;
 }
 __setup("full_nohz=", tick_nohz_full_setup);
+#else
+#define have_full_nohz_mask (0)
 #endif
 
 /*
@@ -512,6 +515,10 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
 		return false;
 	}
 
+	/* If there are full nohz CPUs around, we need to keep the timekeeping duty */
+	if (have_full_nohz_mask && tick_do_timer_cpu == cpu)
+		return false;
+
 	return true;
 }
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 09/33] nohz: Trace timekeeping update
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (7 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 08/33] nohz: Assign timekeeping duty to a non-full-nohz CPU Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 10/33] nohz: Wake up full dynticks CPUs when a timer gets enqueued Frederic Weisbecker
                   ` (23 subsequent siblings)
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

Not for merge. This may become a real tracepoint.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/tick-sched.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index b75e302..a35ae96 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -118,8 +118,10 @@ static void tick_sched_do_timer(ktime_t now)
 #endif
 
 	/* Check, if the jiffies need an update */
-	if (tick_do_timer_cpu == cpu)
+	if (tick_do_timer_cpu == cpu) {
+		trace_printk("do timekeeping\n");
 		tick_do_update_jiffies64(now);
+	}
 }
 
 static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 10/33] nohz: Wake up full dynticks CPUs when a timer gets enqueued
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (8 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 09/33] nohz: Trace timekeeping update Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 11/33] rcu: Restart the tick on non-responding full dynticks CPUs Frederic Weisbecker
                   ` (22 subsequent siblings)
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

Wake up a CPU when a timer list timer is enqueued there and
the CPU is in full dynticks mode. Sending an IPI to it makes
it reconsidering the next timer to program on top of recent
updates.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/sched.h |    4 ++--
 kernel/sched/core.c   |   18 +++++++++++++++++-
 kernel/timer.c        |    2 +-
 3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3bca36e..32860ae 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2061,9 +2061,9 @@ static inline void idle_task_exit(void) {}
 #endif
 
 #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
-extern void wake_up_idle_cpu(int cpu);
+extern void wake_up_nohz_cpu(int cpu);
 #else
-static inline void wake_up_idle_cpu(int cpu) { }
+static inline void wake_up_nohz_cpu(int cpu) { }
 #endif
 
 extern unsigned int sysctl_sched_latency;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 257002c..63b25e2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -587,7 +587,7 @@ unlock:
  * account when the CPU goes back to idle and evaluates the timer
  * wheel for the next timer event.
  */
-void wake_up_idle_cpu(int cpu)
+static void wake_up_idle_cpu(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 
@@ -617,6 +617,22 @@ void wake_up_idle_cpu(int cpu)
 		smp_send_reschedule(cpu);
 }
 
+static bool wake_up_full_nohz_cpu(int cpu)
+{
+	if (tick_nohz_full_cpu(cpu)) {
+		smp_send_reschedule(cpu);
+		return true;
+	}
+
+	return false;
+}
+
+void wake_up_nohz_cpu(int cpu)
+{
+	if (!wake_up_full_nohz_cpu(cpu))
+		wake_up_idle_cpu(cpu);
+}
+
 static inline bool got_nohz_idle_kick(void)
 {
 	int cpu = smp_processor_id();
diff --git a/kernel/timer.c b/kernel/timer.c
index ff3b516..970b57d 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -936,7 +936,7 @@ void add_timer_on(struct timer_list *timer, int cpu)
 	 * makes sure that a CPU on the way to idle can not evaluate
 	 * the timer wheel.
 	 */
-	wake_up_idle_cpu(cpu);
+	wake_up_nohz_cpu(cpu);
 	spin_unlock_irqrestore(&base->lock, flags);
 }
 EXPORT_SYMBOL_GPL(add_timer_on);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 11/33] rcu: Restart the tick on non-responding full dynticks CPUs
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (9 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 10/33] nohz: Wake up full dynticks CPUs when a timer gets enqueued Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 12/33] sched: Comment on rq->clock correctness in ttwu_do_wakeup() in nohz Frederic Weisbecker
                   ` (21 subsequent siblings)
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

When a CPU in full dynticks mode doesn't respond to complete
a grace period, issue it a specific IPI so that it restarts
the tick and chases a quiescent state.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/rcutree.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index e441b77..302d360 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -53,6 +53,7 @@
 #include <linux/delay.h>
 #include <linux/stop_machine.h>
 #include <linux/random.h>
+#include <linux/tick.h>
 
 #include "rcutree.h"
 #include <trace/events/rcu.h>
@@ -743,6 +744,12 @@ static int dyntick_save_progress_counter(struct rcu_data *rdp)
 	return (rdp->dynticks_snap & 0x1) == 0;
 }
 
+static void rcu_kick_nohz_cpu(int cpu)
+{
+	if (tick_nohz_full_cpu(cpu))
+		smp_send_reschedule(cpu);
+}
+
 /*
  * Return true if the specified CPU has passed through a quiescent
  * state by virtue of being in or having passed through an dynticks
@@ -790,6 +797,9 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
 		rdp->offline_fqs++;
 		return 1;
 	}
+
+	rcu_kick_nohz_cpu(rdp->cpu);
+
 	return 0;
 }
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 12/33] sched: Comment on rq->clock correctness in ttwu_do_wakeup() in nohz
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (10 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 11/33] rcu: Restart the tick on non-responding full dynticks CPUs Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 13/33] sched: Update rq clock on nohz CPU before migrating tasks Frederic Weisbecker
                   ` (20 subsequent siblings)
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

Just to avoid confusion.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/core.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 63b25e2..bfac40f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1302,6 +1302,12 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 	if (p->sched_class->task_woken)
 		p->sched_class->task_woken(rq, p);
 
+	/*
+	 * For adaptive nohz case: We called ttwu_activate()
+	 * which just updated the rq clock. There is an
+	 * exception with p->on_rq != 0 but in this case
+	 * we are not idle and rq->idle_stamp == 0
+	 */
 	if (rq->idle_stamp) {
 		u64 delta = rq->clock - rq->idle_stamp;
 		u64 max = 2*sysctl_sched_migration_cost;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 13/33] sched: Update rq clock on nohz CPU before migrating tasks
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (11 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 12/33] sched: Comment on rq->clock correctness in ttwu_do_wakeup() in nohz Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 14/33] sched: Update rq clock on nohz CPU before setting fair group shares Frederic Weisbecker
                   ` (19 subsequent siblings)
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

Because the sched_class::put_prev_task() callback of rt and fair
classes are referring to the rq clock to update their runtime
statistics. A CPU running in tickless mode may carry a stale value.
We need to update it there.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/core.c  |    6 ++++++
 kernel/sched/sched.h |    7 +++++++
 2 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bfac40f..2fcbb03 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4894,6 +4894,12 @@ static void migrate_tasks(unsigned int dead_cpu)
 	 */
 	rq->stop = NULL;
 
+	/*
+	 * ->put_prev_task() need to have an up-to-date value
+	 * of rq->clock[_task]
+	 */
+	update_nohz_rq_clock(rq);
+
 	for ( ; ; ) {
 		/*
 		 * There's this thread running, bail when that's the only
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fc88644..f24d91e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3,6 +3,7 @@
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
 #include <linux/stop_machine.h>
+#include <linux/tick.h>
 
 #include "cpupri.h"
 
@@ -963,6 +964,12 @@ static inline void dec_nr_running(struct rq *rq)
 
 extern void update_rq_clock(struct rq *rq);
 
+static inline void update_nohz_rq_clock(struct rq *rq)
+{
+	if (tick_nohz_full_cpu(cpu_of(rq)))
+		update_rq_clock(rq);
+}
+
 extern void activate_task(struct rq *rq, struct task_struct *p, int flags);
 extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 14/33] sched: Update rq clock on nohz CPU before setting fair group shares
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (12 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 13/33] sched: Update rq clock on nohz CPU before migrating tasks Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 15/33] sched: Update rq clock on tickless CPUs before calling check_preempt_curr() Frederic Weisbecker
                   ` (18 subsequent siblings)
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

Because we may update the execution time (sched_group_set_shares()->
	update_cfs_shares()->reweight_entity()->update_curr()) before
reweighting the entity after updating the group shares and this requires
an uptodate version of the runqueue clock. Let's update it on the target
CPU if it runs tickless because scheduler_tick() is not there to maintain
it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/fair.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5eea870..a96f0f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6068,6 +6068,11 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 		se = tg->se[i];
 		/* Propagate contribution to hierarchy */
 		raw_spin_lock_irqsave(&rq->lock, flags);
+		/*
+		 * We may call update_curr() which needs an up-to-date
+		 * version of rq clock if the CPU runs tickless.
+		 */
+		update_nohz_rq_clock(rq);
 		for_each_sched_entity(se)
 			update_cfs_shares(group_cfs_rq(se));
 		raw_spin_unlock_irqrestore(&rq->lock, flags);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 15/33] sched: Update rq clock on tickless CPUs before calling check_preempt_curr()
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (13 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 14/33] sched: Update rq clock on nohz CPU before setting fair group shares Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 16/33] sched: Update rq clock earlier in unthrottle_cfs_rq Frederic Weisbecker
                   ` (17 subsequent siblings)
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

check_preempt_wakeup() of fair class needs an uptodate sched clock
value to update runtime stats of the current task.

When a task is woken up, activate_task() is usually called right before
ttwu_do_wakeup() unless the task is already in the runqueue. In this
case we need to update the rq clock manually in case the CPU runs
tickless because ttwu_do_wakeup() calls check_preempt_wakeup().

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/core.c |   17 ++++++++++++++++-
 1 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2fcbb03..3c1a806 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1346,6 +1346,12 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
 
 	rq = __task_rq_lock(p);
 	if (p->on_rq) {
+		/*
+		 * Ensure check_preempt_curr() won't deal with a stale value
+		 * of rq clock if the CPU is tickless. BTW do we actually need
+		 * check_preempt_curr() to be called here?
+		 */
+		update_nohz_rq_clock(rq);
 		ttwu_do_wakeup(rq, p, wake_flags);
 		ret = 1;
 	}
@@ -1523,8 +1529,17 @@ static void try_to_wake_up_local(struct task_struct *p)
 	if (!(p->state & TASK_NORMAL))
 		goto out;
 
-	if (!p->on_rq)
+	if (!p->on_rq) {
 		ttwu_activate(rq, p, ENQUEUE_WAKEUP);
+	} else {
+		/*
+		 * Even if the task is on the runqueue we still
+		 * need to ensure check_preempt_curr() won't
+		 * deal with a stale rq clock value on a tickless
+		 * CPU
+		 */
+		update_nohz_rq_clock(rq);
+	}
 
 	ttwu_do_wakeup(rq, p, 0);
 	ttwu_stat(p, smp_processor_id(), 0);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 16/33] sched: Update rq clock earlier in unthrottle_cfs_rq
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (14 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 15/33] sched: Update rq clock on tickless CPUs before calling check_preempt_curr() Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 17/33] sched: Update clock of nohz busiest rq before balancing Frederic Weisbecker
                   ` (16 subsequent siblings)
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

In this function we are making use of rq->clock right before the
update of the rq clock, let's just call update_rq_clock() just
before that to avoid using a stale rq clock value.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/fair.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a96f0f2..3d65ac7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2279,14 +2279,15 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	long task_delta;
 
 	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
-
 	cfs_rq->throttled = 0;
+
+	update_rq_clock(rq);
+
 	raw_spin_lock(&cfs_b->lock);
 	cfs_b->throttled_time += rq->clock - cfs_rq->throttled_clock;
 	list_del_rcu(&cfs_rq->throttled_list);
 	raw_spin_unlock(&cfs_b->lock);
 
-	update_rq_clock(rq);
 	/* update hierarchical throttle state */
 	walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 17/33] sched: Update clock of nohz busiest rq before balancing
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (15 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 16/33] sched: Update rq clock earlier in unthrottle_cfs_rq Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08 10:20   ` Li Zhong
  2013-01-08  2:08 ` [PATCH 18/33] sched: Update rq clock before idle balancing Frederic Weisbecker
                   ` (15 subsequent siblings)
  32 siblings, 1 reply; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

move_tasks() and active_load_balance_cpu_stop() both need
to have the busiest rq clock uptodate because they may end
up calling can_migrate_task() that uses rq->clock_task
to determine if the task running in the busiest runqueue
is cache hot.

Hence if the busiest runqueue is tickless, update its clock
before reading it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
[ Forward port conflicts ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/sched/fair.c |   17 +++++++++++++++++
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d65ac7..e78d81104 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5002,6 +5002,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 {
 	int ld_moved, cur_ld_moved, active_balance = 0;
 	int lb_iterations, max_lb_iterations;
+	int clock_updated;
 	struct sched_group *group;
 	struct rq *busiest;
 	unsigned long flags;
@@ -5045,6 +5046,7 @@ redo:
 
 	ld_moved = 0;
 	lb_iterations = 1;
+	clock_updated = 0;
 	if (busiest->nr_running > 1) {
 		/*
 		 * Attempt to move tasks. If find_busiest_group has found
@@ -5068,6 +5070,14 @@ more_balance:
 		 */
 		cur_ld_moved = move_tasks(&env);
 		ld_moved += cur_ld_moved;
+
+		/*
+		 * Move tasks may end up calling can_migrate_task() which
+		 * requires an uptodate value of the rq clock.
+		 */
+		update_nohz_rq_clock(busiest);
+		clock_updated = 1;
+
 		double_rq_unlock(env.dst_rq, busiest);
 		local_irq_restore(flags);
 
@@ -5163,6 +5173,13 @@ more_balance:
 				busiest->active_balance = 1;
 				busiest->push_cpu = this_cpu;
 				active_balance = 1;
+				/*
+				 * active_load_balance_cpu_stop may end up calling
+				 * can_migrate_task() which requires an uptodate
+				 * value of the rq clock.
+				 */
+				if (!clock_updated)
+					update_nohz_rq_clock(busiest);
 			}
 			raw_spin_unlock_irqrestore(&busiest->lock, flags);
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 18/33] sched: Update rq clock before idle balancing
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (16 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 17/33] sched: Update clock of nohz busiest rq before balancing Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 19/33] sched: Update nohz rq clock before searching busiest group on load balancing Frederic Weisbecker
                   ` (14 subsequent siblings)
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

idle_balance() is called from schedule() right before we schedule the
idle task. It needs to record the idle timestamp at that time and for
this the rq clock must be accurate. If the CPU is running tickless
we need to update the rq clock manually.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/fair.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e78d81104..698137d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5241,6 +5241,7 @@ void idle_balance(int this_cpu, struct rq *this_rq)
 	int pulled_task = 0;
 	unsigned long next_balance = jiffies + HZ;
 
+	update_nohz_rq_clock(this_rq);
 	this_rq->idle_stamp = this_rq->clock;
 
 	if (this_rq->avg_idle < sysctl_sched_migration_cost)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 19/33] sched: Update nohz rq clock before searching busiest group on load balancing
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (17 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 18/33] sched: Update rq clock before idle balancing Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 20/33] nohz: Move nohz load balancer selection into idle logic Frederic Weisbecker
                   ` (13 subsequent siblings)
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

While load balancing an rq target, we look for the busiest group.
This operation may require an uptodate rq clock if we end up calling
scale_rt_power(). To this end, update it manually if the target is
running tickless.

DOUBT: don't we actually also need this in vanilla kernel, in case
this_cpu is in dyntick-idle mode?

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/fair.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 698137d..473f50f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5023,6 +5023,19 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 
 	schedstat_inc(sd, lb_count[idle]);
 
+	/*
+	 * find_busiest_group() may need an uptodate cpu clock
+	 * for find_busiest_group() (see scale_rt_power()). If
+	 * the CPU is nohz, it's clock may be stale.
+	 */
+	if (tick_nohz_full_cpu(this_cpu)) {
+		local_irq_save(flags);
+		raw_spin_lock(&this_rq->lock);
+		update_rq_clock(this_rq);
+		raw_spin_unlock(&this_rq->lock);
+		local_irq_restore(flags);
+	}
+
 redo:
 	group = find_busiest_group(&env, balance);
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 20/33] nohz: Move nohz load balancer selection into idle logic
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (18 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 19/33] sched: Update nohz rq clock before searching busiest group on load balancing Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 21/33] nohz: Full dynticks mode Frederic Weisbecker
                   ` (12 subsequent siblings)
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

[ ** BUGGY PATCH: I need to put more thinking into this ** ]

We want the nohz load balancer to be an idle CPU, thus
move that selection to strict dyntick idle logic.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
[ added movement of calc_load_exit_idle() ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/time/tick-sched.c |   11 ++++++-----
 1 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a35ae96..5cefed8 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -444,9 +444,6 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
 		 * the scheduler tick in nohz_restart_sched_tick.
 		 */
 		if (!ts->tick_stopped) {
-			nohz_balance_enter_idle(cpu);
-			calc_load_enter_idle();
-
 			ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
 			ts->tick_stopped = 1;
 		}
@@ -542,8 +539,11 @@ static void __tick_nohz_idle_enter(struct tick_sched *ts)
 			ts->idle_expires = expires;
 		}
 
-		if (!was_stopped && ts->tick_stopped)
+		if (!was_stopped && ts->tick_stopped) {
 			ts->idle_jiffies = ts->last_jiffies;
+			nohz_balance_enter_idle(cpu);
+			calc_load_enter_idle();
+		}
 	}
 }
 
@@ -651,7 +651,6 @@ static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
 	tick_do_update_jiffies64(now);
 	update_cpu_load_nohz();
 
-	calc_load_exit_idle();
 	touch_softlockup_watchdog();
 	/*
 	 * Cancel the scheduled timer and restore the tick
@@ -711,6 +710,8 @@ void tick_nohz_idle_exit(void)
 		tick_nohz_stop_idle(cpu, now);
 
 	if (ts->tick_stopped) {
+		nohz_balance_enter_idle(cpu);
+		calc_load_exit_idle();
 		tick_nohz_restart_sched_tick(ts, now);
 		tick_nohz_account_idle_ticks(ts);
 	}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 21/33] nohz: Full dynticks mode
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (19 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 20/33] nohz: Move nohz load balancer selection into idle logic Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 22/33] nohz: Only stop the tick on RCU nocb CPUs Frederic Weisbecker
                   ` (11 subsequent siblings)
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

When a CPU is in full dynticks mode, try to switch
it to nohz mode from the interrupt exit path if it is
running a single non-idle task.

Then restart the tick if necessary if we are enqueuing a
second task while the timer is stopped, so that the scheduler
tick is rearmed.

[TODO: Check remaining things to be done from scheduler_tick()]

[ Included build fix from Geoff Levand ]

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/sched.h    |    6 +++++
 include/linux/tick.h     |    2 +
 kernel/sched/core.c      |   22 ++++++++++++++++++++-
 kernel/sched/sched.h     |   10 +++++++++
 kernel/softirq.c         |    5 ++-
 kernel/time/tick-sched.c |   47 ++++++++++++++++++++++++++++++++++++++++-----
 6 files changed, 83 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 32860ae..132897d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2846,6 +2846,12 @@ static inline void inc_syscw(struct task_struct *tsk)
 #define TASK_SIZE_OF(tsk)	TASK_SIZE
 #endif
 
+#ifdef CONFIG_NO_HZ_FULL
+extern bool sched_can_stop_tick(void);
+#else
+static inline bool sched_can_stop_tick(void) { return false; }
+#endif
+
 #ifdef CONFIG_MM_OWNER
 extern void mm_update_next_owner(struct mm_struct *mm);
 extern void mm_init_owner(struct mm_struct *mm, struct task_struct *p);
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 2d4f6f0..dfb90ea 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -159,8 +159,10 @@ static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
 
 #ifdef CONFIG_NO_HZ_FULL
 int tick_nohz_full_cpu(int cpu);
+extern void tick_nohz_full_check(void);
 #else
 static inline int tick_nohz_full_cpu(int cpu) { return 0; }
+static inline void tick_nohz_full_check(void) { }
 #endif
 
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3c1a806..7b6156a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1238,6 +1238,24 @@ static void update_avg(u64 *avg, u64 sample)
 }
 #endif
 
+#ifdef CONFIG_NO_HZ_FULL
+bool sched_can_stop_tick(void)
+{
+	struct rq *rq;
+
+	rq = this_rq();
+
+	/* Make sure rq->nr_running update is visible after the IPI */
+	smp_rmb();
+
+	/* More than one running task need preemption */
+	if (rq->nr_running > 1)
+		return false;
+
+	return true;
+}
+#endif
+
 static void
 ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
 {
@@ -1380,7 +1398,8 @@ static void sched_ttwu_pending(void)
 
 void scheduler_ipi(void)
 {
-	if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
+	if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick()
+	    && !tick_nohz_full_cpu(smp_processor_id()))
 		return;
 
 	/*
@@ -1397,6 +1416,7 @@ void scheduler_ipi(void)
 	 * somewhat pessimize the simple resched case.
 	 */
 	irq_enter();
+	tick_nohz_full_check();
 	sched_ttwu_pending();
 
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f24d91e..63915fe 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -955,6 +955,16 @@ static inline u64 steal_ticks(u64 steal)
 static inline void inc_nr_running(struct rq *rq)
 {
 	rq->nr_running++;
+
+#ifdef CONFIG_NO_HZ_FULL
+	if (rq->nr_running == 2) {
+		if (tick_nohz_full_cpu(rq->cpu)) {
+			/* Order rq->nr_running write against the IPI */
+			smp_wmb();
+			smp_send_reschedule(rq->cpu);
+		}
+	}
+#endif
 }
 
 static inline void dec_nr_running(struct rq *rq)
diff --git a/kernel/softirq.c b/kernel/softirq.c
index f5cc25f..6342078 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -307,7 +307,8 @@ void irq_enter(void)
 	int cpu = smp_processor_id();
 
 	rcu_irq_enter();
-	if (is_idle_task(current) && !in_interrupt()) {
+
+	if ((is_idle_task(current) || tick_nohz_full_cpu(cpu)) && !in_interrupt()) {
 		/*
 		 * Prevent raise_softirq from needlessly waking up ksoftirqd
 		 * here, as softirq will be serviced on return from interrupt.
@@ -349,7 +350,7 @@ void irq_exit(void)
 
 #ifdef CONFIG_NO_HZ
 	/* Make sure that timer wheel updates are propagated */
-	if (idle_cpu(smp_processor_id()) && !in_interrupt() && !need_resched())
+	if (!in_interrupt())
 		tick_nohz_irq_exit();
 #endif
 	rcu_irq_exit();
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 5cefed8..c1a1917 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -587,6 +587,24 @@ void tick_nohz_idle_enter(void)
 	local_irq_enable();
 }
 
+static void tick_nohz_full_stop_tick(struct tick_sched *ts)
+{
+#ifdef CONFIG_NO_HZ_FULL
+	int cpu = smp_processor_id();
+
+	if (!tick_nohz_full_cpu(cpu) || is_idle_task(current))
+		return;
+
+	if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
+		return;
+
+	if (!sched_can_stop_tick())
+		return;
+
+	tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
+#endif
+}
+
 /**
  * tick_nohz_irq_exit - update next tick event from interrupt exit
  *
@@ -599,12 +617,15 @@ void tick_nohz_irq_exit(void)
 {
 	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
 
-	if (!ts->inidle)
-		return;
-
-	/* Cancel the timer because CPU already waken up from the C-states*/
-	menu_hrtimer_cancel();
-	__tick_nohz_idle_enter(ts);
+	if (ts->inidle) {
+		if (!need_resched()) {
+			/* Cancel the timer because CPU already waken up from the C-states*/
+			menu_hrtimer_cancel();
+			__tick_nohz_idle_enter(ts);
+		}
+	} else {
+		tick_nohz_full_stop_tick(ts);
+	}
 }
 
 /**
@@ -835,6 +856,20 @@ static inline void tick_check_nohz(int cpu) { }
 
 #endif /* NO_HZ */
 
+#ifdef CONFIG_NO_HZ_FULL
+void tick_nohz_full_check(void)
+{
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+
+	if (tick_nohz_full_cpu(smp_processor_id())) {
+		if (ts->tick_stopped && !is_idle_task(current)) {
+			if (!sched_can_stop_tick())
+				tick_nohz_restart_sched_tick(ts, ktime_get());
+		}
+	}
+}
+#endif /* CONFIG_NO_HZ_FULL */
+
 /*
  * Called from irq_enter to notify about the possible interruption of idle()
  */
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 22/33] nohz: Only stop the tick on RCU nocb CPUs
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (20 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 21/33] nohz: Full dynticks mode Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 23/33] nohz: Don't turn off the tick if rcu needs it Frederic Weisbecker
                   ` (10 subsequent siblings)
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

On a full dynticks CPU, we want the RCU callbacks to be
offlined to another CPU, otherwise we need to keep
the tick to wait for the grace period completion.

Ensure the full dynticks CPU is also an rcu_nocb one.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/rcupdate.h |    7 +++++++
 kernel/rcutree.c         |    6 +++---
 kernel/rcutree.h         |    1 -
 kernel/rcutree_plugin.h  |   13 ++++---------
 kernel/time/tick-sched.c |   20 +++++++++++++++++---
 5 files changed, 31 insertions(+), 16 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 275aa3f..829312e 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -992,4 +992,11 @@ static inline notrace void rcu_read_unlock_sched_notrace(void)
 #define kfree_rcu(ptr, rcu_head)					\
 	__kfree_rcu(&((ptr)->rcu_head), offsetof(typeof(*(ptr)), rcu_head))
 
+#ifdef CONFIG_RCU_NOCB_CPU
+bool rcu_is_nocb_cpu(int cpu);
+#else
+static inline bool rcu_is_nocb_cpu(int cpu) { return false; };
+#endif
+
+
 #endif /* __LINUX_RCUPDATE_H */
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 302d360..e9e0ffa 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1589,7 +1589,7 @@ rcu_send_cbs_to_orphanage(int cpu, struct rcu_state *rsp,
 			  struct rcu_node *rnp, struct rcu_data *rdp)
 {
 	/* No-CBs CPUs do not have orphanable callbacks. */
-	if (is_nocb_cpu(rdp->cpu))
+	if (rcu_is_nocb_cpu(rdp->cpu))
 		return;
 
 	/*
@@ -2651,10 +2651,10 @@ static void _rcu_barrier(struct rcu_state *rsp)
 	 * corresponding CPU's preceding callbacks have been invoked.
 	 */
 	for_each_possible_cpu(cpu) {
-		if (!cpu_online(cpu) && !is_nocb_cpu(cpu))
+		if (!cpu_online(cpu) && !rcu_is_nocb_cpu(cpu))
 			continue;
 		rdp = per_cpu_ptr(rsp->rda, cpu);
-		if (is_nocb_cpu(cpu)) {
+		if (rcu_is_nocb_cpu(cpu)) {
 			_rcu_barrier_trace(rsp, "OnlineNoCB", cpu,
 					   rsp->n_barrier_done);
 			atomic_inc(&rsp->barrier_cpu_count);
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 4b69291..fbbad93 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -536,7 +536,6 @@ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu);
 static void print_cpu_stall_info_end(void);
 static void zero_cpu_stall_ticks(struct rcu_data *rdp);
 static void increment_cpu_stall_ticks(void);
-static bool is_nocb_cpu(int cpu);
 static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp,
 			    bool lazy);
 static bool rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp,
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index f6e5ec2..625b327 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2160,7 +2160,7 @@ static int __init rcu_nocb_setup(char *str)
 __setup("rcu_nocbs=", rcu_nocb_setup);
 
 /* Is the specified CPU a no-CPUs CPU? */
-static bool is_nocb_cpu(int cpu)
+bool rcu_is_nocb_cpu(int cpu)
 {
 	if (have_rcu_nocb_mask)
 		return cpumask_test_cpu(cpu, rcu_nocb_mask);
@@ -2218,7 +2218,7 @@ static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp,
 			    bool lazy)
 {
 
-	if (!is_nocb_cpu(rdp->cpu))
+	if (!rcu_is_nocb_cpu(rdp->cpu))
 		return 0;
 	__call_rcu_nocb_enqueue(rdp, rhp, &rhp->next, 1, lazy);
 	return 1;
@@ -2235,7 +2235,7 @@ static bool __maybe_unused rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp,
 	long qll = rsp->qlen_lazy;
 
 	/* If this is not a no-CBs CPU, tell the caller to do it the old way. */
-	if (!is_nocb_cpu(smp_processor_id()))
+	if (!rcu_is_nocb_cpu(smp_processor_id()))
 		return 0;
 	rsp->qlen = 0;
 	rsp->qlen_lazy = 0;
@@ -2275,7 +2275,7 @@ static bool nocb_cpu_expendable(int cpu)
 	 * If there are no no-CB CPUs or if this CPU is not a no-CB CPU,
 	 * then offlining this CPU is harmless.  Let it happen.
 	 */
-	if (!have_rcu_nocb_mask || is_nocb_cpu(cpu))
+	if (!have_rcu_nocb_mask || rcu_is_nocb_cpu(cpu))
 		return 1;
 
 	/* If no memory, play it safe and keep the CPU around. */
@@ -2456,11 +2456,6 @@ static void __init rcu_init_nocb(void)
 
 #else /* #ifdef CONFIG_RCU_NOCB_CPU */
 
-static bool is_nocb_cpu(int cpu)
-{
-	return false;
-}
-
 static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp,
 			    bool lazy)
 {
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c1a1917..805eded 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -587,6 +587,19 @@ void tick_nohz_idle_enter(void)
 	local_irq_enable();
 }
 
+#ifdef CONFIG_NO_HZ_FULL
+static bool can_stop_full_tick(int cpu)
+{
+	if (!sched_can_stop_tick())
+		return false;
+
+	if (!rcu_is_nocb_cpu(cpu))
+		return false;
+
+	return true;
+}
+#endif
+
 static void tick_nohz_full_stop_tick(struct tick_sched *ts)
 {
 #ifdef CONFIG_NO_HZ_FULL
@@ -598,7 +611,7 @@ static void tick_nohz_full_stop_tick(struct tick_sched *ts)
 	if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
 		return;
 
-	if (!sched_can_stop_tick())
+	if (!can_stop_full_tick(cpu))
 		return;
 
 	tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
@@ -860,10 +873,11 @@ static inline void tick_check_nohz(int cpu) { }
 void tick_nohz_full_check(void)
 {
 	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+	int cpu = smp_processor_id();
 
-	if (tick_nohz_full_cpu(smp_processor_id())) {
+	if (tick_nohz_full_cpu(cpu)) {
 		if (ts->tick_stopped && !is_idle_task(current)) {
-			if (!sched_can_stop_tick())
+			if (!can_stop_full_tick(cpu))
 				tick_nohz_restart_sched_tick(ts, ktime_get());
 		}
 	}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 23/33] nohz: Don't turn off the tick if rcu needs it
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (21 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 22/33] nohz: Only stop the tick on RCU nocb CPUs Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 24/33] nohz: Don't stop the tick if posix cpu timers are running Frederic Weisbecker
                   ` (9 subsequent siblings)
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

If RCU is waiting for the current CPU to complete a grace
period, don't turn off the tick. Unlike dynctik-idle, we
are not necessarily going to enter into rcu extended quiescent
state, so we may need to keep the tick to note current CPU's
quiescent states.

[added build fix from Zen Lin]

CHECKME: OTOH we don't want to handle a locally started
grace period, this should be offloaded for rcu_nocb CPUs.
What we want is to be kicked if we stay dynticks in the kernel
for too long (ie: to report a quiescent state).
rcu_pending() is perhaps an overkill just for that.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/rcupdate.h |    1 +
 kernel/rcutree.c         |    3 +--
 kernel/time/tick-sched.c |    3 +++
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 829312e..2ebadac 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -211,6 +211,7 @@ static inline int rcu_preempt_depth(void)
 extern void rcu_sched_qs(int cpu);
 extern void rcu_bh_qs(int cpu);
 extern void rcu_check_callbacks(int cpu, int user);
+extern int rcu_pending(int cpu);
 struct notifier_block;
 extern void rcu_idle_enter(void);
 extern void rcu_idle_exit(void);
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index e9e0ffa..6ba3e02 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -232,7 +232,6 @@ module_param(jiffies_till_next_fqs, ulong, 0644);
 
 static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *));
 static void force_quiescent_state(struct rcu_state *rsp);
-static int rcu_pending(int cpu);
 
 /*
  * Return the number of RCU-sched batches processed thus far for debug & stats.
@@ -2521,7 +2520,7 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
  * by the current CPU, returning 1 if so.  This function is part of the
  * RCU implementation; it is -not- an exported member of the RCU API.
  */
-static int rcu_pending(int cpu)
+int rcu_pending(int cpu)
 {
 	struct rcu_state *rsp;
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 805eded..743b021 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -596,6 +596,9 @@ static bool can_stop_full_tick(int cpu)
 	if (!rcu_is_nocb_cpu(cpu))
 		return false;
 
+	if (rcu_pending(cpu))
+		return false;
+
 	return true;
 }
 #endif
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 24/33] nohz: Don't stop the tick if posix cpu timers are running
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (22 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 23/33] nohz: Don't turn off the tick if rcu needs it Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 25/33] nohz: Add some tracing Frederic Weisbecker
                   ` (8 subsequent siblings)
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

If either a per thread or a per process posix cpu timer is running,
don't stop the tick.

TODO: restart the tick if it is stopped and a posix cpu timer is
enqueued. Check we probably need a memory barrier for the per
process posix timer that can be enqueued from another task
of the group.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/posix-timers.h |    1 +
 kernel/posix-cpu-timers.c    |   11 +++++++++++
 kernel/time/tick-sched.c     |    4 ++++
 3 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index 042058f..97480c2 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -119,6 +119,7 @@ int posix_timer_event(struct k_itimer *timr, int si_private);
 void posix_cpu_timer_schedule(struct k_itimer *timer);
 
 void run_posix_cpu_timers(struct task_struct *task);
+bool posix_cpu_timers_running(struct task_struct *tsk);
 void posix_cpu_timers_exit(struct task_struct *task);
 void posix_cpu_timers_exit_group(struct task_struct *task);
 
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 165d476..15f8f4f 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -1269,6 +1269,17 @@ static inline int fastpath_timer_check(struct task_struct *tsk)
 	return 0;
 }
 
+bool posix_cpu_timers_running(struct task_struct *tsk)
+{
+	if (!task_cputime_zero(&tsk->cputime_expires))
+		return true;
+
+	if (tsk->signal->cputimer.running)
+		return true;
+
+	return false;
+}
+
 /*
  * This is called from the timer interrupt handler.  The irq handler has
  * already updated our counts.  We need to check if any timers fire now.
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 743b021..5543a4d 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -21,6 +21,7 @@
 #include <linux/sched.h>
 #include <linux/module.h>
 #include <linux/irq_work.h>
+#include <linux/posix-timers.h>
 
 #include <asm/irq_regs.h>
 
@@ -599,6 +600,9 @@ static bool can_stop_full_tick(int cpu)
 	if (rcu_pending(cpu))
 		return false;
 
+	if (posix_cpu_timers_running(current))
+		return false;
+
 	return true;
 }
 #endif
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 25/33] nohz: Add some tracing
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (23 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 24/33] nohz: Don't stop the tick if posix cpu timers are running Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 26/33] rcu: Don't keep the tick for RCU while in userspace Frederic Weisbecker
                   ` (7 subsequent siblings)
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

Not for merge, just for debugging.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/tick-sched.c |   27 ++++++++++++++++++++++-----
 1 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 5543a4d..1cd93a9 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -142,6 +142,7 @@ static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
 			ts->idle_jiffies++;
 	}
 #endif
+	trace_printk("tick\n");
 	update_process_times(user_mode(regs));
 	profile_tick(CPU_PROFILING);
 }
@@ -591,17 +592,30 @@ void tick_nohz_idle_enter(void)
 #ifdef CONFIG_NO_HZ_FULL
 static bool can_stop_full_tick(int cpu)
 {
-	if (!sched_can_stop_tick())
+	if (!sched_can_stop_tick()) {
+		trace_printk("Can't stop: sched\n");
 		return false;
+	}
 
-	if (!rcu_is_nocb_cpu(cpu))
+	if (!rcu_is_nocb_cpu(cpu)) {
+		trace_printk("Can't stop: not RCU nocb\n");
 		return false;
+	}
 
-	if (rcu_pending(cpu))
+	/*
+	 * Keep the tick if we are asked to report a quiescent state.
+	 * This must be further optimized (avoid checks for local callbacks,
+	 * ignore RCU in userspace, etc...
+	 */
+	if (rcu_pending(cpu)) {
+		trace_printk("Can't stop: RCU pending\n");
 		return false;
+	}
 
-	if (posix_cpu_timers_running(current))
+	if (posix_cpu_timers_running(current)) {
+		trace_printk("Can't stop: posix CPU timers running\n");
 		return false;
+	}
 
 	return true;
 }
@@ -615,12 +629,15 @@ static void tick_nohz_full_stop_tick(struct tick_sched *ts)
 	if (!tick_nohz_full_cpu(cpu) || is_idle_task(current))
 		return;
 
-	if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
+	if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE) {
+		trace_printk("Can't stop: NOHZ_MODE_INACTIVE\n");
 		return;
+	}
 
 	if (!can_stop_full_tick(cpu))
 		return;
 
+	trace_printk("Stop tick\n");
 	tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
 #endif
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 26/33] rcu: Don't keep the tick for RCU while in userspace
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (24 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 25/33] nohz: Add some tracing Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  4:06   ` Paul E. McKenney
  2013-01-08  2:08 ` [PATCH 27/33] profiling: Remove unused timer hook Frederic Weisbecker
                   ` (6 subsequent siblings)
  32 siblings, 1 reply; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

If we are interrupting userspace, we don't need to keep
the tick for RCU: quiescent states don't need to be reported
because we soon run in userspace and local callbacks are handled
by the nocb threads.

CHECKME: Do the nocb threads actually handle the global
grace period completion for local callbacks?

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/tick-sched.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 1cd93a9..ecba8b7 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -22,6 +22,7 @@
 #include <linux/module.h>
 #include <linux/irq_work.h>
 #include <linux/posix-timers.h>
+#include <linux/context_tracking.h>
 
 #include <asm/irq_regs.h>
 
@@ -604,10 +605,9 @@ static bool can_stop_full_tick(int cpu)
 
 	/*
 	 * Keep the tick if we are asked to report a quiescent state.
-	 * This must be further optimized (avoid checks for local callbacks,
-	 * ignore RCU in userspace, etc...
+	 * This must be further optimized (avoid checks for local callbacks)
 	 */
-	if (rcu_pending(cpu)) {
+	if (!context_tracking_in_user() && rcu_pending(cpu)) {
 		trace_printk("Can't stop: RCU pending\n");
 		return false;
 	}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 27/33] profiling: Remove unused timer hook
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (25 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 26/33] rcu: Don't keep the tick for RCU while in userspace Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 28/33] timer: Don't run non-pinned timer to full dynticks CPUs Frederic Weisbecker
                   ` (5 subsequent siblings)
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

The last remaining user was oprofile and its use has been removed
a while ago on commit bc078e4eab65f11bbaeed380593ab8151b30d703
"oprofile: convert oprofile from timer_hook to hrtimer".

There doesn't seem to be any upstream user of this hook
for about two years now. And I'm not even aware of any out of tree
user.

Let's remove it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/profile.h |   13 -------------
 kernel/profile.c        |   24 ------------------------
 2 files changed, 0 insertions(+), 37 deletions(-)

diff --git a/include/linux/profile.h b/include/linux/profile.h
index a0fc322..2112390 100644
--- a/include/linux/profile.h
+++ b/include/linux/profile.h
@@ -82,9 +82,6 @@ int task_handoff_unregister(struct notifier_block * n);
 int profile_event_register(enum profile_type, struct notifier_block * n);
 int profile_event_unregister(enum profile_type, struct notifier_block * n);
 
-int register_timer_hook(int (*hook)(struct pt_regs *));
-void unregister_timer_hook(int (*hook)(struct pt_regs *));
-
 struct pt_regs;
 
 #else
@@ -135,16 +132,6 @@ static inline int profile_event_unregister(enum profile_type t, struct notifier_
 #define profile_handoff_task(a) (0)
 #define profile_munmap(a) do { } while (0)
 
-static inline int register_timer_hook(int (*hook)(struct pt_regs *))
-{
-	return -ENOSYS;
-}
-
-static inline void unregister_timer_hook(int (*hook)(struct pt_regs *))
-{
-	return;
-}
-
 #endif /* CONFIG_PROFILING */
 
 #endif /* _LINUX_PROFILE_H */
diff --git a/kernel/profile.c b/kernel/profile.c
index 1f39181..dc3384e 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -37,9 +37,6 @@ struct profile_hit {
 #define NR_PROFILE_HIT		(PAGE_SIZE/sizeof(struct profile_hit))
 #define NR_PROFILE_GRP		(NR_PROFILE_HIT/PROFILE_GRPSZ)
 
-/* Oprofile timer tick hook */
-static int (*timer_hook)(struct pt_regs *) __read_mostly;
-
 static atomic_t *prof_buffer;
 static unsigned long prof_len, prof_shift;
 
@@ -208,25 +205,6 @@ int profile_event_unregister(enum profile_type type, struct notifier_block *n)
 }
 EXPORT_SYMBOL_GPL(profile_event_unregister);
 
-int register_timer_hook(int (*hook)(struct pt_regs *))
-{
-	if (timer_hook)
-		return -EBUSY;
-	timer_hook = hook;
-	return 0;
-}
-EXPORT_SYMBOL_GPL(register_timer_hook);
-
-void unregister_timer_hook(int (*hook)(struct pt_regs *))
-{
-	WARN_ON(hook != timer_hook);
-	timer_hook = NULL;
-	/* make sure all CPUs see the NULL hook */
-	synchronize_sched();  /* Allow ongoing interrupts to complete. */
-}
-EXPORT_SYMBOL_GPL(unregister_timer_hook);
-
-
 #ifdef CONFIG_SMP
 /*
  * Each cpu has a pair of open-addressed hashtables for pending
@@ -436,8 +414,6 @@ void profile_tick(int type)
 {
 	struct pt_regs *regs = get_irq_regs();
 
-	if (type == CPU_PROFILING && timer_hook)
-		timer_hook(regs);
 	if (!user_mode(regs) && prof_cpu_mask != NULL &&
 	    cpumask_test_cpu(smp_processor_id(), prof_cpu_mask))
 		profile_hit(type, (void *)profile_pc(regs));
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 28/33] timer: Don't run non-pinned timer to full dynticks CPUs
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (26 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 27/33] profiling: Remove unused timer hook Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 29/33] sched: Use an accessor to read rq clock Frederic Weisbecker
                   ` (4 subsequent siblings)
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

While trying to find a target for a non-pinned timer, use
the following logic:

- Use the closest (from a sched domain POV) busy CPU that
is not full dynticks

- If none, use the closest idle CPU that is not full dynticks.

So this is biased toward isolation over powersaving. This is
a quick hack until we provide a way for the user to tune that
policy. A CPU mask affinity for non pinned timers could be such
a solution.

Original-patch-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/hrtimer.c    |    3 ++-
 kernel/sched/core.c |   26 +++++++++++++++++++++++---
 kernel/timer.c      |    3 ++-
 3 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 6db7a5e..f5da6fb 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -159,7 +159,8 @@ struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer,
 static int hrtimer_get_target(int this_cpu, int pinned)
 {
 #ifdef CONFIG_NO_HZ
-	if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu))
+	if (!pinned && get_sysctl_timer_migration() &&
+	    (idle_cpu(this_cpu) || tick_nohz_full_cpu(this_cpu)))
 		return get_nohz_timer_target();
 #endif
 	return this_cpu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7b6156a..e2884c5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -560,22 +560,42 @@ void resched_cpu(int cpu)
  */
 int get_nohz_timer_target(void)
 {
-	int cpu = smp_processor_id();
 	int i;
 	struct sched_domain *sd;
+	int cpu = smp_processor_id();
+	int target = -1;
 
 	rcu_read_lock();
 	for_each_domain(cpu, sd) {
 		for_each_cpu(i, sched_domain_span(sd)) {
+			/*
+			 * This is biased toward CPU isolation usecase:
+			 * try to migrate the timer to a busy non-full-nohz
+			 * CPU. If there is none, then prefer an idle CPU
+			 * than a full nohz one.
+			 * We shouldn't do policy here (isolation VS powersaving)
+			 * so this is a temporary hack. Being able to affine
+			 * non-pinned timers would be a better thing.
+			 */
+			if (tick_nohz_full_cpu(i))
+				continue;
+
 			if (!idle_cpu(i)) {
-				cpu = i;
+				target = i;
 				goto unlock;
 			}
+
+			if (target == -1)
+				target = i;
 		}
 	}
+	/* Fallback in case of NULL domain */
+	if (target == -1)
+		target = cpu;
 unlock:
 	rcu_read_unlock();
-	return cpu;
+
+	return target;
 }
 /*
  * When add_timer_on() enqueues a timer into the timer wheel of an
diff --git a/kernel/timer.c b/kernel/timer.c
index 970b57d..51dd02b 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -738,7 +738,8 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
 	cpu = smp_processor_id();
 
 #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
-	if (!pinned && get_sysctl_timer_migration() && idle_cpu(cpu))
+	if (!pinned && get_sysctl_timer_migration() &&
+	    (idle_cpu(cpu) || tick_nohz_full_cpu(cpu)))
 		cpu = get_nohz_timer_target();
 #endif
 	new_base = per_cpu(tvec_bases, cpu);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 29/33] sched: Use an accessor to read rq clock
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (27 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 28/33] timer: Don't run non-pinned timer to full dynticks CPUs Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 30/33] sched: Debug nohz " Frederic Weisbecker
                   ` (3 subsequent siblings)
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

Read the runqueue clock through an accessor. This way
we'll be able to detect and debug stale rq clocks on
full dynticks CPUs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/core.c      |    6 +++---
 kernel/sched/fair.c      |   42 +++++++++++++++++++++---------------------
 kernel/sched/rt.c        |    8 ++++----
 kernel/sched/sched.h     |   10 ++++++++++
 kernel/sched/stats.h     |    8 ++++----
 kernel/sched/stop_task.c |    8 ++++----
 6 files changed, 46 insertions(+), 36 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e2884c5..15ba35e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -672,7 +672,7 @@ void sched_avg_update(struct rq *rq)
 {
 	s64 period = sched_avg_period();
 
-	while ((s64)(rq->clock - rq->age_stamp) > period) {
+	while ((s64)(rq_clock(rq) - rq->age_stamp) > period) {
 		/*
 		 * Inline assembly required to prevent the compiler
 		 * optimising this loop into a divmod call.
@@ -1347,7 +1347,7 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 	 * we are not idle and rq->idle_stamp == 0
 	 */
 	if (rq->idle_stamp) {
-		u64 delta = rq->clock - rq->idle_stamp;
+		u64 delta = rq_clock(rq) - rq->idle_stamp;
 		u64 max = 2*sysctl_sched_migration_cost;
 
 		if (delta > max)
@@ -2722,7 +2722,7 @@ static u64 do_task_delta_exec(struct task_struct *p, struct rq *rq)
 
 	if (task_current(rq, p)) {
 		update_rq_clock(rq);
-		ns = rq->clock_task - p->se.exec_start;
+		ns = rq_clock_task(rq) - p->se.exec_start;
 		if ((s64)ns < 0)
 			ns = 0;
 	}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 473f50f..bd9113a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -685,7 +685,7 @@ __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
 static void update_curr(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *curr = cfs_rq->curr;
-	u64 now = rq_of(cfs_rq)->clock_task;
+	u64 now = rq_clock_task(rq_of(cfs_rq));
 	unsigned long delta_exec;
 
 	if (unlikely(!curr))
@@ -717,7 +717,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 static inline void
 update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	schedstat_set(se->statistics.wait_start, rq_of(cfs_rq)->clock);
+	schedstat_set(se->statistics.wait_start, rq_clock(rq_of(cfs_rq)));
 }
 
 /*
@@ -737,14 +737,14 @@ static void
 update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	schedstat_set(se->statistics.wait_max, max(se->statistics.wait_max,
-			rq_of(cfs_rq)->clock - se->statistics.wait_start));
+			rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start));
 	schedstat_set(se->statistics.wait_count, se->statistics.wait_count + 1);
 	schedstat_set(se->statistics.wait_sum, se->statistics.wait_sum +
-			rq_of(cfs_rq)->clock - se->statistics.wait_start);
+			rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start);
 #ifdef CONFIG_SCHEDSTATS
 	if (entity_is_task(se)) {
 		trace_sched_stat_wait(task_of(se),
-			rq_of(cfs_rq)->clock - se->statistics.wait_start);
+			rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start);
 	}
 #endif
 	schedstat_set(se->statistics.wait_start, 0);
@@ -770,7 +770,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	/*
 	 * We are starting a new run period:
 	 */
-	se->exec_start = rq_of(cfs_rq)->clock_task;
+	se->exec_start = rq_clock_task(rq_of(cfs_rq));
 }
 
 /**************************************************
@@ -1496,7 +1496,7 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
-	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
+	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
 }
 
@@ -1511,7 +1511,7 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 	 * accumulated while sleeping.
 	 */
 	if (unlikely(se->avg.decay_count <= 0)) {
-		se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
+		se->avg.last_runnable_update = rq_clock_task(rq_of(cfs_rq));
 		if (se->avg.decay_count) {
 			/*
 			 * In a wake-up migration we have to approximate the
@@ -1585,7 +1585,7 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		tsk = task_of(se);
 
 	if (se->statistics.sleep_start) {
-		u64 delta = rq_of(cfs_rq)->clock - se->statistics.sleep_start;
+		u64 delta = rq_clock(rq_of(cfs_rq)) - se->statistics.sleep_start;
 
 		if ((s64)delta < 0)
 			delta = 0;
@@ -1602,7 +1602,7 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		}
 	}
 	if (se->statistics.block_start) {
-		u64 delta = rq_of(cfs_rq)->clock - se->statistics.block_start;
+		u64 delta = rq_clock(rq_of(cfs_rq)) - se->statistics.block_start;
 
 		if ((s64)delta < 0)
 			delta = 0;
@@ -1785,9 +1785,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 			struct task_struct *tsk = task_of(se);
 
 			if (tsk->state & TASK_INTERRUPTIBLE)
-				se->statistics.sleep_start = rq_of(cfs_rq)->clock;
+				se->statistics.sleep_start = rq_clock(rq_of(cfs_rq));
 			if (tsk->state & TASK_UNINTERRUPTIBLE)
-				se->statistics.block_start = rq_of(cfs_rq)->clock;
+				se->statistics.block_start = rq_clock(rq_of(cfs_rq));
 		}
 #endif
 	}
@@ -2062,7 +2062,7 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)
 	if (unlikely(cfs_rq->throttle_count))
 		return cfs_rq->throttled_clock_task;
 
-	return rq_of(cfs_rq)->clock_task - cfs_rq->throttled_clock_task_time;
+	return rq_clock_task(rq_of(cfs_rq)) - cfs_rq->throttled_clock_task_time;
 }
 
 /* returns 0 on failure to allocate runtime */
@@ -2121,7 +2121,7 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	struct rq *rq = rq_of(cfs_rq);
 
 	/* if the deadline is ahead of our clock, nothing to do */
-	if (likely((s64)(rq->clock - cfs_rq->runtime_expires) < 0))
+	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
 		return;
 
 	if (cfs_rq->runtime_remaining < 0)
@@ -2210,7 +2210,7 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 #ifdef CONFIG_SMP
 	if (!cfs_rq->throttle_count) {
 		/* adjust cfs_rq_clock_task() */
-		cfs_rq->throttled_clock_task_time += rq->clock_task -
+		cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
 					     cfs_rq->throttled_clock_task;
 	}
 #endif
@@ -2225,7 +2225,7 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 
 	/* group is entering throttled state, stop time */
 	if (!cfs_rq->throttle_count)
-		cfs_rq->throttled_clock_task = rq->clock_task;
+		cfs_rq->throttled_clock_task = rq_clock_task(rq);
 	cfs_rq->throttle_count++;
 
 	return 0;
@@ -2264,7 +2264,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 		rq->nr_running -= task_delta;
 
 	cfs_rq->throttled = 1;
-	cfs_rq->throttled_clock = rq->clock;
+	cfs_rq->throttled_clock = rq_clock(rq);
 	raw_spin_lock(&cfs_b->lock);
 	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
 	raw_spin_unlock(&cfs_b->lock);
@@ -2284,7 +2284,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	update_rq_clock(rq);
 
 	raw_spin_lock(&cfs_b->lock);
-	cfs_b->throttled_time += rq->clock - cfs_rq->throttled_clock;
+	cfs_b->throttled_time += rq_clock(rq) - cfs_rq->throttled_clock;
 	list_del_rcu(&cfs_rq->throttled_list);
 	raw_spin_unlock(&cfs_b->lock);
 
@@ -2687,7 +2687,7 @@ static void unthrottle_offline_cfs_rqs(struct rq *rq)
 #else /* CONFIG_CFS_BANDWIDTH */
 static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)
 {
-	return rq_of(cfs_rq)->clock_task;
+	return rq_clock_task(rq_of(cfs_rq));
 }
 
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
@@ -4292,7 +4292,7 @@ unsigned long scale_rt_power(int cpu)
 	age_stamp = ACCESS_ONCE(rq->age_stamp);
 	avg = ACCESS_ONCE(rq->rt_avg);
 
-	total = sched_avg_period() + (rq->clock - age_stamp);
+	total = sched_avg_period() + (rq_clock(rq) - age_stamp);
 
 	if (unlikely(total < avg)) {
 		/* Ensures that power won't end up being negative */
@@ -5255,7 +5255,7 @@ void idle_balance(int this_cpu, struct rq *this_rq)
 	unsigned long next_balance = jiffies + HZ;
 
 	update_nohz_rq_clock(this_rq);
-	this_rq->idle_stamp = this_rq->clock;
+	this_rq->idle_stamp = rq_clock(this_rq);
 
 	if (this_rq->avg_idle < sysctl_sched_migration_cost)
 		return;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 418feb0..b1eb08b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -924,7 +924,7 @@ static void update_curr_rt(struct rq *rq)
 	if (curr->sched_class != &rt_sched_class)
 		return;
 
-	delta_exec = rq->clock_task - curr->se.exec_start;
+	delta_exec = rq_clock_task(rq) - curr->se.exec_start;
 	if (unlikely((s64)delta_exec < 0))
 		delta_exec = 0;
 
@@ -934,7 +934,7 @@ static void update_curr_rt(struct rq *rq)
 	curr->se.sum_exec_runtime += delta_exec;
 	account_group_exec_runtime(curr, delta_exec);
 
-	curr->se.exec_start = rq->clock_task;
+	curr->se.exec_start = rq_clock_task(rq);
 	cpuacct_charge(curr, delta_exec);
 
 	sched_rt_avg_update(rq, delta_exec);
@@ -1383,7 +1383,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
 	} while (rt_rq);
 
 	p = rt_task_of(rt_se);
-	p->se.exec_start = rq->clock_task;
+	p->se.exec_start = rq_clock_task(rq);
 
 	return p;
 }
@@ -2029,7 +2029,7 @@ static void set_curr_task_rt(struct rq *rq)
 {
 	struct task_struct *p = rq->curr;
 
-	p->se.exec_start = rq->clock_task;
+	p->se.exec_start = rq_clock_task(rq);
 
 	/* The running task is never eligible for pushing */
 	dequeue_pushable_task(rq, p);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 63915fe..e1bac76 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -502,6 +502,16 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
+static inline u64 rq_clock(struct rq *rq)
+{
+	return rq->clock;
+}
+
+static inline u64 rq_clock_task(struct rq *rq)
+{
+	return rq->clock_task;
+}
+
 #ifdef CONFIG_SMP
 
 #define rcu_dereference_check_sched_domain(p) \
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 2ef90a5..17d7065 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -61,7 +61,7 @@ static inline void sched_info_reset_dequeued(struct task_struct *t)
  */
 static inline void sched_info_dequeued(struct task_struct *t)
 {
-	unsigned long long now = task_rq(t)->clock, delta = 0;
+	unsigned long long now = rq_clock(task_rq(t)), delta = 0;
 
 	if (unlikely(sched_info_on()))
 		if (t->sched_info.last_queued)
@@ -79,7 +79,7 @@ static inline void sched_info_dequeued(struct task_struct *t)
  */
 static void sched_info_arrive(struct task_struct *t)
 {
-	unsigned long long now = task_rq(t)->clock, delta = 0;
+	unsigned long long now = rq_clock(task_rq(t)), delta = 0;
 
 	if (t->sched_info.last_queued)
 		delta = now - t->sched_info.last_queued;
@@ -100,7 +100,7 @@ static inline void sched_info_queued(struct task_struct *t)
 {
 	if (unlikely(sched_info_on()))
 		if (!t->sched_info.last_queued)
-			t->sched_info.last_queued = task_rq(t)->clock;
+			t->sched_info.last_queued = rq_clock(task_rq(t));
 }
 
 /*
@@ -112,7 +112,7 @@ static inline void sched_info_queued(struct task_struct *t)
  */
 static inline void sched_info_depart(struct task_struct *t)
 {
-	unsigned long long delta = task_rq(t)->clock -
+	unsigned long long delta = rq_clock(task_rq(t)) -
 					t->sched_info.last_arrival;
 
 	rq_sched_info_depart(task_rq(t), delta);
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index da5eb5b..e08fbee 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -28,7 +28,7 @@ static struct task_struct *pick_next_task_stop(struct rq *rq)
 	struct task_struct *stop = rq->stop;
 
 	if (stop && stop->on_rq) {
-		stop->se.exec_start = rq->clock_task;
+		stop->se.exec_start = rq_clock_task(rq);
 		return stop;
 	}
 
@@ -57,7 +57,7 @@ static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
 	struct task_struct *curr = rq->curr;
 	u64 delta_exec;
 
-	delta_exec = rq->clock_task - curr->se.exec_start;
+	delta_exec = rq_clock_task(rq) - curr->se.exec_start;
 	if (unlikely((s64)delta_exec < 0))
 		delta_exec = 0;
 
@@ -67,7 +67,7 @@ static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
 	curr->se.sum_exec_runtime += delta_exec;
 	account_group_exec_runtime(curr, delta_exec);
 
-	curr->se.exec_start = rq->clock_task;
+	curr->se.exec_start = rq_clock_task(rq);
 	cpuacct_charge(curr, delta_exec);
 }
 
@@ -79,7 +79,7 @@ static void set_curr_task_stop(struct rq *rq)
 {
 	struct task_struct *stop = rq->stop;
 
-	stop->se.exec_start = rq->clock_task;
+	stop->se.exec_start = rq_clock_task(rq);
 }
 
 static void switched_to_stop(struct rq *rq, struct task_struct *p)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 30/33] sched: Debug nohz rq clock
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (28 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 29/33] sched: Use an accessor to read rq clock Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-03-20 23:23   ` Kevin Hilman
  2013-01-08  2:08 ` [PATCH 31/33] sched: Remove broken check for skip clock update Frederic Weisbecker
                   ` (2 subsequent siblings)
  32 siblings, 1 reply; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

The runqueue clock is supposed to be periodically updated by the
tick. On full dynticks CPU we call update_nohz_rq_clock() before
reading it. Now the scheduler code is complicated enough that we
may miss some update_nohz_rq_clock() calls before reading the
runqueue clock.

This therefore introduce a new debugging feature that detects
when the rq clock is stale due to missing updates on full
dynticks CPUs.

This can be later expanded to debug stale clocks on dynticks-idle
CPUs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/sched.h |   23 +++++++++++++++++++++++
 1 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e1bac76..0fef0b3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -502,16 +502,39 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
+static inline void rq_clock_check(struct rq *rq)
+{
+#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_NO_HZ_FULL)
+	unsigned long long clock;
+	unsigned long flags;
+	int cpu;
+
+	cpu = cpu_of(rq);
+	if (!tick_nohz_full_cpu(cpu) || rq->curr == rq->idle)
+		return;
+
+	local_irq_save(flags);
+	clock = sched_clock_cpu(cpu_of(rq));
+	local_irq_restore(flags);
+
+	if (abs(clock - rq->clock) > (TICK_NSEC * 3))
+		WARN_ON_ONCE(1);
+#endif
+}
+
 static inline u64 rq_clock(struct rq *rq)
 {
+	rq_clock_check(rq);
 	return rq->clock;
 }
 
 static inline u64 rq_clock_task(struct rq *rq)
 {
+	rq_clock_check(rq);
 	return rq->clock_task;
 }
 
+
 #ifdef CONFIG_SMP
 
 #define rcu_dereference_check_sched_domain(p) \
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 31/33] sched: Remove broken check for skip clock update
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (29 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 30/33] sched: Debug nohz " Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:11   ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 32/33] sched: Update rq clock before rt sched average scale Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 33/33] sched: Disable lb_bias feature for full dynticks Frederic Weisbecker
  32 siblings, 1 reply; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

rq->skip_clock_update shouldn't be negative. Thus the check
in put_prev_task() is useless.

It was probably intended to do the following check:

	if (prev->on_rq && !rq->skip_clock_update)

We only want to update the clock if it's not voluntarily
sleeping: otherwise deactivate_task() already did the rq
clock update in schedule(). But we want to ignore that
update if a ttwu did it for us, in which case
rq->skip_clock_update is 1.

But update_rq_clock() already takes care of that so we
can just remove the broken condition.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/core.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 15ba35e..8dfc461 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2886,7 +2886,7 @@ static inline void schedule_debug(struct task_struct *prev)
 
 static void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
-	if (prev->on_rq || rq->skip_clock_update < 0)
+	if (prev->on_rq)
 		update_rq_clock(rq);
 	prev->sched_class->put_prev_task(rq, prev);
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 32/33] sched: Update rq clock before rt sched average scale
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (30 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 31/33] sched: Remove broken check for skip clock update Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  2013-01-08  2:08 ` [PATCH 33/33] sched: Disable lb_bias feature for full dynticks Frederic Weisbecker
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

When we restart the tick, we catch up with the CPU loads
statistics. But we also scale down the rt power. This
require an uptodate rq clock.

DISCLAIMER: I don't know if this is needed when we wake
up from idle. May be this is handled from the reader side
in scale_rt_power(). I have yet to understand exactly what
this function does.

Also scale_rt_power() also need to handle full tickless
CPUs when they run RT tasks. And also to remotely handle
the missing scaling down usually performed from the tick.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/core.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8dfc461..b35d122 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2650,6 +2650,7 @@ void update_cpu_load_nohz(void)
 	pending_updates = curr_jiffies - this_rq->last_load_update_tick;
 	if (pending_updates) {
 		this_rq->last_load_update_tick = curr_jiffies;
+		update_rq_clock(this_rq);
 		/*
 		 * We were idle, this means load 0, the current load might be
 		 * !0 due to remote wakeups and the sort.
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 33/33] sched: Disable lb_bias feature for full dynticks
  2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
                   ` (31 preceding siblings ...)
  2013-01-08  2:08 ` [PATCH 32/33] sched: Update rq clock before rt sched average scale Frederic Weisbecker
@ 2013-01-08  2:08 ` Frederic Weisbecker
  32 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:08 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

If we run in full dynticks mode, we have no way to
update the CPU load stats as typically done by
update_cpu_load_active().

Hence we need to force LB_BIAS sched feature to be off
because we can't rely on these statistics to measure a
CPU load in order to find a target for load balancing.

Instead, lets only rely on the current runqueue load
weight.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/fair.c     |   13 +++++++++++--
 kernel/sched/features.h |    3 +++
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bd9113a..4a55e8a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2903,6 +2903,15 @@ static unsigned long weighted_cpuload(const int cpu)
 	return cpu_rq(cpu)->load.weight;
 }
 
+static inline int sched_lb_bias(void)
+{
+#ifndef CONFIG_NO_HZ_FULL
+	return sched_feat(LB_BIAS);
+#else
+	return 0;
+#endif
+}
+
 /*
  * Return a low guess at the load of a migration-source cpu weighted
  * according to the scheduling class and "nice" value.
@@ -2915,7 +2924,7 @@ static unsigned long source_load(int cpu, int type)
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long total = weighted_cpuload(cpu);
 
-	if (type == 0 || !sched_feat(LB_BIAS))
+	if (type == 0 || !sched_lb_bias())
 		return total;
 
 	return min(rq->cpu_load[type-1], total);
@@ -2930,7 +2939,7 @@ static unsigned long target_load(int cpu, int type)
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long total = weighted_cpuload(cpu);
 
-	if (type == 0 || !sched_feat(LB_BIAS))
+	if (type == 0 || !sched_lb_bias())
 		return total;
 
 	return max(rq->cpu_load[type-1], total);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 1ad1d2b..4c8c113 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -43,7 +43,10 @@ SCHED_FEAT(ARCH_POWER, true)
 
 SCHED_FEAT(HRTICK, false)
 SCHED_FEAT(DOUBLE_TICK, false)
+
+#ifndef CONFIG_NO_HZ_FULL
 SCHED_FEAT(LB_BIAS, true)
+#endif
 
 /*
  * Spin-wait on mutex acquisition when the mutex owner is running on
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 31/33] sched: Remove broken check for skip clock update
  2013-01-08  2:08 ` [PATCH 31/33] sched: Remove broken check for skip clock update Frederic Weisbecker
@ 2013-01-08  2:11   ` Frederic Weisbecker
  0 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08  2:11 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

Please ignore this patch, it's broken.

Thanks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 26/33] rcu: Don't keep the tick for RCU while in userspace
  2013-01-08  2:08 ` [PATCH 26/33] rcu: Don't keep the tick for RCU while in userspace Frederic Weisbecker
@ 2013-01-08  4:06   ` Paul E. McKenney
  0 siblings, 0 replies; 60+ messages in thread
From: Paul E. McKenney @ 2013-01-08  4:06 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Alessio Igor Bogani, Andrew Morton, Chris Metcalf,
	Christoph Lameter, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Li Zhong, Namhyung Kim, Paul Gortmaker,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner

On Tue, Jan 08, 2013 at 03:08:26AM +0100, Frederic Weisbecker wrote:
> If we are interrupting userspace, we don't need to keep
> the tick for RCU: quiescent states don't need to be reported
> because we soon run in userspace and local callbacks are handled
> by the nocb threads.
> 
> CHECKME: Do the nocb threads actually handle the global
> grace period completion for local callbacks?

First answering this for the nocb stuff in mainline:  In this case,
the grace-period startup is handled by the CPU that is not a nocb
CPU, and there has to be at least one.  The grace-period completion
is handled by the grace-period kthreads.  The nocbs CPU need do
nothing, at least assuming that it gets back into dyntick-idle
(or adaptive tickless) state reasonably quickly.

Second for the version in -rcu: In this case, the nocb kthreads
register the need for a grace period using a new mechanism that
pushes the need up the rcu_node tree.  The grace-period completion
is again handled by the grace-period kthreads.  This allows all
CPUs to be nocbs CPUs.

So, in either case, yes, the below code should be safe as long as
the CPU gets into an RCU-idle state quickly (as in within a few
milliseconds or so).

						Thanx, Paul

> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Alessio Igor Bogani <abogani@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Chris Metcalf <cmetcalf@tilera.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Geoff Levand <geoff@infradead.org>
> Cc: Gilad Ben Yossef <gilad@benyossef.com>
> Cc: Hakan Akkan <hakanakkan@gmail.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Li Zhong <zhong@linux.vnet.ibm.com>
> Cc: Namhyung Kim <namhyung.kim@lge.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> ---
>  kernel/time/tick-sched.c |    6 +++---
>  1 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 1cd93a9..ecba8b7 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -22,6 +22,7 @@
>  #include <linux/module.h>
>  #include <linux/irq_work.h>
>  #include <linux/posix-timers.h>
> +#include <linux/context_tracking.h>
> 
>  #include <asm/irq_regs.h>
> 
> @@ -604,10 +605,9 @@ static bool can_stop_full_tick(int cpu)
> 
>  	/*
>  	 * Keep the tick if we are asked to report a quiescent state.
> -	 * This must be further optimized (avoid checks for local callbacks,
> -	 * ignore RCU in userspace, etc...
> +	 * This must be further optimized (avoid checks for local callbacks)
>  	 */
> -	if (rcu_pending(cpu)) {
> +	if (!context_tracking_in_user() && rcu_pending(cpu)) {
>  		trace_printk("Can't stop: RCU pending\n");
>  		return false;
>  	}
> -- 
> 1.7.5.4
> 


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 17/33] sched: Update clock of nohz busiest rq before balancing
  2013-01-08  2:08 ` [PATCH 17/33] sched: Update clock of nohz busiest rq before balancing Frederic Weisbecker
@ 2013-01-08 10:20   ` Li Zhong
  2013-03-07 23:51     ` Frederic Weisbecker
  0 siblings, 1 reply; 60+ messages in thread
From: Li Zhong @ 2013-01-08 10:20 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Alessio Igor Bogani, Andrew Morton, Chris Metcalf,
	Christoph Lameter, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Namhyung Kim, Paul E. McKenney, Paul Gortmaker,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner

On Tue, 2013-01-08 at 03:08 +0100, Frederic Weisbecker wrote:
> move_tasks() and active_load_balance_cpu_stop() both need
> to have the busiest rq clock uptodate because they may end
> up calling can_migrate_task() that uses rq->clock_task
> to determine if the task running in the busiest runqueue
> is cache hot.
> 
> Hence if the busiest runqueue is tickless, update its clock
> before reading it.
> 
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Alessio Igor Bogani <abogani@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Chris Metcalf <cmetcalf@tilera.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Geoff Levand <geoff@infradead.org>
> Cc: Gilad Ben Yossef <gilad@benyossef.com>
> Cc: Hakan Akkan <hakanakkan@gmail.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Li Zhong <zhong@linux.vnet.ibm.com>
> Cc: Namhyung Kim <namhyung.kim@lge.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> [ Forward port conflicts ]
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
>  kernel/sched/fair.c |   17 +++++++++++++++++
>  1 files changed, 17 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3d65ac7..e78d81104 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5002,6 +5002,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>  {
>  	int ld_moved, cur_ld_moved, active_balance = 0;
>  	int lb_iterations, max_lb_iterations;
> +	int clock_updated;
>  	struct sched_group *group;
>  	struct rq *busiest;
>  	unsigned long flags;
> @@ -5045,6 +5046,7 @@ redo:
> 
>  	ld_moved = 0;
>  	lb_iterations = 1;
> +	clock_updated = 0;
>  	if (busiest->nr_running > 1) {
>  		/*
>  		 * Attempt to move tasks. If find_busiest_group has found
> @@ -5068,6 +5070,14 @@ more_balance:
>  		 */
>  		cur_ld_moved = move_tasks(&env);
>  		ld_moved += cur_ld_moved;
> +
> +		/*
> +		 * Move tasks may end up calling can_migrate_task() which
> +		 * requires an uptodate value of the rq clock.
> +		 */
> +		update_nohz_rq_clock(busiest);
> +		clock_updated = 1;

According to the change log, it seems these lines should be added before
move_tasks() above? 

Thanks, Zhong

> +
>  		double_rq_unlock(env.dst_rq, busiest);
>  		local_irq_restore(flags);
> 
> @@ -5163,6 +5173,13 @@ more_balance:
>  				busiest->active_balance = 1;
>  				busiest->push_cpu = this_cpu;
>  				active_balance = 1;
> +				/*
> +				 * active_load_balance_cpu_stop may end up calling
> +				 * can_migrate_task() which requires an uptodate
> +				 * value of the rq clock.
> +				 */
> +				if (!clock_updated)
> +					update_nohz_rq_clock(busiest);
>  			}
>  			raw_spin_unlock_irqrestore(&busiest->lock, flags);
> 



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 03/33] cputime: Generic on-demand virtual cputime accounting
  2013-01-08  2:08 ` [PATCH 03/33] cputime: Generic on-demand virtual cputime accounting Frederic Weisbecker
@ 2013-01-08 20:23   ` Steven Rostedt
  2013-01-08 20:26   ` Steven Rostedt
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 60+ messages in thread
From: Steven Rostedt @ 2013-01-08 20:23 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Alessio Igor Bogani, Andrew Morton, Chris Metcalf,
	Christoph Lameter, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Li Zhong, Namhyung Kim, Paul E. McKenney,
	Paul Gortmaker, Peter Zijlstra, Thomas Gleixner

On Tue, 2013-01-08 at 03:08 +0100, Frederic Weisbecker wrote:

> +++ b/kernel/sched/cputime.c
> @@ -3,6 +3,7 @@
>  #include <linux/tsacct_kern.h>
>  #include <linux/kernel_stat.h>
>  #include <linux/static_key.h>
> +#include <linux/context_tracking.h>
>  #include "sched.h"
>  
> 
> @@ -495,10 +496,24 @@ void vtime_task_switch(struct task_struct *prev)
>  #ifndef __ARCH_HAS_VTIME_ACCOUNT
>  void vtime_account(struct task_struct *tsk)
>  {
> -	if (in_interrupt() || !is_idle_task(tsk))
> -		vtime_account_system(tsk);
> -	else
> -		vtime_account_idle(tsk);
> +	if (!in_interrupt()) {
> +		/*
> +		 * If we interrupted user, context_tracking_in_user()
> +		 * is 1 because the context tracking don't hook

s/don't/doesn't/

> +		 * on irq entry/exit. This way we know if
> +		 * we need to flush user time on kernel entry.

Also, the above comment is simply confusing. Why not just say something
like:

 Context tracking doesn't hook on irq entry/exit. The context will still
 be user context if the interrupt preempted user space.

No need to explain the implementation details of
"context_tracing_in_user() is 1 ...".

-- Steve

> +		 */
> +		if (context_tracking_in_user()) {
> +			vtime_account_user(tsk);
> +			return;
> +		}
> +
> +		if (is_idle_task(tsk)) {
> +			vtime_account_idle(tsk);
> +			return;
> +		}
> +	}
> +	vtime_account_system(tsk);
>  }
>  EXPORT_SYMBOL_GPL(vtime_account);
>  #endif /* __ARCH_HAS_VTIME_ACCOUNT */
> @@ -587,3 +602,72 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
>  	cputime_adjust(&cputime, &p->signal->prev_cputime, ut, st);
>  }
>  #endif



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 03/33] cputime: Generic on-demand virtual cputime accounting
  2013-01-08  2:08 ` [PATCH 03/33] cputime: Generic on-demand virtual cputime accounting Frederic Weisbecker
  2013-01-08 20:23   ` Steven Rostedt
@ 2013-01-08 20:26   ` Steven Rostedt
  2013-01-08 21:00     ` Paul E. McKenney
  2013-01-08 20:45   ` Steven Rostedt
  2013-01-09 13:46   ` Steven Rostedt
  3 siblings, 1 reply; 60+ messages in thread
From: Steven Rostedt @ 2013-01-08 20:26 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Alessio Igor Bogani, Andrew Morton, Chris Metcalf,
	Christoph Lameter, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Li Zhong, Namhyung Kim, Paul E. McKenney,
	Paul Gortmaker, Peter Zijlstra, Thomas Gleixner

On Tue, 2013-01-08 at 03:08 +0100, Frederic Weisbecker wrote:

> diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
> index c952770..bd461ad 100644
> --- a/kernel/context_tracking.c
> +++ b/kernel/context_tracking.c
> @@ -56,7 +56,7 @@ void user_enter(void)
>  	local_irq_save(flags);
>  	if (__this_cpu_read(context_tracking.active) &&
>  	    __this_cpu_read(context_tracking.state) != IN_USER) {
> -		__this_cpu_write(context_tracking.state, IN_USER);
> +		vtime_user_enter(current);
>  		/*
>  		 * At this stage, only low level arch entry code remains and
>  		 * then we'll run in userspace. We can assume there won't be
> @@ -65,6 +65,7 @@ void user_enter(void)
>  		 * on the tick.
>  		 */
>  		rcu_user_enter();

Hmm, the rcu_user_enter() can do quite a bit. Too bad we are accounting
it as user time. I wonder if we could move the vtime_user_enter() below
it. But then if vtime_user_enter() calls rcu_read_lock() it breaks.

The notorious chicken vs egg ordeal!

-- Steve
 
> +		__this_cpu_write(context_tracking.state, IN_USER);
>  	}
>  	local_irq_restore(flags);
>  }
> @@ -90,12 +91,13 @@ void user_exit(void)
>  
>  	local_irq_save(flags);
>  	if (__this_cpu_read(context_tracking.state) == IN_USER) {
> -		__this_cpu_write(context_tracking.state, IN_KERNEL);
>  		/*
>  		 * We are going to run code that may use RCU. Inform
>  		 * RCU core about that (ie: we may need the tick again).
>  		 */
>  		rcu_user_exit();
> +		vtime_user_exit(current);
> +		__this_cpu_write(context_tracking.state, IN_KERNEL);
>  	}
>  	local_irq_restore(flags);
>  }



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 03/33] cputime: Generic on-demand virtual cputime accounting
  2013-01-08  2:08 ` [PATCH 03/33] cputime: Generic on-demand virtual cputime accounting Frederic Weisbecker
  2013-01-08 20:23   ` Steven Rostedt
  2013-01-08 20:26   ` Steven Rostedt
@ 2013-01-08 20:45   ` Steven Rostedt
  2013-01-09 13:46   ` Steven Rostedt
  3 siblings, 0 replies; 60+ messages in thread
From: Steven Rostedt @ 2013-01-08 20:45 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Alessio Igor Bogani, Andrew Morton, Chris Metcalf,
	Christoph Lameter, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Li Zhong, Namhyung Kim, Paul E. McKenney,
	Paul Gortmaker, Peter Zijlstra, Thomas Gleixner

On Tue, 2013-01-08 at 03:08 +0100, Frederic Weisbecker wrote:

> +#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> +static DEFINE_PER_CPU(long, last_jiffies) = INITIAL_JIFFIES;
> +
> +static cputime_t get_vtime_delta(void)
> +{
> +	long delta;
> +
> +	delta = jiffies - __this_cpu_read(last_jiffies);
> +	__this_cpu_add(last_jiffies, delta);
> +
> +	return jiffies_to_cputime(delta);
> +}
> +
> +void vtime_account_system(struct task_struct *tsk)
> +{
> +	cputime_t delta_cpu = get_vtime_delta();
> +
> +	account_system_time(tsk, irq_count(), delta_cpu, cputime_to_scaled(delta_cpu));
> +}
> +
> +void vtime_account_user(struct task_struct *tsk)
> +{
> +	cputime_t delta_cpu = get_vtime_delta();
> +
> +	/*
> +	 * This is an unfortunate hack: if we flush user time only on
> +	 * irq entry, we miss the jiffies update and the time is spuriously
> +	 * accounted to system time.
> +	 */
> +	if (context_tracking_in_user())
> +		account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu));

Hmm, you called get_vtime_delta() up above, which updates the
last_jiffies per_cpu variable, but if for some reason,
context_tracking_in_user() isn't true, we throw away the delta. What
happens to those lost jiffies? They go nowhere.

Shouldn't it be? :

	if (context_tracking_in_user()) {
		delta_cpu = get_vtime_delta();
		account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu));
	}

-- Steve

> +}
> +
> +void vtime_account_idle(struct task_struct *tsk)
> +{
> +	cputime_t delta_cpu = get_vtime_delta();
> +
> +	account_idle_time(delta_cpu);
> +}
> +
> +static int __cpuinit vtime_cpu_notify(struct notifier_block *self,
> +				      unsigned long action, void *hcpu)
> +{
> +	long cpu = (long)hcpu;
> +	long *last_jiffies_cpu = per_cpu_ptr(&last_jiffies, cpu);
> +
> +	switch (action) {
> +	case CPU_UP_PREPARE:
> +	case CPU_UP_PREPARE_FROZEN:
> +		/*
> +		 * CHECKME: ensure that's visible by the CPU
> +		 * once it wakes up
> +		 */
> +		*last_jiffies_cpu = jiffies;
> +	default:
> +		break;
> +	}
> +
> +	return NOTIFY_OK;
> +}
> +
> +static int __init init_vtime(void)
> +{
> +	cpu_notifier(vtime_cpu_notify, 0);
> +	return 0;
> +}
> +early_initcall(init_vtime);
> +#endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 03/33] cputime: Generic on-demand virtual cputime accounting
  2013-01-08 20:26   ` Steven Rostedt
@ 2013-01-08 21:00     ` Paul E. McKenney
  0 siblings, 0 replies; 60+ messages in thread
From: Paul E. McKenney @ 2013-01-08 21:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Frederic Weisbecker, LKML, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim, Paul Gortmaker,
	Peter Zijlstra, Thomas Gleixner

On Tue, Jan 08, 2013 at 03:26:11PM -0500, Steven Rostedt wrote:
> On Tue, 2013-01-08 at 03:08 +0100, Frederic Weisbecker wrote:
> 
> > diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
> > index c952770..bd461ad 100644
> > --- a/kernel/context_tracking.c
> > +++ b/kernel/context_tracking.c
> > @@ -56,7 +56,7 @@ void user_enter(void)
> >  	local_irq_save(flags);
> >  	if (__this_cpu_read(context_tracking.active) &&
> >  	    __this_cpu_read(context_tracking.state) != IN_USER) {
> > -		__this_cpu_write(context_tracking.state, IN_USER);
> > +		vtime_user_enter(current);
> >  		/*
> >  		 * At this stage, only low level arch entry code remains and
> >  		 * then we'll run in userspace. We can assume there won't be
> > @@ -65,6 +65,7 @@ void user_enter(void)
> >  		 * on the tick.
> >  		 */
> >  		rcu_user_enter();
> 
> Hmm, the rcu_user_enter() can do quite a bit. Too bad we are accounting
> it as user time. I wonder if we could move the vtime_user_enter() below
> it. But then if vtime_user_enter() calls rcu_read_lock() it breaks.

If RCU_FAST_NO_HZ=y, the current mainline rcu_user_enter() can be a
bit expensive.  It is going on a diet for 3.9, however.

But there is a lower limit because the CPU moving to adaptive-tick user
mode must reliably inform other CPUs of this, which involves some
overhead due to memory-ordering issues.

							Thanx, Paul

> The notorious chicken vs egg ordeal!
> 
> -- Steve
> 
> > +		__this_cpu_write(context_tracking.state, IN_USER);
> >  	}
> >  	local_irq_restore(flags);
> >  }
> > @@ -90,12 +91,13 @@ void user_exit(void)
> >  
> >  	local_irq_save(flags);
> >  	if (__this_cpu_read(context_tracking.state) == IN_USER) {
> > -		__this_cpu_write(context_tracking.state, IN_KERNEL);
> >  		/*
> >  		 * We are going to run code that may use RCU. Inform
> >  		 * RCU core about that (ie: we may need the tick again).
> >  		 */
> >  		rcu_user_exit();
> > +		vtime_user_exit(current);
> > +		__this_cpu_write(context_tracking.state, IN_KERNEL);
> >  	}
> >  	local_irq_restore(flags);
> >  }
> 
> 


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/33] cputime: Allow dynamic switch between tick/virtual based cputime accounting
  2013-01-08  2:08 ` [PATCH 04/33] cputime: Allow dynamic switch between tick/virtual based " Frederic Weisbecker
@ 2013-01-08 21:20   ` Steven Rostedt
  2013-01-08 23:22     ` Frederic Weisbecker
  0 siblings, 1 reply; 60+ messages in thread
From: Steven Rostedt @ 2013-01-08 21:20 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Alessio Igor Bogani, Andrew Morton, Chris Metcalf,
	Christoph Lameter, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Li Zhong, Namhyung Kim, Paul E. McKenney,
	Paul Gortmaker, Peter Zijlstra, Thomas Gleixner

On Tue, 2013-01-08 at 03:08 +0100, Frederic Weisbecker wrote:

> @@ -439,29 +443,13 @@ void account_idle_ticks(unsigned long ticks)
>  
>  	account_idle_time(jiffies_to_cputime(ticks));
>  }
> -
>  #endif
>  
> +

Spurious newline.

-- Steve

>  /*
>   * Use precise platform statistics if available:
>   */
>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING
> -void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
> -{
> -	*ut = p->utime;
> -	*st = p->stime;
> -}
> -
> -void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
> -{
> -	struct task_cputime cputime;
> -
> -	thread_group_cputime(p, &cputime);
> -
> -	*ut = cputime.utime;
> -	*st = cputime.stime;
> -}
> -
>  void vtime_account_system_irqsafe(struct task_struct *tsk)
>  {
>  	unsigned long flags;
> @@ -517,8 +505,25 @@ void vtime_account(struct task_struct *tsk)
>  }
>  EXPORT_SYMBOL_GPL(vtime_account);
>  #endif /* __ARCH_HAS_VTIME_ACCOUNT */
> +#endif /* CONFIG_VIRT_CPU_ACCOUNTING */
>  
> -#else
> +#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> +void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
> +{
> +	*ut = p->utime;
> +	*st = p->stime;
> +}

Why not keep this out in the open like:

static void __task_cputime_adjusted() {
	...
}

#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
void task_cputime_adjusted(...)
{
	__task_cputime_adjusted(p, ut, st);
}

> +
> +void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
> +{
> +	struct task_cputime cputime;
> +
> +	thread_group_cputime(p, &cputime);
> +
> +	*ut = cputime.utime;
> +	*st = cputime.stime;
> +}
> +#else /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
>  
>  #ifndef nsecs_to_cputime
>  # define nsecs_to_cputime(__nsecs)	nsecs_to_jiffies(__nsecs)
> @@ -548,6 +553,12 @@ static void cputime_adjust(struct task_cputime *curr,
>  {
>  	cputime_t rtime, utime, total;
>  
> +	if (vtime_accounting_enabled()) {
> +		*ut = curr->utime;
> +		*st = curr->stime;

Then here, we can do:

		__task_cputime_adjusted(curr, ut, st);
> +		return;

Isn't suppose to do basically the same thing, when
vtime_accounting_enabled() is set?

> +	}
> +
>  	utime = curr->utime;
>  	total = utime + curr->stime;
>  
> @@ -601,7 +612,7 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
>  	thread_group_cputime(p, &cputime);
>  	cputime_adjust(&cputime, &p->signal->prev_cputime, ut, st);
>  }
> -#endif
> +#endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
>  
>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
>  static DEFINE_PER_CPU(long, last_jiffies) = INITIAL_JIFFIES;
> @@ -643,6 +654,11 @@ void vtime_account_idle(struct task_struct *tsk)
>  	account_idle_time(delta_cpu);
>  }
>  
> +bool vtime_accounting_enabled(void)
> +{
> +	return context_tracking_active();
> +}
> +
>  static int __cpuinit vtime_cpu_notify(struct notifier_block *self,
>  				      unsigned long action, void *hcpu)
>  {
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index fb8e5e4..314b9ee 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -632,8 +632,11 @@ static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
>  
>  static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
>  {
> -#ifndef CONFIG_VIRT_CPU_ACCOUNTING
> +#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
>  	unsigned long ticks;
> +
> +	if (vtime_accounting_enabled())
> +		return;

If this can be dynamically changed at runtime, wouldn't some of these
accounting variables get corrupted? Like the last_jiffies per_cpu
variable?

-- Steve

>  	/*
>  	 * We stopped the tick in idle. Update process times would miss the
>  	 * time we slept as update_process_times does only a 1 tick



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/33] cputime: Allow dynamic switch between tick/virtual based cputime accounting
  2013-01-08 21:20   ` Steven Rostedt
@ 2013-01-08 23:22     ` Frederic Weisbecker
  0 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-08 23:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Alessio Igor Bogani, Andrew Morton, Chris Metcalf,
	Christoph Lameter, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Li Zhong, Namhyung Kim, Paul E. McKenney,
	Paul Gortmaker, Peter Zijlstra, Thomas Gleixner

2013/1/8 Steven Rostedt <rostedt@goodmis.org>:
> On Tue, 2013-01-08 at 03:08 +0100, Frederic Weisbecker wrote:
>
>> @@ -439,29 +443,13 @@ void account_idle_ticks(unsigned long ticks)
>>
>>       account_idle_time(jiffies_to_cputime(ticks));
>>  }
>> -
>>  #endif
>>
>> +
>
> Spurious newline.

There may be even more around on the other patches :)

>> +#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
>> +void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
>> +{
>> +     *ut = p->utime;
>> +     *st = p->stime;
>> +}
>
> Why not keep this out in the open like:
>
> static void __task_cputime_adjusted() {
>         ...
> }
>
> #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> void task_cputime_adjusted(...)
> {
>         __task_cputime_adjusted(p, ut, st);
> }
>
>> +
>> +void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
>> +{
>> +     struct task_cputime cputime;
>> +
>> +     thread_group_cputime(p, &cputime);
>> +
>> +     *ut = cputime.utime;
>> +     *st = cputime.stime;
>> +}
>> +#else /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
>>
>>  #ifndef nsecs_to_cputime
>>  # define nsecs_to_cputime(__nsecs)   nsecs_to_jiffies(__nsecs)
>> @@ -548,6 +553,12 @@ static void cputime_adjust(struct task_cputime *curr,
>>  {
>>       cputime_t rtime, utime, total;
>>
>> +     if (vtime_accounting_enabled()) {
>> +             *ut = curr->utime;
>> +             *st = curr->stime;
>
> Then here, we can do:
>
>                 __task_cputime_adjusted(curr, ut, st);
>> +             return;

Note curr above is not a task but a struct task_cputime. But it could be:

__task_cputime_adjusted(curr->utime, curr->stime, ut, st);

But thinking more about it, I should just remove this:

+     if (vtime_accounting_enabled()) {
+             *ut = curr->utime;
+             *st = curr->stime;

It concerns CONFIG_VIRT_CPU_ACCOUNTING_GEN which is actually not
enough precise (it's jiffies based) and thus needs the same adjusting
performed on normal tick based cputime.

>
> Isn't suppose to do basically the same thing, when
> vtime_accounting_enabled() is set?
>
>> +     }
>> +
>>       utime = curr->utime;
>>       total = utime + curr->stime;
>>
>> @@ -601,7 +612,7 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
>>       thread_group_cputime(p, &cputime);
>>       cputime_adjust(&cputime, &p->signal->prev_cputime, ut, st);
>>  }
>> -#endif
>> +#endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
>>
>>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
>>  static DEFINE_PER_CPU(long, last_jiffies) = INITIAL_JIFFIES;
>> @@ -643,6 +654,11 @@ void vtime_account_idle(struct task_struct *tsk)
>>       account_idle_time(delta_cpu);
>>  }
>>
>> +bool vtime_accounting_enabled(void)
>> +{
>> +     return context_tracking_active();
>> +}
>> +
>>  static int __cpuinit vtime_cpu_notify(struct notifier_block *self,
>>                                     unsigned long action, void *hcpu)
>>  {
>> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
>> index fb8e5e4..314b9ee 100644
>> --- a/kernel/time/tick-sched.c
>> +++ b/kernel/time/tick-sched.c
>> @@ -632,8 +632,11 @@ static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
>>
>>  static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
>>  {
>> -#ifndef CONFIG_VIRT_CPU_ACCOUNTING
>> +#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
>>       unsigned long ticks;
>> +
>> +     if (vtime_accounting_enabled())
>> +             return;
>
> If this can be dynamically changed at runtime, wouldn't some of these
> accounting variables get corrupted? Like the last_jiffies per_cpu
> variable?

It can't yet be dynamically changed at runtime because full dynticks
CPUs are defined through boot parameters. But it's indeed something
we'll need to care about when we'll have a runtime interface.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 03/33] cputime: Generic on-demand virtual cputime accounting
  2013-01-08  2:08 ` [PATCH 03/33] cputime: Generic on-demand virtual cputime accounting Frederic Weisbecker
                     ` (2 preceding siblings ...)
  2013-01-08 20:45   ` Steven Rostedt
@ 2013-01-09 13:46   ` Steven Rostedt
  2013-01-09 13:50     ` Steven Rostedt
  3 siblings, 1 reply; 60+ messages in thread
From: Steven Rostedt @ 2013-01-09 13:46 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Alessio Igor Bogani, Andrew Morton, Chris Metcalf,
	Christoph Lameter, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Li Zhong, Namhyung Kim, Paul E. McKenney,
	Paul Gortmaker, Peter Zijlstra, Thomas Gleixner

On Tue, 2013-01-08 at 03:08 +0100, Frederic Weisbecker wrote:

> diff --git a/init/Kconfig b/init/Kconfig
> index 5cc8713..51b5c33 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -322,6 +322,9 @@ source "kernel/time/Kconfig"
>  
>  menu "CPU/Task time and stats accounting"
>  
> +config VIRT_CPU_ACCOUNTING
> +	bool
> +
>  choice
>  	prompt "Cputime accounting"
>  	default TICK_CPU_ACCOUNTING if !PPC64
> @@ -338,9 +341,10 @@ config TICK_CPU_ACCOUNTING
>  
>  	  If unsure, say Y.
>  
> -config VIRT_CPU_ACCOUNTING
> +config VIRT_CPU_ACCOUNTING_NATIVE
>  	bool "Deterministic task and CPU time accounting"
>  	depends on HAVE_VIRT_CPU_ACCOUNTING
> +	select VIRT_CPU_ACCOUNTING
>  	help
>  	  Select this option to enable more accurate task and CPU time
>  	  accounting.  This is done by reading a CPU counter on each
> @@ -350,6 +354,15 @@ config VIRT_CPU_ACCOUNTING
>  	  this also enables accounting of stolen time on logically-partitioned
>  	  systems.
>  
> +config VIRT_CPU_ACCOUNTING_GEN
> +	bool "Full dynticks CPU time accounting"
> +	depends on HAVE_CONTEXT_TRACKING
> +	select VIRT_CPU_ACCOUNTING
> +	select CONTEXT_TRACKING

	select CONTEXT_TRACKING_FORCE

Otherwise the user time never gets updated.

-- Steve

> +	help
> +	  Implement a generic virtual based cputime accounting by using
> +	  the context tracking subsystem.
> +
>  config IRQ_TIME_ACCOUNTING
>  	bool "Fine granularity task level IRQ time accounting"
>  	depends on HAVE_IRQ_TIME_ACCOUNTING




^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 03/33] cputime: Generic on-demand virtual cputime accounting
  2013-01-09 13:46   ` Steven Rostedt
@ 2013-01-09 13:50     ` Steven Rostedt
  0 siblings, 0 replies; 60+ messages in thread
From: Steven Rostedt @ 2013-01-09 13:50 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Alessio Igor Bogani, Andrew Morton, Chris Metcalf,
	Christoph Lameter, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Li Zhong, Namhyung Kim, Paul E. McKenney,
	Paul Gortmaker, Peter Zijlstra, Thomas Gleixner

On Wed, 2013-01-09 at 08:46 -0500, Steven Rostedt wrote:
>   
> > +config VIRT_CPU_ACCOUNTING_GEN
> > +	bool "Full dynticks CPU time accounting"
> > +	depends on HAVE_CONTEXT_TRACKING
> > +	select VIRT_CPU_ACCOUNTING
> > +	select CONTEXT_TRACKING
> 
> 	select CONTEXT_TRACKING_FORCE
> 
> Otherwise the user time never gets updated.
> 

Bah, kernel time is now screwed up with this:


# time /work/c/kernelspin 5

real    0m5.001s
user    0m1.785s
sys     0m0.000s


-- Steve



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 06/33] cputime: Safely read cputime of full dynticks CPUs
  2013-01-08  2:08 ` [PATCH 06/33] cputime: Safely read cputime of full dynticks CPUs Frederic Weisbecker
@ 2013-01-09 14:54   ` Steven Rostedt
  2013-01-09 18:35     ` Frederic Weisbecker
  0 siblings, 1 reply; 60+ messages in thread
From: Steven Rostedt @ 2013-01-09 14:54 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Alessio Igor Bogani, Andrew Morton, Chris Metcalf,
	Christoph Lameter, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Li Zhong, Namhyung Kim, Paul E. McKenney,
	Paul Gortmaker, Peter Zijlstra, Thomas Gleixner

On Tue, 2013-01-08 at 03:08 +0100, Frederic Weisbecker wrote:

> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 07912dd..bf4f72d 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -484,7 +484,7 @@ void vtime_task_switch(struct task_struct *prev)
>   * vtime_account().
>   */
>  #ifndef __ARCH_HAS_VTIME_ACCOUNT
> -void vtime_account(struct task_struct *tsk)
> +void vtime_account_irq_enter(struct task_struct *tsk)
>  {
>  	if (!in_interrupt()) {
>  		/*
> @@ -505,7 +505,7 @@ void vtime_account(struct task_struct *tsk)
>  	}
>  	vtime_account_system(tsk);
>  }
> -EXPORT_SYMBOL_GPL(vtime_account);
> +EXPORT_SYMBOL_GPL(vtime_account_irq_enter);
>  #endif /* __ARCH_HAS_VTIME_ACCOUNT */
>  #endif /* CONFIG_VIRT_CPU_ACCOUNTING */
>  
> @@ -616,41 +616,67 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
>  #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
>  
>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> -static DEFINE_PER_CPU(long, last_jiffies) = INITIAL_JIFFIES;
> -
> -static cputime_t get_vtime_delta(void)
> +static cputime_t get_vtime_delta(struct task_struct *tsk)
>  {
>  	long delta;
>  
> -	delta = jiffies - __this_cpu_read(last_jiffies);
> -	__this_cpu_add(last_jiffies, delta);
> +	delta = jiffies - tsk->prev_jiffies;
> +	tsk->prev_jiffies += delta;
>  
>  	return jiffies_to_cputime(delta);
>  }
>  
> -void vtime_account_system(struct task_struct *tsk)
> +static void __vtime_account_system(struct task_struct *tsk)
>  {
> -	cputime_t delta_cpu = get_vtime_delta();
> +	cputime_t delta_cpu = get_vtime_delta(tsk);
>  
>  	account_system_time(tsk, irq_count(), delta_cpu, cputime_to_scaled(delta_cpu));
>  }
>  
> +void vtime_account_system(struct task_struct *tsk)
> +{
> +	write_seqlock(&tsk->vtime_seqlock);
> +	__vtime_account_system(tsk);

__vtime_account_system() calls account_system_time()
account_system_time() calls __account_system_time()
__account_system_time() calls acct_update_integrals()
(when CONFIG_TASK_XACCT is set)
acct_update_integrals() calls task_cputime()
task_cputime() grabs t->vtime_seqlock for read

 DEADLOCK

ironically the subject says *Safely* read cputime ;-)

-- Steve

> +	write_sequnlock(&tsk->vtime_seqlock);
> +}
> +
> +void vtime_account_irq_exit(struct task_struct *tsk)
> +{
> +	write_seqlock(&tsk->vtime_seqlock);
> +	if (context_tracking_in_user())
> +		tsk->prev_jiffies_whence = JIFFIES_USER;
> +	__vtime_account_system(tsk);
> +	write_sequnlock(&tsk->vtime_seqlock);
> +}
> +



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 06/33] cputime: Safely read cputime of full dynticks CPUs
  2013-01-09 14:54   ` Steven Rostedt
@ 2013-01-09 18:35     ` Frederic Weisbecker
  0 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-01-09 18:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Alessio Igor Bogani, Andrew Morton, Chris Metcalf,
	Christoph Lameter, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Li Zhong, Namhyung Kim, Paul E. McKenney,
	Paul Gortmaker, Peter Zijlstra, Thomas Gleixner

On Wed, Jan 09, 2013 at 09:54:11AM -0500, Steven Rostedt wrote:
> On Tue, 2013-01-08 at 03:08 +0100, Frederic Weisbecker wrote:
> 
> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> > index 07912dd..bf4f72d 100644
> > --- a/kernel/sched/cputime.c
> > +++ b/kernel/sched/cputime.c
> > @@ -484,7 +484,7 @@ void vtime_task_switch(struct task_struct *prev)
> >   * vtime_account().
> >   */
> >  #ifndef __ARCH_HAS_VTIME_ACCOUNT
> > -void vtime_account(struct task_struct *tsk)
> > +void vtime_account_irq_enter(struct task_struct *tsk)
> >  {
> >  	if (!in_interrupt()) {
> >  		/*
> > @@ -505,7 +505,7 @@ void vtime_account(struct task_struct *tsk)
> >  	}
> >  	vtime_account_system(tsk);
> >  }
> > -EXPORT_SYMBOL_GPL(vtime_account);
> > +EXPORT_SYMBOL_GPL(vtime_account_irq_enter);
> >  #endif /* __ARCH_HAS_VTIME_ACCOUNT */
> >  #endif /* CONFIG_VIRT_CPU_ACCOUNTING */
> >  
> > @@ -616,41 +616,67 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
> >  #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
> >  
> >  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> > -static DEFINE_PER_CPU(long, last_jiffies) = INITIAL_JIFFIES;
> > -
> > -static cputime_t get_vtime_delta(void)
> > +static cputime_t get_vtime_delta(struct task_struct *tsk)
> >  {
> >  	long delta;
> >  
> > -	delta = jiffies - __this_cpu_read(last_jiffies);
> > -	__this_cpu_add(last_jiffies, delta);
> > +	delta = jiffies - tsk->prev_jiffies;
> > +	tsk->prev_jiffies += delta;
> >  
> >  	return jiffies_to_cputime(delta);
> >  }
> >  
> > -void vtime_account_system(struct task_struct *tsk)
> > +static void __vtime_account_system(struct task_struct *tsk)
> >  {
> > -	cputime_t delta_cpu = get_vtime_delta();
> > +	cputime_t delta_cpu = get_vtime_delta(tsk);
> >  
> >  	account_system_time(tsk, irq_count(), delta_cpu, cputime_to_scaled(delta_cpu));
> >  }
> >  
> > +void vtime_account_system(struct task_struct *tsk)
> > +{
> > +	write_seqlock(&tsk->vtime_seqlock);
> > +	__vtime_account_system(tsk);
> 
> __vtime_account_system() calls account_system_time()
> account_system_time() calls __account_system_time()
> __account_system_time() calls acct_update_integrals()
> (when CONFIG_TASK_XACCT is set)
> acct_update_integrals() calls task_cputime()
> task_cputime() grabs t->vtime_seqlock for read
> 
>  DEADLOCK
> 
> ironically the subject says *Safely* read cputime ;-)

Well at least it crashes safely. Safely as in "deterministic".

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/33] nohz: Basic full dynticks interface
  2013-01-08  2:08 ` [PATCH 07/33] nohz: Basic full dynticks interface Frederic Weisbecker
@ 2013-02-11 14:35   ` Borislav Petkov
  2013-02-20 16:32     ` Borislav Petkov
  2013-03-07 23:35     ` Frederic Weisbecker
  0 siblings, 2 replies; 60+ messages in thread
From: Borislav Petkov @ 2013-02-11 14:35 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Alessio Igor Bogani, Andrew Morton, Chris Metcalf,
	Christoph Lameter, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Li Zhong, Namhyung Kim, Paul E. McKenney,
	Paul Gortmaker, Peter Zijlstra, Steven Rostedt, Thomas Gleixner

On Tue, Jan 08, 2013 at 03:08:07AM +0100, Frederic Weisbecker wrote:

[ … ]

> diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
> index 8601f0d..dc6381d 100644
> --- a/kernel/time/Kconfig
> +++ b/kernel/time/Kconfig
> @@ -70,6 +70,15 @@ config NO_HZ
>  	  only trigger on an as-needed basis both when the system is
>  	  busy and when the system is idle.
>  
> +config NO_HZ_FULL
> +       bool "Full tickless system"

I think you want to say here "Almost-completely tickless system".
"Almost" because of that one CPU outside of the range :-)

> +       depends on NO_HZ && RCU_USER_QS && VIRT_CPU_ACCOUNTING_GEN && RCU_NOCB_CPU && SMP
> +       select CONTEXT_TRACKING_FORCE
> +       help
> +         Try to be tickless everywhere, not just in idle. (You need
> +	 to fill up the full_nohz_mask boot parameter).
> +
> +
>  config HIGH_RES_TIMERS
>  	bool "High Resolution Timer Support"
>  	depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 314b9ee..494a2aa 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -142,6 +142,29 @@ static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
>  	profile_tick(CPU_PROFILING);
>  }
>  
> +#ifdef CONFIG_NO_HZ_FULL
> +static cpumask_var_t full_nohz_mask;
> +bool have_full_nohz_mask;
> +
> +int tick_nohz_full_cpu(int cpu)
> +{
> +	if (!have_full_nohz_mask)
> +		return 0;
> +
> +	return cpumask_test_cpu(cpu, full_nohz_mask);
> +}
> +
> +/* Parse the boot-time nohz CPU list from the kernel parameters. */
> +static int __init tick_nohz_full_setup(char *str)
> +{
> +	alloc_bootmem_cpumask_var(&full_nohz_mask);
> +	have_full_nohz_mask = true;
> +	cpulist_parse(str, full_nohz_mask);

Don't you want to check retval of cpulist_parse first here before
assigning have_full_nohz_mask and allocating cpumask var?

We don't trust userspace, you know.

> +	return 1;
> +}
> +__setup("full_nohz=", tick_nohz_full_setup);

I'd guess this kernel parameter needs to go into
Documentation/kernel-parameters.txt along with a referral to
Documentation/cputopology.txt which explains how to specify cpulists for
n00bs like me :-)

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 08/33] nohz: Assign timekeeping duty to a non-full-nohz CPU
  2013-01-08  2:08 ` [PATCH 08/33] nohz: Assign timekeeping duty to a non-full-nohz CPU Frederic Weisbecker
@ 2013-02-15 11:57   ` Borislav Petkov
  2013-02-20 15:57     ` Frederic Weisbecker
  0 siblings, 1 reply; 60+ messages in thread
From: Borislav Petkov @ 2013-02-15 11:57 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Alessio Igor Bogani, Andrew Morton, Chris Metcalf,
	Christoph Lameter, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Li Zhong, Namhyung Kim, Paul E. McKenney,
	Paul Gortmaker, Peter Zijlstra, Steven Rostedt, Thomas Gleixner

On Tue, Jan 08, 2013 at 03:08:08AM +0100, Frederic Weisbecker wrote:
> This way the full nohz CPUs can safely run with the tick
> stopped with a guarantee that somebody else is taking
> care of the jiffies and gtod progression.
> 
> NOTE: this doesn't handle CPU hotplug. Also we could use something
> more elaborated wrt. powersaving if we have more than one non full-nohz
> CPU running. But let's use this KISS solution for now.
> 
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Alessio Igor Bogani <abogani@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Chris Metcalf <cmetcalf@tilera.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Geoff Levand <geoff@infradead.org>
> Cc: Gilad Ben Yossef <gilad@benyossef.com>
> Cc: Hakan Akkan <hakanakkan@gmail.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Li Zhong <zhong@linux.vnet.ibm.com>
> Cc: Namhyung Kim <namhyung.kim@lge.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> [fix have_nohz_full_mask offcase]
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
>  kernel/time/tick-broadcast.c |    3 ++-
>  kernel/time/tick-common.c    |    5 ++++-
>  kernel/time/tick-sched.c     |    9 ++++++++-
>  3 files changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
> index f113755..596c547 100644
> --- a/kernel/time/tick-broadcast.c
> +++ b/kernel/time/tick-broadcast.c
> @@ -537,7 +537,8 @@ void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
>  		bc->event_handler = tick_handle_oneshot_broadcast;
>  
>  		/* Take the do_timer update */
> -		tick_do_timer_cpu = cpu;
> +		if (!tick_nohz_full_cpu(cpu))
> +			tick_do_timer_cpu = cpu;
>  
>  		/*
>  		 * We must be careful here. There might be other CPUs
> diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
> index b1600a6..83f2bd9 100644
> --- a/kernel/time/tick-common.c
> +++ b/kernel/time/tick-common.c
> @@ -163,7 +163,10 @@ static void tick_setup_device(struct tick_device *td,
>  		 * this cpu:
>  		 */
>  		if (tick_do_timer_cpu == TICK_DO_TIMER_BOOT) {
> -			tick_do_timer_cpu = cpu;
> +			if (!tick_nohz_full_cpu(cpu))
> +				tick_do_timer_cpu = cpu;
> +			else
> +				tick_do_timer_cpu = TICK_DO_TIMER_NONE;
>  			tick_next_period = ktime_get();
>  			tick_period = ktime_set(0, NSEC_PER_SEC / HZ);
>  		}
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 494a2aa..b75e302 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -112,7 +112,8 @@ static void tick_sched_do_timer(ktime_t now)
>  	 * this duty, then the jiffies update is still serialized by
>  	 * jiffies_lock.
>  	 */
> -	if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE))
> +	if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE)
> +	    && !tick_nohz_full_cpu(cpu))
>  		tick_do_timer_cpu = cpu;
>  #endif
>  
> @@ -163,6 +164,8 @@ static int __init tick_nohz_full_setup(char *str)
>  	return 1;
>  }
>  __setup("full_nohz=", tick_nohz_full_setup);
> +#else
> +#define have_full_nohz_mask (0)
>  #endif

Looks like this hunk should be part of the previous patch, no?


...

Oui, oui monsieur! :-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 08/33] nohz: Assign timekeeping duty to a non-full-nohz CPU
  2013-02-15 11:57   ` Borislav Petkov
@ 2013-02-20 15:57     ` Frederic Weisbecker
  0 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-02-20 15:57 UTC (permalink / raw)
  To: Borislav Petkov, Frederic Weisbecker, LKML, Alessio Igor Bogani,
	Andrew Morton, Chris Metcalf, Christoph Lameter, Geoff Levand,
	Gilad Ben Yossef, Hakan Akkan, Ingo Molnar, Li Zhong,
	Namhyung Kim, Paul E. McKenney, Paul Gortmaker, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner

2013/2/15 Borislav Petkov <bp@alien8.de>:
> On Tue, Jan 08, 2013 at 03:08:08AM +0100, Frederic Weisbecker wrote:
>> This way the full nohz CPUs can safely run with the tick
>> stopped with a guarantee that somebody else is taking
>> care of the jiffies and gtod progression.
>>
>> NOTE: this doesn't handle CPU hotplug. Also we could use something
>> more elaborated wrt. powersaving if we have more than one non full-nohz
>> CPU running. But let's use this KISS solution for now.
>>
>> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
>> Cc: Alessio Igor Bogani <abogani@kernel.org>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Chris Metcalf <cmetcalf@tilera.com>
>> Cc: Christoph Lameter <cl@linux.com>
>> Cc: Geoff Levand <geoff@infradead.org>
>> Cc: Gilad Ben Yossef <gilad@benyossef.com>
>> Cc: Hakan Akkan <hakanakkan@gmail.com>
>> Cc: Ingo Molnar <mingo@kernel.org>
>> Cc: Li Zhong <zhong@linux.vnet.ibm.com>
>> Cc: Namhyung Kim <namhyung.kim@lge.com>
>> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> [fix have_nohz_full_mask offcase]
>> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
>> ---
>>  kernel/time/tick-broadcast.c |    3 ++-
>>  kernel/time/tick-common.c    |    5 ++++-
>>  kernel/time/tick-sched.c     |    9 ++++++++-
>>  3 files changed, 14 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
>> index f113755..596c547 100644
>> --- a/kernel/time/tick-broadcast.c
>> +++ b/kernel/time/tick-broadcast.c
>> @@ -537,7 +537,8 @@ void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
>>               bc->event_handler = tick_handle_oneshot_broadcast;
>>
>>               /* Take the do_timer update */
>> -             tick_do_timer_cpu = cpu;
>> +             if (!tick_nohz_full_cpu(cpu))
>> +                     tick_do_timer_cpu = cpu;
>>
>>               /*
>>                * We must be careful here. There might be other CPUs
>> diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
>> index b1600a6..83f2bd9 100644
>> --- a/kernel/time/tick-common.c
>> +++ b/kernel/time/tick-common.c
>> @@ -163,7 +163,10 @@ static void tick_setup_device(struct tick_device *td,
>>                * this cpu:
>>                */
>>               if (tick_do_timer_cpu == TICK_DO_TIMER_BOOT) {
>> -                     tick_do_timer_cpu = cpu;
>> +                     if (!tick_nohz_full_cpu(cpu))
>> +                             tick_do_timer_cpu = cpu;
>> +                     else
>> +                             tick_do_timer_cpu = TICK_DO_TIMER_NONE;
>>                       tick_next_period = ktime_get();
>>                       tick_period = ktime_set(0, NSEC_PER_SEC / HZ);
>>               }
>> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
>> index 494a2aa..b75e302 100644
>> --- a/kernel/time/tick-sched.c
>> +++ b/kernel/time/tick-sched.c
>> @@ -112,7 +112,8 @@ static void tick_sched_do_timer(ktime_t now)
>>        * this duty, then the jiffies update is still serialized by
>>        * jiffies_lock.
>>        */
>> -     if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE))
>> +     if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE)
>> +         && !tick_nohz_full_cpu(cpu))
>>               tick_do_timer_cpu = cpu;
>>  #endif
>>
>> @@ -163,6 +164,8 @@ static int __init tick_nohz_full_setup(char *str)
>>       return 1;
>>  }
>>  __setup("full_nohz=", tick_nohz_full_setup);
>> +#else
>> +#define have_full_nohz_mask (0)
>>  #endif
>
> Looks like this hunk should be part of the previous patch, no?

Ah probably yeah :)

> Oui, oui monsieur! :-)

:)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/33] nohz: Basic full dynticks interface
  2013-02-11 14:35   ` Borislav Petkov
@ 2013-02-20 16:32     ` Borislav Petkov
  2013-03-07 23:41       ` Frederic Weisbecker
  2013-03-07 23:35     ` Frederic Weisbecker
  1 sibling, 1 reply; 60+ messages in thread
From: Borislav Petkov @ 2013-02-20 16:32 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Alessio Igor Bogani, Andrew Morton, Chris Metcalf,
	Christoph Lameter, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Li Zhong, Namhyung Kim, Paul E. McKenney,
	Paul Gortmaker, Peter Zijlstra, Steven Rostedt, Thomas Gleixner

On Mon, Feb 11, 2013 at 03:35:29PM +0100, Borislav Petkov wrote:
> > +/* Parse the boot-time nohz CPU list from the kernel parameters. */
> > +static int __init tick_nohz_full_setup(char *str)
> > +{
> > +	alloc_bootmem_cpumask_var(&full_nohz_mask);
> > +	have_full_nohz_mask = true;
> > +	cpulist_parse(str, full_nohz_mask);
> 
> Don't you want to check retval of cpulist_parse first here before
> assigning have_full_nohz_mask and allocating cpumask var?
> 
> We don't trust userspace, you know.
> 
> > +	return 1;
> > +}
> > +__setup("full_nohz=", tick_nohz_full_setup);

One more thing. AFAICT, full_nohz requires rcu_nocbs to pass in the same
mask, right?

Maybe tick_nohz_full_setup() could be made to call rcu_nocb_setup()
without the need to pass "rcu_nocbs=" option on the cmd line; in the
sense that if user supplies a full_nohz mask, she wants the same mask
for rcu_nocbs...

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/33] nohz: Basic full dynticks interface
  2013-02-11 14:35   ` Borislav Petkov
  2013-02-20 16:32     ` Borislav Petkov
@ 2013-03-07 23:35     ` Frederic Weisbecker
  2013-03-08 10:17       ` Borislav Petkov
  1 sibling, 1 reply; 60+ messages in thread
From: Frederic Weisbecker @ 2013-03-07 23:35 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Frederic Weisbecker, LKML, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

2013/2/11 Borislav Petkov <bp@alien8.de>:
> On Tue, Jan 08, 2013 at 03:08:07AM +0100, Frederic Weisbecker wrote:
>
> [ … ]
>
>> diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
>> index 8601f0d..dc6381d 100644
>> --- a/kernel/time/Kconfig
>> +++ b/kernel/time/Kconfig
>> @@ -70,6 +70,15 @@ config NO_HZ
>>         only trigger on an as-needed basis both when the system is
>>         busy and when the system is idle.
>>
>> +config NO_HZ_FULL
>> +       bool "Full tickless system"
>
> I think you want to say here "Almost-completely tickless system".
> "Almost" because of that one CPU outside of the range :-)

I think that "Full dynticks system" would express well what happens?

>
>> +       depends on NO_HZ && RCU_USER_QS && VIRT_CPU_ACCOUNTING_GEN && RCU_NOCB_CPU && SMP
>> +       select CONTEXT_TRACKING_FORCE
>> +       help
>> +         Try to be tickless everywhere, not just in idle. (You need
>> +      to fill up the full_nohz_mask boot parameter).
>> +
>> +
>>  config HIGH_RES_TIMERS
>>       bool "High Resolution Timer Support"
>>       depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS
>> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
>> index 314b9ee..494a2aa 100644
>> --- a/kernel/time/tick-sched.c
>> +++ b/kernel/time/tick-sched.c
>> @@ -142,6 +142,29 @@ static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
>>       profile_tick(CPU_PROFILING);
>>  }
>>
>> +#ifdef CONFIG_NO_HZ_FULL
>> +static cpumask_var_t full_nohz_mask;
>> +bool have_full_nohz_mask;
>> +
>> +int tick_nohz_full_cpu(int cpu)
>> +{
>> +     if (!have_full_nohz_mask)
>> +             return 0;
>> +
>> +     return cpumask_test_cpu(cpu, full_nohz_mask);
>> +}
>> +
>> +/* Parse the boot-time nohz CPU list from the kernel parameters. */
>> +static int __init tick_nohz_full_setup(char *str)
>> +{
>> +     alloc_bootmem_cpumask_var(&full_nohz_mask);
>> +     have_full_nohz_mask = true;
>> +     cpulist_parse(str, full_nohz_mask);
>
> Don't you want to check retval of cpulist_parse first here before
> assigning have_full_nohz_mask and allocating cpumask var?
>
> We don't trust userspace, you know.

Yeah sure. I was really in draft/laboratory mode until now. But I
guess I need to start thinking about such details, since nobody seem
to oppose with the whole design. Time to zoom in.

>
>> +     return 1;
>> +}
>> +__setup("full_nohz=", tick_nohz_full_setup);
>
> I'd guess this kernel parameter needs to go into
> Documentation/kernel-parameters.txt along with a referral to
> Documentation/cputopology.txt which explains how to specify cpulists for
> n00bs like me :-)

Sure, I'll add that on my TODO list.

Thanks!

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/33] nohz: Basic full dynticks interface
  2013-02-20 16:32     ` Borislav Petkov
@ 2013-03-07 23:41       ` Frederic Weisbecker
  0 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-03-07 23:41 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Frederic Weisbecker, LKML, Alessio Igor Bogani, Andrew Morton,
	Chris Metcalf, Christoph Lameter, Geoff Levand, Gilad Ben Yossef,
	Hakan Akkan, Ingo Molnar, Li Zhong, Namhyung Kim,
	Paul E. McKenney, Paul Gortmaker, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

2013/2/20 Borislav Petkov <bp@alien8.de>:
> On Mon, Feb 11, 2013 at 03:35:29PM +0100, Borislav Petkov wrote:
>> > +/* Parse the boot-time nohz CPU list from the kernel parameters. */
>> > +static int __init tick_nohz_full_setup(char *str)
>> > +{
>> > +   alloc_bootmem_cpumask_var(&full_nohz_mask);
>> > +   have_full_nohz_mask = true;
>> > +   cpulist_parse(str, full_nohz_mask);
>>
>> Don't you want to check retval of cpulist_parse first here before
>> assigning have_full_nohz_mask and allocating cpumask var?
>>
>> We don't trust userspace, you know.
>>
>> > +   return 1;
>> > +}
>> > +__setup("full_nohz=", tick_nohz_full_setup);
>
> One more thing. AFAICT, full_nohz requires rcu_nocbs to pass in the same
> mask, right?

Right!

> Maybe tick_nohz_full_setup() could be made to call rcu_nocb_setup()
> without the need to pass "rcu_nocbs=" option on the cmd line; in the
> sense that if user supplies a full_nohz mask, she wants the same mask
> for rcu_nocbs...

Yeah that's probably something we want. (added to the TODO list)

Thanks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 17/33] sched: Update clock of nohz busiest rq before balancing
  2013-01-08 10:20   ` Li Zhong
@ 2013-03-07 23:51     ` Frederic Weisbecker
  0 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-03-07 23:51 UTC (permalink / raw)
  To: Li Zhong
  Cc: LKML, Alessio Igor Bogani, Andrew Morton, Chris Metcalf,
	Christoph Lameter, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Namhyung Kim, Paul E. McKenney, Paul Gortmaker,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner

2013/1/8 Li Zhong <zhong@linux.vnet.ibm.com>:
> On Tue, 2013-01-08 at 03:08 +0100, Frederic Weisbecker wrote:
>> move_tasks() and active_load_balance_cpu_stop() both need
>> to have the busiest rq clock uptodate because they may end
>> up calling can_migrate_task() that uses rq->clock_task
>> to determine if the task running in the busiest runqueue
>> is cache hot.
>>
>> Hence if the busiest runqueue is tickless, update its clock
>> before reading it.
>>
>> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
>> Cc: Alessio Igor Bogani <abogani@kernel.org>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Chris Metcalf <cmetcalf@tilera.com>
>> Cc: Christoph Lameter <cl@linux.com>
>> Cc: Geoff Levand <geoff@infradead.org>
>> Cc: Gilad Ben Yossef <gilad@benyossef.com>
>> Cc: Hakan Akkan <hakanakkan@gmail.com>
>> Cc: Ingo Molnar <mingo@kernel.org>
>> Cc: Li Zhong <zhong@linux.vnet.ibm.com>
>> Cc: Namhyung Kim <namhyung.kim@lge.com>
>> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> [ Forward port conflicts ]
>> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
>> ---
>>  kernel/sched/fair.c |   17 +++++++++++++++++
>>  1 files changed, 17 insertions(+), 0 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 3d65ac7..e78d81104 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5002,6 +5002,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>  {
>>       int ld_moved, cur_ld_moved, active_balance = 0;
>>       int lb_iterations, max_lb_iterations;
>> +     int clock_updated;
>>       struct sched_group *group;
>>       struct rq *busiest;
>>       unsigned long flags;
>> @@ -5045,6 +5046,7 @@ redo:
>>
>>       ld_moved = 0;
>>       lb_iterations = 1;
>> +     clock_updated = 0;
>>       if (busiest->nr_running > 1) {
>>               /*
>>                * Attempt to move tasks. If find_busiest_group has found
>> @@ -5068,6 +5070,14 @@ more_balance:
>>                */
>>               cur_ld_moved = move_tasks(&env);
>>               ld_moved += cur_ld_moved;
>> +
>> +             /*
>> +              * Move tasks may end up calling can_migrate_task() which
>> +              * requires an uptodate value of the rq clock.
>> +              */
>> +             update_nohz_rq_clock(busiest);
>> +             clock_updated = 1;
>
> According to the change log, it seems these lines should be added before
> move_tasks() above?

Yeah, but eventually it seems that can_migrate_task() doesn't make use
of rq clock anymore. I guess I'll just drop that patch.

Thanks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/33] nohz: Basic full dynticks interface
  2013-03-07 23:35     ` Frederic Weisbecker
@ 2013-03-08 10:17       ` Borislav Petkov
  2013-03-08 13:45         ` Frederic Weisbecker
  0 siblings, 1 reply; 60+ messages in thread
From: Borislav Petkov @ 2013-03-08 10:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Alessio Igor Bogani, Andrew Morton, Chris Metcalf,
	Christoph Lameter, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Li Zhong, Namhyung Kim, Paul E. McKenney,
	Paul Gortmaker, Peter Zijlstra, Steven Rostedt, Thomas Gleixner

On Fri, Mar 08, 2013 at 12:35:47AM +0100, Frederic Weisbecker wrote:
> I think that "Full dynticks system" would express well what happens?

Yeah, it probably doesn't really matter all that much in the end -
people will refer to this with different names like with other features
in Linux anyway. :-)

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/33] nohz: Basic full dynticks interface
  2013-03-08 10:17       ` Borislav Petkov
@ 2013-03-08 13:45         ` Frederic Weisbecker
  2013-03-08 14:32           ` Borislav Petkov
  0 siblings, 1 reply; 60+ messages in thread
From: Frederic Weisbecker @ 2013-03-08 13:45 UTC (permalink / raw)
  To: Borislav Petkov, Frederic Weisbecker, LKML, Alessio Igor Bogani,
	Andrew Morton, Chris Metcalf, Christoph Lameter, Geoff Levand,
	Gilad Ben Yossef, Hakan Akkan, Ingo Molnar, Li Zhong,
	Namhyung Kim, Paul E. McKenney, Paul Gortmaker, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner

2013/3/8 Borislav Petkov <bp@alien8.de>:
> On Fri, Mar 08, 2013 at 12:35:47AM +0100, Frederic Weisbecker wrote:
>> I think that "Full dynticks system" would express well what happens?
>
> Yeah, it probably doesn't really matter all that much in the end -
> people will refer to this with different names like with other features
> in Linux anyway. :-)

Right, with some more or less precision, and different shades of
metaphor or metonymy :)

Dynticks, tickless, nohz, cpu isolation, pendulum clock noise free
kernel, bare metal performance snort, walking alone and free through
Latvian forests.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/33] nohz: Basic full dynticks interface
  2013-03-08 13:45         ` Frederic Weisbecker
@ 2013-03-08 14:32           ` Borislav Petkov
  2013-03-08 16:55             ` Frederic Weisbecker
  0 siblings, 1 reply; 60+ messages in thread
From: Borislav Petkov @ 2013-03-08 14:32 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Alessio Igor Bogani, Andrew Morton, Chris Metcalf,
	Christoph Lameter, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Li Zhong, Namhyung Kim, Paul E. McKenney,
	Paul Gortmaker, Peter Zijlstra, Steven Rostedt, Thomas Gleixner

On Fri, Mar 08, 2013 at 02:45:12PM +0100, Frederic Weisbecker wrote:
> 2013/3/8 Borislav Petkov <bp@alien8.de>:
> > On Fri, Mar 08, 2013 at 12:35:47AM +0100, Frederic Weisbecker wrote:
> >> I think that "Full dynticks system" would express well what happens?
> >
> > Yeah, it probably doesn't really matter all that much in the end -
> > people will refer to this with different names like with other features
> > in Linux anyway. :-)
> 
> Right, with some more or less precision, and different shades of
> metaphor or metonymy :)
> 
> Dynticks, tickless, nohz, cpu isolation, pendulum clock noise free
> kernel, bare metal performance snort, walking alone and free through
> Latvian forests.

"... completely naked." You definitely need that too. :-)

Ok, put all those above in the Kconfig help text and ship it - people
will *now* know what it is.

LOL.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/33] nohz: Basic full dynticks interface
  2013-03-08 14:32           ` Borislav Petkov
@ 2013-03-08 16:55             ` Frederic Weisbecker
  0 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-03-08 16:55 UTC (permalink / raw)
  To: Borislav Petkov, Frederic Weisbecker, LKML, Alessio Igor Bogani,
	Andrew Morton, Chris Metcalf, Christoph Lameter, Geoff Levand,
	Gilad Ben Yossef, Hakan Akkan, Ingo Molnar, Li Zhong,
	Namhyung Kim, Paul E. McKenney, Paul Gortmaker, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner

2013/3/8 Borislav Petkov <bp@alien8.de>:
> On Fri, Mar 08, 2013 at 02:45:12PM +0100, Frederic Weisbecker wrote:
>> 2013/3/8 Borislav Petkov <bp@alien8.de>:
>> > On Fri, Mar 08, 2013 at 12:35:47AM +0100, Frederic Weisbecker wrote:
>> >> I think that "Full dynticks system" would express well what happens?
>> >
>> > Yeah, it probably doesn't really matter all that much in the end -
>> > people will refer to this with different names like with other features
>> > in Linux anyway. :-)
>>
>> Right, with some more or less precision, and different shades of
>> metaphor or metonymy :)
>>
>> Dynticks, tickless, nohz, cpu isolation, pendulum clock noise free
>> kernel, bare metal performance snort, walking alone and free through
>> Latvian forests.
>
> "... completely naked." You definitely need that too. :-)

I didn't mean _that_ free, but while at it, the picture will never be
complete without a chainsaw.

> Ok, put all those above in the Kconfig help text and ship it - people
> will *now* know what it is.

At last!

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 30/33] sched: Debug nohz rq clock
  2013-01-08  2:08 ` [PATCH 30/33] sched: Debug nohz " Frederic Weisbecker
@ 2013-03-20 23:23   ` Kevin Hilman
  2013-04-11 16:47     ` Frederic Weisbecker
  0 siblings, 1 reply; 60+ messages in thread
From: Kevin Hilman @ 2013-03-20 23:23 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Alessio Igor Bogani, Andrew Morton, Chris Metcalf,
	Christoph Lameter, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Li Zhong, Namhyung Kim, Paul E. McKenney,
	Paul Gortmaker, Peter Zijlstra, Steven Rostedt, Thomas Gleixner

Hi Frederic,

On 01/07/2013 06:08 PM, Frederic Weisbecker wrote:
> The runqueue clock is supposed to be periodically updated by the
> tick. On full dynticks CPU we call update_nohz_rq_clock() before
> reading it. Now the scheduler code is complicated enough that we
> may miss some update_nohz_rq_clock() calls before reading the
> runqueue clock.
> 
> This therefore introduce a new debugging feature that detects
> when the rq clock is stale due to missing updates on full
> dynticks CPUs.
> 
> This can be later expanded to debug stale clocks on dynticks-idle
> CPUs.

[...]

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e1bac76..0fef0b3 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -502,16 +502,39 @@ DECLARE_PER_CPU(struct rq, runqueues);
>  #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
>  #define raw_rq()		(&__raw_get_cpu_var(runqueues))
>  
> +static inline void rq_clock_check(struct rq *rq)
> +{
> +#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_NO_HZ_FULL)
> +	unsigned long long clock;
> +	unsigned long flags;
> +	int cpu;
> +
> +	cpu = cpu_of(rq);
> +	if (!tick_nohz_full_cpu(cpu) || rq->curr == rq->idle)
> +		return;
> +
> +	local_irq_save(flags);
> +	clock = sched_clock_cpu(cpu_of(rq));
> +	local_irq_restore(flags);
> +
> +	if (abs(clock - rq->clock) > (TICK_NSEC * 3))
> +		WARN_ON_ONCE(1);
> +#endif
> +}

In working on the ARM port for full nohz, I'm hitting this
warning early in the kernel boot, well before userspace starts
(dump below[2].)

I've seen a few different variations of this, but the common
thing for all of them is the use of wait_for_completion().

During boot, only swapper is running so it seems
any waiting of sufficient length during boot will always trigger
this warning.  The hack below[1] avoids checking for the init task,
but I'm not sure if it's the right fix.

Kevin

[1]
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f96329b..56e74df 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -512,7 +512,8 @@ static inline void rq_clock_check(struct rq *rq)
 	int cpu;

 	cpu = cpu_of(rq);
-	if (!tick_nohz_full_cpu(cpu) || rq->curr == rq->idle)
+	if (!tick_nohz_full_cpu(cpu) || rq->curr == rq->idle ||
+	    is_global_init(current))
 		return;


[2] Example warning dump
[    1.101013] ------------[ cut here ]------------
[    1.105865] WARNING: at
/work/kernel/linaro/dev/kernel/sched/sched.h:534 __schedule+0x510/0x790()
[    1.115051] Modules linked in:
[    1.118316] [<c00159d0>] (unwind_backtrace+0x0/0x104) from
[<c054dc98>] (dump_stack+0x20/0x24)
[    1.127227] [<c054dc98>] (dump_stack+0x20/0x24) from [<c0037f58>]
(warn_slowpath_common+0x5c/0x78)
[    1.136535] [<c0037f58>] (warn_slowpath_common+0x5c/0x78) from
[<c0037fa0>] (warn_slowpath_null+0x2c/0x34)
[    1.146545] [<c0037fa0>] (warn_slowpath_null+0x2c/0x34) from
[<c0560834>] (__schedule+0x510/0x790)
[    1.155822] [<c0560834>] (__schedule+0x510/0x790) from [<c0560ba8>]
(schedule+0x40/0x80)
[    1.164215] [<c0560ba8>] (schedule+0x40/0x80) from [<c055e288>]
(schedule_timeout+0x180/0x250)
[    1.173156] [<c055e288>] (schedule_timeout+0x180/0x250) from
[<c05601a4>] (wait_for_common+0xb8/0x15c)
[    1.182800] [<c05601a4>] (wait_for_common+0xb8/0x15c) from
[<c0560268>] (wait_for_completion+0x20/0x24)
[    1.192535] [<c0560268>] (wait_for_completion+0x20/0x24) from
[<c005ce4c>] (kthread_create_on_node+0x84/0xe8)
[    1.202819] [<c005ce4c>] (kthread_create_on_node+0x84/0xe8) from
[<c005732c>] (__alloc_workqueue_key+0x36c/0x440)
[    1.213470] [<c005732c>] (__alloc_workqueue_key+0x36c/0x440) from
[<c052d380>] (rpc_init_mempool+0x44/0x118)
[    1.223663] [<c052d380>] (rpc_init_mempool+0x44/0x118) from
[<c07f1714>] (init_sunrpc+0x10/0x6c)
[    1.232757] [<c07f1714>] (init_sunrpc+0x10/0x6c) from [<c00087f0>]
(do_one_initcall+0x48/0x190)
[    1.241790] [<c00087f0>] (do_one_initcall+0x48/0x190) from
[<c07c3988>] (do_basic_setup+0x9c/0xd0)
[    1.251068] [<c07c3988>] (do_basic_setup+0x9c/0xd0) from [<c07c3a88>]
(kernel_init_freeable+0xcc/0x16c)
[    1.260833] [<c07c3a88>] (kernel_init_freeable+0xcc/0x16c) from
[<c0548840>] (kernel_init+0x1c/0xf4)
[    1.270294] [<c0548840>] (kernel_init+0x1c/0xf4) from [<c000e390>]
(ret_from_fork+0x14/0x20)
[
> +
>  static inline u64 rq_clock(struct rq *rq)
>  {
> +	rq_clock_check(rq);
>  	return rq->clock;
>  }
>  
>  static inline u64 rq_clock_task(struct rq *rq)
>  {
> +	rq_clock_check(rq);
>  	return rq->clock_task;
>  }
>  
> +
>  #ifdef CONFIG_SMP
>  
>  #define rcu_dereference_check_sched_domain(p) \
> 

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 30/33] sched: Debug nohz rq clock
  2013-03-20 23:23   ` Kevin Hilman
@ 2013-04-11 16:47     ` Frederic Weisbecker
  0 siblings, 0 replies; 60+ messages in thread
From: Frederic Weisbecker @ 2013-04-11 16:47 UTC (permalink / raw)
  To: Kevin Hilman
  Cc: LKML, Alessio Igor Bogani, Andrew Morton, Chris Metcalf,
	Christoph Lameter, Geoff Levand, Gilad Ben Yossef, Hakan Akkan,
	Ingo Molnar, Li Zhong, Namhyung Kim, Paul E. McKenney,
	Paul Gortmaker, Peter Zijlstra, Steven Rostedt, Thomas Gleixner

On Wed, Mar 20, 2013 at 04:23:34PM -0700, Kevin Hilman wrote:
> Hi Frederic,
> 
> On 01/07/2013 06:08 PM, Frederic Weisbecker wrote:
> > The runqueue clock is supposed to be periodically updated by the
> > tick. On full dynticks CPU we call update_nohz_rq_clock() before
> > reading it. Now the scheduler code is complicated enough that we
> > may miss some update_nohz_rq_clock() calls before reading the
> > runqueue clock.
> > 
> > This therefore introduce a new debugging feature that detects
> > when the rq clock is stale due to missing updates on full
> > dynticks CPUs.
> > 
> > This can be later expanded to debug stale clocks on dynticks-idle
> > CPUs.
> 
> [...]
> 
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index e1bac76..0fef0b3 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -502,16 +502,39 @@ DECLARE_PER_CPU(struct rq, runqueues);
> >  #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
> >  #define raw_rq()		(&__raw_get_cpu_var(runqueues))
> >  
> > +static inline void rq_clock_check(struct rq *rq)
> > +{
> > +#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_NO_HZ_FULL)
> > +	unsigned long long clock;
> > +	unsigned long flags;
> > +	int cpu;
> > +
> > +	cpu = cpu_of(rq);
> > +	if (!tick_nohz_full_cpu(cpu) || rq->curr == rq->idle)
> > +		return;
> > +
> > +	local_irq_save(flags);
> > +	clock = sched_clock_cpu(cpu_of(rq));
> > +	local_irq_restore(flags);
> > +
> > +	if (abs(clock - rq->clock) > (TICK_NSEC * 3))
> > +		WARN_ON_ONCE(1);
> > +#endif
> > +}
> 
> In working on the ARM port for full nohz, I'm hitting this
> warning early in the kernel boot, well before userspace starts
> (dump below[2].)
> 
> I've seen a few different variations of this, but the common
> thing for all of them is the use of wait_for_completion().
> 
> During boot, only swapper is running so it seems
> any waiting of sufficient length during boot will always trigger
> this warning.  The hack below[1] avoids checking for the init task,
> but I'm not sure if it's the right fix.
> 
> Kevin
> 
> [1]
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f96329b..56e74df 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -512,7 +512,8 @@ static inline void rq_clock_check(struct rq *rq)
>  	int cpu;
> 
>  	cpu = cpu_of(rq);
> -	if (!tick_nohz_full_cpu(cpu) || rq->curr == rq->idle)
> +	if (!tick_nohz_full_cpu(cpu) || rq->curr == rq->idle ||
> +	    is_global_init(current))

Makes sense. But we seem to be taking a new direction there after feedback from Ingo
and Peterz: tag scheduler entry and exit points and invalidate on top of missing rq clock
updates since the last scheduler entry point.

thanks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2013-04-11 16:47 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-08  2:08 [ANNOUNCE] 3.8-rc2-nohz2 Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 01/33] context_tracking: Add comments on interface and internals Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 02/33] context_tracking: Export context state for generic vtime Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 03/33] cputime: Generic on-demand virtual cputime accounting Frederic Weisbecker
2013-01-08 20:23   ` Steven Rostedt
2013-01-08 20:26   ` Steven Rostedt
2013-01-08 21:00     ` Paul E. McKenney
2013-01-08 20:45   ` Steven Rostedt
2013-01-09 13:46   ` Steven Rostedt
2013-01-09 13:50     ` Steven Rostedt
2013-01-08  2:08 ` [PATCH 04/33] cputime: Allow dynamic switch between tick/virtual based " Frederic Weisbecker
2013-01-08 21:20   ` Steven Rostedt
2013-01-08 23:22     ` Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 05/33] cputime: Use accessors to read task cputime stats Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 06/33] cputime: Safely read cputime of full dynticks CPUs Frederic Weisbecker
2013-01-09 14:54   ` Steven Rostedt
2013-01-09 18:35     ` Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 07/33] nohz: Basic full dynticks interface Frederic Weisbecker
2013-02-11 14:35   ` Borislav Petkov
2013-02-20 16:32     ` Borislav Petkov
2013-03-07 23:41       ` Frederic Weisbecker
2013-03-07 23:35     ` Frederic Weisbecker
2013-03-08 10:17       ` Borislav Petkov
2013-03-08 13:45         ` Frederic Weisbecker
2013-03-08 14:32           ` Borislav Petkov
2013-03-08 16:55             ` Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 08/33] nohz: Assign timekeeping duty to a non-full-nohz CPU Frederic Weisbecker
2013-02-15 11:57   ` Borislav Petkov
2013-02-20 15:57     ` Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 09/33] nohz: Trace timekeeping update Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 10/33] nohz: Wake up full dynticks CPUs when a timer gets enqueued Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 11/33] rcu: Restart the tick on non-responding full dynticks CPUs Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 12/33] sched: Comment on rq->clock correctness in ttwu_do_wakeup() in nohz Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 13/33] sched: Update rq clock on nohz CPU before migrating tasks Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 14/33] sched: Update rq clock on nohz CPU before setting fair group shares Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 15/33] sched: Update rq clock on tickless CPUs before calling check_preempt_curr() Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 16/33] sched: Update rq clock earlier in unthrottle_cfs_rq Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 17/33] sched: Update clock of nohz busiest rq before balancing Frederic Weisbecker
2013-01-08 10:20   ` Li Zhong
2013-03-07 23:51     ` Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 18/33] sched: Update rq clock before idle balancing Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 19/33] sched: Update nohz rq clock before searching busiest group on load balancing Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 20/33] nohz: Move nohz load balancer selection into idle logic Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 21/33] nohz: Full dynticks mode Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 22/33] nohz: Only stop the tick on RCU nocb CPUs Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 23/33] nohz: Don't turn off the tick if rcu needs it Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 24/33] nohz: Don't stop the tick if posix cpu timers are running Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 25/33] nohz: Add some tracing Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 26/33] rcu: Don't keep the tick for RCU while in userspace Frederic Weisbecker
2013-01-08  4:06   ` Paul E. McKenney
2013-01-08  2:08 ` [PATCH 27/33] profiling: Remove unused timer hook Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 28/33] timer: Don't run non-pinned timer to full dynticks CPUs Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 29/33] sched: Use an accessor to read rq clock Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 30/33] sched: Debug nohz " Frederic Weisbecker
2013-03-20 23:23   ` Kevin Hilman
2013-04-11 16:47     ` Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 31/33] sched: Remove broken check for skip clock update Frederic Weisbecker
2013-01-08  2:11   ` Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 32/33] sched: Update rq clock before rt sched average scale Frederic Weisbecker
2013-01-08  2:08 ` [PATCH 33/33] sched: Disable lb_bias feature for full dynticks Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).