All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks)
@ 2011-08-15 15:51 Frederic Weisbecker
  2011-08-15 15:51 ` [PATCH 01/32 RESEND] nohz: Drop useless call in tick_nohz_start_idle() Frederic Weisbecker
                   ` (33 more replies)
  0 siblings, 34 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:51 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

So it's still in draft stage. It's far from covering everything
the periodic timer does but it has made some progress since last
posting. So I think it's time now for another early release.

= What's that? = 

On the mainline kernel we have a feature (CONFIG_NO_HZ) that is
able to turn off the periodic scheduler tick when the CPU has
nothing to do, namely when it's running the idle task.

The scheduler tick handles many things like RCU and scheduler
internal state, jiffies accouting, wall time accounting, load
accounting, cputime accounting, timer wheel, posix cpu timers,
etc...

However by the time we run idle and the CPU is going to sleep,
none of these things are useful for the CPU. We can then shut it
down.

The benefit of this is for energy saving purposes. We avoid
to wake up the CPU needlessly with these useless interrupts.

What this patchset do is to extend that feature to non idle
cases, implementing some new kind of "adaptive nohz". But the
purpose is different and the implementation too.

= How does that work =

It tries to handle all the things that the timer tick usually
handle but using different tricks. Sometimes we can't really
afford to avoid the periodic tick, but sometimes we can and if
we do, we need to take some special care.

- We can't shutdown the tick if we have more than one task
running, due to the need for the tick for preemption. But I believe
that one day we can avoid the periodic tick for that and rather
anticipate when the scheduler really needs the tick.

- We can't shutdown the tick if RCU needs to complete a grace
period from the current CPU, or if it has callbacks to handle.

- We can't shutdown the tick if we have a posix cpu timer queued. Similarly
to the preemption case, we should be able to anticipate that with a
precise timer and avoid a periodic check based on HZ.

- Restart the tick when more than one non-idle task are in the runqueue.

- We need to handle process accounting, RCU, rq clock, task tick, etc...

And that patchset for now only handles a part of the whole needs.

= What's the interface =

We use the cpuset interface by adding a nohz flag to it.
As long as a CPU is part of a nohz cpuset, then this CPU will
try to enter into adaptive nohz mode when it can, even if it is part
of another cpuset that is not nohz.

= Why do we need that? =

There are at least two potential users of this feature:

* High performance computing: To optimize the throughput, some
workloads involve running one task per CPU that mostly run in
userspace. These tasks don't want and don't need to suffer from the
overhead of the timer interrupt. It consumes CPU time and it trashes
the CPU cache.

* Real time: Minimizing timer interrupts means less interrupts and thus
less critical sections that usually induce latency.

= What's missing? =

Many things like handling of perf events, irq work, sched clock tick,
runqueue clock, sched_class::task_tick(), rq clock, cpu load, ...

The handling of cputimes is also incomplete as there are other places
that use the utime/stime. Process time accounting is globally incomplete.

But anyway the thing is moving forward. An early posting was just very
needed at that step.

For those who want to play:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git
	nohz/cpuset-v1

Frederic Weisbecker (32):
  nohz: Drop useless call in tick_nohz_start_idle()
  nohz: Drop ts->idle_active
  nohz: Drop useless ts->inidle check before rearming the tick
  nohz: Separate idle sleeping time accounting from nohz switching
  nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  nohz: Move idle ticks stats tracking out of nohz handlers
  nohz: Rename ts->idle_tick to ts->last_tick
  nohz: Move nohz load balancer selection into idle logic
  nohz: Move ts->idle_calls into strict idle logic
  nohz: Move next idle expiring time record into idle logic area
  cpuset: Set up interface for nohz flag
  nohz: Try not to give the timekeeping duty to a cpuset nohz cpu
  nohz: Adaptive tick stop and restart on nohz cpuset
  nohz/cpuset: Don't turn off the tick if rcu needs it
  nohz/cpuset: Restart tick when switching to idle task
  nohz/cpuset: Wake up adaptive nohz CPU when a timer gets enqueued
  x86: New cpuset nohz irq vector
  nohz/cpuset: Don't stop the tick if posix cpu timers are running
  nohz/cpuset: Restart tick when nohz flag is cleared on cpuset
  nohz/cpuset: Restart the tick if printk needs it
  rcu: Restart the tick on non-responding adaptive nohz CPUs
  rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU
  nohz/cpuset: Account user and system times in adaptive nohz mode
  nohz/cpuset: Handle kernel entry/exit to account cputime
  nohz/cpuset: New API to flush cputimes on nohz cpusets
  nohz/cpuset: Flush cputime on threads in nohz cpusets when waiting leader
  nohz/cpuset: Flush cputimes on procfs stat file read
  nohz/cpuset: Flush cputimes for getrusage() and times() syscalls
  x86: Syscall hooks for nohz cpusets
  x86: Exception hooks for nohz cpusets
  rcu: Switch to extended quiescent state in userspace from nohz cpuset
  nohz/cpuset: Disable under some configs

 arch/Kconfig                           |    3 +
 arch/arm/kernel/process.c              |    4 +-
 arch/avr32/kernel/process.c            |    4 +-
 arch/blackfin/kernel/process.c         |    4 +-
 arch/microblaze/kernel/process.c       |    4 +-
 arch/mips/kernel/process.c             |    4 +-
 arch/powerpc/kernel/idle.c             |    4 +-
 arch/powerpc/platforms/iseries/setup.c |    8 +-
 arch/s390/kernel/process.c             |    4 +-
 arch/sh/kernel/idle.c                  |    4 +-
 arch/sparc/kernel/process_64.c         |    4 +-
 arch/tile/kernel/process.c             |    4 +-
 arch/um/kernel/process.c               |    4 +-
 arch/unicore32/kernel/process.c        |    4 +-
 arch/x86/Kconfig                       |    1 +
 arch/x86/include/asm/entry_arch.h      |    3 +
 arch/x86/include/asm/hw_irq.h          |    6 +
 arch/x86/include/asm/irq_vectors.h     |    2 +
 arch/x86/include/asm/smp.h             |   11 +
 arch/x86/include/asm/thread_info.h     |   10 +-
 arch/x86/kernel/entry_64.S             |    4 +
 arch/x86/kernel/irqinit.c              |    4 +
 arch/x86/kernel/process_32.c           |    4 +-
 arch/x86/kernel/process_64.c           |    5 +-
 arch/x86/kernel/ptrace.c               |   10 +
 arch/x86/kernel/smp.c                  |   26 ++
 arch/x86/kernel/traps.c                |   22 +-
 arch/x86/mm/fault.c                    |   13 +-
 fs/proc/array.c                        |    2 +
 include/linux/cpuset.h                 |   29 ++
 include/linux/kernel_stat.h            |    2 +
 include/linux/posix-timers.h           |    1 +
 include/linux/rcupdate.h               |    1 +
 include/linux/sched.h                  |   10 +-
 include/linux/tick.h                   |   50 +++-
 init/Kconfig                           |    8 +
 kernel/cpuset.c                        |  105 +++++++
 kernel/exit.c                          |    2 +
 kernel/posix-cpu-timers.c              |   12 +
 kernel/printk.c                        |   17 +-
 kernel/rcutree.c                       |   28 ++-
 kernel/sched.c                         |  132 +++++++++-
 kernel/softirq.c                       |    6 +-
 kernel/sys.c                           |    6 +
 kernel/time/tick-sched.c               |  479 ++++++++++++++++++++++++--------
 kernel/time/timer_list.c               |    4 +-
 kernel/timer.c                         |    8 +-
 47 files changed, 897 insertions(+), 185 deletions(-)

-- 
1.7.5.4


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH 01/32 RESEND] nohz: Drop useless call in tick_nohz_start_idle()
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
@ 2011-08-15 15:51 ` Frederic Weisbecker
  2011-08-29 14:23   ` Peter Zijlstra
  2011-08-15 15:51 ` [PATCH 02/32 RESEND] nohz: Drop ts->idle_active Frederic Weisbecker
                   ` (32 subsequent siblings)
  33 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:51 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

The call to update_ts_time_stats() there is useless. All
we need is to save the idle entry_time.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 kernel/time/tick-sched.c |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index d5097c4..58e1a96 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -185,9 +185,6 @@ static ktime_t tick_nohz_start_idle(int cpu, struct tick_sched *ts)
 	ktime_t now;
 
 	now = ktime_get();
-
-	update_ts_time_stats(cpu, ts, now, NULL);
-
 	ts->idle_entrytime = now;
 	ts->idle_active = 1;
 	sched_clock_idle_sleep_event();
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 02/32 RESEND] nohz: Drop ts->idle_active
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
  2011-08-15 15:51 ` [PATCH 01/32 RESEND] nohz: Drop useless call in tick_nohz_start_idle() Frederic Weisbecker
@ 2011-08-15 15:51 ` Frederic Weisbecker
  2011-08-29 14:23   ` Peter Zijlstra
  2011-08-15 15:52 ` [PATCH 03/32 RESEND] nohz: Drop useless ts->inidle check before rearming the tick Frederic Weisbecker
                   ` (31 subsequent siblings)
  33 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:51 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

ts->idle_active is used to know if we want to account the idle sleep
time. But ts->inidle is enough to check that.

So drop that field and use inidle instead. This simplifies the code.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 include/linux/tick.h     |    1 -
 kernel/time/tick-sched.c |   14 ++++++--------
 2 files changed, 6 insertions(+), 9 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index b232ccc..532e650 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -56,7 +56,6 @@ struct tick_sched {
 	unsigned long			idle_jiffies;
 	unsigned long			idle_calls;
 	unsigned long			idle_sleeps;
-	int				idle_active;
 	ktime_t				idle_entrytime;
 	ktime_t				idle_waketime;
 	ktime_t				idle_exittime;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 58e1a96..c4d7113 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -157,7 +157,7 @@ update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_upda
 {
 	ktime_t delta;
 
-	if (ts->idle_active) {
+	if (ts->inidle) {
 		delta = ktime_sub(now, ts->idle_entrytime);
 		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
 		if (nr_iowait_cpu(cpu) > 0)
@@ -175,7 +175,6 @@ static void tick_nohz_stop_idle(int cpu, ktime_t now)
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 
 	update_ts_time_stats(cpu, ts, now, NULL);
-	ts->idle_active = 0;
 
 	sched_clock_idle_wakeup_event(0);
 }
@@ -186,7 +185,6 @@ static ktime_t tick_nohz_start_idle(int cpu, struct tick_sched *ts)
 
 	now = ktime_get();
 	ts->idle_entrytime = now;
-	ts->idle_active = 1;
 	sched_clock_idle_sleep_event();
 	return now;
 }
@@ -502,11 +500,11 @@ void tick_nohz_restart_sched_tick(void)
 	ktime_t now;
 
 	local_irq_disable();
-	if (ts->idle_active || (ts->inidle && ts->tick_stopped))
-		now = ktime_get();
 
-	if (ts->idle_active)
+	if (ts->inidle) {
+		now = ktime_get();
 		tick_nohz_stop_idle(cpu, now);
+	}
 
 	if (!ts->inidle || !ts->tick_stopped) {
 		ts->inidle = 0;
@@ -677,10 +675,10 @@ static inline void tick_check_nohz(int cpu)
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 	ktime_t now;
 
-	if (!ts->idle_active && !ts->tick_stopped)
+	if (!ts->inidle && !ts->tick_stopped)
 		return;
 	now = ktime_get();
-	if (ts->idle_active)
+	if (ts->inidle)
 		tick_nohz_stop_idle(cpu, now);
 	if (ts->tick_stopped) {
 		tick_nohz_update_jiffies(now);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 03/32 RESEND] nohz: Drop useless ts->inidle check before rearming the tick
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
  2011-08-15 15:51 ` [PATCH 01/32 RESEND] nohz: Drop useless call in tick_nohz_start_idle() Frederic Weisbecker
  2011-08-15 15:51 ` [PATCH 02/32 RESEND] nohz: Drop ts->idle_active Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-29 14:23   ` Peter Zijlstra
  2011-08-15 15:52 ` [PATCH 04/32] nohz: Separate idle sleeping time accounting from nohz switching Frederic Weisbecker
                   ` (30 subsequent siblings)
  33 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

We only need to check if we have ts->stopped to ensure the tick
was stopped and we want to re-enable it. Checking ts->inidle
there is useless.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 kernel/time/tick-sched.c |    6 ++----
 1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c4d7113..5934aee 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -504,16 +504,14 @@ void tick_nohz_restart_sched_tick(void)
 	if (ts->inidle) {
 		now = ktime_get();
 		tick_nohz_stop_idle(cpu, now);
+		ts->inidle = 0;
 	}
 
-	if (!ts->inidle || !ts->tick_stopped) {
-		ts->inidle = 0;
+	if (!ts->tick_stopped) {
 		local_irq_enable();
 		return;
 	}
 
-	ts->inidle = 0;
-
 	rcu_exit_nohz();
 
 	/* Update jiffies first */
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 04/32] nohz: Separate idle sleeping time accounting from nohz switching
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 03/32 RESEND] nohz: Drop useless ts->inidle check before rearming the tick Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-29 14:23   ` Peter Zijlstra
  2011-08-29 14:23   ` Peter Zijlstra
  2011-08-15 15:52 ` [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs Frederic Weisbecker
                   ` (29 subsequent siblings)
  33 siblings, 2 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

To prepare for having nohz mode switching independant from idle,
pull the idle sleeping time accounting out of the tick stop API.

This implies to implement some new API to call when we
enter/exit idle.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 arch/arm/kernel/process.c              |    4 +-
 arch/avr32/kernel/process.c            |    4 +-
 arch/blackfin/kernel/process.c         |    4 +-
 arch/microblaze/kernel/process.c       |    4 +-
 arch/mips/kernel/process.c             |    4 +-
 arch/powerpc/kernel/idle.c             |    4 +-
 arch/powerpc/platforms/iseries/setup.c |    8 +-
 arch/s390/kernel/process.c             |    4 +-
 arch/sh/kernel/idle.c                  |    4 +-
 arch/sparc/kernel/process_64.c         |    4 +-
 arch/tile/kernel/process.c             |    4 +-
 arch/um/kernel/process.c               |    4 +-
 arch/unicore32/kernel/process.c        |    4 +-
 arch/x86/kernel/process_32.c           |    4 +-
 arch/x86/kernel/process_64.c           |    5 +-
 include/linux/tick.h                   |   10 ++-
 kernel/softirq.c                       |    2 +-
 kernel/time/tick-sched.c               |  102 ++++++++++++++++++-------------
 18 files changed, 98 insertions(+), 81 deletions(-)

diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 5e1e541..27b68b0 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -182,7 +182,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_enter_idle();
 		leds_event(led_idle_start);
 		while (!need_resched()) {
 #ifdef CONFIG_HOTPLUG_CPU
@@ -208,7 +208,7 @@ void cpu_idle(void)
 			}
 		}
 		leds_event(led_idle_end);
-		tick_nohz_restart_sched_tick();
+		tick_nohz_exit_idle();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/avr32/kernel/process.c b/arch/avr32/kernel/process.c
index ef5a2a0..e683a34 100644
--- a/arch/avr32/kernel/process.c
+++ b/arch/avr32/kernel/process.c
@@ -34,10 +34,10 @@ void cpu_idle(void)
 {
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_enter_idle();
 		while (!need_resched())
 			cpu_idle_sleep();
-		tick_nohz_restart_sched_tick();
+		tick_nohz_exit_idle();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/blackfin/kernel/process.c b/arch/blackfin/kernel/process.c
index 6a660fa..8082a8f 100644
--- a/arch/blackfin/kernel/process.c
+++ b/arch/blackfin/kernel/process.c
@@ -88,10 +88,10 @@ void cpu_idle(void)
 #endif
 		if (!idle)
 			idle = default_idle;
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_enter_idle();
 		while (!need_resched())
 			idle();
-		tick_nohz_restart_sched_tick();
+		tick_nohz_exit_idle();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/microblaze/kernel/process.c b/arch/microblaze/kernel/process.c
index 968648a..1b295b2 100644
--- a/arch/microblaze/kernel/process.c
+++ b/arch/microblaze/kernel/process.c
@@ -103,10 +103,10 @@ void cpu_idle(void)
 		if (!idle)
 			idle = default_idle;
 
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_enter_idle();
 		while (!need_resched())
 			idle();
-		tick_nohz_restart_sched_tick();
+		tick_nohz_exit_idle();
 
 		preempt_enable_no_resched();
 		schedule();
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index c28fbe6..3aa4020 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -56,7 +56,7 @@ void __noreturn cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_enter_idle();
 		while (!need_resched() && cpu_online(cpu)) {
 #ifdef CONFIG_MIPS_MT_SMTC
 			extern void smtc_idle_loop_hook(void);
@@ -77,7 +77,7 @@ void __noreturn cpu_idle(void)
 		     system_state == SYSTEM_BOOTING))
 			play_dead();
 #endif
-		tick_nohz_restart_sched_tick();
+		tick_nohz_exit_idle();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/powerpc/kernel/idle.c b/arch/powerpc/kernel/idle.c
index 39a2baa..1108260 100644
--- a/arch/powerpc/kernel/idle.c
+++ b/arch/powerpc/kernel/idle.c
@@ -56,7 +56,7 @@ void cpu_idle(void)
 
 	set_thread_flag(TIF_POLLING_NRFLAG);
 	while (1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_enter_idle();
 		while (!need_resched() && !cpu_should_die()) {
 			ppc64_runlatch_off();
 
@@ -93,7 +93,7 @@ void cpu_idle(void)
 
 		HMT_medium();
 		ppc64_runlatch_on();
-		tick_nohz_restart_sched_tick();
+		tick_nohz_exit_idle();
 		preempt_enable_no_resched();
 		if (cpu_should_die())
 			cpu_die();
diff --git a/arch/powerpc/platforms/iseries/setup.c b/arch/powerpc/platforms/iseries/setup.c
index c25a081..d40dcd9 100644
--- a/arch/powerpc/platforms/iseries/setup.c
+++ b/arch/powerpc/platforms/iseries/setup.c
@@ -562,7 +562,7 @@ static void yield_shared_processor(void)
 static void iseries_shared_idle(void)
 {
 	while (1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_enter_idle();
 		while (!need_resched() && !hvlpevent_is_pending()) {
 			local_irq_disable();
 			ppc64_runlatch_off();
@@ -576,7 +576,7 @@ static void iseries_shared_idle(void)
 		}
 
 		ppc64_runlatch_on();
-		tick_nohz_restart_sched_tick();
+		tick_nohz_exit_idle();
 
 		if (hvlpevent_is_pending())
 			process_iSeries_events();
@@ -592,7 +592,7 @@ static void iseries_dedicated_idle(void)
 	set_thread_flag(TIF_POLLING_NRFLAG);
 
 	while (1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_enter_idle();
 		if (!need_resched()) {
 			while (!need_resched()) {
 				ppc64_runlatch_off();
@@ -609,7 +609,7 @@ static void iseries_dedicated_idle(void)
 		}
 
 		ppc64_runlatch_on();
-		tick_nohz_restart_sched_tick();
+		tick_nohz_exit_idle();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 541a750..560cd94 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -90,10 +90,10 @@ static void default_idle(void)
 void cpu_idle(void)
 {
 	for (;;) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_enter_idle();
 		while (!need_resched())
 			default_idle();
-		tick_nohz_restart_sched_tick();
+		tick_nohz_exit_idle();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/sh/kernel/idle.c b/arch/sh/kernel/idle.c
index 425d604..b7ea6ff 100644
--- a/arch/sh/kernel/idle.c
+++ b/arch/sh/kernel/idle.c
@@ -88,7 +88,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_enter_idle();
 
 		while (!need_resched()) {
 			check_pgt_cache();
@@ -109,7 +109,7 @@ void cpu_idle(void)
 			start_critical_timings();
 		}
 
-		tick_nohz_restart_sched_tick();
+		tick_nohz_exit_idle();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c
index c158a95..5c36632 100644
--- a/arch/sparc/kernel/process_64.c
+++ b/arch/sparc/kernel/process_64.c
@@ -95,12 +95,12 @@ void cpu_idle(void)
 	set_thread_flag(TIF_POLLING_NRFLAG);
 
 	while(1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_enter_idle();
 
 		while (!need_resched() && !cpu_is_offline(cpu))
 			sparc64_yield(cpu);
 
-		tick_nohz_restart_sched_tick();
+		tick_nohz_exit_idle();
 
 		preempt_enable_no_resched();
 
diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index 9c45d8b..cc1bd4f 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -85,7 +85,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_enter_idle();
 		while (!need_resched()) {
 			if (cpu_is_offline(cpu))
 				BUG();  /* no HOTPLUG_CPU */
@@ -105,7 +105,7 @@ void cpu_idle(void)
 				local_irq_enable();
 			current_thread_info()->status |= TS_POLLING;
 		}
-		tick_nohz_restart_sched_tick();
+		tick_nohz_exit_idle();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index fab4371..f1b3864 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -245,10 +245,10 @@ void default_idle(void)
 		if (need_resched())
 			schedule();
 
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_enter_idle();
 		nsecs = disable_timer();
 		idle_sleep(nsecs);
-		tick_nohz_restart_sched_tick();
+		tick_nohz_exit_idle();
 	}
 }
 
diff --git a/arch/unicore32/kernel/process.c b/arch/unicore32/kernel/process.c
index ba401df..e2df91a 100644
--- a/arch/unicore32/kernel/process.c
+++ b/arch/unicore32/kernel/process.c
@@ -55,7 +55,7 @@ void cpu_idle(void)
 {
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_enter_idle();
 		while (!need_resched()) {
 			local_irq_disable();
 			stop_critical_timings();
@@ -63,7 +63,7 @@ void cpu_idle(void)
 			local_irq_enable();
 			start_critical_timings();
 		}
-		tick_nohz_restart_sched_tick();
+		tick_nohz_exit_idle();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index a3d0dc5..1d7e26c 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -97,7 +97,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_enter_idle();
 		while (!need_resched()) {
 
 			check_pgt_cache();
@@ -112,7 +112,7 @@ void cpu_idle(void)
 			pm_idle();
 			start_critical_timings();
 		}
-		tick_nohz_restart_sched_tick();
+		tick_nohz_exit_idle();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index ca6f7ab..5fce49b 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -120,7 +120,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_stop_sched_tick(1);
+		tick_nohz_enter_idle();
 		while (!need_resched()) {
 
 			rmb();
@@ -144,8 +144,7 @@ void cpu_idle(void)
 			   loops can be woken up without interrupt. */
 			__exit_idle();
 		}
-
-		tick_nohz_restart_sched_tick();
+		tick_nohz_exit_idle();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 532e650..04f6418 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -120,14 +120,16 @@ static inline int tick_oneshot_mode_active(void) { return 0; }
 #endif /* !CONFIG_GENERIC_CLOCKEVENTS */
 
 # ifdef CONFIG_NO_HZ
-extern void tick_nohz_stop_sched_tick(int inidle);
-extern void tick_nohz_restart_sched_tick(void);
+extern void tick_nohz_enter_idle(void);
+extern void tick_nohz_exit_idle(void);
+extern void tick_nohz_irq_exit(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
 # else
-static inline void tick_nohz_stop_sched_tick(int inidle) { }
-static inline void tick_nohz_restart_sched_tick(void) { }
+static inline void tick_nohz_enter_idle(void) { }
+static inline void tick_nohz_exit_idle(void) { }
+
 static inline ktime_t tick_nohz_get_sleep_length(void)
 {
 	ktime_t len = { .tv64 = NSEC_PER_SEC/HZ };
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 40cf63d..67a1401 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -343,7 +343,7 @@ void irq_exit(void)
 #ifdef CONFIG_NO_HZ
 	/* Make sure that timer wheel updates are propagated */
 	if (idle_cpu(smp_processor_id()) && !in_interrupt() && !need_resched())
-		tick_nohz_stop_sched_tick(0);
+		tick_nohz_irq_exit();
 #endif
 	preempt_enable_no_resched();
 }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 5934aee..df6bb4c 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -249,38 +249,19 @@ EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
  * Called either from the idle loop or from irq_exit() when an idle period was
  * just interrupted by an interrupt which did not cause a reschedule.
  */
-void tick_nohz_stop_sched_tick(int inidle)
+static void tick_nohz_stop_sched_tick(ktime_t now)
 {
-	unsigned long seq, last_jiffies, next_jiffies, delta_jiffies, flags;
+	unsigned long seq, last_jiffies, next_jiffies, delta_jiffies;
 	struct tick_sched *ts;
-	ktime_t last_update, expires, now;
+	ktime_t last_update, expires;
 	struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev;
 	u64 time_delta;
 	int cpu;
 
-	local_irq_save(flags);
-
 	cpu = smp_processor_id();
 	ts = &per_cpu(tick_cpu_sched, cpu);
 
 	/*
-	 * Call to tick_nohz_start_idle stops the last_update_time from being
-	 * updated. Thus, it must not be called in the event we are called from
-	 * irq_exit() with the prior state different than idle.
-	 */
-	if (!inidle && !ts->inidle)
-		goto end;
-
-	/*
-	 * Set ts->inidle unconditionally. Even if the system did not
-	 * switch to NOHZ mode the cpu frequency governers rely on the
-	 * update of the idle time accounting in tick_nohz_start_idle().
-	 */
-	ts->inidle = 1;
-
-	now = tick_nohz_start_idle(cpu, ts);
-
-	/*
 	 * If this cpu is offline and it is the one which updates
 	 * jiffies, then give up the assignment and let it be taken by
 	 * the cpu which runs the tick timer next. If we don't drop
@@ -293,10 +274,10 @@ void tick_nohz_stop_sched_tick(int inidle)
 	}
 
 	if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
-		goto end;
+		return;
 
 	if (need_resched())
-		goto end;
+		return;
 
 	if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
 		static int ratelimit;
@@ -306,7 +287,7 @@ void tick_nohz_stop_sched_tick(int inidle)
 			       (unsigned int) local_softirq_pending());
 			ratelimit++;
 		}
-		goto end;
+		return;
 	}
 
 	ts->idle_calls++;
@@ -443,10 +424,31 @@ out:
 	ts->next_jiffies = next_jiffies;
 	ts->last_jiffies = last_jiffies;
 	ts->sleep_length = ktime_sub(dev->next_event, now);
-end:
-	local_irq_restore(flags);
 }
 
+static void __tick_nohz_enter_idle(struct tick_sched *ts, int cpu)
+{
+	ktime_t now;
+
+	now = tick_nohz_start_idle(cpu, ts);
+	tick_nohz_stop_sched_tick(now);
+}
+
+void tick_nohz_enter_idle(void)
+{
+	struct tick_sched *ts;
+	int cpu;
+
+	local_irq_disable();
+
+	ts = &__get_cpu_var(tick_cpu_sched);
+	ts->inidle = 1;
+	cpu = smp_processor_id();
+	__tick_nohz_enter_idle(ts, cpu);
+
+	local_irq_enable();
+ }
+
 /**
  * tick_nohz_get_sleep_length - return the length of the current sleep
  *
@@ -490,27 +492,12 @@ static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
  *
  * Restart the idle tick when the CPU is woken up from idle
  */
-void tick_nohz_restart_sched_tick(void)
+static void tick_nohz_restart_sched_tick(ktime_t now, struct tick_sched *ts)
 {
 	int cpu = smp_processor_id();
-	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
 	unsigned long ticks;
 #endif
-	ktime_t now;
-
-	local_irq_disable();
-
-	if (ts->inidle) {
-		now = ktime_get();
-		tick_nohz_stop_idle(cpu, now);
-		ts->inidle = 0;
-	}
-
-	if (!ts->tick_stopped) {
-		local_irq_enable();
-		return;
-	}
 
 	rcu_exit_nohz();
 
@@ -541,10 +528,39 @@ void tick_nohz_restart_sched_tick(void)
 	ts->idle_exittime = now;
 
 	tick_nohz_restart(ts, now);
+}
+
+void tick_nohz_exit_idle(void)
+{
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+	ktime_t now;
+
+	local_irq_disable();
+
+	if (!ts->inidle) {
+		local_irq_enable();
+		return;
+	}
+
+	now = ktime_get();
+
+	tick_nohz_stop_idle(smp_processor_id(), now);
+	ts->inidle = 0;
+
+	if (ts->tick_stopped)
+		tick_nohz_restart_sched_tick(now, ts);
 
 	local_irq_enable();
 }
 
+void tick_nohz_irq_exit(void)
+{
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+
+	if (ts->inidle)
+		__tick_nohz_enter_idle(ts, smp_processor_id());
+}
+
 static int tick_nohz_reprogram(struct tick_sched *ts, ktime_t now)
 {
 	hrtimer_forward(&ts->sched_timer, now, tick_period);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 04/32] nohz: Separate idle sleeping time accounting from nohz switching Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-29 14:25   ` Peter Zijlstra
  2011-08-15 15:52 ` [PATCH 06/32] nohz: Move idle ticks stats tracking out of nohz handlers Frederic Weisbecker
                   ` (28 subsequent siblings)
  33 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

To prepare for nohz / idle logic split, pull out the rcu dynticks
idle mode switching to strict idle entry/exit areas.

So we make the dyntick mode possible without always involving rcu
extended quiescent state.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 kernel/time/tick-sched.c |   11 +++++++----
 1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index df6bb4c..8937d4a 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -385,7 +385,6 @@ static void tick_nohz_stop_sched_tick(ktime_t now)
 			ts->idle_tick = hrtimer_get_expires(&ts->sched_timer);
 			ts->tick_stopped = 1;
 			ts->idle_jiffies = last_jiffies;
-			rcu_enter_nohz();
 		}
 
 		ts->idle_sleeps++;
@@ -429,9 +428,13 @@ out:
 static void __tick_nohz_enter_idle(struct tick_sched *ts, int cpu)
 {
 	ktime_t now;
+	int was_stopped = ts->tick_stopped;
 
 	now = tick_nohz_start_idle(cpu, ts);
 	tick_nohz_stop_sched_tick(now);
+
+	if (!was_stopped && ts->tick_stopped)
+		rcu_enter_nohz();
 }
 
 void tick_nohz_enter_idle(void)
@@ -499,8 +502,6 @@ static void tick_nohz_restart_sched_tick(ktime_t now, struct tick_sched *ts)
 	unsigned long ticks;
 #endif
 
-	rcu_exit_nohz();
-
 	/* Update jiffies first */
 	select_nohz_load_balancer(0);
 	tick_do_update_jiffies64(now);
@@ -547,8 +548,10 @@ void tick_nohz_exit_idle(void)
 	tick_nohz_stop_idle(smp_processor_id(), now);
 	ts->inidle = 0;
 
-	if (ts->tick_stopped)
+	if (ts->tick_stopped) {
+		rcu_exit_nohz();
 		tick_nohz_restart_sched_tick(now, ts);
+	}
 
 	local_irq_enable();
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 06/32] nohz: Move idle ticks stats tracking out of nohz handlers
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-29 14:28   ` Peter Zijlstra
  2011-08-15 15:52 ` [PATCH 07/32] nohz: Rename ts->idle_tick to ts->last_tick Frederic Weisbecker
                   ` (27 subsequent siblings)
  33 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

Idle ticks time tracking is merged into nohz stop/restart
handlers. Pull it out into idle entry/exit handlers instead,
so that nohz APIs is more idle independant.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 kernel/time/tick-sched.c |   34 +++++++++++++++++++---------------
 1 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 8937d4a..21b187c 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -384,7 +384,6 @@ static void tick_nohz_stop_sched_tick(ktime_t now)
 
 			ts->idle_tick = hrtimer_get_expires(&ts->sched_timer);
 			ts->tick_stopped = 1;
-			ts->idle_jiffies = last_jiffies;
 		}
 
 		ts->idle_sleeps++;
@@ -433,8 +432,10 @@ static void __tick_nohz_enter_idle(struct tick_sched *ts, int cpu)
 	now = tick_nohz_start_idle(cpu, ts);
 	tick_nohz_stop_sched_tick(now);
 
-	if (!was_stopped && ts->tick_stopped)
+	if (!was_stopped && ts->tick_stopped) {
+		ts->idle_jiffies = ts->last_jiffies;
 		rcu_enter_nohz();
+	}
 }
 
 void tick_nohz_enter_idle(void)
@@ -498,16 +499,26 @@ static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
 static void tick_nohz_restart_sched_tick(ktime_t now, struct tick_sched *ts)
 {
 	int cpu = smp_processor_id();
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING
-	unsigned long ticks;
-#endif
 
 	/* Update jiffies first */
 	select_nohz_load_balancer(0);
 	tick_do_update_jiffies64(now);
 	cpumask_clear_cpu(cpu, nohz_cpu_mask);
 
+	touch_softlockup_watchdog();
+	/*
+	 * Cancel the scheduled timer and restore the tick
+	 */
+	ts->tick_stopped  = 0;
+	ts->idle_exittime = now;
+
+	tick_nohz_restart(ts, now);
+}
+
+static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
+{
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
+	unsigned long ticks;
 	/*
 	 * We stopped the tick in idle. Update process times would miss the
 	 * time we slept as update_process_times does only a 1 tick
@@ -520,15 +531,6 @@ static void tick_nohz_restart_sched_tick(ktime_t now, struct tick_sched *ts)
 	if (ticks && ticks < LONG_MAX)
 		account_idle_ticks(ticks);
 #endif
-
-	touch_softlockup_watchdog();
-	/*
-	 * Cancel the scheduled timer and restore the tick
-	 */
-	ts->tick_stopped  = 0;
-	ts->idle_exittime = now;
-
-	tick_nohz_restart(ts, now);
 }
 
 void tick_nohz_exit_idle(void)
@@ -551,6 +553,7 @@ void tick_nohz_exit_idle(void)
 	if (ts->tick_stopped) {
 		rcu_exit_nohz();
 		tick_nohz_restart_sched_tick(now, ts);
+		tick_nohz_account_idle_ticks(ts);
 	}
 
 	local_irq_enable();
@@ -766,7 +769,8 @@ static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
 		 */
 		if (ts->tick_stopped) {
 			touch_softlockup_watchdog();
-			ts->idle_jiffies++;
+			if (idle_cpu(cpu))
+				ts->idle_jiffies++;
 		}
 		update_process_times(user_mode(regs));
 		profile_tick(CPU_PROFILING);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 07/32] nohz: Rename ts->idle_tick to ts->last_tick
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (5 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 06/32] nohz: Move idle ticks stats tracking out of nohz handlers Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-15 15:52 ` [PATCH 08/32] nohz: Move nohz load balancer selection into idle logic Frederic Weisbecker
                   ` (26 subsequent siblings)
  33 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

Now that idle and nohz logic are getting split, idle_tick becomes
a misnomer when it takes a field name to save the last tick before
switching to nohz.

Call it last_tick instead. This changes a bit the timer list
stat export so we need to increase its version.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 include/linux/tick.h     |    8 ++++----
 kernel/time/tick-sched.c |    4 ++--
 kernel/time/timer_list.c |    4 ++--
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 04f6418..849a0b2 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -30,10 +30,10 @@ enum tick_nohz_mode {
  * struct tick_sched - sched tick emulation and no idle tick control/stats
  * @sched_timer:	hrtimer to schedule the periodic tick in high
  *			resolution mode
- * @idle_tick:		Store the last idle tick expiry time when the tick
- *			timer is modified for idle sleeps. This is necessary
+ * @last_tick:		Store the last tick expiry time when the tick
+ *			timer is modified for nohz sleeps. This is necessary
  *			to resume the tick timer operation in the timeline
- *			when the CPU returns from idle
+ *			when the CPU returns from nohz sleep.
  * @tick_stopped:	Indicator that the idle tick has been stopped
  * @idle_jiffies:	jiffies at the entry to idle for idle time accounting
  * @idle_calls:		Total number of idle calls
@@ -50,7 +50,7 @@ struct tick_sched {
 	struct hrtimer			sched_timer;
 	unsigned long			check_clocks;
 	enum tick_nohz_mode		nohz_mode;
-	ktime_t				idle_tick;
+	ktime_t				last_tick;
 	int				inidle;
 	int				tick_stopped;
 	unsigned long			idle_jiffies;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 21b187c..bca1519 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -382,7 +382,7 @@ static void tick_nohz_stop_sched_tick(ktime_t now)
 		if (!ts->tick_stopped) {
 			select_nohz_load_balancer(1);
 
-			ts->idle_tick = hrtimer_get_expires(&ts->sched_timer);
+			ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
 			ts->tick_stopped = 1;
 		}
 
@@ -468,7 +468,7 @@ ktime_t tick_nohz_get_sleep_length(void)
 static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
 {
 	hrtimer_cancel(&ts->sched_timer);
-	hrtimer_set_expires(&ts->sched_timer, ts->idle_tick);
+	hrtimer_set_expires(&ts->sched_timer, ts->last_tick);
 
 	while (1) {
 		/* Forward the time to expire in the future */
diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
index 3258455..af5a7e9 100644
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -167,7 +167,7 @@ static void print_cpu(struct seq_file *m, int cpu, u64 now)
 	{
 		struct tick_sched *ts = tick_get_tick_sched(cpu);
 		P(nohz_mode);
-		P_ns(idle_tick);
+		P_ns(last_tick);
 		P(tick_stopped);
 		P(idle_jiffies);
 		P(idle_calls);
@@ -259,7 +259,7 @@ static int timer_list_show(struct seq_file *m, void *v)
 	u64 now = ktime_to_ns(ktime_get());
 	int cpu;
 
-	SEQ_printf(m, "Timer List Version: v0.6\n");
+	SEQ_printf(m, "Timer List Version: v0.7\n");
 	SEQ_printf(m, "HRTIMER_MAX_CLOCK_BASES: %d\n", HRTIMER_MAX_CLOCK_BASES);
 	SEQ_printf(m, "now at %Ld nsecs\n", (unsigned long long)now);
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 08/32] nohz: Move nohz load balancer selection into idle logic
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (6 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 07/32] nohz: Rename ts->idle_tick to ts->last_tick Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-29 14:45   ` Peter Zijlstra
  2011-08-15 15:52 ` [PATCH 09/32] nohz: Move ts->idle_calls into strict " Frederic Weisbecker
                   ` (25 subsequent siblings)
  33 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

We want the nohz load balancer to be an idle CPU, thus
move that selection to strict dyntick idle logic.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 kernel/time/tick-sched.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index bca1519..de1b629 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -380,8 +380,6 @@ static void tick_nohz_stop_sched_tick(ktime_t now)
 		 * the scheduler tick in nohz_restart_sched_tick.
 		 */
 		if (!ts->tick_stopped) {
-			select_nohz_load_balancer(1);
-
 			ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
 			ts->tick_stopped = 1;
 		}
@@ -434,6 +432,7 @@ static void __tick_nohz_enter_idle(struct tick_sched *ts, int cpu)
 
 	if (!was_stopped && ts->tick_stopped) {
 		ts->idle_jiffies = ts->last_jiffies;
+		select_nohz_load_balancer(1);
 		rcu_enter_nohz();
 	}
 }
@@ -501,7 +500,6 @@ static void tick_nohz_restart_sched_tick(ktime_t now, struct tick_sched *ts)
 	int cpu = smp_processor_id();
 
 	/* Update jiffies first */
-	select_nohz_load_balancer(0);
 	tick_do_update_jiffies64(now);
 	cpumask_clear_cpu(cpu, nohz_cpu_mask);
 
@@ -552,6 +550,7 @@ void tick_nohz_exit_idle(void)
 
 	if (ts->tick_stopped) {
 		rcu_exit_nohz();
+		select_nohz_load_balancer(0);
 		tick_nohz_restart_sched_tick(now, ts);
 		tick_nohz_account_idle_ticks(ts);
 	}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 09/32] nohz: Move ts->idle_calls into strict idle logic
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (7 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 08/32] nohz: Move nohz load balancer selection into idle logic Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-29 14:47   ` Peter Zijlstra
  2011-08-15 15:52 ` [PATCH 10/32] nohz: Move next idle expiring time record into idle logic area Frederic Weisbecker
                   ` (24 subsequent siblings)
  33 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

Split the nohz switch in two parts, a first that checks if we can
really stop the tick, and another that actually stop it. This
way we can pull out idle_calls stat incrementation into strict
idle logic.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 kernel/time/tick-sched.c |   87 ++++++++++++++++++++++++---------------------
 1 files changed, 46 insertions(+), 41 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index de1b629..2794150 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -249,48 +249,14 @@ EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
  * Called either from the idle loop or from irq_exit() when an idle period was
  * just interrupted by an interrupt which did not cause a reschedule.
  */
-static void tick_nohz_stop_sched_tick(ktime_t now)
+static void tick_nohz_stop_sched_tick(ktime_t now, int cpu, struct tick_sched *ts)
 {
 	unsigned long seq, last_jiffies, next_jiffies, delta_jiffies;
-	struct tick_sched *ts;
 	ktime_t last_update, expires;
 	struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev;
 	u64 time_delta;
-	int cpu;
-
-	cpu = smp_processor_id();
-	ts = &per_cpu(tick_cpu_sched, cpu);
-
-	/*
-	 * If this cpu is offline and it is the one which updates
-	 * jiffies, then give up the assignment and let it be taken by
-	 * the cpu which runs the tick timer next. If we don't drop
-	 * this here the jiffies might be stale and do_timer() never
-	 * invoked.
-	 */
-	if (unlikely(!cpu_online(cpu))) {
-		if (cpu == tick_do_timer_cpu)
-			tick_do_timer_cpu = TICK_DO_TIMER_NONE;
-	}
 
-	if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
-		return;
-
-	if (need_resched())
-		return;
-
-	if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
-		static int ratelimit;
-
-		if (ratelimit < 10) {
-			printk(KERN_ERR "NOHZ: local_softirq_pending %02x\n",
-			       (unsigned int) local_softirq_pending());
-			ratelimit++;
-		}
-		return;
-	}
 
-	ts->idle_calls++;
 	/* Read jiffies and the time when jiffies were updated last */
 	do {
 		seq = read_seqbegin(&xtime_lock);
@@ -422,18 +388,57 @@ out:
 	ts->sleep_length = ktime_sub(dev->next_event, now);
 }
 
+static bool tick_nohz_can_stop_tick(int cpu, struct tick_sched *ts)
+{
+	/*
+	 * If this cpu is offline and it is the one which updates
+	 * jiffies, then give up the assignment and let it be taken by
+	 * the cpu which runs the tick timer next. If we don't drop
+	 * this here the jiffies might be stale and do_timer() never
+	 * invoked.
+	 */
+	if (unlikely(!cpu_online(cpu))) {
+		if (cpu == tick_do_timer_cpu)
+			tick_do_timer_cpu = TICK_DO_TIMER_NONE;
+	}
+
+	if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
+		return false;
+
+	if (need_resched())
+		return false;
+
+	if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
+		static int ratelimit;
+
+		if (ratelimit < 10) {
+			printk(KERN_ERR "NOHZ: local_softirq_pending %02x\n",
+			       (unsigned int) local_softirq_pending());
+			ratelimit++;
+		}
+		return false;
+	}
+
+	return true;
+}
+
 static void __tick_nohz_enter_idle(struct tick_sched *ts, int cpu)
 {
 	ktime_t now;
-	int was_stopped = ts->tick_stopped;
 
 	now = tick_nohz_start_idle(cpu, ts);
-	tick_nohz_stop_sched_tick(now);
 
-	if (!was_stopped && ts->tick_stopped) {
-		ts->idle_jiffies = ts->last_jiffies;
-		select_nohz_load_balancer(1);
-		rcu_enter_nohz();
+	if (tick_nohz_can_stop_tick(cpu, ts)) {
+		int was_stopped = ts->tick_stopped;
+
+		ts->idle_calls++;
+		tick_nohz_stop_sched_tick(now, cpu, ts);
+
+		if (!was_stopped && ts->tick_stopped) {
+			ts->idle_jiffies = ts->last_jiffies;
+			select_nohz_load_balancer(1);
+			rcu_enter_nohz();
+		}
 	}
 }
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 10/32] nohz: Move next idle expiring time record into idle logic area
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (8 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 09/32] nohz: Move ts->idle_calls into strict " Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-15 15:52 ` [PATCH 11/32] cpuset: Set up interface for nohz flag Frederic Weisbecker
                   ` (23 subsequent siblings)
  33 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

Move next idle expiry time record and idle sleeps tracking into
idle entry functions as they are not generic nohz stats.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 kernel/time/tick-sched.c |   21 ++++++++++++---------
 1 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 2794150..f5e12da 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -249,10 +249,10 @@ EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
  * Called either from the idle loop or from irq_exit() when an idle period was
  * just interrupted by an interrupt which did not cause a reschedule.
  */
-static void tick_nohz_stop_sched_tick(ktime_t now, int cpu, struct tick_sched *ts)
+static ktime_t tick_nohz_stop_sched_tick(ktime_t now, int cpu, struct tick_sched *ts)
 {
 	unsigned long seq, last_jiffies, next_jiffies, delta_jiffies;
-	ktime_t last_update, expires;
+	ktime_t last_update, expires, ret = { .tv64 = 0 };
 	struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev;
 	u64 time_delta;
 
@@ -338,6 +338,8 @@ static void tick_nohz_stop_sched_tick(ktime_t now, int cpu, struct tick_sched *t
 		if (ts->tick_stopped && ktime_equal(expires, dev->next_event))
 			goto out;
 
+		ret = expires;
+
 		/*
 		 * nohz_stop_sched_tick can be called several times before
 		 * the nohz_restart_sched_tick is called. This happens when
@@ -350,11 +352,6 @@ static void tick_nohz_stop_sched_tick(ktime_t now, int cpu, struct tick_sched *t
 			ts->tick_stopped = 1;
 		}
 
-		ts->idle_sleeps++;
-
-		/* Mark expires */
-		ts->idle_expires = expires;
-
 		/*
 		 * If the expiration time == KTIME_MAX, then
 		 * in this case we simply stop the tick timer.
@@ -386,6 +383,8 @@ out:
 	ts->next_jiffies = next_jiffies;
 	ts->last_jiffies = last_jiffies;
 	ts->sleep_length = ktime_sub(dev->next_event, now);
+
+	return ret;
 }
 
 static bool tick_nohz_can_stop_tick(int cpu, struct tick_sched *ts)
@@ -424,7 +423,7 @@ static bool tick_nohz_can_stop_tick(int cpu, struct tick_sched *ts)
 
 static void __tick_nohz_enter_idle(struct tick_sched *ts, int cpu)
 {
-	ktime_t now;
+	ktime_t now, expires;
 
 	now = tick_nohz_start_idle(cpu, ts);
 
@@ -432,7 +431,11 @@ static void __tick_nohz_enter_idle(struct tick_sched *ts, int cpu)
 		int was_stopped = ts->tick_stopped;
 
 		ts->idle_calls++;
-		tick_nohz_stop_sched_tick(now, cpu, ts);
+		expires = tick_nohz_stop_sched_tick(now, cpu, ts);
+		if (expires.tv64 > 0LL) {
+			ts->idle_sleeps++;
+			ts->idle_expires = expires;
+		}
 
 		if (!was_stopped && ts->tick_stopped) {
 			ts->idle_jiffies = ts->last_jiffies;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 11/32] cpuset: Set up interface for nohz flag
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (9 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 10/32] nohz: Move next idle expiring time record into idle logic area Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-15 15:52 ` [PATCH 12/32] nohz: Try not to give the timekeeping duty to a cpuset nohz cpu Frederic Weisbecker
                   ` (22 subsequent siblings)
  33 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

Prepare the interface to implement the nohz cpuset flag.
This flag, once set, will tell the system to try to
shutdown the periodic timer tick when possible.

We use here a per cpu refcounter. As long as a CPU
is contained into at least one cpuset that has the
nohz flag set, it is part of the set of CPUs that
run into adaptive nohz mode.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 arch/Kconfig           |    3 ++
 include/linux/cpuset.h |   22 ++++++++++++++++++++
 init/Kconfig           |    8 +++++++
 kernel/cpuset.c        |   52 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 85 insertions(+), 0 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 341ac95..5fe21c4 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -172,6 +172,9 @@ config HAVE_ARCH_JUMP_LABEL
 	bool
 
 config HAVE_ARCH_MUTEX_CPU_RELAX
+       bool
+
+config HAVE_CPUSETS_NO_HZ
 	bool
 
 config HAVE_RCU_TABLE_FREE
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index e9eaec5..62e5d5a 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -244,4 +244,26 @@ static inline void put_mems_allowed(void)
 
 #endif /* !CONFIG_CPUSETS */
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+
+DECLARE_PER_CPU(int, cpu_adaptive_nohz_ref);
+
+static inline bool cpuset_cpu_adaptive_nohz(int cpu)
+{
+	if (per_cpu(cpu_adaptive_nohz_ref, cpu) > 0)
+		return true;
+
+	return false;
+}
+
+static inline bool cpuset_adaptive_nohz(void)
+{
+	if (__get_cpu_var(cpu_adaptive_nohz_ref) > 0)
+		return true;
+
+	return false;
+}
+
+#endif /* CONFIG_CPUSETS_NO_HZ */
+
 #endif /* _LINUX_CPUSET_H */
diff --git a/init/Kconfig b/init/Kconfig
index 3cf7855..0cb591a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -624,6 +624,14 @@ config PROC_PID_CPUSET
 	depends on CPUSETS
 	default y
 
+config CPUSETS_NO_HZ
+       bool "Tickless cpusets"
+       depends on CPUSETS && HAVE_CPUSETS_NO_HZ
+       help
+         This options let you apply a nohz property to a cpuset such
+	 that the periodic timer tick tries to be avoided when possible on
+	 the concerned CPUs.
+
 config CGROUP_CPUACCT
 	bool "Simple CPU accounting cgroup subsystem"
 	help
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 9c9b754..3135096 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -132,6 +132,7 @@ typedef enum {
 	CS_SCHED_LOAD_BALANCE,
 	CS_SPREAD_PAGE,
 	CS_SPREAD_SLAB,
+	CS_ADAPTIVE_NOHZ,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -170,6 +171,11 @@ static inline int is_spread_slab(const struct cpuset *cs)
 	return test_bit(CS_SPREAD_SLAB, &cs->flags);
 }
 
+static inline int is_adaptive_nohz(const struct cpuset *cs)
+{
+	return test_bit(CS_ADAPTIVE_NOHZ, &cs->flags);
+}
+
 static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
 };
@@ -1189,6 +1195,31 @@ static void cpuset_change_flag(struct task_struct *tsk,
 	cpuset_update_task_spread_flag(cgroup_cs(scan->cg), tsk);
 }
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+
+DEFINE_PER_CPU(int, cpu_adaptive_nohz_ref);
+
+static void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
+{
+	int cpu;
+	int val;
+
+	if (is_adaptive_nohz(old_cs) == is_adaptive_nohz(cs))
+		return;
+
+	for_each_cpu(cpu, cs->cpus_allowed) {
+		if (is_adaptive_nohz(cs))
+			per_cpu(cpu_adaptive_nohz_ref, cpu) += 1;
+		else
+			per_cpu(cpu_adaptive_nohz_ref, cpu) -= 1;
+	}
+}
+#else
+static inline void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
+{
+}
+#endif
+
 /*
  * update_tasks_flags - update the spread flags of tasks in the cpuset.
  * @cs: the cpuset in which each task's spread flags needs to be changed
@@ -1254,6 +1285,8 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
 			|| (is_spread_page(cs) != is_spread_page(trialcs)));
 
+	update_nohz_cpus(cs, trialcs);
+
 	mutex_lock(&callback_mutex);
 	cs->flags = trialcs->flags;
 	mutex_unlock(&callback_mutex);
@@ -1472,6 +1505,7 @@ typedef enum {
 	FILE_MEMORY_PRESSURE,
 	FILE_SPREAD_PAGE,
 	FILE_SPREAD_SLAB,
+	FILE_ADAPTIVE_NOHZ,
 } cpuset_filetype_t;
 
 static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
@@ -1511,6 +1545,11 @@ static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
 	case FILE_SPREAD_SLAB:
 		retval = update_flag(CS_SPREAD_SLAB, cs, val);
 		break;
+#ifdef CONFIG_CPUSETS_NO_HZ
+	case FILE_ADAPTIVE_NOHZ:
+		retval = update_flag(CS_ADAPTIVE_NOHZ, cs, val);
+		break;
+#endif
 	default:
 		retval = -EINVAL;
 		break;
@@ -1670,6 +1709,10 @@ static u64 cpuset_read_u64(struct cgroup *cont, struct cftype *cft)
 		return is_spread_page(cs);
 	case FILE_SPREAD_SLAB:
 		return is_spread_slab(cs);
+#ifdef CONFIG_CPUSETS_NO_HZ
+	case FILE_ADAPTIVE_NOHZ:
+		return is_adaptive_nohz(cs);
+#endif
 	default:
 		BUG();
 	}
@@ -1778,6 +1821,15 @@ static struct cftype files[] = {
 		.write_u64 = cpuset_write_u64,
 		.private = FILE_SPREAD_SLAB,
 	},
+
+#ifdef CONFIG_CPUSETS_NO_HZ
+	{
+		.name = "adaptive_nohz",
+		.read_u64 = cpuset_read_u64,
+		.write_u64 = cpuset_write_u64,
+		.private = FILE_ADAPTIVE_NOHZ,
+	},
+#endif
 };
 
 static struct cftype cft_memory_pressure_enabled = {
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 12/32] nohz: Try not to give the timekeeping duty to a cpuset nohz cpu
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (10 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 11/32] cpuset: Set up interface for nohz flag Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-29 14:55   ` Peter Zijlstra
  2011-08-15 15:52 ` [PATCH 13/32] nohz: Adaptive tick stop and restart on nohz cpuset Frederic Weisbecker
                   ` (21 subsequent siblings)
  33 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

Try to give the timekeeing duty to a CPU that doesn't belong
to any nohz cpuset when possible, so that we increase the chance
for these nohz cpusets to run their CPUs out of periodic tick
mode.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 kernel/time/tick-sched.c |   52 ++++++++++++++++++++++++++++++++++++---------
 1 files changed, 41 insertions(+), 11 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f5e12da..5f41ef7 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -20,6 +20,7 @@
 #include <linux/profile.h>
 #include <linux/sched.h>
 #include <linux/module.h>
+#include <linux/cpuset.h>
 
 #include <asm/irq_regs.h>
 
@@ -729,6 +730,45 @@ void tick_check_idle(int cpu)
 	tick_check_nohz(cpu);
 }
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+
+/*
+ * Take the timer duty if nobody is taking care of it.
+ * If a CPU already does and and it's in a nohz cpuset,
+ * then take the charge so that it can switch to nohz mode.
+ */
+static void tick_do_timer_check_handler(int cpu)
+{
+	int handler = tick_do_timer_cpu;
+
+	if (unlikely(handler == TICK_DO_TIMER_NONE)) {
+		tick_do_timer_cpu = cpu;
+	} else {
+		if (!cpuset_adaptive_nohz() &&
+		    cpuset_cpu_adaptive_nohz(handler))
+			tick_do_timer_cpu = cpu;
+	}
+}
+
+#else
+
+static void tick_do_timer_check_handler(int cpu)
+{
+#ifdef CONFIG_NO_HZ
+	/*
+	 * Check if the do_timer duty was dropped. We don't care about
+	 * concurrency: This happens only when the cpu in charge went
+	 * into a long sleep. If two cpus happen to assign themself to
+	 * this duty, then the jiffies update is still serialized by
+	 * xtime_lock.
+	 */
+	if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE))
+		tick_do_timer_cpu = cpu;
+#endif
+}
+
+#endif /* CONFIG_CPUSETS_NO_HZ */
+
 /*
  * High resolution timer specific code
  */
@@ -745,17 +785,7 @@ static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
 	ktime_t now = ktime_get();
 	int cpu = smp_processor_id();
 
-#ifdef CONFIG_NO_HZ
-	/*
-	 * Check if the do_timer duty was dropped. We don't care about
-	 * concurrency: This happens only when the cpu in charge went
-	 * into a long sleep. If two cpus happen to assign themself to
-	 * this duty, then the jiffies update is still serialized by
-	 * xtime_lock.
-	 */
-	if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE))
-		tick_do_timer_cpu = cpu;
-#endif
+	tick_do_timer_check_handler(cpu);
 
 	/* Check, if the jiffies need an update */
 	if (tick_do_timer_cpu == cpu)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 13/32] nohz: Adaptive tick stop and restart on nohz cpuset
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (11 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 12/32] nohz: Try not to give the timekeeping duty to a cpuset nohz cpu Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-29 15:25   ` Peter Zijlstra
                     ` (2 more replies)
  2011-08-15 15:52 ` [PATCH 14/32] nohz/cpuset: Don't turn off the tick if rcu needs it Frederic Weisbecker
                   ` (20 subsequent siblings)
  33 siblings, 3 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

When a CPU is included in a nohz cpuset, try to switch
it to nohz mode from the timer interrupt if it is the
only non-idle task running.

Then restart the tick if necessary from the wakeup path
if we are enqueuing a second task while the timer is stopped,
so that the scheduler tick is rearmed.

This assumes we are using TTWU_QUEUE sched feature so I need
to handle the off case (or actually not handle it but properly),
because we need the adaptive tick restart and what will come
along in further patches to be done locally and before the new
task ever gets scheduled.

I also need to look at the ARCH_WANT_INTERRUPTS_ON_CTXW case
and the remote wakeups.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 include/linux/cpuset.h   |    4 +++
 include/linux/sched.h    |    6 ++++
 include/linux/tick.h     |   12 ++++++++-
 init/Kconfig             |    2 +-
 kernel/sched.c           |   35 +++++++++++++++++++++++++
 kernel/softirq.c         |    4 +-
 kernel/time/tick-sched.c |   63 ++++++++++++++++++++++++++++++++++++++-------
 7 files changed, 112 insertions(+), 14 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 62e5d5a..799b9a4 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -264,6 +264,10 @@ static inline bool cpuset_adaptive_nohz(void)
 	return false;
 }
 
+extern void cpuset_update_nohz(void);
+#else
+static inline void cpuset_update_nohz(void) { }
+
 #endif /* CONFIG_CPUSETS_NO_HZ */
 
 #endif /* _LINUX_CPUSET_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index dbe021a..53a95b5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2652,6 +2652,12 @@ static inline void inc_syscw(struct task_struct *tsk)
 #define TASK_SIZE_OF(tsk)	TASK_SIZE
 #endif
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+extern bool cpuset_nohz_can_stop_tick(void);
+#else
+static inline bool cpuset_nohz_can_stop_tick(void) { return false; }
+#endif
+
 #ifdef CONFIG_MM_OWNER
 extern void mm_update_next_owner(struct mm_struct *mm);
 extern void mm_init_owner(struct mm_struct *mm, struct task_struct *p);
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 849a0b2..cc4880e 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -122,11 +122,21 @@ static inline int tick_oneshot_mode_active(void) { return 0; }
 # ifdef CONFIG_NO_HZ
 extern void tick_nohz_enter_idle(void);
 extern void tick_nohz_exit_idle(void);
+extern void tick_nohz_restart_sched_tick(void);
 extern void tick_nohz_irq_exit(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
-# else
+
+#ifdef CONFIG_CPUSETS_NO_HZ
+DECLARE_PER_CPU(int, task_nohz_mode);
+
+extern int tick_nohz_adaptive_mode(void);
+#else /* !CPUSETS_NO_HZ */
+static inline int tick_nohz_adaptive_mode(void) { return 0; }
+#endif /* CPUSETS_NO_HZ */
+
+# else /* !NO_HZ */
 static inline void tick_nohz_enter_idle(void) { }
 static inline void tick_nohz_exit_idle(void) { }
 
diff --git a/init/Kconfig b/init/Kconfig
index 0cb591a..7a144ad 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -626,7 +626,7 @@ config PROC_PID_CPUSET
 
 config CPUSETS_NO_HZ
        bool "Tickless cpusets"
-       depends on CPUSETS && HAVE_CPUSETS_NO_HZ
+       depends on CPUSETS && HAVE_CPUSETS_NO_HZ && NO_HZ && HIGH_RES_TIMERS
        help
          This options let you apply a nohz property to a cpuset such
 	 that the periodic timer tick tries to be avoided when possible on
diff --git a/kernel/sched.c b/kernel/sched.c
index 609a867..0e1aa4e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2433,6 +2433,38 @@ static void update_avg(u64 *avg, u64 sample)
 }
 #endif
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+DEFINE_PER_CPU(int, task_nohz_mode);
+
+bool cpuset_nohz_can_stop_tick(void)
+{
+	struct rq *rq;
+
+	rq = this_rq();
+
+	/* More than one running task need preemption */
+	if (rq->nr_running > 1)
+		return false;
+
+	return true;
+}
+
+static void cpuset_nohz_restart_tick(void)
+{
+	__get_cpu_var(task_nohz_mode) = 0;
+	tick_nohz_restart_sched_tick();
+}
+
+void cpuset_update_nohz(void)
+{
+	if (!tick_nohz_adaptive_mode())
+		return;
+
+	if (!cpuset_nohz_can_stop_tick())
+		cpuset_nohz_restart_tick();
+}
+#endif
+
 static void
 ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
 {
@@ -2560,6 +2592,8 @@ static void sched_ttwu_pending(void)
 		ttwu_do_activate(rq, p, 0);
 	}
 
+	cpuset_update_nohz();
+
 	raw_spin_unlock(&rq->lock);
 }
 
@@ -2620,6 +2654,7 @@ static void ttwu_queue(struct task_struct *p, int cpu)
 
 	raw_spin_lock(&rq->lock);
 	ttwu_do_activate(rq, p, 0);
+	cpuset_update_nohz();
 	raw_spin_unlock(&rq->lock);
 }
 
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 67a1401..2dbeeb9 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -297,7 +297,7 @@ void irq_enter(void)
 	int cpu = smp_processor_id();
 
 	rcu_irq_enter();
-	if (idle_cpu(cpu) && !in_interrupt()) {
+	if ((idle_cpu(cpu) || tick_nohz_adaptive_mode()) && !in_interrupt()) {
 		/*
 		 * Prevent raise_softirq from needlessly waking up ksoftirqd
 		 * here, as softirq will be serviced on return from interrupt.
@@ -342,7 +342,7 @@ void irq_exit(void)
 	rcu_irq_exit();
 #ifdef CONFIG_NO_HZ
 	/* Make sure that timer wheel updates are propagated */
-	if (idle_cpu(smp_processor_id()) && !in_interrupt() && !need_resched())
+	if (!in_interrupt())
 		tick_nohz_irq_exit();
 #endif
 	preempt_enable_no_resched();
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 5f41ef7..fb97cd0 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -499,12 +499,7 @@ static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
 	}
 }
 
-/**
- * tick_nohz_restart_sched_tick - restart the idle tick from the idle task
- *
- * Restart the idle tick when the CPU is woken up from idle
- */
-static void tick_nohz_restart_sched_tick(ktime_t now, struct tick_sched *ts)
+static void __tick_nohz_restart_sched_tick(ktime_t now, struct tick_sched *ts)
 {
 	int cpu = smp_processor_id();
 
@@ -522,6 +517,31 @@ static void tick_nohz_restart_sched_tick(ktime_t now, struct tick_sched *ts)
 	tick_nohz_restart(ts, now);
 }
 
+/**
+ * tick_nohz_restart_sched_tick - restart the idle tick from the idle task
+ *
+ * Restart the idle tick when the CPU is woken up from idle
+ */
+void tick_nohz_restart_sched_tick(void)
+{
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+	unsigned long flags;
+	ktime_t now;
+
+	local_irq_save(flags);
+
+	if (!ts->tick_stopped) {
+		local_irq_restore(flags);
+		return;
+	}
+
+	now = ktime_get();
+	__tick_nohz_restart_sched_tick(now, ts);
+
+	local_irq_restore(flags);
+}
+
+
 static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
 {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
@@ -560,7 +580,7 @@ void tick_nohz_exit_idle(void)
 	if (ts->tick_stopped) {
 		rcu_exit_nohz();
 		select_nohz_load_balancer(0);
-		tick_nohz_restart_sched_tick(now, ts);
+		__tick_nohz_restart_sched_tick(now, ts);
 		tick_nohz_account_idle_ticks(ts);
 	}
 
@@ -570,9 +590,14 @@ void tick_nohz_exit_idle(void)
 void tick_nohz_irq_exit(void)
 {
 	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+	int cpu = smp_processor_id();
 
-	if (ts->inidle)
-		__tick_nohz_enter_idle(ts, smp_processor_id());
+	if (ts->inidle && !need_resched())
+		__tick_nohz_enter_idle(ts, cpu);
+	else if (tick_nohz_adaptive_mode() && !idle_cpu(cpu)) {
+		if (tick_nohz_can_stop_tick(cpu, ts))
+			tick_nohz_stop_sched_tick(ktime_get(), cpu, ts);
+	}
 }
 
 static int tick_nohz_reprogram(struct tick_sched *ts, ktime_t now)
@@ -732,6 +757,20 @@ void tick_check_idle(int cpu)
 
 #ifdef CONFIG_CPUSETS_NO_HZ
 
+int tick_nohz_adaptive_mode(void)
+{
+	return __get_cpu_var(task_nohz_mode);
+}
+
+static void tick_nohz_cpuset_stop_tick(int user)
+{
+	if (!cpuset_adaptive_nohz() || tick_nohz_adaptive_mode())
+		return;
+
+	if (cpuset_nohz_can_stop_tick())
+		__get_cpu_var(task_nohz_mode) = 1;
+}
+
 /*
  * Take the timer duty if nobody is taking care of it.
  * If a CPU already does and and it's in a nohz cpuset,
@@ -752,6 +791,8 @@ static void tick_do_timer_check_handler(int cpu)
 
 #else
 
+static void tick_nohz_cpuset_stop_tick(int user) { }
+
 static void tick_do_timer_check_handler(int cpu)
 {
 #ifdef CONFIG_NO_HZ
@@ -796,6 +837,7 @@ static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
 	 * no valid regs pointer
 	 */
 	if (regs) {
+		int user = user_mode(regs);
 		/*
 		 * When we are idle and the tick is stopped, we have to touch
 		 * the watchdog as we might not schedule for a really long
@@ -809,8 +851,9 @@ static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
 			if (idle_cpu(cpu))
 				ts->idle_jiffies++;
 		}
-		update_process_times(user_mode(regs));
+		update_process_times(user);
 		profile_tick(CPU_PROFILING);
+		tick_nohz_cpuset_stop_tick(user);
 	}
 
 	hrtimer_forward(timer, now, tick_period);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 14/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (12 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 13/32] nohz: Adaptive tick stop and restart on nohz cpuset Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-16 20:13   ` Paul E. McKenney
  2011-08-29 15:36   ` Peter Zijlstra
  2011-08-15 15:52 ` [PATCH 15/32] nohz/cpuset: Restart tick when switching to idle task Frederic Weisbecker
                   ` (19 subsequent siblings)
  33 siblings, 2 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

If RCU is waiting for the current CPU to complete a grace
period, don't turn off the tick. Unlike dynctik-idle, we
are not necessarily going to enter into rcu extended quiescent
state, so we may need to keep the tick to note current CPU's
quiescent states.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 include/linux/rcupdate.h |    1 +
 kernel/rcutree.c         |    3 +--
 kernel/sched.c           |   14 ++++++++++++++
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 99f9aa7..55a482a 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -133,6 +133,7 @@ static inline int rcu_preempt_depth(void)
 extern void rcu_sched_qs(int cpu);
 extern void rcu_bh_qs(int cpu);
 extern void rcu_check_callbacks(int cpu, int user);
+extern int rcu_pending(int cpu);
 struct notifier_block;
 
 #ifdef CONFIG_NO_HZ
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index ba06207..0009bfc 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -205,7 +205,6 @@ int rcu_cpu_stall_suppress __read_mostly;
 module_param(rcu_cpu_stall_suppress, int, 0644);
 
 static void force_quiescent_state(struct rcu_state *rsp, int relaxed);
-static int rcu_pending(int cpu);
 
 /*
  * Return the number of RCU-sched batches processed thus far for debug & stats.
@@ -1729,7 +1728,7 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
  * by the current CPU, returning 1 if so.  This function is part of the
  * RCU implementation; it is -not- an exported member of the RCU API.
  */
-static int rcu_pending(int cpu)
+int rcu_pending(int cpu)
 {
 	return __rcu_pending(&rcu_sched_state, &per_cpu(rcu_sched_data, cpu)) ||
 	       __rcu_pending(&rcu_bh_state, &per_cpu(rcu_bh_data, cpu)) ||
diff --git a/kernel/sched.c b/kernel/sched.c
index 0e1aa4e..353a66f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2439,6 +2439,7 @@ DEFINE_PER_CPU(int, task_nohz_mode);
 bool cpuset_nohz_can_stop_tick(void)
 {
 	struct rq *rq;
+	int cpu;
 
 	rq = this_rq();
 
@@ -2446,6 +2447,19 @@ bool cpuset_nohz_can_stop_tick(void)
 	if (rq->nr_running > 1)
 		return false;
 
+	cpu = smp_processor_id();
+
+	/*
+	 * FIXME: will probably be removed soon as it's
+	 * already checked from tick_nohz_stop_sched_tick()
+	 */
+	if (rcu_needs_cpu(cpu))
+		return false;
+
+	/* Is there a grace period to complete ? */
+	if (rcu_pending(cpu))
+		return false;
+
 	return true;
 }
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 15/32] nohz/cpuset: Restart tick when switching to idle task
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (13 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 14/32] nohz/cpuset: Don't turn off the tick if rcu needs it Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-29 15:43   ` Peter Zijlstra
  2011-08-15 15:52 ` [PATCH 16/32] nohz/cpuset: Wake up adaptive nohz CPU when a timer gets enqueued Frederic Weisbecker
                   ` (18 subsequent siblings)
  33 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

Ideally if we are in adaptive nohz mode and we switch to the
the idle task, we shouldn't restart the tick since it's going
to stop the tick soon anyway.

That optimization requires some minor tweaks here and there
though, lets handle that later.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 kernel/sched.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 353a66f..9b6b8eb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2477,6 +2477,18 @@ void cpuset_update_nohz(void)
 	if (!cpuset_nohz_can_stop_tick())
 		cpuset_nohz_restart_tick();
 }
+
+static void cpuset_nohz_task_switch(struct task_struct *next)
+{
+	int cpu = smp_processor_id();
+
+	if (tick_nohz_adaptive_mode() && next == idle_task(cpu))
+		cpuset_nohz_restart_tick();
+}
+#else
+static void cpuset_nohz_task_switch(struct task_struct *next)
+{
+}
 #endif
 
 static void
@@ -3023,6 +3035,7 @@ static inline void
 prepare_task_switch(struct rq *rq, struct task_struct *prev,
 		    struct task_struct *next)
 {
+	cpuset_nohz_task_switch(next);
 	sched_info_switch(prev, next);
 	perf_event_task_sched_out(prev, next);
 	fire_sched_out_preempt_notifiers(prev, next);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 16/32] nohz/cpuset: Wake up adaptive nohz CPU when a timer gets enqueued
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (14 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 15/32] nohz/cpuset: Restart tick when switching to idle task Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-29 15:51   ` Peter Zijlstra
  2011-08-29 15:55   ` Peter Zijlstra
  2011-08-15 15:52 ` [PATCH 17/32] x86: New cpuset nohz irq vector Frederic Weisbecker
                   ` (17 subsequent siblings)
  33 siblings, 2 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

Wake up a CPU when a timer list timer is enqueued there and
the CPU is in adaptive nohz mode. Sending an IPI to it makes
it reconsidering the next timer to program on top of recent
updates.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 include/linux/sched.h    |    4 ++--
 kernel/sched.c           |   33 ++++++++++++++++++++++++++++++++-
 kernel/time/tick-sched.c |    5 ++++-
 kernel/timer.c           |    2 +-
 4 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 53a95b5..5ff0764 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1947,9 +1947,9 @@ static inline void idle_task_exit(void) {}
 #endif
 
 #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
-extern void wake_up_idle_cpu(int cpu);
+extern void wake_up_nohz_cpu(int cpu);
 #else
-static inline void wake_up_idle_cpu(int cpu) { }
+static inline void wake_up_nohz_cpu(int cpu) { }
 #endif
 
 extern unsigned int sysctl_sched_latency;
diff --git a/kernel/sched.c b/kernel/sched.c
index 9b6b8eb..8bf8280 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1234,7 +1234,7 @@ unlock:
  * account when the CPU goes back to idle and evaluates the timer
  * wheel for the next timer event.
  */
-void wake_up_idle_cpu(int cpu)
+static void wake_up_idle_cpu(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 
@@ -1264,6 +1264,37 @@ void wake_up_idle_cpu(int cpu)
 		smp_send_reschedule(cpu);
 }
 
+
+static bool wake_up_cpuset_nohz_cpu(int cpu)
+{
+#ifdef CONFIG_CPUSETS_NO_HZ
+	/* Ensure task_nohz_mode update is visible */
+	smp_rmb();
+	/*
+	 * Even if task_nohz_mode is set concurrently, what
+	 * matters is that by the time we do that check, we know
+	 * that the CPU has not reached tick_nohz_stop_sched_tick().
+	 * As we are holding the base->lock and that lock needs
+	 * to be taken by tick_nohz_stop_sched_tick() we know
+	 * we are preceding it and it will see our update
+	 * synchronously. Thus we know we don't need to send an
+	 * IPI to that CPU.
+	 */
+	if (per_cpu(task_nohz_mode, cpu)) {
+		smp_cpuset_update_nohz(cpu);
+		return true;
+	}
+#endif
+
+	return false;
+}
+
+void wake_up_nohz_cpu(int cpu)
+{
+	if (!wake_up_cpuset_nohz_cpu(cpu))
+		wake_up_idle_cpu(cpu);
+}
+
 #endif /* CONFIG_NO_HZ */
 
 static u64 sched_avg_period(void)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index fb97cd0..9e450d8 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -767,8 +767,11 @@ static void tick_nohz_cpuset_stop_tick(int user)
 	if (!cpuset_adaptive_nohz() || tick_nohz_adaptive_mode())
 		return;
 
-	if (cpuset_nohz_can_stop_tick())
+	if (cpuset_nohz_can_stop_tick()) {
 		__get_cpu_var(task_nohz_mode) = 1;
+		/* Nohz mode must be visible to wake_up_nohz_cpu() */
+		smp_wmb();
+	}
 }
 
 /*
diff --git a/kernel/timer.c b/kernel/timer.c
index 8cff361..8cdbd48 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -880,7 +880,7 @@ void add_timer_on(struct timer_list *timer, int cpu)
 	 * makes sure that a CPU on the way to idle can not evaluate
 	 * the timer wheel.
 	 */
-	wake_up_idle_cpu(cpu);
+	wake_up_nohz_cpu(cpu);
 	spin_unlock_irqrestore(&base->lock, flags);
 }
 EXPORT_SYMBOL_GPL(add_timer_on);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 17/32] x86: New cpuset nohz irq vector
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (15 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 16/32] nohz/cpuset: Wake up adaptive nohz CPU when a timer gets enqueued Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-15 15:52 ` [PATCH 18/32] nohz/cpuset: Don't stop the tick if posix cpu timers are running Frederic Weisbecker
                   ` (16 subsequent siblings)
  33 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

We need a way to send an irq IPI (or local) in
any case and asynchronously in order to restart
the tick for CPUs in nohz adaptive mode

Generic smp operations don't fit into this because
they need interrupts to be enabled and they
try to execute the functions in place if the dest
CPU is the current one. But we always need the
function to be executed in irq context so it
happens quickly and restarting the tick doesn't
mess up with random lock scenarios in place.

In fact this is a temporary solution, what we really need is
an irq work subsystem that supports remote enqueuing.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 arch/x86/include/asm/entry_arch.h  |    3 +++
 arch/x86/include/asm/hw_irq.h      |    6 ++++++
 arch/x86/include/asm/irq_vectors.h |    2 ++
 arch/x86/include/asm/smp.h         |   11 +++++++++++
 arch/x86/kernel/entry_64.S         |    4 ++++
 arch/x86/kernel/irqinit.c          |    4 ++++
 arch/x86/kernel/smp.c              |   26 ++++++++++++++++++++++++++
 7 files changed, 56 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h
index 1cd6d26..019cf29 100644
--- a/arch/x86/include/asm/entry_arch.h
+++ b/arch/x86/include/asm/entry_arch.h
@@ -10,6 +10,9 @@
  * through the ICC by us (IPIs)
  */
 #ifdef CONFIG_SMP
+#ifdef CONFIG_CPUSETS_NO_HZ
+BUILD_INTERRUPT(cpuset_update_nohz_interrupt,CPUSET_UPDATE_NOHZ_VECTOR)
+#endif
 BUILD_INTERRUPT(reschedule_interrupt,RESCHEDULE_VECTOR)
 BUILD_INTERRUPT(call_function_interrupt,CALL_FUNCTION_VECTOR)
 BUILD_INTERRUPT(call_function_single_interrupt,CALL_FUNCTION_SINGLE_VECTOR)
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index bb9efe8..1978050 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -34,6 +34,9 @@ extern void irq_work_interrupt(void);
 extern void spurious_interrupt(void);
 extern void thermal_interrupt(void);
 extern void reschedule_interrupt(void);
+#ifdef CONFIG_CPUSETS_NO_HZ
+extern void cpuset_update_nohz_interrupt(void);
+#endif
 extern void mce_self_interrupt(void);
 
 extern void invalidate_interrupt(void);
@@ -153,6 +156,9 @@ extern asmlinkage void smp_irq_move_cleanup_interrupt(void);
 #endif
 #ifdef CONFIG_SMP
 extern void smp_reschedule_interrupt(struct pt_regs *);
+#ifdef CONFIG_CPUSETS_NO_HZ
+extern void smp_cpuset_update_nohz_interrupt(struct pt_regs *);
+#endif
 extern void smp_call_function_interrupt(struct pt_regs *);
 extern void smp_call_function_single_interrupt(struct pt_regs *);
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 6e976ee..5e33fec 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -117,6 +117,8 @@
 /* Xen vector callback to receive events in a HVM domain */
 #define XEN_HVM_EVTCHN_CALLBACK		0xf3
 
+#define CPUSET_UPDATE_NOHZ_VECTOR	0xf2
+
 /*
  * Local APIC timer IRQ vector is on a different priority level,
  * to work around the 'lost local interrupt if more than 2 IRQ
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index 73b11bc..66dc629 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -70,6 +70,10 @@ struct smp_ops {
 	void (*stop_other_cpus)(int wait);
 	void (*smp_send_reschedule)(int cpu);
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+	void (*smp_cpuset_update_nohz)(int cpu);
+#endif
+
 	int (*cpu_up)(unsigned cpu);
 	int (*cpu_disable)(void);
 	void (*cpu_die)(unsigned int cpu);
@@ -138,6 +142,13 @@ static inline void smp_send_reschedule(int cpu)
 	smp_ops.smp_send_reschedule(cpu);
 }
 
+static inline void smp_cpuset_update_nohz(int cpu)
+{
+#ifdef CONFIG_CPUSETS_NO_HZ
+	smp_ops.smp_cpuset_update_nohz(cpu);
+#endif
+}
+
 static inline void arch_send_call_function_single_ipi(int cpu)
 {
 	smp_ops.send_call_func_single_ipi(cpu);
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index d656f68..06d79c2 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -996,6 +996,10 @@ apicinterrupt CALL_FUNCTION_VECTOR \
 	call_function_interrupt smp_call_function_interrupt
 apicinterrupt RESCHEDULE_VECTOR \
 	reschedule_interrupt smp_reschedule_interrupt
+#ifdef CONFIG_CPUSETS_NO_HZ
+apicinterrupt CPUSET_UPDATE_NOHZ_VECTOR \
+	cpuset_update_nohz_interrupt smp_cpuset_update_nohz_interrupt
+#endif
 #endif
 
 apicinterrupt ERROR_APIC_VECTOR \
diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c
index f470e4e..ba5665c 100644
--- a/arch/x86/kernel/irqinit.c
+++ b/arch/x86/kernel/irqinit.c
@@ -172,6 +172,10 @@ static void __init smp_intr_init(void)
 	 */
 	alloc_intr_gate(RESCHEDULE_VECTOR, reschedule_interrupt);
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+	alloc_intr_gate(CPUSET_UPDATE_NOHZ_VECTOR, cpuset_update_nohz_interrupt);
+#endif
+
 	/* IPIs for invalidation */
 #define ALLOC_INVTLB_VEC(NR) \
 	alloc_intr_gate(INVALIDATE_TLB_VECTOR_START+NR, \
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 013e7eb..7c3e399 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -22,6 +22,7 @@
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <linux/cpuset.h>
 
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
@@ -121,6 +122,17 @@ static void native_smp_send_reschedule(int cpu)
 	apic->send_IPI_mask(cpumask_of(cpu), RESCHEDULE_VECTOR);
 }
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+static void native_smp_cpuset_update_nohz(int cpu)
+{
+	if (unlikely(cpu_is_offline(cpu))) {
+		WARN_ON(1);
+		return;
+	}
+	apic->send_IPI_mask(cpumask_of(cpu), CPUSET_UPDATE_NOHZ_VECTOR);
+}
+#endif
+
 void native_send_call_func_single_ipi(int cpu)
 {
 	apic->send_IPI_mask(cpumask_of(cpu), CALL_FUNCTION_SINGLE_VECTOR);
@@ -206,6 +218,17 @@ void smp_reschedule_interrupt(struct pt_regs *regs)
 	 */
 }
 
+#ifdef CONFIG_CPUSETS_NO_HZ
+void smp_cpuset_update_nohz_interrupt(struct pt_regs *regs)
+{
+	ack_APIC_irq();
+	irq_enter();
+	cpuset_update_nohz();
+	inc_irq_stat(irq_call_count);
+	irq_exit();
+}
+#endif
+
 void smp_call_function_interrupt(struct pt_regs *regs)
 {
 	ack_APIC_irq();
@@ -231,6 +254,9 @@ struct smp_ops smp_ops = {
 
 	.stop_other_cpus	= native_stop_other_cpus,
 	.smp_send_reschedule	= native_smp_send_reschedule,
+#ifdef CONFIG_CPUSETS_NO_HZ
+	.smp_cpuset_update_nohz = native_smp_cpuset_update_nohz,
+#endif
 
 	.cpu_up			= native_cpu_up,
 	.cpu_die		= native_cpu_die,
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 18/32] nohz/cpuset: Don't stop the tick if posix cpu timers are running
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (16 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 17/32] x86: New cpuset nohz irq vector Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-29 15:59   ` Peter Zijlstra
  2011-08-15 15:52 ` [PATCH 19/32] nohz/cpuset: Restart tick when nohz flag is cleared on cpuset Frederic Weisbecker
                   ` (15 subsequent siblings)
  33 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

If either a per thread or a per process posix cpu timer is running,
don't stop the tick.

TODO: restart the tick if it is stopped and a posix cpu timer is
enqueued. Check we probably need a memory barrier for the per
process posix timer that can be enqueued from another task
of the group.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 include/linux/posix-timers.h |    1 +
 kernel/posix-cpu-timers.c    |   12 ++++++++++++
 kernel/sched.c               |    4 ++++
 3 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index 959c141..5092cfd 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -116,6 +116,7 @@ int posix_timer_event(struct k_itimer *timr, int si_private);
 void posix_cpu_timer_schedule(struct k_itimer *timer);
 
 void run_posix_cpu_timers(struct task_struct *task);
+bool posix_cpu_timers_running(struct task_struct *tsk);
 void posix_cpu_timers_exit(struct task_struct *task);
 void posix_cpu_timers_exit_group(struct task_struct *task);
 
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 58f405b..f284fa4 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -6,6 +6,7 @@
 #include <linux/posix-timers.h>
 #include <linux/errno.h>
 #include <linux/math64.h>
+#include <linux/cpuset.h>
 #include <asm/uaccess.h>
 #include <linux/kernel_stat.h>
 #include <trace/events/timer.h>
@@ -1300,6 +1301,17 @@ static inline int fastpath_timer_check(struct task_struct *tsk)
 	return 0;
 }
 
+bool posix_cpu_timers_running(struct task_struct *tsk)
+{
+	if (!task_cputime_zero(&tsk->cputime_expires))
+		return true;
+
+	if (tsk->signal->cputimer.running)
+		return true;
+
+	return false;
+}
+
 /*
  * This is called from the timer interrupt handler.  The irq handler has
  * already updated our counts.  We need to check if any timers fire now.
diff --git a/kernel/sched.c b/kernel/sched.c
index 8bf8280..78ea0a5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -71,6 +71,7 @@
 #include <linux/ctype.h>
 #include <linux/ftrace.h>
 #include <linux/slab.h>
+#include <linux/posix-timers.h>
 
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
@@ -2491,6 +2492,9 @@ bool cpuset_nohz_can_stop_tick(void)
 	if (rcu_pending(cpu))
 		return false;
 
+	if (posix_cpu_timers_running(current))
+		return false;
+
 	return true;
 }
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 19/32] nohz/cpuset: Restart tick when nohz flag is cleared on cpuset
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (17 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 18/32] nohz/cpuset: Don't stop the tick if posix cpu timers are running Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-29 16:02   ` Peter Zijlstra
  2011-08-15 15:52 ` [PATCH 20/32] nohz/cpuset: Restart the tick if printk needs it Frederic Weisbecker
                   ` (14 subsequent siblings)
  33 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

Issue an IPI to restart the tick on a CPU that belongs
to a cpuset when its nohz flag gets cleared.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 include/linux/cpuset.h |    1 +
 kernel/cpuset.c        |   21 +++++++++++++++++++++
 kernel/sched.c         |    8 ++++++++
 3 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 799b9a4..7f9d78d 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -265,6 +265,7 @@ static inline bool cpuset_adaptive_nohz(void)
 }
 
 extern void cpuset_update_nohz(void);
+extern void cpuset_exit_nohz_interrupt(void *unused);
 #else
 static inline void cpuset_update_nohz(void) { }
 
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 3135096..ee3b0d0 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1199,6 +1199,14 @@ static void cpuset_change_flag(struct task_struct *tsk,
 
 DEFINE_PER_CPU(int, cpu_adaptive_nohz_ref);
 
+static void cpu_exit_nohz(int cpu)
+{
+	preempt_disable();
+	smp_call_function_single(cpu, cpuset_exit_nohz_interrupt,
+				 NULL, true);
+	preempt_enable();
+}
+
 static void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
 {
 	int cpu;
@@ -1212,6 +1220,19 @@ static void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
 			per_cpu(cpu_adaptive_nohz_ref, cpu) += 1;
 		else
 			per_cpu(cpu_adaptive_nohz_ref, cpu) -= 1;
+
+		val = per_cpu(cpu_adaptive_nohz_ref, cpu);
+
+		if (!val) {
+			/*
+			 * The update to cpu_adaptive_nohz_ref must be
+			 * visible right away. So that once we restart the tick
+			 * from the IPI, it won't be stopped again due to cache
+			 * update lag.
+			 */
+			smp_mb();
+			cpu_exit_nohz(cpu);
+		}
 	}
 }
 #else
diff --git a/kernel/sched.c b/kernel/sched.c
index 78ea0a5..75378be 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2513,6 +2513,14 @@ void cpuset_update_nohz(void)
 		cpuset_nohz_restart_tick();
 }
 
+void cpuset_exit_nohz_interrupt(void *unused)
+{
+	if (!tick_nohz_adaptive_mode())
+		return;
+
+	cpuset_nohz_restart_tick();
+}
+
 static void cpuset_nohz_task_switch(struct task_struct *next)
 {
 	int cpu = smp_processor_id();
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 20/32] nohz/cpuset: Restart the tick if printk needs it
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (18 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 19/32] nohz/cpuset: Restart tick when nohz flag is cleared on cpuset Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-15 15:52 ` [PATCH 21/32] rcu: Restart the tick on non-responding adaptive nohz CPUs Frederic Weisbecker
                   ` (13 subsequent siblings)
  33 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

If we are in nohz adaptive mode and printk is called, the tick is
missing to wake up the logger. We need to restart the tick when that
happens.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 kernel/printk.c |   17 ++++++++++++++++-
 1 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/kernel/printk.c b/kernel/printk.c
index 3518539..aff07f0 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -41,6 +41,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/rculist.h>
+#include <linux/tick.h>
 
 #include <asm/uaccess.h>
 
@@ -1220,8 +1221,22 @@ int printk_needs_cpu(int cpu)
 
 void wake_up_klogd(void)
 {
-	if (waitqueue_active(&log_wait))
+	unsigned long flags;
+
+	if (waitqueue_active(&log_wait)) {
 		this_cpu_write(printk_pending, 1);
+		/* Make it visible from any interrupt from now */
+		barrier();
+		/*
+		 * It's safe to check that even if interrupts are not disabled.
+		 * If we enable nohz adaptive mode concurrently, we'll the
+		 * printk_pending value and thus keep a periodic tick behaviour.
+		 * Unless it's possible tick_nohz_adaptive_mode() reads its value
+		 * before the barrier() ?
+		 */
+		if (tick_nohz_adaptive_mode())
+			smp_cpuset_update_nohz(smp_processor_id());
+	}
 }
 
 /**
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 21/32] rcu: Restart the tick on non-responding adaptive nohz CPUs
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (19 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 20/32] nohz/cpuset: Restart the tick if printk needs it Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-15 15:52 ` [PATCH 22/32] rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU Frederic Weisbecker
                   ` (12 subsequent siblings)
  33 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

When a CPU in adaptive nohz mode doesn't respond to complete
a grace period, issue it a specific IPI so that it restarts
the tick and chases a quiescent state.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 kernel/rcutree.c |   17 +++++++++++++++++
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 0009bfc..d496c70 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -50,6 +50,7 @@
 #include <linux/wait.h>
 #include <linux/kthread.h>
 #include <linux/prefetch.h>
+#include <linux/cpuset.h>
 
 #include "rcutree.h"
 
@@ -295,6 +296,20 @@ static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
 
 #ifdef CONFIG_SMP
 
+static void cpuset_update_rcu_cpu(int cpu)
+{
+#ifdef CONFIG_CPUSETS_NO_HZ
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	if (cpuset_cpu_adaptive_nohz(cpu))
+		smp_cpuset_update_nohz(cpu);
+
+	local_irq_restore(flags);
+#endif
+}
+
 /*
  * If the specified CPU is offline, tell the caller that it is in
  * a quiescent state.  Otherwise, whack it with a reschedule IPI.
@@ -317,6 +332,8 @@ static int rcu_implicit_offline_qs(struct rcu_data *rdp)
 		return 1;
 	}
 
+	cpuset_update_rcu_cpu(rdp->cpu);
+
 	/* If preemptible RCU, no point in sending reschedule IPI. */
 	if (rdp->preemptible)
 		return 0;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 22/32] rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (20 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 21/32] rcu: Restart the tick on non-responding adaptive nohz CPUs Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-16 20:20   ` Paul E. McKenney
  2011-08-15 15:52 ` [PATCH 23/32] nohz/cpuset: Account user and system times in adaptive nohz mode Frederic Weisbecker
                   ` (11 subsequent siblings)
  33 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

If we enqueue an rcu callback, we need the CPU tick to stay
alive until we take care of those by completing the appropriate
grace period.

Thus, when we call_rcu(), send a self IPI that checks rcu_needs_cpu()
so that we restore a periodic tick behaviour that can take care of
everything.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 kernel/rcutree.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index d496c70..b5643ce2 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -51,6 +51,7 @@
 #include <linux/kthread.h>
 #include <linux/prefetch.h>
 #include <linux/cpuset.h>
+#include <linux/tick.h>
 
 #include "rcutree.h"
 
@@ -1546,6 +1547,13 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
 	rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
 	rdp->qlen++;
 
+	/* Restart the timer if needed to handle the callbacks */
+	if (tick_nohz_adaptive_mode()) {
+		/* Make updates on nxtlist visible to self IPI */
+		barrier();
+		smp_cpuset_update_nohz(smp_processor_id());
+	}
+
 	/* If interrupts were disabled, don't dive into RCU core. */
 	if (irqs_disabled_flags(flags)) {
 		local_irq_restore(flags);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 23/32] nohz/cpuset: Account user and system times in adaptive nohz mode
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (21 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 22/32] rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-15 15:52 ` [PATCH 24/32] nohz/cpuset: Handle kernel entry/exit to account cputime Frederic Weisbecker
                   ` (10 subsequent siblings)
  33 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

If we are not running the tick, we are not anymore regularly counting
the cputime at every jiffies.

Lay the ground to count that cputime from the points that require
it. Start by catching up from timer interrupts and when we schedule
out a process. We record the last jiffies and from which ring we saved
it and compute the difference later when we can catch up.

For now it assumes we haven't switched to another ring while we
were running nohz.

TODO: wrap operation on jiffies?

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 include/linux/kernel_stat.h |    2 ++
 include/linux/tick.h        |   11 +++++++++++
 kernel/sched.c              |   23 +++++++++++++++++++++++
 kernel/time/tick-sched.c    |   39 +++++++++++++++++++++++++++++++++++++++
 kernel/timer.c              |    6 ++++--
 5 files changed, 79 insertions(+), 2 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 0cce2db..14cfce4 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -114,7 +114,9 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu)
 extern unsigned long long task_delta_exec(struct task_struct *);
 
 extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
+extern void account_user_jiffies(struct task_struct *, unsigned long);
 extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t);
+extern void account_system_jiffies(struct task_struct *, unsigned long);
 extern void account_steal_time(cputime_t);
 extern void account_idle_time(cputime_t);
 
diff --git a/include/linux/tick.h b/include/linux/tick.h
index cc4880e..ea6dfb7 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -26,6 +26,12 @@ enum tick_nohz_mode {
 	NOHZ_MODE_HIGHRES,
 };
 
+enum tick_saved_jiffies {
+	JIFFIES_SAVED_NONE,
+	JIFFIES_SAVED_USER,
+	JIFFIES_SAVED_SYS,
+};
+
 /**
  * struct tick_sched - sched tick emulation and no idle tick control/stats
  * @sched_timer:	hrtimer to schedule the periodic tick in high
@@ -60,6 +66,8 @@ struct tick_sched {
 	ktime_t				idle_waketime;
 	ktime_t				idle_exittime;
 	ktime_t				idle_sleeptime;
+	enum tick_saved_jiffies		saved_jiffies_whence;
+	unsigned long			saved_jiffies;
 	ktime_t				iowait_sleeptime;
 	ktime_t				sleep_length;
 	unsigned long			last_jiffies;
@@ -132,8 +140,11 @@ extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
 DECLARE_PER_CPU(int, task_nohz_mode);
 
 extern int tick_nohz_adaptive_mode(void);
+extern bool tick_nohz_account_tick(void);
+extern void tick_nohz_flush_current_times(void);
 #else /* !CPUSETS_NO_HZ */
 static inline int tick_nohz_adaptive_mode(void) { return 0; }
+static inline bool tick_nohz_account_tick(void) { return false; }
 #endif /* CPUSETS_NO_HZ */
 
 # else /* !NO_HZ */
diff --git a/kernel/sched.c b/kernel/sched.c
index 75378be..a58f993 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2500,6 +2500,7 @@ bool cpuset_nohz_can_stop_tick(void)
 
 static void cpuset_nohz_restart_tick(void)
 {
+	tick_nohz_flush_current_times();
 	__get_cpu_var(task_nohz_mode) = 0;
 	tick_nohz_restart_sched_tick();
 }
@@ -3838,6 +3839,17 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
 	acct_update_integrals(p);
 }
 
+void account_user_jiffies(struct task_struct *p, unsigned long count)
+{
+	cputime_t delta_cputime, delta_scaled;
+
+	if (count) {
+		delta_cputime = jiffies_to_cputime(count);
+		delta_scaled = cputime_to_scaled(count);
+		account_user_time(p, delta_cputime, delta_scaled);
+	}
+}
+
 /*
  * Account guest cpu time to a process.
  * @p: the process that the cpu time gets accounted to
@@ -3922,6 +3934,17 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 	__account_system_time(p, cputime, cputime_scaled, target_cputime64);
 }
 
+void account_system_jiffies(struct task_struct *p, unsigned long count)
+{
+	cputime_t delta_cputime, delta_scaled;
+
+	if (count) {
+		delta_cputime = jiffies_to_cputime(count);
+		delta_scaled = cputime_to_scaled(count);
+		account_system_time(p, 0, delta_cputime, delta_scaled);
+	}
+}
+
 /*
  * Account for involuntary wait time.
  * @cputime: the cpu time spent in involuntary wait
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9e450d8..c3a8f26 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -764,6 +764,8 @@ int tick_nohz_adaptive_mode(void)
 
 static void tick_nohz_cpuset_stop_tick(int user)
 {
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+
 	if (!cpuset_adaptive_nohz() || tick_nohz_adaptive_mode())
 		return;
 
@@ -771,6 +773,13 @@ static void tick_nohz_cpuset_stop_tick(int user)
 		__get_cpu_var(task_nohz_mode) = 1;
 		/* Nohz mode must be visible to wake_up_nohz_cpu() */
 		smp_wmb();
+
+		WARN_ON_ONCE(ts->saved_jiffies_whence != JIFFIES_SAVED_NONE);
+		ts->saved_jiffies = jiffies;
+		if (user)
+			ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
+		else
+			ts->saved_jiffies_whence = JIFFIES_SAVED_SYS;
 	}
 }
 
@@ -792,6 +801,36 @@ static void tick_do_timer_check_handler(int cpu)
 	}
 }
 
+bool tick_nohz_account_tick(void)
+{
+	struct tick_sched *ts;
+	unsigned long delta_jiffies;
+
+	if (!tick_nohz_adaptive_mode())
+		return false;
+
+	ts = &__get_cpu_var(tick_cpu_sched);
+
+	delta_jiffies = jiffies - ts->saved_jiffies;
+	if (ts->saved_jiffies_whence == JIFFIES_SAVED_SYS)
+		account_system_jiffies(current, delta_jiffies);
+	else
+		account_user_jiffies(current, delta_jiffies);
+
+	ts->saved_jiffies = jiffies;
+
+	return true;
+}
+
+void tick_nohz_flush_current_times(void)
+{
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+
+	tick_nohz_account_tick();
+
+	ts->saved_jiffies_whence = JIFFIES_SAVED_NONE;
+}
+
 #else
 
 static void tick_nohz_cpuset_stop_tick(int user) { }
diff --git a/kernel/timer.c b/kernel/timer.c
index 8cdbd48..db984ff 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1288,8 +1288,10 @@ void update_process_times(int user_tick)
 	struct task_struct *p = current;
 	int cpu = smp_processor_id();
 
-	/* Note: this timer irq context must be accounted for as well. */
-	account_process_tick(p, user_tick);
+	if (!tick_nohz_account_tick()) {
+		/* Note: this timer irq context must be accounted for as well. */
+		account_process_tick(p, user_tick);
+	}
 	run_local_timers();
 	rcu_check_callbacks(cpu, user_tick);
 	printk_tick();
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 24/32] nohz/cpuset: Handle kernel entry/exit to account cputime
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (22 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 23/32] nohz/cpuset: Account user and system times in adaptive nohz mode Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-16 20:38   ` Paul E. McKenney
  2011-08-15 15:52 ` [PATCH 25/32] nohz/cpuset: New API to flush cputimes on nohz cpusets Frederic Weisbecker
                   ` (9 subsequent siblings)
  33 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

Provide a few APIs that archs can call to tell they are entering
or exiting the kernel so that when we are in nohz adaptive mode
we know precisely where we need to account the cputime.

The new APIs are:

- tick_nohz_enter_kernel() (called when we enter a syscall)
- tick_nohz_exit_kernel() (called when we exit a syscall)
- tick_nohz_enter_exception() (called when we enter any
  exception, trap, faults...but not irqs)
- tick_nohz_exit_exception() (called when we exit any exception)

Hooks into syscalls are typically driven by the TIF_NOHZ thread
flag.

In addition, we use the value returned by user_mode(regs) from
the timer interrupt to know where we are.
Nonetheless, we can rely on user_mode(regs) != 0 to know
we are in userspace, but we can't rely on user_mode(regs) == 0
to know we are in the system.

Consider the following scenario: we stop the tick after syscall
return, so we set TIF_NOHZ but the syscall exit hook is behind us.
If we haven't yet returned to userspace, then we have
user_mode(regs) == 0. If on top of that we consider we are in
system mode, and later we issue a syscall but restart the tick
right before reaching the syscall entry hook, then we have no clue
that the whole elapsed cputime was not in the system but in the
userspace.

The only way to fix this is to only start entering nohz mode once
we know we are in userspace a first time, like when we reach the
kernel exit hook or when a timer tick with user_mode(regs) == 1
fires. Kernel threads don't have this worry.

This sucks but for now I have no better solution. Let's hope we
can find better.

TODO: wrap operation on jiffies?

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 include/linux/tick.h     |    8 +++
 kernel/sched.c           |    1 +
 kernel/time/tick-sched.c |  114 ++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 109 insertions(+), 14 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index ea6dfb7..3ad649f 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -139,10 +139,18 @@ extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
 #ifdef CONFIG_CPUSETS_NO_HZ
 DECLARE_PER_CPU(int, task_nohz_mode);
 
+extern void tick_nohz_enter_kernel(void);
+extern void tick_nohz_exit_kernel(void);
+extern void tick_nohz_enter_exception(struct pt_regs *regs);
+extern void tick_nohz_exit_exception(struct pt_regs *regs);
 extern int tick_nohz_adaptive_mode(void);
 extern bool tick_nohz_account_tick(void);
 extern void tick_nohz_flush_current_times(void);
 #else /* !CPUSETS_NO_HZ */
+static inline void tick_nohz_enter_kernel(void) { }
+static inline void tick_nohz_exit_kernel(void) { }
+static inline void tick_nohz_enter_exception(struct pt_regs *regs) { }
+static inline void tick_nohz_exit_exception(struct pt_regs *regs) { }
 static inline int tick_nohz_adaptive_mode(void) { return 0; }
 static inline bool tick_nohz_account_tick(void) { return false; }
 #endif /* CPUSETS_NO_HZ */
diff --git a/kernel/sched.c b/kernel/sched.c
index a58f993..c49c1b1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2503,6 +2503,7 @@ static void cpuset_nohz_restart_tick(void)
 	tick_nohz_flush_current_times();
 	__get_cpu_var(task_nohz_mode) = 0;
 	tick_nohz_restart_sched_tick();
+	clear_thread_flag(TIF_NOHZ);
 }
 
 void cpuset_update_nohz(void)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c3a8f26..d8f01b8 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -595,8 +595,9 @@ void tick_nohz_irq_exit(void)
 	if (ts->inidle && !need_resched())
 		__tick_nohz_enter_idle(ts, cpu);
 	else if (tick_nohz_adaptive_mode() && !idle_cpu(cpu)) {
-		if (tick_nohz_can_stop_tick(cpu, ts))
-			tick_nohz_stop_sched_tick(ktime_get(), cpu, ts);
+		if (ts->saved_jiffies_whence != JIFFIES_SAVED_NONE
+		    && tick_nohz_can_stop_tick(cpu, ts))
+				tick_nohz_stop_sched_tick(ktime_get(), cpu, ts);
 	}
 }
 
@@ -757,6 +758,74 @@ void tick_check_idle(int cpu)
 
 #ifdef CONFIG_CPUSETS_NO_HZ
 
+void tick_nohz_exit_kernel(void)
+{
+	unsigned long flags;
+	struct tick_sched *ts;
+	unsigned long delta_jiffies;
+
+	local_irq_save(flags);
+
+	if (!tick_nohz_adaptive_mode()) {
+		local_irq_restore(flags);
+		return;
+	}
+
+	ts = &__get_cpu_var(tick_cpu_sched);
+
+	WARN_ON_ONCE(ts->saved_jiffies_whence == JIFFIES_SAVED_USER);
+
+	if (ts->saved_jiffies_whence == JIFFIES_SAVED_SYS) {
+		delta_jiffies = jiffies - ts->saved_jiffies;
+		account_system_jiffies(current, delta_jiffies);
+	}
+
+	ts->saved_jiffies = jiffies;
+	ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
+
+	local_irq_restore(flags);
+}
+
+void tick_nohz_enter_kernel(void)
+{
+	unsigned long flags;
+	struct tick_sched *ts;
+	unsigned long delta_jiffies;
+
+	local_irq_save(flags);
+
+	if (!tick_nohz_adaptive_mode()) {
+		local_irq_restore(flags);
+		return;
+	}
+
+	ts = &__get_cpu_var(tick_cpu_sched);
+
+	WARN_ON_ONCE(ts->saved_jiffies_whence == JIFFIES_SAVED_SYS);
+
+	if (ts->saved_jiffies_whence == JIFFIES_SAVED_USER) {
+		delta_jiffies = jiffies - ts->saved_jiffies;
+		account_user_jiffies(current, delta_jiffies);
+	}
+
+	ts->saved_jiffies = jiffies;
+	ts->saved_jiffies_whence = JIFFIES_SAVED_SYS;
+
+	local_irq_restore(flags);
+}
+
+void tick_nohz_enter_exception(struct pt_regs *regs)
+{
+	if (user_mode(regs))
+		tick_nohz_enter_kernel();
+}
+
+void tick_nohz_exit_exception(struct pt_regs *regs)
+{
+	if (user_mode(regs))
+		tick_nohz_exit_kernel();
+}
+
 int tick_nohz_adaptive_mode(void)
 {
 	return __get_cpu_var(task_nohz_mode);
@@ -766,20 +835,33 @@ static void tick_nohz_cpuset_stop_tick(int user)
 {
 	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
 
-	if (!cpuset_adaptive_nohz() || tick_nohz_adaptive_mode())
+	if (!cpuset_adaptive_nohz())
 		return;
 
+	if (tick_nohz_adaptive_mode()) {
+		if (user && ts->saved_jiffies_whence == JIFFIES_SAVED_NONE) {
+			ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
+			ts->saved_jiffies = jiffies;
+		}
+
+		return;
+	}
+
 	if (cpuset_nohz_can_stop_tick()) {
 		__get_cpu_var(task_nohz_mode) = 1;
 		/* Nohz mode must be visible to wake_up_nohz_cpu() */
 		smp_wmb();
 
+		set_thread_flag(TIF_NOHZ);
 		WARN_ON_ONCE(ts->saved_jiffies_whence != JIFFIES_SAVED_NONE);
-		ts->saved_jiffies = jiffies;
-		if (user)
+
+		if (user) {
 			ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
-		else
+			ts->saved_jiffies = jiffies;
+		} else if (!current->mm) {
 			ts->saved_jiffies_whence = JIFFIES_SAVED_SYS;
+			ts->saved_jiffies = jiffies;
+		}
 	}
 }
 
@@ -803,7 +885,7 @@ static void tick_do_timer_check_handler(int cpu)
 
 bool tick_nohz_account_tick(void)
 {
-	struct tick_sched *ts;
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
 	unsigned long delta_jiffies;
 
 	if (!tick_nohz_adaptive_mode())
@@ -811,11 +893,15 @@ bool tick_nohz_account_tick(void)
 
 	ts = &__get_cpu_var(tick_cpu_sched);
 
+	if (ts->saved_jiffies_whence == JIFFIES_SAVED_NONE)
+		return false;
+
 	delta_jiffies = jiffies - ts->saved_jiffies;
-	if (ts->saved_jiffies_whence == JIFFIES_SAVED_SYS)
-		account_system_jiffies(current, delta_jiffies);
-	else
+
+	if (ts->saved_jiffies_whence == JIFFIES_SAVED_USER)
 		account_user_jiffies(current, delta_jiffies);
+	else
+		account_system_jiffies(current, delta_jiffies);
 
 	ts->saved_jiffies = jiffies;
 
@@ -825,12 +911,12 @@ bool tick_nohz_account_tick(void)
 void tick_nohz_flush_current_times(void)
 {
 	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+	unsigned long delta_jiffies;
+	struct pt_regs *regs;
 
-	tick_nohz_account_tick();
-
-	ts->saved_jiffies_whence = JIFFIES_SAVED_NONE;
+	if (tick_nohz_account_tick())
+		ts->saved_jiffies_whence = JIFFIES_SAVED_NONE;
 }
-
 #else
 
 static void tick_nohz_cpuset_stop_tick(int user) { }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 25/32] nohz/cpuset: New API to flush cputimes on nohz cpusets
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (23 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 24/32] nohz/cpuset: Handle kernel entry/exit to account cputime Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-15 15:52 ` [PATCH 26/32] nohz/cpuset: Flush cputime on threads in nohz cpusets when waiting leader Frederic Weisbecker
                   ` (8 subsequent siblings)
  33 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

Provide a new API that sends an IPI to every CPUs included
in nohz cpusets in order to flush their cputimes. It's going
to be useful for those that want to see accurate cputimes
on a nohz cpuset.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 include/linux/cpuset.h   |    2 ++
 include/linux/tick.h     |    2 +-
 kernel/cpuset.c          |   34 +++++++++++++++++++++++++++++++++-
 kernel/sched.c           |    2 +-
 kernel/time/tick-sched.c |    4 ++--
 5 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 7f9d78d..569da83 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -265,9 +265,11 @@ static inline bool cpuset_adaptive_nohz(void)
 }
 
 extern void cpuset_update_nohz(void);
+extern void cpuset_nohz_flush_cputimes(void);
 extern void cpuset_exit_nohz_interrupt(void *unused);
 #else
 static inline void cpuset_update_nohz(void) { }
+static inline void cpuset_nohz_flush_cputimes(void) { }
 
 #endif /* CONFIG_CPUSETS_NO_HZ */
 
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 3ad649f..9d0270e 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -145,7 +145,7 @@ extern void tick_nohz_enter_exception(struct pt_regs *regs);
 extern void tick_nohz_exit_exception(struct pt_regs *regs);
 extern int tick_nohz_adaptive_mode(void);
 extern bool tick_nohz_account_tick(void);
-extern void tick_nohz_flush_current_times(void);
+extern void tick_nohz_flush_current_times(bool restart_tick);
 #else /* !CPUSETS_NO_HZ */
 static inline void tick_nohz_enter_kernel(void) { }
 static inline void tick_nohz_exit_kernel(void) { }
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index ee3b0d0..61c3f96 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -59,6 +59,7 @@
 #include <linux/mutex.h>
 #include <linux/workqueue.h>
 #include <linux/cgroup.h>
+#include <linux/tick.h>
 
 /*
  * Workqueue for cpuset related tasks.
@@ -1199,6 +1200,23 @@ static void cpuset_change_flag(struct task_struct *tsk,
 
 DEFINE_PER_CPU(int, cpu_adaptive_nohz_ref);
 
+static cpumask_t nohz_cpuset_mask;
+
+static void flush_cputime_interrupt(void *unused)
+{
+	tick_nohz_flush_current_times(false);
+}
+
+void cpuset_nohz_flush_cputimes(void)
+{
+	preempt_disable();
+	smp_call_function_many(&nohz_cpuset_mask, flush_cputime_interrupt,
+			       NULL, true);
+	preempt_enable();
+	/* Make the utime/stime updates visible */
+	smp_mb();
+}
+
 static void cpu_exit_nohz(int cpu)
 {
 	preempt_disable();
@@ -1223,7 +1241,15 @@ static void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
 
 		val = per_cpu(cpu_adaptive_nohz_ref, cpu);
 
-		if (!val) {
+		if (val == 1) {
+			cpumask_set_cpu(cpu, &nohz_cpuset_mask);
+			/*
+			 * The mask update needs to be visible right away
+			 * so that this CPU is part of the cputime IPI
+			 * update right now.
+			 */
+			 smp_mb();
+		} else if (!val) {
 			/*
 			 * The update to cpu_adaptive_nohz_ref must be
 			 * visible right away. So that once we restart the tick
@@ -1232,6 +1258,12 @@ static void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
 			 */
 			smp_mb();
 			cpu_exit_nohz(cpu);
+			/*
+			 * Now that the tick has been restarted and cputimes
+			 * flushed, we don't need anymore to be part of the
+			 * cputime flush IPI.
+			 */
+			cpumask_clear_cpu(cpu, &nohz_cpuset_mask);
 		}
 	}
 }
diff --git a/kernel/sched.c b/kernel/sched.c
index c49c1b1..2bcd456 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2500,7 +2500,7 @@ bool cpuset_nohz_can_stop_tick(void)
 
 static void cpuset_nohz_restart_tick(void)
 {
-	tick_nohz_flush_current_times();
+	tick_nohz_flush_current_times(true);
 	__get_cpu_var(task_nohz_mode) = 0;
 	tick_nohz_restart_sched_tick();
 	clear_thread_flag(TIF_NOHZ);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index d8f01b8..9a2ba5b 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -908,13 +908,13 @@ bool tick_nohz_account_tick(void)
 	return true;
 }
 
-void tick_nohz_flush_current_times(void)
+void tick_nohz_flush_current_times(bool restart_tick)
 {
 	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
 	unsigned long delta_jiffies;
 	struct pt_regs *regs;
 
-	if (tick_nohz_account_tick())
+	if (tick_nohz_account_tick() && restart_tick)
 		ts->saved_jiffies_whence = JIFFIES_SAVED_NONE;
 }
 #else
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 26/32] nohz/cpuset: Flush cputime on threads in nohz cpusets when waiting leader
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (24 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 25/32] nohz/cpuset: New API to flush cputimes on nohz cpusets Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-15 15:52 ` [PATCH 27/32] nohz/cpuset: Flush cputimes on procfs stat file read Frederic Weisbecker
                   ` (7 subsequent siblings)
  33 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

When we wait for a zombie task, flush the cputimes on nohz cpusets
in case we are waiting for a group leader that has threads running
in nohz CPUs. This way thread_group_times() doesn't report stale
values.

<doubts>
If I understood well the code, by the time we call that thread_group_times(),
we may have childs that are still running, so this is necessary.
But I need to check deeper.
</doubts>

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 kernel/exit.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index f2b321b..43fb9ac 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -51,6 +51,7 @@
 #include <trace/events/sched.h>
 #include <linux/hw_breakpoint.h>
 #include <linux/oom.h>
+#include <linux/cpuset.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -1262,6 +1263,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 		 * group, which consolidates times for all threads in the
 		 * group including the group leader.
 		 */
+		cpuset_nohz_flush_cputimes();
 		thread_group_times(p, &tgutime, &tgstime);
 		spin_lock_irq(&p->real_parent->sighand->siglock);
 		psig = p->real_parent->signal;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 27/32] nohz/cpuset: Flush cputimes on procfs stat file read
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (25 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 26/32] nohz/cpuset: Flush cputime on threads in nohz cpusets when waiting leader Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-15 15:52 ` [PATCH 28/32] nohz/cpuset: Flush cputimes for getrusage() and times() syscalls Frederic Weisbecker
                   ` (6 subsequent siblings)
  33 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

When we read a process's procfs stat file, we need
to flush the cputimes of the tasks running in nohz
cpusets in case some childs in the thread group are
running there.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 fs/proc/array.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 9b45ee8..20e1c75 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -397,6 +397,8 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	cutime = cstime = utime = stime = cputime_zero;
 	cgtime = gtime = cputime_zero;
 
+	/* For thread group times */
+	cpuset_nohz_flush_cputimes();
 	if (lock_task_sighand(task, &flags)) {
 		struct signal_struct *sig = task->signal;
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 28/32] nohz/cpuset: Flush cputimes for getrusage() and times() syscalls
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (26 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 27/32] nohz/cpuset: Flush cputimes on procfs stat file read Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-15 15:52 ` [PATCH 29/32] x86: Syscall hooks for nohz cpusets Frederic Weisbecker
                   ` (5 subsequent siblings)
  33 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

Both syscalls need to iterate through the thread group to get
the cputimes. As some threads of the group may be running on
nohz cpuset, we need to flush the cputimes there.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 kernel/sys.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index e4128b2..f6de4b1 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -43,6 +43,7 @@
 #include <linux/syscalls.h>
 #include <linux/kprobes.h>
 #include <linux/user_namespace.h>
+#include <linux/cpuset.h>
 
 #include <linux/kmsg_dump.h>
 
@@ -908,6 +909,8 @@ void do_sys_times(struct tms *tms)
 {
 	cputime_t tgutime, tgstime, cutime, cstime;
 
+	cpuset_nohz_flush_cputimes();
+
 	spin_lock_irq(&current->sighand->siglock);
 	thread_group_times(current, &tgutime, &tgstime);
 	cutime = current->signal->cutime;
@@ -1536,6 +1539,9 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
 		goto out;
 	}
 
+	/* For thread_group_times */
+	cpuset_nohz_flush_cputimes();
+
 	if (!lock_task_sighand(p, &flags))
 		return;
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 29/32] x86: Syscall hooks for nohz cpusets
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (27 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 28/32] nohz/cpuset: Flush cputimes for getrusage() and times() syscalls Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-15 15:52 ` [PATCH 30/32] x86: Exception " Frederic Weisbecker
                   ` (4 subsequent siblings)
  33 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

Add syscall hooks to notify syscall entry and exit on
CPUs running in adative nohz mode.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 arch/x86/include/asm/thread_info.h |   10 +++++++---
 arch/x86/kernel/ptrace.c           |   10 ++++++++++
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 1f2e61e..0e3329f 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -87,6 +87,7 @@ struct thread_info {
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* 32bit process */
 #define TIF_FORK		18	/* ret_from_fork */
+#define TIF_NOHZ		19	/* in nohz userspace mode */
 #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_DEBUG		21	/* uses debug registers */
 #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
@@ -110,6 +111,7 @@ struct thread_info {
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_FORK		(1 << TIF_FORK)
+#define _TIF_NOHZ		(1 << TIF_NOHZ)
 #define _TIF_DEBUG		(1 << TIF_DEBUG)
 #define _TIF_IO_BITMAP		(1 << TIF_IO_BITMAP)
 #define _TIF_FREEZE		(1 << TIF_FREEZE)
@@ -121,12 +123,13 @@ struct thread_info {
 /* work to do in syscall_trace_enter() */
 #define _TIF_WORK_SYSCALL_ENTRY	\
 	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT |	\
-	 _TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT)
+	 _TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT |	\
+	 _TIF_NOHZ)
 
 /* work to do in syscall_trace_leave() */
 #define _TIF_WORK_SYSCALL_EXIT	\
 	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SINGLESTEP |	\
-	 _TIF_SYSCALL_TRACEPOINT)
+	 _TIF_SYSCALL_TRACEPOINT | _TIF_NOHZ)
 
 /* work to do on interrupt/exception return */
 #define _TIF_WORK_MASK							\
@@ -136,7 +139,8 @@ struct thread_info {
 
 /* work to do on any return to user space */
 #define _TIF_ALLWORK_MASK						\
-	((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_TRACEPOINT)
+	((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_TRACEPOINT |	\
+	_TIF_NOHZ)
 
 /* Only used for 64 bit */
 #define _TIF_DO_NOTIFY_MASK						\
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 6e619f2..486a65c 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -21,6 +21,7 @@
 #include <linux/signal.h>
 #include <linux/perf_event.h>
 #include <linux/hw_breakpoint.h>
+#include <linux/tick.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -1385,6 +1386,9 @@ long syscall_trace_enter(struct pt_regs *regs)
 {
 	long ret = 0;
 
+	/* Notify nohz task syscall early so the rest can use rcu */
+	tick_nohz_enter_kernel();
+
 	/*
 	 * If we stepped into a sysenter/syscall insn, it trapped in
 	 * kernel mode; do_debug() cleared TF and set TIF_SINGLESTEP.
@@ -1446,4 +1450,10 @@ void syscall_trace_leave(struct pt_regs *regs)
 			!test_thread_flag(TIF_SYSCALL_EMU);
 	if (step || test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall_exit(regs, step);
+
+	/*
+	 * Notify nohz task exit syscall at last so the rest can
+	 * use rcu.
+	 */
+	tick_nohz_exit_kernel();
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 30/32] x86: Exception hooks for nohz cpusets
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (28 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 29/32] x86: Syscall hooks for nohz cpusets Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-15 15:52 ` [PATCH 31/32] rcu: Switch to extended quiescent state in userspace from nohz cpuset Frederic Weisbecker
                   ` (3 subsequent siblings)
  33 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

Add necessary hooks to x86 exception for nohz cpusets
support. It includes traps, page fault, debug exceptions,
etc...

TODO: handle do_notify_resume(). Not an exception but
it's executed at the end of the syscall/irq path so
it's beyond our hooks but shouldn't be considered as
userspace code.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 arch/x86/Kconfig        |    1 +
 arch/x86/kernel/traps.c |   22 +++++++++++++++-------
 arch/x86/mm/fault.c     |   13 +++++++++++--
 3 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 67979f4..1b12b1a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -69,6 +69,7 @@ config X86
 	select IRQ_FORCED_THREADING
 	select USE_GENERIC_SMP_HELPERS if SMP
 	select HAVE_BPF_JIT if (X86_64 && NET)
+	select HAVE_CPUSETS_NO_HZ
 
 config INSTRUCTION_DECODER
 	def_bool (KPROBES || PERF_EVENTS)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index b9b6716..d98f2f4 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -26,6 +26,7 @@
 #include <linux/sched.h>
 #include <linux/timer.h>
 #include <linux/init.h>
+#include <linux/tick.h>
 #include <linux/bug.h>
 #include <linux/nmi.h>
 #include <linux/mm.h>
@@ -456,24 +457,28 @@ void restart_nmi(void)
 /* May run on IST stack. */
 dotraplinkage void __kprobes do_int3(struct pt_regs *regs, long error_code)
 {
+	tick_nohz_enter_exception(regs);
+
 #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
 	if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, 3, SIGTRAP)
 			== NOTIFY_STOP)
-		return;
+		goto exit;
 #endif /* CONFIG_KGDB_LOW_LEVEL_TRAP */
 #ifdef CONFIG_KPROBES
 	if (notify_die(DIE_INT3, "int3", regs, error_code, 3, SIGTRAP)
 			== NOTIFY_STOP)
-		return;
+		goto exit;
 #else
 	if (notify_die(DIE_TRAP, "int3", regs, error_code, 3, SIGTRAP)
 			== NOTIFY_STOP)
-		return;
+		goto exit;
 #endif
 
 	preempt_conditional_sti(regs);
 	do_trap(3, SIGTRAP, "int3", regs, error_code, NULL);
 	preempt_conditional_cli(regs);
+exit:
+	tick_nohz_exit_exception(regs);
 }
 
 #ifdef CONFIG_X86_64
@@ -534,6 +539,8 @@ dotraplinkage void __kprobes do_debug(struct pt_regs *regs, long error_code)
 	unsigned long dr6;
 	int si_code;
 
+	tick_nohz_enter_exception(regs);
+
 	get_debugreg(dr6, 6);
 
 	/* Filter out all the reserved bits which are preset to 1 */
@@ -549,7 +556,7 @@ dotraplinkage void __kprobes do_debug(struct pt_regs *regs, long error_code)
 
 	/* Catch kmemcheck conditions first of all! */
 	if ((dr6 & DR_STEP) && kmemcheck_trap(regs))
-		return;
+		goto exit;
 
 	/* DR6 may or may not be cleared by the CPU */
 	set_debugreg(0, 6);
@@ -564,7 +571,7 @@ dotraplinkage void __kprobes do_debug(struct pt_regs *regs, long error_code)
 
 	if (notify_die(DIE_DEBUG, "debug", regs, PTR_ERR(&dr6), error_code,
 							SIGTRAP) == NOTIFY_STOP)
-		return;
+		goto exit;
 
 	/* It's safe to allow irq's after DR6 has been saved */
 	preempt_conditional_sti(regs);
@@ -573,7 +580,7 @@ dotraplinkage void __kprobes do_debug(struct pt_regs *regs, long error_code)
 		handle_vm86_trap((struct kernel_vm86_regs *) regs,
 				error_code, 1);
 		preempt_conditional_cli(regs);
-		return;
+		goto exit;
 	}
 
 	/*
@@ -593,7 +600,8 @@ dotraplinkage void __kprobes do_debug(struct pt_regs *regs, long error_code)
 		send_sigtrap(tsk, regs, error_code, si_code);
 	preempt_conditional_cli(regs);
 
-	return;
+exit:
+	tick_nohz_exit_exception(regs);
 }
 
 /*
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 4d09df0..aa2e2e3 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -13,6 +13,7 @@
 #include <linux/perf_event.h>		/* perf_sw_event		*/
 #include <linux/hugetlb.h>		/* hstate_index_to_shift	*/
 #include <linux/prefetch.h>		/* prefetchw			*/
+#include <linux/tick.h>
 
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
 #include <asm/pgalloc.h>		/* pgd_*(), ...			*/
@@ -971,8 +972,8 @@ static int fault_in_kernel_space(unsigned long address)
  * and the problem, and then passes it off to one of the appropriate
  * routines.
  */
-dotraplinkage void __kprobes
-do_page_fault(struct pt_regs *regs, unsigned long error_code)
+static void __kprobes
+__do_page_fault(struct pt_regs *regs, unsigned long error_code)
 {
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
@@ -1180,3 +1181,11 @@ good_area:
 
 	up_read(&mm->mmap_sem);
 }
+
+dotraplinkage void __kprobes
+do_page_fault(struct pt_regs *regs, unsigned long error_code)
+{
+	tick_nohz_enter_exception(regs);
+	__do_page_fault(regs, error_code);
+	tick_nohz_exit_exception(regs);
+}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 31/32] rcu: Switch to extended quiescent state in userspace from nohz cpuset
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (29 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 30/32] x86: Exception " Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-16 20:44   ` Paul E. McKenney
  2011-08-15 15:52 ` [PATCH 32/32] nohz/cpuset: Disable under some configs Frederic Weisbecker
                   ` (2 subsequent siblings)
  33 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

When we switch to adaptive nohz mode and we run in userspace,
we can still receive IPIs from the RCU core if a grace period
has been started by another CPU because we need to take part
of its completion.

However running in userspace is similar to that of running in
idle because we don't make use of RCU there, thus we can be
considered as running in RCU extended quiescent state. The
benefit when running into that mode is that we are not
anymore disturbed by needless IPIs coming from the RCU core.

To perform this, we just to use the RCU extended quiescent state
APIs on the following points:

- kernel exit or tick stop in userspace: here we switch to extended
quiescent state because we run in userspace without the tick.

- kernel entry or tick restart: here we exit the extended quiescent
state because either we enter the kernel and we may make use of RCU
read side critical section anytime, or we need the timer tick for some
reason and that takes care of RCU grace period in a traditional way.

TODO: hook into do_notify_resume() because we may have called
rcu_enter_nohz() from syscall exit hook, but we might call
do_notify_resume() right after, which may use RCU.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 include/linux/tick.h     |    2 ++
 kernel/sched.c           |    1 +
 kernel/time/tick-sched.c |   21 +++++++++++++++++++++
 3 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 9d0270e..4e7555f 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -138,12 +138,14 @@ extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
 
 #ifdef CONFIG_CPUSETS_NO_HZ
 DECLARE_PER_CPU(int, task_nohz_mode);
+DECLARE_PER_CPU(int, nohz_task_ext_qs);
 
 extern void tick_nohz_enter_kernel(void);
 extern void tick_nohz_exit_kernel(void);
 extern void tick_nohz_enter_exception(struct pt_regs *regs);
 extern void tick_nohz_exit_exception(struct pt_regs *regs);
 extern int tick_nohz_adaptive_mode(void);
+extern void tick_nohz_cpu_exit_qs(void);
 extern bool tick_nohz_account_tick(void);
 extern void tick_nohz_flush_current_times(bool restart_tick);
 #else /* !CPUSETS_NO_HZ */
diff --git a/kernel/sched.c b/kernel/sched.c
index 2bcd456..576d0bf 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2504,6 +2504,7 @@ static void cpuset_nohz_restart_tick(void)
 	__get_cpu_var(task_nohz_mode) = 0;
 	tick_nohz_restart_sched_tick();
 	clear_thread_flag(TIF_NOHZ);
+	tick_nohz_cpu_exit_qs();
 }
 
 void cpuset_update_nohz(void)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9a2ba5b..b611b77 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -757,6 +757,7 @@ void tick_check_idle(int cpu)
 }
 
 #ifdef CONFIG_CPUSETS_NO_HZ
+DEFINE_PER_CPU(int, nohz_task_ext_qs);
 
 void tick_nohz_exit_kernel(void)
 {
@@ -783,6 +784,9 @@ void tick_nohz_exit_kernel(void)
 	ts->saved_jiffies = jiffies;
 	ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
 
+	__get_cpu_var(nohz_task_ext_qs) = 1;
+	rcu_enter_nohz();
+
 	local_irq_restore(flags);
 }
 
@@ -799,6 +803,11 @@ void tick_nohz_enter_kernel(void)
 		return;
 	}
 
+	if (__get_cpu_var(nohz_task_ext_qs) == 1) {
+		__get_cpu_var(nohz_task_ext_qs) = 0;
+		rcu_exit_nohz();
+	}
+
 	ts = &__get_cpu_var(tick_cpu_sched);
 
 	WARN_ON_ONCE(ts->saved_jiffies_whence == JIFFIES_SAVED_SYS);
@@ -814,6 +823,16 @@ void tick_nohz_enter_kernel(void)
 	local_irq_restore(flags);
 }
 
+void tick_nohz_cpu_exit_qs(void)
+{
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+
+	if (__get_cpu_var(nohz_task_ext_qs)) {
+		rcu_exit_nohz();
+		__get_cpu_var(nohz_task_ext_qs) = 0;
+	}
+}
+
 void tick_nohz_enter_exception(struct pt_regs *regs)
 {
 	if (user_mode(regs))
@@ -858,6 +877,8 @@ static void tick_nohz_cpuset_stop_tick(int user)
 		if (user) {
 			ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
 			ts->saved_jiffies = jiffies;
+			__get_cpu_var(nohz_task_ext_qs) = 1;
+			rcu_enter_nohz();
 		} else if (!current->mm) {
 			ts->saved_jiffies_whence = JIFFIES_SAVED_SYS;
 			ts->saved_jiffies = jiffies;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 32/32] nohz/cpuset: Disable under some configs
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (30 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 31/32] rcu: Switch to extended quiescent state in userspace from nohz cpuset Frederic Weisbecker
@ 2011-08-15 15:52 ` Frederic Weisbecker
  2011-08-17 16:36 ` [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Avi Kivity
  2011-08-24 14:41 ` Gilad Ben-Yossef
  33 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 15:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Paul Menage,
	Peter Zijlstra, Stephen Hemminger, Thomas Gleixner, Tim Pepper

This shows the various things that are not yet handled by
the nohz cpusets: perf events, irq work, irq time accounting.

But there are further things that have yet to be handled:
sched clock tick, runqueue clock, sched_class::task_tick(),
rq clock, cpu load, complete handling of cputimes, ...

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
---
 init/Kconfig |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 7a144ad..6b6adde 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -626,7 +626,7 @@ config PROC_PID_CPUSET
 
 config CPUSETS_NO_HZ
        bool "Tickless cpusets"
-       depends on CPUSETS && HAVE_CPUSETS_NO_HZ && NO_HZ && HIGH_RES_TIMERS
+       depends on CPUSETS && HAVE_CPUSETS_NO_HZ && NO_HZ && HIGH_RES_TIMERS && !PERF_EVENTS && !PROFILING && !IRQ_WORK && !IRQ_TIME_ACCOUNTING
        help
          This options let you apply a nohz property to a cpuset such
 	 that the periodic timer tick tries to be avoided when possible on
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* Re: [PATCH 14/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2011-08-15 15:52 ` [PATCH 14/32] nohz/cpuset: Don't turn off the tick if rcu needs it Frederic Weisbecker
@ 2011-08-16 20:13   ` Paul E. McKenney
  2011-08-17  2:10     ` Frederic Weisbecker
  2011-08-29 15:36   ` Peter Zijlstra
  1 sibling, 1 reply; 139+ messages in thread
From: Paul E. McKenney @ 2011-08-16 20:13 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul Menage, Peter Zijlstra, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, Aug 15, 2011 at 05:52:11PM +0200, Frederic Weisbecker wrote:
> If RCU is waiting for the current CPU to complete a grace
> period, don't turn off the tick. Unlike dynctik-idle, we

s/dynctik/dyntick/  ;-)

> are not necessarily going to enter into rcu extended quiescent
> state, so we may need to keep the tick to note current CPU's
> quiescent states.

One question below...

							Thanx, Paul

> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Anton Blanchard <anton@au1.ibm.com>
> Cc: Avi Kivity <avi@redhat.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Paul Menage <menage@google.com>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Stephen Hemminger <shemminger@vyatta.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
> ---
>  include/linux/rcupdate.h |    1 +
>  kernel/rcutree.c         |    3 +--
>  kernel/sched.c           |   14 ++++++++++++++
>  3 files changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 99f9aa7..55a482a 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -133,6 +133,7 @@ static inline int rcu_preempt_depth(void)
>  extern void rcu_sched_qs(int cpu);
>  extern void rcu_bh_qs(int cpu);
>  extern void rcu_check_callbacks(int cpu, int user);
> +extern int rcu_pending(int cpu);
>  struct notifier_block;
> 
>  #ifdef CONFIG_NO_HZ
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index ba06207..0009bfc 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -205,7 +205,6 @@ int rcu_cpu_stall_suppress __read_mostly;
>  module_param(rcu_cpu_stall_suppress, int, 0644);
> 
>  static void force_quiescent_state(struct rcu_state *rsp, int relaxed);
> -static int rcu_pending(int cpu);
> 
>  /*
>   * Return the number of RCU-sched batches processed thus far for debug & stats.
> @@ -1729,7 +1728,7 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
>   * by the current CPU, returning 1 if so.  This function is part of the
>   * RCU implementation; it is -not- an exported member of the RCU API.
>   */
> -static int rcu_pending(int cpu)
> +int rcu_pending(int cpu)
>  {
>  	return __rcu_pending(&rcu_sched_state, &per_cpu(rcu_sched_data, cpu)) ||
>  	       __rcu_pending(&rcu_bh_state, &per_cpu(rcu_bh_data, cpu)) ||
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 0e1aa4e..353a66f 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2439,6 +2439,7 @@ DEFINE_PER_CPU(int, task_nohz_mode);
>  bool cpuset_nohz_can_stop_tick(void)
>  {
>  	struct rq *rq;
> +	int cpu;
> 
>  	rq = this_rq();
> 
> @@ -2446,6 +2447,19 @@ bool cpuset_nohz_can_stop_tick(void)
>  	if (rq->nr_running > 1)
>  		return false;
> 
> +	cpu = smp_processor_id();
> +
> +	/*
> +	 * FIXME: will probably be removed soon as it's
> +	 * already checked from tick_nohz_stop_sched_tick()
> +	 */
> +	if (rcu_needs_cpu(cpu))
> +		return false;
> +
> +	/* Is there a grace period to complete ? */
> +	if (rcu_pending(cpu))

This is from a quiescent state for both RCU and RCU-bh, right?
Or can their be RCU or RCU-bh read-side critical sections held
across here?  (It would be mildly bad if so.)

But force_quiescent_state() will catch cases where RCU needs
quiescent states from CPUs, so is this check really needed?

> +		return false;
> +
>  	return true;
>  }
> 
> -- 
> 1.7.5.4
> 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 22/32] rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU
  2011-08-15 15:52 ` [PATCH 22/32] rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU Frederic Weisbecker
@ 2011-08-16 20:20   ` Paul E. McKenney
  2011-08-17  2:18     ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Paul E. McKenney @ 2011-08-16 20:20 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul Menage, Peter Zijlstra, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, Aug 15, 2011 at 05:52:19PM +0200, Frederic Weisbecker wrote:
> If we enqueue an rcu callback, we need the CPU tick to stay
> alive until we take care of those by completing the appropriate
> grace period.
> 
> Thus, when we call_rcu(), send a self IPI that checks rcu_needs_cpu()
> so that we restore a periodic tick behaviour that can take care of
> everything.

One question below.

							Thanx, Paul

> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Anton Blanchard <anton@au1.ibm.com>
> Cc: Avi Kivity <avi@redhat.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Paul Menage <menage@google.com>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Stephen Hemminger <shemminger@vyatta.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
> ---
>  kernel/rcutree.c |    8 ++++++++
>  1 files changed, 8 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index d496c70..b5643ce2 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -51,6 +51,7 @@
>  #include <linux/kthread.h>
>  #include <linux/prefetch.h>
>  #include <linux/cpuset.h>
> +#include <linux/tick.h>
> 
>  #include "rcutree.h"
> 
> @@ -1546,6 +1547,13 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
>  	rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
>  	rdp->qlen++;
> 
> +	/* Restart the timer if needed to handle the callbacks */
> +	if (tick_nohz_adaptive_mode()) {
> +		/* Make updates on nxtlist visible to self IPI */
> +		barrier();
> +		smp_cpuset_update_nohz(smp_processor_id());
> +	}
> +

But this must be happening in a system call or interrupt handler, right?
If so, won't we get a chance to check things on exit from the system call
or interrupt?  Or are you hooking only into syscall entry?

>  	/* If interrupts were disabled, don't dive into RCU core. */
>  	if (irqs_disabled_flags(flags)) {
>  		local_irq_restore(flags);
> -- 
> 1.7.5.4
> 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 24/32] nohz/cpuset: Handle kernel entry/exit to account cputime
  2011-08-15 15:52 ` [PATCH 24/32] nohz/cpuset: Handle kernel entry/exit to account cputime Frederic Weisbecker
@ 2011-08-16 20:38   ` Paul E. McKenney
  2011-08-17  2:30     ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Paul E. McKenney @ 2011-08-16 20:38 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul Menage, Peter Zijlstra, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, Aug 15, 2011 at 05:52:21PM +0200, Frederic Weisbecker wrote:
> Provide a few APIs that archs can call to tell they are entering
> or exiting the kernel so that when we are in nohz adaptive mode
> we know precisely where we need to account the cputime.
> 
> The new APIs are:
> 
> - tick_nohz_enter_kernel() (called when we enter a syscall)
> - tick_nohz_exit_kernel() (called when we exit a syscall)
> - tick_nohz_enter_exception() (called when we enter any
>   exception, trap, faults...but not irqs)
> - tick_nohz_exit_exception() (called when we exit any exception)
> 
> Hooks into syscalls are typically driven by the TIF_NOHZ thread
> flag.
> 
> In addition, we use the value returned by user_mode(regs) from
> the timer interrupt to know where we are.
> Nonetheless, we can rely on user_mode(regs) != 0 to know
> we are in userspace, but we can't rely on user_mode(regs) == 0
> to know we are in the system.
> 
> Consider the following scenario: we stop the tick after syscall
> return, so we set TIF_NOHZ but the syscall exit hook is behind us.
> If we haven't yet returned to userspace, then we have
> user_mode(regs) == 0. If on top of that we consider we are in
> system mode, and later we issue a syscall but restart the tick
> right before reaching the syscall entry hook, then we have no clue
> that the whole elapsed cputime was not in the system but in the
> userspace.
> 
> The only way to fix this is to only start entering nohz mode once
> we know we are in userspace a first time, like when we reach the
> kernel exit hook or when a timer tick with user_mode(regs) == 1
> fires. Kernel threads don't have this worry.
> 
> This sucks but for now I have no better solution. Let's hope we
> can find better.
> 
> TODO: wrap operation on jiffies?

Hmmm...  Does the RCU dyntick-idle code need to know about exception
entry and exit?

							Thanx, Paul

> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Anton Blanchard <anton@au1.ibm.com>
> Cc: Avi Kivity <avi@redhat.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Paul Menage <menage@google.com>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Stephen Hemminger <shemminger@vyatta.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
> ---
>  include/linux/tick.h     |    8 +++
>  kernel/sched.c           |    1 +
>  kernel/time/tick-sched.c |  114 ++++++++++++++++++++++++++++++++++++++++------
>  3 files changed, 109 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index ea6dfb7..3ad649f 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -139,10 +139,18 @@ extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
>  #ifdef CONFIG_CPUSETS_NO_HZ
>  DECLARE_PER_CPU(int, task_nohz_mode);
> 
> +extern void tick_nohz_enter_kernel(void);
> +extern void tick_nohz_exit_kernel(void);
> +extern void tick_nohz_enter_exception(struct pt_regs *regs);
> +extern void tick_nohz_exit_exception(struct pt_regs *regs);
>  extern int tick_nohz_adaptive_mode(void);
>  extern bool tick_nohz_account_tick(void);
>  extern void tick_nohz_flush_current_times(void);
>  #else /* !CPUSETS_NO_HZ */
> +static inline void tick_nohz_enter_kernel(void) { }
> +static inline void tick_nohz_exit_kernel(void) { }
> +static inline void tick_nohz_enter_exception(struct pt_regs *regs) { }
> +static inline void tick_nohz_exit_exception(struct pt_regs *regs) { }
>  static inline int tick_nohz_adaptive_mode(void) { return 0; }
>  static inline bool tick_nohz_account_tick(void) { return false; }
>  #endif /* CPUSETS_NO_HZ */
> diff --git a/kernel/sched.c b/kernel/sched.c
> index a58f993..c49c1b1 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2503,6 +2503,7 @@ static void cpuset_nohz_restart_tick(void)
>  	tick_nohz_flush_current_times();
>  	__get_cpu_var(task_nohz_mode) = 0;
>  	tick_nohz_restart_sched_tick();
> +	clear_thread_flag(TIF_NOHZ);
>  }
> 
>  void cpuset_update_nohz(void)
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index c3a8f26..d8f01b8 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -595,8 +595,9 @@ void tick_nohz_irq_exit(void)
>  	if (ts->inidle && !need_resched())
>  		__tick_nohz_enter_idle(ts, cpu);
>  	else if (tick_nohz_adaptive_mode() && !idle_cpu(cpu)) {
> -		if (tick_nohz_can_stop_tick(cpu, ts))
> -			tick_nohz_stop_sched_tick(ktime_get(), cpu, ts);
> +		if (ts->saved_jiffies_whence != JIFFIES_SAVED_NONE
> +		    && tick_nohz_can_stop_tick(cpu, ts))
> +				tick_nohz_stop_sched_tick(ktime_get(), cpu, ts);
>  	}
>  }
> 
> @@ -757,6 +758,74 @@ void tick_check_idle(int cpu)
> 
>  #ifdef CONFIG_CPUSETS_NO_HZ
> 
> +void tick_nohz_exit_kernel(void)
> +{
> +	unsigned long flags;
> +	struct tick_sched *ts;
> +	unsigned long delta_jiffies;
> +
> +	local_irq_save(flags);
> +
> +	if (!tick_nohz_adaptive_mode()) {
> +		local_irq_restore(flags);
> +		return;
> +	}
> +
> +	ts = &__get_cpu_var(tick_cpu_sched);
> +
> +	WARN_ON_ONCE(ts->saved_jiffies_whence == JIFFIES_SAVED_USER);
> +
> +	if (ts->saved_jiffies_whence == JIFFIES_SAVED_SYS) {
> +		delta_jiffies = jiffies - ts->saved_jiffies;
> +		account_system_jiffies(current, delta_jiffies);
> +	}
> +
> +	ts->saved_jiffies = jiffies;
> +	ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
> +
> +	local_irq_restore(flags);
> +}
> +
> +void tick_nohz_enter_kernel(void)
> +{
> +	unsigned long flags;
> +	struct tick_sched *ts;
> +	unsigned long delta_jiffies;
> +
> +	local_irq_save(flags);
> +
> +	if (!tick_nohz_adaptive_mode()) {
> +		local_irq_restore(flags);
> +		return;
> +	}
> +
> +	ts = &__get_cpu_var(tick_cpu_sched);
> +
> +	WARN_ON_ONCE(ts->saved_jiffies_whence == JIFFIES_SAVED_SYS);
> +
> +	if (ts->saved_jiffies_whence == JIFFIES_SAVED_USER) {
> +		delta_jiffies = jiffies - ts->saved_jiffies;
> +		account_user_jiffies(current, delta_jiffies);
> +	}
> +
> +	ts->saved_jiffies = jiffies;
> +	ts->saved_jiffies_whence = JIFFIES_SAVED_SYS;
> +
> +	local_irq_restore(flags);
> +}
> +
> +void tick_nohz_enter_exception(struct pt_regs *regs)
> +{
> +	if (user_mode(regs))
> +		tick_nohz_enter_kernel();
> +}
> +
> +void tick_nohz_exit_exception(struct pt_regs *regs)
> +{
> +	if (user_mode(regs))
> +		tick_nohz_exit_kernel();
> +}
> +
>  int tick_nohz_adaptive_mode(void)
>  {
>  	return __get_cpu_var(task_nohz_mode);
> @@ -766,20 +835,33 @@ static void tick_nohz_cpuset_stop_tick(int user)
>  {
>  	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
> 
> -	if (!cpuset_adaptive_nohz() || tick_nohz_adaptive_mode())
> +	if (!cpuset_adaptive_nohz())
>  		return;
> 
> +	if (tick_nohz_adaptive_mode()) {
> +		if (user && ts->saved_jiffies_whence == JIFFIES_SAVED_NONE) {
> +			ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
> +			ts->saved_jiffies = jiffies;
> +		}
> +
> +		return;
> +	}
> +
>  	if (cpuset_nohz_can_stop_tick()) {
>  		__get_cpu_var(task_nohz_mode) = 1;
>  		/* Nohz mode must be visible to wake_up_nohz_cpu() */
>  		smp_wmb();
> 
> +		set_thread_flag(TIF_NOHZ);
>  		WARN_ON_ONCE(ts->saved_jiffies_whence != JIFFIES_SAVED_NONE);
> -		ts->saved_jiffies = jiffies;
> -		if (user)
> +
> +		if (user) {
>  			ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
> -		else
> +			ts->saved_jiffies = jiffies;
> +		} else if (!current->mm) {
>  			ts->saved_jiffies_whence = JIFFIES_SAVED_SYS;
> +			ts->saved_jiffies = jiffies;
> +		}
>  	}
>  }
> 
> @@ -803,7 +885,7 @@ static void tick_do_timer_check_handler(int cpu)
> 
>  bool tick_nohz_account_tick(void)
>  {
> -	struct tick_sched *ts;
> +	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
>  	unsigned long delta_jiffies;
> 
>  	if (!tick_nohz_adaptive_mode())
> @@ -811,11 +893,15 @@ bool tick_nohz_account_tick(void)
> 
>  	ts = &__get_cpu_var(tick_cpu_sched);
> 
> +	if (ts->saved_jiffies_whence == JIFFIES_SAVED_NONE)
> +		return false;
> +
>  	delta_jiffies = jiffies - ts->saved_jiffies;
> -	if (ts->saved_jiffies_whence == JIFFIES_SAVED_SYS)
> -		account_system_jiffies(current, delta_jiffies);
> -	else
> +
> +	if (ts->saved_jiffies_whence == JIFFIES_SAVED_USER)
>  		account_user_jiffies(current, delta_jiffies);
> +	else
> +		account_system_jiffies(current, delta_jiffies);
> 
>  	ts->saved_jiffies = jiffies;
> 
> @@ -825,12 +911,12 @@ bool tick_nohz_account_tick(void)
>  void tick_nohz_flush_current_times(void)
>  {
>  	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
> +	unsigned long delta_jiffies;
> +	struct pt_regs *regs;
> 
> -	tick_nohz_account_tick();
> -
> -	ts->saved_jiffies_whence = JIFFIES_SAVED_NONE;
> +	if (tick_nohz_account_tick())
> +		ts->saved_jiffies_whence = JIFFIES_SAVED_NONE;
>  }
> -
>  #else
> 
>  static void tick_nohz_cpuset_stop_tick(int user) { }
> -- 
> 1.7.5.4
> 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 31/32] rcu: Switch to extended quiescent state in userspace from nohz cpuset
  2011-08-15 15:52 ` [PATCH 31/32] rcu: Switch to extended quiescent state in userspace from nohz cpuset Frederic Weisbecker
@ 2011-08-16 20:44   ` Paul E. McKenney
  2011-08-17  2:43     ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Paul E. McKenney @ 2011-08-16 20:44 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul Menage, Peter Zijlstra, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, Aug 15, 2011 at 05:52:28PM +0200, Frederic Weisbecker wrote:
> When we switch to adaptive nohz mode and we run in userspace,
> we can still receive IPIs from the RCU core if a grace period
> has been started by another CPU because we need to take part
> of its completion.
> 
> However running in userspace is similar to that of running in
> idle because we don't make use of RCU there, thus we can be
> considered as running in RCU extended quiescent state. The
> benefit when running into that mode is that we are not
> anymore disturbed by needless IPIs coming from the RCU core.
> 
> To perform this, we just to use the RCU extended quiescent state
> APIs on the following points:
> 
> - kernel exit or tick stop in userspace: here we switch to extended
> quiescent state because we run in userspace without the tick.
> 
> - kernel entry or tick restart: here we exit the extended quiescent
> state because either we enter the kernel and we may make use of RCU
> read side critical section anytime, or we need the timer tick for some
> reason and that takes care of RCU grace period in a traditional way.
> 
> TODO: hook into do_notify_resume() because we may have called
> rcu_enter_nohz() from syscall exit hook, but we might call
> do_notify_resume() right after, which may use RCU.

I don't see exactly how the exception path works, but this does reassure
me a bit on the syscall path.

							Thanx, Paul

> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Anton Blanchard <anton@au1.ibm.com>
> Cc: Avi Kivity <avi@redhat.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Paul Menage <menage@google.com>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Stephen Hemminger <shemminger@vyatta.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
> ---
>  include/linux/tick.h     |    2 ++
>  kernel/sched.c           |    1 +
>  kernel/time/tick-sched.c |   21 +++++++++++++++++++++
>  3 files changed, 24 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index 9d0270e..4e7555f 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -138,12 +138,14 @@ extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
> 
>  #ifdef CONFIG_CPUSETS_NO_HZ
>  DECLARE_PER_CPU(int, task_nohz_mode);
> +DECLARE_PER_CPU(int, nohz_task_ext_qs);
> 
>  extern void tick_nohz_enter_kernel(void);
>  extern void tick_nohz_exit_kernel(void);
>  extern void tick_nohz_enter_exception(struct pt_regs *regs);
>  extern void tick_nohz_exit_exception(struct pt_regs *regs);
>  extern int tick_nohz_adaptive_mode(void);
> +extern void tick_nohz_cpu_exit_qs(void);
>  extern bool tick_nohz_account_tick(void);
>  extern void tick_nohz_flush_current_times(bool restart_tick);
>  #else /* !CPUSETS_NO_HZ */
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 2bcd456..576d0bf 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2504,6 +2504,7 @@ static void cpuset_nohz_restart_tick(void)
>  	__get_cpu_var(task_nohz_mode) = 0;
>  	tick_nohz_restart_sched_tick();
>  	clear_thread_flag(TIF_NOHZ);
> +	tick_nohz_cpu_exit_qs();
>  }
> 
>  void cpuset_update_nohz(void)
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 9a2ba5b..b611b77 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -757,6 +757,7 @@ void tick_check_idle(int cpu)
>  }
> 
>  #ifdef CONFIG_CPUSETS_NO_HZ
> +DEFINE_PER_CPU(int, nohz_task_ext_qs);
> 
>  void tick_nohz_exit_kernel(void)
>  {
> @@ -783,6 +784,9 @@ void tick_nohz_exit_kernel(void)
>  	ts->saved_jiffies = jiffies;
>  	ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
> 
> +	__get_cpu_var(nohz_task_ext_qs) = 1;
> +	rcu_enter_nohz();

OK, I was wondering how this was going to work if RCU didn't
know about kernel entry/exit.  Whew!!!  ;-)

> +
>  	local_irq_restore(flags);
>  }
> 
> @@ -799,6 +803,11 @@ void tick_nohz_enter_kernel(void)
>  		return;
>  	}
> 
> +	if (__get_cpu_var(nohz_task_ext_qs) == 1) {
> +		__get_cpu_var(nohz_task_ext_qs) = 0;
> +		rcu_exit_nohz();
> +	}
> +
>  	ts = &__get_cpu_var(tick_cpu_sched);
> 
>  	WARN_ON_ONCE(ts->saved_jiffies_whence == JIFFIES_SAVED_SYS);
> @@ -814,6 +823,16 @@ void tick_nohz_enter_kernel(void)
>  	local_irq_restore(flags);
>  }
> 
> +void tick_nohz_cpu_exit_qs(void)
> +{
> +	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
> +
> +	if (__get_cpu_var(nohz_task_ext_qs)) {
> +		rcu_exit_nohz();
> +		__get_cpu_var(nohz_task_ext_qs) = 0;
> +	}
> +}
> +
>  void tick_nohz_enter_exception(struct pt_regs *regs)
>  {
>  	if (user_mode(regs))
> @@ -858,6 +877,8 @@ static void tick_nohz_cpuset_stop_tick(int user)
>  		if (user) {
>  			ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
>  			ts->saved_jiffies = jiffies;
> +			__get_cpu_var(nohz_task_ext_qs) = 1;
> +			rcu_enter_nohz();

When entering an exception, shouldn't we call rcu_exit_nohz() rather
than rcu_exit_nohz()?  Or is this a "didn't really mean an exception"
code path?

>  		} else if (!current->mm) {
>  			ts->saved_jiffies_whence = JIFFIES_SAVED_SYS;
>  			ts->saved_jiffies = jiffies;
> -- 
> 1.7.5.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 14/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2011-08-16 20:13   ` Paul E. McKenney
@ 2011-08-17  2:10     ` Frederic Weisbecker
  2011-08-17  2:49       ` Paul E. McKenney
  0 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-17  2:10 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul Menage, Peter Zijlstra, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Tue, Aug 16, 2011 at 01:13:42PM -0700, Paul E. McKenney wrote:
> On Mon, Aug 15, 2011 at 05:52:11PM +0200, Frederic Weisbecker wrote:
> > If RCU is waiting for the current CPU to complete a grace
> > period, don't turn off the tick. Unlike dynctik-idle, we
> 
> s/dynctik/dyntick/  ;-)

Heh! :)

> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index 99f9aa7..55a482a 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -133,6 +133,7 @@ static inline int rcu_preempt_depth(void)
> >  extern void rcu_sched_qs(int cpu);
> >  extern void rcu_bh_qs(int cpu);
> >  extern void rcu_check_callbacks(int cpu, int user);
> > +extern int rcu_pending(int cpu);
> >  struct notifier_block;
> > 
> >  #ifdef CONFIG_NO_HZ
> > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > index ba06207..0009bfc 100644
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> > @@ -205,7 +205,6 @@ int rcu_cpu_stall_suppress __read_mostly;
> >  module_param(rcu_cpu_stall_suppress, int, 0644);
> > 
> >  static void force_quiescent_state(struct rcu_state *rsp, int relaxed);
> > -static int rcu_pending(int cpu);
> > 
> >  /*
> >   * Return the number of RCU-sched batches processed thus far for debug & stats.
> > @@ -1729,7 +1728,7 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
> >   * by the current CPU, returning 1 if so.  This function is part of the
> >   * RCU implementation; it is -not- an exported member of the RCU API.
> >   */
> > -static int rcu_pending(int cpu)
> > +int rcu_pending(int cpu)
> >  {
> >  	return __rcu_pending(&rcu_sched_state, &per_cpu(rcu_sched_data, cpu)) ||
> >  	       __rcu_pending(&rcu_bh_state, &per_cpu(rcu_bh_data, cpu)) ||
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 0e1aa4e..353a66f 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -2439,6 +2439,7 @@ DEFINE_PER_CPU(int, task_nohz_mode);
> >  bool cpuset_nohz_can_stop_tick(void)
> >  {
> >  	struct rq *rq;
> > +	int cpu;
> > 
> >  	rq = this_rq();
> > 
> > @@ -2446,6 +2447,19 @@ bool cpuset_nohz_can_stop_tick(void)
> >  	if (rq->nr_running > 1)
> >  		return false;
> > 
> > +	cpu = smp_processor_id();
> > +
> > +	/*
> > +	 * FIXME: will probably be removed soon as it's
> > +	 * already checked from tick_nohz_stop_sched_tick()
> > +	 */
> > +	if (rcu_needs_cpu(cpu))
> > +		return false;
> > +
> > +	/* Is there a grace period to complete ? */
> > +	if (rcu_pending(cpu))
> 
> This is from a quiescent state for both RCU and RCU-bh, right?
> Or can their be RCU or RCU-bh read-side critical sections held
> across here?  (It would be mildly bad if so.)

Yeah this can happen. This is called from the timer interrupt
or from an IPI. We can be in any kind of rcu critical section.

> 
> But force_quiescent_state() will catch cases where RCU needs
> quiescent states from CPUs, so is this check really needed?

Yeah we should receive IPIs from CPUs that need us. This can
be an optimization though. No need to run into a cycle of
on timers shutdown/restart if we can complete something
right away.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 22/32] rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU
  2011-08-16 20:20   ` Paul E. McKenney
@ 2011-08-17  2:18     ` Frederic Weisbecker
  0 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-17  2:18 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul Menage, Peter Zijlstra, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Tue, Aug 16, 2011 at 01:20:05PM -0700, Paul E. McKenney wrote:
> > @@ -1546,6 +1547,13 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
> >  	rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
> >  	rdp->qlen++;
> > 
> > +	/* Restart the timer if needed to handle the callbacks */
> > +	if (tick_nohz_adaptive_mode()) {
> > +		/* Make updates on nxtlist visible to self IPI */
> > +		barrier();
> > +		smp_cpuset_update_nohz(smp_processor_id());
> > +	}
> > +
> 
> But this must be happening in a system call or interrupt handler, right?
> If so, won't we get a chance to check things on exit from the system call
> or interrupt?  Or are you hooking only into syscall entry?

Sure in theory when we call call_rcu() we are in the kernel and thus not
very far from a syscall/exception/irq exit into userspace or a scheduling
out.

We could rely on that assumption and I bet it won't ever be a problem
in practice. I just wanted to ensure we don't spend too much time
before handling that callback.
 
> >  	/* If interrupts were disabled, don't dive into RCU core. */
> >  	if (irqs_disabled_flags(flags)) {
> >  		local_irq_restore(flags);
> > -- 
> > 1.7.5.4
> > 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 24/32] nohz/cpuset: Handle kernel entry/exit to account cputime
  2011-08-16 20:38   ` Paul E. McKenney
@ 2011-08-17  2:30     ` Frederic Weisbecker
  0 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-17  2:30 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul Menage, Peter Zijlstra, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Tue, Aug 16, 2011 at 01:38:20PM -0700, Paul E. McKenney wrote:
> On Mon, Aug 15, 2011 at 05:52:21PM +0200, Frederic Weisbecker wrote:
> > Provide a few APIs that archs can call to tell they are entering
> > or exiting the kernel so that when we are in nohz adaptive mode
> > we know precisely where we need to account the cputime.
> > 
> > The new APIs are:
> > 
> > - tick_nohz_enter_kernel() (called when we enter a syscall)
> > - tick_nohz_exit_kernel() (called when we exit a syscall)
> > - tick_nohz_enter_exception() (called when we enter any
> >   exception, trap, faults...but not irqs)
> > - tick_nohz_exit_exception() (called when we exit any exception)
> > 
> > Hooks into syscalls are typically driven by the TIF_NOHZ thread
> > flag.
> > 
> > In addition, we use the value returned by user_mode(regs) from
> > the timer interrupt to know where we are.
> > Nonetheless, we can rely on user_mode(regs) != 0 to know
> > we are in userspace, but we can't rely on user_mode(regs) == 0
> > to know we are in the system.
> > 
> > Consider the following scenario: we stop the tick after syscall
> > return, so we set TIF_NOHZ but the syscall exit hook is behind us.
> > If we haven't yet returned to userspace, then we have
> > user_mode(regs) == 0. If on top of that we consider we are in
> > system mode, and later we issue a syscall but restart the tick
> > right before reaching the syscall entry hook, then we have no clue
> > that the whole elapsed cputime was not in the system but in the
> > userspace.
> > 
> > The only way to fix this is to only start entering nohz mode once
> > we know we are in userspace a first time, like when we reach the
> > kernel exit hook or when a timer tick with user_mode(regs) == 1
> > fires. Kernel threads don't have this worry.
> > 
> > This sucks but for now I have no better solution. Let's hope we
> > can find better.
> > 
> > TODO: wrap operation on jiffies?
> 
> Hmmm...  Does the RCU dyntick-idle code need to know about exception
> entry and exit?
> 
> 							Thanx, Paul

At that time it doesn't because we don't yet call rcu_enter_nohz()
when switching to userspace. Instead we shutdown the tick and
restart it when needed when a remote CPU sends us an IPI to complete
a grace period.

The patch that switches to extended qs is the 31/32 and it handles
syscalls and exceptions as well.

I wanted to have support on rcu extended quiescent states late
in the patchset so that it's considered as an incremental feature
and not a core piece of the adaptive nohz (ie: it's no mandatory thing,
just an optimization). This way we can use cpuset nohz without that
rcu extended quiescent state feature and hence make that small part
bisectable.

Patch 30 activates support for cpuset nohz (support from x86).
Patch 31 activates the rcu extended quiescent state support in
userspace as a bonus.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 31/32] rcu: Switch to extended quiescent state in userspace from nohz cpuset
  2011-08-16 20:44   ` Paul E. McKenney
@ 2011-08-17  2:43     ` Frederic Weisbecker
  0 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-17  2:43 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul Menage, Peter Zijlstra, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Tue, Aug 16, 2011 at 01:44:15PM -0700, Paul E. McKenney wrote:
> On Mon, Aug 15, 2011 at 05:52:28PM +0200, Frederic Weisbecker wrote:
> > When we switch to adaptive nohz mode and we run in userspace,
> > we can still receive IPIs from the RCU core if a grace period
> > has been started by another CPU because we need to take part
> > of its completion.
> > 
> > However running in userspace is similar to that of running in
> > idle because we don't make use of RCU there, thus we can be
> > considered as running in RCU extended quiescent state. The
> > benefit when running into that mode is that we are not
> > anymore disturbed by needless IPIs coming from the RCU core.
> > 
> > To perform this, we just to use the RCU extended quiescent state
> > APIs on the following points:
> > 
> > - kernel exit or tick stop in userspace: here we switch to extended
> > quiescent state because we run in userspace without the tick.
> > 
> > - kernel entry or tick restart: here we exit the extended quiescent
> > state because either we enter the kernel and we may make use of RCU
> > read side critical section anytime, or we need the timer tick for some
> > reason and that takes care of RCU grace period in a traditional way.
> > 
> > TODO: hook into do_notify_resume() because we may have called
> > rcu_enter_nohz() from syscall exit hook, but we might call
> > do_notify_resume() right after, which may use RCU.
> 
> I don't see exactly how the exception path works, but this does reassure
> me a bit on the syscall path.

On the syscall path we directly call tick_nohz_enter,exit_kernel() and that
takes care of all the rcu trickies.

The exception paths call tick_nohz_enter,exit_exception() which are
essentially conditional wrappers around tick_nohz_enter,exit_kernel()
after checking user_mode(regs). Hmm now I realize I can't rely
on user_mode() to know if we should exit or not rcu extended quiescent state
because even if user_mode(regs) != 1, we may not have yet called
the syscall exit hook and thus not exited rcu quiescent state. So
tick_nohz_enter_exception() may forget to call rcu_exit_nohz() sometimes.

I need to fix that.

> 
> 							Thanx, Paul
> 
> > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Anton Blanchard <anton@au1.ibm.com>
> > Cc: Avi Kivity <avi@redhat.com>
> > Cc: Ingo Molnar <mingo@elte.hu>
> > Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
> > Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
> > Cc: Paul Menage <menage@google.com>
> > Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Cc: Stephen Hemminger <shemminger@vyatta.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
> > ---
> >  include/linux/tick.h     |    2 ++
> >  kernel/sched.c           |    1 +
> >  kernel/time/tick-sched.c |   21 +++++++++++++++++++++
> >  3 files changed, 24 insertions(+), 0 deletions(-)
> > 
> > diff --git a/include/linux/tick.h b/include/linux/tick.h
> > index 9d0270e..4e7555f 100644
> > --- a/include/linux/tick.h
> > +++ b/include/linux/tick.h
> > @@ -138,12 +138,14 @@ extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
> > 
> >  #ifdef CONFIG_CPUSETS_NO_HZ
> >  DECLARE_PER_CPU(int, task_nohz_mode);
> > +DECLARE_PER_CPU(int, nohz_task_ext_qs);
> > 
> >  extern void tick_nohz_enter_kernel(void);
> >  extern void tick_nohz_exit_kernel(void);
> >  extern void tick_nohz_enter_exception(struct pt_regs *regs);
> >  extern void tick_nohz_exit_exception(struct pt_regs *regs);
> >  extern int tick_nohz_adaptive_mode(void);
> > +extern void tick_nohz_cpu_exit_qs(void);
> >  extern bool tick_nohz_account_tick(void);
> >  extern void tick_nohz_flush_current_times(bool restart_tick);
> >  #else /* !CPUSETS_NO_HZ */
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 2bcd456..576d0bf 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -2504,6 +2504,7 @@ static void cpuset_nohz_restart_tick(void)
> >  	__get_cpu_var(task_nohz_mode) = 0;
> >  	tick_nohz_restart_sched_tick();
> >  	clear_thread_flag(TIF_NOHZ);
> > +	tick_nohz_cpu_exit_qs();
> >  }
> > 
> >  void cpuset_update_nohz(void)
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 9a2ba5b..b611b77 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -757,6 +757,7 @@ void tick_check_idle(int cpu)
> >  }
> > 
> >  #ifdef CONFIG_CPUSETS_NO_HZ
> > +DEFINE_PER_CPU(int, nohz_task_ext_qs);
> > 
> >  void tick_nohz_exit_kernel(void)
> >  {
> > @@ -783,6 +784,9 @@ void tick_nohz_exit_kernel(void)
> >  	ts->saved_jiffies = jiffies;
> >  	ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
> > 
> > +	__get_cpu_var(nohz_task_ext_qs) = 1;
> > +	rcu_enter_nohz();
> 
> OK, I was wondering how this was going to work if RCU didn't
> know about kernel entry/exit.  Whew!!!  ;-)
> 
> > +
> >  	local_irq_restore(flags);
> >  }
> > 
> > @@ -799,6 +803,11 @@ void tick_nohz_enter_kernel(void)
> >  		return;
> >  	}
> > 
> > +	if (__get_cpu_var(nohz_task_ext_qs) == 1) {
> > +		__get_cpu_var(nohz_task_ext_qs) = 0;
> > +		rcu_exit_nohz();
> > +	}
> > +
> >  	ts = &__get_cpu_var(tick_cpu_sched);
> > 
> >  	WARN_ON_ONCE(ts->saved_jiffies_whence == JIFFIES_SAVED_SYS);
> > @@ -814,6 +823,16 @@ void tick_nohz_enter_kernel(void)
> >  	local_irq_restore(flags);
> >  }
> > 
> > +void tick_nohz_cpu_exit_qs(void)
> > +{
> > +	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
> > +
> > +	if (__get_cpu_var(nohz_task_ext_qs)) {
> > +		rcu_exit_nohz();
> > +		__get_cpu_var(nohz_task_ext_qs) = 0;
> > +	}
> > +}
> > +
> >  void tick_nohz_enter_exception(struct pt_regs *regs)
> >  {
> >  	if (user_mode(regs))
> > @@ -858,6 +877,8 @@ static void tick_nohz_cpuset_stop_tick(int user)
> >  		if (user) {
> >  			ts->saved_jiffies_whence = JIFFIES_SAVED_USER;
> >  			ts->saved_jiffies = jiffies;
> > +			__get_cpu_var(nohz_task_ext_qs) = 1;
> > +			rcu_enter_nohz();
> 
> When entering an exception, shouldn't we call rcu_exit_nohz() rather
> than rcu_exit_nohz()?  Or is this a "didn't really mean an exception"
> code path?
> 
> >  		} else if (!current->mm) {
> >  			ts->saved_jiffies_whence = JIFFIES_SAVED_SYS;
> >  			ts->saved_jiffies = jiffies;
> > -- 
> > 1.7.5.4
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 14/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2011-08-17  2:10     ` Frederic Weisbecker
@ 2011-08-17  2:49       ` Paul E. McKenney
  0 siblings, 0 replies; 139+ messages in thread
From: Paul E. McKenney @ 2011-08-17  2:49 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul Menage, Peter Zijlstra, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Wed, Aug 17, 2011 at 04:10:27AM +0200, Frederic Weisbecker wrote:
> On Tue, Aug 16, 2011 at 01:13:42PM -0700, Paul E. McKenney wrote:
> > On Mon, Aug 15, 2011 at 05:52:11PM +0200, Frederic Weisbecker wrote:
> > > If RCU is waiting for the current CPU to complete a grace
> > > period, don't turn off the tick. Unlike dynctik-idle, we
> > 
> > s/dynctik/dyntick/  ;-)
> 
> Heh! :)
> 
> > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > index 99f9aa7..55a482a 100644
> > > --- a/include/linux/rcupdate.h
> > > +++ b/include/linux/rcupdate.h
> > > @@ -133,6 +133,7 @@ static inline int rcu_preempt_depth(void)
> > >  extern void rcu_sched_qs(int cpu);
> > >  extern void rcu_bh_qs(int cpu);
> > >  extern void rcu_check_callbacks(int cpu, int user);
> > > +extern int rcu_pending(int cpu);
> > >  struct notifier_block;
> > > 
> > >  #ifdef CONFIG_NO_HZ
> > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > > index ba06207..0009bfc 100644
> > > --- a/kernel/rcutree.c
> > > +++ b/kernel/rcutree.c
> > > @@ -205,7 +205,6 @@ int rcu_cpu_stall_suppress __read_mostly;
> > >  module_param(rcu_cpu_stall_suppress, int, 0644);
> > > 
> > >  static void force_quiescent_state(struct rcu_state *rsp, int relaxed);
> > > -static int rcu_pending(int cpu);
> > > 
> > >  /*
> > >   * Return the number of RCU-sched batches processed thus far for debug & stats.
> > > @@ -1729,7 +1728,7 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
> > >   * by the current CPU, returning 1 if so.  This function is part of the
> > >   * RCU implementation; it is -not- an exported member of the RCU API.
> > >   */
> > > -static int rcu_pending(int cpu)
> > > +int rcu_pending(int cpu)
> > >  {
> > >  	return __rcu_pending(&rcu_sched_state, &per_cpu(rcu_sched_data, cpu)) ||
> > >  	       __rcu_pending(&rcu_bh_state, &per_cpu(rcu_bh_data, cpu)) ||
> > > diff --git a/kernel/sched.c b/kernel/sched.c
> > > index 0e1aa4e..353a66f 100644
> > > --- a/kernel/sched.c
> > > +++ b/kernel/sched.c
> > > @@ -2439,6 +2439,7 @@ DEFINE_PER_CPU(int, task_nohz_mode);
> > >  bool cpuset_nohz_can_stop_tick(void)
> > >  {
> > >  	struct rq *rq;
> > > +	int cpu;
> > > 
> > >  	rq = this_rq();
> > > 
> > > @@ -2446,6 +2447,19 @@ bool cpuset_nohz_can_stop_tick(void)
> > >  	if (rq->nr_running > 1)
> > >  		return false;
> > > 
> > > +	cpu = smp_processor_id();
> > > +
> > > +	/*
> > > +	 * FIXME: will probably be removed soon as it's
> > > +	 * already checked from tick_nohz_stop_sched_tick()
> > > +	 */
> > > +	if (rcu_needs_cpu(cpu))
> > > +		return false;
> > > +
> > > +	/* Is there a grace period to complete ? */
> > > +	if (rcu_pending(cpu))
> > 
> > This is from a quiescent state for both RCU and RCU-bh, right?
> > Or can their be RCU or RCU-bh read-side critical sections held
> > across here?  (It would be mildly bad if so.)
> 
> Yeah this can happen. This is called from the timer interrupt
> or from an IPI. We can be in any kind of rcu critical section.
> 
> > But force_quiescent_state() will catch cases where RCU needs
> > quiescent states from CPUs, so is this check really needed?
> 
> Yeah we should receive IPIs from CPUs that need us. This can
> be an optimization though. No need to run into a cycle of
> on timers shutdown/restart if we can complete something
> right away.

Never mind...  I was confusing this with rcu_needs_cpu(), which should not
be called from within an RCU-sched or RCU-bh read-side critical section.

It is plenty fine to call rcu_pending() from within RCU read-side
critical sections.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks)
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (31 preceding siblings ...)
  2011-08-15 15:52 ` [PATCH 32/32] nohz/cpuset: Disable under some configs Frederic Weisbecker
@ 2011-08-17 16:36 ` Avi Kivity
  2011-08-18 13:25   ` Frederic Weisbecker
  2011-08-24 14:41 ` Gilad Ben-Yossef
  33 siblings, 1 reply; 139+ messages in thread
From: Avi Kivity @ 2011-08-17 16:36 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Ingo Molnar, Lai Jiangshan,
	Paul E . McKenney, Paul Menage, Peter Zijlstra,
	Stephen Hemminger, Thomas Gleixner, Tim Pepper

On 08/15/2011 08:51 AM, Frederic Weisbecker wrote:
> = What's the interface =
>
> We use the cpuset interface by adding a nohz flag to it.
> As long as a CPU is part of a nohz cpuset, then this CPU will
> try to enter into adaptive nohz mode when it can, even if it is part
> of another cpuset that is not nohz.
>
>

Why not do it unconditionally?  That is, if all the conditions are 
fulfilled, disable the tick regardless of any cpuset settings.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks)
  2011-08-17 16:36 ` [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Avi Kivity
@ 2011-08-18 13:25   ` Frederic Weisbecker
  2011-08-20  7:45     ` Paul Menage
  0 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-18 13:25 UTC (permalink / raw)
  To: Avi Kivity
  Cc: LKML, Andrew Morton, Anton Blanchard, Ingo Molnar, Lai Jiangshan,
	Paul E . McKenney, Paul Menage, Peter Zijlstra,
	Stephen Hemminger, Thomas Gleixner, Tim Pepper

On Wed, Aug 17, 2011 at 09:36:47AM -0700, Avi Kivity wrote:
> On 08/15/2011 08:51 AM, Frederic Weisbecker wrote:
> >= What's the interface =
> >
> >We use the cpuset interface by adding a nohz flag to it.
> >As long as a CPU is part of a nohz cpuset, then this CPU will
> >try to enter into adaptive nohz mode when it can, even if it is part
> >of another cpuset that is not nohz.
> >
> >
> 
> Why not do it unconditionally?  That is, if all the conditions are
> fulfilled, disable the tick regardless of any cpuset settings.

Because I'm not sure it's a win on every workload. This involves
some hooks here and there (syscall slow path), IPIs, etc...

But perhaps if one day it is proven to behave better in most cases
then we can make it enabled by default on cpusets?

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks)
  2011-08-18 13:25   ` Frederic Weisbecker
@ 2011-08-20  7:45     ` Paul Menage
  2011-08-23 16:36       ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Paul Menage @ 2011-08-20  7:45 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Avi Kivity, LKML, Andrew Morton, Anton Blanchard, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Peter Zijlstra,
	Stephen Hemminger, Thomas Gleixner, Tim Pepper

On Thu, Aug 18, 2011 at 6:25 AM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
>>
>> Why not do it unconditionally?  That is, if all the conditions are
>> fulfilled, disable the tick regardless of any cpuset settings.
>
> Because I'm not sure it's a win on every workload. This involves
> some hooks here and there (syscall slow path), IPIs, etc...

I agree with Avi. I'd be inclined to investigate further to see if
there are any important workloads on which it's not a win - and then
add the extra complexity to control it from cpusets if necessary.
Unless there's really a good reason to make it configurable, it's
simpler to make it unconditional.

Paul

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks)
  2011-08-20  7:45     ` Paul Menage
@ 2011-08-23 16:36       ` Frederic Weisbecker
  0 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-23 16:36 UTC (permalink / raw)
  To: Paul Menage
  Cc: Avi Kivity, LKML, Andrew Morton, Anton Blanchard, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Peter Zijlstra,
	Stephen Hemminger, Thomas Gleixner, Tim Pepper

On Sat, Aug 20, 2011 at 12:45:48AM -0700, Paul Menage wrote:
> On Thu, Aug 18, 2011 at 6:25 AM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> >>
> >> Why not do it unconditionally?  That is, if all the conditions are
> >> fulfilled, disable the tick regardless of any cpuset settings.
> >
> > Because I'm not sure it's a win on every workload. This involves
> > some hooks here and there (syscall slow path), IPIs, etc...
> 
> I agree with Avi. I'd be inclined to investigate further to see if
> there are any important workloads on which it's not a win - and then
> add the extra complexity to control it from cpusets if necessary.
> Unless there's really a good reason to make it configurable, it's
> simpler to make it unconditional.

There is another things. We still need to have a CPU with the periodic
tick to maintain the jiffies and walltime progression.

Also on some workloads like HPC (I mean, that's just a personal guess),
it may make sense to also migrate the peripherals interrupts affinity
to that same CPU that keeps the periodic tick, so that we have only one
CPU handling those backrgound things and all the others can be fully
dedicated to their main task.

So with these factors combined, we need to be able to precisely choose at
least a CPU that doesn't run in adaptive nohz mode. Having that nohz cpuset
solves that issue (although I haven't forced to keep one non-nohz cpu but
I should).

Perhaps having a simple interface that defines a fixed CPU to handle
jiffies/walltime would be enough but the cpuset offers more flexibility.
I just can't test every possible workload to know if it has no downside
somewhere.

May be a global toggle in sysfs for a global adaptive nohz would be enough?
I have no idea.

Is anybody aware of any worload that involve very frequent syscalls? Like
more than 100 Hz ?

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks)
  2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
                   ` (32 preceding siblings ...)
  2011-08-17 16:36 ` [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Avi Kivity
@ 2011-08-24 14:41 ` Gilad Ben-Yossef
  2011-08-30 14:06   ` Frederic Weisbecker
  33 siblings, 1 reply; 139+ messages in thread
From: Gilad Ben-Yossef @ 2011-08-24 14:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Peter Zijlstra,
	Stephen Hemminger, Thomas Gleixner, Tim Pepper

Hi,

On Mon, Aug 15, 2011 at 6:51 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
>
> For those who want to play:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git
>        nohz/cpuset-v1


You caught me in playful mood, so I took it for a spin... :-)

I know this is far from being production ready, but I hope you'll find
the feedback useful.

First a short description of my testing setup is in order, I believe:

I've set up a small x86 VM with 4 CPUs running your git tree and a
minimal buildroot system. I've created 2 cpusets: sys and nohz, and
then assigned every task I could to the sys cpuset and set
adaptive_nohz on the nohz set.

To make double sure I have no task on my nohz cpuset CPU, I've booted
the system with the isolcpus command line isolating the same cpu I've
assigned to the nohz set. This shouldn't be needed of course, but just
in case.

I then ran a silly program I've written that basically eats CPU cycles
(https://github.com/gby/cpueat) and assigned it to the nohz set and
monitored the number of interrupts using /proc/interrupts

Now, for the things I've noticed -

1. Before I turn adaptive_nohz to 1, when no task is running on the
nohz cpuset cpu, the tick is indeed idle (regular nohz case) and very
few function call IPIs are seen. However, when I turn adaptive_nohz to
1 (but still with no task running on the CPU), the tick remains idle,
but I get an IPI function call interrupt almost in the rate the tick
would have been.

2. When I run my little cpueat program on the nohz CPU, the tick does
not actually goes off. Instead it ticks away as usual. I know it is
the only legible task to run, since as soon as I kill it  the tick
turns off (regular nohz mode again). I've tinkered around and found
out that what stops the tick going away is the check for rcu_pending()
in cpuset_nohz_can_stop_tick(). It seems to always be true. When I
removed that check experimentally and repeat the test, the tick indeed
stops with my cpueat task running. Of course, I don't suggest this is
the sane thing to do - I just wondered if that what stopped the tick
going away and it seems that it is.

3. My little cpueat program tries to fork a child process after 100k
iteration of some CPU bound loop. It usually takes a few seconds to
happen. The idea is to make sure that the tick resumes when nr_running
> 1. In my case, I got a kernel panic. Since it happened with some
debug code I added and with aforementioned experimental removal of
rcu_pending check, I'm assuming for now it's all my fault but will
look into verifying it further and will send panic logs if it proves
useful.

Cheers,
Gilad


-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com
"Dance like no one is watching, love like you'll never be hurt, sing
like no one is listening... but for BEEP sake you better code like
you're going to maintain it for years!"

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 01/32 RESEND] nohz: Drop useless call in tick_nohz_start_idle()
  2011-08-15 15:51 ` [PATCH 01/32 RESEND] nohz: Drop useless call in tick_nohz_start_idle() Frederic Weisbecker
@ 2011-08-29 14:23   ` Peter Zijlstra
  2011-08-29 17:10     ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 14:23 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-15 at 17:51 +0200, Frederic Weisbecker wrote:
> The call to update_ts_time_stats() there is useless. All
> we need is to save the idle entry_time.
> 
> 
Would have been clearer if you just said the call was a NOP. The whole
second sentence distracts and confuses as its irrelevant to the change
at hand.

If you want to expand you can explain that its a NOP because
update_ts_time_stats() requires either ts->idle_active and/or
@last_update_time and our callsite has neither.

Although this assumes its never called when ts->idle_active is already
set, is this so (likely)? Do we want a WARN_ON_ONCE() testing that
assumption?

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 02/32 RESEND] nohz: Drop ts->idle_active
  2011-08-15 15:51 ` [PATCH 02/32 RESEND] nohz: Drop ts->idle_active Frederic Weisbecker
@ 2011-08-29 14:23   ` Peter Zijlstra
  2011-08-29 16:15     ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 14:23 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-15 at 17:51 +0200, Frederic Weisbecker wrote:
> ts->idle_active is used to know if we want to account the idle sleep
> time. But ts->inidle is enough to check that.
> 
While possibly true, its not immediately obvious and no hints are
supplied. For example: tick_check_nohz() would disable ->idle_active..
where is this mirrored in the ->inidle state.

Also, tick_nohz_stop_sched_tick() has this comment:

	/*
	 * Set ts->inidle unconditionally. Even if the system did not
	 * switch to NOHZ mode the cpu frequency governers rely on the
	 * update of the idle time accounting in tick_nohz_start_idle().
	 */
	ts->inidle = 1;

Which suggest the ->inidle state doesn't accurately reflect things.

This is all rather hairy code, such changes really want more in terms of
explanation.



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 03/32 RESEND] nohz: Drop useless ts->inidle check before rearming the tick
  2011-08-15 15:52 ` [PATCH 03/32 RESEND] nohz: Drop useless ts->inidle check before rearming the tick Frederic Weisbecker
@ 2011-08-29 14:23   ` Peter Zijlstra
  2011-08-29 16:58     ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 14:23 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> We only need to check if we have ts->stopped to ensure the tick
> was stopped and we want to re-enable it. Checking ts->inidle
> there is useless. 

/me goes la-la-la-la... 

It would so help poor little me who hasn't stared at this code in detail
for the past several days and is thus horridly confused if you'd expand
your reasoning somewhat.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 04/32] nohz: Separate idle sleeping time accounting from nohz switching
  2011-08-15 15:52 ` [PATCH 04/32] nohz: Separate idle sleeping time accounting from nohz switching Frederic Weisbecker
@ 2011-08-29 14:23   ` Peter Zijlstra
  2011-08-29 16:32     ` Frederic Weisbecker
  2011-08-29 14:23   ` Peter Zijlstra
  1 sibling, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 14:23 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> To prepare for having nohz mode switching independant from idle,
> pull the idle sleeping time accounting out of the tick stop API.
> 
> This implies to implement some new API to call when we
> enter/exit idle. 

I mean, I really love brevity, but you seem to just not state all the
important bits ;-)

So the goal is to disable the tick more often (say when running 1
userbound task), why does that need new hooks? If we already had the
tick disabled, the tick_nohz_stop_sched_tick() call on going idle will
simply not do anything.

If we go from idle to running something we want to enable the tick
initially because doing the task wakeup involves RCU etc.. Once we find
the task is indeed userbound and we've finished all our state we can
disable the thing again.



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 04/32] nohz: Separate idle sleeping time accounting from nohz switching
  2011-08-15 15:52 ` [PATCH 04/32] nohz: Separate idle sleeping time accounting from nohz switching Frederic Weisbecker
  2011-08-29 14:23   ` Peter Zijlstra
@ 2011-08-29 14:23   ` Peter Zijlstra
  2011-08-29 17:01     ` Frederic Weisbecker
  1 sibling, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 14:23 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> +extern void tick_nohz_exit_idle(void);
> +extern void tick_nohz_irq_exit(void); 

For consistencies sake, pick either tick_nohz_exit_*() or
tick_nohz_*_exit() but don't go mix and match (I prefer the former).

Also, there isn't a matching irq_enter() callback..

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-15 15:52 ` [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs Frederic Weisbecker
@ 2011-08-29 14:25   ` Peter Zijlstra
  2011-08-29 17:11     ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 14:25 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> To prepare for nohz / idle logic split, pull out the rcu dynticks
> idle mode switching to strict idle entry/exit areas.
> 
> So we make the dyntick mode possible without always involving rcu
> extended quiescent state. 

Why is this a good thing? I would be thinking that if we're a userspace
bound task and we disable the tick rcu would be finished on this cpu and
thus the extended quiescent state is just what we want?



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 06/32] nohz: Move idle ticks stats tracking out of nohz handlers
  2011-08-15 15:52 ` [PATCH 06/32] nohz: Move idle ticks stats tracking out of nohz handlers Frederic Weisbecker
@ 2011-08-29 14:28   ` Peter Zijlstra
  2011-09-06  0:35     ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 14:28 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> Idle ticks time tracking is merged into nohz stop/restart
> handlers. Pull it out into idle entry/exit handlers instead,
> so that nohz APIs is more idle independant. 

Are you trying to say:

Currently idle time tracking is part of the nohz state tracking,
separate this so that we can disable the tick while we're non-idle?

If so, how does idle time tracking work on a !NOHZ kernel, surely such
things can be idle too..

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 08/32] nohz: Move nohz load balancer selection into idle logic
  2011-08-15 15:52 ` [PATCH 08/32] nohz: Move nohz load balancer selection into idle logic Frederic Weisbecker
@ 2011-08-29 14:45   ` Peter Zijlstra
  2011-09-08 14:08     ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 14:45 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> We want the nohz load balancer to be an idle CPU, thus
> move that selection to strict dyntick idle logic.

Again, the important part is missing, why is this correct?

I'm not at all convinced this is correct, suppose all your cpus (except
the system CPU, which we'll assume has many tasks) are busy running 1
task. Then two of them get an extra task, now if those two happen to be
SMT siblings you want the load-balancer to pull on task out from the SMT
pair, however nobody is pulling since nobody is idle.

AFAICT this breaks stuff and the ILB needs some serious attention in
order to fix this.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/32] nohz: Move ts->idle_calls into strict idle logic
  2011-08-15 15:52 ` [PATCH 09/32] nohz: Move ts->idle_calls into strict " Frederic Weisbecker
@ 2011-08-29 14:47   ` Peter Zijlstra
  2011-08-29 17:34     ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 14:47 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> +static bool tick_nohz_can_stop_tick(int cpu, struct tick_sched *ts)
> +{
> +       /*
> +        * If this cpu is offline and it is the one which updates
> +        * jiffies, then give up the assignment and let it be taken by
> +        * the cpu which runs the tick timer next. If we don't drop
> +        * this here the jiffies might be stale and do_timer() never
> +        * invoked.
> +        */
> +       if (unlikely(!cpu_online(cpu))) {
> +               if (cpu == tick_do_timer_cpu)
> +                       tick_do_timer_cpu = TICK_DO_TIMER_NONE;
> +       }
> +
> +       if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
> +               return false;
> +
> +       if (need_resched())
> +               return false;
> +
> +       if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
> +               static int ratelimit;
> +
> +               if (ratelimit < 10) {
> +                       printk(KERN_ERR "NOHZ: local_softirq_pending %02x\n",
> +                              (unsigned int) local_softirq_pending());
> +                       ratelimit++;
> +               }
> +               return false;
> +       }
> +
> +       return true;
> +} 

Why aren't rcu_needs_cpu(), printk_needs_cpu() and arch_needs_cpu() not
in there?

That are typical 'can we go sleep now?' functions.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 12/32] nohz: Try not to give the timekeeping duty to a cpuset nohz cpu
  2011-08-15 15:52 ` [PATCH 12/32] nohz: Try not to give the timekeeping duty to a cpuset nohz cpu Frederic Weisbecker
@ 2011-08-29 14:55   ` Peter Zijlstra
  2011-08-30 15:17     ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 14:55 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Dimitri Sivanich

On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> Try to give the timekeeing duty to a CPU that doesn't belong
> to any nohz cpuset when possible, so that we increase the chance
> for these nohz cpusets to run their CPUs out of periodic tick
> mode. 

You and Dmitiri might want to get together:

lkml.kernel.org/r/20110823195628.GB4533@sgi.com

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 13/32] nohz: Adaptive tick stop and restart on nohz cpuset
  2011-08-15 15:52 ` [PATCH 13/32] nohz: Adaptive tick stop and restart on nohz cpuset Frederic Weisbecker
@ 2011-08-29 15:25   ` Peter Zijlstra
  2011-09-06 13:03     ` Frederic Weisbecker
  2011-08-29 15:28   ` Peter Zijlstra
  2011-08-29 15:32   ` Peter Zijlstra
  2 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 15:25 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> When a CPU is included in a nohz cpuset, try to switch
> it to nohz mode from the timer interrupt if it is the
> only non-idle task running.
> 
> Then restart the tick if necessary from the wakeup path
> if we are enqueuing a second task while the timer is stopped,
> so that the scheduler tick is rearmed.

Shouldn't you first put the syscall hooks in place before allowing the
tick to be switched off? It seems this patch is somewhat too early in
the series.

> This assumes we are using TTWU_QUEUE sched feature so I need
> to handle the off case (or actually not handle it but properly),
> because we need the adaptive tick restart and what will come
> along in further patches to be done locally and before the new
> task ever gets scheduled.

We could certainly remove that feature flag and always use it, it was
mostly a transition debug switch in case something didn't work or
performance issues were found due to this.

> I also need to look at the ARCH_WANT_INTERRUPTS_ON_CTXW case
> and the remote wakeups.

Well, ideally we'd simply get rid of that, rmk has some preliminary
patches in that direction.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 13/32] nohz: Adaptive tick stop and restart on nohz cpuset
  2011-08-15 15:52 ` [PATCH 13/32] nohz: Adaptive tick stop and restart on nohz cpuset Frederic Weisbecker
  2011-08-29 15:25   ` Peter Zijlstra
@ 2011-08-29 15:28   ` Peter Zijlstra
  2011-08-29 18:02     ` Frederic Weisbecker
  2011-08-29 15:32   ` Peter Zijlstra
  2 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 15:28 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> +bool cpuset_nohz_can_stop_tick(void)
> +{
> +       struct rq *rq;
> +
> +       rq = this_rq();
> +
> +       /* More than one running task need preemption */
> +       if (rq->nr_running > 1)
> +               return false;
> +
> +       return true;
> +} 

int sched_needs_cpu(int cpu), seems the right name, matches the existing
{rcu,printk,arch}_needs_cpu() functions.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 13/32] nohz: Adaptive tick stop and restart on nohz cpuset
  2011-08-15 15:52 ` [PATCH 13/32] nohz: Adaptive tick stop and restart on nohz cpuset Frederic Weisbecker
  2011-08-29 15:25   ` Peter Zijlstra
  2011-08-29 15:28   ` Peter Zijlstra
@ 2011-08-29 15:32   ` Peter Zijlstra
  2 siblings, 0 replies; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 15:32 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> +static void cpuset_nohz_restart_tick(void)
> +{
> +       __get_cpu_var(task_nohz_mode) = 0;
> +       tick_nohz_restart_sched_tick();
> +}
> +
> +void cpuset_update_nohz(void)
> +{
> +       if (!tick_nohz_adaptive_mode())
> +               return;
> +
> +       if (!cpuset_nohz_can_stop_tick())
> +               cpuset_nohz_restart_tick();
> +}
> +#endif
> +
>  static void
>  ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
>  {
> @@ -2560,6 +2592,8 @@ static void sched_ttwu_pending(void)
>                 ttwu_do_activate(rq, p, 0);
>         }
>  
> +       cpuset_update_nohz();
> +
>         raw_spin_unlock(&rq->lock);
>  }
>  
> @@ -2620,6 +2654,7 @@ static void ttwu_queue(struct task_struct *p, int cpu)
>  
>         raw_spin_lock(&rq->lock);
>         ttwu_do_activate(rq, p, 0);
> +       cpuset_update_nohz();
>         raw_spin_unlock(&rq->lock);
>  }

That really has nothing to do with cpusets, why doesn't that live in the
tick_nohz_* namespace? Something like tick_nohz_sched_wakeup() or so.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 14/32] nohz/cpuset: Don't turn off the tick if rcu needs it
  2011-08-15 15:52 ` [PATCH 14/32] nohz/cpuset: Don't turn off the tick if rcu needs it Frederic Weisbecker
  2011-08-16 20:13   ` Paul E. McKenney
@ 2011-08-29 15:36   ` Peter Zijlstra
  1 sibling, 0 replies; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 15:36 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> @@ -2446,6 +2447,19 @@ bool cpuset_nohz_can_stop_tick(void)
>         if (rq->nr_running > 1)
>                 return false;
>  
> +       cpu = smp_processor_id();
> +
> +       /*
> +        * FIXME: will probably be removed soon as it's
> +        * already checked from tick_nohz_stop_sched_tick()
> +        */
> +       if (rcu_needs_cpu(cpu))
> +               return false;
> +
> +       /* Is there a grace period to complete ? */
> +       if (rcu_pending(cpu))
> +               return false;
> +
>         return true;
>  } 

This really shouldn't live in sched.c, also I would have expected
tick_nohz_can_stop_tick() to do this, its named like it does this.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 15/32] nohz/cpuset: Restart tick when switching to idle task
  2011-08-15 15:52 ` [PATCH 15/32] nohz/cpuset: Restart tick when switching to idle task Frederic Weisbecker
@ 2011-08-29 15:43   ` Peter Zijlstra
  2011-08-30 15:04     ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 15:43 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> Ideally if we are in adaptive nohz mode and we switch to the
> the idle task, we shouldn't restart the tick since it's going
> to stop the tick soon anyway.
> 
> That optimization requires some minor tweaks here and there
> though, lets handle that later. 

You have a knack for confusing changelogs.. so basically you say:

  Restart the tick when we switch to idle.

Now all that needs is an explanation of why..

Also, please drop the whole cpuset_nohz_ stuff, this really isn't about
cpusets, cpusets simly provide the interface, the functionality lives in
the tick_nohz_ namespace.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 16/32] nohz/cpuset: Wake up adaptive nohz CPU when a timer gets enqueued
  2011-08-15 15:52 ` [PATCH 16/32] nohz/cpuset: Wake up adaptive nohz CPU when a timer gets enqueued Frederic Weisbecker
@ 2011-08-29 15:51   ` Peter Zijlstra
  2011-08-29 15:55   ` Peter Zijlstra
  1 sibling, 0 replies; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 15:51 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> Wake up a CPU when a timer list timer is enqueued there and
> the CPU is in adaptive nohz mode. Sending an IPI to it makes
> it reconsidering the next timer to program on top of recent
> updates.
> 

>  include/linux/sched.h    |    4 ++--
>  kernel/sched.c           |   33 ++++++++++++++++++++++++++++++++-
>  kernel/time/tick-sched.c |    5 ++++-
>  kernel/timer.c           |    2 +-
>  4 files changed, 39 insertions(+), 5 deletions(-) 

So here I would have expected timer_needs_cpu() and an addition to
tick_nohz_can_stop_tick(). Why does sched.c get touched at all?

Also, all the delta_jiffies stuff in the current
tick_nohz_stop_sched_tick() deals with this, why duplicate the logic?

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 16/32] nohz/cpuset: Wake up adaptive nohz CPU when a timer gets enqueued
  2011-08-15 15:52 ` [PATCH 16/32] nohz/cpuset: Wake up adaptive nohz CPU when a timer gets enqueued Frederic Weisbecker
  2011-08-29 15:51   ` Peter Zijlstra
@ 2011-08-29 15:55   ` Peter Zijlstra
  2011-08-30 15:06     ` Frederic Weisbecker
  1 sibling, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 15:55 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-29 at 17:51 +0200, Peter Zijlstra wrote:
> 
> Also, all the delta_jiffies stuff in the current
> tick_nohz_stop_sched_tick() deals with this, why duplicate the logic?
> 
Damn, n/m this is about new timers.. tick_nohz_new_timer() then.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 18/32] nohz/cpuset: Don't stop the tick if posix cpu timers are running
  2011-08-15 15:52 ` [PATCH 18/32] nohz/cpuset: Don't stop the tick if posix cpu timers are running Frederic Weisbecker
@ 2011-08-29 15:59   ` Peter Zijlstra
  0 siblings, 0 replies; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 15:59 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> If either a per thread or a per process posix cpu timer is running,
> don't stop the tick.
> 
> TODO: restart the tick if it is stopped and a posix cpu timer is
> enqueued. Check we probably need a memory barrier for the per
> process posix timer that can be enqueued from another task
> of the group.


it would

> +++ b/kernel/sched.c
> @@ -71,6 +71,7 @@
>  #include <linux/ctype.h>
>  #include <linux/ftrace.h>
>  #include <linux/slab.h>
> +#include <linux/posix-timers.h>
>  
>  #include <asm/tlb.h>
>  #include <asm/irq_regs.h>
> @@ -2491,6 +2492,9 @@ bool cpuset_nohz_can_stop_tick(void)
>  	if (rcu_pending(cpu))
>  		return false;
>  
> +	if (posix_cpu_timers_running(current))
> +		return false;
> +
>  	return true;
>  }

Doesn't belong here, go poke at tick_nohz_can_stop_tick().

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 19/32] nohz/cpuset: Restart tick when nohz flag is cleared on cpuset
  2011-08-15 15:52 ` [PATCH 19/32] nohz/cpuset: Restart tick when nohz flag is cleared on cpuset Frederic Weisbecker
@ 2011-08-29 16:02   ` Peter Zijlstra
  2011-08-30 15:10     ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 16:02 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> +++ b/kernel/cpuset.c
> @@ -1199,6 +1199,14 @@ static void cpuset_change_flag(struct task_struct *tsk,
>  
>  DEFINE_PER_CPU(int, cpu_adaptive_nohz_ref);
>  
> +static void cpu_exit_nohz(int cpu)
> +{
> +       preempt_disable();
> +       smp_call_function_single(cpu, cpuset_exit_nohz_interrupt,
> +                                NULL, true);
> +       preempt_enable();
> +}
> +
>  static void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
>  {
>         int cpu;
> @@ -1212,6 +1220,19 @@ static void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
>                         per_cpu(cpu_adaptive_nohz_ref, cpu) += 1;
>                 else
>                         per_cpu(cpu_adaptive_nohz_ref, cpu) -= 1;
> +
> +               val = per_cpu(cpu_adaptive_nohz_ref, cpu);
> +
> +               if (!val) {
> +                       /*
> +                        * The update to cpu_adaptive_nohz_ref must be
> +                        * visible right away. So that once we restart the tick
> +                        * from the IPI, it won't be stopped again due to cache
> +                        * update lag.
> +                        */
> +                       smp_mb();
> +                       cpu_exit_nohz(cpu);
> +               }
>         }
>  }
>  #else
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 78ea0a5..75378be 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2513,6 +2513,14 @@ void cpuset_update_nohz(void)
>                 cpuset_nohz_restart_tick();
>  }
>  
> +void cpuset_exit_nohz_interrupt(void *unused)
> +{
> +       if (!tick_nohz_adaptive_mode())
> +               return;
> +
> +       cpuset_nohz_restart_tick();
> +} 

You do this just to annoy me, right? Why doesn't it live in cpuset.c
where you use it?

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 02/32 RESEND] nohz: Drop ts->idle_active
  2011-08-29 14:23   ` Peter Zijlstra
@ 2011-08-29 16:15     ` Frederic Weisbecker
  0 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-29 16:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, Aug 29, 2011 at 04:23:12PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-15 at 17:51 +0200, Frederic Weisbecker wrote:
> > ts->idle_active is used to know if we want to account the idle sleep
> > time. But ts->inidle is enough to check that.
> > 
> While possibly true, its not immediately obvious and no hints are
> supplied. For example: tick_check_nohz() would disable ->idle_active..
> where is this mirrored in the ->inidle state.

Hmm, you're right. By the time we call tick_check_nohz() (irq_enter())
and tick_nohz_stop_sched_tick() (irq_exit()) there may be a softirq
and then another hard irq that could update the idle time spuriously.

So the mapping inidle - idle_active is wrong.

Let's drop that patch.

> 
> Also, tick_nohz_stop_sched_tick() has this comment:
> 
> 	/*
> 	 * Set ts->inidle unconditionally. Even if the system did not
> 	 * switch to NOHZ mode the cpu frequency governers rely on the
> 	 * update of the idle time accounting in tick_nohz_start_idle().
> 	 */
> 	ts->inidle = 1;
> 
> Which suggest the ->inidle state doesn't accurately reflect things.
> 
> This is all rather hairy code, such changes really want more in terms of
> explanation.
> 
> 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 04/32] nohz: Separate idle sleeping time accounting from nohz switching
  2011-08-29 14:23   ` Peter Zijlstra
@ 2011-08-29 16:32     ` Frederic Weisbecker
  2011-08-29 17:44       ` Peter Zijlstra
  0 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-29 16:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, Aug 29, 2011 at 04:23:19PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > To prepare for having nohz mode switching independant from idle,
> > pull the idle sleeping time accounting out of the tick stop API.
> > 
> > This implies to implement some new API to call when we
> > enter/exit idle. 
> 
> I mean, I really love brevity, but you seem to just not state all the
> important bits ;-)
> 
> So the goal is to disable the tick more often (say when running 1
> userbound task), why does that need new hooks? If we already had the
> tick disabled, the tick_nohz_stop_sched_tick() call on going idle will
> simply not do anything.
> 
> If we go from idle to running something we want to enable the tick
> initially because doing the task wakeup involves RCU etc.. Once we find
> the task is indeed userbound and we've finished all our state we can
> disable the thing again.

That's because we are going to have two different sources of stop/restarting
the tick: either idle or a random task. In the case of idle we have very
specific things to handle like idle time accounting, idle stats, rcu, ...

I could do these things conditionally using a some idle_cpu() checks but
the end result would not be very proper.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 03/32 RESEND] nohz: Drop useless ts->inidle check before rearming the tick
  2011-08-29 14:23   ` Peter Zijlstra
@ 2011-08-29 16:58     ` Frederic Weisbecker
  0 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-29 16:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, Aug 29, 2011 at 04:23:15PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > We only need to check if we have ts->stopped to ensure the tick
> > was stopped and we want to re-enable it. Checking ts->inidle
> > there is useless. 
> 
> /me goes la-la-la-la... 
> 
> It would so help poor little me who hasn't stared at this code in detail
> for the past several days and is thus horridly confused if you'd expand
> your reasoning somewhat.

Sorry, I'm no big fan of writing changelogs and sometimes the lack
it's unfortunately visible :)$

It needs to be refactored due to the previous patch beeing broken.
But the rationale, indeed missing here, is that if you have ts->stopped
then you have ts->inidle. Once you entered tick_nohz_stop_sched_tick()
you have ts->inidle set and only once you reached that step the tick
can be stopped, so the following check:

	if (!ts->inidle || !ts->tick_stopped)

can be summed up with:

	if (!ts->tick_stopped) {


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 04/32] nohz: Separate idle sleeping time accounting from nohz switching
  2011-08-29 14:23   ` Peter Zijlstra
@ 2011-08-29 17:01     ` Frederic Weisbecker
  0 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-29 17:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Mon, Aug 29, 2011 at 04:23:29PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > +extern void tick_nohz_exit_idle(void);
> > +extern void tick_nohz_irq_exit(void); 
> 
> For consistencies sake, pick either tick_nohz_exit_*() or
> tick_nohz_*_exit() but don't go mix and match (I prefer the former).

Agreed, I will.

> Also, there isn't a matching irq_enter() callback..

Yeah, probably tick_check_nohz() should be renamed accordingly.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 01/32 RESEND] nohz: Drop useless call in tick_nohz_start_idle()
  2011-08-29 14:23   ` Peter Zijlstra
@ 2011-08-29 17:10     ` Frederic Weisbecker
  0 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-29 17:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Mon, Aug 29, 2011 at 04:23:10PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-15 at 17:51 +0200, Frederic Weisbecker wrote:
> > The call to update_ts_time_stats() there is useless. All
> > we need is to save the idle entry_time.
> > 
> > 
> Would have been clearer if you just said the call was a NOP. The whole
> second sentence distracts and confuses as its irrelevant to the change
> at hand.
> 
> If you want to expand you can explain that its a NOP because
> update_ts_time_stats() requires either ts->idle_active and/or
> @last_update_time and our callsite has neither.

Right, will fix the changelog.

> 
> Although this assumes its never called when ts->idle_active is already
> set, is this so (likely)? Do we want a WARN_ON_ONCE() testing that
> assumption?

Not sure. Looking at the ondemand cpufreq governor, it calls
get_cpu_idle_time_us() from an initcall.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-29 14:25   ` Peter Zijlstra
@ 2011-08-29 17:11     ` Frederic Weisbecker
  2011-08-29 17:49       ` Peter Zijlstra
  0 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-29 17:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, Aug 29, 2011 at 04:25:22PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > To prepare for nohz / idle logic split, pull out the rcu dynticks
> > idle mode switching to strict idle entry/exit areas.
> > 
> > So we make the dyntick mode possible without always involving rcu
> > extended quiescent state. 
> 
> Why is this a good thing? I would be thinking that if we're a userspace
> bound task and we disable the tick rcu would be finished on this cpu and
> thus the extended quiescent state is just what we want?

But we can stop the tick from the kernel, not just userspace.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/32] nohz: Move ts->idle_calls into strict idle logic
  2011-08-29 14:47   ` Peter Zijlstra
@ 2011-08-29 17:34     ` Frederic Weisbecker
  2011-08-29 17:59       ` Peter Zijlstra
  0 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-29 17:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, Aug 29, 2011 at 04:47:47PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > +static bool tick_nohz_can_stop_tick(int cpu, struct tick_sched *ts)
> > +{
> > +       /*
> > +        * If this cpu is offline and it is the one which updates
> > +        * jiffies, then give up the assignment and let it be taken by
> > +        * the cpu which runs the tick timer next. If we don't drop
> > +        * this here the jiffies might be stale and do_timer() never
> > +        * invoked.
> > +        */
> > +       if (unlikely(!cpu_online(cpu))) {
> > +               if (cpu == tick_do_timer_cpu)
> > +                       tick_do_timer_cpu = TICK_DO_TIMER_NONE;
> > +       }
> > +
> > +       if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
> > +               return false;
> > +
> > +       if (need_resched())
> > +               return false;
> > +
> > +       if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
> > +               static int ratelimit;
> > +
> > +               if (ratelimit < 10) {
> > +                       printk(KERN_ERR "NOHZ: local_softirq_pending %02x\n",
> > +                              (unsigned int) local_softirq_pending());
> > +                       ratelimit++;
> > +               }
> > +               return false;
> > +       }
> > +
> > +       return true;
> > +} 
> 
> Why aren't rcu_needs_cpu(), printk_needs_cpu() and arch_needs_cpu() not
> in there?
> 
> That are typical 'can we go sleep now?' functions.

Because when one of these functions are positive, the ts->next_jiffies and
ts->last_jiffies stats are updated. Not with the above.
Also I start to think the above checks are only useful in the idle case.

We still want tick_nohz_stop_sched_tick() to have the *needs_cpu() checks
so that they can restore a HZ periodic behaviour on interrupt return if
needed.

That said I wonder if some of the above conditions should restore a periodic
behaviour on interrupt return...

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 04/32] nohz: Separate idle sleeping time accounting from nohz switching
  2011-08-29 16:32     ` Frederic Weisbecker
@ 2011-08-29 17:44       ` Peter Zijlstra
  2011-08-29 22:53         ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 17:44 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-29 at 18:32 +0200, Frederic Weisbecker wrote:
> On Mon, Aug 29, 2011 at 04:23:19PM +0200, Peter Zijlstra wrote:
> > On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > > To prepare for having nohz mode switching independant from idle,
> > > pull the idle sleeping time accounting out of the tick stop API.
> > > 
> > > This implies to implement some new API to call when we
> > > enter/exit idle. 
> > 
> > I mean, I really love brevity, but you seem to just not state all the
> > important bits ;-)
> > 
> > So the goal is to disable the tick more often (say when running 1
> > userbound task), why does that need new hooks? If we already had the
> > tick disabled, the tick_nohz_stop_sched_tick() call on going idle will
> > simply not do anything.
> > 
> > If we go from idle to running something we want to enable the tick
> > initially because doing the task wakeup involves RCU etc.. Once we find
> > the task is indeed userbound and we've finished all our state we can
> > disable the thing again.
> 
> That's because we are going to have two different sources of stop/restarting
> the tick: either idle or a random task. In the case of idle we have very
> specific things to handle like idle time accounting, idle stats, rcu, ...
> 
> I could do these things conditionally using a some idle_cpu() checks but
> the end result would not be very proper.

Right, but you didn't explain any of that in the changelog. So the
reasoning is that because tick_nohz_stop_sched_tick() does more than
just stop the tick, and this extra work needs to be isolated to just the
idle case, therefore we need hooks specific for the idle loop.



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-29 17:11     ` Frederic Weisbecker
@ 2011-08-29 17:49       ` Peter Zijlstra
  2011-08-29 17:59         ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 17:49 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-29 at 19:11 +0200, Frederic Weisbecker wrote:
> On Mon, Aug 29, 2011 at 04:25:22PM +0200, Peter Zijlstra wrote:
> > On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > > To prepare for nohz / idle logic split, pull out the rcu dynticks
> > > idle mode switching to strict idle entry/exit areas.
> > > 
> > > So we make the dyntick mode possible without always involving rcu
> > > extended quiescent state. 
> > 
> > Why is this a good thing? I would be thinking that if we're a userspace
> > bound task and we disable the tick rcu would be finished on this cpu and
> > thus the extended quiescent state is just what we want?
> 
> But we can stop the tick from the kernel, not just userspace.

Humm!? I'm confused, I thought the idea was to only stop the tick when
we're 'stuck' in a user bound task. Now I get that we have to stop the
tick from kernel space (as in the interrupt will clearly run in kernel
space), but assuming the normal return from interrupt path doesn't use
rcu, and using rcu (as per a later patch) re-enables the tick again, it
doesn't matter, right?

Also, RCU needs the tick to drive the state machine, so how can you stop
the tick and not also stop the RCU state machine?



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/32] nohz: Move ts->idle_calls into strict idle logic
  2011-08-29 17:34     ` Frederic Weisbecker
@ 2011-08-29 17:59       ` Peter Zijlstra
  2011-08-29 18:23         ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 17:59 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-29 at 19:34 +0200, Frederic Weisbecker wrote:
> On Mon, Aug 29, 2011 at 04:47:47PM +0200, Peter Zijlstra wrote:
> > On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > > +static bool tick_nohz_can_stop_tick(int cpu, struct tick_sched *ts)
> > > +{
> > > +       /*
> > > +        * If this cpu is offline and it is the one which updates
> > > +        * jiffies, then give up the assignment and let it be taken by
> > > +        * the cpu which runs the tick timer next. If we don't drop
> > > +        * this here the jiffies might be stale and do_timer() never
> > > +        * invoked.
> > > +        */
> > > +       if (unlikely(!cpu_online(cpu))) {
> > > +               if (cpu == tick_do_timer_cpu)
> > > +                       tick_do_timer_cpu = TICK_DO_TIMER_NONE;
> > > +       }
> > > +
> > > +       if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
> > > +               return false;
> > > +
> > > +       if (need_resched())
> > > +               return false;
> > > +
> > > +       if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
> > > +               static int ratelimit;
> > > +
> > > +               if (ratelimit < 10) {
> > > +                       printk(KERN_ERR "NOHZ: local_softirq_pending %02x\n",
> > > +                              (unsigned int) local_softirq_pending());
> > > +                       ratelimit++;
> > > +               }
> > > +               return false;
> > > +       }
> > > +
> > > +       return true;
> > > +} 
> > 
> > Why aren't rcu_needs_cpu(), printk_needs_cpu() and arch_needs_cpu() not
> > in there?
> > 
> > That are typical 'can we go sleep now?' functions.
> 
> Because when one of these functions are positive, the ts->next_jiffies and
> ts->last_jiffies stats are updated. Not with the above.
> Also I start to think the above checks are only useful in the idle case.

Then call it tick_nohz_can_stop_tick_idle() or so, and create
tick_nohz_can_stop_tick() to deal with all stuff.

> We still want tick_nohz_stop_sched_tick() to have the *needs_cpu() checks
> so that they can restore a HZ periodic behaviour on interrupt return if
> needed.

Well, no, on interrupt return you shouldn't do anything. If you've
stopped the tick it stays stopped until you do something that needs it,
then that action will re-enable it.

> That said I wonder if some of the above conditions should restore a periodic
> behaviour on interrupt return...

I would expect the tick not to be stopped when tick_nohz_can_stop_tick()
returns false. If it returns true, then I expect anything that needs it
to re-enable it.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-29 17:49       ` Peter Zijlstra
@ 2011-08-29 17:59         ` Frederic Weisbecker
  2011-08-29 18:06           ` Peter Zijlstra
  0 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-29 17:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, Aug 29, 2011 at 07:49:15PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-29 at 19:11 +0200, Frederic Weisbecker wrote:
> > On Mon, Aug 29, 2011 at 04:25:22PM +0200, Peter Zijlstra wrote:
> > > On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > > > To prepare for nohz / idle logic split, pull out the rcu dynticks
> > > > idle mode switching to strict idle entry/exit areas.
> > > > 
> > > > So we make the dyntick mode possible without always involving rcu
> > > > extended quiescent state. 
> > > 
> > > Why is this a good thing? I would be thinking that if we're a userspace
> > > bound task and we disable the tick rcu would be finished on this cpu and
> > > thus the extended quiescent state is just what we want?
> > 
> > But we can stop the tick from the kernel, not just userspace.
> 
> Humm!? I'm confused, I thought the idea was to only stop the tick when
> we're 'stuck' in a user bound task. Now I get that we have to stop the
> tick from kernel space (as in the interrupt will clearly run in kernel
> space), but assuming the normal return from interrupt path doesn't use
> rcu, and using rcu (as per a later patch) re-enables the tick again, it
> doesn't matter, right?

Yeah. Either the interrupt returns to userspace and then we call
rcu_enter_nohz() or we return to kernel space and then a further
use of rcu will restart the tick.

Now this is not any use of rcu. Uses of rcu read side critical section
don't need the tick. But we need it as long as there is an RCU callback
enqueued on some CPU.

> Also, RCU needs the tick to drive the state machine, so how can you stop
> the tick and not also stop the RCU state machine?

This is why we have rcu_needs_cpu() and rcu_pending() checks before
stopping the tick.

rcu_needs_cpu() checks we have no local callback enqueued, in which
case the local CPU is responsible of the RCU state machine.

rcu_pending() is there to know if another CPU started a grace period
so we need the tick to complete it.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 13/32] nohz: Adaptive tick stop and restart on nohz cpuset
  2011-08-29 15:28   ` Peter Zijlstra
@ 2011-08-29 18:02     ` Frederic Weisbecker
  2011-08-29 18:07       ` Peter Zijlstra
  0 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-29 18:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, Aug 29, 2011 at 05:28:09PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > +bool cpuset_nohz_can_stop_tick(void)
> > +{
> > +       struct rq *rq;
> > +
> > +       rq = this_rq();
> > +
> > +       /* More than one running task need preemption */
> > +       if (rq->nr_running > 1)
> > +               return false;
> > +
> > +       return true;
> > +} 
> 
> int sched_needs_cpu(int cpu), seems the right name, matches the existing
> {rcu,printk,arch}_needs_cpu() functions.

tick_nohz_stop_sched_tick() already handles that by keeping a periodic
behaviour if one of these conditions are met.

It has also the upside to restore the periodic behaviour if needed
from irq return if the tick was stopped.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-29 17:59         ` Frederic Weisbecker
@ 2011-08-29 18:06           ` Peter Zijlstra
  2011-08-29 23:35             ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 18:06 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-29 at 19:59 +0200, Frederic Weisbecker wrote:
> On Mon, Aug 29, 2011 at 07:49:15PM +0200, Peter Zijlstra wrote:
> > On Mon, 2011-08-29 at 19:11 +0200, Frederic Weisbecker wrote:
> > > On Mon, Aug 29, 2011 at 04:25:22PM +0200, Peter Zijlstra wrote:
> > > > On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > > > > To prepare for nohz / idle logic split, pull out the rcu dynticks
> > > > > idle mode switching to strict idle entry/exit areas.
> > > > > 
> > > > > So we make the dyntick mode possible without always involving rcu
> > > > > extended quiescent state. 
> > > > 
> > > > Why is this a good thing? I would be thinking that if we're a userspace
> > > > bound task and we disable the tick rcu would be finished on this cpu and
> > > > thus the extended quiescent state is just what we want?
> > > 
> > > But we can stop the tick from the kernel, not just userspace.
> > 
> > Humm!? I'm confused, I thought the idea was to only stop the tick when
> > we're 'stuck' in a user bound task. Now I get that we have to stop the
> > tick from kernel space (as in the interrupt will clearly run in kernel
> > space), but assuming the normal return from interrupt path doesn't use
> > rcu, and using rcu (as per a later patch) re-enables the tick again, it
> > doesn't matter, right?
> 
> Yeah. Either the interrupt returns to userspace and then we call
> rcu_enter_nohz() or we return to kernel space and then a further
> use of rcu will restart the tick.
> 
> Now this is not any use of rcu. Uses of rcu read side critical section
> don't need the tick. 

But but but, then how is it going to stop a grace period from happening?
The grace period state is per-cpu and the whole state machine is tick
driven.

Now some of the new RCU things go kick cpus with IPIs to push grace
periods along, but I would expect you don't want that to happen either,
the whole purpose here is to leave a cpu alone, unperturbed.

That means it has to be in an extended grace period when we stop the
tick.

> But we need it as long as there is an RCU callback
> enqueued on some CPU.

Well, no, only if there's one enqueued on this cpu because then we can't
enter the extended grace period.

> > Also, RCU needs the tick to drive the state machine, so how can you stop
> > the tick and not also stop the RCU state machine?
> 
> This is why we have rcu_needs_cpu() and rcu_pending() checks before
> stopping the tick.
> 
> rcu_needs_cpu() checks we have no local callback enqueued, in which
> case the local CPU is responsible of the RCU state machine.
> 
> rcu_pending() is there to know if another CPU started a grace period
> so we need the tick to complete it.

Hence the extended grace period, so we don't need to complete grace
periods.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 13/32] nohz: Adaptive tick stop and restart on nohz cpuset
  2011-08-29 18:02     ` Frederic Weisbecker
@ 2011-08-29 18:07       ` Peter Zijlstra
  2011-08-29 18:28         ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 18:07 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-29 at 20:02 +0200, Frederic Weisbecker wrote:
> On Mon, Aug 29, 2011 at 05:28:09PM +0200, Peter Zijlstra wrote:
> > On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > > +bool cpuset_nohz_can_stop_tick(void)
> > > +{
> > > +       struct rq *rq;
> > > +
> > > +       rq = this_rq();
> > > +
> > > +       /* More than one running task need preemption */
> > > +       if (rq->nr_running > 1)
> > > +               return false;
> > > +
> > > +       return true;
> > > +} 
> > 
> > int sched_needs_cpu(int cpu), seems the right name, matches the existing
> > {rcu,printk,arch}_needs_cpu() functions.
> 
> tick_nohz_stop_sched_tick() already handles that by keeping a periodic
> behaviour if one of these conditions are met.

What? tick_nohz_stop_sched_tick() most surely cannot access struct rq,
so it cannot do the nr_running test.

> It has also the upside to restore the periodic behaviour if needed
> from irq return if the tick was stopped.

Again, what?

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/32] nohz: Move ts->idle_calls into strict idle logic
  2011-08-29 17:59       ` Peter Zijlstra
@ 2011-08-29 18:23         ` Frederic Weisbecker
  2011-08-29 18:33           ` Peter Zijlstra
  0 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-29 18:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, Aug 29, 2011 at 07:59:49PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-29 at 19:34 +0200, Frederic Weisbecker wrote:
> > On Mon, Aug 29, 2011 at 04:47:47PM +0200, Peter Zijlstra wrote:
> > > On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > > > +static bool tick_nohz_can_stop_tick(int cpu, struct tick_sched *ts)
> > > > +{
> > > > +       /*
> > > > +        * If this cpu is offline and it is the one which updates
> > > > +        * jiffies, then give up the assignment and let it be taken by
> > > > +        * the cpu which runs the tick timer next. If we don't drop
> > > > +        * this here the jiffies might be stale and do_timer() never
> > > > +        * invoked.
> > > > +        */
> > > > +       if (unlikely(!cpu_online(cpu))) {
> > > > +               if (cpu == tick_do_timer_cpu)
> > > > +                       tick_do_timer_cpu = TICK_DO_TIMER_NONE;
> > > > +       }
> > > > +
> > > > +       if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
> > > > +               return false;
> > > > +
> > > > +       if (need_resched())
> > > > +               return false;
> > > > +
> > > > +       if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
> > > > +               static int ratelimit;
> > > > +
> > > > +               if (ratelimit < 10) {
> > > > +                       printk(KERN_ERR "NOHZ: local_softirq_pending %02x\n",
> > > > +                              (unsigned int) local_softirq_pending());
> > > > +                       ratelimit++;
> > > > +               }
> > > > +               return false;
> > > > +       }
> > > > +
> > > > +       return true;
> > > > +} 
> > > 
> > > Why aren't rcu_needs_cpu(), printk_needs_cpu() and arch_needs_cpu() not
> > > in there?
> > > 
> > > That are typical 'can we go sleep now?' functions.
> > 
> > Because when one of these functions are positive, the ts->next_jiffies and
> > ts->last_jiffies stats are updated. Not with the above.
> > Also I start to think the above checks are only useful in the idle case.
> 
> Then call it tick_nohz_can_stop_tick_idle() or so, and create
> tick_nohz_can_stop_tick() to deal with all stuff.

Yeah I need to have a deeper look into these checks.

> 
> > We still want tick_nohz_stop_sched_tick() to have the *needs_cpu() checks
> > so that they can restore a HZ periodic behaviour on interrupt return if
> > needed.
> 
> Well, no, on interrupt return you shouldn't do anything. If you've
> stopped the tick it stays stopped until you do something that needs it,
> then that action will re-enable it.

Sure, when something needs the tick in this mode, we usually
receive an IPI and restart the tick from there but then
tick_nohz_stop_sched_tick() handles the cases with *needs_cpu()
very well on interrupt return (our IPI return) by doing a kind
of "light" HZ mode by logically switching to nohz mode but
with the next timer happening in HZ, assuming it's a matter
of one tick and we will switch to a real nohz behaviour soon.

I don't see a good reason to duplicate that logic with a pure
restart from the IPI.

> > That said I wonder if some of the above conditions should restore a periodic
> > behaviour on interrupt return...
> 
> I would expect the tick not to be stopped when tick_nohz_can_stop_tick()
> returns false. If it returns true, then I expect anything that needs it
> to re-enable it.
> 

Yeah. In the case of need_resched() in idle I believe the CPU doesn't
really go to sleep later so it should be fine. But for the case of
softirq pending or nohz_mode, I'm not sure...

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 13/32] nohz: Adaptive tick stop and restart on nohz cpuset
  2011-08-29 18:07       ` Peter Zijlstra
@ 2011-08-29 18:28         ` Frederic Weisbecker
  2011-08-30 12:44           ` Peter Zijlstra
  0 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-29 18:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Mon, Aug 29, 2011 at 08:07:10PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-29 at 20:02 +0200, Frederic Weisbecker wrote:
> > On Mon, Aug 29, 2011 at 05:28:09PM +0200, Peter Zijlstra wrote:
> > > On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > > > +bool cpuset_nohz_can_stop_tick(void)
> > > > +{
> > > > +       struct rq *rq;
> > > > +
> > > > +       rq = this_rq();
> > > > +
> > > > +       /* More than one running task need preemption */
> > > > +       if (rq->nr_running > 1)
> > > > +               return false;
> > > > +
> > > > +       return true;
> > > > +} 
> > > 
> > > int sched_needs_cpu(int cpu), seems the right name, matches the existing
> > > {rcu,printk,arch}_needs_cpu() functions.
> > 
> > tick_nohz_stop_sched_tick() already handles that by keeping a periodic
> > behaviour if one of these conditions are met.
> 
> What? tick_nohz_stop_sched_tick() most surely cannot access struct rq,
> so it cannot do the nr_running test.

I was talking about {rcu,printk,arch}_needs_cpu() functions.

> 
> > It has also the upside to restore the periodic behaviour if needed
> > from irq return if the tick was stopped.
> 
> Again, what?

If the tick is stopped then an irq fires and something calls printk() or call_rcu()
then on interrupt return, tick_nohz_stop_sched_tick() checks that
with {rcu,printk,arch}_needs_cpu() and restores a periodic behaviour
until nobody else needs the CPU.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/32] nohz: Move ts->idle_calls into strict idle logic
  2011-08-29 18:23         ` Frederic Weisbecker
@ 2011-08-29 18:33           ` Peter Zijlstra
  2011-08-30 14:45             ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-29 18:33 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, 2011-08-29 at 20:23 +0200, Frederic Weisbecker wrote:

> > Well, no, on interrupt return you shouldn't do anything. If you've
> > stopped the tick it stays stopped until you do something that needs it,
> > then that action will re-enable it.
> 
> Sure, when something needs the tick in this mode, we usually
> receive an IPI and restart the tick from there but then
> tick_nohz_stop_sched_tick() handles the cases with *needs_cpu()
> very well on interrupt return (our IPI return) by doing a kind
> of "light" HZ mode by logically switching to nohz mode but
> with the next timer happening in HZ, assuming it's a matter
> of one tick and we will switch to a real nohz behaviour soon.
> 
> I don't see a good reason to duplicate that logic with a pure
> restart from the IPI.

That sounds like an optimization, and should thus be done later.

> > > That said I wonder if some of the above conditions should restore a periodic
> > > behaviour on interrupt return...
> > 
> > I would expect the tick not to be stopped when tick_nohz_can_stop_tick()
> > returns false. If it returns true, then I expect anything that needs it
> > to re-enable it.
> > 
> 
> Yeah. In the case of need_resched() in idle I believe the CPU doesn't
> really go to sleep later so it should be fine. But for the case of
> softirq pending or nohz_mode, I'm not sure...

softirqs shouldn't be pending when you go into nohz mode..

That is, I'm really not seeing what's wrong with the very simple:


  if (tick_nohz_can_stop_tick())
	tick_nohz_stop_tick();


and relying on everybody who invalidates tick_nohz_can_stop_tick(), to
do:

  tick_nohz_start_tick();

I'm also not quite sure why you always IPI, is that to avoid lock
inversions?

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 04/32] nohz: Separate idle sleeping time accounting from nohz switching
  2011-08-29 17:44       ` Peter Zijlstra
@ 2011-08-29 22:53         ` Frederic Weisbecker
  0 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-29 22:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, Aug 29, 2011 at 07:44:01PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-29 at 18:32 +0200, Frederic Weisbecker wrote:
> > On Mon, Aug 29, 2011 at 04:23:19PM +0200, Peter Zijlstra wrote:
> > > On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > > > To prepare for having nohz mode switching independant from idle,
> > > > pull the idle sleeping time accounting out of the tick stop API.
> > > > 
> > > > This implies to implement some new API to call when we
> > > > enter/exit idle. 
> > > 
> > > I mean, I really love brevity, but you seem to just not state all the
> > > important bits ;-)
> > > 
> > > So the goal is to disable the tick more often (say when running 1
> > > userbound task), why does that need new hooks? If we already had the
> > > tick disabled, the tick_nohz_stop_sched_tick() call on going idle will
> > > simply not do anything.
> > > 
> > > If we go from idle to running something we want to enable the tick
> > > initially because doing the task wakeup involves RCU etc.. Once we find
> > > the task is indeed userbound and we've finished all our state we can
> > > disable the thing again.
> > 
> > That's because we are going to have two different sources of stop/restarting
> > the tick: either idle or a random task. In the case of idle we have very
> > specific things to handle like idle time accounting, idle stats, rcu, ...
> > 
> > I could do these things conditionally using a some idle_cpu() checks but
> > the end result would not be very proper.
> 
> Right, but you didn't explain any of that in the changelog. So the
> reasoning is that because tick_nohz_stop_sched_tick() does more than
> just stop the tick, and this extra work needs to be isolated to just the
> idle case, therefore we need hooks specific for the idle loop.

Right, I'll update the changelog.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-29 18:06           ` Peter Zijlstra
@ 2011-08-29 23:35             ` Frederic Weisbecker
  2011-08-30 11:17               ` Peter Zijlstra
                                 ` (2 more replies)
  0 siblings, 3 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-29 23:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, Aug 29, 2011 at 08:06:00PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-29 at 19:59 +0200, Frederic Weisbecker wrote:
> > On Mon, Aug 29, 2011 at 07:49:15PM +0200, Peter Zijlstra wrote:
> > > On Mon, 2011-08-29 at 19:11 +0200, Frederic Weisbecker wrote:
> > > > On Mon, Aug 29, 2011 at 04:25:22PM +0200, Peter Zijlstra wrote:
> > > > > On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > > > > > To prepare for nohz / idle logic split, pull out the rcu dynticks
> > > > > > idle mode switching to strict idle entry/exit areas.
> > > > > > 
> > > > > > So we make the dyntick mode possible without always involving rcu
> > > > > > extended quiescent state. 
> > > > > 
> > > > > Why is this a good thing? I would be thinking that if we're a userspace
> > > > > bound task and we disable the tick rcu would be finished on this cpu and
> > > > > thus the extended quiescent state is just what we want?
> > > > 
> > > > But we can stop the tick from the kernel, not just userspace.
> > > 
> > > Humm!? I'm confused, I thought the idea was to only stop the tick when
> > > we're 'stuck' in a user bound task. Now I get that we have to stop the
> > > tick from kernel space (as in the interrupt will clearly run in kernel
> > > space), but assuming the normal return from interrupt path doesn't use
> > > rcu, and using rcu (as per a later patch) re-enables the tick again, it
> > > doesn't matter, right?
> > 
> > Yeah. Either the interrupt returns to userspace and then we call
> > rcu_enter_nohz() or we return to kernel space and then a further
> > use of rcu will restart the tick.
> > 
> > Now this is not any use of rcu. Uses of rcu read side critical section
> > don't need the tick. 
> 
> But but but, then how is it going to stop a grace period from happening?
> The grace period state is per-cpu and the whole state machine is tick
> driven.

But rcu read side critical sections (preemption disabled, rcu_read_lock(),
softirq disabled) don't need the tick to enforce the critical section
itself.

OTOH it is needed to find non-critical sections when asked to cooperate
in a grace period completion. But if no callback have been enqueued on
the whole system we are fine.

> Now some of the new RCU things go kick cpus with IPIs to push grace
> periods along, but I would expect you don't want that to happen either,
> the whole purpose here is to leave a cpu alone, unperturbed.

Sure we want the CPU to be unperturbed but not if that sacrifies correctness.
As long as we run in the kernel we want to receive such IPIs to restart
the tick as needed.

> That means it has to be in an extended grace period when we stop the
> tick.

You mean extended quiescent state?

As a summary here is what we do:

- if we are in the kernel, we can't run into extended quiescent state because
we may make use of rcu anytime there. But if we run nohz we don't have the tick
to notice quiescent states to the RCU machinery and help completing grace periods
so as soon as we receive an rcu IPI from another CPU (due to the grace period
beeing extended because our nohz CPU doesn't report quiescent states), we restart
the tick. We are optimistic enough to consider that we may avoid a lot of ticks
even if there are some risks to be disturbed in some random rates.
So even with the IPI we consider it as an upside.

- if we are in userspace we can run in extended quiescent state.

> 
> > But we need it as long as there is an RCU callback
> > enqueued on some CPU.
> 
> Well, no, only if there's one enqueued on this cpu because then we can't
> enter the extended grace period.

True if we are in userspace. 

> > > Also, RCU needs the tick to drive the state machine, so how can you stop
> > > the tick and not also stop the RCU state machine?
> > 
> > This is why we have rcu_needs_cpu() and rcu_pending() checks before
> > stopping the tick.
> > 
> > rcu_needs_cpu() checks we have no local callback enqueued, in which
> > case the local CPU is responsible of the RCU state machine.
> > 
> > rcu_pending() is there to know if another CPU started a grace period
> > so we need the tick to complete it.
> 
> Hence the extended grace period, so we don't need to complete grace
> periods.

I hope the above explanations made things more clear.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-29 23:35             ` Frederic Weisbecker
@ 2011-08-30 11:17               ` Peter Zijlstra
  2011-08-30 14:11                 ` Frederic Weisbecker
  2011-08-30 11:19               ` Peter Zijlstra
  2011-08-30 11:21               ` Peter Zijlstra
  2 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-30 11:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Tue, 2011-08-30 at 01:35 +0200, Frederic Weisbecker wrote:
> But rcu read side critical sections (preemption disabled, rcu_read_lock(),
> softirq disabled) don't need the tick to enforce the critical section
> itself. 

Note that with PREEMPT_RCU only the rcu_read_lock() is actually an rcu
read side critical section, non of the others should be used as such.
Relying on preempt_disable(), local_bh_disable() and similar is broken
as per a long while ago.



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-29 23:35             ` Frederic Weisbecker
  2011-08-30 11:17               ` Peter Zijlstra
@ 2011-08-30 11:19               ` Peter Zijlstra
  2011-08-30 14:26                 ` Frederic Weisbecker
  2011-08-30 11:21               ` Peter Zijlstra
  2 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-30 11:19 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Tue, 2011-08-30 at 01:35 +0200, Frederic Weisbecker wrote:
> 
> OTOH it is needed to find non-critical sections when asked to cooperate
> in a grace period completion. But if no callback have been enqueued on
> the whole system we are fine. 

Its that 'whole system' clause that I have a problem with. It would be
perfectly fine to have a number of cpus very busy generating rcu
callbacks, however this should not mean our adaptive nohz cpu should be
bothered to complete grace periods.

Requiring it to participate in the grace period state machine is a fail,
plain and simple.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-29 23:35             ` Frederic Weisbecker
  2011-08-30 11:17               ` Peter Zijlstra
  2011-08-30 11:19               ` Peter Zijlstra
@ 2011-08-30 11:21               ` Peter Zijlstra
  2011-08-30 14:32                 ` Frederic Weisbecker
  2 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-30 11:21 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Tue, 2011-08-30 at 01:35 +0200, Frederic Weisbecker wrote:
> > That means it has to be in an extended grace period when we stop the
> > tick.
> 
> You mean extended quiescent state?

Yeah that :-)

> As a summary here is what we do:
> 
> - if we are in the kernel, we can't run into extended quiescent state because
> we may make use of rcu anytime there. But if we run nohz we don't have the tick
> to notice quiescent states to the RCU machinery and help completing grace periods
> so as soon as we receive an rcu IPI from another CPU (due to the grace period
> beeing extended because our nohz CPU doesn't report quiescent states), we restart
> the tick. We are optimistic enough to consider that we may avoid a lot of ticks
> even if there are some risks to be disturbed in some random rates.
> So even with the IPI we consider it as an upside.
> 
> - if we are in userspace we can run in extended quiescent state.

But you can only disable the tick/enter extended quiescent state while
in kernel-space. Thus the second clause is precluded from ever being
true.



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 13/32] nohz: Adaptive tick stop and restart on nohz cpuset
  2011-08-29 18:28         ` Frederic Weisbecker
@ 2011-08-30 12:44           ` Peter Zijlstra
  2011-08-30 14:38             ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-30 12:44 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Mon, 2011-08-29 at 20:28 +0200, Frederic Weisbecker wrote:
> tick_nohz_stop_sched_tick() checks that
> with {rcu,printk,arch}_needs_cpu() and restores a periodic behaviour
> until nobody else needs the CPU. 

tick_nohz_stop_sched_tick() should not restore stuff, it should at worst
fail to stop but never enable that's just weird.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks)
  2011-08-24 14:41 ` Gilad Ben-Yossef
@ 2011-08-30 14:06   ` Frederic Weisbecker
  2011-08-31  3:47     ` Mike Galbraith
  2011-08-31 13:57     ` Gilad Ben-Yossef
  0 siblings, 2 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-30 14:06 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Peter Zijlstra,
	Stephen Hemminger, Thomas Gleixner, Tim Pepper

On Wed, Aug 24, 2011 at 05:41:05PM +0300, Gilad Ben-Yossef wrote:
> Hi,
> 
> On Mon, Aug 15, 2011 at 6:51 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> >
> > For those who want to play:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git
> >        nohz/cpuset-v1
> 
> 
> You caught me in playful mood, so I took it for a spin... :-)
> 
> I know this is far from being production ready, but I hope you'll find
> the feedback useful.
> 
> First a short description of my testing setup is in order, I believe:
> 
> I've set up a small x86 VM with 4 CPUs running your git tree and a
> minimal buildroot system. I've created 2 cpusets: sys and nohz, and
> then assigned every task I could to the sys cpuset and set
> adaptive_nohz on the nohz set.
> 
> To make double sure I have no task on my nohz cpuset CPU, I've booted
> the system with the isolcpus command line isolating the same cpu I've
> assigned to the nohz set. This shouldn't be needed of course, but just
> in case.

Ah I haven't tested with that isolcpus especially as it's headed toward
removal.

> 
> I then ran a silly program I've written that basically eats CPU cycles
> (https://github.com/gby/cpueat) and assigned it to the nohz set and
> monitored the number of interrupts using /proc/interrupts
> 
> Now, for the things I've noticed -
> 
> 1. Before I turn adaptive_nohz to 1, when no task is running on the
> nohz cpuset cpu, the tick is indeed idle (regular nohz case) and very
> few function call IPIs are seen. However, when I turn adaptive_nohz to
> 1 (but still with no task running on the CPU), the tick remains idle,
> but I get an IPI function call interrupt almost in the rate the tick
> would have been.

Yeah I believe this is due to RCU that tries to wake up our nohz CPU.
I need to have a deeper look there.

> 2. When I run my little cpueat program on the nohz CPU, the tick does
> not actually goes off. Instead it ticks away as usual. I know it is
> the only legible task to run, since as soon as I kill it  the tick
> turns off (regular nohz mode again). I've tinkered around and found
> out that what stops the tick going away is the check for rcu_pending()
> in cpuset_nohz_can_stop_tick(). It seems to always be true. When I
> removed that check experimentally and repeat the test, the tick indeed
> stops with my cpueat task running. Of course, I don't suggest this is
> the sane thing to do - I just wondered if that what stopped the tick
> going away and it seems that it is.

Are you sure the tick never goes off?
But yeah may be there is something that constantly requires RCU grace
periods to complete in your system. I should drop the rcu_pending()
check as long as we want to stop the tick from userspace because
there we are off the RCU state machine.


> 3. My little cpueat program tries to fork a child process after 100k
> iteration of some CPU bound loop. It usually takes a few seconds to
> happen. The idea is to make sure that the tick resumes when nr_running
> > 1. In my case, I got a kernel panic. Since it happened with some
> debug code I added and with aforementioned experimental removal of
> rcu_pending check, I'm assuming for now it's all my fault but will
> look into verifying it further and will send panic logs if it proves
> useful.

I got some panic too but haven't seen any for some time. I made a
lot of changes since then though so I thought the condition to trigger
it just went away.

IIRC, it was a locking inversion against the rq lock and some other lock.
Very nice condition for a cool lockup ;)

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-30 11:17               ` Peter Zijlstra
@ 2011-08-30 14:11                 ` Frederic Weisbecker
  2011-08-30 14:13                   ` Peter Zijlstra
  0 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-30 14:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Tue, Aug 30, 2011 at 01:17:42PM +0200, Peter Zijlstra wrote:
> On Tue, 2011-08-30 at 01:35 +0200, Frederic Weisbecker wrote:
> > But rcu read side critical sections (preemption disabled, rcu_read_lock(),
> > softirq disabled) don't need the tick to enforce the critical section
> > itself. 
> 
> Note that with PREEMPT_RCU only the rcu_read_lock() is actually an rcu
> read side critical section, non of the others should be used as such.
> Relying on preempt_disable(), local_bh_disable() and similar is broken
> as per a long while ago.

Sure yeah.

My point was that the patchset doesn't care about all that anyway. Read
side critical section still work as usual. What changes is the way we
notice periods where we are *not* in rcu read side critical sections.
This was previously partly made through the tick. Now it's still the case
but we need to remotely wake up that tick first.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-30 14:11                 ` Frederic Weisbecker
@ 2011-08-30 14:13                   ` Peter Zijlstra
  2011-08-30 14:27                     ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-30 14:13 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Tue, 2011-08-30 at 16:11 +0200, Frederic Weisbecker wrote:
> On Tue, Aug 30, 2011 at 01:17:42PM +0200, Peter Zijlstra wrote:
> > On Tue, 2011-08-30 at 01:35 +0200, Frederic Weisbecker wrote:
> > > But rcu read side critical sections (preemption disabled, rcu_read_lock(),
> > > softirq disabled) don't need the tick to enforce the critical section
> > > itself. 
> > 
> > Note that with PREEMPT_RCU only the rcu_read_lock() is actually an rcu
> > read side critical section, non of the others should be used as such.
> > Relying on preempt_disable(), local_bh_disable() and similar is broken
> > as per a long while ago.
> 
> Sure yeah.
> 
> My point was that the patchset doesn't care about all that anyway. Read
> side critical section still work as usual. What changes is the way we
> notice periods where we are *not* in rcu read side critical sections.
> This was previously partly made through the tick. Now it's still the case
> but we need to remotely wake up that tick first.

But you wake up the tick for any callback - systemwide - right? That's a
massive fail.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-30 11:19               ` Peter Zijlstra
@ 2011-08-30 14:26                 ` Frederic Weisbecker
  2011-08-30 15:22                   ` Peter Zijlstra
  0 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-30 14:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Tue, Aug 30, 2011 at 01:19:18PM +0200, Peter Zijlstra wrote:
> On Tue, 2011-08-30 at 01:35 +0200, Frederic Weisbecker wrote:
> > 
> > OTOH it is needed to find non-critical sections when asked to cooperate
> > in a grace period completion. But if no callback have been enqueued on
> > the whole system we are fine. 
> 
> Its that 'whole system' clause that I have a problem with. It would be
> perfectly fine to have a number of cpus very busy generating rcu
> callbacks, however this should not mean our adaptive nohz cpu should be
> bothered to complete grace periods.
> 
> Requiring it to participate in the grace period state machine is a fail,
> plain and simple.

We need those nohz CPUs to participate because they may use read side
critical section themselves. So we need them to delay running grace period
until the end of their running rcu read side critical sections, like any
other CPUs. Otherwise their supposed rcu read side critical section wouldn't
be effective.

Either that or we need to only stop the tick when we are in userspace.
I'm not sure it would be a good idea.

We discussed this problem, I believe the problem mostly resides in rcu sched.
Because finding quiescent states for rcu bh is easy, but rcu sched needs
the tick or context switches. (For rcu preempt I have no idea.)
So for now that's the sanest way we found amongst:

- Having explicit hooks in preempt_disable() and local_irq_restore()
to notice end of rcu sched critical section. So that we don't need the tick
anymore to find quiescent states. But that's going to be costly. And we may
miss some more implicitly non-preemptable code path.

- Rely on context switches only. I believe in practice it should be fine.
But in theory this delays the grace period completion for an unbounded
amount of time.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-30 14:13                   ` Peter Zijlstra
@ 2011-08-30 14:27                     ` Frederic Weisbecker
  0 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-30 14:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Tue, Aug 30, 2011 at 04:13:57PM +0200, Peter Zijlstra wrote:
> On Tue, 2011-08-30 at 16:11 +0200, Frederic Weisbecker wrote:
> > On Tue, Aug 30, 2011 at 01:17:42PM +0200, Peter Zijlstra wrote:
> > > On Tue, 2011-08-30 at 01:35 +0200, Frederic Weisbecker wrote:
> > > > But rcu read side critical sections (preemption disabled, rcu_read_lock(),
> > > > softirq disabled) don't need the tick to enforce the critical section
> > > > itself. 
> > > 
> > > Note that with PREEMPT_RCU only the rcu_read_lock() is actually an rcu
> > > read side critical section, non of the others should be used as such.
> > > Relying on preempt_disable(), local_bh_disable() and similar is broken
> > > as per a long while ago.
> > 
> > Sure yeah.
> > 
> > My point was that the patchset doesn't care about all that anyway. Read
> > side critical section still work as usual. What changes is the way we
> > notice periods where we are *not* in rcu read side critical sections.
> > This was previously partly made through the tick. Now it's still the case
> > but we need to remotely wake up that tick first.
> 
> But you wake up the tick for any callback - systemwide - right? That's a
> massive fail.

Until we find a better solution yeah.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-30 11:21               ` Peter Zijlstra
@ 2011-08-30 14:32                 ` Frederic Weisbecker
  2011-08-30 15:26                   ` Peter Zijlstra
  0 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-30 14:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Tue, Aug 30, 2011 at 01:21:55PM +0200, Peter Zijlstra wrote:
> On Tue, 2011-08-30 at 01:35 +0200, Frederic Weisbecker wrote:
> > > That means it has to be in an extended grace period when we stop the
> > > tick.
> > 
> > You mean extended quiescent state?
> 
> Yeah that :-)
> 
> > As a summary here is what we do:
> > 
> > - if we are in the kernel, we can't run into extended quiescent state because
> > we may make use of rcu anytime there. But if we run nohz we don't have the tick
> > to notice quiescent states to the RCU machinery and help completing grace periods
> > so as soon as we receive an rcu IPI from another CPU (due to the grace period
> > beeing extended because our nohz CPU doesn't report quiescent states), we restart
> > the tick. We are optimistic enough to consider that we may avoid a lot of ticks
> > even if there are some risks to be disturbed in some random rates.
> > So even with the IPI we consider it as an upside.
> > 
> > - if we are in userspace we can run in extended quiescent state.
> 
> But you can only disable the tick/enter extended quiescent state while
> in kernel-space. Thus the second clause is precluded from ever being
> true.

No, we have a specific stacking in the irq:

	rcu_irq_enter()

	disable tick...
	if (user)
		rcu_enter_nohz();

	rcu_irq_exit() <-- extended quiescent state entry effective only there

And by the time we call rcu_irq_exit() and we resume to userspace, we are
not supposed to have rcu read side critical section (minus the case of
a signal with do_notify_resume() which I have yet to handle).

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 13/32] nohz: Adaptive tick stop and restart on nohz cpuset
  2011-08-30 12:44           ` Peter Zijlstra
@ 2011-08-30 14:38             ` Frederic Weisbecker
  2011-08-30 15:28               ` Peter Zijlstra
  0 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-30 14:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Tue, Aug 30, 2011 at 02:44:10PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-29 at 20:28 +0200, Frederic Weisbecker wrote:
> > tick_nohz_stop_sched_tick() checks that
> > with {rcu,printk,arch}_needs_cpu() and restores a periodic behaviour
> > until nobody else needs the CPU. 
> 
> tick_nohz_stop_sched_tick() should not restore stuff, it should at worst
> fail to stop but never enable that's just weird.

It's not really enablement, it's just nohz behaviour but the next tick is
in HZ :o)

Like you said before it's an optimization. I can do the things differently
for idle and non-idle cases there but I'm not sure it's really a good thing.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/32] nohz: Move ts->idle_calls into strict idle logic
  2011-08-29 18:33           ` Peter Zijlstra
@ 2011-08-30 14:45             ` Frederic Weisbecker
  2011-08-30 15:33               ` Peter Zijlstra
  0 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-30 14:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, Aug 29, 2011 at 08:33:23PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-29 at 20:23 +0200, Frederic Weisbecker wrote:
> 
> > > Well, no, on interrupt return you shouldn't do anything. If you've
> > > stopped the tick it stays stopped until you do something that needs it,
> > > then that action will re-enable it.
> > 
> > Sure, when something needs the tick in this mode, we usually
> > receive an IPI and restart the tick from there but then
> > tick_nohz_stop_sched_tick() handles the cases with *needs_cpu()
> > very well on interrupt return (our IPI return) by doing a kind
> > of "light" HZ mode by logically switching to nohz mode but
> > with the next timer happening in HZ, assuming it's a matter
> > of one tick and we will switch to a real nohz behaviour soon.
> > 
> > I don't see a good reason to duplicate that logic with a pure
> > restart from the IPI.
> 
> That sounds like an optimization, and should thus be done later.

The optimization is already there upstream. I can split the logic for
non-idle case but I'm not sure about the point of that.
 
> > > > That said I wonder if some of the above conditions should restore a periodic
> > > > behaviour on interrupt return...
> > > 
> > > I would expect the tick not to be stopped when tick_nohz_can_stop_tick()
> > > returns false. If it returns true, then I expect anything that needs it
> > > to re-enable it.
> > > 
> > 
> > Yeah. In the case of need_resched() in idle I believe the CPU doesn't
> > really go to sleep later so it should be fine. But for the case of
> > softirq pending or nohz_mode, I'm not sure...
> 
> softirqs shouldn't be pending when you go into nohz mode..

You mean it can't happen or we don't want that to happen?

> 
> That is, I'm really not seeing what's wrong with the very simple:
> 
> 
>   if (tick_nohz_can_stop_tick())
> 	tick_nohz_stop_tick();
> 
> 
> and relying on everybody who invalidates tick_nohz_can_stop_tick(), to
> do:
> 
>   tick_nohz_start_tick();

May be for the non-idle case. But for the idle case I need to ensure
this is necessary somewhere.

> 
> I'm also not quite sure why you always IPI, is that to avoid lock
> inversions?

Exactly! I think I wrote that to some changelog but I'm not sure. I'll
check that.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 15/32] nohz/cpuset: Restart tick when switching to idle task
  2011-08-29 15:43   ` Peter Zijlstra
@ 2011-08-30 15:04     ` Frederic Weisbecker
  2011-08-30 15:35       ` Peter Zijlstra
  0 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-30 15:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Mon, Aug 29, 2011 at 05:43:40PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > Ideally if we are in adaptive nohz mode and we switch to the
> > the idle task, we shouldn't restart the tick since it's going
> > to stop the tick soon anyway.
> > 
> > That optimization requires some minor tweaks here and there
> > though, lets handle that later. 
> 
> You have a knack for confusing changelogs.. so basically you say:
> 
>   Restart the tick when we switch to idle.
> 
> Now all that needs is an explanation of why..

In the end we don't want to restart the tick when we switch to idle.
But to support that I'll need to tweak various things on idle time
accounting, otherwise the [cpu] time spent in idle is going to be
accounted as system time.

That's definetly headed to be temporary.

> Also, please drop the whole cpuset_nohz_ stuff, this really isn't about
> cpusets, cpusets simly provide the interface, the functionality lives in
> the tick_nohz_ namespace.

Agreed this sucks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 16/32] nohz/cpuset: Wake up adaptive nohz CPU when a timer gets enqueued
  2011-08-29 15:55   ` Peter Zijlstra
@ 2011-08-30 15:06     ` Frederic Weisbecker
  0 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-30 15:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, Aug 29, 2011 at 05:55:37PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-29 at 17:51 +0200, Peter Zijlstra wrote:
> > 
> > Also, all the delta_jiffies stuff in the current
> > tick_nohz_stop_sched_tick() deals with this, why duplicate the logic?
> > 
> Damn, n/m this is about new timers.. tick_nohz_new_timer() then.

I can call that tick_nohz_new_timer() if you prefer yeah. I personally
don't mind either name.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 19/32] nohz/cpuset: Restart tick when nohz flag is cleared on cpuset
  2011-08-29 16:02   ` Peter Zijlstra
@ 2011-08-30 15:10     ` Frederic Weisbecker
  0 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-30 15:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, Aug 29, 2011 at 06:02:01PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > +++ b/kernel/cpuset.c
> > @@ -1199,6 +1199,14 @@ static void cpuset_change_flag(struct task_struct *tsk,
> >  
> >  DEFINE_PER_CPU(int, cpu_adaptive_nohz_ref);
> >  
> > +static void cpu_exit_nohz(int cpu)
> > +{
> > +       preempt_disable();
> > +       smp_call_function_single(cpu, cpuset_exit_nohz_interrupt,
> > +                                NULL, true);
> > +       preempt_enable();
> > +}
> > +
> >  static void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
> >  {
> >         int cpu;
> > @@ -1212,6 +1220,19 @@ static void update_nohz_cpus(struct cpuset *old_cs, struct cpuset *cs)
> >                         per_cpu(cpu_adaptive_nohz_ref, cpu) += 1;
> >                 else
> >                         per_cpu(cpu_adaptive_nohz_ref, cpu) -= 1;
> > +
> > +               val = per_cpu(cpu_adaptive_nohz_ref, cpu);
> > +
> > +               if (!val) {
> > +                       /*
> > +                        * The update to cpu_adaptive_nohz_ref must be
> > +                        * visible right away. So that once we restart the tick
> > +                        * from the IPI, it won't be stopped again due to cache
> > +                        * update lag.
> > +                        */
> > +                       smp_mb();
> > +                       cpu_exit_nohz(cpu);
> > +               }
> >         }
> >  }
> >  #else
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 78ea0a5..75378be 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -2513,6 +2513,14 @@ void cpuset_update_nohz(void)
> >                 cpuset_nohz_restart_tick();
> >  }
> >  
> > +void cpuset_exit_nohz_interrupt(void *unused)
> > +{
> > +       if (!tick_nohz_adaptive_mode())
> > +               return;
> > +
> > +       cpuset_nohz_restart_tick();
> > +} 
> 
> You do this just to annoy me, right? Why doesn't it live in cpuset.c
> where you use it?


Right, I'll move all those things and fix the naming / namespace.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 12/32] nohz: Try not to give the timekeeping duty to a cpuset nohz cpu
  2011-08-29 14:55   ` Peter Zijlstra
@ 2011-08-30 15:17     ` Frederic Weisbecker
  2011-08-30 15:30       ` Dimitri Sivanich
  2011-08-30 15:37       ` Peter Zijlstra
  0 siblings, 2 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-30 15:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Dimitri Sivanich, Paul Menage

On Mon, Aug 29, 2011 at 04:55:45PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > Try to give the timekeeing duty to a CPU that doesn't belong
> > to any nohz cpuset when possible, so that we increase the chance
> > for these nohz cpusets to run their CPUs out of periodic tick
> > mode. 
> 
> You and Dmitiri might want to get together:
> 
> lkml.kernel.org/r/20110823195628.GB4533@sgi.com

Right!

There is another missing piece in my patchset. If every non adaptive-nohz
CPUs are sleeping, then none is handling the do_timer duty and adaptive nohz
CPUs run with a stale jiffies and walltime.

I need to handle that.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-30 14:26                 ` Frederic Weisbecker
@ 2011-08-30 15:22                   ` Peter Zijlstra
  2011-08-30 18:45                     ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-30 15:22 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Tue, 2011-08-30 at 16:26 +0200, Frederic Weisbecker wrote:
> On Tue, Aug 30, 2011 at 01:19:18PM +0200, Peter Zijlstra wrote:
> > On Tue, 2011-08-30 at 01:35 +0200, Frederic Weisbecker wrote:
> > > 
> > > OTOH it is needed to find non-critical sections when asked to cooperate
> > > in a grace period completion. But if no callback have been enqueued on
> > > the whole system we are fine. 
> > 
> > Its that 'whole system' clause that I have a problem with. It would be
> > perfectly fine to have a number of cpus very busy generating rcu
> > callbacks, however this should not mean our adaptive nohz cpu should be
> > bothered to complete grace periods.
> > 
> > Requiring it to participate in the grace period state machine is a fail,
> > plain and simple.
> 
> We need those nohz CPUs to participate because they may use read side
> critical section themselves. So we need them to delay running grace period
> until the end of their running rcu read side critical sections, like any
> other CPUs. Otherwise their supposed rcu read side critical section wouldn't
> be effective.
> 
> Either that or we need to only stop the tick when we are in userspace.
> I'm not sure it would be a good idea.

Well the simple fact is that rcu, when considered system-wide, is pretty
much always busy, voiding any and all benefit you might want to gain.

> We discussed this problem, I believe the problem mostly resides in rcu sched.
> Because finding quiescent states for rcu bh is easy, but rcu sched needs
> the tick or context switches. (For rcu preempt I have no idea.)
> So for now that's the sanest way we found amongst:
> 
> - Having explicit hooks in preempt_disable() and local_irq_restore()
> to notice end of rcu sched critical section. So that we don't need the tick
> anymore to find quiescent states. But that's going to be costly. And we may
> miss some more implicitly non-preemptable code path.
> 
> - Rely on context switches only. I believe in practice it should be fine.
> But in theory this delays the grace period completion for an unbounded
> amount of time.

Right, so what we can do is keep a per-cpu context switch counter (I'm
sure we have one someplace and we already have the
rcu_note_context_switch() callback in case we need another) and have
another cpu (outside of our extended nohz domain) drive our state
machine.

But I'm sure Paul can say more sensible things than me here.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-30 14:32                 ` Frederic Weisbecker
@ 2011-08-30 15:26                   ` Peter Zijlstra
  2011-08-30 15:33                     ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-30 15:26 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Tue, 2011-08-30 at 16:32 +0200, Frederic Weisbecker wrote:
> On Tue, Aug 30, 2011 at 01:21:55PM +0200, Peter Zijlstra wrote:
> > On Tue, 2011-08-30 at 01:35 +0200, Frederic Weisbecker wrote:
> > > > That means it has to be in an extended grace period when we stop the
> > > > tick.
> > > 
> > > You mean extended quiescent state?
> > 
> > Yeah that :-)
> > 
> > > As a summary here is what we do:
> > > 
> > > - if we are in the kernel, we can't run into extended quiescent state because
> > > we may make use of rcu anytime there. But if we run nohz we don't have the tick
> > > to notice quiescent states to the RCU machinery and help completing grace periods
> > > so as soon as we receive an rcu IPI from another CPU (due to the grace period
> > > beeing extended because our nohz CPU doesn't report quiescent states), we restart
> > > the tick. We are optimistic enough to consider that we may avoid a lot of ticks
> > > even if there are some risks to be disturbed in some random rates.
> > > So even with the IPI we consider it as an upside.
> > > 
> > > - if we are in userspace we can run in extended quiescent state.
> > 
> > But you can only disable the tick/enter extended quiescent state while
> > in kernel-space. Thus the second clause is precluded from ever being
> > true.
> 
> No, we have a specific stacking in the irq:
> 
> 	rcu_irq_enter()
> 
> 	disable tick...
> 	if (user)
> 		rcu_enter_nohz();
> 
> 	rcu_irq_exit() <-- extended quiescent state entry effective only there
> 
> And by the time we call rcu_irq_exit() and we resume to userspace, we are
> not supposed to have rcu read side critical section (minus the case of
> a signal with do_notify_resume() which I have yet to handle).

See all that is still kernelspace ;-) I think I know what you mean to
say though, but seeing as you note there is even now a known shortcoming
I'm not very confident its a solid construction. What will help us find
such holes?

I would much rather we not rely on such fragile things too much.. this
RCU stuff wants way more thought, as it stands your patch-set doesn't do
anything useful IMO.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 13/32] nohz: Adaptive tick stop and restart on nohz cpuset
  2011-08-30 14:38             ` Frederic Weisbecker
@ 2011-08-30 15:28               ` Peter Zijlstra
  0 siblings, 0 replies; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-30 15:28 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Tue, 2011-08-30 at 16:38 +0200, Frederic Weisbecker wrote:
> On Tue, Aug 30, 2011 at 02:44:10PM +0200, Peter Zijlstra wrote:
> > On Mon, 2011-08-29 at 20:28 +0200, Frederic Weisbecker wrote:
> > > tick_nohz_stop_sched_tick() checks that
> > > with {rcu,printk,arch}_needs_cpu() and restores a periodic behaviour
> > > until nobody else needs the CPU. 
> > 
> > tick_nohz_stop_sched_tick() should not restore stuff, it should at worst
> > fail to stop but never enable that's just weird.
> 
> It's not really enablement, it's just nohz behaviour but the next tick is
> in HZ :o)
> 
> Like you said before it's an optimization. 

I'm still not feeling very confident about all that..

> I can do the things differently
> for idle and non-idle cases there but I'm not sure it's really a good thing.

its not, but I'm very sure I've lost you on why that should be the case.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 12/32] nohz: Try not to give the timekeeping duty to a cpuset nohz cpu
  2011-08-30 15:17     ` Frederic Weisbecker
@ 2011-08-30 15:30       ` Dimitri Sivanich
  2011-08-30 15:37       ` Peter Zijlstra
  1 sibling, 0 replies; 139+ messages in thread
From: Dimitri Sivanich @ 2011-08-30 15:30 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, LKML, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Tue, Aug 30, 2011 at 05:17:06PM +0200, Frederic Weisbecker wrote:
> On Mon, Aug 29, 2011 at 04:55:45PM +0200, Peter Zijlstra wrote:
> > On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > > Try to give the timekeeing duty to a CPU that doesn't belong
> > > to any nohz cpuset when possible, so that we increase the chance
> > > for these nohz cpusets to run their CPUs out of periodic tick
> > > mode. 
> > 
> > You and Dmitiri might want to get together:
> > 
> > lkml.kernel.org/r/20110823195628.GB4533@sgi.com
> 
> Right!
> 
> There is another missing piece in my patchset. If every non adaptive-nohz
> CPUs are sleeping, then none is handling the do_timer duty and adaptive nohz
> CPUs run with a stale jiffies and walltime.
> 
> I need to handle that.

Frederic,

Please note that the patch that Peter references (lkml.kernel.org/r/20110823195628.GB4533@sgi.com) supports only the nohz=off case.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/32] nohz: Move ts->idle_calls into strict idle logic
  2011-08-30 14:45             ` Frederic Weisbecker
@ 2011-08-30 15:33               ` Peter Zijlstra
  2011-09-06 16:35                 ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-30 15:33 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Tue, 2011-08-30 at 16:45 +0200, Frederic Weisbecker wrote:
> On Mon, Aug 29, 2011 at 08:33:23PM +0200, Peter Zijlstra wrote:
> > On Mon, 2011-08-29 at 20:23 +0200, Frederic Weisbecker wrote:
> > 
> > > > Well, no, on interrupt return you shouldn't do anything. If you've
> > > > stopped the tick it stays stopped until you do something that needs it,
> > > > then that action will re-enable it.
> > > 
> > > Sure, when something needs the tick in this mode, we usually
> > > receive an IPI and restart the tick from there but then
> > > tick_nohz_stop_sched_tick() handles the cases with *needs_cpu()
> > > very well on interrupt return (our IPI return) by doing a kind
> > > of "light" HZ mode by logically switching to nohz mode but
> > > with the next timer happening in HZ, assuming it's a matter
> > > of one tick and we will switch to a real nohz behaviour soon.
> > > 
> > > I don't see a good reason to duplicate that logic with a pure
> > > restart from the IPI.
> > 
> > That sounds like an optimization, and should thus be done later.
> 
> The optimization is already there upstream. I can split the logic for
> non-idle case but I'm not sure about the point of that.

care to point me to the relevant code, because I can't remember a
half-assed nohz state.. then again, maybe I didn't look hard enough.

> > > > > That said I wonder if some of the above conditions should restore a periodic
> > > > > behaviour on interrupt return...
> > > > 
> > > > I would expect the tick not to be stopped when tick_nohz_can_stop_tick()
> > > > returns false. If it returns true, then I expect anything that needs it
> > > > to re-enable it.
> > > > 
> > > 
> > > Yeah. In the case of need_resched() in idle I believe the CPU doesn't
> > > really go to sleep later so it should be fine. But for the case of
> > > softirq pending or nohz_mode, I'm not sure...
> > 
> > softirqs shouldn't be pending when you go into nohz mode..
> 
> You mean it can't happen or we don't want that to happen?

We don't want that.. going into nohz with pending softirqs means the
softirqs will be delayed for an unknown amount of time, this should not
occur.

tick_nohz_stop_sched_tick() has:

        if (unlikely(local_softirq_pending() && cpu_online(cpu))) {                          
                static int ratelimit;                                                        
                                                                                             
                if (ratelimit < 10) {                                                        
                        printk(KERN_ERR "NOHZ: local_softirq_pending %02x\n",                
                               (unsigned int) local_softirq_pending());                      
                        ratelimit++;                                                         
                }                                                                            
                goto end;                                                                    
        }

which should warn us if this ever was to occur.

> > 
> > That is, I'm really not seeing what's wrong with the very simple:
> > 
> > 
> >   if (tick_nohz_can_stop_tick())
> > 	tick_nohz_stop_tick();
> > 
> > 
> > and relying on everybody who invalidates tick_nohz_can_stop_tick(), to
> > do:
> > 
> >   tick_nohz_start_tick();
> 
> May be for the non-idle case. But for the idle case I need to ensure
> this is necessary somewhere.

How exactly do idle and non-idle differ? Its about stopping the tick,
regardless of if we're idle or not if someone needs the thing we need to
start it again.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-30 15:26                   ` Peter Zijlstra
@ 2011-08-30 15:33                     ` Frederic Weisbecker
  2011-08-30 15:42                       ` Peter Zijlstra
  2011-08-30 20:58                       ` Peter Zijlstra
  0 siblings, 2 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-30 15:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Tue, Aug 30, 2011 at 05:26:33PM +0200, Peter Zijlstra wrote:
> On Tue, 2011-08-30 at 16:32 +0200, Frederic Weisbecker wrote:
> > On Tue, Aug 30, 2011 at 01:21:55PM +0200, Peter Zijlstra wrote:
> > > On Tue, 2011-08-30 at 01:35 +0200, Frederic Weisbecker wrote:
> > > > > That means it has to be in an extended grace period when we stop the
> > > > > tick.
> > > > 
> > > > You mean extended quiescent state?
> > > 
> > > Yeah that :-)
> > > 
> > > > As a summary here is what we do:
> > > > 
> > > > - if we are in the kernel, we can't run into extended quiescent state because
> > > > we may make use of rcu anytime there. But if we run nohz we don't have the tick
> > > > to notice quiescent states to the RCU machinery and help completing grace periods
> > > > so as soon as we receive an rcu IPI from another CPU (due to the grace period
> > > > beeing extended because our nohz CPU doesn't report quiescent states), we restart
> > > > the tick. We are optimistic enough to consider that we may avoid a lot of ticks
> > > > even if there are some risks to be disturbed in some random rates.
> > > > So even with the IPI we consider it as an upside.
> > > > 
> > > > - if we are in userspace we can run in extended quiescent state.
> > > 
> > > But you can only disable the tick/enter extended quiescent state while
> > > in kernel-space. Thus the second clause is precluded from ever being
> > > true.
> > 
> > No, we have a specific stacking in the irq:
> > 
> > 	rcu_irq_enter()
> > 
> > 	disable tick...
> > 	if (user)
> > 		rcu_enter_nohz();
> > 
> > 	rcu_irq_exit() <-- extended quiescent state entry effective only there
> > 
> > And by the time we call rcu_irq_exit() and we resume to userspace, we are
> > not supposed to have rcu read side critical section (minus the case of
> > a signal with do_notify_resume() which I have yet to handle).
> 
> See all that is still kernelspace ;-) I think I know what you mean to
> say though, but seeing as you note there is even now a known shortcoming
> I'm not very confident its a solid construction. What will help us find
> such holes?

This: https://lkml.org/lkml/2011/6/23/744

It's in one of Paul's branches and should make it for the next merge window.
This should detect any of such holes. I made that on purpose for the nohz cpusets
when I saw how much error prone that can be with rcu :)

> I would much rather we not rely on such fragile things too much.. this
> RCU stuff wants way more thought, as it stands your patch-set doesn't do
> anything useful IMO.

Not sure what you mean. Well that Rcu thing for sure is fragile but we have
the tools ready to find the problems.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 15/32] nohz/cpuset: Restart tick when switching to idle task
  2011-08-30 15:04     ` Frederic Weisbecker
@ 2011-08-30 15:35       ` Peter Zijlstra
  0 siblings, 0 replies; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-30 15:35 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Tue, 2011-08-30 at 17:04 +0200, Frederic Weisbecker wrote:
> > ... so basically you say:
> > 
> >   Restart the tick when we switch to idle.
> > 
> > Now all that needs is an explanation of why..
> 
> In the end we don't want to restart the tick when we switch to idle.
> But to support that I'll need to tweak various things on idle time
> accounting, otherwise the [cpu] time spent in idle is going to be
> accounted as system time.

Ok, so then the changelog should say so, preferably with a little bit
more detail on how exactly the idle time accounting would go tits-up.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 12/32] nohz: Try not to give the timekeeping duty to a cpuset nohz cpu
  2011-08-30 15:17     ` Frederic Weisbecker
  2011-08-30 15:30       ` Dimitri Sivanich
@ 2011-08-30 15:37       ` Peter Zijlstra
  2011-08-30 22:44         ` Frederic Weisbecker
  1 sibling, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-30 15:37 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Dimitri Sivanich, Paul Menage

On Tue, 2011-08-30 at 17:17 +0200, Frederic Weisbecker wrote:
> On Mon, Aug 29, 2011 at 04:55:45PM +0200, Peter Zijlstra wrote:
> > On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > > Try to give the timekeeing duty to a CPU that doesn't belong
> > > to any nohz cpuset when possible, so that we increase the chance
> > > for these nohz cpusets to run their CPUs out of periodic tick
> > > mode. 
> > 
> > You and Dmitiri might want to get together:
> > 
> > lkml.kernel.org/r/20110823195628.GB4533@sgi.com
> 
> Right!
> 
> There is another missing piece in my patchset. If every non adaptive-nohz
> CPUs are sleeping, then none is handling the do_timer duty and adaptive nohz
> CPUs run with a stale jiffies and walltime.

Doesn't nohz already deal with the case of all cpus being idle? In that
case the cpu that wakes up first gets to play catch up on irq_enter() or
so.



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-30 15:33                     ` Frederic Weisbecker
@ 2011-08-30 15:42                       ` Peter Zijlstra
  2011-08-30 18:53                         ` Frederic Weisbecker
  2011-08-30 20:58                       ` Peter Zijlstra
  1 sibling, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-30 15:42 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Tue, 2011-08-30 at 17:33 +0200, Frederic Weisbecker wrote:
> > See all that is still kernelspace ;-) I think I know what you mean to
> > say though, but seeing as you note there is even now a known shortcoming
> > I'm not very confident its a solid construction. What will help us find
> > such holes?
> 
> This: https://lkml.org/lkml/2011/6/23/744
> 
> It's in one of Paul's branches and should make it for the next merge window.
> This should detect any of such holes. I made that on purpose for the nohz cpusets
> when I saw how much error prone that can be with rcu :)

OK, good ;-)

> > I would much rather we not rely on such fragile things too much.. this
> > RCU stuff wants way more thought, as it stands your patch-set doesn't do
> > anything useful IMO.
> 
> Not sure what you mean. Well that Rcu thing for sure is fragile but we have
> the tools ready to find the problems. 

Right that thing you linked above does catch abuse, still your current
proposal means that due to RCU it will basically never disable the tick.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-30 15:22                   ` Peter Zijlstra
@ 2011-08-30 18:45                     ` Frederic Weisbecker
  0 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-30 18:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Tue, Aug 30, 2011 at 05:22:33PM +0200, Peter Zijlstra wrote:
> On Tue, 2011-08-30 at 16:26 +0200, Frederic Weisbecker wrote:
> > On Tue, Aug 30, 2011 at 01:19:18PM +0200, Peter Zijlstra wrote:
> > > On Tue, 2011-08-30 at 01:35 +0200, Frederic Weisbecker wrote:
> > > > 
> > > > OTOH it is needed to find non-critical sections when asked to cooperate
> > > > in a grace period completion. But if no callback have been enqueued on
> > > > the whole system we are fine. 
> > > 
> > > Its that 'whole system' clause that I have a problem with. It would be
> > > perfectly fine to have a number of cpus very busy generating rcu
> > > callbacks, however this should not mean our adaptive nohz cpu should be
> > > bothered to complete grace periods.
> > > 
> > > Requiring it to participate in the grace period state machine is a fail,
> > > plain and simple.
> > 
> > We need those nohz CPUs to participate because they may use read side
> > critical section themselves. So we need them to delay running grace period
> > until the end of their running rcu read side critical sections, like any
> > other CPUs. Otherwise their supposed rcu read side critical section wouldn't
> > be effective.
> > 
> > Either that or we need to only stop the tick when we are in userspace.
> > I'm not sure it would be a good idea.
> 
> Well the simple fact is that rcu, when considered system-wide, is pretty
> much always busy, voiding any and all benefit you might want to gain.

With my testcase, a stupid userspace loop on a single CPU among 4, I actually
see only few RCU activity. Especially as any other CPU is pretty much idle.
There are some cases where it's not so pointless.

> > We discussed this problem, I believe the problem mostly resides in rcu sched.
> > Because finding quiescent states for rcu bh is easy, but rcu sched needs
> > the tick or context switches. (For rcu preempt I have no idea.)
> > So for now that's the sanest way we found amongst:
> > 
> > - Having explicit hooks in preempt_disable() and local_irq_restore()
> > to notice end of rcu sched critical section. So that we don't need the tick
> > anymore to find quiescent states. But that's going to be costly. And we may
> > miss some more implicitly non-preemptable code path.
> > 
> > - Rely on context switches only. I believe in practice it should be fine.
> > But in theory this delays the grace period completion for an unbounded
> > amount of time.
> 
> Right, so what we can do is keep a per-cpu context switch counter (I'm
> sure we have one someplace and we already have the
> rcu_note_context_switch() callback in case we need another) and have
> another cpu (outside of our extended nohz domain) drive our state
> machine.
> 
> But I'm sure Paul can say more sensible things than me here.

Yeah I hope we can find some solution to minimize these IPIs.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-30 15:42                       ` Peter Zijlstra
@ 2011-08-30 18:53                         ` Frederic Weisbecker
  0 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-30 18:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Tue, Aug 30, 2011 at 05:42:32PM +0200, Peter Zijlstra wrote:
> On Tue, 2011-08-30 at 17:33 +0200, Frederic Weisbecker wrote:
> > > See all that is still kernelspace ;-) I think I know what you mean to
> > > say though, but seeing as you note there is even now a known shortcoming
> > > I'm not very confident its a solid construction. What will help us find
> > > such holes?
> > 
> > This: https://lkml.org/lkml/2011/6/23/744
> > 
> > It's in one of Paul's branches and should make it for the next merge window.
> > This should detect any of such holes. I made that on purpose for the nohz cpusets
> > when I saw how much error prone that can be with rcu :)
> 
> OK, good ;-)
> 
> > > I would much rather we not rely on such fragile things too much.. this
> > > RCU stuff wants way more thought, as it stands your patch-set doesn't do
> > > anything useful IMO.
> > 
> > Not sure what you mean. Well that Rcu thing for sure is fragile but we have
> > the tools ready to find the problems. 
> 
> Right that thing you linked above does catch abuse, still your current
> proposal means that due to RCU it will basically never disable the tick.

At least if we are in userspace it does as long as we have no local rcu callbacks
to handle. We have the rcu_pending() check but we can remove it for userspace.

Now for the kernel space I still think it's worth, for example when all other
CPUs are idle, or when they don't queue that much RCU callbacks. In theory if
we have a grace period to complete 100 times per second but HZ=1000 then we
are going to avoid a lot of timer interrupts.
If we can't remove the timer, we can at least minimize it.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-30 15:33                     ` Frederic Weisbecker
  2011-08-30 15:42                       ` Peter Zijlstra
@ 2011-08-30 20:58                       ` Peter Zijlstra
  2011-08-30 22:24                         ` Frederic Weisbecker
  1 sibling, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-30 20:58 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Tue, 2011-08-30 at 17:42 +0200, Peter Zijlstra wrote:
> On Tue, 2011-08-30 at 17:33 +0200, Frederic Weisbecker wrote:
> > > See all that is still kernelspace ;-) I think I know what you mean to
> > > say though, but seeing as you note there is even now a known shortcoming
> > > I'm not very confident its a solid construction. What will help us find
> > > such holes?
> > 
> > This: https://lkml.org/lkml/2011/6/23/744
> > 
> > It's in one of Paul's branches and should make it for the next merge window.
> > This should detect any of such holes. I made that on purpose for the nohz cpusets
> > when I saw how much error prone that can be with rcu :)
> 
> OK, good ;-)
> 
> > > I would much rather we not rely on such fragile things too much.. this
> > > RCU stuff wants way more thought, as it stands your patch-set doesn't do
> > > anything useful IMO.
> > 
> > Not sure what you mean. Well that Rcu thing for sure is fragile but we have
> > the tools ready to find the problems. 
> 
> Right that thing you linked above does catch abuse, still your current
> proposal means that due to RCU it will basically never disable the tick.

So how about something like:

Assuming we are in rcu_nohz state; on kernel enter we leave rcu_nohz but
don't start the tick, instead we assign another cpu to run our state
machine.

On kernel exit we 'donate' all our rcu state to a willing victim (the
same that earlier was kind enough to drive our state) and undo our
entire GP accounting and re-enter rcu_nohz state.

If between that time we did restart the tick, we take back our rcu state
and skip the donate and rcu_nohz enter on kernel exit.

I really should go read all those docs Paul send me to see how insane
the above is.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-30 20:58                       ` Peter Zijlstra
@ 2011-08-30 22:24                         ` Frederic Weisbecker
  2011-08-31  9:17                           ` Peter Zijlstra
  0 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-30 22:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Tue, Aug 30, 2011 at 10:58:38PM +0200, Peter Zijlstra wrote:
> On Tue, 2011-08-30 at 17:42 +0200, Peter Zijlstra wrote:
> > On Tue, 2011-08-30 at 17:33 +0200, Frederic Weisbecker wrote:
> > > > See all that is still kernelspace ;-) I think I know what you mean to
> > > > say though, but seeing as you note there is even now a known shortcoming
> > > > I'm not very confident its a solid construction. What will help us find
> > > > such holes?
> > > 
> > > This: https://lkml.org/lkml/2011/6/23/744
> > > 
> > > It's in one of Paul's branches and should make it for the next merge window.
> > > This should detect any of such holes. I made that on purpose for the nohz cpusets
> > > when I saw how much error prone that can be with rcu :)
> > 
> > OK, good ;-)
> > 
> > > > I would much rather we not rely on such fragile things too much.. this
> > > > RCU stuff wants way more thought, as it stands your patch-set doesn't do
> > > > anything useful IMO.
> > > 
> > > Not sure what you mean. Well that Rcu thing for sure is fragile but we have
> > > the tools ready to find the problems. 
> > 
> > Right that thing you linked above does catch abuse, still your current
> > proposal means that due to RCU it will basically never disable the tick.
> 
> So how about something like:
> 
> Assuming we are in rcu_nohz state; on kernel enter we leave rcu_nohz but
> don't start the tick, instead we assign another cpu to run our state
> machine.

The nohz CPU still has to notice its own quiescent states. Now it could be
an optimization to ask another CPU to handle all the rest once that quiescent
state is found. That doesn't solve our main problem though which is to
reliably report quiescent states when asked for.

> On kernel exit we 'donate' all our rcu state to a willing victim (the
> same that earlier was kind enough to drive our state) and undo our
> entire GP accounting and re-enter rcu_nohz state.

That's already what does rcu_enter_nohz().

> If between that time we did restart the tick, we take back our rcu state
> and skip the donate and rcu_nohz enter on kernel exit.

That's also what is done in this patchset. As soon as we re-enter the kernel
or the tick had to be restarted before we re-enter the kernel, we call
rcu_exit_nohz() that pulls back the CPU to the whole RCU machinery.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 12/32] nohz: Try not to give the timekeeping duty to a cpuset nohz cpu
  2011-08-30 15:37       ` Peter Zijlstra
@ 2011-08-30 22:44         ` Frederic Weisbecker
  0 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-30 22:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Dimitri Sivanich, Paul Menage

On Tue, Aug 30, 2011 at 05:37:28PM +0200, Peter Zijlstra wrote:
> On Tue, 2011-08-30 at 17:17 +0200, Frederic Weisbecker wrote:
> > On Mon, Aug 29, 2011 at 04:55:45PM +0200, Peter Zijlstra wrote:
> > > On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > > > Try to give the timekeeing duty to a CPU that doesn't belong
> > > > to any nohz cpuset when possible, so that we increase the chance
> > > > for these nohz cpusets to run their CPUs out of periodic tick
> > > > mode. 
> > > 
> > > You and Dmitiri might want to get together:
> > > 
> > > lkml.kernel.org/r/20110823195628.GB4533@sgi.com
> > 
> > Right!
> > 
> > There is another missing piece in my patchset. If every non adaptive-nohz
> > CPUs are sleeping, then none is handling the do_timer duty and adaptive nohz
> > CPUs run with a stale jiffies and walltime.
> 
> Doesn't nohz already deal with the case of all cpus being idle? In that
> case the cpu that wakes up first gets to play catch up on irq_enter() or
> so.

Sure and that works for the nohz idle case. But that's not enough anymore
in the case of adaptive nohz CPUs. They can run for a while without the tick
and if nobody else maintains a tick either then jiffies and walltime are
not maintained anymore.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks)
  2011-08-30 14:06   ` Frederic Weisbecker
@ 2011-08-31  3:47     ` Mike Galbraith
  2011-08-31  9:28       ` Peter Zijlstra
  2011-08-31 13:57     ` Gilad Ben-Yossef
  1 sibling, 1 reply; 139+ messages in thread
From: Mike Galbraith @ 2011-08-31  3:47 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben-Yossef, LKML, Andrew Morton, Anton Blanchard,
	Avi Kivity, Ingo Molnar, Lai Jiangshan, Paul E . McKenney,
	Paul Menage, Peter Zijlstra, Stephen Hemminger, Thomas Gleixner,
	Tim Pepper

On Tue, 2011-08-30 at 16:06 +0200, Frederic Weisbecker wrote:

> Ah I haven't tested with that isolcpus especially as it's headed toward
> removal.

It is?  Is it being replaced by something?  (I find it useful)

	-Mike



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-30 22:24                         ` Frederic Weisbecker
@ 2011-08-31  9:17                           ` Peter Zijlstra
  2011-08-31 13:37                             ` Frederic Weisbecker
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-31  9:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Wed, 2011-08-31 at 00:24 +0200, Frederic Weisbecker wrote:
> On Tue, Aug 30, 2011 at 10:58:38PM +0200, Peter Zijlstra wrote:
> > On Tue, 2011-08-30 at 17:42 +0200, Peter Zijlstra wrote:
> > > On Tue, 2011-08-30 at 17:33 +0200, Frederic Weisbecker wrote:
> > > > > See all that is still kernelspace ;-) I think I know what you mean to
> > > > > say though, but seeing as you note there is even now a known shortcoming
> > > > > I'm not very confident its a solid construction. What will help us find
> > > > > such holes?
> > > > 
> > > > This: https://lkml.org/lkml/2011/6/23/744
> > > > 
> > > > It's in one of Paul's branches and should make it for the next merge window.
> > > > This should detect any of such holes. I made that on purpose for the nohz cpusets
> > > > when I saw how much error prone that can be with rcu :)
> > > 
> > > OK, good ;-)
> > > 
> > > > > I would much rather we not rely on such fragile things too much.. this
> > > > > RCU stuff wants way more thought, as it stands your patch-set doesn't do
> > > > > anything useful IMO.
> > > > 
> > > > Not sure what you mean. Well that Rcu thing for sure is fragile but we have
> > > > the tools ready to find the problems. 
> > > 
> > > Right that thing you linked above does catch abuse, still your current
> > > proposal means that due to RCU it will basically never disable the tick.
> > 
> > So how about something like:
> > 
> > Assuming we are in rcu_nohz state; on kernel enter we leave rcu_nohz but
> > don't start the tick, instead we assign another cpu to run our state
> > machine.
> 
> The nohz CPU still has to notice its own quiescent states. 

Why? rcu-sched can use a context-switch counter, rcu-preempt doesn't
even need that. Remote cpus can notice those just fine.

> Now it could be
> an optimization to ask another CPU to handle all the rest once that quiescent
> state is found. That doesn't solve our main problem though which is to
> reliably report quiescent states when asked for.

No, seriously, RCU should not, ever, need to re-enable the tick. Imagine
a HPC workload where the system cores are also responsible for all IO
and all the adaptive-nohz cores are simply crunching numbers. In that
scenario you'll have a very high rcu usage because the system cores are
all very busy arranging work for the computation cores.

> > On kernel exit we 'donate' all our rcu state to a willing victim (the
> > same that earlier was kind enough to drive our state) and undo our
> > entire GP accounting and re-enter rcu_nohz state.
> 
> That's already what does rcu_enter_nohz().

Almost but not quite, it doesn't donate the callbacks for example
(something it does do on hotplug -- and therefore any assumption the
callback will in fact run on the cpu you submit it on is already
broken).

> > If between that time we did restart the tick, we take back our rcu state
> > and skip the donate and rcu_nohz enter on kernel exit.
> 
> That's also what is done in this patchset. 

Its not, since you don't hand of the grace period detectoring you don't
take it back now do you..

> As soon as we re-enter the kernel
> or the tick had to be restarted before we re-enter the kernel,

Another impossibility, you can only restart the tick from the kernel.

>  we call
> rcu_exit_nohz() that pulls back the CPU to the whole RCU machinery.

But you then also start the tick again..

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks)
  2011-08-31  3:47     ` Mike Galbraith
@ 2011-08-31  9:28       ` Peter Zijlstra
  2011-08-31 10:26         ` Mike Galbraith
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-31  9:28 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Frederic Weisbecker, Gilad Ben-Yossef, LKML, Andrew Morton,
	Anton Blanchard, Avi Kivity, Ingo Molnar, Lai Jiangshan,
	Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Wed, 2011-08-31 at 05:47 +0200, Mike Galbraith wrote:
> On Tue, 2011-08-30 at 16:06 +0200, Frederic Weisbecker wrote:
> 
> > Ah I haven't tested with that isolcpus especially as it's headed toward
> > removal.
> 
> It is?  Is it being replaced by something?  (I find it useful)

cpusets? Afaict all it does is not include a number of cpus in the sched
domains, creating a bunch of independent scheduling cpus. You can use
cpusets to get to the same state.


[cpuisol seems to already be broken for SCHED_FIFO tasks, since they
share the def_root_domain]

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks)
  2011-08-31  9:28       ` Peter Zijlstra
@ 2011-08-31 10:26         ` Mike Galbraith
  2011-08-31 10:33           ` Peter Zijlstra
  2011-08-31 14:05           ` Gilad Ben-Yossef
  0 siblings, 2 replies; 139+ messages in thread
From: Mike Galbraith @ 2011-08-31 10:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Gilad Ben-Yossef, LKML, Andrew Morton,
	Anton Blanchard, Avi Kivity, Ingo Molnar, Lai Jiangshan,
	Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Wed, 2011-08-31 at 11:28 +0200, Peter Zijlstra wrote:
> On Wed, 2011-08-31 at 05:47 +0200, Mike Galbraith wrote:
> > On Tue, 2011-08-30 at 16:06 +0200, Frederic Weisbecker wrote:
> > 
> > > Ah I haven't tested with that isolcpus especially as it's headed toward
> > > removal.
> > 
> > It is?  Is it being replaced by something?  (I find it useful)
> 
> cpusets? Afaict all it does is not include a number of cpus in the sched
> domains, creating a bunch of independent scheduling cpus. You can use
> cpusets to get to the same state.

Guess I'll have to try creating a cpuset per cpu, and see how it
compares to isolcpus.  cset shield --cpu 4-63 didn't work well enough.

	-Mike


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks)
  2011-08-31 10:26         ` Mike Galbraith
@ 2011-08-31 10:33           ` Peter Zijlstra
  2011-08-31 14:00             ` Gilad Ben-Yossef
  2011-08-31 14:05           ` Gilad Ben-Yossef
  1 sibling, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-31 10:33 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Frederic Weisbecker, Gilad Ben-Yossef, LKML, Andrew Morton,
	Anton Blanchard, Avi Kivity, Ingo Molnar, Lai Jiangshan,
	Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Wed, 2011-08-31 at 12:26 +0200, Mike Galbraith wrote:
> On Wed, 2011-08-31 at 11:28 +0200, Peter Zijlstra wrote:
> > On Wed, 2011-08-31 at 05:47 +0200, Mike Galbraith wrote:
> > > On Tue, 2011-08-30 at 16:06 +0200, Frederic Weisbecker wrote:
> > > 
> > > > Ah I haven't tested with that isolcpus especially as it's headed toward
> > > > removal.
> > > 
> > > It is?  Is it being replaced by something?  (I find it useful)
> > 
> > cpusets? Afaict all it does is not include a number of cpus in the sched
> > domains, creating a bunch of independent scheduling cpus. You can use
> > cpusets to get to the same state.
> 
> Guess I'll have to try creating a cpuset per cpu, and see how it
> compares to isolcpus.  cset shield --cpu 4-63 didn't work well enough.

You need to play with the sched_load_balance file, no idea what this
cset utility is though, never encountered it before.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-31  9:17                           ` Peter Zijlstra
@ 2011-08-31 13:37                             ` Frederic Weisbecker
  2011-08-31 14:41                               ` Peter Zijlstra
  0 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-08-31 13:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Wed, Aug 31, 2011 at 11:17:25AM +0200, Peter Zijlstra wrote:
> On Wed, 2011-08-31 at 00:24 +0200, Frederic Weisbecker wrote:
> > On Tue, Aug 30, 2011 at 10:58:38PM +0200, Peter Zijlstra wrote:
> > > On Tue, 2011-08-30 at 17:42 +0200, Peter Zijlstra wrote:
> > > > On Tue, 2011-08-30 at 17:33 +0200, Frederic Weisbecker wrote:
> > > > > > See all that is still kernelspace ;-) I think I know what you mean to
> > > > > > say though, but seeing as you note there is even now a known shortcoming
> > > > > > I'm not very confident its a solid construction. What will help us find
> > > > > > such holes?
> > > > > 
> > > > > This: https://lkml.org/lkml/2011/6/23/744
> > > > > 
> > > > > It's in one of Paul's branches and should make it for the next merge window.
> > > > > This should detect any of such holes. I made that on purpose for the nohz cpusets
> > > > > when I saw how much error prone that can be with rcu :)
> > > > 
> > > > OK, good ;-)
> > > > 
> > > > > > I would much rather we not rely on such fragile things too much.. this
> > > > > > RCU stuff wants way more thought, as it stands your patch-set doesn't do
> > > > > > anything useful IMO.
> > > > > 
> > > > > Not sure what you mean. Well that Rcu thing for sure is fragile but we have
> > > > > the tools ready to find the problems. 
> > > > 
> > > > Right that thing you linked above does catch abuse, still your current
> > > > proposal means that due to RCU it will basically never disable the tick.
> > > 
> > > So how about something like:
> > > 
> > > Assuming we are in rcu_nohz state; on kernel enter we leave rcu_nohz but
> > > don't start the tick, instead we assign another cpu to run our state
> > > machine.
> > 
> > The nohz CPU still has to notice its own quiescent states. 
> 
> Why? rcu-sched can use a context-switch counter, rcu-preempt doesn't
> even need that. Remote cpus can notice those just fine.

If that's fine to only rely on context switches, which don't happen in
a bounded time in theory, then ok.

Would be nice to hear about Paul's opinion on that.
 
> > Now it could be
> > an optimization to ask another CPU to handle all the rest once that quiescent
> > state is found. That doesn't solve our main problem though which is to
> > reliably report quiescent states when asked for.
> 
> No, seriously, RCU should not, ever, need to re-enable the tick. Imagine
> a HPC workload where the system cores are also responsible for all IO
> and all the adaptive-nohz cores are simply crunching numbers. In that
> scenario you'll have a very high rcu usage because the system cores are
> all very busy arranging work for the computation cores.

Of course if we find a better way than having to restart this tick I'm
all for doing that way.

That said if it requires some significant changes this should be done
outside this patchset, as an optimization afterward may be, the patchset
is already big while still missing very important features for now that
the timer handles.

> > > On kernel exit we 'donate' all our rcu state to a willing victim (the
> > > same that earlier was kind enough to drive our state) and undo our
> > > entire GP accounting and re-enter rcu_nohz state.
> > 
> > That's already what does rcu_enter_nohz().
> 
> Almost but not quite, it doesn't donate the callbacks for example
> (something it does do on hotplug -- and therefore any assumption the
> callback will in fact run on the cpu you submit it on is already
> broken).

Good to know, so that would avoid to restart the tick on call_rcu() ?
Sounds good but again I think this should be done later.

> 
> > > If between that time we did restart the tick, we take back our rcu state
> > > and skip the donate and rcu_nohz enter on kernel exit.
> > 
> > That's also what is done in this patchset. 
> 
> Its not, since you don't hand of the grace period detectoring you don't
> take it back now do you..

So you are talking about grace period started locally due to local
callbacks enqueued, right?


> > As soon as we re-enter the kernel
> > or the tick had to be restarted before we re-enter the kernel,
> 
> Another impossibility, you can only restart the tick from the kernel.

Ok I meant it can be restarted from an interrupt interrupting userspace.
I was talking about kernel enter/exit considering the new hooks brought
(syscalls and exceptions).

> >  we call
> > rcu_exit_nohz() that pulls back the CPU to the whole RCU machinery.
> 
> But you then also start the tick again..

When we enter kernel? (minus interrupts)
No we only call rcu_exit_nohz().

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks)
  2011-08-30 14:06   ` Frederic Weisbecker
  2011-08-31  3:47     ` Mike Galbraith
@ 2011-08-31 13:57     ` Gilad Ben-Yossef
  2011-08-31 14:30       ` Peter Zijlstra
  1 sibling, 1 reply; 139+ messages in thread
From: Gilad Ben-Yossef @ 2011-08-31 13:57 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Peter Zijlstra,
	Stephen Hemminger, Thomas Gleixner, Tim Pepper

On Tue, Aug 30, 2011 at 5:06 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:

>
> >
> > To make double sure I have no task on my nohz cpuset CPU, I've booted
> > the system with the isolcpus command line isolating the same cpu I've
> > assigned to the nohz set. This shouldn't be needed of course, but just
> > in case.
>
> Ah I haven't tested with that isolcpus especially as it's headed toward
> removal.
>

I had the cpuisol option in the boot loader config but I did set up a
proper cpuset as well, so I believe it made no difference.

I added the cpuisol option after noticing how many tasks I was unable
to move from the root cpuset to the system cpuset due to them being
bound per CPU and hoped (for no good reason, I admit) that cpuisol
will somehow help with the task isolation I wanted to get to test the
nohz task.

That's a different obstacle to workload of the kind where nohz cpuset
would be useful, but should probably be discussed in another thread
:-)

>
> >
> > I then ran a silly program I've written that basically eats CPU cycles
> > (https://github.com/gby/cpueat) and assigned it to the nohz set and
> > monitored the number of interrupts using /proc/interrupts
> >
> > Now, for the things I've noticed -
> >
> > 1. Before I turn adaptive_nohz to 1, when no task is running on the
> > nohz cpuset cpu, the tick is indeed idle (regular nohz case) and very
> > few function call IPIs are seen. However, when I turn adaptive_nohz to
> > 1 (but still with no task running on the CPU), the tick remains idle,
> > but I get an IPI function call interrupt almost in the rate the tick
> > would have been.
>
> Yeah I believe this is due to RCU that tries to wake up our nohz CPU.
> I need to have a deeper look there.

I believe you are right with the reason for the IPI.

Before setting adaptive_nohz for the cpuset I did not get the IPI on
an idle CPU. After setting it  I started getting the IPI regularly
even when the CPU was idle.

> > 2. When I run my little cpueat program on the nohz CPU, the tick does
> > not actually goes off. Instead it ticks away as usual. I know it is
> > the only legible task to run, since as soon as I kill it  the tick
> > turns off (regular nohz mode again). I've tinkered around and found
> > out that what stops the tick going away is the check for rcu_pending()
> > in cpuset_nohz_can_stop_tick(). It seems to always be true. When I
> > removed that check experimentally and repeat the test, the tick indeed
> > stops with my cpueat task running. Of course, I don't suggest this is
> > the sane thing to do - I just wondered if that what stopped the tick
> > going away and it seems that it is.
>
> Are you sure the tick never goes off?

Yes, I put debug code in the cpuset_nohz_can_stop_tick(). Every time
the function was called for that CPU rcu_pending() returned 1.


> But yeah may be there is something that constantly requires RCU grace
> periods to complete in your system. I should drop the rcu_pending()
> check as long as we want to stop the tick from userspace because
> there we are off the RCU state machine.

I added debug code to rcu_pending() and noticed that the rcu_bh was
the one pending each time.

I found that odd - my VM didn't even have a netowrk interface
configured (except maybe for lo), let alone saw any network traffic. I
thought rcu_bh was mostly used for networking code (Paul?)


> > 3. My little cpueat program tries to fork a child process after 100k
> > iteration of some CPU bound loop. It usually takes a few seconds to
> > happen. The idea is to make sure that the tick resumes when nr_running
> > > 1. In my case, I got a kernel panic. Since it happened with some
> > debug code I added and with aforementioned experimental removal of
> > rcu_pending check, I'm assuming for now it's all my fault but will
> > look into verifying it further and will send panic logs if it proves
> > useful.
>
> I got some panic too but haven't seen any for some time. I made a
> lot of changes since then though so I thought the condition to trigger
> it just went away.
>
> IIRC, it was a locking inversion against the rq lock and some other lock.
> Very nice condition for a cool lockup ;)

It certainly sounds exciting :-)

Let me know if I an help test anything else.

Thanks,
Gilad


--
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com
"Dance like no one is watching, love like you'll never be hurt, sing
like no one is listening... but for BEEP sake you better code like
you're going to maintain it for years!"

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks)
  2011-08-31 10:33           ` Peter Zijlstra
@ 2011-08-31 14:00             ` Gilad Ben-Yossef
  2011-08-31 14:26               ` Peter Zijlstra
  0 siblings, 1 reply; 139+ messages in thread
From: Gilad Ben-Yossef @ 2011-08-31 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Frederic Weisbecker, LKML, Andrew Morton,
	Anton Blanchard, Avi Kivity, Ingo Molnar, Lai Jiangshan,
	Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

>> > cpusets? Afaict all it does is not include a number of cpus in the sched

I think you meant to write cpuisol here :-)

>> > domains, creating a bunch of independent scheduling cpus. You can use
>> > cpusets to get to the same state.

>>
>> Guess I'll have to try creating a cpuset per cpu, and see how it
>> compares to isolcpus.  cset shield --cpu 4-63 didn't work well enough.
>
> You need to play with the sched_load_balance file, no idea what this
> cset utility is though, never encountered it before.
>

It's just a python wrapper around the /cgroup file system.

Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com
"Dance like no one is watching, love like you'll never be hurt, sing
like no one is listening... but for BEEP sake you better code like
you're going to maintain it for years!"

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks)
  2011-08-31 10:26         ` Mike Galbraith
  2011-08-31 10:33           ` Peter Zijlstra
@ 2011-08-31 14:05           ` Gilad Ben-Yossef
  2011-08-31 16:12             ` Mike Galbraith
  1 sibling, 1 reply; 139+ messages in thread
From: Gilad Ben-Yossef @ 2011-08-31 14:05 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Frederic Weisbecker, LKML, Andrew Morton,
	Anton Blanchard, Avi Kivity, Ingo Molnar, Lai Jiangshan,
	Paul E . McKenney, Stephen Hemminger, Thomas Gleixner,
	Tim Pepper

> Guess I'll have to try creating a cpuset per cpu, and see how it
> compares to isolcpus.  cset shield --cpu 4-63 didn't work well enough.

Try adding:

# cset shield --kthread on

It will try to move non bound kernel threads to the system set from
the root set.

The results seems to vary from introducing system instability, via
doing nothing at all, all the way to actually achieving better
isolation, depending on the specific kernel threads you have running
and your kernel version. Sometime more then one of the results at the
same time. It's really like Russian roulette in a way, only way more
geekier...

-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com
"Dance like no one is watching, love like you'll never be hurt, sing
like no one is listening... but for BEEP sake you better code like
you're going to maintain it for years!"

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks)
  2011-08-31 14:00             ` Gilad Ben-Yossef
@ 2011-08-31 14:26               ` Peter Zijlstra
  0 siblings, 0 replies; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-31 14:26 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Mike Galbraith, Frederic Weisbecker, LKML, Andrew Morton,
	Anton Blanchard, Avi Kivity, Ingo Molnar, Lai Jiangshan,
	Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Wed, 2011-08-31 at 17:00 +0300, Gilad Ben-Yossef wrote:
> >> > cpusets? Afaict all it does is not include a number of cpus in the sched
> 
> I think you meant to write cpuisol here :-)

Uhh, yeah ;-)

> >> > domains, creating a bunch of independent scheduling cpus. You can use
> >> > cpusets to get to the same state.
> 
> >>
> >> Guess I'll have to try creating a cpuset per cpu, and see how it
> >> compares to isolcpus.  cset shield --cpu 4-63 didn't work well enough.
> >
> > You need to play with the sched_load_balance file, no idea what this
> > cset utility is though, never encountered it before.
> >
> 
> It's just a python wrapper around the /cgroup file system.

ok, just never encountered it.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks)
  2011-08-31 13:57     ` Gilad Ben-Yossef
@ 2011-08-31 14:30       ` Peter Zijlstra
  0 siblings, 0 replies; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-31 14:30 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Frederic Weisbecker, LKML, Andrew Morton, Anton Blanchard,
	Avi Kivity, Ingo Molnar, Lai Jiangshan, Paul E . McKenney,
	Paul Menage, Stephen Hemminger, Thomas Gleixner, Tim Pepper

On Wed, 2011-08-31 at 16:57 +0300, Gilad Ben-Yossef wrote:
> I added the cpuisol option after noticing how many tasks I was unable
> to move from the root cpuset to the system cpuset due to them being
> bound per CPU 

Right, so ideally those tasks should be idle and not interfere. Where
this is not so, we should make it so.

When userspace didn't ask for anything to happen, nothing should happen.
When it did ask for it, well then it shouldn't complain it does :-)

Furthermore things like:

linux-2.6# git grep on_each_cpu mm/
mm/page_alloc.c:        on_each_cpu(drain_local_pages, NULL, 1);
mm/slab.c:      on_each_cpu(do_drain, cachep, 1);
mm/slab.c:      on_each_cpu(do_ccupdate_local, (void *)new, 1);
mm/slub.c:      on_each_cpu(flush_cpu_slab, s, 1);
mm/swap.c:      return schedule_on_each_cpu(lru_add_drain_per_cpu);

Should be converted to smp_call_function_many() and for each we should
keep a cpumask of cpus where there's work to do, avoiding disturbing
cpus that have been quiet.




^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-31 13:37                             ` Frederic Weisbecker
@ 2011-08-31 14:41                               ` Peter Zijlstra
  2011-09-01 16:40                                 ` Paul E. McKenney
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-08-31 14:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Wed, 2011-08-31 at 15:37 +0200, Frederic Weisbecker wrote:
> > Why? rcu-sched can use a context-switch counter, rcu-preempt doesn't
> > even need that. Remote cpus can notice those just fine.
> 
> If that's fine to only rely on context switches, which don't happen in
> a bounded time in theory, then ok.

But (!PREEMPT) rcu already depends on that, and suffers this lack of
time-bounds. What it does to expedite matters is force context switches,
but nowhere is it written the GP is bounded by anything sane.

> > But you then also start the tick again..
> 
> When we enter kernel? (minus interrupts)
> No we only call rcu_exit_nohz(). 

So thinking more about all this:

rcu_exit_nohz() will make remote cpus wait for us, this is exactly what
is needed because we might have looked at pointers. Lacking a tick we
don't progress our own state but that is fine, !PREEMPT RCU wouldn't
have been able to progress our state anyway since we haven't scheduled
(there's nothing to schedule to except idle, see below).

Then when we leave the kernel (or go idle) we re-enter rcu_nohz state,
and the other cpus will ignore our contribution (since we have entered a
QS and can't be holding any pointers) the other CPUs can continue and
complete the GP and run the callbacks.

I haven't fully considered PREEMPT RCU quite yet, but I'm thinking we
can get away with something similar.

So per the above we don't need the tick at all (for the case of
nr_running=[0,1]), RCU will sort itself out.

Now I forgot where all you send IPIs from, and I'll go look at these
patches once more.

As for call_rcu() for that we can indeed wake the tick (on leaving
kernel space or entering idle, no need to IPI since we can't process
anything before that anyway) or we could hand off our call list to a
'willing' victim.

But yeah, input from Paul would be nice...

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks)
  2011-08-31 14:05           ` Gilad Ben-Yossef
@ 2011-08-31 16:12             ` Mike Galbraith
  0 siblings, 0 replies; 139+ messages in thread
From: Mike Galbraith @ 2011-08-31 16:12 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Peter Zijlstra, Frederic Weisbecker, LKML, Andrew Morton,
	Anton Blanchard, Avi Kivity, Ingo Molnar, Lai Jiangshan,
	Paul E . McKenney, Stephen Hemminger, Thomas Gleixner,
	Tim Pepper

On Wed, 2011-08-31 at 17:05 +0300, Gilad Ben-Yossef wrote:
> > Guess I'll have to try creating a cpuset per cpu, and see how it
> > compares to isolcpus.  cset shield --cpu 4-63 didn't work well enough.
> 
> Try adding:
> 
> # cset shield --kthread on

Yeah, did that.

Killing load balancing in the cpuset should eliminate the problem.
Well, for the test app I was running anyway, which has everything
pinned, and pure RT.

	-Mike


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-08-31 14:41                               ` Peter Zijlstra
@ 2011-09-01 16:40                                 ` Paul E. McKenney
  2011-09-01 17:13                                   ` Peter Zijlstra
  0 siblings, 1 reply; 139+ messages in thread
From: Paul E. McKenney @ 2011-09-01 16:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, Andrew Morton, Anton Blanchard,
	Avi Kivity, Ingo Molnar, Lai Jiangshan, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Wed, Aug 31, 2011 at 04:41:00PM +0200, Peter Zijlstra wrote:
> On Wed, 2011-08-31 at 15:37 +0200, Frederic Weisbecker wrote:
> > > Why? rcu-sched can use a context-switch counter, rcu-preempt doesn't
> > > even need that. Remote cpus can notice those just fine.
> > 
> > If that's fine to only rely on context switches, which don't happen in
> > a bounded time in theory, then ok.
> 
> But (!PREEMPT) rcu already depends on that, and suffers this lack of
> time-bounds. What it does to expedite matters is force context switches,
> but nowhere is it written the GP is bounded by anything sane.

Ah, but it really is written, among other things, by the OOM killer.  ;-)

> > > But you then also start the tick again..
> > 
> > When we enter kernel? (minus interrupts)
> > No we only call rcu_exit_nohz(). 
> 
> So thinking more about all this:
> 
> rcu_exit_nohz() will make remote cpus wait for us, this is exactly what
> is needed because we might have looked at pointers. Lacking a tick we
> don't progress our own state but that is fine, !PREEMPT RCU wouldn't
> have been able to progress our state anyway since we haven't scheduled
> (there's nothing to schedule to except idle, see below).

Lacking a tick, the CPU also fails to respond to state updates from
other CPUs.

> Then when we leave the kernel (or go idle) we re-enter rcu_nohz state,
> and the other cpus will ignore our contribution (since we have entered a
> QS and can't be holding any pointers) the other CPUs can continue and
> complete the GP and run the callbacks.

This is true.

> I haven't fully considered PREEMPT RCU quite yet, but I'm thinking we
> can get away with something similar.

All the ways I know of to make PREEMPT_RCU live without a scheduling
clock tick while not in some form of dyntick-idle mode require either
IPIs or read-side memory barriers.  The special case where all CPUs
are in dyntick-idle mode and something needs to happen also needs to
be handled correctly.

Or are you saying that PREEMPT_RCU does not need a CPU to take
scheduling-clock interrupts while that CPU is in dyntick-idle mode?
That is true enough.

> So per the above we don't need the tick at all (for the case of
> nr_running=[0,1]), RCU will sort itself out.
> 
> Now I forgot where all you send IPIs from, and I'll go look at these
> patches once more.
> 
> As for call_rcu() for that we can indeed wake the tick (on leaving
> kernel space or entering idle, no need to IPI since we can't process
> anything before that anyway) or we could hand off our call list to a
> 'willing' victim.
> 
> But yeah, input from Paul would be nice...

In the call_rcu() case, I do have some code in preparation that allows
CPUs to have non-empty callback queues and still be tickless.  There
are some tricky corner cases, but it does look possible.  (Famous last
words...)

The reason for doing this is that people are enabling
CONFIG_RCU_FAST_NO_HZ on systems that have no business enabling it.
Bad choice of names on my part.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-09-01 16:40                                 ` Paul E. McKenney
@ 2011-09-01 17:13                                   ` Peter Zijlstra
  2011-09-02  1:41                                     ` Paul E. McKenney
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-09-01 17:13 UTC (permalink / raw)
  To: paulmck
  Cc: Frederic Weisbecker, LKML, Andrew Morton, Anton Blanchard,
	Avi Kivity, Ingo Molnar, Lai Jiangshan, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Thu, 2011-09-01 at 09:40 -0700, Paul E. McKenney wrote:
> On Wed, Aug 31, 2011 at 04:41:00PM +0200, Peter Zijlstra wrote:
> > On Wed, 2011-08-31 at 15:37 +0200, Frederic Weisbecker wrote:
> > > > Why? rcu-sched can use a context-switch counter, rcu-preempt doesn't
> > > > even need that. Remote cpus can notice those just fine.
> > > 
> > > If that's fine to only rely on context switches, which don't happen in
> > > a bounded time in theory, then ok.
> > 
> > But (!PREEMPT) rcu already depends on that, and suffers this lack of
> > time-bounds. What it does to expedite matters is force context switches,
> > but nowhere is it written the GP is bounded by anything sane.
> 
> Ah, but it really is written, among other things, by the OOM killer.  ;-)

Well there is that of course :-) But I think the below argument relies
on what we already have without requiring more.

> > > > But you then also start the tick again..
> > > 
> > > When we enter kernel? (minus interrupts)
> > > No we only call rcu_exit_nohz(). 
> > 
> > So thinking more about all this:
> > 
> > rcu_exit_nohz() will make remote cpus wait for us, this is exactly what
> > is needed because we might have looked at pointers. Lacking a tick we
> > don't progress our own state but that is fine, !PREEMPT RCU wouldn't
> > have been able to progress our state anyway since we haven't scheduled
> > (there's nothing to schedule to except idle, see below).
> 
> Lacking a tick, the CPU also fails to respond to state updates from
> other CPUs.

I'm sure I'll have to go re-read your documents, but does that matter?
If we would have had a tick we still couldn't have progressed since we
wouldn't have scheduled etc.. so we would hold up GP completion any way.

> > Then when we leave the kernel (or go idle) we re-enter rcu_nohz state,
> > and the other cpus will ignore our contribution (since we have entered a
> > QS and can't be holding any pointers) the other CPUs can continue and
> > complete the GP and run the callbacks.
> 
> This is true.

So suppose all other CPUs completed the GP and our CPU is the one
holding things up, now I don't see rcu_enter_nohz() doing anything much
at all, who is responsible for GP completion?

> > I haven't fully considered PREEMPT RCU quite yet, but I'm thinking we
> > can get away with something similar.
> 
> All the ways I know of to make PREEMPT_RCU live without a scheduling
> clock tick while not in some form of dyntick-idle mode require either
> IPIs or read-side memory barriers.  The special case where all CPUs
> are in dyntick-idle mode and something needs to happen also needs to
> be handled correctly.
> 
> Or are you saying that PREEMPT_RCU does not need a CPU to take
> scheduling-clock interrupts while that CPU is in dyntick-idle mode?
> That is true enough.

I'm not saying anything much about PREEMPT_RCU, I voiced an
ill-considered suspicion :-)

So in the nr_running=[0,1] case we're in rcu_nohz state when idle or
when in userspace. The only interesting part is being in kernel space
where we cannot be in rcu_nohz state because we might actually use
pointers and thus have to stop callbacks from destroying state etc..

The only PREEMPT_RCU implementation I can recall is the counting one,
and that one does indeed want a tick, because even in kernel space it
could move things forward if the 'old' index counter reaches 0.

Now we could possibly add magic to rcu_read_unlock_special() to restart
the tick in that case.

Now clearly all that might be non-applicable to the current one, will
have to wrap my head around the current PREEMPT_RCU implementation some
more.

> > So per the above we don't need the tick at all (for the case of
> > nr_running=[0,1]), RCU will sort itself out.
> > 
> > Now I forgot where all you send IPIs from, and I'll go look at these
> > patches once more.
> > 
> > As for call_rcu() for that we can indeed wake the tick (on leaving
> > kernel space or entering idle, no need to IPI since we can't process
> > anything before that anyway) or we could hand off our call list to a
> > 'willing' victim.
> > 
> > But yeah, input from Paul would be nice...
> 
> In the call_rcu() case, I do have some code in preparation that allows
> CPUs to have non-empty callback queues and still be tickless.  There
> are some tricky corner cases, but it does look possible.  (Famous last
> words...)

Hand your callback to someone else is one solution, but I'm not overly
worried about re-starting the tick if we do call_rcu().

> The reason for doing this is that people are enabling
> CONFIG_RCU_FAST_NO_HZ on systems that have no business enabling it.
> Bad choice of names on my part.

hehe :-)

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-09-01 17:13                                   ` Peter Zijlstra
@ 2011-09-02  1:41                                     ` Paul E. McKenney
  2011-09-02  8:24                                       ` Peter Zijlstra
  0 siblings, 1 reply; 139+ messages in thread
From: Paul E. McKenney @ 2011-09-02  1:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, Andrew Morton, Anton Blanchard,
	Avi Kivity, Ingo Molnar, Lai Jiangshan, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Thu, Sep 01, 2011 at 07:13:00PM +0200, Peter Zijlstra wrote:
> On Thu, 2011-09-01 at 09:40 -0700, Paul E. McKenney wrote:
> > On Wed, Aug 31, 2011 at 04:41:00PM +0200, Peter Zijlstra wrote:
> > > On Wed, 2011-08-31 at 15:37 +0200, Frederic Weisbecker wrote:
> > > > > Why? rcu-sched can use a context-switch counter, rcu-preempt doesn't
> > > > > even need that. Remote cpus can notice those just fine.
> > > > 
> > > > If that's fine to only rely on context switches, which don't happen in
> > > > a bounded time in theory, then ok.
> > > 
> > > But (!PREEMPT) rcu already depends on that, and suffers this lack of
> > > time-bounds. What it does to expedite matters is force context switches,
> > > but nowhere is it written the GP is bounded by anything sane.
> > 
> > Ah, but it really is written, among other things, by the OOM killer.  ;-)
> 
> Well there is that of course :-) But I think the below argument relies
> on what we already have without requiring more.

Almost.  ;-)

> > > > > But you then also start the tick again..
> > > > 
> > > > When we enter kernel? (minus interrupts)
> > > > No we only call rcu_exit_nohz(). 
> > > 
> > > So thinking more about all this:
> > > 
> > > rcu_exit_nohz() will make remote cpus wait for us, this is exactly what
> > > is needed because we might have looked at pointers. Lacking a tick we
> > > don't progress our own state but that is fine, !PREEMPT RCU wouldn't
> > > have been able to progress our state anyway since we haven't scheduled
> > > (there's nothing to schedule to except idle, see below).
> > 
> > Lacking a tick, the CPU also fails to respond to state updates from
> > other CPUs.
> 
> I'm sure I'll have to go re-read your documents, but does that matter?
> If we would have had a tick we still couldn't have progressed since we
> wouldn't have scheduled etc.. so we would hold up GP completion any way.

There are two phases to quiescent-state detection: (1) actually
detecting the quiescent state and (2) reporting detection to the
RCU core.  If you turn off the tick at an inopportune time, you
can have CPUs that have detected the tick, but not yet reported it.

You asked the follow-up question below, so please see below.

> > > Then when we leave the kernel (or go idle) we re-enter rcu_nohz state,
> > > and the other cpus will ignore our contribution (since we have entered a
> > > QS and can't be holding any pointers) the other CPUs can continue and
> > > complete the GP and run the callbacks.
> > 
> > This is true.
> 
> So suppose all other CPUs completed the GP and our CPU is the one
> holding things up, now I don't see rcu_enter_nohz() doing anything much
> at all, who is responsible for GP completion?

Any CPU that has RCU callbacks queued that are waiting for the current or
some susequent grace period to complete are responsible for pushing the
current grace period forward, hence the checks in the non-RCU_FAST_NO_HZ
variants of rcu_needs_cpu().  This is why CPUs with callbacks that are
not yet done cannot currently disable the tick -- because we need at
least one CPU to detect the fact that dyntick-idle CPUs are in fact in
extended quiescent states.

Again, I believe that I can do better, hence the in-progress rewrite of
RCU_FAST_NO_HZ.  Either that or get most people to stop using it, and
given its name, getting people to stop using it is likely an exercise
in futility.  "But it is FAST, and that is good, and it involve NO_HZ,
which saves energy, which is also good.  Therefore, I will enable it
-everywhere-!!!"

Sigh.  It will be much easier to rewrite it, ugly corner cases
nonwithstanding.  :-(

> > > I haven't fully considered PREEMPT RCU quite yet, but I'm thinking we
> > > can get away with something similar.
> > 
> > All the ways I know of to make PREEMPT_RCU live without a scheduling
> > clock tick while not in some form of dyntick-idle mode require either
> > IPIs or read-side memory barriers.  The special case where all CPUs
> > are in dyntick-idle mode and something needs to happen also needs to
> > be handled correctly.
> > 
> > Or are you saying that PREEMPT_RCU does not need a CPU to take
> > scheduling-clock interrupts while that CPU is in dyntick-idle mode?
> > That is true enough.
> 
> I'm not saying anything much about PREEMPT_RCU, I voiced an
> ill-considered suspicion :-)

;-)

> So in the nr_running=[0,1] case we're in rcu_nohz state when idle or
> when in userspace. The only interesting part is being in kernel space
> where we cannot be in rcu_nohz state because we might actually use
> pointers and thus have to stop callbacks from destroying state etc..

Yep!

> The only PREEMPT_RCU implementation I can recall is the counting one,
> and that one does indeed want a tick, because even in kernel space it
> could move things forward if the 'old' index counter reaches 0.
> 
> Now we could possibly add magic to rcu_read_unlock_special() to restart
> the tick in that case.

Not from NMI handlers we can't.  Unless I am really confused about the
code that restarts the tick.  Which is not impossible, but ...

I don't currently have an opinion about the advisability of restarting
the tick from hardIRQ handlers, but I do feel the need to point out the
the possibility.

> Now clearly all that might be non-applicable to the current one, will
> have to wrap my head around the current PREEMPT_RCU implementation some
> more.

Indeed, the documentation is going much more slowly than I would like...

> > > So per the above we don't need the tick at all (for the case of
> > > nr_running=[0,1]), RCU will sort itself out.
> > > 
> > > Now I forgot where all you send IPIs from, and I'll go look at these
> > > patches once more.
> > > 
> > > As for call_rcu() for that we can indeed wake the tick (on leaving
> > > kernel space or entering idle, no need to IPI since we can't process
> > > anything before that anyway) or we could hand off our call list to a
> > > 'willing' victim.
> > > 
> > > But yeah, input from Paul would be nice...
> > 
> > In the call_rcu() case, I do have some code in preparation that allows
> > CPUs to have non-empty callback queues and still be tickless.  There
> > are some tricky corner cases, but it does look possible.  (Famous last
> > words...)
> 
> Hand your callback to someone else is one solution, but I'm not overly
> worried about re-starting the tick if we do call_rcu().

As long as the handoff doesn't turn into a battery-killing game of
RCU-callback hot potato.  And I am seriously concerned about this
possibility.

I will be updating the Dyntick-Idle doc to cover the new RCU_FAST_NO_HZ
algorithm if and when I get it into human-readable form.

> > The reason for doing this is that people are enabling
> > CONFIG_RCU_FAST_NO_HZ on systems that have no business enabling it.
> > Bad choice of names on my part.
> 
> hehe :-)

Sigh!  Scanning dyntick-idle state on a system with 256 CPUs.  What could
possibly go wrong?  :-/

							Thanx, Paul

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-09-02  1:41                                     ` Paul E. McKenney
@ 2011-09-02  8:24                                       ` Peter Zijlstra
  2011-09-04 19:37                                         ` Paul E. McKenney
  0 siblings, 1 reply; 139+ messages in thread
From: Peter Zijlstra @ 2011-09-02  8:24 UTC (permalink / raw)
  To: paulmck
  Cc: Frederic Weisbecker, LKML, Andrew Morton, Anton Blanchard,
	Avi Kivity, Ingo Molnar, Lai Jiangshan, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Thu, 2011-09-01 at 18:41 -0700, Paul E. McKenney wrote:
> > Now we could possibly add magic to rcu_read_unlock_special() to restart
> > the tick in that case.
> 
> Not from NMI handlers we can't.  Unless I am really confused about the
> code that restarts the tick.  Which is not impossible, but ... 

Uhm.. good point ;-) yes that will be somewhat detrimental to the
overall OS health if we attempt such a thing.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-09-02  8:24                                       ` Peter Zijlstra
@ 2011-09-04 19:37                                         ` Paul E. McKenney
  2011-09-05 14:28                                           ` Peter Zijlstra
  0 siblings, 1 reply; 139+ messages in thread
From: Paul E. McKenney @ 2011-09-04 19:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, Andrew Morton, Anton Blanchard,
	Avi Kivity, Ingo Molnar, Lai Jiangshan, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Fri, Sep 02, 2011 at 10:24:31AM +0200, Peter Zijlstra wrote:
> On Thu, 2011-09-01 at 18:41 -0700, Paul E. McKenney wrote:
> > > Now we could possibly add magic to rcu_read_unlock_special() to restart
> > > the tick in that case.
> > 
> > Not from NMI handlers we can't.  Unless I am really confused about the
> > code that restarts the tick.  Which is not impossible, but ... 
> 
> Uhm.. good point ;-) yes that will be somewhat detrimental to the
> overall OS health if we attempt such a thing.

But if you don't mind, I would very much like to use your question as
a Quick Quiz.  It should be a good one.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs
  2011-09-04 19:37                                         ` Paul E. McKenney
@ 2011-09-05 14:28                                           ` Peter Zijlstra
  0 siblings, 0 replies; 139+ messages in thread
From: Peter Zijlstra @ 2011-09-05 14:28 UTC (permalink / raw)
  To: paulmck
  Cc: Frederic Weisbecker, LKML, Andrew Morton, Anton Blanchard,
	Avi Kivity, Ingo Molnar, Lai Jiangshan, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

 Sun, 2011-09-04 at 12:37 -0700, Paul E. McKenney wrote:
> 
> But if you don't mind, I would very much like to use your question as
> a Quick Quiz.  It should be a good one.  ;-) 

No problem.. :-)

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 06/32] nohz: Move idle ticks stats tracking out of nohz handlers
  2011-08-29 14:28   ` Peter Zijlstra
@ 2011-09-06  0:35     ` Frederic Weisbecker
  0 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-09-06  0:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Mon, Aug 29, 2011 at 04:28:08PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > Idle ticks time tracking is merged into nohz stop/restart
> > handlers. Pull it out into idle entry/exit handlers instead,
> > so that nohz APIs is more idle independant. 
> 
> Are you trying to say:
> 
> Currently idle time tracking is part of the nohz state tracking,
> separate this so that we can disable the tick while we're non-idle?

Right.

> 
> If so, how does idle time tracking work on a !NOHZ kernel, surely such
> things can be idle too..

In that case the tick calls account_process_tick() which delegates
to account_idle_time() if we are idle on tick.

That's also used in NOHZ kernels in fact.

In every tick we call account_process_tick(), but when the tick is stopped
we save its value in ts->idle_jiffies and when it is restarted we account
jiffies - ts->idle_jiffies.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 13/32] nohz: Adaptive tick stop and restart on nohz cpuset
  2011-08-29 15:25   ` Peter Zijlstra
@ 2011-09-06 13:03     ` Frederic Weisbecker
  0 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-09-06 13:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper, Paul Menage

On Mon, Aug 29, 2011 at 05:25:36PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > When a CPU is included in a nohz cpuset, try to switch
> > it to nohz mode from the timer interrupt if it is the
> > only non-idle task running.
> > 
> > Then restart the tick if necessary from the wakeup path
> > if we are enqueuing a second task while the timer is stopped,
> > so that the scheduler tick is rearmed.
> 
> Shouldn't you first put the syscall hooks in place before allowing the
> tick to be switched off? It seems this patch is somewhat too early in
> the series.

I don't think it's necessary, that part doesn't depend on userspace hooks.
The whole thing is enabled very late anyway.

 
> > This assumes we are using TTWU_QUEUE sched feature so I need
> > to handle the off case (or actually not handle it but properly),
> > because we need the adaptive tick restart and what will come
> > along in further patches to be done locally and before the new
> > task ever gets scheduled.
> 
> We could certainly remove that feature flag and always use it, it was
> mostly a transition debug switch in case something didn't work or
> performance issues were found due to this.

Ok, good.

> > I also need to look at the ARCH_WANT_INTERRUPTS_ON_CTXW case
> > and the remote wakeups.
> 
> Well, ideally we'd simply get rid of that, rmk has some preliminary
> patches in that direction.

Great!

Thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/32] nohz: Move ts->idle_calls into strict idle logic
  2011-08-30 15:33               ` Peter Zijlstra
@ 2011-09-06 16:35                 ` Frederic Weisbecker
  0 siblings, 0 replies; 139+ messages in thread
From: Frederic Weisbecker @ 2011-09-06 16:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Tue, Aug 30, 2011 at 05:33:42PM +0200, Peter Zijlstra wrote:
> On Tue, 2011-08-30 at 16:45 +0200, Frederic Weisbecker wrote:
> > On Mon, Aug 29, 2011 at 08:33:23PM +0200, Peter Zijlstra wrote:
> > > On Mon, 2011-08-29 at 20:23 +0200, Frederic Weisbecker wrote:
> > > 
> > > > > Well, no, on interrupt return you shouldn't do anything. If you've
> > > > > stopped the tick it stays stopped until you do something that needs it,
> > > > > then that action will re-enable it.
> > > > 
> > > > Sure, when something needs the tick in this mode, we usually
> > > > receive an IPI and restart the tick from there but then
> > > > tick_nohz_stop_sched_tick() handles the cases with *needs_cpu()
> > > > very well on interrupt return (our IPI return) by doing a kind
> > > > of "light" HZ mode by logically switching to nohz mode but
> > > > with the next timer happening in HZ, assuming it's a matter
> > > > of one tick and we will switch to a real nohz behaviour soon.
> > > > 
> > > > I don't see a good reason to duplicate that logic with a pure
> > > > restart from the IPI.
> > > 
> > > That sounds like an optimization, and should thus be done later.
> > 
> > The optimization is already there upstream. I can split the logic for
> > non-idle case but I'm not sure about the point of that.
> 
> care to point me to the relevant code, because I can't remember a
> half-assed nohz state.. then again, maybe I didn't look hard enough.

It's in tick_nohz_stop_sched_tick():

	if (rcu_needs_cpu(cpu) || printk_needs_cpu(cpu) ||
	    arch_needs_cpu(cpu)) {
		next_jiffies = last_jiffies + 1;
		delta_jiffies = 1;
	} else {
		/* Get the next timer wheel timer */
		next_jiffies = get_next_timer_interrupt(last_jiffies);
		delta_jiffies = next_jiffies - last_jiffies;
	}

There are two cases:

- tick_nohz_stop_sched_tick() is called from idle. If one of the *_needs_cpu()
is true, then the tick is not stopped. It doesn't even enter the nohz logic.

- tick_nohz_stop_sched_tick() is called from an interrupt exit while we are
idle. If one of the *_needs_cpu() is true, then if the tick was previously
stopped, we program the next timer to be in 1 jiffy but without exiting
nohz mode. Otherwise if the tick was disabled, then we don't stop the tick.

In fact those *_needs_cpu() are treated like timer list timer or hrtimers
that happen in one jiffy from now.

When we are in such a scheme:

	tick_nohz_stop_sched_tick();
	while (!need_resched())
		hlt();
	tick_nohz_restart_sched_tick();

It avoids to go back and forth between hz/nohz logic from inside
the loop.
		
 
> > > > > > That said I wonder if some of the above conditions should restore a periodic
> > > > > > behaviour on interrupt return...
> > > > > 
> > > > > I would expect the tick not to be stopped when tick_nohz_can_stop_tick()
> > > > > returns false. If it returns true, then I expect anything that needs it
> > > > > to re-enable it.
> > > > > 
> > > > 
> > > > Yeah. In the case of need_resched() in idle I believe the CPU doesn't
> > > > really go to sleep later so it should be fine. But for the case of
> > > > softirq pending or nohz_mode, I'm not sure...
> > > 
> > > softirqs shouldn't be pending when you go into nohz mode..
> > 
> > You mean it can't happen or we don't want that to happen?
> 
> We don't want that.. going into nohz with pending softirqs means the
> softirqs will be delayed for an unknown amount of time, this should not
> occur.
> 
> tick_nohz_stop_sched_tick() has:
> 
>         if (unlikely(local_softirq_pending() && cpu_online(cpu))) {                          
>                 static int ratelimit;                                                        
>                                                                                              
>                 if (ratelimit < 10) {                                                        
>                         printk(KERN_ERR "NOHZ: local_softirq_pending %02x\n",                
>                                (unsigned int) local_softirq_pending());                      
>                         ratelimit++;                                                         
>                 }                                                                            
>                 goto end;                                                                    
>         }
> 
> which should warn us if this ever was to occur.

Yeah. Moreover that check is weird because if softirqs are pending
at that time, then the softirqd thread should be woken up. Because
either we come from idle and we are not in an interrupt (softirq
raised would use softirqd in not interrupt exit) or we come from
an interrupt and the softirqs have been executed. If there were
too much of them, softirqd would have been woken up as well.

Thus the need_resched() condition that comes before should have exited the
function already.

> > > 
> > > That is, I'm really not seeing what's wrong with the very simple:
> > > 
> > > 
> > >   if (tick_nohz_can_stop_tick())
> > > 	tick_nohz_stop_tick();
> > > 
> > > 
> > > and relying on everybody who invalidates tick_nohz_can_stop_tick(), to
> > > do:
> > > 
> > >   tick_nohz_start_tick();
> > 
> > May be for the non-idle case. But for the idle case I need to ensure
> > this is necessary somewhere.
> 
> How exactly do idle and non-idle differ? Its about stopping the tick,
> regardless of if we're idle or not if someone needs the thing we need to
> start it again.

It may differ because of the way the idle loop is made, considering some
things as 1 jiffy timer to avoid a hard whole tick restart.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 08/32] nohz: Move nohz load balancer selection into idle logic
  2011-08-29 14:45   ` Peter Zijlstra
@ 2011-09-08 14:08     ` Frederic Weisbecker
  2011-09-08 17:16       ` Paul E. McKenney
  0 siblings, 1 reply; 139+ messages in thread
From: Frederic Weisbecker @ 2011-09-08 14:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Anton Blanchard, Avi Kivity, Ingo Molnar,
	Lai Jiangshan, Paul E . McKenney, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Mon, Aug 29, 2011 at 04:45:17PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > We want the nohz load balancer to be an idle CPU, thus
> > move that selection to strict dyntick idle logic.
> 
> Again, the important part is missing, why is this correct?
> 
> I'm not at all convinced this is correct, suppose all your cpus (except
> the system CPU, which we'll assume has many tasks) are busy running 1
> task. Then two of them get an extra task, now if those two happen to be
> SMT siblings you want the load-balancer to pull on task out from the SMT
> pair, however nobody is pulling since nobody is idle.
> 
> AFAICT this breaks stuff and the ILB needs some serious attention in
> order to fix this.

Right, we have the support for trigger_load_balance() in scheduler_tick()
that is still missing.

What about using that CPU that has to stay awake with a periodic tick
to handle jiffies? We could force that CPU to be the idle load balancer.
The problem is perhaps to find the right frequency for doing that because
we have all the rq to handle.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 08/32] nohz: Move nohz load balancer selection into idle logic
  2011-09-08 14:08     ` Frederic Weisbecker
@ 2011-09-08 17:16       ` Paul E. McKenney
  0 siblings, 0 replies; 139+ messages in thread
From: Paul E. McKenney @ 2011-09-08 17:16 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, LKML, Andrew Morton, Anton Blanchard, Avi Kivity,
	Ingo Molnar, Lai Jiangshan, Paul Menage, Stephen Hemminger,
	Thomas Gleixner, Tim Pepper

On Thu, Sep 08, 2011 at 04:08:54PM +0200, Frederic Weisbecker wrote:
> On Mon, Aug 29, 2011 at 04:45:17PM +0200, Peter Zijlstra wrote:
> > On Mon, 2011-08-15 at 17:52 +0200, Frederic Weisbecker wrote:
> > > We want the nohz load balancer to be an idle CPU, thus
> > > move that selection to strict dyntick idle logic.
> > 
> > Again, the important part is missing, why is this correct?
> > 
> > I'm not at all convinced this is correct, suppose all your cpus (except
> > the system CPU, which we'll assume has many tasks) are busy running 1
> > task. Then two of them get an extra task, now if those two happen to be
> > SMT siblings you want the load-balancer to pull on task out from the SMT
> > pair, however nobody is pulling since nobody is idle.
> > 
> > AFAICT this breaks stuff and the ILB needs some serious attention in
> > order to fix this.
> 
> Right, we have the support for trigger_load_balance() in scheduler_tick()
> that is still missing.
> 
> What about using that CPU that has to stay awake with a periodic tick
> to handle jiffies? We could force that CPU to be the idle load balancer.
> The problem is perhaps to find the right frequency for doing that because
> we have all the rq to handle.

If this CPU can also be the RCU grace-period advancer of last resort,
that would make it easier to arrive at an improved RCU_FAST_NO_HZ.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 139+ messages in thread

end of thread, other threads:[~2011-09-09  2:29 UTC | newest]

Thread overview: 139+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-15 15:51 [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Frederic Weisbecker
2011-08-15 15:51 ` [PATCH 01/32 RESEND] nohz: Drop useless call in tick_nohz_start_idle() Frederic Weisbecker
2011-08-29 14:23   ` Peter Zijlstra
2011-08-29 17:10     ` Frederic Weisbecker
2011-08-15 15:51 ` [PATCH 02/32 RESEND] nohz: Drop ts->idle_active Frederic Weisbecker
2011-08-29 14:23   ` Peter Zijlstra
2011-08-29 16:15     ` Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 03/32 RESEND] nohz: Drop useless ts->inidle check before rearming the tick Frederic Weisbecker
2011-08-29 14:23   ` Peter Zijlstra
2011-08-29 16:58     ` Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 04/32] nohz: Separate idle sleeping time accounting from nohz switching Frederic Weisbecker
2011-08-29 14:23   ` Peter Zijlstra
2011-08-29 16:32     ` Frederic Weisbecker
2011-08-29 17:44       ` Peter Zijlstra
2011-08-29 22:53         ` Frederic Weisbecker
2011-08-29 14:23   ` Peter Zijlstra
2011-08-29 17:01     ` Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 05/32] nohz: Move rcu dynticks idle mode handling to idle enter/exit APIs Frederic Weisbecker
2011-08-29 14:25   ` Peter Zijlstra
2011-08-29 17:11     ` Frederic Weisbecker
2011-08-29 17:49       ` Peter Zijlstra
2011-08-29 17:59         ` Frederic Weisbecker
2011-08-29 18:06           ` Peter Zijlstra
2011-08-29 23:35             ` Frederic Weisbecker
2011-08-30 11:17               ` Peter Zijlstra
2011-08-30 14:11                 ` Frederic Weisbecker
2011-08-30 14:13                   ` Peter Zijlstra
2011-08-30 14:27                     ` Frederic Weisbecker
2011-08-30 11:19               ` Peter Zijlstra
2011-08-30 14:26                 ` Frederic Weisbecker
2011-08-30 15:22                   ` Peter Zijlstra
2011-08-30 18:45                     ` Frederic Weisbecker
2011-08-30 11:21               ` Peter Zijlstra
2011-08-30 14:32                 ` Frederic Weisbecker
2011-08-30 15:26                   ` Peter Zijlstra
2011-08-30 15:33                     ` Frederic Weisbecker
2011-08-30 15:42                       ` Peter Zijlstra
2011-08-30 18:53                         ` Frederic Weisbecker
2011-08-30 20:58                       ` Peter Zijlstra
2011-08-30 22:24                         ` Frederic Weisbecker
2011-08-31  9:17                           ` Peter Zijlstra
2011-08-31 13:37                             ` Frederic Weisbecker
2011-08-31 14:41                               ` Peter Zijlstra
2011-09-01 16:40                                 ` Paul E. McKenney
2011-09-01 17:13                                   ` Peter Zijlstra
2011-09-02  1:41                                     ` Paul E. McKenney
2011-09-02  8:24                                       ` Peter Zijlstra
2011-09-04 19:37                                         ` Paul E. McKenney
2011-09-05 14:28                                           ` Peter Zijlstra
2011-08-15 15:52 ` [PATCH 06/32] nohz: Move idle ticks stats tracking out of nohz handlers Frederic Weisbecker
2011-08-29 14:28   ` Peter Zijlstra
2011-09-06  0:35     ` Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 07/32] nohz: Rename ts->idle_tick to ts->last_tick Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 08/32] nohz: Move nohz load balancer selection into idle logic Frederic Weisbecker
2011-08-29 14:45   ` Peter Zijlstra
2011-09-08 14:08     ` Frederic Weisbecker
2011-09-08 17:16       ` Paul E. McKenney
2011-08-15 15:52 ` [PATCH 09/32] nohz: Move ts->idle_calls into strict " Frederic Weisbecker
2011-08-29 14:47   ` Peter Zijlstra
2011-08-29 17:34     ` Frederic Weisbecker
2011-08-29 17:59       ` Peter Zijlstra
2011-08-29 18:23         ` Frederic Weisbecker
2011-08-29 18:33           ` Peter Zijlstra
2011-08-30 14:45             ` Frederic Weisbecker
2011-08-30 15:33               ` Peter Zijlstra
2011-09-06 16:35                 ` Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 10/32] nohz: Move next idle expiring time record into idle logic area Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 11/32] cpuset: Set up interface for nohz flag Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 12/32] nohz: Try not to give the timekeeping duty to a cpuset nohz cpu Frederic Weisbecker
2011-08-29 14:55   ` Peter Zijlstra
2011-08-30 15:17     ` Frederic Weisbecker
2011-08-30 15:30       ` Dimitri Sivanich
2011-08-30 15:37       ` Peter Zijlstra
2011-08-30 22:44         ` Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 13/32] nohz: Adaptive tick stop and restart on nohz cpuset Frederic Weisbecker
2011-08-29 15:25   ` Peter Zijlstra
2011-09-06 13:03     ` Frederic Weisbecker
2011-08-29 15:28   ` Peter Zijlstra
2011-08-29 18:02     ` Frederic Weisbecker
2011-08-29 18:07       ` Peter Zijlstra
2011-08-29 18:28         ` Frederic Weisbecker
2011-08-30 12:44           ` Peter Zijlstra
2011-08-30 14:38             ` Frederic Weisbecker
2011-08-30 15:28               ` Peter Zijlstra
2011-08-29 15:32   ` Peter Zijlstra
2011-08-15 15:52 ` [PATCH 14/32] nohz/cpuset: Don't turn off the tick if rcu needs it Frederic Weisbecker
2011-08-16 20:13   ` Paul E. McKenney
2011-08-17  2:10     ` Frederic Weisbecker
2011-08-17  2:49       ` Paul E. McKenney
2011-08-29 15:36   ` Peter Zijlstra
2011-08-15 15:52 ` [PATCH 15/32] nohz/cpuset: Restart tick when switching to idle task Frederic Weisbecker
2011-08-29 15:43   ` Peter Zijlstra
2011-08-30 15:04     ` Frederic Weisbecker
2011-08-30 15:35       ` Peter Zijlstra
2011-08-15 15:52 ` [PATCH 16/32] nohz/cpuset: Wake up adaptive nohz CPU when a timer gets enqueued Frederic Weisbecker
2011-08-29 15:51   ` Peter Zijlstra
2011-08-29 15:55   ` Peter Zijlstra
2011-08-30 15:06     ` Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 17/32] x86: New cpuset nohz irq vector Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 18/32] nohz/cpuset: Don't stop the tick if posix cpu timers are running Frederic Weisbecker
2011-08-29 15:59   ` Peter Zijlstra
2011-08-15 15:52 ` [PATCH 19/32] nohz/cpuset: Restart tick when nohz flag is cleared on cpuset Frederic Weisbecker
2011-08-29 16:02   ` Peter Zijlstra
2011-08-30 15:10     ` Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 20/32] nohz/cpuset: Restart the tick if printk needs it Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 21/32] rcu: Restart the tick on non-responding adaptive nohz CPUs Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 22/32] rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU Frederic Weisbecker
2011-08-16 20:20   ` Paul E. McKenney
2011-08-17  2:18     ` Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 23/32] nohz/cpuset: Account user and system times in adaptive nohz mode Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 24/32] nohz/cpuset: Handle kernel entry/exit to account cputime Frederic Weisbecker
2011-08-16 20:38   ` Paul E. McKenney
2011-08-17  2:30     ` Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 25/32] nohz/cpuset: New API to flush cputimes on nohz cpusets Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 26/32] nohz/cpuset: Flush cputime on threads in nohz cpusets when waiting leader Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 27/32] nohz/cpuset: Flush cputimes on procfs stat file read Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 28/32] nohz/cpuset: Flush cputimes for getrusage() and times() syscalls Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 29/32] x86: Syscall hooks for nohz cpusets Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 30/32] x86: Exception " Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 31/32] rcu: Switch to extended quiescent state in userspace from nohz cpuset Frederic Weisbecker
2011-08-16 20:44   ` Paul E. McKenney
2011-08-17  2:43     ` Frederic Weisbecker
2011-08-15 15:52 ` [PATCH 32/32] nohz/cpuset: Disable under some configs Frederic Weisbecker
2011-08-17 16:36 ` [RFC PATCH 00/32] Nohz cpusets (was: Nohz Tasks) Avi Kivity
2011-08-18 13:25   ` Frederic Weisbecker
2011-08-20  7:45     ` Paul Menage
2011-08-23 16:36       ` Frederic Weisbecker
2011-08-24 14:41 ` Gilad Ben-Yossef
2011-08-30 14:06   ` Frederic Weisbecker
2011-08-31  3:47     ` Mike Galbraith
2011-08-31  9:28       ` Peter Zijlstra
2011-08-31 10:26         ` Mike Galbraith
2011-08-31 10:33           ` Peter Zijlstra
2011-08-31 14:00             ` Gilad Ben-Yossef
2011-08-31 14:26               ` Peter Zijlstra
2011-08-31 14:05           ` Gilad Ben-Yossef
2011-08-31 16:12             ` Mike Galbraith
2011-08-31 13:57     ` Gilad Ben-Yossef
2011-08-31 14:30       ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.