linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] nohz: Tick dependency mask v2
@ 2015-07-23 16:42 Frederic Weisbecker
  2015-07-23 16:42 ` [PATCH 01/10] nohz: Remove idle task special case Frederic Weisbecker
                   ` (9 more replies)
  0 siblings, 10 replies; 59+ messages in thread
From: Frederic Weisbecker @ 2015-07-23 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Preeti U Murthy, Christoph Lameter, Ingo Molnar, Viresh Kumar,
	Rik van Riel

Currently in nohz full configs, the tick dependency is checked
asynchronously by nohz code from interrupt and context switch for each
concerned subsystem with a set of function provided by these. These
functions are made of many conditions and details that can be heavyweight:
sched_can_stop_tick(), posix_cpu_timer_can_stop_tick(),
perf_event_can_stop_tick()...

Thomas suggested a few month ago to make that tick dependency check                                                                    
synchronous. Instead of checking subsystems details from each interrupt
to guess if the tick can be stopped, every subsystem that may have a tick                                                                      
dependency should set itself a flag specifying the state of that                                                                      
dependency. This way we can verify if we can stop the tick with a single                                                              
lightweight mask check.

There is still a last bit of work to do (add static keys on
tick_dependency APIs) but it's overall complete.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	timers/core-v4

HEAD: 2e9d80bececd2cb49aeddaf61660de76730559f9

Thanks,
	Frederic
---

Frederic Weisbecker (10):
      nohz: Remove idle task special case
      nohz: Restart nohz full tick from irq exit
      nohz: Move tick_nohz_restart_sched_tick() above its users
      nohz: Remove useless argument on tick_nohz_task_switch()
      nohz: New tick dependency mask
      perf: Migrate perf to use new tick dependency mask model
      sched: Migrate sched to use new tick dependency mask model
      posix-cpu-timers: Migrate to use new tick dependency mask model
      sched-clock: Migrate to use new tick dependency mask model
      nohz: Remove task switch obsolete tick dependency check


 include/linux/perf_event.h     |   6 --
 include/linux/posix-timers.h   |   3 -
 include/linux/sched.h          |   3 -
 include/linux/tick.h           |  36 ++++---
 kernel/events/core.c           |  19 +---
 kernel/sched/clock.c           |   5 +
 kernel/sched/core.c            |  14 +--
 kernel/sched/sched.h           |  56 ++++++++---
 kernel/time/posix-cpu-timers.c |  92 +++++++++---------
 kernel/time/tick-sched.c       | 215 ++++++++++++++++++++++++++---------------
 kernel/time/tick-sched.h       |   1 +
 11 files changed, 265 insertions(+), 185 deletions(-)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 01/10] nohz: Remove idle task special case
  2015-07-23 16:42 [PATCH 00/10] nohz: Tick dependency mask v2 Frederic Weisbecker
@ 2015-07-23 16:42 ` Frederic Weisbecker
  2015-07-23 16:42 ` [PATCH 02/10] nohz: Restart nohz full tick from irq exit Frederic Weisbecker
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 59+ messages in thread
From: Frederic Weisbecker @ 2015-07-23 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Preeti U Murthy, Christoph Lameter, Ingo Molnar, Viresh Kumar,
	Rik van Riel

On nohz full early days, idle dynticks and full dynticks weren't well
integrated and we couldn't risk full dynticks calls on idle without
risking messing up tick idle statistics. This is why we prevented such
thing to happen.

Nowadays full dynticks and idle dynticks are better integrated and
interact without known issue.

So lets remove that.

Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 kernel/time/tick-sched.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c792429..d6c8eff 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -208,10 +208,8 @@ void __tick_nohz_full_check(void)
 	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
 
 	if (tick_nohz_full_cpu(smp_processor_id())) {
-		if (ts->tick_stopped && !is_idle_task(current)) {
-			if (!can_stop_full_tick())
-				tick_nohz_restart_sched_tick(ts, ktime_get());
-		}
+		if (ts->tick_stopped && !can_stop_full_tick())
+			tick_nohz_restart_sched_tick(ts, ktime_get());
 	}
 }
 
@@ -710,7 +708,7 @@ static void tick_nohz_full_stop_tick(struct tick_sched *ts)
 #ifdef CONFIG_NO_HZ_FULL
 	int cpu = smp_processor_id();
 
-	if (!tick_nohz_full_cpu(cpu) || is_idle_task(current))
+	if (!tick_nohz_full_cpu(cpu))
 		return;
 
 	if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 02/10] nohz: Restart nohz full tick from irq exit
  2015-07-23 16:42 [PATCH 00/10] nohz: Tick dependency mask v2 Frederic Weisbecker
  2015-07-23 16:42 ` [PATCH 01/10] nohz: Remove idle task special case Frederic Weisbecker
@ 2015-07-23 16:42 ` Frederic Weisbecker
  2015-07-23 16:42 ` [PATCH 03/10] nohz: Move tick_nohz_restart_sched_tick() above its users Frederic Weisbecker
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 59+ messages in thread
From: Frederic Weisbecker @ 2015-07-23 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Preeti U Murthy, Christoph Lameter, Ingo Molnar, Viresh Kumar,
	Rik van Riel

Restart the tick when necessary from the irq exit path. It makes nohz
full more flexible, simplify the related IPIs and doesn't bring
significant overhead on irq exit.

In a longer term view, it will allow us to piggyback the nohz kick
on the scheduler IPI in the future instead of sending a dedicated IPI
that often doubles the scheduler IPI on task wakeup. This will require
more changes though including careful review of resched_curr() callers
to include nohz full needs.

Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/tick.h     |  8 --------
 kernel/time/tick-sched.c | 34 ++++++++++------------------------
 2 files changed, 10 insertions(+), 32 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 1ca93f2..7d35b0f 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -147,7 +147,6 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask)
 		cpumask_or(mask, mask, tick_nohz_full_mask);
 }
 
-extern void __tick_nohz_full_check(void);
 extern void tick_nohz_full_kick(void);
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void tick_nohz_full_kick_all(void);
@@ -156,7 +155,6 @@ extern void __tick_nohz_task_switch(struct task_struct *tsk);
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
 static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { }
-static inline void __tick_nohz_full_check(void) { }
 static inline void tick_nohz_full_kick_cpu(int cpu) { }
 static inline void tick_nohz_full_kick(void) { }
 static inline void tick_nohz_full_kick_all(void) { }
@@ -190,12 +188,6 @@ static inline void housekeeping_affine(struct task_struct *t)
 #endif
 }
 
-static inline void tick_nohz_full_check(void)
-{
-	if (tick_nohz_full_enabled())
-		__tick_nohz_full_check();
-}
-
 static inline void tick_nohz_task_switch(struct task_struct *tsk)
 {
 	if (tick_nohz_full_enabled())
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index d6c8eff..a06cd4a 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -197,25 +197,9 @@ static bool can_stop_full_tick(void)
 	return true;
 }
 
-static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now);
-
-/*
- * Re-evaluate the need for the tick on the current CPU
- * and restart it if necessary.
- */
-void __tick_nohz_full_check(void)
-{
-	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
-
-	if (tick_nohz_full_cpu(smp_processor_id())) {
-		if (ts->tick_stopped && !can_stop_full_tick())
-			tick_nohz_restart_sched_tick(ts, ktime_get());
-	}
-}
-
 static void nohz_full_kick_work_func(struct irq_work *work)
 {
-	__tick_nohz_full_check();
+	/* Empty, the tick restart happens on tick_nohz_irq_exit() */
 }
 
 static DEFINE_PER_CPU(struct irq_work, nohz_full_kick_work) = {
@@ -250,7 +234,7 @@ void tick_nohz_full_kick_cpu(int cpu)
 
 static void nohz_full_kick_ipi(void *info)
 {
-	__tick_nohz_full_check();
+	/* Empty, the tick restart happens on tick_nohz_irq_exit() */
 }
 
 /*
@@ -703,7 +687,9 @@ out:
 	return tick;
 }
 
-static void tick_nohz_full_stop_tick(struct tick_sched *ts)
+static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now);
+
+static void tick_nohz_full_update_tick(struct tick_sched *ts)
 {
 #ifdef CONFIG_NO_HZ_FULL
 	int cpu = smp_processor_id();
@@ -714,10 +700,10 @@ static void tick_nohz_full_stop_tick(struct tick_sched *ts)
 	if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
 		return;
 
-	if (!can_stop_full_tick())
-		return;
-
-	tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
+	if (can_stop_full_tick())
+		tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
+	else if (ts->tick_stopped)
+		tick_nohz_restart_sched_tick(ts, ktime_get());
 #endif
 }
 
@@ -847,7 +833,7 @@ void tick_nohz_irq_exit(void)
 	if (ts->inidle)
 		__tick_nohz_idle_enter(ts);
 	else
-		tick_nohz_full_stop_tick(ts);
+		tick_nohz_full_update_tick(ts);
 }
 
 /**
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 03/10] nohz: Move tick_nohz_restart_sched_tick() above its users
  2015-07-23 16:42 [PATCH 00/10] nohz: Tick dependency mask v2 Frederic Weisbecker
  2015-07-23 16:42 ` [PATCH 01/10] nohz: Remove idle task special case Frederic Weisbecker
  2015-07-23 16:42 ` [PATCH 02/10] nohz: Restart nohz full tick from irq exit Frederic Weisbecker
@ 2015-07-23 16:42 ` Frederic Weisbecker
  2015-07-23 16:42 ` [PATCH 04/10] nohz: Remove useless argument on tick_nohz_task_switch() Frederic Weisbecker
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 59+ messages in thread
From: Frederic Weisbecker @ 2015-07-23 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Preeti U Murthy, Christoph Lameter, Ingo Molnar, Viresh Kumar,
	Rik van Riel

Fix the function declaration/definition dance.

Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 kernel/time/tick-sched.c | 34 ++++++++++++++++------------------
 1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a06cd4a..6b0d14d 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -687,7 +687,22 @@ out:
 	return tick;
 }
 
-static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now);
+static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
+{
+	/* Update jiffies first */
+	tick_do_update_jiffies64(now);
+	update_cpu_load_nohz();
+
+	calc_load_exit_idle();
+	touch_softlockup_watchdog();
+	/*
+	 * Cancel the scheduled timer and restore the tick
+	 */
+	ts->tick_stopped  = 0;
+	ts->idle_exittime = now;
+
+	tick_nohz_restart(ts, now);
+}
 
 static void tick_nohz_full_update_tick(struct tick_sched *ts)
 {
@@ -848,23 +863,6 @@ ktime_t tick_nohz_get_sleep_length(void)
 	return ts->sleep_length;
 }
 
-static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
-{
-	/* Update jiffies first */
-	tick_do_update_jiffies64(now);
-	update_cpu_load_nohz();
-
-	calc_load_exit_idle();
-	touch_softlockup_watchdog();
-	/*
-	 * Cancel the scheduled timer and restore the tick
-	 */
-	ts->tick_stopped  = 0;
-	ts->idle_exittime = now;
-
-	tick_nohz_restart(ts, now);
-}
-
 static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
 {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 04/10] nohz: Remove useless argument on tick_nohz_task_switch()
  2015-07-23 16:42 [PATCH 00/10] nohz: Tick dependency mask v2 Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2015-07-23 16:42 ` [PATCH 03/10] nohz: Move tick_nohz_restart_sched_tick() above its users Frederic Weisbecker
@ 2015-07-23 16:42 ` Frederic Weisbecker
  2015-08-03 12:39   ` Peter Zijlstra
  2015-07-23 16:42 ` [PATCH 05/10] nohz: New tick dependency mask Frederic Weisbecker
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 59+ messages in thread
From: Frederic Weisbecker @ 2015-07-23 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Preeti U Murthy, Christoph Lameter, Ingo Molnar, Viresh Kumar,
	Rik van Riel

Leftover from early code.

Cc: Christoph Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/tick.h     | 8 ++++----
 kernel/sched/core.c      | 2 +-
 kernel/time/tick-sched.c | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 7d35b0f..48d901f 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -150,7 +150,7 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask)
 extern void tick_nohz_full_kick(void);
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void tick_nohz_full_kick_all(void);
-extern void __tick_nohz_task_switch(struct task_struct *tsk);
+extern void __tick_nohz_task_switch(void);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -158,7 +158,7 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { }
 static inline void tick_nohz_full_kick_cpu(int cpu) { }
 static inline void tick_nohz_full_kick(void) { }
 static inline void tick_nohz_full_kick_all(void) { }
-static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
+static inline void __tick_nohz_task_switch(void) { }
 #endif
 
 static inline const struct cpumask *housekeeping_cpumask(void)
@@ -188,10 +188,10 @@ static inline void housekeeping_affine(struct task_struct *t)
 #endif
 }
 
-static inline void tick_nohz_task_switch(struct task_struct *tsk)
+static inline void tick_nohz_task_switch(void)
 {
 	if (tick_nohz_full_enabled())
-		__tick_nohz_task_switch(tsk);
+		__tick_nohz_task_switch();
 }
 
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 78b4bad10..4d34035 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2489,7 +2489,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		put_task_struct(prev);
 	}
 
-	tick_nohz_task_switch(current);
+	tick_nohz_task_switch();
 	return rq;
 }
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 6b0d14d..3319e16 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -258,7 +258,7 @@ void tick_nohz_full_kick_all(void)
  * It might need the tick due to per task/process properties:
  * perf events, posix cpu timers, ...
  */
-void __tick_nohz_task_switch(struct task_struct *tsk)
+void __tick_nohz_task_switch(void)
 {
 	unsigned long flags;
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 05/10] nohz: New tick dependency mask
  2015-07-23 16:42 [PATCH 00/10] nohz: Tick dependency mask v2 Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2015-07-23 16:42 ` [PATCH 04/10] nohz: Remove useless argument on tick_nohz_task_switch() Frederic Weisbecker
@ 2015-07-23 16:42 ` Frederic Weisbecker
  2015-07-24 16:55   ` Chris Metcalf
                     ` (2 more replies)
  2015-07-23 16:42 ` [PATCH 06/10] perf: Migrate perf to use new tick dependency mask model Frederic Weisbecker
                   ` (4 subsequent siblings)
  9 siblings, 3 replies; 59+ messages in thread
From: Frederic Weisbecker @ 2015-07-23 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Preeti U Murthy, Christoph Lameter, Ingo Molnar, Viresh Kumar,
	Rik van Riel

The tick dependency is evaluated on every IRQ. This is a batch of checks
which determine whether it is safe to stop the tick or not. These checks
are often split in many details: posix cpu timers, scheduler, sched clock,
perf events. Each of which are made of smaller details: posix cpu
timer involves checking process wide timers then thread wide timers. Perf
involves checking freq events then more per cpu details.

Checking these details asynchronously every time we update the full
dynticks state bring avoidable overhead and a messy layout.

Lets introduce two tick dependency masks: one for system wide dependency
and another for CPU wide dependency. The subsystems are responsible
of setting and clearing their dependency through a set of APIs that will
take care of concurrent dependency mask modifications and IPI trigger
to restart the relevant CPU tick whenever needed.

This new dependency engine stays beside the old one until all subsystems
having a tick dependency are converted to it.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/tick.h     |  20 +++++++
 kernel/time/tick-sched.c | 142 +++++++++++++++++++++++++++++++++++++++++++++--
 kernel/time/tick-sched.h |   1 +
 3 files changed, 158 insertions(+), 5 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 48d901f..daafcce 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -97,6 +97,18 @@ static inline void tick_broadcast_exit(void)
 	tick_broadcast_oneshot_control(TICK_BROADCAST_EXIT);
 }
 
+enum tick_dependency_bit {
+	TICK_POSIX_TIMER_BIT	= 0,
+	TICK_PERF_EVENTS_BIT	= 1,
+	TICK_SCHED_BIT		= 2,
+	TICK_CLOCK_UNSTABLE_BIT	= 3
+};
+
+#define TICK_POSIX_TIMER_MASK		(1 << TICK_POSIX_TIMER_BIT)
+#define TICK_PERF_EVENTS_MASK		(1 << TICK_PERF_EVENTS_BIT)
+#define TICK_SCHED_MASK			(1 << TICK_SCHED_BIT)
+#define TICK_CLOCK_UNSTABLE_MASK	(1 << TICK_CLOCK_UNSTABLE_BIT)
+
 #ifdef CONFIG_NO_HZ_COMMON
 extern int tick_nohz_tick_stopped(void);
 extern void tick_nohz_idle_enter(void);
@@ -147,6 +159,14 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask)
 		cpumask_or(mask, mask, tick_nohz_full_mask);
 }
 
+extern void tick_nohz_set_tick_dependency(enum tick_dependency_bit bit);
+extern void tick_nohz_set_tick_dependency_delayed(enum tick_dependency_bit bit);
+extern void tick_nohz_clear_tick_dependency(enum tick_dependency_bit bit);
+extern void tick_nohz_set_tick_dependency_cpu(enum tick_dependency_bit bit, int cpu);
+extern void tick_nohz_clear_tick_dependency_cpu(enum tick_dependency_bit bit, int cpu);
+extern void tick_nohz_set_tick_dependency_this_cpu(enum tick_dependency_bit bit);
+extern void tick_nohz_clear_tick_dependency_this_cpu(enum tick_dependency_bit bit);
+
 extern void tick_nohz_full_kick(void);
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void tick_nohz_full_kick_all(void);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 3319e16..a64646e 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -156,11 +156,43 @@ static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
 cpumask_var_t tick_nohz_full_mask;
 cpumask_var_t housekeeping_mask;
 bool tick_nohz_full_running;
+static unsigned long tick_dependency;
 
-static bool can_stop_full_tick(void)
+static void trace_tick_dependency(unsigned long dep)
+{
+	if (dep & TICK_POSIX_TIMER_MASK) {
+		trace_tick_stop(0, "posix timers running\n");
+		return;
+	}
+
+	if (dep & TICK_PERF_EVENTS_MASK) {
+		trace_tick_stop(0, "perf events running\n");
+		return;
+	}
+
+	if (dep & TICK_SCHED_MASK) {
+		trace_tick_stop(0, "more than 1 task in runqueue\n");
+		return;
+	}
+
+	if (dep & TICK_CLOCK_UNSTABLE_MASK)
+		trace_tick_stop(0, "unstable sched clock\n");
+}
+
+static bool can_stop_full_tick(struct tick_sched *ts)
 {
 	WARN_ON_ONCE(!irqs_disabled());
 
+	if (tick_dependency) {
+		trace_tick_dependency(tick_dependency);
+		return false;
+	}
+
+	if (ts->tick_dependency) {
+		trace_tick_dependency(ts->tick_dependency);
+		return false;
+	}
+
 	if (!sched_can_stop_tick()) {
 		trace_tick_stop(0, "more than 1 task in runqueue\n");
 		return false;
@@ -176,9 +208,10 @@ static bool can_stop_full_tick(void)
 		return false;
 	}
 
-	/* sched_clock_tick() needs us? */
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
 	/*
+	 * sched_clock_tick() needs us?
+	 *
 	 * TODO: kick full dynticks CPUs when
 	 * sched_clock_stable is set.
 	 */
@@ -253,6 +286,103 @@ void tick_nohz_full_kick_all(void)
 	preempt_enable();
 }
 
+unsigned long __tick_nohz_set_tick_dependency(enum tick_dependency_bit bit,
+					      unsigned long *dep)
+{
+	unsigned long prev;
+	unsigned long old = *dep;
+	unsigned long mask = BIT_MASK(bit);
+
+	while ((prev = cmpxchg(dep, old, old | mask)) != old) {
+		old = prev;
+		cpu_relax();
+	}
+
+	return prev;
+}
+
+void __tick_nohz_clear_tick_dependency(enum tick_dependency_bit bit,
+				       unsigned long *dep)
+{
+	clear_bit(bit, dep);
+}
+
+void tick_nohz_set_tick_dependency(enum tick_dependency_bit bit)
+{
+	unsigned long prev;
+
+	prev = __tick_nohz_set_tick_dependency(bit, &tick_dependency);
+	if (!prev)
+		tick_nohz_full_kick_all();
+}
+
+static void kick_all_work_fn(struct work_struct *work)
+{
+       tick_nohz_full_kick_all();
+}
+static DECLARE_WORK(kick_all_work, kick_all_work_fn);
+
+void tick_nohz_set_tick_dependency_delayed(enum tick_dependency_bit bit)
+{
+	unsigned long prev;
+
+	prev = __tick_nohz_set_tick_dependency(bit, &tick_dependency);
+	if (!prev) {
+		/*
+		* We need the IPIs to be sent from sane process context.
+		* The posix cpu timers are always set with irqs disabled.
+		*/
+		schedule_work(&kick_all_work);
+	}
+}
+
+void tick_nohz_clear_tick_dependency(enum tick_dependency_bit bit)
+{
+	__tick_nohz_clear_tick_dependency(bit, &tick_dependency);
+}
+
+void tick_nohz_set_tick_dependency_cpu(enum tick_dependency_bit bit, int cpu)
+{
+	unsigned long prev;
+	struct tick_sched *ts;
+
+	ts = per_cpu_ptr(&tick_cpu_sched, cpu);
+
+	prev = __tick_nohz_set_tick_dependency(bit, &ts->tick_dependency);
+	if (!prev)
+		tick_nohz_full_kick_cpu(cpu);
+}
+
+void tick_nohz_clear_tick_dependency_cpu(enum tick_dependency_bit bit, int cpu)
+{
+	struct tick_sched *ts = per_cpu_ptr(&tick_cpu_sched, cpu);
+
+	__tick_nohz_clear_tick_dependency(bit, &ts->tick_dependency);
+}
+
+/*
+ * Local dependency must have its own flavour due to NMI-safe requirement
+ * on perf.
+ */
+void tick_nohz_set_tick_dependency_this_cpu(enum tick_dependency_bit bit)
+{
+	unsigned long prev;
+	struct tick_sched *ts;
+
+	ts = this_cpu_ptr(&tick_cpu_sched);
+
+	prev = __tick_nohz_set_tick_dependency(bit, &ts->tick_dependency);
+	if (!prev)
+		tick_nohz_full_kick();
+}
+
+void tick_nohz_clear_tick_dependency_this_cpu(enum tick_dependency_bit bit)
+{
+	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+
+	__tick_nohz_clear_tick_dependency(bit, &ts->tick_dependency);
+}
+
 /*
  * Re-evaluate the need for the tick as we switch the current task.
  * It might need the tick due to per task/process properties:
@@ -261,15 +391,17 @@ void tick_nohz_full_kick_all(void)
 void __tick_nohz_task_switch(void)
 {
 	unsigned long flags;
+	struct tick_sched *ts;
 
 	local_irq_save(flags);
 
 	if (!tick_nohz_full_cpu(smp_processor_id()))
 		goto out;
 
-	if (tick_nohz_tick_stopped() && !can_stop_full_tick())
+	ts = this_cpu_ptr(&tick_cpu_sched);
+
+	if (ts->tick_stopped && !can_stop_full_tick(ts))
 		tick_nohz_full_kick();
-
 out:
 	local_irq_restore(flags);
 }
@@ -715,7 +847,7 @@ static void tick_nohz_full_update_tick(struct tick_sched *ts)
 	if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
 		return;
 
-	if (can_stop_full_tick())
+	if (can_stop_full_tick(ts))
 		tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
 	else if (ts->tick_stopped)
 		tick_nohz_restart_sched_tick(ts, ktime_get());
diff --git a/kernel/time/tick-sched.h b/kernel/time/tick-sched.h
index a4a8d4e..d327f70 100644
--- a/kernel/time/tick-sched.h
+++ b/kernel/time/tick-sched.h
@@ -60,6 +60,7 @@ struct tick_sched {
 	u64				next_timer;
 	ktime_t				idle_expires;
 	int				do_timer_last;
+	unsigned long			tick_dependency;
 };
 
 extern struct tick_sched *tick_get_tick_sched(int cpu);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 06/10] perf: Migrate perf to use new tick dependency mask model
  2015-07-23 16:42 [PATCH 00/10] nohz: Tick dependency mask v2 Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2015-07-23 16:42 ` [PATCH 05/10] nohz: New tick dependency mask Frederic Weisbecker
@ 2015-07-23 16:42 ` Frederic Weisbecker
  2015-07-23 16:42 ` [PATCH 07/10] sched: Migrate sched " Frederic Weisbecker
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 59+ messages in thread
From: Frederic Weisbecker @ 2015-07-23 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Preeti U Murthy, Christoph Lameter, Ingo Molnar, Viresh Kumar,
	Rik van Riel

Instead of providing asynchronous checks for the nohz subsystem to verify
perf event tick dependency, migrate perf to the new mask.

Perf needs the tick for two situations:

1) Freq events. We could set the tick dependency when those are
installed on a CPU context. But setting a global dependency on top of
the global freq events accounting is much easier. If people want that
to be optimized, we can still refine that on the per-CPU tick dependency
level. This patch dooesn't change the current behaviour anyway.

2) Throttled events: this is a per-cpu dependency.

Cc: Christoph Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/perf_event.h |  6 ------
 kernel/events/core.c       | 19 +++++--------------
 kernel/time/tick-sched.c   |  6 ------
 3 files changed, 5 insertions(+), 26 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2027809..ff25348 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1013,12 +1013,6 @@ static inline int __perf_event_disable(void *info)			{ return -1; }
 static inline void perf_event_task_tick(void)				{ }
 #endif
 
-#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_NO_HZ_FULL)
-extern bool perf_event_can_stop_tick(void);
-#else
-static inline bool perf_event_can_stop_tick(void)			{ return true; }
-#endif
-
 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
 extern void perf_restore_debug_store(void);
 #else
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d3dae34..9980bd9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3064,17 +3064,6 @@ done:
 	return rotate;
 }
 
-#ifdef CONFIG_NO_HZ_FULL
-bool perf_event_can_stop_tick(void)
-{
-	if (atomic_read(&nr_freq_events) ||
-	    __this_cpu_read(perf_throttled_count))
-		return false;
-	else
-		return true;
-}
-#endif
-
 void perf_event_task_tick(void)
 {
 	struct list_head *head = this_cpu_ptr(&active_ctx_list);
@@ -3085,6 +3074,7 @@ void perf_event_task_tick(void)
 
 	__this_cpu_inc(perf_throttled_seq);
 	throttled = __this_cpu_xchg(perf_throttled_count, 0);
+	tick_nohz_clear_tick_dependency_this_cpu(TICK_PERF_EVENTS_BIT);
 
 	list_for_each_entry_safe(ctx, tmp, head, active_ctx_list)
 		perf_adjust_freq_unthr_context(ctx, throttled);
@@ -3453,7 +3443,8 @@ static void unaccount_event(struct perf_event *event)
 	if (event->attr.task)
 		atomic_dec(&nr_task_events);
 	if (event->attr.freq)
-		atomic_dec(&nr_freq_events);
+		if (atomic_dec_and_test(&nr_freq_events))
+			tick_nohz_clear_tick_dependency(TICK_PERF_EVENTS_BIT);
 	if (is_cgroup_event(event))
 		static_key_slow_dec_deferred(&perf_sched_events);
 	if (has_branch_stack(event))
@@ -6089,9 +6080,9 @@ static int __perf_event_overflow(struct perf_event *event,
 		if (unlikely(throttle
 			     && hwc->interrupts >= max_samples_per_tick)) {
 			__this_cpu_inc(perf_throttled_count);
+			tick_nohz_set_tick_dependency_this_cpu(TICK_PERF_EVENTS_BIT);
 			hwc->interrupts = MAX_INTERRUPTS;
 			perf_log_throttle(event, 0);
-			tick_nohz_full_kick();
 			ret = 1;
 		}
 	}
@@ -7477,7 +7468,7 @@ static void account_event(struct perf_event *event)
 		atomic_inc(&nr_task_events);
 	if (event->attr.freq) {
 		if (atomic_inc_return(&nr_freq_events) == 1)
-			tick_nohz_full_kick_all();
+			tick_nohz_set_tick_dependency(TICK_PERF_EVENTS_BIT);
 	}
 	if (has_branch_stack(event))
 		static_key_slow_inc(&perf_sched_events.key);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a64646e..fbe4736 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -22,7 +22,6 @@
 #include <linux/module.h>
 #include <linux/irq_work.h>
 #include <linux/posix-timers.h>
-#include <linux/perf_event.h>
 #include <linux/context_tracking.h>
 
 #include <asm/irq_regs.h>
@@ -203,11 +202,6 @@ static bool can_stop_full_tick(struct tick_sched *ts)
 		return false;
 	}
 
-	if (!perf_event_can_stop_tick()) {
-		trace_tick_stop(0, "perf events running\n");
-		return false;
-	}
-
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
 	/*
 	 * sched_clock_tick() needs us?
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 07/10] sched: Migrate sched to use new tick dependency mask model
  2015-07-23 16:42 [PATCH 00/10] nohz: Tick dependency mask v2 Frederic Weisbecker
                   ` (5 preceding siblings ...)
  2015-07-23 16:42 ` [PATCH 06/10] perf: Migrate perf to use new tick dependency mask model Frederic Weisbecker
@ 2015-07-23 16:42 ` Frederic Weisbecker
  2015-07-23 16:55   ` Frederic Weisbecker
                     ` (2 more replies)
  2015-07-23 16:42 ` [PATCH 08/10] posix-cpu-timers: Migrate " Frederic Weisbecker
                   ` (2 subsequent siblings)
  9 siblings, 3 replies; 59+ messages in thread
From: Frederic Weisbecker @ 2015-07-23 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Preeti U Murthy, Christoph Lameter, Ingo Molnar, Viresh Kumar,
	Rik van Riel

Instead of providing asynchronous checks for the nohz subsystem to verify
sched tick dependency, migrate sched to the new mask.

The easiest is to recycle the current asynchronous tick dependency check
which verifies the class of the current task and its requirements for
periodic preemption checks.

We need to evaluate this tick dependency on three places:

1) Task enqueue: One or more tasks have been enqueued, we must check
   if those are competing with the current task.

2) Task dequeue: A possibly competing task has been dequeued, clear the
   tick dependency if needed.

3) schedule(): we might be switching to a task of another scheduler
   class. Each class has its preemption rules, we must re-evaluate it.

This doesn't change much compared to the previous layout, except that
3) has to be done with rq locked to avoid mask change racing with remote
enqueue.

We could get away with 3) by checking the highest prio tasks of the
runqueue instead of its current task.

Cc: Christoph Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/sched.h    |  3 ---
 kernel/sched/core.c      | 12 ++++++-----
 kernel/sched/sched.h     | 56 +++++++++++++++++++++++++++++++++++-------------
 kernel/time/tick-sched.c |  5 -----
 4 files changed, 48 insertions(+), 28 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ae21f15..88c99a2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2296,10 +2296,7 @@ static inline void wake_up_nohz_cpu(int cpu) { }
 #endif
 
 #ifdef CONFIG_NO_HZ_FULL
-extern bool sched_can_stop_tick(void);
 extern u64 scheduler_tick_max_deferment(void);
-#else
-static inline bool sched_can_stop_tick(void) { return false; }
 #endif
 
 #ifdef CONFIG_SCHED_AUTOGROUP
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4d34035..6c3db36 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -714,21 +714,22 @@ static inline bool got_nohz_idle_kick(void)
 #endif /* CONFIG_NO_HZ_COMMON */
 
 #ifdef CONFIG_NO_HZ_FULL
-bool sched_can_stop_tick(void)
+bool sched_can_stop_tick(struct rq *rq)
 {
+	struct task_struct *curr = rq->curr;
 	/*
 	 * FIFO realtime policy runs the highest priority task. Other runnable
 	 * tasks are of a lower priority. The scheduler tick does nothing.
 	 */
-	if (current->policy == SCHED_FIFO)
+	if (curr->policy == SCHED_FIFO)
 		return true;
 
 	/*
 	 * Round-robin realtime tasks time slice with other tasks at the same
 	 * realtime priority. Is this task the only one at this priority?
 	 */
-	if (current->policy == SCHED_RR) {
-		struct sched_rt_entity *rt_se = &current->rt;
+	if (curr->policy == SCHED_RR) {
+		struct sched_rt_entity *rt_se = &curr->rt;
 
 		return rt_se->run_list.prev == rt_se->run_list.next;
 	}
@@ -738,7 +739,7 @@ bool sched_can_stop_tick(void)
 	 * nr_running update is assumed to be visible
 	 * after IPI is sent from wakers.
 	 */
-	if (this_rq()->nr_running > 1)
+	if (rq->nr_running > 1)
 		return false;
 
 	return true;
@@ -2489,6 +2490,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		put_task_struct(prev);
 	}
 
+	sched_update_tick_dependency(rq);
 	tick_nohz_task_switch();
 	return rq;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 84d4879..5037acf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1321,6 +1321,38 @@ unsigned long to_ratio(u64 period, u64 runtime);
 
 extern void init_task_runnable_average(struct task_struct *p);
 
+#ifdef CONFIG_NO_HZ_FULL
+extern bool sched_can_stop_tick(struct rq *rq);
+
+/*
+ * Tick is needed if more than one task runs on a CPU.
+ * Send the target an IPI to kick it out of nohz mode.
+ *
+ * We assume that IPI implies full memory barrier and the
+ * new value of rq->nr_running is visible on reception
+ * from the target.
+ */
+static inline void sched_update_tick_dependency(struct rq *rq)
+{
+	int cpu;
+
+	if (!tick_nohz_full_enabled())
+		return;
+
+	cpu = cpu_of(rq);
+
+	if (!tick_nohz_full_cpu(rq->cpu))
+		return;
+
+	if (sched_can_stop_tick(rq))
+		tick_nohz_clear_tick_dependency_cpu(TICK_SCHED_BIT, cpu);
+	else
+		tick_nohz_set_tick_dependency_cpu(TICK_SCHED_BIT, cpu);
+}
+#else
+static inline void sched_update_tick_dependency(struct rq *rq) { }
+#endif
+
 static inline void add_nr_running(struct rq *rq, unsigned count)
 {
 	unsigned prev_nr = rq->nr_running;
@@ -1332,26 +1364,20 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
 		if (!rq->rd->overload)
 			rq->rd->overload = true;
 #endif
-
-#ifdef CONFIG_NO_HZ_FULL
-		if (tick_nohz_full_cpu(rq->cpu)) {
-			/*
-			 * Tick is needed if more than one task runs on a CPU.
-			 * Send the target an IPI to kick it out of nohz mode.
-			 *
-			 * We assume that IPI implies full memory barrier and the
-			 * new value of rq->nr_running is visible on reception
-			 * from the target.
-			 */
-			tick_nohz_full_kick_cpu(rq->cpu);
-		}
-#endif
+		/* Check if new task(s) need periodic preemption check */
+		sched_update_tick_dependency(rq);
 	}
 }
 
 static inline void sub_nr_running(struct rq *rq, unsigned count)
 {
-	rq->nr_running -= count;
+	unsigned prev_nr = rq->nr_running;
+
+	rq->nr_running = prev_nr - count;
+	if (prev_nr > 1) {
+		/* Check if we still need preemption */
+		sched_update_tick_dependency(rq);
+	}
 }
 
 static inline void rq_last_tick_reset(struct rq *rq)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index fbe4736..e6447bd 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -192,11 +192,6 @@ static bool can_stop_full_tick(struct tick_sched *ts)
 		return false;
 	}
 
-	if (!sched_can_stop_tick()) {
-		trace_tick_stop(0, "more than 1 task in runqueue\n");
-		return false;
-	}
-
 	if (!posix_cpu_timers_can_stop_tick(current)) {
 		trace_tick_stop(0, "posix timers running\n");
 		return false;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 08/10] posix-cpu-timers: Migrate to use new tick dependency mask model
  2015-07-23 16:42 [PATCH 00/10] nohz: Tick dependency mask v2 Frederic Weisbecker
                   ` (6 preceding siblings ...)
  2015-07-23 16:42 ` [PATCH 07/10] sched: Migrate sched " Frederic Weisbecker
@ 2015-07-23 16:42 ` Frederic Weisbecker
  2015-07-24 16:57   ` Chris Metcalf
  2015-07-23 16:42 ` [PATCH 09/10] sched-clock: " Frederic Weisbecker
  2015-07-23 16:42 ` [PATCH 10/10] nohz: Remove task switch obsolete tick dependency check Frederic Weisbecker
  9 siblings, 1 reply; 59+ messages in thread
From: Frederic Weisbecker @ 2015-07-23 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Preeti U Murthy, Christoph Lameter, Ingo Molnar, Viresh Kumar,
	Rik van Riel

Instead of providing asynchronous checks for the nohz subsystem to verify
posix cpu timers tick dependency, migrate the latter to the new mask.

In order to keep track of the running timers and expose the tick
dependency accordingly, we must probe the timers queuing and dequeuing
on threads and process lists.

Cc: Christoph Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/posix-timers.h   |  3 --
 kernel/time/posix-cpu-timers.c | 92 +++++++++++++++++++++---------------------
 kernel/time/tick-sched.c       |  5 ---
 3 files changed, 47 insertions(+), 53 deletions(-)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index 907f3fd..62d44c1 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -128,9 +128,6 @@ void posix_cpu_timer_schedule(struct k_itimer *timer);
 void run_posix_cpu_timers(struct task_struct *task);
 void posix_cpu_timers_exit(struct task_struct *task);
 void posix_cpu_timers_exit_group(struct task_struct *task);
-
-bool posix_cpu_timers_can_stop_tick(struct task_struct *tsk);
-
 void set_process_cpu_timer(struct task_struct *task, unsigned int clock_idx,
 			   cputime_t *newval, cputime_t *oldval);
 
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 892e3da..d343c39 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -333,6 +333,46 @@ static int posix_cpu_clock_get(const clockid_t which_clock, struct timespec *tp)
 	return err;
 }
 
+#ifdef CONFIG_NO_HZ_FULL
+static atomic_t cpu_timer_active;
+
+static void cpu_timer_inc_tick_dependency(void)
+{
+	if (atomic_inc_return(&cpu_timer_active) == 1)
+		tick_nohz_set_tick_dependency_delayed(TICK_POSIX_TIMER_BIT);
+}
+
+static void cpu_timer_dec_tick_dependency(void)
+{
+	if (atomic_dec_and_test(&cpu_timer_active))
+		tick_nohz_clear_tick_dependency(TICK_POSIX_TIMER_BIT);
+}
+#else
+static void cpu_timer_inc_tick_dependency(void) { }
+static void cpu_timer_dec_tick_dependency(void) { }
+#endif
+
+static void cpu_timer_list_prepare_firing(struct cpu_timer_list *t,
+					  struct list_head *firing)
+{
+	list_move_tail(&t->entry, firing);
+	cpu_timer_dec_tick_dependency();
+}
+
+static void cpu_timer_list_dequeue(struct cpu_timer_list *t)
+{
+	if (!list_empty(&t->entry))
+		cpu_timer_dec_tick_dependency();
+	list_del_init(&t->entry);
+}
+
+static void cpu_timer_list_enqueue(struct cpu_timer_list *t,
+				   struct list_head *head)
+{
+	list_add(&t->entry, head);
+	cpu_timer_inc_tick_dependency();
+}
+
 
 /*
  * Validate the clockid_t for a new CPU-clock timer, and initialize the timer.
@@ -409,7 +449,7 @@ static int posix_cpu_timer_del(struct k_itimer *timer)
 		if (timer->it.cpu.firing)
 			ret = TIMER_RETRY;
 		else
-			list_del(&timer->it.cpu.entry);
+			cpu_timer_list_dequeue(&timer->it.cpu);
 
 		unlock_task_sighand(p, &flags);
 	}
@@ -425,7 +465,7 @@ static void cleanup_timers_list(struct list_head *head)
 	struct cpu_timer_list *timer, *next;
 
 	list_for_each_entry_safe(timer, next, head, entry)
-		list_del_init(&timer->entry);
+		cpu_timer_list_dequeue(timer);
 }
 
 /*
@@ -490,7 +530,7 @@ static void arm_timer(struct k_itimer *timer)
 			break;
 		listpos = &next->entry;
 	}
-	list_add(&nt->entry, listpos);
+	cpu_timer_list_enqueue(nt, listpos);
 
 	if (listpos == head) {
 		unsigned long long exp = nt->expires;
@@ -582,39 +622,6 @@ static int cpu_timer_sample_group(const clockid_t which_clock,
 	return 0;
 }
 
-#ifdef CONFIG_NO_HZ_FULL
-static void nohz_kick_work_fn(struct work_struct *work)
-{
-	tick_nohz_full_kick_all();
-}
-
-static DECLARE_WORK(nohz_kick_work, nohz_kick_work_fn);
-
-/*
- * We need the IPIs to be sent from sane process context.
- * The posix cpu timers are always set with irqs disabled.
- */
-static void posix_cpu_timer_kick_nohz(void)
-{
-	if (context_tracking_is_enabled())
-		schedule_work(&nohz_kick_work);
-}
-
-bool posix_cpu_timers_can_stop_tick(struct task_struct *tsk)
-{
-	if (!task_cputime_zero(&tsk->cputime_expires))
-		return false;
-
-	/* Check if cputimer is running. This is accessed without locking. */
-	if (READ_ONCE(tsk->signal->cputimer.running))
-		return false;
-
-	return true;
-}
-#else
-static inline void posix_cpu_timer_kick_nohz(void) { }
-#endif
-
 /*
  * Guts of sys_timer_settime for CPU timers.
  * This is called with the timer locked and interrupts disabled.
@@ -659,7 +666,7 @@ static int posix_cpu_timer_set(struct k_itimer *timer, int timer_flags,
 		timer->it.cpu.firing = -1;
 		ret = TIMER_RETRY;
 	} else
-		list_del_init(&timer->it.cpu.entry);
+		cpu_timer_list_dequeue(&timer->it.cpu);
 
 	/*
 	 * We need to sample the current value to convert the new
@@ -761,8 +768,7 @@ static int posix_cpu_timer_set(struct k_itimer *timer, int timer_flags,
 		sample_to_timespec(timer->it_clock,
 				   old_incr, &old->it_interval);
 	}
-	if (!ret)
-		posix_cpu_timer_kick_nohz();
+
 	return ret;
 }
 
@@ -844,7 +850,7 @@ check_timers_list(struct list_head *timers,
 			return t->expires;
 
 		t->firing = 1;
-		list_move_tail(&t->entry, firing);
+		cpu_timer_list_prepare_firing(t, firing);
 	}
 
 	return 0;
@@ -1073,8 +1079,6 @@ void posix_cpu_timer_schedule(struct k_itimer *timer)
 	arm_timer(timer);
 	unlock_task_sighand(p, &flags);
 
-	/* Kick full dynticks CPUs in case they need to tick on the new timer */
-	posix_cpu_timer_kick_nohz();
 out:
 	timer->it_overrun_last = timer->it_overrun;
 	timer->it_overrun = -1;
@@ -1243,7 +1247,7 @@ void set_process_cpu_timer(struct task_struct *tsk, unsigned int clock_idx,
 		}
 
 		if (!*newval)
-			goto out;
+			return;
 		*newval += now;
 	}
 
@@ -1261,8 +1265,6 @@ void set_process_cpu_timer(struct task_struct *tsk, unsigned int clock_idx,
 			tsk->signal->cputime_expires.virt_exp = *newval;
 		break;
 	}
-out:
-	posix_cpu_timer_kick_nohz();
 }
 
 static int do_cpu_nanosleep(const clockid_t which_clock, int flags,
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index e6447bd..4805f4e 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -192,11 +192,6 @@ static bool can_stop_full_tick(struct tick_sched *ts)
 		return false;
 	}
 
-	if (!posix_cpu_timers_can_stop_tick(current)) {
-		trace_tick_stop(0, "posix timers running\n");
-		return false;
-	}
-
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
 	/*
 	 * sched_clock_tick() needs us?
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 09/10] sched-clock: Migrate to use new tick dependency mask model
  2015-07-23 16:42 [PATCH 00/10] nohz: Tick dependency mask v2 Frederic Weisbecker
                   ` (7 preceding siblings ...)
  2015-07-23 16:42 ` [PATCH 08/10] posix-cpu-timers: Migrate " Frederic Weisbecker
@ 2015-07-23 16:42 ` Frederic Weisbecker
  2015-07-23 16:42 ` [PATCH 10/10] nohz: Remove task switch obsolete tick dependency check Frederic Weisbecker
  9 siblings, 0 replies; 59+ messages in thread
From: Frederic Weisbecker @ 2015-07-23 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Preeti U Murthy, Christoph Lameter, Ingo Molnar, Viresh Kumar,
	Rik van Riel

Instead of checking sched_clock_stable from the nohz subsystem to verify
its tick dependency, migrate it to the new mask in order to include it
to the all-in-one check.

Cc: Christoph Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 kernel/sched/clock.c     |  5 +++++
 kernel/time/tick-sched.c | 26 ++++++--------------------
 2 files changed, 11 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index c0a2051..30e2524 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -61,6 +61,7 @@
 #include <linux/static_key.h>
 #include <linux/workqueue.h>
 #include <linux/compiler.h>
+#include <linux/tick.h>
 
 /*
  * Scheduler clock - returns current time in nanosec units.
@@ -89,6 +90,8 @@ static void __set_sched_clock_stable(void)
 {
 	if (!sched_clock_stable())
 		static_key_slow_inc(&__sched_clock_stable);
+
+	tick_nohz_clear_tick_dependency(TICK_CLOCK_UNSTABLE_BIT);
 }
 
 void set_sched_clock_stable(void)
@@ -108,6 +111,8 @@ static void __clear_sched_clock_stable(struct work_struct *work)
 	/* XXX worry about clock continuity */
 	if (sched_clock_stable())
 		static_key_slow_dec(&__sched_clock_stable);
+
+	tick_nohz_set_tick_dependency(TICK_CLOCK_UNSTABLE_BIT);
 }
 
 static DECLARE_WORK(sched_clock_work, __clear_sched_clock_stable);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 4805f4e..7b74b93 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -174,8 +174,13 @@ static void trace_tick_dependency(unsigned long dep)
 		return;
 	}
 
-	if (dep & TICK_CLOCK_UNSTABLE_MASK)
+#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
+	if (dep & TICK_CLOCK_UNSTABLE_MASK) {
 		trace_tick_stop(0, "unstable sched clock\n");
+		WARN_ONCE(tick_nohz_full_running,
+			  "NO_HZ FULL will not work with unstable sched clock");
+	}
+#endif
 }
 
 static bool can_stop_full_tick(struct tick_sched *ts)
@@ -192,25 +197,6 @@ static bool can_stop_full_tick(struct tick_sched *ts)
 		return false;
 	}
 
-#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
-	/*
-	 * sched_clock_tick() needs us?
-	 *
-	 * TODO: kick full dynticks CPUs when
-	 * sched_clock_stable is set.
-	 */
-	if (!sched_clock_stable()) {
-		trace_tick_stop(0, "unstable sched clock\n");
-		/*
-		 * Don't allow the user to think they can get
-		 * full NO_HZ with this machine.
-		 */
-		WARN_ONCE(tick_nohz_full_running,
-			  "NO_HZ FULL will not work with unstable sched clock");
-		return false;
-	}
-#endif
-
 	return true;
 }
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 10/10] nohz: Remove task switch obsolete tick dependency check
  2015-07-23 16:42 [PATCH 00/10] nohz: Tick dependency mask v2 Frederic Weisbecker
                   ` (8 preceding siblings ...)
  2015-07-23 16:42 ` [PATCH 09/10] sched-clock: " Frederic Weisbecker
@ 2015-07-23 16:42 ` Frederic Weisbecker
  9 siblings, 0 replies; 59+ messages in thread
From: Frederic Weisbecker @ 2015-07-23 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Preeti U Murthy, Christoph Lameter, Ingo Molnar, Viresh Kumar,
	Rik van Riel

The task switch check was there to evaluate task level tick dependencies.
Now all of them have been converted to the new CPU or system wide tick
dependency masks that are toggled by the concerned subsystems. We can
safely remove it.

Cc: Christoph Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/tick.h     |  8 --------
 kernel/sched/core.c      |  2 +-
 kernel/time/tick-sched.c | 23 -----------------------
 3 files changed, 1 insertion(+), 32 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index daafcce..609f769 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -170,7 +170,6 @@ extern void tick_nohz_clear_tick_dependency_this_cpu(enum tick_dependency_bit bi
 extern void tick_nohz_full_kick(void);
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void tick_nohz_full_kick_all(void);
-extern void __tick_nohz_task_switch(void);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -178,7 +177,6 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { }
 static inline void tick_nohz_full_kick_cpu(int cpu) { }
 static inline void tick_nohz_full_kick(void) { }
 static inline void tick_nohz_full_kick_all(void) { }
-static inline void __tick_nohz_task_switch(void) { }
 #endif
 
 static inline const struct cpumask *housekeeping_cpumask(void)
@@ -208,10 +206,4 @@ static inline void housekeeping_affine(struct task_struct *t)
 #endif
 }
 
-static inline void tick_nohz_task_switch(void)
-{
-	if (tick_nohz_full_enabled())
-		__tick_nohz_task_switch();
-}
-
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6c3db36..eb926bb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2491,7 +2491,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	}
 
 	sched_update_tick_dependency(rq);
-	tick_nohz_task_switch();
+
 	return rq;
 }
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 7b74b93..f7391a6 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -353,29 +353,6 @@ void tick_nohz_clear_tick_dependency_this_cpu(enum tick_dependency_bit bit)
 	__tick_nohz_clear_tick_dependency(bit, &ts->tick_dependency);
 }
 
-/*
- * Re-evaluate the need for the tick as we switch the current task.
- * It might need the tick due to per task/process properties:
- * perf events, posix cpu timers, ...
- */
-void __tick_nohz_task_switch(void)
-{
-	unsigned long flags;
-	struct tick_sched *ts;
-
-	local_irq_save(flags);
-
-	if (!tick_nohz_full_cpu(smp_processor_id()))
-		goto out;
-
-	ts = this_cpu_ptr(&tick_cpu_sched);
-
-	if (ts->tick_stopped && !can_stop_full_tick(ts))
-		tick_nohz_full_kick();
-out:
-	local_irq_restore(flags);
-}
-
 /* Parse the boot-time nohz CPU list from the kernel parameters. */
 static int __init tick_nohz_full_setup(char *str)
 {
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 07/10] sched: Migrate sched to use new tick dependency mask model
  2015-07-23 16:42 ` [PATCH 07/10] sched: Migrate sched " Frederic Weisbecker
@ 2015-07-23 16:55   ` Frederic Weisbecker
  2015-07-24 16:56   ` Chris Metcalf
  2015-08-03 14:00   ` Peter Zijlstra
  2 siblings, 0 replies; 59+ messages in thread
From: Frederic Weisbecker @ 2015-07-23 16:55 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Thu, Jul 23, 2015 at 06:42:12PM +0200, Frederic Weisbecker wrote:
> Instead of providing asynchronous checks for the nohz subsystem to verify
> sched tick dependency, migrate sched to the new mask.
> 
> The easiest is to recycle the current asynchronous tick dependency check
> which verifies the class of the current task and its requirements for
> periodic preemption checks.
> 
> We need to evaluate this tick dependency on three places:
> 
> 1) Task enqueue: One or more tasks have been enqueued, we must check
>    if those are competing with the current task.
> 
> 2) Task dequeue: A possibly competing task has been dequeued, clear the
>    tick dependency if needed.
> 
> 3) schedule(): we might be switching to a task of another scheduler
>    class. Each class has its preemption rules, we must re-evaluate it.
> 
> This doesn't change much compared to the previous layout, except that
> 3) has to be done with rq locked to avoid mask change racing with remote
> enqueue.
> 
> We could get away with 3) by checking the highest prio tasks of the
> runqueue instead of its current task.
> 
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Viresh Kumar <viresh.kumar@linaro.org>
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> ---
>  include/linux/sched.h    |  3 ---
>  kernel/sched/core.c      | 12 ++++++-----
>  kernel/sched/sched.h     | 56 +++++++++++++++++++++++++++++++++++-------------
>  kernel/time/tick-sched.c |  5 -----
>  4 files changed, 48 insertions(+), 28 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ae21f15..88c99a2 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2296,10 +2296,7 @@ static inline void wake_up_nohz_cpu(int cpu) { }
>  #endif
>  
>  #ifdef CONFIG_NO_HZ_FULL
> -extern bool sched_can_stop_tick(void);
>  extern u64 scheduler_tick_max_deferment(void);
> -#else
> -static inline bool sched_can_stop_tick(void) { return false; }
>  #endif
>  
>  #ifdef CONFIG_SCHED_AUTOGROUP
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4d34035..6c3db36 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -714,21 +714,22 @@ static inline bool got_nohz_idle_kick(void)
>  #endif /* CONFIG_NO_HZ_COMMON */
>  
>  #ifdef CONFIG_NO_HZ_FULL
> -bool sched_can_stop_tick(void)
> +bool sched_can_stop_tick(struct rq *rq)
>  {
> +	struct task_struct *curr = rq->curr;
>  	/*
>  	 * FIFO realtime policy runs the highest priority task. Other runnable
>  	 * tasks are of a lower priority. The scheduler tick does nothing.
>  	 */
> -	if (current->policy == SCHED_FIFO)
> +	if (curr->policy == SCHED_FIFO)
>  		return true;
>  
>  	/*
>  	 * Round-robin realtime tasks time slice with other tasks at the same
>  	 * realtime priority. Is this task the only one at this priority?
>  	 */
> -	if (current->policy == SCHED_RR) {
> -		struct sched_rt_entity *rt_se = &current->rt;
> +	if (curr->policy == SCHED_RR) {
> +		struct sched_rt_entity *rt_se = &curr->rt;
>  
>  		return rt_se->run_list.prev == rt_se->run_list.next;
>  	}
> @@ -738,7 +739,7 @@ bool sched_can_stop_tick(void)
>  	 * nr_running update is assumed to be visible
>  	 * after IPI is sent from wakers.
>  	 */
> -	if (this_rq()->nr_running > 1)
> +	if (rq->nr_running > 1)
>  		return false;
>  
>  	return true;
> @@ -2489,6 +2490,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>  		put_task_struct(prev);
>  	}
>  
> +	sched_update_tick_dependency(rq);

A small mistake here that I forgot to update. This is supposed to be between
perf_event_task_sched_in() and finish_lock_switch(). This must be called
when rq is locked otherwise there is a risk that a remote CPU enqueues or
dequeues concurrently and mess up the dependency mask.

Here is the correct diff:

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ae21f15..88c99a2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2296,10 +2296,7 @@ static inline void wake_up_nohz_cpu(int cpu) { }
 #endif
 
 #ifdef CONFIG_NO_HZ_FULL
-extern bool sched_can_stop_tick(void);
 extern u64 scheduler_tick_max_deferment(void);
-#else
-static inline bool sched_can_stop_tick(void) { return false; }
 #endif
 
 #ifdef CONFIG_SCHED_AUTOGROUP
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4d34035..58b16d3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -714,21 +714,22 @@ static inline bool got_nohz_idle_kick(void)
 #endif /* CONFIG_NO_HZ_COMMON */
 
 #ifdef CONFIG_NO_HZ_FULL
-bool sched_can_stop_tick(void)
+bool sched_can_stop_tick(struct rq *rq)
 {
+	struct task_struct *curr = rq->curr;
 	/*
 	 * FIFO realtime policy runs the highest priority task. Other runnable
 	 * tasks are of a lower priority. The scheduler tick does nothing.
 	 */
-	if (current->policy == SCHED_FIFO)
+	if (curr->policy == SCHED_FIFO)
 		return true;
 
 	/*
 	 * Round-robin realtime tasks time slice with other tasks at the same
 	 * realtime priority. Is this task the only one at this priority?
 	 */
-	if (current->policy == SCHED_RR) {
-		struct sched_rt_entity *rt_se = &current->rt;
+	if (curr->policy == SCHED_RR) {
+		struct sched_rt_entity *rt_se = &curr->rt;
 
 		return rt_se->run_list.prev == rt_se->run_list.next;
 	}
@@ -738,7 +739,7 @@ bool sched_can_stop_tick(void)
 	 * nr_running update is assumed to be visible
 	 * after IPI is sent from wakers.
 	 */
-	if (this_rq()->nr_running > 1)
+	if (rq->nr_running > 1)
 		return false;
 
 	return true;
@@ -2471,6 +2472,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	vtime_task_switch(prev);
 	finish_arch_switch(prev);
 	perf_event_task_sched_in(prev, current);
+	sched_update_tick_dependency(rq);
 	finish_lock_switch(rq, prev);
 	finish_arch_post_lock_switch();
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 84d4879..5037acf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1321,6 +1321,38 @@ unsigned long to_ratio(u64 period, u64 runtime);
 
 extern void init_task_runnable_average(struct task_struct *p);
 
+#ifdef CONFIG_NO_HZ_FULL
+extern bool sched_can_stop_tick(struct rq *rq);
+
+/*
+ * Tick is needed if more than one task runs on a CPU.
+ * Send the target an IPI to kick it out of nohz mode.
+ *
+ * We assume that IPI implies full memory barrier and the
+ * new value of rq->nr_running is visible on reception
+ * from the target.
+ */
+static inline void sched_update_tick_dependency(struct rq *rq)
+{
+	int cpu;
+
+	if (!tick_nohz_full_enabled())
+		return;
+
+	cpu = cpu_of(rq);
+
+	if (!tick_nohz_full_cpu(rq->cpu))
+		return;
+
+	if (sched_can_stop_tick(rq))
+		tick_nohz_clear_tick_dependency_cpu(TICK_SCHED_BIT, cpu);
+	else
+		tick_nohz_set_tick_dependency_cpu(TICK_SCHED_BIT, cpu);
+}
+#else
+static inline void sched_update_tick_dependency(struct rq *rq) { }
+#endif
+
 static inline void add_nr_running(struct rq *rq, unsigned count)
 {
 	unsigned prev_nr = rq->nr_running;
@@ -1332,26 +1364,20 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
 		if (!rq->rd->overload)
 			rq->rd->overload = true;
 #endif
-
-#ifdef CONFIG_NO_HZ_FULL
-		if (tick_nohz_full_cpu(rq->cpu)) {
-			/*
-			 * Tick is needed if more than one task runs on a CPU.
-			 * Send the target an IPI to kick it out of nohz mode.
-			 *
-			 * We assume that IPI implies full memory barrier and the
-			 * new value of rq->nr_running is visible on reception
-			 * from the target.
-			 */
-			tick_nohz_full_kick_cpu(rq->cpu);
-		}
-#endif
+		/* Check if new task(s) need periodic preemption check */
+		sched_update_tick_dependency(rq);
 	}
 }
 
 static inline void sub_nr_running(struct rq *rq, unsigned count)
 {
-	rq->nr_running -= count;
+	unsigned prev_nr = rq->nr_running;
+
+	rq->nr_running = prev_nr - count;
+	if (prev_nr > 1) {
+		/* Check if we still need preemption */
+		sched_update_tick_dependency(rq);
+	}
 }
 
 static inline void rq_last_tick_reset(struct rq *rq)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index fbe4736..e6447bd 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -192,11 +192,6 @@ static bool can_stop_full_tick(struct tick_sched *ts)
 		return false;
 	}
 
-	if (!sched_can_stop_tick()) {
-		trace_tick_stop(0, "more than 1 task in runqueue\n");
-		return false;
-	}
-
 	if (!posix_cpu_timers_can_stop_tick(current)) {
 		trace_tick_stop(0, "posix timers running\n");
 		return false;

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 05/10] nohz: New tick dependency mask
  2015-07-23 16:42 ` [PATCH 05/10] nohz: New tick dependency mask Frederic Weisbecker
@ 2015-07-24 16:55   ` Chris Metcalf
  2015-07-24 17:16     ` Frederic Weisbecker
  2015-08-03 12:43   ` Peter Zijlstra
  2015-08-03 12:57   ` Peter Zijlstra
  2 siblings, 1 reply; 59+ messages in thread
From: Chris Metcalf @ 2015-07-24 16:55 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Peter Zijlstra, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On 07/23/2015 12:42 PM, Frederic Weisbecker wrote:
> +unsigned long __tick_nohz_set_tick_dependency(enum tick_dependency_bit bit,
> +					      unsigned long *dep)
> +{
> +	unsigned long prev;
> +	unsigned long old = *dep;
> +	unsigned long mask = BIT_MASK(bit);
> +
> +	while ((prev = cmpxchg(dep, old, old | mask)) != old) {
> +		old = prev;
> +		cpu_relax();
> +	}
> +
> +	return prev;
> +}

Why not use set_bit() here?  It is suitably atomic.

> +		/*
> +		* We need the IPIs to be sent from sane process context.
> +		* The posix cpu timers are always set with irqs disabled.
> +		*/

The block comment indentation is not quite right here.

> +void tick_nohz_set_tick_dependency_cpu(enum tick_dependency_bit bit, int cpu)
> +{
> +	unsigned long prev;
> +	struct tick_sched *ts;
> +
> +	ts = per_cpu_ptr(&tick_cpu_sched, cpu);
> +
> +	prev = __tick_nohz_set_tick_dependency(bit, &ts->tick_dependency);
> +	if (!prev)
> +		tick_nohz_full_kick_cpu(cpu);
> +}

I could imagine arguments for a WARN_ON() if cpu == smp_processor_id() 
here, since then you should be using the _thiscpu() variant.
Or, you could transparently call the _thiscpu() variant in that case.
I think some comment explaining why the approach you chose is better
than those alternatives would be helpful here, perhaps.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 07/10] sched: Migrate sched to use new tick dependency mask model
  2015-07-23 16:42 ` [PATCH 07/10] sched: Migrate sched " Frederic Weisbecker
  2015-07-23 16:55   ` Frederic Weisbecker
@ 2015-07-24 16:56   ` Chris Metcalf
  2015-07-29 13:01     ` Frederic Weisbecker
  2015-08-03 14:00   ` Peter Zijlstra
  2 siblings, 1 reply; 59+ messages in thread
From: Chris Metcalf @ 2015-07-24 16:56 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Peter Zijlstra, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On 07/23/2015 12:42 PM, Frederic Weisbecker wrote:
> +static inline void sched_update_tick_dependency(struct rq *rq)
> +{
> +	int cpu;
> +
> +	if (!tick_nohz_full_enabled())
> +		return;
> +
> +	cpu = cpu_of(rq);
> +
> +	if (!tick_nohz_full_cpu(rq->cpu))
> +		return;
> +
> +	if (sched_can_stop_tick(rq))
> +		tick_nohz_clear_tick_dependency_cpu(TICK_SCHED_BIT, cpu);
> +	else
> +		tick_nohz_set_tick_dependency_cpu(TICK_SCHED_BIT, cpu);
> +}

Is it worth asserting that the rq is locked at this point? Presumably 
that's a requirement so that the cpu can't change out from under you.
-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/10] posix-cpu-timers: Migrate to use new tick dependency mask model
  2015-07-23 16:42 ` [PATCH 08/10] posix-cpu-timers: Migrate " Frederic Weisbecker
@ 2015-07-24 16:57   ` Chris Metcalf
  2015-07-29 13:23     ` Frederic Weisbecker
  0 siblings, 1 reply; 59+ messages in thread
From: Chris Metcalf @ 2015-07-24 16:57 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Peter Zijlstra, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On 07/23/2015 12:42 PM, Frederic Weisbecker wrote:
> +static void cpu_timer_list_dequeue(struct cpu_timer_list *t)
> +{
> +	if (!list_empty(&t->entry))
> +		cpu_timer_dec_tick_dependency();
> +	list_del_init(&t->entry);
> +}

Is the list_empty() test necessary? It wasn't in the original 
posix-timers code, and it feels like a pretty serious bug if you're 
doing a list_del on an empty list.

At a higher level, is the posix-cpu-timers code here really providing 
the right semantics? It seems like before, the code was checking a 
struct task-specific state, and now you are setting a global state such 
that if ANY task anywhere in the system (even on housekeeping cores) has 
a pending posix cpu timer, then nothing can go into nohz_full mode.

Perhaps what is needed is a task_struct->tick_dependency to go along 
with the system-wide and per-cpu flag words?
-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 05/10] nohz: New tick dependency mask
  2015-07-24 16:55   ` Chris Metcalf
@ 2015-07-24 17:16     ` Frederic Weisbecker
  2015-07-24 17:43       ` Chris Metcalf
  0 siblings, 1 reply; 59+ messages in thread
From: Frederic Weisbecker @ 2015-07-24 17:16 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: LKML, Peter Zijlstra, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Fri, Jul 24, 2015 at 12:55:35PM -0400, Chris Metcalf wrote:
> On 07/23/2015 12:42 PM, Frederic Weisbecker wrote:
> >+unsigned long __tick_nohz_set_tick_dependency(enum tick_dependency_bit bit,
> >+					      unsigned long *dep)
> >+{
> >+	unsigned long prev;
> >+	unsigned long old = *dep;
> >+	unsigned long mask = BIT_MASK(bit);
> >+
> >+	while ((prev = cmpxchg(dep, old, old | mask)) != old) {
> >+		old = prev;
> >+		cpu_relax();
> >+	}
> >+
> >+	return prev;
> >+}
> 
> Why not use set_bit() here?  It is suitably atomic.

Because I don't want to send an IPI if the CPU already had bits set in
the dependency.

Ideally I need something like test_and_set_bit() but which returns the
whole previous value and not just the previous value of the bit.

> 
> >+		/*
> >+		* We need the IPIs to be sent from sane process context.
> >+		* The posix cpu timers are always set with irqs disabled.
> >+		*/
> 
> The block comment indentation is not quite right here.

Right.

> 
> >+void tick_nohz_set_tick_dependency_cpu(enum tick_dependency_bit bit, int cpu)
> >+{
> >+	unsigned long prev;
> >+	struct tick_sched *ts;
> >+
> >+	ts = per_cpu_ptr(&tick_cpu_sched, cpu);
> >+
> >+	prev = __tick_nohz_set_tick_dependency(bit, &ts->tick_dependency);
> >+	if (!prev)
> >+		tick_nohz_full_kick_cpu(cpu);
> >+}
> 
> I could imagine arguments for a WARN_ON() if cpu == smp_processor_id() here,
> since then you should be using the _thiscpu() variant.
> Or, you could transparently call the _thiscpu() variant in that case.
> I think some comment explaining why the approach you chose is better
> than those alternatives would be helpful here, perhaps.

It's fine to use this variant for local dependency but if you want it to
be NMI safe you need the _this_cpu() version. perf events require it.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 05/10] nohz: New tick dependency mask
  2015-07-24 17:16     ` Frederic Weisbecker
@ 2015-07-24 17:43       ` Chris Metcalf
  2015-08-03 12:48         ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Chris Metcalf @ 2015-07-24 17:43 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Peter Zijlstra, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On 07/24/2015 01:16 PM, Frederic Weisbecker wrote:
> On Fri, Jul 24, 2015 at 12:55:35PM -0400, Chris Metcalf wrote:
>> On 07/23/2015 12:42 PM, Frederic Weisbecker wrote:
>>> +unsigned long __tick_nohz_set_tick_dependency(enum tick_dependency_bit bit,
>>> +					      unsigned long *dep)
>>> +{
>>> +	unsigned long prev;
>>> +	unsigned long old = *dep;
>>> +	unsigned long mask = BIT_MASK(bit);
>>> +
>>> +	while ((prev = cmpxchg(dep, old, old | mask)) != old) {
>>> +		old = prev;
>>> +		cpu_relax();
>>> +	}
>>> +
>>> +	return prev;
>>> +}
>> Why not use set_bit() here?  It is suitably atomic.
> Because I don't want to send an IPI if the CPU already had bits set in
> the dependency.
>
> Ideally I need something like test_and_set_bit() but which returns the
> whole previous value and not just the previous value of the bit.

Ah, of course.  Peter, maybe we need atomic_or_return() as part
of your new atomic_or/_and/_xor series?  Certainly on tilegx, and
likely other architectures, we can do better than Frederic's
cmpxchg() loop.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 07/10] sched: Migrate sched to use new tick dependency mask model
  2015-07-24 16:56   ` Chris Metcalf
@ 2015-07-29 13:01     ` Frederic Weisbecker
  0 siblings, 0 replies; 59+ messages in thread
From: Frederic Weisbecker @ 2015-07-29 13:01 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: LKML, Peter Zijlstra, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Fri, Jul 24, 2015 at 12:56:42PM -0400, Chris Metcalf wrote:
> On 07/23/2015 12:42 PM, Frederic Weisbecker wrote:
> >+static inline void sched_update_tick_dependency(struct rq *rq)
> >+{
> >+	int cpu;
> >+
> >+	if (!tick_nohz_full_enabled())
> >+		return;
> >+
> >+	cpu = cpu_of(rq);
> >+
> >+	if (!tick_nohz_full_cpu(rq->cpu))
> >+		return;
> >+
> >+	if (sched_can_stop_tick(rq))
> >+		tick_nohz_clear_tick_dependency_cpu(TICK_SCHED_BIT, cpu);
> >+	else
> >+		tick_nohz_set_tick_dependency_cpu(TICK_SCHED_BIT, cpu);
> >+}
> 
> Is it worth asserting that the rq is locked at this point? Presumably that's
> a requirement so that the cpu can't change out from under you.

Indeed it has become a requirement here. I'll add a check.

Thanks.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/10] posix-cpu-timers: Migrate to use new tick dependency mask model
  2015-07-24 16:57   ` Chris Metcalf
@ 2015-07-29 13:23     ` Frederic Weisbecker
  2015-07-29 17:24       ` Chris Metcalf
  0 siblings, 1 reply; 59+ messages in thread
From: Frederic Weisbecker @ 2015-07-29 13:23 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: LKML, Peter Zijlstra, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Fri, Jul 24, 2015 at 12:57:24PM -0400, Chris Metcalf wrote:
> On 07/23/2015 12:42 PM, Frederic Weisbecker wrote:
> >+static void cpu_timer_list_dequeue(struct cpu_timer_list *t)
> >+{
> >+	if (!list_empty(&t->entry))
> >+		cpu_timer_dec_tick_dependency();
> >+	list_del_init(&t->entry);
> >+}
> 
> Is the list_empty() test necessary? It wasn't in the original posix-timers
> code, and it feels like a pretty serious bug if you're doing a list_del on
> an empty list.

No multiple calls to list_del_init() is fine on a list_head as long as it
has been correctly initialized with INIT_LIST_HEAD() and list_del() hasn't
been called at some point before.

It's necessary because we do that dequeue also when we change the timer,
we disarm it in case it was added somewhere before.

> At a higher level, is the posix-cpu-timers code here really providing the
> right semantics? It seems like before, the code was checking a struct
> task-specific state, and now you are setting a global state such that if ANY
> task anywhere in the system (even on housekeeping cores) has a pending posix
> cpu timer, then nothing can go into nohz_full mode.
> 
> Perhaps what is needed is a task_struct->tick_dependency to go along with
> the system-wide and per-cpu flag words?

That's an excellent point! Indeed the tick dependency check on posix-cpu-timers
was made on task granularity before and now it's a global dependency.

Which means that if any task in the system has a posix-cpu-timer enqueued, it
prevents all CPUs from shutting down the tick. I need to mention that in the
changelog.

Now here is the rationale: I expect that nohz full users are not interested in
posix cpu timers at all. The only chance for one to run without breaking the
isolation is on housekeeping CPUs. So perhaps there is a corner case somewhere
but I assume there isn't until somebody reports an issue.

Keeping a task level dependency check means that we need to update it on context
switch. Plus it's not only about task but also process. So that means two
states to update on context switch and to check from interrupts. I don't think
it's worth the effort if there is no user at all.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/10] posix-cpu-timers: Migrate to use new tick dependency mask model
  2015-07-29 13:23     ` Frederic Weisbecker
@ 2015-07-29 17:24       ` Chris Metcalf
  2015-07-30  0:44         ` Frederic Weisbecker
  0 siblings, 1 reply; 59+ messages in thread
From: Chris Metcalf @ 2015-07-29 17:24 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Peter Zijlstra, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On 07/29/2015 09:23 AM, Frederic Weisbecker wrote:
>> At a higher level, is the posix-cpu-timers code here really providing the
>> >right semantics? It seems like before, the code was checking a struct
>> >task-specific state, and now you are setting a global state such that if ANY
>> >task anywhere in the system (even on housekeeping cores) has a pending posix
>> >cpu timer, then nothing can go into nohz_full mode.
>> >
>> >Perhaps what is needed is a task_struct->tick_dependency to go along with
>> >the system-wide and per-cpu flag words?
> That's an excellent point! Indeed the tick dependency check on posix-cpu-timers
> was made on task granularity before and now it's a global dependency.
>
> Which means that if any task in the system has a posix-cpu-timer enqueued, it
> prevents all CPUs from shutting down the tick. I need to mention that in the
> changelog.
>
> Now here is the rationale: I expect that nohz full users are not interested in
> posix cpu timers at all. The only chance for one to run without breaking the
> isolation is on housekeeping CPUs. So perhaps there is a corner case somewhere
> but I assume there isn't until somebody reports an issue.
>
> Keeping a task level dependency check means that we need to update it on context
> switch. Plus it's not only about task but also process. So that means two
> states to update on context switch and to check from interrupts. I don't think
> it's worth the effort if there is no user at all.

I really worry about this!  The vision EZchip offers our customers is
that they can run whatever they want on the slow path housekeeping
cores, i.e. random control-plane code.  Then, on the fast-path cores,
they run their nohz_full stuff without interruption.  Often they don't
even know what the hell is running on their control plane cores - SNMP
or random third-party crap or god knows what.  And there is a decent
likelihood that some posix cpu timer code might sneak in.

You mentioned needing two fields, for task and for process, but in
fact let's just add the one field to the one thing that needs it and
not worry about additional possible future needs.  And note that it's
the task_struct->signal where we need to add the field for posix cpu
timers (the signal_struct) since that's where the sharing occurs, and
given CLONE_SIGHAND I imagine it could be different from the general
"process" model anyway.

In any case it seems like we don't need to do work at context switch.
Updates to the task's tick_dependency are just done as normal in the
task context via "current->signal->". When we are returning to user
space and we want to check the tick, again, we can just read via
"current->signal->".  Why would we need to copy the value around at
task switch time?  That's only necessary if you want to do something
like read/write the task tick_dependency via the cpu index, I would think.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/10] posix-cpu-timers: Migrate to use new tick dependency mask model
  2015-07-29 17:24       ` Chris Metcalf
@ 2015-07-30  0:44         ` Frederic Weisbecker
  2015-07-30 14:31           ` Luiz Capitulino
  2015-07-30 19:35           ` Chris Metcalf
  0 siblings, 2 replies; 59+ messages in thread
From: Frederic Weisbecker @ 2015-07-30  0:44 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: LKML, Peter Zijlstra, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel


On Wed, Jul 29, 2015 at 01:24:16PM -0400, Chris Metcalf wrote:
> On 07/29/2015 09:23 AM, Frederic Weisbecker wrote:
> >>At a higher level, is the posix-cpu-timers code here really providing the
> >>>right semantics? It seems like before, the code was checking a struct
> >>>task-specific state, and now you are setting a global state such that if ANY
> >>>task anywhere in the system (even on housekeeping cores) has a pending posix
> >>>cpu timer, then nothing can go into nohz_full mode.
> >>>
> >>>Perhaps what is needed is a task_struct->tick_dependency to go along with
> >>>the system-wide and per-cpu flag words?
> >That's an excellent point! Indeed the tick dependency check on posix-cpu-timers
> >was made on task granularity before and now it's a global dependency.
> >
> >Which means that if any task in the system has a posix-cpu-timer enqueued, it
> >prevents all CPUs from shutting down the tick. I need to mention that in the
> >changelog.
> >
> >Now here is the rationale: I expect that nohz full users are not interested in
> >posix cpu timers at all. The only chance for one to run without breaking the
> >isolation is on housekeeping CPUs. So perhaps there is a corner case somewhere
> >but I assume there isn't until somebody reports an issue.
> >
> >Keeping a task level dependency check means that we need to update it on context
> >switch. Plus it's not only about task but also process. So that means two
> >states to update on context switch and to check from interrupts. I don't think
> >it's worth the effort if there is no user at all.
> 
> I really worry about this!  The vision EZchip offers our customers is
> that they can run whatever they want on the slow path housekeeping
> cores, i.e. random control-plane code.  Then, on the fast-path cores,
> they run their nohz_full stuff without interruption.  Often they don't
> even know what the hell is running on their control plane cores - SNMP
> or random third-party crap or god knows what.  And there is a decent
> likelihood that some posix cpu timer code might sneak in.

I see. But note that installing a posix cpu timer ends up triggering an
IPI to all nohz full CPUs. That's how nohz full has always behaved.
So users running posix timers on nohz should already suffer issues anyway.

> 
> You mentioned needing two fields, for task and for process, but in
> fact let's just add the one field to the one thing that needs it and
> not worry about additional possible future needs.  And note that it's
> the task_struct->signal where we need to add the field for posix cpu
> timers (the signal_struct) since that's where the sharing occurs, and
> given CLONE_SIGHAND I imagine it could be different from the general
> "process" model anyway.

Well, posix cpu timers can be install per process (signal struct) or
per thread (task struct).

But we can certainly simplify that with a per process flag and expand
the thread dependency to the process scope.

Still there is the issue of telling the CPUs where a process runs when
a posix timer is installed there. There is no process-like tsk->cpus_allowed.
Either we send an IPI everywhere like we do now or we iterate through all
threads in the process to OR all their cpumasks in order to send that IPI.

> 
> In any case it seems like we don't need to do work at context switch.
> Updates to the task's tick_dependency are just done as normal in the
> task context via "current->signal->". When we are returning to user
> space and we want to check the tick, again, we can just read via
> "current->signal->".  Why would we need to copy the value around at
> task switch time?  That's only necessary if you want to do something
> like read/write the task tick_dependency via the cpu index, I would think.

Yeah you're right, at least the context switch should be fine.

Thanks.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/10] posix-cpu-timers: Migrate to use new tick dependency mask model
  2015-07-30  0:44         ` Frederic Weisbecker
@ 2015-07-30 14:31           ` Luiz Capitulino
  2015-07-30 14:46             ` Frederic Weisbecker
  2015-07-30 19:35           ` Chris Metcalf
  1 sibling, 1 reply; 59+ messages in thread
From: Luiz Capitulino @ 2015-07-30 14:31 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Chris Metcalf, LKML, Peter Zijlstra, Thomas Gleixner,
	Preeti U Murthy, Christoph Lameter, Ingo Molnar, Viresh Kumar,
	Rik van Riel

On Thu, 30 Jul 2015 02:44:45 +0200
Frederic Weisbecker <fweisbec@gmail.com> wrote:

> 
> On Wed, Jul 29, 2015 at 01:24:16PM -0400, Chris Metcalf wrote:
> > On 07/29/2015 09:23 AM, Frederic Weisbecker wrote:
> > >>At a higher level, is the posix-cpu-timers code here really providing the
> > >>>right semantics? It seems like before, the code was checking a struct
> > >>>task-specific state, and now you are setting a global state such that if ANY
> > >>>task anywhere in the system (even on housekeeping cores) has a pending posix
> > >>>cpu timer, then nothing can go into nohz_full mode.
> > >>>
> > >>>Perhaps what is needed is a task_struct->tick_dependency to go along with
> > >>>the system-wide and per-cpu flag words?
> > >That's an excellent point! Indeed the tick dependency check on posix-cpu-timers
> > >was made on task granularity before and now it's a global dependency.
> > >
> > >Which means that if any task in the system has a posix-cpu-timer enqueued, it
> > >prevents all CPUs from shutting down the tick. I need to mention that in the
> > >changelog.
> > >
> > >Now here is the rationale: I expect that nohz full users are not interested in
> > >posix cpu timers at all. The only chance for one to run without breaking the
> > >isolation is on housekeeping CPUs. So perhaps there is a corner case somewhere
> > >but I assume there isn't until somebody reports an issue.
> > >
> > >Keeping a task level dependency check means that we need to update it on context
> > >switch. Plus it's not only about task but also process. So that means two
> > >states to update on context switch and to check from interrupts. I don't think
> > >it's worth the effort if there is no user at all.
> > 
> > I really worry about this!  The vision EZchip offers our customers is
> > that they can run whatever they want on the slow path housekeeping
> > cores, i.e. random control-plane code.  Then, on the fast-path cores,
> > they run their nohz_full stuff without interruption.  Often they don't
> > even know what the hell is running on their control plane cores - SNMP
> > or random third-party crap or god knows what.  And there is a decent
> > likelihood that some posix cpu timer code might sneak in.

I share this thinking. We do the exactly same thing for KVM-RT and I
wouldn't be surprised at all if a posix timer pops up in the
housekeeping CPUs.

> I see. But note that installing a posix cpu timer ends up triggering an
> IPI to all nohz full CPUs. That's how nohz full has always behaved.
> So users running posix timers on nohz should already suffer issues anyway.

I haven't checked how this would affect us, but seems a lot less serious
then not having nohz at all.

> 
> > 
> > You mentioned needing two fields, for task and for process, but in
> > fact let's just add the one field to the one thing that needs it and
> > not worry about additional possible future needs.  And note that it's
> > the task_struct->signal where we need to add the field for posix cpu
> > timers (the signal_struct) since that's where the sharing occurs, and
> > given CLONE_SIGHAND I imagine it could be different from the general
> > "process" model anyway.
> 
> Well, posix cpu timers can be install per process (signal struct) or
> per thread (task struct).
> 
> But we can certainly simplify that with a per process flag and expand
> the thread dependency to the process scope.
> 
> Still there is the issue of telling the CPUs where a process runs when
> a posix timer is installed there. There is no process-like tsk->cpus_allowed.
> Either we send an IPI everywhere like we do now or we iterate through all
> threads in the process to OR all their cpumasks in order to send that IPI.
> 
> > 
> > In any case it seems like we don't need to do work at context switch.
> > Updates to the task's tick_dependency are just done as normal in the
> > task context via "current->signal->". When we are returning to user
> > space and we want to check the tick, again, we can just read via
> > "current->signal->".  Why would we need to copy the value around at
> > task switch time?  That's only necessary if you want to do something
> > like read/write the task tick_dependency via the cpu index, I would think.
> 
> Yeah you're right, at least the context switch should be fine.
> 
> Thanks.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/10] posix-cpu-timers: Migrate to use new tick dependency mask model
  2015-07-30 14:31           ` Luiz Capitulino
@ 2015-07-30 14:46             ` Frederic Weisbecker
  0 siblings, 0 replies; 59+ messages in thread
From: Frederic Weisbecker @ 2015-07-30 14:46 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Chris Metcalf, LKML, Peter Zijlstra, Thomas Gleixner,
	Preeti U Murthy, Christoph Lameter, Ingo Molnar, Viresh Kumar,
	Rik van Riel

On Thu, Jul 30, 2015 at 10:31:47AM -0400, Luiz Capitulino wrote:
> On Thu, 30 Jul 2015 02:44:45 +0200
> Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > On Wed, Jul 29, 2015 at 01:24:16PM -0400, Chris Metcalf wrote:
> > > I really worry about this!  The vision EZchip offers our customers is
> > > that they can run whatever they want on the slow path housekeeping
> > > cores, i.e. random control-plane code.  Then, on the fast-path cores,
> > > they run their nohz_full stuff without interruption.  Often they don't
> > > even know what the hell is running on their control plane cores - SNMP
> > > or random third-party crap or god knows what.  And there is a decent
> > > likelihood that some posix cpu timer code might sneak in.
> 
> I share this thinking. We do the exactly same thing for KVM-RT and I
> wouldn't be surprised at all if a posix timer pops up in the
> housekeeping CPUs.

Ok you guys convinced me, I'll reiterate with a per-process mask.

> 
> > I see. But note that installing a posix cpu timer ends up triggering an
> > IPI to all nohz full CPUs. That's how nohz full has always behaved.
> > So users running posix timers on nohz should already suffer issues anyway.
> 
> I haven't checked how this would affect us, but seems a lot less serious
> then not having nohz at all.

Indeed. It's the difference between just receiving one IPI on every CPU and
having one interrupt every 1/Hz.

I'll just keep that global IPI to tell all CPUs that they may run a task
with a posix cpu timer queued. It's necessary for the CPUs to restart the
tick if needed. It's the current behavior and it makes things very simple.
Besides, nobody complained about it yet.

I can fix it if people request it but this will be in another patchset because
it's a complicated issue on its own.

To summarize: this patchset won't change the current upstream behaviour.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/10] posix-cpu-timers: Migrate to use new tick dependency mask model
  2015-07-30  0:44         ` Frederic Weisbecker
  2015-07-30 14:31           ` Luiz Capitulino
@ 2015-07-30 19:35           ` Chris Metcalf
  2015-07-30 19:45             ` Frederic Weisbecker
  1 sibling, 1 reply; 59+ messages in thread
From: Chris Metcalf @ 2015-07-30 19:35 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Peter Zijlstra, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On 07/29/2015 08:44 PM, Frederic Weisbecker wrote:
> On Wed, Jul 29, 2015 at 01:24:16PM -0400, Chris Metcalf wrote:
>> On 07/29/2015 09:23 AM, Frederic Weisbecker wrote:
>>>> At a higher level, is the posix-cpu-timers code here really providing the
>>>>> right semantics? It seems like before, the code was checking a struct
>>>>> task-specific state, and now you are setting a global state such that if ANY
>>>>> task anywhere in the system (even on housekeeping cores) has a pending posix
>>>>> cpu timer, then nothing can go into nohz_full mode.
>>>>>
>>>>> Perhaps what is needed is a task_struct->tick_dependency to go along with
>>>>> the system-wide and per-cpu flag words?
>>> That's an excellent point! Indeed the tick dependency check on posix-cpu-timers
>>> was made on task granularity before and now it's a global dependency.
>>>
>>> Which means that if any task in the system has a posix-cpu-timer enqueued, it
>>> prevents all CPUs from shutting down the tick. I need to mention that in the
>>> changelog.
>>>
>>> Now here is the rationale: I expect that nohz full users are not interested in
>>> posix cpu timers at all. The only chance for one to run without breaking the
>>> isolation is on housekeeping CPUs. So perhaps there is a corner case somewhere
>>> but I assume there isn't until somebody reports an issue.
>>>
>>> Keeping a task level dependency check means that we need to update it on context
>>> switch. Plus it's not only about task but also process. So that means two
>>> states to update on context switch and to check from interrupts. I don't think
>>> it's worth the effort if there is no user at all.
>> I really worry about this!  The vision EZchip offers our customers is
>> that they can run whatever they want on the slow path housekeeping
>> cores, i.e. random control-plane code.  Then, on the fast-path cores,
>> they run their nohz_full stuff without interruption.  Often they don't
>> even know what the hell is running on their control plane cores - SNMP
>> or random third-party crap or god knows what.  And there is a decent
>> likelihood that some posix cpu timer code might sneak in.
> I see. But note that installing a posix cpu timer ends up triggering an
> IPI to all nohz full CPUs. That's how nohz full has always behaved.
> So users running posix timers on nohz should already suffer issues anyway.

True now, yes, I'm just looking ahead to doing better when we have
a chance to improve things.

>> You mentioned needing two fields, for task and for process, but in
>> fact let's just add the one field to the one thing that needs it and
>> not worry about additional possible future needs.  And note that it's
>> the task_struct->signal where we need to add the field for posix cpu
>> timers (the signal_struct) since that's where the sharing occurs, and
>> given CLONE_SIGHAND I imagine it could be different from the general
>> "process" model anyway.
> Well, posix cpu timers can be install per process (signal struct) or
> per thread (task struct).
>
> But we can certainly simplify that with a per process flag and expand
> the thread dependency to the process scope.
>
> Still there is the issue of telling the CPUs where a process runs when
> a posix timer is installed there. There is no process-like tsk->cpus_allowed.
> Either we send an IPI everywhere like we do now or we iterate through all
> threads in the process to OR all their cpumasks in order to send that IPI.

Is there a reason the actual timer can't run on a housekeeping
core?  Then when it does wake_up_process() or whatever, the
specific target task will get an IPI to wake up at that point.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/10] posix-cpu-timers: Migrate to use new tick dependency mask model
  2015-07-30 19:35           ` Chris Metcalf
@ 2015-07-30 19:45             ` Frederic Weisbecker
  2015-07-30 19:52               ` Chris Metcalf
  0 siblings, 1 reply; 59+ messages in thread
From: Frederic Weisbecker @ 2015-07-30 19:45 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: LKML, Peter Zijlstra, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Thu, Jul 30, 2015 at 03:35:06PM -0400, Chris Metcalf wrote:
> On 07/29/2015 08:44 PM, Frederic Weisbecker wrote:
> >I see. But note that installing a posix cpu timer ends up triggering an
> >IPI to all nohz full CPUs. That's how nohz full has always behaved.
> >So users running posix timers on nohz should already suffer issues anyway.
> 
> True now, yes, I'm just looking ahead to doing better when we have
> a chance to improve things.

Sure, but we can fix that global IPI as well anyway.

> 
> >>You mentioned needing two fields, for task and for process, but in
> >>fact let's just add the one field to the one thing that needs it and
> >>not worry about additional possible future needs.  And note that it's
> >>the task_struct->signal where we need to add the field for posix cpu
> >>timers (the signal_struct) since that's where the sharing occurs, and
> >>given CLONE_SIGHAND I imagine it could be different from the general
> >>"process" model anyway.
> >Well, posix cpu timers can be install per process (signal struct) or
> >per thread (task struct).
> >
> >But we can certainly simplify that with a per process flag and expand
> >the thread dependency to the process scope.
> >
> >Still there is the issue of telling the CPUs where a process runs when
> >a posix timer is installed there. There is no process-like tsk->cpus_allowed.
> >Either we send an IPI everywhere like we do now or we iterate through all
> >threads in the process to OR all their cpumasks in order to send that IPI.
> 
> Is there a reason the actual timer can't run on a housekeeping
> core?  Then when it does wake_up_process() or whatever, the
> specific target task will get an IPI to wake up at that point.

It makes sense if people run posix cpu timers on nohz full CPUs. But nobody
reported such usecase yet.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/10] posix-cpu-timers: Migrate to use new tick dependency mask model
  2015-07-30 19:45             ` Frederic Weisbecker
@ 2015-07-30 19:52               ` Chris Metcalf
  2015-07-31 14:49                 ` Frederic Weisbecker
  0 siblings, 1 reply; 59+ messages in thread
From: Chris Metcalf @ 2015-07-30 19:52 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Peter Zijlstra, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On 07/30/2015 03:45 PM, Frederic Weisbecker wrote:
>
>>>> You mentioned needing two fields, for task and for process, but in
>>>> fact let's just add the one field to the one thing that needs it and
>>>> not worry about additional possible future needs.  And note that it's
>>>> the task_struct->signal where we need to add the field for posix cpu
>>>> timers (the signal_struct) since that's where the sharing occurs, and
>>>> given CLONE_SIGHAND I imagine it could be different from the general
>>>> "process" model anyway.
>>> Well, posix cpu timers can be install per process (signal struct) or
>>> per thread (task struct).
>>>
>>> But we can certainly simplify that with a per process flag and expand
>>> the thread dependency to the process scope.
>>>
>>> Still there is the issue of telling the CPUs where a process runs when
>>> a posix timer is installed there. There is no process-like tsk->cpus_allowed.
>>> Either we send an IPI everywhere like we do now or we iterate through all
>>> threads in the process to OR all their cpumasks in order to send that IPI.
>> Is there a reason the actual timer can't run on a housekeeping
>> core?  Then when it does wake_up_process() or whatever, the
>> specific target task will get an IPI to wake up at that point.
> It makes sense if people run posix cpu timers on nohz full CPUs. But nobody
> reported such usecase yet.

The corner case I was trying to address with my comment above
is when a process includes both housekeeping and nohz_full threads.
This is generally a bad idea in my experience, but our customers
do this sometimes (usually because they're porting a big pile of
code from somewhere else), and if so it would be good if we didn't
have to keep every thread in that task ticking; presumably it is
enough to ensure the timer lands on a housekeeping core instead,
possibly the one for the non-fast-path thread in question, and then
the regular IPIs from wake_up_process() will be sufficient if for
some lame reason the signal ends up handled on a nohz_full core.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/10] posix-cpu-timers: Migrate to use new tick dependency mask model
  2015-07-30 19:52               ` Chris Metcalf
@ 2015-07-31 14:49                 ` Frederic Weisbecker
  2015-08-03 15:59                   ` Chris Metcalf
  2015-08-03 17:12                   ` Peter Zijlstra
  0 siblings, 2 replies; 59+ messages in thread
From: Frederic Weisbecker @ 2015-07-31 14:49 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: LKML, Peter Zijlstra, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Thu, Jul 30, 2015 at 03:52:54PM -0400, Chris Metcalf wrote:
> On 07/30/2015 03:45 PM, Frederic Weisbecker wrote:
> >
> >>>>You mentioned needing two fields, for task and for process, but in
> >>>>fact let's just add the one field to the one thing that needs it and
> >>>>not worry about additional possible future needs.  And note that it's
> >>>>the task_struct->signal where we need to add the field for posix cpu
> >>>>timers (the signal_struct) since that's where the sharing occurs, and
> >>>>given CLONE_SIGHAND I imagine it could be different from the general
> >>>>"process" model anyway.
> >>>Well, posix cpu timers can be install per process (signal struct) or
> >>>per thread (task struct).
> >>>
> >>>But we can certainly simplify that with a per process flag and expand
> >>>the thread dependency to the process scope.
> >>>
> >>>Still there is the issue of telling the CPUs where a process runs when
> >>>a posix timer is installed there. There is no process-like tsk->cpus_allowed.
> >>>Either we send an IPI everywhere like we do now or we iterate through all
> >>>threads in the process to OR all their cpumasks in order to send that IPI.
> >>Is there a reason the actual timer can't run on a housekeeping
> >>core?  Then when it does wake_up_process() or whatever, the
> >>specific target task will get an IPI to wake up at that point.
> >It makes sense if people run posix cpu timers on nohz full CPUs. But nobody
> >reported such usecase yet.
> 
> The corner case I was trying to address with my comment above
> is when a process includes both housekeeping and nohz_full threads.
> This is generally a bad idea in my experience, but our customers
> do this sometimes (usually because they're porting a big pile of
> code from somewhere else), and if so it would be good if we didn't
> have to keep every thread in that task ticking; presumably it is
> enough to ensure the timer lands on a housekeeping core instead,
> possibly the one for the non-fast-path thread in question, and then
> the regular IPIs from wake_up_process() will be sufficient if for
> some lame reason the signal ends up handled on a nohz_full core.

Instead of doing a per signal dependency, I'm going to use a per task
one. Which means that if a per-process timer is enqueued, every thread
of that process will have the tick dependency. But if the timer is
enqueued to a single thread, only the thread is concerned.

We'll see if offloading becomes really needed. It's not quite free because
the housekeepers will have to poll on all nohz CPUs at a Hz frequency.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 04/10] nohz: Remove useless argument on tick_nohz_task_switch()
  2015-07-23 16:42 ` [PATCH 04/10] nohz: Remove useless argument on tick_nohz_task_switch() Frederic Weisbecker
@ 2015-08-03 12:39   ` Peter Zijlstra
  2015-08-03 12:49     ` Frederic Weisbecker
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2015-08-03 12:39 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Thomas Gleixner, Preeti U Murthy, Christoph Lameter,
	Ingo Molnar, Viresh Kumar, Rik van Riel

On Thu, Jul 23, 2015 at 06:42:09PM +0200, Frederic Weisbecker wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 78b4bad10..4d34035 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2489,7 +2489,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>  		put_task_struct(prev);
>  	}
>  
> -	tick_nohz_task_switch(current);
> +	tick_nohz_task_switch();
>  	return rq;
>  }

OK, so I just noticed this one. Please explain? WTF does it make sense
to have per task tick state?

If we have >1 tasks, we need the tick. If we have 1 task, per-task ==
per-cpu.

So what gives?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 05/10] nohz: New tick dependency mask
  2015-07-23 16:42 ` [PATCH 05/10] nohz: New tick dependency mask Frederic Weisbecker
  2015-07-24 16:55   ` Chris Metcalf
@ 2015-08-03 12:43   ` Peter Zijlstra
  2015-08-03 13:05     ` Frederic Weisbecker
  2015-08-03 12:57   ` Peter Zijlstra
  2 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2015-08-03 12:43 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Thomas Gleixner, Preeti U Murthy, Christoph Lameter,
	Ingo Molnar, Viresh Kumar, Rik van Riel

On Thu, Jul 23, 2015 at 06:42:10PM +0200, Frederic Weisbecker wrote:
> +unsigned long __tick_nohz_set_tick_dependency(enum tick_dependency_bit bit,
> +					      unsigned long *dep)
> +{
> +	unsigned long prev;
> +	unsigned long old = *dep;
> +	unsigned long mask = BIT_MASK(bit);
> +
> +	while ((prev = cmpxchg(dep, old, old | mask)) != old) {
> +		old = prev;
> +		cpu_relax();
> +	}
> +
> +	return prev;
> +}

That's typically called a fetch_or(). The function name, which is
entirely too long, also doesn't suggest a return value.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 05/10] nohz: New tick dependency mask
  2015-07-24 17:43       ` Chris Metcalf
@ 2015-08-03 12:48         ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2015-08-03 12:48 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Frederic Weisbecker, LKML, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Fri, Jul 24, 2015 at 01:43:38PM -0400, Chris Metcalf wrote:
> On 07/24/2015 01:16 PM, Frederic Weisbecker wrote:
> >On Fri, Jul 24, 2015 at 12:55:35PM -0400, Chris Metcalf wrote:
> >>On 07/23/2015 12:42 PM, Frederic Weisbecker wrote:
> >>>+unsigned long __tick_nohz_set_tick_dependency(enum tick_dependency_bit bit,
> >>>+					      unsigned long *dep)
> >>>+{
> >>>+	unsigned long prev;
> >>>+	unsigned long old = *dep;
> >>>+	unsigned long mask = BIT_MASK(bit);
> >>>+
> >>>+	while ((prev = cmpxchg(dep, old, old | mask)) != old) {
> >>>+		old = prev;
> >>>+		cpu_relax();
> >>>+	}
> >>>+
> >>>+	return prev;
> >>>+}
> >>Why not use set_bit() here?  It is suitably atomic.
> >Because I don't want to send an IPI if the CPU already had bits set in
> >the dependency.
> >
> >Ideally I need something like test_and_set_bit() but which returns the
> >whole previous value and not just the previous value of the bit.
> 
> Ah, of course.  Peter, maybe we need atomic_or_return() as part
> of your new atomic_or/_and/_xor series?  Certainly on tilegx, and
> likely other architectures, we can do better than Frederic's
> cmpxchg() loop.

No, atomic_or_return() would return the new value and is entirely
pointless for the logic ops since they're not reversible (with the
exception of xor).

What you'd need is atomic_fetch_or(), but we don't have any fetch_$op
primitives at all. Introducing them might make sense, but it'll have to
be a separate series.

Note that I have a fetch_or() macro in kernel/sched/core.c.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 04/10] nohz: Remove useless argument on tick_nohz_task_switch()
  2015-08-03 12:39   ` Peter Zijlstra
@ 2015-08-03 12:49     ` Frederic Weisbecker
  2015-08-03 13:04       ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Frederic Weisbecker @ 2015-08-03 12:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Thomas Gleixner, Preeti U Murthy, Christoph Lameter,
	Ingo Molnar, Viresh Kumar, Rik van Riel

On Mon, Aug 03, 2015 at 02:39:59PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 23, 2015 at 06:42:09PM +0200, Frederic Weisbecker wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 78b4bad10..4d34035 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2489,7 +2489,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> >  		put_task_struct(prev);
> >  	}
> >  
> > -	tick_nohz_task_switch(current);
> > +	tick_nohz_task_switch();
> >  	return rq;
> >  }
> 
> OK, so I just noticed this one. Please explain? WTF does it make sense
> to have per task tick state?
> 
> If we have >1 tasks, we need the tick. If we have 1 task, per-task ==
> per-cpu.
> 
> So what gives?

tick_nohz_task_switch() is removed in the end of the patchset. There is no
more per task dependency after this specific set.

Now we just had a talk with Chris about Posix cpu timers whose tick dependency
turned from per-task/per-process to global dependency. When a posix timer is
queued somewhere, we keep the tick everywhere (still talking about this patchset,
not linus tree).

Since some nohz full users may still want to run posix timers on housekeepers,
I'll turn it to a per task dependency again. But this context switch call will
still be removed and the scheduler tick dependency stays a per-cpu dependency
and not a task one.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 05/10] nohz: New tick dependency mask
  2015-07-23 16:42 ` [PATCH 05/10] nohz: New tick dependency mask Frederic Weisbecker
  2015-07-24 16:55   ` Chris Metcalf
  2015-08-03 12:43   ` Peter Zijlstra
@ 2015-08-03 12:57   ` Peter Zijlstra
  2015-08-03 13:09     ` Frederic Weisbecker
  2 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2015-08-03 12:57 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Thomas Gleixner, Preeti U Murthy, Christoph Lameter,
	Ingo Molnar, Viresh Kumar, Rik van Riel

On Thu, Jul 23, 2015 at 06:42:10PM +0200, Frederic Weisbecker wrote:
> +void tick_nohz_set_tick_dependency(enum tick_dependency_bit bit)
> +{
> +	unsigned long prev;
> +
> +	prev = __tick_nohz_set_tick_dependency(bit, &tick_dependency);
> +	if (!prev)
> +		tick_nohz_full_kick_all();
> +}

> +void tick_nohz_set_tick_dependency_cpu(enum tick_dependency_bit bit, int cpu)
> +{
> +	unsigned long prev;
> +	struct tick_sched *ts;
> +
> +	ts = per_cpu_ptr(&tick_cpu_sched, cpu);
> +
> +	prev = __tick_nohz_set_tick_dependency(bit, &ts->tick_dependency);
> +	if (!prev)
> +		tick_nohz_full_kick_cpu(cpu);
> +}

> +/*
> + * Local dependency must have its own flavour due to NMI-safe requirement
> + * on perf.
> + */

That doesn't make any sense:

  tick_nohz_set_tick_dependency_this_cpu();

(shees, you're nowhere near lazy enough, that's insane to type) is
almost identical to:

  tick_nohz_set_tick_dependency_cpu(.cpu = smp_processor_id());

The only difference is a _very_ slight reduction in cost for computing
the per-cpu offset.

> +void tick_nohz_set_tick_dependency_this_cpu(enum tick_dependency_bit bit)
> +{
> +	unsigned long prev;
> +	struct tick_sched *ts;
> +
> +	ts = this_cpu_ptr(&tick_cpu_sched);
> +
> +	prev = __tick_nohz_set_tick_dependency(bit, &ts->tick_dependency);
> +	if (!prev)
> +		tick_nohz_full_kick();
> +}


And on that naming; could we please shorten them, this is really
ridiculous, it has 'tick' in it twice.

What's wrong with:

	tick_nohz_set_dep()
	tick_nohz_set_dep_cpu()

And just kill the this_cpu() version.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 04/10] nohz: Remove useless argument on tick_nohz_task_switch()
  2015-08-03 12:49     ` Frederic Weisbecker
@ 2015-08-03 13:04       ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2015-08-03 13:04 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Thomas Gleixner, Preeti U Murthy, Christoph Lameter,
	Ingo Molnar, Viresh Kumar, Rik van Riel

On Mon, Aug 03, 2015 at 02:49:54PM +0200, Frederic Weisbecker wrote:
> On Mon, Aug 03, 2015 at 02:39:59PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 23, 2015 at 06:42:09PM +0200, Frederic Weisbecker wrote:
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 78b4bad10..4d34035 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -2489,7 +2489,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> > >  		put_task_struct(prev);
> > >  	}
> > >  
> > > -	tick_nohz_task_switch(current);
> > > +	tick_nohz_task_switch();
> > >  	return rq;
> > >  }
> > 
> > OK, so I just noticed this one. Please explain? WTF does it make sense
> > to have per task tick state?
> > 
> > If we have >1 tasks, we need the tick. If we have 1 task, per-task ==
> > per-cpu.
> > 
> > So what gives?
> 
> tick_nohz_task_switch() is removed in the end of the patchset. There is no
> more per task dependency after this specific set.

OK, I'll keep reading ;-)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 05/10] nohz: New tick dependency mask
  2015-08-03 12:43   ` Peter Zijlstra
@ 2015-08-03 13:05     ` Frederic Weisbecker
  2015-08-03 13:24       ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Frederic Weisbecker @ 2015-08-03 13:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Thomas Gleixner, Preeti U Murthy, Christoph Lameter,
	Ingo Molnar, Viresh Kumar, Rik van Riel

On Mon, Aug 03, 2015 at 02:43:50PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 23, 2015 at 06:42:10PM +0200, Frederic Weisbecker wrote:
> > +unsigned long __tick_nohz_set_tick_dependency(enum tick_dependency_bit bit,
> > +					      unsigned long *dep)
> > +{
> > +	unsigned long prev;
> > +	unsigned long old = *dep;
> > +	unsigned long mask = BIT_MASK(bit);
> > +
> > +	while ((prev = cmpxchg(dep, old, old | mask)) != old) {
> > +		old = prev;
> > +		cpu_relax();
> > +	}
> > +
> > +	return prev;
> > +}
> 
> That's typically called a fetch_or(). The function name, which is
> entirely too long, also doesn't suggest a return value.

Nice, I can export fetch_or() from sched code to atomic_fetch_or() like
you just suggested and bring it to asm_generic if you like.
Although given that function name, I fear some people may miss it when they
need such a functionality. I mean, people are interested in a test_and_set_bit()
that returns the whole value. test_and_or() would better recall the test_and_set()
under the line. Now perhaps "test" rather suggests we are dealing with a mask test.
So perhaps fetch_and_or() ? Of course having "and" then "or" in the same function name
might be confusing but after all we have functions names starting with "test" then "and".

Concerning the nohz function name, tick_nohz is a long prefix that is very likely
to produce too long names. Perhaps I should trim tick_nohz to just nohz.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 05/10] nohz: New tick dependency mask
  2015-08-03 12:57   ` Peter Zijlstra
@ 2015-08-03 13:09     ` Frederic Weisbecker
  2015-08-03 13:29       ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Frederic Weisbecker @ 2015-08-03 13:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Thomas Gleixner, Preeti U Murthy, Christoph Lameter,
	Ingo Molnar, Viresh Kumar, Rik van Riel

On Mon, Aug 03, 2015 at 02:57:17PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 23, 2015 at 06:42:10PM +0200, Frederic Weisbecker wrote:
> > +void tick_nohz_set_tick_dependency(enum tick_dependency_bit bit)
> > +{
> > +	unsigned long prev;
> > +
> > +	prev = __tick_nohz_set_tick_dependency(bit, &tick_dependency);
> > +	if (!prev)
> > +		tick_nohz_full_kick_all();
> > +}
> 
> > +void tick_nohz_set_tick_dependency_cpu(enum tick_dependency_bit bit, int cpu)
> > +{
> > +	unsigned long prev;
> > +	struct tick_sched *ts;
> > +
> > +	ts = per_cpu_ptr(&tick_cpu_sched, cpu);
> > +
> > +	prev = __tick_nohz_set_tick_dependency(bit, &ts->tick_dependency);
> > +	if (!prev)
> > +		tick_nohz_full_kick_cpu(cpu);
> > +}
> 
> > +/*
> > + * Local dependency must have its own flavour due to NMI-safe requirement
> > + * on perf.
> > + */
> 
> That doesn't make any sense:
> 
>   tick_nohz_set_tick_dependency_this_cpu();
> 
> (shees, you're nowhere near lazy enough, that's insane to type) is
> almost identical to:
> 
>   tick_nohz_set_tick_dependency_cpu(.cpu = smp_processor_id());
> 
> The only difference is a _very_ slight reduction in cost for computing
> the per-cpu offset.

But the local one must be NMI-safe. Now I can do:

    if (cpu == smp_processor_id())
        tick_nohz_full_kick() // NMI-safe
    else
        tick_nohz_full_kick_cpu(cpu); // not NMI-safe.

> 
> > +void tick_nohz_set_tick_dependency_this_cpu(enum tick_dependency_bit bit)
> > +{
> > +	unsigned long prev;
> > +	struct tick_sched *ts;
> > +
> > +	ts = this_cpu_ptr(&tick_cpu_sched);
> > +
> > +	prev = __tick_nohz_set_tick_dependency(bit, &ts->tick_dependency);
> > +	if (!prev)
> > +		tick_nohz_full_kick();
> > +}
> 
> 
> And on that naming; could we please shorten them, this is really
> ridiculous, it has 'tick' in it twice.
> 
> What's wrong with:
> 
> 	tick_nohz_set_dep()
> 	tick_nohz_set_dep_cpu()

Right.

Thanks.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 05/10] nohz: New tick dependency mask
  2015-08-03 13:05     ` Frederic Weisbecker
@ 2015-08-03 13:24       ` Peter Zijlstra
  2015-08-03 13:49         ` Frederic Weisbecker
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2015-08-03 13:24 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Thomas Gleixner, Preeti U Murthy, Christoph Lameter,
	Ingo Molnar, Viresh Kumar, Rik van Riel

On Mon, Aug 03, 2015 at 03:05:37PM +0200, Frederic Weisbecker wrote:
> On Mon, Aug 03, 2015 at 02:43:50PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 23, 2015 at 06:42:10PM +0200, Frederic Weisbecker wrote:
> > > +unsigned long __tick_nohz_set_tick_dependency(enum tick_dependency_bit bit,
> > > +					      unsigned long *dep)
> > > +{
> > > +	unsigned long prev;
> > > +	unsigned long old = *dep;
> > > +	unsigned long mask = BIT_MASK(bit);
> > > +
> > > +	while ((prev = cmpxchg(dep, old, old | mask)) != old) {
> > > +		old = prev;
> > > +		cpu_relax();
> > > +	}
> > > +
> > > +	return prev;
> > > +}
> > 
> > That's typically called a fetch_or(). The function name, which is
> > entirely too long, also doesn't suggest a return value.
> 
> Nice, I can export fetch_or() from sched code to atomic_fetch_or() like
> you just suggested and bring it to asm_generic if you like.

asm_generic means you _have_ to go implement it for all archs that do
not use asm_generic.

The version I have is also unsuited for the atomic name, as it deals
with all integer types that cmpxchg() can consume (and I need it that
way, because thread_info::flags does not have a consistent type across
archs (unsigned long vs unsigned int is the most common variation).

You can go add it to linux/atomic.h I suppose, with the regular #ifdef
guards we have there. Then add:

  atomic{,long,64}_fetch_{add,sub,inc,dec,or,and,xor}()

I suppose, using cmpxchg() loops and then let archs implement as we go.

> So perhaps fetch_and_or() ? Of course having "and" then "or" in the same function name
> might be confusing but after all we have functions names starting with "test" then "and".

No fetch_$op is a well known name for these things, don't go change it.
That just means people will _not_ find it.

> Concerning the nohz function name, tick_nohz is a long prefix that is very likely
> to produce too long names. Perhaps I should trim tick_nohz to just nohz.

No, then we'll get confused with the regular nohz stuff. But yes a 10
letter prefix sucks, but at least you can try and not then add another
30.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 05/10] nohz: New tick dependency mask
  2015-08-03 13:09     ` Frederic Weisbecker
@ 2015-08-03 13:29       ` Peter Zijlstra
  2015-08-03 13:55         ` Frederic Weisbecker
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2015-08-03 13:29 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Thomas Gleixner, Preeti U Murthy, Christoph Lameter,
	Ingo Molnar, Viresh Kumar, Rik van Riel

On Mon, Aug 03, 2015 at 03:09:39PM +0200, Frederic Weisbecker wrote:

> > That doesn't make any sense:
> > 
> >   tick_nohz_set_tick_dependency_this_cpu();
> > 
> > (shees, you're nowhere near lazy enough, that's insane to type) is
> > almost identical to:
> > 
> >   tick_nohz_set_tick_dependency_cpu(.cpu = smp_processor_id());
> > 
> > The only difference is a _very_ slight reduction in cost for computing
> > the per-cpu offset.
> 
> But the local one must be NMI-safe. Now I can do:
> 
>     if (cpu == smp_processor_id())
>         tick_nohz_full_kick() // NMI-safe
>     else
>         tick_nohz_full_kick_cpu(cpu); // not NMI-safe.

Urgh, I missed that. But yes, I suppose that's ok seeing how we result
in a smaller interface.

I was going to say that with a bit of luck GCC could optimize it, but
its not inline so no it cannot.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 05/10] nohz: New tick dependency mask
  2015-08-03 13:24       ` Peter Zijlstra
@ 2015-08-03 13:49         ` Frederic Weisbecker
  0 siblings, 0 replies; 59+ messages in thread
From: Frederic Weisbecker @ 2015-08-03 13:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Thomas Gleixner, Preeti U Murthy, Christoph Lameter,
	Ingo Molnar, Viresh Kumar, Rik van Riel

On Mon, Aug 03, 2015 at 03:24:07PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 03, 2015 at 03:05:37PM +0200, Frederic Weisbecker wrote:
> > On Mon, Aug 03, 2015 at 02:43:50PM +0200, Peter Zijlstra wrote:
> > > On Thu, Jul 23, 2015 at 06:42:10PM +0200, Frederic Weisbecker wrote:
> > > > +unsigned long __tick_nohz_set_tick_dependency(enum tick_dependency_bit bit,
> > > > +					      unsigned long *dep)
> > > > +{
> > > > +	unsigned long prev;
> > > > +	unsigned long old = *dep;
> > > > +	unsigned long mask = BIT_MASK(bit);
> > > > +
> > > > +	while ((prev = cmpxchg(dep, old, old | mask)) != old) {
> > > > +		old = prev;
> > > > +		cpu_relax();
> > > > +	}
> > > > +
> > > > +	return prev;
> > > > +}
> > > 
> > > That's typically called a fetch_or(). The function name, which is
> > > entirely too long, also doesn't suggest a return value.
> > 
> > Nice, I can export fetch_or() from sched code to atomic_fetch_or() like
> > you just suggested and bring it to asm_generic if you like.
> 
> asm_generic means you _have_ to go implement it for all archs that do
> not use asm_generic.
> 
> The version I have is also unsuited for the atomic name, as it deals
> with all integer types that cmpxchg() can consume (and I need it that
> way, because thread_info::flags does not have a consistent type across
> archs (unsigned long vs unsigned int is the most common variation).

Indeed, and I need clear_bit() to be the symetric call and that doesn't
seem to operate on atomic_t.

> 
> You can go add it to linux/atomic.h I suppose, with the regular #ifdef
> guards we have there. Then add:
> 
>   atomic{,long,64}_fetch_{add,sub,inc,dec,or,and,xor}()
> 
> I suppose, using cmpxchg() loops and then let archs implement as we go.

Right.

> 
> > So perhaps fetch_and_or() ? Of course having "and" then "or" in the same function name
> > might be confusing but after all we have functions names starting with "test" then "and".
> 
> No fetch_$op is a well known name for these things, don't go change it.
> That just means people will _not_ find it.

Ok.

> 
> > Concerning the nohz function name, tick_nohz is a long prefix that is very likely
> > to produce too long names. Perhaps I should trim tick_nohz to just nohz.
> 
> No, then we'll get confused with the regular nohz stuff. But yes a 10
> letter prefix sucks, but at least you can try and not then add another
> 30.

Well regular nohz stuff (that is, idle nohz right?) and nohz full have the same prefix.
We just have tick_nohz_full_* for very specific things but that's still compatible with
a common reduced prefix.

But lets keep it that way for now, if a rename is needed, we'll do it beyond the scope
of that set. I'll just take your tick_nohz_set_dep() suggestion.

Thanks.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 05/10] nohz: New tick dependency mask
  2015-08-03 13:29       ` Peter Zijlstra
@ 2015-08-03 13:55         ` Frederic Weisbecker
  2015-08-03 14:11           ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Frederic Weisbecker @ 2015-08-03 13:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Thomas Gleixner, Preeti U Murthy, Christoph Lameter,
	Ingo Molnar, Viresh Kumar, Rik van Riel

On Mon, Aug 03, 2015 at 03:29:58PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 03, 2015 at 03:09:39PM +0200, Frederic Weisbecker wrote:
> 
> > > That doesn't make any sense:
> > > 
> > >   tick_nohz_set_tick_dependency_this_cpu();
> > > 
> > > (shees, you're nowhere near lazy enough, that's insane to type) is
> > > almost identical to:
> > > 
> > >   tick_nohz_set_tick_dependency_cpu(.cpu = smp_processor_id());
> > > 
> > > The only difference is a _very_ slight reduction in cost for computing
> > > the per-cpu offset.
> > 
> > But the local one must be NMI-safe. Now I can do:
> > 
> >     if (cpu == smp_processor_id())
> >         tick_nohz_full_kick() // NMI-safe
> >     else
> >         tick_nohz_full_kick_cpu(cpu); // not NMI-safe.
> 
> Urgh, I missed that. But yes, I suppose that's ok seeing how we result
> in a smaller interface.
> 
> I was going to say that with a bit of luck GCC could optimize it, but
> its not inline so no it cannot.

I might inline all these set_dep() things to introduce static keys on these
APIs.. But the kick itself will remain real calls.

Ok how about tick_nohz_set_dep_nmi() so that we know exactly what's the purpose
here. Still a long function name but it's clear.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 07/10] sched: Migrate sched to use new tick dependency mask model
  2015-07-23 16:42 ` [PATCH 07/10] sched: Migrate sched " Frederic Weisbecker
  2015-07-23 16:55   ` Frederic Weisbecker
  2015-07-24 16:56   ` Chris Metcalf
@ 2015-08-03 14:00   ` Peter Zijlstra
  2015-08-03 14:50     ` Frederic Weisbecker
  2 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2015-08-03 14:00 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Thomas Gleixner, Preeti U Murthy, Christoph Lameter,
	Ingo Molnar, Viresh Kumar, Rik van Riel

On Thu, Jul 23, 2015 at 06:42:12PM +0200, Frederic Weisbecker wrote:
> Instead of providing asynchronous checks for the nohz subsystem to verify
> sched tick dependency, migrate sched to the new mask.
> 
> The easiest is to recycle the current asynchronous tick dependency check
> which verifies the class of the current task and its requirements for
> periodic preemption checks.
> 
> We need to evaluate this tick dependency on three places:
> 
> 1) Task enqueue: One or more tasks have been enqueued, we must check
>    if those are competing with the current task.
> 
> 2) Task dequeue: A possibly competing task has been dequeued, clear the
>    tick dependency if needed.
> 
> 3) schedule(): we might be switching to a task of another scheduler
>    class. Each class has its preemption rules, we must re-evaluate it.

This is insane.. You add a whole bunch of work per wakeup/sleep/context
switch to avoid some work at tick time. That's a broken trade-off.

We can context switch _waaaay_ more than we have ticks.

Furthermore, you do tons of pointless work, we call add_nr_running()
from the individual classes, and then your routine goes and checks what
class we're in etc..



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 05/10] nohz: New tick dependency mask
  2015-08-03 13:55         ` Frederic Weisbecker
@ 2015-08-03 14:11           ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2015-08-03 14:11 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Thomas Gleixner, Preeti U Murthy, Christoph Lameter,
	Ingo Molnar, Viresh Kumar, Rik van Riel

On Mon, Aug 03, 2015 at 03:55:34PM +0200, Frederic Weisbecker wrote:
> On Mon, Aug 03, 2015 at 03:29:58PM +0200, Peter Zijlstra wrote:
> > On Mon, Aug 03, 2015 at 03:09:39PM +0200, Frederic Weisbecker wrote:
> > 
> > > > That doesn't make any sense:
> > > > 
> > > >   tick_nohz_set_tick_dependency_this_cpu();
> > > > 
> > > > (shees, you're nowhere near lazy enough, that's insane to type) is
> > > > almost identical to:
> > > > 
> > > >   tick_nohz_set_tick_dependency_cpu(.cpu = smp_processor_id());
> > > > 
> > > > The only difference is a _very_ slight reduction in cost for computing
> > > > the per-cpu offset.
> > > 
> > > But the local one must be NMI-safe. Now I can do:
> > > 
> > >     if (cpu == smp_processor_id())
> > >         tick_nohz_full_kick() // NMI-safe
> > >     else
> > >         tick_nohz_full_kick_cpu(cpu); // not NMI-safe.
> > 
> > Urgh, I missed that. But yes, I suppose that's ok seeing how we result
> > in a smaller interface.
> > 
> > I was going to say that with a bit of luck GCC could optimize it, but
> > its not inline so no it cannot.
> 
> I might inline all these set_dep() things to introduce static keys on these
> APIs.. But the kick itself will remain real calls.

Sure, but first check if GCC will optimize:

static inline void foo(int cpu)
{
	if (cpu == smp_processor_id())
		bar1();
	else
		bar2();
}


	foo(smp_processor_id());

Into a direct call to bar1(), if not see if we can make it so. If not,
there's no point in inlining at all.

> Ok how about tick_nohz_set_dep_nmi() so that we know exactly what's the purpose
> here. Still a long function name but it's clear.

Only for the set, if you really care about it. The alternative is
WARN_ON(in_nmi() && cpu != smp_processor_id()) or somesuch.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 07/10] sched: Migrate sched to use new tick dependency mask model
  2015-08-03 14:00   ` Peter Zijlstra
@ 2015-08-03 14:50     ` Frederic Weisbecker
  2015-08-03 17:09       ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Frederic Weisbecker @ 2015-08-03 14:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Thomas Gleixner, Preeti U Murthy, Christoph Lameter,
	Ingo Molnar, Viresh Kumar, Rik van Riel

On Mon, Aug 03, 2015 at 04:00:46PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 23, 2015 at 06:42:12PM +0200, Frederic Weisbecker wrote:
> > Instead of providing asynchronous checks for the nohz subsystem to verify
> > sched tick dependency, migrate sched to the new mask.
> > 
> > The easiest is to recycle the current asynchronous tick dependency check
> > which verifies the class of the current task and its requirements for
> > periodic preemption checks.
> > 
> > We need to evaluate this tick dependency on three places:
> > 
> > 1) Task enqueue: One or more tasks have been enqueued, we must check
> >    if those are competing with the current task.
> > 
> > 2) Task dequeue: A possibly competing task has been dequeued, clear the
> >    tick dependency if needed.
> > 
> > 3) schedule(): we might be switching to a task of another scheduler
> >    class. Each class has its preemption rules, we must re-evaluate it.
> 
> This is insane.. You add a whole bunch of work per wakeup/sleep/context
> switch to avoid some work at tick time. That's a broken trade-off.
> 
> We can context switch _waaaay_ more than we have ticks.
> 
> Furthermore, you do tons of pointless work, we call add_nr_running()
> from the individual classes, and then your routine goes and checks what
> class we're in etc..

I think I could remove the context switch part. But then I need to find a
way to perform these checks on enqueue and dequeue task time:

  sched_update_dependency(cpu)
  {
      if (SCHED_FIFO task on the cpu runqueue) {
          tick_nohz_clear_dep(cpu)
	  return;
      }

      if (SCHED_RR task on the cpu runqueue) {
          if (more than one such task) {
              tick_nohz_set_dep(cpu)
	      return;
	  }
      }

      if (SCHED_NORMAL task on the cpu runqueue) {
          if (more than one such task) {
	      tick_nohz_set_dep(cpu)
	      return;
	  }
      }

      tick_nohz_clear_dep();
   }

That's still heavyweight because enqueue and dequeue can be very frequent
but we get rid of the sched_switch hook because we don't care about the
current task at all.

Now, consider that we could cut all this checks into parts and optimize
that per sched class::enqueue/dequeue.

So we can divide the dependency into:

         struct rq {
	     ...
	     int nr_fifo;
	     int nr_rr;
	     int nr_normal;
	}


	int rq_update_tick_dep(struct rq *rq)
	{
	     if (rq->nr_fifo && (rq->nr_rr > 1 || rq->nr_normal > 1))
	         tick_nohz_set_dep(SCHED_TICK_DEP);
             else
	         tick_nohz_set_dep(SCHED_TICK_DEP)
        }

Then we add or dec the relevant counter fields from the various sched_class::enqueue/dequeue.
I think I saw some of these counters already exist but perhaps not all of them. There are
per class rqs but rt_nr_running counts tasks without distinction of policies.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/10] posix-cpu-timers: Migrate to use new tick dependency mask model
  2015-07-31 14:49                 ` Frederic Weisbecker
@ 2015-08-03 15:59                   ` Chris Metcalf
  2015-08-03 18:01                     ` Frederic Weisbecker
  2015-08-03 17:12                   ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Chris Metcalf @ 2015-08-03 15:59 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Peter Zijlstra, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On 07/31/2015 10:49 AM, Frederic Weisbecker wrote:
> Instead of doing a per signal dependency, I'm going to use a per task
> one. Which means that if a per-process timer is enqueued, every thread
> of that process will have the tick dependency. But if the timer is
> enqueued to a single thread, only the thread is concerned.
>
> We'll see if offloading becomes really needed. It's not quite free because
> the housekeepers will have to poll on all nohz CPUs at a Hz frequency.

Seems reasonable for now!

Why would we need the Hz frequency polling, though?  I would
think it should be possible to just arrange it such that the timer
for posix cpu timers would just always be placed either on the core
that requested it, or if that core is nohz_full, on a housekeeping
core.  Then it would eventually fire from the housekeeping core,
and the logic could be such that (for a process-wide timer) it
would preferentially interrupt threads from that process that
were running on the housekeeping cores.  No polling.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 07/10] sched: Migrate sched to use new tick dependency mask model
  2015-08-03 14:50     ` Frederic Weisbecker
@ 2015-08-03 17:09       ` Peter Zijlstra
  2015-08-03 17:30         ` Frederic Weisbecker
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2015-08-03 17:09 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Thomas Gleixner, Preeti U Murthy, Christoph Lameter,
	Ingo Molnar, Viresh Kumar, Rik van Riel

On Mon, Aug 03, 2015 at 04:50:33PM +0200, Frederic Weisbecker wrote:
> On Mon, Aug 03, 2015 at 04:00:46PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 23, 2015 at 06:42:12PM +0200, Frederic Weisbecker wrote:
> > > Instead of providing asynchronous checks for the nohz subsystem to verify
> > > sched tick dependency, migrate sched to the new mask.
> > > 
> > > The easiest is to recycle the current asynchronous tick dependency check
> > > which verifies the class of the current task and its requirements for
> > > periodic preemption checks.
> > > 
> > > We need to evaluate this tick dependency on three places:
> > > 
> > > 1) Task enqueue: One or more tasks have been enqueued, we must check
> > >    if those are competing with the current task.
> > > 
> > > 2) Task dequeue: A possibly competing task has been dequeued, clear the
> > >    tick dependency if needed.
> > > 
> > > 3) schedule(): we might be switching to a task of another scheduler
> > >    class. Each class has its preemption rules, we must re-evaluate it.
> > 
> > This is insane.. You add a whole bunch of work per wakeup/sleep/context
> > switch to avoid some work at tick time. That's a broken trade-off.
> > 
> > We can context switch _waaaay_ more than we have ticks.
> > 
> > Furthermore, you do tons of pointless work, we call add_nr_running()
> > from the individual classes, and then your routine goes and checks what
> > class we're in etc..
> 
> I think I could remove the context switch part. But then I need to find a
> way to perform these checks on enqueue and dequeue task time:

Uhm, but you already do!?

>   sched_update_dependency(cpu)
>   {
>       if (SCHED_FIFO task on the cpu runqueue) {
>           tick_nohz_clear_dep(cpu)
> 	  return;
>       }
> 
>       if (SCHED_RR task on the cpu runqueue) {
>           if (more than one such task) {
>               tick_nohz_set_dep(cpu)
> 	      return;
> 	  }
>       }
> 
>       if (SCHED_NORMAL task on the cpu runqueue) {
>           if (more than one such task) {
> 	      tick_nohz_set_dep(cpu)
> 	      return;
> 	  }
>       }
> 
>       tick_nohz_clear_dep();
>    }
> 
> That's still heavyweight because enqueue and dequeue can be very frequent
> but we get rid of the sched_switch hook because we don't care about the
> current task at all.
> 
> Now, consider that we could cut all this checks into parts and optimize
> that per sched class::enqueue/dequeue.

You can, seeing how {add,sub}_nr_running() is called from the individual
classes.

> So we can divide the dependency into:
> 
>          struct rq {
> 	     ...
> 	     int nr_fifo;
> 	     int nr_rr;

Those are currently summed together in: rq->rt.rt_nr_total, I suppose we
can split RR out.

> 	     int nr_normal;

That's called: rq->cfs.h_nr_running

But you've forgotten about SCHED_DEADLINE, we count those in:
rq->dl.dl_nr_running.

> 	}
> 
> 
> 	int rq_update_tick_dep(struct rq *rq)
> 	{
> 	     if (rq->nr_fifo && (rq->nr_rr > 1 || rq->nr_normal > 1))
> 	         tick_nohz_set_dep(SCHED_TICK_DEP);
>              else
> 	         tick_nohz_set_dep(SCHED_TICK_DEP)
>         }
> 
> Then we add or dec the relevant counter fields from the various
> sched_class::enqueue/dequeue.  I think I saw some of these counters
> already exist but perhaps not all of them. There are per class rqs but
> rt_nr_running counts tasks without distinction of policies.

Right. At which point you'll end up with:

	if (rq->dl.dl_nr_running > 1 || rq->rt.rr_nr_total > 1 || rq->cfs.h_nr_running > 1)
		tick_nohz_set_dep(SCHED_TICK_DEP)
	else
		tick_nohz_clear_dep(SCHED_TICK_DEP)

But I fear that'll still be rather expensive in some cases. Imagine a
case where we frequently flip between 1-2 tasks on the queue for any one
of those classes, then we'll do a whole bunch of dep flips, which is an
atomic op.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/10] posix-cpu-timers: Migrate to use new tick dependency mask model
  2015-07-31 14:49                 ` Frederic Weisbecker
  2015-08-03 15:59                   ` Chris Metcalf
@ 2015-08-03 17:12                   ` Peter Zijlstra
  2015-08-03 17:39                     ` Frederic Weisbecker
  1 sibling, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2015-08-03 17:12 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Chris Metcalf, LKML, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Fri, Jul 31, 2015 at 04:49:55PM +0200, Frederic Weisbecker wrote:
> Instead of doing a per signal dependency, I'm going to use a per task
> one.

Urgh, does this mean you'll keep the horrid tick_nohz_task_switch()
thing?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 07/10] sched: Migrate sched to use new tick dependency mask model
  2015-08-03 17:09       ` Peter Zijlstra
@ 2015-08-03 17:30         ` Frederic Weisbecker
  2015-08-04  7:41           ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Frederic Weisbecker @ 2015-08-03 17:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Thomas Gleixner, Preeti U Murthy, Christoph Lameter,
	Ingo Molnar, Viresh Kumar, Rik van Riel

On Mon, Aug 03, 2015 at 07:09:11PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 03, 2015 at 04:50:33PM +0200, Frederic Weisbecker wrote:
> > I think I could remove the context switch part. But then I need to find a
> > way to perform these checks on enqueue and dequeue task time:
> 
> Uhm, but you already do!?

Sure but I would like to avoid adding a context switch step, although
dequeuing itself often happens on context switch but we don't have
the choice but to check at that step.

> > So we can divide the dependency into:
> > 
> >          struct rq {
> > 	     ...
> > 	     int nr_fifo;
> > 	     int nr_rr;
> 
> Those are currently summed together in: rq->rt.rt_nr_total, I suppose we
> can split RR out.

Right

> 
> > 	     int nr_normal;
> 
> That's called: rq->cfs.h_nr_running

Ok.

> 
> But you've forgotten about SCHED_DEADLINE, we count those in:
> rq->dl.dl_nr_running.

Indeed. Hmm, there is no preemption between SCHED_DEALINE tasks, right?
So I can treat it like SCHED_FIFO.

> > 	}
> > 
> > 
> > 	int rq_update_tick_dep(struct rq *rq)
> > 	{
> > 	     if (rq->nr_fifo && (rq->nr_rr > 1 || rq->nr_normal > 1))

Oops I meant:

    if (rq->nr_fifo || (rq->nr_rr < 2 && rq->nr_normal < 2))
       clear_dep()
    else
       set_dep()

> > 	         tick_nohz_set_dep(SCHED_TICK_DEP);
> >              else
> > 	         tick_nohz_set_dep(SCHED_TICK_DEP)
> >         }
> > 
> > Then we add or dec the relevant counter fields from the various
> > sched_class::enqueue/dequeue.  I think I saw some of these counters
> > already exist but perhaps not all of them. There are per class rqs but
> > rt_nr_running counts tasks without distinction of policies.
> 
> Right. At which point you'll end up with:
> 
> 	if (rq->dl.dl_nr_running > 1 || rq->rt.rr_nr_total > 1 || rq->cfs.h_nr_running > 1)
> 		tick_nohz_set_dep(SCHED_TICK_DEP)
> 	else
> 		tick_nohz_clear_dep(SCHED_TICK_DEP)

It there is no preemption between deadline tasks, and SCHED_DEALINE is of higher
priority than SCHED_RR that would rather be:

        if (rq->dl.dl_nr_running || rq->rt.ff_nr_running || (rq->rt.rr_nr_running < 2 && rq->cfs.h_nr_running < 2))
	    clear_dep()
	else
	    set_dep()

> 
> But I fear that'll still be rather expensive in some cases. Imagine a
> case where we frequently flip between 1-2 tasks on the queue for any one
> of those classes, then we'll do a whole bunch of dep flips, which is an
> atomic op.

Indeed. Now doing such a thing on a nohz full CPU sounds insane.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/10] posix-cpu-timers: Migrate to use new tick dependency mask model
  2015-08-03 17:12                   ` Peter Zijlstra
@ 2015-08-03 17:39                     ` Frederic Weisbecker
  2015-08-03 19:07                       ` Peter Zijlstra
  2015-08-06 17:13                       ` Chris Metcalf
  0 siblings, 2 replies; 59+ messages in thread
From: Frederic Weisbecker @ 2015-08-03 17:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Metcalf, LKML, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Mon, Aug 03, 2015 at 07:12:43PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 31, 2015 at 04:49:55PM +0200, Frederic Weisbecker wrote:
> > Instead of doing a per signal dependency, I'm going to use a per task
> > one.
> 
> Urgh, does this mean you'll keep the horrid tick_nohz_task_switch()
> thing?

I thought I would drop it, but now that I think about it more, I think I
need to keep it because if we enqueue a posix timer to a sleeping task
and that task later wakes up, we must restart the tick, and that is only
possible with a check on context switch :-(

This current patchset removed the need for that with a global dependency
for posix timers: as long as there is one enqueued we keep the tick. But
Chris and Luiz fear that Tilera users have posix timers on housekeepers.
They also suggested we offline the posix timers. I fear it's going to be
a high overhead as it means polling on the target task context of execution.
Unless we move the task itself to housekeepers...

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/10] posix-cpu-timers: Migrate to use new tick dependency mask model
  2015-08-03 15:59                   ` Chris Metcalf
@ 2015-08-03 18:01                     ` Frederic Weisbecker
  0 siblings, 0 replies; 59+ messages in thread
From: Frederic Weisbecker @ 2015-08-03 18:01 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: LKML, Peter Zijlstra, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Mon, Aug 03, 2015 at 11:59:07AM -0400, Chris Metcalf wrote:
> On 07/31/2015 10:49 AM, Frederic Weisbecker wrote:
> >Instead of doing a per signal dependency, I'm going to use a per task
> >one. Which means that if a per-process timer is enqueued, every thread
> >of that process will have the tick dependency. But if the timer is
> >enqueued to a single thread, only the thread is concerned.
> >
> >We'll see if offloading becomes really needed. It's not quite free because
> >the housekeepers will have to poll on all nohz CPUs at a Hz frequency.
> 
> Seems reasonable for now!
> 
> Why would we need the Hz frequency polling, though?  I would
> think it should be possible to just arrange it such that the timer
> for posix cpu timers would just always be placed either on the core
> that requested it, or if that core is nohz_full, on a housekeeping
> core.  Then it would eventually fire from the housekeeping core,
> and the logic could be such that (for a process-wide timer) it
> would preferentially interrupt threads from that process that
> were running on the housekeeping cores.  No polling.

But you need to periodically poll on timer expiration from a housekeeper.
It's not only about firing the timer, it's about elapsing it against the
target cputime.

Since there is no tick on a nohz full CPU to account the time spent by
the task, you must do that elsewhere. And if you don't poll in a sufficient
frequency, the time accounted is less precise (a quick round-trip to kernel space
can be missed if the polling frequency is too low). Or you can combine it
with the VIRT_CPU_ACCOUNTING_GEN that we are using currently which records the
time spent in user and kernel space using hooks. Still you must check periodically
that the timer hasn't expired at a frequency that doesn't go further the
expiration time. Easy in the case of a timer attached to a single task but what
about a timer attached to a process? You must poll at least at expiration/nr_threads,
so you must handle thread creation as well.

Offlining posix timers sounds like a big headache if we don't poll at Hz time.

That said Rick has posted patches that offline cputime accounting. I'm not yet sure
this patchset is a good idea but offlining posix timers can be done on top of that.

Another thing: now I recall why I turned posix timers to a global tick dependency.
In case of a per task/process dependency we still need the context switch hook because
if we enqueue a timer to a sleeping task, the tick must be restarted when the task wakes
up. And that requires a check on context switch.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/10] posix-cpu-timers: Migrate to use new tick dependency mask model
  2015-08-03 17:39                     ` Frederic Weisbecker
@ 2015-08-03 19:07                       ` Peter Zijlstra
  2015-08-06 17:13                       ` Chris Metcalf
  1 sibling, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2015-08-03 19:07 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Chris Metcalf, LKML, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Mon, Aug 03, 2015 at 07:39:37PM +0200, Frederic Weisbecker wrote:
> On Mon, Aug 03, 2015 at 07:12:43PM +0200, Peter Zijlstra wrote:
> > On Fri, Jul 31, 2015 at 04:49:55PM +0200, Frederic Weisbecker wrote:
> > > Instead of doing a per signal dependency, I'm going to use a per task
> > > one.
> > 
> > Urgh, does this mean you'll keep the horrid tick_nohz_task_switch()
> > thing?
> 
> I thought I would drop it, but now that I think about it more, I think I
> need to keep it because if we enqueue a posix timer to a sleeping task
> and that task later wakes up, we must restart the tick, and that is only
> possible with a check on context switch :-(
> 
> This current patchset removed the need for that with a global dependency
> for posix timers: as long as there is one enqueued we keep the tick. But
> Chris and Luiz fear that Tilera users have posix timers on housekeepers.
> They also suggested we offline the posix timers. I fear it's going to be
> a high overhead as it means polling on the target task context of execution.
> Unless we move the task itself to housekeepers...

At least do something like the below, that irq save/restore is expensive
and can be trivially avoided.

Also, tick_nohz_full_kick() checks that tick_nohz_full_cpu() thing
again.

---
 kernel/sched/core.c      | 2 +-
 kernel/time/tick-sched.c | 9 +--------
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4d34035bb3ee..57d1af7c0660 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2471,6 +2471,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	vtime_task_switch(prev);
 	finish_arch_switch(prev);
 	perf_event_task_sched_in(prev, current);
+	tick_nohz_task_switch();
 	finish_lock_switch(rq, prev);
 	finish_arch_post_lock_switch();
 
@@ -2489,7 +2490,6 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		put_task_struct(prev);
 	}
 
-	tick_nohz_task_switch();
 	return rq;
 }
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 3319e16f31e5..9f2225ef230c 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -260,18 +260,11 @@ void tick_nohz_full_kick_all(void)
  */
 void __tick_nohz_task_switch(void)
 {
-	unsigned long flags;
-
-	local_irq_save(flags);
-
 	if (!tick_nohz_full_cpu(smp_processor_id()))
-		goto out;
+		return;
 
 	if (tick_nohz_tick_stopped() && !can_stop_full_tick())
 		tick_nohz_full_kick();
-
-out:
-	local_irq_restore(flags);
 }
 
 /* Parse the boot-time nohz CPU list from the kernel parameters. */

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 07/10] sched: Migrate sched to use new tick dependency mask model
  2015-08-03 17:30         ` Frederic Weisbecker
@ 2015-08-04  7:41           ` Peter Zijlstra
  2015-08-10 14:02             ` Juri Lelli
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2015-08-04  7:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Thomas Gleixner, Preeti U Murthy, Christoph Lameter,
	Ingo Molnar, Viresh Kumar, Rik van Riel

On Mon, Aug 03, 2015 at 07:30:32PM +0200, Frederic Weisbecker wrote:
> > But you've forgotten about SCHED_DEADLINE, we count those in:
> > rq->dl.dl_nr_running.
> 
> Indeed. Hmm, there is no preemption between SCHED_DEALINE tasks, right?
> So I can treat it like SCHED_FIFO.

Sadly no. Even though EDF has static job priority (once a job is
activated its priority doesn't change anymore) DEADLINE also has a CBS
component and that needs the tick regardless, even with a single task.

So any deadline task running means we cannot stop the tick.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/10] posix-cpu-timers: Migrate to use new tick dependency mask model
  2015-08-03 17:39                     ` Frederic Weisbecker
  2015-08-03 19:07                       ` Peter Zijlstra
@ 2015-08-06 17:13                       ` Chris Metcalf
  1 sibling, 0 replies; 59+ messages in thread
From: Chris Metcalf @ 2015-08-06 17:13 UTC (permalink / raw)
  To: Frederic Weisbecker, Peter Zijlstra
  Cc: LKML, Thomas Gleixner, Preeti U Murthy, Christoph Lameter,
	Ingo Molnar, Viresh Kumar, Rik van Riel

On 8/3/2015 1:39 PM, Frederic Weisbecker wrote:
> On Mon, Aug 03, 2015 at 07:12:43PM +0200, Peter Zijlstra wrote:
>> On Fri, Jul 31, 2015 at 04:49:55PM +0200, Frederic Weisbecker wrote:
>>> Instead of doing a per signal dependency, I'm going to use a per task
>>> one.
>> Urgh, does this mean you'll keep the horrid tick_nohz_task_switch()
>> thing?
> This current patchset removed the need for that with a global dependency
> for posix timers: as long as there is one enqueued we keep the tick. But
> Chris and Luiz fear that Tilera users have posix timers on housekeepers.

I think Luiz was representing a different class of users, not Tilera ones, FWIW.

> But you need to periodically poll on timer expiration from a housekeeper.
> It's not only about firing the timer, it's about elapsing it against the
> target cputime.

Oh right, that makes sense.  Layering this on top of the existing offlining patches that
you mentioned sounds like it will get us close to what we want, though.

> Another thing: now I recall why I turned posix timers to a global tick dependency.
> In case of a per task/process dependency we still need the context switch hook because
> if we enqueue a timer to a sleeping task, the tick must be restarted when the task wakes
> up. And that requires a check on context switch.

Another approach might be to separately track nohz_full and housekeeping states, and then
add a hook at task migration time.  This is presumably much less frequent than context switch,
and would allow re-assessing this kind of state when moving a task from housekeeping to
nohz_full or vice versa.  Then a global tick dependency would be OK when a thread was running
on a nohz_full core (because frankly, that's just stupid, and you get what's coming to you),
but for housekeeping cores we could avoid having to worry about it. (I admit this idea is
half-baked but maybe it will inspire further baking.)

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 07/10] sched: Migrate sched to use new tick dependency mask model
  2015-08-04  7:41           ` Peter Zijlstra
@ 2015-08-10 14:02             ` Juri Lelli
  2015-08-10 14:16               ` Frederic Weisbecker
  0 siblings, 1 reply; 59+ messages in thread
From: Juri Lelli @ 2015-08-10 14:02 UTC (permalink / raw)
  To: Peter Zijlstra, Frederic Weisbecker
  Cc: LKML, Thomas Gleixner, Preeti U Murthy, Christoph Lameter,
	Ingo Molnar, Viresh Kumar, Rik van Riel

Hi,

On 04/08/15 08:41, Peter Zijlstra wrote:
> On Mon, Aug 03, 2015 at 07:30:32PM +0200, Frederic Weisbecker wrote:
>>> But you've forgotten about SCHED_DEADLINE, we count those in:
>>> rq->dl.dl_nr_running.
>>
>> Indeed. Hmm, there is no preemption between SCHED_DEALINE tasks, right?
>> So I can treat it like SCHED_FIFO.
> 
> Sadly no. Even though EDF has static job priority (once a job is
> activated its priority doesn't change anymore) DEADLINE also has a CBS
> component and that needs the tick regardless, even with a single task.
> 
> So any deadline task running means we cannot stop the tick.

Well, couldn't we stop it when we use hrtick? As start_hrtick_dl() sets
the hrtick to fire at tasks runtime depletion instant.

Thanks,

- Juri

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 07/10] sched: Migrate sched to use new tick dependency mask model
  2015-08-10 14:02             ` Juri Lelli
@ 2015-08-10 14:16               ` Frederic Weisbecker
  2015-08-10 14:28                 ` Peter Zijlstra
  2015-08-10 15:33                 ` Christoph Lameter
  0 siblings, 2 replies; 59+ messages in thread
From: Frederic Weisbecker @ 2015-08-10 14:16 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Peter Zijlstra, LKML, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Mon, Aug 10, 2015 at 03:02:04PM +0100, Juri Lelli wrote:
> Hi,
> 
> On 04/08/15 08:41, Peter Zijlstra wrote:
> > On Mon, Aug 03, 2015 at 07:30:32PM +0200, Frederic Weisbecker wrote:
> >>> But you've forgotten about SCHED_DEADLINE, we count those in:
> >>> rq->dl.dl_nr_running.
> >>
> >> Indeed. Hmm, there is no preemption between SCHED_DEALINE tasks, right?
> >> So I can treat it like SCHED_FIFO.
> > 
> > Sadly no. Even though EDF has static job priority (once a job is
> > activated its priority doesn't change anymore) DEADLINE also has a CBS
> > component and that needs the tick regardless, even with a single task.
> > 
> > So any deadline task running means we cannot stop the tick.
> 
> Well, couldn't we stop it when we use hrtick? As start_hrtick_dl() sets
> the hrtick to fire at tasks runtime depletion instant.

Hrtick() only does the task tick part of scheduler_tick(). There are
many other things that need to be updated. Cpu load active (which is buggy
with full dynticks btw. because __update_cpu_load() expects nothing else than
regular frequency updates or idle decay. There is also calc_global_load(),
load balancing stuffs.

I considered many times relying on hrtick btw but everyone seem to say it has a lot
of overhead, especially due to clock reprogramming on schedule() calls.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 07/10] sched: Migrate sched to use new tick dependency mask model
  2015-08-10 14:16               ` Frederic Weisbecker
@ 2015-08-10 14:28                 ` Peter Zijlstra
  2015-08-10 15:11                   ` Peter Zijlstra
  2015-08-10 15:33                 ` Christoph Lameter
  1 sibling, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2015-08-10 14:28 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Juri Lelli, LKML, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Mon, Aug 10, 2015 at 04:16:58PM +0200, Frederic Weisbecker wrote:

> I considered many times relying on hrtick btw but everyone seem to say it has a lot
> of overhead, especially due to clock reprogramming on schedule() calls.

Yeah, I have some vague ideas of how to take out much of that overhead
(tglx will launch frozen sharks at me I suspect), but we cannot get
around the overhead of actually having to program the hardware and that
is still a significant amount on many machines.

Supposedly machines with TSC deadline are better, but I've not tried
to benchmark that.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 07/10] sched: Migrate sched to use new tick dependency mask model
  2015-08-10 14:28                 ` Peter Zijlstra
@ 2015-08-10 15:11                   ` Peter Zijlstra
  2015-08-10 15:29                     ` Frederic Weisbecker
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2015-08-10 15:11 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Juri Lelli, LKML, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Mon, Aug 10, 2015 at 04:28:47PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 10, 2015 at 04:16:58PM +0200, Frederic Weisbecker wrote:
> 
> > I considered many times relying on hrtick btw but everyone seem to say it has a lot
> > of overhead, especially due to clock reprogramming on schedule() calls.
> 
> Yeah, I have some vague ideas of how to take out much of that overhead
> (tglx will launch frozen sharks at me I suspect), but we cannot get
> around the overhead of actually having to program the hardware and that
> is still a significant amount on many machines.
> 
> Supposedly machines with TSC deadline are better, but I've not tried
> to benchmark that.

Basically something along these lines.. which avoids a whole bunch of
hrtimer stuff.

But without fast hardware its all still pointless.

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 76dd4f0da5ca..c279950cb8c3 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -200,6 +200,7 @@ struct hrtimer_cpu_base {
 	unsigned int			nr_retries;
 	unsigned int			nr_hangs;
 	unsigned int			max_hang_time;
+	ktime_t				expires_sched;
 #endif
 	struct hrtimer_clock_base	clock_base[HRTIMER_MAX_CLOCK_BASES];
 } ____cacheline_aligned;
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 5c7ae4b641c4..be9c0a555eaa 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -68,6 +68,7 @@ DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) =
 {
 	.lock = __RAW_SPIN_LOCK_UNLOCKED(hrtimer_bases.lock),
 	.seq = SEQCNT_ZERO(hrtimer_bases.seq),
+	.expires_sched = { .tv64 = KTIME_MAX, },
 	.clock_base =
 	{
 		{
@@ -460,7 +461,7 @@ static inline void hrtimer_update_next_timer(struct hrtimer_cpu_base *cpu_base,
 static ktime_t __hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base)
 {
 	struct hrtimer_clock_base *base = cpu_base->clock_base;
-	ktime_t expires, expires_next = { .tv64 = KTIME_MAX };
+	ktime_t expires, expires_next = cpu_base->expires_sched;
 	unsigned int active = cpu_base->active_bases;
 
 	hrtimer_update_next_timer(cpu_base, NULL);
@@ -1289,6 +1290,33 @@ static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now)
 
 #ifdef CONFIG_HIGH_RES_TIMERS
 
+void sched_hrtick_set(u64 ns)
+{
+	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
+	ktime_t expires = ktime_add_ns(ktime_get(), ns);
+
+	raw_spin_lock(&cpu_base->lock);
+	cpu_base->expires_sched = expires;
+
+	if (expires.tv64 < cpu_base->expires_next.tv64)
+		hrtimer_force_reprogram(cpu_base, 0);
+
+	raw_spin_unlock(&cpu_base->lock);
+}
+
+void sched_hrtick_cancel(void)
+{
+	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
+
+	raw_spin_lock(&cpu_base->lock);
+	/*
+	 * If the current event was this sched event, eat the superfluous
+	 * interrupt rather than touch the hardware again.
+	 */
+	cpu_base->expires_sched.tv64 = KTIME_MAX;
+	raw_spin_unlock(&cpu_base->lock);
+}
+
 /*
  * High resolution timer interrupt
  * Called with interrupts disabled
@@ -1316,6 +1344,13 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 	 */
 	cpu_base->expires_next.tv64 = KTIME_MAX;
 
+	if (cpu_base->expires_sched.tv64 < now.tv64) {
+		cpu_base->expires_sched.tv64 = KTIME_MAX;
+		raw_spin_unlock(&cpu_base->lock);
+		scheduler_hrtick();
+		raw_spin_lock(&cpu_base->lock);
+	}
+
 	__hrtimer_run_queues(cpu_base, now);
 
 	/* Reevaluate the clock bases for the next expiry */

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 07/10] sched: Migrate sched to use new tick dependency mask model
  2015-08-10 15:11                   ` Peter Zijlstra
@ 2015-08-10 15:29                     ` Frederic Weisbecker
  2015-08-10 15:43                       ` Juri Lelli
  2015-08-10 16:41                       ` Peter Zijlstra
  0 siblings, 2 replies; 59+ messages in thread
From: Frederic Weisbecker @ 2015-08-10 15:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, LKML, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Mon, Aug 10, 2015 at 05:11:51PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 10, 2015 at 04:28:47PM +0200, Peter Zijlstra wrote:
> > On Mon, Aug 10, 2015 at 04:16:58PM +0200, Frederic Weisbecker wrote:
> > 
> > > I considered many times relying on hrtick btw but everyone seem to say it has a lot
> > > of overhead, especially due to clock reprogramming on schedule() calls.
> > 
> > Yeah, I have some vague ideas of how to take out much of that overhead
> > (tglx will launch frozen sharks at me I suspect), but we cannot get
> > around the overhead of actually having to program the hardware and that
> > is still a significant amount on many machines.
> > 
> > Supposedly machines with TSC deadline are better, but I've not tried
> > to benchmark that.
> 
> Basically something along these lines.. which avoids a whole bunch of
> hrtimer stuff.
> 
> But without fast hardware its all still pointless.
> 
> diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
> index 76dd4f0da5ca..c279950cb8c3 100644
> --- a/include/linux/hrtimer.h
> +++ b/include/linux/hrtimer.h
> @@ -200,6 +200,7 @@ struct hrtimer_cpu_base {
>  	unsigned int			nr_retries;
>  	unsigned int			nr_hangs;
>  	unsigned int			max_hang_time;
> +	ktime_t				expires_sched;
>  #endif
>  	struct hrtimer_clock_base	clock_base[HRTIMER_MAX_CLOCK_BASES];
>  } ____cacheline_aligned;
> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 5c7ae4b641c4..be9c0a555eaa 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -68,6 +68,7 @@ DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) =
>  {
>  	.lock = __RAW_SPIN_LOCK_UNLOCKED(hrtimer_bases.lock),
>  	.seq = SEQCNT_ZERO(hrtimer_bases.seq),
> +	.expires_sched = { .tv64 = KTIME_MAX, },
>  	.clock_base =
>  	{
>  		{
> @@ -460,7 +461,7 @@ static inline void hrtimer_update_next_timer(struct hrtimer_cpu_base *cpu_base,
>  static ktime_t __hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base)
>  {
>  	struct hrtimer_clock_base *base = cpu_base->clock_base;
> -	ktime_t expires, expires_next = { .tv64 = KTIME_MAX };
> +	ktime_t expires, expires_next = cpu_base->expires_sched;
>  	unsigned int active = cpu_base->active_bases;
>  
>  	hrtimer_update_next_timer(cpu_base, NULL);
> @@ -1289,6 +1290,33 @@ static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now)
>  
>  #ifdef CONFIG_HIGH_RES_TIMERS
>  
> +void sched_hrtick_set(u64 ns)
> +{
> +	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
> +	ktime_t expires = ktime_add_ns(ktime_get(), ns);
> +
> +	raw_spin_lock(&cpu_base->lock);
> +	cpu_base->expires_sched = expires;
> +
> +	if (expires.tv64 < cpu_base->expires_next.tv64)
> +		hrtimer_force_reprogram(cpu_base, 0);
> +
> +	raw_spin_unlock(&cpu_base->lock);
> +}
> +
> +void sched_hrtick_cancel(void)
> +{
> +	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
> +
> +	raw_spin_lock(&cpu_base->lock);
> +	/*
> +	 * If the current event was this sched event, eat the superfluous
> +	 * interrupt rather than touch the hardware again.
> +	 */
> +	cpu_base->expires_sched.tv64 = KTIME_MAX;
> +	raw_spin_unlock(&cpu_base->lock);
> +}

Well, there could be a more proper way to do this without tying that to the scheduler
tick. This could be some sort of hrtimer_cancel_soft() which more generally cancels a
timer without cancelling the interrupt itself. We might want to still keep track of that
lost interrupt though in case of later clock reprogramming that fits the lost interrupt.
With a field like cpu_base->expires_interrupt. I thought about expires_soft and expires_hard
but I think that terminology is already used :-)

That said that feature at least wouldn't fit nohz full which really wants to avoid spurious
interrupts.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 07/10] sched: Migrate sched to use new tick dependency mask model
  2015-08-10 14:16               ` Frederic Weisbecker
  2015-08-10 14:28                 ` Peter Zijlstra
@ 2015-08-10 15:33                 ` Christoph Lameter
  1 sibling, 0 replies; 59+ messages in thread
From: Christoph Lameter @ 2015-08-10 15:33 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Juri Lelli, Peter Zijlstra, LKML, Thomas Gleixner,
	Preeti U Murthy, Ingo Molnar, Viresh Kumar, Rik van Riel

On Mon, 10 Aug 2015, Frederic Weisbecker wrote:

> I considered many times relying on hrtick btw but everyone seem to say it has a lot
> of overhead, especially due to clock reprogramming on schedule() calls.

Depends on how many ticks you can save I would think. It certainly is
worthwhile if you can avoid 10 ticks? Can we figure out how far in the
future that event is? If its more than 10 ticks away then its worth to
switch off the tick?



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 07/10] sched: Migrate sched to use new tick dependency mask model
  2015-08-10 15:29                     ` Frederic Weisbecker
@ 2015-08-10 15:43                       ` Juri Lelli
  2015-08-10 16:41                       ` Peter Zijlstra
  1 sibling, 0 replies; 59+ messages in thread
From: Juri Lelli @ 2015-08-10 15:43 UTC (permalink / raw)
  To: Frederic Weisbecker, Peter Zijlstra
  Cc: LKML, Thomas Gleixner, Preeti U Murthy, Christoph Lameter,
	Ingo Molnar, Viresh Kumar, Rik van Riel

Hi,

On 10/08/15 16:29, Frederic Weisbecker wrote:
> On Mon, Aug 10, 2015 at 05:11:51PM +0200, Peter Zijlstra wrote:
>> On Mon, Aug 10, 2015 at 04:28:47PM +0200, Peter Zijlstra wrote:
>>> On Mon, Aug 10, 2015 at 04:16:58PM +0200, Frederic Weisbecker wrote:
>>>
>>>> I considered many times relying on hrtick btw but everyone seem to say it has a lot
>>>> of overhead, especially due to clock reprogramming on schedule() calls.
>>>
>>> Yeah, I have some vague ideas of how to take out much of that overhead
>>> (tglx will launch frozen sharks at me I suspect), but we cannot get
>>> around the overhead of actually having to program the hardware and that
>>> is still a significant amount on many machines.
>>>
>>> Supposedly machines with TSC deadline are better, but I've not tried
>>> to benchmark that.
>>
>> Basically something along these lines.. which avoids a whole bunch of
>> hrtimer stuff.
>>
>> But without fast hardware its all still pointless.
>>
>> diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
>> index 76dd4f0da5ca..c279950cb8c3 100644
>> --- a/include/linux/hrtimer.h
>> +++ b/include/linux/hrtimer.h
>> @@ -200,6 +200,7 @@ struct hrtimer_cpu_base {
>>  	unsigned int			nr_retries;
>>  	unsigned int			nr_hangs;
>>  	unsigned int			max_hang_time;
>> +	ktime_t				expires_sched;
>>  #endif
>>  	struct hrtimer_clock_base	clock_base[HRTIMER_MAX_CLOCK_BASES];
>>  } ____cacheline_aligned;
>> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
>> index 5c7ae4b641c4..be9c0a555eaa 100644
>> --- a/kernel/time/hrtimer.c
>> +++ b/kernel/time/hrtimer.c
>> @@ -68,6 +68,7 @@ DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) =
>>  {
>>  	.lock = __RAW_SPIN_LOCK_UNLOCKED(hrtimer_bases.lock),
>>  	.seq = SEQCNT_ZERO(hrtimer_bases.seq),
>> +	.expires_sched = { .tv64 = KTIME_MAX, },
>>  	.clock_base =
>>  	{
>>  		{
>> @@ -460,7 +461,7 @@ static inline void hrtimer_update_next_timer(struct hrtimer_cpu_base *cpu_base,
>>  static ktime_t __hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base)
>>  {
>>  	struct hrtimer_clock_base *base = cpu_base->clock_base;
>> -	ktime_t expires, expires_next = { .tv64 = KTIME_MAX };
>> +	ktime_t expires, expires_next = cpu_base->expires_sched;
>>  	unsigned int active = cpu_base->active_bases;
>>  
>>  	hrtimer_update_next_timer(cpu_base, NULL);
>> @@ -1289,6 +1290,33 @@ static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now)
>>  
>>  #ifdef CONFIG_HIGH_RES_TIMERS
>>  
>> +void sched_hrtick_set(u64 ns)
>> +{
>> +	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
>> +	ktime_t expires = ktime_add_ns(ktime_get(), ns);
>> +
>> +	raw_spin_lock(&cpu_base->lock);
>> +	cpu_base->expires_sched = expires;
>> +
>> +	if (expires.tv64 < cpu_base->expires_next.tv64)
>> +		hrtimer_force_reprogram(cpu_base, 0);
>> +
>> +	raw_spin_unlock(&cpu_base->lock);
>> +}
>> +
>> +void sched_hrtick_cancel(void)
>> +{
>> +	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
>> +
>> +	raw_spin_lock(&cpu_base->lock);
>> +	/*
>> +	 * If the current event was this sched event, eat the superfluous
>> +	 * interrupt rather than touch the hardware again.
>> +	 */
>> +	cpu_base->expires_sched.tv64 = KTIME_MAX;
>> +	raw_spin_unlock(&cpu_base->lock);
>> +}
> 
> Well, there could be a more proper way to do this without tying that to the scheduler
> tick. This could be some sort of hrtimer_cancel_soft() which more generally cancels a
> timer without cancelling the interrupt itself. We might want to still keep track of that
> lost interrupt though in case of later clock reprogramming that fits the lost interrupt.
> With a field like cpu_base->expires_interrupt. I thought about expires_soft and expires_hard
> but I think that terminology is already used :-)
> 
> That said that feature at least wouldn't fit nohz full which really wants to avoid spurious
> interrupts.
> 

Quite a detailed reply to my naive question :).
Thanks a lot Frederic and Peter for this!

For what concerns SCHED_DEADLINE, I guess the bottom line is
that it makes sense to use hrtick for sub-millisecond accounting
only (without nohz full).

Best,

- Juri


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 07/10] sched: Migrate sched to use new tick dependency mask model
  2015-08-10 15:29                     ` Frederic Weisbecker
  2015-08-10 15:43                       ` Juri Lelli
@ 2015-08-10 16:41                       ` Peter Zijlstra
  1 sibling, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2015-08-10 16:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Juri Lelli, LKML, Thomas Gleixner, Preeti U Murthy,
	Christoph Lameter, Ingo Molnar, Viresh Kumar, Rik van Riel

On Mon, Aug 10, 2015 at 05:29:22PM +0200, Frederic Weisbecker wrote:

> Well, there could be a more proper way to do this without tying that
> to the scheduler tick. This could be some sort of
> hrtimer_cancel_soft() which more generally cancels a timer without
> cancelling the interrupt itself.

So when I looked at this last -- a long long time ago -- the whole
hrtimer rbtree took a significant amount of time. Hence the proposal
here to avoid all of it for this special timer.

> We might want to still keep track of
> that lost interrupt though in case of later clock reprogramming that
> fits the lost interrupt.  With a field like
> cpu_base->expires_interrupt. I thought about expires_soft and
> expires_hard but I think that terminology is already used :-)

Its easy enough to 'fix', but typically you'd reprogram a new sched tick
anyway, so its moot. Touching the hardware twice, once to cancel the
old, once to program the new, is double pain.

The only case where you really take that interrupt is when you cancel
and go idle I suppose, and we could special case that if the NOHZ code
doesn't already DTRT (by accident) etc..


^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2015-08-10 16:41 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-23 16:42 [PATCH 00/10] nohz: Tick dependency mask v2 Frederic Weisbecker
2015-07-23 16:42 ` [PATCH 01/10] nohz: Remove idle task special case Frederic Weisbecker
2015-07-23 16:42 ` [PATCH 02/10] nohz: Restart nohz full tick from irq exit Frederic Weisbecker
2015-07-23 16:42 ` [PATCH 03/10] nohz: Move tick_nohz_restart_sched_tick() above its users Frederic Weisbecker
2015-07-23 16:42 ` [PATCH 04/10] nohz: Remove useless argument on tick_nohz_task_switch() Frederic Weisbecker
2015-08-03 12:39   ` Peter Zijlstra
2015-08-03 12:49     ` Frederic Weisbecker
2015-08-03 13:04       ` Peter Zijlstra
2015-07-23 16:42 ` [PATCH 05/10] nohz: New tick dependency mask Frederic Weisbecker
2015-07-24 16:55   ` Chris Metcalf
2015-07-24 17:16     ` Frederic Weisbecker
2015-07-24 17:43       ` Chris Metcalf
2015-08-03 12:48         ` Peter Zijlstra
2015-08-03 12:43   ` Peter Zijlstra
2015-08-03 13:05     ` Frederic Weisbecker
2015-08-03 13:24       ` Peter Zijlstra
2015-08-03 13:49         ` Frederic Weisbecker
2015-08-03 12:57   ` Peter Zijlstra
2015-08-03 13:09     ` Frederic Weisbecker
2015-08-03 13:29       ` Peter Zijlstra
2015-08-03 13:55         ` Frederic Weisbecker
2015-08-03 14:11           ` Peter Zijlstra
2015-07-23 16:42 ` [PATCH 06/10] perf: Migrate perf to use new tick dependency mask model Frederic Weisbecker
2015-07-23 16:42 ` [PATCH 07/10] sched: Migrate sched " Frederic Weisbecker
2015-07-23 16:55   ` Frederic Weisbecker
2015-07-24 16:56   ` Chris Metcalf
2015-07-29 13:01     ` Frederic Weisbecker
2015-08-03 14:00   ` Peter Zijlstra
2015-08-03 14:50     ` Frederic Weisbecker
2015-08-03 17:09       ` Peter Zijlstra
2015-08-03 17:30         ` Frederic Weisbecker
2015-08-04  7:41           ` Peter Zijlstra
2015-08-10 14:02             ` Juri Lelli
2015-08-10 14:16               ` Frederic Weisbecker
2015-08-10 14:28                 ` Peter Zijlstra
2015-08-10 15:11                   ` Peter Zijlstra
2015-08-10 15:29                     ` Frederic Weisbecker
2015-08-10 15:43                       ` Juri Lelli
2015-08-10 16:41                       ` Peter Zijlstra
2015-08-10 15:33                 ` Christoph Lameter
2015-07-23 16:42 ` [PATCH 08/10] posix-cpu-timers: Migrate " Frederic Weisbecker
2015-07-24 16:57   ` Chris Metcalf
2015-07-29 13:23     ` Frederic Weisbecker
2015-07-29 17:24       ` Chris Metcalf
2015-07-30  0:44         ` Frederic Weisbecker
2015-07-30 14:31           ` Luiz Capitulino
2015-07-30 14:46             ` Frederic Weisbecker
2015-07-30 19:35           ` Chris Metcalf
2015-07-30 19:45             ` Frederic Weisbecker
2015-07-30 19:52               ` Chris Metcalf
2015-07-31 14:49                 ` Frederic Weisbecker
2015-08-03 15:59                   ` Chris Metcalf
2015-08-03 18:01                     ` Frederic Weisbecker
2015-08-03 17:12                   ` Peter Zijlstra
2015-08-03 17:39                     ` Frederic Weisbecker
2015-08-03 19:07                       ` Peter Zijlstra
2015-08-06 17:13                       ` Chris Metcalf
2015-07-23 16:42 ` [PATCH 09/10] sched-clock: " Frederic Weisbecker
2015-07-23 16:42 ` [PATCH 10/10] nohz: Remove task switch obsolete tick dependency check Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).