All of lore.kernel.org
 help / color / mirror / Atom feed
* [GIT PULL] isolation: 1Hz residual tick offloading v3
@ 2018-01-04  4:25 Frederic Weisbecker
  2018-01-04  4:25 ` [PATCH 1/5] sched: Rename init_rq_hrtick to hrtick_rq_init Frederic Weisbecker
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2018-01-04  4:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter,
	Paul E . McKenney, Wanpeng Li, Mike Galbraith, Rik van Riel

Ingo,

Please pull the sched/0hz branch that can be found at:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	sched/0hz

HEAD: 9e932b2cc707209febd130978a5eb9f4a943a3f4

--
Now that scheduler_tick() has become resilient towards the absence of
ticks, current->sched_class->task_tick() is the last piece that needs
at least 1Hz tick to keep scheduler stats alive.

This patchset adds a flag to the isolcpus boot option to offload the
residual 1Hz tick. This way the nohz_full CPUs don't have anymore tick
(assuming nothing else requires it) as their residual 1Hz tick is
offloaded to the housekeepers.

For quick testing, say on CPUs 1-7:

	"isolcpus=nohz_offload,domain,1-7"

Thanks,
	Frederic
---

Frederic Weisbecker (5):
      sched: Rename init_rq_hrtick to hrtick_rq_init
      sched/isolation: Add scheduler tick offloading interface
      nohz: Allow to check if remote CPU tick is stopped
      sched/isolation: Residual 1Hz scheduler tick offload
      sched/isolation: Document "nohz_offload" flag


 Documentation/admin-guide/kernel-parameters.txt |  7 +-
 include/linux/sched/isolation.h                 |  3 +-
 include/linux/tick.h                            |  2 +
 kernel/sched/core.c                             | 94 +++++++++++++++++++++++--
 kernel/sched/isolation.c                        | 10 +++
 kernel/sched/sched.h                            |  2 +
 kernel/time/tick-sched.c                        |  7 ++
 7 files changed, 117 insertions(+), 8 deletions(-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/5] sched: Rename init_rq_hrtick to hrtick_rq_init
  2018-01-04  4:25 [GIT PULL] isolation: 1Hz residual tick offloading v3 Frederic Weisbecker
@ 2018-01-04  4:25 ` Frederic Weisbecker
  2018-01-04  4:25 ` [PATCH 2/5] sched/isolation: Add scheduler tick offloading interface Frederic Weisbecker
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2018-01-04  4:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter,
	Paul E . McKenney, Wanpeng Li, Mike Galbraith, Rik van Riel

Do that rename in order to normalize the hrtick namespace.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 644fa2e..d72d0e9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -333,7 +333,7 @@ void hrtick_start(struct rq *rq, u64 delay)
 }
 #endif /* CONFIG_SMP */
 
-static void init_rq_hrtick(struct rq *rq)
+static void hrtick_rq_init(struct rq *rq)
 {
 #ifdef CONFIG_SMP
 	rq->hrtick_csd_pending = 0;
@@ -351,7 +351,7 @@ static inline void hrtick_clear(struct rq *rq)
 {
 }
 
-static inline void init_rq_hrtick(struct rq *rq)
+static inline void hrtick_rq_init(struct rq *rq)
 {
 }
 #endif	/* CONFIG_SCHED_HRTICK */
@@ -5955,7 +5955,7 @@ void __init sched_init(void)
 		rq->last_sched_tick = 0;
 #endif
 #endif /* CONFIG_SMP */
-		init_rq_hrtick(rq);
+		hrtick_rq_init(rq);
 		atomic_set(&rq->nr_iowait, 0);
 	}
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/5] sched/isolation: Add scheduler tick offloading interface
  2018-01-04  4:25 [GIT PULL] isolation: 1Hz residual tick offloading v3 Frederic Weisbecker
  2018-01-04  4:25 ` [PATCH 1/5] sched: Rename init_rq_hrtick to hrtick_rq_init Frederic Weisbecker
@ 2018-01-04  4:25 ` Frederic Weisbecker
  2018-01-04  4:25 ` [PATCH 3/5] nohz: Allow to check if remote CPU tick is stopped Frederic Weisbecker
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2018-01-04  4:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter,
	Paul E . McKenney, Wanpeng Li, Mike Galbraith, Rik van Riel

Add the boot option that will allow us to offload the 1Hz scheduler tick
to the housekeeping CPU.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched/isolation.h | 3 ++-
 kernel/sched/isolation.c        | 6 ++++++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index d849431..c831855 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -11,7 +11,8 @@ enum hk_flags {
 	HK_FLAG_MISC		= (1 << 2),
 	HK_FLAG_SCHED		= (1 << 3),
 	HK_FLAG_TICK		= (1 << 4),
-	HK_FLAG_DOMAIN		= (1 << 5),
+	HK_FLAG_TICK_SCHED	= (1 << 5),
+	HK_FLAG_DOMAIN		= (1 << 6),
 };
 
 #ifdef CONFIG_CPU_ISOLATION
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index b71b436..264ddcd 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -136,6 +136,12 @@ static int __init housekeeping_isolcpus_setup(char *str)
 			continue;
 		}
 
+		if (!strncmp(str, "nohz_offload,", 13)) {
+			str += 13;
+			flags |= HK_FLAG_TICK | HK_FLAG_TICK_SCHED;
+			continue;
+		}
+
 		if (!strncmp(str, "domain,", 7)) {
 			str += 7;
 			flags |= HK_FLAG_DOMAIN;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/5] nohz: Allow to check if remote CPU tick is stopped
  2018-01-04  4:25 [GIT PULL] isolation: 1Hz residual tick offloading v3 Frederic Weisbecker
  2018-01-04  4:25 ` [PATCH 1/5] sched: Rename init_rq_hrtick to hrtick_rq_init Frederic Weisbecker
  2018-01-04  4:25 ` [PATCH 2/5] sched/isolation: Add scheduler tick offloading interface Frederic Weisbecker
@ 2018-01-04  4:25 ` Frederic Weisbecker
  2018-01-04  4:25 ` [PATCH 4/5] sched/isolation: Residual 1Hz scheduler tick offload Frederic Weisbecker
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2018-01-04  4:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter,
	Paul E . McKenney, Wanpeng Li, Mike Galbraith, Rik van Riel

This check is racy but provides a good heuristic to determine whether
a CPU may need a remote tick or not.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 include/linux/tick.h     | 2 ++
 kernel/time/tick-sched.c | 7 +++++++
 2 files changed, 9 insertions(+)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 7cc3592..944c829 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -114,6 +114,7 @@ enum tick_dep_bits {
 #ifdef CONFIG_NO_HZ_COMMON
 extern bool tick_nohz_enabled;
 extern int tick_nohz_tick_stopped(void);
+extern int tick_nohz_tick_stopped_cpu(int cpu);
 extern void tick_nohz_idle_enter(void);
 extern void tick_nohz_idle_exit(void);
 extern void tick_nohz_irq_exit(void);
@@ -125,6 +126,7 @@ extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
 #else /* !CONFIG_NO_HZ_COMMON */
 #define tick_nohz_enabled (0)
 static inline int tick_nohz_tick_stopped(void) { return 0; }
+static inline int tick_nohz_tick_stopped_cpu(int cpu) { return 0; }
 static inline void tick_nohz_idle_enter(void) { }
 static inline void tick_nohz_idle_exit(void) { }
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f7cc7ab..97c4317 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -486,6 +486,13 @@ int tick_nohz_tick_stopped(void)
 	return __this_cpu_read(tick_cpu_sched.tick_stopped);
 }
 
+int tick_nohz_tick_stopped_cpu(int cpu)
+{
+	struct tick_sched *ts = per_cpu_ptr(&tick_cpu_sched, cpu);
+
+	return ts->tick_stopped;
+}
+
 /**
  * tick_nohz_update_jiffies - update jiffies when idle was interrupted
  *
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/5] sched/isolation: Residual 1Hz scheduler tick offload
  2018-01-04  4:25 [GIT PULL] isolation: 1Hz residual tick offloading v3 Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2018-01-04  4:25 ` [PATCH 3/5] nohz: Allow to check if remote CPU tick is stopped Frederic Weisbecker
@ 2018-01-04  4:25 ` Frederic Weisbecker
  2018-01-12 19:22   ` Luiz Capitulino
  2018-01-04  4:25 ` [PATCH 5/5] sched/isolation: Document "nohz_offload" flag Frederic Weisbecker
  2018-01-12 19:18 ` [GIT PULL] isolation: 1Hz residual tick offloading v3 Luiz Capitulino
  5 siblings, 1 reply; 22+ messages in thread
From: Frederic Weisbecker @ 2018-01-04  4:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter,
	Paul E . McKenney, Wanpeng Li, Mike Galbraith, Rik van Riel

When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
keep the scheduler stats alive. However this residual tick is a burden
for bare metal tasks that can't stand any interruption at all, or want
to minimize them.

Adding the boot parameter "isolcpus=nohz_offload" will now outsource
these scheduler ticks to the global workqueue so that a housekeeping CPU
handles that tick remotely.

Note it's still up to the user to affine the global workqueues to the
housekeeping CPUs through /sys/devices/virtual/workqueue/cpumask or
domains isolation.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c      | 88 ++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/isolation.c |  4 +++
 kernel/sched/sched.h     |  2 ++
 3 files changed, 91 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d72d0e9..b964890 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3052,9 +3052,14 @@ void scheduler_tick(void)
  */
 u64 scheduler_tick_max_deferment(void)
 {
-	struct rq *rq = this_rq();
-	unsigned long next, now = READ_ONCE(jiffies);
+	struct rq *rq;
+	unsigned long next, now;
 
+	if (!housekeeping_cpu(smp_processor_id(), HK_FLAG_TICK_SCHED))
+		return ktime_to_ns(KTIME_MAX);
+
+	rq = this_rq();
+	now = READ_ONCE(jiffies);
 	next = rq->last_sched_tick + HZ;
 
 	if (time_before_eq(next, now))
@@ -3062,7 +3067,82 @@ u64 scheduler_tick_max_deferment(void)
 
 	return jiffies_to_nsecs(next - now);
 }
-#endif
+
+struct tick_work {
+	int			cpu;
+	struct delayed_work	work;
+};
+
+static struct tick_work __percpu *tick_work_cpu;
+
+static void sched_tick_remote(struct work_struct *work)
+{
+	struct delayed_work *dwork = to_delayed_work(work);
+	struct tick_work *twork = container_of(dwork, struct tick_work, work);
+	int cpu = twork->cpu;
+	struct rq *rq = cpu_rq(cpu);
+	struct rq_flags rf;
+
+	/*
+	 * Handle the tick only if it appears the remote CPU is running
+	 * in full dynticks mode. The check is racy by nature, but
+	 * missing a tick or having one too much is no big deal.
+	 */
+	if (!idle_cpu(cpu) && tick_nohz_tick_stopped_cpu(cpu)) {
+		rq_lock_irq(rq, &rf);
+		update_rq_clock(rq);
+		rq->curr->sched_class->task_tick(rq, rq->curr, 0);
+		rq_unlock_irq(rq, &rf);
+	}
+
+	queue_delayed_work(system_unbound_wq, dwork, HZ);
+}
+
+static void sched_tick_start(int cpu)
+{
+	struct tick_work *twork;
+
+	if (housekeeping_cpu(cpu, HK_FLAG_TICK_SCHED))
+		return;
+
+	WARN_ON_ONCE(!tick_work_cpu);
+
+	twork = per_cpu_ptr(tick_work_cpu, cpu);
+	twork->cpu = cpu;
+	INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
+	queue_delayed_work(system_unbound_wq, &twork->work, HZ);
+}
+
+#ifdef CONFIG_HOTPLUG_CPU
+static void sched_tick_stop(int cpu)
+{
+	struct tick_work *twork;
+
+	if (housekeeping_cpu(cpu, HK_FLAG_TICK_SCHED))
+		return;
+
+	WARN_ON_ONCE(!tick_work_cpu);
+
+	twork = per_cpu_ptr(tick_work_cpu, cpu);
+	cancel_delayed_work_sync(&twork->work);
+}
+#endif /* CONFIG_HOTPLUG_CPU */
+
+int __init sched_tick_offload_init(void)
+{
+	tick_work_cpu = alloc_percpu(struct tick_work);
+	if (!tick_work_cpu) {
+		pr_err("Can't allocate remote tick struct\n");
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+#else
+static void sched_tick_start(int cpu) { }
+static void sched_tick_stop(int cpu) { }
+#endif /* CONFIG_NO_HZ_FULL */
 
 #if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \
 				defined(CONFIG_PREEMPT_TRACER))
@@ -5713,6 +5793,7 @@ int sched_cpu_starting(unsigned int cpu)
 {
 	set_cpu_rq_start_time(cpu);
 	sched_rq_cpu_starting(cpu);
+	sched_tick_start(cpu);
 	return 0;
 }
 
@@ -5724,6 +5805,7 @@ int sched_cpu_dying(unsigned int cpu)
 
 	/* Handle pending wakeups and then migrate everything off */
 	sched_ttwu_pending();
+	sched_tick_stop(cpu);
 
 	rq_lock_irqsave(rq, &rf);
 	if (rq->rd) {
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 264ddcd..c5e7e90a 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -12,6 +12,7 @@
 #include <linux/kernel.h>
 #include <linux/static_key.h>
 #include <linux/ctype.h>
+#include "sched.h"
 
 DEFINE_STATIC_KEY_FALSE(housekeeping_overriden);
 EXPORT_SYMBOL_GPL(housekeeping_overriden);
@@ -60,6 +61,9 @@ void __init housekeeping_init(void)
 
 	static_branch_enable(&housekeeping_overriden);
 
+	if (housekeeping_flags & HK_FLAG_TICK_SCHED)
+		sched_tick_offload_init();
+
 	/* We need at least one CPU to handle housekeeping work */
 	WARN_ON_ONCE(cpumask_empty(housekeeping_mask));
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b19552a2..5a3b82c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1587,6 +1587,7 @@ extern void post_init_entity_util_avg(struct sched_entity *se);
 
 #ifdef CONFIG_NO_HZ_FULL
 extern bool sched_can_stop_tick(struct rq *rq);
+extern int __init sched_tick_offload_init(void);
 
 /*
  * Tick may be needed by tasks in the runqueue depending on their policy and
@@ -1611,6 +1612,7 @@ static inline void sched_update_tick_dependency(struct rq *rq)
 		tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
 }
 #else
+static inline int sched_tick_offload_init(void) { return 0; }
 static inline void sched_update_tick_dependency(struct rq *rq) { }
 #endif
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/5] sched/isolation: Document "nohz_offload" flag
  2018-01-04  4:25 [GIT PULL] isolation: 1Hz residual tick offloading v3 Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2018-01-04  4:25 ` [PATCH 4/5] sched/isolation: Residual 1Hz scheduler tick offload Frederic Weisbecker
@ 2018-01-04  4:25 ` Frederic Weisbecker
  2018-01-12 19:18 ` [GIT PULL] isolation: 1Hz residual tick offloading v3 Luiz Capitulino
  5 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2018-01-04  4:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter,
	Paul E . McKenney, Wanpeng Li, Mike Galbraith, Rik van Riel

Document the interface to offload the 1Hz scheduler tick in full
dynticks mode. Also improve the comment about the existing "nohz" flag
in order to differentiate its behaviour.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 Documentation/admin-guide/kernel-parameters.txt | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index af7104a..2524296 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1749,7 +1749,12 @@
 			specified in the flag list (default: domain):
 
 			nohz
-			  Disable the tick when a single task runs.
+			  Disable the tick when a single task runs. A residual 1Hz
+			  tick remains to maintain scheduler stats alive.
+			nohz_offload
+			  Like nohz but the residual 1Hz tick is offloaded to
+			  housekeeping CPUs, leaving the CPU free of any tick if
+			  nothing else requests it.
 			domain
 			  Isolate from the general SMP balancing and scheduling
 			  algorithms. Note that performing domain isolation this way
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] isolation: 1Hz residual tick offloading v3
  2018-01-04  4:25 [GIT PULL] isolation: 1Hz residual tick offloading v3 Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2018-01-04  4:25 ` [PATCH 5/5] sched/isolation: Document "nohz_offload" flag Frederic Weisbecker
@ 2018-01-12 19:18 ` Luiz Capitulino
  2018-01-16 15:41   ` Frederic Weisbecker
  5 siblings, 1 reply; 22+ messages in thread
From: Luiz Capitulino @ 2018-01-12 19:18 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Christoph Lameter, Paul E . McKenney,
	Wanpeng Li, Mike Galbraith, Rik van Riel

On Thu,  4 Jan 2018 05:25:32 +0100
Frederic Weisbecker <frederic@kernel.org> wrote:

> Ingo,
> 
> Please pull the sched/0hz branch that can be found at:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> 	sched/0hz
> 
> HEAD: 9e932b2cc707209febd130978a5eb9f4a943a3f4
> 
> --
> Now that scheduler_tick() has become resilient towards the absence of
> ticks, current->sched_class->task_tick() is the last piece that needs
> at least 1Hz tick to keep scheduler stats alive.
> 
> This patchset adds a flag to the isolcpus boot option to offload the
> residual 1Hz tick. This way the nohz_full CPUs don't have anymore tick
> (assuming nothing else requires it) as their residual 1Hz tick is
> offloaded to the housekeepers.
> 
> For quick testing, say on CPUs 1-7:
> 
> 	"isolcpus=nohz_offload,domain,1-7"

Sorry for being very late to this series, but I've a few comments to
make (one right now and others in individual patches).

Why are extending isolcpus= given that it's a deprecated interface?
Some people have already moved away from isolcpus= now, but with this
new feature they will be forced back to using it.

What about just adding the new functionality to nohz_full=? That is,
no new options, just make the tick go away since this has always been
what nohz_full= was intended to do?

> 
> Thanks,
> 	Frederic
> ---
> 
> Frederic Weisbecker (5):
>       sched: Rename init_rq_hrtick to hrtick_rq_init
>       sched/isolation: Add scheduler tick offloading interface
>       nohz: Allow to check if remote CPU tick is stopped
>       sched/isolation: Residual 1Hz scheduler tick offload
>       sched/isolation: Document "nohz_offload" flag
> 
> 
>  Documentation/admin-guide/kernel-parameters.txt |  7 +-
>  include/linux/sched/isolation.h                 |  3 +-
>  include/linux/tick.h                            |  2 +
>  kernel/sched/core.c                             | 94 +++++++++++++++++++++++--
>  kernel/sched/isolation.c                        | 10 +++
>  kernel/sched/sched.h                            |  2 +
>  kernel/time/tick-sched.c                        |  7 ++
>  7 files changed, 117 insertions(+), 8 deletions(-)
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] sched/isolation: Residual 1Hz scheduler tick offload
  2018-01-04  4:25 ` [PATCH 4/5] sched/isolation: Residual 1Hz scheduler tick offload Frederic Weisbecker
@ 2018-01-12 19:22   ` Luiz Capitulino
  2018-01-16 15:57     ` Frederic Weisbecker
  0 siblings, 1 reply; 22+ messages in thread
From: Luiz Capitulino @ 2018-01-12 19:22 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Christoph Lameter, Paul E . McKenney,
	Wanpeng Li, Mike Galbraith, Rik van Riel

On Thu,  4 Jan 2018 05:25:36 +0100
Frederic Weisbecker <frederic@kernel.org> wrote:

> When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
> keep the scheduler stats alive. However this residual tick is a burden
> for bare metal tasks that can't stand any interruption at all, or want
> to minimize them.
> 
> Adding the boot parameter "isolcpus=nohz_offload" will now outsource
> these scheduler ticks to the global workqueue so that a housekeeping CPU
> handles that tick remotely.
> 
> Note it's still up to the user to affine the global workqueues to the
> housekeeping CPUs through /sys/devices/virtual/workqueue/cpumask or
> domains isolation.
> 
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> Cc: Chris Metcalf <cmetcalf@mellanox.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Luiz Capitulino <lcapitulino@redhat.com>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Wanpeng Li <kernellwp@gmail.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> ---
>  kernel/sched/core.c      | 88 ++++++++++++++++++++++++++++++++++++++++++++++--
>  kernel/sched/isolation.c |  4 +++
>  kernel/sched/sched.h     |  2 ++
>  3 files changed, 91 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d72d0e9..b964890 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3052,9 +3052,14 @@ void scheduler_tick(void)
>   */
>  u64 scheduler_tick_max_deferment(void)
>  {
> -	struct rq *rq = this_rq();
> -	unsigned long next, now = READ_ONCE(jiffies);
> +	struct rq *rq;
> +	unsigned long next, now;
>  
> +	if (!housekeeping_cpu(smp_processor_id(), HK_FLAG_TICK_SCHED))
> +		return ktime_to_ns(KTIME_MAX);
> +
> +	rq = this_rq();
> +	now = READ_ONCE(jiffies);
>  	next = rq->last_sched_tick + HZ;
>  
>  	if (time_before_eq(next, now))
> @@ -3062,7 +3067,82 @@ u64 scheduler_tick_max_deferment(void)
>  
>  	return jiffies_to_nsecs(next - now);
>  }
> -#endif
> +
> +struct tick_work {
> +	int			cpu;
> +	struct delayed_work	work;
> +};
> +
> +static struct tick_work __percpu *tick_work_cpu;
> +
> +static void sched_tick_remote(struct work_struct *work)
> +{
> +	struct delayed_work *dwork = to_delayed_work(work);
> +	struct tick_work *twork = container_of(dwork, struct tick_work, work);
> +	int cpu = twork->cpu;
> +	struct rq *rq = cpu_rq(cpu);
> +	struct rq_flags rf;
> +
> +	/*
> +	 * Handle the tick only if it appears the remote CPU is running
> +	 * in full dynticks mode. The check is racy by nature, but
> +	 * missing a tick or having one too much is no big deal.
> +	 */
> +	if (!idle_cpu(cpu) && tick_nohz_tick_stopped_cpu(cpu)) {
> +		rq_lock_irq(rq, &rf);
> +		update_rq_clock(rq);
> +		rq->curr->sched_class->task_tick(rq, rq->curr, 0);
> +		rq_unlock_irq(rq, &rf);
> +	}

OK, so this executes task_tick() remotely. What about account_process_tick()?
Don't we need it as well?

In particular, when I run a hog application on a nohz_full core configured
with tick offload, I can see in top that the CPU usage goes from 100%
to idle for a few seconds every couple of seconds. Could this be related?

Also, in my testing I'm sometimes seeing the tick. Sometimes at 10 or
20 seconds interval. Is this expected? I'll dig deeper next week.

> +
> +	queue_delayed_work(system_unbound_wq, dwork, HZ);
> +}
> +
> +static void sched_tick_start(int cpu)
> +{
> +	struct tick_work *twork;
> +
> +	if (housekeeping_cpu(cpu, HK_FLAG_TICK_SCHED))
> +		return;
> +
> +	WARN_ON_ONCE(!tick_work_cpu);
> +
> +	twork = per_cpu_ptr(tick_work_cpu, cpu);
> +	twork->cpu = cpu;
> +	INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> +	queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> +}
> +
> +#ifdef CONFIG_HOTPLUG_CPU
> +static void sched_tick_stop(int cpu)
> +{
> +	struct tick_work *twork;
> +
> +	if (housekeeping_cpu(cpu, HK_FLAG_TICK_SCHED))
> +		return;
> +
> +	WARN_ON_ONCE(!tick_work_cpu);
> +
> +	twork = per_cpu_ptr(tick_work_cpu, cpu);
> +	cancel_delayed_work_sync(&twork->work);
> +}
> +#endif /* CONFIG_HOTPLUG_CPU */
> +
> +int __init sched_tick_offload_init(void)
> +{
> +	tick_work_cpu = alloc_percpu(struct tick_work);
> +	if (!tick_work_cpu) {
> +		pr_err("Can't allocate remote tick struct\n");
> +		return -ENOMEM;
> +	}
> +
> +	return 0;
> +}
> +
> +#else
> +static void sched_tick_start(int cpu) { }
> +static void sched_tick_stop(int cpu) { }
> +#endif /* CONFIG_NO_HZ_FULL */
>  
>  #if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \
>  				defined(CONFIG_PREEMPT_TRACER))
> @@ -5713,6 +5793,7 @@ int sched_cpu_starting(unsigned int cpu)
>  {
>  	set_cpu_rq_start_time(cpu);
>  	sched_rq_cpu_starting(cpu);
> +	sched_tick_start(cpu);
>  	return 0;
>  }
>  
> @@ -5724,6 +5805,7 @@ int sched_cpu_dying(unsigned int cpu)
>  
>  	/* Handle pending wakeups and then migrate everything off */
>  	sched_ttwu_pending();
> +	sched_tick_stop(cpu);
>  
>  	rq_lock_irqsave(rq, &rf);
>  	if (rq->rd) {
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 264ddcd..c5e7e90a 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -12,6 +12,7 @@
>  #include <linux/kernel.h>
>  #include <linux/static_key.h>
>  #include <linux/ctype.h>
> +#include "sched.h"
>  
>  DEFINE_STATIC_KEY_FALSE(housekeeping_overriden);
>  EXPORT_SYMBOL_GPL(housekeeping_overriden);
> @@ -60,6 +61,9 @@ void __init housekeeping_init(void)
>  
>  	static_branch_enable(&housekeeping_overriden);
>  
> +	if (housekeeping_flags & HK_FLAG_TICK_SCHED)
> +		sched_tick_offload_init();
> +
>  	/* We need at least one CPU to handle housekeeping work */
>  	WARN_ON_ONCE(cpumask_empty(housekeeping_mask));
>  }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b19552a2..5a3b82c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1587,6 +1587,7 @@ extern void post_init_entity_util_avg(struct sched_entity *se);
>  
>  #ifdef CONFIG_NO_HZ_FULL
>  extern bool sched_can_stop_tick(struct rq *rq);
> +extern int __init sched_tick_offload_init(void);
>  
>  /*
>   * Tick may be needed by tasks in the runqueue depending on their policy and
> @@ -1611,6 +1612,7 @@ static inline void sched_update_tick_dependency(struct rq *rq)
>  		tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
>  }
>  #else
> +static inline int sched_tick_offload_init(void) { return 0; }
>  static inline void sched_update_tick_dependency(struct rq *rq) { }
>  #endif
>  

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] isolation: 1Hz residual tick offloading v3
  2018-01-12 19:18 ` [GIT PULL] isolation: 1Hz residual tick offloading v3 Luiz Capitulino
@ 2018-01-16 15:41   ` Frederic Weisbecker
  2018-01-16 16:52     ` Luiz Capitulino
  2018-01-16 17:58     ` Mike Galbraith
  0 siblings, 2 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2018-01-16 15:41 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Christoph Lameter, Paul E . McKenney,
	Wanpeng Li, Mike Galbraith, Rik van Riel

On Fri, Jan 12, 2018 at 02:18:13PM -0500, Luiz Capitulino wrote:
> On Thu,  4 Jan 2018 05:25:32 +0100
> Frederic Weisbecker <frederic@kernel.org> wrote:
> 
> > Ingo,
> > 
> > Please pull the sched/0hz branch that can be found at:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > 	sched/0hz
> > 
> > HEAD: 9e932b2cc707209febd130978a5eb9f4a943a3f4
> > 
> > --
> > Now that scheduler_tick() has become resilient towards the absence of
> > ticks, current->sched_class->task_tick() is the last piece that needs
> > at least 1Hz tick to keep scheduler stats alive.
> > 
> > This patchset adds a flag to the isolcpus boot option to offload the
> > residual 1Hz tick. This way the nohz_full CPUs don't have anymore tick
> > (assuming nothing else requires it) as their residual 1Hz tick is
> > offloaded to the housekeepers.
> > 
> > For quick testing, say on CPUs 1-7:
> > 
> > 	"isolcpus=nohz_offload,domain,1-7"
> 
> Sorry for being very late to this series, but I've a few comments to
> make (one right now and others in individual patches).
> 
> Why are extending isolcpus= given that it's a deprecated interface?
> Some people have already moved away from isolcpus= now, but with this
> new feature they will be forced back to using it.

I tried to remove isolcpus or at least change the way it works so that its
effects are reversible (ie: affine the init task instead of isolating domains)
but that got nacked due to the behaviour's expectations for userspace.

That's when I realized that kernel parameters are like userspace ABIs,
they can't be removed easily whether we deprecate them or not.

Also I needed to be able to control the various isolation features, and
nohz_full is the wrong place to do that as nohz_full is really just an
isolation feature like the others, nohz_full= should really just imply
full dynticks and not watchdog, workqueue or tilegx NAPI isolation...

So isolcpus= is now the place where we control the isolation features
and nohz is one of them.

The complain about isolcpus is the immutable result. I'm thinking about
making it modifiable to cpuset but I only see two possible solutions:

- Make the root cpuset modifiable
- Create a directory called "isolcpus" visible on the first cpuset mount
  and move all processes there.
 
> What about just adding the new functionality to nohz_full=? That is,
> no new options, just make the tick go away since this has always been
> what nohz_full= was intended to do?

We can, or have isolcpus=nohz to do it, as both do almost the same.

But I'm afraid about the overhead for people used to nohz_full= once
they upgrade their kernels and see those workqueues once per second.

We can still affine those workqueues (in fact the whole unbound workqueue
mask) outside the nohz_full range. Still current users may be surprised
about that new overhead on housekeeping CPUs...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] sched/isolation: Residual 1Hz scheduler tick offload
  2018-01-12 19:22   ` Luiz Capitulino
@ 2018-01-16 15:57     ` Frederic Weisbecker
  2018-01-16 16:53       ` Luiz Capitulino
  0 siblings, 1 reply; 22+ messages in thread
From: Frederic Weisbecker @ 2018-01-16 15:57 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Christoph Lameter, Paul E . McKenney,
	Wanpeng Li, Mike Galbraith, Rik van Riel

On Fri, Jan 12, 2018 at 02:22:58PM -0500, Luiz Capitulino wrote:
> On Thu,  4 Jan 2018 05:25:36 +0100
> Frederic Weisbecker <frederic@kernel.org> wrote:
> 
> > When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
> > keep the scheduler stats alive. However this residual tick is a burden
> > for bare metal tasks that can't stand any interruption at all, or want
> > to minimize them.
> > 
> > Adding the boot parameter "isolcpus=nohz_offload" will now outsource
> > these scheduler ticks to the global workqueue so that a housekeeping CPU
> > handles that tick remotely.
> > 
> > Note it's still up to the user to affine the global workqueues to the
> > housekeeping CPUs through /sys/devices/virtual/workqueue/cpumask or
> > domains isolation.
> > 
> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> > Cc: Chris Metcalf <cmetcalf@mellanox.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Luiz Capitulino <lcapitulino@redhat.com>
> > Cc: Mike Galbraith <efault@gmx.de>
> > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Wanpeng Li <kernellwp@gmail.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > ---
> >  kernel/sched/core.c      | 88 ++++++++++++++++++++++++++++++++++++++++++++++--
> >  kernel/sched/isolation.c |  4 +++
> >  kernel/sched/sched.h     |  2 ++
> >  3 files changed, 91 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index d72d0e9..b964890 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3052,9 +3052,14 @@ void scheduler_tick(void)
> >   */
> >  u64 scheduler_tick_max_deferment(void)
> >  {
> > -	struct rq *rq = this_rq();
> > -	unsigned long next, now = READ_ONCE(jiffies);
> > +	struct rq *rq;
> > +	unsigned long next, now;
> >  
> > +	if (!housekeeping_cpu(smp_processor_id(), HK_FLAG_TICK_SCHED))
> > +		return ktime_to_ns(KTIME_MAX);
> > +
> > +	rq = this_rq();
> > +	now = READ_ONCE(jiffies);
> >  	next = rq->last_sched_tick + HZ;
> >  
> >  	if (time_before_eq(next, now))
> > @@ -3062,7 +3067,82 @@ u64 scheduler_tick_max_deferment(void)
> >  
> >  	return jiffies_to_nsecs(next - now);
> >  }
> > -#endif
> > +
> > +struct tick_work {
> > +	int			cpu;
> > +	struct delayed_work	work;
> > +};
> > +
> > +static struct tick_work __percpu *tick_work_cpu;
> > +
> > +static void sched_tick_remote(struct work_struct *work)
> > +{
> > +	struct delayed_work *dwork = to_delayed_work(work);
> > +	struct tick_work *twork = container_of(dwork, struct tick_work, work);
> > +	int cpu = twork->cpu;
> > +	struct rq *rq = cpu_rq(cpu);
> > +	struct rq_flags rf;
> > +
> > +	/*
> > +	 * Handle the tick only if it appears the remote CPU is running
> > +	 * in full dynticks mode. The check is racy by nature, but
> > +	 * missing a tick or having one too much is no big deal.
> > +	 */
> > +	if (!idle_cpu(cpu) && tick_nohz_tick_stopped_cpu(cpu)) {
> > +		rq_lock_irq(rq, &rf);
> > +		update_rq_clock(rq);
> > +		rq->curr->sched_class->task_tick(rq, rq->curr, 0);
> > +		rq_unlock_irq(rq, &rf);
> > +	}
> 
> OK, so this executes task_tick() remotely. What about account_process_tick()?
> Don't we need it as well?

Nope, tasks in nohz_full mode have their special accounting that doesn't
rely on the tick.

> 
> In particular, when I run a hog application on a nohz_full core configured
> with tick offload, I can see in top that the CPU usage goes from 100%
> to idle for a few seconds every couple of seconds. Could this be related?
> 
> Also, in my testing I'm sometimes seeing the tick. Sometimes at 10 or
> 20 seconds interval. Is this expected? I'll dig deeper next week.

That's expected, see the changelog: the offload is not affine by default.
You need to either also isolate the domains:

    isolcpus=nohz_offload,domain

or tweak the workqueue cpumask through:

    /sys/devices/virtual/workqueue/cpumask

Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] isolation: 1Hz residual tick offloading v3
  2018-01-16 15:41   ` Frederic Weisbecker
@ 2018-01-16 16:52     ` Luiz Capitulino
  2018-01-16 22:51       ` Frederic Weisbecker
  2018-01-16 17:58     ` Mike Galbraith
  1 sibling, 1 reply; 22+ messages in thread
From: Luiz Capitulino @ 2018-01-16 16:52 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Christoph Lameter, Paul E . McKenney,
	Wanpeng Li, Mike Galbraith, Rik van Riel

On Tue, 16 Jan 2018 16:41:00 +0100
Frederic Weisbecker <frederic@kernel.org> wrote:

> On Fri, Jan 12, 2018 at 02:18:13PM -0500, Luiz Capitulino wrote:
> > On Thu,  4 Jan 2018 05:25:32 +0100
> > Frederic Weisbecker <frederic@kernel.org> wrote:
> >   
> > > Ingo,
> > > 
> > > Please pull the sched/0hz branch that can be found at:
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > > 	sched/0hz
> > > 
> > > HEAD: 9e932b2cc707209febd130978a5eb9f4a943a3f4
> > > 
> > > --
> > > Now that scheduler_tick() has become resilient towards the absence of
> > > ticks, current->sched_class->task_tick() is the last piece that needs
> > > at least 1Hz tick to keep scheduler stats alive.
> > > 
> > > This patchset adds a flag to the isolcpus boot option to offload the
> > > residual 1Hz tick. This way the nohz_full CPUs don't have anymore tick
> > > (assuming nothing else requires it) as their residual 1Hz tick is
> > > offloaded to the housekeepers.
> > > 
> > > For quick testing, say on CPUs 1-7:
> > > 
> > > 	"isolcpus=nohz_offload,domain,1-7"  
> > 
> > Sorry for being very late to this series, but I've a few comments to
> > make (one right now and others in individual patches).
> > 
> > Why are extending isolcpus= given that it's a deprecated interface?
> > Some people have already moved away from isolcpus= now, but with this
> > new feature they will be forced back to using it.  
> 
> I tried to remove isolcpus or at least change the way it works so that its
> effects are reversible (ie: affine the init task instead of isolating domains)
> but that got nacked due to the behaviour's expectations for userspace.
> 
> That's when I realized that kernel parameters are like userspace ABIs,
> they can't be removed easily whether we deprecate them or not.
> 
> Also I needed to be able to control the various isolation features, and
> nohz_full is the wrong place to do that as nohz_full is really just an
> isolation feature like the others, nohz_full= should really just imply
> full dynticks and not watchdog, workqueue or tilegx NAPI isolation...

Yeah, I completely agree with that.

> So isolcpus= is now the place where we control the isolation features
> and nohz is one of them.

That's the part I'm not very sure about. We've been advising users to
move away from isolcpus= when possible, but this very wanted nohz_offload
feature will force everyone back to using isolcpus= again.

I have the impression this series is trying to solve two problems:

 1. How (and where) we control the various isolation features in the
    kernel

 2. Where we add the control for the tick offload feature

I think item 1 is too complex to solve right now. IMHO, this series
should focus on item 2. And regarding item 2, I think we have two
choices to make:

 1. Make tick offload a first class citizen by making it default to
    nohz_full=. If there are regressions, we handle them

 2. Add a new option to nohz_full=, like nohz_full=tick_offload

As an avid user of nohz_full I'm dying to see option 1 happening,
but I'm not totally sure what the consequences can be.

Another idea is to add CONFIG_NOHZ_TICK_OFFLOAD as an experimental
feature.

> The complain about isolcpus is the immutable result. I'm thinking about
> making it modifiable to cpuset but I only see two possible solutions:
> 
> - Make the root cpuset modifiable
> - Create a directory called "isolcpus" visible on the first cpuset mount
>   and move all processes there.

So, if we move the control of the tick offload to nohz_full= itself,
we can completely ditch any isolcpus= change in this series.

I think this should give you a great relief :)

> > What about just adding the new functionality to nohz_full=? That is,
> > no new options, just make the tick go away since this has always been
> > what nohz_full= was intended to do?  
> 
> We can, or have isolcpus=nohz to do it, as both do almost the same.
> 
> But I'm afraid about the overhead for people used to nohz_full= once
> they upgrade their kernels and see those workqueues once per second.
> 
> We can still affine those workqueues (in fact the whole unbound workqueue
> mask) outside the nohz_full range. Still current users may be surprised
> about that new overhead on housekeeping CPUs...
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] sched/isolation: Residual 1Hz scheduler tick offload
  2018-01-16 15:57     ` Frederic Weisbecker
@ 2018-01-16 16:53       ` Luiz Capitulino
  0 siblings, 0 replies; 22+ messages in thread
From: Luiz Capitulino @ 2018-01-16 16:53 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Christoph Lameter, Paul E . McKenney,
	Wanpeng Li, Mike Galbraith, Rik van Riel

On Tue, 16 Jan 2018 16:57:45 +0100
Frederic Weisbecker <frederic@kernel.org> wrote:

> On Fri, Jan 12, 2018 at 02:22:58PM -0500, Luiz Capitulino wrote:
> > On Thu,  4 Jan 2018 05:25:36 +0100
> > Frederic Weisbecker <frederic@kernel.org> wrote:
> >   
> > > When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
> > > keep the scheduler stats alive. However this residual tick is a burden
> > > for bare metal tasks that can't stand any interruption at all, or want
> > > to minimize them.
> > > 
> > > Adding the boot parameter "isolcpus=nohz_offload" will now outsource
> > > these scheduler ticks to the global workqueue so that a housekeeping CPU
> > > handles that tick remotely.
> > > 
> > > Note it's still up to the user to affine the global workqueues to the
> > > housekeeping CPUs through /sys/devices/virtual/workqueue/cpumask or
> > > domains isolation.
> > > 
> > > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> > > Cc: Chris Metcalf <cmetcalf@mellanox.com>
> > > Cc: Christoph Lameter <cl@linux.com>
> > > Cc: Luiz Capitulino <lcapitulino@redhat.com>
> > > Cc: Mike Galbraith <efault@gmx.de>
> > > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Cc: Rik van Riel <riel@redhat.com>
> > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > Cc: Wanpeng Li <kernellwp@gmail.com>
> > > Cc: Ingo Molnar <mingo@kernel.org>
> > > ---
> > >  kernel/sched/core.c      | 88 ++++++++++++++++++++++++++++++++++++++++++++++--
> > >  kernel/sched/isolation.c |  4 +++
> > >  kernel/sched/sched.h     |  2 ++
> > >  3 files changed, 91 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index d72d0e9..b964890 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -3052,9 +3052,14 @@ void scheduler_tick(void)
> > >   */
> > >  u64 scheduler_tick_max_deferment(void)
> > >  {
> > > -	struct rq *rq = this_rq();
> > > -	unsigned long next, now = READ_ONCE(jiffies);
> > > +	struct rq *rq;
> > > +	unsigned long next, now;
> > >  
> > > +	if (!housekeeping_cpu(smp_processor_id(), HK_FLAG_TICK_SCHED))
> > > +		return ktime_to_ns(KTIME_MAX);
> > > +
> > > +	rq = this_rq();
> > > +	now = READ_ONCE(jiffies);
> > >  	next = rq->last_sched_tick + HZ;
> > >  
> > >  	if (time_before_eq(next, now))
> > > @@ -3062,7 +3067,82 @@ u64 scheduler_tick_max_deferment(void)
> > >  
> > >  	return jiffies_to_nsecs(next - now);
> > >  }
> > > -#endif
> > > +
> > > +struct tick_work {
> > > +	int			cpu;
> > > +	struct delayed_work	work;
> > > +};
> > > +
> > > +static struct tick_work __percpu *tick_work_cpu;
> > > +
> > > +static void sched_tick_remote(struct work_struct *work)
> > > +{
> > > +	struct delayed_work *dwork = to_delayed_work(work);
> > > +	struct tick_work *twork = container_of(dwork, struct tick_work, work);
> > > +	int cpu = twork->cpu;
> > > +	struct rq *rq = cpu_rq(cpu);
> > > +	struct rq_flags rf;
> > > +
> > > +	/*
> > > +	 * Handle the tick only if it appears the remote CPU is running
> > > +	 * in full dynticks mode. The check is racy by nature, but
> > > +	 * missing a tick or having one too much is no big deal.
> > > +	 */
> > > +	if (!idle_cpu(cpu) && tick_nohz_tick_stopped_cpu(cpu)) {
> > > +		rq_lock_irq(rq, &rf);
> > > +		update_rq_clock(rq);
> > > +		rq->curr->sched_class->task_tick(rq, rq->curr, 0);
> > > +		rq_unlock_irq(rq, &rf);
> > > +	}  
> > 
> > OK, so this executes task_tick() remotely. What about account_process_tick()?
> > Don't we need it as well?  
> 
> Nope, tasks in nohz_full mode have their special accounting that doesn't
> rely on the tick.

OK, excellent.

> > In particular, when I run a hog application on a nohz_full core configured
> > with tick offload, I can see in top that the CPU usage goes from 100%
> > to idle for a few seconds every couple of seconds. Could this be related?
> > 
> > Also, in my testing I'm sometimes seeing the tick. Sometimes at 10 or
> > 20 seconds interval. Is this expected? I'll dig deeper next week.  
> 
> That's expected, see the changelog: the offload is not affine by default.
> You need to either also isolate the domains:
> 
>     isolcpus=nohz_offload,domain
> 
> or tweak the workqueue cpumask through:
> 
>     /sys/devices/virtual/workqueue/cpumask

Yeah, I already do that. Later today or tomorrow I'll debug this to
see if the problem is in my setup or not.

> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] isolation: 1Hz residual tick offloading v3
  2018-01-16 15:41   ` Frederic Weisbecker
  2018-01-16 16:52     ` Luiz Capitulino
@ 2018-01-16 17:58     ` Mike Galbraith
  2018-01-16 22:53       ` Frederic Weisbecker
  2018-01-17 14:51       ` Christopher Lameter
  1 sibling, 2 replies; 22+ messages in thread
From: Mike Galbraith @ 2018-01-16 17:58 UTC (permalink / raw)
  To: Frederic Weisbecker, Luiz Capitulino
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Christoph Lameter, Paul E . McKenney,
	Wanpeng Li, Rik van Riel

On Tue, 2018-01-16 at 16:41 +0100, Frederic Weisbecker wrote:
> On Fri, Jan 12, 2018 at 02:18:13PM -0500, Luiz Capitulino wrote:
> 
> > Why are extending isolcpus= given that it's a deprecated interface?
> > Some people have already moved away from isolcpus= now, but with this
> > new feature they will be forced back to using it.
> 
> I tried to remove isolcpus or at least change the way it works so that its
> effects are reversible (ie: affine the init task instead of isolating domains)
> but that got nacked due to the behaviour's expectations for userspace.

So we paint ourselves into a static corner forever more, despite every
bit of this being all about "properties of sets of cpus", ie precisely
what cpusets was born to do.  That's sad, dynamic wasn't that far away.

	-Mike

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] isolation: 1Hz residual tick offloading v3
  2018-01-16 16:52     ` Luiz Capitulino
@ 2018-01-16 22:51       ` Frederic Weisbecker
  2018-01-17 17:38         ` Luiz Capitulino
  0 siblings, 1 reply; 22+ messages in thread
From: Frederic Weisbecker @ 2018-01-16 22:51 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Christoph Lameter, Paul E . McKenney,
	Wanpeng Li, Mike Galbraith, Rik van Riel

On Tue, Jan 16, 2018 at 11:52:11AM -0500, Luiz Capitulino wrote:
> On Tue, 16 Jan 2018 16:41:00 +0100
> Frederic Weisbecker <frederic@kernel.org> wrote:
> > So isolcpus= is now the place where we control the isolation features
> > and nohz is one of them.
> 
> That's the part I'm not very sure about. We've been advising users to
> move away from isolcpus= when possible, but this very wanted nohz_offload
> feature will force everyone back to using isolcpus= again.

Note "isolcpus=nohz" only implies nohz. You need to add "domain" to get
the behaviour that you've been advising users against. We are simply
reusing a kernel parameter that was abandoned to now control the isolation
features that were disorganized and opaque behind nohz.

> 
> I have the impression this series is trying to solve two problems:
> 
>  1. How (and where) we control the various isolation features in the
>     kernel

No, that has already been done in the previous merge window. We have a
dedicated isolation subsystem now (kernel/sched/isolation.c) and
an interface to control all these isolation features that were abusively implied
by nohz. The initial plan was to introduce "cpu_isolation=" but it looked too much like
"isolcpus=". Then in fact, why not using "isolcpus=" and give it a second life.
And there we are.

In the end the goal is to propagate what is passed to "isolcpus=" to cpusets.


> 
>  2. Where we add the control for the tick offload feature
> 
> I think item 1 is too complex to solve right now. IMHO, this series
> should focus on item 2. And regarding item 2, I think we have two
> choices to make:
> 
>  1. Make tick offload a first class citizen by making it default to
>     nohz_full=. If there are regressions, we handle them

That's a possible way to go.

> 
>  2. Add a new option to nohz_full=, like nohz_full=tick_offload
> 
> As an avid user of nohz_full I'm dying to see option 1 happening,
> but I'm not totally sure what the consequences can be.

"nohz_full=" parameter has been badly designed as it implies much more
than just full dynticks. So I'm not really looking forward to expanding
it.

> Another idea is to add CONFIG_NOHZ_TICK_OFFLOAD as an experimental
> feature.

I fear it's way too distro-unfriendly. They will want to have it as a
capability without necessarily running it. Just like they do with
CONFIG_NO_HZ_FULL.

> 
> > The complain about isolcpus is the immutable result. I'm thinking about
> > making it modifiable to cpuset but I only see two possible solutions:
> > 
> > - Make the root cpuset modifiable
> > - Create a directory called "isolcpus" visible on the first cpuset mount
> >   and move all processes there.
> 
> So, if we move the control of the tick offload to nohz_full= itself,
> we can completely ditch any isolcpus= change in this series.
> 
> I think this should give you a great relief :)

Not at all :)

What would be a great relief to me is that we can finally propagate isolcpus=
to cpusets so that we can continue to expand it without a second thought.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] isolation: 1Hz residual tick offloading v3
  2018-01-16 17:58     ` Mike Galbraith
@ 2018-01-16 22:53       ` Frederic Weisbecker
  2018-01-17 14:51       ` Christopher Lameter
  1 sibling, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2018-01-16 22:53 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Luiz Capitulino, Ingo Molnar, LKML, Peter Zijlstra,
	Chris Metcalf, Thomas Gleixner, Christoph Lameter,
	Paul E . McKenney, Wanpeng Li, Rik van Riel

On Tue, Jan 16, 2018 at 06:58:18PM +0100, Mike Galbraith wrote:
> On Tue, 2018-01-16 at 16:41 +0100, Frederic Weisbecker wrote:
> > On Fri, Jan 12, 2018 at 02:18:13PM -0500, Luiz Capitulino wrote:
> > 
> > > Why are extending isolcpus= given that it's a deprecated interface?
> > > Some people have already moved away from isolcpus= now, but with this
> > > new feature they will be forced back to using it.
> > 
> > I tried to remove isolcpus or at least change the way it works so that its
> > effects are reversible (ie: affine the init task instead of isolating domains)
> > but that got nacked due to the behaviour's expectations for userspace.
> 
> So we paint ourselves into a static corner forever more, despite every
> bit of this being all about "properties of sets of cpus", ie precisely
> what cpusets was born to do.  That's sad, dynamic wasn't that far away.

Hence why we need to propagate "isolcpus=" to cpusets.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] isolation: 1Hz residual tick offloading v3
  2018-01-16 17:58     ` Mike Galbraith
  2018-01-16 22:53       ` Frederic Weisbecker
@ 2018-01-17 14:51       ` Christopher Lameter
  2018-01-17 15:59         ` Mike Galbraith
  1 sibling, 1 reply; 22+ messages in thread
From: Christopher Lameter @ 2018-01-17 14:51 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Frederic Weisbecker, Luiz Capitulino, Ingo Molnar, LKML,
	Peter Zijlstra, Chris Metcalf, Thomas Gleixner,
	Paul E . McKenney, Wanpeng Li, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 1026 bytes --]

On Tue, 16 Jan 2018, Mike Galbraith wrote:

> > I tried to remove isolcpus or at least change the way it works so that its
> > effects are reversible (ie: affine the init task instead of isolating domains)
> > but that got nacked due to the behaviour's expectations for userspace.
>
> So we paint ourselves into a static corner forever more, despite every
> bit of this being all about "properties of sets of cpus", ie precisely
> what cpusets was born to do.  That's sad, dynamic wasn't that far away.

cpusets was born in order to isolate applications to sets of processors.
The properties of sets of cpus was not on the horizon when SGI started
this.

We have sets of cpus associated with affinity masks in the form of bitmaps
etc etc which is much more lightweight than having slug around the cgroup
overhead everywhere.

A simple bitmask is much better if you have to control detailed system
behavior for each core and are planning each processes role because you
need to make full use of the harware resources available.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] isolation: 1Hz residual tick offloading v3
  2018-01-17 14:51       ` Christopher Lameter
@ 2018-01-17 15:59         ` Mike Galbraith
  2018-01-17 16:32           ` Christopher Lameter
  0 siblings, 1 reply; 22+ messages in thread
From: Mike Galbraith @ 2018-01-17 15:59 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Frederic Weisbecker, Luiz Capitulino, Ingo Molnar, LKML,
	Peter Zijlstra, Chris Metcalf, Thomas Gleixner,
	Paul E . McKenney, Wanpeng Li, Rik van Riel

On Wed, 2018-01-17 at 08:51 -0600, Christopher Lameter wrote:
> On Tue, 16 Jan 2018, Mike Galbraith wrote:
> 
> > > I tried to remove isolcpus or at least change the way it works so that its
> > > effects are reversible (ie: affine the init task instead of isolating domains)
> > > but that got nacked due to the behaviour's expectations for userspace.
> >
> > So we paint ourselves into a static corner forever more, despite every
> > bit of this being all about "properties of sets of cpus", ie precisely
> > what cpusets was born to do.  That's sad, dynamic wasn't that far away.
> 
> cpusets was born in order to isolate applications to sets of processors.
> The properties of sets of cpus was not on the horizon when SGI started
> this.

Domain connectivity very much is a property of a set of CPUs, a rather
important one, and one managed by cpusets.  NOHZ_FULL is a property of
a set of cpus, thus a most excellent fit.  Other things are as well.

> We have sets of cpus associated with affinity masks in the form of bitmaps
> etc etc which is much more lightweight than having slug around the cgroup
> overhead everywhere.

What does everywhere mean, set creation time?

> A simple bitmask is much better if you have to control detailed system
> behavior for each core and are planning each processes role because you
> need to make full use of the harware resources available.

If you live in a static world, maybe.

I like the flexibility of being able to configure on the fly.  One tiny
example: for a high performance aircraft manufacturer, having military
simulation background, I know that simulators frequently have to be
ready to go at the drop of a hat, so I twiddled cpusets to let them
flip their extra fancy video game (80 cores, real controls/avionics...
"game over, insert one gold bar to continue" kind of fancy) from low
power idle to full bore hard realtime with one poke to a cpuset file.

Static may be fine for some, for others, dynamic is much better.

	-Mike

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] isolation: 1Hz residual tick offloading v3
  2018-01-17 15:59         ` Mike Galbraith
@ 2018-01-17 16:32           ` Christopher Lameter
  2018-01-17 16:58             ` Mike Galbraith
  0 siblings, 1 reply; 22+ messages in thread
From: Christopher Lameter @ 2018-01-17 16:32 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Frederic Weisbecker, Luiz Capitulino, Ingo Molnar, LKML,
	Peter Zijlstra, Chris Metcalf, Thomas Gleixner,
	Paul E . McKenney, Wanpeng Li, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 1953 bytes --]

On Wed, 17 Jan 2018, Mike Galbraith wrote:

> Domain connectivity very much is a property of a set of CPUs, a rather
> important one, and one managed by cpusets.  NOHZ_FULL is a property of
> a set of cpus, thus a most excellent fit.  Other things are as well.

Not sure to what domain refers to in this context.

> > We have sets of cpus associated with affinity masks in the form of bitmaps
> > etc etc which is much more lightweight than having slug around the cgroup
> > overhead everywhere.
>
> What does everywhere mean, set creation time?

You would need to create multiple cgroups to create what you want. Those
will "inherit" characteristics from higher levels etc etc. It gets
needlessly complicated and difficult to debug if something goes work.

> > A simple bitmask is much better if you have to control detailed system
> > behavior for each core and are planning each processes role because you
> > need to make full use of the harware resources available.
>
> If you live in a static world, maybe.

Why would that be restricted to a static world?

> I like the flexibility of being able to configure on the fly.  One tiny
> example: for a high performance aircraft manufacturer, having military
> simulation background, I know that simulators frequently have to be
> ready to go at the drop of a hat, so I twiddled cpusets to let them
> flip their extra fancy video game (80 cores, real controls/avionics...
> "game over, insert one gold bar to continue" kind of fancy) from low
> power idle to full bore hard realtime with one poke to a cpuset file.
>
> Static may be fine for some, for others, dynamic is much better.

The problem is that I may be flipping a flag in a cpuset to enable
something but some other cpuset somewhere in the complex hieracy does
something different that causes a conflict. The directness to control is
lost. Instead there is the fog of complexity created by the cgroups that
have various plugins and whatnot.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] isolation: 1Hz residual tick offloading v3
  2018-01-17 16:32           ` Christopher Lameter
@ 2018-01-17 16:58             ` Mike Galbraith
  0 siblings, 0 replies; 22+ messages in thread
From: Mike Galbraith @ 2018-01-17 16:58 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Frederic Weisbecker, Luiz Capitulino, Ingo Molnar, LKML,
	Peter Zijlstra, Chris Metcalf, Thomas Gleixner,
	Paul E . McKenney, Wanpeng Li, Rik van Riel

On Wed, 2018-01-17 at 10:32 -0600, Christopher Lameter wrote:
> On Wed, 17 Jan 2018, Mike Galbraith wrote:
> 
> > Domain connectivity very much is a property of a set of CPUs, a rather
> > important one, and one managed by cpusets.  NOHZ_FULL is a property of
> > a set of cpus, thus a most excellent fit.  Other things are as well.
> 
> Not sure to what domain refers to in this context.

Scheduler domains, load balancing.

> > > We have sets of cpus associated with affinity masks in the form of bitmaps
> > > etc etc which is much more lightweight than having slug around the cgroup
> > > overhead everywhere.
> >
> > What does everywhere mean, set creation time?
> 
> You would need to create multiple cgroups to create what you want. Those
> will "inherit" characteristics from higher levels etc etc. It gets
> needlessly complicated and difficult to debug if something goes work.

It's only as complicated as you make it.  What I create is dirt simple,
an exclusive system set and an exclusive realtime set, both directly
under root.  It doesn't get any simpler than that.

> > > A simple bitmask is much better if you have to control detailed system
> > > behavior for each core and are planning each processes role because you
> > > need to make full use of the harware resources available.
> >
> > If you live in a static world, maybe.
> 
> Why would that be restricted to a static world?

Guess I misunderstood, unimportant.

> > I like the flexibility of being able to configure on the fly.  One tiny
> > example: for a high performance aircraft manufacturer, having military
> > simulation background, I know that simulators frequently have to be
> > ready to go at the drop of a hat, so I twiddled cpusets to let them
> > flip their extra fancy video game (80 cores, real controls/avionics...
> > "game over, insert one gold bar to continue" kind of fancy) from low
> > power idle to full bore hard realtime with one poke to a cpuset file.
> >
> > Static may be fine for some, for others, dynamic is much better.
> 
> The problem is that I may be flipping a flag in a cpuset to enable
> something but some other cpuset somewhere in the complex hieracy does
> something different that causes a conflict.

That's what exclusive sets are for, zero set overlap.  It would be very
difficult to both connect and disconnect scheduler domains :)

>  The directness to control is
> lost. Instead there is the fog of complexity created by the cgroups that
> have various plugins and whatnot.

You don't have to use any of the other controllers, I don't, just tell
systemthing to pretty please NOT co-mount controllers, and whatever to
ensure it keeps its tentacles off of your toys, and you're fine.

	-Mike

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] isolation: 1Hz residual tick offloading v3
  2018-01-16 22:51       ` Frederic Weisbecker
@ 2018-01-17 17:38         ` Luiz Capitulino
  2018-01-18  3:04           ` Frederic Weisbecker
  0 siblings, 1 reply; 22+ messages in thread
From: Luiz Capitulino @ 2018-01-17 17:38 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Christoph Lameter, Paul E . McKenney,
	Wanpeng Li, Mike Galbraith, Rik van Riel

On Tue, 16 Jan 2018 23:51:29 +0100
Frederic Weisbecker <frederic@kernel.org> wrote:

> On Tue, Jan 16, 2018 at 11:52:11AM -0500, Luiz Capitulino wrote:
> > On Tue, 16 Jan 2018 16:41:00 +0100
> > Frederic Weisbecker <frederic@kernel.org> wrote:  
> > > So isolcpus= is now the place where we control the isolation features
> > > and nohz is one of them.  
> > 
> > That's the part I'm not very sure about. We've been advising users to
> > move away from isolcpus= when possible, but this very wanted nohz_offload
> > feature will force everyone back to using isolcpus= again.  
> 
> Note "isolcpus=nohz" only implies nohz. You need to add "domain" to get
> the behaviour that you've been advising users against. We are simply
> reusing a kernel parameter that was abandoned to now control the isolation
> features that were disorganized and opaque behind nohz.
> 
> > 
> > I have the impression this series is trying to solve two problems:
> > 
> >  1. How (and where) we control the various isolation features in the
> >     kernel  
> 
> No, that has already been done in the previous merge window. We have a
> dedicated isolation subsystem now (kernel/sched/isolation.c) and
> an interface to control all these isolation features that were abusively implied
> by nohz. The initial plan was to introduce "cpu_isolation=" but it looked too much like
> "isolcpus=". Then in fact, why not using "isolcpus=" and give it a second life.
> And there we are.

OK, I get it now. But then series has to un-deprecate isolcpus= otherwise
it doesn't make sense to use it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] isolation: 1Hz residual tick offloading v3
  2018-01-17 17:38         ` Luiz Capitulino
@ 2018-01-18  3:04           ` Frederic Weisbecker
  2018-01-18 14:02             ` Luiz Capitulino
  0 siblings, 1 reply; 22+ messages in thread
From: Frederic Weisbecker @ 2018-01-18  3:04 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Christoph Lameter, Paul E . McKenney,
	Wanpeng Li, Mike Galbraith, Rik van Riel

On Wed, Jan 17, 2018 at 12:38:01PM -0500, Luiz Capitulino wrote:
> On Tue, 16 Jan 2018 23:51:29 +0100
> Frederic Weisbecker <frederic@kernel.org> wrote:
> 
> > On Tue, Jan 16, 2018 at 11:52:11AM -0500, Luiz Capitulino wrote:
> > > On Tue, 16 Jan 2018 16:41:00 +0100
> > > Frederic Weisbecker <frederic@kernel.org> wrote:  
> > > > So isolcpus= is now the place where we control the isolation features
> > > > and nohz is one of them.  
> > > 
> > > That's the part I'm not very sure about. We've been advising users to
> > > move away from isolcpus= when possible, but this very wanted nohz_offload
> > > feature will force everyone back to using isolcpus= again.  
> > 
> > Note "isolcpus=nohz" only implies nohz. You need to add "domain" to get
> > the behaviour that you've been advising users against. We are simply
> > reusing a kernel parameter that was abandoned to now control the isolation
> > features that were disorganized and opaque behind nohz.
> > 
> > > 
> > > I have the impression this series is trying to solve two problems:
> > > 
> > >  1. How (and where) we control the various isolation features in the
> > >     kernel  
> > 
> > No, that has already been done in the previous merge window. We have a
> > dedicated isolation subsystem now (kernel/sched/isolation.c) and
> > an interface to control all these isolation features that were abusively implied
> > by nohz. The initial plan was to introduce "cpu_isolation=" but it looked too much like
> > "isolcpus=". Then in fact, why not using "isolcpus=" and give it a second life.
> > And there we are.
> 
> OK, I get it now. But then series has to un-deprecate isolcpus= otherwise
> it doesn't make sense to use it.

Good point. Also I think you convinced me toward just applying that tick offload
on the existing nohz kernel parameter right away, that is, to both existing "nohz_full="
and "isolcpus=nohz".

After all that tick offload is an implementation detail.

Like you said if people complain about a regression, we can still fix it
with a new option. But eventually I doubt this will be needed.

I'll respin with that.

Thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] isolation: 1Hz residual tick offloading v3
  2018-01-18  3:04           ` Frederic Weisbecker
@ 2018-01-18 14:02             ` Luiz Capitulino
  0 siblings, 0 replies; 22+ messages in thread
From: Luiz Capitulino @ 2018-01-18 14:02 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Chris Metcalf,
	Thomas Gleixner, Christoph Lameter, Paul E . McKenney,
	Wanpeng Li, Mike Galbraith, Rik van Riel

On Thu, 18 Jan 2018 04:04:43 +0100
Frederic Weisbecker <frederic@kernel.org> wrote:

> On Wed, Jan 17, 2018 at 12:38:01PM -0500, Luiz Capitulino wrote:
> > On Tue, 16 Jan 2018 23:51:29 +0100
> > Frederic Weisbecker <frederic@kernel.org> wrote:
> >   
> > > On Tue, Jan 16, 2018 at 11:52:11AM -0500, Luiz Capitulino wrote:  
> > > > On Tue, 16 Jan 2018 16:41:00 +0100
> > > > Frederic Weisbecker <frederic@kernel.org> wrote:    
> > > > > So isolcpus= is now the place where we control the isolation features
> > > > > and nohz is one of them.    
> > > > 
> > > > That's the part I'm not very sure about. We've been advising users to
> > > > move away from isolcpus= when possible, but this very wanted nohz_offload
> > > > feature will force everyone back to using isolcpus= again.    
> > > 
> > > Note "isolcpus=nohz" only implies nohz. You need to add "domain" to get
> > > the behaviour that you've been advising users against. We are simply
> > > reusing a kernel parameter that was abandoned to now control the isolation
> > > features that were disorganized and opaque behind nohz.
> > >   
> > > > 
> > > > I have the impression this series is trying to solve two problems:
> > > > 
> > > >  1. How (and where) we control the various isolation features in the
> > > >     kernel    
> > > 
> > > No, that has already been done in the previous merge window. We have a
> > > dedicated isolation subsystem now (kernel/sched/isolation.c) and
> > > an interface to control all these isolation features that were abusively implied
> > > by nohz. The initial plan was to introduce "cpu_isolation=" but it looked too much like
> > > "isolcpus=". Then in fact, why not using "isolcpus=" and give it a second life.
> > > And there we are.  
> > 
> > OK, I get it now. But then series has to un-deprecate isolcpus= otherwise
> > it doesn't make sense to use it.  
> 
> Good point. Also I think you convinced me toward just applying that tick offload
> on the existing nohz kernel parameter right away, that is, to both existing "nohz_full="
> and "isolcpus=nohz".
> 
> After all that tick offload is an implementation detail.
> 
> Like you said if people complain about a regression, we can still fix it
> with a new option. But eventually I doubt this will be needed.
> 
> I'll respin with that.

Exciting times!

Btw, I do have this problem where I have a hog app on an isolated core
with isolcpus=nohz_offload,domain,... and I see top -d1 going from 100%
to 0% and then back from 0% to 100% every few seconds or so. I'll debug
it when you post the next version.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-01-18 14:02 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-04  4:25 [GIT PULL] isolation: 1Hz residual tick offloading v3 Frederic Weisbecker
2018-01-04  4:25 ` [PATCH 1/5] sched: Rename init_rq_hrtick to hrtick_rq_init Frederic Weisbecker
2018-01-04  4:25 ` [PATCH 2/5] sched/isolation: Add scheduler tick offloading interface Frederic Weisbecker
2018-01-04  4:25 ` [PATCH 3/5] nohz: Allow to check if remote CPU tick is stopped Frederic Weisbecker
2018-01-04  4:25 ` [PATCH 4/5] sched/isolation: Residual 1Hz scheduler tick offload Frederic Weisbecker
2018-01-12 19:22   ` Luiz Capitulino
2018-01-16 15:57     ` Frederic Weisbecker
2018-01-16 16:53       ` Luiz Capitulino
2018-01-04  4:25 ` [PATCH 5/5] sched/isolation: Document "nohz_offload" flag Frederic Weisbecker
2018-01-12 19:18 ` [GIT PULL] isolation: 1Hz residual tick offloading v3 Luiz Capitulino
2018-01-16 15:41   ` Frederic Weisbecker
2018-01-16 16:52     ` Luiz Capitulino
2018-01-16 22:51       ` Frederic Weisbecker
2018-01-17 17:38         ` Luiz Capitulino
2018-01-18  3:04           ` Frederic Weisbecker
2018-01-18 14:02             ` Luiz Capitulino
2018-01-16 17:58     ` Mike Galbraith
2018-01-16 22:53       ` Frederic Weisbecker
2018-01-17 14:51       ` Christopher Lameter
2018-01-17 15:59         ` Mike Galbraith
2018-01-17 16:32           ` Christopher Lameter
2018-01-17 16:58             ` Mike Galbraith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.