Linux-rt-users archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH 0/3] newidle_balance() latency mitigation
@ 2020-04-28  5:02 Scott Wood
  2020-04-28  5:02 ` [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch() Scott Wood
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Scott Wood @ 2020-04-28  5:02 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Vincent Guittot
  Cc: Dietmar Eggemann, Rik van Riel, Mel Gorman, linux-kernel, linux-rt-users

These patches mitigate latency caused by newidle_balance() on large
systems, by enabling interrupts when the lock is dropped, and exiting
early at various points if an RT task is runnable on the current CPU.

When applied to an RT kernel on a 72-core machine (2 threads per core), I
saw significant reductions in latency as reported by rteval -- from
over 500us to around 160us with hyperthreading disabled, and from
over 1400us to around 380us with hyperthreading enabled.

This isn't the first time something like this has been tried:
https://lore.kernel.org/lkml/20121222003019.433916240@goodmis.org/
That attempt ended up being reverted:
https://lore.kernel.org/lkml/5122CD9C.9070702@oracle.com/

The problem in that case was the failure to keep BH disabled, and the
difficulty of fixing that when called from the post_schedule() hook. 
This patchset uses finish_task_switch() to call newidle_balance(), which
enters in non-atomic context so we have full control over what we disable
and when.

There was a note at the end about wanting further discussion on the matter --
does anyone remember if that ever happened and what the conclusion was?
Are there any other issues with enabling interrupts here and/or moving
the newidle_balance() call?

Rik van Riel (1):
  sched,rt: break out of load balancing if an RT task appears

Scott Wood (2):
  sched/fair: Call newidle_balance() from finish_task_switch()
  sched/fair: Enable interrupts when dropping lock in newidle_balance()

 kernel/sched/core.c  |  7 +++--
 kernel/sched/fair.c  | 72 +++++++++++++++++++++++---------------------
 kernel/sched/sched.h | 12 +++++---
 3 files changed, 49 insertions(+), 42 deletions(-)

-- 
2.18.2


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch()
  2020-04-28  5:02 [RFC PATCH 0/3] newidle_balance() latency mitigation Scott Wood
@ 2020-04-28  5:02 ` Scott Wood
  2020-04-28 21:37   ` Valentin Schneider
  2020-04-29  8:27   ` Vincent Guittot
  2020-04-28  5:02 ` [RFC PATCH 2/3] sched/fair: Enable interrupts when dropping lock in newidle_balance() Scott Wood
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 28+ messages in thread
From: Scott Wood @ 2020-04-28  5:02 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Vincent Guittot
  Cc: Dietmar Eggemann, Rik van Riel, Mel Gorman, linux-kernel,
	linux-rt-users, Scott Wood

Thus, newidle_balance() is entered with interrupts enabled, which allows
(in the next patch) enabling interrupts when the lock is dropped.

Signed-off-by: Scott Wood <swood@redhat.com>
---
 kernel/sched/core.c  |  7 ++++---
 kernel/sched/fair.c  | 45 ++++++++++++++++----------------------------
 kernel/sched/sched.h |  6 ++----
 3 files changed, 22 insertions(+), 36 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9a2fbf98fd6f..0294beb8d16c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3241,6 +3241,10 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	}
 
 	tick_nohz_task_switch();
+
+	if (is_idle_task(current))
+		newidle_balance();
+
 	return rq;
 }
 
@@ -3919,8 +3923,6 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		   rq->nr_running == rq->cfs.h_nr_running)) {
 
 		p = pick_next_task_fair(rq, prev, rf);
-		if (unlikely(p == RETRY_TASK))
-			goto restart;
 
 		/* Assumes fair_sched_class->next == idle_sched_class */
 		if (!p) {
@@ -3931,7 +3933,6 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		return p;
 	}
 
-restart:
 #ifdef CONFIG_SMP
 	/*
 	 * We must do the balancing pass before put_next_task(), such
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02f323b85b6d..74c3c5280d6b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6758,8 +6758,6 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	if (rq->nr_running)
 		return 1;
-
-	return newidle_balance(rq, rf) != 0;
 }
 #endif /* CONFIG_SMP */
 
@@ -6934,9 +6932,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	struct cfs_rq *cfs_rq = &rq->cfs;
 	struct sched_entity *se;
 	struct task_struct *p;
-	int new_tasks;
 
-again:
 	if (!sched_fair_runnable(rq))
 		goto idle;
 
@@ -7050,19 +7046,6 @@ done: __maybe_unused;
 	if (!rf)
 		return NULL;
 
-	new_tasks = newidle_balance(rq, rf);
-
-	/*
-	 * Because newidle_balance() releases (and re-acquires) rq->lock, it is
-	 * possible for any higher priority task to appear. In that case we
-	 * must re-start the pick_next_entity() loop.
-	 */
-	if (new_tasks < 0)
-		return RETRY_TASK;
-
-	if (new_tasks > 0)
-		goto again;
-
 	/*
 	 * rq is about to be idle, check if we need to update the
 	 * lost_idle_time of clock_pelt
@@ -10425,14 +10408,23 @@ static inline void nohz_newidle_balance(struct rq *this_rq) { }
  *     0 - failed, no new tasks
  *   > 0 - success, new (fair) tasks present
  */
-int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
+int newidle_balance(void)
 {
 	unsigned long next_balance = jiffies + HZ;
-	int this_cpu = this_rq->cpu;
+	int this_cpu;
 	struct sched_domain *sd;
+	struct rq *this_rq;
 	int pulled_task = 0;
 	u64 curr_cost = 0;
 
+	preempt_disable();
+	this_rq = this_rq();
+	this_cpu = this_rq->cpu;
+	local_bh_disable();
+	raw_spin_lock_irq(&this_rq->lock);
+
+	update_rq_clock(this_rq);
+
 	update_misfit_status(NULL, this_rq);
 	/*
 	 * We must set idle_stamp _before_ calling idle_balance(), such that we
@@ -10444,15 +10436,7 @@ int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	 * Do not pull tasks towards !active CPUs...
 	 */
 	if (!cpu_active(this_cpu))
-		return 0;
-
-	/*
-	 * This is OK, because current is on_cpu, which avoids it being picked
-	 * for load-balance and preemption/IRQs are still disabled avoiding
-	 * further scheduler activity on it and we're being very careful to
-	 * re-start the picking loop.
-	 */
-	rq_unpin_lock(this_rq, rf);
+		goto out_unlock;
 
 	if (this_rq->avg_idle < sysctl_sched_migration_cost ||
 	    !READ_ONCE(this_rq->rd->overload)) {
@@ -10534,7 +10518,10 @@ int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	if (pulled_task)
 		this_rq->idle_stamp = 0;
 
-	rq_repin_lock(this_rq, rf);
+out_unlock:
+	raw_spin_unlock_irq(&this_rq->lock);
+	local_bh_enable();
+	preempt_enable();
 
 	return pulled_task;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index db3a57675ccf..3d97c51544d7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1504,13 +1504,13 @@ static inline void unregister_sched_domain_sysctl(void)
 }
 #endif
 
-extern int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
+extern int newidle_balance(void);
 
 #else
 
 static inline void sched_ttwu_pending(void) { }
 
-static inline int newidle_balance(struct rq *this_rq, struct rq_flags *rf) { return 0; }
+static inline int newidle_balance(void) { return 0; }
 
 #endif /* CONFIG_SMP */
 
@@ -1742,8 +1742,6 @@ extern const u32		sched_prio_to_wmult[40];
 #define ENQUEUE_MIGRATED	0x00
 #endif
 
-#define RETRY_TASK		((void *)-1UL)
-
 struct sched_class {
 	const struct sched_class *next;
 
-- 
2.18.2


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH 2/3] sched/fair: Enable interrupts when dropping lock in newidle_balance()
  2020-04-28  5:02 [RFC PATCH 0/3] newidle_balance() latency mitigation Scott Wood
  2020-04-28  5:02 ` [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch() Scott Wood
@ 2020-04-28  5:02 ` Scott Wood
  2020-04-28  5:02 ` [RFC PATCH 3/3] sched,rt: break out of load balancing if an RT task appears Scott Wood
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Scott Wood @ 2020-04-28  5:02 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Vincent Guittot
  Cc: Dietmar Eggemann, Rik van Riel, Mel Gorman, linux-kernel,
	linux-rt-users, Scott Wood

When combined with the next patch, which breaks out of rebalancing
when an RT task is runnable, significant latency reductions are seen
on systems with many CPUs.

Signed-off-by: Scott Wood <swood@redhat.com>
---
 kernel/sched/fair.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 74c3c5280d6b..dfde7f0ce3db 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10376,7 +10376,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
 	    time_before(jiffies, READ_ONCE(nohz.next_blocked)))
 		return;
 
-	raw_spin_unlock(&this_rq->lock);
+	raw_spin_unlock_irq(&this_rq->lock);
 	/*
 	 * This CPU is going to be idle and blocked load of idle CPUs
 	 * need to be updated. Run the ilb locally as it is a good
@@ -10385,7 +10385,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
 	 */
 	if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
 		kick_ilb(NOHZ_STATS_KICK);
-	raw_spin_lock(&this_rq->lock);
+	raw_spin_lock_irq(&this_rq->lock);
 }
 
 #else /* !CONFIG_NO_HZ_COMMON */
@@ -10452,7 +10452,7 @@ int newidle_balance(void)
 		goto out;
 	}
 
-	raw_spin_unlock(&this_rq->lock);
+	raw_spin_unlock_irq(&this_rq->lock);
 
 	update_blocked_averages(this_cpu);
 	rcu_read_lock();
@@ -10493,7 +10493,7 @@ int newidle_balance(void)
 	}
 	rcu_read_unlock();
 
-	raw_spin_lock(&this_rq->lock);
+	raw_spin_lock_irq(&this_rq->lock);
 
 	if (curr_cost > this_rq->max_idle_balance_cost)
 		this_rq->max_idle_balance_cost = curr_cost;
-- 
2.18.2


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH 3/3] sched,rt: break out of load balancing if an RT task appears
  2020-04-28  5:02 [RFC PATCH 0/3] newidle_balance() latency mitigation Scott Wood
  2020-04-28  5:02 ` [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch() Scott Wood
  2020-04-28  5:02 ` [RFC PATCH 2/3] sched/fair: Enable interrupts when dropping lock in newidle_balance() Scott Wood
@ 2020-04-28  5:02 ` Scott Wood
  2020-04-28 21:56   ` Valentin Schneider
  2020-04-28 13:27 ` [RFC PATCH 0/3] newidle_balance() latency mitigation Steven Rostedt
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Scott Wood @ 2020-04-28  5:02 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Vincent Guittot
  Cc: Dietmar Eggemann, Rik van Riel, Mel Gorman, linux-kernel, linux-rt-users

From: Rik van Riel <riel@redhat.com>

Bugzilla: 1331562

The CFS load balancer can take a little while, to the point of
it having a special LBF_NEED_BREAK flag, when the task moving
code takes a breather.

However, at that point it will jump right back in to load balancing,
without checking whether the CPU has gained any runnable real time
(or deadline) tasks.

Only idle_balance used to check for runnable real time tasks on a
CPU. This patch moves that check into a separate inline function,
and calls that function in load_balance, at approximately the same
granularity that LBF_NEED_BREAK happens.

Besides breaking out of load_balance, this patch also clears
continue_balancing, in order for rebalance_domains to break out
of its loop when a realtime task becomes runnable.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Clark Williams <williams@redhat.com>
Signed-off-by: Clark Williams <williams@redhat.com>
---
 kernel/sched/fair.c  | 19 +++++++++++++++++--
 kernel/sched/sched.h |  6 ++++++
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dfde7f0ce3db..e7437e4e40b4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9377,10 +9377,16 @@ voluntary_active_balance(struct lb_env *env)
 	return 0;
 }
 
-static int need_active_balance(struct lb_env *env)
+static int need_active_balance(struct lb_env *env, int *continue_balancing)
 {
 	struct sched_domain *sd = env->sd;
 
+	/* Run the realtime task now; load balance later. */
+	if (rq_has_runnable_rt_task(env->dst_rq)) {
+		*continue_balancing = 0;
+		return 0;
+	}
+
 	if (voluntary_active_balance(env))
 		return 1;
 
@@ -9394,6 +9400,10 @@ static int should_we_balance(struct lb_env *env)
 	struct sched_group *sg = env->sd->groups;
 	int cpu, balance_cpu = -1;
 
+	/* Run the realtime task now; load balance later. */
+	if (rq_has_runnable_rt_task(env->dst_rq))
+		return 0;
+
 	/*
 	 * Ensure the balancing environment is consistent; can happen
 	 * when the softirq triggers 'during' hotplug.
@@ -9521,6 +9531,11 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 
 		local_irq_restore(rf.flags);
 
+		if (rq_has_runnable_rt_task(this_rq)) {
+			*continue_balancing = 0;
+			goto out;
+		}
+
 		if (env.flags & LBF_NEED_BREAK) {
 			env.flags &= ~LBF_NEED_BREAK;
 			goto more_balance;
@@ -9604,7 +9619,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		if (idle != CPU_NEWLY_IDLE)
 			sd->nr_balance_failed++;
 
-		if (need_active_balance(&env)) {
+		if (need_active_balance(&env, continue_balancing)) {
 			unsigned long flags;
 
 			raw_spin_lock_irqsave(&busiest->lock, flags);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3d97c51544d7..a2a01dfd2bea 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1878,6 +1878,12 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
 
 	return rq->idle_state;
 }
+
+/* Is there a task of a high priority class? */
+static inline bool rq_has_runnable_rt_task(struct rq *rq)
+{
+	return unlikely(rq->nr_running != rq->cfs.h_nr_running);
+}
 #else
 static inline void idle_set_state(struct rq *rq,
 				  struct cpuidle_state *idle_state)
-- 
2.18.2


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 0/3] newidle_balance() latency mitigation
  2020-04-28  5:02 [RFC PATCH 0/3] newidle_balance() latency mitigation Scott Wood
                   ` (2 preceding siblings ...)
  2020-04-28  5:02 ` [RFC PATCH 3/3] sched,rt: break out of load balancing if an RT task appears Scott Wood
@ 2020-04-28 13:27 ` Steven Rostedt
  2020-04-29 23:13 ` Valentin Schneider
  2020-04-30 12:48 ` Vincent Guittot
  5 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2020-04-28 13:27 UTC (permalink / raw)
  To: Scott Wood
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Rik van Riel, Mel Gorman, linux-kernel, linux-rt-users

On Tue, 28 Apr 2020 00:02:39 -0500
Scott Wood <swood@redhat.com> wrote:

> There was a note at the end about wanting further discussion on the matter --
> does anyone remember if that ever happened and what the conclusion was?
> Are there any other issues with enabling interrupts here and/or moving
> the newidle_balance() call?

Honestly, I think we got distracted by other issues and never revisited
it :-/

-- Steve

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch()
  2020-04-28  5:02 ` [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch() Scott Wood
@ 2020-04-28 21:37   ` Valentin Schneider
  2020-04-28 22:09     ` Peter Zijlstra
  2020-04-28 22:33     ` Scott Wood
  2020-04-29  8:27   ` Vincent Guittot
  1 sibling, 2 replies; 28+ messages in thread
From: Valentin Schneider @ 2020-04-28 21:37 UTC (permalink / raw)
  To: Scott Wood
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Rik van Riel, Mel Gorman, linux-kernel,
	linux-rt-users


On 28/04/20 06:02, Scott Wood wrote:
> Thus, newidle_balance() is entered with interrupts enabled, which allows
> (in the next patch) enabling interrupts when the lock is dropped.
>
> Signed-off-by: Scott Wood <swood@redhat.com>
> ---
>  kernel/sched/core.c  |  7 ++++---
>  kernel/sched/fair.c  | 45 ++++++++++++++++----------------------------
>  kernel/sched/sched.h |  6 ++----
>  3 files changed, 22 insertions(+), 36 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9a2fbf98fd6f..0294beb8d16c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3241,6 +3241,10 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>       }
>
>       tick_nohz_task_switch();
> +
> +	if (is_idle_task(current))
> +		newidle_balance();
> +

This means we must go through a switch_to(idle) before figuring out we
could've switched to a CFS task, and do it then. I'm curious to see the
performance impact of that.

>       return rq;
>  }
>
> @@ -10425,14 +10408,23 @@ static inline void nohz_newidle_balance(struct rq *this_rq) { }
>   *     0 - failed, no new tasks
>   *   > 0 - success, new (fair) tasks present
>   */
> -int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> +int newidle_balance(void)
>  {
>       unsigned long next_balance = jiffies + HZ;
> -	int this_cpu = this_rq->cpu;
> +	int this_cpu;
>       struct sched_domain *sd;
> +	struct rq *this_rq;
>       int pulled_task = 0;
>       u64 curr_cost = 0;
>
> +	preempt_disable();
> +	this_rq = this_rq();
> +	this_cpu = this_rq->cpu;
> +	local_bh_disable();
> +	raw_spin_lock_irq(&this_rq->lock);
> +
> +	update_rq_clock(this_rq);
> +
>       update_misfit_status(NULL, this_rq);

I'm thinking this should be moved to where newidle_balance() used to be,
otherwise we have a window where the rq is flagged as having a misfit
task despite not having any runnable CFS tasks.

>       /*
>        * We must set idle_stamp _before_ calling idle_balance(), such that we

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 3/3] sched,rt: break out of load balancing if an RT task appears
  2020-04-28  5:02 ` [RFC PATCH 3/3] sched,rt: break out of load balancing if an RT task appears Scott Wood
@ 2020-04-28 21:56   ` Valentin Schneider
  2020-04-28 22:33     ` Scott Wood
  0 siblings, 1 reply; 28+ messages in thread
From: Valentin Schneider @ 2020-04-28 21:56 UTC (permalink / raw)
  To: Scott Wood
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Rik van Riel, Mel Gorman, linux-kernel,
	linux-rt-users


On 28/04/20 06:02, Scott Wood wrote:
> From: Rik van Riel <riel@redhat.com>
>
> Bugzilla: 1331562
>
> The CFS load balancer can take a little while, to the point of
> it having a special LBF_NEED_BREAK flag, when the task moving
> code takes a breather.
>
> However, at that point it will jump right back in to load balancing,
> without checking whether the CPU has gained any runnable real time
> (or deadline) tasks.
>
> Only idle_balance used to check for runnable real time tasks on a
> CPU. This patch moves that check into a separate inline function,
> and calls that function in load_balance, at approximately the same
> granularity that LBF_NEED_BREAK happens.
>
> Besides breaking out of load_balance, this patch also clears
> continue_balancing, in order for rebalance_domains to break out
> of its loop when a realtime task becomes runnable.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Reported-by: Clark Williams <williams@redhat.com>
> Signed-off-by: Clark Williams <williams@redhat.com>
> ---
>  kernel/sched/fair.c  | 19 +++++++++++++++++--
>  kernel/sched/sched.h |  6 ++++++
>  2 files changed, 23 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dfde7f0ce3db..e7437e4e40b4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9394,6 +9400,10 @@ static int should_we_balance(struct lb_env *env)
>       struct sched_group *sg = env->sd->groups;
>       int cpu, balance_cpu = -1;
>
> +	/* Run the realtime task now; load balance later. */
> +	if (rq_has_runnable_rt_task(env->dst_rq))
> +		return 0;
> +

I have a feeling this isn't very nice to CFS tasks, since we would now
"waste" load-balance attempts if they happen to coincide with an RT task
being runnable.

On your 72 CPUs machine, the system-wide balance happens (at best) every
72ms if you have idle time, every ~2300ms otherwise (every balance
CPU gets to try to balance however, so it's not as horrible as I'm making
it sound). This is totally worst-case scenario territory, and you'd hope
newidle_balance() could help here and there (as it isn't gated by any
balance interval).

Still, even for a single rq, postponing a system-wide balance for a
full balance interval (i.e. ~2 secs worst case here) just because we had a
single RT task running when we tried to balance seems a bit much.

It may be possible to hack something to detect those cases and reset the
interval to "now" when e.g. dequeuing the last RT task (& after having
previously aborted a load-balance due to RT/DL/foobar).

>       /*
>        * Ensure the balancing environment is consistent; can happen
>        * when the softirq triggers 'during' hotplug.
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3d97c51544d7..a2a01dfd2bea 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1878,6 +1878,12 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
>
>       return rq->idle_state;
>  }
> +
> +/* Is there a task of a high priority class? */
> +static inline bool rq_has_runnable_rt_task(struct rq *rq)
> +{
> +	return unlikely(rq->nr_running != rq->cfs.h_nr_running);

Seeing as that can be RT, DL or stopper, that name is somewhat misleading.

> +}
>  #else
>  static inline void idle_set_state(struct rq *rq,
>                                 struct cpuidle_state *idle_state)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch()
  2020-04-28 21:37   ` Valentin Schneider
@ 2020-04-28 22:09     ` Peter Zijlstra
  2020-04-28 22:55       ` Scott Wood
  2020-04-28 22:33     ` Scott Wood
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2020-04-28 22:09 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Scott Wood, Steven Rostedt, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Rik van Riel, Mel Gorman, linux-kernel,
	linux-rt-users

On Tue, Apr 28, 2020 at 10:37:18PM +0100, Valentin Schneider wrote:
> 
> On 28/04/20 06:02, Scott Wood wrote:
> > Thus, newidle_balance() is entered with interrupts enabled, which allows
> > (in the next patch) enabling interrupts when the lock is dropped.
> >
> > Signed-off-by: Scott Wood <swood@redhat.com>
> > ---
> >  kernel/sched/core.c  |  7 ++++---
> >  kernel/sched/fair.c  | 45 ++++++++++++++++----------------------------
> >  kernel/sched/sched.h |  6 ++----
> >  3 files changed, 22 insertions(+), 36 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 9a2fbf98fd6f..0294beb8d16c 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3241,6 +3241,10 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> >       }
> >
> >       tick_nohz_task_switch();
> > +
> > +	if (is_idle_task(current))
> > +		newidle_balance();
> > +
> 
> This means we must go through a switch_to(idle) before figuring out we
> could've switched to a CFS task, and do it then. I'm curious to see the
> performance impact of that.

Also, if you move it this late, this is entirely the wrong place. If you
do it after the context switch either use the balance_callback or put it
in the idle path.

But what Valentin said; this needs a fair bit of support, the whole
reason we've never done this is to avoid that double context switch...

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch()
  2020-04-28 21:37   ` Valentin Schneider
  2020-04-28 22:09     ` Peter Zijlstra
@ 2020-04-28 22:33     ` Scott Wood
  2020-04-29 12:00       ` Valentin Schneider
  1 sibling, 1 reply; 28+ messages in thread
From: Scott Wood @ 2020-04-28 22:33 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Rik van Riel, Mel Gorman, linux-kernel,
	linux-rt-users

On Tue, 2020-04-28 at 22:37 +0100, Valentin Schneider wrote:
> On 28/04/20 06:02, Scott Wood wrote:
> > Thus, newidle_balance() is entered with interrupts enabled, which allows
> > (in the next patch) enabling interrupts when the lock is dropped.
> > 
> > Signed-off-by: Scott Wood <swood@redhat.com>
> > ---
> >  kernel/sched/core.c  |  7 ++++---
> >  kernel/sched/fair.c  | 45 ++++++++++++++++----------------------------
> >  kernel/sched/sched.h |  6 ++----
> >  3 files changed, 22 insertions(+), 36 deletions(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 9a2fbf98fd6f..0294beb8d16c 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3241,6 +3241,10 @@ static struct rq *finish_task_switch(struct
> > task_struct *prev)
> >       }
> > 
> >       tick_nohz_task_switch();
> > +
> > +	if (is_idle_task(current))
> > +		newidle_balance();
> > +
> 
> This means we must go through a switch_to(idle) before figuring out we
> could've switched to a CFS task, and do it then. I'm curious to see the
> performance impact of that.

Any particular benchmark I should try?

> >      return rq;
> >  }
> > 
> > @@ -10425,14 +10408,23 @@ static inline void nohz_newidle_balance(struct
> > rq *this_rq) { }
> >   *     0 - failed, no new tasks
> >   *   > 0 - success, new (fair) tasks present
> >   */
> > -int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > +int newidle_balance(void)
> >  {
> >       unsigned long next_balance = jiffies + HZ;
> > -	int this_cpu = this_rq->cpu;
> > +	int this_cpu;
> >       struct sched_domain *sd;
> > +	struct rq *this_rq;
> >       int pulled_task = 0;
> >       u64 curr_cost = 0;
> > 
> > +	preempt_disable();
> > +	this_rq = this_rq();
> > +	this_cpu = this_rq->cpu;
> > +	local_bh_disable();
> > +	raw_spin_lock_irq(&this_rq->lock);
> > +
> > +	update_rq_clock(this_rq);
> > +
> >       update_misfit_status(NULL, this_rq);
> 
> I'm thinking this should be moved to where newidle_balance() used to be,
> otherwise we have a window where the rq is flagged as having a misfit
> task despite not having any runnable CFS tasks.

OK

-Scott



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 3/3] sched,rt: break out of load balancing if an RT task appears
  2020-04-28 21:56   ` Valentin Schneider
@ 2020-04-28 22:33     ` Scott Wood
  2020-04-28 22:52       ` Scott Wood
  2020-04-29 12:01       ` Valentin Schneider
  0 siblings, 2 replies; 28+ messages in thread
From: Scott Wood @ 2020-04-28 22:33 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Rik van Riel, Mel Gorman, linux-kernel,
	linux-rt-users

On Tue, 2020-04-28 at 22:56 +0100, Valentin Schneider wrote:
> On 28/04/20 06:02, Scott Wood wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index dfde7f0ce3db..e7437e4e40b4 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9394,6 +9400,10 @@ static int should_we_balance(struct lb_env *env)
> >       struct sched_group *sg = env->sd->groups;
> >       int cpu, balance_cpu = -1;
> > 
> > +	/* Run the realtime task now; load balance later. */
> > +	if (rq_has_runnable_rt_task(env->dst_rq))
> > +		return 0;
> > +
> 
> I have a feeling this isn't very nice to CFS tasks, since we would now
> "waste" load-balance attempts if they happen to coincide with an RT task
> being runnable.
>
> On your 72 CPUs machine, the system-wide balance happens (at best) every
> 72ms if you have idle time, every ~2300ms otherwise (every balance
> CPU gets to try to balance however, so it's not as horrible as I'm making
> it sound). This is totally worst-case scenario territory, and you'd hope
> newidle_balance() could help here and there (as it isn't gated by any
> balance interval).
> 
> Still, even for a single rq, postponing a system-wide balance for a
> full balance interval (i.e. ~2 secs worst case here) just because we had a
> single RT task running when we tried to balance seems a bit much.
> 
> It may be possible to hack something to detect those cases and reset the
> interval to "now" when e.g. dequeuing the last RT task (& after having
> previously aborted a load-balance due to RT/DL/foobar).

Yeah, some way to retry at an appropriate time after aborting a rebalance
would be good.


> > +
> > +/* Is there a task of a high priority class? */
> > +static inline bool rq_has_runnable_rt_task(struct rq *rq)
> > +{
> > +	return unlikely(rq->nr_running != rq->cfs.h_nr_running);
> 
> Seeing as that can be RT, DL or stopper, that name is somewhat misleading.

rq_has_runnable_rt_dl_task()?  Or is there some term that unambiguously
encompasses both?

-Scott



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 3/3] sched,rt: break out of load balancing if an RT task appears
  2020-04-28 22:33     ` Scott Wood
@ 2020-04-28 22:52       ` Scott Wood
  2020-04-29 12:01       ` Valentin Schneider
  1 sibling, 0 replies; 28+ messages in thread
From: Scott Wood @ 2020-04-28 22:52 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Rik van Riel, Mel Gorman, linux-kernel,
	linux-rt-users

On Tue, 2020-04-28 at 17:33 -0500, Scott Wood wrote:
> On Tue, 2020-04-28 at 22:56 +0100, Valentin Schneider wrote:
> > On 28/04/20 06:02, Scott Wood wrote:
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index dfde7f0ce3db..e7437e4e40b4 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -9394,6 +9400,10 @@ static int should_we_balance(struct lb_env
> > > *env)
> > >       struct sched_group *sg = env->sd->groups;
> > >       int cpu, balance_cpu = -1;
> > > 
> > > +	/* Run the realtime task now; load balance later. */
> > > +	if (rq_has_runnable_rt_task(env->dst_rq))
> > > +		return 0;
> > > +
> > 
> > I have a feeling this isn't very nice to CFS tasks, since we would now
> > "waste" load-balance attempts if they happen to coincide with an RT task
> > being runnable.
> > 
> > On your 72 CPUs machine, the system-wide balance happens (at best) every
> > 72ms if you have idle time, every ~2300ms otherwise (every balance
> > CPU gets to try to balance however, so it's not as horrible as I'm
> > making
> > it sound). This is totally worst-case scenario territory, and you'd hope
> > newidle_balance() could help here and there (as it isn't gated by any
> > balance interval).
> > 
> > Still, even for a single rq, postponing a system-wide balance for a
> > full balance interval (i.e. ~2 secs worst case here) just because we had
> > a
> > single RT task running when we tried to balance seems a bit much.
> > 
> > It may be possible to hack something to detect those cases and reset the
> > interval to "now" when e.g. dequeuing the last RT task (& after having
> > previously aborted a load-balance due to RT/DL/foobar).
> 
> Yeah, some way to retry at an appropriate time after aborting a rebalance
> would be good.

Another option is to limit the bailing out to newidle balancing (as the
patchset currently stands, it isn't checking the right rq for global
balancing anyway).  On RT the softirq runs from thread context, so enabling
interrupts and (on RT) preemption should suffice to avoid latency problems
in the global rebalance.

-Scott



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch()
  2020-04-28 22:09     ` Peter Zijlstra
@ 2020-04-28 22:55       ` Scott Wood
  2020-04-28 23:02         ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Scott Wood @ 2020-04-28 22:55 UTC (permalink / raw)
  To: Peter Zijlstra, Valentin Schneider
  Cc: Steven Rostedt, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Rik van Riel, Mel Gorman, linux-kernel, linux-rt-users

On Wed, 2020-04-29 at 00:09 +0200, Peter Zijlstra wrote:
> On Tue, Apr 28, 2020 at 10:37:18PM +0100, Valentin Schneider wrote:
> > On 28/04/20 06:02, Scott Wood wrote:
> > > Thus, newidle_balance() is entered with interrupts enabled, which
> > > allows
> > > (in the next patch) enabling interrupts when the lock is dropped.
> > > 
> > > Signed-off-by: Scott Wood <swood@redhat.com>
> > > ---
> > >  kernel/sched/core.c  |  7 ++++---
> > >  kernel/sched/fair.c  | 45 ++++++++++++++++---------------------------
> > > -
> > >  kernel/sched/sched.h |  6 ++----
> > >  3 files changed, 22 insertions(+), 36 deletions(-)
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 9a2fbf98fd6f..0294beb8d16c 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -3241,6 +3241,10 @@ static struct rq *finish_task_switch(struct
> > > task_struct *prev)
> > >       }
> > > 
> > >       tick_nohz_task_switch();
> > > +
> > > +	if (is_idle_task(current))
> > > +		newidle_balance();
> > > +
> > 
> > This means we must go through a switch_to(idle) before figuring out we
> > could've switched to a CFS task, and do it then. I'm curious to see the
> > performance impact of that.
> 
> Also, if you move it this late, this is entirely the wrong place. If you
> do it after the context switch either use the balance_callback or put it
> in the idle path.
> 
> But what Valentin said; this needs a fair bit of support, the whole
> reason we've never done this is to avoid that double context switch...
> 

balance_callback() enters with the rq lock held but BH not separately
disabled, which interferes with the ability to enable interrupts but not BH.
It also gets called from rt_mutex_setprio() and __sched_setscheduler(), and
I didn't want the caller of those to be stuck with the latency.

-Scott



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch()
  2020-04-28 22:55       ` Scott Wood
@ 2020-04-28 23:02         ` Peter Zijlstra
  2020-04-28 23:20           ` Scott Wood
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2020-04-28 23:02 UTC (permalink / raw)
  To: Scott Wood
  Cc: Valentin Schneider, Steven Rostedt, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Rik van Riel, Mel Gorman, linux-kernel,
	linux-rt-users

On Tue, Apr 28, 2020 at 05:55:03PM -0500, Scott Wood wrote:
> On Wed, 2020-04-29 at 00:09 +0200, Peter Zijlstra wrote:

> > Also, if you move it this late, this is entirely the wrong place. If you
> > do it after the context switch either use the balance_callback or put it
> > in the idle path.
> > 
> > But what Valentin said; this needs a fair bit of support, the whole
> > reason we've never done this is to avoid that double context switch...
> > 
> 
> balance_callback() enters with the rq lock held but BH not separately

BH? softirqs you mean? Pray tell more.

> disabled, which interferes with the ability to enable interrupts but not BH.
> It also gets called from rt_mutex_setprio() and __sched_setscheduler(), and
> I didn't want the caller of those to be stuck with the latency.

You're not reading it right.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch()
  2020-04-28 23:02         ` Peter Zijlstra
@ 2020-04-28 23:20           ` Scott Wood
  2020-04-29  9:05             ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Scott Wood @ 2020-04-28 23:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, Steven Rostedt, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Rik van Riel, Mel Gorman, linux-kernel,
	linux-rt-users

On Wed, 2020-04-29 at 01:02 +0200, Peter Zijlstra wrote:
> On Tue, Apr 28, 2020 at 05:55:03PM -0500, Scott Wood wrote:
> > On Wed, 2020-04-29 at 00:09 +0200, Peter Zijlstra wrote:
> > > Also, if you move it this late, this is entirely the wrong place. If
> > > you
> > > do it after the context switch either use the balance_callback or put
> > > it
> > > in the idle path.
> > > 
> > > But what Valentin said; this needs a fair bit of support, the whole
> > > reason we've never done this is to avoid that double context switch...
> > > 
> > 
> > balance_callback() enters with the rq lock held but BH not separately
> 
> BH? softirqs you mean? Pray tell more.

In https://lore.kernel.org/lkml/5122CD9C.9070702@oracle.com/ the need to
keep softirqs disabled during rebalance was brought up, but simply wrapping
the lock dropping in local_bh_enable()/local_bh_disable() meant that
local_bh_enable() would be called with interrupts disabled, which isn't
allowed.

> > disabled, which interferes with the ability to enable interrupts but not
> > BH.
> > It also gets called from rt_mutex_setprio() and __sched_setscheduler(),
> > and
> > I didn't want the caller of those to be stuck with the latency.
> 
> You're not reading it right.

Could you elaborate?

-Scott



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch()
  2020-04-28  5:02 ` [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch() Scott Wood
  2020-04-28 21:37   ` Valentin Schneider
@ 2020-04-29  8:27   ` Vincent Guittot
  2020-04-30  1:36     ` Scott Wood
  1 sibling, 1 reply; 28+ messages in thread
From: Vincent Guittot @ 2020-04-29  8:27 UTC (permalink / raw)
  To: Scott Wood
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Dietmar Eggemann,
	Rik van Riel, Mel Gorman, linux-kernel, linux-rt-users

On Tue, 28 Apr 2020 at 07:02, Scott Wood <swood@redhat.com> wrote:
>
> Thus, newidle_balance() is entered with interrupts enabled, which allows
> (in the next patch) enabling interrupts when the lock is dropped.

The comment fails to explain which changes have been done to
newidle_balance(),  for which reasons and what are the impact for CFS
scheduler

>
> Signed-off-by: Scott Wood <swood@redhat.com>
> ---
>  kernel/sched/core.c  |  7 ++++---
>  kernel/sched/fair.c  | 45 ++++++++++++++++----------------------------
>  kernel/sched/sched.h |  6 ++----
>  3 files changed, 22 insertions(+), 36 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9a2fbf98fd6f..0294beb8d16c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3241,6 +3241,10 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>         }
>
>         tick_nohz_task_switch();
> +
> +       if (is_idle_task(current))
> +               newidle_balance();
> +
>         return rq;
>  }
>
> @@ -3919,8 +3923,6 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>                    rq->nr_running == rq->cfs.h_nr_running)) {
>
>                 p = pick_next_task_fair(rq, prev, rf);
> -               if (unlikely(p == RETRY_TASK))
> -                       goto restart;
>
>                 /* Assumes fair_sched_class->next == idle_sched_class */
>                 if (!p) {
> @@ -3931,7 +3933,6 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>                 return p;
>         }
>
> -restart:
>  #ifdef CONFIG_SMP
>         /*
>          * We must do the balancing pass before put_next_task(), such
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 02f323b85b6d..74c3c5280d6b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6758,8 +6758,6 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
>         if (rq->nr_running)
>                 return 1;
> -
> -       return newidle_balance(rq, rf) != 0;

Missing return ?

>  }
>  #endif /* CONFIG_SMP */
>
> @@ -6934,9 +6932,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>         struct cfs_rq *cfs_rq = &rq->cfs;
>         struct sched_entity *se;
>         struct task_struct *p;
> -       int new_tasks;
>
> -again:
>         if (!sched_fair_runnable(rq))
>                 goto idle;
>
> @@ -7050,19 +7046,6 @@ done: __maybe_unused;
>         if (!rf)
>                 return NULL;
>
> -       new_tasks = newidle_balance(rq, rf);
> -
> -       /*
> -        * Because newidle_balance() releases (and re-acquires) rq->lock, it is
> -        * possible for any higher priority task to appear. In that case we
> -        * must re-start the pick_next_entity() loop.
> -        */
> -       if (new_tasks < 0)
> -               return RETRY_TASK;
> -
> -       if (new_tasks > 0)
> -               goto again;
> -
>         /*
>          * rq is about to be idle, check if we need to update the
>          * lost_idle_time of clock_pelt
> @@ -10425,14 +10408,23 @@ static inline void nohz_newidle_balance(struct rq *this_rq) { }
>   *     0 - failed, no new tasks
>   *   > 0 - success, new (fair) tasks present
>   */
> -int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> +int newidle_balance(void)
>  {
>         unsigned long next_balance = jiffies + HZ;
> -       int this_cpu = this_rq->cpu;
> +       int this_cpu;
>         struct sched_domain *sd;
> +       struct rq *this_rq;
>         int pulled_task = 0;
>         u64 curr_cost = 0;
>
> +       preempt_disable();
> +       this_rq = this_rq();
> +       this_cpu = this_rq->cpu;
> +       local_bh_disable();
> +       raw_spin_lock_irq(&this_rq->lock);
> +
> +       update_rq_clock(this_rq);
> +
>         update_misfit_status(NULL, this_rq);
>         /*
>          * We must set idle_stamp _before_ calling idle_balance(), such that we
> @@ -10444,15 +10436,7 @@ int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>          * Do not pull tasks towards !active CPUs...
>          */
>         if (!cpu_active(this_cpu))
> -               return 0;
> -
> -       /*
> -        * This is OK, because current is on_cpu, which avoids it being picked
> -        * for load-balance and preemption/IRQs are still disabled avoiding
> -        * further scheduler activity on it and we're being very careful to
> -        * re-start the picking loop.
> -        */
> -       rq_unpin_lock(this_rq, rf);
> +               goto out_unlock;
>
>         if (this_rq->avg_idle < sysctl_sched_migration_cost ||
>             !READ_ONCE(this_rq->rd->overload)) {
> @@ -10534,7 +10518,10 @@ int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>         if (pulled_task)
>                 this_rq->idle_stamp = 0;
>
> -       rq_repin_lock(this_rq, rf);
> +out_unlock:
> +       raw_spin_unlock_irq(&this_rq->lock);
> +       local_bh_enable();
> +       preempt_enable();
>
>         return pulled_task;
>  }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index db3a57675ccf..3d97c51544d7 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1504,13 +1504,13 @@ static inline void unregister_sched_domain_sysctl(void)
>  }
>  #endif
>
> -extern int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
> +extern int newidle_balance(void);
>
>  #else
>
>  static inline void sched_ttwu_pending(void) { }
>
> -static inline int newidle_balance(struct rq *this_rq, struct rq_flags *rf) { return 0; }
> +static inline int newidle_balance(void) { return 0; }
>
>  #endif /* CONFIG_SMP */
>
> @@ -1742,8 +1742,6 @@ extern const u32          sched_prio_to_wmult[40];
>  #define ENQUEUE_MIGRATED       0x00
>  #endif
>
> -#define RETRY_TASK             ((void *)-1UL)
> -
>  struct sched_class {
>         const struct sched_class *next;
>
> --
> 2.18.2
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch()
  2020-04-28 23:20           ` Scott Wood
@ 2020-04-29  9:05             ` Peter Zijlstra
  2020-04-30  1:31               ` Scott Wood
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2020-04-29  9:05 UTC (permalink / raw)
  To: Scott Wood
  Cc: Valentin Schneider, Steven Rostedt, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Rik van Riel, Mel Gorman, linux-kernel,
	linux-rt-users

On Tue, Apr 28, 2020 at 06:20:32PM -0500, Scott Wood wrote:
> On Wed, 2020-04-29 at 01:02 +0200, Peter Zijlstra wrote:
> > On Tue, Apr 28, 2020 at 05:55:03PM -0500, Scott Wood wrote:
> > > On Wed, 2020-04-29 at 00:09 +0200, Peter Zijlstra wrote:
> > > > Also, if you move it this late, this is entirely the wrong place. If
> > > > you
> > > > do it after the context switch either use the balance_callback or put
> > > > it
> > > > in the idle path.
> > > > 
> > > > But what Valentin said; this needs a fair bit of support, the whole
> > > > reason we've never done this is to avoid that double context switch...
> > > > 
> > > 
> > > balance_callback() enters with the rq lock held but BH not separately
> > 
> > BH? softirqs you mean? Pray tell more.
> 
> In https://lore.kernel.org/lkml/5122CD9C.9070702@oracle.com/ the need to
> keep softirqs disabled during rebalance was brought up, but simply wrapping
> the lock dropping in local_bh_enable()/local_bh_disable() meant that
> local_bh_enable() would be called with interrupts disabled, which isn't
> allowed.

That thread, nor your explanation make any sense -- why do we care about
softirqs?, nor do I see how placing it in finish_task_switch() helps
with any of this.

> > > disabled, which interferes with the ability to enable interrupts but not
> > > BH.
> > > It also gets called from rt_mutex_setprio() and __sched_setscheduler(),
> > > and
> > > I didn't want the caller of those to be stuck with the latency.
> > 
> > You're not reading it right.
> 
> Could you elaborate?

If you were to do a queue_balance_callback() from somewhere in the
pick_next_task() machinery, then the balance_callback() at the end of
__schedule() would run it, and it'd be gone. How would
rt_mutex_setprio() / __sched_setscheduler() be affected?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch()
  2020-04-28 22:33     ` Scott Wood
@ 2020-04-29 12:00       ` Valentin Schneider
  0 siblings, 0 replies; 28+ messages in thread
From: Valentin Schneider @ 2020-04-29 12:00 UTC (permalink / raw)
  To: Scott Wood
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Rik van Riel, Mel Gorman, linux-kernel,
	linux-rt-users


On 28/04/20 23:33, Scott Wood wrote:
> On Tue, 2020-04-28 at 22:37 +0100, Valentin Schneider wrote:
>> On 28/04/20 06:02, Scott Wood wrote:
>> > Thus, newidle_balance() is entered with interrupts enabled, which allows
>> > (in the next patch) enabling interrupts when the lock is dropped.
>> >
>> > Signed-off-by: Scott Wood <swood@redhat.com>
>> > ---
>> >  kernel/sched/core.c  |  7 ++++---
>> >  kernel/sched/fair.c  | 45 ++++++++++++++++----------------------------
>> >  kernel/sched/sched.h |  6 ++----
>> >  3 files changed, 22 insertions(+), 36 deletions(-)
>> >
>> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> > index 9a2fbf98fd6f..0294beb8d16c 100644
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>> > @@ -3241,6 +3241,10 @@ static struct rq *finish_task_switch(struct
>> > task_struct *prev)
>> >       }
>> >
>> >       tick_nohz_task_switch();
>> > +
>> > +	if (is_idle_task(current))
>> > +		newidle_balance();
>> > +
>>
>> This means we must go through a switch_to(idle) before figuring out we
>> could've switched to a CFS task, and do it then. I'm curious to see the
>> performance impact of that.
>
> Any particular benchmark I should try?
>

I'm going to be very original and suggest hackbench :-)

That would just be the first stop however, you would also want to try
something less wakeup-intensive, maybe sysbench and the like - I'm thinking
if you spawn ~1.5*nr_cpu_ids CPU-hogs, you'll hit that double switch fairly
easily.

And then there's always the big boys benchmarks like specjbb and co - I'd
suggest having a look at Mel's mmtests.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 3/3] sched,rt: break out of load balancing if an RT task appears
  2020-04-28 22:33     ` Scott Wood
  2020-04-28 22:52       ` Scott Wood
@ 2020-04-29 12:01       ` Valentin Schneider
  1 sibling, 0 replies; 28+ messages in thread
From: Valentin Schneider @ 2020-04-29 12:01 UTC (permalink / raw)
  To: Scott Wood
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Rik van Riel, Mel Gorman, linux-kernel,
	linux-rt-users


On 28/04/20 23:33, Scott Wood wrote:
>> > +
>> > +/* Is there a task of a high priority class? */
>> > +static inline bool rq_has_runnable_rt_task(struct rq *rq)
>> > +{
>> > +	return unlikely(rq->nr_running != rq->cfs.h_nr_running);
>>
>> Seeing as that can be RT, DL or stopper, that name is somewhat misleading.
>
> rq_has_runnable_rt_dl_task()?  Or is there some term that unambiguously
> encompasses both?
>

Naming is a pain as always; I'd shove it in fair.c as
"rq_has_higher_tasks()" or similar.

> -Scott

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 0/3] newidle_balance() latency mitigation
  2020-04-28  5:02 [RFC PATCH 0/3] newidle_balance() latency mitigation Scott Wood
                   ` (3 preceding siblings ...)
  2020-04-28 13:27 ` [RFC PATCH 0/3] newidle_balance() latency mitigation Steven Rostedt
@ 2020-04-29 23:13 ` Valentin Schneider
  2020-04-30  7:44   ` Vincent Guittot
  2020-04-30 12:48 ` Vincent Guittot
  5 siblings, 1 reply; 28+ messages in thread
From: Valentin Schneider @ 2020-04-29 23:13 UTC (permalink / raw)
  To: Scott Wood
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Rik van Riel, Mel Gorman, linux-kernel,
	linux-rt-users


On 28/04/20 06:02, Scott Wood wrote:
> These patches mitigate latency caused by newidle_balance() on large
> systems, by enabling interrupts when the lock is dropped, and exiting
> early at various points if an RT task is runnable on the current CPU.
>
> When applied to an RT kernel on a 72-core machine (2 threads per core), I
> saw significant reductions in latency as reported by rteval -- from
> over 500us to around 160us with hyperthreading disabled, and from
> over 1400us to around 380us with hyperthreading enabled.
>
> This isn't the first time something like this has been tried:
> https://lore.kernel.org/lkml/20121222003019.433916240@goodmis.org/
> That attempt ended up being reverted:
> https://lore.kernel.org/lkml/5122CD9C.9070702@oracle.com/
>
> The problem in that case was the failure to keep BH disabled, and the
> difficulty of fixing that when called from the post_schedule() hook.
> This patchset uses finish_task_switch() to call newidle_balance(), which
> enters in non-atomic context so we have full control over what we disable
> and when.
>
> There was a note at the end about wanting further discussion on the matter --
> does anyone remember if that ever happened and what the conclusion was?
> Are there any other issues with enabling interrupts here and/or moving
> the newidle_balance() call?
>

Random thought that just occurred to me; in the grand scheme of things,
with something in the same spirit as task-stealing (i.e. don't bother with
a full fledged balance at newidle, just pick one spare task somewhere),
none of this would be required.

Sadly I don't think anyone has been looking at it any recently.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch()
  2020-04-29  9:05             ` Peter Zijlstra
@ 2020-04-30  1:31               ` Scott Wood
  2020-05-11 10:58                 ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Scott Wood @ 2020-04-30  1:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, Steven Rostedt, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Rik van Riel, Mel Gorman, linux-kernel,
	linux-rt-users

On Wed, 2020-04-29 at 11:05 +0200, Peter Zijlstra wrote:
> On Tue, Apr 28, 2020 at 06:20:32PM -0500, Scott Wood wrote:
> > On Wed, 2020-04-29 at 01:02 +0200, Peter Zijlstra wrote:
> > > On Tue, Apr 28, 2020 at 05:55:03PM -0500, Scott Wood wrote:
> > > > On Wed, 2020-04-29 at 00:09 +0200, Peter Zijlstra wrote:
> > > > > Also, if you move it this late, this is entirely the wrong
> > > > > place.  If you do it after the context switch either use the
> > > > > balance_callback or put it in the idle path.
> > > > > 
> > > > > But what Valentin said; this needs a fair bit of support, the
> > > > > whole reason we've never done this is to avoid that double
> > > > > context switch...
> > > > > 
> > > > 
> > > > balance_callback() enters with the rq lock held but BH not
> > > > separately
> > > 
> > > BH? softirqs you mean? Pray tell more.
> > 
> > In https://lore.kernel.org/lkml/5122CD9C.9070702@oracle.com/ the need to
> > keep softirqs disabled during rebalance was brought up, but simply
> > wrapping
> > the lock dropping in local_bh_enable()/local_bh_disable() meant that
> > local_bh_enable() would be called with interrupts disabled, which isn't
> > allowed.
> 
> That thread, nor your explanation make any sense -- why do we care about
> softirqs?, 

I was trusting Steve's claim that that was the issue (it seemed plausible
given that system-wide rebalancing is done from a softirq).  If things have
changed since then, great.  If that was never the issue, then there's the
question of what caused the bug Sasha saw.

> nor do I see how placing it in finish_task_switch() helps
> with any of this.

It lets us do the local_bh_enable() after IRQs are enabled, since we don't
enter with any existing atomic context.  Though I suppose we could instead
do another lock drop at the end of newidle_balance() just to enable
softirqs.

> > > > disabled, which interferes with the ability to enable interrupts
> > > > but not BH.  It also gets called from rt_mutex_setprio() and
> > > > __sched_setscheduler(), and I didn't want the caller of those to
> > > > be stuck with the latency.
> > > 
> > > You're not reading it right.
> > 
> > Could you elaborate?
> 
> If you were to do a queue_balance_callback() from somewhere in the
> pick_next_task() machinery, then the balance_callback() at the end of
> __schedule() would run it, and it'd be gone. How would
> rt_mutex_setprio() / __sched_setscheduler() be affected?

The rq lock is dropped between queue_balance_callback() and the
balance_callback() at the end of __schedule().  What stops
setprio/setscheduler on another cpu from doing the callback at that
point?

-Scott



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch()
  2020-04-29  8:27   ` Vincent Guittot
@ 2020-04-30  1:36     ` Scott Wood
  0 siblings, 0 replies; 28+ messages in thread
From: Scott Wood @ 2020-04-30  1:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Dietmar Eggemann,
	Rik van Riel, Mel Gorman, linux-kernel, linux-rt-users

On Wed, 2020-04-29 at 10:27 +0200, Vincent Guittot wrote:
> On Tue, 28 Apr 2020 at 07:02, Scott Wood <swood@redhat.com> wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 02f323b85b6d..74c3c5280d6b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6758,8 +6758,6 @@ balance_fair(struct rq *rq, struct task_struct
> > *prev, struct rq_flags *rf)
> >  {
> >         if (rq->nr_running)
> >                 return 1;
> > -
> > -       return newidle_balance(rq, rf) != 0;
> 
> Missing return ?

Yes, editing error during last minute cleanup. :-P

-Scott



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 0/3] newidle_balance() latency mitigation
  2020-04-29 23:13 ` Valentin Schneider
@ 2020-04-30  7:44   ` Vincent Guittot
  2020-04-30 10:14     ` Valentin Schneider
  0 siblings, 1 reply; 28+ messages in thread
From: Vincent Guittot @ 2020-04-30  7:44 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Scott Wood, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Dietmar Eggemann, Rik van Riel, Mel Gorman, linux-kernel,
	linux-rt-users

On Thu, 30 Apr 2020 at 01:13, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
>
> On 28/04/20 06:02, Scott Wood wrote:
> > These patches mitigate latency caused by newidle_balance() on large
> > systems, by enabling interrupts when the lock is dropped, and exiting
> > early at various points if an RT task is runnable on the current CPU.
> >
> > When applied to an RT kernel on a 72-core machine (2 threads per core), I
> > saw significant reductions in latency as reported by rteval -- from
> > over 500us to around 160us with hyperthreading disabled, and from
> > over 1400us to around 380us with hyperthreading enabled.
> >
> > This isn't the first time something like this has been tried:
> > https://lore.kernel.org/lkml/20121222003019.433916240@goodmis.org/
> > That attempt ended up being reverted:
> > https://lore.kernel.org/lkml/5122CD9C.9070702@oracle.com/
> >
> > The problem in that case was the failure to keep BH disabled, and the
> > difficulty of fixing that when called from the post_schedule() hook.
> > This patchset uses finish_task_switch() to call newidle_balance(), which
> > enters in non-atomic context so we have full control over what we disable
> > and when.
> >
> > There was a note at the end about wanting further discussion on the matter --
> > does anyone remember if that ever happened and what the conclusion was?
> > Are there any other issues with enabling interrupts here and/or moving
> > the newidle_balance() call?
> >
>
> Random thought that just occurred to me; in the grand scheme of things,
> with something in the same spirit as task-stealing (i.e. don't bother with
> a full fledged balance at newidle, just pick one spare task somewhere),
> none of this would be required.

newly idle load balance already stops after picking 1 task
Now if your proposal is to pick one random task on one random cpu, I'm
clearly not sure that's a good idea


>
> Sadly I don't think anyone has been looking at it any recently.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 0/3] newidle_balance() latency mitigation
  2020-04-30  7:44   ` Vincent Guittot
@ 2020-04-30 10:14     ` Valentin Schneider
  2020-04-30 12:42       ` Vincent Guittot
  0 siblings, 1 reply; 28+ messages in thread
From: Valentin Schneider @ 2020-04-30 10:14 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Scott Wood, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Dietmar Eggemann, Rik van Riel, Mel Gorman, linux-kernel,
	linux-rt-users


On 30/04/20 08:44, Vincent Guittot wrote:
> On Thu, 30 Apr 2020 at 01:13, Valentin Schneider
> <valentin.schneider@arm.com> wrote:
>>
>>
>> On 28/04/20 06:02, Scott Wood wrote:
>> > These patches mitigate latency caused by newidle_balance() on large
>> > systems, by enabling interrupts when the lock is dropped, and exiting
>> > early at various points if an RT task is runnable on the current CPU.
>> >
>> > When applied to an RT kernel on a 72-core machine (2 threads per core), I
>> > saw significant reductions in latency as reported by rteval -- from
>> > over 500us to around 160us with hyperthreading disabled, and from
>> > over 1400us to around 380us with hyperthreading enabled.
>> >
>> > This isn't the first time something like this has been tried:
>> > https://lore.kernel.org/lkml/20121222003019.433916240@goodmis.org/
>> > That attempt ended up being reverted:
>> > https://lore.kernel.org/lkml/5122CD9C.9070702@oracle.com/
>> >
>> > The problem in that case was the failure to keep BH disabled, and the
>> > difficulty of fixing that when called from the post_schedule() hook.
>> > This patchset uses finish_task_switch() to call newidle_balance(), which
>> > enters in non-atomic context so we have full control over what we disable
>> > and when.
>> >
>> > There was a note at the end about wanting further discussion on the matter --
>> > does anyone remember if that ever happened and what the conclusion was?
>> > Are there any other issues with enabling interrupts here and/or moving
>> > the newidle_balance() call?
>> >
>>
>> Random thought that just occurred to me; in the grand scheme of things,
>> with something in the same spirit as task-stealing (i.e. don't bother with
>> a full fledged balance at newidle, just pick one spare task somewhere),
>> none of this would be required.
>
> newly idle load balance already stops after picking 1 task

Mph, I had already forgotten your changes there. Is that really always the
case for newidle? In e.g. the busiest->group_type == group_fully_busy case,
I think we can pull more than one task.

> Now if your proposal is to pick one random task on one random cpu, I'm
> clearly not sure that's a good idea
>

IIRC Steve's implementation was to "simply" pull one task from any CPU
within the LLC domain that had > 1 runnable tasks. I quite like this since
picking any one task is almost always better than switching to the idle
task, but it wasn't a complete newidle_balance() replacement just yet.

>
>>
>> Sadly I don't think anyone has been looking at it any recently.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 0/3] newidle_balance() latency mitigation
  2020-04-30 10:14     ` Valentin Schneider
@ 2020-04-30 12:42       ` Vincent Guittot
  2020-04-30 13:56         ` Valentin Schneider
  0 siblings, 1 reply; 28+ messages in thread
From: Vincent Guittot @ 2020-04-30 12:42 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Scott Wood, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Dietmar Eggemann, Rik van Riel, Mel Gorman, linux-kernel,
	linux-rt-users

On Thu, 30 Apr 2020 at 12:14, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
>
> On 30/04/20 08:44, Vincent Guittot wrote:
> > On Thu, 30 Apr 2020 at 01:13, Valentin Schneider
> > <valentin.schneider@arm.com> wrote:
> >>
> >>
> >> On 28/04/20 06:02, Scott Wood wrote:
> >> > These patches mitigate latency caused by newidle_balance() on large
> >> > systems, by enabling interrupts when the lock is dropped, and exiting
> >> > early at various points if an RT task is runnable on the current CPU.
> >> >
> >> > When applied to an RT kernel on a 72-core machine (2 threads per core), I
> >> > saw significant reductions in latency as reported by rteval -- from
> >> > over 500us to around 160us with hyperthreading disabled, and from
> >> > over 1400us to around 380us with hyperthreading enabled.
> >> >
> >> > This isn't the first time something like this has been tried:
> >> > https://lore.kernel.org/lkml/20121222003019.433916240@goodmis.org/
> >> > That attempt ended up being reverted:
> >> > https://lore.kernel.org/lkml/5122CD9C.9070702@oracle.com/
> >> >
> >> > The problem in that case was the failure to keep BH disabled, and the
> >> > difficulty of fixing that when called from the post_schedule() hook.
> >> > This patchset uses finish_task_switch() to call newidle_balance(), which
> >> > enters in non-atomic context so we have full control over what we disable
> >> > and when.
> >> >
> >> > There was a note at the end about wanting further discussion on the matter --
> >> > does anyone remember if that ever happened and what the conclusion was?
> >> > Are there any other issues with enabling interrupts here and/or moving
> >> > the newidle_balance() call?
> >> >
> >>
> >> Random thought that just occurred to me; in the grand scheme of things,
> >> with something in the same spirit as task-stealing (i.e. don't bother with
> >> a full fledged balance at newidle, just pick one spare task somewhere),
> >> none of this would be required.
> >
> > newly idle load balance already stops after picking 1 task
>
> Mph, I had already forgotten your changes there. Is that really always the
> case for newidle? In e.g. the busiest->group_type == group_fully_busy case,
> I think we can pull more than one task.

for newly_idle load balance, detach_tasks stops after finding 1 suitable task

>
> > Now if your proposal is to pick one random task on one random cpu, I'm
> > clearly not sure that's a good idea
> >
>
> IIRC Steve's implementation was to "simply" pull one task from any CPU
> within the LLC domain that had > 1 runnable tasks. I quite like this since
> picking any one task is almost always better than switching to the idle
> task, but it wasn't a complete newidle_balance() replacement just yet.
>
> >
> >>
> >> Sadly I don't think anyone has been looking at it any recently.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 0/3] newidle_balance() latency mitigation
  2020-04-28  5:02 [RFC PATCH 0/3] newidle_balance() latency mitigation Scott Wood
                   ` (4 preceding siblings ...)
  2020-04-29 23:13 ` Valentin Schneider
@ 2020-04-30 12:48 ` Vincent Guittot
  5 siblings, 0 replies; 28+ messages in thread
From: Vincent Guittot @ 2020-04-30 12:48 UTC (permalink / raw)
  To: Scott Wood
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Dietmar Eggemann,
	Rik van Riel, Mel Gorman, linux-kernel, linux-rt-users

On Tue, 28 Apr 2020 at 07:02, Scott Wood <swood@redhat.com> wrote:
>
> These patches mitigate latency caused by newidle_balance() on large
> systems, by enabling interrupts when the lock is dropped, and exiting
> early at various points if an RT task is runnable on the current CPU.
>
> When applied to an RT kernel on a 72-core machine (2 threads per core), I
> saw significant reductions in latency as reported by rteval -- from
> over 500us to around 160us with hyperthreading disabled, and from
> over 1400us to around 380us with hyperthreading enabled.

Do you know how each patch contributes to the decrease ? Because patch
3 not only impacts newly idle lb but each and every lb. So most of the
decrease might come from aborting the busy or idle lb at the highest
sched_domai level which scan all cpus and moving newly idle load
balance is not a major part.

>
> This isn't the first time something like this has been tried:
> https://lore.kernel.org/lkml/20121222003019.433916240@goodmis.org/
> That attempt ended up being reverted:
> https://lore.kernel.org/lkml/5122CD9C.9070702@oracle.com/
>
> The problem in that case was the failure to keep BH disabled, and the
> difficulty of fixing that when called from the post_schedule() hook.
> This patchset uses finish_task_switch() to call newidle_balance(), which
> enters in non-atomic context so we have full control over what we disable
> and when.
>
> There was a note at the end about wanting further discussion on the matter --
> does anyone remember if that ever happened and what the conclusion was?
> Are there any other issues with enabling interrupts here and/or moving
> the newidle_balance() call?
>
> Rik van Riel (1):
>   sched,rt: break out of load balancing if an RT task appears
>
> Scott Wood (2):
>   sched/fair: Call newidle_balance() from finish_task_switch()
>   sched/fair: Enable interrupts when dropping lock in newidle_balance()
>
>  kernel/sched/core.c  |  7 +++--
>  kernel/sched/fair.c  | 72 +++++++++++++++++++++++---------------------
>  kernel/sched/sched.h | 12 +++++---
>  3 files changed, 49 insertions(+), 42 deletions(-)
>
> --
> 2.18.2
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 0/3] newidle_balance() latency mitigation
  2020-04-30 12:42       ` Vincent Guittot
@ 2020-04-30 13:56         ` Valentin Schneider
  0 siblings, 0 replies; 28+ messages in thread
From: Valentin Schneider @ 2020-04-30 13:56 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Scott Wood, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Dietmar Eggemann, Rik van Riel, Mel Gorman, linux-kernel,
	linux-rt-users


On 30/04/20 13:42, Vincent Guittot wrote:
>> >> Random thought that just occurred to me; in the grand scheme of things,
>> >> with something in the same spirit as task-stealing (i.e. don't bother with
>> >> a full fledged balance at newidle, just pick one spare task somewhere),
>> >> none of this would be required.
>> >
>> > newly idle load balance already stops after picking 1 task
>>
>> Mph, I had already forgotten your changes there. Is that really always the
>> case for newidle? In e.g. the busiest->group_type == group_fully_busy case,
>> I think we can pull more than one task.
>
> for newly_idle load balance, detach_tasks stops after finding 1 suitable task
>

Right, I hadn't noticed

  7e96fa5875d4 ("sched: pull only one task during NEWIDLE balancing to limit critical section")

>>
>> > Now if your proposal is to pick one random task on one random cpu, I'm
>> > clearly not sure that's a good idea
>> >
>>
>> IIRC Steve's implementation was to "simply" pull one task from any CPU
>> within the LLC domain that had > 1 runnable tasks. I quite like this since
>> picking any one task is almost always better than switching to the idle
>> task, but it wasn't a complete newidle_balance() replacement just yet.
>>
>> >
>> >>
>> >> Sadly I don't think anyone has been looking at it any recently.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch()
  2020-04-30  1:31               ` Scott Wood
@ 2020-05-11 10:58                 ` Peter Zijlstra
  2020-05-11 12:13                   ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2020-05-11 10:58 UTC (permalink / raw)
  To: Scott Wood
  Cc: Valentin Schneider, Steven Rostedt, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Rik van Riel, Mel Gorman, linux-kernel,
	linux-rt-users

On Wed, Apr 29, 2020 at 08:31:39PM -0500, Scott Wood wrote:
> > If you were to do a queue_balance_callback() from somewhere in the
> > pick_next_task() machinery, then the balance_callback() at the end of
> > __schedule() would run it, and it'd be gone. How would
> > rt_mutex_setprio() / __sched_setscheduler() be affected?
> 
> The rq lock is dropped between queue_balance_callback() and the
> balance_callback() at the end of __schedule().  What stops
> setprio/setscheduler on another cpu from doing the callback at that
> point?

Hurmm.. fair point, and that might explain some issues I had a while
back. Let me poke a little at that.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch()
  2020-05-11 10:58                 ` Peter Zijlstra
@ 2020-05-11 12:13                   ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2020-05-11 12:13 UTC (permalink / raw)
  To: Scott Wood
  Cc: Valentin Schneider, Steven Rostedt, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Rik van Riel, Mel Gorman, linux-kernel,
	linux-rt-users

On Mon, May 11, 2020 at 12:58:00PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 29, 2020 at 08:31:39PM -0500, Scott Wood wrote:
> > > If you were to do a queue_balance_callback() from somewhere in the
> > > pick_next_task() machinery, then the balance_callback() at the end of
> > > __schedule() would run it, and it'd be gone. How would
> > > rt_mutex_setprio() / __sched_setscheduler() be affected?
> > 
> > The rq lock is dropped between queue_balance_callback() and the
> > balance_callback() at the end of __schedule().  What stops
> > setprio/setscheduler on another cpu from doing the callback at that
> > point?
> 
> Hurmm.. fair point, and that might explain some issues I had a while
> back. Let me poke a little at that.

How's this?

---
 kernel/sched/core.c  | 109 ++++++++++++++++++++++++++++++---------------------
 kernel/sched/sched.h |   2 +
 2 files changed, 67 insertions(+), 44 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dfb8ab61cbdd..610e9da557ed 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3078,6 +3078,59 @@ static inline void finish_task(struct task_struct *prev)
 #endif
 }
 
+#ifdef CONFIG_SMP
+
+/* rq->lock is NOT held, but preemption is disabled */
+static void do_balance_callbacks(struct callback_head *head)
+{
+	void (*func)(struct rq *rq);
+	struct callback_head *next;
+
+	while (head) {
+		func = (void (*)(struct rq *))head->func;
+		next = head->next;
+		head->next = NULL;
+		head = next;
+
+		func(rq);
+	}
+}
+
+static inline struct callback_head *splice_balance_callbacks(struct rq *rq)
+{
+	struct callback_head *head = rq->balance_callback;
+	if (head)
+		rq->balance_callback = NULL;
+}
+
+static void __balance_callbacks(struct rq *rq)
+{
+	do_balance_callbacks(splice_balance_callbacks(rq));
+}
+
+static inline void balance_callbacks(struct rq *rq, struct callback_head *head)
+{
+	unsigned long flags;
+
+	if (unlikely(head)) {
+		raw_spin_lock_irqsave(&rq->lock, flags);
+		do_balance_callbacks(head);
+		raw_spin_unlock_irqrestore(&rq->lock, flags);
+	}
+}
+
+#else
+
+static inline void __balance_callbacks(struct rq *rq)
+{
+}
+
+static inline void balance_callbacks(struct rq *rq, struct callback_head *head)
+{
+}
+
+#endif
+
 static inline void
 prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
 {
@@ -3103,6 +3156,7 @@ static inline void finish_lock_switch(struct rq *rq)
 	 * prev into current:
 	 */
 	spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
+	__balance_callbacks(rq));
 	raw_spin_unlock_irq(&rq->lock);
 }
 
@@ -3244,43 +3298,6 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	return rq;
 }
 
-#ifdef CONFIG_SMP
-
-/* rq->lock is NOT held, but preemption is disabled */
-static void __balance_callback(struct rq *rq)
-{
-	struct callback_head *head, *next;
-	void (*func)(struct rq *rq);
-	unsigned long flags;
-
-	raw_spin_lock_irqsave(&rq->lock, flags);
-	head = rq->balance_callback;
-	rq->balance_callback = NULL;
-	while (head) {
-		func = (void (*)(struct rq *))head->func;
-		next = head->next;
-		head->next = NULL;
-		head = next;
-
-		func(rq);
-	}
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
-}
-
-static inline void balance_callback(struct rq *rq)
-{
-	if (unlikely(rq->balance_callback))
-		__balance_callback(rq);
-}
-
-#else
-
-static inline void balance_callback(struct rq *rq)
-{
-}
-
-#endif
-
 /**
  * schedule_tail - first thing a freshly forked thread must call.
  * @prev: the thread we just switched away from.
@@ -3300,7 +3317,6 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
 	 */
 
 	rq = finish_task_switch(prev);
-	balance_callback(rq);
 	preempt_enable();
 
 	if (current->set_child_tid)
@@ -4090,10 +4106,11 @@ static void __sched notrace __schedule(bool preempt)
 		rq = context_switch(rq, prev, next, &rf);
 	} else {
 		rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
-		rq_unlock_irq(rq, &rf);
-	}
 
-	balance_callback(rq);
+		rq_unpin_lock(rq, &rf);
+		__balance_callbacks(rq);
+		raw_spin_unlock_irq(&rq->lock);
+	}
 }
 
 void __noreturn do_task_dead(void)
@@ -4499,9 +4516,11 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
 out_unlock:
 	/* Avoid rq from going away on us: */
 	preempt_disable();
-	__task_rq_unlock(rq, &rf);
 
-	balance_callback(rq);
+	rq_unpin_lock(rq, &rf);
+	__balance_callbacks(rq);
+	raw_spin_unlock(&rq->lock);
+
 	preempt_enable();
 }
 #else
@@ -4775,6 +4794,7 @@ static int __sched_setscheduler(struct task_struct *p,
 	int retval, oldprio, oldpolicy = -1, queued, running;
 	int new_effective_prio, policy = attr->sched_policy;
 	const struct sched_class *prev_class;
+	struct callback_head *head;
 	struct rq_flags rf;
 	int reset_on_fork;
 	int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
@@ -5013,6 +5033,7 @@ static int __sched_setscheduler(struct task_struct *p,
 
 	/* Avoid rq from going away on us: */
 	preempt_disable();
+	head = splice_balance_callbacks(rq);
 	task_rq_unlock(rq, p, &rf);
 
 	if (pi) {
@@ -5021,7 +5042,7 @@ static int __sched_setscheduler(struct task_struct *p,
 	}
 
 	/* Run balance callbacks after we've adjusted the PI chain: */
-	balance_callback(rq);
+	balance_callbacks(head);
 	preempt_enable();
 
 	return 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d7fc4caf0dfd..3855d354760d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1189,6 +1189,8 @@ static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
 #ifdef CONFIG_SCHED_DEBUG
 	rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
 	rf->clock_update_flags = 0;
+
+	SCHED_WARN_ON(rq->balance_callback);
 #endif
 }
 

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, back to index

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-28  5:02 [RFC PATCH 0/3] newidle_balance() latency mitigation Scott Wood
2020-04-28  5:02 ` [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch() Scott Wood
2020-04-28 21:37   ` Valentin Schneider
2020-04-28 22:09     ` Peter Zijlstra
2020-04-28 22:55       ` Scott Wood
2020-04-28 23:02         ` Peter Zijlstra
2020-04-28 23:20           ` Scott Wood
2020-04-29  9:05             ` Peter Zijlstra
2020-04-30  1:31               ` Scott Wood
2020-05-11 10:58                 ` Peter Zijlstra
2020-05-11 12:13                   ` Peter Zijlstra
2020-04-28 22:33     ` Scott Wood
2020-04-29 12:00       ` Valentin Schneider
2020-04-29  8:27   ` Vincent Guittot
2020-04-30  1:36     ` Scott Wood
2020-04-28  5:02 ` [RFC PATCH 2/3] sched/fair: Enable interrupts when dropping lock in newidle_balance() Scott Wood
2020-04-28  5:02 ` [RFC PATCH 3/3] sched,rt: break out of load balancing if an RT task appears Scott Wood
2020-04-28 21:56   ` Valentin Schneider
2020-04-28 22:33     ` Scott Wood
2020-04-28 22:52       ` Scott Wood
2020-04-29 12:01       ` Valentin Schneider
2020-04-28 13:27 ` [RFC PATCH 0/3] newidle_balance() latency mitigation Steven Rostedt
2020-04-29 23:13 ` Valentin Schneider
2020-04-30  7:44   ` Vincent Guittot
2020-04-30 10:14     ` Valentin Schneider
2020-04-30 12:42       ` Vincent Guittot
2020-04-30 13:56         ` Valentin Schneider
2020-04-30 12:48 ` Vincent Guittot

Linux-rt-users archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-rt-users/0 linux-rt-users/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-rt-users linux-rt-users/ https://lore.kernel.org/linux-rt-users \
		linux-rt-users@vger.kernel.org
	public-inbox-index linux-rt-users

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-rt-users


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git