* [PATCH] sched/fair: Load balance aggressively for SCHED_IDLE CPUs
@ 2019-12-24 5:13 Viresh Kumar
2020-01-02 17:29 ` Steven Rostedt
2020-01-07 12:42 ` Peter Zijlstra
0 siblings, 2 replies; 6+ messages in thread
From: Viresh Kumar @ 2019-12-24 5:13 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman
Cc: Viresh Kumar, linux-kernel
The fair scheduler performs periodic load balance on every CPU to check
if it can pull some tasks from other busy CPUs. The duration of this
periodic load balance is set to sd->balance_interval for the idle CPUs
and is calculated by multiplying the sd->balance_interval with the
sd->busy_factor (set to 32 by default) for the busy CPUs. The
multiplication is done for busy CPUs to avoid doing load balance too
often and rather spend more time executing actual task. While that is
the right thing to do for the CPUs busy with SCHED_OTHER or SCHED_BATCH
tasks, it may not be the optimal thing for CPUs running only SCHED_IDLE
tasks.
With the recent enhancements in the fair scheduler around SCHED_IDLE
CPUs, we now prefer to enqueue a newly-woken task to a SCHED_IDLE
CPU instead of other busy or idle CPUs. The same reasoning should be
applied to the load balancer as well to make it migrate tasks more
aggressively to a SCHED_IDLE CPU, as that will reduce the scheduling
latency of the migrated (SCHED_OTHER) tasks.
This patch makes minimal changes to the fair scheduler to do the next
load balance soon after the last non SCHED_IDLE task is dequeued from a
runqueue, i.e. making the CPU SCHED_IDLE. Also the sd->busy_factor is
ignored while calculating the balance_interval for such CPUs. This is
done to avoid delaying the periodic load balance by few hundred
milliseconds for SCHED_IDLE CPUs.
This is tested on ARM64 Hikey620 platform (octa-core) with the help of
rt-app and it is verified, using kernel traces, that the newly
SCHED_IDLE CPU does load balancing shortly after it becomes SCHED_IDLE
and pulls tasks from other busy CPUs.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
---
kernel/sched/fair.c | 39 +++++++++++++++++++++++++++------------
1 file changed, 27 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 83500b5b93dc..645eb248a2d0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5210,6 +5210,18 @@ static inline void update_overutilized_status(struct rq *rq)
static inline void update_overutilized_status(struct rq *rq) { }
#endif
+/* Runqueue only has SCHED_IDLE tasks enqueued */
+static int sched_idle_rq(struct rq *rq)
+{
+ return unlikely(rq->nr_running == rq->cfs.idle_h_nr_running &&
+ rq->nr_running);
+}
+
+static int sched_idle_cpu(int cpu)
+{
+ return sched_idle_rq(cpu_rq(cpu));
+}
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -5324,6 +5336,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
struct sched_entity *se = &p->se;
int task_sleep = flags & DEQUEUE_SLEEP;
int idle_h_nr_running = task_has_idle_policy(p);
+ bool was_sched_idle = sched_idle_rq(rq);
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
@@ -5370,6 +5383,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se)
sub_nr_running(rq, 1);
+ /* balance early to pull high priority tasks */
+ if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
+ rq->next_balance = jiffies;
+
util_est_dequeue(&rq->cfs, p, task_sleep);
hrtick_update(rq);
}
@@ -5392,15 +5409,6 @@ static struct {
#endif /* CONFIG_NO_HZ_COMMON */
-/* CPU only has SCHED_IDLE tasks enqueued */
-static int sched_idle_cpu(int cpu)
-{
- struct rq *rq = cpu_rq(cpu);
-
- return unlikely(rq->nr_running == rq->cfs.idle_h_nr_running &&
- rq->nr_running);
-}
-
static unsigned long cpu_load(struct rq *rq)
{
return cfs_rq_load_avg(&rq->cfs);
@@ -9531,6 +9539,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
{
int continue_balancing = 1;
int cpu = rq->cpu;
+ int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
unsigned long interval;
struct sched_domain *sd;
/* Earliest time when we have to do rebalance again */
@@ -9567,7 +9576,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
break;
}
- interval = get_sd_balance_interval(sd, idle != CPU_IDLE);
+ interval = get_sd_balance_interval(sd, busy);
need_serialize = sd->flags & SD_SERIALIZE;
if (need_serialize) {
@@ -9582,10 +9591,16 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
* env->dst_cpu, so we can't know our idle
* state even if we migrated tasks. Update it.
*/
- idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;
+ if (idle_cpu(cpu)) {
+ idle = CPU_IDLE;
+ busy = 0;
+ } else {
+ idle = CPU_NOT_IDLE;
+ busy = !sched_idle_cpu(cpu);
+ }
}
sd->last_balance = jiffies;
- interval = get_sd_balance_interval(sd, idle != CPU_IDLE);
+ interval = get_sd_balance_interval(sd, busy);
}
if (need_serialize)
spin_unlock(&balancing);
--
2.21.0.rc0.269.g1a574e7a288b
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] sched/fair: Load balance aggressively for SCHED_IDLE CPUs
2019-12-24 5:13 [PATCH] sched/fair: Load balance aggressively for SCHED_IDLE CPUs Viresh Kumar
@ 2020-01-02 17:29 ` Steven Rostedt
[not found] ` <20200107112518.fqqzldnflqxonptf@vireshk-i7>
2020-01-07 12:42 ` Peter Zijlstra
1 sibling, 1 reply; 6+ messages in thread
From: Steven Rostedt @ 2020-01-02 17:29 UTC (permalink / raw)
To: Viresh Kumar
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel
On Tue, 24 Dec 2019 10:43:30 +0530
Viresh Kumar <viresh.kumar@linaro.org> wrote:
> This is tested on ARM64 Hikey620 platform (octa-core) with the help of
> rt-app and it is verified, using kernel traces, that the newly
> SCHED_IDLE CPU does load balancing shortly after it becomes SCHED_IDLE
> and pulls tasks from other busy CPUs.
Can you post the actual steps you used to test this and show the before
and after results? Then others can reproduce what you have shown and
even run other tests to see if this change has any other side effects.
Thanks!
-- Steve
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] sched/fair: Load balance aggressively for SCHED_IDLE CPUs
2019-12-24 5:13 [PATCH] sched/fair: Load balance aggressively for SCHED_IDLE CPUs Viresh Kumar
2020-01-02 17:29 ` Steven Rostedt
@ 2020-01-07 12:42 ` Peter Zijlstra
2020-01-08 8:05 ` Vincent Guittot
1 sibling, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2020-01-07 12:42 UTC (permalink / raw)
To: Viresh Kumar
Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel
On Tue, Dec 24, 2019 at 10:43:30AM +0530, Viresh Kumar wrote:
> The fair scheduler performs periodic load balance on every CPU to check
> if it can pull some tasks from other busy CPUs. The duration of this
> periodic load balance is set to sd->balance_interval for the idle CPUs
> and is calculated by multiplying the sd->balance_interval with the
> sd->busy_factor (set to 32 by default) for the busy CPUs. The
> multiplication is done for busy CPUs to avoid doing load balance too
> often and rather spend more time executing actual task. While that is
> the right thing to do for the CPUs busy with SCHED_OTHER or SCHED_BATCH
> tasks, it may not be the optimal thing for CPUs running only SCHED_IDLE
> tasks.
>
> With the recent enhancements in the fair scheduler around SCHED_IDLE
> CPUs, we now prefer to enqueue a newly-woken task to a SCHED_IDLE
> CPU instead of other busy or idle CPUs. The same reasoning should be
> applied to the load balancer as well to make it migrate tasks more
> aggressively to a SCHED_IDLE CPU, as that will reduce the scheduling
> latency of the migrated (SCHED_OTHER) tasks.
>
> This patch makes minimal changes to the fair scheduler to do the next
> load balance soon after the last non SCHED_IDLE task is dequeued from a
> runqueue, i.e. making the CPU SCHED_IDLE. Also the sd->busy_factor is
> ignored while calculating the balance_interval for such CPUs. This is
> done to avoid delaying the periodic load balance by few hundred
> milliseconds for SCHED_IDLE CPUs.
>
> This is tested on ARM64 Hikey620 platform (octa-core) with the help of
> rt-app and it is verified, using kernel traces, that the newly
> SCHED_IDLE CPU does load balancing shortly after it becomes SCHED_IDLE
> and pulls tasks from other busy CPUs.
Nothing seems really objectionable here; I have a few comments below.
Vincent?
> @@ -5324,6 +5336,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> struct sched_entity *se = &p->se;
> int task_sleep = flags & DEQUEUE_SLEEP;
> int idle_h_nr_running = task_has_idle_policy(p);
> + bool was_sched_idle = sched_idle_rq(rq);
>
> for_each_sched_entity(se) {
> cfs_rq = cfs_rq_of(se);
> @@ -5370,6 +5383,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> if (!se)
> sub_nr_running(rq, 1);
>
> + /* balance early to pull high priority tasks */
> + if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
> + rq->next_balance = jiffies;
> +
> util_est_dequeue(&rq->cfs, p, task_sleep);
> hrtick_update(rq);
> }
This can effectively set ->next_balance in the past, but given we only
tickle the balancer on every jiffy edge, that is of no concern. It just
made me stumble when reading this.
Not sure it even deserves a comment or not..
> @@ -9531,6 +9539,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
> {
> int continue_balancing = 1;
> int cpu = rq->cpu;
> + int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
> unsigned long interval;
> struct sched_domain *sd;
> /* Earliest time when we have to do rebalance again */
> @@ -9567,7 +9576,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
> break;
> }
>
> - interval = get_sd_balance_interval(sd, idle != CPU_IDLE);
> + interval = get_sd_balance_interval(sd, busy);
>
> need_serialize = sd->flags & SD_SERIALIZE;
> if (need_serialize) {
> @@ -9582,10 +9591,16 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
> * env->dst_cpu, so we can't know our idle
> * state even if we migrated tasks. Update it.
> */
> - idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;
> + if (idle_cpu(cpu)) {
> + idle = CPU_IDLE;
> + busy = 0;
> + } else {
> + idle = CPU_NOT_IDLE;
> + busy = !sched_idle_cpu(cpu);
> + }
This is inconsistent vs the earlier code. That is, why not write it
like:
idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;
busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
> }
> sd->last_balance = jiffies;
> - interval = get_sd_balance_interval(sd, idle != CPU_IDLE);
> + interval = get_sd_balance_interval(sd, busy);
> }
> if (need_serialize)
> spin_unlock(&balancing);
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] sched/fair: Load balance aggressively for SCHED_IDLE CPUs
2020-01-07 12:42 ` Peter Zijlstra
@ 2020-01-08 8:05 ` Vincent Guittot
0 siblings, 0 replies; 6+ messages in thread
From: Vincent Guittot @ 2020-01-08 8:05 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Viresh Kumar, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel
On Tue, 7 Jan 2020 at 13:42, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Dec 24, 2019 at 10:43:30AM +0530, Viresh Kumar wrote:
> > The fair scheduler performs periodic load balance on every CPU to check
> > if it can pull some tasks from other busy CPUs. The duration of this
> > periodic load balance is set to sd->balance_interval for the idle CPUs
> > and is calculated by multiplying the sd->balance_interval with the
> > sd->busy_factor (set to 32 by default) for the busy CPUs. The
> > multiplication is done for busy CPUs to avoid doing load balance too
> > often and rather spend more time executing actual task. While that is
> > the right thing to do for the CPUs busy with SCHED_OTHER or SCHED_BATCH
> > tasks, it may not be the optimal thing for CPUs running only SCHED_IDLE
> > tasks.
> >
> > With the recent enhancements in the fair scheduler around SCHED_IDLE
> > CPUs, we now prefer to enqueue a newly-woken task to a SCHED_IDLE
> > CPU instead of other busy or idle CPUs. The same reasoning should be
> > applied to the load balancer as well to make it migrate tasks more
> > aggressively to a SCHED_IDLE CPU, as that will reduce the scheduling
> > latency of the migrated (SCHED_OTHER) tasks.
> >
> > This patch makes minimal changes to the fair scheduler to do the next
> > load balance soon after the last non SCHED_IDLE task is dequeued from a
> > runqueue, i.e. making the CPU SCHED_IDLE. Also the sd->busy_factor is
> > ignored while calculating the balance_interval for such CPUs. This is
> > done to avoid delaying the periodic load balance by few hundred
> > milliseconds for SCHED_IDLE CPUs.
> >
> > This is tested on ARM64 Hikey620 platform (octa-core) with the help of
> > rt-app and it is verified, using kernel traces, that the newly
> > SCHED_IDLE CPU does load balancing shortly after it becomes SCHED_IDLE
> > and pulls tasks from other busy CPUs.
>
> Nothing seems really objectionable here; I have a few comments below.
>
> Vincent?
The change makes sense to me. This should fix the last remaining long
scheduling latency of SCHED_OTHER tasks in presence of SCHED_IDLE
tasks
With the change proposed by Peter below you can add my
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
>
>
> > @@ -5324,6 +5336,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > struct sched_entity *se = &p->se;
> > int task_sleep = flags & DEQUEUE_SLEEP;
> > int idle_h_nr_running = task_has_idle_policy(p);
> > + bool was_sched_idle = sched_idle_rq(rq);
> >
> > for_each_sched_entity(se) {
> > cfs_rq = cfs_rq_of(se);
> > @@ -5370,6 +5383,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > if (!se)
> > sub_nr_running(rq, 1);
> >
> > + /* balance early to pull high priority tasks */
> > + if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
> > + rq->next_balance = jiffies;
> > +
> > util_est_dequeue(&rq->cfs, p, task_sleep);
> > hrtick_update(rq);
> > }
>
> This can effectively set ->next_balance in the past, but given we only
> tickle the balancer on every jiffy edge, that is of no concern. It just
> made me stumble when reading this.
>
> Not sure it even deserves a comment or not..
>
> > @@ -9531,6 +9539,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
> > {
> > int continue_balancing = 1;
> > int cpu = rq->cpu;
> > + int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
> > unsigned long interval;
> > struct sched_domain *sd;
> > /* Earliest time when we have to do rebalance again */
> > @@ -9567,7 +9576,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
> > break;
> > }
> >
> > - interval = get_sd_balance_interval(sd, idle != CPU_IDLE);
> > + interval = get_sd_balance_interval(sd, busy);
> >
> > need_serialize = sd->flags & SD_SERIALIZE;
> > if (need_serialize) {
> > @@ -9582,10 +9591,16 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
> > * env->dst_cpu, so we can't know our idle
> > * state even if we migrated tasks. Update it.
> > */
> > - idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;
> > + if (idle_cpu(cpu)) {
> > + idle = CPU_IDLE;
> > + busy = 0;
> > + } else {
> > + idle = CPU_NOT_IDLE;
> > + busy = !sched_idle_cpu(cpu);
> > + }
>
> This is inconsistent vs the earlier code. That is, why not write it
> like:
>
> idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;
> busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
This looks easier to read
>
> > }
> > sd->last_balance = jiffies;
> > - interval = get_sd_balance_interval(sd, idle != CPU_IDLE);
> > + interval = get_sd_balance_interval(sd, busy);
> > }
> > if (need_serialize)
> > spin_unlock(&balancing);
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2020-01-08 8:05 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-24 5:13 [PATCH] sched/fair: Load balance aggressively for SCHED_IDLE CPUs Viresh Kumar
2020-01-02 17:29 ` Steven Rostedt
[not found] ` <20200107112518.fqqzldnflqxonptf@vireshk-i7>
2020-01-07 17:31 ` Steven Rostedt
2020-01-08 6:36 ` Viresh Kumar
2020-01-07 12:42 ` Peter Zijlstra
2020-01-08 8:05 ` Vincent Guittot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).