[RFC PATCH v1] sched/fair: limit load balance redo times at the same sched

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level
@ 2021-01-25  6:02 Aubrey Li
  2021-01-25  9:06 ` Mel Gorman
  2021-01-25 10:56 ` Vincent Guittot
  0 siblings, 2 replies; 11+ messages in thread
From: Aubrey Li @ 2021-01-25  6:02 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot
  Cc: linux-kernel, Aubrey Li, Andi Kleen, Tim Chen,
	Srinivas Pandruvada, Rafael J . Wysocki, Aubrey Li

A long-tail load balance cost is observed on the newly idle path,
this is caused by a race window between the first nr_running check
of the busiest runqueue and its nr_running recheck in detach_tasks.

Before the busiest runqueue is locked, the tasks on the busiest
runqueue could be pulled by other CPUs and nr_running of the busiest
runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED
flag set, and triggers load_balance redo at the same sched_domain level.

In order to find the new busiest sched_group and CPU, load balance will
recompute and update the various load statistics, which eventually leads
to the long-tail load balance cost.

This patch introduces a variable(sched_nr_lb_redo) to limit load balance
redo times, combined with sysctl_sched_nr_migrate, the max load balance
cost is reduced from 100+ us to 70+ us, measured on a 4s x86 system with
192 logical CPUs.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
---
 kernel/sched/fair.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ae7ceba..b59f371 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7407,6 +7407,8 @@ struct lb_env {
 	unsigned int		loop;
 	unsigned int		loop_break;
 	unsigned int		loop_max;
+	unsigned int		redo_cnt;
+	unsigned int		redo_max;
 
 	enum fbq_type		fbq_type;
 	enum migration_type	migration_type;
@@ -9525,6 +9527,7 @@ static int should_we_balance(struct lb_env *env)
 	return group_balance_cpu(sg) == env->dst_cpu;
 }
 
+static const unsigned int sched_nr_lb_redo = 1;
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
@@ -9547,6 +9550,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.dst_grpmask    = sched_group_span(sd->groups),
 		.idle		= idle,
 		.loop_break	= sched_nr_migrate_break,
+		.redo_max	= sched_nr_lb_redo,
 		.cpus		= cpus,
 		.fbq_type	= all,
 		.tasks		= LIST_HEAD_INIT(env.tasks),
@@ -9682,7 +9686,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 			 * destination group that is receiving any migrated
 			 * load.
 			 */
-			if (!cpumask_subset(cpus, env.dst_grpmask)) {
+			if (!cpumask_subset(cpus, env.dst_grpmask) &&
+					++env.redo_cnt < env.redo_max) {
 				env.loop = 0;
 				env.loop_break = sched_nr_migrate_break;
 				goto redo;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level
  2021-01-25  6:02 [RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level Aubrey Li
@ 2021-01-25  9:06 ` Mel Gorman
  2021-01-25 13:53   ` Li, Aubrey
  2021-01-25 10:56 ` Vincent Guittot
  1 sibling, 1 reply; 11+ messages in thread
From: Mel Gorman @ 2021-01-25  9:06 UTC (permalink / raw)
  To: Aubrey Li
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, bristot, linux-kernel, Andi Kleen, Tim Chen,
	Srinivas Pandruvada, Rafael J . Wysocki, Aubrey Li

On Mon, Jan 25, 2021 at 02:02:58PM +0800, Aubrey Li wrote:
> A long-tail load balance cost is observed on the newly idle path,
> this is caused by a race window between the first nr_running check
> of the busiest runqueue and its nr_running recheck in detach_tasks.
> 
> Before the busiest runqueue is locked, the tasks on the busiest
> runqueue could be pulled by other CPUs and nr_running of the busiest
> runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED
> flag set, and triggers load_balance redo at the same sched_domain level.
> 
> In order to find the new busiest sched_group and CPU, load balance will
> recompute and update the various load statistics, which eventually leads
> to the long-tail load balance cost.
> 
> This patch introduces a variable(sched_nr_lb_redo) to limit load balance
> redo times, combined with sysctl_sched_nr_migrate, the max load balance
> cost is reduced from 100+ us to 70+ us, measured on a 4s x86 system with
> 192 logical CPUs.
> 
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>

If redo_max is a constant, why is it not a #define instead of increasing
the size of lb_env?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level
  2021-01-25  6:02 [RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level Aubrey Li
  2021-01-25  9:06 ` Mel Gorman
@ 2021-01-25 10:56 ` Vincent Guittot
  2021-01-25 14:00   ` Li, Aubrey
  1 sibling, 1 reply; 11+ messages in thread
From: Vincent Guittot @ 2021-01-25 10:56 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel, Andi Kleen, Tim Chen,
	Srinivas Pandruvada, Rafael J . Wysocki, Aubrey Li

On Mon, 25 Jan 2021 at 06:50, Aubrey Li <aubrey.li@intel.com> wrote:
>
> A long-tail load balance cost is observed on the newly idle path,
> this is caused by a race window between the first nr_running check
> of the busiest runqueue and its nr_running recheck in detach_tasks.
>
> Before the busiest runqueue is locked, the tasks on the busiest
> runqueue could be pulled by other CPUs and nr_running of the busiest
> runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED

We should better detect that when trying to detach task like below

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7688,6 +7688,16 @@ static int detach_tasks(struct lb_env *env)

        lockdep_assert_held(&env->src_rq->lock);

+       /*
+        * Another CPU has emptied this runqueue in the meantime.
+        * Just return and leave the load_balance properly.
+        */
+       if (env->src_rq->nr_running <= 1 && !env->loop) {
+               /* Clear the flag as we will not test any task */
+               env->flags &= ~LBF_ALL_PINNED;
+               return 0;
+       }
+
        if (env->imbalance <= 0)
                return 0;


> flag set, and triggers load_balance redo at the same sched_domain level.
>
> In order to find the new busiest sched_group and CPU, load balance will
> recompute and update the various load statistics, which eventually leads
> to the long-tail load balance cost.
>
> This patch introduces a variable(sched_nr_lb_redo) to limit load balance
> redo times, combined with sysctl_sched_nr_migrate, the max load balance
> cost is reduced from 100+ us to 70+ us, measured on a 4s x86 system with
> 192 logical CPUs.
>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> ---
>  kernel/sched/fair.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ae7ceba..b59f371 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7407,6 +7407,8 @@ struct lb_env {
>         unsigned int            loop;
>         unsigned int            loop_break;
>         unsigned int            loop_max;
> +       unsigned int            redo_cnt;
> +       unsigned int            redo_max;
>
>         enum fbq_type           fbq_type;
>         enum migration_type     migration_type;
> @@ -9525,6 +9527,7 @@ static int should_we_balance(struct lb_env *env)
>         return group_balance_cpu(sg) == env->dst_cpu;
>  }
>
> +static const unsigned int sched_nr_lb_redo = 1;
>  /*
>   * Check this_cpu to ensure it is balanced within domain. Attempt to move
>   * tasks if there is an imbalance.
> @@ -9547,6 +9550,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>                 .dst_grpmask    = sched_group_span(sd->groups),
>                 .idle           = idle,
>                 .loop_break     = sched_nr_migrate_break,
> +               .redo_max       = sched_nr_lb_redo,
>                 .cpus           = cpus,
>                 .fbq_type       = all,
>                 .tasks          = LIST_HEAD_INIT(env.tasks),
> @@ -9682,7 +9686,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>                          * destination group that is receiving any migrated
>                          * load.
>                          */
> -                       if (!cpumask_subset(cpus, env.dst_grpmask)) {
> +                       if (!cpumask_subset(cpus, env.dst_grpmask) &&
> +                                       ++env.redo_cnt < env.redo_max) {
>                                 env.loop = 0;
>                                 env.loop_break = sched_nr_migrate_break;
>                                 goto redo;
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level
  2021-01-25  9:06 ` Mel Gorman
@ 2021-01-25 13:53   ` Li, Aubrey
  2021-01-25 14:40     ` Mel Gorman
  0 siblings, 1 reply; 11+ messages in thread
From: Li, Aubrey @ 2021-01-25 13:53 UTC (permalink / raw)
  To: Mel Gorman, Aubrey Li
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, bristot, linux-kernel, Andi Kleen, Tim Chen,
	Srinivas Pandruvada, Rafael J . Wysocki

On 2021/1/25 17:06, Mel Gorman wrote:
> On Mon, Jan 25, 2021 at 02:02:58PM +0800, Aubrey Li wrote:
>> A long-tail load balance cost is observed on the newly idle path,
>> this is caused by a race window between the first nr_running check
>> of the busiest runqueue and its nr_running recheck in detach_tasks.
>>
>> Before the busiest runqueue is locked, the tasks on the busiest
>> runqueue could be pulled by other CPUs and nr_running of the busiest
>> runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED
>> flag set, and triggers load_balance redo at the same sched_domain level.
>>
>> In order to find the new busiest sched_group and CPU, load balance will
>> recompute and update the various load statistics, which eventually leads
>> to the long-tail load balance cost.
>>
>> This patch introduces a variable(sched_nr_lb_redo) to limit load balance
>> redo times, combined with sysctl_sched_nr_migrate, the max load balance
>> cost is reduced from 100+ us to 70+ us, measured on a 4s x86 system with
>> 192 logical CPUs.
>>
>> Cc: Andi Kleen <ak@linux.intel.com>
>> Cc: Tim Chen <tim.c.chen@linux.intel.com>
>> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
>> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> 
> If redo_max is a constant, why is it not a #define instead of increasing
> the size of lb_env?
> 

I followed the existing variable sched_nr_migrate_break, I think this might
be a tunable as well.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level
  2021-01-25 10:56 ` Vincent Guittot
@ 2021-01-25 14:00   ` Li, Aubrey
  2021-01-25 14:51     ` Vincent Guittot
  0 siblings, 1 reply; 11+ messages in thread
From: Li, Aubrey @ 2021-01-25 14:00 UTC (permalink / raw)
  To: Vincent Guittot, Aubrey Li
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel, Andi Kleen, Tim Chen,
	Srinivas Pandruvada, Rafael J . Wysocki

On 2021/1/25 18:56, Vincent Guittot wrote:
> On Mon, 25 Jan 2021 at 06:50, Aubrey Li <aubrey.li@intel.com> wrote:
>>
>> A long-tail load balance cost is observed on the newly idle path,
>> this is caused by a race window between the first nr_running check
>> of the busiest runqueue and its nr_running recheck in detach_tasks.
>>
>> Before the busiest runqueue is locked, the tasks on the busiest
>> runqueue could be pulled by other CPUs and nr_running of the busiest
>> runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED
> 
> We should better detect that when trying to detach task like below

This should be a compromise from my understanding. If we give up load balance
this time due to the race condition, we do reduce the load balance cost on the
newly idle path, but if there is an imbalance indeed at the same sched_domain
level, we have to wait the next softirq entry to handle that imbalance. This
means the tasks on the second busiest runqueue have to stay longer, which could
introduce tail latency as well. That's why I introduced a variable to control
the redo loops. I'll send this to the benchmark queue to see if it makes any
difference.

Thanks,
-Aubrey

> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7688,6 +7688,16 @@ static int detach_tasks(struct lb_env *env)
> 
>         lockdep_assert_held(&env->src_rq->lock);
> 
> +       /*
> +        * Another CPU has emptied this runqueue in the meantime.
> +        * Just return and leave the load_balance properly.
> +        */
> +       if (env->src_rq->nr_running <= 1 && !env->loop) {
> +               /* Clear the flag as we will not test any task */
> +               env->flags &= ~LBF_ALL_PINNED;
> +               return 0;
> +       }
> +
>         if (env->imbalance <= 0)
>                 return 0;
> 
> 
>> flag set, and triggers load_balance redo at the same sched_domain level.
>>
>> In order to find the new busiest sched_group and CPU, load balance will
>> recompute and update the various load statistics, which eventually leads
>> to the long-tail load balance cost.
>>
>> This patch introduces a variable(sched_nr_lb_redo) to limit load balance
>> redo times, combined with sysctl_sched_nr_migrate, the max load balance
>> cost is reduced from 100+ us to 70+ us, measured on a 4s x86 system with
>> 192 logical CPUs.
>>
>> Cc: Andi Kleen <ak@linux.intel.com>
>> Cc: Tim Chen <tim.c.chen@linux.intel.com>
>> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
>> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
>> ---
>>  kernel/sched/fair.c | 7 ++++++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ae7ceba..b59f371 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7407,6 +7407,8 @@ struct lb_env {
>>         unsigned int            loop;
>>         unsigned int            loop_break;
>>         unsigned int            loop_max;
>> +       unsigned int            redo_cnt;
>> +       unsigned int            redo_max;
>>
>>         enum fbq_type           fbq_type;
>>         enum migration_type     migration_type;
>> @@ -9525,6 +9527,7 @@ static int should_we_balance(struct lb_env *env)
>>         return group_balance_cpu(sg) == env->dst_cpu;
>>  }
>>
>> +static const unsigned int sched_nr_lb_redo = 1;
>>  /*
>>   * Check this_cpu to ensure it is balanced within domain. Attempt to move
>>   * tasks if there is an imbalance.
>> @@ -9547,6 +9550,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>                 .dst_grpmask    = sched_group_span(sd->groups),
>>                 .idle           = idle,
>>                 .loop_break     = sched_nr_migrate_break,
>> +               .redo_max       = sched_nr_lb_redo,
>>                 .cpus           = cpus,
>>                 .fbq_type       = all,
>>                 .tasks          = LIST_HEAD_INIT(env.tasks),
>> @@ -9682,7 +9686,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>                          * destination group that is receiving any migrated
>>                          * load.
>>                          */
>> -                       if (!cpumask_subset(cpus, env.dst_grpmask)) {
>> +                       if (!cpumask_subset(cpus, env.dst_grpmask) &&
>> +                                       ++env.redo_cnt < env.redo_max) {
>>                                 env.loop = 0;
>>                                 env.loop_break = sched_nr_migrate_break;
>>                                 goto redo;
>> --
>> 2.7.4
>>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level
  2021-01-25 13:53   ` Li, Aubrey
@ 2021-01-25 14:40     ` Mel Gorman
  0 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2021-01-25 14:40 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Aubrey Li, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, bristot, linux-kernel,
	Andi Kleen, Tim Chen, Srinivas Pandruvada, Rafael J . Wysocki

On Mon, Jan 25, 2021 at 09:53:28PM +0800, Li, Aubrey wrote:
> On 2021/1/25 17:06, Mel Gorman wrote:
> > On Mon, Jan 25, 2021 at 02:02:58PM +0800, Aubrey Li wrote:
> >> A long-tail load balance cost is observed on the newly idle path,
> >> this is caused by a race window between the first nr_running check
> >> of the busiest runqueue and its nr_running recheck in detach_tasks.
> >>
> >> Before the busiest runqueue is locked, the tasks on the busiest
> >> runqueue could be pulled by other CPUs and nr_running of the busiest
> >> runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED
> >> flag set, and triggers load_balance redo at the same sched_domain level.
> >>
> >> In order to find the new busiest sched_group and CPU, load balance will
> >> recompute and update the various load statistics, which eventually leads
> >> to the long-tail load balance cost.
> >>
> >> This patch introduces a variable(sched_nr_lb_redo) to limit load balance
> >> redo times, combined with sysctl_sched_nr_migrate, the max load balance
> >> cost is reduced from 100+ us to 70+ us, measured on a 4s x86 system with
> >> 192 logical CPUs.
> >>
> >> Cc: Andi Kleen <ak@linux.intel.com>
> >> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> >> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> >> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> > 
> > If redo_max is a constant, why is it not a #define instead of increasing
> > the size of lb_env?
> > 
> 
> I followed the existing variable sched_nr_migrate_break, I think this might
> be a tunable as well.
> 

I don't think it is, the tunable is sched_nr_migrate and it's not clear
to me at all why sched_nr_migrate_break is not also a #define. It just
happens that sched_nr_migrate == sched_nr_migrate_break by default.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level
  2021-01-25 14:00   ` Li, Aubrey
@ 2021-01-25 14:51     ` Vincent Guittot
  2021-01-26  1:40       ` Li, Aubrey
  2021-02-23  5:41       ` Li, Aubrey
  0 siblings, 2 replies; 11+ messages in thread
From: Vincent Guittot @ 2021-01-25 14:51 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Aubrey Li, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel, Andi Kleen, Tim Chen,
	Srinivas Pandruvada, Rafael J . Wysocki

On Mon, 25 Jan 2021 at 15:00, Li, Aubrey <aubrey.li@linux.intel.com> wrote:
>
> On 2021/1/25 18:56, Vincent Guittot wrote:
> > On Mon, 25 Jan 2021 at 06:50, Aubrey Li <aubrey.li@intel.com> wrote:
> >>
> >> A long-tail load balance cost is observed on the newly idle path,
> >> this is caused by a race window between the first nr_running check
> >> of the busiest runqueue and its nr_running recheck in detach_tasks.
> >>
> >> Before the busiest runqueue is locked, the tasks on the busiest
> >> runqueue could be pulled by other CPUs and nr_running of the busiest
> >> runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED
> >
> > We should better detect that when trying to detach task like below
>
> This should be a compromise from my understanding. If we give up load balance
> this time due to the race condition, we do reduce the load balance cost on the
> newly idle path, but if there is an imbalance indeed at the same sched_domain

Redo path is there in case, LB has found an imbalance but it can't
move some loads from this busiest rq to dest rq because of some cpu
affinity. So it tries to fix the imbalance by moving load onto another
rq of the group. In your case, the imbalance has disappeared because
it has already been pulled by another rq so you don't have to try to
find another imbalance. And I would even say you should not in order
to let other level to take a chance to spread the load

> level, we have to wait the next softirq entry to handle that imbalance. This
> means the tasks on the second busiest runqueue have to stay longer, which could
> introduce tail latency as well. That's why I introduced a variable to control
> the redo loops. I'll send this to the benchmark queue to see if it makes any

TBH, I don't like multiplying the number of knobs

> difference.
>
> Thanks,
> -Aubrey
>
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7688,6 +7688,16 @@ static int detach_tasks(struct lb_env *env)
> >
> >         lockdep_assert_held(&env->src_rq->lock);
> >
> > +       /*
> > +        * Another CPU has emptied this runqueue in the meantime.
> > +        * Just return and leave the load_balance properly.
> > +        */
> > +       if (env->src_rq->nr_running <= 1 && !env->loop) {
> > +               /* Clear the flag as we will not test any task */
> > +               env->flags &= ~LBF_ALL_PINNED;
> > +               return 0;
> > +       }
> > +
> >         if (env->imbalance <= 0)
> >                 return 0;
> >
> >
> >> flag set, and triggers load_balance redo at the same sched_domain level.
> >>
> >> In order to find the new busiest sched_group and CPU, load balance will
> >> recompute and update the various load statistics, which eventually leads
> >> to the long-tail load balance cost.
> >>
> >> This patch introduces a variable(sched_nr_lb_redo) to limit load balance
> >> redo times, combined with sysctl_sched_nr_migrate, the max load balance
> >> cost is reduced from 100+ us to 70+ us, measured on a 4s x86 system with
> >> 192 logical CPUs.
> >>
> >> Cc: Andi Kleen <ak@linux.intel.com>
> >> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> >> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> >> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> >> ---
> >>  kernel/sched/fair.c | 7 ++++++-
> >>  1 file changed, 6 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index ae7ceba..b59f371 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -7407,6 +7407,8 @@ struct lb_env {
> >>         unsigned int            loop;
> >>         unsigned int            loop_break;
> >>         unsigned int            loop_max;
> >> +       unsigned int            redo_cnt;
> >> +       unsigned int            redo_max;
> >>
> >>         enum fbq_type           fbq_type;
> >>         enum migration_type     migration_type;
> >> @@ -9525,6 +9527,7 @@ static int should_we_balance(struct lb_env *env)
> >>         return group_balance_cpu(sg) == env->dst_cpu;
> >>  }
> >>
> >> +static const unsigned int sched_nr_lb_redo = 1;
> >>  /*
> >>   * Check this_cpu to ensure it is balanced within domain. Attempt to move
> >>   * tasks if there is an imbalance.
> >> @@ -9547,6 +9550,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> >>                 .dst_grpmask    = sched_group_span(sd->groups),
> >>                 .idle           = idle,
> >>                 .loop_break     = sched_nr_migrate_break,
> >> +               .redo_max       = sched_nr_lb_redo,
> >>                 .cpus           = cpus,
> >>                 .fbq_type       = all,
> >>                 .tasks          = LIST_HEAD_INIT(env.tasks),
> >> @@ -9682,7 +9686,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> >>                          * destination group that is receiving any migrated
> >>                          * load.
> >>                          */
> >> -                       if (!cpumask_subset(cpus, env.dst_grpmask)) {
> >> +                       if (!cpumask_subset(cpus, env.dst_grpmask) &&
> >> +                                       ++env.redo_cnt < env.redo_max) {
> >>                                 env.loop = 0;
> >>                                 env.loop_break = sched_nr_migrate_break;
> >>                                 goto redo;
> >> --
> >> 2.7.4
> >>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level
  2021-01-25 14:51     ` Vincent Guittot
@ 2021-01-26  1:40       ` Li, Aubrey
  2021-02-23  5:41       ` Li, Aubrey
  1 sibling, 0 replies; 11+ messages in thread
From: Li, Aubrey @ 2021-01-26  1:40 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Aubrey Li, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel, Andi Kleen, Tim Chen,
	Srinivas Pandruvada, Rafael J . Wysocki

On 2021/1/25 22:51, Vincent Guittot wrote:
> On Mon, 25 Jan 2021 at 15:00, Li, Aubrey <aubrey.li@linux.intel.com> wrote:
>>
>> On 2021/1/25 18:56, Vincent Guittot wrote:
>>> On Mon, 25 Jan 2021 at 06:50, Aubrey Li <aubrey.li@intel.com> wrote:
>>>>
>>>> A long-tail load balance cost is observed on the newly idle path,
>>>> this is caused by a race window between the first nr_running check
>>>> of the busiest runqueue and its nr_running recheck in detach_tasks.
>>>>
>>>> Before the busiest runqueue is locked, the tasks on the busiest
>>>> runqueue could be pulled by other CPUs and nr_running of the busiest
>>>> runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED
>>>
>>> We should better detect that when trying to detach task like below
>>
>> This should be a compromise from my understanding. If we give up load balance
>> this time due to the race condition, we do reduce the load balance cost on the
>> newly idle path, but if there is an imbalance indeed at the same sched_domain
> 
> Redo path is there in case, LB has found an imbalance but it can't
> move some loads from this busiest rq to dest rq because of some cpu
> affinity. So it tries to fix the imbalance by moving load onto another
> rq of the group. In your case, the imbalance has disappeared because
> it has already been pulled by another rq so you don't have to try to
> find another imbalance. And I would even say you should not in order
> to let other level to take a chance to spread the load

Here is one simple case I have seen:
1) CPU_a becomes idle and invoke newly idle balance
2) Group_b is found as the busiest group
3) CPU_b_1 is found as the busiest CPU, nr_running = 5
4) detach_tasks check CPU_b_1's run queue again, nr_running = 1, goto redo
5) Group_b is still found as the busiest group
6) This time CPU_b_2 is found as the busiest CPU, nr_running = 3
7) detach_tasks successfully, 2 tasks moved.

If we skipped redo,
- CPU_a exit load balance and remain idle
- tasks stay on CPU_b_2's runqueue, wait for the next load balancing

The two tasks could have been moved to the idle CPU and get executed
immediately.

> 
>> level, we have to wait the next softirq entry to handle that imbalance. This
>> means the tasks on the second busiest runqueue have to stay longer, which could
>> introduce tail latency as well. That's why I introduced a variable to control
>> the redo loops. I'll send this to the benchmark queue to see if it makes any
> 
> TBH, I don't like multiplying the number of knobs
> I see.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level
  2021-01-25 14:51     ` Vincent Guittot
  2021-01-26  1:40       ` Li, Aubrey
@ 2021-02-23  5:41       ` Li, Aubrey
  2021-02-23 17:33         ` Vincent Guittot
  1 sibling, 1 reply; 11+ messages in thread
From: Li, Aubrey @ 2021-02-23  5:41 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Aubrey Li, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel, Andi Kleen, Tim Chen,
	Srinivas Pandruvada, Rafael J . Wysocki

Hi Vincent,

Sorry for the delay, I just returned from Chinese New Year holiday.

On 2021/1/25 22:51, Vincent Guittot wrote:
> On Mon, 25 Jan 2021 at 15:00, Li, Aubrey <aubrey.li@linux.intel.com> wrote:
>>
>> On 2021/1/25 18:56, Vincent Guittot wrote:
>>> On Mon, 25 Jan 2021 at 06:50, Aubrey Li <aubrey.li@intel.com> wrote:
>>>>
>>>> A long-tail load balance cost is observed on the newly idle path,
>>>> this is caused by a race window between the first nr_running check
>>>> of the busiest runqueue and its nr_running recheck in detach_tasks.
>>>>
>>>> Before the busiest runqueue is locked, the tasks on the busiest
>>>> runqueue could be pulled by other CPUs and nr_running of the busiest
>>>> runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED
>>>
>>> We should better detect that when trying to detach task like below
>>
>> This should be a compromise from my understanding. If we give up load balance
>> this time due to the race condition, we do reduce the load balance cost on the
>> newly idle path, but if there is an imbalance indeed at the same sched_domain
> 
> Redo path is there in case, LB has found an imbalance but it can't
> move some loads from this busiest rq to dest rq because of some cpu
> affinity. So it tries to fix the imbalance by moving load onto another
> rq of the group. In your case, the imbalance has disappeared because
> it has already been pulled by another rq so you don't have to try to
> find another imbalance. And I would even say you should not in order
> to let other level to take a chance to spread the load
> 
>> level, we have to wait the next softirq entry to handle that imbalance. This
>> means the tasks on the second busiest runqueue have to stay longer, which could
>> introduce tail latency as well. That's why I introduced a variable to control
>> the redo loops. I'll send this to the benchmark queue to see if it makes any
> 
> TBH, I don't like multiplying the number of knobs

Sure, I can take your approach, :)

>>>
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -7688,6 +7688,16 @@ static int detach_tasks(struct lb_env *env)
>>>
>>>         lockdep_assert_held(&env->src_rq->lock);
>>>
>>> +       /*
>>> +        * Another CPU has emptied this runqueue in the meantime.
>>> +        * Just return and leave the load_balance properly.
>>> +        */
>>> +       if (env->src_rq->nr_running <= 1 && !env->loop) {

May I know why !env->loop is needed here? IIUC, if detach_tasks is invoked
from LBF_NEED_BREAK, env->loop could be non-zero, but as long as src_rq's
nr_running <=1, we should return immediately with LBF_ALL_PINNED flag cleared.

How about the following change?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..1761d33accaa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7683,8 +7683,11 @@ static int detach_tasks(struct lb_env *env)
 		 * We don't want to steal all, otherwise we may be treated likewise,
 		 * which could at worst lead to a livelock crash.
 		 */
-		if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
+		if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1) {
+			/* Clear the flag as we will not test any task */
+			env->flag &= ~LBF_ALL_PINNED;
 			break;
+		}
 
 		p = list_last_entry(tasks, struct task_struct, se.group_node);
 
Thanks,
-Aubrey

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level
  2021-02-23  5:41       ` Li, Aubrey
@ 2021-02-23 17:33         ` Vincent Guittot
  2021-02-24  2:55           ` Li, Aubrey
  0 siblings, 1 reply; 11+ messages in thread
From: Vincent Guittot @ 2021-02-23 17:33 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Aubrey Li, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel, Andi Kleen, Tim Chen,
	Srinivas Pandruvada, Rafael J . Wysocki

On Tue, 23 Feb 2021 at 06:41, Li, Aubrey <aubrey.li@linux.intel.com> wrote:
>
> Hi Vincent,
>
> Sorry for the delay, I just returned from Chinese New Year holiday.
>
> On 2021/1/25 22:51, Vincent Guittot wrote:
> > On Mon, 25 Jan 2021 at 15:00, Li, Aubrey <aubrey.li@linux.intel.com> wrote:
> >>
> >> On 2021/1/25 18:56, Vincent Guittot wrote:
> >>> On Mon, 25 Jan 2021 at 06:50, Aubrey Li <aubrey.li@intel.com> wrote:
> >>>>
> >>>> A long-tail load balance cost is observed on the newly idle path,
> >>>> this is caused by a race window between the first nr_running check
> >>>> of the busiest runqueue and its nr_running recheck in detach_tasks.
> >>>>
> >>>> Before the busiest runqueue is locked, the tasks on the busiest
> >>>> runqueue could be pulled by other CPUs and nr_running of the busiest
> >>>> runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED
> >>>
> >>> We should better detect that when trying to detach task like below
> >>
> >> This should be a compromise from my understanding. If we give up load balance
> >> this time due to the race condition, we do reduce the load balance cost on the
> >> newly idle path, but if there is an imbalance indeed at the same sched_domain
> >
> > Redo path is there in case, LB has found an imbalance but it can't
> > move some loads from this busiest rq to dest rq because of some cpu
> > affinity. So it tries to fix the imbalance by moving load onto another
> > rq of the group. In your case, the imbalance has disappeared because
> > it has already been pulled by another rq so you don't have to try to
> > find another imbalance. And I would even say you should not in order
> > to let other level to take a chance to spread the load
> >
> >> level, we have to wait the next softirq entry to handle that imbalance. This
> >> means the tasks on the second busiest runqueue have to stay longer, which could
> >> introduce tail latency as well. That's why I introduced a variable to control
> >> the redo loops. I'll send this to the benchmark queue to see if it makes any
> >
> > TBH, I don't like multiplying the number of knobs
>
> Sure, I can take your approach, :)
>
> >>>
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -7688,6 +7688,16 @@ static int detach_tasks(struct lb_env *env)
> >>>
> >>>         lockdep_assert_held(&env->src_rq->lock);
> >>>
> >>> +       /*
> >>> +        * Another CPU has emptied this runqueue in the meantime.
> >>> +        * Just return and leave the load_balance properly.
> >>> +        */
> >>> +       if (env->src_rq->nr_running <= 1 && !env->loop) {
>
> May I know why !env->loop is needed here? IIUC, if detach_tasks is invoked

IIRC,  my point was to do the test only when trying to detach the 1st
task. A lot of things can happen when a break is involved but TBH I
can't remember a precise UC. It may be over cautious

> from LBF_NEED_BREAK, env->loop could be non-zero, but as long as src_rq's
> nr_running <=1, we should return immediately with LBF_ALL_PINNED flag cleared.
>
> How about the following change?
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 04a3ce20da67..1761d33accaa 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7683,8 +7683,11 @@ static int detach_tasks(struct lb_env *env)
>                  * We don't want to steal all, otherwise we may be treated likewise,
>                  * which could at worst lead to a livelock crash.
>                  */
> -               if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
> +               if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1) {

IMO, we must do the test before:  while (!list_empty(tasks)) {

because src_rq might have become empty if waiting tasks have been
pulled by another cpu and the running one became idle in the meantime

> +                       /* Clear the flag as we will not test any task */
> +                       env->flag &= ~LBF_ALL_PINNED;
>                         break;
> +               }
>
>                 p = list_last_entry(tasks, struct task_struct, se.group_node);
>
> Thanks,
> -Aubrey

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level
  2021-02-23 17:33         ` Vincent Guittot
@ 2021-02-24  2:55           ` Li, Aubrey
  0 siblings, 0 replies; 11+ messages in thread
From: Li, Aubrey @ 2021-02-24  2:55 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Aubrey Li, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel, Andi Kleen, Tim Chen,
	Srinivas Pandruvada, Rafael J . Wysocki

On 2021/2/24 1:33, Vincent Guittot wrote:
> On Tue, 23 Feb 2021 at 06:41, Li, Aubrey <aubrey.li@linux.intel.com> wrote:
>>
>> Hi Vincent,
>>
>> Sorry for the delay, I just returned from Chinese New Year holiday.
>>
>> On 2021/1/25 22:51, Vincent Guittot wrote:
>>> On Mon, 25 Jan 2021 at 15:00, Li, Aubrey <aubrey.li@linux.intel.com> wrote:
>>>>
>>>> On 2021/1/25 18:56, Vincent Guittot wrote:
>>>>> On Mon, 25 Jan 2021 at 06:50, Aubrey Li <aubrey.li@intel.com> wrote:
>>>>>>
>>>>>> A long-tail load balance cost is observed on the newly idle path,
>>>>>> this is caused by a race window between the first nr_running check
>>>>>> of the busiest runqueue and its nr_running recheck in detach_tasks.
>>>>>>
>>>>>> Before the busiest runqueue is locked, the tasks on the busiest
>>>>>> runqueue could be pulled by other CPUs and nr_running of the busiest
>>>>>> runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED
>>>>>
>>>>> We should better detect that when trying to detach task like below
>>>>
>>>> This should be a compromise from my understanding. If we give up load balance
>>>> this time due to the race condition, we do reduce the load balance cost on the
>>>> newly idle path, but if there is an imbalance indeed at the same sched_domain
>>>
>>> Redo path is there in case, LB has found an imbalance but it can't
>>> move some loads from this busiest rq to dest rq because of some cpu
>>> affinity. So it tries to fix the imbalance by moving load onto another
>>> rq of the group. In your case, the imbalance has disappeared because
>>> it has already been pulled by another rq so you don't have to try to
>>> find another imbalance. And I would even say you should not in order
>>> to let other level to take a chance to spread the load
>>>
>>>> level, we have to wait the next softirq entry to handle that imbalance. This
>>>> means the tasks on the second busiest runqueue have to stay longer, which could
>>>> introduce tail latency as well. That's why I introduced a variable to control
>>>> the redo loops. I'll send this to the benchmark queue to see if it makes any
>>>
>>> TBH, I don't like multiplying the number of knobs
>>
>> Sure, I can take your approach, :)
>>
>>>>>
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -7688,6 +7688,16 @@ static int detach_tasks(struct lb_env *env)
>>>>>
>>>>>         lockdep_assert_held(&env->src_rq->lock);
>>>>>
>>>>> +       /*
>>>>> +        * Another CPU has emptied this runqueue in the meantime.
>>>>> +        * Just return and leave the load_balance properly.
>>>>> +        */
>>>>> +       if (env->src_rq->nr_running <= 1 && !env->loop) {
>>
>> May I know why !env->loop is needed here? IIUC, if detach_tasks is invoked
> 
> IIRC,  my point was to do the test only when trying to detach the 1st
> task. A lot of things can happen when a break is involved but TBH I
> can't remember a precise UC. It may be over cautious

When the break happens, rq unlock and local irq restored, so it's still possible
the rq is emptied by another CPU.

> 
>> from LBF_NEED_BREAK, env->loop could be non-zero, but as long as src_rq's
>> nr_running <=1, we should return immediately with LBF_ALL_PINNED flag cleared.
>>
>> How about the following change?
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 04a3ce20da67..1761d33accaa 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7683,8 +7683,11 @@ static int detach_tasks(struct lb_env *env)
>>                  * We don't want to steal all, otherwise we may be treated likewise,
>>                  * which could at worst lead to a livelock crash.
>>                  */
>> -               if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
>> +               if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1) {
> 
> IMO, we must do the test before:  while (!list_empty(tasks)) {
> 
> because src_rq might have become empty if waiting tasks have been
> pulled by another cpu and the running one became idle in the meantime

Okay, after the running one became idle, it still has LBF_ALL_PINNED, which
needs to be cleared as well. Thanks!

> 
>> +                       /* Clear the flag as we will not test any task */
>> +                       env->flag &= ~LBF_ALL_PINNED;
>>                         break;
>> +               }
>>
>>                 p = list_last_entry(tasks, struct task_struct, se.group_node);
>>
>> Thanks,
>> -Aubrey


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-02-24  2:58 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-25  6:02 [RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level Aubrey Li
2021-01-25  9:06 ` Mel Gorman
2021-01-25 13:53   ` Li, Aubrey
2021-01-25 14:40     ` Mel Gorman
2021-01-25 10:56 ` Vincent Guittot
2021-01-25 14:00   ` Li, Aubrey
2021-01-25 14:51     ` Vincent Guittot
2021-01-26  1:40       ` Li, Aubrey
2021-02-23  5:41       ` Li, Aubrey
2021-02-23 17:33         ` Vincent Guittot
2021-02-24  2:55           ` Li, Aubrey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).