linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched: Consider capacity for certain load balancing decisions
@ 2023-02-01  1:20 Xi Wang
  2023-02-03  9:51 ` Peter Zijlstra
  0 siblings, 1 reply; 6+ messages in thread
From: Xi Wang @ 2023-02-01  1:20 UTC (permalink / raw)
  To: Vincent Guittot, Peter Zijlstra
  Cc: Juri Lelli, Ingo Molnar, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Ben Segall, linux-kernel, linux-fsdevel, Xi Wang

After load balancing was split into different scenarios, CPU capacity
is ignored for the "migrate_task" case, which means a thread can stay
on a softirq heavy cpu for an extended amount of time.

By comparing nr_running/capacity instead of just nr_running we can add
CPU capacity back into "migrate_task" decisions. This benefits
workloads running on machines with heavy network traffic. The change
is unlikely to cause serious problems for other workloads but maybe
some corner cases still need to be considered.

Signed-off-by: Xi Wang <xii@google.com>
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0f8736991427..aad14bc04544 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10368,8 +10368,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 			break;
 
 		case migrate_task:
-			if (busiest_nr < nr_running) {
+			if (busiest_nr * capacity < nr_running * busiest_capacity) {
 				busiest_nr = nr_running;
+				busiest_capacity = capacity;
 				busiest = rq;
 			}
 			break;
-- 
2.39.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched: Consider capacity for certain load balancing decisions
  2023-02-01  1:20 [PATCH] sched: Consider capacity for certain load balancing decisions Xi Wang
@ 2023-02-03  9:51 ` Peter Zijlstra
  2023-02-03 18:47   ` Xi Wang
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2023-02-03  9:51 UTC (permalink / raw)
  To: Xi Wang
  Cc: Vincent Guittot, Juri Lelli, Ingo Molnar, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider, Ben Segall, linux-kernel, linux-fsdevel

On Tue, Jan 31, 2023 at 05:20:32PM -0800, Xi Wang wrote:
> After load balancing was split into different scenarios, CPU capacity
> is ignored for the "migrate_task" case, which means a thread can stay
> on a softirq heavy cpu for an extended amount of time.
> 
> By comparing nr_running/capacity instead of just nr_running we can add
> CPU capacity back into "migrate_task" decisions. This benefits
> workloads running on machines with heavy network traffic. The change
> is unlikely to cause serious problems for other workloads but maybe
> some corner cases still need to be considered.
> 
> Signed-off-by: Xi Wang <xii@google.com>
> ---
>  kernel/sched/fair.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0f8736991427..aad14bc04544 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10368,8 +10368,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>  			break;
>  
>  		case migrate_task:
> -			if (busiest_nr < nr_running) {
> +			if (busiest_nr * capacity < nr_running * busiest_capacity) {
>  				busiest_nr = nr_running;
> +				busiest_capacity = capacity;
>  				busiest = rq;
>  			}
>  			break;

I don't think this is correct. The migrate_task case is work-conserving,
and your change can severely break that I think.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched: Consider capacity for certain load balancing decisions
  2023-02-03  9:51 ` Peter Zijlstra
@ 2023-02-03 18:47   ` Xi Wang
  2023-02-06  9:28     ` Vincent Guittot
  0 siblings, 1 reply; 6+ messages in thread
From: Xi Wang @ 2023-02-03 18:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Juri Lelli, Ingo Molnar, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider, Ben Segall, linux-kernel, linux-fsdevel

On Fri, Feb 3, 2023 at 1:51 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Jan 31, 2023 at 05:20:32PM -0800, Xi Wang wrote:
> > After load balancing was split into different scenarios, CPU capacity
> > is ignored for the "migrate_task" case, which means a thread can stay
> > on a softirq heavy cpu for an extended amount of time.
> >
> > By comparing nr_running/capacity instead of just nr_running we can add
> > CPU capacity back into "migrate_task" decisions. This benefits
> > workloads running on machines with heavy network traffic. The change
> > is unlikely to cause serious problems for other workloads but maybe
> > some corner cases still need to be considered.
> >
> > Signed-off-by: Xi Wang <xii@google.com>
> > ---
> >  kernel/sched/fair.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 0f8736991427..aad14bc04544 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10368,8 +10368,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> >                       break;
> >
> >               case migrate_task:
> > -                     if (busiest_nr < nr_running) {
> > +                     if (busiest_nr * capacity < nr_running * busiest_capacity) {
> >                               busiest_nr = nr_running;
> > +                             busiest_capacity = capacity;
> >                               busiest = rq;
> >                       }
> >                       break;
>
> I don't think this is correct. The migrate_task case is work-conserving,
> and your change can severely break that I think.
>

I think you meant this kind of scenario:
cpu 0: idle
cpu 1: 2 tasks
cpu 2: 1 task but only has 30% of capacity
Pulling from cpu 2 is good for the task but lowers the overall cpu
throughput.

The problem we have is:
cpu 0: idle
cpu 1: 1 task
cpu 2: 1 task but only has 60% of capacity due to net softirq
The task on cpu 2 stays there and runs slower. (This can also be
considered non work-conserving if we account softirq like a task.)

Maybe the logic can be merged like this: Use capacity but pick from
nr_running > 1 cpus first, then nr_running == 1 cpus if not found.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched: Consider capacity for certain load balancing decisions
  2023-02-03 18:47   ` Xi Wang
@ 2023-02-06  9:28     ` Vincent Guittot
  2023-02-07  1:03       ` Xi Wang
  0 siblings, 1 reply; 6+ messages in thread
From: Vincent Guittot @ 2023-02-06  9:28 UTC (permalink / raw)
  To: Xi Wang
  Cc: Peter Zijlstra, Juri Lelli, Ingo Molnar, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider, Ben Segall, linux-kernel, linux-fsdevel

On Fri, 3 Feb 2023 at 19:47, Xi Wang <xii@google.com> wrote:
>
> On Fri, Feb 3, 2023 at 1:51 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Jan 31, 2023 at 05:20:32PM -0800, Xi Wang wrote:
> > > After load balancing was split into different scenarios, CPU capacity
> > > is ignored for the "migrate_task" case, which means a thread can stay
> > > on a softirq heavy cpu for an extended amount of time.
> > >
> > > By comparing nr_running/capacity instead of just nr_running we can add
> > > CPU capacity back into "migrate_task" decisions. This benefits
> > > workloads running on machines with heavy network traffic. The change
> > > is unlikely to cause serious problems for other workloads but maybe
> > > some corner cases still need to be considered.
> > >
> > > Signed-off-by: Xi Wang <xii@google.com>
> > > ---
> > >  kernel/sched/fair.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 0f8736991427..aad14bc04544 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -10368,8 +10368,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> > >                       break;
> > >
> > >               case migrate_task:
> > > -                     if (busiest_nr < nr_running) {
> > > +                     if (busiest_nr * capacity < nr_running * busiest_capacity) {
> > >                               busiest_nr = nr_running;
> > > +                             busiest_capacity = capacity;
> > >                               busiest = rq;
> > >                       }
> > >                       break;
> >
> > I don't think this is correct. The migrate_task case is work-conserving,
> > and your change can severely break that I think.
> >
>
> I think you meant this kind of scenario:
> cpu 0: idle
> cpu 1: 2 tasks
> cpu 2: 1 task but only has 30% of capacity
> Pulling from cpu 2 is good for the task but lowers the overall cpu
> throughput.
>
> The problem we have is:
> cpu 0: idle
> cpu 1: 1 task
> cpu 2: 1 task but only has 60% of capacity due to net softirq
> The task on cpu 2 stays there and runs slower. (This can also be
> considered non work-conserving if we account softirq like a task.)

When load_balance runs for this 2 cpus, cpu2 should be tagged as
misfit_task because of reduce_capacity and should be selected in
priority by cpu0 to pull the task. Do you have more details on your
topology ?


>
> Maybe the logic can be merged like this: Use capacity but pick from
> nr_running > 1 cpus first, then nr_running == 1 cpus if not found.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched: Consider capacity for certain load balancing decisions
  2023-02-06  9:28     ` Vincent Guittot
@ 2023-02-07  1:03       ` Xi Wang
  2023-02-07  8:08         ` Vincent Guittot
  0 siblings, 1 reply; 6+ messages in thread
From: Xi Wang @ 2023-02-07  1:03 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Juri Lelli, Ingo Molnar, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider, Ben Segall, linux-kernel, linux-fsdevel

On Mon, Feb 6, 2023 at 1:28 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Fri, 3 Feb 2023 at 19:47, Xi Wang <xii@google.com> wrote:
> >
> > On Fri, Feb 3, 2023 at 1:51 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Tue, Jan 31, 2023 at 05:20:32PM -0800, Xi Wang wrote:
> > > > After load balancing was split into different scenarios, CPU capacity
> > > > is ignored for the "migrate_task" case, which means a thread can stay
> > > > on a softirq heavy cpu for an extended amount of time.
> > > >
> > > > By comparing nr_running/capacity instead of just nr_running we can add
> > > > CPU capacity back into "migrate_task" decisions. This benefits
> > > > workloads running on machines with heavy network traffic. The change
> > > > is unlikely to cause serious problems for other workloads but maybe
> > > > some corner cases still need to be considered.
> > > >
> > > > Signed-off-by: Xi Wang <xii@google.com>
> > > > ---
> > > >  kernel/sched/fair.c | 3 ++-
> > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 0f8736991427..aad14bc04544 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -10368,8 +10368,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> > > >                       break;
> > > >
> > > >               case migrate_task:
> > > > -                     if (busiest_nr < nr_running) {
> > > > +                     if (busiest_nr * capacity < nr_running * busiest_capacity) {
> > > >                               busiest_nr = nr_running;
> > > > +                             busiest_capacity = capacity;
> > > >                               busiest = rq;
> > > >                       }
> > > >                       break;
> > >
> > > I don't think this is correct. The migrate_task case is work-conserving,
> > > and your change can severely break that I think.
> > >
> >
> > I think you meant this kind of scenario:
> > cpu 0: idle
> > cpu 1: 2 tasks
> > cpu 2: 1 task but only has 30% of capacity
> > Pulling from cpu 2 is good for the task but lowers the overall cpu
> > throughput.
> >
> > The problem we have is:
> > cpu 0: idle
> > cpu 1: 1 task
> > cpu 2: 1 task but only has 60% of capacity due to net softirq
> > The task on cpu 2 stays there and runs slower. (This can also be
> > considered non work-conserving if we account softirq like a task.)
>
> When load_balance runs for this 2 cpus, cpu2 should be tagged as
> misfit_task because of reduce_capacity and should be selected in
> priority by cpu0 to pull the task. Do you have more details on your
> topology ?

The topology is 64 core AMD with 2 hyperthreads.

I am not familiar with the related code but I think there are cases
where a task fits cpu capacity but it can still run faster elsewhere,
e.g.: Bursty workloads. Thread pool threads with variable utilization
because it would process more or less requests based on cpu
availability (pick the next request from a shared queue when the
previous one is done). A thread having enough cpu cycles but runs
slower due to softirqs can also directly affect application
performance.

> >
> > Maybe the logic can be merged like this: Use capacity but pick from
> > nr_running > 1 cpus first, then nr_running == 1 cpus if not found.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched: Consider capacity for certain load balancing decisions
  2023-02-07  1:03       ` Xi Wang
@ 2023-02-07  8:08         ` Vincent Guittot
  0 siblings, 0 replies; 6+ messages in thread
From: Vincent Guittot @ 2023-02-07  8:08 UTC (permalink / raw)
  To: Xi Wang
  Cc: Peter Zijlstra, Juri Lelli, Ingo Molnar, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider, Ben Segall, linux-kernel, linux-fsdevel

On Tue, 7 Feb 2023 at 02:03, Xi Wang <xii@google.com> wrote:
>
> On Mon, Feb 6, 2023 at 1:28 AM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > On Fri, 3 Feb 2023 at 19:47, Xi Wang <xii@google.com> wrote:
> > >
> > > On Fri, Feb 3, 2023 at 1:51 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > On Tue, Jan 31, 2023 at 05:20:32PM -0800, Xi Wang wrote:
> > > > > After load balancing was split into different scenarios, CPU capacity
> > > > > is ignored for the "migrate_task" case, which means a thread can stay
> > > > > on a softirq heavy cpu for an extended amount of time.
> > > > >
> > > > > By comparing nr_running/capacity instead of just nr_running we can add
> > > > > CPU capacity back into "migrate_task" decisions. This benefits
> > > > > workloads running on machines with heavy network traffic. The change
> > > > > is unlikely to cause serious problems for other workloads but maybe
> > > > > some corner cases still need to be considered.
> > > > >
> > > > > Signed-off-by: Xi Wang <xii@google.com>
> > > > > ---
> > > > >  kernel/sched/fair.c | 3 ++-
> > > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index 0f8736991427..aad14bc04544 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -10368,8 +10368,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> > > > >                       break;
> > > > >
> > > > >               case migrate_task:
> > > > > -                     if (busiest_nr < nr_running) {
> > > > > +                     if (busiest_nr * capacity < nr_running * busiest_capacity) {
> > > > >                               busiest_nr = nr_running;
> > > > > +                             busiest_capacity = capacity;
> > > > >                               busiest = rq;
> > > > >                       }
> > > > >                       break;
> > > >
> > > > I don't think this is correct. The migrate_task case is work-conserving,
> > > > and your change can severely break that I think.
> > > >
> > >
> > > I think you meant this kind of scenario:
> > > cpu 0: idle
> > > cpu 1: 2 tasks
> > > cpu 2: 1 task but only has 30% of capacity
> > > Pulling from cpu 2 is good for the task but lowers the overall cpu
> > > throughput.
> > >
> > > The problem we have is:
> > > cpu 0: idle
> > > cpu 1: 1 task
> > > cpu 2: 1 task but only has 60% of capacity due to net softirq
> > > The task on cpu 2 stays there and runs slower. (This can also be
> > > considered non work-conserving if we account softirq like a task.)
> >
> > When load_balance runs for this 2 cpus, cpu2 should be tagged as
> > misfit_task because of reduce_capacity and should be selected in
> > priority by cpu0 to pull the task. Do you have more details on your
> > topology ?
>
> The topology is 64 core AMD with 2 hyperthreads.
>
> I am not familiar with the related code but I think there are cases
> where a task fits cpu capacity but it can still run faster elsewhere,
> e.g.: Bursty workloads. Thread pool threads with variable utilization
> because it would process more or less requests based on cpu
> availability (pick the next request from a shared queue when the
> previous one is done). A thread having enough cpu cycles but runs
> slower due to softirqs can also directly affect application
> performance.

misfit_task is also used to detect CPUs with reduced capacity. We can
have a look at update_sg_lb_stats(). Your use case above should fall
in this case

>
> > >
> > > Maybe the logic can be merged like this: Use capacity but pick from
> > > nr_running > 1 cpus first, then nr_running == 1 cpus if not found.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-02-07  8:08 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-01  1:20 [PATCH] sched: Consider capacity for certain load balancing decisions Xi Wang
2023-02-03  9:51 ` Peter Zijlstra
2023-02-03 18:47   ` Xi Wang
2023-02-06  9:28     ` Vincent Guittot
2023-02-07  1:03       ` Xi Wang
2023-02-07  8:08         ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).