All of lore.kernel.org
 help / color / mirror / Atom feed
* Perf regression from scheduler load_balance rework in 5.5?
@ 2022-06-23 19:50 David Chen
  2022-06-24  8:22 ` Vincent Guittot
  0 siblings, 1 reply; 7+ messages in thread
From: David Chen @ 2022-06-23 19:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Vincent Guittot

Hi,

I'm working on upgrading our kernel from 4.14 to 5.10
However, I'm seeing performance regression when doing rand read from windows client through smbd
with a well cached file.

One thing I noticed is that on the new kernel, the smbd thread doing socket I/O tends to stay on
the same cpu core as the net_rx softirq, where as in the old kernel it tends to be moved around
more randomly. And when they are on the same cpu, it tends to saturate the cpu more and causes
performance to drop.

For example, here's the duration (ns) the thread spend on each cpu I captured using bpftrace
On 4.14:
@cputime[7]: 20741458382
@cputime[0]: 25219285005
@cputime[6]: 30892418441
@cputime[5]: 31032404613
@cputime[3]: 33511324691
@cputime[1]: 35564174562
@cputime[4]: 39313421965
@cputime[2]: 55779811909 (net_rx cpu)

On 5.10:
@cputime[3]: 2150554823
@cputime[5]: 3294276626
@cputime[7]: 4277890448
@cputime[4]: 5094586003
@cputime[1]: 6058168291
@cputime[0]: 14688093441
@cputime[6]: 17578229533
@cputime[2]: 223473400411 (net_rx cpu)

I also tried setting the cpu affinity of the smbd thread away from the net_rx cpu and indeed that
seems to bring the perf on par with old kernel.

I noticed that there's scheduler load_balance rework in 5.5, so I did the test on 5.4 and 5.5 and
it did show the behavior changed between 5.4 and 5.5.

Anyone know how to work around this?

Thanks,
David

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Perf regression from scheduler load_balance rework in 5.5?
  2022-06-23 19:50 Perf regression from scheduler load_balance rework in 5.5? David Chen
@ 2022-06-24  8:22 ` Vincent Guittot
  2022-06-24 13:16   ` Zhang Qiao
  0 siblings, 1 reply; 7+ messages in thread
From: Vincent Guittot @ 2022-06-24  8:22 UTC (permalink / raw)
  To: David Chen; +Cc: linux-kernel, Ingo Molnar

On Thu, 23 Jun 2022 at 21:50, David Chen <david.chen@nutanix.com> wrote:
>
> Hi,
>
> I'm working on upgrading our kernel from 4.14 to 5.10
> However, I'm seeing performance regression when doing rand read from windows client through smbd
> with a well cached file.
>
> One thing I noticed is that on the new kernel, the smbd thread doing socket I/O tends to stay on
> the same cpu core as the net_rx softirq, where as in the old kernel it tends to be moved around
> more randomly. And when they are on the same cpu, it tends to saturate the cpu more and causes
> performance to drop.
>
> For example, here's the duration (ns) the thread spend on each cpu I captured using bpftrace
> On 4.14:
> @cputime[7]: 20741458382
> @cputime[0]: 25219285005
> @cputime[6]: 30892418441
> @cputime[5]: 31032404613
> @cputime[3]: 33511324691
> @cputime[1]: 35564174562
> @cputime[4]: 39313421965
> @cputime[2]: 55779811909 (net_rx cpu)
>
> On 5.10:
> @cputime[3]: 2150554823
> @cputime[5]: 3294276626
> @cputime[7]: 4277890448
> @cputime[4]: 5094586003
> @cputime[1]: 6058168291
> @cputime[0]: 14688093441
> @cputime[6]: 17578229533
> @cputime[2]: 223473400411 (net_rx cpu)
>
> I also tried setting the cpu affinity of the smbd thread away from the net_rx cpu and indeed that
> seems to bring the perf on par with old kernel.
>
> I noticed that there's scheduler load_balance rework in 5.5, so I did the test on 5.4 and 5.5 and
> it did show the behavior changed between 5.4 and 5.5.

Have you tested v5.18 ? several improvements happened since v5.5

>
> Anyone know how to work around this?

Have you enabled IRQ_TIME_ACCOUNTING ?

When the time spent under interrupt becomes significant, scheduler
migrate task on another cpu

Vincent
>
> Thanks,
> David

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Perf regression from scheduler load_balance rework in 5.5?
  2022-06-24  8:22 ` Vincent Guittot
@ 2022-06-24 13:16   ` Zhang Qiao
  2022-06-27 10:59     ` Vincent Guittot
  0 siblings, 1 reply; 7+ messages in thread
From: Zhang Qiao @ 2022-06-24 13:16 UTC (permalink / raw)
  To: Vincent Guittot, David Chen; +Cc: linux-kernel, Ingo Molnar


Hi,
在 2022/6/24 16:22, Vincent Guittot 写道:
> On Thu, 23 Jun 2022 at 21:50, David Chen <david.chen@nutanix.com> wrote:
>>
>> Hi,
>>
>> I'm working on upgrading our kernel from 4.14 to 5.10
>> However, I'm seeing performance regression when doing rand read from windows client through smbd
>> with a well cached file.
>>
>> One thing I noticed is that on the new kernel, the smbd thread doing socket I/O tends to stay on
>> the same cpu core as the net_rx softirq, where as in the old kernel it tends to be moved around
>> more randomly. And when they are on the same cpu, it tends to saturate the cpu more and causes
>> performance to drop.
>>
>> For example, here's the duration (ns) the thread spend on each cpu I captured using bpftrace
>> On 4.14:
>> @cputime[7]: 20741458382
>> @cputime[0]: 25219285005
>> @cputime[6]: 30892418441
>> @cputime[5]: 31032404613
>> @cputime[3]: 33511324691
>> @cputime[1]: 35564174562
>> @cputime[4]: 39313421965
>> @cputime[2]: 55779811909 (net_rx cpu)
>>
>> On 5.10:
>> @cputime[3]: 2150554823
>> @cputime[5]: 3294276626
>> @cputime[7]: 4277890448
>> @cputime[4]: 5094586003
>> @cputime[1]: 6058168291
>> @cputime[0]: 14688093441
>> @cputime[6]: 17578229533
>> @cputime[2]: 223473400411 (net_rx cpu)
>>
>> I also tried setting the cpu affinity of the smbd thread away from the net_rx cpu and indeed that
>> seems to bring the perf on par with old kernel.

I observed the same problem for the past two weeks.

>>
>> I noticed that there's scheduler load_balance rework in 5.5, so I did the test on 5.4 and 5.5 and
>> it did show the behavior changed between 5.4 and 5.5.
> 
> Have you tested v5.18 ? several improvements happened since v5.5
> 
>>
>> Anyone know how to work around this?
> 
> Have you enabled IRQ_TIME_ACCOUNTING ?


CONFIG_IRQ_TIME_ACCOUNTING=y.

> 
> When the time spent under interrupt becomes significant, scheduler
> migrate task on another cpu


My board has two cpus, and i used iperf3 to test upload bandwidth,then I saw the same situation,
the iperf3 thread run on the same cpu as the NET_RX softirq.

After debug in find_busiest_group(), i noticed when the cpu(env->idle is CPU_IDLE or CPU_NEWLY_IDLE) try to pull task,
the busiest->group_type == group_fully_busy, busiest->sum_h_nr_running == 1, local->group_type==group_has_spare,
and the loadbalance will failed at find_busiest_group(), as follows:

find_busiest_group():
    ...
    if (busiest->group_type != group_overloaded) {
	....
	if (busiest->sum_h_nr_running == 1)
		goto out_balanced;     ----> loadbalance will returned at here.
....


Thanks,
Qiao


> Vincent>>
>> Thanks,
>> David
> .
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Perf regression from scheduler load_balance rework in 5.5?
  2022-06-24 13:16   ` Zhang Qiao
@ 2022-06-27 10:59     ` Vincent Guittot
  2022-06-29 21:45       ` David Chen
  2022-06-30  7:02       ` Zhang Qiao
  0 siblings, 2 replies; 7+ messages in thread
From: Vincent Guittot @ 2022-06-27 10:59 UTC (permalink / raw)
  To: Zhang Qiao; +Cc: David Chen, linux-kernel, Ingo Molnar

Hi,

Le vendredi 24 juin 2022 à 21:16:05 (+0800), Zhang Qiao a écrit :
> 
> Hi,
> 在 2022/6/24 16:22, Vincent Guittot 写道:
> > On Thu, 23 Jun 2022 at 21:50, David Chen <david.chen@nutanix.com> wrote:
> >>
> >> Hi,
> >>
> >> I'm working on upgrading our kernel from 4.14 to 5.10
> >> However, I'm seeing performance regression when doing rand read from windows client through smbd
> >> with a well cached file.
> >>
> >> One thing I noticed is that on the new kernel, the smbd thread doing socket I/O tends to stay on
> >> the same cpu core as the net_rx softirq, where as in the old kernel it tends to be moved around
> >> more randomly. And when they are on the same cpu, it tends to saturate the cpu more and causes
> >> performance to drop.
> >>
> >> For example, here's the duration (ns) the thread spend on each cpu I captured using bpftrace
> >> On 4.14:
> >> @cputime[7]: 20741458382
> >> @cputime[0]: 25219285005
> >> @cputime[6]: 30892418441
> >> @cputime[5]: 31032404613
> >> @cputime[3]: 33511324691
> >> @cputime[1]: 35564174562
> >> @cputime[4]: 39313421965
> >> @cputime[2]: 55779811909 (net_rx cpu)
> >>
> >> On 5.10:
> >> @cputime[3]: 2150554823
> >> @cputime[5]: 3294276626
> >> @cputime[7]: 4277890448
> >> @cputime[4]: 5094586003
> >> @cputime[1]: 6058168291
> >> @cputime[0]: 14688093441
> >> @cputime[6]: 17578229533
> >> @cputime[2]: 223473400411 (net_rx cpu)
> >>
> >> I also tried setting the cpu affinity of the smbd thread away from the net_rx cpu and indeed that
> >> seems to bring the perf on par with old kernel.
> 
> I observed the same problem for the past two weeks.
> 
> >>
> >> I noticed that there's scheduler load_balance rework in 5.5, so I did the test on 5.4 and 5.5 and
> >> it did show the behavior changed between 5.4 and 5.5.
> > 
> > Have you tested v5.18 ? several improvements happened since v5.5
> > 
> >>
> >> Anyone know how to work around this?
> > 
> > Have you enabled IRQ_TIME_ACCOUNTING ?
> 
> 
> CONFIG_IRQ_TIME_ACCOUNTING=y.
> 
> > 
> > When the time spent under interrupt becomes significant, scheduler
> > migrate task on another cpu
> 
> 
> My board has two cpus, and i used iperf3 to test upload bandwidth,then I saw the same situation,
> the iperf3 thread run on the same cpu as the NET_RX softirq.
> 
> After debug in find_busiest_group(), i noticed when the cpu(env->idle is CPU_IDLE or CPU_NEWLY_IDLE) try to pull task,
> the busiest->group_type == group_fully_busy, busiest->sum_h_nr_running == 1, local->group_type==group_has_spare,
> and the loadbalance will failed at find_busiest_group(), as follows:
> 
> find_busiest_group():
>     ...
>     if (busiest->group_type != group_overloaded) {
> 	....
> 	if (busiest->sum_h_nr_running == 1)
> 		goto out_balanced;     ----> loadbalance will returned at here.

Yes, you're right, we filter such case. Could you try the patch below ?
I use the misfit task state to detect cpu with reduced capacity and migrate_load
to check if it worth migration the task on the dst cpu. 

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6775a117f3c1..013dcd97472b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8757,11 +8757,19 @@ static inline void update_sg_lb_stats(struct lb_env *env,
                if (local_group)
                        continue;
 
-               /* Check for a misfit task on the cpu */
-               if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
-                   sgs->group_misfit_task_load < rq->misfit_task_load) {
-                       sgs->group_misfit_task_load = rq->misfit_task_load;
-                       *sg_status |= SG_OVERLOAD;
+               if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
+                       /* Check for a misfit task on the cpu */
+                       if (sgs->group_misfit_task_load < rq->misfit_task_load) {
+                               sgs->group_misfit_task_load = rq->misfit_task_load;
+                               *sg_status |= SG_OVERLOAD;
+                       }
+                } else if ((env->idle != CPU_NOT_IDLE) &&
+                           (group->group_weight == 1) &&
+                           (rq->cfs.h_nr_running == 1) &&
+                           check_cpu_capacity(rq, env->sd) &&
+                           (sgs->group_misfit_task_load < cpu_load(rq))) {
+                       /* Check for a task running on a CPU with reduced capacity */
+                       sgs->group_misfit_task_load = cpu_load(rq);
                }
        }
 
@@ -8814,7 +8822,8 @@ static bool update_sd_pick_busiest(struct lb_env *env,
         * CPUs in the group should either be possible to resolve
         * internally or be covered by avg_load imbalance (eventually).
         */
-       if (sgs->group_type == group_misfit_task &&
+       if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
+           (sgs->group_type == group_misfit_task) &&
            (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
             sds->local_stat.group_type != group_has_spare))
                return false;
@@ -9360,9 +9369,15 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
        busiest = &sds->busiest_stat;
 
        if (busiest->group_type == group_misfit_task) {
-               /* Set imbalance to allow misfit tasks to be balanced. */
-               env->migration_type = migrate_misfit;
-               env->imbalance = 1;
+               if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
+                       /* Set imbalance to allow misfit tasks to be balanced. */
+                       env->migration_type = migrate_misfit;
+                       env->imbalance = 1;
+               } else {
+                       /* Set group overloaded as one cpu has reduced capacity */
+                       env->migration_type = migrate_load;
+                       env->imbalance = busiest->group_misfit_task_load;
+               }
                return;
        }


> ....
> 
> 
> Thanks,
> Qiao
> 
> 
> > Vincent>>
> >> Thanks,
> >> David
> > .
> > 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* RE: Perf regression from scheduler load_balance rework in 5.5?
  2022-06-27 10:59     ` Vincent Guittot
@ 2022-06-29 21:45       ` David Chen
  2022-06-30  7:01         ` Vincent Guittot
  2022-06-30  7:02       ` Zhang Qiao
  1 sibling, 1 reply; 7+ messages in thread
From: David Chen @ 2022-06-29 21:45 UTC (permalink / raw)
  To: Vincent Guittot, Zhang Qiao; +Cc: linux-kernel, Ingo Molnar



> -----Original Message-----
> From: Vincent Guittot <vincent.guittot@linaro.org>
> Sent: Monday, June 27, 2022 4:00 AM
> To: Zhang Qiao <zhangqiao22@huawei.com>
> Cc: David Chen <david.chen@nutanix.com>; linux-kernel@vger.kernel.org; Ingo Molnar <mingo@redhat.com>
> Subject: Re: Perf regression from scheduler load_balance rework in 5.5?
> 
> Hi,
> 
> Le vendredi 24 juin 2022 à 21:16:05 (+0800), Zhang Qiao a écrit :
> >
> > Hi,
> > 在 2022/6/24 16:22, Vincent Guittot 写道:
> > > On Thu, 23 Jun 2022 at 21:50, David Chen <david.chen@nutanix.com> wrote:
> > >>
> > >> Hi,
> > >>
> > >> I'm working on upgrading our kernel from 4.14 to 5.10
> > >> However, I'm seeing performance regression when doing rand read from windows client through smbd
> > >> with a well cached file.
> > >>
> > >> One thing I noticed is that on the new kernel, the smbd thread doing socket I/O tends to stay on
> > >> the same cpu core as the net_rx softirq, where as in the old kernel it tends to be moved around
> > >> more randomly. And when they are on the same cpu, it tends to saturate the cpu more and causes
> > >> performance to drop.
> > >>
> > >> For example, here's the duration (ns) the thread spend on each cpu I captured using bpftrace
> > >> On 4.14:
> > >> @cputime[7]: 20741458382
> > >> @cputime[0]: 25219285005
> > >> @cputime[6]: 30892418441
> > >> @cputime[5]: 31032404613
> > >> @cputime[3]: 33511324691
> > >> @cputime[1]: 35564174562
> > >> @cputime[4]: 39313421965
> > >> @cputime[2]: 55779811909 (net_rx cpu)
> > >>
> > >> On 5.10:
> > >> @cputime[3]: 2150554823
> > >> @cputime[5]: 3294276626
> > >> @cputime[7]: 4277890448
> > >> @cputime[4]: 5094586003
> > >> @cputime[1]: 6058168291
> > >> @cputime[0]: 14688093441
> > >> @cputime[6]: 17578229533
> > >> @cputime[2]: 223473400411 (net_rx cpu)
> > >>
> > >> I also tried setting the cpu affinity of the smbd thread away from the net_rx cpu and indeed that
> > >> seems to bring the perf on par with old kernel.
> >
> > I observed the same problem for the past two weeks.
> >
> > >>
> > >> I noticed that there's scheduler load_balance rework in 5.5, so I did the test on 5.4 and 5.5 and
> > >> it did show the behavior changed between 5.4 and 5.5.
> > >
> > > Have you tested v5.18 ? several improvements happened since v5.5
> > >
> > >>
> > >> Anyone know how to work around this?
> > >
> > > Have you enabled IRQ_TIME_ACCOUNTING ?
> >
> >
> > CONFIG_IRQ_TIME_ACCOUNTING=y.
> >
> > >
> > > When the time spent under interrupt becomes significant, scheduler
> > > migrate task on another cpu
> >
> >
> > My board has two cpus, and i used iperf3 to test upload bandwidth,then I saw the same situation,
> > the iperf3 thread run on the same cpu as the NET_RX softirq.
> >
> > After debug in find_busiest_group(), i noticed when the cpu(env->idle is CPU_IDLE or CPU_NEWLY_IDLE) try to pull task,
> > the busiest->group_type == group_fully_busy, busiest->sum_h_nr_running == 1, local->group_type==group_has_spare,
> > and the loadbalance will failed at find_busiest_group(), as follows:
> >
> > find_busiest_group():
> >     ...
> >     if (busiest->group_type != group_overloaded) {
> > 	....
> > 	if (busiest->sum_h_nr_running == 1)
> > 		goto out_balanced;     ----> loadbalance will returned at here.
> 
> Yes, you're right, we filter such case. Could you try the patch below ?
> I use the misfit task state to detect cpu with reduced capacity and migrate_load
> to check if it worth migration the task on the dst cpu.
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6775a117f3c1..013dcd97472b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8757,11 +8757,19 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>                 if (local_group)
>                         continue;
> 
> -               /* Check for a misfit task on the cpu */
> -               if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> -                   sgs->group_misfit_task_load < rq->misfit_task_load) {
> -                       sgs->group_misfit_task_load = rq->misfit_task_load;
> -                       *sg_status |= SG_OVERLOAD;
> +               if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
> +                       /* Check for a misfit task on the cpu */
> +                       if (sgs->group_misfit_task_load < rq->misfit_task_load) {
> +                               sgs->group_misfit_task_load = rq->misfit_task_load;
> +                               *sg_status |= SG_OVERLOAD;
> +                       }
> +                } else if ((env->idle != CPU_NOT_IDLE) &&
> +                           (group->group_weight == 1) &&
> +                           (rq->cfs.h_nr_running == 1) &&
> +                           check_cpu_capacity(rq, env->sd) &&
> +                           (sgs->group_misfit_task_load < cpu_load(rq))) {
> +                       /* Check for a task running on a CPU with reduced capacity */
> +                       sgs->group_misfit_task_load = cpu_load(rq);
>                 }
>         }
> 
> @@ -8814,7 +8822,8 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>          * CPUs in the group should either be possible to resolve
>          * internally or be covered by avg_load imbalance (eventually).
>          */
> -       if (sgs->group_type == group_misfit_task &&
> +       if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> +           (sgs->group_type == group_misfit_task) &&
>             (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
>              sds->local_stat.group_type != group_has_spare))
>                 return false;
> @@ -9360,9 +9369,15 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>         busiest = &sds->busiest_stat;
> 
>         if (busiest->group_type == group_misfit_task) {
> -               /* Set imbalance to allow misfit tasks to be balanced. */
> -               env->migration_type = migrate_misfit;
> -               env->imbalance = 1;
> +               if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
> +                       /* Set imbalance to allow misfit tasks to be balanced. */
> +                       env->migration_type = migrate_misfit;
> +                       env->imbalance = 1;
> +               } else {
> +                       /* Set group overloaded as one cpu has reduced capacity */
> +                       env->migration_type = migrate_load;
> +                       env->imbalance = busiest->group_misfit_task_load;
> +               }
>                 return;
>         }
> 
> 
> > ....
> >
> >
> > Thanks,
> > Qiao
> >
> >
> > > Vincent>>
> > >> Thanks,
> > >> David
> > > .
> > >

Hi,

I applied the patch on top of 5.10 and also enabled CONFIG_IRQ_TIME_ACCOUNTING.
And it did fix the issue I had.

Thanks,
David

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Perf regression from scheduler load_balance rework in 5.5?
  2022-06-29 21:45       ` David Chen
@ 2022-06-30  7:01         ` Vincent Guittot
  0 siblings, 0 replies; 7+ messages in thread
From: Vincent Guittot @ 2022-06-30  7:01 UTC (permalink / raw)
  To: David Chen; +Cc: Zhang Qiao, linux-kernel, Ingo Molnar

Hi David,

On Wed, 29 Jun 2022 at 23:45, David Chen <david.chen@nutanix.com> wrote:
>
>
[...]

>
> Hi,
>
> I applied the patch on top of 5.10 and also enabled CONFIG_IRQ_TIME_ACCOUNTING.
> And it did fix the issue I had.

Thanks for testing. I'm going to prepare a patch

Vincent
>
> Thanks,
> David

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Perf regression from scheduler load_balance rework in 5.5?
  2022-06-27 10:59     ` Vincent Guittot
  2022-06-29 21:45       ` David Chen
@ 2022-06-30  7:02       ` Zhang Qiao
  1 sibling, 0 replies; 7+ messages in thread
From: Zhang Qiao @ 2022-06-30  7:02 UTC (permalink / raw)
  To: Vincent Guittot; +Cc: David Chen, linux-kernel, Ingo Molnar



在 2022/6/27 18:59, Vincent Guittot 写道:
> Hi,
> 
> Le vendredi 24 juin 2022 à 21:16:05 (+0800), Zhang Qiao a écrit :
>>
>> Hi,
>> 在 2022/6/24 16:22, Vincent Guittot 写道:
>>> On Thu, 23 Jun 2022 at 21:50, David Chen <david.chen@nutanix.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm working on upgrading our kernel from 4.14 to 5.10
>>>> However, I'm seeing performance regression when doing rand read from windows client through smbd
>>>> with a well cached file.
>>>>
>>>> One thing I noticed is that on the new kernel, the smbd thread doing socket I/O tends to stay on
>>>> the same cpu core as the net_rx softirq, where as in the old kernel it tends to be moved around
>>>> more randomly. And when they are on the same cpu, it tends to saturate the cpu more and causes
>>>> performance to drop.
>>>>
>>>> For example, here's the duration (ns) the thread spend on each cpu I captured using bpftrace
>>>> On 4.14:
>>>> @cputime[7]: 20741458382
>>>> @cputime[0]: 25219285005
>>>> @cputime[6]: 30892418441
>>>> @cputime[5]: 31032404613
>>>> @cputime[3]: 33511324691
>>>> @cputime[1]: 35564174562
>>>> @cputime[4]: 39313421965
>>>> @cputime[2]: 55779811909 (net_rx cpu)
>>>>
>>>> On 5.10:
>>>> @cputime[3]: 2150554823
>>>> @cputime[5]: 3294276626
>>>> @cputime[7]: 4277890448
>>>> @cputime[4]: 5094586003
>>>> @cputime[1]: 6058168291
>>>> @cputime[0]: 14688093441
>>>> @cputime[6]: 17578229533
>>>> @cputime[2]: 223473400411 (net_rx cpu)
>>>>
>>>> I also tried setting the cpu affinity of the smbd thread away from the net_rx cpu and indeed that
>>>> seems to bring the perf on par with old kernel.
>>
>> I observed the same problem for the past two weeks.
>>
>>>>
>>>> I noticed that there's scheduler load_balance rework in 5.5, so I did the test on 5.4 and 5.5 and
>>>> it did show the behavior changed between 5.4 and 5.5.
>>>
>>> Have you tested v5.18 ? several improvements happened since v5.5
>>>
>>>>
>>>> Anyone know how to work around this?
>>>
>>> Have you enabled IRQ_TIME_ACCOUNTING ?
>>
>>
>> CONFIG_IRQ_TIME_ACCOUNTING=y.
>>
>>>
>>> When the time spent under interrupt becomes significant, scheduler
>>> migrate task on another cpu
>>
>>
>> My board has two cpus, and i used iperf3 to test upload bandwidth,then I saw the same situation,
>> the iperf3 thread run on the same cpu as the NET_RX softirq.
>>
>> After debug in find_busiest_group(), i noticed when the cpu(env->idle is CPU_IDLE or CPU_NEWLY_IDLE) try to pull task,
>> the busiest->group_type == group_fully_busy, busiest->sum_h_nr_running == 1, local->group_type==group_has_spare,
>> and the loadbalance will failed at find_busiest_group(), as follows:
>>
>> find_busiest_group():
>>     ...
>>     if (busiest->group_type != group_overloaded) {
>> 	....
>> 	if (busiest->sum_h_nr_running == 1)
>> 		goto out_balanced;     ----> loadbalance will returned at here.
> 
> Yes, you're right, we filter such case. Could you try the patch below ?
> I use the misfit task state to detect cpu with reduced capacity and migrate_load
> to check if it worth migration the task on the dst cpu. 


Hi,

I tested with this patch, it is ok.

> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6775a117f3c1..013dcd97472b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8757,11 +8757,19 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>                 if (local_group)
>                         continue;
>  
> -               /* Check for a misfit task on the cpu */
> -               if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> -                   sgs->group_misfit_task_load < rq->misfit_task_load) {
> -                       sgs->group_misfit_task_load = rq->misfit_task_load;
> -                       *sg_status |= SG_OVERLOAD;
> +               if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
> +                       /* Check for a misfit task on the cpu */
> +                       if (sgs->group_misfit_task_load < rq->misfit_task_load) {
> +                               sgs->group_misfit_task_load = rq->misfit_task_load;
> +                               *sg_status |= SG_OVERLOAD;
> +                       }
> +                } else if ((env->idle != CPU_NOT_IDLE) &&
> +                           (group->group_weight == 1) &&
> +                           (rq->cfs.h_nr_running == 1) &&
> +                           check_cpu_capacity(rq, env->sd) &&
> +                           (sgs->group_misfit_task_load < cpu_load(rq))) {
> +                       /* Check for a task running on a CPU with reduced capacity */
> +                       sgs->group_misfit_task_load = cpu_load(rq);
>                 }
>         }
>  
> @@ -8814,7 +8822,8 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>          * CPUs in the group should either be possible to resolve
>          * internally or be covered by avg_load imbalance (eventually).
>          */
> -       if (sgs->group_type == group_misfit_task &&
> +       if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> +           (sgs->group_type == group_misfit_task) &&
>             (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
>              sds->local_stat.group_type != group_has_spare))
>                 return false;
> @@ -9360,9 +9369,15 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>         busiest = &sds->busiest_stat;
>  
>         if (busiest->group_type == group_misfit_task) {
> -               /* Set imbalance to allow misfit tasks to be balanced. */
> -               env->migration_type = migrate_misfit;
> -               env->imbalance = 1;
> +               if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
> +                       /* Set imbalance to allow misfit tasks to be balanced. */
> +                       env->migration_type = migrate_misfit;
> +                       env->imbalance = 1;
> +               } else {
> +                       /* Set group overloaded as one cpu has reduced capacity */
> +                       env->migration_type = migrate_load;
> +                       env->imbalance = busiest->group_misfit_task_load;
> +               }
>                 return;
>         }
> 
> 
>> ....
>>
>>
>> Thanks,
>> Qiao
>>
>>
>>> Vincent>>
>>>> Thanks,
>>>> David
>>> .
>>>
> .
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-06-30  7:02 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-23 19:50 Perf regression from scheduler load_balance rework in 5.5? David Chen
2022-06-24  8:22 ` Vincent Guittot
2022-06-24 13:16   ` Zhang Qiao
2022-06-27 10:59     ` Vincent Guittot
2022-06-29 21:45       ` David Chen
2022-06-30  7:01         ` Vincent Guittot
2022-06-30  7:02       ` Zhang Qiao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.