[patch v3.18+ regression fix] sched: Further improve spurious CPU

All of lore.kernel.org
 help / color / mirror / Atom feed

* [patch v3.18+ regression fix] sched: Further improve spurious CPU_IDLE active migrations
@ 2016-08-30  5:42 Mike Galbraith
  2016-08-31 10:01 ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Mike Galbraith @ 2016-08-30  5:42 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Rik van Riel


43f4d666 partially cured spurious migrations, but when there are
completely idle groups on a lightly loaded processor, and there is
a buddy pair occupying the busiest group, we will not attempt to
migrate due to select_idle_sibling() buddy placement, leaving the
busiest queue with one task.  We skip balancing, but increment
nr_balance_failed until we kick active balancing, and bounce a
buddy pair endlessly, demolishing throughput.

Regression detected on X5472 box, which has 4 MC groups of 2 cores.

netperf -l 60 -H 127.0.0.1 -t UDP_STREAM -i5,1 -I 95,5
pre:
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput      : 66.421%
!!!                       Local CPU util  : 0.000%
!!!                       Remote CPU util : 0.000%

Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   60.00     1779143      0    15539.49
212992           60.00     1773551           15490.65

post:
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   60.00     3719377      0    32486.01
212992           60.00     3717492           32469.54

Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
Fixes: caeb178c sched/fair: Make update_sd_pick_busiest() return 'true' on a busier sd
Cc: <stable@vger.kernel.org> # v3.18+
---
 kernel/sched/fair.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7249,11 +7249,12 @@ static struct sched_group *find_busiest_
 		 * This cpu is idle. If the busiest group is not overloaded
 		 * and there is no imbalance between this and busiest group
 		 * wrt idle cpus, it is balanced. The imbalance becomes
-		 * significant if the diff is greater than 1 otherwise we
-		 * might end up to just move the imbalance on another group
+		 * significant if the diff is greater than 2 otherwise we
+		 * may end up merely moving the imbalance to another group,
+		 * or bouncing a buddy pair needlessly.
 		 */
 		if ((busiest->group_type != group_overloaded) &&
-				(local->idle_cpus <= (busiest->idle_cpus + 1)))
+				(local->idle_cpus <= (busiest->idle_cpus + 2)))
 			goto out_balanced;
 	} else {
 		/*

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch v3.18+ regression fix] sched: Further improve spurious CPU_IDLE active migrations
  2016-08-30  5:42 [patch v3.18+ regression fix] sched: Further improve spurious CPU_IDLE active migrations Mike Galbraith
@ 2016-08-31 10:01 ` Peter Zijlstra
  2016-08-31 10:18   ` Mike Galbraith
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2016-08-31 10:01 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: LKML, Rik van Riel, Vincent Guittot

On Tue, Aug 30, 2016 at 07:42:55AM +0200, Mike Galbraith wrote:
> 
> 43f4d666 partially cured spurious migrations, but when there are
> completely idle groups on a lightly loaded processor, and there is
> a buddy pair occupying the busiest group, we will not attempt to
> migrate due to select_idle_sibling() buddy placement, leaving the
> busiest queue with one task.  We skip balancing, but increment
> nr_balance_failed until we kick active balancing, and bounce a
> buddy pair endlessly, demolishing throughput.

Have you ran this patch through other benchmarks? It looks like
something that might make something else go funny.

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7249,11 +7249,12 @@ static struct sched_group *find_busiest_
>  		 * This cpu is idle. If the busiest group is not overloaded
>  		 * and there is no imbalance between this and busiest group
>  		 * wrt idle cpus, it is balanced. The imbalance becomes
> -		 * significant if the diff is greater than 1 otherwise we
> -		 * might end up to just move the imbalance on another group
> +		 * significant if the diff is greater than 2 otherwise we
> +		 * may end up merely moving the imbalance to another group,
> +		 * or bouncing a buddy pair needlessly.
>  		 */
>  		if ((busiest->group_type != group_overloaded) &&
> -				(local->idle_cpus <= (busiest->idle_cpus + 1)))
> +				(local->idle_cpus <= (busiest->idle_cpus + 2)))
>  			goto out_balanced;

So 43f4d66637bc ("sched: Improve sysbench performance by fixing spurious
active migration") 's +1 made sense in that its a tie breaker. If you
have 3 tasks on 2 groups, one group will have to have 2 tasks, and
bouncing the one task around just isn't going to help _anything_.

Incrementing that to +2 has the effect that if you have two tasks on two
groups, 0,2 is a valid distribution. Which I understand is exactly what
you want for this workload. But if the two tasks are unrelated, 1,1
really is a better spread.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch v3.18+ regression fix] sched: Further improve spurious CPU_IDLE active migrations
  2016-08-31 10:01 ` Peter Zijlstra
@ 2016-08-31 10:18   ` Mike Galbraith
  2016-08-31 10:36     ` Mike Galbraith
  0 siblings, 1 reply; 14+ messages in thread
From: Mike Galbraith @ 2016-08-31 10:18 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Rik van Riel, Vincent Guittot

On Wed, 2016-08-31 at 12:01 +0200, Peter Zijlstra wrote:
> On Tue, Aug 30, 2016 at 07:42:55AM +0200, Mike Galbraith wrote:
> > 
> > 43f4d666 partially cured spurious migrations, but when there are
> > completely idle groups on a lightly loaded processor, and there is
> > a buddy pair occupying the busiest group, we will not attempt to
> > migrate due to select_idle_sibling() buddy placement, leaving the
> > busiest queue with one task.  We skip balancing, but increment
> > nr_balance_failed until we kick active balancing, and bounce a
> > buddy pair endlessly, demolishing throughput.
> 
> Have you ran this patch through other benchmarks? It looks like
> something that might make something else go funny.

No, but it will be going through SUSE's performance test grid.

> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7249,11 +7249,12 @@ static struct sched_group *find_busiest_
> >  > > 	> > 	> >  * This cpu is idle. If the busiest group is not overloaded
> >  > > 	> > 	> >  * and there is no imbalance between this and busiest group
> >  > > 	> > 	> >  * wrt idle cpus, it is balanced. The imbalance becomes
> > -> > 	> > 	> >  * significant if the diff is greater than 1 otherwise we
> > -> > 	> > 	> >  * might end up to just move the imbalance on another group
> > +> > 	> > 	> >  * significant if the diff is greater than 2 otherwise we
> > +> > 	> > 	> >  * may end up merely moving the imbalance to another group,
> > +> > 	> > 	> >  * or bouncing a buddy pair needlessly.
> >  > > 	> > 	> >  */
> >  > > 	> > 	> > if ((busiest->group_type != group_overloaded) &&
> > -> > 	> > 	> > 	> > 	> > (local->idle_cpus <= (busiest->idle_cpus + 1)))
> > +> > 	> > 	> > 	> > 	> > (local->idle_cpus <= (busiest->idle_cpus + 2)))
> >  > > 	> > 	> > 	> > goto out_balanced;
> 
> So 43f4d66637bc ("sched: Improve sysbench performance by fixing spurious
> active migration") 's +1 made sense in that its a tie breaker. If you
> have 3 tasks on 2 groups, one group will have to have 2 tasks, and
> bouncing the one task around just isn't going to help _anything_.

Yeah, but frequently tasks don't come in ones, so, you end up with an
endless tug of war between LB ripping communicating buddies apart, and
select_idle_sibling() pulling them back together.. bouncing cow
syndrome.

> Incrementing that to +2 has the effect that if you have two tasks on two
> groups, 0,2 is a valid distribution. Which I understand is exactly what
> you want for this workload. But if the two tasks are unrelated, 1,1
> really is a better spread.

True.  Better ideas welcome.

	-Mike

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch v3.18+ regression fix] sched: Further improve spurious CPU_IDLE active migrations
  2016-08-31 10:18   ` Mike Galbraith
@ 2016-08-31 10:36     ` Mike Galbraith
  2016-08-31 15:52       ` Vincent Guittot
  0 siblings, 1 reply; 14+ messages in thread
From: Mike Galbraith @ 2016-08-31 10:36 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Rik van Riel, Vincent Guittot

On Wed, 2016-08-31 at 12:18 +0200, Mike Galbraith wrote:
> On Wed, 2016-08-31 at 12:01 +0200, Peter Zijlstra wrote:

> > So 43f4d66637bc ("sched: Improve sysbench performance by fixing spurious
> > active migration") 's +1 made sense in that its a tie breaker. If you
> > have 3 tasks on 2 groups, one group will have to have 2 tasks, and
> > bouncing the one task around just isn't going to help _anything_.
> 
> Yeah, but frequently tasks don't come in ones, so, you end up with an
> endless tug of war between LB ripping communicating buddies apart, and
> select_idle_sibling() pulling them back together.. bouncing cow
> syndrome.

The whole business of trying to balance groups down to the single task
seems a bit illogical given we care enough to wake to shared cache in
the first place, creating the 'imbalance' we then try to correct. 
 'course that weakens your unrelated tasks (which may meet on a sleepin
g lock or whatever) argument not one bit, it's also valid.

hrm.

	-Mike

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch v3.18+ regression fix] sched: Further improve spurious CPU_IDLE active migrations
  2016-08-31 10:36     ` Mike Galbraith
@ 2016-08-31 15:52       ` Vincent Guittot
  2016-09-01  4:11         ` Mike Galbraith
  0 siblings, 1 reply; 14+ messages in thread
From: Vincent Guittot @ 2016-08-31 15:52 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, LKML, Rik van Riel

On 31 August 2016 at 12:36, Mike Galbraith <umgwanakikbuti@gmail.com> wrote:
> On Wed, 2016-08-31 at 12:18 +0200, Mike Galbraith wrote:
>> On Wed, 2016-08-31 at 12:01 +0200, Peter Zijlstra wrote:
>
>> > So 43f4d66637bc ("sched: Improve sysbench performance by fixing spurious
>> > active migration") 's +1 made sense in that its a tie breaker. If you
>> > have 3 tasks on 2 groups, one group will have to have 2 tasks, and
>> > bouncing the one task around just isn't going to help _anything_.
>>
>> Yeah, but frequently tasks don't come in ones, so, you end up with an
>> endless tug of war between LB ripping communicating buddies apart, and
>> select_idle_sibling() pulling them back together.. bouncing cow
>> syndrome.
>

replacing +1 by +2 fixes this use case that involves 2 threads but
similar behavior can happen with 3 tasks on system with 4 cores per MC
as an example

IIUC, you have on
- one side, periodic load balance that spreads the 2 tasks in the system
- on the other side, wake up path that moves the task back in the same MC.

Isn't your regression more linked to spurious migration than where the
task is scheduled ? I don't see any direct relation between the client
and the server in this netperf test, isn't it ?

we could either remove the condition which tries to keep an even
number of tasks in each group until busiest group becomes overloaded
but it means that unrelated tasks may have to share same resources
or we could try to prevent the migration at wake up. I was looking at
wake_affine which seems to choose local cpu  when both prev and local
cpu are idle. I wonder if local cpu is  really a better choice when
both are idle

Vincent

> The whole business of trying to balance groups down to the single task
> seems a bit illogical given we care enough to wake to shared cache in
> the first place, creating the 'imbalance' we then try to correct.
>  'course that weakens your unrelated tasks (which may meet on a sleepin
> g lock or whatever) argument not one bit, it's also valid.
>
> hrm.
>
>         -Mike

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch v3.18+ regression fix] sched: Further improve spurious CPU_IDLE active migrations
  2016-08-31 15:52       ` Vincent Guittot
@ 2016-09-01  4:11         ` Mike Galbraith
  2016-09-01  6:37           ` Mike Galbraith
  2016-09-01  8:09           ` Vincent Guittot
  0 siblings, 2 replies; 14+ messages in thread
From: Mike Galbraith @ 2016-09-01  4:11 UTC (permalink / raw)
  To: Vincent Guittot; +Cc: Peter Zijlstra, LKML, Rik van Riel

On Wed, 2016-08-31 at 17:52 +0200, Vincent Guittot wrote:
> On 31 August 2016 at 12:36, Mike Galbraith <umgwanakikbuti@gmail.com> wrote:
> > On Wed, 2016-08-31 at 12:18 +0200, Mike Galbraith wrote:
> > > On Wed, 2016-08-31 at 12:01 +0200, Peter Zijlstra wrote:
> > 
> > > > So 43f4d66637bc ("sched: Improve sysbench performance by fixing spurious
> > > > active migration") 's +1 made sense in that its a tie breaker. If you
> > > > have 3 tasks on 2 groups, one group will have to have 2 tasks, and
> > > > bouncing the one task around just isn't going to help _anything_.
> > > 
> > > Yeah, but frequently tasks don't come in ones, so, you end up with an
> > > endless tug of war between LB ripping communicating buddies apart, and
> > > select_idle_sibling() pulling them back together.. bouncing cow
> > > syndrome.
> > 
> 
> replacing +1 by +2 fixes this use case that involves 2 threads but
> similar behavior can happen with 3 tasks on system with 4 cores per MC
> as an example
> 
> IIUC, you have on
> - one side, periodic load balance that spreads the 2 tasks in the system
> - on the other side, wake up path that moves the task back in the same MC.

Yup.

> Isn't your regression more linked to spurious migration than where the
> task is scheduled ? I don't see any direct relation between the client
> and the server in this netperf test, isn't it ?

         netperf  4360 [004]  1207.865265:       sched:sched_wakeup: netserver:4361 [120] success=1 CPU:002
         netperf  4360 [004]  1207.865274:       sched:sched_wakeup: netserver:4361 [120] success=1 CPU:002
         netperf  4360 [004]  1207.865280:       sched:sched_wakeup: netserver:4361 [120] success=1 CPU:002
       netserver  4361 [002]  1207.865313:       sched:sched_wakeup: netperf:4360 [120] success=1 CPU:004
         netperf  4360 [004]  1207.865340:       sched:sched_wakeup: kworker/u16:4:89 [120] success=1 CPU:000
         netperf  4360 [004]  1207.865345:       sched:sched_wakeup: kworker/u16:5:90 [120] success=1 CPU:006
         netperf  4360 [004]  1207.865355:       sched:sched_wakeup: kworker/u16:5:90 [120] success=1 CPU:006
         netperf  4360 [004]  1207.865357:       sched:sched_wakeup: kworker/u16:4:89 [120] success=1 CPU:000
         netperf  4360 [004]  1207.865369:       sched:sched_wakeup: netserver:4361 [120] success=1 CPU:002
       netserver  4361 [002]  1207.865377:       sched:sched_wakeup: netperf:4360 [120] success=1 CPU:004
         netperf  4360 [004]  1207.865476:       sched:sched_wakeup: perf:4359 [120] success=1 CPU:003

It's not limited to this load, anything at all that is communicating
will do the same on these or similar processors.

This trying to be perfect looks like a booboo to me, as we are now
specifically asking our left hand undo what our right hand did to crank
up throughput.  For the diagnosed processor at least, one of those
hands definitely wants to be slapped.

This doesn't seem to be an issue for L3 equipped CPUs, but perhaps is
for some even modern processors, dunno (the boxen where regression was
detected are far from new).

> we could either remove the condition which tries to keep an even
> number of tasks in each group until busiest group becomes overloaded
> but it means that unrelated tasks may have to share same resources
> or we could try to prevent the migration at wake up. I was looking at
> wake_affine which seems to choose local cpu  when both prev and local
> cpu are idle. I wonder if local cpu is  really a better choice when
> both are idle

I don't see a great alternative to turning it off off the top of my
head, at least for processors with multiple LLCs.  Yeah, unrelated
tasks could end up sharing a cache needlessly, but will that hurt as
badly as tasks not munching tasty hot data definitely does?

	-Mike

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch v3.18+ regression fix] sched: Further improve spurious CPU_IDLE active migrations
  2016-09-01  4:11         ` Mike Galbraith
@ 2016-09-01  6:37           ` Mike Galbraith
  2016-09-01  8:09           ` Vincent Guittot
  1 sibling, 0 replies; 14+ messages in thread
From: Mike Galbraith @ 2016-09-01  6:37 UTC (permalink / raw)
  To: Vincent Guittot; +Cc: Peter Zijlstra, LKML, Rik van Riel

On Thu, 2016-09-01 at 06:11 +0200, Mike Galbraith wrote:

> I don't see a great alternative to turning it off off the top of my
> head, at least for processors with multiple LLCs.

Here of course I mean other than saying it's just not worth worrying
about such old processors, iff it's only old dogs that are affected. 

Ignore it, they're dinosaurs is a valid option.  I really doubt there
are oodles of such boxen in active use out in the real world.. if there
were, surely there would be gripes, which afaik has not happened.

	-Mike

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch v3.18+ regression fix] sched: Further improve spurious CPU_IDLE active migrations
  2016-09-01  4:11         ` Mike Galbraith
  2016-09-01  6:37           ` Mike Galbraith
@ 2016-09-01  8:09           ` Vincent Guittot
  2016-09-05 16:26             ` [v2 patch " Mike Galbraith
  1 sibling, 1 reply; 14+ messages in thread
From: Vincent Guittot @ 2016-09-01  8:09 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, LKML, Rik van Riel

On 1 September 2016 at 06:11, Mike Galbraith <umgwanakikbuti@gmail.com> wrote:
> On Wed, 2016-08-31 at 17:52 +0200, Vincent Guittot wrote:
>> On 31 August 2016 at 12:36, Mike Galbraith <umgwanakikbuti@gmail.com> wrote:
>> > On Wed, 2016-08-31 at 12:18 +0200, Mike Galbraith wrote:
>> > > On Wed, 2016-08-31 at 12:01 +0200, Peter Zijlstra wrote:
>> >
>> > > > So 43f4d66637bc ("sched: Improve sysbench performance by fixing spurious
>> > > > active migration") 's +1 made sense in that its a tie breaker. If you
>> > > > have 3 tasks on 2 groups, one group will have to have 2 tasks, and
>> > > > bouncing the one task around just isn't going to help _anything_.
>> > >
>> > > Yeah, but frequently tasks don't come in ones, so, you end up with an
>> > > endless tug of war between LB ripping communicating buddies apart, and
>> > > select_idle_sibling() pulling them back together.. bouncing cow
>> > > syndrome.
>> >
>>
>> replacing +1 by +2 fixes this use case that involves 2 threads but
>> similar behavior can happen with 3 tasks on system with 4 cores per MC
>> as an example
>>
>> IIUC, you have on
>> - one side, periodic load balance that spreads the 2 tasks in the system
>> - on the other side, wake up path that moves the task back in the same MC.
>
> Yup.
>
>> Isn't your regression more linked to spurious migration than where the
>> task is scheduled ? I don't see any direct relation between the client
>> and the server in this netperf test, isn't it ?
>
>          netperf  4360 [004]  1207.865265:       sched:sched_wakeup: netserver:4361 [120] success=1 CPU:002
>          netperf  4360 [004]  1207.865274:       sched:sched_wakeup: netserver:4361 [120] success=1 CPU:002
>          netperf  4360 [004]  1207.865280:       sched:sched_wakeup: netserver:4361 [120] success=1 CPU:002
>        netserver  4361 [002]  1207.865313:       sched:sched_wakeup: netperf:4360 [120] success=1 CPU:004
>          netperf  4360 [004]  1207.865340:       sched:sched_wakeup: kworker/u16:4:89 [120] success=1 CPU:000
>          netperf  4360 [004]  1207.865345:       sched:sched_wakeup: kworker/u16:5:90 [120] success=1 CPU:006
>          netperf  4360 [004]  1207.865355:       sched:sched_wakeup: kworker/u16:5:90 [120] success=1 CPU:006
>          netperf  4360 [004]  1207.865357:       sched:sched_wakeup: kworker/u16:4:89 [120] success=1 CPU:000
>          netperf  4360 [004]  1207.865369:       sched:sched_wakeup: netserver:4361 [120] success=1 CPU:002
>        netserver  4361 [002]  1207.865377:       sched:sched_wakeup: netperf:4360 [120] success=1 CPU:004
>          netperf  4360 [004]  1207.865476:       sched:sched_wakeup: perf:4359 [120] success=1 CPU:003

I would have expected a net_rx softirq in the middle.
Nevermind, i agree that we can find lot of use cases with communicating tasks

>
> It's not limited to this load, anything at all that is communicating
> will do the same on these or similar processors.
>
> This trying to be perfect looks like a booboo to me, as we are now
> specifically asking our left hand undo what our right hand did to crank
> up throughput.  For the diagnosed processor at least, one of those
> hands definitely wants to be slapped.
>
> This doesn't seem to be an issue for L3 equipped CPUs, but perhaps is
> for some even modern processors, dunno (the boxen where regression was
> detected are far from new).
>
>> we could either remove the condition which tries to keep an even
>> number of tasks in each group until busiest group becomes overloaded
>> but it means that unrelated tasks may have to share same resources
>> or we could try to prevent the migration at wake up. I was looking at
>> wake_affine which seems to choose local cpu  when both prev and local
>> cpu are idle. I wonder if local cpu is  really a better choice when
>> both are idle
>
> I don't see a great alternative to turning it off off the top of my
> head, at least for processors with multiple LLCs.  Yeah, unrelated
> tasks could end up sharing a cache needlessly, but will that hurt as
> badly as tasks not munching tasty hot data definitely does?

memory intensive task will probably be hurt

>
>         -Mike

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [v2 patch v3.18+ regression fix] sched: Further improve spurious CPU_IDLE active migrations
  2016-09-01  8:09           ` Vincent Guittot
@ 2016-09-05 16:26             ` Mike Galbraith
  2016-09-06 13:01               ` Vincent Guittot
  0 siblings, 1 reply; 14+ messages in thread
From: Mike Galbraith @ 2016-09-05 16:26 UTC (permalink / raw)
  To: Vincent Guittot; +Cc: Peter Zijlstra, LKML, Rik van Riel

Coming back to this, how about this instead, only increase the group
imbalance threshold when sd_llc_size == 2.  Newer L3 equipped
processors then aren't affected.



43f4d666 partially cured uprious migrations, but when there are
completely idle groups on a lightly loaded processor, and there is
a buddy pair occupying the busiest group, we will not attempt to
migrate due to select_idle_sibling() buddy placement, leaving the
busiest queue with one task.  We skip balancing, but increment
nr_balance_failed until we kick active balancing, and bounce a
buddy pair endlessly, demolishing throughput.

Increase group imbalance threshold to two when sd_llc_size == 2 to
allow buddies to share L2 without affecting larger L3 processors.

Regression detected on X5472 box, which has 4 MC groups of 2 cores.

netperf -l 60 -H 127.0.0.1 -t UDP_STREAM -i5,1 -I 95,5
pre:
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput      : 66.421%
!!!                       Local CPU util  : 0.000%
!!!                       Remote CPU util : 0.000%

Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   60.00     1779143      0    15539.49
212992           60.00     1773551           15490.65

post:
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   60.00     3719377      0    32486.01
212992           60.00     3717492           32469.54

Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
Fixes: caeb178c sched/fair: Make update_sd_pick_busiest() return 'true' on a busier sd
Cc: <stable@vger.kernel.org> # v3.18+
---
 kernel/sched/fair.c |   17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7249,12 +7249,19 @@ static struct sched_group *find_busiest_
 		 * This cpu is idle. If the busiest group is not overloaded
 		 * and there is no imbalance between this and busiest group
 		 * wrt idle cpus, it is balanced. The imbalance becomes
-		 * significant if the diff is greater than 1 otherwise we
-		 * might end up to just move the imbalance on another group
+		 * significant if the diff is greater than 1 for most CPUs,
+		 * or 2 for older CPUs having multiple groups of 2 cores
+		 * sharing an L2, otherwise we may end up uselessly moving
+		 * the imbalance to another group, or starting a tug of war
+		 * with idle L2 groups constantly ripping communicating
+		 * tasks apart, and no L3 to mitigate the cache miss pain.
 		 */
-		if ((busiest->group_type != group_overloaded) &&
-				(local->idle_cpus <= (busiest->idle_cpus + 1)))
-			goto out_balanced;
+		if (busiest->group_type != group_overloaded) {
+			int imbalance = __this_cpu_read(sd_llc_size) == 2 ? 2 : 1;
+
+			if (local->idle_cpus <= busiest->idle_cpus + imbalance)
+				goto out_balanced;
+		}
 	} else {
 		/*
 		 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [v2 patch v3.18+ regression fix] sched: Further improve spurious CPU_IDLE active migrations
  2016-09-05 16:26             ` [v2 patch " Mike Galbraith
@ 2016-09-06 13:01               ` Vincent Guittot
  2016-09-06 13:07                 ` Mike Galbraith
  0 siblings, 1 reply; 14+ messages in thread
From: Vincent Guittot @ 2016-09-06 13:01 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, LKML, Rik van Riel

Le Monday 05 Sep 2016 à 18:26:53 (+0200), Mike Galbraith a écrit :
> Coming back to this, how about this instead, only increase the group
> imbalance threshold when sd_llc_size == 2.  Newer L3 equipped
> processors then aren't affected.
>

Not sure that all systems with sd_llc_size == 2 wants this behavior.

Why not adding a sched_feature for changing the 2nd half of the test for some systems ?

something like below

---
 kernel/sched/fair.c     | 11 ++++++++---
 kernel/sched/features.h |  7 +++++++
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4185e0a..65c9363 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7395,9 +7395,14 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 		 * significant if the diff is greater than 1 otherwise we
 		 * might end up to just move the imbalance on another group
 		 */
-		if ((busiest->group_type != group_overloaded) &&
-				(local->idle_cpus <= (busiest->idle_cpus + 1)))
-			goto out_balanced;
+		if (busiest->group_type != group_overloaded) {
+			int imbalance = 1;
+			if (!sched_feat(BALANCE_IDLE_CPUS))
+				imbalance = __this_cpu_read(sd_llc_size);
+
+			if (local->idle_cpus <= busiest->idle_cpus + imbalance)
+				goto out_balanced;
+		}
 	} else {
 		/*
 		 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 69631fa..16c34ec 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -69,3 +69,10 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
 SCHED_FEAT(ATTACH_AGE_LOAD, true)
 
+/*
+ * Try to balance the number of idle CPUs in each group to minimize contention
+ * on shared ressources. Nevertheless, some older systems without L3 seems to
+ * prefer to share resource for minimizing the migration between groups
+ */
+SCHED_FEAT(BALANCE_IDLE_CPUS, true)
+
--

> 
> 
> 43f4d666 partially cured uprious migrations, but when there are
> completely idle groups on a lightly loaded processor, and there is
> a buddy pair occupying the busiest group, we will not attempt to
> migrate due to select_idle_sibling() buddy placement, leaving the
> busiest queue with one task.  We skip balancing, but increment
> nr_balance_failed until we kick active balancing, and bounce a
> buddy pair endlessly, demolishing throughput.
> 
> Increase group imbalance threshold to two when sd_llc_size == 2 to
> allow buddies to share L2 without affecting larger L3 processors.
> 
> Regression detected on X5472 box, which has 4 MC groups of 2 cores.
> 
> netperf -l 60 -H 127.0.0.1 -t UDP_STREAM -i5,1 -I 95,5
> pre:
> !!! WARNING
> !!! Desired confidence was not achieved within the specified iterations.
> !!! This implies that there was variability in the test environment that
> !!! must be investigated before going further.
> !!! Confidence intervals: Throughput      : 66.421%
> !!!                       Local CPU util  : 0.000%
> !!!                       Remote CPU util : 0.000%
> 
> Socket  Message  Elapsed      Messages                
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
> 
> 212992   65507   60.00     1779143      0    15539.49
> 212992           60.00     1773551           15490.65
> 
> post:
> Socket  Message  Elapsed      Messages                
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
> 
> 212992   65507   60.00     3719377      0    32486.01
> 212992           60.00     3717492           32469.54
> 
> Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
> Fixes: caeb178c sched/fair: Make update_sd_pick_busiest() return 'true' on a busier sd
> Cc: <stable@vger.kernel.org> # v3.18+
> ---
>  kernel/sched/fair.c |   17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7249,12 +7249,19 @@ static struct sched_group *find_busiest_
>  		 * This cpu is idle. If the busiest group is not overloaded
>  		 * and there is no imbalance between this and busiest group
>  		 * wrt idle cpus, it is balanced. The imbalance becomes
> -		 * significant if the diff is greater than 1 otherwise we
> -		 * might end up to just move the imbalance on another group
> +		 * significant if the diff is greater than 1 for most CPUs,
> +		 * or 2 for older CPUs having multiple groups of 2 cores
> +		 * sharing an L2, otherwise we may end up uselessly moving
> +		 * the imbalance to another group, or starting a tug of war
> +		 * with idle L2 groups constantly ripping communicating
> +		 * tasks apart, and no L3 to mitigate the cache miss pain.
>  		 */
> -		if ((busiest->group_type != group_overloaded) &&
> -				(local->idle_cpus <= (busiest->idle_cpus + 1)))
> -			goto out_balanced;
> +		if (busiest->group_type != group_overloaded) {
> +			int imbalance = __this_cpu_read(sd_llc_size) == 2 ? 2 : 1;
> +
> +			if (local->idle_cpus <= busiest->idle_cpus + imbalance)
> +				goto out_balanced;
> +		}
>  	} else {
>  		/*
>  		 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [v2 patch v3.18+ regression fix] sched: Further improve spurious CPU_IDLE active migrations
  2016-09-06 13:01               ` Vincent Guittot
@ 2016-09-06 13:07                 ` Mike Galbraith
  2016-09-06 13:42                   ` Vincent Guittot
  2016-09-06 13:44                   ` Mike Galbraith
  0 siblings, 2 replies; 14+ messages in thread
From: Mike Galbraith @ 2016-09-06 13:07 UTC (permalink / raw)
  To: Vincent Guittot; +Cc: Peter Zijlstra, LKML, Rik van Riel

On Tue, 2016-09-06 at 15:01 +0200, Vincent Guittot wrote:
> Le Monday 05 Sep 2016 à 18:26:53 (+0200), Mike Galbraith a écrit :
> > Coming back to this, how about this instead, only increase the group
> > imbalance threshold when sd_llc_size == 2.  Newer L3 equipped
> > processors then aren't affected.
> > 
> 
> Not sure that all systems with sd_llc_size == 2 wants this behavior.
> 
> Why not adding a sched_feature for changing the 2nd half of the test
> for some systems ?

Because users won't know, and shouldn't need to know that they need to
flip that switch.

	-Mike

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [v2 patch v3.18+ regression fix] sched: Further improve spurious CPU_IDLE active migrations
  2016-09-06 13:07                 ` Mike Galbraith
@ 2016-09-06 13:42                   ` Vincent Guittot
  2016-09-06 13:59                     ` Mike Galbraith
  2016-09-06 13:44                   ` Mike Galbraith
  1 sibling, 1 reply; 14+ messages in thread
From: Vincent Guittot @ 2016-09-06 13:42 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, LKML, Rik van Riel

On 6 September 2016 at 15:07, Mike Galbraith <umgwanakikbuti@gmail.com> wrote:
> On Tue, 2016-09-06 at 15:01 +0200, Vincent Guittot wrote:
>> Le Monday 05 Sep 2016 à 18:26:53 (+0200), Mike Galbraith a écrit :
>> > Coming back to this, how about this instead, only increase the group
>> > imbalance threshold when sd_llc_size == 2.  Newer L3 equipped
>> > processors then aren't affected.
>> >
>>
>> Not sure that all systems with sd_llc_size == 2 wants this behavior.
>>
>> Why not adding a sched_feature for changing the 2nd half of the test
>> for some systems ?
>
> Because users won't know, and shouldn't need to know that they need to
> flip that switch.

fair enough

so how can we detect this specific system configuration ?

sd_llc_size == 2 is not enough IMHO

>
>         -Mike

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [v2 patch v3.18+ regression fix] sched: Further improve spurious CPU_IDLE active migrations
  2016-09-06 13:07                 ` Mike Galbraith
  2016-09-06 13:42                   ` Vincent Guittot
@ 2016-09-06 13:44                   ` Mike Galbraith
  1 sibling, 0 replies; 14+ messages in thread
From: Mike Galbraith @ 2016-09-06 13:44 UTC (permalink / raw)
  To: Vincent Guittot; +Cc: Peter Zijlstra, LKML, Rik van Riel

On Tue, 2016-09-06 at 15:07 +0200, Mike Galbraith wrote:
> On Tue, 2016-09-06 at 15:01 +0200, Vincent Guittot wrote:
> > Le Monday 05 Sep 2016 à 18:26:53 (+0200), Mike Galbraith a écrit :
> > > Coming back to this, how about this instead, only increase the
> > > group
> > > imbalance threshold when sd_llc_size == 2.  Newer L3 equipped
> > > processors then aren't affected.
> > > 
> > 
> > Not sure that all systems with sd_llc_size == 2 wants this
> > behavior.
> > 
> > Why not adding a sched_feature for changing the 2nd half of the
> > test
> > for some systems ?
> 
> Because users won't know, and shouldn't need to know that they need to
> flip that switch.

The patchlet just puts these CPUs back in the same boat they were in
before Rik's patch landed.  Also, if people are doing serious compute,
they're unlikely to leave any placement decisions up to the scheduler.

	-Mike

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [v2 patch v3.18+ regression fix] sched: Further improve spurious CPU_IDLE active migrations
  2016-09-06 13:42                   ` Vincent Guittot
@ 2016-09-06 13:59                     ` Mike Galbraith
  0 siblings, 0 replies; 14+ messages in thread
From: Mike Galbraith @ 2016-09-06 13:59 UTC (permalink / raw)
  To: Vincent Guittot; +Cc: Peter Zijlstra, LKML, Rik van Riel

On Tue, 2016-09-06 at 15:42 +0200, Vincent Guittot wrote:
> On 6 September 2016 at 15:07, Mike Galbraith <umgwanakikbuti@gmail.com> wrote:
> > On Tue, 2016-09-06 at 15:01 +0200, Vincent Guittot wrote:
> > > Le Monday 05 Sep 2016 à 18:26:53 (+0200), Mike Galbraith a écrit :
> > > > Coming back to this, how about this instead, only increase the group
> > > > imbalance threshold when sd_llc_size == 2.  Newer L3 equipped
> > > > processors then aren't affected.
> > > > 
> > > 
> > > Not sure that all systems with sd_llc_size == 2 wants this behavior.
> > > 
> > > Why not adding a sched_feature for changing the 2nd half of the test
> > > for some systems ?
> > 
> > Because users won't know, and shouldn't need to know that they need to
> > flip that switch.
> 
> fair enough
> 
> so how can we detect this specific system configuration ?
> 
> sd_llc_size == 2 is not enough IMHO

Seems to me that _is_ the pertinent configuration.  If that isn't
enough, I guess we just say too bad about old multiple LLC CPUs.

	-Mike

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-09-06 13:59 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-30  5:42 [patch v3.18+ regression fix] sched: Further improve spurious CPU_IDLE active migrations Mike Galbraith
2016-08-31 10:01 ` Peter Zijlstra
2016-08-31 10:18   ` Mike Galbraith
2016-08-31 10:36     ` Mike Galbraith
2016-08-31 15:52       ` Vincent Guittot
2016-09-01  4:11         ` Mike Galbraith
2016-09-01  6:37           ` Mike Galbraith
2016-09-01  8:09           ` Vincent Guittot
2016-09-05 16:26             ` [v2 patch " Mike Galbraith
2016-09-06 13:01               ` Vincent Guittot
2016-09-06 13:07                 ` Mike Galbraith
2016-09-06 13:42                   ` Vincent Guittot
2016-09-06 13:59                     ` Mike Galbraith
2016-09-06 13:44                   ` Mike Galbraith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.