All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] sched: The removal of idle_balance()
@ 2013-02-15  6:13 Steven Rostedt
  2013-02-15  7:26 ` Mike Galbraith
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Steven Rostedt @ 2013-02-15  6:13 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Paul Turner, Frederic Weisbecker, Andrew Morton, Mike Galbraith,
	Arnaldo Carvalho de Melo, Clark Williams, Andrew Theurer

I've been working on cleaning up the scheduler a little and I moved the
call to idle_balance() from directly in the scheduler proper into the
idle class. Benchmarks (well hackbench) improved slightly as I did this.
I was adding some more tweaks and running perf stat on the results when
I made a mistake and notice a drastic change.

My runs looked something like this on my i7 4 core 4 hyperthreads:

[root@bxtest ~]# perf stat -a -r 100  /work/c/hackbench 500
Time: 16.354
Time: 25.299
Time: 20.621
Time: 19.457
Time: 14.484
Time: 7.615
Time: 35.346
Time: 29.366
Time: 18.474
Time: 14.492
Time: 5.660
Time: 25.955
Time: 9.363
Time: 34.834
Time: 18.736
Time: 30.895
Time: 33.827
Time: 11.237
Time: 17.031
Time: 18.615
Time: 29.222
Time: 14.298
Time: 35.798
Time: 7.109
Time: 16.437
Time: 18.782
Time: 4.923
Time: 10.595
Time: 16.685
Time: 9.000
Time: 18.686
Time: 21.355
Time: 10.280
Time: 21.159
Time: 30.955
Time: 15.496
Time: 6.452
Time: 19.625
Time: 20.656
Time: 19.679
Time: 12.484
Time: 31.189
Time: 19.136
Time: 20.763
Time: 11.415
Time: 15.652
Time: 23.935
Time: 28.225
Time: 9.930
Time: 11.658
[...]

With my changes making the average get better by a second or two. The
output from the perf stat looked like this:

 Performance counter stats for '/work/c/hackbench 500' (100 runs):

     199820.045583 task-clock                #    8.016 CPUs utilized            ( +-  5.29% ) [100.00%]
         3,594,264 context-switches          #    0.018 M/sec                    ( +-  5.94% ) [100.00%]
           352,240 cpu-migrations            #    0.002 M/sec                    ( +-  3.31% ) [100.00%]
         1,006,732 page-faults               #    0.005 M/sec                    ( +-  0.56% )
   293,801,912,874 cycles                    #    1.470 GHz                      ( +-  4.20% ) [100.00%]
   261,808,125,109 stalled-cycles-frontend   #   89.11% frontend cycles idle     ( +-  4.38% ) [100.00%]
   <not supported> stalled-cycles-backend  
   135,521,344,089 instructions              #    0.46  insns per cycle        
                                             #    1.93  stalled cycles per insn  ( +-  4.37% ) [100.00%]
    26,198,116,586 branches                  #  131.109 M/sec                    ( +-  4.59% ) [100.00%]
       115,326,812 branch-misses             #    0.44% of all branches          ( +-  4.12% )

      24.929136087 seconds time elapsed                                          ( +-  5.31% )

Again, my patches made slight improvements. Down to 22 and 21 seconds at best.

But then when I made a small tweak, it looked like this:

[root@bxtest ~]# perf stat -a -r 100  /work/c/hackbench 500
Time: 5.820
Time: 28.815
Time: 5.032
Time: 17.151
Time: 8.347
Time: 5.142
Time: 5.138
Time: 18.695
Time: 5.099
Time: 4.994
Time: 5.016
Time: 5.076
Time: 5.049
Time: 21.453
Time: 5.241
Time: 10.498
Time: 5.011
Time: 6.142
Time: 4.953
Time: 5.145
Time: 5.004
Time: 14.848
Time: 5.846
Time: 5.076
Time: 5.826
Time: 5.108
Time: 5.122
Time: 5.254
Time: 5.309
Time: 5.018
Time: 7.561
Time: 5.176
Time: 21.142
Time: 5.063
Time: 5.235
Time: 6.535
Time: 4.993
Time: 5.219
Time: 5.070
Time: 5.232
Time: 5.029
Time: 5.091
Time: 6.092
Time: 5.020
[...]

 Performance counter stats for '/work/c/hackbench 500' (100 runs):

      98258.962617 task-clock                #    7.998 CPUs utilized            ( +- 12.12% ) [100.00%]
         2,572,651 context-switches          #    0.026 M/sec                    ( +-  9.35% ) [100.00%]
           224,004 cpu-migrations            #    0.002 M/sec                    ( +-  5.01% ) [100.00%]
           913,813 page-faults               #    0.009 M/sec                    ( +-  0.71% )
   215,927,081,108 cycles                    #    2.198 GHz                      ( +-  5.48% ) [100.00%]
   189,246,626,321 stalled-cycles-frontend   #   87.64% frontend cycles idle     ( +-  6.07% ) [100.00%]
   <not supported> stalled-cycles-backend  
   102,965,954,824 instructions              #    0.48  insns per cycle        
                                             #    1.84  stalled cycles per insn  ( +-  5.40% ) [100.00%]
    19,280,914,558 branches                  #  196.226 M/sec                    ( +-  5.89% ) [100.00%]
        87,284,617 branch-misses             #    0.45% of all branches          ( +-  5.06% )

      12.285025160 seconds time elapsed                                          ( +- 12.14% )

And it consistently looked like that. I thought to myself, geeze! That
tweek did one hell of an improvement. But that tweak should not have, as
I just moved some code around. Things were only being called in
different places.

Looking at my change, I discovered my *bug*, which in this case,
happened to be a true feature. It prevented idle_balance() from ever
being called.

This is a 50% improvement! On a benchmark that stresses the scheduler.
OK, I know that hackbench isn't a real world benchmark, but this got me
thinking. I started looking into the history of idle_balance() and
discovered that it existed from the start of git (2005), and is probably
older (I didn't bother checking other historical archives, although I
did find this: http://lwn.net/Articles/109371/ ). This was a time that
SMP processors were just becoming affordable for the public. It's when I
first bought my own. But they were on small boxes, nothing large. 8 CPUs
was still considered huge then (for us mere mortals).

idle_balance() is the notion of when the CPU is about to go idle, go
snoop around the other CPUs and pull anything over that might be
available. But this pull is actually hurting the task more than helping,
as it would lose all its cache. Just letting the normal tick based load
balancing will save these tasks from constantly having their cache
ripped out from underneath them.

with idle_balance:

perf stat -r 10 -e cache-misses /work/c/hackbench 500

 Performance counter stats for '/work/c/hackbench 500' (10 runs):

       720,120,346 cache-misses                                                  ( +-  9.87% )

      34.445262454 seconds time elapsed                                          ( +- 32.55% )

perf stat -r 10 -a -e sched:sched_migrate_task -a /work/c/hackbench 500 

 Performance counter stats for '/work/c/hackbench 500' (10 runs):

           306,398 sched:sched_migrate_task                                      ( +-  4.62% )

      18.376370212 seconds time elapsed                                          ( +- 14.15% )


When we remove idle balance:

perf stat -r 10 -e cache-misses /work/c/hackbench 500

 Performance counter stats for '/work/c/hackbench 500' (10 runs):

       550,392,064 cache-misses                                                  ( +-  4.89% )

      12.836740930 seconds time elapsed                                          ( +- 23.53% )

perf stat -r 10 -a -e sched:sched_migrate_task -a /work/c/hackbench 500 

 Performance counter stats for '/work/c/hackbench 500' (10 runs):

           219,725 sched:sched_migrate_task                                      ( +-  2.83% )

       8.019037539 seconds time elapsed                                          ( +-  6.90% )

(cut down to just 10 runs to save time)

The cache misses dropped by ~23% and migrations dropped by ~28%. I
really believe that the idle_balance() hurts performance, and not just
for something like hackbench, but the aggressive nature for migration
that idle_balance() causes takes a large hit on a process' cache.

Think about it some more, just because we go idle isn't enough reason to
pull a runable task over. CPUs go idle all the time, and tasks are woken
up all the time. There's no reason that we can't just wait for the sched
tick to decide its time to do a bit of balancing. Sure, it would be nice
if the idle CPU did the work. But I think that frame of mind was an
incorrect notion from back in the early 2000s and does not apply to
today's hardware, or perhaps it doesn't apply to the (relatively) new
CFS scheduler. If you want aggressive scheduling, make the task rt, and
it will do aggressive scheduling.

But anyway, please, try it yourself. It's a really simple patch. This
isn't the final patch, for if this proves to be as big of a hit as
hackbench shows, the complete removal of idle_balance would be in order.

Who knows, maybe I'm missing something and this is just a fluke with
hackbench. I'm Cc'ing the guru's of the scheduler. Maybe they can show
me why idle_balance() is correct.

Go forth and test!

-- Steve

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1dff78a..a9317b7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2927,9 +2927,6 @@ need_resched:
 
 	pre_schedule(rq, prev);
 
-	if (unlikely(!rq->nr_running))
-		idle_balance(cpu, rq);
-
 	put_prev_task(rq, prev);
 	next = pick_next_task(rq);
 	clear_tsk_need_resched(prev);



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC] sched: The removal of idle_balance()
  2013-02-15  6:13 [RFC] sched: The removal of idle_balance() Steven Rostedt
@ 2013-02-15  7:26 ` Mike Galbraith
  2013-02-15 12:07   ` Peter Zijlstra
                     ` (2 more replies)
  2013-02-15  7:45 ` Joonsoo Kim
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 19+ messages in thread
From: Mike Galbraith @ 2013-02-15  7:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Paul Turner, Frederic Weisbecker, Andrew Morton,
	Arnaldo Carvalho de Melo, Clark Williams, Andrew Theurer

On Fri, 2013-02-15 at 01:13 -0500, Steven Rostedt wrote:

> Think about it some more, just because we go idle isn't enough reason to
> pull a runable task over. CPUs go idle all the time, and tasks are woken
> up all the time. There's no reason that we can't just wait for the sched
> tick to decide its time to do a bit of balancing. Sure, it would be nice
> if the idle CPU did the work. But I think that frame of mind was an
> incorrect notion from back in the early 2000s and does not apply to
> today's hardware, or perhaps it doesn't apply to the (relatively) new
> CFS scheduler. If you want aggressive scheduling, make the task rt, and
> it will do aggressive scheduling.

(the throttle is supposed to keep idle_balance() from doing severe
damage, that may want a peek/tweak)

Hackbench spreads itself with FORK/EXEC balancing, how does say a kbuild
do with no idle_balance()?

-Mike


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] sched: The removal of idle_balance()
  2013-02-15  6:13 [RFC] sched: The removal of idle_balance() Steven Rostedt
  2013-02-15  7:26 ` Mike Galbraith
@ 2013-02-15  7:45 ` Joonsoo Kim
  2013-02-15 15:05   ` Steven Rostedt
  2013-02-17  6:26 ` Mike Galbraith
  2013-02-18  8:13 ` Srikar Dronamraju
  3 siblings, 1 reply; 19+ messages in thread
From: Joonsoo Kim @ 2013-02-15  7:45 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Paul Turner, Frederic Weisbecker, Andrew Morton,
	Mike Galbraith, Arnaldo Carvalho de Melo, Clark Williams,
	Andrew Theurer

Hello, Steven.

On Fri, Feb 15, 2013 at 01:13:39AM -0500, Steven Rostedt wrote:

>  Performance counter stats for '/work/c/hackbench 500' (100 runs):
> 
>      199820.045583 task-clock                #    8.016 CPUs utilized            ( +-  5.29% ) [100.00%]
>          3,594,264 context-switches          #    0.018 M/sec                    ( +-  5.94% ) [100.00%]
>            352,240 cpu-migrations            #    0.002 M/sec                    ( +-  3.31% ) [100.00%]
>          1,006,732 page-faults               #    0.005 M/sec                    ( +-  0.56% )
>    293,801,912,874 cycles                    #    1.470 GHz                      ( +-  4.20% ) [100.00%]
>    261,808,125,109 stalled-cycles-frontend   #   89.11% frontend cycles idle     ( +-  4.38% ) [100.00%]
>    <not supported> stalled-cycles-backend  
>    135,521,344,089 instructions              #    0.46  insns per cycle        
>                                              #    1.93  stalled cycles per insn  ( +-  4.37% ) [100.00%]
>     26,198,116,586 branches                  #  131.109 M/sec                    ( +-  4.59% ) [100.00%]
>        115,326,812 branch-misses             #    0.44% of all branches          ( +-  4.12% )
> 
>       24.929136087 seconds time elapsed                                          ( +-  5.31% )
> 
>  Performance counter stats for '/work/c/hackbench 500' (100 runs):
> 
>       98258.962617 task-clock                #    7.998 CPUs utilized            ( +- 12.12% ) [100.00%]
>          2,572,651 context-switches          #    0.026 M/sec                    ( +-  9.35% ) [100.00%]
>            224,004 cpu-migrations            #    0.002 M/sec                    ( +-  5.01% ) [100.00%]
>            913,813 page-faults               #    0.009 M/sec                    ( +-  0.71% )
>    215,927,081,108 cycles                    #    2.198 GHz                      ( +-  5.48% ) [100.00%]
>    189,246,626,321 stalled-cycles-frontend   #   87.64% frontend cycles idle     ( +-  6.07% ) [100.00%]
>    <not supported> stalled-cycles-backend  
>    102,965,954,824 instructions              #    0.48  insns per cycle        
>                                              #    1.84  stalled cycles per insn  ( +-  5.40% ) [100.00%]
>     19,280,914,558 branches                  #  196.226 M/sec                    ( +-  5.89% ) [100.00%]
>         87,284,617 branch-misses             #    0.45% of all branches          ( +-  5.06% )
> 
>       12.285025160 seconds time elapsed                                          ( +- 12.14% )

IMHO, cycles is somewhat strange.
Why one is 1.470 GHz, other is 2.198 GHz? 

In my quick test, I get below result.

- Before Patch
Permance counter stats for 'perf bench sched messaging -g 300' (10 runs):

      40847.488740 task-clock                #    3.232 CPUs utilized            ( +-  1.24% )
           511,070 context-switches          #    0.013 M/sec                    ( +-  7.28% )
           117,882 cpu-migrations            #    0.003 M/sec                    ( +-  5.14% )
         1,360,501 page-faults               #    0.033 M/sec                    ( +-  0.12% )
   118,534,394,180 cycles                    #    2.902 GHz                      ( +-  1.23% ) [50.70%]
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
    46,217,340,271 instructions              #    0.39  insns per cycle          ( +-  0.56% ) [76.93%]
     8,592,447,548 branches                  #  210.354 M/sec                    ( +-  0.75% ) [75.50%]
       273,367,481 branch-misses             #    3.18% of all branches          ( +-  0.26% ) [75.49%]

      12.639049245 seconds time elapsed                                          ( +-  2.29% )

- After Patch
 Performance counter stats for 'perf bench sched messaging -g 300' (10 runs):

      42053.008632 task-clock                #    2.932 CPUs utilized            ( +-  0.91% )
           672,759 context-switches          #    0.016 M/sec                    ( +-  2.76% )
            83,374 cpu-migrations            #    0.002 M/sec                    ( +-  4.46% )
         1,362,900 page-faults               #    0.032 M/sec                    ( +-  0.20% )
   121,457,601,848 cycles                    #    2.888 GHz                      ( +-  0.93% ) [50.75%]
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
    47,854,828,552 instructions              #    0.39  insns per cycle          ( +-  0.36% ) [77.09%]
     8,981,553,714 branches                  #  213.577 M/sec                    ( +-  0.42% ) [75.41%]
       274,229,438 branch-misses             #    3.05% of all branches          ( +-  0.20% ) [75.44%]

      14.340330678 seconds time elapsed                                          ( +-  1.79% )

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] sched: The removal of idle_balance()
  2013-02-15  7:26 ` Mike Galbraith
@ 2013-02-15 12:07   ` Peter Zijlstra
  2013-02-15 12:21   ` Peter Zijlstra
  2013-02-16 16:12   ` Steven Rostedt
  2 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2013-02-15 12:07 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Steven Rostedt, LKML, Linus Torvalds, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Frederic Weisbecker, Andrew Morton,
	Arnaldo Carvalho de Melo, Clark Williams, Andrew Theurer

On Fri, 2013-02-15 at 08:26 +0100, Mike Galbraith wrote:
> 
> (the throttle is supposed to keep idle_balance() from doing severe
> damage, that may want a peek/tweak)

Right, as it stands idle_balance() can do a lot of work and if the avg
idle time is less than the time we spend looking for a suitable task we
loose.

I've wanted to make this smarter by having the cpufreq/cpuidle avg idle
time guestimator in the scheduler core so we actually know how log we
expect to be idle and couple that with a cache refresh cost per sched
domain (something we used to have pre 2.6.21 or so) so we can auto-limit
the domain traversal for idle_balance.

So far that's all fantasy though..

Related, I wanted to use the idle time guestimate to 'optimize' the idle
loop, currently that stuff is stupid expensive and pokes at timer
hardware etc.. if we know we won't be idle longer than it takes to poke
at timer hardware, don't go into nohz mode etc.

Anyway, non of that couldn't be done if it lived at post_schedule(),
just a tangent.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] sched: The removal of idle_balance()
  2013-02-15  7:26 ` Mike Galbraith
  2013-02-15 12:07   ` Peter Zijlstra
@ 2013-02-15 12:21   ` Peter Zijlstra
  2013-02-15 12:32     ` Mike Galbraith
  2013-02-16 16:12   ` Steven Rostedt
  2 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2013-02-15 12:21 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Steven Rostedt, LKML, Linus Torvalds, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Frederic Weisbecker, Andrew Morton,
	Arnaldo Carvalho de Melo, Clark Williams, Andrew Theurer


On Fri, 2013-02-15 at 08:26 +0100, Mike Galbraith wrote:
> 
> (the throttle is supposed to keep idle_balance() from doing severe
> damage, that may want a peek/tweak)

Right, as it stands idle_balance() can do a lot of work and if the avg
idle time is less than the time we spend looking for a suitable task we
loose.

I've wanted to make this smarter by having the cpufreq/cpuidle avg idle
time guestimator in the scheduler core so we actually know how log we
expect to be idle and couple that with a cache refresh cost per sched
domain (something we used to have pre 2.6.21 or so) so we can auto-limit
the domain traversal for idle_balance.

So far that's all fantasy though..

Related, I wanted to use the idle time guestimate to 'optimize' the idle
loop, currently that stuff is stupid expensive and pokes at timer
hardware etc.. if we know we won't be idle longer than it takes to poke
at timer hardware, don't go into nohz mode etc.

Anyway, that all is independent of the exact location of where we call
that stuff.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] sched: The removal of idle_balance()
  2013-02-15 12:21   ` Peter Zijlstra
@ 2013-02-15 12:32     ` Mike Galbraith
  0 siblings, 0 replies; 19+ messages in thread
From: Mike Galbraith @ 2013-02-15 12:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, LKML, Linus Torvalds, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Frederic Weisbecker, Andrew Morton,
	Arnaldo Carvalho de Melo, Clark Williams, Andrew Theurer

On Fri, 2013-02-15 at 13:21 +0100, Peter Zijlstra wrote: 
> On Fri, 2013-02-15 at 08:26 +0100, Mike Galbraith wrote:
> > 
> > (the throttle is supposed to keep idle_balance() from doing severe
> > damage, that may want a peek/tweak)
> 
> Right, as it stands idle_balance() can do a lot of work and if the avg
> idle time is less than the time we spend looking for a suitable task we
> loose.
> 
> I've wanted to make this smarter by having the cpufreq/cpuidle avg idle
> time guestimator in the scheduler core so we actually know how log we
> expect to be idle and couple that with a cache refresh cost per sched
> domain (something we used to have pre 2.6.21 or so) so we can auto-limit
> the domain traversal for idle_balance.
> 
> So far that's all fantasy though..
> 
> Related, I wanted to use the idle time guestimate to 'optimize' the idle
> loop, currently that stuff is stupid expensive and pokes at timer
> hardware etc.. if we know we won't be idle longer than it takes to poke
> at timer hardware, don't go into nohz mode etc.

Yup.  My trees have nohz throttled too, it's too expensive for fast
switchers scheduling cross core.

-Mike


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] sched: The removal of idle_balance()
  2013-02-15  7:45 ` Joonsoo Kim
@ 2013-02-15 15:05   ` Steven Rostedt
  0 siblings, 0 replies; 19+ messages in thread
From: Steven Rostedt @ 2013-02-15 15:05 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: LKML, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Paul Turner, Frederic Weisbecker, Andrew Morton,
	Mike Galbraith, Arnaldo Carvalho de Melo, Clark Williams,
	Andrew Theurer

On Fri, 2013-02-15 at 16:45 +0900, Joonsoo Kim wrote:
> Hello, Steven.

> - Before Patch
> Permance counter stats for 'perf bench sched messaging -g 300' (10 runs):
> 
>       40847.488740 task-clock                #    3.232 CPUs utilized            ( +-  1.24% )
>            511,070 context-switches          #    0.013 M/sec                    ( +-  7.28% )
>            117,882 cpu-migrations            #    0.003 M/sec                    ( +-  5.14% )
>          1,360,501 page-faults               #    0.033 M/sec                    ( +-  0.12% )
>    118,534,394,180 cycles                    #    2.902 GHz                      ( +-  1.23% ) [50.70%]
>    <not supported> stalled-cycles-frontend 
>    <not supported> stalled-cycles-backend  
>     46,217,340,271 instructions              #    0.39  insns per cycle          ( +-  0.56% ) [76.93%]
>      8,592,447,548 branches                  #  210.354 M/sec                    ( +-  0.75% ) [75.50%]
>        273,367,481 branch-misses             #    3.18% of all branches          ( +-  0.26% ) [75.49%]
> 
>       12.639049245 seconds time elapsed                                          ( +-  2.29% )
> 
> - After Patch
>  Performance counter stats for 'perf bench sched messaging -g 300' (10 runs):
> 
>       42053.008632 task-clock                #    2.932 CPUs utilized            ( +-  0.91% )
>            672,759 context-switches          #    0.016 M/sec                    ( +-  2.76% )
>             83,374 cpu-migrations            #    0.002 M/sec                    ( +-  4.46% )
>          1,362,900 page-faults               #    0.032 M/sec                    ( +-  0.20% )
>    121,457,601,848 cycles                    #    2.888 GHz                      ( +-  0.93% ) [50.75%]
>    <not supported> stalled-cycles-frontend 
>    <not supported> stalled-cycles-backend  
>     47,854,828,552 instructions              #    0.39  insns per cycle          ( +-  0.36% ) [77.09%]
>      8,981,553,714 branches                  #  213.577 M/sec                    ( +-  0.42% ) [75.41%]
>        274,229,438 branch-misses             #    3.05% of all branches          ( +-  0.20% ) [75.44%]
> 
>       14.340330678 seconds time elapsed                                          ( +-  1.79% )
> 

Interesting that perf bench gives me a little better performance with
the idle_balance than without too. But hackbench still shows a huge
performance without idle_balance. The funny part about that is perf
bench sched messaging is based off of hackbench??

I would really like to know why hackbench gets a 50% performance without
idle balancing. Perhaps it is some kind of fluke :-/

-- Steve



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] sched: The removal of idle_balance()
  2013-02-15  7:26 ` Mike Galbraith
  2013-02-15 12:07   ` Peter Zijlstra
  2013-02-15 12:21   ` Peter Zijlstra
@ 2013-02-16 16:12   ` Steven Rostedt
  2013-02-17  6:26     ` Mike Galbraith
  2 siblings, 1 reply; 19+ messages in thread
From: Steven Rostedt @ 2013-02-16 16:12 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: LKML, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Paul Turner, Frederic Weisbecker, Andrew Morton,
	Arnaldo Carvalho de Melo, Clark Williams, Andrew Theurer

On Fri, 2013-02-15 at 08:26 +0100, Mike Galbraith wrote:
> On Fri, 2013-02-15 at 01:13 -0500, Steven Rostedt wrote:
> 
> > Think about it some more, just because we go idle isn't enough reason to
> > pull a runable task over. CPUs go idle all the time, and tasks are woken
> > up all the time. There's no reason that we can't just wait for the sched
> > tick to decide its time to do a bit of balancing. Sure, it would be nice
> > if the idle CPU did the work. But I think that frame of mind was an
> > incorrect notion from back in the early 2000s and does not apply to
> > today's hardware, or perhaps it doesn't apply to the (relatively) new
> > CFS scheduler. If you want aggressive scheduling, make the task rt, and
> > it will do aggressive scheduling.
> 
> (the throttle is supposed to keep idle_balance() from doing severe
> damage, that may want a peek/tweak)
> 
> Hackbench spreads itself with FORK/EXEC balancing, how does say a kbuild
> do with no idle_balance()?
> 

Interesting, I added this patch and it brought down my hackbench to the
same level as removing idle_balance(). Although, on initial tests, it
doesn't seem to help much else (compiles and such), but it doesn't seem
to hurt things either.

As idea of this patch is that we do not want to run the idle_balance if
a task will wake up soon. It adds the heuristic, that if the previous
task is set to TASK_UNINTERRUPTIBLE it will probably wake up in the near
future, because it is blocked on IO or even a mutex. Especially if it is
blocked on a mutex it will likely wake up soon, thus the CPU technically
isn't quite idle. Avoiding the idle balance in this case brings
hackbench back down (50%) on my box.

Ideally, I would have liked to use rq->nr_uninterruptible, but that
counter is only meaningful for the sum of all CPUs, as it may be
incremented on one CPU but then decremented on another CPU. Thus my
algorithm can only use the heuristic of the task immediately going to
sleep.

-- Steve

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1dff78a..886a9af 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2928,7 +2928,7 @@ need_resched:
 	pre_schedule(rq, prev);
 
 	if (unlikely(!rq->nr_running))
-		idle_balance(cpu, rq);
+		idle_balance(cpu, rq, prev);
 
 	put_prev_task(rq, prev);
 	next = pick_next_task(rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ed18c74..a29ea5e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5208,7 +5208,7 @@ out:
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
  */
-void idle_balance(int this_cpu, struct rq *this_rq)
+void idle_balance(int this_cpu, struct rq *this_rq, struct task_struct *prev)
 {
 	struct sched_domain *sd;
 	int pulled_task = 0;
@@ -5216,6 +5216,9 @@ void idle_balance(int this_cpu, struct rq *this_rq)
 
 	this_rq->idle_stamp = this_rq->clock;
 
+	if (!(prev->state & TASK_UNINTERRUPTIBLE))
+		return;
+
 	if (this_rq->avg_idle < sysctl_sched_migration_cost)
 		return;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fc88644..f259070 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -876,11 +876,11 @@ extern const struct sched_class idle_sched_class;
 #ifdef CONFIG_SMP
 
 extern void trigger_load_balance(struct rq *rq, int cpu);
-extern void idle_balance(int this_cpu, struct rq *this_rq);
+extern void idle_balance(int this_cpu, struct rq *this_rq, struct task_struct *prev);
 
 #else	/* CONFIG_SMP */
 
-static inline void idle_balance(int cpu, struct rq *rq)
+static inline void idle_balance(int cpu, struct rq *rq, struct task_struct *prev)
 {
 }
 



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC] sched: The removal of idle_balance()
  2013-02-16 16:12   ` Steven Rostedt
@ 2013-02-17  6:26     ` Mike Galbraith
  2013-02-17  7:14       ` Mike Galbraith
  0 siblings, 1 reply; 19+ messages in thread
From: Mike Galbraith @ 2013-02-17  6:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Paul Turner, Frederic Weisbecker, Andrew Morton,
	Arnaldo Carvalho de Melo, Clark Williams, Andrew Theurer

On Sat, 2013-02-16 at 11:12 -0500, Steven Rostedt wrote:
> On Fri, 2013-02-15 at 08:26 +0100, Mike Galbraith wrote:
> > On Fri, 2013-02-15 at 01:13 -0500, Steven Rostedt wrote:
> > 
> > > Think about it some more, just because we go idle isn't enough reason to
> > > pull a runable task over. CPUs go idle all the time, and tasks are woken
> > > up all the time. There's no reason that we can't just wait for the sched
> > > tick to decide its time to do a bit of balancing. Sure, it would be nice
> > > if the idle CPU did the work. But I think that frame of mind was an
> > > incorrect notion from back in the early 2000s and does not apply to
> > > today's hardware, or perhaps it doesn't apply to the (relatively) new
> > > CFS scheduler. If you want aggressive scheduling, make the task rt, and
> > > it will do aggressive scheduling.
> > 
> > (the throttle is supposed to keep idle_balance() from doing severe
> > damage, that may want a peek/tweak)
> > 
> > Hackbench spreads itself with FORK/EXEC balancing, how does say a kbuild
> > do with no idle_balance()?
> > 
> 
> Interesting, I added this patch and it brought down my hackbench to the
> same level as removing idle_balance().

The typo did it's job well :)

Hrm, turning idle balancing off here does not help hackbench at all.

3.8.0-master

Q6600 +SD_BALANCE_NEWIDLE
 Performance counter stats for 'hackbench -l 500' (100 runs):

       5221.559519 task-clock                #    4.001 CPUs utilized            ( +-  0.26% ) [100.00%]
            129863 context-switches          #    0.025 M/sec                    ( +-  3.65% ) [100.00%]
              7576 cpu-migrations            #    0.001 M/sec                    ( +-  4.60% ) [100.00%]
             31095 page-faults               #    0.006 M/sec                    ( +-  0.39% )
       12258227539 cycles                    #    2.348 GHz                      ( +-  0.27% ) [49.91%]
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        5395089628 instructions              #    0.44  insns per cycle          ( +-  0.28% ) [74.99%]
        1012563262 branches                  #  193.920 M/sec                    ( +-  0.28% ) [75.08%]
          43217098 branch-misses             #    4.27% of all branches          ( +-  0.23% ) [75.01%]

       1.305024749 seconds time elapsed                                          ( +-  0.26% )

Q6600 -SD_BALANCE_NEWIDLE

 Performance counter stats for 'hackbench -l 500' (100 runs):

       5356.549500 task-clock                #    4.001 CPUs utilized            ( +-  0.37% ) [100.00%]
            153093 context-switches          #    0.029 M/sec                    ( +-  3.20% ) [100.00%]
              6887 cpu-migrations            #    0.001 M/sec                    ( +-  4.65% ) [100.00%]
             31248 page-faults               #    0.006 M/sec                    ( +-  0.48% )
       12141992004 cycles                    #    2.267 GHz                      ( +-  0.30% ) [49.90%]
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        5426436261 instructions              #    0.45  insns per cycle          ( +-  0.22% ) [75.00%]
        1016967893 branches                  #  189.855 M/sec                    ( +-  0.22% ) [75.09%]
          43207200 branch-misses             #    4.25% of all branches          ( +-  0.13% ) [75.01%]

       1.338768889 seconds time elapsed                                          ( +-  0.37% )

E5620+HT +SD_BALANCE_NEWIDLE
 Performance counter stats for 'hackbench -l 500' (100 runs):

       3884.162557 task-clock                #    7.997 CPUs utilized            ( +-  0.14% ) [100.00%]
             97366 context-switches          #    0.025 M/sec                    ( +-  1.68% ) [100.00%]
             12383 CPU-migrations            #    0.003 M/sec                    ( +-  3.29% ) [100.00%]
             30749 page-faults               #    0.008 M/sec                    ( +-  0.13% )
        9377671582 cycles                    #    2.414 GHz                      ( +-  0.11% ) [83.04%]
        6973792586 stalled-cycles-frontend   #   74.37% frontend cycles idle     ( +-  0.15% ) [83.27%]
        2529338603 stalled-cycles-backend    #   26.97% backend  cycles idle     ( +-  0.32% ) [66.93%]
        5214109586 instructions              #    0.56  insns per cycle        
                                             #    1.34  stalled cycles per insn  ( +-  0.07% ) [83.50%]
         984681811 branches                  #  253.512 M/sec                    ( +-  0.07% ) [83.56%]
           7050196 branch-misses             #    0.72% of all branches          ( +-  0.49% ) [83.24%]

       0.485726223 seconds time elapsed                                          ( +-  0.14% )

E5620+HT -SD_BALANCE_NEWIDLE
 Performance counter stats for 'hackbench -l 500' (100 runs):

       4124.204725 task-clock                #    7.996 CPUs utilized            ( +-  0.20% ) [100.00%]
            151292 context-switches          #    0.037 M/sec                    ( +-  1.49% ) [100.00%]
             12504 CPU-migrations            #    0.003 M/sec                    ( +-  2.84% ) [100.00%]
             30685 page-faults               #    0.007 M/sec                    ( +-  0.07% )
        9566938118 cycles                    #    2.320 GHz                      ( +-  0.16% ) [83.09%]
        7483411444 stalled-cycles-frontend   #   78.22% frontend cycles idle     ( +-  0.22% ) [83.21%]
        2848475061 stalled-cycles-backend    #   29.77% backend  cycles idle     ( +-  0.38% ) [66.82%]
        5360541017 instructions              #    0.56  insns per cycle        
                                             #    1.40  stalled cycles per insn  ( +-  0.11% ) [83.48%]
        1011027557 branches                  #  245.145 M/sec                    ( +-  0.11% ) [83.59%]
           7964016 branch-misses             #    0.79% of all branches          ( +-  0.55% ) [83.32%]

       0.515779138 seconds time elapsed                                          ( +-  0.20% )

	-Mike


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] sched: The removal of idle_balance()
  2013-02-15  6:13 [RFC] sched: The removal of idle_balance() Steven Rostedt
  2013-02-15  7:26 ` Mike Galbraith
  2013-02-15  7:45 ` Joonsoo Kim
@ 2013-02-17  6:26 ` Mike Galbraith
  2013-02-18  8:13 ` Srikar Dronamraju
  3 siblings, 0 replies; 19+ messages in thread
From: Mike Galbraith @ 2013-02-17  6:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Paul Turner, Frederic Weisbecker, Andrew Morton,
	Arnaldo Carvalho de Melo, Clark Williams, Andrew Theurer

On Fri, 2013-02-15 at 01:13 -0500, Steven Rostedt wrote: 
> I've been working on cleaning up the scheduler a little and I moved the
> call to idle_balance() from directly in the scheduler proper into the
> idle class. Benchmarks (well hackbench) improved slightly as I did this.
> I was adding some more tweaks and running perf stat on the results when
> I made a mistake and notice a drastic change.
> 
> My runs looked something like this on my i7 4 core 4 hyperthreads:
> 

> 293,801,912,874 cycles                    #    1.470 GHz                      ( +-  4.20% ) [100.00%]

> 215,927,081,108 cycles                    #    2.198 GHz                      ( +-  5.48% ) [100.00%]

Hm.  Maybe set governor to performance?

-Mike


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] sched: The removal of idle_balance()
  2013-02-17  6:26     ` Mike Galbraith
@ 2013-02-17  7:14       ` Mike Galbraith
  2013-02-17 21:54         ` Steven Rostedt
  0 siblings, 1 reply; 19+ messages in thread
From: Mike Galbraith @ 2013-02-17  7:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Paul Turner, Frederic Weisbecker, Andrew Morton,
	Arnaldo Carvalho de Melo, Clark Williams, Andrew Theurer

On Sun, 2013-02-17 at 07:26 +0100, Mike Galbraith wrote: 
> On Sat, 2013-02-16 at 11:12 -0500, Steven Rostedt wrote:
> > On Fri, 2013-02-15 at 08:26 +0100, Mike Galbraith wrote:
> > > On Fri, 2013-02-15 at 01:13 -0500, Steven Rostedt wrote:
> > > 
> > > > Think about it some more, just because we go idle isn't enough reason to
> > > > pull a runable task over. CPUs go idle all the time, and tasks are woken
> > > > up all the time. There's no reason that we can't just wait for the sched
> > > > tick to decide its time to do a bit of balancing. Sure, it would be nice
> > > > if the idle CPU did the work. But I think that frame of mind was an
> > > > incorrect notion from back in the early 2000s and does not apply to
> > > > today's hardware, or perhaps it doesn't apply to the (relatively) new
> > > > CFS scheduler. If you want aggressive scheduling, make the task rt, and
> > > > it will do aggressive scheduling.
> > > 
> > > (the throttle is supposed to keep idle_balance() from doing severe
> > > damage, that may want a peek/tweak)
> > > 
> > > Hackbench spreads itself with FORK/EXEC balancing, how does say a kbuild
> > > do with no idle_balance()?
> > > 
> > 
> > Interesting, I added this patch and it brought down my hackbench to the
> > same level as removing idle_balance().
> 
> The typo did it's job well :)
> 
> Hrm, turning idle balancing off here does not help hackbench at all.

(And puts a dent in x264 ultrafast)

+SD_BALANCE_NEWIDLE
encoded 600 frames, 425.04 fps, 22132.71 kb/s
encoded 600 frames, 416.07 fps, 22132.71 kb/s
encoded 600 frames, 417.49 fps, 22132.71 kb/s
encoded 600 frames, 420.65 fps, 22132.71 kb/s
encoded 600 frames, 425.55 fps, 22132.71 kb/s
encoded 600 frames, 425.58 fps, 22132.71 kb/s
encoded 600 frames, 426.18 fps, 22132.71 kb/s
encoded 600 frames, 424.21 fps, 22132.71 kb/s
encoded 600 frames, 422.20 fps, 22132.71 kb/s
encoded 600 frames, 423.15 fps, 22132.71 kb/s

-SD_BALANCE_NEWIDLE
encoded 600 frames, 378.52 fps, 22132.71 kb/s
encoded 600 frames, 378.75 fps, 22132.71 kb/s
encoded 600 frames, 378.20 fps, 22132.71 kb/s
encoded 600 frames, 372.54 fps, 22132.71 kb/s
encoded 600 frames, 366.69 fps, 22132.71 kb/s
encoded 600 frames, 378.46 fps, 22132.71 kb/s
encoded 600 frames, 379.89 fps, 22132.71 kb/s
encoded 600 frames, 382.25 fps, 22132.71 kb/s
encoded 600 frames, 384.10 fps, 22132.71 kb/s
encoded 600 frames, 375.24 fps, 22132.71 kb/s




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] sched: The removal of idle_balance()
  2013-02-17  7:14       ` Mike Galbraith
@ 2013-02-17 21:54         ` Steven Rostedt
  2013-02-18  3:42           ` Mike Galbraith
  0 siblings, 1 reply; 19+ messages in thread
From: Steven Rostedt @ 2013-02-17 21:54 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: LKML, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Paul Turner, Frederic Weisbecker, Andrew Morton,
	Arnaldo Carvalho de Melo, Clark Williams, Andrew Theurer

On Sun, 2013-02-17 at 08:14 +0100, Mike Galbraith wrote:

> (And puts a dent in x264 ultrafast)
> 
> +SD_BALANCE_NEWIDLE
> encoded 600 frames, 425.04 fps, 22132.71 kb/s
> encoded 600 frames, 416.07 fps, 22132.71 kb/s
> encoded 600 frames, 417.49 fps, 22132.71 kb/s
> encoded 600 frames, 420.65 fps, 22132.71 kb/s
> encoded 600 frames, 425.55 fps, 22132.71 kb/s
> encoded 600 frames, 425.58 fps, 22132.71 kb/s
> encoded 600 frames, 426.18 fps, 22132.71 kb/s
> encoded 600 frames, 424.21 fps, 22132.71 kb/s
> encoded 600 frames, 422.20 fps, 22132.71 kb/s
> encoded 600 frames, 423.15 fps, 22132.71 kb/s
> 
> -SD_BALANCE_NEWIDLE
> encoded 600 frames, 378.52 fps, 22132.71 kb/s
> encoded 600 frames, 378.75 fps, 22132.71 kb/s
> encoded 600 frames, 378.20 fps, 22132.71 kb/s
> encoded 600 frames, 372.54 fps, 22132.71 kb/s
> encoded 600 frames, 366.69 fps, 22132.71 kb/s
> encoded 600 frames, 378.46 fps, 22132.71 kb/s
> encoded 600 frames, 379.89 fps, 22132.71 kb/s
> encoded 600 frames, 382.25 fps, 22132.71 kb/s
> encoded 600 frames, 384.10 fps, 22132.71 kb/s
> encoded 600 frames, 375.24 fps, 22132.71 kb/s

What about my last patch? The one that avoids idle_balance() if the
previous task was in a task_uninterruptible state. That one gave the
same performance increase that removing idle_balance() did on my box.

-- Steve




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] sched: The removal of idle_balance()
  2013-02-17 21:54         ` Steven Rostedt
@ 2013-02-18  3:42           ` Mike Galbraith
  2013-02-18 15:23             ` Steven Rostedt
  0 siblings, 1 reply; 19+ messages in thread
From: Mike Galbraith @ 2013-02-18  3:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Paul Turner, Frederic Weisbecker, Andrew Morton,
	Arnaldo Carvalho de Melo, Clark Williams, Andrew Theurer

On Sun, 2013-02-17 at 16:54 -0500, Steven Rostedt wrote:
> On Sun, 2013-02-17 at 08:14 +0100, Mike Galbraith wrote:
> 
> > (And puts a dent in x264 ultrafast) 
 
> What about my last patch? The one that avoids idle_balance() if the
> previous task was in a task_uninterruptible state. That one gave the
> same performance increase that removing idle_balance() did on my box.

I didn't try it, figuring it was pretty much the same as turning it off,
but just did.  Patch (-typo) has no effect on either x264 or hackbench
(surely will for -rt, but rt tasks here aren't sent to burn in rt hell).

-Mike


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] sched: The removal of idle_balance()
  2013-02-15  6:13 [RFC] sched: The removal of idle_balance() Steven Rostedt
                   ` (2 preceding siblings ...)
  2013-02-17  6:26 ` Mike Galbraith
@ 2013-02-18  8:13 ` Srikar Dronamraju
  2013-02-18 15:25   ` Steven Rostedt
  3 siblings, 1 reply; 19+ messages in thread
From: Srikar Dronamraju @ 2013-02-18  8:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Paul Turner, Frederic Weisbecker, Andrew Morton,
	Mike Galbraith, Arnaldo Carvalho de Melo, Clark Williams,
	Andrew Theurer

> The cache misses dropped by ~23% and migrations dropped by ~28%. I
> really believe that the idle_balance() hurts performance, and not just
> for something like hackbench, but the aggressive nature for migration
> that idle_balance() causes takes a large hit on a process' cache.
> 
> Think about it some more, just because we go idle isn't enough reason to
> pull a runable task over. CPUs go idle all the time, and tasks are woken
> up all the time. There's no reason that we can't just wait for the sched
> tick to decide its time to do a bit of balancing. Sure, it would be nice
> if the idle CPU did the work. But I think that frame of mind was an
> incorrect notion from back in the early 2000s and does not apply to
> today's hardware, or perhaps it doesn't apply to the (relatively) new
> CFS scheduler. If you want aggressive scheduling, make the task rt, and
> it will do aggressive scheduling.
> 

How is it that the normal tick based load balancing gets it correctly while
the idle_balance gets is wrong?  Can it because of the different
cpu_idle_type?

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] sched: The removal of idle_balance()
  2013-02-18  3:42           ` Mike Galbraith
@ 2013-02-18 15:23             ` Steven Rostedt
  2013-02-18 17:22               ` Mike Galbraith
  0 siblings, 1 reply; 19+ messages in thread
From: Steven Rostedt @ 2013-02-18 15:23 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: LKML, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Paul Turner, Frederic Weisbecker, Andrew Morton,
	Arnaldo Carvalho de Melo, Clark Williams, Andrew Theurer

On Mon, 2013-02-18 at 04:42 +0100, Mike Galbraith wrote:
> On Sun, 2013-02-17 at 16:54 -0500, Steven Rostedt wrote:
> > On Sun, 2013-02-17 at 08:14 +0100, Mike Galbraith wrote:
> > 
> > > (And puts a dent in x264 ultrafast) 
>  
> > What about my last patch? The one that avoids idle_balance() if the
> > previous task was in a task_uninterruptible state. That one gave the
> > same performance increase that removing idle_balance() did on my box.
> 
> I didn't try it, figuring it was pretty much the same as turning it off,
> but just did.  Patch (-typo) has no effect on either x264 or hackbench
> (surely will for -rt, but rt tasks here aren't sent to burn in rt hell).

So it had no effect to your tests? That's actually good, as if it has a
positive effect on some workloads and no effect on others, that's still
a net win.

-- Steve



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] sched: The removal of idle_balance()
  2013-02-18  8:13 ` Srikar Dronamraju
@ 2013-02-18 15:25   ` Steven Rostedt
  2013-02-19  4:13     ` Rakib Mullick
  0 siblings, 1 reply; 19+ messages in thread
From: Steven Rostedt @ 2013-02-18 15:25 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: LKML, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Paul Turner, Frederic Weisbecker, Andrew Morton,
	Mike Galbraith, Arnaldo Carvalho de Melo, Clark Williams,
	Andrew Theurer

On Mon, 2013-02-18 at 13:43 +0530, Srikar Dronamraju wrote:
> > The cache misses dropped by ~23% and migrations dropped by ~28%. I
> > really believe that the idle_balance() hurts performance, and not just
> > for something like hackbench, but the aggressive nature for migration
> > that idle_balance() causes takes a large hit on a process' cache.
> > 
> > Think about it some more, just because we go idle isn't enough reason to
> > pull a runable task over. CPUs go idle all the time, and tasks are woken
> > up all the time. There's no reason that we can't just wait for the sched
> > tick to decide its time to do a bit of balancing. Sure, it would be nice
> > if the idle CPU did the work. But I think that frame of mind was an
> > incorrect notion from back in the early 2000s and does not apply to
> > today's hardware, or perhaps it doesn't apply to the (relatively) new
> > CFS scheduler. If you want aggressive scheduling, make the task rt, and
> > it will do aggressive scheduling.
> > 
> 
> How is it that the normal tick based load balancing gets it correctly while
> the idle_balance gets is wrong?  Can it because of the different
> cpu_idle_type?
> 

Currently looks to be a fluke in my box, as this performance increase
can't be duplicated elsewhere (yet). But from looking at my traces, it
seems that my box does the idle balance at just the wrong time, and
causes these issues.

-- Steve



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] sched: The removal of idle_balance()
  2013-02-18 15:23             ` Steven Rostedt
@ 2013-02-18 17:22               ` Mike Galbraith
  0 siblings, 0 replies; 19+ messages in thread
From: Mike Galbraith @ 2013-02-18 17:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Paul Turner, Frederic Weisbecker, Andrew Morton,
	Arnaldo Carvalho de Melo, Clark Williams, Andrew Theurer

On Mon, 2013-02-18 at 10:23 -0500, Steven Rostedt wrote: 
> On Mon, 2013-02-18 at 04:42 +0100, Mike Galbraith wrote:
> > On Sun, 2013-02-17 at 16:54 -0500, Steven Rostedt wrote:
> > > On Sun, 2013-02-17 at 08:14 +0100, Mike Galbraith wrote:
> > > 
> > > > (And puts a dent in x264 ultrafast) 
> >  
> > > What about my last patch? The one that avoids idle_balance() if the
> > > previous task was in a task_uninterruptible state. That one gave the
> > > same performance increase that removing idle_balance() did on my box.
> > 
> > I didn't try it, figuring it was pretty much the same as turning it off,
> > but just did.  Patch (-typo) has no effect on either x264 or hackbench
> > (surely will for -rt, but rt tasks here aren't sent to burn in rt hell).
> 
> So it had no effect to your tests? That's actually good, as if it has a
> positive effect on some workloads and no effect on others, that's still
> a net win.

Yeah, for clarity, with "!" removed, there was zero effect to either
hackbench or x264 ultrafast. 

	-Mike


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] sched: The removal of idle_balance()
  2013-02-18 15:25   ` Steven Rostedt
@ 2013-02-19  4:13     ` Rakib Mullick
  2013-02-19  7:29       ` Michael Wang
  0 siblings, 1 reply; 19+ messages in thread
From: Rakib Mullick @ 2013-02-19  4:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Srikar Dronamraju, LKML, Linus Torvalds, Ingo Molnar,
	Peter Zijlstra, Thomas Gleixner, Paul Turner,
	Frederic Weisbecker, Andrew Morton, Mike Galbraith,
	Arnaldo Carvalho de Melo, Clark Williams, Andrew Theurer

On Mon, Feb 18, 2013 at 9:25 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 2013-02-18 at 13:43 +0530, Srikar Dronamraju wrote:
>> > The cache misses dropped by ~23% and migrations dropped by ~28%. I
>> > really believe that the idle_balance() hurts performance, and not just
>> > for something like hackbench, but the aggressive nature for migration
>> > that idle_balance() causes takes a large hit on a process' cache.
>> >
>> > Think about it some more, just because we go idle isn't enough reason to
>> > pull a runable task over. CPUs go idle all the time, and tasks are woken
>> > up all the time. There's no reason that we can't just wait for the sched
>> > tick to decide its time to do a bit of balancing. Sure, it would be nice
>> > if the idle CPU did the work. But I think that frame of mind was an
>> > incorrect notion from back in the early 2000s and does not apply to
>> > today's hardware, or perhaps it doesn't apply to the (relatively) new
>> > CFS scheduler. If you want aggressive scheduling, make the task rt, and
>> > it will do aggressive scheduling.
>> >
>>
>> How is it that the normal tick based load balancing gets it correctly while
>> the idle_balance gets is wrong?  Can it because of the different
>> cpu_idle_type?
>>
>
> Currently looks to be a fluke in my box, as this performance increase
> can't be duplicated elsewhere (yet). But from looking at my traces, it
> seems that my box does the idle balance at just the wrong time, and
> causes these issues.
>
A default hackbench run creates 400 tasks (10 * 40), on a i7 system (4
core, HT), idle_balance() shouldn't be in action, cause on a 8 cpu
system we're assigning 400 tasks. If idle_balance() comes in, that
means - we've done something wrong while distributing tasks among the
CPUs, that indicates a problem during fork/exec/wake balancing?

Thanks,
Rakib.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] sched: The removal of idle_balance()
  2013-02-19  4:13     ` Rakib Mullick
@ 2013-02-19  7:29       ` Michael Wang
  0 siblings, 0 replies; 19+ messages in thread
From: Michael Wang @ 2013-02-19  7:29 UTC (permalink / raw)
  To: Rakib Mullick
  Cc: Steven Rostedt, Srikar Dronamraju, LKML, Linus Torvalds,
	Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Paul Turner,
	Frederic Weisbecker, Andrew Morton, Mike Galbraith,
	Arnaldo Carvalho de Melo, Clark Williams, Andrew Theurer

On 02/19/2013 12:13 PM, Rakib Mullick wrote:
> On Mon, Feb 18, 2013 at 9:25 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
>> On Mon, 2013-02-18 at 13:43 +0530, Srikar Dronamraju wrote:
>>>> The cache misses dropped by ~23% and migrations dropped by ~28%. I
>>>> really believe that the idle_balance() hurts performance, and not just
>>>> for something like hackbench, but the aggressive nature for migration
>>>> that idle_balance() causes takes a large hit on a process' cache.
>>>>
>>>> Think about it some more, just because we go idle isn't enough reason to
>>>> pull a runable task over. CPUs go idle all the time, and tasks are woken
>>>> up all the time. There's no reason that we can't just wait for the sched
>>>> tick to decide its time to do a bit of balancing. Sure, it would be nice
>>>> if the idle CPU did the work. But I think that frame of mind was an
>>>> incorrect notion from back in the early 2000s and does not apply to
>>>> today's hardware, or perhaps it doesn't apply to the (relatively) new
>>>> CFS scheduler. If you want aggressive scheduling, make the task rt, and
>>>> it will do aggressive scheduling.
>>>>
>>>
>>> How is it that the normal tick based load balancing gets it correctly while
>>> the idle_balance gets is wrong?  Can it because of the different
>>> cpu_idle_type?
>>>
>>
>> Currently looks to be a fluke in my box, as this performance increase
>> can't be duplicated elsewhere (yet). But from looking at my traces, it
>> seems that my box does the idle balance at just the wrong time, and
>> causes these issues.
>>
> A default hackbench run creates 400 tasks (10 * 40), on a i7 system (4
> core, HT), idle_balance() shouldn't be in action, cause on a 8 cpu
> system we're assigning 400 tasks. If idle_balance() comes in, that
> means - we've done something wrong while distributing tasks among the
> CPUs, that indicates a problem during fork/exec/wake balancing?

Hmm...I think, unless we have the promise that all those threads, at any
moment, they have the same behaviour, otherwise, even each cpu has the
same load, there are still the chance that some cpu will finish the work
more faster when it own more 'sleepy' tasks at some moment.

So if idle_balance() happen, I will say that the workload is not heavy
enough to keep all the cpu busy all the time, but I won't say it's
imbalanced.

Regards,
Michael Wang

> 
> Thanks,
> Rakib.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2013-02-19  7:29 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-15  6:13 [RFC] sched: The removal of idle_balance() Steven Rostedt
2013-02-15  7:26 ` Mike Galbraith
2013-02-15 12:07   ` Peter Zijlstra
2013-02-15 12:21   ` Peter Zijlstra
2013-02-15 12:32     ` Mike Galbraith
2013-02-16 16:12   ` Steven Rostedt
2013-02-17  6:26     ` Mike Galbraith
2013-02-17  7:14       ` Mike Galbraith
2013-02-17 21:54         ` Steven Rostedt
2013-02-18  3:42           ` Mike Galbraith
2013-02-18 15:23             ` Steven Rostedt
2013-02-18 17:22               ` Mike Galbraith
2013-02-15  7:45 ` Joonsoo Kim
2013-02-15 15:05   ` Steven Rostedt
2013-02-17  6:26 ` Mike Galbraith
2013-02-18  8:13 ` Srikar Dronamraju
2013-02-18 15:25   ` Steven Rostedt
2013-02-19  4:13     ` Rakib Mullick
2013-02-19  7:29       ` Michael Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.