linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* weakness of runnable load tracking?
@ 2012-12-05 15:19 Alex Shi
  2012-12-06  3:13 ` Alex Shi
  0 siblings, 1 reply; 12+ messages in thread
From: Alex Shi @ 2012-12-05 15:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Paul Turner
  Cc: lkml, Vincent Guittot, Preeti U Murthy, Alex Shi

Hi Paul&Ingo:

Runnable load tracking patch set introduce a good way to tracking each
entity/rq's running time.
But when I try to enable it in load balance, I found burst forking
many new tasks will make just few cpu heavy while other cpu has no
much task assigned. That is due to the new forked task's
load_avg_contrib is zero after just created. then no matter how many
tasks assigned to a CPU can not increase the cfs_rq->runnable_load_avg
or rq->avg.load_avg_contrib if this cpu idle.
Actually, if just for new task issue, we can set new task's initial
load_avg same as load_weight. but if we want to burst wake up many
long time sleeping tasks, it has the same issue here since their were
decayed to zero. So what solution I can thought is recording the se's
load_avg_contrib just before dequeue, and don't decay the value, when
it was waken up, add this value to new cfs_rq. but if so, the runnable
load tracking is total meaningless.
So do you have some idea of burst wakeup balancing with runnable load tracking?


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: weakness of runnable load tracking?
  2012-12-05 15:19 weakness of runnable load tracking? Alex Shi
@ 2012-12-06  3:13 ` Alex Shi
  2012-12-06  5:57   ` Mike Galbraith
                     ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Alex Shi @ 2012-12-06  3:13 UTC (permalink / raw)
  To: Alex Shi
  Cc: Ingo Molnar, Peter Zijlstra, Paul Turner, lkml, Vincent Guittot,
	Preeti U Murthy, Andrew Morton, Venkatesh Pallipadi, Tejun Heo,
	Alex Shi

On 12/05/2012 11:19 PM, Alex Shi wrote:
> Hi Paul&Ingo:
> 
> Runnable load tracking patch set introduce a good way to tracking each
> entity/rq's running time.
> But when I try to enable it in load balance, I found burst forking
> many new tasks will make just few cpu heavy while other cpu has no
> much task assigned. That is due to the new forked task's
> load_avg_contrib is zero after just created. then no matter how many
> tasks assigned to a CPU can not increase the cfs_rq->runnable_load_avg
> or rq->avg.load_avg_contrib if this cpu idle.
> Actually, if just for new task issue, we can set new task's initial
> load_avg same as load_weight. but if we want to burst wake up many
> long time sleeping tasks, it has the same issue here since their were
> decayed to zero. So what solution I can thought is recording the se's
> load_avg_contrib just before dequeue, and don't decay the value, when
> it was waken up, add this value to new cfs_rq. but if so, the runnable
> load tracking is total meaningless.
> So do you have some idea of burst wakeup balancing with runnable load tracking?

Hi Paul & Ingo:

In a short word of this issue: burst forking/waking tasks have no time
accumulate the load contribute, their runnable load are taken as zero.
that make select_task_rq do a wrong decision on which group is idlest.

There is still 3 kinds of solution is helpful for this issue.

a, set a unzero minimum value for the long time sleeping task. but it
seems unfair for other tasks these just sleep a short while.

b, just use runnable load contrib in load balance. Still using
nr_running to judge idlest group in select_task_rq_fair. but that may
cause a bit more migrations in future load balance.

c, consider both runnable load and nr_running in the group: like in the
searching domain, the nr_running number increased a certain number, like
double of the domain span, in a certain time. we will think it's a burst
forking/waking happened, then just count the nr_running as the idlest
group criteria.

IMHO, I like the 3rd one a bit more. as to the certain time to judge if
a burst happened, since we will calculate the runnable avg at very tick,
so if increased nr_running is beyond sd->span_weight in 2 ticks, means
burst happening. What's your opinion of this?

Any comments are appreciated!

Regards!
Alex
> 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: weakness of runnable load tracking?
  2012-12-06  3:13 ` Alex Shi
@ 2012-12-06  5:57   ` Mike Galbraith
  2012-12-06  8:06     ` Alex Shi
  2012-12-06  6:52   ` Preeti U Murthy
  2012-12-06 10:45   ` Paul Turner
  2 siblings, 1 reply; 12+ messages in thread
From: Mike Galbraith @ 2012-12-06  5:57 UTC (permalink / raw)
  To: Alex Shi
  Cc: Alex Shi, Ingo Molnar, Peter Zijlstra, Paul Turner, lkml,
	Vincent Guittot, Preeti U Murthy, Andrew Morton,
	Venkatesh Pallipadi, Tejun Heo

On Thu, 2012-12-06 at 11:13 +0800, Alex Shi wrote: 
> On 12/05/2012 11:19 PM, Alex Shi wrote:
> > Hi Paul&Ingo:
> > 
> > Runnable load tracking patch set introduce a good way to tracking each
> > entity/rq's running time.
> > But when I try to enable it in load balance, I found burst forking
> > many new tasks will make just few cpu heavy while other cpu has no
> > much task assigned. That is due to the new forked task's
> > load_avg_contrib is zero after just created. then no matter how many
> > tasks assigned to a CPU can not increase the cfs_rq->runnable_load_avg
> > or rq->avg.load_avg_contrib if this cpu idle.
> > Actually, if just for new task issue, we can set new task's initial
> > load_avg same as load_weight. but if we want to burst wake up many
> > long time sleeping tasks, it has the same issue here since their were
> > decayed to zero. So what solution I can thought is recording the se's
> > load_avg_contrib just before dequeue, and don't decay the value, when
> > it was waken up, add this value to new cfs_rq. but if so, the runnable
> > load tracking is total meaningless.
> > So do you have some idea of burst wakeup balancing with runnable load tracking?
> 
> Hi Paul & Ingo:
> 
> In a short word of this issue: burst forking/waking tasks have no time
> accumulate the load contribute, their runnable load are taken as zero.
> that make select_task_rq do a wrong decision on which group is idlest.

As you pointed out above, new tasks can (and imho should) be born with
full weight.  Tasks _may_ become thin, but they're all born hungry.

> There is still 3 kinds of solution is helpful for this issue.
> 
> a, set a unzero minimum value for the long time sleeping task. but it
> seems unfair for other tasks these just sleep a short while.
> 
> b, just use runnable load contrib in load balance. Still using
> nr_running to judge idlest group in select_task_rq_fair. but that may
> cause a bit more migrations in future load balance.
> 
> c, consider both runnable load and nr_running in the group: like in the
> searching domain, the nr_running number increased a certain number, like
> double of the domain span, in a certain time. we will think it's a burst
> forking/waking happened, then just count the nr_running as the idlest
> group criteria.
> 
> IMHO, I like the 3rd one a bit more. as to the certain time to judge if
> a burst happened, since we will calculate the runnable avg at very tick,
> so if increased nr_running is beyond sd->span_weight in 2 ticks, means
> burst happening. What's your opinion of this?
> 
> Any comments are appreciated!

IMHO, for fork and bursty wake balancing, the only thing meaningful is
the here and now state of runqueues tasks are being dumped into.

Just because tasks are historically short running, you don't necessarily
want to take a gaggle and wedge them into a too small group just to even
out load averages.  If there was a hole available that you passed up by
using average load, you lose utilization.  I can see how this load
tracking stuff can average out to a win on a ~heavily loaded box, but
bursty stuff I don't see how it can do anything but harm, so imho, the
user should choose which is best for his box, instantaneous or history.

WRT burst detection: any window you define can be longer than the burst.

$.02

-Mike


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: weakness of runnable load tracking?
  2012-12-06  3:13 ` Alex Shi
  2012-12-06  5:57   ` Mike Galbraith
@ 2012-12-06  6:52   ` Preeti U Murthy
  2012-12-06  8:14     ` Alex Shi
  2012-12-06 10:45   ` Paul Turner
  2 siblings, 1 reply; 12+ messages in thread
From: Preeti U Murthy @ 2012-12-06  6:52 UTC (permalink / raw)
  To: Alex Shi
  Cc: Alex Shi, Ingo Molnar, Peter Zijlstra, Paul Turner, lkml,
	Vincent Guittot, Andrew Morton, Venkatesh Pallipadi, Tejun Heo

Hi Alex,
> Hi Paul & Ingo:
> 
> In a short word of this issue: burst forking/waking tasks have no time
> accumulate the load contribute, their runnable load are taken as zero.

On performing certain experiments on the way PJT's metric calculates the
load,I observed a few things.Based on these observations let me see if i
can address the issue of why PJT's metric is calculating the load of
bursty tasks as 0.

When we speak about a burst waking task(I will not go into forking
here),we should also speak about its duty cycle-it burst wakes for 1ms
for a 10ms duty cycle or burst wakes 9s out of a 10s duty cycle-both
being 10% tasks wrt their duty cycles.Lets see how load is calculated by
PJT's metric in each of the above cases.
           --
          |  |
          |  |
__________|  |
          A  B
          1ms
          <->
      10ms
<------------>
          Example 1

When the task wakes up at A,it is not yet runnable,and an update of the
task load takes place.Its runtime so far is 0,and its existing time is
10ms.Hence the load is 0/10*1024.Since a scheduler tick happens at B( a
scheduler tick happens for every 1ms,10ms or 4ms.Let us assume 1ms),an
update of the load takes place.PJT's metric divides the time elapsed
into 1ms windows.There is just 1ms window,and hence the runtime is 1ms
and the load is 1ms/10ms*1024.

*If the time elapsed between A and B were to be < 1ms,then PJT's metric
will not capture it*.

And under these circumstances the load remains 0/10ms*1024=0.This is the
situation you are pointing out.Let us assume that these cycle continues
throughout the lifetime of the load,then the load remains at 0.The
question is if such tasks which run for periods<1ms is ok to be termed
as 0 workloads.If it is fine,then what PJT's metric is doing is
right.Maybe we should ignore such workloads because they hardly
contribute to the load.Otherwise we will need to reduce the window of
load update to < 1ms to capture such loads.


Just for some additional info so that we know what happens to different
kinds of loads with PJT's metric,consider the below situation:
                             ------
                            |      |
                            |      |
____________________________|      |
                            A      B
                               1s
                            <------>
<----------------------------------->
      10s
<------------>
                           Example 2

Here at A,the task wakes,just like in Example1 and the load is termed 0.
In between A and B for every scheduler tick if we consider the load to
get updated,then the load slowly increases from 0 to 1024 at B.It is
1024 here,although this is also a 10% task,whereas in Example1 the load
is 102.4 - a 10% task.So what is fishy?

In my opinion,PJT's metric gives the tasks some time to prove their
activeness after they wake up.In Example2 the task has stayed awake too
long-1s; irrespective of what % of the total run time it is.Therefore it
calculates the load to be big enough to balance.

In the example that you have quoted,the tasks may not have run long
enough to consider them as candidates for load balance.

So,essentially what PJT's metric is doing is characterising a task by
the amount it has run so far.


> that make select_task_rq do a wrong decision on which group is idlest.
> 
> There is still 3 kinds of solution is helpful for this issue.
> 
> a, set a unzero minimum value for the long time sleeping task. but it
> seems unfair for other tasks these just sleep a short while.
> 
> b, just use runnable load contrib in load balance. Still using
> nr_running to judge idlest group in select_task_rq_fair. but that may
> cause a bit more migrations in future load balance.
> 
> c, consider both runnable load and nr_running in the group: like in the
> searching domain, the nr_running number increased a certain number, like
> double of the domain span, in a certain time. we will think it's a burst
> forking/waking happened, then just count the nr_running as the idlest
> group criteria.
> 
> IMHO, I like the 3rd one a bit more. as to the certain time to judge if
> a burst happened, since we will calculate the runnable avg at very tick,
> so if increased nr_running is beyond sd->span_weight in 2 ticks, means
> burst happening. What's your opinion of this?
> 
> Any comments are appreciated!


So Pjt's metric rightly seems to be capturing the load of these bursty
tasks but you are right in pointing out that when too many such loads
queue up on the cpu,this metric will consider the load on the cpu as
0,which might not be such a good idea.

It is true that we need to bring in nr_running somewhere.Let me now go
through your suggestions on where to include nr_running and get back on
this.I had planned on including nr_running while selecting the busy
group in update_sd_lb_stats,but select_task_rq_fair is yet another place
to do this, thats right.Good that this issue was brought up :)

> Regards!
> Alex
>>
>>
> 

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: weakness of runnable load tracking?
  2012-12-06  5:57   ` Mike Galbraith
@ 2012-12-06  8:06     ` Alex Shi
  2012-12-06  9:12       ` Mike Galbraith
  0 siblings, 1 reply; 12+ messages in thread
From: Alex Shi @ 2012-12-06  8:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Alex Shi, Ingo Molnar, Peter Zijlstra, Paul Turner, lkml,
	Vincent Guittot, Preeti U Murthy, Andrew Morton,
	Arjan van de Ven, Tejun Heo

>>
>> Hi Paul & Ingo:
>>
>> In a short word of this issue: burst forking/waking tasks have no time
>> accumulate the load contribute, their runnable load are taken as zero.
>> that make select_task_rq do a wrong decision on which group is idlest.
> 
> As you pointed out above, new tasks can (and imho should) be born with
> full weight.  Tasks _may_ become thin, but they're all born hungry.

Thanks for comments. I think so. :)
> 
>> There is still 3 kinds of solution is helpful for this issue.
>>
>> a, set a unzero minimum value for the long time sleeping task. but it
>> seems unfair for other tasks these just sleep a short while.
>>
>> b, just use runnable load contrib in load balance. Still using
>> nr_running to judge idlest group in select_task_rq_fair. but that may
>> cause a bit more migrations in future load balance.
>>
>> c, consider both runnable load and nr_running in the group: like in the
>> searching domain, the nr_running number increased a certain number, like
>> double of the domain span, in a certain time. we will think it's a burst
>> forking/waking happened, then just count the nr_running as the idlest
>> group criteria.
>>
>> IMHO, I like the 3rd one a bit more. as to the certain time to judge if
>> a burst happened, since we will calculate the runnable avg at very tick,
>> so if increased nr_running is beyond sd->span_weight in 2 ticks, means
>> burst happening. What's your opinion of this?
>>
>> Any comments are appreciated!
> 
> IMHO, for fork and bursty wake balancing, the only thing meaningful is
> the here and now state of runqueues tasks are being dumped into.
> 
> Just because tasks are historically short running, you don't necessarily
> want to take a gaggle and wedge them into a too small group just to even
> out load averages.  If there was a hole available that you passed up by
> using average load, you lose utilization.  I can see how this load
> tracking stuff can average out to a win on a ~heavily loaded box, but
> bursty stuff I don't see how it can do anything but harm, so imho, the
> user should choose which is best for his box, instantaneous or history.

Do you mean the system administrator need to do this choice?
It's may a hard decision.  :)
Any suggestions of decision basis?
> 
> WRT burst detection: any window you define can be longer than the burst.

Maybe we can define 2 waking on same cpu in 1 tick is a burst happened,
and if the cpu had taken a waking task. we'd better skip this cpu. :)
Anyway, the hard point is we can not predict future.


> 
> $.02
> 
> -Mike
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: weakness of runnable load tracking?
  2012-12-06  6:52   ` Preeti U Murthy
@ 2012-12-06  8:14     ` Alex Shi
  0 siblings, 0 replies; 12+ messages in thread
From: Alex Shi @ 2012-12-06  8:14 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Alex Shi, Ingo Molnar, Peter Zijlstra, Paul Turner, lkml,
	Vincent Guittot, Andrew Morton, Tejun Heo

On 12/06/2012 02:52 PM, Preeti U Murthy wrote:
> Hi Alex,
>> Hi Paul & Ingo:
>>
>> In a short word of this issue: burst forking/waking tasks have no time
>> accumulate the load contribute, their runnable load are taken as zero.
> 
> On performing certain experiments on the way PJT's metric calculates the
> load,I observed a few things.Based on these observations let me see if i
> can address the issue of why PJT's metric is calculating the load of
> bursty tasks as 0.
> 
> When we speak about a burst waking task(I will not go into forking
> here),we should also speak about its duty cycle-it burst wakes for 1ms
> for a 10ms duty cycle or burst wakes 9s out of a 10s duty cycle-both
> being 10% tasks wrt their duty cycles.Lets see how load is calculated by
> PJT's metric in each of the above cases.
>            --
>           |  |
>           |  |
> __________|  |
>           A  B
>           1ms
>           <->
>       10ms
> <------------>
>           Example 1
> 
> When the task wakes up at A,it is not yet runnable,and an update of the
> task load takes place.Its runtime so far is 0,and its existing time is
> 10ms.Hence the load is 0/10*1024.Since a scheduler tick happens at B( a
> scheduler tick happens for every 1ms,10ms or 4ms.Let us assume 1ms),an
> update of the load takes place.PJT's metric divides the time elapsed
> into 1ms windows.There is just 1ms window,and hence the runtime is 1ms
> and the load is 1ms/10ms*1024.
> 
> *If the time elapsed between A and B were to be < 1ms,then PJT's metric
> will not capture it*.

An nice description to show this issue. :)
> 
> And under these circumstances the load remains 0/10ms*1024=0.This is the
> situation you are pointing out.Let us assume that these cycle continues
> throughout the lifetime of the load,then the load remains at 0.The
> question is if such tasks which run for periods<1ms is ok to be termed
> as 0 workloads.If it is fine,then what PJT's metric is doing is
> right.Maybe we should ignore such workloads because they hardly
> contribute to the load.Otherwise we will need to reduce the window of
> load update to < 1ms to capture such loads.
> 
> 
> Just for some additional info so that we know what happens to different
> kinds of loads with PJT's metric,consider the below situation:
>                              ------
>                             |      |
>                             |      |
> ____________________________|      |
>                             A      B
>                                1s
>                             <------>
> <----------------------------------->
>       10s
> <------------>
>                            Example 2
> 
> Here at A,the task wakes,just like in Example1 and the load is termed 0.
> In between A and B for every scheduler tick if we consider the load to
> get updated,then the load slowly increases from 0 to 1024 at B.It is
> 1024 here,although this is also a 10% task,whereas in Example1 the load
> is 102.4 - a 10% task.So what is fishy?
> 
> In my opinion,PJT's metric gives the tasks some time to prove their
> activeness after they wake up.In Example2 the task has stayed awake too
> long-1s; irrespective of what % of the total run time it is.Therefore it
> calculates the load to be big enough to balance.
> 
> In the example that you have quoted,the tasks may not have run long
> enough to consider them as candidates for load balance.
> 
> So,essentially what PJT's metric is doing is characterising a task by
> the amount it has run so far.
> 
> 
>> that make select_task_rq do a wrong decision on which group is idlest.
>>
>> There is still 3 kinds of solution is helpful for this issue.
>>
>> a, set a unzero minimum value for the long time sleeping task. but it
>> seems unfair for other tasks these just sleep a short while.
>>
>> b, just use runnable load contrib in load balance. Still using
>> nr_running to judge idlest group in select_task_rq_fair. but that may
>> cause a bit more migrations in future load balance.
>>
>> c, consider both runnable load and nr_running in the group: like in the
>> searching domain, the nr_running number increased a certain number, like
>> double of the domain span, in a certain time. we will think it's a burst
>> forking/waking happened, then just count the nr_running as the idlest
>> group criteria.
>>
>> IMHO, I like the 3rd one a bit more. as to the certain time to judge if
>> a burst happened, since we will calculate the runnable avg at very tick,
>> so if increased nr_running is beyond sd->span_weight in 2 ticks, means
>> burst happening. What's your opinion of this?
>>
>> Any comments are appreciated!
> 
> 
> So Pjt's metric rightly seems to be capturing the load of these bursty
> tasks but you are right in pointing out that when too many such loads
> queue up on the cpu,this metric will consider the load on the cpu as
> 0,which might not be such a good idea.
> 
> It is true that we need to bring in nr_running somewhere.Let me now go
> through your suggestions on where to include nr_running and get back on
> this.I had planned on including nr_running while selecting the busy
> group in update_sd_lb_stats,but select_task_rq_fair is yet another place
> to do this, thats right.Good that this issue was brought up :)

Do you has details for the update_sd_lb_stats enbling? In my image, we
may let time to peace the load variation in load balance.
> 
>> Regards!
>> Alex
>>>
>>>
>>
> 
> Regards
> Preeti U Murthy
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: weakness of runnable load tracking?
  2012-12-06  8:06     ` Alex Shi
@ 2012-12-06  9:12       ` Mike Galbraith
  2012-12-06 15:13         ` Alex Shi
  0 siblings, 1 reply; 12+ messages in thread
From: Mike Galbraith @ 2012-12-06  9:12 UTC (permalink / raw)
  To: Alex Shi
  Cc: Alex Shi, Ingo Molnar, Peter Zijlstra, Paul Turner, lkml,
	Vincent Guittot, Preeti U Murthy, Andrew Morton,
	Arjan van de Ven, Tejun Heo

On Thu, 2012-12-06 at 16:06 +0800, Alex Shi wrote: 
> >>
> >> Hi Paul & Ingo:
> >>
> >> In a short word of this issue: burst forking/waking tasks have no time
> >> accumulate the load contribute, their runnable load are taken as zero.
> >> that make select_task_rq do a wrong decision on which group is idlest.
> > 
> > As you pointed out above, new tasks can (and imho should) be born with
> > full weight.  Tasks _may_ become thin, but they're all born hungry.
> 
> Thanks for comments. I think so. :)
> > 
> >> There is still 3 kinds of solution is helpful for this issue.
> >>
> >> a, set a unzero minimum value for the long time sleeping task. but it
> >> seems unfair for other tasks these just sleep a short while.
> >>
> >> b, just use runnable load contrib in load balance. Still using
> >> nr_running to judge idlest group in select_task_rq_fair. but that may
> >> cause a bit more migrations in future load balance.
> >>
> >> c, consider both runnable load and nr_running in the group: like in the
> >> searching domain, the nr_running number increased a certain number, like
> >> double of the domain span, in a certain time. we will think it's a burst
> >> forking/waking happened, then just count the nr_running as the idlest
> >> group criteria.
> >>
> >> IMHO, I like the 3rd one a bit more. as to the certain time to judge if
> >> a burst happened, since we will calculate the runnable avg at very tick,
> >> so if increased nr_running is beyond sd->span_weight in 2 ticks, means
> >> burst happening. What's your opinion of this?
> >>
> >> Any comments are appreciated!
> > 
> > IMHO, for fork and bursty wake balancing, the only thing meaningful is
> > the here and now state of runqueues tasks are being dumped into.
> > 
> > Just because tasks are historically short running, you don't necessarily
> > want to take a gaggle and wedge them into a too small group just to even
> > out load averages.  If there was a hole available that you passed up by
> > using average load, you lose utilization.  I can see how this load
> > tracking stuff can average out to a win on a ~heavily loaded box, but
> > bursty stuff I don't see how it can do anything but harm, so imho, the
> > user should choose which is best for his box, instantaneous or history.
> 
> Do you mean the system administrator need to do this choice?

That's my gut feeling just from pondering potential pitfalls.

> It's may a hard decision.  :)

Yup, very hard.

> Any suggestions of decision basis?

Same as most buttons.. poke it and <cringe> see what happens :) 

> > WRT burst detection: any window you define can be longer than the burst.
> 
> Maybe we can define 2 waking on same cpu in 1 tick is a burst happened,
> and if the cpu had taken a waking task. we'd better skip this cpu. :)
> Anyway, the hard point is we can not predict future.

No matter what the metric, you'll be reacting after the fact.

Somebody needs to code up that darn omniscience algorithm.  In a pinch,
a simple undo the past will suffice :)

-Mike


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: weakness of runnable load tracking?
  2012-12-06  3:13 ` Alex Shi
  2012-12-06  5:57   ` Mike Galbraith
  2012-12-06  6:52   ` Preeti U Murthy
@ 2012-12-06 10:45   ` Paul Turner
  2012-12-06 15:10     ` Alex Shi
  2012-12-07  2:14     ` Alex Shi
  2 siblings, 2 replies; 12+ messages in thread
From: Paul Turner @ 2012-12-06 10:45 UTC (permalink / raw)
  To: Alex Shi
  Cc: Alex Shi, Ingo Molnar, Peter Zijlstra, lkml, Vincent Guittot,
	Preeti U Murthy, Andrew Morton, Venkatesh Pallipadi, Tejun Heo

On Wed, Dec 5, 2012 at 7:13 PM, Alex Shi <alex.shi@intel.com> wrote:
> On 12/05/2012 11:19 PM, Alex Shi wrote:
>> Hi Paul&Ingo:
>>
>> Runnable load tracking patch set introduce a good way to tracking each
>> entity/rq's running time.
>> But when I try to enable it in load balance, I found burst forking
>> many new tasks will make just few cpu heavy while other cpu has no
>> much task assigned. That is due to the new forked task's
>> load_avg_contrib is zero after just created. then no matter how many
>> tasks assigned to a CPU can not increase the cfs_rq->runnable_load_avg
>> or rq->avg.load_avg_contrib if this cpu idle.
>> Actually, if just for new task issue, we can set new task's initial
>> load_avg same as load_weight. but if we want to burst wake up many
>> long time sleeping tasks, it has the same issue here since their were
>> decayed to zero. So what solution I can thought is recording the se's
>> load_avg_contrib just before dequeue, and don't decay the value, when
>> it was waken up, add this value to new cfs_rq. but if so, the runnable
>> load tracking is total meaningless.
>> So do you have some idea of burst wakeup balancing with runnable load tracking?
>
> Hi Paul & Ingo:
>
> In a short word of this issue: burst forking/waking tasks have no time
> accumulate the load contribute, their runnable load are taken as zero.
> that make select_task_rq do a wrong decision on which group is idlest.

So these aren't strictly comparable; bursting and forking tasks have
fairly different characteristics here.

When we fork a task we intentionally reset the previous history.  This
means that a forked task that immediately runs is going to show up as
100% runnable and then converge to it's true value.  This was fairly
intentionally chosen so that tasks would "start" fast rather than
having to worry about ramp up.

The treatment of a burst wake-up however is a little more interesting.
 There are two reasonable trains of thought one can follow, the first
is that:
- If it IS truly bursty you don't really want it factoring into long
term averages since steady state is not going to include that task;
hence a low average is ok.  Anything that's more frequent then this is
going to show up by definition of being within the periods.
- The other is that if it truly is idle for _enormous_ amounts of time
we want to give some cognizance to the fact that it might be more
bursty when it wakes up.

It is my intuition that the greatest carnage here is actually caused
by wake-up load-balancing getting in the way of periodic in
establishing a steady state.  That these entities happen to not be
runnable very often is just a red herring; they don't contribute
enough load average to matter in the periodic case.  Increasing their
load isn't going to really help this -- stronger, you don't want them
affecting the steady state.  I suspect more mileage would result from
reducing the interference wake-up load-balancing has with steady
state.

e.g. One thing you can think about is considering tasks moved by
wake-up load balance as "stolen", and allow periodic load-balance to
re-settle things as if select_idle_sibling had not ignored it :-)

>
> There is still 3 kinds of solution is helpful for this issue.
>
> a, set a unzero minimum value for the long time sleeping task. but it
> seems unfair for other tasks these just sleep a short while.
>

I think this is  reasonable to do regardless, we set such a cap in the
cgroup case already.  Although you still obviously want this
threshhold to be fairly low.  I suspect this is a weak improvement.

> b, just use runnable load contrib in load balance. Still using
> nr_running to judge idlest group in select_task_rq_fair. but that may
> cause a bit more migrations in future load balance.

I don't think this is a good approach.  The whole point of using
blocked load is so that you can converge on a steady state where you
don't NEED to move tasks.  What disrupts this is we naturally prefer
idle cpus on wake-up balance to reduce wake-up latency.  As above, I
think the better answer is making these two processes more
co-operative.

>
> c, consider both runnable load and nr_running in the group: like in the
> searching domain, the nr_running number increased a certain number, like
> double of the domain span, in a certain time. we will think it's a burst
> forking/waking happened, then just count the nr_running as the idlest
> group criteria.

This feels like a bit of a hack.  I suspect this is more binary:

If there's already something running on all the cpus then we should
let the periodic load balancer do placement taking averages into
account.

Otherwise, we're in wake-idle and we throw the cat in the bathwater.

>
> IMHO, I like the 3rd one a bit more. as to the certain time to judge if
> a burst happened, since we will calculate the runnable avg at very tick,
> so if increased nr_running is beyond sd->span_weight in 2 ticks, means
> burst happening. What's your opinion of this?
>

What are you defining as the "right" behavior for a group of tasks
waking up that want to use only a short burst?

This seems to suggest you think spreading them is the best answer?
What's the motivation for that?  Also: What does your topology look
like that's preventing select_idle_sibling from pushing tasks (and
then new-idle subsequently continuing to pull)?

Certainly putting a lower bound on a tasks weight would help new-idle
find the right cpu to pull these from.

> Any comments are appreciated!
>
> Regards!
> Alex
>>
>>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: weakness of runnable load tracking?
  2012-12-06 10:45   ` Paul Turner
@ 2012-12-06 15:10     ` Alex Shi
  2012-12-08 12:20       ` Alex Shi
  2012-12-07  2:14     ` Alex Shi
  1 sibling, 1 reply; 12+ messages in thread
From: Alex Shi @ 2012-12-06 15:10 UTC (permalink / raw)
  To: Paul Turner
  Cc: Alex Shi, Ingo Molnar, Peter Zijlstra, lkml, Vincent Guittot,
	Preeti U Murthy, Andrew Morton, Venkatesh Pallipadi, Tejun Heo


>> Hi Paul & Ingo:
>>
>> In a short word of this issue: burst forking/waking tasks have no time
>> accumulate the load contribute, their runnable load are taken as zero.
>> that make select_task_rq do a wrong decision on which group is idlest.
> 
> So these aren't strictly comparable; bursting and forking tasks have
> fairly different characteristics here.

Many thanks for looking into this. :)
> 
> When we fork a task we intentionally reset the previous history.  This
> means that a forked task that immediately runs is going to show up as
> 100% runnable and then converge to it's true value.  This was fairly
> intentionally chosen so that tasks would "start" fast rather than
> having to worry about ramp up.

I am sorry for didn't see the 100% runnable for a new forked task. 
I believe the code need the following patch to initialize decay_count, 
and load_avg_contrib. otherwise they are random value. 
In enqueue_entity_load_avg() p->se.avg.runnable_avg_sum for new forked 
task is always zero, either because se.avg.last_runnable_update is set
 as clock_task due to decay_count <=0, or just do 
__synchronize_entity_decay not update_entity_load_avg.

===========
>From a161000dbece6e95bf3b81e9246d51784589d393 Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@intel.com>
Date: Mon, 3 Dec 2012 17:30:39 +0800
Subject: [PATCH 05/12] sched: load tracking bug fix

We need initialize the se.avg.{decay_count, load_avg_contrib} to zero
after a new task forked.
Otherwise random values of above variable give a incorrect statistic
data when do new task enqueue:
    enqueue_task_fair
        enqueue_entity
            enqueue_entity_load_avg

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/core.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5dae0d2..e6533e1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1534,6 +1534,8 @@ static void __sched_fork(struct task_struct *p)
 #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
 	p->se.avg.runnable_avg_period = 0;
 	p->se.avg.runnable_avg_sum = 0;
+	p->se.avg.decay_count = 0;
+	p->se.avg.load_avg_contrib = 0;
 #endif
 #ifdef CONFIG_SCHEDSTATS
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
-- 
1.7.5.4


Yes, the runnable load will quick become 100%, after se->on_rq set, and passed 
one tick. 
If we are forking many tasks in one tick. I didn't find a useful chance to update 
load avg for new tasks.
So, guess we need the following patch:

==========
>From 6de8083606b6e0b55e201eadbb945c60fe7631b6 Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@intel.com>
Date: Thu, 6 Dec 2012 22:48:24 +0800
Subject: [PATCH 12/12] sched: set initial load avg of new forked task as its
 load weight

New task has no runnable sum at its first runnable time, that make
burst forking just select few idle cpus to put tasks.
Set initial load avg of new forked task as its load weight to resolve
this issue.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/linux/sched.h |    1 +
 kernel/sched/core.c   |    2 +-
 kernel/sched/fair.c   |   18 ++++++++++++++++--
 3 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e483ccb..12063fa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1043,6 +1043,7 @@ struct sched_domain;
 #else
 #define ENQUEUE_WAKING		0
 #endif
+#define ENQUEUE_NEWTASK		8
 
 #define DEQUEUE_SLEEP		1
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 980904d..6a4d225 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1643,7 +1643,7 @@ void wake_up_new_task(struct task_struct *p)
 #endif
 
 	rq = __task_rq_lock(p);
-	activate_task(rq, p, 0);
+	activate_task(rq, p, ENQUEUE_NEWTASK);
 	p->on_rq = 1;
 	trace_sched_wakeup_new(p, true);
 	check_preempt_curr(rq, p, WF_FORK);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5d6eb91..f15b0ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1289,8 +1289,9 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 /* Add the load generated by se into cfs_rq's child load-average */
 static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 						  struct sched_entity *se,
-						  int wakeup)
+						  int flags)
 {
+	int wakeup = flags & ENQUEUE_WAKEUP;
 	/*
 	 * We track migrations using entity decay_count <= 0, on a wake-up
 	 * migration we use a negative decay count to track the remote decays
@@ -1324,6 +1325,13 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 		update_entity_load_avg(se, 0);
 	}
 
+	/*
+	 * set the initial load avg of new task same as its load
+	 * in order to avoid brust fork make few cpu too heavier
+	 */
+	if (flags & ENQUEUE_NEWTASK)
+		se->avg.load_avg_contrib = se->load.weight;
+
 	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
 	/* we force update consideration on load-balancer moves */
 	update_cfs_rq_blocked_load(cfs_rq, !wakeup);
@@ -1488,7 +1496,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 */
 	update_curr(cfs_rq);
 	account_entity_enqueue(cfs_rq, se);
-	enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
+	enqueue_entity_load_avg(cfs_rq, se, flags &
+				(ENQUEUE_WAKEUP | ENQUEUE_NEWTASK));
 
 	if (flags & ENQUEUE_WAKEUP) {
 		place_entity(cfs_rq, se, 0);
@@ -2580,6 +2589,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
+	int newtask = flags & ENQUEUE_NEWTASK;
 
 	for_each_sched_entity(se) {
 		if (se->on_rq)
@@ -2598,6 +2608,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_running++;
 
 		flags = ENQUEUE_WAKEUP;
+		flags &= ~ENQUEUE_NEWTASK;
 	}
 
 	for_each_sched_entity(se) {
@@ -2616,6 +2627,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		inc_nr_running(rq);
 	}
 	hrtick_update(rq);
+	if (newtask)
+		printk(KERN_ERR "rq avg data on a new task avg rum %ld avg contrib %ld\n", 
+			rq->avg.runnable_avg_sum, rq->avg.load_avg_contrib);
 }
 
 static void set_next_buddy(struct sched_entity *se);
-- 
1.7.5.4


My sleepy brain nearly stall. Will think over your other suggestion in weekend. :)

> 
> The treatment of a burst wake-up however is a little more interesting.
>  There are two reasonable trains of thought one can follow, the first
> is that:
> - If it IS truly bursty you don't really want it factoring into long
> term averages since steady state is not going to include that task;
> hence a low average is ok.  Anything that's more frequent then this is
> going to show up by definition of being within the periods.
> - The other is that if it truly is idle for _enormous_ amounts of time
> we want to give some cognizance to the fact that it might be more
> bursty when it wakes up.
> 
> It is my intuition that the greatest carnage here is actually caused
> by wake-up load-balancing getting in the way of periodic in
> establishing a steady state.  That these entities happen to not be
> runnable very often is just a red herring; they don't contribute
> enough load average to matter in the periodic case.  Increasing their
> load isn't going to really help this -- stronger, you don't want them
> affecting the steady state.  I suspect more mileage would result from
> reducing the interference wake-up load-balancing has with steady
> state.
> 
> e.g. One thing you can think about is considering tasks moved by
> wake-up load balance as "stolen", and allow periodic load-balance to
> re-settle things as if select_idle_sibling had not ignored it :-)
> 
>>
>> There is still 3 kinds of solution is helpful for this issue.
>>
>> a, set a unzero minimum value for the long time sleeping task. but it
>> seems unfair for other tasks these just sleep a short while.
>>
> 
> I think this is  reasonable to do regardless, we set such a cap in the
> cgroup case already.  Although you still obviously want this
> threshhold to be fairly low.  I suspect this is a weak improvement.
> 
>> b, just use runnable load contrib in load balance. Still using
>> nr_running to judge idlest group in select_task_rq_fair. but that may
>> cause a bit more migrations in future load balance.
> 
> I don't think this is a good approach.  The whole point of using
> blocked load is so that you can converge on a steady state where you
> don't NEED to move tasks.  What disrupts this is we naturally prefer
> idle cpus on wake-up balance to reduce wake-up latency.  As above, I
> think the better answer is making these two processes more
> co-operative.
> 
>>
>> c, consider both runnable load and nr_running in the group: like in the
>> searching domain, the nr_running number increased a certain number, like
>> double of the domain span, in a certain time. we will think it's a burst
>> forking/waking happened, then just count the nr_running as the idlest
>> group criteria.
> 
> This feels like a bit of a hack.  I suspect this is more binary:
> 
> If there's already something running on all the cpus then we should
> let the periodic load balancer do placement taking averages into
> account.
> 
> Otherwise, we're in wake-idle and we throw the cat in the bathwater.
> 
>>
>> IMHO, I like the 3rd one a bit more. as to the certain time to judge if
>> a burst happened, since we will calculate the runnable avg at very tick,
>> so if increased nr_running is beyond sd->span_weight in 2 ticks, means
>> burst happening. What's your opinion of this?
>>
> 
> What are you defining as the "right" behavior for a group of tasks
> waking up that want to use only a short burst?
> 
> This seems to suggest you think spreading them is the best answer?
> What's the motivation for that?  Also: What does your topology look
> like that's preventing select_idle_sibling from pushing tasks (and
> then new-idle subsequently continuing to pull)?
> 
> Certainly putting a lower bound on a tasks weight would help new-idle
> find the right cpu to pull these from.
> 
>> Any comments are appreciated!
>>
>> 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: weakness of runnable load tracking?
  2012-12-06  9:12       ` Mike Galbraith
@ 2012-12-06 15:13         ` Alex Shi
  0 siblings, 0 replies; 12+ messages in thread
From: Alex Shi @ 2012-12-06 15:13 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Alex Shi, Ingo Molnar, Peter Zijlstra, Paul Turner, lkml,
	Vincent Guittot, Preeti U Murthy, Andrew Morton,
	Arjan van de Ven, Tejun Heo

On 12/06/2012 05:12 PM, Mike Galbraith wrote:
> On Thu, 2012-12-06 at 16:06 +0800, Alex Shi wrote: 
>>>>
>>>> Hi Paul & Ingo:
>>>>
>>>> In a short word of this issue: burst forking/waking tasks have no time
>>>> accumulate the load contribute, their runnable load are taken as zero.
>>>> that make select_task_rq do a wrong decision on which group is idlest.
>>>
>>> As you pointed out above, new tasks can (and imho should) be born with
>>> full weight.  Tasks _may_ become thin, but they're all born hungry.
>>
>> Thanks for comments. I think so. :)
>>>
>>>> There is still 3 kinds of solution is helpful for this issue.
>>>>
>>>> a, set a unzero minimum value for the long time sleeping task. but it
>>>> seems unfair for other tasks these just sleep a short while.
>>>>
>>>> b, just use runnable load contrib in load balance. Still using
>>>> nr_running to judge idlest group in select_task_rq_fair. but that may
>>>> cause a bit more migrations in future load balance.
>>>>
>>>> c, consider both runnable load and nr_running in the group: like in the
>>>> searching domain, the nr_running number increased a certain number, like
>>>> double of the domain span, in a certain time. we will think it's a burst
>>>> forking/waking happened, then just count the nr_running as the idlest
>>>> group criteria.
>>>>
>>>> IMHO, I like the 3rd one a bit more. as to the certain time to judge if
>>>> a burst happened, since we will calculate the runnable avg at very tick,
>>>> so if increased nr_running is beyond sd->span_weight in 2 ticks, means
>>>> burst happening. What's your opinion of this?
>>>>
>>>> Any comments are appreciated!
>>>
>>> IMHO, for fork and bursty wake balancing, the only thing meaningful is
>>> the here and now state of runqueues tasks are being dumped into.
>>>
>>> Just because tasks are historically short running, you don't necessarily
>>> want to take a gaggle and wedge them into a too small group just to even
>>> out load averages.  If there was a hole available that you passed up by
>>> using average load, you lose utilization.  I can see how this load
>>> tracking stuff can average out to a win on a ~heavily loaded box, but
>>> bursty stuff I don't see how it can do anything but harm, so imho, the
>>> user should choose which is best for his box, instantaneous or history.
>>
>> Do you mean the system administrator need to do this choice?
> 
> That's my gut feeling just from pondering potential pitfalls.
> 
>> It's may a hard decision.  :)
> 
> Yup, very hard.
> 
>> Any suggestions of decision basis?
> 
> Same as most buttons.. poke it and <cringe> see what happens :) 

:D
> 
>>> WRT burst detection: any window you define can be longer than the burst.
>>
>> Maybe we can define 2 waking on same cpu in 1 tick is a burst happened,
>> and if the cpu had taken a waking task. we'd better skip this cpu. :)
>> Anyway, the hard point is we can not predict future.
> 
> No matter what the metric, you'll be reacting after the fact.
> 
> Somebody needs to code up that darn omniscience algorithm.  In a pinch,
> a simple undo the past will suffice :)

Yes. I see.
> 
> -Mike
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: weakness of runnable load tracking?
  2012-12-06 10:45   ` Paul Turner
  2012-12-06 15:10     ` Alex Shi
@ 2012-12-07  2:14     ` Alex Shi
  1 sibling, 0 replies; 12+ messages in thread
From: Alex Shi @ 2012-12-07  2:14 UTC (permalink / raw)
  To: Paul Turner
  Cc: Alex Shi, Ingo Molnar, Peter Zijlstra, lkml, Vincent Guittot,
	Preeti U Murthy, Andrew Morton, Venkatesh Pallipadi, Tejun Heo

> 
> The treatment of a burst wake-up however is a little more interesting.
>  There are two reasonable trains of thought one can follow, the first
> is that:
> - If it IS truly bursty you don't really want it factoring into long
> term averages since steady state is not going to include that task;
> hence a low average is ok.  Anything that's more frequent then this is
> going to show up by definition of being within the periods.
> - The other is that if it truly is idle for _enormous_ amounts of time
> we want to give some cognizance to the fact that it might be more
> bursty when it wakes up.
> 
> It is my intuition that the greatest carnage here is actually caused
> by wake-up load-balancing getting in the way of periodic in
> establishing a steady state.  That these entities happen to not be
> runnable very often is just a red herring; they don't contribute
> enough load average to matter in the periodic case.  Increasing their
> load isn't going to really help this -- stronger, you don't want them
> affecting the steady state.  I suspect more mileage would result from
> reducing the interference wake-up load-balancing has with steady
> state.
> 
> e.g. One thing you can think about is considering tasks moved by
> wake-up load balance as "stolen", and allow periodic load-balance to
> re-settle things as if select_idle_sibling had not ignored it :-)

Consider the periodic load balance should spread tasks well, we can
assume the burst waking tasks are spread widely among all cpu before
their sleeping. That will relief the burst waking task imbalance.

And plus the uncertain utils of waking task, guess we can forget the
extra treatment for burst waking. If the waking tasks will keep running
long time. Let periodic LB to handle them again.
> 
>>
>> There is still 3 kinds of solution is helpful for this issue.
>>
>> a, set a unzero minimum value for the long time sleeping task. but it
>> seems unfair for other tasks these just sleep a short while.
>>
> 
> I think this is  reasonable to do regardless, we set such a cap in the
> cgroup case already.  Although you still obviously want this
> threshhold to be fairly low.  I suspect this is a weak improvement.

Agreed.
> 
>> b, just use runnable load contrib in load balance. Still using
>> nr_running to judge idlest group in select_task_rq_fair. but that may
>> cause a bit more migrations in future load balance.
> 
> I don't think this is a good approach.  The whole point of using
> blocked load is so that you can converge on a steady state where you
> don't NEED to move tasks.  What disrupts this is we naturally prefer
> idle cpus on wake-up balance to reduce wake-up latency.  As above, I
> think the better answer is making these two processes more
> co-operative.

Sure, instant load in fork/exec/waking will do interfere the later
periodic LB.
> 
>>
>> c, consider both runnable load and nr_running in the group: like in the
>> searching domain, the nr_running number increased a certain number, like
>> double of the domain span, in a certain time. we will think it's a burst
>> forking/waking happened, then just count the nr_running as the idlest
>> group criteria.
> 
> This feels like a bit of a hack.  I suspect this is more binary:
> 
> If there's already something running on all the cpus then we should
> let the periodic load balancer do placement taking averages into
> account.
> 
> Otherwise, we're in wake-idle and we throw the cat in the bathwater.

As like Mike said simple time is always not enough for 'burst'. So, it
will be approved to be a stupid optimising on some chance. :)
> 
>>
>> IMHO, I like the 3rd one a bit more. as to the certain time to judge if
>> a burst happened, since we will calculate the runnable avg at very tick,
>> so if increased nr_running is beyond sd->span_weight in 2 ticks, means
>> burst happening. What's your opinion of this?
>>
> 
> What are you defining as the "right" behavior for a group of tasks
> waking up that want to use only a short burst?
> 
> This seems to suggest you think spreading them is the best answer?
> What's the motivation for that?  Also: What does your topology look
> like that's preventing select_idle_sibling from pushing tasks (and
> then new-idle subsequently continuing to pull)?

Actually, I have no real story of the burst waking imbalance among cpus.

Considering above excuses, I will give up the optimizing on this. It's
not too late to reconsider this if some case is really triggered.
> 
> Certainly putting a lower bound on a tasks weight would help new-idle
> find the right cpu to pull these from.
> 
>> Any comments are appreciated!
>>
>> Regards!
>> Alex
>>>
>>>
>>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: weakness of runnable load tracking?
  2012-12-06 15:10     ` Alex Shi
@ 2012-12-08 12:20       ` Alex Shi
  0 siblings, 0 replies; 12+ messages in thread
From: Alex Shi @ 2012-12-08 12:20 UTC (permalink / raw)
  To: Paul Turner
  Cc: Alex Shi, Ingo Molnar, Peter Zijlstra, lkml, Vincent Guittot,
	Preeti U Murthy, Andrew Morton, Venkatesh Pallipadi, Tejun Heo

On 12/06/2012 11:10 PM, Alex Shi wrote:
> 
>>> Hi Paul & Ingo:
>>>
>>> In a short word of this issue: burst forking/waking tasks have no time
>>> accumulate the load contribute, their runnable load are taken as zero.
>>> that make select_task_rq do a wrong decision on which group is idlest.
>>
>> So these aren't strictly comparable; bursting and forking tasks have
>> fairly different characteristics here.
> 
> Many thanks for looking into this. :)
>>
>> When we fork a task we intentionally reset the previous history.  This
>> means that a forked task that immediately runs is going to show up as
>> 100% runnable and then converge to it's true value.  This was fairly
>> intentionally chosen so that tasks would "start" fast rather than
>> having to worry about ramp up.
> 
> I am sorry for didn't see the 100% runnable for a new forked task. 
> I believe the code need the following patch to initialize decay_count, 
> and load_avg_contrib. otherwise they are random value. 
> In enqueue_entity_load_avg() p->se.avg.runnable_avg_sum for new forked 
> task is always zero, either because se.avg.last_runnable_update is set
>  as clock_task due to decay_count <=0, or just do 
> __synchronize_entity_decay not update_entity_load_avg.

Paul:
Would you like to give some comments for the following patches?

> 
> ===========
> From a161000dbece6e95bf3b81e9246d51784589d393 Mon Sep 17 00:00:00 2001
> From: Alex Shi <alex.shi@intel.com>
> Date: Mon, 3 Dec 2012 17:30:39 +0800
> Subject: [PATCH 05/12] sched: load tracking bug fix
> 
> We need initialize the se.avg.{decay_count, load_avg_contrib} to zero
> after a new task forked.
> Otherwise random values of above variable give a incorrect statistic
> data when do new task enqueue:
>     enqueue_task_fair
>         enqueue_entity
>             enqueue_entity_load_avg
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/core.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5dae0d2..e6533e1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1534,6 +1534,8 @@ static void __sched_fork(struct task_struct *p)
>  #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
>  	p->se.avg.runnable_avg_period = 0;
>  	p->se.avg.runnable_avg_sum = 0;
> +	p->se.avg.decay_count = 0;
> +	p->se.avg.load_avg_contrib = 0;
>  #endif
>  #ifdef CONFIG_SCHEDSTATS
>  	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-12-08 12:20 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-05 15:19 weakness of runnable load tracking? Alex Shi
2012-12-06  3:13 ` Alex Shi
2012-12-06  5:57   ` Mike Galbraith
2012-12-06  8:06     ` Alex Shi
2012-12-06  9:12       ` Mike Galbraith
2012-12-06 15:13         ` Alex Shi
2012-12-06  6:52   ` Preeti U Murthy
2012-12-06  8:14     ` Alex Shi
2012-12-06 10:45   ` Paul Turner
2012-12-06 15:10     ` Alex Shi
2012-12-08 12:20       ` Alex Shi
2012-12-07  2:14     ` Alex Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).