All of lore.kernel.org
 help / color / mirror / Atom feed
* [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
@ 2016-10-07 19:38 Joseph Salisbury
  2016-10-07 19:57 ` Linus Torvalds
  2016-10-08  8:00 ` Peter Zijlstra
  0 siblings, 2 replies; 46+ messages in thread
From: Joseph Salisbury @ 2016-10-07 19:38 UTC (permalink / raw)
  To: peterz; +Cc: torvalds, efault, tglx, LKML, mingo

Hello Peter,

A kernel bug report was opened against Ubuntu [0].  After a kernel
bisect, it was found that reverting the following commit resolved this bug:

commit 3d30544f02120b884bba2a9466c87dba980e3be5
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Jun 21 14:27:50 2016 +0200

    sched/fair: Apply more PELT fixes

The regression was introduced as of v4.8-rc1.  The bug can be reproduced
on an X1 Carbon with the following:
stress -c $your_total_cpu_cores

I was hoping to get your feedback, since you are the patch author.  Do
you think gathering any additional data will help diagnose this issue,
or would it be best to submit a revert request? 


Thanks,

Joe


[0] http://pad.lv/1627108

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-07 19:38 [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes Joseph Salisbury
@ 2016-10-07 19:57 ` Linus Torvalds
  2016-10-07 20:22   ` Joseph Salisbury
  2016-10-08  8:00 ` Peter Zijlstra
  1 sibling, 1 reply; 46+ messages in thread
From: Linus Torvalds @ 2016-10-07 19:57 UTC (permalink / raw)
  To: Joseph Salisbury
  Cc: Peter Zijlstra, Mike Galbraith, Thomas Gleixner, LKML, Ingo Molnar

On Fri, Oct 7, 2016 at 12:38 PM, Joseph Salisbury
<joseph.salisbury@canonical.com> wrote:
>
> A kernel bug report was opened against Ubuntu [0].  After a kernel
> bisect, it was found that reverting the following commit resolved this bug:

Hmm. Interesting, and it sounds like we should revert that unless
somebody figures out *why* following the rules wrt cfq updates causes
problems. But I also wonder what the Ubuntu kernel config is. Does
Ubuntu enable CONFIG_SCHED_AUTOGROUP=y, for example? Because
regardless of any other scheduler issues, autogrouping *should* mean
that when you run some CPU hogger in one session, that should still
balance all CPU time with other sessions..

I'm not seeing anything odd on my xps13, which should have a similar
CPU to the X1 Carbon.

               Linus

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-07 19:57 ` Linus Torvalds
@ 2016-10-07 20:22   ` Joseph Salisbury
  2016-10-07 20:37     ` Linus Torvalds
  0 siblings, 1 reply; 46+ messages in thread
From: Joseph Salisbury @ 2016-10-07 20:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Mike Galbraith, Thomas Gleixner, LKML, Ingo Molnar

On 10/07/2016 03:57 PM, Linus Torvalds wrote:
> On Fri, Oct 7, 2016 at 12:38 PM, Joseph Salisbury
> <joseph.salisbury@canonical.com> wrote:
>> A kernel bug report was opened against Ubuntu [0].  After a kernel
>> bisect, it was found that reverting the following commit resolved this bug:
> Hmm. Interesting, and it sounds like we should revert that unless
> somebody figures out *why* following the rules wrt cfq updates causes
> problems. But I also wonder what the Ubuntu kernel config is. Does
> Ubuntu enable CONFIG_SCHED_AUTOGROUP=y, for example? Because
> regardless of any other scheduler issues, autogrouping *should* mean
> that when you run some CPU hogger in one session, that should still
> balance all CPU time with other sessions..
>
> I'm not seeing anything odd on my xps13, which should have a similar
> CPU to the X1 Carbon.
>
>                Linus
Hi Linus,

Yes, CONFIG_SCHED_AUTOGROUP is enabled in the Ubuntu kernel.  However,
that config was also enable in the Ubuntu 4.4 kerrnels without seeing
this issue.   I can try disabling the config in the 4.8 based kernel and
see if that changes things.

Thanks,

Joe

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-07 20:22   ` Joseph Salisbury
@ 2016-10-07 20:37     ` Linus Torvalds
  0 siblings, 0 replies; 46+ messages in thread
From: Linus Torvalds @ 2016-10-07 20:37 UTC (permalink / raw)
  To: Joseph Salisbury
  Cc: Peter Zijlstra, Mike Galbraith, Thomas Gleixner, LKML, Ingo Molnar

On Fri, Oct 7, 2016 at 1:22 PM, Joseph Salisbury
<joseph.salisbury@canonical.com> wrote:
>
> Yes, CONFIG_SCHED_AUTOGROUP is enabled in the Ubuntu kernel.  However,
> that config was also enable in the Ubuntu 4.4 kerrnels without seeing
> this issue.   I can try disabling the config in the 4.8 based kernel and
> see if that changes things.

No, that wouldn't make any sense. I just wanted to know that the
option was enabled, because that option really *should* help buffer
other processes from one session that is a CPU hog.

So something is seriously wrong in that situation if other things get
very choppy. Of course, the fact that it apparently happens on one
particular machine only means that it's hard to figure out what
triggers it. Maybe some unlucky combination of cpufreq and thermal
throttling by the hardware, coupled with the scheduler change.

Peter?

                 Linus

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-07 19:38 [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes Joseph Salisbury
  2016-10-07 19:57 ` Linus Torvalds
@ 2016-10-08  8:00 ` Peter Zijlstra
  2016-10-08  8:39   ` Ingo Molnar
  1 sibling, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-08  8:00 UTC (permalink / raw)
  To: Joseph Salisbury; +Cc: torvalds, efault, tglx, LKML, mingo, Vincent Guittot

On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
> Hello Peter,
> 
> A kernel bug report was opened against Ubuntu [0].  After a kernel
> bisect, it was found that reverting the following commit resolved this bug:
> 
> commit 3d30544f02120b884bba2a9466c87dba980e3be5
> Author: Peter Zijlstra <peterz@infradead.org>
> Date:   Tue Jun 21 14:27:50 2016 +0200
> 
>     sched/fair: Apply more PELT fixes

That commit doesn't revert cleanly, did you take out more?

> The regression was introduced as of v4.8-rc1.  The bug can be reproduced
> on an X1 Carbon with the following:
> stress -c $your_total_cpu_cores
> 
> I was hoping to get your feedback, since you are the patch author.  Do
> you think gathering any additional data will help diagnose this issue,
> or would it be best to submit a revert request? 

I'll try and have a look asap, but I'm traveling next week so it might a
tad slower than normal.

If you could provide a /proc/sched_debug dump while the thing is running
that'd might be useful.

Thanks!

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-08  8:00 ` Peter Zijlstra
@ 2016-10-08  8:39   ` Ingo Molnar
  2016-10-08 11:37     ` Vincent Guittot
  0 siblings, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2016-10-08  8:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joseph Salisbury, torvalds, efault, tglx, LKML, Vincent Guittot


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
> > Hello Peter,
> > 
> > A kernel bug report was opened against Ubuntu [0].  After a kernel
> > bisect, it was found that reverting the following commit resolved this bug:
> > 
> > commit 3d30544f02120b884bba2a9466c87dba980e3be5
> > Author: Peter Zijlstra <peterz@infradead.org>
> > Date:   Tue Jun 21 14:27:50 2016 +0200
> > 
> >     sched/fair: Apply more PELT fixes
> 
> That commit doesn't revert cleanly, did you take out more?

Note that it reverts cleanly from v4.8 - while it does to revert from current 
upstream that did more changes in that area.

I suspect Josheph tested a v4.8-ish kernel.

> I'll try and have a look asap, but I'm traveling next week so it might a
> tad slower than normal.
> 
> If you could provide a /proc/sched_debug dump while the thing is running
> that'd might be useful.

Also, running the latest scheduler bits would be useful:

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

... just on the off chance that another change in this area fixed the bug, plus to 
make it easier to send test patches.

Upstream merge commit af79ad2b1f33 should also be pretty safe to try - it's v4.8 
with the v4.9 scheduler bits applied.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-08  8:39   ` Ingo Molnar
@ 2016-10-08 11:37     ` Vincent Guittot
  2016-10-08 11:49       ` Mike Galbraith
  0 siblings, 1 reply; 46+ messages in thread
From: Vincent Guittot @ 2016-10-08 11:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Joseph Salisbury, Linus Torvalds, Mike Galbraith,
	Thomas Gleixner, LKML

On 8 October 2016 at 10:39, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Peter Zijlstra <peterz@infradead.org> wrote:
>
>> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
>> > Hello Peter,
>> >
>> > A kernel bug report was opened against Ubuntu [0].  After a kernel
>> > bisect, it was found that reverting the following commit resolved this bug:
>> >
>> > commit 3d30544f02120b884bba2a9466c87dba980e3be5
>> > Author: Peter Zijlstra <peterz@infradead.org>
>> > Date:   Tue Jun 21 14:27:50 2016 +0200
>> >
>> >     sched/fair: Apply more PELT fixes

This patch only speeds up the update of task group load in order to
reflect the new load balance but It should not change the final value
and as a result the final behavior. I will try to reproduce it in my
target later today

>>
>> That commit doesn't revert cleanly, did you take out more?
>
> Note that it reverts cleanly from v4.8 - while it does to revert from current
> upstream that did more changes in that area.
>
> I suspect Josheph tested a v4.8-ish kernel.
>
>> I'll try and have a look asap, but I'm traveling next week so it might a
>> tad slower than normal.
>>
>> If you could provide a /proc/sched_debug dump while the thing is running
>> that'd might be useful.
>
> Also, running the latest scheduler bits would be useful:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
>
> ... just on the off chance that another change in this area fixed the bug, plus to
> make it easier to send test patches.
>
> Upstream merge commit af79ad2b1f33 should also be pretty safe to try - it's v4.8
> with the v4.9 scheduler bits applied.
>
> Thanks,
>
>         Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-08 11:37     ` Vincent Guittot
@ 2016-10-08 11:49       ` Mike Galbraith
  2016-10-12 12:20         ` Vincent Guittot
  0 siblings, 1 reply; 46+ messages in thread
From: Mike Galbraith @ 2016-10-08 11:49 UTC (permalink / raw)
  To: Vincent Guittot, Ingo Molnar
  Cc: Peter Zijlstra, Joseph Salisbury, Linus Torvalds, Thomas Gleixner, LKML

On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
> On 8 October 2016 at 10:39, Ingo Molnar <mingo@kernel.org> wrote:
> > 
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
> > > > Hello Peter,
> > > > 
> > > > A kernel bug report was opened against Ubuntu [0].  After a
> > > > kernel
> > > > bisect, it was found that reverting the following commit
> > > > resolved this bug:
> > > > 
> > > > commit 3d30544f02120b884bba2a9466c87dba980e3be5
> > > > Author: Peter Zijlstra <peterz@infradead.org>
> > > > Date:   Tue Jun 21 14:27:50 2016 +0200
> > > > 
> > > >     sched/fair: Apply more PELT fixes
> 
> This patch only speeds up the update of task group load in order to
> reflect the new load balance but It should not change the final value
> and as a result the final behavior. I will try to reproduce it in my
> target later today

FWIW, I tried and failed w/wo autogroup on 4.8 and master.

	-Mike

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-08 11:49       ` Mike Galbraith
@ 2016-10-12 12:20         ` Vincent Guittot
  2016-10-12 15:35           ` Joseph Salisbury
  2016-10-12 16:21           ` Joseph Salisbury
  0 siblings, 2 replies; 46+ messages in thread
From: Vincent Guittot @ 2016-10-12 12:20 UTC (permalink / raw)
  To: Joseph Salisbury
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, Thomas Gleixner,
	LKML, Mike Galbraith

On 8 October 2016 at 13:49, Mike Galbraith <efault@gmx.de> wrote:
> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
>> On 8 October 2016 at 10:39, Ingo Molnar <mingo@kernel.org> wrote:
>> >
>> > * Peter Zijlstra <peterz@infradead.org> wrote:
>> >
>> > > On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
>> > > > Hello Peter,
>> > > >
>> > > > A kernel bug report was opened against Ubuntu [0].  After a
>> > > > kernel
>> > > > bisect, it was found that reverting the following commit
>> > > > resolved this bug:
>> > > >
>> > > > commit 3d30544f02120b884bba2a9466c87dba980e3be5
>> > > > Author: Peter Zijlstra <peterz@infradead.org>
>> > > > Date:   Tue Jun 21 14:27:50 2016 +0200
>> > > >
>> > > >     sched/fair: Apply more PELT fixes
>>
>> This patch only speeds up the update of task group load in order to
>> reflect the new load balance but It should not change the final value
>> and as a result the final behavior. I will try to reproduce it in my
>> target later today
>
> FWIW, I tried and failed w/wo autogroup on 4.8 and master.

Me too

Is it possible to get some dump of  /proc/sched_debug while the problem occurs ?

Vincent

>         -Mike

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-12 12:20         ` Vincent Guittot
@ 2016-10-12 15:35           ` Joseph Salisbury
  2016-10-12 16:21           ` Joseph Salisbury
  1 sibling, 0 replies; 46+ messages in thread
From: Joseph Salisbury @ 2016-10-12 15:35 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, Thomas Gleixner,
	LKML, Mike Galbraith

On 10/12/2016 08:20 AM, Vincent Guittot wrote:
> On 8 October 2016 at 13:49, Mike Galbraith <efault@gmx.de> wrote:
>> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
>>> On 8 October 2016 at 10:39, Ingo Molnar <mingo@kernel.org> wrote:
>>>> * Peter Zijlstra <peterz@infradead.org> wrote:
>>>>
>>>>> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
>>>>>> Hello Peter,
>>>>>>
>>>>>> A kernel bug report was opened against Ubuntu [0].  After a
>>>>>> kernel
>>>>>> bisect, it was found that reverting the following commit
>>>>>> resolved this bug:
>>>>>>
>>>>>> commit 3d30544f02120b884bba2a9466c87dba980e3be5
>>>>>> Author: Peter Zijlstra <peterz@infradead.org>
>>>>>> Date:   Tue Jun 21 14:27:50 2016 +0200
>>>>>>
>>>>>>     sched/fair: Apply more PELT fixes
>>> This patch only speeds up the update of task group load in order to
>>> reflect the new load balance but It should not change the final value
>>> and as a result the final behavior. I will try to reproduce it in my
>>> target later today
>> FWIW, I tried and failed w/wo autogroup on 4.8 and master.
> Me too
>
> Is it possible to get some dump of  /proc/sched_debug while the problem occurs ?
Yes, I requested that data from the bug reporter.  I also built a test
kernel and requested testing of the repo Ingo pointed me at.  We should
have an update shortly.

Joe


>
> Vincent
>
>>         -Mike

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-12 12:20         ` Vincent Guittot
  2016-10-12 15:35           ` Joseph Salisbury
@ 2016-10-12 16:21           ` Joseph Salisbury
  2016-10-13 10:58             ` Vincent Guittot
  1 sibling, 1 reply; 46+ messages in thread
From: Joseph Salisbury @ 2016-10-12 16:21 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, Thomas Gleixner,
	LKML, Mike Galbraith

On 10/12/2016 08:20 AM, Vincent Guittot wrote:
> On 8 October 2016 at 13:49, Mike Galbraith <efault@gmx.de> wrote:
>> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
>>> On 8 October 2016 at 10:39, Ingo Molnar <mingo@kernel.org> wrote:
>>>> * Peter Zijlstra <peterz@infradead.org> wrote:
>>>>
>>>>> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
>>>>>> Hello Peter,
>>>>>>
>>>>>> A kernel bug report was opened against Ubuntu [0].  After a
>>>>>> kernel
>>>>>> bisect, it was found that reverting the following commit
>>>>>> resolved this bug:
>>>>>>
>>>>>> commit 3d30544f02120b884bba2a9466c87dba980e3be5
>>>>>> Author: Peter Zijlstra <peterz@infradead.org>
>>>>>> Date:   Tue Jun 21 14:27:50 2016 +0200
>>>>>>
>>>>>>     sched/fair: Apply more PELT fixes
>>> This patch only speeds up the update of task group load in order to
>>> reflect the new load balance but It should not change the final value
>>> and as a result the final behavior. I will try to reproduce it in my
>>> target later today
>> FWIW, I tried and failed w/wo autogroup on 4.8 and master.
> Me too
>
> Is it possible to get some dump of  /proc/sched_debug while the problem occurs ?
>
> Vincent
>
>>         -Mike

The output from /proc/shed_debug can be seen here:
http://paste.ubuntu.com/23312351/

Ingo, the latest scheduler bits also still exhibit the bug:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-12 16:21           ` Joseph Salisbury
@ 2016-10-13 10:58             ` Vincent Guittot
  2016-10-13 15:52               ` Joseph Salisbury
  0 siblings, 1 reply; 46+ messages in thread
From: Vincent Guittot @ 2016-10-13 10:58 UTC (permalink / raw)
  To: Joseph Salisbury
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, Thomas Gleixner,
	LKML, Mike Galbraith

Hi,

On 12 October 2016 at 18:21, Joseph Salisbury
<joseph.salisbury@canonical.com> wrote:
> On 10/12/2016 08:20 AM, Vincent Guittot wrote:
>> On 8 October 2016 at 13:49, Mike Galbraith <efault@gmx.de> wrote:
>>> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
>>>> On 8 October 2016 at 10:39, Ingo Molnar <mingo@kernel.org> wrote:
>>>>> * Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>
>>>>>> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
>>>>>>> Hello Peter,
>>>>>>>
>>>>>>> A kernel bug report was opened against Ubuntu [0].  After a
>>>>>>> kernel
>>>>>>> bisect, it was found that reverting the following commit
>>>>>>> resolved this bug:
>>>>>>>
>>>>>>> commit 3d30544f02120b884bba2a9466c87dba980e3be5
>>>>>>> Author: Peter Zijlstra <peterz@infradead.org>
>>>>>>> Date:   Tue Jun 21 14:27:50 2016 +0200
>>>>>>>
>>>>>>>     sched/fair: Apply more PELT fixes
>>>> This patch only speeds up the update of task group load in order to
>>>> reflect the new load balance but It should not change the final value
>>>> and as a result the final behavior. I will try to reproduce it in my
>>>> target later today
>>> FWIW, I tried and failed w/wo autogroup on 4.8 and master.
>> Me too
>>
>> Is it possible to get some dump of  /proc/sched_debug while the problem occurs ?
>>
>> Vincent
>>
>>>         -Mike
>
> The output from /proc/shed_debug can be seen here:
> http://paste.ubuntu.com/23312351/

I have looked at the dump and there is something very odd for
system.slice task group where the display manager is running.
system.slice->tg_load_avg is around 381697 but  tg_load_avg is
normally equal to Sum of system.slice[cpu]->tg_load_avg_contrib
whereas Sum of system.slice[cpu]->tg_load_avg_contrib = 1013 in our
case. We can have some differences because the dump of
/proc/shed_debug is not atomic and some changes can happen but nothing
like this difference.

The main effect of this quite high value is that the weight/prio of
the sched_entity that represents system.slice in root cfs_rq is very
low (lower than task with the smallest nice prio) so the system.slice
task group will not get the CPU quite often compared to the user.slice
task group: less than 1% for the system.slice where lightDM and xorg
are running compared 99% for the user.slice where the stress tasks are
running. This is confirmed by the se->avg.util_avg value of the task
groups which reflect how much time each task group is effectively
running on a CPU:
system.slice[CPU3].se->avg.util_avg = 8 whereas
user.slice[CPU3].se->avg.util_avg = 991

This difference of weight/priority explains why the system becomes
unresponsive. For now, I can't explain is why
system.slice->tg_load_avg = 381697 whereas is should be around 1013
and how the patch can generate this situation.

Is it possible to have a dump of /proc/sched_debug before starting
stress command ? to check if the problem is there from the beginning
but not seen because not overloaded. Or if it the problem comes when
user starts to load the system

Thanks,

>
> Ingo, the latest scheduler bits also still exhibit the bug:
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
>
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-13 10:58             ` Vincent Guittot
@ 2016-10-13 15:52               ` Joseph Salisbury
  2016-10-13 16:48                 ` Vincent Guittot
  0 siblings, 1 reply; 46+ messages in thread
From: Joseph Salisbury @ 2016-10-13 15:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, Thomas Gleixner,
	LKML, Mike Galbraith

On 10/13/2016 06:58 AM, Vincent Guittot wrote:
> Hi,
>
> On 12 October 2016 at 18:21, Joseph Salisbury
> <joseph.salisbury@canonical.com> wrote:
>> On 10/12/2016 08:20 AM, Vincent Guittot wrote:
>>> On 8 October 2016 at 13:49, Mike Galbraith <efault@gmx.de> wrote:
>>>> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
>>>>> On 8 October 2016 at 10:39, Ingo Molnar <mingo@kernel.org> wrote:
>>>>>> * Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>>
>>>>>>> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
>>>>>>>> Hello Peter,
>>>>>>>>
>>>>>>>> A kernel bug report was opened against Ubuntu [0].  After a
>>>>>>>> kernel
>>>>>>>> bisect, it was found that reverting the following commit
>>>>>>>> resolved this bug:
>>>>>>>>
>>>>>>>> commit 3d30544f02120b884bba2a9466c87dba980e3be5
>>>>>>>> Author: Peter Zijlstra <peterz@infradead.org>
>>>>>>>> Date:   Tue Jun 21 14:27:50 2016 +0200
>>>>>>>>
>>>>>>>>     sched/fair: Apply more PELT fixes
>>>>> This patch only speeds up the update of task group load in order to
>>>>> reflect the new load balance but It should not change the final value
>>>>> and as a result the final behavior. I will try to reproduce it in my
>>>>> target later today
>>>> FWIW, I tried and failed w/wo autogroup on 4.8 and master.
>>> Me too
>>>
>>> Is it possible to get some dump of  /proc/sched_debug while the problem occurs ?
>>>
>>> Vincent
>>>
>>>>         -Mike
>> The output from /proc/shed_debug can be seen here:
>> http://paste.ubuntu.com/23312351/
> I have looked at the dump and there is something very odd for
> system.slice task group where the display manager is running.
> system.slice->tg_load_avg is around 381697 but  tg_load_avg is
> normally equal to Sum of system.slice[cpu]->tg_load_avg_contrib
> whereas Sum of system.slice[cpu]->tg_load_avg_contrib = 1013 in our
> case. We can have some differences because the dump of
> /proc/shed_debug is not atomic and some changes can happen but nothing
> like this difference.
>
> The main effect of this quite high value is that the weight/prio of
> the sched_entity that represents system.slice in root cfs_rq is very
> low (lower than task with the smallest nice prio) so the system.slice
> task group will not get the CPU quite often compared to the user.slice
> task group: less than 1% for the system.slice where lightDM and xorg
> are running compared 99% for the user.slice where the stress tasks are
> running. This is confirmed by the se->avg.util_avg value of the task
> groups which reflect how much time each task group is effectively
> running on a CPU:
> system.slice[CPU3].se->avg.util_avg = 8 whereas
> user.slice[CPU3].se->avg.util_avg = 991
>
> This difference of weight/priority explains why the system becomes
> unresponsive. For now, I can't explain is why
> system.slice->tg_load_avg = 381697 whereas is should be around 1013
> and how the patch can generate this situation.
>
> Is it possible to have a dump of /proc/sched_debug before starting
> stress command ? to check if the problem is there from the beginning
> but not seen because not overloaded. Or if it the problem comes when
> user starts to load the system
Here is the dump before stress is started:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760437/+files/dump_nonbuggy

Here it is after:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760436/+files/dump_buggy


>
> Thanks,
>
>> Ingo, the latest scheduler bits also still exhibit the bug:
>> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
>>
>>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-13 15:52               ` Joseph Salisbury
@ 2016-10-13 16:48                 ` Vincent Guittot
  2016-10-13 18:49                   ` Dietmar Eggemann
  0 siblings, 1 reply; 46+ messages in thread
From: Vincent Guittot @ 2016-10-13 16:48 UTC (permalink / raw)
  To: Joseph Salisbury
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, Thomas Gleixner,
	LKML, Mike Galbraith

On 13 October 2016 at 17:52, Joseph Salisbury
<joseph.salisbury@canonical.com> wrote:
> On 10/13/2016 06:58 AM, Vincent Guittot wrote:
>> Hi,
>>
>> On 12 October 2016 at 18:21, Joseph Salisbury
>> <joseph.salisbury@canonical.com> wrote:
>>> On 10/12/2016 08:20 AM, Vincent Guittot wrote:
>>>> On 8 October 2016 at 13:49, Mike Galbraith <efault@gmx.de> wrote:
>>>>> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
>>>>>> On 8 October 2016 at 10:39, Ingo Molnar <mingo@kernel.org> wrote:
>>>>>>> * Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>>>
>>>>>>>> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
>>>>>>>>> Hello Peter,
>>>>>>>>>
>>>>>>>>> A kernel bug report was opened against Ubuntu [0].  After a
>>>>>>>>> kernel
>>>>>>>>> bisect, it was found that reverting the following commit
>>>>>>>>> resolved this bug:
>>>>>>>>>
>>>>>>>>> commit 3d30544f02120b884bba2a9466c87dba980e3be5
>>>>>>>>> Author: Peter Zijlstra <peterz@infradead.org>
>>>>>>>>> Date:   Tue Jun 21 14:27:50 2016 +0200
>>>>>>>>>
>>>>>>>>>     sched/fair: Apply more PELT fixes
>>>>>> This patch only speeds up the update of task group load in order to
>>>>>> reflect the new load balance but It should not change the final value
>>>>>> and as a result the final behavior. I will try to reproduce it in my
>>>>>> target later today
>>>>> FWIW, I tried and failed w/wo autogroup on 4.8 and master.
>>>> Me too
>>>>
>>>> Is it possible to get some dump of  /proc/sched_debug while the problem occurs ?
>>>>
>>>> Vincent
>>>>
>>>>>         -Mike
>>> The output from /proc/shed_debug can be seen here:
>>> http://paste.ubuntu.com/23312351/
>> I have looked at the dump and there is something very odd for
>> system.slice task group where the display manager is running.
>> system.slice->tg_load_avg is around 381697 but  tg_load_avg is
>> normally equal to Sum of system.slice[cpu]->tg_load_avg_contrib
>> whereas Sum of system.slice[cpu]->tg_load_avg_contrib = 1013 in our
>> case. We can have some differences because the dump of
>> /proc/shed_debug is not atomic and some changes can happen but nothing
>> like this difference.
>>
>> The main effect of this quite high value is that the weight/prio of
>> the sched_entity that represents system.slice in root cfs_rq is very
>> low (lower than task with the smallest nice prio) so the system.slice
>> task group will not get the CPU quite often compared to the user.slice
>> task group: less than 1% for the system.slice where lightDM and xorg
>> are running compared 99% for the user.slice where the stress tasks are
>> running. This is confirmed by the se->avg.util_avg value of the task
>> groups which reflect how much time each task group is effectively
>> running on a CPU:
>> system.slice[CPU3].se->avg.util_avg = 8 whereas
>> user.slice[CPU3].se->avg.util_avg = 991
>>
>> This difference of weight/priority explains why the system becomes
>> unresponsive. For now, I can't explain is why
>> system.slice->tg_load_avg = 381697 whereas is should be around 1013
>> and how the patch can generate this situation.
>>
>> Is it possible to have a dump of /proc/sched_debug before starting
>> stress command ? to check if the problem is there from the beginning
>> but not seen because not overloaded. Or if it the problem comes when
>> user starts to load the system
> Here is the dump before stress is started:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760437/+files/dump_nonbuggy

This one is ok.
The dump indicates Sched Debug Version: v0.11, 4.8.0-11-generic
#12~lp1627108Commit3d30544Reverted
so this is without the culprit commit

>
> Here it is after:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760436/+files/dump_buggy
>

This one has the exact same odds values for system.slice->tg_load_avg
than the 1st dump that you sent yesterday
The dump indicates Sched Debug Version: v0.11, 4.8.0-22-generic #24-Ubuntu
So this dump has been done with a different kernel than for the dump above.
As I can't find any stress task in the dump, i tend to believe that
the dump has been done before starting the stress tasks and not after
starting them. Can you confirm ?

If i'm right, it mean that the problem was already there before
starting stress tasks.


>
>>
>> Thanks,
>>
>>> Ingo, the latest scheduler bits also still exhibit the bug:
>>> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
>>>
>>>
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-13 16:48                 ` Vincent Guittot
@ 2016-10-13 18:49                   ` Dietmar Eggemann
  2016-10-13 21:34                     ` Vincent Guittot
  0 siblings, 1 reply; 46+ messages in thread
From: Dietmar Eggemann @ 2016-10-13 18:49 UTC (permalink / raw)
  To: Vincent Guittot, Joseph Salisbury
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, Thomas Gleixner,
	LKML, Mike Galbraith

On 13/10/16 17:48, Vincent Guittot wrote:
> On 13 October 2016 at 17:52, Joseph Salisbury
> <joseph.salisbury@canonical.com> wrote:
>> On 10/13/2016 06:58 AM, Vincent Guittot wrote:
>>> Hi,
>>>
>>> On 12 October 2016 at 18:21, Joseph Salisbury
>>> <joseph.salisbury@canonical.com> wrote:
>>>> On 10/12/2016 08:20 AM, Vincent Guittot wrote:
>>>>> On 8 October 2016 at 13:49, Mike Galbraith <efault@gmx.de> wrote:
>>>>>> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
>>>>>>> On 8 October 2016 at 10:39, Ingo Molnar <mingo@kernel.org> wrote:
>>>>>>>> * Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>>>>
>>>>>>>>> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
>>>>>>>>>> Hello Peter,
>>>>>>>>>>
>>>>>>>>>> A kernel bug report was opened against Ubuntu [0].  After a
>>>>>>>>>> kernel
>>>>>>>>>> bisect, it was found that reverting the following commit
>>>>>>>>>> resolved this bug:
>>>>>>>>>>
>>>>>>>>>> commit 3d30544f02120b884bba2a9466c87dba980e3be5
>>>>>>>>>> Author: Peter Zijlstra <peterz@infradead.org>
>>>>>>>>>> Date:   Tue Jun 21 14:27:50 2016 +0200
>>>>>>>>>>
>>>>>>>>>>     sched/fair: Apply more PELT fixes
>>>>>>> This patch only speeds up the update of task group load in order to
>>>>>>> reflect the new load balance but It should not change the final value
>>>>>>> and as a result the final behavior. I will try to reproduce it in my
>>>>>>> target later today
>>>>>> FWIW, I tried and failed w/wo autogroup on 4.8 and master.
>>>>> Me too
>>>>>
>>>>> Is it possible to get some dump of  /proc/sched_debug while the problem occurs ?
>>>>>
>>>>> Vincent
>>>>>
>>>>>>         -Mike
>>>> The output from /proc/shed_debug can be seen here:
>>>> http://paste.ubuntu.com/23312351/
>>> I have looked at the dump and there is something very odd for
>>> system.slice task group where the display manager is running.
>>> system.slice->tg_load_avg is around 381697 but  tg_load_avg is
>>> normally equal to Sum of system.slice[cpu]->tg_load_avg_contrib
>>> whereas Sum of system.slice[cpu]->tg_load_avg_contrib = 1013 in our
>>> case. We can have some differences because the dump of
>>> /proc/shed_debug is not atomic and some changes can happen but nothing
>>> like this difference.
>>>
>>> The main effect of this quite high value is that the weight/prio of
>>> the sched_entity that represents system.slice in root cfs_rq is very
>>> low (lower than task with the smallest nice prio) so the system.slice
>>> task group will not get the CPU quite often compared to the user.slice
>>> task group: less than 1% for the system.slice where lightDM and xorg
>>> are running compared 99% for the user.slice where the stress tasks are
>>> running. This is confirmed by the se->avg.util_avg value of the task
>>> groups which reflect how much time each task group is effectively
>>> running on a CPU:
>>> system.slice[CPU3].se->avg.util_avg = 8 whereas
>>> user.slice[CPU3].se->avg.util_avg = 991
>>>
>>> This difference of weight/priority explains why the system becomes
>>> unresponsive. For now, I can't explain is why
>>> system.slice->tg_load_avg = 381697 whereas is should be around 1013
>>> and how the patch can generate this situation.
>>>
>>> Is it possible to have a dump of /proc/sched_debug before starting
>>> stress command ? to check if the problem is there from the beginning
>>> but not seen because not overloaded. Or if it the problem comes when
>>> user starts to load the system
>> Here is the dump before stress is started:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760437/+files/dump_nonbuggy
> 
> This one is ok.
> The dump indicates Sched Debug Version: v0.11, 4.8.0-11-generic
> #12~lp1627108Commit3d30544Reverted
> so this is without the culprit commit
> 
>>
>> Here it is after:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760436/+files/dump_buggy
>>
> 
> This one has the exact same odds values for system.slice->tg_load_avg
> than the 1st dump that you sent yesterday
> The dump indicates Sched Debug Version: v0.11, 4.8.0-22-generic #24-Ubuntu
> So this dump has been done with a different kernel than for the dump above.
> As I can't find any stress task in the dump, i tend to believe that
> the dump has been done before starting the stress tasks and not after
> starting them. Can you confirm ?
> 
> If i'm right, it mean that the problem was already there before
> starting stress tasks.

Could it be a problem I'm also seeing on my ARM64 Juno (6 logical cpus) w/o systemd
and w/o autogroup (tip/sched/core 447976ef4fd0):

When I create a tg_root/tg_x/tg_y_1 and a tg_root/tg_x/tg_y_2 group, the tg_x->load_avg
becomes > 6*1024 before any tasks ran in it.

tg_x   : 0xffff800975800d80
tg_y_1 : 0xffff800975800c00
tg_y_2 : 0xffff80097543d200 

 mkdir-2177 [002] 117.235241: bprint: sched_online_group: tg=0xffff800975800d80 tg->parent=0xffff000008fd0300
 mkdir-2177 [002] 117.235244: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=0
 mkdir-2177 [002] 117.235247: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=1
 mkdir-2177 [002] 117.235249: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=2
 mkdir-2177 [002] 117.235251: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=3
 mkdir-2177 [002] 117.235253: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=4
 mkdir-2177 [002] 117.235255: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=5

 mkdir-2181 [002] 117.353155: bprint: sched_online_group: tg=0xffff800975800c00 tg->parent=0xffff800975800d80
 mkdir-2181 [002] 117.353158: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=0
 mkdir-2181 [002] 117.353162: bprint: post_init_entity_util_avg: cpu=0 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=0 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
 mkdir-2181 [002] 117.353164: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=1
 mkdir-2181 [002] 117.353167: bprint: post_init_entity_util_avg: cpu=1 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=1024 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
 mkdir-2181 [002] 117.353168: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=2
 mkdir-2181 [002] 117.353171: bprint: post_init_entity_util_avg: cpu=2 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=2048 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
 mkdir-2181 [002] 117.353173: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=3
 mkdir-2181 [002] 117.353175: bprint: post_init_entity_util_avg: cpu=3 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=3072 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
 mkdir-2181 [002] 117.353177: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=4
 mkdir-2181 [002] 117.353179: bprint: post_init_entity_util_avg: cpu=4 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=4096 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
 mkdir-2181 [002] 117.353180: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=5
 mkdir-2181 [002] 117.353183: bprint: post_init_entity_util_avg: cpu=5 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=5120 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024

 mkdir-2185 [001] 117.502980: bprint: sched_online_group: tg=0xffff80097543d200 tg->parent=0xffff800975800d80
 mkdir-2185 [001] 117.502982: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=0
 mkdir-2185 [001] 117.502987: bprint: post_init_entity_util_avg: cpu=0 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6144 cfs_rq->avg.load_avg=1068 cfs_rq->tg_load_avg_contrib=1024 delta=44
 mkdir-2185 [001] 117.502988: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=1
 mkdir-2185 [001] 117.502992: bprint: post_init_entity_util_avg: cpu=1 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6188 cfs_rq->avg.load_avg=1058 cfs_rq->tg_load_avg_contrib=1024 delta=34
 mkdir-2185 [001] 117.502993: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=2
 mkdir-2185 [001] 117.502996: bprint: post_init_entity_util_avg: cpu=2 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6222 cfs_rq->avg.load_avg=1092 cfs_rq->tg_load_avg_contrib=1024 delta=68
 mkdir-2185 [001] 117.502998: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=3
 mkdir-2185 [001] 117.503001: bprint: post_init_entity_util_avg: cpu=3 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6290 cfs_rq->avg.load_avg=1069 cfs_rq->tg_load_avg_contrib=1024 delta=45
 mkdir-2185 [001] 117.503002: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=4
 mkdir-2185 [001] 117.503005: bprint: post_init_entity_util_avg: cpu=4 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6335 cfs_rq->avg.load_avg=1064 cfs_rq->tg_load_avg_contrib=1024 delta=40
 mkdir-2185 [001] 117.503006: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=5
 mkdir-2185 [001] 117.503009: bprint: post_init_entity_util_avg: cpu=5 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=*6375* cfs_rq->avg.load_avg=1061 cfs_rq->tg_load_avg_contrib=1024 delta=37

We add delta=1024 to tg_x->load_avg during tg_root/tg_x/tg_y_1 initialization but only
delta=~40 for tg_root/tg_x/tg_y_2.   

system.slice->tg_load_avg = 381697 is still pretty much higher and AFAICS, it's an
i5-5300U CPU so only 4 logical cpus. 

How many system.slice/foo's actually exists in this system? I only see user.slice
related cfs_rq[x]:/autogroups-xxx on my Ubuntu-16.04 Desktop system.

[...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-13 18:49                   ` Dietmar Eggemann
@ 2016-10-13 21:34                     ` Vincent Guittot
  2016-10-14  8:24                       ` Vincent Guittot
  0 siblings, 1 reply; 46+ messages in thread
From: Vincent Guittot @ 2016-10-13 21:34 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Joseph Salisbury, Ingo Molnar, Peter Zijlstra, Linus Torvalds,
	Thomas Gleixner, LKML, Mike Galbraith

On 13 October 2016 at 20:49, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 13/10/16 17:48, Vincent Guittot wrote:
>> On 13 October 2016 at 17:52, Joseph Salisbury
>> <joseph.salisbury@canonical.com> wrote:
>>> On 10/13/2016 06:58 AM, Vincent Guittot wrote:
>>>> Hi,
>>>>
>>>> On 12 October 2016 at 18:21, Joseph Salisbury
>>>> <joseph.salisbury@canonical.com> wrote:
>>>>> On 10/12/2016 08:20 AM, Vincent Guittot wrote:
>>>>>> On 8 October 2016 at 13:49, Mike Galbraith <efault@gmx.de> wrote:
>>>>>>> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
>>>>>>>> On 8 October 2016 at 10:39, Ingo Molnar <mingo@kernel.org> wrote:
>>>>>>>>> * Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>>>>>
>>>>>>>>>> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
>>>>>>>>>>> Hello Peter,
>>>>>>>>>>>
>>>>>>>>>>> A kernel bug report was opened against Ubuntu [0].  After a
>>>>>>>>>>> kernel
>>>>>>>>>>> bisect, it was found that reverting the following commit
>>>>>>>>>>> resolved this bug:
>>>>>>>>>>>
>>>>>>>>>>> commit 3d30544f02120b884bba2a9466c87dba980e3be5
>>>>>>>>>>> Author: Peter Zijlstra <peterz@infradead.org>
>>>>>>>>>>> Date:   Tue Jun 21 14:27:50 2016 +0200
>>>>>>>>>>>
>>>>>>>>>>>     sched/fair: Apply more PELT fixes
>>>>>>>> This patch only speeds up the update of task group load in order to
>>>>>>>> reflect the new load balance but It should not change the final value
>>>>>>>> and as a result the final behavior. I will try to reproduce it in my
>>>>>>>> target later today
>>>>>>> FWIW, I tried and failed w/wo autogroup on 4.8 and master.
>>>>>> Me too
>>>>>>
>>>>>> Is it possible to get some dump of  /proc/sched_debug while the problem occurs ?
>>>>>>
>>>>>> Vincent
>>>>>>
>>>>>>>         -Mike
>>>>> The output from /proc/shed_debug can be seen here:
>>>>> http://paste.ubuntu.com/23312351/
>>>> I have looked at the dump and there is something very odd for
>>>> system.slice task group where the display manager is running.
>>>> system.slice->tg_load_avg is around 381697 but  tg_load_avg is
>>>> normally equal to Sum of system.slice[cpu]->tg_load_avg_contrib
>>>> whereas Sum of system.slice[cpu]->tg_load_avg_contrib = 1013 in our
>>>> case. We can have some differences because the dump of
>>>> /proc/shed_debug is not atomic and some changes can happen but nothing
>>>> like this difference.
>>>>
>>>> The main effect of this quite high value is that the weight/prio of
>>>> the sched_entity that represents system.slice in root cfs_rq is very
>>>> low (lower than task with the smallest nice prio) so the system.slice
>>>> task group will not get the CPU quite often compared to the user.slice
>>>> task group: less than 1% for the system.slice where lightDM and xorg
>>>> are running compared 99% for the user.slice where the stress tasks are
>>>> running. This is confirmed by the se->avg.util_avg value of the task
>>>> groups which reflect how much time each task group is effectively
>>>> running on a CPU:
>>>> system.slice[CPU3].se->avg.util_avg = 8 whereas
>>>> user.slice[CPU3].se->avg.util_avg = 991
>>>>
>>>> This difference of weight/priority explains why the system becomes
>>>> unresponsive. For now, I can't explain is why
>>>> system.slice->tg_load_avg = 381697 whereas is should be around 1013
>>>> and how the patch can generate this situation.
>>>>
>>>> Is it possible to have a dump of /proc/sched_debug before starting
>>>> stress command ? to check if the problem is there from the beginning
>>>> but not seen because not overloaded. Or if it the problem comes when
>>>> user starts to load the system
>>> Here is the dump before stress is started:
>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760437/+files/dump_nonbuggy
>>
>> This one is ok.
>> The dump indicates Sched Debug Version: v0.11, 4.8.0-11-generic
>> #12~lp1627108Commit3d30544Reverted
>> so this is without the culprit commit
>>
>>>
>>> Here it is after:
>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760436/+files/dump_buggy
>>>
>>
>> This one has the exact same odds values for system.slice->tg_load_avg
>> than the 1st dump that you sent yesterday
>> The dump indicates Sched Debug Version: v0.11, 4.8.0-22-generic #24-Ubuntu
>> So this dump has been done with a different kernel than for the dump above.
>> As I can't find any stress task in the dump, i tend to believe that
>> the dump has been done before starting the stress tasks and not after
>> starting them. Can you confirm ?
>>
>> If i'm right, it mean that the problem was already there before
>> starting stress tasks.
>
> Could it be a problem I'm also seeing on my ARM64 Juno (6 logical cpus) w/o systemd
> and w/o autogroup (tip/sched/core 447976ef4fd0):
>
> When I create a tg_root/tg_x/tg_y_1 and a tg_root/tg_x/tg_y_2 group, the tg_x->load_avg
> becomes > 6*1024 before any tasks ran in it.

This is normal as se->avg.load_avg is initialized to
scale_load_down(se->load.weight) and this se->avg.load_avg will be
added to tg_x[cpu]->cfs_rq->avg.load_avg when attached to the cfs_rq

>
> tg_x   : 0xffff800975800d80
> tg_y_1 : 0xffff800975800c00
> tg_y_2 : 0xffff80097543d200
>
>  mkdir-2177 [002] 117.235241: bprint: sched_online_group: tg=0xffff800975800d80 tg->parent=0xffff000008fd0300
>  mkdir-2177 [002] 117.235244: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=0
>  mkdir-2177 [002] 117.235247: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=1
>  mkdir-2177 [002] 117.235249: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=2
>  mkdir-2177 [002] 117.235251: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=3
>  mkdir-2177 [002] 117.235253: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=4
>  mkdir-2177 [002] 117.235255: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=5
>
>  mkdir-2181 [002] 117.353155: bprint: sched_online_group: tg=0xffff800975800c00 tg->parent=0xffff800975800d80
>  mkdir-2181 [002] 117.353158: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=0
>  mkdir-2181 [002] 117.353162: bprint: post_init_entity_util_avg: cpu=0 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=0 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
>  mkdir-2181 [002] 117.353164: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=1
>  mkdir-2181 [002] 117.353167: bprint: post_init_entity_util_avg: cpu=1 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=1024 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
>  mkdir-2181 [002] 117.353168: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=2
>  mkdir-2181 [002] 117.353171: bprint: post_init_entity_util_avg: cpu=2 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=2048 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
>  mkdir-2181 [002] 117.353173: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=3
>  mkdir-2181 [002] 117.353175: bprint: post_init_entity_util_avg: cpu=3 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=3072 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
>  mkdir-2181 [002] 117.353177: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=4
>  mkdir-2181 [002] 117.353179: bprint: post_init_entity_util_avg: cpu=4 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=4096 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
>  mkdir-2181 [002] 117.353180: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=5
>  mkdir-2181 [002] 117.353183: bprint: post_init_entity_util_avg: cpu=5 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=5120 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
>

This looks coherent as tg_x->load_avg = 5120 == sum of
cfs_rq[cpu]->load_avg and cfs_rq->tg_load_avg_contrib will be set to
cfs_rq[cpu]->load_avg at the end of the function call
In the buggy case, tg->load_avg = 381697 but sum of
cfs_rq[cpu]->tg_load_avg_contrib = 1013 = cfs_rq[cpu]->avg.load_avg so
there is no way to remove the 381697

>  mkdir-2185 [001] 117.502980: bprint: sched_online_group: tg=0xffff80097543d200 tg->parent=0xffff800975800d80
>  mkdir-2185 [001] 117.502982: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=0
>  mkdir-2185 [001] 117.502987: bprint: post_init_entity_util_avg: cpu=0 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6144 cfs_rq->avg.load_avg=1068 cfs_rq->tg_load_avg_contrib=1024 delta=44
>  mkdir-2185 [001] 117.502988: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=1
>  mkdir-2185 [001] 117.502992: bprint: post_init_entity_util_avg: cpu=1 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6188 cfs_rq->avg.load_avg=1058 cfs_rq->tg_load_avg_contrib=1024 delta=34
>  mkdir-2185 [001] 117.502993: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=2
>  mkdir-2185 [001] 117.502996: bprint: post_init_entity_util_avg: cpu=2 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6222 cfs_rq->avg.load_avg=1092 cfs_rq->tg_load_avg_contrib=1024 delta=68
>  mkdir-2185 [001] 117.502998: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=3
>  mkdir-2185 [001] 117.503001: bprint: post_init_entity_util_avg: cpu=3 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6290 cfs_rq->avg.load_avg=1069 cfs_rq->tg_load_avg_contrib=1024 delta=45
>  mkdir-2185 [001] 117.503002: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=4
>  mkdir-2185 [001] 117.503005: bprint: post_init_entity_util_avg: cpu=4 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6335 cfs_rq->avg.load_avg=1064 cfs_rq->tg_load_avg_contrib=1024 delta=40
>  mkdir-2185 [001] 117.503006: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=5
>  mkdir-2185 [001] 117.503009: bprint: post_init_entity_util_avg: cpu=5 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=*6375* cfs_rq->avg.load_avg=1061 cfs_rq->tg_load_avg_contrib=1024 delta=37
>
> We add delta=1024 to tg_x->load_avg during tg_root/tg_x/tg_y_1 initialization but only
> delta=~40 for tg_root/tg_x/tg_y_2.

IMO, this comes from the decay of tg_y_2->se[cpu]->avg.load_avg before
being attached to tg_x

But I don't think that it's the same issue as the tg_x->load_avg is
still coherent with the sum of cfs_rq->avg.load_avg and the sume of
cfs_rq->tg_load_avg_contrib

>
> system.slice->tg_load_avg = 381697 is still pretty much higher and AFAICS, it's an
> i5-5300U CPU so only 4 logical cpus.
>
> How many system.slice/foo's actually exists in this system? I only see user.slice

There is several dozen of childs in system.slice but all

> related cfs_rq[x]:/autogroups-xxx on my Ubuntu-16.04 Desktop system.
>
> [...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-13 21:34                     ` Vincent Guittot
@ 2016-10-14  8:24                       ` Vincent Guittot
  2016-10-14 13:10                         ` Dietmar Eggemann
  0 siblings, 1 reply; 46+ messages in thread
From: Vincent Guittot @ 2016-10-14  8:24 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Joseph Salisbury, Ingo Molnar, Peter Zijlstra, Linus Torvalds,
	Thomas Gleixner, LKML, Mike Galbraith

On 13 October 2016 at 23:34, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 13 October 2016 at 20:49, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 13/10/16 17:48, Vincent Guittot wrote:
>>> On 13 October 2016 at 17:52, Joseph Salisbury
>>> <joseph.salisbury@canonical.com> wrote:
>>>> On 10/13/2016 06:58 AM, Vincent Guittot wrote:
>>>>> Hi,
>>>>>
>>>>> On 12 October 2016 at 18:21, Joseph Salisbury
>>>>> <joseph.salisbury@canonical.com> wrote:
>>>>>> On 10/12/2016 08:20 AM, Vincent Guittot wrote:
>>>>>>> On 8 October 2016 at 13:49, Mike Galbraith <efault@gmx.de> wrote:
>>>>>>>> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
>>>>>>>>> On 8 October 2016 at 10:39, Ingo Molnar <mingo@kernel.org> wrote:
>>>>>>>>>> * Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
>>>>>>>>>>>> Hello Peter,
>>>>>>>>>>>>
>>>>>>>>>>>> A kernel bug report was opened against Ubuntu [0].  After a
>>>>>>>>>>>> kernel
>>>>>>>>>>>> bisect, it was found that reverting the following commit
>>>>>>>>>>>> resolved this bug:
>>>>>>>>>>>>
>>>>>>>>>>>> commit 3d30544f02120b884bba2a9466c87dba980e3be5
>>>>>>>>>>>> Author: Peter Zijlstra <peterz@infradead.org>
>>>>>>>>>>>> Date:   Tue Jun 21 14:27:50 2016 +0200
>>>>>>>>>>>>
>>>>>>>>>>>>     sched/fair: Apply more PELT fixes
>>>>>>>>> This patch only speeds up the update of task group load in order to
>>>>>>>>> reflect the new load balance but It should not change the final value
>>>>>>>>> and as a result the final behavior. I will try to reproduce it in my
>>>>>>>>> target later today
>>>>>>>> FWIW, I tried and failed w/wo autogroup on 4.8 and master.
>>>>>>> Me too
>>>>>>>
>>>>>>> Is it possible to get some dump of  /proc/sched_debug while the problem occurs ?
>>>>>>>
>>>>>>> Vincent
>>>>>>>
>>>>>>>>         -Mike
>>>>>> The output from /proc/shed_debug can be seen here:
>>>>>> http://paste.ubuntu.com/23312351/
>>>>> I have looked at the dump and there is something very odd for
>>>>> system.slice task group where the display manager is running.
>>>>> system.slice->tg_load_avg is around 381697 but  tg_load_avg is
>>>>> normally equal to Sum of system.slice[cpu]->tg_load_avg_contrib
>>>>> whereas Sum of system.slice[cpu]->tg_load_avg_contrib = 1013 in our
>>>>> case. We can have some differences because the dump of
>>>>> /proc/shed_debug is not atomic and some changes can happen but nothing
>>>>> like this difference.
>>>>>
>>>>> The main effect of this quite high value is that the weight/prio of
>>>>> the sched_entity that represents system.slice in root cfs_rq is very
>>>>> low (lower than task with the smallest nice prio) so the system.slice
>>>>> task group will not get the CPU quite often compared to the user.slice
>>>>> task group: less than 1% for the system.slice where lightDM and xorg
>>>>> are running compared 99% for the user.slice where the stress tasks are
>>>>> running. This is confirmed by the se->avg.util_avg value of the task
>>>>> groups which reflect how much time each task group is effectively
>>>>> running on a CPU:
>>>>> system.slice[CPU3].se->avg.util_avg = 8 whereas
>>>>> user.slice[CPU3].se->avg.util_avg = 991
>>>>>
>>>>> This difference of weight/priority explains why the system becomes
>>>>> unresponsive. For now, I can't explain is why
>>>>> system.slice->tg_load_avg = 381697 whereas is should be around 1013
>>>>> and how the patch can generate this situation.
>>>>>
>>>>> Is it possible to have a dump of /proc/sched_debug before starting
>>>>> stress command ? to check if the problem is there from the beginning
>>>>> but not seen because not overloaded. Or if it the problem comes when
>>>>> user starts to load the system
>>>> Here is the dump before stress is started:
>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760437/+files/dump_nonbuggy
>>>
>>> This one is ok.
>>> The dump indicates Sched Debug Version: v0.11, 4.8.0-11-generic
>>> #12~lp1627108Commit3d30544Reverted
>>> so this is without the culprit commit
>>>
>>>>
>>>> Here it is after:
>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760436/+files/dump_buggy
>>>>
>>>
>>> This one has the exact same odds values for system.slice->tg_load_avg
>>> than the 1st dump that you sent yesterday
>>> The dump indicates Sched Debug Version: v0.11, 4.8.0-22-generic #24-Ubuntu
>>> So this dump has been done with a different kernel than for the dump above.
>>> As I can't find any stress task in the dump, i tend to believe that
>>> the dump has been done before starting the stress tasks and not after
>>> starting them. Can you confirm ?
>>>
>>> If i'm right, it mean that the problem was already there before
>>> starting stress tasks.
>>
>> Could it be a problem I'm also seeing on my ARM64 Juno (6 logical cpus) w/o systemd
>> and w/o autogroup (tip/sched/core 447976ef4fd0):
>>
>> When I create a tg_root/tg_x/tg_y_1 and a tg_root/tg_x/tg_y_2 group, the tg_x->load_avg
>> becomes > 6*1024 before any tasks ran in it.
>
> This is normal as se->avg.load_avg is initialized to
> scale_load_down(se->load.weight) and this se->avg.load_avg will be
> added to tg_x[cpu]->cfs_rq->avg.load_avg when attached to the cfs_rq
>
>>
>> tg_x   : 0xffff800975800d80
>> tg_y_1 : 0xffff800975800c00
>> tg_y_2 : 0xffff80097543d200
>>
>>  mkdir-2177 [002] 117.235241: bprint: sched_online_group: tg=0xffff800975800d80 tg->parent=0xffff000008fd0300
>>  mkdir-2177 [002] 117.235244: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=0
>>  mkdir-2177 [002] 117.235247: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=1
>>  mkdir-2177 [002] 117.235249: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=2
>>  mkdir-2177 [002] 117.235251: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=3
>>  mkdir-2177 [002] 117.235253: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=4
>>  mkdir-2177 [002] 117.235255: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=5
>>
>>  mkdir-2181 [002] 117.353155: bprint: sched_online_group: tg=0xffff800975800c00 tg->parent=0xffff800975800d80
>>  mkdir-2181 [002] 117.353158: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=0
>>  mkdir-2181 [002] 117.353162: bprint: post_init_entity_util_avg: cpu=0 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=0 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
>>  mkdir-2181 [002] 117.353164: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=1
>>  mkdir-2181 [002] 117.353167: bprint: post_init_entity_util_avg: cpu=1 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=1024 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
>>  mkdir-2181 [002] 117.353168: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=2
>>  mkdir-2181 [002] 117.353171: bprint: post_init_entity_util_avg: cpu=2 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=2048 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
>>  mkdir-2181 [002] 117.353173: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=3
>>  mkdir-2181 [002] 117.353175: bprint: post_init_entity_util_avg: cpu=3 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=3072 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
>>  mkdir-2181 [002] 117.353177: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=4
>>  mkdir-2181 [002] 117.353179: bprint: post_init_entity_util_avg: cpu=4 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=4096 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
>>  mkdir-2181 [002] 117.353180: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=5
>>  mkdir-2181 [002] 117.353183: bprint: post_init_entity_util_avg: cpu=5 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=5120 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
>>
>
> This looks coherent as tg_x->load_avg = 5120 == sum of
> cfs_rq[cpu]->load_avg and cfs_rq->tg_load_avg_contrib will be set to
> cfs_rq[cpu]->load_avg at the end of the function call
> In the buggy case, tg->load_avg = 381697 but sum of
> cfs_rq[cpu]->tg_load_avg_contrib = 1013 = cfs_rq[cpu]->avg.load_avg so
> there is no way to remove the 381697
>
>>  mkdir-2185 [001] 117.502980: bprint: sched_online_group: tg=0xffff80097543d200 tg->parent=0xffff800975800d80
>>  mkdir-2185 [001] 117.502982: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=0
>>  mkdir-2185 [001] 117.502987: bprint: post_init_entity_util_avg: cpu=0 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6144 cfs_rq->avg.load_avg=1068 cfs_rq->tg_load_avg_contrib=1024 delta=44
>>  mkdir-2185 [001] 117.502988: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=1
>>  mkdir-2185 [001] 117.502992: bprint: post_init_entity_util_avg: cpu=1 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6188 cfs_rq->avg.load_avg=1058 cfs_rq->tg_load_avg_contrib=1024 delta=34
>>  mkdir-2185 [001] 117.502993: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=2
>>  mkdir-2185 [001] 117.502996: bprint: post_init_entity_util_avg: cpu=2 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6222 cfs_rq->avg.load_avg=1092 cfs_rq->tg_load_avg_contrib=1024 delta=68
>>  mkdir-2185 [001] 117.502998: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=3
>>  mkdir-2185 [001] 117.503001: bprint: post_init_entity_util_avg: cpu=3 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6290 cfs_rq->avg.load_avg=1069 cfs_rq->tg_load_avg_contrib=1024 delta=45
>>  mkdir-2185 [001] 117.503002: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=4
>>  mkdir-2185 [001] 117.503005: bprint: post_init_entity_util_avg: cpu=4 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6335 cfs_rq->avg.load_avg=1064 cfs_rq->tg_load_avg_contrib=1024 delta=40
>>  mkdir-2185 [001] 117.503006: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=5
>>  mkdir-2185 [001] 117.503009: bprint: post_init_entity_util_avg: cpu=5 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=*6375* cfs_rq->avg.load_avg=1061 cfs_rq->tg_load_avg_contrib=1024 delta=37
>>
>> We add delta=1024 to tg_x->load_avg during tg_root/tg_x/tg_y_1 initialization but only
>> delta=~40 for tg_root/tg_x/tg_y_2.
>
> IMO, this comes from the decay of tg_y_2->se[cpu]->avg.load_avg before
> being attached to tg_x
>
> But I don't think that it's the same issue as the tg_x->load_avg is
> still coherent with the sum of cfs_rq->avg.load_avg and the sume of
> cfs_rq->tg_load_avg_contrib
>
>>
>> system.slice->tg_load_avg = 381697 is still pretty much higher and AFAICS, it's an
>> i5-5300U CPU so only 4 logical cpus.
>>
>> How many system.slice/foo's actually exists in this system? I only see user.slice
>
> There is several dozen of childs in system.slice but all

but all child have null load_avg and load_avg_contrib

>
>> related cfs_rq[x]:/autogroups-xxx on my Ubuntu-16.04 Desktop system.
>>
>> [...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-14  8:24                       ` Vincent Guittot
@ 2016-10-14 13:10                         ` Dietmar Eggemann
  2016-10-14 15:18                           ` Vincent Guittot
  0 siblings, 1 reply; 46+ messages in thread
From: Dietmar Eggemann @ 2016-10-14 13:10 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Joseph Salisbury, Ingo Molnar, Peter Zijlstra, Linus Torvalds,
	Thomas Gleixner, LKML, Mike Galbraith

On 14/10/16 09:24, Vincent Guittot wrote:
> On 13 October 2016 at 23:34, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>> On 13 October 2016 at 20:49, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>> On 13/10/16 17:48, Vincent Guittot wrote:
>>>> On 13 October 2016 at 17:52, Joseph Salisbury
>>>> <joseph.salisbury@canonical.com> wrote:
>>>>> On 10/13/2016 06:58 AM, Vincent Guittot wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 12 October 2016 at 18:21, Joseph Salisbury
>>>>>> <joseph.salisbury@canonical.com> wrote:
>>>>>>> On 10/12/2016 08:20 AM, Vincent Guittot wrote:
>>>>>>>> On 8 October 2016 at 13:49, Mike Galbraith <efault@gmx.de> wrote:
>>>>>>>>> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
>>>>>>>>>> On 8 October 2016 at 10:39, Ingo Molnar <mingo@kernel.org> wrote:
>>>>>>>>>>> * Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:

[...]

>>> When I create a tg_root/tg_x/tg_y_1 and a tg_root/tg_x/tg_y_2 group, the tg_x->load_avg
>>> becomes > 6*1024 before any tasks ran in it.
>>
>> This is normal as se->avg.load_avg is initialized to
>> scale_load_down(se->load.weight) and this se->avg.load_avg will be
>> added to tg_x[cpu]->cfs_rq->avg.load_avg when attached to the cfs_rq

Yeah, you right, even when I've created 50 second level groups,
tg_x->load_avg is ~6800.

Could it have something to do with the fact that .se->load.weight = 2
for all these task groups? on a 64bit system?

In case we call  __update_load_avg(..., se->on_rq *
scale_load_down(se->load.weight), ...) we pass a weight argument of 0
for these se's.

Does not happen with:

-       if (shares < MIN_SHARES)
-               shares = MIN_SHARES;
+       if (shares < scale_load(MIN_SHARES))
+               shares = scale_load(MIN_SHARES);

in  calc_cfs_shares().

[...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-14 13:10                         ` Dietmar Eggemann
@ 2016-10-14 15:18                           ` Vincent Guittot
  2016-10-14 16:04                             ` Joseph Salisbury
  0 siblings, 1 reply; 46+ messages in thread
From: Vincent Guittot @ 2016-10-14 15:18 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Joseph Salisbury, Ingo Molnar, Peter Zijlstra, Linus Torvalds,
	Thomas Gleixner, LKML, Mike Galbraith

Le Friday 14 Oct 2016 à 14:10:07 (+0100), Dietmar Eggemann a écrit :
> On 14/10/16 09:24, Vincent Guittot wrote:
> > On 13 October 2016 at 23:34, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> >> On 13 October 2016 at 20:49, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>> On 13/10/16 17:48, Vincent Guittot wrote:
> >>>> On 13 October 2016 at 17:52, Joseph Salisbury
> >>>> <joseph.salisbury@canonical.com> wrote:
> >>>>> On 10/13/2016 06:58 AM, Vincent Guittot wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> On 12 October 2016 at 18:21, Joseph Salisbury
> >>>>>> <joseph.salisbury@canonical.com> wrote:
> >>>>>>> On 10/12/2016 08:20 AM, Vincent Guittot wrote:
> >>>>>>>> On 8 October 2016 at 13:49, Mike Galbraith <efault@gmx.de> wrote:
> >>>>>>>>> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
> >>>>>>>>>> On 8 October 2016 at 10:39, Ingo Molnar <mingo@kernel.org> wrote:
> >>>>>>>>>>> * Peter Zijlstra <peterz@infradead.org> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
> 
> [...]
> 
> >>> When I create a tg_root/tg_x/tg_y_1 and a tg_root/tg_x/tg_y_2 group, the tg_x->load_avg
> >>> becomes > 6*1024 before any tasks ran in it.
> >>
> >> This is normal as se->avg.load_avg is initialized to
> >> scale_load_down(se->load.weight) and this se->avg.load_avg will be
> >> added to tg_x[cpu]->cfs_rq->avg.load_avg when attached to the cfs_rq
> 
> Yeah, you right, even when I've created 50 second level groups,
> tg_x->load_avg is ~6800.
> 
> Could it have something to do with the fact that .se->load.weight = 2
> for all these task groups? on a 64bit system?

I don't think so, the problem really comes from tg->load_avg = 381697
but sum of cfs_rq[cpu]->tg_load_avg_contrib = 1013 which is << tg->load_avg
and cfs_rq[cpu]->tg_load_avg_contrib == cfs_rq[cpu]->avg.load_avg so we can't 
expect any negative delta to remove this large value

> 
> In case we call  __update_load_avg(..., se->on_rq *
> scale_load_down(se->load.weight), ...) we pass a weight argument of 0
> for these se's.
> 
> Does not happen with:
> 
> -       if (shares < MIN_SHARES)
> -               shares = MIN_SHARES;
> +       if (shares < scale_load(MIN_SHARES))
> +               shares = scale_load(MIN_SHARES);
> 
> in  calc_cfs_shares().
> 
> [...]
> 
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-14 15:18                           ` Vincent Guittot
@ 2016-10-14 16:04                             ` Joseph Salisbury
  2016-10-17  9:09                               ` Vincent Guittot
  0 siblings, 1 reply; 46+ messages in thread
From: Joseph Salisbury @ 2016-10-14 16:04 UTC (permalink / raw)
  To: Vincent Guittot, Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, Thomas Gleixner,
	LKML, Mike Galbraith, omer.akram

On 10/14/2016 11:18 AM, Vincent Guittot wrote:
> Le Friday 14 Oct 2016 à 14:10:07 (+0100), Dietmar Eggemann a écrit :
>> On 14/10/16 09:24, Vincent Guittot wrote:
>>> On 13 October 2016 at 23:34, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>>>> On 13 October 2016 at 20:49, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>> On 13/10/16 17:48, Vincent Guittot wrote:
>>>>>> On 13 October 2016 at 17:52, Joseph Salisbury
>>>>>> <joseph.salisbury@canonical.com> wrote:
>>>>>>> On 10/13/2016 06:58 AM, Vincent Guittot wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 12 October 2016 at 18:21, Joseph Salisbury
>>>>>>>> <joseph.salisbury@canonical.com> wrote:
>>>>>>>>> On 10/12/2016 08:20 AM, Vincent Guittot wrote:
>>>>>>>>>> On 8 October 2016 at 13:49, Mike Galbraith <efault@gmx.de> wrote:
>>>>>>>>>>> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
>>>>>>>>>>>> On 8 October 2016 at 10:39, Ingo Molnar <mingo@kernel.org> wrote:
>>>>>>>>>>>>> * Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
>> [...]
>>
>>>>> When I create a tg_root/tg_x/tg_y_1 and a tg_root/tg_x/tg_y_2 group, the tg_x->load_avg
>>>>> becomes > 6*1024 before any tasks ran in it.
>>>> This is normal as se->avg.load_avg is initialized to
>>>> scale_load_down(se->load.weight) and this se->avg.load_avg will be
>>>> added to tg_x[cpu]->cfs_rq->avg.load_avg when attached to the cfs_rq
>> Yeah, you right, even when I've created 50 second level groups,
>> tg_x->load_avg is ~6800.
>>
>> Could it have something to do with the fact that .se->load.weight = 2
>> for all these task groups? on a 64bit system?
> I don't think so, the problem really comes from tg->load_avg = 381697
> but sum of cfs_rq[cpu]->tg_load_avg_contrib = 1013 which is << tg->load_avg
> and cfs_rq[cpu]->tg_load_avg_contrib == cfs_rq[cpu]->avg.load_avg so we can't 
> expect any negative delta to remove this large value
>
>> In case we call  __update_load_avg(..., se->on_rq *
>> scale_load_down(se->load.weight), ...) we pass a weight argument of 0
>> for these se's.
>>
>> Does not happen with:
>>
>> -       if (shares < MIN_SHARES)
>> -               shares = MIN_SHARES;
>> +       if (shares < scale_load(MIN_SHARES))
>> +               shares = scale_load(MIN_SHARES);
>>
>> in  calc_cfs_shares().
>>
>> [...]
>>
>>
Adding Omer to CC list, as he is able to reproduce this bug.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-14 16:04                             ` Joseph Salisbury
@ 2016-10-17  9:09                               ` Vincent Guittot
  2016-10-17 11:49                                 ` Dietmar Eggemann
  0 siblings, 1 reply; 46+ messages in thread
From: Vincent Guittot @ 2016-10-17  9:09 UTC (permalink / raw)
  To: Joseph Salisbury
  Cc: Dietmar Eggemann, Ingo Molnar, Peter Zijlstra, Linus Torvalds,
	Thomas Gleixner, LKML, Mike Galbraith, omer.akram

Le Friday 14 Oct 2016 à 12:04:02 (-0400), Joseph Salisbury a écrit :
> On 10/14/2016 11:18 AM, Vincent Guittot wrote:
> > Le Friday 14 Oct 2016 à 14:10:07 (+0100), Dietmar Eggemann a écrit :
> >> On 14/10/16 09:24, Vincent Guittot wrote:
> >>> On 13 October 2016 at 23:34, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> >>>> On 13 October 2016 at 20:49, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>>>> On 13/10/16 17:48, Vincent Guittot wrote:
> >>>>>> On 13 October 2016 at 17:52, Joseph Salisbury
> >>>>>> <joseph.salisbury@canonical.com> wrote:
> >>>>>>> On 10/13/2016 06:58 AM, Vincent Guittot wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> On 12 October 2016 at 18:21, Joseph Salisbury
> >>>>>>>> <joseph.salisbury@canonical.com> wrote:
> >>>>>>>>> On 10/12/2016 08:20 AM, Vincent Guittot wrote:
> >>>>>>>>>> On 8 October 2016 at 13:49, Mike Galbraith <efault@gmx.de> wrote:
> >>>>>>>>>>> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
> >>>>>>>>>>>> On 8 October 2016 at 10:39, Ingo Molnar <mingo@kernel.org> wrote:
> >>>>>>>>>>>>> * Peter Zijlstra <peterz@infradead.org> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
> >> [...]
> >>
> >>>>> When I create a tg_root/tg_x/tg_y_1 and a tg_root/tg_x/tg_y_2 group, the tg_x->load_avg
> >>>>> becomes > 6*1024 before any tasks ran in it.
> >>>> This is normal as se->avg.load_avg is initialized to
> >>>> scale_load_down(se->load.weight) and this se->avg.load_avg will be
> >>>> added to tg_x[cpu]->cfs_rq->avg.load_avg when attached to the cfs_rq
> >> Yeah, you right, even when I've created 50 second level groups,
> >> tg_x->load_avg is ~6800.
> >>
> >> Could it have something to do with the fact that .se->load.weight = 2
> >> for all these task groups? on a 64bit system?
> > I don't think so, the problem really comes from tg->load_avg = 381697
> > but sum of cfs_rq[cpu]->tg_load_avg_contrib = 1013 which is << tg->load_avg
> > and cfs_rq[cpu]->tg_load_avg_contrib == cfs_rq[cpu]->avg.load_avg so we can't 
> > expect any negative delta to remove this large value
> >
> >> In case we call  __update_load_avg(..., se->on_rq *
> >> scale_load_down(se->load.weight), ...) we pass a weight argument of 0
> >> for these se's.
> >>
> >> Does not happen with:
> >>
> >> -       if (shares < MIN_SHARES)
> >> -               shares = MIN_SHARES;
> >> +       if (shares < scale_load(MIN_SHARES))
> >> +               shares = scale_load(MIN_SHARES);
> >>
> >> in  calc_cfs_shares().
> >>
> >> [...]
> >>
> >>
> Adding Omer to CC list, as he is able to reproduce this bug.

Could you try the patch below on top of the faulty kernel ?

---
 kernel/sched/fair.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b03fb5..8926685 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2902,7 +2902,8 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
  */
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
 {
-	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+	unsigned long load_avg = READ_ONCE(cfs_rq->avg.load_avg);
+	long delta = load_avg - cfs_rq->tg_load_avg_contrib;
 
 	/*
 	 * No need to update load_avg for root_task_group as it is not used.
@@ -2912,7 +2913,7 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
 
 	if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
 		atomic_long_add(delta, &cfs_rq->tg->load_avg);
-		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
+		cfs_rq->tg_load_avg_contrib = load_avg;
 	}
 }
 
-- 
2.7.4


> 

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-17  9:09                               ` Vincent Guittot
@ 2016-10-17 11:49                                 ` Dietmar Eggemann
  2016-10-17 13:19                                   ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Dietmar Eggemann @ 2016-10-17 11:49 UTC (permalink / raw)
  To: Vincent Guittot, Joseph Salisbury
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, Thomas Gleixner,
	LKML, Mike Galbraith, omer.akram

Hi Vincent,

On 17/10/16 10:09, Vincent Guittot wrote:
> Le Friday 14 Oct 2016 à 12:04:02 (-0400), Joseph Salisbury a écrit :
>> On 10/14/2016 11:18 AM, Vincent Guittot wrote:
>>> Le Friday 14 Oct 2016 à 14:10:07 (+0100), Dietmar Eggemann a écrit :
>>>> On 14/10/16 09:24, Vincent Guittot wrote:

[...]

> Could you try the patch below on top of the faulty kernel ?
> 
> ---
>  kernel/sched/fair.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8b03fb5..8926685 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2902,7 +2902,8 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>   */
>  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
>  {
> -	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> +	unsigned long load_avg = READ_ONCE(cfs_rq->avg.load_avg);
> +	long delta = load_avg - cfs_rq->tg_load_avg_contrib;
>  
>  	/*
>  	 * No need to update load_avg for root_task_group as it is not used.
> @@ -2912,7 +2913,7 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
>  
>  	if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
>  		atomic_long_add(delta, &cfs_rq->tg->load_avg);
> -		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> +		cfs_rq->tg_load_avg_contrib = load_avg;
>  	}
>  }

Tested it on an Ubuntu 16.10 Server (on top of the default 4.8.0-22-generic
kernel) on a Lenovo T430 and it didn't help.

What seems to cure it is to get rid of this snippet (part of the commit
mentioned earlier in this thread: 3d30544f0212):

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 039de34f1521..16c692049fbf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -726,7 +726,6 @@ void post_init_entity_util_avg(struct sched_entity *se)
        struct sched_avg *sa = &se->avg;
        long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
        u64 now = cfs_rq_clock_task(cfs_rq);
-       int tg_update;
 
        if (cap > 0) {
                if (cfs_rq->avg.util_avg != 0) {
@@ -759,10 +758,8 @@ void post_init_entity_util_avg(struct sched_entity *se)
                }
        }
 
-       tg_update = update_cfs_rq_load_avg(now, cfs_rq, false);
+       update_cfs_rq_load_avg(now, cfs_rq, false);
        attach_entity_load_avg(cfs_rq, se);
-       if (tg_update)
-               update_tg_load_avg(cfs_rq, false);
 }
 
 #else /* !CONFIG_SMP */

BTW, I guess we can reach .tg_load_avg up to ~300000-400000 on such a system
initially because systemd will create all ~100 services (and therefore the
corresponding 2. level tg's) at once. In my previous example, there was 500ms
between the creation of 2 tg's so there was a lot of decaying going on in between.

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-17 11:49                                 ` Dietmar Eggemann
@ 2016-10-17 13:19                                   ` Peter Zijlstra
  2016-10-17 13:54                                     ` Vincent Guittot
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-17 13:19 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Joseph Salisbury, Ingo Molnar, Linus Torvalds,
	Thomas Gleixner, LKML, Mike Galbraith, omer.akram

On Mon, Oct 17, 2016 at 12:49:55PM +0100, Dietmar Eggemann wrote:

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8b03fb5..8926685 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2902,7 +2902,8 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> >   */
> >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
> >  {
> > -	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > +	unsigned long load_avg = READ_ONCE(cfs_rq->avg.load_avg);
> > +	long delta = load_avg - cfs_rq->tg_load_avg_contrib;
> >  
> >  	/*
> >  	 * No need to update load_avg for root_task_group as it is not used.
> > @@ -2912,7 +2913,7 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
> >  
> >  	if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> >  		atomic_long_add(delta, &cfs_rq->tg->load_avg);
> > -		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> > +		cfs_rq->tg_load_avg_contrib = load_avg;
> >  	}
> >  }
> 
> Tested it on an Ubuntu 16.10 Server (on top of the default 4.8.0-22-generic
> kernel) on a Lenovo T430 and it didn't help.

Right, I don't think that race exists, we update cfs_rq->avg.load_avg
with rq->lock held and have it held here, so it cannot change under us.

This might do with a few lockdep_assert_held() instances to clarify this
though.

> What seems to cure it is to get rid of this snippet (part of the commit
> mentioned earlier in this thread: 3d30544f0212):
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 039de34f1521..16c692049fbf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -726,7 +726,6 @@ void post_init_entity_util_avg(struct sched_entity *se)
>         struct sched_avg *sa = &se->avg;
>         long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
>         u64 now = cfs_rq_clock_task(cfs_rq);
> -       int tg_update;
>  
>         if (cap > 0) {
>                 if (cfs_rq->avg.util_avg != 0) {
> @@ -759,10 +758,8 @@ void post_init_entity_util_avg(struct sched_entity *se)
>                 }
>         }
>  
> -       tg_update = update_cfs_rq_load_avg(now, cfs_rq, false);
> +       update_cfs_rq_load_avg(now, cfs_rq, false);
>         attach_entity_load_avg(cfs_rq, se);
> -       if (tg_update)
> -               update_tg_load_avg(cfs_rq, false);
>  }
>  
>  #else /* !CONFIG_SMP */
> 
> BTW, I guess we can reach .tg_load_avg up to ~300000-400000 on such a system
> initially because systemd will create all ~100 services (and therefore the
> corresponding 2. level tg's) at once. In my previous example, there was 500ms
> between the creation of 2 tg's so there was a lot of decaying going on in between.

Cute... on current kernels that translates to simply removing the call
to update_tg_load_avg(), lets see if we can figure out what goes
sideways first though, because it _should_ decay back out. And if that
can fail here, I'm not seeing why that wouldn't fail elsewhere either.

I'll see if I can reproduce this with a script creating heaps of cgroups
in a hurry, I have a total lack of system-disease on all my machines.


/me goes prod..

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-17 13:19                                   ` Peter Zijlstra
@ 2016-10-17 13:54                                     ` Vincent Guittot
  2016-10-17 22:52                                       ` Dietmar Eggemann
  0 siblings, 1 reply; 46+ messages in thread
From: Vincent Guittot @ 2016-10-17 13:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Joseph Salisbury, Ingo Molnar, Linus Torvalds,
	Thomas Gleixner, LKML, Mike Galbraith, omer.akram

On 17 October 2016 at 15:19, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Oct 17, 2016 at 12:49:55PM +0100, Dietmar Eggemann wrote:
>
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 8b03fb5..8926685 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -2902,7 +2902,8 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>> >   */
>> >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
>> >  {
>> > -   long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>> > +   unsigned long load_avg = READ_ONCE(cfs_rq->avg.load_avg);
>> > +   long delta = load_avg - cfs_rq->tg_load_avg_contrib;
>> >
>> >     /*
>> >      * No need to update load_avg for root_task_group as it is not used.
>> > @@ -2912,7 +2913,7 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
>> >
>> >     if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
>> >             atomic_long_add(delta, &cfs_rq->tg->load_avg);
>> > -           cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
>> > +           cfs_rq->tg_load_avg_contrib = load_avg;
>> >     }
>> >  }
>>
>> Tested it on an Ubuntu 16.10 Server (on top of the default 4.8.0-22-generic
>> kernel) on a Lenovo T430 and it didn't help.
>
> Right, I don't think that race exists, we update cfs_rq->avg.load_avg
> with rq->lock held and have it held here, so it cannot change under us.

yes I agree. It was just to be sure that an unexpected race condition
doesn't happen

>
> This might do with a few lockdep_assert_held() instances to clarify this
> though.
>
>> What seems to cure it is to get rid of this snippet (part of the commit
>> mentioned earlier in this thread: 3d30544f0212):
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 039de34f1521..16c692049fbf 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -726,7 +726,6 @@ void post_init_entity_util_avg(struct sched_entity *se)
>>         struct sched_avg *sa = &se->avg;
>>         long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
>>         u64 now = cfs_rq_clock_task(cfs_rq);
>> -       int tg_update;
>>
>>         if (cap > 0) {
>>                 if (cfs_rq->avg.util_avg != 0) {
>> @@ -759,10 +758,8 @@ void post_init_entity_util_avg(struct sched_entity *se)
>>                 }
>>         }
>>
>> -       tg_update = update_cfs_rq_load_avg(now, cfs_rq, false);
>> +       update_cfs_rq_load_avg(now, cfs_rq, false);
>>         attach_entity_load_avg(cfs_rq, se);
>> -       if (tg_update)
>> -               update_tg_load_avg(cfs_rq, false);
>>  }
>>
>>  #else /* !CONFIG_SMP */
>>
>> BTW, I guess we can reach .tg_load_avg up to ~300000-400000 on such a system
>> initially because systemd will create all ~100 services (and therefore the
>> corresponding 2. level tg's) at once. In my previous example, there was 500ms
>> between the creation of 2 tg's so there was a lot of decaying going on in between.
>
> Cute... on current kernels that translates to simply removing the call
> to update_tg_load_avg(), lets see if we can figure out what goes
> sideways first though, because it _should_ decay back out. And if that

yes, Reaching ~300000-400000 is not an issue in itself, the problem is
that load_avg has decayed but it has not been reflected in
tg->load_avg in the buggy case

> can fail here, I'm not seeing why that wouldn't fail elsewhere either.
>
> I'll see if I can reproduce this with a script creating heaps of cgroups
> in a hurry, I have a total lack of system-disease on all my machines.
>
>
> /me goes prod..

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-17 13:54                                     ` Vincent Guittot
@ 2016-10-17 22:52                                       ` Dietmar Eggemann
  2016-10-18  8:43                                         ` Vincent Guittot
  2016-10-18  9:07                                         ` Peter Zijlstra
  0 siblings, 2 replies; 46+ messages in thread
From: Dietmar Eggemann @ 2016-10-17 22:52 UTC (permalink / raw)
  To: Vincent Guittot, Peter Zijlstra
  Cc: Joseph Salisbury, Ingo Molnar, Linus Torvalds, Thomas Gleixner,
	LKML, Mike Galbraith, omer.akram

On 10/17/2016 02:54 PM, Vincent Guittot wrote:
> On 17 October 2016 at 15:19, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Mon, Oct 17, 2016 at 12:49:55PM +0100, Dietmar Eggemann wrote:

[...]

>>> BTW, I guess we can reach .tg_load_avg up to ~300000-400000 on such a system
>>> initially because systemd will create all ~100 services (and therefore the
>>> corresponding 2. level tg's) at once. In my previous example, there was 500ms
>>> between the creation of 2 tg's so there was a lot of decaying going on in between.
>>
>> Cute... on current kernels that translates to simply removing the call
>> to update_tg_load_avg(), lets see if we can figure out what goes
>> sideways first though, because it _should_ decay back out. And if that
> 
> yes, Reaching ~300000-400000 is not an issue in itself, the problem is
> that load_avg has decayed but it has not been reflected in
> tg->load_avg in the buggy case
> 
>> can fail here, I'm not seeing why that wouldn't fail elsewhere either.
>>
>> I'll see if I can reproduce this with a script creating heaps of cgroups
>> in a hurry, I have a total lack of system-disease on all my machines.


Something looks weird related to the use of for_each_possible_cpu(i) in
online_fair_sched_group() on my i5-3320M CPU (4 logical cpus).

In case I print out cpu id and the cpu masks inside the for_each_possible_cpu(i)
I get:

[ 5.462368]  cpu=0  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462370]  cpu=1  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462370]  cpu=2  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462371]  cpu=3  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462372] *cpu=4* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462373] *cpu=5* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462374] *cpu=6* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462375] *cpu=7* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

T430:/sys/fs/cgroup/cpu,cpuacct/system.slice# ls -l | grep '^d' | wc -l
80

/proc/sched_debug:

cfs_rq[0]:/system.slice
  ...
  .tg_load_avg                   : 323584
  ...

80 * 1024 * 4 (not existent cpu4-cpu7) = 327680 (with a little bit of decay,
this could be this extra load on the systen.slice tg)

Using for_each_online_cpu(i) instead of for_each_possible_cpu(i) in
online_fair_sched_group() works on this machine, i.e. the .tg_load_avg
of system.slice tg is 0 after startup.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-17 22:52                                       ` Dietmar Eggemann
@ 2016-10-18  8:43                                         ` Vincent Guittot
  2016-10-18  9:07                                         ` Peter Zijlstra
  1 sibling, 0 replies; 46+ messages in thread
From: Vincent Guittot @ 2016-10-18  8:43 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Joseph Salisbury, Ingo Molnar, Linus Torvalds,
	Thomas Gleixner, LKML, Mike Galbraith, omer.akram

On 18 October 2016 at 00:52, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 10/17/2016 02:54 PM, Vincent Guittot wrote:
>> On 17 October 2016 at 15:19, Peter Zijlstra <peterz@infradead.org> wrote:
>>> On Mon, Oct 17, 2016 at 12:49:55PM +0100, Dietmar Eggemann wrote:
>
> [...]
>
>>>> BTW, I guess we can reach .tg_load_avg up to ~300000-400000 on such a system
>>>> initially because systemd will create all ~100 services (and therefore the
>>>> corresponding 2. level tg's) at once. In my previous example, there was 500ms
>>>> between the creation of 2 tg's so there was a lot of decaying going on in between.
>>>
>>> Cute... on current kernels that translates to simply removing the call
>>> to update_tg_load_avg(), lets see if we can figure out what goes
>>> sideways first though, because it _should_ decay back out. And if that
>>
>> yes, Reaching ~300000-400000 is not an issue in itself, the problem is
>> that load_avg has decayed but it has not been reflected in
>> tg->load_avg in the buggy case
>>
>>> can fail here, I'm not seeing why that wouldn't fail elsewhere either.
>>>
>>> I'll see if I can reproduce this with a script creating heaps of cgroups
>>> in a hurry, I have a total lack of system-disease on all my machines.
>

Hi Dietmar,

>
> Something looks weird related to the use of for_each_possible_cpu(i) in
> online_fair_sched_group() on my i5-3320M CPU (4 logical cpus).
>
> In case I print out cpu id and the cpu masks inside the for_each_possible_cpu(i)
> I get:
>
> [ 5.462368]  cpu=0  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462370]  cpu=1  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462370]  cpu=2  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462371]  cpu=3  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462372] *cpu=4* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462373] *cpu=5* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462374] *cpu=6* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462375] *cpu=7* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
>

Thanks to your description above, i have been able to reproduce the
issue on my ARM platform.
The key point is to have cpu_possible_mask different from
cpu_present_mask in order to reproduce the problem. When
cpu_present_mask equals cpu_possible_mask, i can't reproduce the
problem
I create a 1st level of task group tg-l1. Then each time, I create a
new task group in tg-l1, tg-l1.tg_load_avg will increase with 1024*
number of cpu that are possible but not present like you described
below

Thanks
Vincent

> T430:/sys/fs/cgroup/cpu,cpuacct/system.slice# ls -l | grep '^d' | wc -l
> 80
>
> /proc/sched_debug:
>
> cfs_rq[0]:/system.slice
>   ...
>   .tg_load_avg                   : 323584
>   ...
>
> 80 * 1024 * 4 (not existent cpu4-cpu7) = 327680 (with a little bit of decay,
> this could be this extra load on the systen.slice tg)
>
> Using for_each_online_cpu(i) instead of for_each_possible_cpu(i) in
> online_fair_sched_group() works on this machine, i.e. the .tg_load_avg
> of system.slice tg is 0 after startup.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-17 22:52                                       ` Dietmar Eggemann
  2016-10-18  8:43                                         ` Vincent Guittot
@ 2016-10-18  9:07                                         ` Peter Zijlstra
  2016-10-18  9:45                                           ` Vincent Guittot
  2016-10-18 11:15                                           ` Dietmar Eggemann
  1 sibling, 2 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-18  9:07 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Joseph Salisbury, Ingo Molnar, Linus Torvalds,
	Thomas Gleixner, LKML, Mike Galbraith, omer.akram

On Mon, Oct 17, 2016 at 11:52:39PM +0100, Dietmar Eggemann wrote:
> 
> Something looks weird related to the use of for_each_possible_cpu(i) in
> online_fair_sched_group() on my i5-3320M CPU (4 logical cpus).
> 
> In case I print out cpu id and the cpu masks inside the for_each_possible_cpu(i)
> I get:
> 
> [ 5.462368]  cpu=0  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

OK, you have a buggy BIOS :-) It enumerates too many CPU slots. There is
no reason to have 4 empty CPU slots on a machine that cannot do physical
hotplug.

This also explains why it doesn't show on many machines, most machines
will not have this and possible_mask == present_mask == online_mask for
most all cases.

x86 folk, can we detect the lack of physical hotplug capability and FW_WARN
on this and lowering possible_mask to present_mask?

> [ 5.462370]  cpu=1  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462370]  cpu=2  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462371]  cpu=3  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462372] *cpu=4* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462373] *cpu=5* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462374] *cpu=6* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462375] *cpu=7* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
> 
> T430:/sys/fs/cgroup/cpu,cpuacct/system.slice# ls -l | grep '^d' | wc -l
> 80
> 
> /proc/sched_debug:
> 
> cfs_rq[0]:/system.slice
>   ...
>   .tg_load_avg                   : 323584
>   ...
> 
> 80 * 1024 * 4 (not existent cpu4-cpu7) = 327680 (with a little bit of decay,
> this could be this extra load on the systen.slice tg)
> 
> Using for_each_online_cpu(i) instead of for_each_possible_cpu(i) in
> online_fair_sched_group() works on this machine, i.e. the .tg_load_avg
> of system.slice tg is 0 after startup.

Right, so the reason for using present_mask is that it avoids having to
deal with hotplug, also all the per-cpu memory is allocated and present
for !online CPUs anyway, so might as well set it up properly anyway.

(You might want to start booting your laptop with "possible_cpus=4" to
save some memory FWIW.)

But yes, we have a bug here too... /me ponders

So aside from funny BIOSes, this should also show up when creating
cgroups when you have offlined a few CPUs, which is far more common I'd
think.

On IRC you mentioned that adding list_add_leaf_cfs_rq() to
online_fair_sched_group() cures this, this would actually match with
unregister_fair_sched_group() doing list_del_leaf_cfs_rq() and avoid
a few instructions on the enqueue path, so that's all good.

I'm just not immediately seeing how that cures things. The only relevant
user of the leaf_cfs_rq list seems to be update_blocked_averages() which
is called from the balance code (idle_balance() and
rebalance_domains()). But neither should call that for offline (or
!present) CPUs.


Humm..

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-18  9:07                                         ` Peter Zijlstra
@ 2016-10-18  9:45                                           ` Vincent Guittot
  2016-10-18 10:34                                             ` Peter Zijlstra
  2016-10-18 11:15                                           ` Dietmar Eggemann
  1 sibling, 1 reply; 46+ messages in thread
From: Vincent Guittot @ 2016-10-18  9:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Joseph Salisbury, Ingo Molnar, Linus Torvalds,
	Thomas Gleixner, LKML, Mike Galbraith, omer.akram

On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Oct 17, 2016 at 11:52:39PM +0100, Dietmar Eggemann wrote:
>>
>> Something looks weird related to the use of for_each_possible_cpu(i) in
>> online_fair_sched_group() on my i5-3320M CPU (4 logical cpus).
>>
>> In case I print out cpu id and the cpu masks inside the for_each_possible_cpu(i)
>> I get:
>>
>> [ 5.462368]  cpu=0  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
>
> OK, you have a buggy BIOS :-) It enumerates too many CPU slots. There is
> no reason to have 4 empty CPU slots on a machine that cannot do physical
> hotplug.
>
> This also explains why it doesn't show on many machines, most machines
> will not have this and possible_mask == present_mask == online_mask for
> most all cases.
>
> x86 folk, can we detect the lack of physical hotplug capability and FW_WARN
> on this and lowering possible_mask to present_mask?
>
>> [ 5.462370]  cpu=1  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
>> [ 5.462370]  cpu=2  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
>> [ 5.462371]  cpu=3  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
>> [ 5.462372] *cpu=4* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
>> [ 5.462373] *cpu=5* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
>> [ 5.462374] *cpu=6* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
>> [ 5.462375] *cpu=7* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
>>
>> T430:/sys/fs/cgroup/cpu,cpuacct/system.slice# ls -l | grep '^d' | wc -l
>> 80
>>
>> /proc/sched_debug:
>>
>> cfs_rq[0]:/system.slice
>>   ...
>>   .tg_load_avg                   : 323584
>>   ...
>>
>> 80 * 1024 * 4 (not existent cpu4-cpu7) = 327680 (with a little bit of decay,
>> this could be this extra load on the systen.slice tg)
>>
>> Using for_each_online_cpu(i) instead of for_each_possible_cpu(i) in
>> online_fair_sched_group() works on this machine, i.e. the .tg_load_avg
>> of system.slice tg is 0 after startup.
>
> Right, so the reason for using present_mask is that it avoids having to
> deal with hotplug, also all the per-cpu memory is allocated and present
> for !online CPUs anyway, so might as well set it up properly anyway.
>
> (You might want to start booting your laptop with "possible_cpus=4" to
> save some memory FWIW.)
>
> But yes, we have a bug here too... /me ponders
>
> So aside from funny BIOSes, this should also show up when creating
> cgroups when you have offlined a few CPUs, which is far more common I'd
> think.

The problem is also that the load of the tg->se[cpu] that represents
the tg->cfs_rq[cpu] is initialized to 1024 in:
alloc_fair_sched_group
     for_each_possible_cpu(i) {
         init_entity_runnable_average(se);
            sa->load_avg = scale_load_down(se->load.weight);

Initializing  sa->load_avg to 1024 for a newly created task makes
sense as we don't know yet what will be its real load but i'm not sure
that we have to do the same for se that represents a task group. This
load should be initialized to 0 and it will increase when task will be
moved/attached into task group

>
> On IRC you mentioned that adding list_add_leaf_cfs_rq() to
> online_fair_sched_group() cures this, this would actually match with
> unregister_fair_sched_group() doing list_del_leaf_cfs_rq() and avoid
> a few instructions on the enqueue path, so that's all good.
>
> I'm just not immediately seeing how that cures things. The only relevant
> user of the leaf_cfs_rq list seems to be update_blocked_averages() which
> is called from the balance code (idle_balance() and
> rebalance_domains()). But neither should call that for offline (or
> !present) CPUs.
>
>
> Humm..

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-18  9:45                                           ` Vincent Guittot
@ 2016-10-18 10:34                                             ` Peter Zijlstra
  2016-10-18 11:56                                               ` Vincent Guittot
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-18 10:34 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, Joseph Salisbury, Ingo Molnar, Linus Torvalds,
	Thomas Gleixner, LKML, Mike Galbraith, omer.akram

On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
> On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:
> > So aside from funny BIOSes, this should also show up when creating
> > cgroups when you have offlined a few CPUs, which is far more common I'd
> > think.
> 
> The problem is also that the load of the tg->se[cpu] that represents
> the tg->cfs_rq[cpu] is initialized to 1024 in:
> alloc_fair_sched_group
>      for_each_possible_cpu(i) {
>          init_entity_runnable_average(se);
>             sa->load_avg = scale_load_down(se->load.weight);
> 
> Initializing  sa->load_avg to 1024 for a newly created task makes
> sense as we don't know yet what will be its real load but i'm not sure
> that we have to do the same for se that represents a task group. This
> load should be initialized to 0 and it will increase when task will be
> moved/attached into task group

Yes, I think that makes sense, not sure how horrible that is with the
current state of things, but after your propagate patch, that
reinstates the interactivity hack that should work for sure.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-18  9:07                                         ` Peter Zijlstra
  2016-10-18  9:45                                           ` Vincent Guittot
@ 2016-10-18 11:15                                           ` Dietmar Eggemann
  2016-10-18 12:07                                             ` Peter Zijlstra
  1 sibling, 1 reply; 46+ messages in thread
From: Dietmar Eggemann @ 2016-10-18 11:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Joseph Salisbury, Ingo Molnar, Linus Torvalds,
	Thomas Gleixner, LKML, Mike Galbraith, omer.akram

On 18/10/16 10:07, Peter Zijlstra wrote:
> On Mon, Oct 17, 2016 at 11:52:39PM +0100, Dietmar Eggemann wrote:

[...]

>> Using for_each_online_cpu(i) instead of for_each_possible_cpu(i) in
>> online_fair_sched_group() works on this machine, i.e. the .tg_load_avg
>> of system.slice tg is 0 after startup.
> 
> Right, so the reason for using present_mask is that it avoids having to
> deal with hotplug, also all the per-cpu memory is allocated and present
> for !online CPUs anyway, so might as well set it up properly anyway.
> 
> (You might want to start booting your laptop with "possible_cpus=4" to
> save some memory FWIW.)

The question for me is could this be the reason for the X1 Carbon
platform as well?

The initial pastebin from Joseph (http://paste.ubuntu.com/23312351)
showed .tg_load_avg : 381697 on a 4 logical cpu thing. With a couple of
more services than 80 this might be the problem.

> 
> But yes, we have a bug here too... /me ponders
> 
> So aside from funny BIOSes, this should also show up when creating
> cgroups when you have offlined a few CPUs, which is far more common I'd
> think.

Yes.

> On IRC you mentioned that adding list_add_leaf_cfs_rq() to
> online_fair_sched_group() cures this, this would actually match with
> unregister_fair_sched_group() doing list_del_leaf_cfs_rq() and avoid
> a few instructions on the enqueue path, so that's all good.

Yes, I was able to recreate a similar problem (not related to the cpu
masks) on ARM64 (6 logical cpus). I created 100 2. level tg's but only
put one task (no cpu affinity, so it could run on multiple cpus) in one
of these tg's (mainly to see the related cfs_rq's in /proc/sched_debug).

I get a remaining .tg_load_avg : 49898 for cfs_rq[x]:/tg_1

 > I'm just not immediately seeing how that cures things. The only relevant
> user of the leaf_cfs_rq list seems to be update_blocked_averages() which
> is called from the balance code (idle_balance() and
> rebalance_domains()). But neither should call that for offline (or
> !present) CPUs.

Assuming this is load from the 99 2. level tg's which never had a task
running, putting list_add_leaf_cfs_rq() into online_fair_sched_group()
for all cpus makes sure that all the 'blocked load' get's decayed.

Doing what Vincent just suggested, not initializing tg se's w/ 1024 but
w/ 0 instead prevents this from being necessary.

[...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-18 10:34                                             ` Peter Zijlstra
@ 2016-10-18 11:56                                               ` Vincent Guittot
  2016-10-18 21:58                                                 ` Joonwoo Park
                                                                   ` (4 more replies)
  0 siblings, 5 replies; 46+ messages in thread
From: Vincent Guittot @ 2016-10-18 11:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Joseph Salisbury, Ingo Molnar, Linus Torvalds,
	Thomas Gleixner, LKML, Mike Galbraith, omer.akram

Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
> > On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:
> > > So aside from funny BIOSes, this should also show up when creating
> > > cgroups when you have offlined a few CPUs, which is far more common I'd
> > > think.
> > 
> > The problem is also that the load of the tg->se[cpu] that represents
> > the tg->cfs_rq[cpu] is initialized to 1024 in:
> > alloc_fair_sched_group
> >      for_each_possible_cpu(i) {
> >          init_entity_runnable_average(se);
> >             sa->load_avg = scale_load_down(se->load.weight);
> > 
> > Initializing  sa->load_avg to 1024 for a newly created task makes
> > sense as we don't know yet what will be its real load but i'm not sure
> > that we have to do the same for se that represents a task group. This
> > load should be initialized to 0 and it will increase when task will be
> > moved/attached into task group
> 
> Yes, I think that makes sense, not sure how horrible that is with the

That should not be that bad because this initial value is only useful for
the few dozens of ms that follow the creation of the task group 

>
> current state of things, but after your propagate patch, that
> reinstates the interactivity hack that should work for sure.

The patch below fixes the issue on my platform: 

Dietmar, Omer can you confirm that this fix the problem of your platform too ?

---
 kernel/sched/fair.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b03fb5..89776ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)
 	 * will definitely be update (after enqueue).
 	 */
 	sa->period_contrib = 1023;
-	sa->load_avg = scale_load_down(se->load.weight);
+	/*
+	 * Tasks are intialized with full load to be seen as heavy task until
+	 * they get a chance to stabilize to their real load level.
+	 * group entity are intialized with null load to reflect the fact that
+	 * nothing has been attached yet to the task group.
+	 */
+	if (entity_is_task(se))
+		sa->load_avg = scale_load_down(se->load.weight);
 	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
 	/*
 	 * At this point, util_avg won't be used in select_task_rq_fair anyway

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-18 11:15                                           ` Dietmar Eggemann
@ 2016-10-18 12:07                                             ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-18 12:07 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Joseph Salisbury, Ingo Molnar, Linus Torvalds,
	Thomas Gleixner, LKML, Mike Galbraith, omer.akram

On Tue, Oct 18, 2016 at 12:15:11PM +0100, Dietmar Eggemann wrote:
> On 18/10/16 10:07, Peter Zijlstra wrote:
> > On Mon, Oct 17, 2016 at 11:52:39PM +0100, Dietmar Eggemann wrote:

> > On IRC you mentioned that adding list_add_leaf_cfs_rq() to
> > online_fair_sched_group() cures this, this would actually match with
> > unregister_fair_sched_group() doing list_del_leaf_cfs_rq() and avoid
> > a few instructions on the enqueue path, so that's all good.
> 
> Yes, I was able to recreate a similar problem (not related to the cpu
> masks) on ARM64 (6 logical cpus). I created 100 2. level tg's but only
> put one task (no cpu affinity, so it could run on multiple cpus) in one
> of these tg's (mainly to see the related cfs_rq's in /proc/sched_debug).
> 
> I get a remaining .tg_load_avg : 49898 for cfs_rq[x]:/tg_1

Ah, and since all those CPUs are online, we decay all that load away. OK
makes sense now.

>  > I'm just not immediately seeing how that cures things. The only relevant
> > user of the leaf_cfs_rq list seems to be update_blocked_averages() which
> > is called from the balance code (idle_balance() and
> > rebalance_domains()). But neither should call that for offline (or
> > !present) CPUs.
> 
> Assuming this is load from the 99 2. level tg's which never had a task
> running, putting list_add_leaf_cfs_rq() into online_fair_sched_group()
> for all cpus makes sure that all the 'blocked load' get's decayed.
> 
> Doing what Vincent just suggested, not initializing tg se's w/ 1024 but
> w/ 0 instead prevents this from being necessary.

Indeed. I just worry about the cases where we do no propagate the load
up, eg. the stuff fixed by:

  1476695653-12309-5-git-send-email-vincent.guittot@linaro.org

If we hit an intermediary cgroup with 0 load, we might get some
interactivity issues.

But it could be I got lost again :-)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-18 11:56                                               ` Vincent Guittot
@ 2016-10-18 21:58                                                 ` Joonwoo Park
  2016-10-19  6:42                                                   ` Vincent Guittot
  2016-10-19  9:46                                                 ` Dietmar Eggemann
                                                                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 46+ messages in thread
From: Joonwoo Park @ 2016-10-18 21:58 UTC (permalink / raw)
  To: Vincent Guittot, Peter Zijlstra
  Cc: Dietmar Eggemann, Joseph Salisbury, Ingo Molnar, Linus Torvalds,
	Thomas Gleixner, LKML, Mike Galbraith, omer.akram



On 10/18/2016 04:56 AM, Vincent Guittot wrote:
> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
>>> On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:
>>>> So aside from funny BIOSes, this should also show up when creating
>>>> cgroups when you have offlined a few CPUs, which is far more common I'd
>>>> think.
>>>
>>> The problem is also that the load of the tg->se[cpu] that represents
>>> the tg->cfs_rq[cpu] is initialized to 1024 in:
>>> alloc_fair_sched_group
>>>      for_each_possible_cpu(i) {
>>>          init_entity_runnable_average(se);
>>>             sa->load_avg = scale_load_down(se->load.weight);
>>>
>>> Initializing  sa->load_avg to 1024 for a newly created task makes
>>> sense as we don't know yet what will be its real load but i'm not sure
>>> that we have to do the same for se that represents a task group. This
>>> load should be initialized to 0 and it will increase when task will be
>>> moved/attached into task group
>>
>> Yes, I think that makes sense, not sure how horrible that is with the
>
> That should not be that bad because this initial value is only useful for
> the few dozens of ms that follow the creation of the task group
>
>>
>> current state of things, but after your propagate patch, that
>> reinstates the interactivity hack that should work for sure.
>
> The patch below fixes the issue on my platform:
>
> Dietmar, Omer can you confirm that this fix the problem of your platform too ?

I just noticed this thread after posting 
https://lkml.org/lkml/2016/10/18/719...
Noticed this bug while a ago and had the patch above at least a week but 
unfortunately didn't have time to post...
I think Omer had same problem I was trying to fix and I believe patch I 
post should address it.

Vincent, your version fixes my test case as well.
This is sched_stat from the same test case I had in my changelog.
Note dd-2030 which is in root cgroup had same runtime as dd-2033 which 
is in child cgroup.

dd (2030, #threads: 1)
-------------------------------------------------------------------
se.exec_start                                :        275700.024137
se.vruntime                                  :         10589.114654
se.sum_exec_runtime                          :          1576.837993
se.nr_migrations                             :                    0
nr_switches                                  :                  159
nr_voluntary_switches                        :                    0
nr_involuntary_switches                      :                  159
se.load.weight                               :              1048576
se.avg.load_sum                              :             48840575
se.avg.util_sum                              :             19741820
se.avg.load_avg                              :                 1022
se.avg.util_avg                              :                  413
se.avg.last_update_time                      :         275700024137
policy                                       :                    0
prio                                         :                  120
clock-delta                                  :                   34
dd (2033, #threads: 1)
-------------------------------------------------------------------
se.exec_start                                :        275710.037178
se.vruntime                                  :          2383.802868
se.sum_exec_runtime                          :          1576.547591
se.nr_migrations                             :                    0
nr_switches                                  :                  162
nr_voluntary_switches                        :                    0
nr_involuntary_switches                      :                  162
se.load.weight                               :              1048576
se.avg.load_sum                              :             48316646
se.avg.util_sum                              :             21235249
se.avg.load_avg                              :                 1011
se.avg.util_avg                              :                  444
se.avg.last_update_time                      :         275710037178
policy                                       :                    0
prio                                         :                  120
clock-delta                                  :                   36

Thanks,
Joonwoo

>
> ---
>  kernel/sched/fair.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8b03fb5..89776ac 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)
>  	 * will definitely be update (after enqueue).
>  	 */
>  	sa->period_contrib = 1023;
> -	sa->load_avg = scale_load_down(se->load.weight);
> +	/*
> +	 * Tasks are intialized with full load to be seen as heavy task until
> +	 * they get a chance to stabilize to their real load level.
> +	 * group entity are intialized with null load to reflect the fact that
> +	 * nothing has been attached yet to the task group.
> +	 */
> +	if (entity_is_task(se))
> +		sa->load_avg = scale_load_down(se->load.weight);
>  	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>  	/*
>  	 * At this point, util_avg won't be used in select_task_rq_fair anyway
>
>
>
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-18 21:58                                                 ` Joonwoo Park
@ 2016-10-19  6:42                                                   ` Vincent Guittot
  0 siblings, 0 replies; 46+ messages in thread
From: Vincent Guittot @ 2016-10-19  6:42 UTC (permalink / raw)
  To: Joonwoo Park
  Cc: Peter Zijlstra, Dietmar Eggemann, Joseph Salisbury, Ingo Molnar,
	Linus Torvalds, Thomas Gleixner, LKML, Mike Galbraith,
	omer.akram

On 18 October 2016 at 23:58, Joonwoo Park <joonwoop@codeaurora.org> wrote:
>
>
> On 10/18/2016 04:56 AM, Vincent Guittot wrote:
>>
>> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
>>>
>>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
>>>>
>>>> On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org>
>>>> wrote:
>>>>>
>>>>> So aside from funny BIOSes, this should also show up when creating
>>>>> cgroups when you have offlined a few CPUs, which is far more common I'd
>>>>> think.
>>>>
>>>>
>>>> The problem is also that the load of the tg->se[cpu] that represents
>>>> the tg->cfs_rq[cpu] is initialized to 1024 in:
>>>> alloc_fair_sched_group
>>>>      for_each_possible_cpu(i) {
>>>>          init_entity_runnable_average(se);
>>>>             sa->load_avg = scale_load_down(se->load.weight);
>>>>
>>>> Initializing  sa->load_avg to 1024 for a newly created task makes
>>>> sense as we don't know yet what will be its real load but i'm not sure
>>>> that we have to do the same for se that represents a task group. This
>>>> load should be initialized to 0 and it will increase when task will be
>>>> moved/attached into task group
>>>
>>>
>>> Yes, I think that makes sense, not sure how horrible that is with the
>>
>>
>> That should not be that bad because this initial value is only useful for
>> the few dozens of ms that follow the creation of the task group
>>
>>>
>>> current state of things, but after your propagate patch, that
>>> reinstates the interactivity hack that should work for sure.
>>
>>
>> The patch below fixes the issue on my platform:
>>
>> Dietmar, Omer can you confirm that this fix the problem of your platform
>> too ?
>
>
> I just noticed this thread after posting
> https://lkml.org/lkml/2016/10/18/719...
> Noticed this bug while a ago and had the patch above at least a week but
> unfortunately didn't have time to post...
> I think Omer had same problem I was trying to fix and I believe patch I post
> should address it.
>
> Vincent, your version fixes my test case as well.

Thanks for testing.
Can i consider this as a Tested-by ?

> This is sched_stat from the same test case I had in my changelog.
> Note dd-2030 which is in root cgroup had same runtime as dd-2033 which is in
> child cgroup.
>
> dd (2030, #threads: 1)
> -------------------------------------------------------------------
> se.exec_start                                :        275700.024137
> se.vruntime                                  :         10589.114654
> se.sum_exec_runtime                          :          1576.837993
> se.nr_migrations                             :                    0
> nr_switches                                  :                  159
> nr_voluntary_switches                        :                    0
> nr_involuntary_switches                      :                  159
> se.load.weight                               :              1048576
> se.avg.load_sum                              :             48840575
> se.avg.util_sum                              :             19741820
> se.avg.load_avg                              :                 1022
> se.avg.util_avg                              :                  413
> se.avg.last_update_time                      :         275700024137
> policy                                       :                    0
> prio                                         :                  120
> clock-delta                                  :                   34
> dd (2033, #threads: 1)
> -------------------------------------------------------------------
> se.exec_start                                :        275710.037178
> se.vruntime                                  :          2383.802868
> se.sum_exec_runtime                          :          1576.547591
> se.nr_migrations                             :                    0
> nr_switches                                  :                  162
> nr_voluntary_switches                        :                    0
> nr_involuntary_switches                      :                  162
> se.load.weight                               :              1048576
> se.avg.load_sum                              :             48316646
> se.avg.util_sum                              :             21235249
> se.avg.load_avg                              :                 1011
> se.avg.util_avg                              :                  444
> se.avg.last_update_time                      :         275710037178
> policy                                       :                    0
> prio                                         :                  120
> clock-delta                                  :                   36
>
> Thanks,
> Joonwoo
>
>
>>
>> ---
>>  kernel/sched/fair.c | 9 ++++++++-
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8b03fb5..89776ac 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity
>> *se)
>>          * will definitely be update (after enqueue).
>>          */
>>         sa->period_contrib = 1023;
>> -       sa->load_avg = scale_load_down(se->load.weight);
>> +       /*
>> +        * Tasks are intialized with full load to be seen as heavy task
>> until
>> +        * they get a chance to stabilize to their real load level.
>> +        * group entity are intialized with null load to reflect the fact
>> that
>> +        * nothing has been attached yet to the task group.
>> +        */
>> +       if (entity_is_task(se))
>> +               sa->load_avg = scale_load_down(se->load.weight);
>>         sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>>         /*
>>          * At this point, util_avg won't be used in select_task_rq_fair
>> anyway
>>
>>
>>
>>
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-18 11:56                                               ` Vincent Guittot
  2016-10-18 21:58                                                 ` Joonwoo Park
@ 2016-10-19  9:46                                                 ` Dietmar Eggemann
  2016-10-19 11:25                                                   ` Vincent Guittot
  2016-10-19 11:33                                                 ` Peter Zijlstra
                                                                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 46+ messages in thread
From: Dietmar Eggemann @ 2016-10-19  9:46 UTC (permalink / raw)
  To: Vincent Guittot, Peter Zijlstra
  Cc: Joseph Salisbury, Ingo Molnar, Linus Torvalds, Thomas Gleixner,
	LKML, Mike Galbraith, omer.akram

On 18/10/16 12:56, Vincent Guittot wrote:
> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
>>> On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:

[...]

> 
> The patch below fixes the issue on my platform: 
> 
> Dietmar, Omer can you confirm that this fix the problem of your platform too ?

It fixes this broken BIOS issue on my T430 ( cpu_possible_mask >
cpu_online_mask). I ran the original test with the cpu hogs (stress -c
4). Launch time of applications becomes normal again.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

But this test only makes sure that we don't see any ghost contribution
(from non-existing cpus) any more.

We should study the tg->se[i]->avg.load_avg for the hierarchy of tg's
(with the highest tg having a task enqueued) a little bit more, with and
without your v5 'sched: reflect sched_entity move into task_group's load'.

> ---
>  kernel/sched/fair.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8b03fb5..89776ac 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)
>  	 * will definitely be update (after enqueue).
>  	 */
>  	sa->period_contrib = 1023;
> -	sa->load_avg = scale_load_down(se->load.weight);
> +	/*
> +	 * Tasks are intialized with full load to be seen as heavy task until
> +	 * they get a chance to stabilize to their real load level.
> +	 * group entity are intialized with null load to reflect the fact that
> +	 * nothing has been attached yet to the task group.
> +	 */
> +	if (entity_is_task(se))
> +		sa->load_avg = scale_load_down(se->load.weight);
>  	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>  	/*
>  	 * At this point, util_avg won't be used in select_task_rq_fair anyway

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-19  9:46                                                 ` Dietmar Eggemann
@ 2016-10-19 11:25                                                   ` Vincent Guittot
  2016-10-19 15:33                                                     ` Dietmar Eggemann
  0 siblings, 1 reply; 46+ messages in thread
From: Vincent Guittot @ 2016-10-19 11:25 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Joseph Salisbury, Ingo Molnar, Linus Torvalds,
	Thomas Gleixner, LKML, Mike Galbraith, omer.akram

On 19 October 2016 at 11:46, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 18/10/16 12:56, Vincent Guittot wrote:
>> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
>>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
>>>> On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:
>
> [...]
>
>>
>> The patch below fixes the issue on my platform:
>>
>> Dietmar, Omer can you confirm that this fix the problem of your platform too ?
>
> It fixes this broken BIOS issue on my T430 ( cpu_possible_mask >
> cpu_online_mask). I ran the original test with the cpu hogs (stress -c
> 4). Launch time of applications becomes normal again.
>
> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

Thanks

>
> But this test only makes sure that we don't see any ghost contribution
> (from non-existing cpus) any more.
>
> We should study the tg->se[i]->avg.load_avg for the hierarchy of tg's
> (with the highest tg having a task enqueued) a little bit more, with and
> without your v5 'sched: reflect sched_entity move into task_group's load'.

Can you elaborate ?

Vincent

>
>> ---
>>  kernel/sched/fair.c | 9 ++++++++-
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8b03fb5..89776ac 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)
>>        * will definitely be update (after enqueue).
>>        */
>>       sa->period_contrib = 1023;
>> -     sa->load_avg = scale_load_down(se->load.weight);
>> +     /*
>> +      * Tasks are intialized with full load to be seen as heavy task until
>> +      * they get a chance to stabilize to their real load level.
>> +      * group entity are intialized with null load to reflect the fact that
>> +      * nothing has been attached yet to the task group.
>> +      */
>> +     if (entity_is_task(se))
>> +             sa->load_avg = scale_load_down(se->load.weight);
>>       sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>>       /*
>>        * At this point, util_avg won't be used in select_task_rq_fair anyway

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-18 11:56                                               ` Vincent Guittot
  2016-10-18 21:58                                                 ` Joonwoo Park
  2016-10-19  9:46                                                 ` Dietmar Eggemann
@ 2016-10-19 11:33                                                 ` Peter Zijlstra
  2016-10-19 11:50                                                   ` Vincent Guittot
  2016-10-19 13:30                                                 ` Morten Rasmussen
  2016-10-19 14:49                                                 ` Joseph Salisbury
  4 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-19 11:33 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, Joseph Salisbury, Ingo Molnar, Linus Torvalds,
	Thomas Gleixner, LKML, Mike Galbraith, omer.akram

On Tue, Oct 18, 2016 at 01:56:51PM +0200, Vincent Guittot wrote:

> ---
>  kernel/sched/fair.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8b03fb5..89776ac 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)
>  	 * will definitely be update (after enqueue).
>  	 */
>  	sa->period_contrib = 1023;
> -	sa->load_avg = scale_load_down(se->load.weight);
> +	/*
> +	 * Tasks are intialized with full load to be seen as heavy task until
> +	 * they get a chance to stabilize to their real load level.
> +	 * group entity are intialized with null load to reflect the fact that
> +	 * nothing has been attached yet to the task group.
> +	 */
> +	if (entity_is_task(se))
> +		sa->load_avg = scale_load_down(se->load.weight);
>  	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>  	/*
>  	 * At this point, util_avg won't be used in select_task_rq_fair anyway
> 

Vince, could you post a proper version of this patch with changelogs and
tags so that we can get that merged into Linus' tree and stable for 4.8?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-19 11:33                                                 ` Peter Zijlstra
@ 2016-10-19 11:50                                                   ` Vincent Guittot
  0 siblings, 0 replies; 46+ messages in thread
From: Vincent Guittot @ 2016-10-19 11:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Joseph Salisbury, Ingo Molnar, Linus Torvalds,
	Thomas Gleixner, LKML, Mike Galbraith, omer.akram

On 19 October 2016 at 13:33, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Oct 18, 2016 at 01:56:51PM +0200, Vincent Guittot wrote:
>
>> ---
>>  kernel/sched/fair.c | 9 ++++++++-
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8b03fb5..89776ac 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)
>>        * will definitely be update (after enqueue).
>>        */
>>       sa->period_contrib = 1023;
>> -     sa->load_avg = scale_load_down(se->load.weight);
>> +     /*
>> +      * Tasks are intialized with full load to be seen as heavy task until
>> +      * they get a chance to stabilize to their real load level.
>> +      * group entity are intialized with null load to reflect the fact that
>> +      * nothing has been attached yet to the task group.
>> +      */
>> +     if (entity_is_task(se))
>> +             sa->load_avg = scale_load_down(se->load.weight);
>>       sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>>       /*
>>        * At this point, util_avg won't be used in select_task_rq_fair anyway
>>
>
> Vince, could you post a proper version of this patch with changelogs and
> tags so that we can get that merged into Linus' tree and stable for 4.8?

yes. i just have to finish the changelog and i sent it

>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-18 11:56                                               ` Vincent Guittot
                                                                   ` (2 preceding siblings ...)
  2016-10-19 11:33                                                 ` Peter Zijlstra
@ 2016-10-19 13:30                                                 ` Morten Rasmussen
  2016-10-19 17:41                                                   ` Vincent Guittot
  2016-10-19 14:49                                                 ` Joseph Salisbury
  4 siblings, 1 reply; 46+ messages in thread
From: Morten Rasmussen @ 2016-10-19 13:30 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Dietmar Eggemann, Joseph Salisbury, Ingo Molnar,
	Linus Torvalds, Thomas Gleixner, LKML, Mike Galbraith,
	omer.akram

On Tue, Oct 18, 2016 at 01:56:51PM +0200, Vincent Guittot wrote:
> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
> > On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
> > > On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:
> > > > So aside from funny BIOSes, this should also show up when creating
> > > > cgroups when you have offlined a few CPUs, which is far more common I'd
> > > > think.
> > > 
> > > The problem is also that the load of the tg->se[cpu] that represents
> > > the tg->cfs_rq[cpu] is initialized to 1024 in:
> > > alloc_fair_sched_group
> > >      for_each_possible_cpu(i) {
> > >          init_entity_runnable_average(se);
> > >             sa->load_avg = scale_load_down(se->load.weight);
> > > 
> > > Initializing  sa->load_avg to 1024 for a newly created task makes
> > > sense as we don't know yet what will be its real load but i'm not sure
> > > that we have to do the same for se that represents a task group. This
> > > load should be initialized to 0 and it will increase when task will be
> > > moved/attached into task group
> > 
> > Yes, I think that makes sense, not sure how horrible that is with the
> 
> That should not be that bad because this initial value is only useful for
> the few dozens of ms that follow the creation of the task group

IMHO, it doesn't make much sense to initialize empty containers, which
group sched_entities really are, to 1024. It is meant to represent what
is in it, and a creation it is empty, so in my opinion initializing it
to zero make sense.
 
> > current state of things, but after your propagate patch, that
> > reinstates the interactivity hack that should work for sure.

It actually works on mainline/tip as well.

As I see it, the fundamental problem is keeping group entities up to
date. Because the load_weight and hence se->avg.load_avg each per-cpu
group sched_entity depends on the group cfs_rq->tg_load_avg_contrib for
all cpus (tg->load_avg), including those that might be empty and
therefore not enqueued, we must ensure that they are updated some other
way. Most naturally as part of update_blocked_averages().

To guarantee that, it basically boils down to making sure:
Any cfs_rq with a non-zero tg_load_avg_contrib must be on the
leaf_cfs_rq_list.

We can do that in different ways: 1) Add all cfs_rqs to the
leaf_cfs_rq_list at task group creation, or 2) initialize group
sched_entity contributions to zero and make sure that they are added to
leaf_cfs_rq_list as soon as a sched_entity (task or group) is enqueued
on it.

Vincent patch below gives us the second option.

>  kernel/sched/fair.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8b03fb5..89776ac 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)
>  	 * will definitely be update (after enqueue).
>  	 */
>  	sa->period_contrib = 1023;
> -	sa->load_avg = scale_load_down(se->load.weight);
> +	/*
> +	 * Tasks are intialized with full load to be seen as heavy task until
> +	 * they get a chance to stabilize to their real load level.
> +	 * group entity are intialized with null load to reflect the fact that
> +	 * nothing has been attached yet to the task group.
> +	 */
> +	if (entity_is_task(se))
> +		sa->load_avg = scale_load_down(se->load.weight);
>  	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>  	/*
>  	 * At this point, util_avg won't be used in select_task_rq_fair anyway

I would suggest adding a comment somewhere stating that we need to keep
group cfs_rqs up to date:

-----
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index abb3763dff69..2b820d489be0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6641,6 +6641,11 @@ static void update_blocked_averages(int cpu)
 		if (throttled_hierarchy(cfs_rq))
 			continue;
 
+		/*
+		 * Note that _any_ leaf cfs_rq with a non-zero tg_load_avg_contrib
+		 * _must_ be on the leaf_cfs_rq_list to ensure that group shares
+		 * are updated correctly.
+		 */
 		if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true))
 			update_tg_load_avg(cfs_rq, 0);
 	}
-----

I did a couple of simple tests on tip/sched/core to test whether
Vincent's fix works even without reflecting group load/util in the group
hierarchy:

Juno (2xA57+4xA53)

tip:
	grouped hog(1) alone: 2841
	non-grouped hogs(6) alone: 40830
	grouped hog(1): 218
	non-grouped hogs(6): 40580

tip+vg:
	grouped hog alone: 2849
	non-grouped hogs(6) alone: 40831
	grouped hog: 2363
	non-grouped hogs: 38418

See script below for details, but we basically see that the grouped task
is not getting its 'fair' share on tip, while it does with Vincent's
patch.

To summarize, I think Vincent's patch makes sense and works :-) More
testing is needed of cause to see if there are other problems.

-----

# Create 100 task groups:
for i in `seq 1 100`;
do
        cgcreate -g cpu:/root/test$i
done

NCPUS=$(grep -c ^processor /proc/cpuinfo)

# Run single cpu hog inside task group on first cpu _alone_:
cgexec -g cpu:/root/test100 taskset 0x01 sysbench --test=cpu \
--num-threads=1 --max-time=5 --max-requests=1000000 run | \
awk '{if ($4=="events:") {print "grouped hog(1) alone: " $5}}'

# Run cpu hogs outside task group _alone_:
sysbench --test=cpu --num-threads=$NCPUS --max-time=10 \
--max-requests=1000000 run | awk '{if ($4=="events:") \
{print "non-grouped hogs('$NCPUS') alone: " $5}}'

# Run cpu hogs outside task group:
sysbench --test=cpu --num-threads=$NCPUS --max-time=10 \
--max-requests=1000000 run | awk '{if ($4=="events:") \
{print "non-grouped hogs('$NCPUS'): " $5}}' &

# Run single cpu hog inside task group on first cpu:
cgexec -g cpu:/root/test100 taskset 0x01 sysbench \
--test=cpu --num-threads=1 --max-time=5 \
--max-requests=1000000 run | awk '{if ($4=="events:") \
{print "grouped hog(1): " $5}}'

wait

# Delete task groups:
for i in `seq 1 100`;
do
        cgdelete -g cpu:/root/test$i
done

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-18 11:56                                               ` Vincent Guittot
                                                                   ` (3 preceding siblings ...)
  2016-10-19 13:30                                                 ` Morten Rasmussen
@ 2016-10-19 14:49                                                 ` Joseph Salisbury
  2016-10-19 14:53                                                   ` Vincent Guittot
  4 siblings, 1 reply; 46+ messages in thread
From: Joseph Salisbury @ 2016-10-19 14:49 UTC (permalink / raw)
  To: Vincent Guittot, Peter Zijlstra
  Cc: Dietmar Eggemann, Ingo Molnar, Linus Torvalds, Thomas Gleixner,
	LKML, Mike Galbraith, omer.akram

On 10/18/2016 07:56 AM, Vincent Guittot wrote:
> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
>>> On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:
>>>> So aside from funny BIOSes, this should also show up when creating
>>>> cgroups when you have offlined a few CPUs, which is far more common I'd
>>>> think.
>>> The problem is also that the load of the tg->se[cpu] that represents
>>> the tg->cfs_rq[cpu] is initialized to 1024 in:
>>> alloc_fair_sched_group
>>>      for_each_possible_cpu(i) {
>>>          init_entity_runnable_average(se);
>>>             sa->load_avg = scale_load_down(se->load.weight);
>>>
>>> Initializing  sa->load_avg to 1024 for a newly created task makes
>>> sense as we don't know yet what will be its real load but i'm not sure
>>> that we have to do the same for se that represents a task group. This
>>> load should be initialized to 0 and it will increase when task will be
>>> moved/attached into task group
>> Yes, I think that makes sense, not sure how horrible that is with the
> That should not be that bad because this initial value is only useful for
> the few dozens of ms that follow the creation of the task group 
>
>> current state of things, but after your propagate patch, that
>> reinstates the interactivity hack that should work for sure.
> The patch below fixes the issue on my platform: 
>
> Dietmar, Omer can you confirm that this fix the problem of your platform too ?
>
> ---
>  kernel/sched/fair.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)Vinc
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8b03fb5..89776ac 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)
>  	 * will definitely be update (after enqueue).
>  	 */
>  	sa->period_contrib = 1023;
> -	sa->load_avg = scale_load_down(se->load.weight);
> +	/*
> +	 * Tasks are intialized with full load to be seen as heavy task until
> +	 * they get a chance to stabilize to their real load level.
> +	 * group entity are intialized with null load to reflect the fact that
> +	 * nothing has been attached yet to the task group.
> +	 */
> +	if (entity_is_task(se))
> +		sa->load_avg = scale_load_down(se->load.weight);
>  	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>  	/*
>  	 * At this point, util_avg won't be used in select_task_rq_fair anyway
>
>
>
>
Omer also reports that this patch fixes the bug for him as well.  Thanks
for the great work, Vincent!

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-19 14:49                                                 ` Joseph Salisbury
@ 2016-10-19 14:53                                                   ` Vincent Guittot
  0 siblings, 0 replies; 46+ messages in thread
From: Vincent Guittot @ 2016-10-19 14:53 UTC (permalink / raw)
  To: Joseph Salisbury
  Cc: Peter Zijlstra, Dietmar Eggemann, Ingo Molnar, Linus Torvalds,
	Thomas Gleixner, LKML, Mike Galbraith, omer.akram

On 19 October 2016 at 16:49, Joseph Salisbury
<joseph.salisbury@canonical.com> wrote:
> On 10/18/2016 07:56 AM, Vincent Guittot wrote:
>> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
>>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
>>>> On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>> So aside from funny BIOSes, this should also show up when creating
>>>>> cgroups when you have offlined a few CPUs, which is far more common I'd
>>>>> think.
>>>> The problem is also that the load of the tg->se[cpu] that represents
>>>> the tg->cfs_rq[cpu] is initialized to 1024 in:
>>>> alloc_fair_sched_group
>>>>      for_each_possible_cpu(i) {
>>>>          init_entity_runnable_average(se);
>>>>             sa->load_avg = scale_load_down(se->load.weight);
>>>>
>>>> Initializing  sa->load_avg to 1024 for a newly created task makes
>>>> sense as we don't know yet what will be its real load but i'm not sure
>>>> that we have to do the same for se that represents a task group. This
>>>> load should be initialized to 0 and it will increase when task will be
>>>> moved/attached into task group
>>> Yes, I think that makes sense, not sure how horrible that is with the
>> That should not be that bad because this initial value is only useful for
>> the few dozens of ms that follow the creation of the task group
>>
>>> current state of things, but after your propagate patch, that
>>> reinstates the interactivity hack that should work for sure.
>> The patch below fixes the issue on my platform:
>>
>> Dietmar, Omer can you confirm that this fix the problem of your platform too ?
>>
>> ---
>>  kernel/sched/fair.c | 9 ++++++++-
>>  1 file changed, 8 insertions(+), 1 deletion(-)Vinc
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8b03fb5..89776ac 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)
>>        * will definitely be update (after enqueue).
>>        */
>>       sa->period_contrib = 1023;
>> -     sa->load_avg = scale_load_down(se->load.weight);
>> +     /*
>> +      * Tasks are intialized with full load to be seen as heavy task until
>> +      * they get a chance to stabilize to their real load level.
>> +      * group entity are intialized with null load to reflect the fact that
>> +      * nothing has been attached yet to the task group.
>> +      */
>> +     if (entity_is_task(se))
>> +             sa->load_avg = scale_load_down(se->load.weight);
>>       sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>>       /*
>>        * At this point, util_avg won't be used in select_task_rq_fair anyway
>>
>>
>>
>>
> Omer also reports that this patch fixes the bug for him as well.  Thanks
> for the great work, Vincent!

Thanks

>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-19 11:25                                                   ` Vincent Guittot
@ 2016-10-19 15:33                                                     ` Dietmar Eggemann
  2016-10-19 17:33                                                       ` Joonwoo Park
  2016-10-19 17:50                                                       ` Vincent Guittot
  0 siblings, 2 replies; 46+ messages in thread
From: Dietmar Eggemann @ 2016-10-19 15:33 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Joseph Salisbury, Ingo Molnar, Linus Torvalds,
	Thomas Gleixner, LKML, Mike Galbraith, omer.akram

On 19/10/16 12:25, Vincent Guittot wrote:
> On 19 October 2016 at 11:46, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 18/10/16 12:56, Vincent Guittot wrote:
>>> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
>>>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
>>>>> On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:

[...]

>> But this test only makes sure that we don't see any ghost contribution
>> (from non-existing cpus) any more.
>>
>> We should study the tg->se[i]->avg.load_avg for the hierarchy of tg's
>> (with the highest tg having a task enqueued) a little bit more, with and
>> without your v5 'sched: reflect sched_entity move into task_group's load'.
> 
> Can you elaborate ?

I try :-)

I thought I will see some different behaviour because of the fact that
the tg se's are initialized differently [1024 versus 0].

But I can't spot any difference. The test case is running a sysbench
thread affine to cpu1 in tg_root/tg_1/tg_11/tg_111 on tip/sched/core on
an ARM64 Juno (6 logical cpus).
The moment the sysbench task is put into tg_111
tg_111->se[1]->avg.load_avg gets updated to 0 any way because of the
huge time difference between creating this tg and attaching a task to
it. So the tg->se[2]->avg.load_avg signals for tg_111, tg_11 and tg_1
look exactly the same w/o and w/ your patch.

But your patch helps in this (very synthetic) test case as well. W/o
your patch I see remaining tg->load_avg for tg_1 and tg_11 after the
test case has finished because the tg's were exclusively used on cpu1.

# cat /proc/sched_debug

 cfs_rq[1]:/tg_1
   .tg_load_avg_contrib           : 0
   .tg_load_avg                   : 5120 (5 (unused cpus) * 1024 * 1)
 cfs_rq[1]:/tg_1/tg_11/tg_111
   .tg_load_avg_contrib           : 0
   .tg_load_avg                   : 0
 cfs_rq[1]:/tg_1/tg_11
   .tg_load_avg_contrib           : 0
   .tg_load_avg                   : 5120

With your patch applied all the .tg_load_avg are 0.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-19 15:33                                                     ` Dietmar Eggemann
@ 2016-10-19 17:33                                                       ` Joonwoo Park
  2016-10-19 17:50                                                       ` Vincent Guittot
  1 sibling, 0 replies; 46+ messages in thread
From: Joonwoo Park @ 2016-10-19 17:33 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Peter Zijlstra, Joseph Salisbury, Ingo Molnar,
	Linus Torvalds, Thomas Gleixner, LKML, Mike Galbraith,
	omer.akram

On Wed, Oct 19, 2016 at 04:33:03PM +0100, Dietmar Eggemann wrote:
> On 19/10/16 12:25, Vincent Guittot wrote:
> > On 19 October 2016 at 11:46, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >> On 18/10/16 12:56, Vincent Guittot wrote:
> >>> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
> >>>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
> >>>>> On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> [...]
> 
> >> But this test only makes sure that we don't see any ghost contribution
> >> (from non-existing cpus) any more.
> >>
> >> We should study the tg->se[i]->avg.load_avg for the hierarchy of tg's
> >> (with the highest tg having a task enqueued) a little bit more, with and
> >> without your v5 'sched: reflect sched_entity move into task_group's load'.
> > 
> > Can you elaborate ?
> 
> I try :-)
> 
> I thought I will see some different behaviour because of the fact that
> the tg se's are initialized differently [1024 versus 0].

This is the exact thing I was also worried about and that's the reason I
tried to fix this in a different way.
However I didn't find any behaviour difference once any task attached to
child cfs_rq which is the point we really care about.

I found this bug while making patch at https://lkml.org/lkml/2016/10/18/841
which will fail with wrong task_group load_avg.
I tested Vincent's patch and above together, confirmed it's still good.

Though I know Ingo already sent out pull request.  Anyway.

Tested-by: Joonwoo Park <joonwoop@codeaurora.org>

Thanks,
Joonwoo

> 
> But I can't spot any difference. The test case is running a sysbench
> thread affine to cpu1 in tg_root/tg_1/tg_11/tg_111 on tip/sched/core on
> an ARM64 Juno (6 logical cpus).
> The moment the sysbench task is put into tg_111
> tg_111->se[1]->avg.load_avg gets updated to 0 any way because of the
> huge time difference between creating this tg and attaching a task to
> it. So the tg->se[2]->avg.load_avg signals for tg_111, tg_11 and tg_1
> look exactly the same w/o and w/ your patch.
> 
> But your patch helps in this (very synthetic) test case as well. W/o
> your patch I see remaining tg->load_avg for tg_1 and tg_11 after the
> test case has finished because the tg's were exclusively used on cpu1.
> 
> # cat /proc/sched_debug
> 
>  cfs_rq[1]:/tg_1
>    .tg_load_avg_contrib           : 0
>    .tg_load_avg                   : 5120 (5 (unused cpus) * 1024 * 1)
>  cfs_rq[1]:/tg_1/tg_11/tg_111
>    .tg_load_avg_contrib           : 0
>    .tg_load_avg                   : 0
>  cfs_rq[1]:/tg_1/tg_11
>    .tg_load_avg_contrib           : 0
>    .tg_load_avg                   : 5120
> 
> With your patch applied all the .tg_load_avg are 0.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-19 13:30                                                 ` Morten Rasmussen
@ 2016-10-19 17:41                                                   ` Vincent Guittot
  2016-10-20  7:56                                                     ` Morten Rasmussen
  0 siblings, 1 reply; 46+ messages in thread
From: Vincent Guittot @ 2016-10-19 17:41 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Dietmar Eggemann, Joseph Salisbury, Ingo Molnar,
	Linus Torvalds, Thomas Gleixner, LKML, Mike Galbraith,
	omer.akram

On 19 October 2016 at 15:30, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Tue, Oct 18, 2016 at 01:56:51PM +0200, Vincent Guittot wrote:
>> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
>> > On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
>> > > On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:
>> > > > So aside from funny BIOSes, this should also show up when creating
>> > > > cgroups when you have offlined a few CPUs, which is far more common I'd
>> > > > think.
>> > >
>> > > The problem is also that the load of the tg->se[cpu] that represents
>> > > the tg->cfs_rq[cpu] is initialized to 1024 in:
>> > > alloc_fair_sched_group
>> > >      for_each_possible_cpu(i) {
>> > >          init_entity_runnable_average(se);
>> > >             sa->load_avg = scale_load_down(se->load.weight);
>> > >
>> > > Initializing  sa->load_avg to 1024 for a newly created task makes
>> > > sense as we don't know yet what will be its real load but i'm not sure
>> > > that we have to do the same for se that represents a task group. This
>> > > load should be initialized to 0 and it will increase when task will be
>> > > moved/attached into task group
>> >
>> > Yes, I think that makes sense, not sure how horrible that is with the
>>
>> That should not be that bad because this initial value is only useful for
>> the few dozens of ms that follow the creation of the task group
>
> IMHO, it doesn't make much sense to initialize empty containers, which
> group sched_entities really are, to 1024. It is meant to represent what
> is in it, and a creation it is empty, so in my opinion initializing it
> to zero make sense.
>
>> > current state of things, but after your propagate patch, that
>> > reinstates the interactivity hack that should work for sure.
>
> It actually works on mainline/tip as well.
>
> As I see it, the fundamental problem is keeping group entities up to
> date. Because the load_weight and hence se->avg.load_avg each per-cpu
> group sched_entity depends on the group cfs_rq->tg_load_avg_contrib for
> all cpus (tg->load_avg), including those that might be empty and
> therefore not enqueued, we must ensure that they are updated some other
> way. Most naturally as part of update_blocked_averages().
>
> To guarantee that, it basically boils down to making sure:
> Any cfs_rq with a non-zero tg_load_avg_contrib must be on the
> leaf_cfs_rq_list.
>
> We can do that in different ways: 1) Add all cfs_rqs to the
> leaf_cfs_rq_list at task group creation, or 2) initialize group
> sched_entity contributions to zero and make sure that they are added to
> leaf_cfs_rq_list as soon as a sched_entity (task or group) is enqueued
> on it.
>
> Vincent patch below gives us the second option.
>
>>  kernel/sched/fair.c | 9 ++++++++-
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8b03fb5..89776ac 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)
>>        * will definitely be update (after enqueue).
>>        */
>>       sa->period_contrib = 1023;
>> -     sa->load_avg = scale_load_down(se->load.weight);
>> +     /*
>> +      * Tasks are intialized with full load to be seen as heavy task until
>> +      * they get a chance to stabilize to their real load level.
>> +      * group entity are intialized with null load to reflect the fact that
>> +      * nothing has been attached yet to the task group.
>> +      */
>> +     if (entity_is_task(se))
>> +             sa->load_avg = scale_load_down(se->load.weight);
>>       sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>>       /*
>>        * At this point, util_avg won't be used in select_task_rq_fair anyway
>
> I would suggest adding a comment somewhere stating that we need to keep
> group cfs_rqs up to date:
>
> -----
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index abb3763dff69..2b820d489be0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6641,6 +6641,11 @@ static void update_blocked_averages(int cpu)
>                 if (throttled_hierarchy(cfs_rq))
>                         continue;
>
> +               /*
> +                * Note that _any_ leaf cfs_rq with a non-zero tg_load_avg_contrib
> +                * _must_ be on the leaf_cfs_rq_list to ensure that group shares
> +                * are updated correctly.
> +                */

As discussed on IRC, the point is that even if the leaf cfs_rq is
added to the leaf_cfs_rq_list, it doesn't ensure that it will be
updated correctly for unplugged CPUs

>                 if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true))
>                         update_tg_load_avg(cfs_rq, 0);
>         }
> -----
>
> I did a couple of simple tests on tip/sched/core to test whether
> Vincent's fix works even without reflecting group load/util in the group
> hierarchy:
>
> Juno (2xA57+4xA53)
>
> tip:
>         grouped hog(1) alone: 2841
>         non-grouped hogs(6) alone: 40830
>         grouped hog(1): 218
>         non-grouped hogs(6): 40580
>
> tip+vg:
>         grouped hog alone: 2849
>         non-grouped hogs(6) alone: 40831
>         grouped hog: 2363
>         non-grouped hogs: 38418
>
> See script below for details, but we basically see that the grouped task
> is not getting its 'fair' share on tip, while it does with Vincent's
> patch.
>
> To summarize, I think Vincent's patch makes sense and works :-) More
> testing is needed of cause to see if there are other problems.
>
> -----
>
> # Create 100 task groups:
> for i in `seq 1 100`;
> do
>         cgcreate -g cpu:/root/test$i
> done
>
> NCPUS=$(grep -c ^processor /proc/cpuinfo)
>
> # Run single cpu hog inside task group on first cpu _alone_:
> cgexec -g cpu:/root/test100 taskset 0x01 sysbench --test=cpu \
> --num-threads=1 --max-time=5 --max-requests=1000000 run | \
> awk '{if ($4=="events:") {print "grouped hog(1) alone: " $5}}'
>
> # Run cpu hogs outside task group _alone_:
> sysbench --test=cpu --num-threads=$NCPUS --max-time=10 \
> --max-requests=1000000 run | awk '{if ($4=="events:") \
> {print "non-grouped hogs('$NCPUS') alone: " $5}}'
>
> # Run cpu hogs outside task group:
> sysbench --test=cpu --num-threads=$NCPUS --max-time=10 \
> --max-requests=1000000 run | awk '{if ($4=="events:") \
> {print "non-grouped hogs('$NCPUS'): " $5}}' &
>
> # Run single cpu hog inside task group on first cpu:
> cgexec -g cpu:/root/test100 taskset 0x01 sysbench \
> --test=cpu --num-threads=1 --max-time=5 \
> --max-requests=1000000 run | awk '{if ($4=="events:") \
> {print "grouped hog(1): " $5}}'
>
> wait
>
> # Delete task groups:
> for i in `seq 1 100`;
> do
>         cgdelete -g cpu:/root/test$i
> done

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-19 15:33                                                     ` Dietmar Eggemann
  2016-10-19 17:33                                                       ` Joonwoo Park
@ 2016-10-19 17:50                                                       ` Vincent Guittot
  1 sibling, 0 replies; 46+ messages in thread
From: Vincent Guittot @ 2016-10-19 17:50 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Joseph Salisbury, Ingo Molnar, Linus Torvalds,
	Thomas Gleixner, LKML, Mike Galbraith, omer.akram

On 19 October 2016 at 17:33, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 19/10/16 12:25, Vincent Guittot wrote:
>> On 19 October 2016 at 11:46, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>> On 18/10/16 12:56, Vincent Guittot wrote:
>>>> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
>>>>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
>>>>>> On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:
>
> [...]
>
>>> But this test only makes sure that we don't see any ghost contribution
>>> (from non-existing cpus) any more.
>>>
>>> We should study the tg->se[i]->avg.load_avg for the hierarchy of tg's
>>> (with the highest tg having a task enqueued) a little bit more, with and
>>> without your v5 'sched: reflect sched_entity move into task_group's load'.
>>
>> Can you elaborate ?
>
> I try :-)
>
> I thought I will see some different behaviour because of the fact that
> the tg se's are initialized differently [1024 versus 0].

This difference should be noticeable (if noticeable) only during few
hundreds of ms after the creation of the task group until the load_avg
has reached its real value.

>
> But I can't spot any difference. The test case is running a sysbench
> thread affine to cpu1 in tg_root/tg_1/tg_11/tg_111 on tip/sched/core on
> an ARM64 Juno (6 logical cpus).
> The moment the sysbench task is put into tg_111
> tg_111->se[1]->avg.load_avg gets updated to 0 any way because of the
> huge time difference between creating this tg and attaching a task to
> it. So the tg->se[2]->avg.load_avg signals for tg_111, tg_11 and tg_1
> look exactly the same w/o and w/ your patch.
>
> But your patch helps in this (very synthetic) test case as well. W/o
> your patch I see remaining tg->load_avg for tg_1 and tg_11 after the
> test case has finished because the tg's were exclusively used on cpu1.
>
> # cat /proc/sched_debug
>
>  cfs_rq[1]:/tg_1
>    .tg_load_avg_contrib           : 0
>    .tg_load_avg                   : 5120 (5 (unused cpus) * 1024 * 1)
>  cfs_rq[1]:/tg_1/tg_11/tg_111
>    .tg_load_avg_contrib           : 0
>    .tg_load_avg                   : 0
>  cfs_rq[1]:/tg_1/tg_11
>    .tg_load_avg_contrib           : 0
>    .tg_load_avg                   : 5120
>
> With your patch applied all the .tg_load_avg are 0.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
  2016-10-19 17:41                                                   ` Vincent Guittot
@ 2016-10-20  7:56                                                     ` Morten Rasmussen
  0 siblings, 0 replies; 46+ messages in thread
From: Morten Rasmussen @ 2016-10-20  7:56 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Dietmar Eggemann, Joseph Salisbury, Ingo Molnar,
	Linus Torvalds, Thomas Gleixner, LKML, Mike Galbraith,
	omer.akram

On Wed, Oct 19, 2016 at 07:41:36PM +0200, Vincent Guittot wrote:
> On 19 October 2016 at 15:30, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Tue, Oct 18, 2016 at 01:56:51PM +0200, Vincent Guittot wrote:
> >> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
> >> > On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
> >> > > On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > > > So aside from funny BIOSes, this should also show up when creating
> >> > > > cgroups when you have offlined a few CPUs, which is far more common I'd
> >> > > > think.
> >> > >
> >> > > The problem is also that the load of the tg->se[cpu] that represents
> >> > > the tg->cfs_rq[cpu] is initialized to 1024 in:
> >> > > alloc_fair_sched_group
> >> > >      for_each_possible_cpu(i) {
> >> > >          init_entity_runnable_average(se);
> >> > >             sa->load_avg = scale_load_down(se->load.weight);
> >> > >
> >> > > Initializing  sa->load_avg to 1024 for a newly created task makes
> >> > > sense as we don't know yet what will be its real load but i'm not sure
> >> > > that we have to do the same for se that represents a task group. This
> >> > > load should be initialized to 0 and it will increase when task will be
> >> > > moved/attached into task group
> >> >
> >> > Yes, I think that makes sense, not sure how horrible that is with the
> >>
> >> That should not be that bad because this initial value is only useful for
> >> the few dozens of ms that follow the creation of the task group
> >
> > IMHO, it doesn't make much sense to initialize empty containers, which
> > group sched_entities really are, to 1024. It is meant to represent what
> > is in it, and a creation it is empty, so in my opinion initializing it
> > to zero make sense.
> >
> >> > current state of things, but after your propagate patch, that
> >> > reinstates the interactivity hack that should work for sure.
> >
> > It actually works on mainline/tip as well.
> >
> > As I see it, the fundamental problem is keeping group entities up to
> > date. Because the load_weight and hence se->avg.load_avg each per-cpu
> > group sched_entity depends on the group cfs_rq->tg_load_avg_contrib for
> > all cpus (tg->load_avg), including those that might be empty and
> > therefore not enqueued, we must ensure that they are updated some other
> > way. Most naturally as part of update_blocked_averages().
> >
> > To guarantee that, it basically boils down to making sure:
> > Any cfs_rq with a non-zero tg_load_avg_contrib must be on the
> > leaf_cfs_rq_list.
> >
> > We can do that in different ways: 1) Add all cfs_rqs to the
> > leaf_cfs_rq_list at task group creation, or 2) initialize group
> > sched_entity contributions to zero and make sure that they are added to
> > leaf_cfs_rq_list as soon as a sched_entity (task or group) is enqueued
> > on it.
> >
> > Vincent patch below gives us the second option.
> >
> >>  kernel/sched/fair.c | 9 ++++++++-
> >>  1 file changed, 8 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 8b03fb5..89776ac 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)
> >>        * will definitely be update (after enqueue).
> >>        */
> >>       sa->period_contrib = 1023;
> >> -     sa->load_avg = scale_load_down(se->load.weight);
> >> +     /*
> >> +      * Tasks are intialized with full load to be seen as heavy task until
> >> +      * they get a chance to stabilize to their real load level.
> >> +      * group entity are intialized with null load to reflect the fact that
> >> +      * nothing has been attached yet to the task group.
> >> +      */
> >> +     if (entity_is_task(se))
> >> +             sa->load_avg = scale_load_down(se->load.weight);
> >>       sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
> >>       /*
> >>        * At this point, util_avg won't be used in select_task_rq_fair anyway
> >
> > I would suggest adding a comment somewhere stating that we need to keep
> > group cfs_rqs up to date:
> >
> > -----
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index abb3763dff69..2b820d489be0 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6641,6 +6641,11 @@ static void update_blocked_averages(int cpu)
> >                 if (throttled_hierarchy(cfs_rq))
> >                         continue;
> >
> > +               /*
> > +                * Note that _any_ leaf cfs_rq with a non-zero tg_load_avg_contrib
> > +                * _must_ be on the leaf_cfs_rq_list to ensure that group shares
> > +                * are updated correctly.
> > +                */
> 
> As discussed on IRC, the point is that even if the leaf cfs_rq is
> added to the leaf_cfs_rq_list, it doesn't ensure that it will be
> updated correctly for unplugged CPUs

Agreed. We have to ensure that tg_load_avg_contrib is zeroed for leaf
cfs_rqs belonging to unplugged cpus. And if modify the above to say
leaf_cfs_rq_list of an online cpu, then we should be covered I think.

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2016-10-20  7:56 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-07 19:38 [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes Joseph Salisbury
2016-10-07 19:57 ` Linus Torvalds
2016-10-07 20:22   ` Joseph Salisbury
2016-10-07 20:37     ` Linus Torvalds
2016-10-08  8:00 ` Peter Zijlstra
2016-10-08  8:39   ` Ingo Molnar
2016-10-08 11:37     ` Vincent Guittot
2016-10-08 11:49       ` Mike Galbraith
2016-10-12 12:20         ` Vincent Guittot
2016-10-12 15:35           ` Joseph Salisbury
2016-10-12 16:21           ` Joseph Salisbury
2016-10-13 10:58             ` Vincent Guittot
2016-10-13 15:52               ` Joseph Salisbury
2016-10-13 16:48                 ` Vincent Guittot
2016-10-13 18:49                   ` Dietmar Eggemann
2016-10-13 21:34                     ` Vincent Guittot
2016-10-14  8:24                       ` Vincent Guittot
2016-10-14 13:10                         ` Dietmar Eggemann
2016-10-14 15:18                           ` Vincent Guittot
2016-10-14 16:04                             ` Joseph Salisbury
2016-10-17  9:09                               ` Vincent Guittot
2016-10-17 11:49                                 ` Dietmar Eggemann
2016-10-17 13:19                                   ` Peter Zijlstra
2016-10-17 13:54                                     ` Vincent Guittot
2016-10-17 22:52                                       ` Dietmar Eggemann
2016-10-18  8:43                                         ` Vincent Guittot
2016-10-18  9:07                                         ` Peter Zijlstra
2016-10-18  9:45                                           ` Vincent Guittot
2016-10-18 10:34                                             ` Peter Zijlstra
2016-10-18 11:56                                               ` Vincent Guittot
2016-10-18 21:58                                                 ` Joonwoo Park
2016-10-19  6:42                                                   ` Vincent Guittot
2016-10-19  9:46                                                 ` Dietmar Eggemann
2016-10-19 11:25                                                   ` Vincent Guittot
2016-10-19 15:33                                                     ` Dietmar Eggemann
2016-10-19 17:33                                                       ` Joonwoo Park
2016-10-19 17:50                                                       ` Vincent Guittot
2016-10-19 11:33                                                 ` Peter Zijlstra
2016-10-19 11:50                                                   ` Vincent Guittot
2016-10-19 13:30                                                 ` Morten Rasmussen
2016-10-19 17:41                                                   ` Vincent Guittot
2016-10-20  7:56                                                     ` Morten Rasmussen
2016-10-19 14:49                                                 ` Joseph Salisbury
2016-10-19 14:53                                                   ` Vincent Guittot
2016-10-18 11:15                                           ` Dietmar Eggemann
2016-10-18 12:07                                             ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.