All of lore.kernel.org
 help / color / mirror / Atom feed
* CFS scheduler unfairly prefers pinned tasks
@ 2015-10-05 21:48 paul.szabo
  2015-10-06  2:45 ` Mike Galbraith
  0 siblings, 1 reply; 48+ messages in thread
From: paul.szabo @ 2015-10-05 21:48 UTC (permalink / raw)
  To: linux-kernel

The Linux CFS scheduler prefers pinned tasks and unfairly
gives more CPU time to tasks that have set CPU affinity.
This effect is observed with or without CGROUP controls.

To demonstrate: on an otherwise idle machine, as some user
run several processes pinned to each CPU, one for each CPU
(as many as CPUs present in the system) e.g. for a quad-core
non-HyperThreaded machine:

  taskset -c 0 perl -e 'while(1){1}' &
  taskset -c 1 perl -e 'while(1){1}' &
  taskset -c 2 perl -e 'while(1){1}' &
  taskset -c 3 perl -e 'while(1){1}' &

and (as that same or some other user) run some without
pinning:

  perl -e 'while(1){1}' &
  perl -e 'while(1){1}' &

and use e.g.   top   to observe that the pinned processes get
more CPU time than "fair".

Fairness is obtained when either:
 - there are as many un-pinned processes as CPUs; or
 - with CGROUP controls and the two kinds of processes run by
   different users, when there is just one un-pinned process; or
 - if the pinning is turned off for these processes (or they
   are started without).

Any insight is welcome!

---

I would appreciate replies direct to me as I am not subscribed to the
linux-kernel mailing list (but will try to watch the archives).

This bug is also reported to Debian, please see
  http://bugs.debian.org/800945

I use Debian with the 3.16 kernel, have not yet tried 4.* kernels.


Thanks, Paul

Paul Szabo   psz@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of Sydney    Australia

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: CFS scheduler unfairly prefers pinned tasks
  2015-10-05 21:48 CFS scheduler unfairly prefers pinned tasks paul.szabo
@ 2015-10-06  2:45 ` Mike Galbraith
  2015-10-06 10:06   ` paul.szabo
  2015-10-08  8:19   ` Mike Galbraith
  0 siblings, 2 replies; 48+ messages in thread
From: Mike Galbraith @ 2015-10-06  2:45 UTC (permalink / raw)
  To: paul.szabo; +Cc: linux-kernel

On Tue, 2015-10-06 at 08:48 +1100, paul.szabo@sydney.edu.au wrote:
> The Linux CFS scheduler prefers pinned tasks and unfairly
> gives more CPU time to tasks that have set CPU affinity.
> This effect is observed with or without CGROUP controls.
> 
> To demonstrate: on an otherwise idle machine, as some user
> run several processes pinned to each CPU, one for each CPU
> (as many as CPUs present in the system) e.g. for a quad-core
> non-HyperThreaded machine:
> 
>   taskset -c 0 perl -e 'while(1){1}' &
>   taskset -c 1 perl -e 'while(1){1}' &
>   taskset -c 2 perl -e 'while(1){1}' &
>   taskset -c 3 perl -e 'while(1){1}' &
> 
> and (as that same or some other user) run some without
> pinning:
> 
>   perl -e 'while(1){1}' &
>   perl -e 'while(1){1}' &
> 
> and use e.g.   top   to observe that the pinned processes get
> more CPU time than "fair".
> 
> Fairness is obtained when either:
>  - there are as many un-pinned processes as CPUs; or
>  - with CGROUP controls and the two kinds of processes run by
>    different users, when there is just one un-pinned process; or
>  - if the pinning is turned off for these processes (or they
>    are started without).
> 
> Any insight is welcome!

If they can all migrate, load balancing can move any of them to try to
fix the permanent imbalance, so they'll all bounce about sharing a CPU
with some other hog, and it all kinda sorta works out.

When most are pinned, to make it work out long term you'd have to be
short term unfair, walking the unpinned minority around the box in a
carefully orchestrated dance... and have omniscient powers that assure
that none of the tasks you're trying to equalize is gonna do something
rude like leave, sleep, fork or whatever, and muck up the grand plan.

	-Mike


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: CFS scheduler unfairly prefers pinned tasks
  2015-10-06  2:45 ` Mike Galbraith
@ 2015-10-06 10:06   ` paul.szabo
  2015-10-06 12:17     ` Mike Galbraith
  2015-10-08  8:19   ` Mike Galbraith
  1 sibling, 1 reply; 48+ messages in thread
From: paul.szabo @ 2015-10-06 10:06 UTC (permalink / raw)
  To: umgwanakikbuti; +Cc: linux-kernel

Dear Mike,

>> .. CFS ... unfairly gives more CPU time to [pinned] tasks ...
>
> If they can all migrate, load balancing can move any of them to try to
> fix the permanent imbalance, so they'll all bounce about sharing a CPU
> with some other hog, and it all kinda sorta works out.
>
> When most are pinned, to make it work out long term you'd have to be
> short term unfair, walking the unpinned minority around the box in a
> carefully orchestrated dance... and have omniscient powers that assure
> that none of the tasks you're trying to equalize is gonna do something
> rude like leave, sleep, fork or whatever, and muck up the grand plan.

Could not your argument be turned around: for a pinned task it is harder
to find an idle CPU, so they should get less time?

But really... those pinned tasks do not hog the CPU forever. Whatever
kicks them off: could not that be done just a little earlier?

And further... the CFS is meant to be fair, using things like vruntime
to preempt, and throttling. Why are those pinned tasks not preempted or
throttled?

Thanks, Paul

Paul Szabo   psz@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of Sydney    Australia

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: CFS scheduler unfairly prefers pinned tasks
  2015-10-06 10:06   ` paul.szabo
@ 2015-10-06 12:17     ` Mike Galbraith
  2015-10-06 20:44       ` paul.szabo
  0 siblings, 1 reply; 48+ messages in thread
From: Mike Galbraith @ 2015-10-06 12:17 UTC (permalink / raw)
  To: paul.szabo; +Cc: linux-kernel

On Tue, 2015-10-06 at 21:06 +1100, paul.szabo@sydney.edu.au wrote:

> And further... the CFS is meant to be fair, using things like vruntime
> to preempt, and throttling. Why are those pinned tasks not preempted or
> throttled?

Imagine you own a 8192 CPU box for a moment, all CPUs having one pinned
task, plus one extra unpinned task, and ponder what would have to happen
in order to meet your utilization expectation.  <time passes>  Right.

What you're seeing is not a bug.  No task can occupy more than one CPU
at a time, making space reservation on multiple CPUs a very bad idea.

	-Mike


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: CFS scheduler unfairly prefers pinned tasks
  2015-10-06 12:17     ` Mike Galbraith
@ 2015-10-06 20:44       ` paul.szabo
  2015-10-07  1:28         ` Mike Galbraith
  0 siblings, 1 reply; 48+ messages in thread
From: paul.szabo @ 2015-10-06 20:44 UTC (permalink / raw)
  To: umgwanakikbuti; +Cc: linux-kernel

Dear Mike,

>> ... the CFS is meant to be fair, using things like vruntime
>> to preempt, and throttling. Why are those pinned tasks not preempted or
>> throttled?
>
> Imagine you own a 8192 CPU box for a moment, all CPUs having one pinned
> task, plus one extra unpinned task, and ponder what would have to happen
> in order to meet your utilization expectation. ...

Sorry but the kernel contradicts. As per my original report, things are
"fair" in the case of:
 - with CGROUP controls and the two kinds of processes run by
   different users, when there is just one un-pinned process
and that is so on my quad-core i5-3470 baby or my 32-core 4*E5-4627v2
server (and everywhere that I tested). The kernel is smart and gets it
right for one un-pinned process: why not for two?

Now re-testing further (on some machines with CGROUP): on the i5-3470
things are fair still with one un-pinned (become un-fair with two), on
the 4*E5-4627v2 are fair still with 4 un-pinned (become un-fair with 5).
Does this suggest that the kernel does things right within each physical
CPU, but breaks across several (or exact contrary)? Maybe not: on a
2*E5530 machine, things are fair with just one un-pinned and un-fair
with 2 already.

> What you're seeing is not a bug.  No task can occupy more than one CPU
> at a time, making space reservation on multiple CPUs a very bad idea.

I agree that pinning may be bad... should not the kernel penalize the
badly pinned processes?

Cheers, Paul

Paul Szabo   psz@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of Sydney    Australia

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: CFS scheduler unfairly prefers pinned tasks
  2015-10-06 20:44       ` paul.szabo
@ 2015-10-07  1:28         ` Mike Galbraith
  0 siblings, 0 replies; 48+ messages in thread
From: Mike Galbraith @ 2015-10-07  1:28 UTC (permalink / raw)
  To: paul.szabo; +Cc: linux-kernel

On Wed, 2015-10-07 at 07:44 +1100, paul.szabo@sydney.edu.au wrote:

> I agree that pinning may be bad... should not the kernel penalize the
> badly pinned processes?

I didn't say pinning is bad, I said was what you're seeing is not a bug.

	-Mike


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: CFS scheduler unfairly prefers pinned tasks
  2015-10-06  2:45 ` Mike Galbraith
  2015-10-06 10:06   ` paul.szabo
@ 2015-10-08  8:19   ` Mike Galbraith
  2015-10-08 10:54     ` paul.szabo
  2015-10-10  3:59     ` Wanpeng Li
  1 sibling, 2 replies; 48+ messages in thread
From: Mike Galbraith @ 2015-10-08  8:19 UTC (permalink / raw)
  To: paul.szabo, Peter Zijlstra; +Cc: linux-kernel

On Tue, 2015-10-06 at 04:45 +0200, Mike Galbraith wrote:
> On Tue, 2015-10-06 at 08:48 +1100, paul.szabo@sydney.edu.au wrote:
> > The Linux CFS scheduler prefers pinned tasks and unfairly
> > gives more CPU time to tasks that have set CPU affinity.
> > This effect is observed with or without CGROUP controls.
> > 
> > To demonstrate: on an otherwise idle machine, as some user
> > run several processes pinned to each CPU, one for each CPU
> > (as many as CPUs present in the system) e.g. for a quad-core
> > non-HyperThreaded machine:
> > 
> >   taskset -c 0 perl -e 'while(1){1}' &
> >   taskset -c 1 perl -e 'while(1){1}' &
> >   taskset -c 2 perl -e 'while(1){1}' &
> >   taskset -c 3 perl -e 'while(1){1}' &
> > 
> > and (as that same or some other user) run some without
> > pinning:
> > 
> >   perl -e 'while(1){1}' &
> >   perl -e 'while(1){1}' &
> > 
> > and use e.g.   top   to observe that the pinned processes get
> > more CPU time than "fair".

I see a fairness issue with pinned tasks and group scheduling, but one
opposite to your complaint.
 
Two task groups, one with 8 hogs (oink), one with 1 (pert), all are pinned.
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
 3269 root      20   0    4060    724    648 R 100.0 0.004   1:00.02 1 oink
 3270 root      20   0    4060    652    576 R 100.0 0.004   0:59.84 2 oink
 3271 root      20   0    4060    692    616 R 100.0 0.004   0:59.95 3 oink
 3274 root      20   0    4060    608    532 R 100.0 0.004   1:00.01 6 oink
 3273 root      20   0    4060    728    652 R 99.90 0.005   0:59.98 5 oink
 3272 root      20   0    4060    644    568 R 99.51 0.004   0:59.80 4 oink
 3268 root      20   0    4060    612    536 R 99.41 0.004   0:59.67 0 oink
 3279 root      20   0    8312    804    708 R 88.83 0.005   0:53.06 7 pert
 3275 root      20   0    4060    656    580 R 11.07 0.004   0:06.98 7 oink
.
That group share math would make a huge compute group with progress
checkpoints sharing an SGI monster with one other hog amusing to watch.
  
	-Mike


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: CFS scheduler unfairly prefers pinned tasks
  2015-10-08  8:19   ` Mike Galbraith
@ 2015-10-08 10:54     ` paul.szabo
  2015-10-08 11:19       ` Peter Zijlstra
  2015-10-08 14:25       ` CFS scheduler unfairly prefers pinned tasks Mike Galbraith
  2015-10-10  3:59     ` Wanpeng Li
  1 sibling, 2 replies; 48+ messages in thread
From: paul.szabo @ 2015-10-08 10:54 UTC (permalink / raw)
  To: peterz, umgwanakikbuti; +Cc: linux-kernel

Dear Mike,

> I see a fairness issue ... but one opposite to your complaint.

Why is that opposite? I think it would be fair for the one pert process
to get 100% CPU, the many oink processes can get everything else. That
one oink is lowly 10% (when others are 100%) is of no consequence.

What happens when you un-pin pert: does it get 100%? What if you run two
perts? Have you reproduced my observations?

---

Good to see that you agree on the fairness issue... it MUST be fixed!
CFS might be wrong or wasteful, but never unfair.

Cheers, Paul

Paul Szabo   psz@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of Sydney    Australia

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: CFS scheduler unfairly prefers pinned tasks
  2015-10-08 10:54     ` paul.szabo
@ 2015-10-08 11:19       ` Peter Zijlstra
  2015-10-10 13:22         ` [patch] sched: disable task group re-weighting on the desktop Mike Galbraith
  2015-10-08 14:25       ` CFS scheduler unfairly prefers pinned tasks Mike Galbraith
  1 sibling, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2015-10-08 11:19 UTC (permalink / raw)
  To: paul.szabo; +Cc: umgwanakikbuti, linux-kernel

On Thu, Oct 08, 2015 at 09:54:21PM +1100, paul.szabo@sydney.edu.au wrote:
> Good to see that you agree on the fairness issue... it MUST be fixed!
> CFS might be wrong or wasteful, but never unfair.

I've not yet had time to look at the case at hand, but there are wat is
called 'infeasible weight' scenarios for which it is impossible to be
fair.

Also, CFS must remain a practical scheduler, which places bounds on the
amount of weird cases we can deal with.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: CFS scheduler unfairly prefers pinned tasks
  2015-10-08 10:54     ` paul.szabo
  2015-10-08 11:19       ` Peter Zijlstra
@ 2015-10-08 14:25       ` Mike Galbraith
  2015-10-08 21:55         ` paul.szabo
  1 sibling, 1 reply; 48+ messages in thread
From: Mike Galbraith @ 2015-10-08 14:25 UTC (permalink / raw)
  To: paul.szabo; +Cc: peterz, linux-kernel

On Thu, 2015-10-08 at 21:54 +1100, paul.szabo@sydney.edu.au wrote:
> Dear Mike,
> 
> > I see a fairness issue ... but one opposite to your complaint.
> 
> Why is that opposite? I think it would be fair for the one pert process
> to get 100% CPU, the many oink processes can get everything else. That
> one oink is lowly 10% (when others are 100%) is of no consequence.

Well, not exactly opposite, only opposite in that the one pert task also
receives MORE than it's fair share when unpinned.  Two 100$ hogs sharing
one CPU should each get 50% of that CPU.  The fact that the oink group
contains 8 tasks vs 1 for the pert group should be irrelevant, but what
that last oinker is getting is 1/9 of a CPU, and there just happen to be
9 runnable tasks total, 1 in group pert, and 8 in group oink.

IFF that ratio were to prove to be a constant, AND the oink group were a
massively parallel and synchronized compute job on a huge box, that
entire compute job would not be slowed down by the factor 2 that a fair
distribution would do to it, on say a 1000 core box, it'd be.. utterly
dead, because you'd put it out of your misery.

vogelweide:~/:[0]# cgexec -g cpu:foo bash
vogelweide:~/:[0]# for i in `seq 0 63`; do taskset -c $i cpuhog& done
[1] 8025
[2] 8026
...
vogelweide:~/:[130]# cgexec -g cpu:bar bash
vogelweide:~/:[130]# taskset -c 63 pert 10 (report every 10 seconds)
2260.91 MHZ CPU
perturbation threshold 0.024 usecs.
pert/s:      255 >2070.76us:       38 min:  0.05 max:4065.46 avg: 93.83 sum/s: 23946us overhead: 2.39%
pert/s:      255 >2070.32us:       37 min:  1.32 max:4039.94 avg: 92.82 sum/s: 23744us overhead: 2.37%
pert/s:      253 >2069.85us:       38 min:  0.05 max:4036.44 avg: 94.89 sum/s: 24054us overhead: 2.41%

Hm, that's a kinda odd looking number from my 64 core box, but whatever,
it's far from fair according to my definition thereof.  Poor little oink
plus all other cycles not spent in pert's tight loop add up ~24ms/s.

> Good to see that you agree on the fairness issue... it MUST be fixed!
> CFS might be wrong or wasteful, but never unfair.

Weeell, we've disagreed on pretty much everything we've talked about so
far, but I can well imagine that what I see in the share update business
_could_ be part of your massive compute job woes.

	-Mike


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: CFS scheduler unfairly prefers pinned tasks
  2015-10-08 14:25       ` CFS scheduler unfairly prefers pinned tasks Mike Galbraith
@ 2015-10-08 21:55         ` paul.szabo
  2015-10-09  1:56           ` Mike Galbraith
  2015-10-09  2:40           ` Mike Galbraith
  0 siblings, 2 replies; 48+ messages in thread
From: paul.szabo @ 2015-10-08 21:55 UTC (permalink / raw)
  To: umgwanakikbuti; +Cc: linux-kernel, peterz

Dear Mike,

>>> I see a fairness issue ... but one opposite to your complaint.
>> Why is that opposite? ...
>
> Well, not exactly opposite, only opposite in that the one pert task also
> receives MORE than it's fair share when unpinned.  Two 100$ hogs sharing
> one CPU should each get 50% of that CPU. ...

But you are using CGROUPs, grouping all oinks into one group, and the
one pert into another: requesting each group to get same total CPU.
Since pert has one process only, the most he can get is 100% (not 400%),
and it is quite OK for the oinks together to get 700%.

> IFF ... massively parallel and synchronized ...

You would be making the assumption that you had the machine to yourself:
might be the wrong thing to assume.

>> Good to see that you agree ...
> Weeell, we've disagreed on pretty much everything ...

Sorry I disagree: we do agree on the essence. :-)

Cheers, Paul

Paul Szabo   psz@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of Sydney    Australia

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: CFS scheduler unfairly prefers pinned tasks
  2015-10-08 21:55         ` paul.szabo
@ 2015-10-09  1:56           ` Mike Galbraith
  2015-10-09  2:40           ` Mike Galbraith
  1 sibling, 0 replies; 48+ messages in thread
From: Mike Galbraith @ 2015-10-09  1:56 UTC (permalink / raw)
  To: paul.szabo; +Cc: linux-kernel, peterz

On Fri, 2015-10-09 at 08:55 +1100, paul.szabo@sydney.edu.au wrote:
> Dear Mike,
> 
> >>> I see a fairness issue ... but one opposite to your complaint.
> >> Why is that opposite? ...
> >
> > Well, not exactly opposite, only opposite in that the one pert task also
> > receives MORE than it's fair share when unpinned.  Two 100$ hogs sharing
> > one CPU should each get 50% of that CPU. ...
> 
> But you are using CGROUPs, grouping all oinks into one group, and the
> one pert into another: requesting each group to get same total CPU.
> Since pert has one process only, the most he can get is 100% (not 400%),
> and it is quite OK for the oinks together to get 700%.

Well, that of course depends on what you call fair.  I realize why and
where it happens.  I told weight adjustment to keep its grubby mitts off
of autogroups, and of course the "problem" went away.  Back to the
viewpoint thing, with two users, each having been _placed_ in a group, I
can well imagine a user who is trying to use all of his authorized
bandwidth raising an eyebrow when he sees one of his tasks getting 24
whole milliseconds per second with an allegedly fair scheduler.

I can see it both ways.  What's going to come out of this is probably
going to be "tough titty, yes, group scheduling has side effects, and
this is one".  I already know it does.  Question is only whether the
weight adjustment gears are spinning as intended or not.

> > IFF ... massively parallel and synchronized ...
> 
> You would be making the assumption that you had the machine to yourself:
> might be the wrong thing to assume.

Yup, it would be a doomed attempt to run a load which cannot thrive in a
shared environment in such an environment.  Are any of the compute loads
you're having trouble with.. in the math department..  perhaps doing oh,
say complex math goop that feeds the output of one parallel computation
into the next parallel computation? :)

	-Mike


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: CFS scheduler unfairly prefers pinned tasks
  2015-10-08 21:55         ` paul.szabo
  2015-10-09  1:56           ` Mike Galbraith
@ 2015-10-09  2:40           ` Mike Galbraith
  2015-10-11  9:43             ` paul.szabo
  1 sibling, 1 reply; 48+ messages in thread
From: Mike Galbraith @ 2015-10-09  2:40 UTC (permalink / raw)
  To: paul.szabo; +Cc: linux-kernel, peterz

On Fri, 2015-10-09 at 08:55 +1100, paul.szabo@sydney.edu.au wrote:

> >> Good to see that you agree ...
> > Weeell, we've disagreed on pretty much everything ...
> 
> Sorry I disagree: we do agree on the essence. :-)

P.S.

To some extent.  If the essence is $subject, nope, we definitely
disagree.  If the essence is that _group_ scheduling is not strictly
fair, then we agree.  The must be fixed bit, I also disagree with.
Maybe wants fixing I can agree with ;-) 

	-Mike


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: CFS scheduler unfairly prefers pinned tasks
  2015-10-08  8:19   ` Mike Galbraith
  2015-10-08 10:54     ` paul.szabo
@ 2015-10-10  3:59     ` Wanpeng Li
  2015-10-10  7:58       ` Wanpeng Li
  1 sibling, 1 reply; 48+ messages in thread
From: Wanpeng Li @ 2015-10-10  3:59 UTC (permalink / raw)
  To: paul.szabo, Peter Zijlstra; +Cc: Mike Galbraith, linux-kernel

Hi Paul,
On 10/8/15 4:19 PM, Mike Galbraith wrote:
> On Tue, 2015-10-06 at 04:45 +0200, Mike Galbraith wrote:
>> On Tue, 2015-10-06 at 08:48 +1100, paul.szabo@sydney.edu.au wrote:
>>> The Linux CFS scheduler prefers pinned tasks and unfairly
>>> gives more CPU time to tasks that have set CPU affinity.
>>> This effect is observed with or without CGROUP controls.
>>>
>>> To demonstrate: on an otherwise idle machine, as some user
>>> run several processes pinned to each CPU, one for each CPU
>>> (as many as CPUs present in the system) e.g. for a quad-core
>>> non-HyperThreaded machine:
>>>
>>>    taskset -c 0 perl -e 'while(1){1}' &
>>>    taskset -c 1 perl -e 'while(1){1}' &
>>>    taskset -c 2 perl -e 'while(1){1}' &
>>>    taskset -c 3 perl -e 'while(1){1}' &
>>>
>>> and (as that same or some other user) run some without
>>> pinning:
>>>
>>>    perl -e 'while(1){1}' &
>>>    perl -e 'while(1){1}' &
>>>
>>> and use e.g.   top   to observe that the pinned processes get
>>> more CPU time than "fair".

Interesting, I can reproduce it w/ your simple script. However, they are 
fair when the number of pinned perl tasks is equal to unpinned perl 
tasks. I will dig into it more deeply.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: CFS scheduler unfairly prefers pinned tasks
  2015-10-10  3:59     ` Wanpeng Li
@ 2015-10-10  7:58       ` Wanpeng Li
  0 siblings, 0 replies; 48+ messages in thread
From: Wanpeng Li @ 2015-10-10  7:58 UTC (permalink / raw)
  To: paul.szabo, Peter Zijlstra; +Cc: Mike Galbraith, linux-kernel

On 10/10/15 11:59 AM, Wanpeng Li wrote:
> Hi Paul,
> On 10/8/15 4:19 PM, Mike Galbraith wrote:
>> On Tue, 2015-10-06 at 04:45 +0200, Mike Galbraith wrote:
>>> On Tue, 2015-10-06 at 08:48 +1100, paul.szabo@sydney.edu.au wrote:
>>>> The Linux CFS scheduler prefers pinned tasks and unfairly
>>>> gives more CPU time to tasks that have set CPU affinity.
>>>> This effect is observed with or without CGROUP controls.
>>>>
>>>> To demonstrate: on an otherwise idle machine, as some user
>>>> run several processes pinned to each CPU, one for each CPU
>>>> (as many as CPUs present in the system) e.g. for a quad-core
>>>> non-HyperThreaded machine:
>>>>
>>>>    taskset -c 0 perl -e 'while(1){1}' &
>>>>    taskset -c 1 perl -e 'while(1){1}' &
>>>>    taskset -c 2 perl -e 'while(1){1}' &
>>>>    taskset -c 3 perl -e 'while(1){1}' &
>>>>
>>>> and (as that same or some other user) run some without
>>>> pinning:
>>>>
>>>>    perl -e 'while(1){1}' &
>>>>    perl -e 'while(1){1}' &
>>>>
>>>> and use e.g.   top   to observe that the pinned processes get
>>>> more CPU time than "fair".
>
> Interesting, I can reproduce it w/ your simple script. However, they 
> are fair when the number of pinned perl tasks is equal to unpinned 
> perl tasks. I will dig into it more deeply.

For the pinned tasks, when set the task affinity to all the available 
cpus instead of the separate cpu as in your test, there is fair between 
pinned tasks and unpinned tasks. So I suspect that if it is the overhead 
associated with migration stuff.

Regards,
Wanpeng Li


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [patch] sched: disable task group re-weighting on the desktop
  2015-10-08 11:19       ` Peter Zijlstra
@ 2015-10-10 13:22         ` Mike Galbraith
  2015-10-10 14:03           ` kbuild test robot
                             ` (3 more replies)
  0 siblings, 4 replies; 48+ messages in thread
From: Mike Galbraith @ 2015-10-10 13:22 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: paul.szabo, linux-kernel

On Thu, 2015-10-08 at 13:19 +0200, Peter Zijlstra wrote:
> On Thu, Oct 08, 2015 at 09:54:21PM +1100, paul.szabo@sydney.edu.au wrote:
> > Good to see that you agree on the fairness issue... it MUST be fixed!
> > CFS might be wrong or wasteful, but never unfair.
> 
> I've not yet had time to look at the case at hand, but there are wat is
> called 'infeasible weight' scenarios for which it is impossible to be
> fair.

And sometimes, group wide fairness ain't all that wonderful anyway.

> Also, CFS must remain a practical scheduler, which places bounds on the
> amount of weird cases we can deal with.

Yup, and on a practical note...

master, 1 group of 8 (oink) vs 8 groups of 1 (pert)
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND                                                                                                                                 
 5618 root      20   0    8312    840    744 R 90.46 0.005   1:40.48 0 pert                                                                                                                                    
 5630 root      20   0    8312    720    624 R 90.46 0.004   1:38.40 4 pert                                                                                                                                    
 5615 root      20   0    8312    768    672 R 89.48 0.005   1:39.25 6 pert                                                                                                                                    
 5621 root      20   0    8312    792    696 R 89.34 0.005   1:38.49 2 pert                                                                                                                                    
 5627 root      20   0    8312    760    664 R 89.06 0.005   1:36.53 5 pert                                                                                                                                    
 5645 root      20   0    8312    804    708 R 89.06 0.005   1:34.69 1 pert                                                                                                                                    
 5624 root      20   0    8312    716    620 R 88.64 0.004   1:38.45 7 pert                                                                                                                                    
 5612 root      20   0    8312    716    620 R 83.03 0.004   1:40.11 3 pert                                                                                                                                    
 5633 root      20   0    8312    792    696 R 10.94 0.005   0:11.59 4 oink                                                                                                                                    
 5635 root      20   0    8312    804    708 R 10.80 0.005   0:11.74 2 oink                                                                                                                                    
 5637 root      20   0    8312    796    700 R 10.80 0.005   0:11.34 5 oink                                                                                                                                    
 5639 root      20   0    8312    836    740 R 10.80 0.005   0:11.71 2 oink                                                                                                                                    
 5634 root      20   0    8312    840    744 R 10.66 0.005   0:11.36 7 oink                                                                                                                                    
 5636 root      20   0    8312    756    660 R 10.66 0.005   0:11.68 1 oink                                                                                                                                    
 5640 root      20   0    8312    752    656 R 10.10 0.005   0:11.41 7 oink                                                                                                                                    
 5638 root      20   0    8312    804    708 R 9.818 0.005   0:11.99 7 oink

Avg 98.2s per group vs 92.8s for the 8 task group.  Not _perfect_, but ok.

Before reading further, now would be a good time for readers to chant the
"perfect is the enemy of good" mantra, pretending my not so scientific
measurements had actually shown perfect group wide distribution.  You're
gonna see good, and it doesn't resemble perfect.. which is good ;-)

master+, 1 group of 8 (oink) vs 8 groups of 1 (pert)
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND                                                                                                                                 
19269 root      20   0    8312    716    620 R 77.25 0.004   1:39.43 2 pert                                                                                                                                    
19263 root      20   0    8312    752    656 R 76.65 0.005   1:43.70 7 pert                                                                                                                                    
19257 root      20   0    8312    760    664 R 72.85 0.005   1:37.08 5 pert                                                                                                                                    
19260 root      20   0    8312    804    704 R 71.86 0.005   1:40.42 1 pert                                                                                                                                    
19273 root      20   0    8312    748    652 R 71.26 0.005   1:41.98 6 pert                                                                                                                                    
19266 root      20   0    8312    752    656 R 67.47 0.005   1:41.69 4 pert                                                                                                                                    
19254 root      20   0    8312    744    648 R 61.28 0.005   1:42.88 4 pert                                                                                                                                    
19277 root      20   0    8312    836    740 R 56.29 0.005   0:46.16 5 oink                                                                                                                                    
19281 root      20   0    8312    768    672 R 55.89 0.005   0:42.05 0 oink                                                                                                                                    
19283 root      20   0    8312    840    744 R 44.91 0.005   0:53.05 3 oink                                                                                                                                    
19282 root      20   0    8312    800    704 R 30.74 0.005   0:41.70 3 oink                                                                                                                                    
19284 root      20   0    8312    724    628 R 28.14 0.004   0:42.08 3 oink                                                                                                                                    
19278 root      20   0    8312    752    656 R 25.15 0.005   0:42.26 3 oink                                                                                                                                    
19280 root      20   0    8312    756    660 R 24.35 0.005   0:40.39 3 oink                                                                                                                                    
19279 root      20   0    8312    836    740 R 23.95 0.005   0:45.71 3 oink

Avg 101.6s per pert group vs 353.4s for the 8 task oink group.  Not remotely
fair total group utilization wise.  Ah, but now onward to interactivity...

master, 8 groups of 1 (pert) vs desktop (mplayer BigBuckBunny-DivXPlusHD.mkv)
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND                                                                                                                                 
 4068 root      20   0    8312    724    628 R 99.64 0.004   1:04.32 6 pert                                                                                                                                    
 4065 root      20   0    8312    744    648 R 99.45 0.005   1:04.92 5 pert                                                                                                                                    
 4071 root      20   0    8312    748    652 R 99.27 0.005   1:03.12 7 pert                                                                                                                                    
 4077 root      20   0    8312    840    744 R 98.72 0.005   1:01.46 3 pert                                                                                                                                    
 4074 root      20   0    8312    796    700 R 98.18 0.005   1:03.38 1 pert                                                                                                                                    
 4079 root      20   0    8312    720    624 R 97.99 0.004   1:01.45 4 pert                                                                                                                                    
 4062 root      20   0    8312    836    740 R 96.72 0.005   1:03.44 0 pert                                                                                                                                    
 4059 root      20   0    8312    720    624 R 94.16 0.004   1:04.92 2 pert                                                                                                                                    
 4082 root      20   0 1094400 154324  33592 S 4.197 0.954   0:02.69 0 mplayer                                                                                                                                 
 1029 root      20   0  465332 151540  40816 R 3.285 0.937   0:24.59 2 Xorg                                                                                                                                    
 1773 root      20   0  662592  73308  42012 S 2.007 0.453   0:12.84 5 konsole                                                                                                                                 
  771 root      20   0   11416   1964   1824 S 0.730 0.012   0:10.45 0 rngd                                                                                                                                    
 1722 root      20   0 2866772  65224  51152 S 0.365 0.403   0:03.44 2 kwin                                                                                                                                    
 1769 root      20   0  711684  54212  38020 S 0.182 0.335   0:00.39 1 kmix

That is NOT good.  Mplayer and friends need more than that.  Interactivity
is _horrible_, and buck is an unwatchable mess (no biggy, I know every frame).

master+, 8 groups of 1 (pert) vs desktop (mplayer BigBuckBunny-DivXPlusHD.mkv)
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND                                                                                                                                 
 4346 root      20   0    8312    756    660 R 99.20 0.005   0:59.89 5 pert                                                                                                                                    
 4349 root      20   0    8312    748    652 R 98.80 0.005   1:00.77 6 pert                                                                                                                                    
 4343 root      20   0    8312    720    624 R 94.81 0.004   1:02.11 2 pert                                                                                                                                    
 4331 root      20   0    8312    724    628 R 91.22 0.004   1:01.16 3 pert                                                                                                                                    
 4340 root      20   0    8312    720    624 R 91.22 0.004   1:01.06 7 pert                                                                                                                                    
 4328 root      20   0    8312    836    740 R 90.42 0.005   1:00.07 4 pert                                                                                                                                    
 4334 root      20   0    8312    756    660 R 87.82 0.005   0:59.84 1 pert                                                                                                                                    
 4337 root      20   0    8312    824    728 R 76.85 0.005   0:52.20 0 pert                                                                                                                                    
 4352 root      20   0 1058812 123876  33388 S 29.34 0.766   0:25.01 3 mplayer                                                                                                                                 
 1029 root      20   0  471168 156748  40316 R 22.36 0.969   0:42.23 3 Xorg                                                                                                                                    
 1773 root      20   0  663080  74176  42012 S 4.192 0.459   0:17.98 1 konsole                                                                                                                                 
  771 root      20   0   11416   1964   1824 R 1.198 0.012   0:13.45 0 rngd                                                                                                                                    
 1722 root      20   0 2866880  65340  51152 R 0.599 0.404   0:04.87 3 kwin                                                                                                                                    
 1788 root       9 -11  516744  11932   8536 S 0.599 0.074   0:01.01 0 pulseaudio                                                                                                                              
 1733 root      20   0 3369480 141564  71776 S 0.200 0.875   0:05.51 1 plasma-desktop                                                                                                                          

That's good.  Interactivity is fine, I can't even tell pert groups exist by
watching buck kick squirrel butt for the 10387th time.  With master, and one 8
hog group vs desktop/mplayer, I can see the hog group interfere with mplayer.
Add another hog group, mplayer lurches quite badly.  I can feel even one group
while using mouse wheel to scroll through mail.  With master+, I see/feel none
of that unpleasantness.

Conclusion: task group re-weighting is the mortal enemy of a good desktop.

sched: disable task group re-weighting on the desktop

Task group wide utilization based weight may work well for servers, but it
is horrible on the desktop.  8 groups of 1 hog demoloshes interactivity, 1
group of 8 hogs has noticable impact, 2 such groups is very very noticable.

Turn it off if autogroup is enabled, and add a feature to let people set the
definition of fair to what serves them best.  For the desktop, fixed group
weight wins hands down, no contest....

Signed-off-by: Mike Galbraith <umgwanakikbuit@gmail.com>
---
 kernel/sched/fair.c     |   10 ++++++----
 kernel/sched/features.h |   14 ++++++++++++++
 2 files changed, 20 insertions(+), 4 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2372,6 +2372,8 @@ static long calc_cfs_shares(struct cfs_r
 {
 	long tg_weight, load, shares;
 
+	if (!sched_feat(SMP_FAIR_GROUPS))
+		return tg->shares;
 	tg_weight = calc_tg_weight(tg, cfs_rq);
 	load = cfs_rq_load_avg(cfs_rq);
 
@@ -2420,10 +2422,10 @@ static void update_cfs_shares(struct cfs
 	se = tg->se[cpu_of(rq_of(cfs_rq))];
 	if (!se || throttled_hierarchy(cfs_rq))
 		return;
-#ifndef CONFIG_SMP
-	if (likely(se->load.weight == tg->shares))
-		return;
-#endif
+	if (!IS_ENABLED(CONFIG_SMP) || !sched_feat(SMP_FAIR_GROUPS)) {
+		if (likely(se->load.weight == tg->shares))
+			return;
+	}
 	shares = calc_cfs_shares(cfs_rq, tg);
 
 	reweight_entity(cfs_rq_of(se), se, shares);
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -88,3 +88,17 @@ SCHED_FEAT(LB_MIN, false)
  */
 SCHED_FEAT(NUMA,	true)
 #endif
+
+#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+/*
+ * With SMP_FAIR_GROUPS set, activity group wide determines share for
+ * all froup members.  This does very bad things to interactivity when
+ * a desktop box is heavily loaded.  Default to off when autogroup is
+ * enabled, and let all users set it to what works best for them.
+ */
+#ifndef CONFIG_SCHED_AUTOGROUP
+SCHED_FEAT(SMP_FAIR_GROUPS, true)
+#else
+SCHED_FEAT(SMP_FAIR_GROUPS, false)
+#endif
+#endif



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch] sched: disable task group re-weighting on the desktop
  2015-10-10 13:22         ` [patch] sched: disable task group re-weighting on the desktop Mike Galbraith
@ 2015-10-10 14:03           ` kbuild test robot
  2015-10-10 14:41             ` Mike Galbraith
  2015-10-10 17:01           ` Peter Zijlstra
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 48+ messages in thread
From: kbuild test robot @ 2015-10-10 14:03 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: kbuild-all, Peter Zijlstra, paul.szabo, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4686 bytes --]

Hi Mike,

[auto build test ERROR on v4.3-rc4 -- if it's inappropriate base, please ignore]

config: mips-allyesconfig (attached as .config)
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=mips 

All errors (new ones prefixed by >>):

   In file included from kernel/sched/fair.c:36:0:
   kernel/sched/fair.c: In function 'update_cfs_shares':
>> kernel/sched/sched.h:1001:24: error: implicit declaration of function 'static_branch_SMP_FAIR_GROUPS' [-Werror=implicit-function-declaration]
    #define sched_feat(x) (static_branch_##x(&sched_feat_keys[__SCHED_FEAT_##x]))
                           ^
   kernel/sched/fair.c:2425:34: note: in expansion of macro 'sched_feat'
     if (!IS_ENABLED(CONFIG_SMP) || !sched_feat(SMP_FAIR_GROUPS)) {
                                     ^
   kernel/sched/sched.h:1001:59: error: '__SCHED_FEAT_SMP_FAIR_GROUPS' undeclared (first use in this function)
    #define sched_feat(x) (static_branch_##x(&sched_feat_keys[__SCHED_FEAT_##x]))
                                                              ^
   kernel/sched/fair.c:2425:34: note: in expansion of macro 'sched_feat'
     if (!IS_ENABLED(CONFIG_SMP) || !sched_feat(SMP_FAIR_GROUPS)) {
                                     ^
   kernel/sched/sched.h:1001:59: note: each undeclared identifier is reported only once for each function it appears in
    #define sched_feat(x) (static_branch_##x(&sched_feat_keys[__SCHED_FEAT_##x]))
                                                              ^
   kernel/sched/fair.c:2425:34: note: in expansion of macro 'sched_feat'
     if (!IS_ENABLED(CONFIG_SMP) || !sched_feat(SMP_FAIR_GROUPS)) {
                                     ^
   cc1: some warnings being treated as errors

vim +/static_branch_SMP_FAIR_GROUPS +1001 kernel/sched/sched.h

029632fb kernel/sched.h       Peter Zijlstra 2011-10-25   985  };
029632fb kernel/sched.h       Peter Zijlstra 2011-10-25   986  
029632fb kernel/sched.h       Peter Zijlstra 2011-10-25   987  #undef SCHED_FEAT
029632fb kernel/sched.h       Peter Zijlstra 2011-10-25   988  
f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06   989  #if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL)
f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06   990  #define SCHED_FEAT(name, enabled)					\
c5905afb kernel/sched/sched.h Ingo Molnar    2012-02-24   991  static __always_inline bool static_branch_##name(struct static_key *key) \
f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06   992  {									\
6e76ea8a kernel/sched/sched.h Jason Baron    2014-07-02   993  	return static_key_##enabled(key);				\
f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06   994  }
f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06   995  
f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06   996  #include "features.h"
f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06   997  
f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06   998  #undef SCHED_FEAT
f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06   999  
c5905afb kernel/sched/sched.h Ingo Molnar    2012-02-24  1000  extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06 @1001  #define sched_feat(x) (static_branch_##x(&sched_feat_keys[__SCHED_FEAT_##x]))
f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06  1002  #else /* !(SCHED_DEBUG && HAVE_JUMP_LABEL) */
029632fb kernel/sched.h       Peter Zijlstra 2011-10-25  1003  #define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06  1004  #endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */
029632fb kernel/sched.h       Peter Zijlstra 2011-10-25  1005  
cbee9f88 kernel/sched/sched.h Peter Zijlstra 2012-10-25  1006  #ifdef CONFIG_NUMA_BALANCING
cbee9f88 kernel/sched/sched.h Peter Zijlstra 2012-10-25  1007  #define sched_feat_numa(x) sched_feat(x)
3105b86a kernel/sched/sched.h Mel Gorman     2012-11-23  1008  #ifdef CONFIG_SCHED_DEBUG
3105b86a kernel/sched/sched.h Mel Gorman     2012-11-23  1009  #define numabalancing_enabled sched_feat_numa(NUMA)

:::::: The code at line 1001 was first introduced by commit
:::::: f8b6d1cc7dc15cf3de538b864eefaedad7a84d85 sched: Use jump_labels for sched_feat

:::::: TO: Peter Zijlstra <a.p.zijlstra@chello.nl>
:::::: CC: Ingo Molnar <mingo@elte.hu>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 39228 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch] sched: disable task group re-weighting on the desktop
  2015-10-10 14:03           ` kbuild test robot
@ 2015-10-10 14:41             ` Mike Galbraith
  0 siblings, 0 replies; 48+ messages in thread
From: Mike Galbraith @ 2015-10-10 14:41 UTC (permalink / raw)
  To: kbuild test robot; +Cc: kbuild-all, Peter Zijlstra, paul.szabo, linux-kernel

On Sat, 2015-10-10 at 22:03 +0800, kbuild test robot wrote:
> Hi Mike,

Hi there pin-the-tail-on-the-donkey bot.  Eeee Ahhh :)

sched: disable task group wide utilization based weight on the desktop

Task group wide utilization based weight may work well for servers, but it
is horrible on the desktop.  8 groups of 1 hog demoloshes interactivity, 1
group of 8 hogs has noticable impact, 2 such groups is very very noticable.

Turn it off if autogroup is enabled, and add a feature to let people set the
definition of fair to what serves them best.  For the desktop, fixed group
weight wins hands down, no contest....

Signed-off-by: Mike Galbraith <umgwanakikbuit@gmail.com>
---
 kernel/sched/fair.c     |    5 +++++
 kernel/sched/features.h |   14 ++++++++++++++

---
 kernel/sched/fair.c     |    5 +++++
 kernel/sched/features.h |   14 ++++++++++++++
 2 files changed, 19 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2372,6 +2372,8 @@ static long calc_cfs_shares(struct cfs_r
 {
 	long tg_weight, load, shares;
 
+	if (!sched_feat(SMP_FAIR_GROUPS))
+		return tg->shares;
 	tg_weight = calc_tg_weight(tg, cfs_rq);
 	load = cfs_rq_load_avg(cfs_rq);
 
@@ -2423,6 +2425,9 @@ static void update_cfs_shares(struct cfs
 #ifndef CONFIG_SMP
 	if (likely(se->load.weight == tg->shares))
 		return;
+#else
+	if (!sched_feat(SMP_FAIR_GROUPS) && se->load.weight == tg->shares)
+		return;
 #endif
 	shares = calc_cfs_shares(cfs_rq, tg);
 
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -88,3 +88,17 @@ SCHED_FEAT(LB_MIN, false)
  */
 SCHED_FEAT(NUMA,	true)
 #endif
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+/*
+ * With SMP_FAIR_GROUPS set, activity group wide determines share for
+ * all froup members.  This does very bad things to interactivity when
+ * a desktop box is heavily loaded.  Default to off when autogroup is
+ * enabled, and let all users set it to what works best for them.
+ */
+#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+SCHED_FEAT(SMP_FAIR_GROUPS, true)
+#else
+SCHED_FEAT(SMP_FAIR_GROUPS, false)
+#endif
+#endif






^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch] sched: disable task group re-weighting on the desktop
  2015-10-10 13:22         ` [patch] sched: disable task group re-weighting on the desktop Mike Galbraith
  2015-10-10 14:03           ` kbuild test robot
@ 2015-10-10 17:01           ` Peter Zijlstra
  2015-10-10 17:13             ` Peter Zijlstra
  2015-10-11  2:25             ` Mike Galbraith
  2015-10-10 20:14           ` [patch] sched: disable task group re-weighting on the desktop paul.szabo
  2015-10-11 19:46           ` paul.szabo
  3 siblings, 2 replies; 48+ messages in thread
From: Peter Zijlstra @ 2015-10-10 17:01 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: paul.szabo, linux-kernel

On Sat, Oct 10, 2015 at 03:22:49PM +0200, Mike Galbraith wrote:
> Ah, but now onward to interactivity...
>
> master, 8 groups of 1 (pert) vs desktop (mplayer BigBuckBunny-DivXPlusHD.mkv)
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
>  4068 root      20   0    8312    724    628 R 99.64 0.004   1:04.32 6 pert
>  4065 root      20   0    8312    744    648 R 99.45 0.005   1:04.92 5 pert
>  4071 root      20   0    8312    748    652 R 99.27 0.005   1:03.12 7 pert
>  4077 root      20   0    8312    840    744 R 98.72 0.005   1:01.46 3 pert
>  4074 root      20   0    8312    796    700 R 98.18 0.005   1:03.38 1 pert
>  4079 root      20   0    8312    720    624 R 97.99 0.004   1:01.45 4 pert
>  4062 root      20   0    8312    836    740 R 96.72 0.005   1:03.44 0 pert
>  4059 root      20   0    8312    720    624 R 94.16 0.004   1:04.92 2 pert
>  4082 root      20   0 1094400 154324  33592 S 4.197 0.954   0:02.69 0 mplayer
>  1029 root      20   0  465332 151540  40816 R 3.285 0.937   0:24.59 2 Xorg
>  1773 root      20   0  662592  73308  42012 S 2.007 0.453   0:12.84 5 konsole
>   771 root      20   0   11416   1964   1824 S 0.730 0.012   0:10.45 0 rngd
>  1722 root      20   0 2866772  65224  51152 S 0.365 0.403   0:03.44 2 kwin
>  1769 root      20   0  711684  54212  38020 S 0.182 0.335   0:00.39 1 kmix
>
> That is NOT good.  Mplayer and friends need more than that.  Interactivity
> is _horrible_, and buck is an unwatchable mess (no biggy, I know every frame).
>
> master+, 8 groups of 1 (pert) vs desktop (mplayer BigBuckBunny-DivXPlusHD.mkv)
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
>  4346 root      20   0    8312    756    660 R 99.20 0.005   0:59.89 5 pert
>  4349 root      20   0    8312    748    652 R 98.80 0.005   1:00.77 6 pert
>  4343 root      20   0    8312    720    624 R 94.81 0.004   1:02.11 2 pert
>  4331 root      20   0    8312    724    628 R 91.22 0.004   1:01.16 3 pert
>  4340 root      20   0    8312    720    624 R 91.22 0.004   1:01.06 7 pert
>  4328 root      20   0    8312    836    740 R 90.42 0.005   1:00.07 4 pert
>  4334 root      20   0    8312    756    660 R 87.82 0.005   0:59.84 1 pert
>  4337 root      20   0    8312    824    728 R 76.85 0.005   0:52.20 0 pert
>  4352 root      20   0 1058812 123876  33388 S 29.34 0.766   0:25.01 3 mplayer
>  1029 root      20   0  471168 156748  40316 R 22.36 0.969   0:42.23 3 Xorg
>  1773 root      20   0  663080  74176  42012 S 4.192 0.459   0:17.98 1 konsole
>   771 root      20   0   11416   1964   1824 R 1.198 0.012   0:13.45 0 rngd
>  1722 root      20   0 2866880  65340  51152 R 0.599 0.404   0:04.87 3 kwin
>  1788 root       9 -11  516744  11932   8536 S 0.599 0.074   0:01.01 0 pulseaudio
>  1733 root      20   0 3369480 141564  71776 S 0.200 0.875   0:05.51 1 plasma-desktop
>
> That's good.  Interactivity is fine...

But the patch is most horrible.. :/ It completely destroys everything
group scheduling is supposed to be.

What are these oink/pert things? Both spinners just with amusing names
to distinguish them?

Is the interactivity the same (horrible) at fe32d3cd5e8e (ie, before the
load tracking rewrite from Yuyang)?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch] sched: disable task group re-weighting on the desktop
  2015-10-10 17:01           ` Peter Zijlstra
@ 2015-10-10 17:13             ` Peter Zijlstra
  2015-10-11  2:25             ` Mike Galbraith
  1 sibling, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2015-10-10 17:13 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: paul.szabo, linux-kernel

On Sat, Oct 10, 2015 at 07:01:42PM +0200, Peter Zijlstra wrote:
> On Sat, Oct 10, 2015 at 03:22:49PM +0200, Mike Galbraith wrote:
> > Ah, but now onward to interactivity...
> >
> > master, 8 groups of 1 (pert) vs desktop (mplayer BigBuckBunny-DivXPlusHD.mkv)
> >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
> >  4068 root      20   0    8312    724    628 R 99.64 0.004   1:04.32 6 pert
> >  4065 root      20   0    8312    744    648 R 99.45 0.005   1:04.92 5 pert
> >  4071 root      20   0    8312    748    652 R 99.27 0.005   1:03.12 7 pert
> >  4077 root      20   0    8312    840    744 R 98.72 0.005   1:01.46 3 pert
> >  4074 root      20   0    8312    796    700 R 98.18 0.005   1:03.38 1 pert
> >  4079 root      20   0    8312    720    624 R 97.99 0.004   1:01.45 4 pert
> >  4062 root      20   0    8312    836    740 R 96.72 0.005   1:03.44 0 pert
> >  4059 root      20   0    8312    720    624 R 94.16 0.004   1:04.92 2 pert
> >  4082 root      20   0 1094400 154324  33592 S 4.197 0.954   0:02.69 0 mplayer
> >  1029 root      20   0  465332 151540  40816 R 3.285 0.937   0:24.59 2 Xorg
> >  1773 root      20   0  662592  73308  42012 S 2.007 0.453   0:12.84 5 konsole
> >   771 root      20   0   11416   1964   1824 S 0.730 0.012   0:10.45 0 rngd
> >  1722 root      20   0 2866772  65224  51152 S 0.365 0.403   0:03.44 2 kwin
> >  1769 root      20   0  711684  54212  38020 S 0.182 0.335   0:00.39 1 kmix

Ah wait, so you have 8 groups of cycle soakers vs 1 group of desktop?
That means your desktop will get 1/9 th of the total time and that is
almost so:

 2.69+24.59+12.84+10.45+3.44+.39 = 54.4

vs

 (64.32+64.92+63.12+61.46+63.38+61.45+63.44+64.92)/8 = 63.37625

Which isn't too far off.

This really appears to be a case where you get what you ask for.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch] sched: disable task group re-weighting on the desktop
  2015-10-10 13:22         ` [patch] sched: disable task group re-weighting on the desktop Mike Galbraith
  2015-10-10 14:03           ` kbuild test robot
  2015-10-10 17:01           ` Peter Zijlstra
@ 2015-10-10 20:14           ` paul.szabo
  2015-10-11  2:38             ` Mike Galbraith
  2015-10-11 19:46           ` paul.szabo
  3 siblings, 1 reply; 48+ messages in thread
From: paul.szabo @ 2015-10-10 20:14 UTC (permalink / raw)
  To: peterz, umgwanakikbuti; +Cc: linux-kernel

Dear Mike,

You CCed me on this patch. Is that because you expect this to solve "my"
problem also? You had some measurements of many oinks vs many perts or
vs "desktop", but not many oinks vs 1 or 2 perts as per my "complaint". 
You also changed the subject line, so maybe this is all un-related.

Thanks, Paul

Paul Szabo   psz@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of Sydney    Australia

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch] sched: disable task group re-weighting on the desktop
  2015-10-10 17:01           ` Peter Zijlstra
  2015-10-10 17:13             ` Peter Zijlstra
@ 2015-10-11  2:25             ` Mike Galbraith
  2015-10-11 17:42               ` 4.3 group scheduling regression Mike Galbraith
  1 sibling, 1 reply; 48+ messages in thread
From: Mike Galbraith @ 2015-10-11  2:25 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: paul.szabo, linux-kernel

On Sat, 2015-10-10 at 19:01 +0200, Peter Zijlstra wrote:

> But the patch is most horrible.. :/ It completely destroys everything
> group scheduling is supposed to be.

Yeah, and it works great but...

> What are these oink/pert things? Both spinners just with amusing names
> to distinguish them?

(yeah).

> Is the interactivity the same (horrible) at fe32d3cd5e8e (ie, before the
> load tracking rewrite from Yuyang)?

...you're right.  fe32d3cd5e8e isn't as good as master with a big dent
in its skull, but it is far from the ugly beast I clubbed to death.

	-Mike


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch] sched: disable task group re-weighting on the desktop
  2015-10-10 20:14           ` [patch] sched: disable task group re-weighting on the desktop paul.szabo
@ 2015-10-11  2:38             ` Mike Galbraith
  2015-10-11  9:25               ` paul.szabo
  0 siblings, 1 reply; 48+ messages in thread
From: Mike Galbraith @ 2015-10-11  2:38 UTC (permalink / raw)
  To: paul.szabo; +Cc: peterz, linux-kernel

On Sun, 2015-10-11 at 07:14 +1100, paul.szabo@sydney.edu.au wrote:
> Dear Mike,
> 
> You CCed me on this patch. Is that because you expect this to solve "my"
> problem also? You had some measurements of many oinks vs many perts or
> vs "desktop", but not many oinks vs 1 or 2 perts as per my "complaint". 
> You also changed the subject line, so maybe this is all un-related.

I haven't seen the problem you reported.  I did stumble upon a problem,
but turns out that is only present in master, so yes, un-related.

	-Mike  


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch] sched: disable task group re-weighting on the desktop
  2015-10-11  2:38             ` Mike Galbraith
@ 2015-10-11  9:25               ` paul.szabo
  2015-10-11 12:49                 ` Mike Galbraith
  0 siblings, 1 reply; 48+ messages in thread
From: paul.szabo @ 2015-10-11  9:25 UTC (permalink / raw)
  To: umgwanakikbuti; +Cc: linux-kernel, peterz

Dear Mike,

> ... so yes, un-related.

Thanks for clarifying.

> I haven't seen the problem you reported. ...

You mean you chose not to reproduce: you persisted in pinning your
perts, whereas the problem was stated with un-pinned perts (and pinned
oinks). But that is OK... others did reproduce, and anyway I believe
I have now fixed my problem. (Solution in that "other" email thread.)

Cheers,

Paul

Paul Szabo   psz@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of Sydney    Australia

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: CFS scheduler unfairly prefers pinned tasks
  2015-10-09  2:40           ` Mike Galbraith
@ 2015-10-11  9:43             ` paul.szabo
  0 siblings, 0 replies; 48+ messages in thread
From: paul.szabo @ 2015-10-11  9:43 UTC (permalink / raw)
  To: umgwanakikbuti; +Cc: linux-kernel, peterz, wanpeng.li

I wrote:

  The Linux CFS scheduler prefers pinned tasks and unfairly
  gives more CPU time to tasks that have set CPU affinity.

I believe I have now solved the problem, simply by setting:

  for n in /proc/sys/kernel/sched_domain/cpu*/domain0/min_interval; do echo 0 > $n; done
  for n in /proc/sys/kernel/sched_domain/cpu*/domain0/max_interval; do echo 1 > $n; done

I am not sure what the domain1 values would be for (that I see exist
on my 4*E5-4627v2 server). So far I do not see any negative effects of
using these (extreme?) settings. (Explanation of what these things are
meant for, or pointers to documentation, would be appreciated.)

---

Thanks for the insightful discussion.

(Scary, isn't it?)

Thanks, Paul

Paul Szabo   psz@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of Sydney    Australia

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch] sched: disable task group re-weighting on the desktop
  2015-10-11  9:25               ` paul.szabo
@ 2015-10-11 12:49                 ` Mike Galbraith
  0 siblings, 0 replies; 48+ messages in thread
From: Mike Galbraith @ 2015-10-11 12:49 UTC (permalink / raw)
  To: paul.szabo; +Cc: linux-kernel, peterz

On Sun, 2015-10-11 at 20:25 +1100, paul.szabo@sydney.edu.au wrote:
> Dear Mike,
> 
> > ... so yes, un-related.
> 
> Thanks for clarifying.
> 
> > I haven't seen the problem you reported. ...
> 
> You mean you chose not to reproduce: you persisted in pinning your
> perts..

There was hard data to the contrary in your mailbox as you wrote that.

	-Mike 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* 4.3 group scheduling regression
  2015-10-11  2:25             ` Mike Galbraith
@ 2015-10-11 17:42               ` Mike Galbraith
  2015-10-12  7:23                 ` Peter Zijlstra
  0 siblings, 1 reply; 48+ messages in thread
From: Mike Galbraith @ 2015-10-11 17:42 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Yuyang Du

(change subject, CCs)

On Sun, 2015-10-11 at 04:25 +0200, Mike Galbraith wrote:

> > Is the interactivity the same (horrible) at fe32d3cd5e8e (ie, before the
> > load tracking rewrite from Yuyang)?

It is the rewrite, 9d89c257dfb9c51a532d69397f6eed75e5168c35.

Watching 8 single hog groups vs 1 tbench group, master vs 4.2.3, I saw
no big hairy difference, just as 1 group of 8 hogs vs 8 groups of 1.

8 single hog groups vs the less hungry mplayer otoh is quite different

100 second scripted recordings:
(note: "testo" is kde konsole acting as task group launch vehicle)

master
 -----------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
 -----------------------------------------------------------------------------------------------------------------
  oink:(8)              | 787637.964 ms |    16242 | avg:    0.557 ms | max:   68.993 ms | max at:    239.126118 s
  mplayer:(25)          |   5477.234 ms |     8504 | avg:   16.395 ms | max: 2100.233 ms | max at:    282.850734 s
  Xorg:997              |   1773.218 ms |     4680 | avg:    4.857 ms | max: 1640.194 ms | max at:    285.660210 s
  konsole:1789          |    649.323 ms |     1261 | avg:    6.747 ms | max:  156.282 ms | max at:    265.548523 s
  testo:(9)             |    454.046 ms |     2867 | avg:    5.961 ms | max:  276.371 ms | max at:    245.511282 s
  plasma-desktop:1753   |    223.251 ms |     1582 | avg:    4.220 ms | max:  299.354 ms | max at:    337.242542 s
  kwin:1745             |    156.746 ms |     2879 | avg:    2.398 ms | max:  355.765 ms | max at:    337.242490 s
  pulseaudio:1797       |     60.268 ms |     2573 | avg:    0.695 ms | max:   36.069 ms | max at:    292.318120 s
  threaded-ml:3477      |     47.076 ms |     3878 | avg:    7.083 ms | max: 1898.940 ms | max at:    254.919367 s
  perf:3437             |     28.525 ms |        4 | avg:  129.042 ms | max:  498.816 ms | max at:    336.102154 s

4.2.3
 -----------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
 -----------------------------------------------------------------------------------------------------------------
  oink:(8)              | 741307.292 ms |    42325 | avg:    1.276 ms | max:   23.598 ms | max at:    192.459790 s
  mplayer:(25)          |  35296.804 ms |    35423 | avg:    1.715 ms | max:   71.972 ms | max at:    128.737783 s
  Xorg:929              |  13257.917 ms |    21583 | avg:    0.091 ms | max:   27.983 ms | max at:    102.272376 s
  testo:(9)             |   2315.080 ms |    13213 | avg:    0.133 ms | max:    6.632 ms | max at:    201.422570 s
  konsole:1747          |    938.939 ms |     1458 | avg:    0.096 ms | max:   15.006 ms | max at:    102.260294 s
  kwin:1703             |    815.384 ms |    17376 | avg:    0.464 ms | max:    9.311 ms | max at:    119.026179 s
  pulseaudio:1762       |    396.168 ms |    14338 | avg:    0.020 ms | max:    6.514 ms | max at:    115.928179 s
  threaded-ml:3477      |    310.132 ms |    23966 | avg:    0.428 ms | max:   27.974 ms | max at:    134.100588 s
  plasma-desktop:1711   |    239.232 ms |     1577 | avg:    0.048 ms | max:    7.072 ms | max at:    102.060279 s
  perf:3434             |     65.705 ms |        2 | avg:    0.054 ms | max:    0.105 ms | max at:    102.011221 s

master, mplayer solo reference
 -----------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
 -----------------------------------------------------------------------------------------------------------------
  mplayer:(25)          |  32171.732 ms |    18416 | avg:    0.012 ms | max:    4.405 ms | max at:   4911.226038 s
  Xorg:948              |  14271.286 ms |    17396 | avg:    0.016 ms | max:    0.082 ms | max at:   4911.243020 s
  testo:4121            |   3594.784 ms |    11607 | avg:    0.015 ms | max:    0.078 ms | max at:   4981.705240 s
  kwin:1650             |   1209.387 ms |    17562 | avg:    0.012 ms | max:    1.612 ms | max at:   4911.245523 s
  konsole:1728          |    967.914 ms |     1498 | avg:    0.007 ms | max:    0.048 ms | max at:   4997.903759 s
  pulseaudio:1750       |    684.342 ms |    14460 | avg:    0.013 ms | max:    0.552 ms | max at:   4957.743502 s
  threaded-ml:4153      |    641.893 ms |    15748 | avg:    0.016 ms | max:    2.201 ms | max at:   4923.928810 s
  plasma-desktop:1658   |    150.068 ms |      569 | avg:    0.011 ms | max:    0.390 ms | max at:   4911.258650 s
  perf:4126             |     43.854 ms |        3 | avg:    0.022 ms | max:    0.051 ms | max at:   4959.327694 s


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch] sched: disable task group re-weighting on the desktop
  2015-10-10 13:22         ` [patch] sched: disable task group re-weighting on the desktop Mike Galbraith
                             ` (2 preceding siblings ...)
  2015-10-10 20:14           ` [patch] sched: disable task group re-weighting on the desktop paul.szabo
@ 2015-10-11 19:46           ` paul.szabo
  2015-10-12  1:59             ` Mike Galbraith
  3 siblings, 1 reply; 48+ messages in thread
From: paul.szabo @ 2015-10-11 19:46 UTC (permalink / raw)
  To: peterz, umgwanakikbuti; +Cc: linux-kernel

Dear Mike,

Did you check whether setting min_- and max_interval e.g. as per
  https://lkml.org/lkml/2015/10/11/34
would help with your issue (instead of your "horrible gs destroying"
patch)?

Cheers, Paul

Paul Szabo   psz@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of Sydney    Australia

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 4.3 group scheduling regression
  2015-10-12  8:04                     ` Peter Zijlstra
@ 2015-10-12  0:53                       ` Yuyang Du
  2015-10-12  9:12                         ` Peter Zijlstra
  2015-10-12  8:48                       ` Mike Galbraith
  1 sibling, 1 reply; 48+ messages in thread
From: Yuyang Du @ 2015-10-12  0:53 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, linux-kernel

Good morning, Peter.

On Mon, Oct 12, 2015 at 10:04:07AM +0200, Peter Zijlstra wrote:
> On Mon, Oct 12, 2015 at 09:44:57AM +0200, Mike Galbraith wrote:
> 
> > It's odd to me that things look pretty much the same good/bad tree with
> > hogs vs hogs or hogs vs tbench (with top anyway, just adding up times).
> > Seems Xorg+mplayer more or less playing cross group ping-pong must be
> > the BadThing trigger.
>
> Ohh, wait, Xorg and mplayer are _not_ in the same group? I was assuming
> you had your entire user session in 1 (auto) group and was competing
> against 8 manual cgroups.
> 
> So how exactly are things configured?
 
Hmm... my impression is the naughty boy mplayer (+Xorg) isn't favored, due 
to the per CPU group entity share distribution. Let me dig more.

Sorry.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch] sched: disable task group re-weighting on the desktop
  2015-10-11 19:46           ` paul.szabo
@ 2015-10-12  1:59             ` Mike Galbraith
  0 siblings, 0 replies; 48+ messages in thread
From: Mike Galbraith @ 2015-10-12  1:59 UTC (permalink / raw)
  To: paul.szabo; +Cc: peterz, linux-kernel

On Mon, 2015-10-12 at 06:46 +1100, paul.szabo@sydney.edu.au wrote:
> Dear Mike,
> 
> Did you check whether setting min_- and max_interval e.g. as per
>   https://lkml.org/lkml/2015/10/11/34
> would help with your issue (instead of your "horrible gs destroying"
> patch)?

I spent a lot of MY time looking into YOUR problem, only to be accused
of actively avoiding reproduction thereof, and now you toss another cute
little dart my way.  Looking into your problem wasn't a complete waste
of my time, as it led me to something that actually looks interesting.
Thanks for that, and goodbye. *PLONK*

	-Mike  


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 4.3 group scheduling regression
  2015-10-12  9:12                         ` Peter Zijlstra
@ 2015-10-12  2:12                           ` Yuyang Du
  2015-10-12 10:23                             ` Mike Galbraith
  2015-10-12 11:47                             ` Peter Zijlstra
  0 siblings, 2 replies; 48+ messages in thread
From: Yuyang Du @ 2015-10-12  2:12 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, linux-kernel

On Mon, Oct 12, 2015 at 11:12:06AM +0200, Peter Zijlstra wrote:
> On Mon, Oct 12, 2015 at 08:53:51AM +0800, Yuyang Du wrote:
> > Good morning, Peter.
> > 
> > On Mon, Oct 12, 2015 at 10:04:07AM +0200, Peter Zijlstra wrote:
> > > On Mon, Oct 12, 2015 at 09:44:57AM +0200, Mike Galbraith wrote:
> > > 
> > > > It's odd to me that things look pretty much the same good/bad tree with
> > > > hogs vs hogs or hogs vs tbench (with top anyway, just adding up times).
> > > > Seems Xorg+mplayer more or less playing cross group ping-pong must be
> > > > the BadThing trigger.
> > >
> > > Ohh, wait, Xorg and mplayer are _not_ in the same group? I was assuming
> > > you had your entire user session in 1 (auto) group and was competing
> > > against 8 manual cgroups.
> > > 
> > > So how exactly are things configured?
> >  
> > Hmm... my impression is the naughty boy mplayer (+Xorg) isn't favored, due 
> > to the per CPU group entity share distribution. Let me dig more.
> 
> So in the old code we had 'magic' to deal with the case where a cgroup
> was consuming less than 1 cpu's worth of runtime. For example, a single
> task running in the group.
> 
> In that scenario it might be possible that the group entity weight:
> 
> 	se->weight = (tg->shares * cfs_rq->weight) / tg->weight;
> 
> Strongly deviates from the tg->shares; you want the single task reflect
> the full group shares to the next level; due to the whole distributed
> approximation stuff.

Yeah, I thought so.
 
> I see you've deleted all that code; see the former
> __update_group_entity_contrib().
 
Probably not there, it actually was an icky way to adjust things.

> It could be that we need to bring that back. But let me think a little
> bit more on this.. I'm having a hard time waking :/

I am guessing it is in calc_tg_weight(), and naughty boys do make them more
favored, what a reality...

Mike, beg you test the following?

--

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4df37a4..b184da0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2370,7 +2370,7 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
 	 */
 	tg_weight = atomic_long_read(&tg->load_avg);
 	tg_weight -= cfs_rq->tg_load_avg_contrib;
-	tg_weight += cfs_rq_load_avg(cfs_rq);
+	tg_weight += cfs_rq->load.weight;
 
 	return tg_weight;
 }
@@ -2380,7 +2380,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
 	long tg_weight, load, shares;
 
 	tg_weight = calc_tg_weight(tg, cfs_rq);
-	load = cfs_rq_load_avg(cfs_rq);
+	load = cfs_rq->load.weight;
 
 	shares = (tg->shares * load);
 	if (tg_weight)

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: 4.3 group scheduling regression
  2015-10-11 17:42               ` 4.3 group scheduling regression Mike Galbraith
@ 2015-10-12  7:23                 ` Peter Zijlstra
  2015-10-12  7:44                   ` Mike Galbraith
  0 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2015-10-12  7:23 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, Yuyang Du

On Sun, Oct 11, 2015 at 07:42:01PM +0200, Mike Galbraith wrote:
> (change subject, CCs)
> 
> On Sun, 2015-10-11 at 04:25 +0200, Mike Galbraith wrote:
> 
> > > Is the interactivity the same (horrible) at fe32d3cd5e8e (ie, before the
> > > load tracking rewrite from Yuyang)?
> 
> It is the rewrite, 9d89c257dfb9c51a532d69397f6eed75e5168c35.

Just to be sure, so 9d89c257dfb9^1 is good, while 9d89c257dfb9 is bad?

And *groan*, _just_ the thing I need on a monday morning ;-)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 4.3 group scheduling regression
  2015-10-12  7:23                 ` Peter Zijlstra
@ 2015-10-12  7:44                   ` Mike Galbraith
  2015-10-12  8:04                     ` Peter Zijlstra
  0 siblings, 1 reply; 48+ messages in thread
From: Mike Galbraith @ 2015-10-12  7:44 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Yuyang Du

On Mon, 2015-10-12 at 09:23 +0200, Peter Zijlstra wrote:
> On Sun, Oct 11, 2015 at 07:42:01PM +0200, Mike Galbraith wrote:
> > (change subject, CCs)
> > 
> > On Sun, 2015-10-11 at 04:25 +0200, Mike Galbraith wrote:
> > 
> > > > Is the interactivity the same (horrible) at fe32d3cd5e8e (ie, before the
> > > > load tracking rewrite from Yuyang)?
> > 
> > It is the rewrite, 9d89c257dfb9c51a532d69397f6eed75e5168c35.
> 
> Just to be sure, so 9d89c257dfb9^1 is good, while 9d89c257dfb9 is bad?

Yeah, I went ahead and bisected.
 
> And *groan*, _just_ the thing I need on a monday morning ;-)

Sorry 'bout that.

It's odd to me that things look pretty much the same good/bad tree with
hogs vs hogs or hogs vs tbench (with top anyway, just adding up times).
Seems Xorg+mplayer more or less playing cross group ping-pong must be
the BadThing trigger.

	-Mike



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 4.3 group scheduling regression
  2015-10-12  7:44                   ` Mike Galbraith
@ 2015-10-12  8:04                     ` Peter Zijlstra
  2015-10-12  0:53                       ` Yuyang Du
  2015-10-12  8:48                       ` Mike Galbraith
  0 siblings, 2 replies; 48+ messages in thread
From: Peter Zijlstra @ 2015-10-12  8:04 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, Yuyang Du

On Mon, Oct 12, 2015 at 09:44:57AM +0200, Mike Galbraith wrote:

> It's odd to me that things look pretty much the same good/bad tree with
> hogs vs hogs or hogs vs tbench (with top anyway, just adding up times).
> Seems Xorg+mplayer more or less playing cross group ping-pong must be
> the BadThing trigger.

Ohh, wait, Xorg and mplayer are _not_ in the same group? I was assuming
you had your entire user session in 1 (auto) group and was competing
against 8 manual cgroups.

So how exactly are things configured?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 4.3 group scheduling regression
  2015-10-12  8:04                     ` Peter Zijlstra
  2015-10-12  0:53                       ` Yuyang Du
@ 2015-10-12  8:48                       ` Mike Galbraith
  1 sibling, 0 replies; 48+ messages in thread
From: Mike Galbraith @ 2015-10-12  8:48 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Yuyang Du

On Mon, 2015-10-12 at 10:04 +0200, Peter Zijlstra wrote:
> On Mon, Oct 12, 2015 at 09:44:57AM +0200, Mike Galbraith wrote:
> 
> > It's odd to me that things look pretty much the same good/bad tree with
> > hogs vs hogs or hogs vs tbench (with top anyway, just adding up times).
> > Seems Xorg+mplayer more or less playing cross group ping-pong must be
> > the BadThing trigger.
> 
> Ohh, wait, Xorg and mplayer are _not_ in the same group? I was assuming
> you had your entire user session in 1 (auto) group and was competing
> against 8 manual cgroups.
> 
> So how exactly are things configured?

I turned autogroup on as to not have to muck about creating groups, so
Xorg is in its per session group, and each konsole instance in its.  I
launched groups via testo (aka konsole) -e <content> in a little script
to turn it loose at once to run for 100 seconds and kill itself, but
that's not necessary 'course.  Start 1 hog in 8 konsole tabs, and
mplayer in the 9th, ickiness follows.

	-Mike


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 4.3 group scheduling regression
  2015-10-12  0:53                       ` Yuyang Du
@ 2015-10-12  9:12                         ` Peter Zijlstra
  2015-10-12  2:12                           ` Yuyang Du
  0 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2015-10-12  9:12 UTC (permalink / raw)
  To: Yuyang Du; +Cc: Mike Galbraith, linux-kernel

On Mon, Oct 12, 2015 at 08:53:51AM +0800, Yuyang Du wrote:
> Good morning, Peter.
> 
> On Mon, Oct 12, 2015 at 10:04:07AM +0200, Peter Zijlstra wrote:
> > On Mon, Oct 12, 2015 at 09:44:57AM +0200, Mike Galbraith wrote:
> > 
> > > It's odd to me that things look pretty much the same good/bad tree with
> > > hogs vs hogs or hogs vs tbench (with top anyway, just adding up times).
> > > Seems Xorg+mplayer more or less playing cross group ping-pong must be
> > > the BadThing trigger.
> >
> > Ohh, wait, Xorg and mplayer are _not_ in the same group? I was assuming
> > you had your entire user session in 1 (auto) group and was competing
> > against 8 manual cgroups.
> > 
> > So how exactly are things configured?
>  
> Hmm... my impression is the naughty boy mplayer (+Xorg) isn't favored, due 
> to the per CPU group entity share distribution. Let me dig more.

So in the old code we had 'magic' to deal with the case where a cgroup
was consuming less than 1 cpu's worth of runtime. For example, a single
task running in the group.

In that scenario it might be possible that the group entity weight:

	se->weight = (tg->shares * cfs_rq->weight) / tg->weight;

Strongly deviates from the tg->shares; you want the single task reflect
the full group shares to the next level; due to the whole distributed
approximation stuff.

I see you've deleted all that code; see the former
__update_group_entity_contrib().

It could be that we need to bring that back. But let me think a little
bit more on this.. I'm having a hard time waking :/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 4.3 group scheduling regression
  2015-10-12  2:12                           ` Yuyang Du
@ 2015-10-12 10:23                             ` Mike Galbraith
  2015-10-12 19:55                               ` Yuyang Du
  2015-10-12 11:47                             ` Peter Zijlstra
  1 sibling, 1 reply; 48+ messages in thread
From: Mike Galbraith @ 2015-10-12 10:23 UTC (permalink / raw)
  To: Yuyang Du; +Cc: Peter Zijlstra, linux-kernel

On Mon, 2015-10-12 at 10:12 +0800, Yuyang Du wrote:

> I am guessing it is in calc_tg_weight(), and naughty boys do make them more
> favored, what a reality...
> 
> Mike, beg you test the following?

Wow, that was quick.  Dinky patch made it all better.

 -----------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
 -----------------------------------------------------------------------------------------------------------------
  oink:(8)              | 739056.970 ms |    27270 | avg:    2.043 ms | max:   29.105 ms | max at:    339.988310 s
  mplayer:(25)          |  36448.997 ms |    44670 | avg:    1.886 ms | max:   72.808 ms | max at:    302.153121 s
  Xorg:988              |  13334.908 ms |    22210 | avg:    0.081 ms | max:   25.005 ms | max at:    269.068666 s
  testo:(9)             |   2558.540 ms |    13703 | avg:    0.124 ms | max:    6.412 ms | max at:    279.235272 s
  konsole:1781          |   1084.316 ms |     1457 | avg:    0.006 ms | max:    1.039 ms | max at:    268.863379 s
  kwin:1734             |    879.645 ms |    17855 | avg:    0.458 ms | max:   15.788 ms | max at:    268.854992 s
  pulseaudio:1808       |    356.334 ms |    15023 | avg:    0.028 ms | max:    6.134 ms | max at:    324.479766 s
  threaded-ml:3483      |    292.782 ms |    25769 | avg:    0.364 ms | max:   40.387 ms | max at:    294.550515 s
  plasma-desktop:1745   |    265.055 ms |     1470 | avg:    0.102 ms | max:   21.886 ms | max at:    267.724902 s
  perf:3439             |     61.677 ms |        2 | avg:    0.117 ms | max:    0.232 ms | max at:    367.043889 s


> --
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4df37a4..b184da0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2370,7 +2370,7 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
>  	 */
>  	tg_weight = atomic_long_read(&tg->load_avg);
>  	tg_weight -= cfs_rq->tg_load_avg_contrib;
> -	tg_weight += cfs_rq_load_avg(cfs_rq);
> +	tg_weight += cfs_rq->load.weight;
>  
>  	return tg_weight;
>  }
> @@ -2380,7 +2380,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
>  	long tg_weight, load, shares;
>  
>  	tg_weight = calc_tg_weight(tg, cfs_rq);
> -	load = cfs_rq_load_avg(cfs_rq);
> +	load = cfs_rq->load.weight;
>  
>  	shares = (tg->shares * load);
>  	if (tg_weight)



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 4.3 group scheduling regression
  2015-10-12  2:12                           ` Yuyang Du
  2015-10-12 10:23                             ` Mike Galbraith
@ 2015-10-12 11:47                             ` Peter Zijlstra
  2015-10-12 19:32                               ` Yuyang Du
  2015-10-13  2:22                               ` Mike Galbraith
  1 sibling, 2 replies; 48+ messages in thread
From: Peter Zijlstra @ 2015-10-12 11:47 UTC (permalink / raw)
  To: Yuyang Du; +Cc: Mike Galbraith, linux-kernel

On Mon, Oct 12, 2015 at 10:12:31AM +0800, Yuyang Du wrote:
> On Mon, Oct 12, 2015 at 11:12:06AM +0200, Peter Zijlstra wrote:

> > So in the old code we had 'magic' to deal with the case where a cgroup
> > was consuming less than 1 cpu's worth of runtime. For example, a single
> > task running in the group.
> > 
> > In that scenario it might be possible that the group entity weight:
> > 
> > 	se->weight = (tg->shares * cfs_rq->weight) / tg->weight;
> > 
> > Strongly deviates from the tg->shares; you want the single task reflect
> > the full group shares to the next level; due to the whole distributed
> > approximation stuff.
> 
> Yeah, I thought so.
>  
> > I see you've deleted all that code; see the former
> > __update_group_entity_contrib().
>  
> Probably not there, it actually was an icky way to adjust things.

Yeah, no argument there.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4df37a4..b184da0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2370,7 +2370,7 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
>  	 */
>  	tg_weight = atomic_long_read(&tg->load_avg);
>  	tg_weight -= cfs_rq->tg_load_avg_contrib;
> -	tg_weight += cfs_rq_load_avg(cfs_rq);
> +	tg_weight += cfs_rq->load.weight;
>  
>  	return tg_weight;
>  }
> @@ -2380,7 +2380,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
>  	long tg_weight, load, shares;
>  
>  	tg_weight = calc_tg_weight(tg, cfs_rq);
> -	load = cfs_rq_load_avg(cfs_rq);
> +	load = cfs_rq->load.weight;
>  
>  	shares = (tg->shares * load);
>  	if (tg_weight)

Aah, yes very much so. I completely overlooked that :-(

When calculating shares we very much want the current load, not the load
average.

Also, should we do the below? At this point se->on_rq is still 0 so
reweight_entity() will not update (dequeue/enqueue) the accounting, but
we'll have just accounted the 'old' load.weight.

Doing it this way around we'll first update the weight and then account
it, which seems more accurate.

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 700eb548315f..d2efef565aed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3009,8 +3009,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 */
 	update_curr(cfs_rq);
 	enqueue_entity_load_avg(cfs_rq, se);
-	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
+	account_entity_enqueue(cfs_rq, se);
 
 	if (flags & ENQUEUE_WAKEUP) {
 		place_entity(cfs_rq, se, 0);

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: 4.3 group scheduling regression
  2015-10-12 11:47                             ` Peter Zijlstra
@ 2015-10-12 19:32                               ` Yuyang Du
  2015-10-13  8:07                                 ` Peter Zijlstra
  2015-10-13  2:22                               ` Mike Galbraith
  1 sibling, 1 reply; 48+ messages in thread
From: Yuyang Du @ 2015-10-12 19:32 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, linux-kernel

On Mon, Oct 12, 2015 at 01:47:23PM +0200, Peter Zijlstra wrote:
> 
> Also, should we do the below? At this point se->on_rq is still 0 so
> reweight_entity() will not update (dequeue/enqueue) the accounting, but
> we'll have just accounted the 'old' load.weight.
> 
> Doing it this way around we'll first update the weight and then account
> it, which seems more accurate.
 
I think the original looks ok.

The account_entity_enqueue() adds child entity's load.weight to parent's load:

update_load_add(&cfs_rq->load, se->load.weight)

Then recalculate the shares.

Then reweight_entity() resets the parent entity's load.weight.

> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 700eb548315f..d2efef565aed 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3009,8 +3009,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	 */
>  	update_curr(cfs_rq);
>  	enqueue_entity_load_avg(cfs_rq, se);
> -	account_entity_enqueue(cfs_rq, se);
>  	update_cfs_shares(cfs_rq);
> +	account_entity_enqueue(cfs_rq, se);
>  
>  	if (flags & ENQUEUE_WAKEUP) {
>  		place_entity(cfs_rq, se, 0);

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 4.3 group scheduling regression
  2015-10-12 10:23                             ` Mike Galbraith
@ 2015-10-12 19:55                               ` Yuyang Du
  2015-10-13  4:08                                 ` Mike Galbraith
  2015-10-13  8:06                                 ` Peter Zijlstra
  0 siblings, 2 replies; 48+ messages in thread
From: Yuyang Du @ 2015-10-12 19:55 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, linux-kernel

On Mon, Oct 12, 2015 at 12:23:31PM +0200, Mike Galbraith wrote:
> On Mon, 2015-10-12 at 10:12 +0800, Yuyang Du wrote:
> 
> > I am guessing it is in calc_tg_weight(), and naughty boys do make them more
> > favored, what a reality...
> > 
> > Mike, beg you test the following?
> 
> Wow, that was quick.  Dinky patch made it all better.
> 
>  -----------------------------------------------------------------------------------------------------------------
>   Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
>  -----------------------------------------------------------------------------------------------------------------
>   oink:(8)              | 739056.970 ms |    27270 | avg:    2.043 ms | max:   29.105 ms | max at:    339.988310 s
>   mplayer:(25)          |  36448.997 ms |    44670 | avg:    1.886 ms | max:   72.808 ms | max at:    302.153121 s
>   Xorg:988              |  13334.908 ms |    22210 | avg:    0.081 ms | max:   25.005 ms | max at:    269.068666 s
>   testo:(9)             |   2558.540 ms |    13703 | avg:    0.124 ms | max:    6.412 ms | max at:    279.235272 s
>   konsole:1781          |   1084.316 ms |     1457 | avg:    0.006 ms | max:    1.039 ms | max at:    268.863379 s
>   kwin:1734             |    879.645 ms |    17855 | avg:    0.458 ms | max:   15.788 ms | max at:    268.854992 s
>   pulseaudio:1808       |    356.334 ms |    15023 | avg:    0.028 ms | max:    6.134 ms | max at:    324.479766 s
>   threaded-ml:3483      |    292.782 ms |    25769 | avg:    0.364 ms | max:   40.387 ms | max at:    294.550515 s
>   plasma-desktop:1745   |    265.055 ms |     1470 | avg:    0.102 ms | max:   21.886 ms | max at:    267.724902 s
>   perf:3439             |     61.677 ms |        2 | avg:    0.117 ms | max:    0.232 ms | max at:    367.043889 s

Phew...

I think maybe the real disease is the tg->load_avg is not updated in time.
I.e., it is after migrate, the source cfs_rq does not decrease its contribution
to the parent's tg->load_avg fast enough.

--

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4df37a4..3dba883 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2686,12 +2686,13 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
 static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 {
 	struct sched_avg *sa = &cfs_rq->avg;
-	int decayed;
+	int decayed, updated = 0;
 
 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
 		long r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
 		sa->load_avg = max_t(long, sa->load_avg - r, 0);
 		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
+		updated = 1;
 	}
 
 	if (atomic_long_read(&cfs_rq->removed_util_avg)) {
@@ -2708,7 +2709,7 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 	cfs_rq->load_last_update_time_copy = sa->last_update_time;
 #endif
 
-	return decayed;
+	return decayed | updated;
 }
 
 /* Update task and its cfs_rq load average */

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: 4.3 group scheduling regression
  2015-10-13  4:08                                 ` Mike Galbraith
@ 2015-10-12 20:42                                   ` Yuyang Du
  0 siblings, 0 replies; 48+ messages in thread
From: Yuyang Du @ 2015-10-12 20:42 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, linux-kernel

On Tue, Oct 13, 2015 at 06:08:34AM +0200, Mike Galbraith wrote:
> It sounded like you wanted me to run the below alone.  If so, it's a nogo.
  
Yes, thanks.

Then it is the sad fact that after migrate and removed_load_avg is added
in migrate_task_rq_fair(), we don't get a chance to update the tg so fast
that at the destination the mplayer is weighted to the group's share.

>  -----------------------------------------------------------------------------------------------------------------
>   Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
>  -----------------------------------------------------------------------------------------------------------------
>   oink:(8)              | 787001.236 ms |    21641 | avg:    0.377 ms | max:   21.991 ms | max at:     51.504005 s
>   mplayer:(25)          |   4256.224 ms |     7264 | avg:   19.698 ms | max: 2087.489 ms | max at:    115.294922 s
>   Xorg:1011             |   1507.958 ms |     4081 | avg:    8.349 ms | max: 1652.200 ms | max at:    126.908021 s
>   konsole:1752          |    697.806 ms |     1186 | avg:    5.749 ms | max:  160.189 ms | max at:     53.037952 s
>   testo:(9)             |    438.164 ms |     2551 | avg:    6.616 ms | max:  215.527 ms | max at:    117.302455 s
>   plasma-desktop:1716   |    280.418 ms |     1624 | avg:    3.701 ms | max:  574.806 ms | max at:     53.582261 s
>   kwin:1708             |    144.986 ms |     2422 | avg:    3.301 ms | max:  315.707 ms | max at:    116.555721 s
> 
> > --
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 4df37a4..3dba883 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2686,12 +2686,13 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
> >  static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >  {
> >  	struct sched_avg *sa = &cfs_rq->avg;
> > -	int decayed;
> > +	int decayed, updated = 0;
> >  
> >  	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> >  		long r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> >  		sa->load_avg = max_t(long, sa->load_avg - r, 0);
> >  		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
> > +		updated = 1;
> >  	}
> >  
> >  	if (atomic_long_read(&cfs_rq->removed_util_avg)) {
> > @@ -2708,7 +2709,7 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >  	cfs_rq->load_last_update_time_copy = sa->last_update_time;
> >  #endif
> >  
> > -	return decayed;
> > +	return decayed | updated;

A typo: decayed || updated, but shouldn't make any difference.

> >  }
> >  
> >  /* Update task and its cfs_rq load average */
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 4.3 group scheduling regression
  2015-10-13  8:06                                 ` Peter Zijlstra
@ 2015-10-13  0:35                                   ` Yuyang Du
  2015-10-13  8:10                                   ` Peter Zijlstra
  1 sibling, 0 replies; 48+ messages in thread
From: Yuyang Du @ 2015-10-13  0:35 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, linux-kernel

On Tue, Oct 13, 2015 at 10:06:48AM +0200, Peter Zijlstra wrote:
> On Tue, Oct 13, 2015 at 03:55:17AM +0800, Yuyang Du wrote:
> 
> > I think maybe the real disease is the tg->load_avg is not updated in time.
> > I.e., it is after migrate, the source cfs_rq does not decrease its contribution
> > to the parent's tg->load_avg fast enough.
> 
> No, using the load_avg for shares calculation seems wrong; that would
> mean we'd first have to ramp up the avg before you react.
> 
> You want to react quickly to actual load changes, esp. going up.
> 
> We use the avg to guess the global group load, since that's the best
> compromise we have, but locally it doesn't make sense to use the avg if
> we have the actual values.

In Mike's case, since the mplayer group has only one active task, after
the task migrates, the source cfs_rq should have zero contrib to the
tg, so at the destination, the group entity should have the entire tg's
share. It is just the zeroing can be that fast we need.

But yes, in a general case, the load_avg (that has the blocked load) is
likely to lag behind. Using the actual load.weight to accelerate the
process makes sense. It is especially helpful to the less hungry tasks.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 4.3 group scheduling regression
  2015-10-13  8:10                                   ` Peter Zijlstra
@ 2015-10-13  0:37                                     ` Yuyang Du
  0 siblings, 0 replies; 48+ messages in thread
From: Yuyang Du @ 2015-10-13  0:37 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, linux-kernel

On Tue, Oct 13, 2015 at 10:10:23AM +0200, Peter Zijlstra wrote:
> On Tue, Oct 13, 2015 at 10:06:48AM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 13, 2015 at 03:55:17AM +0800, Yuyang Du wrote:
> > 
> > > I think maybe the real disease is the tg->load_avg is not updated in time.
> > > I.e., it is after migrate, the source cfs_rq does not decrease its contribution
> > > to the parent's tg->load_avg fast enough.
> > 
> > No, using the load_avg for shares calculation seems wrong; that would
> > mean we'd first have to ramp up the avg before you react.
> > 
> > You want to react quickly to actual load changes, esp. going up.
> > 
> > We use the avg to guess the global group load, since that's the best
> > compromise we have, but locally it doesn't make sense to use the avg if
> > we have the actual values.
> 
> That is, can you send the original patch with a Changelog etc.. so that
> I can press 'A' :-)

Sure, in minutes, :)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 4.3 group scheduling regression
  2015-10-12 11:47                             ` Peter Zijlstra
  2015-10-12 19:32                               ` Yuyang Du
@ 2015-10-13  2:22                               ` Mike Galbraith
  1 sibling, 0 replies; 48+ messages in thread
From: Mike Galbraith @ 2015-10-13  2:22 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Yuyang Du, linux-kernel

On Mon, 2015-10-12 at 13:47 +0200, Peter Zijlstra wrote:

> Also, should we do the below?

Ew.  Box said "Either you quilt pop/burn, or I boot windows." ;-)

	-Mike


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 4.3 group scheduling regression
  2015-10-12 19:55                               ` Yuyang Du
@ 2015-10-13  4:08                                 ` Mike Galbraith
  2015-10-12 20:42                                   ` Yuyang Du
  2015-10-13  8:06                                 ` Peter Zijlstra
  1 sibling, 1 reply; 48+ messages in thread
From: Mike Galbraith @ 2015-10-13  4:08 UTC (permalink / raw)
  To: Yuyang Du; +Cc: Peter Zijlstra, linux-kernel

On Tue, 2015-10-13 at 03:55 +0800, Yuyang Du wrote:
> On Mon, Oct 12, 2015 at 12:23:31PM +0200, Mike Galbraith wrote:
> > On Mon, 2015-10-12 at 10:12 +0800, Yuyang Du wrote:
> > 
> > > I am guessing it is in calc_tg_weight(), and naughty boys do make them more
> > > favored, what a reality...
> > > 
> > > Mike, beg you test the following?
> > 
> > Wow, that was quick.  Dinky patch made it all better.
> > 
> >  -----------------------------------------------------------------------------------------------------------------
> >   Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
> >  -----------------------------------------------------------------------------------------------------------------
> >   oink:(8)              | 739056.970 ms |    27270 | avg:    2.043 ms | max:   29.105 ms | max at:    339.988310 s
> >   mplayer:(25)          |  36448.997 ms |    44670 | avg:    1.886 ms | max:   72.808 ms | max at:    302.153121 s
> >   Xorg:988              |  13334.908 ms |    22210 | avg:    0.081 ms | max:   25.005 ms | max at:    269.068666 s
> >   testo:(9)             |   2558.540 ms |    13703 | avg:    0.124 ms | max:    6.412 ms | max at:    279.235272 s
> >   konsole:1781          |   1084.316 ms |     1457 | avg:    0.006 ms | max:    1.039 ms | max at:    268.863379 s
> >   kwin:1734             |    879.645 ms |    17855 | avg:    0.458 ms | max:   15.788 ms | max at:    268.854992 s
> >   pulseaudio:1808       |    356.334 ms |    15023 | avg:    0.028 ms | max:    6.134 ms | max at:    324.479766 s
> >   threaded-ml:3483      |    292.782 ms |    25769 | avg:    0.364 ms | max:   40.387 ms | max at:    294.550515 s
> >   plasma-desktop:1745   |    265.055 ms |     1470 | avg:    0.102 ms | max:   21.886 ms | max at:    267.724902 s
> >   perf:3439             |     61.677 ms |        2 | avg:    0.117 ms | max:    0.232 ms | max at:    367.043889 s
> 
> Phew...
> 
> I think maybe the real disease is the tg->load_avg is not updated in time.
> I.e., it is after migrate, the source cfs_rq does not decrease its contribution
> to the parent's tg->load_avg fast enough.

It sounded like you wanted me to run the below alone.  If so, it's a nogo.
 
 -----------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
 -----------------------------------------------------------------------------------------------------------------
  oink:(8)              | 787001.236 ms |    21641 | avg:    0.377 ms | max:   21.991 ms | max at:     51.504005 s
  mplayer:(25)          |   4256.224 ms |     7264 | avg:   19.698 ms | max: 2087.489 ms | max at:    115.294922 s
  Xorg:1011             |   1507.958 ms |     4081 | avg:    8.349 ms | max: 1652.200 ms | max at:    126.908021 s
  konsole:1752          |    697.806 ms |     1186 | avg:    5.749 ms | max:  160.189 ms | max at:     53.037952 s
  testo:(9)             |    438.164 ms |     2551 | avg:    6.616 ms | max:  215.527 ms | max at:    117.302455 s
  plasma-desktop:1716   |    280.418 ms |     1624 | avg:    3.701 ms | max:  574.806 ms | max at:     53.582261 s
  kwin:1708             |    144.986 ms |     2422 | avg:    3.301 ms | max:  315.707 ms | max at:    116.555721 s

> --
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4df37a4..3dba883 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2686,12 +2686,13 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
>  static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  {
>  	struct sched_avg *sa = &cfs_rq->avg;
> -	int decayed;
> +	int decayed, updated = 0;
>  
>  	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
>  		long r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
>  		sa->load_avg = max_t(long, sa->load_avg - r, 0);
>  		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
> +		updated = 1;
>  	}
>  
>  	if (atomic_long_read(&cfs_rq->removed_util_avg)) {
> @@ -2708,7 +2709,7 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  	cfs_rq->load_last_update_time_copy = sa->last_update_time;
>  #endif
>  
> -	return decayed;
> +	return decayed | updated;
>  }
>  
>  /* Update task and its cfs_rq load average */



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 4.3 group scheduling regression
  2015-10-12 19:55                               ` Yuyang Du
  2015-10-13  4:08                                 ` Mike Galbraith
@ 2015-10-13  8:06                                 ` Peter Zijlstra
  2015-10-13  0:35                                   ` Yuyang Du
  2015-10-13  8:10                                   ` Peter Zijlstra
  1 sibling, 2 replies; 48+ messages in thread
From: Peter Zijlstra @ 2015-10-13  8:06 UTC (permalink / raw)
  To: Yuyang Du; +Cc: Mike Galbraith, linux-kernel

On Tue, Oct 13, 2015 at 03:55:17AM +0800, Yuyang Du wrote:

> I think maybe the real disease is the tg->load_avg is not updated in time.
> I.e., it is after migrate, the source cfs_rq does not decrease its contribution
> to the parent's tg->load_avg fast enough.

No, using the load_avg for shares calculation seems wrong; that would
mean we'd first have to ramp up the avg before you react.

You want to react quickly to actual load changes, esp. going up.

We use the avg to guess the global group load, since that's the best
compromise we have, but locally it doesn't make sense to use the avg if
we have the actual values.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 4.3 group scheduling regression
  2015-10-12 19:32                               ` Yuyang Du
@ 2015-10-13  8:07                                 ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2015-10-13  8:07 UTC (permalink / raw)
  To: Yuyang Du; +Cc: Mike Galbraith, linux-kernel

On Tue, Oct 13, 2015 at 03:32:47AM +0800, Yuyang Du wrote:
> On Mon, Oct 12, 2015 at 01:47:23PM +0200, Peter Zijlstra wrote:
> > 
> > Also, should we do the below? At this point se->on_rq is still 0 so
> > reweight_entity() will not update (dequeue/enqueue) the accounting, but
> > we'll have just accounted the 'old' load.weight.
> > 
> > Doing it this way around we'll first update the weight and then account
> > it, which seems more accurate.
>  
> I think the original looks ok.
> 
> The account_entity_enqueue() adds child entity's load.weight to parent's load:
> 
> update_load_add(&cfs_rq->load, se->load.weight)
> 
> Then recalculate the shares.
> 
> Then reweight_entity() resets the parent entity's load.weight.

Yes, some days I should just not be allowed near a keyboard :)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 4.3 group scheduling regression
  2015-10-13  8:06                                 ` Peter Zijlstra
  2015-10-13  0:35                                   ` Yuyang Du
@ 2015-10-13  8:10                                   ` Peter Zijlstra
  2015-10-13  0:37                                     ` Yuyang Du
  1 sibling, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2015-10-13  8:10 UTC (permalink / raw)
  To: Yuyang Du; +Cc: Mike Galbraith, linux-kernel

On Tue, Oct 13, 2015 at 10:06:48AM +0200, Peter Zijlstra wrote:
> On Tue, Oct 13, 2015 at 03:55:17AM +0800, Yuyang Du wrote:
> 
> > I think maybe the real disease is the tg->load_avg is not updated in time.
> > I.e., it is after migrate, the source cfs_rq does not decrease its contribution
> > to the parent's tg->load_avg fast enough.
> 
> No, using the load_avg for shares calculation seems wrong; that would
> mean we'd first have to ramp up the avg before you react.
> 
> You want to react quickly to actual load changes, esp. going up.
> 
> We use the avg to guess the global group load, since that's the best
> compromise we have, but locally it doesn't make sense to use the avg if
> we have the actual values.

That is, can you send the original patch with a Changelog etc.. so that
I can press 'A' :-)

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2015-10-13  8:25 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-05 21:48 CFS scheduler unfairly prefers pinned tasks paul.szabo
2015-10-06  2:45 ` Mike Galbraith
2015-10-06 10:06   ` paul.szabo
2015-10-06 12:17     ` Mike Galbraith
2015-10-06 20:44       ` paul.szabo
2015-10-07  1:28         ` Mike Galbraith
2015-10-08  8:19   ` Mike Galbraith
2015-10-08 10:54     ` paul.szabo
2015-10-08 11:19       ` Peter Zijlstra
2015-10-10 13:22         ` [patch] sched: disable task group re-weighting on the desktop Mike Galbraith
2015-10-10 14:03           ` kbuild test robot
2015-10-10 14:41             ` Mike Galbraith
2015-10-10 17:01           ` Peter Zijlstra
2015-10-10 17:13             ` Peter Zijlstra
2015-10-11  2:25             ` Mike Galbraith
2015-10-11 17:42               ` 4.3 group scheduling regression Mike Galbraith
2015-10-12  7:23                 ` Peter Zijlstra
2015-10-12  7:44                   ` Mike Galbraith
2015-10-12  8:04                     ` Peter Zijlstra
2015-10-12  0:53                       ` Yuyang Du
2015-10-12  9:12                         ` Peter Zijlstra
2015-10-12  2:12                           ` Yuyang Du
2015-10-12 10:23                             ` Mike Galbraith
2015-10-12 19:55                               ` Yuyang Du
2015-10-13  4:08                                 ` Mike Galbraith
2015-10-12 20:42                                   ` Yuyang Du
2015-10-13  8:06                                 ` Peter Zijlstra
2015-10-13  0:35                                   ` Yuyang Du
2015-10-13  8:10                                   ` Peter Zijlstra
2015-10-13  0:37                                     ` Yuyang Du
2015-10-12 11:47                             ` Peter Zijlstra
2015-10-12 19:32                               ` Yuyang Du
2015-10-13  8:07                                 ` Peter Zijlstra
2015-10-13  2:22                               ` Mike Galbraith
2015-10-12  8:48                       ` Mike Galbraith
2015-10-10 20:14           ` [patch] sched: disable task group re-weighting on the desktop paul.szabo
2015-10-11  2:38             ` Mike Galbraith
2015-10-11  9:25               ` paul.szabo
2015-10-11 12:49                 ` Mike Galbraith
2015-10-11 19:46           ` paul.szabo
2015-10-12  1:59             ` Mike Galbraith
2015-10-08 14:25       ` CFS scheduler unfairly prefers pinned tasks Mike Galbraith
2015-10-08 21:55         ` paul.szabo
2015-10-09  1:56           ` Mike Galbraith
2015-10-09  2:40           ` Mike Galbraith
2015-10-11  9:43             ` paul.szabo
2015-10-10  3:59     ` Wanpeng Li
2015-10-10  7:58       ` Wanpeng Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.