All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch] CFS scheduler, v4
@ 2007-04-20 14:04 Ingo Molnar
  2007-04-20 21:37 ` Gene Heskett
                   ` (6 more replies)
  0 siblings, 7 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-20 14:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett


i'm pleased to announce release -v4 of the CFS patchset. The patch 
against v2.6.21-rc7 can be downloaded from:

    http://redhat.com/~mingo/cfs-scheduler/
 
this CFS release too is mainly about fixing regressions and improving 
interactivity, so the rate of change is relatively low:

    11 files changed, 136 insertions(+), 72 deletions(-)

in particular the preemption fix could resolve the 'desktop slows down 
under IO load' reports and the 'firefox does not switch tabs fast 
enough' reports as well. The suspend2 crash and the yield related 
Kaffeine hangs should be resolved as well.

Changes since -v3:

 - usability fix: automatic renicing of kernel threads such as keventd, 
   OOM tasks and tasks doing privileged hardware access (such as Xorg). 
   (This is a substitute for group scheduling until the group scheduling 
    details have been worked out.)

 - bugfix: buggy yield() caused suspend2 problems

 - preemption fix: it caused desktop app latencies

As usual, any sort of feedback, bugreport, fix and suggestion is more 
than welcome,

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, v4
  2007-04-20 14:04 [patch] CFS scheduler, v4 Ingo Molnar
@ 2007-04-20 21:37 ` Gene Heskett
  2007-04-21 20:47   ` S.Çağlar Onur
  2007-04-20 21:39 ` mdew .
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 149+ messages in thread
From: Gene Heskett @ 2007-04-20 21:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau

On Friday 20 April 2007, Ingo Molnar wrote:
>i'm pleased to announce release -v4 of the CFS patchset. The patch
>against v2.6.21-rc7 can be downloaded from:
>
>    http://redhat.com/~mingo/cfs-scheduler/
>
>this CFS release too is mainly about fixing regressions and improving
>interactivity, so the rate of change is relatively low:
>
>    11 files changed, 136 insertions(+), 72 deletions(-)
>
>in particular the preemption fix could resolve the 'desktop slows down
>under IO load' reports and the 'firefox does not switch tabs fast
>enough' reports as well. The suspend2 crash and the yield related
>Kaffeine hangs should be resolved as well.
>
>Changes since -v3:
>
> - usability fix: automatic renicing of kernel threads such as keventd,
>   OOM tasks and tasks doing privileged hardware access (such as Xorg).
>   (This is a substitute for group scheduling until the group scheduling
>    details have been worked out.)
>
> - bugfix: buggy yield() caused suspend2 problems
>
> - preemption fix: it caused desktop app latencies
>
>As usual, any sort of feedback, bugreport, fix and suggestion is more
>than welcome,

I've been running this one for several hours now, with amanda running in the 
background due a typu in one of my scripts, so now its playing catchup.

This one is another keeper IMO, or as we are fond of saying around here, its 
good enough for the girls I go with.  If this isn't the best one so far, its 
very very close and I'm getting pickier.  kmail is the only thing that's 
lagging, and that's just kmail, which I believe is single threaded.  Even 
with gzip eating 95% of the cpu, graphics animations like the cards in 
patience are moving at at least 80% speed.  Nice, keep this one and use it 
for the reference. 

>	Ingo



-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
In order to dial out, it is necessary to broaden one's dimension.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, v4
  2007-04-20 14:04 [patch] CFS scheduler, v4 Ingo Molnar
  2007-04-20 21:37 ` Gene Heskett
@ 2007-04-20 21:39 ` mdew .
  2007-04-21  6:47   ` Ingo Molnar
  2007-04-21 12:12 ` [REPORT] cfs-v4 vs sd-0.44 Willy Tarreau
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 149+ messages in thread
From: mdew . @ 2007-04-20 21:39 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

Any chance of supporting 2.6.20?

On 4/21/07, Ingo Molnar <mingo@elte.hu> wrote:
>
> i'm pleased to announce release -v4 of the CFS patchset. The patch
> against v2.6.21-rc7 can be downloaded from:
>
>     http://redhat.com/~mingo/cfs-scheduler/
>
> this CFS release too is mainly about fixing regressions and improving
> interactivity, so the rate of change is relatively low:
>
>     11 files changed, 136 insertions(+), 72 deletions(-)
>
> in particular the preemption fix could resolve the 'desktop slows down
> under IO load' reports and the 'firefox does not switch tabs fast
> enough' reports as well. The suspend2 crash and the yield related
> Kaffeine hangs should be resolved as well.
>
> Changes since -v3:
>
>  - usability fix: automatic renicing of kernel threads such as keventd,
>    OOM tasks and tasks doing privileged hardware access (such as Xorg).
>    (This is a substitute for group scheduling until the group scheduling
>     details have been worked out.)
>
>  - bugfix: buggy yield() caused suspend2 problems
>
>  - preemption fix: it caused desktop app latencies
>
> As usual, any sort of feedback, bugreport, fix and suggestion is more
> than welcome,
>
>         Ingo
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, v4
  2007-04-20 21:39 ` mdew .
@ 2007-04-21  6:47   ` Ingo Molnar
  2007-04-21  7:55     ` [patch] CFS scheduler, v4, for v2.6.20.7 Ingo Molnar
  0 siblings, 1 reply; 149+ messages in thread
From: Ingo Molnar @ 2007-04-21  6:47 UTC (permalink / raw)
  To: mdew .; +Cc: linux-kernel


* mdew . <some.nzguy@gmail.com> wrote:

> Any chance of supporting 2.6.20?

okay, it seems it was less work to backport it to v2.6.20 than it was to 
answer all the "where's the v2.6.20 version?" emails ;-) You can 
download the v2.6.20.7 version of CFS from:

    http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v4-v2.6.20.7.patch

let me know how well it works for you and how it compares to the vanilla 
scheduler and/or other schedulers you might have tried.

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [patch] CFS scheduler, v4, for v2.6.20.7
  2007-04-21  6:47   ` Ingo Molnar
@ 2007-04-21  7:55     ` Ingo Molnar
  0 siblings, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-21  7:55 UTC (permalink / raw)
  To: mdew .; +Cc: linux-kernel


* Ingo Molnar <mingo@elte.hu> wrote:

> > Any chance of supporting 2.6.20?
> 
> okay, it seems it was less work to backport it to v2.6.20 than it was 
> to answer all the "where's the v2.6.20 version?" emails ;-) You can 
> download the v2.6.20.7 version of CFS from:
> 
>     http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v4-v2.6.20.7.patch
> 
> let me know how well it works for you and how it compares to the 
> vanilla scheduler and/or other schedulers you might have tried.

(and i changed the subject line as well, so that people can find this 
mail on lkml, if they want to.)

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [REPORT] cfs-v4 vs sd-0.44
  2007-04-20 14:04 [patch] CFS scheduler, v4 Ingo Molnar
  2007-04-20 21:37 ` Gene Heskett
  2007-04-20 21:39 ` mdew .
@ 2007-04-21 12:12 ` Willy Tarreau
  2007-04-21 12:40   ` Con Kolivas
                     ` (4 more replies)
  2007-04-21 20:35 ` [patch] CFS scheduler, v4 S.Çağlar Onur
                   ` (3 subsequent siblings)
  6 siblings, 5 replies; 149+ messages in thread
From: Willy Tarreau @ 2007-04-21 12:12 UTC (permalink / raw)
  To: Ingo Molnar, Con Kolivas
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett

Hi Ingo, Hi Con,

I promised to perform some tests on your code. I'm short in time right now,
but I observed behaviours that should be commented on.

1) machine : dual athlon 1533 MHz, 1G RAM, kernel 2.6.21-rc7 + either scheduler
   Test:  ./ocbench -R 250000 -S 750000 -x 8 -y 8
   ocbench: http://linux.1wt.eu/sched/

2) SD-0.44

   Feels good, but becomes jerky at moderately high loads. I've started
   64 ocbench with a 250 ms busy loop and 750 ms sleep time. The system
   always responds correctly but under X, mouse jumps quite a bit and
   typing in xterm or even text console feels slightly jerky. The CPU is
   not completely used, and the load varies a lot (see below). However,
   the load is shared equally between all 64 ocbench, and they do not
   deviate even after 4000 iterations. X uses less than 1% CPU during
   those tests.

   Here's the vmstat output :

willy@pcw:~$ vmstat 1
   procs                      memory      swap          io     system      cpu
 r  b  w   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id
 0  0  0      0 919856   6648  57788    0    0    22     2    4   148 31 49 20
 0  0  0      0 919856   6648  57788    0    0     0     0    2   285 32 50 19
28  0  0      0 919836   6648  57788    0    0     0     0    0   331 24 40 36
64  0  0      0 919836   6648  57788    0    0     0     0    1   618 23 40 37
65  0  0      0 919836   6648  57788    0    0     0     0    0   571 21 36 43
35  0  0      0 919836   6648  57788    0    0     0     0    3   382 32 50 18
 2  0  0      0 919836   6648  57788    0    0     0     0    0   308 37 61  2
 8  0  0      0 919836   6648  57788    0    0     0     0    1   533 36 65  0
32  0  0      0 919768   6648  57788    0    0     0     0   93   706 33 62  5
62  0  0      0 919712   6648  57788    0    0     0     0   65   617 32 54 13
63  0  0      0 919712   6648  57788    0    0     0     0    1   569 28 48 23
40  0  0      0 919712   6648  57788    0    0     0     0    0   427 26 50 24
 4  0  0      0 919712   6648  57788    0    0     0     0    1   382 29 48 23
 4  0  0      0 919712   6648  57788    0    0     0     0    0   383 34 65  0
14  0  0      0 919712   6648  57788    0    0     0     0    1   769 39 61  0
40  0  0      0 919712   6648  57788    0    0     0     0    0   384 37 52 11
54  0  0      0 919712   6648  57788    0    0     0     0    1   715 31 60  8
58  0  2      0 919712   6648  57788    0    0     0     0    1   611 34 65  0
41  0  0      0 919712   6648  57788    0    0     0     0   19   395 28 45 27
 0  0  0      0 919712   6648  57788    0    0     0     0   31   421 23 32 45
 0  0  0      0 919712   6648  57788    0    0     0     0   31   328 34 44 22
29  0  0      0 919712   6648  57788    0    0     0     0   34   369 32 43 25
65  0  0      0 919712   6648  57788    0    0     0     0   31   410 24 35 40
47  0  1      0 919712   6648  57788    0    0     0     0   42   538 25 39 35

3) CFS-v4

  Feels even better, mouse movements are very smooth even under high load.
  I noticed that X gets reniced to -19 with this scheduler. I've not looked
  at the code yet but this looked suspicious to me. I've reniced it to 0 and
  it did not change any behaviour. Still very good. The 64 ocbench share
  equal CPU time and show exact same progress after 2000 iterations. The CPU
  load is more smoothly spread according to vmstat, and there's no idle (see
  below). BUT I now think it was wrong to let new processes start with no
  timeslice at all, because it can take tens of seconds to start a new process
  when only 64 ocbench are there. Simply starting "killall ocbench" takes about
  10 seconds. On a smaller machine (VIA C3-533), it took me more than one
  minute to do "su -", even from console, so that's not X. BTW, X uses less
  than 1% CPU during those tests.

willy@pcw:~$ vmstat 1
   procs                      memory      swap          io     system      cpu
 r  b  w   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id
12  0  2      0 922120   6532  57540    0    0   299    29   31   386 17 27 57
12  0  2      0 922096   6532  57556    0    0     0     0    1   776 37 63  0
14  0  2      0 922096   6532  57556    0    0     0     0    1   782 35 65  0
13  0  1      0 922096   6532  57556    0    0     0     0    0   782 38 62  0
14  0  1      0 922096   6532  57556    0    0     0     0    1   782 36 64  0
13  0  1      0 922096   6532  57556    0    0     0     0    2   785 38 62  0
13  0  1      0 922096   6532  57556    0    0     0     0    1   774 35 65  0
14  0  1      0 922096   6532  57556    0    0     0     0    0   784 36 64  0
13  0  1      0 922096   6532  57556    0    0     0     0    1   767 37 63  0
13  0  1      0 922096   6532  57556    0    0     0     0    1   785 41 59  0
14  0  1      0 922096   6532  57556    0    0     0     0    0   779 38 62  0
19  0  1      0 922096   6532  57556    0    0     0     0    1   816 38 62  0
22  0  1      0 922096   6532  57556    0    0     0     0    0   817 35 65  0
19  0  1      0 922096   6532  57556    0    0     0     0    1   817 39 61  0
21  0  1      0 922096   6532  57556    0    0     0     0    0   849 36 64  0
20  0  0      0 922096   6532  57556    0    0     0     0    1   793 36 64  0
21  0  0      0 922096   6532  57556    0    0     0     0    0   815 37 63  0
19  0  0      0 922096   6532  57556    0    0     0     0    1   824 35 65  0
21  0  0      0 922096   6532  57556    0    0     0     0    0   817 35 65  0
26  0  0      0 922096   6532  57556    0    0     0     0    1   824 38 62  0
26  0  0      0 922096   6532  57556    0    0     0     0    1   817 35 65  0
26  0  0      0 922096   6532  57556    0    0     0     0    0   811 37 63  0
26  0  0      0 922096   6532  57556    0    0     0     0    1   804 34 66  0
16  0  0      0 922096   6532  57556    0    0     0     0   39   850 35 65  0
18  0  0      0 922096   6532  57556    0    0     0     0    1   801 39 61  0


4) first impressions

I think that CFS is based on a more promising concept but is less mature
and is dangerous right now with certain workloads. SD shows some strange
behaviours like not using all CPU available and a little jerkyness, but is
more robust and may be the less risky solution for a first step towards
a better scheduler in mainline, but it may also probably be the last O(1)
scheduler, which may be replaced sometime later when CFS (or any other one)
shows at the same time the smoothness of CFS and the robustness of SD.

I'm sorry not to spend more time on them right now, I hope that other people
will do.

Regards,
Willy


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 12:12 ` [REPORT] cfs-v4 vs sd-0.44 Willy Tarreau
@ 2007-04-21 12:40   ` Con Kolivas
  2007-04-21 13:02     ` Willy Tarreau
  2007-04-21 15:46   ` Ingo Molnar
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 149+ messages in thread
From: Con Kolivas @ 2007-04-21 12:40 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett

On Saturday 21 April 2007 22:12, Willy Tarreau wrote:
> Hi Ingo, Hi Con,
>
> I promised to perform some tests on your code. I'm short in time right now,
> but I observed behaviours that should be commented on.
>
> 1) machine : dual athlon 1533 MHz, 1G RAM, kernel 2.6.21-rc7 + either
> scheduler Test:  ./ocbench -R 250000 -S 750000 -x 8 -y 8
>    ocbench: http://linux.1wt.eu/sched/
>
> 2) SD-0.44
>
>    Feels good, but becomes jerky at moderately high loads. I've started
>    64 ocbench with a 250 ms busy loop and 750 ms sleep time. The system
>    always responds correctly but under X, mouse jumps quite a bit and
>    typing in xterm or even text console feels slightly jerky. The CPU is
>    not completely used, and the load varies a lot (see below). However,
>    the load is shared equally between all 64 ocbench, and they do not
>    deviate even after 4000 iterations. X uses less than 1% CPU during
>    those tests.
>
>    Here's the vmstat output :
[snip]

> 3) CFS-v4
>
>   Feels even better, mouse movements are very smooth even under high load.
>   I noticed that X gets reniced to -19 with this scheduler. I've not looked
>   at the code yet but this looked suspicious to me. I've reniced it to 0
> and it did not change any behaviour. Still very good. The 64 ocbench share
> equal CPU time and show exact same progress after 2000 iterations. The CPU
> load is more smoothly spread according to vmstat, and there's no idle (see
> below). BUT I now think it was wrong to let new processes start with no
> timeslice at all, because it can take tens of seconds to start a new
> process when only 64 ocbench are there. Simply starting "killall ocbench"
> takes about 10 seconds. On a smaller machine (VIA C3-533), it took me more
> than one minute to do "su -", even from console, so that's not X. BTW, X
> uses less than 1% CPU during those tests.
>
> willy@pcw:~$ vmstat 1
[snip]

> 4) first impressions
>
> I think that CFS is based on a more promising concept but is less mature
> and is dangerous right now with certain workloads. SD shows some strange
> behaviours like not using all CPU available and a little jerkyness, but is
> more robust and may be the less risky solution for a first step towards
> a better scheduler in mainline, but it may also probably be the last O(1)
> scheduler, which may be replaced sometime later when CFS (or any other one)
> shows at the same time the smoothness of CFS and the robustness of SD.

I assumed from your description that you were running X nice 0 during all this 
testing and left the tunables from both SD and CFS at their defaults; this 
tends to have the effective equivalent of "timeslice" in CFS smaller than SD.

> I'm sorry not to spend more time on them right now, I hope that other
> people will do.

Thanks for that interesting testing you've done. The fluctuating cpu load and 
the apparently high idle time means there is almost certainly a bug still in 
the cpu accounting I do in update_cpu_clock. It looks suspicious to me 
already on just my first glance. Fortunately the throughput does not appear 
to be adversely affected on other benchmarks so I suspect it's lying about 
the idle time and it's not really there. Which means it's likely also 
accounting the cpu time wrongly. Which also means there's something I can fix 
and improve SD further. Great stuff, thanks! 

-- 
-ck

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 12:40   ` Con Kolivas
@ 2007-04-21 13:02     ` Willy Tarreau
  0 siblings, 0 replies; 149+ messages in thread
From: Willy Tarreau @ 2007-04-21 13:02 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett

On Sat, Apr 21, 2007 at 10:40:18PM +1000, Con Kolivas wrote:
> On Saturday 21 April 2007 22:12, Willy Tarreau wrote:
> > Hi Ingo, Hi Con,
> >
> > I promised to perform some tests on your code. I'm short in time right now,
> > but I observed behaviours that should be commented on.
> >
> > 1) machine : dual athlon 1533 MHz, 1G RAM, kernel 2.6.21-rc7 + either
> > scheduler Test:  ./ocbench -R 250000 -S 750000 -x 8 -y 8
> >    ocbench: http://linux.1wt.eu/sched/
> >
> > 2) SD-0.44
> >
> >    Feels good, but becomes jerky at moderately high loads. I've started
> >    64 ocbench with a 250 ms busy loop and 750 ms sleep time. The system
> >    always responds correctly but under X, mouse jumps quite a bit and
> >    typing in xterm or even text console feels slightly jerky. The CPU is
> >    not completely used, and the load varies a lot (see below). However,
> >    the load is shared equally between all 64 ocbench, and they do not
> >    deviate even after 4000 iterations. X uses less than 1% CPU during
> >    those tests.
> >
> >    Here's the vmstat output :
> [snip]
> 
> > 3) CFS-v4
> >
> >   Feels even better, mouse movements are very smooth even under high load.
> >   I noticed that X gets reniced to -19 with this scheduler. I've not looked
> >   at the code yet but this looked suspicious to me. I've reniced it to 0
> > and it did not change any behaviour. Still very good. The 64 ocbench share
> > equal CPU time and show exact same progress after 2000 iterations. The CPU
> > load is more smoothly spread according to vmstat, and there's no idle (see
> > below). BUT I now think it was wrong to let new processes start with no
> > timeslice at all, because it can take tens of seconds to start a new
> > process when only 64 ocbench are there. Simply starting "killall ocbench"
> > takes about 10 seconds. On a smaller machine (VIA C3-533), it took me more
> > than one minute to do "su -", even from console, so that's not X. BTW, X
> > uses less than 1% CPU during those tests.
> >
> > willy@pcw:~$ vmstat 1
> [snip]
> 
> > 4) first impressions
> >
> > I think that CFS is based on a more promising concept but is less mature
> > and is dangerous right now with certain workloads. SD shows some strange
> > behaviours like not using all CPU available and a little jerkyness, but is
> > more robust and may be the less risky solution for a first step towards
> > a better scheduler in mainline, but it may also probably be the last O(1)
> > scheduler, which may be replaced sometime later when CFS (or any other one)
> > shows at the same time the smoothness of CFS and the robustness of SD.
> 
> I assumed from your description that you were running X nice 0 during all this 
Yes, that's what I did.

> testing and left the tunables from both SD and CFS at their defaults;

yes too because I don't have enough time to try many combinations this week-end.

> this 
> tends to have the effective equivalent of "timeslice" in CFS smaller than SD.

If you look at the CS column in vmstat, you'll see that there's about twice
as many context switches with CFS than with SD, meaning the average timeslice
would be about twice as short with CFS. But my impression is that some tasks
occasionally get very long timeslices with SD while this never happens with
CFS, hence the very smooth versus jerky feeling which cannot be explained
by just halved timeslices alone.

> > I'm sorry not to spend more time on them right now, I hope that other
> > people will do.
> 
> Thanks for that interesting testing you've done. The fluctuating cpu load and 
> the apparently high idle time means there is almost certainly a bug still in 
> the cpu accounting I do in update_cpu_clock. It looks suspicious to me 
> already on just my first glance. Fortunately the throughput does not appear 
> to be adversely affected on other benchmarks so I suspect it's lying about 
> the idle time and it's not really there. Which means it's likely also 
> accounting the cpu time wrongly.

It is possible that only measurement is wrong because the time was evenly
distributed among the 64 processes. Maybe the fix could also prevent some
tasks from occasionally stealing one slice and reduce the jerkiness feeling.
Anyway, it's just a bit jerky, no more freezes as we've known for years ;-)

> Which also means there's something I can fix and improve SD further.
> Great stuff, thanks! 

You're welcome !

Cheers,
Willy


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 12:12 ` [REPORT] cfs-v4 vs sd-0.44 Willy Tarreau
  2007-04-21 12:40   ` Con Kolivas
@ 2007-04-21 15:46   ` Ingo Molnar
  2007-04-21 16:18     ` Willy Tarreau
  2007-04-21 15:55   ` Con Kolivas
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 149+ messages in thread
From: Ingo Molnar @ 2007-04-21 15:46 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett


* Willy Tarreau <w@1wt.eu> wrote:

> I promised to perform some tests on your code. I'm short in time right 
> now, but I observed behaviours that should be commented on.

thanks for the feedback!

> 3) CFS-v4
> 
>   Feels even better, mouse movements are very smooth even under high 
>   load. I noticed that X gets reniced to -19 with this scheduler. I've 
>   not looked at the code yet but this looked suspicious to me. I've 
>   reniced it to 0 and it did not change any behaviour. Still very 
>   good. The 64 ocbench share equal CPU time and show exact same 
>   progress after 2000 iterations. The CPU load is more smoothly spread 
>   according to vmstat, and there's no idle (see below). BUT I now 
>   think it was wrong to let new processes start with no timeslice at 
>   all, because it can take tens of seconds to start a new process when 
>   only 64 ocbench are there. [...]

ok, i'll modify that portion and add back the 50%/50% parent/child CPU 
time sharing approach again. (which CFS had in -v1) That should not 
change the rest of your test and should improve the task startup 
characteristics.

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 12:12 ` [REPORT] cfs-v4 vs sd-0.44 Willy Tarreau
  2007-04-21 12:40   ` Con Kolivas
  2007-04-21 15:46   ` Ingo Molnar
@ 2007-04-21 15:55   ` Con Kolivas
  2007-04-21 16:00     ` Ingo Molnar
  2007-04-21 18:17   ` Gene Heskett
  2007-04-22  1:51   ` Con Kolivas
  4 siblings, 1 reply; 149+ messages in thread
From: Con Kolivas @ 2007-04-21 15:55 UTC (permalink / raw)
  To: Willy Tarreau, William Lee Irwin III
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett

On Saturday 21 April 2007 22:12, Willy Tarreau wrote:
> I promised to perform some tests on your code. I'm short in time right now,
> but I observed behaviours that should be commented on.

>   Feels even better, mouse movements are very smooth even under high load.
>   I noticed that X gets reniced to -19 with this scheduler. I've not looked
>   at the code yet but this looked suspicious to me.

Looks like this code does it:

+int sysctl_sched_privileged_nice_level __read_mostly = -19;

allows anything that sets sched_privileged_task one way or another gets 
nice -19, and this is enabled by default.

--- linux-cfs-2.6.20.7.q.orig/arch/i386/kernel/ioport.c
+++ linux-cfs-2.6.20.7.q/arch/i386/kernel/ioport.c

+	if (turn_on) {
+		if (!capable(CAP_SYS_RAWIO))
+			return -EPERM;
+		/*
+		 * Task will be accessing hardware IO ports,
+		 * mark it as special with the scheduler too:
+		 */
+		sched_privileged_task(current);
+	}

presumably that selects out X as a privileged task... and sets it to nice -19 
by default.

-- 
-ck

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 15:55   ` Con Kolivas
@ 2007-04-21 16:00     ` Ingo Molnar
  2007-04-21 16:12       ` Willy Tarreau
                         ` (4 more replies)
  0 siblings, 5 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-21 16:00 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Willy Tarreau, William Lee Irwin III, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Gene Heskett


* Con Kolivas <kernel@kolivas.org> wrote:

> >   Feels even better, mouse movements are very smooth even under high 
> >   load. I noticed that X gets reniced to -19 with this scheduler. 
> >   I've not looked at the code yet but this looked suspicious to me. 
> >   I've reniced it to 0 and it did not change any behaviour. Still 
> >   very good.
> 
> Looks like this code does it:
> 
> +int sysctl_sched_privileged_nice_level __read_mostly = -19;

correct. Note that Willy reniced X back to 0 so it had no relevance on 
his test. Also note that i pointed this change out in the -v4 CFS 
announcement:

|| Changes since -v3:
||
||  - usability fix: automatic renicing of kernel threads such as 
||    keventd, OOM tasks and tasks doing privileged hardware access
||    (such as Xorg).

i've attached it below in a standalone form, feel free to put it into 
SD! :)

	Ingo

---
 arch/i386/kernel/ioport.c   |   13 ++++++++++---
 arch/x86_64/kernel/ioport.c |    8 ++++++--
 drivers/block/loop.c        |    5 ++++-
 include/linux/sched.h       |    7 +++++++
 kernel/sched.c              |   40 ++++++++++++++++++++++++++++++++++++++++
 kernel/workqueue.c          |    2 +-
 mm/oom_kill.c               |    4 +++-
 7 files changed, 71 insertions(+), 8 deletions(-)

Index: linux/arch/i386/kernel/ioport.c
===================================================================
--- linux.orig/arch/i386/kernel/ioport.c
+++ linux/arch/i386/kernel/ioport.c
@@ -64,9 +64,15 @@ asmlinkage long sys_ioperm(unsigned long
 
 	if ((from + num <= from) || (from + num > IO_BITMAP_BITS))
 		return -EINVAL;
-	if (turn_on && !capable(CAP_SYS_RAWIO))
-		return -EPERM;
-
+	if (turn_on) {
+		if (!capable(CAP_SYS_RAWIO))
+			return -EPERM;
+		/*
+		 * Task will be accessing hardware IO ports,
+		 * mark it as special with the scheduler too:
+		 */
+		sched_privileged_task(current);
+	}
 	/*
 	 * If it's the first ioperm() call in this thread's lifetime, set the
 	 * IO bitmap up. ioperm() is much less timing critical than clone(),
@@ -145,6 +151,7 @@ asmlinkage long sys_iopl(unsigned long u
 	if (level > old) {
 		if (!capable(CAP_SYS_RAWIO))
 			return -EPERM;
+		sched_privileged_task(current);
 	}
 	t->iopl = level << 12;
 	regs->eflags = (regs->eflags & ~X86_EFLAGS_IOPL) | t->iopl;
Index: linux/arch/x86_64/kernel/ioport.c
===================================================================
--- linux.orig/arch/x86_64/kernel/ioport.c
+++ linux/arch/x86_64/kernel/ioport.c
@@ -41,8 +41,11 @@ asmlinkage long sys_ioperm(unsigned long
 
 	if ((from + num <= from) || (from + num > IO_BITMAP_BITS))
 		return -EINVAL;
-	if (turn_on && !capable(CAP_SYS_RAWIO))
-		return -EPERM;
+	if (turn_on) {
+		if (!capable(CAP_SYS_RAWIO))
+			return -EPERM;
+		sched_privileged_task(current);
+	}
 
 	/*
 	 * If it's the first ioperm() call in this thread's lifetime, set the
@@ -113,6 +116,7 @@ asmlinkage long sys_iopl(unsigned int le
 	if (level > old) {
 		if (!capable(CAP_SYS_RAWIO))
 			return -EPERM;
+		sched_privileged_task(current);
 	}
 	regs->eflags = (regs->eflags &~ X86_EFLAGS_IOPL) | (level << 12);
 	return 0;
Index: linux/drivers/block/loop.c
===================================================================
--- linux.orig/drivers/block/loop.c
+++ linux/drivers/block/loop.c
@@ -588,7 +588,10 @@ static int loop_thread(void *data)
 	 */
 	current->flags |= PF_NOFREEZE;
 
-	set_user_nice(current, -20);
+	/*
+	 * The loop thread is important enough to be given a boost:
+	 */
+	sched_privileged_task(current);
 
 	while (!kthread_should_stop() || lo->lo_bio) {
 
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1256,6 +1256,13 @@ static inline int rt_mutex_getprio(struc
 #endif
 
 extern void set_user_nice(struct task_struct *p, long nice);
+/*
+ * Task has special privileges, give it more CPU power:
+ */
+extern void sched_privileged_task(struct task_struct *p);
+
+extern int sysctl_sched_privileged_nice_level;
+
 extern int task_prio(const struct task_struct *p);
 extern int task_nice(const struct task_struct *p);
 extern int can_nice(const struct task_struct *p, const int nice);
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -3251,6 +3251,46 @@ out_unlock:
 EXPORT_SYMBOL(set_user_nice);
 
 /*
+ * Nice level for privileged tasks. (can be set to 0 for this
+ * to be turned off)
+ */
+int sysctl_sched_privileged_nice_level __read_mostly = -19;
+
+static int __init privileged_nice_level_setup(char *str)
+{
+	sysctl_sched_privileged_nice_level = simple_strtoul(str, NULL, 0);
+	return 1;
+}
+__setup("privileged_nice_level=", privileged_nice_level_setup);
+
+/*
+ * Tasks with special privileges call this and gain extra nice
+ * levels:
+ */
+void sched_privileged_task(struct task_struct *p)
+{
+	long new_nice = sysctl_sched_privileged_nice_level;
+	long old_nice = TASK_NICE(p);
+
+	if (new_nice >= old_nice)
+		return;
+	/*
+	 * Setting the sysctl to 0 turns off the boosting:
+	 */
+	if (unlikely(!new_nice))
+		return;
+
+	if (new_nice < -20)
+		new_nice = -20;
+	else if (new_nice > 19)
+		new_nice = 19;
+
+	set_user_nice(p, new_nice);
+}
+
+EXPORT_SYMBOL(sched_privileged_task);
+
+/*
  * can_nice - check if a task can reduce its nice value
  * @p: task
  * @nice: nice value
Index: linux/kernel/workqueue.c
===================================================================
--- linux.orig/kernel/workqueue.c
+++ linux/kernel/workqueue.c
@@ -355,7 +355,7 @@ static int worker_thread(void *__cwq)
 	if (!cwq->freezeable)
 		current->flags |= PF_NOFREEZE;
 
-	set_user_nice(current, -5);
+	sched_privileged_task(current);
 
 	/* Block and flush all signals */
 	sigfillset(&blocked);
Index: linux/mm/oom_kill.c
===================================================================
--- linux.orig/mm/oom_kill.c
+++ linux/mm/oom_kill.c
@@ -291,7 +291,9 @@ static void __oom_kill_task(struct task_
 	 * all the memory it needs. That way it should be able to
 	 * exit() and clear out its resources quickly...
 	 */
-	p->time_slice = HZ;
+	if (p->policy == SCHED_NORMAL || p->policy == SCHED_BATCH)
+		sched_privileged_task(p);
+
 	set_tsk_thread_flag(p, TIF_MEMDIE);
 
 	force_sig(SIGKILL, p);

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 16:00     ` Ingo Molnar
@ 2007-04-21 16:12       ` Willy Tarreau
  2007-04-21 16:39       ` William Lee Irwin III
                         ` (3 subsequent siblings)
  4 siblings, 0 replies; 149+ messages in thread
From: Willy Tarreau @ 2007-04-21 16:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Con Kolivas, William Lee Irwin III, linux-kernel, Linus Torvalds,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett

On Sat, Apr 21, 2007 at 06:00:08PM +0200, Ingo Molnar wrote:
> 
> * Con Kolivas <kernel@kolivas.org> wrote:
> 
> > >   Feels even better, mouse movements are very smooth even under high 
> > >   load. I noticed that X gets reniced to -19 with this scheduler. 
> > >   I've not looked at the code yet but this looked suspicious to me. 
> > >   I've reniced it to 0 and it did not change any behaviour. Still 
> > >   very good.
> > 
> > Looks like this code does it:
> > 
> > +int sysctl_sched_privileged_nice_level __read_mostly = -19;
> 
> correct. Note that Willy reniced X back to 0 so it had no relevance on 
> his test.

Anyway, my X was mostly unused (below 1% CPU), which was my intent when
replacing glxgears by ocbench. We have not settled yet about how to handle
the special case for X. Let's at least try to get the best schedulers without
this problem, then see how to make them behave the best taking X into account.

> Also note that i pointed this change out in the -v4 CFS 
> announcement:
> 
> || Changes since -v3:
> ||
> ||  - usability fix: automatic renicing of kernel threads such as 
> ||    keventd, OOM tasks and tasks doing privileged hardware access
> ||    (such as Xorg).
> 
> i've attached it below in a standalone form, feel free to put it into 
> SD! :)

Con, I think it could be a good idea since you recommend to renice X with
SD. Most of the problem users are facing with renicing X is that they need
to change their configs or scripts. If the kernel can reliably detect X and
handle it differently, why not do it ?

It makes me think that this hint might be used to set some flags in the task
struct in order to apply different processing than just renicing. It is indeed
possible that nice is not the best solution and that something else would be
even better (eg: longer timeslices, but not changing priority in the queues).
Just an idea anyway.

OK, back to work ;-)
Willy


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 15:46   ` Ingo Molnar
@ 2007-04-21 16:18     ` Willy Tarreau
  2007-04-21 16:34       ` Linus Torvalds
  2007-04-21 17:03       ` Geert Bosch
  0 siblings, 2 replies; 149+ messages in thread
From: Willy Tarreau @ 2007-04-21 16:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett

On Sat, Apr 21, 2007 at 05:46:14PM +0200, Ingo Molnar wrote:
> 
> * Willy Tarreau <w@1wt.eu> wrote:
> 
> > I promised to perform some tests on your code. I'm short in time right 
> > now, but I observed behaviours that should be commented on.
> 
> thanks for the feedback!
> 
> > 3) CFS-v4
> > 
> >   Feels even better, mouse movements are very smooth even under high 
> >   load. I noticed that X gets reniced to -19 with this scheduler. I've 
> >   not looked at the code yet but this looked suspicious to me. I've 
> >   reniced it to 0 and it did not change any behaviour. Still very 
> >   good. The 64 ocbench share equal CPU time and show exact same 
> >   progress after 2000 iterations. The CPU load is more smoothly spread 
> >   according to vmstat, and there's no idle (see below). BUT I now 
> >   think it was wrong to let new processes start with no timeslice at 
> >   all, because it can take tens of seconds to start a new process when 
> >   only 64 ocbench are there. [...]
> 
> ok, i'll modify that portion and add back the 50%/50% parent/child CPU 
> time sharing approach again. (which CFS had in -v1) That should not 
> change the rest of your test and should improve the task startup 
> characteristics.

If you remember, with 50/50, I noticed some difficulties to fork many
processes. I think that during a fork(), the parent has a higher probability
of forking other processes than the child. So at least, we should use
something like 67/33 or 75/25 for parent/child.

There are many shell-scripts out there doing a lot of fork(), and it should
be reasonable to let them keep some CPU to continue to work.

Also, I believe that (in shells), most forked processes do not even consume
a full timeslice (eg: $(uname -n) is very fast). This means that assigning
them with a shorter one will not hurt them while preserving the shell's
performance against CPU hogs.

Willy


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 16:18     ` Willy Tarreau
@ 2007-04-21 16:34       ` Linus Torvalds
  2007-04-21 16:42         ` William Lee Irwin III
                           ` (2 more replies)
  2007-04-21 17:03       ` Geert Bosch
  1 sibling, 3 replies; 149+ messages in thread
From: Linus Torvalds @ 2007-04-21 16:34 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Ingo Molnar, Con Kolivas, linux-kernel, Andrew Morton,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett



On Sat, 21 Apr 2007, Willy Tarreau wrote:
> 
> If you remember, with 50/50, I noticed some difficulties to fork many
> processes. I think that during a fork(), the parent has a higher probability
> of forking other processes than the child. So at least, we should use
> something like 67/33 or 75/25 for parent/child.

It would be even better to simply have the rule:
 - child gets almost no points at startup
 - but when a parent does a "waitpid()" call and blocks, it will spread 
   out its points to the childred (the "vfork()" blocking is another case 
   that is really the same).

This is a very special kind of "priority inversion" logic: you give higher 
priority to the things you wait for. Not because of holding any locks, but 
simply because a blockign waitpid really is a damn big hint that "ok, the 
child now works for the parent".

		Linus

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 16:00     ` Ingo Molnar
  2007-04-21 16:12       ` Willy Tarreau
@ 2007-04-21 16:39       ` William Lee Irwin III
  2007-04-21 17:15       ` Jan Engelhardt
                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 149+ messages in thread
From: William Lee Irwin III @ 2007-04-21 16:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Con Kolivas, Willy Tarreau, linux-kernel, Linus Torvalds,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett

On Sat, Apr 21, 2007 at 06:00:08PM +0200, Ingo Molnar wrote:
>  arch/i386/kernel/ioport.c   |   13 ++++++++++---
>  arch/x86_64/kernel/ioport.c |    8 ++++++--
>  drivers/block/loop.c        |    5 ++++-
>  include/linux/sched.h       |    7 +++++++
>  kernel/sched.c              |   40 ++++++++++++++++++++++++++++++++++++++++
>  kernel/workqueue.c          |    2 +-
>  mm/oom_kill.c               |    4 +++-
>  7 files changed, 71 insertions(+), 8 deletions(-)

Yum. I'm going to see what this does for glxgears (I presume it's a
screensaver) on my dual G5 driving a 42" wall-mounted TV for a display. ;)

More seriously, there should be more portable ways of doing this. I
suspect even someone using fbdev on i386/x86-64 might be left out here.


-- wli

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 16:34       ` Linus Torvalds
@ 2007-04-21 16:42         ` William Lee Irwin III
  2007-04-21 18:55           ` Kyle Moffett
  2007-04-21 16:53         ` Willy Tarreau
  2007-04-21 16:53         ` Ingo Molnar
  2 siblings, 1 reply; 149+ messages in thread
From: William Lee Irwin III @ 2007-04-21 16:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Willy Tarreau, Ingo Molnar, Con Kolivas, linux-kernel,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett

On Sat, 21 Apr 2007, Willy Tarreau wrote:
>> If you remember, with 50/50, I noticed some difficulties to fork many
>> processes. I think that during a fork(), the parent has a higher probability
>> of forking other processes than the child. So at least, we should use
>> something like 67/33 or 75/25 for parent/child.

On Sat, Apr 21, 2007 at 09:34:07AM -0700, Linus Torvalds wrote:
> It would be even better to simply have the rule:
>  - child gets almost no points at startup
>  - but when a parent does a "waitpid()" call and blocks, it will spread 
>    out its points to the childred (the "vfork()" blocking is another case 
>    that is really the same).
> This is a very special kind of "priority inversion" logic: you give higher 
> priority to the things you wait for. Not because of holding any locks, but 
> simply because a blockign waitpid really is a damn big hint that "ok, the 
> child now works for the parent".

An in-kernel scheduler API might help. void yield_to(struct task_struct *)?

A userspace API might be nice, too. e.g. int sched_yield_to(pid_t).


-- wli

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 16:34       ` Linus Torvalds
  2007-04-21 16:42         ` William Lee Irwin III
@ 2007-04-21 16:53         ` Willy Tarreau
  2007-04-21 16:53         ` Ingo Molnar
  2 siblings, 0 replies; 149+ messages in thread
From: Willy Tarreau @ 2007-04-21 16:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Con Kolivas, linux-kernel, Andrew Morton,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett

On Sat, Apr 21, 2007 at 09:34:07AM -0700, Linus Torvalds wrote:
> 
> 
> On Sat, 21 Apr 2007, Willy Tarreau wrote:
> > 
> > If you remember, with 50/50, I noticed some difficulties to fork many
> > processes. I think that during a fork(), the parent has a higher probability
> > of forking other processes than the child. So at least, we should use
> > something like 67/33 or 75/25 for parent/child.
> 
> It would be even better to simply have the rule:
>  - child gets almost no points at startup
>  - but when a parent does a "waitpid()" call and blocks, it will spread 
>    out its points to the childred (the "vfork()" blocking is another case 
>    that is really the same).
> 
> This is a very special kind of "priority inversion" logic: you give higher 
> priority to the things you wait for. Not because of holding any locks, but 
> simply because a blockign waitpid really is a damn big hint that "ok, the 
> child now works for the parent".

I like this idea a lot. I don't know if it can be applied to pipes and unix
sockets, but it's clearly a way of saying "hurry up, I'm waiting for you"
which seems natural with inter-process communications. Also, if we can do
this on unix sockets, it would help a lot with X !

Willy


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 16:34       ` Linus Torvalds
  2007-04-21 16:42         ` William Lee Irwin III
  2007-04-21 16:53         ` Willy Tarreau
@ 2007-04-21 16:53         ` Ingo Molnar
  2007-04-21 16:57           ` Willy Tarreau
  2007-04-21 18:09           ` Ulrich Drepper
  2 siblings, 2 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-21 16:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Willy Tarreau, Con Kolivas, linux-kernel, Andrew Morton,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> It would be even better to simply have the rule:
>  - child gets almost no points at startup
>  - but when a parent does a "waitpid()" call and blocks, it will spread 
>    out its points to the childred (the "vfork()" blocking is another case 
>    that is really the same).
> 
> This is a very special kind of "priority inversion" logic: you give 
> higher priority to the things you wait for. Not because of holding any 
> locks, but simply because a blockign waitpid really is a damn big hint 
> that "ok, the child now works for the parent".

yeah. One problem i can see with the implementation of this though is 
that shells typically do nonspecific waits - for example bash does this 
on a simple 'ls' command:

  21310 clone(child_stack=0,  ...) = 21399
  ...
  21399 execve("/bin/ls", 
  ...
  21310 waitpid(-1, <unfinished ...>

the PID is -1 so we dont actually know which task we are waiting for. We 
could use the first entry from the p->children list, but that looks too 
specific of a hack to me. It should catch most of the 
synchronous-helper-task cases though.

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 16:53         ` Ingo Molnar
@ 2007-04-21 16:57           ` Willy Tarreau
  2007-04-21 18:09           ` Ulrich Drepper
  1 sibling, 0 replies; 149+ messages in thread
From: Willy Tarreau @ 2007-04-21 16:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Con Kolivas, linux-kernel, Andrew Morton,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett

On Sat, Apr 21, 2007 at 06:53:47PM +0200, Ingo Molnar wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > It would be even better to simply have the rule:
> >  - child gets almost no points at startup
> >  - but when a parent does a "waitpid()" call and blocks, it will spread 
> >    out its points to the childred (the "vfork()" blocking is another case 
> >    that is really the same).
> > 
> > This is a very special kind of "priority inversion" logic: you give 
> > higher priority to the things you wait for. Not because of holding any 
> > locks, but simply because a blockign waitpid really is a damn big hint 
> > that "ok, the child now works for the parent".
> 
> yeah. One problem i can see with the implementation of this though is 
> that shells typically do nonspecific waits - for example bash does this 
> on a simple 'ls' command:
> 
>   21310 clone(child_stack=0,  ...) = 21399
>   ...
>   21399 execve("/bin/ls", 
>   ...
>   21310 waitpid(-1, <unfinished ...>
> 
> the PID is -1 so we dont actually know which task we are waiting for. We 
> could use the first entry from the p->children list, but that looks too 
> specific of a hack to me. It should catch most of the 
> synchronous-helper-task cases though.

The last one should be more appropriate IMHO. If you waitpid(), it's very
likely that you're waiting for the result of the very last fork().

Willy


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 16:18     ` Willy Tarreau
  2007-04-21 16:34       ` Linus Torvalds
@ 2007-04-21 17:03       ` Geert Bosch
  1 sibling, 0 replies; 149+ messages in thread
From: Geert Bosch @ 2007-04-21 17:03 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Ingo Molnar, Con Kolivas, linux-kernel, Linus Torvalds,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett


On Apr 21, 2007, at 12:18, Willy Tarreau wrote:
> Also, I believe that (in shells), most forked processes do not even  
> consume
> a full timeslice (eg: $(uname -n) is very fast). This means that  
> assigning
> them with a shorter one will not hurt them while preserving the  
> shell's
> performance against CPU hogs.

On a fast machine, during regression testing of GCC, I've noticed we  
create
an average of 500 processes per second during an hour or so. There  
are other
work loads like this. So, most processes start, execute and complete  
in 2ms.
How does fairness work in a situation like this?

   -Geert

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 16:00     ` Ingo Molnar
  2007-04-21 16:12       ` Willy Tarreau
  2007-04-21 16:39       ` William Lee Irwin III
@ 2007-04-21 17:15       ` Jan Engelhardt
  2007-04-21 19:00         ` Ingo Molnar
  2007-04-21 22:54       ` Denis Vlasenko
  2007-04-21 23:59       ` Con Kolivas
  4 siblings, 1 reply; 149+ messages in thread
From: Jan Engelhardt @ 2007-04-21 17:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Con Kolivas, Willy Tarreau, William Lee Irwin III, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Gene Heskett


On Apr 21 2007 18:00, Ingo Molnar wrote:
>* Con Kolivas <kernel@kolivas.org> wrote:
>
>> >   Feels even better, mouse movements are very smooth even under high 
>> >   load. I noticed that X gets reniced to -19 with this scheduler. 
>> >   I've not looked at the code yet but this looked suspicious to me. 
>> >   I've reniced it to 0 and it did not change any behaviour. Still 
>> >   very good.
>> 
>> Looks like this code does it:
>> 
>> +int sysctl_sched_privileged_nice_level __read_mostly = -19;
>
>correct. Note that Willy reniced X back to 0 so it had no relevance on 
>his test. Also note that i pointed this change out in the -v4 CFS 
>announcement:
>
>|| Changes since -v3:
>||
>||  - usability fix: automatic renicing of kernel threads such as 
>||    keventd, OOM tasks and tasks doing privileged hardware access
>||    (such as Xorg).
>
>i've attached it below in a standalone form, feel free to put it into 
>SD! :)

Assume X went crazy (lacking any statistics, I make the unproven
statement that this happens more often than kthreads going berserk),
then having it niced with minus something is not too nice.



Jan
-- 

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 16:53         ` Ingo Molnar
  2007-04-21 16:57           ` Willy Tarreau
@ 2007-04-21 18:09           ` Ulrich Drepper
  1 sibling, 0 replies; 149+ messages in thread
From: Ulrich Drepper @ 2007-04-21 18:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Willy Tarreau, Con Kolivas, linux-kernel,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett

On 4/21/07, Ingo Molnar <mingo@elte.hu> wrote:
> on a simple 'ls' command:
>
>   21310 clone(child_stack=0,  ...) = 21399
>   ...
>   21399 execve("/bin/ls",
>   ...
>   21310 waitpid(-1, <unfinished ...>
>
> the PID is -1 so we dont actually know which task we are waiting for.

That's a special case.  Most programs don't do this.  In fact, in
multi-threaded code you better never do it since such an unqualified
wait might catch the child another thread waits for (particularly bad
if one thread uses system()).

And even in the case of bash, we probably can change to code to use a
qualified wait in case there are no other children.  This is known at
any time and I expect that most of the time there are no background
processes.  At least in shell scripts.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 12:12 ` [REPORT] cfs-v4 vs sd-0.44 Willy Tarreau
                     ` (2 preceding siblings ...)
  2007-04-21 15:55   ` Con Kolivas
@ 2007-04-21 18:17   ` Gene Heskett
  2007-04-22  1:26     ` Con Kolivas
  2007-04-22  8:07     ` William Lee Irwin III
  2007-04-22  1:51   ` Con Kolivas
  4 siblings, 2 replies; 149+ messages in thread
From: Gene Heskett @ 2007-04-21 18:17 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Ingo Molnar, Con Kolivas, linux-kernel, Linus Torvalds,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar

On Saturday 21 April 2007, Willy Tarreau wrote:
>Hi Ingo, Hi Con,
>
>I promised to perform some tests on your code. I'm short in time right now,
>but I observed behaviours that should be commented on.
>
>1) machine : dual athlon 1533 MHz, 1G RAM, kernel 2.6.21-rc7 + either
> scheduler Test:  ./ocbench -R 250000 -S 750000 -x 8 -y 8
>   ocbench: http://linux.1wt.eu/sched/
>
>2) SD-0.44
>
>   Feels good, but becomes jerky at moderately high loads. I've started
>   64 ocbench with a 250 ms busy loop and 750 ms sleep time. The system
>   always responds correctly but under X, mouse jumps quite a bit and
>   typing in xterm or even text console feels slightly jerky. The CPU is
>   not completely used, and the load varies a lot (see below). However,
>   the load is shared equally between all 64 ocbench, and they do not
>   deviate even after 4000 iterations. X uses less than 1% CPU during
>   those tests.
>
>   Here's the vmstat output :
>
>willy@pcw:~$ vmstat 1
>   procs                      memory      swap          io     system     
> cpu r  b  w   swpd   free   buff  cache   si   so    bi    bo   in    cs us
> sy id 0  0  0      0 919856   6648  57788    0    0    22     2    4   148
> 31 49 20 0  0  0      0 919856   6648  57788    0    0     0     0    2  
> 285 32 50 19 28  0  0      0 919836   6648  57788    0    0     0     0   
> 0   331 24 40 36 64  0  0      0 919836   6648  57788    0    0     0     0
>    1   618 23 40 37 65  0  0      0 919836   6648  57788    0    0     0   
>  0    0   571 21 36 43 35  0  0      0 919836   6648  57788    0    0     0
>     0    3   382 32 50 18 2  0  0      0 919836   6648  57788    0    0    
> 0     0    0   308 37 61  2 8  0  0      0 919836   6648  57788    0    0  
>   0     0    1   533 36 65  0 32  0  0      0 919768   6648  57788    0   
> 0     0     0   93   706 33 62  5 62  0  0      0 919712   6648  57788    0
>    0     0     0   65   617 32 54 13 63  0  0      0 919712   6648  57788  
>  0    0     0     0    1   569 28 48 23 40  0  0      0 919712   6648 
> 57788    0    0     0     0    0   427 26 50 24 4  0  0      0 919712  
> 6648  57788    0    0     0     0    1   382 29 48 23 4  0  0      0 919712
>   6648  57788    0    0     0     0    0   383 34 65  0 14  0  0      0
> 919712   6648  57788    0    0     0     0    1   769 39 61  0 40  0  0    
>  0 919712   6648  57788    0    0     0     0    0   384 37 52 11 54  0  0 
>     0 919712   6648  57788    0    0     0     0    1   715 31 60  8 58  0 
> 2      0 919712   6648  57788    0    0     0     0    1   611 34 65  0 41 
> 0  0      0 919712   6648  57788    0    0     0     0   19   395 28 45 27
> 0  0  0      0 919712   6648  57788    0    0     0     0   31   421 23 32
> 45 0  0  0      0 919712   6648  57788    0    0     0     0   31   328 34
> 44 22 29  0  0      0 919712   6648  57788    0    0     0     0   34   369
> 32 43 25 65  0  0      0 919712   6648  57788    0    0     0     0   31  
> 410 24 35 40 47  0  1      0 919712   6648  57788    0    0     0     0  
> 42   538 25 39 35
>
>3) CFS-v4
>
>  Feels even better, mouse movements are very smooth even under high load.
>  I noticed that X gets reniced to -19 with this scheduler. I've not looked
>  at the code yet but this looked suspicious to me. I've reniced it to 0 and
>  it did not change any behaviour. Still very good. The 64 ocbench share
>  equal CPU time and show exact same progress after 2000 iterations. The CPU
>  load is more smoothly spread according to vmstat, and there's no idle (see
>  below). BUT I now think it was wrong to let new processes start with no
>  timeslice at all, because it can take tens of seconds to start a new
> process when only 64 ocbench are there. Simply starting "killall ocbench"
> takes about 10 seconds. On a smaller machine (VIA C3-533), it took me more
> than one minute to do "su -", even from console, so that's not X. BTW, X
> uses less than 1% CPU during those tests.
>
>willy@pcw:~$ vmstat 1
>   procs                      memory      swap          io     system     
> cpu r  b  w   swpd   free   buff  cache   si   so    bi    bo   in    cs us
> sy id 12  0  2      0 922120   6532  57540    0    0   299    29   31   386
> 17 27 57 12  0  2      0 922096   6532  57556    0    0     0     0    1  
> 776 37 63  0 14  0  2      0 922096   6532  57556    0    0     0     0   
> 1   782 35 65  0 13  0  1      0 922096   6532  57556    0    0     0     0
>    0   782 38 62  0 14  0  1      0 922096   6532  57556    0    0     0   
>  0    1   782 36 64  0 13  0  1      0 922096   6532  57556    0    0     0
>     0    2   785 38 62  0 13  0  1      0 922096   6532  57556    0    0   
>  0     0    1   774 35 65  0 14  0  1      0 922096   6532  57556    0    0
>     0     0    0   784 36 64  0 13  0  1      0 922096   6532  57556    0  
>  0     0     0    1   767 37 63  0 13  0  1      0 922096   6532  57556   
> 0    0     0     0    1   785 41 59  0 14  0  1      0 922096   6532  57556
>    0    0     0     0    0   779 38 62  0 19  0  1      0 922096   6532 
> 57556    0    0     0     0    1   816 38 62  0 22  0  1      0 922096  
> 6532  57556    0    0     0     0    0   817 35 65  0 19  0  1      0
> 922096   6532  57556    0    0     0     0    1   817 39 61  0 21  0  1    
>  0 922096   6532  57556    0    0     0     0    0   849 36 64  0 20  0  0 
>     0 922096   6532  57556    0    0     0     0    1   793 36 64  0 21  0 
> 0      0 922096   6532  57556    0    0     0     0    0   815 37 63  0 19 
> 0  0      0 922096   6532  57556    0    0     0     0    1   824 35 65  0
> 21  0  0      0 922096   6532  57556    0    0     0     0    0   817 35 65
>  0 26  0  0      0 922096   6532  57556    0    0     0     0    1   824 38
> 62  0 26  0  0      0 922096   6532  57556    0    0     0     0    1   817
> 35 65  0 26  0  0      0 922096   6532  57556    0    0     0     0    0  
> 811 37 63  0 26  0  0      0 922096   6532  57556    0    0     0     0   
> 1   804 34 66  0 16  0  0      0 922096   6532  57556    0    0     0     0
>   39   850 35 65  0 18  0  0      0 922096   6532  57556    0    0     0   
>  0    1   801 39 61  0
>
>
>4) first impressions
>
>I think that CFS is based on a more promising concept but is less mature
>and is dangerous right now with certain workloads. SD shows some strange
>behaviours like not using all CPU available and a little jerkyness, but is
>more robust and may be the less risky solution for a first step towards
>a better scheduler in mainline, but it may also probably be the last O(1)
>scheduler, which may be replaced sometime later when CFS (or any other one)
>shows at the same time the smoothness of CFS and the robustness of SD.
>
>I'm sorry not to spend more time on them right now, I hope that other people
>will do.
>
>Regards,
>Willy

More first impressions of sd-0.44 vs CFS-v4

CFS-v4 is quite smooth in terms of the users experience but after prolonged 
observations approaching 24 hours, it appears to choke the cpu hog off a bit 
even when the system has nothing else to do.  My amanda runs went from 1 to 
1.5 hours depending on how much time it took gzip to handle the amount of 
data tar handed it, up to about 165m & change, or nearly 3 hours pretty 
consistently over 5 runs.

sd-0.44 so far seems to be handling the same load (theres a backup running 
right now) fairly well also, and possibly theres a bit more snap to the 
system now.  A switch to screen 1 from this screen 8, and the loading of that 
screen image, which is the Cassini shot of saturn from the backside, the one 
showing that teeny dot to the left of Saturn that is actually us, took 10 
seconds with the stock 2.6.21-rc7, 3 seconds with the best of Ingo's patches, 
and now with Con's latest, is 1 second flat. Another screen however is 4 
seconds, so maybe that first scren had been looked at since I rebooted. 
However, amanda is still getting estimates so gzip hasn't put a tiewrap 
around the kernels neck just yet.

Some minutes later, gzip is smunching /usr/src, and the machine doesn't even 
know its running as sd-0.44 isn't giving gzip more than 75% to gzip, and 
probably averaging less than 50%. And it scared me a bit as it started out at 
not over 5% for the first minute or so.  Running in the 70's now according to 
gkrellm, with an occasional blip to 95%.  And the machine generally feels 
good.

I had previously given CFS-v4 a 95 score but that was before I saw the general 
slowdown, and I believe my first impression of this one is also a 95.  This 
on a scale of the best one of the earlier CFS patches being 100, and stock 
2.6.21-rc7 gets a 0.0.  This scheduler seems to be giving gzip ever more cpu 
as time progresses, and the cpu is warming up quite nicely, from about 132F 
idling to 149.9F now.  And my keyboard is still alive and well.

Generally speaking, Con, I believe this one is also a keeper.  And we'll see 
how long a backup run takes.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
All God's children are not beautiful.  Most of God's children are, in fact,
barely presentable.
		-- Fran Lebowitz, "Metropolitan Life"

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 16:42         ` William Lee Irwin III
@ 2007-04-21 18:55           ` Kyle Moffett
  2007-04-21 19:49             ` Ulrich Drepper
  0 siblings, 1 reply; 149+ messages in thread
From: Kyle Moffett @ 2007-04-21 18:55 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Linus Torvalds, Willy Tarreau, Ingo Molnar, Con Kolivas,
	linux-kernel, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Gene Heskett

On Apr 21, 2007, at 12:42:41, William Lee Irwin III wrote:
> On Sat, 21 Apr 2007, Willy Tarreau wrote:
>>> If you remember, with 50/50, I noticed some difficulties to fork  
>>> many processes. I think that during a fork(), the parent has a  
>>> higher probability of forking other processes than the child. So  
>>> at least, we should use something like 67/33 or 75/25 for parent/ 
>>> child.
>
> On Sat, Apr 21, 2007 at 09:34:07AM -0700, Linus Torvalds wrote:
>> It would be even better to simply have the rule:
>>  - child gets almost no points at startup
>>  - but when a parent does a "waitpid()" call and blocks, it will  
>> spread  out its points to the childred (the "vfork()" blocking is  
>> another case that is really the same).
>> This is a very special kind of "priority inversion" logic: you  
>> give higher priority to the things you wait for. Not because of  
>> holding any locks, but simply because a blockign waitpid really is  
>> a damn big hint that "ok, the child now works for the parent".
>
> An in-kernel scheduler API might help. void yield_to(struct  
> task_struct *)?
>
> A userspace API might be nice, too. e.g. int sched_yield_to(pid_t).

It might be nice if it was possible to actively contribute your CPU  
time to a child process.  For example:
int sched_donate(pid_t pid, struct timeval *time, int percentage);

Maybe a way to pass CPU time over a UNIX socket (analogous to  
SCM_RIGHTS), along with information on what process/user passed it   
That would make it possible to really fix X properly on a local  
system.  You could make the X client library pass CPU time to the X  
server whenever it requests a CPU-intensive rendering operation.   
Ordinarily X would nice all of its client service threads to +10, but  
when a client passes CPU time to its thread over the socket, then its  
service thread temporarily gets the scheduling properties of the  
client.  I'm not a scheduler guru, but that's what makes the most  
sense from an application-programmer point of view.

Cheers,
Kyle Moffett


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 17:15       ` Jan Engelhardt
@ 2007-04-21 19:00         ` Ingo Molnar
  2007-04-22 13:18           ` Mark Lord
  0 siblings, 1 reply; 149+ messages in thread
From: Ingo Molnar @ 2007-04-21 19:00 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Con Kolivas, Willy Tarreau, William Lee Irwin III, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Gene Heskett


* Jan Engelhardt <jengelh@linux01.gwdg.de> wrote:

> > i've attached it below in a standalone form, feel free to put it 
> > into SD! :)
> 
> Assume X went crazy (lacking any statistics, I make the unproven 
> statement that this happens more often than kthreads going berserk), 
> then having it niced with minus something is not too nice.

i've not experienced a 'runaway X' personally, at most it would crash or 
lock up ;) The value is boot-time and sysctl configurable as well back 
to 0.

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 18:55           ` Kyle Moffett
@ 2007-04-21 19:49             ` Ulrich Drepper
  2007-04-21 23:17               ` William Lee Irwin III
  2007-04-21 23:35               ` Linus Torvalds
  0 siblings, 2 replies; 149+ messages in thread
From: Ulrich Drepper @ 2007-04-21 19:49 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: William Lee Irwin III, Linus Torvalds, Willy Tarreau,
	Ingo Molnar, Con Kolivas, linux-kernel, Andrew Morton,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett

On 4/21/07, Kyle Moffett <mrmacman_g4@mac.com> wrote:
> It might be nice if it was possible to actively contribute your CPU
> time to a child process.  For example:
> int sched_donate(pid_t pid, struct timeval *time, int percentage);

If you do this, and it has been requested many a times, then please
generalize it.  We have the same issue with futexes.  If a FUTEX_WAIT
call is issues the remaining time in the slot should be given to the
thread currently owning the futex.  For non-PI futexes this needs an
extension of the interface but I would be up for that.  It can have
big benefits on the throughput of an application.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, v4
  2007-04-20 14:04 [patch] CFS scheduler, v4 Ingo Molnar
                   ` (2 preceding siblings ...)
  2007-04-21 12:12 ` [REPORT] cfs-v4 vs sd-0.44 Willy Tarreau
@ 2007-04-21 20:35 ` S.Çağlar Onur
  2007-04-22  8:30 ` Michael Gerdau
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 149+ messages in thread
From: S.Çağlar Onur @ 2007-04-21 20:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, Willy Tarreau, Gene Heskett

[-- Attachment #1: Type: text/plain, Size: 13325 bytes --]

Hi Ingo;

20 Nis 2007 Cum tarihinde, Ingo Molnar şunları yazmıştı: 
> As usual, any sort of feedback, bugreport, fix and suggestion is more
> than welcome,

I tried hard and found another problem for you :)

With Linus's current git + CFSv4 as soon as i start a guest in VirtualBox [1], 
system enters the following loop;

- Whole system freeze ~5 secs.
- System works well ~5 secs
- Whole system freeze ~5 secs.

Again mainline has no issues but its %100 reproducible under CFSv4.

Following "ps aux" and "top" outputs grabbed while system works well, i cannot 
do anything else while system freezes except moving mouse for fun :P

[caglar@zangetsu][~]> ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0      0     0 ?        Ss   22:26   0:00 init [3]
root         2  0.0  0.0      0     0 ?        S    22:26   0:00 [migration/0]
root         3  0.3  0.0      0     0 ?        SN   22:26   0:11 [ksoftirqd/0]
root         4  0.0  0.0      0     0 ?        S<   22:26   0:00 [events/0]
root         5  0.0  0.0      0     0 ?        S<   22:26   0:00 [khelper]
root         6  0.0  0.0      0     0 ?        S<   22:26   0:00 [kthread]
root        26  0.0  0.0      0     0 ?        S<   22:26   0:00 [kblockd/0]
root        27  0.0  0.0      0     0 ?        S<   22:26   0:00 [kacpid]
root       124  0.0  0.0      0     0 ?        S<   22:26   0:00 [kseriod]
root       137  0.0  0.0      0     0 ?        S<   22:26   0:00 [kapmd]
root       145  0.0  0.0      0     0 ?        S    22:26   0:00 [pdflush]
root       146  0.0  0.0      0     0 ?        S    22:26   0:00 [pdflush]
root       147  0.0  0.0      0     0 ?        S<   22:26   0:00 [kswapd0]
root       148  0.0  0.0      0     0 ?        S<   22:26   0:00 [aio/0]
root       802  0.0  0.0      0     0 ?        S<   22:26   0:00 [kpsmoused]
root       844  0.0  0.0      0     0 ?        S<   22:26   0:00 [ata/0]
root       845  0.0  0.0      0     0 ?        S<   22:26   0:00 [ata_aux]
root       856  0.0  0.0      0     0 ?        S<   22:26   0:01 [scsi_eh_0]
root       857  0.0  0.0      0     0 ?        S<   22:26   0:00 [scsi_eh_1]
root       869  0.0  0.0      0     0 ?        S<   22:26   0:00 
[ksuspend_usbd]
root       872  0.0  0.0      0     0 ?        S<   22:26   0:00 [khubd]
root       919  0.0  0.0      0     0 ?        S<   22:26   0:00 [khpsbpkt]
root       927  0.0  0.0      0     0 ?        S<   22:26   0:00 [knodemgrd_0]
root       982  0.0  0.0      0     0 ?        S<   22:26   0:00 [xfslogd/0]
root       983  0.0  0.0      0     0 ?        S<   22:26   0:00 [xfsdatad/0]
root       985  0.0  0.0      0     0 ?        S<   22:26   0:00 [xfsbufd]
root       986  0.0  0.0      0     0 ?        S<   22:26   0:00 [xfssyncd]
root      1050  0.0  0.0      0     0 ?        S<s  22:26   
0:01 /sbin/udevd --daemon
root      2831  0.0  0.0      0     0 ?        S<   22:26   0:00 [kondemand/0]
root      2993  0.0  0.0      0     0 ?        S<   22:26   0:00 [tifm/0]
root      3039  0.0  0.0      0     0 ?        S<   22:26   0:00 [pccardd]
root      3063  0.0  0.0      0     0 ?        S<   22:26   0:00 [ipw2200/0]
root      3127  0.0  0.0      0     0 ?        Ss   22:26   0:00 Comar
root      3278  0.0  0.0      0     0 ?        S    22:26   0:00 ComarRPC
root      3334  0.0  0.0      0     0 ?        Ss   22:26   
0:00 /usr/sbin/syslogd -m 15
root      3338  0.0  0.0      0     0 ?        Ss   22:26   
0:00 /usr/sbin/klogd -c 3 -2
root      3340  0.0  0.0      0     0 ?        Ss   22:26   
0:00 /sbin/dhcpcd -R -Y -N -t 20 eth1
root      3345  0.0  0.0      0     0 ?        Ss   22:26   
0:00 /usr/sbin/acpid -c /etc/acpi/events
dbus      3351  0.0  0.0      0     0 ?        Ss   22:26   
0:00 /usr/bin/dbus-daemon --system
root      3354  0.0  0.0      0     0 ?        Ss   22:26   
0:00 /usr/sbin/console-kit-daemon
root      3418  0.0  0.0      0     0 ?        Ss   22:26   
0:00 /usr/sbin/polkitd
hal       3420  0.0  0.0      0     0 ?        Ss   22:26   
0:00 /usr/sbin/hald --daemon=yes --use-syslog
root      3421  0.0  0.0      0     0 ?        S    22:26   0:00 hald-runner
hal       3432  0.0  0.0      0     0 ?        S    22:26   0:00 
hald-addon-keyboard: listening on /dev/input/event0
hal       3435  0.0  0.0      0     0 ?        S    22:26   0:00 
hald-addon-keyboard: listening on /dev/input/event1
hal       3436  0.0  0.0      0     0 ?        S    22:26   0:00 
hald-addon-keyboard: listening on /dev/input/event2
hal       3437  0.0  0.0      0     0 ?        S    22:26   0:00 
hald-addon-keyboard: listening on /dev/input/event4
root      3444  0.0  0.0      0     0 tty1     Ss+  22:26   
0:00 /sbin/mingetty --noclear tty1
root      3445  0.0  0.0      0     0 tty2     Ss+  22:26   
0:00 /sbin/mingetty --noclear tty2
root      3446  0.0  0.0      0     0 tty3     Ss+  22:26   
0:00 /sbin/mingetty tty3
root      3447  0.0  0.0      0     0 tty4     Ss+  22:26   
0:00 /sbin/mingetty tty4
root      3448  0.0  0.0      0     0 tty5     Ss+  22:26   
0:00 /sbin/mingetty tty5
root      3449  0.0  0.0      0     0 tty6     Ss+  22:26   
0:00 /sbin/mingetty tty6
root      3450  0.0  0.0      0     0 ?        S    22:26   
0:00 /usr/libexec/hald-addon-cpufreq
hal       3451  0.0  0.0      0     0 ?        S    22:26   0:00 
hald-addon-acpi: listening on acpid socket /var/run/acpid.socket
root      3576  0.0  0.0      0     0 ?        S    22:26   0:00 
hald-addon-storage: polling /dev/sr0 (every 2 sec)
root      3582  0.0  0.0      0     0 ?        S<s  22:26   0:00 /sbin/auditd
root      3584  0.0  0.0      0     0 ?        S<s  22:26   0:00 
python /sbin/audispd
nobody    3588  0.0  0.0      0     0 ?        Ss   22:26   
0:00 /usr/sbin/mdnsd
root      3589  0.0  0.0      0     0 ?        Ss   22:26   
0:01 /opt/sun-jdk/bin/java -DConfigFile=/opt/zemberek-server/config/conf.ini -Djava.library.path=/opt/zemberek-server/li
root      3591  0.0  0.0      0     0 ?        S<   22:26   0:00 [kauditd]
root      3637  0.0  0.0      0     0 ?        Ss   22:26   
0:00 /usr/kde/3.5/bin/kdm
root      3656  0.0  0.0      0     0 ?        Ss   22:26   
0:00 /usr/sbin/cupsd
root      3661  8.9  0.0      0     0 tty7     S<s+ 22:26   
5:06 /usr/bin/X -br -nolisten tcp :0 vt7 -auth /var/run/xauth/A:0-uzBkHQ
root      3662  0.0  0.0      0     0 ?        S    22:26   0:00 -:0
caglar    3713  0.0  0.0      0     0 ?        Ss   22:27   
0:00 /bin/sh /usr/kde/3.5/bin/startkde
caglar    3740  0.0  0.0      0     0 ?        Ss   22:27   
0:00 /usr/bin/gpg-agent --daemon
caglar    3744  0.0  0.0      0     0 ?        S    22:27   
0:00 /usr/bin/dbus-launch --auto-syntax --exit-with-session
caglar    3745  0.0  0.0      0     0 ?        Ss   22:27   
0:00 /usr/bin/dbus-daemon --fork --print-pid 5 --print-address 7 --session
root      3763  0.0  0.0      0     0 ?        S    22:27   0:00 
start_kdeinit --new-startup +kcminit_startup
caglar    3764  0.0  0.0      0     0 ?        Ss   22:27   0:00 kdeinit 
Running...
caglar    3767  0.0  0.0      0     0 ?        S    22:27   0:00 dcopserver 
[kdeinit] --nosid
caglar    3769  0.0  0.0      0     0 ?        S    22:27   0:00 klauncher 
[kdeinit] --new-startup
caglar    3771  0.0  0.0      0     0 ?        S    22:27   0:01 kded 
[kdeinit] --new-startup
caglar    3776  0.0  0.0      0     0 ?        S    22:27   0:00 kwrapper 
ksmserver
caglar    3778  0.0  0.0      0     0 ?        S    22:27   0:00 ksmserver 
[kdeinit]
caglar    3779  0.0  0.0      0     0 ?        S    22:27   0:02 kwin 
[kdeinit] -session 10dfd5e1dc000117615244300000037860015_1177150228_731669
caglar    3781  0.0  0.0      0     0 ?        S    22:27   0:01 kdesktop 
[kdeinit]
caglar    3784  0.1  0.0      0     0 ?        S    22:27   0:06 kicker 
[kdeinit]
caglar    3790  0.0  0.0      0     0 ?        S    22:27   
0:00 /usr/kde/3.5/bin/artsd -F 10 -S 4096 -s 3 -m artsmessage -c drkonqi -l 
3 -f
caglar    3792  0.0  0.0      0     0 ?        S    22:27   0:00 kaccess 
[kdeinit]
caglar    3795  0.0  0.0      0     0 ?        S    22:27   0:00 kgpg -session 
10dfd5e1dc000116939040300000034540016_1177150228_693587
caglar    3797  0.0  0.0      0     0 ?        S    22:27   0:00 kmix 
[kdeinit] -session 10dfd5e1dc000117381541300000036690022_1177150228_688422
caglar    3799  0.0  0.0      0     0 ?        S    22:27   0:03 
yakuake -session 10dfd5e1dc000117550619600000038750011_1177150228_688557
caglar    3801  0.1  0.0      0     0 ?        S    22:27   0:06 
akregator -session 10dfd5e1dc000116863705100000035770011_1177150228_688758
caglar    3803  1.6  0.0      0     0 ?        S    22:27   0:57 
amarokapp -session 10dfd5e1dc000117589538100000037590019_1177150228_689118
caglar    3806  0.0  0.0      0     0 ?        S    22:27   0:00 knotify 
[kdeinit]
caglar    3820  2.2  0.0      0     0 ?        S    22:27   1:15 
kmail -session 10dfd5e1dc000117702739100000037820008_1177150228_689302
caglar    3825  0.0  0.0      0     0 ?        S    22:27   0:00 kpowersave 
[kdeinit]
caglar    3828  0.0  0.0      0     0 ?        S    22:27   0:00 klipper 
[kdeinit]
caglar    3840  0.4  0.0      0     0 ?        S    22:27   0:15 
beagled /usr/lib/beagle/BeagleDaemon.exe --bg
caglar    4027  0.0  0.0      0     0 ?        S    22:28   0:00 
ruby /usr/kde/3.5/share/apps/amarok/scripts/score_default/score_default.rb
caglar    4423  1.6  0.0      0     0 ?        SN   22:29   0:53 
beagled-helper /usr/lib/beagle/IndexHelper.exe
caglar    4479  0.0  0.0      0     0 ?        S    22:29   0:00 konqueror 
[kdeinit] -mimetype inode/directory system:/media/camera
caglar    4780  0.9  0.0      0     0 ?        S    22:35   0:27 
kopete -caption Kopete -icon kopete -miniicon kopete
caglar    6513  0.0  0.0      0     0 pts/0    Ss   23:10   0:00 /bin/bash
caglar    6642  0.2  0.0      0     0 ?        S    23:15   
0:01 /usr/share/VirtualBox/VirtualBox
caglar    6655  0.0  0.0      0     0 ?        S    23:15   0:00 kio_file 
[kdeinit] 
file /tmp/ksocket-caglar/klauncher7DO9ja.slave-socket /tmp/ksocket-caglar/kdesktop1c3ZQb.slave-s
caglar    6657  0.0  0.0      0     0 ?        S    23:15   
0:00 /usr/share/VirtualBox/VBoxSVC --daemonize
caglar    6663  0.0  0.0      0     0 ?        S    23:15   
0:00 /usr/share/VirtualBox/VBoxXPCOMIPCD
caglar    6684 71.7  0.0      0     0 ?        S    23:16   
5:40 /usr/share/VirtualBox/VirtualBox -startvm 
8c00f6d9-e215-4501-a7aa-4ba19eb55acc
caglar    6730  0.0  0.0      0     0 pts/0    R+   23:24   0:00 ps aux

[caglar@zangetsu][~]> top
top - 23:24:27 up 58 min,  3 users,  load average: 13.59, 13.00, 7.75
Tasks: 102 total,   1 running, 101 sleeping,   0 stopped,   0 zombie
Cpu(s): 99.4%us,  0.4%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.2%hi,  0.0%si,  0.0%st
Mem:   2067672k total,  1935300k used,   132372k free,      288k buffers
Swap:  2096440k total,        0k used,  2096440k free,   961236k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 6684 caglar    20   0  625m 570m  15m S 99.2 28.3   5:56.17 VirtualBox
  844 root       1 -19     0    0    0 S  0.2  0.0   0:00.56 ata/0
 6731 caglar    20   0  2312 1124  856 R  0.2  0.1   0:00.04 top
    1 root      20   0  1608  556  484 S  0.0  0.0   0:00.92 init
    2 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0
    3 root      39  19     0    0    0 S  0.0  0.0   0:11.85 ksoftirqd/0
    4 root       1 -19     0    0    0 S  0.0  0.0   0:00.07 events/0
    5 root       1 -19     0    0    0 S  0.0  0.0   0:00.02 khelper
    6 root       1 -19     0    0    0 S  0.0  0.0   0:00.00 kthread
   26 root       1 -19     0    0    0 S  0.0  0.0   0:00.04 kblockd/0
   27 root       1 -19     0    0    0 S  0.0  0.0   0:00.00 kacpid
  124 root       1 -19     0    0    0 S  0.0  0.0   0:00.00 kseriod
  137 root       1 -19     0    0    0 S  0.0  0.0   0:00.00 kapmd
  145 root      20   0     0    0    0 S  0.0  0.0   0:00.00 pdflush
  146 root      20   0     0    0    0 S  0.0  0.0   0:00.27 pdflush
  147 root       1 -19     0    0    0 S  0.0  0.0   0:00.00 kswapd0
  148 root       1 -19     0    0    0 S  0.0  0.0   0:00.00 aio/0
  802 root       1 -19     0    0    0 S  0.0  0.0   0:00.00 kpsmoused
  845 root       1 -19     0    0    0 S  0.0  0.0   0:00.00 ata_aux
  856 root       1 -19     0    0    0 D  0.0  0.0   0:01.04 scsi_eh_0
  857 root       1 -19     0    0    0 S  0.0  0.0   0:00.00 scsi_eh_1
  869 root       1 -19     0    0    0 S  0.0  0.0   0:00.00 ksuspend_usbd
  872 root       1 -19     0    0    0 S  0.0  0.0   0:00.00 khubd
  919 root       1 -19     0    0    0 S  0.0  0.0   0:00.00 khpsbpkt
  927 root       1 -19     0    0    0 S  0.0  0.0   0:00.00 knodemgrd_0
  982 root       1 -19     0    0    0 S  0.0  0.0   0:00.30 xfslogd/0
  983 root       1 -19     0    0    0 S  0.0  0.0   0:00.10 xfsdatad/0


[1] http://www.virtualbox.org/
Cheers
-- 
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, v4
  2007-04-20 21:37 ` Gene Heskett
@ 2007-04-21 20:47   ` S.Çağlar Onur
  2007-04-22  1:22     ` Gene Heskett
  0 siblings, 1 reply; 149+ messages in thread
From: S.Çağlar Onur @ 2007-04-21 20:47 UTC (permalink / raw)
  To: Gene Heskett
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, Willy Tarreau

[-- Attachment #1: Type: text/plain, Size: 677 bytes --]

21 Nis 2007 Cts tarihinde, Gene Heskett şunları yazmıştı: 
> This one is another keeper IMO, or as we are fond of saying around here,
> its good enough for the girls I go with.  If this isn't the best one so
> far, its very very close and I'm getting pickier.  kmail is the only thing
> that's lagging, and that's just kmail, which I believe is single threaded. 

Add +1 for kmail lags (by the way mines are freezes instead of lags, cause i 
cannot use konsole etc. while these happens)

Cheers
-- 
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 16:00     ` Ingo Molnar
                         ` (2 preceding siblings ...)
  2007-04-21 17:15       ` Jan Engelhardt
@ 2007-04-21 22:54       ` Denis Vlasenko
  2007-04-22  0:08         ` Con Kolivas
  2007-04-21 23:59       ` Con Kolivas
  4 siblings, 1 reply; 149+ messages in thread
From: Denis Vlasenko @ 2007-04-21 22:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Con Kolivas, Willy Tarreau, William Lee Irwin III, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Gene Heskett

On Saturday 21 April 2007 18:00, Ingo Molnar wrote:
> correct. Note that Willy reniced X back to 0 so it had no relevance on 
> his test. Also note that i pointed this change out in the -v4 CFS 
> announcement:
> 
> || Changes since -v3:
> ||
> ||  - usability fix: automatic renicing of kernel threads such as 
> ||    keventd, OOM tasks and tasks doing privileged hardware access
> ||    (such as Xorg).
> 
> i've attached it below in a standalone form, feel free to put it into 
> SD! :)

But X problems have nothing to do with "privileged hardware access".
X problems are related to priority inversions between server and client
processes, and "one server process - many client processes" case.

I think syncronous nature of Xlib (clients cannot fire-and-forget
their commands to X server, with Xlib each command waits for ACK
from server) also add some amount of pain.
--
vda

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 19:49             ` Ulrich Drepper
@ 2007-04-21 23:17               ` William Lee Irwin III
  2007-04-21 23:35               ` Linus Torvalds
  1 sibling, 0 replies; 149+ messages in thread
From: William Lee Irwin III @ 2007-04-21 23:17 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Kyle Moffett, Linus Torvalds, Willy Tarreau, Ingo Molnar,
	Con Kolivas, linux-kernel, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett

On 4/21/07, Kyle Moffett <mrmacman_g4@mac.com> wrote:
>> It might be nice if it was possible to actively contribute your CPU
>> time to a child process.  For example:
>> int sched_donate(pid_t pid, struct timeval *time, int percentage);

On Sat, Apr 21, 2007 at 12:49:52PM -0700, Ulrich Drepper wrote:
> If you do this, and it has been requested many a times, then please
> generalize it.  We have the same issue with futexes.  If a FUTEX_WAIT
> call is issues the remaining time in the slot should be given to the
> thread currently owning the futex.  For non-PI futexes this needs an
> extension of the interface but I would be up for that.  It can have
> big benefits on the throughput of an application.

It's encouraging to hear support for a more full-featured API (or, for
that matter, any response at all) on this front.


-- wli

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 19:49             ` Ulrich Drepper
  2007-04-21 23:17               ` William Lee Irwin III
@ 2007-04-21 23:35               ` Linus Torvalds
  2007-04-22  1:46                 ` Ulrich Drepper
  1 sibling, 1 reply; 149+ messages in thread
From: Linus Torvalds @ 2007-04-21 23:35 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Kyle Moffett, William Lee Irwin III, Willy Tarreau, Ingo Molnar,
	Con Kolivas, linux-kernel, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett



On Sat, 21 Apr 2007, Ulrich Drepper wrote:
> 
> If you do this, and it has been requested many a times, then please
> generalize it.  We have the same issue with futexes.  If a FUTEX_WAIT
> call is issues the remaining time in the slot should be given to the
> thread currently owning the futex.

And how the hell do you imagine you'd even *know* what thread holds the 
futex?

The whole point of the "f" part of the mutex is that it's fast, and we 
never see the non-contended case in the kernel. 

So we know who *blocks*, but we don't know who actually didn't block.

		Linus

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 16:00     ` Ingo Molnar
                         ` (3 preceding siblings ...)
  2007-04-21 22:54       ` Denis Vlasenko
@ 2007-04-21 23:59       ` Con Kolivas
  2007-04-22 13:04         ` Juliusz Chroboczek
  2007-04-22 13:23         ` [REPORT] cfs-v4 vs sd-0.44 Mark Lord
  4 siblings, 2 replies; 149+ messages in thread
From: Con Kolivas @ 2007-04-21 23:59 UTC (permalink / raw)
  To: Ingo Molnar, ck list, Bill Davidsen
  Cc: Willy Tarreau, William Lee Irwin III, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Gene Heskett

On Sunday 22 April 2007 02:00, Ingo Molnar wrote:
> * Con Kolivas <kernel@kolivas.org> wrote:
> > >   Feels even better, mouse movements are very smooth even under high
> > >   load. I noticed that X gets reniced to -19 with this scheduler.
> > >   I've not looked at the code yet but this looked suspicious to me.
> > >   I've reniced it to 0 and it did not change any behaviour. Still
> > >   very good.
> >
> > Looks like this code does it:
> >
> > +int sysctl_sched_privileged_nice_level __read_mostly = -19;
>
> correct. 

Oh I definitely was not advocating against renicing X, I just suspect that 
virtually all the users who gave glowing reports to CFS comparing it to SD 
had no idea it had reniced X to -19 behind their back and that they were 
comparing it to SD running X at nice 0. I think had they been comparing CFS 
with X nice -19 to SD running nice -10 in this interactivity soft and squishy 
comparison land their thoughts might have been different. I missed it in the 
announcement and had to go looking in the code since Willy just kinda tripped 
over it unwittingly as well.

> Note that Willy reniced X back to 0 so it had no relevance on 
> his test.

Oh yes I did notice that, but since the array swap is the remaining longest 
deadline in SD which would cause noticeable jerks, renicing X on SD by 
default would make the experience very different since reniced tasks do much 
better over array swaps compared to non niced tasks. I really should go and 
make the whole thing one circular list and blow away the array swap (if I can 
figure out how to do it). 

> Also note that i pointed this change out in the -v4 CFS 
>
> announcement:
> || Changes since -v3:
> ||
> ||  - usability fix: automatic renicing of kernel threads such as
> ||    keventd, OOM tasks and tasks doing privileged hardware access
> ||    (such as Xorg).

Reading the changelog in the gloss-over fashion that I unfortunately did, even 
I missed it. 

> i've attached it below in a standalone form, feel free to put it into
> SD! :)

Hmm well I have tried my very best to do all the changes without 
changing "policy" as much as possible since that trips over so many emotive 
issues that noone can agree on, and I don't have a strong opinion on this as 
I thought it would be better for it to be a config option for X in userspace 
instead. Either way it needs to be turned on/off by admin and doing it by 
default in the kernel is... not universally accepted as good. What else 
accesses ioports that can get privileged nice levels? Does this make it 
relatively exploitable just by poking an ioport?

> 	Ingo
>
> ---
>  arch/i386/kernel/ioport.c   |   13 ++++++++++---
>  arch/x86_64/kernel/ioport.c |    8 ++++++--
>  drivers/block/loop.c        |    5 ++++-
>  include/linux/sched.h       |    7 +++++++
>  kernel/sched.c              |   40

Thanks for the patch. I'll consider it. Since end users are testing this in 
fuzzy interactivity land I may simply be forced to do this just for 
comparisons to be meaningful between CFS and SD otherwise they're not really 
comparing them on a level playing field. I had almost given up SD for dead 
meat with all the momentum CFS had gained... until recently.

-- 
-ck

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 22:54       ` Denis Vlasenko
@ 2007-04-22  0:08         ` Con Kolivas
  2007-04-22  4:58           ` Mike Galbraith
  0 siblings, 1 reply; 149+ messages in thread
From: Con Kolivas @ 2007-04-22  0:08 UTC (permalink / raw)
  To: Denis Vlasenko
  Cc: Ingo Molnar, Willy Tarreau, William Lee Irwin III, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Gene Heskett

On Sunday 22 April 2007 08:54, Denis Vlasenko wrote:
> On Saturday 21 April 2007 18:00, Ingo Molnar wrote:
> > correct. Note that Willy reniced X back to 0 so it had no relevance on
> > his test. Also note that i pointed this change out in the -v4 CFS
> >
> > announcement:
> > || Changes since -v3:
> > ||
> > ||  - usability fix: automatic renicing of kernel threads such as
> > ||    keventd, OOM tasks and tasks doing privileged hardware access
> > ||    (such as Xorg).
> >
> > i've attached it below in a standalone form, feel free to put it into
> > SD! :)
>
> But X problems have nothing to do with "privileged hardware access".
> X problems are related to priority inversions between server and client
> processes, and "one server process - many client processes" case.

It's not a privileged hardware access reason that this code is there. This is 
obfuscation/advertising to make it look like there is a valid reason for X 
getting negative nice levels somehow in the kernel to make interactive 
testing of CFS better by default.

-- 
-ck

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, v4
  2007-04-21 20:47   ` S.Çağlar Onur
@ 2007-04-22  1:22     ` Gene Heskett
  0 siblings, 0 replies; 149+ messages in thread
From: Gene Heskett @ 2007-04-22  1:22 UTC (permalink / raw)
  To: caglar
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, Willy Tarreau

On Saturday 21 April 2007, S.Çağlar Onur wrote:
>21 Nis 2007 Cts tarihinde, Gene Heskett şunları yazmıştı:
>> This one is another keeper IMO, or as we are fond of saying around here,
>> its good enough for the girls I go with.  If this isn't the best one so
>> far, its very very close and I'm getting pickier.  kmail is the only thing
>> that's lagging, and that's just kmail, which I believe is single threaded.
>
>Add +1 for kmail lags (by the way mines are freezes instead of lags, cause i
>cannot use konsole etc. while these happens)
>
>Cheers

yes, you are correct, the composer in particular, or the response to the + key 
for next message, will freeze for the second or maybe 2, that kmail is 
sorting and storing incoming mail.  This is a major problem for users of 
dialup on an auto basis because its frozen for much of the time it takes the 
much slower modem communications to complete, compared to a dsl circuit where 
one can have fetchmail doing the sucking, and handing it off to procmail for 
treatment by spamassassin and its ilk before finally storing the incoming 
mail in /var/spool/mail/gene.  kmail sees none of that background activity at 
all.  They actually run asynchronously here.

kmail then picks that up and sorts it to the correct kmail folder and this 
does cause the lag/freeze while its doing that.

This latter lag/freeze is all I see, but for those that are using kmail to 
directly access their ISP's mailserver(s), this lag/freeze isn't a 1 second 
freeze, but a 10-30 second freeze, and that is truly a cast iron bitch 
version of a PITA.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
VICARIOUSLY experience some reason to LIVE!!

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 18:17   ` Gene Heskett
@ 2007-04-22  1:26     ` Con Kolivas
  2007-04-22  2:07       ` Gene Heskett
  2007-04-22  8:07     ` William Lee Irwin III
  1 sibling, 1 reply; 149+ messages in thread
From: Con Kolivas @ 2007-04-22  1:26 UTC (permalink / raw)
  To: Gene Heskett
  Cc: Willy Tarreau, Ingo Molnar, linux-kernel, Linus Torvalds,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar

On Sunday 22 April 2007 04:17, Gene Heskett wrote:
> More first impressions of sd-0.44 vs CFS-v4

Thanks Gene.
>
> CFS-v4 is quite smooth in terms of the users experience but after prolonged
> observations approaching 24 hours, it appears to choke the cpu hog off a
> bit even when the system has nothing else to do.  My amanda runs went from
> 1 to 1.5 hours depending on how much time it took gzip to handle the amount
> of data tar handed it, up to about 165m & change, or nearly 3 hours pretty
> consistently over 5 runs.
>
> sd-0.44 so far seems to be handling the same load (theres a backup running
> right now) fairly well also, and possibly theres a bit more snap to the
> system now.  A switch to screen 1 from this screen 8, and the loading of
> that screen image, which is the Cassini shot of saturn from the backside,
> the one showing that teeny dot to the left of Saturn that is actually us,
> took 10 seconds with the stock 2.6.21-rc7, 3 seconds with the best of
> Ingo's patches, and now with Con's latest, is 1 second flat. Another screen
> however is 4 seconds, so maybe that first scren had been looked at since I
> rebooted. However, amanda is still getting estimates so gzip hasn't put a
> tiewrap around the kernels neck just yet.
>
> Some minutes later, gzip is smunching /usr/src, and the machine doesn't
> even know its running as sd-0.44 isn't giving gzip more than 75% to gzip,
> and probably averaging less than 50%. And it scared me a bit as it started
> out at not over 5% for the first minute or so.  Running in the 70's now
> according to gkrellm, with an occasional blip to 95%.  And the machine
> generally feels good.
>
> I had previously given CFS-v4 a 95 score but that was before I saw the
> general slowdown, and I believe my first impression of this one is also a
> 95.  This on a scale of the best one of the earlier CFS patches being 100,
> and stock 2.6.21-rc7 gets a 0.0.  This scheduler seems to be giving gzip
> ever more cpu as time progresses, and the cpu is warming up quite nicely,
> from about 132F idling to 149.9F now.  And my keyboard is still alive and
> well.

I'm not sure how much weight to put on what you see as the measured cpu usage. 
I have a feeling it's being wrongly reported in SD currently. Concentrate 
more on the actual progress and behaviour of things as you've already done.

> Generally speaking, Con, I believe this one is also a keeper.  And we'll
> see how long a backup run takes.

Great thanks for feedback.

-- 
-ck

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 23:35               ` Linus Torvalds
@ 2007-04-22  1:46                 ` Ulrich Drepper
  2007-04-22  7:02                   ` William Lee Irwin III
  0 siblings, 1 reply; 149+ messages in thread
From: Ulrich Drepper @ 2007-04-22  1:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kyle Moffett, William Lee Irwin III, Willy Tarreau, Ingo Molnar,
	Con Kolivas, linux-kernel, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett

On 4/21/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> And how the hell do you imagine you'd even *know* what thread holds the
> futex?

We know this in most cases.  This is information recorded, for
instance, in the mutex data structure.  You might have missed my "the
interface must be extended" part.  This means the PID of the owning
thread will have to be passed done.  For PI mutexes this is not
necessary since the kernel already has access to the information.


> The whole point of the "f" part of the mutex is that it's fast, and we
> never see the non-contended case in the kernel.

See above.  Believe me, I know how futexes work.  But I also know what
additional information we collect.  For mutexes and in part for
rwlocks we know which thread owns the sync object.  In that case we
can easily provide the kernel with the information.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 12:12 ` [REPORT] cfs-v4 vs sd-0.44 Willy Tarreau
                     ` (3 preceding siblings ...)
  2007-04-21 18:17   ` Gene Heskett
@ 2007-04-22  1:51   ` Con Kolivas
  4 siblings, 0 replies; 149+ messages in thread
From: Con Kolivas @ 2007-04-22  1:51 UTC (permalink / raw)
  To: Willy Tarreau, ck list
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett

On Saturday 21 April 2007 22:12, Willy Tarreau wrote:
> 2) SD-0.44
>
>    Feels good, but becomes jerky at moderately high loads. I've started
>    64 ocbench with a 250 ms busy loop and 750 ms sleep time. The system
>    always responds correctly but under X, mouse jumps quite a bit and
>    typing in xterm or even text console feels slightly jerky. The CPU is
>    not completely used, and the load varies a lot (see below). However,
>    the load is shared equally between all 64 ocbench, and they do not
>    deviate even after 4000 iterations. X uses less than 1% CPU during
>    those tests.

Found it. I broke SMP balancing again so there is serious scope for 
improvement on SMP hardware. That explains the huge load variations. Expect 
yet another fix soon, which should improve behaviour further :)

-- 
-ck

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-22  1:26     ` Con Kolivas
@ 2007-04-22  2:07       ` Gene Heskett
  0 siblings, 0 replies; 149+ messages in thread
From: Gene Heskett @ 2007-04-22  2:07 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Willy Tarreau, Ingo Molnar, linux-kernel, Linus Torvalds,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar

On Saturday 21 April 2007, Con Kolivas wrote:
>On Sunday 22 April 2007 04:17, Gene Heskett wrote:
>> More first impressions of sd-0.44 vs CFS-v4
>
>Thanks Gene.
>
>> CFS-v4 is quite smooth in terms of the users experience but after
>> prolonged observations approaching 24 hours, it appears to choke the cpu
>> hog off a bit even when the system has nothing else to do.  My amanda runs
>> went from 1 to 1.5 hours depending on how much time it took gzip to handle
>> the amount of data tar handed it, up to about 165m & change, or nearly 3
>> hours pretty consistently over 5 runs.
>>
>> sd-0.44 so far seems to be handling the same load (theres a backup running
>> right now) fairly well also, and possibly theres a bit more snap to the
>> system now.  A switch to screen 1 from this screen 8, and the loading of
>> that screen image, which is the Cassini shot of saturn from the backside,
>> the one showing that teeny dot to the left of Saturn that is actually us,
>> took 10 seconds with the stock 2.6.21-rc7, 3 seconds with the best of
>> Ingo's patches, and now with Con's latest, is 1 second flat. Another
>> screen however is 4 seconds, so maybe that first scren had been looked at
>> since I rebooted. However, amanda is still getting estimates so gzip
>> hasn't put a tiewrap around the kernels neck just yet.
>>
>> Some minutes later, gzip is smunching /usr/src, and the machine doesn't
>> even know its running as sd-0.44 isn't giving gzip more than 75% to gzip,
>> and probably averaging less than 50%. And it scared me a bit as it started
>> out at not over 5% for the first minute or so.  Running in the 70's now
>> according to gkrellm, with an occasional blip to 95%.  And the machine
>> generally feels good.
>>
>> I had previously given CFS-v4 a 95 score but that was before I saw the
>> general slowdown, and I believe my first impression of this one is also a
>> 95.  This on a scale of the best one of the earlier CFS patches being 100,
>> and stock 2.6.21-rc7 gets a 0.0.  This scheduler seems to be giving gzip
>> ever more cpu as time progresses, and the cpu is warming up quite nicely,
>> from about 132F idling to 149.9F now.  And my keyboard is still alive and
>> well.
>
>I'm not sure how much weight to put on what you see as the measured cpu
> usage. I have a feeling it's being wrongly reported in SD currently.
> Concentrate more on the actual progress and behaviour of things as you've
> already done.
>
>> Generally speaking, Con, I believe this one is also a keeper.  And we'll
>> see how long a backup run takes.

It looks as if it could have been 10 minutes quicker according to amplot, but 
that's entirely within the expected variations that amanda's scheduler might 
do to it.  But the one that just finished, running under CFS-v5 was only 
1h:47m, not including the verify run.  The previous backup using sd-0.44, 
took 2h:28m for a similar but not identical operation according to amplot.  
That's a big enough diff to be an indicator I believe, but without knowing 
how much of that time was burned by gzip, its an apples and oranges compare.  
We'll see if it repeats, I coded 'catchup' to do 2 in a row.

>Great thanks for feedback.

You're quite welcome, Con.

ATM I'm doing the same thing again but booted to a CFS-v5 delta that Ingo sent 
me privately,  and except for the kmail lag/freezes everything is cool except 
the cpu, it managed to hit 150.7F during the height of one of the gzip -best 
smunching operations.  I believe the /dev/hdd writes are cranked well up from 
the earlier CSF patches also.  Unforch, this isn't something that's been 
coded into amplot, so I'm stuck watching the hdd display in gkrellm and 
making SWAG's.  And we all know what they are worth.  I've made a lot of them 
in my 72 years, and my track record, with some glaring exceptions like my 2nd 
wife that I won't bore you with the details of, has been fairly decent. :)

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
You will be run over by a beer truck.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-22  0:08         ` Con Kolivas
@ 2007-04-22  4:58           ` Mike Galbraith
  0 siblings, 0 replies; 149+ messages in thread
From: Mike Galbraith @ 2007-04-22  4:58 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Denis Vlasenko, Ingo Molnar, Willy Tarreau,
	William Lee Irwin III, linux-kernel, Linus Torvalds,
	Andrew Morton, Nick Piggin, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett

On Sun, 2007-04-22 at 10:08 +1000, Con Kolivas wrote:
> On Sunday 22 April 2007 08:54, Denis Vlasenko wrote:
> > On Saturday 21 April 2007 18:00, Ingo Molnar wrote:
> > > correct. Note that Willy reniced X back to 0 so it had no relevance on
> > > his test. Also note that i pointed this change out in the -v4 CFS
> > >
> > > announcement:
> > > || Changes since -v3:
> > > ||
> > > ||  - usability fix: automatic renicing of kernel threads such as
> > > ||    keventd, OOM tasks and tasks doing privileged hardware access
> > > ||    (such as Xorg).
> > >
> > > i've attached it below in a standalone form, feel free to put it into
> > > SD! :)
> >
> > But X problems have nothing to do with "privileged hardware access".
> > X problems are related to priority inversions between server and client
> > processes, and "one server process - many client processes" case.
> 
> It's not a privileged hardware access reason that this code is there. This is 
> obfuscation/advertising to make it look like there is a valid reason for X 
> getting negative nice levels somehow in the kernel to make interactive 
> testing of CFS better by default.

That's not a very nice thing to say, and it has no benefit unless you
specifically want to run multiple heavy X hitting clients.

I boot with that feature disabled specifically to be able to measure
fairness in a pure environment, and it's still _much_ smoother and
snappier than any RSDL/SD kernel I ever tried.

	-Mike


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-22  1:46                 ` Ulrich Drepper
@ 2007-04-22  7:02                   ` William Lee Irwin III
  2007-04-22  7:17                     ` Ulrich Drepper
  0 siblings, 1 reply; 149+ messages in thread
From: William Lee Irwin III @ 2007-04-22  7:02 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Linus Torvalds, Kyle Moffett, Willy Tarreau, Ingo Molnar,
	Con Kolivas, linux-kernel, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett

On 4/21/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> And how the hell do you imagine you'd even *know* what thread holds the
>> futex?

On Sat, Apr 21, 2007 at 06:46:58PM -0700, Ulrich Drepper wrote:
> We know this in most cases.  This is information recorded, for
> instance, in the mutex data structure.  You might have missed my "the
> interface must be extended" part.  This means the PID of the owning
> thread will have to be passed done.  For PI mutexes this is not
> necessary since the kernel already has access to the information.

I'm just looking for what people want the API to be here. With that in
hand we can just go out and do whatever needs to be done.


-- wli

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-22  7:02                   ` William Lee Irwin III
@ 2007-04-22  7:17                     ` Ulrich Drepper
  2007-04-22  8:48                       ` William Lee Irwin III
  0 siblings, 1 reply; 149+ messages in thread
From: Ulrich Drepper @ 2007-04-22  7:17 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Linus Torvalds, Kyle Moffett, Willy Tarreau, Ingo Molnar,
	Con Kolivas, linux-kernel, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett

On 4/22/07, William Lee Irwin III <wli@holomorphy.com> wrote:
> I'm just looking for what people want the API to be here. With that in
> hand we can just go out and do whatever needs to be done.

I think a sched_yield_to is one interface:

   int sched_yield_to(pid_t);

For futex(), the extension is needed for the FUTEX_WAIT operation.  We
need a new operation FUTEX_WAIT_FOR or so which takes another (the
fourth) parameter which is the PID of the target.

For FUTEX_LOCK_PI we need no extension.  The futex value is the PID of
the current owner.  This is required for the whole interface to work
in the first place.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 18:17   ` Gene Heskett
  2007-04-22  1:26     ` Con Kolivas
@ 2007-04-22  8:07     ` William Lee Irwin III
  2007-04-22 11:11       ` Gene Heskett
  1 sibling, 1 reply; 149+ messages in thread
From: William Lee Irwin III @ 2007-04-22  8:07 UTC (permalink / raw)
  To: Gene Heskett
  Cc: Willy Tarreau, Ingo Molnar, Con Kolivas, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar

On Sat, Apr 21, 2007 at 02:17:02PM -0400, Gene Heskett wrote:
> CFS-v4 is quite smooth in terms of the users experience but after prolonged 
> observations approaching 24 hours, it appears to choke the cpu hog off a bit 
> even when the system has nothing else to do.  My amanda runs went from 1 to 
> 1.5 hours depending on how much time it took gzip to handle the amount of 
> data tar handed it, up to about 165m & change, or nearly 3 hours pretty 
> consistently over 5 runs.

Welcome to infinite history. I'm not surprised, apart from the time
scale of anomalies being much larger than I anticipated.


On Sat, Apr 21, 2007 at 02:17:02PM -0400, Gene Heskett wrote:
> sd-0.44 so far seems to be handling the same load (theres a backup running 
> right now) fairly well also, and possibly theres a bit more snap to the 
> system now.  A switch to screen 1 from this screen 8, and the loading of that 
> screen image, which is the Cassini shot of saturn from the backside, the one 
> showing that teeny dot to the left of Saturn that is actually us, took 10 
> seconds with the stock 2.6.21-rc7, 3 seconds with the best of Ingo's patches, 
> and now with Con's latest, is 1 second flat. Another screen however is 4 
> seconds, so maybe that first scren had been looked at since I rebooted. 
> However, amanda is still getting estimates so gzip hasn't put a tiewrap 
> around the kernels neck just yet.

Not sure what you mean by gzip putting a tiewrap around the kernel's neck.
Could you clarify?


On Sat, Apr 21, 2007 at 02:17:02PM -0400, Gene Heskett wrote:
> Some minutes later, gzip is smunching /usr/src, and the machine doesn't even 
> know its running as sd-0.44 isn't giving gzip more than 75% to gzip, and 
> probably averaging less than 50%. And it scared me a bit as it started out at 
> not over 5% for the first minute or so.  Running in the 70's now according to 
> gkrellm, with an occasional blip to 95%.  And the machine generally feels 
> good.

I wonder what's behind that sort of initial and steady-state behavior.


On Sat, Apr 21, 2007 at 02:17:02PM -0400, Gene Heskett wrote:
> I had previously given CFS-v4 a 95 score but that was before I saw
> the general slowdown, and I believe my first impression of this one
> is also a 95.  This on a scale of the best one of the earlier CFS
> patches being 100, and stock 2.6.21-rc7 gets a 0.0.  This scheduler
> seems to be giving gzip ever more cpu as time progresses, and the cpu
> is warming up quite nicely, from about 132F idling to 149.9F now.
> And my keyboard is still alive and well.
> Generally speaking, Con, I believe this one is also a keeper.  And we'll see 
> how long a backup run takes.

Pardon my saying so but you appear to be describing anomalous behavior
in terms of "scheduler warmups."

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, v4
  2007-04-20 14:04 [patch] CFS scheduler, v4 Ingo Molnar
                   ` (3 preceding siblings ...)
  2007-04-21 20:35 ` [patch] CFS scheduler, v4 S.Çağlar Onur
@ 2007-04-22  8:30 ` Michael Gerdau
  2007-04-23 22:47   ` Ingo Molnar
  2007-04-23  1:12 ` [patch] CFS scheduler, -v5 Ingo Molnar
  2007-04-23  9:28 ` crash with CFS v4 and qemu/kvm (was: [patch] CFS scheduler, v4) Christian Hesse
  6 siblings, 1 reply; 149+ messages in thread
From: Michael Gerdau @ 2007-04-22  8:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett

[-- Attachment #1: Type: text/plain, Size: 1094 bytes --]

> i'm pleased to announce release -v4 of the CFS patchset. The patch 
> against v2.6.21-rc7 can be downloaded from:
> 
>     http://redhat.com/~mingo/cfs-scheduler/

I can't get 2.6.21-rc7-CFS-v4 to boot. Immediately after selecting
this kernel I see a very fast scrolling (loop?) sequence of addrs
which I don't know how to stop to write them down. They don't appear
in any kernel log either. However I see two Tux at the top of the
screen.

I'm using the very same .config I also use with 2.6.21-rc7-sd0.x

What could be wrong and how could I track that down ?

FWIW this also happened with 2.6.21-rc7-CFS-v3 and thus for
2.6.21-rc7-CFS-v4 I did a pristine extract of 2.6.20.tar.bz2 and
applied all patches freshly.

System is a Dell XPS M1710, Intel Core2 T7600 2.33, 4GB

Best,
Michael
-- 
 Technosis GmbH, Geschäftsführer: Michael Gerdau, Tobias Dittmar
 Sitz Hamburg; HRB 89145 Amtsgericht Hamburg
 Vote against SPAM - see http://www.politik-digital.de/spam/
 Michael Gerdau       email: mgd@technosis.de
 GPG-keys available on request or at public keyserver

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-22  7:17                     ` Ulrich Drepper
@ 2007-04-22  8:48                       ` William Lee Irwin III
  2007-04-22 16:16                         ` Ulrich Drepper
  0 siblings, 1 reply; 149+ messages in thread
From: William Lee Irwin III @ 2007-04-22  8:48 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Linus Torvalds, Kyle Moffett, Willy Tarreau, Ingo Molnar,
	Con Kolivas, linux-kernel, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett

On 4/22/07, William Lee Irwin III <wli@holomorphy.com> wrote:
>> I'm just looking for what people want the API to be here. With that in
>> hand we can just go out and do whatever needs to be done.

On Sun, Apr 22, 2007 at 12:17:31AM -0700, Ulrich Drepper wrote:
> I think a sched_yield_to is one interface:
>   int sched_yield_to(pid_t);

All clear on that front.


On Sun, Apr 22, 2007 at 12:17:31AM -0700, Ulrich Drepper wrote:
> For futex(), the extension is needed for the FUTEX_WAIT operation.  We
> need a new operation FUTEX_WAIT_FOR or so which takes another (the
> fourth) parameter which is the PID of the target.
> For FUTEX_LOCK_PI we need no extension.  The futex value is the PID of
> the current owner.  This is required for the whole interface to work
> in the first place.

We'll have to send things out and see what sticks here. There seems to
be some pickiness above.


-- wli

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-22  8:07     ` William Lee Irwin III
@ 2007-04-22 11:11       ` Gene Heskett
  0 siblings, 0 replies; 149+ messages in thread
From: Gene Heskett @ 2007-04-22 11:11 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Willy Tarreau, Ingo Molnar, Con Kolivas, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar

On Sunday 22 April 2007, William Lee Irwin III wrote:
>On Sat, Apr 21, 2007 at 02:17:02PM -0400, Gene Heskett wrote:
>> CFS-v4 is quite smooth in terms of the users experience but after
>> prolonged observations approaching 24 hours, it appears to choke the cpu
>> hog off a bit even when the system has nothing else to do.  My amanda runs
>> went from 1 to 1.5 hours depending on how much time it took gzip to handle
>> the amount of data tar handed it, up to about 165m & change, or nearly 3
>> hours pretty consistently over 5 runs.
>
>Welcome to infinite history. I'm not surprised, apart from the time
>scale of anomalies being much larger than I anticipated.

[...]

>Pardon my saying so but you appear to be describing anomalous behavior
>in terms of "scheduler warmups."

Well, that was what I saw, it took gzip about 4 or 5 minutes to get to the 
first 90% hit in htop's display, and it first hit the top of the display with 
only 5%.  And the next backup run took about 2h:21m, so we're back in the 
ballpark.  I'd reset amanda's schedule for a faster dumpcycle too, along with 
giving the old girl a new drive, all about the time we started playing with 
this, so the times I'm recording now may well be nominal.  I suppose I should 
boot a plain 2.6.21-rc7 and make a run & time that, but I don't enjoy 
masochism THAT much. :)

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
I've enjoyed just about as much of this as I can stand.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 23:59       ` Con Kolivas
@ 2007-04-22 13:04         ` Juliusz Chroboczek
  2007-04-22 23:24           ` Linus Torvalds
  2007-04-22 13:23         ` [REPORT] cfs-v4 vs sd-0.44 Mark Lord
  1 sibling, 1 reply; 149+ messages in thread
From: Juliusz Chroboczek @ 2007-04-22 13:04 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Ingo Molnar, ck list, Bill Davidsen, Willy Tarreau,
	William Lee Irwin III, linux-kernel, Linus Torvalds,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett

> Oh I definitely was not advocating against renicing X,

Why not do it in the X server itself?  This will avoid controversial
policy in the kernel, and have the added advantage of working with
X servers that don't directly access hardware.

Con, if you tell me ``if you're running under Linux and such and such
/sys variable has value so-and-so, then it's definitely a good idea to
call nice(42) at the X server's start up'', then I'll commit it into
X.Org.  (Please CC both me the list, so I can point any people
complaining to the archives.)

                                        Juliusz

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 19:00         ` Ingo Molnar
@ 2007-04-22 13:18           ` Mark Lord
  2007-04-22 13:27             ` Ingo Molnar
  0 siblings, 1 reply; 149+ messages in thread
From: Mark Lord @ 2007-04-22 13:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jan Engelhardt, Con Kolivas, Willy Tarreau,
	William Lee Irwin III, linux-kernel, Linus Torvalds,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett

Ingo Molnar wrote:
> * Jan Engelhardt <jengelh@linux01.gwdg.de> wrote:
> 
>>> i've attached it below in a standalone form, feel free to put it 
>>> into SD! :)
>> Assume X went crazy (lacking any statistics, I make the unproven 
>> statement that this happens more often than kthreads going berserk), 
>> then having it niced with minus something is not too nice.
> 
> i've not experienced a 'runaway X' personally, at most it would crash or 
> lock up ;) The value is boot-time and sysctl configurable as well back 
> to 0.
>

Mmmm.. I've had to kill off the odd X that was locking in 100% CPU usage.
In the past, this has happened maybe 1-3 times a year or so on my notebook.

Now mind you, that usage could have been due to some client process,
but X is where the 100% showed up, so X is what I nuked.

Cheers

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-21 23:59       ` Con Kolivas
  2007-04-22 13:04         ` Juliusz Chroboczek
@ 2007-04-22 13:23         ` Mark Lord
  1 sibling, 0 replies; 149+ messages in thread
From: Mark Lord @ 2007-04-22 13:23 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Ingo Molnar, ck list, Bill Davidsen, Willy Tarreau,
	William Lee Irwin III, linux-kernel, Linus Torvalds,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett

Con Kolivas wrote:
>
> Oh I definitely was not advocating against renicing X, I just suspect that 
> virtually all the users who gave glowing reports to CFS comparing it to SD 
> had no idea it had reniced X to -19 behind their back and that they were 
> comparing it to SD running X at nice 0.

I really do wish I wouldn't feel the need to keep stepping in here
to manually exclude my own results from such wide brush strokes.

I'm one of those "users", and I've never even tried CFS v4 (yes).
All prior versions did NOT do the renicing.

The renicing was in the CFS v4 announcement, right up front for all to see,
and the code for it has been posted separately with encouragement for RSDL
or whatever to also adopt it.

Now, with it in all of the various "me-too" schedulers,
maybe they'll all start to shine a little more on real users' systems.
So far, the stock 2.6.20 scheduler remains my own current preference,
despite really good results with CFS v1.

Cheers


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-22 13:18           ` Mark Lord
@ 2007-04-22 13:27             ` Ingo Molnar
  2007-04-22 13:30               ` Mark Lord
  2007-04-25  8:16               ` Pavel Machek
  0 siblings, 2 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-22 13:27 UTC (permalink / raw)
  To: Mark Lord
  Cc: Jan Engelhardt, Con Kolivas, Willy Tarreau,
	William Lee Irwin III, linux-kernel, Linus Torvalds,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett


* Mark Lord <lkml@rtr.ca> wrote:

> > i've not experienced a 'runaway X' personally, at most it would 
> > crash or lock up ;) The value is boot-time and sysctl configurable 
> > as well back to 0.
> 
> Mmmm.. I've had to kill off the odd X that was locking in 100% CPU 
> usage. In the past, this has happened maybe 1-3 times a year or so on 
> my notebook.
> 
> Now mind you, that usage could have been due to some client process, 
> but X is where the 100% showed up, so X is what I nuked.

well, i just simulated a runaway X at nice -19 on CFS (on a UP box), and 
while the box was a tad laggy, i was able to killall it without 
problems, within 2 seconds that also included a 'su'. So it's not an 
issue in CFS, it can be turned off, and because every distro has another 
way to renice Xorg, this is a convenience hack until Xorg standardizes 
it into some xorg.conf field. (It also makes sure that X isnt preempted 
by other userspace stuff while it does timing-sensitive operations like 
setting the video modes up or switching video modes, etc.)

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-22 13:27             ` Ingo Molnar
@ 2007-04-22 13:30               ` Mark Lord
  2007-04-25  8:16               ` Pavel Machek
  1 sibling, 0 replies; 149+ messages in thread
From: Mark Lord @ 2007-04-22 13:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jan Engelhardt, Con Kolivas, Willy Tarreau,
	William Lee Irwin III, linux-kernel, Linus Torvalds,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett

Ingo Molnar wrote:
>
> well, i just simulated a runaway X at nice -19 on CFS (on a UP box), and 
> while the box was a tad laggy, i was able to killall it without 
> problems, within 2 seconds that also included a 'su'. So it's not an 
> issue in CFS, it can be turned off, and because every distro has another 
> way to renice Xorg, this is a convenience hack until Xorg standardizes 
> it into some xorg.conf field. (It also makes sure that X isnt preempted 
> by other userspace stuff while it does timing-sensitive operations like 
> setting the video modes up or switching video modes, etc.)

Good!

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-22  8:48                       ` William Lee Irwin III
@ 2007-04-22 16:16                         ` Ulrich Drepper
  2007-04-23  0:07                           ` Rusty Russell
  0 siblings, 1 reply; 149+ messages in thread
From: Ulrich Drepper @ 2007-04-22 16:16 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Linus Torvalds, Kyle Moffett, Willy Tarreau, Ingo Molnar,
	Con Kolivas, linux-kernel, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett, Rusty Russell

On 4/22/07, William Lee Irwin III <wli@holomorphy.com> wrote:
> On Sun, Apr 22, 2007 at 12:17:31AM -0700, Ulrich Drepper wrote:
> > For futex(), the extension is needed for the FUTEX_WAIT operation.  We
> > need a new operation FUTEX_WAIT_FOR or so which takes another (the
> > fourth) parameter which is the PID of the target.
> > For FUTEX_LOCK_PI we need no extension.  The futex value is the PID of
> > the current owner.  This is required for the whole interface to work
> > in the first place.
>
> We'll have to send things out and see what sticks here. There seems to
> be some pickiness above.

I know Rusty will shudder since it makes futexes yet more complicated
(although only if the user wants it) but if you introduce the concept
of "yield to" then this extension makes really sense and it is a quite
simple extension.  Plus: I'm the most affected by the change since I
have to change code to use it and I'm fine with it.

Oh, last time I didn't explicitly mention the cases of
waitpid()/wait4()/waitid() explicitly naming a process to wait on.  I
think it's clear that those cases also should be changed to use yield
to if possible.  I don't have a good suggestion what to do when the
call waits for any child.  Perhaps yielding to the last created one is
fine.  If delays through reading on a pipe are recognized as well and
handle with yield to then the time slot will automatically be
forwarded to the first runnable process in the pipe sequence.  I.e.,
running

    grep foo /etc/passwd | cut -d: -f2 | crack

probably will create 'crack' last.  Giving the remainder of the time
slot should result is recognizing it waits for 'cut' which in turn
waits for 'grep'.  So in the end 'grep' gets the timeslot.  Seems
quite complicated from the outside but I can imagine quite good
results from this.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-22 13:04         ` Juliusz Chroboczek
@ 2007-04-22 23:24           ` Linus Torvalds
  2007-04-23  1:34             ` Nick Piggin
  2007-04-23  2:42             ` [report] renicing X, cfs-v5 vs sd-0.46 Ingo Molnar
  0 siblings, 2 replies; 149+ messages in thread
From: Linus Torvalds @ 2007-04-22 23:24 UTC (permalink / raw)
  To: Juliusz Chroboczek
  Cc: Con Kolivas, Ingo Molnar, ck list, Bill Davidsen, Willy Tarreau,
	William Lee Irwin III, linux-kernel, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett



On Sun, 22 Apr 2007, Juliusz Chroboczek wrote:
> 
> Why not do it in the X server itself?  This will avoid controversial
> policy in the kernel, and have the added advantage of working with
> X servers that don't directly access hardware.

It's wrong *wherever* you do it.

The X server should not be re-niced. It was done in the past, and it was 
wrogn then (and caused problems - we had to tell people to undo it, 
because some distros had started doing it by default).

If you have a single client, the X server is *not* more important than the 
client, and indeed, renicing the X server causes bad patterns: just 
because the client sends a request does not mean that the X server should 
immediately be given the CPU as being "more important". 

In other words, the things that make it important that the X server _can_ 
get CPU time if needed are all totally different from the X server being 
"more important". The X server is more important only in the presense of 
multiple clients, not on its own! Needing to renice it is a hack for a bad 
scheduler, and shows that somebody doesn't understand the problem!

		Linus

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-22 16:16                         ` Ulrich Drepper
@ 2007-04-23  0:07                           ` Rusty Russell
  0 siblings, 0 replies; 149+ messages in thread
From: Rusty Russell @ 2007-04-23  0:07 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: William Lee Irwin III, Linus Torvalds, Kyle Moffett,
	Willy Tarreau, Ingo Molnar, Con Kolivas, linux-kernel,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett

On Sun, 2007-04-22 at 09:16 -0700, Ulrich Drepper wrote:
> On 4/22/07, William Lee Irwin III <wli@holomorphy.com> wrote:
> > On Sun, Apr 22, 2007 at 12:17:31AM -0700, Ulrich Drepper wrote:
> > > For futex(), the extension is needed for the FUTEX_WAIT operation.  We
> > > need a new operation FUTEX_WAIT_FOR or so which takes another (the
> > > fourth) parameter which is the PID of the target.
> > > For FUTEX_LOCK_PI we need no extension.  The futex value is the PID of
> > > the current owner.  This is required for the whole interface to work
> > > in the first place.
> >
> > We'll have to send things out and see what sticks here. There seems to
> > be some pickiness above.
> 
> I know Rusty will shudder since it makes futexes yet more complicated
> (although only if the user wants it) but if you introduce the concept
> of "yield to" then this extension makes really sense and it is a quite
> simple extension.  Plus: I'm the most affected by the change since I
> have to change code to use it and I'm fine with it.

Hi Uli,

	I wouldn't worry: futexes long ago jumped the shark.

	I think it was inevitable that once we started endorsing programs
bypassing the kernel for IPC that we'd want some form of yield_to().
And yield_to(p) has much more sane semantics than yield().

Cheers,
Rusty.



^ permalink raw reply	[flat|nested] 149+ messages in thread

* [patch] CFS scheduler, -v5
  2007-04-20 14:04 [patch] CFS scheduler, v4 Ingo Molnar
                   ` (4 preceding siblings ...)
  2007-04-22  8:30 ` Michael Gerdau
@ 2007-04-23  1:12 ` Ingo Molnar
  2007-04-23  1:25   ` Nick Piggin
                     ` (4 more replies)
  2007-04-23  9:28 ` crash with CFS v4 and qemu/kvm (was: [patch] CFS scheduler, v4) Christian Hesse
  6 siblings, 5 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-23  1:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Ulrich Drepper


i'm pleased to announce release -v5 of the CFS scheduler patchset. The 
patch against v2.6.21-rc7 and v2.6.20.7 can be downloaded from:

    http://redhat.com/~mingo/cfs-scheduler/

this CFS release mainly fixes regressions and improves interactivity:

    13 files changed, 211 insertions(+), 199 deletions(-)

the biggest user-visible change in -v5 are various interactivity 
improvements (especially under higher load) to fix reported regressions, 
and an improved way of handling nice levels. There's also a new 
sys_sched_yield_to() syscall implementation for i686 and x86_64.

All known regressions have been fixed. (knock on wood)

[ Note: while CFS's default preemption granularity is currently set to 5 
  msecs, this value does not directly transform into timeslices: for 
  example two CPU-intense tasks will have effective timeslices of 10 
  msecs with this setting. ]

Changes since -v4:

 - interactivity bugfix: fix xterm latencies and general desktop delays 
   and child task startup delays under load. (reported by Willy Tarreau 
   and Caglar Onur)

 - bugfix: the in_atomic_preempt_off() call on !PREEMPT_BKL was buggy
   and spammed the console with bogus warnings.

 - implementation fix: make the nice levels implementation
   starvation-free and smpnice-friendly. Remove the nice_offset hack.

 - feature: add initial sys_sched_yield_to() implementation. Not hooked 
   into the futex code yet, but testers are encouraged to give the 
   syscalls a try, on i686 the new syscall is __NR_yield_to==320, on 
   x86_64 it's __NR_yield_to==280. The prototype is 
   sys_sched_yield_to(pid_t), as suggested by Ulrich Drepper.

 - usability feature: add CONFIG_RENICE_X: those who dont want the 
   kernel to renice X should disable this option. (the boot option and 
   the sysctl is still available too)

 - removed my home-made "Con was right about scheduling fairness" 
   attribution to Con's scheduler interactivity work - some have 
   suggested that Con might want to see another text there. Con,
   please feel free to fill it in!

 - feature: make the CPU usage of nice levels logarithmic instead of 
   linear. This is more usable and more intuitive. (Going four nice 
   levels forward/backwards give half/twice the CPU power) [ This was
   requested a number of times in the past few years and is 
   straightforward under CFS because there nice levels are not tied to 
   any timeslice distribution mechanism. ]

 - cleanup: removed the stupid "Ingo was here" banner printk from 
   sched_init(), the -cfs EXTRAVERSION serves the purpose (of 
   identifying a booted up kernel as a CFS one) equally well.

 - various other code cleanups

As usual, any sort of feedback, bugreport, fix and suggestion is more 
than welcome,

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5
  2007-04-23  1:12 ` [patch] CFS scheduler, -v5 Ingo Molnar
@ 2007-04-23  1:25   ` Nick Piggin
  2007-04-23  2:39     ` Gene Heskett
  2007-04-23  2:55     ` Ingo Molnar
  2007-04-23  3:19   ` [patch] CFS scheduler, -v5 (build problem - make headers_check fails) Zach Carter
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 149+ messages in thread
From: Nick Piggin @ 2007-04-23  1:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Ulrich Drepper

On Mon, Apr 23, 2007 at 03:12:29AM +0200, Ingo Molnar wrote:
> 
> i'm pleased to announce release -v5 of the CFS scheduler patchset. The 
> patch against v2.6.21-rc7 and v2.6.20.7 can be downloaded from:
> 
>     http://redhat.com/~mingo/cfs-scheduler/
> 
> this CFS release mainly fixes regressions and improves interactivity:
> 
>     13 files changed, 211 insertions(+), 199 deletions(-)
> 
> the biggest user-visible change in -v5 are various interactivity 
> improvements (especially under higher load) to fix reported regressions, 
> and an improved way of handling nice levels. There's also a new 
> sys_sched_yield_to() syscall implementation for i686 and x86_64.
> 
> All known regressions have been fixed. (knock on wood)

I think the granularity is still much too low. Why not increase it to
something more reasonable as a default?


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-22 23:24           ` Linus Torvalds
@ 2007-04-23  1:34             ` Nick Piggin
  2007-04-23 15:56               ` Linus Torvalds
  2007-04-23  2:42             ` [report] renicing X, cfs-v5 vs sd-0.46 Ingo Molnar
  1 sibling, 1 reply; 149+ messages in thread
From: Nick Piggin @ 2007-04-23  1:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Juliusz Chroboczek, Con Kolivas, Ingo Molnar, ck list,
	Bill Davidsen, Willy Tarreau, William Lee Irwin III,
	linux-kernel, Andrew Morton, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett

On Sun, Apr 22, 2007 at 04:24:47PM -0700, Linus Torvalds wrote:
> 
> 
> On Sun, 22 Apr 2007, Juliusz Chroboczek wrote:
> > 
> > Why not do it in the X server itself?  This will avoid controversial
> > policy in the kernel, and have the added advantage of working with
> > X servers that don't directly access hardware.
> 
> It's wrong *wherever* you do it.
> 
> The X server should not be re-niced. It was done in the past, and it was 
> wrogn then (and caused problems - we had to tell people to undo it, 
> because some distros had started doing it by default).

The 2.6 scheduler can get very bad latency problems with the X server
reniced.


> If you have a single client, the X server is *not* more important than the 
> client, and indeed, renicing the X server causes bad patterns: just 
> because the client sends a request does not mean that the X server should 
> immediately be given the CPU as being "more important". 

If the client is doing some processing, and the user moves the mouse, it
feels much more interactive if the pointer moves rather than waits for
the client to finish processing.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5
  2007-04-23  1:25   ` Nick Piggin
@ 2007-04-23  2:39     ` Gene Heskett
  2007-04-23  3:08       ` Ingo Molnar
  2007-04-23  2:55     ` Ingo Molnar
  1 sibling, 1 reply; 149+ messages in thread
From: Gene Heskett @ 2007-04-23  2:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Mark Lord,
	Ulrich Drepper

On Sunday 22 April 2007, Nick Piggin wrote:
>On Mon, Apr 23, 2007 at 03:12:29AM +0200, Ingo Molnar wrote:
>> i'm pleased to announce release -v5 of the CFS scheduler patchset. The
>> patch against v2.6.21-rc7 and v2.6.20.7 can be downloaded from:
>>
>>     http://redhat.com/~mingo/cfs-scheduler/
>>
>> this CFS release mainly fixes regressions and improves interactivity:
>>
>>     13 files changed, 211 insertions(+), 199 deletions(-)
>>
>> the biggest user-visible change in -v5 are various interactivity
>> improvements (especially under higher load) to fix reported regressions,
>> and an improved way of handling nice levels. There's also a new
>> sys_sched_yield_to() syscall implementation for i686 and x86_64.
>>
>> All known regressions have been fixed. (knock on wood)
>
>I think the granularity is still much too low. Why not increase it to
>something more reasonable as a default?

I haven't approached that yet, but I just noticed, having been booted to this 
for all of 5 minutes, that although I told it not to renice x when my script 
ran 'make oldconfig', and I answered n, but there it is, sitting at -19 
according to htop.

The .config says otherwise:
[root@coyote linux-2.6.21-rc7-CFS-v5]# grep RENICE .config
# CONFIG_RENICE_X is not set

So v5 reniced X in spite of the 'no' setting.

Although I hadn't noticed it, one way or the other, I just set it (X) back to 
the default -1 so that I'm comparing the same apples when I do compare.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Fortune finishes the great quotations, #2

	If at first you don't succeed, think how many people
	you've made happy.


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [report] renicing X, cfs-v5 vs sd-0.46
  2007-04-22 23:24           ` Linus Torvalds
  2007-04-23  1:34             ` Nick Piggin
@ 2007-04-23  2:42             ` Ingo Molnar
  2007-04-23 15:09               ` Linus Torvalds
  1 sibling, 1 reply; 149+ messages in thread
From: Ingo Molnar @ 2007-04-23  2:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Juliusz Chroboczek, Con Kolivas, ck list, Bill Davidsen,
	Willy Tarreau, William Lee Irwin III, linux-kernel,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> The X server should not be re-niced. It was done in the past, and it 
> was wrogn then (and caused problems - we had to tell people to undo 
> it, because some distros had started doing it by default).
> 
> If you have a single client, the X server is *not* more important than 
> the client, and indeed, renicing the X server causes bad patterns: 
> just because the client sends a request does not mean that the X 
> server should immediately be given the CPU as being "more important".

You are completely right in the case of traditional schedulers.

Note that this is not the case for CFS though. CFS has natural, built-in 
buffering against high-rate preemptions from lower nice-level 
SCHED_OTHER tasks. So while X will indeed get more CPU time (and that i 
think is fully justified), it wont get nearly as high of a 
context-switch rate as under priority/runqueue-based schedulers.

To demonstrate this i have done the following simple experiment: i 
started 4 xterms on a single-CPU box, then i started the 'yes' utility 
in each xterm and resized all of the xterms to just 2 lines vertical. 
This generates a _lot_ of screen refresh events. Naturally, such a 
workload utilizes the whole CPU.

Using CFS-v5, with Xorg at nice 0, the context-switch rate is low:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 472132  13712 178604    0    0     0    32  113  170 83 17  0  0  0
 2  0      0 472172  13712 178604    0    0     0     0  112  184 85 15  0  0  0
 2  0      0 472196  13712 178604    0    0     0     0  108  162 83 17  0  0  0
 1  0      0 472076  13712 178604    0    0     0     0  115  189 86 14  0  0  0

X's CPU utilization is 49%, xterm's go to 12% each. Userspace 
utilization is 85%, system utilization is 15%.

Renicing X to -10 increases context-switching, but not dramatically so, 
because it is throttled by CFS:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0      0 475752  13492 176320    0    0     0    64  116 1498 85 15  0  0  0
 4  0      0 475752  13492 176320    0    0     0     0  107 1488 84 16  0  0  0
 4  0      0 475752  13492 176320    0    0     0     0  140 1514 86 14  0  0  0
 4  0      0 475752  13492 176320    0    0     0     0  107 1477 85 15  0  0  0
 4  0      0 475752  13492 176320    0    0     0     0  122 1498 84 16  0  0  0

The system is still usable, Xorg is 44% busy, each xterm is 14% busy. 
User utilization 85%, system utilization is 15% - just like in the first 
case.

"Performance of scrolling" is exactly the same in both cases (i have 
tested this by inserting periodic beeps after every 10,000 lines of text 
scrolled) - but the screen refresh rate is alot more eye-pleasing in the 
nice -10 case. (screen refresh it happens at ~500 Hz, while in the nice 
0 case it happens at ~40 Hz and visibly flickers. This is especially 
noticeable if the xterms have full size.)

I have tested the same workload on vanilla v2.6.21-rc7 and on SD-0.46
too, and they give roughly the same xterm scheduling behavior when Xorg 
is at nice 0:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0      0 450564  14844 194976    0    0     0     0  287  594 58 10 32  0  0
 4  0      0 450704  14844 194976    0    0     0     0  108  370 89 11  0  0  0
 0  0      0 449588  14844 194976    0    0     0     0  175  434 85 13  2  0  0
 3  0      0 450688  14852 194976    0    0     0    32  242  315 62  9 29  0  0

but when Xorg is reniced to -10 on the vanilla or SD schedulers, it 
indeed gives the markedly higher context-switching behavior you 
predicted:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 5  0      0 452272  13936 194896    0    0     0     0  126 14147 78 22  0  0  0
 4  0      0 452252  13944 194896    0    0     0    64  155 14143 80 20  0  0  0
 5  0      0 452612  13944 194896    0    0     0     0  187 14031 79 21  0  0  0
 4  0      0 452624  13944 194896    0    0     0     0  121 14300 82 18  0  0  0

User time drops to 78%, system time increases to 22%. "Scrolling 
performance" clearly decreases.

so i agree that renicing X can be a very bad idea, but it very much 
depends on the scheduler implementation too.

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5
  2007-04-23  1:25   ` Nick Piggin
  2007-04-23  2:39     ` Gene Heskett
@ 2007-04-23  2:55     ` Ingo Molnar
  2007-04-23  3:22       ` Nick Piggin
  1 sibling, 1 reply; 149+ messages in thread
From: Ingo Molnar @ 2007-04-23  2:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Ulrich Drepper


* Nick Piggin <npiggin@suse.de> wrote:

> > the biggest user-visible change in -v5 are various interactivity 
> > improvements (especially under higher load) to fix reported 
> > regressions, and an improved way of handling nice levels. There's 
> > also a new sys_sched_yield_to() syscall implementation for i686 and 
> > x86_64.
> > 
> > All known regressions have been fixed. (knock on wood)
> 
> I think the granularity is still much too low. Why not increase it to 
> something more reasonable as a default?

note that CFS's "granularity" value is not directly comparable to 
"timeslice length":

> [ Note: while CFS's default preemption granularity is currently set to
>   5 msecs, this value does not directly transform into timeslices: for 
>   example two CPU-intense tasks will have effective timeslices of 10 
>   msecs with this setting. ]

also, i just checked SD: 0.46 defaults to 8 msecs rr_interval (on 1 CPU 
systems), which is lower than the 10 msecs effective timeslice length 
CVS-v5 achieves on two CPU-bound tasks.

(in -v6 i'll scale the granularity up a bit with the number of CPUs, 
like SD does. That should get the right result on larger SMP boxes too.)

while i agree it's a tad too finegrained still, I agree with Con's 
choice: rather err on the side of being too finegrained and lose some 
small amount of throughput on cache-intense workloads like compile jobs, 
than err on the side of being visibly too choppy for users on the 
desktop.

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5
  2007-04-23  2:39     ` Gene Heskett
@ 2007-04-23  3:08       ` Ingo Molnar
  0 siblings, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-23  3:08 UTC (permalink / raw)
  To: Gene Heskett
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Mark Lord,
	Ulrich Drepper


* Gene Heskett <gene.heskett@gmail.com> wrote:

> I haven't approached that yet, but I just noticed, having been booted 
> to this for all of 5 minutes, that although I told it not to renice x 
> when my script ran 'make oldconfig', and I answered n, but there it 
> is, sitting at -19 according to htop.
> 
> The .config says otherwise:
> [root@coyote linux-2.6.21-rc7-CFS-v5]# grep RENICE .config
> # CONFIG_RENICE_X is not set
> 
> So v5 reniced X in spite of the 'no' setting.

Hmm, apparently your X uses ioperm() while mine uses iopl(), and i only 
turned off the renicing for iopl. (I fixed this in my tree and it will 
show up in -v6.)

> Although I hadn't noticed it, one way or the other, I just set it (X) 
> back to the default -1 so that I'm comparing the same apples when I do 
> compare.

note that CFS handles negative nice levels differently from other 
schedulers, so the disadvantages of agressively reniced X (lost 
throughput due to overscheduling, worse interactivity) do _not_ apply to 
CFS.

I think the 'fair' setting would be whatever the scheduler writer 
recommends: for SD, X probably performs better at around nice 0 (i'll 
let Con correct me if his experience is different). On CFS, nice -10 is 
perfectly fine too, and you'll have a zippier desktop under higher 
loads. (on servers this might be unnecessary/disadvantegous so there 
this can be turned off.)

(also, in my tree i've changed the default from -19 to -10 to make it 
less scary to people and to leave more levels to the sysadmin, this 
change too will show up in -v6.)

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5 (build problem - make headers_check fails)
  2007-04-23  1:12 ` [patch] CFS scheduler, -v5 Ingo Molnar
  2007-04-23  1:25   ` Nick Piggin
@ 2007-04-23  3:19   ` Zach Carter
  2007-04-23 10:03     ` Ingo Molnar
  2007-04-23  5:16   ` [patch] CFS scheduler, -v5 Markus Trippelsdorf
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 149+ messages in thread
From: Zach Carter @ 2007-04-23  3:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Ulrich Drepper



Ingo Molnar wrote:
> i'm pleased to announce release -v5 of the CFS scheduler patchset. The 
> patch against v2.6.21-rc7 and v2.6.20.7 can be downloaded from:
> 

FYI, make headers_check seems to fail on this:

[carter@hoth linux-2.6]$ make headers_check

[snip]

   CHECK   include/linux/usb/cdc.h
   CHECK   include/linux/usb/audio.h
make[2]: *** No rule to make target `/src/linux-2.6/usr/include/linux/.check.sched.h', needed by 
`__headerscheck'.  Stop.
make[1]: *** [linux] Error 2
make: *** [headers_check] Error 2
[carter@hoth linux-2.6]$

This also fails if I have CONFIG_HEADERS_CHECK=y in my .config

unset CONFIG_HEADERS_CHECK and it builds just fine.

-Zach

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5
  2007-04-23  2:55     ` Ingo Molnar
@ 2007-04-23  3:22       ` Nick Piggin
  2007-04-23  3:43         ` Ingo Molnar
  0 siblings, 1 reply; 149+ messages in thread
From: Nick Piggin @ 2007-04-23  3:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Ulrich Drepper

On Mon, Apr 23, 2007 at 04:55:53AM +0200, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > > the biggest user-visible change in -v5 are various interactivity 
> > > improvements (especially under higher load) to fix reported 
> > > regressions, and an improved way of handling nice levels. There's 
> > > also a new sys_sched_yield_to() syscall implementation for i686 and 
> > > x86_64.
> > > 
> > > All known regressions have been fixed. (knock on wood)
> > 
> > I think the granularity is still much too low. Why not increase it to 
> > something more reasonable as a default?
> 
> note that CFS's "granularity" value is not directly comparable to 
> "timeslice length":

Right, but it does introduce the kbuild regression, and as we
discussed, this will be only worse on newer CPUs with bigger
caches or less naturally context switchy workloads.


> > [ Note: while CFS's default preemption granularity is currently set to
> >   5 msecs, this value does not directly transform into timeslices: for 
> >   example two CPU-intense tasks will have effective timeslices of 10 
> >   msecs with this setting. ]
> 
> also, i just checked SD: 0.46 defaults to 8 msecs rr_interval (on 1 CPU 
> systems), which is lower than the 10 msecs effective timeslice length 
> CVS-v5 achieves on two CPU-bound tasks.

This is about an order of magnitude more than the current scheduler, so
I still think it is too small.


> (in -v6 i'll scale the granularity up a bit with the number of CPUs, 
> like SD does. That should get the right result on larger SMP boxes too.)

I don't really like the scaling with SMP thing. The cache effects are
still going to be significant on small systems, and there are lots of
non-desktop users of those (eg. clusters).


> while i agree it's a tad too finegrained still, I agree with Con's 
> choice: rather err on the side of being too finegrained and lose some 
> small amount of throughput on cache-intense workloads like compile jobs, 
> than err on the side of being visibly too choppy for users on the 
> desktop.

So cfs gets too choppy if you make the effective timeslice comparable
to mainline?

My approach is completely the opposite. For testing, I prefer to make
the timeslice as large as possible so any problems or regressions are
really noticable and will be reported; it can be scaled back to be
smaller once those kinks are ironed out.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5
  2007-04-23  3:22       ` Nick Piggin
@ 2007-04-23  3:43         ` Ingo Molnar
  2007-04-23  4:06           ` Nick Piggin
  0 siblings, 1 reply; 149+ messages in thread
From: Ingo Molnar @ 2007-04-23  3:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Ulrich Drepper


* Nick Piggin <npiggin@suse.de> wrote:

> > note that CFS's "granularity" value is not directly comparable to 
> > "timeslice length":
> 
> Right, but it does introduce the kbuild regression, [...]

Note that i increased the granularity from 1msec to 5msecs after your 
kbuild report, could you perhaps retest kbuild with the default settings 
of -v5?

> [...] and as we discussed, this will be only worse on newer CPUs with 
> bigger caches or less naturally context switchy workloads.

yeah - but they'll all be quad core, so the SMP timeslice multiplicator 
should do the trick. Most of the CFS testers use single-CPU systems.

> > (in -v6 i'll scale the granularity up a bit with the number of CPUs, 
> > like SD does. That should get the right result on larger SMP boxes 
> > too.)
> 
> I don't really like the scaling with SMP thing. The cache effects are 
> still going to be significant on small systems, and there are lots of 
> non-desktop users of those (eg. clusters).

CFS using clusters will want to tune the granularity up drastically 
anyway, to 1 second or more, to maximize throughput. I think a small 
default with a scale-up-on-SMP rule is pretty sane. We'll gather some 
more kbuild data and see what happens, ok?

> > while i agree it's a tad too finegrained still, I agree with Con's 
> > choice: rather err on the side of being too finegrained and lose 
> > some small amount of throughput on cache-intense workloads like 
> > compile jobs, than err on the side of being visibly too choppy for 
> > users on the desktop.
> 
> So cfs gets too choppy if you make the effective timeslice comparable 
> to mainline?

it doesnt in any test i do, but again, i'm erring on the side of it 
being more interactive.

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5
  2007-04-23  3:43         ` Ingo Molnar
@ 2007-04-23  4:06           ` Nick Piggin
  2007-04-23  7:10             ` Ingo Molnar
  2007-04-23  9:25             ` Ingo Molnar
  0 siblings, 2 replies; 149+ messages in thread
From: Nick Piggin @ 2007-04-23  4:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Ulrich Drepper

On Mon, Apr 23, 2007 at 05:43:10AM +0200, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > > note that CFS's "granularity" value is not directly comparable to 
> > > "timeslice length":
> > 
> > Right, but it does introduce the kbuild regression, [...]
> 
> Note that i increased the granularity from 1msec to 5msecs after your 
> kbuild report, could you perhaps retest kbuild with the default settings 
> of -v5?

I'm looking at mysql again today, but I will try eventually. It was
just a simple kbuild.


> > [...] and as we discussed, this will be only worse on newer CPUs with 
> > bigger caches or less naturally context switchy workloads.
> 
> yeah - but they'll all be quad core, so the SMP timeslice multiplicator 
> should do the trick. Most of the CFS testers use single-CPU systems.

But desktop users could have have quad thread and even 8 thread CPUs
soon, so if the number doesn't work for both then you're in trouble.
It just smells like a hack to scale with CPU numbers.

 
> > > (in -v6 i'll scale the granularity up a bit with the number of CPUs, 
> > > like SD does. That should get the right result on larger SMP boxes 
> > > too.)
> > 
> > I don't really like the scaling with SMP thing. The cache effects are 
> > still going to be significant on small systems, and there are lots of 
> > non-desktop users of those (eg. clusters).
> 
> CFS using clusters will want to tune the granularity up drastically 
> anyway, to 1 second or more, to maximize throughput. I think a small 
> default with a scale-up-on-SMP rule is pretty sane. We'll gather some 
> more kbuild data and see what happens, ok?
> 
> > > while i agree it's a tad too finegrained still, I agree with Con's 
> > > choice: rather err on the side of being too finegrained and lose 
> > > some small amount of throughput on cache-intense workloads like 
> > > compile jobs, than err on the side of being visibly too choppy for 
> > > users on the desktop.
> > 
> > So cfs gets too choppy if you make the effective timeslice comparable 
> > to mainline?
> 
> it doesnt in any test i do, but again, i'm erring on the side of it 
> being more interactive.

I'd start by erring on the side of trying to ensure no obvious
performance regressions like this because that's the easy part. Suppose
everybody finds your scheduler wonderfully interactive, but you can't
make it so with a larger timeslice?

For _real_ desktop systems, sure, erring on the side of being more
interactive is fine. For RFC patches for testing, I really think you
could be taking advantage of the fact that people will give you feedback
on the issue.



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5
  2007-04-23  1:12 ` [patch] CFS scheduler, -v5 Ingo Molnar
  2007-04-23  1:25   ` Nick Piggin
  2007-04-23  3:19   ` [patch] CFS scheduler, -v5 (build problem - make headers_check fails) Zach Carter
@ 2007-04-23  5:16   ` Markus Trippelsdorf
  2007-04-23  5:27     ` Markus Trippelsdorf
  2007-04-23 12:20   ` Guillaume Chazarain
  2007-04-24 16:54   ` Christian Hesse
  4 siblings, 1 reply; 149+ messages in thread
From: Markus Trippelsdorf @ 2007-04-23  5:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Ulrich Drepper

On Mon, Apr 23, 2007 at 03:12:29AM +0200, Ingo Molnar wrote:
> 
> i'm pleased to announce release -v5 of the CFS scheduler patchset. The 
> patch against v2.6.21-rc7 and v2.6.20.7 can be downloaded from:
...
>  - feature: add initial sys_sched_yield_to() implementation. Not hooked 
>    into the futex code yet, but testers are encouraged to give the 
>    syscalls a try, on i686 the new syscall is __NR_yield_to==320, on 
>    x86_64 it's __NR_yield_to==280. The prototype is 
>    sys_sched_yield_to(pid_t), as suggested by Ulrich Drepper.

The new version does not link here (amd64,smp):

  LD      .tmp_vmlinux1
  arch/x86_64/kernel/built-in.o:(.rodata+0x1dd8): undefined reference to
  `sys_yield_to'

-- 
Markus

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5
  2007-04-23  5:16   ` [patch] CFS scheduler, -v5 Markus Trippelsdorf
@ 2007-04-23  5:27     ` Markus Trippelsdorf
  2007-04-23  6:21       ` Ingo Molnar
  0 siblings, 1 reply; 149+ messages in thread
From: Markus Trippelsdorf @ 2007-04-23  5:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Ulrich Drepper

On Mon, Apr 23, 2007 at 07:16:59AM +0200, Markus Trippelsdorf wrote:
> On Mon, Apr 23, 2007 at 03:12:29AM +0200, Ingo Molnar wrote:
> > 
> > i'm pleased to announce release -v5 of the CFS scheduler patchset. The 
> > patch against v2.6.21-rc7 and v2.6.20.7 can be downloaded from:
> ...
> >  - feature: add initial sys_sched_yield_to() implementation. Not hooked 
> >    into the futex code yet, but testers are encouraged to give the 
> >    syscalls a try, on i686 the new syscall is __NR_yield_to==320, on 
> >    x86_64 it's __NR_yield_to==280. The prototype is 
> >    sys_sched_yield_to(pid_t), as suggested by Ulrich Drepper.
> 
> The new version does not link here (amd64,smp):
> 
>   LD      .tmp_vmlinux1
>   arch/x86_64/kernel/built-in.o:(.rodata+0x1dd8): undefined reference to
>   `sys_yield_to'

Changing  sys_yield_to to sys_sched_yield_to in include/asm-x86_64/unistd.h
fixes the problem.
-- 
Markus

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5
  2007-04-23  5:27     ` Markus Trippelsdorf
@ 2007-04-23  6:21       ` Ingo Molnar
  2007-04-25 11:43         ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 149+ messages in thread
From: Ingo Molnar @ 2007-04-23  6:21 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Ulrich Drepper


* Markus Trippelsdorf <markus@trippelsdorf.de> wrote:

> > The new version does not link here (amd64,smp):
> > 
> >   LD      .tmp_vmlinux1
> >   arch/x86_64/kernel/built-in.o:(.rodata+0x1dd8): undefined reference to
> >   `sys_yield_to'
> 
> Changing sys_yield_to to sys_sched_yield_to in 
> include/asm-x86_64/unistd.h fixes the problem.

thanks. I edited the -v5 patch so new downloads should have the fix. (i 
also test-booted x86_64 with this patch)

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5
  2007-04-23  4:06           ` Nick Piggin
@ 2007-04-23  7:10             ` Ingo Molnar
  2007-04-23  7:25               ` Nick Piggin
  2007-04-23  9:25             ` Ingo Molnar
  1 sibling, 1 reply; 149+ messages in thread
From: Ingo Molnar @ 2007-04-23  7:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Ulrich Drepper


* Nick Piggin <npiggin@suse.de> wrote:

> > yeah - but they'll all be quad core, so the SMP timeslice 
> > multiplicator should do the trick. Most of the CFS testers use 
> > single-CPU systems.
> 
> But desktop users could have have quad thread and even 8 thread CPUs 
> soon, so if the number doesn't work for both then you're in trouble. 
> It just smells like a hack to scale with CPU numbers.

hm, i still like Con's approach in this case because it makes 
independent sense: in essence we calculate the "human visible" effective 
latency of a physical resource: more CPUs/threads means more parallelism 
and less visible choppiness of whatever basic chunking of workloads 
there might be, hence larger size chunking can be done.

> > it doesnt in any test i do, but again, i'm erring on the side of it 
> > being more interactive.
> 
> I'd start by erring on the side of trying to ensure no obvious 
> performance regressions like this because that's the easy part. 
> Suppose everybody finds your scheduler wonderfully interactive, but 
> you can't make it so with a larger timeslice?

look at CFS's design and you'll see that it can easily take larger 
timeslices :) I really dont need any reinforcement on that part. But i 
do need reinforcement and test results on the basic part: _can_ this 
design be interactive enough on the desktop? So far the feedback has 
been affirmative, but more testing is needed.

server scheduling, while obviously of prime importance to us, is really 
'easy' in comparison technically, because it has alot less human factors 
and is thus a much more deterministic task.

> For _real_ desktop systems, sure, erring on the side of being more 
> interactive is fine. For RFC patches for testing, I really think you 
> could be taking advantage of the fact that people will give you 
> feedback on the issue.

90% of the testers are using CFS on desktops. 80% of the scheduler 
complaints come regarding the human (latency/behavior/consistency) 
aspect of the upstream scheduler. (Sure, we dont want to turn that 
around into '80% of the complaints come due to performance' - so i 
increased the granularity based on your kbuild feedback to that near of 
SD's, to show that mini-timeslices are not a necessity in CFS, but i 
really think that server scheduling is the easier part.)

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5
  2007-04-23  7:10             ` Ingo Molnar
@ 2007-04-23  7:25               ` Nick Piggin
  2007-04-23  7:35                 ` Ingo Molnar
  0 siblings, 1 reply; 149+ messages in thread
From: Nick Piggin @ 2007-04-23  7:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Ulrich Drepper

On Mon, Apr 23, 2007 at 09:10:50AM +0200, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > > yeah - but they'll all be quad core, so the SMP timeslice 
> > > multiplicator should do the trick. Most of the CFS testers use 
> > > single-CPU systems.
> > 
> > But desktop users could have have quad thread and even 8 thread CPUs 
> > soon, so if the number doesn't work for both then you're in trouble. 
> > It just smells like a hack to scale with CPU numbers.
> 
> hm, i still like Con's approach in this case because it makes 
> independent sense: in essence we calculate the "human visible" effective 
> latency of a physical resource: more CPUs/threads means more parallelism 
> and less visible choppiness of whatever basic chunking of workloads 
> there might be, hence larger size chunking can be done.

If there were no penalty, you would like the timeslice as small as
possible.

There is a penalty, which is why we want larger timeslices.

This penalty is still almost as significant on multiprocessor systems
as it is on single processor systems (remote memory / coherency
traffic make it slightly more on some multiprocessors, but nothing
like the basic cache<->RAM order of magnitude problem).


> > > it doesnt in any test i do, but again, i'm erring on the side of it 
> > > being more interactive.
> > 
> > I'd start by erring on the side of trying to ensure no obvious 
> > performance regressions like this because that's the easy part. 
> > Suppose everybody finds your scheduler wonderfully interactive, but 
> > you can't make it so with a larger timeslice?
> 
> look at CFS's design and you'll see that it can easily take larger 
> timeslices :) I really dont need any reinforcement on that part. But i 

By default, I mean.

> do need reinforcement and test results on the basic part: _can_ this 
> design be interactive enough on the desktop? So far the feedback has 
> been affirmative, but more testing is needed.

It seems to be fairly easy to make a scheduler interactive if the
timeslice is as low as that (not that I've released one for wider
testing, but just by my own observations). So I don't think we'd
need to go to rbtree based scheduling just for that.


> server scheduling, while obviously of prime importance to us, is really 
> 'easy' in comparison technically, because it has alot less human factors 
> and is thus a much more deterministic task.

But there are lots of shades of grey (CPU efficiency on desktops
is often important, and sometimes servers need to do interactive
sorts of things).

It would be much better if a single scheduler with default
settings would be reasonable for all.


> > For _real_ desktop systems, sure, erring on the side of being more 
> > interactive is fine. For RFC patches for testing, I really think you 
> > could be taking advantage of the fact that people will give you 
> > feedback on the issue.
> 
> 90% of the testers are using CFS on desktops. 80% of the scheduler 
> complaints come regarding the human (latency/behavior/consistency) 
> aspect of the upstream scheduler. (Sure, we dont want to turn that 
> around into '80% of the complaints come due to performance' - so i 
> increased the granularity based on your kbuild feedback to that near of 
> SD's, to show that mini-timeslices are not a necessity in CFS, but i 
> really think that server scheduling is the easier part.)

So why not solve that (or at least not introduce obvious regressions),
and then focus on the hard part?

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5
  2007-04-23  7:25               ` Nick Piggin
@ 2007-04-23  7:35                 ` Ingo Molnar
  0 siblings, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-23  7:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Ulrich Drepper


* Nick Piggin <npiggin@suse.de> wrote:

> > do need reinforcement and test results on the basic part: _can_ this 
> > design be interactive enough on the desktop? So far the feedback has 
> > been affirmative, but more testing is needed.
> 
> It seems to be fairly easy to make a scheduler interactive if the 
> timeslice is as low as that (not that I've released one for wider 
> testing, but just by my own observations). [...]

ok, i'll bite: please release such a scheduler that does that with 
5-8-10msec range timeslices :-)

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5
  2007-04-23  4:06           ` Nick Piggin
  2007-04-23  7:10             ` Ingo Molnar
@ 2007-04-23  9:25             ` Ingo Molnar
  1 sibling, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-23  9:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Ulrich Drepper


* Nick Piggin <npiggin@suse.de> wrote:

> > yeah - but they'll all be quad core, so the SMP timeslice 
> > multiplicator should do the trick. Most of the CFS testers use 
> > single-CPU systems.
> 
> But desktop users could have have quad thread and even 8 thread CPUs 
> soon, [...]

SMT is indeed an issue, so i think what should be used to scale 
timeslices isnt num_online_cpus(), but the sum of all CPU's ->cpu_power 
value (scaled down by SCHED_LOAD_SCALE). That way if the thread is not a 
'full CPU', then the scaling will be proportionally smaller. Can you see 
any hole in that?

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* crash with CFS v4 and qemu/kvm (was: [patch] CFS scheduler, v4)
  2007-04-20 14:04 [patch] CFS scheduler, v4 Ingo Molnar
                   ` (5 preceding siblings ...)
  2007-04-23  1:12 ` [patch] CFS scheduler, -v5 Ingo Molnar
@ 2007-04-23  9:28 ` Christian Hesse
  2007-04-23 10:18     ` Ingo Molnar
  6 siblings, 1 reply; 149+ messages in thread
From: Christian Hesse @ 2007-04-23  9:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Willy Tarreau, Gene Heskett, kvm-devel, Avi Kivity

[-- Attachment #1: Type: text/plain, Size: 575 bytes --]

On Friday 20 April 2007, Ingo Molnar wrote:
> i'm pleased to announce release -v4 of the CFS patchset.

Hi Ingo, hi Avi, hi all,

I'm trying to use kvm-20 with cfs v4 and get a crash:

eworm@revo:~$ /usr/local/kvm/bin/qemu -snapshot /mnt/data/virtual/qemu/winxp.img
kvm_run: failed entry, reason 7
kvm_run returned -8

It works (though it is a bit slow) if I start qemu with strace, so for me it 
looks like a race condition?

I did not test any earlier versions of cfs and kvm in combination - I can't 
say if it happens there as well.
-- 
Regards,
Chris

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5 (build problem - make headers_check fails)
  2007-04-23  3:19   ` [patch] CFS scheduler, -v5 (build problem - make headers_check fails) Zach Carter
@ 2007-04-23 10:03     ` Ingo Molnar
  0 siblings, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-23 10:03 UTC (permalink / raw)
  To: Zach Carter
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Ulrich Drepper


* Zach Carter <linux@zachcarter.com> wrote:

> FYI, make headers_check seems to fail on this:
> 
> [carter@hoth linux-2.6]$ make headers_check

> make[2]: *** No rule to make target 
> `/src/linux-2.6/usr/include/linux/.check.sched.h', needed by 
> `__headerscheck'.  Stop.
> make[1]: *** [linux] Error 2
> make: *** [headers_check] Error 2
> [carter@hoth linux-2.6]$
> 
> This also fails if I have CONFIG_HEADERS_CHECK=y in my .config

ah, indeed - the patch below should fix this. It will be in -v6.

	Ingo

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -2,7 +2,6 @@
 #define _LINUX_SCHED_H
 
 #include <linux/auxvec.h>	/* For AT_VECTOR_SIZE */
-#include <linux/rbtree.h>	/* For run_node */
 /*
  * cloning flags:
  */
@@ -37,6 +36,8 @@
 
 #ifdef __KERNEL__
 
+#include <linux/rbtree.h>	/* For run_node */
+
 struct sched_param {
 	int sched_priority;
 };

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: crash with CFS v4 and qemu/kvm (was: [patch] CFS scheduler, v4)
@ 2007-04-23 10:18     ` Ingo Molnar
  0 siblings, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-23 10:18 UTC (permalink / raw)
  To: Christian Hesse
  Cc: linux-kernel, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Willy Tarreau, Gene Heskett, kvm-devel, Avi Kivity


* Christian Hesse <mail@earthworm.de> wrote:

> On Friday 20 April 2007, Ingo Molnar wrote:
> > i'm pleased to announce release -v4 of the CFS patchset.
> 
> Hi Ingo, hi Avi, hi all,
> 
> I'm trying to use kvm-20 with cfs v4 and get a crash:
> 
> eworm@revo:~$ /usr/local/kvm/bin/qemu -snapshot /mnt/data/virtual/qemu/winxp.img
> kvm_run: failed entry, reason 7
> kvm_run returned -8
> 
> It works (though it is a bit slow) if I start qemu with strace, so for 
> me it looks like a race condition?

hm. Can you work it around with:

   echo 0 > /proc/sys/kernel/sched_granularity_ns

?

If yes then this is a wakeup race: some piece of code relies on the 
upstream scheduler preempting the waker task immediately in 99% of the 
cases.

and you might want to test -v5 too which i released earlier today. It 
has no bugfix in this area though, so it will likely still trigger this 
race - but it will also hopefully be even more pleasant to use than -v4 
;-)

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: crash with CFS v4 and qemu/kvm (was: [patch] CFS scheduler, v4)
@ 2007-04-23 10:18     ` Ingo Molnar
  0 siblings, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-23 10:18 UTC (permalink / raw)
  To: Christian Hesse
  Cc: Nick Piggin, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	caglar-caicS1wCkhO6A22drWdTBw, Mike Galbraith, Peter Williams,
	Con Kolivas, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Gene Heskett,
	Willy Tarreau


* Christian Hesse <mail-8oMOrB1mGocUSW6y5lq3GQ@public.gmane.org> wrote:

> On Friday 20 April 2007, Ingo Molnar wrote:
> > i'm pleased to announce release -v4 of the CFS patchset.
> 
> Hi Ingo, hi Avi, hi all,
> 
> I'm trying to use kvm-20 with cfs v4 and get a crash:
> 
> eworm@revo:~$ /usr/local/kvm/bin/qemu -snapshot /mnt/data/virtual/qemu/winxp.img
> kvm_run: failed entry, reason 7
> kvm_run returned -8
> 
> It works (though it is a bit slow) if I start qemu with strace, so for 
> me it looks like a race condition?

hm. Can you work it around with:

   echo 0 > /proc/sys/kernel/sched_granularity_ns

?

If yes then this is a wakeup race: some piece of code relies on the 
upstream scheduler preempting the waker task immediately in 99% of the 
cases.

and you might want to test -v5 too which i released earlier today. It 
has no bugfix in this area though, so it will likely still trigger this 
race - but it will also hopefully be even more pleasant to use than -v4 
;-)

	Ingo

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5
  2007-04-23  1:12 ` [patch] CFS scheduler, -v5 Ingo Molnar
                     ` (2 preceding siblings ...)
  2007-04-23  5:16   ` [patch] CFS scheduler, -v5 Markus Trippelsdorf
@ 2007-04-23 12:20   ` Guillaume Chazarain
  2007-04-23 12:36     ` Ingo Molnar
  2007-04-24 16:54   ` Christian Hesse
  4 siblings, 1 reply; 149+ messages in thread
From: Guillaume Chazarain @ 2007-04-23 12:20 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

2007/4/23, Ingo Molnar <mingo@elte.hu>:

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
+#include "sched_stats.h"
+#include "sched_rt.c"
+#include "sched_fair.c"
+#include "sched_debug.c"

Index: linux/kernel/sched_stats.h
===================================================================
--- /dev/null
+++ linux/kernel/sched_stats.h

These look unnatural if it were to be included in mainline.

WBR.

-- 
Guillaume

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5
  2007-04-23 12:20   ` Guillaume Chazarain
@ 2007-04-23 12:36     ` Ingo Molnar
  0 siblings, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-23 12:36 UTC (permalink / raw)
  To: Guillaume Chazarain; +Cc: linux-kernel


* Guillaume Chazarain <guichaz@yahoo.fr> wrote:

> 2007/4/23, Ingo Molnar <mingo@elte.hu>:
> 
> Index: linux/kernel/sched.c
> ===================================================================
> --- linux.orig/kernel/sched.c
> +++ linux/kernel/sched.c
> +#include "sched_stats.h"
> +#include "sched_rt.c"
> +#include "sched_fair.c"
> +#include "sched_debug.c"
> 
> Index: linux/kernel/sched_stats.h
> ===================================================================
> --- /dev/null
> +++ linux/kernel/sched_stats.h
> 
> These look unnatural if it were to be included in mainline.

agreed - these will likely be separate modules - i just wanted to have 
an easy way of sharing infrastructure between sched.c and these.

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [report] renicing X, cfs-v5 vs sd-0.46
  2007-04-23  2:42             ` [report] renicing X, cfs-v5 vs sd-0.46 Ingo Molnar
@ 2007-04-23 15:09               ` Linus Torvalds
  2007-04-23 17:19                 ` Gene Heskett
                                   ` (2 more replies)
  0 siblings, 3 replies; 149+ messages in thread
From: Linus Torvalds @ 2007-04-23 15:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Juliusz Chroboczek, Con Kolivas, ck list, Bill Davidsen,
	Willy Tarreau, William Lee Irwin III, linux-kernel,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett



On Mon, 23 Apr 2007, Ingo Molnar wrote:
> 
> You are completely right in the case of traditional schedulers.

And apparently I'm completely right with CFS too.

> Using CFS-v5, with Xorg at nice 0, the context-switch rate is low:
> 
> procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>  2  0      0 472132  13712 178604    0    0     0    32  113  170 83 17  0  0  0
>  2  0      0 472172  13712 178604    0    0     0     0  112  184 85 15  0  0  0
>  2  0      0 472196  13712 178604    0    0     0     0  108  162 83 17  0  0  0
>  1  0      0 472076  13712 178604    0    0     0     0  115  189 86 14  0  0  0

Around 170 context switches per second.

> Renicing X to -10 increases context-switching, but not dramatically so, 
> because it is throttled by CFS:
> 
> procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>  4  0      0 475752  13492 176320    0    0     0    64  116 1498 85 15  0  0  0
>  4  0      0 475752  13492 176320    0    0     0     0  107 1488 84 16  0  0  0
>  4  0      0 475752  13492 176320    0    0     0     0  140 1514 86 14  0  0  0
>  4  0      0 475752  13492 176320    0    0     0     0  107 1477 85 15  0  0  0
>  4  0      0 475752  13492 176320    0    0     0     0  122 1498 84 16  0  0  0

Did you even *look* at your own numbers? Maybe you looked at "interrpts". 
The context switch numbers go from 170 per second, to 1500 per second!

If that's not "dramatically so", I don't know what is! Just how many 
orders of magnitude worse does it have to be, to be "dramatic"? Apparently 
one order of magnitude isn't "dramatic"?

So you were wrong. The fact that it was still "usable" is a good 
indication, but how about just admitting that you were wrong, and that 
renicing X is the *WRONG*THING*TO*DO*.

Just don't do it. It's wrong. It was wrong with the old schedulers, it's 
wrong with the new scheduler, it's just WRONG.

It was a hack, and it's a failed hack. And the fact that you don't seem to 
realize that it's a failure, even when your OWN numbers clearly show that 
it's failed, is a bit scary.

		Linus

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-23  1:34             ` Nick Piggin
@ 2007-04-23 15:56               ` Linus Torvalds
  2007-04-23 19:11                 ` Ingo Molnar
  0 siblings, 1 reply; 149+ messages in thread
From: Linus Torvalds @ 2007-04-23 15:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Juliusz Chroboczek, Con Kolivas, Ingo Molnar, ck list,
	Bill Davidsen, Willy Tarreau, William Lee Irwin III,
	linux-kernel, Andrew Morton, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett



On Mon, 23 Apr 2007, Nick Piggin wrote:
> > If you have a single client, the X server is *not* more important than the 
> > client, and indeed, renicing the X server causes bad patterns: just 
> > because the client sends a request does not mean that the X server should 
> > immediately be given the CPU as being "more important". 
> 
> If the client is doing some processing, and the user moves the mouse, it
> feels much more interactive if the pointer moves rather than waits for
> the client to finish processing.

.. yes. However, that should be automatically true if the X process just 
has "enough CPU time" to merit being scheduled to.

Which it normally should always have, exactly because it's an 
"interactive" process (regardless of how the scheduler is done - any 
scheduler should always give sleepers good latency. The current one 
obviously does it by giving interactivity-bonuses, CFS does it by trying 
to be fair in giving out CPU time).

The problem tends to be the following scenario:

 - the X server is very CPU-busy, because it has lots of clients 
   connecting to it, and it's not getting any "bonus" for doing work for 
   those clients (ie it uses up its time-slice and thus becomes "less 
   important" than other processes, since it's already gotten its "fair" 
   slice of CPU - never mind that it was really unfair to not give it 
   more)

 - there is some process that is *not* dependent on X, that can (and does) 
   run, because X has spent its CPU time serving others.

but the point I'm trying to make is that X shouldn't get more CPU-time 
because it's "more important" (it's not: and as noted earlier, thinking 
that it's more important skews the problem and makes for too *much* 
scheduling). X should get more CPU time simply because it should get it's 
"fair CPU share" relative to the *sum* of the clients, not relative to any 
client individually.

Once you actually do give the X server "fair share" of the CPU, I'm sure
that you can still get into bad situations (trivial example: make clients 
that on purpose do X requests that are expensive for the server, but are 
cheap to generate). But it's likely not going to be an issue in practice 
any more.

Scheduling is not something you can do "perfectly". There's no point in 
even trying. To do "perfect" scheduling, you'd have to have ESP and know 
exactly what the user expects and know the future too. What you should aim 
for is the "obvious cases".

And I don't think anybody really disputes the fact that a process that 
does work for other processes "obviously" should get the CPU time skewed 
towards it (and away from the clients - not from non-clients!). I think 
the only real issue is that nobody really knows how to do it well (or at 
all).

I think the "schedule by user" would be reasonable in practice - not 
perfect by any means, but it *does* fall into the same class of issues: 
users are not in general "more important" than other users, but they 
should be treated fairly across the user, not on a per-process basis.

		Linus

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [report] renicing X, cfs-v5 vs sd-0.46
  2007-04-23 15:09               ` Linus Torvalds
@ 2007-04-23 17:19                 ` Gene Heskett
  2007-04-23 17:19                 ` Gene Heskett
  2007-04-23 19:48                 ` Ingo Molnar
  2 siblings, 0 replies; 149+ messages in thread
From: Gene Heskett @ 2007-04-23 17:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Juliusz Chroboczek, Con Kolivas, ck list,
	Bill Davidsen, Willy Tarreau, William Lee Irwin III,
	linux-kernel, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar

On Monday 23 April 2007, Linus Torvalds wrote:
>On Mon, 23 Apr 2007, Ingo Molnar wrote:
>> You are completely right in the case of traditional schedulers.
>
>And apparently I'm completely right with CFS too.
>
>> Using CFS-v5, with Xorg at nice 0, the context-switch rate is low:
>>
>> procs -----------memory---------- ---swap-- -----io---- --system--
>> -----cpu------ r  b   swpd   free   buff  cache   si   so    bi    bo   in
>>   cs us sy id wa st 2  0      0 472132  13712 178604    0    0     0    32
>>  113  170 83 17  0  0  0 2  0      0 472172  13712 178604    0    0     0 
>>    0  112  184 85 15  0  0  0 2  0      0 472196  13712 178604    0    0  
>>   0     0  108  162 83 17  0  0  0 1  0      0 472076  13712 178604    0  
>>  0     0     0  115  189 86 14  0  0  0
>
>Around 170 context switches per second.
>
>> Renicing X to -10 increases context-switching, but not dramatically so,
>> because it is throttled by CFS:
>>
>> procs -----------memory---------- ---swap-- -----io---- --system--
>> -----cpu------ r  b   swpd   free   buff  cache   si   so    bi    bo   in
>>   cs us sy id wa st 4  0      0 475752  13492 176320    0    0     0    64
>>  116 1498 85 15  0  0  0 4  0      0 475752  13492 176320    0    0     0 
>>    0  107 1488 84 16  0  0  0 4  0      0 475752  13492 176320    0    0  
>>   0     0  140 1514 86 14  0  0  0 4  0      0 475752  13492 176320    0  
>>  0     0     0  107 1477 85 15  0  0  0 4  0      0 475752  13492 176320  
>>  0    0     0     0  122 1498 84 16  0  0  0
>
>Did you even *look* at your own numbers? Maybe you looked at "interrpts".
>The context switch numbers go from 170 per second, to 1500 per second!
>
>If that's not "dramatically so", I don't know what is! Just how many
>orders of magnitude worse does it have to be, to be "dramatic"? Apparently
>one order of magnitude isn't "dramatic"?
>
>So you were wrong. The fact that it was still "usable" is a good
>indication, but how about just admitting that you were wrong, and that
>renicing X is the *WRONG*THING*TO*DO*.
>
>Just don't do it. It's wrong. It was wrong with the old schedulers, it's
>wrong with the new scheduler, it's just WRONG.
>
>It was a hack, and it's a failed hack. And the fact that you don't seem to
>realize that it's a failure, even when your OWN numbers clearly show that
>it's failed, is a bit scary.
>
>		Linus

This message prompted me to do some checking in re context switches myself, 
and I've come to the conclusion that there could be a bug in vmstat itself.

Run singly the context switching is reasonable even for a -19 niceness of x, 
its only showing about 200 or so on the first loop of vmstat.  But throw in 
the -n 1 arguments and it goes crazy on the second and subsequent loops.

X nice=0
[root@coyote ~]# vmstat -n 1
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa 
st
 3  0    324  62836  37952 518080    0    0   786   446  474  201 10  4 82  4  
0
 0  0    324  62712  37952 518080    0    0     0     0 1309 2361  2  5 93  0  
0
 2  0    324  62712  37952 518080    0    0     0     0 1275 2203  2  4 94  0  
0
 0  0    324  62744  37952 518080    0    0     0     0 1305 2224  1  2 97  0  
0
 0  0    324  62744  37952 518080    0    0     0     0 1291 2232  0  1 99  0  
0

X nice=-10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa 
st
 3  0    324  62432  38052 518080    0    0   784   445  476  205 10  4 82  4  
0
 0  0    324  62432  38052 518080    0    0     0     0 1190 3223  1  1 98  0  
0
 2  0    324  62440  38052 518080    0    0     0     0 1209 3210  2  3 95  0  
0
 0  0    324  62316  38060 518080    0    0     0   232 1201 3355  3  4 92  1  
0
 2  0    324  62316  38060 518080    0    0     0     0 1207 2794  1  2 97  0  
0

X nice=10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa 
st
 4  0    324  62372  38184 518132    0    0   783   445  477  209 10  4 82  4  
0
 0  0    324  62372  38192 518132    0    0     0   272 1318 2262  0  3 97  0  
0
 0  0    324  62372  38192 518132    0    0     0     0 1293 2249  1  4 95  0  
0
 0  0    324  62248  38192 518132    0    0     0     0 1280 2443  4  2 94  0  
0
 0  0    324  62248  38192 518132    0    0     0     4 1294 2272  0  3 97  0  
0

Now, I have NDI which set of figures is the true set, but please note that in 
all 3 cases the reported values for cs didn't scale up and down all that much 
if separated out into 1st pass, and subsequent passes.

And, even with X nice=10, the system is still fairly smooth and usable.

This is with 2.6.21-rc7-CFS-v5 I built late last evening.  At Xnice=10 I just 
played a game of patience to watch the card animations and they were 
absolutely acceptably smooth. (and I won it in about 112 moves :)

>From this users viewpoint, it (cfs-v5) works, and works very well indeed, and 
it deserves a place as one of 3 selectable options in mainline.  The other 2 
being the existing mainline way, & Con K's sd-0.45 or later.  Both of these 
seem to be very large enhancements to the user experience over current 
mainline, which I'd discuss in terms borrowed from Joanne Dow.  Comparatively 
speaking, mainline has a very high vacuum.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Jayne: "Let's move this conversation in a not-Jayne's-fault direction."
				--Episode #14, "Objects in Space"

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [report] renicing X, cfs-v5 vs sd-0.46
  2007-04-23 15:09               ` Linus Torvalds
  2007-04-23 17:19                 ` Gene Heskett
@ 2007-04-23 17:19                 ` Gene Heskett
  2007-04-23 19:48                 ` Ingo Molnar
  2 siblings, 0 replies; 149+ messages in thread
From: Gene Heskett @ 2007-04-23 17:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Juliusz Chroboczek, Con Kolivas, ck list,
	Bill Davidsen, Willy Tarreau, William Lee Irwin III,
	linux-kernel, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar

On Monday 23 April 2007, Linus Torvalds wrote:
>On Mon, 23 Apr 2007, Ingo Molnar wrote:
>> You are completely right in the case of traditional schedulers.
>
>And apparently I'm completely right with CFS too.
>
>> Using CFS-v5, with Xorg at nice 0, the context-switch rate is low:
>>
>> procs -----------memory---------- ---swap-- -----io---- --system--
>> -----cpu------ r  b   swpd   free   buff  cache   si   so    bi    bo   in
>>   cs us sy id wa st 2  0      0 472132  13712 178604    0    0     0    32
>>  113  170 83 17  0  0  0 2  0      0 472172  13712 178604    0    0     0 
>>    0  112  184 85 15  0  0  0 2  0      0 472196  13712 178604    0    0  
>>   0     0  108  162 83 17  0  0  0 1  0      0 472076  13712 178604    0  
>>  0     0     0  115  189 86 14  0  0  0
>
>Around 170 context switches per second.
>
>> Renicing X to -10 increases context-switching, but not dramatically so,
>> because it is throttled by CFS:
>>
>> procs -----------memory---------- ---swap-- -----io---- --system--
>> -----cpu------ r  b   swpd   free   buff  cache   si   so    bi    bo   in
>>   cs us sy id wa st 4  0      0 475752  13492 176320    0    0     0    64
>>  116 1498 85 15  0  0  0 4  0      0 475752  13492 176320    0    0     0 
>>    0  107 1488 84 16  0  0  0 4  0      0 475752  13492 176320    0    0  
>>   0     0  140 1514 86 14  0  0  0 4  0      0 475752  13492 176320    0  
>>  0     0     0  107 1477 85 15  0  0  0 4  0      0 475752  13492 176320  
>>  0    0     0     0  122 1498 84 16  0  0  0
>
>Did you even *look* at your own numbers? Maybe you looked at "interrpts".
>The context switch numbers go from 170 per second, to 1500 per second!
>
>If that's not "dramatically so", I don't know what is! Just how many
>orders of magnitude worse does it have to be, to be "dramatic"? Apparently
>one order of magnitude isn't "dramatic"?
>
>So you were wrong. The fact that it was still "usable" is a good
>indication, but how about just admitting that you were wrong, and that
>renicing X is the *WRONG*THING*TO*DO*.
>
>Just don't do it. It's wrong. It was wrong with the old schedulers, it's
>wrong with the new scheduler, it's just WRONG.
>
>It was a hack, and it's a failed hack. And the fact that you don't seem to
>realize that it's a failure, even when your OWN numbers clearly show that
>it's failed, is a bit scary.
>
>		Linus

This message prompted me to do some checking in re context switches myself, 
and I've come to the conclusion that there could be a bug in vmstat itself.

Run singly the context switching is reasonable even for a -19 niceness of x, 
its only showing about 200 or so on the first loop of vmstat.  But throw in 
the -n 1 arguments and it goes crazy on the second and subsequent loops.

X nice=0
[root@coyote ~]# vmstat -n 1
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa 
st
 3  0    324  62836  37952 518080    0    0   786   446  474  201 10  4 82  4  
0
 0  0    324  62712  37952 518080    0    0     0     0 1309 2361  2  5 93  0  
0
 2  0    324  62712  37952 518080    0    0     0     0 1275 2203  2  4 94  0  
0
 0  0    324  62744  37952 518080    0    0     0     0 1305 2224  1  2 97  0  
0
 0  0    324  62744  37952 518080    0    0     0     0 1291 2232  0  1 99  0  
0

X nice=-10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa 
st
 3  0    324  62432  38052 518080    0    0   784   445  476  205 10  4 82  4  
0
 0  0    324  62432  38052 518080    0    0     0     0 1190 3223  1  1 98  0  
0
 2  0    324  62440  38052 518080    0    0     0     0 1209 3210  2  3 95  0  
0
 0  0    324  62316  38060 518080    0    0     0   232 1201 3355  3  4 92  1  
0
 2  0    324  62316  38060 518080    0    0     0     0 1207 2794  1  2 97  0  
0

X nice=10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa 
st
 4  0    324  62372  38184 518132    0    0   783   445  477  209 10  4 82  4  
0
 0  0    324  62372  38192 518132    0    0     0   272 1318 2262  0  3 97  0  
0
 0  0    324  62372  38192 518132    0    0     0     0 1293 2249  1  4 95  0  
0
 0  0    324  62248  38192 518132    0    0     0     0 1280 2443  4  2 94  0  
0
 0  0    324  62248  38192 518132    0    0     0     4 1294 2272  0  3 97  0  
0

Now, I have NDI which set of figures is the true set, but please note that in 
all 3 cases the reported values for cs didn't scale up and down all that much 
if separated out into 1st pass, and subsequent passes.

And, even with X nice=10, the system is still fairly smooth and usable.

This is with 2.6.21-rc7-CFS-v5 I built late last evening.  At Xnice=10 I just 
played a game of patience to watch the card animations and they were 
absolutely acceptably smooth. (and I won it in about 112 moves :)

>From this users viewpoint, it (cfs-v5) works, and works very well indeed, and 
it deserves a place as one of 3 selectable options in mainline.  The other 2 
being the existing mainline way, & Con K's sd-0.45 or later.  Both of these 
seem to be very large enhancements to the user experience over current 
mainline, which I'd discuss in terms borrowed from Joanne Dow.  Comparatively 
speaking, mainline has a very high vacuum.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Jayne: "Let's move this conversation in a not-Jayne's-fault direction."
				--Episode #14, "Objects in Space"

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-23 15:56               ` Linus Torvalds
@ 2007-04-23 19:11                 ` Ingo Molnar
  2007-04-23 19:52                   ` Linus Torvalds
                                     ` (2 more replies)
  0 siblings, 3 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-23 19:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Juliusz Chroboczek, Con Kolivas, ck list,
	Bill Davidsen, Willy Tarreau, William Lee Irwin III,
	linux-kernel, Andrew Morton, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> but the point I'm trying to make is that X shouldn't get more CPU-time 
> because it's "more important" (it's not: and as noted earlier, 
> thinking that it's more important skews the problem and makes for too 
> *much* scheduling). X should get more CPU time simply because it 
> should get it's "fair CPU share" relative to the *sum* of the clients, 
> not relative to any client individually.

yeah. And this is not a pipe dream and i think it does not need a 
'wakeup matrix' or other complexities.

I am --->.<---- this close to being able to do this very robustly under 
CFS via simple rules of economy and trade: there the p->wait_runtime 
metric is intentionally a "physical resource" of "hard-earned right to 
execute on the CPU, by having waited on it" the sum of which is bound 
for the whole system.

So while with other, heuristic approaches we always had the problem of 
creating a "hyper-inflation" of an uneconomic virtual currency that 
could be freely printed by certain tasks, in CFS the economy of this is 
strict and the finegrained plus/minus balance is strictly managed by a 
conservative and independent central bank.

So we can actually let tasks "trade" in these very physical units of 
"right to execute on the CPU". A task giving it to another task means 
that this task _already gave up CPU time in the past_. So it's the 
robust equivalent of an economy's "money earned" concept, and this 
"money"'s distribution (and redistribution) is totally fair and totally 
balanced and is not prone to "inflation".

The "give scheduler money" transaction can be both an "implicit 
transaction" (for example when writing to UNIX domain sockets or 
blocking on a pipe, etc.), or it could be an "explicit transaction": 
sched_yield_to(). This latter i've already implemented for CFS, but it's 
much less useful than the really significant implicit ones, the ones 
which will help X.

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [report] renicing X, cfs-v5 vs sd-0.46
  2007-04-23 15:09               ` Linus Torvalds
  2007-04-23 17:19                 ` Gene Heskett
  2007-04-23 17:19                 ` Gene Heskett
@ 2007-04-23 19:48                 ` Ingo Molnar
  2007-04-23 20:56                   ` Michael K. Edwards
  2 siblings, 1 reply; 149+ messages in thread
From: Ingo Molnar @ 2007-04-23 19:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Juliusz Chroboczek, Con Kolivas, ck list, Bill Davidsen,
	Willy Tarreau, William Lee Irwin III, linux-kernel,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> >  4  0      0 475752  13492 176320    0    0     0     0  107 1477 85 15  0  0  0
> >  4  0      0 475752  13492 176320    0    0     0     0  122 1498 84 16  0  0  0
> 
> Did you even *look* at your own numbers? Maybe you looked at 
> "interrpts". The context switch numbers go from 170 per second, to 
> 1500 per second!

i think i managed to look at the correct column :) 1500 per second is 
the absolute ceiling under CFS.

but, even though this utterly ugly hack of renicing (Arjan immediately 
slapped me for it when i mentioned it to him and he correctly predicted 
that lkml would go amok on anything like this) undeniably behaves better 
under CFS and gives a _visually better_ desktop at a 1500 context 
switches per second, i share your unease about it on architectural and 
policy grounds. Doing this hack upstream could easily hinder the 
efficient creation of a healthy economy for "scheduler money", by 
forcibly hacking X out of the picture - while X could be such a nice 
(and important) prototype for a cool and useful new scheduling 
infrastructure.

Basically this hack is bad on policy grounds because it is giving X an 
"legislated, unfair monopoly" on the system. It's the equivalent of a 
state-guaranteed monopoly in certain 'strategic industries'. It has some 
advantages but it is very much net harmful. Most of the time the 
"strategic importance" of any industry can be cleanly driven by the 
normal mechanics of supply and demand: anything important is recognized 
by 'people' as important via actual actions of giving it 'money'. (This 
approach also gives formerly-strategic industries the boot quickly, were 
they to become less strategic to people as things evolve.)

still, recognizing all the very real advantages of a cleaner approach, 
my primary present goal with CFS is to reach "maximum interactivity" 
here and today on a maximimally broad set of workloads, whatever it 
takes, and then to look back and figure out cleaner ways while still 
carefully keeping that maximum interactivity propertly of CFS.

For this particular auto-renicing hack here are the observed objective 
advantages to the user:

 1) while it's still an ugly hack, the increased context-switching rate
    (surprisingly to me!) still has actual, objective, undeniable 
    positive effects even in this totally X-centric worst-case messaging 
    scenario i tried to trigger:

        - visibly better eye-pleasing X behavior under the same
          "performance of scrolling"

        - no hung mouse pointer. Ever. I'd not go as far as Windows to 
          put the mouse refresh code into the kernel, but now having 
          experienced under CFS the 'mouse never hangs under any load' 
          phenomenon for a longer time, i have to admit i got addicted 
          to it. It give instant emotionally positive feedback about 
          "yes, your system is still fine, just overworked a bit", and 
          it also gives a "you caused something to happen on this box, 
          cool boy!" reassurance to the impatient human who is waiting 
          on it - be it that such a minimal thing as a moving mouse 
          pointer.

    There's a new argument as well, not amongst the issues i raised 
    before: people are happily spending 40-50% of their CPU's
    power on Beryl just to get a more ergonomic desktop via 3D effects, 
    so why not allow them to achieve another type of visual ergonomy by 
    allowing an increased, maximum-throttled X context-switch rate,
    without any measurable drop in performance, to a tunable maximum? 
    I can see no easy way for X itself to control this context-switching
    "refresh" rate in a sane way, as its workload is largely detached
    from client workloads and there's no communication between clients.

 2) it's the absolute worst maximum rate you'll ever see under CFS, and
    i definitely concentrated on triggering the worst-case. On other
    schedulers i easily got to 14K context-switches per second or worse,
    depending on the X workload, which hurts performance and makes it 
    behave visually worse. On CFS the 1400 context-switches is the 
    _ceiling_, did not measurably hurt performance and it is tunable 
    ceiling.

 3) this behavior was totally uncontrollable on other schedulers i tried
    and indeed has hurt performance there. On CFS this is still totally
    tunable and controllable on several levels.

i'm not saying that any of this reduces the ugliness of the hack, or 
that any of this makes the strategic disadvantages of this hack 
disappear, i simply tried to point out that despite the existing 
conventional wisdom it's apparently much more useful in practice on CFS 
than on other schedulers.

And if the "economy of scheduling" experiment fails in practice for some 
presently unknown technological reason, we might as well have to go back 
to ugly tricks like this one. With its 5 lines and limited scope i think 
it still beats 500 lines of convoluted scheduling heuristics :-/ Right 
now i'm very positive about the "economy of scheduling" angle, i think 
we have a realistic chance to pull it off.

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-23 19:11                 ` Ingo Molnar
@ 2007-04-23 19:52                   ` Linus Torvalds
  2007-04-23 20:33                     ` Ingo Molnar
                                       ` (3 more replies)
  2007-04-23 20:05                   ` Willy Tarreau
  2007-04-24 21:05                   ` 'Scheduler Economy' prototype patch for CFS Ingo Molnar
  2 siblings, 4 replies; 149+ messages in thread
From: Linus Torvalds @ 2007-04-23 19:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Juliusz Chroboczek, Con Kolivas, ck list,
	Bill Davidsen, Willy Tarreau, William Lee Irwin III,
	linux-kernel, Andrew Morton, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett



On Mon, 23 Apr 2007, Ingo Molnar wrote:
> 
> The "give scheduler money" transaction can be both an "implicit 
> transaction" (for example when writing to UNIX domain sockets or 
> blocking on a pipe, etc.), or it could be an "explicit transaction": 
> sched_yield_to(). This latter i've already implemented for CFS, but it's 
> much less useful than the really significant implicit ones, the ones 
> which will help X.

Yes. It would be wonderful to get it working automatically, so please say 
something about the implementation..

The "perfect" situation would be that when somebody goes to sleep, any 
extra points it had could be given to whoever it woke up last. Note that 
for something like X, it means that the points are 100% ephemeral: it gets 
points when a client sends it a request, but it would *lose* the points 
again when it sends the reply!

So it would only accumulate "scheduling points" while multiuple clients 
are actively waiting for it, which actually sounds like exactly the right 
thing. However, I don't really see how to do it well, especially since the 
kernel cannot actually match up the client that gave some scheduling 
points to the reply that X sends back.

There are subtle semantics with these kinds of things: especially if the 
scheduling points are only awarded when a process goes to sleep, if X is 
busy and continues to use the CPU (for another client), it wouldn't give 
any scheduling points back to clients and they really do accumulate with 
the server. Which again sounds like it would be exactly the right thing 
(both in the sense that the server that runs more gets more points, but 
also in the sense that we *only* give points at actual scheduling events).

But how do you actually *give/track* points? A simple "last woken up by 
this process" thing that triggers when it goes to sleep? It might work, 
but on the other hand, especially with more complex things (and networking 
tends to be pretty complex) the actual wakeup may be done by a software 
irq. Do we just say "it ran within the context of X, so we assume X was 
the one that caused it?" It probably would work, but we've generally tried 
very hard to avoid accessing "current" from interrupt context, including 
bh's..

			Linus

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-23 19:11                 ` Ingo Molnar
  2007-04-23 19:52                   ` Linus Torvalds
@ 2007-04-23 20:05                   ` Willy Tarreau
  2007-04-24 21:05                   ` 'Scheduler Economy' prototype patch for CFS Ingo Molnar
  2 siblings, 0 replies; 149+ messages in thread
From: Willy Tarreau @ 2007-04-23 20:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Nick Piggin, Juliusz Chroboczek, Con Kolivas,
	ck list, Bill Davidsen, William Lee Irwin III, linux-kernel,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett

Hi !

On Mon, Apr 23, 2007 at 09:11:43PM +0200, Ingo Molnar wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > but the point I'm trying to make is that X shouldn't get more CPU-time 
> > because it's "more important" (it's not: and as noted earlier, 
> > thinking that it's more important skews the problem and makes for too 
> > *much* scheduling). X should get more CPU time simply because it 
> > should get it's "fair CPU share" relative to the *sum* of the clients, 
> > not relative to any client individually.
> 
> yeah. And this is not a pipe dream and i think it does not need a 
> 'wakeup matrix' or other complexities.
> 
> I am --->.<---- this close to being able to do this very robustly under 
> CFS via simple rules of economy and trade: there the p->wait_runtime 
> metric is intentionally a "physical resource" of "hard-earned right to 
> execute on the CPU, by having waited on it" the sum of which is bound 
> for the whole system.
>
> So while with other, heuristic approaches we always had the problem of 
> creating a "hyper-inflation" of an uneconomic virtual currency that 
> could be freely printed by certain tasks, in CFS the economy of this is 
> strict and the finegrained plus/minus balance is strictly managed by a 
> conservative and independent central bank.
> 
> So we can actually let tasks "trade" in these very physical units of 
> "right to execute on the CPU". A task giving it to another task means 
> that this task _already gave up CPU time in the past_. So it's the 
> robust equivalent of an economy's "money earned" concept, and this 
> "money"'s distribution (and redistribution) is totally fair and totally 
> balanced and is not prone to "inflation".
> 
> The "give scheduler money" transaction can be both an "implicit 
> transaction" (for example when writing to UNIX domain sockets or 
> blocking on a pipe, etc.), or it could be an "explicit transaction": 
> sched_yield_to(). This latter i've already implemented for CFS, but it's 
> much less useful than the really significant implicit ones, the ones 
> which will help X.

I don't think that a task should _give_ its slice to the task it's waiting
on, but it should _lend_ it : if the second task (the server, eg: X) does
not eat everything, maybe the first one will need to use the remains.

We had a good example with glxgears. Glxgears may need more CPU than X on
some machines, less on others. But it needs CPU. So it must not give all
it has to X otherwise it will stop. But it can tell X "hey, if you need
some CPU, I have some here, help yourself". When X has exhausted its slice,
it can then use some from the client. Hmmm no, better, X first serves itself
in the client's share, and may then use (parts of) its own if it needs more.

This would be seen like some CPU ressource buckets. Indeed, it's even a
problem of economy as you said. If you want someone to do something for you,
either it's very quick and simple and he can do it for free once in a while,
or you take all of his time and you have to pay him for this time.

Of course, you don't always know whom X is working for, and this will cause
X to sometimes run for one task on another one's ressources. But as long as
the work is done, it's OK. Hey, after all, many of us sometimes work for
customers in real life and take some of their time to work on the kernel
and everyone is happy with it.

I think that if we could have a (small) list of CPU buckets per task, it
would permit us to do such a thing. We would then have to ensure that
pipes or unix sockets correctly present their buckets to their servers.
If we consider that each task only has its own bucket and can lend it to
one and only one server at a time, it should not look too much horrible.

Basically, just something like this (thinking while typing) :

struct task_struct {
  ...
  struct {
     struct list list;
     int money_left;
  } cpu_bucket;
  ...
}

Then, waking up another process would consist in linking our bucket into
its own bucket list. The server can identify the task it's borrowing
from by looking at which task_struct the list belongs to.

Also, it creates some inheritance between processes. When doing such a
thing :


 $ fgrep DST=1.2.3.4 fw.log | sed -e 's/1.2/A.B/' | gzip -c3 >fw-anon.gz

Then fgrep would lend some CPU to sed which in turn would present them both
to gzip. Maybe we need two lists in order of the structures to be unstacked
upon gzip's sleep() :-/

I don't know if I'm clear enough.

Cheers,
Willy


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-23 19:52                   ` Linus Torvalds
@ 2007-04-23 20:33                     ` Ingo Molnar
  2007-04-23 20:44                       ` Ingo Molnar
                                         ` (2 more replies)
  2007-04-23 22:48                     ` Jeremy Fitzhardinge
                                       ` (2 subsequent siblings)
  3 siblings, 3 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-23 20:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Juliusz Chroboczek, Con Kolivas, ck list,
	Bill Davidsen, Willy Tarreau, William Lee Irwin III,
	linux-kernel, Andrew Morton, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > The "give scheduler money" transaction can be both an "implicit 
> > transaction" (for example when writing to UNIX domain sockets or 
> > blocking on a pipe, etc.), or it could be an "explicit transaction": 
> > sched_yield_to(). This latter i've already implemented for CFS, but 
> > it's much less useful than the really significant implicit ones, the 
> > ones which will help X.
> 
> Yes. It would be wonderful to get it working automatically, so please 
> say something about the implementation..

i agree that the devil will be in the details, but so far it's really 
simple. I'll put all this into separate helper functions so that places 
can just use it in a natural way. The existing yield-to bit is this:

static void
yield_task_fair(struct rq *rq, struct task_struct *p, struct task_struct *p_to)
{
        struct rb_node *curr, *next, *first;
        struct task_struct *p_next;

        /*
         * yield-to support: if we are on the same runqueue then
         * give half of our wait_runtime (if it's positive) to the other task:
         */
        if (p_to && p->wait_runtime > 0) {
                p->wait_runtime >>= 1;
                p_to->wait_runtime += p->wait_runtime;
        }

the above is the basic expression of: "charge a positive bank balance". 

(we obviously dont want to allow people to 'share' their loans with 
others ;), nor do we want to allow a net negative balance. CFS is really 
brutally cold-hearted, it has a strict 'no loans' policy - the easiest 
economic way to manage 'inflation', besides the basic act of not 
printing new money, ever.)

[note, due to the nanoseconds unit there's no rounding loss to worry 
about.]

that's all. No runqueue locking, no wakeup decisions even! [Note: see 
detail #1 below for cases where we need to touch the tree]. Really 
low-overhead. Accumulated 'new money' will be acted upon in the next 
schedule() call or in the next scheduler tick, whichever comes sooner. 
Note that in most cases when tasks communicate there will be a natural 
schedule() anyway, which drives this.

p->wait_runtime is also very finegrained: it is in nanoseconds, so a 
task can 'pay' at arbitrary granularity in essence, and there is in 
essence zero 'small coin overhead' and 'conversion loss' in this money 
system. (as you might remember, sharing p->timeslice had inherent 
rounding and sharing problems due to its low jiffy resolution)

detail #1: for decoupled workloads where there is no direct sleep/wake 
coupling between worker and producer, there should also be a way to 
update a task's position in the fairness tree, if it accumulates 
significant amount of new p->wait_runtime. I think this can be done by 
making this an extra field: p->new_wait_runtime, which gets picked up by 
the task if it runs, or which gets propagated into the task's tree 
position if the p->new_wait_runtime value goes above the 
sched_granularity_ns value. But it would work pretty well even without 
this, the server will take advantage of the p->new_wait_runtime 
immediately when it runs, so as long as enough clients 'feed' it with 
money, it will always have enough to keep going.

detail #2: changes to p->wait_runtime are totally lockless, as long as 
they are 64-bit atomic. So the above code is a bit naive on 32-bit 
systems, but no locking is needed otherwise, other than having a stable 
reference to a task structure. (i designed CFS for 64-bit systems)

detail #3: i suspect i should rename p->wait_runtime to a more intuitive 
name - perhaps p->right_to_run? I want to avoid calling it p->timeslice 
because it's not really a timeslice, it's the thing you earned, the 
'timeslice' is a totally locally decided property that has no direct 
connection to this physical resource. I also dont want to call it 
p->cpu_credit, because it is _not_ a credit system: every positive value 
there has been earned the hard way: by 'working' for the system via 
waiting on the runqueue - scaled down to the 'expected fair runtime' - 
i.e. roughly scaled down by 1/rq->nr_running.

detail #3: the scheduler is also a charity: when it has no other work 
left it will let tasks execute "for free" ;-) But otherwise, in any sort 
of saturated market situation CFS is very much a cold hearted 
capitalist.

about the 50% rule: it was a totally arbitrary case for yield_to(), and 
in other cases it should rather be: "give me _all_ the money you have, 
i'll make it work for you as much as i can". And the receiver should 
also perhaps record the amount of 'money' it got from the client, and 
_give back_ any unused proportion of it. (only where easily doable, in 
1:1 task relationships) I.e.:

        p_to->wait_runtime = p->wait_runtime;
	p->wait_runtime = 0;

	schedule();

the former two lines put into a sched_pay(p) API perhaps?

> The "perfect" situation would be that when somebody goes to sleep, any 
> extra points it had could be given to whoever it woke up last. Note 
> that for something like X, it means that the points are 100% 
> ephemeral: it gets points when a client sends it a request, but it 
> would *lose* the points again when it sends the reply!

yeah, exactly. X could even attempt to explicitly manage some of those 
payments it received: it already has an internal automatic notion of 
'expensive clients' (which do large X requests).

> There are subtle semantics with these kinds of things: especially if 
> the scheduling points are only awarded when a process goes to sleep, 
> if X is busy and continues to use the CPU (for another client), it 
> wouldn't give any scheduling points back to clients and they really do 
> accumulate with the server. Which again sounds like it would be 
> exactly the right thing (both in the sense that the server that runs 
> more gets more points, but also in the sense that we *only* give 
> points at actual scheduling events).

yeah. Not only would it cause accumulation of 'money', that money would 
buy it lower latencies as well: with more p->wait_runtime X will get on 
the CPU faster. So it's a basic "work batching" mechanism.

> But how do you actually *give/track* points? [...]

tracking: these points are all hard-earned and their sum in the system 
has a maximum. I.e. any point you 'give' to X was something you already 
earned and it's something that the scheduler would have given CPU time 
for in the near future. So as long as the transaction is balanced (the 
total sum does not change), this does not create any unfairness 
anywhere.

giving: it can be done lockless (64-bit atomic, they are nanoseconds). 
'Collecting' p->new_wait_runtime up to sched_granularity_ns will also 
make sure this field is the only typical impact that clients have on the 
server - the tree position will only be recalculated at a frequency 
determined by sched_granularity_ns.

> [...] A simple "last woken up by this process" thing that triggers 
> when it goes to sleep? It might work, but on the other hand, 
> especially with more complex things (and networking tends to be pretty 
> complex) the actual wakeup may be done by a software irq. Do we just 
> say "it ran within the context of X, so we assume X was the one that 
> caused it?" It probably would work, but we've generally tried very 
> hard to avoid accessing "current" from interrupt context, including 
> bh's..

i'd concentrate on specific synchronous instances first, where we know 
the producer and the consumer.

Later on we could attach "money" to localhost network packets for 
example, to 'spread' fairness into more complex parts of the system. 
(Note that such 'money' could even be passed along with a _physical_ 
packet, for example in a cluster - so it makes lots of sense on both the 
small and the large scale.) I'd not directly attach 'money' to softirqs, 
i'd attach it to the _object of work_: the packet, the workqueue entry, 
the IO request, etc., etc.

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-23 20:33                     ` Ingo Molnar
@ 2007-04-23 20:44                       ` Ingo Molnar
  2007-04-23 21:03                         ` Ingo Molnar
  2007-04-23 21:53                       ` Guillaume Chazarain
  2007-04-24  7:04                       ` Rogan Dawes
  2 siblings, 1 reply; 149+ messages in thread
From: Ingo Molnar @ 2007-04-23 20:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Juliusz Chroboczek, Con Kolivas, ck list,
	Bill Davidsen, Willy Tarreau, William Lee Irwin III,
	linux-kernel, Andrew Morton, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett


* Ingo Molnar <mingo@elte.hu> wrote:

> (we obviously dont want to allow people to 'share' their loans with 
> others ;), nor do we want to allow a net negative balance. CFS is 
> really brutally cold-hearted, it has a strict 'no loans' policy - the 
> easiest economic way to manage 'inflation', besides the basic act of 
> not printing new money, ever.)

sorry, i was a bit imprecise here. There is a case where CFS can give 
out a 'loan' to tasks. The scheduler tick has a low resolution, so it is 
fundamentally inevitable [*] that tasks will run a bit more than they 
should, and at a heavy context-switching rates these errors can add up 
significantly. Furthermore, we want to batch up workloads.

So CFS has a "no loans larger than sched_granularity_ns" policy (which 
defaults to 5msec), and it captures these sub-granularity 'loans' with 
nanosec accounting. This too is a very sane economic policy and is 
anti-infationary :-)

	Ingo

[*] i fundamentally hate 'fundamentally inevitable' conditions so i 
    have plans to make the scheduler tick be fed from the rbtree and 
    thus become a true high-resolution timer. This not only increases 
    fairness (=='precision of scheduling') more, but it also decreases 
    the number of timer interrupts on a running system - extending 
    dynticks to sched-ticks too. Thomas and me shaped dynticks to enable 
    that in an easy way: the scheduler tick is today already a high-res 
    timer (but which is currently still driven via the jiffy mechanism).

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [report] renicing X, cfs-v5 vs sd-0.46
  2007-04-23 19:48                 ` Ingo Molnar
@ 2007-04-23 20:56                   ` Michael K. Edwards
  0 siblings, 0 replies; 149+ messages in thread
From: Michael K. Edwards @ 2007-04-23 20:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Juliusz Chroboczek, Con Kolivas, ck list,
	Bill Davidsen, Willy Tarreau, William Lee Irwin III,
	linux-kernel, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Gene Heskett

On 4/23/07, Ingo Molnar <mingo@elte.hu> wrote:
> Basically this hack is bad on policy grounds because it is giving X an
> "legislated, unfair monopoly" on the system. It's the equivalent of a
> state-guaranteed monopoly in certain 'strategic industries'. It has some
> advantages but it is very much net harmful. Most of the time the
> "strategic importance" of any industry can be cleanly driven by the
> normal mechanics of supply and demand: anything important is recognized
> by 'people' as important via actual actions of giving it 'money'. (This
> approach also gives formerly-strategic industries the boot quickly, were
> they to become less strategic to people as things evolve.)

If you're going to drag free-market economics into it, why not
actually use the techniques of free-market economics?  Design a
bidding system in which agents (tasks) earn "money" by getting things
done, and can use that "money" to bid on "resources".  You will of
course need accurate cost accounting in order to decide which bids are
most "profitable" for the scheduler to accept, and accurate transfer
accounting to design price structures for contracts between agents in
which one agrees to accomplish work on behalf on another.  Actual
revenues come from doing the work that the consumer wants done and is
willing to pay for.  Etc., etc.  Has your horsepucky filter kicked in
yet?

If your system doesn't work this way -- perhaps because you think as I
do that scheduler design is principally an engineering problem, not an
economics problem -- then analogies from economics are probably worth
zip.  Yes, I wrote earlier about "economic dispatch" -- that's an
operations problem, a control theory problem, an _engineering_
problem, that happens to have a set of engineering goals and
constraints that take profitability into account.  I think you might
be able to design a better Linux scheduler anchored in the techniques
and literature of control theory, perhaps specifically with reference
to electric-utility economic dispatch, because the systems under
control and the goals of control are similar.

But there's a good reason not to treat X as special.  Namely, that it
_isn't_.  It may be the only program on many people's Linux desktops
with an opaque control structure -- a separate class of interactive
activities hidden inside an oversubscribed push-model pipeline stage
-- but it's hardly the only program designed this way.  Treat the X
server as a easily instrumented exemplar of a event-loop-centric
design whose thread structure doesn't distinguish between fast-twitch
and best-effort activity patterns.  I wrote earlier about what one
might do about this (attach urgency to to the work in the queue
instead of the worker being asked to do it).

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-23 20:44                       ` Ingo Molnar
@ 2007-04-23 21:03                         ` Ingo Molnar
  0 siblings, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-23 21:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Juliusz Chroboczek, Con Kolivas, ck list,
	Bill Davidsen, Willy Tarreau, William Lee Irwin III,
	linux-kernel, Andrew Morton, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett


* Ingo Molnar <mingo@elte.hu> wrote:

> sorry, i was a bit imprecise here. There is a case where CFS can give 
> out a 'loan' to tasks. The scheduler tick has a low resolution, so it 
> is fundamentally inevitable [*] that tasks will run a bit more than 
> they should, and at a heavy context-switching rates these errors can 
> add up significantly. Furthermore, we want to batch up workloads.
> 
> So CFS has a "no loans larger than sched_granularity_ns" policy (which 
> defaults to 5msec), and it captures these sub-granularity 'loans' with 
> nanosec accounting. This too is a very sane economic policy and is 
> anti-infationary :-)

at which point i guess i should rename CFS to 'EFS' (the Economic Fair 
Scheduler)? =B-)

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-23 20:33                     ` Ingo Molnar
  2007-04-23 20:44                       ` Ingo Molnar
@ 2007-04-23 21:53                       ` Guillaume Chazarain
  2007-04-24  7:04                       ` Rogan Dawes
  2 siblings, 0 replies; 149+ messages in thread
From: Guillaume Chazarain @ 2007-04-23 21:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Nick Piggin, Juliusz Chroboczek, Con Kolivas,
	ck list, Bill Davidsen, Willy Tarreau, William Lee Irwin III,
	linux-kernel, Andrew Morton, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett

2007/4/23, Ingo Molnar <mingo@elte.hu>:
>                 p->wait_runtime >>= 1;
>                 p_to->wait_runtime += p->wait_runtime;

I have no problem with clients giving some credit to X,
I am more concerned with X giving half of its credit to
a single client, a quarter of its credit to another client, etc...

For example, a client could setup a periodical wake up
from X, and then periodically getting some credit for free.
Would that be possible?

Thanks.

-- 
Guillaume

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, v4
  2007-04-22  8:30 ` Michael Gerdau
@ 2007-04-23 22:47   ` Ingo Molnar
  0 siblings, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-23 22:47 UTC (permalink / raw)
  To: Michael Gerdau
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett


* Michael Gerdau <mgd@technosis.de> wrote:

> > i'm pleased to announce release -v4 of the CFS patchset. The patch 
> > against v2.6.21-rc7 can be downloaded from:
> > 
> >     http://redhat.com/~mingo/cfs-scheduler/
> 
> I can't get 2.6.21-rc7-CFS-v4 to boot. Immediately after selecting 
> this kernel I see a very fast scrolling (loop?) sequence of addrs 
> which I don't know how to stop to write them down. They don't appear 
> in any kernel log either. However I see two Tux at the top of the 
> screen.

could you try -v5? It has at least one such bug fixed. (If it still 
happens then please try to make a digital picture of the screen and send 
the picture to me.)

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-23 19:52                   ` Linus Torvalds
  2007-04-23 20:33                     ` Ingo Molnar
@ 2007-04-23 22:48                     ` Jeremy Fitzhardinge
  2007-04-24  0:59                       ` Li, Tong N
  2007-04-24  3:46                     ` Peter Williams
  2007-04-24 15:08                     ` Ray Lee
  3 siblings, 1 reply; 149+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-23 22:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Nick Piggin, Juliusz Chroboczek, Con Kolivas,
	ck list, Bill Davidsen, Willy Tarreau, William Lee Irwin III,
	linux-kernel, Andrew Morton, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett

Linus Torvalds wrote:
> The "perfect" situation would be that when somebody goes to sleep, any 
> extra points it had could be given to whoever it woke up last. Note that 
> for something like X, it means that the points are 100% ephemeral: it gets 
> points when a client sends it a request, but it would *lose* the points 
> again when it sends the reply!
>
> So it would only accumulate "scheduling points" while multiuple clients 
> are actively waiting for it, which actually sounds like exactly the right 
> thing. However, I don't really see how to do it well, especially since the 
> kernel cannot actually match up the client that gave some scheduling 
> points to the reply that X sends back.
>   

This works out in quite an interesting way.  If the economy is closed -
all clients and servers are managed by the same scheduler - then the
server could get no inherent CPU priority and live entirely on donated
shares.  If that were the case, you'd have to make sure that the server
used the donation from client A on client A's work, otherwise you'd get
freeloaders - but maybe it will all work out.

It gets more interesting when you have a non-closed system - the X
server is working on behalf of external clients over TCP.  Presumably
wakeups from incoming TCP connections wouldn't have any scheduler shares
associated with it, so the X server would have to use its inherent CPU
allocation to service those requests.  Or the external client could
effectively end up freeloading off portions of the local clients' donations.

    J

^ permalink raw reply	[flat|nested] 149+ messages in thread

* RE: [REPORT] cfs-v4 vs sd-0.44
  2007-04-23 22:48                     ` Jeremy Fitzhardinge
@ 2007-04-24  0:59                       ` Li, Tong N
  2007-04-24  1:57                         ` Bill Huey
  2007-04-24 21:27                         ` William Lee Irwin III
  0 siblings, 2 replies; 149+ messages in thread
From: Li, Tong N @ 2007-04-24  0:59 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Linus Torvalds
  Cc: Ingo Molnar, Nick Piggin, Juliusz Chroboczek, Con Kolivas,
	ck list, Bill Davidsen, Willy Tarreau, William Lee Irwin III,
	linux-kernel, Andrew Morton, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett

I don't know if we've discussed this or not. Since both CFS and SD claim
to be fair, I'd like to hear more opinions on the fairness aspect of
these designs. In areas such as OS, networking, and real-time, fairness,
and its more general form, proportional fairness, are well-defined
terms. In fact, perfect fairness is not feasible since it requires all
runnable threads to be running simultaneously and scheduled with
infinitesimally small quanta (like a fluid system). So to evaluate if a
new scheduling algorithm is fair, the common approach is to take the
ideal fair algorithm (often referred to as Generalized Processor
Scheduling or GPS) as a reference model and analyze if the new algorithm
can achieve a constant error bound (different error metrics also exist).
I understand that via experiments we can show a design is reasonably
fair in the common case, but IMHO, to claim that a design is fair, there
needs to be some kind of formal analysis on the fairness bound, and this
bound should be proven to be constant. Even if the bound is not
constant, at least this analysis can help us better understand and
predict the degree of fairness that users would experience (e.g., would
the system be less fair if the number of threads increases? What happens
if a large number of threads dynamically join and leave the system?).

  tong

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24  0:59                       ` Li, Tong N
@ 2007-04-24  1:57                         ` Bill Huey
  2007-04-24 18:01                           ` Li, Tong N
  2007-04-24 21:27                         ` William Lee Irwin III
  1 sibling, 1 reply; 149+ messages in thread
From: Bill Huey @ 2007-04-24  1:57 UTC (permalink / raw)
  To: Li, Tong N
  Cc: Jeremy Fitzhardinge, Linus Torvalds, Ingo Molnar, Nick Piggin,
	Juliusz Chroboczek, Con Kolivas, ck list, Bill Davidsen,
	Willy Tarreau, William Lee Irwin III, linux-kernel,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett, Bill Huey (hui)

On Mon, Apr 23, 2007 at 05:59:06PM -0700, Li, Tong N wrote:
> I don't know if we've discussed this or not. Since both CFS and SD claim
> to be fair, I'd like to hear more opinions on the fairness aspect of
> these designs. In areas such as OS, networking, and real-time, fairness,
> and its more general form, proportional fairness, are well-defined
> terms. In fact, perfect fairness is not feasible since it requires all
> runnable threads to be running simultaneously and scheduled with
> infinitesimally small quanta (like a fluid system). So to evaluate if a

Unfortunately, fairness is rather non-formal in this context and probably
isn't strictly desirable given how hack much of Linux userspace is. Until
there's a method of doing directed yields, like what Will has prescribed
a kind of allotment to thread doing work for another a completely strict
mechanism, it is probably problematic with regards to corner cases.

X for example is largely non-thread safe. Until they can get their xcb
framework in place and addition thread infrastructure to do hand off
properly, it's going to be difficult schedule for it. It's well known to
be problematic.

You announced your scheduler without CCing any of the relevant people here
(and risk being completely ignored in lkml traffic):

	http://lkml.org/lkml/2007/4/20/286

What is your opinion of both CFS and SDL ? How can you work be useful
to either scheduler mentioned or to the Linux kernel on its own ?

> I understand that via experiments we can show a design is reasonably
> fair in the common case, but IMHO, to claim that a design is fair, there
> needs to be some kind of formal analysis on the fairness bound, and this
> bound should be proven to be constant. Even if the bound is not
> constant, at least this analysis can help us better understand and
> predict the degree of fairness that users would experience (e.g., would
> the system be less fair if the number of threads increases? What happens
> if a large number of threads dynamically join and leave the system?).

Will has been thinking about this, but you have to also consider the
practicalities of your approach versus Con's and Ingo's.

I'm all for things like proportional scheduling and the extensions
needed to do it properly. It would be highly relevant to some version
of the -rt patch if not that patch directly.

bill


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-23 19:52                   ` Linus Torvalds
  2007-04-23 20:33                     ` Ingo Molnar
  2007-04-23 22:48                     ` Jeremy Fitzhardinge
@ 2007-04-24  3:46                     ` Peter Williams
  2007-04-24  4:52                       ` Arjan van de Ven
  2007-04-24 15:08                     ` Ray Lee
  3 siblings, 1 reply; 149+ messages in thread
From: Peter Williams @ 2007-04-24  3:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Nick Piggin, Juliusz Chroboczek, Con Kolivas,
	ck list, Bill Davidsen, Willy Tarreau, William Lee Irwin III,
	linux-kernel, Andrew Morton, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, caglar, Gene Heskett

Linus Torvalds wrote:
> 
> On Mon, 23 Apr 2007, Ingo Molnar wrote:
>> The "give scheduler money" transaction can be both an "implicit 
>> transaction" (for example when writing to UNIX domain sockets or 
>> blocking on a pipe, etc.), or it could be an "explicit transaction": 
>> sched_yield_to(). This latter i've already implemented for CFS, but it's 
>> much less useful than the really significant implicit ones, the ones 
>> which will help X.
> 
> Yes. It would be wonderful to get it working automatically, so please say 
> something about the implementation..
> 
> The "perfect" situation would be that when somebody goes to sleep, any 
> extra points it had could be given to whoever it woke up last. Note that 
> for something like X, it means that the points are 100% ephemeral: it gets 
> points when a client sends it a request, but it would *lose* the points 
> again when it sends the reply!
> 
> So it would only accumulate "scheduling points" while multiuple clients 
> are actively waiting for it, which actually sounds like exactly the right 
> thing. However, I don't really see how to do it well, especially since the 
> kernel cannot actually match up the client that gave some scheduling 
> points to the reply that X sends back.
> 
> There are subtle semantics with these kinds of things: especially if the 
> scheduling points are only awarded when a process goes to sleep, if X is 
> busy and continues to use the CPU (for another client), it wouldn't give 
> any scheduling points back to clients and they really do accumulate with 
> the server. Which again sounds like it would be exactly the right thing 
> (both in the sense that the server that runs more gets more points, but 
> also in the sense that we *only* give points at actual scheduling events).
> 
> But how do you actually *give/track* points? A simple "last woken up by 
> this process" thing that triggers when it goes to sleep? It might work, 
> but on the other hand, especially with more complex things (and networking 
> tends to be pretty complex) the actual wakeup may be done by a software 
> irq. Do we just say "it ran within the context of X, so we assume X was 
> the one that caused it?" It probably would work, but we've generally tried 
> very hard to avoid accessing "current" from interrupt context, including 
> bh's.

Within reason, it's not the number of clients that X has that causes its 
CPU bandwidth use to sky rocket and cause problems.  It's more to to 
with what type of clients they are.  Most GUIs (even ones that are 
constantly updating visual data (e.g. gkrellm -- I can open quite a 
large number of these without increasing X's CPU usage very much)) cause 
very little load on the X server.  The exceptions to this are the 
various terminal emulators (e.g. xterm, gnome-terminal, etc.) when being 
used to run output intensive command line programs e.g. try "ls -lR /" 
in an xterm.  The other way (that I've noticed) X's CPU usage bandwidth 
sky rocket is when you grab a large window and wiggle it about a lot and 
hopefully this doesn't happen a lot so the problem that needs to be 
addressed is the one caused by text output on xterm and its ilk.

So I think that an elaborate scheme for distributing "points" between X 
and its clients would be overkill.  A good scheduler will make sure 
other tasks such as audio streamers get CPU when they need it with good 
responsiveness even when X takes off by giving them higher priority 
because their CPU bandwidth use is low.

The one problem that might still be apparent in these cases is the mouse 
becoming jerky while X is working like crazy to spew out text too fast 
for anyone to read.  But the only way to fix that is to give X more 
bandwidth but if it's already running at about 95% of a CPU that's 
unlikely to help.  To fix this you would probably need to modify X so 
that it knows re-rendering the cursor is more important than rendering 
text in an xterm.

In normal circumstances, the re-rendering of the mouse happens quickly 
enough for the user to experience good responsiveness because X's normal 
CPU use is low enough for it to be given high priority.

Just because the O(1) tried this model and failed doesn't mean that the 
model is bad.  O(1) was a flawed implementation of a good model.

Peter
PS Doing a kernel build in an xterm isn't an example of high enough 
output to cause a problem as (on my system) it only raises X's 
consumption from 0 to 2% to 2 to 5%.  The type of output that causes the 
problem is usually flying past too fast to read.
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24  3:46                     ` Peter Williams
@ 2007-04-24  4:52                       ` Arjan van de Ven
  2007-04-24  6:21                         ` Peter Williams
  0 siblings, 1 reply; 149+ messages in thread
From: Arjan van de Ven @ 2007-04-24  4:52 UTC (permalink / raw)
  To: Peter Williams
  Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Juliusz Chroboczek,
	Con Kolivas, ck list, Bill Davidsen, Willy Tarreau,
	William Lee Irwin III, linux-kernel, Andrew Morton,
	Mike Galbraith, Thomas Gleixner, caglar, Gene Heskett


> Within reason, it's not the number of clients that X has that causes its 
> CPU bandwidth use to sky rocket and cause problems.  It's more to to 
> with what type of clients they are.  Most GUIs (even ones that are 
> constantly updating visual data (e.g. gkrellm -- I can open quite a 
> large number of these without increasing X's CPU usage very much)) cause 
> very little load on the X server.  The exceptions to this are the 


there is actually 2 and not just 1 "X server", and they are VERY VERY
different in behavior.

Case 1: Accelerated driver

If X talks to a decent enough card it supports will with acceleration,
it will be very rare for X itself to spend any kind of significant
amount of CPU time, all the really heavy stuff is done in hardware, and
asynchronously at that. A bit of batching will greatly improve system
performance in this case.

Case 2: Unaccelerated VESA

Some drivers in X, especially the VESA and NV drivers (which are quite
common, vesa is used on all hardware without a special driver nowadays),
have no or not enough acceleration to matter for modern desktops. This
means the CPU is doing all the heavy lifting, in the X program. In this
case even a simple "move the window a bit" becomes quite a bit of a CPU
hog already.

The cases are fundamentally different in behavior, because in the first
case, X hardly consumes the time it would get in any scheme, while in
the second case X really is CPU bound and will happily consume any CPU
time it can get.



-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24  4:52                       ` Arjan van de Ven
@ 2007-04-24  6:21                         ` Peter Williams
  2007-04-24  6:36                           ` Ingo Molnar
  0 siblings, 1 reply; 149+ messages in thread
From: Peter Williams @ 2007-04-24  6:21 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Juliusz Chroboczek,
	Con Kolivas, ck list, Bill Davidsen, Willy Tarreau,
	William Lee Irwin III, linux-kernel, Andrew Morton,
	Mike Galbraith, Thomas Gleixner, caglar, Gene Heskett

Arjan van de Ven wrote:
>> Within reason, it's not the number of clients that X has that causes its 
>> CPU bandwidth use to sky rocket and cause problems.  It's more to to 
>> with what type of clients they are.  Most GUIs (even ones that are 
>> constantly updating visual data (e.g. gkrellm -- I can open quite a 
>> large number of these without increasing X's CPU usage very much)) cause 
>> very little load on the X server.  The exceptions to this are the 
> 
> 
> there is actually 2 and not just 1 "X server", and they are VERY VERY
> different in behavior.
> 
> Case 1: Accelerated driver
> 
> If X talks to a decent enough card it supports will with acceleration,
> it will be very rare for X itself to spend any kind of significant
> amount of CPU time, all the really heavy stuff is done in hardware, and
> asynchronously at that. A bit of batching will greatly improve system
> performance in this case.
> 
> Case 2: Unaccelerated VESA
> 
> Some drivers in X, especially the VESA and NV drivers (which are quite
> common, vesa is used on all hardware without a special driver nowadays),
> have no or not enough acceleration to matter for modern desktops. This
> means the CPU is doing all the heavy lifting, in the X program. In this
> case even a simple "move the window a bit" becomes quite a bit of a CPU
> hog already.

Mine's a:

SiS 661/741/760 PCI/AGP or 662/761Gx PCIE VGA Display adapter according 
to X's display settings tool.  Which category does that fall into?

It's not a special adapter and is just the one that came with the 
motherboard. It doesn't use much CPU unless I grab a window and wiggle 
it all over the screen or do something like "ls -lR /" in an xterm.

> 
> The cases are fundamentally different in behavior, because in the first
> case, X hardly consumes the time it would get in any scheme, while in
> the second case X really is CPU bound and will happily consume any CPU
> time it can get.

Which still doesn't justify an elaborate "points" sharing scheme. 
Whichever way you look at that that's just another way of giving X more 
CPU bandwidth and there are simpler ways to give X more CPU if it needs 
it.  However, I think there's something seriously wrong if it needs the 
-19 nice that I've heard mentioned.  You might as well just run it as a 
real time process.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24  6:21                         ` Peter Williams
@ 2007-04-24  6:36                           ` Ingo Molnar
  2007-04-24  7:00                             ` Gene Heskett
  2007-04-26  0:51                             ` SD renice recommendation was: " Con Kolivas
  0 siblings, 2 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-24  6:36 UTC (permalink / raw)
  To: Peter Williams
  Cc: Arjan van de Ven, Linus Torvalds, Nick Piggin,
	Juliusz Chroboczek, Con Kolivas, ck list, Bill Davidsen,
	Willy Tarreau, William Lee Irwin III, linux-kernel,
	Andrew Morton, Mike Galbraith, Thomas Gleixner, caglar,
	Gene Heskett


* Peter Williams <pwil3058@bigpond.net.au> wrote:

> > The cases are fundamentally different in behavior, because in the 
> > first case, X hardly consumes the time it would get in any scheme, 
> > while in the second case X really is CPU bound and will happily 
> > consume any CPU time it can get.
> 
> Which still doesn't justify an elaborate "points" sharing scheme. 
> Whichever way you look at that that's just another way of giving X 
> more CPU bandwidth and there are simpler ways to give X more CPU if it 
> needs it.  However, I think there's something seriously wrong if it 
> needs the -19 nice that I've heard mentioned.

Gene has done some testing under CFS with X reniced to +10 and the 
desktop still worked smoothly for him. So CFS does not 'need' a reniced 
X. There are simply advantages to negative nice levels: for example 
screen refreshes are smoother on any scheduler i tried. BUT, there is a 
caveat: on non-CFS schedulers i tried X is much more prone to get into 
'overscheduling' scenarios that visibly hurt X's performance, while on 
CFS there's a max of 1000-1500 context switches a second at nice -10. 
(which, considering the cost of a context switch is well under 1% 
overhead.)

So, my point is, the nice level of X for desktop users should not be set 
lower than a low limit suggested by that particular scheduler's author. 
That limit is scheduler-specific. Con i think recommends a nice level of 
-1 for X when using SD [Con, can you confirm?], while my tests show that 
if you want you can go as low as -10 under CFS, without any bad 
side-effects. (-19 was a bit too much)

> [...]  You might as well just run it as a real time process.

hm, that would be a bad idea under any scheduler (including CFS), 
because real time processes can starve other processes indefinitely.

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24  7:08                               ` Ingo Molnar
@ 2007-04-24  6:45                                 ` David Lang
  2007-04-24  7:24                                   ` Ingo Molnar
  2007-04-24  7:12                                 ` Gene Heskett
                                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 149+ messages in thread
From: David Lang @ 2007-04-24  6:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Gene Heskett, Peter Williams, Arjan van de Ven, Linus Torvalds,
	Nick Piggin, Juliusz Chroboczek, Con Kolivas, ck list,
	Bill Davidsen, Willy Tarreau, William Lee Irwin III,
	linux-kernel, Andrew Morton, Mike Galbraith, Thomas Gleixner,
	caglar

On Tue, 24 Apr 2007, Ingo Molnar wrote:

> * Gene Heskett <gene.heskett@gmail.com> wrote:
>
>>> Gene has done some testing under CFS with X reniced to +10 and the
>>> desktop still worked smoothly for him.
>>
>> As a data point here, and probably nothing to do with X, but I did
>> manage to lock it up, solid, reset button time tonight, by wanting
>> 'smart' to get done with an update session after amanda had started.
>> I took both smart processes I could see in htop all the way to -19,
>> but when it was about done about 3 minutes later, everything came to
>> an instant, frozen, reset button required lockup.  I should have
>> stopped at -17 I guess. :(
>
> yeah, i guess this has little to do with X. I think in your scenario it
> might have been smarter to either stop, or to renice the workloads that
> took away CPU power from others to _positive_ nice levels. Negative nice
> levels can indeed be dangerous.
>
> (Btw., to protect against such mishaps in the future i have changed the
> SysRq-N [SysRq-Nice] implementation in my tree to not only change
> real-time tasks to SCHED_OTHER, but to also renice negative nice levels
> back to 0 - this will show up in -v6. That way you'd only have had to
> hit SysRq-N to get the system out of the wedge.)

if you are trying to unwedge a system it may be a good idea to renice all tasks 
to 0, it could be that a task at +19 is holding a lock that something else is 
waiting for.

David Lang

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24  6:36                           ` Ingo Molnar
@ 2007-04-24  7:00                             ` Gene Heskett
  2007-04-24  7:08                               ` Ingo Molnar
  2007-04-26  0:51                             ` SD renice recommendation was: " Con Kolivas
  1 sibling, 1 reply; 149+ messages in thread
From: Gene Heskett @ 2007-04-24  7:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Williams, Arjan van de Ven, Linus Torvalds, Nick Piggin,
	Juliusz Chroboczek, Con Kolivas, ck list, Bill Davidsen,
	Willy Tarreau, William Lee Irwin III, linux-kernel,
	Andrew Morton, Mike Galbraith, Thomas Gleixner, caglar

On Tuesday 24 April 2007, Ingo Molnar wrote:
>* Peter Williams <pwil3058@bigpond.net.au> wrote:
>> > The cases are fundamentally different in behavior, because in the
>> > first case, X hardly consumes the time it would get in any scheme,
>> > while in the second case X really is CPU bound and will happily
>> > consume any CPU time it can get.
>>
>> Which still doesn't justify an elaborate "points" sharing scheme.
>> Whichever way you look at that that's just another way of giving X
>> more CPU bandwidth and there are simpler ways to give X more CPU if it
>> needs it.  However, I think there's something seriously wrong if it
>> needs the -19 nice that I've heard mentioned.
>
>Gene has done some testing under CFS with X reniced to +10 and the
>desktop still worked smoothly for him.

As a data point here, and probably nothing to do with X, but I did manage to 
lock it up, solid, reset button time tonight, by wanting 'smart' to get done 
with an update session after amanda had started.  I took both smart processes 
I could see in htop all the way to -19, but when it was about done about 3 
minutes later, everything came to an instant, frozen, reset button required 
lockup.  I should have stopped at -17 I guess. :(

>So CFS does not 'need' a reniced 
>X. There are simply advantages to negative nice levels: for example
>screen refreshes are smoother on any scheduler i tried. BUT, there is a
>caveat: on non-CFS schedulers i tried X is much more prone to get into
>'overscheduling' scenarios that visibly hurt X's performance, while on
>CFS there's a max of 1000-1500 context switches a second at nice -10.
>(which, considering the cost of a context switch is well under 1%
>overhead.)
>
>So, my point is, the nice level of X for desktop users should not be set
>lower than a low limit suggested by that particular scheduler's author.
>That limit is scheduler-specific. Con i think recommends a nice level of
>-1 for X when using SD [Con, can you confirm?], while my tests show that
>if you want you can go as low as -10 under CFS, without any bad
>side-effects. (-19 was a bit too much)
>
>> [...]  You might as well just run it as a real time process.
>
>hm, that would be a bad idea under any scheduler (including CFS),
>because real time processes can starve other processes indefinitely.
>
>	Ingo



-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
I have discovered that all human evil comes from this, man's being unable
to sit still in a room.
		-- Blaise Pascal

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-23 20:33                     ` Ingo Molnar
  2007-04-23 20:44                       ` Ingo Molnar
  2007-04-23 21:53                       ` Guillaume Chazarain
@ 2007-04-24  7:04                       ` Rogan Dawes
  2007-04-24  7:31                         ` Ingo Molnar
  2 siblings, 1 reply; 149+ messages in thread
From: Rogan Dawes @ 2007-04-24  7:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Nick Piggin, Gene Heskett, Juliusz Chroboczek,
	Mike Galbraith, linux-kernel, Peter Williams, ck list,
	Thomas Gleixner, William Lee Irwin III, Andrew Morton,
	Bill Davidsen, Willy Tarreau, Arjan van de Ven

Ingo Molnar wrote:

> static void
> yield_task_fair(struct rq *rq, struct task_struct *p, struct task_struct *p_to)
> {
>         struct rb_node *curr, *next, *first;
>         struct task_struct *p_next;
> 
>         /*
>          * yield-to support: if we are on the same runqueue then
>          * give half of our wait_runtime (if it's positive) to the other task:
>          */
>         if (p_to && p->wait_runtime > 0) {
>                 p->wait_runtime >>= 1;
>                 p_to->wait_runtime += p->wait_runtime;
>         }
> 
> the above is the basic expression of: "charge a positive bank balance". 
> 

[..]

> [note, due to the nanoseconds unit there's no rounding loss to worry 
> about.]

Surely if you divide 5 nanoseconds by 2, you'll get a rounding loss?

> 	Ingo

Rogan

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24  7:00                             ` Gene Heskett
@ 2007-04-24  7:08                               ` Ingo Molnar
  2007-04-24  6:45                                 ` David Lang
                                                   ` (3 more replies)
  0 siblings, 4 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-24  7:08 UTC (permalink / raw)
  To: Gene Heskett
  Cc: Peter Williams, Arjan van de Ven, Linus Torvalds, Nick Piggin,
	Juliusz Chroboczek, Con Kolivas, ck list, Bill Davidsen,
	Willy Tarreau, William Lee Irwin III, linux-kernel,
	Andrew Morton, Mike Galbraith, Thomas Gleixner, caglar


* Gene Heskett <gene.heskett@gmail.com> wrote:

> > Gene has done some testing under CFS with X reniced to +10 and the 
> > desktop still worked smoothly for him.
> 
> As a data point here, and probably nothing to do with X, but I did 
> manage to lock it up, solid, reset button time tonight, by wanting 
> 'smart' to get done with an update session after amanda had started.  
> I took both smart processes I could see in htop all the way to -19, 
> but when it was about done about 3 minutes later, everything came to 
> an instant, frozen, reset button required lockup.  I should have 
> stopped at -17 I guess. :(

yeah, i guess this has little to do with X. I think in your scenario it 
might have been smarter to either stop, or to renice the workloads that 
took away CPU power from others to _positive_ nice levels. Negative nice 
levels can indeed be dangerous.

(Btw., to protect against such mishaps in the future i have changed the 
SysRq-N [SysRq-Nice] implementation in my tree to not only change 
real-time tasks to SCHED_OTHER, but to also renice negative nice levels 
back to 0 - this will show up in -v6. That way you'd only have had to 
hit SysRq-N to get the system out of the wedge.)

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24  7:08                               ` Ingo Molnar
  2007-04-24  6:45                                 ` David Lang
@ 2007-04-24  7:12                                 ` Gene Heskett
  2007-04-24  7:14                                   ` Ingo Molnar
  2007-04-24  7:25                                 ` Ingo Molnar
  2007-04-24  7:33                                 ` Ingo Molnar
  3 siblings, 1 reply; 149+ messages in thread
From: Gene Heskett @ 2007-04-24  7:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Williams, Arjan van de Ven, Linus Torvalds, Nick Piggin,
	Juliusz Chroboczek, Con Kolivas, ck list, Bill Davidsen,
	Willy Tarreau, William Lee Irwin III, linux-kernel,
	Andrew Morton, Mike Galbraith, Thomas Gleixner, caglar

On Tuesday 24 April 2007, Ingo Molnar wrote:
>* Gene Heskett <gene.heskett@gmail.com> wrote:
>> > Gene has done some testing under CFS with X reniced to +10 and the
>> > desktop still worked smoothly for him.
>>
>> As a data point here, and probably nothing to do with X, but I did
>> manage to lock it up, solid, reset button time tonight, by wanting
>> 'smart' to get done with an update session after amanda had started.
>> I took both smart processes I could see in htop all the way to -19,
>> but when it was about done about 3 minutes later, everything came to
>> an instant, frozen, reset button required lockup.  I should have
>> stopped at -17 I guess. :(
>
>yeah, i guess this has little to do with X. I think in your scenario it
>might have been smarter to either stop, or to renice the workloads that
>took away CPU power from others to _positive_ nice levels. Negative nice
>levels can indeed be dangerous.
>
>(Btw., to protect against such mishaps in the future i have changed the
>SysRq-N [SysRq-Nice] implementation in my tree to not only change
>real-time tasks to SCHED_OTHER, but to also renice negative nice levels
>back to 0 - this will show up in -v6. That way you'd only have had to
>hit SysRq-N to get the system out of the wedge.)
>
>	Ingo

That sounds handy, particularly with idiots like me at the wheel...


-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
When a Banker jumps out of a window, jump after him--that's where the money 
is.
		-- Robespierre

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24  7:12                                 ` Gene Heskett
@ 2007-04-24  7:14                                   ` Ingo Molnar
  2007-04-24 14:36                                     ` Gene Heskett
  0 siblings, 1 reply; 149+ messages in thread
From: Ingo Molnar @ 2007-04-24  7:14 UTC (permalink / raw)
  To: Gene Heskett
  Cc: Peter Williams, Arjan van de Ven, Linus Torvalds, Nick Piggin,
	Juliusz Chroboczek, Con Kolivas, ck list, Bill Davidsen,
	Willy Tarreau, William Lee Irwin III, linux-kernel,
	Andrew Morton, Mike Galbraith, Thomas Gleixner, caglar


* Gene Heskett <gene.heskett@gmail.com> wrote:

> > (Btw., to protect against such mishaps in the future i have changed 
> > the SysRq-N [SysRq-Nice] implementation in my tree to not only 
> > change real-time tasks to SCHED_OTHER, but to also renice negative 
> > nice levels back to 0 - this will show up in -v6. That way you'd 
> > only have had to hit SysRq-N to get the system out of the wedge.)
> 
> That sounds handy, particularly with idiots like me at the wheel...

by that standard i guess we tinkerers are all idiots ;)

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24  6:45                                 ` David Lang
@ 2007-04-24  7:24                                   ` Ingo Molnar
  2007-04-24 14:38                                     ` Gene Heskett
  0 siblings, 1 reply; 149+ messages in thread
From: Ingo Molnar @ 2007-04-24  7:24 UTC (permalink / raw)
  To: David Lang
  Cc: Gene Heskett, Peter Williams, Arjan van de Ven, Linus Torvalds,
	Nick Piggin, Juliusz Chroboczek, Con Kolivas, ck list,
	Bill Davidsen, Willy Tarreau, William Lee Irwin III,
	linux-kernel, Andrew Morton, Mike Galbraith, Thomas Gleixner,
	caglar


* David Lang <david.lang@digitalinsight.com> wrote:

> > (Btw., to protect against such mishaps in the future i have changed 
> > the SysRq-N [SysRq-Nice] implementation in my tree to not only 
> > change real-time tasks to SCHED_OTHER, but to also renice negative 
> > nice levels back to 0 - this will show up in -v6. That way you'd 
> > only have had to hit SysRq-N to get the system out of the wedge.)
> 
> if you are trying to unwedge a system it may be a good idea to renice 
> all tasks to 0, it could be that a task at +19 is holding a lock that 
> something else is waiting for.

Yeah, that's possible too, but +19 tasks are getting a small but 
guaranteed share of the CPU so eventually it ought to release it. It's 
still a possibility, but i think i'll wait for a specific incident to 
happen first, and then react to that incident :-)

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24  7:08                               ` Ingo Molnar
  2007-04-24  6:45                                 ` David Lang
  2007-04-24  7:12                                 ` Gene Heskett
@ 2007-04-24  7:25                                 ` Ingo Molnar
  2007-04-24 14:39                                   ` Gene Heskett
  2007-04-24 14:42                                   ` Gene Heskett
  2007-04-24  7:33                                 ` Ingo Molnar
  3 siblings, 2 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-24  7:25 UTC (permalink / raw)
  To: Gene Heskett
  Cc: Peter Williams, Arjan van de Ven, Linus Torvalds, Nick Piggin,
	Juliusz Chroboczek, Con Kolivas, ck list, Bill Davidsen,
	Willy Tarreau, William Lee Irwin III, linux-kernel,
	Andrew Morton, Mike Galbraith, Thomas Gleixner, caglar


* Ingo Molnar <mingo@elte.hu> wrote:

> yeah, i guess this has little to do with X. I think in your scenario 
> it might have been smarter to either stop, or to renice the workloads 
> that took away CPU power from others to _positive_ nice levels. 
> Negative nice levels can indeed be dangerous.

btw., was X itself at nice 0 or nice -10 when the lockup happened?

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24  7:04                       ` Rogan Dawes
@ 2007-04-24  7:31                         ` Ingo Molnar
  2007-04-24  8:25                           ` Rogan Dawes
  0 siblings, 1 reply; 149+ messages in thread
From: Ingo Molnar @ 2007-04-24  7:31 UTC (permalink / raw)
  To: Rogan Dawes
  Cc: Linus Torvalds, Nick Piggin, Gene Heskett, Juliusz Chroboczek,
	Mike Galbraith, linux-kernel, Peter Williams, ck list,
	Thomas Gleixner, William Lee Irwin III, Andrew Morton,
	Bill Davidsen, Willy Tarreau, Arjan van de Ven


* Rogan Dawes <lists@dawes.za.net> wrote:

> >        if (p_to && p->wait_runtime > 0) {
> >                p->wait_runtime >>= 1;
> >                p_to->wait_runtime += p->wait_runtime;
> >        }
> >
> >the above is the basic expression of: "charge a positive bank balance". 
> >
> 
> [..]
> 
> > [note, due to the nanoseconds unit there's no rounding loss to worry 
> > about.]
> 
> Surely if you divide 5 nanoseconds by 2, you'll get a rounding loss?

yes. But not that we'll only truly have to worry about that when we'll 
have context-switching performance in that range - currently it's at 
least 2-3 orders of magnitude above that. Microseconds seemed to me to 
be too coarse already, that's why i picked nanoseconds and 64-bit 
arithmetics for CFS.

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24  7:08                               ` Ingo Molnar
                                                   ` (2 preceding siblings ...)
  2007-04-24  7:25                                 ` Ingo Molnar
@ 2007-04-24  7:33                                 ` Ingo Molnar
  3 siblings, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-24  7:33 UTC (permalink / raw)
  To: Gene Heskett
  Cc: Peter Williams, Arjan van de Ven, Linus Torvalds, Nick Piggin,
	Juliusz Chroboczek, Con Kolivas, ck list, Bill Davidsen,
	Willy Tarreau, William Lee Irwin III, linux-kernel,
	Andrew Morton, Mike Galbraith, Thomas Gleixner, caglar


* Ingo Molnar <mingo@elte.hu> wrote:

> [...] That way you'd only have had to hit SysRq-N to get the system 
> out of the wedge.)

small correction: Alt-SysRq-N.

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24  7:31                         ` Ingo Molnar
@ 2007-04-24  8:25                           ` Rogan Dawes
  2007-04-24 15:03                             ` Chris Friesen
  0 siblings, 1 reply; 149+ messages in thread
From: Rogan Dawes @ 2007-04-24  8:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Nick Piggin, Gene Heskett, Juliusz Chroboczek,
	Mike Galbraith, linux-kernel, Peter Williams, ck list,
	Thomas Gleixner, William Lee Irwin III, Andrew Morton,
	Bill Davidsen, Willy Tarreau, Arjan van de Ven

Ingo Molnar wrote:
> * Rogan Dawes <lists@dawes.za.net> wrote:
> 
>>>        if (p_to && p->wait_runtime > 0) {
>>>                p->wait_runtime >>= 1;
>>>                p_to->wait_runtime += p->wait_runtime;
>>>        }
>>>
>>> the above is the basic expression of: "charge a positive bank balance". 
>>>
>> [..]
>>
>>> [note, due to the nanoseconds unit there's no rounding loss to worry 
>>> about.]
>> Surely if you divide 5 nanoseconds by 2, you'll get a rounding loss?
> 
> yes. But not that we'll only truly have to worry about that when we'll 
> have context-switching performance in that range - currently it's at 
> least 2-3 orders of magnitude above that. Microseconds seemed to me to 
> be too coarse already, that's why i picked nanoseconds and 64-bit 
> arithmetics for CFS.
> 
> 	Ingo

I guess my point was if we somehow get to an odd number of nanoseconds, 
we'd end up with rounding errors. I'm not sure if your algorithm will 
ever allow that.

Rogan

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: crash with CFS v4 and qemu/kvm (was: [patch] CFS scheduler, v4)
@ 2007-04-24 10:54       ` Christian Hesse
  0 siblings, 0 replies; 149+ messages in thread
From: Christian Hesse @ 2007-04-24 10:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Willy Tarreau, Gene Heskett, kvm-devel, Avi Kivity

[-- Attachment #1: Type: text/plain, Size: 1277 bytes --]

On Monday 23 April 2007, Ingo Molnar wrote:
> * Christian Hesse <mail@earthworm.de> wrote:
> > On Friday 20 April 2007, Ingo Molnar wrote:
> > > i'm pleased to announce release -v4 of the CFS patchset.
> >
> > Hi Ingo, hi Avi, hi all,
> >
> > I'm trying to use kvm-20 with cfs v4 and get a crash:
> >
> > eworm@revo:~$ /usr/local/kvm/bin/qemu -snapshot
> > /mnt/data/virtual/qemu/winxp.img kvm_run: failed entry, reason 7
> > kvm_run returned -8
> >
> > It works (though it is a bit slow) if I start qemu with strace, so for
> > me it looks like a race condition?
>
> hm. Can you work it around with:
>
>    echo 0 > /proc/sys/kernel/sched_granularity_ns
>
> ?
>
> If yes then this is a wakeup race: some piece of code relies on the
> upstream scheduler preempting the waker task immediately in 99% of the
> cases.
>
> and you might want to test -v5 too which i released earlier today. It
> has no bugfix in this area though, so it will likely still trigger this
> race - but it will also hopefully be even more pleasant to use than -v4
> ;-)

Hi Ingo,

This was kvm's fault. It works perfectly now without modifications to the 
scheduler. If anybody is interested in details please see the kvm mailing 
list archives.
-- 
Regards,
Chris

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: crash with CFS v4 and qemu/kvm (was: [patch] CFS scheduler, v4)
@ 2007-04-24 10:54       ` Christian Hesse
  0 siblings, 0 replies; 149+ messages in thread
From: Christian Hesse @ 2007-04-24 10:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	caglar-caicS1wCkhO6A22drWdTBw, Mike Galbraith, Peter Williams,
	Con Kolivas, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Gene Heskett,
	Willy Tarreau


[-- Attachment #1.1: Type: text/plain, Size: 1304 bytes --]

On Monday 23 April 2007, Ingo Molnar wrote:
> * Christian Hesse <mail-8oMOrB1mGocUSW6y5lq3GQ@public.gmane.org> wrote:
> > On Friday 20 April 2007, Ingo Molnar wrote:
> > > i'm pleased to announce release -v4 of the CFS patchset.
> >
> > Hi Ingo, hi Avi, hi all,
> >
> > I'm trying to use kvm-20 with cfs v4 and get a crash:
> >
> > eworm@revo:~$ /usr/local/kvm/bin/qemu -snapshot
> > /mnt/data/virtual/qemu/winxp.img kvm_run: failed entry, reason 7
> > kvm_run returned -8
> >
> > It works (though it is a bit slow) if I start qemu with strace, so for
> > me it looks like a race condition?
>
> hm. Can you work it around with:
>
>    echo 0 > /proc/sys/kernel/sched_granularity_ns
>
> ?
>
> If yes then this is a wakeup race: some piece of code relies on the
> upstream scheduler preempting the waker task immediately in 99% of the
> cases.
>
> and you might want to test -v5 too which i released earlier today. It
> has no bugfix in this area though, so it will likely still trigger this
> race - but it will also hopefully be even more pleasant to use than -v4
> ;-)

Hi Ingo,

This was kvm's fault. It works perfectly now without modifications to the 
scheduler. If anybody is interested in details please see the kvm mailing 
list archives.
-- 
Regards,
Chris

[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 286 bytes --]

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

[-- Attachment #3: Type: text/plain, Size: 186 bytes --]

_______________________________________________
kvm-devel mailing list
kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
https://lists.sourceforge.net/lists/listinfo/kvm-devel

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24  7:14                                   ` Ingo Molnar
@ 2007-04-24 14:36                                     ` Gene Heskett
  0 siblings, 0 replies; 149+ messages in thread
From: Gene Heskett @ 2007-04-24 14:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Williams, Arjan van de Ven, Linus Torvalds, Nick Piggin,
	Juliusz Chroboczek, Con Kolivas, ck list, Bill Davidsen,
	Willy Tarreau, William Lee Irwin III, linux-kernel,
	Andrew Morton, Mike Galbraith, Thomas Gleixner, caglar

On Tuesday 24 April 2007, Ingo Molnar wrote:
>* Gene Heskett <gene.heskett@gmail.com> wrote:
>> > (Btw., to protect against such mishaps in the future i have changed
>> > the SysRq-N [SysRq-Nice] implementation in my tree to not only
>> > change real-time tasks to SCHED_OTHER, but to also renice negative
>> > nice levels back to 0 - this will show up in -v6. That way you'd
>> > only have had to hit SysRq-N to get the system out of the wedge.)
>>
>> That sounds handy, particularly with idiots like me at the wheel...
>
>by that standard i guess we tinkerers are all idiots ;)
>
>	Ingo

Eiyyyup!

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Man's horizons are bounded by his vision.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24  7:24                                   ` Ingo Molnar
@ 2007-04-24 14:38                                     ` Gene Heskett
  2007-04-24 17:44                                       ` Willy Tarreau
  0 siblings, 1 reply; 149+ messages in thread
From: Gene Heskett @ 2007-04-24 14:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Lang, Peter Williams, Arjan van de Ven, Linus Torvalds,
	Nick Piggin, Juliusz Chroboczek, Con Kolivas, ck list,
	Bill Davidsen, Willy Tarreau, William Lee Irwin III,
	linux-kernel, Andrew Morton, Mike Galbraith, Thomas Gleixner,
	caglar

On Tuesday 24 April 2007, Ingo Molnar wrote:
>* David Lang <david.lang@digitalinsight.com> wrote:
>> > (Btw., to protect against such mishaps in the future i have changed
>> > the SysRq-N [SysRq-Nice] implementation in my tree to not only
>> > change real-time tasks to SCHED_OTHER, but to also renice negative
>> > nice levels back to 0 - this will show up in -v6. That way you'd
>> > only have had to hit SysRq-N to get the system out of the wedge.)
>>
>> if you are trying to unwedge a system it may be a good idea to renice
>> all tasks to 0, it could be that a task at +19 is holding a lock that
>> something else is waiting for.
>
>Yeah, that's possible too, but +19 tasks are getting a small but
>guaranteed share of the CPU so eventually it ought to release it. It's
>still a possibility, but i think i'll wait for a specific incident to
>happen first, and then react to that incident :-)
>
>	Ingo

In the instance I created, even the SysRq+b was ignored, and ISTR thats 
supposed to initiate a reboot is it not?  So it was well and truly wedged.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
I use technology in order to hate it more properly.
		-- Nam June Paik

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24  7:25                                 ` Ingo Molnar
@ 2007-04-24 14:39                                   ` Gene Heskett
  2007-04-24 14:42                                   ` Gene Heskett
  1 sibling, 0 replies; 149+ messages in thread
From: Gene Heskett @ 2007-04-24 14:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Williams, Arjan van de Ven, Linus Torvalds, Nick Piggin,
	Juliusz Chroboczek, Con Kolivas, ck list, Bill Davidsen,
	Willy Tarreau, William Lee Irwin III, linux-kernel,
	Andrew Morton, Mike Galbraith, Thomas Gleixner, caglar

On Tuesday 24 April 2007, Ingo Molnar wrote:
>* Ingo Molnar <mingo@elte.hu> wrote:
>> yeah, i guess this has little to do with X. I think in your scenario
>> it might have been smarter to either stop, or to renice the workloads
>> that took away CPU power from others to _positive_ nice levels.
>> Negative nice levels can indeed be dangerous.
>
>btw., was X itself at nice 0 or nice -10 when the lockup happened?
>
>	Ingo

Memory could be hazy Ingo, but I think X was at 0 when that occurred.


-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
I use technology in order to hate it more properly.
		-- Nam June Paik

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24  7:25                                 ` Ingo Molnar
  2007-04-24 14:39                                   ` Gene Heskett
@ 2007-04-24 14:42                                   ` Gene Heskett
  1 sibling, 0 replies; 149+ messages in thread
From: Gene Heskett @ 2007-04-24 14:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Williams, Arjan van de Ven, Linus Torvalds, Nick Piggin,
	Juliusz Chroboczek, Con Kolivas, ck list, Bill Davidsen,
	Willy Tarreau, William Lee Irwin III, linux-kernel,
	Andrew Morton, Mike Galbraith, Thomas Gleixner, caglar

On Tuesday 24 April 2007, Ingo Molnar wrote:
>* Ingo Molnar <mingo@elte.hu> wrote:
>> yeah, i guess this has little to do with X. I think in your scenario
>> it might have been smarter to either stop, or to renice the workloads
>> that took away CPU power from others to _positive_ nice levels.
>> Negative nice levels can indeed be dangerous.
>
>btw., was X itself at nice 0 or nice -10 when the lockup happened?
>
>	Ingo

Memory could be fuzzy Ingo, but I think it was at 0 at the time.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
I know it all.  I just can't remember it all at once.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24  8:25                           ` Rogan Dawes
@ 2007-04-24 15:03                             ` Chris Friesen
  2007-04-24 15:07                               ` Rogan Dawes
  0 siblings, 1 reply; 149+ messages in thread
From: Chris Friesen @ 2007-04-24 15:03 UTC (permalink / raw)
  To: Rogan Dawes
  Cc: Ingo Molnar, Linus Torvalds, Nick Piggin, Gene Heskett,
	Juliusz Chroboczek, Mike Galbraith, linux-kernel, Peter Williams,
	ck list, Thomas Gleixner, William Lee Irwin III, Andrew Morton,
	Bill Davidsen, Willy Tarreau, Arjan van de Ven

Rogan Dawes wrote:

> I guess my point was if we somehow get to an odd number of nanoseconds, 
> we'd end up with rounding errors. I'm not sure if your algorithm will 
> ever allow that.

And Ingo's point was that when it takes thousands of nanoseconds for a 
single context switch, an error of half a nanosecond is down in the noise.

Chris

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24 15:03                             ` Chris Friesen
@ 2007-04-24 15:07                               ` Rogan Dawes
  2007-04-24 15:15                                 ` Chris Friesen
                                                   ` (2 more replies)
  0 siblings, 3 replies; 149+ messages in thread
From: Rogan Dawes @ 2007-04-24 15:07 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Ingo Molnar, Linus Torvalds, Nick Piggin, Gene Heskett,
	Juliusz Chroboczek, Mike Galbraith, linux-kernel, Peter Williams,
	ck list, Thomas Gleixner, William Lee Irwin III, Andrew Morton,
	Bill Davidsen, Willy Tarreau, Arjan van de Ven

Chris Friesen wrote:
> Rogan Dawes wrote:
> 
>> I guess my point was if we somehow get to an odd number of 
>> nanoseconds, we'd end up with rounding errors. I'm not sure if your 
>> algorithm will ever allow that.
> 
> And Ingo's point was that when it takes thousands of nanoseconds for a 
> single context switch, an error of half a nanosecond is down in the noise.
> 
> Chris

My concern was that since Ingo said that this is a closed economy, with 
a fixed sum/total, if we lose a nanosecond here and there, eventually 
we'll lose them all.

Some folks have uptimes of multiple years.

Of course, I could (very likely!) be full of it! ;-)

Rogan

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-23 19:52                   ` Linus Torvalds
                                       ` (2 preceding siblings ...)
  2007-04-24  3:46                     ` Peter Williams
@ 2007-04-24 15:08                     ` Ray Lee
  2007-04-25  9:32                       ` Ingo Molnar
  3 siblings, 1 reply; 149+ messages in thread
From: Ray Lee @ 2007-04-24 15:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Nick Piggin, Juliusz Chroboczek, Con Kolivas,
	ck list, Bill Davidsen, Willy Tarreau, William Lee Irwin III,
	linux-kernel, Andrew Morton, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett

On 4/23/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Mon, 23 Apr 2007, Ingo Molnar wrote:
> >
> > The "give scheduler money" transaction can be both an "implicit
> > transaction" (for example when writing to UNIX domain sockets or
> > blocking on a pipe, etc.), or it could be an "explicit transaction":
> > sched_yield_to(). This latter i've already implemented for CFS, but it's
> > much less useful than the really significant implicit ones, the ones
> > which will help X.
>
> Yes. It would be wonderful to get it working automatically, so please say
> something about the implementation..
>
> The "perfect" situation would be that when somebody goes to sleep, any
> extra points it had could be given to whoever it woke up last. Note that
> for something like X, it means that the points are 100% ephemeral: it gets
> points when a client sends it a request, but it would *lose* the points
> again when it sends the reply!

It would seem like there should be a penalty associated with sending
those points as well, so that two processes communicating quickly with
each other won't get into a mutual love-fest that'll capture the
scheduler's attention.

Ray

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24 15:07                               ` Rogan Dawes
@ 2007-04-24 15:15                                 ` Chris Friesen
  2007-04-24 23:55                                 ` Peter Williams
  2007-04-25  9:29                                 ` Ingo Molnar
  2 siblings, 0 replies; 149+ messages in thread
From: Chris Friesen @ 2007-04-24 15:15 UTC (permalink / raw)
  To: Rogan Dawes
  Cc: Ingo Molnar, Linus Torvalds, Nick Piggin, Gene Heskett,
	Juliusz Chroboczek, Mike Galbraith, linux-kernel, Peter Williams,
	ck list, Thomas Gleixner, William Lee Irwin III, Andrew Morton,
	Bill Davidsen, Willy Tarreau, Arjan van de Ven

Rogan Dawes wrote:

> My concern was that since Ingo said that this is a closed economy, with 
> a fixed sum/total, if we lose a nanosecond here and there, eventually 
> we'll lose them all.

I assume Ingo has set it up so that the system doesn't "lose" partial 
nanoseconds, but rather they'd just be accounted to the wrong task.

Chris

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5
  2007-04-23  1:12 ` [patch] CFS scheduler, -v5 Ingo Molnar
                     ` (3 preceding siblings ...)
  2007-04-23 12:20   ` Guillaume Chazarain
@ 2007-04-24 16:54   ` Christian Hesse
  2007-04-25  9:25     ` Ingo Molnar
  4 siblings, 1 reply; 149+ messages in thread
From: Christian Hesse @ 2007-04-24 16:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Ulrich Drepper

[-- Attachment #1: Type: text/plain, Size: 380 bytes --]

On Monday 23 April 2007, Ingo Molnar wrote:
> i'm pleased to announce release -v5 of the CFS scheduler patchset.

Hi Ingo,

I just noticed that with cfs all processes (except some kernel threads) run on 
cpu 0. I don't think this is expected cpu affinity for an smp system? I 
remember about half of the processes running on each core with mainline.
-- 
Regards,
Chris

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24 14:38                                     ` Gene Heskett
@ 2007-04-24 17:44                                       ` Willy Tarreau
  2007-04-25  0:30                                         ` Gene Heskett
  2007-04-25  0:32                                         ` Gene Heskett
  0 siblings, 2 replies; 149+ messages in thread
From: Willy Tarreau @ 2007-04-24 17:44 UTC (permalink / raw)
  To: Gene Heskett
  Cc: Ingo Molnar, David Lang, Peter Williams, Arjan van de Ven,
	Linus Torvalds, Nick Piggin, Juliusz Chroboczek, Con Kolivas,
	ck list, Bill Davidsen, William Lee Irwin III, linux-kernel,
	Andrew Morton, Mike Galbraith, Thomas Gleixner, caglar

On Tue, Apr 24, 2007 at 10:38:32AM -0400, Gene Heskett wrote:
> On Tuesday 24 April 2007, Ingo Molnar wrote:
> >* David Lang <david.lang@digitalinsight.com> wrote:
> >> > (Btw., to protect against such mishaps in the future i have changed
> >> > the SysRq-N [SysRq-Nice] implementation in my tree to not only
> >> > change real-time tasks to SCHED_OTHER, but to also renice negative
> >> > nice levels back to 0 - this will show up in -v6. That way you'd
> >> > only have had to hit SysRq-N to get the system out of the wedge.)
> >>
> >> if you are trying to unwedge a system it may be a good idea to renice
> >> all tasks to 0, it could be that a task at +19 is holding a lock that
> >> something else is waiting for.
> >
> >Yeah, that's possible too, but +19 tasks are getting a small but
> >guaranteed share of the CPU so eventually it ought to release it. It's
> >still a possibility, but i think i'll wait for a specific incident to
> >happen first, and then react to that incident :-)
> >
> >	Ingo
> 
> In the instance I created, even the SysRq+b was ignored, and ISTR thats 
> supposed to initiate a reboot is it not?  So it was well and truly wedged.

On many machines I use this on, I have to release Alt while still holding B.
Don't know why, but it works like this.

Willy


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24  1:57                         ` Bill Huey
@ 2007-04-24 18:01                           ` Li, Tong N
  0 siblings, 0 replies; 149+ messages in thread
From: Li, Tong N @ 2007-04-24 18:01 UTC (permalink / raw)
  To: Bill Huey
  Cc: Jeremy Fitzhardinge, Linus Torvalds, Ingo Molnar, Nick Piggin,
	Juliusz Chroboczek, Con Kolivas, ck list, Bill Davidsen,
	Willy Tarreau, William Lee Irwin III, linux-kernel,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett

On Mon, 2007-04-23 at 18:57 -0700, Bill Huey wrote: 
> On Mon, Apr 23, 2007 at 05:59:06PM -0700, Li, Tong N wrote:
> > I don't know if we've discussed this or not. Since both CFS and SD claim
> > to be fair, I'd like to hear more opinions on the fairness aspect of
> > these designs. In areas such as OS, networking, and real-time, fairness,
> > and its more general form, proportional fairness, are well-defined
> > terms. In fact, perfect fairness is not feasible since it requires all
> > runnable threads to be running simultaneously and scheduled with
> > infinitesimally small quanta (like a fluid system). So to evaluate if a
> 
> Unfortunately, fairness is rather non-formal in this context and probably
> isn't strictly desirable given how hack much of Linux userspace is. Until
> there's a method of doing directed yields, like what Will has prescribed
> a kind of allotment to thread doing work for another a completely strict
> mechanism, it is probably problematic with regards to corner cases.
> 
> X for example is largely non-thread safe. Until they can get their xcb
> framework in place and addition thread infrastructure to do hand off
> properly, it's going to be difficult schedule for it. It's well known to
> be problematic.

I agree. I just think calling the designs "perfectly" or "completely"
fair is too strong. It might cause unnecessary confusion that
overshadows the actual merits of these designs. If we were to evaluate
specifically the fairness aspect of a design, then I'd suggest defining
it more formally.

> You announced your scheduler without CCing any of the relevant people here
> (and risk being completely ignored in lkml traffic):
> 
> 	http://lkml.org/lkml/2007/4/20/286
> 
> What is your opinion of both CFS and SDL ? How can you work be useful
> to either scheduler mentioned or to the Linux kernel on its own ?

I like SD for its simplicity. My concern with CFS is the RB tree
structure. Log(n) seems high to me given the fact that we had an O(1)
scheduler. Many algorithms achieve strong fairness guarantees at the
cost of log(n) time. Thus, I tend to think, if log(n) is acceptable, we
might want also to look at other algorithms (e.g., start-time first)
with better fairness properties and see if they could be extended to be
general purpose.

> > I understand that via experiments we can show a design is reasonably
> > fair in the common case, but IMHO, to claim that a design is fair, there
> > needs to be some kind of formal analysis on the fairness bound, and this
> > bound should be proven to be constant. Even if the bound is not
> > constant, at least this analysis can help us better understand and
> > predict the degree of fairness that users would experience (e.g., would
> > the system be less fair if the number of threads increases? What happens
> > if a large number of threads dynamically join and leave the system?).
> 
> Will has been thinking about this, but you have to also consider the
> practicalities of your approach versus Con's and Ingo's.

I consider my work an approach to extend an existing scheduler to
support proportional fairness. I see many proportional-share designs are
lacking things such as good interactive support that Linux does well.
This is why I designed it on top of the existing scheduler so that it
can leverage things such as dynamic priorities. Regardless of the
underlying scheduler, SD or CFS, I think the algorithm I used would
still apply and thus we can extend the scheduler similarly.

> I'm all for things like proportional scheduling and the extensions
> needed to do it properly. It would be highly relevant to some version
> of the -rt patch if not that patch directly.

I'd love it to be considered for part of the -rt patch. I'm new to 
this, so would you please let me know what to do?

Thanks,

  tong

^ permalink raw reply	[flat|nested] 149+ messages in thread

* 'Scheduler Economy' prototype patch for CFS
  2007-04-23 19:11                 ` Ingo Molnar
  2007-04-23 19:52                   ` Linus Torvalds
  2007-04-23 20:05                   ` Willy Tarreau
@ 2007-04-24 21:05                   ` Ingo Molnar
  2 siblings, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-24 21:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel


* Ingo Molnar <mingo@elte.hu> wrote:

> > but the point I'm trying to make is that X shouldn't get more 
> > CPU-time because it's "more important" (it's not: and as noted 
> > earlier, thinking that it's more important skews the problem and 
> > makes for too *much* scheduling). X should get more CPU time simply 
> > because it should get it's "fair CPU share" relative to the *sum* of 
> > the clients, not relative to any client individually.
> 
> yeah. And this is not a pipe dream and i think it does not need a 
> 'wakeup matrix' or other complexities.
> 
> I am --->.<---- this close to being able to do this very robustly 
> under CFS via simple rules of economy and trade: there the 
> p->wait_runtime metric is intentionally a "physical resource" of 
> "hard-earned right to execute on the CPU, by having waited on it" the 
> sum of which is bound for the whole system.
> 
> So while with other, heuristic approaches we always had the problem of 
> creating a "hyper-inflation" of an uneconomic virtual currency that 
> could be freely printed by certain tasks, in CFS the economy of this 
> is strict and the finegrained plus/minus balance is strictly managed 
> by a conservative and independent central bank.
> 
> So we can actually let tasks "trade" in these very physical units of 
> "right to execute on the CPU". A task giving it to another task means 
> that this task _already gave up CPU time in the past_. So it's the 
> robust equivalent of an economy's "money earned" concept, and this 
> "money"'s distribution (and redistribution) is totally fair and 
> totally balanced and is not prone to "inflation".
> 
> The "give scheduler money" transaction can be both an "implicit 
> transaction" (for example when writing to UNIX domain sockets or 
> blocking on a pipe, etc.), or it could be an "explicit transaction": 
> sched_yield_to(). This latter i've already implemented for CFS, but 
> it's much less useful than the really significant implicit ones, the 
> ones which will help X.

today i've implemented a quick prototype of this "Scheduler Economy" 
feature, ontop of CFS. (I added a sysctl to be able to easily test with 
it turned on or off - in a final version no such sysctl would be 
available.)

initial testing shows that clients are indeed able to shuffle measurable 
amount of CPU time from themselves into the X server.

to test it, i took the xterm portion of Bill Davidsen's X test-script, 
to create 4 really busy scrolling-xterms:

  for n in 1 2 3 4; do
    vpos=$[(n-1)*80+50]
    xterm -geom 100x2-0+${vpos} -e bash -c "while true; do echo \$RANDOM; done" &
  done

starting this on an idle 1-CPU system gives an about ~40% busy X 
(everything including X running at nice level 0), with the remaining 60% 
CPU time split between the xterm and bash processes.

then i started 5 busy-loops to 'starve' X:

  for (( i=0; i < 5; i++ )) do
    bash -c "while :; do :; done" &
  done

this resulted in X getting only about 13% of CPU time, the bash 
busy-loops got around 13% each too.

then i turned on the 'Scheduler Economy' feature:

  echo 1 > /proc/sys/kernel/sched_economy

at which point clients started passing small units of 'scheduler money' 
to the X server, which resulted in X's CPU utilization going up to near 
20%, and the busy-loops dropping down to 10% each. [i'm still seeing 
occasional hickups though that happen due to X clients starving 
themselves, so this is still preliminary.]

about the implementation: i tried to keep it as simple as possible. 
There's no costly tracking of individual "transactions", a summary 
account is used per work object (attached to the X unix domain socket in 
this case). The overhead to unix domain socket performance is not 
measurable:

  # echo 1 > /proc/sys/kernel/sched_economy
  # ./lat_unix
  AF_UNIX sock stream latency: 8.0439 microseconds
  # ./lat_unix
  AF_UNIX sock stream latency: 8.0219 microseconds

  # echo 0 > /proc/sys/kernel/sched_economy
  # ./lat_unix
  AF_UNIX sock stream latency: 8.0702 microseconds
  # ./lat_unix
  AF_UNIX sock stream latency: 8.0451 microseconds

the patch is below. Again, it is a prototype, with a few hacks to 
separate 'X client' from 'X server' unix domain socket use. The real 
solution will probably need some API additions either via new syscalls, 
or via new options to unix domain sockets. The kernel side was pretty 
straightforward - it was easy to embedd the 'account' in the socket data 
structure and it was easy to find the places that give/take 'money'.

	Ingo

---
 include/linux/sched.h |   31 ++++++++++++++++++++
 include/net/af_unix.h |    2 +
 kernel/sched.c        |   75 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_fair.c   |    9 ++++++
 kernel/sysctl.c       |    8 +++++
 net/unix/af_unix.c    |   22 ++++++++++++++
 6 files changed, 147 insertions(+)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -805,6 +805,36 @@ struct sched_class {
 	void (*task_new) (struct rq *rq, struct task_struct *p);
 };
 
+/*
+ * Scheduler work account object: it consists of the price, the current
+ * amount of money attached to the account, plus a limit over which tasks
+ * should not have to refill the bucket:
+ */
+struct sched_account {
+	u64 price;
+	u64 balance;
+	u64 limit;
+};
+
+/*
+ * Set up a scheduler work account object:
+ */
+extern void sched_setup_account(struct sched_account *account, u64 price);
+
+/*
+ * A client task can pay into a given work account and can thus
+ * instruct the scheduler to move CPU resources from the current
+ * task to the server task:
+ */
+extern void sched_pay(struct sched_account *account);
+
+/*
+ * A server task can pick up payment from a work account object,
+ * and thus get the CPU resources that clients commited to that
+ * account:
+ */
+extern void sched_withdraw(struct sched_account *account);
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	struct thread_info *thread_info;
@@ -1233,6 +1263,7 @@ static inline void idle_task_exit(void) 
 extern void sched_idle_next(void);
 extern char * sched_print_task_state(struct task_struct *p, char *buffer);
 
+extern unsigned int sysctl_sched_economy;
 extern unsigned int sysctl_sched_granularity;
 extern unsigned int sysctl_sched_child_runs_first;
 extern unsigned int sysctl_sched_delayed_wakeups;
Index: linux/include/net/af_unix.h
===================================================================
--- linux.orig/include/net/af_unix.h
+++ linux/include/net/af_unix.h
@@ -4,6 +4,7 @@
 #include <linux/socket.h>
 #include <linux/un.h>
 #include <linux/mutex.h>
+#include <linux/sched.h>
 #include <net/sock.h>
 
 extern void unix_inflight(struct file *fp);
@@ -85,6 +86,7 @@ struct unix_sock {
         atomic_t                inflight;
         spinlock_t		lock;
         wait_queue_head_t       peer_wait;
+	struct sched_account	work_account;
 };
 #define unix_sk(__sk) ((struct unix_sock *)__sk)
 
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -3158,6 +3158,81 @@ long fastcall __sched sleep_on_timeout(w
 
 EXPORT_SYMBOL(sleep_on_timeout);
 
+/*
+ * Set up a scheduler work account object:
+ */
+void sched_setup_account(struct sched_account *account, u64 price)
+{
+	account->price = price;
+	account->balance = 0;
+	/*
+	 * Set up a reasonable limit, so that clients do not waste
+	 * their resources when the server has had enough pay
+	 * already:
+	 */
+	account->limit = sysctl_sched_granularity * 10;
+}
+
+/*
+ * A client task can pay into a given work account and can thus
+ * instruct the scheduler to move CPU resources from the current
+ * task to the server task:
+ */
+void sched_pay(struct sched_account *account)
+{
+	s64 money_available = current->wait_runtime;
+	u64 balance = account->balance;
+	u64 price = account->price;
+	u64 limit = account->limit;
+
+	if (!sysctl_sched_economy)
+		return;
+
+	/*
+	 * There's not enough money available, the task wont be able
+	 * to help the server:
+	 */
+	if (money_available < price)
+		return;
+	/*
+	 * Pay the server - but only up to a reasonable limit:
+	 */
+	if (balance >= limit)
+		return;
+
+	account->balance = balance + price;
+	current->wait_runtime = money_available - price;
+}
+
+/*
+ * A server task can pick up payment from a work account object,
+ * and thus get the CPU resources that clients commited to that
+ * account:
+ */
+void sched_withdraw(struct sched_account *account)
+{
+	struct task_struct *server = current;
+	u64 balance = account->balance;
+	u64 price = account->price;
+
+	if (!sysctl_sched_economy)
+		return;
+
+	/*
+	 * Note that we only pick up the price for this particular
+	 * work transaction - we dont withdraw the whole balance:
+	 */
+	if (balance < price)
+		return;
+	/*
+	 * No need to lock anything - we are running already.
+	 * The new wait_runtime value will be taken into account
+	 * (and will be an advantage) next time this task reschedules:
+	 */
+	server->wait_runtime += price;
+	account->balance = balance - price;
+}
+
 #ifdef CONFIG_RT_MUTEXES
 
 /*
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -15,6 +15,15 @@
 unsigned int sysctl_sched_granularity __read_mostly = 3000000;
 
 /*
+ * debug: "Scheduler Economy" flag. This flag activates the scheduler
+ * payment system which transforms 'money' from client tasks to server
+ * tasks. This flag can be turned on and off anytime. (It is for debugging
+ * purposes, if it works out then the economy will be enabled
+ * unconditionally.)
+ */
+unsigned int sysctl_sched_economy = 0;
+
+/*
  * Debug: delay the effect of wakeups to until the next scheduler tick.
  * (default: off, no effect)
  */
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -230,6 +230,14 @@ static ctl_table kern_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "sched_economy",
+		.data		= &sysctl_sched_economy,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
 		.ctl_name	= KERN_PANIC,
 		.procname	= "panic",
 		.data		= &panic_timeout,
Index: linux/net/unix/af_unix.c
===================================================================
--- linux.orig/net/unix/af_unix.c
+++ linux/net/unix/af_unix.c
@@ -596,6 +596,12 @@ static struct sock * unix_create1(struct
 	atomic_set(&u->inflight, sock ? 0 : -1);
 	mutex_init(&u->readlock); /* single task reading lock */
 	init_waitqueue_head(&u->peer_wait);
+	/*
+	 * Set up a scheduler work account object, with a default
+	 * price of 100 usecs:
+	 */
+	sched_setup_account(&u->work_account, 100000ULL);
+
 	unix_insert_socket(unix_sockets_unbound, sk);
 out:
 	return sk;
@@ -1501,6 +1507,14 @@ static int unix_stream_sendmsg(struct ki
 		    (other->sk_shutdown & RCV_SHUTDOWN))
 			goto pipe_err_free;
 
+		/*
+		 * Payment for the work the server is going to do on
+		 * behalf of this client (the uid test is a hack to
+		 * detect X clients):
+		 */
+		if (current->uid)
+			sched_pay(&unix_sk(other)->work_account);
+
 		skb_queue_tail(&other->sk_receive_queue, skb);
 		unix_state_runlock(other);
 		other->sk_data_ready(other, size);
@@ -1806,6 +1820,14 @@ static int unix_stream_recvmsg(struct ki
 	mutex_unlock(&u->readlock);
 	scm_recv(sock, msg, siocb->scm, flags);
 out:
+	/*
+	 * Get payment for the work the server is now going to do on
+	 * behalf of clients (the !uid test is a hack to detect the
+	 * X server):
+	 */
+	if (copied && !current->uid)
+		sched_withdraw(&u->work_account);
+
 	return copied ? : err;
 }
 

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24  0:59                       ` Li, Tong N
  2007-04-24  1:57                         ` Bill Huey
@ 2007-04-24 21:27                         ` William Lee Irwin III
  2007-04-24 22:18                           ` Bernd Eckenfels
  2007-04-25  1:22                           ` Li, Tong N
  1 sibling, 2 replies; 149+ messages in thread
From: William Lee Irwin III @ 2007-04-24 21:27 UTC (permalink / raw)
  To: Li, Tong N
  Cc: Jeremy Fitzhardinge, Linus Torvalds, Ingo Molnar, Nick Piggin,
	Juliusz Chroboczek, Con Kolivas, ck list, Bill Davidsen,
	Willy Tarreau, linux-kernel, Andrew Morton, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Gene Heskett

On Mon, Apr 23, 2007 at 05:59:06PM -0700, Li, Tong N wrote:
> I don't know if we've discussed this or not. Since both CFS and SD claim
> to be fair, I'd like to hear more opinions on the fairness aspect of
> these designs. In areas such as OS, networking, and real-time, fairness,
> and its more general form, proportional fairness, are well-defined
> terms. In fact, perfect fairness is not feasible since it requires all
> runnable threads to be running simultaneously and scheduled with
> infinitesimally small quanta (like a fluid system). So to evaluate if a
> new scheduling algorithm is fair, the common approach is to take the
> ideal fair algorithm (often referred to as Generalized Processor
> Scheduling or GPS) as a reference model and analyze if the new algorithm
> can achieve a constant error bound (different error metrics also exist).

Could you explain for the audience the technical definition of fairness
and what sorts of error metrics are commonly used? There seems to be
some disagreement, and you're neutral enough of an observer that your
statement would help.


On Mon, Apr 23, 2007 at 05:59:06PM -0700, Li, Tong N wrote:
> I understand that via experiments we can show a design is reasonably
> fair in the common case, but IMHO, to claim that a design is fair, there
> needs to be some kind of formal analysis on the fairness bound, and this
> bound should be proven to be constant. Even if the bound is not
> constant, at least this analysis can help us better understand and
> predict the degree of fairness that users would experience (e.g., would
> the system be less fair if the number of threads increases? What happens
> if a large number of threads dynamically join and leave the system?).

Carrying out this sort of analysis on various policies would help, but
I'd expect most of them to be difficult to analyze. cfs' current
->fair_key computation should be simple enough to analyze, at least
ignoring nice numbers, though I've done nothing rigorous in this area.


-- wli

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24 21:27                         ` William Lee Irwin III
@ 2007-04-24 22:18                           ` Bernd Eckenfels
  2007-04-25  1:22                           ` Li, Tong N
  1 sibling, 0 replies; 149+ messages in thread
From: Bernd Eckenfels @ 2007-04-24 22:18 UTC (permalink / raw)
  To: linux-kernel

In article <20070424212717.GR31925@holomorphy.com> you wrote:
> Could you explain for the audience the technical definition of fairness
> and what sorts of error metrics are commonly used? There seems to be
> some disagreement, and you're neutral enough of an observer that your
> statement would help.

And while we are at it, why it is a good thing. I could understand that fair
means no missbehaving (intentionally or unintentionally) application can
harm the rest of the system. However a responsive desktop might not
necesarily be very fair to compute jobs.

Even a simple thing as "who gets accounted" can be quite different in
different workloads. (larger multi user systems tend to be fair based on the
user, on servers you more balance by thread or job and single user systems
should be as unfair as the user wants them as long as no process can "run
away")

Gruss
Bernd

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24 15:07                               ` Rogan Dawes
  2007-04-24 15:15                                 ` Chris Friesen
@ 2007-04-24 23:55                                 ` Peter Williams
  2007-04-25  9:29                                 ` Ingo Molnar
  2 siblings, 0 replies; 149+ messages in thread
From: Peter Williams @ 2007-04-24 23:55 UTC (permalink / raw)
  To: Rogan Dawes
  Cc: Chris Friesen, Ingo Molnar, Linus Torvalds, Nick Piggin,
	Gene Heskett, Juliusz Chroboczek, Mike Galbraith, linux-kernel,
	ck list, Thomas Gleixner, William Lee Irwin III, Andrew Morton,
	Bill Davidsen, Willy Tarreau, Arjan van de Ven

Rogan Dawes wrote:
> Chris Friesen wrote:
>> Rogan Dawes wrote:
>>
>>> I guess my point was if we somehow get to an odd number of 
>>> nanoseconds, we'd end up with rounding errors. I'm not sure if your 
>>> algorithm will ever allow that.
>>
>> And Ingo's point was that when it takes thousands of nanoseconds for a 
>> single context switch, an error of half a nanosecond is down in the 
>> noise.
>>
>> Chris
> 
> My concern was that since Ingo said that this is a closed economy, with 
> a fixed sum/total, if we lose a nanosecond here and there, eventually 
> we'll lose them all.
> 
> Some folks have uptimes of multiple years.
> 
> Of course, I could (very likely!) be full of it! ;-)

And won't be using the any new scheduler on these computers anyhow as 
that would involve bringing the system down to install the new kernel. :-)

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24 17:44                                       ` Willy Tarreau
@ 2007-04-25  0:30                                         ` Gene Heskett
  2007-04-25  0:32                                         ` Gene Heskett
  1 sibling, 0 replies; 149+ messages in thread
From: Gene Heskett @ 2007-04-25  0:30 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Ingo Molnar, David Lang, Peter Williams, Arjan van de Ven,
	Linus Torvalds, Nick Piggin, Juliusz Chroboczek, Con Kolivas,
	ck list, Bill Davidsen, William Lee Irwin III, linux-kernel,
	Andrew Morton, Mike Galbraith, Thomas Gleixner, caglar

On Tuesday 24 April 2007, Willy Tarreau wrote:
>On Tue, Apr 24, 2007 at 10:38:32AM -0400, Gene Heskett wrote:
>> On Tuesday 24 April 2007, Ingo Molnar wrote:
>> >* David Lang <david.lang@digitalinsight.com> wrote:
>> >> > (Btw., to protect against such mishaps in the future i have changed
>> >> > the SysRq-N [SysRq-Nice] implementation in my tree to not only
>> >> > change real-time tasks to SCHED_OTHER, but to also renice negative
>> >> > nice levels back to 0 - this will show up in -v6. That way you'd
>> >> > only have had to hit SysRq-N to get the system out of the wedge.)
>> >>
>> >> if you are trying to unwedge a system it may be a good idea to renice
>> >> all tasks to 0, it could be that a task at +19 is holding a lock that
>> >> something else is waiting for.
>> >
>> >Yeah, that's possible too, but +19 tasks are getting a small but
>> >guaranteed share of the CPU so eventually it ought to release it. It's
>> >still a possibility, but i think i'll wait for a specific incident to
>> >happen first, and then react to that incident :-)
>> >
>> >	Ingo
>>
>> In the instance I created, even the SysRq+b was ignored, and ISTR thats
>> supposed to initiate a reboot is it not?  So it was well and truly wedged.
>
>On many machines I use this on, I have to release Alt while still holding B.
>Don't know why, but it works like this.
>
>Willy

Yeah, Willy, and pardon a slight bit of sarcasm here but that's how we get the 
reputation for needing virgins to sacrifice, regular experienced girls just 
wouldn't do.

This isn't APL running on an IBM 5120, so it should Just Work(TM) and not need 
a sceance or something to conjure up the right spell.  Besides, the reset 
button is only about 6 feet away...  I get some execsize that way by getting 
up to push it. :)

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
It is so soon that I am done for, I wonder what I was begun for.
		-- Epitaph, Cheltenham Churchyard

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24 17:44                                       ` Willy Tarreau
  2007-04-25  0:30                                         ` Gene Heskett
@ 2007-04-25  0:32                                         ` Gene Heskett
  1 sibling, 0 replies; 149+ messages in thread
From: Gene Heskett @ 2007-04-25  0:32 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Ingo Molnar, David Lang, Peter Williams, Arjan van de Ven,
	Linus Torvalds, Nick Piggin, Juliusz Chroboczek, Con Kolivas,
	ck list, Bill Davidsen, William Lee Irwin III, linux-kernel,
	Andrew Morton, Mike Galbraith, Thomas Gleixner, caglar

On Tuesday 24 April 2007, Willy Tarreau wrote:
>On Tue, Apr 24, 2007 at 10:38:32AM -0400, Gene Heskett wrote:
>> On Tuesday 24 April 2007, Ingo Molnar wrote:
>> >* David Lang <david.lang@digitalinsight.com> wrote:
>> >> > (Btw., to protect against such mishaps in the future i have changed
>> >> > the SysRq-N [SysRq-Nice] implementation in my tree to not only
>> >> > change real-time tasks to SCHED_OTHER, but to also renice negative
>> >> > nice levels back to 0 - this will show up in -v6. That way you'd
>> >> > only have had to hit SysRq-N to get the system out of the wedge.)
>> >>
>> >> if you are trying to unwedge a system it may be a good idea to renice
>> >> all tasks to 0, it could be that a task at +19 is holding a lock that
>> >> something else is waiting for.
>> >
>> >Yeah, that's possible too, but +19 tasks are getting a small but
>> >guaranteed share of the CPU so eventually it ought to release it. It's
>> >still a possibility, but i think i'll wait for a specific incident to
>> >happen first, and then react to that incident :-)
>> >
>> >	Ingo
>>
>> In the instance I created, even the SysRq+b was ignored, and ISTR thats
>> supposed to initiate a reboot is it not?  So it was well and truly wedged.
>
>On many machines I use this on, I have to release Alt while still holding B.
>Don't know why, but it works like this.
>
>Willy

Yeah, Willy, and pardon a slight bit of sarcasm here but that's how we get the 
reputation for needing virgins to sacrifice, regular experienced girls just 
wouldn't do.

This isn't APL running on an IBM 5120, so it should Just Work(TM) and not need 
a sceance or something to conjure up the right spell.  Besides, the reset 
button is only about 6 feet away...  I get some execsize that way by getting 
up to push it. :)

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
It is so soon that I am done for, I wonder what I was begun for.
		-- Epitaph, Cheltenham Churchyard

^ permalink raw reply	[flat|nested] 149+ messages in thread

* RE: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24 21:27                         ` William Lee Irwin III
  2007-04-24 22:18                           ` Bernd Eckenfels
@ 2007-04-25  1:22                           ` Li, Tong N
  2007-04-25  6:05                             ` William Lee Irwin III
  2007-04-25  9:44                             ` Ingo Molnar
  1 sibling, 2 replies; 149+ messages in thread
From: Li, Tong N @ 2007-04-25  1:22 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Jeremy Fitzhardinge, Linus Torvalds, Ingo Molnar, Nick Piggin,
	Juliusz Chroboczek, Con Kolivas, ck list, Bill Davidsen,
	Willy Tarreau, linux-kernel, Andrew Morton, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Gene Heskett

> Could you explain for the audience the technical definition of
fairness
> and what sorts of error metrics are commonly used? There seems to be
> some disagreement, and you're neutral enough of an observer that your
> statement would help.

The definition for proportional fairness assumes that each thread has a
weight, which, for example, can be specified by the user, or sth. mapped
from thread priorities, nice values, etc. A scheduler achieves ideal
proportional fairness if (1) it is work-conserving, i.e., it never
leaves a processor idle if there are runnable threads, and (2) for any
two threads, i and j, in any time interval, the ratio of their CPU time
is greater than or equal to the ratio of their weights, assuming that
thread i is continuously runnable in the entire interval and both
threads have fixed weights throughout the interval. A corollary of this
is that if both threads i and j are continuously runnable with fixed
weights in the time interval, then the ratio of their CPU time should be
equal to the ratio of their weights. This definition is pretty
restrictive since it requires the properties to hold for any thread in
any interval, which is not feasible. In practice, all algorithms try to
approximate this ideal scheduler (often referred to as Generalized
Processor Scheduling or GPS). Two error metrics are often used: 

(1) lag(t): for any interval [t1, t2], the lag of a thread at time t \in
[t1, t2] is S'(t1, t) - S(t1, t), where S' is the CPU time the thread
would receive in the interval [t1, t] under the ideal scheduler and S is
the actual CPU time it receives under the scheduler being evaluated.

(2) The second metric doesn't really have an agreed-upon name. Some call
it fairness measure and some call it sth else. Anyway, different from
lag, which is kind of an absolute measure for one thread, this metric
(call it F) defines a relative measure between two threads over any time
interval:

F(t1, t2) = S_i(t1, t2) / w_i - S_j(t1, t2) / w_j,

where S_i and S_j are the CPU time the two threads receive in the
interval [t1, t2] and w_i and w_j are their weights, assuming both
weights don't change throughout the interval.

The goal of a proportional-share scheduling algorithm is to minimize the
above metrics. If the lag function is bounded by a constant for any
thread in any time interval, then the algorithm is considered to be
fair. You may notice that the second metric is actually weaker than
first. In fact, if an algorithm achieves a constant lag bound, it must
also achieve a constant bound for the second metric, but the reverse is
not necessarily true. But in some settings, people have focused on the
second metric and still consider an algorithm to be fair as long as the
second metric is bounded by a constant.

> 
> On Mon, Apr 23, 2007 at 05:59:06PM -0700, Li, Tong N wrote:
> > I understand that via experiments we can show a design is reasonably
> > fair in the common case, but IMHO, to claim that a design is fair,
there
> > needs to be some kind of formal analysis on the fairness bound, and
this
> > bound should be proven to be constant. Even if the bound is not
> > constant, at least this analysis can help us better understand and
> > predict the degree of fairness that users would experience (e.g.,
would
> > the system be less fair if the number of threads increases? What
happens
> > if a large number of threads dynamically join and leave the
system?).
> 
> Carrying out this sort of analysis on various policies would help, but
> I'd expect most of them to be difficult to analyze. cfs' current
> ->fair_key computation should be simple enough to analyze, at least
> ignoring nice numbers, though I've done nothing rigorous in this area.
> 

If we can derive some invariants from the algorithm, it'd help the
analysis. An example is the deficit round-robin (DRR) algorithm in
networking. Its analysis utilizes the fact that the round each flow (in
this case, it'd be thread) goes through in any time interval differs by
at most one.

Hope you didn't get bored by all of this. :)

  tong

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-25  1:22                           ` Li, Tong N
@ 2007-04-25  6:05                             ` William Lee Irwin III
  2007-04-25  9:44                             ` Ingo Molnar
  1 sibling, 0 replies; 149+ messages in thread
From: William Lee Irwin III @ 2007-04-25  6:05 UTC (permalink / raw)
  To: Li, Tong N
  Cc: Jeremy Fitzhardinge, Linus Torvalds, Ingo Molnar, Nick Piggin,
	Juliusz Chroboczek, Con Kolivas, ck list, Bill Davidsen,
	Willy Tarreau, linux-kernel, Andrew Morton, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Gene Heskett

On Tue, Apr 24, 2007 at 06:22:53PM -0700, Li, Tong N wrote:
> The goal of a proportional-share scheduling algorithm is to minimize the
> above metrics. If the lag function is bounded by a constant for any
> thread in any time interval, then the algorithm is considered to be
> fair. You may notice that the second metric is actually weaker than
> first. In fact, if an algorithm achieves a constant lag bound, it must
> also achieve a constant bound for the second metric, but the reverse is
> not necessarily true. But in some settings, people have focused on the
> second metric and still consider an algorithm to be fair as long as the
> second metric is bounded by a constant.

Using these metrics it is possible to write benchmarks quantifying
fairness as a performance metric, provided weights for nice numbers.

Not so coincidentally, this also entails a test of whether nice numbers
are working as intended.


-- wli

P.S. Divide by the length of the time interval to rephrase in terms of
CPU bandwidth.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-22 13:27             ` Ingo Molnar
  2007-04-22 13:30               ` Mark Lord
@ 2007-04-25  8:16               ` Pavel Machek
  2007-04-25  8:22                 ` Ingo Molnar
  2007-04-25 10:19                 ` Alan Cox
  1 sibling, 2 replies; 149+ messages in thread
From: Pavel Machek @ 2007-04-25  8:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mark Lord, Jan Engelhardt, Con Kolivas, Willy Tarreau,
	William Lee Irwin III, linux-kernel, Linus Torvalds,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett

Hi!


> it into some xorg.conf field. (It also makes sure that X isnt preempted 
> by other userspace stuff while it does timing-sensitive operations like 
> setting the video modes up or switching video modes, etc.)

X is priviledged. It can just cli around the critical section.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-25  8:16               ` Pavel Machek
@ 2007-04-25  8:22                 ` Ingo Molnar
  2007-04-25 10:19                 ` Alan Cox
  1 sibling, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-25  8:22 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Mark Lord, Jan Engelhardt, Con Kolivas, Willy Tarreau,
	William Lee Irwin III, linux-kernel, Linus Torvalds,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett


* Pavel Machek <pavel@ucw.cz> wrote:

> > it into some xorg.conf field. (It also makes sure that X isnt 
> > preempted by other userspace stuff while it does timing-sensitive 
> > operations like setting the video modes up or switching video modes, 
> > etc.)
> 
> X is priviledged. It can just cli around the critical section.

yes, that is a tool that can be used too (and is used by most drivers) - 
my point was rather that besides the disadvantages, not preempting X can 
be an advantage too - not that there are no other (and often more 
suitable) tools to do the same.

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5
  2007-04-24 16:54   ` Christian Hesse
@ 2007-04-25  9:25     ` Ingo Molnar
  2007-04-25 10:51       ` Christian Hesse
  0 siblings, 1 reply; 149+ messages in thread
From: Ingo Molnar @ 2007-04-25  9:25 UTC (permalink / raw)
  To: Christian Hesse
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Ulrich Drepper


* Christian Hesse <mail@earthworm.de> wrote:

> On Monday 23 April 2007, Ingo Molnar wrote:
> > i'm pleased to announce release -v5 of the CFS scheduler patchset.
> 
> Hi Ingo,
> 
> I just noticed that with cfs all processes (except some kernel 
> threads) run on cpu 0. I don't think this is expected cpu affinity for 
> an smp system? I remember about half of the processes running on each 
> core with mainline.

i've got several SMP systems with CFS and all distribute the load 
properly to all CPUs, so it would be nice if you could tell me more 
about how the problem manifests itself on your system.

for example, if you start two infinite loops:

    for (( N=0; N < 2; N++ )); do ( while :; do :; done ) & done

do they end up on the same CPU?

Or do you mean that the default placement of single tasks starts at 
CPU#0, while with mainline they were alternating?

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24 15:07                               ` Rogan Dawes
  2007-04-24 15:15                                 ` Chris Friesen
  2007-04-24 23:55                                 ` Peter Williams
@ 2007-04-25  9:29                                 ` Ingo Molnar
  2 siblings, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-25  9:29 UTC (permalink / raw)
  To: Rogan Dawes
  Cc: Chris Friesen, Linus Torvalds, Nick Piggin, Gene Heskett,
	Juliusz Chroboczek, Mike Galbraith, linux-kernel, Peter Williams,
	ck list, Thomas Gleixner, William Lee Irwin III, Andrew Morton,
	Bill Davidsen, Willy Tarreau, Arjan van de Ven


* Rogan Dawes <lists@dawes.za.net> wrote:

> My concern was that since Ingo said that this is a closed economy, 
> with a fixed sum/total, if we lose a nanosecond here and there, 
> eventually we'll lose them all.

it's not a closed economy - the CPU constantly produces a resource: "CPU 
cycles to be spent", and tasks constantly consume that resource. So in 
that sense small inaccuracies are not a huge issue. But you are correct 
that each and every such inaccuracy has to be justified. For example 
larger inaccuracies on the order of SCHED_LOAD_SCALE are a problem 
because they can indeed sum up, and i fixed up a couple of such 
inaccuracies in -v6-to-be.

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24 15:08                     ` Ray Lee
@ 2007-04-25  9:32                       ` Ingo Molnar
  0 siblings, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-25  9:32 UTC (permalink / raw)
  To: ray-gmail
  Cc: Linus Torvalds, Nick Piggin, Juliusz Chroboczek, Con Kolivas,
	ck list, Bill Davidsen, Willy Tarreau, William Lee Irwin III,
	linux-kernel, Andrew Morton, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Gene Heskett


* Ray Lee <madrabbit@gmail.com> wrote:

> It would seem like there should be a penalty associated with sending 
> those points as well, so that two processes communicating quickly with 
> each other won't get into a mutual love-fest that'll capture the 
> scheduler's attention.

it's not really "points", but "nanoseconds you are allowed to execute on 
the CPU". And thus two processes communicating with each other quickly 
and sending around this resource does get the attention of CFS: the 
resource is gradually consumed because the two processes are running on 
the CPU while they are communicating with each other. So it all works 
out fine.

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-25  1:22                           ` Li, Tong N
  2007-04-25  6:05                             ` William Lee Irwin III
@ 2007-04-25  9:44                             ` Ingo Molnar
  2007-04-25 11:58                               ` William Lee Irwin III
  1 sibling, 1 reply; 149+ messages in thread
From: Ingo Molnar @ 2007-04-25  9:44 UTC (permalink / raw)
  To: Li, Tong N
  Cc: William Lee Irwin III, Jeremy Fitzhardinge, Linus Torvalds,
	Nick Piggin, Juliusz Chroboczek, Con Kolivas, ck list,
	Bill Davidsen, Willy Tarreau, linux-kernel, Andrew Morton,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett


* Li, Tong N <tong.n.li@intel.com> wrote:

> [...] A corollary of this is that if both threads i and j are 
> continuously runnable with fixed weights in the time interval, then 
> the ratio of their CPU time should be equal to the ratio of their 
> weights. This definition is pretty restrictive since it requires the 
> properties to hold for any thread in any interval, which is not 
> feasible. [...]

yes, it's a pretty strong definition, but also note that while it is 
definitely not easy to implement, the solution is nevertheless feasible 
in my opinion and there exists a scheduler that implements it: CFS.

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-25  8:16               ` Pavel Machek
  2007-04-25  8:22                 ` Ingo Molnar
@ 2007-04-25 10:19                 ` Alan Cox
  1 sibling, 0 replies; 149+ messages in thread
From: Alan Cox @ 2007-04-25 10:19 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ingo Molnar, Mark Lord, Jan Engelhardt, Con Kolivas,
	Willy Tarreau, William Lee Irwin III, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Gene Heskett

> > it into some xorg.conf field. (It also makes sure that X isnt preempted 
> > by other userspace stuff while it does timing-sensitive operations like 
> > setting the video modes up or switching video modes, etc.)
> 
> X is priviledged. It can just cli around the critical section.

Not really. X can use iopl3 but if it disables interrupts you get
priority inversions and hangs, so in practice it can't do that. 

Alan

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5
  2007-04-25  9:25     ` Ingo Molnar
@ 2007-04-25 10:51       ` Christian Hesse
  2007-04-25 10:56         ` Ingo Molnar
  0 siblings, 1 reply; 149+ messages in thread
From: Christian Hesse @ 2007-04-25 10:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Ulrich Drepper

[-- Attachment #1: Type: text/plain, Size: 1120 bytes --]

On Wednesday 25 April 2007, Ingo Molnar wrote:
> * Christian Hesse <mail@earthworm.de> wrote:
> > On Monday 23 April 2007, Ingo Molnar wrote:
> > > i'm pleased to announce release -v5 of the CFS scheduler patchset.
> >
> > Hi Ingo,
> >
> > I just noticed that with cfs all processes (except some kernel
> > threads) run on cpu 0. I don't think this is expected cpu affinity for
> > an smp system? I remember about half of the processes running on each
> > core with mainline.
>
> i've got several SMP systems with CFS and all distribute the load
> properly to all CPUs, so it would be nice if you could tell me more
> about how the problem manifests itself on your system.
>
> for example, if you start two infinite loops:
>
>     for (( N=0; N < 2; N++ )); do ( while :; do :; done ) & done
>
> do they end up on the same CPU?
>
> Or do you mean that the default placement of single tasks starts at
> CPU#0, while with mainline they were alternating?

That was not your fault. I updated suspend2 to 2.2.9.13 and everything works 
as expected again. Sorry for the noise.
-- 
Regards,
Chris

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5
  2007-04-25 10:51       ` Christian Hesse
@ 2007-04-25 10:56         ` Ingo Molnar
  0 siblings, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-25 10:56 UTC (permalink / raw)
  To: Christian Hesse
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Ulrich Drepper


* Christian Hesse <mail@earthworm.de> wrote:

> > Or do you mean that the default placement of single tasks starts at 
> > CPU#0, while with mainline they were alternating?
> 
> That was not your fault. I updated suspend2 to 2.2.9.13 and everything 
> works as expected again. Sorry for the noise.

ok, great!

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5
  2007-04-23  6:21       ` Ingo Molnar
@ 2007-04-25 11:43         ` Srivatsa Vaddagiri
  2007-04-25 12:51           ` Ingo Molnar
  0 siblings, 1 reply; 149+ messages in thread
From: Srivatsa Vaddagiri @ 2007-04-25 11:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Markus Trippelsdorf, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett, Mark Lord, Ulrich Drepper

On Mon, Apr 23, 2007 at 08:21:16AM +0200, Ingo Molnar wrote:
> > Changing sys_yield_to to sys_sched_yield_to in 
> > include/asm-x86_64/unistd.h fixes the problem.
> 
> thanks. I edited the -v5 patch so new downloads should have the fix. (i 
> also test-booted x86_64 with this patch)

I downloaded -v5 and noticed this:

--- linux.orig/include/asm-x86_64/unistd.h
+++ linux/include/asm-x86_64/unistd.h
@@ -619,8 +619,10 @@ __SYSCALL(__NR_sync_file_range, sys_sync
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages                279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_yield_to          280
+__SYSCALL(__NR_move_pages, sys_sched_yield_to)

s/__NR_move_pages/__NR_yield_to in the above line?

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-25  9:44                             ` Ingo Molnar
@ 2007-04-25 11:58                               ` William Lee Irwin III
  2007-04-25 20:13                                 ` Willy Tarreau
  0 siblings, 1 reply; 149+ messages in thread
From: William Lee Irwin III @ 2007-04-25 11:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Li, Tong N, Jeremy Fitzhardinge, Linus Torvalds, Nick Piggin,
	Juliusz Chroboczek, Con Kolivas, ck list, Bill Davidsen,
	Willy Tarreau, linux-kernel, Andrew Morton, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Gene Heskett

* Li, Tong N <tong.n.li@intel.com> wrote:
>> [...] A corollary of this is that if both threads i and j are 
>> continuously runnable with fixed weights in the time interval, then 
>> the ratio of their CPU time should be equal to the ratio of their 
>> weights. This definition is pretty restrictive since it requires the 
>> properties to hold for any thread in any interval, which is not 
>> feasible. [...]

On Wed, Apr 25, 2007 at 11:44:03AM +0200, Ingo Molnar wrote:
> yes, it's a pretty strong definition, but also note that while it is 
> definitely not easy to implement, the solution is nevertheless feasible 
> in my opinion and there exists a scheduler that implements it: CFS.

The feasibility comment refers to the unimplementability of schedulers
with infinitesimal timeslices/quanta/sched_granularity_ns. It's no
failing of cfs (or any other scheduler) if, say, the ratios are not
exact within a time interval of one nanosecond or one picosecond.

One of the reasons you get the results you do is that what you use for
->fair_key is very close to the definition of lag, which is used as a
metric of fairness. It differs is in a couple of ways, but how it's
computed and used for queueing can be altered to more precisely match.

The basic concept you appear to be trying to implement is a greedy
algorithm: run the task with the largest lag first. As far as I can
tell, this is sound enough, though I have no formal proof. So with the
lag computation and queueing adjusted appropriately, it should work out.

Adjustments to the lag computation for for arrivals and departures
during execution are among the missing pieces. Some algorithmic devices
are also needed to account for the varying growth rates of lags of tasks
waiting to run, which arise from differing priorities/weights.

There are no mysteries.


-- wli

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [patch] CFS scheduler, -v5
  2007-04-25 11:43         ` Srivatsa Vaddagiri
@ 2007-04-25 12:51           ` Ingo Molnar
  0 siblings, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2007-04-25 12:51 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Markus Trippelsdorf, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett, Mark Lord, Ulrich Drepper


* Srivatsa Vaddagiri <vatsa@in.ibm.com> wrote:

> +#define __NR_yield_to          280
> +__SYSCALL(__NR_move_pages, sys_sched_yield_to)
> 
> s/__NR_move_pages/__NR_yield_to in the above line?

yeah, thanks.

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-25 11:58                               ` William Lee Irwin III
@ 2007-04-25 20:13                                 ` Willy Tarreau
  2007-04-26 17:57                                   ` Li, Tong N
  0 siblings, 1 reply; 149+ messages in thread
From: Willy Tarreau @ 2007-04-25 20:13 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Li, Tong N, Jeremy Fitzhardinge, Linus Torvalds,
	Nick Piggin, Juliusz Chroboczek, Con Kolivas, ck list,
	Bill Davidsen, linux-kernel, Andrew Morton, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Gene Heskett

On Wed, Apr 25, 2007 at 04:58:40AM -0700, William Lee Irwin III wrote:

> Adjustments to the lag computation for for arrivals and departures
> during execution are among the missing pieces. Some algorithmic devices
> are also needed to account for the varying growth rates of lags of tasks
> waiting to run, which arise from differing priorities/weights.

that was the principle of my proposal of sorting tasks by expected completion
time and using +/- credit to compensate for too large/too short slice used.

Willy


^ permalink raw reply	[flat|nested] 149+ messages in thread

* SD renice recommendation was: Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-24  6:36                           ` Ingo Molnar
  2007-04-24  7:00                             ` Gene Heskett
@ 2007-04-26  0:51                             ` Con Kolivas
  1 sibling, 0 replies; 149+ messages in thread
From: Con Kolivas @ 2007-04-26  0:51 UTC (permalink / raw)
  To: Ingo Molnar, ck
  Cc: Peter Williams, Arjan van de Ven, Linus Torvalds, Nick Piggin,
	Juliusz Chroboczek, ck list, Bill Davidsen, Willy Tarreau,
	William Lee Irwin III, linux-kernel, Andrew Morton,
	Mike Galbraith, Thomas Gleixner, caglar, Gene Heskett

On Tuesday 24 April 2007 16:36, Ingo Molnar wrote:

> So, my point is, the nice level of X for desktop users should not be set
> lower than a low limit suggested by that particular scheduler's author.
> That limit is scheduler-specific. Con i think recommends a nice level of
> -1 for X when using SD [Con, can you confirm?], while my tests show that
> if you want you can go as low as -10 under CFS, without any bad
> side-effects. (-19 was a bit too much)

Nice 0 as a default for X, but if renicing, nice -10 as the lower limit for X 
on SD. The reason for that on SD is that the priority of freshly woken up 
tasks (ie not fully cpu bound) for both nice 0 and nice -10 will still be the 
same at PRIO 1 (see the prio_matrix). Therefore, there will _not_ be 
preemption of the nice 0 task and a context switch _unless_ it is already cpu 
bound and has consumed a certain number of cycles and has been demoted. 
Contrary to popular belief, it is not universal that a less niced task will 
preempt its more niced counterpart and depends entirely on implementation of 
nice. Yes it is true that context switch rate will go up with a reniced X 
because the conditions that lead to preemption are more likely to be met, but 
it is definitely not every single wakeup of the reniced X.

Alas, again, I am forced to spend as little time as possible at the pc for my 
health, so expect _very few_ responses via email from me. Luckily SD is in 
pretty fine shape with version 0.46.

-- 
-ck

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-25 20:13                                 ` Willy Tarreau
@ 2007-04-26 17:57                                   ` Li, Tong N
  2007-04-26 19:18                                     ` Willy Tarreau
  2007-04-26 23:26                                     ` William Lee Irwin III
  0 siblings, 2 replies; 149+ messages in thread
From: Li, Tong N @ 2007-04-26 17:57 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: William Lee Irwin III, Ingo Molnar, Jeremy Fitzhardinge,
	Linus Torvalds, Nick Piggin, Juliusz Chroboczek, Con Kolivas,
	ck list, Bill Davidsen, linux-kernel, Andrew Morton,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett, Siddha, Suresh B, Barnes,
	Jesse

On Wed, 2007-04-25 at 22:13 +0200, Willy Tarreau wrote:
> On Wed, Apr 25, 2007 at 04:58:40AM -0700, William Lee Irwin III wrote:
> 
> > Adjustments to the lag computation for for arrivals and departures
> > during execution are among the missing pieces. Some algorithmic devices
> > are also needed to account for the varying growth rates of lags of tasks
> > waiting to run, which arise from differing priorities/weights.
> 
> that was the principle of my proposal of sorting tasks by expected completion
> time and using +/- credit to compensate for too large/too short slice used.
> 
> Willy

Yeah, it's a good algorithm. It's a variant of earliest deadline first
(EDF). There are also similar ones in the literature such as earliest
eligible virtual deadline first (EEVDF) and biased virtual finishing
time (BVFT). Based on wli's explanation, I think Ingo's approach would
also fall into this category. With careful design, all such algorithms
that order tasks based on some notion of time can achieve good fairness.
There are some subtle differences. Some algorithms of this type can
achieve a constant lag bound, but some only have a constant positive lag
bound, but O(N) negative lag bound, meaning some tasks could receive
much more CPU time than it would under ideal fairness when the number of
tasks is high.

On the other hand, the log(N) complexity of this type of algorithms has
been a concern in the research community. This motivated O(1)
round-robin based algorithms such as deficit round-robin (DRR) and
smoothed round-robin (SRR) in networking, and virtual-time round-robin
(VTRR), group ratio round-robin (GP3) and grouped distributed queues
(GDQ) in OS scheduling, as well as the distributed weighted round-robin
(DWRR) one I posted earlier.

  tong

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-26 17:57                                   ` Li, Tong N
@ 2007-04-26 19:18                                     ` Willy Tarreau
  2007-04-28 15:12                                       ` Bernd Eckenfels
  2007-04-26 23:26                                     ` William Lee Irwin III
  1 sibling, 1 reply; 149+ messages in thread
From: Willy Tarreau @ 2007-04-26 19:18 UTC (permalink / raw)
  To: Li, Tong N
  Cc: William Lee Irwin III, Ingo Molnar, Jeremy Fitzhardinge,
	Linus Torvalds, Nick Piggin, Juliusz Chroboczek, Con Kolivas,
	ck list, Bill Davidsen, linux-kernel, Andrew Morton,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Gene Heskett, Siddha, Suresh B, Barnes,
	Jesse

On Thu, Apr 26, 2007 at 10:57:48AM -0700, Li, Tong N wrote:
> On Wed, 2007-04-25 at 22:13 +0200, Willy Tarreau wrote:
> > On Wed, Apr 25, 2007 at 04:58:40AM -0700, William Lee Irwin III wrote:
> > 
> > > Adjustments to the lag computation for for arrivals and departures
> > > during execution are among the missing pieces. Some algorithmic devices
> > > are also needed to account for the varying growth rates of lags of tasks
> > > waiting to run, which arise from differing priorities/weights.
> > 
> > that was the principle of my proposal of sorting tasks by expected completion
> > time and using +/- credit to compensate for too large/too short slice used.
> > 
> > Willy
> 
> Yeah, it's a good algorithm. It's a variant of earliest deadline first
> (EDF). There are also similar ones in the literature such as earliest
> eligible virtual deadline first (EEVDF) and biased virtual finishing
> time (BVFT). Based on wli's explanation, I think Ingo's approach would
> also fall into this category. With careful design, all such algorithms
> that order tasks based on some notion of time can achieve good fairness.
> There are some subtle differences. Some algorithms of this type can
> achieve a constant lag bound, but some only have a constant positive lag
> bound, but O(N) negative lag bound,

Anyway, we're working in discrete time, not linear time. Lag is
unavoidable. At best it can be bounded and compensated for. First time
I thought about this algorithm, I was looking at a line drawn using the
Bresenham algorithm. This algorithm is all about error compensation for
a perfect expectation. Too high, too far, too high, too far... I thought
that the line could represent a task progress as a function of time, and
the pixels the periods the task spends on the CPU. On short intervals,
you lose. On large ones, you're very close to the ideal case.

> meaning some tasks could receive
> much more CPU time than it would under ideal fairness when the number of
> tasks is high.

It's not a problem that a task receives much more CPU than it should,
provided that :

  a) it may do so for a short and bound time, typically less than the
     maximum acceptable latency for other tasks

  b) the excess of CPU it received is accounted for so that it is
     deduced from next passes.


> On the other hand, the log(N) complexity of this type of algorithms has
> been a concern in the research community. This motivated O(1)

I've been for O(1) algorithms for a long time, but seeing how dumb the
things you can do in O(1) are compared to O(logN), I definitely switched
my mind. Also, in O(logN), you often have a lot of common operations
still O(1). Eg: you insert in O(logN) in a time-ordered tree, but you
read from it in O(1). But you still have the ability to change its
content in O(logN) if you need.

Last but not least, spending one hundred cycles a few thousand times
a second is nothing compared to electing the wrong task for a full
time-slice.

Willy


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-26 17:57                                   ` Li, Tong N
  2007-04-26 19:18                                     ` Willy Tarreau
@ 2007-04-26 23:26                                     ` William Lee Irwin III
  1 sibling, 0 replies; 149+ messages in thread
From: William Lee Irwin III @ 2007-04-26 23:26 UTC (permalink / raw)
  To: Li, Tong N
  Cc: Willy Tarreau, Ingo Molnar, Jeremy Fitzhardinge, Linus Torvalds,
	Nick Piggin, Juliusz Chroboczek, Con Kolivas, ck list,
	Bill Davidsen, linux-kernel, Andrew Morton, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Gene Heskett, Siddha, Suresh B, Barnes, Jesse

On Wed, Apr 25, 2007 at 04:58:40AM -0700, William Lee Irwin III wrote:
>>> Adjustments to the lag computation for for arrivals and departures
>>> during execution are among the missing pieces. Some algorithmic devices
>>> are also needed to account for the varying growth rates of lags of tasks
>>> waiting to run, which arise from differing priorities/weights.

On Wed, 2007-04-25 at 22:13 +0200, Willy Tarreau wrote:
>> that was the principle of my proposal of sorting tasks by expected completion
>> time and using +/- credit to compensate for too large/too short slice used.

On Thu, Apr 26, 2007 at 10:57:48AM -0700, Li, Tong N wrote:
> Yeah, it's a good algorithm. It's a variant of earliest deadline first
> (EDF). There are also similar ones in the literature such as earliest
> eligible virtual deadline first (EEVDF) and biased virtual finishing
> time (BVFT). Based on wli's explanation, I think Ingo's approach would
> also fall into this category. With careful design, all such algorithms
> that order tasks based on some notion of time can achieve good fairness.
> There are some subtle differences. Some algorithms of this type can
> achieve a constant lag bound, but some only have a constant positive lag
> bound, but O(N) negative lag bound, meaning some tasks could receive
> much more CPU time than it would under ideal fairness when the number of
> tasks is high.

The algorithm is in a bit of flux, but the virtual deadline computation
is rather readable. You may be able to tell whether cfs is affected by
the negative lag issue better than I. For the most part all I can smoke
out is that it's not apparent to me whether load balancing is done the
way it needs to be.


On Thu, Apr 26, 2007 at 10:57:48AM -0700, Li, Tong N wrote:
> On the other hand, the log(N) complexity of this type of algorithms has
> been a concern in the research community. This motivated O(1)
> round-robin based algorithms such as deficit round-robin (DRR) and
> smoothed round-robin (SRR) in networking, and virtual-time round-robin
> (VTRR), group ratio round-robin (GP3) and grouped distributed queues
> (GDQ) in OS scheduling, as well as the distributed weighted round-robin
> (DWRR) one I posted earlier.

I'm going to make a bold statement: I don't think O(lg(n)) is bad at
all. In real systems there are constraints related to per-task memory
footprints that severely restrict the domain of the performance metric,
rendering O(lg(n)) bounded by a rather reasonable constant.

A larger concern to me is whether this affair actually achieves its
design goals and, to a lesser extent, in what contexts those design
goals are truly crucial or dominant as opposed to others, such as,
say, interactivity. It is clear, regardless of general applicability,
that the predictability of behavior with regard to strict fairness
is going to be useful in certain contexts.

Another concern which is in favor of the virtual deadline design is
that virtual deadlines can very effectively emulate a broad spectrum
of algorithms. For instance, the mainline "O(1) scheduler" can be
emulated using such a queueing mechanism. Even if the particular
policy cfs now implemented is dumped, radically different policies
can be expressed with its queueing mechanism. This has maintainence
implications which are quite beneficial. That said, it's far from
an unqualified endorsement. I'd still like to see much done differently.


-- wli

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
  2007-04-26 19:18                                     ` Willy Tarreau
@ 2007-04-28 15:12                                       ` Bernd Eckenfels
  0 siblings, 0 replies; 149+ messages in thread
From: Bernd Eckenfels @ 2007-04-28 15:12 UTC (permalink / raw)
  To: linux-kernel

In article <20070426191835.GA12740@1wt.eu> you wrote:
>  a) it may do so for a short and bound time, typically less than the
>     maximum acceptable latency for other tasks

if you have n threads in runq and each of them can have m<d (d=max latency
deadline) overhead, you will have to account on d/n slices. This is
typically not possible for larger number of ready threads.

Therefore another aproach would be to make sure the next thread gets a
smaller slice, but then you will have to move around that debit and
distribute it fair, which is the whole problem we face here.

(Besides it is not clear to me if fair scheduling gets the best results, see
the X problem or compare threads vs. process vs. subsystems).

Gruss
Bernd

PS: sorry for that Cc trimming, I need to get rid of my mail2news gateway,
however I will make sure to copy important info to all concerend parties -
dont think thats needed for my ramblings .)



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [REPORT] cfs-v4 vs sd-0.44
@ 2007-04-22  4:38 Al Boldi
  0 siblings, 0 replies; 149+ messages in thread
From: Al Boldi @ 2007-04-22  4:38 UTC (permalink / raw)
  To: linux-kernel

Con Kolivas wrote:
> On Sunday 22 April 2007 02:00, Ingo Molnar wrote:
> > * Con Kolivas <kernel@kolivas.org> wrote:
> > > >   Feels even better, mouse movements are very smooth even under high
> > > >   load. I noticed that X gets reniced to -19 with this scheduler.
> > > >   I've not looked at the code yet but this looked suspicious to me.
> > > >   I've reniced it to 0 and it did not change any behaviour. Still
> > > >   very good.
> > >
> > > Looks like this code does it:
> > >
> > > +int sysctl_sched_privileged_nice_level __read_mostly = -19;
> >
> > correct.
>
> Oh I definitely was not advocating against renicing X, I just suspect that
> virtually all the users who gave glowing reports to CFS comparing it to SD
> had no idea it had reniced X to -19 behind their back and that they were
> comparing it to SD running X at nice 0. I think had they been comparing
> CFS with X nice -19 to SD running nice -10 in this interactivity soft and
> squishy comparison land their thoughts might have been different. I missed
> it in the announcement and had to go looking in the code since Willy just
> kinda tripped over it unwittingly as well.

I tried this with the vesa driver of X, and reflect from the mesa-demos 
heavily starves new window creation on cfs-v4 with X niced -19.  X reniced 
to 0 removes these starves.  On SD, X reniced to -10 works great.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 149+ messages in thread

end of thread, other threads:[~2007-04-28 15:12 UTC | newest]

Thread overview: 149+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-20 14:04 [patch] CFS scheduler, v4 Ingo Molnar
2007-04-20 21:37 ` Gene Heskett
2007-04-21 20:47   ` S.Çağlar Onur
2007-04-22  1:22     ` Gene Heskett
2007-04-20 21:39 ` mdew .
2007-04-21  6:47   ` Ingo Molnar
2007-04-21  7:55     ` [patch] CFS scheduler, v4, for v2.6.20.7 Ingo Molnar
2007-04-21 12:12 ` [REPORT] cfs-v4 vs sd-0.44 Willy Tarreau
2007-04-21 12:40   ` Con Kolivas
2007-04-21 13:02     ` Willy Tarreau
2007-04-21 15:46   ` Ingo Molnar
2007-04-21 16:18     ` Willy Tarreau
2007-04-21 16:34       ` Linus Torvalds
2007-04-21 16:42         ` William Lee Irwin III
2007-04-21 18:55           ` Kyle Moffett
2007-04-21 19:49             ` Ulrich Drepper
2007-04-21 23:17               ` William Lee Irwin III
2007-04-21 23:35               ` Linus Torvalds
2007-04-22  1:46                 ` Ulrich Drepper
2007-04-22  7:02                   ` William Lee Irwin III
2007-04-22  7:17                     ` Ulrich Drepper
2007-04-22  8:48                       ` William Lee Irwin III
2007-04-22 16:16                         ` Ulrich Drepper
2007-04-23  0:07                           ` Rusty Russell
2007-04-21 16:53         ` Willy Tarreau
2007-04-21 16:53         ` Ingo Molnar
2007-04-21 16:57           ` Willy Tarreau
2007-04-21 18:09           ` Ulrich Drepper
2007-04-21 17:03       ` Geert Bosch
2007-04-21 15:55   ` Con Kolivas
2007-04-21 16:00     ` Ingo Molnar
2007-04-21 16:12       ` Willy Tarreau
2007-04-21 16:39       ` William Lee Irwin III
2007-04-21 17:15       ` Jan Engelhardt
2007-04-21 19:00         ` Ingo Molnar
2007-04-22 13:18           ` Mark Lord
2007-04-22 13:27             ` Ingo Molnar
2007-04-22 13:30               ` Mark Lord
2007-04-25  8:16               ` Pavel Machek
2007-04-25  8:22                 ` Ingo Molnar
2007-04-25 10:19                 ` Alan Cox
2007-04-21 22:54       ` Denis Vlasenko
2007-04-22  0:08         ` Con Kolivas
2007-04-22  4:58           ` Mike Galbraith
2007-04-21 23:59       ` Con Kolivas
2007-04-22 13:04         ` Juliusz Chroboczek
2007-04-22 23:24           ` Linus Torvalds
2007-04-23  1:34             ` Nick Piggin
2007-04-23 15:56               ` Linus Torvalds
2007-04-23 19:11                 ` Ingo Molnar
2007-04-23 19:52                   ` Linus Torvalds
2007-04-23 20:33                     ` Ingo Molnar
2007-04-23 20:44                       ` Ingo Molnar
2007-04-23 21:03                         ` Ingo Molnar
2007-04-23 21:53                       ` Guillaume Chazarain
2007-04-24  7:04                       ` Rogan Dawes
2007-04-24  7:31                         ` Ingo Molnar
2007-04-24  8:25                           ` Rogan Dawes
2007-04-24 15:03                             ` Chris Friesen
2007-04-24 15:07                               ` Rogan Dawes
2007-04-24 15:15                                 ` Chris Friesen
2007-04-24 23:55                                 ` Peter Williams
2007-04-25  9:29                                 ` Ingo Molnar
2007-04-23 22:48                     ` Jeremy Fitzhardinge
2007-04-24  0:59                       ` Li, Tong N
2007-04-24  1:57                         ` Bill Huey
2007-04-24 18:01                           ` Li, Tong N
2007-04-24 21:27                         ` William Lee Irwin III
2007-04-24 22:18                           ` Bernd Eckenfels
2007-04-25  1:22                           ` Li, Tong N
2007-04-25  6:05                             ` William Lee Irwin III
2007-04-25  9:44                             ` Ingo Molnar
2007-04-25 11:58                               ` William Lee Irwin III
2007-04-25 20:13                                 ` Willy Tarreau
2007-04-26 17:57                                   ` Li, Tong N
2007-04-26 19:18                                     ` Willy Tarreau
2007-04-28 15:12                                       ` Bernd Eckenfels
2007-04-26 23:26                                     ` William Lee Irwin III
2007-04-24  3:46                     ` Peter Williams
2007-04-24  4:52                       ` Arjan van de Ven
2007-04-24  6:21                         ` Peter Williams
2007-04-24  6:36                           ` Ingo Molnar
2007-04-24  7:00                             ` Gene Heskett
2007-04-24  7:08                               ` Ingo Molnar
2007-04-24  6:45                                 ` David Lang
2007-04-24  7:24                                   ` Ingo Molnar
2007-04-24 14:38                                     ` Gene Heskett
2007-04-24 17:44                                       ` Willy Tarreau
2007-04-25  0:30                                         ` Gene Heskett
2007-04-25  0:32                                         ` Gene Heskett
2007-04-24  7:12                                 ` Gene Heskett
2007-04-24  7:14                                   ` Ingo Molnar
2007-04-24 14:36                                     ` Gene Heskett
2007-04-24  7:25                                 ` Ingo Molnar
2007-04-24 14:39                                   ` Gene Heskett
2007-04-24 14:42                                   ` Gene Heskett
2007-04-24  7:33                                 ` Ingo Molnar
2007-04-26  0:51                             ` SD renice recommendation was: " Con Kolivas
2007-04-24 15:08                     ` Ray Lee
2007-04-25  9:32                       ` Ingo Molnar
2007-04-23 20:05                   ` Willy Tarreau
2007-04-24 21:05                   ` 'Scheduler Economy' prototype patch for CFS Ingo Molnar
2007-04-23  2:42             ` [report] renicing X, cfs-v5 vs sd-0.46 Ingo Molnar
2007-04-23 15:09               ` Linus Torvalds
2007-04-23 17:19                 ` Gene Heskett
2007-04-23 17:19                 ` Gene Heskett
2007-04-23 19:48                 ` Ingo Molnar
2007-04-23 20:56                   ` Michael K. Edwards
2007-04-22 13:23         ` [REPORT] cfs-v4 vs sd-0.44 Mark Lord
2007-04-21 18:17   ` Gene Heskett
2007-04-22  1:26     ` Con Kolivas
2007-04-22  2:07       ` Gene Heskett
2007-04-22  8:07     ` William Lee Irwin III
2007-04-22 11:11       ` Gene Heskett
2007-04-22  1:51   ` Con Kolivas
2007-04-21 20:35 ` [patch] CFS scheduler, v4 S.Çağlar Onur
2007-04-22  8:30 ` Michael Gerdau
2007-04-23 22:47   ` Ingo Molnar
2007-04-23  1:12 ` [patch] CFS scheduler, -v5 Ingo Molnar
2007-04-23  1:25   ` Nick Piggin
2007-04-23  2:39     ` Gene Heskett
2007-04-23  3:08       ` Ingo Molnar
2007-04-23  2:55     ` Ingo Molnar
2007-04-23  3:22       ` Nick Piggin
2007-04-23  3:43         ` Ingo Molnar
2007-04-23  4:06           ` Nick Piggin
2007-04-23  7:10             ` Ingo Molnar
2007-04-23  7:25               ` Nick Piggin
2007-04-23  7:35                 ` Ingo Molnar
2007-04-23  9:25             ` Ingo Molnar
2007-04-23  3:19   ` [patch] CFS scheduler, -v5 (build problem - make headers_check fails) Zach Carter
2007-04-23 10:03     ` Ingo Molnar
2007-04-23  5:16   ` [patch] CFS scheduler, -v5 Markus Trippelsdorf
2007-04-23  5:27     ` Markus Trippelsdorf
2007-04-23  6:21       ` Ingo Molnar
2007-04-25 11:43         ` Srivatsa Vaddagiri
2007-04-25 12:51           ` Ingo Molnar
2007-04-23 12:20   ` Guillaume Chazarain
2007-04-23 12:36     ` Ingo Molnar
2007-04-24 16:54   ` Christian Hesse
2007-04-25  9:25     ` Ingo Molnar
2007-04-25 10:51       ` Christian Hesse
2007-04-25 10:56         ` Ingo Molnar
2007-04-23  9:28 ` crash with CFS v4 and qemu/kvm (was: [patch] CFS scheduler, v4) Christian Hesse
2007-04-23 10:18   ` Ingo Molnar
2007-04-23 10:18     ` Ingo Molnar
2007-04-24 10:54     ` Christian Hesse
2007-04-24 10:54       ` Christian Hesse
2007-04-22  4:38 [REPORT] cfs-v4 vs sd-0.44 Al Boldi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.