linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)
@ 2005-12-27 18:09 Paolo Ornati
  2005-12-27 21:48 ` Paolo Ornati
  0 siblings, 1 reply; 58+ messages in thread
From: Paolo Ornati @ 2005-12-27 18:09 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Hello,

I've found an easy-to-reproduce-for-me test case that shows a totally
wrong priority calculation: basically a CPU-intensitive process gets
better priority than a disk-intensitive one (dd if=bigfile
of=/dev/null ...).

Seems impossible, isn't it?

---- THE NUMBERS with 2.6.15-rc7 -----

The test-case is the Xvid encoding of dvd-ripped track with transcode
(using "dvd::rip" interface). The copied-and-pasted command line is
this:

mkdir -m 0775 -p '/home/paolo/tmp/test/tmp' &&
cd /home/paolo/tmp/test/tmp && dr_exec transcode -H 10 -a 2 -x vob,null
-i /home/paolo/tmp/test/vob/003 -w 1198,50 -b 128,0,0 -s 1.972
--a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 1 -y xvid4,null
-o /dev/null --print_status 20 && echo DVDRIP_SUCCESS mkdir -m 0775 -p
'/home/paolo/tmp/test/tmp' && cd /home/paolo/tmp/test/tmp && dr_exec
transcode -H 10 -a 2 -x vob -i /home/paolo/tmp/test/vob/003 -w 1198,50
-b 128,0,0 -s 1.972 --a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 2 -y
xvid4 -o /home/paolo/tmp/test/avi/003/test-003.avi --print_status 20 &&
echo DVDRIP_SUCCESS


Here there is a TOP snapshot while running it:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5721 paolo     16   0  115m  18m 2428 R 84.4  3.7   0:15.11 transcode
 5736 paolo     25   0 50352 4516 1912 R  8.4  0.9   0:01.53 tcdecode
 5725 paolo     15   0  115m  18m 2428 S  4.6  3.7   0:00.84 transcode
 5738 paolo     18   0  115m  18m 2428 S  0.8  3.7   0:00.15 transcode
 5734 paolo     25   0 20356 1140  920 S  0.6  0.2   0:00.12 tcdemux
 5731 paolo     25   0 47312 2540 1996 R  0.4  0.5   0:00.08 tcdecode
 5319 root      15   0  166m  16m 2584 S  0.2  3.2   0:25.06 X
 5444 paolo     16   0 87116  22m  15m R  0.2  4.6   0:04.05 konsole
 5716 paolo     16   0 10424 1160  876 R  0.2  0.2   0:00.06 top
 5735 paolo     25   0 22364 1436  932 S  0.2  0.3   0:00.01 tcextract


DD running alone:

paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
128+0 records in
128+0 records out

real    0m4.052s
user    0m0.000s
sys     0m0.209s

DD while transcoding:

paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
128+0 records in
128+0 records out

real    0m26.121s
user    0m0.001s
sys     0m0.255s

---------------------------------------

I've tried older kernels finding that 2.6.11 is the first affected.

Going on with testing...

        2.6.11-rc[1-5]:
2.6.11-rc3 bad
2.6.11-rc1 bad

        2.6.10-bk[1-14]
2.6.10-bk7 good
2.6.10-bk11 good
2.6.10-bk13 bad
2.6.10-bk12 bad

So the problem was introduced with:
	>> 2.6.10-bk12 09-Jan-2005 <<

The exact behaviour is different with 2.6.11/12/13/14... for example:
with 2.6.11 the priority of "transcode" is initially set to ~25 and go
down to 17/18 when running DD.

The problem doesn't seem 100% reproducible with every kernel, sometimes
a "BAD" kernel looks "GOOD"... or maybe it was me confused by too
much compile/install/reboot/test work ;)

Other INFO:
- I'm on x86_64
- preemption ON/OFF doesn't make any differences


Can anyone reproduce this?
IOW: is this affecting only my machine?

-- 
	Paolo Ornati
	Linux 2.6.15-rc7-gf89f5948 on x86_64

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)
  2005-12-27 18:09 [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) Paolo Ornati
@ 2005-12-27 21:48 ` Paolo Ornati
  2005-12-27 23:26   ` Con Kolivas
  2005-12-27 23:59   ` [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) Peter Williams
  0 siblings, 2 replies; 58+ messages in thread
From: Paolo Ornati @ 2005-12-27 21:48 UTC (permalink / raw)
  To: Paolo Ornati; +Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar

On Tue, 27 Dec 2005 19:09:18 +0100
Paolo Ornati <ornati@fastwebnet.it> wrote:

> Hello,
> 
> I've found an easy-to-reproduce-for-me test case that shows a totally
> wrong priority calculation: basically a CPU-intensitive process gets
> better priority than a disk-intensitive one (dd if=bigfile
> of=/dev/null ...).
> 
> Seems impossible, isn't it?
> 
> ---- THE NUMBERS with 2.6.15-rc7 -----
> 
> The test-case is the Xvid encoding of dvd-ripped track with transcode
> (using "dvd::rip" interface). The copied-and-pasted command line is
> this:
> 
> mkdir -m 0775 -p '/home/paolo/tmp/test/tmp' &&
> cd /home/paolo/tmp/test/tmp && dr_exec transcode -H 10 -a 2 -x vob,null
> -i /home/paolo/tmp/test/vob/003 -w 1198,50 -b 128,0,0 -s 1.972
> --a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 1 -y xvid4,null
> -o /dev/null --print_status 20 && echo DVDRIP_SUCCESS mkdir -m 0775 -p
> '/home/paolo/tmp/test/tmp' && cd /home/paolo/tmp/test/tmp && dr_exec
> transcode -H 10 -a 2 -x vob -i /home/paolo/tmp/test/vob/003 -w 1198,50
> -b 128,0,0 -s 1.972 --a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 2 -y
> xvid4 -o /home/paolo/tmp/test/avi/003/test-003.avi --print_status 20 &&
> echo DVDRIP_SUCCESS
> 
> 
> Here there is a TOP snapshot while running it:
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5721 paolo     16   0  115m  18m 2428 R 84.4  3.7   0:15.11 transcode
>  5736 paolo     25   0 50352 4516 1912 R  8.4  0.9   0:01.53 tcdecode
>  5725 paolo     15   0  115m  18m 2428 S  4.6  3.7   0:00.84 transcode
>  5738 paolo     18   0  115m  18m 2428 S  0.8  3.7   0:00.15 transcode
>  5734 paolo     25   0 20356 1140  920 S  0.6  0.2   0:00.12 tcdemux
>  5731 paolo     25   0 47312 2540 1996 R  0.4  0.5   0:00.08 tcdecode
>  5319 root      15   0  166m  16m 2584 S  0.2  3.2   0:25.06 X
>  5444 paolo     16   0 87116  22m  15m R  0.2  4.6   0:04.05 konsole
>  5716 paolo     16   0 10424 1160  876 R  0.2  0.2   0:00.06 top
>  5735 paolo     25   0 22364 1436  932 S  0.2  0.3   0:00.01 tcextract
> 
> 
> DD running alone:
> 
> paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
> 128+0 records in
> 128+0 records out
> 
> real    0m4.052s
> user    0m0.000s
> sys     0m0.209s
> 
> DD while transcoding:
> 
> paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
> 128+0 records in
> 128+0 records out
> 
> real    0m26.121s
> user    0m0.001s
> sys     0m0.255s
> 
> ---------------------------------------
> 
> I've tried older kernels finding that 2.6.11 is the first affected.
> 
> Going on with testing...
> 
>         2.6.11-rc[1-5]:
> 2.6.11-rc3 bad
> 2.6.11-rc1 bad
> 
>         2.6.10-bk[1-14]
> 2.6.10-bk7 good
> 2.6.10-bk11 good
> 2.6.10-bk13 bad
> 2.6.10-bk12 bad
> 
> So the problem was introduced with:
> 	>> 2.6.10-bk12 09-Jan-2005 <<
> 
> The exact behaviour is different with 2.6.11/12/13/14... for example:
> with 2.6.11 the priority of "transcode" is initially set to ~25 and go
> down to 17/18 when running DD.
> 
> The problem doesn't seem 100% reproducible with every kernel, sometimes
> a "BAD" kernel looks "GOOD"... or maybe it was me confused by too
> much compile/install/reboot/test work ;)
> 
> Other INFO:
> - I'm on x86_64
> - preemption ON/OFF doesn't make any differences
> 
> 
> Can anyone reproduce this?
> IOW: is this affecting only my machine?
> 

Hello Con and Ingo... I've found that the above problem goes away
by reverting this:

http://linux.bkbits.net:8080/linux-2.6/cset@41e054c6pwNQXzErMxvfh4IpLPXA5A?nav=index.html|src/|src/include|src/include/linux|related/include/linux/sched.h

--------------------------------------------------

[PATCH] sched: remove_interactive_credit

Special casing tasks by interactive credit was helpful for preventing fully
cpu bound tasks from easily rising to interactive status.

However it did not select out tasks that had periods of being fully cpu
bound and then sleeping while waiting on pipes, signals etc.  This led to a
more disproportionate share of cpu time.

Backing this out will no longer special case only fully cpu bound tasks,
and prevents the variable behaviour that occurs at startup before tasks
declare themseleves interactive or not, and speeds up application startup
slightly under certain circumstances.  It does cost in interactivity
slightly as load rises but it is worth it for the fairness gains.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

--------------------------------------------------


Maybe this change has revealed a scheduler weakness ?

I'm glad to test any patch or give more data :)

Bye,

-- 
	Paolo Ornati
	Linux 2.6.10-bk12-int_credit on x86_64

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)
  2005-12-27 21:48 ` Paolo Ornati
@ 2005-12-27 23:26   ` Con Kolivas
  2005-12-28 11:01     ` Paolo Ornati
  2005-12-30 13:52     ` [SCHED] wrong priority calc - SIMPLE test case Paolo Ornati
  2005-12-27 23:59   ` [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) Peter Williams
  1 sibling, 2 replies; 58+ messages in thread
From: Con Kolivas @ 2005-12-27 23:26 UTC (permalink / raw)
  To: Paolo Ornati; +Cc: Linux Kernel Mailing List, Ingo Molnar

[-- Attachment #1: Type: text/plain, Size: 3885 bytes --]

On Wednesday 28 December 2005 08:48, Paolo Ornati wrote:
> On Tue, 27 Dec 2005 19:09:18 +0100
>
> Paolo Ornati <ornati@fastwebnet.it> wrote:
> > Hello,
> >
> > I've found an easy-to-reproduce-for-me test case that shows a totally
> > wrong priority calculation: basically a CPU-intensitive process gets
> > better priority than a disk-intensitive one (dd if=bigfile
> > of=/dev/null ...).
> >
> > Seems impossible, isn't it?
> >
> > ---- THE NUMBERS with 2.6.15-rc7 -----
> >
> > The test-case is the Xvid encoding of dvd-ripped track with transcode
> > (using "dvd::rip" interface). The copied-and-pasted command line is
> > this:
> >
> > mkdir -m 0775 -p '/home/paolo/tmp/test/tmp' &&
> > cd /home/paolo/tmp/test/tmp && dr_exec transcode -H 10 -a 2 -x vob,null
> > -i /home/paolo/tmp/test/vob/003 -w 1198,50 -b 128,0,0 -s 1.972
> > --a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 1 -y xvid4,null
> > -o /dev/null --print_status 20 && echo DVDRIP_SUCCESS mkdir -m 0775 -p
> > '/home/paolo/tmp/test/tmp' && cd /home/paolo/tmp/test/tmp && dr_exec
> > transcode -H 10 -a 2 -x vob -i /home/paolo/tmp/test/vob/003 -w 1198,50
> > -b 128,0,0 -s 1.972 --a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 2 -y
> > xvid4 -o /home/paolo/tmp/test/avi/003/test-003.avi --print_status 20 &&
> > echo DVDRIP_SUCCESS
> >
> >
> > Here there is a TOP snapshot while running it:
> >
> >   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >  5721 paolo     16   0  115m  18m 2428 R 84.4  3.7   0:15.11 transcode
> >  5736 paolo     25   0 50352 4516 1912 R  8.4  0.9   0:01.53 tcdecode
> >  5725 paolo     15   0  115m  18m 2428 S  4.6  3.7   0:00.84 transcode
> >  5738 paolo     18   0  115m  18m 2428 S  0.8  3.7   0:00.15 transcode
> >  5734 paolo     25   0 20356 1140  920 S  0.6  0.2   0:00.12 tcdemux
> >  5731 paolo     25   0 47312 2540 1996 R  0.4  0.5   0:00.08 tcdecode
> >  5319 root      15   0  166m  16m 2584 S  0.2  3.2   0:25.06 X
> >  5444 paolo     16   0 87116  22m  15m R  0.2  4.6   0:04.05 konsole
> >  5716 paolo     16   0 10424 1160  876 R  0.2  0.2   0:00.06 top
> >  5735 paolo     25   0 22364 1436  932 S  0.2  0.3   0:00.01 tcextract
> >
> >
> > DD running alone:
> >
> > paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null
> > bs=1M count=128; umount space/ 128+0 records in
> > 128+0 records out
> >
> > real    0m4.052s
> > user    0m0.000s
> > sys     0m0.209s
> >
> > DD while transcoding:
> >
> > paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null
> > bs=1M count=128; umount space/ 128+0 records in
> > 128+0 records out
> >
> > real    0m26.121s
> > user    0m0.001s
> > sys     0m0.255s

Looking at your top output I see that transcode command generates 7 processes 
all likely to be using cpu, and your DD slowdown is almost 7 times... I 
assume it all goes away if you nice the transcode up by 3 or more.

> Hello Con and Ingo... I've found that the above problem goes away
> by reverting this:
>
> http://linux.bkbits.net:8080/linux-2.6/cset@41e054c6pwNQXzErMxvfh4IpLPXA5A?
>nav=index.html|src/|src/include|src/include/linux|related/include/linux/sche
>d.h
>
> --------------------------------------------------
>
> [PATCH] sched: remove_interactive_credit

The issue is that the scheduler interactivity estimator is a state machine and 
can be fooled to some degree, and a cpu intensive task that just happens to 
sleep a little bit gets significantly better priority than one that is fully 
cpu bound all the time. Reverting that change is not a solution because it 
can still be fooled by the same process sleeping lots for a few seconds or so 
at startup and then changing to the cpu mostly-sleeping slightly behaviour. 
This "fluctuating" behaviour is in my opinion worse which is why I removed 
it.

Cheers,
Con

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)
  2005-12-27 21:48 ` Paolo Ornati
  2005-12-27 23:26   ` Con Kolivas
@ 2005-12-27 23:59   ` Peter Williams
  2005-12-28 10:20     ` Paolo Ornati
  1 sibling, 1 reply; 58+ messages in thread
From: Peter Williams @ 2005-12-27 23:59 UTC (permalink / raw)
  To: Paolo Ornati; +Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar

Paolo Ornati wrote:
> On Tue, 27 Dec 2005 19:09:18 +0100
> Paolo Ornati <ornati@fastwebnet.it> wrote:
> 
> 
>>Hello,
>>
>>I've found an easy-to-reproduce-for-me test case that shows a totally
>>wrong priority calculation: basically a CPU-intensitive process gets
>>better priority than a disk-intensitive one (dd if=bigfile
>>of=/dev/null ...).
>>
>>Seems impossible, isn't it?
>>
>>---- THE NUMBERS with 2.6.15-rc7 -----
>>
>>The test-case is the Xvid encoding of dvd-ripped track with transcode
>>(using "dvd::rip" interface). The copied-and-pasted command line is
>>this:
>>
>>mkdir -m 0775 -p '/home/paolo/tmp/test/tmp' &&
>>cd /home/paolo/tmp/test/tmp && dr_exec transcode -H 10 -a 2 -x vob,null
>>-i /home/paolo/tmp/test/vob/003 -w 1198,50 -b 128,0,0 -s 1.972
>>--a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 1 -y xvid4,null
>>-o /dev/null --print_status 20 && echo DVDRIP_SUCCESS mkdir -m 0775 -p
>>'/home/paolo/tmp/test/tmp' && cd /home/paolo/tmp/test/tmp && dr_exec
>>transcode -H 10 -a 2 -x vob -i /home/paolo/tmp/test/vob/003 -w 1198,50
>>-b 128,0,0 -s 1.972 --a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 2 -y
>>xvid4 -o /home/paolo/tmp/test/avi/003/test-003.avi --print_status 20 &&
>>echo DVDRIP_SUCCESS
>>
>>
>>Here there is a TOP snapshot while running it:
>>
>>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>> 5721 paolo     16   0  115m  18m 2428 R 84.4  3.7   0:15.11 transcode
>> 5736 paolo     25   0 50352 4516 1912 R  8.4  0.9   0:01.53 tcdecode
>> 5725 paolo     15   0  115m  18m 2428 S  4.6  3.7   0:00.84 transcode
>> 5738 paolo     18   0  115m  18m 2428 S  0.8  3.7   0:00.15 transcode
>> 5734 paolo     25   0 20356 1140  920 S  0.6  0.2   0:00.12 tcdemux
>> 5731 paolo     25   0 47312 2540 1996 R  0.4  0.5   0:00.08 tcdecode
>> 5319 root      15   0  166m  16m 2584 S  0.2  3.2   0:25.06 X
>> 5444 paolo     16   0 87116  22m  15m R  0.2  4.6   0:04.05 konsole
>> 5716 paolo     16   0 10424 1160  876 R  0.2  0.2   0:00.06 top
>> 5735 paolo     25   0 22364 1436  932 S  0.2  0.3   0:00.01 tcextract
>>
>>
>>DD running alone:
>>
>>paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
>>128+0 records in
>>128+0 records out
>>
>>real    0m4.052s
>>user    0m0.000s
>>sys     0m0.209s
>>
>>DD while transcoding:
>>
>>paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
>>128+0 records in
>>128+0 records out
>>
>>real    0m26.121s
>>user    0m0.001s
>>sys     0m0.255s
>>
>>---------------------------------------
>>
>>I've tried older kernels finding that 2.6.11 is the first affected.
>>
>>Going on with testing...
>>
>>        2.6.11-rc[1-5]:
>>2.6.11-rc3 bad
>>2.6.11-rc1 bad
>>
>>        2.6.10-bk[1-14]
>>2.6.10-bk7 good
>>2.6.10-bk11 good
>>2.6.10-bk13 bad
>>2.6.10-bk12 bad
>>
>>So the problem was introduced with:
>>	>> 2.6.10-bk12 09-Jan-2005 <<
>>
>>The exact behaviour is different with 2.6.11/12/13/14... for example:
>>with 2.6.11 the priority of "transcode" is initially set to ~25 and go
>>down to 17/18 when running DD.
>>
>>The problem doesn't seem 100% reproducible with every kernel, sometimes
>>a "BAD" kernel looks "GOOD"... or maybe it was me confused by too
>>much compile/install/reboot/test work ;)
>>
>>Other INFO:
>>- I'm on x86_64
>>- preemption ON/OFF doesn't make any differences
>>
>>
>>Can anyone reproduce this?
>>IOW: is this affecting only my machine?
>>
> 
> 
> Hello Con and Ingo... I've found that the above problem goes away
> by reverting this:
> 
> http://linux.bkbits.net:8080/linux-2.6/cset@41e054c6pwNQXzErMxvfh4IpLPXA5A?nav=index.html|src/|src/include|src/include/linux|related/include/linux/sched.h
> 
> --------------------------------------------------
> 
> [PATCH] sched: remove_interactive_credit
> 
> Special casing tasks by interactive credit was helpful for preventing fully
> cpu bound tasks from easily rising to interactive status.
> 
> However it did not select out tasks that had periods of being fully cpu
> bound and then sleeping while waiting on pipes, signals etc.  This led to a
> more disproportionate share of cpu time.
> 
> Backing this out will no longer special case only fully cpu bound tasks,
> and prevents the variable behaviour that occurs at startup before tasks
> declare themseleves interactive or not, and speeds up application startup
> slightly under certain circumstances.  It does cost in interactivity
> slightly as load rises but it is worth it for the fairness gains.
> 
> Signed-off-by: Con Kolivas <kernel@kolivas.org>
> Acked-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Andrew Morton <akpm@osdl.org>
> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> 
> --------------------------------------------------
> 
> 
> Maybe this change has revealed a scheduler weakness ?
> 
> I'm glad to test any patch or give more data :)
> 
> Bye,
> 

Any chance of you applying the PlugSched patches and seeing how the 
other schedulers that it contains handle this situation?

The patch at:

<http://prdownloads.sourceforge.net/cpuse/plugsched-6.1.6-for-2.6.15-rc5.patch?download>

should apply without problems to the 2.6.15-rc7 kernel.

Very Brief Documentation:

You can select a default scheduler at kernel build time.  If you wish to
boot with a scheduler other than the default it can be selected at boot
time by adding:

cpusched=<scheduler>

to the boot command line where <scheduler> is one of: ingosched,
nicksched, staircase, spa_no_frills, spa_ws, spa_svr or zaphod.  If you
don't change the default when you build the kernel the default scheduler
will be ingosched (which is the normal scheduler).

The scheduler in force on a running system can be determined by the
contents of:

/proc/scheduler

Control parameters for the scheduler can be read/set via files in:

/sys/cpusched/<scheduler>/

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)
  2005-12-27 23:59   ` [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) Peter Williams
@ 2005-12-28 10:20     ` Paolo Ornati
  2005-12-28 13:38       ` Peter Williams
  0 siblings, 1 reply; 58+ messages in thread
From: Paolo Ornati @ 2005-12-28 10:20 UTC (permalink / raw)
  To: Peter Williams; +Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar

On Wed, 28 Dec 2005 10:59:13 +1100
Peter Williams <pwil3058@bigpond.net.au> wrote:

> Any chance of you applying the PlugSched patches and seeing how the 
> other schedulers that it contains handle this situation?
> 
> The patch at:
> 
> <http://prdownloads.sourceforge.net/cpuse/plugsched-6.1.6-for-2.6.15-rc5.patch?download>
> 
> should apply without problems to the 2.6.15-rc7 kernel.
> 
> Very Brief Documentation:
> 
> You can select a default scheduler at kernel build time.  If you wish to
> boot with a scheduler other than the default it can be selected at boot
> time by adding:
> 
> cpusched=<scheduler>
> 
> to the boot command line where <scheduler> is one of: ingosched,
> nicksched, staircase, spa_no_frills, spa_ws, spa_svr or zaphod.  If you
> don't change the default when you build the kernel the default scheduler
> will be ingosched (which is the normal scheduler).


First of all, this is the "pstree" structure of transcode an friends:

     |-kdesktop---perl---sh---transcode-+-2*[sh-+-tccat]
     |                                  |       |-tcdecode]
     |                                  |       |-tcdemux]
     |                                  |       `-tcextract]
     |                                  `-transcode---5*[transcode]


Results with various schedulers:

------------------------------------------------------------------------

	1) nicksched: perfect! This is the behaviour I want.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5562 paolo     40   0  115m  18m 2428 R 82.2  3.7   0:22.16 transcode
 5576 paolo     26   0 50348 4516 1912 S  9.5  0.9   0:02.43 tcdecode
 5566 paolo     23   0  115m  18m 2428 S  4.6  3.7   0:01.24 transcode
 5573 paolo     21   0  115m  18m 2428 S  0.9  3.7   0:00.22 transcode
 5577 paolo     27   0 20356 1140  920 S  0.9  0.2   0:00.21 tcdemux
 5295 root      20   0  167m  17m 3624 S  0.6  3.5   0:11.02 X
 5579 paolo     20   0 47308 2540 1996 S  0.5  0.5   0:00.14 tcdecode
 5574 paolo     20   0 20356 1144  920 S  0.4  0.2   0:00.11 tcdemux
...

transcode get recognized for what it is, and I/O bounded processes
don't even notice that it is running :)


	2) staircase: bad, as you can see:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5582 paolo     26   0  115m  18m 2428 R 82.7  3.7   0:47.63 transcode
 5599 paolo     39   0 50352 4516 1912 R  9.6  0.9   0:05.21 tcdecode
 5586 paolo      0   0  115m  18m 2428 S  4.5  3.7   0:02.61 transcode
 5622 paolo     39   0  4948 1520  412 R  1.1  0.3   0:00.15 dd
 5591 paolo      0   0  115m  18m 2428 S  0.6  3.7   0:00.36 transcode
 5575 paolo      0   0 98476  37m 9392 S  0.4  7.5   0:01.44 perl
 5597 paolo     27   0 20356 1144  920 S  0.4  0.2   0:00.21 tcdemux
 5475 paolo      0   0 86556  22m  15m S  0.2  4.5   0:01.24 konsole
 5388 root       0   0  167m  17m 3208 S  0.1  3.4   0:03.16 X
 5587 paolo      0   0  115m  18m 2428 S  0.1  3.7   0:00.03 transcode
 5595 paolo     20   0 47312 2540 1996 S  0.1  0.5   0:00.14 tcdecode
 5596 paolo     26   0 22672 1268 1020 S  0.1  0.2   0:00.03 tccat
 5598 paolo     28   0 22364 1436  932 S  0.1  0.3   0:00.04 tcextract


And "DD" is affected badly:

paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
of=/dev/null bs=1M count=128; umount space/ 128+0 records in
128+0 records out

real    0m6.341s
user    0m0.002s
sys     0m0.229s

While transcoding:

paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
of=/dev/null bs=1M count=256; umount space/ 256+0 records in
256+0 records out

real    0m15.793s
user    0m0.001s
sys     0m0.374s


	3) spa_no_frills: bad, but this is OK since it is Round Robin :)

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5356 paolo     20   0  115m  18m 2428 R 81.1  3.7   0:27.61 transcode
 5371 paolo     20   0 50348 4516 1912 R  8.9  0.9   0:02.97 tcdecode
 5360 paolo     20   0  115m  18m 2428 S  4.1  3.7   0:01.54 transcode
 5378 paolo     20   0  4948 1520  412 D  1.4  0.3   0:00.29 dd
 5364 paolo     20   0 20352 1144  920 S  0.9  0.2   0:00.20 tcdemux
 5373 paolo     20   0  115m  18m 2428 S  0.7  3.7   0:00.32 transcode
 5369 paolo     20   0 20356 1144  920 S  0.5  0.2   0:00.14 tcdemux
 5205 root      20   0  165m  15m 2584 R  0.2  3.2   0:01.86 X


	4) spa_ws: bad

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5334 paolo     32   0  115m  18m 2428 R 82.7  3.7   0:18.77 transcode
 5349 paolo     32   0 50348 4516 1912 R  8.9  0.9   0:02.00 tcdecode
 5338 paolo     21   0  115m  18m 2428 S  4.6  3.7   0:01.08 transcode
 5356 paolo     32   0  4948 1520  412 D  1.1  0.3   0:00.12 dd
 5351 paolo     32   0  115m  18m 2428 S  1.0  3.7   0:00.20 transcode
 5199 root      21   0  165m  15m 2584 S  0.4  3.2   0:01.68 X
 5347 paolo     32   0 20356 1140  920 S  0.4  0.2   0:00.08 tcdemux
 5296 paolo     22   0 98472  37m 9392 S  0.2  7.5   0:01.47 perl
 5299 paolo     21   0 86556  22m  15m S  0.2  4.4   0:00.75 konsole
 5344 paolo     32   0 47308 2540 1996 S  0.2  0.5   0:00.07 tcdecode
 5339 paolo     21   0  115m  18m 2428 S  0.1  3.7   0:00.01 transcode

paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
of=/dev/null bs=1M count=256; umount space/ 256+0 records in
256+0 records out

real    0m8.112s
user    0m0.001s
sys     0m0.444s

paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
of=/dev/null bs=1M count=256; umount space/ 256+0 records in
256+0 records out

real    0m29.222s
user    0m0.000s
sys     0m0.400s


	5) spa_svr: surprise, surprise! Not all that bad. At least DD
gets better priority than transcode... and DD real time is only a bit
affected (8s --> ~9s).


  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5334 paolo     33   0  115m  18m 2428 R 78.1  3.7   0:22.70 transcode
 5349 paolo     28   0 50352 4516 1912 S  9.0  0.9   0:02.41 tcdecode
 5338 paolo     25   0  115m  18m 2428 S  4.7  3.7   0:01.29 transcode
 5363 paolo     27   0  4952 1520  412 R  4.7  0.3   0:00.25 dd
 5342 paolo     33   0 20352 1140  920 S  1.6  0.2   0:00.21 tcdemux
 5351 paolo     25   0  115m  18m 2428 S  0.8  3.7   0:00.23 transcode
 5144 root      22   0  166m  16m 3120 S  0.4  3.3   0:01.85 X
 5344 paolo     23   0 47308 2540 1996 S  0.4  0.5   0:00.13 tcdecode
 5347 paolo     27   0 20356 1144  920 S  0.4  0.2   0:00.10 tcdemux
 5231 paolo     22   0 86660  22m  15m S  0.2  4.5   0:00.95 konsole
 5271 paolo     25   0 98476  37m 9396 S  0.2  7.5   0:01.54 perl
 5341 paolo     23   0 22672 1268 1020 S  0.2  0.2   0:00.02 tccat


	6) zaphod: more or less like spa_svr

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5308 paolo     34   0  115m  18m 2428 R 52.1  3.7   0:49.77 transcode
 5323 paolo     32   0 50352 4516 1912 S  6.0  0.9   0:05.61 tcdecode
 5356 paolo     28   0  4952 1520  412 D  3.5  0.3   0:00.28 dd
 5312 paolo     28   0  115m  18m 2428 S  2.6  3.7   0:02.71 transcode
 5325 paolo     31   0  115m  18m 2428 S  0.7  3.7   0:00.55 transcode
 5316 paolo     37   0 20352 1140  920 S  0.4  0.2   0:00.33 tcdemux
 5202 root      23   0  165m  15m 2584 S  0.2  3.1   0:01.57 X
 5318 paolo     31   0 47312 2540 1996 S  0.2  0.5   0:00.28 tcdecode
 5321 paolo     33   0 20356 1144  920 S  0.2  0.2   0:00.26 tcdemux
 4760 messageb  25   0 13248 1068  848 S  0.1  0.2   0:00.07
dbus-daemon-1 5264 paolo     24   0 93920  17m  10m S  0.1  3.5
0:00.38 kded 5282 paolo     23   0 92712  19m  12m S  0.1  3.9
0:00.36 kdesktop


	7) ingosched: bad, as already said in the original post

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5209 paolo     16   0  115m  18m 2428 R 72.0  3.7   0:22.13 transcode
 5224 paolo     22   0 50348 4516 1912 R  8.4  0.9   0:02.44 tcdecode
 5213 paolo     15   0  115m  18m 2428 S  4.2  3.7   0:01.24 transcode
 5243 paolo     18   0  4948 1520  412 R  1.8  0.3   0:00.14 dd
 5217 paolo     19   0 20356 1144  920 R  0.8  0.2   0:00.19 tcdemux
 5108 root      15   0  165m  15m 2584 S  0.6  3.1   0:01.44 X
 5226 paolo     15   0  115m  18m 2428 S  0.6  3.7   0:00.20 transcode
 5216 paolo     18   0 22676 1268 1020 S  0.4  0.2   0:00.03 tccat
 5219 paolo     18   0 47312 2540 1996 R  0.4  0.5   0:00.12 tcdecode
 5222 paolo     18   0 20356 1144  920 S  0.4  0.2   0:00.10 tcdemux
 5195 paolo     16   0 98488  37m 9392 S  0.2  7.5   0:01.41 perl
 5198 paolo     16   0 86552  22m  15m R  0.2  4.4   0:00.66 konsole

paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null bs=1M count=256; umount space/
256+0 records in
256+0 records out

real    0m23.393s	(instead of 8s)
user    0m0.001s
sys     0m0.418s

------------------------------------------------------------------------


So the winner for manifest superiority is "nicksched", it looks to me
even better than 2.6.10-bk12 (ingosched) with
"remove_interactive_credit" reverted.

:)

-- 
	Paolo Ornati
	Linux 2.6.15-rc5-plugsched on x86_64

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)
  2005-12-27 23:26   ` Con Kolivas
@ 2005-12-28 11:01     ` Paolo Ornati
  2005-12-28 11:19       ` Con Kolivas
  2005-12-30 13:52     ` [SCHED] wrong priority calc - SIMPLE test case Paolo Ornati
  1 sibling, 1 reply; 58+ messages in thread
From: Paolo Ornati @ 2005-12-28 11:01 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Linux Kernel Mailing List, Ingo Molnar

On Wed, 28 Dec 2005 10:26:58 +1100
Con Kolivas <kernel@kolivas.org> wrote:

> Looking at your top output I see that transcode command generates 7 processes 
> all likely to be using cpu, and your DD slowdown is almost 7 times... I 
> assume it all goes away if you nice the transcode up by 3 or more.

Yes, if I nice everything to 3 or more (nice -n 3 dvdrip ...) it works,
but I would prefer a less weak scheduler (see the other email, with
results for various schedulers).

Another thing that I've noticed is that things tends to get worse when
"transcode" is running for long time (some hours). It happened to me
some times (with ingosched and also with staircase if I remember
correctly):
	after some hours of running transcode (with me away from the
machine) I've found a totally UNUSABLE system. Transcode was the king
of the machine and everything else get almost no CPU time. Switching to
a Text-Console takes something like 10s (or something like that). When
I was finally logged in as root I've reniced transcode and companyto +19
and the system was usable again ;)

To get things even more STRANGE: another time that this happened I've
done the same thing except that I've reniced them to "0" (the same nice
level they were running) ---> And the system became usable again (with
the usual slow down but still usable).

This is what I remember. Now I think we can agree that there is
something wrong... no?

:)

-- 
	Paolo Ornati
	Linux 2.6.15-rc5-plugsched on x86_64

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)
  2005-12-28 11:01     ` Paolo Ornati
@ 2005-12-28 11:19       ` Con Kolivas
  2005-12-28 11:35         ` Paolo Ornati
  0 siblings, 1 reply; 58+ messages in thread
From: Con Kolivas @ 2005-12-28 11:19 UTC (permalink / raw)
  To: Paolo Ornati; +Cc: Linux Kernel Mailing List, Ingo Molnar

On Wednesday 28 December 2005 22:01, Paolo Ornati wrote:
> 	after some hours of running transcode (with me away from the
> machine) I've found a totally UNUSABLE system. Transcode was the king
> of the machine and everything else get almost no CPU time. Switching to
> a Text-Console takes something like 10s (or something like that). When
> I was finally logged in as root I've reniced transcode and companyto +19
> and the system was usable again ;)
>
> To get things even more STRANGE: another time that this happened I've
> done the same thing except that I've reniced them to "0" (the same nice
> level they were running) ---> And the system became usable again (with
> the usual slow down but still usable).
>
> This is what I remember. Now I think we can agree that there is
> something wrong... no?

This latter thing sounds more like your transcode job pushed everything out to 
swap... You need to instrument this case better.

Con

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)
  2005-12-28 11:19       ` Con Kolivas
@ 2005-12-28 11:35         ` Paolo Ornati
  2005-12-28 17:23           ` Paolo Ornati
  0 siblings, 1 reply; 58+ messages in thread
From: Paolo Ornati @ 2005-12-28 11:35 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Linux Kernel Mailing List, Ingo Molnar

On Wed, 28 Dec 2005 22:19:23 +1100
Con Kolivas <kernel@kolivas.org> wrote:

> This latter thing sounds more like your transcode job pushed everything out to 
> swap... You need to instrument this case better.
> 

I don't know. The combination Swapped Out Programs + "normal" priority
strangeness can potentially result in a total disaster... but why
renicing transcode to "0" gets the system usable again?

Next time I'll grab "/proc/meminfo"... and what other info can help to
understand?

-- 
	Paolo Ornati
	Linux 2.6.15-rc5-plugsched on x86_64

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)
  2005-12-28 10:20     ` Paolo Ornati
@ 2005-12-28 13:38       ` Peter Williams
  2005-12-28 19:45         ` Paolo Ornati
  2005-12-29  3:13         ` Nick Piggin
  0 siblings, 2 replies; 58+ messages in thread
From: Peter Williams @ 2005-12-28 13:38 UTC (permalink / raw)
  To: Paolo Ornati; +Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar

Paolo Ornati wrote:
> On Wed, 28 Dec 2005 10:59:13 +1100
> Peter Williams <pwil3058@bigpond.net.au> wrote:
> 
> 
>>Any chance of you applying the PlugSched patches and seeing how the 
>>other schedulers that it contains handle this situation?
>>
>>The patch at:
>>
>><http://prdownloads.sourceforge.net/cpuse/plugsched-6.1.6-for-2.6.15-rc5.patch?download>
>>
>>should apply without problems to the 2.6.15-rc7 kernel.
>>
>>Very Brief Documentation:
>>
>>You can select a default scheduler at kernel build time.  If you wish to
>>boot with a scheduler other than the default it can be selected at boot
>>time by adding:
>>
>>cpusched=<scheduler>
>>
>>to the boot command line where <scheduler> is one of: ingosched,
>>nicksched, staircase, spa_no_frills, spa_ws, spa_svr or zaphod.  If you
>>don't change the default when you build the kernel the default scheduler
>>will be ingosched (which is the normal scheduler).
> 
> 
> 
> First of all, this is the "pstree" structure of transcode an friends:
> 
>      |-kdesktop---perl---sh---transcode-+-2*[sh-+-tccat]
>      |                                  |       |-tcdecode]
>      |                                  |       |-tcdemux]
>      |                                  |       `-tcextract]
>      |                                  `-transcode---5*[transcode]
> 
> 
> Results with various schedulers:

First, thanks for doing this.

> 
> ------------------------------------------------------------------------
> 
> 	1) nicksched: perfect! This is the behaviour I want.
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5562 paolo     40   0  115m  18m 2428 R 82.2  3.7   0:22.16 transcode
>  5576 paolo     26   0 50348 4516 1912 S  9.5  0.9   0:02.43 tcdecode
>  5566 paolo     23   0  115m  18m 2428 S  4.6  3.7   0:01.24 transcode
>  5573 paolo     21   0  115m  18m 2428 S  0.9  3.7   0:00.22 transcode
>  5577 paolo     27   0 20356 1140  920 S  0.9  0.2   0:00.21 tcdemux
>  5295 root      20   0  167m  17m 3624 S  0.6  3.5   0:11.02 X
>  5579 paolo     20   0 47308 2540 1996 S  0.5  0.5   0:00.14 tcdecode
>  5574 paolo     20   0 20356 1144  920 S  0.4  0.2   0:00.11 tcdemux
> ...
> 
> transcode get recognized for what it is, and I/O bounded processes
> don't even notice that it is running :)

Interesting.  This one's more or less a dead scheduler and hasn't had 
any development work done on it for some time.  I just keep porting the 
original version to new kernels.

> 
> 
> 	2) staircase: bad, as you can see:
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5582 paolo     26   0  115m  18m 2428 R 82.7  3.7   0:47.63 transcode
>  5599 paolo     39   0 50352 4516 1912 R  9.6  0.9   0:05.21 tcdecode
>  5586 paolo      0   0  115m  18m 2428 S  4.5  3.7   0:02.61 transcode
>  5622 paolo     39   0  4948 1520  412 R  1.1  0.3   0:00.15 dd
>  5591 paolo      0   0  115m  18m 2428 S  0.6  3.7   0:00.36 transcode
>  5575 paolo      0   0 98476  37m 9392 S  0.4  7.5   0:01.44 perl
>  5597 paolo     27   0 20356 1144  920 S  0.4  0.2   0:00.21 tcdemux
>  5475 paolo      0   0 86556  22m  15m S  0.2  4.5   0:01.24 konsole
>  5388 root       0   0  167m  17m 3208 S  0.1  3.4   0:03.16 X
>  5587 paolo      0   0  115m  18m 2428 S  0.1  3.7   0:00.03 transcode
>  5595 paolo     20   0 47312 2540 1996 S  0.1  0.5   0:00.14 tcdecode
>  5596 paolo     26   0 22672 1268 1020 S  0.1  0.2   0:00.03 tccat
>  5598 paolo     28   0 22364 1436  932 S  0.1  0.3   0:00.04 tcextract
> 
> 
> And "DD" is affected badly:
> 
> paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
> of=/dev/null bs=1M count=128; umount space/ 128+0 records in
> 128+0 records out
> 
> real    0m6.341s
> user    0m0.002s
> sys     0m0.229s
> 
> While transcoding:
> 
> paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
> of=/dev/null bs=1M count=256; umount space/ 256+0 records in
> 256+0 records out
> 
> real    0m15.793s
> user    0m0.001s
> sys     0m0.374s
> 
> 
> 	3) spa_no_frills: bad, but this is OK since it is Round Robin :)
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5356 paolo     20   0  115m  18m 2428 R 81.1  3.7   0:27.61 transcode
>  5371 paolo     20   0 50348 4516 1912 R  8.9  0.9   0:02.97 tcdecode
>  5360 paolo     20   0  115m  18m 2428 S  4.1  3.7   0:01.54 transcode
>  5378 paolo     20   0  4948 1520  412 D  1.4  0.3   0:00.29 dd
>  5364 paolo     20   0 20352 1144  920 S  0.9  0.2   0:00.20 tcdemux
>  5373 paolo     20   0  115m  18m 2428 S  0.7  3.7   0:00.32 transcode
>  5369 paolo     20   0 20356 1144  920 S  0.5  0.2   0:00.14 tcdemux
>  5205 root      20   0  165m  15m 2584 R  0.2  3.2   0:01.86 X
> 

Yes, no surprises there.

> 
> 	4) spa_ws: bad
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5334 paolo     32   0  115m  18m 2428 R 82.7  3.7   0:18.77 transcode
>  5349 paolo     32   0 50348 4516 1912 R  8.9  0.9   0:02.00 tcdecode
>  5338 paolo     21   0  115m  18m 2428 S  4.6  3.7   0:01.08 transcode
>  5356 paolo     32   0  4948 1520  412 D  1.1  0.3   0:00.12 dd
>  5351 paolo     32   0  115m  18m 2428 S  1.0  3.7   0:00.20 transcode
>  5199 root      21   0  165m  15m 2584 S  0.4  3.2   0:01.68 X
>  5347 paolo     32   0 20356 1140  920 S  0.4  0.2   0:00.08 tcdemux
>  5296 paolo     22   0 98472  37m 9392 S  0.2  7.5   0:01.47 perl
>  5299 paolo     21   0 86556  22m  15m S  0.2  4.4   0:00.75 konsole
>  5344 paolo     32   0 47308 2540 1996 S  0.2  0.5   0:00.07 tcdecode
>  5339 paolo     21   0  115m  18m 2428 S  0.1  3.7   0:00.01 transcode
> 
> paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
> of=/dev/null bs=1M count=256; umount space/ 256+0 records in
> 256+0 records out
> 
> real    0m8.112s
> user    0m0.001s
> sys     0m0.444s
> 
> paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
> of=/dev/null bs=1M count=256; umount space/ 256+0 records in
> 256+0 records out
> 
> real    0m29.222s
> user    0m0.000s
> sys     0m0.400s

This one is aimed purely at good interactive responsiveness (i.e. 
keyboard, mouse, X server and media players such as rythmbox/xmms) so no 
real surprises here either.

> 
> 
> 	5) spa_svr: surprise, surprise! Not all that bad. At least DD
> gets better priority than transcode... and DD real time is only a bit
> affected (8s --> ~9s).
>

This will be the "throughput bonus" in action.  It's overall aim is to 
reduce the time tasks spend on the runqueue waiting for CPU access 
a.k.a. delay.  It does this by using the system load and the average 
amount of CPU time that the task uses each scheduling cycle to estimate 
the expected delay for the task and gives it a bonus if the actual 
average delays being experienced are bigger than this value.

It's intended for server systems not interactive systems as reducing 
overall delay isn't necessarily good for interactive systems where the 
aim is to quell the user's impatience by giving good latency to the 
interactive tasks.  These aims aren't always compatible.

> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5334 paolo     33   0  115m  18m 2428 R 78.1  3.7   0:22.70 transcode
>  5349 paolo     28   0 50352 4516 1912 S  9.0  0.9   0:02.41 tcdecode
>  5338 paolo     25   0  115m  18m 2428 S  4.7  3.7   0:01.29 transcode
>  5363 paolo     27   0  4952 1520  412 R  4.7  0.3   0:00.25 dd
>  5342 paolo     33   0 20352 1140  920 S  1.6  0.2   0:00.21 tcdemux
>  5351 paolo     25   0  115m  18m 2428 S  0.8  3.7   0:00.23 transcode
>  5144 root      22   0  166m  16m 3120 S  0.4  3.3   0:01.85 X
>  5344 paolo     23   0 47308 2540 1996 S  0.4  0.5   0:00.13 tcdecode
>  5347 paolo     27   0 20356 1144  920 S  0.4  0.2   0:00.10 tcdemux
>  5231 paolo     22   0 86660  22m  15m S  0.2  4.5   0:00.95 konsole
>  5271 paolo     25   0 98476  37m 9396 S  0.2  7.5   0:01.54 perl
>  5341 paolo     23   0 22672 1268 1020 S  0.2  0.2   0:00.02 tccat
> 
> 
> 	6) zaphod: more or less like spa_svr

Zaphod includes the throughput bonus in its armoury which why it is 
similar in performance to spa_svr.

> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5308 paolo     34   0  115m  18m 2428 R 52.1  3.7   0:49.77 transcode
>  5323 paolo     32   0 50352 4516 1912 S  6.0  0.9   0:05.61 tcdecode
>  5356 paolo     28   0  4952 1520  412 D  3.5  0.3   0:00.28 dd
>  5312 paolo     28   0  115m  18m 2428 S  2.6  3.7   0:02.71 transcode
>  5325 paolo     31   0  115m  18m 2428 S  0.7  3.7   0:00.55 transcode
>  5316 paolo     37   0 20352 1140  920 S  0.4  0.2   0:00.33 tcdemux
>  5202 root      23   0  165m  15m 2584 S  0.2  3.1   0:01.57 X
>  5318 paolo     31   0 47312 2540 1996 S  0.2  0.5   0:00.28 tcdecode
>  5321 paolo     33   0 20356 1144  920 S  0.2  0.2   0:00.26 tcdemux
>  4760 messageb  25   0 13248 1068  848 S  0.1  0.2   0:00.07
> dbus-daemon-1 5264 paolo     24   0 93920  17m  10m S  0.1  3.5
> 0:00.38 kded 5282 paolo     23   0 92712  19m  12m S  0.1  3.9
> 0:00.36 kdesktop
> 
> 
> 	7) ingosched: bad, as already said in the original post
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5209 paolo     16   0  115m  18m 2428 R 72.0  3.7   0:22.13 transcode
>  5224 paolo     22   0 50348 4516 1912 R  8.4  0.9   0:02.44 tcdecode
>  5213 paolo     15   0  115m  18m 2428 S  4.2  3.7   0:01.24 transcode
>  5243 paolo     18   0  4948 1520  412 R  1.8  0.3   0:00.14 dd
>  5217 paolo     19   0 20356 1144  920 R  0.8  0.2   0:00.19 tcdemux
>  5108 root      15   0  165m  15m 2584 S  0.6  3.1   0:01.44 X
>  5226 paolo     15   0  115m  18m 2428 S  0.6  3.7   0:00.20 transcode
>  5216 paolo     18   0 22676 1268 1020 S  0.4  0.2   0:00.03 tccat
>  5219 paolo     18   0 47312 2540 1996 R  0.4  0.5   0:00.12 tcdecode
>  5222 paolo     18   0 20356 1144  920 S  0.4  0.2   0:00.10 tcdemux
>  5195 paolo     16   0 98488  37m 9392 S  0.2  7.5   0:01.41 perl
>  5198 paolo     16   0 86552  22m  15m R  0.2  4.4   0:00.66 konsole
> 
> paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null bs=1M count=256; umount space/
> 256+0 records in
> 256+0 records out
> 
> real    0m23.393s	(instead of 8s)
> user    0m0.001s
> sys     0m0.418s
> 
> ------------------------------------------------------------------------
> 
> 
> So the winner for manifest superiority is "nicksched", it looks to me
> even better than 2.6.10-bk12 (ingosched) with
> "remove_interactive_credit" reverted.

Thanks for this data.  It will enable me to make some mods to the 
spa_xxx and zaphod schedulers.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)
  2005-12-28 11:35         ` Paolo Ornati
@ 2005-12-28 17:23           ` Paolo Ornati
  2005-12-28 17:39             ` Paolo Ornati
  0 siblings, 1 reply; 58+ messages in thread
From: Paolo Ornati @ 2005-12-28 17:23 UTC (permalink / raw)
  To: Paolo Ornati; +Cc: Con Kolivas, Linux Kernel Mailing List, Ingo Molnar

On Wed, 28 Dec 2005 12:35:47 +0100
Paolo Ornati <ornati@fastwebnet.it> wrote:

> > This latter thing sounds more like your transcode job pushed everything out to 
> > swap... You need to instrument this case better.
> > 
> 
> I don't know. The combination Swapped Out Programs + "normal" priority
> strangeness can potentially result in a total disaster... but why
> renicing transcode to "0" gets the system usable again?
> 
> Next time I'll grab "/proc/meminfo"... and what other info can help to
> understand?

I'm doing some tests with "ingosched" and a bit longer trancoding
session and I've found a strange thing.

DD running time can change a LOT between one run and another (always
while transcode is running):

----------------------------------------------------------------------
paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
of=/dev/null bs=1M count=256; umount space/ 256+0 records in
256+0 records out

real    0m15.754s
user    0m0.000s
sys     0m0.500s

paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
of=/dev/null bs=1M count=256; umount space/ 256+0 records in
256+0 records out

real    0m52.966s
user    0m0.000s
sys     0m0.504s

paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
of=/dev/null bs=1M count=256; umount space/ 256+0 records in
256+0 records out

real    0m48.928s
user    0m0.004s
sys     0m0.524s
---------------------------------------------------------------------

Looking at top while running these I've seen that this is related to
how transcode's (the one that eats a lot of CPU) priority changes.

When only transcoding his priority is 16, but when also DD test is
running then that value fluctuate between 16 and 18.

DD priority is always 18 instead.

Usually transcode's prio go to 17/18 and DD runs in 15/20s, but
sometimes it doesn't fluctuate staying stuck to 16 and DD runs in ~50s.


PS:
I'm not yet able to reproduce the "totally unusable system" I was
talking about.

-- 
	Paolo Ornati
	Linux 2.6.15-rc7-plugsched on x86_64

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)
  2005-12-28 17:23           ` Paolo Ornati
@ 2005-12-28 17:39             ` Paolo Ornati
  0 siblings, 0 replies; 58+ messages in thread
From: Paolo Ornati @ 2005-12-28 17:39 UTC (permalink / raw)
  To: Paolo Ornati; +Cc: Con Kolivas, Linux Kernel Mailing List, Ingo Molnar

On Wed, 28 Dec 2005 18:23:23 +0100
Paolo Ornati <ornati@fastwebnet.it> wrote:

> Usually transcode's prio go to 17/18 and DD runs in 15/20s, but
> sometimes it doesn't fluctuate staying stuck to 16 and DD runs in ~50s.

And now I've noticed that when that prority stops fluctuating, it stops
forever. Running the DD test again and again doesn't change anything!

For some reasons (running time?) the trancode priority is stuck to 16
and DD always performs very badly: ~50s (normally it should be 8s).

When I've noticed this the real running time of the trancode test is
about 35/40 min.

-- 
	Paolo Ornati
	Linux 2.6.15-rc7-plugsched on x86_64

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)
  2005-12-28 13:38       ` Peter Williams
@ 2005-12-28 19:45         ` Paolo Ornati
  2005-12-29  3:13         ` Nick Piggin
  1 sibling, 0 replies; 58+ messages in thread
From: Paolo Ornati @ 2005-12-28 19:45 UTC (permalink / raw)
  To: Peter Williams; +Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar

On Thu, 29 Dec 2005 00:38:08 +1100
Peter Williams <pwil3058@bigpond.net.au> wrote:

> Thanks for this data.  It will enable me to make some mods to the 
> spa_xxx and zaphod schedulers.

Ok, but keep in mind that these numbers are just "snapshots". With
almost all schedulers the priority of the CPU-eater transcode and other
processes fluctuate a bit (an exception here is nicksched, that gives
priority 40 to transcode and never change it).

It seems also that small variation of the priority can affect seriously
my DD test case running time (expecially, I think, with schedulers that
give to "transcode" a better-or-equal priority than "dd" -->
ingosched/staircaise).

This is another mail in witch you weren't in CC that explains it better:

http://www.ussg.iu.edu/hypermail/linux/kernel/0512.3/0647.html

-- 
	Paolo Ornati
	Linux 2.6.15-rc7-plugsched on x86_64

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)
  2005-12-28 13:38       ` Peter Williams
  2005-12-28 19:45         ` Paolo Ornati
@ 2005-12-29  3:13         ` Nick Piggin
  2005-12-29  3:35           ` Peter Williams
  1 sibling, 1 reply; 58+ messages in thread
From: Nick Piggin @ 2005-12-29  3:13 UTC (permalink / raw)
  To: Peter Williams
  Cc: Paolo Ornati, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar

Peter Williams wrote:
> Paolo Ornati wrote:

>>     1) nicksched: perfect! This is the behaviour I want.
>>
...
>>
>> transcode get recognized for what it is, and I/O bounded processes
>> don't even notice that it is running :)
> 
> 
> Interesting.  This one's more or less a dead scheduler and hasn't had 
> any development work done on it for some time.  I just keep porting the 
> original version to new kernels.
> 

It isn't a dead scheduler any more than any of the other out of tree
schedulers are (which isn't saying much, unfortunately).

I've probably got a small number of cleanups and microoptimisations
relative to what you have (I can't remember exactly what you sucked up)
... but other than that there hasn't been much development work done for
some time because there is not much wrong with it.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)
  2005-12-29  3:13         ` Nick Piggin
@ 2005-12-29  3:35           ` Peter Williams
  2005-12-29  8:11             ` Nick Piggin
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Williams @ 2005-12-29  3:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Paolo Ornati, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar

Nick Piggin wrote:
> Peter Williams wrote:
> 
>> Paolo Ornati wrote:
> 
> 
>>>     1) nicksched: perfect! This is the behaviour I want.
>>>
> ...
> 
>>>
>>> transcode get recognized for what it is, and I/O bounded processes
>>> don't even notice that it is running :)
>>
>>
>>
>> Interesting.  This one's more or less a dead scheduler and hasn't had 
>> any development work done on it for some time.  I just keep porting 
>> the original version to new kernels.
>>
> 
> It isn't a dead scheduler any more than any of the other out of tree
> schedulers are (which isn't saying much, unfortunately).

Ingosched, staircase and my SPA schedulers are all evolving slowly.
  Are there any out there that I don't have in PlugSched that you think 
should be?

> 
> I've probably got a small number of cleanups and microoptimisations
> relative to what you have (I can't remember exactly what you sucked up)
> ... but other than that there hasn't been much development work done for
> some time because there is not much wrong with it.
> 

I was starting to think that you'd lost interest in this which is why I 
said it was more or less dead.  Sorry.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)
  2005-12-29  3:35           ` Peter Williams
@ 2005-12-29  8:11             ` Nick Piggin
  0 siblings, 0 replies; 58+ messages in thread
From: Nick Piggin @ 2005-12-29  8:11 UTC (permalink / raw)
  To: Peter Williams
  Cc: Paolo Ornati, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar

Peter Williams wrote:
> Nick Piggin wrote:

>> It isn't a dead scheduler any more than any of the other out of tree
>> schedulers are (which isn't saying much, unfortunately).
> 
> 
> Ingosched, staircase and my SPA schedulers are all evolving slowly.
>  Are there any out there that I don't have in PlugSched that you think 
> should be?
> 

Not that I know of...

>>
>> I've probably got a small number of cleanups and microoptimisations
>> relative to what you have (I can't remember exactly what you sucked up)
>> ... but other than that there hasn't been much development work done for
>> some time because there is not much wrong with it.
>>
> 
> I was starting to think that you'd lost interest in this which is why I 
> said it was more or less dead.  Sorry.
> 

No worries. I haven't lost interest so much as people seem to be fairly
happy with the current scheduler and least aren't busting my door down
for updates to nicksched ;)

I'll do a resynch for 2.6.15 though.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [SCHED] wrong priority calc - SIMPLE test case
  2005-12-27 23:26   ` Con Kolivas
  2005-12-28 11:01     ` Paolo Ornati
@ 2005-12-30 13:52     ` Paolo Ornati
  2005-12-31  2:06       ` Peter Williams
                         ` (2 more replies)
  1 sibling, 3 replies; 58+ messages in thread
From: Paolo Ornati @ 2005-12-30 13:52 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams

WAS: [SCHED] Totally WRONG prority calculation with specific test-case
(since 2.6.10-bk12)
http://lkml.org/lkml/2005/12/27/114/index.html

On Wed, 28 Dec 2005 10:26:58 +1100
Con Kolivas <kernel@kolivas.org> wrote:

> The issue is that the scheduler interactivity estimator is a state machine and 
> can be fooled to some degree, and a cpu intensive task that just happens to 
> sleep a little bit gets significantly better priority than one that is fully 
> cpu bound all the time. Reverting that change is not a solution because it 
> can still be fooled by the same process sleeping lots for a few seconds or so 
> at startup and then changing to the cpu mostly-sleeping slightly behaviour. 
> This "fluctuating" behaviour is in my opinion worse which is why I removed 
> it.

Trying to find a "as simple as possible" test case for this problem
(that I consider a BUG in priority calculation) I've come up with this
very simple program:

------ sched_fooler.c -------------------------------
#include <stdlib.h>
#include <unistd.h>

static void burn_cpu(unsigned int x)
{
	static char buf[1024];
	int i;
	
	for (i=0; i < x; ++i)
		buf[i%sizeof(buf)] = (x-i)*3;
}

int main(int argc, char **argv)
{
	unsigned long burn;
	if (argc != 2)
		return 1;
	burn = (unsigned long)atoi(argv[1]);
	while(1) {
		burn_cpu(burn*1000);
		usleep(1);
	}
	return 0;
}
----------------------------------------------

paolo@tux ~/tmp/sched_fooler $ gcc sched_fooler.c
paolo@tux ~/tmp/sched_fooler $ ./a.out 5000

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5633 paolo     15   0  2392  320  252 S 77.7  0.1   0:18.84 a.out
 5225 root      15   0  169m  19m 2956 S  2.0  3.9   0:13.17 X
 5307 paolo     15   0  100m  22m  13m S  2.0  4.4   0:04.32 kicker


paolo@tux ~/tmp/sched_fooler $ ./a.out 10000

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5637 paolo     16   0  2396  320  252 R 87.4  0.1   0:13.38 a.out
 5312 paolo     16   0 86636  22m  15m R  0.1  4.5   0:02.02 konsole
    1 root      16   0  2552  560  468 S  0.0  0.1   0:00.71 init


If I only run 2 of these together the system becomes everything but
interactive ;)

./a.out 10000 & ./a.out 4000

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5714 paolo     15   0  2392  320  252 S 59.2  0.1   0:12.28 a.out
 5713 paolo     16   0  2396  320  252 S 37.1  0.1   0:07.63 a.out


DD TEST: the usual DD test (to read 128 MB from a non-cached file =
disk bounded) says everything, numbers with 2.6.15-rc7:

paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
128+0 records in
128+0 records out

real    0m4.052s
user    0m0.004s
sys     0m0.180s

START 2 OF THEM "./a.out 10000 & ./a.out 4000"

paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
128+0 records in
128+0 records out

real    2m4.578s
user    0m0.000s
sys     0m0.244s


This does't surprise me anymore, since DD gets priority 18 and these
two CPU eaters get 15/16.

As usual "nicksched" is NOT affected at all, IOW it gives to these CPU
eaters the priority that they deserve.

-- 
	Paolo Ornati
	Linux 2.6.15-rc7-g3603bc8d on x86_64

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2005-12-30 13:52     ` [SCHED] wrong priority calc - SIMPLE test case Paolo Ornati
@ 2005-12-31  2:06       ` Peter Williams
  2005-12-31 10:34         ` Paolo Ornati
  2005-12-31  8:13       ` Mike Galbraith
  2006-01-13  1:13       ` Con Kolivas
  2 siblings, 1 reply; 58+ messages in thread
From: Peter Williams @ 2005-12-31  2:06 UTC (permalink / raw)
  To: Paolo Ornati
  Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin

[-- Attachment #1: Type: text/plain, Size: 3997 bytes --]

Paolo Ornati wrote:
> WAS: [SCHED] Totally WRONG prority calculation with specific test-case
> (since 2.6.10-bk12)
> http://lkml.org/lkml/2005/12/27/114/index.html
> 
> On Wed, 28 Dec 2005 10:26:58 +1100
> Con Kolivas <kernel@kolivas.org> wrote:
> 
> 
>>The issue is that the scheduler interactivity estimator is a state machine and 
>>can be fooled to some degree, and a cpu intensive task that just happens to 
>>sleep a little bit gets significantly better priority than one that is fully 
>>cpu bound all the time. Reverting that change is not a solution because it 
>>can still be fooled by the same process sleeping lots for a few seconds or so 
>>at startup and then changing to the cpu mostly-sleeping slightly behaviour. 
>>This "fluctuating" behaviour is in my opinion worse which is why I removed 
>>it.
> 
> 
> Trying to find a "as simple as possible" test case for this problem
> (that I consider a BUG in priority calculation) I've come up with this
> very simple program:
> 
> ------ sched_fooler.c -------------------------------
> #include <stdlib.h>
> #include <unistd.h>
> 
> static void burn_cpu(unsigned int x)
> {
> 	static char buf[1024];
> 	int i;
> 	
> 	for (i=0; i < x; ++i)
> 		buf[i%sizeof(buf)] = (x-i)*3;
> }
> 
> int main(int argc, char **argv)
> {
> 	unsigned long burn;
> 	if (argc != 2)
> 		return 1;
> 	burn = (unsigned long)atoi(argv[1]);
> 	while(1) {
> 		burn_cpu(burn*1000);
> 		usleep(1);
> 	}
> 	return 0;
> }
> ----------------------------------------------
> 
> paolo@tux ~/tmp/sched_fooler $ gcc sched_fooler.c
> paolo@tux ~/tmp/sched_fooler $ ./a.out 5000
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5633 paolo     15   0  2392  320  252 S 77.7  0.1   0:18.84 a.out
>  5225 root      15   0  169m  19m 2956 S  2.0  3.9   0:13.17 X
>  5307 paolo     15   0  100m  22m  13m S  2.0  4.4   0:04.32 kicker
> 
> 
> paolo@tux ~/tmp/sched_fooler $ ./a.out 10000
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5637 paolo     16   0  2396  320  252 R 87.4  0.1   0:13.38 a.out
>  5312 paolo     16   0 86636  22m  15m R  0.1  4.5   0:02.02 konsole
>     1 root      16   0  2552  560  468 S  0.0  0.1   0:00.71 init
> 
> 
> If I only run 2 of these together the system becomes everything but
> interactive ;)
> 
> ./a.out 10000 & ./a.out 4000
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5714 paolo     15   0  2392  320  252 S 59.2  0.1   0:12.28 a.out
>  5713 paolo     16   0  2396  320  252 S 37.1  0.1   0:07.63 a.out
> 
> 
> DD TEST: the usual DD test (to read 128 MB from a non-cached file =
> disk bounded) says everything, numbers with 2.6.15-rc7:
> 
> paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
> 128+0 records in
> 128+0 records out
> 
> real    0m4.052s
> user    0m0.004s
> sys     0m0.180s
> 
> START 2 OF THEM "./a.out 10000 & ./a.out 4000"
> 
> paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
> 128+0 records in
> 128+0 records out
> 
> real    2m4.578s
> user    0m0.000s
> sys     0m0.244s
> 
> 
> This does't surprise me anymore, since DD gets priority 18 and these
> two CPU eaters get 15/16.
> 
> As usual "nicksched" is NOT affected at all, IOW it gives to these CPU
> eaters the priority that they deserve.
> 

Attached is a testing version of a patch for modifying scheduler bonus 
calculations that I'm working on.  Although these changes aren't 
targetted at the problem you are experiencing I believe that they may 
help.  My testing shows that sched_fooler instances don't get any 
bonuses but I would appreciate it if you could try it out.

It is a patch against the 2.6.15-rc7 kernel and includes some other 
scheduling patches from the -mm kernels.

Thanks
Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

[-- Attachment #2: lial-test.patch --]
[-- Type: text/x-patch, Size: 35320 bytes --]

Index: GIT_lial/fs/proc/array.c
===================================================================
--- GIT_lial.orig/fs/proc/array.c	2005-12-29 12:05:42.000000000 +1100
+++ GIT_lial/fs/proc/array.c	2005-12-31 12:29:47.000000000 +1100
@@ -165,7 +165,7 @@ static inline char * task_state(struct t
 	read_lock(&tasklist_lock);
 	buffer += sprintf(buffer,
 		"State:\t%s\n"
-		"SleepAVG:\t%lu%%\n"
+		"LatencyAVG:\t%llu nsec\n"
 		"Tgid:\t%d\n"
 		"Pid:\t%d\n"
 		"PPid:\t%d\n"
@@ -173,7 +173,7 @@ static inline char * task_state(struct t
 		"Uid:\t%d\t%d\t%d\t%d\n"
 		"Gid:\t%d\t%d\t%d\t%d\n",
 		get_task_state(p),
-		(p->sleep_avg/1024)*100/(1020000000/1024),
+		SCHED_AVG_RND(p->avg_latency),
 	       	p->tgid,
 		p->pid, pid_alive(p) ? p->group_leader->real_parent->tgid : 0,
 		pid_alive(p) && p->ptrace ? p->parent->pid : 0,
Index: GIT_lial/include/linux/sched.h
===================================================================
--- GIT_lial.orig/include/linux/sched.h	2005-12-29 12:05:42.000000000 +1100
+++ GIT_lial/include/linux/sched.h	2005-12-31 12:29:47.000000000 +1100
@@ -158,6 +158,7 @@ extern unsigned long nr_iowait(void);
 #define SCHED_NORMAL		0
 #define SCHED_FIFO		1
 #define SCHED_RR		2
+#define SCHED_BATCH		3
 
 struct sched_param {
 	int sched_priority;
@@ -473,9 +474,9 @@ struct signal_struct {
 
 /*
  * Priority of a process goes from 0..MAX_PRIO-1, valid RT
- * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL tasks are
- * in the range MAX_RT_PRIO..MAX_PRIO-1. Priority values
- * are inverted: lower p->prio value means higher priority.
+ * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
+ * tasks are in the range MAX_RT_PRIO..MAX_PRIO-1. Priority
+ * values are inverted: lower p->prio value means higher priority.
  *
  * The MAX_USER_RT_PRIO value allows the actual maximum
  * RT priority to be separate from the value exported to
@@ -683,6 +684,19 @@ static inline void prefetch_stack(struct
 struct audit_context;		/* See audit.c */
 struct mempolicy;
 
+/*
+ * Fixed denominator rational numbers for use estimating task's average
+ * latencies and cpu usage per run
+ */
+#define SCHED_AVG_OFFSET 4
+/*
+ * Get the rounded integer value of a scheduling statistic average field
+ * Needs to be public for printing average interactive latency in
+ * fs/proc/array.c
+ */
+#define SCHED_AVG_RND(x) \
+	(((x) + (1 << (SCHED_AVG_OFFSET - 1))) >> (SCHED_AVG_OFFSET))
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	struct thread_info *thread_info;
@@ -695,16 +709,19 @@ struct task_struct {
 #if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW)
 	int oncpu;
 #endif
-	int prio, static_prio;
+	int prio, static_prio, latency_bonus;
+#ifdef CONFIG_SMP
+	int bias_prio;
+#endif
 	struct list_head run_list;
 	prio_array_t *array;
 
 	unsigned short ioprio;
 
-	unsigned long sleep_avg;
+	unsigned long long avg_latency;
 	unsigned long long timestamp, last_ran;
+	unsigned long long avg_cpu_run;
 	unsigned long long sched_time; /* sched_clock time spent running */
-	int activated;
 
 	unsigned long policy;
 	cpumask_t cpus_allowed;
@@ -908,6 +925,8 @@ do { if (atomic_dec_and_test(&(tsk)->usa
 #define PF_SYNCWRITE	0x00200000	/* I am doing a sync write */
 #define PF_BORROWED_MM	0x00400000	/* I am a kthread doing use_mm */
 #define PF_RANDOMIZE	0x00800000	/* randomize virtual address space */
+#define PF_JUST_WOKEN   0x01000000      /* just woken */
+#define PF_IA_WAKE_UP   0x02000000      /* just woken from interactive sleep */
 
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
Index: GIT_lial/init/main.c
===================================================================
--- GIT_lial.orig/init/main.c	2005-12-31 12:28:25.000000000 +1100
+++ GIT_lial/init/main.c	2005-12-31 12:29:41.000000000 +1100
@@ -603,6 +603,8 @@ static void __init do_initcalls(void)
  *
  * Now we can finally start doing some real work..
  */
+extern void sched_sysfs_init(void);
+
 static void __init do_basic_setup(void)
 {
 	/* drivers will send hotplug events */
@@ -613,6 +615,7 @@ static void __init do_basic_setup(void)
 #ifdef CONFIG_SYSCTL
 	sysctl_init();
 #endif
+	sched_sysfs_init();
 
 	/* Networking initialization needs a process context */ 
 	sock_init();
Index: GIT_lial/kernel/exit.c
===================================================================
--- GIT_lial.orig/kernel/exit.c	2005-12-28 12:09:27.000000000 +1100
+++ GIT_lial/kernel/exit.c	2005-12-29 12:16:49.000000000 +1100
@@ -243,7 +243,9 @@ static inline void reparent_to_init(void
 	/* Set the exit signal to SIGCHLD so we signal init on exit */
 	current->exit_signal = SIGCHLD;
 
-	if ((current->policy == SCHED_NORMAL) && (task_nice(current) < 0))
+	if ((current->policy == SCHED_NORMAL ||
+			current->policy == SCHED_BATCH)
+				&& (task_nice(current) < 0))
 		set_user_nice(current, 0);
 	/* cpus_allowed? */
 	/* rt_priority? */
Index: GIT_lial/kernel/sched.c
===================================================================
--- GIT_lial.orig/kernel/sched.c	2005-12-29 12:05:42.000000000 +1100
+++ GIT_lial/kernel/sched.c	2005-12-31 12:30:51.000000000 +1100
@@ -60,6 +60,16 @@
 #define PRIO_TO_NICE(prio)	((prio) - MAX_RT_PRIO - 20)
 #define TASK_NICE(p)		PRIO_TO_NICE((p)->static_prio)
 
+#ifdef CONFIG_SMP
+/*
+ * Priority bias for load balancing ranges from 1 (nice==19) to 139 (RT
+ * priority of 100).
+ */
+#define NICE_TO_BIAS_PRIO(nice)	(20 - (nice))
+#define PRIO_TO_BIAS_PRIO(prio)	NICE_TO_BIAS_PRIO(PRIO_TO_NICE(prio))
+#define RTPRIO_TO_BIAS_PRIO(rp)	(40 + (rp))
+#endif
+
 /*
  * 'User priority' is the nice value converted to something we
  * can work with better when scaling various scheduler parameters,
@@ -84,16 +94,10 @@
  */
 #define MIN_TIMESLICE		max(5 * HZ / 1000, 1)
 #define DEF_TIMESLICE		(100 * HZ / 1000)
-#define ON_RUNQUEUE_WEIGHT	 30
-#define CHILD_PENALTY		 95
-#define PARENT_PENALTY		100
-#define EXIT_WEIGHT		  3
 #define PRIO_BONUS_RATIO	 25
 #define MAX_BONUS		(MAX_USER_PRIO * PRIO_BONUS_RATIO / 100)
 #define INTERACTIVE_DELTA	  2
-#define MAX_SLEEP_AVG		(DEF_TIMESLICE * MAX_BONUS)
-#define STARVATION_LIMIT	(MAX_SLEEP_AVG)
-#define NS_MAX_SLEEP_AVG	(JIFFIES_TO_NS(MAX_SLEEP_AVG))
+#define STARVATION_LIMIT       	(DEF_TIMESLICE * MAX_BONUS)
 
 /*
  * If a task is 'interactive' then we reinsert it in the active
@@ -123,9 +127,8 @@
  * too hard.
  */
 
-#define CURRENT_BONUS(p) \
-	(NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \
-		MAX_SLEEP_AVG)
+#define CURRENT_BONUS(p) (just_woken_from_ia_sleep(p) ? \
+			  (p)->latency_bonus + 1 : (p)->latency_bonus)
 
 #define GRANULARITY	(10 * HZ / 1000 ? : 1)
 
@@ -147,10 +150,6 @@
 #define TASK_INTERACTIVE(p) \
 	((p)->prio <= (p)->static_prio - DELTA(p))
 
-#define INTERACTIVE_SLEEP(p) \
-	(JIFFIES_TO_NS(MAX_SLEEP_AVG * \
-		(MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
-
 #define TASK_PREEMPTS_CURR(p, rq) \
 	((p)->prio < (rq)->curr->prio)
 
@@ -261,6 +260,91 @@ struct runqueue {
 
 static DEFINE_PER_CPU(struct runqueue, runqueues);
 
+#define SCHED_AVG_REAL(a) ((a) << SCHED_AVG_OFFSET)
+#define SCHED_AVG_ALPHA ((1 << SCHED_AVG_OFFSET) - 1)
+
+/* The range of acceptable interactive latencies in nanosecs */
+unsigned long unacceptable_ia_latency = 150000UL;
+unsigned int acceptability_factor = 8;
+#define ACCEPTABLE(l) ((l) >> acceptability_factor)
+#define UNACCEPTABLE_IA_LATENCY unacceptable_ia_latency
+#define ACCEPTABLE_IA_LATENCY ACCEPTABLE(UNACCEPTABLE_IA_LATENCY)
+
+static inline void incr_latency_bonus(task_t *p)
+{
+	/*
+	 * one bonus point is reserved for allocation to all  interactive
+	 * wake ups
+	 */
+	if (p->latency_bonus < (MAX_BONUS - 1))
+		++p->latency_bonus;
+}
+
+static inline void decr_latency_bonus(task_t *p)
+{
+	if (p->latency_bonus > 0)
+		--p->latency_bonus;
+}
+
+static inline int just_woken(task_t *p)
+{
+	return p->flags & PF_JUST_WOKEN;
+}
+
+static inline int just_woken_from_ia_sleep(task_t *p)
+{
+	return p->flags & PF_IA_WAKE_UP;
+}
+
+static inline void decay_avg_value(unsigned long long *val)
+{
+	*val *= SCHED_AVG_ALPHA;
+	*val >>= SCHED_AVG_OFFSET;
+}
+
+static void update_latency_bonus(task_t *p, runqueue_t *rq, unsigned long long now)
+{
+	long long delta = now - p->timestamp;
+
+	/* make allowance for sched_clock() not being monotonic */
+	if (unlikely(delta < 0))
+		delta = 0;
+
+	decay_avg_value(&p->avg_latency);
+	p->avg_latency += delta;
+
+	/* do this now rather than earlier so that average latency is available
+	 * for didplay for all tasks not just SCHED_NORMAL.
+	 */
+	if (rt_task(p) || p->policy == SCHED_BATCH)
+		goto out;
+
+	if (just_woken_from_ia_sleep(p)) {
+		if (delta > UNACCEPTABLE_IA_LATENCY)
+			incr_latency_bonus(p);
+		else if (delta < ACCEPTABLE_IA_LATENCY)
+			decr_latency_bonus(p);
+	} else {
+		unsigned long long ual = UNACCEPTABLE_IA_LATENCY;
+
+		/*
+		 * The more tasks runnable the greater the acceptable non
+		 * interactive delay.  In the interests of fairness, tasks that
+		 * use short CPU runs are smaller acceptable latencies.
+		 */
+		if (likely(rq->nr_running > 0))
+			ual += SCHED_AVG_RND(p->avg_cpu_run) *
+				(rq->nr_running - 1);
+
+		if (delta > ual)
+			incr_latency_bonus(p);
+		else if (delta < ACCEPTABLE(ual))
+			decr_latency_bonus(p);
+	}
+out:
+	p->flags &= ~(PF_IA_WAKE_UP|PF_JUST_WOKEN);
+}
+
 /*
  * The domain tree (rq->sd) is protected by RCU's quiescent state transition.
  * See detach_destroy_domains: synchronize_sched for details.
@@ -633,9 +717,6 @@ static inline void enqueue_task_head(str
  * effective_prio - return the priority that is based on the static
  * priority but is modified by bonuses/penalties.
  *
- * We scale the actual sleep average [0 .... MAX_SLEEP_AVG]
- * into the -5 ... 0 ... +5 bonus/penalty range.
- *
  * We use 25% of the full 0...39 priority range so that:
  *
  * 1) nice +19 interactive tasks do not preempt nice 0 CPU hogs.
@@ -661,46 +742,53 @@ static int effective_prio(task_t *p)
 }
 
 #ifdef CONFIG_SMP
-static inline void inc_prio_bias(runqueue_t *rq, int prio)
+static inline void set_bias_prio(task_t *p)
 {
-	rq->prio_bias += MAX_PRIO - prio;
+	if (rt_task(p)) {
+		if (p == task_rq(p)->migration_thread)
+			/*
+			 * The migration thread does the actual balancing. Do
+			 * not bias by its priority as the ultra high priority
+			 * will skew balancing adversely.
+			 */
+			p->bias_prio = 0;
+		else
+			p->bias_prio = RTPRIO_TO_BIAS_PRIO(p->rt_priority);
+	} else
+		p->bias_prio = PRIO_TO_BIAS_PRIO(p->static_prio);
+}
+
+static inline void inc_prio_bias(runqueue_t *rq, const task_t *p)
+{
+	rq->prio_bias += p->bias_prio;
 }
 
-static inline void dec_prio_bias(runqueue_t *rq, int prio)
+static inline void dec_prio_bias(runqueue_t *rq, const task_t *p)
 {
-	rq->prio_bias -= MAX_PRIO - prio;
+	rq->prio_bias -= p->bias_prio;
 }
 
 static inline void inc_nr_running(task_t *p, runqueue_t *rq)
 {
 	rq->nr_running++;
-	if (rt_task(p)) {
-		if (p != rq->migration_thread)
-			/*
-			 * The migration thread does the actual balancing. Do
-			 * not bias by its priority as the ultra high priority
-			 * will skew balancing adversely.
-			 */
-			inc_prio_bias(rq, p->prio);
-	} else
-		inc_prio_bias(rq, p->static_prio);
+	inc_prio_bias(rq, p);
 }
 
 static inline void dec_nr_running(task_t *p, runqueue_t *rq)
 {
 	rq->nr_running--;
-	if (rt_task(p)) {
-		if (p != rq->migration_thread)
-			dec_prio_bias(rq, p->prio);
-	} else
-		dec_prio_bias(rq, p->static_prio);
+	dec_prio_bias(rq, p);
 }
 #else
-static inline void inc_prio_bias(runqueue_t *rq, int prio)
+static inline void set_bias_prio(task_t *p)
+{
+}
+
+static inline void inc_prio_bias(runqueue_t *rq, const task_t *p)
 {
 }
 
-static inline void dec_prio_bias(runqueue_t *rq, int prio)
+static inline void dec_prio_bias(runqueue_t *rq, const task_t *p)
 {
 }
 
@@ -733,68 +821,6 @@ static inline void __activate_idle_task(
 	inc_nr_running(p, rq);
 }
 
-static int recalc_task_prio(task_t *p, unsigned long long now)
-{
-	/* Caller must always ensure 'now >= p->timestamp' */
-	unsigned long long __sleep_time = now - p->timestamp;
-	unsigned long sleep_time;
-
-	if (__sleep_time > NS_MAX_SLEEP_AVG)
-		sleep_time = NS_MAX_SLEEP_AVG;
-	else
-		sleep_time = (unsigned long)__sleep_time;
-
-	if (likely(sleep_time > 0)) {
-		/*
-		 * User tasks that sleep a long time are categorised as
-		 * idle and will get just interactive status to stay active &
-		 * prevent them suddenly becoming cpu hogs and starving
-		 * other processes.
-		 */
-		if (p->mm && p->activated != -1 &&
-			sleep_time > INTERACTIVE_SLEEP(p)) {
-				p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG -
-						DEF_TIMESLICE);
-		} else {
-			/*
-			 * The lower the sleep avg a task has the more
-			 * rapidly it will rise with sleep time.
-			 */
-			sleep_time *= (MAX_BONUS - CURRENT_BONUS(p)) ? : 1;
-
-			/*
-			 * Tasks waking from uninterruptible sleep are
-			 * limited in their sleep_avg rise as they
-			 * are likely to be waiting on I/O
-			 */
-			if (p->activated == -1 && p->mm) {
-				if (p->sleep_avg >= INTERACTIVE_SLEEP(p))
-					sleep_time = 0;
-				else if (p->sleep_avg + sleep_time >=
-						INTERACTIVE_SLEEP(p)) {
-					p->sleep_avg = INTERACTIVE_SLEEP(p);
-					sleep_time = 0;
-				}
-			}
-
-			/*
-			 * This code gives a bonus to interactive tasks.
-			 *
-			 * The boost works by updating the 'average sleep time'
-			 * value here, based on ->timestamp. The more time a
-			 * task spends sleeping, the higher the average gets -
-			 * and the higher the priority boost gets as well.
-			 */
-			p->sleep_avg += sleep_time;
-
-			if (p->sleep_avg > NS_MAX_SLEEP_AVG)
-				p->sleep_avg = NS_MAX_SLEEP_AVG;
-		}
-	}
-
-	return effective_prio(p);
-}
-
 /*
  * activate_task - move a task to the runqueue and do priority recalculation
  *
@@ -816,30 +842,8 @@ static void activate_task(task_t *p, run
 #endif
 
 	if (!rt_task(p))
-		p->prio = recalc_task_prio(p, now);
+		p->prio = effective_prio(p);
 
-	/*
-	 * This checks to make sure it's not an uninterruptible task
-	 * that is now waking up.
-	 */
-	if (!p->activated) {
-		/*
-		 * Tasks which were woken up by interrupts (ie. hw events)
-		 * are most likely of interactive nature. So we give them
-		 * the credit of extending their sleep time to the period
-		 * of time they spend on the runqueue, waiting for execution
-		 * on a CPU, first time around:
-		 */
-		if (in_interrupt())
-			p->activated = 2;
-		else {
-			/*
-			 * Normal first-time wakeups get a credit too for
-			 * on-runqueue time, but it will be weighted down:
-			 */
-			p->activated = 1;
-		}
-	}
 	p->timestamp = now;
 
 	__activate_task(p, rq);
@@ -994,61 +998,29 @@ void kick_process(task_t *p)
  * We want to under-estimate the load of migration sources, to
  * balance conservatively.
  */
-static inline unsigned long __source_load(int cpu, int type, enum idle_type idle)
+static inline unsigned long source_load(int cpu, int type)
 {
 	runqueue_t *rq = cpu_rq(cpu);
-	unsigned long running = rq->nr_running;
-	unsigned long source_load, cpu_load = rq->cpu_load[type-1],
-		load_now = running * SCHED_LOAD_SCALE;
+	unsigned long load_now = rq->prio_bias * SCHED_LOAD_SCALE;
 
 	if (type == 0)
-		source_load = load_now;
-	else
-		source_load = min(cpu_load, load_now);
+		return load_now;
 
-	if (running > 1 || (idle == NOT_IDLE && running))
-		/*
-		 * If we are busy rebalancing the load is biased by
-		 * priority to create 'nice' support across cpus. When
-		 * idle rebalancing we should only bias the source_load if
-		 * there is more than one task running on that queue to
-		 * prevent idle rebalance from trying to pull tasks from a
-		 * queue with only one running task.
-		 */
-		source_load = source_load * rq->prio_bias / running;
-
-	return source_load;
-}
-
-static inline unsigned long source_load(int cpu, int type)
-{
-	return __source_load(cpu, type, NOT_IDLE);
+	return min(rq->cpu_load[type-1], load_now);
 }
 
 /*
  * Return a high guess at the load of a migration-target cpu
  */
-static inline unsigned long __target_load(int cpu, int type, enum idle_type idle)
+static inline unsigned long target_load(int cpu, int type)
 {
 	runqueue_t *rq = cpu_rq(cpu);
-	unsigned long running = rq->nr_running;
-	unsigned long target_load, cpu_load = rq->cpu_load[type-1],
-		load_now = running * SCHED_LOAD_SCALE;
+	unsigned long load_now = rq->prio_bias * SCHED_LOAD_SCALE;
 
 	if (type == 0)
-		target_load = load_now;
-	else
-		target_load = max(cpu_load, load_now);
-
-	if (running > 1 || (idle == NOT_IDLE && running))
-		target_load = target_load * rq->prio_bias / running;
+		return load_now;
 
-	return target_load;
-}
-
-static inline unsigned long target_load(int cpu, int type)
-{
-	return __target_load(cpu, type, NOT_IDLE);
+	return max(rq->cpu_load[type-1], load_now);
 }
 
 /*
@@ -1306,7 +1278,7 @@ static int try_to_wake_up(task_t *p, uns
 			 * of the current CPU:
 			 */
 			if (sync)
-				tl -= SCHED_LOAD_SCALE;
+				tl -= p->bias_prio * SCHED_LOAD_SCALE;
 
 			if ((tl <= load &&
 				tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
@@ -1353,25 +1325,21 @@ out_set_cpu:
 
 out_activate:
 #endif /* CONFIG_SMP */
-	if (old_state == TASK_UNINTERRUPTIBLE) {
+	if (old_state == TASK_UNINTERRUPTIBLE)
 		rq->nr_uninterruptible--;
-		/*
-		 * Tasks on involuntary sleep don't earn
-		 * sleep_avg beyond just interactive state.
-		 */
-		p->activated = -1;
-	}
-
 	/*
-	 * Tasks that have marked their sleep as noninteractive get
-	 * woken up without updating their sleep average. (i.e. their
-	 * sleep is handled in a priority-neutral manner, no priority
-	 * boost and no penalty.)
+	 * uninterruptible sleeps are assumed to be non interactive.
+	 * interruptible sleeps are assumed to be interactive unless
+	 * tagged with the TASK_NONINTERACTIVE flag.
 	 */
-	if (old_state & TASK_NONINTERACTIVE)
-		__activate_task(p, rq);
+	if (old_state == TASK_INTERRUPTIBLE)
+		p->flags |= PF_IA_WAKE_UP;
 	else
-		activate_task(p, rq, cpu == this_cpu);
+		p->flags &= ~PF_IA_WAKE_UP;
+
+	p->flags |= PF_JUST_WOKEN;
+
+	activate_task(p, rq, cpu == this_cpu);
 	/*
 	 * Sync wakeups (i.e. those types of wakeups where the waker
 	 * has indicated that it will leave the CPU in short order)
@@ -1421,6 +1389,15 @@ void fastcall sched_fork(task_t *p, int 
 	set_task_cpu(p, cpu);
 
 	/*
+	 * Leave the latency bonus the same as the parent's.
+	 * This helps new tasks launched by media to get off to a good start
+	 * when the system is under load. If they don't warrant it they'll soon
+	 * lose it.
+	 */
+	p->avg_latency = 0;
+	p->avg_cpu_run = 0;
+
+	/*
 	 * We mark the process as running here, but have not actually
 	 * inserted it onto the runqueue yet. This guarantees that
 	 * nobody will actually run it, and a signal or other external
@@ -1477,22 +1454,13 @@ void fastcall wake_up_new_task(task_t *p
 {
 	unsigned long flags;
 	int this_cpu, cpu;
-	runqueue_t *rq, *this_rq;
+	runqueue_t *rq;
 
 	rq = task_rq_lock(p, &flags);
 	BUG_ON(p->state != TASK_RUNNING);
 	this_cpu = smp_processor_id();
 	cpu = task_cpu(p);
 
-	/*
-	 * We decrease the sleep average of forking parents
-	 * and children as well, to keep max-interactive tasks
-	 * from forking tasks that are max-interactive. The parent
-	 * (current) is done further down, under its lock.
-	 */
-	p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
-		CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-
 	p->prio = effective_prio(p);
 
 	if (likely(cpu == this_cpu)) {
@@ -1515,15 +1483,8 @@ void fastcall wake_up_new_task(task_t *p
 		} else
 			/* Run child last */
 			__activate_task(p, rq);
-		/*
-		 * We skip the following code due to cpu == this_cpu
-	 	 *
-		 *   task_rq_unlock(rq, &flags);
-		 *   this_rq = task_rq_lock(current, &flags);
-		 */
-		this_rq = rq;
 	} else {
-		this_rq = cpu_rq(this_cpu);
+		runqueue_t *this_rq = cpu_rq(this_cpu);
 
 		/*
 		 * Not the local CPU - must adjust timestamp. This should
@@ -1534,17 +1495,8 @@ void fastcall wake_up_new_task(task_t *p
 		__activate_task(p, rq);
 		if (TASK_PREEMPTS_CURR(p, rq))
 			resched_task(rq->curr);
-
-		/*
-		 * Parent and child are on different CPUs, now get the
-		 * parent runqueue to update the parent's ->sleep_avg:
-		 */
-		task_rq_unlock(rq, &flags);
-		this_rq = task_rq_lock(current, &flags);
 	}
-	current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
-		PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-	task_rq_unlock(this_rq, &flags);
+	task_rq_unlock(rq, &flags);
 }
 
 /*
@@ -1571,10 +1523,6 @@ void fastcall sched_exit(task_t *p)
 		if (unlikely(p->parent->time_slice > task_timeslice(p)))
 			p->parent->time_slice = task_timeslice(p);
 	}
-	if (p->sleep_avg < p->parent->sleep_avg)
-		p->parent->sleep_avg = p->parent->sleep_avg /
-		(EXIT_WEIGHT + 1) * EXIT_WEIGHT + p->sleep_avg /
-		(EXIT_WEIGHT + 1);
 	task_rq_unlock(rq, &flags);
 }
 
@@ -1911,15 +1859,16 @@ int can_migrate_task(task_t *p, runqueue
  * Called with both runqueues locked.
  */
 static int move_tasks(runqueue_t *this_rq, int this_cpu, runqueue_t *busiest,
-		      unsigned long max_nr_move, struct sched_domain *sd,
-		      enum idle_type idle, int *all_pinned)
+		      unsigned long max_nr_move, long max_bias_move,
+		      struct sched_domain *sd, enum idle_type idle,
+		      int *all_pinned)
 {
 	prio_array_t *array, *dst_array;
 	struct list_head *head, *curr;
 	int idx, pulled = 0, pinned = 0;
 	task_t *tmp;
 
-	if (max_nr_move == 0)
+	if (max_nr_move == 0 || max_bias_move == 0)
 		goto out;
 
 	pinned = 1;
@@ -1962,7 +1911,8 @@ skip_queue:
 
 	curr = curr->prev;
 
-	if (!can_migrate_task(tmp, busiest, this_cpu, sd, idle, &pinned)) {
+	if (tmp->bias_prio > max_bias_move ||
+	    !can_migrate_task(tmp, busiest, this_cpu, sd, idle, &pinned)) {
 		if (curr != head)
 			goto skip_queue;
 		idx++;
@@ -1976,9 +1926,13 @@ skip_queue:
 
 	pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu);
 	pulled++;
+	max_bias_move -= tmp->bias_prio;
 
-	/* We only want to steal up to the prescribed number of tasks. */
-	if (pulled < max_nr_move) {
+	/*
+	 * We only want to steal up to the prescribed number of tasks
+	 * and the prescribed amount of biased load.
+	 */
+	if (pulled < max_nr_move && max_bias_move > 0) {
 		if (curr != head)
 			goto skip_queue;
 		idx++;
@@ -1999,7 +1953,7 @@ out:
 
 /*
  * find_busiest_group finds and returns the busiest CPU group within the
- * domain. It calculates and returns the number of tasks which should be
+ * domain. It calculates and returns the amount of biased load which should be
  * moved to restore balance via the imbalance parameter.
  */
 static struct sched_group *
@@ -2035,9 +1989,9 @@ find_busiest_group(struct sched_domain *
 
 			/* Bias balancing toward cpus of our domain */
 			if (local_group)
-				load = __target_load(i, load_idx, idle);
+				load = target_load(i, load_idx);
 			else
-				load = __source_load(i, load_idx, idle);
+				load = source_load(i, load_idx);
 
 			avg_load += load;
 		}
@@ -2092,7 +2046,7 @@ find_busiest_group(struct sched_domain *
 		unsigned long tmp;
 
 		if (max_load - this_load >= SCHED_LOAD_SCALE*2) {
-			*imbalance = 1;
+			*imbalance = NICE_TO_BIAS_PRIO(0);
 			return busiest;
 		}
 
@@ -2125,7 +2079,7 @@ find_busiest_group(struct sched_domain *
 		if (pwr_move <= pwr_now)
 			goto out_balanced;
 
-		*imbalance = 1;
+		*imbalance = NICE_TO_BIAS_PRIO(0);
 		return busiest;
 	}
 
@@ -2142,15 +2096,14 @@ out_balanced:
 /*
  * find_busiest_queue - find the busiest runqueue among the cpus in group.
  */
-static runqueue_t *find_busiest_queue(struct sched_group *group,
-	enum idle_type idle)
+static runqueue_t *find_busiest_queue(struct sched_group *group)
 {
 	unsigned long load, max_load = 0;
 	runqueue_t *busiest = NULL;
 	int i;
 
 	for_each_cpu_mask(i, group->cpumask) {
-		load = __source_load(i, 0, idle);
+		load = source_load(i, 0);
 
 		if (load > max_load) {
 			max_load = load;
@@ -2167,6 +2120,7 @@ static runqueue_t *find_busiest_queue(st
  */
 #define MAX_PINNED_INTERVAL	512
 
+#define minus_1_or_zero(n) ((n) > 0 ? (n) - 1 : 0)
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
@@ -2194,7 +2148,7 @@ static int load_balance(int this_cpu, ru
 		goto out_balanced;
 	}
 
-	busiest = find_busiest_queue(group, idle);
+	busiest = find_busiest_queue(group);
 	if (!busiest) {
 		schedstat_inc(sd, lb_nobusyq[idle]);
 		goto out_balanced;
@@ -2214,6 +2168,7 @@ static int load_balance(int this_cpu, ru
 		 */
 		double_rq_lock(this_rq, busiest);
 		nr_moved = move_tasks(this_rq, this_cpu, busiest,
+					minus_1_or_zero(busiest->nr_running),
 					imbalance, sd, idle, &all_pinned);
 		double_rq_unlock(this_rq, busiest);
 
@@ -2317,7 +2272,7 @@ static int load_balance_newidle(int this
 		goto out_balanced;
 	}
 
-	busiest = find_busiest_queue(group, NEWLY_IDLE);
+	busiest = find_busiest_queue(group);
 	if (!busiest) {
 		schedstat_inc(sd, lb_nobusyq[NEWLY_IDLE]);
 		goto out_balanced;
@@ -2332,6 +2287,7 @@ static int load_balance_newidle(int this
 		/* Attempt to move tasks */
 		double_lock_balance(this_rq, busiest);
 		nr_moved = move_tasks(this_rq, this_cpu, busiest,
+					minus_1_or_zero(busiest->nr_running),
 					imbalance, sd, NEWLY_IDLE, NULL);
 		spin_unlock(&busiest->lock);
 	}
@@ -2412,7 +2368,8 @@ static void active_load_balance(runqueue
 
 	schedstat_inc(sd, alb_cnt);
 
-	if (move_tasks(target_rq, target_cpu, busiest_rq, 1, sd, SCHED_IDLE, NULL))
+	if (move_tasks(target_rq, target_cpu, busiest_rq, 1,
+			RTPRIO_TO_BIAS_PRIO(100), sd, SCHED_IDLE, NULL))
 		schedstat_inc(sd, alb_pushed);
 	else
 		schedstat_inc(sd, alb_failed);
@@ -2440,7 +2397,7 @@ static void rebalance_tick(int this_cpu,
 	struct sched_domain *sd;
 	int i;
 
-	this_load = this_rq->nr_running * SCHED_LOAD_SCALE;
+	this_load = this_rq->prio_bias * SCHED_LOAD_SCALE;
 	/* Update our load */
 	for (i = 0; i < 3; i++) {
 		unsigned long new_load = this_load;
@@ -2689,6 +2646,10 @@ void scheduler_tick(void)
 	if (!--p->time_slice) {
 		dequeue_task(p, rq->active);
 		set_tsk_need_resched(p);
+		/* make sure that tasks that obtain an latency_bonus but then
+		 * become CPU bound eventually lose the bonus.
+		 */
+		decr_latency_bonus(p);
 		p->prio = effective_prio(p);
 		p->time_slice = task_timeslice(p);
 		p->first_time_slice = 0;
@@ -2949,8 +2910,7 @@ asmlinkage void __sched schedule(void)
 	prio_array_t *array;
 	struct list_head *queue;
 	unsigned long long now;
-	unsigned long run_time;
-	int cpu, idx, new_prio;
+	int cpu, idx;
 
 	/*
 	 * Test if we are atomic.  Since do_exit() needs to call into
@@ -2985,21 +2945,12 @@ need_resched_nonpreemptible:
 
 	schedstat_inc(rq, sched_cnt);
 	now = sched_clock();
-	if (likely((long long)(now - prev->timestamp) < NS_MAX_SLEEP_AVG)) {
-		run_time = now - prev->timestamp;
-		if (unlikely((long long)(now - prev->timestamp) < 0))
-			run_time = 0;
-	} else
-		run_time = NS_MAX_SLEEP_AVG;
-
-	/*
-	 * Tasks charged proportionately less run_time at high sleep_avg to
-	 * delay them losing their interactive status
-	 */
-	run_time /= (CURRENT_BONUS(prev) ? : 1);
 
 	spin_lock_irq(&rq->lock);
 
+	if (likely(now > prev->timestamp))
+		prev->avg_cpu_run += now - prev->timestamp;
+
 	if (unlikely(prev->flags & PF_DEAD))
 		prev->state = EXIT_DEAD;
 
@@ -3063,25 +3014,6 @@ go_idle:
 	queue = array->queue + idx;
 	next = list_entry(queue->next, task_t, run_list);
 
-	if (!rt_task(next) && next->activated > 0) {
-		unsigned long long delta = now - next->timestamp;
-		if (unlikely((long long)(now - next->timestamp) < 0))
-			delta = 0;
-
-		if (next->activated == 1)
-			delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
-
-		array = next->array;
-		new_prio = recalc_task_prio(next, next->timestamp + delta);
-
-		if (unlikely(next->prio != new_prio)) {
-			dequeue_task(next, array);
-			next->prio = new_prio;
-			enqueue_task(next, array);
-		} else
-			requeue_task(next, array);
-	}
-	next->activated = 0;
 switch_tasks:
 	if (next == rq->idle)
 		schedstat_inc(rq, sched_goidle);
@@ -3091,14 +3023,13 @@ switch_tasks:
 	rcu_qsctr_inc(task_cpu(prev));
 
 	update_cpu_clock(prev, rq, now);
-
-	prev->sleep_avg -= run_time;
-	if ((long)prev->sleep_avg <= 0)
-		prev->sleep_avg = 0;
 	prev->timestamp = prev->last_ran = now;
 
 	sched_info_switch(prev, next);
 	if (likely(prev != next)) {
+		decay_avg_value(&prev->avg_cpu_run);
+		if (just_woken(next))
+			update_latency_bonus(next, rq, now);
 		next->timestamp = now;
 		rq->nr_switches++;
 		rq->curr = next;
@@ -3543,7 +3474,7 @@ void set_user_nice(task_t *p, long nice)
 	 * The RT priorities are set via sched_setscheduler(), but we still
 	 * allow the 'normal' nice value to be set - but as expected
 	 * it wont have any effect on scheduling until the task is
-	 * not SCHED_NORMAL:
+	 * not SCHED_NORMAL/SCHED_BATCH:
 	 */
 	if (rt_task(p)) {
 		p->static_prio = NICE_TO_PRIO(nice);
@@ -3552,18 +3483,19 @@ void set_user_nice(task_t *p, long nice)
 	array = p->array;
 	if (array) {
 		dequeue_task(p, array);
-		dec_prio_bias(rq, p->static_prio);
+		dec_prio_bias(rq, p);
 	}
 
 	old_prio = p->prio;
 	new_prio = NICE_TO_PRIO(nice);
 	delta = new_prio - old_prio;
 	p->static_prio = NICE_TO_PRIO(nice);
+	set_bias_prio(p);
 	p->prio += delta;
 
 	if (array) {
 		enqueue_task(p, array);
-		inc_prio_bias(rq, p->static_prio);
+		inc_prio_bias(rq, p);
 		/*
 		 * If the task increased its priority or is running and
 		 * lowered its priority, then reschedule its CPU:
@@ -3689,10 +3621,17 @@ static void __setscheduler(struct task_s
 	BUG_ON(p->array);
 	p->policy = policy;
 	p->rt_priority = prio;
-	if (policy != SCHED_NORMAL)
+	if (policy != SCHED_NORMAL && policy != SCHED_BATCH) {
 		p->prio = MAX_RT_PRIO-1 - p->rt_priority;
-	else
+	} else {
 		p->prio = p->static_prio;
+		/*
+		 * SCHED_BATCH tasks are treated as perpetual CPU hogs:
+		 */
+		if (policy == SCHED_BATCH)
+			p->latency_bonus = 0;
+	}
+	set_bias_prio(p);
 }
 
 /**
@@ -3716,29 +3655,35 @@ recheck:
 	if (policy < 0)
 		policy = oldpolicy = p->policy;
 	else if (policy != SCHED_FIFO && policy != SCHED_RR &&
-				policy != SCHED_NORMAL)
+				policy != SCHED_NORMAL && policy != SCHED_BATCH)
 			return -EINVAL;
 	/*
 	 * Valid priorities for SCHED_FIFO and SCHED_RR are
-	 * 1..MAX_USER_RT_PRIO-1, valid priority for SCHED_NORMAL is 0.
+	 * 1..MAX_USER_RT_PRIO-1, valid priority for SCHED_NORMAL and
+	 * SCHED_BATCH is 0.
 	 */
 	if (param->sched_priority < 0 ||
 	    (p->mm && param->sched_priority > MAX_USER_RT_PRIO-1) ||
 	    (!p->mm && param->sched_priority > MAX_RT_PRIO-1))
 		return -EINVAL;
-	if ((policy == SCHED_NORMAL) != (param->sched_priority == 0))
+	if ((policy == SCHED_NORMAL || policy == SCHED_BATCH)
+					!= (param->sched_priority == 0))
 		return -EINVAL;
 
 	/*
 	 * Allow unprivileged RT tasks to decrease priority:
 	 */
 	if (!capable(CAP_SYS_NICE)) {
-		/* can't change policy */
-		if (policy != p->policy &&
-			!p->signal->rlim[RLIMIT_RTPRIO].rlim_cur)
+		/*
+		 * can't change policy, except between SCHED_NORMAL
+		 * and SCHED_BATCH:
+		 */
+		if (((policy != SCHED_NORMAL && p->policy != SCHED_BATCH) &&
+			(policy != SCHED_BATCH && p->policy != SCHED_NORMAL)) &&
+				!p->signal->rlim[RLIMIT_RTPRIO].rlim_cur)
 			return -EPERM;
 		/* can't increase priority */
-		if (policy != SCHED_NORMAL &&
+		if ((policy != SCHED_NORMAL && policy != SCHED_BATCH) &&
 		    param->sched_priority > p->rt_priority &&
 		    param->sched_priority >
 				p->signal->rlim[RLIMIT_RTPRIO].rlim_cur)
@@ -4216,6 +4161,7 @@ asmlinkage long sys_sched_get_priority_m
 		ret = MAX_USER_RT_PRIO-1;
 		break;
 	case SCHED_NORMAL:
+	case SCHED_BATCH:
 		ret = 0;
 		break;
 	}
@@ -4239,6 +4185,7 @@ asmlinkage long sys_sched_get_priority_m
 		ret = 1;
 		break;
 	case SCHED_NORMAL:
+	case SCHED_BATCH:
 		ret = 0;
 	}
 	return ret;
@@ -4394,7 +4341,9 @@ void __devinit init_idle(task_t *idle, i
 	runqueue_t *rq = cpu_rq(cpu);
 	unsigned long flags;
 
-	idle->sleep_avg = 0;
+	idle->avg_latency = 0;
+	idle->avg_cpu_run = 0;
+	idle->latency_bonus = 0;
 	idle->array = NULL;
 	idle->prio = MAX_PRIO;
 	idle->state = TASK_RUNNING;
@@ -5634,6 +5583,7 @@ void __init sched_init(void)
 		}
 	}
 
+	set_bias_prio(&init_task);
 	/*
 	 * The boot idle thread does lazy MMU switching as well:
 	 */
@@ -5745,3 +5695,128 @@ void set_curr_task(int cpu, task_t *p)
 }
 
 #endif
+
+/*
+ * Place scheduler attributes in sysfs
+ */
+
+#include <linux/sysfs.h>
+
+static ssize_t show_unacceptable_ia_latency(char *page)
+{
+	return sprintf(page, "%ld\n", unacceptable_ia_latency);
+}
+
+static ssize_t store_unacceptable_ia_latency(const char *page, size_t count)
+{
+	unsigned long long val;
+	char *end = NULL;
+
+	val = simple_strtoull(page, &end, 10);
+	if ((end == page) || ((*end != '\0') && (*end != '\n')))
+		return -EINVAL;
+	if (val > ULONG_MAX)
+		unacceptable_ia_latency = ULONG_MAX;
+	else
+		unacceptable_ia_latency = val;
+
+	return count;
+}
+
+static ssize_t show_acceptability_factor(char *page)
+{
+	return sprintf(page, "%d\n", acceptability_factor);
+}
+
+static ssize_t store_acceptability_factor(const char *page, size_t count)
+{
+	unsigned long long val;
+	char *end = NULL;
+
+	val = simple_strtoull(page, &end, 10);
+	if ((end == page) || ((*end != '\0') && (*end != '\n')))
+		return -EINVAL;
+	if (val > UINT_MAX)
+		acceptability_factor = UINT_MAX;
+	else
+		acceptability_factor = val;
+
+	return count;
+}
+
+struct sched_sysfs_entry {
+	struct attribute attr;
+	ssize_t (*show)(char *);
+	ssize_t (*store)(const char *, size_t);
+};
+
+#define to_sched_sysfs_entry(a) \
+	container_of((a), struct sched_sysfs_entry, attr)
+
+static ssize_t show_attribute(struct kobject *kobj, struct attribute *attr,
+			      char *page)
+{
+	struct sched_sysfs_entry *e = to_sched_sysfs_entry(attr);
+
+	if (!e->show)
+		return 0;
+
+	return e->show(page);
+}
+
+static ssize_t store_attribute(struct kobject *kobj, struct attribute *attr,
+			       const char *page, size_t length)
+{
+	struct sched_sysfs_entry *e = to_sched_sysfs_entry(attr);
+
+	if (!e->show)
+		return -EBADF;
+
+	return e->store(page, length);
+}
+
+struct sysfs_ops sched_sysfs_ops = {
+	.show = show_attribute,
+	.store = store_attribute,
+};
+
+static struct sched_sysfs_entry unacceptable_ia_latency_sdse = {
+	.attr = { .name = "unacceptable_ia_latency",
+		  .mode = S_IRUGO | S_IWUSR },
+	.show = show_unacceptable_ia_latency,
+	.store = store_unacceptable_ia_latency,
+};
+
+static struct sched_sysfs_entry acceptability_factor_sdse = {
+	.attr = { .name = "acceptability_factor",
+		  .mode = S_IRUGO | S_IWUSR },
+	.show = show_acceptability_factor,
+	.store = store_acceptability_factor,
+};
+
+static struct attribute *sched_attrs[] = {
+	&unacceptable_ia_latency_sdse.attr,
+	&acceptability_factor_sdse.attr,
+	NULL,
+};
+
+static struct kobj_type sched_ktype = {
+	.sysfs_ops = &sched_sysfs_ops,
+	.default_attrs = sched_attrs,
+};
+
+static struct kobject sched_kobj = {
+	.ktype = &sched_ktype
+};
+
+decl_subsys(cpusched, NULL, NULL);
+
+void __init sched_sysfs_init(void)
+{
+	if (subsystem_register(&cpusched_subsys) == 0) {
+		sched_kobj.kset = &cpusched_subsys.kset;
+		strncpy(sched_kobj.name, "attrs", KOBJ_NAME_LEN);
+		sched_kobj.kset = &cpusched_subsys.kset;
+		(void)kobject_register(&sched_kobj);
+ 	}
+}

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2005-12-30 13:52     ` [SCHED] wrong priority calc - SIMPLE test case Paolo Ornati
  2005-12-31  2:06       ` Peter Williams
@ 2005-12-31  8:13       ` Mike Galbraith
  2005-12-31 11:00         ` Paolo Ornati
  2005-12-31 15:11         ` Paolo Ornati
  2006-01-13  1:13       ` Con Kolivas
  2 siblings, 2 replies; 58+ messages in thread
From: Mike Galbraith @ 2005-12-31  8:13 UTC (permalink / raw)
  To: Paolo Ornati, Linux Kernel Mailing List
  Cc: Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams

At 02:52 PM 12/30/2005 +0100, Paolo Ornati wrote:
>WAS: [SCHED] Totally WRONG prority calculation with specific test-case
>(since 2.6.10-bk12)
>http://lkml.org/lkml/2005/12/27/114/index.html
>
>On Wed, 28 Dec 2005 10:26:58 +1100
>Con Kolivas <kernel@kolivas.org> wrote:
>
> > The issue is that the scheduler interactivity estimator is a state 
> machine and
> > can be fooled to some degree, and a cpu intensive task that just 
> happens to
> > sleep a little bit gets significantly better priority than one that is 
> fully
> > cpu bound all the time. Reverting that change is not a solution because it
> > can still be fooled by the same process sleeping lots for a few seconds 
> or so
> > at startup and then changing to the cpu mostly-sleeping slightly 
> behaviour.
> > This "fluctuating" behaviour is in my opinion worse which is why I removed
> > it.
>
>Trying to find a "as simple as possible" test case for this problem
>(that I consider a BUG in priority calculation) I've come up with this
>very simple program:
>
>------ sched_fooler.c -------------------------------

Ingo seems to have done something in 2.6.15-rc7-rt1 which defeats your 
little proggy.  Taking a quick peek at the rt scheduler changes, nothing 
poked me in the eye, but by golly, I can't get this kernel to act up, 
whereas 2.6.14-virgin does.

         -Mike  (off to stare harder rt patch) 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2005-12-31  2:06       ` Peter Williams
@ 2005-12-31 10:34         ` Paolo Ornati
  2005-12-31 10:52           ` Paolo Ornati
  0 siblings, 1 reply; 58+ messages in thread
From: Paolo Ornati @ 2005-12-31 10:34 UTC (permalink / raw)
  To: Peter Williams
  Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin

On Sat, 31 Dec 2005 13:06:04 +1100
Peter Williams <pwil3058@bigpond.net.au> wrote:

> Attached is a testing version of a patch for modifying scheduler bonus 
> calculations that I'm working on.  Although these changes aren't 
> targetted at the problem you are experiencing I believe that they may 
> help.  My testing shows that sched_fooler instances don't get any 
> bonuses but I would appreciate it if you could try it out.
> 
> It is a patch against the 2.6.15-rc7 kernel and includes some other 
> scheduling patches from the -mm kernels.

Yes, this fixes both my test-case (transcode & little program), they
get priority 25 instead of ~16.

But the priority of DD is now ~23 and so it still suffers a bit:

paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
128+0 records in
128+0 records out

real    0m8.549s	(instead of 4s)
user    0m0.000s
sys     0m0.209s

-- 
	Paolo Ornati
	Linux 2.6.15-rc7-lial on x86_64

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2005-12-31 10:34         ` Paolo Ornati
@ 2005-12-31 10:52           ` Paolo Ornati
  2005-12-31 11:12             ` Con Kolivas
  2005-12-31 13:44             ` Peter Williams
  0 siblings, 2 replies; 58+ messages in thread
From: Paolo Ornati @ 2005-12-31 10:52 UTC (permalink / raw)
  To: Paolo Ornati
  Cc: Peter Williams, Linux Kernel Mailing List, Con Kolivas,
	Ingo Molnar, Nick Piggin

On Sat, 31 Dec 2005 11:34:46 +0100
Paolo Ornati <ornati@fastwebnet.it> wrote:

> > It is a patch against the 2.6.15-rc7 kernel and includes some other 
> > scheduling patches from the -mm kernels.
> 
> Yes, this fixes both my test-case (transcode & little program), they
> get priority 25 instead of ~16.
> 
> But the priority of DD is now ~23 and so it still suffers a bit:

I forgot to mention that even the others "interactive" processes
don't get a good priority too.

Xorg for example, while only moving the cursor around, gets priority
23/24. And when cpu-eaters are running (at priority 25) it isn't happy
at all, the cursor begins to move in jerks and so on...

-- 
	Paolo Ornati
	Linux 2.6.15-rc7-lial on x86_64

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2005-12-31  8:13       ` Mike Galbraith
@ 2005-12-31 11:00         ` Paolo Ornati
  2005-12-31 15:11         ` Paolo Ornati
  1 sibling, 0 replies; 58+ messages in thread
From: Paolo Ornati @ 2005-12-31 11:00 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin,
	Peter Williams

On Sat, 31 Dec 2005 09:13:24 +0100
Mike Galbraith <efault@gmx.de> wrote:

> Ingo seems to have done something in 2.6.15-rc7-rt1 which defeats your 
> little proggy.  Taking a quick peek at the rt scheduler changes, nothing 
> poked me in the eye, but by golly, I can't get this kernel to act up, 
> whereas 2.6.14-virgin does.

Mmm... I get an infinite list of init segfaults trying to boot it. I've
tried disabling CONFIG_CC_OPTIMIZE_FOR_SIZE but it didn't help.

I'll try later with a simplier ".config".

-- 
	Paolo Ornati
	Linux 2.6.15-rc7-plugsched on x86_64

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2005-12-31 10:52           ` Paolo Ornati
@ 2005-12-31 11:12             ` Con Kolivas
  2005-12-31 13:44             ` Peter Williams
  1 sibling, 0 replies; 58+ messages in thread
From: Con Kolivas @ 2005-12-31 11:12 UTC (permalink / raw)
  To: Paolo Ornati
  Cc: Peter Williams, Linux Kernel Mailing List, Ingo Molnar, Nick Piggin

On Saturday 31 December 2005 21:52, Paolo Ornati wrote:
> On Sat, 31 Dec 2005 11:34:46 +0100
>
> Paolo Ornati <ornati@fastwebnet.it> wrote:
> > > It is a patch against the 2.6.15-rc7 kernel and includes some other
> > > scheduling patches from the -mm kernels.
> >
> > Yes, this fixes both my test-case (transcode & little program), they
> > get priority 25 instead of ~16.
> >
> > But the priority of DD is now ~23 and so it still suffers a bit:
>
> I forgot to mention that even the others "interactive" processes
> don't get a good priority too.
>
> Xorg for example, while only moving the cursor around, gets priority
> 23/24. And when cpu-eaters are running (at priority 25) it isn't happy
> at all, the cursor begins to move in jerks and so on...

This is why Ingo, Nick and myself think that a tweak to the heavily field 
tested current cpu scheduler is best for 2.6 rather than any gutting and 
replacement of the interactivity estimator (which even though this scheme is 
simple and easy to understand, it clearly is an example of). Given that we 
have a "2.6 forever" policy, it also means any significant cpu scheduler 
rewrite, even just of the interactivity estimator, is nigh on impossible to 
implement.

Cheers,
Con

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2005-12-31 10:52           ` Paolo Ornati
  2005-12-31 11:12             ` Con Kolivas
@ 2005-12-31 13:44             ` Peter Williams
  2005-12-31 16:31               ` Paolo Ornati
  1 sibling, 1 reply; 58+ messages in thread
From: Peter Williams @ 2005-12-31 13:44 UTC (permalink / raw)
  To: Paolo Ornati
  Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin

Paolo Ornati wrote:
> On Sat, 31 Dec 2005 11:34:46 +0100
> Paolo Ornati <ornati@fastwebnet.it> wrote:
> 
> 
>>>It is a patch against the 2.6.15-rc7 kernel and includes some other 
>>>scheduling patches from the -mm kernels.
>>
>>Yes, this fixes both my test-case (transcode & little program), they
>>get priority 25 instead of ~16.
>>
>>But the priority of DD is now ~23 and so it still suffers a bit:
> 
> 
> I forgot to mention that even the others "interactive" processes
> don't get a good priority too.
> 
> Xorg for example, while only moving the cursor around, gets priority
> 23/24. And when cpu-eaters are running (at priority 25) it isn't happy
> at all, the cursor begins to move in jerks and so on...
> 

OK.  This probably means that the parameters that control the mechanism 
need tweaking.

There should be a file /sys/cpusched/attrs/unacceptable_ia_latency which 
contains the latency (in nanoseconds) that the scheduler considers 
unacceptable for interactive programs.  Try changing that value and see 
if things improve?  Making it smaller should help but if you make it too 
small all the interactive tasks will end up with the same priority and 
this could cause them to get in each other's way.

Thanks,
Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2005-12-31  8:13       ` Mike Galbraith
  2005-12-31 11:00         ` Paolo Ornati
@ 2005-12-31 15:11         ` Paolo Ornati
  2005-12-31 16:37           ` Mike Galbraith
  1 sibling, 1 reply; 58+ messages in thread
From: Paolo Ornati @ 2005-12-31 15:11 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin,
	Peter Williams

On Sat, 31 Dec 2005 09:13:24 +0100
Mike Galbraith <efault@gmx.de> wrote:

> Ingo seems to have done something in 2.6.15-rc7-rt1 which defeats your 
> little proggy.  Taking a quick peek at the rt scheduler changes, nothing 
> poked me in the eye, but by golly, I can't get this kernel to act up, 
> whereas 2.6.14-virgin does.

Ok, I've sucessfully booted 2.6.15-rc7-rt1 (I think that I was
having troubles with Thread Softirqs and/or Thread Hardirqs).

First thing: I've preemption disabled, but it shouldn't matter too much
since we are talking about priority calculation...

1) My program isn't defeated at all. If I start it with the same args
of the previous examples it "seems" defeated, but it isn't.

Lowering the "cpu burn argument" I can reproduce the problem again:

"./a.out 200 & ./a.out 333"

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5607 paolo     15   0  2396  320  252 R 56.1  0.1   0:06.79 a.out
 5606 paolo     15   0  2396  324  252 R 38.7  0.1   0:04.55 a.out
    1 root      16   0  2556  552  468 S  0.0  0.1   0:00.28 init


2) Priority fluctuation - very interesting: playing with the only arg
my program has I've found this:

./a.out 200
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5628 paolo     15   0  2392  320  252 R 48.5  0.1   0:18.34 a.out

./a.out 300
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5633 paolo     15   0  2392  324  252 S 50.1  0.1   0:09.42 a.out

./a.out 400
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5634 paolo     15   0  2392  320  252 S 66.7  0.1   0:06.31 a.out

./a.out 500
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5638 paolo     25   0  2396  320  252 R 67.7  0.1   0:14.78 a.out

./a.out 700
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5640 paolo     15   0  2392  320  252 S 80.1  0.1   0:25.88 a.out

./a.out 800
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5644 paolo     17   0  2396  320  252 R 79.6  0.1   0:26.54 a.out


In the "./a.out 500" case, the priority starts at something like 16 and
then slowly go up to 25 _BUT_ if I start my DD test my cpu-eater
priority goes quickly to 16!

The real world test case (transcode) is a bit harder to describe: its
priority usually goes up to 25, sometimes I've seen it fluctuating a
bit (like go to 19 and then back to 25).

When I start my DD test I've seen basically 2 different behaviours:

A) good
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5788 paolo     25   0  114m  18m 2440 R 82.2  3.7   0:10.16 transcode
 5804 paolo     15   0 49860 4500 1896 S  8.5  0.9   0:00.99 tcdecode
 5808 paolo     18   0  4952 1520  412 D  5.0  0.3   0:00.36 dd

B) bad
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5743 paolo     18   0  114m  18m 2440 R 75.0  3.7   0:26.79 transcode
 5759 paolo     15   0 49864 4500 1896 S  7.8  0.9   0:02.71 tcdecode
 5750 paolo     16   0 19840 1136  916 S  1.5  0.2   0:00.23 tcdemux
 5201 root      15   0  167m  17m 3336 S  0.8  3.5   0:19.38 X
 5764 paolo     18   0  4948 1520  412 R  0.7  0.3   0:00.04 dd

Sometimes happens A and sometimes happens B...

PS: probably all these numbers aren't 100% reproducible... this is what
happens on my PC.

-- 
	Paolo Ornati
	Linux 2.6.15-rc7-rt1 on x86_64

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2005-12-31 13:44             ` Peter Williams
@ 2005-12-31 16:31               ` Paolo Ornati
  2005-12-31 22:04                 ` Peter Williams
  0 siblings, 1 reply; 58+ messages in thread
From: Paolo Ornati @ 2005-12-31 16:31 UTC (permalink / raw)
  To: Peter Williams
  Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin

On Sun, 01 Jan 2006 00:44:10 +1100
Peter Williams <pwil3058@bigpond.net.au> wrote:

> OK.  This probably means that the parameters that control the mechanism 
> need tweaking.
> 
> There should be a file /sys/cpusched/attrs/unacceptable_ia_latency which 
> contains the latency (in nanoseconds) that the scheduler considers 
> unacceptable for interactive programs.  Try changing that value and see 
> if things improve?  Making it smaller should help but if you make it too 
> small all the interactive tasks will end up with the same priority and 
> this could cause them to get in each other's way.

I've tried different values and sometimes I've got a good feeling BUT
the behaviour is too strange to say something.

Sometimes I get what I want (dd priority ~17 and CPU eaters prio
25), sometimes I get a total disaster (dd priority 17 and CPU eaters
prio 15/16) and sometimes I get something like DD prio 22 and CPU
eaters 23/24.

All this is not well related to "unacceptable_ia_latency" values.

What I think is that the priority calculation in ingosched and other
schedulers is in general too weak, while in other schedulers is rock
solid (read: nicksched).

Maybe is just that the smarter a scheduler want to be, the more fragile
it will be.

-- 
	Paolo Ornati
	Linux 2.6.15-rc7-lial on x86_64

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2005-12-31 15:11         ` Paolo Ornati
@ 2005-12-31 16:37           ` Mike Galbraith
  2005-12-31 17:24             ` Paolo Ornati
  2006-01-01 11:39             ` Paolo Ornati
  0 siblings, 2 replies; 58+ messages in thread
From: Mike Galbraith @ 2005-12-31 16:37 UTC (permalink / raw)
  To: Paolo Ornati
  Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin,
	Peter Williams

At 04:11 PM 12/31/2005 +0100, Paolo Ornati wrote:
>On Sat, 31 Dec 2005 09:13:24 +0100
>Mike Galbraith <efault@gmx.de> wrote:
>
> > Ingo seems to have done something in 2.6.15-rc7-rt1 which defeats your
> > little proggy.  Taking a quick peek at the rt scheduler changes, nothing
> > poked me in the eye, but by golly, I can't get this kernel to act up,
> > whereas 2.6.14-virgin does.
>
>Ok, I've sucessfully booted 2.6.15-rc7-rt1 (I think that I was
>having troubles with Thread Softirqs and/or Thread Hardirqs).
>
>First thing: I've preemption disabled, but it shouldn't matter too much
>since we are talking about priority calculation...

Mine is fully preemptible.

>1) My program isn't defeated at all. If I start it with the same args
>of the previous examples it "seems" defeated, but it isn't.
>
>Lowering the "cpu burn argument" I can reproduce the problem again:
>
>"./a.out 200 & ./a.out 333"
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5607 paolo     15   0  2396  320  252 R 56.1  0.1   0:06.79 a.out
>  5606 paolo     15   0  2396  324  252 R 38.7  0.1   0:04.55 a.out
>     1 root      16   0  2556  552  468 S  0.0  0.1   0:00.28 init

Strange.  Using the exact same arguments, I do see some odd bouncing up to 
high priorities, but they spend the vast majority of their time down at 25.

         -Mike 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2005-12-31 16:37           ` Mike Galbraith
@ 2005-12-31 17:24             ` Paolo Ornati
  2005-12-31 17:42               ` Paolo Ornati
  2006-01-01 11:39             ` Paolo Ornati
  1 sibling, 1 reply; 58+ messages in thread
From: Paolo Ornati @ 2005-12-31 17:24 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin,
	Peter Williams

On Sat, 31 Dec 2005 17:37:11 +0100
Mike Galbraith <efault@gmx.de> wrote:

> >"./a.out 200 & ./a.out 333"
> >
> >   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >  5607 paolo     15   0  2396  320  252 R 56.1  0.1   0:06.79 a.out
> >  5606 paolo     15   0  2396  324  252 R 38.7  0.1   0:04.55 a.out
> >     1 root      16   0  2556  552  468 S  0.0  0.1   0:00.28 init
> 
> Strange.  Using the exact same arguments, I do see some odd bouncing up to 
> high priorities, but they spend the vast majority of their time down at 25.

You shouldn't use "the same exact numbers", you should try different
args and see if you can reproduce the problem. Or maybe preemption
make some difference... I'll try with PREEMPT enabled and see.

-- 
	Paolo Ornati
	Linux 2.6.15-rc7-plugsched on x86_64

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2005-12-31 17:24             ` Paolo Ornati
@ 2005-12-31 17:42               ` Paolo Ornati
  0 siblings, 0 replies; 58+ messages in thread
From: Paolo Ornati @ 2005-12-31 17:42 UTC (permalink / raw)
  To: Paolo Ornati
  Cc: Mike Galbraith, Linux Kernel Mailing List, Con Kolivas,
	Ingo Molnar, Nick Piggin, Peter Williams

On Sat, 31 Dec 2005 18:24:40 +0100
Paolo Ornati <ornati@fastwebnet.it> wrote:

> You shouldn't use "the same exact numbers", you should try different
> args and see if you can reproduce the problem. Or maybe preemption
> make some difference... I'll try with PREEMPT enabled and see.

Ok, just tried with Complete Preemption: I can easly reproduce the
problem.

For example:

"./a.out 1000 & ./a.out 1239"

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5433 paolo     16   0  2396  324  252 R 50.3  0.1   0:34.67 a.out
 5434 paolo     16   0  2392  320  252 R 47.4  0.1   0:30.76 a.out
  265 root     -48  -5     0    0    0 S  0.6  0.0   0:00.68 IRQ 12
 5261 root      15   0  166m  16m 2844 S  0.6  3.3   0:04.81 X
 5349 paolo     15   0 86640  22m  15m S  0.6  4.5   0:01.36 konsole
 5344 paolo     15   0 98.8m  20m  12m S  0.3  4.1   0:01.64 kicker
 5444 paolo     18   0  4948 1520  412 R  0.3  0.3   0:00.05 dd

-- 
	Paolo Ornati
	Linux 2.6.15-rc7-rt1 on x86_64

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2005-12-31 16:31               ` Paolo Ornati
@ 2005-12-31 22:04                 ` Peter Williams
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Williams @ 2005-12-31 22:04 UTC (permalink / raw)
  To: Paolo Ornati
  Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin

Paolo Ornati wrote:
> On Sun, 01 Jan 2006 00:44:10 +1100
> Peter Williams <pwil3058@bigpond.net.au> wrote:
> 
> 
>>OK.  This probably means that the parameters that control the mechanism 
>>need tweaking.
>>
>>There should be a file /sys/cpusched/attrs/unacceptable_ia_latency which 
>>contains the latency (in nanoseconds) that the scheduler considers 
>>unacceptable for interactive programs.  Try changing that value and see 
>>if things improve?  Making it smaller should help but if you make it too 
>>small all the interactive tasks will end up with the same priority and 
>>this could cause them to get in each other's way.
> 
> 
> I've tried different values and sometimes I've got a good feeling BUT
> the behaviour is too strange to say something.
> 
> Sometimes I get what I want (dd priority ~17 and CPU eaters prio
> 25), sometimes I get a total disaster (dd priority 17 and CPU eaters
> prio 15/16) and sometimes I get something like DD prio 22 and CPU
> eaters 23/24.
> 
> All this is not well related to "unacceptable_ia_latency" values.

OK. Thanks for trying it.

The feedback will be helpful in trying to improve the mechanisms.

> 
> What I think is that the priority calculation in ingosched and other
> schedulers is in general too weak, while in other schedulers is rock
> solid (read: nicksched).
> 
> Maybe is just that the smarter a scheduler want to be, the more fragile
> it will be.
> 

Probably but this one is fairly simple.

I think the remaining problems with interactive responsiveness is that 
bonuses increase too slowly when a latency problem is detected.  I.e. a 
task just gets one extra bonus point when an unacceptable latency is 
detected regardless of how big the latency is.  This means that it may 
take several cycles for the bonus to be big enough to solve the problem. 
  I'm going to try making the bonus increment proportional to the size 
of the latency w.r.t. the limit.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2005-12-31 16:37           ` Mike Galbraith
  2005-12-31 17:24             ` Paolo Ornati
@ 2006-01-01 11:39             ` Paolo Ornati
  2006-01-02  9:15               ` Mike Galbraith
  1 sibling, 1 reply; 58+ messages in thread
From: Paolo Ornati @ 2006-01-01 11:39 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin,
	Peter Williams

On Sat, 31 Dec 2005 17:37:11 +0100
Mike Galbraith <efault@gmx.de> wrote:

> Strange.  Using the exact same arguments, I do see some odd bouncing up to 
> high priorities, but they spend the vast majority of their time down at 25.

Mmmm... to make it more easly reproducible I've enlarged the sleep time
(1 microsecond is likely to be rounded too much and give different
results on different hardware/kernel/config...).

Compile this _without_ optimizations and try again:

------------------------------------------------
#include <stdlib.h>
#include <unistd.h>

char buf[1024];

static void burn_cpu(unsigned int x)
{
	int i;
	
	for (i=0; i < x; ++i)
		buf[i%sizeof(buf)] = (x-i)*3;
}

int main(int argc, char **argv)
{
	unsigned long burn;
	if (argc != 2)
		return 1;
	burn = (unsigned long)atoi(argv[1]);
	if (!burn)
		return;
	while(1) {
		burn_cpu(burn*1000);
		usleep(10000);
	}
	return 0;
}
-----------------------------------------


With "./a.out 3000" (and 2.6.15-rc7-rt1) I get this:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5485 paolo     15   0  2396  320  252 R 62.7  0.1   0:09.77 a.out


Try different values: 1000, 2000, 3000 ... are you able to reproduce it
now?


If yes, try to start 2 of them with something like this:

"./a.out 3000 & ./a.out 3161"

so they are NOT syncronized and they use almost all the CPU time:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5582 paolo     16   0  2396  320  252 S 45.7  0.1   0:05.52 a.out
 5583 paolo     15   0  2392  320  252 S 45.7  0.1   0:05.49 a.out

This is the bad situation I hate: some cpu-eaters that eat all the CPU
time BUT have a really good priority only because they sleeps a bit.

-- 
	Paolo Ornati
	Linux 2.6.15-rc7-rt1 on x86_64

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-01 11:39             ` Paolo Ornati
@ 2006-01-02  9:15               ` Mike Galbraith
  2006-01-02  9:50                 ` Paolo Ornati
  2006-01-09 11:11                 ` Mike Galbraith
  0 siblings, 2 replies; 58+ messages in thread
From: Mike Galbraith @ 2006-01-02  9:15 UTC (permalink / raw)
  To: Paolo Ornati
  Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin,
	Peter Williams

At 12:39 PM 1/1/2006 +0100, Paolo Ornati wrote:
>On Sat, 31 Dec 2005 17:37:11 +0100
>Mike Galbraith <efault@gmx.de> wrote:
>
> > Strange.  Using the exact same arguments, I do see some odd bouncing up to
> > high priorities, but they spend the vast majority of their time down at 25.
>
>Mmmm... to make it more easly reproducible I've enlarged the sleep time
>(1 microsecond is likely to be rounded too much and give different
>results on different hardware/kernel/config...).
>
>Compile this _without_ optimizations and try again:

<snip>

>Try different values: 1000, 2000, 3000 ... are you able to reproduce it
>now?

Yeah.  One instance running has to sustain roughly _95%_ cpu before it's 
classified as a cpu piggy.  Not good.

>If yes, try to start 2 of them with something like this:
>
>"./a.out 3000 & ./a.out 3161"
>
>so they are NOT syncronized and they use almost all the CPU time:
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5582 paolo     16   0  2396  320  252 S 45.7  0.1   0:05.52 a.out
>  5583 paolo     15   0  2392  320  252 S 45.7  0.1   0:05.49 a.out
>
>This is the bad situation I hate: some cpu-eaters that eat all the CPU
>time BUT have a really good priority only because they sleeps a bit.

Yup, your proggy fools the interactivity estimator quite well.  This 
problem was addressed a long time ago, and thought to be more or less 
cured.  Guess not.

         -Mike 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-02  9:15               ` Mike Galbraith
@ 2006-01-02  9:50                 ` Paolo Ornati
  2006-01-09 11:11                 ` Mike Galbraith
  1 sibling, 0 replies; 58+ messages in thread
From: Paolo Ornati @ 2006-01-02  9:50 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin,
	Peter Williams

On Mon, 02 Jan 2006 10:15:43 +0100
Mike Galbraith <efault@gmx.de> wrote:

> >This is the bad situation I hate: some cpu-eaters that eat all the CPU
> >time BUT have a really good priority only because they sleeps a bit.
> 
> Yup, your proggy fools the interactivity estimator quite well.  This 
> problem was addressed a long time ago, and thought to be more or less 
> cured.  Guess not.

In my original real-life test case (transcode) I found that the problem
started with the removing of "interactive_credit":

http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/6aa5c93c379ae9e1/a9a83db6446edaf7?lnk=st&q=insubject%3Asched+author%3APaolo+author%3AOrnati&rnum=1&hl=en#a9a83db6446edaf7

This is not actually true... in fact that change only unhidden the
problem for that particular test-case.

With my little proggy I'm now able to reproduce the problem even with
"interactive_credit" applied (for example with a 2.6.10 kernel).

Said this, and since "nicksched" doesn't have this problem at all, it
is an ingosched (and others as well) problem.

-- 
	Paolo Ornati
	Linux 2.6.15-rc7-plugsched on x86_64

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-02  9:15               ` Mike Galbraith
  2006-01-02  9:50                 ` Paolo Ornati
@ 2006-01-09 11:11                 ` Mike Galbraith
  2006-01-09 15:52                   ` Mike Galbraith
  1 sibling, 1 reply; 58+ messages in thread
From: Mike Galbraith @ 2006-01-09 11:11 UTC (permalink / raw)
  To: Paolo Ornati
  Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin,
	Peter Williams

[-- Attachment #1: Type: text/plain, Size: 2290 bytes --]

At 10:15 AM 1/2/2006 +0100, Mike Galbraith wrote:
>At 12:39 PM 1/1/2006 +0100, Paolo Ornati wrote:
>>On Sat, 31 Dec 2005 17:37:11 +0100
>>Mike Galbraith <efault@gmx.de> wrote:
>>
>> > Strange.  Using the exact same arguments, I do see some odd bouncing up to
>> > high priorities, but they spend the vast majority of their time down 
>> at 25.
>>
>>Mmmm... to make it more easly reproducible I've enlarged the sleep time
>>(1 microsecond is likely to be rounded too much and give different
>>results on different hardware/kernel/config...).
>>
>>Compile this _without_ optimizations and try again:
>
><snip>
>
>>Try different values: 1000, 2000, 3000 ... are you able to reproduce it
>>now?
>
>Yeah.  One instance running has to sustain roughly _95%_ cpu before it's 
>classified as a cpu piggy.  Not good.
>
>>If yes, try to start 2 of them with something like this:
>>
>>"./a.out 3000 & ./a.out 3161"
>>
>>so they are NOT syncronized and they use almost all the CPU time:
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>  5582 paolo     16   0  2396  320  252 S 45.7  0.1   0:05.52 a.out
>>  5583 paolo     15   0  2392  320  252 S 45.7  0.1   0:05.49 a.out
>>
>>This is the bad situation I hate: some cpu-eaters that eat all the CPU
>>time BUT have a really good priority only because they sleeps a bit.
>
>Yup, your proggy fools the interactivity estimator quite well.  This 
>problem was addressed a long time ago, and thought to be more or less 
>cured.  Guess not.

Care to try an experiment?  I'd be very interested in knowing if the 
attached patch cures the real-life problem you were investigating.

It attempts to catch tasks which the interactivity logic has misidentified, 
and "pull their plug".  It maintains a running plausibility check 
(slice_avg) against sleep_avg, and if a sustained disparity appears, cuts 
off a cpu burning task's supply of bonus points such that it has to "run on 
battery" until the disparity decreases to within acceptable limits.

Obviously, anything that affects fairness _will_ affect interactivity to 
some degree. This simple bolt-on throttle has delayed initiation and 
accelerated release in the hopes of keeping it's impact acceptable.  After 
some initial testing, It doesn't _seem_ to suck.

         -Mike 

[-- Attachment #2: sched_throttle --]
[-- Type: application/octet-stream, Size: 3629 bytes --]

--- include/linux/sched.h.org	Tue Jan  3 09:26:50 2006
+++ include/linux/sched.h	Sat Jan  7 14:45:37 2006
@@ -701,8 +701,8 @@
 
 	unsigned short ioprio;
 
-	unsigned long sleep_avg;
-	unsigned long long timestamp, last_ran;
+	unsigned long sleep_avg, slice_avg;
+	unsigned long long timestamp, last_ran, last_slice;
 	unsigned long long sched_time; /* sched_clock time spent running */
 	int activated;
 
--- linux-2.6.15/kernel/sched.c.org	Sat Jan  7 16:22:13 2006
+++ linux-2.6.15/kernel/sched.c	Mon Jan  9 11:50:40 2006
@@ -47,6 +47,7 @@
 #include <linux/syscalls.h>
 #include <linux/times.h>
 #include <linux/acct.h>
+#include <linux/jiffies.h>
 #include <asm/tlb.h>
 
 #include <asm/unistd.h>
@@ -1353,7 +1354,7 @@
 
 out_activate:
 #endif /* CONFIG_SMP */
-	if (old_state == TASK_UNINTERRUPTIBLE) {
+	if (old_state & TASK_UNINTERRUPTIBLE) {
 		rq->nr_uninterruptible--;
 		/*
 		 * Tasks on involuntary sleep don't earn
@@ -1492,6 +1493,8 @@
 	 */
 	p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
 		CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
+	p->slice_avg = NS_MAX_SLEEP_AVG;
+	p->last_slice = sched_clock();
 
 	p->prio = effective_prio(p);
 
@@ -2646,6 +2649,12 @@
 	runqueue_t *rq = this_rq();
 	task_t *p = current;
 	unsigned long long now = sched_clock();
+#if 1
+	static unsigned long printme = 0;
+
+	if (unlikely(!printme))
+		printme = jiffies;
+#endif
 
 	update_cpu_clock(p, rq, now);
 
@@ -2679,6 +2688,7 @@
 		if ((p->policy == SCHED_RR) && !--p->time_slice) {
 			p->time_slice = task_timeslice(p);
 			p->first_time_slice = 0;
+			p->last_slice = now;
 			set_tsk_need_resched(p);
 
 			/* put it at the end of the queue: */
@@ -2687,12 +2697,40 @@
 		goto out_unlock;
 	}
 	if (!--p->time_slice) {
+		unsigned long long nsecs = now - p->last_slice;
+		unsigned long idle, ticks;
+		int w = 10;
+
 		dequeue_task(p, rq->active);
 		set_tsk_need_resched(p);
 		p->prio = effective_prio(p);
 		p->time_slice = task_timeslice(p);
 		p->first_time_slice = 0;
 
+		if (nsecs > ~0UL)
+			nsecs = ~0UL;
+		ticks = NS_TO_JIFFIES((unsigned long) nsecs);
+		if (ticks < p->time_slice)
+			ticks = p->time_slice;
+		idle = 100 - (100 * p->time_slice / ticks);
+		p->slice_avg /= NS_MAX_SLEEP_AVG / 100;
+		/*
+		 * If we're improving our behaviour, speed up the improvement's
+		 * effect so we don't over throttle.
+		 */
+		if (idle > p->slice_avg + 10)
+			w -= (100 * p->slice_avg / idle) / 10;
+		p->slice_avg = (w * p->slice_avg + idle) / (w + 1);
+		p->slice_avg *= NS_MAX_SLEEP_AVG / 100;
+		p->last_slice = now;
+#if 1
+		if (p->mm && time_after(jiffies, printme + HZ)) {
+			printk(KERN_DEBUG"%s pid:%d sle:%ld sli:%ld tic:%ld idle:%ld w:%d\n",
+				p->comm,p->pid,p->sleep_avg,p->slice_avg,ticks,idle,w);
+			printme =  jiffies + HZ;
+	}
+#endif
+
 		if (!rq->expired_timestamp)
 			rq->expired_timestamp = jiffies;
 		if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
@@ -3010,7 +3048,7 @@
 				unlikely(signal_pending(prev))))
 			prev->state = TASK_RUNNING;
 		else {
-			if (prev->state == TASK_UNINTERRUPTIBLE)
+			if (prev->state & TASK_UNINTERRUPTIBLE)
 				rq->nr_uninterruptible++;
 			deactivate_task(prev, rq);
 		}
@@ -3095,6 +3133,13 @@
 	prev->sleep_avg -= run_time;
 	if ((long)prev->sleep_avg <= 0)
 		prev->sleep_avg = 0;
+	if (prev->state & (TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE) &&
+			prev->sleep_avg > prev->slice_avg + (NS_MAX_SLEEP_AVG/10) &&
+			!rt_task(prev))
+		prev->state |= TASK_NONINTERACTIVE;
+	if (!rt_task(next) && !(next->time_slice % DEF_TIMESLICE))
+		next->last_slice = now;
+
 	prev->timestamp = prev->last_ran = now;
 
 	sched_info_switch(prev, next);

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-09 11:11                 ` Mike Galbraith
@ 2006-01-09 15:52                   ` Mike Galbraith
  2006-01-09 16:08                     ` Con Kolivas
  2006-01-09 20:00                     ` Paolo Ornati
  0 siblings, 2 replies; 58+ messages in thread
From: Mike Galbraith @ 2006-01-09 15:52 UTC (permalink / raw)
  To: Paolo Ornati
  Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin,
	Peter Williams

[-- Attachment #1: Type: text/plain, Size: 975 bytes --]

At 12:11 PM 1/9/2006 +0100, Mike Galbraith wrote:

>Care to try an experiment?...

Oops.  I guess I should send one that's not mixed p1 and p0.  Sorry about 
that :-/

Anyway, if anyone wants to see a functional demonstration, just try 
this.  Remove the TASK_NONINTERACTIVE in fs/pipe.c in both the stock kernel 
and this modified one so Davide Libenzi's excellent sleep pattern exploit 
(irman2) can work [1], and do the below all at the same time ...

make -j4 bzImage
irman2
thud 3

With the stock kernel, I got bored after a half an hour, and stopped the 
kernel build.  It had produced 40 .o files.  The modified kernel finished 
in 20 minutes vs the 8 minutes it took to produce the same 504 .o files if 
not under load.

         -Mike

1.  it just so happens that Davide wrote irman2 using pipes... he could 
have done something else.  if anyone doesn't think this is a fair test, 
just use Paolo's much simpler exploit instead.  the result will be about 
the same. 

[-- Attachment #2: sched_throttle --]
[-- Type: application/octet-stream, Size: 3655 bytes --]

--- linux-2.6.15/include/linux/sched.h.org	Tue Jan  3 09:26:50 2006
+++ linux-2.6.15/include/linux/sched.h	Sat Jan  7 14:45:37 2006
@@ -701,8 +701,8 @@
 
 	unsigned short ioprio;
 
-	unsigned long sleep_avg;
-	unsigned long long timestamp, last_ran;
+	unsigned long sleep_avg, slice_avg;
+	unsigned long long timestamp, last_ran, last_slice;
 	unsigned long long sched_time; /* sched_clock time spent running */
 	int activated;
 
--- linux-2.6.15/kernel/sched.c.org	Sat Jan  7 16:22:13 2006
+++ linux-2.6.15/kernel/sched.c	Mon Jan  9 11:50:40 2006
@@ -47,6 +47,7 @@
 #include <linux/syscalls.h>
 #include <linux/times.h>
 #include <linux/acct.h>
+#include <linux/jiffies.h>
 #include <asm/tlb.h>
 
 #include <asm/unistd.h>
@@ -1353,7 +1354,7 @@
 
 out_activate:
 #endif /* CONFIG_SMP */
-	if (old_state == TASK_UNINTERRUPTIBLE) {
+	if (old_state & TASK_UNINTERRUPTIBLE) {
 		rq->nr_uninterruptible--;
 		/*
 		 * Tasks on involuntary sleep don't earn
@@ -1492,6 +1493,8 @@
 	 */
 	p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
 		CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
+	p->slice_avg = NS_MAX_SLEEP_AVG;
+	p->last_slice = sched_clock();
 
 	p->prio = effective_prio(p);
 
@@ -2646,6 +2649,12 @@
 	runqueue_t *rq = this_rq();
 	task_t *p = current;
 	unsigned long long now = sched_clock();
+#if 1
+	static unsigned long printme = 0;
+
+	if (unlikely(!printme))
+		printme = jiffies;
+#endif
 
 	update_cpu_clock(p, rq, now);
 
@@ -2679,6 +2688,7 @@
 		if ((p->policy == SCHED_RR) && !--p->time_slice) {
 			p->time_slice = task_timeslice(p);
 			p->first_time_slice = 0;
+			p->last_slice = now;
 			set_tsk_need_resched(p);
 
 			/* put it at the end of the queue: */
@@ -2687,12 +2697,40 @@
 		goto out_unlock;
 	}
 	if (!--p->time_slice) {
+		unsigned long long nsecs = now - p->last_slice;
+		unsigned long idle, ticks;
+		int w = 10;
+
 		dequeue_task(p, rq->active);
 		set_tsk_need_resched(p);
 		p->prio = effective_prio(p);
 		p->time_slice = task_timeslice(p);
 		p->first_time_slice = 0;
 
+		if (nsecs > ~0UL)
+			nsecs = ~0UL;
+		ticks = NS_TO_JIFFIES((unsigned long) nsecs);
+		if (ticks < p->time_slice)
+			ticks = p->time_slice;
+		idle = 100 - (100 * p->time_slice / ticks);
+		p->slice_avg /= NS_MAX_SLEEP_AVG / 100;
+		/*
+		 * If we're improving our behaviour, speed up the improvement's
+		 * effect so we don't over throttle.
+		 */
+		if (idle > p->slice_avg + 10)
+			w -= (100 * p->slice_avg / idle) / 10;
+		p->slice_avg = (w * p->slice_avg + idle) / (w + 1);
+		p->slice_avg *= NS_MAX_SLEEP_AVG / 100;
+		p->last_slice = now;
+#if 1
+		if (p->mm && time_after(jiffies, printme + HZ)) {
+			printk(KERN_DEBUG"%s pid:%d sle:%ld sli:%ld tic:%ld idle:%ld w:%d\n",
+				p->comm,p->pid,p->sleep_avg,p->slice_avg,ticks,idle,w);
+			printme =  jiffies + HZ;
+	}
+#endif
+
 		if (!rq->expired_timestamp)
 			rq->expired_timestamp = jiffies;
 		if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
@@ -3010,7 +3048,7 @@
 				unlikely(signal_pending(prev))))
 			prev->state = TASK_RUNNING;
 		else {
-			if (prev->state == TASK_UNINTERRUPTIBLE)
+			if (prev->state & TASK_UNINTERRUPTIBLE)
 				rq->nr_uninterruptible++;
 			deactivate_task(prev, rq);
 		}
@@ -3095,6 +3133,13 @@
 	prev->sleep_avg -= run_time;
 	if ((long)prev->sleep_avg <= 0)
 		prev->sleep_avg = 0;
+	if (prev->state & (TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE) &&
+			prev->sleep_avg > prev->slice_avg + (NS_MAX_SLEEP_AVG/10) &&
+			!rt_task(prev))
+		prev->state |= TASK_NONINTERACTIVE;
+	if (!rt_task(next) && !(next->time_slice % DEF_TIMESLICE))
+		next->last_slice = now;
+
 	prev->timestamp = prev->last_ran = now;
 
 	sched_info_switch(prev, next);

[-- Attachment #3: irman2.c --]
[-- Type: text/plain, Size: 4107 bytes --]

/*
 *  irman by Davide Libenzi ( irman load generator )
 *  Copyright (C) 2003  Davide Libenzi
 *
 *  This program is free software; you can redistribute it and/or modify
 *  it under the terms of the GNU General Public License as published by
 *  the Free Software Foundation; either version 2 of the License, or
 *  (at your option) any later version.
 *
 *  This program is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *  GNU General Public License for more details.
 *
 *  You should have received a copy of the GNU General Public License
 *  along with this program; if not, write to the Free Software
 *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 *
 *  Davide Libenzi <davidel@xmailserver.org>
 *
 */

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/socket.h>
#include <sys/signal.h>
#include <sys/resource.h>
#include <fcntl.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>


#define BUFSIZE (1024 * 32)


static int *pipes, *child;
static int num_pipes, num_active, num_child;
static unsigned long burn_ms;
static char buf1[BUFSIZE], buf2[BUFSIZE];
static volatile sig_atomic_t run = 1;
pid_t parent;

static void signal_all(int signum) {
	if (getpid() == parent) {
		while (num_child >= 0) {
			kill(child[num_child], SIGKILL);
			num_child--;
		}
		exit(0);
	} else if (signum == SIGKILL || getppid() == 1)
		run = 0;
}

unsigned long long getustime(void) {
	struct timeval tm;

	gettimeofday(&tm, NULL);
	return (unsigned long long) tm.tv_sec * 1000ULL + (unsigned long long) tm.tv_usec / 1000ULL;
}


int burn_ms_cpu(unsigned long ms) {
	int i, cmp = 0;
	unsigned long long ts;

	ts = getustime();
	do {
		for (i = 0; i < 4; i++)
			cmp += memcmp(buf1, buf2, BUFSIZE);
	} while (ts + ms > getustime());
	return cmp;
}


pid_t hog_process(void) {
	pid_t pid;

	if (!(pid = fork())) {
		while (run) {
			printf("HOG running %u\n", time(NULL));
			burn_ms_cpu(burn_ms);
		}
		exit(0);
	}
	return pid;
}


pid_t irman_process(int n) {
	int nn;
	pid_t pid;
	u_char ch;

	if (!(pid = fork())) {
		if ((nn = n + num_active) >= num_pipes)
			nn -= num_pipes;
		while (run) {
			printf("reading %u\n", n);
			read(pipes[2 * n], &ch, 1);
			burn_ms_cpu(burn_ms);
			printf("writing %u\n", nn);
			write(pipes[2 * nn + 1], "s", 1);
		}
		exit(0);
	}
	return pid;
}

int main (int argc, char **argv) {
	struct rlimit rl;
	int i, c;
	long work;
	int *cp, run_secs = 0;
	extern char *optarg;
	struct sigaction action;

	parent = getpid();
	num_pipes = 40;
	num_active = 1;
	burn_ms = 300;
	while ((c = getopt(argc, argv, "n:b:a:s:")) != -1) {
		switch (c) {
		case 'n':
			num_pipes = atoi(optarg);
			break;
		case 'b':
			burn_ms = atoi(optarg);
			break;
		case 'a':
			num_active = atoi(optarg);
			break;
		case 's':
			run_secs = 1 + atoi(optarg);
			break;
		default:
			fprintf(stderr, "Illegal argument \"%c\"\n", c);
			exit(1);
		}
	}

	rl.rlim_cur = rl.rlim_max = num_pipes * 2 + 50;
	if (setrlimit(RLIMIT_NOFILE, &rl) == -1) {
		perror("setrlimit"); 
		exit(1);
	}

	pipes = calloc(num_pipes * 2, sizeof(int));
	if (pipes == NULL) {
		perror("malloc");
		exit(1);
	}

	child = calloc(num_pipes, sizeof(int));
	if (child == NULL) {
		perror("malloc");
		exit(1);
	}

	for (cp = pipes, i = 0; i < num_pipes; i++, cp += 2) {
		if (pipe(cp) == -1) {
			perror("pipe");
			exit(1);
		}
	}

	memset(buf1, 'f', sizeof(buf1));
	memset(buf2, 'f', sizeof(buf2));

	sigemptyset(&action.sa_mask);
	/* establish termination handler */
	action.sa_handler = signal_all;
	action.sa_flags = SA_NODEFER;
	if (sigaction(SIGTERM, &action, NULL) == -1) {
		perror("Could not install signal handler");
		exit(1);
	}

	for (i = 0; i < num_pipes; i++)
		child[i] = irman_process(i);

	child[i] = hog_process();
	num_child = i;

	for (i = 0; i < num_active; i++)
		write(pipes[2 * i + 1], "s", 1);

	while (--run_secs)
		sleep(1);
	signal_all(SIGKILL);
	exit(0);
}


[-- Attachment #4: thud.c --]
[-- Type: text/plain, Size: 1279 bytes --]

/* thud.c */

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>
#include <time.h>
#include <sched.h>

/* These are used as strings so that strcmp can be used as a delay loop */
char *s1, *s2;

/* 20000 is fine on my 333MHz Celeron. Adjust so that system time is not
   excessive */
#define DELAY 20000

void busy_wait(long sec, long usec)
{
  struct timeval tv;
  long long end_usec;
  gettimeofday(&tv,0);
  end_usec=(long long)(sec+tv.tv_sec)*1000000 + tv.tv_usec+usec;
  while (((long long)tv.tv_sec*1000000 + tv.tv_usec) < end_usec)
  {
    gettimeofday(&tv,0);
#if 1 /* MIKEDIDIT */
    strcmp(s1,s2); /* yuck */
#else
    sched_yield();
#endif
  }
}

int main(int argc, char**argv)
{
  struct timespec st={10,50000000};
  int n=DELAY;
  int parent=1;

  if (argc<2) {fprintf(stderr,"Syntax: thud <children>\n"); return 0; }

  s1=malloc(n);
  s2=malloc(n);
  memset(s1,33,n);
  memset(s2,33,n);
  s1[n-1]=0;
  s2[n-1]=0;

  n=atoi(argv[1]);
  fprintf(stderr,"starting %d children\n",n);
  for (; n>0; n--)
    if (fork()==0) { sched_yield(); parent=0; break; }
  while (1)
  {
    nanosleep(&st, 0);
    if (parent) printf("running...");
    if (parent) fflush(stdout);
    busy_wait(6,0);
    if (parent) printf("done\n");
  }
  return 0;
}

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-09 15:52                   ` Mike Galbraith
@ 2006-01-09 16:08                     ` Con Kolivas
  2006-01-09 18:14                       ` Mike Galbraith
  2006-01-09 20:00                     ` Paolo Ornati
  1 sibling, 1 reply; 58+ messages in thread
From: Con Kolivas @ 2006-01-09 16:08 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar,
	Nick Piggin, Peter Williams

On Tuesday 10 January 2006 02:52, Mike Galbraith wrote:
> At 12:11 PM 1/9/2006 +0100, Mike Galbraith wrote:
> >Care to try an experiment?...
>
> Oops.  I guess I should send one that's not mixed p1 and p0.  Sorry about
> that :-/
>
> Anyway, if anyone wants to see a functional demonstration, just try
> this.  Remove the TASK_NONINTERACTIVE in fs/pipe.c in both the stock kernel
> and this modified one so Davide Libenzi's excellent sleep pattern exploit
> (irman2) can work [1], and do the below all at the same time ...
>
> make -j4 bzImage
> irman2
> thud 3
>
> With the stock kernel, I got bored after a half an hour, and stopped the
> kernel build.  It had produced 40 .o files.  The modified kernel finished
> in 20 minutes vs the 8 minutes it took to produce the same 504 .o files if
> not under load.
>
>          -Mike
>
> 1.  it just so happens that Davide wrote irman2 using pipes... he could
> have done something else.  if anyone doesn't think this is a fair test,
> just use Paolo's much simpler exploit instead.  the result will be about
> the same.

Want to try with a few other schedulers using plugsched?

Con

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-09 16:08                     ` Con Kolivas
@ 2006-01-09 18:14                       ` Mike Galbraith
  0 siblings, 0 replies; 58+ messages in thread
From: Mike Galbraith @ 2006-01-09 18:14 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar,
	Nick Piggin, Peter Williams

At 03:08 AM 1/10/2006 +1100, Con Kolivas wrote:
>On Tuesday 10 January 2006 02:52, Mike Galbraith wrote:
> >
> > Anyway, if anyone wants to see a functional demonstration, just try
> > this.  Remove the TASK_NONINTERACTIVE in fs/pipe.c in both the stock kernel
> > and this modified one so Davide Libenzi's excellent sleep pattern exploit
> > (irman2) can work [1], and do the below all at the same time ...
> >
> > make -j4 bzImage
> > irman2
> > thud 3
> >...
>
>
>Want to try with a few other schedulers using plugsched?

Not really.  These exploits were specifically aimed at the current 
scheduler's Achilles Heel.  Besides, that carries implications of replacing 
this one, and I'm pretty fond of it.

         -Mike 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-09 15:52                   ` Mike Galbraith
  2006-01-09 16:08                     ` Con Kolivas
@ 2006-01-09 20:00                     ` Paolo Ornati
  2006-01-09 20:23                       ` Paolo Ornati
  2006-01-10  7:08                       ` Mike Galbraith
  1 sibling, 2 replies; 58+ messages in thread
From: Paolo Ornati @ 2006-01-09 20:00 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin,
	Peter Williams

On Mon, 09 Jan 2006 16:52:17 +0100
Mike Galbraith <efault@gmx.de> wrote:

> >Care to try an experiment?...

Yes.

With my simple proggy things improve a bit:

./a.out 7000 & ./a.out 6537 &
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5579 paolo     19   0  2392  288  228 R 51.9  0.1   0:22.38 a.out
 5578 paolo     20   0  2392  288  228 R 43.9  0.1   0:22.50 a.out

As you can see they don't get priority 15/16 anymore :)

DD test: 256 MB, ~8.5s (instead of 8)

In pratice the more CPU they use the more their priority is penalized...


BUT if I start more of them (3/4) I'm able to fool it.

"./a.out 7000 & ./a.out 6537 & ./a.out 6347 & ./a.out 5873"

2 TOP's snapshots:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5625 paolo     17   0  2392  288  228 R 31.6  0.1   0:10.74 a.out
 5626 paolo     17   0  2392  288  228 R 28.8  0.1   0:09.16 a.out
 5627 paolo     17   0  2392  288  228 R 22.2  0.1   0:07.59 a.out
 5624 paolo     17   0  2392  288  228 R 17.4  0.1   0:08.67 a.out

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5626 paolo     16   0  2392  288  228 R 30.1  0.1   0:39.95 a.out
 5627 paolo     16   0  2392  288  228 R 24.1  0.1   0:34.93 a.out
 5625 paolo     18   0  2392  288  228 R 23.5  0.1   0:37.53 a.out
 5624 paolo     18   0  2392  288  228 R 21.9  0.1   0:37.60 a.out
 5193 root      15   0  167m  17m 2916 S  0.2  3.5   0:09.67 X
 5638 paolo     18   0  4952 1468  372 R  0.2  0.3   0:00.15 dd

DD test (256MB): real    3m37.122s  (instead of 8s)



REAL LIFE TEST (transcode)

While running only transcode it gets priority 25:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5857 paolo     25   0  114m  18m 2424 R 90.9  3.7   0:14.28 transcode
 5873 paolo     19   0 49860 4452 1860 S  8.6  0.9   0:01.40 tcdecode
 5308 paolo     16   0 86796  22m  15m R  0.2  4.4   0:06.26 konsole
 5687 paolo     16   0 98648  37m 9348 S  0.2  7.5   0:02.11 perl
 5872 paolo     24   0 21864 1064  600 S  0.2  0.2   0:00.01 tcextract


But if I run also the DD test, "transcode" priority start fluctuating
and can go down to 18/19 (from time to time) interfering with DD:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5694 paolo     19   0  114m  18m 2424 R 75.1  3.7   0:42.29 transcode
 5710 paolo     25   0 49856 4452 1860 R  8.0  0.9   0:04.36 tcdecode
 5726 paolo     18   0  4952 1468  372 R  4.0  0.3   0:00.77 dd


This seems to happen because also transcode is reading (not directly but
through pipes) from disk so the massive disk usage of DD interferes
with it, this leads to transcode using less CPU and getting better
priority.

The exact behaviour changes time to time... but seems to confirm my
teory.

I don't know how can "nicksched" keep transcode priority always to 40
even when I'm running the DD test... I should retry and see.


PS: yes, transcode is reading from disk, but SLOWLY... i think that a
good read-ahead should fullfill his needs even when doing the HD
stressing DD test, no?

-- 
	Paolo Ornati
	Linux 2.6.15-sched_trottle on x86_64

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-09 20:00                     ` Paolo Ornati
@ 2006-01-09 20:23                       ` Paolo Ornati
  2006-01-10  7:08                       ` Mike Galbraith
  1 sibling, 0 replies; 58+ messages in thread
From: Paolo Ornati @ 2006-01-09 20:23 UTC (permalink / raw)
  To: Paolo Ornati
  Cc: Mike Galbraith, Linux Kernel Mailing List, Con Kolivas,
	Ingo Molnar, Nick Piggin, Peter Williams

On Mon, 9 Jan 2006 21:00:35 +0100
Paolo Ornati <ornati@fastwebnet.it> wrote:

> I don't know how can "nicksched" keep transcode priority always to 40
> even when I'm running the DD test... I should retry and see.

I was wrong... the interfering effect is here also with nicksched.

The problem is that Linux is smarter than me --> sometimes I forget the
effect of the disk-cache ;)

For the DD test I always umount/(re)mount the partition where the
BIGFILE is stored to discard the associated cache... but I've forgotten
to do the same for the file that transcode is reading from ;)

-- 
	Paolo Ornati
	Linux 2.6.15-plugsched on x86_64

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-09 20:00                     ` Paolo Ornati
  2006-01-09 20:23                       ` Paolo Ornati
@ 2006-01-10  7:08                       ` Mike Galbraith
  2006-01-10 12:07                         ` Mike Galbraith
  1 sibling, 1 reply; 58+ messages in thread
From: Mike Galbraith @ 2006-01-10  7:08 UTC (permalink / raw)
  To: Paolo Ornati
  Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin,
	Peter Williams

At 09:00 PM 1/9/2006 +0100, Paolo Ornati wrote:
>On Mon, 09 Jan 2006 16:52:17 +0100
>Mike Galbraith <efault@gmx.de> wrote:
>
> > >Care to try an experiment?...
>
>Yes.

Thanks.

>With my simple proggy things improve a bit:

<snip rest of good news>

>BUT if I start more of them (3/4) I'm able to fool it.
>
>"./a.out 7000 & ./a.out 6537 & ./a.out 6347 & ./a.out 5873"
>
>2 TOP's snapshots:
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5625 paolo     17   0  2392  288  228 R 31.6  0.1   0:10.74 a.out
>  5626 paolo     17   0  2392  288  228 R 28.8  0.1   0:09.16 a.out
>  5627 paolo     17   0  2392  288  228 R 22.2  0.1   0:07.59 a.out
>  5624 paolo     17   0  2392  288  228 R 17.4  0.1   0:08.67 a.out
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5626 paolo     16   0  2392  288  228 R 30.1  0.1   0:39.95 a.out
>  5627 paolo     16   0  2392  288  228 R 24.1  0.1   0:34.93 a.out
>  5625 paolo     18   0  2392  288  228 R 23.5  0.1   0:37.53 a.out
>  5624 paolo     18   0  2392  288  228 R 21.9  0.1   0:37.60 a.out
>  5193 root      15   0  167m  17m 2916 S  0.2  3.5   0:09.67 X
>  5638 paolo     18   0  4952 1468  372 R  0.2  0.3   0:00.15 dd
>
>DD test (256MB): real    3m37.122s  (instead of 8s)

Ok, I'll  take another look.  Those should be being throttled.

>REAL LIFE TEST (transcode)
>
>While running only transcode it gets priority 25:
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5857 paolo     25   0  114m  18m 2424 R 90.9  3.7   0:14.28 transcode
>  5873 paolo     19   0 49860 4452 1860 S  8.6  0.9   0:01.40 tcdecode
>  5308 paolo     16   0 86796  22m  15m R  0.2  4.4   0:06.26 konsole
>  5687 paolo     16   0 98648  37m 9348 S  0.2  7.5   0:02.11 perl
>  5872 paolo     24   0 21864 1064  600 S  0.2  0.2   0:00.01 tcextract
>
>
>But if I run also the DD test, "transcode" priority start fluctuating
>and can go down to 18/19 (from time to time) interfering with DD:

19 shouldn't interfere from the cpu side, but 18 will.

>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5694 paolo     19   0  114m  18m 2424 R 75.1  3.7   0:42.29 transcode
>  5710 paolo     25   0 49856 4452 1860 R  8.0  0.9   0:04.36 tcdecode
>  5726 paolo     18   0  4952 1468  372 R  4.0  0.3   0:00.77 dd
>
>
>This seems to happen because also transcode is reading (not directly but
>through pipes) from disk so the massive disk usage of DD interferes
>with it, this leads to transcode using less CPU and getting better
>priority.

It can't be pipe waits, they're disabled in the kernel.  Most likely the 
credit we get for being activated without having yet been selected.

>The exact behaviour changes time to time... but seems to confirm my
>teory.
>
>I don't know how can "nicksched" keep transcode priority always to 40
>even when I'm running the DD test... I should retry and see.
>
>
>PS: yes, transcode is reading from disk, but SLOWLY... i think that a
>good read-ahead should fullfill his needs even when doing the HD
>stressing DD test, no?

Dunno.

         -Mike 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-10  7:08                       ` Mike Galbraith
@ 2006-01-10 12:07                         ` Mike Galbraith
  2006-01-10 12:56                           ` Paolo Ornati
  0 siblings, 1 reply; 58+ messages in thread
From: Mike Galbraith @ 2006-01-10 12:07 UTC (permalink / raw)
  To: Paolo Ornati
  Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin,
	Peter Williams

At 08:08 AM 1/10/2006 +0100, Mike Galbraith wrote:

>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>  5626 paolo     16   0  2392  288  228 R 30.1  0.1   0:39.95 a.out
>>  5627 paolo     16   0  2392  288  228 R 24.1  0.1   0:34.93 a.out
>>  5625 paolo     18   0  2392  288  228 R 23.5  0.1   0:37.53 a.out
>>  5624 paolo     18   0  2392  288  228 R 21.9  0.1   0:37.60 a.out
>>  5193 root      15   0  167m  17m 2916 S  0.2  3.5   0:09.67 X
>>  5638 paolo     18   0  4952 1468  372 R  0.2  0.3   0:00.15 dd
>>
>>DD test (256MB): real    3m37.122s  (instead of 8s)
>
>Ok, I'll  take another look.  Those should be being throttled.

Can you please try this version?  It tries harder to correct any 
imbalance... hopefully not too hard.  In case you didn't notice, you need 
to let your exploits run for a bit before the throttling will take 
effect.  I intentionally start tasks off at 0 cpu usage, so it takes a bit 
to come up to it's real usage.

         -Mike 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-10 12:07                         ` Mike Galbraith
@ 2006-01-10 12:56                           ` Paolo Ornati
  2006-01-10 13:01                             ` Mike Galbraith
  0 siblings, 1 reply; 58+ messages in thread
From: Paolo Ornati @ 2006-01-10 12:56 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin,
	Peter Williams

On Tue, 10 Jan 2006 13:07:33 +0100
Mike Galbraith <efault@gmx.de> wrote:

> Can you please try this version?  It tries harder to correct any 

It seems that you have forgotten the to attach the patch...

> imbalance... hopefully not too hard.  In case you didn't notice, you need 
> to let your exploits run for a bit before the throttling will take 
> effect.  I intentionally start tasks off at 0 cpu usage, so it takes a bit 
> to come up to it's real usage.

Yes, I've seen it.

-- 
	Paolo Ornati
	Linux 2.6.15-plugsched on x86_64

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-10 12:56                           ` Paolo Ornati
@ 2006-01-10 13:01                             ` Mike Galbraith
  2006-01-10 13:53                               ` Paolo Ornati
  0 siblings, 1 reply; 58+ messages in thread
From: Mike Galbraith @ 2006-01-10 13:01 UTC (permalink / raw)
  To: Paolo Ornati
  Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin,
	Peter Williams

[-- Attachment #1: Type: text/plain, Size: 365 bytes --]

At 01:56 PM 1/10/2006 +0100, Paolo Ornati wrote:
>On Tue, 10 Jan 2006 13:07:33 +0100
>Mike Galbraith <efault@gmx.de> wrote:
>
> > Can you please try this version?  It tries harder to correct any
>
>It seems that you have forgotten the to attach the patch...

Drat.  At least I'm not the first to ever do so :)

(double checks)  It's now attached.

         -Mike   

[-- Attachment #2: sched_throttle2 --]
[-- Type: application/octet-stream, Size: 4422 bytes --]

--- linux-2.6.15/include/linux/sched.h.org	Tue Jan  3 09:26:50 2006
+++ linux-2.6.15/include/linux/sched.h	Sat Jan  7 14:45:37 2006
@@ -701,8 +701,8 @@
 
 	unsigned short ioprio;
 
-	unsigned long sleep_avg;
-	unsigned long long timestamp, last_ran;
+	unsigned long sleep_avg, slice_avg;
+	unsigned long long timestamp, last_ran, last_slice;
 	unsigned long long sched_time; /* sched_clock time spent running */
 	int activated;
 
--- linux-2.6.15/kernel/sched.c.org	Sat Jan  7 16:22:13 2006
+++ linux-2.6.15/kernel/sched.c	Tue Jan 10 13:59:44 2006
@@ -127,6 +127,20 @@
 	(NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \
 		MAX_SLEEP_AVG)
 
+/*
+ * If a task's sleep_avg strays too far from it's slice_avg, this
+ * task is using more cpu than it's sleep_avg would indicate. When
+ * such a disparity is detected, prevent additional sleep time from
+ * being added to the existing imbalance, and increase the rate at
+ * which sleep_average is consumed.
+ */
+#define SLEEP_AVG_IMBALANCE(p) \
+	((p)->sleep_avg > (p)->slice_avg + (NS_MAX_SLEEP_AVG/10))
+
+#define CPU_PENALTY(p) \
+	(NS_TO_JIFFIES(min((p)->sleep_avg, (p)->slice_avg)) * MAX_BONUS / \
+		MAX_SLEEP_AVG)
+
 #define GRANULARITY	(10 * HZ / 1000 ? : 1)
 
 #ifdef CONFIG_SMP
@@ -744,7 +758,7 @@
 	else
 		sleep_time = (unsigned long)__sleep_time;
 
-	if (likely(sleep_time > 0)) {
+	if (likely(sleep_time > 0 && !SLEEP_AVG_IMBALANCE(p))) {
 		/*
 		 * User tasks that sleep a long time are categorised as
 		 * idle and will get just interactive status to stay active &
@@ -1353,7 +1367,7 @@
 
 out_activate:
 #endif /* CONFIG_SMP */
-	if (old_state == TASK_UNINTERRUPTIBLE) {
+	if (old_state & TASK_UNINTERRUPTIBLE) {
 		rq->nr_uninterruptible--;
 		/*
 		 * Tasks on involuntary sleep don't earn
@@ -1368,7 +1382,7 @@
 	 * sleep is handled in a priority-neutral manner, no priority
 	 * boost and no penalty.)
 	 */
-	if (old_state & TASK_NONINTERACTIVE)
+	if (old_state & TASK_NONINTERACTIVE || SLEEP_AVG_IMBALANCE(p))
 		__activate_task(p, rq);
 	else
 		activate_task(p, rq, cpu == this_cpu);
@@ -1492,6 +1506,8 @@
 	 */
 	p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
 		CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
+	p->slice_avg = NS_MAX_SLEEP_AVG;
+	p->last_slice = sched_clock();
 
 	p->prio = effective_prio(p);
 
@@ -2679,6 +2695,7 @@
 		if ((p->policy == SCHED_RR) && !--p->time_slice) {
 			p->time_slice = task_timeslice(p);
 			p->first_time_slice = 0;
+			p->last_slice = now;
 			set_tsk_need_resched(p);
 
 			/* put it at the end of the queue: */
@@ -2687,12 +2704,33 @@
 		goto out_unlock;
 	}
 	if (!--p->time_slice) {
+		unsigned long long nsecs = now - p->last_slice;
+		unsigned long idle, ticks;
+		int w = 10;
+
 		dequeue_task(p, rq->active);
 		set_tsk_need_resched(p);
 		p->prio = effective_prio(p);
 		p->time_slice = task_timeslice(p);
 		p->first_time_slice = 0;
 
+		if (nsecs > ~0UL)
+			nsecs = ~0UL;
+		ticks = NS_TO_JIFFIES((unsigned long) nsecs);
+		if (ticks < p->time_slice)
+			ticks = p->time_slice;
+		idle = 100 - (100 * p->time_slice / ticks);
+		p->slice_avg /= NS_MAX_SLEEP_AVG / 100;
+		/*
+		 * If the task is lowering it's cpu usage, speed up the
+		 * effect on slice_avg so we don't over throttle.
+		 */
+		if (idle > p->slice_avg + 10)
+			w -= (100 * p->slice_avg / idle) / 10;
+		p->slice_avg = (w * p->slice_avg + idle) / (w + 1);
+		p->slice_avg *= NS_MAX_SLEEP_AVG / 100;
+		p->last_slice = now;
+
 		if (!rq->expired_timestamp)
 			rq->expired_timestamp = jiffies;
 		if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
@@ -2996,7 +3034,7 @@
 	 * Tasks charged proportionately less run_time at high sleep_avg to
 	 * delay them losing their interactive status
 	 */
-	run_time /= (CURRENT_BONUS(prev) ? : 1);
+	run_time /= (CPU_PENALTY(prev) ? : 1);
 
 	spin_lock_irq(&rq->lock);
 
@@ -3010,7 +3048,7 @@
 				unlikely(signal_pending(prev))))
 			prev->state = TASK_RUNNING;
 		else {
-			if (prev->state == TASK_UNINTERRUPTIBLE)
+			if (prev->state & TASK_UNINTERRUPTIBLE)
 				rq->nr_uninterruptible++;
 			deactivate_task(prev, rq);
 		}
@@ -3095,6 +3133,12 @@
 	prev->sleep_avg -= run_time;
 	if ((long)prev->sleep_avg <= 0)
 		prev->sleep_avg = 0;
+	/*
+	 * Enable detection of the beginning of a slice at tick time.
+	 */
+	if (!rt_task(next) && !(next->time_slice % DEF_TIMESLICE))
+		next->last_slice = now;
+
 	prev->timestamp = prev->last_ran = now;
 
 	sched_info_switch(prev, next);

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-10 13:01                             ` Mike Galbraith
@ 2006-01-10 13:53                               ` Paolo Ornati
  2006-01-10 15:18                                 ` Mike Galbraith
  0 siblings, 1 reply; 58+ messages in thread
From: Paolo Ornati @ 2006-01-10 13:53 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin,
	Peter Williams

On Tue, 10 Jan 2006 14:01:36 +0100
Mike Galbraith <efault@gmx.de> wrote:

> > > Can you please try this version?  It tries harder to correct any
> >
> >It seems that you have forgotten the to attach the patch...
> 
> Drat.  At least I'm not the first to ever do so :)

This version basically works like the the previous, except that it makes
the priority adjustment faster (that is fine).

However I can fool it the same way.

"./a.out 7000"
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5459 paolo     22   0  2392  288  228 S 71.3  0.1   0:09.47 a.out

"./a.out 3000"
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5493 paolo     19   0  2396  292  228 R 49.8  0.1   0:14.42 a.out

"./a.out 1500"
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5495 paolo     18   0  2396  288  228 S 33.4  0.1   0:09.60 a.out


Fooling it:

"./a.out 7000 & ./a.out 6537 & ./a.out 6347 & ./a.out 5873 &"
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5502 paolo     19   0  2392  288  228 R 27.0  0.1   0:05.64 a.out
 5503 paolo     19   0  2396  288  228 R 26.0  0.1   0:07.50 a.out
 5505 paolo     19   0  2396  292  228 R 25.6  0.1   0:07.24 a.out
 5504 paolo     18   0  2392  288  228 R 21.0  0.1   0:06.78 a.out

(priorities fluctuate between 18/19)


Again with more of them:

./a.out 7000 & ./a.out 6537 & ./a.out 6347 & ./a.out 5873& ./a.out 6245 & ./a.out 5467 &

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5525 paolo     18   0  2396  288  228 R 26.4  0.1   0:07.48 a.out
 5521 paolo     19   0  2396  288  228 R 22.0  0.1   0:09.00 a.out
 5524 paolo     19   0  2392  288  228 R 19.6  0.1   0:07.21 a.out
 5523 paolo     19   0  2392  288  228 R 13.0  0.1   0:10.60 a.out
 5520 paolo     19   0  2392  288  228 R 11.0  0.1   0:08.46 a.out
 5522 paolo     19   0  2396  288  228 R  7.8  0.1   0:07.14 a.out

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5528 paolo     18   0  2392  288  228 R 19.7  0.1   0:18.15 a.out
 5533 paolo     15   0  2396  288  228 S 19.3  0.1   0:19.12 a.out
 5531 paolo     18   0  2396  288  228 R 18.5  0.1   0:19.23 a.out
 5532 paolo     17   0  2392  288  228 R 15.1  0.1   0:18.55 a.out
 5529 paolo     18   0  2396  288  228 R 14.7  0.1   0:13.05 a.out
 5530 paolo     18   0  2392  288  228 R 12.5  0.1   0:20.42 a.out

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5530 paolo     18   0  2392  288  228 R 21.0  0.1   0:25.42 a.out
 5533 paolo     18   0  2396  288  228 R 20.2  0.1   0:24.75 a.out
 5529 paolo     18   0  2396  288  228 R 16.2  0.1   0:17.68 a.out
 5532 paolo     18   0  2392  288  228 R 14.8  0.1   0:23.33 a.out
 5531 paolo     18   0  2396  288  228 R 14.4  0.1   0:23.96 a.out
 5528 paolo     18   0  2392  288  228 R 13.6  0.1   0:23.03 a.out


-- 
	Paolo Ornati
	Linux 2.6.15-sched_trottle2 on x86_64

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-10 13:53                               ` Paolo Ornati
@ 2006-01-10 15:18                                 ` Mike Galbraith
  0 siblings, 0 replies; 58+ messages in thread
From: Mike Galbraith @ 2006-01-10 15:18 UTC (permalink / raw)
  To: Paolo Ornati
  Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin,
	Peter Williams

At 02:53 PM 1/10/2006 +0100, Paolo Ornati wrote:
>On Tue, 10 Jan 2006 14:01:36 +0100
>Mike Galbraith <efault@gmx.de> wrote:
>
> > > > Can you please try this version?  It tries harder to correct any
> > >
> > >It seems that you have forgotten the to attach the patch...
> >
> > Drat.  At least I'm not the first to ever do so :)
>
>This version basically works like the the previous, except that it makes
>the priority adjustment faster (that is fine).
>
>However I can fool it the same way.
>
>"./a.out 7000"
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5459 paolo     22   0  2392  288  228 S 71.3  0.1   0:09.47 a.out
>
>"./a.out 3000"
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5493 paolo     19   0  2396  292  228 R 49.8  0.1   0:14.42 a.out
>
>"./a.out 1500"
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5495 paolo     18   0  2396  288  228 S 33.4  0.1   0:09.60 a.out
>
>
>Fooling it:
>
>"./a.out 7000 & ./a.out 6537 & ./a.out 6347 & ./a.out 5873 &"
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5502 paolo     19   0  2392  288  228 R 27.0  0.1   0:05.64 a.out
>  5503 paolo     19   0  2396  288  228 R 26.0  0.1   0:07.50 a.out
>  5505 paolo     19   0  2396  292  228 R 25.6  0.1   0:07.24 a.out
>  5504 paolo     18   0  2392  288  228 R 21.0  0.1   0:06.78 a.out
>
>(priorities fluctuate between 18/19)
>
>
>Again with more of them:
>
>./a.out 7000 & ./a.out 6537 & ./a.out 6347 & ./a.out 5873& ./a.out 6245 & 
>./a.out 5467 &
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5525 paolo     18   0  2396  288  228 R 26.4  0.1   0:07.48 a.out
>  5521 paolo     19   0  2396  288  228 R 22.0  0.1   0:09.00 a.out
>  5524 paolo     19   0  2392  288  228 R 19.6  0.1   0:07.21 a.out
>  5523 paolo     19   0  2392  288  228 R 13.0  0.1   0:10.60 a.out
>  5520 paolo     19   0  2392  288  228 R 11.0  0.1   0:08.46 a.out
>  5522 paolo     19   0  2396  288  228 R  7.8  0.1   0:07.14 a.out
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5528 paolo     18   0  2392  288  228 R 19.7  0.1   0:18.15 a.out
>  5533 paolo     15   0  2396  288  228 S 19.3  0.1   0:19.12 a.out
>  5531 paolo     18   0  2396  288  228 R 18.5  0.1   0:19.23 a.out
>  5532 paolo     17   0  2392  288  228 R 15.1  0.1   0:18.55 a.out
>  5529 paolo     18   0  2396  288  228 R 14.7  0.1   0:13.05 a.out
>  5530 paolo     18   0  2392  288  228 R 12.5  0.1   0:20.42 a.out
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5530 paolo     18   0  2392  288  228 R 21.0  0.1   0:25.42 a.out
>  5533 paolo     18   0  2396  288  228 R 20.2  0.1   0:24.75 a.out
>  5529 paolo     18   0  2396  288  228 R 16.2  0.1   0:17.68 a.out
>  5532 paolo     18   0  2392  288  228 R 14.8  0.1   0:23.33 a.out
>  5531 paolo     18   0  2396  288  228 R 14.4  0.1   0:23.96 a.out
>  5528 paolo     18   0  2392  288  228 R 13.6  0.1   0:23.03 a.out

I don't think it's being fooled, it's pulling these guys down, it's just 
that the plausible limit goes up the more tasks you're sharing the cpu 
with, so they hit the throttle less.  If you start a bunch of tasks who 
only do a short burn such that their individual cpu usage isn't much, but 
together they use all available cpu, you can still severely starve 
non-sleeping == low priority tasks.  This patch won't help for that... 
that's an entirely different problem, and fortunately one that doesn't seem 
to happen much in real life otherwise folks would be bitching like 
crazy.  To guard against it, there are a lot of different things you can 
do.  All of the ways that I know of are very bad for the desktop user, and 
fortunately for me, beyond the intended scope of this patch :)

         -Mike 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2005-12-30 13:52     ` [SCHED] wrong priority calc - SIMPLE test case Paolo Ornati
  2005-12-31  2:06       ` Peter Williams
  2005-12-31  8:13       ` Mike Galbraith
@ 2006-01-13  1:13       ` Con Kolivas
  2006-01-13  1:32         ` Con Kolivas
  2006-01-13 10:46         ` Paolo Ornati
  2 siblings, 2 replies; 58+ messages in thread
From: Con Kolivas @ 2006-01-13  1:13 UTC (permalink / raw)
  To: Paolo Ornati
  Cc: Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams

[-- Attachment #1: Type: text/plain, Size: 6936 bytes --]

On Saturday 31 December 2005 00:52, Paolo Ornati wrote:
> WAS: [SCHED] Totally WRONG prority calculation with specific test-case
> (since 2.6.10-bk12)
> http://lkml.org/lkml/2005/12/27/114/index.html
>
> On Wed, 28 Dec 2005 10:26:58 +1100
>
> Con Kolivas <kernel@kolivas.org> wrote:
> > The issue is that the scheduler interactivity estimator is a state
> > machine and can be fooled to some degree, and a cpu intensive task that
> > just happens to sleep a little bit gets significantly better priority
> > than one that is fully cpu bound all the time. Reverting that change is
> > not a solution because it can still be fooled by the same process
> > sleeping lots for a few seconds or so at startup and then changing to the
> > cpu mostly-sleeping slightly behaviour. This "fluctuating" behaviour is
> > in my opinion worse which is why I removed it.
>
> Trying to find a "as simple as possible" test case for this problem
> (that I consider a BUG in priority calculation) I've come up with this
> very simple program:

Hi Paolo.

Can you try the following patch on 2.6.15 please? I'm interested in how
adversely this affects interactive performance as well as whether it helps
your test case.

Thanks,
Con



---
 include/linux/sched.h |    9 +++++-
 kernel/sched.c        |   72 ++++++++++++++++++++++----------------------------
 2 files changed, 41 insertions(+), 40 deletions(-)

Index: linux-2.6.15/include/linux/sched.h
===================================================================
--- linux-2.6.15.orig/include/linux/sched.h
+++ linux-2.6.15/include/linux/sched.h
@@ -683,6 +683,13 @@ static inline void prefetch_stack(struct
 struct audit_context;		/* See audit.c */
 struct mempolicy;
 
+enum sleep_type {
+	SLEEP_NORMAL,
+	SLEEP_NONINTERACTIVE,
+	SLEEP_INTERACTIVE,
+	SLEEP_INTERRUPTED,
+};
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	struct thread_info *thread_info;
@@ -704,7 +711,7 @@ struct task_struct {
 	unsigned long sleep_avg;
 	unsigned long long timestamp, last_ran;
 	unsigned long long sched_time; /* sched_clock time spent running */
-	int activated;
+	enum sleep_type sleep_type;
 
 	unsigned long policy;
 	cpumask_t cpus_allowed;
Index: linux-2.6.15/kernel/sched.c
===================================================================
--- linux-2.6.15.orig/kernel/sched.c
+++ linux-2.6.15/kernel/sched.c
@@ -751,31 +751,22 @@ static int recalc_task_prio(task_t *p, u
 		 * prevent them suddenly becoming cpu hogs and starving
 		 * other processes.
 		 */
-		if (p->mm && p->activated != -1 &&
+		if (p->mm && p->sleep_type != SLEEP_NONINTERACTIVE &&
 			sleep_time > INTERACTIVE_SLEEP(p)) {
 				p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG -
 						DEF_TIMESLICE);
 		} else {
+
 			/*
 			 * The lower the sleep avg a task has the more
-			 * rapidly it will rise with sleep time.
+			 * rapidly it will rise with sleep time. This enables
+			 * tasks to rapidly recover to a low latency priority.
+			 * If a task was sleeping with the noninteractive
+			 * label do not apply this non-linear boost
 			 */
-			sleep_time *= (MAX_BONUS - CURRENT_BONUS(p)) ? : 1;
-
-			/*
-			 * Tasks waking from uninterruptible sleep are
-			 * limited in their sleep_avg rise as they
-			 * are likely to be waiting on I/O
-			 */
-			if (p->activated == -1 && p->mm) {
-				if (p->sleep_avg >= INTERACTIVE_SLEEP(p))
-					sleep_time = 0;
-				else if (p->sleep_avg + sleep_time >=
-						INTERACTIVE_SLEEP(p)) {
-					p->sleep_avg = INTERACTIVE_SLEEP(p);
-					sleep_time = 0;
-				}
-			}
+			if (p->sleep_type != SLEEP_NONINTERACTIVE || p->mm)
+				sleep_time *=
+					(MAX_BONUS - CURRENT_BONUS(p)) ? : 1;
 
 			/*
 			 * This code gives a bonus to interactive tasks.
@@ -818,11 +809,7 @@ static void activate_task(task_t *p, run
 	if (!rt_task(p))
 		p->prio = recalc_task_prio(p, now);
 
-	/*
-	 * This checks to make sure it's not an uninterruptible task
-	 * that is now waking up.
-	 */
-	if (!p->activated) {
+	if (p->sleep_type != SLEEP_NONINTERACTIVE) {
 		/*
 		 * Tasks which were woken up by interrupts (ie. hw events)
 		 * are most likely of interactive nature. So we give them
@@ -831,13 +818,13 @@ static void activate_task(task_t *p, run
 		 * on a CPU, first time around:
 		 */
 		if (in_interrupt())
-			p->activated = 2;
+			p->sleep_type = SLEEP_INTERRUPTED;
 		else {
 			/*
 			 * Normal first-time wakeups get a credit too for
 			 * on-runqueue time, but it will be weighted down:
 			 */
-			p->activated = 1;
+			p->sleep_type = SLEEP_INTERACTIVE;
 		}
 	}
 	p->timestamp = now;
@@ -1356,22 +1343,23 @@ out_activate:
 	if (old_state == TASK_UNINTERRUPTIBLE) {
 		rq->nr_uninterruptible--;
 		/*
-		 * Tasks on involuntary sleep don't earn
-		 * sleep_avg beyond just interactive state.
+		 * Tasks waking from uninterruptible sleep are likely
+		 * to be sleeping involuntarily on I/O and are otherwise
+		 * cpu bound so label them as noninteractive.
 		 */
-		p->activated = -1;
-	}
+		p->sleep_type = SLEEP_NONINTERACTIVE;
+	} else
 
 	/*
 	 * Tasks that have marked their sleep as noninteractive get
-	 * woken up without updating their sleep average. (i.e. their
-	 * sleep is handled in a priority-neutral manner, no priority
-	 * boost and no penalty.)
+	 * woken up with their sleep average not weighted in an
+	 * interactive way.
 	 */
-	if (old_state & TASK_NONINTERACTIVE)
-		__activate_task(p, rq);
-	else
-		activate_task(p, rq, cpu == this_cpu);
+		if (old_state & TASK_NONINTERACTIVE)
+			p->sleep_type = SLEEP_NONINTERACTIVE;
+
+
+	activate_task(p, rq, cpu == this_cpu);
 	/*
 	 * Sync wakeups (i.e. those types of wakeups where the waker
 	 * has indicated that it will leave the CPU in short order)
@@ -2938,6 +2926,12 @@ EXPORT_SYMBOL(sub_preempt_count);
 
 #endif
 
+static inline int interactive_sleep(enum sleep_type sleep_type)
+{
+	return (sleep_type == SLEEP_INTERACTIVE ||
+		sleep_type == SLEEP_INTERRUPTED);
+}
+
 /*
  * schedule() is the main scheduler function.
  */
@@ -3063,12 +3057,12 @@ go_idle:
 	queue = array->queue + idx;
 	next = list_entry(queue->next, task_t, run_list);
 
-	if (!rt_task(next) && next->activated > 0) {
+	if (!rt_task(next) && interactive_sleep(next->sleep_type)) {
 		unsigned long long delta = now - next->timestamp;
 		if (unlikely((long long)(now - next->timestamp) < 0))
 			delta = 0;
 
-		if (next->activated == 1)
+		if (next->sleep_type == SLEEP_INTERACTIVE)
 			delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
 
 		array = next->array;
@@ -3081,7 +3075,7 @@ go_idle:
 		} else
 			requeue_task(next, array);
 	}
-	next->activated = 0;
+	next->sleep_type = SLEEP_NORMAL;
 switch_tasks:
 	if (next == rq->idle)
 		schedstat_inc(rq, sched_goidle);

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-13  1:13       ` Con Kolivas
@ 2006-01-13  1:32         ` Con Kolivas
  2006-01-13 10:46         ` Paolo Ornati
  1 sibling, 0 replies; 58+ messages in thread
From: Con Kolivas @ 2006-01-13  1:32 UTC (permalink / raw)
  To: Paolo Ornati
  Cc: Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams

On Friday 13 January 2006 12:13, Con Kolivas wrote:
> On Saturday 31 December 2005 00:52, Paolo Ornati wrote:
> > WAS: [SCHED] Totally WRONG prority calculation with specific test-case
> > (since 2.6.10-bk12)
> > http://lkml.org/lkml/2005/12/27/114/index.html
> >
> > On Wed, 28 Dec 2005 10:26:58 +1100
> >
> > Con Kolivas <kernel@kolivas.org> wrote:
> > > The issue is that the scheduler interactivity estimator is a state
> > > machine and can be fooled to some degree, and a cpu intensive task that
> > > just happens to sleep a little bit gets significantly better priority
> > > than one that is fully cpu bound all the time. Reverting that change is
> > > not a solution because it can still be fooled by the same process
> > > sleeping lots for a few seconds or so at startup and then changing to
> > > the cpu mostly-sleeping slightly behaviour. This "fluctuating"
> > > behaviour is in my opinion worse which is why I removed it.
> >
> > Trying to find a "as simple as possible" test case for this problem
> > (that I consider a BUG in priority calculation) I've come up with this
> > very simple program:
>
> Hi Paolo.
>
> Can you try the following patch on 2.6.15 please? I'm interested in how
> adversely this affects interactive performance as well as whether it helps
> your test case.

I should make it clear. This patch _will_ adversely affect interactivity 
because your test case desires that I/O bound tasks get higher priority, and 
this patch will do that. This means that I/O bound tasks will be more 
noticeable now. The question is how much do we trade off one for the other. 
We almost certainly are biased a little too much on the interactive side on 
the mainline kernel at the moment.

Cheers,
Con

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-13  1:13       ` Con Kolivas
  2006-01-13  1:32         ` Con Kolivas
@ 2006-01-13 10:46         ` Paolo Ornati
  2006-01-13 10:51           ` Con Kolivas
  1 sibling, 1 reply; 58+ messages in thread
From: Paolo Ornati @ 2006-01-13 10:46 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams

On Fri, 13 Jan 2006 12:13:11 +1100
Con Kolivas <kernel@kolivas.org> wrote:

> Can you try the following patch on 2.6.15 please? I'm interested in how
> adversely this affects interactive performance as well as whether it helps
> your test case.

"./a.out 5000 & ./a.out 5237 & ./a.out 5331 &"
"mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null
bs=1M count=256; umount space/"

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5445 paolo     16   0  2396  288  228 R 34.8  0.1   0:05.84 a.out
 5446 paolo     15   0  2396  288  228 S 32.8  0.1   0:05.53 a.out
 5444 paolo     16   0  2392  288  228 R 31.3  0.1   0:05.99 a.out
 5443 paolo     16   0 10416 1104  848 R  0.2  0.2   0:00.01 top
 5451 paolo     15   0  4948 1468  372 D  0.2  0.3   0:00.01 dd

DD test takes ~20 s (instead of 8s).

As you can see DD priority is now very good (15) but it still suffers
because also my test programs get good priority (15/16).


Things are BETTER on the real test case (transcode): this is because
transcode usually gets priority 16 and "dd" gets 15... so dd is quite
happy.

BUT what is STRANGE is this: usually transcode is stuck to priority 16
using about 88% of the CPU, but sometimes (don't know how to reproduce
it) its priority grows up to 25 and then stay to 25.

When transcode priority is 25 the DD test is obviously happy: in
particular 2 things can happen (this is expected because I've observed
this thing before):

1) priority of transcode stay to 25 (when the file transcode is
reading from, through pipes, IS cached).

2) CPU usage and priority of transcode go down (the file transcode is
reading from ISN'T cached and DD massive disk usage interferes with
this reading). When DD finish trancode priority go back to 25.

-- 
	Paolo Ornati
	Linux 2.6.15-kolivasPatch on x86_64

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-13 10:46         ` Paolo Ornati
@ 2006-01-13 10:51           ` Con Kolivas
  2006-01-13 13:01             ` Mike Galbraith
  0 siblings, 1 reply; 58+ messages in thread
From: Con Kolivas @ 2006-01-13 10:51 UTC (permalink / raw)
  To: Paolo Ornati
  Cc: Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams

[-- Attachment #1: Type: text/plain, Size: 2657 bytes --]

On Friday 13 January 2006 21:46, Paolo Ornati wrote:
> On Fri, 13 Jan 2006 12:13:11 +1100
>
> Con Kolivas <kernel@kolivas.org> wrote:
> > Can you try the following patch on 2.6.15 please? I'm interested in how
> > adversely this affects interactive performance as well as whether it
> > helps your test case.
>
> "./a.out 5000 & ./a.out 5237 & ./a.out 5331 &"
> "mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null
> bs=1M count=256; umount space/"
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5445 paolo     16   0  2396  288  228 R 34.8  0.1   0:05.84 a.out
>  5446 paolo     15   0  2396  288  228 S 32.8  0.1   0:05.53 a.out
>  5444 paolo     16   0  2392  288  228 R 31.3  0.1   0:05.99 a.out
>  5443 paolo     16   0 10416 1104  848 R  0.2  0.2   0:00.01 top
>  5451 paolo     15   0  4948 1468  372 D  0.2  0.3   0:00.01 dd
>
> DD test takes ~20 s (instead of 8s).
>
> As you can see DD priority is now very good (15) but it still suffers
> because also my test programs get good priority (15/16).
>
>
> Things are BETTER on the real test case (transcode): this is because
> transcode usually gets priority 16 and "dd" gets 15... so dd is quite
> happy.

This seems a reasonable compromise. In your "test app" case you are using 
quirky code to reproduce the worst case scenario. Given that with your quirky 
setup you are using 3 cpu hogs (effectively) then slowing down dd from 8s to 
20s seems an appropriate slowdown (as opposed to the many minutes you were 
getting previously).

See my followup patches that I have posted following "[PATCH 0/5] sched - 
interactivity updates". The first 3 patches are what you tested. These 
patches are being put up for testing hopefully in -mm.

> BUT what is STRANGE is this: usually transcode is stuck to priority 16
> using about 88% of the CPU, but sometimes (don't know how to reproduce
> it) its priority grows up to 25 and then stay to 25.
>
> When transcode priority is 25 the DD test is obviously happy: in
> particular 2 things can happen (this is expected because I've observed
> this thing before):
>
> 1) priority of transcode stay to 25 (when the file transcode is
> reading from, through pipes, IS cached).
>
> 2) CPU usage and priority of transcode go down (the file transcode is
> reading from ISN'T cached and DD massive disk usage interferes with
> this reading). When DD finish trancode priority go back to 25.

I suspect this is entirely dependent on the balance between time spent reading 
on disk, waiting on pipe and so on.

Thanks for your test case and testing!

Cheers,
Con

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-13 10:51           ` Con Kolivas
@ 2006-01-13 13:01             ` Mike Galbraith
  2006-01-13 14:34               ` Con Kolivas
  0 siblings, 1 reply; 58+ messages in thread
From: Mike Galbraith @ 2006-01-13 13:01 UTC (permalink / raw)
  To: Con Kolivas, Paolo Ornati
  Cc: Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams

At 09:51 PM 1/13/2006 +1100, Con Kolivas wrote:
>On Friday 13 January 2006 21:46, Paolo Ornati wrote:
> > On Fri, 13 Jan 2006 12:13:11 +1100
> >
> > Con Kolivas <kernel@kolivas.org> wrote:
> > > Can you try the following patch on 2.6.15 please? I'm interested in how
> > > adversely this affects interactive performance as well as whether it
> > > helps your test case.
> >
> > "./a.out 5000 & ./a.out 5237 & ./a.out 5331 &"
> > "mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null
> > bs=1M count=256; umount space/"
> >
> >   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >  5445 paolo     16   0  2396  288  228 R 34.8  0.1   0:05.84 a.out
> >  5446 paolo     15   0  2396  288  228 S 32.8  0.1   0:05.53 a.out
> >  5444 paolo     16   0  2392  288  228 R 31.3  0.1   0:05.99 a.out
> >  5443 paolo     16   0 10416 1104  848 R  0.2  0.2   0:00.01 top
> >  5451 paolo     15   0  4948 1468  372 D  0.2  0.3   0:00.01 dd
> >
> > DD test takes ~20 s (instead of 8s).
> >
> > As you can see DD priority is now very good (15) but it still suffers
> > because also my test programs get good priority (15/16).
> >
> >
> > Things are BETTER on the real test case (transcode): this is because
> > transcode usually gets priority 16 and "dd" gets 15... so dd is quite
> > happy.
>
>This seems a reasonable compromise. In your "test app" case you are using
>quirky code to reproduce the worst case scenario. Given that with your quirky
>setup you are using 3 cpu hogs (effectively) then slowing down dd from 8s to
>20s seems an appropriate slowdown (as opposed to the many minutes you were
>getting previously).

I'm sorry, but I heartily disagree.  It's not a quirky setup, it's just 
code that exposes a weakness just like thud, starve, irman, 
irman2.  Selectively bumping dd up into the upper tier won't do the other 
things trivially starved to death one bit of good.  On a more positive 
note, I agree that dd should not be punished for waiting on disk.

>See my followup patches that I have posted following "[PATCH 0/5] sched -
>interactivity updates". The first 3 patches are what you tested. These
>patches are being put up for testing hopefully in -mm.

Then the (buggy) version of my simple throttling patch will need to come 
out.  (which is OK, I have a debugged potent++ version)

> > BUT what is STRANGE is this: usually transcode is stuck to priority 16
> > using about 88% of the CPU, but sometimes (don't know how to reproduce
> > it) its priority grows up to 25 and then stay to 25.
> >
> > When transcode priority is 25 the DD test is obviously happy: in
> > particular 2 things can happen (this is expected because I've observed
> > this thing before):
> >
> > 1) priority of transcode stay to 25 (when the file transcode is
> > reading from, through pipes, IS cached).
> >
> > 2) CPU usage and priority of transcode go down (the file transcode is
> > reading from ISN'T cached and DD massive disk usage interferes with
> > this reading). When DD finish trancode priority go back to 25.
>
>I suspect this is entirely dependent on the balance between time spent 
>reading
>on disk, waiting on pipe and so on.

Grumble.  Pipe sleep.  That's another pet peeve of mine.  Sleep is sleep 
whether it's spelled interruptible_pipe or uninterruptible_semaphore.

         -Mike 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-13 13:01             ` Mike Galbraith
@ 2006-01-13 14:34               ` Con Kolivas
  2006-01-13 16:15                 ` Mike Galbraith
  0 siblings, 1 reply; 58+ messages in thread
From: Con Kolivas @ 2006-01-13 14:34 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar,
	Nick Piggin, Peter Williams

On Saturday 14 January 2006 00:01, Mike Galbraith wrote:
> At 09:51 PM 1/13/2006 +1100, Con Kolivas wrote:
> >See my followup patches that I have posted following "[PATCH 0/5] sched -
> >interactivity updates". The first 3 patches are what you tested. These
> >patches are being put up for testing hopefully in -mm.
>
> Then the (buggy) version of my simple throttling patch will need to come
> out.  (which is OK, I have a debugged potent++ version)

Your code need not be mutually exclusive with mine. I've simply damped the 
current behaviour. Your sanity throttling is a good idea.

Con

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-13 14:34               ` Con Kolivas
@ 2006-01-13 16:15                 ` Mike Galbraith
  2006-01-14  2:05                   ` Con Kolivas
  0 siblings, 1 reply; 58+ messages in thread
From: Mike Galbraith @ 2006-01-13 16:15 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar,
	Nick Piggin, Peter Williams

At 01:34 AM 1/14/2006 +1100, Con Kolivas wrote:
>On Saturday 14 January 2006 00:01, Mike Galbraith wrote:
> > At 09:51 PM 1/13/2006 +1100, Con Kolivas wrote:
> > >See my followup patches that I have posted following "[PATCH 0/5] sched -
> > >interactivity updates". The first 3 patches are what you tested. These
> > >patches are being put up for testing hopefully in -mm.
> >
> > Then the (buggy) version of my simple throttling patch will need to come
> > out.  (which is OK, I have a debugged potent++ version)
>
>Your code need not be mutually exclusive with mine. I've simply damped the
>current behaviour. Your sanity throttling is a good idea.

I didn't mean to imply that they're mutually exclusive, and after doing 
some testing, I concluded that it (or something like it) is definitely 
still needed.  The version that's in mm2 _is_ buggy however, so ripping it 
back out wouldn't hurt my delicate little feelings one bit.  In fact, it 
would give me some more time to instrument and test integration with your 
changes.  (Which I think are good btw because they remove what I considered 
to be warts; the pipe and uninterruptible sleep barriers.  Um... try irman2 
now... pure evilness)

         -Mike 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-13 16:15                 ` Mike Galbraith
@ 2006-01-14  2:05                   ` Con Kolivas
  2006-01-14  2:56                     ` Mike Galbraith
  0 siblings, 1 reply; 58+ messages in thread
From: Con Kolivas @ 2006-01-14  2:05 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar,
	Nick Piggin, Peter Williams, Andrew Morton

On Saturday 14 January 2006 03:15, Mike Galbraith wrote:
> At 01:34 AM 1/14/2006 +1100, Con Kolivas wrote:
> >On Saturday 14 January 2006 00:01, Mike Galbraith wrote:
> > > At 09:51 PM 1/13/2006 +1100, Con Kolivas wrote:
> > > >See my followup patches that I have posted following "[PATCH 0/5]
> > > > sched - interactivity updates". The first 3 patches are what you
> > > > tested. These patches are being put up for testing hopefully in -mm.
> > >
> > > Then the (buggy) version of my simple throttling patch will need to
> > > come out.  (which is OK, I have a debugged potent++ version)
> >
> >Your code need not be mutually exclusive with mine. I've simply damped the
> >current behaviour. Your sanity throttling is a good idea.
>
> I didn't mean to imply that they're mutually exclusive, and after doing
> some testing, I concluded that it (or something like it) is definitely
> still needed.  The version that's in mm2 _is_ buggy however, so ripping it
> back out wouldn't hurt my delicate little feelings one bit.  In fact, it
> would give me some more time to instrument and test integration with your
> changes.  

Ok I've communicated this to Andrew (cc'ed here too) so he should remove your 
patch pending a new version from you.

> (Which I think are good btw because they remove what I considered 
> to be warts; the pipe and uninterruptible sleep barriers.  

Yes I felt your abuse wrt to these in an earlier email...

> Um... try irman2 
> now... pure evilness)

Hrm I've been using staircase which is immune for so long I'd all but 
forgotten about this test case. Looking at your code I assume your changes 
should help this?

Con

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-14  2:05                   ` Con Kolivas
@ 2006-01-14  2:56                     ` Mike Galbraith
  0 siblings, 0 replies; 58+ messages in thread
From: Mike Galbraith @ 2006-01-14  2:56 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar,
	Nick Piggin, Peter Williams, Andrew Morton

At 01:05 PM 1/14/2006 +1100, Con Kolivas wrote:
>On Saturday 14 January 2006 03:15, Mike Galbraith wrote:
>
> > Um... try irman2
> > now... pure evilness)
>
>Hrm I've been using staircase which is immune for so long I'd all but
>forgotten about this test case. Looking at your code I assume your changes
>should help this?

Yes.   How much very much depends on how strictly I try to enforce.  In my 
experimental tree, I have four stages of throttling: 1 threshold to begin 
trying to consume the difference between measured slice_avg and sleep_avg 
(kidd gloves), 2 to begin treating all new sleep as noninteractive (stern 
talking to), 3 to cut off new sleep entirely (you're grounded), and 4 is 
when to start using slice_avg instead of the out of balance sleep_avg for 
the priority calculation (um, bitch-slap?).  Levels 1 and 2 won't stop 
irman2, 3 will, and especially 4 will.

These are all /proc settings at the moment, so I can set set my starvation 
pain threshold from super duper desktop (all off) to just as fair as a 
running slice completion time average can possibly make it (all at 1ns 
differential), and anywhere in between.

         -Mike 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-27 23:18   ` Con Kolivas
  2006-01-28  0:01     ` Peter Williams
@ 2006-01-28  3:43     ` MIke Galbraith
  1 sibling, 0 replies; 58+ messages in thread
From: MIke Galbraith @ 2006-01-28  3:43 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar,
	Nick Piggin, Andrew Morton

On Sat, 2006-01-28 at 10:18 +1100, Con Kolivas wrote:
> On Saturday 28 January 2006 07:06, MIke Galbraith wrote:
> > 
> > The strategy works well enough to take the wind out of irman2's sails,
> > and interactive tasks can still do a nice reasonable burst of activity
> > without being evicted.  Down side to starvation control is that X is
> > sometimes a serious cpu user, and _can_ end up in the expired array (not
> > nice under load).  I personally don't think that's a show stopper
> > though... all you have to do is tell the scheduler that what it already
> > noticed, that X is a piggy, but an OK piggy by renicing it. It becomes
> > immune from active throttling, and all is well.  I know that's not going
> > to be popular, but you just can't let X have free rein without leaving
> > the barn door wide open.  (maybe that switch should stay since the
> > majority of boxen are workstations, and default to off?).
> 
> Sounds good but I have to disagree on the X renice thing. It's not that I have 
> a religious objection to renicing things. The problem is that our mainline 
> scheduler determines latency also by nice level. This means that if you 
> -renice a bursty cpu hog like X, then audio applications will fail unless 
> they too are reniced. Try it on your patch.

True.  If I renice, X can/will stomp all over xmms.  However, I can use
up to 30% cpu sustained before I would need to renice, so in the general
case, this patch actually improves latency because it will pull X out of
prio 16 much much sooner than before.  In this regard, it makes sense to
increase the scope, and throttle kernel threads as well.  What really
_should_ happen from the latency perspective is that as soon as we
detect a prio boundary crossing, we should downgrade the task a level
immediately, and then wait to see how things develope.  That would allow
the truely low latency tasks to preempt. When I compile a kernel for my
PIII/500 over nfs with my P4, knfsd eats about 40% of the PIII's cpu,
and all of it is there along side xmms.

What would be nice for the X case is a scheduling class that only meant
I'm always in the active array.  Open multimedia hardware, and you stay
active, but can be throttled under load without turning the box into a
brick.

Oh, while I'm thinking of it, if this thing makes it past discussion and
evaluation, I think my definition of idle task boost should be used
whether the active measures are applied or not.  The comments on that
code say idle tasks will only be promoted into the very bottom of the
interactive priority range via one long sleep. INTERACTIVE_SLEEP_AVG()
does that, and it helps.  That's not enough however.  What allows me to
defeat [1] irman2 isn't scrubbing to perfectly match how much it sleeps,
it is designed to sleep long between burns and hand over the baton to
the next partner in crime, so if I measured slice_avg purely from issue
time, slice_avg would match sleep_avg, and it would still be deadly.
Detecting the beginning of a new slice and moving the stamp forward to
match beginning of execution is what defeats it... the time he spends
trading the cpu with his evil twins doesn't count.


	-Mike

1.  Defeat is actually too strong a word, it is still one hell of an
unpleasant cpu hog in it's total... as any threaded app may be.



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-27 23:18   ` Con Kolivas
@ 2006-01-28  0:01     ` Peter Williams
  2006-01-28  3:43     ` MIke Galbraith
  1 sibling, 0 replies; 58+ messages in thread
From: Peter Williams @ 2006-01-28  0:01 UTC (permalink / raw)
  To: Con Kolivas
  Cc: MIke Galbraith, Paolo Ornati, Linux Kernel Mailing List,
	Ingo Molnar, Nick Piggin, Andrew Morton

Con Kolivas wrote:
> On Saturday 28 January 2006 07:06, MIke Galbraith wrote:
> 
>>What do you think of the below as an evaluation patch?  It leaves the
>>bits I'd really like to change (INTERACTIVE_SLEEP() for one), so it can
>>be switched on and off for easy comparison and regression testing.
>>
>>I really didn't want to add more to the task struct, but after trying
>>different things, a timeout was the most effective means of keeping the
>>nice burst aspect of the interactivity logic but still make sure that a
>>burst doesn't turn into starvation.
>>
>>The workings are dirt simple just as before.  The goal is to keep
>>sleep_avg and slice_avg balanced.  When an imbalance starts, immediately
>>cut off interactive bonus points.  If the imbalance doesn't correct
>>itself through normal sleep_avg usage, we'll soon hit the (1 dynamic
>>prio) trigger point, which starts a countdown toward active
>>intervention.  The default setting is that a task can run at higher
>>dynamic priority than it's cpu usage can justify for 5 seconds.  After
>>than, we start trying to work off the deficit, and if we don't succeeded
>>within another second (ie it was a big deficit), we demote the offender
>>to the rank his cpu usage indicates.
>>
>>The strategy works well enough to take the wind out of irman2's sails,
>>and interactive tasks can still do a nice reasonable burst of activity
>>without being evicted.  Down side to starvation control is that X is
>>sometimes a serious cpu user, and _can_ end up in the expired array (not
>>nice under load).  I personally don't think that's a show stopper
>>though... all you have to do is tell the scheduler that what it already
>>noticed, that X is a piggy, but an OK piggy by renicing it. It becomes
>>immune from active throttling, and all is well.  I know that's not going
>>to be popular, but you just can't let X have free rein without leaving
>>the barn door wide open.  (maybe that switch should stay since the
>>majority of boxen are workstations, and default to off?).
> 
> 
> Sounds good but I have to disagree on the X renice thing. It's not that I have 
> a religious objection to renicing things. The problem is that our mainline 
> scheduler determines latency also by nice level. This means that if you 
> -renice a bursty cpu hog like X, then audio applications will fail unless 
> they too are reniced. Try it on your patch.

On the other hand, if X were reniced it would be possible to make the 
interactive bonus mechanism more strict on what it considers to be 
interactive.  As I see it the main reason that non interactive sleeps 
get any consideration at all in determining whether a task is 
interactive is to make sure that X is classified interactive.

One of the things that I've noticed about the X server (with 
instrumented systems) is the its proportion of interruptible sleeps to 
total sleeps goes down when it becomes CPU intensive.  This is because 
the high CPU usage is usually related to screen updates (e.g. try "ls -R 
/" in an xterm or similar) and not keyboard or mouse input.  This means 
that during these periods the X server would lose most of its 
interactive bonus and just rely on its niceness allowing genuine 
interactive tasks to pip it (provided that X's nice value is 
sufficiently small e.g. 5).

Of course, for this to work all of the instances where interruptible 
sleeps aren't interactive would need to be labelled TASK_NONINTERACTIVE 
which would be a big job.

Anyway just my 2c worth.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-27 20:06 ` MIke Galbraith
@ 2006-01-27 23:18   ` Con Kolivas
  2006-01-28  0:01     ` Peter Williams
  2006-01-28  3:43     ` MIke Galbraith
  0 siblings, 2 replies; 58+ messages in thread
From: Con Kolivas @ 2006-01-27 23:18 UTC (permalink / raw)
  To: MIke Galbraith
  Cc: Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar,
	Nick Piggin, Andrew Morton

On Saturday 28 January 2006 07:06, MIke Galbraith wrote:
> What do you think of the below as an evaluation patch?  It leaves the
> bits I'd really like to change (INTERACTIVE_SLEEP() for one), so it can
> be switched on and off for easy comparison and regression testing.
>
> I really didn't want to add more to the task struct, but after trying
> different things, a timeout was the most effective means of keeping the
> nice burst aspect of the interactivity logic but still make sure that a
> burst doesn't turn into starvation.
>
> The workings are dirt simple just as before.  The goal is to keep
> sleep_avg and slice_avg balanced.  When an imbalance starts, immediately
> cut off interactive bonus points.  If the imbalance doesn't correct
> itself through normal sleep_avg usage, we'll soon hit the (1 dynamic
> prio) trigger point, which starts a countdown toward active
> intervention.  The default setting is that a task can run at higher
> dynamic priority than it's cpu usage can justify for 5 seconds.  After
> than, we start trying to work off the deficit, and if we don't succeeded
> within another second (ie it was a big deficit), we demote the offender
> to the rank his cpu usage indicates.
>
> The strategy works well enough to take the wind out of irman2's sails,
> and interactive tasks can still do a nice reasonable burst of activity
> without being evicted.  Down side to starvation control is that X is
> sometimes a serious cpu user, and _can_ end up in the expired array (not
> nice under load).  I personally don't think that's a show stopper
> though... all you have to do is tell the scheduler that what it already
> noticed, that X is a piggy, but an OK piggy by renicing it. It becomes
> immune from active throttling, and all is well.  I know that's not going
> to be popular, but you just can't let X have free rein without leaving
> the barn door wide open.  (maybe that switch should stay since the
> majority of boxen are workstations, and default to off?).

Sounds good but I have to disagree on the X renice thing. It's not that I have 
a religious objection to renicing things. The problem is that our mainline 
scheduler determines latency also by nice level. This means that if you 
-renice a bursty cpu hog like X, then audio applications will fail unless 
they too are reniced. Try it on your patch.

Con

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
  2006-01-27 16:57 [SCHED] wrong priority calc - SIMPLE test case Con Kolivas
@ 2006-01-27 20:06 ` MIke Galbraith
  2006-01-27 23:18   ` Con Kolivas
  0 siblings, 1 reply; 58+ messages in thread
From: MIke Galbraith @ 2006-01-27 20:06 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar,
	Nick Piggin, Andrew Morton

On Fri, 2006-01-27 at 17:57 +0100, Con Kolivas wrote:
> On Saturday 14 January 2006 03:15, Mike Galbraith wrote:
>  > At 01:34 AM 1/14/2006 +1100, Con Kolivas wrote:
>  > >On Saturday 14 January 2006 00:01, Mike Galbraith wrote:
>  > > > At 09:51 PM 1/13/2006 +1100, Con Kolivas wrote:
>  > > > >See my followup patches that I have posted following "[PATCH 0/5]
>  > > > > sched - interactivity updates". The first 3 patches are what you
>  > > > > tested. These patches are being put up for testing hopefully in -mm.
>  > > >
>  > > > Then the (buggy) version of my simple throttling patch will need to
>  > > > come out.  (which is OK, I have a debugged potent++ version)
>  > >
>  > >Your code need not be mutually exclusive with mine. I've simply damped the
>  > >current behaviour. Your sanity throttling is a good idea.
>  >
>  > I didn't mean to imply that they're mutually exclusive, and after doing
>  > some testing, I concluded that it (or something like it) is definitely
>  > still needed.  The version that's in mm2 _is_ buggy however, so ripping it
>  > back out wouldn't hurt my delicate little feelings one bit.  In fact, it
>  > would give me some more time to instrument and test integration with your
>  > changes.
> 
> Ok I've communicated this to Andrew (cc'ed here too) so he should remove your
> patch pending a new version from you.

(Ok, it took a while what with setting up a new test box, testing this
and that, RL etc etc.)

What do you think of the below as an evaluation patch?  It leaves the
bits I'd really like to change (INTERACTIVE_SLEEP() for one), so it can
be switched on and off for easy comparison and regression testing.

I really didn't want to add more to the task struct, but after trying
different things, a timeout was the most effective means of keeping the
nice burst aspect of the interactivity logic but still make sure that a
burst doesn't turn into starvation.

The workings are dirt simple just as before.  The goal is to keep
sleep_avg and slice_avg balanced.  When an imbalance starts, immediately
cut off interactive bonus points.  If the imbalance doesn't correct
itself through normal sleep_avg usage, we'll soon hit the (1 dynamic
prio) trigger point, which starts a countdown toward active
intervention.  The default setting is that a task can run at higher
dynamic priority than it's cpu usage can justify for 5 seconds.  After
than, we start trying to work off the deficit, and if we don't succeeded
within another second (ie it was a big deficit), we demote the offender
to the rank his cpu usage indicates.

The strategy works well enough to take the wind out of irman2's sails,
and interactive tasks can still do a nice reasonable burst of activity
without being evicted.  Down side to starvation control is that X is
sometimes a serious cpu user, and _can_ end up in the expired array (not
nice under load).  I personally don't think that's a show stopper
though... all you have to do is tell the scheduler that what it already
noticed, that X is a piggy, but an OK piggy by renicing it. It becomes
immune from active throttling, and all is well.  I know that's not going
to be popular, but you just can't let X have free rein without leaving
the barn door wide open.  (maybe that switch should stay since the
majority of boxen are workstations, and default to off?).

'nuff words, patch follows.

	-Mike

--- 2.6.16-rc1-mm3/include/linux/sched.h.org	2006-01-27 11:28:20.000000000 +0100
+++ 2.6.16-rc1-mm3/include/linux/sched.h	2006-01-27 11:54:48.000000000 +0100
@@ -722,6 +722,7 @@
 	unsigned short ioprio;
 
 	unsigned long sleep_avg;
+	unsigned long slice_avg, last_slice, throttle_stamp;
 	unsigned long long timestamp, last_ran;
 	unsigned long long sched_time; /* sched_clock time spent running */
 	enum sleep_type sleep_type;
--- 2.6.16-rc1-mm3/include/linux/sysctl.h.org	2006-01-27 11:28:20.000000000 +0100
+++ 2.6.16-rc1-mm3/include/linux/sysctl.h	2006-01-27 11:57:13.000000000 +0100
@@ -147,6 +147,8 @@
 	KERN_SETUID_DUMPABLE=69, /* int: behaviour of dumps for setuid core */
 	KERN_SPIN_RETRY=70,	/* int: number of spinlock retries */
 	KERN_ACPI_VIDEO_FLAGS=71, /* int: flags for setting up video after ACPI sleep */
+	KERN_SCHED_THROTTLE1=72,  /* int: sleep_avg throttling enabled */
+	KERN_SCHED_THROTTLE2=73,  /* int: throttling grace period in secs */
 };
 
 
--- 2.6.16-rc1-mm3/kernel/sched.c.org	2006-01-27 11:28:20.000000000 +0100
+++ 2.6.16-rc1-mm3/kernel/sched.c	2006-01-27 12:08:59.000000000 +0100
@@ -138,6 +138,55 @@
 	(NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \
 		MAX_SLEEP_AVG)
 
+/*
+ * Interactivity boost can lead to serious starvation problems if the
+ * task being boosted turns out to be a cpu hog.  To combat this, we
+ * compute a running slice_avg, which is the sane upper limit for the
+ * task's sleep_avg.  If an 'interactive' task begins burning cpu, it's
+ * slice_avg will decay, making it visible as a problem so corrective
+ * measures can be applied.  RT tasks and kernel threads are exempt from
+ * throttling - reniced tasks are only subject to mild throttling.
+ *
+ * /proc/sys/kernel tunables.
+ *
+ * sched_throttle: enable throttling logic.
+ * sched_throttle_secs: throttling grace period in seconds.
+ */
+
+int throttle = 1;
+int throttle_secs = 5;
+int throttle_secs_max = 10;
+
+#define THROTTLE_BASE \
+	(10 * (NS_MAX_SLEEP_AVG / 100))
+
+/* Disable interactive task bias logic. */
+#define THROTTLE_BIAS(p) \
+	(throttle && (p)->mm && (p)->sleep_avg > (p)->slice_avg)
+
+/* Begin adjusting sleep_avg consumption to match slice_avg. */
+#define THROTTLE_SOFT(p) \
+	((p)->throttle_stamp && time_after(jiffies, (p)->throttle_stamp) && \
+	(p)->sleep_avg >= (p)->slice_avg + THROTTLE_BASE)
+
+/* Begin adjusting priority to match slice_avg. */
+#define THROTTLE_HARD(p) \
+	((p)->throttle_stamp && time_after(jiffies, (p)->throttle_stamp + HZ) && \
+	(p)->sleep_avg >= (p)->slice_avg + THROTTLE_BASE)
+
+#define THROTTLE_ENGAGE(p) (throttle && !(p)->throttle_stamp && \
+	(p)->mm && PRIO_TO_NICE((p)->static_prio) >= 0 && \
+	(p)->sleep_avg >= (p)->slice_avg + THROTTLE_BASE)
+
+#define THROTTLE_DISENGAGE(p) ((p)->throttle_stamp && \
+	(PRIO_TO_NICE((p)->static_prio) < 0 || !THROTTLE_BIAS((p)) || \
+	time_after(jiffies, (p)->last_slice + (throttle_secs * HZ))))
+
+#define CURRENT_COST(p) \
+	(NS_TO_JIFFIES(THROTTLE_SOFT((p)) ? \
+	min((p)->sleep_avg, (p)->slice_avg) : NS_MAX_SLEEP_AVG) * \
+	MAX_BONUS / MAX_SLEEP_AVG)
+
 #define GRANULARITY	(10 * HZ / 1000 ? : 1)
 
 #ifdef CONFIG_SMP
@@ -162,6 +211,10 @@
 	(JIFFIES_TO_NS(MAX_SLEEP_AVG * \
 		(MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
 
+#define INTERACTIVE_SLEEP_AVG(p) \
+	(JIFFIES_TO_NS(MAX_SLEEP_AVG * \
+		(MAX_BONUS / 2 + DELTA((p))) / MAX_BONUS))
+
 #define TASK_PREEMPTS_CURR(p, rq) \
 	((p)->prio < (rq)->curr->prio)
 
@@ -669,6 +722,8 @@
 		return p->prio;
 
 	bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
+	if (unlikely(THROTTLE_HARD(p)))
+		bonus = CURRENT_COST(p) - MAX_BONUS / 2;
 
 	prio = p->static_prio - bonus;
 	if (prio < MAX_RT_PRIO)
@@ -803,8 +858,15 @@
 		if (p->mm && sleep_time > INTERACTIVE_SLEEP(p)) {
 				unsigned long ceiling;
 
-				ceiling = JIFFIES_TO_NS(MAX_SLEEP_AVG -
-					DEF_TIMESLICE);
+				if (!throttle)
+					ceiling = JIFFIES_TO_NS(MAX_SLEEP_AVG -
+						DEF_TIMESLICE);
+				else {
+					ceiling = INTERACTIVE_SLEEP_AVG(p);
+					ceiling += JIFFIES_TO_NS(DEF_TIMESLICE);
+					if (p->slice_avg < ceiling)
+						ceiling = p->slice_avg;
+				}
 				if (p->sleep_avg < ceiling)
 					p->sleep_avg = ceiling;
 		} else {
@@ -1367,7 +1429,10 @@
 
 out_activate:
 #endif /* CONFIG_SMP */
-	if (old_state == TASK_UNINTERRUPTIBLE) {
+	if (unlikely(THROTTLE_DISENGAGE(p)))
+		p->throttle_stamp = 0;
+
+	if (old_state & TASK_UNINTERRUPTIBLE) {
 		rq->nr_uninterruptible--;
 		/*
 		 * Tasks waking from uninterruptible sleep are likely
@@ -1382,7 +1447,7 @@
 	 * woken up with their sleep average not weighted in an
 	 * interactive way.
 	 */
-		if (old_state & TASK_NONINTERACTIVE)
+		if (old_state & TASK_NONINTERACTIVE || THROTTLE_BIAS(p))
 			p->sleep_type = SLEEP_NONINTERACTIVE;
 
 
@@ -1471,6 +1536,7 @@
 	p->first_time_slice = 1;
 	current->time_slice >>= 1;
 	p->timestamp = sched_clock();
+	p->throttle_stamp = 0;
 	if (unlikely(!current->time_slice)) {
 		/*
 		 * This case is rare, it happens when the parent has only
@@ -1510,6 +1576,9 @@
 	 */
 	p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
 		CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
+	p->slice_avg = NS_MAX_SLEEP_AVG;
+	p->last_slice = jiffies;
+	p->throttle_stamp = 0;
 
 	p->prio = effective_prio(p);
 
@@ -1584,7 +1653,7 @@
 	 * the sleep_avg of the parent as well.
 	 */
 	rq = task_rq_lock(p->parent, &flags);
-	if (p->first_time_slice && task_cpu(p) == task_cpu(p->parent)) {
+	if ((int)p->first_time_slice > 0  && task_cpu(p) == task_cpu(p->parent)) {
 		p->parent->time_slice += p->time_slice;
 		if (unlikely(p->parent->time_slice > task_timeslice(p)))
 			p->parent->time_slice = task_timeslice(p);
@@ -2665,6 +2734,37 @@
 }
 
 /*
+ * Calculate a task's average slice usage rate in terms of sleep_avg.
+ * @p: task for which slice_avg should be computed.
+ * @time_slice: task's current timeslice.
+ */
+static unsigned long task_slice_avg(task_t *p, int time_slice)
+{
+	unsigned long slice_avg = p->slice_avg;
+	int idle, ticks = jiffies - p->last_slice;
+	int w = HZ / DEF_TIMESLICE;
+
+	if (ticks < time_slice)
+		ticks = time_slice;
+	idle = 100 - (100 * time_slice / ticks);
+	slice_avg /= NS_MAX_SLEEP_AVG / 100;
+
+	/*
+	 * If the task is lowering it's cpu usage, speed up the
+	 * effect on slice_avg so we don't over-throttle.
+	 */
+	if (idle > slice_avg + 10) {
+		w -= idle / 10;
+		if (!w)
+			w = 1;
+	}
+	slice_avg = (w * (slice_avg ? : 1) + idle) / (w + 1);
+	slice_avg *= NS_MAX_SLEEP_AVG / 100;
+
+	return slice_avg;
+}
+
+/*
  * This function gets called by the timer code, with HZ frequency.
  * We call it with interrupts disabled.
  *
@@ -2710,19 +2810,33 @@
 		if ((p->policy == SCHED_RR) && !--p->time_slice) {
 			p->time_slice = task_timeslice(p);
 			p->first_time_slice = 0;
+			p->slice_avg = task_slice_avg(p, p->time_slice);
+			p->last_slice = jiffies;
 			set_tsk_need_resched(p);
 
 			/* put it at the end of the queue: */
 			requeue_task(p, rq->active);
 		}
+		if (unlikely(p->throttle_stamp))
+			p->throttle_stamp = 0;
 		goto out_unlock;
 	}
 	if (!--p->time_slice) {
 		dequeue_task(p, rq->active);
 		set_tsk_need_resched(p);
-		p->prio = effective_prio(p);
 		p->time_slice = task_timeslice(p);
+		p->slice_avg = task_slice_avg(p, p->time_slice);
+		p->prio = effective_prio(p);
 		p->first_time_slice = 0;
+		p->last_slice = jiffies;
+
+		/*
+		 * Check to see if we should start/stop throttling;
+		 */
+		if (THROTTLE_ENGAGE(p))
+			p->throttle_stamp = jiffies + (throttle_secs * HZ);
+		else if (THROTTLE_DISENGAGE(p))
+			p->throttle_stamp = 0;
 
 		if (!rq->expired_timestamp)
 			rq->expired_timestamp = jiffies;
@@ -3033,7 +3147,7 @@
 	 * Tasks charged proportionately less run_time at high sleep_avg to
 	 * delay them losing their interactive status
 	 */
-	run_time /= (CURRENT_BONUS(prev) ? : 1);
+	run_time /= (CURRENT_COST(prev) ? : 1);
 
 	spin_lock_irq(&rq->lock);
 
@@ -3047,7 +3161,7 @@
 				unlikely(signal_pending(prev))))
 			prev->state = TASK_RUNNING;
 		else {
-			if (prev->state == TASK_UNINTERRUPTIBLE)
+			if (prev->state & TASK_UNINTERRUPTIBLE)
 				rq->nr_uninterruptible++;
 			deactivate_task(prev, rq);
 		}
@@ -3134,6 +3248,14 @@
 		prev->sleep_avg = 0;
 	prev->timestamp = prev->last_ran = now;
 
+	/*
+	 * Mark the beginning of a new slice in this mostly unused space.
+	 */
+	if (unlikely(!next->first_time_slice)) {
+		next->first_time_slice = ~0U;
+		next->last_slice = jiffies;
+	}
+
 	sched_info_switch(prev, next);
 	if (likely(prev != next)) {
 		next->timestamp = now;
--- 2.6.16-rc1-mm3/kernel/sysctl.c.org	2006-01-27 11:28:20.000000000 +0100
+++ 2.6.16-rc1-mm3/kernel/sysctl.c	2006-01-27 11:58:30.000000000 +0100
@@ -69,6 +69,9 @@
 extern int pid_max_min, pid_max_max;
 extern int sysctl_drop_caches;
 extern int percpu_pagelist_fraction;
+extern int throttle;
+extern int throttle_secs;
+extern int throttle_secs_max;
 
 #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
 int unknown_nmi_panic;
@@ -224,6 +227,12 @@
 	{ .ctl_name = 0 }
 };
 
+/* Constants for minimum and maximum testing in vm_table and
+ * kern_table.  We use these as one-element integer vectors. */
+static int zero;
+static int one = 1;
+static int one_hundred = 100;
+
 static ctl_table kern_table[] = {
 	{
 		.ctl_name	= KERN_OSTYPE,
@@ -666,15 +675,31 @@
 		.proc_handler	= &proc_dointvec,
 	},
 #endif
+	{
+		.ctl_name	= KERN_SCHED_THROTTLE1,
+		.procname	= "sched_throttle",
+		.data		= &throttle,
+		.maxlen		= sizeof (int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+	{
+		.ctl_name	= KERN_SCHED_THROTTLE2,
+		.procname	= "sched_throttle_secs",
+		.data		= &throttle_secs,
+		.maxlen		= sizeof (int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+		.extra2		= &throttle_secs_max,
+	},
 	{ .ctl_name = 0 }
 };
 
-/* Constants for minimum and maximum testing in vm_table.
-   We use these as one-element integer vectors. */
-static int zero;
-static int one_hundred = 100;
-
-
 static ctl_table vm_table[] = {
 	{
 		.ctl_name	= VM_OVERCOMMIT_MEMORY,



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [SCHED] wrong priority calc - SIMPLE test case
@ 2006-01-27 16:57 Con Kolivas
  2006-01-27 20:06 ` MIke Galbraith
  0 siblings, 1 reply; 58+ messages in thread
From: Con Kolivas @ 2006-01-27 16:57 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar,
	Nick Piggin, Andrew Morton

On Saturday 14 January 2006 03:15, Mike Galbraith wrote:
 > At 01:34 AM 1/14/2006 +1100, Con Kolivas wrote:
 > >On Saturday 14 January 2006 00:01, Mike Galbraith wrote:
 > > > At 09:51 PM 1/13/2006 +1100, Con Kolivas wrote:
 > > > >See my followup patches that I have posted following "[PATCH 0/5]
 > > > > sched - interactivity updates". The first 3 patches are what you
 > > > > tested. These patches are being put up for testing hopefully in -mm.
 > > >
 > > > Then the (buggy) version of my simple throttling patch will need to
 > > > come out.  (which is OK, I have a debugged potent++ version)
 > >
 > >Your code need not be mutually exclusive with mine. I've simply damped the
 > >current behaviour. Your sanity throttling is a good idea.
 >
 > I didn't mean to imply that they're mutually exclusive, and after doing
 > some testing, I concluded that it (or something like it) is definitely
 > still needed.  The version that's in mm2 _is_ buggy however, so ripping it
 > back out wouldn't hurt my delicate little feelings one bit.  In fact, it
 > would give me some more time to instrument and test integration with your
 > changes.

Ok I've communicated this to Andrew (cc'ed here too) so he should remove your
patch pending a new version from you.

 > (Which I think are good btw because they remove what I considered
 > to be warts; the pipe and uninterruptible sleep barriers.

Yes I felt your abuse wrt to these in an earlier email...

 > Um... try irman2
 > now... pure evilness)

Hrm I've been using staircase which is immune for so long I'd all but
forgotten about this test case. Looking at your code I assume your changes
should help this?

Con
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2006-01-28  3:42 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-12-27 18:09 [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) Paolo Ornati
2005-12-27 21:48 ` Paolo Ornati
2005-12-27 23:26   ` Con Kolivas
2005-12-28 11:01     ` Paolo Ornati
2005-12-28 11:19       ` Con Kolivas
2005-12-28 11:35         ` Paolo Ornati
2005-12-28 17:23           ` Paolo Ornati
2005-12-28 17:39             ` Paolo Ornati
2005-12-30 13:52     ` [SCHED] wrong priority calc - SIMPLE test case Paolo Ornati
2005-12-31  2:06       ` Peter Williams
2005-12-31 10:34         ` Paolo Ornati
2005-12-31 10:52           ` Paolo Ornati
2005-12-31 11:12             ` Con Kolivas
2005-12-31 13:44             ` Peter Williams
2005-12-31 16:31               ` Paolo Ornati
2005-12-31 22:04                 ` Peter Williams
2005-12-31  8:13       ` Mike Galbraith
2005-12-31 11:00         ` Paolo Ornati
2005-12-31 15:11         ` Paolo Ornati
2005-12-31 16:37           ` Mike Galbraith
2005-12-31 17:24             ` Paolo Ornati
2005-12-31 17:42               ` Paolo Ornati
2006-01-01 11:39             ` Paolo Ornati
2006-01-02  9:15               ` Mike Galbraith
2006-01-02  9:50                 ` Paolo Ornati
2006-01-09 11:11                 ` Mike Galbraith
2006-01-09 15:52                   ` Mike Galbraith
2006-01-09 16:08                     ` Con Kolivas
2006-01-09 18:14                       ` Mike Galbraith
2006-01-09 20:00                     ` Paolo Ornati
2006-01-09 20:23                       ` Paolo Ornati
2006-01-10  7:08                       ` Mike Galbraith
2006-01-10 12:07                         ` Mike Galbraith
2006-01-10 12:56                           ` Paolo Ornati
2006-01-10 13:01                             ` Mike Galbraith
2006-01-10 13:53                               ` Paolo Ornati
2006-01-10 15:18                                 ` Mike Galbraith
2006-01-13  1:13       ` Con Kolivas
2006-01-13  1:32         ` Con Kolivas
2006-01-13 10:46         ` Paolo Ornati
2006-01-13 10:51           ` Con Kolivas
2006-01-13 13:01             ` Mike Galbraith
2006-01-13 14:34               ` Con Kolivas
2006-01-13 16:15                 ` Mike Galbraith
2006-01-14  2:05                   ` Con Kolivas
2006-01-14  2:56                     ` Mike Galbraith
2005-12-27 23:59   ` [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) Peter Williams
2005-12-28 10:20     ` Paolo Ornati
2005-12-28 13:38       ` Peter Williams
2005-12-28 19:45         ` Paolo Ornati
2005-12-29  3:13         ` Nick Piggin
2005-12-29  3:35           ` Peter Williams
2005-12-29  8:11             ` Nick Piggin
2006-01-27 16:57 [SCHED] wrong priority calc - SIMPLE test case Con Kolivas
2006-01-27 20:06 ` MIke Galbraith
2006-01-27 23:18   ` Con Kolivas
2006-01-28  0:01     ` Peter Williams
2006-01-28  3:43     ` MIke Galbraith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).