* [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) @ 2005-12-27 18:09 Paolo Ornati 2005-12-27 21:48 ` Paolo Ornati 0 siblings, 1 reply; 53+ messages in thread From: Paolo Ornati @ 2005-12-27 18:09 UTC (permalink / raw) To: Linux Kernel Mailing List Hello, I've found an easy-to-reproduce-for-me test case that shows a totally wrong priority calculation: basically a CPU-intensitive process gets better priority than a disk-intensitive one (dd if=bigfile of=/dev/null ...). Seems impossible, isn't it? ---- THE NUMBERS with 2.6.15-rc7 ----- The test-case is the Xvid encoding of dvd-ripped track with transcode (using "dvd::rip" interface). The copied-and-pasted command line is this: mkdir -m 0775 -p '/home/paolo/tmp/test/tmp' && cd /home/paolo/tmp/test/tmp && dr_exec transcode -H 10 -a 2 -x vob,null -i /home/paolo/tmp/test/vob/003 -w 1198,50 -b 128,0,0 -s 1.972 --a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 1 -y xvid4,null -o /dev/null --print_status 20 && echo DVDRIP_SUCCESS mkdir -m 0775 -p '/home/paolo/tmp/test/tmp' && cd /home/paolo/tmp/test/tmp && dr_exec transcode -H 10 -a 2 -x vob -i /home/paolo/tmp/test/vob/003 -w 1198,50 -b 128,0,0 -s 1.972 --a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 2 -y xvid4 -o /home/paolo/tmp/test/avi/003/test-003.avi --print_status 20 && echo DVDRIP_SUCCESS Here there is a TOP snapshot while running it: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5721 paolo 16 0 115m 18m 2428 R 84.4 3.7 0:15.11 transcode 5736 paolo 25 0 50352 4516 1912 R 8.4 0.9 0:01.53 tcdecode 5725 paolo 15 0 115m 18m 2428 S 4.6 3.7 0:00.84 transcode 5738 paolo 18 0 115m 18m 2428 S 0.8 3.7 0:00.15 transcode 5734 paolo 25 0 20356 1140 920 S 0.6 0.2 0:00.12 tcdemux 5731 paolo 25 0 47312 2540 1996 R 0.4 0.5 0:00.08 tcdecode 5319 root 15 0 166m 16m 2584 S 0.2 3.2 0:25.06 X 5444 paolo 16 0 87116 22m 15m R 0.2 4.6 0:04.05 konsole 5716 paolo 16 0 10424 1160 876 R 0.2 0.2 0:00.06 top 5735 paolo 25 0 22364 1436 932 S 0.2 0.3 0:00.01 tcextract DD running alone: paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/ 128+0 records in 128+0 records out real 0m4.052s user 0m0.000s sys 0m0.209s DD while transcoding: paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/ 128+0 records in 128+0 records out real 0m26.121s user 0m0.001s sys 0m0.255s --------------------------------------- I've tried older kernels finding that 2.6.11 is the first affected. Going on with testing... 2.6.11-rc[1-5]: 2.6.11-rc3 bad 2.6.11-rc1 bad 2.6.10-bk[1-14] 2.6.10-bk7 good 2.6.10-bk11 good 2.6.10-bk13 bad 2.6.10-bk12 bad So the problem was introduced with: >> 2.6.10-bk12 09-Jan-2005 << The exact behaviour is different with 2.6.11/12/13/14... for example: with 2.6.11 the priority of "transcode" is initially set to ~25 and go down to 17/18 when running DD. The problem doesn't seem 100% reproducible with every kernel, sometimes a "BAD" kernel looks "GOOD"... or maybe it was me confused by too much compile/install/reboot/test work ;) Other INFO: - I'm on x86_64 - preemption ON/OFF doesn't make any differences Can anyone reproduce this? IOW: is this affecting only my machine? -- Paolo Ornati Linux 2.6.15-rc7-gf89f5948 on x86_64 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) 2005-12-27 18:09 [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) Paolo Ornati @ 2005-12-27 21:48 ` Paolo Ornati 2005-12-27 23:26 ` Con Kolivas 2005-12-27 23:59 ` [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) Peter Williams 0 siblings, 2 replies; 53+ messages in thread From: Paolo Ornati @ 2005-12-27 21:48 UTC (permalink / raw) To: Paolo Ornati; +Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar On Tue, 27 Dec 2005 19:09:18 +0100 Paolo Ornati <ornati@fastwebnet.it> wrote: > Hello, > > I've found an easy-to-reproduce-for-me test case that shows a totally > wrong priority calculation: basically a CPU-intensitive process gets > better priority than a disk-intensitive one (dd if=bigfile > of=/dev/null ...). > > Seems impossible, isn't it? > > ---- THE NUMBERS with 2.6.15-rc7 ----- > > The test-case is the Xvid encoding of dvd-ripped track with transcode > (using "dvd::rip" interface). The copied-and-pasted command line is > this: > > mkdir -m 0775 -p '/home/paolo/tmp/test/tmp' && > cd /home/paolo/tmp/test/tmp && dr_exec transcode -H 10 -a 2 -x vob,null > -i /home/paolo/tmp/test/vob/003 -w 1198,50 -b 128,0,0 -s 1.972 > --a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 1 -y xvid4,null > -o /dev/null --print_status 20 && echo DVDRIP_SUCCESS mkdir -m 0775 -p > '/home/paolo/tmp/test/tmp' && cd /home/paolo/tmp/test/tmp && dr_exec > transcode -H 10 -a 2 -x vob -i /home/paolo/tmp/test/vob/003 -w 1198,50 > -b 128,0,0 -s 1.972 --a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 2 -y > xvid4 -o /home/paolo/tmp/test/avi/003/test-003.avi --print_status 20 && > echo DVDRIP_SUCCESS > > > Here there is a TOP snapshot while running it: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5721 paolo 16 0 115m 18m 2428 R 84.4 3.7 0:15.11 transcode > 5736 paolo 25 0 50352 4516 1912 R 8.4 0.9 0:01.53 tcdecode > 5725 paolo 15 0 115m 18m 2428 S 4.6 3.7 0:00.84 transcode > 5738 paolo 18 0 115m 18m 2428 S 0.8 3.7 0:00.15 transcode > 5734 paolo 25 0 20356 1140 920 S 0.6 0.2 0:00.12 tcdemux > 5731 paolo 25 0 47312 2540 1996 R 0.4 0.5 0:00.08 tcdecode > 5319 root 15 0 166m 16m 2584 S 0.2 3.2 0:25.06 X > 5444 paolo 16 0 87116 22m 15m R 0.2 4.6 0:04.05 konsole > 5716 paolo 16 0 10424 1160 876 R 0.2 0.2 0:00.06 top > 5735 paolo 25 0 22364 1436 932 S 0.2 0.3 0:00.01 tcextract > > > DD running alone: > > paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/ > 128+0 records in > 128+0 records out > > real 0m4.052s > user 0m0.000s > sys 0m0.209s > > DD while transcoding: > > paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/ > 128+0 records in > 128+0 records out > > real 0m26.121s > user 0m0.001s > sys 0m0.255s > > --------------------------------------- > > I've tried older kernels finding that 2.6.11 is the first affected. > > Going on with testing... > > 2.6.11-rc[1-5]: > 2.6.11-rc3 bad > 2.6.11-rc1 bad > > 2.6.10-bk[1-14] > 2.6.10-bk7 good > 2.6.10-bk11 good > 2.6.10-bk13 bad > 2.6.10-bk12 bad > > So the problem was introduced with: > >> 2.6.10-bk12 09-Jan-2005 << > > The exact behaviour is different with 2.6.11/12/13/14... for example: > with 2.6.11 the priority of "transcode" is initially set to ~25 and go > down to 17/18 when running DD. > > The problem doesn't seem 100% reproducible with every kernel, sometimes > a "BAD" kernel looks "GOOD"... or maybe it was me confused by too > much compile/install/reboot/test work ;) > > Other INFO: > - I'm on x86_64 > - preemption ON/OFF doesn't make any differences > > > Can anyone reproduce this? > IOW: is this affecting only my machine? > Hello Con and Ingo... I've found that the above problem goes away by reverting this: http://linux.bkbits.net:8080/linux-2.6/cset@41e054c6pwNQXzErMxvfh4IpLPXA5A?nav=index.html|src/|src/include|src/include/linux|related/include/linux/sched.h -------------------------------------------------- [PATCH] sched: remove_interactive_credit Special casing tasks by interactive credit was helpful for preventing fully cpu bound tasks from easily rising to interactive status. However it did not select out tasks that had periods of being fully cpu bound and then sleeping while waiting on pipes, signals etc. This led to a more disproportionate share of cpu time. Backing this out will no longer special case only fully cpu bound tasks, and prevents the variable behaviour that occurs at startup before tasks declare themseleves interactive or not, and speeds up application startup slightly under certain circumstances. It does cost in interactivity slightly as load rises but it is worth it for the fairness gains. Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> -------------------------------------------------- Maybe this change has revealed a scheduler weakness ? I'm glad to test any patch or give more data :) Bye, -- Paolo Ornati Linux 2.6.10-bk12-int_credit on x86_64 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) 2005-12-27 21:48 ` Paolo Ornati @ 2005-12-27 23:26 ` Con Kolivas 2005-12-28 11:01 ` Paolo Ornati 2005-12-30 13:52 ` [SCHED] wrong priority calc - SIMPLE test case Paolo Ornati 2005-12-27 23:59 ` [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) Peter Williams 1 sibling, 2 replies; 53+ messages in thread From: Con Kolivas @ 2005-12-27 23:26 UTC (permalink / raw) To: Paolo Ornati; +Cc: Linux Kernel Mailing List, Ingo Molnar [-- Attachment #1: Type: text/plain, Size: 3885 bytes --] On Wednesday 28 December 2005 08:48, Paolo Ornati wrote: > On Tue, 27 Dec 2005 19:09:18 +0100 > > Paolo Ornati <ornati@fastwebnet.it> wrote: > > Hello, > > > > I've found an easy-to-reproduce-for-me test case that shows a totally > > wrong priority calculation: basically a CPU-intensitive process gets > > better priority than a disk-intensitive one (dd if=bigfile > > of=/dev/null ...). > > > > Seems impossible, isn't it? > > > > ---- THE NUMBERS with 2.6.15-rc7 ----- > > > > The test-case is the Xvid encoding of dvd-ripped track with transcode > > (using "dvd::rip" interface). The copied-and-pasted command line is > > this: > > > > mkdir -m 0775 -p '/home/paolo/tmp/test/tmp' && > > cd /home/paolo/tmp/test/tmp && dr_exec transcode -H 10 -a 2 -x vob,null > > -i /home/paolo/tmp/test/vob/003 -w 1198,50 -b 128,0,0 -s 1.972 > > --a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 1 -y xvid4,null > > -o /dev/null --print_status 20 && echo DVDRIP_SUCCESS mkdir -m 0775 -p > > '/home/paolo/tmp/test/tmp' && cd /home/paolo/tmp/test/tmp && dr_exec > > transcode -H 10 -a 2 -x vob -i /home/paolo/tmp/test/vob/003 -w 1198,50 > > -b 128,0,0 -s 1.972 --a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 2 -y > > xvid4 -o /home/paolo/tmp/test/avi/003/test-003.avi --print_status 20 && > > echo DVDRIP_SUCCESS > > > > > > Here there is a TOP snapshot while running it: > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 5721 paolo 16 0 115m 18m 2428 R 84.4 3.7 0:15.11 transcode > > 5736 paolo 25 0 50352 4516 1912 R 8.4 0.9 0:01.53 tcdecode > > 5725 paolo 15 0 115m 18m 2428 S 4.6 3.7 0:00.84 transcode > > 5738 paolo 18 0 115m 18m 2428 S 0.8 3.7 0:00.15 transcode > > 5734 paolo 25 0 20356 1140 920 S 0.6 0.2 0:00.12 tcdemux > > 5731 paolo 25 0 47312 2540 1996 R 0.4 0.5 0:00.08 tcdecode > > 5319 root 15 0 166m 16m 2584 S 0.2 3.2 0:25.06 X > > 5444 paolo 16 0 87116 22m 15m R 0.2 4.6 0:04.05 konsole > > 5716 paolo 16 0 10424 1160 876 R 0.2 0.2 0:00.06 top > > 5735 paolo 25 0 22364 1436 932 S 0.2 0.3 0:00.01 tcextract > > > > > > DD running alone: > > > > paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null > > bs=1M count=128; umount space/ 128+0 records in > > 128+0 records out > > > > real 0m4.052s > > user 0m0.000s > > sys 0m0.209s > > > > DD while transcoding: > > > > paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null > > bs=1M count=128; umount space/ 128+0 records in > > 128+0 records out > > > > real 0m26.121s > > user 0m0.001s > > sys 0m0.255s Looking at your top output I see that transcode command generates 7 processes all likely to be using cpu, and your DD slowdown is almost 7 times... I assume it all goes away if you nice the transcode up by 3 or more. > Hello Con and Ingo... I've found that the above problem goes away > by reverting this: > > http://linux.bkbits.net:8080/linux-2.6/cset@41e054c6pwNQXzErMxvfh4IpLPXA5A? >nav=index.html|src/|src/include|src/include/linux|related/include/linux/sche >d.h > > -------------------------------------------------- > > [PATCH] sched: remove_interactive_credit The issue is that the scheduler interactivity estimator is a state machine and can be fooled to some degree, and a cpu intensive task that just happens to sleep a little bit gets significantly better priority than one that is fully cpu bound all the time. Reverting that change is not a solution because it can still be fooled by the same process sleeping lots for a few seconds or so at startup and then changing to the cpu mostly-sleeping slightly behaviour. This "fluctuating" behaviour is in my opinion worse which is why I removed it. Cheers, Con [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) 2005-12-27 23:26 ` Con Kolivas @ 2005-12-28 11:01 ` Paolo Ornati 2005-12-28 11:19 ` Con Kolivas 2005-12-30 13:52 ` [SCHED] wrong priority calc - SIMPLE test case Paolo Ornati 1 sibling, 1 reply; 53+ messages in thread From: Paolo Ornati @ 2005-12-28 11:01 UTC (permalink / raw) To: Con Kolivas; +Cc: Linux Kernel Mailing List, Ingo Molnar On Wed, 28 Dec 2005 10:26:58 +1100 Con Kolivas <kernel@kolivas.org> wrote: > Looking at your top output I see that transcode command generates 7 processes > all likely to be using cpu, and your DD slowdown is almost 7 times... I > assume it all goes away if you nice the transcode up by 3 or more. Yes, if I nice everything to 3 or more (nice -n 3 dvdrip ...) it works, but I would prefer a less weak scheduler (see the other email, with results for various schedulers). Another thing that I've noticed is that things tends to get worse when "transcode" is running for long time (some hours). It happened to me some times (with ingosched and also with staircase if I remember correctly): after some hours of running transcode (with me away from the machine) I've found a totally UNUSABLE system. Transcode was the king of the machine and everything else get almost no CPU time. Switching to a Text-Console takes something like 10s (or something like that). When I was finally logged in as root I've reniced transcode and companyto +19 and the system was usable again ;) To get things even more STRANGE: another time that this happened I've done the same thing except that I've reniced them to "0" (the same nice level they were running) ---> And the system became usable again (with the usual slow down but still usable). This is what I remember. Now I think we can agree that there is something wrong... no? :) -- Paolo Ornati Linux 2.6.15-rc5-plugsched on x86_64 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) 2005-12-28 11:01 ` Paolo Ornati @ 2005-12-28 11:19 ` Con Kolivas 2005-12-28 11:35 ` Paolo Ornati 0 siblings, 1 reply; 53+ messages in thread From: Con Kolivas @ 2005-12-28 11:19 UTC (permalink / raw) To: Paolo Ornati; +Cc: Linux Kernel Mailing List, Ingo Molnar On Wednesday 28 December 2005 22:01, Paolo Ornati wrote: > after some hours of running transcode (with me away from the > machine) I've found a totally UNUSABLE system. Transcode was the king > of the machine and everything else get almost no CPU time. Switching to > a Text-Console takes something like 10s (or something like that). When > I was finally logged in as root I've reniced transcode and companyto +19 > and the system was usable again ;) > > To get things even more STRANGE: another time that this happened I've > done the same thing except that I've reniced them to "0" (the same nice > level they were running) ---> And the system became usable again (with > the usual slow down but still usable). > > This is what I remember. Now I think we can agree that there is > something wrong... no? This latter thing sounds more like your transcode job pushed everything out to swap... You need to instrument this case better. Con ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) 2005-12-28 11:19 ` Con Kolivas @ 2005-12-28 11:35 ` Paolo Ornati 2005-12-28 17:23 ` Paolo Ornati 0 siblings, 1 reply; 53+ messages in thread From: Paolo Ornati @ 2005-12-28 11:35 UTC (permalink / raw) To: Con Kolivas; +Cc: Linux Kernel Mailing List, Ingo Molnar On Wed, 28 Dec 2005 22:19:23 +1100 Con Kolivas <kernel@kolivas.org> wrote: > This latter thing sounds more like your transcode job pushed everything out to > swap... You need to instrument this case better. > I don't know. The combination Swapped Out Programs + "normal" priority strangeness can potentially result in a total disaster... but why renicing transcode to "0" gets the system usable again? Next time I'll grab "/proc/meminfo"... and what other info can help to understand? -- Paolo Ornati Linux 2.6.15-rc5-plugsched on x86_64 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) 2005-12-28 11:35 ` Paolo Ornati @ 2005-12-28 17:23 ` Paolo Ornati 2005-12-28 17:39 ` Paolo Ornati 0 siblings, 1 reply; 53+ messages in thread From: Paolo Ornati @ 2005-12-28 17:23 UTC (permalink / raw) To: Paolo Ornati; +Cc: Con Kolivas, Linux Kernel Mailing List, Ingo Molnar On Wed, 28 Dec 2005 12:35:47 +0100 Paolo Ornati <ornati@fastwebnet.it> wrote: > > This latter thing sounds more like your transcode job pushed everything out to > > swap... You need to instrument this case better. > > > > I don't know. The combination Swapped Out Programs + "normal" priority > strangeness can potentially result in a total disaster... but why > renicing transcode to "0" gets the system usable again? > > Next time I'll grab "/proc/meminfo"... and what other info can help to > understand? I'm doing some tests with "ingosched" and a bit longer trancoding session and I've found a strange thing. DD running time can change a LOT between one run and another (always while transcode is running): ---------------------------------------------------------------------- paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null bs=1M count=256; umount space/ 256+0 records in 256+0 records out real 0m15.754s user 0m0.000s sys 0m0.500s paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null bs=1M count=256; umount space/ 256+0 records in 256+0 records out real 0m52.966s user 0m0.000s sys 0m0.504s paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null bs=1M count=256; umount space/ 256+0 records in 256+0 records out real 0m48.928s user 0m0.004s sys 0m0.524s --------------------------------------------------------------------- Looking at top while running these I've seen that this is related to how transcode's (the one that eats a lot of CPU) priority changes. When only transcoding his priority is 16, but when also DD test is running then that value fluctuate between 16 and 18. DD priority is always 18 instead. Usually transcode's prio go to 17/18 and DD runs in 15/20s, but sometimes it doesn't fluctuate staying stuck to 16 and DD runs in ~50s. PS: I'm not yet able to reproduce the "totally unusable system" I was talking about. -- Paolo Ornati Linux 2.6.15-rc7-plugsched on x86_64 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) 2005-12-28 17:23 ` Paolo Ornati @ 2005-12-28 17:39 ` Paolo Ornati 0 siblings, 0 replies; 53+ messages in thread From: Paolo Ornati @ 2005-12-28 17:39 UTC (permalink / raw) To: Paolo Ornati; +Cc: Con Kolivas, Linux Kernel Mailing List, Ingo Molnar On Wed, 28 Dec 2005 18:23:23 +0100 Paolo Ornati <ornati@fastwebnet.it> wrote: > Usually transcode's prio go to 17/18 and DD runs in 15/20s, but > sometimes it doesn't fluctuate staying stuck to 16 and DD runs in ~50s. And now I've noticed that when that prority stops fluctuating, it stops forever. Running the DD test again and again doesn't change anything! For some reasons (running time?) the trancode priority is stuck to 16 and DD always performs very badly: ~50s (normally it should be 8s). When I've noticed this the real running time of the trancode test is about 35/40 min. -- Paolo Ornati Linux 2.6.15-rc7-plugsched on x86_64 ^ permalink raw reply [flat|nested] 53+ messages in thread
* [SCHED] wrong priority calc - SIMPLE test case 2005-12-27 23:26 ` Con Kolivas 2005-12-28 11:01 ` Paolo Ornati @ 2005-12-30 13:52 ` Paolo Ornati 2005-12-31 2:06 ` Peter Williams ` (2 more replies) 1 sibling, 3 replies; 53+ messages in thread From: Paolo Ornati @ 2005-12-30 13:52 UTC (permalink / raw) To: Linux Kernel Mailing List Cc: Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams WAS: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) http://lkml.org/lkml/2005/12/27/114/index.html On Wed, 28 Dec 2005 10:26:58 +1100 Con Kolivas <kernel@kolivas.org> wrote: > The issue is that the scheduler interactivity estimator is a state machine and > can be fooled to some degree, and a cpu intensive task that just happens to > sleep a little bit gets significantly better priority than one that is fully > cpu bound all the time. Reverting that change is not a solution because it > can still be fooled by the same process sleeping lots for a few seconds or so > at startup and then changing to the cpu mostly-sleeping slightly behaviour. > This "fluctuating" behaviour is in my opinion worse which is why I removed > it. Trying to find a "as simple as possible" test case for this problem (that I consider a BUG in priority calculation) I've come up with this very simple program: ------ sched_fooler.c ------------------------------- #include <stdlib.h> #include <unistd.h> static void burn_cpu(unsigned int x) { static char buf[1024]; int i; for (i=0; i < x; ++i) buf[i%sizeof(buf)] = (x-i)*3; } int main(int argc, char **argv) { unsigned long burn; if (argc != 2) return 1; burn = (unsigned long)atoi(argv[1]); while(1) { burn_cpu(burn*1000); usleep(1); } return 0; } ---------------------------------------------- paolo@tux ~/tmp/sched_fooler $ gcc sched_fooler.c paolo@tux ~/tmp/sched_fooler $ ./a.out 5000 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5633 paolo 15 0 2392 320 252 S 77.7 0.1 0:18.84 a.out 5225 root 15 0 169m 19m 2956 S 2.0 3.9 0:13.17 X 5307 paolo 15 0 100m 22m 13m S 2.0 4.4 0:04.32 kicker paolo@tux ~/tmp/sched_fooler $ ./a.out 10000 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5637 paolo 16 0 2396 320 252 R 87.4 0.1 0:13.38 a.out 5312 paolo 16 0 86636 22m 15m R 0.1 4.5 0:02.02 konsole 1 root 16 0 2552 560 468 S 0.0 0.1 0:00.71 init If I only run 2 of these together the system becomes everything but interactive ;) ./a.out 10000 & ./a.out 4000 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5714 paolo 15 0 2392 320 252 S 59.2 0.1 0:12.28 a.out 5713 paolo 16 0 2396 320 252 S 37.1 0.1 0:07.63 a.out DD TEST: the usual DD test (to read 128 MB from a non-cached file = disk bounded) says everything, numbers with 2.6.15-rc7: paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/ 128+0 records in 128+0 records out real 0m4.052s user 0m0.004s sys 0m0.180s START 2 OF THEM "./a.out 10000 & ./a.out 4000" paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/ 128+0 records in 128+0 records out real 2m4.578s user 0m0.000s sys 0m0.244s This does't surprise me anymore, since DD gets priority 18 and these two CPU eaters get 15/16. As usual "nicksched" is NOT affected at all, IOW it gives to these CPU eaters the priority that they deserve. -- Paolo Ornati Linux 2.6.15-rc7-g3603bc8d on x86_64 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2005-12-30 13:52 ` [SCHED] wrong priority calc - SIMPLE test case Paolo Ornati @ 2005-12-31 2:06 ` Peter Williams 2005-12-31 10:34 ` Paolo Ornati 2005-12-31 8:13 ` Mike Galbraith 2006-01-13 1:13 ` Con Kolivas 2 siblings, 1 reply; 53+ messages in thread From: Peter Williams @ 2005-12-31 2:06 UTC (permalink / raw) To: Paolo Ornati Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin [-- Attachment #1: Type: text/plain, Size: 3997 bytes --] Paolo Ornati wrote: > WAS: [SCHED] Totally WRONG prority calculation with specific test-case > (since 2.6.10-bk12) > http://lkml.org/lkml/2005/12/27/114/index.html > > On Wed, 28 Dec 2005 10:26:58 +1100 > Con Kolivas <kernel@kolivas.org> wrote: > > >>The issue is that the scheduler interactivity estimator is a state machine and >>can be fooled to some degree, and a cpu intensive task that just happens to >>sleep a little bit gets significantly better priority than one that is fully >>cpu bound all the time. Reverting that change is not a solution because it >>can still be fooled by the same process sleeping lots for a few seconds or so >>at startup and then changing to the cpu mostly-sleeping slightly behaviour. >>This "fluctuating" behaviour is in my opinion worse which is why I removed >>it. > > > Trying to find a "as simple as possible" test case for this problem > (that I consider a BUG in priority calculation) I've come up with this > very simple program: > > ------ sched_fooler.c ------------------------------- > #include <stdlib.h> > #include <unistd.h> > > static void burn_cpu(unsigned int x) > { > static char buf[1024]; > int i; > > for (i=0; i < x; ++i) > buf[i%sizeof(buf)] = (x-i)*3; > } > > int main(int argc, char **argv) > { > unsigned long burn; > if (argc != 2) > return 1; > burn = (unsigned long)atoi(argv[1]); > while(1) { > burn_cpu(burn*1000); > usleep(1); > } > return 0; > } > ---------------------------------------------- > > paolo@tux ~/tmp/sched_fooler $ gcc sched_fooler.c > paolo@tux ~/tmp/sched_fooler $ ./a.out 5000 > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5633 paolo 15 0 2392 320 252 S 77.7 0.1 0:18.84 a.out > 5225 root 15 0 169m 19m 2956 S 2.0 3.9 0:13.17 X > 5307 paolo 15 0 100m 22m 13m S 2.0 4.4 0:04.32 kicker > > > paolo@tux ~/tmp/sched_fooler $ ./a.out 10000 > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5637 paolo 16 0 2396 320 252 R 87.4 0.1 0:13.38 a.out > 5312 paolo 16 0 86636 22m 15m R 0.1 4.5 0:02.02 konsole > 1 root 16 0 2552 560 468 S 0.0 0.1 0:00.71 init > > > If I only run 2 of these together the system becomes everything but > interactive ;) > > ./a.out 10000 & ./a.out 4000 > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5714 paolo 15 0 2392 320 252 S 59.2 0.1 0:12.28 a.out > 5713 paolo 16 0 2396 320 252 S 37.1 0.1 0:07.63 a.out > > > DD TEST: the usual DD test (to read 128 MB from a non-cached file = > disk bounded) says everything, numbers with 2.6.15-rc7: > > paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/ > 128+0 records in > 128+0 records out > > real 0m4.052s > user 0m0.004s > sys 0m0.180s > > START 2 OF THEM "./a.out 10000 & ./a.out 4000" > > paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/ > 128+0 records in > 128+0 records out > > real 2m4.578s > user 0m0.000s > sys 0m0.244s > > > This does't surprise me anymore, since DD gets priority 18 and these > two CPU eaters get 15/16. > > As usual "nicksched" is NOT affected at all, IOW it gives to these CPU > eaters the priority that they deserve. > Attached is a testing version of a patch for modifying scheduler bonus calculations that I'm working on. Although these changes aren't targetted at the problem you are experiencing I believe that they may help. My testing shows that sched_fooler instances don't get any bonuses but I would appreciate it if you could try it out. It is a patch against the 2.6.15-rc7 kernel and includes some other scheduling patches from the -mm kernels. Thanks Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce [-- Attachment #2: lial-test.patch --] [-- Type: text/x-patch, Size: 35320 bytes --] Index: GIT_lial/fs/proc/array.c =================================================================== --- GIT_lial.orig/fs/proc/array.c 2005-12-29 12:05:42.000000000 +1100 +++ GIT_lial/fs/proc/array.c 2005-12-31 12:29:47.000000000 +1100 @@ -165,7 +165,7 @@ static inline char * task_state(struct t read_lock(&tasklist_lock); buffer += sprintf(buffer, "State:\t%s\n" - "SleepAVG:\t%lu%%\n" + "LatencyAVG:\t%llu nsec\n" "Tgid:\t%d\n" "Pid:\t%d\n" "PPid:\t%d\n" @@ -173,7 +173,7 @@ static inline char * task_state(struct t "Uid:\t%d\t%d\t%d\t%d\n" "Gid:\t%d\t%d\t%d\t%d\n", get_task_state(p), - (p->sleep_avg/1024)*100/(1020000000/1024), + SCHED_AVG_RND(p->avg_latency), p->tgid, p->pid, pid_alive(p) ? p->group_leader->real_parent->tgid : 0, pid_alive(p) && p->ptrace ? p->parent->pid : 0, Index: GIT_lial/include/linux/sched.h =================================================================== --- GIT_lial.orig/include/linux/sched.h 2005-12-29 12:05:42.000000000 +1100 +++ GIT_lial/include/linux/sched.h 2005-12-31 12:29:47.000000000 +1100 @@ -158,6 +158,7 @@ extern unsigned long nr_iowait(void); #define SCHED_NORMAL 0 #define SCHED_FIFO 1 #define SCHED_RR 2 +#define SCHED_BATCH 3 struct sched_param { int sched_priority; @@ -473,9 +474,9 @@ struct signal_struct { /* * Priority of a process goes from 0..MAX_PRIO-1, valid RT - * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL tasks are - * in the range MAX_RT_PRIO..MAX_PRIO-1. Priority values - * are inverted: lower p->prio value means higher priority. + * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH + * tasks are in the range MAX_RT_PRIO..MAX_PRIO-1. Priority + * values are inverted: lower p->prio value means higher priority. * * The MAX_USER_RT_PRIO value allows the actual maximum * RT priority to be separate from the value exported to @@ -683,6 +684,19 @@ static inline void prefetch_stack(struct struct audit_context; /* See audit.c */ struct mempolicy; +/* + * Fixed denominator rational numbers for use estimating task's average + * latencies and cpu usage per run + */ +#define SCHED_AVG_OFFSET 4 +/* + * Get the rounded integer value of a scheduling statistic average field + * Needs to be public for printing average interactive latency in + * fs/proc/array.c + */ +#define SCHED_AVG_RND(x) \ + (((x) + (1 << (SCHED_AVG_OFFSET - 1))) >> (SCHED_AVG_OFFSET)) + struct task_struct { volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ struct thread_info *thread_info; @@ -695,16 +709,19 @@ struct task_struct { #if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW) int oncpu; #endif - int prio, static_prio; + int prio, static_prio, latency_bonus; +#ifdef CONFIG_SMP + int bias_prio; +#endif struct list_head run_list; prio_array_t *array; unsigned short ioprio; - unsigned long sleep_avg; + unsigned long long avg_latency; unsigned long long timestamp, last_ran; + unsigned long long avg_cpu_run; unsigned long long sched_time; /* sched_clock time spent running */ - int activated; unsigned long policy; cpumask_t cpus_allowed; @@ -908,6 +925,8 @@ do { if (atomic_dec_and_test(&(tsk)->usa #define PF_SYNCWRITE 0x00200000 /* I am doing a sync write */ #define PF_BORROWED_MM 0x00400000 /* I am a kthread doing use_mm */ #define PF_RANDOMIZE 0x00800000 /* randomize virtual address space */ +#define PF_JUST_WOKEN 0x01000000 /* just woken */ +#define PF_IA_WAKE_UP 0x02000000 /* just woken from interactive sleep */ /* * Only the _current_ task can read/write to tsk->flags, but other Index: GIT_lial/init/main.c =================================================================== --- GIT_lial.orig/init/main.c 2005-12-31 12:28:25.000000000 +1100 +++ GIT_lial/init/main.c 2005-12-31 12:29:41.000000000 +1100 @@ -603,6 +603,8 @@ static void __init do_initcalls(void) * * Now we can finally start doing some real work.. */ +extern void sched_sysfs_init(void); + static void __init do_basic_setup(void) { /* drivers will send hotplug events */ @@ -613,6 +615,7 @@ static void __init do_basic_setup(void) #ifdef CONFIG_SYSCTL sysctl_init(); #endif + sched_sysfs_init(); /* Networking initialization needs a process context */ sock_init(); Index: GIT_lial/kernel/exit.c =================================================================== --- GIT_lial.orig/kernel/exit.c 2005-12-28 12:09:27.000000000 +1100 +++ GIT_lial/kernel/exit.c 2005-12-29 12:16:49.000000000 +1100 @@ -243,7 +243,9 @@ static inline void reparent_to_init(void /* Set the exit signal to SIGCHLD so we signal init on exit */ current->exit_signal = SIGCHLD; - if ((current->policy == SCHED_NORMAL) && (task_nice(current) < 0)) + if ((current->policy == SCHED_NORMAL || + current->policy == SCHED_BATCH) + && (task_nice(current) < 0)) set_user_nice(current, 0); /* cpus_allowed? */ /* rt_priority? */ Index: GIT_lial/kernel/sched.c =================================================================== --- GIT_lial.orig/kernel/sched.c 2005-12-29 12:05:42.000000000 +1100 +++ GIT_lial/kernel/sched.c 2005-12-31 12:30:51.000000000 +1100 @@ -60,6 +60,16 @@ #define PRIO_TO_NICE(prio) ((prio) - MAX_RT_PRIO - 20) #define TASK_NICE(p) PRIO_TO_NICE((p)->static_prio) +#ifdef CONFIG_SMP +/* + * Priority bias for load balancing ranges from 1 (nice==19) to 139 (RT + * priority of 100). + */ +#define NICE_TO_BIAS_PRIO(nice) (20 - (nice)) +#define PRIO_TO_BIAS_PRIO(prio) NICE_TO_BIAS_PRIO(PRIO_TO_NICE(prio)) +#define RTPRIO_TO_BIAS_PRIO(rp) (40 + (rp)) +#endif + /* * 'User priority' is the nice value converted to something we * can work with better when scaling various scheduler parameters, @@ -84,16 +94,10 @@ */ #define MIN_TIMESLICE max(5 * HZ / 1000, 1) #define DEF_TIMESLICE (100 * HZ / 1000) -#define ON_RUNQUEUE_WEIGHT 30 -#define CHILD_PENALTY 95 -#define PARENT_PENALTY 100 -#define EXIT_WEIGHT 3 #define PRIO_BONUS_RATIO 25 #define MAX_BONUS (MAX_USER_PRIO * PRIO_BONUS_RATIO / 100) #define INTERACTIVE_DELTA 2 -#define MAX_SLEEP_AVG (DEF_TIMESLICE * MAX_BONUS) -#define STARVATION_LIMIT (MAX_SLEEP_AVG) -#define NS_MAX_SLEEP_AVG (JIFFIES_TO_NS(MAX_SLEEP_AVG)) +#define STARVATION_LIMIT (DEF_TIMESLICE * MAX_BONUS) /* * If a task is 'interactive' then we reinsert it in the active @@ -123,9 +127,8 @@ * too hard. */ -#define CURRENT_BONUS(p) \ - (NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \ - MAX_SLEEP_AVG) +#define CURRENT_BONUS(p) (just_woken_from_ia_sleep(p) ? \ + (p)->latency_bonus + 1 : (p)->latency_bonus) #define GRANULARITY (10 * HZ / 1000 ? : 1) @@ -147,10 +150,6 @@ #define TASK_INTERACTIVE(p) \ ((p)->prio <= (p)->static_prio - DELTA(p)) -#define INTERACTIVE_SLEEP(p) \ - (JIFFIES_TO_NS(MAX_SLEEP_AVG * \ - (MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1)) - #define TASK_PREEMPTS_CURR(p, rq) \ ((p)->prio < (rq)->curr->prio) @@ -261,6 +260,91 @@ struct runqueue { static DEFINE_PER_CPU(struct runqueue, runqueues); +#define SCHED_AVG_REAL(a) ((a) << SCHED_AVG_OFFSET) +#define SCHED_AVG_ALPHA ((1 << SCHED_AVG_OFFSET) - 1) + +/* The range of acceptable interactive latencies in nanosecs */ +unsigned long unacceptable_ia_latency = 150000UL; +unsigned int acceptability_factor = 8; +#define ACCEPTABLE(l) ((l) >> acceptability_factor) +#define UNACCEPTABLE_IA_LATENCY unacceptable_ia_latency +#define ACCEPTABLE_IA_LATENCY ACCEPTABLE(UNACCEPTABLE_IA_LATENCY) + +static inline void incr_latency_bonus(task_t *p) +{ + /* + * one bonus point is reserved for allocation to all interactive + * wake ups + */ + if (p->latency_bonus < (MAX_BONUS - 1)) + ++p->latency_bonus; +} + +static inline void decr_latency_bonus(task_t *p) +{ + if (p->latency_bonus > 0) + --p->latency_bonus; +} + +static inline int just_woken(task_t *p) +{ + return p->flags & PF_JUST_WOKEN; +} + +static inline int just_woken_from_ia_sleep(task_t *p) +{ + return p->flags & PF_IA_WAKE_UP; +} + +static inline void decay_avg_value(unsigned long long *val) +{ + *val *= SCHED_AVG_ALPHA; + *val >>= SCHED_AVG_OFFSET; +} + +static void update_latency_bonus(task_t *p, runqueue_t *rq, unsigned long long now) +{ + long long delta = now - p->timestamp; + + /* make allowance for sched_clock() not being monotonic */ + if (unlikely(delta < 0)) + delta = 0; + + decay_avg_value(&p->avg_latency); + p->avg_latency += delta; + + /* do this now rather than earlier so that average latency is available + * for didplay for all tasks not just SCHED_NORMAL. + */ + if (rt_task(p) || p->policy == SCHED_BATCH) + goto out; + + if (just_woken_from_ia_sleep(p)) { + if (delta > UNACCEPTABLE_IA_LATENCY) + incr_latency_bonus(p); + else if (delta < ACCEPTABLE_IA_LATENCY) + decr_latency_bonus(p); + } else { + unsigned long long ual = UNACCEPTABLE_IA_LATENCY; + + /* + * The more tasks runnable the greater the acceptable non + * interactive delay. In the interests of fairness, tasks that + * use short CPU runs are smaller acceptable latencies. + */ + if (likely(rq->nr_running > 0)) + ual += SCHED_AVG_RND(p->avg_cpu_run) * + (rq->nr_running - 1); + + if (delta > ual) + incr_latency_bonus(p); + else if (delta < ACCEPTABLE(ual)) + decr_latency_bonus(p); + } +out: + p->flags &= ~(PF_IA_WAKE_UP|PF_JUST_WOKEN); +} + /* * The domain tree (rq->sd) is protected by RCU's quiescent state transition. * See detach_destroy_domains: synchronize_sched for details. @@ -633,9 +717,6 @@ static inline void enqueue_task_head(str * effective_prio - return the priority that is based on the static * priority but is modified by bonuses/penalties. * - * We scale the actual sleep average [0 .... MAX_SLEEP_AVG] - * into the -5 ... 0 ... +5 bonus/penalty range. - * * We use 25% of the full 0...39 priority range so that: * * 1) nice +19 interactive tasks do not preempt nice 0 CPU hogs. @@ -661,46 +742,53 @@ static int effective_prio(task_t *p) } #ifdef CONFIG_SMP -static inline void inc_prio_bias(runqueue_t *rq, int prio) +static inline void set_bias_prio(task_t *p) { - rq->prio_bias += MAX_PRIO - prio; + if (rt_task(p)) { + if (p == task_rq(p)->migration_thread) + /* + * The migration thread does the actual balancing. Do + * not bias by its priority as the ultra high priority + * will skew balancing adversely. + */ + p->bias_prio = 0; + else + p->bias_prio = RTPRIO_TO_BIAS_PRIO(p->rt_priority); + } else + p->bias_prio = PRIO_TO_BIAS_PRIO(p->static_prio); +} + +static inline void inc_prio_bias(runqueue_t *rq, const task_t *p) +{ + rq->prio_bias += p->bias_prio; } -static inline void dec_prio_bias(runqueue_t *rq, int prio) +static inline void dec_prio_bias(runqueue_t *rq, const task_t *p) { - rq->prio_bias -= MAX_PRIO - prio; + rq->prio_bias -= p->bias_prio; } static inline void inc_nr_running(task_t *p, runqueue_t *rq) { rq->nr_running++; - if (rt_task(p)) { - if (p != rq->migration_thread) - /* - * The migration thread does the actual balancing. Do - * not bias by its priority as the ultra high priority - * will skew balancing adversely. - */ - inc_prio_bias(rq, p->prio); - } else - inc_prio_bias(rq, p->static_prio); + inc_prio_bias(rq, p); } static inline void dec_nr_running(task_t *p, runqueue_t *rq) { rq->nr_running--; - if (rt_task(p)) { - if (p != rq->migration_thread) - dec_prio_bias(rq, p->prio); - } else - dec_prio_bias(rq, p->static_prio); + dec_prio_bias(rq, p); } #else -static inline void inc_prio_bias(runqueue_t *rq, int prio) +static inline void set_bias_prio(task_t *p) +{ +} + +static inline void inc_prio_bias(runqueue_t *rq, const task_t *p) { } -static inline void dec_prio_bias(runqueue_t *rq, int prio) +static inline void dec_prio_bias(runqueue_t *rq, const task_t *p) { } @@ -733,68 +821,6 @@ static inline void __activate_idle_task( inc_nr_running(p, rq); } -static int recalc_task_prio(task_t *p, unsigned long long now) -{ - /* Caller must always ensure 'now >= p->timestamp' */ - unsigned long long __sleep_time = now - p->timestamp; - unsigned long sleep_time; - - if (__sleep_time > NS_MAX_SLEEP_AVG) - sleep_time = NS_MAX_SLEEP_AVG; - else - sleep_time = (unsigned long)__sleep_time; - - if (likely(sleep_time > 0)) { - /* - * User tasks that sleep a long time are categorised as - * idle and will get just interactive status to stay active & - * prevent them suddenly becoming cpu hogs and starving - * other processes. - */ - if (p->mm && p->activated != -1 && - sleep_time > INTERACTIVE_SLEEP(p)) { - p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG - - DEF_TIMESLICE); - } else { - /* - * The lower the sleep avg a task has the more - * rapidly it will rise with sleep time. - */ - sleep_time *= (MAX_BONUS - CURRENT_BONUS(p)) ? : 1; - - /* - * Tasks waking from uninterruptible sleep are - * limited in their sleep_avg rise as they - * are likely to be waiting on I/O - */ - if (p->activated == -1 && p->mm) { - if (p->sleep_avg >= INTERACTIVE_SLEEP(p)) - sleep_time = 0; - else if (p->sleep_avg + sleep_time >= - INTERACTIVE_SLEEP(p)) { - p->sleep_avg = INTERACTIVE_SLEEP(p); - sleep_time = 0; - } - } - - /* - * This code gives a bonus to interactive tasks. - * - * The boost works by updating the 'average sleep time' - * value here, based on ->timestamp. The more time a - * task spends sleeping, the higher the average gets - - * and the higher the priority boost gets as well. - */ - p->sleep_avg += sleep_time; - - if (p->sleep_avg > NS_MAX_SLEEP_AVG) - p->sleep_avg = NS_MAX_SLEEP_AVG; - } - } - - return effective_prio(p); -} - /* * activate_task - move a task to the runqueue and do priority recalculation * @@ -816,30 +842,8 @@ static void activate_task(task_t *p, run #endif if (!rt_task(p)) - p->prio = recalc_task_prio(p, now); + p->prio = effective_prio(p); - /* - * This checks to make sure it's not an uninterruptible task - * that is now waking up. - */ - if (!p->activated) { - /* - * Tasks which were woken up by interrupts (ie. hw events) - * are most likely of interactive nature. So we give them - * the credit of extending their sleep time to the period - * of time they spend on the runqueue, waiting for execution - * on a CPU, first time around: - */ - if (in_interrupt()) - p->activated = 2; - else { - /* - * Normal first-time wakeups get a credit too for - * on-runqueue time, but it will be weighted down: - */ - p->activated = 1; - } - } p->timestamp = now; __activate_task(p, rq); @@ -994,61 +998,29 @@ void kick_process(task_t *p) * We want to under-estimate the load of migration sources, to * balance conservatively. */ -static inline unsigned long __source_load(int cpu, int type, enum idle_type idle) +static inline unsigned long source_load(int cpu, int type) { runqueue_t *rq = cpu_rq(cpu); - unsigned long running = rq->nr_running; - unsigned long source_load, cpu_load = rq->cpu_load[type-1], - load_now = running * SCHED_LOAD_SCALE; + unsigned long load_now = rq->prio_bias * SCHED_LOAD_SCALE; if (type == 0) - source_load = load_now; - else - source_load = min(cpu_load, load_now); + return load_now; - if (running > 1 || (idle == NOT_IDLE && running)) - /* - * If we are busy rebalancing the load is biased by - * priority to create 'nice' support across cpus. When - * idle rebalancing we should only bias the source_load if - * there is more than one task running on that queue to - * prevent idle rebalance from trying to pull tasks from a - * queue with only one running task. - */ - source_load = source_load * rq->prio_bias / running; - - return source_load; -} - -static inline unsigned long source_load(int cpu, int type) -{ - return __source_load(cpu, type, NOT_IDLE); + return min(rq->cpu_load[type-1], load_now); } /* * Return a high guess at the load of a migration-target cpu */ -static inline unsigned long __target_load(int cpu, int type, enum idle_type idle) +static inline unsigned long target_load(int cpu, int type) { runqueue_t *rq = cpu_rq(cpu); - unsigned long running = rq->nr_running; - unsigned long target_load, cpu_load = rq->cpu_load[type-1], - load_now = running * SCHED_LOAD_SCALE; + unsigned long load_now = rq->prio_bias * SCHED_LOAD_SCALE; if (type == 0) - target_load = load_now; - else - target_load = max(cpu_load, load_now); - - if (running > 1 || (idle == NOT_IDLE && running)) - target_load = target_load * rq->prio_bias / running; + return load_now; - return target_load; -} - -static inline unsigned long target_load(int cpu, int type) -{ - return __target_load(cpu, type, NOT_IDLE); + return max(rq->cpu_load[type-1], load_now); } /* @@ -1306,7 +1278,7 @@ static int try_to_wake_up(task_t *p, uns * of the current CPU: */ if (sync) - tl -= SCHED_LOAD_SCALE; + tl -= p->bias_prio * SCHED_LOAD_SCALE; if ((tl <= load && tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || @@ -1353,25 +1325,21 @@ out_set_cpu: out_activate: #endif /* CONFIG_SMP */ - if (old_state == TASK_UNINTERRUPTIBLE) { + if (old_state == TASK_UNINTERRUPTIBLE) rq->nr_uninterruptible--; - /* - * Tasks on involuntary sleep don't earn - * sleep_avg beyond just interactive state. - */ - p->activated = -1; - } - /* - * Tasks that have marked their sleep as noninteractive get - * woken up without updating their sleep average. (i.e. their - * sleep is handled in a priority-neutral manner, no priority - * boost and no penalty.) + * uninterruptible sleeps are assumed to be non interactive. + * interruptible sleeps are assumed to be interactive unless + * tagged with the TASK_NONINTERACTIVE flag. */ - if (old_state & TASK_NONINTERACTIVE) - __activate_task(p, rq); + if (old_state == TASK_INTERRUPTIBLE) + p->flags |= PF_IA_WAKE_UP; else - activate_task(p, rq, cpu == this_cpu); + p->flags &= ~PF_IA_WAKE_UP; + + p->flags |= PF_JUST_WOKEN; + + activate_task(p, rq, cpu == this_cpu); /* * Sync wakeups (i.e. those types of wakeups where the waker * has indicated that it will leave the CPU in short order) @@ -1421,6 +1389,15 @@ void fastcall sched_fork(task_t *p, int set_task_cpu(p, cpu); /* + * Leave the latency bonus the same as the parent's. + * This helps new tasks launched by media to get off to a good start + * when the system is under load. If they don't warrant it they'll soon + * lose it. + */ + p->avg_latency = 0; + p->avg_cpu_run = 0; + + /* * We mark the process as running here, but have not actually * inserted it onto the runqueue yet. This guarantees that * nobody will actually run it, and a signal or other external @@ -1477,22 +1454,13 @@ void fastcall wake_up_new_task(task_t *p { unsigned long flags; int this_cpu, cpu; - runqueue_t *rq, *this_rq; + runqueue_t *rq; rq = task_rq_lock(p, &flags); BUG_ON(p->state != TASK_RUNNING); this_cpu = smp_processor_id(); cpu = task_cpu(p); - /* - * We decrease the sleep average of forking parents - * and children as well, to keep max-interactive tasks - * from forking tasks that are max-interactive. The parent - * (current) is done further down, under its lock. - */ - p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) * - CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); - p->prio = effective_prio(p); if (likely(cpu == this_cpu)) { @@ -1515,15 +1483,8 @@ void fastcall wake_up_new_task(task_t *p } else /* Run child last */ __activate_task(p, rq); - /* - * We skip the following code due to cpu == this_cpu - * - * task_rq_unlock(rq, &flags); - * this_rq = task_rq_lock(current, &flags); - */ - this_rq = rq; } else { - this_rq = cpu_rq(this_cpu); + runqueue_t *this_rq = cpu_rq(this_cpu); /* * Not the local CPU - must adjust timestamp. This should @@ -1534,17 +1495,8 @@ void fastcall wake_up_new_task(task_t *p __activate_task(p, rq); if (TASK_PREEMPTS_CURR(p, rq)) resched_task(rq->curr); - - /* - * Parent and child are on different CPUs, now get the - * parent runqueue to update the parent's ->sleep_avg: - */ - task_rq_unlock(rq, &flags); - this_rq = task_rq_lock(current, &flags); } - current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) * - PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); - task_rq_unlock(this_rq, &flags); + task_rq_unlock(rq, &flags); } /* @@ -1571,10 +1523,6 @@ void fastcall sched_exit(task_t *p) if (unlikely(p->parent->time_slice > task_timeslice(p))) p->parent->time_slice = task_timeslice(p); } - if (p->sleep_avg < p->parent->sleep_avg) - p->parent->sleep_avg = p->parent->sleep_avg / - (EXIT_WEIGHT + 1) * EXIT_WEIGHT + p->sleep_avg / - (EXIT_WEIGHT + 1); task_rq_unlock(rq, &flags); } @@ -1911,15 +1859,16 @@ int can_migrate_task(task_t *p, runqueue * Called with both runqueues locked. */ static int move_tasks(runqueue_t *this_rq, int this_cpu, runqueue_t *busiest, - unsigned long max_nr_move, struct sched_domain *sd, - enum idle_type idle, int *all_pinned) + unsigned long max_nr_move, long max_bias_move, + struct sched_domain *sd, enum idle_type idle, + int *all_pinned) { prio_array_t *array, *dst_array; struct list_head *head, *curr; int idx, pulled = 0, pinned = 0; task_t *tmp; - if (max_nr_move == 0) + if (max_nr_move == 0 || max_bias_move == 0) goto out; pinned = 1; @@ -1962,7 +1911,8 @@ skip_queue: curr = curr->prev; - if (!can_migrate_task(tmp, busiest, this_cpu, sd, idle, &pinned)) { + if (tmp->bias_prio > max_bias_move || + !can_migrate_task(tmp, busiest, this_cpu, sd, idle, &pinned)) { if (curr != head) goto skip_queue; idx++; @@ -1976,9 +1926,13 @@ skip_queue: pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu); pulled++; + max_bias_move -= tmp->bias_prio; - /* We only want to steal up to the prescribed number of tasks. */ - if (pulled < max_nr_move) { + /* + * We only want to steal up to the prescribed number of tasks + * and the prescribed amount of biased load. + */ + if (pulled < max_nr_move && max_bias_move > 0) { if (curr != head) goto skip_queue; idx++; @@ -1999,7 +1953,7 @@ out: /* * find_busiest_group finds and returns the busiest CPU group within the - * domain. It calculates and returns the number of tasks which should be + * domain. It calculates and returns the amount of biased load which should be * moved to restore balance via the imbalance parameter. */ static struct sched_group * @@ -2035,9 +1989,9 @@ find_busiest_group(struct sched_domain * /* Bias balancing toward cpus of our domain */ if (local_group) - load = __target_load(i, load_idx, idle); + load = target_load(i, load_idx); else - load = __source_load(i, load_idx, idle); + load = source_load(i, load_idx); avg_load += load; } @@ -2092,7 +2046,7 @@ find_busiest_group(struct sched_domain * unsigned long tmp; if (max_load - this_load >= SCHED_LOAD_SCALE*2) { - *imbalance = 1; + *imbalance = NICE_TO_BIAS_PRIO(0); return busiest; } @@ -2125,7 +2079,7 @@ find_busiest_group(struct sched_domain * if (pwr_move <= pwr_now) goto out_balanced; - *imbalance = 1; + *imbalance = NICE_TO_BIAS_PRIO(0); return busiest; } @@ -2142,15 +2096,14 @@ out_balanced: /* * find_busiest_queue - find the busiest runqueue among the cpus in group. */ -static runqueue_t *find_busiest_queue(struct sched_group *group, - enum idle_type idle) +static runqueue_t *find_busiest_queue(struct sched_group *group) { unsigned long load, max_load = 0; runqueue_t *busiest = NULL; int i; for_each_cpu_mask(i, group->cpumask) { - load = __source_load(i, 0, idle); + load = source_load(i, 0); if (load > max_load) { max_load = load; @@ -2167,6 +2120,7 @@ static runqueue_t *find_busiest_queue(st */ #define MAX_PINNED_INTERVAL 512 +#define minus_1_or_zero(n) ((n) > 0 ? (n) - 1 : 0) /* * Check this_cpu to ensure it is balanced within domain. Attempt to move * tasks if there is an imbalance. @@ -2194,7 +2148,7 @@ static int load_balance(int this_cpu, ru goto out_balanced; } - busiest = find_busiest_queue(group, idle); + busiest = find_busiest_queue(group); if (!busiest) { schedstat_inc(sd, lb_nobusyq[idle]); goto out_balanced; @@ -2214,6 +2168,7 @@ static int load_balance(int this_cpu, ru */ double_rq_lock(this_rq, busiest); nr_moved = move_tasks(this_rq, this_cpu, busiest, + minus_1_or_zero(busiest->nr_running), imbalance, sd, idle, &all_pinned); double_rq_unlock(this_rq, busiest); @@ -2317,7 +2272,7 @@ static int load_balance_newidle(int this goto out_balanced; } - busiest = find_busiest_queue(group, NEWLY_IDLE); + busiest = find_busiest_queue(group); if (!busiest) { schedstat_inc(sd, lb_nobusyq[NEWLY_IDLE]); goto out_balanced; @@ -2332,6 +2287,7 @@ static int load_balance_newidle(int this /* Attempt to move tasks */ double_lock_balance(this_rq, busiest); nr_moved = move_tasks(this_rq, this_cpu, busiest, + minus_1_or_zero(busiest->nr_running), imbalance, sd, NEWLY_IDLE, NULL); spin_unlock(&busiest->lock); } @@ -2412,7 +2368,8 @@ static void active_load_balance(runqueue schedstat_inc(sd, alb_cnt); - if (move_tasks(target_rq, target_cpu, busiest_rq, 1, sd, SCHED_IDLE, NULL)) + if (move_tasks(target_rq, target_cpu, busiest_rq, 1, + RTPRIO_TO_BIAS_PRIO(100), sd, SCHED_IDLE, NULL)) schedstat_inc(sd, alb_pushed); else schedstat_inc(sd, alb_failed); @@ -2440,7 +2397,7 @@ static void rebalance_tick(int this_cpu, struct sched_domain *sd; int i; - this_load = this_rq->nr_running * SCHED_LOAD_SCALE; + this_load = this_rq->prio_bias * SCHED_LOAD_SCALE; /* Update our load */ for (i = 0; i < 3; i++) { unsigned long new_load = this_load; @@ -2689,6 +2646,10 @@ void scheduler_tick(void) if (!--p->time_slice) { dequeue_task(p, rq->active); set_tsk_need_resched(p); + /* make sure that tasks that obtain an latency_bonus but then + * become CPU bound eventually lose the bonus. + */ + decr_latency_bonus(p); p->prio = effective_prio(p); p->time_slice = task_timeslice(p); p->first_time_slice = 0; @@ -2949,8 +2910,7 @@ asmlinkage void __sched schedule(void) prio_array_t *array; struct list_head *queue; unsigned long long now; - unsigned long run_time; - int cpu, idx, new_prio; + int cpu, idx; /* * Test if we are atomic. Since do_exit() needs to call into @@ -2985,21 +2945,12 @@ need_resched_nonpreemptible: schedstat_inc(rq, sched_cnt); now = sched_clock(); - if (likely((long long)(now - prev->timestamp) < NS_MAX_SLEEP_AVG)) { - run_time = now - prev->timestamp; - if (unlikely((long long)(now - prev->timestamp) < 0)) - run_time = 0; - } else - run_time = NS_MAX_SLEEP_AVG; - - /* - * Tasks charged proportionately less run_time at high sleep_avg to - * delay them losing their interactive status - */ - run_time /= (CURRENT_BONUS(prev) ? : 1); spin_lock_irq(&rq->lock); + if (likely(now > prev->timestamp)) + prev->avg_cpu_run += now - prev->timestamp; + if (unlikely(prev->flags & PF_DEAD)) prev->state = EXIT_DEAD; @@ -3063,25 +3014,6 @@ go_idle: queue = array->queue + idx; next = list_entry(queue->next, task_t, run_list); - if (!rt_task(next) && next->activated > 0) { - unsigned long long delta = now - next->timestamp; - if (unlikely((long long)(now - next->timestamp) < 0)) - delta = 0; - - if (next->activated == 1) - delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128; - - array = next->array; - new_prio = recalc_task_prio(next, next->timestamp + delta); - - if (unlikely(next->prio != new_prio)) { - dequeue_task(next, array); - next->prio = new_prio; - enqueue_task(next, array); - } else - requeue_task(next, array); - } - next->activated = 0; switch_tasks: if (next == rq->idle) schedstat_inc(rq, sched_goidle); @@ -3091,14 +3023,13 @@ switch_tasks: rcu_qsctr_inc(task_cpu(prev)); update_cpu_clock(prev, rq, now); - - prev->sleep_avg -= run_time; - if ((long)prev->sleep_avg <= 0) - prev->sleep_avg = 0; prev->timestamp = prev->last_ran = now; sched_info_switch(prev, next); if (likely(prev != next)) { + decay_avg_value(&prev->avg_cpu_run); + if (just_woken(next)) + update_latency_bonus(next, rq, now); next->timestamp = now; rq->nr_switches++; rq->curr = next; @@ -3543,7 +3474,7 @@ void set_user_nice(task_t *p, long nice) * The RT priorities are set via sched_setscheduler(), but we still * allow the 'normal' nice value to be set - but as expected * it wont have any effect on scheduling until the task is - * not SCHED_NORMAL: + * not SCHED_NORMAL/SCHED_BATCH: */ if (rt_task(p)) { p->static_prio = NICE_TO_PRIO(nice); @@ -3552,18 +3483,19 @@ void set_user_nice(task_t *p, long nice) array = p->array; if (array) { dequeue_task(p, array); - dec_prio_bias(rq, p->static_prio); + dec_prio_bias(rq, p); } old_prio = p->prio; new_prio = NICE_TO_PRIO(nice); delta = new_prio - old_prio; p->static_prio = NICE_TO_PRIO(nice); + set_bias_prio(p); p->prio += delta; if (array) { enqueue_task(p, array); - inc_prio_bias(rq, p->static_prio); + inc_prio_bias(rq, p); /* * If the task increased its priority or is running and * lowered its priority, then reschedule its CPU: @@ -3689,10 +3621,17 @@ static void __setscheduler(struct task_s BUG_ON(p->array); p->policy = policy; p->rt_priority = prio; - if (policy != SCHED_NORMAL) + if (policy != SCHED_NORMAL && policy != SCHED_BATCH) { p->prio = MAX_RT_PRIO-1 - p->rt_priority; - else + } else { p->prio = p->static_prio; + /* + * SCHED_BATCH tasks are treated as perpetual CPU hogs: + */ + if (policy == SCHED_BATCH) + p->latency_bonus = 0; + } + set_bias_prio(p); } /** @@ -3716,29 +3655,35 @@ recheck: if (policy < 0) policy = oldpolicy = p->policy; else if (policy != SCHED_FIFO && policy != SCHED_RR && - policy != SCHED_NORMAL) + policy != SCHED_NORMAL && policy != SCHED_BATCH) return -EINVAL; /* * Valid priorities for SCHED_FIFO and SCHED_RR are - * 1..MAX_USER_RT_PRIO-1, valid priority for SCHED_NORMAL is 0. + * 1..MAX_USER_RT_PRIO-1, valid priority for SCHED_NORMAL and + * SCHED_BATCH is 0. */ if (param->sched_priority < 0 || (p->mm && param->sched_priority > MAX_USER_RT_PRIO-1) || (!p->mm && param->sched_priority > MAX_RT_PRIO-1)) return -EINVAL; - if ((policy == SCHED_NORMAL) != (param->sched_priority == 0)) + if ((policy == SCHED_NORMAL || policy == SCHED_BATCH) + != (param->sched_priority == 0)) return -EINVAL; /* * Allow unprivileged RT tasks to decrease priority: */ if (!capable(CAP_SYS_NICE)) { - /* can't change policy */ - if (policy != p->policy && - !p->signal->rlim[RLIMIT_RTPRIO].rlim_cur) + /* + * can't change policy, except between SCHED_NORMAL + * and SCHED_BATCH: + */ + if (((policy != SCHED_NORMAL && p->policy != SCHED_BATCH) && + (policy != SCHED_BATCH && p->policy != SCHED_NORMAL)) && + !p->signal->rlim[RLIMIT_RTPRIO].rlim_cur) return -EPERM; /* can't increase priority */ - if (policy != SCHED_NORMAL && + if ((policy != SCHED_NORMAL && policy != SCHED_BATCH) && param->sched_priority > p->rt_priority && param->sched_priority > p->signal->rlim[RLIMIT_RTPRIO].rlim_cur) @@ -4216,6 +4161,7 @@ asmlinkage long sys_sched_get_priority_m ret = MAX_USER_RT_PRIO-1; break; case SCHED_NORMAL: + case SCHED_BATCH: ret = 0; break; } @@ -4239,6 +4185,7 @@ asmlinkage long sys_sched_get_priority_m ret = 1; break; case SCHED_NORMAL: + case SCHED_BATCH: ret = 0; } return ret; @@ -4394,7 +4341,9 @@ void __devinit init_idle(task_t *idle, i runqueue_t *rq = cpu_rq(cpu); unsigned long flags; - idle->sleep_avg = 0; + idle->avg_latency = 0; + idle->avg_cpu_run = 0; + idle->latency_bonus = 0; idle->array = NULL; idle->prio = MAX_PRIO; idle->state = TASK_RUNNING; @@ -5634,6 +5583,7 @@ void __init sched_init(void) } } + set_bias_prio(&init_task); /* * The boot idle thread does lazy MMU switching as well: */ @@ -5745,3 +5695,128 @@ void set_curr_task(int cpu, task_t *p) } #endif + +/* + * Place scheduler attributes in sysfs + */ + +#include <linux/sysfs.h> + +static ssize_t show_unacceptable_ia_latency(char *page) +{ + return sprintf(page, "%ld\n", unacceptable_ia_latency); +} + +static ssize_t store_unacceptable_ia_latency(const char *page, size_t count) +{ + unsigned long long val; + char *end = NULL; + + val = simple_strtoull(page, &end, 10); + if ((end == page) || ((*end != '\0') && (*end != '\n'))) + return -EINVAL; + if (val > ULONG_MAX) + unacceptable_ia_latency = ULONG_MAX; + else + unacceptable_ia_latency = val; + + return count; +} + +static ssize_t show_acceptability_factor(char *page) +{ + return sprintf(page, "%d\n", acceptability_factor); +} + +static ssize_t store_acceptability_factor(const char *page, size_t count) +{ + unsigned long long val; + char *end = NULL; + + val = simple_strtoull(page, &end, 10); + if ((end == page) || ((*end != '\0') && (*end != '\n'))) + return -EINVAL; + if (val > UINT_MAX) + acceptability_factor = UINT_MAX; + else + acceptability_factor = val; + + return count; +} + +struct sched_sysfs_entry { + struct attribute attr; + ssize_t (*show)(char *); + ssize_t (*store)(const char *, size_t); +}; + +#define to_sched_sysfs_entry(a) \ + container_of((a), struct sched_sysfs_entry, attr) + +static ssize_t show_attribute(struct kobject *kobj, struct attribute *attr, + char *page) +{ + struct sched_sysfs_entry *e = to_sched_sysfs_entry(attr); + + if (!e->show) + return 0; + + return e->show(page); +} + +static ssize_t store_attribute(struct kobject *kobj, struct attribute *attr, + const char *page, size_t length) +{ + struct sched_sysfs_entry *e = to_sched_sysfs_entry(attr); + + if (!e->show) + return -EBADF; + + return e->store(page, length); +} + +struct sysfs_ops sched_sysfs_ops = { + .show = show_attribute, + .store = store_attribute, +}; + +static struct sched_sysfs_entry unacceptable_ia_latency_sdse = { + .attr = { .name = "unacceptable_ia_latency", + .mode = S_IRUGO | S_IWUSR }, + .show = show_unacceptable_ia_latency, + .store = store_unacceptable_ia_latency, +}; + +static struct sched_sysfs_entry acceptability_factor_sdse = { + .attr = { .name = "acceptability_factor", + .mode = S_IRUGO | S_IWUSR }, + .show = show_acceptability_factor, + .store = store_acceptability_factor, +}; + +static struct attribute *sched_attrs[] = { + &unacceptable_ia_latency_sdse.attr, + &acceptability_factor_sdse.attr, + NULL, +}; + +static struct kobj_type sched_ktype = { + .sysfs_ops = &sched_sysfs_ops, + .default_attrs = sched_attrs, +}; + +static struct kobject sched_kobj = { + .ktype = &sched_ktype +}; + +decl_subsys(cpusched, NULL, NULL); + +void __init sched_sysfs_init(void) +{ + if (subsystem_register(&cpusched_subsys) == 0) { + sched_kobj.kset = &cpusched_subsys.kset; + strncpy(sched_kobj.name, "attrs", KOBJ_NAME_LEN); + sched_kobj.kset = &cpusched_subsys.kset; + (void)kobject_register(&sched_kobj); + } +} ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2005-12-31 2:06 ` Peter Williams @ 2005-12-31 10:34 ` Paolo Ornati 2005-12-31 10:52 ` Paolo Ornati 0 siblings, 1 reply; 53+ messages in thread From: Paolo Ornati @ 2005-12-31 10:34 UTC (permalink / raw) To: Peter Williams Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin On Sat, 31 Dec 2005 13:06:04 +1100 Peter Williams <pwil3058@bigpond.net.au> wrote: > Attached is a testing version of a patch for modifying scheduler bonus > calculations that I'm working on. Although these changes aren't > targetted at the problem you are experiencing I believe that they may > help. My testing shows that sched_fooler instances don't get any > bonuses but I would appreciate it if you could try it out. > > It is a patch against the 2.6.15-rc7 kernel and includes some other > scheduling patches from the -mm kernels. Yes, this fixes both my test-case (transcode & little program), they get priority 25 instead of ~16. But the priority of DD is now ~23 and so it still suffers a bit: paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/ 128+0 records in 128+0 records out real 0m8.549s (instead of 4s) user 0m0.000s sys 0m0.209s -- Paolo Ornati Linux 2.6.15-rc7-lial on x86_64 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2005-12-31 10:34 ` Paolo Ornati @ 2005-12-31 10:52 ` Paolo Ornati 2005-12-31 11:12 ` Con Kolivas 2005-12-31 13:44 ` Peter Williams 0 siblings, 2 replies; 53+ messages in thread From: Paolo Ornati @ 2005-12-31 10:52 UTC (permalink / raw) To: Paolo Ornati Cc: Peter Williams, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin On Sat, 31 Dec 2005 11:34:46 +0100 Paolo Ornati <ornati@fastwebnet.it> wrote: > > It is a patch against the 2.6.15-rc7 kernel and includes some other > > scheduling patches from the -mm kernels. > > Yes, this fixes both my test-case (transcode & little program), they > get priority 25 instead of ~16. > > But the priority of DD is now ~23 and so it still suffers a bit: I forgot to mention that even the others "interactive" processes don't get a good priority too. Xorg for example, while only moving the cursor around, gets priority 23/24. And when cpu-eaters are running (at priority 25) it isn't happy at all, the cursor begins to move in jerks and so on... -- Paolo Ornati Linux 2.6.15-rc7-lial on x86_64 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2005-12-31 10:52 ` Paolo Ornati @ 2005-12-31 11:12 ` Con Kolivas 2005-12-31 13:44 ` Peter Williams 1 sibling, 0 replies; 53+ messages in thread From: Con Kolivas @ 2005-12-31 11:12 UTC (permalink / raw) To: Paolo Ornati Cc: Peter Williams, Linux Kernel Mailing List, Ingo Molnar, Nick Piggin On Saturday 31 December 2005 21:52, Paolo Ornati wrote: > On Sat, 31 Dec 2005 11:34:46 +0100 > > Paolo Ornati <ornati@fastwebnet.it> wrote: > > > It is a patch against the 2.6.15-rc7 kernel and includes some other > > > scheduling patches from the -mm kernels. > > > > Yes, this fixes both my test-case (transcode & little program), they > > get priority 25 instead of ~16. > > > > But the priority of DD is now ~23 and so it still suffers a bit: > > I forgot to mention that even the others "interactive" processes > don't get a good priority too. > > Xorg for example, while only moving the cursor around, gets priority > 23/24. And when cpu-eaters are running (at priority 25) it isn't happy > at all, the cursor begins to move in jerks and so on... This is why Ingo, Nick and myself think that a tweak to the heavily field tested current cpu scheduler is best for 2.6 rather than any gutting and replacement of the interactivity estimator (which even though this scheme is simple and easy to understand, it clearly is an example of). Given that we have a "2.6 forever" policy, it also means any significant cpu scheduler rewrite, even just of the interactivity estimator, is nigh on impossible to implement. Cheers, Con ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2005-12-31 10:52 ` Paolo Ornati 2005-12-31 11:12 ` Con Kolivas @ 2005-12-31 13:44 ` Peter Williams 2005-12-31 16:31 ` Paolo Ornati 1 sibling, 1 reply; 53+ messages in thread From: Peter Williams @ 2005-12-31 13:44 UTC (permalink / raw) To: Paolo Ornati Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin Paolo Ornati wrote: > On Sat, 31 Dec 2005 11:34:46 +0100 > Paolo Ornati <ornati@fastwebnet.it> wrote: > > >>>It is a patch against the 2.6.15-rc7 kernel and includes some other >>>scheduling patches from the -mm kernels. >> >>Yes, this fixes both my test-case (transcode & little program), they >>get priority 25 instead of ~16. >> >>But the priority of DD is now ~23 and so it still suffers a bit: > > > I forgot to mention that even the others "interactive" processes > don't get a good priority too. > > Xorg for example, while only moving the cursor around, gets priority > 23/24. And when cpu-eaters are running (at priority 25) it isn't happy > at all, the cursor begins to move in jerks and so on... > OK. This probably means that the parameters that control the mechanism need tweaking. There should be a file /sys/cpusched/attrs/unacceptable_ia_latency which contains the latency (in nanoseconds) that the scheduler considers unacceptable for interactive programs. Try changing that value and see if things improve? Making it smaller should help but if you make it too small all the interactive tasks will end up with the same priority and this could cause them to get in each other's way. Thanks, Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2005-12-31 13:44 ` Peter Williams @ 2005-12-31 16:31 ` Paolo Ornati 2005-12-31 22:04 ` Peter Williams 0 siblings, 1 reply; 53+ messages in thread From: Paolo Ornati @ 2005-12-31 16:31 UTC (permalink / raw) To: Peter Williams Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin On Sun, 01 Jan 2006 00:44:10 +1100 Peter Williams <pwil3058@bigpond.net.au> wrote: > OK. This probably means that the parameters that control the mechanism > need tweaking. > > There should be a file /sys/cpusched/attrs/unacceptable_ia_latency which > contains the latency (in nanoseconds) that the scheduler considers > unacceptable for interactive programs. Try changing that value and see > if things improve? Making it smaller should help but if you make it too > small all the interactive tasks will end up with the same priority and > this could cause them to get in each other's way. I've tried different values and sometimes I've got a good feeling BUT the behaviour is too strange to say something. Sometimes I get what I want (dd priority ~17 and CPU eaters prio 25), sometimes I get a total disaster (dd priority 17 and CPU eaters prio 15/16) and sometimes I get something like DD prio 22 and CPU eaters 23/24. All this is not well related to "unacceptable_ia_latency" values. What I think is that the priority calculation in ingosched and other schedulers is in general too weak, while in other schedulers is rock solid (read: nicksched). Maybe is just that the smarter a scheduler want to be, the more fragile it will be. -- Paolo Ornati Linux 2.6.15-rc7-lial on x86_64 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2005-12-31 16:31 ` Paolo Ornati @ 2005-12-31 22:04 ` Peter Williams 0 siblings, 0 replies; 53+ messages in thread From: Peter Williams @ 2005-12-31 22:04 UTC (permalink / raw) To: Paolo Ornati Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin Paolo Ornati wrote: > On Sun, 01 Jan 2006 00:44:10 +1100 > Peter Williams <pwil3058@bigpond.net.au> wrote: > > >>OK. This probably means that the parameters that control the mechanism >>need tweaking. >> >>There should be a file /sys/cpusched/attrs/unacceptable_ia_latency which >>contains the latency (in nanoseconds) that the scheduler considers >>unacceptable for interactive programs. Try changing that value and see >>if things improve? Making it smaller should help but if you make it too >>small all the interactive tasks will end up with the same priority and >>this could cause them to get in each other's way. > > > I've tried different values and sometimes I've got a good feeling BUT > the behaviour is too strange to say something. > > Sometimes I get what I want (dd priority ~17 and CPU eaters prio > 25), sometimes I get a total disaster (dd priority 17 and CPU eaters > prio 15/16) and sometimes I get something like DD prio 22 and CPU > eaters 23/24. > > All this is not well related to "unacceptable_ia_latency" values. OK. Thanks for trying it. The feedback will be helpful in trying to improve the mechanisms. > > What I think is that the priority calculation in ingosched and other > schedulers is in general too weak, while in other schedulers is rock > solid (read: nicksched). > > Maybe is just that the smarter a scheduler want to be, the more fragile > it will be. > Probably but this one is fairly simple. I think the remaining problems with interactive responsiveness is that bonuses increase too slowly when a latency problem is detected. I.e. a task just gets one extra bonus point when an unacceptable latency is detected regardless of how big the latency is. This means that it may take several cycles for the bonus to be big enough to solve the problem. I'm going to try making the bonus increment proportional to the size of the latency w.r.t. the limit. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2005-12-30 13:52 ` [SCHED] wrong priority calc - SIMPLE test case Paolo Ornati 2005-12-31 2:06 ` Peter Williams @ 2005-12-31 8:13 ` Mike Galbraith 2005-12-31 11:00 ` Paolo Ornati 2005-12-31 15:11 ` Paolo Ornati 2006-01-13 1:13 ` Con Kolivas 2 siblings, 2 replies; 53+ messages in thread From: Mike Galbraith @ 2005-12-31 8:13 UTC (permalink / raw) To: Paolo Ornati, Linux Kernel Mailing List Cc: Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams At 02:52 PM 12/30/2005 +0100, Paolo Ornati wrote: >WAS: [SCHED] Totally WRONG prority calculation with specific test-case >(since 2.6.10-bk12) >http://lkml.org/lkml/2005/12/27/114/index.html > >On Wed, 28 Dec 2005 10:26:58 +1100 >Con Kolivas <kernel@kolivas.org> wrote: > > > The issue is that the scheduler interactivity estimator is a state > machine and > > can be fooled to some degree, and a cpu intensive task that just > happens to > > sleep a little bit gets significantly better priority than one that is > fully > > cpu bound all the time. Reverting that change is not a solution because it > > can still be fooled by the same process sleeping lots for a few seconds > or so > > at startup and then changing to the cpu mostly-sleeping slightly > behaviour. > > This "fluctuating" behaviour is in my opinion worse which is why I removed > > it. > >Trying to find a "as simple as possible" test case for this problem >(that I consider a BUG in priority calculation) I've come up with this >very simple program: > >------ sched_fooler.c ------------------------------- Ingo seems to have done something in 2.6.15-rc7-rt1 which defeats your little proggy. Taking a quick peek at the rt scheduler changes, nothing poked me in the eye, but by golly, I can't get this kernel to act up, whereas 2.6.14-virgin does. -Mike (off to stare harder rt patch) ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2005-12-31 8:13 ` Mike Galbraith @ 2005-12-31 11:00 ` Paolo Ornati 2005-12-31 15:11 ` Paolo Ornati 1 sibling, 0 replies; 53+ messages in thread From: Paolo Ornati @ 2005-12-31 11:00 UTC (permalink / raw) To: Mike Galbraith Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams On Sat, 31 Dec 2005 09:13:24 +0100 Mike Galbraith <efault@gmx.de> wrote: > Ingo seems to have done something in 2.6.15-rc7-rt1 which defeats your > little proggy. Taking a quick peek at the rt scheduler changes, nothing > poked me in the eye, but by golly, I can't get this kernel to act up, > whereas 2.6.14-virgin does. Mmm... I get an infinite list of init segfaults trying to boot it. I've tried disabling CONFIG_CC_OPTIMIZE_FOR_SIZE but it didn't help. I'll try later with a simplier ".config". -- Paolo Ornati Linux 2.6.15-rc7-plugsched on x86_64 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2005-12-31 8:13 ` Mike Galbraith 2005-12-31 11:00 ` Paolo Ornati @ 2005-12-31 15:11 ` Paolo Ornati 2005-12-31 16:37 ` Mike Galbraith 1 sibling, 1 reply; 53+ messages in thread From: Paolo Ornati @ 2005-12-31 15:11 UTC (permalink / raw) To: Mike Galbraith Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams On Sat, 31 Dec 2005 09:13:24 +0100 Mike Galbraith <efault@gmx.de> wrote: > Ingo seems to have done something in 2.6.15-rc7-rt1 which defeats your > little proggy. Taking a quick peek at the rt scheduler changes, nothing > poked me in the eye, but by golly, I can't get this kernel to act up, > whereas 2.6.14-virgin does. Ok, I've sucessfully booted 2.6.15-rc7-rt1 (I think that I was having troubles with Thread Softirqs and/or Thread Hardirqs). First thing: I've preemption disabled, but it shouldn't matter too much since we are talking about priority calculation... 1) My program isn't defeated at all. If I start it with the same args of the previous examples it "seems" defeated, but it isn't. Lowering the "cpu burn argument" I can reproduce the problem again: "./a.out 200 & ./a.out 333" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5607 paolo 15 0 2396 320 252 R 56.1 0.1 0:06.79 a.out 5606 paolo 15 0 2396 324 252 R 38.7 0.1 0:04.55 a.out 1 root 16 0 2556 552 468 S 0.0 0.1 0:00.28 init 2) Priority fluctuation - very interesting: playing with the only arg my program has I've found this: ./a.out 200 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5628 paolo 15 0 2392 320 252 R 48.5 0.1 0:18.34 a.out ./a.out 300 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5633 paolo 15 0 2392 324 252 S 50.1 0.1 0:09.42 a.out ./a.out 400 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5634 paolo 15 0 2392 320 252 S 66.7 0.1 0:06.31 a.out ./a.out 500 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5638 paolo 25 0 2396 320 252 R 67.7 0.1 0:14.78 a.out ./a.out 700 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5640 paolo 15 0 2392 320 252 S 80.1 0.1 0:25.88 a.out ./a.out 800 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5644 paolo 17 0 2396 320 252 R 79.6 0.1 0:26.54 a.out In the "./a.out 500" case, the priority starts at something like 16 and then slowly go up to 25 _BUT_ if I start my DD test my cpu-eater priority goes quickly to 16! The real world test case (transcode) is a bit harder to describe: its priority usually goes up to 25, sometimes I've seen it fluctuating a bit (like go to 19 and then back to 25). When I start my DD test I've seen basically 2 different behaviours: A) good PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5788 paolo 25 0 114m 18m 2440 R 82.2 3.7 0:10.16 transcode 5804 paolo 15 0 49860 4500 1896 S 8.5 0.9 0:00.99 tcdecode 5808 paolo 18 0 4952 1520 412 D 5.0 0.3 0:00.36 dd B) bad PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5743 paolo 18 0 114m 18m 2440 R 75.0 3.7 0:26.79 transcode 5759 paolo 15 0 49864 4500 1896 S 7.8 0.9 0:02.71 tcdecode 5750 paolo 16 0 19840 1136 916 S 1.5 0.2 0:00.23 tcdemux 5201 root 15 0 167m 17m 3336 S 0.8 3.5 0:19.38 X 5764 paolo 18 0 4948 1520 412 R 0.7 0.3 0:00.04 dd Sometimes happens A and sometimes happens B... PS: probably all these numbers aren't 100% reproducible... this is what happens on my PC. -- Paolo Ornati Linux 2.6.15-rc7-rt1 on x86_64 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2005-12-31 15:11 ` Paolo Ornati @ 2005-12-31 16:37 ` Mike Galbraith 2005-12-31 17:24 ` Paolo Ornati 2006-01-01 11:39 ` Paolo Ornati 0 siblings, 2 replies; 53+ messages in thread From: Mike Galbraith @ 2005-12-31 16:37 UTC (permalink / raw) To: Paolo Ornati Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams At 04:11 PM 12/31/2005 +0100, Paolo Ornati wrote: >On Sat, 31 Dec 2005 09:13:24 +0100 >Mike Galbraith <efault@gmx.de> wrote: > > > Ingo seems to have done something in 2.6.15-rc7-rt1 which defeats your > > little proggy. Taking a quick peek at the rt scheduler changes, nothing > > poked me in the eye, but by golly, I can't get this kernel to act up, > > whereas 2.6.14-virgin does. > >Ok, I've sucessfully booted 2.6.15-rc7-rt1 (I think that I was >having troubles with Thread Softirqs and/or Thread Hardirqs). > >First thing: I've preemption disabled, but it shouldn't matter too much >since we are talking about priority calculation... Mine is fully preemptible. >1) My program isn't defeated at all. If I start it with the same args >of the previous examples it "seems" defeated, but it isn't. > >Lowering the "cpu burn argument" I can reproduce the problem again: > >"./a.out 200 & ./a.out 333" > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5607 paolo 15 0 2396 320 252 R 56.1 0.1 0:06.79 a.out > 5606 paolo 15 0 2396 324 252 R 38.7 0.1 0:04.55 a.out > 1 root 16 0 2556 552 468 S 0.0 0.1 0:00.28 init Strange. Using the exact same arguments, I do see some odd bouncing up to high priorities, but they spend the vast majority of their time down at 25. -Mike ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2005-12-31 16:37 ` Mike Galbraith @ 2005-12-31 17:24 ` Paolo Ornati 2005-12-31 17:42 ` Paolo Ornati 2006-01-01 11:39 ` Paolo Ornati 1 sibling, 1 reply; 53+ messages in thread From: Paolo Ornati @ 2005-12-31 17:24 UTC (permalink / raw) To: Mike Galbraith Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams On Sat, 31 Dec 2005 17:37:11 +0100 Mike Galbraith <efault@gmx.de> wrote: > >"./a.out 200 & ./a.out 333" > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 5607 paolo 15 0 2396 320 252 R 56.1 0.1 0:06.79 a.out > > 5606 paolo 15 0 2396 324 252 R 38.7 0.1 0:04.55 a.out > > 1 root 16 0 2556 552 468 S 0.0 0.1 0:00.28 init > > Strange. Using the exact same arguments, I do see some odd bouncing up to > high priorities, but they spend the vast majority of their time down at 25. You shouldn't use "the same exact numbers", you should try different args and see if you can reproduce the problem. Or maybe preemption make some difference... I'll try with PREEMPT enabled and see. -- Paolo Ornati Linux 2.6.15-rc7-plugsched on x86_64 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2005-12-31 17:24 ` Paolo Ornati @ 2005-12-31 17:42 ` Paolo Ornati 0 siblings, 0 replies; 53+ messages in thread From: Paolo Ornati @ 2005-12-31 17:42 UTC (permalink / raw) To: Paolo Ornati Cc: Mike Galbraith, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams On Sat, 31 Dec 2005 18:24:40 +0100 Paolo Ornati <ornati@fastwebnet.it> wrote: > You shouldn't use "the same exact numbers", you should try different > args and see if you can reproduce the problem. Or maybe preemption > make some difference... I'll try with PREEMPT enabled and see. Ok, just tried with Complete Preemption: I can easly reproduce the problem. For example: "./a.out 1000 & ./a.out 1239" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5433 paolo 16 0 2396 324 252 R 50.3 0.1 0:34.67 a.out 5434 paolo 16 0 2392 320 252 R 47.4 0.1 0:30.76 a.out 265 root -48 -5 0 0 0 S 0.6 0.0 0:00.68 IRQ 12 5261 root 15 0 166m 16m 2844 S 0.6 3.3 0:04.81 X 5349 paolo 15 0 86640 22m 15m S 0.6 4.5 0:01.36 konsole 5344 paolo 15 0 98.8m 20m 12m S 0.3 4.1 0:01.64 kicker 5444 paolo 18 0 4948 1520 412 R 0.3 0.3 0:00.05 dd -- Paolo Ornati Linux 2.6.15-rc7-rt1 on x86_64 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2005-12-31 16:37 ` Mike Galbraith 2005-12-31 17:24 ` Paolo Ornati @ 2006-01-01 11:39 ` Paolo Ornati 2006-01-02 9:15 ` Mike Galbraith 1 sibling, 1 reply; 53+ messages in thread From: Paolo Ornati @ 2006-01-01 11:39 UTC (permalink / raw) To: Mike Galbraith Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams On Sat, 31 Dec 2005 17:37:11 +0100 Mike Galbraith <efault@gmx.de> wrote: > Strange. Using the exact same arguments, I do see some odd bouncing up to > high priorities, but they spend the vast majority of their time down at 25. Mmmm... to make it more easly reproducible I've enlarged the sleep time (1 microsecond is likely to be rounded too much and give different results on different hardware/kernel/config...). Compile this _without_ optimizations and try again: ------------------------------------------------ #include <stdlib.h> #include <unistd.h> char buf[1024]; static void burn_cpu(unsigned int x) { int i; for (i=0; i < x; ++i) buf[i%sizeof(buf)] = (x-i)*3; } int main(int argc, char **argv) { unsigned long burn; if (argc != 2) return 1; burn = (unsigned long)atoi(argv[1]); if (!burn) return; while(1) { burn_cpu(burn*1000); usleep(10000); } return 0; } ----------------------------------------- With "./a.out 3000" (and 2.6.15-rc7-rt1) I get this: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5485 paolo 15 0 2396 320 252 R 62.7 0.1 0:09.77 a.out Try different values: 1000, 2000, 3000 ... are you able to reproduce it now? If yes, try to start 2 of them with something like this: "./a.out 3000 & ./a.out 3161" so they are NOT syncronized and they use almost all the CPU time: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5582 paolo 16 0 2396 320 252 S 45.7 0.1 0:05.52 a.out 5583 paolo 15 0 2392 320 252 S 45.7 0.1 0:05.49 a.out This is the bad situation I hate: some cpu-eaters that eat all the CPU time BUT have a really good priority only because they sleeps a bit. -- Paolo Ornati Linux 2.6.15-rc7-rt1 on x86_64 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2006-01-01 11:39 ` Paolo Ornati @ 2006-01-02 9:15 ` Mike Galbraith 2006-01-02 9:50 ` Paolo Ornati 2006-01-09 11:11 ` Mike Galbraith 0 siblings, 2 replies; 53+ messages in thread From: Mike Galbraith @ 2006-01-02 9:15 UTC (permalink / raw) To: Paolo Ornati Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams At 12:39 PM 1/1/2006 +0100, Paolo Ornati wrote: >On Sat, 31 Dec 2005 17:37:11 +0100 >Mike Galbraith <efault@gmx.de> wrote: > > > Strange. Using the exact same arguments, I do see some odd bouncing up to > > high priorities, but they spend the vast majority of their time down at 25. > >Mmmm... to make it more easly reproducible I've enlarged the sleep time >(1 microsecond is likely to be rounded too much and give different >results on different hardware/kernel/config...). > >Compile this _without_ optimizations and try again: <snip> >Try different values: 1000, 2000, 3000 ... are you able to reproduce it >now? Yeah. One instance running has to sustain roughly _95%_ cpu before it's classified as a cpu piggy. Not good. >If yes, try to start 2 of them with something like this: > >"./a.out 3000 & ./a.out 3161" > >so they are NOT syncronized and they use almost all the CPU time: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5582 paolo 16 0 2396 320 252 S 45.7 0.1 0:05.52 a.out > 5583 paolo 15 0 2392 320 252 S 45.7 0.1 0:05.49 a.out > >This is the bad situation I hate: some cpu-eaters that eat all the CPU >time BUT have a really good priority only because they sleeps a bit. Yup, your proggy fools the interactivity estimator quite well. This problem was addressed a long time ago, and thought to be more or less cured. Guess not. -Mike ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2006-01-02 9:15 ` Mike Galbraith @ 2006-01-02 9:50 ` Paolo Ornati 2006-01-09 11:11 ` Mike Galbraith 1 sibling, 0 replies; 53+ messages in thread From: Paolo Ornati @ 2006-01-02 9:50 UTC (permalink / raw) To: Mike Galbraith Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams On Mon, 02 Jan 2006 10:15:43 +0100 Mike Galbraith <efault@gmx.de> wrote: > >This is the bad situation I hate: some cpu-eaters that eat all the CPU > >time BUT have a really good priority only because they sleeps a bit. > > Yup, your proggy fools the interactivity estimator quite well. This > problem was addressed a long time ago, and thought to be more or less > cured. Guess not. In my original real-life test case (transcode) I found that the problem started with the removing of "interactive_credit": http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/6aa5c93c379ae9e1/a9a83db6446edaf7?lnk=st&q=insubject%3Asched+author%3APaolo+author%3AOrnati&rnum=1&hl=en#a9a83db6446edaf7 This is not actually true... in fact that change only unhidden the problem for that particular test-case. With my little proggy I'm now able to reproduce the problem even with "interactive_credit" applied (for example with a 2.6.10 kernel). Said this, and since "nicksched" doesn't have this problem at all, it is an ingosched (and others as well) problem. -- Paolo Ornati Linux 2.6.15-rc7-plugsched on x86_64 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2006-01-02 9:15 ` Mike Galbraith 2006-01-02 9:50 ` Paolo Ornati @ 2006-01-09 11:11 ` Mike Galbraith 2006-01-09 15:52 ` Mike Galbraith 1 sibling, 1 reply; 53+ messages in thread From: Mike Galbraith @ 2006-01-09 11:11 UTC (permalink / raw) To: Paolo Ornati Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams [-- Attachment #1: Type: text/plain, Size: 2290 bytes --] At 10:15 AM 1/2/2006 +0100, Mike Galbraith wrote: >At 12:39 PM 1/1/2006 +0100, Paolo Ornati wrote: >>On Sat, 31 Dec 2005 17:37:11 +0100 >>Mike Galbraith <efault@gmx.de> wrote: >> >> > Strange. Using the exact same arguments, I do see some odd bouncing up to >> > high priorities, but they spend the vast majority of their time down >> at 25. >> >>Mmmm... to make it more easly reproducible I've enlarged the sleep time >>(1 microsecond is likely to be rounded too much and give different >>results on different hardware/kernel/config...). >> >>Compile this _without_ optimizations and try again: > ><snip> > >>Try different values: 1000, 2000, 3000 ... are you able to reproduce it >>now? > >Yeah. One instance running has to sustain roughly _95%_ cpu before it's >classified as a cpu piggy. Not good. > >>If yes, try to start 2 of them with something like this: >> >>"./a.out 3000 & ./a.out 3161" >> >>so they are NOT syncronized and they use almost all the CPU time: >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 5582 paolo 16 0 2396 320 252 S 45.7 0.1 0:05.52 a.out >> 5583 paolo 15 0 2392 320 252 S 45.7 0.1 0:05.49 a.out >> >>This is the bad situation I hate: some cpu-eaters that eat all the CPU >>time BUT have a really good priority only because they sleeps a bit. > >Yup, your proggy fools the interactivity estimator quite well. This >problem was addressed a long time ago, and thought to be more or less >cured. Guess not. Care to try an experiment? I'd be very interested in knowing if the attached patch cures the real-life problem you were investigating. It attempts to catch tasks which the interactivity logic has misidentified, and "pull their plug". It maintains a running plausibility check (slice_avg) against sleep_avg, and if a sustained disparity appears, cuts off a cpu burning task's supply of bonus points such that it has to "run on battery" until the disparity decreases to within acceptable limits. Obviously, anything that affects fairness _will_ affect interactivity to some degree. This simple bolt-on throttle has delayed initiation and accelerated release in the hopes of keeping it's impact acceptable. After some initial testing, It doesn't _seem_ to suck. -Mike [-- Attachment #2: sched_throttle --] [-- Type: application/octet-stream, Size: 3629 bytes --] --- include/linux/sched.h.org Tue Jan 3 09:26:50 2006 +++ include/linux/sched.h Sat Jan 7 14:45:37 2006 @@ -701,8 +701,8 @@ unsigned short ioprio; - unsigned long sleep_avg; - unsigned long long timestamp, last_ran; + unsigned long sleep_avg, slice_avg; + unsigned long long timestamp, last_ran, last_slice; unsigned long long sched_time; /* sched_clock time spent running */ int activated; --- linux-2.6.15/kernel/sched.c.org Sat Jan 7 16:22:13 2006 +++ linux-2.6.15/kernel/sched.c Mon Jan 9 11:50:40 2006 @@ -47,6 +47,7 @@ #include <linux/syscalls.h> #include <linux/times.h> #include <linux/acct.h> +#include <linux/jiffies.h> #include <asm/tlb.h> #include <asm/unistd.h> @@ -1353,7 +1354,7 @@ out_activate: #endif /* CONFIG_SMP */ - if (old_state == TASK_UNINTERRUPTIBLE) { + if (old_state & TASK_UNINTERRUPTIBLE) { rq->nr_uninterruptible--; /* * Tasks on involuntary sleep don't earn @@ -1492,6 +1493,8 @@ */ p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) * CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); + p->slice_avg = NS_MAX_SLEEP_AVG; + p->last_slice = sched_clock(); p->prio = effective_prio(p); @@ -2646,6 +2649,12 @@ runqueue_t *rq = this_rq(); task_t *p = current; unsigned long long now = sched_clock(); +#if 1 + static unsigned long printme = 0; + + if (unlikely(!printme)) + printme = jiffies; +#endif update_cpu_clock(p, rq, now); @@ -2679,6 +2688,7 @@ if ((p->policy == SCHED_RR) && !--p->time_slice) { p->time_slice = task_timeslice(p); p->first_time_slice = 0; + p->last_slice = now; set_tsk_need_resched(p); /* put it at the end of the queue: */ @@ -2687,12 +2697,40 @@ goto out_unlock; } if (!--p->time_slice) { + unsigned long long nsecs = now - p->last_slice; + unsigned long idle, ticks; + int w = 10; + dequeue_task(p, rq->active); set_tsk_need_resched(p); p->prio = effective_prio(p); p->time_slice = task_timeslice(p); p->first_time_slice = 0; + if (nsecs > ~0UL) + nsecs = ~0UL; + ticks = NS_TO_JIFFIES((unsigned long) nsecs); + if (ticks < p->time_slice) + ticks = p->time_slice; + idle = 100 - (100 * p->time_slice / ticks); + p->slice_avg /= NS_MAX_SLEEP_AVG / 100; + /* + * If we're improving our behaviour, speed up the improvement's + * effect so we don't over throttle. + */ + if (idle > p->slice_avg + 10) + w -= (100 * p->slice_avg / idle) / 10; + p->slice_avg = (w * p->slice_avg + idle) / (w + 1); + p->slice_avg *= NS_MAX_SLEEP_AVG / 100; + p->last_slice = now; +#if 1 + if (p->mm && time_after(jiffies, printme + HZ)) { + printk(KERN_DEBUG"%s pid:%d sle:%ld sli:%ld tic:%ld idle:%ld w:%d\n", + p->comm,p->pid,p->sleep_avg,p->slice_avg,ticks,idle,w); + printme = jiffies + HZ; + } +#endif + if (!rq->expired_timestamp) rq->expired_timestamp = jiffies; if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) { @@ -3010,7 +3048,7 @@ unlikely(signal_pending(prev)))) prev->state = TASK_RUNNING; else { - if (prev->state == TASK_UNINTERRUPTIBLE) + if (prev->state & TASK_UNINTERRUPTIBLE) rq->nr_uninterruptible++; deactivate_task(prev, rq); } @@ -3095,6 +3133,13 @@ prev->sleep_avg -= run_time; if ((long)prev->sleep_avg <= 0) prev->sleep_avg = 0; + if (prev->state & (TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE) && + prev->sleep_avg > prev->slice_avg + (NS_MAX_SLEEP_AVG/10) && + !rt_task(prev)) + prev->state |= TASK_NONINTERACTIVE; + if (!rt_task(next) && !(next->time_slice % DEF_TIMESLICE)) + next->last_slice = now; + prev->timestamp = prev->last_ran = now; sched_info_switch(prev, next); ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2006-01-09 11:11 ` Mike Galbraith @ 2006-01-09 15:52 ` Mike Galbraith 2006-01-09 16:08 ` Con Kolivas 2006-01-09 20:00 ` Paolo Ornati 0 siblings, 2 replies; 53+ messages in thread From: Mike Galbraith @ 2006-01-09 15:52 UTC (permalink / raw) To: Paolo Ornati Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams [-- Attachment #1: Type: text/plain, Size: 975 bytes --] At 12:11 PM 1/9/2006 +0100, Mike Galbraith wrote: >Care to try an experiment?... Oops. I guess I should send one that's not mixed p1 and p0. Sorry about that :-/ Anyway, if anyone wants to see a functional demonstration, just try this. Remove the TASK_NONINTERACTIVE in fs/pipe.c in both the stock kernel and this modified one so Davide Libenzi's excellent sleep pattern exploit (irman2) can work [1], and do the below all at the same time ... make -j4 bzImage irman2 thud 3 With the stock kernel, I got bored after a half an hour, and stopped the kernel build. It had produced 40 .o files. The modified kernel finished in 20 minutes vs the 8 minutes it took to produce the same 504 .o files if not under load. -Mike 1. it just so happens that Davide wrote irman2 using pipes... he could have done something else. if anyone doesn't think this is a fair test, just use Paolo's much simpler exploit instead. the result will be about the same. [-- Attachment #2: sched_throttle --] [-- Type: application/octet-stream, Size: 3655 bytes --] --- linux-2.6.15/include/linux/sched.h.org Tue Jan 3 09:26:50 2006 +++ linux-2.6.15/include/linux/sched.h Sat Jan 7 14:45:37 2006 @@ -701,8 +701,8 @@ unsigned short ioprio; - unsigned long sleep_avg; - unsigned long long timestamp, last_ran; + unsigned long sleep_avg, slice_avg; + unsigned long long timestamp, last_ran, last_slice; unsigned long long sched_time; /* sched_clock time spent running */ int activated; --- linux-2.6.15/kernel/sched.c.org Sat Jan 7 16:22:13 2006 +++ linux-2.6.15/kernel/sched.c Mon Jan 9 11:50:40 2006 @@ -47,6 +47,7 @@ #include <linux/syscalls.h> #include <linux/times.h> #include <linux/acct.h> +#include <linux/jiffies.h> #include <asm/tlb.h> #include <asm/unistd.h> @@ -1353,7 +1354,7 @@ out_activate: #endif /* CONFIG_SMP */ - if (old_state == TASK_UNINTERRUPTIBLE) { + if (old_state & TASK_UNINTERRUPTIBLE) { rq->nr_uninterruptible--; /* * Tasks on involuntary sleep don't earn @@ -1492,6 +1493,8 @@ */ p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) * CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); + p->slice_avg = NS_MAX_SLEEP_AVG; + p->last_slice = sched_clock(); p->prio = effective_prio(p); @@ -2646,6 +2649,12 @@ runqueue_t *rq = this_rq(); task_t *p = current; unsigned long long now = sched_clock(); +#if 1 + static unsigned long printme = 0; + + if (unlikely(!printme)) + printme = jiffies; +#endif update_cpu_clock(p, rq, now); @@ -2679,6 +2688,7 @@ if ((p->policy == SCHED_RR) && !--p->time_slice) { p->time_slice = task_timeslice(p); p->first_time_slice = 0; + p->last_slice = now; set_tsk_need_resched(p); /* put it at the end of the queue: */ @@ -2687,12 +2697,40 @@ goto out_unlock; } if (!--p->time_slice) { + unsigned long long nsecs = now - p->last_slice; + unsigned long idle, ticks; + int w = 10; + dequeue_task(p, rq->active); set_tsk_need_resched(p); p->prio = effective_prio(p); p->time_slice = task_timeslice(p); p->first_time_slice = 0; + if (nsecs > ~0UL) + nsecs = ~0UL; + ticks = NS_TO_JIFFIES((unsigned long) nsecs); + if (ticks < p->time_slice) + ticks = p->time_slice; + idle = 100 - (100 * p->time_slice / ticks); + p->slice_avg /= NS_MAX_SLEEP_AVG / 100; + /* + * If we're improving our behaviour, speed up the improvement's + * effect so we don't over throttle. + */ + if (idle > p->slice_avg + 10) + w -= (100 * p->slice_avg / idle) / 10; + p->slice_avg = (w * p->slice_avg + idle) / (w + 1); + p->slice_avg *= NS_MAX_SLEEP_AVG / 100; + p->last_slice = now; +#if 1 + if (p->mm && time_after(jiffies, printme + HZ)) { + printk(KERN_DEBUG"%s pid:%d sle:%ld sli:%ld tic:%ld idle:%ld w:%d\n", + p->comm,p->pid,p->sleep_avg,p->slice_avg,ticks,idle,w); + printme = jiffies + HZ; + } +#endif + if (!rq->expired_timestamp) rq->expired_timestamp = jiffies; if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) { @@ -3010,7 +3048,7 @@ unlikely(signal_pending(prev)))) prev->state = TASK_RUNNING; else { - if (prev->state == TASK_UNINTERRUPTIBLE) + if (prev->state & TASK_UNINTERRUPTIBLE) rq->nr_uninterruptible++; deactivate_task(prev, rq); } @@ -3095,6 +3133,13 @@ prev->sleep_avg -= run_time; if ((long)prev->sleep_avg <= 0) prev->sleep_avg = 0; + if (prev->state & (TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE) && + prev->sleep_avg > prev->slice_avg + (NS_MAX_SLEEP_AVG/10) && + !rt_task(prev)) + prev->state |= TASK_NONINTERACTIVE; + if (!rt_task(next) && !(next->time_slice % DEF_TIMESLICE)) + next->last_slice = now; + prev->timestamp = prev->last_ran = now; sched_info_switch(prev, next); [-- Attachment #3: irman2.c --] [-- Type: text/plain, Size: 4107 bytes --] /* * irman by Davide Libenzi ( irman load generator ) * Copyright (C) 2003 Davide Libenzi * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA * * Davide Libenzi <davidel@xmailserver.org> * */ #include <sys/types.h> #include <sys/stat.h> #include <sys/time.h> #include <sys/socket.h> #include <sys/signal.h> #include <sys/resource.h> #include <fcntl.h> #include <stdlib.h> #include <stdio.h> #include <string.h> #include <unistd.h> #include <errno.h> #define BUFSIZE (1024 * 32) static int *pipes, *child; static int num_pipes, num_active, num_child; static unsigned long burn_ms; static char buf1[BUFSIZE], buf2[BUFSIZE]; static volatile sig_atomic_t run = 1; pid_t parent; static void signal_all(int signum) { if (getpid() == parent) { while (num_child >= 0) { kill(child[num_child], SIGKILL); num_child--; } exit(0); } else if (signum == SIGKILL || getppid() == 1) run = 0; } unsigned long long getustime(void) { struct timeval tm; gettimeofday(&tm, NULL); return (unsigned long long) tm.tv_sec * 1000ULL + (unsigned long long) tm.tv_usec / 1000ULL; } int burn_ms_cpu(unsigned long ms) { int i, cmp = 0; unsigned long long ts; ts = getustime(); do { for (i = 0; i < 4; i++) cmp += memcmp(buf1, buf2, BUFSIZE); } while (ts + ms > getustime()); return cmp; } pid_t hog_process(void) { pid_t pid; if (!(pid = fork())) { while (run) { printf("HOG running %u\n", time(NULL)); burn_ms_cpu(burn_ms); } exit(0); } return pid; } pid_t irman_process(int n) { int nn; pid_t pid; u_char ch; if (!(pid = fork())) { if ((nn = n + num_active) >= num_pipes) nn -= num_pipes; while (run) { printf("reading %u\n", n); read(pipes[2 * n], &ch, 1); burn_ms_cpu(burn_ms); printf("writing %u\n", nn); write(pipes[2 * nn + 1], "s", 1); } exit(0); } return pid; } int main (int argc, char **argv) { struct rlimit rl; int i, c; long work; int *cp, run_secs = 0; extern char *optarg; struct sigaction action; parent = getpid(); num_pipes = 40; num_active = 1; burn_ms = 300; while ((c = getopt(argc, argv, "n:b:a:s:")) != -1) { switch (c) { case 'n': num_pipes = atoi(optarg); break; case 'b': burn_ms = atoi(optarg); break; case 'a': num_active = atoi(optarg); break; case 's': run_secs = 1 + atoi(optarg); break; default: fprintf(stderr, "Illegal argument \"%c\"\n", c); exit(1); } } rl.rlim_cur = rl.rlim_max = num_pipes * 2 + 50; if (setrlimit(RLIMIT_NOFILE, &rl) == -1) { perror("setrlimit"); exit(1); } pipes = calloc(num_pipes * 2, sizeof(int)); if (pipes == NULL) { perror("malloc"); exit(1); } child = calloc(num_pipes, sizeof(int)); if (child == NULL) { perror("malloc"); exit(1); } for (cp = pipes, i = 0; i < num_pipes; i++, cp += 2) { if (pipe(cp) == -1) { perror("pipe"); exit(1); } } memset(buf1, 'f', sizeof(buf1)); memset(buf2, 'f', sizeof(buf2)); sigemptyset(&action.sa_mask); /* establish termination handler */ action.sa_handler = signal_all; action.sa_flags = SA_NODEFER; if (sigaction(SIGTERM, &action, NULL) == -1) { perror("Could not install signal handler"); exit(1); } for (i = 0; i < num_pipes; i++) child[i] = irman_process(i); child[i] = hog_process(); num_child = i; for (i = 0; i < num_active; i++) write(pipes[2 * i + 1], "s", 1); while (--run_secs) sleep(1); signal_all(SIGKILL); exit(0); } [-- Attachment #4: thud.c --] [-- Type: text/plain, Size: 1279 bytes --] /* thud.c */ #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/time.h> #include <time.h> #include <sched.h> /* These are used as strings so that strcmp can be used as a delay loop */ char *s1, *s2; /* 20000 is fine on my 333MHz Celeron. Adjust so that system time is not excessive */ #define DELAY 20000 void busy_wait(long sec, long usec) { struct timeval tv; long long end_usec; gettimeofday(&tv,0); end_usec=(long long)(sec+tv.tv_sec)*1000000 + tv.tv_usec+usec; while (((long long)tv.tv_sec*1000000 + tv.tv_usec) < end_usec) { gettimeofday(&tv,0); #if 1 /* MIKEDIDIT */ strcmp(s1,s2); /* yuck */ #else sched_yield(); #endif } } int main(int argc, char**argv) { struct timespec st={10,50000000}; int n=DELAY; int parent=1; if (argc<2) {fprintf(stderr,"Syntax: thud <children>\n"); return 0; } s1=malloc(n); s2=malloc(n); memset(s1,33,n); memset(s2,33,n); s1[n-1]=0; s2[n-1]=0; n=atoi(argv[1]); fprintf(stderr,"starting %d children\n",n); for (; n>0; n--) if (fork()==0) { sched_yield(); parent=0; break; } while (1) { nanosleep(&st, 0); if (parent) printf("running..."); if (parent) fflush(stdout); busy_wait(6,0); if (parent) printf("done\n"); } return 0; } ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2006-01-09 15:52 ` Mike Galbraith @ 2006-01-09 16:08 ` Con Kolivas 2006-01-09 18:14 ` Mike Galbraith 2006-01-09 20:00 ` Paolo Ornati 1 sibling, 1 reply; 53+ messages in thread From: Con Kolivas @ 2006-01-09 16:08 UTC (permalink / raw) To: Mike Galbraith Cc: Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams On Tuesday 10 January 2006 02:52, Mike Galbraith wrote: > At 12:11 PM 1/9/2006 +0100, Mike Galbraith wrote: > >Care to try an experiment?... > > Oops. I guess I should send one that's not mixed p1 and p0. Sorry about > that :-/ > > Anyway, if anyone wants to see a functional demonstration, just try > this. Remove the TASK_NONINTERACTIVE in fs/pipe.c in both the stock kernel > and this modified one so Davide Libenzi's excellent sleep pattern exploit > (irman2) can work [1], and do the below all at the same time ... > > make -j4 bzImage > irman2 > thud 3 > > With the stock kernel, I got bored after a half an hour, and stopped the > kernel build. It had produced 40 .o files. The modified kernel finished > in 20 minutes vs the 8 minutes it took to produce the same 504 .o files if > not under load. > > -Mike > > 1. it just so happens that Davide wrote irman2 using pipes... he could > have done something else. if anyone doesn't think this is a fair test, > just use Paolo's much simpler exploit instead. the result will be about > the same. Want to try with a few other schedulers using plugsched? Con ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2006-01-09 16:08 ` Con Kolivas @ 2006-01-09 18:14 ` Mike Galbraith 0 siblings, 0 replies; 53+ messages in thread From: Mike Galbraith @ 2006-01-09 18:14 UTC (permalink / raw) To: Con Kolivas Cc: Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams At 03:08 AM 1/10/2006 +1100, Con Kolivas wrote: >On Tuesday 10 January 2006 02:52, Mike Galbraith wrote: > > > > Anyway, if anyone wants to see a functional demonstration, just try > > this. Remove the TASK_NONINTERACTIVE in fs/pipe.c in both the stock kernel > > and this modified one so Davide Libenzi's excellent sleep pattern exploit > > (irman2) can work [1], and do the below all at the same time ... > > > > make -j4 bzImage > > irman2 > > thud 3 > >... > > >Want to try with a few other schedulers using plugsched? Not really. These exploits were specifically aimed at the current scheduler's Achilles Heel. Besides, that carries implications of replacing this one, and I'm pretty fond of it. -Mike ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2006-01-09 15:52 ` Mike Galbraith 2006-01-09 16:08 ` Con Kolivas @ 2006-01-09 20:00 ` Paolo Ornati 2006-01-09 20:23 ` Paolo Ornati 2006-01-10 7:08 ` Mike Galbraith 1 sibling, 2 replies; 53+ messages in thread From: Paolo Ornati @ 2006-01-09 20:00 UTC (permalink / raw) To: Mike Galbraith Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams On Mon, 09 Jan 2006 16:52:17 +0100 Mike Galbraith <efault@gmx.de> wrote: > >Care to try an experiment?... Yes. With my simple proggy things improve a bit: ./a.out 7000 & ./a.out 6537 & PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5579 paolo 19 0 2392 288 228 R 51.9 0.1 0:22.38 a.out 5578 paolo 20 0 2392 288 228 R 43.9 0.1 0:22.50 a.out As you can see they don't get priority 15/16 anymore :) DD test: 256 MB, ~8.5s (instead of 8) In pratice the more CPU they use the more their priority is penalized... BUT if I start more of them (3/4) I'm able to fool it. "./a.out 7000 & ./a.out 6537 & ./a.out 6347 & ./a.out 5873" 2 TOP's snapshots: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5625 paolo 17 0 2392 288 228 R 31.6 0.1 0:10.74 a.out 5626 paolo 17 0 2392 288 228 R 28.8 0.1 0:09.16 a.out 5627 paolo 17 0 2392 288 228 R 22.2 0.1 0:07.59 a.out 5624 paolo 17 0 2392 288 228 R 17.4 0.1 0:08.67 a.out PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5626 paolo 16 0 2392 288 228 R 30.1 0.1 0:39.95 a.out 5627 paolo 16 0 2392 288 228 R 24.1 0.1 0:34.93 a.out 5625 paolo 18 0 2392 288 228 R 23.5 0.1 0:37.53 a.out 5624 paolo 18 0 2392 288 228 R 21.9 0.1 0:37.60 a.out 5193 root 15 0 167m 17m 2916 S 0.2 3.5 0:09.67 X 5638 paolo 18 0 4952 1468 372 R 0.2 0.3 0:00.15 dd DD test (256MB): real 3m37.122s (instead of 8s) REAL LIFE TEST (transcode) While running only transcode it gets priority 25: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5857 paolo 25 0 114m 18m 2424 R 90.9 3.7 0:14.28 transcode 5873 paolo 19 0 49860 4452 1860 S 8.6 0.9 0:01.40 tcdecode 5308 paolo 16 0 86796 22m 15m R 0.2 4.4 0:06.26 konsole 5687 paolo 16 0 98648 37m 9348 S 0.2 7.5 0:02.11 perl 5872 paolo 24 0 21864 1064 600 S 0.2 0.2 0:00.01 tcextract But if I run also the DD test, "transcode" priority start fluctuating and can go down to 18/19 (from time to time) interfering with DD: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5694 paolo 19 0 114m 18m 2424 R 75.1 3.7 0:42.29 transcode 5710 paolo 25 0 49856 4452 1860 R 8.0 0.9 0:04.36 tcdecode 5726 paolo 18 0 4952 1468 372 R 4.0 0.3 0:00.77 dd This seems to happen because also transcode is reading (not directly but through pipes) from disk so the massive disk usage of DD interferes with it, this leads to transcode using less CPU and getting better priority. The exact behaviour changes time to time... but seems to confirm my teory. I don't know how can "nicksched" keep transcode priority always to 40 even when I'm running the DD test... I should retry and see. PS: yes, transcode is reading from disk, but SLOWLY... i think that a good read-ahead should fullfill his needs even when doing the HD stressing DD test, no? -- Paolo Ornati Linux 2.6.15-sched_trottle on x86_64 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2006-01-09 20:00 ` Paolo Ornati @ 2006-01-09 20:23 ` Paolo Ornati 2006-01-10 7:08 ` Mike Galbraith 1 sibling, 0 replies; 53+ messages in thread From: Paolo Ornati @ 2006-01-09 20:23 UTC (permalink / raw) To: Paolo Ornati Cc: Mike Galbraith, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams On Mon, 9 Jan 2006 21:00:35 +0100 Paolo Ornati <ornati@fastwebnet.it> wrote: > I don't know how can "nicksched" keep transcode priority always to 40 > even when I'm running the DD test... I should retry and see. I was wrong... the interfering effect is here also with nicksched. The problem is that Linux is smarter than me --> sometimes I forget the effect of the disk-cache ;) For the DD test I always umount/(re)mount the partition where the BIGFILE is stored to discard the associated cache... but I've forgotten to do the same for the file that transcode is reading from ;) -- Paolo Ornati Linux 2.6.15-plugsched on x86_64 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2006-01-09 20:00 ` Paolo Ornati 2006-01-09 20:23 ` Paolo Ornati @ 2006-01-10 7:08 ` Mike Galbraith 2006-01-10 12:07 ` Mike Galbraith 1 sibling, 1 reply; 53+ messages in thread From: Mike Galbraith @ 2006-01-10 7:08 UTC (permalink / raw) To: Paolo Ornati Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams At 09:00 PM 1/9/2006 +0100, Paolo Ornati wrote: >On Mon, 09 Jan 2006 16:52:17 +0100 >Mike Galbraith <efault@gmx.de> wrote: > > > >Care to try an experiment?... > >Yes. Thanks. >With my simple proggy things improve a bit: <snip rest of good news> >BUT if I start more of them (3/4) I'm able to fool it. > >"./a.out 7000 & ./a.out 6537 & ./a.out 6347 & ./a.out 5873" > >2 TOP's snapshots: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5625 paolo 17 0 2392 288 228 R 31.6 0.1 0:10.74 a.out > 5626 paolo 17 0 2392 288 228 R 28.8 0.1 0:09.16 a.out > 5627 paolo 17 0 2392 288 228 R 22.2 0.1 0:07.59 a.out > 5624 paolo 17 0 2392 288 228 R 17.4 0.1 0:08.67 a.out > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5626 paolo 16 0 2392 288 228 R 30.1 0.1 0:39.95 a.out > 5627 paolo 16 0 2392 288 228 R 24.1 0.1 0:34.93 a.out > 5625 paolo 18 0 2392 288 228 R 23.5 0.1 0:37.53 a.out > 5624 paolo 18 0 2392 288 228 R 21.9 0.1 0:37.60 a.out > 5193 root 15 0 167m 17m 2916 S 0.2 3.5 0:09.67 X > 5638 paolo 18 0 4952 1468 372 R 0.2 0.3 0:00.15 dd > >DD test (256MB): real 3m37.122s (instead of 8s) Ok, I'll take another look. Those should be being throttled. >REAL LIFE TEST (transcode) > >While running only transcode it gets priority 25: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5857 paolo 25 0 114m 18m 2424 R 90.9 3.7 0:14.28 transcode > 5873 paolo 19 0 49860 4452 1860 S 8.6 0.9 0:01.40 tcdecode > 5308 paolo 16 0 86796 22m 15m R 0.2 4.4 0:06.26 konsole > 5687 paolo 16 0 98648 37m 9348 S 0.2 7.5 0:02.11 perl > 5872 paolo 24 0 21864 1064 600 S 0.2 0.2 0:00.01 tcextract > > >But if I run also the DD test, "transcode" priority start fluctuating >and can go down to 18/19 (from time to time) interfering with DD: 19 shouldn't interfere from the cpu side, but 18 will. > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5694 paolo 19 0 114m 18m 2424 R 75.1 3.7 0:42.29 transcode > 5710 paolo 25 0 49856 4452 1860 R 8.0 0.9 0:04.36 tcdecode > 5726 paolo 18 0 4952 1468 372 R 4.0 0.3 0:00.77 dd > > >This seems to happen because also transcode is reading (not directly but >through pipes) from disk so the massive disk usage of DD interferes >with it, this leads to transcode using less CPU and getting better >priority. It can't be pipe waits, they're disabled in the kernel. Most likely the credit we get for being activated without having yet been selected. >The exact behaviour changes time to time... but seems to confirm my >teory. > >I don't know how can "nicksched" keep transcode priority always to 40 >even when I'm running the DD test... I should retry and see. > > >PS: yes, transcode is reading from disk, but SLOWLY... i think that a >good read-ahead should fullfill his needs even when doing the HD >stressing DD test, no? Dunno. -Mike ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2006-01-10 7:08 ` Mike Galbraith @ 2006-01-10 12:07 ` Mike Galbraith 2006-01-10 12:56 ` Paolo Ornati 0 siblings, 1 reply; 53+ messages in thread From: Mike Galbraith @ 2006-01-10 12:07 UTC (permalink / raw) To: Paolo Ornati Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams At 08:08 AM 1/10/2006 +0100, Mike Galbraith wrote: >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 5626 paolo 16 0 2392 288 228 R 30.1 0.1 0:39.95 a.out >> 5627 paolo 16 0 2392 288 228 R 24.1 0.1 0:34.93 a.out >> 5625 paolo 18 0 2392 288 228 R 23.5 0.1 0:37.53 a.out >> 5624 paolo 18 0 2392 288 228 R 21.9 0.1 0:37.60 a.out >> 5193 root 15 0 167m 17m 2916 S 0.2 3.5 0:09.67 X >> 5638 paolo 18 0 4952 1468 372 R 0.2 0.3 0:00.15 dd >> >>DD test (256MB): real 3m37.122s (instead of 8s) > >Ok, I'll take another look. Those should be being throttled. Can you please try this version? It tries harder to correct any imbalance... hopefully not too hard. In case you didn't notice, you need to let your exploits run for a bit before the throttling will take effect. I intentionally start tasks off at 0 cpu usage, so it takes a bit to come up to it's real usage. -Mike ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2006-01-10 12:07 ` Mike Galbraith @ 2006-01-10 12:56 ` Paolo Ornati 2006-01-10 13:01 ` Mike Galbraith 0 siblings, 1 reply; 53+ messages in thread From: Paolo Ornati @ 2006-01-10 12:56 UTC (permalink / raw) To: Mike Galbraith Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams On Tue, 10 Jan 2006 13:07:33 +0100 Mike Galbraith <efault@gmx.de> wrote: > Can you please try this version? It tries harder to correct any It seems that you have forgotten the to attach the patch... > imbalance... hopefully not too hard. In case you didn't notice, you need > to let your exploits run for a bit before the throttling will take > effect. I intentionally start tasks off at 0 cpu usage, so it takes a bit > to come up to it's real usage. Yes, I've seen it. -- Paolo Ornati Linux 2.6.15-plugsched on x86_64 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2006-01-10 12:56 ` Paolo Ornati @ 2006-01-10 13:01 ` Mike Galbraith 2006-01-10 13:53 ` Paolo Ornati 0 siblings, 1 reply; 53+ messages in thread From: Mike Galbraith @ 2006-01-10 13:01 UTC (permalink / raw) To: Paolo Ornati Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams [-- Attachment #1: Type: text/plain, Size: 365 bytes --] At 01:56 PM 1/10/2006 +0100, Paolo Ornati wrote: >On Tue, 10 Jan 2006 13:07:33 +0100 >Mike Galbraith <efault@gmx.de> wrote: > > > Can you please try this version? It tries harder to correct any > >It seems that you have forgotten the to attach the patch... Drat. At least I'm not the first to ever do so :) (double checks) It's now attached. -Mike [-- Attachment #2: sched_throttle2 --] [-- Type: application/octet-stream, Size: 4422 bytes --] --- linux-2.6.15/include/linux/sched.h.org Tue Jan 3 09:26:50 2006 +++ linux-2.6.15/include/linux/sched.h Sat Jan 7 14:45:37 2006 @@ -701,8 +701,8 @@ unsigned short ioprio; - unsigned long sleep_avg; - unsigned long long timestamp, last_ran; + unsigned long sleep_avg, slice_avg; + unsigned long long timestamp, last_ran, last_slice; unsigned long long sched_time; /* sched_clock time spent running */ int activated; --- linux-2.6.15/kernel/sched.c.org Sat Jan 7 16:22:13 2006 +++ linux-2.6.15/kernel/sched.c Tue Jan 10 13:59:44 2006 @@ -127,6 +127,20 @@ (NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \ MAX_SLEEP_AVG) +/* + * If a task's sleep_avg strays too far from it's slice_avg, this + * task is using more cpu than it's sleep_avg would indicate. When + * such a disparity is detected, prevent additional sleep time from + * being added to the existing imbalance, and increase the rate at + * which sleep_average is consumed. + */ +#define SLEEP_AVG_IMBALANCE(p) \ + ((p)->sleep_avg > (p)->slice_avg + (NS_MAX_SLEEP_AVG/10)) + +#define CPU_PENALTY(p) \ + (NS_TO_JIFFIES(min((p)->sleep_avg, (p)->slice_avg)) * MAX_BONUS / \ + MAX_SLEEP_AVG) + #define GRANULARITY (10 * HZ / 1000 ? : 1) #ifdef CONFIG_SMP @@ -744,7 +758,7 @@ else sleep_time = (unsigned long)__sleep_time; - if (likely(sleep_time > 0)) { + if (likely(sleep_time > 0 && !SLEEP_AVG_IMBALANCE(p))) { /* * User tasks that sleep a long time are categorised as * idle and will get just interactive status to stay active & @@ -1353,7 +1367,7 @@ out_activate: #endif /* CONFIG_SMP */ - if (old_state == TASK_UNINTERRUPTIBLE) { + if (old_state & TASK_UNINTERRUPTIBLE) { rq->nr_uninterruptible--; /* * Tasks on involuntary sleep don't earn @@ -1368,7 +1382,7 @@ * sleep is handled in a priority-neutral manner, no priority * boost and no penalty.) */ - if (old_state & TASK_NONINTERACTIVE) + if (old_state & TASK_NONINTERACTIVE || SLEEP_AVG_IMBALANCE(p)) __activate_task(p, rq); else activate_task(p, rq, cpu == this_cpu); @@ -1492,6 +1506,8 @@ */ p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) * CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); + p->slice_avg = NS_MAX_SLEEP_AVG; + p->last_slice = sched_clock(); p->prio = effective_prio(p); @@ -2679,6 +2695,7 @@ if ((p->policy == SCHED_RR) && !--p->time_slice) { p->time_slice = task_timeslice(p); p->first_time_slice = 0; + p->last_slice = now; set_tsk_need_resched(p); /* put it at the end of the queue: */ @@ -2687,12 +2704,33 @@ goto out_unlock; } if (!--p->time_slice) { + unsigned long long nsecs = now - p->last_slice; + unsigned long idle, ticks; + int w = 10; + dequeue_task(p, rq->active); set_tsk_need_resched(p); p->prio = effective_prio(p); p->time_slice = task_timeslice(p); p->first_time_slice = 0; + if (nsecs > ~0UL) + nsecs = ~0UL; + ticks = NS_TO_JIFFIES((unsigned long) nsecs); + if (ticks < p->time_slice) + ticks = p->time_slice; + idle = 100 - (100 * p->time_slice / ticks); + p->slice_avg /= NS_MAX_SLEEP_AVG / 100; + /* + * If the task is lowering it's cpu usage, speed up the + * effect on slice_avg so we don't over throttle. + */ + if (idle > p->slice_avg + 10) + w -= (100 * p->slice_avg / idle) / 10; + p->slice_avg = (w * p->slice_avg + idle) / (w + 1); + p->slice_avg *= NS_MAX_SLEEP_AVG / 100; + p->last_slice = now; + if (!rq->expired_timestamp) rq->expired_timestamp = jiffies; if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) { @@ -2996,7 +3034,7 @@ * Tasks charged proportionately less run_time at high sleep_avg to * delay them losing their interactive status */ - run_time /= (CURRENT_BONUS(prev) ? : 1); + run_time /= (CPU_PENALTY(prev) ? : 1); spin_lock_irq(&rq->lock); @@ -3010,7 +3048,7 @@ unlikely(signal_pending(prev)))) prev->state = TASK_RUNNING; else { - if (prev->state == TASK_UNINTERRUPTIBLE) + if (prev->state & TASK_UNINTERRUPTIBLE) rq->nr_uninterruptible++; deactivate_task(prev, rq); } @@ -3095,6 +3133,12 @@ prev->sleep_avg -= run_time; if ((long)prev->sleep_avg <= 0) prev->sleep_avg = 0; + /* + * Enable detection of the beginning of a slice at tick time. + */ + if (!rt_task(next) && !(next->time_slice % DEF_TIMESLICE)) + next->last_slice = now; + prev->timestamp = prev->last_ran = now; sched_info_switch(prev, next); ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2006-01-10 13:01 ` Mike Galbraith @ 2006-01-10 13:53 ` Paolo Ornati 2006-01-10 15:18 ` Mike Galbraith 0 siblings, 1 reply; 53+ messages in thread From: Paolo Ornati @ 2006-01-10 13:53 UTC (permalink / raw) To: Mike Galbraith Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams On Tue, 10 Jan 2006 14:01:36 +0100 Mike Galbraith <efault@gmx.de> wrote: > > > Can you please try this version? It tries harder to correct any > > > >It seems that you have forgotten the to attach the patch... > > Drat. At least I'm not the first to ever do so :) This version basically works like the the previous, except that it makes the priority adjustment faster (that is fine). However I can fool it the same way. "./a.out 7000" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5459 paolo 22 0 2392 288 228 S 71.3 0.1 0:09.47 a.out "./a.out 3000" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5493 paolo 19 0 2396 292 228 R 49.8 0.1 0:14.42 a.out "./a.out 1500" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5495 paolo 18 0 2396 288 228 S 33.4 0.1 0:09.60 a.out Fooling it: "./a.out 7000 & ./a.out 6537 & ./a.out 6347 & ./a.out 5873 &" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5502 paolo 19 0 2392 288 228 R 27.0 0.1 0:05.64 a.out 5503 paolo 19 0 2396 288 228 R 26.0 0.1 0:07.50 a.out 5505 paolo 19 0 2396 292 228 R 25.6 0.1 0:07.24 a.out 5504 paolo 18 0 2392 288 228 R 21.0 0.1 0:06.78 a.out (priorities fluctuate between 18/19) Again with more of them: ./a.out 7000 & ./a.out 6537 & ./a.out 6347 & ./a.out 5873& ./a.out 6245 & ./a.out 5467 & PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5525 paolo 18 0 2396 288 228 R 26.4 0.1 0:07.48 a.out 5521 paolo 19 0 2396 288 228 R 22.0 0.1 0:09.00 a.out 5524 paolo 19 0 2392 288 228 R 19.6 0.1 0:07.21 a.out 5523 paolo 19 0 2392 288 228 R 13.0 0.1 0:10.60 a.out 5520 paolo 19 0 2392 288 228 R 11.0 0.1 0:08.46 a.out 5522 paolo 19 0 2396 288 228 R 7.8 0.1 0:07.14 a.out PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5528 paolo 18 0 2392 288 228 R 19.7 0.1 0:18.15 a.out 5533 paolo 15 0 2396 288 228 S 19.3 0.1 0:19.12 a.out 5531 paolo 18 0 2396 288 228 R 18.5 0.1 0:19.23 a.out 5532 paolo 17 0 2392 288 228 R 15.1 0.1 0:18.55 a.out 5529 paolo 18 0 2396 288 228 R 14.7 0.1 0:13.05 a.out 5530 paolo 18 0 2392 288 228 R 12.5 0.1 0:20.42 a.out PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5530 paolo 18 0 2392 288 228 R 21.0 0.1 0:25.42 a.out 5533 paolo 18 0 2396 288 228 R 20.2 0.1 0:24.75 a.out 5529 paolo 18 0 2396 288 228 R 16.2 0.1 0:17.68 a.out 5532 paolo 18 0 2392 288 228 R 14.8 0.1 0:23.33 a.out 5531 paolo 18 0 2396 288 228 R 14.4 0.1 0:23.96 a.out 5528 paolo 18 0 2392 288 228 R 13.6 0.1 0:23.03 a.out -- Paolo Ornati Linux 2.6.15-sched_trottle2 on x86_64 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2006-01-10 13:53 ` Paolo Ornati @ 2006-01-10 15:18 ` Mike Galbraith 0 siblings, 0 replies; 53+ messages in thread From: Mike Galbraith @ 2006-01-10 15:18 UTC (permalink / raw) To: Paolo Ornati Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams At 02:53 PM 1/10/2006 +0100, Paolo Ornati wrote: >On Tue, 10 Jan 2006 14:01:36 +0100 >Mike Galbraith <efault@gmx.de> wrote: > > > > > Can you please try this version? It tries harder to correct any > > > > > >It seems that you have forgotten the to attach the patch... > > > > Drat. At least I'm not the first to ever do so :) > >This version basically works like the the previous, except that it makes >the priority adjustment faster (that is fine). > >However I can fool it the same way. > >"./a.out 7000" > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5459 paolo 22 0 2392 288 228 S 71.3 0.1 0:09.47 a.out > >"./a.out 3000" > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5493 paolo 19 0 2396 292 228 R 49.8 0.1 0:14.42 a.out > >"./a.out 1500" > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5495 paolo 18 0 2396 288 228 S 33.4 0.1 0:09.60 a.out > > >Fooling it: > >"./a.out 7000 & ./a.out 6537 & ./a.out 6347 & ./a.out 5873 &" > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5502 paolo 19 0 2392 288 228 R 27.0 0.1 0:05.64 a.out > 5503 paolo 19 0 2396 288 228 R 26.0 0.1 0:07.50 a.out > 5505 paolo 19 0 2396 292 228 R 25.6 0.1 0:07.24 a.out > 5504 paolo 18 0 2392 288 228 R 21.0 0.1 0:06.78 a.out > >(priorities fluctuate between 18/19) > > >Again with more of them: > >./a.out 7000 & ./a.out 6537 & ./a.out 6347 & ./a.out 5873& ./a.out 6245 & >./a.out 5467 & > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5525 paolo 18 0 2396 288 228 R 26.4 0.1 0:07.48 a.out > 5521 paolo 19 0 2396 288 228 R 22.0 0.1 0:09.00 a.out > 5524 paolo 19 0 2392 288 228 R 19.6 0.1 0:07.21 a.out > 5523 paolo 19 0 2392 288 228 R 13.0 0.1 0:10.60 a.out > 5520 paolo 19 0 2392 288 228 R 11.0 0.1 0:08.46 a.out > 5522 paolo 19 0 2396 288 228 R 7.8 0.1 0:07.14 a.out > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5528 paolo 18 0 2392 288 228 R 19.7 0.1 0:18.15 a.out > 5533 paolo 15 0 2396 288 228 S 19.3 0.1 0:19.12 a.out > 5531 paolo 18 0 2396 288 228 R 18.5 0.1 0:19.23 a.out > 5532 paolo 17 0 2392 288 228 R 15.1 0.1 0:18.55 a.out > 5529 paolo 18 0 2396 288 228 R 14.7 0.1 0:13.05 a.out > 5530 paolo 18 0 2392 288 228 R 12.5 0.1 0:20.42 a.out > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5530 paolo 18 0 2392 288 228 R 21.0 0.1 0:25.42 a.out > 5533 paolo 18 0 2396 288 228 R 20.2 0.1 0:24.75 a.out > 5529 paolo 18 0 2396 288 228 R 16.2 0.1 0:17.68 a.out > 5532 paolo 18 0 2392 288 228 R 14.8 0.1 0:23.33 a.out > 5531 paolo 18 0 2396 288 228 R 14.4 0.1 0:23.96 a.out > 5528 paolo 18 0 2392 288 228 R 13.6 0.1 0:23.03 a.out I don't think it's being fooled, it's pulling these guys down, it's just that the plausible limit goes up the more tasks you're sharing the cpu with, so they hit the throttle less. If you start a bunch of tasks who only do a short burn such that their individual cpu usage isn't much, but together they use all available cpu, you can still severely starve non-sleeping == low priority tasks. This patch won't help for that... that's an entirely different problem, and fortunately one that doesn't seem to happen much in real life otherwise folks would be bitching like crazy. To guard against it, there are a lot of different things you can do. All of the ways that I know of are very bad for the desktop user, and fortunately for me, beyond the intended scope of this patch :) -Mike ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2005-12-30 13:52 ` [SCHED] wrong priority calc - SIMPLE test case Paolo Ornati 2005-12-31 2:06 ` Peter Williams 2005-12-31 8:13 ` Mike Galbraith @ 2006-01-13 1:13 ` Con Kolivas 2006-01-13 1:32 ` Con Kolivas 2006-01-13 10:46 ` Paolo Ornati 2 siblings, 2 replies; 53+ messages in thread From: Con Kolivas @ 2006-01-13 1:13 UTC (permalink / raw) To: Paolo Ornati Cc: Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams [-- Attachment #1: Type: text/plain, Size: 6936 bytes --] On Saturday 31 December 2005 00:52, Paolo Ornati wrote: > WAS: [SCHED] Totally WRONG prority calculation with specific test-case > (since 2.6.10-bk12) > http://lkml.org/lkml/2005/12/27/114/index.html > > On Wed, 28 Dec 2005 10:26:58 +1100 > > Con Kolivas <kernel@kolivas.org> wrote: > > The issue is that the scheduler interactivity estimator is a state > > machine and can be fooled to some degree, and a cpu intensive task that > > just happens to sleep a little bit gets significantly better priority > > than one that is fully cpu bound all the time. Reverting that change is > > not a solution because it can still be fooled by the same process > > sleeping lots for a few seconds or so at startup and then changing to the > > cpu mostly-sleeping slightly behaviour. This "fluctuating" behaviour is > > in my opinion worse which is why I removed it. > > Trying to find a "as simple as possible" test case for this problem > (that I consider a BUG in priority calculation) I've come up with this > very simple program: Hi Paolo. Can you try the following patch on 2.6.15 please? I'm interested in how adversely this affects interactive performance as well as whether it helps your test case. Thanks, Con --- include/linux/sched.h | 9 +++++- kernel/sched.c | 72 ++++++++++++++++++++++---------------------------- 2 files changed, 41 insertions(+), 40 deletions(-) Index: linux-2.6.15/include/linux/sched.h =================================================================== --- linux-2.6.15.orig/include/linux/sched.h +++ linux-2.6.15/include/linux/sched.h @@ -683,6 +683,13 @@ static inline void prefetch_stack(struct struct audit_context; /* See audit.c */ struct mempolicy; +enum sleep_type { + SLEEP_NORMAL, + SLEEP_NONINTERACTIVE, + SLEEP_INTERACTIVE, + SLEEP_INTERRUPTED, +}; + struct task_struct { volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ struct thread_info *thread_info; @@ -704,7 +711,7 @@ struct task_struct { unsigned long sleep_avg; unsigned long long timestamp, last_ran; unsigned long long sched_time; /* sched_clock time spent running */ - int activated; + enum sleep_type sleep_type; unsigned long policy; cpumask_t cpus_allowed; Index: linux-2.6.15/kernel/sched.c =================================================================== --- linux-2.6.15.orig/kernel/sched.c +++ linux-2.6.15/kernel/sched.c @@ -751,31 +751,22 @@ static int recalc_task_prio(task_t *p, u * prevent them suddenly becoming cpu hogs and starving * other processes. */ - if (p->mm && p->activated != -1 && + if (p->mm && p->sleep_type != SLEEP_NONINTERACTIVE && sleep_time > INTERACTIVE_SLEEP(p)) { p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG - DEF_TIMESLICE); } else { + /* * The lower the sleep avg a task has the more - * rapidly it will rise with sleep time. + * rapidly it will rise with sleep time. This enables + * tasks to rapidly recover to a low latency priority. + * If a task was sleeping with the noninteractive + * label do not apply this non-linear boost */ - sleep_time *= (MAX_BONUS - CURRENT_BONUS(p)) ? : 1; - - /* - * Tasks waking from uninterruptible sleep are - * limited in their sleep_avg rise as they - * are likely to be waiting on I/O - */ - if (p->activated == -1 && p->mm) { - if (p->sleep_avg >= INTERACTIVE_SLEEP(p)) - sleep_time = 0; - else if (p->sleep_avg + sleep_time >= - INTERACTIVE_SLEEP(p)) { - p->sleep_avg = INTERACTIVE_SLEEP(p); - sleep_time = 0; - } - } + if (p->sleep_type != SLEEP_NONINTERACTIVE || p->mm) + sleep_time *= + (MAX_BONUS - CURRENT_BONUS(p)) ? : 1; /* * This code gives a bonus to interactive tasks. @@ -818,11 +809,7 @@ static void activate_task(task_t *p, run if (!rt_task(p)) p->prio = recalc_task_prio(p, now); - /* - * This checks to make sure it's not an uninterruptible task - * that is now waking up. - */ - if (!p->activated) { + if (p->sleep_type != SLEEP_NONINTERACTIVE) { /* * Tasks which were woken up by interrupts (ie. hw events) * are most likely of interactive nature. So we give them @@ -831,13 +818,13 @@ static void activate_task(task_t *p, run * on a CPU, first time around: */ if (in_interrupt()) - p->activated = 2; + p->sleep_type = SLEEP_INTERRUPTED; else { /* * Normal first-time wakeups get a credit too for * on-runqueue time, but it will be weighted down: */ - p->activated = 1; + p->sleep_type = SLEEP_INTERACTIVE; } } p->timestamp = now; @@ -1356,22 +1343,23 @@ out_activate: if (old_state == TASK_UNINTERRUPTIBLE) { rq->nr_uninterruptible--; /* - * Tasks on involuntary sleep don't earn - * sleep_avg beyond just interactive state. + * Tasks waking from uninterruptible sleep are likely + * to be sleeping involuntarily on I/O and are otherwise + * cpu bound so label them as noninteractive. */ - p->activated = -1; - } + p->sleep_type = SLEEP_NONINTERACTIVE; + } else /* * Tasks that have marked their sleep as noninteractive get - * woken up without updating their sleep average. (i.e. their - * sleep is handled in a priority-neutral manner, no priority - * boost and no penalty.) + * woken up with their sleep average not weighted in an + * interactive way. */ - if (old_state & TASK_NONINTERACTIVE) - __activate_task(p, rq); - else - activate_task(p, rq, cpu == this_cpu); + if (old_state & TASK_NONINTERACTIVE) + p->sleep_type = SLEEP_NONINTERACTIVE; + + + activate_task(p, rq, cpu == this_cpu); /* * Sync wakeups (i.e. those types of wakeups where the waker * has indicated that it will leave the CPU in short order) @@ -2938,6 +2926,12 @@ EXPORT_SYMBOL(sub_preempt_count); #endif +static inline int interactive_sleep(enum sleep_type sleep_type) +{ + return (sleep_type == SLEEP_INTERACTIVE || + sleep_type == SLEEP_INTERRUPTED); +} + /* * schedule() is the main scheduler function. */ @@ -3063,12 +3057,12 @@ go_idle: queue = array->queue + idx; next = list_entry(queue->next, task_t, run_list); - if (!rt_task(next) && next->activated > 0) { + if (!rt_task(next) && interactive_sleep(next->sleep_type)) { unsigned long long delta = now - next->timestamp; if (unlikely((long long)(now - next->timestamp) < 0)) delta = 0; - if (next->activated == 1) + if (next->sleep_type == SLEEP_INTERACTIVE) delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128; array = next->array; @@ -3081,7 +3075,7 @@ go_idle: } else requeue_task(next, array); } - next->activated = 0; + next->sleep_type = SLEEP_NORMAL; switch_tasks: if (next == rq->idle) schedstat_inc(rq, sched_goidle); [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2006-01-13 1:13 ` Con Kolivas @ 2006-01-13 1:32 ` Con Kolivas 2006-01-13 10:46 ` Paolo Ornati 1 sibling, 0 replies; 53+ messages in thread From: Con Kolivas @ 2006-01-13 1:32 UTC (permalink / raw) To: Paolo Ornati Cc: Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams On Friday 13 January 2006 12:13, Con Kolivas wrote: > On Saturday 31 December 2005 00:52, Paolo Ornati wrote: > > WAS: [SCHED] Totally WRONG prority calculation with specific test-case > > (since 2.6.10-bk12) > > http://lkml.org/lkml/2005/12/27/114/index.html > > > > On Wed, 28 Dec 2005 10:26:58 +1100 > > > > Con Kolivas <kernel@kolivas.org> wrote: > > > The issue is that the scheduler interactivity estimator is a state > > > machine and can be fooled to some degree, and a cpu intensive task that > > > just happens to sleep a little bit gets significantly better priority > > > than one that is fully cpu bound all the time. Reverting that change is > > > not a solution because it can still be fooled by the same process > > > sleeping lots for a few seconds or so at startup and then changing to > > > the cpu mostly-sleeping slightly behaviour. This "fluctuating" > > > behaviour is in my opinion worse which is why I removed it. > > > > Trying to find a "as simple as possible" test case for this problem > > (that I consider a BUG in priority calculation) I've come up with this > > very simple program: > > Hi Paolo. > > Can you try the following patch on 2.6.15 please? I'm interested in how > adversely this affects interactive performance as well as whether it helps > your test case. I should make it clear. This patch _will_ adversely affect interactivity because your test case desires that I/O bound tasks get higher priority, and this patch will do that. This means that I/O bound tasks will be more noticeable now. The question is how much do we trade off one for the other. We almost certainly are biased a little too much on the interactive side on the mainline kernel at the moment. Cheers, Con ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2006-01-13 1:13 ` Con Kolivas 2006-01-13 1:32 ` Con Kolivas @ 2006-01-13 10:46 ` Paolo Ornati 2006-01-13 10:51 ` Con Kolivas 1 sibling, 1 reply; 53+ messages in thread From: Paolo Ornati @ 2006-01-13 10:46 UTC (permalink / raw) To: Con Kolivas Cc: Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams On Fri, 13 Jan 2006 12:13:11 +1100 Con Kolivas <kernel@kolivas.org> wrote: > Can you try the following patch on 2.6.15 please? I'm interested in how > adversely this affects interactive performance as well as whether it helps > your test case. "./a.out 5000 & ./a.out 5237 & ./a.out 5331 &" "mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null bs=1M count=256; umount space/" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5445 paolo 16 0 2396 288 228 R 34.8 0.1 0:05.84 a.out 5446 paolo 15 0 2396 288 228 S 32.8 0.1 0:05.53 a.out 5444 paolo 16 0 2392 288 228 R 31.3 0.1 0:05.99 a.out 5443 paolo 16 0 10416 1104 848 R 0.2 0.2 0:00.01 top 5451 paolo 15 0 4948 1468 372 D 0.2 0.3 0:00.01 dd DD test takes ~20 s (instead of 8s). As you can see DD priority is now very good (15) but it still suffers because also my test programs get good priority (15/16). Things are BETTER on the real test case (transcode): this is because transcode usually gets priority 16 and "dd" gets 15... so dd is quite happy. BUT what is STRANGE is this: usually transcode is stuck to priority 16 using about 88% of the CPU, but sometimes (don't know how to reproduce it) its priority grows up to 25 and then stay to 25. When transcode priority is 25 the DD test is obviously happy: in particular 2 things can happen (this is expected because I've observed this thing before): 1) priority of transcode stay to 25 (when the file transcode is reading from, through pipes, IS cached). 2) CPU usage and priority of transcode go down (the file transcode is reading from ISN'T cached and DD massive disk usage interferes with this reading). When DD finish trancode priority go back to 25. -- Paolo Ornati Linux 2.6.15-kolivasPatch on x86_64 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2006-01-13 10:46 ` Paolo Ornati @ 2006-01-13 10:51 ` Con Kolivas 2006-01-13 13:01 ` Mike Galbraith 0 siblings, 1 reply; 53+ messages in thread From: Con Kolivas @ 2006-01-13 10:51 UTC (permalink / raw) To: Paolo Ornati Cc: Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams [-- Attachment #1: Type: text/plain, Size: 2657 bytes --] On Friday 13 January 2006 21:46, Paolo Ornati wrote: > On Fri, 13 Jan 2006 12:13:11 +1100 > > Con Kolivas <kernel@kolivas.org> wrote: > > Can you try the following patch on 2.6.15 please? I'm interested in how > > adversely this affects interactive performance as well as whether it > > helps your test case. > > "./a.out 5000 & ./a.out 5237 & ./a.out 5331 &" > "mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null > bs=1M count=256; umount space/" > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5445 paolo 16 0 2396 288 228 R 34.8 0.1 0:05.84 a.out > 5446 paolo 15 0 2396 288 228 S 32.8 0.1 0:05.53 a.out > 5444 paolo 16 0 2392 288 228 R 31.3 0.1 0:05.99 a.out > 5443 paolo 16 0 10416 1104 848 R 0.2 0.2 0:00.01 top > 5451 paolo 15 0 4948 1468 372 D 0.2 0.3 0:00.01 dd > > DD test takes ~20 s (instead of 8s). > > As you can see DD priority is now very good (15) but it still suffers > because also my test programs get good priority (15/16). > > > Things are BETTER on the real test case (transcode): this is because > transcode usually gets priority 16 and "dd" gets 15... so dd is quite > happy. This seems a reasonable compromise. In your "test app" case you are using quirky code to reproduce the worst case scenario. Given that with your quirky setup you are using 3 cpu hogs (effectively) then slowing down dd from 8s to 20s seems an appropriate slowdown (as opposed to the many minutes you were getting previously). See my followup patches that I have posted following "[PATCH 0/5] sched - interactivity updates". The first 3 patches are what you tested. These patches are being put up for testing hopefully in -mm. > BUT what is STRANGE is this: usually transcode is stuck to priority 16 > using about 88% of the CPU, but sometimes (don't know how to reproduce > it) its priority grows up to 25 and then stay to 25. > > When transcode priority is 25 the DD test is obviously happy: in > particular 2 things can happen (this is expected because I've observed > this thing before): > > 1) priority of transcode stay to 25 (when the file transcode is > reading from, through pipes, IS cached). > > 2) CPU usage and priority of transcode go down (the file transcode is > reading from ISN'T cached and DD massive disk usage interferes with > this reading). When DD finish trancode priority go back to 25. I suspect this is entirely dependent on the balance between time spent reading on disk, waiting on pipe and so on. Thanks for your test case and testing! Cheers, Con [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2006-01-13 10:51 ` Con Kolivas @ 2006-01-13 13:01 ` Mike Galbraith 2006-01-13 14:34 ` Con Kolivas 0 siblings, 1 reply; 53+ messages in thread From: Mike Galbraith @ 2006-01-13 13:01 UTC (permalink / raw) To: Con Kolivas, Paolo Ornati Cc: Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams At 09:51 PM 1/13/2006 +1100, Con Kolivas wrote: >On Friday 13 January 2006 21:46, Paolo Ornati wrote: > > On Fri, 13 Jan 2006 12:13:11 +1100 > > > > Con Kolivas <kernel@kolivas.org> wrote: > > > Can you try the following patch on 2.6.15 please? I'm interested in how > > > adversely this affects interactive performance as well as whether it > > > helps your test case. > > > > "./a.out 5000 & ./a.out 5237 & ./a.out 5331 &" > > "mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null > > bs=1M count=256; umount space/" > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 5445 paolo 16 0 2396 288 228 R 34.8 0.1 0:05.84 a.out > > 5446 paolo 15 0 2396 288 228 S 32.8 0.1 0:05.53 a.out > > 5444 paolo 16 0 2392 288 228 R 31.3 0.1 0:05.99 a.out > > 5443 paolo 16 0 10416 1104 848 R 0.2 0.2 0:00.01 top > > 5451 paolo 15 0 4948 1468 372 D 0.2 0.3 0:00.01 dd > > > > DD test takes ~20 s (instead of 8s). > > > > As you can see DD priority is now very good (15) but it still suffers > > because also my test programs get good priority (15/16). > > > > > > Things are BETTER on the real test case (transcode): this is because > > transcode usually gets priority 16 and "dd" gets 15... so dd is quite > > happy. > >This seems a reasonable compromise. In your "test app" case you are using >quirky code to reproduce the worst case scenario. Given that with your quirky >setup you are using 3 cpu hogs (effectively) then slowing down dd from 8s to >20s seems an appropriate slowdown (as opposed to the many minutes you were >getting previously). I'm sorry, but I heartily disagree. It's not a quirky setup, it's just code that exposes a weakness just like thud, starve, irman, irman2. Selectively bumping dd up into the upper tier won't do the other things trivially starved to death one bit of good. On a more positive note, I agree that dd should not be punished for waiting on disk. >See my followup patches that I have posted following "[PATCH 0/5] sched - >interactivity updates". The first 3 patches are what you tested. These >patches are being put up for testing hopefully in -mm. Then the (buggy) version of my simple throttling patch will need to come out. (which is OK, I have a debugged potent++ version) > > BUT what is STRANGE is this: usually transcode is stuck to priority 16 > > using about 88% of the CPU, but sometimes (don't know how to reproduce > > it) its priority grows up to 25 and then stay to 25. > > > > When transcode priority is 25 the DD test is obviously happy: in > > particular 2 things can happen (this is expected because I've observed > > this thing before): > > > > 1) priority of transcode stay to 25 (when the file transcode is > > reading from, through pipes, IS cached). > > > > 2) CPU usage and priority of transcode go down (the file transcode is > > reading from ISN'T cached and DD massive disk usage interferes with > > this reading). When DD finish trancode priority go back to 25. > >I suspect this is entirely dependent on the balance between time spent >reading >on disk, waiting on pipe and so on. Grumble. Pipe sleep. That's another pet peeve of mine. Sleep is sleep whether it's spelled interruptible_pipe or uninterruptible_semaphore. -Mike ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2006-01-13 13:01 ` Mike Galbraith @ 2006-01-13 14:34 ` Con Kolivas 2006-01-13 16:15 ` Mike Galbraith 0 siblings, 1 reply; 53+ messages in thread From: Con Kolivas @ 2006-01-13 14:34 UTC (permalink / raw) To: Mike Galbraith Cc: Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams On Saturday 14 January 2006 00:01, Mike Galbraith wrote: > At 09:51 PM 1/13/2006 +1100, Con Kolivas wrote: > >See my followup patches that I have posted following "[PATCH 0/5] sched - > >interactivity updates". The first 3 patches are what you tested. These > >patches are being put up for testing hopefully in -mm. > > Then the (buggy) version of my simple throttling patch will need to come > out. (which is OK, I have a debugged potent++ version) Your code need not be mutually exclusive with mine. I've simply damped the current behaviour. Your sanity throttling is a good idea. Con ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2006-01-13 14:34 ` Con Kolivas @ 2006-01-13 16:15 ` Mike Galbraith 2006-01-14 2:05 ` Con Kolivas 0 siblings, 1 reply; 53+ messages in thread From: Mike Galbraith @ 2006-01-13 16:15 UTC (permalink / raw) To: Con Kolivas Cc: Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams At 01:34 AM 1/14/2006 +1100, Con Kolivas wrote: >On Saturday 14 January 2006 00:01, Mike Galbraith wrote: > > At 09:51 PM 1/13/2006 +1100, Con Kolivas wrote: > > >See my followup patches that I have posted following "[PATCH 0/5] sched - > > >interactivity updates". The first 3 patches are what you tested. These > > >patches are being put up for testing hopefully in -mm. > > > > Then the (buggy) version of my simple throttling patch will need to come > > out. (which is OK, I have a debugged potent++ version) > >Your code need not be mutually exclusive with mine. I've simply damped the >current behaviour. Your sanity throttling is a good idea. I didn't mean to imply that they're mutually exclusive, and after doing some testing, I concluded that it (or something like it) is definitely still needed. The version that's in mm2 _is_ buggy however, so ripping it back out wouldn't hurt my delicate little feelings one bit. In fact, it would give me some more time to instrument and test integration with your changes. (Which I think are good btw because they remove what I considered to be warts; the pipe and uninterruptible sleep barriers. Um... try irman2 now... pure evilness) -Mike ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2006-01-13 16:15 ` Mike Galbraith @ 2006-01-14 2:05 ` Con Kolivas 2006-01-14 2:56 ` Mike Galbraith 0 siblings, 1 reply; 53+ messages in thread From: Con Kolivas @ 2006-01-14 2:05 UTC (permalink / raw) To: Mike Galbraith Cc: Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams, Andrew Morton On Saturday 14 January 2006 03:15, Mike Galbraith wrote: > At 01:34 AM 1/14/2006 +1100, Con Kolivas wrote: > >On Saturday 14 January 2006 00:01, Mike Galbraith wrote: > > > At 09:51 PM 1/13/2006 +1100, Con Kolivas wrote: > > > >See my followup patches that I have posted following "[PATCH 0/5] > > > > sched - interactivity updates". The first 3 patches are what you > > > > tested. These patches are being put up for testing hopefully in -mm. > > > > > > Then the (buggy) version of my simple throttling patch will need to > > > come out. (which is OK, I have a debugged potent++ version) > > > >Your code need not be mutually exclusive with mine. I've simply damped the > >current behaviour. Your sanity throttling is a good idea. > > I didn't mean to imply that they're mutually exclusive, and after doing > some testing, I concluded that it (or something like it) is definitely > still needed. The version that's in mm2 _is_ buggy however, so ripping it > back out wouldn't hurt my delicate little feelings one bit. In fact, it > would give me some more time to instrument and test integration with your > changes. Ok I've communicated this to Andrew (cc'ed here too) so he should remove your patch pending a new version from you. > (Which I think are good btw because they remove what I considered > to be warts; the pipe and uninterruptible sleep barriers. Yes I felt your abuse wrt to these in an earlier email... > Um... try irman2 > now... pure evilness) Hrm I've been using staircase which is immune for so long I'd all but forgotten about this test case. Looking at your code I assume your changes should help this? Con ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] wrong priority calc - SIMPLE test case 2006-01-14 2:05 ` Con Kolivas @ 2006-01-14 2:56 ` Mike Galbraith 0 siblings, 0 replies; 53+ messages in thread From: Mike Galbraith @ 2006-01-14 2:56 UTC (permalink / raw) To: Con Kolivas Cc: Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams, Andrew Morton At 01:05 PM 1/14/2006 +1100, Con Kolivas wrote: >On Saturday 14 January 2006 03:15, Mike Galbraith wrote: > > > Um... try irman2 > > now... pure evilness) > >Hrm I've been using staircase which is immune for so long I'd all but >forgotten about this test case. Looking at your code I assume your changes >should help this? Yes. How much very much depends on how strictly I try to enforce. In my experimental tree, I have four stages of throttling: 1 threshold to begin trying to consume the difference between measured slice_avg and sleep_avg (kidd gloves), 2 to begin treating all new sleep as noninteractive (stern talking to), 3 to cut off new sleep entirely (you're grounded), and 4 is when to start using slice_avg instead of the out of balance sleep_avg for the priority calculation (um, bitch-slap?). Levels 1 and 2 won't stop irman2, 3 will, and especially 4 will. These are all /proc settings at the moment, so I can set set my starvation pain threshold from super duper desktop (all off) to just as fair as a running slice completion time average can possibly make it (all at 1ns differential), and anywhere in between. -Mike ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) 2005-12-27 21:48 ` Paolo Ornati 2005-12-27 23:26 ` Con Kolivas @ 2005-12-27 23:59 ` Peter Williams 2005-12-28 10:20 ` Paolo Ornati 1 sibling, 1 reply; 53+ messages in thread From: Peter Williams @ 2005-12-27 23:59 UTC (permalink / raw) To: Paolo Ornati; +Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar Paolo Ornati wrote: > On Tue, 27 Dec 2005 19:09:18 +0100 > Paolo Ornati <ornati@fastwebnet.it> wrote: > > >>Hello, >> >>I've found an easy-to-reproduce-for-me test case that shows a totally >>wrong priority calculation: basically a CPU-intensitive process gets >>better priority than a disk-intensitive one (dd if=bigfile >>of=/dev/null ...). >> >>Seems impossible, isn't it? >> >>---- THE NUMBERS with 2.6.15-rc7 ----- >> >>The test-case is the Xvid encoding of dvd-ripped track with transcode >>(using "dvd::rip" interface). The copied-and-pasted command line is >>this: >> >>mkdir -m 0775 -p '/home/paolo/tmp/test/tmp' && >>cd /home/paolo/tmp/test/tmp && dr_exec transcode -H 10 -a 2 -x vob,null >>-i /home/paolo/tmp/test/vob/003 -w 1198,50 -b 128,0,0 -s 1.972 >>--a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 1 -y xvid4,null >>-o /dev/null --print_status 20 && echo DVDRIP_SUCCESS mkdir -m 0775 -p >>'/home/paolo/tmp/test/tmp' && cd /home/paolo/tmp/test/tmp && dr_exec >>transcode -H 10 -a 2 -x vob -i /home/paolo/tmp/test/vob/003 -w 1198,50 >>-b 128,0,0 -s 1.972 --a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 2 -y >>xvid4 -o /home/paolo/tmp/test/avi/003/test-003.avi --print_status 20 && >>echo DVDRIP_SUCCESS >> >> >>Here there is a TOP snapshot while running it: >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 5721 paolo 16 0 115m 18m 2428 R 84.4 3.7 0:15.11 transcode >> 5736 paolo 25 0 50352 4516 1912 R 8.4 0.9 0:01.53 tcdecode >> 5725 paolo 15 0 115m 18m 2428 S 4.6 3.7 0:00.84 transcode >> 5738 paolo 18 0 115m 18m 2428 S 0.8 3.7 0:00.15 transcode >> 5734 paolo 25 0 20356 1140 920 S 0.6 0.2 0:00.12 tcdemux >> 5731 paolo 25 0 47312 2540 1996 R 0.4 0.5 0:00.08 tcdecode >> 5319 root 15 0 166m 16m 2584 S 0.2 3.2 0:25.06 X >> 5444 paolo 16 0 87116 22m 15m R 0.2 4.6 0:04.05 konsole >> 5716 paolo 16 0 10424 1160 876 R 0.2 0.2 0:00.06 top >> 5735 paolo 25 0 22364 1436 932 S 0.2 0.3 0:00.01 tcextract >> >> >>DD running alone: >> >>paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/ >>128+0 records in >>128+0 records out >> >>real 0m4.052s >>user 0m0.000s >>sys 0m0.209s >> >>DD while transcoding: >> >>paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/ >>128+0 records in >>128+0 records out >> >>real 0m26.121s >>user 0m0.001s >>sys 0m0.255s >> >>--------------------------------------- >> >>I've tried older kernels finding that 2.6.11 is the first affected. >> >>Going on with testing... >> >> 2.6.11-rc[1-5]: >>2.6.11-rc3 bad >>2.6.11-rc1 bad >> >> 2.6.10-bk[1-14] >>2.6.10-bk7 good >>2.6.10-bk11 good >>2.6.10-bk13 bad >>2.6.10-bk12 bad >> >>So the problem was introduced with: >> >> 2.6.10-bk12 09-Jan-2005 << >> >>The exact behaviour is different with 2.6.11/12/13/14... for example: >>with 2.6.11 the priority of "transcode" is initially set to ~25 and go >>down to 17/18 when running DD. >> >>The problem doesn't seem 100% reproducible with every kernel, sometimes >>a "BAD" kernel looks "GOOD"... or maybe it was me confused by too >>much compile/install/reboot/test work ;) >> >>Other INFO: >>- I'm on x86_64 >>- preemption ON/OFF doesn't make any differences >> >> >>Can anyone reproduce this? >>IOW: is this affecting only my machine? >> > > > Hello Con and Ingo... I've found that the above problem goes away > by reverting this: > > http://linux.bkbits.net:8080/linux-2.6/cset@41e054c6pwNQXzErMxvfh4IpLPXA5A?nav=index.html|src/|src/include|src/include/linux|related/include/linux/sched.h > > -------------------------------------------------- > > [PATCH] sched: remove_interactive_credit > > Special casing tasks by interactive credit was helpful for preventing fully > cpu bound tasks from easily rising to interactive status. > > However it did not select out tasks that had periods of being fully cpu > bound and then sleeping while waiting on pipes, signals etc. This led to a > more disproportionate share of cpu time. > > Backing this out will no longer special case only fully cpu bound tasks, > and prevents the variable behaviour that occurs at startup before tasks > declare themseleves interactive or not, and speeds up application startup > slightly under certain circumstances. It does cost in interactivity > slightly as load rises but it is worth it for the fairness gains. > > Signed-off-by: Con Kolivas <kernel@kolivas.org> > Acked-by: Ingo Molnar <mingo@elte.hu> > Signed-off-by: Andrew Morton <akpm@osdl.org> > Signed-off-by: Linus Torvalds <torvalds@osdl.org> > > -------------------------------------------------- > > > Maybe this change has revealed a scheduler weakness ? > > I'm glad to test any patch or give more data :) > > Bye, > Any chance of you applying the PlugSched patches and seeing how the other schedulers that it contains handle this situation? The patch at: <http://prdownloads.sourceforge.net/cpuse/plugsched-6.1.6-for-2.6.15-rc5.patch?download> should apply without problems to the 2.6.15-rc7 kernel. Very Brief Documentation: You can select a default scheduler at kernel build time. If you wish to boot with a scheduler other than the default it can be selected at boot time by adding: cpusched=<scheduler> to the boot command line where <scheduler> is one of: ingosched, nicksched, staircase, spa_no_frills, spa_ws, spa_svr or zaphod. If you don't change the default when you build the kernel the default scheduler will be ingosched (which is the normal scheduler). The scheduler in force on a running system can be determined by the contents of: /proc/scheduler Control parameters for the scheduler can be read/set via files in: /sys/cpusched/<scheduler>/ Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) 2005-12-27 23:59 ` [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) Peter Williams @ 2005-12-28 10:20 ` Paolo Ornati 2005-12-28 13:38 ` Peter Williams 0 siblings, 1 reply; 53+ messages in thread From: Paolo Ornati @ 2005-12-28 10:20 UTC (permalink / raw) To: Peter Williams; +Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar On Wed, 28 Dec 2005 10:59:13 +1100 Peter Williams <pwil3058@bigpond.net.au> wrote: > Any chance of you applying the PlugSched patches and seeing how the > other schedulers that it contains handle this situation? > > The patch at: > > <http://prdownloads.sourceforge.net/cpuse/plugsched-6.1.6-for-2.6.15-rc5.patch?download> > > should apply without problems to the 2.6.15-rc7 kernel. > > Very Brief Documentation: > > You can select a default scheduler at kernel build time. If you wish to > boot with a scheduler other than the default it can be selected at boot > time by adding: > > cpusched=<scheduler> > > to the boot command line where <scheduler> is one of: ingosched, > nicksched, staircase, spa_no_frills, spa_ws, spa_svr or zaphod. If you > don't change the default when you build the kernel the default scheduler > will be ingosched (which is the normal scheduler). First of all, this is the "pstree" structure of transcode an friends: |-kdesktop---perl---sh---transcode-+-2*[sh-+-tccat] | | |-tcdecode] | | |-tcdemux] | | `-tcextract] | `-transcode---5*[transcode] Results with various schedulers: ------------------------------------------------------------------------ 1) nicksched: perfect! This is the behaviour I want. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5562 paolo 40 0 115m 18m 2428 R 82.2 3.7 0:22.16 transcode 5576 paolo 26 0 50348 4516 1912 S 9.5 0.9 0:02.43 tcdecode 5566 paolo 23 0 115m 18m 2428 S 4.6 3.7 0:01.24 transcode 5573 paolo 21 0 115m 18m 2428 S 0.9 3.7 0:00.22 transcode 5577 paolo 27 0 20356 1140 920 S 0.9 0.2 0:00.21 tcdemux 5295 root 20 0 167m 17m 3624 S 0.6 3.5 0:11.02 X 5579 paolo 20 0 47308 2540 1996 S 0.5 0.5 0:00.14 tcdecode 5574 paolo 20 0 20356 1144 920 S 0.4 0.2 0:00.11 tcdemux ... transcode get recognized for what it is, and I/O bounded processes don't even notice that it is running :) 2) staircase: bad, as you can see: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5582 paolo 26 0 115m 18m 2428 R 82.7 3.7 0:47.63 transcode 5599 paolo 39 0 50352 4516 1912 R 9.6 0.9 0:05.21 tcdecode 5586 paolo 0 0 115m 18m 2428 S 4.5 3.7 0:02.61 transcode 5622 paolo 39 0 4948 1520 412 R 1.1 0.3 0:00.15 dd 5591 paolo 0 0 115m 18m 2428 S 0.6 3.7 0:00.36 transcode 5575 paolo 0 0 98476 37m 9392 S 0.4 7.5 0:01.44 perl 5597 paolo 27 0 20356 1144 920 S 0.4 0.2 0:00.21 tcdemux 5475 paolo 0 0 86556 22m 15m S 0.2 4.5 0:01.24 konsole 5388 root 0 0 167m 17m 3208 S 0.1 3.4 0:03.16 X 5587 paolo 0 0 115m 18m 2428 S 0.1 3.7 0:00.03 transcode 5595 paolo 20 0 47312 2540 1996 S 0.1 0.5 0:00.14 tcdecode 5596 paolo 26 0 22672 1268 1020 S 0.1 0.2 0:00.03 tccat 5598 paolo 28 0 22364 1436 932 S 0.1 0.3 0:00.04 tcextract And "DD" is affected badly: paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/ 128+0 records in 128+0 records out real 0m6.341s user 0m0.002s sys 0m0.229s While transcoding: paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null bs=1M count=256; umount space/ 256+0 records in 256+0 records out real 0m15.793s user 0m0.001s sys 0m0.374s 3) spa_no_frills: bad, but this is OK since it is Round Robin :) PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5356 paolo 20 0 115m 18m 2428 R 81.1 3.7 0:27.61 transcode 5371 paolo 20 0 50348 4516 1912 R 8.9 0.9 0:02.97 tcdecode 5360 paolo 20 0 115m 18m 2428 S 4.1 3.7 0:01.54 transcode 5378 paolo 20 0 4948 1520 412 D 1.4 0.3 0:00.29 dd 5364 paolo 20 0 20352 1144 920 S 0.9 0.2 0:00.20 tcdemux 5373 paolo 20 0 115m 18m 2428 S 0.7 3.7 0:00.32 transcode 5369 paolo 20 0 20356 1144 920 S 0.5 0.2 0:00.14 tcdemux 5205 root 20 0 165m 15m 2584 R 0.2 3.2 0:01.86 X 4) spa_ws: bad PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5334 paolo 32 0 115m 18m 2428 R 82.7 3.7 0:18.77 transcode 5349 paolo 32 0 50348 4516 1912 R 8.9 0.9 0:02.00 tcdecode 5338 paolo 21 0 115m 18m 2428 S 4.6 3.7 0:01.08 transcode 5356 paolo 32 0 4948 1520 412 D 1.1 0.3 0:00.12 dd 5351 paolo 32 0 115m 18m 2428 S 1.0 3.7 0:00.20 transcode 5199 root 21 0 165m 15m 2584 S 0.4 3.2 0:01.68 X 5347 paolo 32 0 20356 1140 920 S 0.4 0.2 0:00.08 tcdemux 5296 paolo 22 0 98472 37m 9392 S 0.2 7.5 0:01.47 perl 5299 paolo 21 0 86556 22m 15m S 0.2 4.4 0:00.75 konsole 5344 paolo 32 0 47308 2540 1996 S 0.2 0.5 0:00.07 tcdecode 5339 paolo 21 0 115m 18m 2428 S 0.1 3.7 0:00.01 transcode paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null bs=1M count=256; umount space/ 256+0 records in 256+0 records out real 0m8.112s user 0m0.001s sys 0m0.444s paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null bs=1M count=256; umount space/ 256+0 records in 256+0 records out real 0m29.222s user 0m0.000s sys 0m0.400s 5) spa_svr: surprise, surprise! Not all that bad. At least DD gets better priority than transcode... and DD real time is only a bit affected (8s --> ~9s). PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5334 paolo 33 0 115m 18m 2428 R 78.1 3.7 0:22.70 transcode 5349 paolo 28 0 50352 4516 1912 S 9.0 0.9 0:02.41 tcdecode 5338 paolo 25 0 115m 18m 2428 S 4.7 3.7 0:01.29 transcode 5363 paolo 27 0 4952 1520 412 R 4.7 0.3 0:00.25 dd 5342 paolo 33 0 20352 1140 920 S 1.6 0.2 0:00.21 tcdemux 5351 paolo 25 0 115m 18m 2428 S 0.8 3.7 0:00.23 transcode 5144 root 22 0 166m 16m 3120 S 0.4 3.3 0:01.85 X 5344 paolo 23 0 47308 2540 1996 S 0.4 0.5 0:00.13 tcdecode 5347 paolo 27 0 20356 1144 920 S 0.4 0.2 0:00.10 tcdemux 5231 paolo 22 0 86660 22m 15m S 0.2 4.5 0:00.95 konsole 5271 paolo 25 0 98476 37m 9396 S 0.2 7.5 0:01.54 perl 5341 paolo 23 0 22672 1268 1020 S 0.2 0.2 0:00.02 tccat 6) zaphod: more or less like spa_svr PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5308 paolo 34 0 115m 18m 2428 R 52.1 3.7 0:49.77 transcode 5323 paolo 32 0 50352 4516 1912 S 6.0 0.9 0:05.61 tcdecode 5356 paolo 28 0 4952 1520 412 D 3.5 0.3 0:00.28 dd 5312 paolo 28 0 115m 18m 2428 S 2.6 3.7 0:02.71 transcode 5325 paolo 31 0 115m 18m 2428 S 0.7 3.7 0:00.55 transcode 5316 paolo 37 0 20352 1140 920 S 0.4 0.2 0:00.33 tcdemux 5202 root 23 0 165m 15m 2584 S 0.2 3.1 0:01.57 X 5318 paolo 31 0 47312 2540 1996 S 0.2 0.5 0:00.28 tcdecode 5321 paolo 33 0 20356 1144 920 S 0.2 0.2 0:00.26 tcdemux 4760 messageb 25 0 13248 1068 848 S 0.1 0.2 0:00.07 dbus-daemon-1 5264 paolo 24 0 93920 17m 10m S 0.1 3.5 0:00.38 kded 5282 paolo 23 0 92712 19m 12m S 0.1 3.9 0:00.36 kdesktop 7) ingosched: bad, as already said in the original post PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5209 paolo 16 0 115m 18m 2428 R 72.0 3.7 0:22.13 transcode 5224 paolo 22 0 50348 4516 1912 R 8.4 0.9 0:02.44 tcdecode 5213 paolo 15 0 115m 18m 2428 S 4.2 3.7 0:01.24 transcode 5243 paolo 18 0 4948 1520 412 R 1.8 0.3 0:00.14 dd 5217 paolo 19 0 20356 1144 920 R 0.8 0.2 0:00.19 tcdemux 5108 root 15 0 165m 15m 2584 S 0.6 3.1 0:01.44 X 5226 paolo 15 0 115m 18m 2428 S 0.6 3.7 0:00.20 transcode 5216 paolo 18 0 22676 1268 1020 S 0.4 0.2 0:00.03 tccat 5219 paolo 18 0 47312 2540 1996 R 0.4 0.5 0:00.12 tcdecode 5222 paolo 18 0 20356 1144 920 S 0.4 0.2 0:00.10 tcdemux 5195 paolo 16 0 98488 37m 9392 S 0.2 7.5 0:01.41 perl 5198 paolo 16 0 86552 22m 15m R 0.2 4.4 0:00.66 konsole paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null bs=1M count=256; umount space/ 256+0 records in 256+0 records out real 0m23.393s (instead of 8s) user 0m0.001s sys 0m0.418s ------------------------------------------------------------------------ So the winner for manifest superiority is "nicksched", it looks to me even better than 2.6.10-bk12 (ingosched) with "remove_interactive_credit" reverted. :) -- Paolo Ornati Linux 2.6.15-rc5-plugsched on x86_64 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) 2005-12-28 10:20 ` Paolo Ornati @ 2005-12-28 13:38 ` Peter Williams 2005-12-28 19:45 ` Paolo Ornati 2005-12-29 3:13 ` Nick Piggin 0 siblings, 2 replies; 53+ messages in thread From: Peter Williams @ 2005-12-28 13:38 UTC (permalink / raw) To: Paolo Ornati; +Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar Paolo Ornati wrote: > On Wed, 28 Dec 2005 10:59:13 +1100 > Peter Williams <pwil3058@bigpond.net.au> wrote: > > >>Any chance of you applying the PlugSched patches and seeing how the >>other schedulers that it contains handle this situation? >> >>The patch at: >> >><http://prdownloads.sourceforge.net/cpuse/plugsched-6.1.6-for-2.6.15-rc5.patch?download> >> >>should apply without problems to the 2.6.15-rc7 kernel. >> >>Very Brief Documentation: >> >>You can select a default scheduler at kernel build time. If you wish to >>boot with a scheduler other than the default it can be selected at boot >>time by adding: >> >>cpusched=<scheduler> >> >>to the boot command line where <scheduler> is one of: ingosched, >>nicksched, staircase, spa_no_frills, spa_ws, spa_svr or zaphod. If you >>don't change the default when you build the kernel the default scheduler >>will be ingosched (which is the normal scheduler). > > > > First of all, this is the "pstree" structure of transcode an friends: > > |-kdesktop---perl---sh---transcode-+-2*[sh-+-tccat] > | | |-tcdecode] > | | |-tcdemux] > | | `-tcextract] > | `-transcode---5*[transcode] > > > Results with various schedulers: First, thanks for doing this. > > ------------------------------------------------------------------------ > > 1) nicksched: perfect! This is the behaviour I want. > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5562 paolo 40 0 115m 18m 2428 R 82.2 3.7 0:22.16 transcode > 5576 paolo 26 0 50348 4516 1912 S 9.5 0.9 0:02.43 tcdecode > 5566 paolo 23 0 115m 18m 2428 S 4.6 3.7 0:01.24 transcode > 5573 paolo 21 0 115m 18m 2428 S 0.9 3.7 0:00.22 transcode > 5577 paolo 27 0 20356 1140 920 S 0.9 0.2 0:00.21 tcdemux > 5295 root 20 0 167m 17m 3624 S 0.6 3.5 0:11.02 X > 5579 paolo 20 0 47308 2540 1996 S 0.5 0.5 0:00.14 tcdecode > 5574 paolo 20 0 20356 1144 920 S 0.4 0.2 0:00.11 tcdemux > ... > > transcode get recognized for what it is, and I/O bounded processes > don't even notice that it is running :) Interesting. This one's more or less a dead scheduler and hasn't had any development work done on it for some time. I just keep porting the original version to new kernels. > > > 2) staircase: bad, as you can see: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5582 paolo 26 0 115m 18m 2428 R 82.7 3.7 0:47.63 transcode > 5599 paolo 39 0 50352 4516 1912 R 9.6 0.9 0:05.21 tcdecode > 5586 paolo 0 0 115m 18m 2428 S 4.5 3.7 0:02.61 transcode > 5622 paolo 39 0 4948 1520 412 R 1.1 0.3 0:00.15 dd > 5591 paolo 0 0 115m 18m 2428 S 0.6 3.7 0:00.36 transcode > 5575 paolo 0 0 98476 37m 9392 S 0.4 7.5 0:01.44 perl > 5597 paolo 27 0 20356 1144 920 S 0.4 0.2 0:00.21 tcdemux > 5475 paolo 0 0 86556 22m 15m S 0.2 4.5 0:01.24 konsole > 5388 root 0 0 167m 17m 3208 S 0.1 3.4 0:03.16 X > 5587 paolo 0 0 115m 18m 2428 S 0.1 3.7 0:00.03 transcode > 5595 paolo 20 0 47312 2540 1996 S 0.1 0.5 0:00.14 tcdecode > 5596 paolo 26 0 22672 1268 1020 S 0.1 0.2 0:00.03 tccat > 5598 paolo 28 0 22364 1436 932 S 0.1 0.3 0:00.04 tcextract > > > And "DD" is affected badly: > > paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile > of=/dev/null bs=1M count=128; umount space/ 128+0 records in > 128+0 records out > > real 0m6.341s > user 0m0.002s > sys 0m0.229s > > While transcoding: > > paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile > of=/dev/null bs=1M count=256; umount space/ 256+0 records in > 256+0 records out > > real 0m15.793s > user 0m0.001s > sys 0m0.374s > > > 3) spa_no_frills: bad, but this is OK since it is Round Robin :) > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5356 paolo 20 0 115m 18m 2428 R 81.1 3.7 0:27.61 transcode > 5371 paolo 20 0 50348 4516 1912 R 8.9 0.9 0:02.97 tcdecode > 5360 paolo 20 0 115m 18m 2428 S 4.1 3.7 0:01.54 transcode > 5378 paolo 20 0 4948 1520 412 D 1.4 0.3 0:00.29 dd > 5364 paolo 20 0 20352 1144 920 S 0.9 0.2 0:00.20 tcdemux > 5373 paolo 20 0 115m 18m 2428 S 0.7 3.7 0:00.32 transcode > 5369 paolo 20 0 20356 1144 920 S 0.5 0.2 0:00.14 tcdemux > 5205 root 20 0 165m 15m 2584 R 0.2 3.2 0:01.86 X > Yes, no surprises there. > > 4) spa_ws: bad > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5334 paolo 32 0 115m 18m 2428 R 82.7 3.7 0:18.77 transcode > 5349 paolo 32 0 50348 4516 1912 R 8.9 0.9 0:02.00 tcdecode > 5338 paolo 21 0 115m 18m 2428 S 4.6 3.7 0:01.08 transcode > 5356 paolo 32 0 4948 1520 412 D 1.1 0.3 0:00.12 dd > 5351 paolo 32 0 115m 18m 2428 S 1.0 3.7 0:00.20 transcode > 5199 root 21 0 165m 15m 2584 S 0.4 3.2 0:01.68 X > 5347 paolo 32 0 20356 1140 920 S 0.4 0.2 0:00.08 tcdemux > 5296 paolo 22 0 98472 37m 9392 S 0.2 7.5 0:01.47 perl > 5299 paolo 21 0 86556 22m 15m S 0.2 4.4 0:00.75 konsole > 5344 paolo 32 0 47308 2540 1996 S 0.2 0.5 0:00.07 tcdecode > 5339 paolo 21 0 115m 18m 2428 S 0.1 3.7 0:00.01 transcode > > paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile > of=/dev/null bs=1M count=256; umount space/ 256+0 records in > 256+0 records out > > real 0m8.112s > user 0m0.001s > sys 0m0.444s > > paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile > of=/dev/null bs=1M count=256; umount space/ 256+0 records in > 256+0 records out > > real 0m29.222s > user 0m0.000s > sys 0m0.400s This one is aimed purely at good interactive responsiveness (i.e. keyboard, mouse, X server and media players such as rythmbox/xmms) so no real surprises here either. > > > 5) spa_svr: surprise, surprise! Not all that bad. At least DD > gets better priority than transcode... and DD real time is only a bit > affected (8s --> ~9s). > This will be the "throughput bonus" in action. It's overall aim is to reduce the time tasks spend on the runqueue waiting for CPU access a.k.a. delay. It does this by using the system load and the average amount of CPU time that the task uses each scheduling cycle to estimate the expected delay for the task and gives it a bonus if the actual average delays being experienced are bigger than this value. It's intended for server systems not interactive systems as reducing overall delay isn't necessarily good for interactive systems where the aim is to quell the user's impatience by giving good latency to the interactive tasks. These aims aren't always compatible. > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5334 paolo 33 0 115m 18m 2428 R 78.1 3.7 0:22.70 transcode > 5349 paolo 28 0 50352 4516 1912 S 9.0 0.9 0:02.41 tcdecode > 5338 paolo 25 0 115m 18m 2428 S 4.7 3.7 0:01.29 transcode > 5363 paolo 27 0 4952 1520 412 R 4.7 0.3 0:00.25 dd > 5342 paolo 33 0 20352 1140 920 S 1.6 0.2 0:00.21 tcdemux > 5351 paolo 25 0 115m 18m 2428 S 0.8 3.7 0:00.23 transcode > 5144 root 22 0 166m 16m 3120 S 0.4 3.3 0:01.85 X > 5344 paolo 23 0 47308 2540 1996 S 0.4 0.5 0:00.13 tcdecode > 5347 paolo 27 0 20356 1144 920 S 0.4 0.2 0:00.10 tcdemux > 5231 paolo 22 0 86660 22m 15m S 0.2 4.5 0:00.95 konsole > 5271 paolo 25 0 98476 37m 9396 S 0.2 7.5 0:01.54 perl > 5341 paolo 23 0 22672 1268 1020 S 0.2 0.2 0:00.02 tccat > > > 6) zaphod: more or less like spa_svr Zaphod includes the throughput bonus in its armoury which why it is similar in performance to spa_svr. > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5308 paolo 34 0 115m 18m 2428 R 52.1 3.7 0:49.77 transcode > 5323 paolo 32 0 50352 4516 1912 S 6.0 0.9 0:05.61 tcdecode > 5356 paolo 28 0 4952 1520 412 D 3.5 0.3 0:00.28 dd > 5312 paolo 28 0 115m 18m 2428 S 2.6 3.7 0:02.71 transcode > 5325 paolo 31 0 115m 18m 2428 S 0.7 3.7 0:00.55 transcode > 5316 paolo 37 0 20352 1140 920 S 0.4 0.2 0:00.33 tcdemux > 5202 root 23 0 165m 15m 2584 S 0.2 3.1 0:01.57 X > 5318 paolo 31 0 47312 2540 1996 S 0.2 0.5 0:00.28 tcdecode > 5321 paolo 33 0 20356 1144 920 S 0.2 0.2 0:00.26 tcdemux > 4760 messageb 25 0 13248 1068 848 S 0.1 0.2 0:00.07 > dbus-daemon-1 5264 paolo 24 0 93920 17m 10m S 0.1 3.5 > 0:00.38 kded 5282 paolo 23 0 92712 19m 12m S 0.1 3.9 > 0:00.36 kdesktop > > > 7) ingosched: bad, as already said in the original post > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5209 paolo 16 0 115m 18m 2428 R 72.0 3.7 0:22.13 transcode > 5224 paolo 22 0 50348 4516 1912 R 8.4 0.9 0:02.44 tcdecode > 5213 paolo 15 0 115m 18m 2428 S 4.2 3.7 0:01.24 transcode > 5243 paolo 18 0 4948 1520 412 R 1.8 0.3 0:00.14 dd > 5217 paolo 19 0 20356 1144 920 R 0.8 0.2 0:00.19 tcdemux > 5108 root 15 0 165m 15m 2584 S 0.6 3.1 0:01.44 X > 5226 paolo 15 0 115m 18m 2428 S 0.6 3.7 0:00.20 transcode > 5216 paolo 18 0 22676 1268 1020 S 0.4 0.2 0:00.03 tccat > 5219 paolo 18 0 47312 2540 1996 R 0.4 0.5 0:00.12 tcdecode > 5222 paolo 18 0 20356 1144 920 S 0.4 0.2 0:00.10 tcdemux > 5195 paolo 16 0 98488 37m 9392 S 0.2 7.5 0:01.41 perl > 5198 paolo 16 0 86552 22m 15m R 0.2 4.4 0:00.66 konsole > > paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null bs=1M count=256; umount space/ > 256+0 records in > 256+0 records out > > real 0m23.393s (instead of 8s) > user 0m0.001s > sys 0m0.418s > > ------------------------------------------------------------------------ > > > So the winner for manifest superiority is "nicksched", it looks to me > even better than 2.6.10-bk12 (ingosched) with > "remove_interactive_credit" reverted. Thanks for this data. It will enable me to make some mods to the spa_xxx and zaphod schedulers. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) 2005-12-28 13:38 ` Peter Williams @ 2005-12-28 19:45 ` Paolo Ornati 2005-12-29 3:13 ` Nick Piggin 1 sibling, 0 replies; 53+ messages in thread From: Paolo Ornati @ 2005-12-28 19:45 UTC (permalink / raw) To: Peter Williams; +Cc: Linux Kernel Mailing List, Con Kolivas, Ingo Molnar On Thu, 29 Dec 2005 00:38:08 +1100 Peter Williams <pwil3058@bigpond.net.au> wrote: > Thanks for this data. It will enable me to make some mods to the > spa_xxx and zaphod schedulers. Ok, but keep in mind that these numbers are just "snapshots". With almost all schedulers the priority of the CPU-eater transcode and other processes fluctuate a bit (an exception here is nicksched, that gives priority 40 to transcode and never change it). It seems also that small variation of the priority can affect seriously my DD test case running time (expecially, I think, with schedulers that give to "transcode" a better-or-equal priority than "dd" --> ingosched/staircaise). This is another mail in witch you weren't in CC that explains it better: http://www.ussg.iu.edu/hypermail/linux/kernel/0512.3/0647.html -- Paolo Ornati Linux 2.6.15-rc7-plugsched on x86_64 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) 2005-12-28 13:38 ` Peter Williams 2005-12-28 19:45 ` Paolo Ornati @ 2005-12-29 3:13 ` Nick Piggin 2005-12-29 3:35 ` Peter Williams 1 sibling, 1 reply; 53+ messages in thread From: Nick Piggin @ 2005-12-29 3:13 UTC (permalink / raw) To: Peter Williams Cc: Paolo Ornati, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar Peter Williams wrote: > Paolo Ornati wrote: >> 1) nicksched: perfect! This is the behaviour I want. >> ... >> >> transcode get recognized for what it is, and I/O bounded processes >> don't even notice that it is running :) > > > Interesting. This one's more or less a dead scheduler and hasn't had > any development work done on it for some time. I just keep porting the > original version to new kernels. > It isn't a dead scheduler any more than any of the other out of tree schedulers are (which isn't saying much, unfortunately). I've probably got a small number of cleanups and microoptimisations relative to what you have (I can't remember exactly what you sucked up) ... but other than that there hasn't been much development work done for some time because there is not much wrong with it. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) 2005-12-29 3:13 ` Nick Piggin @ 2005-12-29 3:35 ` Peter Williams 2005-12-29 8:11 ` Nick Piggin 0 siblings, 1 reply; 53+ messages in thread From: Peter Williams @ 2005-12-29 3:35 UTC (permalink / raw) To: Nick Piggin Cc: Paolo Ornati, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar Nick Piggin wrote: > Peter Williams wrote: > >> Paolo Ornati wrote: > > >>> 1) nicksched: perfect! This is the behaviour I want. >>> > ... > >>> >>> transcode get recognized for what it is, and I/O bounded processes >>> don't even notice that it is running :) >> >> >> >> Interesting. This one's more or less a dead scheduler and hasn't had >> any development work done on it for some time. I just keep porting >> the original version to new kernels. >> > > It isn't a dead scheduler any more than any of the other out of tree > schedulers are (which isn't saying much, unfortunately). Ingosched, staircase and my SPA schedulers are all evolving slowly. Are there any out there that I don't have in PlugSched that you think should be? > > I've probably got a small number of cleanups and microoptimisations > relative to what you have (I can't remember exactly what you sucked up) > ... but other than that there hasn't been much development work done for > some time because there is not much wrong with it. > I was starting to think that you'd lost interest in this which is why I said it was more or less dead. Sorry. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) 2005-12-29 3:35 ` Peter Williams @ 2005-12-29 8:11 ` Nick Piggin 0 siblings, 0 replies; 53+ messages in thread From: Nick Piggin @ 2005-12-29 8:11 UTC (permalink / raw) To: Peter Williams Cc: Paolo Ornati, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar Peter Williams wrote: > Nick Piggin wrote: >> It isn't a dead scheduler any more than any of the other out of tree >> schedulers are (which isn't saying much, unfortunately). > > > Ingosched, staircase and my SPA schedulers are all evolving slowly. > Are there any out there that I don't have in PlugSched that you think > should be? > Not that I know of... >> >> I've probably got a small number of cleanups and microoptimisations >> relative to what you have (I can't remember exactly what you sucked up) >> ... but other than that there hasn't been much development work done for >> some time because there is not much wrong with it. >> > > I was starting to think that you'd lost interest in this which is why I > said it was more or less dead. Sorry. > No worries. I haven't lost interest so much as people seem to be fairly happy with the current scheduler and least aren't busting my door down for updates to nicksched ;) I'll do a resynch for 2.6.15 though. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 53+ messages in thread
end of thread, other threads:[~2006-01-14 2:56 UTC | newest] Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-12-27 18:09 [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) Paolo Ornati 2005-12-27 21:48 ` Paolo Ornati 2005-12-27 23:26 ` Con Kolivas 2005-12-28 11:01 ` Paolo Ornati 2005-12-28 11:19 ` Con Kolivas 2005-12-28 11:35 ` Paolo Ornati 2005-12-28 17:23 ` Paolo Ornati 2005-12-28 17:39 ` Paolo Ornati 2005-12-30 13:52 ` [SCHED] wrong priority calc - SIMPLE test case Paolo Ornati 2005-12-31 2:06 ` Peter Williams 2005-12-31 10:34 ` Paolo Ornati 2005-12-31 10:52 ` Paolo Ornati 2005-12-31 11:12 ` Con Kolivas 2005-12-31 13:44 ` Peter Williams 2005-12-31 16:31 ` Paolo Ornati 2005-12-31 22:04 ` Peter Williams 2005-12-31 8:13 ` Mike Galbraith 2005-12-31 11:00 ` Paolo Ornati 2005-12-31 15:11 ` Paolo Ornati 2005-12-31 16:37 ` Mike Galbraith 2005-12-31 17:24 ` Paolo Ornati 2005-12-31 17:42 ` Paolo Ornati 2006-01-01 11:39 ` Paolo Ornati 2006-01-02 9:15 ` Mike Galbraith 2006-01-02 9:50 ` Paolo Ornati 2006-01-09 11:11 ` Mike Galbraith 2006-01-09 15:52 ` Mike Galbraith 2006-01-09 16:08 ` Con Kolivas 2006-01-09 18:14 ` Mike Galbraith 2006-01-09 20:00 ` Paolo Ornati 2006-01-09 20:23 ` Paolo Ornati 2006-01-10 7:08 ` Mike Galbraith 2006-01-10 12:07 ` Mike Galbraith 2006-01-10 12:56 ` Paolo Ornati 2006-01-10 13:01 ` Mike Galbraith 2006-01-10 13:53 ` Paolo Ornati 2006-01-10 15:18 ` Mike Galbraith 2006-01-13 1:13 ` Con Kolivas 2006-01-13 1:32 ` Con Kolivas 2006-01-13 10:46 ` Paolo Ornati 2006-01-13 10:51 ` Con Kolivas 2006-01-13 13:01 ` Mike Galbraith 2006-01-13 14:34 ` Con Kolivas 2006-01-13 16:15 ` Mike Galbraith 2006-01-14 2:05 ` Con Kolivas 2006-01-14 2:56 ` Mike Galbraith 2005-12-27 23:59 ` [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12) Peter Williams 2005-12-28 10:20 ` Paolo Ornati 2005-12-28 13:38 ` Peter Williams 2005-12-28 19:45 ` Paolo Ornati 2005-12-29 3:13 ` Nick Piggin 2005-12-29 3:35 ` Peter Williams 2005-12-29 8:11 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).