* CFS scheduler unfairly prefers pinned tasks @ 2015-10-05 21:48 paul.szabo 2015-10-06 2:45 ` Mike Galbraith 0 siblings, 1 reply; 48+ messages in thread From: paul.szabo @ 2015-10-05 21:48 UTC (permalink / raw) To: linux-kernel The Linux CFS scheduler prefers pinned tasks and unfairly gives more CPU time to tasks that have set CPU affinity. This effect is observed with or without CGROUP controls. To demonstrate: on an otherwise idle machine, as some user run several processes pinned to each CPU, one for each CPU (as many as CPUs present in the system) e.g. for a quad-core non-HyperThreaded machine: taskset -c 0 perl -e 'while(1){1}' & taskset -c 1 perl -e 'while(1){1}' & taskset -c 2 perl -e 'while(1){1}' & taskset -c 3 perl -e 'while(1){1}' & and (as that same or some other user) run some without pinning: perl -e 'while(1){1}' & perl -e 'while(1){1}' & and use e.g. top to observe that the pinned processes get more CPU time than "fair". Fairness is obtained when either: - there are as many un-pinned processes as CPUs; or - with CGROUP controls and the two kinds of processes run by different users, when there is just one un-pinned process; or - if the pinning is turned off for these processes (or they are started without). Any insight is welcome! --- I would appreciate replies direct to me as I am not subscribed to the linux-kernel mailing list (but will try to watch the archives). This bug is also reported to Debian, please see http://bugs.debian.org/800945 I use Debian with the 3.16 kernel, have not yet tried 4.* kernels. Thanks, Paul Paul Szabo psz@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of Sydney Australia ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: CFS scheduler unfairly prefers pinned tasks 2015-10-05 21:48 CFS scheduler unfairly prefers pinned tasks paul.szabo @ 2015-10-06 2:45 ` Mike Galbraith 2015-10-06 10:06 ` paul.szabo 2015-10-08 8:19 ` Mike Galbraith 0 siblings, 2 replies; 48+ messages in thread From: Mike Galbraith @ 2015-10-06 2:45 UTC (permalink / raw) To: paul.szabo; +Cc: linux-kernel On Tue, 2015-10-06 at 08:48 +1100, paul.szabo@sydney.edu.au wrote: > The Linux CFS scheduler prefers pinned tasks and unfairly > gives more CPU time to tasks that have set CPU affinity. > This effect is observed with or without CGROUP controls. > > To demonstrate: on an otherwise idle machine, as some user > run several processes pinned to each CPU, one for each CPU > (as many as CPUs present in the system) e.g. for a quad-core > non-HyperThreaded machine: > > taskset -c 0 perl -e 'while(1){1}' & > taskset -c 1 perl -e 'while(1){1}' & > taskset -c 2 perl -e 'while(1){1}' & > taskset -c 3 perl -e 'while(1){1}' & > > and (as that same or some other user) run some without > pinning: > > perl -e 'while(1){1}' & > perl -e 'while(1){1}' & > > and use e.g. top to observe that the pinned processes get > more CPU time than "fair". > > Fairness is obtained when either: > - there are as many un-pinned processes as CPUs; or > - with CGROUP controls and the two kinds of processes run by > different users, when there is just one un-pinned process; or > - if the pinning is turned off for these processes (or they > are started without). > > Any insight is welcome! If they can all migrate, load balancing can move any of them to try to fix the permanent imbalance, so they'll all bounce about sharing a CPU with some other hog, and it all kinda sorta works out. When most are pinned, to make it work out long term you'd have to be short term unfair, walking the unpinned minority around the box in a carefully orchestrated dance... and have omniscient powers that assure that none of the tasks you're trying to equalize is gonna do something rude like leave, sleep, fork or whatever, and muck up the grand plan. -Mike ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: CFS scheduler unfairly prefers pinned tasks 2015-10-06 2:45 ` Mike Galbraith @ 2015-10-06 10:06 ` paul.szabo 2015-10-06 12:17 ` Mike Galbraith 2015-10-08 8:19 ` Mike Galbraith 1 sibling, 1 reply; 48+ messages in thread From: paul.szabo @ 2015-10-06 10:06 UTC (permalink / raw) To: umgwanakikbuti; +Cc: linux-kernel Dear Mike, >> .. CFS ... unfairly gives more CPU time to [pinned] tasks ... > > If they can all migrate, load balancing can move any of them to try to > fix the permanent imbalance, so they'll all bounce about sharing a CPU > with some other hog, and it all kinda sorta works out. > > When most are pinned, to make it work out long term you'd have to be > short term unfair, walking the unpinned minority around the box in a > carefully orchestrated dance... and have omniscient powers that assure > that none of the tasks you're trying to equalize is gonna do something > rude like leave, sleep, fork or whatever, and muck up the grand plan. Could not your argument be turned around: for a pinned task it is harder to find an idle CPU, so they should get less time? But really... those pinned tasks do not hog the CPU forever. Whatever kicks them off: could not that be done just a little earlier? And further... the CFS is meant to be fair, using things like vruntime to preempt, and throttling. Why are those pinned tasks not preempted or throttled? Thanks, Paul Paul Szabo psz@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of Sydney Australia ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: CFS scheduler unfairly prefers pinned tasks 2015-10-06 10:06 ` paul.szabo @ 2015-10-06 12:17 ` Mike Galbraith 2015-10-06 20:44 ` paul.szabo 0 siblings, 1 reply; 48+ messages in thread From: Mike Galbraith @ 2015-10-06 12:17 UTC (permalink / raw) To: paul.szabo; +Cc: linux-kernel On Tue, 2015-10-06 at 21:06 +1100, paul.szabo@sydney.edu.au wrote: > And further... the CFS is meant to be fair, using things like vruntime > to preempt, and throttling. Why are those pinned tasks not preempted or > throttled? Imagine you own a 8192 CPU box for a moment, all CPUs having one pinned task, plus one extra unpinned task, and ponder what would have to happen in order to meet your utilization expectation. <time passes> Right. What you're seeing is not a bug. No task can occupy more than one CPU at a time, making space reservation on multiple CPUs a very bad idea. -Mike ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: CFS scheduler unfairly prefers pinned tasks 2015-10-06 12:17 ` Mike Galbraith @ 2015-10-06 20:44 ` paul.szabo 2015-10-07 1:28 ` Mike Galbraith 0 siblings, 1 reply; 48+ messages in thread From: paul.szabo @ 2015-10-06 20:44 UTC (permalink / raw) To: umgwanakikbuti; +Cc: linux-kernel Dear Mike, >> ... the CFS is meant to be fair, using things like vruntime >> to preempt, and throttling. Why are those pinned tasks not preempted or >> throttled? > > Imagine you own a 8192 CPU box for a moment, all CPUs having one pinned > task, plus one extra unpinned task, and ponder what would have to happen > in order to meet your utilization expectation. ... Sorry but the kernel contradicts. As per my original report, things are "fair" in the case of: - with CGROUP controls and the two kinds of processes run by different users, when there is just one un-pinned process and that is so on my quad-core i5-3470 baby or my 32-core 4*E5-4627v2 server (and everywhere that I tested). The kernel is smart and gets it right for one un-pinned process: why not for two? Now re-testing further (on some machines with CGROUP): on the i5-3470 things are fair still with one un-pinned (become un-fair with two), on the 4*E5-4627v2 are fair still with 4 un-pinned (become un-fair with 5). Does this suggest that the kernel does things right within each physical CPU, but breaks across several (or exact contrary)? Maybe not: on a 2*E5530 machine, things are fair with just one un-pinned and un-fair with 2 already. > What you're seeing is not a bug. No task can occupy more than one CPU > at a time, making space reservation on multiple CPUs a very bad idea. I agree that pinning may be bad... should not the kernel penalize the badly pinned processes? Cheers, Paul Paul Szabo psz@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of Sydney Australia ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: CFS scheduler unfairly prefers pinned tasks 2015-10-06 20:44 ` paul.szabo @ 2015-10-07 1:28 ` Mike Galbraith 0 siblings, 0 replies; 48+ messages in thread From: Mike Galbraith @ 2015-10-07 1:28 UTC (permalink / raw) To: paul.szabo; +Cc: linux-kernel On Wed, 2015-10-07 at 07:44 +1100, paul.szabo@sydney.edu.au wrote: > I agree that pinning may be bad... should not the kernel penalize the > badly pinned processes? I didn't say pinning is bad, I said was what you're seeing is not a bug. -Mike ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: CFS scheduler unfairly prefers pinned tasks 2015-10-06 2:45 ` Mike Galbraith 2015-10-06 10:06 ` paul.szabo @ 2015-10-08 8:19 ` Mike Galbraith 2015-10-08 10:54 ` paul.szabo 2015-10-10 3:59 ` Wanpeng Li 1 sibling, 2 replies; 48+ messages in thread From: Mike Galbraith @ 2015-10-08 8:19 UTC (permalink / raw) To: paul.szabo, Peter Zijlstra; +Cc: linux-kernel On Tue, 2015-10-06 at 04:45 +0200, Mike Galbraith wrote: > On Tue, 2015-10-06 at 08:48 +1100, paul.szabo@sydney.edu.au wrote: > > The Linux CFS scheduler prefers pinned tasks and unfairly > > gives more CPU time to tasks that have set CPU affinity. > > This effect is observed with or without CGROUP controls. > > > > To demonstrate: on an otherwise idle machine, as some user > > run several processes pinned to each CPU, one for each CPU > > (as many as CPUs present in the system) e.g. for a quad-core > > non-HyperThreaded machine: > > > > taskset -c 0 perl -e 'while(1){1}' & > > taskset -c 1 perl -e 'while(1){1}' & > > taskset -c 2 perl -e 'while(1){1}' & > > taskset -c 3 perl -e 'while(1){1}' & > > > > and (as that same or some other user) run some without > > pinning: > > > > perl -e 'while(1){1}' & > > perl -e 'while(1){1}' & > > > > and use e.g. top to observe that the pinned processes get > > more CPU time than "fair". I see a fairness issue with pinned tasks and group scheduling, but one opposite to your complaint. Two task groups, one with 8 hogs (oink), one with 1 (pert), all are pinned. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 3269 root 20 0 4060 724 648 R 100.0 0.004 1:00.02 1 oink 3270 root 20 0 4060 652 576 R 100.0 0.004 0:59.84 2 oink 3271 root 20 0 4060 692 616 R 100.0 0.004 0:59.95 3 oink 3274 root 20 0 4060 608 532 R 100.0 0.004 1:00.01 6 oink 3273 root 20 0 4060 728 652 R 99.90 0.005 0:59.98 5 oink 3272 root 20 0 4060 644 568 R 99.51 0.004 0:59.80 4 oink 3268 root 20 0 4060 612 536 R 99.41 0.004 0:59.67 0 oink 3279 root 20 0 8312 804 708 R 88.83 0.005 0:53.06 7 pert 3275 root 20 0 4060 656 580 R 11.07 0.004 0:06.98 7 oink . That group share math would make a huge compute group with progress checkpoints sharing an SGI monster with one other hog amusing to watch. -Mike ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: CFS scheduler unfairly prefers pinned tasks 2015-10-08 8:19 ` Mike Galbraith @ 2015-10-08 10:54 ` paul.szabo 2015-10-08 11:19 ` Peter Zijlstra 2015-10-08 14:25 ` CFS scheduler unfairly prefers pinned tasks Mike Galbraith 2015-10-10 3:59 ` Wanpeng Li 1 sibling, 2 replies; 48+ messages in thread From: paul.szabo @ 2015-10-08 10:54 UTC (permalink / raw) To: peterz, umgwanakikbuti; +Cc: linux-kernel Dear Mike, > I see a fairness issue ... but one opposite to your complaint. Why is that opposite? I think it would be fair for the one pert process to get 100% CPU, the many oink processes can get everything else. That one oink is lowly 10% (when others are 100%) is of no consequence. What happens when you un-pin pert: does it get 100%? What if you run two perts? Have you reproduced my observations? --- Good to see that you agree on the fairness issue... it MUST be fixed! CFS might be wrong or wasteful, but never unfair. Cheers, Paul Paul Szabo psz@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of Sydney Australia ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: CFS scheduler unfairly prefers pinned tasks 2015-10-08 10:54 ` paul.szabo @ 2015-10-08 11:19 ` Peter Zijlstra 2015-10-10 13:22 ` [patch] sched: disable task group re-weighting on the desktop Mike Galbraith 2015-10-08 14:25 ` CFS scheduler unfairly prefers pinned tasks Mike Galbraith 1 sibling, 1 reply; 48+ messages in thread From: Peter Zijlstra @ 2015-10-08 11:19 UTC (permalink / raw) To: paul.szabo; +Cc: umgwanakikbuti, linux-kernel On Thu, Oct 08, 2015 at 09:54:21PM +1100, paul.szabo@sydney.edu.au wrote: > Good to see that you agree on the fairness issue... it MUST be fixed! > CFS might be wrong or wasteful, but never unfair. I've not yet had time to look at the case at hand, but there are wat is called 'infeasible weight' scenarios for which it is impossible to be fair. Also, CFS must remain a practical scheduler, which places bounds on the amount of weird cases we can deal with. ^ permalink raw reply [flat|nested] 48+ messages in thread
* [patch] sched: disable task group re-weighting on the desktop 2015-10-08 11:19 ` Peter Zijlstra @ 2015-10-10 13:22 ` Mike Galbraith 2015-10-10 14:03 ` kbuild test robot ` (3 more replies) 0 siblings, 4 replies; 48+ messages in thread From: Mike Galbraith @ 2015-10-10 13:22 UTC (permalink / raw) To: Peter Zijlstra; +Cc: paul.szabo, linux-kernel On Thu, 2015-10-08 at 13:19 +0200, Peter Zijlstra wrote: > On Thu, Oct 08, 2015 at 09:54:21PM +1100, paul.szabo@sydney.edu.au wrote: > > Good to see that you agree on the fairness issue... it MUST be fixed! > > CFS might be wrong or wasteful, but never unfair. > > I've not yet had time to look at the case at hand, but there are wat is > called 'infeasible weight' scenarios for which it is impossible to be > fair. And sometimes, group wide fairness ain't all that wonderful anyway. > Also, CFS must remain a practical scheduler, which places bounds on the > amount of weird cases we can deal with. Yup, and on a practical note... master, 1 group of 8 (oink) vs 8 groups of 1 (pert) PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 5618 root 20 0 8312 840 744 R 90.46 0.005 1:40.48 0 pert 5630 root 20 0 8312 720 624 R 90.46 0.004 1:38.40 4 pert 5615 root 20 0 8312 768 672 R 89.48 0.005 1:39.25 6 pert 5621 root 20 0 8312 792 696 R 89.34 0.005 1:38.49 2 pert 5627 root 20 0 8312 760 664 R 89.06 0.005 1:36.53 5 pert 5645 root 20 0 8312 804 708 R 89.06 0.005 1:34.69 1 pert 5624 root 20 0 8312 716 620 R 88.64 0.004 1:38.45 7 pert 5612 root 20 0 8312 716 620 R 83.03 0.004 1:40.11 3 pert 5633 root 20 0 8312 792 696 R 10.94 0.005 0:11.59 4 oink 5635 root 20 0 8312 804 708 R 10.80 0.005 0:11.74 2 oink 5637 root 20 0 8312 796 700 R 10.80 0.005 0:11.34 5 oink 5639 root 20 0 8312 836 740 R 10.80 0.005 0:11.71 2 oink 5634 root 20 0 8312 840 744 R 10.66 0.005 0:11.36 7 oink 5636 root 20 0 8312 756 660 R 10.66 0.005 0:11.68 1 oink 5640 root 20 0 8312 752 656 R 10.10 0.005 0:11.41 7 oink 5638 root 20 0 8312 804 708 R 9.818 0.005 0:11.99 7 oink Avg 98.2s per group vs 92.8s for the 8 task group. Not _perfect_, but ok. Before reading further, now would be a good time for readers to chant the "perfect is the enemy of good" mantra, pretending my not so scientific measurements had actually shown perfect group wide distribution. You're gonna see good, and it doesn't resemble perfect.. which is good ;-) master+, 1 group of 8 (oink) vs 8 groups of 1 (pert) PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 19269 root 20 0 8312 716 620 R 77.25 0.004 1:39.43 2 pert 19263 root 20 0 8312 752 656 R 76.65 0.005 1:43.70 7 pert 19257 root 20 0 8312 760 664 R 72.85 0.005 1:37.08 5 pert 19260 root 20 0 8312 804 704 R 71.86 0.005 1:40.42 1 pert 19273 root 20 0 8312 748 652 R 71.26 0.005 1:41.98 6 pert 19266 root 20 0 8312 752 656 R 67.47 0.005 1:41.69 4 pert 19254 root 20 0 8312 744 648 R 61.28 0.005 1:42.88 4 pert 19277 root 20 0 8312 836 740 R 56.29 0.005 0:46.16 5 oink 19281 root 20 0 8312 768 672 R 55.89 0.005 0:42.05 0 oink 19283 root 20 0 8312 840 744 R 44.91 0.005 0:53.05 3 oink 19282 root 20 0 8312 800 704 R 30.74 0.005 0:41.70 3 oink 19284 root 20 0 8312 724 628 R 28.14 0.004 0:42.08 3 oink 19278 root 20 0 8312 752 656 R 25.15 0.005 0:42.26 3 oink 19280 root 20 0 8312 756 660 R 24.35 0.005 0:40.39 3 oink 19279 root 20 0 8312 836 740 R 23.95 0.005 0:45.71 3 oink Avg 101.6s per pert group vs 353.4s for the 8 task oink group. Not remotely fair total group utilization wise. Ah, but now onward to interactivity... master, 8 groups of 1 (pert) vs desktop (mplayer BigBuckBunny-DivXPlusHD.mkv) PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 4068 root 20 0 8312 724 628 R 99.64 0.004 1:04.32 6 pert 4065 root 20 0 8312 744 648 R 99.45 0.005 1:04.92 5 pert 4071 root 20 0 8312 748 652 R 99.27 0.005 1:03.12 7 pert 4077 root 20 0 8312 840 744 R 98.72 0.005 1:01.46 3 pert 4074 root 20 0 8312 796 700 R 98.18 0.005 1:03.38 1 pert 4079 root 20 0 8312 720 624 R 97.99 0.004 1:01.45 4 pert 4062 root 20 0 8312 836 740 R 96.72 0.005 1:03.44 0 pert 4059 root 20 0 8312 720 624 R 94.16 0.004 1:04.92 2 pert 4082 root 20 0 1094400 154324 33592 S 4.197 0.954 0:02.69 0 mplayer 1029 root 20 0 465332 151540 40816 R 3.285 0.937 0:24.59 2 Xorg 1773 root 20 0 662592 73308 42012 S 2.007 0.453 0:12.84 5 konsole 771 root 20 0 11416 1964 1824 S 0.730 0.012 0:10.45 0 rngd 1722 root 20 0 2866772 65224 51152 S 0.365 0.403 0:03.44 2 kwin 1769 root 20 0 711684 54212 38020 S 0.182 0.335 0:00.39 1 kmix That is NOT good. Mplayer and friends need more than that. Interactivity is _horrible_, and buck is an unwatchable mess (no biggy, I know every frame). master+, 8 groups of 1 (pert) vs desktop (mplayer BigBuckBunny-DivXPlusHD.mkv) PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 4346 root 20 0 8312 756 660 R 99.20 0.005 0:59.89 5 pert 4349 root 20 0 8312 748 652 R 98.80 0.005 1:00.77 6 pert 4343 root 20 0 8312 720 624 R 94.81 0.004 1:02.11 2 pert 4331 root 20 0 8312 724 628 R 91.22 0.004 1:01.16 3 pert 4340 root 20 0 8312 720 624 R 91.22 0.004 1:01.06 7 pert 4328 root 20 0 8312 836 740 R 90.42 0.005 1:00.07 4 pert 4334 root 20 0 8312 756 660 R 87.82 0.005 0:59.84 1 pert 4337 root 20 0 8312 824 728 R 76.85 0.005 0:52.20 0 pert 4352 root 20 0 1058812 123876 33388 S 29.34 0.766 0:25.01 3 mplayer 1029 root 20 0 471168 156748 40316 R 22.36 0.969 0:42.23 3 Xorg 1773 root 20 0 663080 74176 42012 S 4.192 0.459 0:17.98 1 konsole 771 root 20 0 11416 1964 1824 R 1.198 0.012 0:13.45 0 rngd 1722 root 20 0 2866880 65340 51152 R 0.599 0.404 0:04.87 3 kwin 1788 root 9 -11 516744 11932 8536 S 0.599 0.074 0:01.01 0 pulseaudio 1733 root 20 0 3369480 141564 71776 S 0.200 0.875 0:05.51 1 plasma-desktop That's good. Interactivity is fine, I can't even tell pert groups exist by watching buck kick squirrel butt for the 10387th time. With master, and one 8 hog group vs desktop/mplayer, I can see the hog group interfere with mplayer. Add another hog group, mplayer lurches quite badly. I can feel even one group while using mouse wheel to scroll through mail. With master+, I see/feel none of that unpleasantness. Conclusion: task group re-weighting is the mortal enemy of a good desktop. sched: disable task group re-weighting on the desktop Task group wide utilization based weight may work well for servers, but it is horrible on the desktop. 8 groups of 1 hog demoloshes interactivity, 1 group of 8 hogs has noticable impact, 2 such groups is very very noticable. Turn it off if autogroup is enabled, and add a feature to let people set the definition of fair to what serves them best. For the desktop, fixed group weight wins hands down, no contest.... Signed-off-by: Mike Galbraith <umgwanakikbuit@gmail.com> --- kernel/sched/fair.c | 10 ++++++---- kernel/sched/features.h | 14 ++++++++++++++ 2 files changed, 20 insertions(+), 4 deletions(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2372,6 +2372,8 @@ static long calc_cfs_shares(struct cfs_r { long tg_weight, load, shares; + if (!sched_feat(SMP_FAIR_GROUPS)) + return tg->shares; tg_weight = calc_tg_weight(tg, cfs_rq); load = cfs_rq_load_avg(cfs_rq); @@ -2420,10 +2422,10 @@ static void update_cfs_shares(struct cfs se = tg->se[cpu_of(rq_of(cfs_rq))]; if (!se || throttled_hierarchy(cfs_rq)) return; -#ifndef CONFIG_SMP - if (likely(se->load.weight == tg->shares)) - return; -#endif + if (!IS_ENABLED(CONFIG_SMP) || !sched_feat(SMP_FAIR_GROUPS)) { + if (likely(se->load.weight == tg->shares)) + return; + } shares = calc_cfs_shares(cfs_rq, tg); reweight_entity(cfs_rq_of(se), se, shares); --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -88,3 +88,17 @@ SCHED_FEAT(LB_MIN, false) */ SCHED_FEAT(NUMA, true) #endif + +#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED) +/* + * With SMP_FAIR_GROUPS set, activity group wide determines share for + * all froup members. This does very bad things to interactivity when + * a desktop box is heavily loaded. Default to off when autogroup is + * enabled, and let all users set it to what works best for them. + */ +#ifndef CONFIG_SCHED_AUTOGROUP +SCHED_FEAT(SMP_FAIR_GROUPS, true) +#else +SCHED_FEAT(SMP_FAIR_GROUPS, false) +#endif +#endif ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch] sched: disable task group re-weighting on the desktop 2015-10-10 13:22 ` [patch] sched: disable task group re-weighting on the desktop Mike Galbraith @ 2015-10-10 14:03 ` kbuild test robot 2015-10-10 14:41 ` Mike Galbraith 2015-10-10 17:01 ` Peter Zijlstra ` (2 subsequent siblings) 3 siblings, 1 reply; 48+ messages in thread From: kbuild test robot @ 2015-10-10 14:03 UTC (permalink / raw) To: Mike Galbraith; +Cc: kbuild-all, Peter Zijlstra, paul.szabo, linux-kernel [-- Attachment #1: Type: text/plain, Size: 4686 bytes --] Hi Mike, [auto build test ERROR on v4.3-rc4 -- if it's inappropriate base, please ignore] config: mips-allyesconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=mips All errors (new ones prefixed by >>): In file included from kernel/sched/fair.c:36:0: kernel/sched/fair.c: In function 'update_cfs_shares': >> kernel/sched/sched.h:1001:24: error: implicit declaration of function 'static_branch_SMP_FAIR_GROUPS' [-Werror=implicit-function-declaration] #define sched_feat(x) (static_branch_##x(&sched_feat_keys[__SCHED_FEAT_##x])) ^ kernel/sched/fair.c:2425:34: note: in expansion of macro 'sched_feat' if (!IS_ENABLED(CONFIG_SMP) || !sched_feat(SMP_FAIR_GROUPS)) { ^ kernel/sched/sched.h:1001:59: error: '__SCHED_FEAT_SMP_FAIR_GROUPS' undeclared (first use in this function) #define sched_feat(x) (static_branch_##x(&sched_feat_keys[__SCHED_FEAT_##x])) ^ kernel/sched/fair.c:2425:34: note: in expansion of macro 'sched_feat' if (!IS_ENABLED(CONFIG_SMP) || !sched_feat(SMP_FAIR_GROUPS)) { ^ kernel/sched/sched.h:1001:59: note: each undeclared identifier is reported only once for each function it appears in #define sched_feat(x) (static_branch_##x(&sched_feat_keys[__SCHED_FEAT_##x])) ^ kernel/sched/fair.c:2425:34: note: in expansion of macro 'sched_feat' if (!IS_ENABLED(CONFIG_SMP) || !sched_feat(SMP_FAIR_GROUPS)) { ^ cc1: some warnings being treated as errors vim +/static_branch_SMP_FAIR_GROUPS +1001 kernel/sched/sched.h 029632fb kernel/sched.h Peter Zijlstra 2011-10-25 985 }; 029632fb kernel/sched.h Peter Zijlstra 2011-10-25 986 029632fb kernel/sched.h Peter Zijlstra 2011-10-25 987 #undef SCHED_FEAT 029632fb kernel/sched.h Peter Zijlstra 2011-10-25 988 f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06 989 #if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL) f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06 990 #define SCHED_FEAT(name, enabled) \ c5905afb kernel/sched/sched.h Ingo Molnar 2012-02-24 991 static __always_inline bool static_branch_##name(struct static_key *key) \ f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06 992 { \ 6e76ea8a kernel/sched/sched.h Jason Baron 2014-07-02 993 return static_key_##enabled(key); \ f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06 994 } f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06 995 f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06 996 #include "features.h" f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06 997 f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06 998 #undef SCHED_FEAT f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06 999 c5905afb kernel/sched/sched.h Ingo Molnar 2012-02-24 1000 extern struct static_key sched_feat_keys[__SCHED_FEAT_NR]; f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06 @1001 #define sched_feat(x) (static_branch_##x(&sched_feat_keys[__SCHED_FEAT_##x])) f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06 1002 #else /* !(SCHED_DEBUG && HAVE_JUMP_LABEL) */ 029632fb kernel/sched.h Peter Zijlstra 2011-10-25 1003 #define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x)) f8b6d1cc kernel/sched/sched.h Peter Zijlstra 2011-07-06 1004 #endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */ 029632fb kernel/sched.h Peter Zijlstra 2011-10-25 1005 cbee9f88 kernel/sched/sched.h Peter Zijlstra 2012-10-25 1006 #ifdef CONFIG_NUMA_BALANCING cbee9f88 kernel/sched/sched.h Peter Zijlstra 2012-10-25 1007 #define sched_feat_numa(x) sched_feat(x) 3105b86a kernel/sched/sched.h Mel Gorman 2012-11-23 1008 #ifdef CONFIG_SCHED_DEBUG 3105b86a kernel/sched/sched.h Mel Gorman 2012-11-23 1009 #define numabalancing_enabled sched_feat_numa(NUMA) :::::: The code at line 1001 was first introduced by commit :::::: f8b6d1cc7dc15cf3de538b864eefaedad7a84d85 sched: Use jump_labels for sched_feat :::::: TO: Peter Zijlstra <a.p.zijlstra@chello.nl> :::::: CC: Ingo Molnar <mingo@elte.hu> --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/octet-stream, Size: 39228 bytes --] ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch] sched: disable task group re-weighting on the desktop 2015-10-10 14:03 ` kbuild test robot @ 2015-10-10 14:41 ` Mike Galbraith 0 siblings, 0 replies; 48+ messages in thread From: Mike Galbraith @ 2015-10-10 14:41 UTC (permalink / raw) To: kbuild test robot; +Cc: kbuild-all, Peter Zijlstra, paul.szabo, linux-kernel On Sat, 2015-10-10 at 22:03 +0800, kbuild test robot wrote: > Hi Mike, Hi there pin-the-tail-on-the-donkey bot. Eeee Ahhh :) sched: disable task group wide utilization based weight on the desktop Task group wide utilization based weight may work well for servers, but it is horrible on the desktop. 8 groups of 1 hog demoloshes interactivity, 1 group of 8 hogs has noticable impact, 2 such groups is very very noticable. Turn it off if autogroup is enabled, and add a feature to let people set the definition of fair to what serves them best. For the desktop, fixed group weight wins hands down, no contest.... Signed-off-by: Mike Galbraith <umgwanakikbuit@gmail.com> --- kernel/sched/fair.c | 5 +++++ kernel/sched/features.h | 14 ++++++++++++++ --- kernel/sched/fair.c | 5 +++++ kernel/sched/features.h | 14 ++++++++++++++ 2 files changed, 19 insertions(+) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2372,6 +2372,8 @@ static long calc_cfs_shares(struct cfs_r { long tg_weight, load, shares; + if (!sched_feat(SMP_FAIR_GROUPS)) + return tg->shares; tg_weight = calc_tg_weight(tg, cfs_rq); load = cfs_rq_load_avg(cfs_rq); @@ -2423,6 +2425,9 @@ static void update_cfs_shares(struct cfs #ifndef CONFIG_SMP if (likely(se->load.weight == tg->shares)) return; +#else + if (!sched_feat(SMP_FAIR_GROUPS) && se->load.weight == tg->shares) + return; #endif shares = calc_cfs_shares(cfs_rq, tg); --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -88,3 +88,17 @@ SCHED_FEAT(LB_MIN, false) */ SCHED_FEAT(NUMA, true) #endif + +#ifdef CONFIG_FAIR_GROUP_SCHED +/* + * With SMP_FAIR_GROUPS set, activity group wide determines share for + * all froup members. This does very bad things to interactivity when + * a desktop box is heavily loaded. Default to off when autogroup is + * enabled, and let all users set it to what works best for them. + */ +#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED) +SCHED_FEAT(SMP_FAIR_GROUPS, true) +#else +SCHED_FEAT(SMP_FAIR_GROUPS, false) +#endif +#endif ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch] sched: disable task group re-weighting on the desktop 2015-10-10 13:22 ` [patch] sched: disable task group re-weighting on the desktop Mike Galbraith 2015-10-10 14:03 ` kbuild test robot @ 2015-10-10 17:01 ` Peter Zijlstra 2015-10-10 17:13 ` Peter Zijlstra 2015-10-11 2:25 ` Mike Galbraith 2015-10-10 20:14 ` [patch] sched: disable task group re-weighting on the desktop paul.szabo 2015-10-11 19:46 ` paul.szabo 3 siblings, 2 replies; 48+ messages in thread From: Peter Zijlstra @ 2015-10-10 17:01 UTC (permalink / raw) To: Mike Galbraith; +Cc: paul.szabo, linux-kernel On Sat, Oct 10, 2015 at 03:22:49PM +0200, Mike Galbraith wrote: > Ah, but now onward to interactivity... > > master, 8 groups of 1 (pert) vs desktop (mplayer BigBuckBunny-DivXPlusHD.mkv) > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND > 4068 root 20 0 8312 724 628 R 99.64 0.004 1:04.32 6 pert > 4065 root 20 0 8312 744 648 R 99.45 0.005 1:04.92 5 pert > 4071 root 20 0 8312 748 652 R 99.27 0.005 1:03.12 7 pert > 4077 root 20 0 8312 840 744 R 98.72 0.005 1:01.46 3 pert > 4074 root 20 0 8312 796 700 R 98.18 0.005 1:03.38 1 pert > 4079 root 20 0 8312 720 624 R 97.99 0.004 1:01.45 4 pert > 4062 root 20 0 8312 836 740 R 96.72 0.005 1:03.44 0 pert > 4059 root 20 0 8312 720 624 R 94.16 0.004 1:04.92 2 pert > 4082 root 20 0 1094400 154324 33592 S 4.197 0.954 0:02.69 0 mplayer > 1029 root 20 0 465332 151540 40816 R 3.285 0.937 0:24.59 2 Xorg > 1773 root 20 0 662592 73308 42012 S 2.007 0.453 0:12.84 5 konsole > 771 root 20 0 11416 1964 1824 S 0.730 0.012 0:10.45 0 rngd > 1722 root 20 0 2866772 65224 51152 S 0.365 0.403 0:03.44 2 kwin > 1769 root 20 0 711684 54212 38020 S 0.182 0.335 0:00.39 1 kmix > > That is NOT good. Mplayer and friends need more than that. Interactivity > is _horrible_, and buck is an unwatchable mess (no biggy, I know every frame). > > master+, 8 groups of 1 (pert) vs desktop (mplayer BigBuckBunny-DivXPlusHD.mkv) > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND > 4346 root 20 0 8312 756 660 R 99.20 0.005 0:59.89 5 pert > 4349 root 20 0 8312 748 652 R 98.80 0.005 1:00.77 6 pert > 4343 root 20 0 8312 720 624 R 94.81 0.004 1:02.11 2 pert > 4331 root 20 0 8312 724 628 R 91.22 0.004 1:01.16 3 pert > 4340 root 20 0 8312 720 624 R 91.22 0.004 1:01.06 7 pert > 4328 root 20 0 8312 836 740 R 90.42 0.005 1:00.07 4 pert > 4334 root 20 0 8312 756 660 R 87.82 0.005 0:59.84 1 pert > 4337 root 20 0 8312 824 728 R 76.85 0.005 0:52.20 0 pert > 4352 root 20 0 1058812 123876 33388 S 29.34 0.766 0:25.01 3 mplayer > 1029 root 20 0 471168 156748 40316 R 22.36 0.969 0:42.23 3 Xorg > 1773 root 20 0 663080 74176 42012 S 4.192 0.459 0:17.98 1 konsole > 771 root 20 0 11416 1964 1824 R 1.198 0.012 0:13.45 0 rngd > 1722 root 20 0 2866880 65340 51152 R 0.599 0.404 0:04.87 3 kwin > 1788 root 9 -11 516744 11932 8536 S 0.599 0.074 0:01.01 0 pulseaudio > 1733 root 20 0 3369480 141564 71776 S 0.200 0.875 0:05.51 1 plasma-desktop > > That's good. Interactivity is fine... But the patch is most horrible.. :/ It completely destroys everything group scheduling is supposed to be. What are these oink/pert things? Both spinners just with amusing names to distinguish them? Is the interactivity the same (horrible) at fe32d3cd5e8e (ie, before the load tracking rewrite from Yuyang)? ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch] sched: disable task group re-weighting on the desktop 2015-10-10 17:01 ` Peter Zijlstra @ 2015-10-10 17:13 ` Peter Zijlstra 2015-10-11 2:25 ` Mike Galbraith 1 sibling, 0 replies; 48+ messages in thread From: Peter Zijlstra @ 2015-10-10 17:13 UTC (permalink / raw) To: Mike Galbraith; +Cc: paul.szabo, linux-kernel On Sat, Oct 10, 2015 at 07:01:42PM +0200, Peter Zijlstra wrote: > On Sat, Oct 10, 2015 at 03:22:49PM +0200, Mike Galbraith wrote: > > Ah, but now onward to interactivity... > > > > master, 8 groups of 1 (pert) vs desktop (mplayer BigBuckBunny-DivXPlusHD.mkv) > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND > > 4068 root 20 0 8312 724 628 R 99.64 0.004 1:04.32 6 pert > > 4065 root 20 0 8312 744 648 R 99.45 0.005 1:04.92 5 pert > > 4071 root 20 0 8312 748 652 R 99.27 0.005 1:03.12 7 pert > > 4077 root 20 0 8312 840 744 R 98.72 0.005 1:01.46 3 pert > > 4074 root 20 0 8312 796 700 R 98.18 0.005 1:03.38 1 pert > > 4079 root 20 0 8312 720 624 R 97.99 0.004 1:01.45 4 pert > > 4062 root 20 0 8312 836 740 R 96.72 0.005 1:03.44 0 pert > > 4059 root 20 0 8312 720 624 R 94.16 0.004 1:04.92 2 pert > > 4082 root 20 0 1094400 154324 33592 S 4.197 0.954 0:02.69 0 mplayer > > 1029 root 20 0 465332 151540 40816 R 3.285 0.937 0:24.59 2 Xorg > > 1773 root 20 0 662592 73308 42012 S 2.007 0.453 0:12.84 5 konsole > > 771 root 20 0 11416 1964 1824 S 0.730 0.012 0:10.45 0 rngd > > 1722 root 20 0 2866772 65224 51152 S 0.365 0.403 0:03.44 2 kwin > > 1769 root 20 0 711684 54212 38020 S 0.182 0.335 0:00.39 1 kmix Ah wait, so you have 8 groups of cycle soakers vs 1 group of desktop? That means your desktop will get 1/9 th of the total time and that is almost so: 2.69+24.59+12.84+10.45+3.44+.39 = 54.4 vs (64.32+64.92+63.12+61.46+63.38+61.45+63.44+64.92)/8 = 63.37625 Which isn't too far off. This really appears to be a case where you get what you ask for. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch] sched: disable task group re-weighting on the desktop 2015-10-10 17:01 ` Peter Zijlstra 2015-10-10 17:13 ` Peter Zijlstra @ 2015-10-11 2:25 ` Mike Galbraith 2015-10-11 17:42 ` 4.3 group scheduling regression Mike Galbraith 1 sibling, 1 reply; 48+ messages in thread From: Mike Galbraith @ 2015-10-11 2:25 UTC (permalink / raw) To: Peter Zijlstra; +Cc: paul.szabo, linux-kernel On Sat, 2015-10-10 at 19:01 +0200, Peter Zijlstra wrote: > But the patch is most horrible.. :/ It completely destroys everything > group scheduling is supposed to be. Yeah, and it works great but... > What are these oink/pert things? Both spinners just with amusing names > to distinguish them? (yeah). > Is the interactivity the same (horrible) at fe32d3cd5e8e (ie, before the > load tracking rewrite from Yuyang)? ...you're right. fe32d3cd5e8e isn't as good as master with a big dent in its skull, but it is far from the ugly beast I clubbed to death. -Mike ^ permalink raw reply [flat|nested] 48+ messages in thread
* 4.3 group scheduling regression 2015-10-11 2:25 ` Mike Galbraith @ 2015-10-11 17:42 ` Mike Galbraith 2015-10-12 7:23 ` Peter Zijlstra 0 siblings, 1 reply; 48+ messages in thread From: Mike Galbraith @ 2015-10-11 17:42 UTC (permalink / raw) To: Peter Zijlstra; +Cc: linux-kernel, Yuyang Du (change subject, CCs) On Sun, 2015-10-11 at 04:25 +0200, Mike Galbraith wrote: > > Is the interactivity the same (horrible) at fe32d3cd5e8e (ie, before the > > load tracking rewrite from Yuyang)? It is the rewrite, 9d89c257dfb9c51a532d69397f6eed75e5168c35. Watching 8 single hog groups vs 1 tbench group, master vs 4.2.3, I saw no big hairy difference, just as 1 group of 8 hogs vs 8 groups of 1. 8 single hog groups vs the less hungry mplayer otoh is quite different 100 second scripted recordings: (note: "testo" is kde konsole acting as task group launch vehicle) master ----------------------------------------------------------------------------------------------------------------- Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at | ----------------------------------------------------------------------------------------------------------------- oink:(8) | 787637.964 ms | 16242 | avg: 0.557 ms | max: 68.993 ms | max at: 239.126118 s mplayer:(25) | 5477.234 ms | 8504 | avg: 16.395 ms | max: 2100.233 ms | max at: 282.850734 s Xorg:997 | 1773.218 ms | 4680 | avg: 4.857 ms | max: 1640.194 ms | max at: 285.660210 s konsole:1789 | 649.323 ms | 1261 | avg: 6.747 ms | max: 156.282 ms | max at: 265.548523 s testo:(9) | 454.046 ms | 2867 | avg: 5.961 ms | max: 276.371 ms | max at: 245.511282 s plasma-desktop:1753 | 223.251 ms | 1582 | avg: 4.220 ms | max: 299.354 ms | max at: 337.242542 s kwin:1745 | 156.746 ms | 2879 | avg: 2.398 ms | max: 355.765 ms | max at: 337.242490 s pulseaudio:1797 | 60.268 ms | 2573 | avg: 0.695 ms | max: 36.069 ms | max at: 292.318120 s threaded-ml:3477 | 47.076 ms | 3878 | avg: 7.083 ms | max: 1898.940 ms | max at: 254.919367 s perf:3437 | 28.525 ms | 4 | avg: 129.042 ms | max: 498.816 ms | max at: 336.102154 s 4.2.3 ----------------------------------------------------------------------------------------------------------------- Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at | ----------------------------------------------------------------------------------------------------------------- oink:(8) | 741307.292 ms | 42325 | avg: 1.276 ms | max: 23.598 ms | max at: 192.459790 s mplayer:(25) | 35296.804 ms | 35423 | avg: 1.715 ms | max: 71.972 ms | max at: 128.737783 s Xorg:929 | 13257.917 ms | 21583 | avg: 0.091 ms | max: 27.983 ms | max at: 102.272376 s testo:(9) | 2315.080 ms | 13213 | avg: 0.133 ms | max: 6.632 ms | max at: 201.422570 s konsole:1747 | 938.939 ms | 1458 | avg: 0.096 ms | max: 15.006 ms | max at: 102.260294 s kwin:1703 | 815.384 ms | 17376 | avg: 0.464 ms | max: 9.311 ms | max at: 119.026179 s pulseaudio:1762 | 396.168 ms | 14338 | avg: 0.020 ms | max: 6.514 ms | max at: 115.928179 s threaded-ml:3477 | 310.132 ms | 23966 | avg: 0.428 ms | max: 27.974 ms | max at: 134.100588 s plasma-desktop:1711 | 239.232 ms | 1577 | avg: 0.048 ms | max: 7.072 ms | max at: 102.060279 s perf:3434 | 65.705 ms | 2 | avg: 0.054 ms | max: 0.105 ms | max at: 102.011221 s master, mplayer solo reference ----------------------------------------------------------------------------------------------------------------- Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at | ----------------------------------------------------------------------------------------------------------------- mplayer:(25) | 32171.732 ms | 18416 | avg: 0.012 ms | max: 4.405 ms | max at: 4911.226038 s Xorg:948 | 14271.286 ms | 17396 | avg: 0.016 ms | max: 0.082 ms | max at: 4911.243020 s testo:4121 | 3594.784 ms | 11607 | avg: 0.015 ms | max: 0.078 ms | max at: 4981.705240 s kwin:1650 | 1209.387 ms | 17562 | avg: 0.012 ms | max: 1.612 ms | max at: 4911.245523 s konsole:1728 | 967.914 ms | 1498 | avg: 0.007 ms | max: 0.048 ms | max at: 4997.903759 s pulseaudio:1750 | 684.342 ms | 14460 | avg: 0.013 ms | max: 0.552 ms | max at: 4957.743502 s threaded-ml:4153 | 641.893 ms | 15748 | avg: 0.016 ms | max: 2.201 ms | max at: 4923.928810 s plasma-desktop:1658 | 150.068 ms | 569 | avg: 0.011 ms | max: 0.390 ms | max at: 4911.258650 s perf:4126 | 43.854 ms | 3 | avg: 0.022 ms | max: 0.051 ms | max at: 4959.327694 s ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: 4.3 group scheduling regression 2015-10-11 17:42 ` 4.3 group scheduling regression Mike Galbraith @ 2015-10-12 7:23 ` Peter Zijlstra 2015-10-12 7:44 ` Mike Galbraith 0 siblings, 1 reply; 48+ messages in thread From: Peter Zijlstra @ 2015-10-12 7:23 UTC (permalink / raw) To: Mike Galbraith; +Cc: linux-kernel, Yuyang Du On Sun, Oct 11, 2015 at 07:42:01PM +0200, Mike Galbraith wrote: > (change subject, CCs) > > On Sun, 2015-10-11 at 04:25 +0200, Mike Galbraith wrote: > > > > Is the interactivity the same (horrible) at fe32d3cd5e8e (ie, before the > > > load tracking rewrite from Yuyang)? > > It is the rewrite, 9d89c257dfb9c51a532d69397f6eed75e5168c35. Just to be sure, so 9d89c257dfb9^1 is good, while 9d89c257dfb9 is bad? And *groan*, _just_ the thing I need on a monday morning ;-) ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: 4.3 group scheduling regression 2015-10-12 7:23 ` Peter Zijlstra @ 2015-10-12 7:44 ` Mike Galbraith 2015-10-12 8:04 ` Peter Zijlstra 0 siblings, 1 reply; 48+ messages in thread From: Mike Galbraith @ 2015-10-12 7:44 UTC (permalink / raw) To: Peter Zijlstra; +Cc: linux-kernel, Yuyang Du On Mon, 2015-10-12 at 09:23 +0200, Peter Zijlstra wrote: > On Sun, Oct 11, 2015 at 07:42:01PM +0200, Mike Galbraith wrote: > > (change subject, CCs) > > > > On Sun, 2015-10-11 at 04:25 +0200, Mike Galbraith wrote: > > > > > > Is the interactivity the same (horrible) at fe32d3cd5e8e (ie, before the > > > > load tracking rewrite from Yuyang)? > > > > It is the rewrite, 9d89c257dfb9c51a532d69397f6eed75e5168c35. > > Just to be sure, so 9d89c257dfb9^1 is good, while 9d89c257dfb9 is bad? Yeah, I went ahead and bisected. > And *groan*, _just_ the thing I need on a monday morning ;-) Sorry 'bout that. It's odd to me that things look pretty much the same good/bad tree with hogs vs hogs or hogs vs tbench (with top anyway, just adding up times). Seems Xorg+mplayer more or less playing cross group ping-pong must be the BadThing trigger. -Mike ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: 4.3 group scheduling regression 2015-10-12 7:44 ` Mike Galbraith @ 2015-10-12 8:04 ` Peter Zijlstra 2015-10-12 0:53 ` Yuyang Du 2015-10-12 8:48 ` Mike Galbraith 0 siblings, 2 replies; 48+ messages in thread From: Peter Zijlstra @ 2015-10-12 8:04 UTC (permalink / raw) To: Mike Galbraith; +Cc: linux-kernel, Yuyang Du On Mon, Oct 12, 2015 at 09:44:57AM +0200, Mike Galbraith wrote: > It's odd to me that things look pretty much the same good/bad tree with > hogs vs hogs or hogs vs tbench (with top anyway, just adding up times). > Seems Xorg+mplayer more or less playing cross group ping-pong must be > the BadThing trigger. Ohh, wait, Xorg and mplayer are _not_ in the same group? I was assuming you had your entire user session in 1 (auto) group and was competing against 8 manual cgroups. So how exactly are things configured? ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: 4.3 group scheduling regression 2015-10-12 8:04 ` Peter Zijlstra @ 2015-10-12 0:53 ` Yuyang Du 2015-10-12 9:12 ` Peter Zijlstra 2015-10-12 8:48 ` Mike Galbraith 1 sibling, 1 reply; 48+ messages in thread From: Yuyang Du @ 2015-10-12 0:53 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Mike Galbraith, linux-kernel Good morning, Peter. On Mon, Oct 12, 2015 at 10:04:07AM +0200, Peter Zijlstra wrote: > On Mon, Oct 12, 2015 at 09:44:57AM +0200, Mike Galbraith wrote: > > > It's odd to me that things look pretty much the same good/bad tree with > > hogs vs hogs or hogs vs tbench (with top anyway, just adding up times). > > Seems Xorg+mplayer more or less playing cross group ping-pong must be > > the BadThing trigger. > > Ohh, wait, Xorg and mplayer are _not_ in the same group? I was assuming > you had your entire user session in 1 (auto) group and was competing > against 8 manual cgroups. > > So how exactly are things configured? Hmm... my impression is the naughty boy mplayer (+Xorg) isn't favored, due to the per CPU group entity share distribution. Let me dig more. Sorry. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: 4.3 group scheduling regression 2015-10-12 0:53 ` Yuyang Du @ 2015-10-12 9:12 ` Peter Zijlstra 2015-10-12 2:12 ` Yuyang Du 0 siblings, 1 reply; 48+ messages in thread From: Peter Zijlstra @ 2015-10-12 9:12 UTC (permalink / raw) To: Yuyang Du; +Cc: Mike Galbraith, linux-kernel On Mon, Oct 12, 2015 at 08:53:51AM +0800, Yuyang Du wrote: > Good morning, Peter. > > On Mon, Oct 12, 2015 at 10:04:07AM +0200, Peter Zijlstra wrote: > > On Mon, Oct 12, 2015 at 09:44:57AM +0200, Mike Galbraith wrote: > > > > > It's odd to me that things look pretty much the same good/bad tree with > > > hogs vs hogs or hogs vs tbench (with top anyway, just adding up times). > > > Seems Xorg+mplayer more or less playing cross group ping-pong must be > > > the BadThing trigger. > > > > Ohh, wait, Xorg and mplayer are _not_ in the same group? I was assuming > > you had your entire user session in 1 (auto) group and was competing > > against 8 manual cgroups. > > > > So how exactly are things configured? > > Hmm... my impression is the naughty boy mplayer (+Xorg) isn't favored, due > to the per CPU group entity share distribution. Let me dig more. So in the old code we had 'magic' to deal with the case where a cgroup was consuming less than 1 cpu's worth of runtime. For example, a single task running in the group. In that scenario it might be possible that the group entity weight: se->weight = (tg->shares * cfs_rq->weight) / tg->weight; Strongly deviates from the tg->shares; you want the single task reflect the full group shares to the next level; due to the whole distributed approximation stuff. I see you've deleted all that code; see the former __update_group_entity_contrib(). It could be that we need to bring that back. But let me think a little bit more on this.. I'm having a hard time waking :/ ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: 4.3 group scheduling regression 2015-10-12 9:12 ` Peter Zijlstra @ 2015-10-12 2:12 ` Yuyang Du 2015-10-12 10:23 ` Mike Galbraith 2015-10-12 11:47 ` Peter Zijlstra 0 siblings, 2 replies; 48+ messages in thread From: Yuyang Du @ 2015-10-12 2:12 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Mike Galbraith, linux-kernel On Mon, Oct 12, 2015 at 11:12:06AM +0200, Peter Zijlstra wrote: > On Mon, Oct 12, 2015 at 08:53:51AM +0800, Yuyang Du wrote: > > Good morning, Peter. > > > > On Mon, Oct 12, 2015 at 10:04:07AM +0200, Peter Zijlstra wrote: > > > On Mon, Oct 12, 2015 at 09:44:57AM +0200, Mike Galbraith wrote: > > > > > > > It's odd to me that things look pretty much the same good/bad tree with > > > > hogs vs hogs or hogs vs tbench (with top anyway, just adding up times). > > > > Seems Xorg+mplayer more or less playing cross group ping-pong must be > > > > the BadThing trigger. > > > > > > Ohh, wait, Xorg and mplayer are _not_ in the same group? I was assuming > > > you had your entire user session in 1 (auto) group and was competing > > > against 8 manual cgroups. > > > > > > So how exactly are things configured? > > > > Hmm... my impression is the naughty boy mplayer (+Xorg) isn't favored, due > > to the per CPU group entity share distribution. Let me dig more. > > So in the old code we had 'magic' to deal with the case where a cgroup > was consuming less than 1 cpu's worth of runtime. For example, a single > task running in the group. > > In that scenario it might be possible that the group entity weight: > > se->weight = (tg->shares * cfs_rq->weight) / tg->weight; > > Strongly deviates from the tg->shares; you want the single task reflect > the full group shares to the next level; due to the whole distributed > approximation stuff. Yeah, I thought so. > I see you've deleted all that code; see the former > __update_group_entity_contrib(). Probably not there, it actually was an icky way to adjust things. > It could be that we need to bring that back. But let me think a little > bit more on this.. I'm having a hard time waking :/ I am guessing it is in calc_tg_weight(), and naughty boys do make them more favored, what a reality... Mike, beg you test the following? -- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4df37a4..b184da0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2370,7 +2370,7 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq) */ tg_weight = atomic_long_read(&tg->load_avg); tg_weight -= cfs_rq->tg_load_avg_contrib; - tg_weight += cfs_rq_load_avg(cfs_rq); + tg_weight += cfs_rq->load.weight; return tg_weight; } @@ -2380,7 +2380,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg) long tg_weight, load, shares; tg_weight = calc_tg_weight(tg, cfs_rq); - load = cfs_rq_load_avg(cfs_rq); + load = cfs_rq->load.weight; shares = (tg->shares * load); if (tg_weight) ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: 4.3 group scheduling regression 2015-10-12 2:12 ` Yuyang Du @ 2015-10-12 10:23 ` Mike Galbraith 2015-10-12 19:55 ` Yuyang Du 2015-10-12 11:47 ` Peter Zijlstra 1 sibling, 1 reply; 48+ messages in thread From: Mike Galbraith @ 2015-10-12 10:23 UTC (permalink / raw) To: Yuyang Du; +Cc: Peter Zijlstra, linux-kernel On Mon, 2015-10-12 at 10:12 +0800, Yuyang Du wrote: > I am guessing it is in calc_tg_weight(), and naughty boys do make them more > favored, what a reality... > > Mike, beg you test the following? Wow, that was quick. Dinky patch made it all better. ----------------------------------------------------------------------------------------------------------------- Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at | ----------------------------------------------------------------------------------------------------------------- oink:(8) | 739056.970 ms | 27270 | avg: 2.043 ms | max: 29.105 ms | max at: 339.988310 s mplayer:(25) | 36448.997 ms | 44670 | avg: 1.886 ms | max: 72.808 ms | max at: 302.153121 s Xorg:988 | 13334.908 ms | 22210 | avg: 0.081 ms | max: 25.005 ms | max at: 269.068666 s testo:(9) | 2558.540 ms | 13703 | avg: 0.124 ms | max: 6.412 ms | max at: 279.235272 s konsole:1781 | 1084.316 ms | 1457 | avg: 0.006 ms | max: 1.039 ms | max at: 268.863379 s kwin:1734 | 879.645 ms | 17855 | avg: 0.458 ms | max: 15.788 ms | max at: 268.854992 s pulseaudio:1808 | 356.334 ms | 15023 | avg: 0.028 ms | max: 6.134 ms | max at: 324.479766 s threaded-ml:3483 | 292.782 ms | 25769 | avg: 0.364 ms | max: 40.387 ms | max at: 294.550515 s plasma-desktop:1745 | 265.055 ms | 1470 | avg: 0.102 ms | max: 21.886 ms | max at: 267.724902 s perf:3439 | 61.677 ms | 2 | avg: 0.117 ms | max: 0.232 ms | max at: 367.043889 s > -- > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 4df37a4..b184da0 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2370,7 +2370,7 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq) > */ > tg_weight = atomic_long_read(&tg->load_avg); > tg_weight -= cfs_rq->tg_load_avg_contrib; > - tg_weight += cfs_rq_load_avg(cfs_rq); > + tg_weight += cfs_rq->load.weight; > > return tg_weight; > } > @@ -2380,7 +2380,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg) > long tg_weight, load, shares; > > tg_weight = calc_tg_weight(tg, cfs_rq); > - load = cfs_rq_load_avg(cfs_rq); > + load = cfs_rq->load.weight; > > shares = (tg->shares * load); > if (tg_weight) ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: 4.3 group scheduling regression 2015-10-12 10:23 ` Mike Galbraith @ 2015-10-12 19:55 ` Yuyang Du 2015-10-13 4:08 ` Mike Galbraith 2015-10-13 8:06 ` Peter Zijlstra 0 siblings, 2 replies; 48+ messages in thread From: Yuyang Du @ 2015-10-12 19:55 UTC (permalink / raw) To: Mike Galbraith; +Cc: Peter Zijlstra, linux-kernel On Mon, Oct 12, 2015 at 12:23:31PM +0200, Mike Galbraith wrote: > On Mon, 2015-10-12 at 10:12 +0800, Yuyang Du wrote: > > > I am guessing it is in calc_tg_weight(), and naughty boys do make them more > > favored, what a reality... > > > > Mike, beg you test the following? > > Wow, that was quick. Dinky patch made it all better. > > ----------------------------------------------------------------------------------------------------------------- > Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at | > ----------------------------------------------------------------------------------------------------------------- > oink:(8) | 739056.970 ms | 27270 | avg: 2.043 ms | max: 29.105 ms | max at: 339.988310 s > mplayer:(25) | 36448.997 ms | 44670 | avg: 1.886 ms | max: 72.808 ms | max at: 302.153121 s > Xorg:988 | 13334.908 ms | 22210 | avg: 0.081 ms | max: 25.005 ms | max at: 269.068666 s > testo:(9) | 2558.540 ms | 13703 | avg: 0.124 ms | max: 6.412 ms | max at: 279.235272 s > konsole:1781 | 1084.316 ms | 1457 | avg: 0.006 ms | max: 1.039 ms | max at: 268.863379 s > kwin:1734 | 879.645 ms | 17855 | avg: 0.458 ms | max: 15.788 ms | max at: 268.854992 s > pulseaudio:1808 | 356.334 ms | 15023 | avg: 0.028 ms | max: 6.134 ms | max at: 324.479766 s > threaded-ml:3483 | 292.782 ms | 25769 | avg: 0.364 ms | max: 40.387 ms | max at: 294.550515 s > plasma-desktop:1745 | 265.055 ms | 1470 | avg: 0.102 ms | max: 21.886 ms | max at: 267.724902 s > perf:3439 | 61.677 ms | 2 | avg: 0.117 ms | max: 0.232 ms | max at: 367.043889 s Phew... I think maybe the real disease is the tg->load_avg is not updated in time. I.e., it is after migrate, the source cfs_rq does not decrease its contribution to the parent's tg->load_avg fast enough. -- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4df37a4..3dba883 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2686,12 +2686,13 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq); static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) { struct sched_avg *sa = &cfs_rq->avg; - int decayed; + int decayed, updated = 0; if (atomic_long_read(&cfs_rq->removed_load_avg)) { long r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); sa->load_avg = max_t(long, sa->load_avg - r, 0); sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); + updated = 1; } if (atomic_long_read(&cfs_rq->removed_util_avg)) { @@ -2708,7 +2709,7 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) cfs_rq->load_last_update_time_copy = sa->last_update_time; #endif - return decayed; + return decayed | updated; } /* Update task and its cfs_rq load average */ ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: 4.3 group scheduling regression 2015-10-12 19:55 ` Yuyang Du @ 2015-10-13 4:08 ` Mike Galbraith 2015-10-12 20:42 ` Yuyang Du 2015-10-13 8:06 ` Peter Zijlstra 1 sibling, 1 reply; 48+ messages in thread From: Mike Galbraith @ 2015-10-13 4:08 UTC (permalink / raw) To: Yuyang Du; +Cc: Peter Zijlstra, linux-kernel On Tue, 2015-10-13 at 03:55 +0800, Yuyang Du wrote: > On Mon, Oct 12, 2015 at 12:23:31PM +0200, Mike Galbraith wrote: > > On Mon, 2015-10-12 at 10:12 +0800, Yuyang Du wrote: > > > > > I am guessing it is in calc_tg_weight(), and naughty boys do make them more > > > favored, what a reality... > > > > > > Mike, beg you test the following? > > > > Wow, that was quick. Dinky patch made it all better. > > > > ----------------------------------------------------------------------------------------------------------------- > > Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at | > > ----------------------------------------------------------------------------------------------------------------- > > oink:(8) | 739056.970 ms | 27270 | avg: 2.043 ms | max: 29.105 ms | max at: 339.988310 s > > mplayer:(25) | 36448.997 ms | 44670 | avg: 1.886 ms | max: 72.808 ms | max at: 302.153121 s > > Xorg:988 | 13334.908 ms | 22210 | avg: 0.081 ms | max: 25.005 ms | max at: 269.068666 s > > testo:(9) | 2558.540 ms | 13703 | avg: 0.124 ms | max: 6.412 ms | max at: 279.235272 s > > konsole:1781 | 1084.316 ms | 1457 | avg: 0.006 ms | max: 1.039 ms | max at: 268.863379 s > > kwin:1734 | 879.645 ms | 17855 | avg: 0.458 ms | max: 15.788 ms | max at: 268.854992 s > > pulseaudio:1808 | 356.334 ms | 15023 | avg: 0.028 ms | max: 6.134 ms | max at: 324.479766 s > > threaded-ml:3483 | 292.782 ms | 25769 | avg: 0.364 ms | max: 40.387 ms | max at: 294.550515 s > > plasma-desktop:1745 | 265.055 ms | 1470 | avg: 0.102 ms | max: 21.886 ms | max at: 267.724902 s > > perf:3439 | 61.677 ms | 2 | avg: 0.117 ms | max: 0.232 ms | max at: 367.043889 s > > Phew... > > I think maybe the real disease is the tg->load_avg is not updated in time. > I.e., it is after migrate, the source cfs_rq does not decrease its contribution > to the parent's tg->load_avg fast enough. It sounded like you wanted me to run the below alone. If so, it's a nogo. ----------------------------------------------------------------------------------------------------------------- Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at | ----------------------------------------------------------------------------------------------------------------- oink:(8) | 787001.236 ms | 21641 | avg: 0.377 ms | max: 21.991 ms | max at: 51.504005 s mplayer:(25) | 4256.224 ms | 7264 | avg: 19.698 ms | max: 2087.489 ms | max at: 115.294922 s Xorg:1011 | 1507.958 ms | 4081 | avg: 8.349 ms | max: 1652.200 ms | max at: 126.908021 s konsole:1752 | 697.806 ms | 1186 | avg: 5.749 ms | max: 160.189 ms | max at: 53.037952 s testo:(9) | 438.164 ms | 2551 | avg: 6.616 ms | max: 215.527 ms | max at: 117.302455 s plasma-desktop:1716 | 280.418 ms | 1624 | avg: 3.701 ms | max: 574.806 ms | max at: 53.582261 s kwin:1708 | 144.986 ms | 2422 | avg: 3.301 ms | max: 315.707 ms | max at: 116.555721 s > -- > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 4df37a4..3dba883 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2686,12 +2686,13 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq); > static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > { > struct sched_avg *sa = &cfs_rq->avg; > - int decayed; > + int decayed, updated = 0; > > if (atomic_long_read(&cfs_rq->removed_load_avg)) { > long r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); > sa->load_avg = max_t(long, sa->load_avg - r, 0); > sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); > + updated = 1; > } > > if (atomic_long_read(&cfs_rq->removed_util_avg)) { > @@ -2708,7 +2709,7 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > cfs_rq->load_last_update_time_copy = sa->last_update_time; > #endif > > - return decayed; > + return decayed | updated; > } > > /* Update task and its cfs_rq load average */ ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: 4.3 group scheduling regression 2015-10-13 4:08 ` Mike Galbraith @ 2015-10-12 20:42 ` Yuyang Du 0 siblings, 0 replies; 48+ messages in thread From: Yuyang Du @ 2015-10-12 20:42 UTC (permalink / raw) To: Mike Galbraith; +Cc: Peter Zijlstra, linux-kernel On Tue, Oct 13, 2015 at 06:08:34AM +0200, Mike Galbraith wrote: > It sounded like you wanted me to run the below alone. If so, it's a nogo. Yes, thanks. Then it is the sad fact that after migrate and removed_load_avg is added in migrate_task_rq_fair(), we don't get a chance to update the tg so fast that at the destination the mplayer is weighted to the group's share. > ----------------------------------------------------------------------------------------------------------------- > Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at | > ----------------------------------------------------------------------------------------------------------------- > oink:(8) | 787001.236 ms | 21641 | avg: 0.377 ms | max: 21.991 ms | max at: 51.504005 s > mplayer:(25) | 4256.224 ms | 7264 | avg: 19.698 ms | max: 2087.489 ms | max at: 115.294922 s > Xorg:1011 | 1507.958 ms | 4081 | avg: 8.349 ms | max: 1652.200 ms | max at: 126.908021 s > konsole:1752 | 697.806 ms | 1186 | avg: 5.749 ms | max: 160.189 ms | max at: 53.037952 s > testo:(9) | 438.164 ms | 2551 | avg: 6.616 ms | max: 215.527 ms | max at: 117.302455 s > plasma-desktop:1716 | 280.418 ms | 1624 | avg: 3.701 ms | max: 574.806 ms | max at: 53.582261 s > kwin:1708 | 144.986 ms | 2422 | avg: 3.301 ms | max: 315.707 ms | max at: 116.555721 s > > > -- > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 4df37a4..3dba883 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -2686,12 +2686,13 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq); > > static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > { > > struct sched_avg *sa = &cfs_rq->avg; > > - int decayed; > > + int decayed, updated = 0; > > > > if (atomic_long_read(&cfs_rq->removed_load_avg)) { > > long r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); > > sa->load_avg = max_t(long, sa->load_avg - r, 0); > > sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); > > + updated = 1; > > } > > > > if (atomic_long_read(&cfs_rq->removed_util_avg)) { > > @@ -2708,7 +2709,7 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > cfs_rq->load_last_update_time_copy = sa->last_update_time; > > #endif > > > > - return decayed; > > + return decayed | updated; A typo: decayed || updated, but shouldn't make any difference. > > } > > > > /* Update task and its cfs_rq load average */ > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: 4.3 group scheduling regression 2015-10-12 19:55 ` Yuyang Du 2015-10-13 4:08 ` Mike Galbraith @ 2015-10-13 8:06 ` Peter Zijlstra 2015-10-13 0:35 ` Yuyang Du 2015-10-13 8:10 ` Peter Zijlstra 1 sibling, 2 replies; 48+ messages in thread From: Peter Zijlstra @ 2015-10-13 8:06 UTC (permalink / raw) To: Yuyang Du; +Cc: Mike Galbraith, linux-kernel On Tue, Oct 13, 2015 at 03:55:17AM +0800, Yuyang Du wrote: > I think maybe the real disease is the tg->load_avg is not updated in time. > I.e., it is after migrate, the source cfs_rq does not decrease its contribution > to the parent's tg->load_avg fast enough. No, using the load_avg for shares calculation seems wrong; that would mean we'd first have to ramp up the avg before you react. You want to react quickly to actual load changes, esp. going up. We use the avg to guess the global group load, since that's the best compromise we have, but locally it doesn't make sense to use the avg if we have the actual values. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: 4.3 group scheduling regression 2015-10-13 8:06 ` Peter Zijlstra @ 2015-10-13 0:35 ` Yuyang Du 2015-10-13 8:10 ` Peter Zijlstra 1 sibling, 0 replies; 48+ messages in thread From: Yuyang Du @ 2015-10-13 0:35 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Mike Galbraith, linux-kernel On Tue, Oct 13, 2015 at 10:06:48AM +0200, Peter Zijlstra wrote: > On Tue, Oct 13, 2015 at 03:55:17AM +0800, Yuyang Du wrote: > > > I think maybe the real disease is the tg->load_avg is not updated in time. > > I.e., it is after migrate, the source cfs_rq does not decrease its contribution > > to the parent's tg->load_avg fast enough. > > No, using the load_avg for shares calculation seems wrong; that would > mean we'd first have to ramp up the avg before you react. > > You want to react quickly to actual load changes, esp. going up. > > We use the avg to guess the global group load, since that's the best > compromise we have, but locally it doesn't make sense to use the avg if > we have the actual values. In Mike's case, since the mplayer group has only one active task, after the task migrates, the source cfs_rq should have zero contrib to the tg, so at the destination, the group entity should have the entire tg's share. It is just the zeroing can be that fast we need. But yes, in a general case, the load_avg (that has the blocked load) is likely to lag behind. Using the actual load.weight to accelerate the process makes sense. It is especially helpful to the less hungry tasks. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: 4.3 group scheduling regression 2015-10-13 8:06 ` Peter Zijlstra 2015-10-13 0:35 ` Yuyang Du @ 2015-10-13 8:10 ` Peter Zijlstra 2015-10-13 0:37 ` Yuyang Du 1 sibling, 1 reply; 48+ messages in thread From: Peter Zijlstra @ 2015-10-13 8:10 UTC (permalink / raw) To: Yuyang Du; +Cc: Mike Galbraith, linux-kernel On Tue, Oct 13, 2015 at 10:06:48AM +0200, Peter Zijlstra wrote: > On Tue, Oct 13, 2015 at 03:55:17AM +0800, Yuyang Du wrote: > > > I think maybe the real disease is the tg->load_avg is not updated in time. > > I.e., it is after migrate, the source cfs_rq does not decrease its contribution > > to the parent's tg->load_avg fast enough. > > No, using the load_avg for shares calculation seems wrong; that would > mean we'd first have to ramp up the avg before you react. > > You want to react quickly to actual load changes, esp. going up. > > We use the avg to guess the global group load, since that's the best > compromise we have, but locally it doesn't make sense to use the avg if > we have the actual values. That is, can you send the original patch with a Changelog etc.. so that I can press 'A' :-) ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: 4.3 group scheduling regression 2015-10-13 8:10 ` Peter Zijlstra @ 2015-10-13 0:37 ` Yuyang Du 0 siblings, 0 replies; 48+ messages in thread From: Yuyang Du @ 2015-10-13 0:37 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Mike Galbraith, linux-kernel On Tue, Oct 13, 2015 at 10:10:23AM +0200, Peter Zijlstra wrote: > On Tue, Oct 13, 2015 at 10:06:48AM +0200, Peter Zijlstra wrote: > > On Tue, Oct 13, 2015 at 03:55:17AM +0800, Yuyang Du wrote: > > > > > I think maybe the real disease is the tg->load_avg is not updated in time. > > > I.e., it is after migrate, the source cfs_rq does not decrease its contribution > > > to the parent's tg->load_avg fast enough. > > > > No, using the load_avg for shares calculation seems wrong; that would > > mean we'd first have to ramp up the avg before you react. > > > > You want to react quickly to actual load changes, esp. going up. > > > > We use the avg to guess the global group load, since that's the best > > compromise we have, but locally it doesn't make sense to use the avg if > > we have the actual values. > > That is, can you send the original patch with a Changelog etc.. so that > I can press 'A' :-) Sure, in minutes, :) ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: 4.3 group scheduling regression 2015-10-12 2:12 ` Yuyang Du 2015-10-12 10:23 ` Mike Galbraith @ 2015-10-12 11:47 ` Peter Zijlstra 2015-10-12 19:32 ` Yuyang Du 2015-10-13 2:22 ` Mike Galbraith 1 sibling, 2 replies; 48+ messages in thread From: Peter Zijlstra @ 2015-10-12 11:47 UTC (permalink / raw) To: Yuyang Du; +Cc: Mike Galbraith, linux-kernel On Mon, Oct 12, 2015 at 10:12:31AM +0800, Yuyang Du wrote: > On Mon, Oct 12, 2015 at 11:12:06AM +0200, Peter Zijlstra wrote: > > So in the old code we had 'magic' to deal with the case where a cgroup > > was consuming less than 1 cpu's worth of runtime. For example, a single > > task running in the group. > > > > In that scenario it might be possible that the group entity weight: > > > > se->weight = (tg->shares * cfs_rq->weight) / tg->weight; > > > > Strongly deviates from the tg->shares; you want the single task reflect > > the full group shares to the next level; due to the whole distributed > > approximation stuff. > > Yeah, I thought so. > > > I see you've deleted all that code; see the former > > __update_group_entity_contrib(). > > Probably not there, it actually was an icky way to adjust things. Yeah, no argument there. > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 4df37a4..b184da0 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2370,7 +2370,7 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq) > */ > tg_weight = atomic_long_read(&tg->load_avg); > tg_weight -= cfs_rq->tg_load_avg_contrib; > - tg_weight += cfs_rq_load_avg(cfs_rq); > + tg_weight += cfs_rq->load.weight; > > return tg_weight; > } > @@ -2380,7 +2380,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg) > long tg_weight, load, shares; > > tg_weight = calc_tg_weight(tg, cfs_rq); > - load = cfs_rq_load_avg(cfs_rq); > + load = cfs_rq->load.weight; > > shares = (tg->shares * load); > if (tg_weight) Aah, yes very much so. I completely overlooked that :-( When calculating shares we very much want the current load, not the load average. Also, should we do the below? At this point se->on_rq is still 0 so reweight_entity() will not update (dequeue/enqueue) the accounting, but we'll have just accounted the 'old' load.weight. Doing it this way around we'll first update the weight and then account it, which seems more accurate. --- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 700eb548315f..d2efef565aed 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3009,8 +3009,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) */ update_curr(cfs_rq); enqueue_entity_load_avg(cfs_rq, se); - account_entity_enqueue(cfs_rq, se); update_cfs_shares(cfs_rq); + account_entity_enqueue(cfs_rq, se); if (flags & ENQUEUE_WAKEUP) { place_entity(cfs_rq, se, 0); ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: 4.3 group scheduling regression 2015-10-12 11:47 ` Peter Zijlstra @ 2015-10-12 19:32 ` Yuyang Du 2015-10-13 8:07 ` Peter Zijlstra 2015-10-13 2:22 ` Mike Galbraith 1 sibling, 1 reply; 48+ messages in thread From: Yuyang Du @ 2015-10-12 19:32 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Mike Galbraith, linux-kernel On Mon, Oct 12, 2015 at 01:47:23PM +0200, Peter Zijlstra wrote: > > Also, should we do the below? At this point se->on_rq is still 0 so > reweight_entity() will not update (dequeue/enqueue) the accounting, but > we'll have just accounted the 'old' load.weight. > > Doing it this way around we'll first update the weight and then account > it, which seems more accurate. I think the original looks ok. The account_entity_enqueue() adds child entity's load.weight to parent's load: update_load_add(&cfs_rq->load, se->load.weight) Then recalculate the shares. Then reweight_entity() resets the parent entity's load.weight. > --- > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 700eb548315f..d2efef565aed 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3009,8 +3009,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > */ > update_curr(cfs_rq); > enqueue_entity_load_avg(cfs_rq, se); > - account_entity_enqueue(cfs_rq, se); > update_cfs_shares(cfs_rq); > + account_entity_enqueue(cfs_rq, se); > > if (flags & ENQUEUE_WAKEUP) { > place_entity(cfs_rq, se, 0); ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: 4.3 group scheduling regression 2015-10-12 19:32 ` Yuyang Du @ 2015-10-13 8:07 ` Peter Zijlstra 0 siblings, 0 replies; 48+ messages in thread From: Peter Zijlstra @ 2015-10-13 8:07 UTC (permalink / raw) To: Yuyang Du; +Cc: Mike Galbraith, linux-kernel On Tue, Oct 13, 2015 at 03:32:47AM +0800, Yuyang Du wrote: > On Mon, Oct 12, 2015 at 01:47:23PM +0200, Peter Zijlstra wrote: > > > > Also, should we do the below? At this point se->on_rq is still 0 so > > reweight_entity() will not update (dequeue/enqueue) the accounting, but > > we'll have just accounted the 'old' load.weight. > > > > Doing it this way around we'll first update the weight and then account > > it, which seems more accurate. > > I think the original looks ok. > > The account_entity_enqueue() adds child entity's load.weight to parent's load: > > update_load_add(&cfs_rq->load, se->load.weight) > > Then recalculate the shares. > > Then reweight_entity() resets the parent entity's load.weight. Yes, some days I should just not be allowed near a keyboard :) ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: 4.3 group scheduling regression 2015-10-12 11:47 ` Peter Zijlstra 2015-10-12 19:32 ` Yuyang Du @ 2015-10-13 2:22 ` Mike Galbraith 1 sibling, 0 replies; 48+ messages in thread From: Mike Galbraith @ 2015-10-13 2:22 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Yuyang Du, linux-kernel On Mon, 2015-10-12 at 13:47 +0200, Peter Zijlstra wrote: > Also, should we do the below? Ew. Box said "Either you quilt pop/burn, or I boot windows." ;-) -Mike ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: 4.3 group scheduling regression 2015-10-12 8:04 ` Peter Zijlstra 2015-10-12 0:53 ` Yuyang Du @ 2015-10-12 8:48 ` Mike Galbraith 1 sibling, 0 replies; 48+ messages in thread From: Mike Galbraith @ 2015-10-12 8:48 UTC (permalink / raw) To: Peter Zijlstra; +Cc: linux-kernel, Yuyang Du On Mon, 2015-10-12 at 10:04 +0200, Peter Zijlstra wrote: > On Mon, Oct 12, 2015 at 09:44:57AM +0200, Mike Galbraith wrote: > > > It's odd to me that things look pretty much the same good/bad tree with > > hogs vs hogs or hogs vs tbench (with top anyway, just adding up times). > > Seems Xorg+mplayer more or less playing cross group ping-pong must be > > the BadThing trigger. > > Ohh, wait, Xorg and mplayer are _not_ in the same group? I was assuming > you had your entire user session in 1 (auto) group and was competing > against 8 manual cgroups. > > So how exactly are things configured? I turned autogroup on as to not have to muck about creating groups, so Xorg is in its per session group, and each konsole instance in its. I launched groups via testo (aka konsole) -e <content> in a little script to turn it loose at once to run for 100 seconds and kill itself, but that's not necessary 'course. Start 1 hog in 8 konsole tabs, and mplayer in the 9th, ickiness follows. -Mike ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch] sched: disable task group re-weighting on the desktop 2015-10-10 13:22 ` [patch] sched: disable task group re-weighting on the desktop Mike Galbraith 2015-10-10 14:03 ` kbuild test robot 2015-10-10 17:01 ` Peter Zijlstra @ 2015-10-10 20:14 ` paul.szabo 2015-10-11 2:38 ` Mike Galbraith 2015-10-11 19:46 ` paul.szabo 3 siblings, 1 reply; 48+ messages in thread From: paul.szabo @ 2015-10-10 20:14 UTC (permalink / raw) To: peterz, umgwanakikbuti; +Cc: linux-kernel Dear Mike, You CCed me on this patch. Is that because you expect this to solve "my" problem also? You had some measurements of many oinks vs many perts or vs "desktop", but not many oinks vs 1 or 2 perts as per my "complaint". You also changed the subject line, so maybe this is all un-related. Thanks, Paul Paul Szabo psz@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of Sydney Australia ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch] sched: disable task group re-weighting on the desktop 2015-10-10 20:14 ` [patch] sched: disable task group re-weighting on the desktop paul.szabo @ 2015-10-11 2:38 ` Mike Galbraith 2015-10-11 9:25 ` paul.szabo 0 siblings, 1 reply; 48+ messages in thread From: Mike Galbraith @ 2015-10-11 2:38 UTC (permalink / raw) To: paul.szabo; +Cc: peterz, linux-kernel On Sun, 2015-10-11 at 07:14 +1100, paul.szabo@sydney.edu.au wrote: > Dear Mike, > > You CCed me on this patch. Is that because you expect this to solve "my" > problem also? You had some measurements of many oinks vs many perts or > vs "desktop", but not many oinks vs 1 or 2 perts as per my "complaint". > You also changed the subject line, so maybe this is all un-related. I haven't seen the problem you reported. I did stumble upon a problem, but turns out that is only present in master, so yes, un-related. -Mike ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch] sched: disable task group re-weighting on the desktop 2015-10-11 2:38 ` Mike Galbraith @ 2015-10-11 9:25 ` paul.szabo 2015-10-11 12:49 ` Mike Galbraith 0 siblings, 1 reply; 48+ messages in thread From: paul.szabo @ 2015-10-11 9:25 UTC (permalink / raw) To: umgwanakikbuti; +Cc: linux-kernel, peterz Dear Mike, > ... so yes, un-related. Thanks for clarifying. > I haven't seen the problem you reported. ... You mean you chose not to reproduce: you persisted in pinning your perts, whereas the problem was stated with un-pinned perts (and pinned oinks). But that is OK... others did reproduce, and anyway I believe I have now fixed my problem. (Solution in that "other" email thread.) Cheers, Paul Paul Szabo psz@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of Sydney Australia ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch] sched: disable task group re-weighting on the desktop 2015-10-11 9:25 ` paul.szabo @ 2015-10-11 12:49 ` Mike Galbraith 0 siblings, 0 replies; 48+ messages in thread From: Mike Galbraith @ 2015-10-11 12:49 UTC (permalink / raw) To: paul.szabo; +Cc: linux-kernel, peterz On Sun, 2015-10-11 at 20:25 +1100, paul.szabo@sydney.edu.au wrote: > Dear Mike, > > > ... so yes, un-related. > > Thanks for clarifying. > > > I haven't seen the problem you reported. ... > > You mean you chose not to reproduce: you persisted in pinning your > perts.. There was hard data to the contrary in your mailbox as you wrote that. -Mike ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch] sched: disable task group re-weighting on the desktop 2015-10-10 13:22 ` [patch] sched: disable task group re-weighting on the desktop Mike Galbraith ` (2 preceding siblings ...) 2015-10-10 20:14 ` [patch] sched: disable task group re-weighting on the desktop paul.szabo @ 2015-10-11 19:46 ` paul.szabo 2015-10-12 1:59 ` Mike Galbraith 3 siblings, 1 reply; 48+ messages in thread From: paul.szabo @ 2015-10-11 19:46 UTC (permalink / raw) To: peterz, umgwanakikbuti; +Cc: linux-kernel Dear Mike, Did you check whether setting min_- and max_interval e.g. as per https://lkml.org/lkml/2015/10/11/34 would help with your issue (instead of your "horrible gs destroying" patch)? Cheers, Paul Paul Szabo psz@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of Sydney Australia ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch] sched: disable task group re-weighting on the desktop 2015-10-11 19:46 ` paul.szabo @ 2015-10-12 1:59 ` Mike Galbraith 0 siblings, 0 replies; 48+ messages in thread From: Mike Galbraith @ 2015-10-12 1:59 UTC (permalink / raw) To: paul.szabo; +Cc: peterz, linux-kernel On Mon, 2015-10-12 at 06:46 +1100, paul.szabo@sydney.edu.au wrote: > Dear Mike, > > Did you check whether setting min_- and max_interval e.g. as per > https://lkml.org/lkml/2015/10/11/34 > would help with your issue (instead of your "horrible gs destroying" > patch)? I spent a lot of MY time looking into YOUR problem, only to be accused of actively avoiding reproduction thereof, and now you toss another cute little dart my way. Looking into your problem wasn't a complete waste of my time, as it led me to something that actually looks interesting. Thanks for that, and goodbye. *PLONK* -Mike ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: CFS scheduler unfairly prefers pinned tasks 2015-10-08 10:54 ` paul.szabo 2015-10-08 11:19 ` Peter Zijlstra @ 2015-10-08 14:25 ` Mike Galbraith 2015-10-08 21:55 ` paul.szabo 1 sibling, 1 reply; 48+ messages in thread From: Mike Galbraith @ 2015-10-08 14:25 UTC (permalink / raw) To: paul.szabo; +Cc: peterz, linux-kernel On Thu, 2015-10-08 at 21:54 +1100, paul.szabo@sydney.edu.au wrote: > Dear Mike, > > > I see a fairness issue ... but one opposite to your complaint. > > Why is that opposite? I think it would be fair for the one pert process > to get 100% CPU, the many oink processes can get everything else. That > one oink is lowly 10% (when others are 100%) is of no consequence. Well, not exactly opposite, only opposite in that the one pert task also receives MORE than it's fair share when unpinned. Two 100$ hogs sharing one CPU should each get 50% of that CPU. The fact that the oink group contains 8 tasks vs 1 for the pert group should be irrelevant, but what that last oinker is getting is 1/9 of a CPU, and there just happen to be 9 runnable tasks total, 1 in group pert, and 8 in group oink. IFF that ratio were to prove to be a constant, AND the oink group were a massively parallel and synchronized compute job on a huge box, that entire compute job would not be slowed down by the factor 2 that a fair distribution would do to it, on say a 1000 core box, it'd be.. utterly dead, because you'd put it out of your misery. vogelweide:~/:[0]# cgexec -g cpu:foo bash vogelweide:~/:[0]# for i in `seq 0 63`; do taskset -c $i cpuhog& done [1] 8025 [2] 8026 ... vogelweide:~/:[130]# cgexec -g cpu:bar bash vogelweide:~/:[130]# taskset -c 63 pert 10 (report every 10 seconds) 2260.91 MHZ CPU perturbation threshold 0.024 usecs. pert/s: 255 >2070.76us: 38 min: 0.05 max:4065.46 avg: 93.83 sum/s: 23946us overhead: 2.39% pert/s: 255 >2070.32us: 37 min: 1.32 max:4039.94 avg: 92.82 sum/s: 23744us overhead: 2.37% pert/s: 253 >2069.85us: 38 min: 0.05 max:4036.44 avg: 94.89 sum/s: 24054us overhead: 2.41% Hm, that's a kinda odd looking number from my 64 core box, but whatever, it's far from fair according to my definition thereof. Poor little oink plus all other cycles not spent in pert's tight loop add up ~24ms/s. > Good to see that you agree on the fairness issue... it MUST be fixed! > CFS might be wrong or wasteful, but never unfair. Weeell, we've disagreed on pretty much everything we've talked about so far, but I can well imagine that what I see in the share update business _could_ be part of your massive compute job woes. -Mike ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: CFS scheduler unfairly prefers pinned tasks 2015-10-08 14:25 ` CFS scheduler unfairly prefers pinned tasks Mike Galbraith @ 2015-10-08 21:55 ` paul.szabo 2015-10-09 1:56 ` Mike Galbraith 2015-10-09 2:40 ` Mike Galbraith 0 siblings, 2 replies; 48+ messages in thread From: paul.szabo @ 2015-10-08 21:55 UTC (permalink / raw) To: umgwanakikbuti; +Cc: linux-kernel, peterz Dear Mike, >>> I see a fairness issue ... but one opposite to your complaint. >> Why is that opposite? ... > > Well, not exactly opposite, only opposite in that the one pert task also > receives MORE than it's fair share when unpinned. Two 100$ hogs sharing > one CPU should each get 50% of that CPU. ... But you are using CGROUPs, grouping all oinks into one group, and the one pert into another: requesting each group to get same total CPU. Since pert has one process only, the most he can get is 100% (not 400%), and it is quite OK for the oinks together to get 700%. > IFF ... massively parallel and synchronized ... You would be making the assumption that you had the machine to yourself: might be the wrong thing to assume. >> Good to see that you agree ... > Weeell, we've disagreed on pretty much everything ... Sorry I disagree: we do agree on the essence. :-) Cheers, Paul Paul Szabo psz@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of Sydney Australia ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: CFS scheduler unfairly prefers pinned tasks 2015-10-08 21:55 ` paul.szabo @ 2015-10-09 1:56 ` Mike Galbraith 2015-10-09 2:40 ` Mike Galbraith 1 sibling, 0 replies; 48+ messages in thread From: Mike Galbraith @ 2015-10-09 1:56 UTC (permalink / raw) To: paul.szabo; +Cc: linux-kernel, peterz On Fri, 2015-10-09 at 08:55 +1100, paul.szabo@sydney.edu.au wrote: > Dear Mike, > > >>> I see a fairness issue ... but one opposite to your complaint. > >> Why is that opposite? ... > > > > Well, not exactly opposite, only opposite in that the one pert task also > > receives MORE than it's fair share when unpinned. Two 100$ hogs sharing > > one CPU should each get 50% of that CPU. ... > > But you are using CGROUPs, grouping all oinks into one group, and the > one pert into another: requesting each group to get same total CPU. > Since pert has one process only, the most he can get is 100% (not 400%), > and it is quite OK for the oinks together to get 700%. Well, that of course depends on what you call fair. I realize why and where it happens. I told weight adjustment to keep its grubby mitts off of autogroups, and of course the "problem" went away. Back to the viewpoint thing, with two users, each having been _placed_ in a group, I can well imagine a user who is trying to use all of his authorized bandwidth raising an eyebrow when he sees one of his tasks getting 24 whole milliseconds per second with an allegedly fair scheduler. I can see it both ways. What's going to come out of this is probably going to be "tough titty, yes, group scheduling has side effects, and this is one". I already know it does. Question is only whether the weight adjustment gears are spinning as intended or not. > > IFF ... massively parallel and synchronized ... > > You would be making the assumption that you had the machine to yourself: > might be the wrong thing to assume. Yup, it would be a doomed attempt to run a load which cannot thrive in a shared environment in such an environment. Are any of the compute loads you're having trouble with.. in the math department.. perhaps doing oh, say complex math goop that feeds the output of one parallel computation into the next parallel computation? :) -Mike ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: CFS scheduler unfairly prefers pinned tasks 2015-10-08 21:55 ` paul.szabo 2015-10-09 1:56 ` Mike Galbraith @ 2015-10-09 2:40 ` Mike Galbraith 2015-10-11 9:43 ` paul.szabo 1 sibling, 1 reply; 48+ messages in thread From: Mike Galbraith @ 2015-10-09 2:40 UTC (permalink / raw) To: paul.szabo; +Cc: linux-kernel, peterz On Fri, 2015-10-09 at 08:55 +1100, paul.szabo@sydney.edu.au wrote: > >> Good to see that you agree ... > > Weeell, we've disagreed on pretty much everything ... > > Sorry I disagree: we do agree on the essence. :-) P.S. To some extent. If the essence is $subject, nope, we definitely disagree. If the essence is that _group_ scheduling is not strictly fair, then we agree. The must be fixed bit, I also disagree with. Maybe wants fixing I can agree with ;-) -Mike ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: CFS scheduler unfairly prefers pinned tasks 2015-10-09 2:40 ` Mike Galbraith @ 2015-10-11 9:43 ` paul.szabo 0 siblings, 0 replies; 48+ messages in thread From: paul.szabo @ 2015-10-11 9:43 UTC (permalink / raw) To: umgwanakikbuti; +Cc: linux-kernel, peterz, wanpeng.li I wrote: The Linux CFS scheduler prefers pinned tasks and unfairly gives more CPU time to tasks that have set CPU affinity. I believe I have now solved the problem, simply by setting: for n in /proc/sys/kernel/sched_domain/cpu*/domain0/min_interval; do echo 0 > $n; done for n in /proc/sys/kernel/sched_domain/cpu*/domain0/max_interval; do echo 1 > $n; done I am not sure what the domain1 values would be for (that I see exist on my 4*E5-4627v2 server). So far I do not see any negative effects of using these (extreme?) settings. (Explanation of what these things are meant for, or pointers to documentation, would be appreciated.) --- Thanks for the insightful discussion. (Scary, isn't it?) Thanks, Paul Paul Szabo psz@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of Sydney Australia ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: CFS scheduler unfairly prefers pinned tasks 2015-10-08 8:19 ` Mike Galbraith 2015-10-08 10:54 ` paul.szabo @ 2015-10-10 3:59 ` Wanpeng Li 2015-10-10 7:58 ` Wanpeng Li 1 sibling, 1 reply; 48+ messages in thread From: Wanpeng Li @ 2015-10-10 3:59 UTC (permalink / raw) To: paul.szabo, Peter Zijlstra; +Cc: Mike Galbraith, linux-kernel Hi Paul, On 10/8/15 4:19 PM, Mike Galbraith wrote: > On Tue, 2015-10-06 at 04:45 +0200, Mike Galbraith wrote: >> On Tue, 2015-10-06 at 08:48 +1100, paul.szabo@sydney.edu.au wrote: >>> The Linux CFS scheduler prefers pinned tasks and unfairly >>> gives more CPU time to tasks that have set CPU affinity. >>> This effect is observed with or without CGROUP controls. >>> >>> To demonstrate: on an otherwise idle machine, as some user >>> run several processes pinned to each CPU, one for each CPU >>> (as many as CPUs present in the system) e.g. for a quad-core >>> non-HyperThreaded machine: >>> >>> taskset -c 0 perl -e 'while(1){1}' & >>> taskset -c 1 perl -e 'while(1){1}' & >>> taskset -c 2 perl -e 'while(1){1}' & >>> taskset -c 3 perl -e 'while(1){1}' & >>> >>> and (as that same or some other user) run some without >>> pinning: >>> >>> perl -e 'while(1){1}' & >>> perl -e 'while(1){1}' & >>> >>> and use e.g. top to observe that the pinned processes get >>> more CPU time than "fair". Interesting, I can reproduce it w/ your simple script. However, they are fair when the number of pinned perl tasks is equal to unpinned perl tasks. I will dig into it more deeply. Regards, Wanpeng Li ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: CFS scheduler unfairly prefers pinned tasks 2015-10-10 3:59 ` Wanpeng Li @ 2015-10-10 7:58 ` Wanpeng Li 0 siblings, 0 replies; 48+ messages in thread From: Wanpeng Li @ 2015-10-10 7:58 UTC (permalink / raw) To: paul.szabo, Peter Zijlstra; +Cc: Mike Galbraith, linux-kernel On 10/10/15 11:59 AM, Wanpeng Li wrote: > Hi Paul, > On 10/8/15 4:19 PM, Mike Galbraith wrote: >> On Tue, 2015-10-06 at 04:45 +0200, Mike Galbraith wrote: >>> On Tue, 2015-10-06 at 08:48 +1100, paul.szabo@sydney.edu.au wrote: >>>> The Linux CFS scheduler prefers pinned tasks and unfairly >>>> gives more CPU time to tasks that have set CPU affinity. >>>> This effect is observed with or without CGROUP controls. >>>> >>>> To demonstrate: on an otherwise idle machine, as some user >>>> run several processes pinned to each CPU, one for each CPU >>>> (as many as CPUs present in the system) e.g. for a quad-core >>>> non-HyperThreaded machine: >>>> >>>> taskset -c 0 perl -e 'while(1){1}' & >>>> taskset -c 1 perl -e 'while(1){1}' & >>>> taskset -c 2 perl -e 'while(1){1}' & >>>> taskset -c 3 perl -e 'while(1){1}' & >>>> >>>> and (as that same or some other user) run some without >>>> pinning: >>>> >>>> perl -e 'while(1){1}' & >>>> perl -e 'while(1){1}' & >>>> >>>> and use e.g. top to observe that the pinned processes get >>>> more CPU time than "fair". > > Interesting, I can reproduce it w/ your simple script. However, they > are fair when the number of pinned perl tasks is equal to unpinned > perl tasks. I will dig into it more deeply. For the pinned tasks, when set the task affinity to all the available cpus instead of the separate cpu as in your test, there is fair between pinned tasks and unpinned tasks. So I suspect that if it is the overhead associated with migration stuff. Regards, Wanpeng Li ^ permalink raw reply [flat|nested] 48+ messages in thread
end of thread, other threads:[~2015-10-13 8:25 UTC | newest] Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-10-05 21:48 CFS scheduler unfairly prefers pinned tasks paul.szabo 2015-10-06 2:45 ` Mike Galbraith 2015-10-06 10:06 ` paul.szabo 2015-10-06 12:17 ` Mike Galbraith 2015-10-06 20:44 ` paul.szabo 2015-10-07 1:28 ` Mike Galbraith 2015-10-08 8:19 ` Mike Galbraith 2015-10-08 10:54 ` paul.szabo 2015-10-08 11:19 ` Peter Zijlstra 2015-10-10 13:22 ` [patch] sched: disable task group re-weighting on the desktop Mike Galbraith 2015-10-10 14:03 ` kbuild test robot 2015-10-10 14:41 ` Mike Galbraith 2015-10-10 17:01 ` Peter Zijlstra 2015-10-10 17:13 ` Peter Zijlstra 2015-10-11 2:25 ` Mike Galbraith 2015-10-11 17:42 ` 4.3 group scheduling regression Mike Galbraith 2015-10-12 7:23 ` Peter Zijlstra 2015-10-12 7:44 ` Mike Galbraith 2015-10-12 8:04 ` Peter Zijlstra 2015-10-12 0:53 ` Yuyang Du 2015-10-12 9:12 ` Peter Zijlstra 2015-10-12 2:12 ` Yuyang Du 2015-10-12 10:23 ` Mike Galbraith 2015-10-12 19:55 ` Yuyang Du 2015-10-13 4:08 ` Mike Galbraith 2015-10-12 20:42 ` Yuyang Du 2015-10-13 8:06 ` Peter Zijlstra 2015-10-13 0:35 ` Yuyang Du 2015-10-13 8:10 ` Peter Zijlstra 2015-10-13 0:37 ` Yuyang Du 2015-10-12 11:47 ` Peter Zijlstra 2015-10-12 19:32 ` Yuyang Du 2015-10-13 8:07 ` Peter Zijlstra 2015-10-13 2:22 ` Mike Galbraith 2015-10-12 8:48 ` Mike Galbraith 2015-10-10 20:14 ` [patch] sched: disable task group re-weighting on the desktop paul.szabo 2015-10-11 2:38 ` Mike Galbraith 2015-10-11 9:25 ` paul.szabo 2015-10-11 12:49 ` Mike Galbraith 2015-10-11 19:46 ` paul.szabo 2015-10-12 1:59 ` Mike Galbraith 2015-10-08 14:25 ` CFS scheduler unfairly prefers pinned tasks Mike Galbraith 2015-10-08 21:55 ` paul.szabo 2015-10-09 1:56 ` Mike Galbraith 2015-10-09 2:40 ` Mike Galbraith 2015-10-11 9:43 ` paul.szabo 2015-10-10 3:59 ` Wanpeng Li 2015-10-10 7:58 ` Wanpeng Li
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.