* [PATCH 0/2] Reduce stacking and overscheduling @ 2021-10-21 14:56 Mel Gorman 2021-10-21 14:56 ` [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers Mel Gorman 2021-10-21 14:56 ` [PATCH 2/2] sched/fair: Increase wakeup_gran if current task has not executed the minimum granularity Mel Gorman 0 siblings, 2 replies; 26+ messages in thread From: Mel Gorman @ 2021-10-21 14:56 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman These series tackles two problems. The first is that heavy wakers can stack an excessive number of tasks on the same CPU. The second is that tasks can overschedule when the task has not reached its minimum preemption granularity. The patches are independent but were discussed together in the thread https://lore.kernel.org/r/20210920142614.4891-1-mgorman@techsingularity.net so are presented together. With both patches on a zen3 machine hackbench-process-pipes 5.15.0-rc3 5.15.0-rc3 vanilla sched-scalewakegran-v3r2 Amean 1 0.3667 ( 0.00%) 0.3823 ( -4.27%) Amean 4 0.5343 ( 0.00%) 0.4867 ( 8.92%) Amean 7 0.5300 ( 0.00%) 0.5053 ( 4.65%) Amean 12 0.5737 ( 0.00%) 0.5450 ( 5.00%) Amean 21 0.6727 ( 0.00%) 0.6807 ( -1.19%) Amean 30 0.8583 ( 0.00%) 0.7107 * 17.20%* Amean 48 1.3977 ( 0.00%) 1.0447 * 25.26%* Amean 79 1.9790 ( 0.00%) 1.6033 * 18.98%* Amean 110 2.8020 ( 0.00%) 2.0763 * 25.90%* Amean 141 3.6683 ( 0.00%) 2.5313 * 31.00%* Amean 172 4.6687 ( 0.00%) 3.1163 * 33.25%* Amean 203 5.2183 ( 0.00%) 3.5560 * 31.86%* Amean 234 6.1077 ( 0.00%) 3.8913 * 36.29%* Amean 265 7.1313 ( 0.00%) 4.2293 * 40.69%* Amean 296 7.7557 ( 0.00%) 4.5357 * 41.52%* 5.15.0-rc3 5.15.0-rc3 vanilla sched-scalewakegran-v3r2 Duration User 2933.05 2034.17 Duration System 25652.83 17137.08 Duration Elapsed 162.50 120.25 -- 2.31.1 Mel Gorman (2): sched/fair: Couple wakee flips with heavy wakers sched/fair: Increase wakeup_gran if current task has not executed the minimum granularity kernel/sched/fair.c | 27 ++++++++++++++++++++++++--- kernel/sched/features.h | 2 ++ 2 files changed, 26 insertions(+), 3 deletions(-) -- 2.31.1 ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-10-21 14:56 [PATCH 0/2] Reduce stacking and overscheduling Mel Gorman @ 2021-10-21 14:56 ` Mel Gorman 2021-10-22 10:26 ` Mike Galbraith 2021-11-09 11:56 ` Peter Zijlstra 2021-10-21 14:56 ` [PATCH 2/2] sched/fair: Increase wakeup_gran if current task has not executed the minimum granularity Mel Gorman 1 sibling, 2 replies; 26+ messages in thread From: Mel Gorman @ 2021-10-21 14:56 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman This patch mitigates a problem where wake_wide() allows a heavy waker (e.g. X) to stack an excessive number of wakees on the same CPU. This is due to the cpu_load check in wake_affine_weight. As noted by the original patch author (Mike Galbraith)[1]; Between load updates, X, or any other waker of many, can stack wakees to a ludicrous depth. Tracing kbuild vs firefox playing a youtube clip, I watched X stack 20 of the zillion firefox minions while their previous CPUs all had 1 lousy task running but a cpu_load() higher than the cpu_load() of X's CPU. Most of those prev_cpus were where X had left them when it migrated. Each and every crazy depth migration was wake_affine_weight() deciding we should pull. Parahrasing Mike's test results from the patch. With make -j8 running along with firefox with two tabs, one containing youtube's suggestions of stuff, the other a running clip, if the idle tab in focus, and don't drive mouse around, flips decay enough for wake_wide() to lose interest, but just wiggle the mouse, and it starts waking wide. Focus on the running clip, and it continuously wakes wide. The end result is that heavy wakers are less likely to stack tasks and, depending on the workload, reduce migrations. From additional tests on various servers, the impact is machine dependant but generally this patch improves the situation. hackbench-process-pipes 5.15.0-rc3 5.15.0-rc3 vanilla sched-wakeeflips-v1r1 Amean 1 0.3667 ( 0.00%) 0.3890 ( -6.09%) Amean 4 0.5343 ( 0.00%) 0.5217 ( 2.37%) Amean 7 0.5300 ( 0.00%) 0.5387 ( -1.64%) Amean 12 0.5737 ( 0.00%) 0.5443 ( 5.11%) Amean 21 0.6727 ( 0.00%) 0.6487 ( 3.57%) Amean 30 0.8583 ( 0.00%) 0.8033 ( 6.41%) Amean 48 1.3977 ( 0.00%) 1.2400 * 11.28%* Amean 79 1.9790 ( 0.00%) 1.8200 * 8.03%* Amean 110 2.8020 ( 0.00%) 2.5820 * 7.85%* Amean 141 3.6683 ( 0.00%) 3.2203 * 12.21%* Amean 172 4.6687 ( 0.00%) 3.8200 * 18.18%* Amean 203 5.2183 ( 0.00%) 4.3357 * 16.91%* Amean 234 6.1077 ( 0.00%) 4.8047 * 21.33%* Amean 265 7.1313 ( 0.00%) 5.1243 * 28.14%* Amean 296 7.7557 ( 0.00%) 5.5940 * 27.87%* While different machines showed different results, in general there were much less CPU migrations of tasks tbench4 5.15.0-rc3 5.15.0-rc3 vanilla sched-wakeeflips-v1r1 Hmean 1 824.05 ( 0.00%) 802.56 * -2.61%* Hmean 2 1578.49 ( 0.00%) 1645.11 * 4.22%* Hmean 4 2959.08 ( 0.00%) 2984.75 * 0.87%* Hmean 8 5080.09 ( 0.00%) 5173.35 * 1.84%* Hmean 16 8276.02 ( 0.00%) 9327.17 * 12.70%* Hmean 32 15501.61 ( 0.00%) 15925.55 * 2.73%* Hmean 64 27313.67 ( 0.00%) 24107.81 * -11.74%* Hmean 128 32928.19 ( 0.00%) 36261.75 * 10.12%* Hmean 256 35434.73 ( 0.00%) 38670.61 * 9.13%* Hmean 512 50098.34 ( 0.00%) 53243.75 * 6.28%* Hmean 1024 69503.69 ( 0.00%) 67425.26 * -2.99%* Bit of a mixed bag but wins more than it loses. A new workload was added that runs a kernel build in the background -jNR_CPUS while NR_CPUS pairs of tasks run Netperf TCP_RR. The intent is to see if heavy background tasks disrupt ligher tasks multi subtest kernbench 5.15.0-rc3 5.15.0-rc3 vanilla sched-wakeeflips-v1r1 Min elsp-256 20.80 ( 0.00%) 14.89 ( 28.41%) Amean elsp-256 24.08 ( 0.00%) 20.94 ( 13.05%) Stddev elsp-256 3.32 ( 0.00%) 4.68 ( -41.16%) CoeffVar elsp-256 13.78 ( 0.00%) 22.36 ( -62.33%) Max elsp-256 29.11 ( 0.00%) 26.49 ( 9.00%) multi subtest netperf-tcp-rr 5.15.0-rc3 5.15.0-rc3 vanilla sched-wakeeflips-v1r1 Min 1 48286.26 ( 0.00%) 49101.48 ( 1.69%) Hmean 1 62894.82 ( 0.00%) 68963.51 * 9.65%* Stddev 1 7600.56 ( 0.00%) 8804.82 ( -15.84%) Max 1 78975.16 ( 0.00%) 87124.67 ( 10.32%) The variability is higher as a result of the patch but both workloads experienced improved performance. [1] https://lore.kernel.org/r/02c977d239c312de5e15c77803118dcf1e11f216.camel@gmx.de Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- kernel/sched/fair.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ff69f245b939..d00af3b97d8f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5865,6 +5865,14 @@ static void record_wakee(struct task_struct *p) } if (current->last_wakee != p) { + int min = __this_cpu_read(sd_llc_size) << 1; + /* + * Couple the wakee flips to the waker for the case where it + * doesn't accrue flips, taking care to not push the wakee + * high enough that the wake_wide() heuristic fails. + */ + if (current->wakee_flips > p->wakee_flips * min) + p->wakee_flips++; current->last_wakee = p; current->wakee_flips++; } @@ -5895,7 +5903,7 @@ static int wake_wide(struct task_struct *p) if (master < slave) swap(master, slave); - if (slave < factor || master < slave * factor) + if ((slave < factor && master < (factor>>1)*factor) || master < slave * factor) return 0; return 1; } -- 2.31.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-10-21 14:56 ` [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers Mel Gorman @ 2021-10-22 10:26 ` Mike Galbraith 2021-10-22 11:05 ` Mel Gorman 2021-11-09 11:56 ` Peter Zijlstra 1 sibling, 1 reply; 26+ messages in thread From: Mike Galbraith @ 2021-10-22 10:26 UTC (permalink / raw) To: Mel Gorman, Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Thu, 2021-10-21 at 15:56 +0100, Mel Gorman wrote: > > From additional tests on various servers, the impact is machine dependant > but generally this patch improves the situation. > > hackbench-process-pipes > 5.15.0-rc3 5.15.0-rc3 > vanilla sched-wakeeflips-v1r1 > Amean 1 0.3667 ( 0.00%) 0.3890 ( -6.09%) > Amean 4 0.5343 ( 0.00%) 0.5217 ( 2.37%) > Amean 7 0.5300 ( 0.00%) 0.5387 ( -1.64%) > Amean 12 0.5737 ( 0.00%) 0.5443 ( 5.11%) > Amean 21 0.6727 ( 0.00%) 0.6487 ( 3.57%) > Amean 30 0.8583 ( 0.00%) 0.8033 ( 6.41%) > Amean 48 1.3977 ( 0.00%) 1.2400 * 11.28%* > Amean 79 1.9790 ( 0.00%) 1.8200 * 8.03%* > Amean 110 2.8020 ( 0.00%) 2.5820 * 7.85%* > Amean 141 3.6683 ( 0.00%) 3.2203 * 12.21%* > Amean 172 4.6687 ( 0.00%) 3.8200 * 18.18%* > Amean 203 5.2183 ( 0.00%) 4.3357 * 16.91%* > Amean 234 6.1077 ( 0.00%) 4.8047 * 21.33%* > Amean 265 7.1313 ( 0.00%) 5.1243 * 28.14%* > Amean 296 7.7557 ( 0.00%) 5.5940 * 27.87%* > > While different machines showed different results, in general > there were much less CPU migrations of tasks Patchlet helped hackbench? That's.. unexpected (at least by me). > tbench4 > 5.15.0-rc3 5.15.0-rc3 > vanilla sched-wakeeflips-v1r1 > Hmean 1 824.05 ( 0.00%) 802.56 * -2.61%* > Hmean 2 1578.49 ( 0.00%) 1645.11 * 4.22%* > Hmean 4 2959.08 ( 0.00%) 2984.75 * 0.87%* > Hmean 8 5080.09 ( 0.00%) 5173.35 * 1.84%* > Hmean 16 8276.02 ( 0.00%) 9327.17 * 12.70%* > Hmean 32 15501.61 ( 0.00%) 15925.55 * 2.73%* > Hmean 64 27313.67 ( 0.00%) 24107.81 * -11.74%* > Hmean 128 32928.19 ( 0.00%) 36261.75 * 10.12%* > Hmean 256 35434.73 ( 0.00%) 38670.61 * 9.13%* > Hmean 512 50098.34 ( 0.00%) 53243.75 * 6.28%* > Hmean 1024 69503.69 ( 0.00%) 67425.26 * -2.99%* > > Bit of a mixed bag but wins more than it loses. Hm. If patchlet repeatably impacts buddy pairs one way or the other, it should probably be tossed out the nearest window. -Mike ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-10-22 10:26 ` Mike Galbraith @ 2021-10-22 11:05 ` Mel Gorman 2021-10-22 12:00 ` Mike Galbraith 2021-10-25 6:35 ` Mike Galbraith 0 siblings, 2 replies; 26+ messages in thread From: Mel Gorman @ 2021-10-22 11:05 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Fri, Oct 22, 2021 at 12:26:08PM +0200, Mike Galbraith wrote: > On Thu, 2021-10-21 at 15:56 +0100, Mel Gorman wrote: > > > > From additional tests on various servers, the impact is machine dependant > > but generally this patch improves the situation. > > > > hackbench-process-pipes > > 5.15.0-rc3 5.15.0-rc3 > > vanilla sched-wakeeflips-v1r1 > > Amean 1 0.3667 ( 0.00%) 0.3890 ( -6.09%) > > Amean 4 0.5343 ( 0.00%) 0.5217 ( 2.37%) > > Amean 7 0.5300 ( 0.00%) 0.5387 ( -1.64%) > > Amean 12 0.5737 ( 0.00%) 0.5443 ( 5.11%) > > Amean 21 0.6727 ( 0.00%) 0.6487 ( 3.57%) > > Amean 30 0.8583 ( 0.00%) 0.8033 ( 6.41%) > > Amean 48 1.3977 ( 0.00%) 1.2400 * 11.28%* > > Amean 79 1.9790 ( 0.00%) 1.8200 * 8.03%* > > Amean 110 2.8020 ( 0.00%) 2.5820 * 7.85%* > > Amean 141 3.6683 ( 0.00%) 3.2203 * 12.21%* > > Amean 172 4.6687 ( 0.00%) 3.8200 * 18.18%* > > Amean 203 5.2183 ( 0.00%) 4.3357 * 16.91%* > > Amean 234 6.1077 ( 0.00%) 4.8047 * 21.33%* > > Amean 265 7.1313 ( 0.00%) 5.1243 * 28.14%* > > Amean 296 7.7557 ( 0.00%) 5.5940 * 27.87%* > > > > While different machines showed different results, in general > > there were much less CPU migrations of tasks > > Patchlet helped hackbench? That's.. unexpected (at least by me). > I didn't analyse in depth and other machines do not show as dramatic a difference but it's likely due to timings of tasks getting wakeup preempted. On a 2-socket cascadelake machine the difference was -7.4% to 7.66% depending on group count. The second biggest loss was -0.71% and more gains than losses. In each case, CPU migrations and system CPU usage are reduced. The big difference here is likely because the machine is Zen 3 and has multiple LLCs per cache so it suffers more if there are imbalances between LLCs that wouldn't be visible on most Intel machines with 1 LLC per socket. > > tbench4 > > 5.15.0-rc3 5.15.0-rc3 > > vanilla sched-wakeeflips-v1r1 > > Hmean 1 824.05 ( 0.00%) 802.56 * -2.61%* > > Hmean 2 1578.49 ( 0.00%) 1645.11 * 4.22%* > > Hmean 4 2959.08 ( 0.00%) 2984.75 * 0.87%* > > Hmean 8 5080.09 ( 0.00%) 5173.35 * 1.84%* > > Hmean 16 8276.02 ( 0.00%) 9327.17 * 12.70%* > > Hmean 32 15501.61 ( 0.00%) 15925.55 * 2.73%* > > Hmean 64 27313.67 ( 0.00%) 24107.81 * -11.74%* > > Hmean 128 32928.19 ( 0.00%) 36261.75 * 10.12%* > > Hmean 256 35434.73 ( 0.00%) 38670.61 * 9.13%* > > Hmean 512 50098.34 ( 0.00%) 53243.75 * 6.28%* > > Hmean 1024 69503.69 ( 0.00%) 67425.26 * -2.99%* > > > > Bit of a mixed bag but wins more than it loses. > > Hm. If patchlet repeatably impacts buddy pairs one way or the other, > it should probably be tossed out the nearest window. > I don't see how buddy pairing would be impacted although there is likely differences in the degree tasks get preempted due to pulling tasks. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-10-22 11:05 ` Mel Gorman @ 2021-10-22 12:00 ` Mike Galbraith 2021-10-25 6:35 ` Mike Galbraith 1 sibling, 0 replies; 26+ messages in thread From: Mike Galbraith @ 2021-10-22 12:00 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Fri, 2021-10-22 at 12:05 +0100, Mel Gorman wrote: > On Fri, Oct 22, 2021 at 12:26:08PM +0200, Mike Galbraith wrote: > > > > > Hm. If patchlet repeatably impacts buddy pairs one way or the other, > > it should probably be tossed out the nearest window. > > > > I don't see how buddy pairing would be impacted although there is likely > differences in the degree tasks get preempted due to pulling tasks. Hohum, numbers get to say whatever (weirdness) they want to I suppose. btw, below are some desktop impact numbers I had collected. box = i7-4790 quad+smt desktop vs massive_intr 8 9999 (8 x 8ms run/1ms sleep, for 9999 secs.. effectively forever) perf sched record -a -- su mikeg -c 'firefox https://www.youtube.com/watch?v=aqz-KE-bpKQ'& sleep 300 && killall perf firefox runtime runtime sum delay sum delay sum delay switches desktop patch/features total util massive_intr util total massive_intr desktop total/massive util virgin/stock 2267347.921 ms 94.4% 1932675.152 ms 80.5% 158611.016 ms 133309.938 ms 25301.078 ms 594780/441157 13.9% virgin/-wa_weight 2236871.408 ms 93.2% 1881464.401 ms 78.3% 255785.391 ms 243958.616 ms 11826.775 ms 1525470/1424083 14.8% -1.34% -1.2% -2.2% -13.474 s +0.9% wake_wide/stock 2254335.961 ms 93.9% 1917834.157 ms 79.9% 164766.194 ms 141974.540 ms 22791.654 ms 720711/599064 14.0% While patchlet mitigated the stacking somewhat, it wasn't meaningful to my test load (BigBuckBunny for the 10001th time). PELT clearly stacks up the desktop pretty damn badly, but gets away with it.. in this case. OTOH, killing stacking graveyard dead via NO_WA_WEIGHT would have certainly dinged up cache quite a bit for the compute load had it been something other than a synthetic CPU burner, so there's a brownie point for PELT in the mix to go along with its stacking demerit. Given there was zero perceptible difference, the only thing patchlet really did was to give me a warm fuzzy knowing it was in there fighting the good fight against obscene *looking* stacking (with potential). -Mike ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-10-22 11:05 ` Mel Gorman 2021-10-22 12:00 ` Mike Galbraith @ 2021-10-25 6:35 ` Mike Galbraith 2021-10-26 8:18 ` Mel Gorman 1 sibling, 1 reply; 26+ messages in thread From: Mike Galbraith @ 2021-10-25 6:35 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Fri, 2021-10-22 at 12:05 +0100, Mel Gorman wrote: > On Fri, Oct 22, 2021 at 12:26:08PM +0200, Mike Galbraith wrote: > > > > > Patchlet helped hackbench? That's.. unexpected (at least by me). > > > > I didn't analyse in depth and other machines do not show as dramatic > a difference but it's likely due to timings of tasks getting wakeup > preempted. Wakeup tracing made those hackbench numbers less surprising. There's tons of wake-many going on. At a glance, it appears to already be bi- directional though, so patchlet helping seemingly means that there's just not quite enough to tickle the heuristic without a little help. Question is, is the potential reward of strengthening that heuristic yet again, keeping in mind that "heuristic" tends to not play well with "deterministic", worth the risk? My desktop trace session said distribution improved a bit, but there was no meaningful latency or throughput improvement, making for a pretty clear "nope" to the above question. It benefiting NUMA box hackbench is a valid indicator, but one that is IMO too disconnected from the real world to carry much weight. -Mike ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-10-25 6:35 ` Mike Galbraith @ 2021-10-26 8:18 ` Mel Gorman 2021-10-26 10:15 ` Mike Galbraith 0 siblings, 1 reply; 26+ messages in thread From: Mel Gorman @ 2021-10-26 8:18 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Mon, Oct 25, 2021 at 08:35:52AM +0200, Mike Galbraith wrote: > On Fri, 2021-10-22 at 12:05 +0100, Mel Gorman wrote: > > On Fri, Oct 22, 2021 at 12:26:08PM +0200, Mike Galbraith wrote: > > > > > > > > Patchlet helped hackbench? That's.. unexpected (at least by me). > > > > > > > I didn't analyse in depth and other machines do not show as dramatic > > a difference but it's likely due to timings of tasks getting wakeup > > preempted. > > Wakeup tracing made those hackbench numbers less surprising. There's > tons of wake-many going on. At a glance, it appears to already be bi- > directional though, so patchlet helping seemingly means that there's > just not quite enough to tickle the heuristic without a little help. Another possible explanation is that hackbench overloads a machine to such an extent that the ratio of bi-directional wakeups is not sufficient to trigger the wake-wide logic. > Question is, is the potential reward of strengthening that heuristic > yet again, keeping in mind that "heuristic" tends to not play well with > "deterministic", worth the risk? > > My desktop trace session said distribution improved a bit, but there > was no meaningful latency or throughput improvement, making for a > pretty clear "nope" to the above question. Another interpretation is that it's simply neutral and does no harm. > It benefiting NUMA box > hackbench is a valid indicator, but one that is IMO too disconnected > from the real world to carry much weight. > I think if it's not shown to be harmful to a realistic workload but helps an overloaded example then it should be ok. While excessive overload is rare in a realistic workload, it does happen. There are a few workloads I've seen bugs for that were triggered when an excessive number of worker threads get spawned and compete for CPU access which in turns leads more worker threads get spawned. There are application workarounds for this corner case but it still triggers bugs. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-10-26 8:18 ` Mel Gorman @ 2021-10-26 10:15 ` Mike Galbraith 2021-10-26 10:41 ` Mike Galbraith 0 siblings, 1 reply; 26+ messages in thread From: Mike Galbraith @ 2021-10-26 10:15 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, 2021-10-26 at 09:18 +0100, Mel Gorman wrote: > On Mon, Oct 25, 2021 at 08:35:52AM +0200, Mike Galbraith wrote: > > > > > My desktop trace session said distribution improved a bit, but there > > was no meaningful latency or throughput improvement, making for a > > pretty clear "nope" to the above question. > > Another interpretation is that it's simply neutral and does no harm. Yes, patchlet not completely countermanding PELT is a good sign. Had it continuously fired and countermanded PELT entirely, that would mean either that it was busted, or that the desktop load is so damn over- threaded as to be continuously waking more damn threads than can possibly have any benefit whatsoever. Neither of those having happened happening is a good thing. While I really dislike PELT's evil side, it's other face is rather attractive... rob > > It benefiting NUMA box > > hackbench is a valid indicator, but one that is IMO too disconnected > > from the real world to carry much weight. > > > > I think if it's not shown to be harmful to a realistic workload but helps > an overloaded example then it should be ok. While excessive overload is > rare in a realistic workload, it does happen. There are a few workloads > I've seen bugs for that were triggered when an excessive number of worker > threads get spawned and compete for CPU access which in turns leads more > worker threads get spawned. There are application workarounds for this > corner case but it still triggers bugs. > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-10-26 10:15 ` Mike Galbraith @ 2021-10-26 10:41 ` Mike Galbraith 2021-10-26 11:57 ` Mel Gorman 0 siblings, 1 reply; 26+ messages in thread From: Mike Galbraith @ 2021-10-26 10:41 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, 2021-10-26 at 12:15 +0200, Mike Galbraith wrote: > ... Well now, that interruption didn't go as planned. The briefer edit would have been preferred, but you get the drift, so moving on... > > > It benefiting NUMA box > > > hackbench is a valid indicator, but one that is IMO too disconnected > > > from the real world to carry much weight. > > > > > > > I think if it's not shown to be harmful to a realistic workload but helps > > an overloaded example then it should be ok. While excessive overload is > > rare in a realistic workload, it does happen. There are a few workloads > > I've seen bugs for that were triggered when an excessive number of worker > > threads get spawned and compete for CPU access which in turns leads more > > worker threads get spawned. There are application workarounds for this > > corner case but it still triggers bugs. > wake_wide()'s proper test environment is NUMA, not a desktop box, so patchlet has yet to meet a real world load that qualifies as such. That it could detect the test load doing nutty stuff like waking a thread pool three times the size of the box is all well and good, but not the point. $.02 WRT poor abused hackbench: if it happens to benefit that's fine, but I don't think it should ever be considered change validation. It's a useful tool, but the common massive overload use is just nuts IMO. -Mike ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-10-26 10:41 ` Mike Galbraith @ 2021-10-26 11:57 ` Mel Gorman 2021-10-26 12:13 ` Mike Galbraith 0 siblings, 1 reply; 26+ messages in thread From: Mel Gorman @ 2021-10-26 11:57 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, Oct 26, 2021 at 12:41:36PM +0200, Mike Galbraith wrote: > On Tue, 2021-10-26 at 12:15 +0200, Mike Galbraith wrote: > > > ... > > Well now, that interruption didn't go as planned. The briefer edit > would have been preferred, but you get the drift, so moving on... > > > > > It benefiting NUMA box > > > > hackbench is a valid indicator, but one that is IMO too disconnected > > > > from the real world to carry much weight. > > > > > > > > > > I think if it's not shown to be harmful to a realistic workload but helps > > > an overloaded example then it should be ok. While excessive overload is > > > rare in a realistic workload, it does happen. There are a few workloads > > > I've seen bugs for that were triggered when an excessive number of worker > > > threads get spawned and compete for CPU access which in turns leads more > > > worker threads get spawned. There are application workarounds for this > > > corner case but it still triggers bugs. > > > > wake_wide()'s proper test environment is NUMA, not a desktop box, so > patchlet has yet to meet a real world load that qualifies as such. > That it could detect the test load doing nutty stuff like waking a > thread pool three times the size of the box is all well and good, but > not the point. > > $.02 WRT poor abused hackbench: if it happens to benefit that's fine, > but I don't think it should ever be considered change validation. It's > a useful tool, but the common massive overload use is just nuts IMO. > hackbench is nuts and generally only useful for looking at overload placement, overload balancing and useless search depth for idle CPUs. I pay it some attention but it's usually not my favour workload. Regrettably, it's mostly used because it's popular, not because it makes sense. The patch in question was also tested on other workloads on NUMA machines. For a 2-socket machine (20 cores, HT enabled so 40 CPUs) running specjbb 2005 with one JVM per NUMA node, the patch also scaled reasonably well specjbb 5.15.0-rc3 5.15.0-rc3 vanilla sched-wakeeflips-v1r1 Hmean tput-1 50044.48 ( 0.00%) 53969.00 * 7.84%* Hmean tput-2 106050.31 ( 0.00%) 113580.78 * 7.10%* Hmean tput-3 156701.44 ( 0.00%) 164857.00 * 5.20%* Hmean tput-4 196538.75 ( 0.00%) 218373.42 * 11.11%* Hmean tput-5 247566.16 ( 0.00%) 267173.09 * 7.92%* Hmean tput-6 284981.46 ( 0.00%) 311007.14 * 9.13%* Hmean tput-7 328882.48 ( 0.00%) 359373.89 * 9.27%* Hmean tput-8 366941.24 ( 0.00%) 393244.37 * 7.17%* Hmean tput-9 402386.74 ( 0.00%) 433010.43 * 7.61%* Hmean tput-10 437551.05 ( 0.00%) 475756.08 * 8.73%* Hmean tput-11 481349.41 ( 0.00%) 519824.54 * 7.99%* Hmean tput-12 533148.45 ( 0.00%) 565070.21 * 5.99%* Hmean tput-13 570563.97 ( 0.00%) 609499.06 * 6.82%* Hmean tput-14 601117.97 ( 0.00%) 647876.05 * 7.78%* Hmean tput-15 639096.38 ( 0.00%) 690854.46 * 8.10%* Hmean tput-16 682644.91 ( 0.00%) 722826.06 * 5.89%* Hmean tput-17 732248.96 ( 0.00%) 758805.17 * 3.63%* Hmean tput-18 762771.33 ( 0.00%) 791211.66 * 3.73%* Hmean tput-19 780582.92 ( 0.00%) 819064.19 * 4.93%* Hmean tput-20 812183.95 ( 0.00%) 836664.87 * 3.01%* Hmean tput-21 821415.48 ( 0.00%) 833734.23 ( 1.50%) Hmean tput-22 815457.65 ( 0.00%) 844393.98 * 3.55%* Hmean tput-23 819263.63 ( 0.00%) 846109.07 * 3.28%* Hmean tput-24 817962.95 ( 0.00%) 839682.92 * 2.66%* Hmean tput-25 807814.64 ( 0.00%) 841826.52 * 4.21%* Hmean tput-26 811755.89 ( 0.00%) 838543.08 * 3.30%* Hmean tput-27 799341.75 ( 0.00%) 833487.26 * 4.27%* Hmean tput-28 803434.89 ( 0.00%) 829022.50 * 3.18%* Hmean tput-29 803233.25 ( 0.00%) 826622.37 * 2.91%* Hmean tput-30 800465.12 ( 0.00%) 824347.42 * 2.98%* Hmean tput-31 791284.39 ( 0.00%) 791575.67 ( 0.04%) Hmean tput-32 781930.07 ( 0.00%) 805725.80 ( 3.04%) Hmean tput-33 785194.31 ( 0.00%) 804795.44 ( 2.50%) Hmean tput-34 781325.67 ( 0.00%) 800067.53 ( 2.40%) Hmean tput-35 777715.92 ( 0.00%) 753926.32 ( -3.06%) Hmean tput-36 770516.85 ( 0.00%) 783328.32 ( 1.66%) Hmean tput-37 758067.26 ( 0.00%) 772243.18 * 1.87%* Hmean tput-38 764815.45 ( 0.00%) 769156.32 ( 0.57%) Hmean tput-39 757885.41 ( 0.00%) 757670.59 ( -0.03%) Hmean tput-40 750140.15 ( 0.00%) 760739.13 ( 1.41%) The largest regression was within noise. Most results were outside the noise. Some HPC workloads showed little difference but they do not communicate that heavily. redis microbenchmark showed mostly neutral results. schbench (facebook simulator workload that is latency sensitive) showed a mix of results, but helped more than it hurt. Even the machine with the worst results for schbench showed improved wakeup latencies at the 99th percentile. These were all on NUMA machines. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-10-26 11:57 ` Mel Gorman @ 2021-10-26 12:13 ` Mike Galbraith 2021-10-27 2:09 ` Mike Galbraith 0 siblings, 1 reply; 26+ messages in thread From: Mike Galbraith @ 2021-10-26 12:13 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, 2021-10-26 at 12:57 +0100, Mel Gorman wrote: > > The patch in question was also tested on other workloads on NUMA > machines. For a 2-socket machine (20 cores, HT enabled so 40 CPUs) > running specjbb 2005 with one JVM per NUMA node, the patch also scaled > reasonably well That's way more more interesting. No idea what this thing does under the hood thus whether it should be helped or not, but at least it's a real deal benchmark vs a kernel hacker tool. > specjbb > 5.15.0-rc3 5.15.0-rc3 > vanilla sched-wakeeflips-v1r1 > Hmean tput-1 50044.48 ( 0.00%) 53969.00 * 7.84%* > Hmean tput-2 106050.31 ( 0.00%) 113580.78 * 7.10%* > Hmean tput-3 156701.44 ( 0.00%) 164857.00 * 5.20%* > Hmean tput-4 196538.75 ( 0.00%) 218373.42 * 11.11%* > Hmean tput-5 247566.16 ( 0.00%) 267173.09 * 7.92%* > Hmean tput-6 284981.46 ( 0.00%) 311007.14 * 9.13%* > Hmean tput-7 328882.48 ( 0.00%) 359373.89 * 9.27%* > Hmean tput-8 366941.24 ( 0.00%) 393244.37 * 7.17%* > Hmean tput-9 402386.74 ( 0.00%) 433010.43 * 7.61%* > Hmean tput-10 437551.05 ( 0.00%) 475756.08 * 8.73%* > Hmean tput-11 481349.41 ( 0.00%) 519824.54 * 7.99%* > Hmean tput-12 533148.45 ( 0.00%) 565070.21 * 5.99%* > Hmean tput-13 570563.97 ( 0.00%) 609499.06 * 6.82%* > Hmean tput-14 601117.97 ( 0.00%) 647876.05 * 7.78%* > Hmean tput-15 639096.38 ( 0.00%) 690854.46 * 8.10%* > Hmean tput-16 682644.91 ( 0.00%) 722826.06 * 5.89%* > Hmean tput-17 732248.96 ( 0.00%) 758805.17 * 3.63%* > Hmean tput-18 762771.33 ( 0.00%) 791211.66 * 3.73%* > Hmean tput-19 780582.92 ( 0.00%) 819064.19 * 4.93%* > Hmean tput-20 812183.95 ( 0.00%) 836664.87 * 3.01%* > Hmean tput-21 821415.48 ( 0.00%) 833734.23 ( 1.50%) > Hmean tput-22 815457.65 ( 0.00%) 844393.98 * 3.55%* > Hmean tput-23 819263.63 ( 0.00%) 846109.07 * 3.28%* > Hmean tput-24 817962.95 ( 0.00%) 839682.92 * 2.66%* > Hmean tput-25 807814.64 ( 0.00%) 841826.52 * 4.21%* > Hmean tput-26 811755.89 ( 0.00%) 838543.08 * 3.30%* > Hmean tput-27 799341.75 ( 0.00%) 833487.26 * 4.27%* > Hmean tput-28 803434.89 ( 0.00%) 829022.50 * 3.18%* > Hmean tput-29 803233.25 ( 0.00%) 826622.37 * 2.91%* > Hmean tput-30 800465.12 ( 0.00%) 824347.42 * 2.98%* > Hmean tput-31 791284.39 ( 0.00%) 791575.67 ( 0.04%) > Hmean tput-32 781930.07 ( 0.00%) 805725.80 ( 3.04%) > Hmean tput-33 785194.31 ( 0.00%) 804795.44 ( 2.50%) > Hmean tput-34 781325.67 ( 0.00%) 800067.53 ( 2.40%) > Hmean tput-35 777715.92 ( 0.00%) 753926.32 ( -3.06%) > Hmean tput-36 770516.85 ( 0.00%) 783328.32 ( 1.66%) > Hmean tput-37 758067.26 ( 0.00%) 772243.18 * 1.87%* > Hmean tput-38 764815.45 ( 0.00%) 769156.32 ( 0.57%) > Hmean tput-39 757885.41 ( 0.00%) 757670.59 ( -0.03%) > Hmean tput-40 750140.15 ( 0.00%) 760739.13 ( 1.41%) > > The largest regression was within noise. Most results were outside the > noise. > > Some HPC workloads showed little difference but they do not communicate > that heavily. redis microbenchmark showed mostly neutral results. > schbench (facebook simulator workload that is latency sensitive) showed a > mix of results, but helped more than it hurt. Even the machine with the > worst results for schbench showed improved wakeup latencies at the 99th > percentile. These were all on NUMA machines. > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-10-26 12:13 ` Mike Galbraith @ 2021-10-27 2:09 ` Mike Galbraith 2021-10-27 9:00 ` Mel Gorman 0 siblings, 1 reply; 26+ messages in thread From: Mike Galbraith @ 2021-10-27 2:09 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, 2021-10-26 at 14:13 +0200, Mike Galbraith wrote: > On Tue, 2021-10-26 at 12:57 +0100, Mel Gorman wrote: > > > > The patch in question was also tested on other workloads on NUMA > > machines. For a 2-socket machine (20 cores, HT enabled so 40 CPUs) > > running specjbb 2005 with one JVM per NUMA node, the patch also > > scaled > > reasonably well > > That's way more more interesting. No idea what this thing does under > the hood thus whether it should be helped or not, but at least it's a > real deal benchmark vs a kernel hacker tool. ... Installing test specjbb specjvm-install: Fetching from mirror http://mcp/mmtests-mirror/spec/SPECjbb2005_kitv1.07.tar.gz specjvm-install: Fetching from internet NOT_AVAILABLE/SPECjbb2005_kitv1.07.tar.gz specjvm-install: Fetching from alt internet /SPECjbb2005_kitv1.07.tar.gz FATAL specjvm-install: specjvm-install: Could not download /SPECjbb2005_kitv1.07.tar.gz FATAL specjbb-bench: specjbb install script returned error FATAL: specjbb returned failure, unable to continue FATAL: Installation step failed for specjbb Hohum, so much for trying to take a peek. At any rate, unlike the tbench numbers, these have the look of signal rather than test jig noise, and pretty strong signal at that, so maybe patchlet should fly. At the very least, it appears to be saying that there is significant performance to be had by some means. Bah, fly or die little patchlet. Either way there will be winners and losers, that's just the way it works if you're not shaving cycles. > > specjbb > > 5.15.0-rc3 5.15.0-rc3 > > vanilla sched-wakeeflips-v1r1 > > Hmean tput-1 50044.48 ( 0.00%) 53969.00 * 7.84%* > > Hmean tput-2 106050.31 ( 0.00%) 113580.78 * 7.10%* > > Hmean tput-3 156701.44 ( 0.00%) 164857.00 * 5.20%* > > Hmean tput-4 196538.75 ( 0.00%) 218373.42 * 11.11%* > > Hmean tput-5 247566.16 ( 0.00%) 267173.09 * 7.92%* > > Hmean tput-6 284981.46 ( 0.00%) 311007.14 * 9.13%* > > Hmean tput-7 328882.48 ( 0.00%) 359373.89 * 9.27%* > > Hmean tput-8 366941.24 ( 0.00%) 393244.37 * 7.17%* > > Hmean tput-9 402386.74 ( 0.00%) 433010.43 * 7.61%* > > Hmean tput-10 437551.05 ( 0.00%) 475756.08 * 8.73%* > > Hmean tput-11 481349.41 ( 0.00%) 519824.54 * 7.99%* > > Hmean tput-12 533148.45 ( 0.00%) 565070.21 * 5.99%* > > Hmean tput-13 570563.97 ( 0.00%) 609499.06 * 6.82%* > > Hmean tput-14 601117.97 ( 0.00%) 647876.05 * 7.78%* > > Hmean tput-15 639096.38 ( 0.00%) 690854.46 * 8.10%* > > Hmean tput-16 682644.91 ( 0.00%) 722826.06 * 5.89%* > > Hmean tput-17 732248.96 ( 0.00%) 758805.17 * 3.63%* > > Hmean tput-18 762771.33 ( 0.00%) 791211.66 * 3.73%* > > Hmean tput-19 780582.92 ( 0.00%) 819064.19 * 4.93%* > > Hmean tput-20 812183.95 ( 0.00%) 836664.87 * 3.01%* > > Hmean tput-21 821415.48 ( 0.00%) 833734.23 ( 1.50%) > > Hmean tput-22 815457.65 ( 0.00%) 844393.98 * 3.55%* > > Hmean tput-23 819263.63 ( 0.00%) 846109.07 * 3.28%* > > Hmean tput-24 817962.95 ( 0.00%) 839682.92 * 2.66%* > > Hmean tput-25 807814.64 ( 0.00%) 841826.52 * 4.21%* > > Hmean tput-26 811755.89 ( 0.00%) 838543.08 * 3.30%* > > Hmean tput-27 799341.75 ( 0.00%) 833487.26 * 4.27%* > > Hmean tput-28 803434.89 ( 0.00%) 829022.50 * 3.18%* > > Hmean tput-29 803233.25 ( 0.00%) 826622.37 * 2.91%* > > Hmean tput-30 800465.12 ( 0.00%) 824347.42 * 2.98%* > > Hmean tput-31 791284.39 ( 0.00%) 791575.67 ( 0.04%) > > Hmean tput-32 781930.07 ( 0.00%) 805725.80 ( 3.04%) > > Hmean tput-33 785194.31 ( 0.00%) 804795.44 ( 2.50%) > > Hmean tput-34 781325.67 ( 0.00%) 800067.53 ( 2.40%) > > Hmean tput-35 777715.92 ( 0.00%) 753926.32 ( -3.06%) > > Hmean tput-36 770516.85 ( 0.00%) 783328.32 ( 1.66%) > > Hmean tput-37 758067.26 ( 0.00%) 772243.18 * 1.87%* > > Hmean tput-38 764815.45 ( 0.00%) 769156.32 ( 0.57%) > > Hmean tput-39 757885.41 ( 0.00%) 757670.59 ( -0.03%) > > Hmean tput-40 750140.15 ( 0.00%) 760739.13 ( 1.41%) > > > > The largest regression was within noise. Most results were outside the > > noise. > > > > Some HPC workloads showed little difference but they do not communicate > > that heavily. redis microbenchmark showed mostly neutral results. > > schbench (facebook simulator workload that is latency sensitive) showed a > > mix of results, but helped more than it hurt. Even the machine with the > > worst results for schbench showed improved wakeup latencies at the 99th > > percentile. These were all on NUMA machines. > > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-10-27 2:09 ` Mike Galbraith @ 2021-10-27 9:00 ` Mel Gorman 2021-10-27 10:18 ` Mike Galbraith 0 siblings, 1 reply; 26+ messages in thread From: Mel Gorman @ 2021-10-27 9:00 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Wed, Oct 27, 2021 at 04:09:12AM +0200, Mike Galbraith wrote: > On Tue, 2021-10-26 at 14:13 +0200, Mike Galbraith wrote: > > On Tue, 2021-10-26 at 12:57 +0100, Mel Gorman wrote: > > > > > > The patch in question was also tested on other workloads on NUMA > > > machines. For a 2-socket machine (20 cores, HT enabled so 40 CPUs) > > > running specjbb 2005 with one JVM per NUMA node, the patch also > > > scaled > > > reasonably well > > > > That's way more more interesting. No idea what this thing does under > > the hood thus whether it should be helped or not, but at least it's a > > real deal benchmark vs a kernel hacker tool. > > ... > Installing test specjbb > specjvm-install: Fetching from mirror > http://mcp/mmtests-mirror/spec/SPECjbb2005_kitv1.07.tar.gz > specjvm-install: Fetching from internet > NOT_AVAILABLE/SPECjbb2005_kitv1.07.tar.gz > specjvm-install: Fetching from alt internet > /SPECjbb2005_kitv1.07.tar.gz > FATAL specjvm-install: specjvm-install: Could not download > /SPECjbb2005_kitv1.07.tar.gz > FATAL specjbb-bench: specjbb install script returned error > FATAL: specjbb returned failure, unable to continue > FATAL: Installation step failed for specjbb > > Hohum, so much for trying to take a peek. > The benchmark is not available for free unfortunately. > At any rate, unlike the tbench numbers, these have the look of signal > rather than test jig noise, and pretty strong signal at that, so maybe > patchlet should fly. At the very least, it appears to be saying that > there is significant performance to be had by some means. > > Bah, fly or die little patchlet. Either way there will be winners and > losers, that's just the way it works if you're not shaving cycles. > So, I assume you are ok for patch 1 to take flight to either live or die. I'll handle any bugs that show up in relation to it. How about patch 2? -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-10-27 9:00 ` Mel Gorman @ 2021-10-27 10:18 ` Mike Galbraith 0 siblings, 0 replies; 26+ messages in thread From: Mike Galbraith @ 2021-10-27 10:18 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Wed, 2021-10-27 at 10:00 +0100, Mel Gorman wrote: > > How about patch 2? This version doesn't slam the door, so no worries wrt kthread/control thread latencies etc. Doing a brief tbench vs hogs mixed load run didn't even show much of a delta. -Mike ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-10-21 14:56 ` [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers Mel Gorman 2021-10-22 10:26 ` Mike Galbraith @ 2021-11-09 11:56 ` Peter Zijlstra 2021-11-09 12:55 ` Mike Galbraith 1 sibling, 1 reply; 26+ messages in thread From: Peter Zijlstra @ 2021-11-09 11:56 UTC (permalink / raw) To: Mel Gorman Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Thu, Oct 21, 2021 at 03:56:02PM +0100, Mel Gorman wrote: > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index ff69f245b939..d00af3b97d8f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5865,6 +5865,14 @@ static void record_wakee(struct task_struct *p) > } > > if (current->last_wakee != p) { > + int min = __this_cpu_read(sd_llc_size) << 1; > + /* > + * Couple the wakee flips to the waker for the case where it > + * doesn't accrue flips, taking care to not push the wakee > + * high enough that the wake_wide() heuristic fails. > + */ > + if (current->wakee_flips > p->wakee_flips * min) > + p->wakee_flips++; > current->last_wakee = p; > current->wakee_flips++; > } It's a bit odd that the above uses min for llc_size, while the below: > @@ -5895,7 +5903,7 @@ static int wake_wide(struct task_struct *p) > > if (master < slave) > swap(master, slave); > - if (slave < factor || master < slave * factor) > + if ((slave < factor && master < (factor>>1)*factor) || master < slave * factor) > return 0; > return 1; > } has factor. Now: !(slave < factor || master < slave * factor) !(x || y) == !x && !y, gives: slave >= factor && master >= slave * factor subst lhr in rhs: master >= factor * factor your extra term: !((slave < factor && master < (factor*factor)/2) || master < slave * factor) changes that how? AFAICT it's a nop. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-11-09 11:56 ` Peter Zijlstra @ 2021-11-09 12:55 ` Mike Galbraith 0 siblings, 0 replies; 26+ messages in thread From: Mike Galbraith @ 2021-11-09 12:55 UTC (permalink / raw) To: Peter Zijlstra, Mel Gorman Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, 2021-11-09 at 12:56 +0100, Peter Zijlstra wrote: > > > > @@ -5895,7 +5903,7 @@ static int wake_wide(struct task_struct *p) > > > > if (master < slave) > > swap(master, slave); > > - if (slave < factor || master < slave * factor) > > + if ((slave < factor && master < (factor>>1)*factor) || master < slave * factor) > > return 0; > > return 1; > > } > > has factor. > > Now: > > !(slave < factor || master < slave * factor) > > !(x || y) == !x && !y, gives: > > slave >= factor && master >= slave * factor > > subst lhr in rhs: > > master >= factor * factor > > > your extra term: > > !((slave < factor && master < (factor*factor)/2) || master < slave * factor) > > changes that how? AFAICT it's a nop. That can happen when twiddling. The thought was to let volume on the right override individual thread decay on the left to a limited extent. -Mike ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 2/2] sched/fair: Increase wakeup_gran if current task has not executed the minimum granularity 2021-10-21 14:56 [PATCH 0/2] Reduce stacking and overscheduling Mel Gorman 2021-10-21 14:56 ` [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers Mel Gorman @ 2021-10-21 14:56 ` Mel Gorman 1 sibling, 0 replies; 26+ messages in thread From: Mel Gorman @ 2021-10-21 14:56 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved the kernel.sched_wakeup_granularity_ns sysctl under debugfs. One of the reasons why this sysctl may be used may be for "optimising for throughput", particularly when overloaded. The tool TuneD sometimes alters this for two profiles e.g. "mssql" and "throughput-performance". At least version 2.9 does but it changed in master where it also will poke at debugfs instead. This patch aims to reduce the motivation to tweak sysctl_sched_wakeup_granularity by increasing sched_wakeup_granularity if the running task runtime has not exceeded sysctl_sched_min_granularity. During task migration or wakeup, a decision is made on whether to preempt the current task or not. To limit over-scheduled, sysctl_sched_wakeup_granularity delays the preemption to allow at least 1ms of runtime before preempting. However, when a domain is heavily overloaded (e.g. hackbench), the degree of over-scheduling is still severe. This is problematic as time is wasted rescheduling tasks that could instead be used by userspace tasks. However, care must be taken. Even if a system is overloaded, there may be high priority threads that must still be able to run. Mike Galbraith explained the constraints as follows; CFS came about because the O1 scheduler was unfair to the point it had starvation problems. People pretty much across the board agreed that a fair scheduler was a much way better way to go, and CFS was born. It didn't originally have the sleep credit business, but had to grow it to become _short term_ fair. Ingo cut the sleep credit in half because of overscheduling, and that has worked out pretty well all told.. but now you're pushing it more in the unfair direction, all the way to extremely unfair for anything and everything very light. Fairness isn't the holy grail mind you, and at some point, giving up on short term fairness certainly isn't crazy, as proven by your hackbench numbers and other numbers we've seen over the years, but taking bites out of the 'CF' in the CFS that was born to be a corner-case killer is.. worrisome. The other shoe will drop.. it always does :) This patch increases the wakeup granularity if the current task has not reached its minimum preemption granularity. The current task may still be preempted but the difference in runtime must be higher. hackbench-process-pipes 5.15.0-rc3 5.15.0-rc3 sched-wakeeflips-v1r1sched-scalewakegran-v3r2 Amean 1 0.3890 ( 0.00%) 0.3823 ( 1.71%) Amean 4 0.5217 ( 0.00%) 0.4867 ( 6.71%) Amean 7 0.5387 ( 0.00%) 0.5053 ( 6.19%) Amean 12 0.5443 ( 0.00%) 0.5450 ( -0.12%) Amean 21 0.6487 ( 0.00%) 0.6807 ( -4.93%) Amean 30 0.8033 ( 0.00%) 0.7107 * 11.54%* Amean 48 1.2400 ( 0.00%) 1.0447 * 15.75%* Amean 79 1.8200 ( 0.00%) 1.6033 * 11.90%* Amean 110 2.5820 ( 0.00%) 2.0763 * 19.58%* Amean 141 3.2203 ( 0.00%) 2.5313 * 21.40%* Amean 172 3.8200 ( 0.00%) 3.1163 * 18.42%* Amean 203 4.3357 ( 0.00%) 3.5560 * 17.98%* Amean 234 4.8047 ( 0.00%) 3.8913 * 19.01%* Amean 265 5.1243 ( 0.00%) 4.2293 * 17.47%* Amean 296 5.5940 ( 0.00%) 4.5357 * 18.92%* 5.15.0-rc3 5.15.0-rc3 sched-wakeeflips-v1r1 sched-scalewakegran-v3r2 Duration User 2567.27 2034.17 Duration System 21098.79 17137.08 Duration Elapsed 136.49 120.2 Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- kernel/sched/fair.c | 17 +++++++++++++++-- kernel/sched/features.h | 2 ++ 2 files changed, 17 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d00af3b97d8f..dee108470297 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7052,10 +7052,23 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) } #endif /* CONFIG_SMP */ -static unsigned long wakeup_gran(struct sched_entity *se) +static unsigned long +wakeup_gran(struct sched_entity *curr, struct sched_entity *se) { unsigned long gran = sysctl_sched_wakeup_granularity; + if (sched_feat(SCALE_WAKEUP_GRAN)) { + unsigned long delta_exec; + + /* + * Increase the wakeup granularity if curr's runtime + * is less than the minimum preemption granularity. + */ + delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; + if (delta_exec < sysctl_sched_min_granularity) + gran += sysctl_sched_min_granularity; + } + /* * Since its curr running now, convert the gran from real-time * to virtual-time in his units. @@ -7094,7 +7107,7 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se) if (vdiff <= 0) return -1; - gran = wakeup_gran(se); + gran = wakeup_gran(curr, se); if (vdiff > gran) return 1; diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 7f8dace0964c..611591355ffd 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -95,3 +95,5 @@ SCHED_FEAT(LATENCY_WARN, false) SCHED_FEAT(ALT_PERIOD, true) SCHED_FEAT(BASE_SLICE, true) + +SCHED_FEAT(SCALE_WAKEUP_GRAN, true) -- 2.31.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v4 0/2] Reduce stacking and overscheduling @ 2021-10-28 9:48 Mel Gorman 2021-10-28 9:48 ` [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers Mel Gorman 0 siblings, 1 reply; 26+ messages in thread From: Mel Gorman @ 2021-10-28 9:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman Also available at git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git sched-scalewakegran-v4r1 Changelog since v3 o No code changes, added some additional results to patch 1 These series tackles two problems. The first is that heavy wakers can stack an excessive number of tasks on the same CPU. The second is that tasks can overschedule when the task has not reached its minimum preemption granularity. The patches are independent but were discussed together in the thread https://lore.kernel.org/r/20210920142614.4891-1-mgorman@techsingularity.net so are presented together. With both patches on a zen3 machine hackbench-process-pipes 5.15.0-rc3 5.15.0-rc3 vanilla sched-scalewakegran-v3r2 Amean 1 0.3667 ( 0.00%) 0.3823 ( -4.27%) Amean 4 0.5343 ( 0.00%) 0.4867 ( 8.92%) Amean 7 0.5300 ( 0.00%) 0.5053 ( 4.65%) Amean 12 0.5737 ( 0.00%) 0.5450 ( 5.00%) Amean 21 0.6727 ( 0.00%) 0.6807 ( -1.19%) Amean 30 0.8583 ( 0.00%) 0.7107 * 17.20%* Amean 48 1.3977 ( 0.00%) 1.0447 * 25.26%* Amean 79 1.9790 ( 0.00%) 1.6033 * 18.98%* Amean 110 2.8020 ( 0.00%) 2.0763 * 25.90%* Amean 141 3.6683 ( 0.00%) 2.5313 * 31.00%* Amean 172 4.6687 ( 0.00%) 3.1163 * 33.25%* Amean 203 5.2183 ( 0.00%) 3.5560 * 31.86%* Amean 234 6.1077 ( 0.00%) 3.8913 * 36.29%* Amean 265 7.1313 ( 0.00%) 4.2293 * 40.69%* Amean 296 7.7557 ( 0.00%) 4.5357 * 41.52%* 5.15.0-rc3 5.15.0-rc3 vanilla sched-scalewakegran-v3r2 Duration User 2933.05 2034.17 Duration System 25652.83 17137.08 Duration Elapsed 162.50 120.25 kernel/sched/fair.c | 27 ++++++++++++++++++++++++--- kernel/sched/features.h | 2 ++ 2 files changed, 26 insertions(+), 3 deletions(-) -- 2.31.1 ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-10-28 9:48 [PATCH v4 0/2] Reduce stacking and overscheduling Mel Gorman @ 2021-10-28 9:48 ` Mel Gorman 2021-10-28 16:19 ` Tao Zhou 2021-10-29 15:17 ` Vincent Guittot 0 siblings, 2 replies; 26+ messages in thread From: Mel Gorman @ 2021-10-28 9:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman This patch mitigates a problem where wake_wide() allows a heavy waker (e.g. X) to stack an excessive number of wakees on the same CPU. This is due to the cpu_load check in wake_affine_weight. As noted by the original patch author (Mike Galbraith)[1]; Between load updates, X, or any other waker of many, can stack wakees to a ludicrous depth. Tracing kbuild vs firefox playing a youtube clip, I watched X stack 20 of the zillion firefox minions while their previous CPUs all had 1 lousy task running but a cpu_load() higher than the cpu_load() of X's CPU. Most of those prev_cpus were where X had left them when it migrated. Each and every crazy depth migration was wake_affine_weight() deciding we should pull. Parahrasing Mike's test results from the patch. With make -j8 running along with firefox with two tabs, one containing youtube's suggestions of stuff, the other a running clip, if the idle tab in focus, and don't drive mouse around, flips decay enough for wake_wide() to lose interest, but just wiggle the mouse, and it starts waking wide. Focus on the running clip, and it continuously wakes wide. The end result is that heavy wakers are less likely to stack tasks and, depending on the workload, reduce migrations. From additional tests on various servers, the impact is machine dependant but generally this patch improves the situation. hackbench-process-pipes 5.15.0-rc3 5.15.0-rc3 vanilla sched-wakeeflips-v1r1 Amean 1 0.3667 ( 0.00%) 0.3890 ( -6.09%) Amean 4 0.5343 ( 0.00%) 0.5217 ( 2.37%) Amean 7 0.5300 ( 0.00%) 0.5387 ( -1.64%) Amean 12 0.5737 ( 0.00%) 0.5443 ( 5.11%) Amean 21 0.6727 ( 0.00%) 0.6487 ( 3.57%) Amean 30 0.8583 ( 0.00%) 0.8033 ( 6.41%) Amean 48 1.3977 ( 0.00%) 1.2400 * 11.28%* Amean 79 1.9790 ( 0.00%) 1.8200 * 8.03%* Amean 110 2.8020 ( 0.00%) 2.5820 * 7.85%* Amean 141 3.6683 ( 0.00%) 3.2203 * 12.21%* Amean 172 4.6687 ( 0.00%) 3.8200 * 18.18%* Amean 203 5.2183 ( 0.00%) 4.3357 * 16.91%* Amean 234 6.1077 ( 0.00%) 4.8047 * 21.33%* Amean 265 7.1313 ( 0.00%) 5.1243 * 28.14%* Amean 296 7.7557 ( 0.00%) 5.5940 * 27.87%* While different machines showed different results, in general there were much less CPU migrations of tasks tbench4 5.15.0-rc3 5.15.0-rc3 vanilla sched-wakeeflips-v1r1 Hmean 1 824.05 ( 0.00%) 802.56 * -2.61%* Hmean 2 1578.49 ( 0.00%) 1645.11 * 4.22%* Hmean 4 2959.08 ( 0.00%) 2984.75 * 0.87%* Hmean 8 5080.09 ( 0.00%) 5173.35 * 1.84%* Hmean 16 8276.02 ( 0.00%) 9327.17 * 12.70%* Hmean 32 15501.61 ( 0.00%) 15925.55 * 2.73%* Hmean 64 27313.67 ( 0.00%) 24107.81 * -11.74%* Hmean 128 32928.19 ( 0.00%) 36261.75 * 10.12%* Hmean 256 35434.73 ( 0.00%) 38670.61 * 9.13%* Hmean 512 50098.34 ( 0.00%) 53243.75 * 6.28%* Hmean 1024 69503.69 ( 0.00%) 67425.26 * -2.99%* Bit of a mixed bag but wins more than it loses. A new workload was added that runs a kernel build in the background -jNR_CPUS while NR_CPUS pairs of tasks run Netperf TCP_RR. The intent is to see if heavy background tasks disrupt ligher tasks multi subtest kernbench 5.15.0-rc3 5.15.0-rc3 vanilla sched-wakeeflips-v1r1 Min elsp-256 20.80 ( 0.00%) 14.89 ( 28.41%) Amean elsp-256 24.08 ( 0.00%) 20.94 ( 13.05%) Stddev elsp-256 3.32 ( 0.00%) 4.68 ( -41.16%) CoeffVar elsp-256 13.78 ( 0.00%) 22.36 ( -62.33%) Max elsp-256 29.11 ( 0.00%) 26.49 ( 9.00%) multi subtest netperf-tcp-rr 5.15.0-rc3 5.15.0-rc3 vanilla sched-wakeeflips-v1r1 Min 1 48286.26 ( 0.00%) 49101.48 ( 1.69%) Hmean 1 62894.82 ( 0.00%) 68963.51 * 9.65%* Stddev 1 7600.56 ( 0.00%) 8804.82 ( -15.84%) Max 1 78975.16 ( 0.00%) 87124.67 ( 10.32%) The variability is higher as a result of the patch but both workloads experienced improved performance. SpecJBB 2005 is a slightly more realistic workload with multiple communicating Java threads specjbb 5.15.0-rc3 5.15.0-rc3 vanilla sched-wakeeflips-v1r1 Hmean tput-1 50044.48 ( 0.00%) 53969.00 * 7.84%* Hmean tput-2 106050.31 ( 0.00%) 113580.78 * 7.10%* Hmean tput-3 156701.44 ( 0.00%) 164857.00 * 5.20%* Hmean tput-4 196538.75 ( 0.00%) 218373.42 * 11.11%* Hmean tput-5 247566.16 ( 0.00%) 267173.09 * 7.92%* Hmean tput-6 284981.46 ( 0.00%) 311007.14 * 9.13%* Hmean tput-7 328882.48 ( 0.00%) 359373.89 * 9.27%* Hmean tput-8 366941.24 ( 0.00%) 393244.37 * 7.17%* Hmean tput-9 402386.74 ( 0.00%) 433010.43 * 7.61%* Hmean tput-10 437551.05 ( 0.00%) 475756.08 * 8.73%* Hmean tput-11 481349.41 ( 0.00%) 519824.54 * 7.99%* Hmean tput-12 533148.45 ( 0.00%) 565070.21 * 5.99%* Hmean tput-13 570563.97 ( 0.00%) 609499.06 * 6.82%* Hmean tput-14 601117.97 ( 0.00%) 647876.05 * 7.78%* Hmean tput-15 639096.38 ( 0.00%) 690854.46 * 8.10%* Hmean tput-16 682644.91 ( 0.00%) 722826.06 * 5.89%* Hmean tput-17 732248.96 ( 0.00%) 758805.17 * 3.63%* Hmean tput-18 762771.33 ( 0.00%) 791211.66 * 3.73%* Hmean tput-19 780582.92 ( 0.00%) 819064.19 * 4.93%* Hmean tput-20 812183.95 ( 0.00%) 836664.87 * 3.01%* Hmean tput-21 821415.48 ( 0.00%) 833734.23 ( 1.50%) Hmean tput-22 815457.65 ( 0.00%) 844393.98 * 3.55%* Hmean tput-23 819263.63 ( 0.00%) 846109.07 * 3.28%* Hmean tput-24 817962.95 ( 0.00%) 839682.92 * 2.66%* Hmean tput-25 807814.64 ( 0.00%) 841826.52 * 4.21%* Hmean tput-26 811755.89 ( 0.00%) 838543.08 * 3.30%* Hmean tput-27 799341.75 ( 0.00%) 833487.26 * 4.27%* Hmean tput-28 803434.89 ( 0.00%) 829022.50 * 3.18%* Hmean tput-29 803233.25 ( 0.00%) 826622.37 * 2.91%* Hmean tput-30 800465.12 ( 0.00%) 824347.42 * 2.98%* Hmean tput-31 791284.39 ( 0.00%) 791575.67 ( 0.04%) Hmean tput-32 781930.07 ( 0.00%) 805725.80 ( 3.04%) Hmean tput-33 785194.31 ( 0.00%) 804795.44 ( 2.50%) Hmean tput-34 781325.67 ( 0.00%) 800067.53 ( 2.40%) Hmean tput-35 777715.92 ( 0.00%) 753926.32 ( -3.06%) Hmean tput-36 770516.85 ( 0.00%) 783328.32 ( 1.66%) Hmean tput-37 758067.26 ( 0.00%) 772243.18 * 1.87%* Hmean tput-38 764815.45 ( 0.00%) 769156.32 ( 0.57%) Hmean tput-39 757885.41 ( 0.00%) 757670.59 ( -0.03%) Hmean tput-40 750140.15 ( 0.00%) 760739.13 ( 1.41%) [1] https://lore.kernel.org/r/02c977d239c312de5e15c77803118dcf1e11f216.camel@gmx.de Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- kernel/sched/fair.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ff69f245b939..d00af3b97d8f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5865,6 +5865,14 @@ static void record_wakee(struct task_struct *p) } if (current->last_wakee != p) { + int min = __this_cpu_read(sd_llc_size) << 1; + /* + * Couple the wakee flips to the waker for the case where it + * doesn't accrue flips, taking care to not push the wakee + * high enough that the wake_wide() heuristic fails. + */ + if (current->wakee_flips > p->wakee_flips * min) + p->wakee_flips++; current->last_wakee = p; current->wakee_flips++; } @@ -5895,7 +5903,7 @@ static int wake_wide(struct task_struct *p) if (master < slave) swap(master, slave); - if (slave < factor || master < slave * factor) + if ((slave < factor && master < (factor>>1)*factor) || master < slave * factor) return 0; return 1; } -- 2.31.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-10-28 9:48 ` [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers Mel Gorman @ 2021-10-28 16:19 ` Tao Zhou 2021-10-29 8:42 ` Mel Gorman 2021-10-29 15:17 ` Vincent Guittot 1 sibling, 1 reply; 26+ messages in thread From: Tao Zhou @ 2021-10-28 16:19 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Tao Zhou Hi Mel, On Thu, Oct 28, 2021 at 10:48:33AM +0100, Mel Gorman wrote: > @@ -5865,6 +5865,14 @@ static void record_wakee(struct task_struct *p) > } > > if (current->last_wakee != p) { > + int min = __this_cpu_read(sd_llc_size) << 1; > + /* > + * Couple the wakee flips to the waker for the case where it > + * doesn't accrue flips, taking care to not push the wakee > + * high enough that the wake_wide() heuristic fails. > + */ > + if (current->wakee_flips > p->wakee_flips * min) > + p->wakee_flips++; > current->last_wakee = p; > current->wakee_flips++; > } > @@ -5895,7 +5903,7 @@ static int wake_wide(struct task_struct *p) > > if (master < slave) > swap(master, slave); > - if (slave < factor || master < slave * factor) > + if ((slave < factor && master < (factor>>1)*factor) || master < slave * factor) So, the check like this include the above range: if ((slave < factor && master < slave * factor) || master < slave * factor) That "factor>>1" filter some. If "slave < factor" is true and "master < (factor>>1)*factor" is false, then we check "master < slave * factor".(This is one path added by the check "&& master < (factor>>1)*factor"). In the latter check "slave < factor" must be true, the result of this check depend on slave in the range [factor, factor>>1] if there is possibility that "master < slave * factor". If slave in [factor>>1, 0], the check of "master < slave * factor" is absolutly false and this can be filtered if we use a variable to load the result of master < (factor>>1)*factor. My random random inputs and continue confusing to move on. Thanks, Tao ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-10-28 16:19 ` Tao Zhou @ 2021-10-29 8:42 ` Mel Gorman 2021-11-10 9:53 ` Tao Zhou 0 siblings, 1 reply; 26+ messages in thread From: Mel Gorman @ 2021-10-29 8:42 UTC (permalink / raw) To: Tao Zhou Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Fri, Oct 29, 2021 at 12:19:48AM +0800, Tao Zhou wrote: > Hi Mel, > > On Thu, Oct 28, 2021 at 10:48:33AM +0100, Mel Gorman wrote: > > > @@ -5865,6 +5865,14 @@ static void record_wakee(struct task_struct *p) > > } > > > > if (current->last_wakee != p) { > > + int min = __this_cpu_read(sd_llc_size) << 1; > > + /* > > + * Couple the wakee flips to the waker for the case where it > > + * doesn't accrue flips, taking care to not push the wakee > > + * high enough that the wake_wide() heuristic fails. > > + */ > > + if (current->wakee_flips > p->wakee_flips * min) > > + p->wakee_flips++; > > current->last_wakee = p; > > current->wakee_flips++; > > } > > @@ -5895,7 +5903,7 @@ static int wake_wide(struct task_struct *p) > > > > if (master < slave) > > swap(master, slave); > > - if (slave < factor || master < slave * factor) > > + if ((slave < factor && master < (factor>>1)*factor) || master < slave * factor) > > So, the check like this include the above range: > > if ((slave < factor && master < slave * factor) || > master < slave * factor) > > That "factor>>1" filter some. > > If "slave < factor" is true and "master < (factor>>1)*factor" is false, > then we check "master < slave * factor".(This is one path added by the > check "&& master < (factor>>1)*factor"). > In the latter check "slave < factor" must be true, the result of this > check depend on slave in the range [factor, factor>>1] if there is possibility > that "master < slave * factor". If slave in [factor>>1, 0], the check of > "master < slave * factor" is absolutly false and this can be filtered if > we use a variable to load the result of master < (factor>>1)*factor. > > My random random inputs and continue confusing to move on. > I'm not sure what point you're trying to make. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-10-29 8:42 ` Mel Gorman @ 2021-11-10 9:53 ` Tao Zhou 2021-11-10 15:40 ` Mike Galbraith 0 siblings, 1 reply; 26+ messages in thread From: Tao Zhou @ 2021-11-10 9:53 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Tao Zhou On Fri, Oct 29, 2021 at 09:42:19AM +0100, Mel Gorman wrote: > On Fri, Oct 29, 2021 at 12:19:48AM +0800, Tao Zhou wrote: > > Hi Mel, > > > > On Thu, Oct 28, 2021 at 10:48:33AM +0100, Mel Gorman wrote: > > > > > @@ -5865,6 +5865,14 @@ static void record_wakee(struct task_struct *p) > > > } > > > > > > if (current->last_wakee != p) { > > > + int min = __this_cpu_read(sd_llc_size) << 1; > > > + /* > > > + * Couple the wakee flips to the waker for the case where it > > > + * doesn't accrue flips, taking care to not push the wakee > > > + * high enough that the wake_wide() heuristic fails. > > > + */ > > > + if (current->wakee_flips > p->wakee_flips * min) > > > + p->wakee_flips++; > > > current->last_wakee = p; > > > current->wakee_flips++; > > > } > > > @@ -5895,7 +5903,7 @@ static int wake_wide(struct task_struct *p) > > > > > > if (master < slave) > > > swap(master, slave); > > > - if (slave < factor || master < slave * factor) > > > + if ((slave < factor && master < (factor>>1)*factor) || master < slave * factor) > > > > So, the check like this include the above range: > > > > if ((slave < factor && master < slave * factor) || > > master < slave * factor) > > > > That "factor>>1" filter some. > > > > If "slave < factor" is true and "master < (factor>>1)*factor" is false, > > then we check "master < slave * factor".(This is one path added by the > > check "&& master < (factor>>1)*factor"). > > In the latter check "slave < factor" must be true, the result of this > > check depend on slave in the range [factor, factor>>1] if there is possibility > > that "master < slave * factor". If slave in [factor>>1, 0], the check of > > "master < slave * factor" is absolutly false and this can be filtered if > > we use a variable to load the result of master < (factor>>1)*factor. > > > > My random random inputs and continue confusing to move on. > > > > I'm not sure what point you're trying to make. Ok, some days later even can not understand what my saying myself. After wrong and right aross with my wreck head I just try to make this: if ((slave < factor && master < (factor>>1)*factor) || (slave >= factor>>1) && master < slave * factor) check "slave > factor>>1" for filter the cases that is calculated if I am not wrong. If this have a little effect that will be to not need to do "master < slave * factor" for some time not sure. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-11-10 9:53 ` Tao Zhou @ 2021-11-10 15:40 ` Mike Galbraith 0 siblings, 0 replies; 26+ messages in thread From: Mike Galbraith @ 2021-11-10 15:40 UTC (permalink / raw) To: Tao Zhou, Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Wed, 2021-11-10 at 17:53 +0800, Tao Zhou wrote: > On Fri, Oct 29, 2021 at 09:42:19AM +0100, Mel Gorman wrote: > > On Fri, Oct 29, 2021 at 12:19:48AM +0800, Tao Zhou wrote: > > > Hi Mel, > > > > > > On Thu, Oct 28, 2021 at 10:48:33AM +0100, Mel Gorman wrote: > > > > > > > @@ -5865,6 +5865,14 @@ static void record_wakee(struct task_struct *p) > > > > } > > > > > > > > if (current->last_wakee != p) { > > > > + int min = __this_cpu_read(sd_llc_size) << 1; > > > > + /* > > > > + * Couple the wakee flips to the waker for the case where it > > > > + * doesn't accrue flips, taking care to not push the wakee > > > > + * high enough that the wake_wide() heuristic fails. > > > > + */ > > > > + if (current->wakee_flips > p->wakee_flips * min) > > > > + p->wakee_flips++; > > > > current->last_wakee = p; > > > > current->wakee_flips++; > > > > } > > > > @@ -5895,7 +5903,7 @@ static int wake_wide(struct task_struct *p) > > > > > > > > if (master < slave) > > > > swap(master, slave); > > > > - if (slave < factor || master < slave * factor) > > > > + if ((slave < factor && master < (factor>>1)*factor) || master < slave * factor) > > > > > > So, the check like this include the above range: > > > > > > if ((slave < factor && master < slave * factor) || > > > master < slave * factor) > > > > > > That "factor>>1" filter some. > > > > > > If "slave < factor" is true and "master < (factor>>1)*factor" is false, > > > then we check "master < slave * factor".(This is one path added by the > > > check "&& master < (factor>>1)*factor"). > > > In the latter check "slave < factor" must be true, the result of this > > > check depend on slave in the range [factor, factor>>1] if there is possibility > > > that "master < slave * factor". If slave in [factor>>1, 0], the check of > > > "master < slave * factor" is absolutly false and this can be filtered if > > > we use a variable to load the result of master < (factor>>1)*factor. > > > > > > My random random inputs and continue confusing to move on. > > > > > > > I'm not sure what point you're trying to make. > > Ok, some days later even can not understand what my saying myself. After > wrong and right aross with my wreck head I just try to make this: > > if ((slave < factor && master < (factor>>1)*factor) || (slave >= factor>>1) && master < slave * factor) > > check "slave > factor>>1" for filter the cases that is calculated if I > am not wrong. If this have a little effect that will be to not need to > do "master < slave * factor" for some time not sure. Take the original: if (slave < factor || master < slave * factor) return 0; That is looking for a waker:wakees ratio of sd_llc_size, and does it the way it does because you can create "flips" galore by waking only two tasks, but using the two comparisons together makes it more likely that you're waking sd_llc_size tasks. Take my box's 8 rq servicing LLC, if wakee is 8, multi-waker being 8 times that suggests 8 wakees, each having been awakened 8 times by our multi-waker, qualifying the pair to be considered part of a load too large to restrict to one LLC. But what happens when our multi-waker isn't always waking a uniformly growing/shrinking set of workers, it's a bit chaotic, and the flip count of some wakees decay below our magic 8? The right side can be happy as a clam because the multi-waker is flipping madly enough to make wakee * llc_size nothing remotely resembling a hurdle, but there sits a deal breaker on the left.. so we should wake these threads affine? I should have left that alone, or at least picked a big arbitrary stopper, but picked half of our magic "I might be waking a herd" number to say nah, as long as the ratio on the right looks herd like AND our multi-waker appears to be waking at least half a herd, wake it wide. That not-a-noop probably should die despite having not (yet) shown an evil side because it dings up an already questionable enough heuristic. -Mike ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-10-28 9:48 ` [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers Mel Gorman 2021-10-28 16:19 ` Tao Zhou @ 2021-10-29 15:17 ` Vincent Guittot 2021-10-30 3:11 ` Mike Galbraith 2021-11-01 8:56 ` Mel Gorman 1 sibling, 2 replies; 26+ messages in thread From: Vincent Guittot @ 2021-10-29 15:17 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Thu, 28 Oct 2021 at 11:48, Mel Gorman <mgorman@techsingularity.net> wrote: > > This patch mitigates a problem where wake_wide() allows a heavy waker > (e.g. X) to stack an excessive number of wakees on the same CPU. This > is due to the cpu_load check in wake_affine_weight. As noted by the > original patch author (Mike Galbraith)[1]; > > Between load updates, X, or any other waker of many, can stack > wakees to a ludicrous depth. Tracing kbuild vs firefox playing a > youtube clip, I watched X stack 20 of the zillion firefox minions > while their previous CPUs all had 1 lousy task running but a > cpu_load() higher than the cpu_load() of X's CPU. Most of those > prev_cpus were where X had left them when it migrated. Each and > every crazy depth migration was wake_affine_weight() deciding we > should pull. > > Parahrasing Mike's test results from the patch. > > With make -j8 running along with firefox with two tabs, one > containing youtube's suggestions of stuff, the other a running > clip, if the idle tab in focus, and don't drive mouse around, > flips decay enough for wake_wide() to lose interest, but just > wiggle the mouse, and it starts waking wide. Focus on the running > clip, and it continuously wakes wide. > > The end result is that heavy wakers are less likely to stack tasks and, > depending on the workload, reduce migrations. > > From additional tests on various servers, the impact is machine dependant > but generally this patch improves the situation. > > hackbench-process-pipes > 5.15.0-rc3 5.15.0-rc3 > vanilla sched-wakeeflips-v1r1 > Amean 1 0.3667 ( 0.00%) 0.3890 ( -6.09%) > Amean 4 0.5343 ( 0.00%) 0.5217 ( 2.37%) > Amean 7 0.5300 ( 0.00%) 0.5387 ( -1.64%) > Amean 12 0.5737 ( 0.00%) 0.5443 ( 5.11%) > Amean 21 0.6727 ( 0.00%) 0.6487 ( 3.57%) > Amean 30 0.8583 ( 0.00%) 0.8033 ( 6.41%) > Amean 48 1.3977 ( 0.00%) 1.2400 * 11.28%* > Amean 79 1.9790 ( 0.00%) 1.8200 * 8.03%* > Amean 110 2.8020 ( 0.00%) 2.5820 * 7.85%* > Amean 141 3.6683 ( 0.00%) 3.2203 * 12.21%* > Amean 172 4.6687 ( 0.00%) 3.8200 * 18.18%* > Amean 203 5.2183 ( 0.00%) 4.3357 * 16.91%* > Amean 234 6.1077 ( 0.00%) 4.8047 * 21.33%* > Amean 265 7.1313 ( 0.00%) 5.1243 * 28.14%* > Amean 296 7.7557 ( 0.00%) 5.5940 * 27.87%* > > While different machines showed different results, in general > there were much less CPU migrations of tasks > > tbench4 > 5.15.0-rc3 5.15.0-rc3 > vanilla sched-wakeeflips-v1r1 > Hmean 1 824.05 ( 0.00%) 802.56 * -2.61%* > Hmean 2 1578.49 ( 0.00%) 1645.11 * 4.22%* > Hmean 4 2959.08 ( 0.00%) 2984.75 * 0.87%* > Hmean 8 5080.09 ( 0.00%) 5173.35 * 1.84%* > Hmean 16 8276.02 ( 0.00%) 9327.17 * 12.70%* > Hmean 32 15501.61 ( 0.00%) 15925.55 * 2.73%* > Hmean 64 27313.67 ( 0.00%) 24107.81 * -11.74%* > Hmean 128 32928.19 ( 0.00%) 36261.75 * 10.12%* > Hmean 256 35434.73 ( 0.00%) 38670.61 * 9.13%* > Hmean 512 50098.34 ( 0.00%) 53243.75 * 6.28%* > Hmean 1024 69503.69 ( 0.00%) 67425.26 * -2.99%* > > Bit of a mixed bag but wins more than it loses. > > A new workload was added that runs a kernel build in the background > -jNR_CPUS while NR_CPUS pairs of tasks run Netperf TCP_RR. The > intent is to see if heavy background tasks disrupt ligher tasks > > multi subtest kernbench > 5.15.0-rc3 5.15.0-rc3 > vanilla sched-wakeeflips-v1r1 > Min elsp-256 20.80 ( 0.00%) 14.89 ( 28.41%) > Amean elsp-256 24.08 ( 0.00%) 20.94 ( 13.05%) > Stddev elsp-256 3.32 ( 0.00%) 4.68 ( -41.16%) > CoeffVar elsp-256 13.78 ( 0.00%) 22.36 ( -62.33%) > Max elsp-256 29.11 ( 0.00%) 26.49 ( 9.00%) > > multi subtest netperf-tcp-rr > 5.15.0-rc3 5.15.0-rc3 > vanilla sched-wakeeflips-v1r1 > Min 1 48286.26 ( 0.00%) 49101.48 ( 1.69%) > Hmean 1 62894.82 ( 0.00%) 68963.51 * 9.65%* > Stddev 1 7600.56 ( 0.00%) 8804.82 ( -15.84%) > Max 1 78975.16 ( 0.00%) 87124.67 ( 10.32%) > > The variability is higher as a result of the patch but both workloads > experienced improved performance. > > SpecJBB 2005 is a slightly more realistic workload with multiple > communicating Java threads > > specjbb > 5.15.0-rc3 5.15.0-rc3 > vanilla sched-wakeeflips-v1r1 > Hmean tput-1 50044.48 ( 0.00%) 53969.00 * 7.84%* > Hmean tput-2 106050.31 ( 0.00%) 113580.78 * 7.10%* > Hmean tput-3 156701.44 ( 0.00%) 164857.00 * 5.20%* > Hmean tput-4 196538.75 ( 0.00%) 218373.42 * 11.11%* > Hmean tput-5 247566.16 ( 0.00%) 267173.09 * 7.92%* > Hmean tput-6 284981.46 ( 0.00%) 311007.14 * 9.13%* > Hmean tput-7 328882.48 ( 0.00%) 359373.89 * 9.27%* > Hmean tput-8 366941.24 ( 0.00%) 393244.37 * 7.17%* > Hmean tput-9 402386.74 ( 0.00%) 433010.43 * 7.61%* > Hmean tput-10 437551.05 ( 0.00%) 475756.08 * 8.73%* > Hmean tput-11 481349.41 ( 0.00%) 519824.54 * 7.99%* > Hmean tput-12 533148.45 ( 0.00%) 565070.21 * 5.99%* > Hmean tput-13 570563.97 ( 0.00%) 609499.06 * 6.82%* > Hmean tput-14 601117.97 ( 0.00%) 647876.05 * 7.78%* > Hmean tput-15 639096.38 ( 0.00%) 690854.46 * 8.10%* > Hmean tput-16 682644.91 ( 0.00%) 722826.06 * 5.89%* > Hmean tput-17 732248.96 ( 0.00%) 758805.17 * 3.63%* > Hmean tput-18 762771.33 ( 0.00%) 791211.66 * 3.73%* > Hmean tput-19 780582.92 ( 0.00%) 819064.19 * 4.93%* > Hmean tput-20 812183.95 ( 0.00%) 836664.87 * 3.01%* > Hmean tput-21 821415.48 ( 0.00%) 833734.23 ( 1.50%) > Hmean tput-22 815457.65 ( 0.00%) 844393.98 * 3.55%* > Hmean tput-23 819263.63 ( 0.00%) 846109.07 * 3.28%* > Hmean tput-24 817962.95 ( 0.00%) 839682.92 * 2.66%* > Hmean tput-25 807814.64 ( 0.00%) 841826.52 * 4.21%* > Hmean tput-26 811755.89 ( 0.00%) 838543.08 * 3.30%* > Hmean tput-27 799341.75 ( 0.00%) 833487.26 * 4.27%* > Hmean tput-28 803434.89 ( 0.00%) 829022.50 * 3.18%* > Hmean tput-29 803233.25 ( 0.00%) 826622.37 * 2.91%* > Hmean tput-30 800465.12 ( 0.00%) 824347.42 * 2.98%* > Hmean tput-31 791284.39 ( 0.00%) 791575.67 ( 0.04%) > Hmean tput-32 781930.07 ( 0.00%) 805725.80 ( 3.04%) > Hmean tput-33 785194.31 ( 0.00%) 804795.44 ( 2.50%) > Hmean tput-34 781325.67 ( 0.00%) 800067.53 ( 2.40%) > Hmean tput-35 777715.92 ( 0.00%) 753926.32 ( -3.06%) > Hmean tput-36 770516.85 ( 0.00%) 783328.32 ( 1.66%) > Hmean tput-37 758067.26 ( 0.00%) 772243.18 * 1.87%* > Hmean tput-38 764815.45 ( 0.00%) 769156.32 ( 0.57%) > Hmean tput-39 757885.41 ( 0.00%) 757670.59 ( -0.03%) > Hmean tput-40 750140.15 ( 0.00%) 760739.13 ( 1.41%) > > [1] https://lore.kernel.org/r/02c977d239c312de5e15c77803118dcf1e11f216.camel@gmx.de > > Signed-off-by: Mike Galbraith <efault@gmx.de> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > kernel/sched/fair.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index ff69f245b939..d00af3b97d8f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5865,6 +5865,14 @@ static void record_wakee(struct task_struct *p) > } > > if (current->last_wakee != p) { > + int min = __this_cpu_read(sd_llc_size) << 1; > + /* > + * Couple the wakee flips to the waker for the case where it > + * doesn't accrue flips, taking care to not push the wakee > + * high enough that the wake_wide() heuristic fails. > + */ > + if (current->wakee_flips > p->wakee_flips * min) > + p->wakee_flips++; I have a hard time understanding the rationale behind these changes and the one below. Could you provide more details about why to increase p->wakee_flips here ? Also would be good to add such explanation in the commit message > current->last_wakee = p; > current->wakee_flips++; > } > @@ -5895,7 +5903,7 @@ static int wake_wide(struct task_struct *p) > > if (master < slave) > swap(master, slave); > - if (slave < factor || master < slave * factor) > + if ((slave < factor && master < (factor>>1)*factor) || master < slave * factor) > return 0; > return 1; > } > -- > 2.31.1 > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-10-29 15:17 ` Vincent Guittot @ 2021-10-30 3:11 ` Mike Galbraith 2021-10-30 4:12 ` Mike Galbraith 2021-11-01 8:56 ` Mel Gorman 1 sibling, 1 reply; 26+ messages in thread From: Mike Galbraith @ 2021-10-30 3:11 UTC (permalink / raw) To: Vincent Guittot, Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Fri, 2021-10-29 at 17:17 +0200, Vincent Guittot wrote: > > I have a hard time understanding the rationale behind these changes > and the one below. Could you provide more details about why to > increase p->wakee_flips here ? The rationale behind it was me discovering wake_affine_weight() use of weight averages causing X waking a way too big thread pool to stack the entire herd of 21 threads onto the waker's CPU while all other CPUs in my little i7 box had one task each. Preventing stacking is SIS or wake_wide(), but because I was test driving a patch I had some fairness concerns about, box was kept busy. I was subsequently asked about wake_wide()'s role, and while I don't think it should have one in a single LLC box, looked into it, found that while X is a candidate, event thread wakees were not. I think to self, what if I loosely couple zero flip earning wakees, do so, then allow for a bit of decay wiggle room while after watching it do its thing in realtime. There you have the rationale. While it did help, it did not eliminate the aforementioned worst case because as desktop behavior changes, decay turns off the heuristic, stacking follows. I profiled it with a perf that sums delay (local mod I find useful), and found that there was no real benefit to the light desktop test load, at which point, no longer having NUMA boxen at my disposal where wake_wide() does have a mission, I lost interest. Mel was interested however, fed it to SUSE's test array, and here we are. Executive summary: patchlet is not so lovely mitigation of an even more not so lovely scheduler behavior. The two deserve each other ;-) Kidding aside, way better would be wake_wide() becoming obsolete. -Mike ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-10-30 3:11 ` Mike Galbraith @ 2021-10-30 4:12 ` Mike Galbraith 0 siblings, 0 replies; 26+ messages in thread From: Mike Galbraith @ 2021-10-30 4:12 UTC (permalink / raw) To: Vincent Guittot, Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Sat, 2021-10-30 at 05:11 +0200, Mike Galbraith wrote: > > I profiled it with a perf that sums delay (local mod I find useful)... Here are those numbers for completeness. As you can see, patchlet was fairly meaningless to the desktop load. In fact, even though the wake affine logic essentially favors the compute load over the heavily threaded but light desktop load, I noticed zero difference with or without that being allowed, nor did perf record any noteworthy latency events, just more wait as load was kinda sorta de-treaded by being stacked up. box = i7-4790 quad+smt desktop vs massive_intr 8 9999 (8 x 8ms run/1ms sleep, for 9999 secs.. effectively forever) perf sched record -a -- su mikeg -c 'firefox https://www.youtube.com/watch?v=aqz-KE-bpKQ'& sleep 300 && killall perf firefox runtime runtime sum delay sum delay sum delay switches desktop patch/features total util massive_intr util total massive_intr desktop total/massive util virgin/stock 2267347.921 ms 94.4% 1932675.152 ms 80.5% 158611.016 ms 133309.938 ms 25301.078 ms 594780/441157 13.9% virgin/-wa_weight 2236871.408 ms 93.2% 1881464.401 ms 78.3% 255785.391 ms 243958.616 ms 11826.775 ms 1525470/1424083 14.8% -1.34% -1.2% -2.2% -13.474 s +0.9% wake_wide/stock 2254335.961 ms 93.9% 1917834.157 ms 79.9% 164766.194 ms 141974.540 ms 22791.654 ms 720711/599064 14.0% ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers 2021-10-29 15:17 ` Vincent Guittot 2021-10-30 3:11 ` Mike Galbraith @ 2021-11-01 8:56 ` Mel Gorman 1 sibling, 0 replies; 26+ messages in thread From: Mel Gorman @ 2021-11-01 8:56 UTC (permalink / raw) To: Vincent Guittot Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Fri, Oct 29, 2021 at 05:17:38PM +0200, Vincent Guittot wrote: > > index ff69f245b939..d00af3b97d8f 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -5865,6 +5865,14 @@ static void record_wakee(struct task_struct *p) > > } > > > > if (current->last_wakee != p) { > > + int min = __this_cpu_read(sd_llc_size) << 1; > > + /* > > + * Couple the wakee flips to the waker for the case where it > > + * doesn't accrue flips, taking care to not push the wakee > > + * high enough that the wake_wide() heuristic fails. > > + */ > > + if (current->wakee_flips > p->wakee_flips * min) > > + p->wakee_flips++; > > I have a hard time understanding the rationale behind these changes > and the one below. Could you provide more details about why to > increase p->wakee_flips here ? Also would be good to add such > explanation in the commit message The changelog covers it in the first two paragraphs but would the following be better as a comment? /* * Couple the wakee flips to the waker for the case where the * wakee doesn't accrue any flips during a short interval where * there are many wakeups without cpu load average being updated. * Otherwise, it is possible for wake_wide to not trigger followed * by an affine wake stacking multiple tasks on the same CPU due * to a stale cpu_load() value checked in wake_affine_weight. * This heuristic reduces excessive stacking of tasks while taking * care to not push the wakee high enough that the wake_wide * heuristic fails differently. */ Is that any better? I know this is a heuristic that is a bit on the fuzzy side as it's trying to clamp the worst of a corner case. Ideally "wake_wide" would be replaced with a more straight-forward heuristic but I'm not aware of any alternatives being proposed (and I don't have one of my own). -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2021-11-10 15:40 UTC | newest] Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-10-21 14:56 [PATCH 0/2] Reduce stacking and overscheduling Mel Gorman 2021-10-21 14:56 ` [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers Mel Gorman 2021-10-22 10:26 ` Mike Galbraith 2021-10-22 11:05 ` Mel Gorman 2021-10-22 12:00 ` Mike Galbraith 2021-10-25 6:35 ` Mike Galbraith 2021-10-26 8:18 ` Mel Gorman 2021-10-26 10:15 ` Mike Galbraith 2021-10-26 10:41 ` Mike Galbraith 2021-10-26 11:57 ` Mel Gorman 2021-10-26 12:13 ` Mike Galbraith 2021-10-27 2:09 ` Mike Galbraith 2021-10-27 9:00 ` Mel Gorman 2021-10-27 10:18 ` Mike Galbraith 2021-11-09 11:56 ` Peter Zijlstra 2021-11-09 12:55 ` Mike Galbraith 2021-10-21 14:56 ` [PATCH 2/2] sched/fair: Increase wakeup_gran if current task has not executed the minimum granularity Mel Gorman 2021-10-28 9:48 [PATCH v4 0/2] Reduce stacking and overscheduling Mel Gorman 2021-10-28 9:48 ` [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers Mel Gorman 2021-10-28 16:19 ` Tao Zhou 2021-10-29 8:42 ` Mel Gorman 2021-11-10 9:53 ` Tao Zhou 2021-11-10 15:40 ` Mike Galbraith 2021-10-29 15:17 ` Vincent Guittot 2021-10-30 3:11 ` Mike Galbraith 2021-10-30 4:12 ` Mike Galbraith 2021-11-01 8:56 ` Mel Gorman
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.