* [PATCH 0/2] Scale wakeup granularity relative to nr_running @ 2021-09-20 14:26 Mel Gorman 2021-09-20 14:26 ` [PATCH 1/2] sched/fair: Remove redundant lookup of rq in check_preempt_wakeup Mel Gorman 2021-09-20 14:26 ` [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running Mel Gorman 0 siblings, 2 replies; 59+ messages in thread From: Mel Gorman @ 2021-09-20 14:26 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML, Mel Gorman It's unfortunate this is colliding with Plumbers but part of the discussions are on select_idle_sibling and hackbench is a common reference workload for evaluating select_idle_sibling. The downside is that hackbench is sensitive to other factors much more than the SIS cost. A major component is the value of kernel.sched_wakeup_granularity_ns and this series tackles that bit. Patch 1 is minor, it was spotted while developing patch 2. Patch 2 scales kernel.sched_wakeup_granularity_ns so allow tasks to avoid preemption longer when the machine is heavily overloaded. -- 2.31.1 ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH 1/2] sched/fair: Remove redundant lookup of rq in check_preempt_wakeup 2021-09-20 14:26 [PATCH 0/2] Scale wakeup granularity relative to nr_running Mel Gorman @ 2021-09-20 14:26 ` Mel Gorman 2021-09-21 7:21 ` Vincent Guittot 2021-09-20 14:26 ` [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running Mel Gorman 1 sibling, 1 reply; 59+ messages in thread From: Mel Gorman @ 2021-09-20 14:26 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML, Mel Gorman The rq for curr is read during the function preamble, remove the redundant lookup. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ff69f245b939..038edfaaae9e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7190,7 +7190,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ if (cse_is_idle != pse_is_idle) return; - update_curr(cfs_rq_of(se)); + update_curr(cfs_rq); if (wakeup_preempt_entity(se, pse) == 1) { /* * Bias pick_next to pick the sched entity that is -- 2.31.1 ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH 1/2] sched/fair: Remove redundant lookup of rq in check_preempt_wakeup 2021-09-20 14:26 ` [PATCH 1/2] sched/fair: Remove redundant lookup of rq in check_preempt_wakeup Mel Gorman @ 2021-09-21 7:21 ` Vincent Guittot 2021-09-21 7:53 ` Mel Gorman 0 siblings, 1 reply; 59+ messages in thread From: Vincent Guittot @ 2021-09-21 7:21 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Mon, 20 Sept 2021 at 16:26, Mel Gorman <mgorman@techsingularity.net> wrote: > > The rq for curr is read during the function preamble, remove the > redundant lookup. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > kernel/sched/fair.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index ff69f245b939..038edfaaae9e 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7190,7 +7190,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ > if (cse_is_idle != pse_is_idle) > return; > > - update_curr(cfs_rq_of(se)); > + update_curr(cfs_rq); se can have been modified by find_matching_se(&se, &pse) > if (wakeup_preempt_entity(se, pse) == 1) { > /* > * Bias pick_next to pick the sched entity that is > -- > 2.31.1 > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/2] sched/fair: Remove redundant lookup of rq in check_preempt_wakeup 2021-09-21 7:21 ` Vincent Guittot @ 2021-09-21 7:53 ` Mel Gorman 2021-09-21 8:12 ` Vincent Guittot 2021-09-21 8:21 ` Peter Zijlstra 0 siblings, 2 replies; 59+ messages in thread From: Mel Gorman @ 2021-09-21 7:53 UTC (permalink / raw) To: Vincent Guittot Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, Sep 21, 2021 at 09:21:16AM +0200, Vincent Guittot wrote: > On Mon, 20 Sept 2021 at 16:26, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > The rq for curr is read during the function preamble, remove the > > redundant lookup. > > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > > --- > > kernel/sched/fair.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index ff69f245b939..038edfaaae9e 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -7190,7 +7190,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ > > if (cse_is_idle != pse_is_idle) > > return; > > > > - update_curr(cfs_rq_of(se)); > > + update_curr(cfs_rq); > > se can have been modified by find_matching_se(&se, &pse) > I still expected the cfs_rq to be the same, particularly given that the context is about preempting the current task on a runqueue. Is that wrong? -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/2] sched/fair: Remove redundant lookup of rq in check_preempt_wakeup 2021-09-21 7:53 ` Mel Gorman @ 2021-09-21 8:12 ` Vincent Guittot 2021-09-21 8:21 ` Peter Zijlstra 1 sibling, 0 replies; 59+ messages in thread From: Vincent Guittot @ 2021-09-21 8:12 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, 21 Sept 2021 at 09:53, Mel Gorman <mgorman@techsingularity.net> wrote: > > On Tue, Sep 21, 2021 at 09:21:16AM +0200, Vincent Guittot wrote: > > On Mon, 20 Sept 2021 at 16:26, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > The rq for curr is read during the function preamble, remove the > > > redundant lookup. > > > > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > > > --- > > > kernel/sched/fair.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index ff69f245b939..038edfaaae9e 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -7190,7 +7190,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ > > > if (cse_is_idle != pse_is_idle) > > > return; > > > > > > - update_curr(cfs_rq_of(se)); > > > + update_curr(cfs_rq); > > > > se can have been modified by find_matching_se(&se, &pse) > > > > I still expected the cfs_rq to be the same, particularly given that the > context is about preempting the current task on a runqueue. Is that > wrong? As soon as the tasks don't belong to the same cgroup, se can be modified and cfs_rq will not be the same > > -- > Mel Gorman > SUSE Labs ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/2] sched/fair: Remove redundant lookup of rq in check_preempt_wakeup 2021-09-21 7:53 ` Mel Gorman 2021-09-21 8:12 ` Vincent Guittot @ 2021-09-21 8:21 ` Peter Zijlstra 2021-09-21 10:03 ` Mel Gorman 1 sibling, 1 reply; 59+ messages in thread From: Peter Zijlstra @ 2021-09-21 8:21 UTC (permalink / raw) To: Mel Gorman Cc: Vincent Guittot, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, Sep 21, 2021 at 08:53:09AM +0100, Mel Gorman wrote: > On Tue, Sep 21, 2021 at 09:21:16AM +0200, Vincent Guittot wrote: > > On Mon, 20 Sept 2021 at 16:26, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > The rq for curr is read during the function preamble, remove the > > > redundant lookup. > > > > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > > > --- > > > kernel/sched/fair.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index ff69f245b939..038edfaaae9e 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -7190,7 +7190,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ > > > if (cse_is_idle != pse_is_idle) > > > return; > > > > > > - update_curr(cfs_rq_of(se)); > > > + update_curr(cfs_rq); > > > > se can have been modified by find_matching_se(&se, &pse) > > > > I still expected the cfs_rq to be the same, particularly given that the > context is about preempting the current task on a runqueue. Is that > wrong? Yes. There's a cfs_rq for every se. What we do in find_matching_se() is walk up the hiarachy until both are in the same cfs_rq, otherwse we cannot compare them. Fundamentally this means the effective cfs_rq also changes. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/2] sched/fair: Remove redundant lookup of rq in check_preempt_wakeup 2021-09-21 8:21 ` Peter Zijlstra @ 2021-09-21 10:03 ` Mel Gorman 0 siblings, 0 replies; 59+ messages in thread From: Mel Gorman @ 2021-09-21 10:03 UTC (permalink / raw) To: Peter Zijlstra Cc: Vincent Guittot, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, Sep 21, 2021 at 10:21:19AM +0200, Peter Zijlstra wrote: > On Tue, Sep 21, 2021 at 08:53:09AM +0100, Mel Gorman wrote: > > On Tue, Sep 21, 2021 at 09:21:16AM +0200, Vincent Guittot wrote: > > > On Mon, 20 Sept 2021 at 16:26, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > > > The rq for curr is read during the function preamble, remove the > > > > redundant lookup. > > > > > > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > > > > --- > > > > kernel/sched/fair.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > > index ff69f245b939..038edfaaae9e 100644 > > > > --- a/kernel/sched/fair.c > > > > +++ b/kernel/sched/fair.c > > > > @@ -7190,7 +7190,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ > > > > if (cse_is_idle != pse_is_idle) > > > > return; > > > > > > > > - update_curr(cfs_rq_of(se)); > > > > + update_curr(cfs_rq); > > > > > > se can have been modified by find_matching_se(&se, &pse) > > > > > > > I still expected the cfs_rq to be the same, particularly given that the > > context is about preempting the current task on a runqueue. Is that > > wrong? > > Yes. There's a cfs_rq for every se. What we do in find_matching_se() is > walk up the hiarachy until both are in the same cfs_rq, otherwse we > cannot compare them. > > Fundamentally this means the effective cfs_rq also changes. Ok, thanks. I'll read into this more but the patch is dead. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-20 14:26 [PATCH 0/2] Scale wakeup granularity relative to nr_running Mel Gorman 2021-09-20 14:26 ` [PATCH 1/2] sched/fair: Remove redundant lookup of rq in check_preempt_wakeup Mel Gorman @ 2021-09-20 14:26 ` Mel Gorman 2021-09-21 3:52 ` Mike Galbraith 2021-09-21 8:03 ` Vincent Guittot 1 sibling, 2 replies; 59+ messages in thread From: Mel Gorman @ 2021-09-20 14:26 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML, Mel Gorman Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved the kernel.sched_wakeup_granularity_ns sysctl under debugfs. One of the reasons why this sysctl may be used may be for "optimising for throughput", particularly when overloaded. The tool TuneD sometimes alters this for two profiles e.g. "mssql" and "throughput-performance". At least version 2.9 does but it changed in master where it also will poke at debugfs instead. During task migration or wakeup, a decision is made on whether to preempt the current task or not. To limit over-scheduled, sysctl_sched_wakeup_granularity delays the preemption to allow at least 1ms of runtime before preempting. However, when a domain is heavily overloaded (e.g. hackbench), the degree of over-scheduling is still severe. This is problematic as a lot of time can be wasted rescheduling tasks that could instead be used by userspace tasks. This patch scales the wakeup granularity based on the number of running tasks on the CPU up to a max of 8ms by default. The intent is to allow tasks to run for longer while overloaded so that some tasks may complete faster and reduce the degree a domain is overloaded. Note that the TuneD throughput-performance profile allows up to 15ms but there is no explanation why such a long period was necessary so this patch is conservative and uses the value that check_preempt_wakeup() also takes into account. An internet search for instances where this parameter are tuned to high values offer either no explanation or a broken one. This improved hackbench on a range of machines when communicating via pipes (sockets show little to no difference). For a 2-socket CascadeLake machine, the results were hackbench-process-pipes 5.15.0-rc1 5.15.0-rc1 vanilla sched-scalewakegran-v1r4 Amean 1 0.3253 ( 0.00%) 0.3337 ( -2.56%) Amean 4 0.8300 ( 0.00%) 0.7983 ( 3.82%) Amean 7 1.1003 ( 0.00%) 1.1600 * -5.42%* Amean 12 1.7263 ( 0.00%) 1.6457 * 4.67%* Amean 21 3.0063 ( 0.00%) 2.7933 * 7.09%* Amean 30 4.2323 ( 0.00%) 3.8010 * 10.19%* Amean 48 6.5657 ( 0.00%) 5.6453 * 14.02%* Amean 79 10.4867 ( 0.00%) 8.5960 * 18.03%* Amean 110 14.8880 ( 0.00%) 11.4173 * 23.31%* Amean 141 19.2083 ( 0.00%) 14.3850 * 25.11%* Amean 172 23.4847 ( 0.00%) 17.1980 * 26.77%* Amean 203 27.3763 ( 0.00%) 20.1677 * 26.33%* Amean 234 31.3707 ( 0.00%) 23.4053 * 25.39%* Amean 265 35.4663 ( 0.00%) 26.3513 * 25.70%* Amean 296 39.2380 ( 0.00%) 29.3670 * 25.16%* For Zen 3; hackbench-process-pipes 5.15.0-rc1 5.15.0-rc1 vanillasched-scalewakegran-v1r4 Amean 1 0.3780 ( 0.00%) 0.4080 ( -7.94%) Amean 4 0.5393 ( 0.00%) 0.5217 ( 3.28%) Amean 7 0.5480 ( 0.00%) 0.5577 ( -1.76%) Amean 12 0.5803 ( 0.00%) 0.5667 ( 2.35%) Amean 21 0.7073 ( 0.00%) 0.6543 * 7.49%* Amean 30 0.8663 ( 0.00%) 0.8290 ( 4.31%) Amean 48 1.2720 ( 0.00%) 1.1337 * 10.88%* Amean 79 1.9403 ( 0.00%) 1.7247 * 11.11%* Amean 110 2.6827 ( 0.00%) 2.3450 * 12.59%* Amean 141 3.6863 ( 0.00%) 3.0253 * 17.93%* Amean 172 4.5853 ( 0.00%) 3.4987 * 23.70%* Amean 203 5.4893 ( 0.00%) 3.9630 * 27.81%* Amean 234 6.6017 ( 0.00%) 4.4230 * 33.00%* Amean 265 7.3850 ( 0.00%) 4.8317 * 34.57%* Amean 296 8.5823 ( 0.00%) 5.3327 * 37.86%* For other workloads, the benefits were marginal as the extreme overloaded case is not hit to the same extent. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- kernel/sched/fair.c | 30 ++++++++++++++++++++++-------- 1 file changed, 22 insertions(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 038edfaaae9e..8e12aeebf4ce 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4511,7 +4511,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) } static int -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se); +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr, + struct sched_entity *se); /* * Pick the next process, keeping these things in mind, in this order: @@ -4550,16 +4551,16 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) second = curr; } - if (second && wakeup_preempt_entity(second, left) < 1) + if (second && wakeup_preempt_entity(NULL, second, left) < 1) se = second; } - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { + if (cfs_rq->next && wakeup_preempt_entity(NULL, cfs_rq->next, left) < 1) { /* * Someone really wants this to run. If it's not unfair, run it. */ se = cfs_rq->next; - } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) { + } else if (cfs_rq->last && wakeup_preempt_entity(NULL, cfs_rq->last, left) < 1) { /* * Prefer last buddy, try to return the CPU to a preempted task. */ @@ -7044,10 +7045,22 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) } #endif /* CONFIG_SMP */ -static unsigned long wakeup_gran(struct sched_entity *se) +static unsigned long +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se) { unsigned long gran = sysctl_sched_wakeup_granularity; + /* + * If rq is specified, scale the granularity relative to the number + * of running tasks but no more than 8ms with default + * sysctl_sched_wakeup_granularity settings. The wakeup gran + * reduces over-scheduling but if tasks are stacked then the + * domain is likely overloaded and over-scheduling may + * prolong the overloaded state. + */ + if (cfs_rq && cfs_rq->nr_running > 1) + gran *= min(cfs_rq->nr_running >> 1, sched_nr_latency); + /* * Since its curr running now, convert the gran from real-time * to virtual-time in his units. @@ -7079,14 +7092,15 @@ static unsigned long wakeup_gran(struct sched_entity *se) * */ static int -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se) +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr, + struct sched_entity *se) { s64 gran, vdiff = curr->vruntime - se->vruntime; if (vdiff <= 0) return -1; - gran = wakeup_gran(se); + gran = wakeup_gran(cfs_rq, se); if (vdiff > gran) return 1; @@ -7191,7 +7205,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ return; update_curr(cfs_rq); - if (wakeup_preempt_entity(se, pse) == 1) { + if (wakeup_preempt_entity(cfs_rq, se, pse) == 1) { /* * Bias pick_next to pick the sched entity that is * triggering this preemption. -- 2.31.1 ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-20 14:26 ` [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running Mel Gorman @ 2021-09-21 3:52 ` Mike Galbraith 2021-09-21 5:50 ` Mike Galbraith ` (2 more replies) 2021-09-21 8:03 ` Vincent Guittot 1 sibling, 3 replies; 59+ messages in thread From: Mike Galbraith @ 2021-09-21 3:52 UTC (permalink / raw) To: Mel Gorman, Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Mon, 2021-09-20 at 15:26 +0100, Mel Gorman wrote: > > This patch scales the wakeup granularity based on the number of running > tasks on the CPU up to a max of 8ms by default. The intent is to > allow tasks to run for longer while overloaded so that some tasks may > complete faster and reduce the degree a domain is overloaded. Note that > the TuneD throughput-performance profile allows up to 15ms but there > is no explanation why such a long period was necessary so this patch is > conservative and uses the value that check_preempt_wakeup() also takes > into account. An internet search for instances where this parameter are > tuned to high values offer either no explanation or a broken one. > > This improved hackbench on a range of machines when communicating via > pipes (sockets show little to no difference). For a 2-socket CascadeLake > machine, the results were Twiddling wakeup preemption based upon the performance of a fugly fork bomb seems like a pretty bad idea to me. Preemption does rapidly run into diminishing return as load climbs for a lot of loads, but as you know, it's a rather sticky wicket because even when over-committed, preventing light control threads from slicing through (what can be a load's own work crew of) hogs can seriously injure performance. <snip> > @@ -7044,10 +7045,22 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > } > #endif /* CONFIG_SMP */ > > -static unsigned long wakeup_gran(struct sched_entity *se) > +static unsigned long > +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se) > { > unsigned long gran = sysctl_sched_wakeup_granularity; > > + /* > + * If rq is specified, scale the granularity relative to the number > + * of running tasks but no more than 8ms with default > + * sysctl_sched_wakeup_granularity settings. The wakeup gran > + * reduces over-scheduling but if tasks are stacked then the > + * domain is likely overloaded and over-scheduling may > + * prolong the overloaded state. > + */ > + if (cfs_rq && cfs_rq->nr_running > 1) > + gran *= min(cfs_rq->nr_running >> 1, sched_nr_latency); > + Maybe things have changed while I wasn't watching closely, but... The scaled up tweakables on my little quad desktop box: sched_nr_latency = 8 sched_wakeup_granularity = 4ms sched_latency = 24ms Due to the FAIR_SLEEPERS feature, a task can only receive a max of sched_latency/2 sleep credit, ie the delta between waking sleeper and current is clipped to a max of 12 virtual ms, so the instant our preempt threshold reaches 12.000ms, by human booboo or now 3 runnable tasks with this change, wakeup preemption is completely disabled, or? -Mike ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-21 3:52 ` Mike Galbraith @ 2021-09-21 5:50 ` Mike Galbraith 2021-09-21 7:04 ` Mike Galbraith 2021-09-21 10:36 ` Mel Gorman 2 siblings, 0 replies; 59+ messages in thread From: Mike Galbraith @ 2021-09-21 5:50 UTC (permalink / raw) To: Mel Gorman, Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, 2021-09-21 at 05:52 +0200, Mike Galbraith wrote: > On Mon, 2021-09-20 at 15:26 +0100, Mel Gorman wrote: > > > The scaled up tweakables on my little quad desktop box: > sched_nr_latency = 8 > sched_wakeup_granularity = 4ms > sched_latency = 24ms > > Due to the FAIR_SLEEPERS feature, a task can only receive a max of > sched_latency/2 sleep credit, ie the delta between waking sleeper and > current is clipped to a max of 12 virtual ms, so the instant our > preempt threshold reaches 12.000ms, by human booboo or now 3 runnable > tasks with this change, wakeup preemption is completely disabled, or? I just dug up a now ancient artifact testcase allegedly distilled from a real application's control thread sometime back in the dark ages, and granularity >= latency/2 still does turn off wakeup preemption. -Mike ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-21 3:52 ` Mike Galbraith 2021-09-21 5:50 ` Mike Galbraith @ 2021-09-21 7:04 ` Mike Galbraith 2021-09-21 10:36 ` Mel Gorman 2 siblings, 0 replies; 59+ messages in thread From: Mike Galbraith @ 2021-09-21 7:04 UTC (permalink / raw) To: Mel Gorman, Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, 2021-09-21 at 05:52 +0200, Mike Galbraith wrote: > > ...preempt threshold reaches 12.000ms, by human booboo or now 3 > runnable tasks Bah, 6 so not so close to hohum burst level, but still something that maybe warrants a feature. -Mike ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-21 3:52 ` Mike Galbraith 2021-09-21 5:50 ` Mike Galbraith 2021-09-21 7:04 ` Mike Galbraith @ 2021-09-21 10:36 ` Mel Gorman 2021-09-21 12:32 ` Mike Galbraith 2021-09-22 5:22 ` Mike Galbraith 2 siblings, 2 replies; 59+ messages in thread From: Mel Gorman @ 2021-09-21 10:36 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, Sep 21, 2021 at 05:52:32AM +0200, Mike Galbraith wrote: > On Mon, 2021-09-20 at 15:26 +0100, Mel Gorman wrote: > > > > This patch scales the wakeup granularity based on the number of running > > tasks on the CPU up to a max of 8ms by default. The intent is to > > allow tasks to run for longer while overloaded so that some tasks may > > complete faster and reduce the degree a domain is overloaded. Note that > > the TuneD throughput-performance profile allows up to 15ms but there > > is no explanation why such a long period was necessary so this patch is > > conservative and uses the value that check_preempt_wakeup() also takes > > into account. An internet search for instances where this parameter are > > tuned to high values offer either no explanation or a broken one. > > > > This improved hackbench on a range of machines when communicating via > > pipes (sockets show little to no difference). For a 2-socket CascadeLake > > machine, the results were > > Twiddling wakeup preemption based upon the performance of a fugly fork > bomb seems like a pretty bad idea to me. > I'm no fan of hackbench but unfortunately it's heavily used for evaluating scheduler changes at least and the notion that over-scheduling can reduce throughput of communicating tasks in the overloaded case is still problematic. > Preemption does rapidly run into diminishing return as load climbs for > a lot of loads, but as you know, it's a rather sticky wicket because > even when over-committed, preventing light control threads from slicing > through (what can be a load's own work crew of) hogs can seriously > injure performance. > Turning this into a classic Rob Peter To Pay Paul problem. We don't know if there is a light control thread that needs to run or not that affects overall performance. It all depends on whether that control thread needs to make progress for the overall workload or whether there are a mix of workloads resulting in overloading. > <snip> > > > @@ -7044,10 +7045,22 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > > } > > #endif /* CONFIG_SMP */ > > > > -static unsigned long wakeup_gran(struct sched_entity *se) > > +static unsigned long > > +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se) > > { > > unsigned long gran = sysctl_sched_wakeup_granularity; > > > > + /* > > + * If rq is specified, scale the granularity relative to the number > > + * of running tasks but no more than 8ms with default > > + * sysctl_sched_wakeup_granularity settings. The wakeup gran > > + * reduces over-scheduling but if tasks are stacked then the > > + * domain is likely overloaded and over-scheduling may > > + * prolong the overloaded state. > > + */ > > + if (cfs_rq && cfs_rq->nr_running > 1) > > + gran *= min(cfs_rq->nr_running >> 1, sched_nr_latency); > > + > > Maybe things have changed while I wasn't watching closely, but... > Yeah, a lot has changed and unfortunately, I still lack a lot of historical context on why things are the way they are. > The scaled up tweakables on my little quad desktop box: > sched_nr_latency = 8 > sched_wakeup_granularity = 4ms > sched_latency = 24ms > > Due to the FAIR_SLEEPERS feature, a task can only receive a max of > sched_latency/2 sleep credit, A task that just became runnable rather than all tasks, right? > ie the delta between waking sleeper and > current is clipped to a max of 12 virtual ms, so the instant our > preempt threshold reaches 12.000ms, by human booboo or now 3 runnable > tasks with this change, wakeup preemption is completely disabled, or? > I'm having trouble reconciling this in some sensible fashion. sched_nr_latency is the threshold where the sched period might be stretched (sysctl_sched_latency/sysctl_sched_min_granularity) which is why I picked it - (nr_running > sched_nr_latency) is a tipping point where the scheduler changes behaviour. FAIR_SLEEPERS primarily affects tasks that just became runnable and the new task is trying to fit in without causing too much disruption based on sysctl_sched_latency. As sysctl_sched_wakeup_granularity is now debugfs and should not be tuned, we want to avoid that. On the other hand, we also know from the results that overloaded tasks may fail to make sufficient progress so, do we either try to dynamically adjust or do we just ignore the problem? If we are to dynamically adjust then what should it be? One alternative could be to base the tipping point on the ratio of sysctl_sched_latency to gran, take FAIR_SLEEPERS into account, sanity check the result and stick it behind a feature flag in case it needs to be disabled to debug a preempt starvation problem. This on top? ---8<--- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8e12aeebf4ce..4c94af6ecb1d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7052,14 +7052,21 @@ wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se) /* * If rq is specified, scale the granularity relative to the number - * of running tasks but no more than 8ms with default - * sysctl_sched_wakeup_granularity settings. The wakeup gran - * reduces over-scheduling but if tasks are stacked then the - * domain is likely overloaded and over-scheduling may - * prolong the overloaded state. - */ - if (cfs_rq && cfs_rq->nr_running > 1) - gran *= min(cfs_rq->nr_running >> 1, sched_nr_latency); + * of running tasks but no more than sysctl_sched_latency. + * The wakeup gran reduces over-scheduling but if tasks are + * stacked then the domain is likely overloaded and over-scheduling + * may prolong the overloaded state. + */ + if (sched_feat(SCALE_WAKEUP_GRAN) && + cfs_rq && cfs_rq->nr_running > 1) { + int max_scale = sysctl_sched_latency / gran; + + if (sched_feat(GENTLE_FAIR_SLEEPERS)) + max_scale >>= 1; + + if (max_scale) + gran *= min(cfs_rq->nr_running >> 1, max_scale); + } /* * Since its curr running now, convert the gran from real-time diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 7f8dace0964c..d041d7023029 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -95,3 +95,9 @@ SCHED_FEAT(LATENCY_WARN, false) SCHED_FEAT(ALT_PERIOD, true) SCHED_FEAT(BASE_SLICE, true) + +/* + * Scale sched_wakeup_granularity dynamically based on the number of running + * tasks up to a cap of sysctl_sched_latency. + */ +SCHED_FEAT(SCALE_WAKEUP_GRAN, true) ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-21 10:36 ` Mel Gorman @ 2021-09-21 12:32 ` Mike Galbraith 2021-09-21 14:03 ` Mel Gorman 2021-10-05 9:24 ` Peter Zijlstra 2021-09-22 5:22 ` Mike Galbraith 1 sibling, 2 replies; 59+ messages in thread From: Mike Galbraith @ 2021-09-21 12:32 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, 2021-09-21 at 11:36 +0100, Mel Gorman wrote: > On Tue, Sep 21, 2021 at 05:52:32AM +0200, Mike Galbraith wrote: > > > > > Preemption does rapidly run into diminishing return as load climbs for > > a lot of loads, but as you know, it's a rather sticky wicket because > > even when over-committed, preventing light control threads from slicing > > through (what can be a load's own work crew of) hogs can seriously > > injure performance. > > > > Turning this into a classic Rob Peter To Pay Paul problem. Damn near everything you do in sched-land is rob Peter to pay Paul. > We don't know > if there is a light control thread that needs to run or not that affects > overall performance. It all depends on whether that control thread needs > to make progress for the overall workload or whether there are a mix of > workloads resulting in overloading. Sure.. and operative words "we don't know" cuts both ways. CFS came about because the O1 scheduler was unfair to the point it had starvation problems. People pretty much across the board agreed that a fair scheduler was a much way better way to go, and CFS was born. It didn't originally have the sleep credit business, but had to grow it to become _short term_ fair. Ingo cut the sleep credit in half because of overscheduling, and that has worked out pretty well all told.. but now you're pushing it more in the unfair direction, all the way to extremely unfair for anything and everything very light. Fairness isn't the holy grail mind you, and at some point, giving up on short term fairness certainly isn't crazy, as proven by your hackbench numbers and other numbers we've seen over the years, but taking bites out of the 'CF' in the CFS that was born to be a corner-case killer is.. worrisome. The other shoe will drop.. it always does :) > > <snip> > > > > > @@ -7044,10 +7045,22 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > > > } > > > #endif /* CONFIG_SMP */ > > > > > > -static unsigned long wakeup_gran(struct sched_entity *se) > > > +static unsigned long > > > +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se) > > > { > > > unsigned long gran = sysctl_sched_wakeup_granularity; > > > > > > + /* > > > + * If rq is specified, scale the granularity relative to the number > > > + * of running tasks but no more than 8ms with default > > > + * sysctl_sched_wakeup_granularity settings. The wakeup gran > > > + * reduces over-scheduling but if tasks are stacked then the > > > + * domain is likely overloaded and over-scheduling may > > > + * prolong the overloaded state. > > > + */ > > > + if (cfs_rq && cfs_rq->nr_running > 1) > > > + gran *= min(cfs_rq->nr_running >> 1, sched_nr_latency); > > > + > > > > Maybe things have changed while I wasn't watching closely, but... > > > > Yeah, a lot has changed and unfortunately, I still lack a lot of > historical context on why things are the way they are. > > > The scaled up tweakables on my little quad desktop box: > > sched_nr_latency = 8 > > sched_wakeup_granularity = 4ms > > sched_latency = 24ms > > > > Due to the FAIR_SLEEPERS feature, a task can only receive a max of > > sched_latency/2 sleep credit, > > A task that just became runnable rather than all tasks, right? I'm talking about tasks being enqueued during wakeup. > > ie the delta between waking sleeper and > > current is clipped to a max of 12 virtual ms, so the instant our > > preempt threshold reaches 12.000ms, by human booboo or now 3 runnable > > tasks with this change, wakeup preemption is completely disabled, or? > > > > I'm having trouble reconciling this in some sensible fashion. > sched_nr_latency is the threshold where the sched period might be stretched > (sysctl_sched_latency/sysctl_sched_min_granularity) which is why I picked > it - (nr_running > sched_nr_latency) is a tipping point where the > scheduler changes behaviour. Yeah, an existing branch is as good a place as any. > FAIR_SLEEPERS primarily affects tasks that just became runnable and the > new task is trying to fit in without causing too much disruption based > on sysctl_sched_latency. No, fair sleepers is all about sleeper wakeup preemption, I think you're thinking of fork initial placement. > As sysctl_sched_wakeup_granularity is now debugfs and should not be > tuned, we want to avoid that. On the other hand, we also know from the > results that overloaded tasks may fail to make sufficient progress so, > do we either try to dynamically adjust or do we just ignore the problem? Good question. Anything you do is guaranteed to be wrong for something, which is why CFS was born.. fair doesn't have corner cases. > If we are to dynamically adjust then what should it be? One alternative > could be to base the tipping point on the ratio of sysctl_sched_latency to > gran, take FAIR_SLEEPERS into account, sanity check the result and stick it > behind a feature flag in case it needs to be disabled to debug a preempt > starvation problem. > > This on top? Dunno. Seeing as how it wasn't as aggressive as I thought at first glance, maybe that original rolldown will be more or less fine. I don't like it much, but numbers talk, BS walks. I trust boxen on such matters way more than any human, myself included ;-) -Mike ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-21 12:32 ` Mike Galbraith @ 2021-09-21 14:03 ` Mel Gorman 2021-10-05 9:24 ` Peter Zijlstra 1 sibling, 0 replies; 59+ messages in thread From: Mel Gorman @ 2021-09-21 14:03 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, Sep 21, 2021 at 02:32:54PM +0200, Mike Galbraith wrote: > On Tue, 2021-09-21 at 11:36 +0100, Mel Gorman wrote: > > On Tue, Sep 21, 2021 at 05:52:32AM +0200, Mike Galbraith wrote: > > > > > > > > Preemption does rapidly run into diminishing return as load climbs for > > > a lot of loads, but as you know, it's a rather sticky wicket because > > > even when over-committed, preventing light control threads from slicing > > > through (what can be a load's own work crew of) hogs can seriously > > > injure performance. > > > > > > > Turning this into a classic Rob Peter To Pay Paul problem. > > Damn near everything you do in sched-land is rob Peter to pay Paul. > True. > > We don't know > > if there is a light control thread that needs to run or not that affects > > overall performance. It all depends on whether that control thread needs > > to make progress for the overall workload or whether there are a mix of > > workloads resulting in overloading. > > Sure.. and operative words "we don't know" cuts both ways. > Yes and I don't believe we know how often a latency critical thread is running in a heavily overloaded domain (I hope not very often). > CFS came about because the O1 scheduler was unfair to the point it had > starvation problems. People pretty much across the board agreed that a > fair scheduler was a much way better way to go, and CFS was born. It > didn't originally have the sleep credit business, but had to grow it to > become _short term_ fair. Ingo cut the sleep credit in half because of > overscheduling, and that has worked out pretty well all told.. but now > you're pushing it more in the unfair direction, all the way to > extremely unfair for anything and everything very light. > With the slight caveat that it's pushed in the other direction at the point where the domain is becoming overloaded. While fairness is still important, tasks are being starved to some extent. > Fairness isn't the holy grail mind you, and at some point, giving up on > short term fairness certainly isn't crazy, as proven by your hackbench > numbers and other numbers we've seen over the years, but taking bites > out of the 'CF' in the CFS that was born to be a corner-case killer > is.. worrisome. The other shoe will drop.. it always does :) > Lovely. I don't suppose you know where that shoe is? > > > <snip> > > > > > > > @@ -7044,10 +7045,22 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > > > > } > > > > #endif /* CONFIG_SMP */ > > > > > > > > -static unsigned long wakeup_gran(struct sched_entity *se) > > > > +static unsigned long > > > > +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se) > > > > { > > > > unsigned long gran = sysctl_sched_wakeup_granularity; > > > > > > > > + /* > > > > + * If rq is specified, scale the granularity relative to the number > > > > + * of running tasks but no more than 8ms with default > > > > + * sysctl_sched_wakeup_granularity settings. The wakeup gran > > > > + * reduces over-scheduling but if tasks are stacked then the > > > > + * domain is likely overloaded and over-scheduling may > > > > + * prolong the overloaded state. > > > > + */ > > > > + if (cfs_rq && cfs_rq->nr_running > 1) > > > > + gran *= min(cfs_rq->nr_running >> 1, sched_nr_latency); > > > > + > > > > > > Maybe things have changed while I wasn't watching closely, but... > > > > > > > Yeah, a lot has changed and unfortunately, I still lack a lot of > > historical context on why things are the way they are. > > > > > The scaled up tweakables on my little quad desktop box: > > > sched_nr_latency = 8 > > > sched_wakeup_granularity = 4ms > > > sched_latency = 24ms > > > > > > Due to the FAIR_SLEEPERS feature, a task can only receive a max of > > > sched_latency/2 sleep credit, > > > > A task that just became runnable rather than all tasks, right? > > I'm talking about tasks being enqueued during wakeup. > Understood. > > > ie the delta between waking sleeper and > > > current is clipped to a max of 12 virtual ms, so the instant our > > > preempt threshold reaches 12.000ms, by human booboo or now 3 runnable > > > tasks with this change, wakeup preemption is completely disabled, or? > > > > > > > I'm having trouble reconciling this in some sensible fashion. > > sched_nr_latency is the threshold where the sched period might be stretched > > (sysctl_sched_latency/sysctl_sched_min_granularity) which is why I picked > > it - (nr_running > sched_nr_latency) is a tipping point where the > > scheduler changes behaviour. > > Yeah, an existing branch is as good a place as any. > Although possibly overkill. Even if the new patch does not help hackbench to the same extent, it may still be a better tradeoff for workloads that have critical threads running in an overloaded domain. > > FAIR_SLEEPERS primarily affects tasks that just became runnable and the > > new task is trying to fit in without causing too much disruption based > > on sysctl_sched_latency. > > No, fair sleepers is all about sleeper wakeup preemption, I think > you're thinking of fork initial placement. > Understood. > > As sysctl_sched_wakeup_granularity is now debugfs and should not be > > tuned, we want to avoid that. On the other hand, we also know from the > > results that overloaded tasks may fail to make sufficient progress so, > > do we either try to dynamically adjust or do we just ignore the problem? > > Good question. Anything you do is guaranteed to be wrong for > something, which is why CFS was born.. fair doesn't have corner cases. > And indeed, this is basically trading some fairness to allow some tasks to complete faster so that load goes down. It's similar to the race-to-idle problem in frequency management where it asked "Is it better to use more power to finish quickly or converve power and take longer?". This patch is asking the question "When overloaded, is it better to let a task run longer so it finishes quicker or pay the cost of context switching?" > > If we are to dynamically adjust then what should it be? One alternative > > could be to base the tipping point on the ratio of sysctl_sched_latency to > > gran, take FAIR_SLEEPERS into account, sanity check the result and stick it > > behind a feature flag in case it needs to be disabled to debug a preempt > > starvation problem. > > > > This on top? > > Dunno. Seeing as how it wasn't as aggressive as I thought at first > glance, maybe that original rolldown will be more or less fine. I > don't like it much, but numbers talk, BS walks. I trust boxen on such > matters way more than any human, myself included ;-) > I've queued v2 on the test grid, lets see what falls out. Thanks Mike for the review and the historical context. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-21 12:32 ` Mike Galbraith 2021-09-21 14:03 ` Mel Gorman @ 2021-10-05 9:24 ` Peter Zijlstra 1 sibling, 0 replies; 59+ messages in thread From: Peter Zijlstra @ 2021-10-05 9:24 UTC (permalink / raw) To: Mike Galbraith Cc: Mel Gorman, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, Sep 21, 2021 at 02:32:54PM +0200, Mike Galbraith wrote: > On Tue, 2021-09-21 at 11:36 +0100, Mel Gorman wrote: > > FAIR_SLEEPERS primarily affects tasks that just became runnable and the > > new task is trying to fit in without causing too much disruption based > > on sysctl_sched_latency. > > No, fair sleepers is all about sleeper wakeup preemption, I think > you're thinking of fork initial placement. Butting in in the middle of the thread (and I know there's still lots to read)... So the FAIR_SLEEPERS thing is about giving tasks that have slept a while some extra credit to run sooner. The classical example has always been a task that run 50% combined with a task that runs 100%, what's fair? a 1:2 or 1:1 ratio? Strict fair (runnable) scheduling will get you the 1:2, while intuitively having two tasks with 100% combined CPU utilization 1:1 would be 'fair'. FAIR_SLEEPERS gets you towards that 1:1, *provided* the period of that 50% is near sched_latency/2. Another important factor for wakeup preemption has always been desktop usage; can you still get responsive terminals while building a kernel, how does firefox scroll during a kernel build etc.. (fwiw, firefox should start scrolling responsively and then bog down if you keep on scrolling because it becomes a hog and has exhausted the inital boost) Also, I think the ChromeOS people have interactivity measures these days. All our traditinoal benchmarks miss out here; they're mostly throughput oriented, and it is really easy to totally wreck interactivity while getting great througput :/ ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-21 10:36 ` Mel Gorman 2021-09-21 12:32 ` Mike Galbraith @ 2021-09-22 5:22 ` Mike Galbraith 2021-09-22 13:20 ` Mel Gorman 2021-10-03 3:07 ` wakeup_affine_weight() is b0rked - was " Mike Galbraith 1 sibling, 2 replies; 59+ messages in thread From: Mike Galbraith @ 2021-09-22 5:22 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, 2021-09-21 at 11:36 +0100, Mel Gorman wrote: > On Tue, Sep 21, 2021 at 05:52:32AM +0200, Mike Galbraith wrote: > > > > Preemption does rapidly run into diminishing return as load climbs for > > a lot of loads, but as you know, it's a rather sticky wicket because > > even when over-committed, preventing light control threads from slicing > > through (what can be a load's own work crew of) hogs can seriously > > injure performance. > > > > Turning this into a classic Rob Peter To Pay Paul problem. We don't know > if there is a light control thread that needs to run or not that affects > overall performance. It all depends on whether that control thread needs > to make progress for the overall workload or whether there are a mix of > workloads resulting in overloading. WRT overload, and our good buddies Peter and Paul :) I added... if (gran >= sysctl_sched_latency >> 1) trace_printk("runnable:%d preempt disabled\n",cfs_rq->nr_running); ...to watch, and met the below when I.. logged in. homer:..debug/tracing # tail -20 trace X-2229 [002] d..5. 60.690322: wakeup_gran: runnable:9 preempt disabled X-2229 [002] d..5. 60.690325: wakeup_gran: runnable:10 preempt disabled X-2229 [002] d..5. 60.690330: wakeup_gran: runnable:11 preempt disabled X-2229 [002] d..5. 60.690363: wakeup_gran: runnable:13 preempt disabled X-2229 [002] d..5. 60.690377: wakeup_gran: runnable:14 preempt disabled X-2229 [002] d..5. 60.690390: wakeup_gran: runnable:15 preempt disabled X-2229 [002] d..5. 60.690404: wakeup_gran: runnable:16 preempt disabled X-2229 [002] d..5. 60.690425: wakeup_gran: runnable:9 preempt disabled ksmserver-2694 [003] d..3. 60.690432: wakeup_gran: runnable:6 preempt disabled ksmserver-2694 [003] d..3. 60.690436: wakeup_gran: runnable:7 preempt disabled X-2229 [002] d..5. 60.690451: wakeup_gran: runnable:6 preempt disabled X-2229 [002] d..5. 60.690465: wakeup_gran: runnable:7 preempt disabled kmix-2736 [000] d..3. 60.690491: wakeup_gran: runnable:6 preempt disabled X-2229 [004] d..5. 92.889635: wakeup_gran: runnable:6 preempt disabled X-2229 [004] d..5. 92.889675: wakeup_gran: runnable:6 preempt disabled X-2229 [004] d..5. 92.889863: wakeup_gran: runnable:6 preempt disabled X-2229 [004] d..5. 92.889944: wakeup_gran: runnable:6 preempt disabled X-2229 [004] d..5. 92.889957: wakeup_gran: runnable:7 preempt disabled X-2229 [004] d..5. 92.889968: wakeup_gran: runnable:8 preempt disabled QXcbEventQueue-2740 [000] d..4. 92.890025: wakeup_gran: runnable:6 preempt disabled homer:..debug/tracing Watching 'while sleep 1; do clear;tail trace; done' with nothing but a kbuild running is like watching top. There's enough stacking during routine use of my desktop box that it runs into the tick granularity wall pretty much continuously, so 'overload' may want redefining. -Mike ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-22 5:22 ` Mike Galbraith @ 2021-09-22 13:20 ` Mel Gorman 2021-09-22 14:04 ` Mike Galbraith ` (2 more replies) 2021-10-03 3:07 ` wakeup_affine_weight() is b0rked - was " Mike Galbraith 1 sibling, 3 replies; 59+ messages in thread From: Mel Gorman @ 2021-09-22 13:20 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Wed, Sep 22, 2021 at 07:22:20AM +0200, Mike Galbraith wrote: > On Tue, 2021-09-21 at 11:36 +0100, Mel Gorman wrote: > > On Tue, Sep 21, 2021 at 05:52:32AM +0200, Mike Galbraith wrote: > > > > > > > Preemption does rapidly run into diminishing return as load climbs for > > > a lot of loads, but as you know, it's a rather sticky wicket because > > > even when over-committed, preventing light control threads from slicing > > > through (what can be a load's own work crew of) hogs can seriously > > > injure performance. > > > > > > > Turning this into a classic Rob Peter To Pay Paul problem. We don't know > > if there is a light control thread that needs to run or not that affects > > overall performance. It all depends on whether that control thread needs > > to make progress for the overall workload or whether there are a mix of > > workloads resulting in overloading. > > WRT overload, and our good buddies Peter and Paul :) I added... > if (gran >= sysctl_sched_latency >> 1) > trace_printk("runnable:%d preempt disabled\n",cfs_rq->nr_running); > ...to watch, and met the below when I.. logged in. > > homer:..debug/tracing # tail -20 trace > X-2229 [002] d..5. 60.690322: wakeup_gran: runnable:9 preempt disabled > X-2229 [002] d..5. 60.690325: wakeup_gran: runnable:10 preempt disabled > X-2229 [002] d..5. 60.690330: wakeup_gran: runnable:11 preempt disabled > X-2229 [002] d..5. 60.690363: wakeup_gran: runnable:13 preempt disabled > X-2229 [002] d..5. 60.690377: wakeup_gran: runnable:14 preempt disabled > X-2229 [002] d..5. 60.690390: wakeup_gran: runnable:15 preempt disabled > X-2229 [002] d..5. 60.690404: wakeup_gran: runnable:16 preempt disabled > X-2229 [002] d..5. 60.690425: wakeup_gran: runnable:9 preempt disabled > ksmserver-2694 [003] d..3. 60.690432: wakeup_gran: runnable:6 preempt disabled > ksmserver-2694 [003] d..3. 60.690436: wakeup_gran: runnable:7 preempt disabled > X-2229 [002] d..5. 60.690451: wakeup_gran: runnable:6 preempt disabled > X-2229 [002] d..5. 60.690465: wakeup_gran: runnable:7 preempt disabled > kmix-2736 [000] d..3. 60.690491: wakeup_gran: runnable:6 preempt disabled > X-2229 [004] d..5. 92.889635: wakeup_gran: runnable:6 preempt disabled > X-2229 [004] d..5. 92.889675: wakeup_gran: runnable:6 preempt disabled > X-2229 [004] d..5. 92.889863: wakeup_gran: runnable:6 preempt disabled > X-2229 [004] d..5. 92.889944: wakeup_gran: runnable:6 preempt disabled > X-2229 [004] d..5. 92.889957: wakeup_gran: runnable:7 preempt disabled > X-2229 [004] d..5. 92.889968: wakeup_gran: runnable:8 preempt disabled > QXcbEventQueue-2740 [000] d..4. 92.890025: wakeup_gran: runnable:6 preempt disabled > homer:..debug/tracing > > Watching 'while sleep 1; do clear;tail trace; done' with nothing but a > kbuild running is like watching top. There's enough stacking during > routine use of my desktop box that it runs into the tick granularity > wall pretty much continuously, so 'overload' may want redefining. > Ok, that's pretty convincing. You didn't mention if there were interactivity glitches but it's possible. This is what I'm currently testing but have no results for yet. It caps wakeup_gran at sysctl_sched_latency. ---8<--- sched/fair: Scale wakeup granularity relative to nr_running Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved the kernel.sched_wakeup_granularity_ns sysctl under debugfs. One of the reasons why this sysctl may be used may be for "optimising for throughput", particularly when overloaded. The tool TuneD sometimes alters this for two profiles e.g. "mssql" and "throughput-performance". At least version 2.9 does but it changed in master where it also will poke at debugfs instead. Internal parameters like sysctl_sched_wakeup_granularity are scaled based on the number of CPUs due to sysctl_sched_tunable_scaling. For simplicity, the timing figures in this changelog are based on SCHED_TUNABLESCALING_NONE. During task migration or wakeup, a decision is made on whether to preempt the current task or not. To limit over-scheduled, sysctl_sched_wakeup_granularity delays the preemption to allow at least 1ms of runtime before preempting. However, when a domain is heavily overloaded (e.g. hackbench), the degree of over-scheduling is still severe. This is problematic as time is wasted rescheduling tasks that could instead be used by userspace tasks. However, care must be taken. Even if a system is overloaded, there may be high priority threads that must still be able to run. Mike Galbraith explained the contraints as follows; CFS came about because the O1 scheduler was unfair to the point it had starvation problems. People pretty much across the board agreed that a fair scheduler was a much way better way to go, and CFS was born. It didn't originally have the sleep credit business, but had to grow it to become _short term_ fair. Ingo cut the sleep credit in half because of overscheduling, and that has worked out pretty well all told.. but now you're pushing it more in the unfair direction, all the way to extremely unfair for anything and everything very light. Fairness isn't the holy grail mind you, and at some point, giving up on short term fairness certainly isn't crazy, as proven by your hackbench numbers and other numbers we've seen over the years, but taking bites out of the 'CF' in the CFS that was born to be a corner-case killer is.. worrisome. The other shoe will drop.. it always does :) This patch scales the wakeup granularity based on the number of running tasks on the CPU relative to sched_nr_disable_gran = sysctl_sched_latency / sysctl_sched_wakeup_granularity By default, this will allow the wakeup_gran to scale from sysctl_sched_wakeup_granularity up to sysctl_sched_wakeup_granularity up to sysctl_sched_latency depending on the number of running tasks on a cfs_rq. By default, the limit is 6ms. Note that the TuneD throughput-performance profile allows up to 15ms for sysctl_sched_latency (ignoring scaling) but there is no explanation why such a long period was necessary or why sched_latency_ns is also not adjusted. The intent may have been to disable wakeup preemption or it might be an oversight. An internet search for instances where sysctl_sched_wakeup_granularity parameter are tuned to high values offer either no explanation or a broken one. TODO: Results positive or negative Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- kernel/sched/fair.c | 64 ++++++++++++++++++++++++++++++++++++++++++------- kernel/sched/features.h | 6 +++++ 2 files changed, 61 insertions(+), 9 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ff69f245b939..5ec3b12039d6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -84,6 +84,14 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL; const_debug unsigned int sysctl_sched_migration_cost = 500000UL; +/* + * This value is kept at sysctl_sched_latency / sysctl_sched_wakeup_granularity + * + * This influences the decision on whether a waking task can preempt a running + * task. + */ +static unsigned int sched_nr_disable_gran = 6; + int sched_thermal_decay_shift; static int __init setup_sched_thermal_decay_shift(char *str) { @@ -627,6 +635,9 @@ int sched_update_scaling(void) sched_nr_latency = DIV_ROUND_UP(sysctl_sched_latency, sysctl_sched_min_granularity); + sched_nr_disable_gran = DIV_ROUND_UP(sysctl_sched_latency, + sysctl_sched_wakeup_granularity); + #define WRT_SYSCTL(name) \ (normalized_sysctl_##name = sysctl_##name / (factor)) WRT_SYSCTL(sched_min_granularity); @@ -4511,7 +4522,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) } static int -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se); +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr, + struct sched_entity *se); /* * Pick the next process, keeping these things in mind, in this order: @@ -4550,16 +4562,16 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) second = curr; } - if (second && wakeup_preempt_entity(second, left) < 1) + if (second && wakeup_preempt_entity(NULL, second, left) < 1) se = second; } - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { + if (cfs_rq->next && wakeup_preempt_entity(NULL, cfs_rq->next, left) < 1) { /* * Someone really wants this to run. If it's not unfair, run it. */ se = cfs_rq->next; - } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) { + } else if (cfs_rq->last && wakeup_preempt_entity(NULL, cfs_rq->last, left) < 1) { /* * Prefer last buddy, try to return the CPU to a preempted task. */ @@ -7044,9 +7056,42 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) } #endif /* CONFIG_SMP */ -static unsigned long wakeup_gran(struct sched_entity *se) +static unsigned long +select_wakeup_gran(struct cfs_rq *cfs_rq) +{ + unsigned int nr_running, threshold; + + if (!cfs_rq || !sched_feat(SCALE_WAKEUP_GRAN)) + return sysctl_sched_wakeup_granularity; + + /* !GENTLE_FAIR_SLEEPERS has one overload threshold. */ + if (!sched_feat(GENTLE_FAIR_SLEEPERS)) { + if (cfs_rq->nr_running <= sched_nr_disable_gran) + return sysctl_sched_wakeup_granularity; + + return sysctl_sched_latency; + } + + /* GENTLE_FAIR_SLEEPER has two overloaded thresholds. */ + nr_running = cfs_rq->nr_running; + threshold = sched_nr_disable_gran >> 1; + + /* No overload. */ + if (nr_running <= threshold) + return sysctl_sched_wakeup_granularity; + + /* Light overload. */ + if (nr_running <= sched_nr_disable_gran) + return sysctl_sched_latency >> 1; + + /* Heavy overload. */ + return sysctl_sched_latency; +} + +static unsigned long +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se) { - unsigned long gran = sysctl_sched_wakeup_granularity; + unsigned long gran = select_wakeup_gran(cfs_rq); /* * Since its curr running now, convert the gran from real-time @@ -7079,14 +7124,15 @@ static unsigned long wakeup_gran(struct sched_entity *se) * */ static int -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se) +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr, + struct sched_entity *se) { s64 gran, vdiff = curr->vruntime - se->vruntime; if (vdiff <= 0) return -1; - gran = wakeup_gran(se); + gran = wakeup_gran(cfs_rq, se); if (vdiff > gran) return 1; @@ -7191,7 +7237,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ return; update_curr(cfs_rq_of(se)); - if (wakeup_preempt_entity(se, pse) == 1) { + if (wakeup_preempt_entity(cfs_rq, se, pse) == 1) { /* * Bias pick_next to pick the sched entity that is * triggering this preemption. diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 7f8dace0964c..d041d7023029 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -95,3 +95,9 @@ SCHED_FEAT(LATENCY_WARN, false) SCHED_FEAT(ALT_PERIOD, true) SCHED_FEAT(BASE_SLICE, true) + +/* + * Scale sched_wakeup_granularity dynamically based on the number of running + * tasks up to a cap of sysctl_sched_latency. + */ +SCHED_FEAT(SCALE_WAKEUP_GRAN, true) ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-22 13:20 ` Mel Gorman @ 2021-09-22 14:04 ` Mike Galbraith 2021-09-22 14:15 ` Vincent Guittot 2021-10-05 9:32 ` Peter Zijlstra 2 siblings, 0 replies; 59+ messages in thread From: Mike Galbraith @ 2021-09-22 14:04 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Wed, 2021-09-22 at 14:20 +0100, Mel Gorman wrote: > On Wed, Sep 22, 2021 at 07:22:20AM +0200, Mike Galbraith wrote: > > > > Watching 'while sleep 1; do clear;tail trace; done' with nothing but a > > kbuild running is like watching top. There's enough stacking during > > routine use of my desktop box that it runs into the tick granularity > > wall pretty much continuously, so 'overload' may want redefining. > > > > Ok, that's pretty convincing. You didn't mention if there were > interactivity glitches but it's possible. No, I didn't notice a thing. That's not surprising given the amount of scheduling going on. The stack depth surprised me a bit though. X-2230 [005] d..5. 2162.123029: wakeup_gran: runnable:6 wakee:QXcbEventQueue:2695 CPU5 X-2230 [005] d..5. 2162.123035: wakeup_gran: runnable:7 wakee:QXcbEventQueue:2656 CPU5 X-2230 [005] d..5. 2162.123046: wakeup_gran: runnable:8 wakee:QXcbEventQueue:5876 CPU5 X-2230 [005] d..5. 2162.123049: wakeup_gran: runnable:9 wakee:QXcbEventQueue:5355 CPU5 X-2230 [005] d..5. 2162.123083: wakeup_gran: runnable:10 wakee:QXcbEventQueue:2723 CPU5 X-2230 [005] d..5. 2162.123097: wakeup_gran: runnable:11 wakee:QXcbEventQueue:2630 CPU5 X-2230 [005] d..5. 2162.123208: wakeup_gran: runnable:12 wakee:QXcbEventQueue:2760 CPU5 That CPU# on the end is task_cpu(). Lord knows why X wakes all those, but with no SD_BALANCE_WAKE, a busy box and all those having last lived on waker's CPU, the resulting construct.. probably doesn't very closely resemble what the programmer had in mind when threading. -Mike ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-22 13:20 ` Mel Gorman 2021-09-22 14:04 ` Mike Galbraith @ 2021-09-22 14:15 ` Vincent Guittot 2021-09-22 15:04 ` Mel Gorman 2021-09-22 15:05 ` Vincent Guittot 2021-10-05 9:32 ` Peter Zijlstra 2 siblings, 2 replies; 59+ messages in thread From: Vincent Guittot @ 2021-09-22 14:15 UTC (permalink / raw) To: Mel Gorman Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Wed, 22 Sept 2021 at 15:20, Mel Gorman <mgorman@techsingularity.net> wrote: > > On Wed, Sep 22, 2021 at 07:22:20AM +0200, Mike Galbraith wrote: > > On Tue, 2021-09-21 at 11:36 +0100, Mel Gorman wrote: > > > On Tue, Sep 21, 2021 at 05:52:32AM +0200, Mike Galbraith wrote: > > > > > > > > > > Preemption does rapidly run into diminishing return as load climbs for > > > > a lot of loads, but as you know, it's a rather sticky wicket because > > > > even when over-committed, preventing light control threads from slicing > > > > through (what can be a load's own work crew of) hogs can seriously > > > > injure performance. > > > > > > > > > > Turning this into a classic Rob Peter To Pay Paul problem. We don't know > > > if there is a light control thread that needs to run or not that affects > > > overall performance. It all depends on whether that control thread needs > > > to make progress for the overall workload or whether there are a mix of > > > workloads resulting in overloading. > > > > WRT overload, and our good buddies Peter and Paul :) I added... > > if (gran >= sysctl_sched_latency >> 1) > > trace_printk("runnable:%d preempt disabled\n",cfs_rq->nr_running); > > ...to watch, and met the below when I.. logged in. > > > > homer:..debug/tracing # tail -20 trace > > X-2229 [002] d..5. 60.690322: wakeup_gran: runnable:9 preempt disabled > > X-2229 [002] d..5. 60.690325: wakeup_gran: runnable:10 preempt disabled > > X-2229 [002] d..5. 60.690330: wakeup_gran: runnable:11 preempt disabled > > X-2229 [002] d..5. 60.690363: wakeup_gran: runnable:13 preempt disabled > > X-2229 [002] d..5. 60.690377: wakeup_gran: runnable:14 preempt disabled > > X-2229 [002] d..5. 60.690390: wakeup_gran: runnable:15 preempt disabled > > X-2229 [002] d..5. 60.690404: wakeup_gran: runnable:16 preempt disabled > > X-2229 [002] d..5. 60.690425: wakeup_gran: runnable:9 preempt disabled > > ksmserver-2694 [003] d..3. 60.690432: wakeup_gran: runnable:6 preempt disabled > > ksmserver-2694 [003] d..3. 60.690436: wakeup_gran: runnable:7 preempt disabled > > X-2229 [002] d..5. 60.690451: wakeup_gran: runnable:6 preempt disabled > > X-2229 [002] d..5. 60.690465: wakeup_gran: runnable:7 preempt disabled > > kmix-2736 [000] d..3. 60.690491: wakeup_gran: runnable:6 preempt disabled > > X-2229 [004] d..5. 92.889635: wakeup_gran: runnable:6 preempt disabled > > X-2229 [004] d..5. 92.889675: wakeup_gran: runnable:6 preempt disabled > > X-2229 [004] d..5. 92.889863: wakeup_gran: runnable:6 preempt disabled > > X-2229 [004] d..5. 92.889944: wakeup_gran: runnable:6 preempt disabled > > X-2229 [004] d..5. 92.889957: wakeup_gran: runnable:7 preempt disabled > > X-2229 [004] d..5. 92.889968: wakeup_gran: runnable:8 preempt disabled > > QXcbEventQueue-2740 [000] d..4. 92.890025: wakeup_gran: runnable:6 preempt disabled > > homer:..debug/tracing > > > > Watching 'while sleep 1; do clear;tail trace; done' with nothing but a > > kbuild running is like watching top. There's enough stacking during > > routine use of my desktop box that it runs into the tick granularity > > wall pretty much continuously, so 'overload' may want redefining. > > > > Ok, that's pretty convincing. You didn't mention if there were > interactivity glitches but it's possible. This is what I'm currently > testing but have no results for yet. It caps wakeup_gran at > sysctl_sched_latency. > > ---8<--- > sched/fair: Scale wakeup granularity relative to nr_running > > Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved > the kernel.sched_wakeup_granularity_ns sysctl under debugfs. One of the > reasons why this sysctl may be used may be for "optimising for throughput", > particularly when overloaded. The tool TuneD sometimes alters this for two > profiles e.g. "mssql" and "throughput-performance". At least version 2.9 > does but it changed in master where it also will poke at debugfs instead. > > Internal parameters like sysctl_sched_wakeup_granularity are scaled > based on the number of CPUs due to sysctl_sched_tunable_scaling. For > simplicity, the timing figures in this changelog are based on > SCHED_TUNABLESCALING_NONE. This is a bit misleading because the platform that you used to highlight the problem has a 7ms sysctl_sched_wakeup_granularity. which is far more than your tick which should be 1ms > > During task migration or wakeup, a decision is made on whether > to preempt the current task or not. To limit over-scheduled, > sysctl_sched_wakeup_granularity delays the preemption to allow at least 1ms > of runtime before preempting. However, when a domain is heavily overloaded > (e.g. hackbench), the degree of over-scheduling is still severe. This is > problematic as time is wasted rescheduling tasks that could instead be > used by userspace tasks. > > However, care must be taken. Even if a system is overloaded, there may > be high priority threads that must still be able to run. Mike Galbraith > explained the contraints as follows; > > CFS came about because the O1 scheduler was unfair to the > point it had starvation problems. People pretty much across the > board agreed that a fair scheduler was a much way better way > to go, and CFS was born. It didn't originally have the sleep > credit business, but had to grow it to become _short term_ fair. > Ingo cut the sleep credit in half because of overscheduling, and > that has worked out pretty well all told.. but now you're pushing > it more in the unfair direction, all the way to extremely unfair > for anything and everything very light. > > Fairness isn't the holy grail mind you, and at some point, giving > up on short term fairness certainly isn't crazy, as proven by your > hackbench numbers and other numbers we've seen over the years, > but taking bites out of the 'CF' in the CFS that was born to be a > corner-case killer is.. worrisome. The other shoe will drop.. it > always does :) > > This patch scales the wakeup granularity based on the number of running > tasks on the CPU relative to > > sched_nr_disable_gran = sysctl_sched_latency / sysctl_sched_wakeup_granularity > > By default, this will allow the wakeup_gran to scale from > sysctl_sched_wakeup_granularity up to sysctl_sched_wakeup_granularity up to > sysctl_sched_latency depending on the number of running tasks on a cfs_rq. > By default, the limit is 6ms. > > Note that the TuneD throughput-performance profile allows up to 15ms > for sysctl_sched_latency (ignoring scaling) but there is no explanation > why such a long period was necessary or why sched_latency_ns is also > not adjusted. The intent may have been to disable wakeup preemption > or it might be an oversight. An internet search for instances where > sysctl_sched_wakeup_granularity parameter are tuned to high values offer > either no explanation or a broken one. > > TODO: Results positive or negative > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > kernel/sched/fair.c | 64 ++++++++++++++++++++++++++++++++++++++++++------- > kernel/sched/features.h | 6 +++++ > 2 files changed, 61 insertions(+), 9 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index ff69f245b939..5ec3b12039d6 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -84,6 +84,14 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL; > > const_debug unsigned int sysctl_sched_migration_cost = 500000UL; > > +/* > + * This value is kept at sysctl_sched_latency / sysctl_sched_wakeup_granularity > + * > + * This influences the decision on whether a waking task can preempt a running > + * task. > + */ > +static unsigned int sched_nr_disable_gran = 6; > + > int sched_thermal_decay_shift; > static int __init setup_sched_thermal_decay_shift(char *str) > { > @@ -627,6 +635,9 @@ int sched_update_scaling(void) > sched_nr_latency = DIV_ROUND_UP(sysctl_sched_latency, > sysctl_sched_min_granularity); > > + sched_nr_disable_gran = DIV_ROUND_UP(sysctl_sched_latency, > + sysctl_sched_wakeup_granularity); > + > #define WRT_SYSCTL(name) \ > (normalized_sysctl_##name = sysctl_##name / (factor)) > WRT_SYSCTL(sched_min_granularity); > @@ -4511,7 +4522,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) > } > > static int > -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se); > +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr, > + struct sched_entity *se); > > /* > * Pick the next process, keeping these things in mind, in this order: > @@ -4550,16 +4562,16 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) > second = curr; > } > > - if (second && wakeup_preempt_entity(second, left) < 1) > + if (second && wakeup_preempt_entity(NULL, second, left) < 1) Why not applying the same policy here ? the tick can also prevent current task to move forward > se = second; > } > > - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { > + if (cfs_rq->next && wakeup_preempt_entity(NULL, cfs_rq->next, left) < 1) { > /* > * Someone really wants this to run. If it's not unfair, run it. > */ > se = cfs_rq->next; > - } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) { > + } else if (cfs_rq->last && wakeup_preempt_entity(NULL, cfs_rq->last, left) < 1) { > /* > * Prefer last buddy, try to return the CPU to a preempted task. > */ > @@ -7044,9 +7056,42 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > } > #endif /* CONFIG_SMP */ > > -static unsigned long wakeup_gran(struct sched_entity *se) > +static unsigned long > +select_wakeup_gran(struct cfs_rq *cfs_rq) > +{ > + unsigned int nr_running, threshold; > + > + if (!cfs_rq || !sched_feat(SCALE_WAKEUP_GRAN)) > + return sysctl_sched_wakeup_granularity; > + > + /* !GENTLE_FAIR_SLEEPERS has one overload threshold. */ > + if (!sched_feat(GENTLE_FAIR_SLEEPERS)) { > + if (cfs_rq->nr_running <= sched_nr_disable_gran) cfs_rq->nr_running reflects the number of sched entities in the cfs but not the number of running tasks which reflected in h_nr_running Also do you want to take into account only tasks in this cfs and its children or on all cfs on this rq ? > + return sysctl_sched_wakeup_granularity; > + > + return sysctl_sched_latency; > + } > + > + /* GENTLE_FAIR_SLEEPER has two overloaded thresholds. */ > + nr_running = cfs_rq->nr_running; > + threshold = sched_nr_disable_gran >> 1; > + > + /* No overload. */ > + if (nr_running <= threshold) > + return sysctl_sched_wakeup_granularity; TBH I don't like these "no overload", "light overload" ... They don't have any real meaning apart from that it might work for your platform and your hackbench test. We already had have people complaining that small cfs task does not preempt fast enough curr task as an example There is no explanation why these values are the correct ones and not but are just some random heuristic results and we are trying to remove platform heuristic and to not add new > + > + /* Light overload. */ > + if (nr_running <= sched_nr_disable_gran) > + return sysctl_sched_latency >> 1; > + > + /* Heavy overload. */ > + return sysctl_sched_latency; Why should a thread without any relationship with the curr, not preempt it because there are already a lot of running threads ? In your case, you want hackbench threads to not preempt each others because they tries to use same resources so it's probably better to let the current one to move forward but that's not a universal policy. side question: Have you try to change the nice priority which also impact whether a thread can preempt curr ? > +} > + > +static unsigned long > +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se) > { > - unsigned long gran = sysctl_sched_wakeup_granularity; > + unsigned long gran = select_wakeup_gran(cfs_rq); > > /* > * Since its curr running now, convert the gran from real-time > @@ -7079,14 +7124,15 @@ static unsigned long wakeup_gran(struct sched_entity *se) > * > */ > static int > -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se) > +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr, > + struct sched_entity *se) > { > s64 gran, vdiff = curr->vruntime - se->vruntime; > > if (vdiff <= 0) > return -1; > > - gran = wakeup_gran(se); > + gran = wakeup_gran(cfs_rq, se); > if (vdiff > gran) > return 1; > > @@ -7191,7 +7237,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ > return; > > update_curr(cfs_rq_of(se)); > - if (wakeup_preempt_entity(se, pse) == 1) { > + if (wakeup_preempt_entity(cfs_rq, se, pse) == 1) { like for update_curr above, cfs_rq can be wrong because se could have changed > /* > * Bias pick_next to pick the sched entity that is > * triggering this preemption. > diff --git a/kernel/sched/features.h b/kernel/sched/features.h > index 7f8dace0964c..d041d7023029 100644 > --- a/kernel/sched/features.h > +++ b/kernel/sched/features.h > @@ -95,3 +95,9 @@ SCHED_FEAT(LATENCY_WARN, false) > > SCHED_FEAT(ALT_PERIOD, true) > SCHED_FEAT(BASE_SLICE, true) > + > +/* > + * Scale sched_wakeup_granularity dynamically based on the number of running > + * tasks up to a cap of sysctl_sched_latency. > + */ > +SCHED_FEAT(SCALE_WAKEUP_GRAN, true) ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-22 14:15 ` Vincent Guittot @ 2021-09-22 15:04 ` Mel Gorman 2021-09-22 16:00 ` Vincent Guittot 2021-10-05 9:41 ` Peter Zijlstra 2021-09-22 15:05 ` Vincent Guittot 1 sibling, 2 replies; 59+ messages in thread From: Mel Gorman @ 2021-09-22 15:04 UTC (permalink / raw) To: Vincent Guittot Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Wed, Sep 22, 2021 at 04:15:27PM +0200, Vincent Guittot wrote: > > ---8<--- > > sched/fair: Scale wakeup granularity relative to nr_running > > > > Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved > > the kernel.sched_wakeup_granularity_ns sysctl under debugfs. One of the > > reasons why this sysctl may be used may be for "optimising for throughput", > > particularly when overloaded. The tool TuneD sometimes alters this for two > > profiles e.g. "mssql" and "throughput-performance". At least version 2.9 > > does but it changed in master where it also will poke at debugfs instead. > > > > Internal parameters like sysctl_sched_wakeup_granularity are scaled > > based on the number of CPUs due to sysctl_sched_tunable_scaling. For > > simplicity, the timing figures in this changelog are based on > > SCHED_TUNABLESCALING_NONE. > > This is a bit misleading because the platform that you used to > highlight the problem has a 7ms sysctl_sched_wakeup_granularity. which > is far more than your tick which should be 1ms > Tick on the test machines is 4ms (HZ=250). The reason I used SCHED_TUNABLESCALING_NONE for the changelog is that the exact values depend on the number of CPUs so values are not even the same across the range of machines I'm using. sysctl_sched_latency, sysctl_sched_min_granularity sysctl_sched_wakeup_granularity are all scaled but the ratios remain constant. > > <SNIP> > > static int > > -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se); > > +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr, > > + struct sched_entity *se); > > > > /* > > * Pick the next process, keeping these things in mind, in this order: > > @@ -4550,16 +4562,16 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) > > second = curr; > > } > > > > - if (second && wakeup_preempt_entity(second, left) < 1) > > + if (second && wakeup_preempt_entity(NULL, second, left) < 1) > > Why not applying the same policy here ? the tick can also prevent > current task to move forward > Because it was less clear if it was necessary and what the consequences would be if the skip buddy ran or the next buddy failed to run because preempt failed, how does it interact with yield_to etc. I ended up concluding that they should be separate patches and keep this patch to one topic. > > se = second; > > } > > > > - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { > > + if (cfs_rq->next && wakeup_preempt_entity(NULL, cfs_rq->next, left) < 1) { > > /* > > * Someone really wants this to run. If it's not unfair, run it. > > */ > > se = cfs_rq->next; > > - } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) { > > + } else if (cfs_rq->last && wakeup_preempt_entity(NULL, cfs_rq->last, left) < 1) { > > /* > > * Prefer last buddy, try to return the CPU to a preempted task. > > */ > > @@ -7044,9 +7056,42 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > > } > > #endif /* CONFIG_SMP */ > > > > -static unsigned long wakeup_gran(struct sched_entity *se) > > +static unsigned long > > +select_wakeup_gran(struct cfs_rq *cfs_rq) > > +{ > > + unsigned int nr_running, threshold; > > + > > + if (!cfs_rq || !sched_feat(SCALE_WAKEUP_GRAN)) > > + return sysctl_sched_wakeup_granularity; > > + > > + /* !GENTLE_FAIR_SLEEPERS has one overload threshold. */ > > + if (!sched_feat(GENTLE_FAIR_SLEEPERS)) { > > + if (cfs_rq->nr_running <= sched_nr_disable_gran) > > cfs_rq->nr_running reflects the number of sched entities in the cfs > but not the number of running tasks which reflected in h_nr_running > Then check_preempt_wakeup may also have the same problem as it uses nr_running. > Also do you want to take into account only tasks in this cfs and its > children or on all cfs on this rq ? > Only this cfq I think to limit overhead. > > + return sysctl_sched_wakeup_granularity; > > + > > + return sysctl_sched_latency; > > + } > > + > > + /* GENTLE_FAIR_SLEEPER has two overloaded thresholds. */ > > + nr_running = cfs_rq->nr_running; > > + threshold = sched_nr_disable_gran >> 1; > > + > > + /* No overload. */ > > + if (nr_running <= threshold) > > + return sysctl_sched_wakeup_granularity; > > TBH I don't like these "no overload", "light overload" ... They don't > have any real meaning apart from that it might work for your platform > and your hackbench test. They are, at best, a proxy measure for overload but the alternative is scanning a bunch of runqueues similar to what is required when detecting if a domain is fully busy or overloaded. > We already had have people complaining that small cfs task does not > preempt fast enough curr task as an example > Is there a specific test case that demonstrates this? > There is no explanation why these values are the correct ones and not > but are just some random heuristic results and we are trying to remove > platform heuristic and to not add new > They are a heuristic yes, but I'm trying to remove the motivation for users trying to tune sysctl_sched_wakeup_granularity to stupid values because as it stands, tuned will happily poke into debugfs despite the fact they are meant for debugging only and the values are of dubious merit. > > + > > + /* Light overload. */ > > + if (nr_running <= sched_nr_disable_gran) > > + return sysctl_sched_latency >> 1; > > + > > + /* Heavy overload. */ > > + return sysctl_sched_latency; > > Why should a thread without any relationship with the curr, not > preempt it because there are already a lot of running threads? Preemption is not free, without any knowledge of what the thread is doing, we cannot determine if an inappropriate amount of CPU time is being spent dequeueing/enqueuing tasks. > In > your case, you want hackbench threads to not preempt each others > because they tries to use same resources so it's probably better to > let the current one to move forward but that's not a universal policy. > No, but have you a better suggestion? hackbench might be stupid but it's an example of where a workload can excessively preempt itself. While overloading an entire machine is stupid, it could also potentially occurs for applications running within a constrained cpumask. > side question: Have you try to change the nice priority which also > impact whether a thread can preempt curr ? > No, I have not tried. I guess one would be constructed with cyclictest -S running hackbench in the background and measuring if it makes a difference to the amount of jitter cyclictest experiences but I'm not sure if that would cover the case you are concerned with. > > +} > > + > > +static unsigned long > > +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se) > > { > > - unsigned long gran = sysctl_sched_wakeup_granularity; > > + unsigned long gran = select_wakeup_gran(cfs_rq); > > > > /* > > * Since its curr running now, convert the gran from real-time > > @@ -7079,14 +7124,15 @@ static unsigned long wakeup_gran(struct sched_entity *se) > > * > > */ > > static int > > -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se) > > +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr, > > + struct sched_entity *se) > > { > > s64 gran, vdiff = curr->vruntime - se->vruntime; > > > > if (vdiff <= 0) > > return -1; > > > > - gran = wakeup_gran(se); > > + gran = wakeup_gran(cfs_rq, se); > > if (vdiff > gran) > > return 1; > > > > @@ -7191,7 +7237,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ > > return; > > > > update_curr(cfs_rq_of(se)); > > - if (wakeup_preempt_entity(se, pse) == 1) { > > + if (wakeup_preempt_entity(cfs_rq, se, pse) == 1) { > > like for update_curr above, cfs_rq can be wrong because se could have changed > Crap, that was a stupid mistake based on earlier review. I'll fix it. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-22 15:04 ` Mel Gorman @ 2021-09-22 16:00 ` Vincent Guittot 2021-09-22 17:38 ` Mel Gorman 2021-10-05 9:41 ` Peter Zijlstra 1 sibling, 1 reply; 59+ messages in thread From: Vincent Guittot @ 2021-09-22 16:00 UTC (permalink / raw) To: Mel Gorman Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Wed, 22 Sept 2021 at 17:04, Mel Gorman <mgorman@techsingularity.net> wrote: > > On Wed, Sep 22, 2021 at 04:15:27PM +0200, Vincent Guittot wrote: > > > ---8<--- > > > sched/fair: Scale wakeup granularity relative to nr_running > > > > > > Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved > > > the kernel.sched_wakeup_granularity_ns sysctl under debugfs. One of the > > > reasons why this sysctl may be used may be for "optimising for throughput", > > > particularly when overloaded. The tool TuneD sometimes alters this for two > > > profiles e.g. "mssql" and "throughput-performance". At least version 2.9 > > > does but it changed in master where it also will poke at debugfs instead. > > > > > > Internal parameters like sysctl_sched_wakeup_granularity are scaled > > > based on the number of CPUs due to sysctl_sched_tunable_scaling. For > > > simplicity, the timing figures in this changelog are based on > > > SCHED_TUNABLESCALING_NONE. > > > > This is a bit misleading because the platform that you used to > > highlight the problem has a 7ms sysctl_sched_wakeup_granularity. which > > is far more than your tick which should be 1ms > > > > Tick on the test machines is 4ms (HZ=250). > > The reason I used SCHED_TUNABLESCALING_NONE for the changelog is that > the exact values depend on the number of CPUs so values are not even > the same across the range of machines I'm using. sysctl_sched_latency, > sysctl_sched_min_granularity sysctl_sched_wakeup_granularity are all > scaled but the ratios remain constant. My point was mainly that sysctl_sched_wakeup_granularity is above the tick period > > > > <SNIP> > > > static int > > > -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se); > > > +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr, > > > + struct sched_entity *se); > > > > > > /* > > > * Pick the next process, keeping these things in mind, in this order: > > > @@ -4550,16 +4562,16 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) > > > second = curr; > > > } > > > > > > - if (second && wakeup_preempt_entity(second, left) < 1) > > > + if (second && wakeup_preempt_entity(NULL, second, left) < 1) > > > > Why not applying the same policy here ? the tick can also prevent > > current task to move forward > > > > Because it was less clear if it was necessary and what the consequences > would be if the skip buddy ran or the next buddy failed to run because > preempt failed, how does it interact with yield_to etc. > > I ended up concluding that they should be separate patches and keep this > patch to one topic. > > > > > se = second; > > > } > > > > > > - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { > > > + if (cfs_rq->next && wakeup_preempt_entity(NULL, cfs_rq->next, left) < 1) { > > > /* > > > * Someone really wants this to run. If it's not unfair, run it. > > > */ > > > se = cfs_rq->next; > > > - } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) { > > > + } else if (cfs_rq->last && wakeup_preempt_entity(NULL, cfs_rq->last, left) < 1) { > > > /* > > > * Prefer last buddy, try to return the CPU to a preempted task. > > > */ > > > @@ -7044,9 +7056,42 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > > > } > > > #endif /* CONFIG_SMP */ > > > > > > -static unsigned long wakeup_gran(struct sched_entity *se) > > > +static unsigned long > > > +select_wakeup_gran(struct cfs_rq *cfs_rq) > > > +{ > > > + unsigned int nr_running, threshold; > > > + > > > + if (!cfs_rq || !sched_feat(SCALE_WAKEUP_GRAN)) > > > + return sysctl_sched_wakeup_granularity; > > > + > > > + /* !GENTLE_FAIR_SLEEPERS has one overload threshold. */ > > > + if (!sched_feat(GENTLE_FAIR_SLEEPERS)) { > > > + if (cfs_rq->nr_running <= sched_nr_disable_gran) > > > > cfs_rq->nr_running reflects the number of sched entities in the cfs > > but not the number of running tasks which reflected in h_nr_running > > > > Then check_preempt_wakeup may also have the same problem as it uses > nr_running. This comment was originally for the use of cfs_rq->nr_running a bit more below > > > Also do you want to take into account only tasks in this cfs and its > > children or on all cfs on this rq ? > > > > Only this cfq I think to limit overhead. > > > > + return sysctl_sched_wakeup_granularity; > > > + > > > + return sysctl_sched_latency; > > > + } > > > + > > > + /* GENTLE_FAIR_SLEEPER has two overloaded thresholds. */ > > > + nr_running = cfs_rq->nr_running; > > > + threshold = sched_nr_disable_gran >> 1; > > > + > > > + /* No overload. */ > > > + if (nr_running <= threshold) The comment was originally for this nr_running does not reflect the number of task running on the cpu and the associated overload state If you put 2 hackbench in their own cgroup, nr_running will be 2 but you will have twice more runnable threads > > > + return sysctl_sched_wakeup_granularity; > > > > TBH I don't like these "no overload", "light overload" ... They don't > > have any real meaning apart from that it might work for your platform > > and your hackbench test. > > They are, at best, a proxy measure for overload but the alternative is > scanning a bunch of runqueues similar to what is required when detecting > if a domain is fully busy or overloaded. > > > We already had have people complaining that small cfs task does not > > preempt fast enough curr task as an example > > > > Is there a specific test case that demonstrates this? > > > There is no explanation why these values are the correct ones and not > > but are just some random heuristic results and we are trying to remove > > platform heuristic and to not add new > > > > They are a heuristic yes, but I'm trying to remove the motivation for > users trying to tune sysctl_sched_wakeup_granularity to stupid values > because as it stands, tuned will happily poke into debugfs despite the > fact they are meant for debugging only and the values are of dubious merit. > > > > + > > > + /* Light overload. */ > > > + if (nr_running <= sched_nr_disable_gran) > > > + return sysctl_sched_latency >> 1; > > > + > > > + /* Heavy overload. */ > > > + return sysctl_sched_latency; > > > > Why should a thread without any relationship with the curr, not > > preempt it because there are already a lot of running threads? > > Preemption is not free, without any knowledge of what the thread is > doing, we cannot determine if an inappropriate amount of CPU time is > being spent dequeueing/enqueuing tasks. That's exactly my point. The heuristic above doesn't give any clue if the thread should or not preempt the current one. Most of the time there is no relation with the number of the running threads but it is define by the workload itself and its level of interactivity > > > In > > your case, you want hackbench threads to not preempt each others > > because they tries to use same resources so it's probably better to > > let the current one to move forward but that's not a universal policy. > > > > No, but have you a better suggestion? hackbench might be stupid but it's > an example of where a workload can excessively preempt itself. While > overloading an entire machine is stupid, it could also potentially occurs > for applications running within a constrained cpumask. But this is property that is specific to each application. Some can have a lot of running threads but few wakes up which have to preempt current threads quickly but others just want the opposite So because it is a application specific property we should define it this way instead of trying to guess > > > side question: Have you try to change the nice priority which also > > impact whether a thread can preempt curr ? > > > > No, I have not tried. I guess one would be constructed with cyclictest > -S running hackbench in the background and measuring if it makes a > difference to the amount of jitter cyclictest experiences but I'm not > sure if that would cover the case you are concerned with. > > > > +} > > > + > > > +static unsigned long > > > +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se) > > > { > > > - unsigned long gran = sysctl_sched_wakeup_granularity; > > > + unsigned long gran = select_wakeup_gran(cfs_rq); > > > > > > /* > > > * Since its curr running now, convert the gran from real-time > > > @@ -7079,14 +7124,15 @@ static unsigned long wakeup_gran(struct sched_entity *se) > > > * > > > */ > > > static int > > > -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se) > > > +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr, > > > + struct sched_entity *se) > > > { > > > s64 gran, vdiff = curr->vruntime - se->vruntime; > > > > > > if (vdiff <= 0) > > > return -1; > > > > > > - gran = wakeup_gran(se); > > > + gran = wakeup_gran(cfs_rq, se); > > > if (vdiff > gran) > > > return 1; > > > > > > @@ -7191,7 +7237,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ > > > return; > > > > > > update_curr(cfs_rq_of(se)); > > > - if (wakeup_preempt_entity(se, pse) == 1) { > > > + if (wakeup_preempt_entity(cfs_rq, se, pse) == 1) { > > > > like for update_curr above, cfs_rq can be wrong because se could have changed > > > > Crap, that was a stupid mistake based on earlier review. I'll fix it. > > -- > Mel Gorman > SUSE Labs ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-22 16:00 ` Vincent Guittot @ 2021-09-22 17:38 ` Mel Gorman 2021-09-22 18:22 ` Vincent Guittot 2021-10-05 10:23 ` Peter Zijlstra 0 siblings, 2 replies; 59+ messages in thread From: Mel Gorman @ 2021-09-22 17:38 UTC (permalink / raw) To: Vincent Guittot Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Wed, Sep 22, 2021 at 06:00:56PM +0200, Vincent Guittot wrote: > On Wed, 22 Sept 2021 at 17:04, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > On Wed, Sep 22, 2021 at 04:15:27PM +0200, Vincent Guittot wrote: > > > > ---8<--- > > > > sched/fair: Scale wakeup granularity relative to nr_running > > > > > > > > Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved > > > > the kernel.sched_wakeup_granularity_ns sysctl under debugfs. One of the > > > > reasons why this sysctl may be used may be for "optimising for throughput", > > > > particularly when overloaded. The tool TuneD sometimes alters this for two > > > > profiles e.g. "mssql" and "throughput-performance". At least version 2.9 > > > > does but it changed in master where it also will poke at debugfs instead. > > > > > > > > Internal parameters like sysctl_sched_wakeup_granularity are scaled > > > > based on the number of CPUs due to sysctl_sched_tunable_scaling. For > > > > simplicity, the timing figures in this changelog are based on > > > > SCHED_TUNABLESCALING_NONE. > > > > > > This is a bit misleading because the platform that you used to > > > highlight the problem has a 7ms sysctl_sched_wakeup_granularity. which > > > is far more than your tick which should be 1ms > > > > > > > Tick on the test machines is 4ms (HZ=250). > > > > The reason I used SCHED_TUNABLESCALING_NONE for the changelog is that > > the exact values depend on the number of CPUs so values are not even > > the same across the range of machines I'm using. sysctl_sched_latency, > > sysctl_sched_min_granularity sysctl_sched_wakeup_granularity are all > > scaled but the ratios remain constant. > > My point was mainly that sysctl_sched_wakeup_granularity is above the > tick period > Ok. sysctl_sched_wakeup_granularity is not related to the tick but I'm probably missing something else. > > > Also do you want to take into account only tasks in this cfs and its > > > children or on all cfs on this rq ? > > > > > > > Only this cfq I think to limit overhead. > > > > > > + return sysctl_sched_wakeup_granularity; > > > > + > > > > + return sysctl_sched_latency; > > > > + } > > > > + > > > > + /* GENTLE_FAIR_SLEEPER has two overloaded thresholds. */ > > > > + nr_running = cfs_rq->nr_running; > > > > + threshold = sched_nr_disable_gran >> 1; > > > > + > > > > + /* No overload. */ > > > > + if (nr_running <= threshold) > > The comment was originally for this > nr_running does not reflect the number of task running on the cpu and > the associated overload state > > If you put 2 hackbench in their own cgroup, nr_running will be 2 but > you will have twice more runnable threads > Ok, that's understood. FWIW, I had switched to h_nr_running already. > > > > + > > > > + /* Light overload. */ > > > > + if (nr_running <= sched_nr_disable_gran) > > > > + return sysctl_sched_latency >> 1; > > > > + > > > > + /* Heavy overload. */ > > > > + return sysctl_sched_latency; > > > > > > Why should a thread without any relationship with the curr, not > > > preempt it because there are already a lot of running threads? > > > > Preemption is not free, without any knowledge of what the thread is > > doing, we cannot determine if an inappropriate amount of CPU time is > > being spent dequeueing/enqueuing tasks. > > That's exactly my point. The heuristic above doesn't give any clue if > the thread should or not preempt the current one. Most of the time > there is no relation with the number of the running threads but it is > define by the workload itself and its level of interactivity > Right albeit it ignores the possibility that there are multiple workloads overloading the machine wasting even more time preempting. Whether that is due to a bad application, a bad configuration or a bad actor is anyones guess. I've seen applications with multiple worker threads which generally behaved ok, except when they didn't and too many worker threads were spawned due to a spike in server load. I'm not > > > > > In > > > your case, you want hackbench threads to not preempt each others > > > because they tries to use same resources so it's probably better to > > > let the current one to move forward but that's not a universal policy. > > > > > > > No, but have you a better suggestion? hackbench might be stupid but it's > > an example of where a workload can excessively preempt itself. While > > overloading an entire machine is stupid, it could also potentially occurs > > for applications running within a constrained cpumask. > > But this is property that is specific to each application. Some can > have a lot of running threads but few wakes up which have to preempt > current threads quickly but others just want the opposite > So because it is a application specific property we should define it > this way instead of trying to guess I'm not seeing an alternative suggestion that could be turned into an implementation. The current value for sched_wakeup_granularity was set 12 years ago was exposed for tuning which is no longer the case. The intent was to allow some dynamic adjustment between sysctl_sched_wakeup_granularity and sysctl_sched_latency to reduce over-scheduling in the worst case without disabling preemption entirely (which the first version did). Should we just ignore this problem and hope it goes away or just let people keep poking silly values into debugfs via tuned? -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-22 17:38 ` Mel Gorman @ 2021-09-22 18:22 ` Vincent Guittot 2021-09-22 18:57 ` Mel Gorman ` (2 more replies) 2021-10-05 10:23 ` Peter Zijlstra 1 sibling, 3 replies; 59+ messages in thread From: Vincent Guittot @ 2021-09-22 18:22 UTC (permalink / raw) To: Mel Gorman Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Wed, 22 Sept 2021 at 19:38, Mel Gorman <mgorman@techsingularity.net> wrote: > > On Wed, Sep 22, 2021 at 06:00:56PM +0200, Vincent Guittot wrote: > > On Wed, 22 Sept 2021 at 17:04, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > On Wed, Sep 22, 2021 at 04:15:27PM +0200, Vincent Guittot wrote: > > > > > ---8<--- > > > > > sched/fair: Scale wakeup granularity relative to nr_running > > > > > > > > > > Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved > > > > > the kernel.sched_wakeup_granularity_ns sysctl under debugfs. One of the > > > > > reasons why this sysctl may be used may be for "optimising for throughput", > > > > > particularly when overloaded. The tool TuneD sometimes alters this for two > > > > > profiles e.g. "mssql" and "throughput-performance". At least version 2.9 > > > > > does but it changed in master where it also will poke at debugfs instead. > > > > > > > > > > Internal parameters like sysctl_sched_wakeup_granularity are scaled > > > > > based on the number of CPUs due to sysctl_sched_tunable_scaling. For > > > > > simplicity, the timing figures in this changelog are based on > > > > > SCHED_TUNABLESCALING_NONE. > > > > > > > > This is a bit misleading because the platform that you used to > > > > highlight the problem has a 7ms sysctl_sched_wakeup_granularity. which > > > > is far more than your tick which should be 1ms > > > > > > > > > > Tick on the test machines is 4ms (HZ=250). > > > > > > The reason I used SCHED_TUNABLESCALING_NONE for the changelog is that > > > the exact values depend on the number of CPUs so values are not even > > > the same across the range of machines I'm using. sysctl_sched_latency, > > > sysctl_sched_min_granularity sysctl_sched_wakeup_granularity are all > > > scaled but the ratios remain constant. > > > > My point was mainly that sysctl_sched_wakeup_granularity is above the > > tick period > > > > Ok. sysctl_sched_wakeup_granularity is not related to the tick but I'm > probably missing something else. > > > > > Also do you want to take into account only tasks in this cfs and its > > > > children or on all cfs on this rq ? > > > > > > > > > > Only this cfq I think to limit overhead. > > > > > > > > + return sysctl_sched_wakeup_granularity; > > > > > + > > > > > + return sysctl_sched_latency; > > > > > + } > > > > > + > > > > > + /* GENTLE_FAIR_SLEEPER has two overloaded thresholds. */ > > > > > + nr_running = cfs_rq->nr_running; > > > > > + threshold = sched_nr_disable_gran >> 1; > > > > > + > > > > > + /* No overload. */ > > > > > + if (nr_running <= threshold) > > > > The comment was originally for this > > nr_running does not reflect the number of task running on the cpu and > > the associated overload state > > > > If you put 2 hackbench in their own cgroup, nr_running will be 2 but > > you will have twice more runnable threads > > > > Ok, that's understood. FWIW, I had switched to h_nr_running already. > > > > > > + > > > > > + /* Light overload. */ > > > > > + if (nr_running <= sched_nr_disable_gran) > > > > > + return sysctl_sched_latency >> 1; > > > > > + > > > > > + /* Heavy overload. */ > > > > > + return sysctl_sched_latency; > > > > > > > > Why should a thread without any relationship with the curr, not > > > > preempt it because there are already a lot of running threads? > > > > > > Preemption is not free, without any knowledge of what the thread is > > > doing, we cannot determine if an inappropriate amount of CPU time is > > > being spent dequeueing/enqueuing tasks. > > > > That's exactly my point. The heuristic above doesn't give any clue if > > the thread should or not preempt the current one. Most of the time > > there is no relation with the number of the running threads but it is > > define by the workload itself and its level of interactivity > > > > Right albeit it ignores the possibility that there are multiple workloads > overloading the machine wasting even more time preempting. Whether that > is due to a bad application, a bad configuration or a bad actor is anyones > guess. I've seen applications with multiple worker threads which generally > behaved ok, except when they didn't and too many worker threads were > spawned due to a spike in server load. I'm not > > > > > > > > In > > > > your case, you want hackbench threads to not preempt each others > > > > because they tries to use same resources so it's probably better to > > > > let the current one to move forward but that's not a universal policy. > > > > > > > > > > No, but have you a better suggestion? hackbench might be stupid but it's > > > an example of where a workload can excessively preempt itself. While > > > overloading an entire machine is stupid, it could also potentially occurs > > > for applications running within a constrained cpumask. > > > > But this is property that is specific to each application. Some can > > have a lot of running threads but few wakes up which have to preempt > > current threads quickly but others just want the opposite > > So because it is a application specific property we should define it > > this way instead of trying to guess > > I'm not seeing an alternative suggestion that could be turned into > an implementation. The current value for sched_wakeup_granularity > was set 12 years ago was exposed for tuning which is no longer > the case. The intent was to allow some dynamic adjustment between > sysctl_sched_wakeup_granularity and sysctl_sched_latency to reduce > over-scheduling in the worst case without disabling preemption entirely > (which the first version did). > > Should we just ignore this problem and hope it goes away or just let > people keep poking silly values into debugfs via tuned? We should certainly not add a bandaid because people will continue to poke silly value at the end. And increasing sysctl_sched_wakeup_granularity based on the number of running threads is not the right solution. According to the description of your problem that the current task doesn't get enough time to move forward, sysctl_sched_min_granularity should be part of the solution. Something like below will ensure that current got a chance to move forward diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9bf540f04c2d..39d4e4827d3d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7102,6 +7102,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ int scale = cfs_rq->nr_running >= sched_nr_latency; int next_buddy_marked = 0; int cse_is_idle, pse_is_idle; + unsigned long delta_exec; if (unlikely(se == pse)) return; @@ -7161,6 +7162,13 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ return; update_curr(cfs_rq_of(se)); + delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime; + /* + * Ensure that current got a chance to move forward + */ + if (delta_exec < sysctl_sched_min_granularity) + return; + if (wakeup_preempt_entity(se, pse) == 1) { /* * Bias pick_next to pick the sched entity that is > > -- > Mel Gorman > SUSE Labs ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-22 18:22 ` Vincent Guittot @ 2021-09-22 18:57 ` Mel Gorman 2021-09-23 1:47 ` Mike Galbraith 2021-10-05 10:28 ` Peter Zijlstra 2 siblings, 0 replies; 59+ messages in thread From: Mel Gorman @ 2021-09-22 18:57 UTC (permalink / raw) To: Vincent Guittot Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Wed, Sep 22, 2021 at 08:22:43PM +0200, Vincent Guittot wrote: > > > > > In > > > > > your case, you want hackbench threads to not preempt each others > > > > > because they tries to use same resources so it's probably better to > > > > > let the current one to move forward but that's not a universal policy. > > > > > > > > > > > > > No, but have you a better suggestion? hackbench might be stupid but it's > > > > an example of where a workload can excessively preempt itself. While > > > > overloading an entire machine is stupid, it could also potentially occurs > > > > for applications running within a constrained cpumask. > > > > > > But this is property that is specific to each application. Some can > > > have a lot of running threads but few wakes up which have to preempt > > > current threads quickly but others just want the opposite > > > So because it is a application specific property we should define it > > > this way instead of trying to guess > > > > I'm not seeing an alternative suggestion that could be turned into > > an implementation. The current value for sched_wakeup_granularity > > was set 12 years ago was exposed for tuning which is no longer > > the case. The intent was to allow some dynamic adjustment between > > sysctl_sched_wakeup_granularity and sysctl_sched_latency to reduce > > over-scheduling in the worst case without disabling preemption entirely > > (which the first version did). > > > > Should we just ignore this problem and hope it goes away or just let > > people keep poking silly values into debugfs via tuned? > > We should certainly not add a bandaid because people will continue to > poke silly value at the end. And increasing > sysctl_sched_wakeup_granularity based on the number of running threads > is not the right solution. According to the description of your > problem that the current task doesn't get enough time to move forward, > sysctl_sched_min_granularity should be part of the solution. Something > like below will ensure that current got a chance to move forward > That's a very interesting idea! I've queued it up for further testing and as a comparison to the bandaid. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-22 18:22 ` Vincent Guittot 2021-09-22 18:57 ` Mel Gorman @ 2021-09-23 1:47 ` Mike Galbraith 2021-09-23 8:40 ` Vincent Guittot 2021-10-05 10:28 ` Peter Zijlstra 2 siblings, 1 reply; 59+ messages in thread From: Mike Galbraith @ 2021-09-23 1:47 UTC (permalink / raw) To: Vincent Guittot, Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Wed, 2021-09-22 at 20:22 +0200, Vincent Guittot wrote: > On Wed, 22 Sept 2021 at 19:38, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > I'm not seeing an alternative suggestion that could be turned into > > an implementation. The current value for sched_wakeup_granularity > > was set 12 years ago was exposed for tuning which is no longer > > the case. The intent was to allow some dynamic adjustment between > > sysctl_sched_wakeup_granularity and sysctl_sched_latency to reduce > > over-scheduling in the worst case without disabling preemption entirely > > (which the first version did). I don't think those knobs were ever _intended_ for general purpose tuning, but they did get used that way by some folks. > > > > Should we just ignore this problem and hope it goes away or just let > > people keep poking silly values into debugfs via tuned? > > We should certainly not add a bandaid because people will continue to > poke silly value at the end. And increasing > sysctl_sched_wakeup_granularity based on the number of running threads > is not the right solution. Watching my desktop box stack up large piles of very short running threads, I agree, instantaneous load looks like a non-starter. > According to the description of your > problem that the current task doesn't get enough time to move forward, > sysctl_sched_min_granularity should be part of the solution. Something > like below will ensure that current got a chance to move forward Nah, progress is guaranteed, the issue is a zillion very similar short running threads preempting each other with no win to be had, thus spending cycles in the scheduler that are utterly wasted. It's a valid issue, trouble is teaching the scheduler to recognize that situation without mucking up other situations where there IS a win for even very short running threads say, doing a synchronous handoff; preemption is cheaper than scheduling off if the waker is going be awakened again in very short order. > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 9bf540f04c2d..39d4e4827d3d 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7102,6 +7102,7 @@ static void check_preempt_wakeup(struct rq *rq, > struct task_struct *p, int wake_ > int scale = cfs_rq->nr_running >= sched_nr_latency; > int next_buddy_marked = 0; > int cse_is_idle, pse_is_idle; > + unsigned long delta_exec; > > if (unlikely(se == pse)) > return; > @@ -7161,6 +7162,13 @@ static void check_preempt_wakeup(struct rq *rq, > struct task_struct *p, int wake_ > return; > > update_curr(cfs_rq_of(se)); > + delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime; > + /* > + * Ensure that current got a chance to move forward > + */ > + if (delta_exec < sysctl_sched_min_granularity) > + return; > + > if (wakeup_preempt_entity(se, pse) == 1) { > /* > * Bias pick_next to pick the sched entity that is Yikes! If you do that, you may as well go the extra nanometer and rip wakeup preemption out entirely, same result, impressive diffstat. -Mike ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-23 1:47 ` Mike Galbraith @ 2021-09-23 8:40 ` Vincent Guittot 2021-09-23 9:21 ` Mike Galbraith 2021-09-23 12:24 ` Phil Auld 0 siblings, 2 replies; 59+ messages in thread From: Vincent Guittot @ 2021-09-23 8:40 UTC (permalink / raw) To: Mike Galbraith Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Thu, 23 Sept 2021 at 03:47, Mike Galbraith <efault@gmx.de> wrote: > > On Wed, 2021-09-22 at 20:22 +0200, Vincent Guittot wrote: > > On Wed, 22 Sept 2021 at 19:38, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > > > > I'm not seeing an alternative suggestion that could be turned into > > > an implementation. The current value for sched_wakeup_granularity > > > was set 12 years ago was exposed for tuning which is no longer > > > the case. The intent was to allow some dynamic adjustment between > > > sysctl_sched_wakeup_granularity and sysctl_sched_latency to reduce > > > over-scheduling in the worst case without disabling preemption entirely > > > (which the first version did). > > I don't think those knobs were ever _intended_ for general purpose > tuning, but they did get used that way by some folks. > > > > > > > Should we just ignore this problem and hope it goes away or just let > > > people keep poking silly values into debugfs via tuned? > > > > We should certainly not add a bandaid because people will continue to > > poke silly value at the end. And increasing > > sysctl_sched_wakeup_granularity based on the number of running threads > > is not the right solution. > > Watching my desktop box stack up large piles of very short running > threads, I agree, instantaneous load looks like a non-starter. > > > According to the description of your > > problem that the current task doesn't get enough time to move forward, > > sysctl_sched_min_granularity should be part of the solution. Something > > like below will ensure that current got a chance to move forward > > Nah, progress is guaranteed, the issue is a zillion very similar short > running threads preempting each other with no win to be had, thus > spending cycles in the scheduler that are utterly wasted. It's a valid > issue, trouble is teaching the scheduler to recognize that situation > without mucking up other situations where there IS a win for even very > short running threads say, doing a synchronous handoff; preemption is > cheaper than scheduling off if the waker is going be awakened again in > very short order. > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 9bf540f04c2d..39d4e4827d3d 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -7102,6 +7102,7 @@ static void check_preempt_wakeup(struct rq *rq, > > struct task_struct *p, int wake_ > > int scale = cfs_rq->nr_running >= sched_nr_latency; > > int next_buddy_marked = 0; > > int cse_is_idle, pse_is_idle; > > + unsigned long delta_exec; > > > > if (unlikely(se == pse)) > > return; > > @@ -7161,6 +7162,13 @@ static void check_preempt_wakeup(struct rq *rq, > > struct task_struct *p, int wake_ > > return; > > > > update_curr(cfs_rq_of(se)); > > + delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime; > > + /* > > + * Ensure that current got a chance to move forward > > + */ > > + if (delta_exec < sysctl_sched_min_granularity) > > + return; > > + > > if (wakeup_preempt_entity(se, pse) == 1) { > > /* > > * Bias pick_next to pick the sched entity that is > > Yikes! If you do that, you may as well go the extra nanometer and rip > wakeup preemption out entirely, same result, impressive diffstat. This patch is mainly there to show that there are other ways to ensure progress without using some load heuristic. sysctl_sched_min_granularity has the problem of scaling with the number of cpus and this can generate large values. At least we should use the normalized_sysctl_sched_min_granularity or even a smaller value but wakeup preemption still happens with this change. It only ensures that we don't waste time preempting each other without any chance to do actual stuff. a 100us value should even be enough to fix Mel's problem without impacting common wakeup preemption cases. > > -Mike ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-23 8:40 ` Vincent Guittot @ 2021-09-23 9:21 ` Mike Galbraith 2021-09-23 12:41 ` Vincent Guittot 2021-09-23 12:24 ` Phil Auld 1 sibling, 1 reply; 59+ messages in thread From: Mike Galbraith @ 2021-09-23 9:21 UTC (permalink / raw) To: Vincent Guittot Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Thu, 2021-09-23 at 10:40 +0200, Vincent Guittot wrote: > > a 100us value should even be enough to fix Mel's problem without > impacting common wakeup preemption cases. It'd be nice if it turn out to be something that simple, but color me skeptical. I've tried various preemption throttling schemes, and while it was trivial to get promising results, my scheme always managed to harm something. Everything I ever tried, I ended up tossing. -Mike ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-23 9:21 ` Mike Galbraith @ 2021-09-23 12:41 ` Vincent Guittot 2021-09-23 13:14 ` Mike Galbraith 2021-09-27 11:17 ` Mel Gorman 0 siblings, 2 replies; 59+ messages in thread From: Vincent Guittot @ 2021-09-23 12:41 UTC (permalink / raw) To: Mike Galbraith Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Thu, 23 Sept 2021 at 11:22, Mike Galbraith <efault@gmx.de> wrote: > > On Thu, 2021-09-23 at 10:40 +0200, Vincent Guittot wrote: > > > > a 100us value should even be enough to fix Mel's problem without > > impacting common wakeup preemption cases. > > It'd be nice if it turn out to be something that simple, but color me > skeptical. I've tried various preemption throttling schemes, and while Let's see what the results will show. I tend to agree that this will not be enough to cover all use cases and I don't see any other way to cover all cases than getting some inputs from the threads about their latency fairness which bring us back to some kind of latency niceness value > it was trivial to get promising results, my scheme always managed to > harm something. Everything I ever tried, I ended up tossing. > > -Mike ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-23 12:41 ` Vincent Guittot @ 2021-09-23 13:14 ` Mike Galbraith 2021-09-27 11:17 ` Mel Gorman 1 sibling, 0 replies; 59+ messages in thread From: Mike Galbraith @ 2021-09-23 13:14 UTC (permalink / raw) To: Vincent Guittot Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Thu, 2021-09-23 at 14:41 +0200, Vincent Guittot wrote: > On Thu, 23 Sept 2021 at 11:22, Mike Galbraith <efault@gmx.de> wrote: > > > > On Thu, 2021-09-23 at 10:40 +0200, Vincent Guittot wrote: > > > > > > a 100us value should even be enough to fix Mel's problem without > > > impacting common wakeup preemption cases. > > > > It'd be nice if it turn out to be something that simple, but color me > > skeptical. I've tried various preemption throttling schemes, and while > > Let's see what the results will show. I tend to agree that this will > not be enough to cover all use cases and I don't see any other way to > cover all cases than getting some inputs from the threads about their > latency fairness which bring us back to some kind of latency niceness > value What dooms the dirt simple solution is: tasks that dip lightly but frequently are a thing ;-) Take nfs threads, tell 'em frequent preemption ain't cool, and no matter how you diplomatically you say it, they react in the only way they can, by sucking at their job. -Mike ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-23 12:41 ` Vincent Guittot 2021-09-23 13:14 ` Mike Galbraith @ 2021-09-27 11:17 ` Mel Gorman 2021-09-27 14:17 ` Mike Galbraith 2021-09-27 14:19 ` Vincent Guittot 1 sibling, 2 replies; 59+ messages in thread From: Mel Gorman @ 2021-09-27 11:17 UTC (permalink / raw) To: Vincent Guittot Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Thu, Sep 23, 2021 at 02:41:06PM +0200, Vincent Guittot wrote: > On Thu, 23 Sept 2021 at 11:22, Mike Galbraith <efault@gmx.de> wrote: > > > > On Thu, 2021-09-23 at 10:40 +0200, Vincent Guittot wrote: > > > > > > a 100us value should even be enough to fix Mel's problem without > > > impacting common wakeup preemption cases. > > > > It'd be nice if it turn out to be something that simple, but color me > > skeptical. I've tried various preemption throttling schemes, and while > > Let's see what the results will show. I tend to agree that this will > not be enough to cover all use cases and I don't see any other way to > cover all cases than getting some inputs from the threads about their > latency fairness which bring us back to some kind of latency niceness > value > Unfortunately, I didn't get a complete set of results but enough to work with. The missing tests have been requeued. The figures below are based on a single-socket Skylake machine with 8 CPUs as it had the most set of results and is the basic case. The reported kernels are vanilla: vanilla 5.15-rc1 sched-scalewakegran-v2r4: My patch sched-moveforward-v1r1: Vincent's patch hackbench-process-pipes 5.15.0-rc1 5.15.0-rc1 5.15.0-rc1 vanilla sched-scalewakegran-v2r4 sched-moveforward-v1r1 Amean 1 0.3253 ( 0.00%) 0.3330 ( -2.36%) 0.3257 ( -0.10%) Amean 4 0.8300 ( 0.00%) 0.7570 ( 8.80%) 0.7560 ( 8.92%) Amean 7 1.1003 ( 0.00%) 1.1457 * -4.12%* 1.1163 ( -1.45%) Amean 12 1.7263 ( 0.00%) 1.6393 * 5.04%* 1.5963 * 7.53%* Amean 21 3.0063 ( 0.00%) 2.6590 * 11.55%* 2.4487 * 18.55%* Amean 30 4.2323 ( 0.00%) 3.5657 * 15.75%* 3.3410 * 21.06%* Amean 48 6.5657 ( 0.00%) 5.4180 * 17.48%* 5.0857 * 22.54%* Amean 79 10.4867 ( 0.00%) 8.4357 * 19.56%* 7.9563 * 24.13%* Amean 110 14.8880 ( 0.00%) 11.0423 * 25.83%* 10.7407 * 27.86%* Amean 141 19.2083 ( 0.00%) 14.0820 * 26.69%* 13.3780 * 30.35%* Amean 172 23.4847 ( 0.00%) 16.9880 * 27.66%* 16.4293 * 30.04%* Amean 203 27.3763 ( 0.00%) 20.2480 * 26.04%* 19.6430 * 28.25%* Amean 234 31.3707 ( 0.00%) 23.2477 * 25.89%* 22.8287 * 27.23%* Amean 265 35.4663 ( 0.00%) 26.2483 * 25.99%* 25.8683 * 27.06%* Amean 296 39.2380 ( 0.00%) 29.4237 * 25.01%* 28.8727 * 26.42%* For hackbench, either Vincent or my patch has a similar impact. tbench4 5.15.0-rc1 5.15.0-rc1 5.15.0-rc1 vanillasched-scalewakegran-v2r4 sched-moveforward-v1r1 Hmean 1 598.71 ( 0.00%) 608.31 * 1.60%* 586.05 * -2.11%* Hmean 2 1096.74 ( 0.00%) 1110.07 * 1.22%* 1106.70 * 0.91%* Hmean 4 1529.35 ( 0.00%) 1531.20 * 0.12%* 1551.11 * 1.42%* Hmean 8 2824.32 ( 0.00%) 2847.96 * 0.84%* 2684.21 * -4.96%* Hmean 16 2573.30 ( 0.00%) 2591.77 * 0.72%* 2445.41 * -4.97%* Hmean 32 2518.77 ( 0.00%) 2532.70 * 0.55%* 2409.30 * -4.35%* For tbench, it's ok for lower thread counts for 8 threads (machine overloaded), Vincent's patch regresses slightly. With these test runs, I don't have detailed information as to why but the most likely solution is that preemption gets disabled prematurely. specjbb 5.15.0-rc1 5.15.0-rc1 5.15.0-rc1 vanillasched-scalewakegran-v2r4 sched-moveforward-v1r1 Hmean tput-1 71199.00 ( 0.00%) 69492.00 * -2.40%* 71126.00 * -0.10%* Hmean tput-2 154478.00 ( 0.00%) 146060.00 * -5.45%* 153073.00 * -0.91%* Hmean tput-3 211889.00 ( 0.00%) 209386.00 * -1.18%* 219434.00 * 3.56%* Hmean tput-4 257842.00 ( 0.00%) 248012.00 * -3.81%* 262903.00 * 1.96%* Hmean tput-5 253506.00 ( 0.00%) 242511.00 * -4.34%* 250828.00 * -1.06%* Hmean tput-6 246202.00 ( 0.00%) 236480.00 * -3.95%* 244236.00 * -0.80%* Hmean tput-7 241133.00 ( 0.00%) 230905.00 * -4.24%* 237619.00 * -1.46%* Hmean tput-8 237983.00 ( 0.00%) 230010.00 * -3.35%* 235275.00 * -1.14%* For specjbb, it's different again, Vincent's patch is better for the overloaded case but both patches show light regressions. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-27 11:17 ` Mel Gorman @ 2021-09-27 14:17 ` Mike Galbraith 2021-10-04 8:05 ` Mel Gorman 2021-09-27 14:19 ` Vincent Guittot 1 sibling, 1 reply; 59+ messages in thread From: Mike Galbraith @ 2021-09-27 14:17 UTC (permalink / raw) To: Mel Gorman, Vincent Guittot Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Mon, 2021-09-27 at 12:17 +0100, Mel Gorman wrote: > On Thu, Sep 23, 2021 at 02:41:06PM +0200, Vincent Guittot wrote: > > On Thu, 23 Sept 2021 at 11:22, Mike Galbraith <efault@gmx.de> wrote: > > > > > > On Thu, 2021-09-23 at 10:40 +0200, Vincent Guittot wrote: > > > > > > > > a 100us value should even be enough to fix Mel's problem without > > > > impacting common wakeup preemption cases. > > > > > > It'd be nice if it turn out to be something that simple, but color me > > > skeptical. I've tried various preemption throttling schemes, and while > > > > Let's see what the results will show. I tend to agree that this will > > not be enough to cover all use cases and I don't see any other way to > > cover all cases than getting some inputs from the threads about their > > latency fairness which bring us back to some kind of latency niceness > > value > > > > Unfortunately, I didn't get a complete set of results but enough to work > with. The missing tests have been requeued. The figures below are based > on a single-socket Skylake machine with 8 CPUs as it had the most set of > results and is the basic case. There's something missing, namely how does whatever load you measure perform when facing dissimilar competition. Instead of only scaling loads running solo from underutilized to heavily over-committed, give them competition. eg something switch heavy, say tbench, TCP_RR et al (latency bound load) pairs=CPUS vs something hefty like make -j CPUS or such. With your "pick a load number and roll preemption down" approach and competing (modest) loads running on some headless enterprise test array box, there will likely be little impact, but do the same on a desktop box from the GUI, according to my box, you're likely to see something entirely different. Seems the "if current hasn't consumed at least 100us, go away" approach should impact fast movers facing competition both a lot more and more consistently (given modest load for both methods). Below is my box nfs mounting itself for no other reason than I was curious to see what running my repo update script from an nfs mount would look like (hey, it's still more realistic than hackbench;). It's sorted by switches, but those at the top are also where the most cycles landed. I wouldn't expect throughput of a load that switch happy to hold up well at all when faced with enforced 100us latency. Switch happy loads don't do so just to be mean, nor do they matter less than whatever beefier loads they may encounter in a box. ----------------------------------------------------------------------------------------------------------------- Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at | ----------------------------------------------------------------------------------------------------------------- nfsd:2246 | 31297.564 ms | 1514806 | avg: 0.002 ms | max: 555.543 ms | max at: 9130.660504 s nfsd:2245 | 9536.364 ms | 1475992 | avg: 0.002 ms | max: 902.341 ms | max at: 9083.907053 s kworker/u17:2-x:21933 | 3629.267 ms | 768463 | avg: 0.009 ms | max: 5536.206 ms | max at: 9088.540916 s kworker/u17:3:7082 | 3426.631 ms | 701947 | avg: 0.010 ms | max: 5543.659 ms | max at: 9088.540901 s git:7100 | 12708.278 ms | 573828 | avg: 0.001 ms | max: 3.520 ms | max at: 9066.125757 s git:7704 | 11620.355 ms | 517010 | avg: 0.001 ms | max: 4.070 ms | max at: 9113.894832 s kworker/u17:0:7075 | 1812.581 ms | 397601 | avg: 0.002 ms | max: 620.321 ms | max at: 9114.655685 s nfsd:2244 | 4930.826 ms | 370473 | avg: 0.008 ms | max: 910.646 ms | max at: 9083.915441 s kworker/u16:6:7094 | 2870.424 ms | 335848 | avg: 0.005 ms | max: 580.871 ms | max at: 9114.616479 s nfsd:2243 | 3424.996 ms | 257274 | avg: 0.033 ms | max: 3843.339 ms | max at: 9086.848829 s kworker/u17:1-x:30183 | 1310.614 ms | 255990 | avg: 0.001 ms | max: 1.817 ms | max at: 9089.173217 s kworker/u16:60-:6124 | 2253.058 ms | 225931 | avg: 0.050 ms | max:10128.140 ms | max at: 9108.375040 s kworker/u16:5:7092 | 1831.385 ms | 211923 | avg: 0.007 ms | max: 905.513 ms | max at: 9083.911630 s kworker/u16:7:7101 | 1606.258 ms | 194944 | avg: 0.002 ms | max: 11.576 ms | max at: 9082.789700 s kworker/u16:4:7090 | 1484.687 ms | 189197 | avg: 0.100 ms | max:12112.172 ms | max at: 9110.360308 s kworker/u16:59-:6123 | 1707.858 ms | 183464 | avg: 0.073 ms | max: 6135.816 ms | max at: 9120.173398 s kworker/u16:3:7074 | 1528.375 ms | 173089 | avg: 0.098 ms | max:15196.567 ms | max at: 9098.202355 s kworker/u16:0-r:7009 | 1336.814 ms | 166043 | avg: 0.002 ms | max: 12.381 ms | max at: 9082.839130 s nfsd:2242 | 1876.802 ms | 154855 | avg: 0.073 ms | max: 3844.877 ms | max at: 9086.848848 s kworker/u16:1:7072 | 1214.642 ms | 151420 | avg: 0.002 ms | max: 6.433 ms | max at: 9075.581713 s kworker/u16:2:7073 | 1302.996 ms | 150863 | avg: 0.002 ms | max: 12.119 ms | max at: 9082.839133 s ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-27 14:17 ` Mike Galbraith @ 2021-10-04 8:05 ` Mel Gorman 2021-10-04 16:37 ` Vincent Guittot 0 siblings, 1 reply; 59+ messages in thread From: Mel Gorman @ 2021-10-04 8:05 UTC (permalink / raw) To: Mike Galbraith Cc: Vincent Guittot, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Mon, Sep 27, 2021 at 04:17:25PM +0200, Mike Galbraith wrote: > On Mon, 2021-09-27 at 12:17 +0100, Mel Gorman wrote: > > On Thu, Sep 23, 2021 at 02:41:06PM +0200, Vincent Guittot wrote: > > > On Thu, 23 Sept 2021 at 11:22, Mike Galbraith <efault@gmx.de> wrote: > > > > > > > > On Thu, 2021-09-23 at 10:40 +0200, Vincent Guittot wrote: > > > > > > > > > > a 100us value should even be enough to fix Mel's problem without > > > > > impacting common wakeup preemption cases. > > > > > > > > It'd be nice if it turn out to be something that simple, but color me > > > > skeptical. I've tried various preemption throttling schemes, and while > > > > > > Let's see what the results will show. I tend to agree that this will > > > not be enough to cover all use cases and I don't see any other way to > > > cover all cases than getting some inputs from the threads about their > > > latency fairness which bring us back to some kind of latency niceness > > > value > > > > > > > Unfortunately, I didn't get a complete set of results but enough to work > > with. The missing tests have been requeued. The figures below are based > > on a single-socket Skylake machine with 8 CPUs as it had the most set of > > results and is the basic case. > > There's something missing, namely how does whatever load you measure > perform when facing dissimilar competition. Instead of only scaling > loads running solo from underutilized to heavily over-committed, give > them competition. eg something switch heavy, say tbench, TCP_RR et al > (latency bound load) pairs=CPUS vs something hefty like make -j CPUS or > such. > Ok, that's an interesting test. I've been out intermittently and will be for the next few weeks but I managed to automate something that can test this. The test runs a kernel compile with -jNR_CPUS and TCP_RR running NR_CPUS pairs of clients/servers in the background with the default openSUSE Leap kernel config (CONFIG_PREEMPT_NONE) with the two patches and no tricks done with task priorities. 5 kernel compilations are run and TCP_RR is shutdown when the compilation finishes. This can be reproduced with the mmtests config config-multi-kernbench__netperf-tcp-rr-multipair using xfs as the filesystem for the kernel compilation. sched-scalewakegran-v2r5: my patch sched-moveforward-v1r1: Vincent's patch multi subtest kernbench 5.15.0-rc1 5.15.0-rc1 5.15.0-rc1 vanillasched-scalewakegran-v2r5 sched-moveforward-v1r1 Amean user-80 1518.87 ( 0.00%) 1520.34 ( -0.10%) 1518.93 ( -0.00%) Amean syst-80 248.57 ( 0.00%) 247.74 ( 0.33%) 232.93 * 6.29%* Amean elsp-80 48.76 ( 0.00%) 48.51 ( 0.52%) 48.70 ( 0.14%) Stddev user-80 10.15 ( 0.00%) 9.17 ( 9.70%) 10.25 ( -0.93%) Stddev syst-80 2.83 ( 0.00%) 3.02 ( -6.65%) 3.65 ( -28.83%) Stddev elsp-80 3.54 ( 0.00%) 3.28 ( 7.28%) 2.40 ( 32.13%) CoeffVar user-80 0.67 ( 0.00%) 0.60 ( 9.79%) 0.67 ( -0.93%) CoeffVar syst-80 1.14 ( 0.00%) 1.22 ( -7.01%) 1.57 ( -37.48%) CoeffVar elsp-80 7.26 ( 0.00%) 6.76 ( 6.79%) 4.93 ( 32.04%) With either patch, time to finish compilations is not affected with differences in elapsed time being well within the noise Meanwhile, netperf tcp-rr running with NR_CPUS pairs showed the following multi subtest netperf-tcp-rr 5.15.0-rc1 5.15.0-rc1 5.15.0-rc1 vanilla sched-scalewakegran-v2r5 sched-moveforward-v1r1 Min 1 32388.28 ( 0.00%) 32208.66 ( -0.55%) 31824.54 ( -1.74%) Hmean 1 39112.22 ( 0.00%) 39364.10 ( 0.64%) 39552.30 * 1.13%* Stddev 1 3471.61 ( 0.00%) 3357.28 ( 3.29%) 3713.97 ( -6.98%) CoeffVar 1 8.81 ( 0.00%) 8.47 ( 3.87%) 9.31 ( -5.67%) Max 1 53019.93 ( 0.00%) 51263.38 ( -3.31%) 51263.04 ( -3.31%) This shows a slightly different picture with Vincent's patch having a small impact on netperf tcp-rr. It's noisy and may be subject to test-to-test variances but it's a mild concern. A greater concern is that across all machines, dbench was heavily affected by Vincent's patch even for relatively low thread counts which is surprising. For the same Cascadelake machine both resulst are from, dbench reports 5.15.0-rc1 5.15.0-rc1 5.15.0-rc1 vanillasched-scalewakegran-v2r5 sched-moveforward-v1r1 Amean 1 15.99 ( 0.00%) 16.20 * -1.27%* 16.18 * -1.16%* Amean 2 18.43 ( 0.00%) 18.34 * 0.50%* 22.72 * -23.28%* Amean 4 22.32 ( 0.00%) 22.06 * 1.14%* 45.86 *-105.52%* Amean 8 30.58 ( 0.00%) 30.22 * 1.18%* 99.04 *-223.88%* Amean 16 41.79 ( 0.00%) 41.68 * 0.25%* 161.09 *-285.52%* Amean 32 63.45 ( 0.00%) 63.16 * 0.45%* 248.13 *-291.09%* Amean 64 127.81 ( 0.00%) 128.50 * -0.54%* 402.93 *-215.25%* Amean 128 330.42 ( 0.00%) 336.06 * -1.71%* 531.35 * -60.81%* That is an excessive impairment. While it varied across machines, there was some impact on all of them. For a 1-socket skylake machine to rule out NUMA artifacts, I get dbench4 Loadfile Execution Time 5.15.0-rc1 5.15.0-rc1 5.15.0-rc1 vanillasched-scalewakegran-v2r5 sched-moveforward-v1r1 Amean 1 29.51 ( 0.00%) 29.45 * 0.21%* 29.58 * -0.22%* Amean 2 37.46 ( 0.00%) 37.16 * 0.82%* 39.81 * -6.26%* Amean 4 51.31 ( 0.00%) 51.34 ( -0.04%) 57.14 * -11.35%* Amean 8 81.77 ( 0.00%) 81.65 ( 0.15%) 88.68 * -8.44%* Amean 64 406.94 ( 0.00%) 408.08 * -0.28%* 433.64 * -6.56%* Stddev 1 1.43 ( 0.00%) 1.44 ( -0.79%) 1.54 ( -7.45%) Not as dramatic but indicates that we likely do not want to cut off wakeup_preempt too early a problem. The test was not profiling times to switch tasks as the overhead distorts resules. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-10-04 8:05 ` Mel Gorman @ 2021-10-04 16:37 ` Vincent Guittot 2021-10-05 7:41 ` Mel Gorman 0 siblings, 1 reply; 59+ messages in thread From: Vincent Guittot @ 2021-10-04 16:37 UTC (permalink / raw) To: Mel Gorman Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Mon, 4 Oct 2021 at 10:05, Mel Gorman <mgorman@techsingularity.net> wrote: > > On Mon, Sep 27, 2021 at 04:17:25PM +0200, Mike Galbraith wrote: > > On Mon, 2021-09-27 at 12:17 +0100, Mel Gorman wrote: > > > On Thu, Sep 23, 2021 at 02:41:06PM +0200, Vincent Guittot wrote: > > > > On Thu, 23 Sept 2021 at 11:22, Mike Galbraith <efault@gmx.de> wrote: > > > > > > > > > > On Thu, 2021-09-23 at 10:40 +0200, Vincent Guittot wrote: > > > > > > > > > > > > a 100us value should even be enough to fix Mel's problem without > > > > > > impacting common wakeup preemption cases. > > > > > > > > > > It'd be nice if it turn out to be something that simple, but color me > > > > > skeptical. I've tried various preemption throttling schemes, and while > > > > > > > > Let's see what the results will show. I tend to agree that this will > > > > not be enough to cover all use cases and I don't see any other way to > > > > cover all cases than getting some inputs from the threads about their > > > > latency fairness which bring us back to some kind of latency niceness > > > > value > > > > > > > > > > Unfortunately, I didn't get a complete set of results but enough to work > > > with. The missing tests have been requeued. The figures below are based > > > on a single-socket Skylake machine with 8 CPUs as it had the most set of > > > results and is the basic case. > > > > There's something missing, namely how does whatever load you measure > > perform when facing dissimilar competition. Instead of only scaling > > loads running solo from underutilized to heavily over-committed, give > > them competition. eg something switch heavy, say tbench, TCP_RR et al > > (latency bound load) pairs=CPUS vs something hefty like make -j CPUS or > > such. > > > > Ok, that's an interesting test. I've been out intermittently and will be > for the next few weeks but I managed to automate something that can test > this. The test runs a kernel compile with -jNR_CPUS and TCP_RR running > NR_CPUS pairs of clients/servers in the background with the default > openSUSE Leap kernel config (CONFIG_PREEMPT_NONE) with the two patches > and no tricks done with task priorities. 5 kernel compilations are run > and TCP_RR is shutdown when the compilation finishes. > > This can be reproduced with the mmtests config > config-multi-kernbench__netperf-tcp-rr-multipair using xfs as the > filesystem for the kernel compilation. > > sched-scalewakegran-v2r5: my patch > sched-moveforward-v1r1: Vincent's patch If I'm not wrong, you refer to the 1st version which scales with the number of cpu by sched-moveforward-v1r1. We don't want to scale with the number of cpu because this can create some quite large non preemptable duration. We want to ensure a fix small runtime like the last version with 100us > > > multi subtest kernbench > 5.15.0-rc1 5.15.0-rc1 5.15.0-rc1 > vanillasched-scalewakegran-v2r5 sched-moveforward-v1r1 > Amean user-80 1518.87 ( 0.00%) 1520.34 ( -0.10%) 1518.93 ( -0.00%) > Amean syst-80 248.57 ( 0.00%) 247.74 ( 0.33%) 232.93 * 6.29%* > Amean elsp-80 48.76 ( 0.00%) 48.51 ( 0.52%) 48.70 ( 0.14%) > Stddev user-80 10.15 ( 0.00%) 9.17 ( 9.70%) 10.25 ( -0.93%) > Stddev syst-80 2.83 ( 0.00%) 3.02 ( -6.65%) 3.65 ( -28.83%) > Stddev elsp-80 3.54 ( 0.00%) 3.28 ( 7.28%) 2.40 ( 32.13%) > CoeffVar user-80 0.67 ( 0.00%) 0.60 ( 9.79%) 0.67 ( -0.93%) > CoeffVar syst-80 1.14 ( 0.00%) 1.22 ( -7.01%) 1.57 ( -37.48%) > CoeffVar elsp-80 7.26 ( 0.00%) 6.76 ( 6.79%) 4.93 ( 32.04%) > > With either patch, time to finish compilations is not affected with > differences in elapsed time being well within the noise > > Meanwhile, netperf tcp-rr running with NR_CPUS pairs showed the > following > > multi subtest netperf-tcp-rr > 5.15.0-rc1 5.15.0-rc1 5.15.0-rc1 > vanilla sched-scalewakegran-v2r5 sched-moveforward-v1r1 > Min 1 32388.28 ( 0.00%) 32208.66 ( -0.55%) 31824.54 ( -1.74%) > Hmean 1 39112.22 ( 0.00%) 39364.10 ( 0.64%) 39552.30 * 1.13%* > Stddev 1 3471.61 ( 0.00%) 3357.28 ( 3.29%) 3713.97 ( -6.98%) > CoeffVar 1 8.81 ( 0.00%) 8.47 ( 3.87%) 9.31 ( -5.67%) > Max 1 53019.93 ( 0.00%) 51263.38 ( -3.31%) 51263.04 ( -3.31%) > > This shows a slightly different picture with Vincent's patch having a small > impact on netperf tcp-rr. It's noisy and may be subject to test-to-test > variances but it's a mild concern. A greater concern is that across > all machines, dbench was heavily affected by Vincent's patch even for > relatively low thread counts which is surprising. > > For the same Cascadelake machine both resulst are from, dbench reports > > 5.15.0-rc1 5.15.0-rc1 5.15.0-rc1 > vanillasched-scalewakegran-v2r5 sched-moveforward-v1r1 > Amean 1 15.99 ( 0.00%) 16.20 * -1.27%* 16.18 * -1.16%* > Amean 2 18.43 ( 0.00%) 18.34 * 0.50%* 22.72 * -23.28%* > Amean 4 22.32 ( 0.00%) 22.06 * 1.14%* 45.86 *-105.52%* > Amean 8 30.58 ( 0.00%) 30.22 * 1.18%* 99.04 *-223.88%* > Amean 16 41.79 ( 0.00%) 41.68 * 0.25%* 161.09 *-285.52%* > Amean 32 63.45 ( 0.00%) 63.16 * 0.45%* 248.13 *-291.09%* > Amean 64 127.81 ( 0.00%) 128.50 * -0.54%* 402.93 *-215.25%* > Amean 128 330.42 ( 0.00%) 336.06 * -1.71%* 531.35 * -60.81%* > > That is an excessive impairment. While it varied across machines, there > was some impact on all of them. For a 1-socket skylake machine to rule > out NUMA artifacts, I get As mentioned above, the 1st version which uses sysctl_sched_min_granularity is not the best solution because it scales the duration with the number of cpus on the system. The move forward duration should be fixed and small like the last proposal with 100us. Would be intetesting to see results for this last version instead > > dbench4 Loadfile Execution Time > 5.15.0-rc1 5.15.0-rc1 5.15.0-rc1 > vanillasched-scalewakegran-v2r5 sched-moveforward-v1r1 > Amean 1 29.51 ( 0.00%) 29.45 * 0.21%* 29.58 * -0.22%* > Amean 2 37.46 ( 0.00%) 37.16 * 0.82%* 39.81 * -6.26%* > Amean 4 51.31 ( 0.00%) 51.34 ( -0.04%) 57.14 * -11.35%* > Amean 8 81.77 ( 0.00%) 81.65 ( 0.15%) 88.68 * -8.44%* > Amean 64 406.94 ( 0.00%) 408.08 * -0.28%* 433.64 * -6.56%* > Stddev 1 1.43 ( 0.00%) 1.44 ( -0.79%) 1.54 ( -7.45%) > > Not as dramatic but indicates that we likely do not want to cut off > wakeup_preempt too early a problem. > > The test was not profiling times to switch tasks as the overhead > distorts resules. > > -- > Mel Gorman > SUSE Labs ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-10-04 16:37 ` Vincent Guittot @ 2021-10-05 7:41 ` Mel Gorman 0 siblings, 0 replies; 59+ messages in thread From: Mel Gorman @ 2021-10-05 7:41 UTC (permalink / raw) To: Vincent Guittot Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Mon, Oct 04, 2021 at 06:37:02PM +0200, Vincent Guittot wrote: > On Mon, 4 Oct 2021 at 10:05, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > On Mon, Sep 27, 2021 at 04:17:25PM +0200, Mike Galbraith wrote: > > > On Mon, 2021-09-27 at 12:17 +0100, Mel Gorman wrote: > > > > On Thu, Sep 23, 2021 at 02:41:06PM +0200, Vincent Guittot wrote: > > > > > On Thu, 23 Sept 2021 at 11:22, Mike Galbraith <efault@gmx.de> wrote: > > > > > > > > > > > > On Thu, 2021-09-23 at 10:40 +0200, Vincent Guittot wrote: > > > > > > > > > > > > > > a 100us value should even be enough to fix Mel's problem without > > > > > > > impacting common wakeup preemption cases. > > > > > > > > > > > > It'd be nice if it turn out to be something that simple, but color me > > > > > > skeptical. I've tried various preemption throttling schemes, and while > > > > > > > > > > Let's see what the results will show. I tend to agree that this will > > > > > not be enough to cover all use cases and I don't see any other way to > > > > > cover all cases than getting some inputs from the threads about their > > > > > latency fairness which bring us back to some kind of latency niceness > > > > > value > > > > > > > > > > > > > Unfortunately, I didn't get a complete set of results but enough to work > > > > with. The missing tests have been requeued. The figures below are based > > > > on a single-socket Skylake machine with 8 CPUs as it had the most set of > > > > results and is the basic case. > > > > > > There's something missing, namely how does whatever load you measure > > > perform when facing dissimilar competition. Instead of only scaling > > > loads running solo from underutilized to heavily over-committed, give > > > them competition. eg something switch heavy, say tbench, TCP_RR et al > > > (latency bound load) pairs=CPUS vs something hefty like make -j CPUS or > > > such. > > > > > > > Ok, that's an interesting test. I've been out intermittently and will be > > for the next few weeks but I managed to automate something that can test > > this. The test runs a kernel compile with -jNR_CPUS and TCP_RR running > > NR_CPUS pairs of clients/servers in the background with the default > > openSUSE Leap kernel config (CONFIG_PREEMPT_NONE) with the two patches > > and no tricks done with task priorities. 5 kernel compilations are run > > and TCP_RR is shutdown when the compilation finishes. > > > > This can be reproduced with the mmtests config > > config-multi-kernbench__netperf-tcp-rr-multipair using xfs as the > > filesystem for the kernel compilation. > > > > sched-scalewakegran-v2r5: my patch > > sched-moveforward-v1r1: Vincent's patch > > If I'm not wrong, you refer to the 1st version which scales with the > number of cpu by sched-moveforward-v1r1. We don't want to scale with > the number of cpu because this can create some quite large non > preemptable duration. We want to ensure a fix small runtime like the > last version with 100us > It was a modified version based on feedback that limited the scale that preemption would be disabled. It was still based on h_nr_running as a basis for comparison diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ff69f245b939..964f76a95c04 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -84,6 +84,14 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL; const_debug unsigned int sysctl_sched_migration_cost = 500000UL; +/* + * This value is kept at sysctl_sched_latency / sysctl_sched_wakeup_granularity + * + * This influences the decision on whether a waking task can preempt a running + * task. + */ +static unsigned int sched_nr_disable_gran = 6; + int sched_thermal_decay_shift; static int __init setup_sched_thermal_decay_shift(char *str) { @@ -627,6 +635,9 @@ int sched_update_scaling(void) sched_nr_latency = DIV_ROUND_UP(sysctl_sched_latency, sysctl_sched_min_granularity); + sched_nr_disable_gran = DIV_ROUND_UP(sysctl_sched_latency, + sysctl_sched_wakeup_granularity); + #define WRT_SYSCTL(name) \ (normalized_sysctl_##name = sysctl_##name / (factor)) WRT_SYSCTL(sched_min_granularity); @@ -4511,7 +4522,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) } static int -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se); +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr, + struct sched_entity *se); /* * Pick the next process, keeping these things in mind, in this order: @@ -4550,16 +4562,16 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) second = curr; } - if (second && wakeup_preempt_entity(second, left) < 1) + if (second && wakeup_preempt_entity(NULL, second, left) < 1) se = second; } - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { + if (cfs_rq->next && wakeup_preempt_entity(NULL, cfs_rq->next, left) < 1) { /* * Someone really wants this to run. If it's not unfair, run it. */ se = cfs_rq->next; - } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) { + } else if (cfs_rq->last && wakeup_preempt_entity(NULL, cfs_rq->last, left) < 1) { /* * Prefer last buddy, try to return the CPU to a preempted task. */ @@ -7044,9 +7056,42 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) } #endif /* CONFIG_SMP */ -static unsigned long wakeup_gran(struct sched_entity *se) +static unsigned long +select_wakeup_gran(struct cfs_rq *cfs_rq) +{ + unsigned int nr_running, threshold; + + if (!cfs_rq || !sched_feat(SCALE_WAKEUP_GRAN)) + return sysctl_sched_wakeup_granularity; + + /* !GENTLE_FAIR_SLEEPERS has one overload threshold. */ + if (!sched_feat(GENTLE_FAIR_SLEEPERS)) { + if (cfs_rq->h_nr_running <= sched_nr_disable_gran) + return sysctl_sched_wakeup_granularity; + + return sysctl_sched_latency; + } + + /* GENTLE_FAIR_SLEEPER has two overloaded thresholds. */ + nr_running = cfs_rq->h_nr_running; + threshold = sched_nr_disable_gran >> 1; + + /* No overload. */ + if (nr_running <= threshold) + return sysctl_sched_wakeup_granularity; + + /* Light overload. */ + if (nr_running <= sched_nr_disable_gran) + return sysctl_sched_latency >> 1; + + /* Heavy overload. */ + return sysctl_sched_latency; +} + +static unsigned long +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se) { - unsigned long gran = sysctl_sched_wakeup_granularity; + unsigned long gran = select_wakeup_gran(cfs_rq); /* * Since its curr running now, convert the gran from real-time @@ -7079,14 +7124,15 @@ static unsigned long wakeup_gran(struct sched_entity *se) * */ static int -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se) +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr, + struct sched_entity *se) { s64 gran, vdiff = curr->vruntime - se->vruntime; if (vdiff <= 0) return -1; - gran = wakeup_gran(se); + gran = wakeup_gran(cfs_rq, se); if (vdiff > gran) return 1; @@ -7190,8 +7236,9 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ if (cse_is_idle != pse_is_idle) return; - update_curr(cfs_rq_of(se)); - if (wakeup_preempt_entity(se, pse) == 1) { + cfs_rq = cfs_rq_of(se); + update_curr(cfs_rq); + if (wakeup_preempt_entity(cfs_rq, se, pse) == 1) { /* * Bias pick_next to pick the sched entity that is * triggering this preemption. diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 7f8dace0964c..d041d7023029 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -95,3 +95,9 @@ SCHED_FEAT(LATENCY_WARN, false) SCHED_FEAT(ALT_PERIOD, true) SCHED_FEAT(BASE_SLICE, true) + +/* + * Scale sched_wakeup_granularity dynamically based on the number of running + * tasks up to a cap of sysctl_sched_latency. + */ +SCHED_FEAT(SCALE_WAKEUP_GRAN, true) ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-27 11:17 ` Mel Gorman 2021-09-27 14:17 ` Mike Galbraith @ 2021-09-27 14:19 ` Vincent Guittot 2021-09-27 15:02 ` Mel Gorman 1 sibling, 1 reply; 59+ messages in thread From: Vincent Guittot @ 2021-09-27 14:19 UTC (permalink / raw) To: Mel Gorman Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Mon, 27 Sept 2021 at 13:17, Mel Gorman <mgorman@techsingularity.net> wrote: > > On Thu, Sep 23, 2021 at 02:41:06PM +0200, Vincent Guittot wrote: > > On Thu, 23 Sept 2021 at 11:22, Mike Galbraith <efault@gmx.de> wrote: > > > > > > On Thu, 2021-09-23 at 10:40 +0200, Vincent Guittot wrote: > > > > > > > > a 100us value should even be enough to fix Mel's problem without > > > > impacting common wakeup preemption cases. > > > > > > It'd be nice if it turn out to be something that simple, but color me > > > skeptical. I've tried various preemption throttling schemes, and while > > > > Let's see what the results will show. I tend to agree that this will > > not be enough to cover all use cases and I don't see any other way to > > cover all cases than getting some inputs from the threads about their > > latency fairness which bring us back to some kind of latency niceness > > value > > > > Unfortunately, I didn't get a complete set of results but enough to work > with. The missing tests have been requeued. The figures below are based > on a single-socket Skylake machine with 8 CPUs as it had the most set of > results and is the basic case. > > The reported kernels are > > vanilla: vanilla 5.15-rc1 > sched-scalewakegran-v2r4: My patch > sched-moveforward-v1r1: Vincent's patch I imagine that this is the results for the 1st version which scales with the number of CPUs > > > > hackbench-process-pipes > 5.15.0-rc1 5.15.0-rc1 5.15.0-rc1 > vanilla sched-scalewakegran-v2r4 sched-moveforward-v1r1 > Amean 1 0.3253 ( 0.00%) 0.3330 ( -2.36%) 0.3257 ( -0.10%) > Amean 4 0.8300 ( 0.00%) 0.7570 ( 8.80%) 0.7560 ( 8.92%) > Amean 7 1.1003 ( 0.00%) 1.1457 * -4.12%* 1.1163 ( -1.45%) > Amean 12 1.7263 ( 0.00%) 1.6393 * 5.04%* 1.5963 * 7.53%* > Amean 21 3.0063 ( 0.00%) 2.6590 * 11.55%* 2.4487 * 18.55%* > Amean 30 4.2323 ( 0.00%) 3.5657 * 15.75%* 3.3410 * 21.06%* > Amean 48 6.5657 ( 0.00%) 5.4180 * 17.48%* 5.0857 * 22.54%* > Amean 79 10.4867 ( 0.00%) 8.4357 * 19.56%* 7.9563 * 24.13%* > Amean 110 14.8880 ( 0.00%) 11.0423 * 25.83%* 10.7407 * 27.86%* > Amean 141 19.2083 ( 0.00%) 14.0820 * 26.69%* 13.3780 * 30.35%* > Amean 172 23.4847 ( 0.00%) 16.9880 * 27.66%* 16.4293 * 30.04%* > Amean 203 27.3763 ( 0.00%) 20.2480 * 26.04%* 19.6430 * 28.25%* > Amean 234 31.3707 ( 0.00%) 23.2477 * 25.89%* 22.8287 * 27.23%* > Amean 265 35.4663 ( 0.00%) 26.2483 * 25.99%* 25.8683 * 27.06%* > Amean 296 39.2380 ( 0.00%) 29.4237 * 25.01%* 28.8727 * 26.42%* > > For hackbench, either Vincent or my patch has a similar impact. > > tbench4 > 5.15.0-rc1 5.15.0-rc1 5.15.0-rc1 > vanillasched-scalewakegran-v2r4 sched-moveforward-v1r1 > Hmean 1 598.71 ( 0.00%) 608.31 * 1.60%* 586.05 * -2.11%* > Hmean 2 1096.74 ( 0.00%) 1110.07 * 1.22%* 1106.70 * 0.91%* > Hmean 4 1529.35 ( 0.00%) 1531.20 * 0.12%* 1551.11 * 1.42%* > Hmean 8 2824.32 ( 0.00%) 2847.96 * 0.84%* 2684.21 * -4.96%* > Hmean 16 2573.30 ( 0.00%) 2591.77 * 0.72%* 2445.41 * -4.97%* > Hmean 32 2518.77 ( 0.00%) 2532.70 * 0.55%* 2409.30 * -4.35%* > > For tbench, it's ok for lower thread counts for 8 threads (machine > overloaded), Vincent's patch regresses slightly. With these test runs, > I don't have detailed information as to why but the most likely solution > is that preemption gets disabled prematurely. > > specjbb > 5.15.0-rc1 5.15.0-rc1 5.15.0-rc1 > vanillasched-scalewakegran-v2r4 sched-moveforward-v1r1 > Hmean tput-1 71199.00 ( 0.00%) 69492.00 * -2.40%* 71126.00 * -0.10%* > Hmean tput-2 154478.00 ( 0.00%) 146060.00 * -5.45%* 153073.00 * -0.91%* > Hmean tput-3 211889.00 ( 0.00%) 209386.00 * -1.18%* 219434.00 * 3.56%* > Hmean tput-4 257842.00 ( 0.00%) 248012.00 * -3.81%* 262903.00 * 1.96%* > Hmean tput-5 253506.00 ( 0.00%) 242511.00 * -4.34%* 250828.00 * -1.06%* > Hmean tput-6 246202.00 ( 0.00%) 236480.00 * -3.95%* 244236.00 * -0.80%* > Hmean tput-7 241133.00 ( 0.00%) 230905.00 * -4.24%* 237619.00 * -1.46%* > Hmean tput-8 237983.00 ( 0.00%) 230010.00 * -3.35%* 235275.00 * -1.14%* > > For specjbb, it's different again, Vincent's patch is better for the > overloaded case but both patches show light regressions. > > -- > Mel Gorman > SUSE Labs ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-27 14:19 ` Vincent Guittot @ 2021-09-27 15:02 ` Mel Gorman 0 siblings, 0 replies; 59+ messages in thread From: Mel Gorman @ 2021-09-27 15:02 UTC (permalink / raw) To: Vincent Guittot Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Mon, Sep 27, 2021 at 04:19:00PM +0200, Vincent Guittot wrote: > On Mon, 27 Sept 2021 at 13:17, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > On Thu, Sep 23, 2021 at 02:41:06PM +0200, Vincent Guittot wrote: > > > On Thu, 23 Sept 2021 at 11:22, Mike Galbraith <efault@gmx.de> wrote: > > > > > > > > On Thu, 2021-09-23 at 10:40 +0200, Vincent Guittot wrote: > > > > > > > > > > a 100us value should even be enough to fix Mel's problem without > > > > > impacting common wakeup preemption cases. > > > > > > > > It'd be nice if it turn out to be something that simple, but color me > > > > skeptical. I've tried various preemption throttling schemes, and while > > > > > > Let's see what the results will show. I tend to agree that this will > > > not be enough to cover all use cases and I don't see any other way to > > > cover all cases than getting some inputs from the threads about their > > > latency fairness which bring us back to some kind of latency niceness > > > value > > > > > > > Unfortunately, I didn't get a complete set of results but enough to work > > with. The missing tests have been requeued. The figures below are based > > on a single-socket Skylake machine with 8 CPUs as it had the most set of > > results and is the basic case. > > > > The reported kernels are > > > > vanilla: vanilla 5.15-rc1 > > sched-scalewakegran-v2r4: My patch > > sched-moveforward-v1r1: Vincent's patch > > I imagine that this is the results for the 1st version which scales > with the number of CPUs > Yes, the v1r5 results were incomplete and had to be requeued. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-23 8:40 ` Vincent Guittot 2021-09-23 9:21 ` Mike Galbraith @ 2021-09-23 12:24 ` Phil Auld 2021-10-05 10:36 ` Peter Zijlstra 1 sibling, 1 reply; 59+ messages in thread From: Phil Auld @ 2021-09-23 12:24 UTC (permalink / raw) To: Vincent Guittot Cc: Mike Galbraith, Mel Gorman, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Thu, Sep 23, 2021 at 10:40:48AM +0200 Vincent Guittot wrote: > On Thu, 23 Sept 2021 at 03:47, Mike Galbraith <efault@gmx.de> wrote: > > > > On Wed, 2021-09-22 at 20:22 +0200, Vincent Guittot wrote: > > > On Wed, 22 Sept 2021 at 19:38, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > > > > > > > I'm not seeing an alternative suggestion that could be turned into > > > > an implementation. The current value for sched_wakeup_granularity > > > > was set 12 years ago was exposed for tuning which is no longer > > > > the case. The intent was to allow some dynamic adjustment between > > > > sysctl_sched_wakeup_granularity and sysctl_sched_latency to reduce > > > > over-scheduling in the worst case without disabling preemption entirely > > > > (which the first version did). > > > > I don't think those knobs were ever _intended_ for general purpose > > tuning, but they did get used that way by some folks. > > > > > > > > > > Should we just ignore this problem and hope it goes away or just let > > > > people keep poking silly values into debugfs via tuned? > > > > > > We should certainly not add a bandaid because people will continue to > > > poke silly value at the end. And increasing > > > sysctl_sched_wakeup_granularity based on the number of running threads > > > is not the right solution. > > > > Watching my desktop box stack up large piles of very short running > > threads, I agree, instantaneous load looks like a non-starter. > > > > > According to the description of your > > > problem that the current task doesn't get enough time to move forward, > > > sysctl_sched_min_granularity should be part of the solution. Something > > > like below will ensure that current got a chance to move forward > > > > Nah, progress is guaranteed, the issue is a zillion very similar short > > running threads preempting each other with no win to be had, thus > > spending cycles in the scheduler that are utterly wasted. It's a valid > > issue, trouble is teaching the scheduler to recognize that situation > > without mucking up other situations where there IS a win for even very > > short running threads say, doing a synchronous handoff; preemption is > > cheaper than scheduling off if the waker is going be awakened again in > > very short order. > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index 9bf540f04c2d..39d4e4827d3d 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -7102,6 +7102,7 @@ static void check_preempt_wakeup(struct rq *rq, > > > struct task_struct *p, int wake_ > > > int scale = cfs_rq->nr_running >= sched_nr_latency; > > > int next_buddy_marked = 0; > > > int cse_is_idle, pse_is_idle; > > > + unsigned long delta_exec; > > > > > > if (unlikely(se == pse)) > > > return; > > > @@ -7161,6 +7162,13 @@ static void check_preempt_wakeup(struct rq *rq, > > > struct task_struct *p, int wake_ > > > return; > > > > > > update_curr(cfs_rq_of(se)); > > > + delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime; > > > + /* > > > + * Ensure that current got a chance to move forward > > > + */ > > > + if (delta_exec < sysctl_sched_min_granularity) > > > + return; > > > + > > > if (wakeup_preempt_entity(se, pse) == 1) { > > > /* > > > * Bias pick_next to pick the sched entity that is > > > > Yikes! If you do that, you may as well go the extra nanometer and rip > > wakeup preemption out entirely, same result, impressive diffstat. > > This patch is mainly there to show that there are other ways to ensure > progress without using some load heuristic. > sysctl_sched_min_granularity has the problem of scaling with the > number of cpus and this can generate large values. At least we should > use the normalized_sysctl_sched_min_granularity or even a smaller > value but wakeup preemption still happens with this change. It only > ensures that we don't waste time preempting each other without any > chance to do actual stuff. > It's capped at 8 cpus, which is pretty easy to reach these days, so the values don't get too large. That scaling is almost a no-op these days. Cheers, Phil > a 100us value should even be enough to fix Mel's problem without > impacting common wakeup preemption cases. > > > > > > -Mike > -- ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-23 12:24 ` Phil Auld @ 2021-10-05 10:36 ` Peter Zijlstra 2021-10-05 14:12 ` Phil Auld 0 siblings, 1 reply; 59+ messages in thread From: Peter Zijlstra @ 2021-10-05 10:36 UTC (permalink / raw) To: Phil Auld Cc: Vincent Guittot, Mike Galbraith, Mel Gorman, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Thu, Sep 23, 2021 at 08:24:03AM -0400, Phil Auld wrote: > It's capped at 8 cpus, which is pretty easy to reach these days, so the > values don't get too large. That scaling is almost a no-op these days. https://lkml.kernel.org/r/YVwdrh5pg0zSv2/b@hirez.programming.kicks-ass.net Ooh, hey, we already fixed that :-) So the reasoning there is that if the values get too big, interactiviy get *really* bad, but if you go from say 1 to 4 CPUs, interactivity can improve due to being able to run on other CPUs. At 8 CPUs we end up at 6ms*4=24ms, which is already pretty terrible. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-10-05 10:36 ` Peter Zijlstra @ 2021-10-05 14:12 ` Phil Auld 2021-10-05 14:32 ` Peter Zijlstra 0 siblings, 1 reply; 59+ messages in thread From: Phil Auld @ 2021-10-05 14:12 UTC (permalink / raw) To: Peter Zijlstra Cc: Vincent Guittot, Mike Galbraith, Mel Gorman, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, Oct 05, 2021 at 12:36:22PM +0200 Peter Zijlstra wrote: > On Thu, Sep 23, 2021 at 08:24:03AM -0400, Phil Auld wrote: > > > It's capped at 8 cpus, which is pretty easy to reach these days, so the > > values don't get too large. That scaling is almost a no-op these days. > > https://lkml.kernel.org/r/YVwdrh5pg0zSv2/b@hirez.programming.kicks-ass.net > > Ooh, hey, we already fixed that :-) > Thanks Peter. I'm always a little behind upstream (nature of the job :) That link leads to a message Id not found. But from what I can see the code that takes the min of online cpus and 8 is still present. > So the reasoning there is that if the values get too big, interactiviy > get *really* bad, but if you go from say 1 to 4 CPUs, interactivity can > improve due to being able to run on other CPUs. > > At 8 CPUs we end up at 6ms*4=24ms, which is already pretty terrible. > And actually you mention the same thing later on. Most systems, even desktops, have 8+ cpus these days so the scaling is mostly not doing anything except multiplying by 4, right? So no-op was not the right way to describe it maybe. But it's not getting bigger with larger numbers of cpus beyond a pretty commonly reached limit. Cheers, Phil -- ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-10-05 14:12 ` Phil Auld @ 2021-10-05 14:32 ` Peter Zijlstra 0 siblings, 0 replies; 59+ messages in thread From: Peter Zijlstra @ 2021-10-05 14:32 UTC (permalink / raw) To: Phil Auld Cc: Vincent Guittot, Mike Galbraith, Mel Gorman, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, Oct 05, 2021 at 10:12:12AM -0400, Phil Auld wrote: > On Tue, Oct 05, 2021 at 12:36:22PM +0200 Peter Zijlstra wrote: > > On Thu, Sep 23, 2021 at 08:24:03AM -0400, Phil Auld wrote: > > > > > It's capped at 8 cpus, which is pretty easy to reach these days, so the > > > values don't get too large. That scaling is almost a no-op these days. > > > > https://lkml.kernel.org/r/YVwdrh5pg0zSv2/b@hirez.programming.kicks-ass.net > > > > Ooh, hey, we already fixed that :-) > > > > Thanks Peter. > > I'm always a little behind upstream (nature of the job :) > > That link leads to a message Id not found. https://lore.kernel.org/all/YVwblBZ9JBn9vvVr@hirez.programming.kicks-ass.net/T/#u Seems to work, I must've messed up the copy/paste or something. > But from what I can see the code that takes the min of online cpus and > 8 is still present. Yes, and it should be. I was the confused one. I forgot we added it and suggested we should add it again :-) > > So the reasoning there is that if the values get too big, interactiviy > > get *really* bad, but if you go from say 1 to 4 CPUs, interactivity can > > improve due to being able to run on other CPUs. > > > > At 8 CPUs we end up at 6ms*4=24ms, which is already pretty terrible. > > > > And actually you mention the same thing later on. Most systems, even > desktops, have 8+ cpus these days so the scaling is mostly not doing > anything except multiplying by 4, right? So no-op was not the right > way to describe it maybe. But it's not getting bigger with larger > numbers of cpus beyond a pretty commonly reached limit. Yeah, the whole scaling thing is of dubious value these days, the whole 1-8 range is for embedded stuff these days, I mean, only low-end phones are maybe even still in that range -- oh and my laptop.. :/ ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-22 18:22 ` Vincent Guittot 2021-09-22 18:57 ` Mel Gorman 2021-09-23 1:47 ` Mike Galbraith @ 2021-10-05 10:28 ` Peter Zijlstra 2 siblings, 0 replies; 59+ messages in thread From: Peter Zijlstra @ 2021-10-05 10:28 UTC (permalink / raw) To: Vincent Guittot Cc: Mel Gorman, Mike Galbraith, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Wed, Sep 22, 2021 at 08:22:43PM +0200, Vincent Guittot wrote: > @@ -7161,6 +7162,13 @@ static void check_preempt_wakeup(struct rq *rq, > struct task_struct *p, int wake_ > return; > > update_curr(cfs_rq_of(se)); > + delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime; > + /* > + * Ensure that current got a chance to move forward > + */ > + if (delta_exec < sysctl_sched_min_granularity) > + return; > + I think we tried that at some point; IIRC the problem with this is that if the interactive task fails to preempt, that preemption is lost. IOW interactivity suffers. Basically if you don't want wake-up preemptions, use SCHED_BATCH, then those tasks will not preempt one another, but the SCHED_NORMAL tasks will preempt them. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-22 17:38 ` Mel Gorman 2021-09-22 18:22 ` Vincent Guittot @ 2021-10-05 10:23 ` Peter Zijlstra 1 sibling, 0 replies; 59+ messages in thread From: Peter Zijlstra @ 2021-10-05 10:23 UTC (permalink / raw) To: Mel Gorman Cc: Vincent Guittot, Mike Galbraith, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Wed, Sep 22, 2021 at 06:38:53PM +0100, Mel Gorman wrote: > > I'm not seeing an alternative suggestion that could be turned into > an implementation. The current value for sched_wakeup_granularity > was set 12 years ago was exposed for tuning which is no longer (it has always been SCHED_DEBUG) > the case. The intent was to allow some dynamic adjustment between > sysctl_sched_wakeup_granularity and sysctl_sched_latency to reduce > over-scheduling in the worst case without disabling preemption entirely > (which the first version did). > > Should we just ignore this problem and hope it goes away or just let > people keep poking silly values into debugfs via tuned? People are going to do stupid no matter what (and tuned is very much included there -- poking random numbers in debug interfaces without doing analysis on what the workload does and needs it just wasting everybodies time. RHT really should know better than that). I'm also thinking that adding more heuristics isn't going to improve the situation lots. For the past # of years people have been talking about extendng the task model for SCHED_NORMAL, latency_nice was one such proposal. If we really want to fix this proper, and not make a bigger mess of things, we should look at all these various workloads and identify *what* specifically they want and *why*. Once we have this enumerated, we can look at what exactly we can provide and how to structure the interface. The extention must be hint only, we should be free to completely ignore it. The things I can think of off the top of my head are: - tail latency; prepared to waste time to increase the odds of running sooner. Possible effect: have this task always do a full select_idle_sibling() scan. (there's also the anti case, which I'm not sure how to enumerate, basically they don't want select_idle_sibling(), just place the task wherever) - non-interactive; doesn't much care about wakeup latency; can suffer packing? - background; (implies non-interactive?) doesn't much care about completion time either, just cares about efficiency - interactive; cares much about wakeup-latency; cares less about throughput. - (energy) efficient; cares more about energy usage than performance Now, we already have SCHED_BATCH that completely kills wakeup preemption and SCHED_IDLE lets everything preempt. I know SCHED_IDLE is in active use (because we see patches for that), I don't think people use SCHED_BATCH much, should they? Now, I think a number of contraints are going to be mutually exclusive, and if we get such tasks combined there's just nothing much we can do about it (and since it's all hints, we good). We should also not make the load-balance condition too complicated, we don't want it to try and (badly) solve a large set of constraints. Again, if possible respect the hints, otherwise tough luck. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-22 15:04 ` Mel Gorman 2021-09-22 16:00 ` Vincent Guittot @ 2021-10-05 9:41 ` Peter Zijlstra 1 sibling, 0 replies; 59+ messages in thread From: Peter Zijlstra @ 2021-10-05 9:41 UTC (permalink / raw) To: Mel Gorman Cc: Vincent Guittot, Mike Galbraith, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Wed, Sep 22, 2021 at 04:04:57PM +0100, Mel Gorman wrote: > On Wed, Sep 22, 2021 at 04:15:27PM +0200, Vincent Guittot wrote: > The reason I used SCHED_TUNABLESCALING_NONE for the changelog is that > the exact values depend on the number of CPUs so values are not even > the same across the range of machines I'm using. sysctl_sched_latency, > sysctl_sched_min_granularity sysctl_sched_wakeup_granularity are all > scaled but the ratios remain constant. It might make sense to reconsider the whole scaling thing FWIW. It used to be that desktop systems had 'small' number of CPUs (<=8) and servers had more. But today it's not uncommon to have 32-64 CPUs in a desktop system. And even LOG scaling gets us into stupid numbers for the latencies. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-22 14:15 ` Vincent Guittot 2021-09-22 15:04 ` Mel Gorman @ 2021-09-22 15:05 ` Vincent Guittot 1 sibling, 0 replies; 59+ messages in thread From: Vincent Guittot @ 2021-09-22 15:05 UTC (permalink / raw) To: Mel Gorman Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Wed, 22 Sept 2021 at 16:15, Vincent Guittot <vincent.guittot@linaro.org> wrote: > > On Wed, 22 Sept 2021 at 15:20, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > On Wed, Sep 22, 2021 at 07:22:20AM +0200, Mike Galbraith wrote: > > > On Tue, 2021-09-21 at 11:36 +0100, Mel Gorman wrote: > > > > On Tue, Sep 21, 2021 at 05:52:32AM +0200, Mike Galbraith wrote: > > > > > > > > > > > > > Preemption does rapidly run into diminishing return as load climbs for > > > > > a lot of loads, but as you know, it's a rather sticky wicket because > > > > > even when over-committed, preventing light control threads from slicing > > > > > through (what can be a load's own work crew of) hogs can seriously > > > > > injure performance. > > > > > > > > > > > > > Turning this into a classic Rob Peter To Pay Paul problem. We don't know > > > > if there is a light control thread that needs to run or not that affects > > > > overall performance. It all depends on whether that control thread needs > > > > to make progress for the overall workload or whether there are a mix of > > > > workloads resulting in overloading. > > > > > > WRT overload, and our good buddies Peter and Paul :) I added... > > > if (gran >= sysctl_sched_latency >> 1) > > > trace_printk("runnable:%d preempt disabled\n",cfs_rq->nr_running); > > > ...to watch, and met the below when I.. logged in. > > > > > > homer:..debug/tracing # tail -20 trace > > > X-2229 [002] d..5. 60.690322: wakeup_gran: runnable:9 preempt disabled > > > X-2229 [002] d..5. 60.690325: wakeup_gran: runnable:10 preempt disabled > > > X-2229 [002] d..5. 60.690330: wakeup_gran: runnable:11 preempt disabled > > > X-2229 [002] d..5. 60.690363: wakeup_gran: runnable:13 preempt disabled > > > X-2229 [002] d..5. 60.690377: wakeup_gran: runnable:14 preempt disabled > > > X-2229 [002] d..5. 60.690390: wakeup_gran: runnable:15 preempt disabled > > > X-2229 [002] d..5. 60.690404: wakeup_gran: runnable:16 preempt disabled > > > X-2229 [002] d..5. 60.690425: wakeup_gran: runnable:9 preempt disabled > > > ksmserver-2694 [003] d..3. 60.690432: wakeup_gran: runnable:6 preempt disabled > > > ksmserver-2694 [003] d..3. 60.690436: wakeup_gran: runnable:7 preempt disabled > > > X-2229 [002] d..5. 60.690451: wakeup_gran: runnable:6 preempt disabled > > > X-2229 [002] d..5. 60.690465: wakeup_gran: runnable:7 preempt disabled > > > kmix-2736 [000] d..3. 60.690491: wakeup_gran: runnable:6 preempt disabled > > > X-2229 [004] d..5. 92.889635: wakeup_gran: runnable:6 preempt disabled > > > X-2229 [004] d..5. 92.889675: wakeup_gran: runnable:6 preempt disabled > > > X-2229 [004] d..5. 92.889863: wakeup_gran: runnable:6 preempt disabled > > > X-2229 [004] d..5. 92.889944: wakeup_gran: runnable:6 preempt disabled > > > X-2229 [004] d..5. 92.889957: wakeup_gran: runnable:7 preempt disabled > > > X-2229 [004] d..5. 92.889968: wakeup_gran: runnable:8 preempt disabled > > > QXcbEventQueue-2740 [000] d..4. 92.890025: wakeup_gran: runnable:6 preempt disabled > > > homer:..debug/tracing > > > > > > Watching 'while sleep 1; do clear;tail trace; done' with nothing but a > > > kbuild running is like watching top. There's enough stacking during > > > routine use of my desktop box that it runs into the tick granularity > > > wall pretty much continuously, so 'overload' may want redefining. > > > > > > > Ok, that's pretty convincing. You didn't mention if there were > > interactivity glitches but it's possible. This is what I'm currently > > testing but have no results for yet. It caps wakeup_gran at > > sysctl_sched_latency. > > > > ---8<--- > > sched/fair: Scale wakeup granularity relative to nr_running > > > > Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved > > the kernel.sched_wakeup_granularity_ns sysctl under debugfs. One of the > > reasons why this sysctl may be used may be for "optimising for throughput", > > particularly when overloaded. The tool TuneD sometimes alters this for two > > profiles e.g. "mssql" and "throughput-performance". At least version 2.9 > > does but it changed in master where it also will poke at debugfs instead. > > > > Internal parameters like sysctl_sched_wakeup_granularity are scaled > > based on the number of CPUs due to sysctl_sched_tunable_scaling. For > > simplicity, the timing figures in this changelog are based on > > SCHED_TUNABLESCALING_NONE. > > This is a bit misleading because the platform that you used to > highlight the problem has a 7ms sysctl_sched_wakeup_granularity. which > is far more than your tick which should be 1ms > > > > > During task migration or wakeup, a decision is made on whether > > to preempt the current task or not. To limit over-scheduled, > > sysctl_sched_wakeup_granularity delays the preemption to allow at least 1ms > > of runtime before preempting. However, when a domain is heavily overloaded > > (e.g. hackbench), the degree of over-scheduling is still severe. This is > > problematic as time is wasted rescheduling tasks that could instead be > > used by userspace tasks. > > > > However, care must be taken. Even if a system is overloaded, there may > > be high priority threads that must still be able to run. Mike Galbraith > > explained the contraints as follows; > > > > CFS came about because the O1 scheduler was unfair to the > > point it had starvation problems. People pretty much across the > > board agreed that a fair scheduler was a much way better way > > to go, and CFS was born. It didn't originally have the sleep > > credit business, but had to grow it to become _short term_ fair. > > Ingo cut the sleep credit in half because of overscheduling, and > > that has worked out pretty well all told.. but now you're pushing > > it more in the unfair direction, all the way to extremely unfair > > for anything and everything very light. > > > > Fairness isn't the holy grail mind you, and at some point, giving > > up on short term fairness certainly isn't crazy, as proven by your > > hackbench numbers and other numbers we've seen over the years, > > but taking bites out of the 'CF' in the CFS that was born to be a > > corner-case killer is.. worrisome. The other shoe will drop.. it > > always does :) > > > > This patch scales the wakeup granularity based on the number of running > > tasks on the CPU relative to > > > > sched_nr_disable_gran = sysctl_sched_latency / sysctl_sched_wakeup_granularity > > > > By default, this will allow the wakeup_gran to scale from > > sysctl_sched_wakeup_granularity up to sysctl_sched_wakeup_granularity up to > > sysctl_sched_latency depending on the number of running tasks on a cfs_rq. > > By default, the limit is 6ms. > > > > Note that the TuneD throughput-performance profile allows up to 15ms > > for sysctl_sched_latency (ignoring scaling) but there is no explanation > > why such a long period was necessary or why sched_latency_ns is also > > not adjusted. The intent may have been to disable wakeup preemption > > or it might be an oversight. An internet search for instances where > > sysctl_sched_wakeup_granularity parameter are tuned to high values offer > > either no explanation or a broken one. > > > > TODO: Results positive or negative > > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > > --- > > kernel/sched/fair.c | 64 ++++++++++++++++++++++++++++++++++++++++++------- > > kernel/sched/features.h | 6 +++++ > > 2 files changed, 61 insertions(+), 9 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index ff69f245b939..5ec3b12039d6 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -84,6 +84,14 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL; > > > > const_debug unsigned int sysctl_sched_migration_cost = 500000UL; > > > > +/* > > + * This value is kept at sysctl_sched_latency / sysctl_sched_wakeup_granularity > > + * > > + * This influences the decision on whether a waking task can preempt a running > > + * task. > > + */ > > +static unsigned int sched_nr_disable_gran = 6; > > + > > int sched_thermal_decay_shift; > > static int __init setup_sched_thermal_decay_shift(char *str) > > { > > @@ -627,6 +635,9 @@ int sched_update_scaling(void) > > sched_nr_latency = DIV_ROUND_UP(sysctl_sched_latency, > > sysctl_sched_min_granularity); > > > > + sched_nr_disable_gran = DIV_ROUND_UP(sysctl_sched_latency, > > + sysctl_sched_wakeup_granularity); > > + > > #define WRT_SYSCTL(name) \ > > (normalized_sysctl_##name = sysctl_##name / (factor)) > > WRT_SYSCTL(sched_min_granularity); > > @@ -4511,7 +4522,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) > > } > > > > static int > > -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se); > > +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr, > > + struct sched_entity *se); > > > > /* > > * Pick the next process, keeping these things in mind, in this order: > > @@ -4550,16 +4562,16 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) > > second = curr; > > } > > > > - if (second && wakeup_preempt_entity(second, left) < 1) > > + if (second && wakeup_preempt_entity(NULL, second, left) < 1) > > Why not applying the same policy here ? the tick can also prevent > current task to move forward > > > se = second; > > } > > > > - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { > > + if (cfs_rq->next && wakeup_preempt_entity(NULL, cfs_rq->next, left) < 1) { > > /* > > * Someone really wants this to run. If it's not unfair, run it. > > */ > > se = cfs_rq->next; > > - } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) { > > + } else if (cfs_rq->last && wakeup_preempt_entity(NULL, cfs_rq->last, left) < 1) { > > /* > > * Prefer last buddy, try to return the CPU to a preempted task. > > */ > > @@ -7044,9 +7056,42 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > > } > > #endif /* CONFIG_SMP */ > > > > -static unsigned long wakeup_gran(struct sched_entity *se) > > +static unsigned long > > +select_wakeup_gran(struct cfs_rq *cfs_rq) > > +{ > > + unsigned int nr_running, threshold; > > + > > + if (!cfs_rq || !sched_feat(SCALE_WAKEUP_GRAN)) > > + return sysctl_sched_wakeup_granularity; > > + > > + /* !GENTLE_FAIR_SLEEPERS has one overload threshold. */ > > + if (!sched_feat(GENTLE_FAIR_SLEEPERS)) { > > + if (cfs_rq->nr_running <= sched_nr_disable_gran) > > cfs_rq->nr_running reflects the number of sched entities in the cfs > but not the number of running tasks which reflected in h_nr_running > > Also do you want to take into account only tasks in this cfs and its > children or on all cfs on this rq ? > > > + return sysctl_sched_wakeup_granularity; > > + > > + return sysctl_sched_latency; > > + } > > + > > + /* GENTLE_FAIR_SLEEPER has two overloaded thresholds. */ > > + nr_running = cfs_rq->nr_running; > > + threshold = sched_nr_disable_gran >> 1; > > + > > + /* No overload. */ > > + if (nr_running <= threshold) > > + return sysctl_sched_wakeup_granularity; > > TBH I don't like these "no overload", "light overload" ... They don't > have any real meaning apart from that it might work for your platform > and your hackbench test. > We already had have people complaining that small cfs task does not > preempt fast enough curr task as an example > > There is no explanation why these values are the correct ones and not > but are just some random heuristic results and we are trying to remove > platform heuristic and to not add new > > > + > > + /* Light overload. */ > > + if (nr_running <= sched_nr_disable_gran) > > + return sysctl_sched_latency >> 1; > > + > > + /* Heavy overload. */ > > + return sysctl_sched_latency; > > Why should a thread without any relationship with the curr, not > preempt it because there are already a lot of running threads ? In > your case, you want hackbench threads to not preempt each others > because they tries to use same resources so it's probably better to > let the current one to move forward but that's not a universal policy. > > side question: Have you try to change the nice priority which also > impact whether a thread can preempt curr ? There were also some discussions about adding a nice_latency prio so a task/group can be set if they want to be nice with others from a scheduling latency pov. One open point was to define what it mean to be nice from a latency pov > > > +} > > + > > +static unsigned long > > +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se) > > { > > - unsigned long gran = sysctl_sched_wakeup_granularity; > > + unsigned long gran = select_wakeup_gran(cfs_rq); > > > > /* > > * Since its curr running now, convert the gran from real-time > > @@ -7079,14 +7124,15 @@ static unsigned long wakeup_gran(struct sched_entity *se) > > * > > */ > > static int > > -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se) > > +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr, > > + struct sched_entity *se) > > { > > s64 gran, vdiff = curr->vruntime - se->vruntime; > > > > if (vdiff <= 0) > > return -1; > > > > - gran = wakeup_gran(se); > > + gran = wakeup_gran(cfs_rq, se); > > if (vdiff > gran) > > return 1; > > > > @@ -7191,7 +7237,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ > > return; > > > > update_curr(cfs_rq_of(se)); > > - if (wakeup_preempt_entity(se, pse) == 1) { > > + if (wakeup_preempt_entity(cfs_rq, se, pse) == 1) { > > like for update_curr above, cfs_rq can be wrong because se could have changed > > > /* > > * Bias pick_next to pick the sched entity that is > > * triggering this preemption. > > diff --git a/kernel/sched/features.h b/kernel/sched/features.h > > index 7f8dace0964c..d041d7023029 100644 > > --- a/kernel/sched/features.h > > +++ b/kernel/sched/features.h > > @@ -95,3 +95,9 @@ SCHED_FEAT(LATENCY_WARN, false) > > > > SCHED_FEAT(ALT_PERIOD, true) > > SCHED_FEAT(BASE_SLICE, true) > > + > > +/* > > + * Scale sched_wakeup_granularity dynamically based on the number of running > > + * tasks up to a cap of sysctl_sched_latency. > > + */ > > +SCHED_FEAT(SCALE_WAKEUP_GRAN, true) ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-22 13:20 ` Mel Gorman 2021-09-22 14:04 ` Mike Galbraith 2021-09-22 14:15 ` Vincent Guittot @ 2021-10-05 9:32 ` Peter Zijlstra 2 siblings, 0 replies; 59+ messages in thread From: Peter Zijlstra @ 2021-10-05 9:32 UTC (permalink / raw) To: Mel Gorman Cc: Mike Galbraith, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Wed, Sep 22, 2021 at 02:20:02PM +0100, Mel Gorman wrote: > > Note that the TuneD throughput-performance profile allows up to 15ms > for sysctl_sched_latency (ignoring scaling) but there is no explanation > why such a long period was necessary or why sched_latency_ns is also > not adjusted. The intent may have been to disable wakeup preemption > or it might be an oversight. An internet search for instances where > sysctl_sched_wakeup_granularity parameter are tuned to high values offer > either no explanation or a broken one. FWIW, if one wants to disable wakeup preemption, we've got SCHED_BATCH for that. ^ permalink raw reply [flat|nested] 59+ messages in thread
* wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-22 5:22 ` Mike Galbraith 2021-09-22 13:20 ` Mel Gorman @ 2021-10-03 3:07 ` Mike Galbraith 2021-10-03 7:34 ` Barry Song 1 sibling, 1 reply; 59+ messages in thread From: Mike Galbraith @ 2021-10-03 3:07 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Wed, 2021-09-22 at 07:22 +0200, Mike Galbraith wrote: > On Tue, 2021-09-21 at 11:36 +0100, Mel Gorman wrote: > > On Tue, Sep 21, 2021 at 05:52:32AM +0200, Mike Galbraith wrote: > > > > > > > Preemption does rapidly run into diminishing return as load climbs for > > > a lot of loads, but as you know, it's a rather sticky wicket because > > > even when over-committed, preventing light control threads from slicing > > > through (what can be a load's own work crew of) hogs can seriously > > > injure performance. > > > > > > > Turning this into a classic Rob Peter To Pay Paul problem. We don't know > > if there is a light control thread that needs to run or not that affects > > overall performance. It all depends on whether that control thread needs > > to make progress for the overall workload or whether there are a mix of > > workloads resulting in overloading. > > WRT overload, and our good buddies Peter and Paul :) I added... > if (gran >= sysctl_sched_latency >> 1) > trace_printk("runnable:%d preempt disabled\n",cfs_rq->nr_running); > ...to watch, and met the below when I.. logged in. > > homer:..debug/tracing # tail -20 trace > X-2229 [002] d..5. 60.690322: wakeup_gran: runnable:9 preempt disabled > X-2229 [002] d..5. 60.690325: wakeup_gran: runnable:10 preempt disabled > X-2229 [002] d..5. 60.690330: wakeup_gran: runnable:11 preempt disabled > X-2229 [002] d..5. 60.690363: wakeup_gran: runnable:13 preempt disabled > X-2229 [002] d..5. 60.690377: wakeup_gran: runnable:14 preempt disabled > X-2229 [002] d..5. 60.690390: wakeup_gran: runnable:15 preempt disabled > X-2229 [002] d..5. 60.690404: wakeup_gran: runnable:16 preempt disabled > X-2229 [002] d..5. 60.690425: wakeup_gran: runnable:9 preempt disabled > ksmserver-2694 [003] d..3. 60.690432: wakeup_gran: runnable:6 preempt disabled > ksmserver-2694 [003] d..3. 60.690436: wakeup_gran: runnable:7 preempt disabled > X-2229 [002] d..5. 60.690451: wakeup_gran: runnable:6 preempt disabled > X-2229 [002] d..5. 60.690465: wakeup_gran: runnable:7 preempt disabled > kmix-2736 [000] d..3. 60.690491: wakeup_gran: runnable:6 preempt disabled > X-2229 [004] d..5. 92.889635: wakeup_gran: runnable:6 preempt disabled > X-2229 [004] d..5. 92.889675: wakeup_gran: runnable:6 preempt disabled > X-2229 [004] d..5. 92.889863: wakeup_gran: runnable:6 preempt disabled > X-2229 [004] d..5. 92.889944: wakeup_gran: runnable:6 preempt disabled > X-2229 [004] d..5. 92.889957: wakeup_gran: runnable:7 preempt disabled > X-2229 [004] d..5. 92.889968: wakeup_gran: runnable:8 preempt disabled > QXcbEventQueue-2740 [000] d..4. 92.890025: wakeup_gran: runnable:6 preempt disabled > homer:..debug/tracing > > Watching 'while sleep 1; do clear;tail trace; done' with nothing but a > kbuild running is like watching top. There's enough stacking during > routine use of my desktop box that it runs into the tick granularity > wall pretty much continuously, so 'overload' may want redefining. I looked into that crazy stacking depth... static int wake_affine_weight(struct sched_domain *sd, struct task_struct *p, int this_cpu, int prev_cpu, int sync) { s64 this_eff_load, prev_eff_load; unsigned long task_load; this_eff_load = cpu_load(cpu_rq(this_cpu)); ^^^^^^^^^^^^^^^^^^^^^^^^^^^ the butler didit! That's pretty darn busted as it sits. Between load updates, X, or any other waker of many, can stack wakees to a ludicrous depth. Tracing kbuild vs firefox playing a youtube clip, I watched X stack 20 of the zillion firefox minions while their previous CPUs all had 1 lousy task running but a cpu_load() higher than the cpu_load() of X's CPU. Most of those prev_cpus were where X had left them when it migrated. Each and every crazy depth migration was wake_affine_weight() deciding we should pull based on crappy data. As instantaneous load on the waker CPU blew through the roof in my trace snapshot, its cpu_load() did finally budge.. a tiny bit.. downward. No idea where the stack would have topped out, my tracing_off() limit was 20. Hohum, my box grew a WA_INST companion to SIS_MIN_LAT cache cold task distribulator feature ;-) Not particularly lovely, but it knocks over the leaning tower of minions. -Mike ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-10-03 3:07 ` wakeup_affine_weight() is b0rked - was " Mike Galbraith @ 2021-10-03 7:34 ` Barry Song 2021-10-03 14:52 ` Mike Galbraith 2021-10-04 4:34 ` Mike Galbraith 0 siblings, 2 replies; 59+ messages in thread From: Barry Song @ 2021-10-03 7:34 UTC (permalink / raw) To: Mike Galbraith Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Sun, Oct 3, 2021 at 4:11 PM Mike Galbraith <efault@gmx.de> wrote: > > On Wed, 2021-09-22 at 07:22 +0200, Mike Galbraith wrote: > > On Tue, 2021-09-21 at 11:36 +0100, Mel Gorman wrote: > > > On Tue, Sep 21, 2021 at 05:52:32AM +0200, Mike Galbraith wrote: > > > > > > > > > > Preemption does rapidly run into diminishing return as load climbs for > > > > a lot of loads, but as you know, it's a rather sticky wicket because > > > > even when over-committed, preventing light control threads from slicing > > > > through (what can be a load's own work crew of) hogs can seriously > > > > injure performance. > > > > > > > > > > Turning this into a classic Rob Peter To Pay Paul problem. We don't know > > > if there is a light control thread that needs to run or not that affects > > > overall performance. It all depends on whether that control thread needs > > > to make progress for the overall workload or whether there are a mix of > > > workloads resulting in overloading. > > > > WRT overload, and our good buddies Peter and Paul :) I added... > > if (gran >= sysctl_sched_latency >> 1) > > trace_printk("runnable:%d preempt disabled\n",cfs_rq->nr_running); > > ...to watch, and met the below when I.. logged in. > > > > homer:..debug/tracing # tail -20 trace > > X-2229 [002] d..5. 60.690322: wakeup_gran: runnable:9 preempt disabled > > X-2229 [002] d..5. 60.690325: wakeup_gran: runnable:10 preempt disabled > > X-2229 [002] d..5. 60.690330: wakeup_gran: runnable:11 preempt disabled > > X-2229 [002] d..5. 60.690363: wakeup_gran: runnable:13 preempt disabled > > X-2229 [002] d..5. 60.690377: wakeup_gran: runnable:14 preempt disabled > > X-2229 [002] d..5. 60.690390: wakeup_gran: runnable:15 preempt disabled > > X-2229 [002] d..5. 60.690404: wakeup_gran: runnable:16 preempt disabled > > X-2229 [002] d..5. 60.690425: wakeup_gran: runnable:9 preempt disabled > > ksmserver-2694 [003] d..3. 60.690432: wakeup_gran: runnable:6 preempt disabled > > ksmserver-2694 [003] d..3. 60.690436: wakeup_gran: runnable:7 preempt disabled > > X-2229 [002] d..5. 60.690451: wakeup_gran: runnable:6 preempt disabled > > X-2229 [002] d..5. 60.690465: wakeup_gran: runnable:7 preempt disabled > > kmix-2736 [000] d..3. 60.690491: wakeup_gran: runnable:6 preempt disabled > > X-2229 [004] d..5. 92.889635: wakeup_gran: runnable:6 preempt disabled > > X-2229 [004] d..5. 92.889675: wakeup_gran: runnable:6 preempt disabled > > X-2229 [004] d..5. 92.889863: wakeup_gran: runnable:6 preempt disabled > > X-2229 [004] d..5. 92.889944: wakeup_gran: runnable:6 preempt disabled > > X-2229 [004] d..5. 92.889957: wakeup_gran: runnable:7 preempt disabled > > X-2229 [004] d..5. 92.889968: wakeup_gran: runnable:8 preempt disabled > > QXcbEventQueue-2740 [000] d..4. 92.890025: wakeup_gran: runnable:6 preempt disabled > > homer:..debug/tracing > > > > Watching 'while sleep 1; do clear;tail trace; done' with nothing but a > > kbuild running is like watching top. There's enough stacking during > > routine use of my desktop box that it runs into the tick granularity > > wall pretty much continuously, so 'overload' may want redefining. > > I looked into that crazy stacking depth... > > static int > wake_affine_weight(struct sched_domain *sd, struct task_struct *p, > int this_cpu, int prev_cpu, int sync) > { > s64 this_eff_load, prev_eff_load; > unsigned long task_load; > > this_eff_load = cpu_load(cpu_rq(this_cpu)); > ^^^^^^^^^^^^^^^^^^^^^^^^^^^ the butler didit! > > That's pretty darn busted as it sits. Between load updates, X, or any > other waker of many, can stack wakees to a ludicrous depth. Tracing > kbuild vs firefox playing a youtube clip, I watched X stack 20 of the > zillion firefox minions while their previous CPUs all had 1 lousy task > running but a cpu_load() higher than the cpu_load() of X's CPU. Most > of those prev_cpus were where X had left them when it migrated. Each > and every crazy depth migration was wake_affine_weight() deciding we > should pull based on crappy data. As instantaneous load on the waker > CPU blew through the roof in my trace snapshot, its cpu_load() did > finally budge.. a tiny bit.. downward. No idea where the stack would > have topped out, my tracing_off() limit was 20. Mike, not quite sure I caught your point. It seems you mean x wakes up many firefoxes within a short period, so it pulls them to the CPU where x is running. Technically those pulling should increase cpu_load of x' CPU. But due to some reason, the cpu_load is not increased in time on x' CPU, So this makes a lot of firefoxes piled on x' CPU, but at that time, the load of the cpu which firefox was running on is still larger than x' cpu with a lot of firefoxes? I am wondering if this should be the responsibility of wake_wide()? > > Hohum, my box grew a WA_INST companion to SIS_MIN_LAT cache cold task > distribulator feature ;-) Not particularly lovely, but it knocks over > the leaning tower of minions. > > -Mike Thanks barry ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-10-03 7:34 ` Barry Song @ 2021-10-03 14:52 ` Mike Galbraith 2021-10-03 21:06 ` Barry Song 2021-10-04 4:34 ` Mike Galbraith 1 sibling, 1 reply; 59+ messages in thread From: Mike Galbraith @ 2021-10-03 14:52 UTC (permalink / raw) To: Barry Song Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Sun, 2021-10-03 at 20:34 +1300, Barry Song wrote: > > > > I looked into that crazy stacking depth... > > > > static int > > wake_affine_weight(struct sched_domain *sd, struct task_struct *p, > > int this_cpu, int prev_cpu, int sync) > > { > > s64 this_eff_load, prev_eff_load; > > unsigned long task_load; > > > > this_eff_load = cpu_load(cpu_rq(this_cpu)); > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^ the butler didit! > > > > That's pretty darn busted as it sits. Between load updates, X, or any > > other waker of many, can stack wakees to a ludicrous depth. Tracing > > kbuild vs firefox playing a youtube clip, I watched X stack 20 of the > > zillion firefox minions while their previous CPUs all had 1 lousy task > > running but a cpu_load() higher than the cpu_load() of X's CPU. Most > > of those prev_cpus were where X had left them when it migrated. Each > > and every crazy depth migration was wake_affine_weight() deciding we > > should pull based on crappy data. As instantaneous load on the waker > > CPU blew through the roof in my trace snapshot, its cpu_load() did > > finally budge.. a tiny bit.. downward. No idea where the stack would > > have topped out, my tracing_off() limit was 20. > > Mike, not quite sure I caught your point. It seems you mean x wakes up > many firefoxes within a short period, so it pulls them to the CPU where x > is running. Technically those pulling should increase cpu_load of x' CPU. > But due to some reason, the cpu_load is not increased in time on x' CPU, > So this makes a lot of firefoxes piled on x' CPU, but at that time, the load > of the cpu which firefox was running on is still larger than x' cpu with a lot > of firefoxes? It looked like this. X-2211 [007] d...211 2327.810997: select_task_rq_fair: this_run/load:4:373 prev_run/load:4:373 waking firefox:4971 CPU7 ==> CPU7 X-2211 [007] d...211 2327.811004: select_task_rq_fair: this_run/load:5:373 prev_run/load:1:1029 waking QXcbEventQueue:4952 CPU0 ==> CPU7 X-2211 [007] d...211 2327.811010: select_task_rq_fair: this_run/load:6:373 prev_run/load:1:1528 waking QXcbEventQueue:3969 CPU5 ==> CPU7 X-2211 [007] d...211 2327.811015: select_task_rq_fair: this_run/load:7:373 prev_run/load:1:1029 waking evolution-alarm:3833 CPU0 ==> CPU7 X-2211 [007] d...211 2327.811021: select_task_rq_fair: this_run/load:8:373 prev_run/load:1:1528 waking QXcbEventQueue:3860 CPU5 ==> CPU7 X-2211 [007] d...211 2327.811026: select_task_rq_fair: this_run/load:8:373 prev_run/load:1:1528 waking QXcbEventQueue:3800 CPU5 ==> CPU7 X-2211 [007] d...211 2327.811032: select_task_rq_fair: this_run/load:9:373 prev_run/load:1:1528 waking xdg-desktop-por:3341 CPU5 ==> CPU7 X-2211 [007] d...211 2327.811037: select_task_rq_fair: this_run/load:10:373 prev_run/load:1:289 waking at-spi2-registr:3165 CPU4 ==> CPU7 X-2211 [007] d...211 2327.811042: select_task_rq_fair: this_run/load:11:373 prev_run/load:1:1029 waking ibus-ui-gtk3:2865 CPU0 ==> CPU0 X-2211 [007] d...211 2327.811049: select_task_rq_fair: this_run/load:11:373 prev_run/load:1:226 waking ibus-x11:2868 CPU2 ==> CPU2 X-2211 [007] d...211 2327.811054: select_task_rq_fair: this_run/load:11:373 prev_run/load:11:373 waking ibus-extension-:2866 CPU7 ==> CPU7 X-2211 [007] d...211 2327.811059: select_task_rq_fair: this_run/load:12:373 prev_run/load:1:289 waking QXcbEventQueue:2804 CPU4 ==> CPU7 X-2211 [007] d...211 2327.811063: select_task_rq_fair: this_run/load:13:373 prev_run/load:1:935 waking QXcbEventQueue:2756 CPU1 ==> CPU7 X-2211 [007] d...211 2327.811068: select_task_rq_fair: this_run/load:14:373 prev_run/load:1:1528 waking QXcbEventQueue:2753 CPU5 ==> CPU7 X-2211 [007] d...211 2327.811074: select_task_rq_fair: this_run/load:15:373 prev_run/load:1:1528 waking QXcbEventQueue:2741 CPU5 ==> CPU7 X-2211 [007] d...211 2327.811079: select_task_rq_fair: this_run/load:16:373 prev_run/load:1:1528 waking QXcbEventQueue:2730 CPU5 ==> CPU7 X-2211 [007] d...211 2327.811085: select_task_rq_fair: this_run/load:17:373 prev_run/load:1:5 waking QXcbEventQueue:2724 CPU0 ==> CPU0 X-2211 [007] d...211 2327.811090: select_task_rq_fair: this_run/load:17:373 prev_run/load:1:1010 waking QXcbEventQueue:2721 CPU6 ==> CPU7 X-2211 [007] d...211 2327.811096: select_task_rq_fair: this_run/load:18:373 prev_run/load:1:1528 waking QXcbEventQueue:2720 CPU5 ==> CPU7 X-2211 [007] d...211 2327.811101: select_task_rq_fair: this_run/load:19:373 prev_run/load:1:1528 waking QXcbEventQueue:2704 CPU5 ==> CPU7 X-2211 [007] d...211 2327.811105: select_task_rq_fair: this_run/load:20:373 prev_run/load:0:226 waking QXcbEventQueue:2705 CPU2 ==> CPU2 X-2211 [007] d...211 2327.811110: select_task_rq_fair: this_run/load:19:342 prev_run/load:1:1528 waking QXcbEventQueue:2695 CPU5 ==> CPU7 X-2211 [007] d...211 2327.811115: select_task_rq_fair: this_run/load:20:342 prev_run/load:1:1528 waking QXcbEventQueue:2694 CPU5 ==> CPU7 X-2211 [007] d...211 2327.811120: select_task_rq_fair: this_run/load:21:342 prev_run/load:1:1528 waking QXcbEventQueue:2679 CPU5 ==> CPU7 Legend: foo_run/load:foo->nr_running:cpu_load(foo) Every migration to CPU7 in the above was due to wake_affine_weight() seeing more or less static effective load numbers (the trace was wider, showing which path was taken). > I am wondering if this should be the responsibility of wake_wide()? That's a good point. I'm not so sure that would absolve use of what appears to be stagnant state though. If we hadn't gotten there, this stack obviously wouldn't have happened.. but we did get there, and state that was used did not reflect reality. wake_wide() deflecting this particular gaggle wouldn't improved state accuracy one whit for a subsequent wakeup, or? -Mike ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-10-03 14:52 ` Mike Galbraith @ 2021-10-03 21:06 ` Barry Song 2021-10-04 1:49 ` Mike Galbraith 0 siblings, 1 reply; 59+ messages in thread From: Barry Song @ 2021-10-03 21:06 UTC (permalink / raw) To: Mike Galbraith Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Mon, Oct 4, 2021 at 3:52 AM Mike Galbraith <efault@gmx.de> wrote: > > On Sun, 2021-10-03 at 20:34 +1300, Barry Song wrote: > > > > > > I looked into that crazy stacking depth... > > > > > > static int > > > wake_affine_weight(struct sched_domain *sd, struct task_struct *p, > > > int this_cpu, int prev_cpu, int sync) > > > { > > > s64 this_eff_load, prev_eff_load; > > > unsigned long task_load; > > > > > > this_eff_load = cpu_load(cpu_rq(this_cpu)); > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^ the butler didit! > > > > > > That's pretty darn busted as it sits. Between load updates, X, or any > > > other waker of many, can stack wakees to a ludicrous depth. Tracing > > > kbuild vs firefox playing a youtube clip, I watched X stack 20 of the > > > zillion firefox minions while their previous CPUs all had 1 lousy task > > > running but a cpu_load() higher than the cpu_load() of X's CPU. Most > > > of those prev_cpus were where X had left them when it migrated. Each > > > and every crazy depth migration was wake_affine_weight() deciding we > > > should pull based on crappy data. As instantaneous load on the waker > > > CPU blew through the roof in my trace snapshot, its cpu_load() did > > > finally budge.. a tiny bit.. downward. No idea where the stack would > > > have topped out, my tracing_off() limit was 20. > > > > Mike, not quite sure I caught your point. It seems you mean x wakes up > > many firefoxes within a short period, so it pulls them to the CPU where x > > is running. Technically those pulling should increase cpu_load of x' CPU. > > But due to some reason, the cpu_load is not increased in time on x' CPU, > > So this makes a lot of firefoxes piled on x' CPU, but at that time, the load > > of the cpu which firefox was running on is still larger than x' cpu with a lot > > of firefoxes? > > It looked like this. > > X-2211 [007] d...211 2327.810997: select_task_rq_fair: this_run/load:4:373 prev_run/load:4:373 waking firefox:4971 CPU7 ==> CPU7 > X-2211 [007] d...211 2327.811004: select_task_rq_fair: this_run/load:5:373 prev_run/load:1:1029 waking QXcbEventQueue:4952 CPU0 ==> CPU7 > X-2211 [007] d...211 2327.811010: select_task_rq_fair: this_run/load:6:373 prev_run/load:1:1528 waking QXcbEventQueue:3969 CPU5 ==> CPU7 > X-2211 [007] d...211 2327.811015: select_task_rq_fair: this_run/load:7:373 prev_run/load:1:1029 waking evolution-alarm:3833 CPU0 ==> CPU7 > X-2211 [007] d...211 2327.811021: select_task_rq_fair: this_run/load:8:373 prev_run/load:1:1528 waking QXcbEventQueue:3860 CPU5 ==> CPU7 > X-2211 [007] d...211 2327.811026: select_task_rq_fair: this_run/load:8:373 prev_run/load:1:1528 waking QXcbEventQueue:3800 CPU5 ==> CPU7 > X-2211 [007] d...211 2327.811032: select_task_rq_fair: this_run/load:9:373 prev_run/load:1:1528 waking xdg-desktop-por:3341 CPU5 ==> CPU7 > X-2211 [007] d...211 2327.811037: select_task_rq_fair: this_run/load:10:373 prev_run/load:1:289 waking at-spi2-registr:3165 CPU4 ==> CPU7 > X-2211 [007] d...211 2327.811042: select_task_rq_fair: this_run/load:11:373 prev_run/load:1:1029 waking ibus-ui-gtk3:2865 CPU0 ==> CPU0 > X-2211 [007] d...211 2327.811049: select_task_rq_fair: this_run/load:11:373 prev_run/load:1:226 waking ibus-x11:2868 CPU2 ==> CPU2 > X-2211 [007] d...211 2327.811054: select_task_rq_fair: this_run/load:11:373 prev_run/load:11:373 waking ibus-extension-:2866 CPU7 ==> CPU7 > X-2211 [007] d...211 2327.811059: select_task_rq_fair: this_run/load:12:373 prev_run/load:1:289 waking QXcbEventQueue:2804 CPU4 ==> CPU7 > X-2211 [007] d...211 2327.811063: select_task_rq_fair: this_run/load:13:373 prev_run/load:1:935 waking QXcbEventQueue:2756 CPU1 ==> CPU7 > X-2211 [007] d...211 2327.811068: select_task_rq_fair: this_run/load:14:373 prev_run/load:1:1528 waking QXcbEventQueue:2753 CPU5 ==> CPU7 > X-2211 [007] d...211 2327.811074: select_task_rq_fair: this_run/load:15:373 prev_run/load:1:1528 waking QXcbEventQueue:2741 CPU5 ==> CPU7 > X-2211 [007] d...211 2327.811079: select_task_rq_fair: this_run/load:16:373 prev_run/load:1:1528 waking QXcbEventQueue:2730 CPU5 ==> CPU7 > X-2211 [007] d...211 2327.811085: select_task_rq_fair: this_run/load:17:373 prev_run/load:1:5 waking QXcbEventQueue:2724 CPU0 ==> CPU0 > X-2211 [007] d...211 2327.811090: select_task_rq_fair: this_run/load:17:373 prev_run/load:1:1010 waking QXcbEventQueue:2721 CPU6 ==> CPU7 > X-2211 [007] d...211 2327.811096: select_task_rq_fair: this_run/load:18:373 prev_run/load:1:1528 waking QXcbEventQueue:2720 CPU5 ==> CPU7 > X-2211 [007] d...211 2327.811101: select_task_rq_fair: this_run/load:19:373 prev_run/load:1:1528 waking QXcbEventQueue:2704 CPU5 ==> CPU7 > X-2211 [007] d...211 2327.811105: select_task_rq_fair: this_run/load:20:373 prev_run/load:0:226 waking QXcbEventQueue:2705 CPU2 ==> CPU2 > X-2211 [007] d...211 2327.811110: select_task_rq_fair: this_run/load:19:342 prev_run/load:1:1528 waking QXcbEventQueue:2695 CPU5 ==> CPU7 > X-2211 [007] d...211 2327.811115: select_task_rq_fair: this_run/load:20:342 prev_run/load:1:1528 waking QXcbEventQueue:2694 CPU5 ==> CPU7 > X-2211 [007] d...211 2327.811120: select_task_rq_fair: this_run/load:21:342 prev_run/load:1:1528 waking QXcbEventQueue:2679 CPU5 ==> CPU7 What is the topology of your hardware? shouldn't select_idle_sibling find some other idle CPUs in CPU7's LLC domain? Why are you always getting CPU7? one thing bothering me is that we are using the load of a single CPU in wake_affine_weight(), but we are actually scanning the whole LLC afterwards. > > Legend: foo_run/load:foo->nr_running:cpu_load(foo) > > Every migration to CPU7 in the above was due to wake_affine_weight() > seeing more or less static effective load numbers (the trace was wider, > showing which path was taken). > > > I am wondering if this should be the responsibility of wake_wide()? > > That's a good point. I'm not so sure that would absolve use of what > appears to be stagnant state though. If we hadn't gotten there, this > stack obviously wouldn't have happened.. but we did get there, and > state that was used did not reflect reality. wake_wide() deflecting > this particular gaggle wouldn't improved state accuracy one whit for a > subsequent wakeup, or? > > -Mike Thanks barry ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-10-03 21:06 ` Barry Song @ 2021-10-04 1:49 ` Mike Galbraith 0 siblings, 0 replies; 59+ messages in thread From: Mike Galbraith @ 2021-10-04 1:49 UTC (permalink / raw) To: Barry Song Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Mon, 2021-10-04 at 10:06 +1300, Barry Song wrote: > > What is the topology of your hardware? It's an i4790 quad+smt. > shouldn't select_idle_sibling find some other idle CPUs in CPU7's LLC > domain? Why are you always getting CPU7? The box is busy. It's running a parallel kbuild as well as the desktop, firefox and their various minions (et al.. kthreads). > one thing bothering me is that we are using the load of a single CPU > in wake_affine_weight(), but we are actually scanning the whole LLC > afterwards. The question wake_affine() is answering is one from the bad old days of SMP, of pull the task to toasty data, or make it drag that data across horrible hardware. LLC came into the equation much later. With it, and big boxen with several of them, the wake affine question became mostly about which LLC, but not solely. There's still L2 to consider, and there's even a valid time to stack two communicating tasks like in the bad old SMP days.. say when there would be a man-in-the-middle of the conversation if wakee were to be left on its previous CPU. -Mike ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-10-03 7:34 ` Barry Song 2021-10-03 14:52 ` Mike Galbraith @ 2021-10-04 4:34 ` Mike Galbraith 2021-10-04 9:06 ` Mike Galbraith 1 sibling, 1 reply; 59+ messages in thread From: Mike Galbraith @ 2021-10-04 4:34 UTC (permalink / raw) To: Barry Song Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Sun, 2021-10-03 at 20:34 +1300, Barry Song wrote: > > I am wondering if this should be the responsibility of wake_wide()? Those event threads we stacked so high (which are kde minions btw), don't generally accrue _any_ wakee_flips, so when X wakes a slew of the things, wake_wide()'s heuristic rejects the lot. So yeah, the blame game for this issue is a target rich environment. Shoot either of 'em (or both), and you'll hit the bad guy. -Mike ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-10-04 4:34 ` Mike Galbraith @ 2021-10-04 9:06 ` Mike Galbraith 2021-10-05 7:47 ` Mel Gorman 0 siblings, 1 reply; 59+ messages in thread From: Mike Galbraith @ 2021-10-04 9:06 UTC (permalink / raw) To: Barry Song Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Mon, 2021-10-04 at 06:34 +0200, Mike Galbraith wrote: > On Sun, 2021-10-03 at 20:34 +1300, Barry Song wrote: > > > > I am wondering if this should be the responsibility of wake_wide()? > > Those event threads we stacked so high (which are kde minions btw), > don't generally accrue _any_ wakee_flips, so when X wakes a slew of the > things, wake_wide()'s heuristic rejects the lot. > > So yeah, the blame game for this issue is a target rich environment. > Shoot either of 'em (or both), and you'll hit the bad guy. The mallet below convinced wake_wide() that X waking event threads is something it most definitely should care about. It's not perfect, can get caught with its pants down, because behavior changes a LOT, but I at least have to work at it a bit to stack tasks to the ceiling. With make -j8 running along with firefox with two tabs, one containing youtube's suggestions of stuff you probably don't want to watch, the other a running clip, if I have the idle tab in focus, and don't drive mouse around, flips decay enough for wake_wide() to lose interest, but just wiggle the mouse, and it starts waking wide. Focus on the running clip, and it continuously wakes wide. Hacky, but way better behavior.. at this particular testcase.. in this particular box.. at least once :) --- kernel/sched/fair.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5865,6 +5865,14 @@ static void record_wakee(struct task_str } if (current->last_wakee != p) { + int min = __this_cpu_read(sd_llc_size) << 1; + /* + * Couple the wakee flips to the waker for the case where it + * doesn't accrue flips, taking care to not push the wakee + * high enough that the wake_wide() heuristic fails. + */ + if (current->wakee_flips > p->wakee_flips * min) + p->wakee_flips++; current->last_wakee = p; current->wakee_flips++; } @@ -5895,7 +5903,7 @@ static int wake_wide(struct task_struct if (master < slave) swap(master, slave); - if (slave < factor || master < slave * factor) + if ((slave < factor && master < (factor>>1)*factor) || master < slave * factor) return 0; return 1; } ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-10-04 9:06 ` Mike Galbraith @ 2021-10-05 7:47 ` Mel Gorman 2021-10-05 8:42 ` Mike Galbraith 0 siblings, 1 reply; 59+ messages in thread From: Mel Gorman @ 2021-10-05 7:47 UTC (permalink / raw) To: Mike Galbraith Cc: Barry Song, Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Mon, Oct 04, 2021 at 11:06:30AM +0200, Mike Galbraith wrote: > On Mon, 2021-10-04 at 06:34 +0200, Mike Galbraith wrote: > > On Sun, 2021-10-03 at 20:34 +1300, Barry Song wrote: > > > > > > I am wondering if this should be the responsibility of wake_wide()? > > > > Those event threads we stacked so high (which are kde minions btw), > > don't generally accrue _any_ wakee_flips, so when X wakes a slew of the > > things, wake_wide()'s heuristic rejects the lot. > > > > So yeah, the blame game for this issue is a target rich environment. > > Shoot either of 'em (or both), and you'll hit the bad guy. > > The mallet below convinced wake_wide() that X waking event threads is > something it most definitely should care about. It's not perfect, can > get caught with its pants down, because behavior changes a LOT, but I > at least have to work at it a bit to stack tasks to the ceiling. > > With make -j8 running along with firefox with two tabs, one containing > youtube's suggestions of stuff you probably don't want to watch, the > other a running clip, if I have the idle tab in focus, and don't drive > mouse around, flips decay enough for wake_wide() to lose interest, but > just wiggle the mouse, and it starts waking wide. Focus on the running > clip, and it continuously wakes wide. > > Hacky, but way better behavior.. at this particular testcase.. in this > particular box.. at least once :) > Only three machines managed to complete tests overnight. For most workloads test, it was neutral or slight improvements. For multi-kernbench__netperf-tcp-rr-multipair (kernel compile + netperf-tcp-rr combined), there was little to no change. For the heavy overloaded cases (hackbench again), it was variable. Worst improvement was a gain of 1-3%. Best improvement (single socket skylake with 8 logical CPUs SMT enabled) was 1%-18% depending on the group count. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-10-05 7:47 ` Mel Gorman @ 2021-10-05 8:42 ` Mike Galbraith 2021-10-05 9:31 ` Mel Gorman 0 siblings, 1 reply; 59+ messages in thread From: Mike Galbraith @ 2021-10-05 8:42 UTC (permalink / raw) To: Mel Gorman Cc: Barry Song, Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, 2021-10-05 at 08:47 +0100, Mel Gorman wrote: > On Mon, Oct 04, 2021 at 11:06:30AM +0200, Mike Galbraith wrote: > > > > The mallet below convinced wake_wide() that X waking event threads is > > something it most definitely should care about. It's not perfect, can > > get caught with its pants down, because behavior changes a LOT, but I > > at least have to work at it a bit to stack tasks to the ceiling. > > > > With make -j8 running along with firefox with two tabs, one containing > > youtube's suggestions of stuff you probably don't want to watch, the > > other a running clip, if I have the idle tab in focus, and don't drive > > mouse around, flips decay enough for wake_wide() to lose interest, but > > just wiggle the mouse, and it starts waking wide. Focus on the running > > clip, and it continuously wakes wide. > > > > Hacky, but way better behavior.. at this particular testcase.. in this > > particular box.. at least once :) > > > > Only three machines managed to complete tests overnight. For most > workloads test, it was neutral or slight improvements. For > multi-kernbench__netperf-tcp-rr-multipair (kernel compile + > netperf-tcp-rr combined), there was little to no change. > > For the heavy overloaded cases (hackbench again), it was variable. Worst > improvement was a gain of 1-3%. Best improvement (single socket skylake > with 8 logical CPUs SMT enabled) was 1%-18% depending on the group > count. I wrote up a changelog to remind future me why I bent it up, but I'm not going to submit it. I'll leave the twiddling to folks who can be more responsive to possible (spelled probable;) regression reports than I can be. -Mike ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-10-05 8:42 ` Mike Galbraith @ 2021-10-05 9:31 ` Mel Gorman 2021-10-06 6:46 ` Mike Galbraith 2021-10-08 5:06 ` Mike Galbraith 0 siblings, 2 replies; 59+ messages in thread From: Mel Gorman @ 2021-10-05 9:31 UTC (permalink / raw) To: Mike Galbraith Cc: Barry Song, Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, Oct 05, 2021 at 10:42:07AM +0200, Mike Galbraith wrote: > On Tue, 2021-10-05 at 08:47 +0100, Mel Gorman wrote: > > On Mon, Oct 04, 2021 at 11:06:30AM +0200, Mike Galbraith wrote: > > > > > > The mallet below convinced wake_wide() that X waking event threads is > > > something it most definitely should care about. It's not perfect, can > > > get caught with its pants down, because behavior changes a LOT, but I > > > at least have to work at it a bit to stack tasks to the ceiling. > > > > > > With make -j8 running along with firefox with two tabs, one containing > > > youtube's suggestions of stuff you probably don't want to watch, the > > > other a running clip, if I have the idle tab in focus, and don't drive > > > mouse around, flips decay enough for wake_wide() to lose interest, but > > > just wiggle the mouse, and it starts waking wide. Focus on the running > > > clip, and it continuously wakes wide. > > > > > > Hacky, but way better behavior.. at this particular testcase.. in this > > > particular box.. at least once :) > > > > > > > Only three machines managed to complete tests overnight. For most > > workloads test, it was neutral or slight improvements. For > > multi-kernbench__netperf-tcp-rr-multipair (kernel compile + > > netperf-tcp-rr combined), there was little to no change. > > > > For the heavy overloaded cases (hackbench again), it was variable. Worst > > improvement was a gain of 1-3%. Best improvement (single socket skylake > > with 8 logical CPUs SMT enabled) was 1%-18% depending on the group > > count. > > I wrote up a changelog to remind future me why I bent it up, but I'm > not going to submit it. I'll leave the twiddling to folks who can be > more responsive to possible (spelled probable;) regression reports than > I can be. > Thanks Mike. If you write a formal patch with a Signed-off-by, I'll use it as a baseline for the wakegran work or submit it directly. That way, I'll get any LKP reports, user regression reports on lkml, any regression reports via openSUSE etc and deal with them. Based on the results I have so far, I would not be twiddling it further but that might change when a full range of machines have completed all of their tests. Ideally, I would do some tracing to confirm that maximum runqueue depth is really reduced by the path. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-10-05 9:31 ` Mel Gorman @ 2021-10-06 6:46 ` Mike Galbraith 2021-10-08 5:06 ` Mike Galbraith 1 sibling, 0 replies; 59+ messages in thread From: Mike Galbraith @ 2021-10-06 6:46 UTC (permalink / raw) To: Mel Gorman Cc: Barry Song, Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, 2021-10-05 at 10:31 +0100, Mel Gorman wrote: > > If you write a formal patch with a Signed-off-by, I'll use it as a baseline > for the wakegran work or submit it directly. That way, I'll get any LKP > reports, user regression reports on lkml, any regression reports via > openSUSE etc and deal with them. > > Based on the results I have so far, I would not be twiddling it further > but that might change when a full range of machines have completed all > of their tests. Ideally, I would do some tracing to confirm that maximum > runqueue depth is really reduced by the path. Ok, I amended it, adding probably way too many words for a dinky bend- adjust. Feel free to do whatever you like with every bit below. sched: Make wake_wide() handle wakees with no wakee_flips While looking into a wakeup time task stacking problem, noticed that wake_wide() wasn't recognizing X as a waker-of-many despite it regularly burst waking 24 QXcbEventQueue threads. The reason for this lies in the heuristic requiring both the multi-waker and its minions to be earning wakee_flips, ie both wake more than one task. X earns plenty, but the event threads rarely earn any, allowing the lot to meet wake_affine_weight(), where its use of slow to update load averages MAY direct the entire burst toward the waker's CPU, where they WILL stack if SIS can't deflect them. To combat this, have the multi-waker (X in this case) trickle enough flips to its wakees to keep those who don't earn flips barely eligible. To reduce aging taking too many of the thread pool below the limit due to periods of inactivity, continue to wake wide IFF the waker has a markedly elevated flip frequency. This improved things for the X+QXcbEventQueue burst, but is a bit hacky. We need a better M:N load activity differentiator. Note: given wake_wide()'s mission is to recognize when thread pool activity exceeds sd_llc_size, it logically SHOULD play no role whatsoever in boxen with a single LLC, there being no other LLCs to expand/shrink load distribution to/from. While this patchlet hopefully improves M:N detection a bit in general, it fixes nothing. Signed-off-by: Mike Galbraith <efault@gmx.de> --- kernel/sched/fair.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5869,6 +5869,15 @@ static void record_wakee(struct task_str } if (current->last_wakee != p) { + int min = __this_cpu_read(sd_llc_size) << 1; + /* + * Couple waker flips to the wakee for the case where it + * doesn't accrue any of its own, taking care to not push + * it high enough to break the wake_wide() waker:wakees + * heuristic for those that do accrue their own flips. + */ + if (current->wakee_flips > p->wakee_flips * min) + p->wakee_flips++; current->last_wakee = p; current->wakee_flips++; } @@ -5899,7 +5908,7 @@ static int wake_wide(struct task_struct if (master < slave) swap(master, slave); - if (slave < factor || master < slave * factor) + if ((slave < factor && master < (factor>>1)*factor) || master < slave * factor) return 0; return 1; } ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-10-05 9:31 ` Mel Gorman 2021-10-06 6:46 ` Mike Galbraith @ 2021-10-08 5:06 ` Mike Galbraith 1 sibling, 0 replies; 59+ messages in thread From: Mike Galbraith @ 2021-10-08 5:06 UTC (permalink / raw) To: Mel Gorman Cc: Barry Song, Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, 2021-10-05 at 10:31 +0100, Mel Gorman wrote: > Ideally, I would do some tracing to confirm that maximum runqueue depth > is really reduced by the path. I would expect your worst case to remain unchanged, mine does. The patch mitigates, it does not eradicate. I dug up a late 2016 mitigation patch, wedged it into 2021 and added a BFH that does eradicate my stacking depth woes. I'll probably keep it, at least for a while. Not because I feel anything in my desktop, rather because meeting this again (and it being deeper than I recall) reminded me of measuring impact on NFS etc, making it a tad difficult to ignore. Oh well, I'll forget about it eventually.. BTDT. (standard beloved Granny disclaimer) sched: Add SIS stacking mitigation feature Select the least loaded LLC CPU for cache cold tasks and kthreads. Addendum: renamed feature, and give it a big brother. Not-Signed-off-by: Mike Galbraith <efault@gmx.de> --- kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++++++++++++---- kernel/sched/features.h | 5 ++++ 2 files changed, 55 insertions(+), 4 deletions(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6261,6 +6261,26 @@ static inline int select_idle_smt(struct #endif /* CONFIG_SCHED_SMT */ +static bool task_is_kthread_or_cold(struct task_struct *p) +{ + s64 cold = sysctl_sched_migration_cost; + + if (p->flags & PF_KTHREAD) + return true; + if (cold <= 0) + return false; + return task_rq(p)->clock_task - p->se.exec_start > cold; +} + +static bool cpu_load_inconsistent(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + + if (rq->cfs.h_nr_running < 4) + return false; + return cpu_load(rq) << 2 < scale_load_down(rq->cfs.load.weight); +} + /* * Scan the LLC domain for idle CPUs; this is dynamically regulated by * comparing the average scan cost (tracked in sd->avg_scan_cost) against the @@ -6269,7 +6289,7 @@ static inline int select_idle_smt(struct static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target) { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); - int i, cpu, idle_cpu = -1, nr = INT_MAX; + int i, cpu, idle_cpu = -1, nr = INT_MAX, ld = -1; struct rq *this_rq = this_rq(); int this = smp_processor_id(); struct sched_domain *this_sd; @@ -6309,6 +6329,21 @@ static int select_idle_cpu(struct task_s time = cpu_clock(this); } + /* + * Select the least loaded CPU for kthreads and cache cold tasks + * if no idle CPU is found. + */ + if ((sched_feat(SIS_SPOT) && task_is_kthread_or_cold(p)) || + (sched_feat(SIS_REXY) && cpu_load_inconsistent(target))) { + idle_cpu = task_cpu(p); + if (idle_cpu != target && !cpus_share_cache(idle_cpu, target)) + idle_cpu = target; + if (unlikely(!sched_cpu_cookie_match(cpu_rq(idle_cpu), p))) + idle_cpu = -1; + else + ld = scale_load_down(cpu_rq(idle_cpu)->cfs.load.weight); + } + for_each_cpu_wrap(cpu, cpus, target + 1) { if (has_idle_core) { i = select_idle_core(p, cpu, cpus, &idle_cpu); @@ -6317,10 +6352,21 @@ static int select_idle_cpu(struct task_s } else { if (!--nr) - return -1; - idle_cpu = __select_idle_cpu(cpu, p); - if ((unsigned int)idle_cpu < nr_cpumask_bits) + return idle_cpu; + i = __select_idle_cpu(cpu, p); + if ((unsigned int)i < nr_cpumask_bits) { + idle_cpu = i; break; + } + } + if (ld > 0 && sched_cpu_cookie_match(cpu_rq(cpu), p)) { + i = scale_load_down(cpu_rq(cpu)->cfs.load.weight); + if (i < ld) { + idle_cpu = cpu; + if (i == 0) + break; + ld = i; + } } } --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -95,3 +95,8 @@ SCHED_FEAT(LATENCY_WARN, false) SCHED_FEAT(ALT_PERIOD, true) SCHED_FEAT(BASE_SLICE, true) + +/* Mitigate PELT induced stacking. */ +SCHED_FEAT(SIS_SPOT, true) +/* Spot's 12 ton big brother. */ +SCHED_FEAT(SIS_REXY, true) ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-20 14:26 ` [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running Mel Gorman 2021-09-21 3:52 ` Mike Galbraith @ 2021-09-21 8:03 ` Vincent Guittot 2021-09-21 10:45 ` Mel Gorman 1 sibling, 1 reply; 59+ messages in thread From: Vincent Guittot @ 2021-09-21 8:03 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Mon, 20 Sept 2021 at 16:26, Mel Gorman <mgorman@techsingularity.net> wrote: > > Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved > the kernel.sched_wakeup_granularity_ns sysctl under debugfs. One of the > reasons why this sysctl may be used may be for "optimising for throughput", > particularly when overloaded. The tool TuneD sometimes alters this for two > profiles e.g. "mssql" and "throughput-performance". At least version 2.9 > does but it changed in master where it also will poke at debugfs instead. > > During task migration or wakeup, a decision is made on whether > to preempt the current task or not. To limit over-scheduled, > sysctl_sched_wakeup_granularity delays the preemption to allow at least 1ms > of runtime before preempting. However, when a domain is heavily overloaded > (e.g. hackbench), the degree of over-scheduling is still severe. This is sysctl_sched_wakeup_granularity = 1 msec * (1 + ilog(ncpus)) AFAIK, a 2-socket CascadeLake has 56 cpus which means that sysctl_sched_wakeup_granularity is 6ms for your platform > problematic as a lot of time can be wasted rescheduling tasks that could > instead be used by userspace tasks. > > This patch scales the wakeup granularity based on the number of running > tasks on the CPU up to a max of 8ms by default. The intent is to This becomes 8*6=48ms on your platform which is far more than the 15ms below. Also 48ms is quite a long time to wait for a newly woken task especially when this task is a bottleneck. > allow tasks to run for longer while overloaded so that some tasks may > complete faster and reduce the degree a domain is overloaded. Note that > the TuneD throughput-performance profile allows up to 15ms but there > is no explanation why such a long period was necessary so this patch is > conservative and uses the value that check_preempt_wakeup() also takes > into account. An internet search for instances where this parameter are > tuned to high values offer either no explanation or a broken one. > > This improved hackbench on a range of machines when communicating via > pipes (sockets show little to no difference). For a 2-socket CascadeLake > machine, the results were > > hackbench-process-pipes > 5.15.0-rc1 5.15.0-rc1 > vanilla sched-scalewakegran-v1r4 > Amean 1 0.3253 ( 0.00%) 0.3337 ( -2.56%) > Amean 4 0.8300 ( 0.00%) 0.7983 ( 3.82%) > Amean 7 1.1003 ( 0.00%) 1.1600 * -5.42%* > Amean 12 1.7263 ( 0.00%) 1.6457 * 4.67%* > Amean 21 3.0063 ( 0.00%) 2.7933 * 7.09%* > Amean 30 4.2323 ( 0.00%) 3.8010 * 10.19%* > Amean 48 6.5657 ( 0.00%) 5.6453 * 14.02%* > Amean 79 10.4867 ( 0.00%) 8.5960 * 18.03%* > Amean 110 14.8880 ( 0.00%) 11.4173 * 23.31%* > Amean 141 19.2083 ( 0.00%) 14.3850 * 25.11%* > Amean 172 23.4847 ( 0.00%) 17.1980 * 26.77%* > Amean 203 27.3763 ( 0.00%) 20.1677 * 26.33%* > Amean 234 31.3707 ( 0.00%) 23.4053 * 25.39%* > Amean 265 35.4663 ( 0.00%) 26.3513 * 25.70%* > Amean 296 39.2380 ( 0.00%) 29.3670 * 25.16%* > > For Zen 3; > > hackbench-process-pipes > 5.15.0-rc1 5.15.0-rc1 > vanillasched-scalewakegran-v1r4 > Amean 1 0.3780 ( 0.00%) 0.4080 ( -7.94%) > Amean 4 0.5393 ( 0.00%) 0.5217 ( 3.28%) > Amean 7 0.5480 ( 0.00%) 0.5577 ( -1.76%) > Amean 12 0.5803 ( 0.00%) 0.5667 ( 2.35%) > Amean 21 0.7073 ( 0.00%) 0.6543 * 7.49%* > Amean 30 0.8663 ( 0.00%) 0.8290 ( 4.31%) > Amean 48 1.2720 ( 0.00%) 1.1337 * 10.88%* > Amean 79 1.9403 ( 0.00%) 1.7247 * 11.11%* > Amean 110 2.6827 ( 0.00%) 2.3450 * 12.59%* > Amean 141 3.6863 ( 0.00%) 3.0253 * 17.93%* > Amean 172 4.5853 ( 0.00%) 3.4987 * 23.70%* > Amean 203 5.4893 ( 0.00%) 3.9630 * 27.81%* > Amean 234 6.6017 ( 0.00%) 4.4230 * 33.00%* > Amean 265 7.3850 ( 0.00%) 4.8317 * 34.57%* > Amean 296 8.5823 ( 0.00%) 5.3327 * 37.86%* > > For other workloads, the benefits were marginal as the extreme overloaded > case is not hit to the same extent. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > kernel/sched/fair.c | 30 ++++++++++++++++++++++-------- > 1 file changed, 22 insertions(+), 8 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 038edfaaae9e..8e12aeebf4ce 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4511,7 +4511,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) > } > > static int > -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se); > +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr, > + struct sched_entity *se); > > /* > * Pick the next process, keeping these things in mind, in this order: > @@ -4550,16 +4551,16 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) > second = curr; > } > > - if (second && wakeup_preempt_entity(second, left) < 1) > + if (second && wakeup_preempt_entity(NULL, second, left) < 1) > se = second; > } > > - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { > + if (cfs_rq->next && wakeup_preempt_entity(NULL, cfs_rq->next, left) < 1) { > /* > * Someone really wants this to run. If it's not unfair, run it. > */ > se = cfs_rq->next; > - } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) { > + } else if (cfs_rq->last && wakeup_preempt_entity(NULL, cfs_rq->last, left) < 1) { > /* > * Prefer last buddy, try to return the CPU to a preempted task. > */ > @@ -7044,10 +7045,22 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > } > #endif /* CONFIG_SMP */ > > -static unsigned long wakeup_gran(struct sched_entity *se) > +static unsigned long > +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se) > { > unsigned long gran = sysctl_sched_wakeup_granularity; > > + /* > + * If rq is specified, scale the granularity relative to the number > + * of running tasks but no more than 8ms with default > + * sysctl_sched_wakeup_granularity settings. The wakeup gran > + * reduces over-scheduling but if tasks are stacked then the > + * domain is likely overloaded and over-scheduling may > + * prolong the overloaded state. > + */ > + if (cfs_rq && cfs_rq->nr_running > 1) > + gran *= min(cfs_rq->nr_running >> 1, sched_nr_latency); > + > /* > * Since its curr running now, convert the gran from real-time > * to virtual-time in his units. > @@ -7079,14 +7092,15 @@ static unsigned long wakeup_gran(struct sched_entity *se) > * > */ > static int > -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se) > +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr, > + struct sched_entity *se) > { > s64 gran, vdiff = curr->vruntime - se->vruntime; > > if (vdiff <= 0) > return -1; > > - gran = wakeup_gran(se); > + gran = wakeup_gran(cfs_rq, se); > if (vdiff > gran) > return 1; > > @@ -7191,7 +7205,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ > return; > > update_curr(cfs_rq); > - if (wakeup_preempt_entity(se, pse) == 1) { > + if (wakeup_preempt_entity(cfs_rq, se, pse) == 1) { > /* > * Bias pick_next to pick the sched entity that is > * triggering this preemption. > -- > 2.31.1 > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running 2021-09-21 8:03 ` Vincent Guittot @ 2021-09-21 10:45 ` Mel Gorman 0 siblings, 0 replies; 59+ messages in thread From: Mel Gorman @ 2021-09-21 10:45 UTC (permalink / raw) To: Vincent Guittot Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju, LKML On Tue, Sep 21, 2021 at 10:03:56AM +0200, Vincent Guittot wrote: > On Mon, 20 Sept 2021 at 16:26, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved > > the kernel.sched_wakeup_granularity_ns sysctl under debugfs. One of the > > reasons why this sysctl may be used may be for "optimising for throughput", > > particularly when overloaded. The tool TuneD sometimes alters this for two > > profiles e.g. "mssql" and "throughput-performance". At least version 2.9 > > does but it changed in master where it also will poke at debugfs instead. > > > > During task migration or wakeup, a decision is made on whether > > to preempt the current task or not. To limit over-scheduled, > > sysctl_sched_wakeup_granularity delays the preemption to allow at least 1ms > > of runtime before preempting. However, when a domain is heavily overloaded > > (e.g. hackbench), the degree of over-scheduling is still severe. This is > > sysctl_sched_wakeup_granularity = 1 msec * (1 + ilog(ncpus)) > AFAIK, a 2-socket CascadeLake has 56 cpus which means that > sysctl_sched_wakeup_granularity is 6ms for your platform > On my machine it becomes 7ms but lets assume there were 56 cpus to avoid confusion. > > problematic as a lot of time can be wasted rescheduling tasks that could > > instead be used by userspace tasks. > > > > This patch scales the wakeup granularity based on the number of running > > tasks on the CPU up to a max of 8ms by default. The intent is to > > This becomes 8*6=48ms on your platform which is far more than the 15ms > below. Also 48ms is quite a long time to wait for a newly woken task > especially when this task is a bottleneck. > With the patch on top I proposed to Mike to take FAIR_SLEEPERS into account, it becomes ((sysctl_sched_latency / gran) >> 1) by default which becomes 18ms for heavy overloading or potentially 12ms if there is enough load to stack 2 tasks. The patch generates a warning as I didn't even build test it, but hey, it was for illustrative purposes. Is that any better conceptually or should we ignore the problem? My motivation here really is to reduce the motivation of others to "tune" debugfs values or be tempted to revert the move to debugfs. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 59+ messages in thread
end of thread, other threads:[~2021-10-08 5:10 UTC | newest] Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-09-20 14:26 [PATCH 0/2] Scale wakeup granularity relative to nr_running Mel Gorman 2021-09-20 14:26 ` [PATCH 1/2] sched/fair: Remove redundant lookup of rq in check_preempt_wakeup Mel Gorman 2021-09-21 7:21 ` Vincent Guittot 2021-09-21 7:53 ` Mel Gorman 2021-09-21 8:12 ` Vincent Guittot 2021-09-21 8:21 ` Peter Zijlstra 2021-09-21 10:03 ` Mel Gorman 2021-09-20 14:26 ` [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running Mel Gorman 2021-09-21 3:52 ` Mike Galbraith 2021-09-21 5:50 ` Mike Galbraith 2021-09-21 7:04 ` Mike Galbraith 2021-09-21 10:36 ` Mel Gorman 2021-09-21 12:32 ` Mike Galbraith 2021-09-21 14:03 ` Mel Gorman 2021-10-05 9:24 ` Peter Zijlstra 2021-09-22 5:22 ` Mike Galbraith 2021-09-22 13:20 ` Mel Gorman 2021-09-22 14:04 ` Mike Galbraith 2021-09-22 14:15 ` Vincent Guittot 2021-09-22 15:04 ` Mel Gorman 2021-09-22 16:00 ` Vincent Guittot 2021-09-22 17:38 ` Mel Gorman 2021-09-22 18:22 ` Vincent Guittot 2021-09-22 18:57 ` Mel Gorman 2021-09-23 1:47 ` Mike Galbraith 2021-09-23 8:40 ` Vincent Guittot 2021-09-23 9:21 ` Mike Galbraith 2021-09-23 12:41 ` Vincent Guittot 2021-09-23 13:14 ` Mike Galbraith 2021-09-27 11:17 ` Mel Gorman 2021-09-27 14:17 ` Mike Galbraith 2021-10-04 8:05 ` Mel Gorman 2021-10-04 16:37 ` Vincent Guittot 2021-10-05 7:41 ` Mel Gorman 2021-09-27 14:19 ` Vincent Guittot 2021-09-27 15:02 ` Mel Gorman 2021-09-23 12:24 ` Phil Auld 2021-10-05 10:36 ` Peter Zijlstra 2021-10-05 14:12 ` Phil Auld 2021-10-05 14:32 ` Peter Zijlstra 2021-10-05 10:28 ` Peter Zijlstra 2021-10-05 10:23 ` Peter Zijlstra 2021-10-05 9:41 ` Peter Zijlstra 2021-09-22 15:05 ` Vincent Guittot 2021-10-05 9:32 ` Peter Zijlstra 2021-10-03 3:07 ` wakeup_affine_weight() is b0rked - was " Mike Galbraith 2021-10-03 7:34 ` Barry Song 2021-10-03 14:52 ` Mike Galbraith 2021-10-03 21:06 ` Barry Song 2021-10-04 1:49 ` Mike Galbraith 2021-10-04 4:34 ` Mike Galbraith 2021-10-04 9:06 ` Mike Galbraith 2021-10-05 7:47 ` Mel Gorman 2021-10-05 8:42 ` Mike Galbraith 2021-10-05 9:31 ` Mel Gorman 2021-10-06 6:46 ` Mike Galbraith 2021-10-08 5:06 ` Mike Galbraith 2021-09-21 8:03 ` Vincent Guittot 2021-09-21 10:45 ` Mel Gorman
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.