All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Scale wakeup granularity relative to nr_running
@ 2021-09-20 14:26 Mel Gorman
  2021-09-20 14:26 ` [PATCH 1/2] sched/fair: Remove redundant lookup of rq in check_preempt_wakeup Mel Gorman
  2021-09-20 14:26 ` [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running Mel Gorman
  0 siblings, 2 replies; 59+ messages in thread
From: Mel Gorman @ 2021-09-20 14:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Srikar Dronamraju, LKML, Mel Gorman

It's unfortunate this is colliding with Plumbers but part of the
discussions are on select_idle_sibling and hackbench is a common
reference workload for evaluating select_idle_sibling. The
downside is that hackbench is sensitive to other factors much
more than the SIS cost. A major component is the value of
kernel.sched_wakeup_granularity_ns and this series tackles that
bit.

Patch 1 is minor, it was spotted while developing patch 2.

Patch 2 scales kernel.sched_wakeup_granularity_ns so allow
	tasks to avoid preemption longer when the machine
	is heavily overloaded.

-- 
2.31.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 1/2] sched/fair: Remove redundant lookup of rq in check_preempt_wakeup
  2021-09-20 14:26 [PATCH 0/2] Scale wakeup granularity relative to nr_running Mel Gorman
@ 2021-09-20 14:26 ` Mel Gorman
  2021-09-21  7:21   ` Vincent Guittot
  2021-09-20 14:26 ` [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running Mel Gorman
  1 sibling, 1 reply; 59+ messages in thread
From: Mel Gorman @ 2021-09-20 14:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Srikar Dronamraju, LKML, Mel Gorman

The rq for curr is read during the function preamble, remove the
redundant lookup.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff69f245b939..038edfaaae9e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7190,7 +7190,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	if (cse_is_idle != pse_is_idle)
 		return;
 
-	update_curr(cfs_rq_of(se));
+	update_curr(cfs_rq);
 	if (wakeup_preempt_entity(se, pse) == 1) {
 		/*
 		 * Bias pick_next to pick the sched entity that is
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-20 14:26 [PATCH 0/2] Scale wakeup granularity relative to nr_running Mel Gorman
  2021-09-20 14:26 ` [PATCH 1/2] sched/fair: Remove redundant lookup of rq in check_preempt_wakeup Mel Gorman
@ 2021-09-20 14:26 ` Mel Gorman
  2021-09-21  3:52   ` Mike Galbraith
  2021-09-21  8:03   ` Vincent Guittot
  1 sibling, 2 replies; 59+ messages in thread
From: Mel Gorman @ 2021-09-20 14:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Srikar Dronamraju, LKML, Mel Gorman

Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved
the kernel.sched_wakeup_granularity_ns sysctl under debugfs.  One of the
reasons why this sysctl may be used may be for "optimising for throughput",
particularly when overloaded. The tool TuneD sometimes alters this for two
profiles e.g. "mssql" and "throughput-performance". At least version 2.9
does but it changed in master where it also will poke at debugfs instead.

During task migration or wakeup, a decision is made on whether
to preempt the current task or not. To limit over-scheduled,
sysctl_sched_wakeup_granularity delays the preemption to allow at least 1ms
of runtime before preempting. However, when a domain is heavily overloaded
(e.g. hackbench), the degree of over-scheduling is still severe. This is
problematic as a lot of time can be wasted rescheduling tasks that could
instead be used by userspace tasks.

This patch scales the wakeup granularity based on the number of running
tasks on the CPU up to a max of 8ms by default.  The intent is to
allow tasks to run for longer while overloaded so that some tasks may
complete faster and reduce the degree a domain is overloaded. Note that
the TuneD throughput-performance profile allows up to 15ms but there
is no explanation why such a long period was necessary so this patch is
conservative and uses the value that check_preempt_wakeup() also takes
into account.  An internet search for instances where this parameter are
tuned to high values offer either no explanation or a broken one.

This improved hackbench on a range of machines when communicating via
pipes (sockets show little to no difference). For a 2-socket CascadeLake
machine, the results were

hackbench-process-pipes
                          5.15.0-rc1             5.15.0-rc1
                             vanilla sched-scalewakegran-v1r4
Amean     1        0.3253 (   0.00%)      0.3337 (  -2.56%)
Amean     4        0.8300 (   0.00%)      0.7983 (   3.82%)
Amean     7        1.1003 (   0.00%)      1.1600 *  -5.42%*
Amean     12       1.7263 (   0.00%)      1.6457 *   4.67%*
Amean     21       3.0063 (   0.00%)      2.7933 *   7.09%*
Amean     30       4.2323 (   0.00%)      3.8010 *  10.19%*
Amean     48       6.5657 (   0.00%)      5.6453 *  14.02%*
Amean     79      10.4867 (   0.00%)      8.5960 *  18.03%*
Amean     110     14.8880 (   0.00%)     11.4173 *  23.31%*
Amean     141     19.2083 (   0.00%)     14.3850 *  25.11%*
Amean     172     23.4847 (   0.00%)     17.1980 *  26.77%*
Amean     203     27.3763 (   0.00%)     20.1677 *  26.33%*
Amean     234     31.3707 (   0.00%)     23.4053 *  25.39%*
Amean     265     35.4663 (   0.00%)     26.3513 *  25.70%*
Amean     296     39.2380 (   0.00%)     29.3670 *  25.16%*

For Zen 3;

hackbench-process-pipes
                          5.15.0-rc1             5.15.0-rc1
                             vanillasched-scalewakegran-v1r4
Amean     1        0.3780 (   0.00%)      0.4080 (  -7.94%)
Amean     4        0.5393 (   0.00%)      0.5217 (   3.28%)
Amean     7        0.5480 (   0.00%)      0.5577 (  -1.76%)
Amean     12       0.5803 (   0.00%)      0.5667 (   2.35%)
Amean     21       0.7073 (   0.00%)      0.6543 *   7.49%*
Amean     30       0.8663 (   0.00%)      0.8290 (   4.31%)
Amean     48       1.2720 (   0.00%)      1.1337 *  10.88%*
Amean     79       1.9403 (   0.00%)      1.7247 *  11.11%*
Amean     110      2.6827 (   0.00%)      2.3450 *  12.59%*
Amean     141      3.6863 (   0.00%)      3.0253 *  17.93%*
Amean     172      4.5853 (   0.00%)      3.4987 *  23.70%*
Amean     203      5.4893 (   0.00%)      3.9630 *  27.81%*
Amean     234      6.6017 (   0.00%)      4.4230 *  33.00%*
Amean     265      7.3850 (   0.00%)      4.8317 *  34.57%*
Amean     296      8.5823 (   0.00%)      5.3327 *  37.86%*

For other workloads, the benefits were marginal as the extreme overloaded
case is not hit to the same extent.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 038edfaaae9e..8e12aeebf4ce 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4511,7 +4511,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 
 static int
-wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
+wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr,
+						struct sched_entity *se);
 
 /*
  * Pick the next process, keeping these things in mind, in this order:
@@ -4550,16 +4551,16 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 				second = curr;
 		}
 
-		if (second && wakeup_preempt_entity(second, left) < 1)
+		if (second && wakeup_preempt_entity(NULL, second, left) < 1)
 			se = second;
 	}
 
-	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
+	if (cfs_rq->next && wakeup_preempt_entity(NULL, cfs_rq->next, left) < 1) {
 		/*
 		 * Someone really wants this to run. If it's not unfair, run it.
 		 */
 		se = cfs_rq->next;
-	} else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
+	} else if (cfs_rq->last && wakeup_preempt_entity(NULL, cfs_rq->last, left) < 1) {
 		/*
 		 * Prefer last buddy, try to return the CPU to a preempted task.
 		 */
@@ -7044,10 +7045,22 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 }
 #endif /* CONFIG_SMP */
 
-static unsigned long wakeup_gran(struct sched_entity *se)
+static unsigned long
+wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	unsigned long gran = sysctl_sched_wakeup_granularity;
 
+	/*
+	 * If rq is specified, scale the granularity relative to the number
+	 * of running tasks but no more than 8ms with default
+	 * sysctl_sched_wakeup_granularity settings. The wakeup gran
+	 * reduces over-scheduling but if tasks are stacked then the
+	 * domain is likely overloaded and over-scheduling may
+	 * prolong the overloaded state.
+	 */
+	if (cfs_rq && cfs_rq->nr_running > 1)
+		gran *= min(cfs_rq->nr_running >> 1, sched_nr_latency);
+
 	/*
 	 * Since its curr running now, convert the gran from real-time
 	 * to virtual-time in his units.
@@ -7079,14 +7092,15 @@ static unsigned long wakeup_gran(struct sched_entity *se)
  *
  */
 static int
-wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
+wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr,
+						struct sched_entity *se)
 {
 	s64 gran, vdiff = curr->vruntime - se->vruntime;
 
 	if (vdiff <= 0)
 		return -1;
 
-	gran = wakeup_gran(se);
+	gran = wakeup_gran(cfs_rq, se);
 	if (vdiff > gran)
 		return 1;
 
@@ -7191,7 +7205,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 		return;
 
 	update_curr(cfs_rq);
-	if (wakeup_preempt_entity(se, pse) == 1) {
+	if (wakeup_preempt_entity(cfs_rq, se, pse) == 1) {
 		/*
 		 * Bias pick_next to pick the sched entity that is
 		 * triggering this preemption.
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-20 14:26 ` [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running Mel Gorman
@ 2021-09-21  3:52   ` Mike Galbraith
  2021-09-21  5:50     ` Mike Galbraith
                       ` (2 more replies)
  2021-09-21  8:03   ` Vincent Guittot
  1 sibling, 3 replies; 59+ messages in thread
From: Mike Galbraith @ 2021-09-21  3:52 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Srikar Dronamraju, LKML

On Mon, 2021-09-20 at 15:26 +0100, Mel Gorman wrote:
>
> This patch scales the wakeup granularity based on the number of running
> tasks on the CPU up to a max of 8ms by default.  The intent is to
> allow tasks to run for longer while overloaded so that some tasks may
> complete faster and reduce the degree a domain is overloaded. Note that
> the TuneD throughput-performance profile allows up to 15ms but there
> is no explanation why such a long period was necessary so this patch is
> conservative and uses the value that check_preempt_wakeup() also takes
> into account.  An internet search for instances where this parameter are
> tuned to high values offer either no explanation or a broken one.
>
> This improved hackbench on a range of machines when communicating via
> pipes (sockets show little to no difference). For a 2-socket CascadeLake
> machine, the results were

Twiddling wakeup preemption based upon the performance of a fugly fork
bomb seems like a pretty bad idea to me.

Preemption does rapidly run into diminishing return as load climbs for
a lot of loads, but as you know, it's a rather sticky wicket because
even when over-committed, preventing light control threads from slicing
through (what can be a load's own work crew of) hogs can seriously
injure performance.

<snip>

> @@ -7044,10 +7045,22 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  }
>  #endif /* CONFIG_SMP */
>  
> -static unsigned long wakeup_gran(struct sched_entity *se)
> +static unsigned long
> +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>         unsigned long gran = sysctl_sched_wakeup_granularity;
>  
> +       /*
> +        * If rq is specified, scale the granularity relative to the number
> +        * of running tasks but no more than 8ms with default
> +        * sysctl_sched_wakeup_granularity settings. The wakeup gran
> +        * reduces over-scheduling but if tasks are stacked then the
> +        * domain is likely overloaded and over-scheduling may
> +        * prolong the overloaded state.
> +        */
> +       if (cfs_rq && cfs_rq->nr_running > 1)
> +               gran *= min(cfs_rq->nr_running >> 1, sched_nr_latency);
> +

Maybe things have changed while I wasn't watching closely, but...

The scaled up tweakables on my little quad desktop box:
sched_nr_latency = 8
sched_wakeup_granularity = 4ms
sched_latency = 24ms

Due to the FAIR_SLEEPERS feature, a task can only receive a max of
sched_latency/2 sleep credit, ie the delta between waking sleeper and
current is clipped to a max of 12 virtual ms, so the instant our
preempt threshold reaches 12.000ms, by human booboo or now 3 runnable
tasks with this change, wakeup preemption is completely disabled, or?

	-Mike

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-21  3:52   ` Mike Galbraith
@ 2021-09-21  5:50     ` Mike Galbraith
  2021-09-21  7:04     ` Mike Galbraith
  2021-09-21 10:36     ` Mel Gorman
  2 siblings, 0 replies; 59+ messages in thread
From: Mike Galbraith @ 2021-09-21  5:50 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Srikar Dronamraju, LKML

On Tue, 2021-09-21 at 05:52 +0200, Mike Galbraith wrote:
> On Mon, 2021-09-20 at 15:26 +0100, Mel Gorman wrote:
>
>
> The scaled up tweakables on my little quad desktop box:
> sched_nr_latency = 8
> sched_wakeup_granularity = 4ms
> sched_latency = 24ms
>
> Due to the FAIR_SLEEPERS feature, a task can only receive a max of
> sched_latency/2 sleep credit, ie the delta between waking sleeper and
> current is clipped to a max of 12 virtual ms, so the instant our
> preempt threshold reaches 12.000ms, by human booboo or now 3 runnable
> tasks with this change, wakeup preemption is completely disabled, or?

I just dug up a now ancient artifact testcase allegedly distilled from
a real application's control thread sometime back in the dark ages, and
granularity >= latency/2 still does turn off wakeup preemption.

	-Mike

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-21  3:52   ` Mike Galbraith
  2021-09-21  5:50     ` Mike Galbraith
@ 2021-09-21  7:04     ` Mike Galbraith
  2021-09-21 10:36     ` Mel Gorman
  2 siblings, 0 replies; 59+ messages in thread
From: Mike Galbraith @ 2021-09-21  7:04 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Srikar Dronamraju, LKML

On Tue, 2021-09-21 at 05:52 +0200, Mike Galbraith wrote:
>
> ...preempt threshold reaches 12.000ms, by human booboo or now 3
> runnable tasks

Bah, 6 so not so close to hohum burst level, but still something that
maybe warrants a feature.

	-Mike

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 1/2] sched/fair: Remove redundant lookup of rq in check_preempt_wakeup
  2021-09-20 14:26 ` [PATCH 1/2] sched/fair: Remove redundant lookup of rq in check_preempt_wakeup Mel Gorman
@ 2021-09-21  7:21   ` Vincent Guittot
  2021-09-21  7:53     ` Mel Gorman
  0 siblings, 1 reply; 59+ messages in thread
From: Vincent Guittot @ 2021-09-21  7:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li,
	Barry Song, Srikar Dronamraju, LKML

On Mon, 20 Sept 2021 at 16:26, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> The rq for curr is read during the function preamble, remove the
> redundant lookup.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ff69f245b939..038edfaaae9e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7190,7 +7190,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>         if (cse_is_idle != pse_is_idle)
>                 return;
>
> -       update_curr(cfs_rq_of(se));
> +       update_curr(cfs_rq);

se can have been modified by find_matching_se(&se, &pse)

>         if (wakeup_preempt_entity(se, pse) == 1) {
>                 /*
>                  * Bias pick_next to pick the sched entity that is
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 1/2] sched/fair: Remove redundant lookup of rq in check_preempt_wakeup
  2021-09-21  7:21   ` Vincent Guittot
@ 2021-09-21  7:53     ` Mel Gorman
  2021-09-21  8:12       ` Vincent Guittot
  2021-09-21  8:21       ` Peter Zijlstra
  0 siblings, 2 replies; 59+ messages in thread
From: Mel Gorman @ 2021-09-21  7:53 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li,
	Barry Song, Srikar Dronamraju, LKML

On Tue, Sep 21, 2021 at 09:21:16AM +0200, Vincent Guittot wrote:
> On Mon, 20 Sept 2021 at 16:26, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > The rq for curr is read during the function preamble, remove the
> > redundant lookup.
> >
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > ---
> >  kernel/sched/fair.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index ff69f245b939..038edfaaae9e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7190,7 +7190,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> >         if (cse_is_idle != pse_is_idle)
> >                 return;
> >
> > -       update_curr(cfs_rq_of(se));
> > +       update_curr(cfs_rq);
> 
> se can have been modified by find_matching_se(&se, &pse)
> 

I still expected the cfs_rq to be the same, particularly given that the
context is about preempting the current task on a runqueue. Is that
wrong?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-20 14:26 ` [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running Mel Gorman
  2021-09-21  3:52   ` Mike Galbraith
@ 2021-09-21  8:03   ` Vincent Guittot
  2021-09-21 10:45     ` Mel Gorman
  1 sibling, 1 reply; 59+ messages in thread
From: Vincent Guittot @ 2021-09-21  8:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li,
	Barry Song, Srikar Dronamraju, LKML

On Mon, 20 Sept 2021 at 16:26, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved
> the kernel.sched_wakeup_granularity_ns sysctl under debugfs.  One of the
> reasons why this sysctl may be used may be for "optimising for throughput",
> particularly when overloaded. The tool TuneD sometimes alters this for two
> profiles e.g. "mssql" and "throughput-performance". At least version 2.9
> does but it changed in master where it also will poke at debugfs instead.
>
> During task migration or wakeup, a decision is made on whether
> to preempt the current task or not. To limit over-scheduled,
> sysctl_sched_wakeup_granularity delays the preemption to allow at least 1ms
> of runtime before preempting. However, when a domain is heavily overloaded
> (e.g. hackbench), the degree of over-scheduling is still severe. This is

sysctl_sched_wakeup_granularity =  1 msec * (1 + ilog(ncpus))
AFAIK, a 2-socket CascadeLake has 56 cpus which means that
sysctl_sched_wakeup_granularity is 6ms for your platform

> problematic as a lot of time can be wasted rescheduling tasks that could
> instead be used by userspace tasks.
>
> This patch scales the wakeup granularity based on the number of running
> tasks on the CPU up to a max of 8ms by default.  The intent is to

This becomes 8*6=48ms on your platform which is far more than the 15ms
below. Also 48ms is quite a long time to wait for a newly woken task
especially when this task is a bottleneck.

> allow tasks to run for longer while overloaded so that some tasks may
> complete faster and reduce the degree a domain is overloaded. Note that
> the TuneD throughput-performance profile allows up to 15ms but there
> is no explanation why such a long period was necessary so this patch is
> conservative and uses the value that check_preempt_wakeup() also takes
> into account.  An internet search for instances where this parameter are
> tuned to high values offer either no explanation or a broken one.
>
> This improved hackbench on a range of machines when communicating via
> pipes (sockets show little to no difference). For a 2-socket CascadeLake
> machine, the results were
>
> hackbench-process-pipes
>                           5.15.0-rc1             5.15.0-rc1
>                              vanilla sched-scalewakegran-v1r4
> Amean     1        0.3253 (   0.00%)      0.3337 (  -2.56%)
> Amean     4        0.8300 (   0.00%)      0.7983 (   3.82%)
> Amean     7        1.1003 (   0.00%)      1.1600 *  -5.42%*
> Amean     12       1.7263 (   0.00%)      1.6457 *   4.67%*
> Amean     21       3.0063 (   0.00%)      2.7933 *   7.09%*
> Amean     30       4.2323 (   0.00%)      3.8010 *  10.19%*
> Amean     48       6.5657 (   0.00%)      5.6453 *  14.02%*
> Amean     79      10.4867 (   0.00%)      8.5960 *  18.03%*
> Amean     110     14.8880 (   0.00%)     11.4173 *  23.31%*
> Amean     141     19.2083 (   0.00%)     14.3850 *  25.11%*
> Amean     172     23.4847 (   0.00%)     17.1980 *  26.77%*
> Amean     203     27.3763 (   0.00%)     20.1677 *  26.33%*
> Amean     234     31.3707 (   0.00%)     23.4053 *  25.39%*
> Amean     265     35.4663 (   0.00%)     26.3513 *  25.70%*
> Amean     296     39.2380 (   0.00%)     29.3670 *  25.16%*
>
> For Zen 3;
>
> hackbench-process-pipes
>                           5.15.0-rc1             5.15.0-rc1
>                              vanillasched-scalewakegran-v1r4
> Amean     1        0.3780 (   0.00%)      0.4080 (  -7.94%)
> Amean     4        0.5393 (   0.00%)      0.5217 (   3.28%)
> Amean     7        0.5480 (   0.00%)      0.5577 (  -1.76%)
> Amean     12       0.5803 (   0.00%)      0.5667 (   2.35%)
> Amean     21       0.7073 (   0.00%)      0.6543 *   7.49%*
> Amean     30       0.8663 (   0.00%)      0.8290 (   4.31%)
> Amean     48       1.2720 (   0.00%)      1.1337 *  10.88%*
> Amean     79       1.9403 (   0.00%)      1.7247 *  11.11%*
> Amean     110      2.6827 (   0.00%)      2.3450 *  12.59%*
> Amean     141      3.6863 (   0.00%)      3.0253 *  17.93%*
> Amean     172      4.5853 (   0.00%)      3.4987 *  23.70%*
> Amean     203      5.4893 (   0.00%)      3.9630 *  27.81%*
> Amean     234      6.6017 (   0.00%)      4.4230 *  33.00%*
> Amean     265      7.3850 (   0.00%)      4.8317 *  34.57%*
> Amean     296      8.5823 (   0.00%)      5.3327 *  37.86%*
>
> For other workloads, the benefits were marginal as the extreme overloaded
> case is not hit to the same extent.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  kernel/sched/fair.c | 30 ++++++++++++++++++++++--------
>  1 file changed, 22 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 038edfaaae9e..8e12aeebf4ce 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4511,7 +4511,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  }
>
>  static int
> -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
> +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr,
> +                                               struct sched_entity *se);
>
>  /*
>   * Pick the next process, keeping these things in mind, in this order:
> @@ -4550,16 +4551,16 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>                                 second = curr;
>                 }
>
> -               if (second && wakeup_preempt_entity(second, left) < 1)
> +               if (second && wakeup_preempt_entity(NULL, second, left) < 1)
>                         se = second;
>         }
>
> -       if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
> +       if (cfs_rq->next && wakeup_preempt_entity(NULL, cfs_rq->next, left) < 1) {
>                 /*
>                  * Someone really wants this to run. If it's not unfair, run it.
>                  */
>                 se = cfs_rq->next;
> -       } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
> +       } else if (cfs_rq->last && wakeup_preempt_entity(NULL, cfs_rq->last, left) < 1) {
>                 /*
>                  * Prefer last buddy, try to return the CPU to a preempted task.
>                  */
> @@ -7044,10 +7045,22 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  }
>  #endif /* CONFIG_SMP */
>
> -static unsigned long wakeup_gran(struct sched_entity *se)
> +static unsigned long
> +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>         unsigned long gran = sysctl_sched_wakeup_granularity;
>
> +       /*
> +        * If rq is specified, scale the granularity relative to the number
> +        * of running tasks but no more than 8ms with default
> +        * sysctl_sched_wakeup_granularity settings. The wakeup gran
> +        * reduces over-scheduling but if tasks are stacked then the
> +        * domain is likely overloaded and over-scheduling may
> +        * prolong the overloaded state.
> +        */
> +       if (cfs_rq && cfs_rq->nr_running > 1)
> +               gran *= min(cfs_rq->nr_running >> 1, sched_nr_latency);
> +
>         /*
>          * Since its curr running now, convert the gran from real-time
>          * to virtual-time in his units.
> @@ -7079,14 +7092,15 @@ static unsigned long wakeup_gran(struct sched_entity *se)
>   *
>   */
>  static int
> -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
> +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr,
> +                                               struct sched_entity *se)
>  {
>         s64 gran, vdiff = curr->vruntime - se->vruntime;
>
>         if (vdiff <= 0)
>                 return -1;
>
> -       gran = wakeup_gran(se);
> +       gran = wakeup_gran(cfs_rq, se);
>         if (vdiff > gran)
>                 return 1;
>
> @@ -7191,7 +7205,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>                 return;
>
>         update_curr(cfs_rq);
> -       if (wakeup_preempt_entity(se, pse) == 1) {
> +       if (wakeup_preempt_entity(cfs_rq, se, pse) == 1) {
>                 /*
>                  * Bias pick_next to pick the sched entity that is
>                  * triggering this preemption.
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 1/2] sched/fair: Remove redundant lookup of rq in check_preempt_wakeup
  2021-09-21  7:53     ` Mel Gorman
@ 2021-09-21  8:12       ` Vincent Guittot
  2021-09-21  8:21       ` Peter Zijlstra
  1 sibling, 0 replies; 59+ messages in thread
From: Vincent Guittot @ 2021-09-21  8:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li,
	Barry Song, Srikar Dronamraju, LKML

On Tue, 21 Sept 2021 at 09:53, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Tue, Sep 21, 2021 at 09:21:16AM +0200, Vincent Guittot wrote:
> > On Mon, 20 Sept 2021 at 16:26, Mel Gorman <mgorman@techsingularity.net> wrote:
> > >
> > > The rq for curr is read during the function preamble, remove the
> > > redundant lookup.
> > >
> > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > > ---
> > >  kernel/sched/fair.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index ff69f245b939..038edfaaae9e 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7190,7 +7190,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> > >         if (cse_is_idle != pse_is_idle)
> > >                 return;
> > >
> > > -       update_curr(cfs_rq_of(se));
> > > +       update_curr(cfs_rq);
> >
> > se can have been modified by find_matching_se(&se, &pse)
> >
>
> I still expected the cfs_rq to be the same, particularly given that the
> context is about preempting the current task on a runqueue. Is that
> wrong?

As soon as the tasks don't belong to the same cgroup, se can be
modified and cfs_rq will not be the same

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 1/2] sched/fair: Remove redundant lookup of rq in check_preempt_wakeup
  2021-09-21  7:53     ` Mel Gorman
  2021-09-21  8:12       ` Vincent Guittot
@ 2021-09-21  8:21       ` Peter Zijlstra
  2021-09-21 10:03         ` Mel Gorman
  1 sibling, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2021-09-21  8:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Ingo Molnar, Valentin Schneider, Aubrey Li,
	Barry Song, Srikar Dronamraju, LKML

On Tue, Sep 21, 2021 at 08:53:09AM +0100, Mel Gorman wrote:
> On Tue, Sep 21, 2021 at 09:21:16AM +0200, Vincent Guittot wrote:
> > On Mon, 20 Sept 2021 at 16:26, Mel Gorman <mgorman@techsingularity.net> wrote:
> > >
> > > The rq for curr is read during the function preamble, remove the
> > > redundant lookup.
> > >
> > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > > ---
> > >  kernel/sched/fair.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index ff69f245b939..038edfaaae9e 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7190,7 +7190,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> > >         if (cse_is_idle != pse_is_idle)
> > >                 return;
> > >
> > > -       update_curr(cfs_rq_of(se));
> > > +       update_curr(cfs_rq);
> > 
> > se can have been modified by find_matching_se(&se, &pse)
> > 
> 
> I still expected the cfs_rq to be the same, particularly given that the
> context is about preempting the current task on a runqueue. Is that
> wrong?

Yes. There's a cfs_rq for every se. What we do in find_matching_se() is
walk up the hiarachy until both are in the same cfs_rq, otherwse we
cannot compare them.

Fundamentally this means the effective cfs_rq also changes.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 1/2] sched/fair: Remove redundant lookup of rq in check_preempt_wakeup
  2021-09-21  8:21       ` Peter Zijlstra
@ 2021-09-21 10:03         ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2021-09-21 10:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, Valentin Schneider, Aubrey Li,
	Barry Song, Srikar Dronamraju, LKML

On Tue, Sep 21, 2021 at 10:21:19AM +0200, Peter Zijlstra wrote:
> On Tue, Sep 21, 2021 at 08:53:09AM +0100, Mel Gorman wrote:
> > On Tue, Sep 21, 2021 at 09:21:16AM +0200, Vincent Guittot wrote:
> > > On Mon, 20 Sept 2021 at 16:26, Mel Gorman <mgorman@techsingularity.net> wrote:
> > > >
> > > > The rq for curr is read during the function preamble, remove the
> > > > redundant lookup.
> > > >
> > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > > > ---
> > > >  kernel/sched/fair.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index ff69f245b939..038edfaaae9e 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -7190,7 +7190,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> > > >         if (cse_is_idle != pse_is_idle)
> > > >                 return;
> > > >
> > > > -       update_curr(cfs_rq_of(se));
> > > > +       update_curr(cfs_rq);
> > > 
> > > se can have been modified by find_matching_se(&se, &pse)
> > > 
> > 
> > I still expected the cfs_rq to be the same, particularly given that the
> > context is about preempting the current task on a runqueue. Is that
> > wrong?
> 
> Yes. There's a cfs_rq for every se. What we do in find_matching_se() is
> walk up the hiarachy until both are in the same cfs_rq, otherwse we
> cannot compare them.
> 
> Fundamentally this means the effective cfs_rq also changes.

Ok, thanks. I'll read into this more but the patch is dead.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-21  3:52   ` Mike Galbraith
  2021-09-21  5:50     ` Mike Galbraith
  2021-09-21  7:04     ` Mike Galbraith
@ 2021-09-21 10:36     ` Mel Gorman
  2021-09-21 12:32       ` Mike Galbraith
  2021-09-22  5:22       ` Mike Galbraith
  2 siblings, 2 replies; 59+ messages in thread
From: Mel Gorman @ 2021-09-21 10:36 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Tue, Sep 21, 2021 at 05:52:32AM +0200, Mike Galbraith wrote:
> On Mon, 2021-09-20 at 15:26 +0100, Mel Gorman wrote:
> >
> > This patch scales the wakeup granularity based on the number of running
> > tasks on the CPU up to a max of 8ms by default.  The intent is to
> > allow tasks to run for longer while overloaded so that some tasks may
> > complete faster and reduce the degree a domain is overloaded. Note that
> > the TuneD throughput-performance profile allows up to 15ms but there
> > is no explanation why such a long period was necessary so this patch is
> > conservative and uses the value that check_preempt_wakeup() also takes
> > into account.  An internet search for instances where this parameter are
> > tuned to high values offer either no explanation or a broken one.
> >
> > This improved hackbench on a range of machines when communicating via
> > pipes (sockets show little to no difference). For a 2-socket CascadeLake
> > machine, the results were
> 
> Twiddling wakeup preemption based upon the performance of a fugly fork
> bomb seems like a pretty bad idea to me.
> 

I'm no fan of hackbench but unfortunately it's heavily used for
evaluating scheduler changes at least and the notion that
over-scheduling can reduce throughput of communicating tasks in the
overloaded case is still problematic.

> Preemption does rapidly run into diminishing return as load climbs for
> a lot of loads, but as you know, it's a rather sticky wicket because
> even when over-committed, preventing light control threads from slicing
> through (what can be a load's own work crew of) hogs can seriously
> injure performance.
> 

Turning this into a classic Rob Peter To Pay Paul problem. We don't know
if there is a light control thread that needs to run or not that affects
overall performance. It all depends on whether that control thread needs
to make progress for the overall workload or whether there are a mix of
workloads resulting in overloading.

> <snip>
> 
> > @@ -7044,10 +7045,22 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >  }
> >  #endif /* CONFIG_SMP */
> >  
> > -static unsigned long wakeup_gran(struct sched_entity *se)
> > +static unsigned long
> > +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> >         unsigned long gran = sysctl_sched_wakeup_granularity;
> >  
> > +       /*
> > +        * If rq is specified, scale the granularity relative to the number
> > +        * of running tasks but no more than 8ms with default
> > +        * sysctl_sched_wakeup_granularity settings. The wakeup gran
> > +        * reduces over-scheduling but if tasks are stacked then the
> > +        * domain is likely overloaded and over-scheduling may
> > +        * prolong the overloaded state.
> > +        */
> > +       if (cfs_rq && cfs_rq->nr_running > 1)
> > +               gran *= min(cfs_rq->nr_running >> 1, sched_nr_latency);
> > +
> 
> Maybe things have changed while I wasn't watching closely, but...
> 

Yeah, a lot has changed and unfortunately, I still lack a lot of
historical context on why things are the way they are. 

> The scaled up tweakables on my little quad desktop box:
> sched_nr_latency = 8
> sched_wakeup_granularity = 4ms
> sched_latency = 24ms
> 
> Due to the FAIR_SLEEPERS feature, a task can only receive a max of
> sched_latency/2 sleep credit,

A task that just became runnable rather than all tasks, right?

> ie the delta between waking sleeper and
> current is clipped to a max of 12 virtual ms, so the instant our
> preempt threshold reaches 12.000ms, by human booboo or now 3 runnable
> tasks with this change, wakeup preemption is completely disabled, or?
> 

I'm having trouble reconciling this in some sensible fashion.
sched_nr_latency is the threshold where the sched period might be stretched
(sysctl_sched_latency/sysctl_sched_min_granularity) which is why I picked
it - (nr_running > sched_nr_latency) is a tipping point where the
scheduler changes behaviour.

FAIR_SLEEPERS primarily affects tasks that just became runnable and the
new task is trying to fit in without causing too much disruption based
on sysctl_sched_latency.

As sysctl_sched_wakeup_granularity is now debugfs and should not be
tuned, we want to avoid that. On the other hand, we also know from the
results that overloaded tasks may fail to make sufficient progress so,
do we either try to dynamically adjust or do we just ignore the problem?

If we are to dynamically adjust then what should it be? One alternative
could be to base the tipping point on the ratio of sysctl_sched_latency to
gran, take FAIR_SLEEPERS into account, sanity check the result and stick it
behind a feature flag in case it needs to be disabled to debug a preempt
starvation problem.

This on top?

---8<---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8e12aeebf4ce..4c94af6ecb1d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7052,14 +7052,21 @@ wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 	/*
 	 * If rq is specified, scale the granularity relative to the number
-	 * of running tasks but no more than 8ms with default
-	 * sysctl_sched_wakeup_granularity settings. The wakeup gran
-	 * reduces over-scheduling but if tasks are stacked then the
-	 * domain is likely overloaded and over-scheduling may
-	 * prolong the overloaded state.
-	 */
-	if (cfs_rq && cfs_rq->nr_running > 1)
-		gran *= min(cfs_rq->nr_running >> 1, sched_nr_latency);
+	 * of running tasks but no more than sysctl_sched_latency.
+	 * The wakeup gran reduces over-scheduling but if tasks are
+	 * stacked then the domain is likely overloaded and over-scheduling
+	 * may prolong the overloaded state.
+	 */
+	if (sched_feat(SCALE_WAKEUP_GRAN) &&
+	    cfs_rq && cfs_rq->nr_running > 1) {
+		int max_scale = sysctl_sched_latency / gran;
+
+		if (sched_feat(GENTLE_FAIR_SLEEPERS))
+			max_scale >>= 1;
+
+		if (max_scale)
+			gran *= min(cfs_rq->nr_running >> 1, max_scale);
+	}
 
 	/*
 	 * Since its curr running now, convert the gran from real-time
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 7f8dace0964c..d041d7023029 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -95,3 +95,9 @@ SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(ALT_PERIOD, true)
 SCHED_FEAT(BASE_SLICE, true)
+
+/*
+ * Scale sched_wakeup_granularity dynamically based on the number of running
+ * tasks up to a cap of sysctl_sched_latency.
+ */
+SCHED_FEAT(SCALE_WAKEUP_GRAN, true)

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-21  8:03   ` Vincent Guittot
@ 2021-09-21 10:45     ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2021-09-21 10:45 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li,
	Barry Song, Srikar Dronamraju, LKML

On Tue, Sep 21, 2021 at 10:03:56AM +0200, Vincent Guittot wrote:
> On Mon, 20 Sept 2021 at 16:26, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved
> > the kernel.sched_wakeup_granularity_ns sysctl under debugfs.  One of the
> > reasons why this sysctl may be used may be for "optimising for throughput",
> > particularly when overloaded. The tool TuneD sometimes alters this for two
> > profiles e.g. "mssql" and "throughput-performance". At least version 2.9
> > does but it changed in master where it also will poke at debugfs instead.
> >
> > During task migration or wakeup, a decision is made on whether
> > to preempt the current task or not. To limit over-scheduled,
> > sysctl_sched_wakeup_granularity delays the preemption to allow at least 1ms
> > of runtime before preempting. However, when a domain is heavily overloaded
> > (e.g. hackbench), the degree of over-scheduling is still severe. This is
> 
> sysctl_sched_wakeup_granularity =  1 msec * (1 + ilog(ncpus))
> AFAIK, a 2-socket CascadeLake has 56 cpus which means that
> sysctl_sched_wakeup_granularity is 6ms for your platform
> 

On my machine it becomes 7ms but lets assume there were 56 cpus to avoid
confusion.

> > problematic as a lot of time can be wasted rescheduling tasks that could
> > instead be used by userspace tasks.
> >
> > This patch scales the wakeup granularity based on the number of running
> > tasks on the CPU up to a max of 8ms by default.  The intent is to
> 
> This becomes 8*6=48ms on your platform which is far more than the 15ms
> below. Also 48ms is quite a long time to wait for a newly woken task
> especially when this task is a bottleneck.
> 

With the patch on top I proposed to Mike to take FAIR_SLEEPERS into
account, it becomes ((sysctl_sched_latency / gran) >> 1) by default which
becomes 18ms for heavy overloading or potentially 12ms if there is enough
load to stack 2 tasks. The patch generates a warning as I didn't even
build test it, but hey, it was for illustrative purposes.

Is that any better conceptually or should we ignore the problem? My
motivation here really is to reduce the motivation of others to "tune"
debugfs values or be tempted to revert the move to debugfs.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-21 10:36     ` Mel Gorman
@ 2021-09-21 12:32       ` Mike Galbraith
  2021-09-21 14:03         ` Mel Gorman
  2021-10-05  9:24         ` Peter Zijlstra
  2021-09-22  5:22       ` Mike Galbraith
  1 sibling, 2 replies; 59+ messages in thread
From: Mike Galbraith @ 2021-09-21 12:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Tue, 2021-09-21 at 11:36 +0100, Mel Gorman wrote:
> On Tue, Sep 21, 2021 at 05:52:32AM +0200, Mike Galbraith wrote:
> >
>
> > Preemption does rapidly run into diminishing return as load climbs for
> > a lot of loads, but as you know, it's a rather sticky wicket because
> > even when over-committed, preventing light control threads from slicing
> > through (what can be a load's own work crew of) hogs can seriously
> > injure performance.
> >
>
> Turning this into a classic Rob Peter To Pay Paul problem.

Damn near everything you do in sched-land is rob Peter to pay Paul.

>  We don't know
> if there is a light control thread that needs to run or not that affects
> overall performance. It all depends on whether that control thread needs
> to make progress for the overall workload or whether there are a mix of
> workloads resulting in overloading.

Sure.. and operative words "we don't know" cuts both ways.

CFS came about because the O1 scheduler was unfair to the point it had
starvation problems. People pretty much across the board agreed that a
fair scheduler was a much way better way to go, and CFS was born.  It
didn't originally have the sleep credit business, but had to grow it to
become _short term_ fair.  Ingo cut the sleep credit in half because of
overscheduling, and that has worked out pretty well all told.. but now
you're pushing it more in the unfair direction, all the way to
extremely unfair for anything and everything very light.

Fairness isn't the holy grail mind you, and at some point, giving up on
short term fairness certainly isn't crazy, as proven by your hackbench
numbers and other numbers we've seen over the years, but taking bites
out of the 'CF' in the CFS that was born to be a corner-case killer
is.. worrisome.  The other shoe will drop.. it always does :)

> > <snip>
> >
> > > @@ -7044,10 +7045,22 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > >  }
> > >  #endif /* CONFIG_SMP */
> > >  
> > > -static unsigned long wakeup_gran(struct sched_entity *se)
> > > +static unsigned long
> > > +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > >  {
> > >         unsigned long gran = sysctl_sched_wakeup_granularity;
> > >  
> > > +       /*
> > > +        * If rq is specified, scale the granularity relative to the number
> > > +        * of running tasks but no more than 8ms with default
> > > +        * sysctl_sched_wakeup_granularity settings. The wakeup gran
> > > +        * reduces over-scheduling but if tasks are stacked then the
> > > +        * domain is likely overloaded and over-scheduling may
> > > +        * prolong the overloaded state.
> > > +        */
> > > +       if (cfs_rq && cfs_rq->nr_running > 1)
> > > +               gran *= min(cfs_rq->nr_running >> 1, sched_nr_latency);
> > > +
> >
> > Maybe things have changed while I wasn't watching closely, but...
> >
>
> Yeah, a lot has changed and unfortunately, I still lack a lot of
> historical context on why things are the way they are.
>
> > The scaled up tweakables on my little quad desktop box:
> > sched_nr_latency = 8
> > sched_wakeup_granularity = 4ms
> > sched_latency = 24ms
> >
> > Due to the FAIR_SLEEPERS feature, a task can only receive a max of
> > sched_latency/2 sleep credit,
>
> A task that just became runnable rather than all tasks, right?

I'm talking about tasks being enqueued during wakeup.

> > ie the delta between waking sleeper and
> > current is clipped to a max of 12 virtual ms, so the instant our
> > preempt threshold reaches 12.000ms, by human booboo or now 3 runnable
> > tasks with this change, wakeup preemption is completely disabled, or?
> >
>
> I'm having trouble reconciling this in some sensible fashion.
> sched_nr_latency is the threshold where the sched period might be stretched
> (sysctl_sched_latency/sysctl_sched_min_granularity) which is why I picked
> it - (nr_running > sched_nr_latency) is a tipping point where the
> scheduler changes behaviour.

Yeah, an existing branch is as good a place as any.

> FAIR_SLEEPERS primarily affects tasks that just became runnable and the
> new task is trying to fit in without causing too much disruption based
> on sysctl_sched_latency.

No, fair sleepers is all about sleeper wakeup preemption, I think
you're thinking of fork initial placement.

> As sysctl_sched_wakeup_granularity is now debugfs and should not be
> tuned, we want to avoid that. On the other hand, we also know from the
> results that overloaded tasks may fail to make sufficient progress so,
> do we either try to dynamically adjust or do we just ignore the problem?

Good question.  Anything you do is guaranteed to be wrong for
something, which is why CFS was born.. fair doesn't have corner cases.

> If we are to dynamically adjust then what should it be? One alternative
> could be to base the tipping point on the ratio of sysctl_sched_latency to
> gran, take FAIR_SLEEPERS into account, sanity check the result and stick it
> behind a feature flag in case it needs to be disabled to debug a preempt
> starvation problem.
>
> This on top?

Dunno.  Seeing as how it wasn't as aggressive as I thought at first
glance, maybe that original rolldown will be more or less fine.  I
don't like it much, but numbers talk, BS walks.  I trust boxen on such
matters way more than any human, myself included ;-)

	-Mike


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-21 12:32       ` Mike Galbraith
@ 2021-09-21 14:03         ` Mel Gorman
  2021-10-05  9:24         ` Peter Zijlstra
  1 sibling, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2021-09-21 14:03 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Tue, Sep 21, 2021 at 02:32:54PM +0200, Mike Galbraith wrote:
> On Tue, 2021-09-21 at 11:36 +0100, Mel Gorman wrote:
> > On Tue, Sep 21, 2021 at 05:52:32AM +0200, Mike Galbraith wrote:
> > >
> >
> > > Preemption does rapidly run into diminishing return as load climbs for
> > > a lot of loads, but as you know, it's a rather sticky wicket because
> > > even when over-committed, preventing light control threads from slicing
> > > through (what can be a load's own work crew of) hogs can seriously
> > > injure performance.
> > >
> >
> > Turning this into a classic Rob Peter To Pay Paul problem.
> 
> Damn near everything you do in sched-land is rob Peter to pay Paul.
> 

True.

> >  We don't know
> > if there is a light control thread that needs to run or not that affects
> > overall performance. It all depends on whether that control thread needs
> > to make progress for the overall workload or whether there are a mix of
> > workloads resulting in overloading.
> 
> Sure.. and operative words "we don't know" cuts both ways.
> 

Yes and I don't believe we know how often a latency critical thread is
running in a heavily overloaded domain (I hope not very often).

> CFS came about because the O1 scheduler was unfair to the point it had
> starvation problems. People pretty much across the board agreed that a
> fair scheduler was a much way better way to go, and CFS was born.  It
> didn't originally have the sleep credit business, but had to grow it to
> become _short term_ fair.  Ingo cut the sleep credit in half because of
> overscheduling, and that has worked out pretty well all told.. but now
> you're pushing it more in the unfair direction, all the way to
> extremely unfair for anything and everything very light.
> 

With the slight caveat that it's pushed in the other direction at the
point where the domain is becoming overloaded. While fairness is still
important, tasks are being starved to some extent.

> Fairness isn't the holy grail mind you, and at some point, giving up on
> short term fairness certainly isn't crazy, as proven by your hackbench
> numbers and other numbers we've seen over the years, but taking bites
> out of the 'CF' in the CFS that was born to be a corner-case killer
> is.. worrisome.  The other shoe will drop.. it always does :)
> 

Lovely. I don't suppose you know where that shoe is?

> > > <snip>
> > >
> > > > @@ -7044,10 +7045,22 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > > >  }
> > > >  #endif /* CONFIG_SMP */
> > > >  
> > > > -static unsigned long wakeup_gran(struct sched_entity *se)
> > > > +static unsigned long
> > > > +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > > >  {
> > > >         unsigned long gran = sysctl_sched_wakeup_granularity;
> > > >  
> > > > +       /*
> > > > +        * If rq is specified, scale the granularity relative to the number
> > > > +        * of running tasks but no more than 8ms with default
> > > > +        * sysctl_sched_wakeup_granularity settings. The wakeup gran
> > > > +        * reduces over-scheduling but if tasks are stacked then the
> > > > +        * domain is likely overloaded and over-scheduling may
> > > > +        * prolong the overloaded state.
> > > > +        */
> > > > +       if (cfs_rq && cfs_rq->nr_running > 1)
> > > > +               gran *= min(cfs_rq->nr_running >> 1, sched_nr_latency);
> > > > +
> > >
> > > Maybe things have changed while I wasn't watching closely, but...
> > >
> >
> > Yeah, a lot has changed and unfortunately, I still lack a lot of
> > historical context on why things are the way they are.
> >
> > > The scaled up tweakables on my little quad desktop box:
> > > sched_nr_latency = 8
> > > sched_wakeup_granularity = 4ms
> > > sched_latency = 24ms
> > >
> > > Due to the FAIR_SLEEPERS feature, a task can only receive a max of
> > > sched_latency/2 sleep credit,
> >
> > A task that just became runnable rather than all tasks, right?
> 
> I'm talking about tasks being enqueued during wakeup.
> 

Understood.

> > > ie the delta between waking sleeper and
> > > current is clipped to a max of 12 virtual ms, so the instant our
> > > preempt threshold reaches 12.000ms, by human booboo or now 3 runnable
> > > tasks with this change, wakeup preemption is completely disabled, or?
> > >
> >
> > I'm having trouble reconciling this in some sensible fashion.
> > sched_nr_latency is the threshold where the sched period might be stretched
> > (sysctl_sched_latency/sysctl_sched_min_granularity) which is why I picked
> > it - (nr_running > sched_nr_latency) is a tipping point where the
> > scheduler changes behaviour.
> 
> Yeah, an existing branch is as good a place as any.
> 

Although possibly overkill. Even if the new patch does not help
hackbench to the same extent, it may still be a better tradeoff for
workloads that have critical threads running in an overloaded domain.

> > FAIR_SLEEPERS primarily affects tasks that just became runnable and the
> > new task is trying to fit in without causing too much disruption based
> > on sysctl_sched_latency.
> 
> No, fair sleepers is all about sleeper wakeup preemption, I think
> you're thinking of fork initial placement.
> 

Understood.

> > As sysctl_sched_wakeup_granularity is now debugfs and should not be
> > tuned, we want to avoid that. On the other hand, we also know from the
> > results that overloaded tasks may fail to make sufficient progress so,
> > do we either try to dynamically adjust or do we just ignore the problem?
> 
> Good question.  Anything you do is guaranteed to be wrong for
> something, which is why CFS was born.. fair doesn't have corner cases.
> 

And indeed, this is basically trading some fairness to allow some tasks to
complete faster so that load goes down. It's similar to the race-to-idle
problem in frequency management where it asked "Is it better to use more
power to finish quickly or converve power and take longer?". This patch
is asking the question "When overloaded, is it better to let a task run
longer so it finishes quicker or pay the cost of context switching?"

> > If we are to dynamically adjust then what should it be? One alternative
> > could be to base the tipping point on the ratio of sysctl_sched_latency to
> > gran, take FAIR_SLEEPERS into account, sanity check the result and stick it
> > behind a feature flag in case it needs to be disabled to debug a preempt
> > starvation problem.
> >
> > This on top?
> 
> Dunno.  Seeing as how it wasn't as aggressive as I thought at first
> glance, maybe that original rolldown will be more or less fine.  I
> don't like it much, but numbers talk, BS walks.  I trust boxen on such
> matters way more than any human, myself included ;-)
> 

I've queued v2 on the test grid, lets see what falls out.

Thanks Mike for the review and the historical context.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-21 10:36     ` Mel Gorman
  2021-09-21 12:32       ` Mike Galbraith
@ 2021-09-22  5:22       ` Mike Galbraith
  2021-09-22 13:20         ` Mel Gorman
  2021-10-03  3:07         ` wakeup_affine_weight() is b0rked - was " Mike Galbraith
  1 sibling, 2 replies; 59+ messages in thread
From: Mike Galbraith @ 2021-09-22  5:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Tue, 2021-09-21 at 11:36 +0100, Mel Gorman wrote:
> On Tue, Sep 21, 2021 at 05:52:32AM +0200, Mike Galbraith wrote:
>
>
> > Preemption does rapidly run into diminishing return as load climbs for
> > a lot of loads, but as you know, it's a rather sticky wicket because
> > even when over-committed, preventing light control threads from slicing
> > through (what can be a load's own work crew of) hogs can seriously
> > injure performance.
> >
>
> Turning this into a classic Rob Peter To Pay Paul problem. We don't know
> if there is a light control thread that needs to run or not that affects
> overall performance. It all depends on whether that control thread needs
> to make progress for the overall workload or whether there are a mix of
> workloads resulting in overloading.

WRT overload, and our good buddies Peter and Paul :) I added...
	if (gran >= sysctl_sched_latency >> 1)
		trace_printk("runnable:%d preempt disabled\n",cfs_rq->nr_running);
...to watch, and met the below when I.. logged in.

homer:..debug/tracing # tail -20 trace
               X-2229    [002] d..5.    60.690322: wakeup_gran: runnable:9 preempt disabled
               X-2229    [002] d..5.    60.690325: wakeup_gran: runnable:10 preempt disabled
               X-2229    [002] d..5.    60.690330: wakeup_gran: runnable:11 preempt disabled
               X-2229    [002] d..5.    60.690363: wakeup_gran: runnable:13 preempt disabled
               X-2229    [002] d..5.    60.690377: wakeup_gran: runnable:14 preempt disabled
               X-2229    [002] d..5.    60.690390: wakeup_gran: runnable:15 preempt disabled
               X-2229    [002] d..5.    60.690404: wakeup_gran: runnable:16 preempt disabled
               X-2229    [002] d..5.    60.690425: wakeup_gran: runnable:9 preempt disabled
       ksmserver-2694    [003] d..3.    60.690432: wakeup_gran: runnable:6 preempt disabled
       ksmserver-2694    [003] d..3.    60.690436: wakeup_gran: runnable:7 preempt disabled
               X-2229    [002] d..5.    60.690451: wakeup_gran: runnable:6 preempt disabled
               X-2229    [002] d..5.    60.690465: wakeup_gran: runnable:7 preempt disabled
            kmix-2736    [000] d..3.    60.690491: wakeup_gran: runnable:6 preempt disabled
               X-2229    [004] d..5.    92.889635: wakeup_gran: runnable:6 preempt disabled
               X-2229    [004] d..5.    92.889675: wakeup_gran: runnable:6 preempt disabled
               X-2229    [004] d..5.    92.889863: wakeup_gran: runnable:6 preempt disabled
               X-2229    [004] d..5.    92.889944: wakeup_gran: runnable:6 preempt disabled
               X-2229    [004] d..5.    92.889957: wakeup_gran: runnable:7 preempt disabled
               X-2229    [004] d..5.    92.889968: wakeup_gran: runnable:8 preempt disabled
  QXcbEventQueue-2740    [000] d..4.    92.890025: wakeup_gran: runnable:6 preempt disabled
homer:..debug/tracing

Watching 'while sleep 1; do clear;tail trace; done' with nothing but a
kbuild running is like watching top.  There's enough stacking during
routine use of my desktop box that it runs into the tick granularity
wall pretty much continuously, so 'overload' may want redefining.

	-Mike

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-22  5:22       ` Mike Galbraith
@ 2021-09-22 13:20         ` Mel Gorman
  2021-09-22 14:04           ` Mike Galbraith
                             ` (2 more replies)
  2021-10-03  3:07         ` wakeup_affine_weight() is b0rked - was " Mike Galbraith
  1 sibling, 3 replies; 59+ messages in thread
From: Mel Gorman @ 2021-09-22 13:20 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Wed, Sep 22, 2021 at 07:22:20AM +0200, Mike Galbraith wrote:
> On Tue, 2021-09-21 at 11:36 +0100, Mel Gorman wrote:
> > On Tue, Sep 21, 2021 at 05:52:32AM +0200, Mike Galbraith wrote:
> >
> >
> > > Preemption does rapidly run into diminishing return as load climbs for
> > > a lot of loads, but as you know, it's a rather sticky wicket because
> > > even when over-committed, preventing light control threads from slicing
> > > through (what can be a load's own work crew of) hogs can seriously
> > > injure performance.
> > >
> >
> > Turning this into a classic Rob Peter To Pay Paul problem. We don't know
> > if there is a light control thread that needs to run or not that affects
> > overall performance. It all depends on whether that control thread needs
> > to make progress for the overall workload or whether there are a mix of
> > workloads resulting in overloading.
> 
> WRT overload, and our good buddies Peter and Paul :) I added...
> 	if (gran >= sysctl_sched_latency >> 1)
> 		trace_printk("runnable:%d preempt disabled\n",cfs_rq->nr_running);
> ...to watch, and met the below when I.. logged in.
> 
> homer:..debug/tracing # tail -20 trace
>                X-2229    [002] d..5.    60.690322: wakeup_gran: runnable:9 preempt disabled
>                X-2229    [002] d..5.    60.690325: wakeup_gran: runnable:10 preempt disabled
>                X-2229    [002] d..5.    60.690330: wakeup_gran: runnable:11 preempt disabled
>                X-2229    [002] d..5.    60.690363: wakeup_gran: runnable:13 preempt disabled
>                X-2229    [002] d..5.    60.690377: wakeup_gran: runnable:14 preempt disabled
>                X-2229    [002] d..5.    60.690390: wakeup_gran: runnable:15 preempt disabled
>                X-2229    [002] d..5.    60.690404: wakeup_gran: runnable:16 preempt disabled
>                X-2229    [002] d..5.    60.690425: wakeup_gran: runnable:9 preempt disabled
>        ksmserver-2694    [003] d..3.    60.690432: wakeup_gran: runnable:6 preempt disabled
>        ksmserver-2694    [003] d..3.    60.690436: wakeup_gran: runnable:7 preempt disabled
>                X-2229    [002] d..5.    60.690451: wakeup_gran: runnable:6 preempt disabled
>                X-2229    [002] d..5.    60.690465: wakeup_gran: runnable:7 preempt disabled
>             kmix-2736    [000] d..3.    60.690491: wakeup_gran: runnable:6 preempt disabled
>                X-2229    [004] d..5.    92.889635: wakeup_gran: runnable:6 preempt disabled
>                X-2229    [004] d..5.    92.889675: wakeup_gran: runnable:6 preempt disabled
>                X-2229    [004] d..5.    92.889863: wakeup_gran: runnable:6 preempt disabled
>                X-2229    [004] d..5.    92.889944: wakeup_gran: runnable:6 preempt disabled
>                X-2229    [004] d..5.    92.889957: wakeup_gran: runnable:7 preempt disabled
>                X-2229    [004] d..5.    92.889968: wakeup_gran: runnable:8 preempt disabled
>   QXcbEventQueue-2740    [000] d..4.    92.890025: wakeup_gran: runnable:6 preempt disabled
> homer:..debug/tracing
> 
> Watching 'while sleep 1; do clear;tail trace; done' with nothing but a
> kbuild running is like watching top.  There's enough stacking during
> routine use of my desktop box that it runs into the tick granularity
> wall pretty much continuously, so 'overload' may want redefining.
> 

Ok, that's pretty convincing. You didn't mention if there were
interactivity glitches but it's possible. This is what I'm currently
testing but have no results for yet. It caps wakeup_gran at
sysctl_sched_latency.

---8<---
sched/fair: Scale wakeup granularity relative to nr_running

Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved
the kernel.sched_wakeup_granularity_ns sysctl under debugfs.  One of the
reasons why this sysctl may be used may be for "optimising for throughput",
particularly when overloaded. The tool TuneD sometimes alters this for two
profiles e.g. "mssql" and "throughput-performance". At least version 2.9
does but it changed in master where it also will poke at debugfs instead.

Internal parameters like sysctl_sched_wakeup_granularity are scaled
based on the number of CPUs due to sysctl_sched_tunable_scaling. For
simplicity, the timing figures in this changelog are based on
SCHED_TUNABLESCALING_NONE.

During task migration or wakeup, a decision is made on whether
to preempt the current task or not. To limit over-scheduled,
sysctl_sched_wakeup_granularity delays the preemption to allow at least 1ms
of runtime before preempting. However, when a domain is heavily overloaded
(e.g. hackbench), the degree of over-scheduling is still severe. This is
problematic as time is wasted rescheduling tasks that could instead be
used by userspace tasks.

However, care must be taken. Even if a system is overloaded, there may
be high priority threads that must still be able to run. Mike Galbraith
explained the contraints as follows;

	CFS came about because the O1 scheduler was unfair to the
	point it had starvation problems. People pretty much across the
	board agreed that a fair scheduler was a much way better way
	to go, and CFS was born.  It didn't originally have the sleep
	credit business, but had to grow it to become _short term_ fair.
	Ingo cut the sleep credit in half because of overscheduling, and
	that has worked out pretty well all told.. but now you're pushing
	it more in the unfair direction, all the way to extremely unfair
	for anything and everything very light.

	Fairness isn't the holy grail mind you, and at some point, giving
	up on short term fairness certainly isn't crazy, as proven by your
	hackbench numbers and other numbers we've seen over the years,
	but taking bites out of the 'CF' in the CFS that was born to be a
	corner-case killer is.. worrisome.  The other shoe will drop.. it
	always does :)

This patch scales the wakeup granularity based on the number of running
tasks on the CPU relative to

	sched_nr_disable_gran = sysctl_sched_latency / sysctl_sched_wakeup_granularity

By default, this will allow the wakeup_gran to scale from
sysctl_sched_wakeup_granularity up to sysctl_sched_wakeup_granularity up to
sysctl_sched_latency depending on the number of running tasks on a cfs_rq.
By default, the limit is 6ms.

Note that the TuneD throughput-performance profile allows up to 15ms
for sysctl_sched_latency (ignoring scaling) but there is no explanation
why such a long period was necessary or why sched_latency_ns is also
not adjusted. The intent may have been to disable wakeup preemption
or it might be an oversight.  An internet search for instances where
sysctl_sched_wakeup_granularity parameter are tuned to high values offer
either no explanation or a broken one.

TODO: Results positive or negative

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c     | 64 ++++++++++++++++++++++++++++++++++++++++++-------
 kernel/sched/features.h |  6 +++++
 2 files changed, 61 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff69f245b939..5ec3b12039d6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -84,6 +84,14 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity	= 1000000UL;
 
 const_debug unsigned int sysctl_sched_migration_cost	= 500000UL;
 
+/*
+ * This value is kept at sysctl_sched_latency / sysctl_sched_wakeup_granularity
+ *
+ * This influences the decision on whether a waking task can preempt a running
+ * task.
+ */
+static unsigned int sched_nr_disable_gran = 6;
+
 int sched_thermal_decay_shift;
 static int __init setup_sched_thermal_decay_shift(char *str)
 {
@@ -627,6 +635,9 @@ int sched_update_scaling(void)
 	sched_nr_latency = DIV_ROUND_UP(sysctl_sched_latency,
 					sysctl_sched_min_granularity);
 
+	sched_nr_disable_gran = DIV_ROUND_UP(sysctl_sched_latency,
+					sysctl_sched_wakeup_granularity);
+
 #define WRT_SYSCTL(name) \
 	(normalized_sysctl_##name = sysctl_##name / (factor))
 	WRT_SYSCTL(sched_min_granularity);
@@ -4511,7 +4522,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 
 static int
-wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
+wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr,
+						struct sched_entity *se);
 
 /*
  * Pick the next process, keeping these things in mind, in this order:
@@ -4550,16 +4562,16 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 				second = curr;
 		}
 
-		if (second && wakeup_preempt_entity(second, left) < 1)
+		if (second && wakeup_preempt_entity(NULL, second, left) < 1)
 			se = second;
 	}
 
-	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
+	if (cfs_rq->next && wakeup_preempt_entity(NULL, cfs_rq->next, left) < 1) {
 		/*
 		 * Someone really wants this to run. If it's not unfair, run it.
 		 */
 		se = cfs_rq->next;
-	} else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
+	} else if (cfs_rq->last && wakeup_preempt_entity(NULL, cfs_rq->last, left) < 1) {
 		/*
 		 * Prefer last buddy, try to return the CPU to a preempted task.
 		 */
@@ -7044,9 +7056,42 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 }
 #endif /* CONFIG_SMP */
 
-static unsigned long wakeup_gran(struct sched_entity *se)
+static unsigned long
+select_wakeup_gran(struct cfs_rq *cfs_rq)
+{
+	unsigned int nr_running, threshold;
+
+	if (!cfs_rq || !sched_feat(SCALE_WAKEUP_GRAN))
+		return sysctl_sched_wakeup_granularity;
+
+	/* !GENTLE_FAIR_SLEEPERS has one overload threshold. */
+	if (!sched_feat(GENTLE_FAIR_SLEEPERS)) {
+		if (cfs_rq->nr_running <= sched_nr_disable_gran)
+			return sysctl_sched_wakeup_granularity;
+
+		return sysctl_sched_latency;
+	}
+
+	/* GENTLE_FAIR_SLEEPER has two overloaded thresholds. */
+	nr_running = cfs_rq->nr_running;
+	threshold = sched_nr_disable_gran >> 1;
+
+	/* No overload. */
+	if (nr_running <= threshold)
+		return sysctl_sched_wakeup_granularity;
+
+	/* Light overload. */
+	if (nr_running <= sched_nr_disable_gran)
+		return sysctl_sched_latency >> 1;
+
+	/* Heavy overload. */
+	return sysctl_sched_latency;
+}
+
+static unsigned long
+wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	unsigned long gran = sysctl_sched_wakeup_granularity;
+	unsigned long gran = select_wakeup_gran(cfs_rq);
 
 	/*
 	 * Since its curr running now, convert the gran from real-time
@@ -7079,14 +7124,15 @@ static unsigned long wakeup_gran(struct sched_entity *se)
  *
  */
 static int
-wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
+wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr,
+						struct sched_entity *se)
 {
 	s64 gran, vdiff = curr->vruntime - se->vruntime;
 
 	if (vdiff <= 0)
 		return -1;
 
-	gran = wakeup_gran(se);
+	gran = wakeup_gran(cfs_rq, se);
 	if (vdiff > gran)
 		return 1;
 
@@ -7191,7 +7237,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 		return;
 
 	update_curr(cfs_rq_of(se));
-	if (wakeup_preempt_entity(se, pse) == 1) {
+	if (wakeup_preempt_entity(cfs_rq, se, pse) == 1) {
 		/*
 		 * Bias pick_next to pick the sched entity that is
 		 * triggering this preemption.
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 7f8dace0964c..d041d7023029 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -95,3 +95,9 @@ SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(ALT_PERIOD, true)
 SCHED_FEAT(BASE_SLICE, true)
+
+/*
+ * Scale sched_wakeup_granularity dynamically based on the number of running
+ * tasks up to a cap of sysctl_sched_latency.
+ */
+SCHED_FEAT(SCALE_WAKEUP_GRAN, true)

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-22 13:20         ` Mel Gorman
@ 2021-09-22 14:04           ` Mike Galbraith
  2021-09-22 14:15           ` Vincent Guittot
  2021-10-05  9:32           ` Peter Zijlstra
  2 siblings, 0 replies; 59+ messages in thread
From: Mike Galbraith @ 2021-09-22 14:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Wed, 2021-09-22 at 14:20 +0100, Mel Gorman wrote:
> On Wed, Sep 22, 2021 at 07:22:20AM +0200, Mike Galbraith wrote:
> >
> > Watching 'while sleep 1; do clear;tail trace; done' with nothing but a
> > kbuild running is like watching top.  There's enough stacking during
> > routine use of my desktop box that it runs into the tick granularity
> > wall pretty much continuously, so 'overload' may want redefining.
> >
>
> Ok, that's pretty convincing. You didn't mention if there were
> interactivity glitches but it's possible.

No, I didn't notice a thing.  That's not surprising given the amount of
scheduling going on.  The stack depth surprised me a bit though.

X-2230    [005] d..5.  2162.123029: wakeup_gran: runnable:6 wakee:QXcbEventQueue:2695 CPU5
X-2230    [005] d..5.  2162.123035: wakeup_gran: runnable:7 wakee:QXcbEventQueue:2656 CPU5
X-2230    [005] d..5.  2162.123046: wakeup_gran: runnable:8 wakee:QXcbEventQueue:5876 CPU5
X-2230    [005] d..5.  2162.123049: wakeup_gran: runnable:9 wakee:QXcbEventQueue:5355 CPU5
X-2230    [005] d..5.  2162.123083: wakeup_gran: runnable:10 wakee:QXcbEventQueue:2723 CPU5
X-2230    [005] d..5.  2162.123097: wakeup_gran: runnable:11 wakee:QXcbEventQueue:2630 CPU5
X-2230    [005] d..5.  2162.123208: wakeup_gran: runnable:12 wakee:QXcbEventQueue:2760 CPU5

That CPU# on the end is task_cpu(). Lord knows why X wakes all those,
but with no SD_BALANCE_WAKE, a busy box and all those having last lived
on waker's CPU, the resulting construct.. probably doesn't very closely
resemble what the programmer had in mind when threading.

	-Mike

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-22 13:20         ` Mel Gorman
  2021-09-22 14:04           ` Mike Galbraith
@ 2021-09-22 14:15           ` Vincent Guittot
  2021-09-22 15:04             ` Mel Gorman
  2021-09-22 15:05             ` Vincent Guittot
  2021-10-05  9:32           ` Peter Zijlstra
  2 siblings, 2 replies; 59+ messages in thread
From: Vincent Guittot @ 2021-09-22 14:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Wed, 22 Sept 2021 at 15:20, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Wed, Sep 22, 2021 at 07:22:20AM +0200, Mike Galbraith wrote:
> > On Tue, 2021-09-21 at 11:36 +0100, Mel Gorman wrote:
> > > On Tue, Sep 21, 2021 at 05:52:32AM +0200, Mike Galbraith wrote:
> > >
> > >
> > > > Preemption does rapidly run into diminishing return as load climbs for
> > > > a lot of loads, but as you know, it's a rather sticky wicket because
> > > > even when over-committed, preventing light control threads from slicing
> > > > through (what can be a load's own work crew of) hogs can seriously
> > > > injure performance.
> > > >
> > >
> > > Turning this into a classic Rob Peter To Pay Paul problem. We don't know
> > > if there is a light control thread that needs to run or not that affects
> > > overall performance. It all depends on whether that control thread needs
> > > to make progress for the overall workload or whether there are a mix of
> > > workloads resulting in overloading.
> >
> > WRT overload, and our good buddies Peter and Paul :) I added...
> >       if (gran >= sysctl_sched_latency >> 1)
> >               trace_printk("runnable:%d preempt disabled\n",cfs_rq->nr_running);
> > ...to watch, and met the below when I.. logged in.
> >
> > homer:..debug/tracing # tail -20 trace
> >                X-2229    [002] d..5.    60.690322: wakeup_gran: runnable:9 preempt disabled
> >                X-2229    [002] d..5.    60.690325: wakeup_gran: runnable:10 preempt disabled
> >                X-2229    [002] d..5.    60.690330: wakeup_gran: runnable:11 preempt disabled
> >                X-2229    [002] d..5.    60.690363: wakeup_gran: runnable:13 preempt disabled
> >                X-2229    [002] d..5.    60.690377: wakeup_gran: runnable:14 preempt disabled
> >                X-2229    [002] d..5.    60.690390: wakeup_gran: runnable:15 preempt disabled
> >                X-2229    [002] d..5.    60.690404: wakeup_gran: runnable:16 preempt disabled
> >                X-2229    [002] d..5.    60.690425: wakeup_gran: runnable:9 preempt disabled
> >        ksmserver-2694    [003] d..3.    60.690432: wakeup_gran: runnable:6 preempt disabled
> >        ksmserver-2694    [003] d..3.    60.690436: wakeup_gran: runnable:7 preempt disabled
> >                X-2229    [002] d..5.    60.690451: wakeup_gran: runnable:6 preempt disabled
> >                X-2229    [002] d..5.    60.690465: wakeup_gran: runnable:7 preempt disabled
> >             kmix-2736    [000] d..3.    60.690491: wakeup_gran: runnable:6 preempt disabled
> >                X-2229    [004] d..5.    92.889635: wakeup_gran: runnable:6 preempt disabled
> >                X-2229    [004] d..5.    92.889675: wakeup_gran: runnable:6 preempt disabled
> >                X-2229    [004] d..5.    92.889863: wakeup_gran: runnable:6 preempt disabled
> >                X-2229    [004] d..5.    92.889944: wakeup_gran: runnable:6 preempt disabled
> >                X-2229    [004] d..5.    92.889957: wakeup_gran: runnable:7 preempt disabled
> >                X-2229    [004] d..5.    92.889968: wakeup_gran: runnable:8 preempt disabled
> >   QXcbEventQueue-2740    [000] d..4.    92.890025: wakeup_gran: runnable:6 preempt disabled
> > homer:..debug/tracing
> >
> > Watching 'while sleep 1; do clear;tail trace; done' with nothing but a
> > kbuild running is like watching top.  There's enough stacking during
> > routine use of my desktop box that it runs into the tick granularity
> > wall pretty much continuously, so 'overload' may want redefining.
> >
>
> Ok, that's pretty convincing. You didn't mention if there were
> interactivity glitches but it's possible. This is what I'm currently
> testing but have no results for yet. It caps wakeup_gran at
> sysctl_sched_latency.
>
> ---8<---
> sched/fair: Scale wakeup granularity relative to nr_running
>
> Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved
> the kernel.sched_wakeup_granularity_ns sysctl under debugfs.  One of the
> reasons why this sysctl may be used may be for "optimising for throughput",
> particularly when overloaded. The tool TuneD sometimes alters this for two
> profiles e.g. "mssql" and "throughput-performance". At least version 2.9
> does but it changed in master where it also will poke at debugfs instead.
>
> Internal parameters like sysctl_sched_wakeup_granularity are scaled
> based on the number of CPUs due to sysctl_sched_tunable_scaling. For
> simplicity, the timing figures in this changelog are based on
> SCHED_TUNABLESCALING_NONE.

This is a bit misleading because the platform that you used to
highlight the problem has a 7ms sysctl_sched_wakeup_granularity. which
is far more than your tick which should be 1ms

>
> During task migration or wakeup, a decision is made on whether
> to preempt the current task or not. To limit over-scheduled,
> sysctl_sched_wakeup_granularity delays the preemption to allow at least 1ms
> of runtime before preempting. However, when a domain is heavily overloaded
> (e.g. hackbench), the degree of over-scheduling is still severe. This is
> problematic as time is wasted rescheduling tasks that could instead be
> used by userspace tasks.
>
> However, care must be taken. Even if a system is overloaded, there may
> be high priority threads that must still be able to run. Mike Galbraith
> explained the contraints as follows;
>
>         CFS came about because the O1 scheduler was unfair to the
>         point it had starvation problems. People pretty much across the
>         board agreed that a fair scheduler was a much way better way
>         to go, and CFS was born.  It didn't originally have the sleep
>         credit business, but had to grow it to become _short term_ fair.
>         Ingo cut the sleep credit in half because of overscheduling, and
>         that has worked out pretty well all told.. but now you're pushing
>         it more in the unfair direction, all the way to extremely unfair
>         for anything and everything very light.
>
>         Fairness isn't the holy grail mind you, and at some point, giving
>         up on short term fairness certainly isn't crazy, as proven by your
>         hackbench numbers and other numbers we've seen over the years,
>         but taking bites out of the 'CF' in the CFS that was born to be a
>         corner-case killer is.. worrisome.  The other shoe will drop.. it
>         always does :)
>
> This patch scales the wakeup granularity based on the number of running
> tasks on the CPU relative to
>
>         sched_nr_disable_gran = sysctl_sched_latency / sysctl_sched_wakeup_granularity
>
> By default, this will allow the wakeup_gran to scale from
> sysctl_sched_wakeup_granularity up to sysctl_sched_wakeup_granularity up to
> sysctl_sched_latency depending on the number of running tasks on a cfs_rq.
> By default, the limit is 6ms.
>
> Note that the TuneD throughput-performance profile allows up to 15ms
> for sysctl_sched_latency (ignoring scaling) but there is no explanation
> why such a long period was necessary or why sched_latency_ns is also
> not adjusted. The intent may have been to disable wakeup preemption
> or it might be an oversight.  An internet search for instances where
> sysctl_sched_wakeup_granularity parameter are tuned to high values offer
> either no explanation or a broken one.
>
> TODO: Results positive or negative
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  kernel/sched/fair.c     | 64 ++++++++++++++++++++++++++++++++++++++++++-------
>  kernel/sched/features.h |  6 +++++
>  2 files changed, 61 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ff69f245b939..5ec3b12039d6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -84,6 +84,14 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity       = 1000000UL;
>
>  const_debug unsigned int sysctl_sched_migration_cost   = 500000UL;
>
> +/*
> + * This value is kept at sysctl_sched_latency / sysctl_sched_wakeup_granularity
> + *
> + * This influences the decision on whether a waking task can preempt a running
> + * task.
> + */
> +static unsigned int sched_nr_disable_gran = 6;
> +
>  int sched_thermal_decay_shift;
>  static int __init setup_sched_thermal_decay_shift(char *str)
>  {
> @@ -627,6 +635,9 @@ int sched_update_scaling(void)
>         sched_nr_latency = DIV_ROUND_UP(sysctl_sched_latency,
>                                         sysctl_sched_min_granularity);
>
> +       sched_nr_disable_gran = DIV_ROUND_UP(sysctl_sched_latency,
> +                                       sysctl_sched_wakeup_granularity);
> +
>  #define WRT_SYSCTL(name) \
>         (normalized_sysctl_##name = sysctl_##name / (factor))
>         WRT_SYSCTL(sched_min_granularity);
> @@ -4511,7 +4522,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  }
>
>  static int
> -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
> +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr,
> +                                               struct sched_entity *se);
>
>  /*
>   * Pick the next process, keeping these things in mind, in this order:
> @@ -4550,16 +4562,16 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>                                 second = curr;
>                 }
>
> -               if (second && wakeup_preempt_entity(second, left) < 1)
> +               if (second && wakeup_preempt_entity(NULL, second, left) < 1)

Why not applying the same policy here ? the tick can also prevent
current task to move forward

>                         se = second;
>         }
>
> -       if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
> +       if (cfs_rq->next && wakeup_preempt_entity(NULL, cfs_rq->next, left) < 1) {
>                 /*
>                  * Someone really wants this to run. If it's not unfair, run it.
>                  */
>                 se = cfs_rq->next;
> -       } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
> +       } else if (cfs_rq->last && wakeup_preempt_entity(NULL, cfs_rq->last, left) < 1) {
>                 /*
>                  * Prefer last buddy, try to return the CPU to a preempted task.
>                  */
> @@ -7044,9 +7056,42 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  }
>  #endif /* CONFIG_SMP */
>
> -static unsigned long wakeup_gran(struct sched_entity *se)
> +static unsigned long
> +select_wakeup_gran(struct cfs_rq *cfs_rq)
> +{
> +       unsigned int nr_running, threshold;
> +
> +       if (!cfs_rq || !sched_feat(SCALE_WAKEUP_GRAN))
> +               return sysctl_sched_wakeup_granularity;
> +
> +       /* !GENTLE_FAIR_SLEEPERS has one overload threshold. */
> +       if (!sched_feat(GENTLE_FAIR_SLEEPERS)) {
> +               if (cfs_rq->nr_running <= sched_nr_disable_gran)

cfs_rq->nr_running reflects the number of sched entities in the cfs
but not the number of running tasks which reflected in h_nr_running

Also do you want to take into account only tasks in this cfs and its
children or on all cfs on this rq ?

> +                       return sysctl_sched_wakeup_granularity;
> +
> +               return sysctl_sched_latency;
> +       }
> +
> +       /* GENTLE_FAIR_SLEEPER has two overloaded thresholds. */
> +       nr_running = cfs_rq->nr_running;
> +       threshold = sched_nr_disable_gran >> 1;
> +
> +       /* No overload. */
> +       if (nr_running <= threshold)
> +               return sysctl_sched_wakeup_granularity;

TBH I don't like these "no overload", "light overload" ...  They don't
have any real meaning apart from that it might work for your platform
and your hackbench test.
We already had have people complaining that small cfs task does not
preempt fast enough curr task as an example

There is no explanation why these values are the correct ones and not
but are just some random heuristic results and we are trying to remove
platform heuristic and to not add new

> +
> +       /* Light overload. */
> +       if (nr_running <= sched_nr_disable_gran)
> +               return sysctl_sched_latency >> 1;
> +
> +       /* Heavy overload. */
> +       return sysctl_sched_latency;

Why should a thread without any relationship with the curr,  not
preempt it because there are already a lot of running threads ? In
your case, you want hackbench threads to not preempt each others
because they tries to use same resources so it's probably better to
let the current one to move forward but that's not a universal policy.

side question: Have you try to change the nice priority which also
impact whether a thread can preempt curr ?

> +}
> +
> +static unsigned long
> +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       unsigned long gran = sysctl_sched_wakeup_granularity;
> +       unsigned long gran = select_wakeup_gran(cfs_rq);
>
>         /*
>          * Since its curr running now, convert the gran from real-time
> @@ -7079,14 +7124,15 @@ static unsigned long wakeup_gran(struct sched_entity *se)
>   *
>   */
>  static int
> -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
> +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr,
> +                                               struct sched_entity *se)
>  {
>         s64 gran, vdiff = curr->vruntime - se->vruntime;
>
>         if (vdiff <= 0)
>                 return -1;
>
> -       gran = wakeup_gran(se);
> +       gran = wakeup_gran(cfs_rq, se);
>         if (vdiff > gran)
>                 return 1;
>
> @@ -7191,7 +7237,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>                 return;
>
>         update_curr(cfs_rq_of(se));
> -       if (wakeup_preempt_entity(se, pse) == 1) {
> +       if (wakeup_preempt_entity(cfs_rq, se, pse) == 1) {

like for update_curr above, cfs_rq can be wrong because se could have changed

>                 /*
>                  * Bias pick_next to pick the sched entity that is
>                  * triggering this preemption.
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 7f8dace0964c..d041d7023029 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -95,3 +95,9 @@ SCHED_FEAT(LATENCY_WARN, false)
>
>  SCHED_FEAT(ALT_PERIOD, true)
>  SCHED_FEAT(BASE_SLICE, true)
> +
> +/*
> + * Scale sched_wakeup_granularity dynamically based on the number of running
> + * tasks up to a cap of sysctl_sched_latency.
> + */
> +SCHED_FEAT(SCALE_WAKEUP_GRAN, true)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-22 14:15           ` Vincent Guittot
@ 2021-09-22 15:04             ` Mel Gorman
  2021-09-22 16:00               ` Vincent Guittot
  2021-10-05  9:41               ` Peter Zijlstra
  2021-09-22 15:05             ` Vincent Guittot
  1 sibling, 2 replies; 59+ messages in thread
From: Mel Gorman @ 2021-09-22 15:04 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Wed, Sep 22, 2021 at 04:15:27PM +0200, Vincent Guittot wrote:
> > ---8<---
> > sched/fair: Scale wakeup granularity relative to nr_running
> >
> > Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved
> > the kernel.sched_wakeup_granularity_ns sysctl under debugfs.  One of the
> > reasons why this sysctl may be used may be for "optimising for throughput",
> > particularly when overloaded. The tool TuneD sometimes alters this for two
> > profiles e.g. "mssql" and "throughput-performance". At least version 2.9
> > does but it changed in master where it also will poke at debugfs instead.
> >
> > Internal parameters like sysctl_sched_wakeup_granularity are scaled
> > based on the number of CPUs due to sysctl_sched_tunable_scaling. For
> > simplicity, the timing figures in this changelog are based on
> > SCHED_TUNABLESCALING_NONE.
> 
> This is a bit misleading because the platform that you used to
> highlight the problem has a 7ms sysctl_sched_wakeup_granularity. which
> is far more than your tick which should be 1ms
> 

Tick on the test machines is 4ms (HZ=250).

The reason I used SCHED_TUNABLESCALING_NONE for the changelog is that
the exact values depend on the number of CPUs so values are not even
the same across the range of machines I'm using. sysctl_sched_latency,
sysctl_sched_min_granularity sysctl_sched_wakeup_granularity are all
scaled but the ratios remain constant.

> > <SNIP>
> >  static int
> > -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
> > +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr,
> > +                                               struct sched_entity *se);
> >
> >  /*
> >   * Pick the next process, keeping these things in mind, in this order:
> > @@ -4550,16 +4562,16 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> >                                 second = curr;
> >                 }
> >
> > -               if (second && wakeup_preempt_entity(second, left) < 1)
> > +               if (second && wakeup_preempt_entity(NULL, second, left) < 1)
> 
> Why not applying the same policy here ? the tick can also prevent
> current task to move forward
> 

Because it was less clear if it was necessary and what the consequences
would be if the skip buddy ran or the next buddy failed to run because
preempt failed, how does it interact with yield_to etc.

I ended up concluding that they should be separate patches and keep this
patch to one topic.


> >                         se = second;
> >         }
> >
> > -       if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
> > +       if (cfs_rq->next && wakeup_preempt_entity(NULL, cfs_rq->next, left) < 1) {
> >                 /*
> >                  * Someone really wants this to run. If it's not unfair, run it.
> >                  */
> >                 se = cfs_rq->next;
> > -       } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
> > +       } else if (cfs_rq->last && wakeup_preempt_entity(NULL, cfs_rq->last, left) < 1) {
> >                 /*
> >                  * Prefer last buddy, try to return the CPU to a preempted task.
> >                  */
> > @@ -7044,9 +7056,42 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >  }
> >  #endif /* CONFIG_SMP */
> >
> > -static unsigned long wakeup_gran(struct sched_entity *se)
> > +static unsigned long
> > +select_wakeup_gran(struct cfs_rq *cfs_rq)
> > +{
> > +       unsigned int nr_running, threshold;
> > +
> > +       if (!cfs_rq || !sched_feat(SCALE_WAKEUP_GRAN))
> > +               return sysctl_sched_wakeup_granularity;
> > +
> > +       /* !GENTLE_FAIR_SLEEPERS has one overload threshold. */
> > +       if (!sched_feat(GENTLE_FAIR_SLEEPERS)) {
> > +               if (cfs_rq->nr_running <= sched_nr_disable_gran)
> 
> cfs_rq->nr_running reflects the number of sched entities in the cfs
> but not the number of running tasks which reflected in h_nr_running
> 

Then check_preempt_wakeup may also have the same problem as it uses
nr_running.

> Also do you want to take into account only tasks in this cfs and its
> children or on all cfs on this rq ?
> 

Only this cfq I think to limit overhead.

> > +                       return sysctl_sched_wakeup_granularity;
> > +
> > +               return sysctl_sched_latency;
> > +       }
> > +
> > +       /* GENTLE_FAIR_SLEEPER has two overloaded thresholds. */
> > +       nr_running = cfs_rq->nr_running;
> > +       threshold = sched_nr_disable_gran >> 1;
> > +
> > +       /* No overload. */
> > +       if (nr_running <= threshold)
> > +               return sysctl_sched_wakeup_granularity;
> 
> TBH I don't like these "no overload", "light overload" ...  They don't
> have any real meaning apart from that it might work for your platform
> and your hackbench test.

They are, at best, a proxy measure for overload but the alternative is
scanning a bunch of runqueues similar to what is required when detecting
if a domain is fully busy or overloaded.

> We already had have people complaining that small cfs task does not
> preempt fast enough curr task as an example
> 

Is there a specific test case that demonstrates this?

> There is no explanation why these values are the correct ones and not
> but are just some random heuristic results and we are trying to remove
> platform heuristic and to not add new
> 

They are a heuristic yes, but I'm trying to remove the motivation for
users trying to tune sysctl_sched_wakeup_granularity to stupid values
because as it stands, tuned will happily poke into debugfs despite the
fact they are meant for debugging only and the values are of dubious merit.

> > +
> > +       /* Light overload. */
> > +       if (nr_running <= sched_nr_disable_gran)
> > +               return sysctl_sched_latency >> 1;
> > +
> > +       /* Heavy overload. */
> > +       return sysctl_sched_latency;
> 
> Why should a thread without any relationship with the curr,  not
> preempt it because there are already a lot of running threads?

Preemption is not free, without any knowledge of what the thread is
doing, we cannot determine if an inappropriate amount of CPU time is
being spent dequeueing/enqueuing tasks.

> In
> your case, you want hackbench threads to not preempt each others
> because they tries to use same resources so it's probably better to
> let the current one to move forward but that's not a universal policy.
> 

No, but have you a better suggestion? hackbench might be stupid but it's
an example of where a workload can excessively preempt itself. While
overloading an entire machine is stupid, it could also potentially occurs
for applications running within a constrained cpumask.

> side question: Have you try to change the nice priority which also
> impact whether a thread can preempt curr ?
> 

No, I have not tried. I guess one would be constructed with cyclictest
-S running hackbench in the background and measuring if it makes a
difference to the amount of jitter cyclictest experiences but I'm not
sure if that would cover the case you are concerned with.

> > +}
> > +
> > +static unsigned long
> > +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> > -       unsigned long gran = sysctl_sched_wakeup_granularity;
> > +       unsigned long gran = select_wakeup_gran(cfs_rq);
> >
> >         /*
> >          * Since its curr running now, convert the gran from real-time
> > @@ -7079,14 +7124,15 @@ static unsigned long wakeup_gran(struct sched_entity *se)
> >   *
> >   */
> >  static int
> > -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
> > +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr,
> > +                                               struct sched_entity *se)
> >  {
> >         s64 gran, vdiff = curr->vruntime - se->vruntime;
> >
> >         if (vdiff <= 0)
> >                 return -1;
> >
> > -       gran = wakeup_gran(se);
> > +       gran = wakeup_gran(cfs_rq, se);
> >         if (vdiff > gran)
> >                 return 1;
> >
> > @@ -7191,7 +7237,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> >                 return;
> >
> >         update_curr(cfs_rq_of(se));
> > -       if (wakeup_preempt_entity(se, pse) == 1) {
> > +       if (wakeup_preempt_entity(cfs_rq, se, pse) == 1) {
> 
> like for update_curr above, cfs_rq can be wrong because se could have changed
> 

Crap, that was a stupid mistake based on earlier review. I'll fix it.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-22 14:15           ` Vincent Guittot
  2021-09-22 15:04             ` Mel Gorman
@ 2021-09-22 15:05             ` Vincent Guittot
  1 sibling, 0 replies; 59+ messages in thread
From: Vincent Guittot @ 2021-09-22 15:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Wed, 22 Sept 2021 at 16:15, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Wed, 22 Sept 2021 at 15:20, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Wed, Sep 22, 2021 at 07:22:20AM +0200, Mike Galbraith wrote:
> > > On Tue, 2021-09-21 at 11:36 +0100, Mel Gorman wrote:
> > > > On Tue, Sep 21, 2021 at 05:52:32AM +0200, Mike Galbraith wrote:
> > > >
> > > >
> > > > > Preemption does rapidly run into diminishing return as load climbs for
> > > > > a lot of loads, but as you know, it's a rather sticky wicket because
> > > > > even when over-committed, preventing light control threads from slicing
> > > > > through (what can be a load's own work crew of) hogs can seriously
> > > > > injure performance.
> > > > >
> > > >
> > > > Turning this into a classic Rob Peter To Pay Paul problem. We don't know
> > > > if there is a light control thread that needs to run or not that affects
> > > > overall performance. It all depends on whether that control thread needs
> > > > to make progress for the overall workload or whether there are a mix of
> > > > workloads resulting in overloading.
> > >
> > > WRT overload, and our good buddies Peter and Paul :) I added...
> > >       if (gran >= sysctl_sched_latency >> 1)
> > >               trace_printk("runnable:%d preempt disabled\n",cfs_rq->nr_running);
> > > ...to watch, and met the below when I.. logged in.
> > >
> > > homer:..debug/tracing # tail -20 trace
> > >                X-2229    [002] d..5.    60.690322: wakeup_gran: runnable:9 preempt disabled
> > >                X-2229    [002] d..5.    60.690325: wakeup_gran: runnable:10 preempt disabled
> > >                X-2229    [002] d..5.    60.690330: wakeup_gran: runnable:11 preempt disabled
> > >                X-2229    [002] d..5.    60.690363: wakeup_gran: runnable:13 preempt disabled
> > >                X-2229    [002] d..5.    60.690377: wakeup_gran: runnable:14 preempt disabled
> > >                X-2229    [002] d..5.    60.690390: wakeup_gran: runnable:15 preempt disabled
> > >                X-2229    [002] d..5.    60.690404: wakeup_gran: runnable:16 preempt disabled
> > >                X-2229    [002] d..5.    60.690425: wakeup_gran: runnable:9 preempt disabled
> > >        ksmserver-2694    [003] d..3.    60.690432: wakeup_gran: runnable:6 preempt disabled
> > >        ksmserver-2694    [003] d..3.    60.690436: wakeup_gran: runnable:7 preempt disabled
> > >                X-2229    [002] d..5.    60.690451: wakeup_gran: runnable:6 preempt disabled
> > >                X-2229    [002] d..5.    60.690465: wakeup_gran: runnable:7 preempt disabled
> > >             kmix-2736    [000] d..3.    60.690491: wakeup_gran: runnable:6 preempt disabled
> > >                X-2229    [004] d..5.    92.889635: wakeup_gran: runnable:6 preempt disabled
> > >                X-2229    [004] d..5.    92.889675: wakeup_gran: runnable:6 preempt disabled
> > >                X-2229    [004] d..5.    92.889863: wakeup_gran: runnable:6 preempt disabled
> > >                X-2229    [004] d..5.    92.889944: wakeup_gran: runnable:6 preempt disabled
> > >                X-2229    [004] d..5.    92.889957: wakeup_gran: runnable:7 preempt disabled
> > >                X-2229    [004] d..5.    92.889968: wakeup_gran: runnable:8 preempt disabled
> > >   QXcbEventQueue-2740    [000] d..4.    92.890025: wakeup_gran: runnable:6 preempt disabled
> > > homer:..debug/tracing
> > >
> > > Watching 'while sleep 1; do clear;tail trace; done' with nothing but a
> > > kbuild running is like watching top.  There's enough stacking during
> > > routine use of my desktop box that it runs into the tick granularity
> > > wall pretty much continuously, so 'overload' may want redefining.
> > >
> >
> > Ok, that's pretty convincing. You didn't mention if there were
> > interactivity glitches but it's possible. This is what I'm currently
> > testing but have no results for yet. It caps wakeup_gran at
> > sysctl_sched_latency.
> >
> > ---8<---
> > sched/fair: Scale wakeup granularity relative to nr_running
> >
> > Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved
> > the kernel.sched_wakeup_granularity_ns sysctl under debugfs.  One of the
> > reasons why this sysctl may be used may be for "optimising for throughput",
> > particularly when overloaded. The tool TuneD sometimes alters this for two
> > profiles e.g. "mssql" and "throughput-performance". At least version 2.9
> > does but it changed in master where it also will poke at debugfs instead.
> >
> > Internal parameters like sysctl_sched_wakeup_granularity are scaled
> > based on the number of CPUs due to sysctl_sched_tunable_scaling. For
> > simplicity, the timing figures in this changelog are based on
> > SCHED_TUNABLESCALING_NONE.
>
> This is a bit misleading because the platform that you used to
> highlight the problem has a 7ms sysctl_sched_wakeup_granularity. which
> is far more than your tick which should be 1ms
>
> >
> > During task migration or wakeup, a decision is made on whether
> > to preempt the current task or not. To limit over-scheduled,
> > sysctl_sched_wakeup_granularity delays the preemption to allow at least 1ms
> > of runtime before preempting. However, when a domain is heavily overloaded
> > (e.g. hackbench), the degree of over-scheduling is still severe. This is
> > problematic as time is wasted rescheduling tasks that could instead be
> > used by userspace tasks.
> >
> > However, care must be taken. Even if a system is overloaded, there may
> > be high priority threads that must still be able to run. Mike Galbraith
> > explained the contraints as follows;
> >
> >         CFS came about because the O1 scheduler was unfair to the
> >         point it had starvation problems. People pretty much across the
> >         board agreed that a fair scheduler was a much way better way
> >         to go, and CFS was born.  It didn't originally have the sleep
> >         credit business, but had to grow it to become _short term_ fair.
> >         Ingo cut the sleep credit in half because of overscheduling, and
> >         that has worked out pretty well all told.. but now you're pushing
> >         it more in the unfair direction, all the way to extremely unfair
> >         for anything and everything very light.
> >
> >         Fairness isn't the holy grail mind you, and at some point, giving
> >         up on short term fairness certainly isn't crazy, as proven by your
> >         hackbench numbers and other numbers we've seen over the years,
> >         but taking bites out of the 'CF' in the CFS that was born to be a
> >         corner-case killer is.. worrisome.  The other shoe will drop.. it
> >         always does :)
> >
> > This patch scales the wakeup granularity based on the number of running
> > tasks on the CPU relative to
> >
> >         sched_nr_disable_gran = sysctl_sched_latency / sysctl_sched_wakeup_granularity
> >
> > By default, this will allow the wakeup_gran to scale from
> > sysctl_sched_wakeup_granularity up to sysctl_sched_wakeup_granularity up to
> > sysctl_sched_latency depending on the number of running tasks on a cfs_rq.
> > By default, the limit is 6ms.
> >
> > Note that the TuneD throughput-performance profile allows up to 15ms
> > for sysctl_sched_latency (ignoring scaling) but there is no explanation
> > why such a long period was necessary or why sched_latency_ns is also
> > not adjusted. The intent may have been to disable wakeup preemption
> > or it might be an oversight.  An internet search for instances where
> > sysctl_sched_wakeup_granularity parameter are tuned to high values offer
> > either no explanation or a broken one.
> >
> > TODO: Results positive or negative
> >
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > ---
> >  kernel/sched/fair.c     | 64 ++++++++++++++++++++++++++++++++++++++++++-------
> >  kernel/sched/features.h |  6 +++++
> >  2 files changed, 61 insertions(+), 9 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index ff69f245b939..5ec3b12039d6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -84,6 +84,14 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity       = 1000000UL;
> >
> >  const_debug unsigned int sysctl_sched_migration_cost   = 500000UL;
> >
> > +/*
> > + * This value is kept at sysctl_sched_latency / sysctl_sched_wakeup_granularity
> > + *
> > + * This influences the decision on whether a waking task can preempt a running
> > + * task.
> > + */
> > +static unsigned int sched_nr_disable_gran = 6;
> > +
> >  int sched_thermal_decay_shift;
> >  static int __init setup_sched_thermal_decay_shift(char *str)
> >  {
> > @@ -627,6 +635,9 @@ int sched_update_scaling(void)
> >         sched_nr_latency = DIV_ROUND_UP(sysctl_sched_latency,
> >                                         sysctl_sched_min_granularity);
> >
> > +       sched_nr_disable_gran = DIV_ROUND_UP(sysctl_sched_latency,
> > +                                       sysctl_sched_wakeup_granularity);
> > +
> >  #define WRT_SYSCTL(name) \
> >         (normalized_sysctl_##name = sysctl_##name / (factor))
> >         WRT_SYSCTL(sched_min_granularity);
> > @@ -4511,7 +4522,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  }
> >
> >  static int
> > -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
> > +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr,
> > +                                               struct sched_entity *se);
> >
> >  /*
> >   * Pick the next process, keeping these things in mind, in this order:
> > @@ -4550,16 +4562,16 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> >                                 second = curr;
> >                 }
> >
> > -               if (second && wakeup_preempt_entity(second, left) < 1)
> > +               if (second && wakeup_preempt_entity(NULL, second, left) < 1)
>
> Why not applying the same policy here ? the tick can also prevent
> current task to move forward
>
> >                         se = second;
> >         }
> >
> > -       if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
> > +       if (cfs_rq->next && wakeup_preempt_entity(NULL, cfs_rq->next, left) < 1) {
> >                 /*
> >                  * Someone really wants this to run. If it's not unfair, run it.
> >                  */
> >                 se = cfs_rq->next;
> > -       } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
> > +       } else if (cfs_rq->last && wakeup_preempt_entity(NULL, cfs_rq->last, left) < 1) {
> >                 /*
> >                  * Prefer last buddy, try to return the CPU to a preempted task.
> >                  */
> > @@ -7044,9 +7056,42 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >  }
> >  #endif /* CONFIG_SMP */
> >
> > -static unsigned long wakeup_gran(struct sched_entity *se)
> > +static unsigned long
> > +select_wakeup_gran(struct cfs_rq *cfs_rq)
> > +{
> > +       unsigned int nr_running, threshold;
> > +
> > +       if (!cfs_rq || !sched_feat(SCALE_WAKEUP_GRAN))
> > +               return sysctl_sched_wakeup_granularity;
> > +
> > +       /* !GENTLE_FAIR_SLEEPERS has one overload threshold. */
> > +       if (!sched_feat(GENTLE_FAIR_SLEEPERS)) {
> > +               if (cfs_rq->nr_running <= sched_nr_disable_gran)
>
> cfs_rq->nr_running reflects the number of sched entities in the cfs
> but not the number of running tasks which reflected in h_nr_running
>
> Also do you want to take into account only tasks in this cfs and its
> children or on all cfs on this rq ?
>
> > +                       return sysctl_sched_wakeup_granularity;
> > +
> > +               return sysctl_sched_latency;
> > +       }
> > +
> > +       /* GENTLE_FAIR_SLEEPER has two overloaded thresholds. */
> > +       nr_running = cfs_rq->nr_running;
> > +       threshold = sched_nr_disable_gran >> 1;
> > +
> > +       /* No overload. */
> > +       if (nr_running <= threshold)
> > +               return sysctl_sched_wakeup_granularity;
>
> TBH I don't like these "no overload", "light overload" ...  They don't
> have any real meaning apart from that it might work for your platform
> and your hackbench test.
> We already had have people complaining that small cfs task does not
> preempt fast enough curr task as an example
>
> There is no explanation why these values are the correct ones and not
> but are just some random heuristic results and we are trying to remove
> platform heuristic and to not add new
>
> > +
> > +       /* Light overload. */
> > +       if (nr_running <= sched_nr_disable_gran)
> > +               return sysctl_sched_latency >> 1;
> > +
> > +       /* Heavy overload. */
> > +       return sysctl_sched_latency;
>
> Why should a thread without any relationship with the curr,  not
> preempt it because there are already a lot of running threads ? In
> your case, you want hackbench threads to not preempt each others
> because they tries to use same resources so it's probably better to
> let the current one to move forward but that's not a universal policy.
>
> side question: Have you try to change the nice priority which also
> impact whether a thread can preempt curr ?

There were also some discussions about adding a nice_latency prio so a
task/group can be set if they want to be nice with others from a
scheduling latency pov. One open point was to define what it mean to
be nice from a latency pov
>
> > +}
> > +
> > +static unsigned long
> > +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> > -       unsigned long gran = sysctl_sched_wakeup_granularity;
> > +       unsigned long gran = select_wakeup_gran(cfs_rq);
> >
> >         /*
> >          * Since its curr running now, convert the gran from real-time
> > @@ -7079,14 +7124,15 @@ static unsigned long wakeup_gran(struct sched_entity *se)
> >   *
> >   */
> >  static int
> > -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
> > +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr,
> > +                                               struct sched_entity *se)
> >  {
> >         s64 gran, vdiff = curr->vruntime - se->vruntime;
> >
> >         if (vdiff <= 0)
> >                 return -1;
> >
> > -       gran = wakeup_gran(se);
> > +       gran = wakeup_gran(cfs_rq, se);
> >         if (vdiff > gran)
> >                 return 1;
> >
> > @@ -7191,7 +7237,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> >                 return;
> >
> >         update_curr(cfs_rq_of(se));
> > -       if (wakeup_preempt_entity(se, pse) == 1) {
> > +       if (wakeup_preempt_entity(cfs_rq, se, pse) == 1) {
>
> like for update_curr above, cfs_rq can be wrong because se could have changed
>
> >                 /*
> >                  * Bias pick_next to pick the sched entity that is
> >                  * triggering this preemption.
> > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > index 7f8dace0964c..d041d7023029 100644
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -95,3 +95,9 @@ SCHED_FEAT(LATENCY_WARN, false)
> >
> >  SCHED_FEAT(ALT_PERIOD, true)
> >  SCHED_FEAT(BASE_SLICE, true)
> > +
> > +/*
> > + * Scale sched_wakeup_granularity dynamically based on the number of running
> > + * tasks up to a cap of sysctl_sched_latency.
> > + */
> > +SCHED_FEAT(SCALE_WAKEUP_GRAN, true)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-22 15:04             ` Mel Gorman
@ 2021-09-22 16:00               ` Vincent Guittot
  2021-09-22 17:38                 ` Mel Gorman
  2021-10-05  9:41               ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Vincent Guittot @ 2021-09-22 16:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Wed, 22 Sept 2021 at 17:04, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Wed, Sep 22, 2021 at 04:15:27PM +0200, Vincent Guittot wrote:
> > > ---8<---
> > > sched/fair: Scale wakeup granularity relative to nr_running
> > >
> > > Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved
> > > the kernel.sched_wakeup_granularity_ns sysctl under debugfs.  One of the
> > > reasons why this sysctl may be used may be for "optimising for throughput",
> > > particularly when overloaded. The tool TuneD sometimes alters this for two
> > > profiles e.g. "mssql" and "throughput-performance". At least version 2.9
> > > does but it changed in master where it also will poke at debugfs instead.
> > >
> > > Internal parameters like sysctl_sched_wakeup_granularity are scaled
> > > based on the number of CPUs due to sysctl_sched_tunable_scaling. For
> > > simplicity, the timing figures in this changelog are based on
> > > SCHED_TUNABLESCALING_NONE.
> >
> > This is a bit misleading because the platform that you used to
> > highlight the problem has a 7ms sysctl_sched_wakeup_granularity. which
> > is far more than your tick which should be 1ms
> >
>
> Tick on the test machines is 4ms (HZ=250).
>
> The reason I used SCHED_TUNABLESCALING_NONE for the changelog is that
> the exact values depend on the number of CPUs so values are not even
> the same across the range of machines I'm using. sysctl_sched_latency,
> sysctl_sched_min_granularity sysctl_sched_wakeup_granularity are all
> scaled but the ratios remain constant.

My point was mainly that sysctl_sched_wakeup_granularity is above the
tick period

>
> > > <SNIP>
> > >  static int
> > > -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
> > > +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr,
> > > +                                               struct sched_entity *se);
> > >
> > >  /*
> > >   * Pick the next process, keeping these things in mind, in this order:
> > > @@ -4550,16 +4562,16 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > >                                 second = curr;
> > >                 }
> > >
> > > -               if (second && wakeup_preempt_entity(second, left) < 1)
> > > +               if (second && wakeup_preempt_entity(NULL, second, left) < 1)
> >
> > Why not applying the same policy here ? the tick can also prevent
> > current task to move forward
> >
>
> Because it was less clear if it was necessary and what the consequences
> would be if the skip buddy ran or the next buddy failed to run because
> preempt failed, how does it interact with yield_to etc.
>
> I ended up concluding that they should be separate patches and keep this
> patch to one topic.
>
>
> > >                         se = second;
> > >         }
> > >
> > > -       if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
> > > +       if (cfs_rq->next && wakeup_preempt_entity(NULL, cfs_rq->next, left) < 1) {
> > >                 /*
> > >                  * Someone really wants this to run. If it's not unfair, run it.
> > >                  */
> > >                 se = cfs_rq->next;
> > > -       } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
> > > +       } else if (cfs_rq->last && wakeup_preempt_entity(NULL, cfs_rq->last, left) < 1) {
> > >                 /*
> > >                  * Prefer last buddy, try to return the CPU to a preempted task.
> > >                  */
> > > @@ -7044,9 +7056,42 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > >  }
> > >  #endif /* CONFIG_SMP */
> > >
> > > -static unsigned long wakeup_gran(struct sched_entity *se)
> > > +static unsigned long
> > > +select_wakeup_gran(struct cfs_rq *cfs_rq)
> > > +{
> > > +       unsigned int nr_running, threshold;
> > > +
> > > +       if (!cfs_rq || !sched_feat(SCALE_WAKEUP_GRAN))
> > > +               return sysctl_sched_wakeup_granularity;
> > > +
> > > +       /* !GENTLE_FAIR_SLEEPERS has one overload threshold. */
> > > +       if (!sched_feat(GENTLE_FAIR_SLEEPERS)) {
> > > +               if (cfs_rq->nr_running <= sched_nr_disable_gran)
> >
> > cfs_rq->nr_running reflects the number of sched entities in the cfs
> > but not the number of running tasks which reflected in h_nr_running
> >
>
> Then check_preempt_wakeup may also have the same problem as it uses
> nr_running.

This comment was originally for the use of cfs_rq->nr_running a bit more below

>
> > Also do you want to take into account only tasks in this cfs and its
> > children or on all cfs on this rq ?
> >
>
> Only this cfq I think to limit overhead.
>
> > > +                       return sysctl_sched_wakeup_granularity;
> > > +
> > > +               return sysctl_sched_latency;
> > > +       }
> > > +
> > > +       /* GENTLE_FAIR_SLEEPER has two overloaded thresholds. */
> > > +       nr_running = cfs_rq->nr_running;
> > > +       threshold = sched_nr_disable_gran >> 1;
> > > +
> > > +       /* No overload. */
> > > +       if (nr_running <= threshold)

The comment was originally for this
nr_running does not reflect the number of task running on the cpu and
the associated overload state

If you put 2 hackbench in their own cgroup, nr_running will be 2 but
you will have twice more runnable threads

> > > +               return sysctl_sched_wakeup_granularity;
> >
> > TBH I don't like these "no overload", "light overload" ...  They don't
> > have any real meaning apart from that it might work for your platform
> > and your hackbench test.
>
> They are, at best, a proxy measure for overload but the alternative is
> scanning a bunch of runqueues similar to what is required when detecting
> if a domain is fully busy or overloaded.
>
> > We already had have people complaining that small cfs task does not
> > preempt fast enough curr task as an example
> >
>
> Is there a specific test case that demonstrates this?
>
> > There is no explanation why these values are the correct ones and not
> > but are just some random heuristic results and we are trying to remove
> > platform heuristic and to not add new
> >
>
> They are a heuristic yes, but I'm trying to remove the motivation for
> users trying to tune sysctl_sched_wakeup_granularity to stupid values
> because as it stands, tuned will happily poke into debugfs despite the
> fact they are meant for debugging only and the values are of dubious merit.
>
> > > +
> > > +       /* Light overload. */
> > > +       if (nr_running <= sched_nr_disable_gran)
> > > +               return sysctl_sched_latency >> 1;
> > > +
> > > +       /* Heavy overload. */
> > > +       return sysctl_sched_latency;
> >
> > Why should a thread without any relationship with the curr,  not
> > preempt it because there are already a lot of running threads?
>
> Preemption is not free, without any knowledge of what the thread is
> doing, we cannot determine if an inappropriate amount of CPU time is
> being spent dequeueing/enqueuing tasks.

That's exactly my point. The heuristic above doesn't give any clue if
the thread should or not preempt the current one. Most of the time
there is no relation with the number of the running threads but it is
define by the workload itself and its level of interactivity

>
> > In
> > your case, you want hackbench threads to not preempt each others
> > because they tries to use same resources so it's probably better to
> > let the current one to move forward but that's not a universal policy.
> >
>
> No, but have you a better suggestion? hackbench might be stupid but it's
> an example of where a workload can excessively preempt itself. While
> overloading an entire machine is stupid, it could also potentially occurs
> for applications running within a constrained cpumask.

But this is property that is specific to each application. Some can
have a lot of running threads but few wakes up which have to preempt
current threads quickly but others just want the opposite
So because it is a application specific property we should define it
this way instead of trying to guess
>
> > side question: Have you try to change the nice priority which also
> > impact whether a thread can preempt curr ?
> >
>
> No, I have not tried. I guess one would be constructed with cyclictest
> -S running hackbench in the background and measuring if it makes a
> difference to the amount of jitter cyclictest experiences but I'm not
> sure if that would cover the case you are concerned with.
>
> > > +}
> > > +
> > > +static unsigned long
> > > +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > >  {
> > > -       unsigned long gran = sysctl_sched_wakeup_granularity;
> > > +       unsigned long gran = select_wakeup_gran(cfs_rq);
> > >
> > >         /*
> > >          * Since its curr running now, convert the gran from real-time
> > > @@ -7079,14 +7124,15 @@ static unsigned long wakeup_gran(struct sched_entity *se)
> > >   *
> > >   */
> > >  static int
> > > -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
> > > +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr,
> > > +                                               struct sched_entity *se)
> > >  {
> > >         s64 gran, vdiff = curr->vruntime - se->vruntime;
> > >
> > >         if (vdiff <= 0)
> > >                 return -1;
> > >
> > > -       gran = wakeup_gran(se);
> > > +       gran = wakeup_gran(cfs_rq, se);
> > >         if (vdiff > gran)
> > >                 return 1;
> > >
> > > @@ -7191,7 +7237,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> > >                 return;
> > >
> > >         update_curr(cfs_rq_of(se));
> > > -       if (wakeup_preempt_entity(se, pse) == 1) {
> > > +       if (wakeup_preempt_entity(cfs_rq, se, pse) == 1) {
> >
> > like for update_curr above, cfs_rq can be wrong because se could have changed
> >
>
> Crap, that was a stupid mistake based on earlier review. I'll fix it.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-22 16:00               ` Vincent Guittot
@ 2021-09-22 17:38                 ` Mel Gorman
  2021-09-22 18:22                   ` Vincent Guittot
  2021-10-05 10:23                   ` Peter Zijlstra
  0 siblings, 2 replies; 59+ messages in thread
From: Mel Gorman @ 2021-09-22 17:38 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Wed, Sep 22, 2021 at 06:00:56PM +0200, Vincent Guittot wrote:
> On Wed, 22 Sept 2021 at 17:04, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Wed, Sep 22, 2021 at 04:15:27PM +0200, Vincent Guittot wrote:
> > > > ---8<---
> > > > sched/fair: Scale wakeup granularity relative to nr_running
> > > >
> > > > Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved
> > > > the kernel.sched_wakeup_granularity_ns sysctl under debugfs.  One of the
> > > > reasons why this sysctl may be used may be for "optimising for throughput",
> > > > particularly when overloaded. The tool TuneD sometimes alters this for two
> > > > profiles e.g. "mssql" and "throughput-performance". At least version 2.9
> > > > does but it changed in master where it also will poke at debugfs instead.
> > > >
> > > > Internal parameters like sysctl_sched_wakeup_granularity are scaled
> > > > based on the number of CPUs due to sysctl_sched_tunable_scaling. For
> > > > simplicity, the timing figures in this changelog are based on
> > > > SCHED_TUNABLESCALING_NONE.
> > >
> > > This is a bit misleading because the platform that you used to
> > > highlight the problem has a 7ms sysctl_sched_wakeup_granularity. which
> > > is far more than your tick which should be 1ms
> > >
> >
> > Tick on the test machines is 4ms (HZ=250).
> >
> > The reason I used SCHED_TUNABLESCALING_NONE for the changelog is that
> > the exact values depend on the number of CPUs so values are not even
> > the same across the range of machines I'm using. sysctl_sched_latency,
> > sysctl_sched_min_granularity sysctl_sched_wakeup_granularity are all
> > scaled but the ratios remain constant.
> 
> My point was mainly that sysctl_sched_wakeup_granularity is above the
> tick period
> 

Ok. sysctl_sched_wakeup_granularity is not related to the tick but I'm
probably missing something else.

> > > Also do you want to take into account only tasks in this cfs and its
> > > children or on all cfs on this rq ?
> > >
> >
> > Only this cfq I think to limit overhead.
> >
> > > > +                       return sysctl_sched_wakeup_granularity;
> > > > +
> > > > +               return sysctl_sched_latency;
> > > > +       }
> > > > +
> > > > +       /* GENTLE_FAIR_SLEEPER has two overloaded thresholds. */
> > > > +       nr_running = cfs_rq->nr_running;
> > > > +       threshold = sched_nr_disable_gran >> 1;
> > > > +
> > > > +       /* No overload. */
> > > > +       if (nr_running <= threshold)
> 
> The comment was originally for this
> nr_running does not reflect the number of task running on the cpu and
> the associated overload state
> 
> If you put 2 hackbench in their own cgroup, nr_running will be 2 but
> you will have twice more runnable threads
> 

Ok, that's understood. FWIW, I had switched to h_nr_running already.

> > > > +
> > > > +       /* Light overload. */
> > > > +       if (nr_running <= sched_nr_disable_gran)
> > > > +               return sysctl_sched_latency >> 1;
> > > > +
> > > > +       /* Heavy overload. */
> > > > +       return sysctl_sched_latency;
> > >
> > > Why should a thread without any relationship with the curr,  not
> > > preempt it because there are already a lot of running threads?
> >
> > Preemption is not free, without any knowledge of what the thread is
> > doing, we cannot determine if an inappropriate amount of CPU time is
> > being spent dequeueing/enqueuing tasks.
> 
> That's exactly my point. The heuristic above doesn't give any clue if
> the thread should or not preempt the current one. Most of the time
> there is no relation with the number of the running threads but it is
> define by the workload itself and its level of interactivity
> 

Right albeit it ignores the possibility that there are multiple workloads
overloading the machine wasting even more time preempting.  Whether that
is due to a bad application, a bad configuration or a bad actor is anyones
guess. I've seen applications with multiple worker threads which generally
behaved ok, except when they didn't and too many worker threads were
spawned due to a spike in server load. I'm not 

> >
> > > In
> > > your case, you want hackbench threads to not preempt each others
> > > because they tries to use same resources so it's probably better to
> > > let the current one to move forward but that's not a universal policy.
> > >
> >
> > No, but have you a better suggestion? hackbench might be stupid but it's
> > an example of where a workload can excessively preempt itself. While
> > overloading an entire machine is stupid, it could also potentially occurs
> > for applications running within a constrained cpumask.
> 
> But this is property that is specific to each application. Some can
> have a lot of running threads but few wakes up which have to preempt
> current threads quickly but others just want the opposite
> So because it is a application specific property we should define it
> this way instead of trying to guess

I'm not seeing an alternative suggestion that could be turned into
an implementation. The current value for sched_wakeup_granularity
was set 12 years ago was exposed for tuning which is no longer
the case. The intent was to allow some dynamic adjustment between
sysctl_sched_wakeup_granularity and sysctl_sched_latency to reduce
over-scheduling in the worst case without disabling preemption entirely
(which the first version did).

Should we just ignore this problem and hope it goes away or just let
people keep poking silly values into debugfs via tuned?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-22 17:38                 ` Mel Gorman
@ 2021-09-22 18:22                   ` Vincent Guittot
  2021-09-22 18:57                     ` Mel Gorman
                                       ` (2 more replies)
  2021-10-05 10:23                   ` Peter Zijlstra
  1 sibling, 3 replies; 59+ messages in thread
From: Vincent Guittot @ 2021-09-22 18:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Wed, 22 Sept 2021 at 19:38, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Wed, Sep 22, 2021 at 06:00:56PM +0200, Vincent Guittot wrote:
> > On Wed, 22 Sept 2021 at 17:04, Mel Gorman <mgorman@techsingularity.net> wrote:
> > >
> > > On Wed, Sep 22, 2021 at 04:15:27PM +0200, Vincent Guittot wrote:
> > > > > ---8<---
> > > > > sched/fair: Scale wakeup granularity relative to nr_running
> > > > >
> > > > > Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved
> > > > > the kernel.sched_wakeup_granularity_ns sysctl under debugfs.  One of the
> > > > > reasons why this sysctl may be used may be for "optimising for throughput",
> > > > > particularly when overloaded. The tool TuneD sometimes alters this for two
> > > > > profiles e.g. "mssql" and "throughput-performance". At least version 2.9
> > > > > does but it changed in master where it also will poke at debugfs instead.
> > > > >
> > > > > Internal parameters like sysctl_sched_wakeup_granularity are scaled
> > > > > based on the number of CPUs due to sysctl_sched_tunable_scaling. For
> > > > > simplicity, the timing figures in this changelog are based on
> > > > > SCHED_TUNABLESCALING_NONE.
> > > >
> > > > This is a bit misleading because the platform that you used to
> > > > highlight the problem has a 7ms sysctl_sched_wakeup_granularity. which
> > > > is far more than your tick which should be 1ms
> > > >
> > >
> > > Tick on the test machines is 4ms (HZ=250).
> > >
> > > The reason I used SCHED_TUNABLESCALING_NONE for the changelog is that
> > > the exact values depend on the number of CPUs so values are not even
> > > the same across the range of machines I'm using. sysctl_sched_latency,
> > > sysctl_sched_min_granularity sysctl_sched_wakeup_granularity are all
> > > scaled but the ratios remain constant.
> >
> > My point was mainly that sysctl_sched_wakeup_granularity is above the
> > tick period
> >
>
> Ok. sysctl_sched_wakeup_granularity is not related to the tick but I'm
> probably missing something else.
>
> > > > Also do you want to take into account only tasks in this cfs and its
> > > > children or on all cfs on this rq ?
> > > >
> > >
> > > Only this cfq I think to limit overhead.
> > >
> > > > > +                       return sysctl_sched_wakeup_granularity;
> > > > > +
> > > > > +               return sysctl_sched_latency;
> > > > > +       }
> > > > > +
> > > > > +       /* GENTLE_FAIR_SLEEPER has two overloaded thresholds. */
> > > > > +       nr_running = cfs_rq->nr_running;
> > > > > +       threshold = sched_nr_disable_gran >> 1;
> > > > > +
> > > > > +       /* No overload. */
> > > > > +       if (nr_running <= threshold)
> >
> > The comment was originally for this
> > nr_running does not reflect the number of task running on the cpu and
> > the associated overload state
> >
> > If you put 2 hackbench in their own cgroup, nr_running will be 2 but
> > you will have twice more runnable threads
> >
>
> Ok, that's understood. FWIW, I had switched to h_nr_running already.
>
> > > > > +
> > > > > +       /* Light overload. */
> > > > > +       if (nr_running <= sched_nr_disable_gran)
> > > > > +               return sysctl_sched_latency >> 1;
> > > > > +
> > > > > +       /* Heavy overload. */
> > > > > +       return sysctl_sched_latency;
> > > >
> > > > Why should a thread without any relationship with the curr,  not
> > > > preempt it because there are already a lot of running threads?
> > >
> > > Preemption is not free, without any knowledge of what the thread is
> > > doing, we cannot determine if an inappropriate amount of CPU time is
> > > being spent dequeueing/enqueuing tasks.
> >
> > That's exactly my point. The heuristic above doesn't give any clue if
> > the thread should or not preempt the current one. Most of the time
> > there is no relation with the number of the running threads but it is
> > define by the workload itself and its level of interactivity
> >
>
> Right albeit it ignores the possibility that there are multiple workloads
> overloading the machine wasting even more time preempting.  Whether that
> is due to a bad application, a bad configuration or a bad actor is anyones
> guess. I've seen applications with multiple worker threads which generally
> behaved ok, except when they didn't and too many worker threads were
> spawned due to a spike in server load. I'm not
>
> > >
> > > > In
> > > > your case, you want hackbench threads to not preempt each others
> > > > because they tries to use same resources so it's probably better to
> > > > let the current one to move forward but that's not a universal policy.
> > > >
> > >
> > > No, but have you a better suggestion? hackbench might be stupid but it's
> > > an example of where a workload can excessively preempt itself. While
> > > overloading an entire machine is stupid, it could also potentially occurs
> > > for applications running within a constrained cpumask.
> >
> > But this is property that is specific to each application. Some can
> > have a lot of running threads but few wakes up which have to preempt
> > current threads quickly but others just want the opposite
> > So because it is a application specific property we should define it
> > this way instead of trying to guess
>
> I'm not seeing an alternative suggestion that could be turned into
> an implementation. The current value for sched_wakeup_granularity
> was set 12 years ago was exposed for tuning which is no longer
> the case. The intent was to allow some dynamic adjustment between
> sysctl_sched_wakeup_granularity and sysctl_sched_latency to reduce
> over-scheduling in the worst case without disabling preemption entirely
> (which the first version did).
>
> Should we just ignore this problem and hope it goes away or just let
> people keep poking silly values into debugfs via tuned?

We should certainly not add a bandaid because people will continue to
poke silly value at the end. And increasing
sysctl_sched_wakeup_granularity based on the number of running threads
is not the right solution. According to the description of your
problem that the current task doesn't get enough time to move forward,
sysctl_sched_min_granularity should be part of the solution. Something
like below will ensure that current got a chance to move forward

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9bf540f04c2d..39d4e4827d3d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7102,6 +7102,7 @@ static void check_preempt_wakeup(struct rq *rq,
struct task_struct *p, int wake_
        int scale = cfs_rq->nr_running >= sched_nr_latency;
        int next_buddy_marked = 0;
        int cse_is_idle, pse_is_idle;
+       unsigned long delta_exec;

        if (unlikely(se == pse))
                return;
@@ -7161,6 +7162,13 @@ static void check_preempt_wakeup(struct rq *rq,
struct task_struct *p, int wake_
                return;

        update_curr(cfs_rq_of(se));
+       delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+       /*
+        * Ensure that current got a chance to move forward
+        */
+       if (delta_exec < sysctl_sched_min_granularity)
+               return;
+
        if (wakeup_preempt_entity(se, pse) == 1) {
                /*
                 * Bias pick_next to pick the sched entity that is


>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-22 18:22                   ` Vincent Guittot
@ 2021-09-22 18:57                     ` Mel Gorman
  2021-09-23  1:47                     ` Mike Galbraith
  2021-10-05 10:28                     ` Peter Zijlstra
  2 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2021-09-22 18:57 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Wed, Sep 22, 2021 at 08:22:43PM +0200, Vincent Guittot wrote:
> > > > > In
> > > > > your case, you want hackbench threads to not preempt each others
> > > > > because they tries to use same resources so it's probably better to
> > > > > let the current one to move forward but that's not a universal policy.
> > > > >
> > > >
> > > > No, but have you a better suggestion? hackbench might be stupid but it's
> > > > an example of where a workload can excessively preempt itself. While
> > > > overloading an entire machine is stupid, it could also potentially occurs
> > > > for applications running within a constrained cpumask.
> > >
> > > But this is property that is specific to each application. Some can
> > > have a lot of running threads but few wakes up which have to preempt
> > > current threads quickly but others just want the opposite
> > > So because it is a application specific property we should define it
> > > this way instead of trying to guess
> >
> > I'm not seeing an alternative suggestion that could be turned into
> > an implementation. The current value for sched_wakeup_granularity
> > was set 12 years ago was exposed for tuning which is no longer
> > the case. The intent was to allow some dynamic adjustment between
> > sysctl_sched_wakeup_granularity and sysctl_sched_latency to reduce
> > over-scheduling in the worst case without disabling preemption entirely
> > (which the first version did).
> >
> > Should we just ignore this problem and hope it goes away or just let
> > people keep poking silly values into debugfs via tuned?
> 
> We should certainly not add a bandaid because people will continue to
> poke silly value at the end. And increasing
> sysctl_sched_wakeup_granularity based on the number of running threads
> is not the right solution. According to the description of your
> problem that the current task doesn't get enough time to move forward,
> sysctl_sched_min_granularity should be part of the solution. Something
> like below will ensure that current got a chance to move forward
> 

That's a very interesting idea! I've queued it up for further testing
and as a comparison to the bandaid.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-22 18:22                   ` Vincent Guittot
  2021-09-22 18:57                     ` Mel Gorman
@ 2021-09-23  1:47                     ` Mike Galbraith
  2021-09-23  8:40                       ` Vincent Guittot
  2021-10-05 10:28                     ` Peter Zijlstra
  2 siblings, 1 reply; 59+ messages in thread
From: Mike Galbraith @ 2021-09-23  1:47 UTC (permalink / raw)
  To: Vincent Guittot, Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li,
	Barry Song, Srikar Dronamraju, LKML

On Wed, 2021-09-22 at 20:22 +0200, Vincent Guittot wrote:
> On Wed, 22 Sept 2021 at 19:38, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> >
> > I'm not seeing an alternative suggestion that could be turned into
> > an implementation. The current value for sched_wakeup_granularity
> > was set 12 years ago was exposed for tuning which is no longer
> > the case. The intent was to allow some dynamic adjustment between
> > sysctl_sched_wakeup_granularity and sysctl_sched_latency to reduce
> > over-scheduling in the worst case without disabling preemption entirely
> > (which the first version did).

I don't think those knobs were ever _intended_ for general purpose
tuning, but they did get used that way by some folks.

> >
> > Should we just ignore this problem and hope it goes away or just let
> > people keep poking silly values into debugfs via tuned?
>
> We should certainly not add a bandaid because people will continue to
> poke silly value at the end. And increasing
> sysctl_sched_wakeup_granularity based on the number of running threads
> is not the right solution.

Watching my desktop box stack up large piles of very short running
threads, I agree, instantaneous load looks like a non-starter.

>  According to the description of your
> problem that the current task doesn't get enough time to move forward,
> sysctl_sched_min_granularity should be part of the solution. Something
> like below will ensure that current got a chance to move forward

Nah, progress is guaranteed, the issue is a zillion very similar short
running threads preempting each other with no win to be had, thus
spending cycles in the scheduler that are utterly wasted.  It's a valid
issue, trouble is teaching the scheduler to recognize that situation
without mucking up other situations where there IS a win for even very
short running threads say, doing a synchronous handoff; preemption is
cheaper than scheduling off if the waker is going be awakened again in
very short order.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9bf540f04c2d..39d4e4827d3d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7102,6 +7102,7 @@ static void check_preempt_wakeup(struct rq *rq,
> struct task_struct *p, int wake_
>         int scale = cfs_rq->nr_running >= sched_nr_latency;
>         int next_buddy_marked = 0;
>         int cse_is_idle, pse_is_idle;
> +       unsigned long delta_exec;
>
>         if (unlikely(se == pse))
>                 return;
> @@ -7161,6 +7162,13 @@ static void check_preempt_wakeup(struct rq *rq,
> struct task_struct *p, int wake_
>                 return;
>
>         update_curr(cfs_rq_of(se));
> +       delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> +       /*
> +        * Ensure that current got a chance to move forward
> +        */
> +       if (delta_exec < sysctl_sched_min_granularity)
> +               return;
> +
>         if (wakeup_preempt_entity(se, pse) == 1) {
>                 /*
>                  * Bias pick_next to pick the sched entity that is

Yikes!  If you do that, you may as well go the extra nanometer and rip
wakeup preemption out entirely, same result, impressive diffstat.

	-Mike

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-23  1:47                     ` Mike Galbraith
@ 2021-09-23  8:40                       ` Vincent Guittot
  2021-09-23  9:21                         ` Mike Galbraith
  2021-09-23 12:24                         ` Phil Auld
  0 siblings, 2 replies; 59+ messages in thread
From: Vincent Guittot @ 2021-09-23  8:40 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Thu, 23 Sept 2021 at 03:47, Mike Galbraith <efault@gmx.de> wrote:
>
> On Wed, 2021-09-22 at 20:22 +0200, Vincent Guittot wrote:
> > On Wed, 22 Sept 2021 at 19:38, Mel Gorman <mgorman@techsingularity.net> wrote:
> > >
> > >
> > > I'm not seeing an alternative suggestion that could be turned into
> > > an implementation. The current value for sched_wakeup_granularity
> > > was set 12 years ago was exposed for tuning which is no longer
> > > the case. The intent was to allow some dynamic adjustment between
> > > sysctl_sched_wakeup_granularity and sysctl_sched_latency to reduce
> > > over-scheduling in the worst case without disabling preemption entirely
> > > (which the first version did).
>
> I don't think those knobs were ever _intended_ for general purpose
> tuning, but they did get used that way by some folks.
>
> > >
> > > Should we just ignore this problem and hope it goes away or just let
> > > people keep poking silly values into debugfs via tuned?
> >
> > We should certainly not add a bandaid because people will continue to
> > poke silly value at the end. And increasing
> > sysctl_sched_wakeup_granularity based on the number of running threads
> > is not the right solution.
>
> Watching my desktop box stack up large piles of very short running
> threads, I agree, instantaneous load looks like a non-starter.
>
> >  According to the description of your
> > problem that the current task doesn't get enough time to move forward,
> > sysctl_sched_min_granularity should be part of the solution. Something
> > like below will ensure that current got a chance to move forward
>
> Nah, progress is guaranteed, the issue is a zillion very similar short
> running threads preempting each other with no win to be had, thus
> spending cycles in the scheduler that are utterly wasted.  It's a valid
> issue, trouble is teaching the scheduler to recognize that situation
> without mucking up other situations where there IS a win for even very
> short running threads say, doing a synchronous handoff; preemption is
> cheaper than scheduling off if the waker is going be awakened again in
> very short order.
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 9bf540f04c2d..39d4e4827d3d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7102,6 +7102,7 @@ static void check_preempt_wakeup(struct rq *rq,
> > struct task_struct *p, int wake_
> >         int scale = cfs_rq->nr_running >= sched_nr_latency;
> >         int next_buddy_marked = 0;
> >         int cse_is_idle, pse_is_idle;
> > +       unsigned long delta_exec;
> >
> >         if (unlikely(se == pse))
> >                 return;
> > @@ -7161,6 +7162,13 @@ static void check_preempt_wakeup(struct rq *rq,
> > struct task_struct *p, int wake_
> >                 return;
> >
> >         update_curr(cfs_rq_of(se));
> > +       delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> > +       /*
> > +        * Ensure that current got a chance to move forward
> > +        */
> > +       if (delta_exec < sysctl_sched_min_granularity)
> > +               return;
> > +
> >         if (wakeup_preempt_entity(se, pse) == 1) {
> >                 /*
> >                  * Bias pick_next to pick the sched entity that is
>
> Yikes!  If you do that, you may as well go the extra nanometer and rip
> wakeup preemption out entirely, same result, impressive diffstat.

This patch is mainly there to show that there are other ways to ensure
progress without using some load heuristic.
sysctl_sched_min_granularity has the problem of scaling with the
number of cpus and this can generate large values. At least we should
use the normalized_sysctl_sched_min_granularity or even a smaller
value but wakeup preemption still happens with this change. It only
ensures that we don't waste time preempting each other without any
chance to do actual stuff.

a 100us value should even be enough to fix Mel's problem without
impacting common wakeup preemption cases.


>
>         -Mike

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-23  8:40                       ` Vincent Guittot
@ 2021-09-23  9:21                         ` Mike Galbraith
  2021-09-23 12:41                           ` Vincent Guittot
  2021-09-23 12:24                         ` Phil Auld
  1 sibling, 1 reply; 59+ messages in thread
From: Mike Galbraith @ 2021-09-23  9:21 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Thu, 2021-09-23 at 10:40 +0200, Vincent Guittot wrote:
>
> a 100us value should even be enough to fix Mel's problem without
> impacting common wakeup preemption cases.

It'd be nice if it turn out to be something that simple, but color me
skeptical.  I've tried various preemption throttling schemes, and while
it was trivial to get promising results, my scheme always managed to
harm something.  Everything I ever tried, I ended up tossing.

	-Mike

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-23  8:40                       ` Vincent Guittot
  2021-09-23  9:21                         ` Mike Galbraith
@ 2021-09-23 12:24                         ` Phil Auld
  2021-10-05 10:36                           ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Phil Auld @ 2021-09-23 12:24 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mike Galbraith, Mel Gorman, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju,
	LKML

On Thu, Sep 23, 2021 at 10:40:48AM +0200 Vincent Guittot wrote:
> On Thu, 23 Sept 2021 at 03:47, Mike Galbraith <efault@gmx.de> wrote:
> >
> > On Wed, 2021-09-22 at 20:22 +0200, Vincent Guittot wrote:
> > > On Wed, 22 Sept 2021 at 19:38, Mel Gorman <mgorman@techsingularity.net> wrote:
> > > >
> > > >
> > > > I'm not seeing an alternative suggestion that could be turned into
> > > > an implementation. The current value for sched_wakeup_granularity
> > > > was set 12 years ago was exposed for tuning which is no longer
> > > > the case. The intent was to allow some dynamic adjustment between
> > > > sysctl_sched_wakeup_granularity and sysctl_sched_latency to reduce
> > > > over-scheduling in the worst case without disabling preemption entirely
> > > > (which the first version did).
> >
> > I don't think those knobs were ever _intended_ for general purpose
> > tuning, but they did get used that way by some folks.
> >
> > > >
> > > > Should we just ignore this problem and hope it goes away or just let
> > > > people keep poking silly values into debugfs via tuned?
> > >
> > > We should certainly not add a bandaid because people will continue to
> > > poke silly value at the end. And increasing
> > > sysctl_sched_wakeup_granularity based on the number of running threads
> > > is not the right solution.
> >
> > Watching my desktop box stack up large piles of very short running
> > threads, I agree, instantaneous load looks like a non-starter.
> >
> > >  According to the description of your
> > > problem that the current task doesn't get enough time to move forward,
> > > sysctl_sched_min_granularity should be part of the solution. Something
> > > like below will ensure that current got a chance to move forward
> >
> > Nah, progress is guaranteed, the issue is a zillion very similar short
> > running threads preempting each other with no win to be had, thus
> > spending cycles in the scheduler that are utterly wasted.  It's a valid
> > issue, trouble is teaching the scheduler to recognize that situation
> > without mucking up other situations where there IS a win for even very
> > short running threads say, doing a synchronous handoff; preemption is
> > cheaper than scheduling off if the waker is going be awakened again in
> > very short order.
> >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 9bf540f04c2d..39d4e4827d3d 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7102,6 +7102,7 @@ static void check_preempt_wakeup(struct rq *rq,
> > > struct task_struct *p, int wake_
> > >         int scale = cfs_rq->nr_running >= sched_nr_latency;
> > >         int next_buddy_marked = 0;
> > >         int cse_is_idle, pse_is_idle;
> > > +       unsigned long delta_exec;
> > >
> > >         if (unlikely(se == pse))
> > >                 return;
> > > @@ -7161,6 +7162,13 @@ static void check_preempt_wakeup(struct rq *rq,
> > > struct task_struct *p, int wake_
> > >                 return;
> > >
> > >         update_curr(cfs_rq_of(se));
> > > +       delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> > > +       /*
> > > +        * Ensure that current got a chance to move forward
> > > +        */
> > > +       if (delta_exec < sysctl_sched_min_granularity)
> > > +               return;
> > > +
> > >         if (wakeup_preempt_entity(se, pse) == 1) {
> > >                 /*
> > >                  * Bias pick_next to pick the sched entity that is
> >
> > Yikes!  If you do that, you may as well go the extra nanometer and rip
> > wakeup preemption out entirely, same result, impressive diffstat.
> 
> This patch is mainly there to show that there are other ways to ensure
> progress without using some load heuristic.
> sysctl_sched_min_granularity has the problem of scaling with the
> number of cpus and this can generate large values. At least we should
> use the normalized_sysctl_sched_min_granularity or even a smaller
> value but wakeup preemption still happens with this change. It only
> ensures that we don't waste time preempting each other without any
> chance to do actual stuff.
>

It's capped at 8 cpus, which is pretty easy to reach these days, so the
values don't get too large.  That scaling is almost a no-op these days.


Cheers,
Phil


> a 100us value should even be enough to fix Mel's problem without
> impacting common wakeup preemption cases.
> 
> 
> >
> >         -Mike
> 

-- 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-23  9:21                         ` Mike Galbraith
@ 2021-09-23 12:41                           ` Vincent Guittot
  2021-09-23 13:14                             ` Mike Galbraith
  2021-09-27 11:17                             ` Mel Gorman
  0 siblings, 2 replies; 59+ messages in thread
From: Vincent Guittot @ 2021-09-23 12:41 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Thu, 23 Sept 2021 at 11:22, Mike Galbraith <efault@gmx.de> wrote:
>
> On Thu, 2021-09-23 at 10:40 +0200, Vincent Guittot wrote:
> >
> > a 100us value should even be enough to fix Mel's problem without
> > impacting common wakeup preemption cases.
>
> It'd be nice if it turn out to be something that simple, but color me
> skeptical.  I've tried various preemption throttling schemes, and while

Let's see what the results will show. I tend to agree that this will
not be enough to cover all use cases and I don't see any other way to
cover all cases than getting some inputs from the threads about their
latency fairness which bring us back to some kind of latency niceness
value

> it was trivial to get promising results, my scheme always managed to
> harm something.  Everything I ever tried, I ended up tossing.
>
>         -Mike

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-23 12:41                           ` Vincent Guittot
@ 2021-09-23 13:14                             ` Mike Galbraith
  2021-09-27 11:17                             ` Mel Gorman
  1 sibling, 0 replies; 59+ messages in thread
From: Mike Galbraith @ 2021-09-23 13:14 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Thu, 2021-09-23 at 14:41 +0200, Vincent Guittot wrote:
> On Thu, 23 Sept 2021 at 11:22, Mike Galbraith <efault@gmx.de> wrote:
> >
> > On Thu, 2021-09-23 at 10:40 +0200, Vincent Guittot wrote:
> > >
> > > a 100us value should even be enough to fix Mel's problem without
> > > impacting common wakeup preemption cases.
> >
> > It'd be nice if it turn out to be something that simple, but color me
> > skeptical.  I've tried various preemption throttling schemes, and while
>
> Let's see what the results will show. I tend to agree that this will
> not be enough to cover all use cases and I don't see any other way to
> cover all cases than getting some inputs from the threads about their
> latency fairness which bring us back to some kind of latency niceness
> value

What dooms the dirt simple solution is: tasks that dip lightly but
frequently are a thing ;-) Take nfs threads, tell 'em frequent
preemption ain't cool, and no matter how you diplomatically you say it,
they react in the only way they can, by sucking at their job.

	-Mike

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-23 12:41                           ` Vincent Guittot
  2021-09-23 13:14                             ` Mike Galbraith
@ 2021-09-27 11:17                             ` Mel Gorman
  2021-09-27 14:17                               ` Mike Galbraith
  2021-09-27 14:19                               ` Vincent Guittot
  1 sibling, 2 replies; 59+ messages in thread
From: Mel Gorman @ 2021-09-27 11:17 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Thu, Sep 23, 2021 at 02:41:06PM +0200, Vincent Guittot wrote:
> On Thu, 23 Sept 2021 at 11:22, Mike Galbraith <efault@gmx.de> wrote:
> >
> > On Thu, 2021-09-23 at 10:40 +0200, Vincent Guittot wrote:
> > >
> > > a 100us value should even be enough to fix Mel's problem without
> > > impacting common wakeup preemption cases.
> >
> > It'd be nice if it turn out to be something that simple, but color me
> > skeptical.  I've tried various preemption throttling schemes, and while
> 
> Let's see what the results will show. I tend to agree that this will
> not be enough to cover all use cases and I don't see any other way to
> cover all cases than getting some inputs from the threads about their
> latency fairness which bring us back to some kind of latency niceness
> value
> 

Unfortunately, I didn't get a complete set of results but enough to work
with. The missing tests have been requeued. The figures below are based
on a single-socket Skylake machine with 8 CPUs as it had the most set of
results and is the basic case.

The reported kernels are

vanilla:			vanilla 5.15-rc1
sched-scalewakegran-v2r4:	My patch
sched-moveforward-v1r1:		Vincent's patch



hackbench-process-pipes
                          5.15.0-rc1             5.15.0-rc1             5.15.0-rc1
                             vanilla sched-scalewakegran-v2r4 sched-moveforward-v1r1
Amean     1        0.3253 (   0.00%)      0.3330 (  -2.36%)      0.3257 (  -0.10%)
Amean     4        0.8300 (   0.00%)      0.7570 (   8.80%)      0.7560 (   8.92%)
Amean     7        1.1003 (   0.00%)      1.1457 *  -4.12%*      1.1163 (  -1.45%)
Amean     12       1.7263 (   0.00%)      1.6393 *   5.04%*      1.5963 *   7.53%*
Amean     21       3.0063 (   0.00%)      2.6590 *  11.55%*      2.4487 *  18.55%*
Amean     30       4.2323 (   0.00%)      3.5657 *  15.75%*      3.3410 *  21.06%*
Amean     48       6.5657 (   0.00%)      5.4180 *  17.48%*      5.0857 *  22.54%*
Amean     79      10.4867 (   0.00%)      8.4357 *  19.56%*      7.9563 *  24.13%*
Amean     110     14.8880 (   0.00%)     11.0423 *  25.83%*     10.7407 *  27.86%*
Amean     141     19.2083 (   0.00%)     14.0820 *  26.69%*     13.3780 *  30.35%*
Amean     172     23.4847 (   0.00%)     16.9880 *  27.66%*     16.4293 *  30.04%*
Amean     203     27.3763 (   0.00%)     20.2480 *  26.04%*     19.6430 *  28.25%*
Amean     234     31.3707 (   0.00%)     23.2477 *  25.89%*     22.8287 *  27.23%*
Amean     265     35.4663 (   0.00%)     26.2483 *  25.99%*     25.8683 *  27.06%*
Amean     296     39.2380 (   0.00%)     29.4237 *  25.01%*     28.8727 *  26.42%*

For hackbench, either Vincent or my patch has a similar impact.

tbench4
                         5.15.0-rc1             5.15.0-rc1             5.15.0-rc1
                            vanillasched-scalewakegran-v2r4 sched-moveforward-v1r1
Hmean     1       598.71 (   0.00%)      608.31 *   1.60%*      586.05 *  -2.11%*
Hmean     2      1096.74 (   0.00%)     1110.07 *   1.22%*     1106.70 *   0.91%*
Hmean     4      1529.35 (   0.00%)     1531.20 *   0.12%*     1551.11 *   1.42%*
Hmean     8      2824.32 (   0.00%)     2847.96 *   0.84%*     2684.21 *  -4.96%*
Hmean     16     2573.30 (   0.00%)     2591.77 *   0.72%*     2445.41 *  -4.97%*
Hmean     32     2518.77 (   0.00%)     2532.70 *   0.55%*     2409.30 *  -4.35%*

For tbench, it's ok for lower thread counts for 8 threads (machine
overloaded), Vincent's patch regresses slightly. With these test runs,
I don't have detailed information as to why but the most likely solution
is that preemption gets disabled prematurely.

specjbb
                             5.15.0-rc1             5.15.0-rc1             5.15.0-rc1
                                vanillasched-scalewakegran-v2r4 sched-moveforward-v1r1
Hmean     tput-1    71199.00 (   0.00%)    69492.00 *  -2.40%*    71126.00 *  -0.10%*
Hmean     tput-2   154478.00 (   0.00%)   146060.00 *  -5.45%*   153073.00 *  -0.91%*
Hmean     tput-3   211889.00 (   0.00%)   209386.00 *  -1.18%*   219434.00 *   3.56%*
Hmean     tput-4   257842.00 (   0.00%)   248012.00 *  -3.81%*   262903.00 *   1.96%*
Hmean     tput-5   253506.00 (   0.00%)   242511.00 *  -4.34%*   250828.00 *  -1.06%*
Hmean     tput-6   246202.00 (   0.00%)   236480.00 *  -3.95%*   244236.00 *  -0.80%*
Hmean     tput-7   241133.00 (   0.00%)   230905.00 *  -4.24%*   237619.00 *  -1.46%*
Hmean     tput-8   237983.00 (   0.00%)   230010.00 *  -3.35%*   235275.00 *  -1.14%*

For specjbb, it's different again, Vincent's patch is better for the
overloaded case but both patches show light regressions.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-27 11:17                             ` Mel Gorman
@ 2021-09-27 14:17                               ` Mike Galbraith
  2021-10-04  8:05                                 ` Mel Gorman
  2021-09-27 14:19                               ` Vincent Guittot
  1 sibling, 1 reply; 59+ messages in thread
From: Mike Galbraith @ 2021-09-27 14:17 UTC (permalink / raw)
  To: Mel Gorman, Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li,
	Barry Song, Srikar Dronamraju, LKML

On Mon, 2021-09-27 at 12:17 +0100, Mel Gorman wrote:
> On Thu, Sep 23, 2021 at 02:41:06PM +0200, Vincent Guittot wrote:
> > On Thu, 23 Sept 2021 at 11:22, Mike Galbraith <efault@gmx.de> wrote:
> > >
> > > On Thu, 2021-09-23 at 10:40 +0200, Vincent Guittot wrote:
> > > >
> > > > a 100us value should even be enough to fix Mel's problem without
> > > > impacting common wakeup preemption cases.
> > >
> > > It'd be nice if it turn out to be something that simple, but color me
> > > skeptical.  I've tried various preemption throttling schemes, and while
> >
> > Let's see what the results will show. I tend to agree that this will
> > not be enough to cover all use cases and I don't see any other way to
> > cover all cases than getting some inputs from the threads about their
> > latency fairness which bring us back to some kind of latency niceness
> > value
> >
>
> Unfortunately, I didn't get a complete set of results but enough to work
> with. The missing tests have been requeued. The figures below are based
> on a single-socket Skylake machine with 8 CPUs as it had the most set of
> results and is the basic case.

There's something missing, namely how does whatever load you measure
perform when facing dissimilar competition. Instead of only scaling
loads running solo from underutilized to heavily over-committed, give
them competition. eg something switch heavy, say tbench, TCP_RR et al
(latency bound load) pairs=CPUS vs something hefty like make -j CPUS or
such.

With your "pick a load number and roll preemption down" approach and
competing (modest) loads running on some headless enterprise test array
box, there will likely be little impact, but do the same on a desktop
box from the GUI, according to my box, you're likely to see something
entirely different.

Seems the "if current hasn't consumed at least 100us, go away" approach
should impact fast movers facing competition both a lot more and more
consistently (given modest load for both methods).

Below is my box nfs mounting itself for no other reason than I was
curious to see what running my repo update script from an nfs mount
would look like (hey, it's still more realistic than hackbench;).  It's
sorted by switches, but those at the top are also where the most cycles
landed.  I wouldn't expect throughput of a load that switch happy to
hold up well at all when faced with enforced 100us latency.

Switch happy loads don't do so just to be mean, nor do they matter less
than whatever beefier loads they may encounter in a box.

 -----------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
 -----------------------------------------------------------------------------------------------------------------
  nfsd:2246             |  31297.564 ms |  1514806 | avg:    0.002 ms | max:  555.543 ms | max at:   9130.660504 s
  nfsd:2245             |   9536.364 ms |  1475992 | avg:    0.002 ms | max:  902.341 ms | max at:   9083.907053 s
  kworker/u17:2-x:21933 |   3629.267 ms |   768463 | avg:    0.009 ms | max: 5536.206 ms | max at:   9088.540916 s
  kworker/u17:3:7082    |   3426.631 ms |   701947 | avg:    0.010 ms | max: 5543.659 ms | max at:   9088.540901 s
  git:7100              |  12708.278 ms |   573828 | avg:    0.001 ms | max:    3.520 ms | max at:   9066.125757 s
  git:7704              |  11620.355 ms |   517010 | avg:    0.001 ms | max:    4.070 ms | max at:   9113.894832 s
  kworker/u17:0:7075    |   1812.581 ms |   397601 | avg:    0.002 ms | max:  620.321 ms | max at:   9114.655685 s
  nfsd:2244             |   4930.826 ms |   370473 | avg:    0.008 ms | max:  910.646 ms | max at:   9083.915441 s
  kworker/u16:6:7094    |   2870.424 ms |   335848 | avg:    0.005 ms | max:  580.871 ms | max at:   9114.616479 s
  nfsd:2243             |   3424.996 ms |   257274 | avg:    0.033 ms | max: 3843.339 ms | max at:   9086.848829 s
  kworker/u17:1-x:30183 |   1310.614 ms |   255990 | avg:    0.001 ms | max:    1.817 ms | max at:   9089.173217 s
  kworker/u16:60-:6124  |   2253.058 ms |   225931 | avg:    0.050 ms | max:10128.140 ms | max at:   9108.375040 s
  kworker/u16:5:7092    |   1831.385 ms |   211923 | avg:    0.007 ms | max:  905.513 ms | max at:   9083.911630 s
  kworker/u16:7:7101    |   1606.258 ms |   194944 | avg:    0.002 ms | max:   11.576 ms | max at:   9082.789700 s
  kworker/u16:4:7090    |   1484.687 ms |   189197 | avg:    0.100 ms | max:12112.172 ms | max at:   9110.360308 s
  kworker/u16:59-:6123  |   1707.858 ms |   183464 | avg:    0.073 ms | max: 6135.816 ms | max at:   9120.173398 s
  kworker/u16:3:7074    |   1528.375 ms |   173089 | avg:    0.098 ms | max:15196.567 ms | max at:   9098.202355 s
  kworker/u16:0-r:7009  |   1336.814 ms |   166043 | avg:    0.002 ms | max:   12.381 ms | max at:   9082.839130 s
  nfsd:2242             |   1876.802 ms |   154855 | avg:    0.073 ms | max: 3844.877 ms | max at:   9086.848848 s
  kworker/u16:1:7072    |   1214.642 ms |   151420 | avg:    0.002 ms | max:    6.433 ms | max at:   9075.581713 s
  kworker/u16:2:7073    |   1302.996 ms |   150863 | avg:    0.002 ms | max:   12.119 ms | max at:   9082.839133 s


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-27 11:17                             ` Mel Gorman
  2021-09-27 14:17                               ` Mike Galbraith
@ 2021-09-27 14:19                               ` Vincent Guittot
  2021-09-27 15:02                                 ` Mel Gorman
  1 sibling, 1 reply; 59+ messages in thread
From: Vincent Guittot @ 2021-09-27 14:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Mon, 27 Sept 2021 at 13:17, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Thu, Sep 23, 2021 at 02:41:06PM +0200, Vincent Guittot wrote:
> > On Thu, 23 Sept 2021 at 11:22, Mike Galbraith <efault@gmx.de> wrote:
> > >
> > > On Thu, 2021-09-23 at 10:40 +0200, Vincent Guittot wrote:
> > > >
> > > > a 100us value should even be enough to fix Mel's problem without
> > > > impacting common wakeup preemption cases.
> > >
> > > It'd be nice if it turn out to be something that simple, but color me
> > > skeptical.  I've tried various preemption throttling schemes, and while
> >
> > Let's see what the results will show. I tend to agree that this will
> > not be enough to cover all use cases and I don't see any other way to
> > cover all cases than getting some inputs from the threads about their
> > latency fairness which bring us back to some kind of latency niceness
> > value
> >
>
> Unfortunately, I didn't get a complete set of results but enough to work
> with. The missing tests have been requeued. The figures below are based
> on a single-socket Skylake machine with 8 CPUs as it had the most set of
> results and is the basic case.
>
> The reported kernels are
>
> vanilla:                        vanilla 5.15-rc1
> sched-scalewakegran-v2r4:       My patch
> sched-moveforward-v1r1:         Vincent's patch

I imagine that this is the results for the 1st version which scales
with the number of CPUs


>
>
>
> hackbench-process-pipes
>                           5.15.0-rc1             5.15.0-rc1             5.15.0-rc1
>                              vanilla sched-scalewakegran-v2r4 sched-moveforward-v1r1
> Amean     1        0.3253 (   0.00%)      0.3330 (  -2.36%)      0.3257 (  -0.10%)
> Amean     4        0.8300 (   0.00%)      0.7570 (   8.80%)      0.7560 (   8.92%)
> Amean     7        1.1003 (   0.00%)      1.1457 *  -4.12%*      1.1163 (  -1.45%)
> Amean     12       1.7263 (   0.00%)      1.6393 *   5.04%*      1.5963 *   7.53%*
> Amean     21       3.0063 (   0.00%)      2.6590 *  11.55%*      2.4487 *  18.55%*
> Amean     30       4.2323 (   0.00%)      3.5657 *  15.75%*      3.3410 *  21.06%*
> Amean     48       6.5657 (   0.00%)      5.4180 *  17.48%*      5.0857 *  22.54%*
> Amean     79      10.4867 (   0.00%)      8.4357 *  19.56%*      7.9563 *  24.13%*
> Amean     110     14.8880 (   0.00%)     11.0423 *  25.83%*     10.7407 *  27.86%*
> Amean     141     19.2083 (   0.00%)     14.0820 *  26.69%*     13.3780 *  30.35%*
> Amean     172     23.4847 (   0.00%)     16.9880 *  27.66%*     16.4293 *  30.04%*
> Amean     203     27.3763 (   0.00%)     20.2480 *  26.04%*     19.6430 *  28.25%*
> Amean     234     31.3707 (   0.00%)     23.2477 *  25.89%*     22.8287 *  27.23%*
> Amean     265     35.4663 (   0.00%)     26.2483 *  25.99%*     25.8683 *  27.06%*
> Amean     296     39.2380 (   0.00%)     29.4237 *  25.01%*     28.8727 *  26.42%*
>
> For hackbench, either Vincent or my patch has a similar impact.
>
> tbench4
>                          5.15.0-rc1             5.15.0-rc1             5.15.0-rc1
>                             vanillasched-scalewakegran-v2r4 sched-moveforward-v1r1
> Hmean     1       598.71 (   0.00%)      608.31 *   1.60%*      586.05 *  -2.11%*
> Hmean     2      1096.74 (   0.00%)     1110.07 *   1.22%*     1106.70 *   0.91%*
> Hmean     4      1529.35 (   0.00%)     1531.20 *   0.12%*     1551.11 *   1.42%*
> Hmean     8      2824.32 (   0.00%)     2847.96 *   0.84%*     2684.21 *  -4.96%*
> Hmean     16     2573.30 (   0.00%)     2591.77 *   0.72%*     2445.41 *  -4.97%*
> Hmean     32     2518.77 (   0.00%)     2532.70 *   0.55%*     2409.30 *  -4.35%*
>
> For tbench, it's ok for lower thread counts for 8 threads (machine
> overloaded), Vincent's patch regresses slightly. With these test runs,
> I don't have detailed information as to why but the most likely solution
> is that preemption gets disabled prematurely.
>
> specjbb
>                              5.15.0-rc1             5.15.0-rc1             5.15.0-rc1
>                                 vanillasched-scalewakegran-v2r4 sched-moveforward-v1r1
> Hmean     tput-1    71199.00 (   0.00%)    69492.00 *  -2.40%*    71126.00 *  -0.10%*
> Hmean     tput-2   154478.00 (   0.00%)   146060.00 *  -5.45%*   153073.00 *  -0.91%*
> Hmean     tput-3   211889.00 (   0.00%)   209386.00 *  -1.18%*   219434.00 *   3.56%*
> Hmean     tput-4   257842.00 (   0.00%)   248012.00 *  -3.81%*   262903.00 *   1.96%*
> Hmean     tput-5   253506.00 (   0.00%)   242511.00 *  -4.34%*   250828.00 *  -1.06%*
> Hmean     tput-6   246202.00 (   0.00%)   236480.00 *  -3.95%*   244236.00 *  -0.80%*
> Hmean     tput-7   241133.00 (   0.00%)   230905.00 *  -4.24%*   237619.00 *  -1.46%*
> Hmean     tput-8   237983.00 (   0.00%)   230010.00 *  -3.35%*   235275.00 *  -1.14%*
>
> For specjbb, it's different again, Vincent's patch is better for the
> overloaded case but both patches show light regressions.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-27 14:19                               ` Vincent Guittot
@ 2021-09-27 15:02                                 ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2021-09-27 15:02 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Mon, Sep 27, 2021 at 04:19:00PM +0200, Vincent Guittot wrote:
> On Mon, 27 Sept 2021 at 13:17, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Thu, Sep 23, 2021 at 02:41:06PM +0200, Vincent Guittot wrote:
> > > On Thu, 23 Sept 2021 at 11:22, Mike Galbraith <efault@gmx.de> wrote:
> > > >
> > > > On Thu, 2021-09-23 at 10:40 +0200, Vincent Guittot wrote:
> > > > >
> > > > > a 100us value should even be enough to fix Mel's problem without
> > > > > impacting common wakeup preemption cases.
> > > >
> > > > It'd be nice if it turn out to be something that simple, but color me
> > > > skeptical.  I've tried various preemption throttling schemes, and while
> > >
> > > Let's see what the results will show. I tend to agree that this will
> > > not be enough to cover all use cases and I don't see any other way to
> > > cover all cases than getting some inputs from the threads about their
> > > latency fairness which bring us back to some kind of latency niceness
> > > value
> > >
> >
> > Unfortunately, I didn't get a complete set of results but enough to work
> > with. The missing tests have been requeued. The figures below are based
> > on a single-socket Skylake machine with 8 CPUs as it had the most set of
> > results and is the basic case.
> >
> > The reported kernels are
> >
> > vanilla:                        vanilla 5.15-rc1
> > sched-scalewakegran-v2r4:       My patch
> > sched-moveforward-v1r1:         Vincent's patch
> 
> I imagine that this is the results for the 1st version which scales
> with the number of CPUs
> 

Yes, the v1r5 results were incomplete and had to be requeued.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 59+ messages in thread

* wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-22  5:22       ` Mike Galbraith
  2021-09-22 13:20         ` Mel Gorman
@ 2021-10-03  3:07         ` Mike Galbraith
  2021-10-03  7:34           ` Barry Song
  1 sibling, 1 reply; 59+ messages in thread
From: Mike Galbraith @ 2021-10-03  3:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Wed, 2021-09-22 at 07:22 +0200, Mike Galbraith wrote:
> On Tue, 2021-09-21 at 11:36 +0100, Mel Gorman wrote:
> > On Tue, Sep 21, 2021 at 05:52:32AM +0200, Mike Galbraith wrote:
> >
> >
> > > Preemption does rapidly run into diminishing return as load climbs for
> > > a lot of loads, but as you know, it's a rather sticky wicket because
> > > even when over-committed, preventing light control threads from slicing
> > > through (what can be a load's own work crew of) hogs can seriously
> > > injure performance.
> > >
> >
> > Turning this into a classic Rob Peter To Pay Paul problem. We don't know
> > if there is a light control thread that needs to run or not that affects
> > overall performance. It all depends on whether that control thread needs
> > to make progress for the overall workload or whether there are a mix of
> > workloads resulting in overloading.
>
> WRT overload, and our good buddies Peter and Paul :) I added...
>         if (gran >= sysctl_sched_latency >> 1)
>                 trace_printk("runnable:%d preempt disabled\n",cfs_rq->nr_running);
> ...to watch, and met the below when I.. logged in. 
>
> homer:..debug/tracing # tail -20 trace
>                X-2229    [002] d..5.    60.690322: wakeup_gran: runnable:9 preempt disabled
>                X-2229    [002] d..5.    60.690325: wakeup_gran: runnable:10 preempt disabled
>                X-2229    [002] d..5.    60.690330: wakeup_gran: runnable:11 preempt disabled
>                X-2229    [002] d..5.    60.690363: wakeup_gran: runnable:13 preempt disabled
>                X-2229    [002] d..5.    60.690377: wakeup_gran: runnable:14 preempt disabled
>                X-2229    [002] d..5.    60.690390: wakeup_gran: runnable:15 preempt disabled
>                X-2229    [002] d..5.    60.690404: wakeup_gran: runnable:16 preempt disabled
>                X-2229    [002] d..5.    60.690425: wakeup_gran: runnable:9 preempt disabled
>        ksmserver-2694    [003] d..3.    60.690432: wakeup_gran: runnable:6 preempt disabled
>        ksmserver-2694    [003] d..3.    60.690436: wakeup_gran: runnable:7 preempt disabled
>                X-2229    [002] d..5.    60.690451: wakeup_gran: runnable:6 preempt disabled
>                X-2229    [002] d..5.    60.690465: wakeup_gran: runnable:7 preempt disabled
>             kmix-2736    [000] d..3.    60.690491: wakeup_gran: runnable:6 preempt disabled
>                X-2229    [004] d..5.    92.889635: wakeup_gran: runnable:6 preempt disabled
>                X-2229    [004] d..5.    92.889675: wakeup_gran: runnable:6 preempt disabled
>                X-2229    [004] d..5.    92.889863: wakeup_gran: runnable:6 preempt disabled
>                X-2229    [004] d..5.    92.889944: wakeup_gran: runnable:6 preempt disabled
>                X-2229    [004] d..5.    92.889957: wakeup_gran: runnable:7 preempt disabled
>                X-2229    [004] d..5.    92.889968: wakeup_gran: runnable:8 preempt disabled
>   QXcbEventQueue-2740    [000] d..4.    92.890025: wakeup_gran: runnable:6 preempt disabled
> homer:..debug/tracing
>
> Watching 'while sleep 1; do clear;tail trace; done' with nothing but a
> kbuild running is like watching top.  There's enough stacking during
> routine use of my desktop box that it runs into the tick granularity
> wall pretty much continuously, so 'overload' may want redefining.

I looked into that crazy stacking depth...

static int
wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
                   int this_cpu, int prev_cpu, int sync)
{
        s64 this_eff_load, prev_eff_load;
        unsigned long task_load;

        this_eff_load = cpu_load(cpu_rq(this_cpu));
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^ the butler didit!

That's pretty darn busted as it sits.  Between load updates, X, or any
other waker of many, can stack wakees to a ludicrous depth.  Tracing
kbuild vs firefox playing a youtube clip, I watched X stack 20 of the
zillion firefox minions while their previous CPUs all had 1 lousy task
running but a cpu_load() higher than the cpu_load() of X's CPU.  Most
of those prev_cpus were where X had left them when it migrated. Each
and every crazy depth migration was wake_affine_weight() deciding we
should pull based on crappy data.  As instantaneous load on the waker
CPU blew through the roof in my trace snapshot, its cpu_load() did
finally budge.. a tiny bit.. downward.  No idea where the stack would
have topped out, my tracing_off() limit was 20.

Hohum, my box grew a WA_INST companion to SIS_MIN_LAT cache cold task
distribulator feature ;-)  Not particularly lovely, but it knocks over
the leaning tower of minions.

	-Mike

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-10-03  3:07         ` wakeup_affine_weight() is b0rked - was " Mike Galbraith
@ 2021-10-03  7:34           ` Barry Song
  2021-10-03 14:52             ` Mike Galbraith
  2021-10-04  4:34             ` Mike Galbraith
  0 siblings, 2 replies; 59+ messages in thread
From: Barry Song @ 2021-10-03  7:34 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju,
	LKML

On Sun, Oct 3, 2021 at 4:11 PM Mike Galbraith <efault@gmx.de> wrote:
>
> On Wed, 2021-09-22 at 07:22 +0200, Mike Galbraith wrote:
> > On Tue, 2021-09-21 at 11:36 +0100, Mel Gorman wrote:
> > > On Tue, Sep 21, 2021 at 05:52:32AM +0200, Mike Galbraith wrote:
> > >
> > >
> > > > Preemption does rapidly run into diminishing return as load climbs for
> > > > a lot of loads, but as you know, it's a rather sticky wicket because
> > > > even when over-committed, preventing light control threads from slicing
> > > > through (what can be a load's own work crew of) hogs can seriously
> > > > injure performance.
> > > >
> > >
> > > Turning this into a classic Rob Peter To Pay Paul problem. We don't know
> > > if there is a light control thread that needs to run or not that affects
> > > overall performance. It all depends on whether that control thread needs
> > > to make progress for the overall workload or whether there are a mix of
> > > workloads resulting in overloading.
> >
> > WRT overload, and our good buddies Peter and Paul :) I added...
> >         if (gran >= sysctl_sched_latency >> 1)
> >                 trace_printk("runnable:%d preempt disabled\n",cfs_rq->nr_running);
> > ...to watch, and met the below when I.. logged in.
> >
> > homer:..debug/tracing # tail -20 trace
> >                X-2229    [002] d..5.    60.690322: wakeup_gran: runnable:9 preempt disabled
> >                X-2229    [002] d..5.    60.690325: wakeup_gran: runnable:10 preempt disabled
> >                X-2229    [002] d..5.    60.690330: wakeup_gran: runnable:11 preempt disabled
> >                X-2229    [002] d..5.    60.690363: wakeup_gran: runnable:13 preempt disabled
> >                X-2229    [002] d..5.    60.690377: wakeup_gran: runnable:14 preempt disabled
> >                X-2229    [002] d..5.    60.690390: wakeup_gran: runnable:15 preempt disabled
> >                X-2229    [002] d..5.    60.690404: wakeup_gran: runnable:16 preempt disabled
> >                X-2229    [002] d..5.    60.690425: wakeup_gran: runnable:9 preempt disabled
> >        ksmserver-2694    [003] d..3.    60.690432: wakeup_gran: runnable:6 preempt disabled
> >        ksmserver-2694    [003] d..3.    60.690436: wakeup_gran: runnable:7 preempt disabled
> >                X-2229    [002] d..5.    60.690451: wakeup_gran: runnable:6 preempt disabled
> >                X-2229    [002] d..5.    60.690465: wakeup_gran: runnable:7 preempt disabled
> >             kmix-2736    [000] d..3.    60.690491: wakeup_gran: runnable:6 preempt disabled
> >                X-2229    [004] d..5.    92.889635: wakeup_gran: runnable:6 preempt disabled
> >                X-2229    [004] d..5.    92.889675: wakeup_gran: runnable:6 preempt disabled
> >                X-2229    [004] d..5.    92.889863: wakeup_gran: runnable:6 preempt disabled
> >                X-2229    [004] d..5.    92.889944: wakeup_gran: runnable:6 preempt disabled
> >                X-2229    [004] d..5.    92.889957: wakeup_gran: runnable:7 preempt disabled
> >                X-2229    [004] d..5.    92.889968: wakeup_gran: runnable:8 preempt disabled
> >   QXcbEventQueue-2740    [000] d..4.    92.890025: wakeup_gran: runnable:6 preempt disabled
> > homer:..debug/tracing
> >
> > Watching 'while sleep 1; do clear;tail trace; done' with nothing but a
> > kbuild running is like watching top.  There's enough stacking during
> > routine use of my desktop box that it runs into the tick granularity
> > wall pretty much continuously, so 'overload' may want redefining.
>
> I looked into that crazy stacking depth...
>
> static int
> wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
>                    int this_cpu, int prev_cpu, int sync)
> {
>         s64 this_eff_load, prev_eff_load;
>         unsigned long task_load;
>
>         this_eff_load = cpu_load(cpu_rq(this_cpu));
>                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^ the butler didit!
>
> That's pretty darn busted as it sits.  Between load updates, X, or any
> other waker of many, can stack wakees to a ludicrous depth.  Tracing
> kbuild vs firefox playing a youtube clip, I watched X stack 20 of the
> zillion firefox minions while their previous CPUs all had 1 lousy task
> running but a cpu_load() higher than the cpu_load() of X's CPU.  Most
> of those prev_cpus were where X had left them when it migrated. Each
> and every crazy depth migration was wake_affine_weight() deciding we
> should pull based on crappy data.  As instantaneous load on the waker
> CPU blew through the roof in my trace snapshot, its cpu_load() did
> finally budge.. a tiny bit.. downward.  No idea where the stack would
> have topped out, my tracing_off() limit was 20.

Mike, not quite sure I caught your point. It seems you mean x wakes up
many firefoxes within a short period, so it pulls them to the CPU where x
is running. Technically those pulling should increase cpu_load of x' CPU.
But due to some reason, the cpu_load is not increased in time on x' CPU,
So this makes a lot of firefoxes piled on x' CPU, but at that time,  the load
of the cpu which firefox was running on is still larger than x' cpu with a lot
of firefoxes?

I am wondering if this should be the responsibility of wake_wide()?

>
> Hohum, my box grew a WA_INST companion to SIS_MIN_LAT cache cold task
> distribulator feature ;-)  Not particularly lovely, but it knocks over
> the leaning tower of minions.
>
>         -Mike

Thanks
barry

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-10-03  7:34           ` Barry Song
@ 2021-10-03 14:52             ` Mike Galbraith
  2021-10-03 21:06               ` Barry Song
  2021-10-04  4:34             ` Mike Galbraith
  1 sibling, 1 reply; 59+ messages in thread
From: Mike Galbraith @ 2021-10-03 14:52 UTC (permalink / raw)
  To: Barry Song
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju,
	LKML

On Sun, 2021-10-03 at 20:34 +1300, Barry Song wrote:
> >
> > I looked into that crazy stacking depth...
> >
> > static int
> > wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
> >                    int this_cpu, int prev_cpu, int sync)
> > {
> >         s64 this_eff_load, prev_eff_load;
> >         unsigned long task_load;
> >
> >         this_eff_load = cpu_load(cpu_rq(this_cpu));
> >                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^ the butler didit!
> >
> > That's pretty darn busted as it sits.  Between load updates, X, or any
> > other waker of many, can stack wakees to a ludicrous depth.  Tracing
> > kbuild vs firefox playing a youtube clip, I watched X stack 20 of the
> > zillion firefox minions while their previous CPUs all had 1 lousy task
> > running but a cpu_load() higher than the cpu_load() of X's CPU.  Most
> > of those prev_cpus were where X had left them when it migrated. Each
> > and every crazy depth migration was wake_affine_weight() deciding we
> > should pull based on crappy data.  As instantaneous load on the waker
> > CPU blew through the roof in my trace snapshot, its cpu_load() did
> > finally budge.. a tiny bit.. downward.  No idea where the stack would
> > have topped out, my tracing_off() limit was 20.
>
> Mike, not quite sure I caught your point. It seems you mean x wakes up
> many firefoxes within a short period, so it pulls them to the CPU where x
> is running. Technically those pulling should increase cpu_load of x' CPU.
> But due to some reason, the cpu_load is not increased in time on x' CPU,
> So this makes a lot of firefoxes piled on x' CPU, but at that time,  the load
> of the cpu which firefox was running on is still larger than x' cpu with a lot
> of firefoxes?

It looked like this.

X-2211    [007] d...211  2327.810997: select_task_rq_fair: this_run/load:4:373 prev_run/load:4:373 waking firefox:4971 CPU7 ==> CPU7
X-2211    [007] d...211  2327.811004: select_task_rq_fair: this_run/load:5:373 prev_run/load:1:1029 waking QXcbEventQueue:4952 CPU0 ==> CPU7
X-2211    [007] d...211  2327.811010: select_task_rq_fair: this_run/load:6:373 prev_run/load:1:1528 waking QXcbEventQueue:3969 CPU5 ==> CPU7
X-2211    [007] d...211  2327.811015: select_task_rq_fair: this_run/load:7:373 prev_run/load:1:1029 waking evolution-alarm:3833 CPU0 ==> CPU7
X-2211    [007] d...211  2327.811021: select_task_rq_fair: this_run/load:8:373 prev_run/load:1:1528 waking QXcbEventQueue:3860 CPU5 ==> CPU7
X-2211    [007] d...211  2327.811026: select_task_rq_fair: this_run/load:8:373 prev_run/load:1:1528 waking QXcbEventQueue:3800 CPU5 ==> CPU7
X-2211    [007] d...211  2327.811032: select_task_rq_fair: this_run/load:9:373 prev_run/load:1:1528 waking xdg-desktop-por:3341 CPU5 ==> CPU7
X-2211    [007] d...211  2327.811037: select_task_rq_fair: this_run/load:10:373 prev_run/load:1:289 waking at-spi2-registr:3165 CPU4 ==> CPU7
X-2211    [007] d...211  2327.811042: select_task_rq_fair: this_run/load:11:373 prev_run/load:1:1029 waking ibus-ui-gtk3:2865 CPU0 ==> CPU0
X-2211    [007] d...211  2327.811049: select_task_rq_fair: this_run/load:11:373 prev_run/load:1:226 waking ibus-x11:2868 CPU2 ==> CPU2
X-2211    [007] d...211  2327.811054: select_task_rq_fair: this_run/load:11:373 prev_run/load:11:373 waking ibus-extension-:2866 CPU7 ==> CPU7
X-2211    [007] d...211  2327.811059: select_task_rq_fair: this_run/load:12:373 prev_run/load:1:289 waking QXcbEventQueue:2804 CPU4 ==> CPU7
X-2211    [007] d...211  2327.811063: select_task_rq_fair: this_run/load:13:373 prev_run/load:1:935 waking QXcbEventQueue:2756 CPU1 ==> CPU7
X-2211    [007] d...211  2327.811068: select_task_rq_fair: this_run/load:14:373 prev_run/load:1:1528 waking QXcbEventQueue:2753 CPU5 ==> CPU7
X-2211    [007] d...211  2327.811074: select_task_rq_fair: this_run/load:15:373 prev_run/load:1:1528 waking QXcbEventQueue:2741 CPU5 ==> CPU7
X-2211    [007] d...211  2327.811079: select_task_rq_fair: this_run/load:16:373 prev_run/load:1:1528 waking QXcbEventQueue:2730 CPU5 ==> CPU7
X-2211    [007] d...211  2327.811085: select_task_rq_fair: this_run/load:17:373 prev_run/load:1:5 waking QXcbEventQueue:2724 CPU0 ==> CPU0
X-2211    [007] d...211  2327.811090: select_task_rq_fair: this_run/load:17:373 prev_run/load:1:1010 waking QXcbEventQueue:2721 CPU6 ==> CPU7
X-2211    [007] d...211  2327.811096: select_task_rq_fair: this_run/load:18:373 prev_run/load:1:1528 waking QXcbEventQueue:2720 CPU5 ==> CPU7
X-2211    [007] d...211  2327.811101: select_task_rq_fair: this_run/load:19:373 prev_run/load:1:1528 waking QXcbEventQueue:2704 CPU5 ==> CPU7
X-2211    [007] d...211  2327.811105: select_task_rq_fair: this_run/load:20:373 prev_run/load:0:226 waking QXcbEventQueue:2705 CPU2 ==> CPU2
X-2211    [007] d...211  2327.811110: select_task_rq_fair: this_run/load:19:342 prev_run/load:1:1528 waking QXcbEventQueue:2695 CPU5 ==> CPU7
X-2211    [007] d...211  2327.811115: select_task_rq_fair: this_run/load:20:342 prev_run/load:1:1528 waking QXcbEventQueue:2694 CPU5 ==> CPU7
X-2211    [007] d...211  2327.811120: select_task_rq_fair: this_run/load:21:342 prev_run/load:1:1528 waking QXcbEventQueue:2679 CPU5 ==> CPU7

Legend: foo_run/load:foo->nr_running:cpu_load(foo)

Every migration to CPU7 in the above was due to wake_affine_weight()
seeing more or less static effective load numbers (the trace was wider,
showing which path was taken).

> I am wondering if this should be the responsibility of wake_wide()?

That's a good point.  I'm not so sure that would absolve use of what
appears to be stagnant state though.  If we hadn't gotten there, this
stack obviously wouldn't have happened.. but we did get there, and
state that was used did not reflect reality.  wake_wide() deflecting
this particular gaggle wouldn't improved state accuracy one whit for a
subsequent wakeup, or?

	-Mike

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-10-03 14:52             ` Mike Galbraith
@ 2021-10-03 21:06               ` Barry Song
  2021-10-04  1:49                 ` Mike Galbraith
  0 siblings, 1 reply; 59+ messages in thread
From: Barry Song @ 2021-10-03 21:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju,
	LKML

On Mon, Oct 4, 2021 at 3:52 AM Mike Galbraith <efault@gmx.de> wrote:
>
> On Sun, 2021-10-03 at 20:34 +1300, Barry Song wrote:
> > >
> > > I looked into that crazy stacking depth...
> > >
> > > static int
> > > wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
> > >                    int this_cpu, int prev_cpu, int sync)
> > > {
> > >         s64 this_eff_load, prev_eff_load;
> > >         unsigned long task_load;
> > >
> > >         this_eff_load = cpu_load(cpu_rq(this_cpu));
> > >                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^ the butler didit!
> > >
> > > That's pretty darn busted as it sits.  Between load updates, X, or any
> > > other waker of many, can stack wakees to a ludicrous depth.  Tracing
> > > kbuild vs firefox playing a youtube clip, I watched X stack 20 of the
> > > zillion firefox minions while their previous CPUs all had 1 lousy task
> > > running but a cpu_load() higher than the cpu_load() of X's CPU.  Most
> > > of those prev_cpus were where X had left them when it migrated. Each
> > > and every crazy depth migration was wake_affine_weight() deciding we
> > > should pull based on crappy data.  As instantaneous load on the waker
> > > CPU blew through the roof in my trace snapshot, its cpu_load() did
> > > finally budge.. a tiny bit.. downward.  No idea where the stack would
> > > have topped out, my tracing_off() limit was 20.
> >
> > Mike, not quite sure I caught your point. It seems you mean x wakes up
> > many firefoxes within a short period, so it pulls them to the CPU where x
> > is running. Technically those pulling should increase cpu_load of x' CPU.
> > But due to some reason, the cpu_load is not increased in time on x' CPU,
> > So this makes a lot of firefoxes piled on x' CPU, but at that time,  the load
> > of the cpu which firefox was running on is still larger than x' cpu with a lot
> > of firefoxes?
>
> It looked like this.
>
> X-2211    [007] d...211  2327.810997: select_task_rq_fair: this_run/load:4:373 prev_run/load:4:373 waking firefox:4971 CPU7 ==> CPU7
> X-2211    [007] d...211  2327.811004: select_task_rq_fair: this_run/load:5:373 prev_run/load:1:1029 waking QXcbEventQueue:4952 CPU0 ==> CPU7
> X-2211    [007] d...211  2327.811010: select_task_rq_fair: this_run/load:6:373 prev_run/load:1:1528 waking QXcbEventQueue:3969 CPU5 ==> CPU7
> X-2211    [007] d...211  2327.811015: select_task_rq_fair: this_run/load:7:373 prev_run/load:1:1029 waking evolution-alarm:3833 CPU0 ==> CPU7
> X-2211    [007] d...211  2327.811021: select_task_rq_fair: this_run/load:8:373 prev_run/load:1:1528 waking QXcbEventQueue:3860 CPU5 ==> CPU7
> X-2211    [007] d...211  2327.811026: select_task_rq_fair: this_run/load:8:373 prev_run/load:1:1528 waking QXcbEventQueue:3800 CPU5 ==> CPU7
> X-2211    [007] d...211  2327.811032: select_task_rq_fair: this_run/load:9:373 prev_run/load:1:1528 waking xdg-desktop-por:3341 CPU5 ==> CPU7
> X-2211    [007] d...211  2327.811037: select_task_rq_fair: this_run/load:10:373 prev_run/load:1:289 waking at-spi2-registr:3165 CPU4 ==> CPU7
> X-2211    [007] d...211  2327.811042: select_task_rq_fair: this_run/load:11:373 prev_run/load:1:1029 waking ibus-ui-gtk3:2865 CPU0 ==> CPU0
> X-2211    [007] d...211  2327.811049: select_task_rq_fair: this_run/load:11:373 prev_run/load:1:226 waking ibus-x11:2868 CPU2 ==> CPU2
> X-2211    [007] d...211  2327.811054: select_task_rq_fair: this_run/load:11:373 prev_run/load:11:373 waking ibus-extension-:2866 CPU7 ==> CPU7
> X-2211    [007] d...211  2327.811059: select_task_rq_fair: this_run/load:12:373 prev_run/load:1:289 waking QXcbEventQueue:2804 CPU4 ==> CPU7
> X-2211    [007] d...211  2327.811063: select_task_rq_fair: this_run/load:13:373 prev_run/load:1:935 waking QXcbEventQueue:2756 CPU1 ==> CPU7
> X-2211    [007] d...211  2327.811068: select_task_rq_fair: this_run/load:14:373 prev_run/load:1:1528 waking QXcbEventQueue:2753 CPU5 ==> CPU7
> X-2211    [007] d...211  2327.811074: select_task_rq_fair: this_run/load:15:373 prev_run/load:1:1528 waking QXcbEventQueue:2741 CPU5 ==> CPU7
> X-2211    [007] d...211  2327.811079: select_task_rq_fair: this_run/load:16:373 prev_run/load:1:1528 waking QXcbEventQueue:2730 CPU5 ==> CPU7
> X-2211    [007] d...211  2327.811085: select_task_rq_fair: this_run/load:17:373 prev_run/load:1:5 waking QXcbEventQueue:2724 CPU0 ==> CPU0
> X-2211    [007] d...211  2327.811090: select_task_rq_fair: this_run/load:17:373 prev_run/load:1:1010 waking QXcbEventQueue:2721 CPU6 ==> CPU7
> X-2211    [007] d...211  2327.811096: select_task_rq_fair: this_run/load:18:373 prev_run/load:1:1528 waking QXcbEventQueue:2720 CPU5 ==> CPU7
> X-2211    [007] d...211  2327.811101: select_task_rq_fair: this_run/load:19:373 prev_run/load:1:1528 waking QXcbEventQueue:2704 CPU5 ==> CPU7
> X-2211    [007] d...211  2327.811105: select_task_rq_fair: this_run/load:20:373 prev_run/load:0:226 waking QXcbEventQueue:2705 CPU2 ==> CPU2
> X-2211    [007] d...211  2327.811110: select_task_rq_fair: this_run/load:19:342 prev_run/load:1:1528 waking QXcbEventQueue:2695 CPU5 ==> CPU7
> X-2211    [007] d...211  2327.811115: select_task_rq_fair: this_run/load:20:342 prev_run/load:1:1528 waking QXcbEventQueue:2694 CPU5 ==> CPU7
> X-2211    [007] d...211  2327.811120: select_task_rq_fair: this_run/load:21:342 prev_run/load:1:1528 waking QXcbEventQueue:2679 CPU5 ==> CPU7

What is the topology of your hardware?  shouldn't select_idle_sibling
find some other idle CPUs in CPU7's LLC domain?
Why are you always getting CPU7?

one thing bothering me is that we are using the load of a single CPU
in wake_affine_weight(), but we are actually scanning
the whole LLC afterwards.

>
> Legend: foo_run/load:foo->nr_running:cpu_load(foo)
>
> Every migration to CPU7 in the above was due to wake_affine_weight()
> seeing more or less static effective load numbers (the trace was wider,
> showing which path was taken).
>
> > I am wondering if this should be the responsibility of wake_wide()?
>
> That's a good point.  I'm not so sure that would absolve use of what
> appears to be stagnant state though.  If we hadn't gotten there, this
> stack obviously wouldn't have happened.. but we did get there, and
> state that was used did not reflect reality.  wake_wide() deflecting
> this particular gaggle wouldn't improved state accuracy one whit for a
> subsequent wakeup, or?
>
>         -Mike

Thanks
barry

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-10-03 21:06               ` Barry Song
@ 2021-10-04  1:49                 ` Mike Galbraith
  0 siblings, 0 replies; 59+ messages in thread
From: Mike Galbraith @ 2021-10-04  1:49 UTC (permalink / raw)
  To: Barry Song
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju,
	LKML

On Mon, 2021-10-04 at 10:06 +1300, Barry Song wrote:
>
> What is the topology of your hardware?

It's an i4790 quad+smt.

> shouldn't select_idle_sibling find some other idle CPUs in CPU7's LLC
> domain? Why are you always getting CPU7?

The box is busy.  It's running a parallel kbuild as well as the
desktop, firefox and their various minions (et al.. kthreads).

> one thing bothering me is that we are using the load of a single CPU
> in wake_affine_weight(), but we are actually scanning the whole LLC
> afterwards.

The question wake_affine() is answering is one from the bad old days of
SMP, of pull the task to toasty data, or make it drag that data across
horrible hardware. LLC came into the equation much later.  With it, and
big boxen with several of them, the wake affine question became mostly
about which LLC, but not solely.  There's still L2 to consider, and
there's even a valid time to stack two communicating tasks like in the
bad old SMP days.. say when there would be a man-in-the-middle of the
conversation if wakee were to be left on its previous CPU.

	-Mike

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-10-03  7:34           ` Barry Song
  2021-10-03 14:52             ` Mike Galbraith
@ 2021-10-04  4:34             ` Mike Galbraith
  2021-10-04  9:06               ` Mike Galbraith
  1 sibling, 1 reply; 59+ messages in thread
From: Mike Galbraith @ 2021-10-04  4:34 UTC (permalink / raw)
  To: Barry Song
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju,
	LKML

On Sun, 2021-10-03 at 20:34 +1300, Barry Song wrote:
>
> I am wondering if this should be the responsibility of wake_wide()?

Those event threads we stacked so high (which are kde minions btw),
don't generally accrue _any_ wakee_flips, so when X wakes a slew of the
things, wake_wide()'s heuristic rejects the lot.

So yeah, the blame game for this issue is a target rich environment.
Shoot either of 'em (or both), and you'll hit the bad guy.

	-Mike

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-27 14:17                               ` Mike Galbraith
@ 2021-10-04  8:05                                 ` Mel Gorman
  2021-10-04 16:37                                   ` Vincent Guittot
  0 siblings, 1 reply; 59+ messages in thread
From: Mel Gorman @ 2021-10-04  8:05 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Vincent Guittot, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Mon, Sep 27, 2021 at 04:17:25PM +0200, Mike Galbraith wrote:
> On Mon, 2021-09-27 at 12:17 +0100, Mel Gorman wrote:
> > On Thu, Sep 23, 2021 at 02:41:06PM +0200, Vincent Guittot wrote:
> > > On Thu, 23 Sept 2021 at 11:22, Mike Galbraith <efault@gmx.de> wrote:
> > > >
> > > > On Thu, 2021-09-23 at 10:40 +0200, Vincent Guittot wrote:
> > > > >
> > > > > a 100us value should even be enough to fix Mel's problem without
> > > > > impacting common wakeup preemption cases.
> > > >
> > > > It'd be nice if it turn out to be something that simple, but color me
> > > > skeptical.  I've tried various preemption throttling schemes, and while
> > >
> > > Let's see what the results will show. I tend to agree that this will
> > > not be enough to cover all use cases and I don't see any other way to
> > > cover all cases than getting some inputs from the threads about their
> > > latency fairness which bring us back to some kind of latency niceness
> > > value
> > >
> >
> > Unfortunately, I didn't get a complete set of results but enough to work
> > with. The missing tests have been requeued. The figures below are based
> > on a single-socket Skylake machine with 8 CPUs as it had the most set of
> > results and is the basic case.
> 
> There's something missing, namely how does whatever load you measure
> perform when facing dissimilar competition. Instead of only scaling
> loads running solo from underutilized to heavily over-committed, give
> them competition. eg something switch heavy, say tbench, TCP_RR et al
> (latency bound load) pairs=CPUS vs something hefty like make -j CPUS or
> such.
> 

Ok, that's an interesting test. I've been out intermittently and will be
for the next few weeks but I managed to automate something that can test
this. The test runs a kernel compile with -jNR_CPUS and TCP_RR running
NR_CPUS pairs of clients/servers in the background with the default
openSUSE Leap kernel config (CONFIG_PREEMPT_NONE) with the two patches
and no tricks done with task priorities.  5 kernel compilations are run
and TCP_RR is shutdown when the compilation finishes.

This can be reproduced with the mmtests config
config-multi-kernbench__netperf-tcp-rr-multipair using xfs as the
filesystem for the kernel compilation.

sched-scalewakegran-v2r5: my patch
sched-moveforward-v1r1: Vincent's patch


multi subtest kernbench
                              5.15.0-rc1             5.15.0-rc1             5.15.0-rc1
                                 vanillasched-scalewakegran-v2r5 sched-moveforward-v1r1
Amean     user-80     1518.87 (   0.00%)     1520.34 (  -0.10%)     1518.93 (  -0.00%)
Amean     syst-80      248.57 (   0.00%)      247.74 (   0.33%)      232.93 *   6.29%*
Amean     elsp-80       48.76 (   0.00%)       48.51 (   0.52%)       48.70 (   0.14%)
Stddev    user-80       10.15 (   0.00%)        9.17 (   9.70%)       10.25 (  -0.93%)
Stddev    syst-80        2.83 (   0.00%)        3.02 (  -6.65%)        3.65 ( -28.83%)
Stddev    elsp-80        3.54 (   0.00%)        3.28 (   7.28%)        2.40 (  32.13%)
CoeffVar  user-80        0.67 (   0.00%)        0.60 (   9.79%)        0.67 (  -0.93%)
CoeffVar  syst-80        1.14 (   0.00%)        1.22 (  -7.01%)        1.57 ( -37.48%)
CoeffVar  elsp-80        7.26 (   0.00%)        6.76 (   6.79%)        4.93 (  32.04%)

With either patch,  time to finish compilations is not affected with
differences in elapsed time being well within the noise

Meanwhile, netperf tcp-rr running with NR_CPUS pairs showed the
following

multi subtest netperf-tcp-rr
                        5.15.0-rc1             5.15.0-rc1             5.15.0-rc1
                           vanilla sched-scalewakegran-v2r5 sched-moveforward-v1r1
Min       1    32388.28 (   0.00%)    32208.66 (  -0.55%)    31824.54 (  -1.74%)
Hmean     1    39112.22 (   0.00%)    39364.10 (   0.64%)    39552.30 *   1.13%*
Stddev    1     3471.61 (   0.00%)     3357.28 (   3.29%)     3713.97 (  -6.98%)
CoeffVar  1        8.81 (   0.00%)        8.47 (   3.87%)        9.31 (  -5.67%)
Max       1    53019.93 (   0.00%)    51263.38 (  -3.31%)    51263.04 (  -3.31%)

This shows a slightly different picture with Vincent's patch having a small
impact on netperf tcp-rr. It's noisy and may be subject to test-to-test
variances but it's a mild concern. A greater concern is that across
all machines, dbench was heavily affected by Vincent's patch even for
relatively low thread counts which is surprising.

For the same Cascadelake machine both resulst are from, dbench reports

                          5.15.0-rc1             5.15.0-rc1             5.15.0-rc1
                             vanillasched-scalewakegran-v2r5 sched-moveforward-v1r1
Amean     1         15.99 (   0.00%)       16.20 *  -1.27%*       16.18 *  -1.16%*
Amean     2         18.43 (   0.00%)       18.34 *   0.50%*       22.72 * -23.28%*
Amean     4         22.32 (   0.00%)       22.06 *   1.14%*       45.86 *-105.52%*
Amean     8         30.58 (   0.00%)       30.22 *   1.18%*       99.04 *-223.88%*
Amean     16        41.79 (   0.00%)       41.68 *   0.25%*      161.09 *-285.52%*
Amean     32        63.45 (   0.00%)       63.16 *   0.45%*      248.13 *-291.09%*
Amean     64       127.81 (   0.00%)      128.50 *  -0.54%*      402.93 *-215.25%*
Amean     128      330.42 (   0.00%)      336.06 *  -1.71%*      531.35 * -60.81%*

That is an excessive impairment. While it varied across machines, there
was some impact on all of them. For a 1-socket skylake machine to rule
out NUMA artifacts, I get

dbench4 Loadfile Execution Time
                         5.15.0-rc1             5.15.0-rc1             5.15.0-rc1
                            vanillasched-scalewakegran-v2r5 sched-moveforward-v1r1
Amean     1        29.51 (   0.00%)       29.45 *   0.21%*       29.58 *  -0.22%*
Amean     2        37.46 (   0.00%)       37.16 *   0.82%*       39.81 *  -6.26%*
Amean     4        51.31 (   0.00%)       51.34 (  -0.04%)       57.14 * -11.35%*
Amean     8        81.77 (   0.00%)       81.65 (   0.15%)       88.68 *  -8.44%*
Amean     64      406.94 (   0.00%)      408.08 *  -0.28%*      433.64 *  -6.56%*
Stddev    1         1.43 (   0.00%)        1.44 (  -0.79%)        1.54 (  -7.45%)

Not as dramatic but indicates that we likely do not want to cut off
wakeup_preempt too early a problem.

The test was not profiling times to switch tasks as the overhead
distorts resules.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-10-04  4:34             ` Mike Galbraith
@ 2021-10-04  9:06               ` Mike Galbraith
  2021-10-05  7:47                 ` Mel Gorman
  0 siblings, 1 reply; 59+ messages in thread
From: Mike Galbraith @ 2021-10-04  9:06 UTC (permalink / raw)
  To: Barry Song
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju,
	LKML

On Mon, 2021-10-04 at 06:34 +0200, Mike Galbraith wrote:
> On Sun, 2021-10-03 at 20:34 +1300, Barry Song wrote:
> >
> > I am wondering if this should be the responsibility of wake_wide()?
>
> Those event threads we stacked so high (which are kde minions btw),
> don't generally accrue _any_ wakee_flips, so when X wakes a slew of the
> things, wake_wide()'s heuristic rejects the lot.
>
> So yeah, the blame game for this issue is a target rich environment.
> Shoot either of 'em (or both), and you'll hit the bad guy.

The mallet below convinced wake_wide() that X waking event threads is
something it most definitely should care about.  It's not perfect, can
get caught with its pants down, because behavior changes a LOT, but I
at least have to work at it a bit to stack tasks to the ceiling.

With make -j8 running along with firefox with two tabs, one containing
youtube's suggestions of stuff you probably don't want to watch, the
other a running clip, if I have the idle tab in focus, and don't drive
mouse around, flips decay enough for wake_wide() to lose interest, but
just wiggle the mouse, and it starts waking wide. Focus on the running
clip, and it continuously wakes wide.  

Hacky, but way better behavior.. at this particular testcase.. in this
particular box.. at least once :)

---
 kernel/sched/fair.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5865,6 +5865,14 @@ static void record_wakee(struct task_str
 	}

 	if (current->last_wakee != p) {
+		int min = __this_cpu_read(sd_llc_size) << 1;
+		/*
+		 * Couple the wakee flips to the waker for the case where it
+		 * doesn't accrue flips, taking care to not push the wakee
+		 * high enough that the wake_wide() heuristic fails.
+		 */
+		if (current->wakee_flips > p->wakee_flips * min)
+			p->wakee_flips++;
 		current->last_wakee = p;
 		current->wakee_flips++;
 	}
@@ -5895,7 +5903,7 @@ static int wake_wide(struct task_struct

 	if (master < slave)
 		swap(master, slave);
-	if (slave < factor || master < slave * factor)
+	if ((slave < factor && master < (factor>>1)*factor) || master < slave * factor)
 		return 0;
 	return 1;
 }


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-10-04  8:05                                 ` Mel Gorman
@ 2021-10-04 16:37                                   ` Vincent Guittot
  2021-10-05  7:41                                     ` Mel Gorman
  0 siblings, 1 reply; 59+ messages in thread
From: Vincent Guittot @ 2021-10-04 16:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Mon, 4 Oct 2021 at 10:05, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Mon, Sep 27, 2021 at 04:17:25PM +0200, Mike Galbraith wrote:
> > On Mon, 2021-09-27 at 12:17 +0100, Mel Gorman wrote:
> > > On Thu, Sep 23, 2021 at 02:41:06PM +0200, Vincent Guittot wrote:
> > > > On Thu, 23 Sept 2021 at 11:22, Mike Galbraith <efault@gmx.de> wrote:
> > > > >
> > > > > On Thu, 2021-09-23 at 10:40 +0200, Vincent Guittot wrote:
> > > > > >
> > > > > > a 100us value should even be enough to fix Mel's problem without
> > > > > > impacting common wakeup preemption cases.
> > > > >
> > > > > It'd be nice if it turn out to be something that simple, but color me
> > > > > skeptical.  I've tried various preemption throttling schemes, and while
> > > >
> > > > Let's see what the results will show. I tend to agree that this will
> > > > not be enough to cover all use cases and I don't see any other way to
> > > > cover all cases than getting some inputs from the threads about their
> > > > latency fairness which bring us back to some kind of latency niceness
> > > > value
> > > >
> > >
> > > Unfortunately, I didn't get a complete set of results but enough to work
> > > with. The missing tests have been requeued. The figures below are based
> > > on a single-socket Skylake machine with 8 CPUs as it had the most set of
> > > results and is the basic case.
> >
> > There's something missing, namely how does whatever load you measure
> > perform when facing dissimilar competition. Instead of only scaling
> > loads running solo from underutilized to heavily over-committed, give
> > them competition. eg something switch heavy, say tbench, TCP_RR et al
> > (latency bound load) pairs=CPUS vs something hefty like make -j CPUS or
> > such.
> >
>
> Ok, that's an interesting test. I've been out intermittently and will be
> for the next few weeks but I managed to automate something that can test
> this. The test runs a kernel compile with -jNR_CPUS and TCP_RR running
> NR_CPUS pairs of clients/servers in the background with the default
> openSUSE Leap kernel config (CONFIG_PREEMPT_NONE) with the two patches
> and no tricks done with task priorities.  5 kernel compilations are run
> and TCP_RR is shutdown when the compilation finishes.
>
> This can be reproduced with the mmtests config
> config-multi-kernbench__netperf-tcp-rr-multipair using xfs as the
> filesystem for the kernel compilation.
>
> sched-scalewakegran-v2r5: my patch
> sched-moveforward-v1r1: Vincent's patch

If I'm not wrong, you refer to the 1st version which scales with the
number of cpu by sched-moveforward-v1r1. We don't want to scale with
the number of cpu because this can create some quite large non
preemptable duration. We want to ensure a fix small runtime like the
last version with 100us

>
>
> multi subtest kernbench
>                               5.15.0-rc1             5.15.0-rc1             5.15.0-rc1
>                                  vanillasched-scalewakegran-v2r5 sched-moveforward-v1r1
> Amean     user-80     1518.87 (   0.00%)     1520.34 (  -0.10%)     1518.93 (  -0.00%)
> Amean     syst-80      248.57 (   0.00%)      247.74 (   0.33%)      232.93 *   6.29%*
> Amean     elsp-80       48.76 (   0.00%)       48.51 (   0.52%)       48.70 (   0.14%)
> Stddev    user-80       10.15 (   0.00%)        9.17 (   9.70%)       10.25 (  -0.93%)
> Stddev    syst-80        2.83 (   0.00%)        3.02 (  -6.65%)        3.65 ( -28.83%)
> Stddev    elsp-80        3.54 (   0.00%)        3.28 (   7.28%)        2.40 (  32.13%)
> CoeffVar  user-80        0.67 (   0.00%)        0.60 (   9.79%)        0.67 (  -0.93%)
> CoeffVar  syst-80        1.14 (   0.00%)        1.22 (  -7.01%)        1.57 ( -37.48%)
> CoeffVar  elsp-80        7.26 (   0.00%)        6.76 (   6.79%)        4.93 (  32.04%)
>
> With either patch,  time to finish compilations is not affected with
> differences in elapsed time being well within the noise
>
> Meanwhile, netperf tcp-rr running with NR_CPUS pairs showed the
> following
>
> multi subtest netperf-tcp-rr
>                         5.15.0-rc1             5.15.0-rc1             5.15.0-rc1
>                            vanilla sched-scalewakegran-v2r5 sched-moveforward-v1r1
> Min       1    32388.28 (   0.00%)    32208.66 (  -0.55%)    31824.54 (  -1.74%)
> Hmean     1    39112.22 (   0.00%)    39364.10 (   0.64%)    39552.30 *   1.13%*
> Stddev    1     3471.61 (   0.00%)     3357.28 (   3.29%)     3713.97 (  -6.98%)
> CoeffVar  1        8.81 (   0.00%)        8.47 (   3.87%)        9.31 (  -5.67%)
> Max       1    53019.93 (   0.00%)    51263.38 (  -3.31%)    51263.04 (  -3.31%)
>
> This shows a slightly different picture with Vincent's patch having a small
> impact on netperf tcp-rr. It's noisy and may be subject to test-to-test
> variances but it's a mild concern. A greater concern is that across
> all machines, dbench was heavily affected by Vincent's patch even for
> relatively low thread counts which is surprising.
>
> For the same Cascadelake machine both resulst are from, dbench reports
>
>                           5.15.0-rc1             5.15.0-rc1             5.15.0-rc1
>                              vanillasched-scalewakegran-v2r5 sched-moveforward-v1r1
> Amean     1         15.99 (   0.00%)       16.20 *  -1.27%*       16.18 *  -1.16%*
> Amean     2         18.43 (   0.00%)       18.34 *   0.50%*       22.72 * -23.28%*
> Amean     4         22.32 (   0.00%)       22.06 *   1.14%*       45.86 *-105.52%*
> Amean     8         30.58 (   0.00%)       30.22 *   1.18%*       99.04 *-223.88%*
> Amean     16        41.79 (   0.00%)       41.68 *   0.25%*      161.09 *-285.52%*
> Amean     32        63.45 (   0.00%)       63.16 *   0.45%*      248.13 *-291.09%*
> Amean     64       127.81 (   0.00%)      128.50 *  -0.54%*      402.93 *-215.25%*
> Amean     128      330.42 (   0.00%)      336.06 *  -1.71%*      531.35 * -60.81%*
>
> That is an excessive impairment. While it varied across machines, there
> was some impact on all of them. For a 1-socket skylake machine to rule
> out NUMA artifacts, I get

As mentioned above, the 1st version which uses
sysctl_sched_min_granularity is not the best solution because it
scales the duration with the number of cpus on the system. The move
forward duration should be fixed and small like the last proposal with
100us. Would be intetesting to see results for this last version
instead

>
> dbench4 Loadfile Execution Time
>                          5.15.0-rc1             5.15.0-rc1             5.15.0-rc1
>                             vanillasched-scalewakegran-v2r5 sched-moveforward-v1r1
> Amean     1        29.51 (   0.00%)       29.45 *   0.21%*       29.58 *  -0.22%*
> Amean     2        37.46 (   0.00%)       37.16 *   0.82%*       39.81 *  -6.26%*
> Amean     4        51.31 (   0.00%)       51.34 (  -0.04%)       57.14 * -11.35%*
> Amean     8        81.77 (   0.00%)       81.65 (   0.15%)       88.68 *  -8.44%*
> Amean     64      406.94 (   0.00%)      408.08 *  -0.28%*      433.64 *  -6.56%*
> Stddev    1         1.43 (   0.00%)        1.44 (  -0.79%)        1.54 (  -7.45%)
>
> Not as dramatic but indicates that we likely do not want to cut off
> wakeup_preempt too early a problem.
>
> The test was not profiling times to switch tasks as the overhead
> distorts resules.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-10-04 16:37                                   ` Vincent Guittot
@ 2021-10-05  7:41                                     ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2021-10-05  7:41 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Mon, Oct 04, 2021 at 06:37:02PM +0200, Vincent Guittot wrote:
> On Mon, 4 Oct 2021 at 10:05, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Mon, Sep 27, 2021 at 04:17:25PM +0200, Mike Galbraith wrote:
> > > On Mon, 2021-09-27 at 12:17 +0100, Mel Gorman wrote:
> > > > On Thu, Sep 23, 2021 at 02:41:06PM +0200, Vincent Guittot wrote:
> > > > > On Thu, 23 Sept 2021 at 11:22, Mike Galbraith <efault@gmx.de> wrote:
> > > > > >
> > > > > > On Thu, 2021-09-23 at 10:40 +0200, Vincent Guittot wrote:
> > > > > > >
> > > > > > > a 100us value should even be enough to fix Mel's problem without
> > > > > > > impacting common wakeup preemption cases.
> > > > > >
> > > > > > It'd be nice if it turn out to be something that simple, but color me
> > > > > > skeptical.  I've tried various preemption throttling schemes, and while
> > > > >
> > > > > Let's see what the results will show. I tend to agree that this will
> > > > > not be enough to cover all use cases and I don't see any other way to
> > > > > cover all cases than getting some inputs from the threads about their
> > > > > latency fairness which bring us back to some kind of latency niceness
> > > > > value
> > > > >
> > > >
> > > > Unfortunately, I didn't get a complete set of results but enough to work
> > > > with. The missing tests have been requeued. The figures below are based
> > > > on a single-socket Skylake machine with 8 CPUs as it had the most set of
> > > > results and is the basic case.
> > >
> > > There's something missing, namely how does whatever load you measure
> > > perform when facing dissimilar competition. Instead of only scaling
> > > loads running solo from underutilized to heavily over-committed, give
> > > them competition. eg something switch heavy, say tbench, TCP_RR et al
> > > (latency bound load) pairs=CPUS vs something hefty like make -j CPUS or
> > > such.
> > >
> >
> > Ok, that's an interesting test. I've been out intermittently and will be
> > for the next few weeks but I managed to automate something that can test
> > this. The test runs a kernel compile with -jNR_CPUS and TCP_RR running
> > NR_CPUS pairs of clients/servers in the background with the default
> > openSUSE Leap kernel config (CONFIG_PREEMPT_NONE) with the two patches
> > and no tricks done with task priorities.  5 kernel compilations are run
> > and TCP_RR is shutdown when the compilation finishes.
> >
> > This can be reproduced with the mmtests config
> > config-multi-kernbench__netperf-tcp-rr-multipair using xfs as the
> > filesystem for the kernel compilation.
> >
> > sched-scalewakegran-v2r5: my patch
> > sched-moveforward-v1r1: Vincent's patch
> 
> If I'm not wrong, you refer to the 1st version which scales with the
> number of cpu by sched-moveforward-v1r1. We don't want to scale with
> the number of cpu because this can create some quite large non
> preemptable duration. We want to ensure a fix small runtime like the
> last version with 100us
> 

It was a modified version based on feedback that limited the scale that
preemption would be disabled. It was still based on h_nr_running as a
basis for comparison

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff69f245b939..964f76a95c04 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -84,6 +84,14 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity	= 1000000UL;
 
 const_debug unsigned int sysctl_sched_migration_cost	= 500000UL;
 
+/*
+ * This value is kept at sysctl_sched_latency / sysctl_sched_wakeup_granularity
+ *
+ * This influences the decision on whether a waking task can preempt a running
+ * task.
+ */
+static unsigned int sched_nr_disable_gran = 6;
+
 int sched_thermal_decay_shift;
 static int __init setup_sched_thermal_decay_shift(char *str)
 {
@@ -627,6 +635,9 @@ int sched_update_scaling(void)
 	sched_nr_latency = DIV_ROUND_UP(sysctl_sched_latency,
 					sysctl_sched_min_granularity);
 
+	sched_nr_disable_gran = DIV_ROUND_UP(sysctl_sched_latency,
+					sysctl_sched_wakeup_granularity);
+
 #define WRT_SYSCTL(name) \
 	(normalized_sysctl_##name = sysctl_##name / (factor))
 	WRT_SYSCTL(sched_min_granularity);
@@ -4511,7 +4522,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 
 static int
-wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
+wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr,
+						struct sched_entity *se);
 
 /*
  * Pick the next process, keeping these things in mind, in this order:
@@ -4550,16 +4562,16 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 				second = curr;
 		}
 
-		if (second && wakeup_preempt_entity(second, left) < 1)
+		if (second && wakeup_preempt_entity(NULL, second, left) < 1)
 			se = second;
 	}
 
-	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
+	if (cfs_rq->next && wakeup_preempt_entity(NULL, cfs_rq->next, left) < 1) {
 		/*
 		 * Someone really wants this to run. If it's not unfair, run it.
 		 */
 		se = cfs_rq->next;
-	} else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
+	} else if (cfs_rq->last && wakeup_preempt_entity(NULL, cfs_rq->last, left) < 1) {
 		/*
 		 * Prefer last buddy, try to return the CPU to a preempted task.
 		 */
@@ -7044,9 +7056,42 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 }
 #endif /* CONFIG_SMP */
 
-static unsigned long wakeup_gran(struct sched_entity *se)
+static unsigned long
+select_wakeup_gran(struct cfs_rq *cfs_rq)
+{
+	unsigned int nr_running, threshold;
+
+	if (!cfs_rq || !sched_feat(SCALE_WAKEUP_GRAN))
+		return sysctl_sched_wakeup_granularity;
+
+	/* !GENTLE_FAIR_SLEEPERS has one overload threshold. */
+	if (!sched_feat(GENTLE_FAIR_SLEEPERS)) {
+		if (cfs_rq->h_nr_running <= sched_nr_disable_gran)
+			return sysctl_sched_wakeup_granularity;
+
+		return sysctl_sched_latency;
+	}
+
+	/* GENTLE_FAIR_SLEEPER has two overloaded thresholds. */
+	nr_running = cfs_rq->h_nr_running;
+	threshold = sched_nr_disable_gran >> 1;
+
+	/* No overload. */
+	if (nr_running <= threshold)
+		return sysctl_sched_wakeup_granularity;
+
+	/* Light overload. */
+	if (nr_running <= sched_nr_disable_gran)
+		return sysctl_sched_latency >> 1;
+
+	/* Heavy overload. */
+	return sysctl_sched_latency;
+}
+
+static unsigned long
+wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	unsigned long gran = sysctl_sched_wakeup_granularity;
+	unsigned long gran = select_wakeup_gran(cfs_rq);
 
 	/*
 	 * Since its curr running now, convert the gran from real-time
@@ -7079,14 +7124,15 @@ static unsigned long wakeup_gran(struct sched_entity *se)
  *
  */
 static int
-wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
+wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr,
+						struct sched_entity *se)
 {
 	s64 gran, vdiff = curr->vruntime - se->vruntime;
 
 	if (vdiff <= 0)
 		return -1;
 
-	gran = wakeup_gran(se);
+	gran = wakeup_gran(cfs_rq, se);
 	if (vdiff > gran)
 		return 1;
 
@@ -7190,8 +7236,9 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	if (cse_is_idle != pse_is_idle)
 		return;
 
-	update_curr(cfs_rq_of(se));
-	if (wakeup_preempt_entity(se, pse) == 1) {
+	cfs_rq = cfs_rq_of(se);
+	update_curr(cfs_rq);
+	if (wakeup_preempt_entity(cfs_rq, se, pse) == 1) {
 		/*
 		 * Bias pick_next to pick the sched entity that is
 		 * triggering this preemption.
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 7f8dace0964c..d041d7023029 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -95,3 +95,9 @@ SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(ALT_PERIOD, true)
 SCHED_FEAT(BASE_SLICE, true)
+
+/*
+ * Scale sched_wakeup_granularity dynamically based on the number of running
+ * tasks up to a cap of sysctl_sched_latency.
+ */
+SCHED_FEAT(SCALE_WAKEUP_GRAN, true)

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-10-04  9:06               ` Mike Galbraith
@ 2021-10-05  7:47                 ` Mel Gorman
  2021-10-05  8:42                   ` Mike Galbraith
  0 siblings, 1 reply; 59+ messages in thread
From: Mel Gorman @ 2021-10-05  7:47 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Barry Song, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju,
	LKML

On Mon, Oct 04, 2021 at 11:06:30AM +0200, Mike Galbraith wrote:
> On Mon, 2021-10-04 at 06:34 +0200, Mike Galbraith wrote:
> > On Sun, 2021-10-03 at 20:34 +1300, Barry Song wrote:
> > >
> > > I am wondering if this should be the responsibility of wake_wide()?
> >
> > Those event threads we stacked so high (which are kde minions btw),
> > don't generally accrue _any_ wakee_flips, so when X wakes a slew of the
> > things, wake_wide()'s heuristic rejects the lot.
> >
> > So yeah, the blame game for this issue is a target rich environment.
> > Shoot either of 'em (or both), and you'll hit the bad guy.
> 
> The mallet below convinced wake_wide() that X waking event threads is
> something it most definitely should care about.  It's not perfect, can
> get caught with its pants down, because behavior changes a LOT, but I
> at least have to work at it a bit to stack tasks to the ceiling.
> 
> With make -j8 running along with firefox with two tabs, one containing
> youtube's suggestions of stuff you probably don't want to watch, the
> other a running clip, if I have the idle tab in focus, and don't drive
> mouse around, flips decay enough for wake_wide() to lose interest, but
> just wiggle the mouse, and it starts waking wide. Focus on the running
> clip, and it continuously wakes wide.  
> 
> Hacky, but way better behavior.. at this particular testcase.. in this
> particular box.. at least once :)
> 

Only three machines managed to complete tests overnight. For most
workloads test, it was neutral or slight improvements. For
multi-kernbench__netperf-tcp-rr-multipair (kernel compile +
netperf-tcp-rr combined), there was little to no change.

For the heavy overloaded cases (hackbench again), it was variable. Worst
improvement was a gain of 1-3%. Best improvement (single socket skylake
with 8 logical CPUs SMT enabled) was 1%-18% depending on the group
count.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-10-05  7:47                 ` Mel Gorman
@ 2021-10-05  8:42                   ` Mike Galbraith
  2021-10-05  9:31                     ` Mel Gorman
  0 siblings, 1 reply; 59+ messages in thread
From: Mike Galbraith @ 2021-10-05  8:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Barry Song, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju,
	LKML

On Tue, 2021-10-05 at 08:47 +0100, Mel Gorman wrote:
> On Mon, Oct 04, 2021 at 11:06:30AM +0200, Mike Galbraith wrote:
> >
> > The mallet below convinced wake_wide() that X waking event threads is
> > something it most definitely should care about.  It's not perfect, can
> > get caught with its pants down, because behavior changes a LOT, but I
> > at least have to work at it a bit to stack tasks to the ceiling.
> >
> > With make -j8 running along with firefox with two tabs, one containing
> > youtube's suggestions of stuff you probably don't want to watch, the
> > other a running clip, if I have the idle tab in focus, and don't drive
> > mouse around, flips decay enough for wake_wide() to lose interest, but
> > just wiggle the mouse, and it starts waking wide. Focus on the running
> > clip, and it continuously wakes wide.  
> >
> > Hacky, but way better behavior.. at this particular testcase.. in this
> > particular box.. at least once :)
> >
>
> Only three machines managed to complete tests overnight. For most
> workloads test, it was neutral or slight improvements. For
> multi-kernbench__netperf-tcp-rr-multipair (kernel compile +
> netperf-tcp-rr combined), there was little to no change.
>
> For the heavy overloaded cases (hackbench again), it was variable. Worst
> improvement was a gain of 1-3%. Best improvement (single socket skylake
> with 8 logical CPUs SMT enabled) was 1%-18% depending on the group
> count.

I wrote up a changelog to remind future me why I bent it up, but I'm
not going to submit it. I'll leave the twiddling to folks who can be
more responsive to possible (spelled probable;) regression reports than
I can be.

	-Mike


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-21 12:32       ` Mike Galbraith
  2021-09-21 14:03         ` Mel Gorman
@ 2021-10-05  9:24         ` Peter Zijlstra
  1 sibling, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2021-10-05  9:24 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Mel Gorman, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Tue, Sep 21, 2021 at 02:32:54PM +0200, Mike Galbraith wrote:
> On Tue, 2021-09-21 at 11:36 +0100, Mel Gorman wrote:

> > FAIR_SLEEPERS primarily affects tasks that just became runnable and the
> > new task is trying to fit in without causing too much disruption based
> > on sysctl_sched_latency.
> 
> No, fair sleepers is all about sleeper wakeup preemption, I think
> you're thinking of fork initial placement.

Butting in in the middle of the thread (and I know there's still lots to
read)...

So the FAIR_SLEEPERS thing is about giving tasks that have slept a while
some extra credit to run sooner.

The classical example has always been a task that run 50% combined with
a task that runs 100%, what's fair? a 1:2 or 1:1 ratio? Strict fair
(runnable) scheduling will get you the 1:2, while intuitively having two
tasks with 100% combined CPU utilization 1:1 would be 'fair'.

FAIR_SLEEPERS gets you towards that 1:1, *provided* the period of that
50% is near sched_latency/2.

Another important factor for wakeup preemption has always been desktop
usage; can you still get responsive terminals while building a kernel,
how does firefox scroll during a kernel build etc..

(fwiw, firefox should start scrolling responsively and then bog down if
you keep on scrolling because it becomes a hog and has exhausted the
inital boost)

Also, I think the ChromeOS people have interactivity measures these
days.

All our traditinoal benchmarks miss out here; they're mostly throughput
oriented, and it is really easy to totally wreck interactivity while
getting great througput :/

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-10-05  8:42                   ` Mike Galbraith
@ 2021-10-05  9:31                     ` Mel Gorman
  2021-10-06  6:46                       ` Mike Galbraith
  2021-10-08  5:06                       ` Mike Galbraith
  0 siblings, 2 replies; 59+ messages in thread
From: Mel Gorman @ 2021-10-05  9:31 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Barry Song, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju,
	LKML

On Tue, Oct 05, 2021 at 10:42:07AM +0200, Mike Galbraith wrote:
> On Tue, 2021-10-05 at 08:47 +0100, Mel Gorman wrote:
> > On Mon, Oct 04, 2021 at 11:06:30AM +0200, Mike Galbraith wrote:
> > >
> > > The mallet below convinced wake_wide() that X waking event threads is
> > > something it most definitely should care about.  It's not perfect, can
> > > get caught with its pants down, because behavior changes a LOT, but I
> > > at least have to work at it a bit to stack tasks to the ceiling.
> > >
> > > With make -j8 running along with firefox with two tabs, one containing
> > > youtube's suggestions of stuff you probably don't want to watch, the
> > > other a running clip, if I have the idle tab in focus, and don't drive
> > > mouse around, flips decay enough for wake_wide() to lose interest, but
> > > just wiggle the mouse, and it starts waking wide. Focus on the running
> > > clip, and it continuously wakes wide.  
> > >
> > > Hacky, but way better behavior.. at this particular testcase.. in this
> > > particular box.. at least once :)
> > >
> >
> > Only three machines managed to complete tests overnight. For most
> > workloads test, it was neutral or slight improvements. For
> > multi-kernbench__netperf-tcp-rr-multipair (kernel compile +
> > netperf-tcp-rr combined), there was little to no change.
> >
> > For the heavy overloaded cases (hackbench again), it was variable. Worst
> > improvement was a gain of 1-3%. Best improvement (single socket skylake
> > with 8 logical CPUs SMT enabled) was 1%-18% depending on the group
> > count.
> 
> I wrote up a changelog to remind future me why I bent it up, but I'm
> not going to submit it. I'll leave the twiddling to folks who can be
> more responsive to possible (spelled probable;) regression reports than
> I can be.
> 

Thanks Mike.

If you write a formal patch with a Signed-off-by, I'll use it as a baseline
for the wakegran work or submit it directly. That way, I'll get any LKP
reports, user regression reports on lkml, any regression reports via
openSUSE etc and deal with them.

Based on the results I have so far, I would not be twiddling it further
but that might change when a full range of machines have completed all
of their tests. Ideally, I would do some tracing to confirm that maximum
runqueue depth is really reduced by the path.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-22 13:20         ` Mel Gorman
  2021-09-22 14:04           ` Mike Galbraith
  2021-09-22 14:15           ` Vincent Guittot
@ 2021-10-05  9:32           ` Peter Zijlstra
  2 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2021-10-05  9:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Mike Galbraith, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Wed, Sep 22, 2021 at 02:20:02PM +0100, Mel Gorman wrote:
> 
> Note that the TuneD throughput-performance profile allows up to 15ms
> for sysctl_sched_latency (ignoring scaling) but there is no explanation
> why such a long period was necessary or why sched_latency_ns is also
> not adjusted. The intent may have been to disable wakeup preemption
> or it might be an oversight.  An internet search for instances where
> sysctl_sched_wakeup_granularity parameter are tuned to high values offer
> either no explanation or a broken one.

FWIW, if one wants to disable wakeup preemption, we've got SCHED_BATCH
for that.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-22 15:04             ` Mel Gorman
  2021-09-22 16:00               ` Vincent Guittot
@ 2021-10-05  9:41               ` Peter Zijlstra
  1 sibling, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2021-10-05  9:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Mike Galbraith, Ingo Molnar, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Wed, Sep 22, 2021 at 04:04:57PM +0100, Mel Gorman wrote:
> On Wed, Sep 22, 2021 at 04:15:27PM +0200, Vincent Guittot wrote:

> The reason I used SCHED_TUNABLESCALING_NONE for the changelog is that
> the exact values depend on the number of CPUs so values are not even
> the same across the range of machines I'm using. sysctl_sched_latency,
> sysctl_sched_min_granularity sysctl_sched_wakeup_granularity are all
> scaled but the ratios remain constant.

It might make sense to reconsider the whole scaling thing FWIW. It used
to be that desktop systems had 'small' number of CPUs (<=8) and servers
had more.

But today it's not uncommon to have 32-64 CPUs in a desktop system. And
even LOG scaling gets us into stupid numbers for the latencies.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-22 17:38                 ` Mel Gorman
  2021-09-22 18:22                   ` Vincent Guittot
@ 2021-10-05 10:23                   ` Peter Zijlstra
  1 sibling, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2021-10-05 10:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Mike Galbraith, Ingo Molnar, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Wed, Sep 22, 2021 at 06:38:53PM +0100, Mel Gorman wrote:

> 
> I'm not seeing an alternative suggestion that could be turned into
> an implementation. The current value for sched_wakeup_granularity
> was set 12 years ago was exposed for tuning which is no longer

(it has always been SCHED_DEBUG)

> the case. The intent was to allow some dynamic adjustment between
> sysctl_sched_wakeup_granularity and sysctl_sched_latency to reduce
> over-scheduling in the worst case without disabling preemption entirely
> (which the first version did).
> 
> Should we just ignore this problem and hope it goes away or just let
> people keep poking silly values into debugfs via tuned?

People are going to do stupid no matter what (and tuned is very much
included there -- poking random numbers in debug interfaces without
doing analysis on what the workload does and needs it just wasting
everybodies time. RHT really should know better than that).

I'm also thinking that adding more heuristics isn't going to improve the
situation lots.

For the past # of years people have been talking about extendng the task
model for SCHED_NORMAL, latency_nice was one such proposal.

If we really want to fix this proper, and not make a bigger mess of
things, we should look at all these various workloads and identify
*what* specifically they want and *why*.

Once we have this enumerated, we can look at what exactly we can provide
and how to structure the interface.

The extention must be hint only, we should be free to completely ignore
it.

The things I can think of off the top of my head are:

- tail latency; prepared to waste time to increase the odds of running
  sooner. Possible effect: have this task always do a full
  select_idle_sibling() scan.

  (there's also the anti case, which I'm not sure how to enumerate,
  basically they don't want select_idle_sibling(), just place the task
  wherever)

- non-interactive; doesn't much care about wakeup latency; can suffer
  packing?

- background; (implies non-interactive?) doesn't much care about
  completion time either, just cares about efficiency

- interactive; cares much about wakeup-latency; cares less about
  throughput.

- (energy) efficient; cares more about energy usage than performance


Now, we already have SCHED_BATCH that completely kills wakeup preemption
and SCHED_IDLE lets everything preempt. I know SCHED_IDLE is in active
use (because we see patches for that), I don't think people use
SCHED_BATCH much, should they?


Now, I think a number of contraints are going to be mutually exclusive,
and if we get such tasks combined there's just nothing much we can do
about it (and since it's all hints, we good).

We should also not make the load-balance condition too complicated, we
don't want it to try and (badly) solve a large set of constraints.
Again, if possible respect the hints, otherwise tough luck.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-22 18:22                   ` Vincent Guittot
  2021-09-22 18:57                     ` Mel Gorman
  2021-09-23  1:47                     ` Mike Galbraith
@ 2021-10-05 10:28                     ` Peter Zijlstra
  2 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2021-10-05 10:28 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mel Gorman, Mike Galbraith, Ingo Molnar, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Wed, Sep 22, 2021 at 08:22:43PM +0200, Vincent Guittot wrote:

> @@ -7161,6 +7162,13 @@ static void check_preempt_wakeup(struct rq *rq,
> struct task_struct *p, int wake_
>                 return;
> 
>         update_curr(cfs_rq_of(se));
> +       delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> +       /*
> +        * Ensure that current got a chance to move forward
> +        */
> +       if (delta_exec < sysctl_sched_min_granularity)
> +               return;
> +

I think we tried that at some point; IIRC the problem with this is that
if the interactive task fails to preempt, that preemption is lost. IOW
interactivity suffers.

Basically if you don't want wake-up preemptions, use SCHED_BATCH, then
those tasks will not preempt one another, but the SCHED_NORMAL tasks
will preempt them.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-09-23 12:24                         ` Phil Auld
@ 2021-10-05 10:36                           ` Peter Zijlstra
  2021-10-05 14:12                             ` Phil Auld
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2021-10-05 10:36 UTC (permalink / raw)
  To: Phil Auld
  Cc: Vincent Guittot, Mike Galbraith, Mel Gorman, Ingo Molnar,
	Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju,
	LKML

On Thu, Sep 23, 2021 at 08:24:03AM -0400, Phil Auld wrote:

> It's capped at 8 cpus, which is pretty easy to reach these days, so the
> values don't get too large.  That scaling is almost a no-op these days.

  https://lkml.kernel.org/r/YVwdrh5pg0zSv2/b@hirez.programming.kicks-ass.net

Ooh, hey, we already fixed that :-)

So the reasoning there is that if the values get too big, interactiviy
get *really* bad, but if you go from say 1 to 4 CPUs, interactivity can
improve due to being able to run on other CPUs.

At 8 CPUs we end up at 6ms*4=24ms, which is already pretty terrible.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-10-05 10:36                           ` Peter Zijlstra
@ 2021-10-05 14:12                             ` Phil Auld
  2021-10-05 14:32                               ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Phil Auld @ 2021-10-05 14:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Mike Galbraith, Mel Gorman, Ingo Molnar,
	Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju,
	LKML

On Tue, Oct 05, 2021 at 12:36:22PM +0200 Peter Zijlstra wrote:
> On Thu, Sep 23, 2021 at 08:24:03AM -0400, Phil Auld wrote:
> 
> > It's capped at 8 cpus, which is pretty easy to reach these days, so the
> > values don't get too large.  That scaling is almost a no-op these days.
> 
>   https://lkml.kernel.org/r/YVwdrh5pg0zSv2/b@hirez.programming.kicks-ass.net
> 
> Ooh, hey, we already fixed that :-)
>

Thanks Peter.

I'm always a little behind upstream (nature of the job :)

That link leads to a message Id not found. But from what I can see the
code that takes the min of online cpus and 8 is still present. 


> So the reasoning there is that if the values get too big, interactiviy
> get *really* bad, but if you go from say 1 to 4 CPUs, interactivity can
> improve due to being able to run on other CPUs.
> 
> At 8 CPUs we end up at 6ms*4=24ms, which is already pretty terrible.
> 

And actually you mention the same thing later on.  Most systems, even
desktops, have 8+ cpus these days so the scaling is mostly not doing
anything except multiplying by 4, right? So no-op was not the right
way to describe it maybe. But it's not getting bigger with larger
numbers of cpus beyond a pretty commonly reached limit.


Cheers,
Phil

-- 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-10-05 14:12                             ` Phil Auld
@ 2021-10-05 14:32                               ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2021-10-05 14:32 UTC (permalink / raw)
  To: Phil Auld
  Cc: Vincent Guittot, Mike Galbraith, Mel Gorman, Ingo Molnar,
	Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju,
	LKML

On Tue, Oct 05, 2021 at 10:12:12AM -0400, Phil Auld wrote:
> On Tue, Oct 05, 2021 at 12:36:22PM +0200 Peter Zijlstra wrote:
> > On Thu, Sep 23, 2021 at 08:24:03AM -0400, Phil Auld wrote:
> > 
> > > It's capped at 8 cpus, which is pretty easy to reach these days, so the
> > > values don't get too large.  That scaling is almost a no-op these days.
> > 
> >   https://lkml.kernel.org/r/YVwdrh5pg0zSv2/b@hirez.programming.kicks-ass.net
> > 
> > Ooh, hey, we already fixed that :-)
> >
> 
> Thanks Peter.
> 
> I'm always a little behind upstream (nature of the job :)
> 
> That link leads to a message Id not found.

https://lore.kernel.org/all/YVwblBZ9JBn9vvVr@hirez.programming.kicks-ass.net/T/#u

Seems to work, I must've messed up the copy/paste or something.

> But from what I can see the code that takes the min of online cpus and
> 8 is still present. 

Yes, and it should be. I was the confused one. I forgot we added it and
suggested we should add it again :-)

> > So the reasoning there is that if the values get too big, interactiviy
> > get *really* bad, but if you go from say 1 to 4 CPUs, interactivity can
> > improve due to being able to run on other CPUs.
> > 
> > At 8 CPUs we end up at 6ms*4=24ms, which is already pretty terrible.
> > 
> 
> And actually you mention the same thing later on.  Most systems, even
> desktops, have 8+ cpus these days so the scaling is mostly not doing
> anything except multiplying by 4, right? So no-op was not the right
> way to describe it maybe. But it's not getting bigger with larger
> numbers of cpus beyond a pretty commonly reached limit.

Yeah, the whole scaling thing is of dubious value these days, the whole
1-8 range is for embedded stuff these days, I mean, only low-end phones
are maybe even still in that range -- oh and my laptop.. :/



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-10-05  9:31                     ` Mel Gorman
@ 2021-10-06  6:46                       ` Mike Galbraith
  2021-10-08  5:06                       ` Mike Galbraith
  1 sibling, 0 replies; 59+ messages in thread
From: Mike Galbraith @ 2021-10-06  6:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Barry Song, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju,
	LKML

On Tue, 2021-10-05 at 10:31 +0100, Mel Gorman wrote:
>
> If you write a formal patch with a Signed-off-by, I'll use it as a baseline
> for the wakegran work or submit it directly. That way, I'll get any LKP
> reports, user regression reports on lkml, any regression reports via
> openSUSE etc and deal with them.
>
> Based on the results I have so far, I would not be twiddling it further
> but that might change when a full range of machines have completed all
> of their tests. Ideally, I would do some tracing to confirm that maximum
> runqueue depth is really reduced by the path.

Ok, I amended it, adding probably way too many words for a dinky bend-
adjust.  Feel free to do whatever you like with every bit below.

sched: Make wake_wide() handle wakees with no wakee_flips

While looking into a wakeup time task stacking problem, noticed
that wake_wide() wasn't recognizing X as a waker-of-many despite
it regularly burst waking 24 QXcbEventQueue threads.  The reason
for this lies in the heuristic requiring both the multi-waker and
its minions to be earning wakee_flips, ie both wake more than one
task. X earns plenty, but the event threads rarely earn any,
allowing the lot to meet wake_affine_weight(), where its use of
slow to update load averages MAY direct the entire burst toward
the waker's CPU, where they WILL stack if SIS can't deflect them.

To combat this, have the multi-waker (X in this case) trickle
enough flips to its wakees to keep those who don't earn flips
barely eligible.  To reduce aging taking too many of the thread
pool below the limit due to periods of inactivity, continue to
wake wide IFF the waker has a markedly elevated flip frequency.

This improved things for the X+QXcbEventQueue burst, but is a
bit hacky. We need a better M:N load activity differentiator.

Note: given wake_wide()'s mission is to recognize when thread
pool activity exceeds sd_llc_size, it logically SHOULD play no
role whatsoever in boxen with a single LLC, there being no other
LLCs to expand/shrink load distribution to/from.  While this
patchlet hopefully improves M:N detection a bit in general, it
fixes nothing.

Signed-off-by: Mike Galbraith <efault@gmx.de>
---
 kernel/sched/fair.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5869,6 +5869,15 @@ static void record_wakee(struct task_str
 	}

 	if (current->last_wakee != p) {
+		int min = __this_cpu_read(sd_llc_size) << 1;
+		/*
+		 * Couple waker flips to the wakee for the case where it
+		 * doesn't accrue any of its own, taking care to not push
+		 * it high enough to break the wake_wide() waker:wakees
+		 * heuristic for those that do accrue their own flips.
+		 */
+		if (current->wakee_flips > p->wakee_flips * min)
+			p->wakee_flips++;
 		current->last_wakee = p;
 		current->wakee_flips++;
 	}
@@ -5899,7 +5908,7 @@ static int wake_wide(struct task_struct

 	if (master < slave)
 		swap(master, slave);
-	if (slave < factor || master < slave * factor)
+	if ((slave < factor && master < (factor>>1)*factor) || master < slave * factor)
 		return 0;
 	return 1;
 }


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: wakeup_affine_weight() is b0rked - was Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
  2021-10-05  9:31                     ` Mel Gorman
  2021-10-06  6:46                       ` Mike Galbraith
@ 2021-10-08  5:06                       ` Mike Galbraith
  1 sibling, 0 replies; 59+ messages in thread
From: Mike Galbraith @ 2021-10-08  5:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Barry Song, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Aubrey Li, Barry Song, Srikar Dronamraju,
	LKML

On Tue, 2021-10-05 at 10:31 +0100, Mel Gorman wrote:
> Ideally, I would do some tracing to confirm that maximum runqueue depth
> is really reduced by the path.

I would expect your worst case to remain unchanged, mine does.  The
patch mitigates, it does not eradicate.

I dug up a late 2016 mitigation patch, wedged it into 2021 and added a
BFH that does eradicate my stacking depth woes.  I'll probably keep it,
at least for a while. Not because I feel anything in my desktop, rather
because meeting this again (and it being deeper than I recall) reminded
me of measuring impact on NFS etc, making it a tad difficult to ignore.
Oh well, I'll forget about it eventually.. BTDT.

(standard beloved Granny disclaimer)

sched: Add SIS stacking mitigation feature

Select the least loaded LLC CPU for cache cold tasks and kthreads.

Addendum: renamed feature, and give it a big brother.

Not-Signed-off-by: Mike Galbraith <efault@gmx.de>
---
 kernel/sched/fair.c     |   54 ++++++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/features.h |    5 ++++
 2 files changed, 55 insertions(+), 4 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6261,6 +6261,26 @@ static inline int select_idle_smt(struct

 #endif /* CONFIG_SCHED_SMT */

+static bool task_is_kthread_or_cold(struct task_struct *p)
+{
+	s64 cold = sysctl_sched_migration_cost;
+
+	if (p->flags & PF_KTHREAD)
+		return true;
+	if (cold <= 0)
+		return false;
+	return task_rq(p)->clock_task - p->se.exec_start > cold;
+}
+
+static bool cpu_load_inconsistent(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+
+	if (rq->cfs.h_nr_running < 4)
+		return false;
+	return cpu_load(rq) << 2 < scale_load_down(rq->cfs.load.weight);
+}
+
 /*
  * Scan the LLC domain for idle CPUs; this is dynamically regulated by
  * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -6269,7 +6289,7 @@ static inline int select_idle_smt(struct
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
-	int i, cpu, idle_cpu = -1, nr = INT_MAX;
+	int i, cpu, idle_cpu = -1, nr = INT_MAX, ld = -1;
 	struct rq *this_rq = this_rq();
 	int this = smp_processor_id();
 	struct sched_domain *this_sd;
@@ -6309,6 +6329,21 @@ static int select_idle_cpu(struct task_s
 		time = cpu_clock(this);
 	}

+	/*
+	 * Select the least loaded CPU for kthreads and cache cold tasks
+	 * if no idle CPU is found.
+	 */
+	if ((sched_feat(SIS_SPOT) && task_is_kthread_or_cold(p)) ||
+	    (sched_feat(SIS_REXY) && cpu_load_inconsistent(target))) {
+		idle_cpu = task_cpu(p);
+		if (idle_cpu != target && !cpus_share_cache(idle_cpu, target))
+			idle_cpu = target;
+		if (unlikely(!sched_cpu_cookie_match(cpu_rq(idle_cpu), p)))
+			idle_cpu = -1;
+		else
+			ld = scale_load_down(cpu_rq(idle_cpu)->cfs.load.weight);
+	}
+
 	for_each_cpu_wrap(cpu, cpus, target + 1) {
 		if (has_idle_core) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
@@ -6317,10 +6352,21 @@ static int select_idle_cpu(struct task_s

 		} else {
 			if (!--nr)
-				return -1;
-			idle_cpu = __select_idle_cpu(cpu, p);
-			if ((unsigned int)idle_cpu < nr_cpumask_bits)
+				return idle_cpu;
+			i = __select_idle_cpu(cpu, p);
+			if ((unsigned int)i < nr_cpumask_bits) {
+				idle_cpu = i;
 				break;
+			}
+		}
+		if (ld > 0 && sched_cpu_cookie_match(cpu_rq(cpu), p)) {
+			i = scale_load_down(cpu_rq(cpu)->cfs.load.weight);
+			if (i < ld) {
+				idle_cpu = cpu;
+				if (i == 0)
+					break;
+				ld = i;
+			}
 		}
 	}

--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -95,3 +95,8 @@ SCHED_FEAT(LATENCY_WARN, false)

 SCHED_FEAT(ALT_PERIOD, true)
 SCHED_FEAT(BASE_SLICE, true)
+
+/* Mitigate PELT induced stacking.  */
+SCHED_FEAT(SIS_SPOT, true)
+/* Spot's 12 ton big brother. */
+SCHED_FEAT(SIS_REXY, true)


^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2021-10-08  5:10 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-20 14:26 [PATCH 0/2] Scale wakeup granularity relative to nr_running Mel Gorman
2021-09-20 14:26 ` [PATCH 1/2] sched/fair: Remove redundant lookup of rq in check_preempt_wakeup Mel Gorman
2021-09-21  7:21   ` Vincent Guittot
2021-09-21  7:53     ` Mel Gorman
2021-09-21  8:12       ` Vincent Guittot
2021-09-21  8:21       ` Peter Zijlstra
2021-09-21 10:03         ` Mel Gorman
2021-09-20 14:26 ` [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running Mel Gorman
2021-09-21  3:52   ` Mike Galbraith
2021-09-21  5:50     ` Mike Galbraith
2021-09-21  7:04     ` Mike Galbraith
2021-09-21 10:36     ` Mel Gorman
2021-09-21 12:32       ` Mike Galbraith
2021-09-21 14:03         ` Mel Gorman
2021-10-05  9:24         ` Peter Zijlstra
2021-09-22  5:22       ` Mike Galbraith
2021-09-22 13:20         ` Mel Gorman
2021-09-22 14:04           ` Mike Galbraith
2021-09-22 14:15           ` Vincent Guittot
2021-09-22 15:04             ` Mel Gorman
2021-09-22 16:00               ` Vincent Guittot
2021-09-22 17:38                 ` Mel Gorman
2021-09-22 18:22                   ` Vincent Guittot
2021-09-22 18:57                     ` Mel Gorman
2021-09-23  1:47                     ` Mike Galbraith
2021-09-23  8:40                       ` Vincent Guittot
2021-09-23  9:21                         ` Mike Galbraith
2021-09-23 12:41                           ` Vincent Guittot
2021-09-23 13:14                             ` Mike Galbraith
2021-09-27 11:17                             ` Mel Gorman
2021-09-27 14:17                               ` Mike Galbraith
2021-10-04  8:05                                 ` Mel Gorman
2021-10-04 16:37                                   ` Vincent Guittot
2021-10-05  7:41                                     ` Mel Gorman
2021-09-27 14:19                               ` Vincent Guittot
2021-09-27 15:02                                 ` Mel Gorman
2021-09-23 12:24                         ` Phil Auld
2021-10-05 10:36                           ` Peter Zijlstra
2021-10-05 14:12                             ` Phil Auld
2021-10-05 14:32                               ` Peter Zijlstra
2021-10-05 10:28                     ` Peter Zijlstra
2021-10-05 10:23                   ` Peter Zijlstra
2021-10-05  9:41               ` Peter Zijlstra
2021-09-22 15:05             ` Vincent Guittot
2021-10-05  9:32           ` Peter Zijlstra
2021-10-03  3:07         ` wakeup_affine_weight() is b0rked - was " Mike Galbraith
2021-10-03  7:34           ` Barry Song
2021-10-03 14:52             ` Mike Galbraith
2021-10-03 21:06               ` Barry Song
2021-10-04  1:49                 ` Mike Galbraith
2021-10-04  4:34             ` Mike Galbraith
2021-10-04  9:06               ` Mike Galbraith
2021-10-05  7:47                 ` Mel Gorman
2021-10-05  8:42                   ` Mike Galbraith
2021-10-05  9:31                     ` Mel Gorman
2021-10-06  6:46                       ` Mike Galbraith
2021-10-08  5:06                       ` Mike Galbraith
2021-09-21  8:03   ` Vincent Guittot
2021-09-21 10:45     ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.