All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/2] Reduce stacking and overscheduling
@ 2021-10-28  9:48 Mel Gorman
  2021-10-28  9:48 ` [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers Mel Gorman
  2021-10-28  9:48 ` [PATCH 2/2] sched/fair: Increase wakeup_gran if current task has not executed the minimum granularity Mel Gorman
  0 siblings, 2 replies; 13+ messages in thread
From: Mel Gorman @ 2021-10-28  9:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman

Also available at
git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git sched-scalewakegran-v4r1

Changelog since v3
o No code changes, added some additional results to patch 1


These series tackles two problems. The first is that heavy wakers
can stack an excessive number of tasks on the same CPU. The
second is that tasks can overschedule when the task has not
reached its minimum preemption granularity.

The patches are independent but were discussed together in the thread
https://lore.kernel.org/r/20210920142614.4891-1-mgorman@techsingularity.net
so are presented together.

With both patches on a zen3 machine

hackbench-process-pipes
                          5.15.0-rc3             5.15.0-rc3
                             vanilla sched-scalewakegran-v3r2
Amean     1        0.3667 (   0.00%)      0.3823 (  -4.27%)
Amean     4        0.5343 (   0.00%)      0.4867 (   8.92%)
Amean     7        0.5300 (   0.00%)      0.5053 (   4.65%)
Amean     12       0.5737 (   0.00%)      0.5450 (   5.00%)
Amean     21       0.6727 (   0.00%)      0.6807 (  -1.19%)
Amean     30       0.8583 (   0.00%)      0.7107 *  17.20%*
Amean     48       1.3977 (   0.00%)      1.0447 *  25.26%*
Amean     79       1.9790 (   0.00%)      1.6033 *  18.98%*
Amean     110      2.8020 (   0.00%)      2.0763 *  25.90%*
Amean     141      3.6683 (   0.00%)      2.5313 *  31.00%*
Amean     172      4.6687 (   0.00%)      3.1163 *  33.25%*
Amean     203      5.2183 (   0.00%)      3.5560 *  31.86%*
Amean     234      6.1077 (   0.00%)      3.8913 *  36.29%*
Amean     265      7.1313 (   0.00%)      4.2293 *  40.69%*
Amean     296      7.7557 (   0.00%)      4.5357 *  41.52%*

                  5.15.0-rc3  5.15.0-rc3
                     vanilla sched-scalewakegran-v3r2
Duration User        2933.05     2034.17
Duration System     25652.83    17137.08
Duration Elapsed      162.50      120.25

 kernel/sched/fair.c     | 27 ++++++++++++++++++++++++---
 kernel/sched/features.h |  2 ++
 2 files changed, 26 insertions(+), 3 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers
  2021-10-28  9:48 [PATCH v4 0/2] Reduce stacking and overscheduling Mel Gorman
@ 2021-10-28  9:48 ` Mel Gorman
  2021-10-28 16:19   ` Tao Zhou
  2021-10-29 15:17   ` Vincent Guittot
  2021-10-28  9:48 ` [PATCH 2/2] sched/fair: Increase wakeup_gran if current task has not executed the minimum granularity Mel Gorman
  1 sibling, 2 replies; 13+ messages in thread
From: Mel Gorman @ 2021-10-28  9:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman

This patch mitigates a problem where wake_wide() allows a heavy waker
(e.g. X) to stack an excessive number of wakees on the same CPU. This
is due to the cpu_load check in wake_affine_weight. As noted by the
original patch author (Mike Galbraith)[1];

	Between load updates, X, or any other waker of many, can stack
	wakees to a ludicrous depth.  Tracing kbuild vs firefox playing a
	youtube clip, I watched X stack 20 of the zillion firefox minions
	while their previous CPUs all had 1 lousy task running but a
	cpu_load() higher than the cpu_load() of X's CPU.  Most of those
	prev_cpus were where X had left them when it migrated. Each and
	every crazy depth migration was wake_affine_weight() deciding we
	should pull.

Parahrasing Mike's test results from the patch.

	With make -j8 running along with firefox with two tabs, one
	containing youtube's suggestions of stuff, the other a running
	clip, if the idle tab in focus, and don't drive mouse around,
	flips decay enough for wake_wide() to lose interest, but just
	wiggle the mouse, and it starts waking wide. Focus on the running
	clip, and it continuously wakes wide.  

The end result is that heavy wakers are less likely to stack tasks and,
depending on the workload, reduce migrations.

From additional tests on various servers, the impact is machine dependant
but generally this patch improves the situation.

hackbench-process-pipes
                          5.15.0-rc3             5.15.0-rc3
                             vanilla  sched-wakeeflips-v1r1
Amean     1        0.3667 (   0.00%)      0.3890 (  -6.09%)
Amean     4        0.5343 (   0.00%)      0.5217 (   2.37%)
Amean     7        0.5300 (   0.00%)      0.5387 (  -1.64%)
Amean     12       0.5737 (   0.00%)      0.5443 (   5.11%)
Amean     21       0.6727 (   0.00%)      0.6487 (   3.57%)
Amean     30       0.8583 (   0.00%)      0.8033 (   6.41%)
Amean     48       1.3977 (   0.00%)      1.2400 *  11.28%*
Amean     79       1.9790 (   0.00%)      1.8200 *   8.03%*
Amean     110      2.8020 (   0.00%)      2.5820 *   7.85%*
Amean     141      3.6683 (   0.00%)      3.2203 *  12.21%*
Amean     172      4.6687 (   0.00%)      3.8200 *  18.18%*
Amean     203      5.2183 (   0.00%)      4.3357 *  16.91%*
Amean     234      6.1077 (   0.00%)      4.8047 *  21.33%*
Amean     265      7.1313 (   0.00%)      5.1243 *  28.14%*
Amean     296      7.7557 (   0.00%)      5.5940 *  27.87%*

While different machines showed different results, in general
there were much less CPU migrations of tasks

tbench4
                           5.15.0-rc3             5.15.0-rc3
                              vanilla  sched-wakeeflips-v1r1
Hmean     1         824.05 (   0.00%)      802.56 *  -2.61%*
Hmean     2        1578.49 (   0.00%)     1645.11 *   4.22%*
Hmean     4        2959.08 (   0.00%)     2984.75 *   0.87%*
Hmean     8        5080.09 (   0.00%)     5173.35 *   1.84%*
Hmean     16       8276.02 (   0.00%)     9327.17 *  12.70%*
Hmean     32      15501.61 (   0.00%)    15925.55 *   2.73%*
Hmean     64      27313.67 (   0.00%)    24107.81 * -11.74%*
Hmean     128     32928.19 (   0.00%)    36261.75 *  10.12%*
Hmean     256     35434.73 (   0.00%)    38670.61 *   9.13%*
Hmean     512     50098.34 (   0.00%)    53243.75 *   6.28%*
Hmean     1024    69503.69 (   0.00%)    67425.26 *  -2.99%*

Bit of a mixed bag but wins more than it loses.

A new workload was added that runs a kernel build in the background
-jNR_CPUS while NR_CPUS pairs of tasks run Netperf TCP_RR. The
intent is to see if heavy background tasks disrupt ligher tasks

multi subtest kernbench
                               5.15.0-rc3             5.15.0-rc3
                                  vanilla  sched-wakeeflips-v1r1
Min       elsp-256       20.80 (   0.00%)       14.89 (  28.41%)
Amean     elsp-256       24.08 (   0.00%)       20.94 (  13.05%)
Stddev    elsp-256        3.32 (   0.00%)        4.68 ( -41.16%)
CoeffVar  elsp-256       13.78 (   0.00%)       22.36 ( -62.33%)
Max       elsp-256       29.11 (   0.00%)       26.49 (   9.00%)

multi subtest netperf-tcp-rr
                        5.15.0-rc3             5.15.0-rc3
                           vanilla  sched-wakeeflips-v1r1
Min       1    48286.26 (   0.00%)    49101.48 (   1.69%)
Hmean     1    62894.82 (   0.00%)    68963.51 *   9.65%*
Stddev    1     7600.56 (   0.00%)     8804.82 ( -15.84%)
Max       1    78975.16 (   0.00%)    87124.67 (  10.32%)

The variability is higher as a result of the patch but both workloads
experienced improved performance.

SpecJBB 2005 is a slightly more realistic workload with multiple
communicating Java threads

specjbb
                              5.15.0-rc3             5.15.0-rc3
                                 vanilla  sched-wakeeflips-v1r1
Hmean     tput-1     50044.48 (   0.00%)    53969.00 *   7.84%*
Hmean     tput-2    106050.31 (   0.00%)   113580.78 *   7.10%*
Hmean     tput-3    156701.44 (   0.00%)   164857.00 *   5.20%*
Hmean     tput-4    196538.75 (   0.00%)   218373.42 *  11.11%*
Hmean     tput-5    247566.16 (   0.00%)   267173.09 *   7.92%*
Hmean     tput-6    284981.46 (   0.00%)   311007.14 *   9.13%*
Hmean     tput-7    328882.48 (   0.00%)   359373.89 *   9.27%*
Hmean     tput-8    366941.24 (   0.00%)   393244.37 *   7.17%*
Hmean     tput-9    402386.74 (   0.00%)   433010.43 *   7.61%*
Hmean     tput-10   437551.05 (   0.00%)   475756.08 *   8.73%*
Hmean     tput-11   481349.41 (   0.00%)   519824.54 *   7.99%*
Hmean     tput-12   533148.45 (   0.00%)   565070.21 *   5.99%*
Hmean     tput-13   570563.97 (   0.00%)   609499.06 *   6.82%*
Hmean     tput-14   601117.97 (   0.00%)   647876.05 *   7.78%*
Hmean     tput-15   639096.38 (   0.00%)   690854.46 *   8.10%*
Hmean     tput-16   682644.91 (   0.00%)   722826.06 *   5.89%*
Hmean     tput-17   732248.96 (   0.00%)   758805.17 *   3.63%*
Hmean     tput-18   762771.33 (   0.00%)   791211.66 *   3.73%*
Hmean     tput-19   780582.92 (   0.00%)   819064.19 *   4.93%*
Hmean     tput-20   812183.95 (   0.00%)   836664.87 *   3.01%*
Hmean     tput-21   821415.48 (   0.00%)   833734.23 (   1.50%)
Hmean     tput-22   815457.65 (   0.00%)   844393.98 *   3.55%*
Hmean     tput-23   819263.63 (   0.00%)   846109.07 *   3.28%*
Hmean     tput-24   817962.95 (   0.00%)   839682.92 *   2.66%*
Hmean     tput-25   807814.64 (   0.00%)   841826.52 *   4.21%*
Hmean     tput-26   811755.89 (   0.00%)   838543.08 *   3.30%*
Hmean     tput-27   799341.75 (   0.00%)   833487.26 *   4.27%*
Hmean     tput-28   803434.89 (   0.00%)   829022.50 *   3.18%*
Hmean     tput-29   803233.25 (   0.00%)   826622.37 *   2.91%*
Hmean     tput-30   800465.12 (   0.00%)   824347.42 *   2.98%*
Hmean     tput-31   791284.39 (   0.00%)   791575.67 (   0.04%)
Hmean     tput-32   781930.07 (   0.00%)   805725.80 (   3.04%)
Hmean     tput-33   785194.31 (   0.00%)   804795.44 (   2.50%)
Hmean     tput-34   781325.67 (   0.00%)   800067.53 (   2.40%)
Hmean     tput-35   777715.92 (   0.00%)   753926.32 (  -3.06%)
Hmean     tput-36   770516.85 (   0.00%)   783328.32 (   1.66%)
Hmean     tput-37   758067.26 (   0.00%)   772243.18 *   1.87%*
Hmean     tput-38   764815.45 (   0.00%)   769156.32 (   0.57%)
Hmean     tput-39   757885.41 (   0.00%)   757670.59 (  -0.03%)
Hmean     tput-40   750140.15 (   0.00%)   760739.13 (   1.41%)

[1] https://lore.kernel.org/r/02c977d239c312de5e15c77803118dcf1e11f216.camel@gmx.de

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff69f245b939..d00af3b97d8f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5865,6 +5865,14 @@ static void record_wakee(struct task_struct *p)
 	}
 
 	if (current->last_wakee != p) {
+		int min = __this_cpu_read(sd_llc_size) << 1;
+		/*
+		 * Couple the wakee flips to the waker for the case where it
+		 * doesn't accrue flips, taking care to not push the wakee
+		 * high enough that the wake_wide() heuristic fails.
+		 */
+		if (current->wakee_flips > p->wakee_flips * min)
+			p->wakee_flips++;
 		current->last_wakee = p;
 		current->wakee_flips++;
 	}
@@ -5895,7 +5903,7 @@ static int wake_wide(struct task_struct *p)
 
 	if (master < slave)
 		swap(master, slave);
-	if (slave < factor || master < slave * factor)
+	if ((slave < factor && master < (factor>>1)*factor) || master < slave * factor)
 		return 0;
 	return 1;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] sched/fair: Increase wakeup_gran if current task has not executed the minimum granularity
  2021-10-28  9:48 [PATCH v4 0/2] Reduce stacking and overscheduling Mel Gorman
  2021-10-28  9:48 ` [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers Mel Gorman
@ 2021-10-28  9:48 ` Mel Gorman
  2021-10-29 16:07   ` Vincent Guittot
  1 sibling, 1 reply; 13+ messages in thread
From: Mel Gorman @ 2021-10-28  9:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman

Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs")
moved the kernel.sched_wakeup_granularity_ns sysctl under debugfs.
One of the reasons why this sysctl may be used may be for "optimising
for throughput", particularly when overloaded. The tool TuneD sometimes
alters this for two profiles e.g. "mssql" and "throughput-performance". At
least version 2.9 does but it changed in master where it also will poke
at debugfs instead. This patch aims to reduce the motivation to tweak
sysctl_sched_wakeup_granularity by increasing sched_wakeup_granularity
if the running task runtime has not exceeded sysctl_sched_min_granularity.

During task migration or wakeup, a decision is made on whether
to preempt the current task or not. To limit over-scheduled,
sysctl_sched_wakeup_granularity delays the preemption to allow at least 1ms
of runtime before preempting. However, when a domain is heavily overloaded
(e.g. hackbench), the degree of over-scheduling is still severe. This is
problematic as time is wasted rescheduling tasks that could instead be
used by userspace tasks.

However, care must be taken. Even if a system is overloaded, there may
be high priority threads that must still be able to run. Mike Galbraith
explained the constraints as follows;

        CFS came about because the O1 scheduler was unfair to the
        point it had starvation problems. People pretty much across the
        board agreed that a fair scheduler was a much way better way
        to go, and CFS was born.  It didn't originally have the sleep
        credit business, but had to grow it to become _short term_ fair.
        Ingo cut the sleep credit in half because of overscheduling, and
        that has worked out pretty well all told.. but now you're pushing
        it more in the unfair direction, all the way to extremely unfair
        for anything and everything very light.

        Fairness isn't the holy grail mind you, and at some point, giving
        up on short term fairness certainly isn't crazy, as proven by your
        hackbench numbers and other numbers we've seen over the years,
        but taking bites out of the 'CF' in the CFS that was born to be a
        corner-case killer is.. worrisome.  The other shoe will drop.. it
        always does :)

This patch increases the wakeup granularity if the current task has not
reached its minimum preemption granularity. The current task may still
be preempted but the difference in runtime must be higher.

hackbench-process-pipes
                          5.15.0-rc3             5.15.0-rc3
               sched-wakeeflips-v1r1sched-scalewakegran-v3r2
Amean     1        0.3890 (   0.00%)      0.3823 (   1.71%)
Amean     4        0.5217 (   0.00%)      0.4867 (   6.71%)
Amean     7        0.5387 (   0.00%)      0.5053 (   6.19%)
Amean     12       0.5443 (   0.00%)      0.5450 (  -0.12%)
Amean     21       0.6487 (   0.00%)      0.6807 (  -4.93%)
Amean     30       0.8033 (   0.00%)      0.7107 *  11.54%*
Amean     48       1.2400 (   0.00%)      1.0447 *  15.75%*
Amean     79       1.8200 (   0.00%)      1.6033 *  11.90%*
Amean     110      2.5820 (   0.00%)      2.0763 *  19.58%*
Amean     141      3.2203 (   0.00%)      2.5313 *  21.40%*
Amean     172      3.8200 (   0.00%)      3.1163 *  18.42%*
Amean     203      4.3357 (   0.00%)      3.5560 *  17.98%*
Amean     234      4.8047 (   0.00%)      3.8913 *  19.01%*
Amean     265      5.1243 (   0.00%)      4.2293 *  17.47%*
Amean     296      5.5940 (   0.00%)      4.5357 *  18.92%*

                  5.15.0-rc3  5.15.0-rc3
         sched-wakeeflips-v1r1 sched-scalewakegran-v3r2
Duration User        2567.27     2034.17
Duration System     21098.79    17137.08
Duration Elapsed      136.49      120.2

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c     | 17 +++++++++++++++--
 kernel/sched/features.h |  2 ++
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d00af3b97d8f..dee108470297 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7052,10 +7052,23 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 }
 #endif /* CONFIG_SMP */
 
-static unsigned long wakeup_gran(struct sched_entity *se)
+static unsigned long
+wakeup_gran(struct sched_entity *curr, struct sched_entity *se)
 {
 	unsigned long gran = sysctl_sched_wakeup_granularity;
 
+	if (sched_feat(SCALE_WAKEUP_GRAN)) {
+		unsigned long delta_exec;
+
+		/*
+		 * Increase the wakeup granularity if curr's runtime
+		 * is less than the minimum preemption granularity.
+		 */
+		delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
+		if (delta_exec < sysctl_sched_min_granularity)
+			gran += sysctl_sched_min_granularity;
+	}
+
 	/*
 	 * Since its curr running now, convert the gran from real-time
 	 * to virtual-time in his units.
@@ -7094,7 +7107,7 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
 	if (vdiff <= 0)
 		return -1;
 
-	gran = wakeup_gran(se);
+	gran = wakeup_gran(curr, se);
 	if (vdiff > gran)
 		return 1;
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 7f8dace0964c..611591355ffd 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -95,3 +95,5 @@ SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(ALT_PERIOD, true)
 SCHED_FEAT(BASE_SLICE, true)
+
+SCHED_FEAT(SCALE_WAKEUP_GRAN, true)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers
  2021-10-28  9:48 ` [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers Mel Gorman
@ 2021-10-28 16:19   ` Tao Zhou
  2021-10-29  8:42     ` Mel Gorman
  2021-10-29 15:17   ` Vincent Guittot
  1 sibling, 1 reply; 13+ messages in thread
From: Tao Zhou @ 2021-10-28 16:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML,
	Tao Zhou

Hi Mel,

On Thu, Oct 28, 2021 at 10:48:33AM +0100, Mel Gorman wrote:

> @@ -5865,6 +5865,14 @@ static void record_wakee(struct task_struct *p)
>  	}
>  
>  	if (current->last_wakee != p) {
> +		int min = __this_cpu_read(sd_llc_size) << 1;
> +		/*
> +		 * Couple the wakee flips to the waker for the case where it
> +		 * doesn't accrue flips, taking care to not push the wakee
> +		 * high enough that the wake_wide() heuristic fails.
> +		 */
> +		if (current->wakee_flips > p->wakee_flips * min)
> +			p->wakee_flips++;
>  		current->last_wakee = p;
>  		current->wakee_flips++;
>  	}
> @@ -5895,7 +5903,7 @@ static int wake_wide(struct task_struct *p)
>  
>  	if (master < slave)
>  		swap(master, slave);
> -	if (slave < factor || master < slave * factor)
> +	if ((slave < factor && master < (factor>>1)*factor) || master < slave * factor)

So, the check like this include the above range:

  if ((slave < factor && master < slave * factor) ||
       master < slave * factor)

That "factor>>1" filter some.

If "slave < factor" is true and "master < (factor>>1)*factor" is false,
then we check "master < slave * factor".(This is one path added by the
check "&&  master < (factor>>1)*factor").
In the latter check "slave < factor" must be true, the result of this
check depend on slave in the range [factor, factor>>1] if there is possibility
that "master < slave * factor". If slave in [factor>>1, 0], the check of
"master < slave * factor" is absolutly false and this can be filtered if
we use a variable to load the result of master < (factor>>1)*factor.

My random random inputs and continue confusing to move on.



Thanks,
Tao

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers
  2021-10-28 16:19   ` Tao Zhou
@ 2021-10-29  8:42     ` Mel Gorman
  2021-11-10  9:53       ` Tao Zhou
  0 siblings, 1 reply; 13+ messages in thread
From: Mel Gorman @ 2021-10-29  8:42 UTC (permalink / raw)
  To: Tao Zhou
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Fri, Oct 29, 2021 at 12:19:48AM +0800, Tao Zhou wrote:
> Hi Mel,
> 
> On Thu, Oct 28, 2021 at 10:48:33AM +0100, Mel Gorman wrote:
> 
> > @@ -5865,6 +5865,14 @@ static void record_wakee(struct task_struct *p)
> >  	}
> >  
> >  	if (current->last_wakee != p) {
> > +		int min = __this_cpu_read(sd_llc_size) << 1;
> > +		/*
> > +		 * Couple the wakee flips to the waker for the case where it
> > +		 * doesn't accrue flips, taking care to not push the wakee
> > +		 * high enough that the wake_wide() heuristic fails.
> > +		 */
> > +		if (current->wakee_flips > p->wakee_flips * min)
> > +			p->wakee_flips++;
> >  		current->last_wakee = p;
> >  		current->wakee_flips++;
> >  	}
> > @@ -5895,7 +5903,7 @@ static int wake_wide(struct task_struct *p)
> >  
> >  	if (master < slave)
> >  		swap(master, slave);
> > -	if (slave < factor || master < slave * factor)
> > +	if ((slave < factor && master < (factor>>1)*factor) || master < slave * factor)
> 
> So, the check like this include the above range:
> 
>   if ((slave < factor && master < slave * factor) ||
>        master < slave * factor)
> 
> That "factor>>1" filter some.
> 
> If "slave < factor" is true and "master < (factor>>1)*factor" is false,
> then we check "master < slave * factor".(This is one path added by the
> check "&&  master < (factor>>1)*factor").
> In the latter check "slave < factor" must be true, the result of this
> check depend on slave in the range [factor, factor>>1] if there is possibility
> that "master < slave * factor". If slave in [factor>>1, 0], the check of
> "master < slave * factor" is absolutly false and this can be filtered if
> we use a variable to load the result of master < (factor>>1)*factor.
> 
> My random random inputs and continue confusing to move on.
> 

I'm not sure what point you're trying to make.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers
  2021-10-28  9:48 ` [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers Mel Gorman
  2021-10-28 16:19   ` Tao Zhou
@ 2021-10-29 15:17   ` Vincent Guittot
  2021-10-30  3:11     ` Mike Galbraith
  2021-11-01  8:56     ` Mel Gorman
  1 sibling, 2 replies; 13+ messages in thread
From: Vincent Guittot @ 2021-10-29 15:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Thu, 28 Oct 2021 at 11:48, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> This patch mitigates a problem where wake_wide() allows a heavy waker
> (e.g. X) to stack an excessive number of wakees on the same CPU. This
> is due to the cpu_load check in wake_affine_weight. As noted by the
> original patch author (Mike Galbraith)[1];
>
>         Between load updates, X, or any other waker of many, can stack
>         wakees to a ludicrous depth.  Tracing kbuild vs firefox playing a
>         youtube clip, I watched X stack 20 of the zillion firefox minions
>         while their previous CPUs all had 1 lousy task running but a
>         cpu_load() higher than the cpu_load() of X's CPU.  Most of those
>         prev_cpus were where X had left them when it migrated. Each and
>         every crazy depth migration was wake_affine_weight() deciding we
>         should pull.
>
> Parahrasing Mike's test results from the patch.
>
>         With make -j8 running along with firefox with two tabs, one
>         containing youtube's suggestions of stuff, the other a running
>         clip, if the idle tab in focus, and don't drive mouse around,
>         flips decay enough for wake_wide() to lose interest, but just
>         wiggle the mouse, and it starts waking wide. Focus on the running
>         clip, and it continuously wakes wide.
>
> The end result is that heavy wakers are less likely to stack tasks and,
> depending on the workload, reduce migrations.
>
> From additional tests on various servers, the impact is machine dependant
> but generally this patch improves the situation.
>
> hackbench-process-pipes
>                           5.15.0-rc3             5.15.0-rc3
>                              vanilla  sched-wakeeflips-v1r1
> Amean     1        0.3667 (   0.00%)      0.3890 (  -6.09%)
> Amean     4        0.5343 (   0.00%)      0.5217 (   2.37%)
> Amean     7        0.5300 (   0.00%)      0.5387 (  -1.64%)
> Amean     12       0.5737 (   0.00%)      0.5443 (   5.11%)
> Amean     21       0.6727 (   0.00%)      0.6487 (   3.57%)
> Amean     30       0.8583 (   0.00%)      0.8033 (   6.41%)
> Amean     48       1.3977 (   0.00%)      1.2400 *  11.28%*
> Amean     79       1.9790 (   0.00%)      1.8200 *   8.03%*
> Amean     110      2.8020 (   0.00%)      2.5820 *   7.85%*
> Amean     141      3.6683 (   0.00%)      3.2203 *  12.21%*
> Amean     172      4.6687 (   0.00%)      3.8200 *  18.18%*
> Amean     203      5.2183 (   0.00%)      4.3357 *  16.91%*
> Amean     234      6.1077 (   0.00%)      4.8047 *  21.33%*
> Amean     265      7.1313 (   0.00%)      5.1243 *  28.14%*
> Amean     296      7.7557 (   0.00%)      5.5940 *  27.87%*
>
> While different machines showed different results, in general
> there were much less CPU migrations of tasks
>
> tbench4
>                            5.15.0-rc3             5.15.0-rc3
>                               vanilla  sched-wakeeflips-v1r1
> Hmean     1         824.05 (   0.00%)      802.56 *  -2.61%*
> Hmean     2        1578.49 (   0.00%)     1645.11 *   4.22%*
> Hmean     4        2959.08 (   0.00%)     2984.75 *   0.87%*
> Hmean     8        5080.09 (   0.00%)     5173.35 *   1.84%*
> Hmean     16       8276.02 (   0.00%)     9327.17 *  12.70%*
> Hmean     32      15501.61 (   0.00%)    15925.55 *   2.73%*
> Hmean     64      27313.67 (   0.00%)    24107.81 * -11.74%*
> Hmean     128     32928.19 (   0.00%)    36261.75 *  10.12%*
> Hmean     256     35434.73 (   0.00%)    38670.61 *   9.13%*
> Hmean     512     50098.34 (   0.00%)    53243.75 *   6.28%*
> Hmean     1024    69503.69 (   0.00%)    67425.26 *  -2.99%*
>
> Bit of a mixed bag but wins more than it loses.
>
> A new workload was added that runs a kernel build in the background
> -jNR_CPUS while NR_CPUS pairs of tasks run Netperf TCP_RR. The
> intent is to see if heavy background tasks disrupt ligher tasks
>
> multi subtest kernbench
>                                5.15.0-rc3             5.15.0-rc3
>                                   vanilla  sched-wakeeflips-v1r1
> Min       elsp-256       20.80 (   0.00%)       14.89 (  28.41%)
> Amean     elsp-256       24.08 (   0.00%)       20.94 (  13.05%)
> Stddev    elsp-256        3.32 (   0.00%)        4.68 ( -41.16%)
> CoeffVar  elsp-256       13.78 (   0.00%)       22.36 ( -62.33%)
> Max       elsp-256       29.11 (   0.00%)       26.49 (   9.00%)
>
> multi subtest netperf-tcp-rr
>                         5.15.0-rc3             5.15.0-rc3
>                            vanilla  sched-wakeeflips-v1r1
> Min       1    48286.26 (   0.00%)    49101.48 (   1.69%)
> Hmean     1    62894.82 (   0.00%)    68963.51 *   9.65%*
> Stddev    1     7600.56 (   0.00%)     8804.82 ( -15.84%)
> Max       1    78975.16 (   0.00%)    87124.67 (  10.32%)
>
> The variability is higher as a result of the patch but both workloads
> experienced improved performance.
>
> SpecJBB 2005 is a slightly more realistic workload with multiple
> communicating Java threads
>
> specjbb
>                               5.15.0-rc3             5.15.0-rc3
>                                  vanilla  sched-wakeeflips-v1r1
> Hmean     tput-1     50044.48 (   0.00%)    53969.00 *   7.84%*
> Hmean     tput-2    106050.31 (   0.00%)   113580.78 *   7.10%*
> Hmean     tput-3    156701.44 (   0.00%)   164857.00 *   5.20%*
> Hmean     tput-4    196538.75 (   0.00%)   218373.42 *  11.11%*
> Hmean     tput-5    247566.16 (   0.00%)   267173.09 *   7.92%*
> Hmean     tput-6    284981.46 (   0.00%)   311007.14 *   9.13%*
> Hmean     tput-7    328882.48 (   0.00%)   359373.89 *   9.27%*
> Hmean     tput-8    366941.24 (   0.00%)   393244.37 *   7.17%*
> Hmean     tput-9    402386.74 (   0.00%)   433010.43 *   7.61%*
> Hmean     tput-10   437551.05 (   0.00%)   475756.08 *   8.73%*
> Hmean     tput-11   481349.41 (   0.00%)   519824.54 *   7.99%*
> Hmean     tput-12   533148.45 (   0.00%)   565070.21 *   5.99%*
> Hmean     tput-13   570563.97 (   0.00%)   609499.06 *   6.82%*
> Hmean     tput-14   601117.97 (   0.00%)   647876.05 *   7.78%*
> Hmean     tput-15   639096.38 (   0.00%)   690854.46 *   8.10%*
> Hmean     tput-16   682644.91 (   0.00%)   722826.06 *   5.89%*
> Hmean     tput-17   732248.96 (   0.00%)   758805.17 *   3.63%*
> Hmean     tput-18   762771.33 (   0.00%)   791211.66 *   3.73%*
> Hmean     tput-19   780582.92 (   0.00%)   819064.19 *   4.93%*
> Hmean     tput-20   812183.95 (   0.00%)   836664.87 *   3.01%*
> Hmean     tput-21   821415.48 (   0.00%)   833734.23 (   1.50%)
> Hmean     tput-22   815457.65 (   0.00%)   844393.98 *   3.55%*
> Hmean     tput-23   819263.63 (   0.00%)   846109.07 *   3.28%*
> Hmean     tput-24   817962.95 (   0.00%)   839682.92 *   2.66%*
> Hmean     tput-25   807814.64 (   0.00%)   841826.52 *   4.21%*
> Hmean     tput-26   811755.89 (   0.00%)   838543.08 *   3.30%*
> Hmean     tput-27   799341.75 (   0.00%)   833487.26 *   4.27%*
> Hmean     tput-28   803434.89 (   0.00%)   829022.50 *   3.18%*
> Hmean     tput-29   803233.25 (   0.00%)   826622.37 *   2.91%*
> Hmean     tput-30   800465.12 (   0.00%)   824347.42 *   2.98%*
> Hmean     tput-31   791284.39 (   0.00%)   791575.67 (   0.04%)
> Hmean     tput-32   781930.07 (   0.00%)   805725.80 (   3.04%)
> Hmean     tput-33   785194.31 (   0.00%)   804795.44 (   2.50%)
> Hmean     tput-34   781325.67 (   0.00%)   800067.53 (   2.40%)
> Hmean     tput-35   777715.92 (   0.00%)   753926.32 (  -3.06%)
> Hmean     tput-36   770516.85 (   0.00%)   783328.32 (   1.66%)
> Hmean     tput-37   758067.26 (   0.00%)   772243.18 *   1.87%*
> Hmean     tput-38   764815.45 (   0.00%)   769156.32 (   0.57%)
> Hmean     tput-39   757885.41 (   0.00%)   757670.59 (  -0.03%)
> Hmean     tput-40   750140.15 (   0.00%)   760739.13 (   1.41%)
>
> [1] https://lore.kernel.org/r/02c977d239c312de5e15c77803118dcf1e11f216.camel@gmx.de
>
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  kernel/sched/fair.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ff69f245b939..d00af3b97d8f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5865,6 +5865,14 @@ static void record_wakee(struct task_struct *p)
>         }
>
>         if (current->last_wakee != p) {
> +               int min = __this_cpu_read(sd_llc_size) << 1;
> +               /*
> +                * Couple the wakee flips to the waker for the case where it
> +                * doesn't accrue flips, taking care to not push the wakee
> +                * high enough that the wake_wide() heuristic fails.
> +                */
> +               if (current->wakee_flips > p->wakee_flips * min)
> +                       p->wakee_flips++;

I have a hard time understanding the rationale behind these changes
and the one below. Could you provide more details about why to
increase p->wakee_flips here ? Also would be good to add such
explanation in the commit message
>                 current->last_wakee = p;
>                 current->wakee_flips++;
>         }
> @@ -5895,7 +5903,7 @@ static int wake_wide(struct task_struct *p)
>
>         if (master < slave)
>                 swap(master, slave);
> -       if (slave < factor || master < slave * factor)
> +       if ((slave < factor && master < (factor>>1)*factor) || master < slave * factor)
>                 return 0;
>         return 1;
>  }
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] sched/fair: Increase wakeup_gran if current task has not executed the minimum granularity
  2021-10-28  9:48 ` [PATCH 2/2] sched/fair: Increase wakeup_gran if current task has not executed the minimum granularity Mel Gorman
@ 2021-10-29 16:07   ` Vincent Guittot
  0 siblings, 0 replies; 13+ messages in thread
From: Vincent Guittot @ 2021-10-29 16:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Thu, 28 Oct 2021 at 11:49, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs")
> moved the kernel.sched_wakeup_granularity_ns sysctl under debugfs.
> One of the reasons why this sysctl may be used may be for "optimising
> for throughput", particularly when overloaded. The tool TuneD sometimes
> alters this for two profiles e.g. "mssql" and "throughput-performance". At
> least version 2.9 does but it changed in master where it also will poke
> at debugfs instead. This patch aims to reduce the motivation to tweak
> sysctl_sched_wakeup_granularity by increasing sched_wakeup_granularity
> if the running task runtime has not exceeded sysctl_sched_min_granularity.
>
> During task migration or wakeup, a decision is made on whether
> to preempt the current task or not. To limit over-scheduled,
> sysctl_sched_wakeup_granularity delays the preemption to allow at least 1ms
> of runtime before preempting. However, when a domain is heavily overloaded
> (e.g. hackbench), the degree of over-scheduling is still severe. This is
> problematic as time is wasted rescheduling tasks that could instead be
> used by userspace tasks.
>
> However, care must be taken. Even if a system is overloaded, there may
> be high priority threads that must still be able to run. Mike Galbraith
> explained the constraints as follows;
>
>         CFS came about because the O1 scheduler was unfair to the
>         point it had starvation problems. People pretty much across the
>         board agreed that a fair scheduler was a much way better way
>         to go, and CFS was born.  It didn't originally have the sleep
>         credit business, but had to grow it to become _short term_ fair.
>         Ingo cut the sleep credit in half because of overscheduling, and
>         that has worked out pretty well all told.. but now you're pushing
>         it more in the unfair direction, all the way to extremely unfair
>         for anything and everything very light.
>
>         Fairness isn't the holy grail mind you, and at some point, giving
>         up on short term fairness certainly isn't crazy, as proven by your
>         hackbench numbers and other numbers we've seen over the years,
>         but taking bites out of the 'CF' in the CFS that was born to be a
>         corner-case killer is.. worrisome.  The other shoe will drop.. it
>         always does :)
>
> This patch increases the wakeup granularity if the current task has not
> reached its minimum preemption granularity. The current task may still
> be preempted but the difference in runtime must be higher.
>
> hackbench-process-pipes
>                           5.15.0-rc3             5.15.0-rc3
>                sched-wakeeflips-v1r1sched-scalewakegran-v3r2
> Amean     1        0.3890 (   0.00%)      0.3823 (   1.71%)
> Amean     4        0.5217 (   0.00%)      0.4867 (   6.71%)
> Amean     7        0.5387 (   0.00%)      0.5053 (   6.19%)
> Amean     12       0.5443 (   0.00%)      0.5450 (  -0.12%)
> Amean     21       0.6487 (   0.00%)      0.6807 (  -4.93%)
> Amean     30       0.8033 (   0.00%)      0.7107 *  11.54%*
> Amean     48       1.2400 (   0.00%)      1.0447 *  15.75%*
> Amean     79       1.8200 (   0.00%)      1.6033 *  11.90%*
> Amean     110      2.5820 (   0.00%)      2.0763 *  19.58%*
> Amean     141      3.2203 (   0.00%)      2.5313 *  21.40%*
> Amean     172      3.8200 (   0.00%)      3.1163 *  18.42%*
> Amean     203      4.3357 (   0.00%)      3.5560 *  17.98%*
> Amean     234      4.8047 (   0.00%)      3.8913 *  19.01%*
> Amean     265      5.1243 (   0.00%)      4.2293 *  17.47%*
> Amean     296      5.5940 (   0.00%)      4.5357 *  18.92%*
>
>                   5.15.0-rc3  5.15.0-rc3
>          sched-wakeeflips-v1r1 sched-scalewakegran-v3r2
> Duration User        2567.27     2034.17
> Duration System     21098.79    17137.08
> Duration Elapsed      136.49      120.2
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  kernel/sched/fair.c     | 17 +++++++++++++++--
>  kernel/sched/features.h |  2 ++
>  2 files changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d00af3b97d8f..dee108470297 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7052,10 +7052,23 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  }
>  #endif /* CONFIG_SMP */
>
> -static unsigned long wakeup_gran(struct sched_entity *se)
> +static unsigned long
> +wakeup_gran(struct sched_entity *curr, struct sched_entity *se)
>  {
>         unsigned long gran = sysctl_sched_wakeup_granularity;
>
> +       if (sched_feat(SCALE_WAKEUP_GRAN)) {
> +               unsigned long delta_exec;
> +
> +               /*
> +                * Increase the wakeup granularity if curr's runtime
> +                * is less than the minimum preemption granularity.
> +                */
> +               delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
> +               if (delta_exec < sysctl_sched_min_granularity)
> +                       gran += sysctl_sched_min_granularity;

I need to think a bit more about corner cases but this change looks
much better than the previous one.

> +       }
> +
>         /*
>          * Since its curr running now, convert the gran from real-time
>          * to virtual-time in his units.
> @@ -7094,7 +7107,7 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
>         if (vdiff <= 0)
>                 return -1;
>
> -       gran = wakeup_gran(se);
> +       gran = wakeup_gran(curr, se);
>         if (vdiff > gran)
>                 return 1;
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 7f8dace0964c..611591355ffd 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -95,3 +95,5 @@ SCHED_FEAT(LATENCY_WARN, false)
>
>  SCHED_FEAT(ALT_PERIOD, true)
>  SCHED_FEAT(BASE_SLICE, true)
> +
> +SCHED_FEAT(SCALE_WAKEUP_GRAN, true)
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers
  2021-10-29 15:17   ` Vincent Guittot
@ 2021-10-30  3:11     ` Mike Galbraith
  2021-10-30  4:12       ` Mike Galbraith
  2021-11-01  8:56     ` Mel Gorman
  1 sibling, 1 reply; 13+ messages in thread
From: Mike Galbraith @ 2021-10-30  3:11 UTC (permalink / raw)
  To: Vincent Guittot, Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li,
	Barry Song, Srikar Dronamraju, LKML

On Fri, 2021-10-29 at 17:17 +0200, Vincent Guittot wrote:
>
> I have a hard time understanding the rationale behind these changes
> and the one below. Could you provide more details about why to
> increase p->wakee_flips here ?

The rationale behind it was me discovering wake_affine_weight() use of
weight averages causing X waking a way too big thread pool to stack the
entire herd of 21 threads onto the waker's CPU while all other CPUs in
my little i7 box had one task each.  Preventing stacking is SIS or
wake_wide(), but because I was test driving a patch I had some fairness
concerns about, box was kept busy.  I was subsequently asked about
wake_wide()'s role, and while I don't think it should have one in a
single LLC box, looked into it, found that while X is a candidate,
event thread wakees were not. I think to self, what if I loosely couple
zero flip earning wakees, do so, then allow for a bit of decay wiggle
room while after watching it do its thing in realtime.

There you have the rationale.

While it did help, it did not eliminate the aforementioned worst case
because as desktop behavior changes, decay turns off the heuristic,
stacking follows.  I profiled it with a perf that sums delay (local mod
I find useful), and found that there was no real benefit to the light
desktop test load, at which point, no longer having NUMA boxen at my
disposal where wake_wide() does have a mission, I lost interest.  Mel
was interested however, fed it to SUSE's test array, and here we are.

Executive summary: patchlet is not so lovely mitigation of an even more
not so lovely scheduler behavior. The two deserve each other ;-)

Kidding aside, way better would be wake_wide() becoming obsolete.

	-Mike

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers
  2021-10-30  3:11     ` Mike Galbraith
@ 2021-10-30  4:12       ` Mike Galbraith
  0 siblings, 0 replies; 13+ messages in thread
From: Mike Galbraith @ 2021-10-30  4:12 UTC (permalink / raw)
  To: Vincent Guittot, Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li,
	Barry Song, Srikar Dronamraju, LKML

On Sat, 2021-10-30 at 05:11 +0200, Mike Galbraith wrote:
>
> I profiled it with a perf that sums delay (local mod I find useful)...

Here are those numbers for completeness.  As you can see, patchlet was
fairly meaningless to the desktop load.  In fact, even though the wake
affine logic essentially favors the compute load over the heavily
threaded but light desktop load, I noticed zero difference with or
without that being allowed, nor did perf record any noteworthy latency
events, just more wait as load was kinda sorta de-treaded by being
stacked up.

box = i7-4790 quad+smt
desktop vs massive_intr 8 9999 (8 x 8ms run/1ms sleep, for 9999 secs.. effectively forever)
perf sched record -a -- su mikeg -c 'firefox https://www.youtube.com/watch?v=aqz-KE-bpKQ'& sleep 300 && killall perf firefox
                     runtime              runtime               sum delay      sum delay        sum delay       switches      desktop
patch/features       total          util  massive_intr   util   total          massive_intr     desktop        total/massive  util
virgin/stock         2267347.921 ms 94.4% 1932675.152 ms 80.5%  158611.016 ms  133309.938 ms    25301.078 ms  594780/441157   13.9%
virgin/-wa_weight    2236871.408 ms 93.2% 1881464.401 ms 78.3%  255785.391 ms  243958.616 ms    11826.775 ms 1525470/1424083  14.8%
                          -1.34%    -1.2%                -2.2%                                    -13.474 s                   +0.9%
wake_wide/stock      2254335.961 ms 93.9% 1917834.157 ms 79.9%  164766.194 ms  141974.540 ms    22791.654 ms  720711/599064   14.0%


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers
  2021-10-29 15:17   ` Vincent Guittot
  2021-10-30  3:11     ` Mike Galbraith
@ 2021-11-01  8:56     ` Mel Gorman
  1 sibling, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2021-11-01  8:56 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Fri, Oct 29, 2021 at 05:17:38PM +0200, Vincent Guittot wrote:
> > index ff69f245b939..d00af3b97d8f 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5865,6 +5865,14 @@ static void record_wakee(struct task_struct *p)
> >         }
> >
> >         if (current->last_wakee != p) {
> > +               int min = __this_cpu_read(sd_llc_size) << 1;
> > +               /*
> > +                * Couple the wakee flips to the waker for the case where it
> > +                * doesn't accrue flips, taking care to not push the wakee
> > +                * high enough that the wake_wide() heuristic fails.
> > +                */
> > +               if (current->wakee_flips > p->wakee_flips * min)
> > +                       p->wakee_flips++;
> 
> I have a hard time understanding the rationale behind these changes
> and the one below. Could you provide more details about why to
> increase p->wakee_flips here ? Also would be good to add such
> explanation in the commit message


The changelog covers it in the first two paragraphs but would the
following be better as a comment?

/*
 * Couple the wakee flips to the waker for the case where the
 * wakee doesn't accrue any flips during a short interval where
 * there are many wakeups without cpu load average being updated.
 * Otherwise, it is possible for wake_wide to not trigger followed
 * by an affine wake stacking multiple tasks on the same CPU due
 * to a stale cpu_load() value checked in wake_affine_weight.
 * This heuristic reduces excessive stacking of tasks while taking
 * care to not push the wakee high enough that the wake_wide
 * heuristic fails differently.
 */

Is that any better? I know this is a heuristic that is a bit on the
fuzzy side as it's trying to clamp the worst of a corner case. Ideally
"wake_wide" would be replaced with a more straight-forward heuristic but
I'm not aware of any alternatives being proposed (and I don't have one
of my own).

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers
  2021-10-29  8:42     ` Mel Gorman
@ 2021-11-10  9:53       ` Tao Zhou
  2021-11-10 15:40         ` Mike Galbraith
  0 siblings, 1 reply; 13+ messages in thread
From: Tao Zhou @ 2021-11-10  9:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML,
	Tao Zhou

On Fri, Oct 29, 2021 at 09:42:19AM +0100, Mel Gorman wrote:
> On Fri, Oct 29, 2021 at 12:19:48AM +0800, Tao Zhou wrote:
> > Hi Mel,
> > 
> > On Thu, Oct 28, 2021 at 10:48:33AM +0100, Mel Gorman wrote:
> > 
> > > @@ -5865,6 +5865,14 @@ static void record_wakee(struct task_struct *p)
> > >  	}
> > >  
> > >  	if (current->last_wakee != p) {
> > > +		int min = __this_cpu_read(sd_llc_size) << 1;
> > > +		/*
> > > +		 * Couple the wakee flips to the waker for the case where it
> > > +		 * doesn't accrue flips, taking care to not push the wakee
> > > +		 * high enough that the wake_wide() heuristic fails.
> > > +		 */
> > > +		if (current->wakee_flips > p->wakee_flips * min)
> > > +			p->wakee_flips++;
> > >  		current->last_wakee = p;
> > >  		current->wakee_flips++;
> > >  	}
> > > @@ -5895,7 +5903,7 @@ static int wake_wide(struct task_struct *p)
> > >  
> > >  	if (master < slave)
> > >  		swap(master, slave);
> > > -	if (slave < factor || master < slave * factor)
> > > +	if ((slave < factor && master < (factor>>1)*factor) || master < slave * factor)
> > 
> > So, the check like this include the above range:
> > 
> >   if ((slave < factor && master < slave * factor) ||
> >        master < slave * factor)
> > 
> > That "factor>>1" filter some.
> > 
> > If "slave < factor" is true and "master < (factor>>1)*factor" is false,
> > then we check "master < slave * factor".(This is one path added by the
> > check "&&  master < (factor>>1)*factor").
> > In the latter check "slave < factor" must be true, the result of this
> > check depend on slave in the range [factor, factor>>1] if there is possibility
> > that "master < slave * factor". If slave in [factor>>1, 0], the check of
> > "master < slave * factor" is absolutly false and this can be filtered if
> > we use a variable to load the result of master < (factor>>1)*factor.
> > 
> > My random random inputs and continue confusing to move on.
> > 
> 
> I'm not sure what point you're trying to make.

Ok, some days later even can not understand what my saying myself. After 
wrong and right aross with my wreck head I just try to make this:

if ((slave < factor && master < (factor>>1)*factor) || (slave >= factor>>1) && master < slave * factor)

check "slave > factor>>1" for filter the cases that is calculated if I
am not wrong. If this have a little effect that will be to not need to
do "master < slave * factor" for some time not sure.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers
  2021-11-10  9:53       ` Tao Zhou
@ 2021-11-10 15:40         ` Mike Galbraith
  0 siblings, 0 replies; 13+ messages in thread
From: Mike Galbraith @ 2021-11-10 15:40 UTC (permalink / raw)
  To: Tao Zhou, Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Srikar Dronamraju, LKML

On Wed, 2021-11-10 at 17:53 +0800, Tao Zhou wrote:
> On Fri, Oct 29, 2021 at 09:42:19AM +0100, Mel Gorman wrote:
> > On Fri, Oct 29, 2021 at 12:19:48AM +0800, Tao Zhou wrote:
> > > Hi Mel,
> > >
> > > On Thu, Oct 28, 2021 at 10:48:33AM +0100, Mel Gorman wrote:
> > >
> > > > @@ -5865,6 +5865,14 @@ static void record_wakee(struct task_struct *p)
> > > >         }
> > > >  
> > > >         if (current->last_wakee != p) {
> > > > +               int min = __this_cpu_read(sd_llc_size) << 1;
> > > > +               /*
> > > > +                * Couple the wakee flips to the waker for the case where it
> > > > +                * doesn't accrue flips, taking care to not push the wakee
> > > > +                * high enough that the wake_wide() heuristic fails.
> > > > +                */
> > > > +               if (current->wakee_flips > p->wakee_flips * min)
> > > > +                       p->wakee_flips++;
> > > >                 current->last_wakee = p;
> > > >                 current->wakee_flips++;
> > > >         }
> > > > @@ -5895,7 +5903,7 @@ static int wake_wide(struct task_struct *p)
> > > >  
> > > >         if (master < slave)
> > > >                 swap(master, slave);
> > > > -       if (slave < factor || master < slave * factor)
> > > > +       if ((slave < factor && master < (factor>>1)*factor) || master < slave * factor)
> > >
> > > So, the check like this include the above range:
> > >
> > >   if ((slave < factor && master < slave * factor) ||
> > >        master < slave * factor)
> > >
> > > That "factor>>1" filter some.
> > >
> > > If "slave < factor" is true and "master < (factor>>1)*factor" is false,
> > > then we check "master < slave * factor".(This is one path added by the
> > > check "&&  master < (factor>>1)*factor").
> > > In the latter check "slave < factor" must be true, the result of this
> > > check depend on slave in the range [factor, factor>>1] if there is possibility
> > > that "master < slave * factor". If slave in [factor>>1, 0], the check of
> > > "master < slave * factor" is absolutly false and this can be filtered if
> > > we use a variable to load the result of master < (factor>>1)*factor.
> > >
> > > My random random inputs and continue confusing to move on.
> > >
> >
> > I'm not sure what point you're trying to make.
>
> Ok, some days later even can not understand what my saying myself. After
> wrong and right aross with my wreck head I just try to make this:
>
> if ((slave < factor && master < (factor>>1)*factor) || (slave >= factor>>1) && master < slave * factor)
>
> check "slave > factor>>1" for filter the cases that is calculated if I
> am not wrong. If this have a little effect that will be to not need to
> do "master < slave * factor" for some time not sure.

Take the original:

if (slave < factor || master < slave * factor)
	return 0;

That is looking for a waker:wakees ratio of sd_llc_size, and does it
the way it does because you can create "flips" galore by waking only
two tasks, but using the two comparisons together makes it more likely
that you're waking sd_llc_size tasks.  Take my box's 8 rq servicing
LLC, if wakee is 8, multi-waker being 8 times that suggests 8 wakees,
each having been awakened 8 times by our multi-waker, qualifying the
pair to be considered part of a load too large to restrict to one LLC.

But what happens when our multi-waker isn't always waking a uniformly
growing/shrinking set of workers, it's a bit chaotic, and the flip
count of some wakees decay below our magic 8?  The right side can be
happy as a clam because the multi-waker is flipping madly enough to
make wakee * llc_size nothing remotely resembling a hurdle, but there
sits a deal breaker on the left.. so we should wake these threads
affine?  I should have left that alone, or at least picked a big
arbitrary stopper, but picked half of our magic "I might be waking a
herd" number to say nah, as long as the ratio on the right looks herd
like AND our multi-waker appears to be waking at least half a herd,
wake it wide.

That not-a-noop probably should die despite having not (yet) shown an
evil side because it dings up an already questionable enough heuristic.

	-Mike

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/2] sched/fair: Increase wakeup_gran if current task has not executed the minimum granularity
  2021-10-21 14:56 [PATCH 0/2] Reduce stacking and overscheduling Mel Gorman
@ 2021-10-21 14:56 ` Mel Gorman
  0 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2021-10-21 14:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman

Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs")
moved the kernel.sched_wakeup_granularity_ns sysctl under debugfs.
One of the reasons why this sysctl may be used may be for "optimising
for throughput", particularly when overloaded. The tool TuneD sometimes
alters this for two profiles e.g. "mssql" and "throughput-performance". At
least version 2.9 does but it changed in master where it also will poke
at debugfs instead. This patch aims to reduce the motivation to tweak
sysctl_sched_wakeup_granularity by increasing sched_wakeup_granularity
if the running task runtime has not exceeded sysctl_sched_min_granularity.

During task migration or wakeup, a decision is made on whether
to preempt the current task or not. To limit over-scheduled,
sysctl_sched_wakeup_granularity delays the preemption to allow at least 1ms
of runtime before preempting. However, when a domain is heavily overloaded
(e.g. hackbench), the degree of over-scheduling is still severe. This is
problematic as time is wasted rescheduling tasks that could instead be
used by userspace tasks.

However, care must be taken. Even if a system is overloaded, there may
be high priority threads that must still be able to run. Mike Galbraith
explained the constraints as follows;

        CFS came about because the O1 scheduler was unfair to the
        point it had starvation problems. People pretty much across the
        board agreed that a fair scheduler was a much way better way
        to go, and CFS was born.  It didn't originally have the sleep
        credit business, but had to grow it to become _short term_ fair.
        Ingo cut the sleep credit in half because of overscheduling, and
        that has worked out pretty well all told.. but now you're pushing
        it more in the unfair direction, all the way to extremely unfair
        for anything and everything very light.

        Fairness isn't the holy grail mind you, and at some point, giving
        up on short term fairness certainly isn't crazy, as proven by your
        hackbench numbers and other numbers we've seen over the years,
        but taking bites out of the 'CF' in the CFS that was born to be a
        corner-case killer is.. worrisome.  The other shoe will drop.. it
        always does :)

This patch increases the wakeup granularity if the current task has not
reached its minimum preemption granularity. The current task may still
be preempted but the difference in runtime must be higher.

hackbench-process-pipes
                          5.15.0-rc3             5.15.0-rc3
               sched-wakeeflips-v1r1sched-scalewakegran-v3r2
Amean     1        0.3890 (   0.00%)      0.3823 (   1.71%)
Amean     4        0.5217 (   0.00%)      0.4867 (   6.71%)
Amean     7        0.5387 (   0.00%)      0.5053 (   6.19%)
Amean     12       0.5443 (   0.00%)      0.5450 (  -0.12%)
Amean     21       0.6487 (   0.00%)      0.6807 (  -4.93%)
Amean     30       0.8033 (   0.00%)      0.7107 *  11.54%*
Amean     48       1.2400 (   0.00%)      1.0447 *  15.75%*
Amean     79       1.8200 (   0.00%)      1.6033 *  11.90%*
Amean     110      2.5820 (   0.00%)      2.0763 *  19.58%*
Amean     141      3.2203 (   0.00%)      2.5313 *  21.40%*
Amean     172      3.8200 (   0.00%)      3.1163 *  18.42%*
Amean     203      4.3357 (   0.00%)      3.5560 *  17.98%*
Amean     234      4.8047 (   0.00%)      3.8913 *  19.01%*
Amean     265      5.1243 (   0.00%)      4.2293 *  17.47%*
Amean     296      5.5940 (   0.00%)      4.5357 *  18.92%*

                  5.15.0-rc3  5.15.0-rc3
         sched-wakeeflips-v1r1 sched-scalewakegran-v3r2
Duration User        2567.27     2034.17
Duration System     21098.79    17137.08
Duration Elapsed      136.49      120.2

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c     | 17 +++++++++++++++--
 kernel/sched/features.h |  2 ++
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d00af3b97d8f..dee108470297 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7052,10 +7052,23 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 }
 #endif /* CONFIG_SMP */
 
-static unsigned long wakeup_gran(struct sched_entity *se)
+static unsigned long
+wakeup_gran(struct sched_entity *curr, struct sched_entity *se)
 {
 	unsigned long gran = sysctl_sched_wakeup_granularity;
 
+	if (sched_feat(SCALE_WAKEUP_GRAN)) {
+		unsigned long delta_exec;
+
+		/*
+		 * Increase the wakeup granularity if curr's runtime
+		 * is less than the minimum preemption granularity.
+		 */
+		delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
+		if (delta_exec < sysctl_sched_min_granularity)
+			gran += sysctl_sched_min_granularity;
+	}
+
 	/*
 	 * Since its curr running now, convert the gran from real-time
 	 * to virtual-time in his units.
@@ -7094,7 +7107,7 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
 	if (vdiff <= 0)
 		return -1;
 
-	gran = wakeup_gran(se);
+	gran = wakeup_gran(curr, se);
 	if (vdiff > gran)
 		return 1;
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 7f8dace0964c..611591355ffd 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -95,3 +95,5 @@ SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(ALT_PERIOD, true)
 SCHED_FEAT(BASE_SLICE, true)
+
+SCHED_FEAT(SCALE_WAKEUP_GRAN, true)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-11-10 15:40 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-28  9:48 [PATCH v4 0/2] Reduce stacking and overscheduling Mel Gorman
2021-10-28  9:48 ` [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers Mel Gorman
2021-10-28 16:19   ` Tao Zhou
2021-10-29  8:42     ` Mel Gorman
2021-11-10  9:53       ` Tao Zhou
2021-11-10 15:40         ` Mike Galbraith
2021-10-29 15:17   ` Vincent Guittot
2021-10-30  3:11     ` Mike Galbraith
2021-10-30  4:12       ` Mike Galbraith
2021-11-01  8:56     ` Mel Gorman
2021-10-28  9:48 ` [PATCH 2/2] sched/fair: Increase wakeup_gran if current task has not executed the minimum granularity Mel Gorman
2021-10-29 16:07   ` Vincent Guittot
  -- strict thread matches above, loose matches on Subject: below --
2021-10-21 14:56 [PATCH 0/2] Reduce stacking and overscheduling Mel Gorman
2021-10-21 14:56 ` [PATCH 2/2] sched/fair: Increase wakeup_gran if current task has not executed the minimum granularity Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.