Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

From: K Prateek Nayak <kprateek.nayak@amd.com>
To: linux-kernel@vger.kernel.org, linux-tip-commits@vger.kernel.org
Cc: Tejun Heo <tj@kernel.org>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	x86@kernel.org, Gautham Shenoy <gautham.shenoy@amd.com>
Subject: Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()
Date: Thu, 1 Jun 2023 15:03:39 +0530	[thread overview]
Message-ID: <3de5c24f-6437-f21b-ed61-76b86a199e8c@amd.com> (raw)
In-Reply-To: <168553468754.404.2298362895524875073.tip-bot2@tip-bot2>

Hello Peter, 

Sharing some initial benchmark results with the patch below.

tl;dr

- Hackbench starts off well but performance drops as the number of groups
  increases.

- schbench (old), tbench, netperf see improvement but there is a band of
  outlier results when system is fully loaded or slightly overloaded.

- Stream and ycsb-mongodb are don't mind the extra search.

- SPECjbb (with default scheduler tunables) and DeathStarBench are not
  very happy.

On 5/31/2023 5:34 PM, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the sched/core branch of tip:
> 
> Commit-ID:     c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> Gitweb:        https://git.kernel.org/tip/c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> Author:        Peter Zijlstra <peterz@infradead.org>
> AuthorDate:    Tue, 30 May 2023 13:20:46 +02:00
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Tue, 30 May 2023 22:46:27 +02:00
> 
> sched/fair: Multi-LLC select_idle_sibling()
> 
> Tejun reported that when he targets workqueues towards a specific LLC
> on his Zen2 machine with 3 cores / LLC and 4 LLCs in total, he gets
> significant idle time.
> 
> This is, of course, because of how select_idle_sibling() will not
> consider anything outside of the local LLC, and since all these tasks
> are short running the periodic idle load balancer is ineffective.
> 
> And while it is good to keep work cache local, it is better to not
> have significant idle time. Therefore, have select_idle_sibling() try
> other LLCs inside the same node when the local one comes up empty.

Tests were run on a dual socket 3rd Generation EPYC server(2 x64C/128T)
running in NPS1 mode. Following it the simplified machine topology:

    NPS1: Each socket is a NUMA node.
    Total 2 NUMA nodes in the dual socket machine.

    DIE0: 0-63,   128-191
        MC0: 0-7, 128-135
           SMT0: 0,128
           SMT1: 1,129
           ...
           SMT7: 7,135
	MC1: 8-15, 136-143
           SMT8: 8,136
           SMT9: 9,137
           ...
           SMT15: 15,143
	...
	MC7: 56-63, 184-191
           SMT56: 56,184
           SMT57: 57,185
           ...
           SMT63: 63,191

    DIE1: 64-127, 192-255
        MC8: 64-71, 192-199
           SMT64: 64,192
           SMT65: 65,193
           ...
           SMT71: 71,199
	MC9: 72-79, 200-207
           SMT72: 72,200
           SMT73: 72,201
           ...
           SMT79: 79,207
	...
	MC15: 120-127, 248-255
           SMT120: 120,248
           SMT121: 121,249
           ...
           SMT127: 127,255

Since the patch extends the idle CPU search to one domain above MC in
case of on an unsuccessful search, for the above topology, the DIE
domain becomes the wake domain with potential 128CPUs to be searched.
Following are the benchmark results:

o Kernel Versions

- tip              - tip:sched/core at commit e2a1f85bf9f5 "sched/psi:
                     Avoid resetting the min update period when it is
                     unnecessary")

- peter-next-level - tip:sched/core + this patch

o Benchmark Results

Note: Benchmarks were run with boost enabled and C2 disabled to minimize
other external fact.

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

o NPS1

Test:			tip		   peter-next-level
 1-groups:	   3.92 (0.00 pct)	   4.05 (-3.31 pct)
 2-groups:	   4.58 (0.00 pct)	   3.84 (16.15 pct)
 4-groups:	   4.99 (0.00 pct)	   3.98 (20.24 pct)
 8-groups:	   5.67 (0.00 pct)	   6.05 (-6.70 pct)	* Overloaded
16-groups:	   7.88 (0.00 pct)	  10.56 (-34.01 pct)	* Overloaded

~~~~~~~~~~~~~~~~~~
~ schbench (Old) ~
~~~~~~~~~~~~~~~~~~

o NPS1

#workers:	tip		  peter-next-level
  1:	  26.00 (0.00 pct)	  24.00 (7.69 pct)
  2:	  27.00 (0.00 pct)	  24.00 (11.11 pct)
  4:	  31.00 (0.00 pct)	  28.00 (9.67 pct)
  8:	  36.00 (0.00 pct)	  33.00 (8.33 pct)
 16:	  49.00 (0.00 pct)	  47.00 (4.08 pct)
 32:	  80.00 (0.00 pct)	  81.00 (-1.25 pct)
 64:	 169.00 (0.00 pct)	 169.00 (0.00 pct)
128:	 343.00 (0.00 pct)	 365.00 (-6.41 pct)	* Fully Loaded
256:	 42048.00 (0.00 pct)	 35392.00 (15.82 pct)
512:	 95104.00 (0.00 pct)	 88704.00 (6.72 pct)

~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

o NPS1

Clients:	tip		 peter-next-level
    1	 452.49 (0.00 pct)	 457.94 (1.20 pct)
    2	 862.44 (0.00 pct)	 879.99 (2.03 pct)
    4	 1604.27 (0.00 pct)	 1618.87 (0.91 pct)
    8	 2966.77 (0.00 pct)	 3040.90 (2.49 pct)
   16	 5176.70 (0.00 pct)	 5292.29 (2.23 pct)
   32	 8205.24 (0.00 pct)	 8949.12 (9.06 pct)
   64	 13956.71 (0.00 pct)	 14461.42 (3.61 pct)
  128	 24005.50 (0.00 pct)	 26052.75 (8.52 pct)
  256	 32457.61 (0.00 pct)	 21999.41 (-32.22 pct)	* Overloaded
  512	 34345.24 (0.00 pct)	 41166.39 (19.86 pct)
 1024	 33432.92 (0.00 pct)	 40900.84 (22.33 pct)

~~~~~~~~~~
~ stream ~
~~~~~~~~~~

o NPS1

- 10 Runs:

Test:			tip	   peter-next-level
 Copy:	 271317.35 (0.00 pct)	 292440.22 (7.78 pct)
Scale:	 205533.77 (0.00 pct)	 203362.60 (-1.05 pct)
  Add:	 221624.62 (0.00 pct)	 225850.83 (1.90 pct)
Triad:	 228500.68 (0.00 pct)	 225885.25 (-1.14 pct)

- 100 Runs:

Test:			tip	   peter-next-level
 Copy:	 317381.65 (0.00 pct)	 318827.08 (0.45 pct)
Scale:	 214145.00 (0.00 pct)	 206213.69 (-3.70 pct)
  Add:	 239243.29 (0.00 pct)	 229791.67 (-3.95 pct)
Triad:	 249477.76 (0.00 pct)	 236843.06 (-5.06 pct)

~~~~~~~~~~~~~~~~~~~~
~ netperf - TCP_RR ~
~~~~~~~~~~~~~~~~~~~~

o NPS1

Test:			tip		   peter-next-level
 1-clients:	 102839.97 (0.00 pct)	 103540.33 (0.68 pct)
 2-clients:	 98428.08 (0.00 pct)	 100431.67 (2.03 pct)
 4-clients:	 92298.45 (0.00 pct)	 94800.51 (2.71 pct)
 8-clients:	 85618.41 (0.00 pct)	 89130.14 (4.10 pct)
16-clients:	 78722.18 (0.00 pct)	 79715.38 (1.26 pct)
32-clients:	 73610.75 (0.00 pct)	 72801.41 (-1.09 pct)
64-clients:	 55285.07 (0.00 pct)	 56184.38 (1.62 pct)
128-clients:	 31176.92 (0.00 pct)	 32830.06 (5.30 pct)
256-clients:	 20011.44 (0.00 pct)	 15135.39 (-24.36 pct)	* Overloaded

~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~

o NPS1

						  tip			peter-next-level
Hmean     unixbench-dhry2reg-1   	  41322625.19 (   0.00%)   41224388.33 (  -0.24%)
Hmean     unixbench-dhry2reg-512	6252491108.60 (   0.00%)  6240160851.68 (  -0.20%)
Amean     unixbench-syscall-1    	   2501398.27 (   0.00%)    2577323.43 *  -3.04%*
Amean     unixbench-syscall-512  	   8120524.00 (   0.00%)    7512955.87 *   7.48%*
Hmean     unixbench-pipe-1    		   2359346.02 (   0.00%)    2392308.62 *   1.40%*
Hmean     unixbench-pipe-512		 338790322.61 (   0.00%)  337711432.92 (  -0.32%)
Hmean     unixbench-spawn-1       	      4261.52 (   0.00%)       4164.90 (  -2.27%)
Hmean     unixbench-spawn-512    	     64328.93 (   0.00%)      62257.64 *  -3.22%*
Hmean     unixbench-execl-1       	      3677.73 (   0.00%)       3652.08 (  -0.70%)
Hmean     unixbench-execl-512    	     11984.83 (   0.00%)      13585.65 *  13.36%*

~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~

o NPS1

tip:			131070.33 (var: 2.84%)
peter-next-level:	131070.33 (var: 2.84%) (0.00%)

~~~~~~~~~~~~~~~~~~~~~~~
~ SPECjbb - Multi-JVM ~
~~~~~~~~~~~~~~~~~~~~~~~

o NPS1

- Default Scheduler Tunables

kernel			max-jOPS		critical-jOPS
tip			100.00%			100.00%
peter-next-level	 94.45% (-5.55%)	 98.25% (-1.75%)

- Modified Scheduler Tunables

kernel			max-jOPS		critical-jOPS
tip			100.00%			100.00%
peter-next-level	100.00% (0.00%)		102.41% (2.41%)

~~~~~~~~~~~~~~~~~~
~ DeathStarBench ~
~~~~~~~~~~~~~~~~~~

Pinning   Scaling	tip		peter-next-level
1 CCD     1             100.00%      	100.30% (%diff:  0.30%)
2 CCD     2             100.00%      	100.17% (%diff:  0.17%)
4 CCD     4             100.00%      	 99.60% (%diff: -0.40%)
8 CCD     8             100.00%      	 92.05% (%diff: -7.95%)	*

---

Based on the above data, the results seem to be mostly positive for
the microbenchmarks but not so much for SpecJBB and DeathStarBench,
which have high utilization. There is also band of outliers when the
system is fully loaded or overloaded (~2 tasks per rq) for some of
the microbenchmarks.

I wonder if extending SIS_UTIL for SIS_NODE would help some of these
cases but I've not tried tinkering with it yet. I'll continue
testing on other NPS modes which would decrease the search scope.
I'll also try running the same bunch of workloads on an even larger
4th Generation EPYC server to see if the behavior there is similar.

Let me know if you need any data from from my test system for any
specific workload. I'll be more than happy to get them for you :)

> 
> Reported-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/fair.c     | 38 ++++++++++++++++++++++++++++++++++++++
>  kernel/sched/features.h |  1 +
>  2 files changed, 39 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0c..0172458 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7028,6 +7028,38 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool 
>  }
>  
>  /*
> + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> + * local LLC comes up empty.
> + */
> +static int
> +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> +{
> +	struct sched_domain *parent = sd->parent;
> +	struct sched_group *sg;
> +
> +	/* Make sure to not cross nodes. */
> +	if (!parent || parent->flags & SD_NUMA)
> +		return -1;
> +
> +	sg = parent->groups;
> +	do {
> +		int cpu = cpumask_first(sched_group_span(sg));
> +		struct sched_domain *sd_child;
> +
> +		sd_child = per_cpu(sd_llc, cpu);
> +		if (sd_child != sd) {
> +			int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
> +			if ((unsigned)i < nr_cpumask_bits)
> +				return i;
> +		}
> +
> +		sg = sg->next;
> +	} while (sg != parent->groups);
> +
> +	return -1;
> +}
> +
> +/*
>   * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
>   * the task fits. If no CPU is big enough, but there are idle ones, try to
>   * maximize capacity.
> @@ -7199,6 +7231,12 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>  	if ((unsigned)i < nr_cpumask_bits)
>  		return i;
>  
> +	if (sched_feat(SIS_NODE)) {
> +		i = select_idle_node(p, sd, target);
> +		if ((unsigned)i < nr_cpumask_bits)
> +			return i;
> +	}
> +
>  	return target;
>  }
>  
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index ee7f23c..9e390eb 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>   */
>  SCHED_FEAT(SIS_PROP, false)
>  SCHED_FEAT(SIS_UTIL, true)
> +SCHED_FEAT(SIS_NODE, true)
>  
>  /*
>   * Issue a WARN when we do multiple update_rq_clock() calls

--
Thanks and Regards,
Prateek