Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

From: Chen Yu <yu.c.chen@intel.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>,
	<linux-kernel@vger.kernel.org>,
	<linux-tip-commits@vger.kernel.org>, Tejun Heo <tj@kernel.org>,
	<x86@kernel.org>, Gautham Shenoy <gautham.shenoy@amd.com>,
	Tim Chen <tim.c.chen@intel.com>
Subject: Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()
Date: Wed, 21 Jun 2023 15:16:15 +0800	[thread overview]
Message-ID: <ZJKjvx/NxooM5z1Y@chenyu5-mobl2.ccr.corp.intel.com> (raw)
In-Reply-To: <20230614151348.GM1639749@hirez.programming.kicks-ass.net>

On 2023-06-14 at 17:13:48 +0200, Peter Zijlstra wrote:
> On Wed, Jun 14, 2023 at 10:58:20PM +0800, Chen Yu wrote:
> > On 2023-06-14 at 10:17:57 +0200, Peter Zijlstra wrote:
> > > On Tue, Jun 13, 2023 at 04:00:39PM +0530, K Prateek Nayak wrote:
> > > 
> > > > >> - SIS_NODE_TOPOEXT - tip:sched/core + this patch
> > > > >>                      + new sched domain (Multi-Multi-Core or MMC)
> > > > >> 		     (https://lore.kernel.org/all/20230601153522.GB559993@hirez.programming.kicks-ass.net/)
> > > > >> 		     MMC domain groups 2 nearby CCX.
> > > > > 
> > > > > OK, so you managed to get the NPS4 topology in NPS1 mode?
> > > > 
> > > > Yup! But it is a hack. I'll leave the patch at the end.
> > > 
> > > Chen Yu, could we do the reverse? Instead of building a bigger LLC
> > > domain, can we split our LLC based on SNC (sub-numa-cluster) topologies?
> > >
> > Hi Peter,
> > Do you mean with SNC enabled, if the LLC domain gets smaller? 
> > According to the test, the answer seems to be yes.
> 
> No, I mean to build smaller LLC domains even with SNC disabled, as-if
> SNC were active.
> 
>
The topology on Sapphire Rapids is that there are 4 memory controllers within
1 package per lstopo result, and the LLCs could have slightly difference distance
to the 4 mc with SNC disabled. Unfortunately there is no interface for the OS
to query this partition. I used a hack to split the LLC into 4 smaller ones
with SNC disabled, according to the topology in SNC4. Then I had a test on this
platform with/withouth this LLC split, both with SIS_NODE enabled and with
this issue fixed[1]. Something like this when iterating the groups in select_idle_node():

if (cpumask_test_cpu(target, sched_group_span(sg)))
	continue;

The SIS_NODE should have no impact on non-LLC-split version on
Sapphire Rapids, so the baseline is vanilla+SIS_NODE.

In summary, huge improvement from netperf was observed, but also regression from
hackbench/schbench was observed when the system is under load. I'll collect some
schedstats to check the scan depth in the problematic cases.


With SNC disabled and with the hack llc-split patch applied, there is a new
Die domain generated, the LLC is divided into 4 sub-llc groups:

 grep  . domain*/{name,flags}
domain0/name:SMT
domain1/name:MC
domain2/name:DIE
domain3/name:NUMA
domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
domain2/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING
domain3/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA

cat /proc/schedstat | grep cpu0 -A 4
cpu0 0 0 0 0 0 0 15968391465 3630455022 18084
domain0 00000000,00000000,00000000,00010000,00000000,00000000,00000001 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,3fff0000,00000000,00000000,00003fff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,000000ff,ffffffff,ffff0000,00000000,00ffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


hackbench
=========
case                    load            baseline(std%)  compare%( std%)
process-pipe            1-groups         1.00 (  3.81)  -100.18 (  0.19)
process-pipe            2-groups         1.00 ( 10.74)  -59.21 (  0.91)
process-pipe            4-groups         1.00 (  5.37)  -56.37 (  0.56)
process-pipe            8-groups         1.00 (  0.36)  +17.11 (  0.82)
process-sockets         1-groups         1.00 (  0.09)  -26.53 (  1.45)
process-sockets         2-groups         1.00 (  0.82)  -26.45 (  0.40)
process-sockets         4-groups         1.00 (  0.21)   -4.09 (  0.19)
process-sockets         8-groups         1.00 (  0.13)   -5.31 (  0.36)
threads-pipe            1-groups         1.00 (  2.14)  -62.87 (  1.11)
threads-pipe            2-groups         1.00 (  3.18)  -55.82 (  1.14)
threads-pipe            4-groups         1.00 (  4.68)  -54.92 (  0.34)
threads-pipe            8-groups         1.00 (  5.08)  +15.81 (  3.08)
threads-sockets         1-groups         1.00 (  2.60)  -18.28 (  6.03)
threads-sockets         2-groups         1.00 (  0.83)  -30.17 (  0.60)
threads-sockets         4-groups         1.00 (  0.16)   -4.15 (  0.27)
threads-sockets         8-groups         1.00 (  0.36)   -5.92 (  0.94)

The 1 group, 2 groups, 4 groups suffered.

netperf
=======
case                    load            baseline(std%)  compare%( std%)
TCP_RR                  56-threads       1.00 (  2.75)  +10.49 ( 10.88)
TCP_RR                  112-threads      1.00 (  2.39)   -1.88 (  2.82)
TCP_RR                  168-threads      1.00 (  2.05)   +8.31 (  9.73)
TCP_RR                  224-threads      1.00 (  2.32)  +788.25 (  1.94)
TCP_RR                  280-threads      1.00 ( 59.77)  +83.07 ( 12.38)
TCP_RR                  336-threads      1.00 ( 21.61)   -0.22 ( 28.72)
TCP_RR                  392-threads      1.00 ( 31.26)   -0.13 ( 36.11)
TCP_RR                  448-threads      1.00 ( 39.93)   -0.14 ( 45.71)
UDP_RR                  56-threads       1.00 (  5.57)   +2.38 (  7.41)
UDP_RR                  112-threads      1.00 ( 24.53)   +1.51 (  8.43)
UDP_RR                  168-threads      1.00 ( 11.83)   +7.34 ( 20.20)
UDP_RR                  224-threads      1.00 ( 10.55)  +163.81 ( 20.64)
UDP_RR                  280-threads      1.00 ( 11.32)  +176.04 ( 21.83)
UDP_RR                  336-threads      1.00 ( 31.79)  +12.87 ( 37.23)
UDP_RR                  392-threads      1.00 ( 34.06)  +15.64 ( 44.62)
UDP_RR                  448-threads      1.00 ( 59.09)  +14.00 ( 52.93)

The 224-thread/280-threads show good improvement.

tbench
======
case                    load            baseline(std%)  compare%( std%)
loopback                56-threads       1.00 (  0.83)   +1.38 (  1.56)
loopback                112-threads      1.00 (  0.19)   -4.25 (  0.90)
loopback                168-threads      1.00 ( 56.43)  -31.12 (  0.37)
loopback                224-threads      1.00 (  0.28)   -2.50 (  0.44)
loopback                280-threads      1.00 (  0.10)   -1.64 (  0.81)
loopback                336-threads      1.00 (  0.19)   -2.10 (  0.10)
loopback                392-threads      1.00 (  0.13)   -2.15 (  0.39)
loopback                448-threads      1.00 (  0.45)   -2.14 (  0.43)

Might have no impact to tbench(the 168 threads result is unstable and could
be ignored)

schbench
========
case                    load            baseline(std%)  compare%( std%)
normal                  1-mthreads       1.00 (  0.42)   -0.59 (  0.72)
normal                  2-mthreads       1.00 (  2.72)   +1.76 (  0.42)
normal                  4-mthreads       1.00 (  0.75)   -1.22 (  1.86)
normal                  8-mthreads       1.00 (  6.44)  -14.56 (  5.64)

8 message case is not good for schbench.

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 352f0ce1ece4..ffc44639447e 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -511,6 +511,30 @@ static const struct x86_cpu_id intel_cod_cpu[] = {
 	{}
 };
 
+static unsigned int sub_llc_nr;
+
+static int __init parse_sub_llc(char *str)
+{
+	get_option(&str, &sub_llc_nr);
+
+	return 0;
+}
+early_param("sub_llc_nr", parse_sub_llc);
+
+static bool
+topology_same_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
+{
+	int idx1, idx2;
+
+	if (!sub_llc_nr)
+		return true;
+
+	idx1 = c->apicid / sub_llc_nr;
+	idx2 = o->apicid / sub_llc_nr;
+
+	return idx1 == idx2;
+}
+
 static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 {
 	const struct x86_cpu_id *id = x86_match_cpu(intel_cod_cpu);
@@ -530,7 +554,7 @@ static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 	 * means 'c' does not share the LLC of 'o'. This will be
 	 * reflected to userspace.
 	 */
-	if (match_pkg(c, o) && !topology_same_node(c, o) && intel_snc)
+	if (match_pkg(c, o) && (!topology_same_node(c, o) || !topology_same_llc(c, o)) && intel_snc)
 		return false;
 
 	return topology_sane(c, o, "llc");
-- 
2.25.1



[1] https://lore.kernel.org/lkml/5903fc0a-787e-9471-0256-77ff66f0bdef@bytedance.com/