All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chen Yu <yu.c.chen@intel.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>,
	<linux-kernel@vger.kernel.org>,
	<linux-tip-commits@vger.kernel.org>, Tejun Heo <tj@kernel.org>,
	<x86@kernel.org>, Gautham Shenoy <gautham.shenoy@amd.com>,
	Tim Chen <tim.c.chen@intel.com>
Subject: Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()
Date: Wed, 21 Jun 2023 15:16:15 +0800	[thread overview]
Message-ID: <ZJKjvx/NxooM5z1Y@chenyu5-mobl2.ccr.corp.intel.com> (raw)
In-Reply-To: <20230614151348.GM1639749@hirez.programming.kicks-ass.net>

On 2023-06-14 at 17:13:48 +0200, Peter Zijlstra wrote:
> On Wed, Jun 14, 2023 at 10:58:20PM +0800, Chen Yu wrote:
> > On 2023-06-14 at 10:17:57 +0200, Peter Zijlstra wrote:
> > > On Tue, Jun 13, 2023 at 04:00:39PM +0530, K Prateek Nayak wrote:
> > > 
> > > > >> - SIS_NODE_TOPOEXT - tip:sched/core + this patch
> > > > >>                      + new sched domain (Multi-Multi-Core or MMC)
> > > > >> 		     (https://lore.kernel.org/all/20230601153522.GB559993@hirez.programming.kicks-ass.net/)
> > > > >> 		     MMC domain groups 2 nearby CCX.
> > > > > 
> > > > > OK, so you managed to get the NPS4 topology in NPS1 mode?
> > > > 
> > > > Yup! But it is a hack. I'll leave the patch at the end.
> > > 
> > > Chen Yu, could we do the reverse? Instead of building a bigger LLC
> > > domain, can we split our LLC based on SNC (sub-numa-cluster) topologies?
> > >
> > Hi Peter,
> > Do you mean with SNC enabled, if the LLC domain gets smaller? 
> > According to the test, the answer seems to be yes.
> 
> No, I mean to build smaller LLC domains even with SNC disabled, as-if
> SNC were active.
> 
>
The topology on Sapphire Rapids is that there are 4 memory controllers within
1 package per lstopo result, and the LLCs could have slightly difference distance
to the 4 mc with SNC disabled. Unfortunately there is no interface for the OS
to query this partition. I used a hack to split the LLC into 4 smaller ones
with SNC disabled, according to the topology in SNC4. Then I had a test on this
platform with/withouth this LLC split, both with SIS_NODE enabled and with
this issue fixed[1]. Something like this when iterating the groups in select_idle_node():

if (cpumask_test_cpu(target, sched_group_span(sg)))
	continue;

The SIS_NODE should have no impact on non-LLC-split version on
Sapphire Rapids, so the baseline is vanilla+SIS_NODE.

In summary, huge improvement from netperf was observed, but also regression from
hackbench/schbench was observed when the system is under load. I'll collect some
schedstats to check the scan depth in the problematic cases.


With SNC disabled and with the hack llc-split patch applied, there is a new
Die domain generated, the LLC is divided into 4 sub-llc groups:

 grep  . domain*/{name,flags}
domain0/name:SMT
domain1/name:MC
domain2/name:DIE
domain3/name:NUMA
domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
domain2/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING
domain3/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA

cat /proc/schedstat | grep cpu0 -A 4
cpu0 0 0 0 0 0 0 15968391465 3630455022 18084
domain0 00000000,00000000,00000000,00010000,00000000,00000000,00000001 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,3fff0000,00000000,00000000,00003fff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,000000ff,ffffffff,ffff0000,00000000,00ffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


hackbench
=========
case                    load            baseline(std%)  compare%( std%)
process-pipe            1-groups         1.00 (  3.81)  -100.18 (  0.19)
process-pipe            2-groups         1.00 ( 10.74)  -59.21 (  0.91)
process-pipe            4-groups         1.00 (  5.37)  -56.37 (  0.56)
process-pipe            8-groups         1.00 (  0.36)  +17.11 (  0.82)
process-sockets         1-groups         1.00 (  0.09)  -26.53 (  1.45)
process-sockets         2-groups         1.00 (  0.82)  -26.45 (  0.40)
process-sockets         4-groups         1.00 (  0.21)   -4.09 (  0.19)
process-sockets         8-groups         1.00 (  0.13)   -5.31 (  0.36)
threads-pipe            1-groups         1.00 (  2.14)  -62.87 (  1.11)
threads-pipe            2-groups         1.00 (  3.18)  -55.82 (  1.14)
threads-pipe            4-groups         1.00 (  4.68)  -54.92 (  0.34)
threads-pipe            8-groups         1.00 (  5.08)  +15.81 (  3.08)
threads-sockets         1-groups         1.00 (  2.60)  -18.28 (  6.03)
threads-sockets         2-groups         1.00 (  0.83)  -30.17 (  0.60)
threads-sockets         4-groups         1.00 (  0.16)   -4.15 (  0.27)
threads-sockets         8-groups         1.00 (  0.36)   -5.92 (  0.94)

The 1 group, 2 groups, 4 groups suffered.

netperf
=======
case                    load            baseline(std%)  compare%( std%)
TCP_RR                  56-threads       1.00 (  2.75)  +10.49 ( 10.88)
TCP_RR                  112-threads      1.00 (  2.39)   -1.88 (  2.82)
TCP_RR                  168-threads      1.00 (  2.05)   +8.31 (  9.73)
TCP_RR                  224-threads      1.00 (  2.32)  +788.25 (  1.94)
TCP_RR                  280-threads      1.00 ( 59.77)  +83.07 ( 12.38)
TCP_RR                  336-threads      1.00 ( 21.61)   -0.22 ( 28.72)
TCP_RR                  392-threads      1.00 ( 31.26)   -0.13 ( 36.11)
TCP_RR                  448-threads      1.00 ( 39.93)   -0.14 ( 45.71)
UDP_RR                  56-threads       1.00 (  5.57)   +2.38 (  7.41)
UDP_RR                  112-threads      1.00 ( 24.53)   +1.51 (  8.43)
UDP_RR                  168-threads      1.00 ( 11.83)   +7.34 ( 20.20)
UDP_RR                  224-threads      1.00 ( 10.55)  +163.81 ( 20.64)
UDP_RR                  280-threads      1.00 ( 11.32)  +176.04 ( 21.83)
UDP_RR                  336-threads      1.00 ( 31.79)  +12.87 ( 37.23)
UDP_RR                  392-threads      1.00 ( 34.06)  +15.64 ( 44.62)
UDP_RR                  448-threads      1.00 ( 59.09)  +14.00 ( 52.93)

The 224-thread/280-threads show good improvement.

tbench
======
case                    load            baseline(std%)  compare%( std%)
loopback                56-threads       1.00 (  0.83)   +1.38 (  1.56)
loopback                112-threads      1.00 (  0.19)   -4.25 (  0.90)
loopback                168-threads      1.00 ( 56.43)  -31.12 (  0.37)
loopback                224-threads      1.00 (  0.28)   -2.50 (  0.44)
loopback                280-threads      1.00 (  0.10)   -1.64 (  0.81)
loopback                336-threads      1.00 (  0.19)   -2.10 (  0.10)
loopback                392-threads      1.00 (  0.13)   -2.15 (  0.39)
loopback                448-threads      1.00 (  0.45)   -2.14 (  0.43)

Might have no impact to tbench(the 168 threads result is unstable and could
be ignored)

schbench
========
case                    load            baseline(std%)  compare%( std%)
normal                  1-mthreads       1.00 (  0.42)   -0.59 (  0.72)
normal                  2-mthreads       1.00 (  2.72)   +1.76 (  0.42)
normal                  4-mthreads       1.00 (  0.75)   -1.22 (  1.86)
normal                  8-mthreads       1.00 (  6.44)  -14.56 (  5.64)

8 message case is not good for schbench.


diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 352f0ce1ece4..ffc44639447e 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -511,6 +511,30 @@ static const struct x86_cpu_id intel_cod_cpu[] = {
 	{}
 };
 
+static unsigned int sub_llc_nr;
+
+static int __init parse_sub_llc(char *str)
+{
+	get_option(&str, &sub_llc_nr);
+
+	return 0;
+}
+early_param("sub_llc_nr", parse_sub_llc);
+
+static bool
+topology_same_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
+{
+	int idx1, idx2;
+
+	if (!sub_llc_nr)
+		return true;
+
+	idx1 = c->apicid / sub_llc_nr;
+	idx2 = o->apicid / sub_llc_nr;
+
+	return idx1 == idx2;
+}
+
 static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 {
 	const struct x86_cpu_id *id = x86_match_cpu(intel_cod_cpu);
@@ -530,7 +554,7 @@ static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 	 * means 'c' does not share the LLC of 'o'. This will be
 	 * reflected to userspace.
 	 */
-	if (match_pkg(c, o) && !topology_same_node(c, o) && intel_snc)
+	if (match_pkg(c, o) && (!topology_same_node(c, o) || !topology_same_llc(c, o)) && intel_snc)
 		return false;
 
 	return topology_sane(c, o, "llc");
-- 
2.25.1



[1] https://lore.kernel.org/lkml/5903fc0a-787e-9471-0256-77ff66f0bdef@bytedance.com/



  reply	other threads:[~2023-06-21  7:17 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-31 12:04 [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling() tip-bot2 for Peter Zijlstra
2023-06-01  3:41 ` Abel Wu
2023-06-01  8:09   ` Peter Zijlstra
2023-06-01  9:33 ` K Prateek Nayak
2023-06-01 11:13   ` Peter Zijlstra
2023-06-01 11:56     ` Peter Zijlstra
2023-06-01 12:00       ` Peter Zijlstra
2023-06-01 14:47         ` Peter Zijlstra
2023-06-01 15:35           ` Peter Zijlstra
2023-06-02  5:13           ` K Prateek Nayak
2023-06-02  6:54             ` Peter Zijlstra
2023-06-02  9:19               ` K Prateek Nayak
2023-06-07 18:32               ` K Prateek Nayak
2023-06-13  8:25                 ` Peter Zijlstra
2023-06-13 10:30                   ` K Prateek Nayak
2023-06-14  8:17                     ` Peter Zijlstra
2023-06-14 14:58                       ` Chen Yu
2023-06-14 15:13                         ` Peter Zijlstra
2023-06-21  7:16                           ` Chen Yu [this message]
2023-06-16  6:34                   ` K Prateek Nayak
2023-07-05 11:57                     ` Peter Zijlstra
2023-07-08 13:17                       ` Chen Yu
2023-07-12 17:19                         ` Chen Yu
2023-07-13  3:43                           ` K Prateek Nayak
2023-07-17  1:09                             ` Chen Yu
2023-06-02  7:00             ` Peter Zijlstra
2023-06-01 14:51         ` Peter Zijlstra
2023-06-02  5:17           ` K Prateek Nayak
2023-06-02  9:06             ` Gautham R. Shenoy
2023-06-02 11:23               ` Peter Zijlstra
2023-06-01 16:44     ` Chen Yu
2023-06-02  3:12     ` K Prateek Nayak
     [not found] ` <CGME20230605152531eucas1p2a10401ec2180696cc9a5f2e94a67adca@eucas1p2.samsung.com>
2023-06-05 15:25   ` Marek Szyprowski
2023-06-05 17:56     ` Peter Zijlstra
2023-06-05 19:07     ` Peter Zijlstra
2023-06-05 22:20       ` Marek Szyprowski
2023-06-06  7:58       ` Chen Yu
2023-06-01  8:43 tip-bot2 for Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZJKjvx/NxooM5z1Y@chenyu5-mobl2.ccr.corp.intel.com \
    --to=yu.c.chen@intel.com \
    --cc=gautham.shenoy@amd.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tip-commits@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=tim.c.chen@intel.com \
    --cc=tj@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.