linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] sched/topology: Get rid of overlapping groups
@ 2021-02-03 15:54 Valentin Schneider
  2021-02-03 15:54 ` [RFC PATCH 1/2] sched/topology: Get rid of NUMA " Valentin Schneider
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Valentin Schneider @ 2021-02-03 15:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: vincent.guittot, mgorman, mingo, peterz, dietmar.eggemann,
	morten.rasmussen, linuxarm, xuwei5, liguozhu, tiantao6,
	wanghuiqiang, prime.zeng, jonathan.cameron, guodong.xu,
	Barry Song, Meelis Roos

Hi folks,

My keyboard still hasn't been taken away from me, so I went down and did what I
was rambling about in [1].

The preemptive degeneration check is the worst wart, but I think I'll have to
sleep on it before I can figure out a better way.

Something tells me the increase in group numbers isn't going to make me any
friends, so I've included 3/2 which is Barry's approach (with the subset check
condensed in a single spot).

Cheers,
Valentin

[1]: http://lore.kernel.org/r/jhj4kiu4hz8.mognet@arm.com

Valentin Schneider (2):
  sched/topology: Get rid of NUMA overlapping groups
  Revert "sched/topology: Warn when NUMA diameter > 2"

 kernel/sched/topology.c | 77 +++++++++++++++++++++++------------------
 1 file changed, 43 insertions(+), 34 deletions(-)

--
2.27.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 1/2] sched/topology: Get rid of NUMA overlapping groups
  2021-02-03 15:54 [RFC PATCH 0/2] sched/topology: Get rid of overlapping groups Valentin Schneider
@ 2021-02-03 15:54 ` Valentin Schneider
  2021-02-08 10:04   ` Song Bao Hua (Barry Song)
  2021-02-03 15:54 ` [RFC PATCH 2/2] Revert "sched/topology: Warn when NUMA diameter > 2" Valentin Schneider
  2021-02-03 15:54 ` [RFC PATCH 3/2] sched/topology: Alternative diameter >= 2 fixup Valentin Schneider
  2 siblings, 1 reply; 9+ messages in thread
From: Valentin Schneider @ 2021-02-03 15:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: vincent.guittot, mgorman, mingo, peterz, dietmar.eggemann,
	morten.rasmussen, linuxarm, xuwei5, liguozhu, tiantao6,
	wanghuiqiang, prime.zeng, jonathan.cameron, guodong.xu,
	Barry Song, Meelis Roos

As pointed out in commit

  b5b217346de8 ("sched/topology: Warn when NUMA diameter > 2")

overlapping groups result in broken topology data structures whenever the
underlying system has a NUMA diameter greater than 2. This stems from
overlapping groups being built from sibling domain's spans, yielding bogus
transitivity relations the like of:

  distance(A, B) <= 30 && distance(B, C) <= 20
    =>
  distance(A, C) <= 30

As discussed with Barry, a feasible approach is to catch bogus overlapping
groups and fix them after the fact [1].

A more proactive approach would be to prevent aforementioned bogus
relations from being built altogether, implies departing from the
"group span is sibling domain child's span" strategy. Said strategy only
works for diameter <= 2, which fortunately or unfortunately is currently
the most common case.

The chosen approach is, for NUMA domains:
a) have the local group be the child domain's span, as before
b) have all remote groups span only their respective node

This boils down to getting rid of overlapping groups.

Note that b) requires introducing cross sched_domain_topology_level
references for sched_group_capacity. This is a somewhat prickly matter as
we need to ensure whichever group we hook into won't see its domain
degenerated (which was never an issue when such references were bounded
within a single topology level).

This lifts the NUMA diameter restriction, although yields more groups in
the NUMA domains. As an example, here is the distance matrix for
an AMD Epyc:

  node   0   1   2   3   4   5   6   7
    0:  10  16  16  16  32  32  32  32
    1:  16  10  16  16  32  32  32  32
    2:  16  16  10  16  32  32  32  32
    3:  16  16  16  10  32  32  32  32
    4:  32  32  32  32  10  16  16  16
    5:  32  32  32  32  16  10  16  16
    6:  32  32  32  32  16  16  10  16
    7:  32  32  32  32  16  16  16  10

Emulating this on QEMU yields, before the patch:
  [    0.386745] CPU0 attaching sched-domain(s):
  [    0.386969]  domain-0: span=0-3 level=NUMA
  [    0.387708]   groups: 0:{ span=0 cap=1008 }, 1:{ span=1 cap=1007 }, 2:{ span=2 cap=1007 }, 3:{ span=3 cap=998 }
  [    0.388505]   domain-1: span=0-7 level=NUMA
  [    0.388700]    groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 }
  [    0.389861] CPU1 attaching sched-domain(s):
  [    0.390020]  domain-0: span=0-3 level=NUMA
  [    0.390200]   groups: 1:{ span=1 cap=1007 }, 2:{ span=2 cap=1007 }, 3:{ span=3 cap=998 }, 0:{ span=0 cap=1008 }
  [    0.390701]   domain-1: span=0-7 level=NUMA
  [    0.390874]    groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 }
  [    0.391460] CPU2 attaching sched-domain(s):
  [    0.391664]  domain-0: span=0-3 level=NUMA
  [    0.392750]   groups: 2:{ span=2 cap=1007 }, 3:{ span=3 cap=998 }, 0:{ span=0 cap=1008 }, 1:{ span=1 cap=1007 }
  [    0.393672]   domain-1: span=0-7 level=NUMA
  [    0.393961]    groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 }
  [    0.394645] CPU3 attaching sched-domain(s):
  [    0.394792]  domain-0: span=0-3 level=NUMA
  [    0.394961]   groups: 3:{ span=3 cap=998 }, 0:{ span=0 cap=1008 }, 1:{ span=1 cap=1007 }, 2:{ span=2 cap=1007 }
  [    0.395749]   domain-1: span=0-7 level=NUMA
  [    0.396098]    groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 }
  [    0.396455] CPU4 attaching sched-domain(s):
  [    0.396603]  domain-0: span=4-7 level=NUMA
  [    0.396771]   groups: 4:{ span=4 cap=1001 }, 5:{ span=5 cap=1004 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=1006 }
  [    0.397274]   domain-1: span=0-7 level=NUMA
  [    0.397454]    groups: 4:{ span=4-7 cap=4014 }, 0:{ span=0-3 cap=4020 }
  [    0.397801] CPU5 attaching sched-domain(s):
  [    0.397945]  domain-0: span=4-7 level=NUMA
  [    0.398110]   groups: 5:{ span=5 cap=1004 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=1006 }, 4:{ span=4 cap=1001 }
  [    0.398605]   domain-1: span=0-7 level=NUMA
  [    0.398773]    groups: 4:{ span=4-7 cap=4014 }, 0:{ span=0-3 cap=4020 }
  [    0.399109] CPU6 attaching sched-domain(s):
  [    0.399253]  domain-0: span=4-7 level=NUMA
  [    0.399418]   groups: 6:{ span=6 cap=1003 }, 7:{ span=7 cap=1006 }, 4:{ span=4 cap=1001 }, 5:{ span=5 cap=1004 }
  [    0.400562]   domain-1: span=0-7 level=NUMA
  [    0.400741]    groups: 4:{ span=4-7 cap=4020 }, 0:{ span=0-3 cap=4020 }
  [    0.401083] CPU7 attaching sched-domain(s):
  [    0.401231]  domain-0: span=4-7 level=NUMA
  [    0.401395]   groups: 7:{ span=7 cap=1006 }, 4:{ span=4 cap=1004 }, 5:{ span=5 cap=1007 }, 6:{ span=6 cap=1003 }
  [    0.401906]   domain-1: span=0-7 level=NUMA
  [    0.402076]    groups: 4:{ span=4-7 cap=4020 }, 0:{ span=0-3 cap=4020 }
  [    0.402437] root domain span: 0-7 (max cpu_capacity = 1024)

with the patch:
  [    0.367436] CPU0 attaching sched-domain(s):
  [    0.368064]  domain-0: span=0-3 level=NUMA
  [    0.368614]   groups: 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1002 }, 2:{ span=2 cap=1000 }, 3:{ span=3 cap=993 }
  [    0.369490]   domain-1: span=0-7 level=NUMA
  [    0.369682]    groups: 0:{ span=0-3 cap=4007 }, 4:{ span=4 cap=991 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 }
  [    0.371132] CPU1 attaching sched-domain(s):
  [    0.371290]  domain-0: span=0-3 level=NUMA
  [    0.371462]   groups: 1:{ span=1 cap=1002 }, 2:{ span=2 cap=1000 }, 3:{ span=3 cap=993 }, 0:{ span=0 cap=1012 }
  [    0.372720]   domain-1: span=0-7 level=NUMA
  [    0.372906]    groups: 0:{ span=0-3 cap=4007 }, 4:{ span=4 cap=991 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 }
  [    0.373678] CPU2 attaching sched-domain(s):
  [    0.373833]  domain-0: span=0-3 level=NUMA
  [    0.374006]   groups: 2:{ span=2 cap=1000 }, 3:{ span=3 cap=993 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1002 }
  [    0.374516]   domain-1: span=0-7 level=NUMA
  [    0.374689]    groups: 0:{ span=0-3 cap=4007 }, 4:{ span=4 cap=991 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 }
  [    0.375337] CPU3 attaching sched-domain(s):
  [    0.375491]  domain-0: span=0-3 level=NUMA
  [    0.375666]   groups: 3:{ span=3 cap=993 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1002 }, 2:{ span=2 cap=1000 }
  [    0.376639]   domain-1: span=0-7 level=NUMA
  [    0.376818]    groups: 0:{ span=0-3 cap=4007 }, 4:{ span=4 cap=991 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 }
  [    0.377465] CPU4 attaching sched-domain(s):
  [    0.377616]  domain-0: span=4-7 level=NUMA
  [    0.377844]   groups: 4:{ span=4 cap=991 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 }
  [    0.378445]   domain-1: span=0-7 level=NUMA
  [    0.378622]    groups: 4:{ span=4-7 cap=3995 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1001 }, 2:{ span=2 cap=1000 }, 3:{ span=3 cap=993 }
  [    0.379296] CPU5 attaching sched-domain(s):
  [    0.379453]  domain-0: span=4-7 level=NUMA
  [    0.379629]   groups: 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 }, 4:{ span=4 cap=991 }
  [    0.380499]   domain-1: span=0-7 level=NUMA
  [    0.380800]    groups: 4:{ span=4-7 cap=4001 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1001 }, 2:{ span=2 cap=1000 }, 3:{ span=3 cap=998 }
  [    0.381475] CPU6 attaching sched-domain(s):
  [    0.381641]  domain-0: span=4-7 level=NUMA
  [    0.381882]   groups: 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 }, 4:{ span=4 cap=997 }, 5:{ span=5 cap=1003 }
  [    0.382419]   domain-1: span=0-7 level=NUMA
  [    0.382594]    groups: 4:{ span=4-7 cap=4001 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1001 }, 2:{ span=2 cap=1004 }, 3:{ span=3 cap=998 }
  [    0.383253] CPU7 attaching sched-domain(s):
  [    0.383407]  domain-0: span=4-7 level=NUMA
  [    0.383584]   groups: 7:{ span=7 cap=998 }, 4:{ span=4 cap=997 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }
  [    0.384089]   domain-1: span=0-7 level=NUMA
  [    0.384516]    groups: 4:{ span=4-7 cap=4001 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1001 }, 2:{ span=2 cap=1004 }, 3:{ span=3 cap=998 }
  [    0.385503] root domain span: 0-7 (max cpu_capacity = 1024)

IOW, this does more than double the number of groups at higher domains, the
impact of which has yet to be measured.

XXX: this only changes the group generation; actually removing SD_OVERLAP
and all the related fluff is yet to be done.

[1]: http://lore.kernel.org/r/20210201033830.15040-1-song.bao.hua@hisilicon.com

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/topology.c | 44 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 09d35044bd88..a8f69f234258 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -982,6 +982,41 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 	sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
 }
 
+static struct sched_domain *find_node_domain(struct sched_domain *sd)
+{
+	struct sched_domain *parent;
+
+	BUG_ON(!(sd->flags & SD_NUMA));
+
+	/* Get to the level above NODE */
+	while (sd && sd->child) {
+		parent = sd;
+		sd = sd->child;
+
+		if (!(sd->flags & SD_NUMA))
+			break;
+	}
+	/*
+	 * We're going to create cross topology level sched_group_capacity
+	 * references. This can only work if the domains resulting from said
+	 * levels won't be degenerated, as we need said sgc to be periodically
+	 * updated: it needs to be attached to the local group of a domain
+	 * that didn't get degenerated.
+	 *
+	 * Of course, groups aren't available yet, so we can't call the usual
+	 * sd_degenerate(). Checking domain spans is the closest we get.
+	 * Start from NODE's parent, and keep going up until we get a domain
+	 * we're sure won't be degenerated.
+	 */
+	while (sd->parent &&
+	       cpumask_equal(sched_domain_span(sd), sched_domain_span(parent))) {
+		sd = parent;
+		parent = sd->parent;
+	}
+
+	return parent;
+}
+
 static int
 build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 {
@@ -1015,6 +1050,13 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
 			continue;
 
+		/*
+		 * Local group is child domain's span, as is tradition.
+		 * Non-local groups will only span remote nodes.
+		 */
+		if (first)
+			sibling = find_node_domain(sibling);
+
 		sg = build_group_from_child_sched_domain(sibling, cpu);
 		if (!sg)
 			goto fail;
@@ -1022,7 +1064,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		sg_span = sched_group_span(sg);
 		cpumask_or(covered, covered, sg_span);
 
-		init_overlap_sched_group(sd, sg);
+		init_overlap_sched_group(sibling, sg);
 
 		if (!first)
 			first = sg;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 2/2] Revert "sched/topology: Warn when NUMA diameter > 2"
  2021-02-03 15:54 [RFC PATCH 0/2] sched/topology: Get rid of overlapping groups Valentin Schneider
  2021-02-03 15:54 ` [RFC PATCH 1/2] sched/topology: Get rid of NUMA " Valentin Schneider
@ 2021-02-03 15:54 ` Valentin Schneider
  2021-02-08 10:27   ` Song Bao Hua (Barry Song)
  2021-02-03 15:54 ` [RFC PATCH 3/2] sched/topology: Alternative diameter >= 2 fixup Valentin Schneider
  2 siblings, 1 reply; 9+ messages in thread
From: Valentin Schneider @ 2021-02-03 15:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: vincent.guittot, mgorman, mingo, peterz, dietmar.eggemann,
	morten.rasmussen, linuxarm, xuwei5, liguozhu, tiantao6,
	wanghuiqiang, prime.zeng, jonathan.cameron, guodong.xu,
	Barry Song, Meelis Roos

The scheduler topology code can now figure out what to do with such
topologies.

This reverts commit b5b217346de85ed1b03fdecd5c5076b34fbb2f0b.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/topology.c | 33 ---------------------------------
 1 file changed, 33 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index a8f69f234258..0fa41aab74e0 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -688,7 +688,6 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 	struct sched_domain *tmp;
-	int numa_distance = 0;
 
 	/* Remove the sched domains which do not contribute to scheduling. */
 	for (tmp = sd; tmp; ) {
@@ -720,38 +719,6 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 			sd->child = NULL;
 	}
 
-	for (tmp = sd; tmp; tmp = tmp->parent)
-		numa_distance += !!(tmp->flags & SD_NUMA);
-
-	/*
-	 * FIXME: Diameter >=3 is misrepresented.
-	 *
-	 * Smallest diameter=3 topology is:
-	 *
-	 *   node   0   1   2   3
-	 *     0:  10  20  30  40
-	 *     1:  20  10  20  30
-	 *     2:  30  20  10  20
-	 *     3:  40  30  20  10
-	 *
-	 *   0 --- 1 --- 2 --- 3
-	 *
-	 * NUMA-3	0-3		N/A		N/A		0-3
-	 *  groups:	{0-2},{1-3}					{1-3},{0-2}
-	 *
-	 * NUMA-2	0-2		0-3		0-3		1-3
-	 *  groups:	{0-1},{1-3}	{0-2},{2-3}	{1-3},{0-1}	{2-3},{0-2}
-	 *
-	 * NUMA-1	0-1		0-2		1-3		2-3
-	 *  groups:	{0},{1}		{1},{2},{0}	{2},{3},{1}	{3},{2}
-	 *
-	 * NUMA-0	0		1		2		3
-	 *
-	 * The NUMA-2 groups for nodes 0 and 3 are obviously buggered, as the
-	 * group span isn't a subset of the domain span.
-	 */
-	WARN_ONCE(numa_distance > 2, "Shortest NUMA path spans too many nodes\n");
-
 	sched_domain_debug(sd, cpu);
 
 	rq_attach_root(rq, rd);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 3/2] sched/topology: Alternative diameter >= 2 fixup
  2021-02-03 15:54 [RFC PATCH 0/2] sched/topology: Get rid of overlapping groups Valentin Schneider
  2021-02-03 15:54 ` [RFC PATCH 1/2] sched/topology: Get rid of NUMA " Valentin Schneider
  2021-02-03 15:54 ` [RFC PATCH 2/2] Revert "sched/topology: Warn when NUMA diameter > 2" Valentin Schneider
@ 2021-02-03 15:54 ` Valentin Schneider
  2 siblings, 0 replies; 9+ messages in thread
From: Valentin Schneider @ 2021-02-03 15:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: vincent.guittot, mgorman, mingo, peterz, dietmar.eggemann,
	morten.rasmussen, linuxarm, xuwei5, liguozhu, tiantao6,
	wanghuiqiang, prime.zeng, jonathan.cameron, guodong.xu,
	Barry Song, Meelis Roos

This follows Barry's approach, which yields fewer groups in the higher domains.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/topology.c | 42 +++++++++++++++++++----------------------
 1 file changed, 19 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 0fa41aab74e0..e1bb97c06f53 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -949,20 +949,21 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 	sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
 }
 
-static struct sched_domain *find_node_domain(struct sched_domain *sd)
+static struct sched_domain *
+find_sibling_domain(struct sched_domain *sd, struct sched_domain *sibling)
 {
-	struct sched_domain *parent;
-
 	BUG_ON(!(sd->flags & SD_NUMA));
 
-	/* Get to the level above NODE */
-	while (sd && sd->child) {
-		parent = sd;
-		sd = sd->child;
+	/* Find a subset sibling */
+	while (sibling->child &&
+	       !cpumask_subset(sched_domain_span(sibling->child),
+			       sched_domain_span(sd)))
+		sibling = sibling->child;
+
+	/* If the above loop was a no-op, we're done */
+	if (sd->private == sibling->private)
+		return sibling;
 
-		if (!(sd->flags & SD_NUMA))
-			break;
-	}
 	/*
 	 * We're going to create cross topology level sched_group_capacity
 	 * references. This can only work if the domains resulting from said
@@ -972,16 +973,14 @@ static struct sched_domain *find_node_domain(struct sched_domain *sd)
 	 *
 	 * Of course, groups aren't available yet, so we can't call the usual
 	 * sd_degenerate(). Checking domain spans is the closest we get.
-	 * Start from NODE's parent, and keep going up until we get a domain
-	 * we're sure won't be degenerated.
+	 * We can't go up as per the above subset check, so keep going down
+	 * until we get a domain we're sure won't be degenerated.
 	 */
-	while (sd->parent &&
-	       cpumask_equal(sched_domain_span(sd), sched_domain_span(parent))) {
-		sd = parent;
-		parent = sd->parent;
-	}
+	while (sibling->child &&
+	       cpumask_equal(sched_domain_span(sibling->child), sched_domain_span(sibling)))
+		sibling = sibling->child;
 
-	return parent;
+	return sibling;
 }
 
 static int
@@ -1017,12 +1016,9 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
 			continue;
 
-		/*
-		 * Local group is child domain's span, as is tradition.
-		 * Non-local groups will only span remote nodes.
-		 */
+		/* Ensure sibling domain's span is a subset of this domain */
 		if (first)
-			sibling = find_node_domain(sibling);
+			sibling = find_sibling_domain(sd, sibling);
 
 		sg = build_group_from_child_sched_domain(sibling, cpu);
 		if (!sg)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* RE: [RFC PATCH 1/2] sched/topology: Get rid of NUMA overlapping groups
  2021-02-03 15:54 ` [RFC PATCH 1/2] sched/topology: Get rid of NUMA " Valentin Schneider
@ 2021-02-08 10:04   ` Song Bao Hua (Barry Song)
  2021-02-08 11:47     ` Valentin Schneider
  0 siblings, 1 reply; 9+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-02-08 10:04 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel
  Cc: vincent.guittot, mgorman, mingo, peterz, dietmar.eggemann,
	morten.rasmussen, linuxarm, xuwei (O), Liguozhu (Kenneth),
	tiantao (H), wanghuiqiang, Zengtao (B),
	Jonathan Cameron, guodong.xu, Meelis Roos



> -----Original Message-----
> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
> Sent: Thursday, February 4, 2021 4:55 AM
> To: linux-kernel@vger.kernel.org
> Cc: vincent.guittot@linaro.org; mgorman@suse.de; mingo@kernel.org;
> peterz@infradead.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com;
> linuxarm@openeuler.org; xuwei (O) <xuwei5@huawei.com>; Liguozhu (Kenneth)
> <liguozhu@hisilicon.com>; tiantao (H) <tiantao6@hisilicon.com>; wanghuiqiang
> <wanghuiqiang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Jonathan
> Cameron <jonathan.cameron@huawei.com>; guodong.xu@linaro.org; Song Bao Hua
> (Barry Song) <song.bao.hua@hisilicon.com>; Meelis Roos <mroos@linux.ee>
> Subject: [RFC PATCH 1/2] sched/topology: Get rid of NUMA overlapping groups
> 
> As pointed out in commit
> 
>   b5b217346de8 ("sched/topology: Warn when NUMA diameter > 2")
> 
> overlapping groups result in broken topology data structures whenever the
> underlying system has a NUMA diameter greater than 2. This stems from
> overlapping groups being built from sibling domain's spans, yielding bogus
> transitivity relations the like of:
> 
>   distance(A, B) <= 30 && distance(B, C) <= 20
>     =>
>   distance(A, C) <= 30
> 
> As discussed with Barry, a feasible approach is to catch bogus overlapping
> groups and fix them after the fact [1].
> 
> A more proactive approach would be to prevent aforementioned bogus
> relations from being built altogether, implies departing from the
> "group span is sibling domain child's span" strategy. Said strategy only
> works for diameter <= 2, which fortunately or unfortunately is currently
> the most common case.
> 
> The chosen approach is, for NUMA domains:
> a) have the local group be the child domain's span, as before
> b) have all remote groups span only their respective node
> 
> This boils down to getting rid of overlapping groups.
> 

Hi Valentin,

While I like your approach, this will require more time
to evaluate possible influence as the approach also affects
all machines without 3-hops issue. So x86 platforms need to
be tested and benchmark is required.

What about we firstly finish the review of "grandchild" approach
v2 and have a solution for kunpeng920 and Sun Fire X4600-M2
while not impacting other machines which haven't 3-hops issues
first?

I would appreciate very much if you could comment on v2:
https://lore.kernel.org/lkml/20210203111201.20720-1-song.bao.hua@hisilicon.com/


> Note that b) requires introducing cross sched_domain_topology_level
> references for sched_group_capacity. This is a somewhat prickly matter as
> we need to ensure whichever group we hook into won't see its domain
> degenerated (which was never an issue when such references were bounded
> within a single topology level).
> 
> This lifts the NUMA diameter restriction, although yields more groups in
> the NUMA domains. As an example, here is the distance matrix for
> an AMD Epyc:
> 
>   node   0   1   2   3   4   5   6   7
>     0:  10  16  16  16  32  32  32  32
>     1:  16  10  16  16  32  32  32  32
>     2:  16  16  10  16  32  32  32  32
>     3:  16  16  16  10  32  32  32  32
>     4:  32  32  32  32  10  16  16  16
>     5:  32  32  32  32  16  10  16  16
>     6:  32  32  32  32  16  16  10  16
>     7:  32  32  32  32  16  16  16  10
> 
> Emulating this on QEMU yields, before the patch:
>   [    0.386745] CPU0 attaching sched-domain(s):
>   [    0.386969]  domain-0: span=0-3 level=NUMA
>   [    0.387708]   groups: 0:{ span=0 cap=1008 }, 1:{ span=1 cap=1007 },
> 2:{ span=2 cap=1007 }, 3:{ span=3 cap=998 }
>   [    0.388505]   domain-1: span=0-7 level=NUMA
>   [    0.388700]    groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 }
>   [    0.389861] CPU1 attaching sched-domain(s):
>   [    0.390020]  domain-0: span=0-3 level=NUMA
>   [    0.390200]   groups: 1:{ span=1 cap=1007 }, 2:{ span=2 cap=1007 },
> 3:{ span=3 cap=998 }, 0:{ span=0 cap=1008 }
>   [    0.390701]   domain-1: span=0-7 level=NUMA
>   [    0.390874]    groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 }
>   [    0.391460] CPU2 attaching sched-domain(s):
>   [    0.391664]  domain-0: span=0-3 level=NUMA
>   [    0.392750]   groups: 2:{ span=2 cap=1007 }, 3:{ span=3 cap=998 }, 0:{ span=0
> cap=1008 }, 1:{ span=1 cap=1007 }
>   [    0.393672]   domain-1: span=0-7 level=NUMA
>   [    0.393961]    groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 }
>   [    0.394645] CPU3 attaching sched-domain(s):
>   [    0.394792]  domain-0: span=0-3 level=NUMA
>   [    0.394961]   groups: 3:{ span=3 cap=998 }, 0:{ span=0 cap=1008 }, 1:{ span=1
> cap=1007 }, 2:{ span=2 cap=1007 }
>   [    0.395749]   domain-1: span=0-7 level=NUMA
>   [    0.396098]    groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 }
>   [    0.396455] CPU4 attaching sched-domain(s):
>   [    0.396603]  domain-0: span=4-7 level=NUMA
>   [    0.396771]   groups: 4:{ span=4 cap=1001 }, 5:{ span=5 cap=1004 },
> 6:{ span=6 cap=1003 }, 7:{ span=7 cap=1006 }
>   [    0.397274]   domain-1: span=0-7 level=NUMA
>   [    0.397454]    groups: 4:{ span=4-7 cap=4014 }, 0:{ span=0-3 cap=4020 }
>   [    0.397801] CPU5 attaching sched-domain(s):
>   [    0.397945]  domain-0: span=4-7 level=NUMA
>   [    0.398110]   groups: 5:{ span=5 cap=1004 }, 6:{ span=6 cap=1003 },
> 7:{ span=7 cap=1006 }, 4:{ span=4 cap=1001 }
>   [    0.398605]   domain-1: span=0-7 level=NUMA
>   [    0.398773]    groups: 4:{ span=4-7 cap=4014 }, 0:{ span=0-3 cap=4020 }
>   [    0.399109] CPU6 attaching sched-domain(s):
>   [    0.399253]  domain-0: span=4-7 level=NUMA
>   [    0.399418]   groups: 6:{ span=6 cap=1003 }, 7:{ span=7 cap=1006 },
> 4:{ span=4 cap=1001 }, 5:{ span=5 cap=1004 }
>   [    0.400562]   domain-1: span=0-7 level=NUMA
>   [    0.400741]    groups: 4:{ span=4-7 cap=4020 }, 0:{ span=0-3 cap=4020 }
>   [    0.401083] CPU7 attaching sched-domain(s):
>   [    0.401231]  domain-0: span=4-7 level=NUMA
>   [    0.401395]   groups: 7:{ span=7 cap=1006 }, 4:{ span=4 cap=1004 },
> 5:{ span=5 cap=1007 }, 6:{ span=6 cap=1003 }
>   [    0.401906]   domain-1: span=0-7 level=NUMA
>   [    0.402076]    groups: 4:{ span=4-7 cap=4020 }, 0:{ span=0-3 cap=4020 }
>   [    0.402437] root domain span: 0-7 (max cpu_capacity = 1024)
> 
> with the patch:
>   [    0.367436] CPU0 attaching sched-domain(s):
>   [    0.368064]  domain-0: span=0-3 level=NUMA
>   [    0.368614]   groups: 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1002 },
> 2:{ span=2 cap=1000 }, 3:{ span=3 cap=993 }
>   [    0.369490]   domain-1: span=0-7 level=NUMA
>   [    0.369682]    groups: 0:{ span=0-3 cap=4007 }, 4:{ span=4 cap=991 },
> 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 }
>   [    0.371132] CPU1 attaching sched-domain(s):
>   [    0.371290]  domain-0: span=0-3 level=NUMA
>   [    0.371462]   groups: 1:{ span=1 cap=1002 }, 2:{ span=2 cap=1000 },
> 3:{ span=3 cap=993 }, 0:{ span=0 cap=1012 }
>   [    0.372720]   domain-1: span=0-7 level=NUMA
>   [    0.372906]    groups: 0:{ span=0-3 cap=4007 }, 4:{ span=4 cap=991 },
> 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 }
>   [    0.373678] CPU2 attaching sched-domain(s):
>   [    0.373833]  domain-0: span=0-3 level=NUMA
>   [    0.374006]   groups: 2:{ span=2 cap=1000 }, 3:{ span=3 cap=993 }, 0:{ span=0
> cap=1012 }, 1:{ span=1 cap=1002 }
>   [    0.374516]   domain-1: span=0-7 level=NUMA
>   [    0.374689]    groups: 0:{ span=0-3 cap=4007 }, 4:{ span=4 cap=991 },
> 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 }
>   [    0.375337] CPU3 attaching sched-domain(s):
>   [    0.375491]  domain-0: span=0-3 level=NUMA
>   [    0.375666]   groups: 3:{ span=3 cap=993 }, 0:{ span=0 cap=1012 }, 1:{ span=1
> cap=1002 }, 2:{ span=2 cap=1000 }
>   [    0.376639]   domain-1: span=0-7 level=NUMA
>   [    0.376818]    groups: 0:{ span=0-3 cap=4007 }, 4:{ span=4 cap=991 },
> 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 }
>   [    0.377465] CPU4 attaching sched-domain(s):
>   [    0.377616]  domain-0: span=4-7 level=NUMA
>   [    0.377844]   groups: 4:{ span=4 cap=991 }, 5:{ span=5 cap=1003 }, 6:{ span=6
> cap=1003 }, 7:{ span=7 cap=998 }
>   [    0.378445]   domain-1: span=0-7 level=NUMA
>   [    0.378622]    groups: 4:{ span=4-7 cap=3995 }, 0:{ span=0 cap=1012 },
> 1:{ span=1 cap=1001 }, 2:{ span=2 cap=1000 }, 3:{ span=3 cap=993 }
>   [    0.379296] CPU5 attaching sched-domain(s):
>   [    0.379453]  domain-0: span=4-7 level=NUMA
>   [    0.379629]   groups: 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 },
> 7:{ span=7 cap=998 }, 4:{ span=4 cap=991 }
>   [    0.380499]   domain-1: span=0-7 level=NUMA
>   [    0.380800]    groups: 4:{ span=4-7 cap=4001 }, 0:{ span=0 cap=1012 },
> 1:{ span=1 cap=1001 }, 2:{ span=2 cap=1000 }, 3:{ span=3 cap=998 }
>   [    0.381475] CPU6 attaching sched-domain(s):
>   [    0.381641]  domain-0: span=4-7 level=NUMA
>   [    0.381882]   groups: 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 }, 4:{ span=4
> cap=997 }, 5:{ span=5 cap=1003 }
>   [    0.382419]   domain-1: span=0-7 level=NUMA
>   [    0.382594]    groups: 4:{ span=4-7 cap=4001 }, 0:{ span=0 cap=1012 },
> 1:{ span=1 cap=1001 }, 2:{ span=2 cap=1004 }, 3:{ span=3 cap=998 }
>   [    0.383253] CPU7 attaching sched-domain(s):
>   [    0.383407]  domain-0: span=4-7 level=NUMA
>   [    0.383584]   groups: 7:{ span=7 cap=998 }, 4:{ span=4 cap=997 }, 5:{ span=5
> cap=1003 }, 6:{ span=6 cap=1003 }
>   [    0.384089]   domain-1: span=0-7 level=NUMA
>   [    0.384516]    groups: 4:{ span=4-7 cap=4001 }, 0:{ span=0 cap=1012 },
> 1:{ span=1 cap=1001 }, 2:{ span=2 cap=1004 }, 3:{ span=3 cap=998 }
>   [    0.385503] root domain span: 0-7 (max cpu_capacity = 1024)
> 
> IOW, this does more than double the number of groups at higher domains, the
> impact of which has yet to be measured.
> 
> XXX: this only changes the group generation; actually removing SD_OVERLAP
> and all the related fluff is yet to be done.
> 
> [1]:
> http://lore.kernel.org/r/20210201033830.15040-1-song.bao.hua@hisilicon.com
> 
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/topology.c | 44 ++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 43 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 09d35044bd88..a8f69f234258 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -982,6 +982,41 @@ static void init_overlap_sched_group(struct sched_domain
> *sd,
>  	sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
>  }
> 
> +static struct sched_domain *find_node_domain(struct sched_domain *sd)
> +{
> +	struct sched_domain *parent;
> +
> +	BUG_ON(!(sd->flags & SD_NUMA));
> +
> +	/* Get to the level above NODE */
> +	while (sd && sd->child) {
> +		parent = sd;
> +		sd = sd->child;
> +
> +		if (!(sd->flags & SD_NUMA))
> +			break;
> +	}
> +	/*
> +	 * We're going to create cross topology level sched_group_capacity
> +	 * references. This can only work if the domains resulting from said
> +	 * levels won't be degenerated, as we need said sgc to be periodically
> +	 * updated: it needs to be attached to the local group of a domain
> +	 * that didn't get degenerated.
> +	 *
> +	 * Of course, groups aren't available yet, so we can't call the usual
> +	 * sd_degenerate(). Checking domain spans is the closest we get.
> +	 * Start from NODE's parent, and keep going up until we get a domain
> +	 * we're sure won't be degenerated.
> +	 */
> +	while (sd->parent &&
> +	       cpumask_equal(sched_domain_span(sd), sched_domain_span(parent))) {
> +		sd = parent;
> +		parent = sd->parent;
> +	}

So this is because the sched_domain which doesn't contribute to scheduler
will be destroyed during cpu_attach_domain() since sd and parent span
the seam mask?

> +
> +	return parent;
> +}
> +
>  static int
>  build_overlap_sched_groups(struct sched_domain *sd, int cpu)
>  {
> @@ -1015,6 +1050,13 @@ build_overlap_sched_groups(struct sched_domain *sd, int
> cpu)
>  		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
>  			continue;
> 
> +		/*
> +		 * Local group is child domain's span, as is tradition.
> +		 * Non-local groups will only span remote nodes.
> +		 */
> +		if (first)
> +			sibling = find_node_domain(sibling);
> +
>  		sg = build_group_from_child_sched_domain(sibling, cpu);
>  		if (!sg)
>  			goto fail;
> @@ -1022,7 +1064,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int
> cpu)
>  		sg_span = sched_group_span(sg);
>  		cpumask_or(covered, covered, sg_span);
> 
> -		init_overlap_sched_group(sd, sg);
> +		init_overlap_sched_group(sibling, sg);
> 
>  		if (!first)
>  			first = sg;
> --
> 2.27.0

Thanks
Barry


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC PATCH 2/2] Revert "sched/topology: Warn when NUMA diameter > 2"
  2021-02-03 15:54 ` [RFC PATCH 2/2] Revert "sched/topology: Warn when NUMA diameter > 2" Valentin Schneider
@ 2021-02-08 10:27   ` Song Bao Hua (Barry Song)
  0 siblings, 0 replies; 9+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-02-08 10:27 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel
  Cc: vincent.guittot, mgorman, mingo, peterz, dietmar.eggemann,
	morten.rasmussen, linuxarm, xuwei (O), Liguozhu (Kenneth),
	tiantao (H), wanghuiqiang, Zengtao (B),
	Jonathan Cameron, guodong.xu, Meelis Roos



> -----Original Message-----
> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
> Sent: Thursday, February 4, 2021 4:55 AM
> To: linux-kernel@vger.kernel.org
> Cc: vincent.guittot@linaro.org; mgorman@suse.de; mingo@kernel.org;
> peterz@infradead.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com;
> linuxarm@openeuler.org; xuwei (O) <xuwei5@huawei.com>; Liguozhu (Kenneth)
> <liguozhu@hisilicon.com>; tiantao (H) <tiantao6@hisilicon.com>; wanghuiqiang
> <wanghuiqiang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Jonathan
> Cameron <jonathan.cameron@huawei.com>; guodong.xu@linaro.org; Song Bao Hua
> (Barry Song) <song.bao.hua@hisilicon.com>; Meelis Roos <mroos@linux.ee>
> Subject: [RFC PATCH 2/2] Revert "sched/topology: Warn when NUMA diameter > 2"
> 
> The scheduler topology code can now figure out what to do with such
> topologies.
> 
> This reverts commit b5b217346de85ed1b03fdecd5c5076b34fbb2f0b.
> 
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>

Yes, this is fine. I actually have seen some other problems we need
to consider.

The current code is probably well consolidated for machines with
2 hops or less. Thus, even after we fix the 3-hops span issue, I
can still see some other issue.

For example, if we change the sd flags and remove the SD_BALANCE
flags for the last hops in sd_init(), we are able to see large
score increase in unixbench.

		if (sched_domains_numa_distance[tl->numa_level] > node_reclaim_distance ||
			is_3rd_hops_domain(...)) {
			sd->flags &= ~(SD_BALANCE_EXEC |
				       SD_BALANCE_FORK |
				       SD_WAKE_AFFINE);
		}

So guess something needs to be tuned for machines with 3 hops or more.

But we need a kernel which has the fix of 3-hops issue before we can
do more work.

> ---
>  kernel/sched/topology.c | 33 ---------------------------------
>  1 file changed, 33 deletions(-)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index a8f69f234258..0fa41aab74e0 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -688,7 +688,6 @@ cpu_attach_domain(struct sched_domain *sd, struct
> root_domain *rd, int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);
>  	struct sched_domain *tmp;
> -	int numa_distance = 0;
> 
>  	/* Remove the sched domains which do not contribute to scheduling. */
>  	for (tmp = sd; tmp; ) {
> @@ -720,38 +719,6 @@ cpu_attach_domain(struct sched_domain *sd, struct
> root_domain *rd, int cpu)
>  			sd->child = NULL;
>  	}
> 
> -	for (tmp = sd; tmp; tmp = tmp->parent)
> -		numa_distance += !!(tmp->flags & SD_NUMA);
> -
> -	/*
> -	 * FIXME: Diameter >=3 is misrepresented.
> -	 *
> -	 * Smallest diameter=3 topology is:
> -	 *
> -	 *   node   0   1   2   3
> -	 *     0:  10  20  30  40
> -	 *     1:  20  10  20  30
> -	 *     2:  30  20  10  20
> -	 *     3:  40  30  20  10
> -	 *
> -	 *   0 --- 1 --- 2 --- 3
> -	 *
> -	 * NUMA-3	0-3		N/A		N/A		0-3
> -	 *  groups:	{0-2},{1-3}					{1-3},{0-2}
> -	 *
> -	 * NUMA-2	0-2		0-3		0-3		1-3
> -	 *  groups:	{0-1},{1-3}	{0-2},{2-3}	{1-3},{0-1}	{2-3},{0-2}
> -	 *
> -	 * NUMA-1	0-1		0-2		1-3		2-3
> -	 *  groups:	{0},{1}		{1},{2},{0}	{2},{3},{1}	{3},{2}
> -	 *
> -	 * NUMA-0	0		1		2		3
> -	 *
> -	 * The NUMA-2 groups for nodes 0 and 3 are obviously buggered, as the
> -	 * group span isn't a subset of the domain span.
> -	 */
> -	WARN_ONCE(numa_distance > 2, "Shortest NUMA path spans too many nodes\n");
> -
>  	sched_domain_debug(sd, cpu);
> 
>  	rq_attach_root(rq, rd);
> --
> 2.27.0

Thanks
Barry


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC PATCH 1/2] sched/topology: Get rid of NUMA overlapping groups
  2021-02-08 10:04   ` Song Bao Hua (Barry Song)
@ 2021-02-08 11:47     ` Valentin Schneider
  2021-02-09  0:12       ` Song Bao Hua (Barry Song)
  0 siblings, 1 reply; 9+ messages in thread
From: Valentin Schneider @ 2021-02-08 11:47 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song), linux-kernel
  Cc: vincent.guittot, mgorman, mingo, peterz, dietmar.eggemann,
	morten.rasmussen, linuxarm, xuwei (O), Liguozhu (Kenneth),
	tiantao (H), wanghuiqiang, Zengtao (B),
	Jonathan Cameron, guodong.xu, Meelis Roos

Hi Barry,

On 08/02/21 10:04, Song Bao Hua (Barry Song) wrote:
>> -----Original Message-----
>> From: Valentin Schneider [mailto:valentin.schneider@arm.com]

>
> Hi Valentin,
>
> While I like your approach, this will require more time
> to evaluate possible influence as the approach also affects
> all machines without 3-hops issue. So x86 platforms need to
> be tested and benchmark is required.
>
> What about we firstly finish the review of "grandchild" approach
> v2 and have a solution for kunpeng920 and Sun Fire X4600-M2
> while not impacting other machines which haven't 3-hops issues
> first?
>

I figured I'd toss this out while the iron was hot (and I had the topology
crud paged in), but I ultimately agree that it's better to first go with
something that fixes the diameter > 2 topologies and leaves the other ones
untouched, which is exactly what you have.

> I would appreciate very much if you could comment on v2:
> https://lore.kernel.org/lkml/20210203111201.20720-1-song.bao.hua@hisilicon.com/
>

See my comment below on domain degeneration; with that taken care of I
would say it's good to go. Have a look at what patch1+patch3 squashed
together looks like, passing the right sd to init_overlap_sched_group()
looks a bit neater IMO.

>> +static struct sched_domain *find_node_domain(struct sched_domain *sd)
>> +{
>> +	struct sched_domain *parent;
>> +
>> +	BUG_ON(!(sd->flags & SD_NUMA));
>> +
>> +	/* Get to the level above NODE */
>> +	while (sd && sd->child) {
>> +		parent = sd;
>> +		sd = sd->child;
>> +
>> +		if (!(sd->flags & SD_NUMA))
>> +			break;
>> +	}
>> +	/*
>> +	 * We're going to create cross topology level sched_group_capacity
>> +	 * references. This can only work if the domains resulting from said
>> +	 * levels won't be degenerated, as we need said sgc to be periodically
>> +	 * updated: it needs to be attached to the local group of a domain
>> +	 * that didn't get degenerated.
>> +	 *
>> +	 * Of course, groups aren't available yet, so we can't call the usual
>> +	 * sd_degenerate(). Checking domain spans is the closest we get.
>> +	 * Start from NODE's parent, and keep going up until we get a domain
>> +	 * we're sure won't be degenerated.
>> +	 */
>> +	while (sd->parent &&
>> +	       cpumask_equal(sched_domain_span(sd), sched_domain_span(parent))) {
>> +		sd = parent;
>> +		parent = sd->parent;
>> +	}
>
> So this is because the sched_domain which doesn't contribute to scheduler
> will be destroyed during cpu_attach_domain() since sd and parent span
> the seam mask?
>

Yes; let's take your topology for instance:

node   0   1   2   3
    0:  10  12  20  22
    1:  12  10  22  24
    2:  20  22  10  12
    3:  22  24  12  10

      2       10      2
  0 <---> 1 <---> 2 <---> 3


Domains for node1 will look like (before any fixes are applied):

NUMA<=10: span=1   groups=(1)
NUMA<=12: span=0-1 groups=(1)->(0)
NUMA<=20: span=0-1 groups=(0,1)
NUMA<=22: span=0-2 groups=(0,1)->(0,2-3)
NUMA<=24: span=0-3 groups=(0-2)->(0,2-3)

As you can see, the domain representing distance <= 20 will be degenerated
(it has a single group). If we were to e.g. add some more nodes to the left
of node0, then we would trigger the "grandchildren logic" for node1 and
would end up creating a reference to node1 NUMA<=20's sgc, which is a
mistake: that domain will be degenerated, and that sgc will never be
updated. The right thing to do here would be reference node1 NUMA<=12's
sgc, which the above snippet does.

>> +
>> +	return parent;
>> +}
>> +

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC PATCH 1/2] sched/topology: Get rid of NUMA overlapping groups
  2021-02-08 11:47     ` Valentin Schneider
@ 2021-02-09  0:12       ` Song Bao Hua (Barry Song)
  2021-02-09 10:41         ` Valentin Schneider
  0 siblings, 1 reply; 9+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-02-09  0:12 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel
  Cc: vincent.guittot, mgorman, mingo, peterz, dietmar.eggemann,
	morten.rasmussen, linuxarm, xuwei (O), Liguozhu (Kenneth),
	tiantao (H), wanghuiqiang, Zengtao (B),
	Jonathan Cameron, guodong.xu, Meelis Roos



> -----Original Message-----
> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
> Sent: Tuesday, February 9, 2021 12:48 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>;
> linux-kernel@vger.kernel.org
> Cc: vincent.guittot@linaro.org; mgorman@suse.de; mingo@kernel.org;
> peterz@infradead.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com;
> linuxarm@openeuler.org; xuwei (O) <xuwei5@huawei.com>; Liguozhu (Kenneth)
> <liguozhu@hisilicon.com>; tiantao (H) <tiantao6@hisilicon.com>; wanghuiqiang
> <wanghuiqiang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Jonathan
> Cameron <jonathan.cameron@huawei.com>; guodong.xu@linaro.org; Meelis Roos
> <mroos@linux.ee>
> Subject: RE: [RFC PATCH 1/2] sched/topology: Get rid of NUMA overlapping groups
> 
> Hi Barry,
> 
> On 08/02/21 10:04, Song Bao Hua (Barry Song) wrote:
> >> -----Original Message-----
> >> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
> 
> >
> > Hi Valentin,
> >
> > While I like your approach, this will require more time
> > to evaluate possible influence as the approach also affects
> > all machines without 3-hops issue. So x86 platforms need to
> > be tested and benchmark is required.
> >
> > What about we firstly finish the review of "grandchild" approach
> > v2 and have a solution for kunpeng920 and Sun Fire X4600-M2
> > while not impacting other machines which haven't 3-hops issues
> > first?
> >
> 
> I figured I'd toss this out while the iron was hot (and I had the topology
> crud paged in), but I ultimately agree that it's better to first go with
> something that fixes the diameter > 2 topologies and leaves the other ones
> untouched, which is exactly what you have.
> 
> > I would appreciate very much if you could comment on v2:
> >
> https://lore.kernel.org/lkml/20210203111201.20720-1-song.bao.hua@hisilicon
> .com/
> >
> 
> See my comment below on domain degeneration; with that taken care of I
> would say it's good to go. Have a look at what patch1+patch3 squashed
> together looks like, passing the right sd to init_overlap_sched_group()
> looks a bit neater IMO.
> 
> >> +static struct sched_domain *find_node_domain(struct sched_domain *sd)
> >> +{
> >> +	struct sched_domain *parent;
> >> +
> >> +	BUG_ON(!(sd->flags & SD_NUMA));
> >> +
> >> +	/* Get to the level above NODE */
> >> +	while (sd && sd->child) {
> >> +		parent = sd;
> >> +		sd = sd->child;
> >> +
> >> +		if (!(sd->flags & SD_NUMA))
> >> +			break;
> >> +	}
> >> +	/*
> >> +	 * We're going to create cross topology level sched_group_capacity
> >> +	 * references. This can only work if the domains resulting from said
> >> +	 * levels won't be degenerated, as we need said sgc to be periodically
> >> +	 * updated: it needs to be attached to the local group of a domain
> >> +	 * that didn't get degenerated.
> >> +	 *
> >> +	 * Of course, groups aren't available yet, so we can't call the usual
> >> +	 * sd_degenerate(). Checking domain spans is the closest we get.
> >> +	 * Start from NODE's parent, and keep going up until we get a domain
> >> +	 * we're sure won't be degenerated.
> >> +	 */
> >> +	while (sd->parent &&
> >> +	       cpumask_equal(sched_domain_span(sd), sched_domain_span(parent)))
> {
> >> +		sd = parent;
> >> +		parent = sd->parent;
> >> +	}
> >
> > So this is because the sched_domain which doesn't contribute to scheduler
> > will be destroyed during cpu_attach_domain() since sd and parent span
> > the seam mask?
> >
> 
> Yes; let's take your topology for instance:
> 
> node   0   1   2   3
>     0:  10  12  20  22
>     1:  12  10  22  24
>     2:  20  22  10  12
>     3:  22  24  12  10
> 
>       2       10      2
>   0 <---> 1 <---> 2 <---> 3

Guess you actually mean
       2       10      2
   1 <---> 0 <---> 2 <---> 3

> 
> 
> Domains for node1 will look like (before any fixes are applied):
> 
> NUMA<=10: span=1   groups=(1)
> NUMA<=12: span=0-1 groups=(1)->(0)
> NUMA<=20: span=0-1 groups=(0,1)
> NUMA<=22: span=0-2 groups=(0,1)->(0,2-3)
> NUMA<=24: span=0-3 groups=(0-2)->(0,2-3)
> 
> As you can see, the domain representing distance <= 20 will be degenerated
> (it has a single group). If we were to e.g. add some more nodes to the left
> of node0, then we would trigger the "grandchildren logic" for node1 and
> would end up creating a reference to node1 NUMA<=20's sgc, which is a
> mistake: that domain will be degenerated, and that sgc will never be
> updated. The right thing to do here would be reference node1 NUMA<=12's
> sgc, which the above snippet does.

Guess I got your point even though the diagram is not correct :-)

If the topology is as below(add a node left to node1 rather than
node0):

    9       2       10      2
A <---> 1 <---> 0 <---> 2 <---> 3

For nodeA,
NUMA<=10: span=A   groups=(A)
NUMA<=12: span= A groups= (A)
NUMA<=19: span=A-1 groups=(A),(1)
NUMA<=20: span=A-1 groups=(A,1)
*1 NUMA<=21: span=A-1-0 groups=(A,1), node1's numa<=20

For node0,
NUMA<=10: span=9   groups=(0)
#3 NUMA<=12: span=0-1 groups=(0)->(1)
#2 NUMA<=19: span=0-1 groups=(0,1)
#1 NUMA<=20: span=0-1-2 groups=(0,1),....

*1 will firstly try #1, and it finds 2 is outside the A-1-0,
then it will try #2. Finally #2 will be degenerated, so we
should actually use #3. Amazing!

> 
> >> +
> >> +	return parent;
> >> +}
> >> +

Thanks
Barry

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC PATCH 1/2] sched/topology: Get rid of NUMA overlapping groups
  2021-02-09  0:12       ` Song Bao Hua (Barry Song)
@ 2021-02-09 10:41         ` Valentin Schneider
  0 siblings, 0 replies; 9+ messages in thread
From: Valentin Schneider @ 2021-02-09 10:41 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song), linux-kernel
  Cc: vincent.guittot, mgorman, mingo, peterz, dietmar.eggemann,
	morten.rasmussen, linuxarm, xuwei (O), Liguozhu (Kenneth),
	tiantao (H), wanghuiqiang, Zengtao (B),
	Jonathan Cameron, guodong.xu, Meelis Roos

On 09/02/21 00:12, Song Bao Hua (Barry Song) wrote:
>> -----Original Message-----
>> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
>>
>> Yes; let's take your topology for instance:
>>
>> node   0   1   2   3
>>     0:  10  12  20  22
>>     1:  12  10  22  24
>>     2:  20  22  10  12
>>     3:  22  24  12  10
>>
>>       2       10      2
>>   0 <---> 1 <---> 2 <---> 3
>
> Guess you actually mean
>        2       10      2
>    1 <---> 0 <---> 2 <---> 3
>

Yeah, you're right, sorry about that!

>>
>>
>> Domains for node1 will look like (before any fixes are applied):
>>
>> NUMA<=10: span=1   groups=(1)
>> NUMA<=12: span=0-1 groups=(1)->(0)
>> NUMA<=20: span=0-1 groups=(0,1)
>> NUMA<=22: span=0-2 groups=(0,1)->(0,2-3)
>> NUMA<=24: span=0-3 groups=(0-2)->(0,2-3)
>>
>> As you can see, the domain representing distance <= 20 will be degenerated
>> (it has a single group). If we were to e.g. add some more nodes to the left
>> of node0, then we would trigger the "grandchildren logic" for node1 and
>> would end up creating a reference to node1 NUMA<=20's sgc, which is a
>> mistake: that domain will be degenerated, and that sgc will never be
>> updated. The right thing to do here would be reference node1 NUMA<=12's
>> sgc, which the above snippet does.
>
> Guess I got your point even though the diagram is not correct :-)
>

Good!

> If the topology is as below(add a node left to node1 rather than
> node0):
>
>     9       2       10      2
> A <---> 1 <---> 0 <---> 2 <---> 3
>
> For nodeA,
> NUMA<=10: span=A   groups=(A)
> NUMA<=12: span= A groups= (A)
> NUMA<=19: span=A-1 groups=(A),(1)
> NUMA<=20: span=A-1 groups=(A,1)
> *1 NUMA<=21: span=A-1-0 groups=(A,1), node1's numa<=20
>
> For node0,
> NUMA<=10: span=9   groups=(0)
> #3 NUMA<=12: span=0-1 groups=(0)->(1)
> #2 NUMA<=19: span=0-1 groups=(0,1)
> #1 NUMA<=20: span=0-1-2 groups=(0,1),....
>
> *1 will firstly try #1, and it finds 2 is outside the A-1-0,
> then it will try #2. Finally #2 will be degenerated, so we
> should actually use #3. Amazing!
>

Bingo!

>>
>> >> +
>> >> +	return parent;
>> >> +}
>> >> +
>
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-02-09 10:51 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-03 15:54 [RFC PATCH 0/2] sched/topology: Get rid of overlapping groups Valentin Schneider
2021-02-03 15:54 ` [RFC PATCH 1/2] sched/topology: Get rid of NUMA " Valentin Schneider
2021-02-08 10:04   ` Song Bao Hua (Barry Song)
2021-02-08 11:47     ` Valentin Schneider
2021-02-09  0:12       ` Song Bao Hua (Barry Song)
2021-02-09 10:41         ` Valentin Schneider
2021-02-03 15:54 ` [RFC PATCH 2/2] Revert "sched/topology: Warn when NUMA diameter > 2" Valentin Schneider
2021-02-08 10:27   ` Song Bao Hua (Barry Song)
2021-02-03 15:54 ` [RFC PATCH 3/2] sched/topology: Alternative diameter >= 2 fixup Valentin Schneider

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).