linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
@ 2021-02-01  3:38 Barry Song
  2021-02-01  5:36 ` kernel test robot
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Barry Song @ 2021-02-01  3:38 UTC (permalink / raw)
  To: valentin.schneider, vincent.guittot, mgorman, mingo, peterz,
	dietmar.eggemann, morten.rasmussen, linux-kernel
  Cc: linuxarm, xuwei5, liguozhu, tiantao6, wanghuiqiang, prime.zeng,
	jonathan.cameron, guodong.xu, Barry Song, Meelis Roos

As long as NUMA diameter > 2, building sched_domain by sibling's child
domain will definitely create a sched_domain with sched_group which will
span out of the sched_domain:

               +------+         +------+        +-------+       +------+
               | node |  12     |node  | 20     | node  |  12   |node  |
               |  0   +---------+1     +--------+ 2     +-------+3     |
               +------+         +------+        +-------+       +------+

domain0        node0            node1            node2          node3

domain1        node0+1          node0+1          node2+3        node2+3
                                                 +
domain2        node0+1+2                         |
             group: node0+1                      |
               group:node2+3 <-------------------+

when node2 is added into the domain2 of node0, kernel is using the child
domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
the span of the domain including node0+1+2.

This will make load_balance() run based on the avg_load in the sched_group
spanning out of the sched_domain, and it also makes select_task_rq_fair()
pick an idle CPU out of the sched_domain.

Real servers which suffer from this problem include Kunpeng920 and 8-node
Sun Fire X4600-M2, at least.

Here we move to use the *child* domain of the *child* domain of node2's
domain2 to build the sched_group.

               +------+         +------+        +-------+       +------+
               | node |  12     |node  | 20     | node  |  12   |node  |
               |  0   +---------+1     +--------+ 2     +-------+3     |
               +------+         +------+        +-------+       +------+

domain0        node0            node1          +- node2          node3
                                               |
domain1        node0+1          node0+1        | node2+3        node2+3
                                               |
domain2        node0+1+2                       |
             group: node0+1                    |
               group:node2 <-------------------+

A tricky thing is that we shouldn't use the sgc of the 1st CPU of node2
for the sched_group generated by grandchild, otherwise, when this cpu
becomes the balance_cpu of another sched_group of cpus other than node0,
our sched_group generated by grandchild will access the same sgc with
the sched_group generated by child of another CPU.

So in init_overlap_sched_group(), sgc's capacity be overwritten:
        build_balance_mask(sd, sg, mask);
        cpu = cpumask_first_and(sched_group_span(sg), mask);

        sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);

And WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask)) will
also be triggered:
static void init_overlap_sched_group(struct sched_domain *sd,
                                     struct sched_group *sg)
{
        if (atomic_inc_return(&sg->sgc->ref) == 1)
                cpumask_copy(group_balance_mask(sg), mask);
        else
                WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask));
}

So here move to use the sgc of the 2nd cpu. For the corner case, if NUMA
has only one CPU, we will still trigger this WARN_ON_ONCE. But It is
really unlikely to be a real case for one NUMA to have one CPU only.

Tested by the below topology:
qemu-system-aarch64  -M virt -nographic \
 -smp cpus=8 \
 -numa node,cpus=0-1,nodeid=0 \
 -numa node,cpus=2-3,nodeid=1 \
 -numa node,cpus=4-5,nodeid=2 \
 -numa node,cpus=6-7,nodeid=3 \
 -numa dist,src=0,dst=1,val=12 \
 -numa dist,src=0,dst=2,val=20 \
 -numa dist,src=0,dst=3,val=22 \
 -numa dist,src=1,dst=2,val=22 \
 -numa dist,src=2,dst=3,val=12 \
 -numa dist,src=1,dst=3,val=24 \
 -m 4G -cpu cortex-a57 -kernel arch/arm64/boot/Image

w/o patch, we get lots of "groups don't span domain->span":
[    0.802139] CPU0 attaching sched-domain(s):
[    0.802193]  domain-0: span=0-1 level=MC
[    0.802443]   groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 }
[    0.802693]   domain-1: span=0-3 level=NUMA
[    0.802731]    groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
[    0.802811]    domain-2: span=0-5 level=NUMA
[    0.802829]     groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
[    0.802881] ERROR: groups don't span domain->span
[    0.803058]     domain-3: span=0-7 level=NUMA
[    0.803080]      groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 }
[    0.804055] CPU1 attaching sched-domain(s):
[    0.804072]  domain-0: span=0-1 level=MC
[    0.804096]   groups: 1:{ span=1 cap=979 }, 0:{ span=0 cap=1013 }
[    0.804152]   domain-1: span=0-3 level=NUMA
[    0.804170]    groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
[    0.804219]    domain-2: span=0-5 level=NUMA
[    0.804236]     groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
[    0.804302] ERROR: groups don't span domain->span
[    0.804520]     domain-3: span=0-7 level=NUMA
[    0.804546]      groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 }
[    0.804677] CPU2 attaching sched-domain(s):
[    0.804687]  domain-0: span=2-3 level=MC
[    0.804705]   groups: 2:{ span=2 cap=934 }, 3:{ span=3 cap=1009 }
[    0.804754]   domain-1: span=0-3 level=NUMA
[    0.804772]    groups: 2:{ span=2-3 cap=1943 }, 0:{ span=0-1 cap=1992 }
[    0.804820]    domain-2: span=0-5 level=NUMA
[    0.804836]     groups: 2:{ span=0-3 mask=2-3 cap=3991 }, 4:{ span=0-1,4-7 mask=4-5 cap=5985 }
[    0.804944] ERROR: groups don't span domain->span
[    0.805108]     domain-3: span=0-7 level=NUMA
[    0.805134]      groups: 2:{ span=0-5 mask=2-3 cap=5899 }, 6:{ span=0-1,4-7 mask=6-7 cap=6125 }
[    0.805223] CPU3 attaching sched-domain(s):
[    0.805232]  domain-0: span=2-3 level=MC
[    0.805249]   groups: 3:{ span=3 cap=1009 }, 2:{ span=2 cap=934 }
[    0.805319]   domain-1: span=0-3 level=NUMA
[    0.805336]    groups: 2:{ span=2-3 cap=1943 }, 0:{ span=0-1 cap=1992 }
[    0.805383]    domain-2: span=0-5 level=NUMA
[    0.805399]     groups: 2:{ span=0-3 mask=2-3 cap=3991 }, 4:{ span=0-1,4-7 mask=4-5 cap=5985 }
[    0.805458] ERROR: groups don't span domain->span
[    0.805605]     domain-3: span=0-7 level=NUMA
[    0.805626]      groups: 2:{ span=0-5 mask=2-3 cap=5899 }, 6:{ span=0-1,4-7 mask=6-7 cap=6125 }
[    0.805712] CPU4 attaching sched-domain(s):
[    0.805721]  domain-0: span=4-5 level=MC
[    0.805738]   groups: 4:{ span=4 cap=984 }, 5:{ span=5 cap=924 }
[    0.805787]   domain-1: span=4-7 level=NUMA
[    0.805803]    groups: 4:{ span=4-5 cap=1908 }, 6:{ span=6-7 cap=2029 }
[    0.805851]    domain-2: span=0-1,4-7 level=NUMA
[    0.805867]     groups: 4:{ span=4-7 cap=3937 }, 0:{ span=0-3 cap=3935 }
[    0.805915] ERROR: groups don't span domain->span
[    0.806108]     domain-3: span=0-7 level=NUMA
[    0.806130]      groups: 4:{ span=0-1,4-7 mask=4-5 cap=5985 }, 2:{ span=0-3 mask=2-3 cap=3991 }
[    0.806214] CPU5 attaching sched-domain(s):
[    0.806222]  domain-0: span=4-5 level=MC
[    0.806240]   groups: 5:{ span=5 cap=924 }, 4:{ span=4 cap=984 }
[    0.806841]   domain-1: span=4-7 level=NUMA
[    0.806866]    groups: 4:{ span=4-5 cap=1908 }, 6:{ span=6-7 cap=2029 }
[    0.806934]    domain-2: span=0-1,4-7 level=NUMA
[    0.806953]     groups: 4:{ span=4-7 cap=3937 }, 0:{ span=0-3 cap=3935 }
[    0.807004] ERROR: groups don't span domain->span
[    0.807312]     domain-3: span=0-7 level=NUMA
[    0.807386]      groups: 4:{ span=0-1,4-7 mask=4-5 cap=5985 }, 2:{ span=0-3 mask=2-3 cap=3991 }
[    0.807686] CPU6 attaching sched-domain(s):
[    0.807710]  domain-0: span=6-7 level=MC
[    0.807750]   groups: 6:{ span=6 cap=1017 }, 7:{ span=7 cap=1012 }
[    0.807840]   domain-1: span=4-7 level=NUMA
[    0.807870]    groups: 6:{ span=6-7 cap=2029 }, 4:{ span=4-5 cap=1908 }
[    0.807952]    domain-2: span=0-1,4-7 level=NUMA
[    0.807985]     groups: 6:{ span=4-7 mask=6-7 cap=4077 }, 0:{ span=0-5 mask=0-1 cap=5843 }
[    0.808045] ERROR: groups don't span domain->span
[    0.808257]     domain-3: span=0-7 level=NUMA
[    0.808571]      groups: 6:{ span=0-1,4-7 mask=6-7 cap=6125 }, 2:{ span=0-5 mask=2-3 cap=5899 }
[    0.808848] CPU7 attaching sched-domain(s):
[    0.808860]  domain-0: span=6-7 level=MC
[    0.808880]   groups: 7:{ span=7 cap=1012 }, 6:{ span=6 cap=1017 }
[    0.808953]   domain-1: span=4-7 level=NUMA
[    0.808974]    groups: 6:{ span=6-7 cap=2029 }, 4:{ span=4-5 cap=1908 }
[    0.809034]    domain-2: span=0-1,4-7 level=NUMA
[    0.809055]     groups: 6:{ span=4-7 mask=6-7 cap=4077 }, 0:{ span=0-5 mask=0-1 cap=5843 }
[    0.809128] ERROR: groups don't span domain->span
[    0.810361]     domain-3: span=0-7 level=NUMA
[    0.810400]      groups: 6:{ span=0-1,4-7 mask=6-7 cap=5961 }, 2:{ span=0-5 mask=2-3 cap=5903 }

w/ patch, we don't get "groups don't span domain->span" any more:
[    0.868907] CPU0 attaching sched-domain(s):
[    0.868962]  domain-0: span=0-1 level=MC
[    0.869179]   groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=983 }
[    0.869405]   domain-1: span=0-3 level=NUMA
[    0.869438]    groups: 0:{ span=0-1 cap=1996 }, 2:{ span=2-3 cap=2006 }
[    0.869542]    domain-2: span=0-5 level=NUMA
[    0.869559]     groups: 0:{ span=0-3 cap=4002 }, 5:{ span=4-5 cap=2048 }
[    0.869603]     domain-3: span=0-7 level=NUMA
[    0.869618]      groups: 0:{ span=0-5 mask=0-1 cap=5980 }, 6:{ span=4-7 mask=6-7 cap=4016 }
[    0.870303] CPU1 attaching sched-domain(s):
[    0.870314]  domain-0: span=0-1 level=MC
[    0.870334]   groups: 1:{ span=1 cap=983 }, 0:{ span=0 cap=1013 }
[    0.870381]   domain-1: span=0-3 level=NUMA
[    0.870396]    groups: 0:{ span=0-1 cap=1996 }, 2:{ span=2-3 cap=2006 }
[    0.870440]    domain-2: span=0-5 level=NUMA
[    0.870454]     groups: 0:{ span=0-3 cap=4002 }, 5:{ span=4-5 cap=2048 }
[    0.870507]     domain-3: span=0-7 level=NUMA
[    0.870530]      groups: 0:{ span=0-5 mask=0-1 cap=5980 }, 6:{ span=4-7 mask=6-7 cap=4016 }
[    0.870611] CPU2 attaching sched-domain(s):
[    0.870619]  domain-0: span=2-3 level=MC
[    0.870634]   groups: 2:{ span=2 cap=1007 }, 3:{ span=3 cap=999 }
[    0.870677]   domain-1: span=0-3 level=NUMA
[    0.870691]    groups: 2:{ span=2-3 cap=2006 }, 0:{ span=0-1 cap=1996 }
[    0.870734]    domain-2: span=0-5 level=NUMA
[    0.870748]     groups: 2:{ span=0-3 mask=2-3 cap=4054 }, 5:{ span=4-5 cap=2048 }
[    0.870795]     domain-3: span=0-7 level=NUMA
[    0.870809]      groups: 2:{ span=0-5 mask=2-3 cap=6032 }, 6:{ span=0-1,4-7 mask=6-7 cap=6064 }
[    0.870913] CPU3 attaching sched-domain(s):
[    0.870921]  domain-0: span=2-3 level=MC
[    0.870936]   groups: 3:{ span=3 cap=999 }, 2:{ span=2 cap=1007 }
[    0.870979]   domain-1: span=0-3 level=NUMA
[    0.870993]    groups: 2:{ span=2-3 cap=2006 }, 0:{ span=0-1 cap=1996 }
[    0.871035]    domain-2: span=0-5 level=NUMA
[    0.871049]     groups: 2:{ span=0-3 mask=2-3 cap=4054 }, 5:{ span=4-5 cap=2048 }
[    0.871096]     domain-3: span=0-7 level=NUMA
[    0.871110]      groups: 2:{ span=0-5 mask=2-3 cap=6032 }, 6:{ span=0-1,4-7 mask=6-7 cap=6064 }
[    0.871177] CPU4 attaching sched-domain(s):
[    0.871185]  domain-0: span=4-5 level=MC
[    0.871200]   groups: 4:{ span=4 cap=977 }, 5:{ span=5 cap=1001 }
[    0.871243]   domain-1: span=4-7 level=NUMA
[    0.871257]    groups: 4:{ span=4-5 cap=1978 }, 6:{ span=6-7 cap=1968 }
[    0.871300]    domain-2: span=0-1,4-7 level=NUMA
[    0.871314]     groups: 4:{ span=4-7 cap=3946 }, 1:{ span=0-1 cap=2048 }
[    0.871356]     domain-3: span=0-7 level=NUMA
[    0.871370]      groups: 4:{ span=0-1,4-7 mask=4-5 cap=5994 }, 2:{ span=0-3 mask=2-3 cap=4054 }
[    0.871436] CPU5 attaching sched-domain(s):
[    0.871443]  domain-0: span=4-5 level=MC
[    0.871457]   groups: 5:{ span=5 cap=1001 }, 4:{ span=4 cap=977 }
[    0.871512]   domain-1: span=4-7 level=NUMA
[    0.871893]    groups: 4:{ span=4-5 cap=1978 }, 6:{ span=6-7 cap=1968 }
[    0.871949]    domain-2: span=0-1,4-7 level=NUMA
[    0.871966]     groups: 4:{ span=4-7 cap=3946 }, 1:{ span=0-1 cap=2048 }
[    0.872010]     domain-3: span=0-7 level=NUMA
[    0.872025]      groups: 4:{ span=0-1,4-7 mask=4-5 cap=5994 }, 2:{ span=0-3 mask=2-3 cap=4054 }
[    0.872115] CPU6 attaching sched-domain(s):
[    0.872123]  domain-0: span=6-7 level=MC
[    0.872139]   groups: 6:{ span=6 cap=993 }, 7:{ span=7 cap=975 }
[    0.872186]   domain-1: span=4-7 level=NUMA
[    0.872202]    groups: 6:{ span=6-7 cap=1968 }, 4:{ span=4-5 cap=1978 }
[    0.872246]    domain-2: span=0-1,4-7 level=NUMA
[    0.872260]     groups: 6:{ span=4-7 mask=6-7 cap=4016 }, 1:{ span=0-1 cap=2048 }
[    0.872309]     domain-3: span=0-7 level=NUMA
[    0.872323]      groups: 6:{ span=0-1,4-7 mask=6-7 cap=6064 }, 2:{ span=0-5 mask=2-3 cap=6032 }
[    0.872392] CPU7 attaching sched-domain(s):
[    0.872399]  domain-0: span=6-7 level=MC
[    0.872414]   groups: 7:{ span=7 cap=975 }, 6:{ span=6 cap=993 }
[    0.872458]   domain-1: span=4-7 level=NUMA
[    0.872472]    groups: 6:{ span=6-7 cap=1968 }, 4:{ span=4-5 cap=1978 }
[    0.872662]    domain-2: span=0-1,4-7 level=NUMA
[    0.872685]     groups: 6:{ span=4-7 mask=6-7 cap=4016 }, 1:{ span=0-1 cap=2048 }
[    0.872737]     domain-3: span=0-7 level=NUMA
[    0.872752]      groups: 6:{ span=0-1,4-7 mask=6-7 cap=6064 }, 2:{ span=0-5 mask=2-3 cap=6032 }

Reported-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 Differences with RFC v2
 * added tested-by Meelis Roos for the fixed "8-node Sun Fire X4600-M2"
 * removed the hacking code in balance_mask and should_we_balance()
 * removed the redundant "from_grandchild" field from sched_group

 The patch is based on 5.11-rc6;

 kernel/sched/topology.c | 83 ++++++++++++++++++++++++++---------------
 1 file changed, 52 insertions(+), 31 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5d3675c7a76b..100feb2fd8a0 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -723,35 +723,6 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 	for (tmp = sd; tmp; tmp = tmp->parent)
 		numa_distance += !!(tmp->flags & SD_NUMA);
 
-	/*
-	 * FIXME: Diameter >=3 is misrepresented.
-	 *
-	 * Smallest diameter=3 topology is:
-	 *
-	 *   node   0   1   2   3
-	 *     0:  10  20  30  40
-	 *     1:  20  10  20  30
-	 *     2:  30  20  10  20
-	 *     3:  40  30  20  10
-	 *
-	 *   0 --- 1 --- 2 --- 3
-	 *
-	 * NUMA-3	0-3		N/A		N/A		0-3
-	 *  groups:	{0-2},{1-3}					{1-3},{0-2}
-	 *
-	 * NUMA-2	0-2		0-3		0-3		1-3
-	 *  groups:	{0-1},{1-3}	{0-2},{2-3}	{1-3},{0-1}	{2-3},{0-2}
-	 *
-	 * NUMA-1	0-1		0-2		1-3		2-3
-	 *  groups:	{0},{1}		{1},{2},{0}	{2},{3},{1}	{3},{2}
-	 *
-	 * NUMA-0	0		1		2		3
-	 *
-	 * The NUMA-2 groups for nodes 0 and 3 are obviously buggered, as the
-	 * group span isn't a subset of the domain span.
-	 */
-	WARN_ONCE(numa_distance > 2, "Shortest NUMA path spans too many nodes\n");
-
 	sched_domain_debug(sd, cpu);
 
 	rq_attach_root(rq, rd);
@@ -916,6 +887,11 @@ build_balance_mask(struct sched_domain *sd, struct sched_group *sg, struct cpuma
 		if (!sibling->child)
 			continue;
 
+		while (sibling->child &&
+			!cpumask_subset(sched_domain_span(sibling->child),
+					sched_domain_span(sd)))
+			sibling = sibling->child;
+
 		/* If we would not end up here, we can't continue from here */
 		if (!cpumask_equal(sg_span, sched_domain_span(sibling->child)))
 			continue;
@@ -955,7 +931,8 @@ build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
 }
 
 static void init_overlap_sched_group(struct sched_domain *sd,
-				     struct sched_group *sg)
+				     struct sched_group *sg,
+				     int from_grandchild)
 {
 	struct cpumask *mask = sched_domains_tmpmask2;
 	struct sd_data *sdd = sd->private;
@@ -964,6 +941,12 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 
 	build_balance_mask(sd, sg, mask);
 	cpu = cpumask_first_and(sched_group_span(sg), mask);
+	/*
+	 * for the group generated by grandchild, use the sgc of 2nd cpu
+	 * because the 1st cpu might be used by another sched_group
+	 */
+	if (from_grandchild && cpumask_weight(mask) > 1)
+		cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
 
 	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
 	if (atomic_inc_return(&sg->sgc->ref) == 1)
@@ -996,6 +979,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 
 	for_each_cpu_wrap(i, span, cpu) {
 		struct cpumask *sg_span;
+		int from_grandchild = 0;
 
 		if (cpumask_test_cpu(i, covered))
 			continue;
@@ -1015,6 +999,43 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
 			continue;
 
+		/*
+		 * for NUMA diameter >= 3, building sched_domain by sibling's
+		 * child's child domain to prevent sched_group from spanning
+		 * out of sched_domain
+		 * if we don't do this, Diameter >=3 is misrepresented:
+		 *
+		 * Smallest diameter=3 topology is:
+		 *
+		 *   node   0   1   2   3
+		 *     0:  10  20  30  40
+		 *     1:  20  10  20  30
+		 *     2:  30  20  10  20
+		 *     3:  40  30  20  10
+		 *
+		 *   0 --- 1 --- 2 --- 3
+		 *
+		 * NUMA-3       0-3             N/A             N/A             0-3
+		 *  groups:     {0-2},{1-3}                                     {1-3},{0-2}
+		 *
+		 * NUMA-2       0-2             0-3             0-3             1-3
+		 *  groups:     {0-1},{1-3}     {0-2},{2-3}     {1-3},{0-1}     {2-3},{0-2}
+		 *
+		 * NUMA-1       0-1             0-2             1-3             2-3
+		 *  groups:     {0},{1}         {1},{2},{0}     {2},{3},{1}     {3},{2}
+		 *
+		 * NUMA-0       0               1               2               3
+		 *
+		 * The NUMA-2 groups for nodes 0 and 3 are obviously buggered, as the
+		 * group span isn't a subset of the domain span.
+		 */
+		while (sibling->child &&
+		       !cpumask_subset(sched_domain_span(sibling->child),
+				       span)) {
+			sibling = sibling->child;
+			from_grandchild = 1;
+		}
+
 		sg = build_group_from_child_sched_domain(sibling, cpu);
 		if (!sg)
 			goto fail;
@@ -1022,7 +1043,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		sg_span = sched_group_span(sg);
 		cpumask_or(covered, covered, sg_span);
 
-		init_overlap_sched_group(sd, sg);
+		init_overlap_sched_group(sd, sg, from_grandchild);
 
 		if (!first)
 			first = sg;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
  2021-02-01  3:38 [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2 Barry Song
@ 2021-02-01  5:36 ` kernel test robot
  2021-02-01 18:11 ` Valentin Schneider
  2021-02-02 15:17 ` Valentin Schneider
  2 siblings, 0 replies; 9+ messages in thread
From: kernel test robot @ 2021-02-01  5:36 UTC (permalink / raw)
  To: Barry Song, valentin.schneider, vincent.guittot, mgorman, mingo,
	peterz, dietmar.eggemann, morten.rasmussen, linux-kernel
  Cc: kbuild-all, linuxarm, xuwei5

[-- Attachment #1: Type: text/plain, Size: 18022 bytes --]

Hi Barry,

I love your patch! Perhaps something to improve:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on next-20210125]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Barry-Song/sched-topology-fix-the-issue-groups-don-t-span-domain-span-for-NUMA-diameter-2/20210201-114807
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 7a976f77bb962ce9486e09eb839aa135619b54f3
config: i386-randconfig-s001-20210201 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-215-g0fb77bb6-dirty
        # https://github.com/0day-ci/linux/commit/a8f7eae6bc5ac37efd8cbe15e0a388127619f1b0
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Barry-Song/sched-topology-fix-the-issue-groups-don-t-span-domain-span-for-NUMA-diameter-2/20210201-114807
        git checkout a8f7eae6bc5ac37efd8cbe15e0a388127619f1b0
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


"sparse warnings: (new ones prefixed by >>)"
   kernel/sched/topology.c:106:56: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:106:56: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:106:56: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:125:60: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:125:60: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:125:60: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:148:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:148:20: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:148:20: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:431:13: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct perf_domain *[assigned] tmp @@     got struct perf_domain [noderef] __rcu *pd @@
   kernel/sched/topology.c:431:13: sparse:     expected struct perf_domain *[assigned] tmp
   kernel/sched/topology.c:431:13: sparse:     got struct perf_domain [noderef] __rcu *pd
   kernel/sched/topology.c:440:13: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct perf_domain *[assigned] tmp @@     got struct perf_domain [noderef] __rcu *pd @@
   kernel/sched/topology.c:440:13: sparse:     expected struct perf_domain *[assigned] tmp
   kernel/sched/topology.c:440:13: sparse:     got struct perf_domain [noderef] __rcu *pd
   kernel/sched/topology.c:461:19: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct perf_domain *[assigned] pd @@     got struct perf_domain [noderef] __rcu *pd @@
   kernel/sched/topology.c:461:19: sparse:     expected struct perf_domain *[assigned] pd
   kernel/sched/topology.c:461:19: sparse:     got struct perf_domain [noderef] __rcu *pd
   kernel/sched/topology.c:623:49: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *parent @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:623:49: sparse:     expected struct sched_domain *parent
   kernel/sched/topology.c:623:49: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:695:50: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *parent @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:695:50: sparse:     expected struct sched_domain *parent
   kernel/sched/topology.c:695:50: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:702:55: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain [noderef] __rcu *[noderef] __rcu child @@     got struct sched_domain *[assigned] tmp @@
   kernel/sched/topology.c:702:55: sparse:     expected struct sched_domain [noderef] __rcu *[noderef] __rcu child
   kernel/sched/topology.c:702:55: sparse:     got struct sched_domain *[assigned] tmp
   kernel/sched/topology.c:712:29: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] tmp @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:712:29: sparse:     expected struct sched_domain *[assigned] tmp
   kernel/sched/topology.c:712:29: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:717:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:717:20: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:717:20: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:723:33: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] tmp @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:723:33: sparse:     expected struct sched_domain *[assigned] tmp
   kernel/sched/topology.c:723:33: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:729:13: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] tmp @@     got struct sched_domain [noderef] __rcu *sd @@
   kernel/sched/topology.c:729:13: sparse:     expected struct sched_domain *[assigned] tmp
   kernel/sched/topology.c:729:13: sparse:     got struct sched_domain [noderef] __rcu *sd
   kernel/sched/topology.c:891:66: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:891:66: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:891:66: sparse:     got struct sched_domain [noderef] __rcu *child
>> kernel/sched/topology.c:893:33: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sibling @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:893:33: sparse:     expected struct sched_domain *[assigned] sibling
   kernel/sched/topology.c:893:33: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:896:70: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:896:70: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:896:70: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:925:59: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:925:59: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:925:59: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:1033:65: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:1033:65: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:1033:65: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:1035:33: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sibling @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:1035:33: sparse:     expected struct sched_domain *[assigned] sibling
   kernel/sched/topology.c:1035:33: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:1140:40: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *child @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:1140:40: sparse:     expected struct sched_domain *child
   kernel/sched/topology.c:1140:40: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:1441:43: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain [noderef] __rcu *child @@     got struct sched_domain *child @@
   kernel/sched/topology.c:1441:43: sparse:     expected struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:1441:43: sparse:     got struct sched_domain *child
   kernel/sched/topology.c:1926:31: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain [noderef] __rcu *parent @@     got struct sched_domain *sd @@
   kernel/sched/topology.c:1926:31: sparse:     expected struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:1926:31: sparse:     got struct sched_domain *sd
   kernel/sched/topology.c:2094:57: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:2094:57: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/topology.c:2094:57: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:2111:57: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:2111:57: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/topology.c:2111:57: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:59:25: sparse: sparse: dereference of noderef expression
   kernel/sched/topology.c:64:25: sparse: sparse: dereference of noderef expression
   kernel/sched/topology.c: note: in included file:
   kernel/sched/sched.h:1455:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/sched.h:1455:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/sched.h:1455:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/sched.h:1468:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/sched.h:1468:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/sched.h:1468:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/sched.h:1455:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/sched.h:1455:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/sched.h:1455:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/sched.h:1468:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/sched.h:1468:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/sched.h:1468:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:1456:19: sparse: sparse: dereference of noderef expression

vim +893 kernel/sched/topology.c

   762	
   763	
   764	/*
   765	 * NUMA topology (first read the regular topology blurb below)
   766	 *
   767	 * Given a node-distance table, for example:
   768	 *
   769	 *   node   0   1   2   3
   770	 *     0:  10  20  30  20
   771	 *     1:  20  10  20  30
   772	 *     2:  30  20  10  20
   773	 *     3:  20  30  20  10
   774	 *
   775	 * which represents a 4 node ring topology like:
   776	 *
   777	 *   0 ----- 1
   778	 *   |       |
   779	 *   |       |
   780	 *   |       |
   781	 *   3 ----- 2
   782	 *
   783	 * We want to construct domains and groups to represent this. The way we go
   784	 * about doing this is to build the domains on 'hops'. For each NUMA level we
   785	 * construct the mask of all nodes reachable in @level hops.
   786	 *
   787	 * For the above NUMA topology that gives 3 levels:
   788	 *
   789	 * NUMA-2	0-3		0-3		0-3		0-3
   790	 *  groups:	{0-1,3},{1-3}	{0-2},{0,2-3}	{1-3},{0-1,3}	{0,2-3},{0-2}
   791	 *
   792	 * NUMA-1	0-1,3		0-2		1-3		0,2-3
   793	 *  groups:	{0},{1},{3}	{0},{1},{2}	{1},{2},{3}	{0},{2},{3}
   794	 *
   795	 * NUMA-0	0		1		2		3
   796	 *
   797	 *
   798	 * As can be seen; things don't nicely line up as with the regular topology.
   799	 * When we iterate a domain in child domain chunks some nodes can be
   800	 * represented multiple times -- hence the "overlap" naming for this part of
   801	 * the topology.
   802	 *
   803	 * In order to minimize this overlap, we only build enough groups to cover the
   804	 * domain. For instance Node-0 NUMA-2 would only get groups: 0-1,3 and 1-3.
   805	 *
   806	 * Because:
   807	 *
   808	 *  - the first group of each domain is its child domain; this
   809	 *    gets us the first 0-1,3
   810	 *  - the only uncovered node is 2, who's child domain is 1-3.
   811	 *
   812	 * However, because of the overlap, computing a unique CPU for each group is
   813	 * more complicated. Consider for instance the groups of NODE-1 NUMA-2, both
   814	 * groups include the CPUs of Node-0, while those CPUs would not in fact ever
   815	 * end up at those groups (they would end up in group: 0-1,3).
   816	 *
   817	 * To correct this we have to introduce the group balance mask. This mask
   818	 * will contain those CPUs in the group that can reach this group given the
   819	 * (child) domain tree.
   820	 *
   821	 * With this we can once again compute balance_cpu and sched_group_capacity
   822	 * relations.
   823	 *
   824	 * XXX include words on how balance_cpu is unique and therefore can be
   825	 * used for sched_group_capacity links.
   826	 *
   827	 *
   828	 * Another 'interesting' topology is:
   829	 *
   830	 *   node   0   1   2   3
   831	 *     0:  10  20  20  30
   832	 *     1:  20  10  20  20
   833	 *     2:  20  20  10  20
   834	 *     3:  30  20  20  10
   835	 *
   836	 * Which looks a little like:
   837	 *
   838	 *   0 ----- 1
   839	 *   |     / |
   840	 *   |   /   |
   841	 *   | /     |
   842	 *   2 ----- 3
   843	 *
   844	 * This topology is asymmetric, nodes 1,2 are fully connected, but nodes 0,3
   845	 * are not.
   846	 *
   847	 * This leads to a few particularly weird cases where the sched_domain's are
   848	 * not of the same number for each CPU. Consider:
   849	 *
   850	 * NUMA-2	0-3						0-3
   851	 *  groups:	{0-2},{1-3}					{1-3},{0-2}
   852	 *
   853	 * NUMA-1	0-2		0-3		0-3		1-3
   854	 *
   855	 * NUMA-0	0		1		2		3
   856	 *
   857	 */
   858	
   859	
   860	/*
   861	 * Build the balance mask; it contains only those CPUs that can arrive at this
   862	 * group and should be considered to continue balancing.
   863	 *
   864	 * We do this during the group creation pass, therefore the group information
   865	 * isn't complete yet, however since each group represents a (child) domain we
   866	 * can fully construct this using the sched_domain bits (which are already
   867	 * complete).
   868	 */
   869	static void
   870	build_balance_mask(struct sched_domain *sd, struct sched_group *sg, struct cpumask *mask)
   871	{
   872		const struct cpumask *sg_span = sched_group_span(sg);
   873		struct sd_data *sdd = sd->private;
   874		struct sched_domain *sibling;
   875		int i;
   876	
   877		cpumask_clear(mask);
   878	
   879		for_each_cpu(i, sg_span) {
   880			sibling = *per_cpu_ptr(sdd->sd, i);
   881	
   882			/*
   883			 * Can happen in the asymmetric case, where these siblings are
   884			 * unused. The mask will not be empty because those CPUs that
   885			 * do have the top domain _should_ span the domain.
   886			 */
   887			if (!sibling->child)
   888				continue;
   889	
   890			while (sibling->child &&
   891				!cpumask_subset(sched_domain_span(sibling->child),
   892						sched_domain_span(sd)))
 > 893				sibling = sibling->child;
   894	
   895			/* If we would not end up here, we can't continue from here */
   896			if (!cpumask_equal(sg_span, sched_domain_span(sibling->child)))
   897				continue;
   898	
   899			cpumask_set_cpu(i, mask);
   900		}
   901	
   902		/* We must not have empty masks here */
   903		WARN_ON_ONCE(cpumask_empty(mask));
   904	}
   905	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28559 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
  2021-02-01  3:38 [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2 Barry Song
  2021-02-01  5:36 ` kernel test robot
@ 2021-02-01 18:11 ` Valentin Schneider
  2021-02-01 21:49   ` Song Bao Hua (Barry Song)
  2021-02-02 15:17 ` Valentin Schneider
  2 siblings, 1 reply; 9+ messages in thread
From: Valentin Schneider @ 2021-02-01 18:11 UTC (permalink / raw)
  To: Barry Song, vincent.guittot, mgorman, mingo, peterz,
	dietmar.eggemann, morten.rasmussen, linux-kernel
  Cc: linuxarm, xuwei5, liguozhu, tiantao6, wanghuiqiang, prime.zeng,
	jonathan.cameron, guodong.xu, Barry Song, Meelis Roos


Hi,

On 01/02/21 16:38, Barry Song wrote:
> A tricky thing is that we shouldn't use the sgc of the 1st CPU of node2
> for the sched_group generated by grandchild, otherwise, when this cpu
> becomes the balance_cpu of another sched_group of cpus other than node0,
> our sched_group generated by grandchild will access the same sgc with
> the sched_group generated by child of another CPU.
>
> So in init_overlap_sched_group(), sgc's capacity be overwritten:
>         build_balance_mask(sd, sg, mask);
>         cpu = cpumask_first_and(sched_group_span(sg), mask);
>
>         sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
>
> And WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask)) will
> also be triggered:
> static void init_overlap_sched_group(struct sched_domain *sd,
>                                      struct sched_group *sg)
> {
>         if (atomic_inc_return(&sg->sgc->ref) == 1)
>                 cpumask_copy(group_balance_mask(sg), mask);
>         else
>                 WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask));
> }
>
> So here move to use the sgc of the 2nd cpu. For the corner case, if NUMA
> has only one CPU, we will still trigger this WARN_ON_ONCE. But It is
> really unlikely to be a real case for one NUMA to have one CPU only.
>

Well, it's trivial to boot this with QEMU, and it's actually the example
the comment atop that WARN_ONCE() is based on. Also, you could end up with
a single CPU on a node during hotplug operations...

I am not entirely sure whether having more than one CPU per node is a
sufficient condition. I'm starting to *think* it is, but I'm not entirely
convinced yet - and now I need a new notebook.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
  2021-02-01 18:11 ` Valentin Schneider
@ 2021-02-01 21:49   ` Song Bao Hua (Barry Song)
  0 siblings, 0 replies; 9+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-02-01 21:49 UTC (permalink / raw)
  To: Valentin Schneider, vincent.guittot, mgorman, mingo, peterz,
	dietmar.eggemann, morten.rasmussen, linux-kernel
  Cc: linuxarm, xuwei (O), Liguozhu (Kenneth), tiantao (H),
	wanghuiqiang, Zengtao (B),
	Jonathan Cameron, guodong.xu, Meelis Roos



> -----Original Message-----
> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
> Sent: Tuesday, February 2, 2021 7:11 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>;
> vincent.guittot@linaro.org; mgorman@suse.de; mingo@kernel.org;
> peterz@infradead.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com;
> linux-kernel@vger.kernel.org
> Cc: linuxarm@openeuler.org; xuwei (O) <xuwei5@huawei.com>; Liguozhu (Kenneth)
> <liguozhu@hisilicon.com>; tiantao (H) <tiantao6@hisilicon.com>; wanghuiqiang
> <wanghuiqiang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Jonathan
> Cameron <jonathan.cameron@huawei.com>; guodong.xu@linaro.org; Song Bao Hua
> (Barry Song) <song.bao.hua@hisilicon.com>; Meelis Roos <mroos@linux.ee>
> Subject: Re: [PATCH] sched/topology: fix the issue groups don't span
> domain->span for NUMA diameter > 2
> 
> 
> Hi,
> 
> On 01/02/21 16:38, Barry Song wrote:
> > A tricky thing is that we shouldn't use the sgc of the 1st CPU of node2
> > for the sched_group generated by grandchild, otherwise, when this cpu
> > becomes the balance_cpu of another sched_group of cpus other than node0,
> > our sched_group generated by grandchild will access the same sgc with
> > the sched_group generated by child of another CPU.
> >
> > So in init_overlap_sched_group(), sgc's capacity be overwritten:
> >         build_balance_mask(sd, sg, mask);
> >         cpu = cpumask_first_and(sched_group_span(sg), mask);
> >
> >         sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
> >
> > And WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask)) will
> > also be triggered:
> > static void init_overlap_sched_group(struct sched_domain *sd,
> >                                      struct sched_group *sg)
> > {
> >         if (atomic_inc_return(&sg->sgc->ref) == 1)
> >                 cpumask_copy(group_balance_mask(sg), mask);
> >         else
> >                 WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask));
> > }
> >
> > So here move to use the sgc of the 2nd cpu. For the corner case, if NUMA
> > has only one CPU, we will still trigger this WARN_ON_ONCE. But It is
> > really unlikely to be a real case for one NUMA to have one CPU only.
> >
> 
> Well, it's trivial to boot this with QEMU, and it's actually the example
> the comment atop that WARN_ONCE() is based on. Also, you could end up with
> a single CPU on a node during hotplug operations...

Hi Valentin,

The qemu topology is just a reflection of real kunpeng920 case, and pls
also note Meelis has also tested on another real hardware "8-node Sun
Fire X4600-M2" and gave the tested-by.

It might not a perfect fix, but it is the simplest way to fix for this
moment and for real cases. A "perfect" fix will require major
refactoring of topology.c.

I don't think hotplug is much relevant as even some cpus are unplugged
and only one cpu is left in the sched_group of the sched_domain, the
related domain and group are still getting right settings.

On the other hand, the corner could literally  be fixed, but will
get some very ugly code involved. I mean, two sched_group can result
in using the same sgc:
1. the sched_group generated by grandchild with only one numa
2. the sched_group generated by child with more than one numa

Right now, I'm moving to the 2nd cpu for sched_group1, if we move to
use 2nd cpu for sched_group2, then having only one cpu in one NUMA
wouldn't be a problem anymore. But the code will be very ugly.
So I would prefer to keep this assumption and just ignore the unreal
corner case.

> 
> I am not entirely sure whether having more than one CPU per node is a
> sufficient condition. I'm starting to *think* it is, but I'm not entirely
> convinced yet - and now I need a new notebook.

Me too. Some extremely complicated topology might break the assumption.
Really need a new notebook to draw this kind of complicated topology to
break the assumption :-)

But it is sufficient for the existing real cases which need fixing. When
someday a real case in which each numa has more than one CPU wakes up
the below warning:
WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask)).
It might be the right time to consider major refactoring of topology.c.

Thanks
Barry

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
  2021-02-01  3:38 [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2 Barry Song
  2021-02-01  5:36 ` kernel test robot
  2021-02-01 18:11 ` Valentin Schneider
@ 2021-02-02 15:17 ` Valentin Schneider
  2021-02-02 16:48   ` Valentin Schneider
                     ` (2 more replies)
  2 siblings, 3 replies; 9+ messages in thread
From: Valentin Schneider @ 2021-02-02 15:17 UTC (permalink / raw)
  To: Barry Song, vincent.guittot, mgorman, mingo, peterz,
	dietmar.eggemann, morten.rasmussen, linux-kernel
  Cc: linuxarm, xuwei5, liguozhu, tiantao6, wanghuiqiang, prime.zeng,
	jonathan.cameron, guodong.xu, Barry Song, Meelis Roos

On 01/02/21 16:38, Barry Song wrote:
> @@ -964,6 +941,12 @@ static void init_overlap_sched_group(struct sched_domain *sd,
>
>       build_balance_mask(sd, sg, mask);
>       cpu = cpumask_first_and(sched_group_span(sg), mask);
> +	/*
> +	 * for the group generated by grandchild, use the sgc of 2nd cpu
> +	 * because the 1st cpu might be used by another sched_group
> +	 */
> +	if (from_grandchild && cpumask_weight(mask) > 1)
> +		cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
>
>       sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);

So you are getting a (hopefully) unique ID for this group span at this
given topology level (i.e. sd->private) but as I had stated in that list of
issues, this creates an sgc that isn't attached to the local group of any
sched_domain, and thus won't get its capacity values updated.

This can actually be seen via the capacity values you're getting at build
time:

> [    0.868907] CPU0 attaching sched-domain(s):
...
> [    0.869542]    domain-2: span=0-5 level=NUMA
> [    0.869559]     groups: 0:{ span=0-3 cap=4002 }, 5:{ span=4-5 cap=2048 }
                                                          ^^^^^^^^^^^^^^^^
> [    0.871177] CPU4 attaching sched-domain(s):
...
> [    0.871200]   groups: 4:{ span=4 cap=977 }, 5:{ span=5 cap=1001 }
> [    0.871243]   domain-1: span=4-7 level=NUMA
> [    0.871257]    groups: 4:{ span=4-5 cap=1978 }, 6:{ span=6-7 cap=1968 }
                                ^^^^^^^^^^^^^^^^

IMO what we want to do here is to hook this CPU0-domain-2-group5 to the sgc
of CPU4-domain1-group4. I've done that in the below diff - this gives us
groups with sgc's owned at lower topology levels, but this will only ever
be true for non-local groups. This has the added benefit of working with
single-CPU nodes. Briefly tested on your topology and the sunfire's (via
QEMU), and I didn't get screamed at.

Before the fun police comes and impounds my keyboard, I'd like to point out
that we could leverage this cross-level sgc referencing hack to further
change the NUMA domains and pretty much get rid of overlapping groups
(that's what I was fumbling with in [1]).

[1]: http://lore.kernel.org/r/jhjwnw11ak2.mognet@arm.com

That is, rather than building overlapping groups and fixing them whenever
that breaks (distance > 2), we could have:
- the local group being the child domain's span (as always)
- all non-local NUMA groups spanning a single node each, with the right sgc
  cross-referencing.

Thoughts?

--->8---
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b748999c9e11..ef43abb6b1fb 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -932,21 +932,15 @@ build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
 
 static void init_overlap_sched_group(struct sched_domain *sd,
 				     struct sched_group *sg,
-				     int from_grandchild)
+				     struct sched_domain *grandchild)
 {
 	struct cpumask *mask = sched_domains_tmpmask2;
-	struct sd_data *sdd = sd->private;
+	struct sd_data *sdd = grandchild ? grandchild->private : sd->private;
 	struct cpumask *sg_span;
 	int cpu;
 
 	build_balance_mask(sd, sg, mask);
 	cpu = cpumask_first_and(sched_group_span(sg), mask);
-	/*
-	 * for the group generated by grandchild, use the sgc of 2nd cpu
-	 * because the 1st cpu might be used by another sched_group
-	 */
-	if (from_grandchild && cpumask_weight(mask) > 1)
-		cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
 
 	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
 	if (atomic_inc_return(&sg->sgc->ref) == 1)
@@ -979,7 +973,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 
 	for_each_cpu_wrap(i, span, cpu) {
 		struct cpumask *sg_span;
-		int from_grandchild = 0;
+		bool from_grandchild = false;
 
 		if (cpumask_test_cpu(i, covered))
 			continue;
@@ -1033,7 +1027,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		       !cpumask_subset(sched_domain_span(sibling->child),
 				       span)) {
 			sibling = sibling->child;
-			from_grandchild = 1;
+			from_grandchild = true;
 		}
 
 		sg = build_group_from_child_sched_domain(sibling, cpu);
@@ -1043,7 +1037,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		sg_span = sched_group_span(sg);
 		cpumask_or(covered, covered, sg_span);
 
-		init_overlap_sched_group(sd, sg, from_grandchild);
+		init_overlap_sched_group(sd, sg, from_grandchild ? sibling : NULL);
 
 		if (!first)
 			first = sg;


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
  2021-02-02 15:17 ` Valentin Schneider
@ 2021-02-02 16:48   ` Valentin Schneider
  2021-02-03 10:23   ` Song Bao Hua (Barry Song)
  2021-02-03 10:27   ` Song Bao Hua (Barry Song)
  2 siblings, 0 replies; 9+ messages in thread
From: Valentin Schneider @ 2021-02-02 16:48 UTC (permalink / raw)
  To: Barry Song, vincent.guittot, mgorman, mingo, peterz,
	dietmar.eggemann, morten.rasmussen, linux-kernel
  Cc: linuxarm, xuwei5, liguozhu, tiantao6, wanghuiqiang, prime.zeng,
	jonathan.cameron, guodong.xu, Barry Song, Meelis Roos

On 02/02/21 15:17, Valentin Schneider wrote:
> That is, rather than building overlapping groups and fixing them whenever
> that breaks (distance > 2), we could have:
> - the local group being the child domain's span (as always)
> - all non-local NUMA groups spanning a single node each, with the right sgc
>   cross-referencing.
>
> Thoughts?
>

Hmph I'm thinking this can be broken by domain degeneration, as nothing
will come fix up the ->sgc link and we'd be back to having a reference to a
group that doesn't get updated... 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
  2021-02-02 15:17 ` Valentin Schneider
  2021-02-02 16:48   ` Valentin Schneider
@ 2021-02-03 10:23   ` Song Bao Hua (Barry Song)
  2021-02-03 11:42     ` Valentin Schneider
  2021-02-03 10:27   ` Song Bao Hua (Barry Song)
  2 siblings, 1 reply; 9+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-02-03 10:23 UTC (permalink / raw)
  To: Valentin Schneider, vincent.guittot, mgorman, mingo, peterz,
	dietmar.eggemann, morten.rasmussen, linux-kernel
  Cc: linuxarm, xuwei (O), Liguozhu (Kenneth), tiantao (H),
	wanghuiqiang, Zengtao (B),
	Jonathan Cameron, guodong.xu, Meelis Roos



> -----Original Message-----
> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
> Sent: Wednesday, February 3, 2021 4:17 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>;
> vincent.guittot@linaro.org; mgorman@suse.de; mingo@kernel.org;
> peterz@infradead.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com;
> linux-kernel@vger.kernel.org
> Cc: linuxarm@openeuler.org; xuwei (O) <xuwei5@huawei.com>; Liguozhu (Kenneth)
> <liguozhu@hisilicon.com>; tiantao (H) <tiantao6@hisilicon.com>; wanghuiqiang
> <wanghuiqiang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Jonathan
> Cameron <jonathan.cameron@huawei.com>; guodong.xu@linaro.org; Song Bao Hua
> (Barry Song) <song.bao.hua@hisilicon.com>; Meelis Roos <mroos@linux.ee>
> Subject: Re: [PATCH] sched/topology: fix the issue groups don't span
> domain->span for NUMA diameter > 2
> 
> On 01/02/21 16:38, Barry Song wrote:
> > @@ -964,6 +941,12 @@ static void init_overlap_sched_group(struct sched_domain
> *sd,
> >
> >       build_balance_mask(sd, sg, mask);
> >       cpu = cpumask_first_and(sched_group_span(sg), mask);
> > +	/*
> > +	 * for the group generated by grandchild, use the sgc of 2nd cpu
> > +	 * because the 1st cpu might be used by another sched_group
> > +	 */
> > +	if (from_grandchild && cpumask_weight(mask) > 1)
> > +		cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
> >
> >       sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
> 
> So you are getting a (hopefully) unique ID for this group span at this
> given topology level (i.e. sd->private) but as I had stated in that list of
> issues, this creates an sgc that isn't attached to the local group of any
> sched_domain, and thus won't get its capacity values updated.
> 
> This can actually be seen via the capacity values you're getting at build
> time:
> 
> > [    0.868907] CPU0 attaching sched-domain(s):
> ...
> > [    0.869542]    domain-2: span=0-5 level=NUMA
> > [    0.869559]     groups: 0:{ span=0-3 cap=4002 }, 5:{ span=4-5 cap=2048 }
>                                                           ^^^^^^^^^^^^^^^^
> > [    0.871177] CPU4 attaching sched-domain(s):
> ...
> > [    0.871200]   groups: 4:{ span=4 cap=977 }, 5:{ span=5 cap=1001 }
> > [    0.871243]   domain-1: span=4-7 level=NUMA
> > [    0.871257]    groups: 4:{ span=4-5 cap=1978 }, 6:{ span=6-7 cap=1968 }
>                                 ^^^^^^^^^^^^^^^^
> 

Yes. I could see this issue.  We could hack update_group_capacity to let
it scan both local_group  and sched_group generated by grandchild, but it
seems your edit is much better.

> IMO what we want to do here is to hook this CPU0-domain-2-group5 to the sgc
> of CPU4-domain1-group4. I've done that in the below diff - this gives us
> groups with sgc's owned at lower topology levels, but this will only ever
> be true for non-local groups. This has the added benefit of working with
> single-CPU nodes. Briefly tested on your topology and the sunfire's (via
> QEMU), and I didn't get screamed at.
> 
> Before the fun police comes and impounds my keyboard, I'd like to point out
> that we could leverage this cross-level sgc referencing hack to further
> change the NUMA domains and pretty much get rid of overlapping groups
> (that's what I was fumbling with in [1]).
> 
> [1]: http://lore.kernel.org/r/jhjwnw11ak2.mognet@arm.com
> 
> That is, rather than building overlapping groups and fixing them whenever
> that breaks (distance > 2), we could have:
> - the local group being the child domain's span (as always)
> - all non-local NUMA groups spanning a single node each, with the right sgc
>   cross-referencing.
> 
> Thoughts?

I guess the original purpose of overlapping groups is creating as few groups
as possible. If we totally remove overlapping groups, it seems we will create
much more groups?
For example, while node0 begins to build sched_domain for distance 20, it will
add node2, since the distance between node2 and node3 is 15, so while node2 is
added, node3 is also added as node2's lower domain has covered node3. So we need
two groups only for node0's sched_domain of distance level 20.
+-------+                  +--------+
 |       |      15          |        |
 |  node0+----------------+ | node1  |
 |       |                  |        |
 +----+--+                XXX--------+
      |                 XXX
      |                XX
20    |         15   XX
      |            XXX
      |       X XXX
 +----+----XXX               +-------+
 |         |     15          |  node3|
 | node2   +-----------------+       |
 |         |                 +-------+
 +---------+

If we remove overlapping group, we will add a group for node2, another
group for node3. Then we get three groups.

I am not sure if it is always positive for performance.

> 
> --->8---
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index b748999c9e11..ef43abb6b1fb 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -932,21 +932,15 @@ build_group_from_child_sched_domain(struct sched_domain
> *sd, int cpu)
> 
>  static void init_overlap_sched_group(struct sched_domain *sd,
>  				     struct sched_group *sg,
> -				     int from_grandchild)
> +				     struct sched_domain *grandchild)
>  {
>  	struct cpumask *mask = sched_domains_tmpmask2;
> -	struct sd_data *sdd = sd->private;
> +	struct sd_data *sdd = grandchild ? grandchild->private : sd->private;
>  	struct cpumask *sg_span;
>  	int cpu;
> 
>  	build_balance_mask(sd, sg, mask);
>  	cpu = cpumask_first_and(sched_group_span(sg), mask);
> -	/*
> -	 * for the group generated by grandchild, use the sgc of 2nd cpu
> -	 * because the 1st cpu might be used by another sched_group
> -	 */
> -	if (from_grandchild && cpumask_weight(mask) > 1)
> -		cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
> 
>  	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
>  	if (atomic_inc_return(&sg->sgc->ref) == 1)
> @@ -979,7 +973,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int
> cpu)
> 
>  	for_each_cpu_wrap(i, span, cpu) {
>  		struct cpumask *sg_span;
> -		int from_grandchild = 0;
> +		bool from_grandchild = false;
> 
>  		if (cpumask_test_cpu(i, covered))
>  			continue;
> @@ -1033,7 +1027,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int
> cpu)
>  		       !cpumask_subset(sched_domain_span(sibling->child),
>  				       span)) {
>  			sibling = sibling->child;
> -			from_grandchild = 1;
> +			from_grandchild = true;
>  		}
> 
>  		sg = build_group_from_child_sched_domain(sibling, cpu);
> @@ -1043,7 +1037,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int
> cpu)
>  		sg_span = sched_group_span(sg);
>  		cpumask_or(covered, covered, sg_span);
> 
> -		init_overlap_sched_group(sd, sg, from_grandchild);
> +		init_overlap_sched_group(sd, sg, from_grandchild ? sibling : NULL);
> 
Nice edit!
Will merge your edit into v1 and send v2.

>  		if (!first)
>  			first = sg;

Thanks
Barry


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
  2021-02-02 15:17 ` Valentin Schneider
  2021-02-02 16:48   ` Valentin Schneider
  2021-02-03 10:23   ` Song Bao Hua (Barry Song)
@ 2021-02-03 10:27   ` Song Bao Hua (Barry Song)
  2 siblings, 0 replies; 9+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-02-03 10:27 UTC (permalink / raw)
  To: Valentin Schneider, vincent.guittot, mgorman, mingo, peterz,
	dietmar.eggemann, morten.rasmussen, linux-kernel
  Cc: linuxarm, xuwei (O), Liguozhu (Kenneth), tiantao (H),
	wanghuiqiang, Zengtao (B),
	Jonathan Cameron, guodong.xu, Meelis Roos



> -----Original Message-----
> From: Song Bao Hua (Barry Song)
> Sent: Wednesday, February 3, 2021 11:18 PM
> To: 'Valentin Schneider' <valentin.schneider@arm.com>;
> vincent.guittot@linaro.org; mgorman@suse.de; mingo@kernel.org;
> peterz@infradead.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com;
> linux-kernel@vger.kernel.org
> Cc: linuxarm@openeuler.org; xuwei (O) <xuwei5@huawei.com>; Liguozhu (Kenneth)
> <liguozhu@hisilicon.com>; tiantao (H) <tiantao6@hisilicon.com>; wanghuiqiang
> <wanghuiqiang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Jonathan
> Cameron <jonathan.cameron@huawei.com>; guodong.xu@linaro.org; Meelis Roos
> <mroos@linux.ee>
> Subject: RE: [PATCH] sched/topology: fix the issue groups don't span
> domain->span for NUMA diameter > 2
> 
> 
> 
> > -----Original Message-----
> > From: Valentin Schneider [mailto:valentin.schneider@arm.com]
> > Sent: Wednesday, February 3, 2021 4:17 AM
> > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>;
> > vincent.guittot@linaro.org; mgorman@suse.de; mingo@kernel.org;
> > peterz@infradead.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com;
> > linux-kernel@vger.kernel.org
> > Cc: linuxarm@openeuler.org; xuwei (O) <xuwei5@huawei.com>; Liguozhu
> (Kenneth)
> > <liguozhu@hisilicon.com>; tiantao (H) <tiantao6@hisilicon.com>;
> wanghuiqiang
> > <wanghuiqiang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Jonathan
> > Cameron <jonathan.cameron@huawei.com>; guodong.xu@linaro.org; Song Bao Hua
> > (Barry Song) <song.bao.hua@hisilicon.com>; Meelis Roos <mroos@linux.ee>
> > Subject: Re: [PATCH] sched/topology: fix the issue groups don't span
> > domain->span for NUMA diameter > 2
> >
> > On 01/02/21 16:38, Barry Song wrote:
> > > @@ -964,6 +941,12 @@ static void init_overlap_sched_group(struct
> sched_domain
> > *sd,
> > >
> > >       build_balance_mask(sd, sg, mask);
> > >       cpu = cpumask_first_and(sched_group_span(sg), mask);
> > > +	/*
> > > +	 * for the group generated by grandchild, use the sgc of 2nd cpu
> > > +	 * because the 1st cpu might be used by another sched_group
> > > +	 */
> > > +	if (from_grandchild && cpumask_weight(mask) > 1)
> > > +		cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
> > >
> > >       sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
> >
> > So you are getting a (hopefully) unique ID for this group span at this
> > given topology level (i.e. sd->private) but as I had stated in that list of
> > issues, this creates an sgc that isn't attached to the local group of any
> > sched_domain, and thus won't get its capacity values updated.
> >
> > This can actually be seen via the capacity values you're getting at build
> > time:
> >
> > > [    0.868907] CPU0 attaching sched-domain(s):
> > ...
> > > [    0.869542]    domain-2: span=0-5 level=NUMA
> > > [    0.869559]     groups: 0:{ span=0-3 cap=4002 }, 5:{ span=4-5 cap=2048 }
> >                                                           ^^^^^^^^^^^^^^^^
> > > [    0.871177] CPU4 attaching sched-domain(s):
> > ...
> > > [    0.871200]   groups: 4:{ span=4 cap=977 }, 5:{ span=5 cap=1001 }
> > > [    0.871243]   domain-1: span=4-7 level=NUMA
> > > [    0.871257]    groups: 4:{ span=4-5 cap=1978 }, 6:{ span=6-7 cap=1968 }
> >                                 ^^^^^^^^^^^^^^^^
> >
> 
> Yes. I could see this issue.  We could hack update_group_capacity to let
> it scan both local_group  and sched_group generated by grandchild, but it
> seems your edit is much better.
> 
> > IMO what we want to do here is to hook this CPU0-domain-2-group5 to the sgc
> > of CPU4-domain1-group4. I've done that in the below diff - this gives us
> > groups with sgc's owned at lower topology levels, but this will only ever
> > be true for non-local groups. This has the added benefit of working with
> > single-CPU nodes. Briefly tested on your topology and the sunfire's (via
> > QEMU), and I didn't get screamed at.
> >
> > Before the fun police comes and impounds my keyboard, I'd like to point out
> > that we could leverage this cross-level sgc referencing hack to further
> > change the NUMA domains and pretty much get rid of overlapping groups
> > (that's what I was fumbling with in [1]).
> >
> > [1]: http://lore.kernel.org/r/jhjwnw11ak2.mognet@arm.com
> >
> > That is, rather than building overlapping groups and fixing them whenever
> > that breaks (distance > 2), we could have:
> > - the local group being the child domain's span (as always)
> > - all non-local NUMA groups spanning a single node each, with the right sgc
> >   cross-referencing.
> >
> > Thoughts?
> 
> I guess the original purpose of overlapping groups is creating as few groups
> as possible. If we totally remove overlapping groups, it seems we will create
> much more groups?
> For example, while node0 begins to build sched_domain for distance 20, it will
> add node2, since the distance between node2 and node3 is 15, so while node2
> is
> added, node3 is also added as node2's lower domain has covered node3. So we
> need
> two groups only for node0's sched_domain of distance level 20.
> +-------+                  +--------+
>  |       |      15          |        |
>  |  node0+----------------+ | node1  |
>  |       |                  |        |
>  +----+--+                XXX--------+
>       |                 XXX
>       |                XX
> 20    |         15   XX
>       |            XXX
>       |       X XXX
>  +----+----XXX               +-------+
>  |         |     15          |  node3|
>  | node2   +-----------------+       |
>  |         |                 +-------+
>  +---------+
> 

Sorry for missing a line:

node0-node3: 20

                             20
                 X XX X  X X  X  X  X
             XXXX                       X   X  X XX
           XX                                     XXX
         XX                                          X
       XX                                              XX
 +-----X-+                  +--------+                  XX
 |       |      15          |        |                   X
 |  node0+----------------+ | node1  |                   X
 |       |                  |        |                   X
 +----+--+                XXX--------+                   X
      |                 XXX                             XX
      |                XX                              XX
20    |         15   XX                             XXXX
      |            XXX                         XXXX
      |       X XXX                        XXXX
 +----+----XXX               +-------+ XXXX
 |         |     15          |  node3|XX
 | node2   +-----------------+       |
 |         |                 +-------+
 +---------+

> If we remove overlapping group, we will add a group for node2, another
> group for node3. Then we get three groups.
> 
> I am not sure if it is always positive for performance.
> 
> >
> > --->8---
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index b748999c9e11..ef43abb6b1fb 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -932,21 +932,15 @@ build_group_from_child_sched_domain(struct
> sched_domain
> > *sd, int cpu)
> >
> >  static void init_overlap_sched_group(struct sched_domain *sd,
> >  				     struct sched_group *sg,
> > -				     int from_grandchild)
> > +				     struct sched_domain *grandchild)
> >  {
> >  	struct cpumask *mask = sched_domains_tmpmask2;
> > -	struct sd_data *sdd = sd->private;
> > +	struct sd_data *sdd = grandchild ? grandchild->private : sd->private;
> >  	struct cpumask *sg_span;
> >  	int cpu;
> >
> >  	build_balance_mask(sd, sg, mask);
> >  	cpu = cpumask_first_and(sched_group_span(sg), mask);
> > -	/*
> > -	 * for the group generated by grandchild, use the sgc of 2nd cpu
> > -	 * because the 1st cpu might be used by another sched_group
> > -	 */
> > -	if (from_grandchild && cpumask_weight(mask) > 1)
> > -		cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
> >
> >  	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
> >  	if (atomic_inc_return(&sg->sgc->ref) == 1)
> > @@ -979,7 +973,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int
> > cpu)
> >
> >  	for_each_cpu_wrap(i, span, cpu) {
> >  		struct cpumask *sg_span;
> > -		int from_grandchild = 0;
> > +		bool from_grandchild = false;
> >
> >  		if (cpumask_test_cpu(i, covered))
> >  			continue;
> > @@ -1033,7 +1027,7 @@ build_overlap_sched_groups(struct sched_domain *sd,
> int
> > cpu)
> >  		       !cpumask_subset(sched_domain_span(sibling->child),
> >  				       span)) {
> >  			sibling = sibling->child;
> > -			from_grandchild = 1;
> > +			from_grandchild = true;
> >  		}
> >
> >  		sg = build_group_from_child_sched_domain(sibling, cpu);
> > @@ -1043,7 +1037,7 @@ build_overlap_sched_groups(struct sched_domain *sd,
> int
> > cpu)
> >  		sg_span = sched_group_span(sg);
> >  		cpumask_or(covered, covered, sg_span);
> >
> > -		init_overlap_sched_group(sd, sg, from_grandchild);
> > +		init_overlap_sched_group(sd, sg, from_grandchild ? sibling : NULL);
> >
> Nice edit!
> Will merge your edit into v1 and send v2.
> 
> >  		if (!first)
> >  			first = sg;
> 
Thanks
Barry

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
  2021-02-03 10:23   ` Song Bao Hua (Barry Song)
@ 2021-02-03 11:42     ` Valentin Schneider
  0 siblings, 0 replies; 9+ messages in thread
From: Valentin Schneider @ 2021-02-03 11:42 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song),
	vincent.guittot, mgorman, mingo, peterz, dietmar.eggemann,
	morten.rasmussen, linux-kernel
  Cc: linuxarm, xuwei (O), Liguozhu (Kenneth), tiantao (H),
	wanghuiqiang, Zengtao (B),
	Jonathan Cameron, guodong.xu, Meelis Roos

On 03/02/21 10:23, Song Bao Hua (Barry Song) wrote:
>> -----Original Message-----
>> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
>> Thoughts?
>
> I guess the original purpose of overlapping groups is creating as few groups
> as possible. If we totally remove overlapping groups, it seems we will create
> much more groups?
> For example, while node0 begins to build sched_domain for distance 20, it will
> add node2, since the distance between node2 and node3 is 15, so while node2 is
> added, node3 is also added as node2's lower domain has covered node3. So we need
> two groups only for node0's sched_domain of distance level 20.
> +-------+                  +--------+
>  |       |      15          |        |
>  |  node0+----------------+ | node1  |
>  |       |                  |        |
>  +----+--+                XXX--------+
>       |                 XXX
>       |                XX
> 20    |         15   XX
>       |            XXX
>       |       X XXX
>  +----+----XXX               +-------+
>  |         |     15          |  node3|
>  | node2   +-----------------+       |
>  |         |                 +-------+
>  +---------+
>
> If we remove overlapping group, we will add a group for node2, another
> group for node3. Then we get three groups.
>
> I am not sure if it is always positive for performance.
>

Neither am I! At the same time our strategy for generating groups is pretty
much flawed for anything with distance > 2, so I'd like to have a saner
setup that doesn't involve fixing groups "after the fact".

I have a sort-of-working hack, I'll make this into a patch and toss it out
for discussion.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-02-03 11:44 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-01  3:38 [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2 Barry Song
2021-02-01  5:36 ` kernel test robot
2021-02-01 18:11 ` Valentin Schneider
2021-02-01 21:49   ` Song Bao Hua (Barry Song)
2021-02-02 15:17 ` Valentin Schneider
2021-02-02 16:48   ` Valentin Schneider
2021-02-03 10:23   ` Song Bao Hua (Barry Song)
2021-02-03 11:42     ` Valentin Schneider
2021-02-03 10:27   ` Song Bao Hua (Barry Song)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).