Re: [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

From: Valentin Schneider <valentin.schneider@arm.com>
To: Barry Song <song.bao.hua@hisilicon.com>,
	vincent.guittot@linaro.org, mgorman@suse.de, mingo@kernel.org,
	peterz@infradead.org, dietmar.eggemann@arm.com,
	morten.rasmussen@arm.com, linux-kernel@vger.kernel.org
Cc: linuxarm@openeuler.org, xuwei5@huawei.com,
	liguozhu@hisilicon.com, tiantao6@hisilicon.com,
	wanghuiqiang@huawei.com, prime.zeng@hisilicon.com,
	jonathan.cameron@huawei.com, guodong.xu@linaro.org,
	Barry Song <song.bao.hua@hisilicon.com>,
	Meelis Roos <mroos@linux.ee>
Subject: Re: [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
Date: Tue, 02 Feb 2021 15:17:15 +0000	[thread overview]
Message-ID: <jhj4kiu4hz8.mognet@arm.com> (raw)
In-Reply-To: <20210201033830.15040-1-song.bao.hua@hisilicon.com>

On 01/02/21 16:38, Barry Song wrote:
> @@ -964,6 +941,12 @@ static void init_overlap_sched_group(struct sched_domain *sd,
>
>       build_balance_mask(sd, sg, mask);
>       cpu = cpumask_first_and(sched_group_span(sg), mask);
> +	/*
> +	 * for the group generated by grandchild, use the sgc of 2nd cpu
> +	 * because the 1st cpu might be used by another sched_group
> +	 */
> +	if (from_grandchild && cpumask_weight(mask) > 1)
> +		cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
>
>       sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);

So you are getting a (hopefully) unique ID for this group span at this
given topology level (i.e. sd->private) but as I had stated in that list of
issues, this creates an sgc that isn't attached to the local group of any
sched_domain, and thus won't get its capacity values updated.

This can actually be seen via the capacity values you're getting at build
time:

> [    0.868907] CPU0 attaching sched-domain(s):
...
> [    0.869542]    domain-2: span=0-5 level=NUMA
> [    0.869559]     groups: 0:{ span=0-3 cap=4002 }, 5:{ span=4-5 cap=2048 }
                                                          ^^^^^^^^^^^^^^^^
> [    0.871177] CPU4 attaching sched-domain(s):
...
> [    0.871200]   groups: 4:{ span=4 cap=977 }, 5:{ span=5 cap=1001 }
> [    0.871243]   domain-1: span=4-7 level=NUMA
> [    0.871257]    groups: 4:{ span=4-5 cap=1978 }, 6:{ span=6-7 cap=1968 }
                                ^^^^^^^^^^^^^^^^

IMO what we want to do here is to hook this CPU0-domain-2-group5 to the sgc
of CPU4-domain1-group4. I've done that in the below diff - this gives us
groups with sgc's owned at lower topology levels, but this will only ever
be true for non-local groups. This has the added benefit of working with
single-CPU nodes. Briefly tested on your topology and the sunfire's (via
QEMU), and I didn't get screamed at.

Before the fun police comes and impounds my keyboard, I'd like to point out
that we could leverage this cross-level sgc referencing hack to further
change the NUMA domains and pretty much get rid of overlapping groups
(that's what I was fumbling with in [1]).

[1]: http://lore.kernel.org/r/jhjwnw11ak2.mognet@arm.com

That is, rather than building overlapping groups and fixing them whenever
that breaks (distance > 2), we could have:
- the local group being the child domain's span (as always)
- all non-local NUMA groups spanning a single node each, with the right sgc
  cross-referencing.

Thoughts?

--->8---

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b748999c9e11..ef43abb6b1fb 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -932,21 +932,15 @@ build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
 
 static void init_overlap_sched_group(struct sched_domain *sd,
 				     struct sched_group *sg,
-				     int from_grandchild)
+				     struct sched_domain *grandchild)
 {
 	struct cpumask *mask = sched_domains_tmpmask2;
-	struct sd_data *sdd = sd->private;
+	struct sd_data *sdd = grandchild ? grandchild->private : sd->private;
 	struct cpumask *sg_span;
 	int cpu;
 
 	build_balance_mask(sd, sg, mask);
 	cpu = cpumask_first_and(sched_group_span(sg), mask);
-	/*
-	 * for the group generated by grandchild, use the sgc of 2nd cpu
-	 * because the 1st cpu might be used by another sched_group
-	 */
-	if (from_grandchild && cpumask_weight(mask) > 1)
-		cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
 
 	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
 	if (atomic_inc_return(&sg->sgc->ref) == 1)
@@ -979,7 +973,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 
 	for_each_cpu_wrap(i, span, cpu) {
 		struct cpumask *sg_span;
-		int from_grandchild = 0;
+		bool from_grandchild = false;
 
 		if (cpumask_test_cpu(i, covered))
 			continue;
@@ -1033,7 +1027,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		       !cpumask_subset(sched_domain_span(sibling->child),
 				       span)) {
 			sibling = sibling->child;
-			from_grandchild = 1;
+			from_grandchild = true;
 		}
 
 		sg = build_group_from_child_sched_domain(sibling, cpu);
@@ -1043,7 +1037,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		sg_span = sched_group_span(sg);
 		cpumask_or(covered, covered, sg_span);
 
-		init_overlap_sched_group(sd, sg, from_grandchild);
+		init_overlap_sched_group(sd, sg, from_grandchild ? sibling : NULL);
 
 		if (!first)
 			first = sg;