All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/1] sched/topology: NUMA distance deduplication
@ 2021-01-22 12:39 Valentin Schneider
  2021-01-22 12:39 ` [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for the deduplicating sort Valentin Schneider
  0 siblings, 1 reply; 15+ messages in thread
From: Valentin Schneider @ 2021-01-22 12:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, mgorman, song.bao.hua

Hi folks,

As pointed out by Barry in [1], there are topologies out there that struggle to
go through the NUMA distance deduplicating sort. Included patch is something
I wrote back when I started untangling this distance > 2 mess.

It's only been lightly tested on some array of QEMU-powered topologies I keep
around for this sort of things. I *think* this works out fine with the NODE
topology level, but I wouldn't be surprised if I (re)introduced an off-by-one
error in there. 

@Barry feel free to test / edit this as you see fit.

[1]: http://lore.kernel.org/r/9d6c6d3ba6ac4272bf844034da4653fe@hisilicon.com

HTH.


Valentin Schneider (1):
  sched/topology: Make sched_init_numa() use a set for the deduplicating
    sort

 include/linux/topology.h |  1 +
 kernel/sched/topology.c  | 99 +++++++++++++++++++---------------------
 2 files changed, 49 insertions(+), 51 deletions(-)

--
2.27.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for the deduplicating sort
  2021-01-22 12:39 [PATCH 0/1] sched/topology: NUMA distance deduplication Valentin Schneider
@ 2021-01-22 12:39 ` Valentin Schneider
  2021-01-25  2:23   ` Song Bao Hua (Barry Song)
  2021-02-01  9:53   ` Dietmar Eggemann
  0 siblings, 2 replies; 15+ messages in thread
From: Valentin Schneider @ 2021-01-22 12:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, mgorman, song.bao.hua

The deduplicating sort in sched_init_numa() assumes that the first line in
the distance table contains all unique values in the entire table. I've
been trying to pen what this exactly means for the topology, but it's not
straightforward. For instance, topology.c uses this example:

  node   0   1   2   3
    0:  10  20  20  30
    1:  20  10  20  20
    2:  20  20  10  20
    3:  30  20  20  10

  0 ----- 1
  |     / |
  |   /   |
  | /     |
  2 ----- 3

Which works out just fine. However, if we swap nodes 0 and 1:

  1 ----- 0
  |     / |
  |   /   |
  | /     |
  2 ----- 3

we get this distance table:

  node   0  1  2  3
    0:  10 20 20 20
    1:  20 10 20 30
    2:  20 20 10 20
    3:  20 30 20 10

Which breaks the deduplicating sort (non-representative first line). In
this case this would just be a renumbering exercise, but it so happens that
we can have a deduplicating sort that goes through the whole table in O(n²)
at the extra cost of a temporary memory allocation (i.e. any form of set).

The ACPI spec (SLIT) mentions distances are encoded on 8 bits. Following
this, implement the set as a 256-bits bitmap. Should this not be
satisfactory (i.e. we want to support 32-bit values), then we'll have to go
for some other sparse set implementation.

This has the added benefit of letting us allocate just the right amount of
memory for sched_domains_numa_distance[], rather than an arbitrary
(nr_node_ids + 1).

Note: DT binding equivalent (distance-map) decodes distances as 32-bit
values.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 include/linux/topology.h |  1 +
 kernel/sched/topology.c  | 99 +++++++++++++++++++---------------------
 2 files changed, 49 insertions(+), 51 deletions(-)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index ad03df1cc266..7634cd737061 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -48,6 +48,7 @@ int arch_update_cpu_topology(void);
 /* Conform to ACPI 2.0 SLIT distance definitions */
 #define LOCAL_DISTANCE		10
 #define REMOTE_DISTANCE		20
+#define DISTANCE_BITS           8
 #ifndef node_distance
 #define node_distance(from,to)	((from) == (to) ? LOCAL_DISTANCE : REMOTE_DISTANCE)
 #endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5d3675c7a76b..bf5c9bd10bfe 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1596,66 +1596,58 @@ static void init_numa_topology_type(void)
 	}
 }
 
+
+#define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
+
 void sched_init_numa(void)
 {
-	int next_distance, curr_distance = node_distance(0, 0);
 	struct sched_domain_topology_level *tl;
-	int level = 0;
-	int i, j, k;
-
-	sched_domains_numa_distance = kzalloc(sizeof(int) * (nr_node_ids + 1), GFP_KERNEL);
-	if (!sched_domains_numa_distance)
-		return;
-
-	/* Includes NUMA identity node at level 0. */
-	sched_domains_numa_distance[level++] = curr_distance;
-	sched_domains_numa_levels = level;
+	unsigned long *distance_map;
+	int nr_levels = 0;
+	int i, j;
 
 	/*
 	 * O(nr_nodes^2) deduplicating selection sort -- in order to find the
 	 * unique distances in the node_distance() table.
-	 *
-	 * Assumes node_distance(0,j) includes all distances in
-	 * node_distance(i,j) in order to avoid cubic time.
 	 */
-	next_distance = curr_distance;
+	distance_map = bitmap_alloc(NR_DISTANCE_VALUES, GFP_KERNEL);
+	if (!distance_map)
+		return;
+
+	bitmap_zero(distance_map, NR_DISTANCE_VALUES);
 	for (i = 0; i < nr_node_ids; i++) {
 		for (j = 0; j < nr_node_ids; j++) {
-			for (k = 0; k < nr_node_ids; k++) {
-				int distance = node_distance(i, k);
-
-				if (distance > curr_distance &&
-				    (distance < next_distance ||
-				     next_distance == curr_distance))
-					next_distance = distance;
-
-				/*
-				 * While not a strong assumption it would be nice to know
-				 * about cases where if node A is connected to B, B is not
-				 * equally connected to A.
-				 */
-				if (sched_debug() && node_distance(k, i) != distance)
-					sched_numa_warn("Node-distance not symmetric");
+			int distance = node_distance(i, j);
 
-				if (sched_debug() && i && !find_numa_distance(distance))
-					sched_numa_warn("Node-0 not representative");
+			if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
+				sched_numa_warn("Invalid distance value range");
+				return;
 			}
-			if (next_distance != curr_distance) {
-				sched_domains_numa_distance[level++] = next_distance;
-				sched_domains_numa_levels = level;
-				curr_distance = next_distance;
-			} else break;
+
+			bitmap_set(distance_map, distance, 1);
 		}
+	}
+	/*
+	 * We can now figure out how many unique distance values there are and
+	 * allocate memory accordingly.
+	 */
+	nr_levels = bitmap_weight(distance_map, NR_DISTANCE_VALUES);
 
-		/*
-		 * In case of sched_debug() we verify the above assumption.
-		 */
-		if (!sched_debug())
-			break;
+	sched_domains_numa_distance = kcalloc(nr_levels, sizeof(int), GFP_KERNEL);
+	if (!sched_domains_numa_distance) {
+		bitmap_free(distance_map);
+		return;
+	}
+
+	for (i = 0, j = 0; i < nr_levels; i++, j++) {
+		j = find_next_bit(distance_map, NR_DISTANCE_VALUES, j);
+		sched_domains_numa_distance[i] = j;
 	}
 
+	bitmap_free(distance_map);
+
 	/*
-	 * 'level' contains the number of unique distances
+	 * 'nr_levels' contains the number of unique distances
 	 *
 	 * The sched_domains_numa_distance[] array includes the actual distance
 	 * numbers.
@@ -1664,15 +1656,15 @@ void sched_init_numa(void)
 	/*
 	 * Here, we should temporarily reset sched_domains_numa_levels to 0.
 	 * If it fails to allocate memory for array sched_domains_numa_masks[][],
-	 * the array will contain less then 'level' members. This could be
+	 * the array will contain less then 'nr_levels' members. This could be
 	 * dangerous when we use it to iterate array sched_domains_numa_masks[][]
 	 * in other functions.
 	 *
-	 * We reset it to 'level' at the end of this function.
+	 * We reset it to 'nr_levels' at the end of this function.
 	 */
 	sched_domains_numa_levels = 0;
 
-	sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
+	sched_domains_numa_masks = kzalloc(sizeof(void *) * nr_levels, GFP_KERNEL);
 	if (!sched_domains_numa_masks)
 		return;
 
@@ -1680,7 +1672,7 @@ void sched_init_numa(void)
 	 * Now for each level, construct a mask per node which contains all
 	 * CPUs of nodes that are that many hops away from us.
 	 */
-	for (i = 0; i < level; i++) {
+	for (i = 0; i < nr_levels; i++) {
 		sched_domains_numa_masks[i] =
 			kzalloc(nr_node_ids * sizeof(void *), GFP_KERNEL);
 		if (!sched_domains_numa_masks[i])
@@ -1688,12 +1680,17 @@ void sched_init_numa(void)
 
 		for (j = 0; j < nr_node_ids; j++) {
 			struct cpumask *mask = kzalloc(cpumask_size(), GFP_KERNEL);
+			int k;
+
 			if (!mask)
 				return;
 
 			sched_domains_numa_masks[i][j] = mask;
 
 			for_each_node(k) {
+				if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
+					sched_numa_warn("Node-distance not symmetric");
+
 				if (node_distance(j, k) > sched_domains_numa_distance[i])
 					continue;
 
@@ -1705,7 +1702,7 @@ void sched_init_numa(void)
 	/* Compute default topology size */
 	for (i = 0; sched_domain_topology[i].mask; i++);
 
-	tl = kzalloc((i + level + 1) *
+	tl = kzalloc((i + nr_levels) *
 			sizeof(struct sched_domain_topology_level), GFP_KERNEL);
 	if (!tl)
 		return;
@@ -1728,7 +1725,7 @@ void sched_init_numa(void)
 	/*
 	 * .. and append 'j' levels of NUMA goodness.
 	 */
-	for (j = 1; j < level; i++, j++) {
+	for (j = 1; j < nr_levels; i++, j++) {
 		tl[i] = (struct sched_domain_topology_level){
 			.mask = sd_numa_mask,
 			.sd_flags = cpu_numa_flags,
@@ -1740,8 +1737,8 @@ void sched_init_numa(void)
 
 	sched_domain_topology = tl;
 
-	sched_domains_numa_levels = level;
-	sched_max_numa_distance = sched_domains_numa_distance[level - 1];
+	sched_domains_numa_levels = nr_levels;
+	sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
 
 	init_numa_topology_type();
 }
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* RE: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for the deduplicating sort
  2021-01-22 12:39 ` [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for the deduplicating sort Valentin Schneider
@ 2021-01-25  2:23   ` Song Bao Hua (Barry Song)
  2021-01-25  9:26     ` Valentin Schneider
  2021-02-01  9:53   ` Dietmar Eggemann
  1 sibling, 1 reply; 15+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-01-25  2:23 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, mgorman



> -----Original Message-----
> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
> Sent: Saturday, January 23, 2021 1:40 AM
> To: linux-kernel@vger.kernel.org
> Cc: mingo@kernel.org; peterz@infradead.org; vincent.guittot@linaro.org;
> dietmar.eggemann@arm.com; morten.rasmussen@arm.com; mgorman@suse.de; Song Bao
> Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Subject: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for the
> deduplicating sort
> 
> The deduplicating sort in sched_init_numa() assumes that the first line in
> the distance table contains all unique values in the entire table. I've
> been trying to pen what this exactly means for the topology, but it's not
> straightforward. For instance, topology.c uses this example:
> 
>   node   0   1   2   3
>     0:  10  20  20  30
>     1:  20  10  20  20
>     2:  20  20  10  20
>     3:  30  20  20  10
> 
>   0 ----- 1
>   |     / |
>   |   /   |
>   | /     |
>   2 ----- 3
> 
> Which works out just fine. However, if we swap nodes 0 and 1:
> 
>   1 ----- 0
>   |     / |
>   |   /   |
>   | /     |
>   2 ----- 3
> 
> we get this distance table:
> 
>   node   0  1  2  3
>     0:  10 20 20 20
>     1:  20 10 20 30
>     2:  20 20 10 20
>     3:  20 30 20 10
> 
> Which breaks the deduplicating sort (non-representative first line). In
> this case this would just be a renumbering exercise, but it so happens that
> we can have a deduplicating sort that goes through the whole table in O(n²)
> at the extra cost of a temporary memory allocation (i.e. any form of set).
> 
> The ACPI spec (SLIT) mentions distances are encoded on 8 bits. Following
> this, implement the set as a 256-bits bitmap. Should this not be
> satisfactory (i.e. we want to support 32-bit values), then we'll have to go
> for some other sparse set implementation.
> 
> This has the added benefit of letting us allocate just the right amount of
> memory for sched_domains_numa_distance[], rather than an arbitrary
> (nr_node_ids + 1).
> 
> Note: DT binding equivalent (distance-map) decodes distances as 32-bit
> values.
> 
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>

Hi Valentin, thanks.
It seems it didn't resolve my problem. The simplest code to workaround
my problem is:
void sched_init_numa(void)
{
	...
	next_distance = curr_distance;
	for (i = 0; i < nr_node_ids; i++) {
		for (j = 0; j < nr_node_ids; j++) {
			for (k = 0; k < nr_node_ids; k++) {
+				int ii = (i + 1) % nr_node_ids;
-				int distance = node_distance(i, k);
+				int distance = node_distance(ii, k);

				if (distance > curr_distance &&
				    (distance < next_distance ||
				     next_distance == curr_distance))
					next_distance = distance;

				/*

On the other hand, the patch made the kernel panic during boot:

[    1.596789] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000417c9000
[    1.598406] [ffff80002e8cc001] pgd=000000013ffff003,
p4d=000000013ffff003, pud=000000013fffe003, pmd=0000000000000000
[    1.600258] Internal error: Oops: 96000006 [#1] PREEMPT SMP
[    1.600730] Modules linked in:
[    1.601072] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W
  5.11.0-rc1-00079-g8da796ff6f58-dirty #114
[    1.601652] Hardware name: linux,dummy-virt (DT)
[    1.601917] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
[    1.602202] pc : build_sched_domains+0x3e4/0x1498
[    1.604050] lr : build_sched_domains+0x3c0/0x1498
[    1.604512] sp : ffff800011f1bce0
[    1.604654] x29: ffff800011f1bce0 x28: ffff800011b7a410
[    1.604924] x27: ffff800011b799c0 x26: ffff00000263b300
[    1.605227] x25: 00000000ffffffff x24: ffff00000263b3a0
[    1.605470] x23: 0000000000000000 x22: ffff800011b799c0
[    1.605813] x21: 0000000000000000 x20: ffff00000261db00
[    1.606140] x19: ffff0000025c7600 x18: 0000000000000010
[    1.606472] x17: 000000001472bb62 x16: 000000006e92504d
[    1.606885] x15: ffff000002600508 x14: 0000000000000116
[    1.607408] x13: ffff000002600508 x12: 00000000ffffffea
[    1.607681] x11: ffff800011c02fe8 x10: ffff800011beafa8
[    1.607931] x9 : ffff800011beb000 x8 : 0000000000017fe8
[    1.608225] x7 : 00000000000002a8 x6 : 00000000000000ff
[    1.608480] x5 : 0000000000000000 x4 : 0000000000000000
[    1.608690] x3 : 0000000000000000 x2 : ffff80002e8cc000
[    1.609048] x1 : 0000000000000001 x0 : 0000000000000001
[    1.609425] Call trace:
[    1.609550]  build_sched_domains+0x3e4/0x1498
[    1.609777]  sched_init_domains+0x88/0xb8
[    1.609937]  sched_init_smp+0x30/0x80
[    1.610170]  kernel_init_freeable+0xf4/0x238
[    1.610388]  kernel_init+0x14/0x118
[    1.610559]  ret_from_fork+0x10/0x30
[    1.611234] Code: b4000201 93407eb7 aa0103e0 f8777ac2 (f8626800)
[    1.613107] ---[ end trace 01540465b01c8e3b ]---
[    1.614871] Kernel panic - not syncing: Attempted to kill init!
exitcode=0x0000000b
[    1.615433] SMP: stopping secondary CPUs
[    1.616415] ---[ end Kernel panic - not syncing: Attempted to kill
init! exitcode=0x0000000b ]---

with the below topology:
qemu-system-aarch64 -M virt -nographic \
 -smp cpus=8 \
 -numa node,cpus=0-1,nodeid=0 \
 -numa node,cpus=2-3,nodeid=1 \
 -numa node,cpus=4-5,nodeid=2 \
 -numa node,cpus=6-7,nodeid=3 \
 -numa dist,src=0,dst=1,val=12 \
 -numa dist,src=0,dst=2,val=20 \
 -numa dist,src=0,dst=3,val=22 \
 -numa dist,src=1,dst=2,val=22 \
 -numa dist,src=2,dst=3,val=12 \
 -numa dist,src=1,dst=3,val=24 \


The panic address is *1294:

                        if (sdd->sd) {
    1280:       f9400e61        ldr     x1, [x19, #24]
    1284:       b4000201        cbz     x1, 12c4 <build_sched_domains+0x414>
                                sd = *per_cpu_ptr(sdd->sd, j);
    1288:       93407eb7        sxtw    x23, w21
    128c:       aa0103e0        mov     x0, x1
    1290:       f8777ac2        ldr     x2, [x22, x23, lsl #3]
    *1294:       f8626800        ldr     x0, [x0, x2]
                                if (sd && (sd->flags & SD_OVERLAP))
    1298:       b4000120        cbz     x0, 12bc <build_sched_domains+0x40c>
    129c:       b9403803        ldr     w3, [x0, #56]
    12a0:       365800e3        tbz     w3, #11, 12bc
<build_sched_domains+0x40c>
                                        free_sched_groups(sd->groups, 0);
    12a4:       f9400800        ldr     x0, [x0, #16]
        if (!sg)

Thanks
Barry

> ---
>  include/linux/topology.h |  1 +
>  kernel/sched/topology.c  | 99 +++++++++++++++++++---------------------
>  2 files changed, 49 insertions(+), 51 deletions(-)
> 
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index ad03df1cc266..7634cd737061 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -48,6 +48,7 @@ int arch_update_cpu_topology(void);
>  /* Conform to ACPI 2.0 SLIT distance definitions */
>  #define LOCAL_DISTANCE		10
>  #define REMOTE_DISTANCE		20
> +#define DISTANCE_BITS           8
>  #ifndef node_distance
>  #define node_distance(from,to)	((from) == (to) ? LOCAL_DISTANCE :
> REMOTE_DISTANCE)
>  #endif
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 5d3675c7a76b..bf5c9bd10bfe 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1596,66 +1596,58 @@ static void init_numa_topology_type(void)
>  	}
>  }
> 
> +
> +#define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
> +
>  void sched_init_numa(void)
>  {
> -	int next_distance, curr_distance = node_distance(0, 0);
>  	struct sched_domain_topology_level *tl;
> -	int level = 0;
> -	int i, j, k;
> -
> -	sched_domains_numa_distance = kzalloc(sizeof(int) * (nr_node_ids + 1),
> GFP_KERNEL);
> -	if (!sched_domains_numa_distance)
> -		return;
> -
> -	/* Includes NUMA identity node at level 0. */
> -	sched_domains_numa_distance[level++] = curr_distance;
> -	sched_domains_numa_levels = level;
> +	unsigned long *distance_map;
> +	int nr_levels = 0;
> +	int i, j;
> 
>  	/*
>  	 * O(nr_nodes^2) deduplicating selection sort -- in order to find the
>  	 * unique distances in the node_distance() table.
> -	 *
> -	 * Assumes node_distance(0,j) includes all distances in
> -	 * node_distance(i,j) in order to avoid cubic time.
>  	 */
> -	next_distance = curr_distance;
> +	distance_map = bitmap_alloc(NR_DISTANCE_VALUES, GFP_KERNEL);
> +	if (!distance_map)
> +		return;
> +
> +	bitmap_zero(distance_map, NR_DISTANCE_VALUES);
>  	for (i = 0; i < nr_node_ids; i++) {
>  		for (j = 0; j < nr_node_ids; j++) {
> -			for (k = 0; k < nr_node_ids; k++) {
> -				int distance = node_distance(i, k);
> -
> -				if (distance > curr_distance &&
> -				    (distance < next_distance ||
> -				     next_distance == curr_distance))
> -					next_distance = distance;
> -
> -				/*
> -				 * While not a strong assumption it would be nice to know
> -				 * about cases where if node A is connected to B, B is not
> -				 * equally connected to A.
> -				 */
> -				if (sched_debug() && node_distance(k, i) != distance)
> -					sched_numa_warn("Node-distance not symmetric");
> +			int distance = node_distance(i, j);
> 
> -				if (sched_debug() && i && !find_numa_distance(distance))
> -					sched_numa_warn("Node-0 not representative");
> +			if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
> +				sched_numa_warn("Invalid distance value range");
> +				return;
>  			}
> -			if (next_distance != curr_distance) {
> -				sched_domains_numa_distance[level++] = next_distance;
> -				sched_domains_numa_levels = level;
> -				curr_distance = next_distance;
> -			} else break;
> +
> +			bitmap_set(distance_map, distance, 1);
>  		}
> +	}
> +	/*
> +	 * We can now figure out how many unique distance values there are and
> +	 * allocate memory accordingly.
> +	 */
> +	nr_levels = bitmap_weight(distance_map, NR_DISTANCE_VALUES);
> 
> -		/*
> -		 * In case of sched_debug() we verify the above assumption.
> -		 */
> -		if (!sched_debug())
> -			break;
> +	sched_domains_numa_distance = kcalloc(nr_levels, sizeof(int),
> GFP_KERNEL);
> +	if (!sched_domains_numa_distance) {
> +		bitmap_free(distance_map);
> +		return;
> +	}
> +
> +	for (i = 0, j = 0; i < nr_levels; i++, j++) {
> +		j = find_next_bit(distance_map, NR_DISTANCE_VALUES, j);
> +		sched_domains_numa_distance[i] = j;
>  	}
> 
> +	bitmap_free(distance_map);
> +
>  	/*
> -	 * 'level' contains the number of unique distances
> +	 * 'nr_levels' contains the number of unique distances
>  	 *
>  	 * The sched_domains_numa_distance[] array includes the actual distance
>  	 * numbers.
> @@ -1664,15 +1656,15 @@ void sched_init_numa(void)
>  	/*
>  	 * Here, we should temporarily reset sched_domains_numa_levels to 0.
>  	 * If it fails to allocate memory for array sched_domains_numa_masks[][],
> -	 * the array will contain less then 'level' members. This could be
> +	 * the array will contain less then 'nr_levels' members. This could be
>  	 * dangerous when we use it to iterate array sched_domains_numa_masks[][]
>  	 * in other functions.
>  	 *
> -	 * We reset it to 'level' at the end of this function.
> +	 * We reset it to 'nr_levels' at the end of this function.
>  	 */
>  	sched_domains_numa_levels = 0;
> 
> -	sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
> +	sched_domains_numa_masks = kzalloc(sizeof(void *) * nr_levels,
> GFP_KERNEL);
>  	if (!sched_domains_numa_masks)
>  		return;
> 
> @@ -1680,7 +1672,7 @@ void sched_init_numa(void)
>  	 * Now for each level, construct a mask per node which contains all
>  	 * CPUs of nodes that are that many hops away from us.
>  	 */
> -	for (i = 0; i < level; i++) {
> +	for (i = 0; i < nr_levels; i++) {
>  		sched_domains_numa_masks[i] =
>  			kzalloc(nr_node_ids * sizeof(void *), GFP_KERNEL);
>  		if (!sched_domains_numa_masks[i])
> @@ -1688,12 +1680,17 @@ void sched_init_numa(void)
> 
>  		for (j = 0; j < nr_node_ids; j++) {
>  			struct cpumask *mask = kzalloc(cpumask_size(), GFP_KERNEL);
> +			int k;
> +
>  			if (!mask)
>  				return;
> 
>  			sched_domains_numa_masks[i][j] = mask;
> 
>  			for_each_node(k) {
> +				if (sched_debug() && (node_distance(j, k) != node_distance(k,
> j)))
> +					sched_numa_warn("Node-distance not symmetric");
> +
>  				if (node_distance(j, k) > sched_domains_numa_distance[i])
>  					continue;
> 
> @@ -1705,7 +1702,7 @@ void sched_init_numa(void)
>  	/* Compute default topology size */
>  	for (i = 0; sched_domain_topology[i].mask; i++);
> 
> -	tl = kzalloc((i + level + 1) *
> +	tl = kzalloc((i + nr_levels) *
>  			sizeof(struct sched_domain_topology_level), GFP_KERNEL);
>  	if (!tl)
>  		return;
> @@ -1728,7 +1725,7 @@ void sched_init_numa(void)
>  	/*
>  	 * .. and append 'j' levels of NUMA goodness.
>  	 */
> -	for (j = 1; j < level; i++, j++) {
> +	for (j = 1; j < nr_levels; i++, j++) {
>  		tl[i] = (struct sched_domain_topology_level){
>  			.mask = sd_numa_mask,
>  			.sd_flags = cpu_numa_flags,
> @@ -1740,8 +1737,8 @@ void sched_init_numa(void)
> 
>  	sched_domain_topology = tl;
> 
> -	sched_domains_numa_levels = level;
> -	sched_max_numa_distance = sched_domains_numa_distance[level - 1];
> +	sched_domains_numa_levels = nr_levels;
> +	sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
> 
>  	init_numa_topology_type();
>  }
> --
> 2.27.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for the deduplicating sort
  2021-01-25  2:23   ` Song Bao Hua (Barry Song)
@ 2021-01-25  9:26     ` Valentin Schneider
  2021-01-25 16:45       ` Valentin Schneider
  0 siblings, 1 reply; 15+ messages in thread
From: Valentin Schneider @ 2021-01-25  9:26 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song), linux-kernel
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, mgorman

On 25/01/21 02:23, Song Bao Hua (Barry Song) wrote:

> with the below topology:
> qemu-system-aarch64 -M virt -nographic \
>  -smp cpus=8 \
>  -numa node,cpus=0-1,nodeid=0 \
>  -numa node,cpus=2-3,nodeid=1 \
>  -numa node,cpus=4-5,nodeid=2 \
>  -numa node,cpus=6-7,nodeid=3 \
>  -numa dist,src=0,dst=1,val=12 \
>  -numa dist,src=0,dst=2,val=20 \
>  -numa dist,src=0,dst=3,val=22 \
>  -numa dist,src=1,dst=2,val=22 \
>  -numa dist,src=2,dst=3,val=12 \
>  -numa dist,src=1,dst=3,val=24 \
>
>
> The panic address is *1294:
>
>                         if (sdd->sd) {
>     1280:       f9400e61        ldr     x1, [x19, #24]
>     1284:       b4000201        cbz     x1, 12c4 <build_sched_domains+0x414>
>                                 sd = *per_cpu_ptr(sdd->sd, j);
>     1288:       93407eb7        sxtw    x23, w21
>     128c:       aa0103e0        mov     x0, x1
>     1290:       f8777ac2        ldr     x2, [x22, x23, lsl #3]
>     *1294:       f8626800        ldr     x0, [x0, x2]
>                                 if (sd && (sd->flags & SD_OVERLAP))
>     1298:       b4000120        cbz     x0, 12bc <build_sched_domains+0x40c>
>     129c:       b9403803        ldr     w3, [x0, #56]
>     12a0:       365800e3        tbz     w3, #11, 12bc
> <build_sched_domains+0x40c>
>                                         free_sched_groups(sd->groups, 0);
>     12a4:       f9400800        ldr     x0, [x0, #16]
>         if (!sg)
>

Thanks for giving it a shot, let me run that with your topology and see
where I end up.

> Thanks
> Barry
>
>> ---
>>  include/linux/topology.h |  1 +
>>  kernel/sched/topology.c  | 99 +++++++++++++++++++---------------------
>>  2 files changed, 49 insertions(+), 51 deletions(-)
>>
>> diff --git a/include/linux/topology.h b/include/linux/topology.h
>> index ad03df1cc266..7634cd737061 100644
>> --- a/include/linux/topology.h
>> +++ b/include/linux/topology.h
>> @@ -48,6 +48,7 @@ int arch_update_cpu_topology(void);
>>  /* Conform to ACPI 2.0 SLIT distance definitions */
>>  #define LOCAL_DISTANCE		10
>>  #define REMOTE_DISTANCE		20
>> +#define DISTANCE_BITS           8
>>  #ifndef node_distance
>>  #define node_distance(from,to)	((from) == (to) ? LOCAL_DISTANCE :
>> REMOTE_DISTANCE)
>>  #endif
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 5d3675c7a76b..bf5c9bd10bfe 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -1596,66 +1596,58 @@ static void init_numa_topology_type(void)
>>      }
>>  }
>>
>> +
>> +#define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
>> +
>>  void sched_init_numa(void)
>>  {
>> -	int next_distance, curr_distance = node_distance(0, 0);
>>      struct sched_domain_topology_level *tl;
>> -	int level = 0;
>> -	int i, j, k;
>> -
>> -	sched_domains_numa_distance = kzalloc(sizeof(int) * (nr_node_ids + 1),
>> GFP_KERNEL);
>> -	if (!sched_domains_numa_distance)
>> -		return;
>> -
>> -	/* Includes NUMA identity node at level 0. */
>> -	sched_domains_numa_distance[level++] = curr_distance;
>> -	sched_domains_numa_levels = level;
>> +	unsigned long *distance_map;
>> +	int nr_levels = 0;
>> +	int i, j;
>>
>>      /*
>>       * O(nr_nodes^2) deduplicating selection sort -- in order to find the
>>       * unique distances in the node_distance() table.
>> -	 *
>> -	 * Assumes node_distance(0,j) includes all distances in
>> -	 * node_distance(i,j) in order to avoid cubic time.
>>       */
>> -	next_distance = curr_distance;
>> +	distance_map = bitmap_alloc(NR_DISTANCE_VALUES, GFP_KERNEL);
>> +	if (!distance_map)
>> +		return;
>> +
>> +	bitmap_zero(distance_map, NR_DISTANCE_VALUES);
>>      for (i = 0; i < nr_node_ids; i++) {
>>              for (j = 0; j < nr_node_ids; j++) {
>> -			for (k = 0; k < nr_node_ids; k++) {
>> -				int distance = node_distance(i, k);
>> -
>> -				if (distance > curr_distance &&
>> -				    (distance < next_distance ||
>> -				     next_distance == curr_distance))
>> -					next_distance = distance;
>> -
>> -				/*
>> -				 * While not a strong assumption it would be nice to know
>> -				 * about cases where if node A is connected to B, B is not
>> -				 * equally connected to A.
>> -				 */
>> -				if (sched_debug() && node_distance(k, i) != distance)
>> -					sched_numa_warn("Node-distance not symmetric");
>> +			int distance = node_distance(i, j);
>>
>> -				if (sched_debug() && i && !find_numa_distance(distance))
>> -					sched_numa_warn("Node-0 not representative");
>> +			if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
>> +				sched_numa_warn("Invalid distance value range");
>> +				return;
>>                      }
>> -			if (next_distance != curr_distance) {
>> -				sched_domains_numa_distance[level++] = next_distance;
>> -				sched_domains_numa_levels = level;
>> -				curr_distance = next_distance;
>> -			} else break;
>> +
>> +			bitmap_set(distance_map, distance, 1);
>>              }
>> +	}
>> +	/*
>> +	 * We can now figure out how many unique distance values there are and
>> +	 * allocate memory accordingly.
>> +	 */
>> +	nr_levels = bitmap_weight(distance_map, NR_DISTANCE_VALUES);
>>
>> -		/*
>> -		 * In case of sched_debug() we verify the above assumption.
>> -		 */
>> -		if (!sched_debug())
>> -			break;
>> +	sched_domains_numa_distance = kcalloc(nr_levels, sizeof(int),
>> GFP_KERNEL);
>> +	if (!sched_domains_numa_distance) {
>> +		bitmap_free(distance_map);
>> +		return;
>> +	}
>> +
>> +	for (i = 0, j = 0; i < nr_levels; i++, j++) {
>> +		j = find_next_bit(distance_map, NR_DISTANCE_VALUES, j);
>> +		sched_domains_numa_distance[i] = j;
>>      }
>>
>> +	bitmap_free(distance_map);
>> +
>>      /*
>> -	 * 'level' contains the number of unique distances
>> +	 * 'nr_levels' contains the number of unique distances
>>       *
>>       * The sched_domains_numa_distance[] array includes the actual distance
>>       * numbers.
>> @@ -1664,15 +1656,15 @@ void sched_init_numa(void)
>>      /*
>>       * Here, we should temporarily reset sched_domains_numa_levels to 0.
>>       * If it fails to allocate memory for array sched_domains_numa_masks[][],
>> -	 * the array will contain less then 'level' members. This could be
>> +	 * the array will contain less then 'nr_levels' members. This could be
>>       * dangerous when we use it to iterate array sched_domains_numa_masks[][]
>>       * in other functions.
>>       *
>> -	 * We reset it to 'level' at the end of this function.
>> +	 * We reset it to 'nr_levels' at the end of this function.
>>       */
>>      sched_domains_numa_levels = 0;
>>
>> -	sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
>> +	sched_domains_numa_masks = kzalloc(sizeof(void *) * nr_levels,
>> GFP_KERNEL);
>>      if (!sched_domains_numa_masks)
>>              return;
>>
>> @@ -1680,7 +1672,7 @@ void sched_init_numa(void)
>>       * Now for each level, construct a mask per node which contains all
>>       * CPUs of nodes that are that many hops away from us.
>>       */
>> -	for (i = 0; i < level; i++) {
>> +	for (i = 0; i < nr_levels; i++) {
>>              sched_domains_numa_masks[i] =
>>                      kzalloc(nr_node_ids * sizeof(void *), GFP_KERNEL);
>>              if (!sched_domains_numa_masks[i])
>> @@ -1688,12 +1680,17 @@ void sched_init_numa(void)
>>
>>              for (j = 0; j < nr_node_ids; j++) {
>>                      struct cpumask *mask = kzalloc(cpumask_size(), GFP_KERNEL);
>> +			int k;
>> +
>>                      if (!mask)
>>                              return;
>>
>>                      sched_domains_numa_masks[i][j] = mask;
>>
>>                      for_each_node(k) {
>> +				if (sched_debug() && (node_distance(j, k) != node_distance(k,
>> j)))
>> +					sched_numa_warn("Node-distance not symmetric");
>> +
>>                              if (node_distance(j, k) > sched_domains_numa_distance[i])
>>                                      continue;
>>
>> @@ -1705,7 +1702,7 @@ void sched_init_numa(void)
>>      /* Compute default topology size */
>>      for (i = 0; sched_domain_topology[i].mask; i++);
>>
>> -	tl = kzalloc((i + level + 1) *
>> +	tl = kzalloc((i + nr_levels) *
>>                      sizeof(struct sched_domain_topology_level), GFP_KERNEL);
>>      if (!tl)
>>              return;
>> @@ -1728,7 +1725,7 @@ void sched_init_numa(void)
>>      /*
>>       * .. and append 'j' levels of NUMA goodness.
>>       */
>> -	for (j = 1; j < level; i++, j++) {
>> +	for (j = 1; j < nr_levels; i++, j++) {
>>              tl[i] = (struct sched_domain_topology_level){
>>                      .mask = sd_numa_mask,
>>                      .sd_flags = cpu_numa_flags,
>> @@ -1740,8 +1737,8 @@ void sched_init_numa(void)
>>
>>      sched_domain_topology = tl;
>>
>> -	sched_domains_numa_levels = level;
>> -	sched_max_numa_distance = sched_domains_numa_distance[level - 1];
>> +	sched_domains_numa_levels = nr_levels;
>> +	sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
>>
>>      init_numa_topology_type();
>>  }
>> --
>> 2.27.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for the deduplicating sort
  2021-01-25  9:26     ` Valentin Schneider
@ 2021-01-25 16:45       ` Valentin Schneider
  2021-01-25 21:35         ` Song Bao Hua (Barry Song)
  0 siblings, 1 reply; 15+ messages in thread
From: Valentin Schneider @ 2021-01-25 16:45 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song), linux-kernel
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, mgorman

On 25/01/21 09:26, Valentin Schneider wrote:
> On 25/01/21 02:23, Song Bao Hua (Barry Song) wrote:
>
>> with the below topology:
>> qemu-system-aarch64 -M virt -nographic \
>>  -smp cpus=8 \
>>  -numa node,cpus=0-1,nodeid=0 \
>>  -numa node,cpus=2-3,nodeid=1 \
>>  -numa node,cpus=4-5,nodeid=2 \
>>  -numa node,cpus=6-7,nodeid=3 \
>>  -numa dist,src=0,dst=1,val=12 \
>>  -numa dist,src=0,dst=2,val=20 \
>>  -numa dist,src=0,dst=3,val=22 \
>>  -numa dist,src=1,dst=2,val=22 \
>>  -numa dist,src=2,dst=3,val=12 \
>>  -numa dist,src=1,dst=3,val=24 \
>>
>>
>> The panic address is *1294:
>>
>>                         if (sdd->sd) {
>>     1280:       f9400e61        ldr     x1, [x19, #24]
>>     1284:       b4000201        cbz     x1, 12c4 <build_sched_domains+0x414>
>>                                 sd = *per_cpu_ptr(sdd->sd, j);
>>     1288:       93407eb7        sxtw    x23, w21
>>     128c:       aa0103e0        mov     x0, x1
>>     1290:       f8777ac2        ldr     x2, [x22, x23, lsl #3]
>>     *1294:       f8626800        ldr     x0, [x0, x2]
>>                                 if (sd && (sd->flags & SD_OVERLAP))
>>     1298:       b4000120        cbz     x0, 12bc <build_sched_domains+0x40c>
>>     129c:       b9403803        ldr     w3, [x0, #56]
>>     12a0:       365800e3        tbz     w3, #11, 12bc
>> <build_sched_domains+0x40c>
>>                                         free_sched_groups(sd->groups, 0);
>>     12a4:       f9400800        ldr     x0, [x0, #16]
>>         if (!sg)
>>
>
> Thanks for giving it a shot, let me run that with your topology and see
> where I end up.
>

I can't seem to reproduce this - your topology is actually among the ones
I tested this against.

Applying this patch obviously doesn't get rid of the group span issue, but
it does remove this warning:

[    0.354806] ERROR: Node-0 not representative
[    0.354806]
[    0.355223]   10 12 20 22
[    0.355475]   12 10 22 24
[    0.355648]   20 22 10 12
[    0.355814]   22 24 12 10

I'm running this based on tip/sched/core:

  65bcf072e20e ("sched: Use task_current() instead of 'rq->curr == p'")

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for the deduplicating sort
  2021-01-25 16:45       ` Valentin Schneider
@ 2021-01-25 21:35         ` Song Bao Hua (Barry Song)
  2021-01-28 14:47           ` Valentin Schneider
  0 siblings, 1 reply; 15+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-01-25 21:35 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, mgorman



> -----Original Message-----
> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
> Sent: Tuesday, January 26, 2021 5:46 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>;
> linux-kernel@vger.kernel.org
> Cc: mingo@kernel.org; peterz@infradead.org; vincent.guittot@linaro.org;
> dietmar.eggemann@arm.com; morten.rasmussen@arm.com; mgorman@suse.de
> Subject: RE: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for
> the deduplicating sort
> 
> On 25/01/21 09:26, Valentin Schneider wrote:
> > On 25/01/21 02:23, Song Bao Hua (Barry Song) wrote:
> >
> >> with the below topology:
> >> qemu-system-aarch64 -M virt -nographic \
> >>  -smp cpus=8 \
> >>  -numa node,cpus=0-1,nodeid=0 \
> >>  -numa node,cpus=2-3,nodeid=1 \
> >>  -numa node,cpus=4-5,nodeid=2 \
> >>  -numa node,cpus=6-7,nodeid=3 \
> >>  -numa dist,src=0,dst=1,val=12 \
> >>  -numa dist,src=0,dst=2,val=20 \
> >>  -numa dist,src=0,dst=3,val=22 \
> >>  -numa dist,src=1,dst=2,val=22 \
> >>  -numa dist,src=2,dst=3,val=12 \
> >>  -numa dist,src=1,dst=3,val=24 \
> >>
> >>
> >> The panic address is *1294:
> >>
> >>                         if (sdd->sd) {
> >>     1280:       f9400e61        ldr     x1, [x19, #24]
> >>     1284:       b4000201        cbz     x1, 12c4 <build_sched_domains+0x414>
> >>                                 sd = *per_cpu_ptr(sdd->sd, j);
> >>     1288:       93407eb7        sxtw    x23, w21
> >>     128c:       aa0103e0        mov     x0, x1
> >>     1290:       f8777ac2        ldr     x2, [x22, x23, lsl #3]
> >>     *1294:       f8626800        ldr     x0, [x0, x2]
> >>                                 if (sd && (sd->flags & SD_OVERLAP))
> >>     1298:       b4000120        cbz     x0, 12bc <build_sched_domains+0x40c>
> >>     129c:       b9403803        ldr     w3, [x0, #56]
> >>     12a0:       365800e3        tbz     w3, #11, 12bc
> >> <build_sched_domains+0x40c>
> >>                                         free_sched_groups(sd->groups, 0);
> >>     12a4:       f9400800        ldr     x0, [x0, #16]
> >>         if (!sg)
> >>
> >
> > Thanks for giving it a shot, let me run that with your topology and see
> > where I end up.
> >
> 
> I can't seem to reproduce this - your topology is actually among the ones
> I tested this against.
> 
> Applying this patch obviously doesn't get rid of the group span issue, but
> it does remove this warning:
> 
> [    0.354806] ERROR: Node-0 not representative
> [    0.354806]
> [    0.355223]   10 12 20 22
> [    0.355475]   12 10 22 24
> [    0.355648]   20 22 10 12
> [    0.355814]   22 24 12 10
> 
> I'm running this based on tip/sched/core:
> 
>   65bcf072e20e ("sched: Use task_current() instead of 'rq->curr == p'")
I was using 5.11-rc1. One thing I'd like to mention is that:

For the below topology:
+-------+          +-----+
| node1 |  20      |node2|
|       +----------+     |
+---+---+          +-----+
    |                  |12
12  |                  |
+---+---+          +---+-+
|       |          |node3|
| node0 |          |     |
+-------+          +-----+

with node0-node2 as 22, node0-node3 as 24, node1-node3 as 22.

I will get the below sched_domains_numa_distance[]:
10, 12, 22, 24
As you can see there is *no* 20. So the node1 and node2 will
only get two-level numa sched_domain:


But for the below topology:
+-------+          +-----+
| node0 |  20      |node2|
|       +----------+     |
+---+---+          +-----+
    |                  |12
12  |                  |
+---+---+          +---+-+
|       |          |node3|
| node1 |          |     |
+-------+          +-----+

with node1-node2 as 22, node1-node3 as 24,node0-node3 as 22.

I will get the below sched_domains_numa_distance[]:
10, 12, 20, 22, 24

What I have seen is the performance will be better if we
drop the 20 as we will get a sched_domain hierarchy with less
levels, and two intermediate nodes won't have the group span
issue.

Thanks
Barry


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for the deduplicating sort
  2021-01-25 21:35         ` Song Bao Hua (Barry Song)
@ 2021-01-28 14:47           ` Valentin Schneider
  2021-01-29  2:02             ` Song Bao Hua (Barry Song)
  0 siblings, 1 reply; 15+ messages in thread
From: Valentin Schneider @ 2021-01-28 14:47 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song), linux-kernel
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, mgorman

On 25/01/21 21:35, Song Bao Hua (Barry Song) wrote:
> I was using 5.11-rc1. One thing I'd like to mention is that:
>
> For the below topology:
> +-------+          +-----+
> | node1 |  20      |node2|
> |       +----------+     |
> +---+---+          +-----+
>     |                  |12
> 12  |                  |
> +---+---+          +---+-+
> |       |          |node3|
> | node0 |          |     |
> +-------+          +-----+
>
> with node0-node2 as 22, node0-node3 as 24, node1-node3 as 22.
>
> I will get the below sched_domains_numa_distance[]:
> 10, 12, 22, 24
> As you can see there is *no* 20. So the node1 and node2 will
> only get two-level numa sched_domain:
>


So that's

    -numa node,cpus=0-1,nodeid=0 -numa node,cpus=2-3,nodeid=1, \
    -numa node,cpus=4-5,nodeid=2, -numa node,cpus=6-7,nodeid=3, \
    -numa dist,src=0,dst=1,val=12, \
    -numa dist,src=0,dst=2,val=22, \
    -numa dist,src=0,dst=3,val=24, \
    -numa dist,src=1,dst=2,val=20, \
    -numa dist,src=1,dst=3,val=22, \
    -numa dist,src=2,dst=3,val=12

but running this still doesn't get me a splat. Debugging
sched_domains_numa_distance[] still gives me
{10, 12, 20, 22, 24}

>
> But for the below topology:
> +-------+          +-----+
> | node0 |  20      |node2|
> |       +----------+     |
> +---+---+          +-----+
>     |                  |12
> 12  |                  |
> +---+---+          +---+-+
> |       |          |node3|
> | node1 |          |     |
> +-------+          +-----+
>
> with node1-node2 as 22, node1-node3 as 24,node0-node3 as 22.
>
> I will get the below sched_domains_numa_distance[]:
> 10, 12, 20, 22, 24
>
> What I have seen is the performance will be better if we
> drop the 20 as we will get a sched_domain hierarchy with less
> levels, and two intermediate nodes won't have the group span
> issue.
>

That is another thing that's worth considering. Morten was arguing that if
the distance between two nodes is so tiny, it might not be worth
representing it at all in the scheduler topology.

> Thanks
> Barry

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for the deduplicating sort
  2021-01-28 14:47           ` Valentin Schneider
@ 2021-01-29  2:02             ` Song Bao Hua (Barry Song)
  2021-02-01 12:03               ` Valentin Schneider
  0 siblings, 1 reply; 15+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-01-29  2:02 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, mgorman



> -----Original Message-----
> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
> Sent: Friday, January 29, 2021 3:47 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>;
> linux-kernel@vger.kernel.org
> Cc: mingo@kernel.org; peterz@infradead.org; vincent.guittot@linaro.org;
> dietmar.eggemann@arm.com; morten.rasmussen@arm.com; mgorman@suse.de
> Subject: RE: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set
> for the deduplicating sort
> 
> On 25/01/21 21:35, Song Bao Hua (Barry Song) wrote:
> > I was using 5.11-rc1. One thing I'd like to mention is that:
> >
> > For the below topology:
> > +-------+          +-----+
> > | node1 |  20      |node2|
> > |       +----------+     |
> > +---+---+          +-----+
> >     |                  |12
> > 12  |                  |
> > +---+---+          +---+-+
> > |       |          |node3|
> > | node0 |          |     |
> > +-------+          +-----+
> >
> > with node0-node2 as 22, node0-node3 as 24, node1-node3 as 22.
> >
> > I will get the below sched_domains_numa_distance[]:
> > 10, 12, 22, 24
> > As you can see there is *no* 20. So the node1 and node2 will
> > only get two-level numa sched_domain:
> >
> 
> 
> So that's
> 
>     -numa node,cpus=0-1,nodeid=0 -numa node,cpus=2-3,nodeid=1, \
>     -numa node,cpus=4-5,nodeid=2, -numa node,cpus=6-7,nodeid=3, \
>     -numa dist,src=0,dst=1,val=12, \
>     -numa dist,src=0,dst=2,val=22, \
>     -numa dist,src=0,dst=3,val=24, \
>     -numa dist,src=1,dst=2,val=20, \
>     -numa dist,src=1,dst=3,val=22, \
>     -numa dist,src=2,dst=3,val=12
> 
> but running this still doesn't get me a splat. Debugging
> sched_domains_numa_distance[] still gives me
> {10, 12, 20, 22, 24}
> 
> >
> > But for the below topology:
> > +-------+          +-----+
> > | node0 |  20      |node2|
> > |       +----------+     |
> > +---+---+          +-----+
> >     |                  |12
> > 12  |                  |
> > +---+---+          +---+-+
> > |       |          |node3|
> > | node1 |          |     |
> > +-------+          +-----+
> >
> > with node1-node2 as 22, node1-node3 as 24,node0-node3 as 22.
> >
> > I will get the below sched_domains_numa_distance[]:
> > 10, 12, 20, 22, 24
> >
> > What I have seen is the performance will be better if we
> > drop the 20 as we will get a sched_domain hierarchy with less
> > levels, and two intermediate nodes won't have the group span
> > issue.
> >
> 
> That is another thing that's worth considering. Morten was arguing that if
> the distance between two nodes is so tiny, it might not be worth
> representing it at all in the scheduler topology.

Yes. I agree it is a different thing. Anyway, I saw your patch has been
in sched tree. One side effect your patch is the one more sched_domain
level is imported for this topology:

                            24
                      X X XXX X X  X X X X XXX
             XX XX X                          XXXXX
         XXX                                        X
       XX                                             XXX
     XX                                 22              XXX
     X                           XXXXXXX                   XX
    X                        XXXXX      XXXXXXXXX           XXXX
   XX                      XXX                    XX X XX X    XX
+--------+           +---------+          +---------+      XX+---------+
| 0      |   12      | 1       | 20       | 2       |   12   |3        |
|        +-----------+         +----------+         +--------+         |
+---X----+           +---------+          +--X------+        +---------+
    X                                        X
    XX                                      X
     X                                     XX
      XX                                  XX
       XX                                X
        X XXX                         XXX
             X XXXXXX XX XX X X X XXXX
                       22
Without the patch, Linux will use 10,12,22,24 to build sched_domain;
With your patch, Linux will use 10,12,20,22,24 to build sched_domain.

So one more layer is added. What I have seen is that:

For node0 sched_domain <=12 and sched_domain <=20 span the same range
(node0, node1). So one of them is redundant. then in cpu_attach_domain,
the redundant one is dropped due to "remove the sched domains which
do not contribute to scheduling".

For node1&2, the origin code had no "20", thus built one less sched_domain
level.

What is really interesting is that removing 20 actually gives better
benchmark in speccpu :-)


> 
> > Thanks
> > Barry

Thanks
Barry


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for the deduplicating sort
  2021-01-22 12:39 ` [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for the deduplicating sort Valentin Schneider
  2021-01-25  2:23   ` Song Bao Hua (Barry Song)
@ 2021-02-01  9:53   ` Dietmar Eggemann
  2021-02-01 10:19     ` Vincent Guittot
                       ` (4 more replies)
  1 sibling, 5 replies; 15+ messages in thread
From: Dietmar Eggemann @ 2021-02-01  9:53 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel
  Cc: mingo, peterz, vincent.guittot, morten.rasmussen, mgorman, song.bao.hua

On 22/01/2021 13:39, Valentin Schneider wrote:

[...]

> @@ -1705,7 +1702,7 @@ void sched_init_numa(void)
>  	/* Compute default topology size */
>  	for (i = 0; sched_domain_topology[i].mask; i++);
>  
> -	tl = kzalloc((i + level + 1) *
> +	tl = kzalloc((i + nr_levels) *
>  			sizeof(struct sched_domain_topology_level), GFP_KERNEL);
>  	if (!tl)
>  		return;

This hunk creates issues during startup on my Arm64 juno board on tip/sched/core.

---8<---

From: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date: Mon, 1 Feb 2021 09:58:04 +0100
Subject: [PATCH] sched/topology: Fix sched_domain_topology_level alloc in
 sched_init_numa

Commit "sched/topology: Make sched_init_numa() use a set for the
deduplicating sort" allocates 'i + nr_levels (level)' instead of
'i + nr_levels + 1' sched_domain_topology_level.

This led to an Oops (on Arm64 juno with CONFIG_SCHED_DEBUG):

sched_init_domains
  build_sched_domains()
    __free_domain_allocs()
      __sdt_free() {
	...
        for_each_sd_topology(tl)
	  ...
          sd = *per_cpu_ptr(sdd->sd, j); <--
	  ...
      }

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---

./scripts/decode_stacktrace.sh ./vmlinux < input.log

[    0.495693] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[    0.501280] Modules linked in:
[    0.504342] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.11.0-rc2-00025-g7a976f77bb96 #95
[    0.512455] Hardware name: ARM Juno development board (r0) (DT)
[    0.518386] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
[    0.524409] pc : build_sched_domains (kernel/sched/topology.c:1872 kernel/sched/topology.c:1288 kernel/sched/topology.c:2120)
[    0.529132] lr : build_sched_domains (kernel/sched/topology.c:1868 kernel/sched/topology.c:1288 kernel/sched/topology.c:2120)
[    0.533847] sp : ffff800011efbd00
[    0.537165] x29: ffff800011efbd00 x28: ffff800011b69a38
[    0.542496] x27: 0000000000000000 x26: 0000000000000000
[    0.547827] x25: ffff000800191600 x24: 0000000000000000
[    0.553157] x23: ffff800011b69a40 x22: 0000000000000001
[    0.558487] x21: ffff000800056e00 x20: ffff00080016f380
[    0.563818] x19: ffff800011b6a410 x18: 0000000000000000
[    0.569149] x17: 0000000000000000 x16: 0000000000000000
[    0.574478] x15: 0000000000000030 x14: ffffffffffffffff
[    0.579809] x13: ffff800011b82d38 x12: 0000000000000189
[    0.585139] x11: 0000000000000083 x10: ffff800011bdad38
[    0.590469] x9 : 00000000fffff000 x8 : ffff800011b82d38
[    0.595800] x7 : 00000000000000e0 x6 : 000000000000003f
[    0.601130] x5 : 0000000000000000 x4 : 0000000000000000
[    0.606460] x3 : 0000000000000000 x2 : ffff80096d857000
[    0.611790] x1 : 0000000000000001 x0 : 0000000000000001
[    0.617120] Call trace:
[    0.619566] build_sched_domains (kernel/sched/topology.c:1872 kernel/sched/topology.c:1288 kernel/sched/topology.c:2120)
[    0.623934] sched_init_domains (kernel/sched/topology.c:2194)
[    0.627954] sched_init_smp (kernel/sched/core.c:7727)
[    0.631629] kernel_init_freeable (init/main.c:1528)
[    0.635914] kernel_init (init/main.c:1417)
[    0.639416] ret_from_fork (arch/arm64/kernel/entry.S:959)
[ 0.643008] Code: b4000360 93407f1b aa0003e1 f87b7ae2 (f8626821)
All code
========
   0:	60                   	(bad)
   1:	03 00                	add    (%rax),%eax
   3:	b4 1b                	mov    $0x1b,%ah
   5:	7f 40                	jg     0x47
   7:	93                   	xchg   %eax,%ebx
   8:	e1 03                	loope  0xd
   a:	00 aa e2 7a 7b f8    	add    %ch,-0x784851e(%rdx)
  10:*	21 68 62             	and    %ebp,0x62(%rax)		<-- trapping instruction
  13:	f8                   	clc

Code starting with the faulting instruction
===========================================
   0:	21 68 62             	and    %ebp,0x62(%rax)
   3:	f8                   	clc
[    0.649129] ---[ end trace 4453f742a7781011 ]---
[    0.653772] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    0.661449] SMP: stopping secondary CPUs
[    0.665389] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

 kernel/sched/topology.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index bf5c9bd10bfe..09d35044bd88 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1702,7 +1702,7 @@ void sched_init_numa(void)
 	/* Compute default topology size */
 	for (i = 0; sched_domain_topology[i].mask; i++);
 
-	tl = kzalloc((i + nr_levels) *
+	tl = kzalloc((i + nr_levels + 1) *
 			sizeof(struct sched_domain_topology_level), GFP_KERNEL);
 	if (!tl)
 		return;
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for the deduplicating sort
  2021-02-01  9:53   ` Dietmar Eggemann
@ 2021-02-01 10:19     ` Vincent Guittot
  2021-02-01 10:35     ` Song Bao Hua (Barry Song)
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Vincent Guittot @ 2021-02-01 10:19 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Valentin Schneider, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Morten Rasmussen, Mel Gorman, Song Bao Hua (Barry Song)

On Mon, 1 Feb 2021 at 10:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 22/01/2021 13:39, Valentin Schneider wrote:
>
> [...]
>
> > @@ -1705,7 +1702,7 @@ void sched_init_numa(void)
> >       /* Compute default topology size */
> >       for (i = 0; sched_domain_topology[i].mask; i++);
> >
> > -     tl = kzalloc((i + level + 1) *
> > +     tl = kzalloc((i + nr_levels) *
> >                       sizeof(struct sched_domain_topology_level), GFP_KERNEL);
> >       if (!tl)
> >               return;
>
> This hunk creates issues during startup on my Arm64 juno board on tip/sched/core.
>
> ---8<---
>
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Date: Mon, 1 Feb 2021 09:58:04 +0100
> Subject: [PATCH] sched/topology: Fix sched_domain_topology_level alloc in
>  sched_init_numa
>
> Commit "sched/topology: Make sched_init_numa() use a set for the
> deduplicating sort" allocates 'i + nr_levels (level)' instead of
> 'i + nr_levels + 1' sched_domain_topology_level.
>
> This led to an Oops (on Arm64 juno with CONFIG_SCHED_DEBUG):
>
> sched_init_domains
>   build_sched_domains()
>     __free_domain_allocs()
>       __sdt_free() {
>         ...
>         for_each_sd_topology(tl)
>           ...
>           sd = *per_cpu_ptr(sdd->sd, j); <--
>           ...
>       }
>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

Same crash happens for me on hikey and db845c. This fixes the problem

Tested-by: Vincent Guittot <vincent.guittot@linaro.org>


> ---
>
> ./scripts/decode_stacktrace.sh ./vmlinux < input.log
>
> [    0.495693] Internal error: Oops: 96000005 [#1] PREEMPT SMP
> [    0.501280] Modules linked in:
> [    0.504342] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.11.0-rc2-00025-g7a976f77bb96 #95
> [    0.512455] Hardware name: ARM Juno development board (r0) (DT)
> [    0.518386] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
> [    0.524409] pc : build_sched_domains (kernel/sched/topology.c:1872 kernel/sched/topology.c:1288 kernel/sched/topology.c:2120)
> [    0.529132] lr : build_sched_domains (kernel/sched/topology.c:1868 kernel/sched/topology.c:1288 kernel/sched/topology.c:2120)
> [    0.533847] sp : ffff800011efbd00
> [    0.537165] x29: ffff800011efbd00 x28: ffff800011b69a38
> [    0.542496] x27: 0000000000000000 x26: 0000000000000000
> [    0.547827] x25: ffff000800191600 x24: 0000000000000000
> [    0.553157] x23: ffff800011b69a40 x22: 0000000000000001
> [    0.558487] x21: ffff000800056e00 x20: ffff00080016f380
> [    0.563818] x19: ffff800011b6a410 x18: 0000000000000000
> [    0.569149] x17: 0000000000000000 x16: 0000000000000000
> [    0.574478] x15: 0000000000000030 x14: ffffffffffffffff
> [    0.579809] x13: ffff800011b82d38 x12: 0000000000000189
> [    0.585139] x11: 0000000000000083 x10: ffff800011bdad38
> [    0.590469] x9 : 00000000fffff000 x8 : ffff800011b82d38
> [    0.595800] x7 : 00000000000000e0 x6 : 000000000000003f
> [    0.601130] x5 : 0000000000000000 x4 : 0000000000000000
> [    0.606460] x3 : 0000000000000000 x2 : ffff80096d857000
> [    0.611790] x1 : 0000000000000001 x0 : 0000000000000001
> [    0.617120] Call trace:
> [    0.619566] build_sched_domains (kernel/sched/topology.c:1872 kernel/sched/topology.c:1288 kernel/sched/topology.c:2120)
> [    0.623934] sched_init_domains (kernel/sched/topology.c:2194)
> [    0.627954] sched_init_smp (kernel/sched/core.c:7727)
> [    0.631629] kernel_init_freeable (init/main.c:1528)
> [    0.635914] kernel_init (init/main.c:1417)
> [    0.639416] ret_from_fork (arch/arm64/kernel/entry.S:959)
> [ 0.643008] Code: b4000360 93407f1b aa0003e1 f87b7ae2 (f8626821)
> All code
> ========
>    0:   60                      (bad)
>    1:   03 00                   add    (%rax),%eax
>    3:   b4 1b                   mov    $0x1b,%ah
>    5:   7f 40                   jg     0x47
>    7:   93                      xchg   %eax,%ebx
>    8:   e1 03                   loope  0xd
>    a:   00 aa e2 7a 7b f8       add    %ch,-0x784851e(%rdx)
>   10:*  21 68 62                and    %ebp,0x62(%rax)          <-- trapping instruction
>   13:   f8                      clc
>
> Code starting with the faulting instruction
> ===========================================
>    0:   21 68 62                and    %ebp,0x62(%rax)
>    3:   f8                      clc
> [    0.649129] ---[ end trace 4453f742a7781011 ]---
> [    0.653772] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> [    0.661449] SMP: stopping secondary CPUs
> [    0.665389] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
>
>  kernel/sched/topology.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index bf5c9bd10bfe..09d35044bd88 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1702,7 +1702,7 @@ void sched_init_numa(void)
>         /* Compute default topology size */
>         for (i = 0; sched_domain_topology[i].mask; i++);
>
> -       tl = kzalloc((i + nr_levels) *
> +       tl = kzalloc((i + nr_levels + 1) *
>                         sizeof(struct sched_domain_topology_level), GFP_KERNEL);
>         if (!tl)
>                 return;
> --
> 2.25.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for the deduplicating sort
  2021-02-01  9:53   ` Dietmar Eggemann
  2021-02-01 10:19     ` Vincent Guittot
@ 2021-02-01 10:35     ` Song Bao Hua (Barry Song)
  2021-02-01 11:55     ` Valentin Schneider
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-02-01 10:35 UTC (permalink / raw)
  To: Dietmar Eggemann, Valentin Schneider, linux-kernel
  Cc: mingo, peterz, vincent.guittot, morten.rasmussen, mgorman



> -----Original Message-----
> From: Dietmar Eggemann [mailto:dietmar.eggemann@arm.com]
> Sent: Monday, February 1, 2021 10:54 PM
> To: Valentin Schneider <valentin.schneider@arm.com>;
> linux-kernel@vger.kernel.org
> Cc: mingo@kernel.org; peterz@infradead.org; vincent.guittot@linaro.org;
> morten.rasmussen@arm.com; mgorman@suse.de; Song Bao Hua (Barry Song)
> <song.bao.hua@hisilicon.com>
> Subject: Re: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for
> the deduplicating sort
> 
> On 22/01/2021 13:39, Valentin Schneider wrote:
> 
> [...]
> 
> > @@ -1705,7 +1702,7 @@ void sched_init_numa(void)
> >  	/* Compute default topology size */
> >  	for (i = 0; sched_domain_topology[i].mask; i++);
> >
> > -	tl = kzalloc((i + level + 1) *
> > +	tl = kzalloc((i + nr_levels) *
> >  			sizeof(struct sched_domain_topology_level), GFP_KERNEL);
> >  	if (!tl)
> >  		return;
> 
> This hunk creates issues during startup on my Arm64 juno board on tip/sched/core.

I also reported this kernel panic here:
https://lore.kernel.org/lkml/bfb703294b234e1e926a68fcb73dbee3@hisilicon.com/#t

> 
> ---8<---
> 
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Date: Mon, 1 Feb 2021 09:58:04 +0100
> Subject: [PATCH] sched/topology: Fix sched_domain_topology_level alloc in
>  sched_init_numa
> 
> Commit "sched/topology: Make sched_init_numa() use a set for the
> deduplicating sort" allocates 'i + nr_levels (level)' instead of
> 'i + nr_levels + 1' sched_domain_topology_level.
> 
> This led to an Oops (on Arm64 juno with CONFIG_SCHED_DEBUG):
> 
> sched_init_domains
>   build_sched_domains()
>     __free_domain_allocs()
>       __sdt_free() {
> 	...
>         for_each_sd_topology(tl)
> 	  ...
>           sd = *per_cpu_ptr(sdd->sd, j); <--
> 	  ...
>       }
> 
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---

This patch also resolved my panic. So:

Tested-by: Barry Song <song.bao.hua@hisilicon.com>

Thanks
Barry


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for the deduplicating sort
  2021-02-01  9:53   ` Dietmar Eggemann
  2021-02-01 10:19     ` Vincent Guittot
  2021-02-01 10:35     ` Song Bao Hua (Barry Song)
@ 2021-02-01 11:55     ` Valentin Schneider
  2021-02-02 10:03     ` [tip: sched/core] sched/topology: Fix sched_domain_topology_level alloc in sched_init_numa() tip-bot2 for Dietmar Eggemann
  2021-02-17 13:17     ` tip-bot2 for Dietmar Eggemann
  4 siblings, 0 replies; 15+ messages in thread
From: Valentin Schneider @ 2021-02-01 11:55 UTC (permalink / raw)
  To: Dietmar Eggemann, linux-kernel
  Cc: mingo, peterz, vincent.guittot, morten.rasmussen, mgorman, song.bao.hua

On 01/02/21 10:53, Dietmar Eggemann wrote:
> On 22/01/2021 13:39, Valentin Schneider wrote:
>
> [...]
>
>> @@ -1705,7 +1702,7 @@ void sched_init_numa(void)
>>      /* Compute default topology size */
>>      for (i = 0; sched_domain_topology[i].mask; i++);
>>
>> -	tl = kzalloc((i + level + 1) *
>> +	tl = kzalloc((i + nr_levels) *
>>                      sizeof(struct sched_domain_topology_level), GFP_KERNEL);
>>      if (!tl)
>>              return;
>
> This hunk creates issues during startup on my Arm64 juno board on tip/sched/core.
>
> ---8<---
>
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Date: Mon, 1 Feb 2021 09:58:04 +0100
> Subject: [PATCH] sched/topology: Fix sched_domain_topology_level alloc in
>  sched_init_numa
>
> Commit "sched/topology: Make sched_init_numa() use a set for the
> deduplicating sort" allocates 'i + nr_levels (level)' instead of
> 'i + nr_levels + 1' sched_domain_topology_level.
>
> This led to an Oops (on Arm64 juno with CONFIG_SCHED_DEBUG):
>
> sched_init_domains
>   build_sched_domains()
>     __free_domain_allocs()
>       __sdt_free() {
>       ...
>         for_each_sd_topology(tl)
>         ...
>           sd = *per_cpu_ptr(sdd->sd, j); <--
>         ...
>       }
>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

Damn, I forgot the topology level stuff must terminate in a NULL'd
sentinel! Vincent fixed the same thing a few years ago...

  c515db8cd311 ("sched/numa: Fix initialization of sched_domain_topology for NUMA")

Thanks for fixing up my mistake, I ought to have tested !NUMA setups.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for the deduplicating sort
  2021-01-29  2:02             ` Song Bao Hua (Barry Song)
@ 2021-02-01 12:03               ` Valentin Schneider
  0 siblings, 0 replies; 15+ messages in thread
From: Valentin Schneider @ 2021-02-01 12:03 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song), linux-kernel
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, mgorman

On 29/01/21 02:02, Song Bao Hua (Barry Song) wrote:
>> -----Original Message-----
>> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
>> That is another thing that's worth considering. Morten was arguing that if
>> the distance between two nodes is so tiny, it might not be worth
>> representing it at all in the scheduler topology.
>
> Yes. I agree it is a different thing. Anyway, I saw your patch has been
> in sched tree. One side effect your patch is the one more sched_domain
> level is imported for this topology:
>
>                             24
>                       X X XXX X X  X X X X XXX
>              XX XX X                          XXXXX
>          XXX                                        X
>        XX                                             XXX
>      XX                                 22              XXX
>      X                           XXXXXXX                   XX
>     X                        XXXXX      XXXXXXXXX           XXXX
>    XX                      XXX                    XX X XX X    XX
> +--------+           +---------+          +---------+      XX+---------+
> | 0      |   12      | 1       | 20       | 2       |   12   |3        |
> |        +-----------+         +----------+         +--------+         |
> +---X----+           +---------+          +--X------+        +---------+
>     X                                        X
>     XX                                      X
>      X                                     XX
>       XX                                  XX
>        XX                                X
>         X XXX                         XXX
>              X XXXXXX XX XX X X X XXXX
>                        22
> Without the patch, Linux will use 10,12,22,24 to build sched_domain;
> With your patch, Linux will use 10,12,20,22,24 to build sched_domain.
>
> So one more layer is added. What I have seen is that:
>
> For node0 sched_domain <=12 and sched_domain <=20 span the same range
> (node0, node1). So one of them is redundant. then in cpu_attach_domain,
> the redundant one is dropped due to "remove the sched domains which
> do not contribute to scheduling".
>
> For node1&2, the origin code had no "20", thus built one less sched_domain
> level.
>

Right, that domain degeneration should get you to the same result. We do
want to make sure we're handling every distance value in the table; the
gist is to "stupidly" build every domain / level we can, and if some are
redundant we can remove them after the fact.

> What is really interesting is that removing 20 actually gives better
> benchmark in speccpu :-)
>
>
>>
>> > Thanks
>> > Barry
>
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [tip: sched/core] sched/topology: Fix sched_domain_topology_level alloc in sched_init_numa()
  2021-02-01  9:53   ` Dietmar Eggemann
                       ` (2 preceding siblings ...)
  2021-02-01 11:55     ` Valentin Schneider
@ 2021-02-02 10:03     ` tip-bot2 for Dietmar Eggemann
  2021-02-17 13:17     ` tip-bot2 for Dietmar Eggemann
  4 siblings, 0 replies; 15+ messages in thread
From: tip-bot2 for Dietmar Eggemann @ 2021-02-02 10:03 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Dietmar Eggemann, Peter Zijlstra (Intel),
	Vincent Guittot, Barry Song, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     e972d92d52a1f691498add14feb2ee5902d02404
Gitweb:        https://git.kernel.org/tip/e972d92d52a1f691498add14feb2ee5902d02404
Author:        Dietmar Eggemann <dietmar.eggemann@arm.com>
AuthorDate:    Mon, 01 Feb 2021 10:53:53 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 01 Feb 2021 15:31:39 +01:00

sched/topology: Fix sched_domain_topology_level alloc in sched_init_numa()

Commit "sched/topology: Make sched_init_numa() use a set for the
deduplicating sort" allocates 'i + nr_levels (level)' instead of
'i + nr_levels + 1' sched_domain_topology_level.

This led to an Oops (on Arm64 juno with CONFIG_SCHED_DEBUG):

sched_init_domains
  build_sched_domains()
    __free_domain_allocs()
      __sdt_free() {
	...
        for_each_sd_topology(tl)
	  ...
          sd = *per_cpu_ptr(sdd->sd, j); <--
	  ...
      }

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Barry Song <song.bao.hua@hisilicon.com>
Link: https://lkml.kernel.org/r/6000e39e-7d28-c360-9cd6-8798fd22a9bf@arm.com
---
 kernel/sched/topology.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index bf5c9bd..09d3504 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1702,7 +1702,7 @@ void sched_init_numa(void)
 	/* Compute default topology size */
 	for (i = 0; sched_domain_topology[i].mask; i++);
 
-	tl = kzalloc((i + nr_levels) *
+	tl = kzalloc((i + nr_levels + 1) *
 			sizeof(struct sched_domain_topology_level), GFP_KERNEL);
 	if (!tl)
 		return;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [tip: sched/core] sched/topology: Fix sched_domain_topology_level alloc in sched_init_numa()
  2021-02-01  9:53   ` Dietmar Eggemann
                       ` (3 preceding siblings ...)
  2021-02-02 10:03     ` [tip: sched/core] sched/topology: Fix sched_domain_topology_level alloc in sched_init_numa() tip-bot2 for Dietmar Eggemann
@ 2021-02-17 13:17     ` tip-bot2 for Dietmar Eggemann
  4 siblings, 0 replies; 15+ messages in thread
From: tip-bot2 for Dietmar Eggemann @ 2021-02-17 13:17 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Dietmar Eggemann, Peter Zijlstra (Intel),
	Ingo Molnar, Vincent Guittot, Barry Song, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     71e5f6644fb2f3304fcb310145ded234a37e7cc1
Gitweb:        https://git.kernel.org/tip/71e5f6644fb2f3304fcb310145ded234a37e7cc1
Author:        Dietmar Eggemann <dietmar.eggemann@arm.com>
AuthorDate:    Mon, 01 Feb 2021 10:53:53 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 17 Feb 2021 14:08:05 +01:00

sched/topology: Fix sched_domain_topology_level alloc in sched_init_numa()

Commit "sched/topology: Make sched_init_numa() use a set for the
deduplicating sort" allocates 'i + nr_levels (level)' instead of
'i + nr_levels + 1' sched_domain_topology_level.

This led to an Oops (on Arm64 juno with CONFIG_SCHED_DEBUG):

sched_init_domains
  build_sched_domains()
    __free_domain_allocs()
      __sdt_free() {
	...
        for_each_sd_topology(tl)
	  ...
          sd = *per_cpu_ptr(sdd->sd, j); <--
	  ...
      }

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Barry Song <song.bao.hua@hisilicon.com>
Link: https://lkml.kernel.org/r/6000e39e-7d28-c360-9cd6-8798fd22a9bf@arm.com
---
 kernel/sched/topology.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index bf5c9bd..09d3504 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1702,7 +1702,7 @@ void sched_init_numa(void)
 	/* Compute default topology size */
 	for (i = 0; sched_domain_topology[i].mask; i++);
 
-	tl = kzalloc((i + nr_levels) *
+	tl = kzalloc((i + nr_levels + 1) *
 			sizeof(struct sched_domain_topology_level), GFP_KERNEL);
 	if (!tl)
 		return;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-02-17 13:25 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-22 12:39 [PATCH 0/1] sched/topology: NUMA distance deduplication Valentin Schneider
2021-01-22 12:39 ` [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for the deduplicating sort Valentin Schneider
2021-01-25  2:23   ` Song Bao Hua (Barry Song)
2021-01-25  9:26     ` Valentin Schneider
2021-01-25 16:45       ` Valentin Schneider
2021-01-25 21:35         ` Song Bao Hua (Barry Song)
2021-01-28 14:47           ` Valentin Schneider
2021-01-29  2:02             ` Song Bao Hua (Barry Song)
2021-02-01 12:03               ` Valentin Schneider
2021-02-01  9:53   ` Dietmar Eggemann
2021-02-01 10:19     ` Vincent Guittot
2021-02-01 10:35     ` Song Bao Hua (Barry Song)
2021-02-01 11:55     ` Valentin Schneider
2021-02-02 10:03     ` [tip: sched/core] sched/topology: Fix sched_domain_topology_level alloc in sched_init_numa() tip-bot2 for Dietmar Eggemann
2021-02-17 13:17     ` tip-bot2 for Dietmar Eggemann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.