linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Skip numa distance for offline nodes
@ 2021-05-20 15:44 Srikar Dronamraju
  2021-05-20 15:44 ` [PATCH 1/3] sched/topology: Allow archs to populate distance map Srikar Dronamraju
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Srikar Dronamraju @ 2021-05-20 15:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Michael Ellerman,
	Scott Cheloha, Gautham R Shenoy, Geetika Moolchandani

Geetika reported yet another trace while doing a dlpar CPU add
operation. This was true even on top of a recent commit
6980d13f0dd1 ("powerpc/smp: Set numa node before updating mask") which fixed
a similar trace.

WARNING: CPU: 40 PID: 2954 at kernel/sched/topology.c:2088 build_sched_domains+0x6e8/0x1540
Modules linked in: nft_counter nft_compat rpadlpar_io rpaphp mptcp_diag
xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag
netlink_diag bonding tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat
nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables
nfnetlink dm_multipath pseries_rng xts vmx_crypto binfmt_misc ip_tables xfs
libcrc32c sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp dm_mirror
dm_region_hash dm_log dm_mod fuse
CPU: 40 PID: 2954 Comm: kworker/40:0 Not tainted 5.13.0-rc1+ #19
Workqueue: events cpuset_hotplug_workfn
NIP:  c0000000001de588 LR: c0000000001de584 CTR: 00000000006cd36c
REGS: c00000002772b250 TRAP: 0700   Not tainted  (5.12.0-rc5-master+)
MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28828422  XER: 0000000d
CFAR: c00000000020c2f8 IRQMASK: 0 #012GPR00: c0000000001de584 c00000002772b4f0
c000000001f55400 0000000000000036 #012GPR04: c0000063c6368010 c0000063c63f0a00
0000000000000027 c0000063c6368018 #012GPR08: 0000000000000023 c0000063c636ef48
00000063c4de0000 c0000063bfe9ffe8 #012GPR12: 0000000028828424 c0000063fe68fe80
0000000000000000 0000000000000417 #012GPR16: 0000000000000028 c00000000740dcd8
c00000000205db68 c000000001a3a4a0 #012GPR20: c000000091ed7d20 c000000091ed8520
0000000000000001 0000000000000000 #012GPR24: c0000000113a9600 0000000000000190
0000000000000028 c0000000010e3ac0 #012GPR28: 0000000000000000 c00000000740dd00
c0000000317b5900 0000000000000190 
NIP [c0000000001de588] build_sched_domains+0x6e8/0x1540
LR [c0000000001de584] build_sched_domains+0x6e4/0x1540
Call Trace:
[c00000002772b4f0] [c0000000001de584] build_sched_domains+0x6e4/0x1540 (unreliable)
[c00000002772b640] [c0000000001e08dc] partition_sched_domains_locked+0x3ec/0x530
[c00000002772b6e0] [c0000000002a2144] rebuild_sched_domains_locked+0x524/0xbf0
[c00000002772b7e0] [c0000000002a5620] rebuild_sched_domains+0x40/0x70
[c00000002772b810] [c0000000002a58e4] cpuset_hotplug_workfn+0x294/0xe20
[c00000002772bc30] [c000000000187510] process_one_work+0x300/0x670
[c00000002772bd10] [c0000000001878f8] worker_thread+0x78/0x520
[c00000002772bda0] [c0000000001937f0] kthread+0x1a0/0x1b0
[c00000002772be10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
7ee5bb78 7f0ac378 7f29cb78 7f68db78 7f46d378 7f84e378 f8610068 3c62ff19
fbe10060 3863e558 4802dd31 60000000 <0fe00000> 3920fff4 f9210080 e86100b0

Detailed analysis of the failing scenario showed that the span in
question belongs to NODE domain and further the cpumasks for some
cpus in NODE overlapped. There are two possible reasons how we ended
up here:

(1) The numa node was offline or blank with no CPUs or memory. Hence
the sched_max_numa_distance could not be set correctly, or the
sched_domains_numa_distance happened to be partially populated.

(2) Depending on a bogus node_distance of an offline node to populate
cpumasks is the issue.  On POWER platform the node_distance is
correctly available only for an online node which has some CPU or
memory resource associated with it.

For example distance info from numactl from a fully populated 8 node
system at boot may look like this.

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  20  40  40  40  40  40  40
  1:  20  10  40  40  40  40  40  40
  2:  40  40  10  20  40  40  40  40
  3:  40  40  20  10  40  40  40  40
  4:  40  40  40  40  10  20  40  40
  5:  40  40  40  40  20  10  40  40
  6:  40  40  40  40  40  40  10  20
  7:  40  40  40  40  40  40  20  10

However the same system when only two nodes are online at boot, then the
numa topology will look like
node distances:
node   0   1
  0:  10  20
  1:  20  10

This series tries to fix both these problems.
Note: These problems are now visible, thanks to 
Commit ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't
(partially) overlap")


Cc: LKML <linux-kernel@vger.kernel.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>

Srikar Dronamraju (3):
  sched/topology: Allow archs to populate distance map
  powerpc/numa: Populate distance map correctly
  sched/topology: Skip updating masks for non-online nodes

 arch/powerpc/include/asm/topology.h |  3 +++
 arch/powerpc/mm/numa.c              | 19 +++++++++++++++
 kernel/sched/topology.c             | 38 +++++++++++++++++++++--------
 3 files changed, 50 insertions(+), 10 deletions(-)


base-commit: 1699949d3314e5d1956fb082e4cd4798bf6149fc
-- 
2.27.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/3] sched/topology: Allow archs to populate distance map
  2021-05-20 15:44 [PATCH 0/3] Skip numa distance for offline nodes Srikar Dronamraju
@ 2021-05-20 15:44 ` Srikar Dronamraju
  2021-05-20 18:56   ` Peter Zijlstra
  2021-05-20 15:44 ` [PATCH 2/3] powerpc/numa: Populate distance map correctly Srikar Dronamraju
  2021-05-20 15:44 ` [PATCH 3/3] sched/topology: Skip updating masks for non-online nodes Srikar Dronamraju
  2 siblings, 1 reply; 17+ messages in thread
From: Srikar Dronamraju @ 2021-05-20 15:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Michael Ellerman,
	Scott Cheloha, Gautham R Shenoy, Geetika Moolchandani

Currently scheduler populates the distance map by looking at distance
of each node from all other nodes. This should work for most
architectures and platforms.

However there are some architectures like POWER that may not expose
the distance of nodes that are not yet onlined because those resources
are not yet allocated to the OS instance. Such architectures have
other means to provide valid distance data for the current platform.

For example distance info from numactl from a fully populated 8 node
system at boot may look like this.

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  20  40  40  40  40  40  40
  1:  20  10  40  40  40  40  40  40
  2:  40  40  10  20  40  40  40  40
  3:  40  40  20  10  40  40  40  40
  4:  40  40  40  40  10  20  40  40
  5:  40  40  40  40  20  10  40  40
  6:  40  40  40  40  40  40  10  20
  7:  40  40  40  40  40  40  20  10

However the same system when only two nodes are online at boot, then the
numa topology will look like
node distances:
node   0   1
  0:  10  20
  1:  20  10

It may be implementation dependent on what node_distance(0,3) where
node 0 is online and node 3 is offline. In POWER case, it returns
LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
distance between nodes is 20. However that would not be true.

When Nodes are onlined and CPUs from those nodes are hotplugged,
the max node distance would be 40.

To handle such scenarios, let scheduler allow architectures to populate
the distance map. Architectures that like to populate the distance map
can overload arch_populate_distance_map().

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/topology.c | 32 ++++++++++++++++++++++----------
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 053115b55f89..ccb9aff59add 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1630,6 +1630,26 @@ static void init_numa_topology_type(void)
 
 #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
 
+#ifndef arch_populate_distance_map
+static int arch_populate_distance_map(unsigned long *distance_map)
+{
+	int i, j;
+
+	for (i = 0; i < nr_node_ids; i++) {
+		for (j = 0; j < nr_node_ids; j++) {
+			int distance = node_distance(i, j);
+
+			if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
+				sched_numa_warn("Invalid distance value range");
+				return -1;
+			}
+			bitmap_set(distance_map, distance, 1);
+		}
+	}
+	return 0;
+}
+#endif
+
 void sched_init_numa(void)
 {
 	struct sched_domain_topology_level *tl;
@@ -1646,18 +1666,10 @@ void sched_init_numa(void)
 		return;
 
 	bitmap_zero(distance_map, NR_DISTANCE_VALUES);
-	for (i = 0; i < nr_node_ids; i++) {
-		for (j = 0; j < nr_node_ids; j++) {
-			int distance = node_distance(i, j);
 
-			if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
-				sched_numa_warn("Invalid distance value range");
-				return;
-			}
+	if (arch_populate_distance_map(distance_map))
+		return;
 
-			bitmap_set(distance_map, distance, 1);
-		}
-	}
 	/*
 	 * We can now figure out how many unique distance values there are and
 	 * allocate memory accordingly.
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/3] powerpc/numa: Populate distance map correctly
  2021-05-20 15:44 [PATCH 0/3] Skip numa distance for offline nodes Srikar Dronamraju
  2021-05-20 15:44 ` [PATCH 1/3] sched/topology: Allow archs to populate distance map Srikar Dronamraju
@ 2021-05-20 15:44 ` Srikar Dronamraju
  2021-05-24 14:16   ` Valentin Schneider
  2021-05-20 15:44 ` [PATCH 3/3] sched/topology: Skip updating masks for non-online nodes Srikar Dronamraju
  2 siblings, 1 reply; 17+ messages in thread
From: Srikar Dronamraju @ 2021-05-20 15:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Michael Ellerman,
	Scott Cheloha, Gautham R Shenoy, Geetika Moolchandani

As per PAPR that defines the OS to hypervisor interface on POWER,
there is no way to calculate the node_distance between 2 nodes, when
either of the nodes are offline. However scheduler needs the distance
map to be populated at boot time. On POWER, this information is
provided within the distance_ref_points_depth array, which needs to be
parsed to extract the potential node distances.

To handle this scenario, lets overload arch_populate_distance_map(),
to provide all the distances that are possible in the current
platform.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/topology.h |  3 +++
 arch/powerpc/mm/numa.c              | 19 +++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index e4db64c0e184..d7605d833b8d 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -22,6 +22,9 @@ struct drmem_lmb;
 			       cpu_all_mask :				\
 			       node_to_cpumask_map[node])
 
+#define arch_populate_distance_map arch_populate_distance_map
+extern int arch_populate_distance_map(unsigned long *distance_map);
+
 struct pci_bus;
 #ifdef CONFIG_PCI
 extern int pcibus_to_node(struct pci_bus *bus);
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..9a225b29814a 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -221,6 +221,25 @@ static void initialize_distance_lookup_table(int nid,
 	}
 }
 
+int arch_populate_distance_map(unsigned long *distance_map)
+{
+	int i;
+	int distance = LOCAL_DISTANCE;
+
+	bitmap_set(distance_map, distance, 1);
+
+	if (!form1_affinity) {
+		bitmap_set(distance_map, REMOTE_DISTANCE, 1);
+		return 0;
+	}
+
+	for (i = 0; i < distance_ref_points_depth; i++) {
+		distance *= 2;
+		bitmap_set(distance_map, distance, 1);
+	}
+	return 0;
+}
+
 /*
  * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
  * info is found.
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/3] sched/topology: Skip updating masks for non-online nodes
  2021-05-20 15:44 [PATCH 0/3] Skip numa distance for offline nodes Srikar Dronamraju
  2021-05-20 15:44 ` [PATCH 1/3] sched/topology: Allow archs to populate distance map Srikar Dronamraju
  2021-05-20 15:44 ` [PATCH 2/3] powerpc/numa: Populate distance map correctly Srikar Dronamraju
@ 2021-05-20 15:44 ` Srikar Dronamraju
  2 siblings, 0 replies; 17+ messages in thread
From: Srikar Dronamraju @ 2021-05-20 15:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Michael Ellerman,
	Scott Cheloha, Gautham R Shenoy, Geetika Moolchandani

Currently scheduler doesn't check if node is online before adding CPUs
to the node mask. However on some architectures, node distance is only
available for nodes that are online. Its not sure how much to rely on
the node distance, when one of the nodes is offline.

If said node distance is fake (since one of the nodes is offline) and
the actual node distance is different, then the cpumask of such nodes
when the nodes become becomes online will be wrong.

This can cause topology_span_sane to throw up a warning message and the
rest of the topology being not updated properly.

Resolve this by skipping update of cpumask for nodes that are not
online.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/topology.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index ccb9aff59add..ba0555e83548 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1731,6 +1731,9 @@ void sched_init_numa(void)
 			sched_domains_numa_masks[i][j] = mask;
 
 			for_each_node(k) {
+				if (!node_online(j))
+					continue;
+
 				if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
 					sched_numa_warn("Node-distance not symmetric");
 
@@ -1793,6 +1796,9 @@ void sched_domains_numa_masks_set(unsigned int cpu)
 
 	for (i = 0; i < sched_domains_numa_levels; i++) {
 		for (j = 0; j < nr_node_ids; j++) {
+			if (!node_online(j))
+				continue;
+
 			if (node_distance(j, node) <= sched_domains_numa_distance[i])
 				cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
 		}
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map
  2021-05-20 15:44 ` [PATCH 1/3] sched/topology: Allow archs to populate distance map Srikar Dronamraju
@ 2021-05-20 18:56   ` Peter Zijlstra
  2021-05-21  2:38     ` Srikar Dronamraju
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2021-05-20 18:56 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	linuxppc-dev, Nathan Lynch, Michael Ellerman, Scott Cheloha,
	Gautham R Shenoy, Geetika Moolchandani

On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
> Currently scheduler populates the distance map by looking at distance
> of each node from all other nodes. This should work for most
> architectures and platforms.
> 
> However there are some architectures like POWER that may not expose
> the distance of nodes that are not yet onlined because those resources
> are not yet allocated to the OS instance. Such architectures have
> other means to provide valid distance data for the current platform.
> 
> For example distance info from numactl from a fully populated 8 node
> system at boot may look like this.
> 
> node distances:
> node   0   1   2   3   4   5   6   7
>   0:  10  20  40  40  40  40  40  40
>   1:  20  10  40  40  40  40  40  40
>   2:  40  40  10  20  40  40  40  40
>   3:  40  40  20  10  40  40  40  40
>   4:  40  40  40  40  10  20  40  40
>   5:  40  40  40  40  20  10  40  40
>   6:  40  40  40  40  40  40  10  20
>   7:  40  40  40  40  40  40  20  10
> 
> However the same system when only two nodes are online at boot, then the
> numa topology will look like
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
> 
> It may be implementation dependent on what node_distance(0,3) where
> node 0 is online and node 3 is offline. In POWER case, it returns
> LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
> distance between nodes is 20. However that would not be true.
> 
> When Nodes are onlined and CPUs from those nodes are hotplugged,
> the max node distance would be 40.
> 
> To handle such scenarios, let scheduler allow architectures to populate
> the distance map. Architectures that like to populate the distance map
> can overload arch_populate_distance_map().

Why? Why can't your node_distance() DTRT? The arch interface is
nr_node_ids and node_distance(), I don't see why we need something new
and then replace one special use of it.

By virtue of you being able to actually implement this new hook, you
supposedly can actually do node_distance() right too.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map
  2021-05-20 18:56   ` Peter Zijlstra
@ 2021-05-21  2:38     ` Srikar Dronamraju
  2021-05-21  8:14       ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Srikar Dronamraju @ 2021-05-21  2:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	linuxppc-dev, Nathan Lynch, Michael Ellerman, Scott Cheloha,
	Gautham R Shenoy, Geetika Moolchandani

* Peter Zijlstra <peterz@infradead.org> [2021-05-20 20:56:31]:

> On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
> > Currently scheduler populates the distance map by looking at distance
> > of each node from all other nodes. This should work for most
> > architectures and platforms.
> > 
> > However there are some architectures like POWER that may not expose
> > the distance of nodes that are not yet onlined because those resources
> > are not yet allocated to the OS instance. Such architectures have
> > other means to provide valid distance data for the current platform.
> > 
> > For example distance info from numactl from a fully populated 8 node
> > system at boot may look like this.
> > 
> > node distances:
> > node   0   1   2   3   4   5   6   7
> >   0:  10  20  40  40  40  40  40  40
> >   1:  20  10  40  40  40  40  40  40
> >   2:  40  40  10  20  40  40  40  40
> >   3:  40  40  20  10  40  40  40  40
> >   4:  40  40  40  40  10  20  40  40
> >   5:  40  40  40  40  20  10  40  40
> >   6:  40  40  40  40  40  40  10  20
> >   7:  40  40  40  40  40  40  20  10
> > 
> > However the same system when only two nodes are online at boot, then the
> > numa topology will look like
> > node distances:
> > node   0   1
> >   0:  10  20
> >   1:  20  10
> > 
> > It may be implementation dependent on what node_distance(0,3) where
> > node 0 is online and node 3 is offline. In POWER case, it returns
> > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
> > distance between nodes is 20. However that would not be true.
> > 
> > When Nodes are onlined and CPUs from those nodes are hotplugged,
> > the max node distance would be 40.
> > 
> > To handle such scenarios, let scheduler allow architectures to populate
> > the distance map. Architectures that like to populate the distance map
> > can overload arch_populate_distance_map().
> 
> Why? Why can't your node_distance() DTRT? The arch interface is
> nr_node_ids and node_distance(), I don't see why we need something new
> and then replace one special use of it.
> 
> By virtue of you being able to actually implement this new hook, you
> supposedly can actually do node_distance() right too.

Since for an offline node, arch interface code doesn't have the info.
As far as I know/understand, in POWER, unless there is an active memory or
CPU that's getting onlined, arch can't fetch the correct node distance.

Taking the above example: node 3 is offline, then node_distance of (3,X)
where X is anything other than 3, is not reliable. The moment node 3 is
onlined, the node distance is reliable.

This problem will not happen even on POWER if all the nodes have either
memory or CPUs active at the time of boot.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map
  2021-05-21  2:38     ` Srikar Dronamraju
@ 2021-05-21  8:14       ` Peter Zijlstra
  2021-05-21  9:28         ` Srikar Dronamraju
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2021-05-21  8:14 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	linuxppc-dev, Nathan Lynch, Michael Ellerman, Scott Cheloha,
	Gautham R Shenoy, Geetika Moolchandani

On Fri, May 21, 2021 at 08:08:02AM +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2021-05-20 20:56:31]:
> 
> > On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
> > > Currently scheduler populates the distance map by looking at distance
> > > of each node from all other nodes. This should work for most
> > > architectures and platforms.
> > > 
> > > However there are some architectures like POWER that may not expose
> > > the distance of nodes that are not yet onlined because those resources
> > > are not yet allocated to the OS instance. Such architectures have
> > > other means to provide valid distance data for the current platform.
> > > 
> > > For example distance info from numactl from a fully populated 8 node
> > > system at boot may look like this.
> > > 
> > > node distances:
> > > node   0   1   2   3   4   5   6   7
> > >   0:  10  20  40  40  40  40  40  40
> > >   1:  20  10  40  40  40  40  40  40
> > >   2:  40  40  10  20  40  40  40  40
> > >   3:  40  40  20  10  40  40  40  40
> > >   4:  40  40  40  40  10  20  40  40
> > >   5:  40  40  40  40  20  10  40  40
> > >   6:  40  40  40  40  40  40  10  20
> > >   7:  40  40  40  40  40  40  20  10
> > > 
> > > However the same system when only two nodes are online at boot, then the
> > > numa topology will look like
> > > node distances:
> > > node   0   1
> > >   0:  10  20
> > >   1:  20  10
> > > 
> > > It may be implementation dependent on what node_distance(0,3) where
> > > node 0 is online and node 3 is offline. In POWER case, it returns
> > > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
> > > distance between nodes is 20. However that would not be true.
> > > 
> > > When Nodes are onlined and CPUs from those nodes are hotplugged,
> > > the max node distance would be 40.
> > > 
> > > To handle such scenarios, let scheduler allow architectures to populate
> > > the distance map. Architectures that like to populate the distance map
> > > can overload arch_populate_distance_map().
> > 
> > Why? Why can't your node_distance() DTRT? The arch interface is
> > nr_node_ids and node_distance(), I don't see why we need something new
> > and then replace one special use of it.
> > 
> > By virtue of you being able to actually implement this new hook, you
> > supposedly can actually do node_distance() right too.
> 
> Since for an offline node, arch interface code doesn't have the info.
> As far as I know/understand, in POWER, unless there is an active memory or
> CPU that's getting onlined, arch can't fetch the correct node distance.
> 
> Taking the above example: node 3 is offline, then node_distance of (3,X)
> where X is anything other than 3, is not reliable. The moment node 3 is
> onlined, the node distance is reliable.
> 
> This problem will not happen even on POWER if all the nodes have either
> memory or CPUs active at the time of boot.

But then how can you implement this new hook? Going by the fact that
both nr_node_ids and distance_ref_points_depth are fixed, how many
possible __node_distance() configurations are there left?

The example provided above does not suggest there's much room for
alternatives, and hence for actual need of this new interface.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map
  2021-05-21  8:14       ` Peter Zijlstra
@ 2021-05-21  9:28         ` Srikar Dronamraju
  2021-05-24 14:16           ` Valentin Schneider
  0 siblings, 1 reply; 17+ messages in thread
From: Srikar Dronamraju @ 2021-05-21  9:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	linuxppc-dev, Nathan Lynch, Michael Ellerman, Scott Cheloha,
	Gautham R Shenoy, Geetika Moolchandani

* Peter Zijlstra <peterz@infradead.org> [2021-05-21 10:14:10]:

> On Fri, May 21, 2021 at 08:08:02AM +0530, Srikar Dronamraju wrote:
> > * Peter Zijlstra <peterz@infradead.org> [2021-05-20 20:56:31]:
> > 
> > > On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
> > > > Currently scheduler populates the distance map by looking at distance
> > > > of each node from all other nodes. This should work for most
> > > > architectures and platforms.
> > > > 
> > > > However there are some architectures like POWER that may not expose
> > > > the distance of nodes that are not yet onlined because those resources
> > > > are not yet allocated to the OS instance. Such architectures have
> > > > other means to provide valid distance data for the current platform.
> > > > 
> > > > For example distance info from numactl from a fully populated 8 node
> > > > system at boot may look like this.
> > > > 
> > > > node distances:
> > > > node   0   1   2   3   4   5   6   7
> > > >   0:  10  20  40  40  40  40  40  40
> > > >   1:  20  10  40  40  40  40  40  40
> > > >   2:  40  40  10  20  40  40  40  40
> > > >   3:  40  40  20  10  40  40  40  40
> > > >   4:  40  40  40  40  10  20  40  40
> > > >   5:  40  40  40  40  20  10  40  40
> > > >   6:  40  40  40  40  40  40  10  20
> > > >   7:  40  40  40  40  40  40  20  10
> > > > 
> > > > However the same system when only two nodes are online at boot, then the
> > > > numa topology will look like
> > > > node distances:
> > > > node   0   1
> > > >   0:  10  20
> > > >   1:  20  10
> > > > 
> > > > It may be implementation dependent on what node_distance(0,3) where
> > > > node 0 is online and node 3 is offline. In POWER case, it returns
> > > > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
> > > > distance between nodes is 20. However that would not be true.
> > > > 
> > > > When Nodes are onlined and CPUs from those nodes are hotplugged,
> > > > the max node distance would be 40.
> > > > 
> > > > To handle such scenarios, let scheduler allow architectures to populate
> > > > the distance map. Architectures that like to populate the distance map
> > > > can overload arch_populate_distance_map().
> > > 
> > > Why? Why can't your node_distance() DTRT? The arch interface is
> > > nr_node_ids and node_distance(), I don't see why we need something new
> > > and then replace one special use of it.
> > > 
> > > By virtue of you being able to actually implement this new hook, you
> > > supposedly can actually do node_distance() right too.
> > 
> > Since for an offline node, arch interface code doesn't have the info.
> > As far as I know/understand, in POWER, unless there is an active memory or
> > CPU that's getting onlined, arch can't fetch the correct node distance.
> > 
> > Taking the above example: node 3 is offline, then node_distance of (3,X)
> > where X is anything other than 3, is not reliable. The moment node 3 is
> > onlined, the node distance is reliable.
> > 
> > This problem will not happen even on POWER if all the nodes have either
> > memory or CPUs active at the time of boot.
> 
> But then how can you implement this new hook? Going by the fact that
> both nr_node_ids and distance_ref_points_depth are fixed, how many
> possible __node_distance() configurations are there left?
> 

distance_ref_point_depth is provided as a different property and is readily
available at boot. The new api will use just use that. So based on the
distance_ref_point_depth, we know all possible node distances for that
platform.

For an offline node, we don't have that specific nodes distance_lookup_table
array entries. Each array would be of distance_ref_point_depth entries.
Without the distance_lookup_table for an array populated, we will not be
able to tell how far the node is with respect to other nodes.

We can lookup the correct distance_lookup_table for a node based on memory
or the CPUs attached to that node. Since in an offline node, both of them
would not be around, the distance_lookup_table will have stale values.

> The example provided above does not suggest there's much room for
> alternatives, and hence for actual need of this new interface.
> 

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map
  2021-05-21  9:28         ` Srikar Dronamraju
@ 2021-05-24 14:16           ` Valentin Schneider
  2021-05-24 16:18             ` Srikar Dronamraju
  0 siblings, 1 reply; 17+ messages in thread
From: Valentin Schneider @ 2021-05-24 14:16 UTC (permalink / raw)
  To: Srikar Dronamraju, Peter Zijlstra
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner,
	Vincent Guittot, Dietmar Eggemann, linuxppc-dev, Nathan Lynch,
	Michael Ellerman, Scott Cheloha, Gautham R Shenoy,
	Geetika Moolchandani

On 21/05/21 14:58, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2021-05-21 10:14:10]:
>
>> On Fri, May 21, 2021 at 08:08:02AM +0530, Srikar Dronamraju wrote:
>> > * Peter Zijlstra <peterz@infradead.org> [2021-05-20 20:56:31]:
>> >
>> > > On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
>> > > > Currently scheduler populates the distance map by looking at distance
>> > > > of each node from all other nodes. This should work for most
>> > > > architectures and platforms.
>> > > >
>> > > > However there are some architectures like POWER that may not expose
>> > > > the distance of nodes that are not yet onlined because those resources
>> > > > are not yet allocated to the OS instance. Such architectures have
>> > > > other means to provide valid distance data for the current platform.
>> > > >
>> > > > For example distance info from numactl from a fully populated 8 node
>> > > > system at boot may look like this.
>> > > >
>> > > > node distances:
>> > > > node   0   1   2   3   4   5   6   7
>> > > >   0:  10  20  40  40  40  40  40  40
>> > > >   1:  20  10  40  40  40  40  40  40
>> > > >   2:  40  40  10  20  40  40  40  40
>> > > >   3:  40  40  20  10  40  40  40  40
>> > > >   4:  40  40  40  40  10  20  40  40
>> > > >   5:  40  40  40  40  20  10  40  40
>> > > >   6:  40  40  40  40  40  40  10  20
>> > > >   7:  40  40  40  40  40  40  20  10
>> > > >
>> > > > However the same system when only two nodes are online at boot, then the
>> > > > numa topology will look like
>> > > > node distances:
>> > > > node   0   1
>> > > >   0:  10  20
>> > > >   1:  20  10
>> > > >
>> > > > It may be implementation dependent on what node_distance(0,3) where
>> > > > node 0 is online and node 3 is offline. In POWER case, it returns
>> > > > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
>> > > > distance between nodes is 20. However that would not be true.
>> > > >
>> > > > When Nodes are onlined and CPUs from those nodes are hotplugged,
>> > > > the max node distance would be 40.
>> > > >
>> > > > To handle such scenarios, let scheduler allow architectures to populate
>> > > > the distance map. Architectures that like to populate the distance map
>> > > > can overload arch_populate_distance_map().
>> > >
>> > > Why? Why can't your node_distance() DTRT? The arch interface is
>> > > nr_node_ids and node_distance(), I don't see why we need something new
>> > > and then replace one special use of it.
>> > >
>> > > By virtue of you being able to actually implement this new hook, you
>> > > supposedly can actually do node_distance() right too.
>> >
>> > Since for an offline node, arch interface code doesn't have the info.
>> > As far as I know/understand, in POWER, unless there is an active memory or
>> > CPU that's getting onlined, arch can't fetch the correct node distance.
>> >
>> > Taking the above example: node 3 is offline, then node_distance of (3,X)
>> > where X is anything other than 3, is not reliable. The moment node 3 is
>> > onlined, the node distance is reliable.
>> >
>> > This problem will not happen even on POWER if all the nodes have either
>> > memory or CPUs active at the time of boot.
>>
>> But then how can you implement this new hook? Going by the fact that
>> both nr_node_ids and distance_ref_points_depth are fixed, how many
>> possible __node_distance() configurations are there left?
>>
>
> distance_ref_point_depth is provided as a different property and is readily
> available at boot. The new api will use just use that. So based on the
> distance_ref_point_depth, we know all possible node distances for that
> platform.
>
> For an offline node, we don't have that specific nodes distance_lookup_table
> array entries. Each array would be of distance_ref_point_depth entries.
> Without the distance_lookup_table for an array populated, we will not be
> able to tell how far the node is with respect to other nodes.
>
> We can lookup the correct distance_lookup_table for a node based on memory
> or the CPUs attached to that node. Since in an offline node, both of them
> would not be around, the distance_lookup_table will have stale values.
>

Ok so from your arch you can figure out the *size* of the set of unique
distances, but not the individual node_distance(a, b)... That's quite
unfortunate.

I suppose one way to avoid the hook would be to write some "fake" distance
values into your distance_lookup_table[] for offline nodes using your
distance_ref_point_depth thing, i.e. ensure an iteration of
node_distance(a, b) covers all distance values [1]. You can then keep patch
3 around, and that should roughly be it.


>> The example provided above does not suggest there's much room for
>> alternatives, and hence for actual need of this new interface.
>>
>
> --
> Thanks and Regards
> Srikar Dronamraju

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] powerpc/numa: Populate distance map correctly
  2021-05-20 15:44 ` [PATCH 2/3] powerpc/numa: Populate distance map correctly Srikar Dronamraju
@ 2021-05-24 14:16   ` Valentin Schneider
  2021-05-24 14:50     ` Srikar Dronamraju
  0 siblings, 1 reply; 17+ messages in thread
From: Valentin Schneider @ 2021-05-24 14:16 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Vincent Guittot, Dietmar Eggemann, linuxppc-dev,
	Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy,
	Geetika Moolchandani

On 20/05/21 21:14, Srikar Dronamraju wrote:
> +int arch_populate_distance_map(unsigned long *distance_map)
> +{
> +	int i;
> +	int distance = LOCAL_DISTANCE;
> +
> +	bitmap_set(distance_map, distance, 1);
> +
> +	if (!form1_affinity) {
> +		bitmap_set(distance_map, REMOTE_DISTANCE, 1);
> +		return 0;
> +	}
> +
> +	for (i = 0; i < distance_ref_points_depth; i++) {
> +		distance *= 2;
> +		bitmap_set(distance_map, distance, 1);

Do you have guarantees your distance values will always be in the form of

  LOCAL_DISTANCE * 2^i

because that certainly isn't true for x86/arm64.

> +	}
> +	return 0;
> +}
> +
>  /*
>   * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
>   * info is found.
> --
> 2.27.0

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] powerpc/numa: Populate distance map correctly
  2021-05-24 14:16   ` Valentin Schneider
@ 2021-05-24 14:50     ` Srikar Dronamraju
  0 siblings, 0 replies; 17+ messages in thread
From: Srikar Dronamraju @ 2021-05-24 14:50 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Vincent Guittot, Dietmar Eggemann, linuxppc-dev,
	Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy,
	Geetika Moolchandani

* Valentin Schneider <valentin.schneider@arm.com> [2021-05-24 15:16:22]:

> On 20/05/21 21:14, Srikar Dronamraju wrote:
> > +int arch_populate_distance_map(unsigned long *distance_map)
> > +{
> > +	int i;
> > +	int distance = LOCAL_DISTANCE;
> > +
> > +	bitmap_set(distance_map, distance, 1);
> > +
> > +	if (!form1_affinity) {
> > +		bitmap_set(distance_map, REMOTE_DISTANCE, 1);
> > +		return 0;
> > +	}
> > +
> > +	for (i = 0; i < distance_ref_points_depth; i++) {
> > +		distance *= 2;
> > +		bitmap_set(distance_map, distance, 1);
> 
> Do you have guarantees your distance values will always be in the form of
> 
>   LOCAL_DISTANCE * 2^i
> 
> because that certainly isn't true for x86/arm64.
> 

This is true till now. It don't think that's going to change anytime soon, but
we never know what lies ahead.

For all practical purposes, (unless a newer, shinier property is proposed,)
distance_ref_points_depth is going to give us the unique distances.

> > +	}
> > +	return 0;
> > +}
> > +
> >  /*
> >   * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
> >   * info is found.
> > --
> > 2.27.0

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map
  2021-05-24 14:16           ` Valentin Schneider
@ 2021-05-24 16:18             ` Srikar Dronamraju
  2021-05-25 10:21               ` Valentin Schneider
  2021-05-28  8:43               ` Peter Zijlstra
  0 siblings, 2 replies; 17+ messages in thread
From: Srikar Dronamraju @ 2021-05-24 16:18 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Vincent Guittot, Dietmar Eggemann, linuxppc-dev,
	Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy,
	Geetika Moolchandani

* Valentin Schneider <valentin.schneider@arm.com> [2021-05-24 15:16:09]:

> On 21/05/21 14:58, Srikar Dronamraju wrote:
> > * Peter Zijlstra <peterz@infradead.org> [2021-05-21 10:14:10]:
> >
> >> On Fri, May 21, 2021 at 08:08:02AM +0530, Srikar Dronamraju wrote:
> >> > * Peter Zijlstra <peterz@infradead.org> [2021-05-20 20:56:31]:
> >> >
> >> > > On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
> >> > > > Currently scheduler populates the distance map by looking at distance
> >> > > > of each node from all other nodes. This should work for most
> >> > > > architectures and platforms.
> >> > > >
> >> > > > However there are some architectures like POWER that may not expose
> >> > > > the distance of nodes that are not yet onlined because those resources
> >> > > > are not yet allocated to the OS instance. Such architectures have
> >> > > > other means to provide valid distance data for the current platform.
> >> > > >
> >> > > > For example distance info from numactl from a fully populated 8 node
> >> > > > system at boot may look like this.
> >> > > >
> >> > > > node distances:
> >> > > > node   0   1   2   3   4   5   6   7
> >> > > >   0:  10  20  40  40  40  40  40  40
> >> > > >   1:  20  10  40  40  40  40  40  40
> >> > > >   2:  40  40  10  20  40  40  40  40
> >> > > >   3:  40  40  20  10  40  40  40  40
> >> > > >   4:  40  40  40  40  10  20  40  40
> >> > > >   5:  40  40  40  40  20  10  40  40
> >> > > >   6:  40  40  40  40  40  40  10  20
> >> > > >   7:  40  40  40  40  40  40  20  10
> >> > > >
> >> > > > However the same system when only two nodes are online at boot, then the
> >> > > > numa topology will look like
> >> > > > node distances:
> >> > > > node   0   1
> >> > > >   0:  10  20
> >> > > >   1:  20  10
> >> > > >
> >> > > > It may be implementation dependent on what node_distance(0,3) where
> >> > > > node 0 is online and node 3 is offline. In POWER case, it returns
> >> > > > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
> >> > > > distance between nodes is 20. However that would not be true.
> >> > > >
> >> > > > When Nodes are onlined and CPUs from those nodes are hotplugged,
> >> > > > the max node distance would be 40.
> >> > > >
> >> > > > To handle such scenarios, let scheduler allow architectures to populate
> >> > > > the distance map. Architectures that like to populate the distance map
> >> > > > can overload arch_populate_distance_map().
> >> > >
> >> > > Why? Why can't your node_distance() DTRT? The arch interface is
> >> > > nr_node_ids and node_distance(), I don't see why we need something new
> >> > > and then replace one special use of it.
> >> > >
> >> > > By virtue of you being able to actually implement this new hook, you
> >> > > supposedly can actually do node_distance() right too.
> >> >
> >> > Since for an offline node, arch interface code doesn't have the info.
> >> > As far as I know/understand, in POWER, unless there is an active memory or
> >> > CPU that's getting onlined, arch can't fetch the correct node distance.
> >> >
> >> > Taking the above example: node 3 is offline, then node_distance of (3,X)
> >> > where X is anything other than 3, is not reliable. The moment node 3 is
> >> > onlined, the node distance is reliable.
> >> >
> >> > This problem will not happen even on POWER if all the nodes have either
> >> > memory or CPUs active at the time of boot.
> >>
> >> But then how can you implement this new hook? Going by the fact that
> >> both nr_node_ids and distance_ref_points_depth are fixed, how many
> >> possible __node_distance() configurations are there left?
> >>
> >
> > distance_ref_point_depth is provided as a different property and is readily
> > available at boot. The new api will use just use that. So based on the
> > distance_ref_point_depth, we know all possible node distances for that
> > platform.
> >
> > For an offline node, we don't have that specific nodes distance_lookup_table
> > array entries. Each array would be of distance_ref_point_depth entries.
> > Without the distance_lookup_table for an array populated, we will not be
> > able to tell how far the node is with respect to other nodes.
> >
> > We can lookup the correct distance_lookup_table for a node based on memory
> > or the CPUs attached to that node. Since in an offline node, both of them
> > would not be around, the distance_lookup_table will have stale values.
> >
> 
> Ok so from your arch you can figure out the *size* of the set of unique
> distances, but not the individual node_distance(a, b)... That's quite
> unfortunate.

Yes, thats true.

> 
> I suppose one way to avoid the hook would be to write some "fake" distance
> values into your distance_lookup_table[] for offline nodes using your
> distance_ref_point_depth thing, i.e. ensure an iteration of
> node_distance(a, b) covers all distance values [1]. You can then keep patch
> 3 around, and that should roughly be it.
> 

Yes, this would suffice but to me its not very clean.
static int found[distance_ref_point_depth];

for_each_node(node){
	int i, nd, distance = LOCAL_DISTANCE;
		goto out;

	nd = node_distance(node, first_online_node)
	for (i=0; i < distance_ref_point_depth; i++, distance *= 2) {
		if (node_online) {
			if (distance != nd)
				continue;
			found[i] ++;
			break;
		}
		if (found[i])
			continue;
		distance_lookup_table[node][i] = distance_lookup_table[first_online_node][i];
		found[i] ++;
		break;
	}
}

But do note: We are setting a precedent for node distance between two nodes
to change.


> 
> >> The example provided above does not suggest there's much room for
> >> alternatives, and hence for actual need of this new interface.
> >>
> >
> > --
> > Thanks and Regards
> > Srikar Dronamraju

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map
  2021-05-24 16:18             ` Srikar Dronamraju
@ 2021-05-25 10:21               ` Valentin Schneider
  2021-05-25 11:32                 ` Srikar Dronamraju
  2021-05-28  5:21                 ` Srikar Dronamraju
  2021-05-28  8:43               ` Peter Zijlstra
  1 sibling, 2 replies; 17+ messages in thread
From: Valentin Schneider @ 2021-05-25 10:21 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Vincent Guittot, Dietmar Eggemann, linuxppc-dev,
	Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy,
	Geetika Moolchandani

On 24/05/21 21:48, Srikar Dronamraju wrote:
> * Valentin Schneider <valentin.schneider@arm.com> [2021-05-24 15:16:09]:
>> Ok so from your arch you can figure out the *size* of the set of unique
>> distances, but not the individual node_distance(a, b)... That's quite
>> unfortunate.
>
> Yes, thats true.
>
>>
>> I suppose one way to avoid the hook would be to write some "fake" distance
>> values into your distance_lookup_table[] for offline nodes using your
>> distance_ref_point_depth thing, i.e. ensure an iteration of
>> node_distance(a, b) covers all distance values [1]. You can then keep patch
>> 3 around, and that should roughly be it.
>>
>
> Yes, this would suffice but to me its not very clean.
> static int found[distance_ref_point_depth];
>
> for_each_node(node){
>       int i, nd, distance = LOCAL_DISTANCE;
>               goto out;
>
>       nd = node_distance(node, first_online_node)
>       for (i=0; i < distance_ref_point_depth; i++, distance *= 2) {
>               if (node_online) {
>                       if (distance != nd)
>                               continue;
>                       found[i] ++;
>                       break;
>               }
>               if (found[i])
>                       continue;
>               distance_lookup_table[node][i] = distance_lookup_table[first_online_node][i];
>               found[i] ++;
>               break;
>       }
> }
>
> But do note: We are setting a precedent for node distance between two nodes
> to change.
>

Indeed. AFAICT it's that or the unique-distance-values hook :/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map
  2021-05-25 10:21               ` Valentin Schneider
@ 2021-05-25 11:32                 ` Srikar Dronamraju
  2021-05-28  5:21                 ` Srikar Dronamraju
  1 sibling, 0 replies; 17+ messages in thread
From: Srikar Dronamraju @ 2021-05-25 11:32 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Vincent Guittot, Dietmar Eggemann, linuxppc-dev,
	Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy,
	Geetika Moolchandani

* Valentin Schneider <valentin.schneider@arm.com> [2021-05-25 11:21:02]:

> On 24/05/21 21:48, Srikar Dronamraju wrote:
> > * Valentin Schneider <valentin.schneider@arm.com> [2021-05-24 15:16:09]:
> >> Ok so from your arch you can figure out the *size* of the set of unique
> >> distances, but not the individual node_distance(a, b)... That's quite
> >> unfortunate.
> >
> > Yes, thats true.
> >
> >>
> >> I suppose one way to avoid the hook would be to write some "fake" distance
> >> values into your distance_lookup_table[] for offline nodes using your
> >> distance_ref_point_depth thing, i.e. ensure an iteration of
> >> node_distance(a, b) covers all distance values [1]. You can then keep patch
> >> 3 around, and that should roughly be it.
> >>
> >
> > Yes, this would suffice but to me its not very clean.
> > static int found[distance_ref_point_depth];
> >
> > for_each_node(node){
> >       int i, nd, distance = LOCAL_DISTANCE;
> >               goto out;
> >
> >       nd = node_distance(node, first_online_node)
> >       for (i=0; i < distance_ref_point_depth; i++, distance *= 2) {
> >               if (node_online) {
> >                       if (distance != nd)
> >                               continue;
> >                       found[i] ++;
> >                       break;
> >               }
> >               if (found[i])
> >                       continue;
> >               distance_lookup_table[node][i] = distance_lookup_table[first_online_node][i];
> >               found[i] ++;
> >               break;
> >       }
> > }
> >
> > But do note: We are setting a precedent for node distance between two nodes
> > to change.
> >
> 
> Indeed. AFAICT it's that or the unique-distance-values hook :/

Peter,

Please let me know which approach would you prefer.
I am open to try any other approach too.

In my humble opinion, unique-distance-values hook is more cleaner.
Do you still have any concerns with the unique-distance-values hook?

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map
  2021-05-25 10:21               ` Valentin Schneider
  2021-05-25 11:32                 ` Srikar Dronamraju
@ 2021-05-28  5:21                 ` Srikar Dronamraju
  1 sibling, 0 replies; 17+ messages in thread
From: Srikar Dronamraju @ 2021-05-28  5:21 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Vincent Guittot, Dietmar Eggemann, linuxppc-dev,
	Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy,
	Geetika Moolchandani

* Valentin Schneider <valentin.schneider@arm.com> [2021-05-25 11:21:02]:

> On 24/05/21 21:48, Srikar Dronamraju wrote:
> > * Valentin Schneider <valentin.schneider@arm.com> [2021-05-24 15:16:09]:
> >> Ok so from your arch you can figure out the *size* of the set of unique
> >> distances, but not the individual node_distance(a, b)... That's quite
> >> unfortunate.
> >
> > Yes, thats true.
> >
> >>
> >> I suppose one way to avoid the hook would be to write some "fake" distance
> >> values into your distance_lookup_table[] for offline nodes using your
> >> distance_ref_point_depth thing, i.e. ensure an iteration of
> >> node_distance(a, b) covers all distance values [1]. You can then keep patch
> >> 3 around, and that should roughly be it.
> >>
> >
> > Yes, this would suffice but to me its not very clean.
> > static int found[distance_ref_point_depth];
> >
> > for_each_node(node){
> >       int i, nd, distance = LOCAL_DISTANCE;
> >               goto out;
> >
> >       nd = node_distance(node, first_online_node)
> >       for (i=0; i < distance_ref_point_depth; i++, distance *= 2) {
> >               if (node_online) {
> >                       if (distance != nd)
> >                               continue;
> >                       found[i] ++;
> >                       break;
> >               }
> >               if (found[i])
> >                       continue;
> >               distance_lookup_table[node][i] = distance_lookup_table[first_online_node][i];
> >               found[i] ++;
> >               break;
> >       }
> > }
> >
> > But do note: We are setting a precedent for node distance between two nodes
> > to change.
> >
> 
> Indeed. AFAICT it's that or the unique-distance-values hook :/

Peter, Valentin, Michael,

Can you please let me know which approach you would want me to follow.

Or do let me know any other alternative solutions that you would want me to
try.


-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map
  2021-05-24 16:18             ` Srikar Dronamraju
  2021-05-25 10:21               ` Valentin Schneider
@ 2021-05-28  8:43               ` Peter Zijlstra
  2021-05-28 10:24                 ` Srikar Dronamraju
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2021-05-28  8:43 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Valentin Schneider, Ingo Molnar, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Vincent Guittot, Dietmar Eggemann, linuxppc-dev,
	Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy,
	Geetika Moolchandani

On Mon, May 24, 2021 at 09:48:29PM +0530, Srikar Dronamraju wrote:
> * Valentin Schneider <valentin.schneider@arm.com> [2021-05-24 15:16:09]:

> > I suppose one way to avoid the hook would be to write some "fake" distance
> > values into your distance_lookup_table[] for offline nodes using your
> > distance_ref_point_depth thing, i.e. ensure an iteration of
> > node_distance(a, b) covers all distance values [1]. You can then keep patch
> > 3 around, and that should roughly be it.
> > 
> 
> Yes, this would suffice but to me its not very clean.
> static int found[distance_ref_point_depth];
> 
> for_each_node(node){
> 	int i, nd, distance = LOCAL_DISTANCE;
> 		goto out;
> 
> 	nd = node_distance(node, first_online_node)
> 	for (i=0; i < distance_ref_point_depth; i++, distance *= 2) {
> 		if (node_online) {
> 			if (distance != nd)
> 				continue;
> 			found[i] ++;
> 			break;
> 		}
> 		if (found[i])
> 			continue;
> 		distance_lookup_table[node][i] = distance_lookup_table[first_online_node][i];
> 		found[i] ++;
> 		break;
> 	}
> }
> 
> But do note: We are setting a precedent for node distance between two nodes
> to change.

Not really; or rather not more than already is the case AFAICT. Because
currently your distance table will have *something* in it
(LOCAL_DISTANCE afaict) for nodes that have never been online, which is
what triggered the whole problem to begin with.

Only after the node has come online for the first time, will it contain
the right value.

So both before and after this proposal the actual distance value changes
after the first time a node goes online.

Yes that's unfortunate, but I don't see a problem with pre-filling it
with something useful in order to avoid aditional arch hooks.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map
  2021-05-28  8:43               ` Peter Zijlstra
@ 2021-05-28 10:24                 ` Srikar Dronamraju
  0 siblings, 0 replies; 17+ messages in thread
From: Srikar Dronamraju @ 2021-05-28 10:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, Ingo Molnar, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Vincent Guittot, Dietmar Eggemann, linuxppc-dev,
	Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy,
	Geetika Moolchandani

* Peter Zijlstra <peterz@infradead.org> [2021-05-28 10:43:23]:

> On Mon, May 24, 2021 at 09:48:29PM +0530, Srikar Dronamraju wrote:
> > * Valentin Schneider <valentin.schneider@arm.com> [2021-05-24 15:16:09]:
> 
> > > I suppose one way to avoid the hook would be to write some "fake" distance
> > > values into your distance_lookup_table[] for offline nodes using your
> > > distance_ref_point_depth thing, i.e. ensure an iteration of
> > > node_distance(a, b) covers all distance values [1]. You can then keep patch
> > > 3 around, and that should roughly be it.
> > > 
> > 
> > Yes, this would suffice but to me its not very clean.
> > static int found[distance_ref_point_depth];
> > 
> > for_each_node(node){
> > 	int i, nd, distance = LOCAL_DISTANCE;
> > 		goto out;
> > 
> > 	nd = node_distance(node, first_online_node)
> > 	for (i=0; i < distance_ref_point_depth; i++, distance *= 2) {
> > 		if (node_online) {
> > 			if (distance != nd)
> > 				continue;
> > 			found[i] ++;
> > 			break;
> > 		}
> > 		if (found[i])
> > 			continue;
> > 		distance_lookup_table[node][i] = distance_lookup_table[first_online_node][i];
> > 		found[i] ++;
> > 		break;
> > 	}
> > }
> > 
> > But do note: We are setting a precedent for node distance between two nodes
> > to change.
> 
> Not really; or rather not more than already is the case AFAICT. Because
> currently your distance table will have *something* in it
> (LOCAL_DISTANCE afaict) for nodes that have never been online, which is
> what triggered the whole problem to begin with.
> 
> Only after the node has come online for the first time, will it contain
> the right value.
> 
> So both before and after this proposal the actual distance value changes
> after the first time a node goes online.
> 
> Yes that's unfortunate, but I don't see a problem with pre-filling it
> with something useful in order to avoid aditional arch hooks.
> 
> 

Okay,

Will post a v2 with prefilling.
Thanks for the update.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-05-28 10:25 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-20 15:44 [PATCH 0/3] Skip numa distance for offline nodes Srikar Dronamraju
2021-05-20 15:44 ` [PATCH 1/3] sched/topology: Allow archs to populate distance map Srikar Dronamraju
2021-05-20 18:56   ` Peter Zijlstra
2021-05-21  2:38     ` Srikar Dronamraju
2021-05-21  8:14       ` Peter Zijlstra
2021-05-21  9:28         ` Srikar Dronamraju
2021-05-24 14:16           ` Valentin Schneider
2021-05-24 16:18             ` Srikar Dronamraju
2021-05-25 10:21               ` Valentin Schneider
2021-05-25 11:32                 ` Srikar Dronamraju
2021-05-28  5:21                 ` Srikar Dronamraju
2021-05-28  8:43               ` Peter Zijlstra
2021-05-28 10:24                 ` Srikar Dronamraju
2021-05-20 15:44 ` [PATCH 2/3] powerpc/numa: Populate distance map correctly Srikar Dronamraju
2021-05-24 14:16   ` Valentin Schneider
2021-05-24 14:50     ` Srikar Dronamraju
2021-05-20 15:44 ` [PATCH 3/3] sched/topology: Skip updating masks for non-online nodes Srikar Dronamraju

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).