All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] Updates to powerpc for robust CPU online/offline
@ 2021-08-21 10:25 Srikar Dronamraju
  2021-08-21 10:25 ` [PATCH v2 1/3] powerpc/numa: Print debug statements only when required Srikar Dronamraju
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2021-08-21 10:25 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	Srikar Dronamraju, Peter Zijlstra, Geetika Moolchandani,
	Ingo Molnar, Laurent Dufour, linuxppc-dev, Valentin Schneider

Scheduler expects unique number of node distances to be available
at boot. It uses node distance to calculate this unique node
distances. On Power Servers, node distances for offline nodes is not
available. However, Power Servers already knows unique possible node
distances. Fake the offline node's distance_lookup_table entries so
that all possible node distances are updated.

For example distance info from numactl from a fully populated 8 node
system at boot may look like this.

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  20  40  40  40  40  40  40
  1:  20  10  40  40  40  40  40  40
  2:  40  40  10  20  40  40  40  40
  3:  40  40  20  10  40  40  40  40
  4:  40  40  40  40  10  20  40  40
  5:  40  40  40  40  20  10  40  40
  6:  40  40  40  40  40  40  10  20
  7:  40  40  40  40  40  40  20  10

However the same system when only two nodes are online at boot, then
distance info from numactl will look like
node distances:
node   0   1
  0:  10  20
  1:  20  10

With the faked numa distance at boot, the node distance table will look
like
node   0   1   2
  0:  10  20  40
  1:  20  10  40
  2:  40  40  10

The actual distance will be populated once the nodes are onlined.

Also when simultaneously running CPU online/offline with CPU
add/remove in a loop, we see a WARNING messages.

WARNING: CPU: 13 PID: 1142 at kernel/sched/topology.c:898 build_sched_domains+0xd48/0x1720
Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag udp_diag
raw_diag inet_diag unix_diag af_packet_diag netlink_diag bonding tls
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink
pseries_rng xts vmx_crypto uio_pdrv_genirq uio binfmt_misc ip_tables xfs
libcrc32c dm_service_time sd_mod t10_pi sg ibmvfc scsi_transport_fc ibmveth
dm_multipath dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 13 PID: 1142 Comm: kworker/13:2 Not tainted 5.13.0-rc6+ #28
Workqueue: events cpuset_hotplug_workfn
NIP:  c0000000001caac8 LR: c0000000001caac4 CTR: 00000000007088ec
REGS: c00000005596f220 TRAP: 0700   Not tainted  (5.13.0-rc6+)
MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 48828222  XER: 00000009
CFAR: c0000000001ea698 IRQMASK: 0
GPR00: c0000000001caac4 c00000005596f4c0 c000000001c4a400 0000000000000036
GPR04: 00000000fffdffff c00000005596f1d0 0000000000000027 c0000018cfd07f90
GPR08: 0000000000000023 0000000000000001 0000000000000027 c0000018fe68ffe8
GPR12: 0000000000008000 c00000001e9d1880 c00000013a047200 0000000000000800
GPR16: c000000001d3c7d0 0000000000000240 0000000000000048 c000000010aacd18
GPR20: 0000000000000001 c000000010aacc18 c00000013a047c00 c000000139ec2400
GPR24: 0000000000000280 c000000139ec2520 c000000136c1b400 c000000001c93060
GPR28: c00000013a047c20 c000000001d3c6c0 c000000001c978a0 000000000000000d
NIP [c0000000001caac8] build_sched_domains+0xd48/0x1720
LR [c0000000001caac4] build_sched_domains+0xd44/0x1720
Call Trace:
[c00000005596f4c0] [c0000000001caac4] build_sched_domains+0xd44/0x1720 (unreliable)
[c00000005596f670] [c0000000001cc5ec] partition_sched_domains_locked+0x3ac/0x4b0
[c00000005596f710] [c0000000002804e4] rebuild_sched_domains_locked+0x404/0x9e0
[c00000005596f810] [c000000000283e60] rebuild_sched_domains+0x40/0x70
[c00000005596f840] [c000000000284124] cpuset_hotplug_workfn+0x294/0xf10
[c00000005596fc60] [c000000000175040] process_one_work+0x290/0x590
[c00000005596fd00] [c0000000001753c8] worker_thread+0x88/0x620
[c00000005596fda0] [c000000000181704] kthread+0x194/0x1a0
[c00000005596fe10] [c00000000000ccec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
485af049 60000000 2fa30800 409e0028 80fe0000 e89a00f8 e86100e8 38da0120
7f88e378 7ce53b78 4801fb91 60000000 <0fe00000> 39000000 38e00000 38c00000

This was because cpu_cpu_mask() was not getting updated on CPU
online/offline but would be only updated when add/remove of CPUs.
Other cpumasks get updated both on CPU online/offline and add/remove
Update cpu_cpu_mask() on CPU online/offline too.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>

Srikar Dronamraju (3):
  powerpc/numa: Print debug statements only when required
  powerpc/numa: Update cpu_cpu_map on CPU online/offline
  powerpc/numa: Fill distance_lookup_table for offline nodes

 arch/powerpc/include/asm/topology.h | 12 ++++
 arch/powerpc/kernel/smp.c           |  3 +
 arch/powerpc/mm/numa.c              | 88 +++++++++++++++++++++++++----
 3 files changed, 92 insertions(+), 11 deletions(-)

-- 
2.18.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/3] powerpc/numa: Print debug statements only when required
  2021-08-21 10:25 [PATCH v2 0/3] Updates to powerpc for robust CPU online/offline Srikar Dronamraju
@ 2021-08-21 10:25 ` Srikar Dronamraju
  2021-08-23  9:21   ` Laurent Dufour
  2021-08-21 10:25 ` [PATCH v2 2/3] powerpc/numa: Update cpu_cpu_map on CPU online/offline Srikar Dronamraju
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2021-08-21 10:25 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	Srikar Dronamraju, Peter Zijlstra, Geetika Moolchandani,
	Ingo Molnar, Laurent Dufour, linuxppc-dev, Valentin Schneider

Currently, a debug message gets printed every time an attempt to
add(remove) a CPU. However this is redundant if the CPU is already added
(removed) from the node.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..fbe03f6840e0 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -141,10 +141,11 @@ static void map_cpu_to_node(int cpu, int node)
 {
 	update_numa_cpu_lookup_table(cpu, node);
 
-	dbg("adding cpu %d to node %d\n", cpu, node);
 
-	if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node])))
+	if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node]))) {
+		dbg("adding cpu %d to node %d\n", cpu, node);
 		cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
+	}
 }
 
 #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
@@ -152,13 +153,11 @@ static void unmap_cpu_from_node(unsigned long cpu)
 {
 	int node = numa_cpu_lookup_table[cpu];
 
-	dbg("removing cpu %lu from node %d\n", cpu, node);
-
 	if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) {
 		cpumask_clear_cpu(cpu, node_to_cpumask_map[node]);
+		dbg("removing cpu %lu from node %d\n", cpu, node);
 	} else {
-		printk(KERN_ERR "WARNING: cpu %lu not found in node %d\n",
-		       cpu, node);
+		pr_err("WARNING: cpu %lu not found in node %d\n", cpu, node);
 	}
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/3] powerpc/numa: Update cpu_cpu_map on CPU online/offline
  2021-08-21 10:25 [PATCH v2 0/3] Updates to powerpc for robust CPU online/offline Srikar Dronamraju
  2021-08-21 10:25 ` [PATCH v2 1/3] powerpc/numa: Print debug statements only when required Srikar Dronamraju
@ 2021-08-21 10:25 ` Srikar Dronamraju
  2021-08-21 10:25 ` [PATCH v2 3/3] powerpc/numa: Fill distance_lookup_table for offline nodes Srikar Dronamraju
  2021-08-23  8:33 ` [PATCH v2 0/3] Updates to powerpc for robust CPU online/offline Peter Zijlstra
  3 siblings, 0 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2021-08-21 10:25 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	Srikar Dronamraju, Peter Zijlstra, Geetika Moolchandani,
	Ingo Molnar, Laurent Dufour, linuxppc-dev, Valentin Schneider

cpu_cpu_map holds all the CPUs in the DIE. However in PowerPC, when
onlining/offlining of CPUs, this mask doesn't get updated.  This mask
is however updated when CPUs are added/removed. So when both
operations like online/offline of CPUs and adding/removing of CPUs are
done simultaneously, then cpumaps end up broken.

WARNING: CPU: 13 PID: 1142 at kernel/sched/topology.c:898
build_sched_domains+0xd48/0x1720
Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag
udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag
bonding tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
rfkill nf_tables nfnetlink pseries_rng xts vmx_crypto uio_pdrv_genirq
uio binfmt_misc ip_tables xfs libcrc32c dm_service_time sd_mod t10_pi sg
ibmvfc scsi_transport_fc ibmveth dm_multipath dm_mirror dm_region_hash
dm_log dm_mod fuse
CPU: 13 PID: 1142 Comm: kworker/13:2 Not tainted 5.13.0-rc6+ #28
Workqueue: events cpuset_hotplug_workfn
NIP:  c0000000001caac8 LR: c0000000001caac4 CTR: 00000000007088ec
REGS: c00000005596f220 TRAP: 0700   Not tainted  (5.13.0-rc6+)
MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 48828222  XER:
00000009
CFAR: c0000000001ea698 IRQMASK: 0
GPR00: c0000000001caac4 c00000005596f4c0 c000000001c4a400 0000000000000036
GPR04: 00000000fffdffff c00000005596f1d0 0000000000000027 c0000018cfd07f90
GPR08: 0000000000000023 0000000000000001 0000000000000027 c0000018fe68ffe8
GPR12: 0000000000008000 c00000001e9d1880 c00000013a047200 0000000000000800
GPR16: c000000001d3c7d0 0000000000000240 0000000000000048 c000000010aacd18
GPR20: 0000000000000001 c000000010aacc18 c00000013a047c00 c000000139ec2400
GPR24: 0000000000000280 c000000139ec2520 c000000136c1b400 c000000001c93060
GPR28: c00000013a047c20 c000000001d3c6c0 c000000001c978a0 000000000000000d
NIP [c0000000001caac8] build_sched_domains+0xd48/0x1720
LR [c0000000001caac4] build_sched_domains+0xd44/0x1720
Call Trace:
[c00000005596f4c0] [c0000000001caac4] build_sched_domains+0xd44/0x1720 (unreliable)
[c00000005596f670] [c0000000001cc5ec] partition_sched_domains_locked+0x3ac/0x4b0
[c00000005596f710] [c0000000002804e4] rebuild_sched_domains_locked+0x404/0x9e0
[c00000005596f810] [c000000000283e60] rebuild_sched_domains+0x40/0x70
[c00000005596f840] [c000000000284124] cpuset_hotplug_workfn+0x294/0xf10
[c00000005596fc60] [c000000000175040] process_one_work+0x290/0x590
[c00000005596fd00] [c0000000001753c8] worker_thread+0x88/0x620
[c00000005596fda0] [c000000000181704] kthread+0x194/0x1a0
[c00000005596fe10] [c00000000000ccec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
485af049 60000000 2fa30800 409e0028 80fe0000 e89a00f8 e86100e8 38da0120
7f88e378 7ce53b78 4801fb91 60000000 <0fe00000> 39000000 38e00000 38c00000

Fix this by updating cpu_cpu_map aka cpumask_of_node() on every CPU
online/offline.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/topology.h | 12 ++++++++++++
 arch/powerpc/kernel/smp.c           |  3 +++
 arch/powerpc/mm/numa.c              |  7 ++-----
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index e4db64c0e184..2f0a4d7b95f6 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -65,6 +65,11 @@ static inline int early_cpu_to_node(int cpu)
 
 int of_drconf_to_nid_single(struct drmem_lmb *lmb);
 
+extern void map_cpu_to_node(int cpu, int node);
+#ifdef CONFIG_HOTPLUG_CPU
+extern void unmap_cpu_from_node(unsigned long cpu);
+#endif /* CONFIG_HOTPLUG_CPU */
+
 #else
 
 static inline int early_cpu_to_node(int cpu) { return 0; }
@@ -93,6 +98,13 @@ static inline int of_drconf_to_nid_single(struct drmem_lmb *lmb)
 	return first_online_node;
 }
 
+#ifdef CONFIG_SMP
+static inline void map_cpu_to_node(int cpu, int node) {}
+#ifdef CONFIG_HOTPLUG_CPU
+static inline void unmap_cpu_from_node(unsigned long cpu) {}
+#endif /* CONFIG_HOTPLUG_CPU */
+#endif /* CONFIG_SMP */
+
 #endif /* CONFIG_NUMA */
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 47b15f31cc29..5ede4b1c7473 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1407,6 +1407,8 @@ static void remove_cpu_from_masks(int cpu)
 	struct cpumask *(*mask_fn)(int) = cpu_sibling_mask;
 	int i;
 
+	unmap_cpu_from_node(cpu);
+
 	if (shared_caches)
 		mask_fn = cpu_l2_cache_mask;
 
@@ -1491,6 +1493,7 @@ static void add_cpu_to_masks(int cpu)
 	 * This CPU will not be in the online mask yet so we need to manually
 	 * add it to it's own thread sibling mask.
 	 */
+	map_cpu_to_node(cpu, cpu_to_node(cpu));
 	cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
 	cpumask_set_cpu(cpu, cpu_core_mask(cpu));
 
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index fbe03f6840e0..3c124928a16d 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -137,7 +137,7 @@ static void reset_numa_cpu_lookup_table(void)
 		numa_cpu_lookup_table[cpu] = -1;
 }
 
-static void map_cpu_to_node(int cpu, int node)
+void map_cpu_to_node(int cpu, int node)
 {
 	update_numa_cpu_lookup_table(cpu, node);
 
@@ -149,7 +149,7 @@ static void map_cpu_to_node(int cpu, int node)
 }
 
 #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
-static void unmap_cpu_from_node(unsigned long cpu)
+void unmap_cpu_from_node(unsigned long cpu)
 {
 	int node = numa_cpu_lookup_table[cpu];
 
@@ -597,9 +597,6 @@ static int ppc_numa_cpu_prepare(unsigned int cpu)
 
 static int ppc_numa_cpu_dead(unsigned int cpu)
 {
-#ifdef CONFIG_HOTPLUG_CPU
-	unmap_cpu_from_node(cpu);
-#endif
 	return 0;
 }
 
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 3/3] powerpc/numa: Fill distance_lookup_table for offline nodes
  2021-08-21 10:25 [PATCH v2 0/3] Updates to powerpc for robust CPU online/offline Srikar Dronamraju
  2021-08-21 10:25 ` [PATCH v2 1/3] powerpc/numa: Print debug statements only when required Srikar Dronamraju
  2021-08-21 10:25 ` [PATCH v2 2/3] powerpc/numa: Update cpu_cpu_map on CPU online/offline Srikar Dronamraju
@ 2021-08-21 10:25 ` Srikar Dronamraju
  2021-08-26 13:36   ` Michael Ellerman
  2021-08-23  8:33 ` [PATCH v2 0/3] Updates to powerpc for robust CPU online/offline Peter Zijlstra
  3 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2021-08-21 10:25 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	Srikar Dronamraju, Peter Zijlstra, Geetika Moolchandani,
	Ingo Molnar, Laurent Dufour, linuxppc-dev, Valentin Schneider,
	kernel test robot

Scheduler expects unique number of node distances to be available at
boot. It uses node distance to calculate this unique node distances.
On POWER, node distances for offline nodes is not available. However,
POWER already knows unique possible node distances. Fake the offline
node's distance_lookup_table entries so that all possible node
distances are updated.

However this only needs to be done if the number of unique node
distances that can be computed for online nodes is less than the
number of possible unique node distances as represented by
distance_ref_points_depth. When the node is actually onlined,
distance_lookup_table will be updated with actual entries.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 70 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 70 insertions(+)

Changelog:
v1: https://lore.kernel.org/linuxppc-dev/20210701041552.112072-3-srikar@linux.vnet.ibm.com/t/#u
[ Fixed a missing prototype warning Reported-by: kernel test robot <lkp@intel.com>]

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 3c124928a16d..0ee79a08c9e1 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -856,6 +856,75 @@ void __init dump_numa_cpu_topology(void)
 	}
 }
 
+/*
+ * Scheduler expects unique number of node distances to be available at
+ * boot. It uses node distance to calculate this unique node distances. On
+ * POWER, node distances for offline nodes is not available. However, POWER
+ * already knows unique possible node distances. Fake the offline node's
+ * distance_lookup_table entries so that all possible node distances are
+ * updated.
+ */
+static void __init fake_update_distance_lookup_table(void)
+{
+	unsigned long distance_map;
+	int i, nr_levels, nr_depth, node;
+
+	if (!numa_enabled)
+		return;
+
+	if (!form1_affinity)
+		return;
+
+	/*
+	 * distance_ref_points_depth lists the unique numa domains
+	 * available. However it ignore LOCAL_DISTANCE. So add +1
+	 * to get the actual number of unique distances.
+	 */
+	nr_depth = distance_ref_points_depth + 1;
+
+	WARN_ON(nr_depth > sizeof(distance_map));
+
+	bitmap_zero(&distance_map, nr_depth);
+	bitmap_set(&distance_map, 0, 1);
+
+	for_each_online_node(node) {
+		int nd, distance = LOCAL_DISTANCE;
+
+		if (node == first_online_node)
+			continue;
+
+		nd = __node_distance(node, first_online_node);
+		for (i = 0; i < nr_depth; i++, distance *= 2) {
+			if (distance == nd) {
+				bitmap_set(&distance_map, i, 1);
+				break;
+			}
+		}
+		nr_levels = bitmap_weight(&distance_map, nr_depth);
+		if (nr_levels == nr_depth)
+			return;
+	}
+
+	for_each_node(node) {
+		if (node_online(node))
+			continue;
+
+		i = find_first_zero_bit(&distance_map, nr_depth);
+		if (i >= nr_depth || i == 0) {
+			pr_warn("Levels(%d) not matching levels(%d)", nr_levels, nr_depth);
+			return;
+		}
+
+		bitmap_set(&distance_map, i, 1);
+		while (i--)
+			distance_lookup_table[node][i] = node;
+
+		nr_levels = bitmap_weight(&distance_map, nr_depth);
+		if (nr_levels == nr_depth)
+			return;
+	}
+}
+
 /* Initialize NODE_DATA for a node on the local memory */
 static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
 {
@@ -971,6 +1040,7 @@ void __init mem_topology_setup(void)
 		 */
 		numa_setup_cpu(cpu);
 	}
+	fake_update_distance_lookup_table();
 }
 
 void __init initmem_init(void)
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/3] Updates to powerpc for robust CPU online/offline
  2021-08-21 10:25 [PATCH v2 0/3] Updates to powerpc for robust CPU online/offline Srikar Dronamraju
                   ` (2 preceding siblings ...)
  2021-08-21 10:25 ` [PATCH v2 3/3] powerpc/numa: Fill distance_lookup_table for offline nodes Srikar Dronamraju
@ 2021-08-23  8:33 ` Peter Zijlstra
  2021-08-23  9:34   ` Srikar Dronamraju
  3 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2021-08-23  8:33 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	Geetika Moolchandani, Valentin Schneider, Laurent Dufour,
	linuxppc-dev, Ingo Molnar

On Sat, Aug 21, 2021 at 03:55:32PM +0530, Srikar Dronamraju wrote:
> Scheduler expects unique number of node distances to be available
> at boot. It uses node distance to calculate this unique node
> distances. On Power Servers, node distances for offline nodes is not
> available. However, Power Servers already knows unique possible node
> distances. Fake the offline node's distance_lookup_table entries so
> that all possible node distances are updated.
> 
> For example distance info from numactl from a fully populated 8 node
> system at boot may look like this.
> 
> node distances:
> node   0   1   2   3   4   5   6   7
>   0:  10  20  40  40  40  40  40  40
>   1:  20  10  40  40  40  40  40  40
>   2:  40  40  10  20  40  40  40  40
>   3:  40  40  20  10  40  40  40  40
>   4:  40  40  40  40  10  20  40  40
>   5:  40  40  40  40  20  10  40  40
>   6:  40  40  40  40  40  40  10  20
>   7:  40  40  40  40  40  40  20  10
> 
> However the same system when only two nodes are online at boot, then
> distance info from numactl will look like
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
> 
> With the faked numa distance at boot, the node distance table will look
> like
> node   0   1   2
>   0:  10  20  40
>   1:  20  10  40
>   2:  40  40  10
> 
> The actual distance will be populated once the nodes are onlined.

How did you want all this merged? I picked up Valentin's patch, do you
want me to pick up these PowerPC patches in the same tree, or do you
want to route them seperately?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/3] powerpc/numa: Print debug statements only when required
  2021-08-21 10:25 ` [PATCH v2 1/3] powerpc/numa: Print debug statements only when required Srikar Dronamraju
@ 2021-08-23  9:21   ` Laurent Dufour
  2021-08-23  9:38     ` Srikar Dronamraju
  0 siblings, 1 reply; 16+ messages in thread
From: Laurent Dufour @ 2021-08-23  9:21 UTC (permalink / raw)
  To: Srikar Dronamraju, Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot, Peter Zijlstra,
	Geetika Moolchandani, Valentin Schneider, linuxppc-dev,
	Ingo Molnar

Le 21/08/2021 à 12:25, Srikar Dronamraju a écrit :
> Currently, a debug message gets printed every time an attempt to
> add(remove) a CPU. However this is redundant if the CPU is already added
> (removed) from the node.
> 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
> Cc: Laurent Dufour <ldufour@linux.ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
>   arch/powerpc/mm/numa.c | 11 +++++------
>   1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index f2bf98bdcea2..fbe03f6840e0 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -141,10 +141,11 @@ static void map_cpu_to_node(int cpu, int node)
>   {
>   	update_numa_cpu_lookup_table(cpu, node);
>   
> -	dbg("adding cpu %d to node %d\n", cpu, node);
>   
> -	if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node])))
> +	if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node]))) {
> +		dbg("adding cpu %d to node %d\n", cpu, node);
>   		cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
> +	}
>   }
>   
>   #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
> @@ -152,13 +153,11 @@ static void unmap_cpu_from_node(unsigned long cpu)
>   {
>   	int node = numa_cpu_lookup_table[cpu];
>   
> -	dbg("removing cpu %lu from node %d\n", cpu, node);
> -
>   	if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) {
>   		cpumask_clear_cpu(cpu, node_to_cpumask_map[node]);
> +		dbg("removing cpu %lu from node %d\n", cpu, node);
>   	} else {
> -		printk(KERN_ERR "WARNING: cpu %lu not found in node %d\n",
> -		       cpu, node);
> +		pr_err("WARNING: cpu %lu not found in node %d\n", cpu, node);

Would pr_warn() be more appropriate here (or removing the "WARNING" statement)?

>   	}
>   }
>   #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/3] Updates to powerpc for robust CPU online/offline
  2021-08-23  8:33 ` [PATCH v2 0/3] Updates to powerpc for robust CPU online/offline Peter Zijlstra
@ 2021-08-23  9:34   ` Srikar Dronamraju
  2021-08-23  9:37     ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2021-08-23  9:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	Geetika Moolchandani, Valentin Schneider, Laurent Dufour,
	linuxppc-dev, Ingo Molnar

* Peter Zijlstra <peterz@infradead.org> [2021-08-23 10:33:30]:

> On Sat, Aug 21, 2021 at 03:55:32PM +0530, Srikar Dronamraju wrote:
> > Scheduler expects unique number of node distances to be available
> > at boot. It uses node distance to calculate this unique node
> > distances. On Power Servers, node distances for offline nodes is not
> > available. However, Power Servers already knows unique possible node
> > distances. Fake the offline node's distance_lookup_table entries so
> > that all possible node distances are updated.
> > 
> > For example distance info from numactl from a fully populated 8 node
> > system at boot may look like this.
> > 
> > node distances:
> > node   0   1   2   3   4   5   6   7
> >   0:  10  20  40  40  40  40  40  40
> >   1:  20  10  40  40  40  40  40  40
> >   2:  40  40  10  20  40  40  40  40
> >   3:  40  40  20  10  40  40  40  40
> >   4:  40  40  40  40  10  20  40  40
> >   5:  40  40  40  40  20  10  40  40
> >   6:  40  40  40  40  40  40  10  20
> >   7:  40  40  40  40  40  40  20  10
> > 
> > However the same system when only two nodes are online at boot, then
> > distance info from numactl will look like
> > node distances:
> > node   0   1
> >   0:  10  20
> >   1:  20  10
> > 
> > With the faked numa distance at boot, the node distance table will look
> > like
> > node   0   1   2
> >   0:  10  20  40
> >   1:  20  10  40
> >   2:  40  40  10
> > 
> > The actual distance will be populated once the nodes are onlined.
> 
> How did you want all this merged? I picked up Valentin's patch, do you
> want me to pick up these PowerPC patches in the same tree, or do you
> want to route them seperately?

While both (the patch you accepted and this series) together help solve the
problem, I think there is no hard dependency between the two. Hence I would
think it should be okay to go through the powerpc tree.


-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/3] Updates to powerpc for robust CPU online/offline
  2021-08-23  9:34   ` Srikar Dronamraju
@ 2021-08-23  9:37     ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2021-08-23  9:37 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	Geetika Moolchandani, Valentin Schneider, Laurent Dufour,
	linuxppc-dev, Ingo Molnar

On Mon, Aug 23, 2021 at 03:04:37PM +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2021-08-23 10:33:30]:
> 
> > On Sat, Aug 21, 2021 at 03:55:32PM +0530, Srikar Dronamraju wrote:
> > > Scheduler expects unique number of node distances to be available
> > > at boot. It uses node distance to calculate this unique node
> > > distances. On Power Servers, node distances for offline nodes is not
> > > available. However, Power Servers already knows unique possible node
> > > distances. Fake the offline node's distance_lookup_table entries so
> > > that all possible node distances are updated.
> > > 
> > > For example distance info from numactl from a fully populated 8 node
> > > system at boot may look like this.
> > > 
> > > node distances:
> > > node   0   1   2   3   4   5   6   7
> > >   0:  10  20  40  40  40  40  40  40
> > >   1:  20  10  40  40  40  40  40  40
> > >   2:  40  40  10  20  40  40  40  40
> > >   3:  40  40  20  10  40  40  40  40
> > >   4:  40  40  40  40  10  20  40  40
> > >   5:  40  40  40  40  20  10  40  40
> > >   6:  40  40  40  40  40  40  10  20
> > >   7:  40  40  40  40  40  40  20  10
> > > 
> > > However the same system when only two nodes are online at boot, then
> > > distance info from numactl will look like
> > > node distances:
> > > node   0   1
> > >   0:  10  20
> > >   1:  20  10
> > > 
> > > With the faked numa distance at boot, the node distance table will look
> > > like
> > > node   0   1   2
> > >   0:  10  20  40
> > >   1:  20  10  40
> > >   2:  40  40  10
> > > 
> > > The actual distance will be populated once the nodes are onlined.
> > 
> > How did you want all this merged? I picked up Valentin's patch, do you
> > want me to pick up these PowerPC patches in the same tree, or do you
> > want to route them seperately?
> 
> While both (the patch you accepted and this series) together help solve the
> problem, I think there is no hard dependency between the two. Hence I would
> think it should be okay to go through the powerpc tree.
> 

OK, works for me, thanks!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/3] powerpc/numa: Print debug statements only when required
  2021-08-23  9:21   ` Laurent Dufour
@ 2021-08-23  9:38     ` Srikar Dronamraju
  2021-08-25 13:01       ` Michael Ellerman
  0 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2021-08-23  9:38 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot, Peter Zijlstra,
	Geetika Moolchandani, Valentin Schneider, linuxppc-dev,
	Ingo Molnar

* Laurent Dufour <ldufour@linux.ibm.com> [2021-08-23 11:21:33]:

> Le 21/08/2021 à 12:25, Srikar Dronamraju a écrit :
> > Currently, a debug message gets printed every time an attempt to
> > add(remove) a CPU. However this is redundant if the CPU is already added
> > (removed) from the node.
> > 
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: Nathan Lynch <nathanl@linux.ibm.com>
> > Cc: Michael Ellerman <mpe@ellerman.id.au>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
> > Cc: Laurent Dufour <ldufour@linux.ibm.com>
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > ---
> >   arch/powerpc/mm/numa.c | 11 +++++------
> >   1 file changed, 5 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index f2bf98bdcea2..fbe03f6840e0 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -141,10 +141,11 @@ static void map_cpu_to_node(int cpu, int node)
> >   {
> >   	update_numa_cpu_lookup_table(cpu, node);
> > -	dbg("adding cpu %d to node %d\n", cpu, node);
> > -	if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node])))
> > +	if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node]))) {
> > +		dbg("adding cpu %d to node %d\n", cpu, node);
> >   		cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
> > +	}
> >   }
> >   #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
> > @@ -152,13 +153,11 @@ static void unmap_cpu_from_node(unsigned long cpu)
> >   {
> >   	int node = numa_cpu_lookup_table[cpu];
> > -	dbg("removing cpu %lu from node %d\n", cpu, node);
> > -
> >   	if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) {
> >   		cpumask_clear_cpu(cpu, node_to_cpumask_map[node]);
> > +		dbg("removing cpu %lu from node %d\n", cpu, node);
> >   	} else {
> > -		printk(KERN_ERR "WARNING: cpu %lu not found in node %d\n",
> > -		       cpu, node);
> > +		pr_err("WARNING: cpu %lu not found in node %d\n", cpu, node);
> 
> Would pr_warn() be more appropriate here (or removing the "WARNING" statement)?

Its a fair point.

Michael,

Do you want me to resend this patch with s/pr_err/pr_warn for the above
line?

> 
> >   	}
> >   }
> >   #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
> > 
> 

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/3] powerpc/numa: Print debug statements only when required
  2021-08-23  9:38     ` Srikar Dronamraju
@ 2021-08-25 13:01       ` Michael Ellerman
  2021-08-26  4:47         ` Srikar Dronamraju
  0 siblings, 1 reply; 16+ messages in thread
From: Michael Ellerman @ 2021-08-25 13:01 UTC (permalink / raw)
  To: Srikar Dronamraju, Laurent Dufour
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot, Peter Zijlstra,
	Geetika Moolchandani, Valentin Schneider, linuxppc-dev,
	Ingo Molnar

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> * Laurent Dufour <ldufour@linux.ibm.com> [2021-08-23 11:21:33]:
>> Le 21/08/2021 à 12:25, Srikar Dronamraju a écrit :
>> > Currently, a debug message gets printed every time an attempt to
>> > add(remove) a CPU. However this is redundant if the CPU is already added
>> > (removed) from the node.
>> > 
>> > Cc: linuxppc-dev@lists.ozlabs.org
>> > Cc: Nathan Lynch <nathanl@linux.ibm.com>
>> > Cc: Michael Ellerman <mpe@ellerman.id.au>
>> > Cc: Ingo Molnar <mingo@kernel.org>
>> > Cc: Peter Zijlstra <peterz@infradead.org>
>> > Cc: Valentin Schneider <valentin.schneider@arm.com>
>> > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
>> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> > Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
>> > Cc: Laurent Dufour <ldufour@linux.ibm.com>
>> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>> > ---
>> >   arch/powerpc/mm/numa.c | 11 +++++------
>> >   1 file changed, 5 insertions(+), 6 deletions(-)
>> > 
>> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> > index f2bf98bdcea2..fbe03f6840e0 100644
>> > --- a/arch/powerpc/mm/numa.c
>> > +++ b/arch/powerpc/mm/numa.c
>> > @@ -141,10 +141,11 @@ static void map_cpu_to_node(int cpu, int node)
>> >   {
>> >   	update_numa_cpu_lookup_table(cpu, node);
>> > -	dbg("adding cpu %d to node %d\n", cpu, node);
>> > -	if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node])))
>> > +	if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node]))) {
>> > +		dbg("adding cpu %d to node %d\n", cpu, node);
>> >   		cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
>> > +	}
>> >   }
>> >   #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
>> > @@ -152,13 +153,11 @@ static void unmap_cpu_from_node(unsigned long cpu)
>> >   {
>> >   	int node = numa_cpu_lookup_table[cpu];
>> > -	dbg("removing cpu %lu from node %d\n", cpu, node);
>> > -
>> >   	if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) {
>> >   		cpumask_clear_cpu(cpu, node_to_cpumask_map[node]);
>> > +		dbg("removing cpu %lu from node %d\n", cpu, node);
>> >   	} else {
>> > -		printk(KERN_ERR "WARNING: cpu %lu not found in node %d\n",
>> > -		       cpu, node);
>> > +		pr_err("WARNING: cpu %lu not found in node %d\n", cpu, node);
>> 
>> Would pr_warn() be more appropriate here (or removing the "WARNING" statement)?
>
> Its a fair point.
>
> Michael,
>
> Do you want me to resend this patch with s/pr_err/pr_warn for the above
> line?

I think what I'd prefer is if we stopped using this custom dbg() stuff
in numa.c, and cleaned up all the messages to use pr_xxx().

Those debug statements only appear if you boot with numa=debug, which is
not documented anywhere and I had completely forgotten existed TBH.

These days there's CONFIG_DYNAMIC_DEBUG for turning on/off messages,
which is much more flexible.

So can we drop the numa=debug bits, and convert all the dbg()s to
pr_debug().

And then do a pass converting all the printk("NUMA: ") to pr_xxx() which
will get "numa:" from pr_fmt().

cheers

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/3] powerpc/numa: Print debug statements only when required
  2021-08-25 13:01       ` Michael Ellerman
@ 2021-08-26  4:47         ` Srikar Dronamraju
  0 siblings, 0 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2021-08-26  4:47 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot, Peter Zijlstra,
	Geetika Moolchandani, Valentin Schneider, Laurent Dufour,
	linuxppc-dev, Ingo Molnar

* Michael Ellerman <mpe@ellerman.id.au> [2021-08-25 23:01:42]:

> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> > * Laurent Dufour <ldufour@linux.ibm.com> [2021-08-23 11:21:33]:
> >> Le 21/08/2021 à 12:25, Srikar Dronamraju a écrit :
> >> > Currently, a debug message gets printed every time an attempt to
> >> > add(remove) a CPU. However this is redundant if the CPU is already added
> >> > (removed) from the node.
> >> > 
> >
> > Its a fair point.
> >
> > Michael,
> >
> > Do you want me to resend this patch with s/pr_err/pr_warn for the above
> > line?
> 
> I think what I'd prefer is if we stopped using this custom dbg() stuff
> in numa.c, and cleaned up all the messages to use pr_xxx().
> 
> Those debug statements only appear if you boot with numa=debug, which is
> not documented anywhere and I had completely forgotten existed TBH.
> 
> These days there's CONFIG_DYNAMIC_DEBUG for turning on/off messages,
> which is much more flexible.
> 
> So can we drop the numa=debug bits, and convert all the dbg()s to
> pr_debug().
> 
> And then do a pass converting all the printk("NUMA: ") to pr_xxx() which
> will get "numa:" from pr_fmt().
> 
> cheers

Okay, will do the needful.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] powerpc/numa: Fill distance_lookup_table for offline nodes
  2021-08-21 10:25 ` [PATCH v2 3/3] powerpc/numa: Fill distance_lookup_table for offline nodes Srikar Dronamraju
@ 2021-08-26 13:36   ` Michael Ellerman
  2021-09-01 10:22     ` Srikar Dronamraju
  0 siblings, 1 reply; 16+ messages in thread
From: Michael Ellerman @ 2021-08-26 13:36 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	Srikar Dronamraju, Peter Zijlstra, Geetika Moolchandani,
	Ingo Molnar, Laurent Dufour, linuxppc-dev, Valentin Schneider,
	kernel test robot

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> Scheduler expects unique number of node distances to be available at
> boot.

I think it needs "the number of unique node distances" ?

> It uses node distance to calculate this unique node distances.

It iterates over all pairs of nodes and records node_distance() for that
pair, and then calculates the set of unique distances.

> On POWER, node distances for offline nodes is not available. However,
> POWER already knows unique possible node distances.

I think it would be more accurate to say PAPR rather than POWER there.
It's PAPR that defines the way we determine distances and imposes that
limitation.

> Fake the offline node's distance_lookup_table entries so that all
> possible node distances are updated.

Does this work if we have a single node offline at boot?

Say we start with:

node distances:
node   0   1
  0:  10  20
  1:  20  10

And node 2 is offline at boot. We can only initialise that nodes entries
in the distance_lookup_table:

		while (i--)
			distance_lookup_table[node][i] = node;

By filling them all with 2 that causes node_distance(2, X) to return the
maximum distance for all other nodes X, because we won't break out of
the loop in __node_distance():

	for (i = 0; i < distance_ref_points_depth; i++) {
		if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
			break;

		/* Double the distance for each NUMA level */
		distance *= 2;
	}

If distance_ref_points_depth was 4 we'd return 160.

That'd leave us with 3 unique distances at boot, 10, 20, 160.

But when node 2 comes online it might introduce more than 1 new distance
value, eg. it could be that the actual distances are:

node distances:
node   0   1   2
  0:  10  20  40
  1:  20  10  80
  2:  40  80  10

ie. we now have 4 distances, 10, 20, 40, 80.

What am I missing?

> However this only needs to be done if the number of unique node
> distances that can be computed for online nodes is less than the
> number of possible unique node distances as represented by
> distance_ref_points_depth.

Looking at a few machines they all have distance_ref_points_depth = 2.

So maybe that explains it, in practice we only see 10, 20, 40.


> When the node is actually onlined, distance_lookup_table will be
> updated with actual entries.

> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
> Cc: Laurent Dufour <ldufour@linux.ibm.com>
> Cc: kernel test robot <lkp@intel.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
>  arch/powerpc/mm/numa.c | 70 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 70 insertions(+)
>
> Changelog:
> v1: https://lore.kernel.org/linuxppc-dev/20210701041552.112072-3-srikar@linux.vnet.ibm.com/t/#u
> [ Fixed a missing prototype warning Reported-by: kernel test robot <lkp@intel.com>]
>
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 3c124928a16d..0ee79a08c9e1 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -856,6 +856,75 @@ void __init dump_numa_cpu_topology(void)
>  	}
>  }
>  
> +/*
> + * Scheduler expects unique number of node distances to be available at
> + * boot. It uses node distance to calculate this unique node distances. On
> + * POWER, node distances for offline nodes is not available. However, POWER
> + * already knows unique possible node distances. Fake the offline node's
> + * distance_lookup_table entries so that all possible node distances are
> + * updated.
> + */

> +static void __init fake_update_distance_lookup_table(void)
> +{
> +	unsigned long distance_map;
> +	int i, nr_levels, nr_depth, node;

Are they distances, depths, or levels? :)

Bit more consistency in the variable names might make the code easier to
follow.

> +
> +	if (!numa_enabled)
> +		return;
> +
> +	if (!form1_affinity)
> +		return;

That doesn't exist since Aneesh's FORM2 series, so that will need a
rebase, and possibly some more rework to interact with that series.

> +	/*
> +	 * distance_ref_points_depth lists the unique numa domains
> +	 * available. However it ignore LOCAL_DISTANCE. So add +1
> +	 * to get the actual number of unique distances.
> +	 */
> +	nr_depth = distance_ref_points_depth + 1;

num_depths would be a better name IMHO.

> +
> +	WARN_ON(nr_depth > sizeof(distance_map));

Warn but then continue, and corrupt something on the stack? Seems like a
bad idea :)

I guess it's too early to use bitmap_alloc(). But can we at least return
if nr_depth is too big.

> +
> +	bitmap_zero(&distance_map, nr_depth);
> +	bitmap_set(&distance_map, 0, 1);
> +
> +	for_each_online_node(node) {
> +		int nd, distance = LOCAL_DISTANCE;
> +
> +		if (node == first_online_node)
> +			continue;
> +
> +		nd = __node_distance(node, first_online_node);
> +		for (i = 0; i < nr_depth; i++, distance *= 2) {

		for (i = 0, distance = LOCAL_DISTANCE; i < nr_depth; i++, distance *= 2) {

Could make it clearer what the for loop is doing I think.

> +			if (distance == nd) {
> +				bitmap_set(&distance_map, i, 1);
> +				break;
> +			}
> +		}
> +		nr_levels = bitmap_weight(&distance_map, nr_depth);
> +		if (nr_levels == nr_depth)
> +			return;
> +	}
> +
> +	for_each_node(node) {
> +		if (node_online(node))
> +			continue;
> +
> +		i = find_first_zero_bit(&distance_map, nr_depth);
> +		if (i >= nr_depth || i == 0) {

Neither of those can happen can they?

We checked the bitmap weight in the previous for loop, or at the bottom
of this one, and returned if we'd filled the map already.

And we set bit zero explicitly with bitmap_set().

> +			pr_warn("Levels(%d) not matching levels(%d)", nr_levels, nr_depth);
> +			return;
> +		}
> +
> +		bitmap_set(&distance_map, i, 1);
> +		while (i--)
> +			distance_lookup_table[node][i] = node;

That leaves distance_lookup_table[node][i+1] and so on uninitialised, or
initialised to zero because it's static, is that OK?

> +		nr_levels = bitmap_weight(&distance_map, nr_depth);
> +		if (nr_levels == nr_depth)
> +			return;
> +	}
> +}
> +
>  /* Initialize NODE_DATA for a node on the local memory */
>  static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
>  {
> @@ -971,6 +1040,7 @@ void __init mem_topology_setup(void)
>  		 */
>  		numa_setup_cpu(cpu);
>  	}
> +	fake_update_distance_lookup_table();
>  }
>  
>  void __init initmem_init(void)


cheers

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] powerpc/numa: Fill distance_lookup_table for offline nodes
  2021-08-26 13:36   ` Michael Ellerman
@ 2021-09-01 10:22     ` Srikar Dronamraju
  2021-09-23 11:17       ` Michael Ellerman
  0 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2021-09-01 10:22 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	kernel test robot, Peter Zijlstra, Geetika Moolchandani,
	Valentin Schneider, Laurent Dufour, linuxppc-dev, Ingo Molnar

* Michael Ellerman <mpe@ellerman.id.au> [2021-08-26 23:36:53]:

> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> > Scheduler expects unique number of node distances to be available at
> > boot.
> 
> I think it needs "the number of unique node distances" ?
> 
> > It uses node distance to calculate this unique node distances.
> 
> It iterates over all pairs of nodes and records node_distance() for that
> pair, and then calculates the set of unique distances.
> 
> > On POWER, node distances for offline nodes is not available. However,
> > POWER already knows unique possible node distances.
> 
> I think it would be more accurate to say PAPR rather than POWER there.
> It's PAPR that defines the way we determine distances and imposes that
> limitation.
> 

Okay, will do all the necessary modifications as suggested above.

> > Fake the offline node's distance_lookup_table entries so that all
> > possible node distances are updated.
> 
> Does this work if we have a single node offline at boot?
> 

It should.

> Say we start with:
> 
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
> 
> And node 2 is offline at boot. We can only initialise that nodes entries
> in the distance_lookup_table:
> 
> 		while (i--)
> 			distance_lookup_table[node][i] = node;
> 
> By filling them all with 2 that causes node_distance(2, X) to return the
> maximum distance for all other nodes X, because we won't break out of
> the loop in __node_distance():
> 
> 	for (i = 0; i < distance_ref_points_depth; i++) {
> 		if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
> 			break;
> 
> 		/* Double the distance for each NUMA level */
> 		distance *= 2;
> 	}
> 
> If distance_ref_points_depth was 4 we'd return 160.

As you already know, distance 10, 20, .. are defined by Powerpc, form1
affinity. PAPR doesn't define actual distances, it only provides us the
associativity. If there are distance_ref_points_depth is 4,
(distance_ref_points_depth doesn't take local distance into consideration)
10, 20, 40, 80, 160.

> 
> That'd leave us with 3 unique distances at boot, 10, 20, 160.
> 

So if there are unique distances, then the distances as per the current
code has to be 10, 20, 40, 80.. I dont see a way in which we have a break in
the series. like having 160 without 80.

> But when node 2 comes online it might introduce more than 1 new distance
> value, eg. it could be that the actual distances are:
> 
> node distances:
> node   0   1   2
>   0:  10  20  40
>   1:  20  10  80
>   2:  40  80  10
> 
> ie. we now have 4 distances, 10, 20, 40, 80.
> 
> What am I missing?
> 

As I said above, I am not sure how we can have a break in the series.
If distance_ref_points_depth is 3, the distances has to be 10,20,40,80 as
atleast for form1 affinity.

> > However this only needs to be done if the number of unique node
> > distances that can be computed for online nodes is less than the
> > number of possible unique node distances as represented by
> > distance_ref_points_depth.
> 
> Looking at a few machines they all have distance_ref_points_depth = 2.
> 
> So maybe that explains it, in practice we only see 10, 20, 40.
> 
> > When the node is actually onlined, distance_lookup_table will be
> > updated with actual entries.
> 
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: Nathan Lynch <nathanl@linux.ibm.com>
> > Cc: Michael Ellerman <mpe@ellerman.id.au>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
> > Cc: Laurent Dufour <ldufour@linux.ibm.com>
> > Cc: kernel test robot <lkp@intel.com>
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > ---
> >  arch/powerpc/mm/numa.c | 70 ++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 70 insertions(+)
> >
> > Changelog:
> > v1: https://lore.kernel.org/linuxppc-dev/20210701041552.112072-3-srikar@linux.vnet.ibm.com/t/#u
> > [ Fixed a missing prototype warning Reported-by: kernel test robot <lkp@intel.com>]
> >
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index 3c124928a16d..0ee79a08c9e1 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -856,6 +856,75 @@ void __init dump_numa_cpu_topology(void)
> >  	}
> >  }
> >  
> > +/*
> > + * Scheduler expects unique number of node distances to be available at
> > + * boot. It uses node distance to calculate this unique node distances. On
> > + * POWER, node distances for offline nodes is not available. However, POWER
> > + * already knows unique possible node distances. Fake the offline node's
> > + * distance_lookup_table entries so that all possible node distances are
> > + * updated.
> > + */
> 
> > +static void __init fake_update_distance_lookup_table(void)
> > +{
> > +	unsigned long distance_map;
> > +	int i, nr_levels, nr_depth, node;
> 
> Are they distances, depths, or levels? :)
> 
> Bit more consistency in the variable names might make the code easier to
> follow.
> 
> > +
> > +	if (!numa_enabled)
> > +		return;
> > +
> > +	if (!form1_affinity)
> > +		return;
> 
> That doesn't exist since Aneesh's FORM2 series, so that will need a
> rebase, and possibly some more rework to interact with that series.
> 

We only have to handle for form1, so it should be easier to handle.

> > +	/*
> > +	 * distance_ref_points_depth lists the unique numa domains
> > +	 * available. However it ignore LOCAL_DISTANCE. So add +1
> > +	 * to get the actual number of unique distances.
> > +	 */
> > +	nr_depth = distance_ref_points_depth + 1;
> 
> num_depths would be a better name IMHO.

Okay,
s/nr_depth/num_depths/g
s/nr_level/depth/g

> 
> > +
> > +	WARN_ON(nr_depth > sizeof(distance_map));
> 
> Warn but then continue, and corrupt something on the stack? Seems like a
> bad idea :)
> 
> I guess it's too early to use bitmap_alloc(). But can we at least return
> if nr_depth is too big.

Yes, we can't use bitmap_alloc here.
Now should we continue if nr_depth is greater than sizeof(distance_map)

If we don't and return immediately, then we can end up not creating enough
scheduler domains and may later on lead to build_sched_domain OOPs, when we
online nodes.

However if don't return, chance of surviving when the domains are actually
onlined is more.
We could probably reset nr_depth to be same as sizeof(distance_map).

That said, I think we are too far away from nr_depths being anywhere closer
to sizeof(long). So I am okay either way.

> 
> > +
> > +	bitmap_zero(&distance_map, nr_depth);
> > +	bitmap_set(&distance_map, 0, 1);
> > +
> > +	for_each_online_node(node) {
> > +		int nd, distance = LOCAL_DISTANCE;
> > +
> > +		if (node == first_online_node)
> > +			continue;
> > +
> > +		nd = __node_distance(node, first_online_node);
> > +		for (i = 0; i < nr_depth; i++, distance *= 2) {
> 
> 		for (i = 0, distance = LOCAL_DISTANCE; i < nr_depth; i++, distance *= 2) {
> 
> Could make it clearer what the for loop is doing I think.
> 
> > +			if (distance == nd) {
> > +				bitmap_set(&distance_map, i, 1);
> > +				break;
> > +			}
> > +		}
> > +		nr_levels = bitmap_weight(&distance_map, nr_depth);
> > +		if (nr_levels == nr_depth)
> > +			return;
> > +	}
> > +
> > +	for_each_node(node) {
> > +		if (node_online(node))
> > +			continue;
> > +
> > +		i = find_first_zero_bit(&distance_map, nr_depth);
> > +		if (i >= nr_depth || i == 0) {
> 
> Neither of those can happen can they?
> 
> We checked the bitmap weight in the previous for loop, or at the bottom
> of this one, and returned if we'd filled the map already.
> 
> And we set bit zero explicitly with bitmap_set().
> 

Agree, I can drop the hunk.

> > +			pr_warn("Levels(%d) not matching levels(%d)", nr_levels, nr_depth);
> > +			return;
> > +		}


> > +
> > +		bitmap_set(&distance_map, i, 1);
> > +		while (i--)
> > +			distance_lookup_table[node][i] = node;
> 
> That leaves distance_lookup_table[node][i+1] and so on uninitialised, or
> initialised to zero because it's static, is that OK?

Yes, this should be fine, because we are only interested in finding number
of unique numa distances, By the time actual distances come and overwrite,
we would no more use these fake distances.

But if you are comfortable with updating for all the depths, I can update
too.

> 
> > +		nr_levels = bitmap_weight(&distance_map, nr_depth);
> > +		if (nr_levels == nr_depth)
> > +			return;
> > +	}
> > +}
> > +
> >  /* Initialize NODE_DATA for a node on the local memory */
> >  static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
> >  {
> > @@ -971,6 +1040,7 @@ void __init mem_topology_setup(void)
> >  		 */
> >  		numa_setup_cpu(cpu);
> >  	}
> > +	fake_update_distance_lookup_table();
> >  }
> >  
> >  void __init initmem_init(void)
> 
> 
> cheers

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] powerpc/numa: Fill distance_lookup_table for offline nodes
  2021-09-01 10:22     ` Srikar Dronamraju
@ 2021-09-23 11:17       ` Michael Ellerman
  2021-09-23 17:57         ` Srikar Dronamraju
  0 siblings, 1 reply; 16+ messages in thread
From: Michael Ellerman @ 2021-09-23 11:17 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	kernel test robot, Peter Zijlstra, Geetika Moolchandani,
	Valentin Schneider, Laurent Dufour, linuxppc-dev, Ingo Molnar

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> * Michael Ellerman <mpe@ellerman.id.au> [2021-08-26 23:36:53]:
>
>> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
>> > Scheduler expects unique number of node distances to be available at
>> > boot.
...
>
>> > Fake the offline node's distance_lookup_table entries so that all
>> > possible node distances are updated.
>> 
>> Does this work if we have a single node offline at boot?
>> 
>
> It should.
>
>> Say we start with:
>> 
>> node distances:
>> node   0   1
>>   0:  10  20
>>   1:  20  10
>> 
>> And node 2 is offline at boot. We can only initialise that nodes entries
>> in the distance_lookup_table:
>> 
>> 		while (i--)
>> 			distance_lookup_table[node][i] = node;
>> 
>> By filling them all with 2 that causes node_distance(2, X) to return the
>> maximum distance for all other nodes X, because we won't break out of
>> the loop in __node_distance():
>> 
>> 	for (i = 0; i < distance_ref_points_depth; i++) {
>> 		if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
>> 			break;
>> 
>> 		/* Double the distance for each NUMA level */
>> 		distance *= 2;
>> 	}
>> 
>> If distance_ref_points_depth was 4 we'd return 160.
>
> As you already know, distance 10, 20, .. are defined by Powerpc, form1
> affinity. PAPR doesn't define actual distances, it only provides us the
> associativity. If there are distance_ref_points_depth is 4,
> (distance_ref_points_depth doesn't take local distance into consideration)
> 10, 20, 40, 80, 160.
>
>> 
>> That'd leave us with 3 unique distances at boot, 10, 20, 160.
>> 
>
> So if there are unique distances, then the distances as per the current
> code has to be 10, 20, 40, 80.. I dont see a way in which we have a break in
> the series. like having 160 without 80.

I'm confused what you mean there.

If we have a node that's offline at boot then we get 160 for that node,
that's just the result of having no info for it, so we never break out
of the for loop.

So if we have two nodes, one hop apart, and then an offline node we get
10, 20, 160.

Or if you're using depth = 3 then it's 10, 20, 80.

>> But when node 2 comes online it might introduce more than 1 new distance
>> value, eg. it could be that the actual distances are:
>> 
>> node distances:
>> node   0   1   2
>>   0:  10  20  40
>>   1:  20  10  80
>>   2:  40  80  10
>> 
>> ie. we now have 4 distances, 10, 20, 40, 80.
>> 
>> What am I missing?
>
> As I said above, I am not sure how we can have a break in the series.
> If distance_ref_points_depth is 3, the distances has to be 10,20,40,80 as
> atleast for form1 affinity.

I agree for depth 3 we have to see 10, 20, 40, 80. But nothing
guarantees we see each value (other than 10).

We can have two nodes one hop apart, so we have 10 and 20, then a third
node is added 3 hops away, so we get 10, 20, 80.

The real problem is that the third node could be 3 hops from node 0
and 2 hops from node 1, and so the addition of the third node causes
two new distance values (40 & 80) to be required.

I think maybe what you're saying is that in practice we don't see setups
like that. But I don't know if I'm happy with a solution that doesn't
work in the general case, and relies on the particular properties of our
current set of systems.

Possibly we just need to detect that case and WARN about it. The only
problem is we won't know until the system is already up and running, ie.
we can't know at boot that the onlining of the third node will cause 2
new distance values to be added.

cheers

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] powerpc/numa: Fill distance_lookup_table for offline nodes
  2021-09-23 11:17       ` Michael Ellerman
@ 2021-09-23 17:57         ` Srikar Dronamraju
  2021-10-11 11:45           ` Michael Ellerman
  0 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2021-09-23 17:57 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	kernel test robot, Peter Zijlstra, Geetika Moolchandani,
	Valentin Schneider, Laurent Dufour, linuxppc-dev, Ingo Molnar

* Michael Ellerman <mpe@ellerman.id.au> [2021-09-23 21:17:25]:

> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> > * Michael Ellerman <mpe@ellerman.id.au> [2021-08-26 23:36:53]:
> >
> >> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> >> > Scheduler expects unique number of node distances to be available at
> >> > boot.
> ...
> >
> >> > Fake the offline node's distance_lookup_table entries so that all
> >> > possible node distances are updated.
> >>
> >> Does this work if we have a single node offline at boot?
> >>
> >
> > It should.
> >
> >> Say we start with:
> >>
> >> node distances:
> >> node   0   1
> >>   0:  10  20
> >>   1:  20  10
> >>
> >> And node 2 is offline at boot. We can only initialise that nodes entries
> >> in the distance_lookup_table:
> >>
> >> 		while (i--)
> >> 			distance_lookup_table[node][i] = node;
> >>
> >> By filling them all with 2 that causes node_distance(2, X) to return the
> >> maximum distance for all other nodes X, because we won't break out of
> >> the loop in __node_distance():
> >>
> >> 	for (i = 0; i < distance_ref_points_depth; i++) {
> >> 		if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
> >> 			break;
> >>
> >> 		/* Double the distance for each NUMA level */
> >> 		distance *= 2;
> >> 	}
> >>
> >> If distance_ref_points_depth was 4 we'd return 160.
> >
> > As you already know, distance 10, 20, .. are defined by Powerpc, form1
> > affinity. PAPR doesn't define actual distances, it only provides us the
> > associativity. If there are distance_ref_points_depth is 4,
> > (distance_ref_points_depth doesn't take local distance into consideration)
> > 10, 20, 40, 80, 160.
> >
> >>
> >> That'd leave us with 3 unique distances at boot, 10, 20, 160.
> >>
> >
> > So if there are unique distances, then the distances as per the current
> > code has to be 10, 20, 40, 80.. I dont see a way in which we have a break in
> > the series. like having 160 without 80.
>
> I'm confused what you mean there.
>

At the outset, if we have a better probable solution, do let me know, I am
willing to try that too.

> If we have a node that's offline at boot then we get 160 for that node,
> that's just the result of having no info for it, so we never break out
> of the for loop.
>
> So if we have two nodes, one hop apart, and then an offline node we get
> 10, 20, 160.
>
> Or if you're using depth = 3 then it's 10, 20, 80.
>

My understanding is as below:

device-tree provides the max hops by way of
ibm,associativity-reference-points. This is mapped to
distance_ref_points_depth in Linux-powerpc.

Now Linux-powerpc encodes hops as (dis-regarding local distance) 20, 40, 80,
160, 320 ...
So if the distance_ref_points_depth is 3, then the hops are 20, 40, 80.

Do you disagree?


> >> But when node 2 comes online it might introduce more than 1 new distance
> >> value, eg. it could be that the actual distances are:
> >>
> >> node distances:
> >> node   0   1   2
> >>   0:  10  20  40
> >>   1:  20  10  80
> >>   2:  40  80  10
> >>
> >> ie. we now have 4 distances, 10, 20, 40, 80.
> >>
> >> What am I missing?
> >
> > As I said above, I am not sure how we can have a break in the series.
> > If distance_ref_points_depth is 3, the distances has to be 10,20,40,80 as
> > atleast for form1 affinity.
>
> I agree for depth 3 we have to see 10, 20, 40, 80. But nothing
> guarantees we see each value (other than 10).

The hop distances are not from the device-tree, the device-tree only gives
us the max hops possible. Linux-powerpc is actually hard-coding the
distances which each hop distance being 2x the previous.
So we may not see any nodes at a particular hop, but we know maximum hops.
And if distance_ref_points_depth is 3, then hops are 20, 40, 80 only.

>
> We can have two nodes one hop apart, so we have 10 and 20, then a third
> node is added 3 hops away, so we get 10, 20, 80.
>

> The real problem is that the third node could be 3 hops from node 0
> and 2 hops from node 1, and so the addition of the third node causes
> two new distance values (40 & 80) to be required.

So here the max hops as given by device-tree is 3. So we know that we are
looking for max-distance of 80 by way of distance_ref_points_depth.

Even if the 3rd node was at 4 hops, we would already know the max distance
of 160, by way of distance_ref_points_depth. However in the most unlikely
scenarios where the number of possible nodes are less than the
distance_ref_points_depth(aka max hops) + there are CPUless/memoryless nodes
we may not have initialized to the right distances.

>
> I think maybe what you're saying is that in practice we don't see setups
> like that. But I don't know if I'm happy with a solution that doesn't
> work in the general case, and relies on the particular properties of our
> current set of systems.
>

But our current set of systems are having a problem (Systems can likely
crash on adding a CPU to a node.)  The only other way I can think of is the
previous approach were we ask scheduler hook which tells how many unique
node distances are possible. But then it was stuck down because, we didnt
want to add a hook just for one arch.

However isn't this is much much better than the current situation we are in?
i.e This is not going to cause any regression for the other setups.

> Possibly we just need to detect that case and WARN about it. The only
> problem is we won't know until the system is already up and running, ie.
> we can't know at boot that the onlining of the third node will cause 2
> new distance values to be added.
>

Yes, We should be able to detect this very easily.
At the end of the function (v2 or v3) if cur_depth != max_depth then we
havent initialized all possible node distances.

> cheers

--
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] powerpc/numa: Fill distance_lookup_table for offline nodes
  2021-09-23 17:57         ` Srikar Dronamraju
@ 2021-10-11 11:45           ` Michael Ellerman
  0 siblings, 0 replies; 16+ messages in thread
From: Michael Ellerman @ 2021-10-11 11:45 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	kernel test robot, Peter Zijlstra, Geetika Moolchandani,
	Valentin Schneider, Laurent Dufour, linuxppc-dev, Ingo Molnar

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> * Michael Ellerman <mpe@ellerman.id.au> [2021-09-23 21:17:25]:
>
>> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
>> > * Michael Ellerman <mpe@ellerman.id.au> [2021-08-26 23:36:53]:
>> >
>> >> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
>> >> > Scheduler expects unique number of node distances to be available at
>> >> > boot.
>> ...
>> >
>> >> > Fake the offline node's distance_lookup_table entries so that all
>> >> > possible node distances are updated.
>> >>
>> >> Does this work if we have a single node offline at boot?
>> >>
>> >
>> > It should.
>> >
>> >> Say we start with:
>> >>
>> >> node distances:
>> >> node   0   1
>> >>   0:  10  20
>> >>   1:  20  10
>> >>
>> >> And node 2 is offline at boot. We can only initialise that nodes entries
>> >> in the distance_lookup_table:
>> >>
>> >> 		while (i--)
>> >> 			distance_lookup_table[node][i] = node;
>> >>
>> >> By filling them all with 2 that causes node_distance(2, X) to return the
>> >> maximum distance for all other nodes X, because we won't break out of
>> >> the loop in __node_distance():
>> >>
>> >> 	for (i = 0; i < distance_ref_points_depth; i++) {
>> >> 		if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
>> >> 			break;
>> >>
>> >> 		/* Double the distance for each NUMA level */
>> >> 		distance *= 2;
>> >> 	}
>> >>
>> >> If distance_ref_points_depth was 4 we'd return 160.
>> >
>> > As you already know, distance 10, 20, .. are defined by Powerpc, form1
>> > affinity. PAPR doesn't define actual distances, it only provides us the
>> > associativity. If there are distance_ref_points_depth is 4,
>> > (distance_ref_points_depth doesn't take local distance into consideration)
>> > 10, 20, 40, 80, 160.
>> >
>> >>
>> >> That'd leave us with 3 unique distances at boot, 10, 20, 160.
>> >>
>> >
>> > So if there are unique distances, then the distances as per the current
>> > code has to be 10, 20, 40, 80.. I dont see a way in which we have a break in
>> > the series. like having 160 without 80.
>>
>> I'm confused what you mean there.
>
> At the outset, if we have a better probable solution, do let me know, I am
> willing to try that too.

I don't have one in mind no, I'm just trying to satisfy myself that this
solution will work in all cases we're likely to encounter.

>> If we have a node that's offline at boot then we get 160 for that node,
>> that's just the result of having no info for it, so we never break out
>> of the for loop.
>>
>> So if we have two nodes, one hop apart, and then an offline node we get
>> 10, 20, 160.
>>
>> Or if you're using depth = 3 then it's 10, 20, 80.
>
> My understanding is as below:
>
> device-tree provides the max hops by way of
> ibm,associativity-reference-points. This is mapped to
> distance_ref_points_depth in Linux-powerpc.
>
> Now Linux-powerpc encodes hops as (dis-regarding local distance) 20, 40, 80,
> 160, 320 ...
> So if the distance_ref_points_depth is 3, then the hops are 20, 40, 80.
>
> Do you disagree?

I'm not sure. You didn't really address my point.

You said that we can't have 160 without 80 (for depth = 4).

I gave an example where we could see a gap in the used distance values,
ie. 10, 20, 80 for a depth of 3.

Which is not to say that distance 40 doesn't exist in that scenario,
rather that it's not used by any node.


>> >> But when node 2 comes online it might introduce more than 1 new distance
>> >> value, eg. it could be that the actual distances are:
>> >>
>> >> node distances:
>> >> node   0   1   2
>> >>   0:  10  20  40
>> >>   1:  20  10  80
>> >>   2:  40  80  10
>> >>
>> >> ie. we now have 4 distances, 10, 20, 40, 80.
>> >>
>> >> What am I missing?
>> >
>> > As I said above, I am not sure how we can have a break in the series.
>> > If distance_ref_points_depth is 3, the distances has to be 10,20,40,80 as
>> > atleast for form1 affinity.
>>
>> I agree for depth 3 we have to see 10, 20, 40, 80. But nothing
>> guarantees we see each value (other than 10).
>
> The hop distances are not from the device-tree, the device-tree only gives
> us the max hops possible. Linux-powerpc is actually hard-coding the
> distances which each hop distance being 2x the previous.

Yes. I guess I was sloppy to say "see each value", I didn't mean we see
those values directly in the device-tree.

> So we may not see any nodes at a particular hop, but we know maximum hops.
> And if distance_ref_points_depth is 3, then hops are 20, 40, 80 only.

OK, so we agree that "we may not see any nodes at a particular hop".

Which is what I was trying to say above.

>> We can have two nodes one hop apart, so we have 10 and 20, then a third
>> node is added 3 hops away, so we get 10, 20, 80.
>>
>
>> The real problem is that the third node could be 3 hops from node 0
>> and 2 hops from node 1, and so the addition of the third node causes
>> two new distance values (40 & 80) to be required.
>
> So here the max hops as given by device-tree is 3. So we know that we are
> looking for max-distance of 80 by way of distance_ref_points_depth.
>
> Even if the 3rd node was at 4 hops, we would already know the max distance
> of 160, by way of distance_ref_points_depth.

I agree we know that the max value is, and therefore the total number of
possible distance values.

But I think there are topologies where we can not represent all the
possible distances in the distance table.

> However in the most unlikely scenarios where the number of possible
> nodes are less than the distance_ref_points_depth(aka max hops) +
> there are CPUless/memoryless nodes we may not have initialized to the
> right distances.

OK, so I think you're saying you agree that there are situations where
we might not be able to represent all the distances.

But you say that's an "unlikely scenario", why is it unlikely?

If you can convince me it's 100% unlikely then maybe we can forget about
it :)

>> I think maybe what you're saying is that in practice we don't see setups
>> like that. But I don't know if I'm happy with a solution that doesn't
>> work in the general case, and relies on the particular properties of our
>> current set of systems.
>
> But our current set of systems are having a problem (Systems can likely
> crash on adding a CPU to a node.)

Yes I agree that's bad, but also we don't want to merge a solution (and
presumably backport it everywhere) if it doesn't work for some cases.

> The only other way I can think of is the previous approach were we ask
> scheduler hook which tells how many unique node distances are
> possible. But then it was stuck down because, we didnt want to add a
> hook just for one arch.

OK.

> However isn't this is much much better than the current situation we are in?
> i.e This is not going to cause any regression for the other setups.

Yes that's true. But equally if we can find a 100% solution that would
save us having to fix the same issue again in future.

>> Possibly we just need to detect that case and WARN about it. The only
>> problem is we won't know until the system is already up and running, ie.
>> we can't know at boot that the onlining of the third node will cause 2
>> new distance values to be added.
>
> Yes, We should be able to detect this very easily.
> At the end of the function (v2 or v3) if cur_depth != max_depth then we
> havent initialized all possible node distances.

The only issue is what do we do when we detect that case?

We can't BUG/panic, because we don't know for sure that the offline
nodes will cause new distance values to be needed. So the system might
be completely fine, all we know is it might not be. If we print a scary
warning we'll end up with bugs filed for that.

cheers

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-10-11 11:45 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-21 10:25 [PATCH v2 0/3] Updates to powerpc for robust CPU online/offline Srikar Dronamraju
2021-08-21 10:25 ` [PATCH v2 1/3] powerpc/numa: Print debug statements only when required Srikar Dronamraju
2021-08-23  9:21   ` Laurent Dufour
2021-08-23  9:38     ` Srikar Dronamraju
2021-08-25 13:01       ` Michael Ellerman
2021-08-26  4:47         ` Srikar Dronamraju
2021-08-21 10:25 ` [PATCH v2 2/3] powerpc/numa: Update cpu_cpu_map on CPU online/offline Srikar Dronamraju
2021-08-21 10:25 ` [PATCH v2 3/3] powerpc/numa: Fill distance_lookup_table for offline nodes Srikar Dronamraju
2021-08-26 13:36   ` Michael Ellerman
2021-09-01 10:22     ` Srikar Dronamraju
2021-09-23 11:17       ` Michael Ellerman
2021-09-23 17:57         ` Srikar Dronamraju
2021-10-11 11:45           ` Michael Ellerman
2021-08-23  8:33 ` [PATCH v2 0/3] Updates to powerpc for robust CPU online/offline Peter Zijlstra
2021-08-23  9:34   ` Srikar Dronamraju
2021-08-23  9:37     ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.