linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2]  Skip numa distance for offline nodes
@ 2021-07-01  4:15 Srikar Dronamraju
  2021-07-01  4:15 ` [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes Srikar Dronamraju
  2021-07-01  4:15 ` [PATCH v2 2/2] powerpc/numa: Fill distance_lookup_table for offline nodes Srikar Dronamraju
  0 siblings, 2 replies; 17+ messages in thread
From: Srikar Dronamraju @ 2021-07-01  4:15 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Michael Ellerman
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Gautham R Shenoy,
	Geetika Moolchandani, Laurent Dufour

Changelog v1->v2:
v1: http://lore.kernel.org/lkml/20210520154427.1041031-1-srikar@linux.vnet.ibm.com/t/#u
- Update the numa masks, whenever 1st CPU is added to cpuless node
- Populate all possible nodes distances in boot in a
powerpc specific function

Geetika reported yet another trace while doing a dlpar CPU add
operation. This was true even on top of a recent commit
6980d13f0dd1 ("powerpc/smp: Set numa node before updating mask") which fixed
a similar trace.

WARNING: CPU: 40 PID: 2954 at kernel/sched/topology.c:2088 build_sched_domains+0x6e8/0x1540
Modules linked in: nft_counter nft_compat rpadlpar_io rpaphp mptcp_diag
xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag
netlink_diag bonding tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat
nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables
nfnetlink dm_multipath pseries_rng xts vmx_crypto binfmt_misc ip_tables xfs
libcrc32c sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp dm_mirror
dm_region_hash dm_log dm_mod fuse
CPU: 40 PID: 2954 Comm: kworker/40:0 Not tainted 5.13.0-rc1+ #19
Workqueue: events cpuset_hotplug_workfn
NIP:  c0000000001de588 LR: c0000000001de584 CTR: 00000000006cd36c
REGS: c00000002772b250 TRAP: 0700   Not tainted  (5.12.0-rc5-master+)
MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28828422  XER: 0000000d
CFAR: c00000000020c2f8 IRQMASK: 0 #012GPR00: c0000000001de584 c00000002772b4f0
c000000001f55400 0000000000000036 #012GPR04: c0000063c6368010 c0000063c63f0a00
0000000000000027 c0000063c6368018 #012GPR08: 0000000000000023 c0000063c636ef48
00000063c4de0000 c0000063bfe9ffe8 #012GPR12: 0000000028828424 c0000063fe68fe80
0000000000000000 0000000000000417 #012GPR16: 0000000000000028 c00000000740dcd8
c00000000205db68 c000000001a3a4a0 #012GPR20: c000000091ed7d20 c000000091ed8520
0000000000000001 0000000000000000 #012GPR24: c0000000113a9600 0000000000000190
0000000000000028 c0000000010e3ac0 #012GPR28: 0000000000000000 c00000000740dd00
c0000000317b5900 0000000000000190
NIP [c0000000001de588] build_sched_domains+0x6e8/0x1540
LR [c0000000001de584] build_sched_domains+0x6e4/0x1540
Call Trace:
[c00000002772b4f0] [c0000000001de584] build_sched_domains+0x6e4/0x1540 (unreliable)
[c00000002772b640] [c0000000001e08dc] partition_sched_domains_locked+0x3ec/0x530
[c00000002772b6e0] [c0000000002a2144] rebuild_sched_domains_locked+0x524/0xbf0
[c00000002772b7e0] [c0000000002a5620] rebuild_sched_domains+0x40/0x70
[c00000002772b810] [c0000000002a58e4] cpuset_hotplug_workfn+0x294/0xe20
[c00000002772bc30] [c000000000187510] process_one_work+0x300/0x670
[c00000002772bd10] [c0000000001878f8] worker_thread+0x78/0x520
[c00000002772bda0] [c0000000001937f0] kthread+0x1a0/0x1b0
[c00000002772be10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
7ee5bb78 7f0ac378 7f29cb78 7f68db78 7f46d378 7f84e378 f8610068 3c62ff19
fbe10060 3863e558 4802dd31 60000000 <0fe00000> 3920fff4 f9210080 e86100b0

Detailed analysis of the failing scenario showed that the span in
question belongs to NODE domain and further the cpumasks for some
cpus in NODE overlapped. There are two possible reasons how we ended
up here:

(1) The numa node was offline or blank with no CPUs or memory. Hence
the sched_max_numa_distance could not be set correctly, or the
sched_domains_numa_distance happened to be partially populated.

(2) Depending on a bogus node_distance of an offline node to populate
cpumasks is the issue.  On POWER platform the node_distance is
correctly available only for an online node which has some CPU or
memory resource associated with it.

For example distance info from numactl from a fully populated 8 node
system at boot may look like this.

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  20  40  40  40  40  40  40
  1:  20  10  40  40  40  40  40  40
  2:  40  40  10  20  40  40  40  40
  3:  40  40  20  10  40  40  40  40
  4:  40  40  40  40  10  20  40  40
  5:  40  40  40  40  20  10  40  40
  6:  40  40  40  40  40  40  10  20
  7:  40  40  40  40  40  40  20  10

However the same system when only two nodes are online at boot, then the
numa topology will look like
node distances:
node   0   1
  0:  10  20
  1:  20  10

This series tries to fix both these problems.
Note: These problems are now visible, thanks to
Commit ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't
(partially) overlap")

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>

Srikar Dronamraju (2):
  sched/topology: Skip updating masks for non-online nodes
  powerpc/numa: Fill distance_lookup_table for offline nodes

 arch/powerpc/mm/numa.c  | 70 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/topology.c | 25 +++++++++++++--
 2 files changed, 93 insertions(+), 2 deletions(-)


base-commit: 031e3bd8986fffe31e1ddbf5264cccfe30c9abd7
-- 
2.27.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
  2021-07-01  4:15 [PATCH v2 0/2] Skip numa distance for offline nodes Srikar Dronamraju
@ 2021-07-01  4:15 ` Srikar Dronamraju
  2021-07-01 14:28   ` Valentin Schneider
  2021-07-01  4:15 ` [PATCH v2 2/2] powerpc/numa: Fill distance_lookup_table for offline nodes Srikar Dronamraju
  1 sibling, 1 reply; 17+ messages in thread
From: Srikar Dronamraju @ 2021-07-01  4:15 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Michael Ellerman
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Gautham R Shenoy,
	Geetika Moolchandani, Laurent Dufour

Currently scheduler doesn't check if node is online before adding CPUs
to the node mask. However on some architectures, node distance is only
available for nodes that are online. Its not sure how much to rely on
the node distance, when one of the nodes is offline.

If said node distance is fake (since one of the nodes is offline) and
the actual node distance is different, then the cpumask of such nodes
when the nodes become becomes online will be wrong.

This can cause topology_span_sane to throw up a warning message and the
rest of the topology being not updated properly.

Resolve this by skipping update of cpumask for nodes that are not
online.

However by skipping, relevant CPUs may not be set when nodes are
onlined. i.e when coming up with NUMA masks at a certain NUMA distance,
CPUs that are part of other nodes, which are already online will not be
part of the NUMA mask. Hence the first time, a CPU is added to the newly
onlined node, add the other CPUs to the numa_mask.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1->v2:
v1 link: http://lore.kernel.org/lkml/20210520154427.1041031-4-srikar@linux.vnet.ibm.com/t/#u
Update the NUMA masks, whenever 1st CPU is added to cpuless node

 kernel/sched/topology.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b77ad49dc14f..f25dbcab4fd2 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1833,6 +1833,9 @@ void sched_init_numa(void)
 			sched_domains_numa_masks[i][j] = mask;
 
 			for_each_node(k) {
+				if (!node_online(j))
+					continue;
+
 				if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
 					sched_numa_warn("Node-distance not symmetric");
 
@@ -1891,12 +1894,30 @@ void sched_init_numa(void)
 void sched_domains_numa_masks_set(unsigned int cpu)
 {
 	int node = cpu_to_node(cpu);
-	int i, j;
+	int i, j, empty;
 
+	empty = cpumask_empty(sched_domains_numa_masks[0][node]);
 	for (i = 0; i < sched_domains_numa_levels; i++) {
 		for (j = 0; j < nr_node_ids; j++) {
-			if (node_distance(j, node) <= sched_domains_numa_distance[i])
+			if (!node_online(j))
+				continue;
+
+			if (node_distance(j, node) <= sched_domains_numa_distance[i]) {
 				cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
+
+				/*
+				 * We skip updating numa_masks for offline
+				 * nodes. However now that the node is
+				 * finally online, CPUs that were added
+				 * earlier, should now be accommodated into
+				 * newly oneline node's numa mask.
+				 */
+				if (node != j && empty) {
+					cpumask_or(sched_domains_numa_masks[i][node],
+							sched_domains_numa_masks[i][node],
+							sched_domains_numa_masks[0][j]);
+				}
+			}
 		}
 	}
 }
-- 
2.27.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 2/2] powerpc/numa: Fill distance_lookup_table for offline nodes
  2021-07-01  4:15 [PATCH v2 0/2] Skip numa distance for offline nodes Srikar Dronamraju
  2021-07-01  4:15 ` [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes Srikar Dronamraju
@ 2021-07-01  4:15 ` Srikar Dronamraju
  2021-07-01  9:36   ` kernel test robot
  2021-07-01 10:20   ` kernel test robot
  1 sibling, 2 replies; 17+ messages in thread
From: Srikar Dronamraju @ 2021-07-01  4:15 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Michael Ellerman
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Gautham R Shenoy,
	Geetika Moolchandani, Laurent Dufour

Currently scheduler populates the distance map by looking at distance
of each node from all other nodes. This should work for most
architectures and platforms.

Scheduler expects unique number of node distances to be available at
boot. It uses node distance to calculate this unique node distances.
On Power Servers, node distances for offline nodes is not available.
However, Power Servers already knows unique possible node distances.
Fake the offline node's distance_lookup_table entries so that all
possible node distances are updated.

For example distance info from numactl from a fully populated 8 node
system at boot may look like this.

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  20  40  40  40  40  40  40
  1:  20  10  40  40  40  40  40  40
  2:  40  40  10  20  40  40  40  40
  3:  40  40  20  10  40  40  40  40
  4:  40  40  40  40  10  20  40  40
  5:  40  40  40  40  20  10  40  40
  6:  40  40  40  40  40  40  10  20
  7:  40  40  40  40  40  40  20  10

However the same system when only two nodes are online at boot, then
distance info from numactl will look like
node distances:
node   0   1
  0:  10  20
  1:  20  10

It may be implementation dependent on what node_distance(0,3) where
node 0 is online and node 3 is offline. In Power Servers case, it returns
LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
distance between nodes is 20. However that would not be true.

When Nodes are onlined and CPUs from those nodes are hotplugged,
the max node distance would be 40.

However this only needs to be done if the number of unique node
distances that can be computed for online nodes is less than the
number of possible unique node distances as represented by
distance_ref_points_depth. When the node is actually onlined,
distance_lookup_table will be updated with actual entries.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1->v2:
Move to a Powerpc specific solution as suggested by Peter and Valentin

 arch/powerpc/mm/numa.c | 70 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 70 insertions(+)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..6d0d89127190 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -860,6 +860,75 @@ void __init dump_numa_cpu_topology(void)
 	}
 }
 
+/*
+ * Scheduler expects unique number of node distances to be available at
+ * boot. It uses node distance to calculate this unique node distances. On
+ * POWER, node distances for offline nodes is not available. However, POWER
+ * already knows unique possible node distances. Fake the offline node's
+ * distance_lookup_table entries so that all possible node distances are
+ * updated.
+ */
+void __init fake_update_distance_lookup_table(void)
+{
+	unsigned long distance_map;
+	int i, nr_levels, nr_depth, node;
+
+	if (!numa_enabled)
+		return;
+
+	if (!form1_affinity)
+		return;
+
+	/*
+	 * distance_ref_points_depth lists the unique numa domains
+	 * available. However it ignore LOCAL_DISTANCE. So add +1
+	 * to get the actual number of unique distances.
+	 */
+	nr_depth = distance_ref_points_depth + 1;
+
+	WARN_ON(nr_depth > sizeof(distance_map));
+
+	bitmap_zero(&distance_map, nr_depth);
+	bitmap_set(&distance_map, 0, 1);
+
+	for_each_online_node(node) {
+		int nd, distance = LOCAL_DISTANCE;
+
+		if (node == first_online_node)
+			continue;
+
+		nd = __node_distance(node, first_online_node);
+		for (i = 0; i < nr_depth; i++, distance *= 2) {
+			if (distance == nd) {
+				bitmap_set(&distance_map, i, 1);
+				break;
+			}
+		}
+		nr_levels = bitmap_weight(&distance_map, nr_depth);
+		if (nr_levels == nr_depth)
+			return;
+	}
+
+	for_each_node(node) {
+		if (node_online(node))
+			continue;
+
+		i = find_first_zero_bit(&distance_map, nr_depth);
+		if (i >= nr_depth || i == 0) {
+			pr_warn("Levels(%d) not matching levels(%d)", nr_levels, nr_depth);
+			return;
+		}
+
+		bitmap_set(&distance_map, i, 1);
+		while (i--)
+			distance_lookup_table[node][i] = node;
+
+		nr_levels = bitmap_weight(&distance_map, nr_depth);
+		if (nr_levels == nr_depth)
+			return;
+	}
+}
+
 /* Initialize NODE_DATA for a node on the local memory */
 static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
 {
@@ -975,6 +1044,7 @@ void __init mem_topology_setup(void)
 		 */
 		numa_setup_cpu(cpu);
 	}
+	fake_update_distance_lookup_table();
 }
 
 void __init initmem_init(void)
-- 
2.27.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/2] powerpc/numa: Fill distance_lookup_table for offline nodes
  2021-07-01  4:15 ` [PATCH v2 2/2] powerpc/numa: Fill distance_lookup_table for offline nodes Srikar Dronamraju
@ 2021-07-01  9:36   ` kernel test robot
  2021-07-01 10:20   ` kernel test robot
  1 sibling, 0 replies; 17+ messages in thread
From: kernel test robot @ 2021-07-01  9:36 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra, Michael Ellerman
  Cc: kbuild-all, LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot

[-- Attachment #1: Type: text/plain, Size: 3856 bytes --]

Hi Srikar,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on 031e3bd8986fffe31e1ddbf5264cccfe30c9abd7]

url:    https://github.com/0day-ci/linux/commits/Srikar-Dronamraju/Skip-numa-distance-for-offline-nodes/20210701-121809
base:   031e3bd8986fffe31e1ddbf5264cccfe30c9abd7
config: powerpc-allyesconfig (attached as .config)
compiler: powerpc64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/715881c25ce171cc9d097d4faeb0dce60bb3e71c
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Srikar-Dronamraju/Skip-numa-distance-for-offline-nodes/20210701-121809
        git checkout 715881c25ce171cc9d097d4faeb0dce60bb3e71c
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> arch/powerpc/mm/numa.c:871:13: warning: no previous prototype for 'fake_update_distance_lookup_table' [-Wmissing-prototypes]
     871 | void __init fake_update_distance_lookup_table(void)
         |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


vim +/fake_update_distance_lookup_table +871 arch/powerpc/mm/numa.c

   862	
   863	/*
   864	 * Scheduler expects unique number of node distances to be available at
   865	 * boot. It uses node distance to calculate this unique node distances. On
   866	 * POWER, node distances for offline nodes is not available. However, POWER
   867	 * already knows unique possible node distances. Fake the offline node's
   868	 * distance_lookup_table entries so that all possible node distances are
   869	 * updated.
   870	 */
 > 871	void __init fake_update_distance_lookup_table(void)
   872	{
   873		unsigned long distance_map;
   874		int i, nr_levels, nr_depth, node;
   875	
   876		if (!numa_enabled)
   877			return;
   878	
   879		if (!form1_affinity)
   880			return;
   881	
   882		/*
   883		 * distance_ref_points_depth lists the unique numa domains
   884		 * available. However it ignore LOCAL_DISTANCE. So add +1
   885		 * to get the actual number of unique distances.
   886		 */
   887		nr_depth = distance_ref_points_depth + 1;
   888	
   889		WARN_ON(nr_depth > sizeof(distance_map));
   890	
   891		bitmap_zero(&distance_map, nr_depth);
   892		bitmap_set(&distance_map, 0, 1);
   893	
   894		for_each_online_node(node) {
   895			int nd, distance = LOCAL_DISTANCE;
   896	
   897			if (node == first_online_node)
   898				continue;
   899	
   900			nd = __node_distance(node, first_online_node);
   901			for (i = 0; i < nr_depth; i++, distance *= 2) {
   902				if (distance == nd) {
   903					bitmap_set(&distance_map, i, 1);
   904					break;
   905				}
   906			}
   907			nr_levels = bitmap_weight(&distance_map, nr_depth);
   908			if (nr_levels == nr_depth)
   909				return;
   910		}
   911	
   912		for_each_node(node) {
   913			if (node_online(node))
   914				continue;
   915	
   916			i = find_first_zero_bit(&distance_map, nr_depth);
   917			if (i >= nr_depth || i == 0) {
   918				pr_warn("Levels(%d) not matching levels(%d)", nr_levels, nr_depth);
   919				return;
   920			}
   921	
   922			bitmap_set(&distance_map, i, 1);
   923			while (i--)
   924				distance_lookup_table[node][i] = node;
   925	
   926			nr_levels = bitmap_weight(&distance_map, nr_depth);
   927			if (nr_levels == nr_depth)
   928				return;
   929		}
   930	}
   931	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 73418 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/2] powerpc/numa: Fill distance_lookup_table for offline nodes
  2021-07-01  4:15 ` [PATCH v2 2/2] powerpc/numa: Fill distance_lookup_table for offline nodes Srikar Dronamraju
  2021-07-01  9:36   ` kernel test robot
@ 2021-07-01 10:20   ` kernel test robot
  1 sibling, 0 replies; 17+ messages in thread
From: kernel test robot @ 2021-07-01 10:20 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra, Michael Ellerman
  Cc: kbuild-all, LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot

[-- Attachment #1: Type: text/plain, Size: 3915 bytes --]

Hi Srikar,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on 031e3bd8986fffe31e1ddbf5264cccfe30c9abd7]

url:    https://github.com/0day-ci/linux/commits/Srikar-Dronamraju/Skip-numa-distance-for-offline-nodes/20210701-121809
base:   031e3bd8986fffe31e1ddbf5264cccfe30c9abd7
config: powerpc64-randconfig-r001-20210630 (attached as .config)
compiler: powerpc64le-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/715881c25ce171cc9d097d4faeb0dce60bb3e71c
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Srikar-Dronamraju/Skip-numa-distance-for-offline-nodes/20210701-121809
        git checkout 715881c25ce171cc9d097d4faeb0dce60bb3e71c
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=powerpc64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> arch/powerpc/mm/numa.c:871:13: error: no previous prototype for 'fake_update_distance_lookup_table' [-Werror=missing-prototypes]
     871 | void __init fake_update_distance_lookup_table(void)
         |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   cc1: all warnings being treated as errors


vim +/fake_update_distance_lookup_table +871 arch/powerpc/mm/numa.c

   862	
   863	/*
   864	 * Scheduler expects unique number of node distances to be available at
   865	 * boot. It uses node distance to calculate this unique node distances. On
   866	 * POWER, node distances for offline nodes is not available. However, POWER
   867	 * already knows unique possible node distances. Fake the offline node's
   868	 * distance_lookup_table entries so that all possible node distances are
   869	 * updated.
   870	 */
 > 871	void __init fake_update_distance_lookup_table(void)
   872	{
   873		unsigned long distance_map;
   874		int i, nr_levels, nr_depth, node;
   875	
   876		if (!numa_enabled)
   877			return;
   878	
   879		if (!form1_affinity)
   880			return;
   881	
   882		/*
   883		 * distance_ref_points_depth lists the unique numa domains
   884		 * available. However it ignore LOCAL_DISTANCE. So add +1
   885		 * to get the actual number of unique distances.
   886		 */
   887		nr_depth = distance_ref_points_depth + 1;
   888	
   889		WARN_ON(nr_depth > sizeof(distance_map));
   890	
   891		bitmap_zero(&distance_map, nr_depth);
   892		bitmap_set(&distance_map, 0, 1);
   893	
   894		for_each_online_node(node) {
   895			int nd, distance = LOCAL_DISTANCE;
   896	
   897			if (node == first_online_node)
   898				continue;
   899	
   900			nd = __node_distance(node, first_online_node);
   901			for (i = 0; i < nr_depth; i++, distance *= 2) {
   902				if (distance == nd) {
   903					bitmap_set(&distance_map, i, 1);
   904					break;
   905				}
   906			}
   907			nr_levels = bitmap_weight(&distance_map, nr_depth);
   908			if (nr_levels == nr_depth)
   909				return;
   910		}
   911	
   912		for_each_node(node) {
   913			if (node_online(node))
   914				continue;
   915	
   916			i = find_first_zero_bit(&distance_map, nr_depth);
   917			if (i >= nr_depth || i == 0) {
   918				pr_warn("Levels(%d) not matching levels(%d)", nr_levels, nr_depth);
   919				return;
   920			}
   921	
   922			bitmap_set(&distance_map, i, 1);
   923			while (i--)
   924				distance_lookup_table[node][i] = node;
   925	
   926			nr_levels = bitmap_weight(&distance_map, nr_depth);
   927			if (nr_levels == nr_depth)
   928				return;
   929		}
   930	}
   931	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 32778 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
  2021-07-01  4:15 ` [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes Srikar Dronamraju
@ 2021-07-01 14:28   ` Valentin Schneider
  2021-07-12 12:48     ` Srikar Dronamraju
  0 siblings, 1 reply; 17+ messages in thread
From: Valentin Schneider @ 2021-07-01 14:28 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra, Michael Ellerman
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Vincent Guittot, Dietmar Eggemann, linuxppc-dev,
	Nathan Lynch, Gautham R Shenoy, Geetika Moolchandani,
	Laurent Dufour

On 01/07/21 09:45, Srikar Dronamraju wrote:
> @@ -1891,12 +1894,30 @@ void sched_init_numa(void)
>  void sched_domains_numa_masks_set(unsigned int cpu)
>  {
>       int node = cpu_to_node(cpu);
> -	int i, j;
> +	int i, j, empty;
>
> +	empty = cpumask_empty(sched_domains_numa_masks[0][node]);
>       for (i = 0; i < sched_domains_numa_levels; i++) {
>               for (j = 0; j < nr_node_ids; j++) {
> -			if (node_distance(j, node) <= sched_domains_numa_distance[i])
> +			if (!node_online(j))
> +				continue;
> +
> +			if (node_distance(j, node) <= sched_domains_numa_distance[i]) {
>                               cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
> +
> +				/*
> +				 * We skip updating numa_masks for offline
> +				 * nodes. However now that the node is
> +				 * finally online, CPUs that were added
> +				 * earlier, should now be accommodated into
> +				 * newly oneline node's numa mask.
> +				 */
> +				if (node != j && empty) {
> +					cpumask_or(sched_domains_numa_masks[i][node],
> +							sched_domains_numa_masks[i][node],
> +							sched_domains_numa_masks[0][j]);
> +				}
> +			}

Hmph, so we're playing games with masks of offline nodes - is that really
necessary? Your modification of sched_init_numa() still scans all of the
nodes (regardless of their online status) to build the distance map, and
that is never updated (sched_init_numa() is pretty much an __init
function).

So AFAICT this is all to cope with topology_span_sane() not applying
'cpu_map' to its masks. That seemed fine to me back when I wrote it, but in
light of having bogus distance values for offline nodes, not so much...

What about the below instead?

---
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b77ad49dc14f..c2d9caad4aa6 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2075,6 +2075,7 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
 static bool topology_span_sane(struct sched_domain_topology_level *tl,
 			      const struct cpumask *cpu_map, int cpu)
 {
+	struct cpumask *intersect = sched_domains_tmpmask;
 	int i;
 
 	/* NUMA levels are allowed to overlap */
@@ -2090,14 +2091,17 @@ static bool topology_span_sane(struct sched_domain_topology_level *tl,
 	for_each_cpu(i, cpu_map) {
 		if (i == cpu)
 			continue;
+
 		/*
-		 * We should 'and' all those masks with 'cpu_map' to exactly
-		 * match the topology we're about to build, but that can only
-		 * remove CPUs, which only lessens our ability to detect
-		 * overlaps
+		 * We shouldn't have to bother with cpu_map here, unfortunately
+		 * some architectures (powerpc says hello) have to deal with
+		 * offline NUMA nodes reporting bogus distance values. This can
+		 * lead to funky NODE domain spans, but since those are offline
+		 * we can mask them out.
 		 */
+		cpumask_and(intersect, tl->mask(cpu), tl->mask(i));
 		if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
-		    cpumask_intersects(tl->mask(cpu), tl->mask(i)))
+		    cpumask_intersects(intersect, cpu_map))
 			return false;
 	}
 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
  2021-07-01 14:28   ` Valentin Schneider
@ 2021-07-12 12:48     ` Srikar Dronamraju
  2021-07-13 16:32       ` Valentin Schneider
  0 siblings, 1 reply; 17+ messages in thread
From: Srikar Dronamraju @ 2021-07-12 12:48 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, Michael Ellerman, LKML, Mel Gorman,
	Rik van Riel, Thomas Gleixner, Vincent Guittot, Dietmar Eggemann,
	linuxppc-dev, Nathan Lynch, Gautham R Shenoy,
	Geetika Moolchandani, Laurent Dufour

Hi Valentin,

> On 01/07/21 09:45, Srikar Dronamraju wrote:
> > @@ -1891,12 +1894,30 @@ void sched_init_numa(void)
> >  void sched_domains_numa_masks_set(unsigned int cpu)
> >  {
> 
> Hmph, so we're playing games with masks of offline nodes - is that really
> necessary? Your modification of sched_init_numa() still scans all of the
> nodes (regardless of their online status) to build the distance map, and
> that is never updated (sched_init_numa() is pretty much an __init
> function).
> 
> So AFAICT this is all to cope with topology_span_sane() not applying
> 'cpu_map' to its masks. That seemed fine to me back when I wrote it, but in
> light of having bogus distance values for offline nodes, not so much...
> 
> What about the below instead?
> 
> ---
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index b77ad49dc14f..c2d9caad4aa6 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2075,6 +2075,7 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
>  static bool topology_span_sane(struct sched_domain_topology_level *tl,
>  			      const struct cpumask *cpu_map, int cpu)
>  {
> +	struct cpumask *intersect = sched_domains_tmpmask;
>  	int i;
> 
>  	/* NUMA levels are allowed to overlap */
> @@ -2090,14 +2091,17 @@ static bool topology_span_sane(struct sched_domain_topology_level *tl,
>  	for_each_cpu(i, cpu_map) {
>  		if (i == cpu)
>  			continue;
> +
>  		/*
> -		 * We should 'and' all those masks with 'cpu_map' to exactly
> -		 * match the topology we're about to build, but that can only
> -		 * remove CPUs, which only lessens our ability to detect
> -		 * overlaps
> +		 * We shouldn't have to bother with cpu_map here, unfortunately
> +		 * some architectures (powerpc says hello) have to deal with
> +		 * offline NUMA nodes reporting bogus distance values. This can
> +		 * lead to funky NODE domain spans, but since those are offline
> +		 * we can mask them out.
>  		 */
> +		cpumask_and(intersect, tl->mask(cpu), tl->mask(i));
>  		if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
> -		    cpumask_intersects(tl->mask(cpu), tl->mask(i)))
> +		    cpumask_intersects(intersect, cpu_map))
>  			return false;
>  	}
> 

Unfortunately this is not helping.
I tried this patch alone and also with 2/2 patch of this series where
we update/fill fake topology numbers. However both cases are still failing.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
  2021-07-12 12:48     ` Srikar Dronamraju
@ 2021-07-13 16:32       ` Valentin Schneider
  2021-07-23 14:39         ` Srikar Dronamraju
  0 siblings, 1 reply; 17+ messages in thread
From: Valentin Schneider @ 2021-07-13 16:32 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, Michael Ellerman, LKML, Mel Gorman,
	Rik van Riel, Thomas Gleixner, Vincent Guittot, Dietmar Eggemann,
	linuxppc-dev, Nathan Lynch, Gautham R Shenoy,
	Geetika Moolchandani, Laurent Dufour

On 12/07/21 18:18, Srikar Dronamraju wrote:
> Hi Valentin,
>
>> On 01/07/21 09:45, Srikar Dronamraju wrote:
>> > @@ -1891,12 +1894,30 @@ void sched_init_numa(void)
>> >  void sched_domains_numa_masks_set(unsigned int cpu)
>> >  {
>>
>> Hmph, so we're playing games with masks of offline nodes - is that really
>> necessary? Your modification of sched_init_numa() still scans all of the
>> nodes (regardless of their online status) to build the distance map, and
>> that is never updated (sched_init_numa() is pretty much an __init
>> function).
>>
>> So AFAICT this is all to cope with topology_span_sane() not applying
>> 'cpu_map' to its masks. That seemed fine to me back when I wrote it, but in
>> light of having bogus distance values for offline nodes, not so much...
>>
>> What about the below instead?
>>
>> ---
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index b77ad49dc14f..c2d9caad4aa6 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -2075,6 +2075,7 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
>>  static bool topology_span_sane(struct sched_domain_topology_level *tl,
>>                            const struct cpumask *cpu_map, int cpu)
>>  {
>> +	struct cpumask *intersect = sched_domains_tmpmask;
>>      int i;
>>
>>      /* NUMA levels are allowed to overlap */
>> @@ -2090,14 +2091,17 @@ static bool topology_span_sane(struct sched_domain_topology_level *tl,
>>      for_each_cpu(i, cpu_map) {
>>              if (i == cpu)
>>                      continue;
>> +
>>              /*
>> -		 * We should 'and' all those masks with 'cpu_map' to exactly
>> -		 * match the topology we're about to build, but that can only
>> -		 * remove CPUs, which only lessens our ability to detect
>> -		 * overlaps
>> +		 * We shouldn't have to bother with cpu_map here, unfortunately
>> +		 * some architectures (powerpc says hello) have to deal with
>> +		 * offline NUMA nodes reporting bogus distance values. This can
>> +		 * lead to funky NODE domain spans, but since those are offline
>> +		 * we can mask them out.
>>               */
>> +		cpumask_and(intersect, tl->mask(cpu), tl->mask(i));
>>              if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
>> -		    cpumask_intersects(tl->mask(cpu), tl->mask(i)))
>> +		    cpumask_intersects(intersect, cpu_map))
>>                      return false;
>>      }
>>
>
> Unfortunately this is not helping.
> I tried this patch alone and also with 2/2 patch of this series where
> we update/fill fake topology numbers. However both cases are still failing.
>

Thanks for testing it.


Now, let's take examples from your cover letter:

  node distances:
  node   0   1   2   3   4   5   6   7
    0:  10  20  40  40  40  40  40  40
    1:  20  10  40  40  40  40  40  40
    2:  40  40  10  20  40  40  40  40
    3:  40  40  20  10  40  40  40  40
    4:  40  40  40  40  10  20  40  40
    5:  40  40  40  40  20  10  40  40
    6:  40  40  40  40  40  40  10  20
    7:  40  40  40  40  40  40  20  10

But the system boots with just nodes 0 and 1, thus only this distance
matrix is valid:

  node   0   1
    0:  10  20
    1:  20  10

topology_span_sane() is going to use tl->mask(cpu), and as you reported the
NODE topology level should cause issues. Let's assume all offline nodes say
they're 10 distance away from everyone else, and that we have one CPU per
node. This would give us:

  NODE->mask(0) == 0,2-7
  NODE->mask(1) == 1-7

The intersection is 2-7, we'll trigger the WARN_ON().
Now, with the above snippet, we'll check if that intersection covers any
online CPU. For sched_init_domains(), cpu_map is cpu_active_mask, so we'd
end up with an empty intersection and we shouldn't warn - that's the theory
at least.

Looking at sd_numa_mask(), I think there's a bug with topology_span_sane():
it doesn't run in the right place wrt where sched_domains_curr_level is
updated. Could you try the below on top of the previous snippet?

If that doesn't help, could you share the node distances / topology masks
that lead to the WARN_ON()? Thanks.

---
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b77ad49dc14f..cda69dfa4065 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1516,13 +1516,6 @@ sd_init(struct sched_domain_topology_level *tl,
 	int sd_id, sd_weight, sd_flags = 0;
 	struct cpumask *sd_span;
 
-#ifdef CONFIG_NUMA
-	/*
-	 * Ugly hack to pass state to sd_numa_mask()...
-	 */
-	sched_domains_curr_level = tl->numa_level;
-#endif
-
 	sd_weight = cpumask_weight(tl->mask(cpu));
 
 	if (tl->sd_flags)
@@ -2131,7 +2124,12 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 
 		sd = NULL;
 		for_each_sd_topology(tl) {
-
+#ifdef CONFIG_NUMA
+			/*
+			 * Ugly hack to pass state to sd_numa_mask()...
+			 */
+			sched_domains_curr_level = tl->numa_level;
+#endif
 			if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
 				goto error;
 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
  2021-07-13 16:32       ` Valentin Schneider
@ 2021-07-23 14:39         ` Srikar Dronamraju
  2021-08-04 10:01           ` Srikar Dronamraju
  2021-08-08 15:56           ` Valentin Schneider
  0 siblings, 2 replies; 17+ messages in thread
From: Srikar Dronamraju @ 2021-07-23 14:39 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, Michael Ellerman, LKML, Mel Gorman,
	Rik van Riel, Thomas Gleixner, Vincent Guittot, Dietmar Eggemann,
	linuxppc-dev, Nathan Lynch, Gautham R Shenoy,
	Geetika Moolchandani, Laurent Dufour

* Valentin Schneider <valentin.schneider@arm.com> [2021-07-13 17:32:14]:

> On 12/07/21 18:18, Srikar Dronamraju wrote:
> > Hi Valentin,
> >
> >> On 01/07/21 09:45, Srikar Dronamraju wrote:
> >> > @@ -1891,12 +1894,30 @@ void sched_init_numa(void)
> >> >  void sched_domains_numa_masks_set(unsigned int cpu)
> >> >  {
> >
> > Unfortunately this is not helping.
> > I tried this patch alone and also with 2/2 patch of this series where
> > we update/fill fake topology numbers. However both cases are still failing.
> >
> 
> Thanks for testing it.
> 
> 
> Now, let's take examples from your cover letter:
> 
>   node distances:
>   node   0   1   2   3   4   5   6   7
>     0:  10  20  40  40  40  40  40  40
>     1:  20  10  40  40  40  40  40  40
>     2:  40  40  10  20  40  40  40  40
>     3:  40  40  20  10  40  40  40  40
>     4:  40  40  40  40  10  20  40  40
>     5:  40  40  40  40  20  10  40  40
>     6:  40  40  40  40  40  40  10  20
>     7:  40  40  40  40  40  40  20  10
> 
> But the system boots with just nodes 0 and 1, thus only this distance
> matrix is valid:
> 
>   node   0   1
>     0:  10  20
>     1:  20  10
> 
> topology_span_sane() is going to use tl->mask(cpu), and as you reported the
> NODE topology level should cause issues. Let's assume all offline nodes say
> they're 10 distance away from everyone else, and that we have one CPU per
> node. This would give us:
> 

No,
All offline nodes would be at a distance of 10 from node 0 only.
So here node distance of all offline nodes from node 1 would be 20.

>   NODE->mask(0) == 0,2-7
>   NODE->mask(1) == 1-7

so 

NODE->mask(0) == 0,2-7
NODE->mask(1) should be 1
and NODE->mask(2-7) == 0,2-7

> 
> The intersection is 2-7, we'll trigger the WARN_ON().
> Now, with the above snippet, we'll check if that intersection covers any
> online CPU. For sched_init_domains(), cpu_map is cpu_active_mask, so we'd
> end up with an empty intersection and we shouldn't warn - that's the theory
> at least.

Now lets say we onlined CPU 3 and node 3 which was at a actual distance
of 20 from node 0.

(If we only consider online CPUs, and since scheduler masks like
sched_domains_numa_masks arent updated with offline CPUs,)
then

NODE->mask(0) == 0
NODE->mask(1) == 1
NODE->mask(3) == 0,3

cpumask_and(intersect, tl->mask(cpu), tl->mask(i));
if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) && cpumask_intersects(intersect, cpu_map))

cpu_map is 0,1,3
intersect is 0

From above NODE->mask(0) is !equal to NODE->mask(1) and
cpumask_intersects(intersect, cpu_map) is also true.

I picked Node 3 since if Node 1 is online, we would have faked distance
for Node 2 to be at distance of 40.

Any node from 3 to 7, we would have faced the same problem.

> 
> Looking at sd_numa_mask(), I think there's a bug with topology_span_sane():
> it doesn't run in the right place wrt where sched_domains_curr_level is
> updated. Could you try the below on top of the previous snippet?
> 
> If that doesn't help, could you share the node distances / topology masks
> that lead to the WARN_ON()? Thanks.
> 
> ---
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index b77ad49dc14f..cda69dfa4065 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1516,13 +1516,6 @@ sd_init(struct sched_domain_topology_level *tl,
>  	int sd_id, sd_weight, sd_flags = 0;
>  	struct cpumask *sd_span;
> 
> -#ifdef CONFIG_NUMA
> -	/*
> -	 * Ugly hack to pass state to sd_numa_mask()...
> -	 */
> -	sched_domains_curr_level = tl->numa_level;
> -#endif
> -
>  	sd_weight = cpumask_weight(tl->mask(cpu));
> 
>  	if (tl->sd_flags)
> @@ -2131,7 +2124,12 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> 
>  		sd = NULL;
>  		for_each_sd_topology(tl) {
> -
> +#ifdef CONFIG_NUMA
> +			/*
> +			 * Ugly hack to pass state to sd_numa_mask()...
> +			 */
> +			sched_domains_curr_level = tl->numa_level;
> +#endif
>  			if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
>  				goto error;
> 
> 

I tested with the above patch too. However it still not helping.

Here is the log from my testing.

At Boot.

(Do remember to arrive at sched_max_numa_levels we faked the
numa_distance of node 1 to be at 20 from node 0. All other offline
nodes are at a distance of 10 from node 0.)

numactl -H
available: 2 nodes (0,5)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 0 MB
node 0 free: 0 MB
node 5 cpus:
node 5 size: 32038 MB
node 5 free: 29367 MB
node distances:
node   0   5
  0:  10  40
  5:  40  10
------------------------------------------------------------------
grep -r . /sys/kernel/debug/sched/domains/cpu0/domain{0,1,2,3,4}/{name,flags}
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
/sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
------------------------------------------------------------------
awk '/domain/{print $1, $2}' /proc/schedstat | sort -u | sed -e 's/00000000,//g'
domain0 00000055
domain0 000000aa
domain1 000000ff
==================================================================

(After performing smt mode switch to 1 and adding 2 additional small cores.
(We always add 2 small cores at a time.))

numactl -H
available: 2 nodes (0,5)
node 0 cpus: 0
node 0 size: 0 MB
node 0 free: 0 MB
node 5 cpus: 8 9 10 11 12 13 14 15
node 5 size: 32038 MB
node 5 free: 29389 MB
node distances:
node   0   5
  0:  10  40
  5:  40  10
------------------------------------------------------------------
grep -r . /sys/kernel/debug/sched/domains/cpu0/domain{0,1,2,3,4}/{name,flags}
/sys/kernel/debug/sched/domains/cpu0/domain0/name:NUMA
/sys/kernel/debug/sched/domains/cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_SERIALIZE SD_OVERLAP SD_NUMA
------------------------------------------------------------------
awk '/domain/{print $1, $2}' /proc/schedstat | sort -u | sed -e 's/00000000,//g'
domain0 0000ff00
domain0 0000ff01
domain1 0000ff01
==================================================================

<snipped for brevity>
(Penultimate successful smt mode switch + add of a core)

numactl -H
available: 2 nodes (0,5)
node 0 cpus: 0 1 2 3
node 0 size: 0 MB
node 0 free: 0 MB
node 5 cpus: 8 9 10 11 16 17 18 19 24 25 26 27 32 33 34 35 40 41 42 43 48 49 50 51 56 57 58 59 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
node 5 size: 32038 MB
node 5 free: 29106 MB
node distances:
node   0   5
  0:  10  40
  5:  40  10
------------------------------------------------------------------
grep -r . /sys/kernel/debug/sched/domains/cpu0/domain{0,1,2,3,4}/{name,flags}
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
/sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
/sys/kernel/debug/sched/domains/cpu0/domain2/name:NUMA
/sys/kernel/debug/sched/domains/cpu0/domain2/flags:SD_BALANCE_NEWIDLE SD_SERIALIZE SD_OVERLAP SD_NUMA
------------------------------------------------------------------
awk '/domain/{print $1, $2}' /proc/schedstat | sort -u | sed -e 's/00000000,//g'
domain0 00000005
domain0 0000000a
domain0 00000f00
domain0 000f0000
domain0 0f000000
domain0 0000000f,00000000
domain0 00000f00,00000000
domain0 000f0000,00000000
domain0 8f000000,00000000
domain0 000000ff,00000000
domain0 0000ff00,00000000
domain1 0000000f
domain1 0000ffff,8f0f0f0f,0f0f0f00
domain2 0000ffff,8f0f0f0f,0f0f0f0f
==================================================================

(Before Last successful smt mode switch and 2 small core additions.
Till now all CPU additions have been from nodes which are online.)

numactl -H
available: 2 nodes (0,5)
node 0 cpus: 0 1 2 3 4 5
node 0 size: 0 MB
node 0 free: 0 MB
node 5 cpus: 8 9 10 11 16 17 18 19 24 25 26 27 32 33 34 35 40 41 42 43 48 49 50 51 56 57 58 59 64 65 66 67 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
node 5 size: 32038 MB
node 5 free: 29099 MB
node distances:
node   0   5
  0:  10  40
  5:  40  10
------------------------------------------------------------------
grep -r . /sys/kernel/debug/sched/domains/cpu0/domain{0,1,2,3,4}/{name,flags}
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
/sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
/sys/kernel/debug/sched/domains/cpu0/domain2/name:NUMA
/sys/kernel/debug/sched/domains/cpu0/domain2/flags:SD_BALANCE_NEWIDLE SD_SERIALIZE SD_OVERLAP SD_NUMA
------------------------------------------------------------------
awk '/domain/{print $1, $2}' /proc/schedstat | sort -u | sed -e 's/00000000,//g'
domain0 00000015
domain0 0000002a
domain0 00000f00
domain0 000f0000
domain0 0f000000
domain0 0000000f,00000000
domain0 00000f00,00000000
domain0 000f0000,00000000
domain0 0f000000,00000000
domain0 0000000f,00000000
domain0 0000ff00,00000000
domain0 00ff0000,00000000
domain1 0000003f
domain1 00ffff0f,0f0f0f0f,0f0f0f00
domain2 00ffff0f,0f0f0f0f,0f0f0f3f
==================================================================

( First addition of a CPU to a non-online node esp node whose node
distance was not faked.)

numactl -H
available: 3 nodes (0,5,7)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 0 MB
node 0 free: 0 MB
node 5 cpus: 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 32 33 34 35 40 41 42 43 48 49 50 51 56 57 58 59 64 65 66 67 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
node 5 size: 32038 MB
node 5 free: 29024 MB
node 7 cpus: 88 89 90 91 92 93 94 95
node 7 size: 0 MB
node 7 free: 0 MB
node distances:
node   0   5   7
  0:  10  40  40
  5:  40  10  20
  7:  40  20  10
------------------------------------------------------------------
grep -r . /sys/kernel/debug/sched/domains/cpu0/domain{0,1,2,3,4}/{name,flags}
------------------------------------------------------------------
awk '/domain/{print $1, $2}' /proc/schedstat | sort -u | sed -e 's/00000000,//g'
==================================================================

I had added a debug patch to dump some variables that may help to
understand the problem
------------------->8--------------------------------------------8<----------
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5e1abd9a8cc5..146f59381bcc 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2096,7 +2096,8 @@ static bool topology_span_sane(struct sched_domain_topology_level *tl,
 		cpumask_and(intersect, tl->mask(cpu), tl->mask(i));
 		if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
 		    cpumask_intersects(intersect, cpu_map)) {
-			pr_err("name=%s mask(%d/%d)=%*pbl mask(%d/%d)=%*pbl", tl->name, cpu, cpu_to_node(cpu), cpumask_pr_args(tl->mask(cpu)), i, cpu_to_node(i), cpumask_pr_args(tl->mask(i)));
+			pr_err("name=%s mask(%d/%d)=%*pbl mask(%d/%d)=%*pbl numa-level=%d curr_level=%d", tl->name, cpu, cpu_to_node(cpu), cpumask_pr_args(tl->mask(cpu)), i, cpu_to_node(i), cpumask_pr_args(tl->mask(i)), tl->numa_level, sched_domains_curr_level);
+			pr_err("intersect=%*pbl cpu_map=%*pbl", cpumask_pr_args(intersect), cpumask_pr_args(cpu_map));
 			return false;
 		}
 	}
------------------->8--------------------------------------------8<----------

From dmesg:

[   99.076892] WARNING: workqueue cpumask: online intersect > possible intersect
[  167.256079] Built 2 zonelists, mobility grouping on.  Total pages: 508394
[  167.256108] Policy zone: Normal
[  167.626915] name=NODE mask(0/0)=0-7 mask(88/7)=0-7,88 numa-level=0 curr_level=0    <-- hunk above
[  167.626925] intersect=0-7 cpu_map=0-19,24-27,32-35,40-43,48-51,56-59,64-67,72-88
[  167.626951] ------------[ cut here ]------------
[  167.626959] WARNING: CPU: 88 PID: 6285 at kernel/sched/topology.c:2143 build_sched_domains+0xacc/0x1780
[  167.626975] Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag bonding tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink pseries_rng xts vmx_crypto uio_pdrv_genirq uio binfmt_misc ip_tables xfs libcrc32c dm_service_time sd_mod t10_pi sg ibmvfc scsi_transport_fc ibmveth dm_multipath dm_mirror dm_region_hash dm_log dm_mod fuse
[  167.627068] CPU: 88 PID: 6285 Comm: kworker/88:0 Tainted: G        W         5.13.0-rc6+ #60
[  167.627075] Workqueue: events cpuset_hotplug_workfn
[  167.627085] NIP:  c0000000001caf3c LR: c0000000001caf38 CTR: 00000000007088ec
[  167.627091] REGS: c0000000aa253260 TRAP: 0700   Tainted: G        W          (5.13.0-rc6+)
[  167.627095] MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 48848222  XER: 00000004
[  167.627108] CFAR: c0000000001eac38 IRQMASK: 0 
               GPR00: c0000000001caf38 c0000000aa253500 c000000001c4a500 0000000000000044 
               GPR04: 00000000fff7ffff c0000000aa253210 0000000000000027 c0000007d9807f90 
               GPR08: 0000000000000023 0000000000000001 0000000000000027 c0000007d2ffffe8 
               GPR12: 0000000000008000 c00000001e9aa480 c000000001c93060 c00000006b834e80 
               GPR16: c000000001d3c6b0 0000000000000000 c000000001775860 c000000001775868 
               GPR20: 0000000000000000 c00000000c064900 0000000000000280 c0000000010bf838 
               GPR24: 0000000000000280 c000000001d3c6c0 0000000000000007 0000000000000058 
               GPR28: 0000000000000000 c00000000c446cc0 c000000001c978a0 c0000000b0a36f00 
[  167.627161] NIP [c0000000001caf3c] build_sched_domains+0xacc/0x1780
[  167.627166] LR [c0000000001caf38] build_sched_domains+0xac8/0x1780
[  167.627172] Call Trace:
[  167.627174] [c0000000aa253500] [c0000000001caf38] build_sched_domains+0xac8/0x1780 (unreliable)
[  167.627182] [c0000000aa253680] [c0000000001ccf2c] partition_sched_domains_locked+0x3ac/0x4c0
[  167.627188] [c0000000aa253710] [c000000000280a84] rebuild_sched_domains_locked+0x404/0x9e0
[  167.627194] [c0000000aa253810] [c000000000284400] rebuild_sched_domains+0x40/0x70
[  167.627201] [c0000000aa253840] [c0000000002846c4] cpuset_hotplug_workfn+0x294/0xf10
[  167.627208] [c0000000aa253c60] [c000000000175140] process_one_work+0x290/0x590
[  167.627217] [c0000000aa253d00] [c0000000001754c8] worker_thread+0x88/0x620
[  167.627224] [c0000000aa253da0] [c000000000181804] kthread+0x194/0x1a0
[  167.627230] [c0000000aa253e10] [c00000000000ccec] ret_from_kernel_thread+0x5c/0x70
[  167.627238] Instruction dump:
[  167.627242] 38635150 f9610070 4801fcdd 60000000 80de0000 3c62ff47 7f25cb78 7fe7fb78 
[  167.627252] 386351a0 7cc43378 4801fcbd 60000000 <0fe00000> 3920fff4 f92100c0 e86100e0 
[  167.627262] ---[ end trace 870f890d7c623d18 ]---
[  168.026621] name=NODE mask(0/0)=0-7 mask(88/7)=0-7,88-89 numa-level=0 curr_level=0
[  168.026626] intersect=0-7 cpu_map=0-19,24-27,32-35,40-43,48-51,56-59,64-67,72-89
[  168.026637] ------------[ cut here ]------------
[  168.026643] WARNING: CPU: 89 PID: 6298 at kernel/sched/topology.c:2143 build_sched_domains+0xacc/0x1780
[  168.026650] Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag bonding tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink pseries_rng xts vmx_crypto uio_pdrv_genirq uio binfmt_misc ip_tables xfs libcrc32c dm_service_time sd_mod t10_pi sg ibmvfc scsi_transport_fc ibmveth dm_multipath dm_mirror dm_region_hash dm_log dm_mod fuse
[  168.026707] CPU: 89 PID: 6298 Comm: kworker/89:0 Tainted: G        W         5.13.0-rc6+ #60
[  168.026713] Workqueue: events cpuset_hotplug_workfn
[  168.026719] NIP:  c0000000001caf3c LR: c0000000001caf38 CTR: 00000000007088ec
[  168.026723] REGS: c0000000ae6d7260 TRAP: 0700   Tainted: G        W          (5.13.0-rc6+)
[  168.026728] MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 48848222  XER: 00000007
[  168.026738] CFAR: c0000000001eac38 IRQMASK: 0 
               GPR00: c0000000001caf38 c0000000ae6d7500 c000000001c4a500 0000000000000044 
               GPR04: 00000000fff7ffff c0000000ae6d7210 0000000000000027 c0000007d9907f90 
               GPR08: 0000000000000023 0000000000000001 0000000000000027 c0000007d2ffffe8 
               GPR12: 0000000000008000 c00000001e9a9400 c000000001c93060 c00000006b836a80 
               GPR16: c000000001d3c6b0 0000000000000000 c000000001775860 c000000001775868 
               GPR20: 0000000000000000 c00000000c064900 0000000000000280 c0000000010bf838 
               GPR24: 0000000000000280 c000000001d3c6c0 0000000000000007 0000000000000058 
               GPR28: 0000000000000000 c00000000c446cc0 c000000001c978a0 c0000000b1c13b00 
[  168.026792] NIP [c0000000001caf3c] build_sched_domains+0xacc/0x1780
[  168.026796] LR [c0000000001caf38] build_sched_domains+0xac8/0x1780
[  168.026801] Call Trace:
[  168.026804] [c0000000ae6d7500] [c0000000001caf38] build_sched_domains+0xac8/0x1780 (unreliable)
[  168.026811] [c0000000ae6d7680] [c0000000001ccf2c] partition_sched_domains_locked+0x3ac/0x4c0
[  168.026817] [c0000000ae6d7710] [c000000000280a84] rebuild_sched_domains_locked+0x404/0x9e0
[  168.026823] [c0000000ae6d7810] [c000000000284400] rebuild_sched_domains+0x40/0x70
[  168.026829] [c0000000ae6d7840] [c0000000002846c4] cpuset_hotplug_workfn+0x294/0xf10
[  168.026835] [c0000000ae6d7c60] [c000000000175140] process_one_work+0x290/0x590
[  168.026841] [c0000000ae6d7d00] [c0000000001754c8] worker_thread+0x88/0x620
[  168.026848] [c0000000ae6d7da0] [c000000000181804] kthread+0x194/0x1a0
[  168.026853] [c0000000ae6d7e10] [c00000000000ccec] ret_from_kernel_thread+0x5c/0x70
[  168.026859] Instruction dump:
[  168.026863] 38635150 f9610070 4801fcdd 60000000 80de0000 3c62ff47 7f25cb78 7fe7fb78 
[  168.026872] 386351a0 7cc43378 4801fcbd 60000000 <0fe00000> 3920fff4 f92100c0 e86100e0 
[  168.026883] ---[ end trace 870f890d7c623d19 ]---

Now this keeps repeating.

I know I have mentioned this before. (So sorry for repeating)
Generally on Power node distance is not populated for offline nodes.
However to arrive at sched_max_numa_levels, we thought of faking few
node distances. In the above case, we faked distance of node 1 as 20
(from node 0) node 5 was already at distance of 40 from node 0.

So when sched_domains_numa_masks_set is called to update sd_numa_mask or
sched_domains_numa_masks, all CPUs under node 0 get updated for node 2
too. (since node 2 is shown as at a local distance from node 0). Do
look at the node mask of CPU 88 in the dmesg. It should have been 88,
however its 0-7,88 where 0-7 are coming from node 0.

Even if we skip updation of sched_domains_numa_masks for offline nodes,
on online of a node (i.e when we get the correct node distance), we have
to update the sched_domains_numa_masks to ensure CPUs that were already
present within a certain distance and skipped are added back. And this
was what I tried to do in my patch.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
  2021-07-23 14:39         ` Srikar Dronamraju
@ 2021-08-04 10:01           ` Srikar Dronamraju
  2021-08-04 10:20             ` Valentin Schneider
  2021-08-08 15:56           ` Valentin Schneider
  1 sibling, 1 reply; 17+ messages in thread
From: Srikar Dronamraju @ 2021-08-04 10:01 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, Michael Ellerman, LKML, Mel Gorman,
	Rik van Riel, Thomas Gleixner, Vincent Guittot, Dietmar Eggemann,
	linuxppc-dev, Nathan Lynch, Gautham R Shenoy,
	Geetika Moolchandani, Laurent Dufour

* Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2021-07-23 20:09:14]:

> * Valentin Schneider <valentin.schneider@arm.com> [2021-07-13 17:32:14]:
> 
> > On 12/07/21 18:18, Srikar Dronamraju wrote:
> > > Hi Valentin,
> > >
> > >> On 01/07/21 09:45, Srikar Dronamraju wrote:
> > >> > @@ -1891,12 +1894,30 @@ void sched_init_numa(void)
> > >> >  void sched_domains_numa_masks_set(unsigned int cpu)
> > >> >  {
> > >

Hey Valentin / Peter

Did you get a chance to look at this?

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
  2021-08-04 10:01           ` Srikar Dronamraju
@ 2021-08-04 10:20             ` Valentin Schneider
  0 siblings, 0 replies; 17+ messages in thread
From: Valentin Schneider @ 2021-08-04 10:20 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, Michael Ellerman, LKML, Mel Gorman,
	Rik van Riel, Thomas Gleixner, Vincent Guittot, Dietmar Eggemann,
	linuxppc-dev, Nathan Lynch, Gautham R Shenoy,
	Geetika Moolchandani, Laurent Dufour

On 04/08/21 15:31, Srikar Dronamraju wrote:
> * Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2021-07-23 20:09:14]:
>
>> * Valentin Schneider <valentin.schneider@arm.com> [2021-07-13 17:32:14]:
>>
>> > On 12/07/21 18:18, Srikar Dronamraju wrote:
>> > > Hi Valentin,
>> > >
>> > >> On 01/07/21 09:45, Srikar Dronamraju wrote:
>> > >> > @@ -1891,12 +1894,30 @@ void sched_init_numa(void)
>> > >> >  void sched_domains_numa_masks_set(unsigned int cpu)
>> > >> >  {
>> > >
>
> Hey Valentin / Peter
>
> Did you get a chance to look at this?
>

Barely, I wanted to set some time aside to stare at this and have been
failing miserably. Let me bump it up my todolist, I'll get to it before the
end of the week.

> --
> Thanks and Regards
> Srikar Dronamraju

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
  2021-07-23 14:39         ` Srikar Dronamraju
  2021-08-04 10:01           ` Srikar Dronamraju
@ 2021-08-08 15:56           ` Valentin Schneider
  2021-08-09  6:52             ` Srikar Dronamraju
  1 sibling, 1 reply; 17+ messages in thread
From: Valentin Schneider @ 2021-08-08 15:56 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, Michael Ellerman, LKML, Mel Gorman,
	Rik van Riel, Thomas Gleixner, Vincent Guittot, Dietmar Eggemann,
	linuxppc-dev, Nathan Lynch, Gautham R Shenoy,
	Geetika Moolchandani, Laurent Dufour


A bit late, but technically the week isn't over yet! :D

On 23/07/21 20:09, Srikar Dronamraju wrote:
> * Valentin Schneider <valentin.schneider@arm.com> [2021-07-13 17:32:14]:
>> Now, let's take examples from your cover letter:
>>
>>   node distances:
>>   node   0   1   2   3   4   5   6   7
>>     0:  10  20  40  40  40  40  40  40
>>     1:  20  10  40  40  40  40  40  40
>>     2:  40  40  10  20  40  40  40  40
>>     3:  40  40  20  10  40  40  40  40
>>     4:  40  40  40  40  10  20  40  40
>>     5:  40  40  40  40  20  10  40  40
>>     6:  40  40  40  40  40  40  10  20
>>     7:  40  40  40  40  40  40  20  10
>>
>> But the system boots with just nodes 0 and 1, thus only this distance
>> matrix is valid:
>>
>>   node   0   1
>>     0:  10  20
>>     1:  20  10
>>
>> topology_span_sane() is going to use tl->mask(cpu), and as you reported the
>> NODE topology level should cause issues. Let's assume all offline nodes say
>> they're 10 distance away from everyone else, and that we have one CPU per
>> node. This would give us:
>>
>
> No,
> All offline nodes would be at a distance of 10 from node 0 only.
> So here node distance of all offline nodes from node 1 would be 20.
>
>>   NODE->mask(0) == 0,2-7
>>   NODE->mask(1) == 1-7
>
> so
>
> NODE->mask(0) == 0,2-7
> NODE->mask(1) should be 1
> and NODE->mask(2-7) == 0,2-7
>

Ok, so that shouldn't trigger the warning.

>>
>> The intersection is 2-7, we'll trigger the WARN_ON().
>> Now, with the above snippet, we'll check if that intersection covers any
>> online CPU. For sched_init_domains(), cpu_map is cpu_active_mask, so we'd
>> end up with an empty intersection and we shouldn't warn - that's the theory
>> at least.
>
> Now lets say we onlined CPU 3 and node 3 which was at a actual distance
> of 20 from node 0.
>
> (If we only consider online CPUs, and since scheduler masks like
> sched_domains_numa_masks arent updated with offline CPUs,)
> then
>
> NODE->mask(0) == 0
> NODE->mask(1) == 1
> NODE->mask(3) == 0,3
>

Wait, doesn't the distance matrix (without any offline node) say

  distance(0, 3) == 40

? We should have at the very least:

  node   0   1   2   3
    0:  10  20  ??  40
    1:  20  20  ??  40
    2:  ??  ??  ??  ??
    3:  40  40  ??  10

Regardless, NODE->mask(x) is sched_domains_numa_masks[0][x], if

  distance(0,3) > LOCAL_DISTANCE

then

  node0 ∉ NODE->mask(3)

> cpumask_and(intersect, tl->mask(cpu), tl->mask(i));
> if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) && cpumask_intersects(intersect, cpu_map))
>
> cpu_map is 0,1,3
> intersect is 0
>
> From above NODE->mask(0) is !equal to NODE->mask(1) and
> cpumask_intersects(intersect, cpu_map) is also true.
>
> I picked Node 3 since if Node 1 is online, we would have faked distance
> for Node 2 to be at distance of 40.
>
> Any node from 3 to 7, we would have faced the same problem.
>
>>
>> Looking at sd_numa_mask(), I think there's a bug with topology_span_sane():
>> it doesn't run in the right place wrt where sched_domains_curr_level is
>> updated. Could you try the below on top of the previous snippet?
>>
>> If that doesn't help, could you share the node distances / topology masks
>> that lead to the WARN_ON()? Thanks.
>>
>> ---
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index b77ad49dc14f..cda69dfa4065 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -1516,13 +1516,6 @@ sd_init(struct sched_domain_topology_level *tl,
>>      int sd_id, sd_weight, sd_flags = 0;
>>      struct cpumask *sd_span;
>>
>> -#ifdef CONFIG_NUMA
>> -	/*
>> -	 * Ugly hack to pass state to sd_numa_mask()...
>> -	 */
>> -	sched_domains_curr_level = tl->numa_level;
>> -#endif
>> -
>>      sd_weight = cpumask_weight(tl->mask(cpu));
>>
>>      if (tl->sd_flags)
>> @@ -2131,7 +2124,12 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>>
>>              sd = NULL;
>>              for_each_sd_topology(tl) {
>> -
>> +#ifdef CONFIG_NUMA
>> +			/*
>> +			 * Ugly hack to pass state to sd_numa_mask()...
>> +			 */
>> +			sched_domains_curr_level = tl->numa_level;
>> +#endif
>>                      if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
>>                              goto error;
>>
>>
>
> I tested with the above patch too. However it still not helping.
>
> Here is the log from my testing.
>
> At Boot.
>
> (Do remember to arrive at sched_max_numa_levels we faked the
> numa_distance of node 1 to be at 20 from node 0. All other offline
> nodes are at a distance of 10 from node 0.)
>

[...]

> ( First addition of a CPU to a non-online node esp node whose node
> distance was not faked.)
>
> numactl -H
> available: 3 nodes (0,5,7)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 5 cpus: 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 32 33 34 35 40 41 42 43 48 49 50 51 56 57 58 59 64 65 66 67 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
> node 5 size: 32038 MB
> node 5 free: 29024 MB
> node 7 cpus: 88 89 90 91 92 93 94 95
> node 7 size: 0 MB
> node 7 free: 0 MB
> node distances:
> node   0   5   7
>   0:  10  40  40
>   5:  40  10  20
>   7:  40  20  10
> ------------------------------------------------------------------
> grep -r . /sys/kernel/debug/sched/domains/cpu0/domain{0,1,2,3,4}/{name,flags}
> ------------------------------------------------------------------
> awk '/domain/{print $1, $2}' /proc/schedstat | sort -u | sed -e 's/00000000,//g'
> ==================================================================
>
> I had added a debug patch to dump some variables that may help to
> understand the problem
> ------------------->8--------------------------------------------8<----------
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 5e1abd9a8cc5..146f59381bcc 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2096,7 +2096,8 @@ static bool topology_span_sane(struct sched_domain_topology_level *tl,
>               cpumask_and(intersect, tl->mask(cpu), tl->mask(i));
>               if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
>                   cpumask_intersects(intersect, cpu_map)) {
> -			pr_err("name=%s mask(%d/%d)=%*pbl mask(%d/%d)=%*pbl", tl->name, cpu, cpu_to_node(cpu), cpumask_pr_args(tl->mask(cpu)), i, cpu_to_node(i), cpumask_pr_args(tl->mask(i)));
> +			pr_err("name=%s mask(%d/%d)=%*pbl mask(%d/%d)=%*pbl numa-level=%d curr_level=%d", tl->name, cpu, cpu_to_node(cpu), cpumask_pr_args(tl->mask(cpu)), i, cpu_to_node(i), cpumask_pr_args(tl->mask(i)), tl->numa_level, sched_domains_curr_level);
> +			pr_err("intersect=%*pbl cpu_map=%*pbl", cpumask_pr_args(intersect), cpumask_pr_args(cpu_map));
>                       return false;
>               }
>       }
> ------------------->8--------------------------------------------8<----------
>
> From dmesg:
>
> [  167.626915] name=NODE mask(0/0)=0-7 mask(88/7)=0-7,88 numa-level=0 curr_level=0    <-- hunk above
> [  167.626925] intersect=0-7 cpu_map=0-19,24-27,32-35,40-43,48-51,56-59,64-67,72-88

> [  168.026621] name=NODE mask(0/0)=0-7 mask(88/7)=0-7,88-89 numa-level=0 curr_level=0
> [  168.026626] intersect=0-7 cpu_map=0-19,24-27,32-35,40-43,48-51,56-59,64-67,72-89
>

Ok so to condense the info, we have:

  node   0   5   7
    0:  10  40  40
    5:  40  10  20
    7:  40  20  10

  node0: 0-7
  node5: 8-29, 32-35, 40-43, 48-51, 56-59, 64-67, 72-87
  node7: 88-95

With the above distance map, we should have

  NODE->mask(CPU0) == 0-7
  NODE->mask(CPU8) == 8-29, 32-35, 40-43, 48-51, 56-59, 64-67, 72-87
  NODE->mask(CPU88) == 88-95

(this is sched_domains_numa_masks[0][CPUx], and
sched_domains_numa_distance[0] == LOCAL_DISTANCE, thus the mask of CPUs
LOCAL_DISTANCE away from CPUx).

For some reason you end up with node0 being part of node7's NODE
mask. Neither nodes are offline, and per the above distance table that
shouldn't happen.

> Now this keeps repeating.
>
> I know I have mentioned this before. (So sorry for repeating)

It can't hurt to reformulate ;)

> Generally on Power node distance is not populated for offline nodes.
> However to arrive at sched_max_numa_levels, we thought of faking few
> node distances. In the above case, we faked distance of node 1 as 20
> (from node 0) node 5 was already at distance of 40 from node 0.
>

Right, again that gives us the right set of unique distances (10, 20, 40).

> So when sched_domains_numa_masks_set is called to update sd_numa_mask or
> sched_domains_numa_masks, all CPUs under node 0 get updated for node 2
> too. (since node 2 is shown as at a local distance from node 0). Do
> look at the node mask of CPU 88 in the dmesg. It should have been 88,
> however its 0-7,88 where 0-7 are coming from node 0.
>
> Even if we skip updation of sched_domains_numa_masks for offline nodes,
> on online of a node (i.e when we get the correct node distance), we have
> to update the sched_domains_numa_masks to ensure CPUs that were already
> present within a certain distance and skipped are added back. And this
> was what I tried to do in my patch.
>

Ok, so it looks like we really can't do without that part - even if we get
"sensible" distance values for the online nodes, we can't divine values for
the offline ones.

> --
> Thanks and Regards
> Srikar Dronamraju

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
  2021-08-08 15:56           ` Valentin Schneider
@ 2021-08-09  6:52             ` Srikar Dronamraju
  2021-08-09 12:52               ` Valentin Schneider
  0 siblings, 1 reply; 17+ messages in thread
From: Srikar Dronamraju @ 2021-08-09  6:52 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, Michael Ellerman, LKML, Mel Gorman,
	Rik van Riel, Thomas Gleixner, Vincent Guittot, Dietmar Eggemann,
	linuxppc-dev, Nathan Lynch, Gautham R Shenoy,
	Geetika Moolchandani, Laurent Dufour

* Valentin Schneider <valentin.schneider@arm.com> [2021-08-08 16:56:47]:

>
> A bit late, but technically the week isn't over yet! :D
>
> On 23/07/21 20:09, Srikar Dronamraju wrote:
> > * Valentin Schneider <valentin.schneider@arm.com> [2021-07-13 17:32:14]:
> >> Now, let's take examples from your cover letter:
> >>
> >>   node distances:
> >>   node   0   1   2   3   4   5   6   7
> >>     0:  10  20  40  40  40  40  40  40
> >>     1:  20  10  40  40  40  40  40  40
> >>     2:  40  40  10  20  40  40  40  40
> >>     3:  40  40  20  10  40  40  40  40
> >>     4:  40  40  40  40  10  20  40  40
> >>     5:  40  40  40  40  20  10  40  40
> >>     6:  40  40  40  40  40  40  10  20
> >>     7:  40  40  40  40  40  40  20  10
> >>
> >> But the system boots with just nodes 0 and 1, thus only this distance
> >> matrix is valid:
> >>
> >>   node   0   1
> >>     0:  10  20
> >>     1:  20  10
> >>
> >> topology_span_sane() is going to use tl->mask(cpu), and as you reported the
> >> NODE topology level should cause issues. Let's assume all offline nodes say
> >> they're 10 distance away from everyone else, and that we have one CPU per
> >> node. This would give us:
> >>
> >
> > No,
> > All offline nodes would be at a distance of 10 from node 0 only.
> > So here node distance of all offline nodes from node 1 would be 20.
> >
> >>   NODE->mask(0) == 0,2-7
> >>   NODE->mask(1) == 1-7
> >
> > so
> >
> > NODE->mask(0) == 0,2-7
> > NODE->mask(1) should be 1
> > and NODE->mask(2-7) == 0,2-7
> >
>
> Ok, so that shouldn't trigger the warning.

Yes not at this point, but later on when we online a node.

>
> >>
> >> The intersection is 2-7, we'll trigger the WARN_ON().
> >> Now, with the above snippet, we'll check if that intersection covers any
> >> online CPU. For sched_init_domains(), cpu_map is cpu_active_mask, so we'd
> >> end up with an empty intersection and we shouldn't warn - that's the theory
> >> at least.
> >
> > Now lets say we onlined CPU 3 and node 3 which was at a actual distance
> > of 20 from node 0.
> >
> > (If we only consider online CPUs, and since scheduler masks like
> > sched_domains_numa_masks arent updated with offline CPUs,)
> > then
> >
> > NODE->mask(0) == 0
> > NODE->mask(1) == 1
> > NODE->mask(3) == 0,3
> >
>
> Wait, doesn't the distance matrix (without any offline node) say
>
>   distance(0, 3) == 40
>
> ? We should have at the very least:
>
>   node   0   1   2   3
>     0:  10  20  ??  40
>     1:  20  20  ??  40
>     2:  ??  ??  ??  ??
>     3:  40  40  ??  10
>

Before onlining node 3 and CPU 3 (node/CPU 0 and 1 are already online)
Note: Node 2-7 and CPU 2-7 are still offline.

node   0   1   2   3
  0:  10  20  40  10
  1:  20  20  40  10
  2:  40  40  10  10
  3:  10  10  10  10

NODE->mask(0) == 0
NODE->mask(1) == 1
NODE->mask(2) == 0
NODE->mask(3) == 0

Note: This is with updating Node 2's distance as 40 for figuring out
the number of numa levels. Since we have all possible distances, we
dont update Node 3 distance, so it will be as if its local to node 0.

Now when Node 3 and CPU 3 are onlined
Note: Node 2, 3-7 and CPU 2, 3-7 are still offline.

node   0   1   2   3
  0:  10  20  40  40
  1:  20  20  40  40
  2:  40  40  10  40
  3:  40  40  40  10

NODE->mask(0) == 0
NODE->mask(1) == 1
NODE->mask(2) == 0
NODE->mask(3) == 0,3

CPU 0 continues to be part of Node->mask(3) because when we online and
we find the right distance, there is no API to reset the numa mask of
3 to remove CPU 0 from the numa masks.

If we had an API to clear/set sched_domains_numa_masks[node][] when
the node state changes, we could probably plug-in to clear/set the
node masks whenever node state changes.


> Regardless, NODE->mask(x) is sched_domains_numa_masks[0][x], if
>
>   distance(0,3) > LOCAL_DISTANCE
>
> then
>
>   node0 ??? NODE->mask(3)
>
> > cpumask_and(intersect, tl->mask(cpu), tl->mask(i));
> > if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) && cpumask_intersects(intersect, cpu_map))
> >
> > cpu_map is 0,1,3
> > intersect is 0
> >
> > From above NODE->mask(0) is !equal to NODE->mask(1) and
> > cpumask_intersects(intersect, cpu_map) is also true.
> >
> > I picked Node 3 since if Node 1 is online, we would have faked distance
> > for Node 2 to be at distance of 40.
> >
> > Any node from 3 to 7, we would have faced the same problem.
> >
> >>
> >> Looking at sd_numa_mask(), I think there's a bug with topology_span_sane():
> >> it doesn't run in the right place wrt where sched_domains_curr_level is
> >> updated. Could you try the below on top of the previous snippet?
> >>
> >> If that doesn't help, could you share the node distances / topology masks
> >> that lead to the WARN_ON()? Thanks.
> >>
> >> ---
> >> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> >> index b77ad49dc14f..cda69dfa4065 100644
> >> --- a/kernel/sched/topology.c
> >> +++ b/kernel/sched/topology.c
> >> @@ -1516,13 +1516,6 @@ sd_init(struct sched_domain_topology_level *tl,
> >>      int sd_id, sd_weight, sd_flags = 0;
> >>      struct cpumask *sd_span;
> >>
> >> -#ifdef CONFIG_NUMA
> >> -	/*
> >> -	 * Ugly hack to pass state to sd_numa_mask()...
> >> -	 */
> >> -	sched_domains_curr_level = tl->numa_level;
> >> -#endif
> >> -
> >>      sd_weight = cpumask_weight(tl->mask(cpu));
> >>
> >>      if (tl->sd_flags)
> >> @@ -2131,7 +2124,12 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >>
> >>              sd = NULL;
> >>              for_each_sd_topology(tl) {
> >> -
> >> +#ifdef CONFIG_NUMA
> >> +			/*
> >> +			 * Ugly hack to pass state to sd_numa_mask()...
> >> +			 */
> >> +			sched_domains_curr_level = tl->numa_level;
> >> +#endif
> >>                      if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
> >>                              goto error;
> >>
> >>
> >
> > I tested with the above patch too. However it still not helping.
> >
> > Here is the log from my testing.
> >
> > At Boot.
> >
> > (Do remember to arrive at sched_max_numa_levels we faked the
> > numa_distance of node 1 to be at 20 from node 0. All other offline
> > nodes are at a distance of 10 from node 0.)
> >
>
> [...]
>
> > ( First addition of a CPU to a non-online node esp node whose node
> > distance was not faked.)
> >
> > numactl -H
> > available: 3 nodes (0,5,7)
> > node 0 cpus: 0 1 2 3 4 5 6 7
> > node 0 size: 0 MB
> > node 0 free: 0 MB
> > node 5 cpus: 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 32 33 34 35 40 41 42 43 48 49 50 51 56 57 58 59 64 65 66 67 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
> > node 5 size: 32038 MB
> > node 5 free: 29024 MB
> > node 7 cpus: 88 89 90 91 92 93 94 95
> > node 7 size: 0 MB
> > node 7 free: 0 MB
> > node distances:
> > node   0   5   7
> >   0:  10  40  40
> >   5:  40  10  20
> >   7:  40  20  10
> > ------------------------------------------------------------------
> > grep -r . /sys/kernel/debug/sched/domains/cpu0/domain{0,1,2,3,4}/{name,flags}
> > ------------------------------------------------------------------
> > awk '/domain/{print $1, $2}' /proc/schedstat | sort -u | sed -e 's/00000000,//g'
> > ==================================================================
> >
> > I had added a debug patch to dump some variables that may help to
> > understand the problem
> > ------------------->8--------------------------------------------8<----------
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 5e1abd9a8cc5..146f59381bcc 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2096,7 +2096,8 @@ static bool topology_span_sane(struct sched_domain_topology_level *tl,
> >               cpumask_and(intersect, tl->mask(cpu), tl->mask(i));
> >               if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
> >                   cpumask_intersects(intersect, cpu_map)) {
> > -			pr_err("name=%s mask(%d/%d)=%*pbl mask(%d/%d)=%*pbl", tl->name, cpu, cpu_to_node(cpu), cpumask_pr_args(tl->mask(cpu)), i, cpu_to_node(i), cpumask_pr_args(tl->mask(i)));
> > +			pr_err("name=%s mask(%d/%d)=%*pbl mask(%d/%d)=%*pbl numa-level=%d curr_level=%d", tl->name, cpu, cpu_to_node(cpu), cpumask_pr_args(tl->mask(cpu)), i, cpu_to_node(i), cpumask_pr_args(tl->mask(i)), tl->numa_level, sched_domains_curr_level);
> > +			pr_err("intersect=%*pbl cpu_map=%*pbl", cpumask_pr_args(intersect), cpumask_pr_args(cpu_map));
> >                       return false;
> >               }
> >       }
> > ------------------->8--------------------------------------------8<----------
> >
> > From dmesg:
> >
> > [  167.626915] name=NODE mask(0/0)=0-7 mask(88/7)=0-7,88 numa-level=0 curr_level=0    <-- hunk above
> > [  167.626925] intersect=0-7 cpu_map=0-19,24-27,32-35,40-43,48-51,56-59,64-67,72-88
>
> > [  168.026621] name=NODE mask(0/0)=0-7 mask(88/7)=0-7,88-89 numa-level=0 curr_level=0
> > [  168.026626] intersect=0-7 cpu_map=0-19,24-27,32-35,40-43,48-51,56-59,64-67,72-89
> >
>
> Ok so to condense the info, we have:
>
>   node   0   5   7
>     0:  10  40  40
>     5:  40  10  20
>     7:  40  20  10
>
>   node0: 0-7
>   node5: 8-29, 32-35, 40-43, 48-51, 56-59, 64-67, 72-87
>   node7: 88-95
>
> With the above distance map, we should have
>
>   node->mask(cpu0) == 0-7
>   node->mask(cpu8) == 8-29, 32-35, 40-43, 48-51, 56-59, 64-67, 72-87
>   node->mask(cpu88) == 88-95
>

Yes. this is what we should have and

node->mask(cpu0) == 0-7
node->mask(cpu8) == 8-29, 32-35, 40-43, 48-51, 56-59, 64-67, 72-87
node->mask(cpu88) == 0-7, 88-95

this is what we get.


> (this is sched_domains_numa_masks[0][CPUx], and
> sched_domains_numa_distance[0] == LOCAL_DISTANCE, thus the mask of CPUs
> LOCAL_DISTANCE away from CPUx).
>
> For some reason you end up with node0 being part of node7's NODE
> mask. Neither nodes are offline, and per the above distance table that
> shouldn't happen.
>
> > Now this keeps repeating.
> >
> > I know I have mentioned this before. (So sorry for repeating)
>
> It can't hurt to reformulate ;)
>
> > Generally on Power node distance is not populated for offline nodes.
> > However to arrive at sched_max_numa_levels, we thought of faking few
> > node distances. In the above case, we faked distance of node 1 as 20
> > (from node 0) node 5 was already at distance of 40 from node 0.
> >
>
> Right, again that gives us the right set of unique distances (10, 20, 40).
>
> > So when sched_domains_numa_masks_set is called to update sd_numa_mask or
> > sched_domains_numa_masks, all CPUs under node 0 get updated for node 2
> > too. (since node 2 is shown as at a local distance from node 0). Do
> > look at the node mask of CPU 88 in the dmesg. It should have been 88,
> > however its 0-7,88 where 0-7 are coming from node 0.
> >
> > Even if we skip updation of sched_domains_numa_masks for offline nodes,
> > on online of a node (i.e when we get the correct node distance), we have
> > to update the sched_domains_numa_masks to ensure CPUs that were already
> > present within a certain distance and skipped are added back. And this
> > was what I tried to do in my patch.
> >
>
> Ok, so it looks like we really can't do without that part - even if we get
> "sensible" distance values for the online nodes, we can't divine values for
> the offline ones.
>

Yes

--
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
  2021-08-09  6:52             ` Srikar Dronamraju
@ 2021-08-09 12:52               ` Valentin Schneider
  2021-08-10 11:47                 ` Srikar Dronamraju
  0 siblings, 1 reply; 17+ messages in thread
From: Valentin Schneider @ 2021-08-09 12:52 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, Michael Ellerman, LKML, Mel Gorman,
	Rik van Riel, Thomas Gleixner, Vincent Guittot, Dietmar Eggemann,
	linuxppc-dev, Nathan Lynch, Gautham R Shenoy,
	Geetika Moolchandani, Laurent Dufour

On 09/08/21 12:22, Srikar Dronamraju wrote:
> * Valentin Schneider <valentin.schneider@arm.com> [2021-08-08 16:56:47]:
>> Wait, doesn't the distance matrix (without any offline node) say
>>
>>   distance(0, 3) == 40
>>
>> ? We should have at the very least:
>>
>>   node   0   1   2   3
>>     0:  10  20  ??  40
>>     1:  20  20  ??  40
>>     2:  ??  ??  ??  ??
>>     3:  40  40  ??  10
>>
>
> Before onlining node 3 and CPU 3 (node/CPU 0 and 1 are already online)
> Note: Node 2-7 and CPU 2-7 are still offline.
>
> node   0   1   2   3
>   0:  10  20  40  10
>   1:  20  20  40  10
>   2:  40  40  10  10
>   3:  10  10  10  10
>
> NODE->mask(0) == 0
> NODE->mask(1) == 1
> NODE->mask(2) == 0
> NODE->mask(3) == 0
>
> Note: This is with updating Node 2's distance as 40 for figuring out
> the number of numa levels. Since we have all possible distances, we
> dont update Node 3 distance, so it will be as if its local to node 0.
>
> Now when Node 3 and CPU 3 are onlined
> Note: Node 2, 3-7 and CPU 2, 3-7 are still offline.
>
> node   0   1   2   3
>   0:  10  20  40  40
>   1:  20  20  40  40
>   2:  40  40  10  40
>   3:  40  40  40  10
>
> NODE->mask(0) == 0
> NODE->mask(1) == 1
> NODE->mask(2) == 0
> NODE->mask(3) == 0,3
>
> CPU 0 continues to be part of Node->mask(3) because when we online and
> we find the right distance, there is no API to reset the numa mask of
> 3 to remove CPU 0 from the numa masks.
>
> If we had an API to clear/set sched_domains_numa_masks[node][] when
> the node state changes, we could probably plug-in to clear/set the
> node masks whenever node state changes.
>

Gotcha, this is now coming back to me...

[...]

>> Ok, so it looks like we really can't do without that part - even if we get
>> "sensible" distance values for the online nodes, we can't divine values for
>> the offline ones.
>>
>
> Yes
>

Argh, while your approach does take care of the masks, it leaves
sched_numa_topology_type unchanged. You *can* force an update of it, but
yuck :(

I got to the below...

---
From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Date: Thu, 1 Jul 2021 09:45:51 +0530
Subject: [PATCH 1/1] sched/topology: Skip updating masks for non-online nodes

The scheduler currently expects NUMA node distances to be stable from init
onwards, and as a consequence builds the related data structures
once-and-for-all at init (see sched_init_numa()).

Unfortunately, on some architectures node distance is unreliable for
offline nodes and may very well change upon onlining.

Skip over offline nodes during sched_init_numa(). Track nodes that have
been onlined at least once, and trigger a build of a node's NUMA masks when
it is first onlined post-init.

Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/topology.c | 65 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b77ad49dc14f..cba95793a9b7 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1482,6 +1482,8 @@ int				sched_max_numa_distance;
 static int			*sched_domains_numa_distance;
 static struct cpumask		***sched_domains_numa_masks;
 int __read_mostly		node_reclaim_distance = RECLAIM_DISTANCE;
+
+static unsigned long __read_mostly *sched_numa_onlined_nodes;
 #endif
 
 /*
@@ -1833,6 +1835,16 @@ void sched_init_numa(void)
 			sched_domains_numa_masks[i][j] = mask;
 
 			for_each_node(k) {
+				/*
+				 * Distance information can be unreliable for
+				 * offline nodes, defer building the node
+				 * masks to its bringup.
+				 * This relies on all unique distance values
+				 * still being visible at init time.
+				 */
+				if (!node_online(j))
+					continue;
+
 				if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
 					sched_numa_warn("Node-distance not symmetric");
 
@@ -1886,6 +1898,53 @@ void sched_init_numa(void)
 	sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
 
 	init_numa_topology_type();
+
+	sched_numa_onlined_nodes = bitmap_alloc(nr_node_ids, GFP_KERNEL);
+	if (!sched_numa_onlined_nodes)
+		return;
+
+	bitmap_zero(sched_numa_onlined_nodes, nr_node_ids);
+	for_each_online_node(i)
+		bitmap_set(sched_numa_onlined_nodes, i, 1);
+}
+
+void __sched_domains_numa_masks_set(unsigned int node)
+{
+	int i, j;
+
+	/*
+	 * NUMA masks are not built for offline nodes in sched_init_numa().
+	 * Thus, when a CPU of a never-onlined-before node gets plugged in,
+	 * adding that new CPU to the right NUMA masks is not sufficient: the
+	 * masks of that CPU's node must also be updated.
+	 */
+	if (test_bit(node, sched_numa_onlined_nodes))
+		return;
+
+	bitmap_set(sched_numa_onlined_nodes, node, 1);
+
+	for (i = 0; i < sched_domains_numa_levels; i++) {
+		for (j = 0; j < nr_node_ids; j++) {
+			if (!node_online(j) || node == j)
+				continue;
+
+			if (node_distance(j, node) > sched_domains_numa_distance[i])
+				continue;
+
+			/* Add remote nodes in our masks */
+			cpumask_or(sched_domains_numa_masks[i][node],
+				   sched_domains_numa_masks[i][node],
+				   sched_domains_numa_masks[0][j]);
+		}
+	}
+
+	/*
+	 * A new node has been brought up, potentially changing the topology
+	 * classification.
+	 *
+	 * Note that this is racy vs any use of sched_numa_topology_type :/
+	 */
+	init_numa_topology_type();
 }
 
 void sched_domains_numa_masks_set(unsigned int cpu)
@@ -1893,8 +1952,14 @@ void sched_domains_numa_masks_set(unsigned int cpu)
 	int node = cpu_to_node(cpu);
 	int i, j;
 
+	__sched_domains_numa_masks_set(node);
+
 	for (i = 0; i < sched_domains_numa_levels; i++) {
 		for (j = 0; j < nr_node_ids; j++) {
+			if (!node_online(j))
+				continue;
+
+			/* Set ourselves in the remote node's masks */
 			if (node_distance(j, node) <= sched_domains_numa_distance[i])
 				cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
 		}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
  2021-08-09 12:52               ` Valentin Schneider
@ 2021-08-10 11:47                 ` Srikar Dronamraju
  2021-08-16 10:33                   ` Srikar Dronamraju
  0 siblings, 1 reply; 17+ messages in thread
From: Srikar Dronamraju @ 2021-08-10 11:47 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, Michael Ellerman, LKML, Mel Gorman,
	Rik van Riel, Thomas Gleixner, Vincent Guittot, Dietmar Eggemann,
	linuxppc-dev, Nathan Lynch, Gautham R Shenoy,
	Geetika Moolchandani, Laurent Dufour

* Valentin Schneider <valentin.schneider@arm.com> [2021-08-09 13:52:38]:

> On 09/08/21 12:22, Srikar Dronamraju wrote:
> > * Valentin Schneider <valentin.schneider@arm.com> [2021-08-08 16:56:47]:
> >> Wait, doesn't the distance matrix (without any offline node) say
> >>
> >>   distance(0, 3) == 40
> >>
> >> ? We should have at the very least:
> >>
> >>   node   0   1   2   3
> >>     0:  10  20  ??  40
> >>     1:  20  20  ??  40
> >>     2:  ??  ??  ??  ??
> >>     3:  40  40  ??  10
> >>
> >
> > Before onlining node 3 and CPU 3 (node/CPU 0 and 1 are already online)
> > Note: Node 2-7 and CPU 2-7 are still offline.
> >
> > node   0   1   2   3
> >   0:  10  20  40  10
> >   1:  20  20  40  10
> >   2:  40  40  10  10
> >   3:  10  10  10  10
> >
> > NODE->mask(0) == 0
> > NODE->mask(1) == 1
> > NODE->mask(2) == 0
> > NODE->mask(3) == 0
> >
> > Note: This is with updating Node 2's distance as 40 for figuring out
> > the number of numa levels. Since we have all possible distances, we
> > dont update Node 3 distance, so it will be as if its local to node 0.
> >
> > Now when Node 3 and CPU 3 are onlined
> > Note: Node 2, 3-7 and CPU 2, 3-7 are still offline.
> >
> > node   0   1   2   3
> >   0:  10  20  40  40
> >   1:  20  20  40  40
> >   2:  40  40  10  40
> >   3:  40  40  40  10
> >
> > NODE->mask(0) == 0
> > NODE->mask(1) == 1
> > NODE->mask(2) == 0
> > NODE->mask(3) == 0,3
> >
> > CPU 0 continues to be part of Node->mask(3) because when we online and
> > we find the right distance, there is no API to reset the numa mask of
> > 3 to remove CPU 0 from the numa masks.
> >
> > If we had an API to clear/set sched_domains_numa_masks[node][] when
> > the node state changes, we could probably plug-in to clear/set the
> > node masks whenever node state changes.
> >
> 
> Gotcha, this is now coming back to me...
> 
> [...]
> 
> >> Ok, so it looks like we really can't do without that part - even if we get
> >> "sensible" distance values for the online nodes, we can't divine values for
> >> the offline ones.
> >>
> >
> > Yes
> >
> 
> Argh, while your approach does take care of the masks, it leaves
> sched_numa_topology_type unchanged. You *can* force an update of it, but
> yuck :(
> 
> I got to the below...
> 

Yes, I completely missed that we should update sched_numa_topology_type.


> ---
> From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Date: Thu, 1 Jul 2021 09:45:51 +0530
> Subject: [PATCH 1/1] sched/topology: Skip updating masks for non-online nodes
> 
> The scheduler currently expects NUMA node distances to be stable from init
> onwards, and as a consequence builds the related data structures
> once-and-for-all at init (see sched_init_numa()).
> 
> Unfortunately, on some architectures node distance is unreliable for
> offline nodes and may very well change upon onlining.
> 
> Skip over offline nodes during sched_init_numa(). Track nodes that have
> been onlined at least once, and trigger a build of a node's NUMA masks when
> it is first onlined post-init.
> 

Your version is much much better than mine.
And I have verified that it works as expected.


> Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/topology.c | 65 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 65 insertions(+)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index b77ad49dc14f..cba95793a9b7 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1482,6 +1482,8 @@ int				sched_max_numa_distance;
>  static int			*sched_domains_numa_distance;
>  static struct cpumask		***sched_domains_numa_masks;
>  int __read_mostly		node_reclaim_distance = RECLAIM_DISTANCE;
> +
> +static unsigned long __read_mostly *sched_numa_onlined_nodes;
>  #endif
> 
>  /*
> @@ -1833,6 +1835,16 @@ void sched_init_numa(void)
>  			sched_domains_numa_masks[i][j] = mask;
> 
>  			for_each_node(k) {
> +				/*
> +				 * Distance information can be unreliable for
> +				 * offline nodes, defer building the node
> +				 * masks to its bringup.
> +				 * This relies on all unique distance values
> +				 * still being visible at init time.
> +				 */
> +				if (!node_online(j))
> +					continue;
> +
>  				if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
>  					sched_numa_warn("Node-distance not symmetric");
> 
> @@ -1886,6 +1898,53 @@ void sched_init_numa(void)
>  	sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
> 
>  	init_numa_topology_type();
> +
> +	sched_numa_onlined_nodes = bitmap_alloc(nr_node_ids, GFP_KERNEL);
> +	if (!sched_numa_onlined_nodes)
> +		return;
> +
> +	bitmap_zero(sched_numa_onlined_nodes, nr_node_ids);
> +	for_each_online_node(i)
> +		bitmap_set(sched_numa_onlined_nodes, i, 1);
> +}
> +
> +void __sched_domains_numa_masks_set(unsigned int node)
> +{
> +	int i, j;
> +
> +	/*
> +	 * NUMA masks are not built for offline nodes in sched_init_numa().
> +	 * Thus, when a CPU of a never-onlined-before node gets plugged in,
> +	 * adding that new CPU to the right NUMA masks is not sufficient: the
> +	 * masks of that CPU's node must also be updated.
> +	 */
> +	if (test_bit(node, sched_numa_onlined_nodes))
> +		return;
> +
> +	bitmap_set(sched_numa_onlined_nodes, node, 1);
> +
> +	for (i = 0; i < sched_domains_numa_levels; i++) {
> +		for (j = 0; j < nr_node_ids; j++) {
> +			if (!node_online(j) || node == j)
> +				continue;
> +
> +			if (node_distance(j, node) > sched_domains_numa_distance[i])
> +				continue;
> +
> +			/* Add remote nodes in our masks */
> +			cpumask_or(sched_domains_numa_masks[i][node],
> +				   sched_domains_numa_masks[i][node],
> +				   sched_domains_numa_masks[0][j]);
> +		}
> +	}
> +
> +	/*
> +	 * A new node has been brought up, potentially changing the topology
> +	 * classification.
> +	 *
> +	 * Note that this is racy vs any use of sched_numa_topology_type :/
> +	 */
> +	init_numa_topology_type();
>  }
> 
>  void sched_domains_numa_masks_set(unsigned int cpu)
> @@ -1893,8 +1952,14 @@ void sched_domains_numa_masks_set(unsigned int cpu)
>  	int node = cpu_to_node(cpu);
>  	int i, j;
> 
> +	__sched_domains_numa_masks_set(node);
> +
>  	for (i = 0; i < sched_domains_numa_levels; i++) {
>  		for (j = 0; j < nr_node_ids; j++) {
> +			if (!node_online(j))
> +				continue;
> +
> +			/* Set ourselves in the remote node's masks */
>  			if (node_distance(j, node) <= sched_domains_numa_distance[i])
>  				cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
>  		}
> -- 
> 2.25.1
> 

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
  2021-08-10 11:47                 ` Srikar Dronamraju
@ 2021-08-16 10:33                   ` Srikar Dronamraju
  2021-08-17  0:01                     ` Valentin Schneider
  0 siblings, 1 reply; 17+ messages in thread
From: Srikar Dronamraju @ 2021-08-16 10:33 UTC (permalink / raw)
  To: Valentin Schneider, Peter Zijlstra
  Cc: Ingo Molnar, Michael Ellerman, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Vincent Guittot, Dietmar Eggemann, linuxppc-dev,
	Nathan Lynch, Gautham R Shenoy, Geetika Moolchandani,
	Laurent Dufour

> 
> Your version is much much better than mine.
> And I have verified that it works as expected.
> 
> 

Hey Peter/Valentin

Are we waiting for any more feedback/testing for this?


-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
  2021-08-16 10:33                   ` Srikar Dronamraju
@ 2021-08-17  0:01                     ` Valentin Schneider
  0 siblings, 0 replies; 17+ messages in thread
From: Valentin Schneider @ 2021-08-17  0:01 UTC (permalink / raw)
  To: Srikar Dronamraju, Peter Zijlstra
  Cc: Ingo Molnar, Michael Ellerman, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Vincent Guittot, Dietmar Eggemann, linuxppc-dev,
	Nathan Lynch, Gautham R Shenoy, Geetika Moolchandani,
	Laurent Dufour

On 16/08/21 16:03, Srikar Dronamraju wrote:
>>
>> Your version is much much better than mine.
>> And I have verified that it works as expected.
>>
>>
>
> Hey Peter/Valentin
>
> Are we waiting for any more feedback/testing for this?
>

I'm not overly fond of that last one, but AFAICT the only alternative is
doing a full-fledged NUMA topology rebuild on new-node onlining (i.e. make
calling sched_init_numa() more than once work). It's a lot more work for a
very particular usecase.

>
> --
> Thanks and Regards
> Srikar Dronamraju

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-08-17  0:01 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-01  4:15 [PATCH v2 0/2] Skip numa distance for offline nodes Srikar Dronamraju
2021-07-01  4:15 ` [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes Srikar Dronamraju
2021-07-01 14:28   ` Valentin Schneider
2021-07-12 12:48     ` Srikar Dronamraju
2021-07-13 16:32       ` Valentin Schneider
2021-07-23 14:39         ` Srikar Dronamraju
2021-08-04 10:01           ` Srikar Dronamraju
2021-08-04 10:20             ` Valentin Schneider
2021-08-08 15:56           ` Valentin Schneider
2021-08-09  6:52             ` Srikar Dronamraju
2021-08-09 12:52               ` Valentin Schneider
2021-08-10 11:47                 ` Srikar Dronamraju
2021-08-16 10:33                   ` Srikar Dronamraju
2021-08-17  0:01                     ` Valentin Schneider
2021-07-01  4:15 ` [PATCH v2 2/2] powerpc/numa: Fill distance_lookup_table for offline nodes Srikar Dronamraju
2021-07-01  9:36   ` kernel test robot
2021-07-01 10:20   ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).