linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -V3 1/2] NUMA balancing: fix NUMA topology for systems with CPU-less nodes
@ 2022-02-14 12:15 Huang Ying
  2022-02-14 12:15 ` [PATCH -V3 2/2] NUMA balancing: avoid to migrate task to CPU-less node Huang Ying
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Huang Ying @ 2022-02-14 12:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Huang Ying, Valentin Schneider, Ingo Molnar,
	Mel Gorman, Rik van Riel, Srikar Dronamraju

The NUMA topology parameters (sched_numa_topology_type,
sched_domains_numa_levels, and sched_max_numa_distance, etc.)
identified by scheduler may be wrong for systems with CPU-less nodes.

For example, the ACPI SLIT of a system with CPU-less persistent
memory (Intel Optane DCPMM) nodes is as follows,

[000h 0000   4]                    Signature : "SLIT"    [System Locality Information Table]
[004h 0004   4]                 Table Length : 0000042C
[008h 0008   1]                     Revision : 01
[009h 0009   1]                     Checksum : 59
[00Ah 0010   6]                       Oem ID : "XXXX"
[010h 0016   8]                 Oem Table ID : "XXXXXXX"
[018h 0024   4]                 Oem Revision : 00000001
[01Ch 0028   4]              Asl Compiler ID : "INTL"
[020h 0032   4]        Asl Compiler Revision : 20091013

[024h 0036   8]                   Localities : 0000000000000004
[02Ch 0044   4]                 Locality   0 : 0A 15 11 1C
[030h 0048   4]                 Locality   1 : 15 0A 1C 11
[034h 0052   4]                 Locality   2 : 11 1C 0A 1C
[038h 0056   4]                 Locality   3 : 1C 11 1C 0A

While the `numactl -H` output is as follows,

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 0 size: 64136 MB
node 0 free: 5981 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 1 size: 64466 MB
node 1 free: 10415 MB
node 2 cpus:
node 2 size: 253952 MB
node 2 free: 253920 MB
node 3 cpus:
node 3 size: 253952 MB
node 3 free: 253951 MB
node distances:
node   0   1   2   3
  0:  10  21  17  28
  1:  21  10  28  17
  2:  17  28  10  28
  3:  28  17  28  10

In this system, there are only 2 sockets.  In each memory controller,
both DRAM and PMEM DIMMs are installed.  Although the physical NUMA
topology is simple, the logical NUMA topology becomes a little
complex.  Because both the distance(0, 1) and distance (1, 3) are less
than the distance (0, 3), it appears that node 1 sits between node 0
and node 3.  And the whole system appears to be a glueless mesh NUMA
topology type.  But it's definitely not, there is even no CPU in node 3.

This isn't a practical problem now yet.  Because the PMEM nodes (node
2 and node 3 in example system) are offlined by default during system
boot.  So init_numa_topology_type() called during system boot will
ignore them and set sched_numa_topology_type to NUMA_DIRECT.  And
init_numa_topology_type() is only called at runtime when a CPU of a
never-onlined-before node gets plugged in.  And there's no CPU in the
PMEM nodes.  But it appears better to fix this to make the code more
robust.

To test the potential problem.  We have used a debug patch to call
init_numa_topology_type() when the PMEM node is onlined (in
__set_migration_target_nodes()).  With that, the NUMA parameters
identified by scheduler is as follows,

sched_numa_topology_type:	NUMA_GLUELESS_MESH
sched_domains_numa_levels:	4
sched_max_numa_distance:	28

To fix the issue, the CPU-less nodes are ignored when the NUMA topology
parameters are identified.  Because a node may become CPU-less or not
at run time because of CPU hotplug, the NUMA topology parameters need
to be re-initialized at runtime for CPU hotplug too.

With the patch, the NUMA parameters identified for the example system
above is as follows,

sched_numa_topology_type:	NUMA_DIRECT
sched_domains_numa_levels:	2
sched_max_numa_distance:	21

v3:

- Fix error processing in offline code.
- Rename sched_reinit_numa() to sched_update_numa().
- Fix several building bugs, Thanks 0-Day team.

v2:

- Fix scheduler NUMA topology more thoroughly, as suggested and shown
  by Peter Zijlstra.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/core.c     |   5 +-
 kernel/sched/fair.c     |   9 +-
 kernel/sched/sched.h    |   6 +-
 kernel/sched/topology.c | 204 +++++++++++++++++++++++-----------------
 4 files changed, 134 insertions(+), 90 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fcf0c180617c..9f3f14216a61 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9045,6 +9045,7 @@ int sched_cpu_activate(unsigned int cpu)
 	set_cpu_active(cpu, true);
 
 	if (sched_smp_initialized) {
+		sched_update_numa(cpu, true);
 		sched_domains_numa_masks_set(cpu);
 		cpuset_cpu_active();
 	}
@@ -9123,10 +9124,12 @@ int sched_cpu_deactivate(unsigned int cpu)
 	if (!sched_smp_initialized)
 		return 0;
 
+	sched_update_numa(cpu, false);
 	ret = cpuset_cpu_inactive(cpu);
 	if (ret) {
 		balance_push_set(cpu, false);
 		set_cpu_active(cpu, true);
+		sched_update_numa(cpu, true);
 		return ret;
 	}
 	sched_domains_numa_masks_clear(cpu);
@@ -9229,7 +9232,7 @@ int sched_cpu_dying(unsigned int cpu)
 
 void __init sched_init_smp(void)
 {
-	sched_init_numa();
+	sched_init_numa(NUMA_NO_NODE);
 
 	/*
 	 * There's no userspace yet to cause hotplug operations; hence all the
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5146163bfabb..04968f3f9b6d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1263,6 +1263,7 @@ static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
 {
 	unsigned long score = 0;
 	int node;
+	int sys_max_dist;
 
 	/*
 	 * All nodes are directly connected, and the same distance
@@ -1271,6 +1272,8 @@ static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
 	if (sched_numa_topology_type == NUMA_DIRECT)
 		return 0;
 
+	/* sched_max_numa_distance may be changed in parallel. */
+	sys_max_dist = READ_ONCE(sched_max_numa_distance);
 	/*
 	 * This code is called for each node, introducing N^2 complexity,
 	 * which should be ok given the number of nodes rarely exceeds 8.
@@ -1283,7 +1286,7 @@ static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
 		 * The furthest away nodes in the system are not interesting
 		 * for placement; nid was already counted.
 		 */
-		if (dist == sched_max_numa_distance || node == nid)
+		if (dist >= sys_max_dist || node == nid)
 			continue;
 
 		/*
@@ -1312,8 +1315,8 @@ static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
 		 * This seems to result in good task placement.
 		 */
 		if (sched_numa_topology_type == NUMA_GLUELESS_MESH) {
-			faults *= (sched_max_numa_distance - dist);
-			faults /= (sched_max_numa_distance - LOCAL_DISTANCE);
+			faults *= (sys_max_dist - dist);
+			faults /= (sys_max_dist - LOCAL_DISTANCE);
 		}
 
 		score += faults;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index de53be905739..a1ef365407da 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1662,12 +1662,14 @@ enum numa_topology_type {
 extern enum numa_topology_type sched_numa_topology_type;
 extern int sched_max_numa_distance;
 extern bool find_numa_distance(int distance);
-extern void sched_init_numa(void);
+extern void sched_init_numa(int offline_node);
+extern void sched_update_numa(int cpu, bool online);
 extern void sched_domains_numa_masks_set(unsigned int cpu);
 extern void sched_domains_numa_masks_clear(unsigned int cpu);
 extern int sched_numa_find_closest(const struct cpumask *cpus, int cpu);
 #else
-static inline void sched_init_numa(void) { }
+static inline void sched_init_numa(int offline_node) { }
+static inline void sched_update_numa(int cpu, bool online) { }
 static inline void sched_domains_numa_masks_set(unsigned int cpu) { }
 static inline void sched_domains_numa_masks_clear(unsigned int cpu) { }
 static inline int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d201a7052a29..2bb6722733fd 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1492,8 +1492,6 @@ static int			sched_domains_curr_level;
 int				sched_max_numa_distance;
 static int			*sched_domains_numa_distance;
 static struct cpumask		***sched_domains_numa_masks;
-
-static unsigned long __read_mostly *sched_numa_onlined_nodes;
 #endif
 
 /*
@@ -1651,6 +1649,7 @@ static struct sched_domain_topology_level default_topology[] = {
 
 static struct sched_domain_topology_level *sched_domain_topology =
 	default_topology;
+static struct sched_domain_topology_level *sched_domain_topology_saved;
 
 #define for_each_sd_topology(tl)			\
 	for (tl = sched_domain_topology; tl->mask; tl++)
@@ -1661,6 +1660,7 @@ void set_sched_topology(struct sched_domain_topology_level *tl)
 		return;
 
 	sched_domain_topology = tl;
+	sched_domain_topology_saved = NULL;
 }
 
 #ifdef CONFIG_NUMA
@@ -1684,8 +1684,12 @@ static void sched_numa_warn(const char *str)
 
 	for (i = 0; i < nr_node_ids; i++) {
 		printk(KERN_WARNING "  ");
-		for (j = 0; j < nr_node_ids; j++)
-			printk(KERN_CONT "%02d ", node_distance(i,j));
+		for (j = 0; j < nr_node_ids; j++) {
+			if (!node_state(i, N_CPU) || !node_state(j, N_CPU))
+				printk(KERN_CONT "(%02d) ", node_distance(i,j));
+			else
+				printk(KERN_CONT " %02d  ", node_distance(i,j));
+		}
 		printk(KERN_CONT "\n");
 	}
 	printk(KERN_WARNING "\n");
@@ -1693,19 +1697,34 @@ static void sched_numa_warn(const char *str)
 
 bool find_numa_distance(int distance)
 {
-	int i;
+	bool found = false;
+	int i, *distances;
 
 	if (distance == node_distance(0, 0))
 		return true;
 
+	rcu_read_lock();
+	distances = rcu_dereference(sched_domains_numa_distance);
+	if (!distances)
+		goto unlock;
 	for (i = 0; i < sched_domains_numa_levels; i++) {
-		if (sched_domains_numa_distance[i] == distance)
-			return true;
+		if (distances[i] == distance) {
+			found = true;
+			break;
+		}
 	}
+unlock:
+	rcu_read_unlock();
 
-	return false;
+	return found;
 }
 
+#define for_each_cpu_node_but(n, nbut)		\
+	for_each_node_state(n, N_CPU)		\
+		if (n == nbut)			\
+			continue;		\
+		else
+
 /*
  * A system can have three types of NUMA topology:
  * NUMA_DIRECT: all nodes are directly connected, or not a NUMA system
@@ -1725,7 +1744,7 @@ bool find_numa_distance(int distance)
  *   there is an intermediary node C, which is < N hops away from both
  *   nodes A and B, the system is a glueless mesh.
  */
-static void init_numa_topology_type(void)
+static void init_numa_topology_type(int offline_node)
 {
 	int a, b, c, n;
 
@@ -1736,14 +1755,14 @@ static void init_numa_topology_type(void)
 		return;
 	}
 
-	for_each_online_node(a) {
-		for_each_online_node(b) {
+	for_each_cpu_node_but(a, offline_node) {
+		for_each_cpu_node_but(b, offline_node) {
 			/* Find two nodes furthest removed from each other. */
 			if (node_distance(a, b) < n)
 				continue;
 
 			/* Is there an intermediary node between a and b? */
-			for_each_online_node(c) {
+			for_each_cpu_node_but(c, offline_node) {
 				if (node_distance(a, c) < n &&
 				    node_distance(b, c) < n) {
 					sched_numa_topology_type =
@@ -1756,17 +1775,22 @@ static void init_numa_topology_type(void)
 			return;
 		}
 	}
+
+	pr_err("Failed to find a NUMA topology type, defaulting to DIRECT\n");
+	sched_numa_topology_type = NUMA_DIRECT;
 }
 
 
 #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
 
-void sched_init_numa(void)
+void sched_init_numa(int offline_node)
 {
 	struct sched_domain_topology_level *tl;
 	unsigned long *distance_map;
 	int nr_levels = 0;
 	int i, j;
+	int *distances;
+	struct cpumask ***masks;
 
 	/*
 	 * O(nr_nodes^2) deduplicating selection sort -- in order to find the
@@ -1777,12 +1801,13 @@ void sched_init_numa(void)
 		return;
 
 	bitmap_zero(distance_map, NR_DISTANCE_VALUES);
-	for (i = 0; i < nr_node_ids; i++) {
-		for (j = 0; j < nr_node_ids; j++) {
+	for_each_cpu_node_but(i, offline_node) {
+		for_each_cpu_node_but(j, offline_node) {
 			int distance = node_distance(i, j);
 
 			if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
 				sched_numa_warn("Invalid distance value range");
+				bitmap_free(distance_map);
 				return;
 			}
 
@@ -1795,16 +1820,17 @@ void sched_init_numa(void)
 	 */
 	nr_levels = bitmap_weight(distance_map, NR_DISTANCE_VALUES);
 
-	sched_domains_numa_distance = kcalloc(nr_levels, sizeof(int), GFP_KERNEL);
-	if (!sched_domains_numa_distance) {
+	distances = kcalloc(nr_levels, sizeof(int), GFP_KERNEL);
+	if (!distances) {
 		bitmap_free(distance_map);
 		return;
 	}
 
 	for (i = 0, j = 0; i < nr_levels; i++, j++) {
 		j = find_next_bit(distance_map, NR_DISTANCE_VALUES, j);
-		sched_domains_numa_distance[i] = j;
+		distances[i] = j;
 	}
+	rcu_assign_pointer(sched_domains_numa_distance, distances);
 
 	bitmap_free(distance_map);
 
@@ -1826,8 +1852,8 @@ void sched_init_numa(void)
 	 */
 	sched_domains_numa_levels = 0;
 
-	sched_domains_numa_masks = kzalloc(sizeof(void *) * nr_levels, GFP_KERNEL);
-	if (!sched_domains_numa_masks)
+	masks = kzalloc(sizeof(void *) * nr_levels, GFP_KERNEL);
+	if (!masks)
 		return;
 
 	/*
@@ -1835,31 +1861,20 @@ void sched_init_numa(void)
 	 * CPUs of nodes that are that many hops away from us.
 	 */
 	for (i = 0; i < nr_levels; i++) {
-		sched_domains_numa_masks[i] =
-			kzalloc(nr_node_ids * sizeof(void *), GFP_KERNEL);
-		if (!sched_domains_numa_masks[i])
+		masks[i] = kzalloc(nr_node_ids * sizeof(void *), GFP_KERNEL);
+		if (!masks[i])
 			return;
 
-		for (j = 0; j < nr_node_ids; j++) {
+		for_each_cpu_node_but(j, offline_node) {
 			struct cpumask *mask = kzalloc(cpumask_size(), GFP_KERNEL);
 			int k;
 
 			if (!mask)
 				return;
 
-			sched_domains_numa_masks[i][j] = mask;
-
-			for_each_node(k) {
-				/*
-				 * Distance information can be unreliable for
-				 * offline nodes, defer building the node
-				 * masks to its bringup.
-				 * This relies on all unique distance values
-				 * still being visible at init time.
-				 */
-				if (!node_online(j))
-					continue;
+			masks[i][j] = mask;
 
+			for_each_cpu_node_but(k, offline_node) {
 				if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
 					sched_numa_warn("Node-distance not symmetric");
 
@@ -1870,6 +1885,7 @@ void sched_init_numa(void)
 			}
 		}
 	}
+	rcu_assign_pointer(sched_domains_numa_masks, masks);
 
 	/* Compute default topology size */
 	for (i = 0; sched_domain_topology[i].mask; i++);
@@ -1907,59 +1923,67 @@ void sched_init_numa(void)
 		};
 	}
 
+	sched_domain_topology_saved = sched_domain_topology;
 	sched_domain_topology = tl;
 
 	sched_domains_numa_levels = nr_levels;
 	sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
 
-	init_numa_topology_type();
-
-	sched_numa_onlined_nodes = bitmap_alloc(nr_node_ids, GFP_KERNEL);
-	if (!sched_numa_onlined_nodes)
-		return;
-
-	bitmap_zero(sched_numa_onlined_nodes, nr_node_ids);
-	for_each_online_node(i)
-		bitmap_set(sched_numa_onlined_nodes, i, 1);
+	init_numa_topology_type(offline_node);
 }
 
-static void __sched_domains_numa_masks_set(unsigned int node)
-{
-	int i, j;
-
-	/*
-	 * NUMA masks are not built for offline nodes in sched_init_numa().
-	 * Thus, when a CPU of a never-onlined-before node gets plugged in,
-	 * adding that new CPU to the right NUMA masks is not sufficient: the
-	 * masks of that CPU's node must also be updated.
-	 */
-	if (test_bit(node, sched_numa_onlined_nodes))
-		return;
 
-	bitmap_set(sched_numa_onlined_nodes, node, 1);
-
-	for (i = 0; i < sched_domains_numa_levels; i++) {
-		for (j = 0; j < nr_node_ids; j++) {
-			if (!node_online(j) || node == j)
-				continue;
+void sched_reset_numa(void)
+{
+	int nr_levels, *distances;
+	struct cpumask ***masks;
 
-			if (node_distance(j, node) > sched_domains_numa_distance[i])
+	nr_levels = sched_domains_numa_levels;
+	sched_domains_numa_levels = 0;
+	sched_max_numa_distance = 0;
+	sched_numa_topology_type = NUMA_DIRECT;
+	distances = sched_domains_numa_distance;
+	rcu_assign_pointer(sched_domains_numa_distance, NULL);
+	masks = sched_domains_numa_masks;
+	rcu_assign_pointer(sched_domains_numa_masks, NULL);
+	if (distances || masks) {
+		int i, j;
+
+		synchronize_rcu();
+		kfree(distances);
+		for (i = 0; i < nr_levels && masks; i++) {
+			if (!masks[i])
 				continue;
-
-			/* Add remote nodes in our masks */
-			cpumask_or(sched_domains_numa_masks[i][node],
-				   sched_domains_numa_masks[i][node],
-				   sched_domains_numa_masks[0][j]);
+			for_each_node(j)
+				kfree(masks[i][j]);
+			kfree(masks[i]);
 		}
+		kfree(masks);
 	}
+	if (sched_domain_topology_saved) {
+		kfree(sched_domain_topology);
+		sched_domain_topology = sched_domain_topology_saved;
+		sched_domain_topology_saved = NULL;
+	}
+}
+
+/*
+ * Call with hotplug lock held
+ */
+void sched_update_numa(int cpu, bool online)
+{
+	int node;
 
+	node = cpu_to_node(cpu);
 	/*
-	 * A new node has been brought up, potentially changing the topology
-	 * classification.
-	 *
-	 * Note that this is racy vs any use of sched_numa_topology_type :/
+	 * Scheduler NUMA topology is updated when the first CPU of a
+	 * node is onlined or the last CPU of a node is offlined.
 	 */
-	init_numa_topology_type();
+	if (cpumask_weight(cpumask_of_node(node)) != 1)
+		return;
+
+	sched_reset_numa();
+	sched_init_numa(online ? NUMA_NO_NODE : node);
 }
 
 void sched_domains_numa_masks_set(unsigned int cpu)
@@ -1967,11 +1991,9 @@ void sched_domains_numa_masks_set(unsigned int cpu)
 	int node = cpu_to_node(cpu);
 	int i, j;
 
-	__sched_domains_numa_masks_set(node);
-
 	for (i = 0; i < sched_domains_numa_levels; i++) {
 		for (j = 0; j < nr_node_ids; j++) {
-			if (!node_online(j))
+			if (!node_state(j, N_CPU))
 				continue;
 
 			/* Set ourselves in the remote node's masks */
@@ -1986,8 +2008,10 @@ void sched_domains_numa_masks_clear(unsigned int cpu)
 	int i, j;
 
 	for (i = 0; i < sched_domains_numa_levels; i++) {
-		for (j = 0; j < nr_node_ids; j++)
-			cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]);
+		for (j = 0; j < nr_node_ids; j++) {
+			if (sched_domains_numa_masks[i][j])
+				cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]);
+		}
 	}
 }
 
@@ -2001,14 +2025,26 @@ void sched_domains_numa_masks_clear(unsigned int cpu)
  */
 int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
 {
-	int i, j = cpu_to_node(cpu);
+	int i, j = cpu_to_node(cpu), found = nr_cpu_ids;
+	struct cpumask ***masks;
 
+	rcu_read_lock();
+	masks = rcu_dereference(sched_domains_numa_masks);
+	if (!masks)
+		goto unlock;
 	for (i = 0; i < sched_domains_numa_levels; i++) {
-		cpu = cpumask_any_and(cpus, sched_domains_numa_masks[i][j]);
-		if (cpu < nr_cpu_ids)
-			return cpu;
+		if (!masks[i][j])
+			break;
+		cpu = cpumask_any_and(cpus, masks[i][j]);
+		if (cpu < nr_cpu_ids) {
+			found = cpu;
+			break;
+		}
 	}
-	return nr_cpu_ids;
+unlock:
+	rcu_read_unlock();
+
+	return found;
 }
 
 #endif /* CONFIG_NUMA */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH -V3 2/2] NUMA balancing: avoid to migrate task to CPU-less node
  2022-02-14 12:15 [PATCH -V3 1/2] NUMA balancing: fix NUMA topology for systems with CPU-less nodes Huang Ying
@ 2022-02-14 12:15 ` Huang Ying
  2022-02-17 18:56   ` [tip: sched/core] sched/numa: Avoid migrating " tip-bot2 for Huang Ying
  2022-03-08  2:05   ` [PATCH -V3 2/2 UPDATE] NUMA balancing: avoid to migrate " Huang, Ying
  2022-02-14 15:05 ` [PATCH -V3 1/2] NUMA balancing: fix NUMA topology for systems with CPU-less nodes Peter Zijlstra
  2022-02-17 18:56 ` [tip: sched/core] sched/numa: Fix " tip-bot2 for Huang Ying
  2 siblings, 2 replies; 15+ messages in thread
From: Huang Ying @ 2022-02-14 12:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Huang Ying, Valentin Schneider, Ingo Molnar,
	Mel Gorman, Rik van Riel, Srikar Dronamraju

In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
nodes.  But if the number of the hint page faults on a PMEM node is
the max for a task, The current NUMA balancing policy may try to place
the task on the PMEM node instead of DRAM node.  This is unreasonable,
because there's no CPU in PMEM NUMA nodes.  To fix this, CPU-less
nodes are ignored when searching the migration target node for a task
in this patch.

To test the patch, we run a workload that accesses more memory in PMEM
node than memory in DRAM node.  Without the patch, the PMEM node will
be chosen as preferred node in task_numa_placement().  While the DRAM
node will be chosen instead with the patch.

Known issue: I don't have systems to test complex NUMA topology type,
for example, NUMA_BACKPLANE or NUMA_GLUELESS_MESH.

v3:

- Fix several missing places to use CPU-less nodes as migrating
  target.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04968f3f9b6d..a3f0ea216ccb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1988,7 +1988,7 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	ng = deref_curr_numa_group(p);
 	if (env.best_cpu == -1 || (ng && ng->active_nodes > 1)) {
-		for_each_online_node(nid) {
+		for_each_node_state(nid, N_CPU) {
 			if (nid == env.src_nid || nid == p->numa_preferred_nid)
 				continue;
 
@@ -2086,13 +2086,13 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group)
 	unsigned long faults, max_faults = 0;
 	int nid, active_nodes = 0;
 
-	for_each_online_node(nid) {
+	for_each_node_state(nid, N_CPU) {
 		faults = group_faults_cpu(numa_group, nid);
 		if (faults > max_faults)
 			max_faults = faults;
 	}
 
-	for_each_online_node(nid) {
+	for_each_node_state(nid, N_CPU) {
 		faults = group_faults_cpu(numa_group, nid);
 		if (faults * ACTIVE_NODE_FRACTION > max_faults)
 			active_nodes++;
@@ -2246,7 +2246,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)
 
 		dist = sched_max_numa_distance;
 
-		for_each_online_node(node) {
+		for_each_node_state(node, N_CPU) {
 			score = group_weight(p, node, dist);
 			if (score > max_score) {
 				max_score = score;
@@ -2265,7 +2265,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)
 	 * inside the highest scoring group of nodes. The nodemask tricks
 	 * keep the complexity of the search down.
 	 */
-	nodes = node_online_map;
+	nodes = node_states[N_CPU];
 	for (dist = sched_max_numa_distance; dist > LOCAL_DISTANCE; dist--) {
 		unsigned long max_faults = 0;
 		nodemask_t max_group = NODE_MASK_NONE;
@@ -2404,6 +2404,21 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
+	/* Cannot migrate task to CPU-less node */
+	if (!node_state(max_nid, N_CPU)) {
+		int near_nid = max_nid;
+		int distance, near_distance = INT_MAX;
+
+		for_each_node_state(nid, N_CPU) {
+			distance = node_distance(max_nid, nid);
+			if (distance < near_distance) {
+				near_nid = nid;
+				near_distance = distance;
+			}
+		}
+		max_nid = near_nid;
+	}
+
 	if (ng) {
 		numa_group_count_active_nodes(ng);
 		spin_unlock_irq(group_lock);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH -V3 1/2] NUMA balancing: fix NUMA topology for systems with CPU-less nodes
  2022-02-14 12:15 [PATCH -V3 1/2] NUMA balancing: fix NUMA topology for systems with CPU-less nodes Huang Ying
  2022-02-14 12:15 ` [PATCH -V3 2/2] NUMA balancing: avoid to migrate task to CPU-less node Huang Ying
@ 2022-02-14 15:05 ` Peter Zijlstra
  2022-02-15  1:29   ` Huang, Ying
  2022-02-17 18:56 ` [tip: sched/core] sched/numa: Fix " tip-bot2 for Huang Ying
  2 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2022-02-14 15:05 UTC (permalink / raw)
  To: Huang Ying
  Cc: linux-kernel, Valentin Schneider, Ingo Molnar, Mel Gorman,
	Rik van Riel, Srikar Dronamraju

On Mon, Feb 14, 2022 at 08:15:52PM +0800, Huang Ying wrote:

> This isn't a practical problem now yet.  Because the PMEM nodes (node
> 2 and node 3 in example system) are offlined by default during system
> boot.  So init_numa_topology_type() called during system boot will
> ignore them and set sched_numa_topology_type to NUMA_DIRECT.  And
> init_numa_topology_type() is only called at runtime when a CPU of a
> never-onlined-before node gets plugged in.  And there's no CPU in the
> PMEM nodes.  But it appears better to fix this to make the code more
> robust.

IIRC there are pre-existing issues with this; namely the distance_map is
created for all nodes, online or not, therefore the levels and
max_distance include the pmem stuff.

At the same time, the numa_topolog_type() uses those values, and the
only reason it 'worked' is because the combination of arguments fails to
hit any of the existing types and exits without setting a type,
defaulting to NUMA_DIRECT by 'accident' of that being type 0 and
bss/data being 0 initialized.

Also, Power (and possibly other architectures) already have CPU-less
nodes and are similarly suffering issues.

Anyway, aside from this the patches look like they should do.

There's a few niggles, like using READ_ONCE() on sched_max_numa_distance
without using WRITE_ONCE() (see below) and having
sched_domains_numa_distance and sched_domains_numa_masks separate RCU
variables (that could go side-ways if there were a function using both,
afaict there isn't and I couldn't be bothered changing that, but it's
something to keep in mind).

I'll go queue these, thanks!

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1259,11 +1259,10 @@ static bool numa_is_active_node(int nid,
 
 /* Handle placement on systems where not all nodes are directly connected. */
 static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
-					int maxdist, bool task)
+					int lim_dist, bool task)
 {
 	unsigned long score = 0;
-	int node;
-	int sys_max_dist;
+	int node, max_dist;
 
 	/*
 	 * All nodes are directly connected, and the same distance
@@ -1273,7 +1272,7 @@ static unsigned long score_nearby_nodes(
 		return 0;
 
 	/* sched_max_numa_distance may be changed in parallel. */
-	sys_max_dist = READ_ONCE(sched_max_numa_distance);
+	max_dist = READ_ONCE(sched_max_numa_distance);
 	/*
 	 * This code is called for each node, introducing N^2 complexity,
 	 * which should be ok given the number of nodes rarely exceeds 8.
@@ -1286,7 +1285,7 @@ static unsigned long score_nearby_nodes(
 		 * The furthest away nodes in the system are not interesting
 		 * for placement; nid was already counted.
 		 */
-		if (dist >= sys_max_dist || node == nid)
+		if (dist >= max_dist || node == nid)
 			continue;
 
 		/*
@@ -1296,8 +1295,7 @@ static unsigned long score_nearby_nodes(
 		 * "hoplimit", only nodes closer by than "hoplimit" are part
 		 * of each group. Skip other nodes.
 		 */
-		if (sched_numa_topology_type == NUMA_BACKPLANE &&
-					dist >= maxdist)
+		if (sched_numa_topology_type == NUMA_BACKPLANE && dist >= lim_dist)
 			continue;
 
 		/* Add up the faults from nearby nodes. */
@@ -1315,8 +1313,8 @@ static unsigned long score_nearby_nodes(
 		 * This seems to result in good task placement.
 		 */
 		if (sched_numa_topology_type == NUMA_GLUELESS_MESH) {
-			faults *= (sys_max_dist - dist);
-			faults /= (sys_max_dist - LOCAL_DISTANCE);
+			faults *= (max_dist - dist);
+			faults /= (max_dist - LOCAL_DISTANCE);
 		}
 
 		score += faults;
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1927,7 +1927,7 @@ void sched_init_numa(int offline_node)
 	sched_domain_topology = tl;
 
 	sched_domains_numa_levels = nr_levels;
-	sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
+	WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]);
 
 	init_numa_topology_type(offline_node);
 }

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -V3 1/2] NUMA balancing: fix NUMA topology for systems with CPU-less nodes
  2022-02-14 15:05 ` [PATCH -V3 1/2] NUMA balancing: fix NUMA topology for systems with CPU-less nodes Peter Zijlstra
@ 2022-02-15  1:29   ` Huang, Ying
  0 siblings, 0 replies; 15+ messages in thread
From: Huang, Ying @ 2022-02-15  1:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Valentin Schneider, Ingo Molnar, Mel Gorman,
	Rik van Riel, Srikar Dronamraju

Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, Feb 14, 2022 at 08:15:52PM +0800, Huang Ying wrote:
>
>> This isn't a practical problem now yet.  Because the PMEM nodes (node
>> 2 and node 3 in example system) are offlined by default during system
>> boot.  So init_numa_topology_type() called during system boot will
>> ignore them and set sched_numa_topology_type to NUMA_DIRECT.  And
>> init_numa_topology_type() is only called at runtime when a CPU of a
>> never-onlined-before node gets plugged in.  And there's no CPU in the
>> PMEM nodes.  But it appears better to fix this to make the code more
>> robust.
>
> IIRC there are pre-existing issues with this; namely the distance_map is
> created for all nodes, online or not, therefore the levels and
> max_distance include the pmem stuff.
>
> At the same time, the numa_topolog_type() uses those values, and the
> only reason it 'worked' is because the combination of arguments fails to
> hit any of the existing types and exits without setting a type,
> defaulting to NUMA_DIRECT by 'accident' of that being type 0 and
> bss/data being 0 initialized.
>
> Also, Power (and possibly other architectures) already have CPU-less
> nodes and are similarly suffering issues.
>
> Anyway, aside from this the patches look like they should do.
>
> There's a few niggles, like using READ_ONCE() on sched_max_numa_distance
> without using WRITE_ONCE() (see below) and having
> sched_domains_numa_distance and sched_domains_numa_masks separate RCU
> variables (that could go side-ways if there were a function using both,
> afaict there isn't and I couldn't be bothered changing that, but it's
> something to keep in mind).

If you think that's necessary, I can work on it with a follow up patch.

> I'll go queue these, thanks!

Thanks!

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1259,11 +1259,10 @@ static bool numa_is_active_node(int nid,
>  
>  /* Handle placement on systems where not all nodes are directly connected. */
>  static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
> -					int maxdist, bool task)
> +					int lim_dist, bool task)
>  {
>  	unsigned long score = 0;
> -	int node;
> -	int sys_max_dist;
> +	int node, max_dist;
>  
>  	/*
>  	 * All nodes are directly connected, and the same distance
> @@ -1273,7 +1272,7 @@ static unsigned long score_nearby_nodes(
>  		return 0;
>  
>  	/* sched_max_numa_distance may be changed in parallel. */
> -	sys_max_dist = READ_ONCE(sched_max_numa_distance);
> +	max_dist = READ_ONCE(sched_max_numa_distance);
>  	/*
>  	 * This code is called for each node, introducing N^2 complexity,
>  	 * which should be ok given the number of nodes rarely exceeds 8.
> @@ -1286,7 +1285,7 @@ static unsigned long score_nearby_nodes(
>  		 * The furthest away nodes in the system are not interesting
>  		 * for placement; nid was already counted.
>  		 */
> -		if (dist >= sys_max_dist || node == nid)
> +		if (dist >= max_dist || node == nid)
>  			continue;
>  
>  		/*
> @@ -1296,8 +1295,7 @@ static unsigned long score_nearby_nodes(
>  		 * "hoplimit", only nodes closer by than "hoplimit" are part
>  		 * of each group. Skip other nodes.
>  		 */
> -		if (sched_numa_topology_type == NUMA_BACKPLANE &&
> -					dist >= maxdist)
> +		if (sched_numa_topology_type == NUMA_BACKPLANE && dist >= lim_dist)
>  			continue;
>  
>  		/* Add up the faults from nearby nodes. */
> @@ -1315,8 +1313,8 @@ static unsigned long score_nearby_nodes(
>  		 * This seems to result in good task placement.
>  		 */
>  		if (sched_numa_topology_type == NUMA_GLUELESS_MESH) {
> -			faults *= (sys_max_dist - dist);
> -			faults /= (sys_max_dist - LOCAL_DISTANCE);
> +			faults *= (max_dist - dist);
> +			faults /= (max_dist - LOCAL_DISTANCE);
>  		}
>  
>  		score += faults;
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1927,7 +1927,7 @@ void sched_init_numa(int offline_node)
>  	sched_domain_topology = tl;
>  
>  	sched_domains_numa_levels = nr_levels;
> -	sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
> +	WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]);
>  
>  	init_numa_topology_type(offline_node);
>  }

Sorry, 0-Day building robot reported another small issue, can you fold
the following patch too?

Best Regards,
Huang, Ying

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 490acd4b835d..d72128151e56 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1947,7 +1947,7 @@ void sched_init_numa(int offline_node)
 		sched_max_numa_distance);
 }
 
-void sched_reset_numa(void)
+static void sched_reset_numa(void)
 {
 	int nr_levels, *distances;
 	struct cpumask ***masks;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [tip: sched/core] sched/numa: Avoid migrating task to CPU-less node
  2022-02-14 12:15 ` [PATCH -V3 2/2] NUMA balancing: avoid to migrate task to CPU-less node Huang Ying
@ 2022-02-17 18:56   ` tip-bot2 for Huang Ying
  2022-03-01 20:54     ` Qian Cai
  2022-03-08  2:05   ` [PATCH -V3 2/2 UPDATE] NUMA balancing: avoid to migrate " Huang, Ying
  1 sibling, 1 reply; 15+ messages in thread
From: tip-bot2 for Huang Ying @ 2022-02-17 18:56 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Huang, Ying, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     5c7b1aaf139dab5072311853bacc40fc3457d1f9
Gitweb:        https://git.kernel.org/tip/5c7b1aaf139dab5072311853bacc40fc3457d1f9
Author:        Huang Ying <ying.huang@intel.com>
AuthorDate:    Mon, 14 Feb 2022 20:15:53 +08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 16 Feb 2022 15:57:53 +01:00

sched/numa: Avoid migrating task to CPU-less node

In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
nodes.  But if the number of the hint page faults on a PMEM node is
the max for a task, The current NUMA balancing policy may try to place
the task on the PMEM node instead of DRAM node.  This is unreasonable,
because there's no CPU in PMEM NUMA nodes.  To fix this, CPU-less
nodes are ignored when searching the migration target node for a task
in this patch.

To test the patch, we run a workload that accesses more memory in PMEM
node than memory in DRAM node.  Without the patch, the PMEM node will
be chosen as preferred node in task_numa_placement().  While the DRAM
node will be chosen instead with the patch.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220214121553.582248-2-ying.huang@intel.com
---
 kernel/sched/fair.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da3230b..11a72e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1989,7 +1989,7 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	ng = deref_curr_numa_group(p);
 	if (env.best_cpu == -1 || (ng && ng->active_nodes > 1)) {
-		for_each_online_node(nid) {
+		for_each_node_state(nid, N_CPU) {
 			if (nid == env.src_nid || nid == p->numa_preferred_nid)
 				continue;
 
@@ -2087,13 +2087,13 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group)
 	unsigned long faults, max_faults = 0;
 	int nid, active_nodes = 0;
 
-	for_each_online_node(nid) {
+	for_each_node_state(nid, N_CPU) {
 		faults = group_faults_cpu(numa_group, nid);
 		if (faults > max_faults)
 			max_faults = faults;
 	}
 
-	for_each_online_node(nid) {
+	for_each_node_state(nid, N_CPU) {
 		faults = group_faults_cpu(numa_group, nid);
 		if (faults * ACTIVE_NODE_FRACTION > max_faults)
 			active_nodes++;
@@ -2247,7 +2247,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)
 
 		dist = sched_max_numa_distance;
 
-		for_each_online_node(node) {
+		for_each_node_state(node, N_CPU) {
 			score = group_weight(p, node, dist);
 			if (score > max_score) {
 				max_score = score;
@@ -2266,7 +2266,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)
 	 * inside the highest scoring group of nodes. The nodemask tricks
 	 * keep the complexity of the search down.
 	 */
-	nodes = node_online_map;
+	nodes = node_states[N_CPU];
 	for (dist = sched_max_numa_distance; dist > LOCAL_DISTANCE; dist--) {
 		unsigned long max_faults = 0;
 		nodemask_t max_group = NODE_MASK_NONE;
@@ -2405,6 +2405,21 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
+	/* Cannot migrate task to CPU-less node */
+	if (!node_state(max_nid, N_CPU)) {
+		int near_nid = max_nid;
+		int distance, near_distance = INT_MAX;
+
+		for_each_node_state(nid, N_CPU) {
+			distance = node_distance(max_nid, nid);
+			if (distance < near_distance) {
+				near_nid = nid;
+				near_distance = distance;
+			}
+		}
+		max_nid = near_nid;
+	}
+
 	if (ng) {
 		numa_group_count_active_nodes(ng);
 		spin_unlock_irq(group_lock);

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [tip: sched/core] sched/numa: Fix NUMA topology for systems with CPU-less nodes
  2022-02-14 12:15 [PATCH -V3 1/2] NUMA balancing: fix NUMA topology for systems with CPU-less nodes Huang Ying
  2022-02-14 12:15 ` [PATCH -V3 2/2] NUMA balancing: avoid to migrate task to CPU-less node Huang Ying
  2022-02-14 15:05 ` [PATCH -V3 1/2] NUMA balancing: fix NUMA topology for systems with CPU-less nodes Peter Zijlstra
@ 2022-02-17 18:56 ` tip-bot2 for Huang Ying
  2 siblings, 0 replies; 15+ messages in thread
From: tip-bot2 for Huang Ying @ 2022-02-17 18:56 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra, Huang, Ying, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     0fb3978b0aac3a5c08637aed03cc2d65f793508f
Gitweb:        https://git.kernel.org/tip/0fb3978b0aac3a5c08637aed03cc2d65f793508f
Author:        Huang Ying <ying.huang@intel.com>
AuthorDate:    Mon, 14 Feb 2022 20:15:52 +08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 16 Feb 2022 15:57:53 +01:00

sched/numa: Fix NUMA topology for systems with CPU-less nodes

The NUMA topology parameters (sched_numa_topology_type,
sched_domains_numa_levels, and sched_max_numa_distance, etc.)
identified by scheduler may be wrong for systems with CPU-less nodes.

For example, the ACPI SLIT of a system with CPU-less persistent
memory (Intel Optane DCPMM) nodes is as follows,

[000h 0000   4]                    Signature : "SLIT"    [System Locality Information Table]
[004h 0004   4]                 Table Length : 0000042C
[008h 0008   1]                     Revision : 01
[009h 0009   1]                     Checksum : 59
[00Ah 0010   6]                       Oem ID : "XXXX"
[010h 0016   8]                 Oem Table ID : "XXXXXXX"
[018h 0024   4]                 Oem Revision : 00000001
[01Ch 0028   4]              Asl Compiler ID : "INTL"
[020h 0032   4]        Asl Compiler Revision : 20091013

[024h 0036   8]                   Localities : 0000000000000004
[02Ch 0044   4]                 Locality   0 : 0A 15 11 1C
[030h 0048   4]                 Locality   1 : 15 0A 1C 11
[034h 0052   4]                 Locality   2 : 11 1C 0A 1C
[038h 0056   4]                 Locality   3 : 1C 11 1C 0A

While the `numactl -H` output is as follows,

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 0 size: 64136 MB
node 0 free: 5981 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 1 size: 64466 MB
node 1 free: 10415 MB
node 2 cpus:
node 2 size: 253952 MB
node 2 free: 253920 MB
node 3 cpus:
node 3 size: 253952 MB
node 3 free: 253951 MB
node distances:
node   0   1   2   3
  0:  10  21  17  28
  1:  21  10  28  17
  2:  17  28  10  28
  3:  28  17  28  10

In this system, there are only 2 sockets.  In each memory controller,
both DRAM and PMEM DIMMs are installed.  Although the physical NUMA
topology is simple, the logical NUMA topology becomes a little
complex.  Because both the distance(0, 1) and distance (1, 3) are less
than the distance (0, 3), it appears that node 1 sits between node 0
and node 3.  And the whole system appears to be a glueless mesh NUMA
topology type.  But it's definitely not, there is even no CPU in node 3.

This isn't a practical problem now yet.  Because the PMEM nodes (node
2 and node 3 in example system) are offlined by default during system
boot.  So init_numa_topology_type() called during system boot will
ignore them and set sched_numa_topology_type to NUMA_DIRECT.  And
init_numa_topology_type() is only called at runtime when a CPU of a
never-onlined-before node gets plugged in.  And there's no CPU in the
PMEM nodes.  But it appears better to fix this to make the code more
robust.

To test the potential problem.  We have used a debug patch to call
init_numa_topology_type() when the PMEM node is onlined (in
__set_migration_target_nodes()).  With that, the NUMA parameters
identified by scheduler is as follows,

sched_numa_topology_type:	NUMA_GLUELESS_MESH
sched_domains_numa_levels:	4
sched_max_numa_distance:	28

To fix the issue, the CPU-less nodes are ignored when the NUMA topology
parameters are identified.  Because a node may become CPU-less or not
at run time because of CPU hotplug, the NUMA topology parameters need
to be re-initialized at runtime for CPU hotplug too.

With the patch, the NUMA parameters identified for the example system
above is as follows,

sched_numa_topology_type:	NUMA_DIRECT
sched_domains_numa_levels:	2
sched_max_numa_distance:	21

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220214121553.582248-1-ying.huang@intel.com
---
 kernel/sched/core.c     |   5 +-
 kernel/sched/fair.c     |  15 +--
 kernel/sched/sched.h    |   6 +-
 kernel/sched/topology.c | 206 ++++++++++++++++++++++-----------------
 4 files changed, 137 insertions(+), 95 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c620aab..b222692 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9052,6 +9052,7 @@ int sched_cpu_activate(unsigned int cpu)
 	set_cpu_active(cpu, true);
 
 	if (sched_smp_initialized) {
+		sched_update_numa(cpu, true);
 		sched_domains_numa_masks_set(cpu);
 		cpuset_cpu_active();
 	}
@@ -9130,10 +9131,12 @@ int sched_cpu_deactivate(unsigned int cpu)
 	if (!sched_smp_initialized)
 		return 0;
 
+	sched_update_numa(cpu, false);
 	ret = cpuset_cpu_inactive(cpu);
 	if (ret) {
 		balance_push_set(cpu, false);
 		set_cpu_active(cpu, true);
+		sched_update_numa(cpu, true);
 		return ret;
 	}
 	sched_domains_numa_masks_clear(cpu);
@@ -9236,7 +9239,7 @@ int sched_cpu_dying(unsigned int cpu)
 
 void __init sched_init_smp(void)
 {
-	sched_init_numa();
+	sched_init_numa(NUMA_NO_NODE);
 
 	/*
 	 * There's no userspace yet to cause hotplug operations; hence all the
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5c4bfff..da3230b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1259,10 +1259,10 @@ static bool numa_is_active_node(int nid, struct numa_group *ng)
 
 /* Handle placement on systems where not all nodes are directly connected. */
 static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
-					int maxdist, bool task)
+					int lim_dist, bool task)
 {
 	unsigned long score = 0;
-	int node;
+	int node, max_dist;
 
 	/*
 	 * All nodes are directly connected, and the same distance
@@ -1271,6 +1271,8 @@ static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
 	if (sched_numa_topology_type == NUMA_DIRECT)
 		return 0;
 
+	/* sched_max_numa_distance may be changed in parallel. */
+	max_dist = READ_ONCE(sched_max_numa_distance);
 	/*
 	 * This code is called for each node, introducing N^2 complexity,
 	 * which should be ok given the number of nodes rarely exceeds 8.
@@ -1283,7 +1285,7 @@ static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
 		 * The furthest away nodes in the system are not interesting
 		 * for placement; nid was already counted.
 		 */
-		if (dist == sched_max_numa_distance || node == nid)
+		if (dist >= max_dist || node == nid)
 			continue;
 
 		/*
@@ -1293,8 +1295,7 @@ static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
 		 * "hoplimit", only nodes closer by than "hoplimit" are part
 		 * of each group. Skip other nodes.
 		 */
-		if (sched_numa_topology_type == NUMA_BACKPLANE &&
-					dist >= maxdist)
+		if (sched_numa_topology_type == NUMA_BACKPLANE && dist >= lim_dist)
 			continue;
 
 		/* Add up the faults from nearby nodes. */
@@ -1312,8 +1313,8 @@ static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
 		 * This seems to result in good task placement.
 		 */
 		if (sched_numa_topology_type == NUMA_GLUELESS_MESH) {
-			faults *= (sched_max_numa_distance - dist);
-			faults /= (sched_max_numa_distance - LOCAL_DISTANCE);
+			faults *= (max_dist - dist);
+			faults /= (max_dist - LOCAL_DISTANCE);
 		}
 
 		score += faults;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9b33ba9..3da5718 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1662,12 +1662,14 @@ enum numa_topology_type {
 extern enum numa_topology_type sched_numa_topology_type;
 extern int sched_max_numa_distance;
 extern bool find_numa_distance(int distance);
-extern void sched_init_numa(void);
+extern void sched_init_numa(int offline_node);
+extern void sched_update_numa(int cpu, bool online);
 extern void sched_domains_numa_masks_set(unsigned int cpu);
 extern void sched_domains_numa_masks_clear(unsigned int cpu);
 extern int sched_numa_find_closest(const struct cpumask *cpus, int cpu);
 #else
-static inline void sched_init_numa(void) { }
+static inline void sched_init_numa(int offline_node) { }
+static inline void sched_update_numa(int cpu, bool online) { }
 static inline void sched_domains_numa_masks_set(unsigned int cpu) { }
 static inline void sched_domains_numa_masks_clear(unsigned int cpu) { }
 static inline int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 1c84b48..5db322c 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1492,8 +1492,6 @@ static int			sched_domains_curr_level;
 int				sched_max_numa_distance;
 static int			*sched_domains_numa_distance;
 static struct cpumask		***sched_domains_numa_masks;
-
-static unsigned long __read_mostly *sched_numa_onlined_nodes;
 #endif
 
 /*
@@ -1651,6 +1649,7 @@ static struct sched_domain_topology_level default_topology[] = {
 
 static struct sched_domain_topology_level *sched_domain_topology =
 	default_topology;
+static struct sched_domain_topology_level *sched_domain_topology_saved;
 
 #define for_each_sd_topology(tl)			\
 	for (tl = sched_domain_topology; tl->mask; tl++)
@@ -1661,6 +1660,7 @@ void set_sched_topology(struct sched_domain_topology_level *tl)
 		return;
 
 	sched_domain_topology = tl;
+	sched_domain_topology_saved = NULL;
 }
 
 #ifdef CONFIG_NUMA
@@ -1684,8 +1684,12 @@ static void sched_numa_warn(const char *str)
 
 	for (i = 0; i < nr_node_ids; i++) {
 		printk(KERN_WARNING "  ");
-		for (j = 0; j < nr_node_ids; j++)
-			printk(KERN_CONT "%02d ", node_distance(i,j));
+		for (j = 0; j < nr_node_ids; j++) {
+			if (!node_state(i, N_CPU) || !node_state(j, N_CPU))
+				printk(KERN_CONT "(%02d) ", node_distance(i,j));
+			else
+				printk(KERN_CONT " %02d  ", node_distance(i,j));
+		}
 		printk(KERN_CONT "\n");
 	}
 	printk(KERN_WARNING "\n");
@@ -1693,19 +1697,34 @@ static void sched_numa_warn(const char *str)
 
 bool find_numa_distance(int distance)
 {
-	int i;
+	bool found = false;
+	int i, *distances;
 
 	if (distance == node_distance(0, 0))
 		return true;
 
+	rcu_read_lock();
+	distances = rcu_dereference(sched_domains_numa_distance);
+	if (!distances)
+		goto unlock;
 	for (i = 0; i < sched_domains_numa_levels; i++) {
-		if (sched_domains_numa_distance[i] == distance)
-			return true;
+		if (distances[i] == distance) {
+			found = true;
+			break;
+		}
 	}
+unlock:
+	rcu_read_unlock();
 
-	return false;
+	return found;
 }
 
+#define for_each_cpu_node_but(n, nbut)		\
+	for_each_node_state(n, N_CPU)		\
+		if (n == nbut)			\
+			continue;		\
+		else
+
 /*
  * A system can have three types of NUMA topology:
  * NUMA_DIRECT: all nodes are directly connected, or not a NUMA system
@@ -1725,7 +1744,7 @@ bool find_numa_distance(int distance)
  *   there is an intermediary node C, which is < N hops away from both
  *   nodes A and B, the system is a glueless mesh.
  */
-static void init_numa_topology_type(void)
+static void init_numa_topology_type(int offline_node)
 {
 	int a, b, c, n;
 
@@ -1736,14 +1755,14 @@ static void init_numa_topology_type(void)
 		return;
 	}
 
-	for_each_online_node(a) {
-		for_each_online_node(b) {
+	for_each_cpu_node_but(a, offline_node) {
+		for_each_cpu_node_but(b, offline_node) {
 			/* Find two nodes furthest removed from each other. */
 			if (node_distance(a, b) < n)
 				continue;
 
 			/* Is there an intermediary node between a and b? */
-			for_each_online_node(c) {
+			for_each_cpu_node_but(c, offline_node) {
 				if (node_distance(a, c) < n &&
 				    node_distance(b, c) < n) {
 					sched_numa_topology_type =
@@ -1756,17 +1775,22 @@ static void init_numa_topology_type(void)
 			return;
 		}
 	}
+
+	pr_err("Failed to find a NUMA topology type, defaulting to DIRECT\n");
+	sched_numa_topology_type = NUMA_DIRECT;
 }
 
 
 #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
 
-void sched_init_numa(void)
+void sched_init_numa(int offline_node)
 {
 	struct sched_domain_topology_level *tl;
 	unsigned long *distance_map;
 	int nr_levels = 0;
 	int i, j;
+	int *distances;
+	struct cpumask ***masks;
 
 	/*
 	 * O(nr_nodes^2) deduplicating selection sort -- in order to find the
@@ -1777,12 +1801,13 @@ void sched_init_numa(void)
 		return;
 
 	bitmap_zero(distance_map, NR_DISTANCE_VALUES);
-	for (i = 0; i < nr_node_ids; i++) {
-		for (j = 0; j < nr_node_ids; j++) {
+	for_each_cpu_node_but(i, offline_node) {
+		for_each_cpu_node_but(j, offline_node) {
 			int distance = node_distance(i, j);
 
 			if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
 				sched_numa_warn("Invalid distance value range");
+				bitmap_free(distance_map);
 				return;
 			}
 
@@ -1795,16 +1820,17 @@ void sched_init_numa(void)
 	 */
 	nr_levels = bitmap_weight(distance_map, NR_DISTANCE_VALUES);
 
-	sched_domains_numa_distance = kcalloc(nr_levels, sizeof(int), GFP_KERNEL);
-	if (!sched_domains_numa_distance) {
+	distances = kcalloc(nr_levels, sizeof(int), GFP_KERNEL);
+	if (!distances) {
 		bitmap_free(distance_map);
 		return;
 	}
 
 	for (i = 0, j = 0; i < nr_levels; i++, j++) {
 		j = find_next_bit(distance_map, NR_DISTANCE_VALUES, j);
-		sched_domains_numa_distance[i] = j;
+		distances[i] = j;
 	}
+	rcu_assign_pointer(sched_domains_numa_distance, distances);
 
 	bitmap_free(distance_map);
 
@@ -1826,8 +1852,8 @@ void sched_init_numa(void)
 	 */
 	sched_domains_numa_levels = 0;
 
-	sched_domains_numa_masks = kzalloc(sizeof(void *) * nr_levels, GFP_KERNEL);
-	if (!sched_domains_numa_masks)
+	masks = kzalloc(sizeof(void *) * nr_levels, GFP_KERNEL);
+	if (!masks)
 		return;
 
 	/*
@@ -1835,31 +1861,20 @@ void sched_init_numa(void)
 	 * CPUs of nodes that are that many hops away from us.
 	 */
 	for (i = 0; i < nr_levels; i++) {
-		sched_domains_numa_masks[i] =
-			kzalloc(nr_node_ids * sizeof(void *), GFP_KERNEL);
-		if (!sched_domains_numa_masks[i])
+		masks[i] = kzalloc(nr_node_ids * sizeof(void *), GFP_KERNEL);
+		if (!masks[i])
 			return;
 
-		for (j = 0; j < nr_node_ids; j++) {
+		for_each_cpu_node_but(j, offline_node) {
 			struct cpumask *mask = kzalloc(cpumask_size(), GFP_KERNEL);
 			int k;
 
 			if (!mask)
 				return;
 
-			sched_domains_numa_masks[i][j] = mask;
-
-			for_each_node(k) {
-				/*
-				 * Distance information can be unreliable for
-				 * offline nodes, defer building the node
-				 * masks to its bringup.
-				 * This relies on all unique distance values
-				 * still being visible at init time.
-				 */
-				if (!node_online(j))
-					continue;
+			masks[i][j] = mask;
 
+			for_each_cpu_node_but(k, offline_node) {
 				if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
 					sched_numa_warn("Node-distance not symmetric");
 
@@ -1870,6 +1885,7 @@ void sched_init_numa(void)
 			}
 		}
 	}
+	rcu_assign_pointer(sched_domains_numa_masks, masks);
 
 	/* Compute default topology size */
 	for (i = 0; sched_domain_topology[i].mask; i++);
@@ -1907,59 +1923,67 @@ void sched_init_numa(void)
 		};
 	}
 
+	sched_domain_topology_saved = sched_domain_topology;
 	sched_domain_topology = tl;
 
 	sched_domains_numa_levels = nr_levels;
-	sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
-
-	init_numa_topology_type();
-
-	sched_numa_onlined_nodes = bitmap_alloc(nr_node_ids, GFP_KERNEL);
-	if (!sched_numa_onlined_nodes)
-		return;
+	WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]);
 
-	bitmap_zero(sched_numa_onlined_nodes, nr_node_ids);
-	for_each_online_node(i)
-		bitmap_set(sched_numa_onlined_nodes, i, 1);
+	init_numa_topology_type(offline_node);
 }
 
-static void __sched_domains_numa_masks_set(unsigned int node)
-{
-	int i, j;
-
-	/*
-	 * NUMA masks are not built for offline nodes in sched_init_numa().
-	 * Thus, when a CPU of a never-onlined-before node gets plugged in,
-	 * adding that new CPU to the right NUMA masks is not sufficient: the
-	 * masks of that CPU's node must also be updated.
-	 */
-	if (test_bit(node, sched_numa_onlined_nodes))
-		return;
 
-	bitmap_set(sched_numa_onlined_nodes, node, 1);
-
-	for (i = 0; i < sched_domains_numa_levels; i++) {
-		for (j = 0; j < nr_node_ids; j++) {
-			if (!node_online(j) || node == j)
-				continue;
+static void sched_reset_numa(void)
+{
+	int nr_levels, *distances;
+	struct cpumask ***masks;
 
-			if (node_distance(j, node) > sched_domains_numa_distance[i])
+	nr_levels = sched_domains_numa_levels;
+	sched_domains_numa_levels = 0;
+	sched_max_numa_distance = 0;
+	sched_numa_topology_type = NUMA_DIRECT;
+	distances = sched_domains_numa_distance;
+	rcu_assign_pointer(sched_domains_numa_distance, NULL);
+	masks = sched_domains_numa_masks;
+	rcu_assign_pointer(sched_domains_numa_masks, NULL);
+	if (distances || masks) {
+		int i, j;
+
+		synchronize_rcu();
+		kfree(distances);
+		for (i = 0; i < nr_levels && masks; i++) {
+			if (!masks[i])
 				continue;
-
-			/* Add remote nodes in our masks */
-			cpumask_or(sched_domains_numa_masks[i][node],
-				   sched_domains_numa_masks[i][node],
-				   sched_domains_numa_masks[0][j]);
+			for_each_node(j)
+				kfree(masks[i][j]);
+			kfree(masks[i]);
 		}
+		kfree(masks);
 	}
+	if (sched_domain_topology_saved) {
+		kfree(sched_domain_topology);
+		sched_domain_topology = sched_domain_topology_saved;
+		sched_domain_topology_saved = NULL;
+	}
+}
+
+/*
+ * Call with hotplug lock held
+ */
+void sched_update_numa(int cpu, bool online)
+{
+	int node;
 
+	node = cpu_to_node(cpu);
 	/*
-	 * A new node has been brought up, potentially changing the topology
-	 * classification.
-	 *
-	 * Note that this is racy vs any use of sched_numa_topology_type :/
+	 * Scheduler NUMA topology is updated when the first CPU of a
+	 * node is onlined or the last CPU of a node is offlined.
 	 */
-	init_numa_topology_type();
+	if (cpumask_weight(cpumask_of_node(node)) != 1)
+		return;
+
+	sched_reset_numa();
+	sched_init_numa(online ? NUMA_NO_NODE : node);
 }
 
 void sched_domains_numa_masks_set(unsigned int cpu)
@@ -1967,11 +1991,9 @@ void sched_domains_numa_masks_set(unsigned int cpu)
 	int node = cpu_to_node(cpu);
 	int i, j;
 
-	__sched_domains_numa_masks_set(node);
-
 	for (i = 0; i < sched_domains_numa_levels; i++) {
 		for (j = 0; j < nr_node_ids; j++) {
-			if (!node_online(j))
+			if (!node_state(j, N_CPU))
 				continue;
 
 			/* Set ourselves in the remote node's masks */
@@ -1986,8 +2008,10 @@ void sched_domains_numa_masks_clear(unsigned int cpu)
 	int i, j;
 
 	for (i = 0; i < sched_domains_numa_levels; i++) {
-		for (j = 0; j < nr_node_ids; j++)
-			cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]);
+		for (j = 0; j < nr_node_ids; j++) {
+			if (sched_domains_numa_masks[i][j])
+				cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]);
+		}
 	}
 }
 
@@ -2001,14 +2025,26 @@ void sched_domains_numa_masks_clear(unsigned int cpu)
  */
 int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
 {
-	int i, j = cpu_to_node(cpu);
+	int i, j = cpu_to_node(cpu), found = nr_cpu_ids;
+	struct cpumask ***masks;
 
+	rcu_read_lock();
+	masks = rcu_dereference(sched_domains_numa_masks);
+	if (!masks)
+		goto unlock;
 	for (i = 0; i < sched_domains_numa_levels; i++) {
-		cpu = cpumask_any_and(cpus, sched_domains_numa_masks[i][j]);
-		if (cpu < nr_cpu_ids)
-			return cpu;
+		if (!masks[i][j])
+			break;
+		cpu = cpumask_any_and(cpus, masks[i][j]);
+		if (cpu < nr_cpu_ids) {
+			found = cpu;
+			break;
+		}
 	}
-	return nr_cpu_ids;
+unlock:
+	rcu_read_unlock();
+
+	return found;
 }
 
 #endif /* CONFIG_NUMA */

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [tip: sched/core] sched/numa: Avoid migrating task to CPU-less node
  2022-02-17 18:56   ` [tip: sched/core] sched/numa: Avoid migrating " tip-bot2 for Huang Ying
@ 2022-03-01 20:54     ` Qian Cai
  2022-03-02  0:59       ` Huang, Ying
  0 siblings, 1 reply; 15+ messages in thread
From: Qian Cai @ 2022-03-01 20:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-tip-commits, Huang, Ying, Peter Zijlstra (Intel),
	x86, linux-arm-kernel

On Thu, Feb 17, 2022 at 06:56:52PM -0000, tip-bot2 for Huang Ying wrote:
> The following commit has been merged into the sched/core branch of tip:
> 
> Commit-ID:     5c7b1aaf139dab5072311853bacc40fc3457d1f9
> Gitweb:        https://git.kernel.org/tip/5c7b1aaf139dab5072311853bacc40fc3457d1f9
> Author:        Huang Ying <ying.huang@intel.com>
> AuthorDate:    Mon, 14 Feb 2022 20:15:53 +08:00
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Wed, 16 Feb 2022 15:57:53 +01:00
> 
> sched/numa: Avoid migrating task to CPU-less node
> 
> In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
> nodes.  But if the number of the hint page faults on a PMEM node is
> the max for a task, The current NUMA balancing policy may try to place
> the task on the PMEM node instead of DRAM node.  This is unreasonable,
> because there's no CPU in PMEM NUMA nodes.  To fix this, CPU-less
> nodes are ignored when searching the migration target node for a task
> in this patch.
> 
> To test the patch, we run a workload that accesses more memory in PMEM
> node than memory in DRAM node.  Without the patch, the PMEM node will
> be chosen as preferred node in task_numa_placement().  While the DRAM
> node will be chosen instead with the patch.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20220214121553.582248-2-ying.huang@intel.com

Reverting this commit on the top of today's linux-next fixed a boot crash
on arm64 NUMA systems.

 Unable to handle kernel paging request at virtual address ffff7a6601694aec
 KASAN: maybe wild-memory-access in range [0xffffd3300b4a5760-0xffffd3300b4a5767]
 Mem abort info:
   ESR = 0x96000005
   EC = 0x25: DABT (current EL), IL = 32 bits
 mlx5_core 0007:02:00.0: enabling device (0100 -> 0102)
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
   FSC = 0x05: level 1 translation fault
 Data abort info:
   ISV = 0, ISS = 0x00000005
   CM = 0, WnR = 0
 swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000400b3d6c6000
 [ffff7a6601694aec] pgd=0000403fc007f003, p4d=0000403fc007f003, pud=0000000000000000
 Internal error: Oops: 96000005 [#1] PREEMPT SMP
 Modules linked in: nouveau(+) drm_ttm_helper ttm nvme(+) drm_dp_helper drm_kms_helper mlx5_core(+) mpt3sas(+) xhci_pci(+) nvme_core raid_class xhci_pci_renesas drm
 CPU: 85 PID: 1308 Comm: udevadm Not tainted 5.17.0-rc6-next-20220301 #1
 pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : task_numa_placement
 lr : task_numa_placement
 sp : ffff800031047760
 x29: ffff800031047760 x28: ffff3fffab916c00 x27: 0000000000000020
 x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000000

 x23: ffff07ffe5289a80 x22: ffffd3300b4a5760 x21: 000000000000003f
 x20: ffffd32feb4a5768 x19: 0000000000000000 x18: ffff07ffe528ad88
 x17: ffffd32fe5693a1c x16: 0000000000000000 x15: ffff8000310478e0

 x14: ffff07ffe528ad90 x13: 0000000000000002 x12: dfff80000000000d
 x11: 0000000000000001 x10: 000000000000b6be x9 : 0000000000000000
 x8 : 00000000ffffffff x7 : ffffd32feb4a5780 x6 : 0000000000000000
 x5 : 0000000000000000 x4 : 0000000000000000 x3 : 1ffffa6601694aec
 x2 : 0000000000000000 x1 : dfff800000000000 x0 : 000000001ffffff8
 Call trace:
  task_numa_placement
  arch_test_bit at include/asm-generic/bitops/non-atomic.h:118
  (inlined by) node_state at include/linux/nodemask.h:416
  (inlined by) task_numa_placement at kernel/sched/fair.c:2439
  task_numa_fault
  do_numa_page
  handle_pte_fault
  __handle_mm_fault
  handle_mm_fault
  do_page_fault
  do_translation_fault
  do_mem_abort
  el0_da
  el0t_64_sync_handler
  el0t_64_sync
 Code: 8b000296 d2d00001 f2fbffe1 d343fec3 (38e16861)
 ---[ end trace 0000000000000000 ]---
 Kernel panic - not syncing: Oops: Fatal exception
 SMP: stopping secondary CPUs
 Kernel Offset: 0x532fdcf70000 from 0xffff800008000000
 PHYS_OFFSET: 0x80000000
 CPU features: 0x00,00042c0c,19801c82
 Memory Limit: none
 ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---

> ---
>  kernel/sched/fair.c | 25 ++++++++++++++++++++-----
>  1 file changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index da3230b..11a72e1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1989,7 +1989,7 @@ static int task_numa_migrate(struct task_struct *p)
>  	 */
>  	ng = deref_curr_numa_group(p);
>  	if (env.best_cpu == -1 || (ng && ng->active_nodes > 1)) {
> -		for_each_online_node(nid) {
> +		for_each_node_state(nid, N_CPU) {
>  			if (nid == env.src_nid || nid == p->numa_preferred_nid)
>  				continue;
>  
> @@ -2087,13 +2087,13 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group)
>  	unsigned long faults, max_faults = 0;
>  	int nid, active_nodes = 0;
>  
> -	for_each_online_node(nid) {
> +	for_each_node_state(nid, N_CPU) {
>  		faults = group_faults_cpu(numa_group, nid);
>  		if (faults > max_faults)
>  			max_faults = faults;
>  	}
>  
> -	for_each_online_node(nid) {
> +	for_each_node_state(nid, N_CPU) {
>  		faults = group_faults_cpu(numa_group, nid);
>  		if (faults * ACTIVE_NODE_FRACTION > max_faults)
>  			active_nodes++;
> @@ -2247,7 +2247,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)
>  
>  		dist = sched_max_numa_distance;
>  
> -		for_each_online_node(node) {
> +		for_each_node_state(node, N_CPU) {
>  			score = group_weight(p, node, dist);
>  			if (score > max_score) {
>  				max_score = score;
> @@ -2266,7 +2266,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)
>  	 * inside the highest scoring group of nodes. The nodemask tricks
>  	 * keep the complexity of the search down.
>  	 */
> -	nodes = node_online_map;
> +	nodes = node_states[N_CPU];
>  	for (dist = sched_max_numa_distance; dist > LOCAL_DISTANCE; dist--) {
>  		unsigned long max_faults = 0;
>  		nodemask_t max_group = NODE_MASK_NONE;
> @@ -2405,6 +2405,21 @@ static void task_numa_placement(struct task_struct *p)
>  		}
>  	}
>  
> +	/* Cannot migrate task to CPU-less node */
> +	if (!node_state(max_nid, N_CPU)) {
> +		int near_nid = max_nid;
> +		int distance, near_distance = INT_MAX;
> +
> +		for_each_node_state(nid, N_CPU) {
> +			distance = node_distance(max_nid, nid);
> +			if (distance < near_distance) {
> +				near_nid = nid;
> +				near_distance = distance;
> +			}
> +		}
> +		max_nid = near_nid;
> +	}
> +
>  	if (ng) {
>  		numa_group_count_active_nodes(ng);
>  		spin_unlock_irq(group_lock);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [tip: sched/core] sched/numa: Avoid migrating task to CPU-less node
  2022-03-01 20:54     ` Qian Cai
@ 2022-03-02  0:59       ` Huang, Ying
  2022-03-02 12:37         ` Qian Cai
  2022-03-07  5:51         ` Huang, Ying
  0 siblings, 2 replies; 15+ messages in thread
From: Huang, Ying @ 2022-03-02  0:59 UTC (permalink / raw)
  To: Qian Cai
  Cc: linux-kernel, linux-tip-commits, Peter Zijlstra (Intel),
	x86, linux-arm-kernel

Qian Cai <quic_qiancai@quicinc.com> writes:

> On Thu, Feb 17, 2022 at 06:56:52PM -0000, tip-bot2 for Huang Ying wrote:
>> The following commit has been merged into the sched/core branch of tip:
>> 
>> Commit-ID:     5c7b1aaf139dab5072311853bacc40fc3457d1f9
>> Gitweb:        https://git.kernel.org/tip/5c7b1aaf139dab5072311853bacc40fc3457d1f9
>> Author:        Huang Ying <ying.huang@intel.com>
>> AuthorDate:    Mon, 14 Feb 2022 20:15:53 +08:00
>> Committer:     Peter Zijlstra <peterz@infradead.org>
>> CommitterDate: Wed, 16 Feb 2022 15:57:53 +01:00
>> 
>> sched/numa: Avoid migrating task to CPU-less node
>> 
>> In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
>> nodes.  But if the number of the hint page faults on a PMEM node is
>> the max for a task, The current NUMA balancing policy may try to place
>> the task on the PMEM node instead of DRAM node.  This is unreasonable,
>> because there's no CPU in PMEM NUMA nodes.  To fix this, CPU-less
>> nodes are ignored when searching the migration target node for a task
>> in this patch.
>> 
>> To test the patch, we run a workload that accesses more memory in PMEM
>> node than memory in DRAM node.  Without the patch, the PMEM node will
>> be chosen as preferred node in task_numa_placement().  While the DRAM
>> node will be chosen instead with the patch.
>> 
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Link: https://lkml.kernel.org/r/20220214121553.582248-2-ying.huang@intel.com
>
> Reverting this commit on the top of today's linux-next fixed a boot crash
> on arm64 NUMA systems.
>
>  Unable to handle kernel paging request at virtual address ffff7a6601694aec
>  KASAN: maybe wild-memory-access in range [0xffffd3300b4a5760-0xffffd3300b4a5767]
>  Mem abort info:
>    ESR = 0x96000005
>    EC = 0x25: DABT (current EL), IL = 32 bits
>  mlx5_core 0007:02:00.0: enabling device (0100 -> 0102)
>    SET = 0, FnV = 0
>    EA = 0, S1PTW = 0
>    FSC = 0x05: level 1 translation fault
>  Data abort info:
>    ISV = 0, ISS = 0x00000005
>    CM = 0, WnR = 0
>  swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000400b3d6c6000
>  [ffff7a6601694aec] pgd=0000403fc007f003, p4d=0000403fc007f003, pud=0000000000000000
>  Internal error: Oops: 96000005 [#1] PREEMPT SMP
>  Modules linked in: nouveau(+) drm_ttm_helper ttm nvme(+) drm_dp_helper drm_kms_helper mlx5_core(+) mpt3sas(+) xhci_pci(+) nvme_core raid_class xhci_pci_renesas drm
>  CPU: 85 PID: 1308 Comm: udevadm Not tainted 5.17.0-rc6-next-20220301 #1
>  pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>  pc : task_numa_placement
>  lr : task_numa_placement
>  sp : ffff800031047760
>  x29: ffff800031047760 x28: ffff3fffab916c00 x27: 0000000000000020
>  x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000000
>
>  x23: ffff07ffe5289a80 x22: ffffd3300b4a5760 x21: 000000000000003f
>  x20: ffffd32feb4a5768 x19: 0000000000000000 x18: ffff07ffe528ad88
>  x17: ffffd32fe5693a1c x16: 0000000000000000 x15: ffff8000310478e0
>
>  x14: ffff07ffe528ad90 x13: 0000000000000002 x12: dfff80000000000d
>  x11: 0000000000000001 x10: 000000000000b6be x9 : 0000000000000000
>  x8 : 00000000ffffffff x7 : ffffd32feb4a5780 x6 : 0000000000000000
>  x5 : 0000000000000000 x4 : 0000000000000000 x3 : 1ffffa6601694aec
>  x2 : 0000000000000000 x1 : dfff800000000000 x0 : 000000001ffffff8
>  Call trace:
>   task_numa_placement
>   arch_test_bit at include/asm-generic/bitops/non-atomic.h:118
>   (inlined by) node_state at include/linux/nodemask.h:416
>   (inlined by) task_numa_placement at kernel/sched/fair.c:2439
>   task_numa_fault
>   do_numa_page
>   handle_pte_fault
>   __handle_mm_fault
>   handle_mm_fault
>   do_page_fault
>   do_translation_fault
>   do_mem_abort
>   el0_da
>   el0t_64_sync_handler
>   el0t_64_sync
>  Code: 8b000296 d2d00001 f2fbffe1 d343fec3 (38e16861)
>  ---[ end trace 0000000000000000 ]---
>  Kernel panic - not syncing: Oops: Fatal exception
>  SMP: stopping secondary CPUs
>  Kernel Offset: 0x532fdcf70000 from 0xffff800008000000
>  PHYS_OFFSET: 0x80000000
>  CPU features: 0x00,00042c0c,19801c82
>  Memory Limit: none
>  ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---

Thanks for reporting!  Can you try whether the following debug patch can fix the issue?

Best Regards,
Huang, Ying

----------------------------8<-------------------------------------------
From 176d185426730111e763eb386d0210561f021dbc Mon Sep 17 00:00:00 2001
From: Huang Ying <ying.huang@intel.com>
Date: Wed, 2 Mar 2022 08:54:01 +0800
Subject: [PATCH] dbg KASAN error

---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a3f0ea216ccb..1fe7a4510cca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2405,7 +2405,7 @@ static void task_numa_placement(struct task_struct *p)
 	}
 
 	/* Cannot migrate task to CPU-less node */
-	if (!node_state(max_nid, N_CPU)) {
+	if (max_nid != NUMA_NO_NODE && !node_state(max_nid, N_CPU)) {
 		int near_nid = max_nid;
 		int distance, near_distance = INT_MAX;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [tip: sched/core] sched/numa: Avoid migrating task to CPU-less node
  2022-03-02  0:59       ` Huang, Ying
@ 2022-03-02 12:37         ` Qian Cai
  2022-03-07  5:51         ` Huang, Ying
  1 sibling, 0 replies; 15+ messages in thread
From: Qian Cai @ 2022-03-02 12:37 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-kernel, linux-tip-commits, Peter Zijlstra (Intel),
	x86, linux-arm-kernel

On Wed, Mar 02, 2022 at 08:59:55AM +0800, Huang, Ying wrote:
> Thanks for reporting!  Can you try whether the following debug patch can fix the issue?

Yes, it prevents the crash.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [tip: sched/core] sched/numa: Avoid migrating task to CPU-less node
  2022-03-02  0:59       ` Huang, Ying
  2022-03-02 12:37         ` Qian Cai
@ 2022-03-07  5:51         ` Huang, Ying
  2022-03-07 13:53           ` Qian Cai
  1 sibling, 1 reply; 15+ messages in thread
From: Huang, Ying @ 2022-03-07  5:51 UTC (permalink / raw)
  To: Qian Cai
  Cc: linux-kernel, linux-tip-commits, Peter Zijlstra (Intel),
	x86, linux-arm-kernel

Hi, Qian,

"Huang, Ying" <ying.huang@intel.com> writes:

> Qian Cai <quic_qiancai@quicinc.com> writes:
>
>> On Thu, Feb 17, 2022 at 06:56:52PM -0000, tip-bot2 for Huang Ying wrote:
>>> The following commit has been merged into the sched/core branch of tip:
>>> 
>>> Commit-ID:     5c7b1aaf139dab5072311853bacc40fc3457d1f9
>>> Gitweb:        https://git.kernel.org/tip/5c7b1aaf139dab5072311853bacc40fc3457d1f9
>>> Author:        Huang Ying <ying.huang@intel.com>
>>> AuthorDate:    Mon, 14 Feb 2022 20:15:53 +08:00
>>> Committer:     Peter Zijlstra <peterz@infradead.org>
>>> CommitterDate: Wed, 16 Feb 2022 15:57:53 +01:00
>>> 
>>> sched/numa: Avoid migrating task to CPU-less node
>>> 
>>> In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
>>> nodes.  But if the number of the hint page faults on a PMEM node is
>>> the max for a task, The current NUMA balancing policy may try to place
>>> the task on the PMEM node instead of DRAM node.  This is unreasonable,
>>> because there's no CPU in PMEM NUMA nodes.  To fix this, CPU-less
>>> nodes are ignored when searching the migration target node for a task
>>> in this patch.
>>> 
>>> To test the patch, we run a workload that accesses more memory in PMEM
>>> node than memory in DRAM node.  Without the patch, the PMEM node will
>>> be chosen as preferred node in task_numa_placement().  While the DRAM
>>> node will be chosen instead with the patch.
>>> 
>>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>> Link: https://lkml.kernel.org/r/20220214121553.582248-2-ying.huang@intel.com
>>
>> Reverting this commit on the top of today's linux-next fixed a boot crash
>> on arm64 NUMA systems.
>>
>>  Unable to handle kernel paging request at virtual address ffff7a6601694aec
>>  KASAN: maybe wild-memory-access in range [0xffffd3300b4a5760-0xffffd3300b4a5767]
>>  Mem abort info:
>>    ESR = 0x96000005
>>    EC = 0x25: DABT (current EL), IL = 32 bits
>>  mlx5_core 0007:02:00.0: enabling device (0100 -> 0102)
>>    SET = 0, FnV = 0
>>    EA = 0, S1PTW = 0
>>    FSC = 0x05: level 1 translation fault
>>  Data abort info:
>>    ISV = 0, ISS = 0x00000005
>>    CM = 0, WnR = 0
>>  swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000400b3d6c6000
>>  [ffff7a6601694aec] pgd=0000403fc007f003, p4d=0000403fc007f003, pud=0000000000000000
>>  Internal error: Oops: 96000005 [#1] PREEMPT SMP
>>  Modules linked in: nouveau(+) drm_ttm_helper ttm nvme(+) drm_dp_helper drm_kms_helper mlx5_core(+) mpt3sas(+) xhci_pci(+) nvme_core raid_class xhci_pci_renesas drm
>>  CPU: 85 PID: 1308 Comm: udevadm Not tainted 5.17.0-rc6-next-20220301 #1
>>  pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>  pc : task_numa_placement
>>  lr : task_numa_placement
>>  sp : ffff800031047760
>>  x29: ffff800031047760 x28: ffff3fffab916c00 x27: 0000000000000020
>>  x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000000
>>
>>  x23: ffff07ffe5289a80 x22: ffffd3300b4a5760 x21: 000000000000003f
>>  x20: ffffd32feb4a5768 x19: 0000000000000000 x18: ffff07ffe528ad88
>>  x17: ffffd32fe5693a1c x16: 0000000000000000 x15: ffff8000310478e0
>>
>>  x14: ffff07ffe528ad90 x13: 0000000000000002 x12: dfff80000000000d
>>  x11: 0000000000000001 x10: 000000000000b6be x9 : 0000000000000000
>>  x8 : 00000000ffffffff x7 : ffffd32feb4a5780 x6 : 0000000000000000
>>  x5 : 0000000000000000 x4 : 0000000000000000 x3 : 1ffffa6601694aec
>>  x2 : 0000000000000000 x1 : dfff800000000000 x0 : 000000001ffffff8
>>  Call trace:
>>   task_numa_placement
>>   arch_test_bit at include/asm-generic/bitops/non-atomic.h:118
>>   (inlined by) node_state at include/linux/nodemask.h:416
>>   (inlined by) task_numa_placement at kernel/sched/fair.c:2439
>>   task_numa_fault
>>   do_numa_page
>>   handle_pte_fault
>>   __handle_mm_fault
>>   handle_mm_fault
>>   do_page_fault
>>   do_translation_fault
>>   do_mem_abort
>>   el0_da
>>   el0t_64_sync_handler
>>   el0t_64_sync
>>  Code: 8b000296 d2d00001 f2fbffe1 d343fec3 (38e16861)
>>  ---[ end trace 0000000000000000 ]---
>>  Kernel panic - not syncing: Oops: Fatal exception
>>  SMP: stopping secondary CPUs
>>  Kernel Offset: 0x532fdcf70000 from 0xffff800008000000
>>  PHYS_OFFSET: 0x80000000
>>  CPU features: 0x00,00042c0c,19801c82
>>  Memory Limit: none
>>  ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---
>
> Thanks for reporting!  Can you try whether the following debug patch can fix the issue?
>
> Best Regards,
> Huang, Ying
>
> ----------------------------8<-------------------------------------------
> From 176d185426730111e763eb386d0210561f021dbc Mon Sep 17 00:00:00 2001
> From: Huang Ying <ying.huang@intel.com>
> Date: Wed, 2 Mar 2022 08:54:01 +0800
> Subject: [PATCH] dbg KASAN error
>
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a3f0ea216ccb..1fe7a4510cca 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2405,7 +2405,7 @@ static void task_numa_placement(struct task_struct *p)
>  	}
>  
>  	/* Cannot migrate task to CPU-less node */
> -	if (!node_state(max_nid, N_CPU)) {
> +	if (max_nid != NUMA_NO_NODE && !node_state(max_nid, N_CPU)) {
>  		int near_nid = max_nid;
>  		int distance, near_distance = INT_MAX;

Do you have time to give this patch a try?

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [tip: sched/core] sched/numa: Avoid migrating task to CPU-less node
  2022-03-07  5:51         ` Huang, Ying
@ 2022-03-07 13:53           ` Qian Cai
  2022-03-08  0:40             ` Huang, Ying
  0 siblings, 1 reply; 15+ messages in thread
From: Qian Cai @ 2022-03-07 13:53 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-kernel, linux-tip-commits, Peter Zijlstra (Intel),
	x86, linux-arm-kernel

On Mon, Mar 07, 2022 at 01:51:55PM +0800, Huang, Ying wrote:
> > ---
> >  kernel/sched/fair.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a3f0ea216ccb..1fe7a4510cca 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2405,7 +2405,7 @@ static void task_numa_placement(struct task_struct *p)
> >  	}
> >  
> >  	/* Cannot migrate task to CPU-less node */
> > -	if (!node_state(max_nid, N_CPU)) {
> > +	if (max_nid != NUMA_NO_NODE && !node_state(max_nid, N_CPU)) {
> >  		int near_nid = max_nid;
> >  		int distance, near_distance = INT_MAX;
> 
> Do you have time to give this patch a try?

Ah, I thought I has already replied it a while ago. Anyway, it works fine.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [tip: sched/core] sched/numa: Avoid migrating task to CPU-less node
  2022-03-07 13:53           ` Qian Cai
@ 2022-03-08  0:40             ` Huang, Ying
  0 siblings, 0 replies; 15+ messages in thread
From: Huang, Ying @ 2022-03-08  0:40 UTC (permalink / raw)
  To: Qian Cai
  Cc: linux-kernel, linux-tip-commits, Peter Zijlstra (Intel),
	x86, linux-arm-kernel

Qian Cai <quic_qiancai@quicinc.com> writes:

> On Mon, Mar 07, 2022 at 01:51:55PM +0800, Huang, Ying wrote:
>> > ---
>> >  kernel/sched/fair.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index a3f0ea216ccb..1fe7a4510cca 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -2405,7 +2405,7 @@ static void task_numa_placement(struct task_struct *p)
>> >  	}
>> >  
>> >  	/* Cannot migrate task to CPU-less node */
>> > -	if (!node_state(max_nid, N_CPU)) {
>> > +	if (max_nid != NUMA_NO_NODE && !node_state(max_nid, N_CPU)) {
>> >  		int near_nid = max_nid;
>> >  		int distance, near_distance = INT_MAX;
>> 
>> Do you have time to give this patch a try?
>
> Ah, I thought I has already replied it a while ago. Anyway, it works fine.

Thanks!

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH -V3 2/2 UPDATE] NUMA balancing: avoid to migrate task to CPU-less node
  2022-02-14 12:15 ` [PATCH -V3 2/2] NUMA balancing: avoid to migrate task to CPU-less node Huang Ying
  2022-02-17 18:56   ` [tip: sched/core] sched/numa: Avoid migrating " tip-bot2 for Huang Ying
@ 2022-03-08  2:05   ` Huang, Ying
  2022-03-08  2:11     ` Huang, Ying
  1 sibling, 1 reply; 15+ messages in thread
From: Huang, Ying @ 2022-03-08  2:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Valentin Schneider, Ingo Molnar, Mel Gorman,
	Rik van Riel, Srikar Dronamraju

In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
nodes.  But if the number of the hint page faults on a PMEM node is
the max for a task, The current NUMA balancing policy may try to place
the task on the PMEM node instead of DRAM node.  This is unreasonable,
because there's no CPU in PMEM NUMA nodes.  To fix this, CPU-less
nodes are ignored when searching the migration target node for a task
in this patch.

To test the patch, we run a workload that accesses more memory in PMEM
node than memory in DRAM node.  Without the patch, the PMEM node will
be chosen as preferred node in task_numa_placement().  While the DRAM
node will be chosen instead with the patch.

Known issue: I don't have systems to test complex NUMA topology type,
for example, NUMA_BACKPLANE or NUMA_GLUELESS_MESH.

v3:

- Fix a boot crash for some uncovered marginal condition.  Thanks Qian
  Cai for reporting and testing the bug!

- Fix several missing places to use CPU-less nodes as migrating
  target.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reported-and-tested-by: Qian Cai <quic_qiancai@quicinc.com> # boot crash
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04968f3f9b6d..1fe7a4510cca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1988,7 +1988,7 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	ng = deref_curr_numa_group(p);
 	if (env.best_cpu == -1 || (ng && ng->active_nodes > 1)) {
-		for_each_online_node(nid) {
+		for_each_node_state(nid, N_CPU) {
 			if (nid == env.src_nid || nid == p->numa_preferred_nid)
 				continue;
 
@@ -2086,13 +2086,13 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group)
 	unsigned long faults, max_faults = 0;
 	int nid, active_nodes = 0;
 
-	for_each_online_node(nid) {
+	for_each_node_state(nid, N_CPU) {
 		faults = group_faults_cpu(numa_group, nid);
 		if (faults > max_faults)
 			max_faults = faults;
 	}
 
-	for_each_online_node(nid) {
+	for_each_node_state(nid, N_CPU) {
 		faults = group_faults_cpu(numa_group, nid);
 		if (faults * ACTIVE_NODE_FRACTION > max_faults)
 			active_nodes++;
@@ -2246,7 +2246,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)
 
 		dist = sched_max_numa_distance;
 
-		for_each_online_node(node) {
+		for_each_node_state(node, N_CPU) {
 			score = group_weight(p, node, dist);
 			if (score > max_score) {
 				max_score = score;
@@ -2265,7 +2265,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)
 	 * inside the highest scoring group of nodes. The nodemask tricks
 	 * keep the complexity of the search down.
 	 */
-	nodes = node_online_map;
+	nodes = node_states[N_CPU];
 	for (dist = sched_max_numa_distance; dist > LOCAL_DISTANCE; dist--) {
 		unsigned long max_faults = 0;
 		nodemask_t max_group = NODE_MASK_NONE;
@@ -2404,6 +2404,21 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
+	/* Cannot migrate task to CPU-less node */
+	if (max_nid != NUMA_NO_NODE && !node_state(max_nid, N_CPU)) {
+		int near_nid = max_nid;
+		int distance, near_distance = INT_MAX;
+
+		for_each_node_state(nid, N_CPU) {
+			distance = node_distance(max_nid, nid);
+			if (distance < near_distance) {
+				near_nid = nid;
+				near_distance = distance;
+			}
+		}
+		max_nid = near_nid;
+	}
+
 	if (ng) {
 		numa_group_count_active_nodes(ng);
 		spin_unlock_irq(group_lock);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH -V3 2/2 UPDATE] NUMA balancing: avoid to migrate task to CPU-less node
  2022-03-08  2:05   ` [PATCH -V3 2/2 UPDATE] NUMA balancing: avoid to migrate " Huang, Ying
@ 2022-03-08  2:11     ` Huang, Ying
  2022-03-16  0:37       ` Huang, Ying
  0 siblings, 1 reply; 15+ messages in thread
From: Huang, Ying @ 2022-03-08  2:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Valentin Schneider, Ingo Molnar, Mel Gorman,
	Rik van Riel, Srikar Dronamraju

Hi, Peter,

"Huang, Ying" <ying.huang@intel.com> writes:

> In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
> nodes.  But if the number of the hint page faults on a PMEM node is
> the max for a task, The current NUMA balancing policy may try to place
> the task on the PMEM node instead of DRAM node.  This is unreasonable,
> because there's no CPU in PMEM NUMA nodes.  To fix this, CPU-less
> nodes are ignored when searching the migration target node for a task
> in this patch.
>
> To test the patch, we run a workload that accesses more memory in PMEM
> node than memory in DRAM node.  Without the patch, the PMEM node will
> be chosen as preferred node in task_numa_placement().  While the DRAM
> node will be chosen instead with the patch.
>
> Known issue: I don't have systems to test complex NUMA topology type,
> for example, NUMA_BACKPLANE or NUMA_GLUELESS_MESH.
>
> v3:
>
> - Fix a boot crash for some uncovered marginal condition.  Thanks Qian
>   Cai for reporting and testing the bug!
>
> - Fix several missing places to use CPU-less nodes as migrating
>   target.
>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Reported-and-tested-by: Qian Cai <quic_qiancai@quicinc.com> # boot crash
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Can you update the patch to fix the bug?  Or you prefer the incremental
patch?

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -V3 2/2 UPDATE] NUMA balancing: avoid to migrate task to CPU-less node
  2022-03-08  2:11     ` Huang, Ying
@ 2022-03-16  0:37       ` Huang, Ying
  0 siblings, 0 replies; 15+ messages in thread
From: Huang, Ying @ 2022-03-16  0:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Valentin Schneider, Ingo Molnar, Mel Gorman,
	Rik van Riel, Srikar Dronamraju

Hi, Peter,

"Huang, Ying" <ying.huang@intel.com> writes:

> Hi, Peter,
>
> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
>> nodes.  But if the number of the hint page faults on a PMEM node is
>> the max for a task, The current NUMA balancing policy may try to place
>> the task on the PMEM node instead of DRAM node.  This is unreasonable,
>> because there's no CPU in PMEM NUMA nodes.  To fix this, CPU-less
>> nodes are ignored when searching the migration target node for a task
>> in this patch.
>>
>> To test the patch, we run a workload that accesses more memory in PMEM
>> node than memory in DRAM node.  Without the patch, the PMEM node will
>> be chosen as preferred node in task_numa_placement().  While the DRAM
>> node will be chosen instead with the patch.
>>
>> Known issue: I don't have systems to test complex NUMA topology type,
>> for example, NUMA_BACKPLANE or NUMA_GLUELESS_MESH.
>>
>> v3:
>>
>> - Fix a boot crash for some uncovered marginal condition.  Thanks Qian
>>   Cai for reporting and testing the bug!
>>
>> - Fix several missing places to use CPU-less nodes as migrating
>>   target.
>>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Reported-and-tested-by: Qian Cai <quic_qiancai@quicinc.com> # boot crash
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Valentin Schneider <valentin.schneider@arm.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Rik van Riel <riel@surriel.com>
>> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>
> Can you update the patch to fix the bug?  Or you prefer the incremental
> patch?

Can you take a look at this?

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-03-16  0:38 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-14 12:15 [PATCH -V3 1/2] NUMA balancing: fix NUMA topology for systems with CPU-less nodes Huang Ying
2022-02-14 12:15 ` [PATCH -V3 2/2] NUMA balancing: avoid to migrate task to CPU-less node Huang Ying
2022-02-17 18:56   ` [tip: sched/core] sched/numa: Avoid migrating " tip-bot2 for Huang Ying
2022-03-01 20:54     ` Qian Cai
2022-03-02  0:59       ` Huang, Ying
2022-03-02 12:37         ` Qian Cai
2022-03-07  5:51         ` Huang, Ying
2022-03-07 13:53           ` Qian Cai
2022-03-08  0:40             ` Huang, Ying
2022-03-08  2:05   ` [PATCH -V3 2/2 UPDATE] NUMA balancing: avoid to migrate " Huang, Ying
2022-03-08  2:11     ` Huang, Ying
2022-03-16  0:37       ` Huang, Ying
2022-02-14 15:05 ` [PATCH -V3 1/2] NUMA balancing: fix NUMA topology for systems with CPU-less nodes Peter Zijlstra
2022-02-15  1:29   ` Huang, Ying
2022-02-17 18:56 ` [tip: sched/core] sched/numa: Fix " tip-bot2 for Huang Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).