linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] sched,numa: task placement for complex NUMA topologies
@ 2014-05-08 17:23 riel
  2014-05-08 17:23 ` [PATCH 1/4] numa,x86: store maximum numa node distance riel
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: riel @ 2014-05-08 17:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, mgorman, chegu_vinod

This patch series adds code for placement of tasks on a NUMA system
with complex NUMA topology. The code is fairly well isolated, and
does not impact things on systems with directly connected NUMA
topology.

The strategy is to adjust the score of each node, by the score of
nearby NUMA nodes, weighed by the numa distance of that node from
the other nearby nodes.

One of the main reasons for choosing this strategy is that it allows
the current code to stay the same, and simply add one extra call,
which does nothing on systems with simple NUMA topologies.

There is a tradeoff between performance and complexity, which is why
performance tests were done with patches 1-2, and with the whole
series of 1-4.

SPECjbb2005 throughput on an 8 node system:

		average bops	standard deviation

vanilla		857814		53528
patches 1-2	926053		26590
patches 1-4	931791		13874


Full test results, 10 runs of 2 4-node wide instances. Throughput
and distribution of each JVM across memory nodes.

1) kernel without patches:
                                                                    Nodes
                                                                    -----
spec1.txt:           throughput =     901782.00 SPECjbb2005 bops   0,1,4,5
spec2.txt:           throughput =     861272.25 SPECjbb2005 bops   2,3,6,7

spec1.txt:           throughput =     828895.46 SPECjbb2005 bops   3,5,6,7
spec2.txt:           throughput =     834569.91 SPECjbb2005 bops   0,1,2,4

spec1.txt:           throughput =     798582.97 SPECjbb2005 bops   2,4,5,6
spec2.txt:           throughput =     856616.41 SPECjbb2005 bops   0,1,3,7

spec1.txt:           throughput =     805044.62 SPECjbb2005 bops   0,2,4,5
spec2.txt:           throughput =     827402.54 SPECjbb2005 bops   1,3,6,7

spec1.txt:           throughput =     836735.76 SPECjbb2005 bops   2,4,6,7
spec2.txt:           throughput =     830288.17 SPECjbb2005 bops   0,1,3,5

spec1.txt:           throughput =     856434.61 SPECjbb2005 bops   3,4,5,6
spec2.txt:           throughput =     831602.46 SPECjbb2005 bops   0,1,2,7

spec1.txt:           throughput =     796366.70 SPECjbb2005 bops   0,4,6,7
spec2.txt:           throughput =     829059.05 SPECjbb2005 bops   1,2,3,5

spec1.txt:           throughput =     820001.99 SPECjbb2005 bops   1,4,5,7
spec2.txt:           throughput =     836664.23 SPECjbb2005 bops   0,2,3,6

spec1.txt:           throughput =     950736.30 SPECjbb2005 bops   4,5,6,7
spec2.txt:           throughput =     931937.24 SPECjbb2005 bops   0,1,2,3

spec1.txt:           throughput =     956298.83 SPECjbb2005 bops   4,5,6,7
spec2.txt:           throughput =     965984.19 SPECjbb2005 bops   0,1,2,3


2) kernel with patches 1 - 2:
                                                                   nodes
                                                                   ------
spec1.txt:           throughput =     923364.71 SPECjbb2005 bops  0,1,2,3
spec2.txt:           throughput =     912929.23 SPECjbb2005 bops  4,5,6,7

spec1.txt:           throughput =     889254.77 SPECjbb2005 bops  0,1,6,7
spec2.txt:           throughput =     870275.31 SPECjbb2005 bops  2,3,4,5

spec1.txt:           throughput =     950790.19 SPECjbb2005 bops  4,5,6,7
spec2.txt:           throughput =     940626.50 SPECjbb2005 bops  0,1,2,3

spec1.txt:           throughput =     915422.01 SPECjbb2005 bops  4,5,6,7
spec2.txt:           throughput =     934301.50 SPECjbb2005 bops  0,1,2,3

spec1.txt:           throughput =     934467.57 SPECjbb2005 bops  4,5,6,7
spec2.txt:           throughput =     923753.88 SPECjbb2005 bops  0,1,2,3

spec1.txt:           throughput =     951261.79 SPECjbb2005 bops  0,1,2,3
spec2.txt:           throughput =     950683.73 SPECjbb2005 bops  4,5,6,7

spec1.txt:           throughput =     954495.68 SPECjbb2005 bops  0,1,2,3
spec2.txt:           throughput =     943342.78 SPECjbb2005 bops  4,5,6,7

spec1.txt:           throughput =     943429.43 SPECjbb2005 bops  0,1,2,3
spec2.txt:           throughput =     924885.84 SPECjbb2005 bops  4,5,6,7

spec1.txt:           throughput =     962104.96 SPECjbb2005 bops  4,5,6,7
spec2.txt:           throughput =     924263.56 SPECjbb2005 bops  0,1,2,3

spec1.txt:           throughput =     901460.65 SPECjbb2005 bops  2,3,6,7
spec2.txt:           throughput =     869961.50 SPECjbb2005 bops  0,1,4,5


3) kernel with patches 1 - 4:
                                                                    Nodes
                                                                    -----
spec1.txt:           throughput =     929011.11 SPECjbb2005 bops   4,5,6,7
spec2.txt:           throughput =     955879.30 SPECjbb2005 bops   0,1,2,3

spec1.txt:           throughput =     902287.44 SPECjbb2005 bops   4,5,6,7
spec2.txt:           throughput =     932429.31 SPECjbb2005 bops   0,1,2,3

spec1.txt:           throughput =     909671.34 SPECjbb2005 bops   0,1,2,3
spec2.txt:           throughput =     934057.16 SPECjbb2005 bops   4,5,6,7

spec1.txt:           throughput =     940457.57 SPECjbb2005 bops   0,1,2,3
spec2.txt:           throughput =     951272.07 SPECjbb2005 bops   4,5,6,7

spec1.txt:           throughput =     936920.56 SPECjbb2005 bops   0,1,2,3
spec2.txt:           throughput =     947703.47 SPECjbb2005 bops   4,5,6,7

spec1.txt:           throughput =     924643.52 SPECjbb2005 bops   4,5,6,7
spec2.txt:           throughput =     939721.24 SPECjbb2005 bops   0,1,2,3

spec1.txt:           throughput =     935592.32 SPECjbb2005 bops   0,1,2,3
spec2.txt:           throughput =     930367.11 SPECjbb2005 bops   4,5,6,7

spec1.txt:           throughput =     949498.38 SPECjbb2005 bops   0,1,2,3
spec2.txt:           throughput =     933686.74 SPECjbb2005 bops   4,5,6,7

spec1.txt:           throughput =     924315.44 SPECjbb2005 bops   0,1,2,3
spec2.txt:           throughput =     919825.21 SPECjbb2005 bops   4,5,6,7

spec1.txt:           throughput =     917190.04 SPECjbb2005 bops   0,1,2,3
spec2.txt:           throughput =     921301.56 SPECjbb2005 bops   4,5,6,7


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/4] numa,x86: store maximum numa node distance
  2014-05-08 17:23 [PATCH 0/4] sched,numa: task placement for complex NUMA topologies riel
@ 2014-05-08 17:23 ` riel
  2014-05-09  9:45   ` Peter Zijlstra
  2014-05-08 17:23 ` [PATCH 2/4] sched,numa: weigh nearby nodes for task placement on complex NUMA topologies riel
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: riel @ 2014-05-08 17:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, mgorman, chegu_vinod

From: Rik van Riel <riel@redhat.com>

Store the maximum node distance, so the numa placement code can do
better placement on systems with complex numa topology.

The function max_node_distance will return LOCAL_DISTANCE if the
system has simple NUMA topology, with only a single level of
remote distance.

Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
---
 arch/x86/include/asm/topology.h |  3 +++
 arch/x86/mm/numa.c              | 25 +++++++++++++++++++++++++
 include/linux/topology.h        |  3 +++
 3 files changed, 31 insertions(+)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 0e8f04f..3ed2464 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -95,6 +95,9 @@ extern void setup_node_to_cpumask_map(void);
 extern int __node_distance(int, int);
 #define node_distance(a, b) __node_distance(a, b)
 
+extern int max_node_distance(void);
+#define max_node_distance max_node_distance
+
 #else /* !CONFIG_NUMA */
 
 static inline int numa_node_id(void)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1d045f9..525cde9 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -34,6 +34,8 @@ __initdata
 
 static int numa_distance_cnt;
 static u8 *numa_distance;
+static int numa_max_distance;
+static bool numa_complex_topology;
 
 static __init int numa_setup(char *opt)
 {
@@ -357,6 +359,19 @@ void __init numa_reset_distance(void)
 		memblock_free(__pa(numa_distance), size);
 	numa_distance_cnt = 0;
 	numa_distance = NULL;	/* enable table creation */
+	numa_max_distance = LOCAL_DISTANCE;
+	numa_complex_topology = false;
+}
+
+static void __init numa_set_max_distance(int distance)
+{
+	/* Remember the maximum node distance seen. */
+	if (distance > numa_max_distance)
+		numa_max_distance = distance;
+
+	/* Do we see any values in-between local distance and the maximum? */
+	if (distance != LOCAL_DISTANCE && distance != numa_max_distance)
+		numa_complex_topology = true;
 }
 
 static int __init numa_alloc_distance(void)
@@ -437,6 +452,16 @@ void __init numa_set_distance(int from, int to, int distance)
 	}
 
 	numa_distance[from * numa_distance_cnt + to] = distance;
+
+	numa_set_max_distance(distance);
+}
+
+int max_node_distance(void)
+{
+	if (!numa_complex_topology)
+		return LOCAL_DISTANCE;
+
+	return numa_max_distance;
 }
 
 int __node_distance(int from, int to)
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 7062330..6781b8d 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -54,6 +54,9 @@ int arch_update_cpu_topology(void);
 #ifndef node_distance
 #define node_distance(from,to)	((from) == (to) ? LOCAL_DISTANCE : REMOTE_DISTANCE)
 #endif
+#ifndef max_node_distance
+#define max_node_distance() (LOCAL_DISTANCE)
+#endif
 #ifndef RECLAIM_DISTANCE
 /*
  * If the distance between nodes in a system is larger than RECLAIM_DISTANCE
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/4] sched,numa: weigh nearby nodes for task placement on complex NUMA topologies
  2014-05-08 17:23 [PATCH 0/4] sched,numa: task placement for complex NUMA topologies riel
  2014-05-08 17:23 ` [PATCH 1/4] numa,x86: store maximum numa node distance riel
@ 2014-05-08 17:23 ` riel
  2014-05-09  9:53   ` Peter Zijlstra
                     ` (4 more replies)
  2014-05-08 17:23 ` [PATCH 3/4] sched,numa: store numa_group's preferred nid riel
  2014-05-08 17:23 ` [PATCH 4/4] sched,numa: pull workloads towards their preferred nodes riel
  3 siblings, 5 replies; 16+ messages in thread
From: riel @ 2014-05-08 17:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, mgorman, chegu_vinod

From: Rik van Riel <riel@redhat.com>

Workloads that span multiple NUMA nodes benefit greatly from being placed
on nearby nodes. There are two common configurations on 8 node NUMA systems.
One has four "islands" of 2 tighly coupled nodes, another has two "islands"
of 4 tightly coupled nodes.

When dealing with 2 4-node wide workloads on such a system, the current
NUMA code relies on luck. When a workload is placed on adjacent nodes,
performance is great. When a workload is spread across far-away nodes,
performance sucks.

This patch adjusts the numa scores of each node by adding the scores
of nearby nodes, weighted by the NUMA distance. This results in workloads
being pulled together on nearby nodes, and improved performance for
workloads that span multiple nodes, on larger NUMA systems.

This patch does nothing on machines with simple NUMA topologies.

On an 8 node system, this patch manages to correctly place 2 4-node wide
SPECjbb2005 instances on adjacent nodes around 80% of the time, improving
average performance from 857814 to 926054 bops.

Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
---
 kernel/sched/fair.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 78 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 302facf..5925667 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -923,6 +923,63 @@ static inline unsigned long group_faults_cpu(struct numa_group *group, int nid)
 }
 
 /*
+ * On systems with complex NUMA topology, we deal with the local node,
+ * far-away nodes, and nodes that are somewhere in-between. The goal is
+ * to pull workloads that span multiple nodes to nodes that are near each
+ * other. This is accomplished adding calculating a score for nearby nodes,
+ * to the local node score, based on the faults on those nearby nodes, and
+ * the proximity of those nodes.
+ */
+static inline unsigned long nearby_nodes_score(struct task_struct *p, int nid,
+						bool task)
+{
+	int max_distance = max_node_distance();
+	unsigned long score = 0;
+	int node;
+
+	/*
+	 * No need to calculate a score if the system has a simple NUMA
+	 * topology, with no node distances between "local" and "far away".
+	 */
+	if (max_distance == LOCAL_DISTANCE)
+		return 0;
+
+	for_each_online_node(node) {
+		int distance;
+		unsigned long faults;
+
+		/* Already scored by the calling function. */
+		if (node == nid)
+			continue;
+
+		/* Skip far-away nodes. We only care about nearby ones. */
+		distance = node_distance(node, nid);
+		if (distance == max_distance)
+			continue;
+
+		/*
+		 * For nodes with distances in-between LOCAL_DISTANCE
+		 * and max_distance, we count the faults on those nodes
+		 * in proportion to their distance, using this formula:
+		 *
+		 * max_distance - node_distance
+		 * -----------------------------
+		 * max_distance - LOCAL_DISTANCE
+		 */
+		if (task)
+			faults = task_faults(p, node);
+		else
+			faults = group_faults(p, node);
+
+		score += 1000 * faults *
+				(max_distance - distance) /
+				(max_distance - LOCAL_DISTANCE);
+	}
+
+	return score;
+}
+
+/*
  * These return the fraction of accesses done by a particular task, or
  * task group, on a particular numa node.  The group weight is given a
  * larger multiplier, in order to group tasks together that are almost
@@ -930,7 +987,7 @@ static inline unsigned long group_faults_cpu(struct numa_group *group, int nid)
  */
 static inline unsigned long task_weight(struct task_struct *p, int nid)
 {
-	unsigned long total_faults;
+	unsigned long total_faults, score;
 
 	if (!p->numa_faults_memory)
 		return 0;
@@ -940,15 +997,32 @@ static inline unsigned long task_weight(struct task_struct *p, int nid)
 	if (!total_faults)
 		return 0;
 
-	return 1000 * task_faults(p, nid) / total_faults;
+	score = 1000 * task_faults(p, nid);
+	score += nearby_nodes_score(p, nid, true);
+
+	score /= total_faults;
+
+	return score;
 }
 
 static inline unsigned long group_weight(struct task_struct *p, int nid)
 {
-	if (!p->numa_group || !p->numa_group->total_faults)
+	unsigned long total_faults, score;
+
+	if (!p->numa_group)
+		return 0;
+
+	total_faults = p->numa_group->total_faults;
+
+	if (!total_faults)
 		return 0;
 
-	return 1000 * group_faults(p, nid) / p->numa_group->total_faults;
+	score = 1000 * group_faults(p, nid);
+	score += nearby_nodes_score(p, nid, false);
+
+	score /= total_faults;
+
+	return score;
 }
 
 bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/4] sched,numa: store numa_group's preferred nid
  2014-05-08 17:23 [PATCH 0/4] sched,numa: task placement for complex NUMA topologies riel
  2014-05-08 17:23 ` [PATCH 1/4] numa,x86: store maximum numa node distance riel
  2014-05-08 17:23 ` [PATCH 2/4] sched,numa: weigh nearby nodes for task placement on complex NUMA topologies riel
@ 2014-05-08 17:23 ` riel
  2014-05-08 17:23 ` [PATCH 4/4] sched,numa: pull workloads towards their preferred nodes riel
  3 siblings, 0 replies; 16+ messages in thread
From: riel @ 2014-05-08 17:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, mgorman, chegu_vinod

From: Rik van Riel <riel@redhat.com>

Store a numa_group's preferred nid. Used by the next patch to pull
workloads towards their preferred nodes.

Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
---
 kernel/sched/fair.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5925667..99cc829 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -864,6 +864,7 @@ struct numa_group {
 
 	spinlock_t lock; /* nr_tasks, tasks */
 	int nr_tasks;
+	int preferred_nid;
 	pid_t gid;
 	struct list_head task_list;
 
@@ -1642,6 +1643,7 @@ static void task_numa_placement(struct task_struct *p)
 
 	if (p->numa_group) {
 		update_numa_active_node_mask(p->numa_group);
+		p->numa_group->preferred_nid = max_group_nid;
 		/*
 		 * If the preferred task and group nids are different,
 		 * iterate over the nodes again to find the best place.
@@ -1701,6 +1703,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 		spin_lock_init(&grp->lock);
 		INIT_LIST_HEAD(&grp->task_list);
 		grp->gid = p->pid;
+		grp->preferred_nid = -1;
 		/* Second half of the array tracks nids where faults happen */
 		grp->faults_cpu = grp->faults + NR_NUMA_HINT_FAULT_TYPES *
 						nr_node_ids;
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/4] sched,numa: pull workloads towards their preferred nodes
  2014-05-08 17:23 [PATCH 0/4] sched,numa: task placement for complex NUMA topologies riel
                   ` (2 preceding siblings ...)
  2014-05-08 17:23 ` [PATCH 3/4] sched,numa: store numa_group's preferred nid riel
@ 2014-05-08 17:23 ` riel
  3 siblings, 0 replies; 16+ messages in thread
From: riel @ 2014-05-08 17:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, mgorman, chegu_vinod

From: Rik van Riel <riel@redhat.com>

Give a bonus to nodes near a workload's preferred node. This will pull
workloads towards their preferred node.

For workloads that span multiple NUMA nodes, pseudo-interleaving will
even out the memory use between nodes over time, causing the preferred
node to move around over time.

This movement over time will cause the preferred nodes to be on opposite
sides of the system eventually, untangling workloads that were spread
all over the system, and moving them onto adjacent nodes.

The perturbation introduced by this patch enables the kernel to
reliably untangled 2 4-node wide SPECjbb2005 instances on an 8 node
system, improving average performance from 857814 to 931792 bops.

Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
---
 kernel/sched/fair.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 99cc829..cffa829 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -932,7 +932,7 @@ static inline unsigned long group_faults_cpu(struct numa_group *group, int nid)
  * the proximity of those nodes.
  */
 static inline unsigned long nearby_nodes_score(struct task_struct *p, int nid,
-						bool task)
+						bool task, bool *preferred_nid)
 {
 	int max_distance = max_node_distance();
 	unsigned long score = 0;
@@ -949,6 +949,15 @@ static inline unsigned long nearby_nodes_score(struct task_struct *p, int nid,
 		int distance;
 		unsigned long faults;
 
+		/*
+		 * Pseudo-interleaving balances out the memory use between the
+		 * nodes where a workload runs, so the preferred node should
+		 * change over time. This helps separate two workloads onto
+		 * separate sides of the system.
+		 */
+		if (p->numa_group && node == p->numa_group->preferred_nid)
+			*preferred_nid = true;
+
 		/* Already scored by the calling function. */
 		if (node == nid)
 			continue;
@@ -989,6 +998,7 @@ static inline unsigned long nearby_nodes_score(struct task_struct *p, int nid,
 static inline unsigned long task_weight(struct task_struct *p, int nid)
 {
 	unsigned long total_faults, score;
+	bool near_preferred_nid = false;
 
 	if (!p->numa_faults_memory)
 		return 0;
@@ -999,7 +1009,7 @@ static inline unsigned long task_weight(struct task_struct *p, int nid)
 		return 0;
 
 	score = 1000 * task_faults(p, nid);
-	score += nearby_nodes_score(p, nid, true);
+	score += nearby_nodes_score(p, nid, true, &near_preferred_nid);
 
 	score /= total_faults;
 
@@ -1009,6 +1019,7 @@ static inline unsigned long task_weight(struct task_struct *p, int nid)
 static inline unsigned long group_weight(struct task_struct *p, int nid)
 {
 	unsigned long total_faults, score;
+	bool near_preferred_nid = false;
 
 	if (!p->numa_group)
 		return 0;
@@ -1019,7 +1030,15 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
 		return 0;
 
 	score = 1000 * group_faults(p, nid);
-	score += nearby_nodes_score(p, nid, false);
+	score += nearby_nodes_score(p, nid, false, &near_preferred_nid);
+
+	/*
+	 * Pull workloads towards their preferred node, with the minimum
+	 * multiplier required to be a tie-breaker when two groups of nodes
+	 * have the same amount of memory.
+	 */
+	if (near_preferred_nid)
+		score *= (max_node_distance() - LOCAL_DISTANCE);
 
 	score /= total_faults;
 
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/4] numa,x86: store maximum numa node distance
  2014-05-08 17:23 ` [PATCH 1/4] numa,x86: store maximum numa node distance riel
@ 2014-05-09  9:45   ` Peter Zijlstra
  2014-05-09 15:08     ` Rik van Riel
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2014-05-09  9:45 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, mingo, mgorman, chegu_vinod

[-- Attachment #1: Type: text/plain, Size: 901 bytes --]

On Thu, May 08, 2014 at 01:23:28PM -0400, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> Store the maximum node distance, so the numa placement code can do
> better placement on systems with complex numa topology.
> 
> The function max_node_distance will return LOCAL_DISTANCE if the
> system has simple NUMA topology, with only a single level of
> remote distance.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Tested-by: Chegu Vinod <chegu_vinod@hp.com>
> ---
>  arch/x86/include/asm/topology.h |  3 +++
>  arch/x86/mm/numa.c              | 25 +++++++++++++++++++++++++
>  include/linux/topology.h        |  3 +++
>  3 files changed, 31 insertions(+)
> 

Why are you doing this in arch code? I would've expected some extra code
to sched_init_numa() which is generic code that analyses the distance
table and reconstructs the actual topology from it.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] sched,numa: weigh nearby nodes for task placement on complex NUMA topologies
  2014-05-08 17:23 ` [PATCH 2/4] sched,numa: weigh nearby nodes for task placement on complex NUMA topologies riel
@ 2014-05-09  9:53   ` Peter Zijlstra
  2014-05-09 15:14     ` Rik van Riel
  2014-05-09  9:54   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2014-05-09  9:53 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, mingo, mgorman, chegu_vinod

[-- Attachment #1: Type: text/plain, Size: 678 bytes --]

On Thu, May 08, 2014 at 01:23:29PM -0400, riel@redhat.com wrote:
> +		/*
> +		 * For nodes with distances in-between LOCAL_DISTANCE
> +		 * and max_distance, we count the faults on those nodes
> +		 * in proportion to their distance, using this formula:
> +		 *
> +		 * max_distance - node_distance
> +		 * -----------------------------
> +		 * max_distance - LOCAL_DISTANCE
> +		 */
> +		if (task)
> +			faults = task_faults(p, node);
> +		else
> +			faults = group_faults(p, node);
> +
> +		score += 1000 * faults *
> +				(max_distance - distance) /
> +				(max_distance - LOCAL_DISTANCE);

OK that makes sense, except I would suggest you use a power-of-two scale
factor :-)

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] sched,numa: weigh nearby nodes for task placement on complex NUMA topologies
  2014-05-08 17:23 ` [PATCH 2/4] sched,numa: weigh nearby nodes for task placement on complex NUMA topologies riel
  2014-05-09  9:53   ` Peter Zijlstra
@ 2014-05-09  9:54   ` Peter Zijlstra
  2014-05-09 10:03   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2014-05-09  9:54 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, mingo, mgorman, chegu_vinod

[-- Attachment #1: Type: text/plain, Size: 380 bytes --]

On Thu, May 08, 2014 at 01:23:29PM -0400, riel@redhat.com wrote:
> +	/*
> +	 * No need to calculate a score if the system has a simple NUMA
> +	 * topology, with no node distances between "local" and "far away".
> +	 */
> +	if (max_distance == LOCAL_DISTANCE)
> +		return 0;

That's basically !numa and we shouldn't be getting here, so I think you
can make that a WARN_ON_ONCE().

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] sched,numa: weigh nearby nodes for task placement on complex NUMA topologies
  2014-05-08 17:23 ` [PATCH 2/4] sched,numa: weigh nearby nodes for task placement on complex NUMA topologies riel
  2014-05-09  9:53   ` Peter Zijlstra
  2014-05-09  9:54   ` Peter Zijlstra
@ 2014-05-09 10:03   ` Peter Zijlstra
  2014-05-09 15:16     ` Rik van Riel
  2014-05-09 10:11   ` Peter Zijlstra
  2014-05-09 10:13   ` Peter Zijlstra
  4 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2014-05-09 10:03 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, mingo, mgorman, chegu_vinod

[-- Attachment #1: Type: text/plain, Size: 889 bytes --]

On Thu, May 08, 2014 at 01:23:29PM -0400, riel@redhat.com wrote:
>  static inline unsigned long task_weight(struct task_struct *p, int nid)
>  {
> -	unsigned long total_faults;
> +	unsigned long total_faults, score;
>  
>  	if (!p->numa_faults_memory)
>  		return 0;
> @@ -940,15 +997,32 @@ static inline unsigned long task_weight(struct task_struct *p, int nid)
>  	if (!total_faults)
>  		return 0;
>  
> -	return 1000 * task_faults(p, nid) / total_faults;
> +	score = 1000 * task_faults(p, nid);
> +	score += nearby_nodes_score(p, nid, true);
> +
> +	score /= total_faults;
> +
> +	return score;
>  }

So you add an O(nr_nodes) loop to task_weight(), but that in itself is
already called from O(nr_nodes) loops, yielding a total complexity of
O(nr_nodes^2).

This might be fine, but algorithmic complexity should very much be a
part of the changelog I think.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] sched,numa: weigh nearby nodes for task placement on complex NUMA topologies
  2014-05-08 17:23 ` [PATCH 2/4] sched,numa: weigh nearby nodes for task placement on complex NUMA topologies riel
                     ` (2 preceding siblings ...)
  2014-05-09 10:03   ` Peter Zijlstra
@ 2014-05-09 10:11   ` Peter Zijlstra
  2014-05-09 15:11     ` Rik van Riel
  2014-05-09 10:13   ` Peter Zijlstra
  4 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2014-05-09 10:11 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, mingo, mgorman, chegu_vinod

[-- Attachment #1: Type: text/plain, Size: 2027 bytes --]

On Thu, May 08, 2014 at 01:23:29PM -0400, riel@redhat.com wrote:
> @@ -930,7 +987,7 @@ static inline unsigned long group_faults_cpu(struct numa_group *group, int nid)
>   */
>  static inline unsigned long task_weight(struct task_struct *p, int nid)
>  {
> -	unsigned long total_faults;
> +	unsigned long total_faults, score;
>  
>  	if (!p->numa_faults_memory)
>  		return 0;
> @@ -940,15 +997,32 @@ static inline unsigned long task_weight(struct task_struct *p, int nid)
>  	if (!total_faults)
>  		return 0;
>  
> -	return 1000 * task_faults(p, nid) / total_faults;
> +	score = 1000 * task_faults(p, nid);
> +	score += nearby_nodes_score(p, nid, true);
> +
> +	score /= total_faults;
> +
> +	return score;
>  }
>  
>  static inline unsigned long group_weight(struct task_struct *p, int nid)
>  {
> -	if (!p->numa_group || !p->numa_group->total_faults)
> +	unsigned long total_faults, score;
> +
> +	if (!p->numa_group)
> +		return 0;
> +
> +	total_faults = p->numa_group->total_faults;
> +
> +	if (!total_faults)
>  		return 0;
>  
> -	return 1000 * group_faults(p, nid) / p->numa_group->total_faults;
> +	score = 1000 * group_faults(p, nid);
> +	score += nearby_nodes_score(p, nid, false);
> +
> +	score /= total_faults;
> +
> +	return score;
>  }

OK, and that's just sad..

See task_numa_placement(), which does:

	for_each_online_node(nid) {
		weight = task_weight(p, nid) + group_weight(p, nid);
		if (weight > max_weight) {
			max_weight = weight;
			max_nid = nid;
		}
	}

So not only is that loop now O(nr_nodes^2), the inner loops doubly
iterates all nodes.

Also, {task,group}_weight() functions were like cheap-ish (/me mumbles
something about people using !2^n scaling factors for no sane reason).
And they're used all over with that in mind.

But look what you did to migrate_improves_locality(), that will now
iterate all nodes _4_ times, and its called for every single task we try
and migrate during load balance, while holding rq->lock.



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] sched,numa: weigh nearby nodes for task placement on complex NUMA topologies
  2014-05-08 17:23 ` [PATCH 2/4] sched,numa: weigh nearby nodes for task placement on complex NUMA topologies riel
                     ` (3 preceding siblings ...)
  2014-05-09 10:11   ` Peter Zijlstra
@ 2014-05-09 10:13   ` Peter Zijlstra
  2014-05-09 15:03     ` Rik van Riel
  4 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2014-05-09 10:13 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, mingo, mgorman, chegu_vinod

[-- Attachment #1: Type: text/plain, Size: 451 bytes --]

On Thu, May 08, 2014 at 01:23:29PM -0400, riel@redhat.com wrote:
> This patch does nothing on machines with simple NUMA topologies.

Was this:

> +	/*
> +	 * No need to calculate a score if the system has a simple NUMA
> +	 * topology, with no node distances between "local" and "far away".
> +	 */
> +	if (max_distance == LOCAL_DISTANCE)
> +		return 0;

Supposed to make that true?

It doesn't. That test is a !numa test, not a fully connected test.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] sched,numa: weigh nearby nodes for task placement on complex NUMA topologies
  2014-05-09 10:13   ` Peter Zijlstra
@ 2014-05-09 15:03     ` Rik van Riel
  0 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2014-05-09 15:03 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, mgorman, chegu_vinod

On 05/09/2014 06:13 AM, Peter Zijlstra wrote:
> On Thu, May 08, 2014 at 01:23:29PM -0400, riel@redhat.com wrote:
>> This patch does nothing on machines with simple NUMA topologies.
>
> Was this:
>
>> +	/*
>> +	 * No need to calculate a score if the system has a simple NUMA
>> +	 * topology, with no node distances between "local" and "far away".
>> +	 */
>> +	if (max_distance == LOCAL_DISTANCE)
>> +		return 0;
>
> Supposed to make that true?
>
> It doesn't. That test is a !numa test, not a fully connected test.
>

Look at patch 1/4.  I only set max_distance to !LOCAL_DISTANCE
if the system has multiple different distances in the SLIT
table.

I guess that needs to be cleaned up :)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/4] numa,x86: store maximum numa node distance
  2014-05-09  9:45   ` Peter Zijlstra
@ 2014-05-09 15:08     ` Rik van Riel
  0 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2014-05-09 15:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, mgorman, chegu_vinod

On 05/09/2014 05:45 AM, Peter Zijlstra wrote:
> On Thu, May 08, 2014 at 01:23:28PM -0400, riel@redhat.com wrote:
>> From: Rik van Riel <riel@redhat.com>
>>
>> Store the maximum node distance, so the numa placement code can do
>> better placement on systems with complex numa topology.
>>
>> The function max_node_distance will return LOCAL_DISTANCE if the
>> system has simple NUMA topology, with only a single level of
>> remote distance.
>>
>> Signed-off-by: Rik van Riel <riel@redhat.com>
>> Tested-by: Chegu Vinod <chegu_vinod@hp.com>
>> ---
>>   arch/x86/include/asm/topology.h |  3 +++
>>   arch/x86/mm/numa.c              | 25 +++++++++++++++++++++++++
>>   include/linux/topology.h        |  3 +++
>>   3 files changed, 31 insertions(+)
>>
>
> Why are you doing this in arch code? I would've expected some extra code
> to sched_init_numa() which is generic code that analyses the distance
> table and reconstructs the actual topology from it.

The only reason it is in the arch code is that node_distance()
is in arch code today.

If there is no good reason for node_distance and the node
distances array to be in arch code, I could take a stab at
making it generic...


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] sched,numa: weigh nearby nodes for task placement on complex NUMA topologies
  2014-05-09 10:11   ` Peter Zijlstra
@ 2014-05-09 15:11     ` Rik van Riel
  0 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2014-05-09 15:11 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, mgorman, chegu_vinod

On 05/09/2014 06:11 AM, Peter Zijlstra wrote:
> On Thu, May 08, 2014 at 01:23:29PM -0400, riel@redhat.com wrote:
>> @@ -930,7 +987,7 @@ static inline unsigned long group_faults_cpu(struct numa_group *group, int nid)
>>    */
>>   static inline unsigned long task_weight(struct task_struct *p, int nid)
>>   {
>> -	unsigned long total_faults;
>> +	unsigned long total_faults, score;
>>
>>   	if (!p->numa_faults_memory)
>>   		return 0;
>> @@ -940,15 +997,32 @@ static inline unsigned long task_weight(struct task_struct *p, int nid)
>>   	if (!total_faults)
>>   		return 0;
>>
>> -	return 1000 * task_faults(p, nid) / total_faults;
>> +	score = 1000 * task_faults(p, nid);
>> +	score += nearby_nodes_score(p, nid, true);
>> +
>> +	score /= total_faults;
>> +
>> +	return score;
>>   }
>>
>>   static inline unsigned long group_weight(struct task_struct *p, int nid)
>>   {
>> -	if (!p->numa_group || !p->numa_group->total_faults)
>> +	unsigned long total_faults, score;
>> +
>> +	if (!p->numa_group)
>> +		return 0;
>> +
>> +	total_faults = p->numa_group->total_faults;
>> +
>> +	if (!total_faults)
>>   		return 0;
>>
>> -	return 1000 * group_faults(p, nid) / p->numa_group->total_faults;
>> +	score = 1000 * group_faults(p, nid);
>> +	score += nearby_nodes_score(p, nid, false);
>> +
>> +	score /= total_faults;
>> +
>> +	return score;
>>   }
>
> OK, and that's just sad..
>
> See task_numa_placement(), which does:
>
> 	for_each_online_node(nid) {
> 		weight = task_weight(p, nid) + group_weight(p, nid);
> 		if (weight > max_weight) {
> 			max_weight = weight;
> 			max_nid = nid;
> 		}
> 	}
>
> So not only is that loop now O(nr_nodes^2), the inner loops doubly
> iterates all nodes.

I am not too worried about task_numa_placement, but you are
right that this may well be much too expensive for more
frequently called code like migrate_improves_locality.

Having said that, grouping related tasks together on nearby
nodes does seem to bring significant performance gains.

Do you have any ideas on other ways we can achieve that
grouping?

> Also, {task,group}_weight() functions were like cheap-ish (/me mumbles
> something about people using !2^n scaling factors for no sane reason).
> And they're used all over with that in mind.
>
> But look what you did to migrate_improves_locality(), that will now
> iterate all nodes _4_ times, and its called for every single task we try
> and migrate during load balance, while holding rq->lock.
>
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] sched,numa: weigh nearby nodes for task placement on complex NUMA topologies
  2014-05-09  9:53   ` Peter Zijlstra
@ 2014-05-09 15:14     ` Rik van Riel
  0 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2014-05-09 15:14 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, mgorman, chegu_vinod

On 05/09/2014 05:53 AM, Peter Zijlstra wrote:
> On Thu, May 08, 2014 at 01:23:29PM -0400, riel@redhat.com wrote:
>> +		/*
>> +		 * For nodes with distances in-between LOCAL_DISTANCE
>> +		 * and max_distance, we count the faults on those nodes
>> +		 * in proportion to their distance, using this formula:
>> +		 *
>> +		 * max_distance - node_distance
>> +		 * -----------------------------
>> +		 * max_distance - LOCAL_DISTANCE
>> +		 */
>> +		if (task)
>> +			faults = task_faults(p, node);
>> +		else
>> +			faults = group_faults(p, node);
>> +
>> +		score += 1000 * faults *
>> +				(max_distance - distance) /
>> +				(max_distance - LOCAL_DISTANCE);
>
> OK that makes sense, except I would suggest you use a power-of-two scale
> factor :-)

I guess we could build a NUMA distance table that
counts the number of hops, and use that.

That is likely to result in better/easier values
for grouping than the (somewhat arbitrary) distances
in the SLIT table, anyway...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] sched,numa: weigh nearby nodes for task placement on complex NUMA topologies
  2014-05-09 10:03   ` Peter Zijlstra
@ 2014-05-09 15:16     ` Rik van Riel
  0 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2014-05-09 15:16 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, mgorman, chegu_vinod

On 05/09/2014 06:03 AM, Peter Zijlstra wrote:
> On Thu, May 08, 2014 at 01:23:29PM -0400, riel@redhat.com wrote:
>>   static inline unsigned long task_weight(struct task_struct *p, int nid)
>>   {
>> -	unsigned long total_faults;
>> +	unsigned long total_faults, score;
>>
>>   	if (!p->numa_faults_memory)
>>   		return 0;
>> @@ -940,15 +997,32 @@ static inline unsigned long task_weight(struct task_struct *p, int nid)
>>   	if (!total_faults)
>>   		return 0;
>>
>> -	return 1000 * task_faults(p, nid) / total_faults;
>> +	score = 1000 * task_faults(p, nid);
>> +	score += nearby_nodes_score(p, nid, true);
>> +
>> +	score /= total_faults;
>> +
>> +	return score;
>>   }
>
> So you add an O(nr_nodes) loop to task_weight(), but that in itself is
> already called from O(nr_nodes) loops, yielding a total complexity of
> O(nr_nodes^2).

However, it only does actual calculations for nodes that
are closer by than the furthest away nodes in the system.

Hopefully on even the largest systems, that will mean an
"island" of a handful of nodes, with everything else being
at the same large distance.

> This might be fine, but algorithmic complexity should very much be a
> part of the changelog I think.

Agreed, I do need to document this kind of thing better,
if only because it gives people a chance to verify my
assumptions.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2014-05-09 15:19 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-08 17:23 [PATCH 0/4] sched,numa: task placement for complex NUMA topologies riel
2014-05-08 17:23 ` [PATCH 1/4] numa,x86: store maximum numa node distance riel
2014-05-09  9:45   ` Peter Zijlstra
2014-05-09 15:08     ` Rik van Riel
2014-05-08 17:23 ` [PATCH 2/4] sched,numa: weigh nearby nodes for task placement on complex NUMA topologies riel
2014-05-09  9:53   ` Peter Zijlstra
2014-05-09 15:14     ` Rik van Riel
2014-05-09  9:54   ` Peter Zijlstra
2014-05-09 10:03   ` Peter Zijlstra
2014-05-09 15:16     ` Rik van Riel
2014-05-09 10:11   ` Peter Zijlstra
2014-05-09 15:11     ` Rik van Riel
2014-05-09 10:13   ` Peter Zijlstra
2014-05-09 15:03     ` Rik van Riel
2014-05-08 17:23 ` [PATCH 3/4] sched,numa: store numa_group's preferred nid riel
2014-05-08 17:23 ` [PATCH 4/4] sched,numa: pull workloads towards their preferred nodes riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).