[RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology

From: Lauro Ramos Venancio <lvenanci@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: lwang@redhat.com, riel@redhat.com, Mike Galbraith <efault@gmx.de>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@kernel.org>,
	Lauro Ramos Venancio <lvenanci@redhat.com>
Subject: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology
Date: Thu, 13 Apr 2017 10:56:08 -0300	[thread overview]
Message-ID: <1492091769-19879-3-git-send-email-lvenanci@redhat.com> (raw)
In-Reply-To: <1492091769-19879-1-git-send-email-lvenanci@redhat.com>

Currently, on a 4 nodes NUMA machine with ring topology, two sched
groups are generated for the last NUMA sched domain. One group has the
CPUs from NUMA nodes 3, 0 and 1; the other group has the CPUs from nodes
1, 2 and 3. As CPUs from nodes 1 and 3 belongs to both groups, the
scheduler is unable to directly move tasks between these nodes. In the
worst scenario, when a set of tasks are bound to nodes 1 and 3, the
performance is severely impacted because just one node is used while the
other node remains idle.

This problem also affects machines with more NUMA nodes. For instance,
currently, the scheduler is unable to directly move tasks between some
node pairs 2-hops apart on an 8 nodes machine with mesh topology.

This bug was reported in the paper [1] as "The Scheduling Group
Construction bug".

This patch constructs the sched groups from each CPU perspective. So, on
a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same
groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups
[(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops
apart.

SPECjbb2005 results on an 8 NUMA nodes machine with mesh topology

Threads       before              after          %
           mean   stddev      mean    stddev
  1       22801   1950        27059   1367     +19%
  8       146008  50782       209193  826      +43%
  32      351030  105111      522445  9051     +49%
  48      365835  116571      594905  3314     +63%

[1] http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf

Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
---
 kernel/sched/topology.c | 33 +++++++++++++++------------------
 1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d786d45..d0302ad 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -557,14 +557,24 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 static int
 build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 {
-	struct sched_group *first = NULL, *last = NULL, *groups = NULL, *sg;
+	struct sched_group *last = NULL, *sg;
 	const struct cpumask *span = sched_domain_span(sd);
 	struct cpumask *covered = sched_domains_tmpmask;
 	struct sd_data *sdd = sd->private;
 	struct sched_domain *sibling;
 	int i;
 
-	cpumask_clear(covered);
+	sg = build_group_from_child_sched_domain(sd, cpu);
+	if (!sg)
+		return -ENOMEM;
+
+	init_overlap_sched_group(sd, sg, cpu);
+
+	sd->groups = sg;
+	last = sg;
+	sg->next = sg;
+
+	cpumask_copy(covered, sched_group_cpus(sg));
 
 	for_each_cpu(i, span) {
 		struct cpumask *sg_span;
@@ -587,28 +597,15 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 
 		init_overlap_sched_group(sd, sg, i);
 
-		/*
-		 * Make sure the first group of this domain contains the
-		 * canonical balance CPU. Otherwise the sched_domain iteration
-		 * breaks. See update_sg_lb_stats().
-		 */
-		if ((!groups && cpumask_test_cpu(cpu, sg_span)) ||
-		    group_balance_cpu(sg) == cpu)
-			groups = sg;
-
-		if (!first)
-			first = sg;
-		if (last)
-			last->next = sg;
+		last->next = sg;
 		last = sg;
-		last->next = first;
+		sg->next = sd->groups;
 	}
-	sd->groups = groups;
 
 	return 0;
 
 fail:
-	free_sched_groups(first, 0);
+	free_sched_groups(sd->groups, 0);
 
 	return -ENOMEM;
 }
-- 
1.8.3.1