LKML Archive on lore.kernel.org
 help / color / Atom feed
From: Lauro Ramos Venancio <lvenanci@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: lwang@redhat.com, riel@redhat.com, Mike Galbraith <efault@gmx.de>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@kernel.org>,
	Lauro Ramos Venancio <lvenanci@redhat.com>
Subject: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology
Date: Thu, 13 Apr 2017 10:56:08 -0300
Message-ID: <1492091769-19879-3-git-send-email-lvenanci@redhat.com> (raw)
In-Reply-To: <1492091769-19879-1-git-send-email-lvenanci@redhat.com>

Currently, on a 4 nodes NUMA machine with ring topology, two sched
groups are generated for the last NUMA sched domain. One group has the
CPUs from NUMA nodes 3, 0 and 1; the other group has the CPUs from nodes
1, 2 and 3. As CPUs from nodes 1 and 3 belongs to both groups, the
scheduler is unable to directly move tasks between these nodes. In the
worst scenario, when a set of tasks are bound to nodes 1 and 3, the
performance is severely impacted because just one node is used while the
other node remains idle.

This problem also affects machines with more NUMA nodes. For instance,
currently, the scheduler is unable to directly move tasks between some
node pairs 2-hops apart on an 8 nodes machine with mesh topology.

This bug was reported in the paper [1] as "The Scheduling Group
Construction bug".

This patch constructs the sched groups from each CPU perspective. So, on
a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same
groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups
[(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops
apart.

SPECjbb2005 results on an 8 NUMA nodes machine with mesh topology

Threads       before              after          %
           mean   stddev      mean    stddev
  1       22801   1950        27059   1367     +19%
  8       146008  50782       209193  826      +43%
  32      351030  105111      522445  9051     +49%
  48      365835  116571      594905  3314     +63%

[1] http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf

Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
---
 kernel/sched/topology.c | 33 +++++++++++++++------------------
 1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d786d45..d0302ad 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -557,14 +557,24 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 static int
 build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 {
-	struct sched_group *first = NULL, *last = NULL, *groups = NULL, *sg;
+	struct sched_group *last = NULL, *sg;
 	const struct cpumask *span = sched_domain_span(sd);
 	struct cpumask *covered = sched_domains_tmpmask;
 	struct sd_data *sdd = sd->private;
 	struct sched_domain *sibling;
 	int i;
 
-	cpumask_clear(covered);
+	sg = build_group_from_child_sched_domain(sd, cpu);
+	if (!sg)
+		return -ENOMEM;
+
+	init_overlap_sched_group(sd, sg, cpu);
+
+	sd->groups = sg;
+	last = sg;
+	sg->next = sg;
+
+	cpumask_copy(covered, sched_group_cpus(sg));
 
 	for_each_cpu(i, span) {
 		struct cpumask *sg_span;
@@ -587,28 +597,15 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 
 		init_overlap_sched_group(sd, sg, i);
 
-		/*
-		 * Make sure the first group of this domain contains the
-		 * canonical balance CPU. Otherwise the sched_domain iteration
-		 * breaks. See update_sg_lb_stats().
-		 */
-		if ((!groups && cpumask_test_cpu(cpu, sg_span)) ||
-		    group_balance_cpu(sg) == cpu)
-			groups = sg;
-
-		if (!first)
-			first = sg;
-		if (last)
-			last->next = sg;
+		last->next = sg;
 		last = sg;
-		last->next = first;
+		sg->next = sd->groups;
 	}
-	sd->groups = groups;
 
 	return 0;
 
 fail:
-	free_sched_groups(first, 0);
+	free_sched_groups(sd->groups, 0);
 
 	return -ENOMEM;
 }
-- 
1.8.3.1

  parent reply index

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-13 13:56 [RFC 0/3] " Lauro Ramos Venancio
2017-04-13 13:56 ` [RFC 1/3] sched/topology: Refactor function build_overlap_sched_groups() Lauro Ramos Venancio
2017-04-13 14:50   ` Rik van Riel
2017-05-15  9:02   ` [tip:sched/core] " tip-bot for Lauro Ramos Venancio
2017-04-13 13:56 ` Lauro Ramos Venancio [this message]
2017-04-13 15:16   ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Rik van Riel
2017-04-13 15:48   ` Peter Zijlstra
2017-04-13 20:21     ` Lauro Venancio
2017-04-13 21:06       ` Lauro Venancio
2017-04-13 23:38         ` Rik van Riel
2017-04-14 10:48           ` Peter Zijlstra
2017-04-14 11:38   ` Peter Zijlstra
2017-04-14 12:20     ` Peter Zijlstra
2017-05-15  9:03       ` [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap() tip-bot for Peter Zijlstra
2017-05-17 10:53         ` hackbench vs select_idle_sibling; was: " Peter Zijlstra
2017-05-17 12:46           ` Matt Fleming
2017-05-17 14:49           ` Chris Mason
2017-05-19 15:00           ` Matt Fleming
2017-06-05 13:00             ` Matt Fleming
2017-06-06  9:21               ` Peter Zijlstra
2017-06-09 17:52                 ` Chris Mason
2017-06-08  9:22           ` [tip:sched/core] sched/core: Implement new approach to scale select_idle_cpu() tip-bot for Peter Zijlstra
2017-04-14 16:58     ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Peter Zijlstra
2017-04-17 14:40       ` Lauro Venancio
2017-04-13 13:56 ` [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu Lauro Ramos Venancio
2017-04-13 15:27   ` Rik van Riel
2017-04-14 16:49   ` Peter Zijlstra
2017-04-17 15:34     ` Lauro Venancio
2017-04-18 12:32       ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1492091769-19879-3-git-send-email-lvenanci@redhat.com \
    --to=lvenanci@redhat.com \
    --cc=efault@gmx.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lwang@redhat.com \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git