From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753991AbdDMN5G (ORCPT ); Thu, 13 Apr 2017 09:57:06 -0400 Received: from mx1.redhat.com ([209.132.183.28]:44236 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751378AbdDMN5D (ORCPT ); Thu, 13 Apr 2017 09:57:03 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com F09A5A3281 Authentication-Results: ext-mx10.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx10.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=lvenanci@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com F09A5A3281 From: Lauro Ramos Venancio To: linux-kernel@vger.kernel.org Cc: lwang@redhat.com, riel@redhat.com, Mike Galbraith , Peter Zijlstra , Thomas Gleixner , Ingo Molnar , Lauro Ramos Venancio Subject: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Date: Thu, 13 Apr 2017 10:56:08 -0300 Message-Id: <1492091769-19879-3-git-send-email-lvenanci@redhat.com> In-Reply-To: <1492091769-19879-1-git-send-email-lvenanci@redhat.com> References: <1492091769-19879-1-git-send-email-lvenanci@redhat.com> X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.39]); Thu, 13 Apr 2017 13:56:53 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Currently, on a 4 nodes NUMA machine with ring topology, two sched groups are generated for the last NUMA sched domain. One group has the CPUs from NUMA nodes 3, 0 and 1; the other group has the CPUs from nodes 1, 2 and 3. As CPUs from nodes 1 and 3 belongs to both groups, the scheduler is unable to directly move tasks between these nodes. In the worst scenario, when a set of tasks are bound to nodes 1 and 3, the performance is severely impacted because just one node is used while the other node remains idle. This problem also affects machines with more NUMA nodes. For instance, currently, the scheduler is unable to directly move tasks between some node pairs 2-hops apart on an 8 nodes machine with mesh topology. This bug was reported in the paper [1] as "The Scheduling Group Construction bug". This patch constructs the sched groups from each CPU perspective. So, on a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops apart. SPECjbb2005 results on an 8 NUMA nodes machine with mesh topology Threads before after % mean stddev mean stddev 1 22801 1950 27059 1367 +19% 8 146008 50782 209193 826 +43% 32 351030 105111 522445 9051 +49% 48 365835 116571 594905 3314 +63% [1] http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf Signed-off-by: Lauro Ramos Venancio --- kernel/sched/topology.c | 33 +++++++++++++++------------------ 1 file changed, 15 insertions(+), 18 deletions(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index d786d45..d0302ad 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -557,14 +557,24 @@ static void init_overlap_sched_group(struct sched_domain *sd, static int build_overlap_sched_groups(struct sched_domain *sd, int cpu) { - struct sched_group *first = NULL, *last = NULL, *groups = NULL, *sg; + struct sched_group *last = NULL, *sg; const struct cpumask *span = sched_domain_span(sd); struct cpumask *covered = sched_domains_tmpmask; struct sd_data *sdd = sd->private; struct sched_domain *sibling; int i; - cpumask_clear(covered); + sg = build_group_from_child_sched_domain(sd, cpu); + if (!sg) + return -ENOMEM; + + init_overlap_sched_group(sd, sg, cpu); + + sd->groups = sg; + last = sg; + sg->next = sg; + + cpumask_copy(covered, sched_group_cpus(sg)); for_each_cpu(i, span) { struct cpumask *sg_span; @@ -587,28 +597,15 @@ static void init_overlap_sched_group(struct sched_domain *sd, init_overlap_sched_group(sd, sg, i); - /* - * Make sure the first group of this domain contains the - * canonical balance CPU. Otherwise the sched_domain iteration - * breaks. See update_sg_lb_stats(). - */ - if ((!groups && cpumask_test_cpu(cpu, sg_span)) || - group_balance_cpu(sg) == cpu) - groups = sg; - - if (!first) - first = sg; - if (last) - last->next = sg; + last->next = sg; last = sg; - last->next = first; + sg->next = sd->groups; } - sd->groups = groups; return 0; fail: - free_sched_groups(first, 0); + free_sched_groups(sd->groups, 0); return -ENOMEM; } -- 1.8.3.1