All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/3] sched/topology: fix sched groups on NUMA machines with mesh topology
@ 2017-04-13 13:56 Lauro Ramos Venancio
  2017-04-13 13:56 ` [RFC 1/3] sched/topology: Refactor function build_overlap_sched_groups() Lauro Ramos Venancio
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Lauro Ramos Venancio @ 2017-04-13 13:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: lwang, riel, Mike Galbraith, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Lauro Ramos Venancio

Currently, the scheduler is not able to directly move tasks between some NUMA
nodes 2-hops apart on machines with mesh topology. This occurs because some
NUMA nodes belongs to all sched groups. For more details, see the patch 2
commit log.

This bug was reported in the paper [1] as "The Scheduling Group Construction
bug".

This patchset constructs the sched groups from each CPU perspective. So each
NUMA node can have different groups in the last NUMA sched domain level.

SPECjbb2005 results show up to 63% performance improvement and a huge standard
deviation drop on a machine with 8 NUMA nodes and mesh topology.

Patch 1 - just prepare the code for patch 2
Patch 2 - change the sched groups construction
Patch 3 - fix issue with different groups starting with the same CPU

[1] http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf

Regards,
Lauro

Lauro Ramos Venancio (3):
  sched/topology: Refactor function build_overlap_sched_groups()
  sched/topology: fix sched groups on NUMA machines with mesh topology
  sched/topology: Different sched groups must not have the same balance
    cpu

 kernel/sched/topology.c | 165 ++++++++++++++++++++++++++++++++++--------------
 1 file changed, 117 insertions(+), 48 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC 1/3] sched/topology: Refactor function build_overlap_sched_groups()
  2017-04-13 13:56 [RFC 0/3] sched/topology: fix sched groups on NUMA machines with mesh topology Lauro Ramos Venancio
@ 2017-04-13 13:56 ` Lauro Ramos Venancio
  2017-04-13 14:50   ` Rik van Riel
  2017-05-15  9:02   ` [tip:sched/core] " tip-bot for Lauro Ramos Venancio
  2017-04-13 13:56 ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Lauro Ramos Venancio
  2017-04-13 13:56 ` [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu Lauro Ramos Venancio
  2 siblings, 2 replies; 29+ messages in thread
From: Lauro Ramos Venancio @ 2017-04-13 13:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: lwang, riel, Mike Galbraith, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Lauro Ramos Venancio

Create functions build_group_from_child_sched_domain() and
init_overlap_sched_group(). No functional change.

Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
---
 kernel/sched/topology.c | 62 ++++++++++++++++++++++++++++++++++---------------
 1 file changed, 43 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 1b0b4fb..d786d45 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -513,6 +513,47 @@ int group_balance_cpu(struct sched_group *sg)
 	return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
 }
 
+static struct sched_group *
+build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
+{
+	struct sched_group *sg;
+	struct cpumask *sg_span;
+
+	sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
+			GFP_KERNEL, cpu_to_node(cpu));
+
+	if (!sg)
+		return NULL;
+
+	sg_span = sched_group_cpus(sg);
+	if (sd->child)
+		cpumask_copy(sg_span, sched_domain_span(sd->child));
+	else
+		cpumask_copy(sg_span, sched_domain_span(sd));
+
+	return sg;
+}
+
+static void init_overlap_sched_group(struct sched_domain *sd,
+				     struct sched_group *sg, int cpu)
+{
+	struct sd_data *sdd = sd->private;
+	struct cpumask *sg_span;
+
+	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
+	if (atomic_inc_return(&sg->sgc->ref) == 1)
+		build_group_mask(sd, sg);
+
+	/*
+	 * Initialize sgc->capacity such that even if we mess up the
+	 * domains and no possible iteration will get us here, we won't
+	 * die on a /0 trap.
+	 */
+	sg_span = sched_group_cpus(sg);
+	sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
+	sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+}
+
 static int
 build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 {
@@ -537,31 +578,14 @@ int group_balance_cpu(struct sched_group *sg)
 		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
 			continue;
 
-		sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
-				GFP_KERNEL, cpu_to_node(cpu));
-
+		sg = build_group_from_child_sched_domain(sibling, cpu);
 		if (!sg)
 			goto fail;
 
 		sg_span = sched_group_cpus(sg);
-		if (sibling->child)
-			cpumask_copy(sg_span, sched_domain_span(sibling->child));
-		else
-			cpumask_set_cpu(i, sg_span);
-
 		cpumask_or(covered, covered, sg_span);
 
-		sg->sgc = *per_cpu_ptr(sdd->sgc, i);
-		if (atomic_inc_return(&sg->sgc->ref) == 1)
-			build_group_mask(sd, sg);
-
-		/*
-		 * Initialize sgc->capacity such that even if we mess up the
-		 * domains and no possible iteration will get us here, we won't
-		 * die on a /0 trap.
-		 */
-		sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
-		sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+		init_overlap_sched_group(sd, sg, i);
 
 		/*
 		 * Make sure the first group of this domain contains the
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology
  2017-04-13 13:56 [RFC 0/3] sched/topology: fix sched groups on NUMA machines with mesh topology Lauro Ramos Venancio
  2017-04-13 13:56 ` [RFC 1/3] sched/topology: Refactor function build_overlap_sched_groups() Lauro Ramos Venancio
@ 2017-04-13 13:56 ` Lauro Ramos Venancio
  2017-04-13 15:16   ` Rik van Riel
                     ` (2 more replies)
  2017-04-13 13:56 ` [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu Lauro Ramos Venancio
  2 siblings, 3 replies; 29+ messages in thread
From: Lauro Ramos Venancio @ 2017-04-13 13:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: lwang, riel, Mike Galbraith, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Lauro Ramos Venancio

Currently, on a 4 nodes NUMA machine with ring topology, two sched
groups are generated for the last NUMA sched domain. One group has the
CPUs from NUMA nodes 3, 0 and 1; the other group has the CPUs from nodes
1, 2 and 3. As CPUs from nodes 1 and 3 belongs to both groups, the
scheduler is unable to directly move tasks between these nodes. In the
worst scenario, when a set of tasks are bound to nodes 1 and 3, the
performance is severely impacted because just one node is used while the
other node remains idle.

This problem also affects machines with more NUMA nodes. For instance,
currently, the scheduler is unable to directly move tasks between some
node pairs 2-hops apart on an 8 nodes machine with mesh topology.

This bug was reported in the paper [1] as "The Scheduling Group
Construction bug".

This patch constructs the sched groups from each CPU perspective. So, on
a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same
groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups
[(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops
apart.

SPECjbb2005 results on an 8 NUMA nodes machine with mesh topology

Threads       before              after          %
           mean   stddev      mean    stddev
  1       22801   1950        27059   1367     +19%
  8       146008  50782       209193  826      +43%
  32      351030  105111      522445  9051     +49%
  48      365835  116571      594905  3314     +63%

[1] http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf

Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
---
 kernel/sched/topology.c | 33 +++++++++++++++------------------
 1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d786d45..d0302ad 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -557,14 +557,24 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 static int
 build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 {
-	struct sched_group *first = NULL, *last = NULL, *groups = NULL, *sg;
+	struct sched_group *last = NULL, *sg;
 	const struct cpumask *span = sched_domain_span(sd);
 	struct cpumask *covered = sched_domains_tmpmask;
 	struct sd_data *sdd = sd->private;
 	struct sched_domain *sibling;
 	int i;
 
-	cpumask_clear(covered);
+	sg = build_group_from_child_sched_domain(sd, cpu);
+	if (!sg)
+		return -ENOMEM;
+
+	init_overlap_sched_group(sd, sg, cpu);
+
+	sd->groups = sg;
+	last = sg;
+	sg->next = sg;
+
+	cpumask_copy(covered, sched_group_cpus(sg));
 
 	for_each_cpu(i, span) {
 		struct cpumask *sg_span;
@@ -587,28 +597,15 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 
 		init_overlap_sched_group(sd, sg, i);
 
-		/*
-		 * Make sure the first group of this domain contains the
-		 * canonical balance CPU. Otherwise the sched_domain iteration
-		 * breaks. See update_sg_lb_stats().
-		 */
-		if ((!groups && cpumask_test_cpu(cpu, sg_span)) ||
-		    group_balance_cpu(sg) == cpu)
-			groups = sg;
-
-		if (!first)
-			first = sg;
-		if (last)
-			last->next = sg;
+		last->next = sg;
 		last = sg;
-		last->next = first;
+		sg->next = sd->groups;
 	}
-	sd->groups = groups;
 
 	return 0;
 
 fail:
-	free_sched_groups(first, 0);
+	free_sched_groups(sd->groups, 0);
 
 	return -ENOMEM;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu
  2017-04-13 13:56 [RFC 0/3] sched/topology: fix sched groups on NUMA machines with mesh topology Lauro Ramos Venancio
  2017-04-13 13:56 ` [RFC 1/3] sched/topology: Refactor function build_overlap_sched_groups() Lauro Ramos Venancio
  2017-04-13 13:56 ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Lauro Ramos Venancio
@ 2017-04-13 13:56 ` Lauro Ramos Venancio
  2017-04-13 15:27   ` Rik van Riel
  2017-04-14 16:49   ` Peter Zijlstra
  2 siblings, 2 replies; 29+ messages in thread
From: Lauro Ramos Venancio @ 2017-04-13 13:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: lwang, riel, Mike Galbraith, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Lauro Ramos Venancio

Currently, the group balance cpu is the groups's first CPU. But with
overlapping groups, two different groups can have the same first CPU.

This patch uses the group mask to mark all the CPUs that have a
particular group as its main sched group. The group balance cpu is the
first group CPU that is also in the mask.

Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
---
 kernel/sched/topology.c | 76 ++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 62 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d0302ad..7920bbb 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -477,27 +477,31 @@ enum s_alloc {
 };
 
 /*
- * Build an iteration mask that can exclude certain CPUs from the upwards
- * domain traversal.
+ * An overlap sched group may not be present in all CPUs that compose the
+ * group. So build the mask, marking all the group CPUs where it is present.
  *
  * Asymmetric node setups can result in situations where the domain tree is of
  * unequal depth, make sure to skip domains that already cover the entire
  * range.
- *
- * In that case build_sched_domains() will have terminated the iteration early
- * and our sibling sd spans will be empty. Domains should always include the
- * CPU they're built on, so check that.
  */
 static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
 {
-	const struct cpumask *span = sched_domain_span(sd);
+	const struct cpumask *sg_span = sched_group_cpus(sg);
 	struct sd_data *sdd = sd->private;
 	struct sched_domain *sibling;
 	int i;
 
-	for_each_cpu(i, span) {
+	for_each_cpu(i, sg_span) {
 		sibling = *per_cpu_ptr(sdd->sd, i);
-		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
+
+		/*
+		 * Asymmetric node setups: skip domains that are already
+		 * done.
+		 */
+		if (!sibling->groups)
+			continue;
+
+		if (!cpumask_equal(sg_span, sched_group_cpus(sibling->groups)))
 			continue;
 
 		cpumask_set_cpu(i, sched_group_mask(sg));
@@ -513,6 +517,28 @@ int group_balance_cpu(struct sched_group *sg)
 	return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
 }
 
+/*
+ * Find the group balance cpu when the group mask is not available yet.
+ */
+static int find_group_balance_cpu(struct sched_domain *sd,
+				  struct sched_group *sg)
+{
+	const struct cpumask *sg_span = sched_group_cpus(sg);
+	struct sd_data *sdd = sd->private;
+	struct sched_domain *sibling;
+	int i;
+
+	for_each_cpu(i, sg_span) {
+		sibling = *per_cpu_ptr(sdd->sd, i);
+		if (cpumask_equal(sg_span, sched_group_cpus(sibling->groups)))
+			return i;
+	}
+
+	WARN(1, "group balance cpu not found.");
+	return 0;
+}
+
+
 static struct sched_group *
 build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
 {
@@ -554,6 +580,19 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 	sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
 }
 
+static void init_overlap_sched_groups(struct sched_domain *sd)
+{
+	struct sched_group *sg = sd->groups;
+	int cpu;
+
+	do {
+		cpu = find_group_balance_cpu(sd, sg);
+		init_overlap_sched_group(sd, sg, cpu);
+
+		sg = sg->next;
+	} while (sg != sd->groups);
+}
+
 static int
 build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 {
@@ -568,8 +607,6 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 	if (!sg)
 		return -ENOMEM;
 
-	init_overlap_sched_group(sd, sg, cpu);
-
 	sd->groups = sg;
 	last = sg;
 	sg->next = sg;
@@ -584,7 +621,12 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 
 		sibling = *per_cpu_ptr(sdd->sd, i);
 
-		/* See the comment near build_group_mask(). */
+		/*
+		 * In Asymmetric node setups, build_sched_domains() will have
+		 * terminated the iteration early and our sibling sd spans will
+		 * be empty. Domains should always include the CPU they're
+		 * built on, so check that.
+		 */
 		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
 			continue;
 
@@ -595,8 +637,6 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 		sg_span = sched_group_cpus(sg);
 		cpumask_or(covered, covered, sg_span);
 
-		init_overlap_sched_group(sd, sg, i);
-
 		last->next = sg;
 		last = sg;
 		sg->next = sd->groups;
@@ -1449,6 +1489,14 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
 		}
 	}
 
+	/* Init overlap groups */
+	for_each_cpu(i, cpu_map) {
+		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+			if (sd->flags & SD_OVERLAP)
+				init_overlap_sched_groups(sd);
+		}
+	}
+
 	/* Calculate CPU capacity for physical packages and nodes */
 	for (i = nr_cpumask_bits-1; i >= 0; i--) {
 		if (!cpumask_test_cpu(i, cpu_map))
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC 1/3] sched/topology: Refactor function build_overlap_sched_groups()
  2017-04-13 13:56 ` [RFC 1/3] sched/topology: Refactor function build_overlap_sched_groups() Lauro Ramos Venancio
@ 2017-04-13 14:50   ` Rik van Riel
  2017-05-15  9:02   ` [tip:sched/core] " tip-bot for Lauro Ramos Venancio
  1 sibling, 0 replies; 29+ messages in thread
From: Rik van Riel @ 2017-04-13 14:50 UTC (permalink / raw)
  To: Lauro Ramos Venancio, linux-kernel
  Cc: lwang, Mike Galbraith, Peter Zijlstra, Thomas Gleixner, Ingo Molnar

On Thu, 2017-04-13 at 10:56 -0300, Lauro Ramos Venancio wrote:
> Create functions build_group_from_child_sched_domain() and
> init_overlap_sched_group(). No functional change.
> 
> Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
> 
Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology
  2017-04-13 13:56 ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Lauro Ramos Venancio
@ 2017-04-13 15:16   ` Rik van Riel
  2017-04-13 15:48   ` Peter Zijlstra
  2017-04-14 11:38   ` Peter Zijlstra
  2 siblings, 0 replies; 29+ messages in thread
From: Rik van Riel @ 2017-04-13 15:16 UTC (permalink / raw)
  To: Lauro Ramos Venancio, linux-kernel
  Cc: lwang, Mike Galbraith, Peter Zijlstra, Thomas Gleixner, Ingo Molnar

On Thu, 2017-04-13 at 10:56 -0300, Lauro Ramos Venancio wrote:
> 
> SPECjbb2005 results on an 8 NUMA nodes machine with mesh topology
> 
> Threads       before              after          %
>            mean   stddev      mean    stddev
>   1       22801   1950        27059   1367     +19%
>   8       146008  50782       209193  826      +43%
>   32      351030  105111      522445  9051     +49%
>   48      365835  116571      594905  3314     +63%

Impressive!

> [1] http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf
> 
> Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu
  2017-04-13 13:56 ` [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu Lauro Ramos Venancio
@ 2017-04-13 15:27   ` Rik van Riel
  2017-04-14 16:49   ` Peter Zijlstra
  1 sibling, 0 replies; 29+ messages in thread
From: Rik van Riel @ 2017-04-13 15:27 UTC (permalink / raw)
  To: Lauro Ramos Venancio, linux-kernel
  Cc: lwang, Mike Galbraith, Peter Zijlstra, Thomas Gleixner, Ingo Molnar

On Thu, 2017-04-13 at 10:56 -0300, Lauro Ramos Venancio wrote:
> Currently, the group balance cpu is the groups's first CPU. But with
> overlapping groups, two different groups can have the same first CPU.
> 
> This patch uses the group mask to mark all the CPUs that have a
> particular group as its main sched group. The group balance cpu is
> the
> first group CPU that is also in the mask.
> 

This is not your fault, but this code is really hard
to understand.

Your comments tell me what the code does, but not
really why. 

> +++ b/kernel/sched/topology.c
> @@ -477,27 +477,31 @@ enum s_alloc {
>  };
>  
>  /*
> - * Build an iteration mask that can exclude certain CPUs from the
> upwards
> - * domain traversal.
> + * An overlap sched group may not be present in all CPUs that
> compose the
> + * group. So build the mask, marking all the group CPUs where it is
> present.
>   *
>   * Asymmetric node setups can result in situations where the domain
> tree is of
>   * unequal depth, make sure to skip domains that already cover the
> entire
>   * range.
> - *
> - * In that case build_sched_domains() will have terminated the
> iteration early
> - * and our sibling sd spans will be empty. Domains should always
> include the
> - * CPU they're built on, so check that.
>   */

Why are we doing this?

Could the comment explain why things need to be
this way?


> -	for_each_cpu(i, span) {
> +	for_each_cpu(i, sg_span) {
>  		sibling = *per_cpu_ptr(sdd->sd, i);
> -		if (!cpumask_test_cpu(i,
> sched_domain_span(sibling)))
> +
> +		/*
> +		 * Asymmetric node setups: skip domains that are
> already
> +		 * done.
> +		 */
> +		if (!sibling->groups)
> +			continue;
> +

What does this mean?

I would really like it if this code was a little
better documented.  This is not something I can
put on you for most of the code, but I can at least
ask it for the code you are adding :)

The same goes for the rest of this patch.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology
  2017-04-13 13:56 ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Lauro Ramos Venancio
  2017-04-13 15:16   ` Rik van Riel
@ 2017-04-13 15:48   ` Peter Zijlstra
  2017-04-13 20:21     ` Lauro Venancio
  2017-04-14 11:38   ` Peter Zijlstra
  2 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2017-04-13 15:48 UTC (permalink / raw)
  To: Lauro Ramos Venancio
  Cc: linux-kernel, lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar

On Thu, Apr 13, 2017 at 10:56:08AM -0300, Lauro Ramos Venancio wrote:
> Currently, on a 4 nodes NUMA machine with ring topology, two sched
> groups are generated for the last NUMA sched domain. One group has the
> CPUs from NUMA nodes 3, 0 and 1; the other group has the CPUs from nodes
> 1, 2 and 3. As CPUs from nodes 1 and 3 belongs to both groups, the
> scheduler is unable to directly move tasks between these nodes. In the
> worst scenario, when a set of tasks are bound to nodes 1 and 3, the
> performance is severely impacted because just one node is used while the
> other node remains idle.

I feel a picture would be ever so much clearer.

> This patch constructs the sched groups from each CPU perspective. So, on
> a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same
> groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups
> [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops
> apart.

So I still have no idea what specifically goes wrong and how this fixes
it. Changelog is impenetrable.

"From each CPU's persepective" doesn't really help, there already is a
for_each_cpu() in.

Also, since I'm not sure what happend to the 4 node system, I cannot
begin to imagine what would happen on the 8 node one.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology
  2017-04-13 15:48   ` Peter Zijlstra
@ 2017-04-13 20:21     ` Lauro Venancio
  2017-04-13 21:06       ` Lauro Venancio
  0 siblings, 1 reply; 29+ messages in thread
From: Lauro Venancio @ 2017-04-13 20:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar

On 04/13/2017 12:48 PM, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 10:56:08AM -0300, Lauro Ramos Venancio wrote:
>> Currently, on a 4 nodes NUMA machine with ring topology, two sched
>> groups are generated for the last NUMA sched domain. One group has the
>> CPUs from NUMA nodes 3, 0 and 1; the other group has the CPUs from nodes
>> 1, 2 and 3. As CPUs from nodes 1 and 3 belongs to both groups, the
>> scheduler is unable to directly move tasks between these nodes. In the
>> worst scenario, when a set of tasks are bound to nodes 1 and 3, the
>> performance is severely impacted because just one node is used while the
>> other node remains idle.
> I feel a picture would be ever so much clearer.
>
>> This patch constructs the sched groups from each CPU perspective. So, on
>> a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same
>> groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups
>> [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops
>> apart.
> So I still have no idea what specifically goes wrong and how this fixes
> it. Changelog is impenetrable.
On a 4 nodes machine with ring topology, the last sched domain level
contains groups with 3 numa nodes each. So we have four possible groups:
(0, 1, 2) (1, 2, 3) (2, 3, 0)(3, 0, 1). As we need just two groups to
fill the sched domain, currently, the groups (3, 0, 1) and (1, 2, 3) are
used for all CPUs. The problem with it is that nodes 1 and 3 belongs to
both groups, becoming impossible to move tasks between these two nodes.

This patch uses different groups depending on the CPU they are
installed. So nodes 0 and 2 CPUs keep the same group as before: (3, 0,
1) and (1, 2, 3). Nodes 1 and 3 CPUs use the new groups: (0, 1, 2) and
(2, 3, 0). So the first pair of groups allows movement between nodes 0
and 2; and the second pair of groups allows movement between nodes 1 and 3.

I will improve the changelog.

> "From each CPU's persepective" doesn't really help, there already is a
> for_each_cpu() in.
The for_each_cpu() is used to iterate across all sched domain cpus. It
doesn't consider the CPU where the groups are being installed (parameter
cpu in build_overlap_sched_groups()). Currently, the parameter cpu is
used just for memory allocation and for ordering the groups, it doesn't
change the groups that are chosen. This patch uses the parameter cpu to
choose the first group, changing also, as consequence, the second group.
>
> Also, since I'm not sure what happend to the 4 node system, I cannot
> begin to imagine what would happen on the 8 node one.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology
  2017-04-13 20:21     ` Lauro Venancio
@ 2017-04-13 21:06       ` Lauro Venancio
  2017-04-13 23:38         ` Rik van Riel
  0 siblings, 1 reply; 29+ messages in thread
From: Lauro Venancio @ 2017-04-13 21:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar

On 04/13/2017 05:21 PM, Lauro Venancio wrote:
> On 04/13/2017 12:48 PM, Peter Zijlstra wrote:
>> On Thu, Apr 13, 2017 at 10:56:08AM -0300, Lauro Ramos Venancio wrote:
>>> Currently, on a 4 nodes NUMA machine with ring topology, two sched
>>> groups are generated for the last NUMA sched domain. One group has the
>>> CPUs from NUMA nodes 3, 0 and 1; the other group has the CPUs from nodes
>>> 1, 2 and 3. As CPUs from nodes 1 and 3 belongs to both groups, the
>>> scheduler is unable to directly move tasks between these nodes. In the
>>> worst scenario, when a set of tasks are bound to nodes 1 and 3, the
>>> performance is severely impacted because just one node is used while the
>>> other node remains idle.
>> I feel a picture would be ever so much clearer.
>>
>>> This patch constructs the sched groups from each CPU perspective. So, on
>>> a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same
>>> groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups
>>> [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops
>>> apart.
>> So I still have no idea what specifically goes wrong and how this fixes
>> it. Changelog is impenetrable.
> On a 4 nodes machine with ring topology, the last sched domain level
> contains groups with 3 numa nodes each. So we have four possible groups:
> (0, 1, 2) (1, 2, 3) (2, 3, 0)(3, 0, 1). As we need just two groups to
> fill the sched domain, currently, the groups (3, 0, 1) and (1, 2, 3) are
> used for all CPUs. The problem with it is that nodes 1 and 3 belongs to
> both groups, becoming impossible to move tasks between these two nodes.
>
> This patch uses different groups depending on the CPU they are
> installed. So nodes 0 and 2 CPUs keep the same group as before: (3, 0,
> 1) and (1, 2, 3). Nodes 1 and 3 CPUs use the new groups: (0, 1, 2) and
> (2, 3, 0). So the first pair of groups allows movement between nodes 0
> and 2; and the second pair of groups allows movement between nodes 1 and 3.
>
> I will improve the changelog.
>
>> "From each CPU's persepective" doesn't really help, there already is a
>> for_each_cpu() in.
> The for_each_cpu() is used to iterate across all sched domain cpus. It
> doesn't consider the CPU where the groups are being installed (parameter
> cpu in build_overlap_sched_groups()). Currently, the parameter cpu is
> used just for memory allocation and for ordering the groups, it doesn't
> change the groups that are chosen. This patch uses the parameter cpu to
> choose the first group, changing also, as consequence, the second group.
>> Also, since I'm not sure what happend to the 4 node system, I cannot
>> begin to imagine what would happen on the 8 node one.

Just for clarification, I am sending the nodes distance table for the
two most common typologies affected by this issue.

4 nodes, ring topology
node distances:
node   0   1   2   3
  0:  10  20  30  20
  1:  20  10  20  30
  2:  30  20  10  20
  3:  20  30  20  10

8 node, mesh topology
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  16  16  22  16  22  16  22
  1:  16  10  16  22  22  16  22  16
  2:  16  16  10  16  16  16  16  22
  3:  22  22  16  10  16  16  22  16
  4:  16  22  16  16  10  16  16  16
  5:  22  16  16  16  16  10  22  22
  6:  16  22  16  22  16  22  10  16
  7:  22  16  22  16  16  22  16  10

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology
  2017-04-13 21:06       ` Lauro Venancio
@ 2017-04-13 23:38         ` Rik van Riel
  2017-04-14 10:48           ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Rik van Riel @ 2017-04-13 23:38 UTC (permalink / raw)
  To: lvenanci, Peter Zijlstra
  Cc: linux-kernel, lwang, Mike Galbraith, Thomas Gleixner, Ingo Molnar

On Thu, 2017-04-13 at 18:06 -0300, Lauro Venancio wrote:

> Just for clarification, I am sending the nodes distance table for the
> two most common typologies affected by this issue.

What do the sched groups look like for these topologies,
before and after your patch series?

> 4 nodes, ring topology
> node distances:
> node   0   1   2   3
>   0:  10  20  30  20
>   1:  20  10  20  30
>   2:  30  20  10  20
>   3:  20  30  20  10
> 
> 8 node, mesh topology
> node distances:
> node   0   1   2   3   4   5   6   7
>   0:  10  16  16  22  16  22  16  22
>   1:  16  10  16  22  22  16  22  16
>   2:  16  16  10  16  16  16  16  22
>   3:  22  22  16  10  16  16  22  16
>   4:  16  22  16  16  10  16  16  16
>   5:  22  16  16  16  16  10  22  22
>   6:  16  22  16  22  16  22  10  16
>   7:  22  16  22  16  16  22  16  10

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology
  2017-04-13 23:38         ` Rik van Riel
@ 2017-04-14 10:48           ` Peter Zijlstra
  0 siblings, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2017-04-14 10:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: lvenanci, linux-kernel, lwang, Mike Galbraith, Thomas Gleixner,
	Ingo Molnar

On Thu, Apr 13, 2017 at 07:38:05PM -0400, Rik van Riel wrote:

> What do the sched groups look like for these topologies,
> before and after your patch series?
> 
> > 4 nodes, ring topology
> > node distances:
> > node   0   1   2   3
> >   0:  10  20  30  20
> >   1:  20  10  20  30
> >   2:  30  20  10  20
> >   3:  20  30  20  10

kvm -smp 4 -m 4G -display none -monitor null -serial stdio -kernel
defconfig-build/arch/x86/boot/bzImage -append "sched_debug debug
ignore_loglevel earlyprintk=serial,ttyS0,115200,keep
numa=fake=4:10,20,30,20,20,10,20,30,30,20,10,20,20,30,20,10,0"

(FWIW, that's defconfig+kvmconfig+SCHED_DEBUG=y+NUMA_EMU=y)

Gives me:

[    0.075004] smpboot: Total of 4 processors activated (22345.79 BogoMIPS)
[    0.076767] CPU0 attaching sched-domain:
[    0.077003]  domain 0: span 0-1,3 level NUMA
[    0.078002]   groups: 0 1 3
[    0.079002]   domain 1: span 0-3 level NUMA
[    0.080002]    groups: 0-1,3 (cpu_capacity = 3072) 1-3 (cpu_capacity = 3072)
[    0.081005] CPU1 attaching sched-domain:
[    0.082003]  domain 0: span 0-2 level NUMA
[    0.083002]   groups: 1 2 0
[    0.084002]   domain 1: span 0-3 level NUMA
[    0.085002]    groups: 1-3 (cpu_capacity = 3072) 0-1,3 (cpu_capacity = 3072)
[    0.086004] CPU2 attaching sched-domain:
[    0.087002]  domain 0: span 1-3 level NUMA
[    0.088002]   groups: 2 3 1
[    0.089002]   domain 1: span 0-3 level NUMA
[    0.090002]    groups: 1-3 (cpu_capacity = 3072) 0-1,3 (cpu_capacity = 3072)
[    0.091004] CPU3 attaching sched-domain:
[    0.092002]  domain 0: span 0,2-3 level NUMA
[    0.093002]   groups: 3 0 2
[    0.094002]   domain 1: span 0-3 level NUMA
[    0.095002]    groups: 0-1,3 (cpu_capacity = 3072) 1-3 (cpu_capacity = 3072)
[    0.096004] span: 0-3 (max cpu_capacity = 1024)


With patches it looks like:

[    0.080006] smpboot: Total of 4 processors activated (22345.79 BogoMIPS)
[    0.082545] CPU0 attaching sched-domain:
[    0.083007]  domain 0: span 0-1,3 level NUMA
[    0.084004]   groups: 0 1 3
[    0.085004]   domain 1: span 0-3 level NUMA
[    0.086004]    groups: 0-1,3 (cpu_capacity = 3072) 1-3 (cpu_capacity = 3072)
[    0.087007] CPU1 attaching sched-domain:
[    0.088004]  domain 0: span 0-2 level NUMA
[    0.089004]   groups: 1 0 2
[    0.090004]   domain 1: span 0-3 level NUMA
[    0.091003]    groups: 0-2 (cpu_capacity = 3072) 0,2-3 (cpu_capacity = 3072)
[    0.092008] CPU2 attaching sched-domain:
[    0.093004]  domain 0: span 1-3 level NUMA
[    0.094004]   groups: 2 1 3
[    0.095004]   domain 1: span 0-3 level NUMA
[    0.096004]    groups: 1-3 (cpu_capacity = 3072) 0-1,3 (cpu_capacity = 3072)
[    0.097007] CPU3 attaching sched-domain:
[    0.098004]  domain 0: span 0,2-3 level NUMA
[    0.099003]   groups: 3 0 2
[    0.100004]   domain 1: span 0-3 level NUMA
[    0.101003]    groups: 0,2-3 (cpu_capacity = 3072) 0-2 (cpu_capacity = 3072)
[    0.102007] span: 0-3 (max cpu_capacity = 1024)


Now let me try and reverse engineer those patches ..

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology
  2017-04-13 13:56 ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Lauro Ramos Venancio
  2017-04-13 15:16   ` Rik van Riel
  2017-04-13 15:48   ` Peter Zijlstra
@ 2017-04-14 11:38   ` Peter Zijlstra
  2017-04-14 12:20     ` Peter Zijlstra
  2017-04-14 16:58     ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Peter Zijlstra
  2 siblings, 2 replies; 29+ messages in thread
From: Peter Zijlstra @ 2017-04-14 11:38 UTC (permalink / raw)
  To: Lauro Ramos Venancio
  Cc: linux-kernel, lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar

On Thu, Apr 13, 2017 at 10:56:08AM -0300, Lauro Ramos Venancio wrote:
> This patch constructs the sched groups from each CPU perspective. So, on
> a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same
> groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups
> [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops
> apart.

Ah,.. so after drawing pictures I see what went wrong; duh :-(

An equivalent patch would be (if for_each_cpu_wrap() were exposed):

@@ -521,11 +588,11 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 	struct cpumask *covered = sched_domains_tmpmask;
 	struct sd_data *sdd = sd->private;
 	struct sched_domain *sibling;
-	int i;
+	int i, wrap;
 
 	cpumask_clear(covered);
 
-	for_each_cpu(i, span) {
+	for_each_cpu_wrap(i, span, cpu, wrap) {
 		struct cpumask *sg_span;
 
 		if (cpumask_test_cpu(i, covered))


We need to start iterating at @cpu, not start at 0 every time.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology
  2017-04-14 11:38   ` Peter Zijlstra
@ 2017-04-14 12:20     ` Peter Zijlstra
  2017-05-15  9:03       ` [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap() tip-bot for Peter Zijlstra
  2017-04-14 16:58     ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Peter Zijlstra
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2017-04-14 12:20 UTC (permalink / raw)
  To: Lauro Ramos Venancio
  Cc: linux-kernel, lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar

On Fri, Apr 14, 2017 at 01:38:13PM +0200, Peter Zijlstra wrote:

> An equivalent patch would be (if for_each_cpu_wrap() were exposed):

---
Subject: sched,cpumask: Export for_each_cpu_wrap()

More users for for_each_cpu_wrap() have appeared. Promote the construct
to generic cpumask interface.

The implementation is slightly modified to reduce arguments.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/cpumask.h | 16 ++++++++++++++++
 kernel/sched/fair.c     | 45 ++++-----------------------------------------
 lib/cpumask.c           | 32 ++++++++++++++++++++++++++++++++
 3 files changed, 52 insertions(+), 41 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 96f1e88..4b87b7b 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -236,6 +236,22 @@ unsigned int cpumask_local_spread(unsigned int i, int node);
 		(cpu) = cpumask_next_zero((cpu), (mask)),	\
 		(cpu) < nr_cpu_ids;)
 
+
+/**
+ * for_each_cpu_wrap - iterate over every cpu in a mask, starting at a specified location
+ * @cpu: the (optionally unsigned) integer iterator
+ * @mask: the cpumask poiter
+ * @start: the start location
+ *
+ * The implementation does not assume any bit in @mask is set (including @start).
+ *
+ * After the loop, cpu is >= nr_cpu_ids.
+ */
+#define for_each_cpu_wrap(cpu, mask, start)					\
+	for ((cpu) = cpumask_next_wrap((start)-1, (mask), (start), false);	\
+	     (cpu) < nr_cpumask_bits;						\
+	     (cpu) = cpumask_next_wrap((cpu), (mask), (start), true))
+
 /**
  * for_each_cpu_and - iterate over every cpu in both masks
  * @cpu: the (optionally unsigned) integer iterator
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a903276..d89d700 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5640,43 +5640,6 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 	return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu;
 }
 
-/*
- * Implement a for_each_cpu() variant that starts the scan at a given cpu
- * (@start), and wraps around.
- *
- * This is used to scan for idle CPUs; such that not all CPUs looking for an
- * idle CPU find the same CPU. The down-side is that tasks tend to cycle
- * through the LLC domain.
- *
- * Especially tbench is found sensitive to this.
- */
-
-static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int *wrapped)
-{
-	int next;
-
-again:
-	next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1);
-
-	if (*wrapped) {
-		if (next >= start)
-			return nr_cpumask_bits;
-	} else {
-		if (next >= nr_cpumask_bits) {
-			*wrapped = 1;
-			n = -1;
-			goto again;
-		}
-	}
-
-	return next;
-}
-
-#define for_each_cpu_wrap(cpu, mask, start, wrap)				\
-	for ((wrap) = 0, (cpu) = (start)-1;					\
-		(cpu) = cpumask_next_wrap((cpu), (mask), (start), &(wrap)),	\
-		(cpu) < nr_cpumask_bits; )
-
 #ifdef CONFIG_SCHED_SMT
 
 static inline void set_idle_cores(int cpu, int val)
@@ -5736,7 +5699,7 @@ void __update_idle_core(struct rq *rq)
 static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
-	int core, cpu, wrap;
+	int core, cpu;
 
 	if (!static_branch_likely(&sched_smt_present))
 		return -1;
@@ -5746,7 +5709,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
 
 	cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed);
 
-	for_each_cpu_wrap(core, cpus, target, wrap) {
+	for_each_cpu_wrap(core, cpus, target) {
 		bool idle = true;
 
 		for_each_cpu(cpu, cpu_smt_mask(core)) {
@@ -5812,7 +5775,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	u64 avg_cost, avg_idle = this_rq()->avg_idle;
 	u64 time, cost;
 	s64 delta;
-	int cpu, wrap;
+	int cpu;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
@@ -5829,7 +5792,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 
 	time = local_clock();
 
-	for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {
+	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
 			continue;
 		if (idle_cpu(cpu))
diff --git a/lib/cpumask.c b/lib/cpumask.c
index 81dedaa..4731a08 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -43,6 +43,38 @@ int cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
 }
 EXPORT_SYMBOL(cpumask_any_but);
 
+/**
+ * cpumask_next_wrap - helper to implement for_each_cpu_wrap
+ * @n: the cpu prior to the place to search
+ * @mask: the cpumask pointer
+ * @start: the start point of the iteration
+ * @wrap: assume @n crossing @start terminates the iteration
+ *
+ * Returns >= nr_cpu_ids on completion
+ *
+ * Note: the @wrap argument is required for the start condition when
+ * we cannot assume @start is set in @mask.
+ */
+int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool wrap)
+{
+	int next;
+
+again:
+	next = cpumask_next(n, mask);
+
+	if (wrap && n < start && next >= start) {
+		return nr_cpumask_bits;
+
+	} else if (next >= nr_cpumask_bits) {
+		wrap = true;
+		n = -1;
+		goto again;
+	}
+
+	return next;
+}
+EXPORT_SYMBOL(cpumask_next_wrap);
+
 /* These are not inline because of header tangles. */
 #ifdef CONFIG_CPUMASK_OFFSTACK
 /**

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu
  2017-04-13 13:56 ` [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu Lauro Ramos Venancio
  2017-04-13 15:27   ` Rik van Riel
@ 2017-04-14 16:49   ` Peter Zijlstra
  2017-04-17 15:34     ` Lauro Venancio
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2017-04-14 16:49 UTC (permalink / raw)
  To: Lauro Ramos Venancio
  Cc: linux-kernel, lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar

On Thu, Apr 13, 2017 at 10:56:09AM -0300, Lauro Ramos Venancio wrote:
> Currently, the group balance cpu is the groups's first CPU. But with
> overlapping groups, two different groups can have the same first CPU.
> 
> This patch uses the group mask to mark all the CPUs that have a
> particular group as its main sched group. The group balance cpu is the
> first group CPU that is also in the mask.

Please give a NUMA configuration and CPU number where this goes wrong.

Because only the first group of a domain matters, and with the other
thing fixed, I'm not immediately seeing where we go wobbly.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology
  2017-04-14 11:38   ` Peter Zijlstra
  2017-04-14 12:20     ` Peter Zijlstra
@ 2017-04-14 16:58     ` Peter Zijlstra
  2017-04-17 14:40       ` Lauro Venancio
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2017-04-14 16:58 UTC (permalink / raw)
  To: Lauro Ramos Venancio
  Cc: linux-kernel, lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar

On Fri, Apr 14, 2017 at 01:38:13PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 10:56:08AM -0300, Lauro Ramos Venancio wrote:
> > This patch constructs the sched groups from each CPU perspective. So, on
> > a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same
> > groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups
> > [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops
> > apart.
> 
> Ah,.. so after drawing pictures I see what went wrong; duh :-(
> 
> An equivalent patch would be (if for_each_cpu_wrap() were exposed):
> 
> @@ -521,11 +588,11 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
>  	struct cpumask *covered = sched_domains_tmpmask;
>  	struct sd_data *sdd = sd->private;
>  	struct sched_domain *sibling;
> -	int i;
> +	int i, wrap;
>  
>  	cpumask_clear(covered);
>  
> -	for_each_cpu(i, span) {
> +	for_each_cpu_wrap(i, span, cpu, wrap) {
>  		struct cpumask *sg_span;
>  
>  		if (cpumask_test_cpu(i, covered))
> 
> 
> We need to start iterating at @cpu, not start at 0 every time.
> 
> 

OK, please have a look here:

https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology
  2017-04-14 16:58     ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Peter Zijlstra
@ 2017-04-17 14:40       ` Lauro Venancio
  0 siblings, 0 replies; 29+ messages in thread
From: Lauro Venancio @ 2017-04-17 14:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar

On 04/14/2017 01:58 PM, Peter Zijlstra wrote:
> On Fri, Apr 14, 2017 at 01:38:13PM +0200, Peter Zijlstra wrote:
>> On Thu, Apr 13, 2017 at 10:56:08AM -0300, Lauro Ramos Venancio wrote:
>>> This patch constructs the sched groups from each CPU perspective. So, on
>>> a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same
>>> groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups
>>> [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops
>>> apart.
>> Ah,.. so after drawing pictures I see what went wrong; duh :-(
>>
>> An equivalent patch would be (if for_each_cpu_wrap() were exposed):
>>
>> @@ -521,11 +588,11 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
>>  	struct cpumask *covered = sched_domains_tmpmask;
>>  	struct sd_data *sdd = sd->private;
>>  	struct sched_domain *sibling;
>> -	int i;
>> +	int i, wrap;
>>  
>>  	cpumask_clear(covered);
>>  
>> -	for_each_cpu(i, span) {
>> +	for_each_cpu_wrap(i, span, cpu, wrap) {
>>  		struct cpumask *sg_span;
>>  
>>  		if (cpumask_test_cpu(i, covered))
>>
>>
>> We need to start iterating at @cpu, not start at 0 every time.
>>
>>
> OK, please have a look here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core

Looks good, but please hold these patches while patch 3 is not applied.
Without it, the sched_group_capacity (sg->sgc) instance is not selected
correctly and we have an important performance regression in all NUMA
machines.

I will continue this discussion in the other thread.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu
  2017-04-14 16:49   ` Peter Zijlstra
@ 2017-04-17 15:34     ` Lauro Venancio
  2017-04-18 12:32       ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Lauro Venancio @ 2017-04-17 15:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar

On 04/14/2017 01:49 PM, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 10:56:09AM -0300, Lauro Ramos Venancio wrote:
>> Currently, the group balance cpu is the groups's first CPU. But with
>> overlapping groups, two different groups can have the same first CPU.
>>
>> This patch uses the group mask to mark all the CPUs that have a
>> particular group as its main sched group. The group balance cpu is the
>> first group CPU that is also in the mask.
> Please give a NUMA configuration and CPU number where this goes wrong.
On a 4 nodes with ring topology, the groups (0-1,3 [cpu 0]),  (0-2 [cpu
1]) and (0,2-3 [cpu 3]) share the same sched_group_capacity instance
when the first groups cpu is used to select the sgc.
>
> Because only the first group of a domain matters, and with the other
> thing fixed, I'm not immediately seeing where we go wobbly.

Before patch 2, the group balance cpu was implicitly used to select the
sched_group_capacity instance. When two different groups had the same
balance cpu, they shared the same sched_group_capacity instance.

After patch 2, one different sched_group_capacity instance is assigned
to each group instance.


This patch ensures tree things:

1) different instances of the same group share the same
sched_group_capacity instance.

2) instances of different groups don't share the same
sched_group_capacity instance.

3) the group balance cpu must be one of the cpus where the group is
installed.


I am rebasing this patch on top of your patches.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu
  2017-04-17 15:34     ` Lauro Venancio
@ 2017-04-18 12:32       ` Peter Zijlstra
  0 siblings, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2017-04-18 12:32 UTC (permalink / raw)
  To: Lauro Venancio
  Cc: linux-kernel, lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar

On Mon, Apr 17, 2017 at 12:34:05PM -0300, Lauro Venancio wrote:
> This patch ensures tree things:
> 
> 1) different instances of the same group share the same
> sched_group_capacity instance.
> 
> 2) instances of different groups don't share the same
> sched_group_capacity instance.
> 
> 3) the group balance cpu must be one of the cpus where the group is
> installed.
> 
> 
> I am rebasing this patch on top of your patches.

Well, I would rather have 3 patches, each with a comprehensible
changelog.

I had already rebased the patch (trivial) so that's not the problem.
Explaining what and why it does things is.

And as per the usual rules, if it does 3 things it should be 3 patches.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [tip:sched/core] sched/topology: Refactor function build_overlap_sched_groups()
  2017-04-13 13:56 ` [RFC 1/3] sched/topology: Refactor function build_overlap_sched_groups() Lauro Ramos Venancio
  2017-04-13 14:50   ` Rik van Riel
@ 2017-05-15  9:02   ` tip-bot for Lauro Ramos Venancio
  1 sibling, 0 replies; 29+ messages in thread
From: tip-bot for Lauro Ramos Venancio @ 2017-05-15  9:02 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: riel, lvenanci, torvalds, mingo, efault, tglx, hpa, linux-kernel, peterz

Commit-ID:  8c0334697dc37eb3d6d7632304d3a3662248daac
Gitweb:     http://git.kernel.org/tip/8c0334697dc37eb3d6d7632304d3a3662248daac
Author:     Lauro Ramos Venancio <lvenanci@redhat.com>
AuthorDate: Thu, 13 Apr 2017 10:56:07 -0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 15 May 2017 10:15:22 +0200

sched/topology: Refactor function build_overlap_sched_groups()

Create functions build_group_from_child_sched_domain() and
init_overlap_sched_group(). No functional change.

Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1492091769-19879-2-git-send-email-lvenanci@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/topology.c | 62 ++++++++++++++++++++++++++++++++++---------------
 1 file changed, 43 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 1b0b4fb..d786d45 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -513,6 +513,47 @@ int group_balance_cpu(struct sched_group *sg)
 	return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
 }
 
+static struct sched_group *
+build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
+{
+	struct sched_group *sg;
+	struct cpumask *sg_span;
+
+	sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
+			GFP_KERNEL, cpu_to_node(cpu));
+
+	if (!sg)
+		return NULL;
+
+	sg_span = sched_group_cpus(sg);
+	if (sd->child)
+		cpumask_copy(sg_span, sched_domain_span(sd->child));
+	else
+		cpumask_copy(sg_span, sched_domain_span(sd));
+
+	return sg;
+}
+
+static void init_overlap_sched_group(struct sched_domain *sd,
+				     struct sched_group *sg, int cpu)
+{
+	struct sd_data *sdd = sd->private;
+	struct cpumask *sg_span;
+
+	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
+	if (atomic_inc_return(&sg->sgc->ref) == 1)
+		build_group_mask(sd, sg);
+
+	/*
+	 * Initialize sgc->capacity such that even if we mess up the
+	 * domains and no possible iteration will get us here, we won't
+	 * die on a /0 trap.
+	 */
+	sg_span = sched_group_cpus(sg);
+	sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
+	sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+}
+
 static int
 build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 {
@@ -537,31 +578,14 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
 			continue;
 
-		sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
-				GFP_KERNEL, cpu_to_node(cpu));
-
+		sg = build_group_from_child_sched_domain(sibling, cpu);
 		if (!sg)
 			goto fail;
 
 		sg_span = sched_group_cpus(sg);
-		if (sibling->child)
-			cpumask_copy(sg_span, sched_domain_span(sibling->child));
-		else
-			cpumask_set_cpu(i, sg_span);
-
 		cpumask_or(covered, covered, sg_span);
 
-		sg->sgc = *per_cpu_ptr(sdd->sgc, i);
-		if (atomic_inc_return(&sg->sgc->ref) == 1)
-			build_group_mask(sd, sg);
-
-		/*
-		 * Initialize sgc->capacity such that even if we mess up the
-		 * domains and no possible iteration will get us here, we won't
-		 * die on a /0 trap.
-		 */
-		sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
-		sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+		init_overlap_sched_group(sd, sg, i);
 
 		/*
 		 * Make sure the first group of this domain contains the

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
  2017-04-14 12:20     ` Peter Zijlstra
@ 2017-05-15  9:03       ` tip-bot for Peter Zijlstra
  2017-05-17 10:53         ` hackbench vs select_idle_sibling; was: " Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-05-15  9:03 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, tglx, peterz, riel, efault, hpa, torvalds, lvenanci, linux-kernel

Commit-ID:  c743f0a5c50f2fcbc628526279cfa24f3dabe182
Gitweb:     http://git.kernel.org/tip/c743f0a5c50f2fcbc628526279cfa24f3dabe182
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Fri, 14 Apr 2017 14:20:05 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 15 May 2017 10:15:23 +0200

sched/fair, cpumask: Export for_each_cpu_wrap()

More users for for_each_cpu_wrap() have appeared. Promote the construct
to generic cpumask interface.

The implementation is slightly modified to reduce arguments.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Lauro Ramos Venancio <lvenanci@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lwang@redhat.com
Link: http://lkml.kernel.org/r/20170414122005.o35me2h5nowqkxbv@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/cpumask.h | 17 +++++++++++++++++
 kernel/sched/fair.c     | 45 ++++-----------------------------------------
 lib/cpumask.c           | 32 ++++++++++++++++++++++++++++++++
 3 files changed, 53 insertions(+), 41 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 2404ad2..a21b1fb 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -236,6 +236,23 @@ unsigned int cpumask_local_spread(unsigned int i, int node);
 		(cpu) = cpumask_next_zero((cpu), (mask)),	\
 		(cpu) < nr_cpu_ids;)
 
+extern int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool wrap);
+
+/**
+ * for_each_cpu_wrap - iterate over every cpu in a mask, starting at a specified location
+ * @cpu: the (optionally unsigned) integer iterator
+ * @mask: the cpumask poiter
+ * @start: the start location
+ *
+ * The implementation does not assume any bit in @mask is set (including @start).
+ *
+ * After the loop, cpu is >= nr_cpu_ids.
+ */
+#define for_each_cpu_wrap(cpu, mask, start)					\
+	for ((cpu) = cpumask_next_wrap((start)-1, (mask), (start), false);	\
+	     (cpu) < nr_cpumask_bits;						\
+	     (cpu) = cpumask_next_wrap((cpu), (mask), (start), true))
+
 /**
  * for_each_cpu_and - iterate over every cpu in both masks
  * @cpu: the (optionally unsigned) integer iterator
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4f1825d..f80c825 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5640,43 +5640,6 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 	return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu;
 }
 
-/*
- * Implement a for_each_cpu() variant that starts the scan at a given cpu
- * (@start), and wraps around.
- *
- * This is used to scan for idle CPUs; such that not all CPUs looking for an
- * idle CPU find the same CPU. The down-side is that tasks tend to cycle
- * through the LLC domain.
- *
- * Especially tbench is found sensitive to this.
- */
-
-static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int *wrapped)
-{
-	int next;
-
-again:
-	next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1);
-
-	if (*wrapped) {
-		if (next >= start)
-			return nr_cpumask_bits;
-	} else {
-		if (next >= nr_cpumask_bits) {
-			*wrapped = 1;
-			n = -1;
-			goto again;
-		}
-	}
-
-	return next;
-}
-
-#define for_each_cpu_wrap(cpu, mask, start, wrap)				\
-	for ((wrap) = 0, (cpu) = (start)-1;					\
-		(cpu) = cpumask_next_wrap((cpu), (mask), (start), &(wrap)),	\
-		(cpu) < nr_cpumask_bits; )
-
 #ifdef CONFIG_SCHED_SMT
 
 static inline void set_idle_cores(int cpu, int val)
@@ -5736,7 +5699,7 @@ unlock:
 static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
-	int core, cpu, wrap;
+	int core, cpu;
 
 	if (!static_branch_likely(&sched_smt_present))
 		return -1;
@@ -5746,7 +5709,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
 
 	cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed);
 
-	for_each_cpu_wrap(core, cpus, target, wrap) {
+	for_each_cpu_wrap(core, cpus, target) {
 		bool idle = true;
 
 		for_each_cpu(cpu, cpu_smt_mask(core)) {
@@ -5812,7 +5775,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	u64 avg_cost, avg_idle = this_rq()->avg_idle;
 	u64 time, cost;
 	s64 delta;
-	int cpu, wrap;
+	int cpu;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
@@ -5829,7 +5792,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 
 	time = local_clock();
 
-	for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {
+	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
 			continue;
 		if (idle_cpu(cpu))
diff --git a/lib/cpumask.c b/lib/cpumask.c
index 81dedaa..4731a08 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -43,6 +43,38 @@ int cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
 }
 EXPORT_SYMBOL(cpumask_any_but);
 
+/**
+ * cpumask_next_wrap - helper to implement for_each_cpu_wrap
+ * @n: the cpu prior to the place to search
+ * @mask: the cpumask pointer
+ * @start: the start point of the iteration
+ * @wrap: assume @n crossing @start terminates the iteration
+ *
+ * Returns >= nr_cpu_ids on completion
+ *
+ * Note: the @wrap argument is required for the start condition when
+ * we cannot assume @start is set in @mask.
+ */
+int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool wrap)
+{
+	int next;
+
+again:
+	next = cpumask_next(n, mask);
+
+	if (wrap && n < start && next >= start) {
+		return nr_cpumask_bits;
+
+	} else if (next >= nr_cpumask_bits) {
+		wrap = true;
+		n = -1;
+		goto again;
+	}
+
+	return next;
+}
+EXPORT_SYMBOL(cpumask_next_wrap);
+
 /* These are not inline because of header tangles. */
 #ifdef CONFIG_CPUMASK_OFFSTACK
 /**

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
  2017-05-15  9:03       ` [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap() tip-bot for Peter Zijlstra
@ 2017-05-17 10:53         ` Peter Zijlstra
  2017-05-17 12:46           ` Matt Fleming
                             ` (3 more replies)
  0 siblings, 4 replies; 29+ messages in thread
From: Peter Zijlstra @ 2017-05-17 10:53 UTC (permalink / raw)
  To: mingo, tglx, riel, hpa, efault, linux-kernel, torvalds, lvenanci
  Cc: xiaolong.ye, kitsunyan, clm, matt

On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote:
> sched/fair, cpumask: Export for_each_cpu_wrap()

> -static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int *wrapped)
> -{

> -	next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1);

> -}

OK, so this patch fixed an actual bug in the for_each_cpu_wrap()
implementation. The above 'n+1' should be 'n', and the effect is that
it'll skip over CPUs, potentially resulting in an iteration that only
sees every other CPU (for a fully contiguous mask).

This in turn causes hackbench to further suffer from the regression
introduced by commit:

  4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive")

So its well past time to fix this.

Where the old scheme was a cliff-edge throttle on idle scanning, this
introduces a more gradual approach. Instead of stopping to scan
entirely, we limit how many CPUs we scan.

Initial benchmarks show that it mostly recovers hackbench while not
hurting anything else, except Mason's schbench, but not as bad as the
old thing.

It also appears to recover the tbench high-end, which also suffered like
hackbench.

I'm also hoping it will fix/preserve kitsunyan's interactivity issue.

Please test..

Cc: xiaolong.ye@intel.com
Cc: kitsunyan <kitsunyan@inbox.ru>
Cc: Chris Mason <clm@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
---
 kernel/sched/fair.c     | 21 ++++++++++++++++-----
 kernel/sched/features.h |  1 +
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 219fe58..8c1afa0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5793,27 +5793,38 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
 {
 	struct sched_domain *this_sd;
-	u64 avg_cost, avg_idle = this_rq()->avg_idle;
+	u64 avg_cost, avg_idle;
 	u64 time, cost;
 	s64 delta;
-	int cpu;
+	int cpu, nr = INT_MAX;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
 		return -1;
 
-	avg_cost = this_sd->avg_scan_cost;
-
 	/*
 	 * Due to large variance we need a large fuzz factor; hackbench in
 	 * particularly is sensitive here.
 	 */
-	if (sched_feat(SIS_AVG_CPU) && (avg_idle / 512) < avg_cost)
+	avg_idle = this_rq()->avg_idle / 512;
+	avg_cost = this_sd->avg_scan_cost + 1;
+
+	if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
 		return -1;
 
+	if (sched_feat(SIS_PROP)) {
+		u64 span_avg = sd->span_weight * avg_idle;
+		if (span_avg > 4*avg_cost)
+			nr = div_u64(span_avg, avg_cost);
+		else
+			nr = 4;
+	}
+
 	time = local_clock();
 
 	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+		if (!--nr)
+			return -1;
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
 			continue;
 		if (idle_cpu(cpu))
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index dc4d148..d3fb155 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -55,6 +55,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
  */
 SCHED_FEAT(SIS_AVG_CPU, false)
+SCHED_FEAT(SIS_PROP, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
  2017-05-17 10:53         ` hackbench vs select_idle_sibling; was: " Peter Zijlstra
@ 2017-05-17 12:46           ` Matt Fleming
  2017-05-17 14:49           ` Chris Mason
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 29+ messages in thread
From: Matt Fleming @ 2017-05-17 12:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, riel, hpa, efault, linux-kernel, torvalds, lvenanci,
	xiaolong.ye, kitsunyan, clm

On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote:
> On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote:
> > sched/fair, cpumask: Export for_each_cpu_wrap()
> 
> > -static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int *wrapped)
> > -{
> 
> > -	next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1);
> 
> > -}
> 
> OK, so this patch fixed an actual bug in the for_each_cpu_wrap()
> implementation. The above 'n+1' should be 'n', and the effect is that
> it'll skip over CPUs, potentially resulting in an iteration that only
> sees every other CPU (for a fully contiguous mask).
> 
> This in turn causes hackbench to further suffer from the regression
> introduced by commit:
> 
>   4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive")
> 
> So its well past time to fix this.
> 
> Where the old scheme was a cliff-edge throttle on idle scanning, this
> introduces a more gradual approach. Instead of stopping to scan
> entirely, we limit how many CPUs we scan.
> 
> Initial benchmarks show that it mostly recovers hackbench while not
> hurting anything else, except Mason's schbench, but not as bad as the
> old thing.
> 
> It also appears to recover the tbench high-end, which also suffered like
> hackbench.
> 
> I'm also hoping it will fix/preserve kitsunyan's interactivity issue.
> 
> Please test..

Tests queued up at SUSE. I'll let you know the results as soon as
they're ready -- it'll be at least a couple of days.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
  2017-05-17 10:53         ` hackbench vs select_idle_sibling; was: " Peter Zijlstra
  2017-05-17 12:46           ` Matt Fleming
@ 2017-05-17 14:49           ` Chris Mason
  2017-05-19 15:00           ` Matt Fleming
  2017-06-08  9:22           ` [tip:sched/core] sched/core: Implement new approach to scale select_idle_cpu() tip-bot for Peter Zijlstra
  3 siblings, 0 replies; 29+ messages in thread
From: Chris Mason @ 2017-05-17 14:49 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, tglx, riel, hpa, efault, linux-kernel,
	torvalds, lvenanci
  Cc: xiaolong.ye, kitsunyan, matt

On 05/17/2017 06:53 AM, Peter Zijlstra wrote:
> On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote:
>> sched/fair, cpumask: Export for_each_cpu_wrap()
>
>> -static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int *wrapped)
>> -{
>
>> -	next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1);
>
>> -}
>
> OK, so this patch fixed an actual bug in the for_each_cpu_wrap()
> implementation. The above 'n+1' should be 'n', and the effect is that
> it'll skip over CPUs, potentially resulting in an iteration that only
> sees every other CPU (for a fully contiguous mask).
>
> This in turn causes hackbench to further suffer from the regression
> introduced by commit:
>
>   4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive")
>
> So its well past time to fix this.
>
> Where the old scheme was a cliff-edge throttle on idle scanning, this
> introduces a more gradual approach. Instead of stopping to scan
> entirely, we limit how many CPUs we scan.
>
> Initial benchmarks show that it mostly recovers hackbench while not
> hurting anything else, except Mason's schbench, but not as bad as the
> old thing.
>
> It also appears to recover the tbench high-end, which also suffered like
> hackbench.
>
> I'm also hoping it will fix/preserve kitsunyan's interactivity issue.
>
> Please test..

We'll get some tests going here too.

-chris

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
  2017-05-17 10:53         ` hackbench vs select_idle_sibling; was: " Peter Zijlstra
  2017-05-17 12:46           ` Matt Fleming
  2017-05-17 14:49           ` Chris Mason
@ 2017-05-19 15:00           ` Matt Fleming
  2017-06-05 13:00             ` Matt Fleming
  2017-06-08  9:22           ` [tip:sched/core] sched/core: Implement new approach to scale select_idle_cpu() tip-bot for Peter Zijlstra
  3 siblings, 1 reply; 29+ messages in thread
From: Matt Fleming @ 2017-05-19 15:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, riel, hpa, efault, linux-kernel, torvalds, lvenanci,
	xiaolong.ye, kitsunyan, clm

On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote:
> 
> Please test..

Results are still coming in but things do look better with your patch
applied.

It does look like there's a regression when running hackbench in
process mode and when the CPUs are not fully utilised, e.g. check this
out:

hackbench-process-pipes
                            4.4.68                     4.4.68                4.4.68                4.4.68
                        sles12-sp3 select-idle-cpu-aggressive for-each-cpu-wrap-fix  latest-hackbench-fix
Amean    1        0.8853 (  0.00%)           1.2160 (-37.35%)      1.0350 (-16.91%)      1.1853 (-33.89%)

This machine has 80 CPUs and that's a 40 process workload.

Here's the key:

select-idle-cpu-aggressive: 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive")
for-each-cpu-wrap-fix: c743f0a5c50f ("sched/fair, cpumask: Export for_each_cpu_wrap()")
latest-hackbench-fix: this patch

But those results definitely look to be an exception. Here's the same
machine running the same number of tasks but with pthreads,

hackbench-thread-pipes
                            4.4.68                     4.4.68                4.4.68                4.4.68
                        sles12-sp3 select-idle-cpu-aggressive for-each-cpu-wrap-fix  latest-hackbench-fix
Amean    1        0.7427 (  0.00%)           0.9760 (-31.42%)      1.1907 (-60.32%)      0.7643 ( -2.92%)

Nice win.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
  2017-05-19 15:00           ` Matt Fleming
@ 2017-06-05 13:00             ` Matt Fleming
  2017-06-06  9:21               ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Matt Fleming @ 2017-06-05 13:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, riel, hpa, efault, linux-kernel, torvalds, lvenanci,
	xiaolong.ye, kitsunyan, clm

On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote:
> On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote:
> > 
> > Please test..
> 
> Results are still coming in but things do look better with your patch
> applied.
> 
> It does look like there's a regression when running hackbench in
> process mode and when the CPUs are not fully utilised, e.g. check this
> out:

This turned out to be a false positive; your patch improves things as
far as I can see.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
  2017-06-05 13:00             ` Matt Fleming
@ 2017-06-06  9:21               ` Peter Zijlstra
  2017-06-09 17:52                 ` Chris Mason
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2017-06-06  9:21 UTC (permalink / raw)
  To: Matt Fleming
  Cc: mingo, tglx, riel, hpa, efault, linux-kernel, torvalds, lvenanci,
	xiaolong.ye, kitsunyan, clm

On Mon, Jun 05, 2017 at 02:00:21PM +0100, Matt Fleming wrote:
> On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote:
> > On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote:
> > > 
> > > Please test..
> > 
> > Results are still coming in but things do look better with your patch
> > applied.
> > 
> > It does look like there's a regression when running hackbench in
> > process mode and when the CPUs are not fully utilised, e.g. check this
> > out:
> 
> This turned out to be a false positive; your patch improves things as
> far as I can see.

Hooray, I'll move it to a part of the queue intended for merging.

Thanks!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [tip:sched/core] sched/core: Implement new approach to scale select_idle_cpu()
  2017-05-17 10:53         ` hackbench vs select_idle_sibling; was: " Peter Zijlstra
                             ` (2 preceding siblings ...)
  2017-05-19 15:00           ` Matt Fleming
@ 2017-06-08  9:22           ` tip-bot for Peter Zijlstra
  3 siblings, 0 replies; 29+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-06-08  9:22 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: clm, torvalds, efault, mingo, kitsunyan, peterz, linux-kernel,
	tglx, matt, hpa

Commit-ID:  1ad3aaf3fcd2444406628a19a9b9e0922b95e2d4
Gitweb:     http://git.kernel.org/tip/1ad3aaf3fcd2444406628a19a9b9e0922b95e2d4
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 17 May 2017 12:53:50 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 8 Jun 2017 10:25:17 +0200

sched/core: Implement new approach to scale select_idle_cpu()

Hackbench recently suffered a bunch of pain, first by commit:

  4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive")

and then by commit:

  c743f0a5c50f ("sched/fair, cpumask: Export for_each_cpu_wrap()")

which fixed a bug in the initial for_each_cpu_wrap() implementation
that made select_idle_cpu() even more expensive. The bug was that it
would skip over CPUs when bits were consequtive in the bitmask.

This however gave me an idea to fix select_idle_cpu(); where the old
scheme was a cliff-edge throttle on idle scanning, this introduces a
more gradual approach. Instead of stopping to scan entirely, we limit
how many CPUs we scan.

Initial benchmarks show that it mostly recovers hackbench while not
hurting anything else, except Mason's schbench, but not as bad as the
old thing.

It also appears to recover the tbench high-end, which also suffered like
hackbench.

Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Cc: kitsunyan <kitsunyan@inbox.ru>
Cc: linux-kernel@vger.kernel.org
Cc: lvenanci@redhat.com
Cc: riel@redhat.com
Cc: xiaolong.ye@intel.com
Link: http://lkml.kernel.org/r/20170517105350.hk5m4h4jb6dfr65a@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c     | 21 ++++++++++++++++-----
 kernel/sched/features.h |  1 +
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 47a0c55..396bca9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5794,27 +5794,38 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
 {
 	struct sched_domain *this_sd;
-	u64 avg_cost, avg_idle = this_rq()->avg_idle;
+	u64 avg_cost, avg_idle;
 	u64 time, cost;
 	s64 delta;
-	int cpu;
+	int cpu, nr = INT_MAX;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
 		return -1;
 
-	avg_cost = this_sd->avg_scan_cost;
-
 	/*
 	 * Due to large variance we need a large fuzz factor; hackbench in
 	 * particularly is sensitive here.
 	 */
-	if (sched_feat(SIS_AVG_CPU) && (avg_idle / 512) < avg_cost)
+	avg_idle = this_rq()->avg_idle / 512;
+	avg_cost = this_sd->avg_scan_cost + 1;
+
+	if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
 		return -1;
 
+	if (sched_feat(SIS_PROP)) {
+		u64 span_avg = sd->span_weight * avg_idle;
+		if (span_avg > 4*avg_cost)
+			nr = div_u64(span_avg, avg_cost);
+		else
+			nr = 4;
+	}
+
 	time = local_clock();
 
 	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+		if (!--nr)
+			return -1;
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
 			continue;
 		if (idle_cpu(cpu))
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index dc4d148..d3fb155 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -55,6 +55,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
  */
 SCHED_FEAT(SIS_AVG_CPU, false)
+SCHED_FEAT(SIS_PROP, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
  2017-06-06  9:21               ` Peter Zijlstra
@ 2017-06-09 17:52                 ` Chris Mason
  0 siblings, 0 replies; 29+ messages in thread
From: Chris Mason @ 2017-06-09 17:52 UTC (permalink / raw)
  To: Peter Zijlstra, Matt Fleming
  Cc: mingo, tglx, riel, hpa, efault, linux-kernel, torvalds, lvenanci,
	xiaolong.ye, kitsunyan

On 06/06/2017 05:21 AM, Peter Zijlstra wrote:
> On Mon, Jun 05, 2017 at 02:00:21PM +0100, Matt Fleming wrote:
>> On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote:
>>> On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote:
>>>>
>>>> Please test..
>>>
>>> Results are still coming in but things do look better with your patch
>>> applied.
>>>
>>> It does look like there's a regression when running hackbench in
>>> process mode and when the CPUs are not fully utilised, e.g. check this
>>> out:
>>
>> This turned out to be a false positive; your patch improves things as
>> far as I can see.
>
> Hooray, I'll move it to a part of the queue intended for merging.

It's a little late, but Roman Gushchin helped get some runs of this with 
our production workload.  The patch is every so slightly better.

Thanks!

-chris

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2017-06-09 17:53 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-13 13:56 [RFC 0/3] sched/topology: fix sched groups on NUMA machines with mesh topology Lauro Ramos Venancio
2017-04-13 13:56 ` [RFC 1/3] sched/topology: Refactor function build_overlap_sched_groups() Lauro Ramos Venancio
2017-04-13 14:50   ` Rik van Riel
2017-05-15  9:02   ` [tip:sched/core] " tip-bot for Lauro Ramos Venancio
2017-04-13 13:56 ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Lauro Ramos Venancio
2017-04-13 15:16   ` Rik van Riel
2017-04-13 15:48   ` Peter Zijlstra
2017-04-13 20:21     ` Lauro Venancio
2017-04-13 21:06       ` Lauro Venancio
2017-04-13 23:38         ` Rik van Riel
2017-04-14 10:48           ` Peter Zijlstra
2017-04-14 11:38   ` Peter Zijlstra
2017-04-14 12:20     ` Peter Zijlstra
2017-05-15  9:03       ` [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap() tip-bot for Peter Zijlstra
2017-05-17 10:53         ` hackbench vs select_idle_sibling; was: " Peter Zijlstra
2017-05-17 12:46           ` Matt Fleming
2017-05-17 14:49           ` Chris Mason
2017-05-19 15:00           ` Matt Fleming
2017-06-05 13:00             ` Matt Fleming
2017-06-06  9:21               ` Peter Zijlstra
2017-06-09 17:52                 ` Chris Mason
2017-06-08  9:22           ` [tip:sched/core] sched/core: Implement new approach to scale select_idle_cpu() tip-bot for Peter Zijlstra
2017-04-14 16:58     ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Peter Zijlstra
2017-04-17 14:40       ` Lauro Venancio
2017-04-13 13:56 ` [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu Lauro Ramos Venancio
2017-04-13 15:27   ` Rik van Riel
2017-04-14 16:49   ` Peter Zijlstra
2017-04-17 15:34     ` Lauro Venancio
2017-04-18 12:32       ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.