[PATCH 0/4] sched/topology: fix overlap group capacity and balance cpu

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] sched/topology: fix overlap group capacity and balance cpu
@ 2017-04-20 19:51 Lauro Ramos Venancio
  2017-04-20 19:51 ` [PATCH 1/4] sched/topology: optimize build_group_mask() Lauro Ramos Venancio
                   ` (4 more replies)
  0 siblings, 5 replies; 28+ messages in thread
From: Lauro Ramos Venancio @ 2017-04-20 19:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	linux-kernel, Lauro Ramos Venancio

This patchset is on top of Peter Zijlstra's sched/core tree[1]. It is equivalent
to the patch 3 from my previous patchset[2].

This patchset ensures:

 1) different instances of the same sched group share the same
    sched_group_capacity instance.
 2) instances of different groups don't share the same sched_group_capacity
    instance.
 3) the group balance cpu must be one of the cpus where the group is installed.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core
[2] https://lkml.org/lkml/2017/4/13/355

Lauro Ramos Venancio (4):
  sched/topology: optimize build_group_mask()
  sched/topology: all instances of a sched group must use the same
    sched_group_capacity
  sched/topology: move comment about asymmetric node setups
  sched/topology: the group balance cpu must be a cpu where the group is
    installed

 kernel/sched/topology.c | 83 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 67 insertions(+), 16 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/4] sched/topology: optimize build_group_mask()
  2017-04-20 19:51 [PATCH 0/4] sched/topology: fix overlap group capacity and balance cpu Lauro Ramos Venancio
@ 2017-04-20 19:51 ` Lauro Ramos Venancio
  2017-05-15  9:06   ` [tip:sched/core] sched/topology: Optimize build_group_mask() tip-bot for Lauro Ramos Venancio
  2017-04-20 19:51 ` [PATCH 2/4] sched/topology: all instances of a sched group must use the same sched_group_capacity Lauro Ramos Venancio
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 28+ messages in thread
From: Lauro Ramos Venancio @ 2017-04-20 19:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	linux-kernel, Lauro Ramos Venancio

The group mask is always used in intersection with the group cpus. So,
when building the group mask, we don't have to care about cpus that are
not part of the group.

Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
---
 kernel/sched/topology.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9d4566c..f8b53b3 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -506,12 +506,12 @@ enum s_alloc {
  */
 static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
 {
-	const struct cpumask *span = sched_domain_span(sd);
+	const struct cpumask *sg_span = sched_group_cpus(sg);
 	struct sd_data *sdd = sd->private;
 	struct sched_domain *sibling;
 	int i;
 
-	for_each_cpu(i, span) {
+	for_each_cpu(i, sg_span) {
 		sibling = *per_cpu_ptr(sdd->sd, i);
 		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
 			continue;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/4] sched/topology: all instances of a sched group must use the same sched_group_capacity
  2017-04-20 19:51 [PATCH 0/4] sched/topology: fix overlap group capacity and balance cpu Lauro Ramos Venancio
  2017-04-20 19:51 ` [PATCH 1/4] sched/topology: optimize build_group_mask() Lauro Ramos Venancio
@ 2017-04-20 19:51 ` Lauro Ramos Venancio
  2017-04-20 19:51 ` [PATCH 3/4] sched/topology: move comment about asymmetric node setups Lauro Ramos Venancio
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 28+ messages in thread
From: Lauro Ramos Venancio @ 2017-04-20 19:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	linux-kernel, Lauro Ramos Venancio

Use the group balance cpu to select the same sched_group_capacity
instance for all instances of a sched group.

As the group mask is stored in the struct sched_group_capacity and the
function group_balance_cpu() cannot be used when the group mask is not
available, this patch creates a function to find the group balance cpu
when the mask is not available.

Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
---
 kernel/sched/topology.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f8b53b3..55bbaf7 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -529,6 +529,27 @@ int group_balance_cpu(struct sched_group *sg)
 	return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
 }
 
+/*
+ * Find the group balance cpu when the group mask is not available yet.
+ */
+static int find_group_balance_cpu(struct sched_domain *sd,
+				  struct sched_group *sg)
+{
+	const struct cpumask *sg_span = sched_group_cpus(sg);
+	struct sd_data *sdd = sd->private;
+	struct sched_domain *sibling;
+	int i;
+
+	for_each_cpu(i, sg_span) {
+		sibling = *per_cpu_ptr(sdd->sd, i);
+		if (cpumask_test_cpu(i, sched_domain_span(sibling)))
+			return i;
+	}
+
+	WARN(1, "group balance cpu not found.");
+	return 0;
+}
+
 static struct sched_group *
 build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
 {
@@ -551,10 +572,11 @@ int group_balance_cpu(struct sched_group *sg)
 }
 
 static void init_overlap_sched_group(struct sched_domain *sd,
-				     struct sched_group *sg, int cpu)
+				     struct sched_group *sg)
 {
 	struct sd_data *sdd = sd->private;
 	struct cpumask *sg_span;
+	int cpu = find_group_balance_cpu(sd, sg);
 
 	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
 	if (atomic_inc_return(&sg->sgc->ref) == 1)
@@ -601,7 +623,7 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 		sg_span = sched_group_cpus(sg);
 		cpumask_or(covered, covered, sg_span);
 
-		init_overlap_sched_group(sd, sg, i);
+		init_overlap_sched_group(sd, sg);
 
 		if (!first)
 			first = sg;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/4] sched/topology: move comment about asymmetric node setups
  2017-04-20 19:51 [PATCH 0/4] sched/topology: fix overlap group capacity and balance cpu Lauro Ramos Venancio
  2017-04-20 19:51 ` [PATCH 1/4] sched/topology: optimize build_group_mask() Lauro Ramos Venancio
  2017-04-20 19:51 ` [PATCH 2/4] sched/topology: all instances of a sched group must use the same sched_group_capacity Lauro Ramos Venancio
@ 2017-04-20 19:51 ` Lauro Ramos Venancio
  2017-04-21 16:31   ` Peter Zijlstra
  2017-05-15  9:06   ` [tip:sched/core] sched/topology: Move " tip-bot for Lauro Ramos Venancio
  2017-04-20 19:51 ` [PATCH 4/4] sched/topology: the group balance cpu must be a cpu where the group is installed Lauro Ramos Venancio
  2017-04-26 16:31 ` [PATCH 0/4] sched/topology: fix overlap group capacity and balance cpu Peter Zijlstra
  4 siblings, 2 replies; 28+ messages in thread
From: Lauro Ramos Venancio @ 2017-04-20 19:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	linux-kernel, Lauro Ramos Venancio

Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
---
 kernel/sched/topology.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 55bbaf7..e77c93a 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -495,14 +495,6 @@ enum s_alloc {
 /*
  * Build an iteration mask that can exclude certain CPUs from the upwards
  * domain traversal.
- *
- * Asymmetric node setups can result in situations where the domain tree is of
- * unequal depth, make sure to skip domains that already cover the entire
- * range.
- *
- * In that case build_sched_domains() will have terminated the iteration early
- * and our sibling sd spans will be empty. Domains should always include the
- * CPU they're built on, so check that.
  */
 static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
 {
@@ -612,7 +604,16 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 
 		sibling = *per_cpu_ptr(sdd->sd, i);
 
-		/* See the comment near build_group_mask(). */
+		/*
+		 * Asymmetric node setups can result in situations where the
+		 * domain tree is of unequal depth, make sure to skip domains
+		 * that already cover the entire range.
+		 *
+		 * In that case build_sched_domains() will have terminated the
+		 * iteration early and our sibling sd spans will be empty.
+		 * Domains should always include the CPU they're built on, so
+		 * check that.
+		 */
 		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
 			continue;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/4] sched/topology: the group balance cpu must be a cpu where the group is installed
  2017-04-20 19:51 [PATCH 0/4] sched/topology: fix overlap group capacity and balance cpu Lauro Ramos Venancio
                   ` (2 preceding siblings ...)
  2017-04-20 19:51 ` [PATCH 3/4] sched/topology: move comment about asymmetric node setups Lauro Ramos Venancio
@ 2017-04-20 19:51 ` Lauro Ramos Venancio
  2017-04-24 13:03   ` Peter Zijlstra
  2017-04-26 16:31 ` [PATCH 0/4] sched/topology: fix overlap group capacity and balance cpu Peter Zijlstra
  4 siblings, 1 reply; 28+ messages in thread
From: Lauro Ramos Venancio @ 2017-04-20 19:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	linux-kernel, Lauro Ramos Venancio

An overlap sched group may not be installed in all cpus that compose the
group. Currently, the group balance cpu may be a cpu where the group is
not installed, causing two problems:

1) Two groups may have the same group balance cpu and, as consequence,
share the sched_group_capacity.
2) should_we_balance() in fair.c may never return true.

This patch changes the group mask meaning to mark all the cpus where a
group is installed.

Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
---
 kernel/sched/topology.c | 40 ++++++++++++++++++++++++++++++++++------
 1 file changed, 34 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index e77c93a..694e799 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -493,8 +493,10 @@ enum s_alloc {
 };
 
 /*
- * Build an iteration mask that can exclude certain CPUs from the upwards
- * domain traversal.
+ * An overlap sched group may not be installed in all CPUs that compose the
+ * group. So build the mask, marking all the CPUs where the group is installed.
+ *
+ * This function can only be used when all the groups are already built.
  */
 static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
 {
@@ -505,7 +507,11 @@ static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
 
 	for_each_cpu(i, sg_span) {
 		sibling = *per_cpu_ptr(sdd->sd, i);
-		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
+
+		if (!sibling->groups)
+			continue;
+
+		if (!cpumask_equal(sg_span, sched_group_cpus(sibling->groups)))
 			continue;
 
 		cpumask_set_cpu(i, sched_group_mask(sg));
@@ -523,6 +529,7 @@ int group_balance_cpu(struct sched_group *sg)
 
 /*
  * Find the group balance cpu when the group mask is not available yet.
+ * This function can only be used when all the groups are already built.
  */
 static int find_group_balance_cpu(struct sched_domain *sd,
 				  struct sched_group *sg)
@@ -534,7 +541,11 @@ static int find_group_balance_cpu(struct sched_domain *sd,
 
 	for_each_cpu(i, sg_span) {
 		sibling = *per_cpu_ptr(sdd->sd, i);
-		if (cpumask_test_cpu(i, sched_domain_span(sibling)))
+
+		if (!sibling->groups)
+			continue;
+
+		if (cpumask_equal(sg_span, sched_group_cpus(sibling->groups)))
 			return i;
 	}
 
@@ -584,6 +595,17 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 	sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
 }
 
+static void init_overlap_sched_groups(struct sched_domain *sd)
+{
+	struct sched_group *sg = sd->groups;
+
+	do {
+		init_overlap_sched_group(sd, sg);
+
+		sg = sg->next;
+	} while (sg != sd->groups);
+}
+
 static int
 build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 {
@@ -624,8 +646,6 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 		sg_span = sched_group_cpus(sg);
 		cpumask_or(covered, covered, sg_span);
 
-		init_overlap_sched_group(sd, sg);
-
 		if (!first)
 			first = sg;
 		if (last)
@@ -1482,6 +1502,14 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
 		}
 	}
 
+	/* Init overlap groups */
+	for_each_cpu(i, cpu_map) {
+		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+			if (sd->flags & SD_OVERLAP)
+				init_overlap_sched_groups(sd);
+		}
+	}
+
 	/* Calculate CPU capacity for physical packages and nodes */
 	for (i = nr_cpumask_bits-1; i >= 0; i--) {
 		if (!cpumask_test_cpu(i, cpu_map))
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] sched/topology: move comment about asymmetric node setups
  2017-04-20 19:51 ` [PATCH 3/4] sched/topology: move comment about asymmetric node setups Lauro Ramos Venancio
@ 2017-04-21 16:31   ` Peter Zijlstra
  2017-05-15  9:06   ` [tip:sched/core] sched/topology: Move " tip-bot for Lauro Ramos Venancio
  1 sibling, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2017-04-21 16:31 UTC (permalink / raw)
  To: Lauro Ramos Venancio
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

On Thu, Apr 20, 2017 at 04:51:42PM -0300, Lauro Ramos Venancio wrote:
> @@ -612,7 +604,16 @@ static void init_overlap_sched_group(struct sched_domain *sd,
>  
>  		sibling = *per_cpu_ptr(sdd->sd, i);
>  
> -		/* See the comment near build_group_mask(). */
> +		/*
> +		 * Asymmetric node setups can result in situations where the
> +		 * domain tree is of unequal depth, make sure to skip domains
> +		 * that already cover the entire range.
> +		 *
> +		 * In that case build_sched_domains() will have terminated the
> +		 * iteration early and our sibling sd spans will be empty.
> +		 * Domains should always include the CPU they're built on, so
> +		 * check that.
> +		 */
>  		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
>  			continue;
>  

FWIW, the topology that spawned all that is:

  10,20,20,30
  20,10,20,20
  20,20,10,20
  30,20,20,10

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] sched/topology: the group balance cpu must be a cpu where the group is installed
  2017-04-20 19:51 ` [PATCH 4/4] sched/topology: the group balance cpu must be a cpu where the group is installed Lauro Ramos Venancio
@ 2017-04-24 13:03   ` Peter Zijlstra
  2017-04-24 14:19     ` Peter Zijlstra
  2017-04-24 15:11     ` Lauro Venancio
  0 siblings, 2 replies; 28+ messages in thread
From: Peter Zijlstra @ 2017-04-24 13:03 UTC (permalink / raw)
  To: Lauro Ramos Venancio
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

On Thu, Apr 20, 2017 at 04:51:43PM -0300, Lauro Ramos Venancio wrote:

> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index e77c93a..694e799 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c

> @@ -505,7 +507,11 @@ static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
>  
>  	for_each_cpu(i, sg_span) {
>  		sibling = *per_cpu_ptr(sdd->sd, i);
> -		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
> +
> +		if (!sibling->groups)
> +			continue;

How can this happen?

> +
> +		if (!cpumask_equal(sg_span, sched_group_cpus(sibling->groups)))
>  			continue;
>  
>  		cpumask_set_cpu(i, sched_group_mask(sg));


> @@ -1482,6 +1502,14 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
>  		}
>  	}
>  
> +	/* Init overlap groups */
> +	for_each_cpu(i, cpu_map) {
> +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +			if (sd->flags & SD_OVERLAP)
> +				init_overlap_sched_groups(sd);
> +		}
> +	}

Why does this have to be a whole new loop? This is because in
build_group_mask() we could encounter @sibling that were not constructed
yet?

So this is the primary fix?

> +
>  	/* Calculate CPU capacity for physical packages and nodes */
>  	for (i = nr_cpumask_bits-1; i >= 0; i--) {
>  		if (!cpumask_test_cpu(i, cpu_map))


Also, would it not make sense to re-order patch 2 to come after this,
such that we _do_ have the group_mask available and don't have to jump
through hoops in order to link up the sgc? Afaict we don't actually use
the sgc until the above (reverse) loop computing the CPU capacities.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] sched/topology: the group balance cpu must be a cpu where the group is installed
  2017-04-24 13:03   ` Peter Zijlstra
@ 2017-04-24 14:19     ` Peter Zijlstra
  2017-04-24 14:27       ` Peter Zijlstra
  2017-04-24 15:11     ` Lauro Venancio
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2017-04-24 14:19 UTC (permalink / raw)
  To: Lauro Ramos Venancio
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

On Mon, Apr 24, 2017 at 03:03:26PM +0200, Peter Zijlstra wrote:

> Also, would it not make sense to re-order patch 2 to come after this,
> such that we _do_ have the group_mask available and don't have to jump
> through hoops in order to link up the sgc? Afaict we don't actually use
> the sgc until the above (reverse) loop computing the CPU capacities.

That is, if I force 4 on without 2, then doesn't something like the
below also do the right thing? (without duplicating part of the magic
already contained in build_group_mask)

---
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -498,13 +498,16 @@ enum s_alloc {
  *
  * This function can only be used when all the groups are already built.
  */
-static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
+static void
+build_group_mask(struct sched_domain *sd, struct sched_group *sg, struct cpumask *mask)
 {
 	const struct cpumask *sg_span = sched_group_cpus(sg);
 	struct sd_data *sdd = sd->private;
 	struct sched_domain *sibling;
 	int i;
 
+	cpumask_clear(mask);
+
 	for_each_cpu(i, sg_span) {
 		sibling = *per_cpu_ptr(sdd->sd, i);
 
@@ -514,7 +517,7 @@ static void build_group_mask(struct sche
 		if (!cpumask_equal(sg_span, sched_group_cpus(sibling->groups)))
 			continue;
 
-		cpumask_set_cpu(i, sched_group_mask(sg));
+		cpumask_set_cpu(i, mask);
 	}
 }
 
@@ -549,14 +552,19 @@ build_group_from_child_sched_domain(stru
 }
 
 static void init_overlap_sched_group(struct sched_domain *sd,
-				     struct sched_group *sg, int cpu)
+				     struct sched_group *sg)
 {
+	struct cpumask *mask = sched_domains_tmpmask;
 	struct sd_data *sdd = sd->private;
 	struct cpumask *sg_span;
+	int cpu;
+
+	build_group_mask(sd, sg, mask);
+	cpu = cpumask_first_and(sched_group_mask(sg), mask); /* balance cpu */
 
 	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
 	if (atomic_inc_return(&sg->sgc->ref) == 1)
-		build_group_mask(sd, sg);
+		cpumask_copy(sched_group_mask(sg), mask);
 
 	/*
 	 * Initialize sgc->capacity such that even if we mess up the

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] sched/topology: the group balance cpu must be a cpu where the group is installed
  2017-04-24 14:19     ` Peter Zijlstra
@ 2017-04-24 14:27       ` Peter Zijlstra
  2017-04-24 15:19         ` Lauro Venancio
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2017-04-24 14:27 UTC (permalink / raw)
  To: Lauro Ramos Venancio
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

On Mon, Apr 24, 2017 at 04:19:44PM +0200, Peter Zijlstra wrote:
> On Mon, Apr 24, 2017 at 03:03:26PM +0200, Peter Zijlstra wrote:
> 
> > Also, would it not make sense to re-order patch 2 to come after this,
> > such that we _do_ have the group_mask available and don't have to jump
> > through hoops in order to link up the sgc? Afaict we don't actually use
> > the sgc until the above (reverse) loop computing the CPU capacities.
> 
> That is, if I force 4 on without 2, then doesn't something like the
> below also do the right thing? (without duplicating part of the magic
> already contained in build_group_mask)
> 
> ---
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -498,13 +498,16 @@ enum s_alloc {
>   *
>   * This function can only be used when all the groups are already built.
>   */
> -static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
> +static void
> +build_group_mask(struct sched_domain *sd, struct sched_group *sg, struct cpumask *mask)
>  {
>  	const struct cpumask *sg_span = sched_group_cpus(sg);
>  	struct sd_data *sdd = sd->private;
>  	struct sched_domain *sibling;
>  	int i;
>  
> +	cpumask_clear(mask);
> +
>  	for_each_cpu(i, sg_span) {
>  		sibling = *per_cpu_ptr(sdd->sd, i);
>  
> @@ -514,7 +517,7 @@ static void build_group_mask(struct sche
>  		if (!cpumask_equal(sg_span, sched_group_cpus(sibling->groups)))
>  			continue;
>  
> -		cpumask_set_cpu(i, sched_group_mask(sg));
> +		cpumask_set_cpu(i, mask);
>  	}
>  }
>  
> @@ -549,14 +552,19 @@ build_group_from_child_sched_domain(stru
>  }
>  
>  static void init_overlap_sched_group(struct sched_domain *sd,
> -				     struct sched_group *sg, int cpu)
> +				     struct sched_group *sg)
>  {
> +	struct cpumask *mask = sched_domains_tmpmask;
>  	struct sd_data *sdd = sd->private;
>  	struct cpumask *sg_span;
> +	int cpu;
> +
> +	build_group_mask(sd, sg, mask);
> +	cpu = cpumask_first_and(sched_group_mask(sg), mask); /* balance cpu */

s/group_mask/group_span/

>  
>  	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
>  	if (atomic_inc_return(&sg->sgc->ref) == 1)
> -		build_group_mask(sd, sg);
> +		cpumask_copy(sched_group_mask(sg), mask);
>  
>  	/*
>  	 * Initialize sgc->capacity such that even if we mess up the

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] sched/topology: the group balance cpu must be a cpu where the group is installed
  2017-04-24 13:03   ` Peter Zijlstra
  2017-04-24 14:19     ` Peter Zijlstra
@ 2017-04-24 15:11     ` Lauro Venancio
  2017-04-24 22:15       ` Peter Zijlstra
  2017-04-25 12:17       ` Peter Zijlstra
  1 sibling, 2 replies; 28+ messages in thread
From: Lauro Venancio @ 2017-04-24 15:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

On 04/24/2017 10:03 AM, Peter Zijlstra wrote:
> On Thu, Apr 20, 2017 at 04:51:43PM -0300, Lauro Ramos Venancio wrote:
>
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index e77c93a..694e799 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -505,7 +507,11 @@ static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
>>  
>>  	for_each_cpu(i, sg_span) {
>>  		sibling = *per_cpu_ptr(sdd->sd, i);
>> -		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
>> +
>> +		if (!sibling->groups)
>> +			continue;
> How can this happen?
This happens on machines with asymmetric topologies. For more details see:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c1174876874dcf8986806e4dad3d7d07af20b439
>
>> +
>> +		if (!cpumask_equal(sg_span, sched_group_cpus(sibling->groups)))
>>  			continue;
>>  
>>  		cpumask_set_cpu(i, sched_group_mask(sg));
>
>> @@ -1482,6 +1502,14 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
>>  		}
>>  	}
>>  
>> +	/* Init overlap groups */
>> +	for_each_cpu(i, cpu_map) {
>> +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
>> +			if (sd->flags & SD_OVERLAP)
>> +				init_overlap_sched_groups(sd);
>> +		}
>> +	}
> Why does this have to be a whole new loop? This is because in
> build_group_mask() we could encounter @sibling that were not constructed
> yet?
That is right. We can only build the group mask when all siblings are
constructed.

>
> So this is the primary fix?
Yes.
>
>> +
>>  	/* Calculate CPU capacity for physical packages and nodes */
>>  	for (i = nr_cpumask_bits-1; i >= 0; i--) {
>>  		if (!cpumask_test_cpu(i, cpu_map))
>
> Also, would it not make sense to re-order patch 2 to come after this,
> such that we _do_ have the group_mask available and don't have to jump
> through hoops in order to link up the sgc? Afaict we don't actually use
> the sgc until the above (reverse) loop computing the CPU capacities.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] sched/topology: the group balance cpu must be a cpu where the group is installed
  2017-04-24 14:27       ` Peter Zijlstra
@ 2017-04-24 15:19         ` Lauro Venancio
  2017-04-24 22:19           ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Lauro Venancio @ 2017-04-24 15:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

On 04/24/2017 11:27 AM, Peter Zijlstra wrote:
> On Mon, Apr 24, 2017 at 04:19:44PM +0200, Peter Zijlstra wrote:
>> On Mon, Apr 24, 2017 at 03:03:26PM +0200, Peter Zijlstra wrote:
>>
>>> Also, would it not make sense to re-order patch 2 to come after this,
>>> such that we _do_ have the group_mask available and don't have to jump
>>> through hoops in order to link up the sgc? Afaict we don't actually use
>>> the sgc until the above (reverse) loop computing the CPU capacities.
>> That is, if I force 4 on without 2, then doesn't something like the
>> below also do the right thing? (without duplicating part of the magic
>> already contained in build_group_mask)
Yes, it has the same result. I duplicated the build_group_mask magic to
avoid building the complete mask for all instances of a group.
Currently, the mask is built just once per group.
>>
>> ---
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -498,13 +498,16 @@ enum s_alloc {
>>   *
>>   * This function can only be used when all the groups are already built.
>>   */
>> -static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
>> +static void
>> +build_group_mask(struct sched_domain *sd, struct sched_group *sg, struct cpumask *mask)
>>  {
>>  	const struct cpumask *sg_span = sched_group_cpus(sg);
>>  	struct sd_data *sdd = sd->private;
>>  	struct sched_domain *sibling;
>>  	int i;
>>  
>> +	cpumask_clear(mask);
>> +
>>  	for_each_cpu(i, sg_span) {
>>  		sibling = *per_cpu_ptr(sdd->sd, i);
>>  
>> @@ -514,7 +517,7 @@ static void build_group_mask(struct sche
>>  		if (!cpumask_equal(sg_span, sched_group_cpus(sibling->groups)))
>>  			continue;
>>  
>> -		cpumask_set_cpu(i, sched_group_mask(sg));
>> +		cpumask_set_cpu(i, mask);
>>  	}
>>  }
>>  
>> @@ -549,14 +552,19 @@ build_group_from_child_sched_domain(stru
>>  }
>>  
>>  static void init_overlap_sched_group(struct sched_domain *sd,
>> -				     struct sched_group *sg, int cpu)
>> +				     struct sched_group *sg)
>>  {
>> +	struct cpumask *mask = sched_domains_tmpmask;
>>  	struct sd_data *sdd = sd->private;
>>  	struct cpumask *sg_span;
>> +	int cpu;
>> +
>> +	build_group_mask(sd, sg, mask);
>> +	cpu = cpumask_first_and(sched_group_mask(sg), mask); /* balance cpu */
> s/group_mask/group_span/
>
>>  
>>  	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
>>  	if (atomic_inc_return(&sg->sgc->ref) == 1)
>> -		build_group_mask(sd, sg);
>> +		cpumask_copy(sched_group_mask(sg), mask);
>>  
>>  	/*
>>  	 * Initialize sgc->capacity such that even if we mess up the

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] sched/topology: the group balance cpu must be a cpu where the group is installed
  2017-04-24 15:11     ` Lauro Venancio
@ 2017-04-24 22:15       ` Peter Zijlstra
  2017-04-25 12:17       ` Peter Zijlstra
  1 sibling, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2017-04-24 22:15 UTC (permalink / raw)
  To: Lauro Venancio
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

On Mon, Apr 24, 2017 at 12:11:59PM -0300, Lauro Venancio wrote:
> On 04/24/2017 10:03 AM, Peter Zijlstra wrote:
> > On Thu, Apr 20, 2017 at 04:51:43PM -0300, Lauro Ramos Venancio wrote:
> >
> >> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> >> index e77c93a..694e799 100644
> >> --- a/kernel/sched/topology.c
> >> +++ b/kernel/sched/topology.c
> >> @@ -505,7 +507,11 @@ static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
> >>  
> >>  	for_each_cpu(i, sg_span) {
> >>  		sibling = *per_cpu_ptr(sdd->sd, i);
> >> -		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
> >> +
> >> +		if (!sibling->groups)
> >> +			continue;
> > How can this happen?
> This happens on machines with asymmetric topologies. For more details see:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c1174876874dcf8986806e4dad3d7d07af20b439

Ah, ok. I'll put a comment with.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] sched/topology: the group balance cpu must be a cpu where the group is installed
  2017-04-24 15:19         ` Lauro Venancio
@ 2017-04-24 22:19           ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2017-04-24 22:19 UTC (permalink / raw)
  To: Lauro Venancio
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

On Mon, Apr 24, 2017 at 12:19:47PM -0300, Lauro Venancio wrote:
> On 04/24/2017 11:27 AM, Peter Zijlstra wrote:

> >>> Also, would it not make sense to re-order patch 2 to come after this,
> >>> such that we _do_ have the group_mask available and don't have to jump
> >>> through hoops in order to link up the sgc? Afaict we don't actually use
> >>> the sgc until the above (reverse) loop computing the CPU capacities.

> Yes, it has the same result. I duplicated the build_group_mask magic to
> avoid building the complete mask for all instances of a group.
> Currently, the mask is built just once per group.

OK, but it was subtly different (using sched_domain_span(sibling) vs
sched_group_cpus(sibling->groups)). And if there's anything this code
could do with less of its confusion.

I see your point about wasted computation, but this is very slow path
code, nobody should care too much. And I much prefer simpler code.

I'll see if I can come up with some coherent comments as well. This code
certainly can use some.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] sched/topology: the group balance cpu must be a cpu where the group is installed
  2017-04-24 15:11     ` Lauro Venancio
  2017-04-24 22:15       ` Peter Zijlstra
@ 2017-04-25 12:17       ` Peter Zijlstra
  2017-04-25 14:33         ` Lauro Venancio
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2017-04-25 12:17 UTC (permalink / raw)
  To: Lauro Venancio
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

On Mon, Apr 24, 2017 at 12:11:59PM -0300, Lauro Venancio wrote:
> On 04/24/2017 10:03 AM, Peter Zijlstra wrote:
> > On Thu, Apr 20, 2017 at 04:51:43PM -0300, Lauro Ramos Venancio wrote:
> >
> >> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> >> index e77c93a..694e799 100644
> >> --- a/kernel/sched/topology.c
> >> +++ b/kernel/sched/topology.c
> >> @@ -505,7 +507,11 @@ static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
> >>  
> >>  	for_each_cpu(i, sg_span) {
> >>  		sibling = *per_cpu_ptr(sdd->sd, i);

> >> -		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))

> >> +		if (!cpumask_equal(sg_span, sched_group_cpus(sibling->groups)))
> >>  			continue;

Hmm _this_ is what requires us to move the thing to a whole separate
iteration. Because when we build the groups, the domains are already
constructed, so that was right.

So the moving crud around wasn't the primary fix, this is.

With the fact that sched_group_cpus(sd->groups) ==
sched_domain_span(sibling->child) (if child exists) established in the
previous patches, could we not simplify this like the below?

---
Subject: sched/topology: Fix overlapping sched_group_mask
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Apr 25 14:00:49 CEST 2017

The point of sched_group_mask is to select those CPUs from
sched_group_cpus that can actually arrive at this balance domain.

The current code gets it wrong, as can be readily demonstrated with a
topology like:

  node   0   1   2   3
    0:  10  20  30  20
    1:  20  10  20  30
    2:  30  20  10  20
    3:  20  30  20  10

Where (for example) domain 1 on CPU1 ends up with a mask that includes
CPU0:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 (mask: 1), 2, 0
  []   domain 1: span 0-3 level NUMA
  []    groups: 0-2 (mask: 0-2) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)

This causes sched_balance_cpu() to compute the wrong CPU and
consequently should_we_balance() will terminate early resulting in
missed load-balance opportunities.

The fixed topology looks like:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 (mask: 1), 2, 0
  []   domain 1: span 0-3 level NUMA
  []    groups: 0-2 (mask: 1) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)

Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/topology.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -495,6 +495,9 @@ enum s_alloc {
 /*
  * Build an iteration mask that can exclude certain CPUs from the upwards
  * domain traversal.
+ *
+ * Only CPUs that can arrive at this group should be considered to continue
+ * balancing.
  */
 static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
 {
@@ -505,7 +508,13 @@ static void build_group_mask(struct sche
 
 	for_each_cpu(i, sg_span) {
 		sibling = *per_cpu_ptr(sdd->sd, i);
-		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
+
+		/* overlap should have children; except for FORCE_SD_OVERLAP */
+		if (WARN_ON_ONCE(!sibling->child))
+			continue;
+
+		/* If we would not end up here, we can't continue from here */
+		if (!cpumask_equal(sg_span, sched_domain_span(sibling->child)))
 			continue;
 
 		cpumask_set_cpu(i, sched_group_mask(sg));

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] sched/topology: the group balance cpu must be a cpu where the group is installed
  2017-04-25 12:17       ` Peter Zijlstra
@ 2017-04-25 14:33         ` Lauro Venancio
  2017-04-25 15:12           ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Lauro Venancio @ 2017-04-25 14:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

On 04/25/2017 09:17 AM, Peter Zijlstra wrote:
> On Mon, Apr 24, 2017 at 12:11:59PM -0300, Lauro Venancio wrote:
>> On 04/24/2017 10:03 AM, Peter Zijlstra wrote:
>>> On Thu, Apr 20, 2017 at 04:51:43PM -0300, Lauro Ramos Venancio wrote:
>>>
>>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>>> index e77c93a..694e799 100644
>>>> --- a/kernel/sched/topology.c
>>>> +++ b/kernel/sched/topology.c
>>>> @@ -505,7 +507,11 @@ static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
>>>>  
>>>>  	for_each_cpu(i, sg_span) {
>>>>  		sibling = *per_cpu_ptr(sdd->sd, i);
>>>> -		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
>>>> +		if (!cpumask_equal(sg_span, sched_group_cpus(sibling->groups)))
>>>>  			continue;
> Hmm _this_ is what requires us to move the thing to a whole separate
> iteration. Because when we build the groups, the domains are already
> constructed, so that was right.
>
> So the moving crud around wasn't the primary fix, this is.
>
> With the fact that sched_group_cpus(sd->groups) ==
> sched_domain_span(sibling->child) (if child exists) established in the
> previous patches, could we not simplify this like the below?
We can. We just need to better handle the case when there is no child or
we will have empty masks.
We have to replicate the build_group_from_child_sched_domain() behavior:

if (sd->child)
    cpumask_copy(sg_span, sched_domain_span(sd->child));
else
    cpumask_copy(sg_span, sched_domain_span(sd));


So we need something like:


if (sibling->child)
    gsd = sibling->child;
else
    gsd = sibling;

if (!cpumask_equal(sg_span, sched_domain_span(gsd)))

	continue;


>
> ---
> Subject: sched/topology: Fix overlapping sched_group_mask
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue Apr 25 14:00:49 CEST 2017
>
> The point of sched_group_mask is to select those CPUs from
> sched_group_cpus that can actually arrive at this balance domain.
>
> The current code gets it wrong, as can be readily demonstrated with a
> topology like:
>
>   node   0   1   2   3
>     0:  10  20  30  20
>     1:  20  10  20  30
>     2:  30  20  10  20
>     3:  20  30  20  10
>
> Where (for example) domain 1 on CPU1 ends up with a mask that includes
> CPU0:
>
>   [] CPU1 attaching sched-domain:
>   []  domain 0: span 0-2 level NUMA
>   []   groups: 1 (mask: 1), 2, 0
>   []   domain 1: span 0-3 level NUMA
>   []    groups: 0-2 (mask: 0-2) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)
>
> This causes sched_balance_cpu() to compute the wrong CPU and
> consequently should_we_balance() will terminate early resulting in
> missed load-balance opportunities.
>
> The fixed topology looks like:
>
>   [] CPU1 attaching sched-domain:
>   []  domain 0: span 0-2 level NUMA
>   []   groups: 1 (mask: 1), 2, 0
>   []   domain 1: span 0-3 level NUMA
>   []    groups: 0-2 (mask: 1) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)
>
> Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/topology.c |   11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
>
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -495,6 +495,9 @@ enum s_alloc {
>  /*
>   * Build an iteration mask that can exclude certain CPUs from the upwards
>   * domain traversal.
> + *
> + * Only CPUs that can arrive at this group should be considered to continue
> + * balancing.
>   */
>  static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
>  {
> @@ -505,7 +508,13 @@ static void build_group_mask(struct sche
>  
>  	for_each_cpu(i, sg_span) {
>  		sibling = *per_cpu_ptr(sdd->sd, i);
> -		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
> +
> +		/* overlap should have children; except for FORCE_SD_OVERLAP */
> +		if (WARN_ON_ONCE(!sibling->child))
> +			continue;
> +
> +		/* If we would not end up here, we can't continue from here */
> +		if (!cpumask_equal(sg_span, sched_domain_span(sibling->child)))
>  			continue;
>  
>  		cpumask_set_cpu(i, sched_group_mask(sg));
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] sched/topology: the group balance cpu must be a cpu where the group is installed
  2017-04-25 14:33         ` Lauro Venancio
@ 2017-04-25 15:12           ` Peter Zijlstra
  2017-04-25 15:22             ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2017-04-25 15:12 UTC (permalink / raw)
  To: Lauro Venancio
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

On Tue, Apr 25, 2017 at 11:33:51AM -0300, Lauro Venancio wrote:
> On 04/25/2017 09:17 AM, Peter Zijlstra wrote:

> > With the fact that sched_group_cpus(sd->groups) ==
> > sched_domain_span(sibling->child) (if child exists) established in the
> > previous patches, could we not simplify this like the below?

> We can. We just need to better handle the case when there is no child or
> we will have empty masks.
> We have to replicate the build_group_from_child_sched_domain() behavior:
> 
> if (sd->child)
>     cpumask_copy(sg_span, sched_domain_span(sd->child));
> else
>     cpumask_copy(sg_span, sched_domain_span(sd));
> 
> 
> So we need something like:
> 
> 
> if (sibling->child)
>     gsd = sibling->child;
> else
>     gsd = sibling;
> 
> if (!cpumask_equal(sg_span, sched_domain_span(gsd)))
> 
> 	continue;

Right, ran into that already. My truncated topologies (single cpu per
node) insta triggered it. But somehow removing the WARN was sufficient
and the mask didn't end up empty.

This is the ring topo:

[    0.086772] CPU0 attaching sched-domain:
[    0.087005]  domain 0: span 0-1,3 level NUMA
[    0.088002]   groups: 0 (mask: 0), 1, 3
[    0.089002]   domain 1: span 0-3 level NUMA
[    0.090002]    groups: 0-1,3 (mask: 0) (cpu_capacity: 3072), 1-3 (cpu_capacity: 3072)
[    0.091005] CPU1 attaching sched-domain:
[    0.092003]  domain 0: span 0-2 level NUMA
[    0.093002]   groups: 1 (mask: 1), 2, 0
[    0.094002]   domain 1: span 0-3 level NUMA
[    0.095002]    groups: 0-2 (mask: 1) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)
[    0.096005] CPU2 attaching sched-domain:
[    0.097002]  domain 0: span 1-3 level NUMA
[    0.098002]   groups: 2 (mask: 2), 3, 1
[    0.099002]   domain 1: span 0-3 level NUMA
[    0.100002]    groups: 1-3 (mask: 2) (cpu_capacity: 3072), 0-1,3 (cpu_capacity: 3072)
[    0.101004] CPU3 attaching sched-domain:
[    0.102002]  domain 0: span 0,2-3 level NUMA
[    0.103002]   groups: 3 (mask: 3), 0, 2
[    0.104002]   domain 1: span 0-3 level NUMA
[    0.105002]    groups: 0,2-3 (mask: 3) (cpu_capacity: 3072), 0-2 (cpu_capacity: 3072)

See how the domain-0 mask isn't empty.


That said, when !child, ->groups ends up being a single cpu.

So I was thinking:

	struct cpumask *i_span;

	if (sibling->child)
		i_span = sched_domain_span(sibling->child);
	else
		i_span = cpumask_of(i);

	if (!cpumask_equal(sg_span, i_span))
		continue;


But I'll first try and figure out why I'm not having empty masks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] sched/topology: the group balance cpu must be a cpu where the group is installed
  2017-04-25 15:12           ` Peter Zijlstra
@ 2017-04-25 15:22             ` Peter Zijlstra
  2017-04-25 15:27               ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2017-04-25 15:22 UTC (permalink / raw)
  To: Lauro Venancio
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

On Tue, Apr 25, 2017 at 05:12:00PM +0200, Peter Zijlstra wrote:
> But I'll first try and figure out why I'm not having empty masks.

Ah, so this is before all the degenerate stuff, so there's a bunch of
redundant domains below that make it work -- and there always will be,
unless FORCE_SD_OVERLAP.

Now I wonder what triggered it.. let me put it back.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] sched/topology: the group balance cpu must be a cpu where the group is installed
  2017-04-25 15:22             ` Peter Zijlstra
@ 2017-04-25 15:27               ` Peter Zijlstra
  2017-04-25 15:39                 ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2017-04-25 15:27 UTC (permalink / raw)
  To: Lauro Venancio
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

On Tue, Apr 25, 2017 at 05:22:36PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 25, 2017 at 05:12:00PM +0200, Peter Zijlstra wrote:
> > But I'll first try and figure out why I'm not having empty masks.
> 
> Ah, so this is before all the degenerate stuff, so there's a bunch of
> redundant domains below that make it work -- and there always will be,
> unless FORCE_SD_OVERLAP.
> 
> Now I wonder what triggered it.. let me put it back.

Ah! the asymmetric setup, where @sibling is entirely uninitialized for
the top domain.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] sched/topology: the group balance cpu must be a cpu where the group is installed
  2017-04-25 15:27               ` Peter Zijlstra
@ 2017-04-25 15:39                 ` Peter Zijlstra
  2017-04-25 15:52                   ` Peter Zijlstra
  2017-04-25 15:56                   ` Lauro Venancio
  0 siblings, 2 replies; 28+ messages in thread
From: Peter Zijlstra @ 2017-04-25 15:39 UTC (permalink / raw)
  To: Lauro Venancio
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

On Tue, Apr 25, 2017 at 05:27:03PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 25, 2017 at 05:22:36PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 25, 2017 at 05:12:00PM +0200, Peter Zijlstra wrote:
> > > But I'll first try and figure out why I'm not having empty masks.
> > 
> > Ah, so this is before all the degenerate stuff, so there's a bunch of
> > redundant domains below that make it work -- and there always will be,
> > unless FORCE_SD_OVERLAP.
> > 
> > Now I wonder what triggered it.. let me put it back.
> 
> Ah! the asymmetric setup, where @sibling is entirely uninitialized for
> the top domain.
> 
And it still works correctly too:


[    0.078756] XXX 1 NUMA 
[    0.079005] XXX 2 NUMA 
[    0.080003] XXY 0-2:0
[    0.081007] XXX 1 NUMA 
[    0.082005] XXX 2 NUMA 
[    0.083003] XXY 1-3:3
[    0.084032] XXX 1 NUMA 
[    0.085003] XXX 2 NUMA 
[    0.086003] XXY 1-3:3
[    0.087015] XXX 1 NUMA 
[    0.088003] XXX 2 NUMA 
[    0.089002] XXY 0-2:0


[    0.090007] CPU0 attaching sched-domain:
[    0.091002]  domain 0: span 0-2 level NUMA
[    0.092002]   groups: 0 (mask: 0), 1, 2
[    0.093002]   domain 1: span 0-3 level NUMA
[    0.094002]    groups: 0-2 (mask: 0) (cpu_capacity: 3072), 1-3 (cpu_capacity: 3072)
[    0.095005] CPU1 attaching sched-domain:
[    0.096003]  domain 0: span 0-3 level NUMA
[    0.097002]   groups: 1 (mask: 1), 2, 3, 0
[    0.098004] CPU2 attaching sched-domain:
[    0.099002]  domain 0: span 0-3 level NUMA
[    0.100002]   groups: 2 (mask: 2), 3, 0, 1
[    0.101004] CPU3 attaching sched-domain:
[    0.102002]  domain 0: span 1-3 level NUMA
[    0.103002]   groups: 3 (mask: 3), 1, 2
[    0.104002]   domain 1: span 0-3 level NUMA
[    0.105002]    groups: 1-3 (mask: 3) (cpu_capacity: 3072), 0-2 (cpu_capacity: 3072)


static void
build_group_mask(struct sched_domain *sd, struct sched_group *sg, struct cpumask *mask)
{
        const struct cpumask *sg_span = sched_group_cpus(sg);
        struct sd_data *sdd = sd->private;
        struct sched_domain *sibling;
        int i, funny = 0;

        cpumask_clear(mask);

        for_each_cpu(i, sg_span) {
                sibling = *per_cpu_ptr(sdd->sd, i);

                if (!sibling->child) {
                        funny = 1;
                        printk("XXX %d %s %*pbl\n", i, sd->name, cpumask_pr_args(sched_domain_span(sibling)));
                        continue;
                }

                /* If we would not end up here, we can't continue from here */
                if (!cpumask_equal(sg_span, sched_domain_span(sibling->child)))
                        continue;

                cpumask_set_cpu(i, mask);
        }

        if (funny) {
                printk("XXY %*pbl:%*pbl\n",
                                cpumask_pr_args(sg_span),
                                cpumask_pr_args(mask));
        }
}


So that will still get the right balance cpu and thus sgc.

Another thing I've been thinking about; I think we can do away with the
kzalloc() in build_group_from_child_sched_domain() and use the sdd->sg
storage.

I just didn't want to move too much code around again, and ideally put
more assertions in place to catch bad stuff; I just haven't had a good
time thinking of good assertions :/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] sched/topology: the group balance cpu must be a cpu where the group is installed
  2017-04-25 15:39                 ` Peter Zijlstra
@ 2017-04-25 15:52                   ` Peter Zijlstra
  2017-04-25 15:56                   ` Lauro Venancio
  1 sibling, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2017-04-25 15:52 UTC (permalink / raw)
  To: Lauro Venancio
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

On Tue, Apr 25, 2017 at 05:39:37PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 25, 2017 at 05:27:03PM +0200, Peter Zijlstra wrote:

> > Ah! the asymmetric setup, where @sibling is entirely uninitialized for
> > the top domain.

Like so then...

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -509,6 +509,11 @@ static void build_group_mask(struct sche
 	for_each_cpu(i, sg_span) {
 		sibling = *per_cpu_ptr(sdd->sd, i);
 
+		/*
+		 * Can happen in the asymmetric case, where these siblings are
+		 * unused. The mask will not be empty because those CPUs that
+		 * do have the top domain _should_ span the domain.
+		 */
 		if (!sibling->child)
 			continue;
 
@@ -518,6 +523,9 @@ static void build_group_mask(struct sche
 
 		cpumask_set_cpu(i, sched_group_mask(sg));
 	}
+
+	/* We must not have empty masks here */
+	WARN_ON_ONCE(cpumask_empty(sched_group_mask(sg)));
 }
 
 /*

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] sched/topology: the group balance cpu must be a cpu where the group is installed
  2017-04-25 15:39                 ` Peter Zijlstra
  2017-04-25 15:52                   ` Peter Zijlstra
@ 2017-04-25 15:56                   ` Lauro Venancio
  2017-04-25 16:26                     ` Peter Zijlstra
  1 sibling, 1 reply; 28+ messages in thread
From: Lauro Venancio @ 2017-04-25 15:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

On 04/25/2017 12:39 PM, Peter Zijlstra wrote:
> On Tue, Apr 25, 2017 at 05:27:03PM +0200, Peter Zijlstra wrote:
>> On Tue, Apr 25, 2017 at 05:22:36PM +0200, Peter Zijlstra wrote:
>>> On Tue, Apr 25, 2017 at 05:12:00PM +0200, Peter Zijlstra wrote:
>>>> But I'll first try and figure out why I'm not having empty masks.
>>> Ah, so this is before all the degenerate stuff, so there's a bunch of
>>> redundant domains below that make it work -- and there always will be,
>>> unless FORCE_SD_OVERLAP.
>>>
>>> Now I wonder what triggered it.. let me put it back.
>> Ah! the asymmetric setup, where @sibling is entirely uninitialized for
>> the top domain.
>>
> And it still works correctly too:
>
>
> [    0.078756] XXX 1 NUMA 
> [    0.079005] XXX 2 NUMA 
> [    0.080003] XXY 0-2:0
> [    0.081007] XXX 1 NUMA 
> [    0.082005] XXX 2 NUMA 
> [    0.083003] XXY 1-3:3
> [    0.084032] XXX 1 NUMA 
> [    0.085003] XXX 2 NUMA 
> [    0.086003] XXY 1-3:3
> [    0.087015] XXX 1 NUMA 
> [    0.088003] XXX 2 NUMA 
> [    0.089002] XXY 0-2:0
>
>
> [    0.090007] CPU0 attaching sched-domain:
> [    0.091002]  domain 0: span 0-2 level NUMA
> [    0.092002]   groups: 0 (mask: 0), 1, 2
> [    0.093002]   domain 1: span 0-3 level NUMA
> [    0.094002]    groups: 0-2 (mask: 0) (cpu_capacity: 3072), 1-3 (cpu_capacity: 3072)
> [    0.095005] CPU1 attaching sched-domain:
> [    0.096003]  domain 0: span 0-3 level NUMA
> [    0.097002]   groups: 1 (mask: 1), 2, 3, 0
> [    0.098004] CPU2 attaching sched-domain:
> [    0.099002]  domain 0: span 0-3 level NUMA
> [    0.100002]   groups: 2 (mask: 2), 3, 0, 1
> [    0.101004] CPU3 attaching sched-domain:
> [    0.102002]  domain 0: span 1-3 level NUMA
> [    0.103002]   groups: 3 (mask: 3), 1, 2
> [    0.104002]   domain 1: span 0-3 level NUMA
> [    0.105002]    groups: 1-3 (mask: 3) (cpu_capacity: 3072), 0-2 (cpu_capacity: 3072)
>
>
> static void
> build_group_mask(struct sched_domain *sd, struct sched_group *sg, struct cpumask *mask)
> {
>         const struct cpumask *sg_span = sched_group_cpus(sg);
>         struct sd_data *sdd = sd->private;
>         struct sched_domain *sibling;
>         int i, funny = 0;
>
>         cpumask_clear(mask);
>
>         for_each_cpu(i, sg_span) {
>                 sibling = *per_cpu_ptr(sdd->sd, i);
>
>                 if (!sibling->child) {
>                         funny = 1;
>                         printk("XXX %d %s %*pbl\n", i, sd->name, cpumask_pr_args(sched_domain_span(sibling)));
>                         continue;
>                 }
>
>                 /* If we would not end up here, we can't continue from here */
>                 if (!cpumask_equal(sg_span, sched_domain_span(sibling->child)))
>                         continue;
>
>                 cpumask_set_cpu(i, mask);
>         }
>
>         if (funny) {
>                 printk("XXY %*pbl:%*pbl\n",
>                                 cpumask_pr_args(sg_span),
>                                 cpumask_pr_args(mask));
>         }
> }
>
>
> So that will still get the right balance cpu and thus sgc.
>
> Another thing I've been thinking about; I think we can do away with the
> kzalloc() in build_group_from_child_sched_domain() and use the sdd->sg
> storage.
I considered this too. I decided to do not change this because I was not
sure if the kzalloc() was there for performance reasons. Currently, all
groups are allocated in the NUMA node they are used.
If we use sdd->sg storage, we may have groups allocated in one NUMA node
being used in another node.
>
> I just didn't want to move too much code around again, and ideally put
> more assertions in place to catch bad stuff; I just haven't had a good
> time thinking of good assertions :/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] sched/topology: the group balance cpu must be a cpu where the group is installed
  2017-04-25 15:56                   ` Lauro Venancio
@ 2017-04-25 16:26                     ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2017-04-25 16:26 UTC (permalink / raw)
  To: Lauro Venancio
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

On Tue, Apr 25, 2017 at 12:56:23PM -0300, Lauro Venancio wrote:

> > Another thing I've been thinking about; I think we can do away with the
> > kzalloc() in build_group_from_child_sched_domain() and use the sdd->sg
> > storage.
> I considered this too. I decided to do not change this because I was not
> sure if the kzalloc() was there for performance reasons. Currently, all
> groups are allocated in the NUMA node they are used.
> If we use sdd->sg storage, we may have groups allocated in one NUMA node
> being used in another node.

Right.. I cannot remember :/

/me once again kicks himself for not writing more comments

It does save a few lines.. and I suspect that if we do this, we could
actually completely get rid of sched_group_capacity, since its now
always the same as the group (again), which should removes more lines
still.

But I'll shelf this patch for now.. we've got enough changes as is.

I still need to write a changelog for the new #2, which has become ugly
again, because its needs a second sched_domains_tmpmask.

(compile tested only)

---
 kernel/sched/topology.c |   76 ++++++++++++++++++------------------------------
 1 file changed, 29 insertions(+), 47 deletions(-)

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -501,10 +501,8 @@ enum s_alloc {
  * balancing.
  */
 static void
-build_group_mask(struct sched_domain *sd, struct sched_group *sg, struct cpumask *mask)
+build_group_mask(struct sd_data *sdd, struct cpumask *sg_span, struct cpumask *mask)
 {
-	const struct cpumask *sg_span = sched_group_cpus(sg);
-	struct sd_data *sdd = sd->private;
 	struct sched_domain *sibling;
 	int i;
 
@@ -542,49 +540,34 @@ int group_balance_cpu(struct sched_group
 }
 
 static struct sched_group *
-build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
+get_overlap_group(struct sd_data *sdd, int cpu)
 {
-	struct sched_group *sg;
-	struct cpumask *sg_span;
+	struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
+	struct sched_domain *child = sd->child;
+	struct sched_group *group;
+	struct cpumask *mask = sched_domains_tmpmask2;
 
-	sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
-			GFP_KERNEL, cpu_to_node(cpu));
+	/*
+	 * Overlap must have !overlap children.
+	 * This is before degenerate throws them out.
+	 */
+	BUG_ON(!sd->child);
 
-	if (!sg)
-		return NULL;
+	build_group_mask(sdd, sched_domain_span(child), mask);
+	cpu = cpumask_first_and(sched_domain_span(child), mask);
 
-	sg_span = sched_group_cpus(sg);
-	if (sd->child)
-		cpumask_copy(sg_span, sched_domain_span(sd->child));
-	else
-		cpumask_copy(sg_span, sched_domain_span(sd));
+	BUG_ON(cpu >= nr_cpu_ids);
 
-	return sg;
-}
+	group = *per_cpu_ptr(sdd->sg, cpu);
+	group->sgc = *per_cpu_ptr(sdd->sgc, cpu);
 
-static void init_overlap_sched_group(struct sched_domain *sd,
-				     struct sched_group *sg)
-{
-	struct cpumask *mask = sched_domains_tmpmask2;
-	struct sd_data *sdd = sd->private;
-	struct cpumask *sg_span;
-	int cpu;
+	atomic_inc(&group->ref);
+	atomic_inc(&group->sgc->ref);
 
-	build_group_mask(sd, sg, mask);
-	cpu = cpumask_first_and(sched_group_cpus(sg), mask);
+	cpumask_copy(sched_group_cpus(group), sched_domain_span(child));
+	cpumask_copy(sched_group_mask(group), mask);
 
-	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
-	if (atomic_inc_return(&sg->sgc->ref) == 1)
-		cpumask_copy(sched_group_mask(sg), mask);
-
-	/*
-	 * Initialize sgc->capacity such that even if we mess up the
-	 * domains and no possible iteration will get us here, we won't
-	 * die on a /0 trap.
-	 */
-	sg_span = sched_group_cpus(sg);
-	sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
-	sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+	return group;
 }
 
 static int
@@ -620,14 +603,18 @@ build_overlap_sched_groups(struct sched_
 		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
 			continue;
 
-		sg = build_group_from_child_sched_domain(sibling, cpu);
-		if (!sg)
-			goto fail;
+		sg = get_overlap_group(sdd, i);
 
 		sg_span = sched_group_cpus(sg);
 		cpumask_or(covered, covered, sg_span);
 
-		init_overlap_sched_group(sd, sg);
+		/*
+		 * Initialize sgc->capacity such that even if we mess up the
+		 * domains and no possible iteration will get us here, we won't
+		 * die on a /0 trap.
+		 */
+		sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
+		sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
 
 		if (!first)
 			first = sg;
@@ -639,11 +626,6 @@ build_overlap_sched_groups(struct sched_
 	sd->groups = first;
 
 	return 0;
-
-fail:
-	free_sched_groups(first, 0);
-
-	return -ENOMEM;
 }
 
 static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] sched/topology: fix overlap group capacity and balance cpu
  2017-04-20 19:51 [PATCH 0/4] sched/topology: fix overlap group capacity and balance cpu Lauro Ramos Venancio
                   ` (3 preceding siblings ...)
  2017-04-20 19:51 ` [PATCH 4/4] sched/topology: the group balance cpu must be a cpu where the group is installed Lauro Ramos Venancio
@ 2017-04-26 16:31 ` Peter Zijlstra
  2017-04-26 17:59   ` Lauro Venancio
  4 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2017-04-26 16:31 UTC (permalink / raw)
  To: Lauro Ramos Venancio
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel



Please have a look here:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/wip

I think that has them all.

I'm still thinking on how to best create a test to assert on the SGC
condition, but I don't seem to get much further than brute-force
matching all group masks. At least we now display (albeit I don't like
the format, suggestions?) the stuff.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] sched/topology: fix overlap group capacity and balance cpu
  2017-04-26 16:31 ` [PATCH 0/4] sched/topology: fix overlap group capacity and balance cpu Peter Zijlstra
@ 2017-04-26 17:59   ` Lauro Venancio
  2017-04-26 22:43     ` Peter Zijlstra
  2017-04-28 10:33     ` Peter Zijlstra
  0 siblings, 2 replies; 28+ messages in thread
From: Lauro Venancio @ 2017-04-26 17:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

On 04/26/2017 01:31 PM, Peter Zijlstra wrote:
>
> Please have a look here:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/wip
>
> I think that has them all.
Yes. I agree.
>
> I'm still thinking on how to best create a test to assert on the SGC
> condition, but I don't seem to get much further than brute-force
> matching all group masks. At least we now display (albeit I don't like
> the format, suggestions?) the stuff.

I cannot think in anything other than brute-force, too.

Maybe, to make the debug messages clear, you could print "balance_cpu:
%d" instead of "capacity(%d): " and assert (sgc id == balance cpu). You
would need to restart the sgc id count on every topology level.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] sched/topology: fix overlap group capacity and balance cpu
  2017-04-26 17:59   ` Lauro Venancio
@ 2017-04-26 22:43     ` Peter Zijlstra
  2017-04-28 10:33     ` Peter Zijlstra
  1 sibling, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2017-04-26 22:43 UTC (permalink / raw)
  To: Lauro Venancio
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

On Wed, Apr 26, 2017 at 02:59:09PM -0300, Lauro Venancio wrote:
> On 04/26/2017 01:31 PM, Peter Zijlstra wrote:
> >
> > Please have a look here:
> >
> >   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/wip
> >
> > I think that has them all.
> Yes. I agree.

FWIW, this is how far I got with writing comments...

I'll try and finish on Friday (holiday tomorrow).

---
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9e276f5a83a4..e6ba3f0793ff 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -494,6 +494,68 @@ enum s_alloc {
 };
 
 /*
+ * Return the canonical balance CPU for this group, this is the first CPU
+ * of this group that's also in the iteration mask.
+ */
+int group_balance_cpu(struct sched_group *sg)
+{
+	return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
+}
+
+
+/*
+ * NUMA topology (first read the regular topology blurb below)
+ *
+ * Given a node-distance table, for example:
+ *
+ *   node   0   1   2   3
+ *     0:  10  20  30  20
+ *     1:  20  10  20  30
+ *     2:  30  20  10  20
+ *     3:  20  30  20  10
+ *
+ * which represents a 4 node ring topology like:
+ *
+ *   0 ----- 1
+ *   |       |
+ *   |       |
+ *   |       |
+ *   3 ----- 2
+ *
+ * We want to construct domains and groups to represent this. The way we go about
+ * doing this is to build the domains on 'hops'. For each NUMA level we construct
+ * the mask of all nodes reachable in @level hops.
+ *
+ * For the above NUMA topology that gives 3 levels:
+ *
+ * NUMA-2	0-3		0-3		0-3		0-3
+ *
+ * NUMA-1	0-1,3		0-2		1-3		0,2-3
+ *
+ * NUMA-0	0		1		2		3
+ *
+ * 
+ * As can be seen; things don't nicely line up as with the regular topology.
+ * When we iterate a domain in child domain chunks some nodes can be
+ * represented multiple times -- hence the "overlap" naming for this part of
+ * the topology.
+ *
+ * In order to minimize this overlap, we only build enough groups to cover the
+ * domain. For instance NODE-0 NUMA-2 would only get groups: 0-1,3 and 1-3.
+ *
+ * Because:
+ *
+ *  - the first group of each domain is its child domain; this
+ *    gets us the first 0-1,3
+ *  - the only uncovered node is 2, who's child domain is 1-3.
+ *
+ * XXX words on sched_group_capacity and balance_cpu
+ *
+ */
+
+
+
+/*
  * Build an iteration mask that can exclude certain CPUs from the upwards
  * domain traversal.
  *
@@ -532,15 +594,6 @@ build_group_mask(struct sched_domain *sd, struct sched_group *sg, struct cpumask
 	WARN_ON_ONCE(cpumask_empty(mask));
 }
 
-/*
- * Return the canonical balance CPU for this group, this is the first CPU
- * of this group that's also in the iteration mask.
- */
-int group_balance_cpu(struct sched_group *sg)
-{
-	return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
-}
-
 static struct sched_group *
 build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
 {
@@ -646,6 +699,59 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 	return -ENOMEM;
 }
 
+
+/*
+ * Package topology (also see the load-balance blurb in fair.c)
+ *
+ * The scheduler builds a tree structure to represent a number of important
+ * topology features. By default (default_topology[]) these include:
+ *
+ *  - Simultaneous multithreading (SMT)
+ *  - Multi-Core Cache (MC)
+ *  - Package (DIE)
+ *
+ * Where the last one more or less denotes everything up to a NUMA node.
+ *
+ * The tree consists of 3 primary data structures:
+ *
+ * 	sched_domain -> sched_group -> sched_group_capacity
+ *
+ * The sched_domains are per-cpu and have a two way link (parent & child) and
+ * denote the ever growing mask of CPUs belonging to that level of topology.
+ *
+ * Each sched_domain has a circular (double) linked list of sched_group's, each
+ * denoting the domains of the level below (or individual CPUs in case of the
+ * first domain level). The sched_group linked by a sched_domain includes the
+ * CPU of that sched_domain [*].
+ *
+ * Take for instance a 2 threaded, 2 core, 2 cache cluster part:
+ *
+ * CPU   0   1   2   3   4   5   6   7
+ *
+ * DIE  [                             ]
+ * MC   [             ] [             ]
+ * SMT  [     ] [     ] [     ] [     ]
+ *
+ *  - or -
+ *
+ * DIE  0-7 0-7 0-7 0-7 0-7 0-7 0-7 0-7
+ * MC	0-3 0-3 0-3 0-3 4-7 4-7 4-7 4-7
+ * SMT  0-1 0-1 2-3 2-3 4-5 4-5 6-7 6-7
+ *
+ * CPU   0   1   2   3   4   5   6   7
+ *
+ * One way to think about it is: sched_domain moves you up and down among these
+ * topology levels, while sched_group moves you sideways through it, at child
+ * domain granularity.
+ *
+ * sched_group_capacity ensures each unique sched_group has shared storage.
+ *
+ * XXX words on how these levels don't overlap
+ * XXX words on balance CPU
+ *
+ * [*] in other words, the first group of each domain is its child domain.
+ */
+
 static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg)
 {
 	struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] sched/topology: fix overlap group capacity and balance cpu
  2017-04-26 17:59   ` Lauro Venancio
  2017-04-26 22:43     ` Peter Zijlstra
@ 2017-04-28 10:33     ` Peter Zijlstra
  1 sibling, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2017-04-28 10:33 UTC (permalink / raw)
  To: Lauro Venancio
  Cc: lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

On Wed, Apr 26, 2017 at 02:59:09PM -0300, Lauro Venancio wrote:

> Maybe, to make the debug messages clear, you could print "balance_cpu:
> %d" instead of "capacity(%d): " and assert (sgc id == balance cpu). You
> would need to restart the sgc id count on every topology level.

Right, that's a good suggestion. I've also completely revamped printing
the groups, because they got really long and it was very hard to
visually separate the individual groups on a line.

I've ended up with the following:

  [] CPU0 attaching sched-domain(s):
  []  domain-0: span=0,4 level=DIE
  []   groups: 0:{ span=0 }, 4:{ span=4 }
  []   domain-1: span=0-1,3-5,7 level=NUMA
  []    groups: 0:{ span=0,4 mask=0,4 cap=2048 }, 1:{ span=1,5 mask=1,5 cap=2048 }, 3:{ span=3,7 mask=3,7 cap=2048 }
  []    domain-2: span=0-7 level=NUMA
  []     groups: 0:{ span=0-1,3-5,7 mask=0,4 cap=6144 }, 2:{ span=1-3,5-7 mask=2,6 cap=6144 }
  [] CPU1 attaching sched-domain(s):
  []  domain-0: span=1,5 level=DIE
  []   groups: 1:{ span=1 }, 5:{ span=5 }
  []   domain-1: span=0-2,4-6 level=NUMA
  []    groups: 1:{ span=1,5 mask=1,5 cap=2048 }, 2:{ span=2,6 mask=2,6 cap=2048 }, 4:{ span=0,4 mask=0,4 cap=2048 }
  []    domain-2: span=0-7 level=NUMA
  []     groups: 1:{ span=0-2,4-6 mask=1,5 cap=6144 }, 3:{ span=0,2-4,6-7 mask=3,7 cap=6144 }

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [tip:sched/core] sched/topology: Optimize build_group_mask()
  2017-04-20 19:51 ` [PATCH 1/4] sched/topology: optimize build_group_mask() Lauro Ramos Venancio
@ 2017-05-15  9:06   ` tip-bot for Lauro Ramos Venancio
  0 siblings, 0 replies; 28+ messages in thread
From: tip-bot for Lauro Ramos Venancio @ 2017-05-15  9:06 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, tglx, mingo, linux-kernel, lvenanci, peterz, hpa, efault

Commit-ID:  f32d782e31bf079f600dcec126ed117b0577e85c
Gitweb:     http://git.kernel.org/tip/f32d782e31bf079f600dcec126ed117b0577e85c
Author:     Lauro Ramos Venancio <lvenanci@redhat.com>
AuthorDate: Thu, 20 Apr 2017 16:51:40 -0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 15 May 2017 10:15:26 +0200

sched/topology: Optimize build_group_mask()

The group mask is always used in intersection with the group CPUs. So,
when building the group mask, we don't have to care about CPUs that are
not part of the group.

Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lwang@redhat.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1492717903-5195-2-git-send-email-lvenanci@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/topology.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 81c8203..5a4d9ae 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -506,12 +506,12 @@ enum s_alloc {
  */
 static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
 {
-	const struct cpumask *span = sched_domain_span(sd);
+	const struct cpumask *sg_span = sched_group_cpus(sg);
 	struct sd_data *sdd = sd->private;
 	struct sched_domain *sibling;
 	int i;
 
-	for_each_cpu(i, span) {
+	for_each_cpu(i, sg_span) {
 		sibling = *per_cpu_ptr(sdd->sd, i);
 		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
 			continue;

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [tip:sched/core] sched/topology: Move comment about asymmetric node setups
  2017-04-20 19:51 ` [PATCH 3/4] sched/topology: move comment about asymmetric node setups Lauro Ramos Venancio
  2017-04-21 16:31   ` Peter Zijlstra
@ 2017-05-15  9:06   ` tip-bot for Lauro Ramos Venancio
  1 sibling, 0 replies; 28+ messages in thread
From: tip-bot for Lauro Ramos Venancio @ 2017-05-15  9:06 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, hpa, lvenanci, tglx, torvalds, efault, linux-kernel, peterz

Commit-ID:  c20e1ea4b61c3d99a354d912f2d74822fd2a001d
Gitweb:     http://git.kernel.org/tip/c20e1ea4b61c3d99a354d912f2d74822fd2a001d
Author:     Lauro Ramos Venancio <lvenanci@redhat.com>
AuthorDate: Thu, 20 Apr 2017 16:51:42 -0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 15 May 2017 10:15:27 +0200

sched/topology: Move comment about asymmetric node setups

Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lwang@redhat.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1492717903-5195-4-git-send-email-lvenanci@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/topology.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5a4d9ae..c10f44a 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -495,14 +495,6 @@ enum s_alloc {
 /*
  * Build an iteration mask that can exclude certain CPUs from the upwards
  * domain traversal.
- *
- * Asymmetric node setups can result in situations where the domain tree is of
- * unequal depth, make sure to skip domains that already cover the entire
- * range.
- *
- * In that case build_sched_domains() will have terminated the iteration early
- * and our sibling sd spans will be empty. Domains should always include the
- * CPU they're built on, so check that.
  */
 static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
 {
@@ -590,7 +582,16 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 
 		sibling = *per_cpu_ptr(sdd->sd, i);
 
-		/* See the comment near build_group_mask(). */
+		/*
+		 * Asymmetric node setups can result in situations where the
+		 * domain tree is of unequal depth, make sure to skip domains
+		 * that already cover the entire range.
+		 *
+		 * In that case build_sched_domains() will have terminated the
+		 * iteration early and our sibling sd spans will be empty.
+		 * Domains should always include the CPU they're built on, so
+		 * check that.
+		 */
 		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
 			continue;
 

^ permalink raw reply related	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2017-05-15  9:10 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-20 19:51 [PATCH 0/4] sched/topology: fix overlap group capacity and balance cpu Lauro Ramos Venancio
2017-04-20 19:51 ` [PATCH 1/4] sched/topology: optimize build_group_mask() Lauro Ramos Venancio
2017-05-15  9:06   ` [tip:sched/core] sched/topology: Optimize build_group_mask() tip-bot for Lauro Ramos Venancio
2017-04-20 19:51 ` [PATCH 2/4] sched/topology: all instances of a sched group must use the same sched_group_capacity Lauro Ramos Venancio
2017-04-20 19:51 ` [PATCH 3/4] sched/topology: move comment about asymmetric node setups Lauro Ramos Venancio
2017-04-21 16:31   ` Peter Zijlstra
2017-05-15  9:06   ` [tip:sched/core] sched/topology: Move " tip-bot for Lauro Ramos Venancio
2017-04-20 19:51 ` [PATCH 4/4] sched/topology: the group balance cpu must be a cpu where the group is installed Lauro Ramos Venancio
2017-04-24 13:03   ` Peter Zijlstra
2017-04-24 14:19     ` Peter Zijlstra
2017-04-24 14:27       ` Peter Zijlstra
2017-04-24 15:19         ` Lauro Venancio
2017-04-24 22:19           ` Peter Zijlstra
2017-04-24 15:11     ` Lauro Venancio
2017-04-24 22:15       ` Peter Zijlstra
2017-04-25 12:17       ` Peter Zijlstra
2017-04-25 14:33         ` Lauro Venancio
2017-04-25 15:12           ` Peter Zijlstra
2017-04-25 15:22             ` Peter Zijlstra
2017-04-25 15:27               ` Peter Zijlstra
2017-04-25 15:39                 ` Peter Zijlstra
2017-04-25 15:52                   ` Peter Zijlstra
2017-04-25 15:56                   ` Lauro Venancio
2017-04-25 16:26                     ` Peter Zijlstra
2017-04-26 16:31 ` [PATCH 0/4] sched/topology: fix overlap group capacity and balance cpu Peter Zijlstra
2017-04-26 17:59   ` Lauro Venancio
2017-04-26 22:43     ` Peter Zijlstra
2017-04-28 10:33     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).