All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/14] sched/topology fixes
@ 2017-04-28 13:19 Peter Zijlstra
  2017-04-28 13:19 ` [PATCH 01/14] sched/topology: Refactor function build_overlap_sched_groups() Peter Zijlstra
                   ` (14 more replies)
  0 siblings, 15 replies; 31+ messages in thread
From: Peter Zijlstra @ 2017-04-28 13:19 UTC (permalink / raw)
  To: mingo, lvenanci; +Cc: lwang, riel, efault, tglx, linux-kernel, peterz

Hi,

These patches are based upon the hard work of Lauro. He put in the time and
effort to understand and debug the code.

So while I didn't take many of his actual patches; I want to thank him for
doing the work. Hopefully the "Debugged-by:" tag conveys some of that.

In any case, please have a look. I think these should about cover things.

Rik, Lauro, could you guys in particular look at the final patch that adds a
few comments. I attempted to document the intent and understanding there. But
given I've been staring at this stuff too long I could've missed the obvious.

Comments and or suggestions welcome.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 01/14] sched/topology: Refactor function build_overlap_sched_groups()
  2017-04-28 13:19 [PATCH 00/14] sched/topology fixes Peter Zijlstra
@ 2017-04-28 13:19 ` Peter Zijlstra
  2017-04-28 13:20 ` [PATCH 02/14] sched,cpumask: Export for_each_cpu_wrap() Peter Zijlstra
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2017-04-28 13:19 UTC (permalink / raw)
  To: mingo, lvenanci; +Cc: lwang, riel, efault, tglx, linux-kernel, peterz

[-- Attachment #1: lauro_ramos_venancio-sched_topology-refactor_function_build_overlap_sched_groups.patch --]
[-- Type: text/plain, Size: 3126 bytes --]

Create functions build_group_from_child_sched_domain() and
init_overlap_sched_group(). No functional change.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1492091769-19879-2-git-send-email-lvenanci@redhat.com
---
 kernel/sched/topology.c | 62 ++++++++++++++++++++++++++++++++++---------------
 1 file changed, 43 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 1b0b4fb..d786d45 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -513,6 +513,47 @@ int group_balance_cpu(struct sched_group *sg)
 	return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
 }
 
+static struct sched_group *
+build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
+{
+	struct sched_group *sg;
+	struct cpumask *sg_span;
+
+	sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
+			GFP_KERNEL, cpu_to_node(cpu));
+
+	if (!sg)
+		return NULL;
+
+	sg_span = sched_group_cpus(sg);
+	if (sd->child)
+		cpumask_copy(sg_span, sched_domain_span(sd->child));
+	else
+		cpumask_copy(sg_span, sched_domain_span(sd));
+
+	return sg;
+}
+
+static void init_overlap_sched_group(struct sched_domain *sd,
+				     struct sched_group *sg, int cpu)
+{
+	struct sd_data *sdd = sd->private;
+	struct cpumask *sg_span;
+
+	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
+	if (atomic_inc_return(&sg->sgc->ref) == 1)
+		build_group_mask(sd, sg);
+
+	/*
+	 * Initialize sgc->capacity such that even if we mess up the
+	 * domains and no possible iteration will get us here, we won't
+	 * die on a /0 trap.
+	 */
+	sg_span = sched_group_cpus(sg);
+	sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
+	sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+}
+
 static int
 build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 {
@@ -537,31 +578,14 @@ int group_balance_cpu(struct sched_group *sg)
 		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
 			continue;
 
-		sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
-				GFP_KERNEL, cpu_to_node(cpu));
-
+		sg = build_group_from_child_sched_domain(sibling, cpu);
 		if (!sg)
 			goto fail;
 
 		sg_span = sched_group_cpus(sg);
-		if (sibling->child)
-			cpumask_copy(sg_span, sched_domain_span(sibling->child));
-		else
-			cpumask_set_cpu(i, sg_span);
-
 		cpumask_or(covered, covered, sg_span);
 
-		sg->sgc = *per_cpu_ptr(sdd->sgc, i);
-		if (atomic_inc_return(&sg->sgc->ref) == 1)
-			build_group_mask(sd, sg);
-
-		/*
-		 * Initialize sgc->capacity such that even if we mess up the
-		 * domains and no possible iteration will get us here, we won't
-		 * die on a /0 trap.
-		 */
-		sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
-		sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+		init_overlap_sched_group(sd, sg, i);
 
 		/*
 		 * Make sure the first group of this domain contains the
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 02/14] sched,cpumask: Export for_each_cpu_wrap()
  2017-04-28 13:19 [PATCH 00/14] sched/topology fixes Peter Zijlstra
  2017-04-28 13:19 ` [PATCH 01/14] sched/topology: Refactor function build_overlap_sched_groups() Peter Zijlstra
@ 2017-04-28 13:20 ` Peter Zijlstra
  2017-05-01 19:56   ` Rik van Riel
  2017-04-28 13:20 ` [PATCH 03/14] sched/topology: Fix building of overlapping sched-groups Peter Zijlstra
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2017-04-28 13:20 UTC (permalink / raw)
  To: mingo, lvenanci; +Cc: lwang, riel, efault, tglx, linux-kernel, peterz

[-- Attachment #1: peter_zijlstra-sched_topology-fix_sched_groups_on_numa_machines_with_mesh_topology.patch --]
[-- Type: text/plain, Size: 5100 bytes --]

More users for for_each_cpu_wrap() have appeared. Promote the construct
to generic cpumask interface.

The implementation is slightly modified to reduce arguments.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Lauro Ramos Venancio <lvenanci@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170414122005.o35me2h5nowqkxbv@hirez.programming.kicks-ass.net
---
 include/linux/cpumask.h |   17 +++++++++++++++++
 kernel/sched/fair.c     |   45 ++++-----------------------------------------
 lib/cpumask.c           |   32 ++++++++++++++++++++++++++++++++
 3 files changed, 53 insertions(+), 41 deletions(-)

--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -236,6 +236,23 @@ unsigned int cpumask_local_spread(unsign
 		(cpu) = cpumask_next_zero((cpu), (mask)),	\
 		(cpu) < nr_cpu_ids;)
 
+extern int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool wrap);
+
+/**
+ * for_each_cpu_wrap - iterate over every cpu in a mask, starting at a specified location
+ * @cpu: the (optionally unsigned) integer iterator
+ * @mask: the cpumask poiter
+ * @start: the start location
+ *
+ * The implementation does not assume any bit in @mask is set (including @start).
+ *
+ * After the loop, cpu is >= nr_cpu_ids.
+ */
+#define for_each_cpu_wrap(cpu, mask, start)					\
+	for ((cpu) = cpumask_next_wrap((start)-1, (mask), (start), false);	\
+	     (cpu) < nr_cpumask_bits;						\
+	     (cpu) = cpumask_next_wrap((cpu), (mask), (start), true))
+
 /**
  * for_each_cpu_and - iterate over every cpu in both masks
  * @cpu: the (optionally unsigned) integer iterator
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5691,43 +5691,6 @@ find_idlest_cpu(struct sched_group *grou
 	return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu;
 }
 
-/*
- * Implement a for_each_cpu() variant that starts the scan at a given cpu
- * (@start), and wraps around.
- *
- * This is used to scan for idle CPUs; such that not all CPUs looking for an
- * idle CPU find the same CPU. The down-side is that tasks tend to cycle
- * through the LLC domain.
- *
- * Especially tbench is found sensitive to this.
- */
-
-static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int *wrapped)
-{
-	int next;
-
-again:
-	next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1);
-
-	if (*wrapped) {
-		if (next >= start)
-			return nr_cpumask_bits;
-	} else {
-		if (next >= nr_cpumask_bits) {
-			*wrapped = 1;
-			n = -1;
-			goto again;
-		}
-	}
-
-	return next;
-}
-
-#define for_each_cpu_wrap(cpu, mask, start, wrap)				\
-	for ((wrap) = 0, (cpu) = (start)-1;					\
-		(cpu) = cpumask_next_wrap((cpu), (mask), (start), &(wrap)),	\
-		(cpu) < nr_cpumask_bits; )
-
 #ifdef CONFIG_SCHED_SMT
 
 static inline void set_idle_cores(int cpu, int val)
@@ -5787,7 +5750,7 @@ void __update_idle_core(struct rq *rq)
 static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
-	int core, cpu, wrap;
+	int core, cpu;
 
 	if (!static_branch_likely(&sched_smt_present))
 		return -1;
@@ -5797,7 +5760,7 @@ static int select_idle_core(struct task_
 
 	cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed);
 
-	for_each_cpu_wrap(core, cpus, target, wrap) {
+	for_each_cpu_wrap(core, cpus, target) {
 		bool idle = true;
 
 		for_each_cpu(cpu, cpu_smt_mask(core)) {
@@ -5863,7 +5826,7 @@ static int select_idle_cpu(struct task_s
 	u64 avg_cost, avg_idle = this_rq()->avg_idle;
 	u64 time, cost;
 	s64 delta;
-	int cpu, wrap;
+	int cpu;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
@@ -5880,7 +5843,7 @@ static int select_idle_cpu(struct task_s
 
 	time = local_clock();
 
-	for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {
+	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
 			continue;
 		if (idle_cpu(cpu))
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -43,6 +43,38 @@ int cpumask_any_but(const struct cpumask
 }
 EXPORT_SYMBOL(cpumask_any_but);
 
+/**
+ * cpumask_next_wrap - helper to implement for_each_cpu_wrap
+ * @n: the cpu prior to the place to search
+ * @mask: the cpumask pointer
+ * @start: the start point of the iteration
+ * @wrap: assume @n crossing @start terminates the iteration
+ *
+ * Returns >= nr_cpu_ids on completion
+ *
+ * Note: the @wrap argument is required for the start condition when
+ * we cannot assume @start is set in @mask.
+ */
+int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool wrap)
+{
+	int next;
+
+again:
+	next = cpumask_next(n, mask);
+
+	if (wrap && n < start && next >= start) {
+		return nr_cpumask_bits;
+
+	} else if (next >= nr_cpumask_bits) {
+		wrap = true;
+		n = -1;
+		goto again;
+	}
+
+	return next;
+}
+EXPORT_SYMBOL(cpumask_next_wrap);
+
 /* These are not inline because of header tangles. */
 #ifdef CONFIG_CPUMASK_OFFSTACK
 /**

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 03/14] sched/topology: Fix building of overlapping sched-groups
  2017-04-28 13:19 [PATCH 00/14] sched/topology fixes Peter Zijlstra
  2017-04-28 13:19 ` [PATCH 01/14] sched/topology: Refactor function build_overlap_sched_groups() Peter Zijlstra
  2017-04-28 13:20 ` [PATCH 02/14] sched,cpumask: Export for_each_cpu_wrap() Peter Zijlstra
@ 2017-04-28 13:20 ` Peter Zijlstra
  2017-05-01 20:08   ` Rik van Riel
  2017-04-28 13:20 ` [PATCH 04/14] sched/topology: Simplify build_overlap_sched_groups() Peter Zijlstra
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2017-04-28 13:20 UTC (permalink / raw)
  To: mingo, lvenanci; +Cc: lwang, riel, efault, tglx, linux-kernel, peterz, stable

[-- Attachment #1: peterz-sched-fix-build-overlapping-groups.patch --]
[-- Type: text/plain, Size: 1472 bytes --]

When building the overlapping groups, we very obviously should start
with the previous domain of _this_ @cpu, not CPU-0.

This can be readily demonstrated with a topology like:

  node   0   1   2   3
    0:  10  20  30  20
    1:  20  10  20  30
    2:  30  20  10  20
    3:  20  30  20  10

Where (for example) CPU1 ends up generating the following nonsensical groups:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 2 0
  []   domain 1: span 0-3 level NUMA
  []    groups: 1-3 (cpu_capacity = 3072) 0-1,3 (cpu_capacity = 3072)

Where the fact that domain 1 doesn't include a group with span 0-2 is
the obvious fail.

With patch this looks like:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 0 2
  []   domain 1: span 0-3 level NUMA
  []    groups: 0-2 (cpu_capacity = 3072) 0,2-3 (cpu_capacity = 3072)

Cc: stable@vger.kernel.org
Fixes: e3589f6c81e4 ("sched: Allow for overlapping sched_domain spans")
Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/topology.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -525,7 +525,7 @@ build_overlap_sched_groups(struct sched_
 
 	cpumask_clear(covered);
 
-	for_each_cpu(i, span) {
+	for_each_cpu_wrap(i, span, cpu) {
 		struct cpumask *sg_span;
 
 		if (cpumask_test_cpu(i, covered))

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 04/14] sched/topology: Simplify build_overlap_sched_groups()
  2017-04-28 13:19 [PATCH 00/14] sched/topology fixes Peter Zijlstra
                   ` (2 preceding siblings ...)
  2017-04-28 13:20 ` [PATCH 03/14] sched/topology: Fix building of overlapping sched-groups Peter Zijlstra
@ 2017-04-28 13:20 ` Peter Zijlstra
  2017-05-01 21:04   ` Rik van Riel
  2017-04-28 13:20 ` [PATCH 05/14] sched/topology,debug: Print the group mask Peter Zijlstra
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2017-04-28 13:20 UTC (permalink / raw)
  To: mingo, lvenanci; +Cc: lwang, riel, efault, tglx, linux-kernel, peterz

[-- Attachment #1: peterz-sched-simplify-overlapping-groups.patch --]
[-- Type: text/plain, Size: 1417 bytes --]

Now that the first group will always be the previous domain of this
@cpu this can be simplified.

In fact, writing the code now removed should've been a big clue I was
doing it wrong :/

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/topology.c |   13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -557,7 +557,7 @@ static void init_overlap_sched_group(str
 static int
 build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 {
-	struct sched_group *first = NULL, *last = NULL, *groups = NULL, *sg;
+	struct sched_group *first = NULL, *last = NULL, *sg;
 	const struct cpumask *span = sched_domain_span(sd);
 	struct cpumask *covered = sched_domains_tmpmask;
 	struct sd_data *sdd = sd->private;
@@ -587,15 +587,6 @@ build_overlap_sched_groups(struct sched_
 
 		init_overlap_sched_group(sd, sg, i);
 
-		/*
-		 * Make sure the first group of this domain contains the
-		 * canonical balance CPU. Otherwise the sched_domain iteration
-		 * breaks. See update_sg_lb_stats().
-		 */
-		if ((!groups && cpumask_test_cpu(cpu, sg_span)) ||
-		    group_balance_cpu(sg) == cpu)
-			groups = sg;
-
 		if (!first)
 			first = sg;
 		if (last)
@@ -603,7 +594,7 @@ build_overlap_sched_groups(struct sched_
 		last = sg;
 		last->next = first;
 	}
-	sd->groups = groups;
+	sd->groups = first;
 
 	return 0;
 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 05/14] sched/topology,debug: Print the group mask
  2017-04-28 13:19 [PATCH 00/14] sched/topology fixes Peter Zijlstra
                   ` (3 preceding siblings ...)
  2017-04-28 13:20 ` [PATCH 04/14] sched/topology: Simplify build_overlap_sched_groups() Peter Zijlstra
@ 2017-04-28 13:20 ` Peter Zijlstra
  2017-05-01 21:13   ` Rik van Riel
  2017-04-28 13:20 ` [PATCH 06/14] sched/topology,debug: Verify the first group matches the child domain Peter Zijlstra
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2017-04-28 13:20 UTC (permalink / raw)
  To: mingo, lvenanci; +Cc: lwang, riel, efault, tglx, linux-kernel, peterz

[-- Attachment #1: peterz-sched-debug-topo-print-mask.patch --]
[-- Type: text/plain, Size: 985 bytes --]

In order to determine the balance_cpu (for should_we_balance()) we need
the sched_group_mask() for overlapping domains.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/topology.c |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -82,12 +82,22 @@ static int sched_domain_debug_one(struct
 
 		printk(KERN_CONT " %*pbl",
 		       cpumask_pr_args(sched_group_cpus(group)));
+
+		if ((sd->flags & SD_OVERLAP) && !cpumask_full(sched_group_mask(group))) {
+			printk(KERN_CONT " (mask: %*pbl)",
+				cpumask_pr_args(sched_group_mask(group)));
+		}
+
 		if (group->sgc->capacity != SCHED_CAPACITY_SCALE) {
-			printk(KERN_CONT " (cpu_capacity = %lu)",
+			printk(KERN_CONT " (cpu_capacity: %lu)",
 				group->sgc->capacity);
 		}
 
 		group = group->next;
+
+		if (group != sd->groups)
+			printk(KERN_CONT ",");
+
 	} while (group != sd->groups);
 	printk(KERN_CONT "\n");
 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 06/14] sched/topology,debug: Verify the first group matches the child domain
  2017-04-28 13:19 [PATCH 00/14] sched/topology fixes Peter Zijlstra
                   ` (4 preceding siblings ...)
  2017-04-28 13:20 ` [PATCH 05/14] sched/topology,debug: Print the group mask Peter Zijlstra
@ 2017-04-28 13:20 ` Peter Zijlstra
  2017-05-01 21:13   ` Rik van Riel
  2017-04-28 13:20 ` [PATCH 07/14] sched/topology: Optimize build_group_mask() Peter Zijlstra
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2017-04-28 13:20 UTC (permalink / raw)
  To: mingo, lvenanci; +Cc: lwang, riel, efault, tglx, linux-kernel, peterz

[-- Attachment #1: peterz-sched-topo-more-debug.patch --]
[-- Type: text/plain, Size: 563 bytes --]


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/topology.c |    6 ++++++
 1 file changed, 6 insertions(+)

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -93,6 +93,12 @@ static int sched_domain_debug_one(struct
 				group->sgc->capacity);
 		}
 
+		if (group == sd->groups && sd->child &&
+		    !cpumask_equal(sched_domain_span(sd->child),
+				   sched_group_cpus(group))) {
+			printk(KERN_ERR "ERROR: domain->groups does not match domain->child\n");
+		}
+
 		group = group->next;
 
 		if (group != sd->groups)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 07/14] sched/topology: Optimize build_group_mask()
  2017-04-28 13:19 [PATCH 00/14] sched/topology fixes Peter Zijlstra
                   ` (5 preceding siblings ...)
  2017-04-28 13:20 ` [PATCH 06/14] sched/topology,debug: Verify the first group matches the child domain Peter Zijlstra
@ 2017-04-28 13:20 ` Peter Zijlstra
  2017-05-02  0:57   ` Rik van Riel
  2017-04-28 13:20 ` [PATCH 08/14] sched/topology: Move comment about asymmetric node setups Peter Zijlstra
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2017-04-28 13:20 UTC (permalink / raw)
  To: mingo, lvenanci; +Cc: lwang, riel, efault, tglx, linux-kernel, peterz

[-- Attachment #1: lauro_ramos_venancio-sched_topology-optimize_build_group_mask.patch --]
[-- Type: text/plain, Size: 1131 bytes --]

The group mask is always used in intersection with the group cpus. So,
when building the group mask, we don't have to care about cpus that are
not part of the group.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: riel@redhat.com
Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1492717903-5195-2-git-send-email-lvenanci@redhat.com
---
 kernel/sched/topology.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -506,12 +506,12 @@ enum s_alloc {
  */
 static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
 {
-	const struct cpumask *span = sched_domain_span(sd);
+	const struct cpumask *sg_span = sched_group_cpus(sg);
 	struct sd_data *sdd = sd->private;
 	struct sched_domain *sibling;
 	int i;
 
-	for_each_cpu(i, span) {
+	for_each_cpu(i, sg_span) {
 		sibling = *per_cpu_ptr(sdd->sd, i);
 		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
 			continue;

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 08/14] sched/topology: Move comment about asymmetric node setups
  2017-04-28 13:19 [PATCH 00/14] sched/topology fixes Peter Zijlstra
                   ` (6 preceding siblings ...)
  2017-04-28 13:20 ` [PATCH 07/14] sched/topology: Optimize build_group_mask() Peter Zijlstra
@ 2017-04-28 13:20 ` Peter Zijlstra
  2017-05-02  1:08   ` Rik van Riel
  2017-04-28 13:20 ` [PATCH 09/14] sched/topology: Remove FORCE_SD_OVERLAP Peter Zijlstra
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2017-04-28 13:20 UTC (permalink / raw)
  To: mingo, lvenanci; +Cc: lwang, riel, efault, tglx, linux-kernel, peterz

[-- Attachment #1: lauro_ramos_venancio-sched_topology-move_comment_about_asymmetric_node_setups.patch --]
[-- Type: text/plain, Size: 1750 bytes --]


Cc: lwang@redhat.com
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: riel@redhat.com
Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1492717903-5195-4-git-send-email-lvenanci@redhat.com
---
 kernel/sched/topology.c |   19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -495,14 +495,6 @@ enum s_alloc {
 /*
  * Build an iteration mask that can exclude certain CPUs from the upwards
  * domain traversal.
- *
- * Asymmetric node setups can result in situations where the domain tree is of
- * unequal depth, make sure to skip domains that already cover the entire
- * range.
- *
- * In that case build_sched_domains() will have terminated the iteration early
- * and our sibling sd spans will be empty. Domains should always include the
- * CPU they're built on, so check that.
  */
 static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
 {
@@ -590,7 +582,16 @@ build_overlap_sched_groups(struct sched_
 
 		sibling = *per_cpu_ptr(sdd->sd, i);
 
-		/* See the comment near build_group_mask(). */
+		/*
+		 * Asymmetric node setups can result in situations where the
+		 * domain tree is of unequal depth, make sure to skip domains
+		 * that already cover the entire range.
+		 *
+		 * In that case build_sched_domains() will have terminated the
+		 * iteration early and our sibling sd spans will be empty.
+		 * Domains should always include the CPU they're built on, so
+		 * check that.
+		 */
 		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
 			continue;
 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 09/14] sched/topology: Remove FORCE_SD_OVERLAP
  2017-04-28 13:19 [PATCH 00/14] sched/topology fixes Peter Zijlstra
                   ` (7 preceding siblings ...)
  2017-04-28 13:20 ` [PATCH 08/14] sched/topology: Move comment about asymmetric node setups Peter Zijlstra
@ 2017-04-28 13:20 ` Peter Zijlstra
  2017-05-02  1:09   ` Rik van Riel
  2017-04-28 13:20 ` [PATCH 10/14] sched/topology: Fix overlapping sched_group_mask Peter Zijlstra
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2017-04-28 13:20 UTC (permalink / raw)
  To: mingo, lvenanci; +Cc: lwang, riel, efault, tglx, linux-kernel, peterz

[-- Attachment #1: peterz-sched-topo-kill-force-overlap.patch --]
[-- Type: text/plain, Size: 1167 bytes --]

Its an obsolete debug mechanism and future code wants to rely on
properties this undermines.

Namely, it would be good to assume that SD_OVERLAP domains have
children, but if we build the entire hierarchy with SD_OVERLAP this is
obviously false.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/features.h |    1 -
 kernel/sched/topology.c |    2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -76,7 +76,6 @@ SCHED_FEAT(WARN_DOUBLE_CLOCK, false)
 SCHED_FEAT(RT_PUSH_IPI, true)
 #endif
 
-SCHED_FEAT(FORCE_SD_OVERLAP, false)
 SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
 SCHED_FEAT(ATTACH_AGE_LOAD, true)
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1439,7 +1439,7 @@ build_sched_domains(const struct cpumask
 			sd = build_sched_domain(tl, cpu_map, attr, sd, i);
 			if (tl == sched_domain_topology)
 				*per_cpu_ptr(d.sd, i) = sd;
-			if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP))
+			if (tl->flags & SDTL_OVERLAP)
 				sd->flags |= SD_OVERLAP;
 			if (cpumask_equal(cpu_map, sched_domain_span(sd)))
 				break;

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 10/14] sched/topology: Fix overlapping sched_group_mask
  2017-04-28 13:19 [PATCH 00/14] sched/topology fixes Peter Zijlstra
                   ` (8 preceding siblings ...)
  2017-04-28 13:20 ` [PATCH 09/14] sched/topology: Remove FORCE_SD_OVERLAP Peter Zijlstra
@ 2017-04-28 13:20 ` Peter Zijlstra
  2017-04-28 13:20 ` [PATCH 11/14] sched/topology: Small cleanup Peter Zijlstra
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2017-04-28 13:20 UTC (permalink / raw)
  To: mingo, lvenanci; +Cc: lwang, riel, efault, tglx, linux-kernel, peterz, stable

[-- Attachment #1: peterz-sched-topo-sched_group_mask.patch --]
[-- Type: text/plain, Size: 2649 bytes --]

The point of sched_group_mask is to select those CPUs from
sched_group_cpus that can actually arrive at this balance domain.

The current code gets it wrong, as can be readily demonstrated with a
topology like:

  node   0   1   2   3
    0:  10  20  30  20
    1:  20  10  20  30
    2:  30  20  10  20
    3:  20  30  20  10

Where (for example) domain 1 on CPU1 ends up with a mask that includes
CPU0:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 (mask: 1), 2, 0
  []   domain 1: span 0-3 level NUMA
  []    groups: 0-2 (mask: 0-2) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)

This causes sched_balance_cpu() to compute the wrong CPU and
consequently should_we_balance() will terminate early resulting in
missed load-balance opportunities.

The fixed topology looks like:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 (mask: 1), 2, 0
  []   domain 1: span 0-3 level NUMA
  []    groups: 0-2 (mask: 1) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)

(note: this relies on OVERLAP domains to always have children, this is
 true because the regular topology domains are still here -- this is
 before degenerate trimming)

Cc: stable@vger.kernel.org
Fixes: e3589f6c81e4 ("sched: Allow for overlapping sched_domain spans")
Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/topology.c |   18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -495,6 +495,9 @@ enum s_alloc {
 /*
  * Build an iteration mask that can exclude certain CPUs from the upwards
  * domain traversal.
+ *
+ * Only CPUs that can arrive at this group should be considered to continue
+ * balancing.
  */
 static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
 {
@@ -505,11 +508,24 @@ static void build_group_mask(struct sche
 
 	for_each_cpu(i, sg_span) {
 		sibling = *per_cpu_ptr(sdd->sd, i);
-		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
+
+		/*
+		 * Can happen in the asymmetric case, where these siblings are
+		 * unused. The mask will not be empty because those CPUs that
+		 * do have the top domain _should_ span the domain.
+		 */
+		if (!sibling->child)
+			continue;
+
+		/* If we would not end up here, we can't continue from here */
+		if (!cpumask_equal(sg_span, sched_domain_span(sibling->child)))
 			continue;
 
 		cpumask_set_cpu(i, sched_group_mask(sg));
 	}
+
+	/* We must not have empty masks here */
+	WARN_ON_ONCE(cpumask_empty(sched_group_mask(sg)));
 }
 
 /*

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 11/14] sched/topology: Small cleanup
  2017-04-28 13:19 [PATCH 00/14] sched/topology fixes Peter Zijlstra
                   ` (9 preceding siblings ...)
  2017-04-28 13:20 ` [PATCH 10/14] sched/topology: Fix overlapping sched_group_mask Peter Zijlstra
@ 2017-04-28 13:20 ` Peter Zijlstra
  2017-04-28 13:20 ` [PATCH 12/14] sched/topology,debug: Add sched_group_capacity debugging Peter Zijlstra
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2017-04-28 13:20 UTC (permalink / raw)
  To: mingo, lvenanci; +Cc: lwang, riel, efault, tglx, linux-kernel, peterz

[-- Attachment #1: peterz-sched-init-domains.patch --]
[-- Type: text/plain, Size: 2683 bytes --]

Move the allocation of topology specific cpumasks into the topology
code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c     |    4 +---
 kernel/sched/sched.h    |    4 +---
 kernel/sched/topology.c |    7 +++++--
 3 files changed, 7 insertions(+), 8 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5961,7 +5961,6 @@ void __init sched_init_smp(void)
 	cpumask_var_t non_isolated_cpus;
 
 	alloc_cpumask_var(&non_isolated_cpus, GFP_KERNEL);
-	alloc_cpumask_var(&fallback_doms, GFP_KERNEL);
 
 	sched_init_numa();
 
@@ -5971,7 +5970,7 @@ void __init sched_init_smp(void)
 	 * happen.
 	 */
 	mutex_lock(&sched_domains_mutex);
-	init_sched_domains(cpu_active_mask);
+	sched_init_domains(cpu_active_mask);
 	cpumask_andnot(non_isolated_cpus, cpu_possible_mask, cpu_isolated_map);
 	if (cpumask_empty(non_isolated_cpus))
 		cpumask_set_cpu(smp_processor_id(), non_isolated_cpus);
@@ -6185,7 +6184,6 @@ void __init sched_init(void)
 	calc_load_update = jiffies + LOAD_FREQ;
 
 #ifdef CONFIG_SMP
-	zalloc_cpumask_var(&sched_domains_tmpmask, GFP_NOWAIT);
 	/* May be allocated at isolcpus cmdline parse time */
 	if (cpu_isolated_map == NULL)
 		zalloc_cpumask_var(&cpu_isolated_map, GFP_NOWAIT);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -606,11 +606,9 @@ struct root_domain {
 
 extern struct root_domain def_root_domain;
 extern struct mutex sched_domains_mutex;
-extern cpumask_var_t fallback_doms;
-extern cpumask_var_t sched_domains_tmpmask;
 
 extern void init_defrootdomain(void);
-extern int init_sched_domains(const struct cpumask *cpu_map);
+extern int sched_init_domains(const struct cpumask *cpu_map);
 extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
 
 #endif /* CONFIG_SMP */
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1519,7 +1519,7 @@ static struct sched_domain_attr		*dattr_
  * cpumask) fails, then fallback to a single sched domain,
  * as determined by the single cpumask fallback_doms.
  */
-cpumask_var_t				fallback_doms;
+static cpumask_var_t			fallback_doms;
 
 /*
  * arch_update_cpu_topology lets virtualized architectures update the
@@ -1561,10 +1561,13 @@ void free_sched_domains(cpumask_var_t do
  * For now this just excludes isolated CPUs, but could be used to
  * exclude other special cases in the future.
  */
-int init_sched_domains(const struct cpumask *cpu_map)
+int sched_init_domains(const struct cpumask *cpu_map)
 {
 	int err;
 
+	zalloc_cpumask_var(&sched_domains_tmpmask, GFP_KERNEL);
+	zalloc_cpumask_var(&fallback_doms, GFP_KERNEL);
+
 	arch_update_cpu_topology();
 	ndoms_cur = 1;
 	doms_cur = alloc_sched_domains(ndoms_cur);

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 12/14] sched/topology,debug: Add sched_group_capacity debugging
  2017-04-28 13:19 [PATCH 00/14] sched/topology fixes Peter Zijlstra
                   ` (10 preceding siblings ...)
  2017-04-28 13:20 ` [PATCH 11/14] sched/topology: Small cleanup Peter Zijlstra
@ 2017-04-28 13:20 ` Peter Zijlstra
  2017-04-28 13:20 ` [PATCH 13/14] sched/topology: Fix overlapping sched_group_capacity Peter Zijlstra
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2017-04-28 13:20 UTC (permalink / raw)
  To: mingo, lvenanci; +Cc: lwang, riel, efault, tglx, linux-kernel, peterz

[-- Attachment #1: peterz-sched-topo-print-sgc.patch --]
[-- Type: text/plain, Size: 3614 bytes --]

Add sgc::id to easier spot domain construction issues.

Take the opportunity to slightly rework the group printing, because
adding more "(id: %d)" strings makes the entire thing very hard to
read. Also the individual groups are very hard to separate, so add
explicit visual grouping, which allows replacing all the "(%s: %d)"
format things with shorter "%s=%d" variants.

Then fix up some inconsistencies in surrounding prints for domains.

The end result looks like:

  [] CPU0 attaching sched-domain(s):
  []  domain-0: span=0,4 level=DIE
  []   groups: 0:{ span=0 }, 4:{ span=4 }
  []   domain-1: span=0-1,3-5,7 level=NUMA
  []    groups: 0:{ span=0,4 mask=0,4 cap=2048 }, 1:{ span=1,5 mask=1,5 cap=2048 }, 3:{ span=3,7 mask=3,7 cap=2048 }
  []    domain-2: span=0-7 level=NUMA
  []     groups: 0:{ span=0-1,3-5,7 mask=0,4 cap=6144 }, 2:{ span=1-3,5-7 mask=2,6 cap=6144 }

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/sched.h    |    4 ++++
 kernel/sched/topology.c |   25 +++++++++++++++----------
 2 files changed, 19 insertions(+), 10 deletions(-)

--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1023,6 +1023,10 @@ struct sched_group_capacity {
 	unsigned long next_update;
 	int imbalance; /* XXX unrelated to capacity but shared group state */
 
+#ifdef CONFIG_SCHED_DEBUG
+	int id;
+#endif
+
 	unsigned long cpumask[0]; /* iteration mask */
 };
 
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -35,7 +35,7 @@ static int sched_domain_debug_one(struct
 
 	cpumask_clear(groupmask);
 
-	printk(KERN_DEBUG "%*s domain %d: ", level, "", level);
+	printk(KERN_DEBUG "%*s domain-%d: ", level, "", level);
 
 	if (!(sd->flags & SD_LOAD_BALANCE)) {
 		printk("does not load-balance\n");
@@ -45,7 +45,7 @@ static int sched_domain_debug_one(struct
 		return -1;
 	}
 
-	printk(KERN_CONT "span %*pbl level %s\n",
+	printk(KERN_CONT "span=%*pbl level=%s\n",
 	       cpumask_pr_args(sched_domain_span(sd)), sd->name);
 
 	if (!cpumask_test_cpu(cpu, sched_domain_span(sd))) {
@@ -80,18 +80,17 @@ static int sched_domain_debug_one(struct
 
 		cpumask_or(groupmask, groupmask, sched_group_cpus(group));
 
-		printk(KERN_CONT " %*pbl",
-		       cpumask_pr_args(sched_group_cpus(group)));
+		printk(KERN_CONT " %d:{ span=%*pbl",
+				group->sgc->id,
+				cpumask_pr_args(sched_group_cpus(group)));
 
 		if ((sd->flags & SD_OVERLAP) && !cpumask_full(sched_group_mask(group))) {
-			printk(KERN_CONT " (mask: %*pbl)",
+			printk(KERN_CONT " mask=%*pbl",
 				cpumask_pr_args(sched_group_mask(group)));
 		}
 
-		if (group->sgc->capacity != SCHED_CAPACITY_SCALE) {
-			printk(KERN_CONT " (cpu_capacity: %lu)",
-				group->sgc->capacity);
-		}
+		if (group->sgc->capacity != SCHED_CAPACITY_SCALE)
+			printk(KERN_CONT " cap=%lu", group->sgc->capacity);
 
 		if (group == sd->groups && sd->child &&
 		    !cpumask_equal(sched_domain_span(sd->child),
@@ -99,6 +98,8 @@ static int sched_domain_debug_one(struct
 			printk(KERN_ERR "ERROR: domain->groups does not match domain->child\n");
 		}
 
+		printk(KERN_CONT " }");
+
 		group = group->next;
 
 		if (group != sd->groups)
@@ -129,7 +130,7 @@ static void sched_domain_debug(struct sc
 		return;
 	}
 
-	printk(KERN_DEBUG "CPU%d attaching sched-domain:\n", cpu);
+	printk(KERN_DEBUG "CPU%d attaching sched-domain(s):\n", cpu);
 
 	for (;;) {
 		if (sched_domain_debug_one(sd, cpu, level, sched_domains_tmpmask))
@@ -1356,6 +1357,10 @@ static int __sdt_alloc(const struct cpum
 			if (!sgc)
 				return -ENOMEM;
 
+#ifdef CONFIG_SCHED_DEBUG
+			sgc->id = j;
+#endif
+
 			*per_cpu_ptr(sdd->sgc, j) = sgc;
 		}
 	}

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 13/14] sched/topology: Fix overlapping sched_group_capacity
  2017-04-28 13:19 [PATCH 00/14] sched/topology fixes Peter Zijlstra
                   ` (11 preceding siblings ...)
  2017-04-28 13:20 ` [PATCH 12/14] sched/topology,debug: Add sched_group_capacity debugging Peter Zijlstra
@ 2017-04-28 13:20 ` Peter Zijlstra
  2017-04-28 13:20 ` [PATCH 14/14] sched/topology: Add a few comments Peter Zijlstra
  2017-04-28 13:53 ` [PATCH 00/14] sched/topology fixes Peter Zijlstra
  14 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2017-04-28 13:20 UTC (permalink / raw)
  To: mingo, lvenanci; +Cc: lwang, riel, efault, tglx, linux-kernel, peterz

[-- Attachment #1: peterz-sched-topo-sched-group-capacity.patch --]
[-- Type: text/plain, Size: 5235 bytes --]

When building the overlapping groups we need to attach a consistent
sched_group_capacity structure. That is, all 'identical' sched_group's
should have the _same_ sched_group_capacity.

This can (once again) be demonstrated with a topology like:

  node   0   1   2   3
    0:  10  20  30  20
    1:  20  10  20  30
    2:  30  20  10  20
    3:  20  30  20  10

But we need at least 2 CPUs per node for this to show up, after all,
if there is only one CPU per node, our CPU @i is per definition a
unique CPU that reaches this domain (aka balance-cpu).

Given the above NUMA topo and 2 CPUs per node:

  [] CPU0 attaching sched-domain(s):
  []  domain-0: span=0,4 level=DIE
  []   groups: 0:{ span=0 }, 4:{ span=4 }
  []   domain-1: span=0-1,3-5,7 level=NUMA
  []    groups: 0:{ span=0,4 mask=0,4 cap=2048 }, 1:{ span=1,5 mask=1,5 cap=2048 }, 3:{ span=3,7 mask=3,7 cap=2048 }
  []    domain-2: span=0-7 level=NUMA
  []     groups: 0:{ span=0-1,3-5,7 mask=0,4 cap=6144 }, 2:{ span=1-3,5-7 mask=2,6 cap=6144 }
  [] CPU1 attaching sched-domain(s):
  []  domain-0: span=1,5 level=DIE
  []   groups: 1:{ span=1 }, 5:{ span=5 }
  []   domain-1: span=0-2,4-6 level=NUMA
  []    groups: 1:{ span=1,5 mask=1,5 cap=2048 }, 2:{ span=2,6 mask=2,6 cap=2048 }, 4:{ span=0,4 mask=0,4 cap=2048 }
  []    domain-2: span=0-7 level=NUMA
  []     groups: 1:{ span=0-2,4-6 mask=1,5 cap=6144 }, 3:{ span=0,2-4,6-7 mask=3,7 cap=6144 }

Observe how CPU0-domain1-group0 and CPU1-domain1-group4 are the
'same' but have a different id (0 vs 4).

To fix this, use the group balance CPU to select the SGC. This means
we have to compute the full mask for each CPU and require a second
temporary mask to store the group mask in (it otherwise lives in the
SGC).

The fixed topology looks like:

  [] CPU0 attaching sched-domain(s):
  []  domain-0: span=0,4 level=DIE
  []   groups: 0:{ span=0 }, 4:{ span=4 }
  []   domain-1: span=0-1,3-5,7 level=NUMA
  []    groups: 0:{ span=0,4 mask=0,4 cap=2048 }, 1:{ span=1,5 mask=1,5 cap=2048 }, 3:{ span=3,7 mask=3,7 cap=2048 }
  []    domain-2: span=0-7 level=NUMA
  []     groups: 0:{ span=0-1,3-5,7 mask=0,4 cap=6144 }, 2:{ span=1-3,5-7 mask=2,6 cap=6144 }
  [] CPU1 attaching sched-domain(s):
  []  domain-0: span=1,5 level=DIE
  []   groups: 1:{ span=1 }, 5:{ span=5 }
  []   domain-1: span=0-2,4-6 level=NUMA
  []    groups: 1:{ span=1,5 mask=1,5 cap=2048 }, 2:{ span=2,6 mask=2,6 cap=2048 }, 0:{ span=0,4 mask=0,4 cap=2048 }
  []    domain-2: span=0-7 level=NUMA
  []     groups: 1:{ span=0-2,4-6 mask=1,5 cap=6144 }, 3:{ span=0,2-4,6-7 mask=3,7 cap=6144 }

Fixes: e3589f6c81e4 ("sched: Allow for overlapping sched_domain spans")
Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/topology.c |   22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -10,6 +10,7 @@ DEFINE_MUTEX(sched_domains_mutex);
 
 /* Protected by sched_domains_mutex: */
 cpumask_var_t sched_domains_tmpmask;
+cpumask_var_t sched_domains_tmpmask2;
 
 #ifdef CONFIG_SCHED_DEBUG
 
@@ -500,13 +501,16 @@ enum s_alloc {
  * Only CPUs that can arrive at this group should be considered to continue
  * balancing.
  */
-static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
+static void
+build_group_mask(struct sched_domain *sd, struct sched_group *sg, struct cpumask *mask)
 {
 	const struct cpumask *sg_span = sched_group_cpus(sg);
 	struct sd_data *sdd = sd->private;
 	struct sched_domain *sibling;
 	int i;
 
+	cpumask_clear(mask);
+
 	for_each_cpu(i, sg_span) {
 		sibling = *per_cpu_ptr(sdd->sd, i);
 
@@ -522,11 +526,11 @@ static void build_group_mask(struct sche
 		if (!cpumask_equal(sg_span, sched_domain_span(sibling->child)))
 			continue;
 
-		cpumask_set_cpu(i, sched_group_mask(sg));
+		cpumask_set_cpu(i, mask);
 	}
 
 	/* We must not have empty masks here */
-	WARN_ON_ONCE(cpumask_empty(sched_group_mask(sg)));
+	WARN_ON_ONCE(cpumask_empty(mask));
 }
 
 /*
@@ -560,14 +564,19 @@ build_group_from_child_sched_domain(stru
 }
 
 static void init_overlap_sched_group(struct sched_domain *sd,
-				     struct sched_group *sg, int cpu)
+				     struct sched_group *sg)
 {
+	struct cpumask *mask = sched_domains_tmpmask2;
 	struct sd_data *sdd = sd->private;
 	struct cpumask *sg_span;
+	int cpu;
+
+	build_group_mask(sd, sg, mask);
+	cpu = cpumask_first_and(sched_group_cpus(sg), mask);
 
 	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
 	if (atomic_inc_return(&sg->sgc->ref) == 1)
-		build_group_mask(sd, sg);
+		cpumask_copy(sched_group_mask(sg), mask);
 
 	/*
 	 * Initialize sgc->capacity such that even if we mess up the
@@ -619,7 +628,7 @@ build_overlap_sched_groups(struct sched_
 		sg_span = sched_group_cpus(sg);
 		cpumask_or(covered, covered, sg_span);
 
-		init_overlap_sched_group(sd, sg, i);
+		init_overlap_sched_group(sd, sg);
 
 		if (!first)
 			first = sg;
@@ -1578,6 +1587,7 @@ int sched_init_domains(const struct cpum
 	int err;
 
 	zalloc_cpumask_var(&sched_domains_tmpmask, GFP_KERNEL);
+	zalloc_cpumask_var(&sched_domains_tmpmask2, GFP_KERNEL);
 	zalloc_cpumask_var(&fallback_doms, GFP_KERNEL);
 
 	arch_update_cpu_topology();

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 14/14] sched/topology: Add a few comments
  2017-04-28 13:19 [PATCH 00/14] sched/topology fixes Peter Zijlstra
                   ` (12 preceding siblings ...)
  2017-04-28 13:20 ` [PATCH 13/14] sched/topology: Fix overlapping sched_group_capacity Peter Zijlstra
@ 2017-04-28 13:20 ` Peter Zijlstra
  2017-05-01  8:35   ` Peter Zijlstra
  2017-04-28 13:53 ` [PATCH 00/14] sched/topology fixes Peter Zijlstra
  14 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2017-04-28 13:20 UTC (permalink / raw)
  To: mingo, lvenanci; +Cc: lwang, riel, efault, tglx, linux-kernel, peterz

[-- Attachment #1: peterz-sched-topo-comments.patch --]
[-- Type: text/plain, Size: 7459 bytes --]

Try and describe what this code is about..

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/topology.c |  169 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 162 insertions(+), 7 deletions(-)

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -495,11 +495,96 @@ enum s_alloc {
 };
 
 /*
+ * Return the canonical balance CPU for this group, this is the first CPU
+ * of this group that's also in the iteration mask.
+ *
+ * The iteration mask are all those CPUs that could actually end up at this
+ * group. See build_group_mask().
+ *
+ * Also see should_we_balance().
+ */
+int group_balance_cpu(struct sched_group *sg)
+{
+	return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
+}
+
+
+/*
+ * NUMA topology (first read the regular topology blurb below)
+ *
+ * Given a node-distance table, for example:
+ *
+ *   node   0   1   2   3
+ *     0:  10  20  30  20
+ *     1:  20  10  20  30
+ *     2:  30  20  10  20
+ *     3:  20  30  20  10
+ *
+ * which represents a 4 node ring topology like:
+ *
+ *   0 ----- 1
+ *   |       |
+ *   |       |
+ *   |       |
+ *   3 ----- 2
+ *
+ * We want to construct domains and groups to represent this. The way we go
+ * about doing this is to build the domains on 'hops'. For each NUMA level we
+ * construct the mask of all nodes reachable in @level hops.
+ *
+ * For the above NUMA topology that gives 3 levels:
+ *
+ * NUMA-2	0-3		0-3		0-3		0-3
+ *  groups:	{0-1,3},{1-3}	{0-2},{0,2-3}	{1-3},{0-1,3}	{0,2-3},{0-2}
+ *
+ * NUMA-1	0-1,3		0-2		1-3		0,2-3
+ *  groups:	{0},{1},{3}	{0},{1},{2}	{1},{2},{3}	{0},{2},{3}
+ *
+ * NUMA-0	0		1		2		3
+ *
+ *
+ * As can be seen; things don't nicely line up as with the regular topology.
+ * When we iterate a domain in child domain chunks some nodes can be
+ * represented multiple times -- hence the "overlap" naming for this part of
+ * the topology.
+ *
+ * In order to minimize this overlap, we only build enough groups to cover the
+ * domain. For instance Node-0 NUMA-2 would only get groups: 0-1,3 and 1-3.
+ *
+ * Because:
+ *
+ *  - the first group of each domain is its child domain; this
+ *    gets us the first 0-1,3
+ *  - the only uncovered node is 2, who's child domain is 1-3.
+ *
+ * However, because of the overlap, computing a unique CPU for each group is
+ * more complicated. Consider for instance the groups of NODE-1 NUMA-2, both
+ * groups include the CPUs of Node-0, while those CPUs would not in fact ever
+ * end up at those groups (they would end up in group: 0-1,3).
+ *
+ * To correct this we have to introduce the group iteration mask. This mask
+ * will contain those CPUs in the group that can reach this group given the
+ * (child) domain tree.
+ *
+ * With this we can once again compute balance_cpu and sched_group_capacity
+ * relations.
+ *
+ * XXX include words on how balance_cpu is unique and therefore can be
+ * used for sched_group_capacity links.
+ */
+
+
+/*
  * Build an iteration mask that can exclude certain CPUs from the upwards
  * domain traversal.
  *
  * Only CPUs that can arrive at this group should be considered to continue
  * balancing.
+ *
+ * We do this during the group creation pass, therefore the group information
+ * isn't complete yet, however since each group represents a (child) domain we
+ * can fully construct this using the sched_domain bits (which are already
+ * complete).
  */
 static void
 build_group_mask(struct sched_domain *sd, struct sched_group *sg, struct cpumask *mask)
@@ -534,14 +619,10 @@ build_group_mask(struct sched_domain *sd
 }
 
 /*
- * Return the canonical balance CPU for this group, this is the first CPU
- * of this group that's also in the iteration mask.
+ * XXX: This creates per-node group entries; since the load-balancer will
+ * immediately access remote memory to construct this group's load-balance
+ * statistics having the groups node local is of dubious benefit.
  */
-int group_balance_cpu(struct sched_group *sg)
-{
-	return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
-}
-
 static struct sched_group *
 build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
 {
@@ -577,6 +658,8 @@ static void init_overlap_sched_group(str
 	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
 	if (atomic_inc_return(&sg->sgc->ref) == 1)
 		cpumask_copy(sched_group_mask(sg), mask);
+	else
+		WARN_ON_ONCE(!cpumask_equal(sched_group_mask(sg)), mask);
 
 	/*
 	 * Initialize sgc->capacity such that even if we mess up the
@@ -647,6 +730,78 @@ build_overlap_sched_groups(struct sched_
 	return -ENOMEM;
 }
 
+
+/*
+ * Package topology (also see the load-balance blurb in fair.c)
+ *
+ * The scheduler builds a tree structure to represent a number of important
+ * topology features. By default (default_topology[]) these include:
+ *
+ *  - Simultaneous multithreading (SMT)
+ *  - Multi-Core Cache (MC)
+ *  - Package (DIE)
+ *
+ * Where the last one more or less denotes everything up to a NUMA node.
+ *
+ * The tree consists of 3 primary data structures:
+ *
+ *	sched_domain -> sched_group -> sched_group_capacity
+ *	    ^ ^             ^ ^
+ *          `-'             `-'
+ *
+ * The sched_domains are per-cpu and have a two way link (parent & child) and
+ * denote the ever growing mask of CPUs belonging to that level of topology.
+ *
+ * Each sched_domain has a circular (double) linked list of sched_group's, each
+ * denoting the domains of the level below (or individual CPUs in case of the
+ * first domain level). The sched_group linked by a sched_domain includes the
+ * CPU of that sched_domain [*].
+ *
+ * Take for instance a 2 threaded, 2 core, 2 cache cluster part:
+ *
+ * CPU   0   1   2   3   4   5   6   7
+ *
+ * DIE  [                             ]
+ * MC   [             ] [             ]
+ * SMT  [     ] [     ] [     ] [     ]
+ *
+ *  - or -
+ *
+ * DIE  0-7 0-7 0-7 0-7 0-7 0-7 0-7 0-7
+ * MC	0-3 0-3 0-3 0-3 4-7 4-7 4-7 4-7
+ * SMT  0-1 0-1 2-3 2-3 4-5 4-5 6-7 6-7
+ *
+ * CPU   0   1   2   3   4   5   6   7
+ *
+ * One way to think about it is: sched_domain moves you up and down among these
+ * topology levels, while sched_group moves you sideways through it, at child
+ * domain granularity.
+ *
+ * sched_group_capacity ensures each unique sched_group has shared storage.
+ *
+ * There are two related construction problems, both require a CPU that
+ * uniquely identify each group (for a given domain):
+ *
+ *  - The first is the balance_cpu (see should_we_balance() and the
+ *    load-balance blub in fair.c); for each group we only want 1 CPU to
+ *    continue balancing at a higher domain.
+ *
+ *  - The second is the sched_group_capacity; we want all identical groups
+ *    to share a single sched_group_capacity.
+ *
+ * Since these topologies are exclusive by construction. That is, its
+ * impossible for an SMT thread to belong to multiple cores, and cores to
+ * be part of multiple caches. There is a very clear and unique location
+ * for each CPU in the hierarchy.
+ *
+ * Therefore computing a unique CPU for each group is trivial (the iteration
+ * mask is redundant and set all 1s; all CPUs in a group will end up at _that_
+ * group), we can simply pick the first CPU in each group.
+ *
+ *
+ * [*] in other words, the first group of each domain is its child domain.
+ */
+
 static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg)
 {
 	struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/14] sched/topology fixes
  2017-04-28 13:19 [PATCH 00/14] sched/topology fixes Peter Zijlstra
                   ` (13 preceding siblings ...)
  2017-04-28 13:20 ` [PATCH 14/14] sched/topology: Add a few comments Peter Zijlstra
@ 2017-04-28 13:53 ` Peter Zijlstra
  2017-04-28 23:30   ` Lauro Venancio
  2017-05-02 14:43   ` Peter Zijlstra
  14 siblings, 2 replies; 31+ messages in thread
From: Peter Zijlstra @ 2017-04-28 13:53 UTC (permalink / raw)
  To: mingo, lvenanci; +Cc: lwang, riel, efault, tglx, linux-kernel

On Fri, Apr 28, 2017 at 03:19:58PM +0200, Peter Zijlstra wrote:
> Hi,
> 
> These patches are based upon the hard work of Lauro. He put in the time and
> effort to understand and debug the code.
> 
> So while I didn't take many of his actual patches; I want to thank him for
> doing the work. Hopefully the "Debugged-by:" tag conveys some of that.
> 
> In any case, please have a look. I think these should about cover things.
> 
> Rik, Lauro, could you guys in particular look at the final patch that adds a
> few comments. I attempted to document the intent and understanding there. But
> given I've been staring at this stuff too long I could've missed the obvious.
> 
> Comments and or suggestions welcome.
> 

Also, the following occurred to me:

  sg_span & sg_mask == sg_mask

Therefore, we don't need to do the whole "sg_span &" business.

Hmm?

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7996,7 +7996,7 @@ static int active_load_balance_cpu_stop(
 static int should_we_balance(struct lb_env *env)
 {
 	struct sched_group *sg = env->sd->groups;
-	struct cpumask *sg_cpus, *sg_mask;
+	struct cpumask *sg_mask;
 	int cpu, balance_cpu = -1;
 
 	/*
@@ -8006,11 +8006,10 @@ static int should_we_balance(struct lb_e
 	if (env->idle == CPU_NEWLY_IDLE)
 		return 1;
 
-	sg_cpus = sched_group_cpus(sg);
 	sg_mask = sched_group_mask(sg);
 	/* Try to find first idle cpu */
-	for_each_cpu_and(cpu, sg_cpus, env->cpus) {
-		if (!cpumask_test_cpu(cpu, sg_mask) || !idle_cpu(cpu))
+	for_each_cpu_and(cpu, sg_mask, env->cpus) {
+		if (!idle_cpu(cpu))
 			continue;
 
 		balance_cpu = cpu;
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -85,7 +85,8 @@ static int sched_domain_debug_one(struct
 				group->sgc->id,
 				cpumask_pr_args(sched_group_cpus(group)));
 
-		if ((sd->flags & SD_OVERLAP) && !cpumask_full(sched_group_mask(group))) {
+		if ((sd->flags & SD_OVERLAP) &&
+		    !cpumask_equal(sched_group_mask(group), sched_group_cpus(group))) {
 			printk(KERN_CONT " mask=%*pbl",
 				cpumask_pr_args(sched_group_mask(group)));
 		}
@@ -505,7 +506,7 @@ enum s_alloc {
  */
 int group_balance_cpu(struct sched_group *sg)
 {
-	return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
+	return cpumask_first(sched_group_mask(sg));
 }
 
 
@@ -856,7 +857,7 @@ build_sched_groups(struct sched_domain *
 			continue;
 
 		group = get_group(i, sdd, &sg);
-		cpumask_setall(sched_group_mask(sg));
+		cpumask_copy(sched_group_mask(sg), sched_group_cpus(sg));
 
 		for_each_cpu(j, span) {
 			if (get_group(j, sdd, NULL) != group)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/14] sched/topology fixes
  2017-04-28 13:53 ` [PATCH 00/14] sched/topology fixes Peter Zijlstra
@ 2017-04-28 23:30   ` Lauro Venancio
  2017-05-01  8:57     ` Peter Zijlstra
  2017-05-02 14:43   ` Peter Zijlstra
  1 sibling, 1 reply; 31+ messages in thread
From: Lauro Venancio @ 2017-04-28 23:30 UTC (permalink / raw)
  To: Peter Zijlstra, mingo; +Cc: lwang, riel, efault, tglx, linux-kernel

On 04/28/2017 10:53 AM, Peter Zijlstra wrote:
> On Fri, Apr 28, 2017 at 03:19:58PM +0200, Peter Zijlstra wrote:
>> Hi,
>>
>> These patches are based upon the hard work of Lauro. He put in the time and
>> effort to understand and debug the code.
>>
>> So while I didn't take many of his actual patches; I want to thank him for
>> doing the work. Hopefully the "Debugged-by:" tag conveys some of that.
>>
>> In any case, please have a look. I think these should about cover things.
>>
>> Rik, Lauro, could you guys in particular look at the final patch that adds a
>> few comments. I attempted to document the intent and understanding there. But
>> given I've been staring at this stuff too long I could've missed the obvious.
>>
>> Comments and or suggestions welcome.
>>
> Also, the following occurred to me:
>
>   sg_span & sg_mask == sg_mask
>
> Therefore, we don't need to do the whole "sg_span &" business.
>
> Hmm?
Agreed. Maybe we have to rename "mask" to something else. Maybe
"group_reached_by_cpus" or "group_installed_on_cpus".

>
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7996,7 +7996,7 @@ static int active_load_balance_cpu_stop(
>  static int should_we_balance(struct lb_env *env)
>  {
>  	struct sched_group *sg = env->sd->groups;
> -	struct cpumask *sg_cpus, *sg_mask;
> +	struct cpumask *sg_mask;
>  	int cpu, balance_cpu = -1;
>  
>  	/*
> @@ -8006,11 +8006,10 @@ static int should_we_balance(struct lb_e
>  	if (env->idle == CPU_NEWLY_IDLE)
>  		return 1;
>  
> -	sg_cpus = sched_group_cpus(sg);
>  	sg_mask = sched_group_mask(sg);
>  	/* Try to find first idle cpu */
> -	for_each_cpu_and(cpu, sg_cpus, env->cpus) {
> -		if (!cpumask_test_cpu(cpu, sg_mask) || !idle_cpu(cpu))
> +	for_each_cpu_and(cpu, sg_mask, env->cpus) {
> +		if (!idle_cpu(cpu))
>  			continue;
>  
>  		balance_cpu = cpu;
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -85,7 +85,8 @@ static int sched_domain_debug_one(struct
>  				group->sgc->id,
>  				cpumask_pr_args(sched_group_cpus(group)));
>  
> -		if ((sd->flags & SD_OVERLAP) && !cpumask_full(sched_group_mask(group))) {
> +		if ((sd->flags & SD_OVERLAP) &&
> +		    !cpumask_equal(sched_group_mask(group), sched_group_cpus(group))) {
>  			printk(KERN_CONT " mask=%*pbl",
>  				cpumask_pr_args(sched_group_mask(group)));
>  		}
> @@ -505,7 +506,7 @@ enum s_alloc {
>   */
>  int group_balance_cpu(struct sched_group *sg)
>  {
> -	return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
> +	return cpumask_first(sched_group_mask(sg));
>  }
>  
>  
> @@ -856,7 +857,7 @@ build_sched_groups(struct sched_domain *
>  			continue;
>  
>  		group = get_group(i, sdd, &sg);
> -		cpumask_setall(sched_group_mask(sg));
> +		cpumask_copy(sched_group_mask(sg), sched_group_cpus(sg));
>  
>  		for_each_cpu(j, span) {
>  			if (get_group(j, sdd, NULL) != group)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 14/14] sched/topology: Add a few comments
  2017-04-28 13:20 ` [PATCH 14/14] sched/topology: Add a few comments Peter Zijlstra
@ 2017-05-01  8:35   ` Peter Zijlstra
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2017-05-01  8:35 UTC (permalink / raw)
  To: mingo, lvenanci; +Cc: lwang, riel, efault, tglx, linux-kernel

On Fri, Apr 28, 2017 at 03:20:12PM +0200, Peter Zijlstra wrote:
> +/*
> + * NUMA topology (first read the regular topology blurb below)
> + *
> + * Given a node-distance table, for example:
> + *
> + *   node   0   1   2   3
> + *     0:  10  20  30  20
> + *     1:  20  10  20  30
> + *     2:  30  20  10  20
> + *     3:  20  30  20  10
> + *
> + * which represents a 4 node ring topology like:
> + *
> + *   0 ----- 1
> + *   |       |
> + *   |       |
> + *   |       |
> + *   3 ----- 2
> + *
> + * We want to construct domains and groups to represent this. The way we go
> + * about doing this is to build the domains on 'hops'. For each NUMA level we
> + * construct the mask of all nodes reachable in @level hops.
> + *
> + * For the above NUMA topology that gives 3 levels:
> + *
> + * NUMA-2	0-3		0-3		0-3		0-3
> + *  groups:	{0-1,3},{1-3}	{0-2},{0,2-3}	{1-3},{0-1,3}	{0,2-3},{0-2}
> + *
> + * NUMA-1	0-1,3		0-2		1-3		0,2-3
> + *  groups:	{0},{1},{3}	{0},{1},{2}	{1},{2},{3}	{0},{2},{3}
> + *
> + * NUMA-0	0		1		2		3
> + *
> + *
> + * As can be seen; things don't nicely line up as with the regular topology.
> + * When we iterate a domain in child domain chunks some nodes can be
> + * represented multiple times -- hence the "overlap" naming for this part of
> + * the topology.
> + *
> + * In order to minimize this overlap, we only build enough groups to cover the
> + * domain. For instance Node-0 NUMA-2 would only get groups: 0-1,3 and 1-3.
> + *
> + * Because:
> + *
> + *  - the first group of each domain is its child domain; this
> + *    gets us the first 0-1,3
> + *  - the only uncovered node is 2, who's child domain is 1-3.
> + *
> + * However, because of the overlap, computing a unique CPU for each group is
> + * more complicated. Consider for instance the groups of NODE-1 NUMA-2, both
> + * groups include the CPUs of Node-0, while those CPUs would not in fact ever
> + * end up at those groups (they would end up in group: 0-1,3).
> + *
> + * To correct this we have to introduce the group iteration mask. This mask
> + * will contain those CPUs in the group that can reach this group given the
> + * (child) domain tree.
> + *
> + * With this we can once again compute balance_cpu and sched_group_capacity
> + * relations.
> + *
> + * XXX include words on how balance_cpu is unique and therefore can be
> + * used for sched_group_capacity links.
> + */
> +

I added the below to clarify some of the asymmetric comments we have.


--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -571,6 +571,37 @@ int group_balance_cpu(struct sched_group
  *
  * XXX include words on how balance_cpu is unique and therefore can be
  * used for sched_group_capacity links.
+ *
+ *
+ * Another 'interesting' topology is:
+ *
+ *   node   0   1   2   3
+ *     0:  10  20  20  30
+ *     1:  20  10  20  20
+ *     2:  20  20  10  20
+ *     3:  30  20  20  10
+ *
+ * Which looks a little like:
+ *
+ *   0 ----- 1
+ *   |     / |
+ *   |   /   |
+ *   | /     |
+ *   2 ----- 3
+ *
+ * This topology is asymmetric, nodes 1,2 are fully connected, but nodes 0,3
+ * are not.
+ *
+ * This leads to a few particularly weird cases where the sched_domain's are
+ * not of the same number for each cpu. Consider:
+ *
+ * NUMA-2	0-3						0-3
+ *  groups:	{0-2},{1-3}					{1-3},{0-2}
+ *
+ * NUMA-1	0-2		0-3		0-3		1-3
+ *
+ * NUMA-0	0		1		2		3
+ *
  */
 
 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/14] sched/topology fixes
  2017-04-28 23:30   ` Lauro Venancio
@ 2017-05-01  8:57     ` Peter Zijlstra
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2017-05-01  8:57 UTC (permalink / raw)
  To: Lauro Venancio; +Cc: mingo, lwang, riel, efault, tglx, linux-kernel

On Fri, Apr 28, 2017 at 08:30:05PM -0300, Lauro Venancio wrote:
> On 04/28/2017 10:53 AM, Peter Zijlstra wrote:
> > Also, the following occurred to me:
> >
> >   sg_span & sg_mask == sg_mask
> >
> > Therefore, we don't need to do the whole "sg_span &" business.
> >
> > Hmm?

> Agreed. Maybe we have to rename "mask" to something else. Maybe
> "group_reached_by_cpus" or "group_installed_on_cpus".

I went with group_balance_mask() to match the existing
group_balance_cpu().

And now the temptation is very great to also fix up the
sched_group_cpus() vs sched_domain_span() thing...

---
Subject: sched/topology: Rename sched_group_mask()
From: Peter Zijlstra <peterz@infradead.org>
Date: Mon May 1 10:47:02 CEST 2017

Since sched_group_mask() is now an independent cpumask (it no longer
masks sched_group_cpus()), rename the thing.

Suggested-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |    4 +---
 kernel/sched/sched.h    |    7 +++----
 kernel/sched/topology.c |   31 ++++++++++++++-----------------
 3 files changed, 18 insertions(+), 24 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7996,7 +7996,6 @@ static int active_load_balance_cpu_stop(
 static int should_we_balance(struct lb_env *env)
 {
 	struct sched_group *sg = env->sd->groups;
-	struct cpumask *sg_mask;
 	int cpu, balance_cpu = -1;
 
 	/*
@@ -8006,9 +8005,8 @@ static int should_we_balance(struct lb_e
 	if (env->idle == CPU_NEWLY_IDLE)
 		return 1;
 
-	sg_mask = sched_group_mask(sg);
 	/* Try to find first idle cpu */
-	for_each_cpu_and(cpu, sg_mask, env->cpus) {
+	for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) {
 		if (!idle_cpu(cpu))
 			continue;
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1027,7 +1027,7 @@ struct sched_group_capacity {
 	int id;
 #endif
 
-	unsigned long cpumask[0]; /* iteration mask */
+	unsigned long cpumask[0]; /* balance mask */
 };
 
 struct sched_group {
@@ -1054,10 +1054,9 @@ static inline struct cpumask *sched_grou
 }
 
 /*
- * cpumask masking which cpus in the group are allowed to iterate up the domain
- * tree.
+ * See build_balance_mask().
  */
-static inline struct cpumask *sched_group_mask(struct sched_group *sg)
+static inline struct cpumask *group_balance_mask(struct sched_group *sg)
 {
 	return to_cpumask(sg->sgc->cpumask);
 }
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -86,9 +86,9 @@ static int sched_domain_debug_one(struct
 				cpumask_pr_args(sched_group_cpus(group)));
 
 		if ((sd->flags & SD_OVERLAP) &&
-		    !cpumask_equal(sched_group_mask(group), sched_group_cpus(group))) {
+		    !cpumask_equal(group_balance_mask(group), sched_group_cpus(group))) {
 			printk(KERN_CONT " mask=%*pbl",
-				cpumask_pr_args(sched_group_mask(group)));
+				cpumask_pr_args(group_balance_mask(group)));
 		}
 
 		if (group->sgc->capacity != SCHED_CAPACITY_SCALE)
@@ -497,16 +497,16 @@ enum s_alloc {
 
 /*
  * Return the canonical balance CPU for this group, this is the first CPU
- * of this group that's also in the iteration mask.
+ * of this group that's also in the balance mask.
  *
- * The iteration mask are all those CPUs that could actually end up at this
- * group. See build_group_mask().
+ * The balance mask are all those CPUs that could actually end up at this
+ * group. See build_balance_mask().
  *
  * Also see should_we_balance().
  */
 int group_balance_cpu(struct sched_group *sg)
 {
-	return cpumask_first(sched_group_mask(sg));
+	return cpumask_first(group_balance_mask(sg));
 }
 
 
@@ -563,7 +563,7 @@ int group_balance_cpu(struct sched_group
  * groups include the CPUs of Node-0, while those CPUs would not in fact ever
  * end up at those groups (they would end up in group: 0-1,3).
  *
- * To correct this we have to introduce the group iteration mask. This mask
+ * To correct this we have to introduce the group balance mask. This mask
  * will contain those CPUs in the group that can reach this group given the
  * (child) domain tree.
  *
@@ -607,11 +607,8 @@ int group_balance_cpu(struct sched_group
 
 
 /*
- * Build an iteration mask that can exclude certain CPUs from the upwards
- * domain traversal.
- *
- * Only CPUs that can arrive at this group should be considered to continue
- * balancing.
+ * Build the balance mask; it contains only those CPUs that can arrive at this
+ * group and should be considered to continue balancing.
  *
  * We do this during the group creation pass, therefore the group information
  * isn't complete yet, however since each group represents a (child) domain we
@@ -619,7 +616,7 @@ int group_balance_cpu(struct sched_group
  * complete).
  */
 static void
-build_group_mask(struct sched_domain *sd, struct sched_group *sg, struct cpumask *mask)
+build_balance_mask(struct sched_domain *sd, struct sched_group *sg, struct cpumask *mask)
 {
 	const struct cpumask *sg_span = sched_group_cpus(sg);
 	struct sd_data *sdd = sd->private;
@@ -684,14 +681,14 @@ static void init_overlap_sched_group(str
 	struct cpumask *sg_span;
 	int cpu;
 
-	build_group_mask(sd, sg, mask);
+	build_balance_mask(sd, sg, mask);
 	cpu = cpumask_first_and(sched_group_cpus(sg), mask);
 
 	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
 	if (atomic_inc_return(&sg->sgc->ref) == 1)
-		cpumask_copy(sched_group_mask(sg), mask);
+		cpumask_copy(group_balance_mask(sg), mask);
 	else
-		WARN_ON_ONCE(!cpumask_equal(sched_group_mask(sg), mask));
+		WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask));
 
 	/*
 	 * Initialize sgc->capacity such that even if we mess up the
@@ -888,7 +885,7 @@ build_sched_groups(struct sched_domain *
 			continue;
 
 		group = get_group(i, sdd, &sg);
-		cpumask_copy(sched_group_mask(sg), sched_group_cpus(sg));
+		cpumask_copy(group_balance_mask(sg), sched_group_cpus(sg));
 
 		for_each_cpu(j, span) {
 			if (get_group(j, sdd, NULL) != group)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 02/14] sched,cpumask: Export for_each_cpu_wrap()
  2017-04-28 13:20 ` [PATCH 02/14] sched,cpumask: Export for_each_cpu_wrap() Peter Zijlstra
@ 2017-05-01 19:56   ` Rik van Riel
  0 siblings, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2017-05-01 19:56 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, lvenanci; +Cc: lwang, efault, tglx, linux-kernel

On Fri, 2017-04-28 at 15:20 +0200, Peter Zijlstra wrote:
> More users for for_each_cpu_wrap() have appeared. Promote the
> construct
> to generic cpumask interface.
> 
> The implementation is slightly modified to reduce arguments.
> 
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Lauro Ramos Venancio <lvenanci@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Rik van Riel <riel@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/20170414122005.o35me2h5nowqkxbv@hirez.
> programming.kicks-ass.net
> 

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 03/14] sched/topology: Fix building of overlapping sched-groups
  2017-04-28 13:20 ` [PATCH 03/14] sched/topology: Fix building of overlapping sched-groups Peter Zijlstra
@ 2017-05-01 20:08   ` Rik van Riel
  0 siblings, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2017-05-01 20:08 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, lvenanci; +Cc: lwang, efault, tglx, linux-kernel, stable

On Fri, 2017-04-28 at 15:20 +0200, Peter Zijlstra wrote:
> When building the overlapping groups, we very obviously should start
> with the previous domain of _this_ @cpu, not CPU-0.

> With patch this looks like:
> 
>   [] CPU1 attaching sched-domain:
>   []  domain 0: span 0-2 level NUMA
>   []   groups: 1 0 2
>   []   domain 1: span 0-3 level NUMA
>   []    groups: 0-2 (cpu_capacity = 3072) 0,2-3 (cpu_capacity = 3072)
> 
> Cc: stable@vger.kernel.org
> Fixes: e3589f6c81e4 ("sched: Allow for overlapping sched_domain
> spans")
> Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 04/14] sched/topology: Simplify build_overlap_sched_groups()
  2017-04-28 13:20 ` [PATCH 04/14] sched/topology: Simplify build_overlap_sched_groups() Peter Zijlstra
@ 2017-05-01 21:04   ` Rik van Riel
  0 siblings, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2017-05-01 21:04 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, lvenanci; +Cc: lwang, efault, tglx, linux-kernel

On Fri, 2017-04-28 at 15:20 +0200, Peter Zijlstra wrote:
> Now that the first group will always be the previous domain of this
> @cpu this can be simplified.
> 
> In fact, writing the code now removed should've been a big clue I was
> doing it wrong :/
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 05/14] sched/topology,debug: Print the group mask
  2017-04-28 13:20 ` [PATCH 05/14] sched/topology,debug: Print the group mask Peter Zijlstra
@ 2017-05-01 21:13   ` Rik van Riel
  0 siblings, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2017-05-01 21:13 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, lvenanci; +Cc: lwang, efault, tglx, linux-kernel

On Fri, 2017-04-28 at 15:20 +0200, Peter Zijlstra wrote:
> In order to determine the balance_cpu (for should_we_balance()) we
> need
> the sched_group_mask() for overlapping domains.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 06/14] sched/topology,debug: Verify the first group matches the child domain
  2017-04-28 13:20 ` [PATCH 06/14] sched/topology,debug: Verify the first group matches the child domain Peter Zijlstra
@ 2017-05-01 21:13   ` Rik van Riel
  2017-05-02 14:52     ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Rik van Riel @ 2017-05-01 21:13 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, lvenanci; +Cc: lwang, efault, tglx, linux-kernel

On Fri, 2017-04-28 at 15:20 +0200, Peter Zijlstra wrote:
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

This could use a changelog.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 07/14] sched/topology: Optimize build_group_mask()
  2017-04-28 13:20 ` [PATCH 07/14] sched/topology: Optimize build_group_mask() Peter Zijlstra
@ 2017-05-02  0:57   ` Rik van Riel
  0 siblings, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2017-05-02  0:57 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, lvenanci; +Cc: lwang, efault, tglx, linux-kernel

On Fri, 2017-04-28 at 15:20 +0200, Peter Zijlstra wrote:
> The group mask is always used in intersection with the group cpus.
> So,
> when building the group mask, we don't have to care about cpus that
> are
> not part of the group.
> 
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: riel@redhat.com
> Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/1492717903-5195-2-git-send-email-lvena
> nci@redhat.com
> 

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 08/14] sched/topology: Move comment about asymmetric node setups
  2017-04-28 13:20 ` [PATCH 08/14] sched/topology: Move comment about asymmetric node setups Peter Zijlstra
@ 2017-05-02  1:08   ` Rik van Riel
  0 siblings, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2017-05-02  1:08 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, lvenanci; +Cc: lwang, efault, tglx, linux-kernel

On Fri, 2017-04-28 at 15:20 +0200, Peter Zijlstra wrote:
> Cc: lwang@redhat.com
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: riel@redhat.com
> Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/1492717903-5195-4-git-send-email-lvena
> nci@redhat.com

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 09/14] sched/topology: Remove FORCE_SD_OVERLAP
  2017-04-28 13:20 ` [PATCH 09/14] sched/topology: Remove FORCE_SD_OVERLAP Peter Zijlstra
@ 2017-05-02  1:09   ` Rik van Riel
  0 siblings, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2017-05-02  1:09 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, lvenanci; +Cc: lwang, efault, tglx, linux-kernel

On Fri, 2017-04-28 at 15:20 +0200, Peter Zijlstra wrote:
> Its an obsolete debug mechanism and future code wants to rely on
> properties this undermines.
> 
> Namely, it would be good to assume that SD_OVERLAP domains have
> children, but if we build the entire hierarchy with SD_OVERLAP this
> is
> obviously false.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/14] sched/topology fixes
  2017-04-28 13:53 ` [PATCH 00/14] sched/topology fixes Peter Zijlstra
  2017-04-28 23:30   ` Lauro Venancio
@ 2017-05-02 14:43   ` Peter Zijlstra
  2017-05-02 14:54     ` Lauro Venancio
  1 sibling, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2017-05-02 14:43 UTC (permalink / raw)
  To: mingo, lvenanci; +Cc: lwang, riel, efault, tglx, linux-kernel

On Fri, Apr 28, 2017 at 03:53:39PM +0200, Peter Zijlstra wrote:
> Also, the following occurred to me:
> 
>   sg_span & sg_mask == sg_mask
> 
> Therefore, we don't need to do the whole "sg_span &" business.
> 
> Hmm?

> @@ -856,7 +857,7 @@ build_sched_groups(struct sched_domain *
>  			continue;
>  
>  		group = get_group(i, sdd, &sg);
> -		cpumask_setall(sched_group_mask(sg));
> +		cpumask_copy(sched_group_mask(sg), sched_group_cpus(sg));
>  
>  		for_each_cpu(j, span) {
>  			if (get_group(j, sdd, NULL) != group)

OK, so this explodes mightily.

That code also hurt my brain bad, so I had to fix that a little.

The below seems to boot.

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7996,7 +7996,7 @@ static int active_load_balance_cpu_stop(
 static int should_we_balance(struct lb_env *env)
 {
 	struct sched_group *sg = env->sd->groups;
-	struct cpumask *sg_cpus, *sg_mask;
+	struct cpumask *sg_mask;
 	int cpu, balance_cpu = -1;
 
 	/*
@@ -8006,11 +8006,10 @@ static int should_we_balance(struct lb_e
 	if (env->idle == CPU_NEWLY_IDLE)
 		return 1;
 
-	sg_cpus = sched_group_cpus(sg);
 	sg_mask = sched_group_mask(sg);
 	/* Try to find first idle cpu */
-	for_each_cpu_and(cpu, sg_cpus, env->cpus) {
-		if (!cpumask_test_cpu(cpu, sg_mask) || !idle_cpu(cpu))
+	for_each_cpu_and(cpu, sg_mask, env->cpus) {
+		if (!idle_cpu(cpu))
 			continue;
 
 		balance_cpu = cpu;
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -85,7 +85,8 @@ static int sched_domain_debug_one(struct
 				group->sgc->id,
 				cpumask_pr_args(sched_group_cpus(group)));
 
-		if ((sd->flags & SD_OVERLAP) && !cpumask_full(sched_group_mask(group))) {
+		if ((sd->flags & SD_OVERLAP) &&
+		    !cpumask_equal(sched_group_mask(group), sched_group_cpus(group))) {
 			printk(KERN_CONT " mask=%*pbl",
 				cpumask_pr_args(sched_group_mask(group)));
 		}
@@ -505,7 +506,7 @@ enum s_alloc {
  */
 int group_balance_cpu(struct sched_group *sg)
 {
-	return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
+	return cpumask_first(sched_group_mask(sg));
 }
 
 
@@ -833,23 +834,34 @@ build_overlap_sched_groups(struct sched_
  * [*] in other words, the first group of each domain is its child domain.
  */
 
-static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg)
+static struct sched_group *get_group(int cpu, struct sd_data *sdd)
 {
 	struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
 	struct sched_domain *child = sd->child;
+	struct sched_group *sg;
 
 	if (child)
 		cpu = cpumask_first(sched_domain_span(child));
 
-	if (sg) {
-		*sg = *per_cpu_ptr(sdd->sg, cpu);
-		(*sg)->sgc = *per_cpu_ptr(sdd->sgc, cpu);
+	sg = *per_cpu_ptr(sdd->sg, cpu);
+	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
 
-		/* For claim_allocations: */
-		atomic_set(&(*sg)->sgc->ref, 1);
+	/* For claim_allocations: */
+	atomic_inc(&sg->ref);
+	atomic_inc(&sg->sgc->ref);
+
+	if (child) {
+		cpumask_copy(sched_group_cpus(sg), sched_domain_span(child));
+		cpumask_copy(sched_group_mask(sg), sched_group_cpus(sg));
+	} else {
+		cpumask_set_cpu(cpu, sched_group_cpus(sg));
+		cpumask_set_cpu(cpu, sched_group_cpus(sg));
 	}
 
-	return cpu;
+	sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sched_group_cpus(sg));
+	sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+
+	return sg;
 }
 
 /*
@@ -868,34 +880,20 @@ build_sched_groups(struct sched_domain *
 	struct cpumask *covered;
 	int i;
 
-	get_group(cpu, sdd, &sd->groups);
-	atomic_inc(&sd->groups->ref);
-
-	if (cpu != cpumask_first(span))
-		return 0;
-
 	lockdep_assert_held(&sched_domains_mutex);
 	covered = sched_domains_tmpmask;
 
 	cpumask_clear(covered);
 
-	for_each_cpu(i, span) {
+	for_each_cpu_wrap(i, span, cpu) {
 		struct sched_group *sg;
-		int group, j;
 
 		if (cpumask_test_cpu(i, covered))
 			continue;
 
-		group = get_group(i, sdd, &sg);
-		cpumask_setall(sched_group_mask(sg));
+		sg = get_group(i, sdd);
 
-		for_each_cpu(j, span) {
-			if (get_group(j, sdd, NULL) != group)
-				continue;
-
-			cpumask_set_cpu(j, covered);
-			cpumask_set_cpu(j, sched_group_cpus(sg));
-		}
+		cpumask_or(covered, covered, sched_group_cpus(sg));
 
 		if (!first)
 			first = sg;
@@ -904,6 +902,7 @@ build_sched_groups(struct sched_domain *
 		last = sg;
 	}
 	last->next = first;
+	sd->groups = first;
 
 	return 0;
 }

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 06/14] sched/topology,debug: Verify the first group matches the child domain
  2017-05-01 21:13   ` Rik van Riel
@ 2017-05-02 14:52     ` Peter Zijlstra
  2017-05-03 15:00       ` Rik van Riel
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2017-05-02 14:52 UTC (permalink / raw)
  To: Rik van Riel; +Cc: mingo, lvenanci, lwang, efault, tglx, linux-kernel

On Mon, May 01, 2017 at 05:13:26PM -0400, Rik van Riel wrote:
> On Fri, 2017-04-28 at 15:20 +0200, Peter Zijlstra wrote:
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> This could use a changelog.

Yes indeed... I put off writing one because $hard, and clearly I forgot
entirely :-/

How's this?

---
Subject: sched/topology,debug: Verify the first group matches the child domain
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri Apr 14 18:20:48 CEST 2017

We want sched_groups to be sibling child domains (or individual CPUs
when there are no child domains). Furthermore, since the first group
of a domain should include the CPU of that domain, the first group of
each domain should match the child domain.

Verify this is indeed so.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/topology.c |    6 ++++++
 1 file changed, 6 insertions(+)

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -93,6 +93,12 @@ static int sched_domain_debug_one(struct
 				group->sgc->capacity);
 		}
 
+		if (group == sd->groups && sd->child &&
+		    !cpumask_equal(sched_domain_span(sd->child),
+				   sched_group_cpus(group))) {
+			printk(KERN_ERR "ERROR: domain->groups does not match domain->child\n");
+		}
+
 		group = group->next;
 
 		if (group != sd->groups)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/14] sched/topology fixes
  2017-05-02 14:43   ` Peter Zijlstra
@ 2017-05-02 14:54     ` Lauro Venancio
  0 siblings, 0 replies; 31+ messages in thread
From: Lauro Venancio @ 2017-05-02 14:54 UTC (permalink / raw)
  To: Peter Zijlstra, mingo; +Cc: lwang, riel, efault, tglx, linux-kernel

On 05/02/2017 11:43 AM, Peter Zijlstra wrote:
> On Fri, Apr 28, 2017 at 03:53:39PM +0200, Peter Zijlstra wrote:
>> Also, the following occurred to me:
>>
>>   sg_span & sg_mask == sg_mask
>>
>> Therefore, we don't need to do the whole "sg_span &" business.
>>
>> Hmm?
>> @@ -856,7 +857,7 @@ build_sched_groups(struct sched_domain *
>>  			continue;
>>  
>>  		group = get_group(i, sdd, &sg);
>> -		cpumask_setall(sched_group_mask(sg));
>> +		cpumask_copy(sched_group_mask(sg), sched_group_cpus(sg));
>>  
>>  		for_each_cpu(j, span) {
>>  			if (get_group(j, sdd, NULL) != group)
> OK, so this explodes mightily.
>
> That code also hurt my brain bad, so I had to fix that a little.
>
> The below seems to boot.
>
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7996,7 +7996,7 @@ static int active_load_balance_cpu_stop(
>  static int should_we_balance(struct lb_env *env)
>  {
>  	struct sched_group *sg = env->sd->groups;
> -	struct cpumask *sg_cpus, *sg_mask;
> +	struct cpumask *sg_mask;
>  	int cpu, balance_cpu = -1;
>  
>  	/*
> @@ -8006,11 +8006,10 @@ static int should_we_balance(struct lb_e
>  	if (env->idle == CPU_NEWLY_IDLE)
>  		return 1;
>  
> -	sg_cpus = sched_group_cpus(sg);
>  	sg_mask = sched_group_mask(sg);
>  	/* Try to find first idle cpu */
> -	for_each_cpu_and(cpu, sg_cpus, env->cpus) {
> -		if (!cpumask_test_cpu(cpu, sg_mask) || !idle_cpu(cpu))
> +	for_each_cpu_and(cpu, sg_mask, env->cpus) {
> +		if (!idle_cpu(cpu))
>  			continue;
>  
>  		balance_cpu = cpu;
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -85,7 +85,8 @@ static int sched_domain_debug_one(struct
>  				group->sgc->id,
>  				cpumask_pr_args(sched_group_cpus(group)));
>  
> -		if ((sd->flags & SD_OVERLAP) && !cpumask_full(sched_group_mask(group))) {
> +		if ((sd->flags & SD_OVERLAP) &&
> +		    !cpumask_equal(sched_group_mask(group), sched_group_cpus(group))) {
>  			printk(KERN_CONT " mask=%*pbl",
>  				cpumask_pr_args(sched_group_mask(group)));
>  		}
> @@ -505,7 +506,7 @@ enum s_alloc {
>   */
>  int group_balance_cpu(struct sched_group *sg)
>  {
> -	return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
> +	return cpumask_first(sched_group_mask(sg));
>  }
>  
>  
> @@ -833,23 +834,34 @@ build_overlap_sched_groups(struct sched_
>   * [*] in other words, the first group of each domain is its child domain.
>   */
>  
> -static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg)
> +static struct sched_group *get_group(int cpu, struct sd_data *sdd)
>  {
>  	struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
>  	struct sched_domain *child = sd->child;
> +	struct sched_group *sg;
>  
>  	if (child)
>  		cpu = cpumask_first(sched_domain_span(child));
>  
> -	if (sg) {
> -		*sg = *per_cpu_ptr(sdd->sg, cpu);
> -		(*sg)->sgc = *per_cpu_ptr(sdd->sgc, cpu);
> +	sg = *per_cpu_ptr(sdd->sg, cpu);
> +	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
>  
> -		/* For claim_allocations: */
> -		atomic_set(&(*sg)->sgc->ref, 1);
> +	/* For claim_allocations: */
> +	atomic_inc(&sg->ref);
> +	atomic_inc(&sg->sgc->ref);
> +
> +	if (child) {
> +		cpumask_copy(sched_group_cpus(sg), sched_domain_span(child));
> +		cpumask_copy(sched_group_mask(sg), sched_group_cpus(sg));
> +	} else {
> +		cpumask_set_cpu(cpu, sched_group_cpus(sg));
> +		cpumask_set_cpu(cpu, sched_group_cpus(sg));
Typo here. The mask is not being set in the else clause.

>  	}
>  
> -	return cpu;
> +	sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sched_group_cpus(sg));
> +	sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
> +
> +	return sg;
>  }
>  
>  /*
> @@ -868,34 +880,20 @@ build_sched_groups(struct sched_domain *
>  	struct cpumask *covered;
>  	int i;
>  
> -	get_group(cpu, sdd, &sd->groups);
> -	atomic_inc(&sd->groups->ref);
> -
> -	if (cpu != cpumask_first(span))
> -		return 0;
> -
>  	lockdep_assert_held(&sched_domains_mutex);
>  	covered = sched_domains_tmpmask;
>  
>  	cpumask_clear(covered);
>  
> -	for_each_cpu(i, span) {
> +	for_each_cpu_wrap(i, span, cpu) {
>  		struct sched_group *sg;
> -		int group, j;
>  
>  		if (cpumask_test_cpu(i, covered))
>  			continue;
>  
> -		group = get_group(i, sdd, &sg);
> -		cpumask_setall(sched_group_mask(sg));
> +		sg = get_group(i, sdd);
>  
> -		for_each_cpu(j, span) {
> -			if (get_group(j, sdd, NULL) != group)
> -				continue;
> -
> -			cpumask_set_cpu(j, covered);
> -			cpumask_set_cpu(j, sched_group_cpus(sg));
> -		}
> +		cpumask_or(covered, covered, sched_group_cpus(sg));
>  
>  		if (!first)
>  			first = sg;
> @@ -904,6 +902,7 @@ build_sched_groups(struct sched_domain *
>  		last = sg;
>  	}
>  	last->next = first;
> +	sd->groups = first;
>  
>  	return 0;
>  }

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 06/14] sched/topology,debug: Verify the first group matches the child domain
  2017-05-02 14:52     ` Peter Zijlstra
@ 2017-05-03 15:00       ` Rik van Riel
  0 siblings, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2017-05-03 15:00 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, lvenanci, lwang, efault, tglx, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1052 bytes --]

On Tue, 2017-05-02 at 16:52 +0200, Peter Zijlstra wrote:
> On Mon, May 01, 2017 at 05:13:26PM -0400, Rik van Riel wrote:
> > On Fri, 2017-04-28 at 15:20 +0200, Peter Zijlstra wrote:
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > 
> > This could use a changelog.
> 
> Yes indeed... I put off writing one because $hard, and clearly I
> forgot
> entirely :-/
> 
> How's this?

Looks good to me.

> ---
> Subject: sched/topology,debug: Verify the first group matches the
> child domain
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri Apr 14 18:20:48 CEST 2017
> 
> We want sched_groups to be sibling child domains (or individual CPUs
> when there are no child domains). Furthermore, since the first group
> of a domain should include the CPU of that domain, the first group of
> each domain should match the child domain.
> 
> Verify this is indeed so.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2017-05-03 15:00 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-28 13:19 [PATCH 00/14] sched/topology fixes Peter Zijlstra
2017-04-28 13:19 ` [PATCH 01/14] sched/topology: Refactor function build_overlap_sched_groups() Peter Zijlstra
2017-04-28 13:20 ` [PATCH 02/14] sched,cpumask: Export for_each_cpu_wrap() Peter Zijlstra
2017-05-01 19:56   ` Rik van Riel
2017-04-28 13:20 ` [PATCH 03/14] sched/topology: Fix building of overlapping sched-groups Peter Zijlstra
2017-05-01 20:08   ` Rik van Riel
2017-04-28 13:20 ` [PATCH 04/14] sched/topology: Simplify build_overlap_sched_groups() Peter Zijlstra
2017-05-01 21:04   ` Rik van Riel
2017-04-28 13:20 ` [PATCH 05/14] sched/topology,debug: Print the group mask Peter Zijlstra
2017-05-01 21:13   ` Rik van Riel
2017-04-28 13:20 ` [PATCH 06/14] sched/topology,debug: Verify the first group matches the child domain Peter Zijlstra
2017-05-01 21:13   ` Rik van Riel
2017-05-02 14:52     ` Peter Zijlstra
2017-05-03 15:00       ` Rik van Riel
2017-04-28 13:20 ` [PATCH 07/14] sched/topology: Optimize build_group_mask() Peter Zijlstra
2017-05-02  0:57   ` Rik van Riel
2017-04-28 13:20 ` [PATCH 08/14] sched/topology: Move comment about asymmetric node setups Peter Zijlstra
2017-05-02  1:08   ` Rik van Riel
2017-04-28 13:20 ` [PATCH 09/14] sched/topology: Remove FORCE_SD_OVERLAP Peter Zijlstra
2017-05-02  1:09   ` Rik van Riel
2017-04-28 13:20 ` [PATCH 10/14] sched/topology: Fix overlapping sched_group_mask Peter Zijlstra
2017-04-28 13:20 ` [PATCH 11/14] sched/topology: Small cleanup Peter Zijlstra
2017-04-28 13:20 ` [PATCH 12/14] sched/topology,debug: Add sched_group_capacity debugging Peter Zijlstra
2017-04-28 13:20 ` [PATCH 13/14] sched/topology: Fix overlapping sched_group_capacity Peter Zijlstra
2017-04-28 13:20 ` [PATCH 14/14] sched/topology: Add a few comments Peter Zijlstra
2017-05-01  8:35   ` Peter Zijlstra
2017-04-28 13:53 ` [PATCH 00/14] sched/topology fixes Peter Zijlstra
2017-04-28 23:30   ` Lauro Venancio
2017-05-01  8:57     ` Peter Zijlstra
2017-05-02 14:43   ` Peter Zijlstra
2017-05-02 14:54     ` Lauro Venancio

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.