All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/7] sched: select_idle_siblings rewrite
@ 2016-05-09 10:48 Peter Zijlstra
  2016-05-09 10:48 ` [RFC][PATCH 1/7] sched: Remove unused @cpu argument from destroy_sched_domain*() Peter Zijlstra
                   ` (10 more replies)
  0 siblings, 11 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-09 10:48 UTC (permalink / raw)
  To: mingo, linux-kernel; +Cc: clm, matt, mgalbraith, tglx, fweisbec, peterz

Hai,

here be a semi coherent patch series for the recent select_idle_siblings()
tinkering. Happy benchmarking..

---
 include/linux/sched.h    |  10 ++
 kernel/sched/core.c      |  94 +++++++++++----
 kernel/sched/fair.c      | 298 ++++++++++++++++++++++++++++++++++++++---------
 kernel/sched/idle_task.c |   2 +-
 kernel/sched/sched.h     |  23 +++-
 kernel/time/tick-sched.c |  10 +-
 6 files changed, 348 insertions(+), 89 deletions(-)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC][PATCH 1/7] sched: Remove unused @cpu argument from destroy_sched_domain*()
  2016-05-09 10:48 [RFC][PATCH 0/7] sched: select_idle_siblings rewrite Peter Zijlstra
@ 2016-05-09 10:48 ` Peter Zijlstra
  2016-05-09 10:48 ` [RFC][PATCH 2/7] sched: Restructure destroy_sched_domain() Peter Zijlstra
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-09 10:48 UTC (permalink / raw)
  To: mingo, linux-kernel; +Cc: clm, matt, mgalbraith, tglx, fweisbec, peterz

[-- Attachment #1: peterz-sched-kill-sd_busy.patch --]
[-- Type: text/plain, Size: 1502 bytes --]

Small cleanup; nothing uses the @cpu argument so make it go away.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5841,15 +5841,15 @@ static void free_sched_domain(struct rcu
 	kfree(sd);
 }
 
-static void destroy_sched_domain(struct sched_domain *sd, int cpu)
+static void destroy_sched_domain(struct sched_domain *sd)
 {
 	call_rcu(&sd->rcu, free_sched_domain);
 }
 
-static void destroy_sched_domains(struct sched_domain *sd, int cpu)
+static void destroy_sched_domains(struct sched_domain *sd)
 {
 	for (; sd; sd = sd->parent)
-		destroy_sched_domain(sd, cpu);
+		destroy_sched_domain(sd);
 }
 
 /*
@@ -5921,7 +5921,7 @@ cpu_attach_domain(struct sched_domain *s
 			 */
 			if (parent->flags & SD_PREFER_SIBLING)
 				tmp->flags |= SD_PREFER_SIBLING;
-			destroy_sched_domain(parent, cpu);
+			destroy_sched_domain(parent);
 		} else
 			tmp = tmp->parent;
 	}
@@ -5929,7 +5929,7 @@ cpu_attach_domain(struct sched_domain *s
 	if (sd && sd_degenerate(sd)) {
 		tmp = sd;
 		sd = sd->parent;
-		destroy_sched_domain(tmp, cpu);
+		destroy_sched_domain(tmp);
 		if (sd)
 			sd->child = NULL;
 	}
@@ -5939,7 +5939,7 @@ cpu_attach_domain(struct sched_domain *s
 	rq_attach_root(rq, rd);
 	tmp = rq->sd;
 	rcu_assign_pointer(rq->sd, sd);
-	destroy_sched_domains(tmp, cpu);
+	destroy_sched_domains(tmp);
 
 	update_top_cache_domain(cpu);
 }

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC][PATCH 2/7] sched: Restructure destroy_sched_domain()
  2016-05-09 10:48 [RFC][PATCH 0/7] sched: select_idle_siblings rewrite Peter Zijlstra
  2016-05-09 10:48 ` [RFC][PATCH 1/7] sched: Remove unused @cpu argument from destroy_sched_domain*() Peter Zijlstra
@ 2016-05-09 10:48 ` Peter Zijlstra
  2016-05-09 14:46   ` Peter Zijlstra
  2016-05-09 10:48 ` [RFC][PATCH 3/7] sched: Introduce struct sched_domain_shared Peter Zijlstra
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-09 10:48 UTC (permalink / raw)
  To: mingo, linux-kernel; +Cc: clm, matt, mgalbraith, tglx, fweisbec, peterz

[-- Attachment #1: peterz-sched-kill-sd_busy-1.patch --]
[-- Type: text/plain, Size: 1619 bytes --]

There is no point in doing a call_rcu() for each domain, only do a
callback for the root sched domain and clean up the entire set in one
go.

Also make the entire call chain be called destroy_sched_domain*() to
remove confusion with the free_sched_domains() call, which does an
entirely different thing.

Both cpu_attach_domain() callers of destroy_sched_domain() can live
without the call_rcu() because at those points the sched_domain hasn't
been published yet.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5824,10 +5824,8 @@ static void free_sched_groups(struct sch
 	} while (sg != first);
 }
 
-static void free_sched_domain(struct rcu_head *rcu)
+static void destroy_sched_domain(struct sched_domain *sd)
 {
-	struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu);
-
 	/*
 	 * If its an overlapping domain it has private groups, iterate and
 	 * nuke them all.
@@ -5841,15 +5839,17 @@ static void free_sched_domain(struct rcu
 	kfree(sd);
 }
 
-static void destroy_sched_domain(struct sched_domain *sd)
+static void destroy_sched_domains_rcu(struct rcu_head *rcu)
 {
-	call_rcu(&sd->rcu, free_sched_domain);
+	struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu);
+
+	for (; sd; sd = sd->parent)
+		destroy_sched_domain(sd);
 }
 
 static void destroy_sched_domains(struct sched_domain *sd)
 {
-	for (; sd; sd = sd->parent)
-		destroy_sched_domain(sd);
+	call_rcu(&sd->rcu, destroy_sched_domains_rcu);
 }
 
 /*

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC][PATCH 3/7] sched: Introduce struct sched_domain_shared
  2016-05-09 10:48 [RFC][PATCH 0/7] sched: select_idle_siblings rewrite Peter Zijlstra
  2016-05-09 10:48 ` [RFC][PATCH 1/7] sched: Remove unused @cpu argument from destroy_sched_domain*() Peter Zijlstra
  2016-05-09 10:48 ` [RFC][PATCH 2/7] sched: Restructure destroy_sched_domain() Peter Zijlstra
@ 2016-05-09 10:48 ` Peter Zijlstra
  2016-05-09 10:48 ` [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared Peter Zijlstra
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-09 10:48 UTC (permalink / raw)
  To: mingo, linux-kernel; +Cc: clm, matt, mgalbraith, tglx, fweisbec, peterz

[-- Attachment #1: peterz-sched-kill-sd_busy-2.patch --]
[-- Type: text/plain, Size: 5187 bytes --]

Since struct sched_domain is strictly per cpu; introduce a structure
that is shared between all 'identical' sched_domains.

Limit to SD_SHARE_PKG_RESOURCES domains for now, as we'll only use it
for shared cache state; if another use comes up later we can easily
relax this.

While the sched_group's are normally shared between CPUs, these are
not natural to use when we need some shared state on a domain level --
since that would require the domain to have a parent, which is not a
given.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h |    6 ++++++
 kernel/sched/core.c   |   40 ++++++++++++++++++++++++++++++++++------
 2 files changed, 40 insertions(+), 6 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1057,6 +1057,10 @@ extern int sched_domain_level_max;
 
 struct sched_group;
 
+struct sched_domain_shared {
+	atomic_t	ref;
+};
+
 struct sched_domain {
 	/* These fields must be setup */
 	struct sched_domain *parent;	/* top domain must be null terminated */
@@ -1125,6 +1129,7 @@ struct sched_domain {
 		void *private;		/* used during construction */
 		struct rcu_head rcu;	/* used during destruction */
 	};
+	struct sched_domain_shared *shared;
 
 	unsigned int span_weight;
 	/*
@@ -1158,6 +1163,7 @@ typedef int (*sched_domain_flags_f)(void
 
 struct sd_data {
 	struct sched_domain **__percpu sd;
+	struct sched_domain_shared **__percpu sds;
 	struct sched_group **__percpu sg;
 	struct sched_group_capacity **__percpu sgc;
 };
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5836,6 +5836,8 @@ static void free_sched_domain(struct sch
 		kfree(sd->groups->sgc);
 		kfree(sd->groups);
 	}
+	if (sd->shared && atomic_dec_and_test(&sd->shared->ref))
+		kfree(sd->shared);
 	kfree(sd);
 }
 
@@ -6270,6 +6272,9 @@ static void claim_allocations(int cpu, s
 	WARN_ON_ONCE(*per_cpu_ptr(sdd->sd, cpu) != sd);
 	*per_cpu_ptr(sdd->sd, cpu) = NULL;
 
+	if (atomic_read(&(*per_cpu_ptr(sdd->sds, cpu))->ref))
+		*per_cpu_ptr(sdd->sds, cpu) = NULL;
+
 	if (atomic_read(&(*per_cpu_ptr(sdd->sg, cpu))->ref))
 		*per_cpu_ptr(sdd->sg, cpu) = NULL;
 
@@ -6305,10 +6310,12 @@ static int sched_domains_curr_level;
 	 SD_SHARE_POWERDOMAIN)
 
 static struct sched_domain *
-sd_init(struct sched_domain_topology_level *tl, int cpu)
+sd_init(struct sched_domain_topology_level *tl,
+	const struct cpumask *cpu_map, int cpu)
 {
-	struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
-	int sd_weight, sd_flags = 0;
+	struct sd_data *sdd = &tl->data;
+	struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
+	int sd_id, sd_weight, sd_flags = 0;
 
 #ifdef CONFIG_NUMA
 	/*
@@ -6362,6 +6369,9 @@ sd_init(struct sched_domain_topology_lev
 #endif
 	};
 
+	cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
+	sd_id = cpumask_first(sched_domain_span(sd));
+
 	/*
 	 * Convert topological properties into behaviour.
 	 */
@@ -6376,6 +6386,9 @@ sd_init(struct sched_domain_topology_lev
 		sd->cache_nice_tries = 1;
 		sd->busy_idx = 2;
 
+		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
+		atomic_inc(&sd->shared->ref);
+
 #ifdef CONFIG_NUMA
 	} else if (sd->flags & SD_NUMA) {
 		sd->cache_nice_tries = 2;
@@ -6397,7 +6410,7 @@ sd_init(struct sched_domain_topology_lev
 		sd->idle_idx = 1;
 	}
 
-	sd->private = &tl->data;
+	sd->private = sdd;
 
 	return sd;
 }
@@ -6704,6 +6717,10 @@ static int __sdt_alloc(const struct cpum
 		if (!sdd->sd)
 			return -ENOMEM;
 
+		sdd->sds = alloc_percpu(struct sched_domain_shared *);
+		if (!sdd->sds)
+			return -ENOMEM;
+
 		sdd->sg = alloc_percpu(struct sched_group *);
 		if (!sdd->sg)
 			return -ENOMEM;
@@ -6714,6 +6731,7 @@ static int __sdt_alloc(const struct cpum
 
 		for_each_cpu(j, cpu_map) {
 			struct sched_domain *sd;
+			struct sched_domain_shared *sds;
 			struct sched_group *sg;
 			struct sched_group_capacity *sgc;
 
@@ -6724,6 +6742,13 @@ static int __sdt_alloc(const struct cpum
 
 			*per_cpu_ptr(sdd->sd, j) = sd;
 
+			sds = kzalloc_node(sizeof(struct sched_domain_shared),
+					GFP_KERNEL, cpu_to_node(j));
+			if (!sds)
+				return -ENOMEM;
+
+			*per_cpu_ptr(sdd->sds, j) = sds;
+
 			sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
 					GFP_KERNEL, cpu_to_node(j));
 			if (!sg)
@@ -6763,6 +6788,8 @@ static void __sdt_free(const struct cpum
 				kfree(*per_cpu_ptr(sdd->sd, j));
 			}
 
+			if (sdd->sds)
+				kfree(*per_cpu_ptr(sdd->sds, j));
 			if (sdd->sg)
 				kfree(*per_cpu_ptr(sdd->sg, j));
 			if (sdd->sgc)
@@ -6770,6 +6797,8 @@ static void __sdt_free(const struct cpum
 		}
 		free_percpu(sdd->sd);
 		sdd->sd = NULL;
+		free_percpu(sdd->sds);
+		sdd->sds = NULL;
 		free_percpu(sdd->sg);
 		sdd->sg = NULL;
 		free_percpu(sdd->sgc);
@@ -6781,11 +6810,10 @@ struct sched_domain *build_sched_domain(
 		const struct cpumask *cpu_map, struct sched_domain_attr *attr,
 		struct sched_domain *child, int cpu)
 {
-	struct sched_domain *sd = sd_init(tl, cpu);
+	struct sched_domain *sd = sd_init(tl, cpu_map, cpu);
 	if (!sd)
 		return child;
 
-	cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
 	if (child) {
 		sd->level = child->level + 1;
 		sched_domain_level_max = max(sched_domain_level_max, sd->level);

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  2016-05-09 10:48 [RFC][PATCH 0/7] sched: select_idle_siblings rewrite Peter Zijlstra
                   ` (2 preceding siblings ...)
  2016-05-09 10:48 ` [RFC][PATCH 3/7] sched: Introduce struct sched_domain_shared Peter Zijlstra
@ 2016-05-09 10:48 ` Peter Zijlstra
  2016-05-11 11:55   ` Matt Fleming
  2016-05-16 15:31   ` Dietmar Eggemann
  2016-05-09 10:48 ` [RFC][PATCH 5/7] sched: Rewrite select_idle_siblings() Peter Zijlstra
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-09 10:48 UTC (permalink / raw)
  To: mingo, linux-kernel; +Cc: clm, matt, mgalbraith, tglx, fweisbec, peterz

[-- Attachment #1: peterz-sched-kill-sd_busy-3.patch --]
[-- Type: text/plain, Size: 5444 bytes --]

Move the nr_busy_cpus thing from its hacky sd->parent->groups->sgc
location into the much more natural sched_domain_shared location.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h    |    1 +
 kernel/sched/core.c      |   10 +++++-----
 kernel/sched/fair.c      |   22 ++++++++++++----------
 kernel/sched/sched.h     |    6 +-----
 kernel/time/tick-sched.c |   10 +++++-----
 5 files changed, 24 insertions(+), 25 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1059,6 +1059,7 @@ struct sched_group;
 
 struct sched_domain_shared {
 	atomic_t	ref;
+	atomic_t	nr_busy_cpus;
 };
 
 struct sched_domain {
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5866,14 +5866,14 @@ static void destroy_sched_domains(struct
 DEFINE_PER_CPU(struct sched_domain *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
+DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain *, sd_numa);
-DEFINE_PER_CPU(struct sched_domain *, sd_busy);
 DEFINE_PER_CPU(struct sched_domain *, sd_asym);
 
 static void update_top_cache_domain(int cpu)
 {
+	struct sched_domain_shared *sds = NULL;
 	struct sched_domain *sd;
-	struct sched_domain *busy_sd = NULL;
 	int id = cpu;
 	int size = 1;
 
@@ -5881,13 +5881,13 @@ static void update_top_cache_domain(int
 	if (sd) {
 		id = cpumask_first(sched_domain_span(sd));
 		size = cpumask_weight(sched_domain_span(sd));
-		busy_sd = sd->parent; /* sd_busy */
+		sds = sd->shared;
 	}
-	rcu_assign_pointer(per_cpu(sd_busy, cpu), busy_sd);
 
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_size, cpu) = size;
 	per_cpu(sd_llc_id, cpu) = id;
+	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
 
 	sd = lowest_flag_domain(cpu, SD_NUMA);
 	rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
@@ -6184,7 +6184,6 @@ static void init_sched_groups_capacity(i
 		return;
 
 	update_group_capacity(sd, cpu);
-	atomic_set(&sg->sgc->nr_busy_cpus, sg->group_weight);
 }
 
 /*
@@ -6388,6 +6387,7 @@ sd_init(struct sched_domain_topology_lev
 
 		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
 		atomic_inc(&sd->shared->ref);
+		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
 
 #ifdef CONFIG_NUMA
 	} else if (sd->flags & SD_NUMA) {
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7842,13 +7842,13 @@ static inline void set_cpu_sd_state_busy
 	int cpu = smp_processor_id();
 
 	rcu_read_lock();
-	sd = rcu_dereference(per_cpu(sd_busy, cpu));
+	sd = rcu_dereference(per_cpu(sd_llc, cpu));
 
 	if (!sd || !sd->nohz_idle)
 		goto unlock;
 	sd->nohz_idle = 0;
 
-	atomic_inc(&sd->groups->sgc->nr_busy_cpus);
+	atomic_inc(&sd->shared->nr_busy_cpus);
 unlock:
 	rcu_read_unlock();
 }
@@ -7859,13 +7859,13 @@ void set_cpu_sd_state_idle(void)
 	int cpu = smp_processor_id();
 
 	rcu_read_lock();
-	sd = rcu_dereference(per_cpu(sd_busy, cpu));
+	sd = rcu_dereference(per_cpu(sd_llc, cpu));
 
 	if (!sd || sd->nohz_idle)
 		goto unlock;
 	sd->nohz_idle = 1;
 
-	atomic_dec(&sd->groups->sgc->nr_busy_cpus);
+	atomic_dec(&sd->shared->nr_busy_cpus);
 unlock:
 	rcu_read_unlock();
 }
@@ -8092,8 +8092,8 @@ static void nohz_idle_balance(struct rq
 static inline bool nohz_kick_needed(struct rq *rq)
 {
 	unsigned long now = jiffies;
+	struct sched_domain_shared *sds;
 	struct sched_domain *sd;
-	struct sched_group_capacity *sgc;
 	int nr_busy, cpu = rq->cpu;
 	bool kick = false;
 
@@ -8121,11 +8121,13 @@ static inline bool nohz_kick_needed(stru
 		return true;
 
 	rcu_read_lock();
-	sd = rcu_dereference(per_cpu(sd_busy, cpu));
-	if (sd) {
-		sgc = sd->groups->sgc;
-		nr_busy = atomic_read(&sgc->nr_busy_cpus);
-
+	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (sds) {
+		/*
+		 * XXX: write a coherent comment on why we do this.
+		 * See also: http:lkml.kernel.org/r/20111202010832.602203411@sbsiddha-desk.sc.intel.com
+		 */
+		nr_busy = atomic_read(&sds->nr_busy_cpus);
 		if (nr_busy > 1) {
 			kick = true;
 			goto unlock;
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -856,8 +856,8 @@ static inline struct sched_domain *lowes
 DECLARE_PER_CPU(struct sched_domain *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain *, sd_numa);
-DECLARE_PER_CPU(struct sched_domain *, sd_busy);
 DECLARE_PER_CPU(struct sched_domain *, sd_asym);
 
 struct sched_group_capacity {
@@ -869,10 +869,6 @@ struct sched_group_capacity {
 	unsigned int capacity;
 	unsigned long next_update;
 	int imbalance; /* XXX unrelated to capacity but shared group state */
-	/*
-	 * Number of busy cpus in this group.
-	 */
-	atomic_t nr_busy_cpus;
 
 	unsigned long cpumask[0]; /* iteration mask */
 };
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -933,11 +933,11 @@ void tick_nohz_idle_enter(void)
 	WARN_ON_ONCE(irqs_disabled());
 
 	/*
- 	 * Update the idle state in the scheduler domain hierarchy
- 	 * when tick_nohz_stop_sched_tick() is called from the idle loop.
- 	 * State will be updated to busy during the first busy tick after
- 	 * exiting idle.
- 	 */
+	 * Update the idle state in the scheduler domain hierarchy
+	 * when tick_nohz_stop_sched_tick() is called from the idle loop.
+	 * State will be updated to busy during the first busy tick after
+	 * exiting idle.
+	 */
 	set_cpu_sd_state_idle();
 
 	local_irq_disable();

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC][PATCH 5/7] sched: Rewrite select_idle_siblings()
  2016-05-09 10:48 [RFC][PATCH 0/7] sched: select_idle_siblings rewrite Peter Zijlstra
                   ` (3 preceding siblings ...)
  2016-05-09 10:48 ` [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared Peter Zijlstra
@ 2016-05-09 10:48 ` Peter Zijlstra
  2016-05-10 21:05   ` Yuyang Du
  2016-05-09 10:48 ` [RFC][PATCH 6/7] sched: Optimize SCHED_SMT Peter Zijlstra
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-09 10:48 UTC (permalink / raw)
  To: mingo, linux-kernel; +Cc: clm, matt, mgalbraith, tglx, fweisbec, peterz

[-- Attachment #1: peterz-select_idle_siblings-rewrite-1.patch --]
[-- Type: text/plain, Size: 11919 bytes --]

select_idle_siblings() is a known pain point for a number of
workloads; it either does too much or not enough and sometimes just
does plain wrong.

This rewrite attempts to address a number of issues (but sadly not
all).

The current code does an unconditional sched_domain iteration; with
the intent of finding an idle core (on SMT hardware). The problems
which this patch tries to address are:

 - its pointless to look for idle cores if the machine is real busy;
   at which point you're just wasting cycles.

 - it's behaviour is inconsistent between SMT and !SMT hardware in
   that !SMT hardware ends up doing a scan for any idle CPU in the LLC
   domain, while SMT hardware does a scan for idle cores and if that
   fails, falls back to a scan for idle threads on the 'target' core.

The new code replaces the sched_domain scan with 3 explicit scans:

 1) search for an idle core in the LLC
 2) search for an idle CPU in the LLC
 3) search for an idle thread in the 'target' core

where 1 and 3 are conditional on SMT support and 1 and 2 have runtime
heuristics to skip the step.

Step 1) is conditional on sd_llc_shared->has_idle_cores; when a cpu
goes idle and sd_llc_shared->has_idle_cores is false, we scan all SMT
siblings of the CPU going idle. Similarly, we clear
sd_llc_shared->has_idle_cores when we fail to find an idle core.

Step 2) tracks the average cost of the scan and compares this to the
average idle time guestimate for the CPU doing the wakeup. There is a
significant fudge factor involved to deal with the variability of the
averages. Esp. hackbench was sensitive to this.

Step 3) is unconditional; we assume (also per step 1) that scanning
all SMT siblings in a core is 'cheap'.

With this; SMT systems gain step 2, which cures a few benchmarks --
notably one from Facebook.

One 'feature' of the sched_domain iteration, which we preserve in the
new code, is that it would start scanning from the 'target' CPU,
instead of scanning the cpumask in cpu id order. This avoids multiple
CPUs in the LLC scanning for idle to gang up and find the same CPU
quite as much. The down side is that tasks can end up hopping across
the LLC for no apparent reason.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h    |    3 
 kernel/sched/core.c      |    3 
 kernel/sched/fair.c      |  270 ++++++++++++++++++++++++++++++++++++++---------
 kernel/sched/idle_task.c |    4 
 4 files changed, 233 insertions(+), 47 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1060,6 +1060,7 @@ struct sched_group;
 struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
+	int		has_idle_cores;
 };
 
 struct sched_domain {
@@ -1092,6 +1093,8 @@ struct sched_domain {
 	u64 max_newidle_lb_cost;
 	unsigned long next_decay_max_lb_cost;
 
+	u64 avg_scan_cost;		/* select_idle_sibling */
+
 #ifdef CONFIG_SCHEDSTATS
 	/* load_balance() stats */
 	unsigned int lb_count[CPU_MAX_IDLE_TYPES];
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7329,6 +7329,7 @@ static struct kmem_cache *task_group_cac
 #endif
 
 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
+DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
 
 void __init sched_init(void)
 {
@@ -7365,6 +7366,8 @@ void __init sched_init(void)
 	for_each_possible_cpu(i) {
 		per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node(
 			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
+		per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node(
+			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
 	}
 #endif /* CONFIG_CPUMASK_OFFSTACK */
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1501,8 +1501,15 @@ static void task_numa_compare(struct tas
 	 * One idle CPU per node is evaluated for a task numa move.
 	 * Call select_idle_sibling to maybe find a better one.
 	 */
-	if (!cur)
+	if (!cur) {
+		/*
+		 * select_idle_siblings() uses an per-cpu cpumask that
+		 * can be used from IRQ context.
+		 */
+		local_irq_disable();
 		env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu);
+		local_irq_enable();
+	}
 
 assign:
 	assigned = true;
@@ -4499,6 +4506,11 @@ static void dequeue_task_fair(struct rq
 }
 
 #ifdef CONFIG_SMP
+
+/* Working cpumask for: load_balance, load_balance_newidle. */
+DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
+DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
+
 #ifdef CONFIG_NO_HZ_COMMON
 /*
  * per rq 'load' arrray crap; XXX kill this.
@@ -5172,64 +5184,233 @@ find_idlest_cpu(struct sched_group *grou
 }
 
 /*
- * Try and locate an idle CPU in the sched_domain.
+ * Implement a for_each_cpu() variant that starts the scan at a given cpu
+ * (@start), and wraps around.
+ *
+ * This is used to scan for idle CPUs; such that not all CPUs looking for an
+ * idle CPU find the same CPU. The down-side is that tasks tend to cycle
+ * through the LLC domain.
+ *
+ * Especially tbench is found sensitive to this.
+ */
+
+static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int *wrapped)
+{
+	int next;
+
+again:
+	next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1);
+
+	if (*wrapped) {
+		if (next >= start)
+			return nr_cpumask_bits;
+	} else {
+		if (next >= nr_cpumask_bits) {
+			*wrapped = 1;
+			n = -1;
+			goto again;
+		}
+	}
+
+	return next;
+}
+
+#define for_each_cpu_wrap(cpu, mask, start, wrap)				\
+	for ((wrap) = 0, (cpu) = (start)-1;					\
+		(cpu) = cpumask_next_wrap((cpu), (mask), (start), &(wrap)),	\
+		(cpu) < nr_cpumask_bits; )
+
+#ifdef CONFIG_SCHED_SMT
+
+static inline void set_idle_cores(int cpu, int val)
+{
+	struct sched_domain_shared *sds;
+
+	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (sds)
+		WRITE_ONCE(sds->has_idle_cores, val);
+}
+
+static inline bool test_idle_cores(int cpu, bool def)
+{
+	struct sched_domain_shared *sds;
+
+	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (sds)
+		return READ_ONCE(sds->has_idle_cores);
+
+	return def;
+}
+
+/*
+ * Scans the local SMT mask to see if the entire core is idle, and records this
+ * information in sd_llc_shared->has_idle_cores.
+ *
+ * Since SMT siblings share all cache levels, inspecting this limited remote
+ * state should be fairly cheap.
+ */
+void update_idle_core(struct rq *rq)
+{
+	int core = cpu_of(rq);
+	int cpu;
+
+	rcu_read_lock();
+	if (test_idle_cores(core, true))
+		goto unlock;
+
+	for_each_cpu(cpu, cpu_smt_mask(core)) {
+		if (cpu == core)
+			continue;
+
+		if (!idle_cpu(cpu))
+			goto unlock;
+	}
+
+	set_idle_cores(core, 1);
+unlock:
+	rcu_read_unlock();
+}
+
+/*
+ * Scan the entire LLC domain for idle cores; this dynamically switches off if there are
+ * no idle cores left in the system; tracked through sd_llc->shared->has_idle_cores
+ * and enabled through update_idle_core() above.
+ */
+static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
+{
+	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+	int core, cpu, wrap;
+
+	if (!test_idle_cores(target, false))
+		return -1;
+
+	cpumask_and(cpus, sched_domain_span(sd), tsk_cpus_allowed(p));
+
+	for_each_cpu_wrap(core, cpus, target, wrap) {
+		bool idle = true;
+
+		for_each_cpu(cpu, cpu_smt_mask(core)) {
+			cpumask_clear_cpu(cpu, cpus);
+			if (!idle_cpu(cpu))
+				idle = false;
+		}
+
+		if (idle)
+			return core;
+	}
+
+	/*
+	 * Failed to find an idle core; stop looking for one.
+	 */
+	set_idle_cores(target, 0);
+
+	return -1;
+}
+
+/*
+ * Scan the local SMT mask for idle CPUs.
+ */
+static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
+{
+	int cpu;
+
+	for_each_cpu(cpu, cpu_smt_mask(target)) {
+		if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+			continue;
+		if (idle_cpu(cpu))
+			return cpu;
+	}
+
+	return -1;
+}
+
+#else /* CONFIG_SCHED_SMT */
+
+void update_idle_core(struct rq *rq) { }
+
+static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
+{
+	return -1;
+}
+
+static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
+{
+	return -1;
+}
+
+#endif /* CONFIG_SCHED_SMT */
+
+/*
+ * Scan the LLC domain for idle CPUs; this is dynamically regulated by
+ * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
+ * average idle time for this rq (as found in rq->avg_idle).
+ */
+static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
+{
+	struct sched_domain *this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
+	u64 avg_idle = this_rq()->avg_idle;
+	u64 avg_cost = this_sd->avg_scan_cost;
+	u64 time, cost;
+	s64 delta;
+	int cpu, wrap;
+
+	/*
+	 * Due to large variance we need a large fuzz factor; hackbench in
+	 * particularly is sensitive here.
+	 */
+	if ((avg_idle / 512) < avg_cost)
+		return -1;
+
+	time = local_clock();
+
+	for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {
+		if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+			continue;
+		if (idle_cpu(cpu))
+			break;
+	}
+
+	time = local_clock() - time;
+	cost = this_sd->avg_scan_cost;
+	delta = (s64)(time - cost) / 8;
+	this_sd->avg_scan_cost += delta;
+
+	return cpu;
+}
+
+/*
+ * Try and locate an idle core/thread in the LLC cache domain.
  */
 static int select_idle_sibling(struct task_struct *p, int target)
 {
 	struct sched_domain *sd;
-	struct sched_group *sg;
 	int i = task_cpu(p);
 
 	if (idle_cpu(target))
 		return target;
 
 	/*
-	 * If the prevous cpu is cache affine and idle, don't be stupid.
+	 * If the previous cpu is cache affine and idle, don't be stupid.
 	 */
 	if (i != target && cpus_share_cache(i, target) && idle_cpu(i))
 		return i;
 
-	/*
-	 * Otherwise, iterate the domains and find an eligible idle cpu.
-	 *
-	 * A completely idle sched group at higher domains is more
-	 * desirable than an idle group at a lower level, because lower
-	 * domains have smaller groups and usually share hardware
-	 * resources which causes tasks to contend on them, e.g. x86
-	 * hyperthread siblings in the lowest domain (SMT) can contend
-	 * on the shared cpu pipeline.
-	 *
-	 * However, while we prefer idle groups at higher domains
-	 * finding an idle cpu at the lowest domain is still better than
-	 * returning 'target', which we've already established, isn't
-	 * idle.
-	 */
 	sd = rcu_dereference(per_cpu(sd_llc, target));
-	for_each_lower_domain(sd) {
-		sg = sd->groups;
-		do {
-			if (!cpumask_intersects(sched_group_cpus(sg),
-						tsk_cpus_allowed(p)))
-				goto next;
-
-			/* Ensure the entire group is idle */
-			for_each_cpu(i, sched_group_cpus(sg)) {
-				if (i == target || !idle_cpu(i))
-					goto next;
-			}
+	if (!sd)
+		return target;
+
+	i = select_idle_core(p, sd, target);
+	if ((unsigned)i < nr_cpumask_bits)
+		return i;
+
+	i = select_idle_cpu(p, sd, target);
+	if ((unsigned)i < nr_cpumask_bits)
+		return i;
+
+	i = select_idle_smt(p, sd, target);
+	if ((unsigned)i < nr_cpumask_bits)
+		return i;
 
-			/*
-			 * It doesn't matter which cpu we pick, the
-			 * whole group is idle.
-			 */
-			target = cpumask_first_and(sched_group_cpus(sg),
-					tsk_cpus_allowed(p));
-			goto done;
-next:
-			sg = sg->next;
-		} while (sg != sd->groups);
-	}
-done:
 	return target;
 }
 
@@ -7232,9 +7413,6 @@ static struct rq *find_busiest_queue(str
  */
 #define MAX_PINNED_INTERVAL	512
 
-/* Working cpumask for load_balance and load_balance_newidle. */
-DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
-
 static int need_active_balance(struct lb_env *env)
 {
 	struct sched_domain *sd = env->sd;
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -23,11 +23,13 @@ static void check_preempt_curr_idle(stru
 	resched_curr(rq);
 }
 
+extern void update_idle_core(struct rq *rq);
+
 static struct task_struct *
 pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct pin_cookie cookie)
 {
 	put_prev_task(rq, prev);
-
+	update_idle_core(rq);
 	schedstat_inc(rq, sched_goidle);
 	return rq->idle;
 }

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC][PATCH 6/7] sched: Optimize SCHED_SMT
  2016-05-09 10:48 [RFC][PATCH 0/7] sched: select_idle_siblings rewrite Peter Zijlstra
                   ` (4 preceding siblings ...)
  2016-05-09 10:48 ` [RFC][PATCH 5/7] sched: Rewrite select_idle_siblings() Peter Zijlstra
@ 2016-05-09 10:48 ` Peter Zijlstra
  2016-05-09 10:48 ` [RFC][PATCH 7/7] sched: debug muck -- not for merging Peter Zijlstra
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-09 10:48 UTC (permalink / raw)
  To: mingo, linux-kernel; +Cc: clm, matt, mgalbraith, tglx, fweisbec, peterz

[-- Attachment #1: peterz-select_idle_siblings-rewrite-3.patch --]
[-- Type: text/plain, Size: 2797 bytes --]

Avoid pointless SCHED_SMT code when running on !SMT hardware.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c      |   19 +++++++++++++++++++
 kernel/sched/fair.c      |    8 +++++++-
 kernel/sched/idle_task.c |    2 --
 kernel/sched/sched.h     |   17 +++++++++++++++++
 4 files changed, 43 insertions(+), 3 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7263,6 +7263,22 @@ int sched_cpu_dying(unsigned int cpu)
 }
 #endif
 
+#ifdef CONFIG_SCHED_SMT
+DEFINE_STATIC_KEY_FALSE(sched_smt_present);
+
+static void sched_init_smt(void)
+{
+	/*
+	 * We've enumerated all CPUs and will assume that if any CPU
+	 * has SMT siblings, CPU0 will too.
+	 */
+	if (cpumask_weight(cpu_smt_mask(0)) > 1)
+		static_branch_enable(&sched_smt_present);
+}
+#else
+static inline void sched_init_smt(void) { }
+#endif
+
 void __init sched_init_smp(void)
 {
 	cpumask_var_t non_isolated_cpus;
@@ -7292,6 +7308,9 @@ void __init sched_init_smp(void)
 
 	init_sched_rt_class();
 	init_sched_dl_class();
+
+	sched_init_smt();
+
 	sched_smp_initialized = true;
 }
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5249,7 +5249,7 @@ static inline bool test_idle_cores(int c
  * Since SMT siblings share all cache levels, inspecting this limited remote
  * state should be fairly cheap.
  */
-void update_idle_core(struct rq *rq)
+void __update_idle_core(struct rq *rq)
 {
 	int core = cpu_of(rq);
 	int cpu;
@@ -5281,6 +5281,9 @@ static int select_idle_core(struct task_
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
 	int core, cpu, wrap;
 
+	if (!static_branch_likely(&sched_smt_present))
+		return -1;
+
 	if (!test_idle_cores(target, false))
 		return -1;
 
@@ -5314,6 +5317,9 @@ static int select_idle_smt(struct task_s
 {
 	int cpu;
 
+	if (!static_branch_likely(&sched_smt_present))
+		return -1;
+
 	for_each_cpu(cpu, cpu_smt_mask(target)) {
 		if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
 			continue;
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -23,8 +23,6 @@ static void check_preempt_curr_idle(stru
 	resched_curr(rq);
 }
 
-extern void update_idle_core(struct rq *rq);
-
 static struct task_struct *
 pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct pin_cookie cookie)
 {
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1811,3 +1811,20 @@ static inline void account_reset_rq(stru
 	rq->prev_steal_time_rq = 0;
 #endif
 }
+
+
+#ifdef CONFIG_SCHED_SMT
+
+extern struct static_key_false sched_smt_present;
+
+extern void __update_idle_core(struct rq *rq);
+
+static inline void update_idle_core(struct rq *rq)
+{
+	if (static_branch_unlikely(&sched_smt_present))
+		__update_idle_core(rq);
+}
+
+#else
+static inline void update_idle_core(struct rq *rq) { }
+#endif

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC][PATCH 7/7] sched: debug muck -- not for merging
  2016-05-09 10:48 [RFC][PATCH 0/7] sched: select_idle_siblings rewrite Peter Zijlstra
                   ` (5 preceding siblings ...)
  2016-05-09 10:48 ` [RFC][PATCH 6/7] sched: Optimize SCHED_SMT Peter Zijlstra
@ 2016-05-09 10:48 ` Peter Zijlstra
  2016-05-10  0:50 ` [RFC][PATCH 0/7] sched: select_idle_siblings rewrite Chris Mason
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-09 10:48 UTC (permalink / raw)
  To: mingo, linux-kernel; +Cc: clm, matt, mgalbraith, tglx, fweisbec

[-- Attachment #1: peterz-select_idle_siblings-rewrite-2.patch --]
[-- Type: text/plain, Size: 5110 bytes --]

Add a few knobs to poke while playing with the new code.

Not-Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched/sysctl.h |    1 
 kernel/sched/fair.c          |   86 ++++++++++++++++++++++++++++++++++---------
 kernel/sched/features.h      |   10 +++++
 kernel/sysctl.c              |    7 +++
 4 files changed, 86 insertions(+), 18 deletions(-)

--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -37,6 +37,7 @@ extern unsigned int sysctl_sched_migrati
 extern unsigned int sysctl_sched_nr_migrate;
 extern unsigned int sysctl_sched_time_avg;
 extern unsigned int sysctl_sched_shares_window;
+extern unsigned int sysctl_sched_shift;
 
 int sched_proc_update_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *length,
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -114,6 +114,8 @@ unsigned int __read_mostly sysctl_sched_
 unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
 #endif
 
+const_debug unsigned int sysctl_sched_shift = 9;
+
 static inline void update_load_add(struct load_weight *lw, unsigned long inc)
 {
 	lw->weight += inc;
@@ -5354,18 +5356,24 @@ static inline int select_idle_smt(struct
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
 {
 	struct sched_domain *this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
-	u64 avg_idle = this_rq()->avg_idle;
-	u64 avg_cost = this_sd->avg_scan_cost;
 	u64 time, cost;
 	s64 delta;
 	int cpu, wrap;
 
-	/*
-	 * Due to large variance we need a large fuzz factor; hackbench in
-	 * particularly is sensitive here.
-	 */
-	if ((avg_idle / 512) < avg_cost)
-		return -1;
+	if (sched_feat(AVG_CPU)) {
+		u64 avg_idle = this_rq()->avg_idle;
+		u64 avg_cost = this_sd->avg_scan_cost;
+
+		if (sched_feat(PRINT_AVG))
+			trace_printk("idle: %Ld cost: %Ld\n", avg_idle, avg_cost);
+
+		/*
+		 * Due to large variance we need a large fuzz factor; hackbench in
+		 * particularly is sensitive here.
+		 */
+		if ((avg_idle >> sysctl_sched_shift) < avg_cost)
+			return -1;
+	}
 
 	time = local_clock();
 
@@ -5379,6 +5387,7 @@ static int select_idle_cpu(struct task_s
 	time = local_clock() - time;
 	cost = this_sd->avg_scan_cost;
 	delta = (s64)(time - cost) / 8;
+	/* trace_printk("time: %Ld cost: %Ld delta: %Ld\n", time, cost, delta); */
 	this_sd->avg_scan_cost += delta;
 
 	return cpu;
@@ -5390,7 +5399,7 @@ static int select_idle_cpu(struct task_s
 static int select_idle_sibling(struct task_struct *p, int target)
 {
 	struct sched_domain *sd;
-	int i = task_cpu(p);
+	int start, i = task_cpu(p);
 
 	if (idle_cpu(target))
 		return target;
@@ -5401,21 +5410,62 @@ static int select_idle_sibling(struct ta
 	if (i != target && cpus_share_cache(i, target) && idle_cpu(i))
 		return i;
 
+	start = target;
+	if (sched_feat(ORDER_IDLE))
+		start = per_cpu(sd_llc_id, target); /* first cpu in llc domain */
+
 	sd = rcu_dereference(per_cpu(sd_llc, target));
 	if (!sd)
 		return target;
 
-	i = select_idle_core(p, sd, target);
-	if ((unsigned)i < nr_cpumask_bits)
-		return i;
+	if (sched_feat(OLD_IDLE)) {
+		struct sched_group *sg;
 
-	i = select_idle_cpu(p, sd, target);
-	if ((unsigned)i < nr_cpumask_bits)
-		return i;
+		for_each_lower_domain(sd) {
+			sg = sd->groups;
+			do {
+				if (!cpumask_intersects(sched_group_cpus(sg),
+							tsk_cpus_allowed(p)))
+					goto next;
+
+				/* Ensure the entire group is idle */
+				for_each_cpu(i, sched_group_cpus(sg)) {
+					if (i == target || !idle_cpu(i))
+						goto next;
+				}
 
-	i = select_idle_smt(p, sd, target);
-	if ((unsigned)i < nr_cpumask_bits)
-		return i;
+				/*
+				 * It doesn't matter which cpu we pick, the
+				 * whole group is idle.
+				 */
+				target = cpumask_first_and(sched_group_cpus(sg),
+						tsk_cpus_allowed(p));
+				goto done;
+next:
+				sg = sg->next;
+			} while (sg != sd->groups);
+		}
+done:
+		return target;
+	}
+
+	if (sched_feat(IDLE_CORE)) {
+		i = select_idle_core(p, sd, start);
+		if ((unsigned)i < nr_cpumask_bits)
+			return i;
+	}
+
+	if (sched_feat(IDLE_CPU)) {
+		i = select_idle_cpu(p, sd, start);
+		if ((unsigned)i < nr_cpumask_bits)
+			return i;
+	}
+
+	if (sched_feat(IDLE_SMT)) {
+		i = select_idle_smt(p, sd, start);
+		if ((unsigned)i < nr_cpumask_bits)
+			return i;
+	}
 
 	return target;
 }
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -69,3 +69,13 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
 SCHED_FEAT(ATTACH_AGE_LOAD, true)
 
+SCHED_FEAT(OLD_IDLE, false)
+SCHED_FEAT(ORDER_IDLE, false)
+
+SCHED_FEAT(IDLE_CORE, true)
+SCHED_FEAT(IDLE_CPU, true)
+SCHED_FEAT(AVG_CPU, true)
+SCHED_FEAT(PRINT_AVG, false)
+
+SCHED_FEAT(IDLE_SMT, true)
+
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -334,6 +334,13 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
+		.procname	= "sched_shift",
+		.data		= &sysctl_sched_shift,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
 		.procname	= "sched_nr_migrate",
 		.data		= &sysctl_sched_nr_migrate,
 		.maxlen		= sizeof(unsigned int),

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 2/7] sched: Restructure destroy_sched_domain()
  2016-05-09 10:48 ` [RFC][PATCH 2/7] sched: Restructure destroy_sched_domain() Peter Zijlstra
@ 2016-05-09 14:46   ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-09 14:46 UTC (permalink / raw)
  To: mingo, linux-kernel; +Cc: clm, matt, mgalbraith, tglx, fweisbec

On Mon, May 09, 2016 at 12:48:09PM +0200, Peter Zijlstra wrote:
> There is no point in doing a call_rcu() for each domain, only do a
> callback for the root sched domain and clean up the entire set in one
> go.
> 
> Also make the entire call chain be called destroy_sched_domain*() to
> remove confusion with the free_sched_domains() call, which does an
> entirely different thing.
> 
> Both cpu_attach_domain() callers of destroy_sched_domain() can live
> without the call_rcu() because at those points the sched_domain hasn't
> been published yet.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---

This one seems to work much better; so much for last minute 'cleanups'.

---
 kernel/sched/core.c |   18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5824,10 +5824,8 @@ static void free_sched_groups(struct sch
 	} while (sg != first);
 }
 
-static void free_sched_domain(struct rcu_head *rcu)
+static void destroy_sched_domain(struct sched_domain *sd)
 {
-	struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu);
-
 	/*
 	 * If its an overlapping domain it has private groups, iterate and
 	 * nuke them all.
@@ -5841,15 +5839,21 @@ static void free_sched_domain(struct rcu
 	kfree(sd);
 }
 
-static void destroy_sched_domain(struct sched_domain *sd)
+static void destroy_sched_domains_rcu(struct rcu_head *rcu)
 {
-	call_rcu(&sd->rcu, free_sched_domain);
+	struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu);
+
+	while (sd) {
+		struct sched_domain *parent = sd->parent;
+		destroy_sched_domain(sd);
+		sd = parent;
+	}
 }
 
 static void destroy_sched_domains(struct sched_domain *sd)
 {
-	for (; sd; sd = sd->parent)
-		destroy_sched_domain(sd);
+	if (sd)
+		call_rcu(&sd->rcu, destroy_sched_domains_rcu);
 }
 
 /*

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 0/7] sched: select_idle_siblings rewrite
  2016-05-09 10:48 [RFC][PATCH 0/7] sched: select_idle_siblings rewrite Peter Zijlstra
                   ` (6 preceding siblings ...)
  2016-05-09 10:48 ` [RFC][PATCH 7/7] sched: debug muck -- not for merging Peter Zijlstra
@ 2016-05-10  0:50 ` Chris Mason
  2016-05-11 14:19 ` Chris Mason
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Chris Mason @ 2016-05-10  0:50 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, matt, mgalbraith, tglx, fweisbec

On Mon, May 09, 2016 at 12:48:07PM +0200, Peter Zijlstra wrote:
> Hai,
> 
> here be a semi coherent patch series for the recent select_idle_siblings()
> tinkering. Happy benchmarking..

Thanks Peter,

I'll have some production numbers tomorrow, but based on schbench I'm
hoping it'll score better than my original.

You win on the pipe test (1.9MB/s vs 2.1MB/s) when a thread sibling is
pegged, and even when I double the think time on schbench, it holds up
well:

# Peter
# ./schbench -t 23 -m 2 -c 60000 -s 30000
Latency percentiles (usec)
	50.0000th: 50
	75.0000th: 60
	90.0000th: 69
	95.0000th: 73
	*99.0000th: 85
	99.5000th: 135
	99.9000th: 4012
	min=0, max=10873

# Mason
# ./schbench -t 23 -m 2 -c 60000 -s 30000
Latency percentiles (usec)
        50.0000th: 50
        75.0000th: 60
        90.0000th: 70
        95.0000th: 74
        *99.0000th: 83
        99.5000th: 88
        99.9000th: 118
        min=0, max=14770

# Mainline
# ./schbench -t 23 -m 2 -c 60000 -s 30000
Latency percentiles (usec)
        50.0000th: 47
        75.0000th: 60
        90.0000th: 70
        95.0000th: 79
        *99.0000th: 5400
        99.5000th: 10352
        99.9000th: 10992
        min=0, max=19642

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 5/7] sched: Rewrite select_idle_siblings()
  2016-05-09 10:48 ` [RFC][PATCH 5/7] sched: Rewrite select_idle_siblings() Peter Zijlstra
@ 2016-05-10 21:05   ` Yuyang Du
  2016-05-11  7:00     ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Yuyang Du @ 2016-05-10 21:05 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, clm, matt, mgalbraith, tglx, fweisbec

On Mon, May 09, 2016 at 12:48:12PM +0200, Peter Zijlstra wrote:
> +/*
> + * Scan the LLC domain for idle CPUs; this is dynamically regulated by
> + * comparing the average scan cost (tracked in sd->avg_scan_cost) against the

				       tracked in this_sd->avg_scan_cost

> + * average idle time for this rq (as found in rq->avg_idle).
> + */
> +static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
> +{
> +	struct sched_domain *this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> +	u64 avg_idle = this_rq()->avg_idle;
> +	u64 avg_cost = this_sd->avg_scan_cost;
> +	u64 time, cost;
> +	s64 delta;
> +	int cpu, wrap;
> +
> +	/*
> +	 * Due to large variance we need a large fuzz factor; hackbench in
> +	 * particularly is sensitive here.
> +	 */
> +	if ((avg_idle / 512) < avg_cost)
> +		return -1;
> +
> +	time = local_clock();
> +
> +	for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {
> +		if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
> +			continue;
> +		if (idle_cpu(cpu))
> +			break;
> +	}
> +
> +	time = local_clock() - time;
> +	cost = this_sd->avg_scan_cost;
> +	delta = (s64)(time - cost) / 8;
> +	this_sd->avg_scan_cost += delta;
> +
> +	return cpu;
> +}

[snip]

> +
> +	i = select_idle_core(p, sd, target);
> +	if ((unsigned)i < nr_cpumask_bits)
> +		return i;
> +
> +	i = select_idle_cpu(p, sd, target);
> +	if ((unsigned)i < nr_cpumask_bits)
> +		return i;
> +
> +	i = select_idle_smt(p, sd, target);
> +	if ((unsigned)i < nr_cpumask_bits)
> +		return i;

First, on smt, these three scans have a lot of overlap, but it also has an
effect of opportunistic-spinning if it was intended, which is good. Anyway,
maybe you should write something about it, and why only consider cost for cpu,
not core.

Then, I am still considering combining them a bit, like the following patch.
And if you want, more might be combined.

P.S., The idle_smt_cpu may not be the first idle, but one more i++ can make
it.

---

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25bd5b0..648a15c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5281,7 +5281,7 @@ unlock:
 static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
-	int core, cpu, wrap;
+	int core, cpu, wrap, i = 0, idle_smt_cpu = -1;
 
 	if (!static_branch_likely(&sched_smt_present))
 		return -1;
@@ -5298,8 +5298,14 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
 			cpumask_clear_cpu(cpu, cpus);
 			if (!idle_cpu(cpu))
 				idle = false;
+			/*
+			 * First smt must be target's smt, and any cpu here is allowed
+			 */
+			else if (i == 0)
+				idle_smt_cpu = cpu;
 		}
 
+		i++;
 		if (idle)
 			return core;
 	}

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 5/7] sched: Rewrite select_idle_siblings()
  2016-05-11  7:00     ` Peter Zijlstra
@ 2016-05-10 23:42       ` Yuyang Du
  2016-05-11  7:43         ` Mike Galbraith
  0 siblings, 1 reply; 39+ messages in thread
From: Yuyang Du @ 2016-05-10 23:42 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, clm, matt, mgalbraith, tglx, fweisbec

On Wed, May 11, 2016 at 09:00:29AM +0200, Peter Zijlstra wrote:
> On Wed, May 11, 2016 at 05:05:50AM +0800, Yuyang Du wrote:
> > On Mon, May 09, 2016 at 12:48:12PM +0200, Peter Zijlstra wrote:
> > > +	i = select_idle_core(p, sd, target);
> > > +	if ((unsigned)i < nr_cpumask_bits)
> > > +		return i;
> > > +
> > > +	i = select_idle_cpu(p, sd, target);
> > > +	if ((unsigned)i < nr_cpumask_bits)
> > > +		return i;
> > > +
> > > +	i = select_idle_smt(p, sd, target);
> > > +	if ((unsigned)i < nr_cpumask_bits)
> > > +		return i;
> > 
> > First, on smt, these three scans have a lot of overlap,
> 
> Limited, we stop doing the idle_core scan the moment we fail to find
> one. And on a busy system it is unlikely to come back.
> 
> Which leaves us with the straight idle_cpu scan. Sure, the one time we
> fail to find an idle core and fall through we'll scan double, but that
> should be rare.

Yes, I was sorta well educated about how important finding an idle is ("one
stack, game over"), which therefore is sure worth the effort.

Looks to me, it is just about how much opportunisic-spinning should be applied
here. Great.

Do you have any suggestion about doing other part of select_task_rq_fair?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 5/7] sched: Rewrite select_idle_siblings()
  2016-05-10 21:05   ` Yuyang Du
@ 2016-05-11  7:00     ` Peter Zijlstra
  2016-05-10 23:42       ` Yuyang Du
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-11  7:00 UTC (permalink / raw)
  To: Yuyang Du; +Cc: mingo, linux-kernel, clm, matt, mgalbraith, tglx, fweisbec

On Wed, May 11, 2016 at 05:05:50AM +0800, Yuyang Du wrote:
> On Mon, May 09, 2016 at 12:48:12PM +0200, Peter Zijlstra wrote:
> > +	i = select_idle_core(p, sd, target);
> > +	if ((unsigned)i < nr_cpumask_bits)
> > +		return i;
> > +
> > +	i = select_idle_cpu(p, sd, target);
> > +	if ((unsigned)i < nr_cpumask_bits)
> > +		return i;
> > +
> > +	i = select_idle_smt(p, sd, target);
> > +	if ((unsigned)i < nr_cpumask_bits)
> > +		return i;
> 
> First, on smt, these three scans have a lot of overlap,

Limited, we stop doing the idle_core scan the moment we fail to find
one. And on a busy system it is unlikely to come back.

Which leaves us with the straight idle_cpu scan. Sure, the one time we
fail to find an idle core and fall through we'll scan double, but that
should be rare.

> Then, I am still considering combining them a bit, like the following patch.
> And if you want, more might be combined.

I tried that (and you really want the last idle it finds, to minimize
the time between testing for idle and returning it), but it didn't
matter for anything I could find.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 5/7] sched: Rewrite select_idle_siblings()
  2016-05-10 23:42       ` Yuyang Du
@ 2016-05-11  7:43         ` Mike Galbraith
  0 siblings, 0 replies; 39+ messages in thread
From: Mike Galbraith @ 2016-05-11  7:43 UTC (permalink / raw)
  To: Yuyang Du, Peter Zijlstra; +Cc: mingo, linux-kernel, clm, matt, tglx, fweisbec

On Wed, 2016-05-11 at 07:42 +0800, Yuyang Du wrote:

> Do you have any suggestion about doing other part of
> select_task_rq_fair?

Conjure up a decent metric for placing tasks in tasty hot L2?  That
would be really lovely to have (/me pondering page faults..).

We used to have an avg_overlap heuristic, which would let us stack high
speed ping-pong like things such that they could really smoke.  It was
too fragile though, so had to go away.

	-Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  2016-05-09 10:48 ` [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared Peter Zijlstra
@ 2016-05-11 11:55   ` Matt Fleming
  2016-05-11 12:33     ` Peter Zijlstra
  2016-05-11 17:37     ` Peter Zijlstra
  2016-05-16 15:31   ` Dietmar Eggemann
  1 sibling, 2 replies; 39+ messages in thread
From: Matt Fleming @ 2016-05-11 11:55 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, clm, mgalbraith, tglx, fweisbec

On Mon, 09 May, at 12:48:11PM, Peter Zijlstra wrote:
> Move the nr_busy_cpus thing from its hacky sd->parent->groups->sgc
> location into the much more natural sched_domain_shared location.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/sched.h    |    1 +
>  kernel/sched/core.c      |   10 +++++-----
>  kernel/sched/fair.c      |   22 ++++++++++++----------
>  kernel/sched/sched.h     |    6 +-----
>  kernel/time/tick-sched.c |   10 +++++-----
>  5 files changed, 24 insertions(+), 25 deletions(-)
> 
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1059,6 +1059,7 @@ struct sched_group;
>  
>  struct sched_domain_shared {
>  	atomic_t	ref;
> +	atomic_t	nr_busy_cpus;
>  };
>  
>  struct sched_domain {

[...]

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7842,13 +7842,13 @@ static inline void set_cpu_sd_state_busy
>  	int cpu = smp_processor_id();
>  
>  	rcu_read_lock();
> -	sd = rcu_dereference(per_cpu(sd_busy, cpu));
> +	sd = rcu_dereference(per_cpu(sd_llc, cpu));
>  
>  	if (!sd || !sd->nohz_idle)
>  		goto unlock;
>  	sd->nohz_idle = 0;
>  
> -	atomic_inc(&sd->groups->sgc->nr_busy_cpus);
> +	atomic_inc(&sd->shared->nr_busy_cpus);
>  unlock:
>  	rcu_read_unlock();
>  }

This breaks my POWER7 box which presumably doesn't have SD_SHARE_PKG_RESOURCES,

  NIP [c00000000012de68] .set_cpu_sd_state_idle+0x58/0x80
  LR [c00000000017ded4] .tick_nohz_idle_enter+0x24/0x90
  Call Trace:
  [c0000007774b7cf0] [c0000007774b4080] 0xc0000007774b4080 (unreliable)
  [c0000007774b7d60] [c00000000017ded4] .tick_nohz_idle_enter+0x24/0x90
  [c0000007774b7dd0] [c000000000137200] .cpu_startup_entry+0xe0/0x440
  [c0000007774b7ee0] [c00000000004739c] .start_secondary+0x35c/0x3a0
  [c0000007774b7f90] [c000000000008bfc] start_secondary_prolog+0x10/0x14

The following fixes it,

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 978b3ef2d87e..d27153adee4d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7920,7 +7920,8 @@ static inline void set_cpu_sd_state_busy(void)
 		goto unlock;
 	sd->nohz_idle = 0;
 
-	atomic_inc(&sd->shared->nr_busy_cpus);
+	if (sd->shared)
+		atomic_inc(&sd->shared->nr_busy_cpus);
 unlock:
 	rcu_read_unlock();
 }
@@ -7937,7 +7938,8 @@ void set_cpu_sd_state_idle(void)
 		goto unlock;
 	sd->nohz_idle = 1;
 
-	atomic_dec(&sd->shared->nr_busy_cpus);
+	if (sd->shared)
+		atomic_dec(&sd->shared->nr_busy_cpus);
 unlock:
 	rcu_read_unlock();
 }

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  2016-05-11 11:55   ` Matt Fleming
@ 2016-05-11 12:33     ` Peter Zijlstra
  2016-05-11 18:11       ` Peter Zijlstra
  2016-05-11 18:24       ` Peter Zijlstra
  2016-05-11 17:37     ` Peter Zijlstra
  1 sibling, 2 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-11 12:33 UTC (permalink / raw)
  To: Matt Fleming
  Cc: mingo, linux-kernel, clm, mgalbraith, tglx, fweisbec, srikar,
	mikey, anton

On Wed, May 11, 2016 at 12:55:56PM +0100, Matt Fleming wrote:
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7842,13 +7842,13 @@ static inline void set_cpu_sd_state_busy
> >  	int cpu = smp_processor_id();
> >  
> >  	rcu_read_lock();
> > -	sd = rcu_dereference(per_cpu(sd_busy, cpu));
> > +	sd = rcu_dereference(per_cpu(sd_llc, cpu));
> >  
> >  	if (!sd || !sd->nohz_idle)
> >  		goto unlock;
> >  	sd->nohz_idle = 0;
> >  
> > -	atomic_inc(&sd->groups->sgc->nr_busy_cpus);
> > +	atomic_inc(&sd->shared->nr_busy_cpus);
> >  unlock:
> >  	rcu_read_unlock();
> >  }
> 
> This breaks my POWER7 box which presumably doesn't have SD_SHARE_PKG_RESOURCES,
> 

Hmm, PPC folks; what does your topology look like?

Currently your sched_domain_topology, as per arch/powerpc/kernel/smp.c
seems to suggest your cores do not share cache at all.

https://en.wikipedia.org/wiki/POWER7 seems to agree and states

  "4 MB L3 cache per C1 core"

And http://www-03.ibm.com/systems/resources/systems_power_software_i_perfmgmt_underthehood.pdf
also explicitly draws pictures with the L3 per core.

_however_, that same document describes L3 inter-core fill and lateral
cast-out, which sounds like the L3s work together to form a node wide
caching system.

Do we want to model this co-operative L3 slices thing as a sort of
node-wide LLC for the purpose of the scheduler ?

While we should definitely fix the assumption that an LLC exists (and I
need to look at why it isn't set to the core domain instead as well),
the scheduler does try and scale things by 'assuming' LLC := node.

It does this for NOHZ, and these here patches under discussion would be
doing the same for idle-core state.

Would this make sense for power, or should we somehow think of something
else?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 0/7] sched: select_idle_siblings rewrite
  2016-05-09 10:48 [RFC][PATCH 0/7] sched: select_idle_siblings rewrite Peter Zijlstra
                   ` (7 preceding siblings ...)
  2016-05-10  0:50 ` [RFC][PATCH 0/7] sched: select_idle_siblings rewrite Chris Mason
@ 2016-05-11 14:19 ` Chris Mason
  2016-05-18  5:51 ` [RFC][PATCH 8/7] sched/fair: Use utilization distance to filter affine sync wakeups Mike Galbraith
  2016-05-25 14:51 ` [RFC][PATCH 0/7] sched: select_idle_siblings rewrite Chris Mason
  10 siblings, 0 replies; 39+ messages in thread
From: Chris Mason @ 2016-05-11 14:19 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, matt, mgalbraith, tglx, fweisbec

On Mon, May 09, 2016 at 12:48:07PM +0200, Peter Zijlstra wrote:
> Hai,
> 
> here be a semi coherent patch series for the recent select_idle_siblings()
> tinkering. Happy benchmarking..

I ran a few more rounds of the production benchmarks, and NO_AVG_CPU is
consistently faster by about 5% than AVG_CPU.  I think what's happening
here is the production runs are ramping up load in a finer grained setup
than schbench, and production is able to see the AVG_CPU calculations
back off the scan too soon (for us anyway).

I'm going to play around with schbench to try and model this better, but
so far this is a clear win over unpatched v4.6.

-chris

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  2016-05-11 11:55   ` Matt Fleming
  2016-05-11 12:33     ` Peter Zijlstra
@ 2016-05-11 17:37     ` Peter Zijlstra
  2016-05-11 18:04       ` Matt Fleming
  1 sibling, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-11 17:37 UTC (permalink / raw)
  To: Matt Fleming; +Cc: mingo, linux-kernel, clm, mgalbraith, tglx, fweisbec

On Wed, May 11, 2016 at 12:55:56PM +0100, Matt Fleming wrote:

> This breaks my POWER7 box which presumably doesn't have SD_SHARE_PKG_RESOURCES,

> index 978b3ef2d87e..d27153adee4d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7920,7 +7920,8 @@ static inline void set_cpu_sd_state_busy(void)
>  		goto unlock;
>  	sd->nohz_idle = 0;
>  
> -	atomic_inc(&sd->shared->nr_busy_cpus);
> +	if (sd->shared)
> +		atomic_inc(&sd->shared->nr_busy_cpus);
>  unlock:
>  	rcu_read_unlock();
>  }


Ah, no, the problem is that while it does have SHARE_PKG_RESOURCES (in
its SMT domain -- SMT threads share all cache after all), I failed to
connect the sched_domain_shared structure for it.

Does something like this also work?

---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6389,10 +6389,6 @@ sd_init(struct sched_domain_topology_lev
 		sd->cache_nice_tries = 1;
 		sd->busy_idx = 2;
 
-		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
-		atomic_inc(&sd->shared->ref);
-		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
-
 #ifdef CONFIG_NUMA
 	} else if (sd->flags & SD_NUMA) {
 		sd->cache_nice_tries = 2;
@@ -6414,6 +6410,16 @@ sd_init(struct sched_domain_topology_lev
 		sd->idle_idx = 1;
 	}
 
+	/*
+	 * For all levels sharing cache; connect a sched_domain_shared
+	 * instance.
+	 */
+	if (sd->flags & SH_SHARED_PKG_RESOURCES) {
+		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
+		atomic_inc(&sd->shared->ref);
+		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+	}
+
 	sd->private = sdd;
 
 	return sd;

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  2016-05-11 17:37     ` Peter Zijlstra
@ 2016-05-11 18:04       ` Matt Fleming
  0 siblings, 0 replies; 39+ messages in thread
From: Matt Fleming @ 2016-05-11 18:04 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, clm, mgalbraith, tglx, fweisbec

On Wed, 11 May, at 07:37:52PM, Peter Zijlstra wrote:
> On Wed, May 11, 2016 at 12:55:56PM +0100, Matt Fleming wrote:
> 
> > This breaks my POWER7 box which presumably doesn't have SD_SHARE_PKG_RESOURCES,
> 
> > index 978b3ef2d87e..d27153adee4d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7920,7 +7920,8 @@ static inline void set_cpu_sd_state_busy(void)
> >  		goto unlock;
> >  	sd->nohz_idle = 0;
> >  
> > -	atomic_inc(&sd->shared->nr_busy_cpus);
> > +	if (sd->shared)
> > +		atomic_inc(&sd->shared->nr_busy_cpus);
> >  unlock:
> >  	rcu_read_unlock();
> >  }
> 
> 
> Ah, no, the problem is that while it does have SHARE_PKG_RESOURCES (in
> its SMT domain -- SMT threads share all cache after all), I failed to
> connect the sched_domain_shared structure for it.
> 
> Does something like this also work?

Yep, it does.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  2016-05-11 12:33     ` Peter Zijlstra
@ 2016-05-11 18:11       ` Peter Zijlstra
  2016-05-11 18:24       ` Peter Zijlstra
  1 sibling, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-11 18:11 UTC (permalink / raw)
  To: Matt Fleming
  Cc: mingo, linux-kernel, clm, mgalbraith, tglx, fweisbec, srikar,
	mikey, anton

On Wed, May 11, 2016 at 02:33:45PM +0200, Peter Zijlstra wrote:
> Do we want to model this co-operative L3 slices thing as a sort of
> node-wide LLC for the purpose of the scheduler ?
> 
> the scheduler does try and scale things by 'assuming' LLC := node.

So this whole series is about selecting idle CPUs to run stuff on. With
the current PPC setup we would never consider placing a task outside of
its core.

If we were to add a node wide cache domain, the scheduler would look to
place a task across cores if there is an idle core to be had.

Would something like that be beneficial?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  2016-05-11 12:33     ` Peter Zijlstra
  2016-05-11 18:11       ` Peter Zijlstra
@ 2016-05-11 18:24       ` Peter Zijlstra
  2016-05-12  2:05         ` Michael Neuling
  1 sibling, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-11 18:24 UTC (permalink / raw)
  To: Matt Fleming
  Cc: mingo, linux-kernel, clm, mgalbraith, tglx, fweisbec, srikar,
	mikey, anton

On Wed, May 11, 2016 at 02:33:45PM +0200, Peter Zijlstra wrote:
> Hmm, PPC folks; what does your topology look like?
> 
> Currently your sched_domain_topology, as per arch/powerpc/kernel/smp.c
> seems to suggest your cores do not share cache at all.
> 
> https://en.wikipedia.org/wiki/POWER7 seems to agree and states
> 
>   "4 MB L3 cache per C1 core"
> 
> And http://www-03.ibm.com/systems/resources/systems_power_software_i_perfmgmt_underthehood.pdf
> also explicitly draws pictures with the L3 per core.
> 
> _however_, that same document describes L3 inter-core fill and lateral
> cast-out, which sounds like the L3s work together to form a node wide
> caching system.
> 
> Do we want to model this co-operative L3 slices thing as a sort of
> node-wide LLC for the purpose of the scheduler ?

Going back a generation; Power6 seems to have a shared L3 (off package)
between the two cores on the package. The current topology does not
reflect that at all.

And going forward a generation; Power8 seems to share the per-core
(chiplet) L3 amonst all cores (chiplets) + is has the centaur (memory
controller) 16M L4.

So it seems the current topology setup is not describing these chips
very well. Also note that the arch topology code can runtime select a
topology, so you could make that topo setup micro-arch specific.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  2016-05-11 18:24       ` Peter Zijlstra
@ 2016-05-12  2:05         ` Michael Neuling
  2016-05-12  5:07           ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Michael Neuling @ 2016-05-12  2:05 UTC (permalink / raw)
  To: Peter Zijlstra, Matt Fleming
  Cc: mingo, linux-kernel, clm, mgalbraith, tglx, fweisbec, srikar,
	anton, oliver, Shreyas B. Prabhu

On Wed, 2016-05-11 at 20:24 +0200, Peter Zijlstra wrote:
> On Wed, May 11, 2016 at 02:33:45PM +0200, Peter Zijlstra wrote:
> > 
> > Hmm, PPC folks; what does your topology look like?
> > 
> > Currently your sched_domain_topology, as per arch/powerpc/kernel/smp.c
> > seems to suggest your cores do not share cache at all.
> > 
> > https://en.wikipedia.org/wiki/POWER7 seems to agree and states
> > 
> >   "4 MB L3 cache per C1 core"
> > 
> > And http://www-03.ibm.com/systems/resources/systems_power_software_i_pe
> > rfmgmt_underthehood.pdf
> > also explicitly draws pictures with the L3 per core.
> > 
> > _however_, that same document describes L3 inter-core fill and lateral
> > cast-out, which sounds like the L3s work together to form a node wide
> > caching system.
> > 
> > Do we want to model this co-operative L3 slices thing as a sort of
> > node-wide LLC for the purpose of the scheduler ?
> Going back a generation; Power6 seems to have a shared L3 (off package)
> between the two cores on the package. The current topology does not
> reflect that at all.
> 
> And going forward a generation; Power8 seems to share the per-core
> (chiplet) L3 amonst all cores (chiplets) + is has the centaur (memory
> controller) 16M L4.

Yep, L1/L2/L3 is per core on POWER8 and POWER7.  POWER6 and POWER5 (both
dual core chips) had a shared off chip cache

The POWER8 L4 is really a bit different as it's out in the memory
controller.  It's more of a memory DIMM buffer as it can only cache data
associated with the physical addresses on those DIMMS.

> So it seems the current topology setup is not describing these chips
> very well. Also note that the arch topology code can runtime select a
> topology, so you could make that topo setup micro-arch specific.

We are planning on making some topology changes for the upcoming P9 which
will share L2/L3 amongst pairs of cores (24 cores per chip).

FWIW our P9 upstreaming is still in it's infancy since P9 is not released
yet.

Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  2016-05-12  2:05         ` Michael Neuling
@ 2016-05-12  5:07           ` Peter Zijlstra
  2016-05-12 11:07             ` Michael Neuling
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-12  5:07 UTC (permalink / raw)
  To: Michael Neuling
  Cc: Matt Fleming, mingo, linux-kernel, clm, mgalbraith, tglx,
	fweisbec, srikar, anton, oliver, Shreyas B. Prabhu

On Thu, May 12, 2016 at 12:05:37PM +1000, Michael Neuling wrote:
> On Wed, 2016-05-11 at 20:24 +0200, Peter Zijlstra wrote:
> > On Wed, May 11, 2016 at 02:33:45PM +0200, Peter Zijlstra wrote:
> > > 
> > > Hmm, PPC folks; what does your topology look like?
> > > 
> > > Currently your sched_domain_topology, as per arch/powerpc/kernel/smp.c
> > > seems to suggest your cores do not share cache at all.
> > > 
> > > https://en.wikipedia.org/wiki/POWER7 seems to agree and states
> > > 
> > >   "4 MB L3 cache per C1 core"
> > > 
> > > And http://www-03.ibm.com/systems/resources/systems_power_software_i_pe
> > > rfmgmt_underthehood.pdf
> > > also explicitly draws pictures with the L3 per core.
> > > 
> > > _however_, that same document describes L3 inter-core fill and lateral
> > > cast-out, which sounds like the L3s work together to form a node wide
> > > caching system.
> > > 
> > > Do we want to model this co-operative L3 slices thing as a sort of
> > > node-wide LLC for the purpose of the scheduler ?
> > Going back a generation; Power6 seems to have a shared L3 (off package)
> > between the two cores on the package. The current topology does not
> > reflect that at all.
> > 
> > And going forward a generation; Power8 seems to share the per-core
> > (chiplet) L3 amonst all cores (chiplets) + is has the centaur (memory
> > controller) 16M L4.
> 
> Yep, L1/L2/L3 is per core on POWER8 and POWER7.  POWER6 and POWER5 (both
> dual core chips) had a shared off chip cache

But as per the above, Power7 and Power8 have explicit logic to share the
per-core L3 with the other cores.

How effective is that? From some of the slides/documents i've looked at
the L3s are connected with a high-speed fabric. Suggesting that the
cross-core sharing should be fairly efficient.

In which case it would make sense to treat/model the combined L3 as a
single large LLC covering all cores.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  2016-05-12  5:07           ` Peter Zijlstra
@ 2016-05-12 11:07             ` Michael Neuling
  2016-05-12 11:33               ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Michael Neuling @ 2016-05-12 11:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matt Fleming, mingo, linux-kernel, clm, mgalbraith, tglx,
	fweisbec, srikar, anton, oliver, Shreyas B. Prabhu

On Thu, 2016-05-12 at 07:07 +0200, Peter Zijlstra wrote:
> On Thu, May 12, 2016 at 12:05:37PM +1000, Michael Neuling wrote:
> > 
> > On Wed, 2016-05-11 at 20:24 +0200, Peter Zijlstra wrote:
> > > 
> > > On Wed, May 11, 2016 at 02:33:45PM +0200, Peter Zijlstra wrote:
> > > > 
> > > > 
> > > > Hmm, PPC folks; what does your topology look like?
> > > > 
> > > > Currently your sched_domain_topology, as per
> > > > arch/powerpc/kernel/smp.c
> > > > seems to suggest your cores do not share cache at all.
> > > > 
> > > > https://en.wikipedia.org/wiki/POWER7 seems to agree and states
> > > > 
> > > >   "4 MB L3 cache per C1 core"
> > > > 
> > > > And http://www-03.ibm.com/systems/resources/systems_power_software_
> > > > i_pe
> > > > rfmgmt_underthehood.pdf
> > > > also explicitly draws pictures with the L3 per core.
> > > > 
> > > > _however_, that same document describes L3 inter-core fill and
> > > > lateral
> > > > cast-out, which sounds like the L3s work together to form a node
> > > > wide
> > > > caching system.
> > > > 
> > > > Do we want to model this co-operative L3 slices thing as a sort of
> > > > node-wide LLC for the purpose of the scheduler ?
> > > Going back a generation; Power6 seems to have a shared L3 (off
> > > package)
> > > between the two cores on the package. The current topology does not
> > > reflect that at all.
> > > 
> > > And going forward a generation; Power8 seems to share the per-core
> > > (chiplet) L3 amonst all cores (chiplets) + is has the centaur (memory
> > > controller) 16M L4.
> > Yep, L1/L2/L3 is per core on POWER8 and POWER7.  POWER6 and POWER5
> > (both
> > dual core chips) had a shared off chip cache
> But as per the above, Power7 and Power8 have explicit logic to share the
> per-core L3 with the other cores.
> 
> How effective is that? From some of the slides/documents i've looked at
> the L3s are connected with a high-speed fabric. Suggesting that the
> cross-core sharing should be fairly efficient.

I'm not sure.  I thought it was mostly private but if another core was
sleeping or not experiencing much cache pressure, another core could use it
for some things. But I'm fuzzy on the the exact properties, sorry.

> In which case it would make sense to treat/model the combined L3 as a
> single large LLC covering all cores.

Are you thinking it would be much cheaper to migrate a task to another core
inside this chip, than to off chip?

Mikey

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  2016-05-12 11:07             ` Michael Neuling
@ 2016-05-12 11:33               ` Peter Zijlstra
  2016-05-13  0:12                 ` Michael Neuling
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-12 11:33 UTC (permalink / raw)
  To: Michael Neuling
  Cc: Matt Fleming, mingo, linux-kernel, clm, mgalbraith, tglx,
	fweisbec, srikar, anton, oliver, Shreyas B. Prabhu

On Thu, May 12, 2016 at 09:07:52PM +1000, Michael Neuling wrote:
> On Thu, 2016-05-12 at 07:07 +0200, Peter Zijlstra wrote:

> > But as per the above, Power7 and Power8 have explicit logic to share the
> > per-core L3 with the other cores.
> > 
> > How effective is that? From some of the slides/documents i've looked at
> > the L3s are connected with a high-speed fabric. Suggesting that the
> > cross-core sharing should be fairly efficient.
> 
> I'm not sure.  I thought it was mostly private but if another core was
> sleeping or not experiencing much cache pressure, another core could use it
> for some things. But I'm fuzzy on the the exact properties, sorry.

Right; I'm going by bits and pieces found on the tubes, so I'm just
guessing ;-)

But it sounds like these L3s are nowhere close to what Intel does with
their L3, where each core has an L3 slice, and slices are connected on a
ring to form a unified/shared cache across all cores.

http://www.realworldtech.com/sandy-bridge/8/

> > In which case it would make sense to treat/model the combined L3 as a
> > single large LLC covering all cores.
> 
> Are you thinking it would be much cheaper to migrate a task to another core
> inside this chip, than to off chip?

Basically; and if so, if its cheap enough to shoot a task to an idle
core to avoid queueing. Assuming there still is some cache residency on
the old core, the inter-core fill should be much cheaper than fetching
it off package (either remote cache or dram).

Or at least; so goes my reasoning based on my google results.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  2016-05-12 11:33               ` Peter Zijlstra
@ 2016-05-13  0:12                 ` Michael Neuling
  2016-05-16 14:00                   ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Michael Neuling @ 2016-05-13  0:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matt Fleming, mingo, linux-kernel, clm, mgalbraith, tglx,
	fweisbec, srikar, anton, oliver, Shreyas B. Prabhu

On Thu, 2016-05-12 at 13:33 +0200, Peter Zijlstra wrote:
> On Thu, May 12, 2016 at 09:07:52PM +1000, Michael Neuling wrote:
> > 
> > On Thu, 2016-05-12 at 07:07 +0200, Peter Zijlstra wrote:
> > 
> > > 
> > > But as per the above, Power7 and Power8 have explicit logic to share
> > > the
> > > per-core L3 with the other cores.
> > > 
> > > How effective is that? From some of the slides/documents i've looked
> > > at
> > > the L3s are connected with a high-speed fabric. Suggesting that the
> > > cross-core sharing should be fairly efficient.
> > I'm not sure.  I thought it was mostly private but if another core was
> > sleeping or not experiencing much cache pressure, another core could
> > use it
> > for some things. But I'm fuzzy on the the exact properties, sorry.
> Right; I'm going by bits and pieces found on the tubes, so I'm just
> guessing ;-)
> 
> But it sounds like these L3s are nowhere close to what Intel does with
> their L3, where each core has an L3 slice, and slices are connected on a
> ring to form a unified/shared cache across all cores.
> 
> http://www.realworldtech.com/sandy-bridge/8/

The POWER8 user manual is what you want to look at:

https://www.setphaserstostun.org/power8/POWER8_UM_v1.3_16MAR2016_pub.pdf

There is a section 10. "L3 Cache Overview" starting on page 128.  In there
it talks about L3.0 which is using the local cores L3.  L3.1 which is using
some other cores L3.

Once the L3.0 is full, we can cast out to an L3.1 (ie. the cache on another
core).  L3.1 can also provide data for reads.

ECO mode (section 10.4) is what I was talking about for sleeping/unused
cores.  That's more of a boot time (firmware option) than something we can
dynamically play with at runtime (I believe), so it's not something I think
is relevant here.

> > 
> > > 
> > > In which case it would make sense to treat/model the combined L3 as a
> > > single large LLC covering all cores.
> > Are you thinking it would be much cheaper to migrate a task to another
> > core
> > inside this chip, than to off chip?
> Basically; and if so, if its cheap enough to shoot a task to an idle
> core to avoid queueing. Assuming there still is some cache residency on
> the old core, the inter-core fill should be much cheaper than fetching
> it off package (either remote cache or dram).

So I think that will apply on POWER8.

In 10.4.2 it says "The L3.1 ECO Caches will be snooped and provide
intervention data similar to the L2 and L3.0 caches on the
chip"  That should be much faster than going to another chip or DIMM.

So migrating to another core on the same chip should be faster than off
chip.

Mikey




> Or at least; so goes my reasoning based on my google results.
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  2016-05-13  0:12                 ` Michael Neuling
@ 2016-05-16 14:00                   ` Peter Zijlstra
  2016-05-17 10:20                     ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-16 14:00 UTC (permalink / raw)
  To: Michael Neuling
  Cc: Matt Fleming, mingo, linux-kernel, clm, mgalbraith, tglx,
	fweisbec, srikar, anton, oliver, Shreyas B. Prabhu

On Fri, May 13, 2016 at 10:12:26AM +1000, Michael Neuling wrote:

> > Basically; and if so, if its cheap enough to shoot a task to an idle
> > core to avoid queueing. Assuming there still is some cache residency on
> > the old core, the inter-core fill should be much cheaper than fetching
> > it off package (either remote cache or dram).
> 
> So I think that will apply on POWER8.
> 
> In 10.4.2 it says "The L3.1 ECO Caches will be snooped and provide
> intervention data similar to the L2 and L3.0 caches on the
> chip"  That should be much faster than going to another chip or DIMM.
> 
> So migrating to another core on the same chip should be faster than off
> chip.

OK; so something like the below might be what you want to play with.

---
 arch/powerpc/kernel/smp.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 55c924b65f71..1a54fa8a3323 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -782,6 +782,23 @@ static struct sched_domain_topology_level powerpc_topology[] = {
 	{ NULL, },
 };
 
+static struct sched_domain_topology_level powerpc8_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+#ifdef CONFIG_SCHED_MC
+	/*
+	 * Model the L3.1 cache and sets the LLC as the whole package.
+	 *
+	 * This also ensures we try and move woken tasks to idle cores inside
+	 * the package to avoid queueing.
+	 */
+	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+#endif
+	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ NULL, },
+};
+
 void __init smp_cpus_done(unsigned int max_cpus)
 {
 	cpumask_var_t old_mask;
@@ -806,7 +823,10 @@ void __init smp_cpus_done(unsigned int max_cpus)
 
 	dump_numa_cpu_topology();
 
-	set_sched_topology(powerpc_topology);
+	if (cpu_has_feature(CPU_FTRS_POWER8))
+		set_sched_topology(powerpc8_topology);
+	else
+		set_sched_topology(powerpc_topology);
 
 }
 

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  2016-05-09 10:48 ` [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared Peter Zijlstra
  2016-05-11 11:55   ` Matt Fleming
@ 2016-05-16 15:31   ` Dietmar Eggemann
  2016-05-16 17:02     ` Peter Zijlstra
  1 sibling, 1 reply; 39+ messages in thread
From: Dietmar Eggemann @ 2016-05-16 15:31 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, linux-kernel; +Cc: clm, matt, mgalbraith, tglx, fweisbec

On 09/05/16 11:48, Peter Zijlstra wrote:

Couldn't you just always access sd->shared via
sd = rcu_dereference(per_cpu(sd_llc, cpu)) for
updating nr_busy_cpus?

The call_rcu() thing is on the sd any way.

@@ -5879,7 +5879,6 @@ static void destroy_sched_domains(struct sched_domain *sd)
 DEFINE_PER_CPU(struct sched_domain *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
-DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain *, sd_asym);
 
@@ -5900,7 +5899,6 @@ static void update_top_cache_domain(int cpu)
        rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
        per_cpu(sd_llc_size, cpu) = size;
        per_cpu(sd_llc_id, cpu) = id;
-       rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
 
        sd = lowest_flag_domain(cpu, SD_NUMA);
        rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0d8dad2972b6..5aed6089dae8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8136,7 +8136,6 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 static inline bool nohz_kick_needed(struct rq *rq)
 {
        unsigned long now = jiffies;
-       struct sched_domain_shared *sds;
        struct sched_domain *sd;
        int nr_busy, cpu = rq->cpu;
        bool kick = false;
@@ -8165,13 +8164,13 @@ static inline bool nohz_kick_needed(struct rq *rq)
                return true;
 
        rcu_read_lock();
-       sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
-       if (sds) {
+       sd = rcu_dereference(per_cpu(sd_llc, cpu));
+       if (sd) {
                /*
                 * XXX: write a coherent comment on why we do this.
                 * See also: http:lkml.kernel.org/r/20111202010832.602203411@sbsiddha-desk.sc.intel.com
                 */
-               nr_busy = atomic_read(&sds->nr_busy_cpus);
+               nr_busy = atomic_read(&sd->shared->nr_busy_cpus);
                if (nr_busy > 1) {
                        kick = true;
                        goto unlock;

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  2016-05-16 15:31   ` Dietmar Eggemann
@ 2016-05-16 17:02     ` Peter Zijlstra
  2016-05-16 17:26       ` Dietmar Eggemann
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-16 17:02 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: mingo, linux-kernel, clm, matt, mgalbraith, tglx, fweisbec

On Mon, May 16, 2016 at 04:31:08PM +0100, Dietmar Eggemann wrote:
> On 09/05/16 11:48, Peter Zijlstra wrote:
> 
> Couldn't you just always access sd->shared via
> sd = rcu_dereference(per_cpu(sd_llc, cpu)) for
> updating nr_busy_cpus?

Sure; but why would I want to add that extra dereference? Note that in
the next patch I add more users of sd_llc_shared.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  2016-05-16 17:02     ` Peter Zijlstra
@ 2016-05-16 17:26       ` Dietmar Eggemann
  0 siblings, 0 replies; 39+ messages in thread
From: Dietmar Eggemann @ 2016-05-16 17:26 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, clm, matt, mgalbraith, tglx, fweisbec

On 16/05/16 18:02, Peter Zijlstra wrote:
> On Mon, May 16, 2016 at 04:31:08PM +0100, Dietmar Eggemann wrote:
>> On 09/05/16 11:48, Peter Zijlstra wrote:
>>
>> Couldn't you just always access sd->shared via
>> sd = rcu_dereference(per_cpu(sd_llc, cpu)) for
>> updating nr_busy_cpus?
> 
> Sure; but why would I want to add that extra dereference? Note that in
> the next patch I add more users of sd_llc_shared.
> 

I see ... I thought because you do this already in
set_cpu_sd_state_[busy|idle]. But there you need the sd reference to set
sd->nohz_idle already.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  2016-05-16 14:00                   ` Peter Zijlstra
@ 2016-05-17 10:20                     ` Peter Zijlstra
  2016-05-17 10:52                       ` Srikar Dronamraju
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-17 10:20 UTC (permalink / raw)
  To: Michael Neuling
  Cc: Matt Fleming, mingo, linux-kernel, clm, mgalbraith, tglx,
	fweisbec, srikar, anton, oliver, Shreyas B. Prabhu

On Mon, May 16, 2016 at 04:00:32PM +0200, Peter Zijlstra wrote:
> On Fri, May 13, 2016 at 10:12:26AM +1000, Michael Neuling wrote:
> 
> > > Basically; and if so, if its cheap enough to shoot a task to an idle
> > > core to avoid queueing. Assuming there still is some cache residency on
> > > the old core, the inter-core fill should be much cheaper than fetching
> > > it off package (either remote cache or dram).
> > 
> > So I think that will apply on POWER8.
> > 
> > In 10.4.2 it says "The L3.1 ECO Caches will be snooped and provide
> > intervention data similar to the L2 and L3.0 caches on the
> > chip"  That should be much faster than going to another chip or DIMM.
> > 
> > So migrating to another core on the same chip should be faster than off
> > chip.
> 
> OK; so something like the below might be what you want to play with.
> 
> ---

Maybe even something like so; which would make Power <= 6 use the
default topology and result in a shared LLC between the on package cores
etc..

Power7 is then special for not having a shared L3 but having the
asymmetric SMT thing and Power8 again gains the shared L3 while
retaining the asymmetric SMT stuff.

---
 arch/powerpc/kernel/smp.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 55c924b65f71..09ce422fc29c 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -782,6 +782,23 @@ static struct sched_domain_topology_level powerpc_topology[] = {
 	{ NULL, },
 };
 
+static struct sched_domain_topology_level powerpc8_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+#ifdef CONFIG_SCHED_MC
+	/*
+	 * Model the L3.1 cache and sets the LLC as the whole package.
+	 *
+	 * This also ensures we try and move woken tasks to idle cores inside
+	 * the package to avoid queueing.
+	 */
+	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+#endif
+	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ NULL, },
+};
+
 void __init smp_cpus_done(unsigned int max_cpus)
 {
 	cpumask_var_t old_mask;
@@ -806,7 +823,22 @@ void __init smp_cpus_done(unsigned int max_cpus)
 
 	dump_numa_cpu_topology();
 
-	set_sched_topology(powerpc_topology);
+	if (cpu_has_feature(CPU_FTRS_POWER7)) {
+		/*
+		 * Power7 topology is special because it doesn't have
+		 * a shared L3 between cores and has the ASYM(metric)
+		 * SMT thing.
+		 */
+		set_sched_topology(powerpc_topology);
+	}
+
+	if (cpu_has_feature(CPU_FTRS_POWER8)) {
+		/*
+		 * Power8 differs from Power7 in that it does sort-of
+		 * have a shared L3 again.
+		 */
+		set_sched_topology(powerpc8_topology);
+	}
 
 }
 

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  2016-05-17 10:20                     ` Peter Zijlstra
@ 2016-05-17 10:52                       ` Srikar Dronamraju
  2016-05-17 11:15                         ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Srikar Dronamraju @ 2016-05-17 10:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michael Neuling, Matt Fleming, mingo, linux-kernel, clm,
	mgalbraith, tglx, fweisbec, anton, oliver, Shreyas B. Prabhu

> 
> Maybe even something like so; which would make Power <= 6 use the
> default topology and result in a shared LLC between the on package cores
> etc..
> 
> Power7 is then special for not having a shared L3 but having the
> asymmetric SMT thing and Power8 again gains the shared L3 while
> retaining the asymmetric SMT stuff.


Asymmetric SMT is only for Power 7. Its not enabled for Power 8.
On Power8, if there are only 2 active threads assume thread 0 and thread
4 running in smt 8 mode, it would internally run as if running in smt 2
mode without any scheduler tweaks. (This is unlike power 7).

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  2016-05-17 10:52                       ` Srikar Dronamraju
@ 2016-05-17 11:15                         ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-17 11:15 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Michael Neuling, Matt Fleming, mingo, linux-kernel, clm,
	mgalbraith, tglx, fweisbec, anton, oliver, Shreyas B. Prabhu

On Tue, May 17, 2016 at 04:22:04PM +0530, Srikar Dronamraju wrote:
> > 
> > Maybe even something like so; which would make Power <= 6 use the
> > default topology and result in a shared LLC between the on package cores
> > etc..
> > 
> > Power7 is then special for not having a shared L3 but having the
> > asymmetric SMT thing and Power8 again gains the shared L3 while
> > retaining the asymmetric SMT stuff.
> 
> 
> Asymmetric SMT is only for Power 7. Its not enabled for Power 8.
> On Power8, if there are only 2 active threads assume thread 0 and thread
> 4 running in smt 8 mode, it would internally run as if running in smt 2
> mode without any scheduler tweaks. (This is unlike power 7).

Ah, good to know. In that case, you can use the generic topology for
everything except power7. IOW, just make that existing
set_sched_topology() call depend on CPU_FTRS_POWER7 or so (if that is
the correct magic incantation).

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC][PATCH 8/7] sched/fair: Use utilization distance to filter affine sync wakeups
  2016-05-09 10:48 [RFC][PATCH 0/7] sched: select_idle_siblings rewrite Peter Zijlstra
                   ` (8 preceding siblings ...)
  2016-05-11 14:19 ` Chris Mason
@ 2016-05-18  5:51 ` Mike Galbraith
  2016-05-19 21:43   ` Rik van Riel
  2016-05-25 14:51 ` [RFC][PATCH 0/7] sched: select_idle_siblings rewrite Chris Mason
  10 siblings, 1 reply; 39+ messages in thread
From: Mike Galbraith @ 2016-05-18  5:51 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, linux-kernel; +Cc: clm, matt, tglx, fweisbec

On Mon, 2016-05-09 at 12:48 +0200, Peter Zijlstra wrote:
> Hai,

(got some of the frozen variety handy?:)

> here be a semi coherent patch series for the recent select_idle_siblings()
> tinkering. Happy benchmarking..

And tinkering on top of your rewrite series...

sched/fair: Use utilization distance to filter affine sync wakeups

Identifying truly synchronous tasks accurately is annoyingly fragile,
which led to the demise of the old avg_overlap heuristic, which meant
that we schedule tasks high frequency localhost communicating buddies
to L3 vs L2, causing them to take painful cache misses needlessly.

To combat this, track average utilization distance, and when both
waker/wakee are short duration tasks cycling at the ~same frequency
(ie can't have any appreciable reclaimable overlap), and the sync
hint has been passed, take that as a queue that pulling the wakee
to hot L2 is very likely to be a win.  Changes in behavior, such
as taking a long nap, bursts of other activity, or sharing the rq
with tasks that are not cycling rapidly will quickly encourage the
pair to search for a new home, where they can again find each other.

This only helps really fast movers, but that's ok (if we can get
away with it at all), as these are the ones that need some help.

It's dirt simple, cheap, and seems to work pretty well.  It does
help fast movers, does not wreck lmbench AF_UNIX/TCP throughput
gains that select_idle_sibling() provided, and didn't change pgbench
numbers one bit on my desktop box, ie tight discrimination criteria
seems to work out ok in light testing, so _maybe_ not completely
useless...
 
4 x E7-8890 tbench
Throughput 598.158 MB/sec  1 clients  1 procs  max_latency=0.287 ms         1.000
Throughput 1166.26 MB/sec  2 clients  2 procs  max_latency=0.076 ms         1.000
Throughput 2214.55 MB/sec  4 clients  4 procs  max_latency=0.087 ms         1.000
Throughput 4264.44 MB/sec  8 clients  8 procs  max_latency=0.164 ms         1.000
Throughput 7780.58 MB/sec  16 clients  16 procs  max_latency=0.109 ms       1.000
Throughput 15199.3 MB/sec  32 clients  32 procs  max_latency=0.293 ms       1.000
Throughput 21714.8 MB/sec  64 clients  64 procs  max_latency=0.872 ms       1.000
Throughput 44916.1 MB/sec  128 clients  128 procs  max_latency=4.821 ms     1.000
Throughput 76294.5 MB/sec  256 clients  256 procs  max_latency=7.375 ms     1.000

+IDLE_SYNC
Throughput 737.781 MB/sec  1 clients  1 procs  max_latency=0.248 ms         1.233
Throughput 1478.49 MB/sec  2 clients  2 procs  max_latency=0.321 ms         1.267
Throughput 2506.98 MB/sec  4 clients  4 procs  max_latency=0.413 ms         1.132
Throughput 4359.15 MB/sec  8 clients  8 procs  max_latency=0.306 ms         1.022
Throughput 9025.05 MB/sec  16 clients  16 procs  max_latency=0.349 ms       1.159
Throughput 18703.1 MB/sec  32 clients  32 procs  max_latency=0.290 ms       1.230
Throughput 33600.8 MB/sec  64 clients  64 procs  max_latency=6.469 ms       1.547
Throughput 59084.3 MB/sec  128 clients  128 procs  max_latency=5.031 ms     1.315
Throughput 75705.8 MB/sec  256 clients  256 procs  max_latency=24.113 ms    0.992

1 x i4790 lmbench3
*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------------------------
Host                OS  Pipe AF    TCP  File   Mmap  Bcopy  Bcopy  Mem   Mem
                             UNIX      reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
IDLE_CORE+IDLE_CPU+IDLE_SMT
homer     4.6.0-masterx 6027 14.K 9773 8905.2  15.2K  10.1K 6775.0 15.K 10.0K
homer     4.6.0-masterx 5962 14.K 9881 8900.7  15.0K  10.1K 6785.2 15.K 10.0K
homer     4.6.0-masterx 5935 14.K 9917 8946.2  15.0K  10.1K 6761.8 15.K 9826.
+IDLE_SYNC
homer     4.6.0-masterx 8865 14.K 9807 8880.6  14.7K  10.1K 6777.9 15.K 9966.
homer     4.6.0-masterx 8855 13.K 9856 8844.5  15.2K  10.1K 6752.1 15.K 10.0K
homer     4.6.0-masterx 8896 14.K 9836 8880.1  15.0K  10.2K 6771.6 15.K 9941.
                        ^++  ^+-  ^+-
select_idle_sibling() completely disabled
homer     4.6.0-masterx 8810 9807 7109 8982.8  15.4K  10.2K 6831.7 15.K 10.1K
homer     4.6.0-masterx 8877 9757 6864 8970.1  15.3K  10.2K 6826.6 15.K 10.1K
homer     4.6.0-masterx 8779 9736 10.K 8975.6  15.4K  10.1K 6830.2 15.K 10.1K
                        ^++  ^--  ^--

Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
---
 include/linux/sched.h   |    2 -
 kernel/sched/core.c     |    6 -----
 kernel/sched/fair.c     |   49 +++++++++++++++++++++++++++++++++++++++++-------
 kernel/sched/features.h |    1 
 kernel/sched/sched.h    |    7 ++++++
 5 files changed, 51 insertions(+), 14 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1302,7 +1302,7 @@ struct load_weight {
  * issues.
  */
 struct sched_avg {
-	u64 last_update_time, load_sum;
+	u64 last_update_time, load_sum, util_dist_us;
 	u32 util_sum, period_contrib;
 	unsigned long load_avg, util_avg;
 };
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1706,12 +1706,6 @@ int select_task_rq(struct task_struct *p
 	return cpu;
 }
 
-static void update_avg(u64 *avg, u64 sample)
-{
-	s64 diff = sample - *avg;
-	*avg += diff >> 3;
-}
-
 #else
 
 static inline int __set_cpus_allowed_ptr(struct task_struct *p,
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -658,7 +658,7 @@ static u64 sched_vslice(struct cfs_rq *c
 }
 
 #ifdef CONFIG_SMP
-static int select_idle_sibling(struct task_struct *p, int cpu);
+static int select_idle_sibling(struct task_struct *p, int cpu, int sync);
 static unsigned long task_h_load(struct task_struct *p);
 
 /*
@@ -689,6 +689,7 @@ void init_entity_runnable_average(struct
 	 */
 	sa->util_avg = 0;
 	sa->util_sum = 0;
+	sa->util_dist_us = USEC_PER_SEC;
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
 
@@ -1509,7 +1510,7 @@ static void task_numa_compare(struct tas
 		 * can be used from IRQ context.
 		 */
 		local_irq_disable();
-		env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu);
+		env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu, 0);
 		local_irq_enable();
 	}
 
@@ -2734,6 +2735,13 @@ __update_load_avg(u64 now, int cpu, stru
 		return 0;
 	sa->last_update_time = now;
 
+	/*
+	 * Track avg utilization distance for select_idle_sibling() to try
+	 * to identify short running synchronous tasks that will perform
+	 * much better when migrated to tasty hot L2 data.
+	 */
+	update_avg(&sa->util_dist_us, min(delta, (u64)USEC_PER_SEC));
+
 	scale_freq = arch_scale_freq_capacity(NULL, cpu);
 	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
 
@@ -5408,17 +5416,43 @@ static int select_idle_cpu(struct task_s
 	return cpu;
 }
 
+static int select_idle_sync(struct task_struct *p, int target)
+{
+	u64 max = current->se.avg.util_dist_us;
+	u64 min = p->se.avg.util_dist_us;
+
+	if (min > max)
+		swap(min, max);
+
+	/*
+	 * If waker/wakee are both short duration and very similar,
+	 * prefer L2.  If either spends much time waiting or changes
+	 * behavior, utilization distances should quickly increase,
+	 * rendering them ineligible.
+	 */
+	if (max + min > 10 || max - min > 2)
+		return -1;
+	return target;
+}
+
 /*
  * Try and locate an idle core/thread in the LLC cache domain.
  */
-static int select_idle_sibling(struct task_struct *p, int target)
+static int select_idle_sibling(struct task_struct *p, int target, int sync)
 {
 	struct sched_domain *sd;
-	int start, i = task_cpu(p);
+	int start, i;
 
 	if (idle_cpu(target))
 		return target;
 
+	if (sched_feat(IDLE_SYNC) && sync) {
+		i = select_idle_sync(p, target);
+		if (i != -1)
+			return i;
+	}
+
+	i = task_cpu(p);
 	/*
 	 * If the previous cpu is cache affine and idle, don't be stupid.
 	 */
@@ -5573,9 +5607,10 @@ select_task_rq_fair(struct task_struct *
 	}
 
 	if (!sd) {
-		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
-			new_cpu = select_idle_sibling(p, new_cpu);
-
+		if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
+			sync &= want_affine;
+			new_cpu = select_idle_sibling(p, new_cpu, sync);
+		}
 	} else while (sd) {
 		struct sched_group *group;
 		int weight;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -78,4 +78,5 @@ SCHED_FEAT(AVG_CPU, true)
 SCHED_FEAT(PRINT_AVG, false)
 
 SCHED_FEAT(IDLE_SMT, true)
+SCHED_FEAT(IDLE_SYNC, true)
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1848,3 +1848,10 @@ static inline void update_idle_core(stru
 #else
 static inline void update_idle_core(struct rq *rq) { }
 #endif
+
+static inline void update_avg(u64 *avg, u64 sample)
+{
+	s64 diff = sample - *avg;
+	*avg += diff >> 3;
+}
+

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 8/7] sched/fair: Use utilization distance to filter affine sync wakeups
  2016-05-18  5:51 ` [RFC][PATCH 8/7] sched/fair: Use utilization distance to filter affine sync wakeups Mike Galbraith
@ 2016-05-19 21:43   ` Rik van Riel
  2016-05-20  2:52     ` Mike Galbraith
  0 siblings, 1 reply; 39+ messages in thread
From: Rik van Riel @ 2016-05-19 21:43 UTC (permalink / raw)
  To: Mike Galbraith, Peter Zijlstra, mingo, linux-kernel
  Cc: clm, matt, tglx, fweisbec

[-- Attachment #1: Type: text/plain, Size: 556 bytes --]

On Wed, 2016-05-18 at 07:51 +0200, Mike Galbraith wrote:
> On Mon, 2016-05-09 at 12:48 +0200, Peter Zijlstra wrote:
> > Hai,
> 
> (got some of the frozen variety handy?:)
> 
> > here be a semi coherent patch series for the recent
> > select_idle_siblings()
> > tinkering. Happy benchmarking..
> 
> And tinkering on top of your rewrite series...
> 
> sched/fair: Use utilization distance to filter affine sync wakeups
> 

Nice. This looks like it could be a lot more robust
than the stuff that was there before.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 8/7] sched/fair: Use utilization distance to filter affine sync wakeups
  2016-05-19 21:43   ` Rik van Riel
@ 2016-05-20  2:52     ` Mike Galbraith
  0 siblings, 0 replies; 39+ messages in thread
From: Mike Galbraith @ 2016-05-20  2:52 UTC (permalink / raw)
  To: Rik van Riel, Peter Zijlstra, mingo, linux-kernel
  Cc: clm, matt, tglx, fweisbec

On Thu, 2016-05-19 at 17:43 -0400, Rik van Riel wrote:
> On Wed, 2016-05-18 at 07:51 +0200, Mike Galbraith wrote:
> > On Mon, 2016-05-09 at 12:48 +0200, Peter Zijlstra wrote:
> > > Hai,
> > 
> > (got some of the frozen variety handy?:)
> > 
> > > here be a semi coherent patch series for the recent
> > > select_idle_siblings()
> > > tinkering. Happy benchmarking..
> > 
> > And tinkering on top of your rewrite series...
> > 
> > sched/fair: Use utilization distance to filter affine sync wakeups
> > 
> 
> Nice. This looks like it could be a lot more robust
> than the stuff that was there before.

That's the hope.  I kinda doubt it'll make any big difference in the
real world, but should help some.. wild theory being that the 
 microbenchmarks it definitely does help were based at least in part
upon some aspect of real world loads.

	-Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 0/7] sched: select_idle_siblings rewrite
  2016-05-09 10:48 [RFC][PATCH 0/7] sched: select_idle_siblings rewrite Peter Zijlstra
                   ` (9 preceding siblings ...)
  2016-05-18  5:51 ` [RFC][PATCH 8/7] sched/fair: Use utilization distance to filter affine sync wakeups Mike Galbraith
@ 2016-05-25 14:51 ` Chris Mason
  2016-05-25 16:24   ` Peter Zijlstra
  10 siblings, 1 reply; 39+ messages in thread
From: Chris Mason @ 2016-05-25 14:51 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, matt, mgalbraith, tglx, fweisbec

[-- Attachment #1: Type: text/plain, Size: 5753 bytes --]

On Mon, May 09, 2016 at 12:48:07PM +0200, Peter Zijlstra wrote:
> Hai,
> 
> here be a semi coherent patch series for the recent select_idle_siblings()
> tinkering. Happy benchmarking..

This took a while, mostly because my original schbench showed your
patches were just as fast as our internal patch, but our production
benchmark showed a 5-10% regression.  My theory was just that schbench
wasn't fine grained enough, and the load would stop before the
regression kicked in.

So, I added a -R (--rps) mode to schbench.  The goal was a single
command line that ran through an increasing workload and captured the 5%
performance regression from Peter's patch.

The new code is in schbench git:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git

In RPS mode each messenger thread queues up requests for the workers.
It records the latency between when the request was queued to when
it was completed.  Since the request is just spinning for a specific
amount of wall time, the spintime is subtracted before printing.

This means the printed number is our overhead.  It could be scheduling
delay, or a context switch in the middle of processing the request.

To help normalize things, I also calculate and print this latency
divided by the spin time.  This lets me figure out when to stop the auto
mode, and I found it a faster way to compare results from two runs.

Longer runs give very consistent results, except when you first turn
AVG_CPU back on.  It took me a while to figure out Peter's math was
screwing with me, and each run would be a little slower than the last
until it stabilized.

Ok, row for results.  My test boxes have two sockets, 24 cores per
socket.  So I'm running with two messenger threads and 23 workers per
messenger.  -R is the requests per second rate, -R 1 means: guess a good
starting point based on the cputime.

You can pass -R <number>, and you can use -R without -a if you want to
focus on a specific spot.

# NO_AVG_CPU
# ./schbench -t 23 -R 1 -m 2 -r 180 -a
autobench rps 897
rps: 897.08 p95 (usec) 50 p99 (usec) 56 p95/cputime 0.17% p99/cputime 0.19%
rps: 926.88 p95 (usec) 50 p99 (usec) 56 p95/cputime 0.17% p99/cputime 0.19%
rps: 956.77 p95 (usec) 50 p99 (usec) 55 p95/cputime 0.17% p99/cputime 0.18%
rps: 986.80 p95 (usec) 45 p99 (usec) 51 p95/cputime 0.15% p99/cputime 0.17%
rps: 1047.09 p95 (usec) 1362 p99 (usec) 1386 p95/cputime 4.54% p99/cputime 4.62%
rps: 1076.93 p95 (usec) 2156 p99 (usec) 2180 p95/cputime 7.19% p99/cputime 7.27%
rps: 1106.56 p95 (usec) 2916 p99 (usec) 3324 p95/cputime 9.72% p99/cputime 11.08%
rps: 1136.57 p95 (usec) 3620 p99 (usec) 3644 p95/cputime 12.07% p99/cputime 12.15%
rps: 1196.22 p95 (usec) 4936 p99 (usec) 4952 p95/cputime 16.45% p99/cputime 16.51%
rps: 1226.04 p95 (usec) 5544 p99 (usec) 5576 p95/cputime 18.48% p99/cputime 18.59%
rps: 1255.90 p95 (usec) 6120 p99 (usec) 6152 p95/cputime 20.40% p99/cputime 20.51%
rps: 1315.50 p95 (usec) 7208 p99 (usec) 7240 p95/cputime 24.03% p99/cputime 24.13%
rps: 1345.27 p95 (usec) 7720 p99 (usec) 7736 p95/cputime 25.73% p99/cputime 25.79%
rps: 1404.84 p95 (usec) 8656 p99 (usec) 8688 p95/cputime 28.85% p99/cputime 28.96%
rps: 1434.59 p95 (usec) 9104 p99 (usec) 9136 p95/cputime 30.35% p99/cputime 30.45%
rps: 1464.29 p95 (usec) 9520 p99 (usec) 9552 p95/cputime 31.73% p99/cputime 31.84%
rps: 1515.09 p95 (usec) 15152 p99 (usec) 16336 p95/cputime 50.51% p99/cputime 54.45%
rps: 1532.98 p95 (usec) 1722368 p99 (usec) 1800192 p95/cputime 5741.23% p99/cputime 6000.64%

# AVG_CPU
# ./schbench -t 23 -R 1 -m 2 -r 180 -a
autobench rps 897
rps: 897.15 p95 (usec) 49 p99 (usec) 55 p95/cputime 0.16% p99/cputime 0.18%
rps: 926.94 p95 (usec) 50 p99 (usec) 56 p95/cputime 0.17% p99/cputime 0.19%
rps: 956.78 p95 (usec) 50 p99 (usec) 55 p95/cputime 0.17% p99/cputime 0.18%
rps: 986.79 p95 (usec) 45 p99 (usec) 51 p95/cputime 0.15% p99/cputime 0.17%
rps: 1044.92 p95 (usec) 1374 p99 (usec) 1998 p95/cputime 4.58% p99/cputime 6.66%
rps: 1074.93 p95 (usec) 2164 p99 (usec) 2636 p95/cputime 7.21% p99/cputime 8.79%
rps: 1103.85 p95 (usec) 2916 p99 (usec) 3404 p95/cputime 9.72% p99/cputime 11.35%
rps: 1133.10 p95 (usec) 3628 p99 (usec) 4136 p95/cputime 12.09% p99/cputime 13.79%
rps: 1193.01 p95 (usec) 4936 p99 (usec) 5320 p95/cputime 16.45% p99/cputime 17.73%
rps: 1222.18 p95 (usec) 5544 p99 (usec) 5976 p95/cputime 18.48% p99/cputime 19.92%
rps: 1251.14 p95 (usec) 6136 p99 (usec) 6504 p95/cputime 20.45% p99/cputime 21.68%
rps: 1309.63 p95 (usec) 7224 p99 (usec) 7752 p95/cputime 24.08% p99/cputime 25.84%
rps: 1336.49 p95 (usec) 7720 p99 (usec) 8496 p95/cputime 25.73% p99/cputime 28.32%
rps: 1398.47 p95 (usec) 8656 p99 (usec) 9200 p95/cputime 28.85% p99/cputime 30.67%
rps: 1429.66 p95 (usec) 9104 p99 (usec) 9488 p95/cputime 30.35% p99/cputime 31.63%
rps: 1459.88 p95 (usec) 9520 p99 (usec) 9968 p95/cputime 31.73% p99/cputime 33.23%
rps: 1487.82 p95 (usec) 9936 p99 (usec) 11024 p95/cputime 33.12% p99/cputime 36.75%
rps: 1532.85 p95 (usec) 1550336 p99 (usec) 1628160 p95/cputime 5167.79% p99/cputime 5427.20%


Jumping into the sweet spot in the run:
NO_AVG_CPU rps: 1345.27 p95 (usec) 7720 p99 (usec) 7736 p95/cputime 25.73% p99/cputime 25.79%
   AVG_CPU rps: 1336.49 p95 (usec) 7720 p99 (usec) 8496 p95/cputime 25.73% p99/cputime 28.32%

p95 times are fine, but p99 shows a ~9% difference.  Running the
same workloads on my patch gives the same numbers as NO_AVG_CPU.  I've
attached a png to help show the difference between NO_AVG_CPU and
AVG_CPU.

Auto mode ends when the p99 ratios go crazy.  I wouldn't put too much
value in the last few lines, the machine is so saturated that anything
could be happening.

Hope this helps, I'm going to keep tweaking schbench as we compare it
with the production runs here.

-chris

[-- Attachment #2: p99.png --]
[-- Type: application/octet-stream, Size: 13699 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 0/7] sched: select_idle_siblings rewrite
  2016-05-25 14:51 ` [RFC][PATCH 0/7] sched: select_idle_siblings rewrite Chris Mason
@ 2016-05-25 16:24   ` Peter Zijlstra
  2016-05-25 17:11     ` Chris Mason
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-25 16:24 UTC (permalink / raw)
  To: Chris Mason, mingo, linux-kernel, matt, mgalbraith, tglx, fweisbec

On Wed, May 25, 2016 at 10:51:20AM -0400, Chris Mason wrote:
> On Mon, May 09, 2016 at 12:48:07PM +0200, Peter Zijlstra wrote:
> > Hai,
> > 
> > here be a semi coherent patch series for the recent select_idle_siblings()
> > tinkering. Happy benchmarking..
> 
> This took a while, mostly because my original schbench showed your
> patches were just as fast as our internal patch, but our production
> benchmark showed a 5-10% regression.  My theory was just that schbench
> wasn't fine grained enough, and the load would stop before the
> regression kicked in.

Just to be sure; regression as per your production kernel, not vs.
mainline, right?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 0/7] sched: select_idle_siblings rewrite
  2016-05-25 16:24   ` Peter Zijlstra
@ 2016-05-25 17:11     ` Chris Mason
  0 siblings, 0 replies; 39+ messages in thread
From: Chris Mason @ 2016-05-25 17:11 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, matt, mgalbraith, tglx, fweisbec

On Wed, May 25, 2016 at 06:24:49PM +0200, Peter Zijlstra wrote:
>On Wed, May 25, 2016 at 10:51:20AM -0400, Chris Mason wrote:
>> On Mon, May 09, 2016 at 12:48:07PM +0200, Peter Zijlstra wrote:
>> > Hai,
>> >
>> > here be a semi coherent patch series for the recent select_idle_siblings()
>> > tinkering. Happy benchmarking..
>>
>> This took a while, mostly because my original schbench showed your
>> patches were just as fast as our internal patch, but our production
>> benchmark showed a 5-10% regression.  My theory was just that schbench
>> wasn't fine grained enough, and the load would stop before the
>> regression kicked in.
>
>Just to be sure; regression as per your production kernel, not vs.
>mainline, right?

Sorry yes, that's correct.

-chris

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2016-05-25 17:12 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-09 10:48 [RFC][PATCH 0/7] sched: select_idle_siblings rewrite Peter Zijlstra
2016-05-09 10:48 ` [RFC][PATCH 1/7] sched: Remove unused @cpu argument from destroy_sched_domain*() Peter Zijlstra
2016-05-09 10:48 ` [RFC][PATCH 2/7] sched: Restructure destroy_sched_domain() Peter Zijlstra
2016-05-09 14:46   ` Peter Zijlstra
2016-05-09 10:48 ` [RFC][PATCH 3/7] sched: Introduce struct sched_domain_shared Peter Zijlstra
2016-05-09 10:48 ` [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared Peter Zijlstra
2016-05-11 11:55   ` Matt Fleming
2016-05-11 12:33     ` Peter Zijlstra
2016-05-11 18:11       ` Peter Zijlstra
2016-05-11 18:24       ` Peter Zijlstra
2016-05-12  2:05         ` Michael Neuling
2016-05-12  5:07           ` Peter Zijlstra
2016-05-12 11:07             ` Michael Neuling
2016-05-12 11:33               ` Peter Zijlstra
2016-05-13  0:12                 ` Michael Neuling
2016-05-16 14:00                   ` Peter Zijlstra
2016-05-17 10:20                     ` Peter Zijlstra
2016-05-17 10:52                       ` Srikar Dronamraju
2016-05-17 11:15                         ` Peter Zijlstra
2016-05-11 17:37     ` Peter Zijlstra
2016-05-11 18:04       ` Matt Fleming
2016-05-16 15:31   ` Dietmar Eggemann
2016-05-16 17:02     ` Peter Zijlstra
2016-05-16 17:26       ` Dietmar Eggemann
2016-05-09 10:48 ` [RFC][PATCH 5/7] sched: Rewrite select_idle_siblings() Peter Zijlstra
2016-05-10 21:05   ` Yuyang Du
2016-05-11  7:00     ` Peter Zijlstra
2016-05-10 23:42       ` Yuyang Du
2016-05-11  7:43         ` Mike Galbraith
2016-05-09 10:48 ` [RFC][PATCH 6/7] sched: Optimize SCHED_SMT Peter Zijlstra
2016-05-09 10:48 ` [RFC][PATCH 7/7] sched: debug muck -- not for merging Peter Zijlstra
2016-05-10  0:50 ` [RFC][PATCH 0/7] sched: select_idle_siblings rewrite Chris Mason
2016-05-11 14:19 ` Chris Mason
2016-05-18  5:51 ` [RFC][PATCH 8/7] sched/fair: Use utilization distance to filter affine sync wakeups Mike Galbraith
2016-05-19 21:43   ` Rik van Riel
2016-05-20  2:52     ` Mike Galbraith
2016-05-25 14:51 ` [RFC][PATCH 0/7] sched: select_idle_siblings rewrite Chris Mason
2016-05-25 16:24   ` Peter Zijlstra
2016-05-25 17:11     ` Chris Mason

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.