linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/11] Refactor select_task_rq_fair()
@ 2016-06-16  1:49 Yuyang Du
  2016-06-16  1:49 ` [RFC PATCH 01/11] sched: Remove unused @cpu argument from destroy_sched_domain*() Yuyang Du
                   ` (10 more replies)
  0 siblings, 11 replies; 16+ messages in thread
From: Yuyang Du @ 2016-06-16  1:49 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: umgwanakikbuti, bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, Yuyang Du

Hi all,

As a follow-up of my recent proposal to remove SD_WAKE_AFFINE, I have this
initial-draft and unfinished refactoring to select_task_rq_fair().

Basic idea is to form a comprehensive view of the topology to guide the
select.

My many wanted jobs haven't been achieved, so I decided to post this initial
draft earlier to have your comments and experiences (I have no experience on
production multi-socket machines and no real experince on heterogeneous CPUs).
But generally, better topology view, I believe, should greatly help with the
full spectrum of machines from big to small.

Just lightly tested, it only works, :)

So my patches are built on Peter's select_idle_sibling() patches (so all of
them should be a little outdated), and borrowed his scan cost regulation part.

Thanks a lot.

Yuyang

--

Peter Zijlstra (6):
  sched: Remove unused @cpu argument from destroy_sched_domain*()
  sched: Restructure destroy_sched_domain()
  sched: Introduce struct sched_domain_shared
  sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  sched: Rewrite select_idle_siblings()
  sched: Optimize SCHED_SMT

Yuyang Du (5):
  sched: Clean up SD_BALANCE_WAKE flags in sched domain build-up
  sched: Remove SD_WAKE_AFFINE flag and replace it with SD_BALANCE_WAKE
  sched: Add per CPU variable sd_socket_id to specify the CPU's socket
  sched: Add sched_llc_complete static key to specify whether the LLC
    covers all CPUs
  sched/fair: Refactor select_task_rq_fair()

 include/linux/sched.h    |   17 +-
 kernel/sched/core.c      |  134 ++++++++---
 kernel/sched/deadline.c  |    2 +-
 kernel/sched/fair.c      |  589 +++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/features.h  |    1 +
 kernel/sched/idle_task.c |    2 +-
 kernel/sched/rt.c        |    2 +-
 kernel/sched/sched.h     |   28 ++-
 kernel/time/tick-sched.c |   10 +-
 9 files changed, 680 insertions(+), 105 deletions(-)

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 01/11] sched: Remove unused @cpu argument from destroy_sched_domain*()
  2016-06-16  1:49 [RFC PATCH 00/11] Refactor select_task_rq_fair() Yuyang Du
@ 2016-06-16  1:49 ` Yuyang Du
  2016-06-16  1:49 ` [RFC PATCH 02/11] sched: Restructure destroy_sched_domain() Yuyang Du
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Yuyang Du @ 2016-06-16  1:49 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: umgwanakikbuti, bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt

From: Peter Zijlstra <peterz@infradead.org>

Small cleanup; nothing uses the @cpu argument so make it go away.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 404c078..ab6e6cf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5850,15 +5850,15 @@ static void free_sched_domain(struct rcu_head *rcu)
 	kfree(sd);
 }
 
-static void destroy_sched_domain(struct sched_domain *sd, int cpu)
+static void destroy_sched_domain(struct sched_domain *sd)
 {
 	call_rcu(&sd->rcu, free_sched_domain);
 }
 
-static void destroy_sched_domains(struct sched_domain *sd, int cpu)
+static void destroy_sched_domains(struct sched_domain *sd)
 {
 	for (; sd; sd = sd->parent)
-		destroy_sched_domain(sd, cpu);
+		destroy_sched_domain(sd);
 }
 
 /*
@@ -5930,7 +5930,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 			 */
 			if (parent->flags & SD_PREFER_SIBLING)
 				tmp->flags |= SD_PREFER_SIBLING;
-			destroy_sched_domain(parent, cpu);
+			destroy_sched_domain(parent);
 		} else
 			tmp = tmp->parent;
 	}
@@ -5938,7 +5938,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 	if (sd && sd_degenerate(sd)) {
 		tmp = sd;
 		sd = sd->parent;
-		destroy_sched_domain(tmp, cpu);
+		destroy_sched_domain(tmp);
 		if (sd)
 			sd->child = NULL;
 	}
@@ -5948,7 +5948,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 	rq_attach_root(rq, rd);
 	tmp = rq->sd;
 	rcu_assign_pointer(rq->sd, sd);
-	destroy_sched_domains(tmp, cpu);
+	destroy_sched_domains(tmp);
 
 	update_top_cache_domain(cpu);
 }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 02/11] sched: Restructure destroy_sched_domain()
  2016-06-16  1:49 [RFC PATCH 00/11] Refactor select_task_rq_fair() Yuyang Du
  2016-06-16  1:49 ` [RFC PATCH 01/11] sched: Remove unused @cpu argument from destroy_sched_domain*() Yuyang Du
@ 2016-06-16  1:49 ` Yuyang Du
  2016-06-16  1:49 ` [RFC PATCH 03/11] sched: Introduce struct sched_domain_shared Yuyang Du
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Yuyang Du @ 2016-06-16  1:49 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: umgwanakikbuti, bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt

From: Peter Zijlstra <peterz@infradead.org>

There is no point in doing a call_rcu() for each domain, only do a
callback for the root sched domain and clean up the entire set in one
go.

Also make the entire call chain be called destroy_sched_domain*() to
remove confusion with the free_sched_domains() call, which does an
entirely different thing.

Both cpu_attach_domain() callers of destroy_sched_domain() can live
without the call_rcu() because at those points the sched_domain hasn't
been published yet.

This one seems to work much better; so much for last minute 'cleanups'.
---
 kernel/sched/core.c |   18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ab6e6cf..c4596e0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5833,10 +5833,8 @@ static void free_sched_groups(struct sched_group *sg, int free_sgc)
 	} while (sg != first);
 }
 
-static void free_sched_domain(struct rcu_head *rcu)
+static void destroy_sched_domain(struct sched_domain *sd)
 {
-	struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu);
-
 	/*
 	 * If its an overlapping domain it has private groups, iterate and
 	 * nuke them all.
@@ -5850,15 +5848,21 @@ static void free_sched_domain(struct rcu_head *rcu)
 	kfree(sd);
 }
 
-static void destroy_sched_domain(struct sched_domain *sd)
+static void destroy_sched_domains_rcu(struct rcu_head *rcu)
 {
-	call_rcu(&sd->rcu, free_sched_domain);
+	struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu);
+
+	while (sd) {
+		struct sched_domain *parent = sd->parent;
+		destroy_sched_domain(sd);
+		sd = parent;
+	}
 }
 
 static void destroy_sched_domains(struct sched_domain *sd)
 {
-	for (; sd; sd = sd->parent)
-		destroy_sched_domain(sd);
+	if (sd)
+		call_rcu(&sd->rcu, destroy_sched_domains_rcu);
 }
 
 /*
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 03/11] sched: Introduce struct sched_domain_shared
  2016-06-16  1:49 [RFC PATCH 00/11] Refactor select_task_rq_fair() Yuyang Du
  2016-06-16  1:49 ` [RFC PATCH 01/11] sched: Remove unused @cpu argument from destroy_sched_domain*() Yuyang Du
  2016-06-16  1:49 ` [RFC PATCH 02/11] sched: Restructure destroy_sched_domain() Yuyang Du
@ 2016-06-16  1:49 ` Yuyang Du
  2016-06-16  1:49 ` [RFC PATCH 04/11] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared Yuyang Du
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Yuyang Du @ 2016-06-16  1:49 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: umgwanakikbuti, bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt

From: Peter Zijlstra <peterz@infradead.org>

Since struct sched_domain is strictly per cpu; introduce a structure
that is shared between all 'identical' sched_domains.

Limit to SD_SHARE_PKG_RESOURCES domains for now, as we'll only use it
for shared cache state; if another use comes up later we can easily
relax this.

While the sched_group's are normally shared between CPUs, these are
not natural to use when we need some shared state on a domain level --
since that would require the domain to have a parent, which is not a
given.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h |    6 ++++++
 kernel/sched/core.c   |   40 ++++++++++++++++++++++++++++++++++------
 2 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1b43b45..f1233f54a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1057,6 +1057,10 @@ extern int sched_domain_level_max;
 
 struct sched_group;
 
+struct sched_domain_shared {
+	atomic_t	ref;
+};
+
 struct sched_domain {
 	/* These fields must be setup */
 	struct sched_domain *parent;	/* top domain must be null terminated */
@@ -1125,6 +1129,7 @@ struct sched_domain {
 		void *private;		/* used during construction */
 		struct rcu_head rcu;	/* used during destruction */
 	};
+	struct sched_domain_shared *shared;
 
 	unsigned int span_weight;
 	/*
@@ -1158,6 +1163,7 @@ typedef int (*sched_domain_flags_f)(void);
 
 struct sd_data {
 	struct sched_domain **__percpu sd;
+	struct sched_domain_shared **__percpu sds;
 	struct sched_group **__percpu sg;
 	struct sched_group_capacity **__percpu sgc;
 };
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c4596e0..59f8bf1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5845,6 +5845,8 @@ static void destroy_sched_domain(struct sched_domain *sd)
 		kfree(sd->groups->sgc);
 		kfree(sd->groups);
 	}
+	if (sd->shared && atomic_dec_and_test(&sd->shared->ref))
+		kfree(sd->shared);
 	kfree(sd);
 }
 
@@ -6283,6 +6285,9 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
 	WARN_ON_ONCE(*per_cpu_ptr(sdd->sd, cpu) != sd);
 	*per_cpu_ptr(sdd->sd, cpu) = NULL;
 
+	if (atomic_read(&(*per_cpu_ptr(sdd->sds, cpu))->ref))
+		*per_cpu_ptr(sdd->sds, cpu) = NULL;
+
 	if (atomic_read(&(*per_cpu_ptr(sdd->sg, cpu))->ref))
 		*per_cpu_ptr(sdd->sg, cpu) = NULL;
 
@@ -6318,10 +6323,12 @@ static int sched_domains_curr_level;
 	 SD_SHARE_POWERDOMAIN)
 
 static struct sched_domain *
-sd_init(struct sched_domain_topology_level *tl, int cpu)
+sd_init(struct sched_domain_topology_level *tl,
+	const struct cpumask *cpu_map, int cpu)
 {
-	struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
-	int sd_weight, sd_flags = 0;
+	struct sd_data *sdd = &tl->data;
+	struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
+	int sd_id, sd_weight, sd_flags = 0;
 
 #ifdef CONFIG_NUMA
 	/*
@@ -6375,6 +6382,9 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 #endif
 	};
 
+	cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
+	sd_id = cpumask_first(sched_domain_span(sd));
+
 	/*
 	 * Convert topological properties into behaviour.
 	 */
@@ -6389,6 +6399,9 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 		sd->cache_nice_tries = 1;
 		sd->busy_idx = 2;
 
+		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
+		atomic_inc(&sd->shared->ref);
+
 #ifdef CONFIG_NUMA
 	} else if (sd->flags & SD_NUMA) {
 		sd->cache_nice_tries = 2;
@@ -6410,7 +6423,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 		sd->idle_idx = 1;
 	}
 
-	sd->private = &tl->data;
+	sd->private = sdd;
 
 	return sd;
 }
@@ -6717,6 +6730,10 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 		if (!sdd->sd)
 			return -ENOMEM;
 
+		sdd->sds = alloc_percpu(struct sched_domain_shared *);
+		if (!sdd->sds)
+			return -ENOMEM;
+
 		sdd->sg = alloc_percpu(struct sched_group *);
 		if (!sdd->sg)
 			return -ENOMEM;
@@ -6727,6 +6744,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 
 		for_each_cpu(j, cpu_map) {
 			struct sched_domain *sd;
+			struct sched_domain_shared *sds;
 			struct sched_group *sg;
 			struct sched_group_capacity *sgc;
 
@@ -6737,6 +6755,13 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 
 			*per_cpu_ptr(sdd->sd, j) = sd;
 
+			sds = kzalloc_node(sizeof(struct sched_domain_shared),
+					GFP_KERNEL, cpu_to_node(j));
+			if (!sds)
+				return -ENOMEM;
+
+			*per_cpu_ptr(sdd->sds, j) = sds;
+
 			sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
 					GFP_KERNEL, cpu_to_node(j));
 			if (!sg)
@@ -6776,6 +6801,8 @@ static void __sdt_free(const struct cpumask *cpu_map)
 				kfree(*per_cpu_ptr(sdd->sd, j));
 			}
 
+			if (sdd->sds)
+				kfree(*per_cpu_ptr(sdd->sds, j));
 			if (sdd->sg)
 				kfree(*per_cpu_ptr(sdd->sg, j));
 			if (sdd->sgc)
@@ -6783,6 +6810,8 @@ static void __sdt_free(const struct cpumask *cpu_map)
 		}
 		free_percpu(sdd->sd);
 		sdd->sd = NULL;
+		free_percpu(sdd->sds);
+		sdd->sds = NULL;
 		free_percpu(sdd->sg);
 		sdd->sg = NULL;
 		free_percpu(sdd->sgc);
@@ -6794,11 +6823,10 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
 		const struct cpumask *cpu_map, struct sched_domain_attr *attr,
 		struct sched_domain *child, int cpu)
 {
-	struct sched_domain *sd = sd_init(tl, cpu);
+	struct sched_domain *sd = sd_init(tl, cpu_map, cpu);
 	if (!sd)
 		return child;
 
-	cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
 	if (child) {
 		sd->level = child->level + 1;
 		sched_domain_level_max = max(sched_domain_level_max, sd->level);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 04/11] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  2016-06-16  1:49 [RFC PATCH 00/11] Refactor select_task_rq_fair() Yuyang Du
                   ` (2 preceding siblings ...)
  2016-06-16  1:49 ` [RFC PATCH 03/11] sched: Introduce struct sched_domain_shared Yuyang Du
@ 2016-06-16  1:49 ` Yuyang Du
  2016-06-16  1:49 ` [RFC PATCH 05/11] sched: Rewrite select_idle_siblings() Yuyang Du
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Yuyang Du @ 2016-06-16  1:49 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: umgwanakikbuti, bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt

From: Peter Zijlstra <peterz@infradead.org>

Move the nr_busy_cpus thing from its hacky sd->parent->groups->sgc
location into the much more natural sched_domain_shared location.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h    |    1 +
 kernel/sched/core.c      |   10 +++++-----
 kernel/sched/fair.c      |   22 ++++++++++++----------
 kernel/sched/sched.h     |    6 +-----
 kernel/time/tick-sched.c |   10 +++++-----
 5 files changed, 24 insertions(+), 25 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f1233f54a..96b371c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1059,6 +1059,7 @@ struct sched_group;
 
 struct sched_domain_shared {
 	atomic_t	ref;
+	atomic_t	nr_busy_cpus;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 59f8bf1..bd313b2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5879,14 +5879,14 @@ static void destroy_sched_domains(struct sched_domain *sd)
 DEFINE_PER_CPU(struct sched_domain *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
+DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain *, sd_numa);
-DEFINE_PER_CPU(struct sched_domain *, sd_busy);
 DEFINE_PER_CPU(struct sched_domain *, sd_asym);
 
 static void update_top_cache_domain(int cpu)
 {
+	struct sched_domain_shared *sds = NULL;
 	struct sched_domain *sd;
-	struct sched_domain *busy_sd = NULL;
 	int id = cpu;
 	int size = 1;
 
@@ -5894,13 +5894,13 @@ static void update_top_cache_domain(int cpu)
 	if (sd) {
 		id = cpumask_first(sched_domain_span(sd));
 		size = cpumask_weight(sched_domain_span(sd));
-		busy_sd = sd->parent; /* sd_busy */
+		sds = sd->shared;
 	}
-	rcu_assign_pointer(per_cpu(sd_busy, cpu), busy_sd);
 
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_size, cpu) = size;
 	per_cpu(sd_llc_id, cpu) = id;
+	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
 
 	sd = lowest_flag_domain(cpu, SD_NUMA);
 	rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
@@ -6197,7 +6197,6 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
 		return;
 
 	update_group_capacity(sd, cpu);
-	atomic_set(&sg->sgc->nr_busy_cpus, sg->group_weight);
 }
 
 /*
@@ -6401,6 +6400,7 @@ sd_init(struct sched_domain_topology_level *tl,
 
 		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
 		atomic_inc(&sd->shared->ref);
+		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
 
 #ifdef CONFIG_NUMA
 	} else if (sd->flags & SD_NUMA) {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 218f8e8..0d8dad2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7886,13 +7886,13 @@ static inline void set_cpu_sd_state_busy(void)
 	int cpu = smp_processor_id();
 
 	rcu_read_lock();
-	sd = rcu_dereference(per_cpu(sd_busy, cpu));
+	sd = rcu_dereference(per_cpu(sd_llc, cpu));
 
 	if (!sd || !sd->nohz_idle)
 		goto unlock;
 	sd->nohz_idle = 0;
 
-	atomic_inc(&sd->groups->sgc->nr_busy_cpus);
+	atomic_inc(&sd->shared->nr_busy_cpus);
 unlock:
 	rcu_read_unlock();
 }
@@ -7903,13 +7903,13 @@ void set_cpu_sd_state_idle(void)
 	int cpu = smp_processor_id();
 
 	rcu_read_lock();
-	sd = rcu_dereference(per_cpu(sd_busy, cpu));
+	sd = rcu_dereference(per_cpu(sd_llc, cpu));
 
 	if (!sd || sd->nohz_idle)
 		goto unlock;
 	sd->nohz_idle = 1;
 
-	atomic_dec(&sd->groups->sgc->nr_busy_cpus);
+	atomic_dec(&sd->shared->nr_busy_cpus);
 unlock:
 	rcu_read_unlock();
 }
@@ -8136,8 +8136,8 @@ end:
 static inline bool nohz_kick_needed(struct rq *rq)
 {
 	unsigned long now = jiffies;
+	struct sched_domain_shared *sds;
 	struct sched_domain *sd;
-	struct sched_group_capacity *sgc;
 	int nr_busy, cpu = rq->cpu;
 	bool kick = false;
 
@@ -8165,11 +8165,13 @@ static inline bool nohz_kick_needed(struct rq *rq)
 		return true;
 
 	rcu_read_lock();
-	sd = rcu_dereference(per_cpu(sd_busy, cpu));
-	if (sd) {
-		sgc = sd->groups->sgc;
-		nr_busy = atomic_read(&sgc->nr_busy_cpus);
-
+	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (sds) {
+		/*
+		 * XXX: write a coherent comment on why we do this.
+		 * See also: http:lkml.kernel.org/r/20111202010832.602203411@sbsiddha-desk.sc.intel.com
+		 */
+		nr_busy = atomic_read(&sds->nr_busy_cpus);
 		if (nr_busy > 1) {
 			kick = true;
 			goto unlock;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e51145e..cdc63d9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -856,8 +856,8 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
 DECLARE_PER_CPU(struct sched_domain *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain *, sd_numa);
-DECLARE_PER_CPU(struct sched_domain *, sd_busy);
 DECLARE_PER_CPU(struct sched_domain *, sd_asym);
 
 struct sched_group_capacity {
@@ -869,10 +869,6 @@ struct sched_group_capacity {
 	unsigned int capacity;
 	unsigned long next_update;
 	int imbalance; /* XXX unrelated to capacity but shared group state */
-	/*
-	 * Number of busy cpus in this group.
-	 */
-	atomic_t nr_busy_cpus;
 
 	unsigned long cpumask[0]; /* iteration mask */
 };
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 536ada8..d6535fd 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -933,11 +933,11 @@ void tick_nohz_idle_enter(void)
 	WARN_ON_ONCE(irqs_disabled());
 
 	/*
- 	 * Update the idle state in the scheduler domain hierarchy
- 	 * when tick_nohz_stop_sched_tick() is called from the idle loop.
- 	 * State will be updated to busy during the first busy tick after
- 	 * exiting idle.
- 	 */
+	 * Update the idle state in the scheduler domain hierarchy
+	 * when tick_nohz_stop_sched_tick() is called from the idle loop.
+	 * State will be updated to busy during the first busy tick after
+	 * exiting idle.
+	 */
 	set_cpu_sd_state_idle();
 
 	local_irq_disable();
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 05/11] sched: Rewrite select_idle_siblings()
  2016-06-16  1:49 [RFC PATCH 00/11] Refactor select_task_rq_fair() Yuyang Du
                   ` (3 preceding siblings ...)
  2016-06-16  1:49 ` [RFC PATCH 04/11] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared Yuyang Du
@ 2016-06-16  1:49 ` Yuyang Du
  2016-06-16  1:49 ` [RFC PATCH 06/11] sched: Optimize SCHED_SMT Yuyang Du
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Yuyang Du @ 2016-06-16  1:49 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: umgwanakikbuti, bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt

From: Peter Zijlstra <peterz@infradead.org>

select_idle_siblings() is a known pain point for a number of
workloads; it either does too much or not enough and sometimes just
does plain wrong.

This rewrite attempts to address a number of issues (but sadly not
all).

The current code does an unconditional sched_domain iteration; with
the intent of finding an idle core (on SMT hardware). The problems
which this patch tries to address are:

 - its pointless to look for idle cores if the machine is real busy;
   at which point you're just wasting cycles.

 - it's behaviour is inconsistent between SMT and !SMT hardware in
   that !SMT hardware ends up doing a scan for any idle CPU in the LLC
   domain, while SMT hardware does a scan for idle cores and if that
   fails, falls back to a scan for idle threads on the 'target' core.

The new code replaces the sched_domain scan with 3 explicit scans:

 1) search for an idle core in the LLC
 2) search for an idle CPU in the LLC
 3) search for an idle thread in the 'target' core

where 1 and 3 are conditional on SMT support and 1 and 2 have runtime
heuristics to skip the step.

Step 1) is conditional on sd_llc_shared->has_idle_cores; when a cpu
goes idle and sd_llc_shared->has_idle_cores is false, we scan all SMT
siblings of the CPU going idle. Similarly, we clear
sd_llc_shared->has_idle_cores when we fail to find an idle core.

Step 2) tracks the average cost of the scan and compares this to the
average idle time guestimate for the CPU doing the wakeup. There is a
significant fudge factor involved to deal with the variability of the
averages. Esp. hackbench was sensitive to this.

Step 3) is unconditional; we assume (also per step 1) that scanning
all SMT siblings in a core is 'cheap'.

With this; SMT systems gain step 2, which cures a few benchmarks --
notably one from Facebook.

One 'feature' of the sched_domain iteration, which we preserve in the
new code, is that it would start scanning from the 'target' CPU,
instead of scanning the cpumask in cpu id order. This avoids multiple
CPUs in the LLC scanning for idle to gang up and find the same CPU
quite as much. The down side is that tasks can end up hopping across
the LLC for no apparent reason.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h    |    3 +
 kernel/sched/core.c      |    3 +
 kernel/sched/fair.c      |  270 ++++++++++++++++++++++++++++++++++++++--------
 kernel/sched/idle_task.c |    4 +-
 4 files changed, 233 insertions(+), 47 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 96b371c..d74e757 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1060,6 +1060,7 @@ struct sched_group;
 struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
+	int		has_idle_cores;
 };
 
 struct sched_domain {
@@ -1092,6 +1093,8 @@ struct sched_domain {
 	u64 max_newidle_lb_cost;
 	unsigned long next_decay_max_lb_cost;
 
+	u64 avg_scan_cost;		/* select_idle_sibling */
+
 #ifdef CONFIG_SCHEDSTATS
 	/* load_balance() stats */
 	unsigned int lb_count[CPU_MAX_IDLE_TYPES];
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bd313b2..e224581 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7342,6 +7342,7 @@ static struct kmem_cache *task_group_cache __read_mostly;
 #endif
 
 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
+DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
 
 void __init sched_init(void)
 {
@@ -7378,6 +7379,8 @@ void __init sched_init(void)
 	for_each_possible_cpu(i) {
 		per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node(
 			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
+		per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node(
+			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
 	}
 #endif /* CONFIG_CPUMASK_OFFSTACK */
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0d8dad2..8335ed5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1501,8 +1501,15 @@ balance:
 	 * One idle CPU per node is evaluated for a task numa move.
 	 * Call select_idle_sibling to maybe find a better one.
 	 */
-	if (!cur)
+	if (!cur) {
+		/*
+		 * select_idle_siblings() uses an per-cpu cpumask that
+		 * can be used from IRQ context.
+		 */
+		local_irq_disable();
 		env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu);
+		local_irq_enable();
+	}
 
 assign:
 	assigned = true;
@@ -4532,6 +4539,11 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 }
 
 #ifdef CONFIG_SMP
+
+/* Working cpumask for: load_balance, load_balance_newidle. */
+DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
+DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
+
 #ifdef CONFIG_NO_HZ_COMMON
 /*
  * per rq 'load' arrray crap; XXX kill this.
@@ -5187,64 +5199,233 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 }
 
 /*
- * Try and locate an idle CPU in the sched_domain.
+ * Implement a for_each_cpu() variant that starts the scan at a given cpu
+ * (@start), and wraps around.
+ *
+ * This is used to scan for idle CPUs; such that not all CPUs looking for an
+ * idle CPU find the same CPU. The down-side is that tasks tend to cycle
+ * through the LLC domain.
+ *
+ * Especially tbench is found sensitive to this.
+ */
+
+static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int *wrapped)
+{
+	int next;
+
+again:
+	next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1);
+
+	if (*wrapped) {
+		if (next >= start)
+			return nr_cpumask_bits;
+	} else {
+		if (next >= nr_cpumask_bits) {
+			*wrapped = 1;
+			n = -1;
+			goto again;
+		}
+	}
+
+	return next;
+}
+
+#define for_each_cpu_wrap(cpu, mask, start, wrap)				\
+	for ((wrap) = 0, (cpu) = (start)-1;					\
+		(cpu) = cpumask_next_wrap((cpu), (mask), (start), &(wrap)),	\
+		(cpu) < nr_cpumask_bits; )
+
+#ifdef CONFIG_SCHED_SMT
+
+static inline void set_idle_cores(int cpu, int val)
+{
+	struct sched_domain_shared *sds;
+
+	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (sds)
+		WRITE_ONCE(sds->has_idle_cores, val);
+}
+
+static inline bool test_idle_cores(int cpu, bool def)
+{
+	struct sched_domain_shared *sds;
+
+	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (sds)
+		return READ_ONCE(sds->has_idle_cores);
+
+	return def;
+}
+
+/*
+ * Scans the local SMT mask to see if the entire core is idle, and records this
+ * information in sd_llc_shared->has_idle_cores.
+ *
+ * Since SMT siblings share all cache levels, inspecting this limited remote
+ * state should be fairly cheap.
+ */
+void update_idle_core(struct rq *rq)
+{
+	int core = cpu_of(rq);
+	int cpu;
+
+	rcu_read_lock();
+	if (test_idle_cores(core, true))
+		goto unlock;
+
+	for_each_cpu(cpu, cpu_smt_mask(core)) {
+		if (cpu == core)
+			continue;
+
+		if (!idle_cpu(cpu))
+			goto unlock;
+	}
+
+	set_idle_cores(core, 1);
+unlock:
+	rcu_read_unlock();
+}
+
+/*
+ * Scan the entire LLC domain for idle cores; this dynamically switches off if there are
+ * no idle cores left in the system; tracked through sd_llc->shared->has_idle_cores
+ * and enabled through update_idle_core() above.
+ */
+static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
+{
+	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+	int core, cpu, wrap;
+
+	if (!test_idle_cores(target, false))
+		return -1;
+
+	cpumask_and(cpus, sched_domain_span(sd), tsk_cpus_allowed(p));
+
+	for_each_cpu_wrap(core, cpus, target, wrap) {
+		bool idle = true;
+
+		for_each_cpu(cpu, cpu_smt_mask(core)) {
+			cpumask_clear_cpu(cpu, cpus);
+			if (!idle_cpu(cpu))
+				idle = false;
+		}
+
+		if (idle)
+			return core;
+	}
+
+	/*
+	 * Failed to find an idle core; stop looking for one.
+	 */
+	set_idle_cores(target, 0);
+
+	return -1;
+}
+
+/*
+ * Scan the local SMT mask for idle CPUs.
+ */
+static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
+{
+	int cpu;
+
+	for_each_cpu(cpu, cpu_smt_mask(target)) {
+		if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+			continue;
+		if (idle_cpu(cpu))
+			return cpu;
+	}
+
+	return -1;
+}
+
+#else /* CONFIG_SCHED_SMT */
+
+void update_idle_core(struct rq *rq) { }
+
+static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
+{
+	return -1;
+}
+
+static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
+{
+	return -1;
+}
+
+#endif /* CONFIG_SCHED_SMT */
+
+/*
+ * Scan the LLC domain for idle CPUs; this is dynamically regulated by
+ * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
+ * average idle time for this rq (as found in rq->avg_idle).
+ */
+static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
+{
+	struct sched_domain *this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
+	u64 avg_idle = this_rq()->avg_idle;
+	u64 avg_cost = this_sd->avg_scan_cost;
+	u64 time, cost;
+	s64 delta;
+	int cpu, wrap;
+
+	/*
+	 * Due to large variance we need a large fuzz factor; hackbench in
+	 * particularly is sensitive here.
+	 */
+	if ((avg_idle / 512) < avg_cost)
+		return -1;
+
+	time = local_clock();
+
+	for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {
+		if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+			continue;
+		if (idle_cpu(cpu))
+			break;
+	}
+
+	time = local_clock() - time;
+	cost = this_sd->avg_scan_cost;
+	delta = (s64)(time - cost) / 8;
+	this_sd->avg_scan_cost += delta;
+
+	return cpu;
+}
+
+/*
+ * Try and locate an idle core/thread in the LLC cache domain.
  */
 static int select_idle_sibling(struct task_struct *p, int target)
 {
 	struct sched_domain *sd;
-	struct sched_group *sg;
 	int i = task_cpu(p);
 
 	if (idle_cpu(target))
 		return target;
 
 	/*
-	 * If the prevous cpu is cache affine and idle, don't be stupid.
+	 * If the previous cpu is cache affine and idle, don't be stupid.
 	 */
 	if (i != target && cpus_share_cache(i, target) && idle_cpu(i))
 		return i;
 
-	/*
-	 * Otherwise, iterate the domains and find an eligible idle cpu.
-	 *
-	 * A completely idle sched group at higher domains is more
-	 * desirable than an idle group at a lower level, because lower
-	 * domains have smaller groups and usually share hardware
-	 * resources which causes tasks to contend on them, e.g. x86
-	 * hyperthread siblings in the lowest domain (SMT) can contend
-	 * on the shared cpu pipeline.
-	 *
-	 * However, while we prefer idle groups at higher domains
-	 * finding an idle cpu at the lowest domain is still better than
-	 * returning 'target', which we've already established, isn't
-	 * idle.
-	 */
 	sd = rcu_dereference(per_cpu(sd_llc, target));
-	for_each_lower_domain(sd) {
-		sg = sd->groups;
-		do {
-			if (!cpumask_intersects(sched_group_cpus(sg),
-						tsk_cpus_allowed(p)))
-				goto next;
-
-			/* Ensure the entire group is idle */
-			for_each_cpu(i, sched_group_cpus(sg)) {
-				if (i == target || !idle_cpu(i))
-					goto next;
-			}
+	if (!sd)
+		return target;
+
+	i = select_idle_core(p, sd, target);
+	if ((unsigned)i < nr_cpumask_bits)
+		return i;
+
+	i = select_idle_cpu(p, sd, target);
+	if ((unsigned)i < nr_cpumask_bits)
+		return i;
+
+	i = select_idle_smt(p, sd, target);
+	if ((unsigned)i < nr_cpumask_bits)
+		return i;
 
-			/*
-			 * It doesn't matter which cpu we pick, the
-			 * whole group is idle.
-			 */
-			target = cpumask_first_and(sched_group_cpus(sg),
-					tsk_cpus_allowed(p));
-			goto done;
-next:
-			sg = sg->next;
-		} while (sg != sd->groups);
-	}
-done:
 	return target;
 }
 
@@ -7276,9 +7457,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
  */
 #define MAX_PINNED_INTERVAL	512
 
-/* Working cpumask for load_balance and load_balance_newidle. */
-DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
-
 static int need_active_balance(struct lb_env *env)
 {
 	struct sched_domain *sd = env->sd;
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index 2ce5458..5baf75c 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -23,11 +23,13 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
 	resched_curr(rq);
 }
 
+extern void update_idle_core(struct rq *rq);
+
 static struct task_struct *
 pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct pin_cookie cookie)
 {
 	put_prev_task(rq, prev);
-
+	update_idle_core(rq);
 	schedstat_inc(rq, sched_goidle);
 	return rq->idle;
 }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 06/11] sched: Optimize SCHED_SMT
  2016-06-16  1:49 [RFC PATCH 00/11] Refactor select_task_rq_fair() Yuyang Du
                   ` (4 preceding siblings ...)
  2016-06-16  1:49 ` [RFC PATCH 05/11] sched: Rewrite select_idle_siblings() Yuyang Du
@ 2016-06-16  1:49 ` Yuyang Du
  2016-06-16  1:49 ` [RFC PATCH 07/11] sched: Clean up SD_BALANCE_WAKE flags in sched domain build-up Yuyang Du
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Yuyang Du @ 2016-06-16  1:49 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: umgwanakikbuti, bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt

From: Peter Zijlstra <peterz@infradead.org>

Avoid pointless SCHED_SMT code when running on !SMT hardware.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c      |   19 +++++++++++++++++++
 kernel/sched/fair.c      |    8 +++++++-
 kernel/sched/idle_task.c |    2 --
 kernel/sched/sched.h     |   17 +++++++++++++++++
 4 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e224581..b41059d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7276,6 +7276,22 @@ int sched_cpu_dying(unsigned int cpu)
 }
 #endif
 
+#ifdef CONFIG_SCHED_SMT
+DEFINE_STATIC_KEY_FALSE(sched_smt_present);
+
+static void sched_init_smt(void)
+{
+	/*
+	 * We've enumerated all CPUs and will assume that if any CPU
+	 * has SMT siblings, CPU0 will too.
+	 */
+	if (cpumask_weight(cpu_smt_mask(0)) > 1)
+		static_branch_enable(&sched_smt_present);
+}
+#else
+static inline void sched_init_smt(void) { }
+#endif
+
 void __init sched_init_smp(void)
 {
 	cpumask_var_t non_isolated_cpus;
@@ -7305,6 +7321,9 @@ void __init sched_init_smp(void)
 
 	init_sched_rt_class();
 	init_sched_dl_class();
+
+	sched_init_smt();
+
 	sched_smp_initialized = true;
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8335ed5..d048203 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5264,7 +5264,7 @@ static inline bool test_idle_cores(int cpu, bool def)
  * Since SMT siblings share all cache levels, inspecting this limited remote
  * state should be fairly cheap.
  */
-void update_idle_core(struct rq *rq)
+void __update_idle_core(struct rq *rq)
 {
 	int core = cpu_of(rq);
 	int cpu;
@@ -5296,6 +5296,9 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
 	int core, cpu, wrap;
 
+	if (!static_branch_likely(&sched_smt_present))
+		return -1;
+
 	if (!test_idle_cores(target, false))
 		return -1;
 
@@ -5329,6 +5332,9 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
 {
 	int cpu;
 
+	if (!static_branch_likely(&sched_smt_present))
+		return -1;
+
 	for_each_cpu(cpu, cpu_smt_mask(target)) {
 		if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
 			continue;
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index 5baf75c..73c39cb 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -23,8 +23,6 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
 	resched_curr(rq);
 }
 
-extern void update_idle_core(struct rq *rq);
-
 static struct task_struct *
 pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct pin_cookie cookie)
 {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cdc63d9..df27200 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1810,3 +1810,20 @@ static inline void account_reset_rq(struct rq *rq)
 	rq->prev_steal_time_rq = 0;
 #endif
 }
+
+
+#ifdef CONFIG_SCHED_SMT
+
+extern struct static_key_false sched_smt_present;
+
+extern void __update_idle_core(struct rq *rq);
+
+static inline void update_idle_core(struct rq *rq)
+{
+	if (static_branch_unlikely(&sched_smt_present))
+		__update_idle_core(rq);
+}
+
+#else
+static inline void update_idle_core(struct rq *rq) { }
+#endif
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 07/11] sched: Clean up SD_BALANCE_WAKE flags in sched domain build-up
  2016-06-16  1:49 [RFC PATCH 00/11] Refactor select_task_rq_fair() Yuyang Du
                   ` (5 preceding siblings ...)
  2016-06-16  1:49 ` [RFC PATCH 06/11] sched: Optimize SCHED_SMT Yuyang Du
@ 2016-06-16  1:49 ` Yuyang Du
  2016-06-16  1:49 ` [RFC PATCH 08/11] sched: Remove SD_WAKE_AFFINE flag and replace it with SD_BALANCE_WAKE Yuyang Du
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Yuyang Du @ 2016-06-16  1:49 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: umgwanakikbuti, bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, Yuyang Du

According to the comment: "turn off/on idle balance on this domain",
the SD_BALANCE_WAKE has nothing to do with idle balance, so clean
them up.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/core.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b41059d..7ef6385 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6230,10 +6230,10 @@ static void set_domain_attribute(struct sched_domain *sd,
 		request = attr->relax_domain_level;
 	if (request < sd->level) {
 		/* turn off idle balance on this domain */
-		sd->flags &= ~(SD_BALANCE_WAKE|SD_BALANCE_NEWIDLE);
+		sd->flags &= ~(SD_BALANCE_NEWIDLE);
 	} else {
 		/* turn on idle balance on this domain */
-		sd->flags |= (SD_BALANCE_WAKE|SD_BALANCE_NEWIDLE);
+		sd->flags |= (SD_BALANCE_NEWIDLE);
 	}
 }
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 08/11] sched: Remove SD_WAKE_AFFINE flag and replace it with SD_BALANCE_WAKE
  2016-06-16  1:49 [RFC PATCH 00/11] Refactor select_task_rq_fair() Yuyang Du
                   ` (6 preceding siblings ...)
  2016-06-16  1:49 ` [RFC PATCH 07/11] sched: Clean up SD_BALANCE_WAKE flags in sched domain build-up Yuyang Du
@ 2016-06-16  1:49 ` Yuyang Du
  2016-06-23 13:04   ` Matt Fleming
  2016-06-16  1:49 ` [RFC PATCH 09/11] sched: Add per CPU variable sd_socket_id to specify the CPU's socket Yuyang Du
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 16+ messages in thread
From: Yuyang Du @ 2016-06-16  1:49 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: umgwanakikbuti, bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, Yuyang Du

SD_BALANCE_{FORK|EXEC|WAKE} flags are for select_task_rq() to select a
CPU to run a new task or a waking task. SD_WAKE_AFFINE is a flag to
try selecting the waker CPU to run the waking task.

SD_BALANCE_WAKE is not a sched_domain flag, but SD_WAKE_AFFINE is.
Conceptually, SD_BALANCE_WAKE should be a sched_domain flag just like
the other two, so we first make SD_BALANCE_WAKE a sched_domain flag.

Moreover, the semantic of SD_WAKE_AFFINE is included in the semantic
of SD_BALANCE_WAKE. When in wakeup balancing, it is natual to try
the waker CPU if the waker CPU is allowed, in that sense, we don't
need a separate flag to specify it, not mentioning that SD_WAKE_AFFINE
is almost enabled in every sched_domains.

With the above combined, there is no need to have SD_WAKE_AFFINE, so
we remove and replace it with SD_BALANCE_WAKE. This can be accomplished
without any functionality change.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched.h   |    1 -
 kernel/sched/core.c     |    7 +++----
 kernel/sched/deadline.c |    2 +-
 kernel/sched/fair.c     |    9 ++++-----
 kernel/sched/rt.c       |    2 +-
 5 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d74e757..0803abd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1014,7 +1014,6 @@ extern void wake_up_q(struct wake_q_head *head);
 #define SD_BALANCE_EXEC		0x0004	/* Balance on exec */
 #define SD_BALANCE_FORK		0x0008	/* Balance on fork, clone */
 #define SD_BALANCE_WAKE		0x0010  /* Balance on wakeup */
-#define SD_WAKE_AFFINE		0x0020	/* Wake task to waking CPU */
 #define SD_SHARE_CPUCAPACITY	0x0080	/* Domain members share cpu power */
 #define SD_SHARE_POWERDOMAIN	0x0100	/* Domain members share power domain */
 #define SD_SHARE_PKG_RESOURCES	0x0200	/* Domain members share cpu pkg resources */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7ef6385..56ac8f1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5666,7 +5666,7 @@ static int sd_degenerate(struct sched_domain *sd)
 	}
 
 	/* Following flags don't use groups */
-	if (sd->flags & (SD_WAKE_AFFINE))
+	if (sd->flags & (SD_BALANCE_WAKE))
 		return 0;
 
 	return 1;
@@ -6361,8 +6361,7 @@ sd_init(struct sched_domain_topology_level *tl,
 					| 1*SD_BALANCE_NEWIDLE
 					| 1*SD_BALANCE_EXEC
 					| 1*SD_BALANCE_FORK
-					| 0*SD_BALANCE_WAKE
-					| 1*SD_WAKE_AFFINE
+					| 1*SD_BALANCE_WAKE
 					| 0*SD_SHARE_CPUCAPACITY
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 0*SD_SERIALIZE
@@ -6412,7 +6411,7 @@ sd_init(struct sched_domain_topology_level *tl,
 		if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
 			sd->flags &= ~(SD_BALANCE_EXEC |
 				       SD_BALANCE_FORK |
-				       SD_WAKE_AFFINE);
+				       SD_BALANCE_WAKE);
 		}
 
 #endif
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fcb7f02..037ab0f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1323,7 +1323,7 @@ static int find_later_rq(struct task_struct *task)
 
 	rcu_read_lock();
 	for_each_domain(cpu, sd) {
-		if (sd->flags & SD_WAKE_AFFINE) {
+		if (sd->flags & SD_BALANCE_WAKE) {
 
 			/*
 			 * If possible, preempting this_cpu is
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d048203..f15461f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5474,8 +5474,7 @@ static int cpu_util(int cpu)
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
  * SD_BALANCE_FORK, or SD_BALANCE_EXEC.
  *
- * Balances load by selecting the idlest cpu in the idlest group, or under
- * certain conditions an idle sibling cpu if the domain has SD_WAKE_AFFINE set.
+ * Balances load by selecting the idlest cpu in the idlest group.
  *
  * Returns the target cpu number.
  *
@@ -5502,9 +5501,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 
 		/*
 		 * If both cpu and prev_cpu are part of this domain,
-		 * cpu is a valid SD_WAKE_AFFINE target.
+		 * cpu is a valid SD_BALANCE_WAKE target.
 		 */
-		if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
+		if (want_affine && (tmp->flags & SD_BALANCE_WAKE) &&
 		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
 			affine_sd = tmp;
 			break;
@@ -5517,7 +5516,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	}
 
 	if (affine_sd) {
-		sd = NULL; /* Prefer wake_affine over balance flags */
+		sd = NULL; /* Prefer SD_BALANCE_WAKE over other balance flags */
 		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
 			new_cpu = cpu;
 	}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index d5690b7..d1c8f41 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1655,7 +1655,7 @@ static int find_lowest_rq(struct task_struct *task)
 
 	rcu_read_lock();
 	for_each_domain(cpu, sd) {
-		if (sd->flags & SD_WAKE_AFFINE) {
+		if (sd->flags & SD_BALANCE_WAKE) {
 			int best_cpu;
 
 			/*
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 09/11] sched: Add per CPU variable sd_socket_id to specify the CPU's socket
  2016-06-16  1:49 [RFC PATCH 00/11] Refactor select_task_rq_fair() Yuyang Du
                   ` (7 preceding siblings ...)
  2016-06-16  1:49 ` [RFC PATCH 08/11] sched: Remove SD_WAKE_AFFINE flag and replace it with SD_BALANCE_WAKE Yuyang Du
@ 2016-06-16  1:49 ` Yuyang Du
  2016-06-16  1:49 ` [RFC PATCH 10/11] sched: Add sched_llc_complete static key to specify whether the LLC covers all CPUs Yuyang Du
  2016-06-16  1:49 ` [RFC PATCH 11/11] sched/fair: Refactor select_task_rq_fair() Yuyang Du
  10 siblings, 0 replies; 16+ messages in thread
From: Yuyang Du @ 2016-06-16  1:49 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: umgwanakikbuti, bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, Yuyang Du

Add an unique ID per socket (we use the first CPU number in the
cpumask of the socket). This allows us to quickly tell if two CPUs are
in the same socket, see cpus_share_socket().

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched.h |    5 +++++
 kernel/sched/core.c   |   16 ++++++++++++++--
 kernel/sched/sched.h  |    1 +
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0803abd..4b2a0fa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1158,6 +1158,7 @@ cpumask_var_t *alloc_sched_domains(unsigned int ndoms);
 void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
 
 bool cpus_share_cache(int this_cpu, int that_cpu);
+bool cpus_share_socket(int this_cpu, int that_cpu);
 
 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 typedef int (*sched_domain_flags_f)(void);
@@ -1206,6 +1207,10 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)
 	return true;
 }
 
+static inline bool cpus_share_socket(int this_cpu, int that_cpu)
+{
+	return true;
+}
 #endif	/* !CONFIG_SMP */
 
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 56ac8f1..0a332ed 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1859,6 +1859,11 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
 {
 	return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
 }
+
+bool cpus_share_socket(int this_cpu, int that_cpu)
+{
+	return per_cpu(sd_socket_id, this_cpu) == per_cpu(sd_socket_id, that_cpu);
+}
 #endif /* CONFIG_SMP */
 
 static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
@@ -5875,10 +5880,15 @@ static void destroy_sched_domains(struct sched_domain *sd)
  * Also keep a unique ID per domain (we use the first cpu number in
  * the cpumask of the domain), this allows us to quickly tell if
  * two cpus are in the same cache domain, see cpus_share_cache().
+
+ * And also keep a unique ID per socket (we use the first cpu number
+ * in the cpumask of the socket), this allows us to quickly tell if
+ * two cpus are in the same socket, see cpus_share_socket().
  */
 DEFINE_PER_CPU(struct sched_domain *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
+DEFINE_PER_CPU(int, sd_socket_id);
 DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain *, sd_asym);
@@ -5887,8 +5897,7 @@ static void update_top_cache_domain(int cpu)
 {
 	struct sched_domain_shared *sds = NULL;
 	struct sched_domain *sd;
-	int id = cpu;
-	int size = 1;
+	int id = cpu, size = 1, socket_id = 0;
 
 	sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
 	if (sd) {
@@ -5904,6 +5913,9 @@ static void update_top_cache_domain(int cpu)
 
 	sd = lowest_flag_domain(cpu, SD_NUMA);
 	rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
+	if (sd)
+		socket_id = cpumask_first(sched_group_cpus(sd->groups));
+	per_cpu(sd_socket_id, cpu) = socket_id;
 
 	sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
 	rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index df27200..654bc65 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -856,6 +856,7 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
 DECLARE_PER_CPU(struct sched_domain *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(int, sd_socket_id);
 DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain *, sd_asym);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 10/11] sched: Add sched_llc_complete static key to specify whether the LLC covers all CPUs
  2016-06-16  1:49 [RFC PATCH 00/11] Refactor select_task_rq_fair() Yuyang Du
                   ` (8 preceding siblings ...)
  2016-06-16  1:49 ` [RFC PATCH 09/11] sched: Add per CPU variable sd_socket_id to specify the CPU's socket Yuyang Du
@ 2016-06-16  1:49 ` Yuyang Du
  2016-06-16  1:49 ` [RFC PATCH 11/11] sched/fair: Refactor select_task_rq_fair() Yuyang Du
  10 siblings, 0 replies; 16+ messages in thread
From: Yuyang Du @ 2016-06-16  1:49 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: umgwanakikbuti, bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, Yuyang Du

This static key tells whether all CPUs share last level cache, which is
common in single-socket x86 boxes. This allows us to know better to fast
select an idle sibling in select_task_rq_fair().

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/core.c  |    9 +++++++++
 kernel/sched/sched.h |    4 ++++
 2 files changed, 13 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0a332ed..3f26fea 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7303,6 +7303,14 @@ static void sched_init_smt(void)
 static inline void sched_init_smt(void) { }
 #endif
 
+DEFINE_STATIC_KEY_TRUE(sched_llc_complete);
+
+static void sched_init_llc_complete(void)
+{
+	if (cpumask_weight(cpu_active_mask) > per_cpu(sd_llc_size, 0))
+		static_branch_disable(&sched_llc_complete);
+}
+
 void __init sched_init_smp(void)
 {
 	cpumask_var_t non_isolated_cpus;
@@ -7334,6 +7342,7 @@ void __init sched_init_smp(void)
 	init_sched_dl_class();
 
 	sched_init_smt();
+	sched_init_llc_complete();
 
 	sched_smp_initialized = true;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 654bc65..f11c5dd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1828,3 +1828,7 @@ static inline void update_idle_core(struct rq *rq)
 #else
 static inline void update_idle_core(struct rq *rq) { }
 #endif
+
+#ifdef CONFIG_SCHED_MC
+extern struct static_key_true sched_llc_complete;
+#endif
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 11/11] sched/fair: Refactor select_task_rq_fair()
  2016-06-16  1:49 [RFC PATCH 00/11] Refactor select_task_rq_fair() Yuyang Du
                   ` (9 preceding siblings ...)
  2016-06-16  1:49 ` [RFC PATCH 10/11] sched: Add sched_llc_complete static key to specify whether the LLC covers all CPUs Yuyang Du
@ 2016-06-16  1:49 ` Yuyang Du
  2016-06-16  1:57   ` Yuyang Du
  10 siblings, 1 reply; 16+ messages in thread
From: Yuyang Du @ 2016-06-16  1:49 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: umgwanakikbuti, bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, Yuyang Du

This refactoring attempts to achieve:

 - Decouple waker/wakee with the three kinds of wakeup SD_* flags.
 - Form a complete topology view in the select
 - Determine fast idle select vs. slow avg select with more info

To enable this refactoring:

echo NEW_SELECT > sched_features

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched.h   |    1 +
 kernel/sched/fair.c     |  284 ++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/features.h |    1 +
 3 files changed, 285 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4b2a0fa..c4b4b90 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1475,6 +1475,7 @@ struct task_struct {
 	struct llist_node wake_entry;
 	int on_cpu;
 	unsigned int wakee_flips;
+	unsigned int wakee_count;
 	unsigned long wakee_flip_decay_ts;
 	struct task_struct *last_wakee;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f15461f..1ab41b8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4986,12 +4986,14 @@ static void record_wakee(struct task_struct *p)
 	 */
 	if (time_after(jiffies, current->wakee_flip_decay_ts + HZ)) {
 		current->wakee_flips >>= 1;
+		current->wakee_count >>= 1;
 		current->wakee_flip_decay_ts = jiffies;
 	}
 
 	if (current->last_wakee != p) {
 		current->last_wakee = p;
 		current->wakee_flips++;
+		current->wakee_count++;
 	}
 }
 
@@ -5025,6 +5027,45 @@ static int wake_wide(struct task_struct *p)
 	return 1;
 }
 
+#define TP_IDENT	0x1	/* waker is wakee */
+#define TP_SMT		0x2	/* SMT sibling */
+#define TP_LLC		0x4	/* cpus_share_cache(waker, wakee) */
+#define TP_SHARE_CACHE	0x7	/* waker and wakee share cache */
+#define TP_LLC_SOCK	0x10	/* all CPUs have complete LLC coverage */
+#define TP_NOLLC_SOCK	0x20	/* !TP_LLC_SOCK */
+#define TP_NOLLC_XSOCK	0x40	/* cross socket */
+
+static int wake_wide2(int topology, struct task_struct *p)
+{
+	unsigned int master = current->wakee_count;
+	unsigned int slave = p->wakee_count;
+	unsigned int master_flips = current->wakee_flips;
+	unsigned int slave_flips = p->wakee_flips;
+	int factor = this_cpu_read(sd_llc_size);
+
+	if (!master_flips)
+		master /= master_flips;
+	else
+		master = !!master;
+
+	if (!slave_flips)
+		slave /= slave_flips;
+	else
+		slave = !!slave;
+
+	if (master < slave)
+		swap(master, slave);
+
+	/*
+	 * The system is either TP_NOLLC_SOCK or TP_NOLLC_XSOCK, the waker
+	 * and wakee may still share some cache though.
+	 */
+	if (topology | TP_SHARE_CACHE)
+		return master > factor;
+
+	return slave > factor;
+}
+
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 {
 	s64 this_load, load;
@@ -5399,6 +5440,20 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	return cpu;
 }
 
+static int select_idle_cpu2(struct task_struct *p, struct sched_domain *sd, int target)
+{
+	int cpu, wrap;
+
+	for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {
+		if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+			continue;
+		if (idle_cpu(cpu))
+			break;
+	}
+
+	return cpu;
+}
+
 /*
  * Try and locate an idle core/thread in the LLC cache domain.
  */
@@ -5424,7 +5479,10 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
-	i = select_idle_cpu(p, sd, target);
+	if (sched_feat(NEW_SELECT))
+		i = select_idle_cpu2(p, sd, target);
+	else
+		i = select_idle_cpu(p, sd, target);
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
@@ -5470,6 +5528,227 @@ static int cpu_util(int cpu)
 }
 
 /*
+ * XXX just copied from waker_affine().
+ * use util other than load for some topologes?
+ * real-time vs. avg
+ */
+static int
+wake_waker(int topology, struct sched_domain *sd, struct task_struct *p, int sync)
+{
+	s64 this_load, load;
+	s64 this_eff_load, prev_eff_load;
+	int idx, this_cpu, prev_cpu;
+	struct task_group *tg;
+	unsigned long weight;
+	int balanced;
+
+	idx	  = sd->wake_idx;
+	this_cpu  = smp_processor_id();
+	prev_cpu  = task_cpu(p);
+	load	  = source_load(prev_cpu, idx);
+	this_load = target_load(this_cpu, idx);
+
+	/*
+	 * If sync wakeup then subtract the (maximum possible)
+	 * effect of the currently running task from the load
+	 * of the current CPU:
+	 */
+	if (sync) {
+		tg = task_group(current);
+		weight = current->se.avg.load_avg;
+
+		this_load += effective_load(tg, this_cpu, -weight, -weight);
+		load += effective_load(tg, prev_cpu, 0, -weight);
+	}
+
+	tg = task_group(p);
+	weight = p->se.avg.load_avg;
+
+	/*
+	 * In low-load situations, where prev_cpu is idle and this_cpu is idle
+	 * due to the sync cause above having dropped this_load to 0, we'll
+	 * always have an imbalance, but there's really nothing you can do
+	 * about that, so that's good too.
+	 *
+	 * Otherwise check if either cpus are near enough in load to allow this
+	 * task to be woken on this_cpu.
+	 */
+	this_eff_load = 100;
+	this_eff_load *= capacity_of(prev_cpu);
+
+	prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
+	prev_eff_load *= capacity_of(this_cpu);
+
+	if (this_load > 0) {
+		this_eff_load *= this_load +
+			effective_load(tg, this_cpu, weight, weight);
+
+		prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight);
+	}
+
+	balanced = this_eff_load <= prev_eff_load;
+
+	schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
+
+	if (!balanced)
+		return 0;
+
+	schedstat_inc(sd, ttwu_move_affine);
+	schedstat_inc(p, se.statistics.nr_wakeups_affine);
+
+	return 1;
+}
+static int wake_idle(struct sched_domain *sd)
+{
+	u64 avg_cost = sd->avg_scan_cost, avg_idle = this_rq()->avg_idle;
+
+	/*
+	 * Due to large variance we need a large fuzz factor; hackbench in
+	 * particularly is sensitive here.
+	 */
+	if ((avg_idle / 512) < avg_cost)
+		return 0;
+
+	return 1;
+}
+
+static inline int
+waker_wakee_topology(int waker, int wakee, int *allowed, struct task_struct *p)
+{
+	int topology;
+
+	if (static_branch_likely(&sched_llc_complete))
+		topology = TP_LLC_SOCK;
+
+	if (waker == wakee)
+		return TP_IDENT | topology;
+
+	*allowed = cpumask_test_cpu(waker, tsk_cpus_allowed(p));
+
+#ifdef CONFIG_SCHED_SMT
+	if (static_branch_likely(&sched_smt_present) &&
+	    cpumask_test_cpu(waker, cpu_smt_mask(wakee)))
+		return TP_SMT | topology;
+#endif
+	if (cpus_share_cache(wakee, waker))
+		return TP_LLC | topology;
+
+	/* We are here without TP_LLC_SOCK for sure */
+	if (unlikely(cpus_share_socket(wakee, waker)))
+		return TP_NOLLC_SOCK;
+
+	return TP_NOLLC_XSOCK;
+}
+
+/*
+ * Notes:
+ *  - how one-time local suboptimal select approaches to overall global optimal selects
+ *  - certain randomness
+ *  - spread vs. consolidate
+ *  - load vs. util, real-time vs. avg
+ *  - use topology more
+ */
+static int
+select_task_rq_fair2(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
+{
+	struct sched_domain *tmp, *sd = NULL, *this_sd_llc = NULL;
+	int waker_allowed = 1, select_idle = 0;
+	int cpu = smp_processor_id(), new_cpu = prev_cpu;
+	int sync = wake_flags & WF_SYNC;
+	int topology = waker_wakee_topology(cpu, prev_cpu, &waker_allowed, p);
+
+	record_wakee(p);
+
+	rcu_read_lock();
+
+	for_each_domain(cpu, tmp) {
+		/* Stop if domain flags says no */
+		if (!(tmp->flags & SD_LOAD_BALANCE) || !(tmp->flags & sd_flag))
+			break;
+
+		sd = tmp;
+	}
+
+	if (!sd)
+		goto out_unlock;
+
+	this_sd_llc = rcu_dereference(*this_cpu_ptr(&sd_llc));
+	if (!this_sd_llc || (sd->span_weight < this_sd_llc->span_weight))
+		goto select_avg;
+
+	/* Now we can attempt to fast select an idle CPU */
+	if (topology & TP_LLC_SOCK) {
+
+		if (wake_idle(this_sd_llc))
+			select_idle = 1;
+
+	} else if (!wake_wide2(topology, p))
+		select_idle = 1;
+
+	if (topology > TP_IDENT && waker_allowed &&
+	    wake_waker(topology, this_sd_llc, p, sync))
+		new_cpu = cpu;
+
+	if (select_idle) {
+		/*
+		 * Scan the LLC domain for idle CPUs; this is dynamically
+		 * regulated by comparing the average scan cost (tracked in
+		 * this_sd_llc->avg_scan_cost) against the average idle time
+		 * for this rq (as found in this_rq->avg_idle).
+		 */
+		u64 time = local_clock();
+
+		new_cpu = select_idle_sibling(p, new_cpu);
+		time = local_clock() - time;
+		this_sd_llc->avg_scan_cost +=
+			(s64)(time - this_sd_llc->avg_scan_cost) / 8;
+
+		goto out_unlock;
+	}
+
+select_avg:
+	while (sd) {
+		struct sched_group *group;
+		int weight;
+
+		if (!(sd->flags & sd_flag)) {
+			sd = sd->child;
+			continue;
+		}
+
+		group = find_idlest_group(sd, p, cpu, sd_flag);
+		if (!group) {
+			sd = sd->child;
+			continue;
+		}
+
+		new_cpu = find_idlest_cpu(group, p, cpu);
+		if (new_cpu == -1 || new_cpu == cpu) {
+			/* Now try balancing at a lower domain level of cpu */
+			sd = sd->child;
+			continue;
+		}
+
+		/* Now try balancing at a lower domain level of new_cpu */
+		cpu = new_cpu;
+		weight = sd->span_weight;
+		sd = NULL;
+		for_each_domain(cpu, tmp) {
+			if (weight <= tmp->span_weight)
+				break;
+			if (tmp->flags & sd_flag)
+				sd = tmp;
+		}
+		/* while loop will break here if sd == NULL */
+	}
+
+out_unlock:
+	rcu_read_unlock();
+
+	return new_cpu;
+}
+
+/*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
  * SD_BALANCE_FORK, or SD_BALANCE_EXEC.
@@ -5489,6 +5768,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
 
+	if (sched_feat(NEW_SELECT))
+		return select_task_rq_fair2(p, prev_cpu, sd_flag, wake_flags);
+
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
 		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 69631fa..ea41750 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -69,3 +69,4 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
 SCHED_FEAT(ATTACH_AGE_LOAD, true)
 
+SCHED_FEAT(NEW_SELECT, false)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 11/11] sched/fair: Refactor select_task_rq_fair()
  2016-06-16  1:49 ` [RFC PATCH 11/11] sched/fair: Refactor select_task_rq_fair() Yuyang Du
@ 2016-06-16  1:57   ` Yuyang Du
  0 siblings, 0 replies; 16+ messages in thread
From: Yuyang Du @ 2016-06-16  1:57 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: umgwanakikbuti, bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt

On Thu, Jun 16, 2016 at 09:49:35AM +0800, Yuyang Du wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f15461f..1ab41b8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4986,12 +4986,14 @@ static void record_wakee(struct task_struct *p)
>  	 */
>  	if (time_after(jiffies, current->wakee_flip_decay_ts + HZ)) {
>  		current->wakee_flips >>= 1;
> +		current->wakee_count >>= 1;
>  		current->wakee_flip_decay_ts = jiffies;
>  	}
>  
>  	if (current->last_wakee != p) {
>  		current->last_wakee = p;
>  		current->wakee_flips++;
>  	}

So sorry that:

		current->wakee_count++;

should be moved out of the if statement.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 08/11] sched: Remove SD_WAKE_AFFINE flag and replace it with SD_BALANCE_WAKE
  2016-06-16  1:49 ` [RFC PATCH 08/11] sched: Remove SD_WAKE_AFFINE flag and replace it with SD_BALANCE_WAKE Yuyang Du
@ 2016-06-23 13:04   ` Matt Fleming
  2016-06-23 13:54     ` Peter Zijlstra
  2016-06-23 14:06     ` Mike Galbraith
  0 siblings, 2 replies; 16+ messages in thread
From: Matt Fleming @ 2016-06-23 13:04 UTC (permalink / raw)
  To: Yuyang Du
  Cc: peterz, mingo, linux-kernel, umgwanakikbuti, bsegall, pjt,
	morten.rasmussen, vincent.guittot, dietmar.eggemann

On Thu, 16 Jun, at 09:49:32AM, Yuyang Du wrote:
> SD_BALANCE_{FORK|EXEC|WAKE} flags are for select_task_rq() to select a
> CPU to run a new task or a waking task. SD_WAKE_AFFINE is a flag to
> try selecting the waker CPU to run the waking task.
> 
> SD_BALANCE_WAKE is not a sched_domain flag, but SD_WAKE_AFFINE is.
> Conceptually, SD_BALANCE_WAKE should be a sched_domain flag just like
> the other two, so we first make SD_BALANCE_WAKE a sched_domain flag.
> 
> Moreover, the semantic of SD_WAKE_AFFINE is included in the semantic
> of SD_BALANCE_WAKE. When in wakeup balancing, it is natual to try
> the waker CPU if the waker CPU is allowed, in that sense, we don't
> need a separate flag to specify it, not mentioning that SD_WAKE_AFFINE
> is almost enabled in every sched_domains.
> 
> With the above combined, there is no need to have SD_WAKE_AFFINE, so
> we remove and replace it with SD_BALANCE_WAKE. This can be accomplished
> without any functionality change.
> 
> Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> ---
>  include/linux/sched.h   |    1 -
>  kernel/sched/core.c     |    7 +++----
>  kernel/sched/deadline.c |    2 +-
>  kernel/sched/fair.c     |    9 ++++-----
>  kernel/sched/rt.c       |    2 +-
>  5 files changed, 9 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d74e757..0803abd 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1014,7 +1014,6 @@ extern void wake_up_q(struct wake_q_head *head);
>  #define SD_BALANCE_EXEC		0x0004	/* Balance on exec */
>  #define SD_BALANCE_FORK		0x0008	/* Balance on fork, clone */
>  #define SD_BALANCE_WAKE		0x0010  /* Balance on wakeup */
> -#define SD_WAKE_AFFINE		0x0020	/* Wake task to waking CPU */
>  #define SD_SHARE_CPUCAPACITY	0x0080	/* Domain members share cpu power */
>  #define SD_SHARE_POWERDOMAIN	0x0100	/* Domain members share power domain */
>  #define SD_SHARE_PKG_RESOURCES	0x0200	/* Domain members share cpu pkg resources */

I'm curious - doesn't this break userspace ABI? These flags are
exported via procfs, so I would have assumed removing or changing the
value of any of these constants would be forbidden.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 08/11] sched: Remove SD_WAKE_AFFINE flag and replace it with SD_BALANCE_WAKE
  2016-06-23 13:04   ` Matt Fleming
@ 2016-06-23 13:54     ` Peter Zijlstra
  2016-06-23 14:06     ` Mike Galbraith
  1 sibling, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2016-06-23 13:54 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Yuyang Du, mingo, linux-kernel, umgwanakikbuti, bsegall, pjt,
	morten.rasmussen, vincent.guittot, dietmar.eggemann

On Thu, Jun 23, 2016 at 02:04:33PM +0100, Matt Fleming wrote:
> On Thu, 16 Jun, at 09:49:32AM, Yuyang Du wrote:
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1014,7 +1014,6 @@ extern void wake_up_q(struct wake_q_head *head);
> >  #define SD_BALANCE_EXEC		0x0004	/* Balance on exec */
> >  #define SD_BALANCE_FORK		0x0008	/* Balance on fork, clone */
> >  #define SD_BALANCE_WAKE		0x0010  /* Balance on wakeup */
> > -#define SD_WAKE_AFFINE		0x0020	/* Wake task to waking CPU */
> >  #define SD_SHARE_CPUCAPACITY	0x0080	/* Domain members share cpu power */
> >  #define SD_SHARE_POWERDOMAIN	0x0100	/* Domain members share power domain */
> >  #define SD_SHARE_PKG_RESOURCES	0x0200	/* Domain members share cpu pkg resources */
> 
> I'm curious - doesn't this break userspace ABI? These flags are
> exported via procfs, so I would have assumed removing or changing the
> value of any of these constants would be forbidden.

Generally I ignore this little issue. Also, I suppose we should move
that whole sched_domain cruft into /debug/sched/ or so.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 08/11] sched: Remove SD_WAKE_AFFINE flag and replace it with SD_BALANCE_WAKE
  2016-06-23 13:04   ` Matt Fleming
  2016-06-23 13:54     ` Peter Zijlstra
@ 2016-06-23 14:06     ` Mike Galbraith
  1 sibling, 0 replies; 16+ messages in thread
From: Mike Galbraith @ 2016-06-23 14:06 UTC (permalink / raw)
  To: Matt Fleming, Yuyang Du
  Cc: peterz, mingo, linux-kernel, bsegall, pjt, morten.rasmussen,
	vincent.guittot, dietmar.eggemann

On Thu, 2016-06-23 at 14:04 +0100, Matt Fleming wrote:

> I'm curious - doesn't this break userspace ABI? These flags are
> exported via procfs, so I would have assumed removing or changing the
> value of any of these constants would be forbidden.

Nope, if those change, you get to fix up your toys.  Hopping in the way
way back machine...

@ -460,10 +460,11 @@ enum idle_type
 #define SD_LOAD_BALANCE		1	/* Do load balancing on this domain. */
 #define SD_BALANCE_NEWIDLE	2	/* Balance when about to become idle */
 #define SD_BALANCE_EXEC		4	/* Balance on exec */
-#define SD_WAKE_IDLE		8	/* Wake to idle CPU on task wakeup */
-#define SD_WAKE_AFFINE		16	/* Wake task to waking CPU */
-#define SD_WAKE_BALANCE		32	/* Perform balancing at task wakeup */
-#define SD_SHARE_CPUPOWER	64	/* Domain members share cpu power */
+#define SD_BALANCE_FORK		8	/* Balance on fork, clone */
+#define SD_WAKE_IDLE		16	/* Wake to idle CPU on task wakeup */
+#define SD_WAKE_AFFINE		32	/* Wake task to waking CPU */
+#define SD_WAKE_BALANCE		64	/* Perform balancing at task wakeup */
+#define SD_SHARE_CPUPOWER	128	/* Domain members share cpu power */

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-06-23 14:06 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-16  1:49 [RFC PATCH 00/11] Refactor select_task_rq_fair() Yuyang Du
2016-06-16  1:49 ` [RFC PATCH 01/11] sched: Remove unused @cpu argument from destroy_sched_domain*() Yuyang Du
2016-06-16  1:49 ` [RFC PATCH 02/11] sched: Restructure destroy_sched_domain() Yuyang Du
2016-06-16  1:49 ` [RFC PATCH 03/11] sched: Introduce struct sched_domain_shared Yuyang Du
2016-06-16  1:49 ` [RFC PATCH 04/11] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared Yuyang Du
2016-06-16  1:49 ` [RFC PATCH 05/11] sched: Rewrite select_idle_siblings() Yuyang Du
2016-06-16  1:49 ` [RFC PATCH 06/11] sched: Optimize SCHED_SMT Yuyang Du
2016-06-16  1:49 ` [RFC PATCH 07/11] sched: Clean up SD_BALANCE_WAKE flags in sched domain build-up Yuyang Du
2016-06-16  1:49 ` [RFC PATCH 08/11] sched: Remove SD_WAKE_AFFINE flag and replace it with SD_BALANCE_WAKE Yuyang Du
2016-06-23 13:04   ` Matt Fleming
2016-06-23 13:54     ` Peter Zijlstra
2016-06-23 14:06     ` Mike Galbraith
2016-06-16  1:49 ` [RFC PATCH 09/11] sched: Add per CPU variable sd_socket_id to specify the CPU's socket Yuyang Du
2016-06-16  1:49 ` [RFC PATCH 10/11] sched: Add sched_llc_complete static key to specify whether the LLC covers all CPUs Yuyang Du
2016-06-16  1:49 ` [RFC PATCH 11/11] sched/fair: Refactor select_task_rq_fair() Yuyang Du
2016-06-16  1:57   ` Yuyang Du

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).