linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/16 v3] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
@ 2014-05-30  6:35 Yuyang Du
  2014-05-30  6:35 ` [RFC PATCH 01/16 v3] Remove update_rq_runnable_avg Yuyang Du
                   ` (17 more replies)
  0 siblings, 18 replies; 32+ messages in thread
From: Yuyang Du @ 2014-05-30  6:35 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

Hi Ingo, PeterZ, Rafael, and others,

The current scheduler’s load balancing is completely work-conserving. In some
workload, generally low CPU utilization but immersed with CPU bursts of
transient tasks, migrating task to engage all available CPUs for
work-conserving can lead to significant overhead: cache locality loss,
idle/active HW state transitional latency and power, shallower idle state,
etc, which are both power and performance inefficient especially for today’s
low power processors in mobile. 

This RFC introduces a sense of idleness-conserving into work-conserving (by
all means, we really don’t want to be overwhelming in only one way). But to
what extent the idleness-conserving should be, bearing in mind that we don’t
want to sacrifice performance? We first need a load/idleness indicator to that
end.

Thanks to CFS’s “model an ideal, precise multi-tasking CPU”, tasks can be seen
as concurrently running (the tasks in the runqueue). So it is natural to use
task concurrency as load indicator. Having said that, we do two things:

1) Divide continuous time into periods of time, and average task concurrency
in period, for tolerating the transient bursts:
a = sum(concurrency * time) / period
2) Exponentially decay past periods, and synthesize them all, for hysteresis
to load drops or resilience to load rises (let f be decaying factor, and a_x
the xth period average since period 0):
s = a_n + f^1 * a_n-1 + f^2 * a_n-2 +, ..., + f^(n-1) * a_1 + f^n * a_0

We name this load indicator as CPU ConCurrency (CC): task concurrency
determines how many CPUs are needed to be running concurrently.

Another two ways of how to interpret CC:

1) the current work-conserving load balance also uses CC, but instantaneous
CC.

2) CC vs. CPU utilization. CC is runqueue-length-weighted CPU utilization. If
we change: "a = sum(concurrency * time) / period" to "a' = sum(1 * time) /
period". Then a' is just about the CPU utilization. And the way we weight
runqueue-length is the simplest one (excluding the exponential decays, and you
may have other ways).

To track CC, we intercept the scheduler in 1) enqueue, 2) dequeue, 3)
scheduler tick, and 4) enter/exit idle.

After CC, in the consolidation part, we do 1) attach the CPU topology to be
adaptive beyond our experimental platforms, and 2) intercept the current load
balance for load and load balancing containment.

Currently, CC is per CPU. To consolidate, the formula is based on a heuristic.
Suppose we have 2 CPUs, their task concurrency over time is ('-' means no
task, 'x' having tasks):

1)
CPU0: ---xxxx---------- (CC[0])
CPU1: ---------xxxx---- (CC[1])

2)
CPU0: ---xxxx---------- (CC[0])
CPU1: ---xxxx---------- (CC[1])

If we consolidate CPU0 and CPU1, the consolidated CC will be: CC' = CC[0] +
CC[1] for case 1 and CC'' = (CC[0] + CC[1]) * 2 for case 2. For the cases in
between case 1 and 2 in terms of how xxx overlaps, the CC should be between
CC' and CC''. So, we uniformly use this condition for consolidation (suppose
we consolidate m CPUs to n CPUs, m > n):

(CC[0] + CC[1] + ... + CC[m-2] + CC[m-1]) * (n + log(m-n)) >=<? (1 * n) * n *
consolidating_coefficient

The consolidating_coefficient could be like 100% or more or less.

By CC, we implemented a Workload Consolidation (WC) patch on two Intel mobile
platforms (a quad-core composed of two dual-core modules): contain load and
load balancing in the first dual-core when aggregated CC low, and if not in
the full quad-core. Results show that we got power savings and no substantial
performance regression (even gains for some). The workloads we used to
evaluate the Workload Consolidation include 1) 50+ perf/ux benchmarks (almost
all of the magazine ones), and 2) ~10 power workloads, of course, they are the
easiest ones, such as browsing, audio, video, recording, imaging, etc. The
current half-life is 1 period, and the period was 32ms, and now 64ms for more
aggressive consolidation.

Usage:
CPU CC and WC was originally designed for Intel mobile platforms. It can also
be used on bigger machines. For example, I have an Intel Core(TM) i7-3770K
3.50GHz, which is quad-core, 8 threads. The CPU topology has Sibling and MC
domain. The flags without CPU CC and WC are:

kernel.sched_domain.cpu0.domain0.flags = 687
kernel.sched_domain.cpu0.domain1.flags = 559
kernel.sched_domain.cpu1.domain0.flags = 687
kernel.sched_domain.cpu1.domain1.flags = 559
kernel.sched_domain.cpu2.domain0.flags = 687
kernel.sched_domain.cpu2.domain1.flags = 559
kernel.sched_domain.cpu3.domain0.flags = 687
kernel.sched_domain.cpu3.domain1.flags = 559
kernel.sched_domain.cpu4.domain0.flags = 687
kernel.sched_domain.cpu4.domain1.flags = 559
kernel.sched_domain.cpu5.domain0.flags = 687
kernel.sched_domain.cpu5.domain1.flags = 559
kernel.sched_domain.cpu6.domain0.flags = 687
kernel.sched_domain.cpu6.domain1.flags = 559
kernel.sched_domain.cpu7.domain0.flags = 687
kernel.sched_domain.cpu7.domain1.flags = 559

To enable CPU WC at MC domain (SD_WORKLOAD_CONSOLIDATION=0x8000,
this patchset enables MC and CPU domain WC by default):

sysctl -w kernel.sched_cc_wakeup_threshold=80
sysctl -w kernel.sched_domain.cpu0.domain1.flags=33327
sysctl -w kernel.sched_domain.cpu1.domain1.flags=33327
sysctl -w kernel.sched_domain.cpu2.domain1.flags=33327
sysctl -w kernel.sched_domain.cpu3.domain1.flags=33327
sysctl -w kernel.sched_domain.cpu4.domain1.flags=33327
sysctl -w kernel.sched_domain.cpu5.domain1.flags=33327
sysctl -w kernel.sched_domain.cpu6.domain1.flags=33327
sysctl -w kernel.sched_domain.cpu7.domain1.flags=33327

To disable CPU WC at MC domain:

sysctl -w kernel.sched_cc_wakeup_threshold=0
sysctl -w kernel.sched_domain.cpu0.domain1.flags=559
sysctl -w kernel.sched_domain.cpu1.domain1.flags=559
sysctl -w kernel.sched_domain.cpu2.domain1.flags=559
sysctl -w kernel.sched_domain.cpu3.domain1.flags=559
sysctl -w kernel.sched_domain.cpu4.domain1.flags=559
sysctl -w kernel.sched_domain.cpu5.domain1.flags=559
sysctl -w kernel.sched_domain.cpu6.domain1.flags=559
sysctl -w kernel.sched_domain.cpu7.domain1.flags=559

In addition, I will send a PnP report shortly.

v3:
- Removed rq->avg first, and base our patch on it
- Removed all CONFIG_CPU_CONCURRENCY and CONFIG_WORKLOAD_CONSOLIDATION
- CPU CC will be updated mandatory
- CPU WC can be enabled/disabled by flags per domain level on the fly 
- CPU CC and WC is completely fair scheduler thing, don't touch RT anymore
 
v2:
- Data type defined in formation


Yuyang Du (16):
  Remove update_rq_runnable_avg
  Define and initialize CPU ConCurrency in struct rq
  How CC accrues with run queue change and time
  CPU CC update period is changeable via sysctl
  Update CPU CC in fair
  Add Workload Consolidation fields in struct sched_domain
  Init Workload Consolidation flags in sched_domain
  Write CPU topology info for Workload Consolidation fields in
    sched_domain
  Define and allocate a per CPU local cpumask for Workload
    Consolidation
  Workload Consolidation APIs
  Make wakeup bias threshold changeable via sysctl
  Bias select wakee than waker in WAKE_AFFINE
  Intercept wakeup/fork/exec load balancing
  Intercept idle balancing
  Intercept periodic nohz idle balancing
  Intercept periodic load balancing

 include/linux/sched.h        |    6 +
 include/linux/sched/sysctl.h |    5 +
 include/linux/topology.h     |    6 +
 kernel/sched/core.c          |   34 +-
 kernel/sched/debug.c         |    8 -
 kernel/sched/fair.c          |  924 ++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h         |   20 +-
 kernel/sysctl.c              |   16 +
 8 files changed, 972 insertions(+), 47 deletions(-)

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH 01/16 v3] Remove update_rq_runnable_avg
  2014-05-30  6:35 [RFC PATCH 00/16 v3] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
@ 2014-05-30  6:35 ` Yuyang Du
  2014-05-30  6:35 ` [RFC PATCH 02/16 v3] Define and initialize CPU ConCurrency in struct rq Yuyang Du
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 32+ messages in thread
From: Yuyang Du @ 2014-05-30  6:35 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

Since rq->avg is not made use of anywhere (I really can't find it),
and the code is in fair scheduler's critical path, remove it.  

Sorry if anybody wants to use it, just at least temporarily remove it as of now.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/debug.c |    8 --------
 kernel/sched/fair.c  |   24 ++++--------------------
 kernel/sched/sched.h |    2 --
 3 files changed, 4 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 695f977..4b864c7 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -68,14 +68,6 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 #define PN(F) \
 	SEQ_printf(m, "  .%-30s: %lld.%06ld\n", #F, SPLIT_NS((long long)F))
 
-	if (!se) {
-		struct sched_avg *avg = &cpu_rq(cpu)->avg;
-		P(avg->runnable_avg_sum);
-		P(avg->runnable_avg_period);
-		return;
-	}
-
-
 	PN(se->exec_start);
 	PN(se->vruntime);
 	PN(se->sum_exec_runtime);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7570dd9..e7a0d91 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2378,18 +2378,12 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
 	}
 }
 
-static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
-{
-	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
-	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
-}
 #else /* CONFIG_FAIR_GROUP_SCHED */
 static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 						 int force_update) {}
 static inline void __update_tg_runnable_avg(struct sched_avg *sa,
 						  struct cfs_rq *cfs_rq) {}
 static inline void __update_group_entity_contrib(struct sched_entity *se) {}
-static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 static inline void __update_task_entity_contrib(struct sched_entity *se)
@@ -2562,7 +2556,6 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
  */
 void idle_enter_fair(struct rq *this_rq)
 {
-	update_rq_runnable_avg(this_rq, 1);
 }
 
 /*
@@ -2572,7 +2565,6 @@ void idle_enter_fair(struct rq *this_rq)
  */
 void idle_exit_fair(struct rq *this_rq)
 {
-	update_rq_runnable_avg(this_rq, 0);
 }
 
 static int idle_balance(struct rq *this_rq);
@@ -2581,7 +2573,6 @@ static int idle_balance(struct rq *this_rq);
 
 static inline void update_entity_load_avg(struct sched_entity *se,
 					  int update_cfs_rq) {}
-static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
 static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 					   struct sched_entity *se,
 					   int wakeup) {}
@@ -3882,10 +3873,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_entity_load_avg(se, 1);
 	}
 
-	if (!se) {
-		update_rq_runnable_avg(rq, rq->nr_running);
+	if (!se)
 		inc_nr_running(rq);
-	}
+
 	hrtick_update(rq);
 }
 
@@ -3943,10 +3933,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_entity_load_avg(se, 1);
 	}
 
-	if (!se) {
+	if (!se)
 		dec_nr_running(rq);
-		update_rq_runnable_avg(rq, 1);
-	}
+
 	hrtick_update(rq);
 }
 
@@ -5364,9 +5353,6 @@ static void __update_blocked_averages_cpu(struct task_group *tg, int cpu)
 		 */
 		if (!se->avg.runnable_avg_sum && !cfs_rq->nr_running)
 			list_del_leaf_cfs_rq(cfs_rq);
-	} else {
-		struct rq *rq = rq_of(cfs_rq);
-		update_rq_runnable_avg(rq, rq->nr_running);
 	}
 }
 
@@ -7243,8 +7229,6 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	if (numabalancing_enabled)
 		task_tick_numa(rq, curr);
-
-	update_rq_runnable_avg(rq, 1);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 456e492..5a66776 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -552,8 +552,6 @@ struct rq {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this cpu: */
 	struct list_head leaf_cfs_rq_list;
-
-	struct sched_avg avg;
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 	/*
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 02/16 v3] Define and initialize CPU ConCurrency in struct rq
  2014-05-30  6:35 [RFC PATCH 00/16 v3] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
  2014-05-30  6:35 ` [RFC PATCH 01/16 v3] Remove update_rq_runnable_avg Yuyang Du
@ 2014-05-30  6:35 ` Yuyang Du
  2014-05-30  6:35 ` [RFC PATCH 03/16 v3] How CC accrues with run queue change and time Yuyang Du
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 32+ messages in thread
From: Yuyang Du @ 2014-05-30  6:35 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

This struct is in CPU's rq and updated with rq->lock held.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/core.c  |    1 +
 kernel/sched/fair.c  |   22 ++++++++++++++++++++++
 kernel/sched/sched.h |   18 ++++++++++++++++++
 3 files changed, 41 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 268a45e..69623f1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6881,6 +6881,7 @@ void __init sched_init(void)
 #ifdef CONFIG_NO_HZ_FULL
 		rq->last_sched_tick = 0;
 #endif
+		init_cpu_concurrency(rq);
 #endif
 		init_rq_hrtick(rq);
 		atomic_set(&rq->nr_iowait, 0);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e7a0d91..2c1f453 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7688,3 +7688,25 @@ __init void init_sched_fair_class(void)
 #endif /* SMP */
 
 }
+
+#ifdef CONFIG_SMP
+
+/*
+ * CPU ConCurrency (CC) measures the CPU load by averaging
+ * the number of running tasks. Using CC, the scheduler can
+ * evaluate the load of CPUs to improve load balance for power
+ * efficiency without sacrificing performance.
+ */
+
+void init_cpu_concurrency(struct rq *rq)
+{
+	rq->concurrency.sum = 0;
+	rq->concurrency.sum_now = 0;
+	rq->concurrency.contrib = 0;
+	rq->concurrency.nr_running = 0;
+	rq->concurrency.sum_timestamp = ULLONG_MAX;
+	rq->concurrency.contrib_timestamp = ULLONG_MAX;
+	rq->concurrency.unload = 0;
+}
+
+#endif /* CONFIG_SMP */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5a66776..898127b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -506,6 +506,18 @@ struct root_domain {
 
 extern struct root_domain def_root_domain;
 
+struct cpu_concurrency_t {
+	u64 sum;
+	u64 sum_now;
+	u64 contrib;
+	u64 sum_timestamp;
+	u64 contrib_timestamp;
+	unsigned int nr_running;
+	int unload;
+	int dst_cpu;
+	struct cpu_stop_work unload_work;
+};
+
 #endif /* CONFIG_SMP */
 
 /*
@@ -640,6 +652,8 @@ struct rq {
 
 #ifdef CONFIG_SMP
 	struct llist_head wake_list;
+
+	struct cpu_concurrency_t concurrency;
 #endif
 };
 
@@ -1182,11 +1196,15 @@ extern void trigger_load_balance(struct rq *rq);
 extern void idle_enter_fair(struct rq *this_rq);
 extern void idle_exit_fair(struct rq *this_rq);
 
+extern void init_cpu_concurrency(struct rq *rq);
+
 #else
 
 static inline void idle_enter_fair(struct rq *rq) { }
 static inline void idle_exit_fair(struct rq *rq) { }
 
+static inline void init_cpu_concurrency(struct rq *rq) {}
+
 #endif
 
 extern void sysrq_sched_debug_show(void);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 03/16 v3] How CC accrues with run queue change and time
  2014-05-30  6:35 [RFC PATCH 00/16 v3] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
  2014-05-30  6:35 ` [RFC PATCH 01/16 v3] Remove update_rq_runnable_avg Yuyang Du
  2014-05-30  6:35 ` [RFC PATCH 02/16 v3] Define and initialize CPU ConCurrency in struct rq Yuyang Du
@ 2014-05-30  6:35 ` Yuyang Du
  2014-06-03 12:12   ` Peter Zijlstra
  2014-05-30  6:36 ` [RFC PATCH 04/16 v3] CPU CC update period is changeable via sysctl Yuyang Du
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 32+ messages in thread
From: Yuyang Du @ 2014-05-30  6:35 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

It is natural to use task concurrency (running tasks in the rq) as load
indicator. We calculate CC for task concurrency by two steps:

1) Divide continuous time into periods of time, and average task concurrency
in period, for tolerating the transient bursts:

a = sum(concurrency * time) / period

2) Exponentially decay past periods, and synthesize them all, for hysteresis
to load drops or resilience to load rises (let f be decaying factor, and a_x
the xth period average since period 0):

s = a_n + f^1 * a_n-1 + f^2 * a_n-2 +, ..., + f^(n-1) * a_1 + f^n * a_0

To sum up, CPU CC is 1) decayed average run queue length, or 2) run-queue-lengh-
weighted CPU utilization.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |  255 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 255 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2c1f453..f7910cf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2549,6 +2549,8 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 	} /* migrations, e.g. sleep=0 leave decay_count == 0 */
 }
 
+static inline void update_cpu_concurrency(struct rq *rq);
+
 /*
  * Update the rq's load with the elapsed running time before entering
  * idle. if the last scheduled task is not a CFS task, idle_enter will
@@ -2582,6 +2584,8 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 static inline void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
 					      int force_update) {}
 
+static inline void update_cpu_concurrency(struct rq *rq) {}
+
 static inline int idle_balance(struct rq *rq)
 {
 	return 0;
@@ -7698,6 +7702,238 @@ __init void init_sched_fair_class(void)
  * efficiency without sacrificing performance.
  */
 
+/*
+ * the sum period of time is 2^26 ns (~64) by default
+ */
+unsigned int sysctl_sched_cc_sum_period = 26UL;
+
+/*
+ * the number of sum periods, after which the original
+ * will be reduced/decayed to half. we now make it
+ * unchangeable.
+ */
+static const unsigned int cc_decay_rate = 1UL;
+
+/*
+ * the contrib period of time is 2^10 (~1us) by default,
+ * us has better precision than ms, and
+ * 1024 makes use of faster shift than div
+ */
+static unsigned int cc_contrib_period = 10UL;
+
+/*
+ * the concurrency is scaled up for decaying,
+ * thus, concurrency 1 is effectively 2^cc_resolution (1024),
+ * which can be halved by 10 half-life periods
+ */
+static unsigned int cc_resolution = 10UL;
+
+/*
+ * after this number of half-life periods, even
+ * (1>>32)-1 (which is sufficiently large) is less than 1
+ */
+static unsigned int cc_decay_max_pds = 32UL;
+
+static inline u32 cc_scale_up(unsigned int c)
+{
+	return c << cc_resolution;
+}
+
+static inline u32 cc_scale_down(unsigned int c)
+{
+	return c >> cc_resolution;
+}
+
+/* from nanoseconds to sum periods */
+static inline u64 cc_sum_pds(u64 n)
+{
+	return n >> sysctl_sched_cc_sum_period;
+}
+
+/* from sum period to timestamp in ns */
+static inline u64 cc_timestamp(u64 p)
+{
+	return p << sysctl_sched_cc_sum_period;
+}
+
+/*
+ * from nanoseconds to contrib periods, because
+ * ns so risky that can overflow cc->contrib
+ */
+static inline u64 cc_contrib_pds(u64 n)
+{
+	return n >> cc_contrib_period;
+}
+
+/*
+ * cc_decay_factor only works for 32bit integer,
+ * cc_decay_factor_x, x indicates the number of periods
+ * as half-life (cc_decay_rate)
+ */
+static const u32 cc_decay_factor[] = {
+	0xFFFFFFFF,
+};
+
+/*
+ * cc_decayed_sum depends on cc_resolution (fixed 10),
+ * and cc_decay_rate (half-life)
+ */
+static const u32 cc_decayed_sum[] = {
+	0, 512, 768, 896, 960, 992,
+	1008, 1016, 1020, 1022, 1023,
+};
+
+/*
+ * the last index of cc_decayed_sum array
+ */
+static u32 cc_decayed_sum_len =
+	sizeof(cc_decayed_sum) / sizeof(cc_decayed_sum[0]) - 1;
+
+/*
+ * decay concurrency at some decay rate
+ */
+static inline u64 decay_cc(u64 cc, u64 periods)
+{
+	u32 periods_l;
+
+	if (periods <= 0)
+		return cc;
+
+	if (unlikely(periods >= cc_decay_max_pds))
+		return 0;
+
+	/* now period is not too large */
+	periods_l = (u32)periods;
+	if (periods_l >= cc_decay_rate) {
+		cc >>= periods_l / cc_decay_rate;
+		periods_l %= cc_decay_rate;
+	}
+
+	if (!periods_l)
+		return cc;
+
+	/* since cc_decay_rate is 1, we never get here */
+	cc *= cc_decay_factor[periods_l];
+
+	return cc >> 32;
+}
+
+/*
+ * add missed periods by predefined constants
+ */
+static inline u64 cc_missed_pds(u64 periods)
+{
+	if (periods <= 0)
+		return 0;
+
+	if (periods > cc_decayed_sum_len)
+		periods = cc_decayed_sum_len;
+
+	return cc_decayed_sum[periods];
+}
+
+/*
+ * scale up nr_running, because we decay
+ */
+static inline u32 cc_weight(unsigned int nr_running)
+{
+	/*
+	 * scaling factor, this should be tunable
+	 */
+	return cc_scale_up(nr_running);
+}
+
+static inline void
+__update_concurrency(struct rq *rq, u64 now, struct cpu_concurrency_t *cc)
+{
+	u64 sum_pds, sum_pds_s, sum_pds_e;
+	u64 contrib_pds, ts_contrib, contrib_pds_one;
+	u64 sum_now = 0;
+	u32 weight;
+	int updated = 0;
+
+	/*
+	 * guarantee contrib_timestamp always >= sum_timestamp,
+	 * and sum_timestamp is at period boundary
+	 */
+	if (now <= cc->sum_timestamp) {
+		cc->sum_timestamp = cc_timestamp(cc_sum_pds(now));
+		cc->contrib_timestamp = now;
+		return;
+	}
+
+	weight = cc_weight(cc->nr_running);
+
+	/* start and end of sum periods */
+	sum_pds_s = cc_sum_pds(cc->sum_timestamp);
+	sum_pds_e = cc_sum_pds(now);
+	sum_pds = sum_pds_e - sum_pds_s;
+	/* number of contrib periods in one sum period */
+	contrib_pds_one = cc_contrib_pds(cc_timestamp(1));
+
+	/*
+	 * if we have passed at least one period,
+	 * we need to do four things:
+	 */
+	if (sum_pds) {
+		/* 1) complete the last period */
+		ts_contrib = cc_timestamp(sum_pds_s + 1);
+		contrib_pds = cc_contrib_pds(ts_contrib);
+		contrib_pds -= cc_contrib_pds(cc->contrib_timestamp);
+
+		if (likely(contrib_pds))
+			cc->contrib += weight * contrib_pds;
+
+		cc->contrib = div64_u64(cc->contrib, contrib_pds_one);
+
+		cc->sum += cc->contrib;
+		cc->contrib = 0;
+
+		/* 2) update/decay them */
+		cc->sum = decay_cc(cc->sum, sum_pds);
+		sum_now = decay_cc(cc->sum, sum_pds - 1);
+
+		/* 3) compensate missed periods if any */
+		sum_pds -= 1;
+		cc->sum += cc->nr_running * cc_missed_pds(sum_pds);
+		sum_now += cc->nr_running * cc_missed_pds(sum_pds - 1);
+		updated = 1;
+
+		/* 4) update contrib timestamp to period boundary */
+		ts_contrib = cc_timestamp(sum_pds_e);
+
+		cc->sum_timestamp = ts_contrib;
+		cc->contrib_timestamp = ts_contrib;
+	}
+
+	/* current period */
+	contrib_pds = cc_contrib_pds(now);
+	contrib_pds -= cc_contrib_pds(cc->contrib_timestamp);
+
+	if (likely(contrib_pds))
+		cc->contrib += weight * contrib_pds;
+
+	/* new nr_running for next update */
+	cc->nr_running = rq->nr_running;
+
+	/*
+	 * we need to account for the current sum period,
+	 * if now has passed 1/2 of sum period, we contribute,
+	 * otherwise, we use the last complete sum period
+	 */
+	contrib_pds = cc_contrib_pds(now - cc->sum_timestamp);
+
+	if (contrib_pds > contrib_pds_one / 2) {
+		sum_now = div64_u64(cc->contrib, contrib_pds);
+		sum_now += cc->sum;
+		updated = 1;
+	}
+
+	if (updated == 1)
+		cc->sum_now = sum_now;
+	cc->contrib_timestamp = now;
+}
+
 void init_cpu_concurrency(struct rq *rq)
 {
 	rq->concurrency.sum = 0;
@@ -7709,4 +7945,23 @@ void init_cpu_concurrency(struct rq *rq)
 	rq->concurrency.unload = 0;
 }
 
+/*
+ * we update cpu concurrency at:
+ * 1) enqueue task, which increases concurrency
+ * 2) dequeue task, which decreases concurrency
+ * 3) periodic scheduler tick, in case no en/dequeue for long
+ * 4) enter and exit idle
+ * 5) update_blocked_averages
+ */
+static void update_cpu_concurrency(struct rq *rq)
+{
+	/*
+	 * protected under rq->lock
+	 */
+	struct cpu_concurrency_t *cc = &rq->concurrency;
+	u64 now = rq->clock;
+
+	__update_concurrency(rq, now, cc);
+}
+
 #endif /* CONFIG_SMP */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 04/16 v3] CPU CC update period is changeable via sysctl
  2014-05-30  6:35 [RFC PATCH 00/16 v3] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (2 preceding siblings ...)
  2014-05-30  6:35 ` [RFC PATCH 03/16 v3] How CC accrues with run queue change and time Yuyang Du
@ 2014-05-30  6:36 ` Yuyang Du
  2014-05-30  6:36 ` [RFC PATCH 05/16 v3] Update CPU CC in fair Yuyang Du
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 32+ messages in thread
From: Yuyang Du @ 2014-05-30  6:36 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

sysctl_sched_cc_sum_period is the CC update period. Make it changable via
sysctl tool. In general, the longer this period, the stabler and slower to
respond to task concurrency change on this CPU.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched/sysctl.h |    4 ++++
 kernel/sysctl.c              |    9 +++++++++
 2 files changed, 13 insertions(+)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 8045a55..f8a3e0a 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -36,6 +36,10 @@ extern unsigned int sysctl_sched_min_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
 extern unsigned int sysctl_sched_child_runs_first;
 
+#ifdef CONFIG_SMP
+extern unsigned int sysctl_sched_cc_sum_period;
+#endif
+
 enum sched_tunable_scaling {
 	SCHED_TUNABLESCALING_NONE,
 	SCHED_TUNABLESCALING_LOG,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 74f5b58..13aea95 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1090,6 +1090,15 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 #endif
+#ifdef CONFIG_SMP
+	{
+		.procname	= "sched_cc_sum_period",
+		.data		= &sysctl_sched_cc_sum_period,
+		.maxlen		= sizeof(sysctl_sched_cc_sum_period),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+#endif
 	{ }
 };
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 05/16 v3] Update CPU CC in fair
  2014-05-30  6:35 [RFC PATCH 00/16 v3] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (3 preceding siblings ...)
  2014-05-30  6:36 ` [RFC PATCH 04/16 v3] CPU CC update period is changeable via sysctl Yuyang Du
@ 2014-05-30  6:36 ` Yuyang Du
  2014-05-30  6:36 ` [RFC PATCH 06/16 v3] Add Workload Consolidation fields in struct sched_domain Yuyang Du
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 32+ messages in thread
From: Yuyang Du @ 2014-05-30  6:36 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

CC can only be modified when enqueue and dequeue the CPU rq. We also
update it in scheduler tick, load balancing, and idle enter/exit
in case we may not have enqueue and dequeue for a long time.

Therefore, we update/track CC in and only in these points:

we update cpu concurrency at:
1) enqueue task, which increases concurrency
2) dequeue task, which decreases concurrency
3) periodic scheduler tick, in case no en/dequeue for long
4) enter and exit idle
5) update_blocked_averages

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |   14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f7910cf..96d6f64 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2558,6 +2558,7 @@ static inline void update_cpu_concurrency(struct rq *rq);
  */
 void idle_enter_fair(struct rq *this_rq)
 {
+	update_cpu_concurrency(this_rq);
 }
 
 /*
@@ -2567,6 +2568,7 @@ void idle_enter_fair(struct rq *this_rq)
  */
 void idle_exit_fair(struct rq *this_rq)
 {
+	update_cpu_concurrency(this_rq);
 }
 
 static int idle_balance(struct rq *this_rq);
@@ -3877,8 +3879,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_entity_load_avg(se, 1);
 	}
 
-	if (!se)
+	if (!se) {
 		inc_nr_running(rq);
+		update_cpu_concurrency(rq);
+	}
 
 	hrtick_update(rq);
 }
@@ -3937,8 +3941,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_entity_load_avg(se, 1);
 	}
 
-	if (!se)
+	if (!se) {
 		dec_nr_running(rq);
+		update_cpu_concurrency(rq);
+	}
 
 	hrtick_update(rq);
 }
@@ -5381,6 +5387,8 @@ static void update_blocked_averages(int cpu)
 		__update_blocked_averages_cpu(cfs_rq->tg, rq->cpu);
 	}
 
+	update_cpu_concurrency(rq);
+
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
 }
 
@@ -7233,6 +7241,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	if (numabalancing_enabled)
 		task_tick_numa(rq, curr);
+
+	update_cpu_concurrency(rq);
 }
 
 /*
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 06/16 v3] Add Workload Consolidation fields in struct sched_domain
  2014-05-30  6:35 [RFC PATCH 00/16 v3] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (4 preceding siblings ...)
  2014-05-30  6:36 ` [RFC PATCH 05/16 v3] Update CPU CC in fair Yuyang Du
@ 2014-05-30  6:36 ` Yuyang Du
  2014-05-30  6:36 ` [RFC PATCH 07/16 v3] Init Workload Consolidation flags in sched_domain Yuyang Du
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 32+ messages in thread
From: Yuyang Du @ 2014-05-30  6:36 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

Workload Consolidation is completely CPU topology and policy driven. To do so,
we define SD_WORKLOAD_CONSOLIDATION, and add some fields in sched_domain struct:

1) total_groups is the group number in total in this domain
2) group_number is this CPU's group sequence number
3) consolidating_coeff is the coefficient for consolidating CPUs, and is changeable
   via sysctl tool to make consolidation more aggressive or less
4) first_group is the pointer to this domain's first group ordered by CPU number

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched.h |    6 ++++++
 kernel/sched/core.c   |    6 ++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 25f54c7..c6aac65 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -876,6 +876,7 @@ enum cpu_idle_type {
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
 #define SD_NUMA			0x4000	/* cross-node balancing */
+#define SD_WORKLOAD_CONSOLIDATION	0x8000	/* consolidate CPU workload */
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
@@ -960,6 +961,11 @@ struct sched_domain {
 		struct rcu_head rcu;	/* used during destruction */
 	};
 
+	unsigned int total_groups;			/* total group number */
+	unsigned int group_number;			/* this CPU's group sequence */
+	unsigned int consolidating_coeff;	/* consolidating coefficient */
+	struct sched_group *first_group;	/* ordered by CPU number */
+
 	unsigned int span_weight;
 	/*
 	 * Span of all CPUs in this domain.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 69623f1..1cb7402 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4840,7 +4840,7 @@ set_table_entry(struct ctl_table *entry,
 static struct ctl_table *
 sd_alloc_ctl_domain_table(struct sched_domain *sd)
 {
-	struct ctl_table *table = sd_alloc_ctl_entry(14);
+	struct ctl_table *table = sd_alloc_ctl_entry(15);
 
 	if (table == NULL)
 		return NULL;
@@ -4873,7 +4873,9 @@ sd_alloc_ctl_domain_table(struct sched_domain *sd)
 		sizeof(long), 0644, proc_doulongvec_minmax, false);
 	set_table_entry(&table[12], "name", sd->name,
 		CORENAME_MAX_SIZE, 0444, proc_dostring, false);
-	/* &table[13] is terminator */
+	set_table_entry(&table[13], "consolidating_coeff", &sd->consolidating_coeff,
+		sizeof(int), 0644, proc_dointvec, false);
+	/* &table[14] is terminator */
 
 	return table;
 }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 07/16 v3] Init Workload Consolidation flags in sched_domain
  2014-05-30  6:35 [RFC PATCH 00/16 v3] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (5 preceding siblings ...)
  2014-05-30  6:36 ` [RFC PATCH 06/16 v3] Add Workload Consolidation fields in struct sched_domain Yuyang Du
@ 2014-05-30  6:36 ` Yuyang Du
  2014-06-03 12:14   ` Peter Zijlstra
  2014-05-30  6:36 ` [RFC PATCH 08/16 v3] Write CPU topology info for Workload Consolidation fields " Yuyang Du
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 32+ messages in thread
From: Yuyang Du @ 2014-05-30  6:36 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

Workload Consolidation can be enabled/disabled on the fly. This patchset
enables MC and CPU domain WC by default.

To enable CPU WC (SD_WORKLOAD_CONSOLIDATION=0x8000):

sysctl -w kernel.sched_domain.cpuX.domainY.flags += 0x8000

To disable CPU WC:

sysctl -w kernel.sched_domain.cpuX.domainY.flags -= 0x8000

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/topology.h |    6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 7062330..ebc339c3 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -102,12 +102,14 @@ int arch_update_cpu_topology(void);
 				| 0*SD_SERIALIZE			\
 				| 0*SD_PREFER_SIBLING			\
 				| arch_sd_sibling_asym_packing()	\
+				| 0*SD_WORKLOAD_CONSOLIDATION	\
 				,					\
 	.last_balance		= jiffies,				\
 	.balance_interval	= 1,					\
 	.smt_gain		= 1178,	/* 15% */			\
 	.max_newidle_lb_cost	= 0,					\
 	.next_decay_max_lb_cost	= jiffies,				\
+	.consolidating_coeff = 0,					\
 }
 #endif
 #endif /* CONFIG_SCHED_SMT */
@@ -134,11 +136,13 @@ int arch_update_cpu_topology(void);
 				| 0*SD_SHARE_CPUPOWER			\
 				| 1*SD_SHARE_PKG_RESOURCES		\
 				| 0*SD_SERIALIZE			\
+				| 1*SD_WORKLOAD_CONSOLIDATION	\
 				,					\
 	.last_balance		= jiffies,				\
 	.balance_interval	= 1,					\
 	.max_newidle_lb_cost	= 0,					\
 	.next_decay_max_lb_cost	= jiffies,				\
+	.consolidating_coeff = 180,					\
 }
 #endif
 #endif /* CONFIG_SCHED_MC */
@@ -167,11 +171,13 @@ int arch_update_cpu_topology(void);
 				| 0*SD_SHARE_PKG_RESOURCES		\
 				| 0*SD_SERIALIZE			\
 				| 1*SD_PREFER_SIBLING			\
+				| 1*SD_WORKLOAD_CONSOLIDATION	\
 				,					\
 	.last_balance		= jiffies,				\
 	.balance_interval	= 1,					\
 	.max_newidle_lb_cost	= 0,					\
 	.next_decay_max_lb_cost	= jiffies,				\
+	.consolidating_coeff = 180,					\
 }
 #endif
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 08/16 v3] Write CPU topology info for Workload Consolidation fields in sched_domain
  2014-05-30  6:35 [RFC PATCH 00/16 v3] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (6 preceding siblings ...)
  2014-05-30  6:36 ` [RFC PATCH 07/16 v3] Init Workload Consolidation flags in sched_domain Yuyang Du
@ 2014-05-30  6:36 ` Yuyang Du
  2014-05-30  6:36 ` [RFC PATCH 09/16 v3] Define and allocate a per CPU local cpumask for Workload Consolidation Yuyang Du
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 32+ messages in thread
From: Yuyang Du @ 2014-05-30  6:36 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

Write additional CPU topology info in sched_domain for our use in cpu_attach_domain()

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/core.c |   27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1cb7402..9df01d5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5496,6 +5496,31 @@ static void update_top_cache_domain(int cpu)
 	rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
 }
 
+static void update_cpu_position_info(struct sched_domain *sd)
+{
+	while (sd) {
+		int i = 0, j = 0, first, min = INT_MAX;
+		struct sched_group *group;
+
+		group = sd->groups;
+		first = group_first_cpu(group);
+		do {
+			int k = group_first_cpu(group);
+			i += 1;
+			if (k < first)
+				j += 1;
+			if (k < min) {
+				sd->first_group = group;
+				min = k;
+			}
+		} while (group = group->next, group != sd->groups);
+
+		sd->total_groups = i;
+		sd->group_number = j;
+		sd = sd->parent;
+	}
+}
+
 /*
  * Attach the domain 'sd' to 'cpu' as its base domain. Callers must
  * hold the hotplug lock.
@@ -5544,6 +5569,8 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 	destroy_sched_domains(tmp, cpu);
 
 	update_top_cache_domain(cpu);
+
+	update_cpu_position_info(sd);
 }
 
 /* cpus with isolated domains */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 09/16 v3] Define and allocate a per CPU local cpumask for Workload Consolidation
  2014-05-30  6:35 [RFC PATCH 00/16 v3] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (7 preceding siblings ...)
  2014-05-30  6:36 ` [RFC PATCH 08/16 v3] Write CPU topology info for Workload Consolidation fields " Yuyang Du
@ 2014-05-30  6:36 ` Yuyang Du
  2014-06-03 12:15   ` Peter Zijlstra
  2014-05-30  6:36 ` [RFC PATCH 10/16 v3] Workload Consolidation APIs Yuyang Du
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 32+ messages in thread
From: Yuyang Du @ 2014-05-30  6:36 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

We need these cpumasks to aid in cosolidated load balancing

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 96d6f64..5755746 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6638,6 +6638,8 @@ out:
 	return ld_moved;
 }
 
+static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask);
+
 /*
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
@@ -7692,6 +7694,12 @@ void print_cfs_stats(struct seq_file *m, int cpu)
 __init void init_sched_fair_class(void)
 {
 #ifdef CONFIG_SMP
+	unsigned int i;
+	for_each_possible_cpu(i) {
+		zalloc_cpumask_var_node(&per_cpu(local_cpu_mask, i),
+					GFP_KERNEL, cpu_to_node(i));
+	}
+
 	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
 
 #ifdef CONFIG_NO_HZ_COMMON
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 10/16 v3] Workload Consolidation APIs
  2014-05-30  6:35 [RFC PATCH 00/16 v3] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (8 preceding siblings ...)
  2014-05-30  6:36 ` [RFC PATCH 09/16 v3] Define and allocate a per CPU local cpumask for Workload Consolidation Yuyang Du
@ 2014-05-30  6:36 ` Yuyang Du
  2014-06-03 12:22   ` Peter Zijlstra
  2014-05-30  6:36 ` [RFC PATCH 11/16 v3] Make wakeup bias threshold changeable via sysctl Yuyang Du
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 32+ messages in thread
From: Yuyang Du @ 2014-05-30  6:36 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

Currently, CPU CC is per CPU. To consolidate, the formula is based on a heuristic.
Suppose we have 2 CPUs, their task concurrency over time is ('-' means no
task, 'x' having tasks):

1)
CPU0: ---xxxx---------- (CC[0])
CPU1: ---------xxxx---- (CC[1])

2)
CPU0: ---xxxx---------- (CC[0])
CPU1: ---xxxx---------- (CC[1])

If we consolidate CPU0 and CPU1, the consolidated CC will be: CC' = CC[0] +
CC[1] for case 1 and CC'' = (CC[0] + CC[1]) * 2 for case 2. For the cases in
between case 1 and 2 in terms of how xxx overlaps, the CC should be between
CC' and CC''. So, we uniformly use this condition for consolidation (suppose
we consolidate m CPUs to n CPUs, m > n):

(CC[0] + CC[1] + ... + CC[m-2] + CC[m-1]) * (n + log(m-n)) >=<? (1 * n) * n *
consolidating_coefficient

The consolidating_coefficient could be like 100% or more or less.

TODO:
1) need sched statistics
2) whether or not to consolidate is decided every time we need it. Not efficient.
3) really want to be used for multi-socket machines, but the decision to consolidation
   is time-consuming if CPU number increase significantly, need to remedy this
4) make use of existing load balancing util functions, like move_one_task

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |  468 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 468 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5755746..0c188df 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2550,6 +2550,13 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 }
 
 static inline void update_cpu_concurrency(struct rq *rq);
+static struct sched_group *wc_find_group(struct sched_domain *sd,
+	struct task_struct *p, int this_cpu);
+static void wc_unload(struct cpumask *nonshielded, struct sched_domain *sd);
+static void wc_nonshielded_mask(int cpu, struct sched_domain *sd,
+	struct cpumask *mask);
+static int cpu_cc_capable(int cpu);
+static inline struct sched_domain *top_flag_domain(int cpu, int flag);
 
 /*
  * Update the rq's load with the elapsed running time before entering
@@ -7726,6 +7733,12 @@ __init void init_sched_fair_class(void)
 unsigned int sysctl_sched_cc_sum_period = 26UL;
 
 /*
+ * concurrency lower than this threshold percentage of cc 1
+ * is capable of running wakee task, otherwise make it 0
+ */
+unsigned int sysctl_sched_cc_wakeup_threshold = 80UL;
+
+/*
  * the number of sum periods, after which the original
  * will be reduced/decayed to half. we now make it
  * unchangeable.
@@ -7740,6 +7753,11 @@ static const unsigned int cc_decay_rate = 1UL;
 static unsigned int cc_contrib_period = 10UL;
 
 /*
+ * aggressively push the task even it is hot
+ */
+static unsigned int wc_push_hot_task = 1UL;
+
+/*
  * the concurrency is scaled up for decaying,
  * thus, concurrency 1 is effectively 2^cc_resolution (1024),
  * which can be halved by 10 half-life periods
@@ -7982,4 +8000,454 @@ static void update_cpu_concurrency(struct rq *rq)
 	__update_concurrency(rq, now, cc);
 }
 
+/*
+ * whether cpu is capable of having more concurrency
+ */
+static int cpu_cc_capable(int cpu)
+{
+	u64 sum = cpu_rq(cpu)->concurrency.sum_now;
+	u64 threshold = cc_weight(1);
+
+	sum *= 100;
+	sum *= cpu_rq(cpu)->cpu_power;
+
+	threshold *= sysctl_sched_cc_wakeup_threshold;
+	threshold <<= SCHED_POWER_SHIFT;
+
+	if (sum <= threshold)
+		return 1;
+
+	return 0;
+}
+
+static inline u64 sched_group_cc(struct sched_group *sg)
+{
+	u64 sg_cc = 0;
+	int i;
+
+	for_each_cpu(i, sched_group_cpus(sg))
+		sg_cc += cpu_rq(i)->concurrency.sum_now *
+			cpu_rq(i)->cpu_power;
+
+	return sg_cc;
+}
+
+static inline u64 sched_domain_cc(struct sched_domain *sd)
+{
+	struct sched_group *sg = sd->groups;
+	u64 sd_cc = 0;
+
+	do {
+		sd_cc += sched_group_cc(sg);
+		sg = sg->next;
+	} while (sg != sd->groups);
+
+	return sd_cc;
+}
+
+static inline struct sched_group *
+find_lowest_cc_group(struct sched_group *sg, int span)
+{
+	u64 grp_cc, min = ULLONG_MAX;
+	struct sched_group *lowest = NULL;
+	int i;
+
+	for (i = 0; i < span; ++i) {
+		grp_cc = sched_group_cc(sg);
+
+		if (grp_cc < min) {
+			min = grp_cc;
+			lowest = sg;
+		}
+
+		sg = sg->next;
+	}
+
+	return lowest;
+}
+
+static inline u64 __calc_cc_thr(int cpus, unsigned int asym_cc)
+{
+	u64 thr = cpus;
+
+	thr *= cc_weight(1);
+	thr *= asym_cc;
+	thr <<= SCHED_POWER_SHIFT;
+
+	return thr;
+}
+
+/*
+ * can @src_cc of @src_nr cpus be consolidated
+ * to @dst_cc of @dst_nr cpus
+ */
+static inline int
+__can_consolidate_cc(u64 src_cc, int src_nr, u64 dst_cc, int dst_nr)
+{
+	dst_cc *= dst_nr;
+	src_nr -= dst_nr;
+
+	if (unlikely(src_nr <= 0))
+		return 0;
+
+	src_nr = ilog2(src_nr);
+	src_nr += dst_nr;
+	src_cc *= src_nr;
+
+	if (src_cc > dst_cc)
+		return 0;
+
+	return 1;
+}
+
+/*
+ * find the group for asymmetric concurrency
+ * problem to address: traverse sd from top to down
+ */
+static struct sched_group * wc_find_group(struct sched_domain *sd,
+	struct task_struct *p, int this_cpu)
+{
+	int half, sg_weight, ns_half = 0;
+	struct sched_group *sg;
+	u64 sd_cc;
+
+	half = DIV_ROUND_CLOSEST(sd->total_groups, 2);
+	sg_weight = sd->groups->group_weight;
+
+	sd_cc = sched_domain_cc(sd);
+	sd_cc *= 100;
+
+	while (half) {
+		int allowed = 0, i;
+		int cpus = sg_weight * half;
+		u64 threshold = __calc_cc_thr(cpus,
+			sd->consolidating_coeff);
+
+		/*
+		 * we did not consider the added cc by this
+		 * wakeup (mostly from fork/exec)
+		 */
+		if (!__can_consolidate_cc(sd_cc, sd->span_weight,
+			threshold, cpus))
+			break;
+
+		sg = sd->first_group;
+		for (i = 0; i < half; ++i) {
+			/* if it has no cpus allowed */
+			if (!cpumask_intersects(sched_group_cpus(sg),
+					tsk_cpus_allowed(p)))
+				continue;
+
+			allowed = 1;
+			sg = sg->next;
+		}
+
+		if (!allowed)
+			break;
+
+		ns_half = half;
+		half /= 2;
+	}
+
+	if (!ns_half)
+		return NULL;
+
+	if (ns_half == 1)
+		return sd->first_group;
+
+	return find_lowest_cc_group(sd->first_group, ns_half);
+}
+
+/*
+ * top_flag_domain - return top sched_domain containing flag.
+ * @cpu:	the cpu whose highest level of sched domain is to
+ *		be returned.
+ * @flag:	the flag to check for the highest sched_domain
+ *		for the given cpu.
+ *
+ * returns the highest sched_domain of a cpu which contains the given flag.
+ * different from highest_flag_domain in that along the domain upward chain
+ * domain may or may not contain the flag.
+ */
+static inline struct sched_domain *top_flag_domain(int cpu, int flag)
+{
+	struct sched_domain *sd, *hsd = NULL;
+
+	for_each_domain(cpu, sd) {
+		if (!(sd->flags & flag))
+			continue;
+		hsd = sd;
+	}
+
+	return hsd;
+}
+
+/*
+ * as of now, we have the following assumption
+ * 1) every sched_group has the same weight
+ * 2) every CPU has the same computing power
+ */
+static inline int __nonshielded_groups(struct sched_domain *sd)
+{
+	int half, sg_weight, ret = 0;
+	u64 sd_cc;
+
+	half = DIV_ROUND_CLOSEST(sd->total_groups, 2);
+	sg_weight = sd->groups->group_weight;
+
+	sd_cc = sched_domain_cc(sd);
+	sd_cc *= 100;
+
+	while (half) {
+		int cpus = sg_weight * half;
+		u64 threshold = __calc_cc_thr(cpus,
+			sd->consolidating_coeff);
+
+		if (!__can_consolidate_cc(sd_cc, sd->span_weight,
+			threshold, cpus))
+			return ret;
+
+		ret = half;
+		half /= 2;
+	}
+
+	return ret;
+}
+
+/*
+ * wc_nonshielded_mask - return the nonshielded cpus in the @mask,
+ * which is unmasked by the shielded cpus
+ *
+ * traverse downward the sched_domain tree when the sched_domain contains
+ * flag SD_WORKLOAD_CONSOLIDATION, each sd may have more than two groups
+ */
+static void
+wc_nonshielded_mask(int cpu, struct sched_domain *sd, struct cpumask *mask)
+{
+	struct cpumask *nonshielded_cpus = __get_cpu_var(local_cpu_mask);
+	int i;
+
+	while (sd) {
+		struct sched_group *sg;
+		int this_sg_nr, ns_half;
+
+		if (!(sd->flags & SD_WORKLOAD_CONSOLIDATION)) {
+			sd = sd->child;
+			continue;
+		}
+
+		ns_half = __nonshielded_groups(sd);
+
+		if (!ns_half)
+			break;
+
+		cpumask_clear(nonshielded_cpus);
+		sg = sd->first_group;
+
+		for (i = 0; i < ns_half; ++i) {
+			cpumask_or(nonshielded_cpus, nonshielded_cpus,
+				sched_group_cpus(sg));
+			sg = sg->next;
+		}
+
+		cpumask_and(mask, mask, nonshielded_cpus);
+
+		this_sg_nr = sd->group_number;
+		if (this_sg_nr)
+			break;
+
+		sd = sd->child;
+	}
+}
+
+static int cpu_task_hot(struct task_struct *p, u64 now)
+{
+	s64 delta;
+
+	if (p->sched_class != &fair_sched_class)
+		return 0;
+
+	if (unlikely(p->policy == SCHED_IDLE))
+		return 0;
+
+	if (wc_push_hot_task)
+		return 0;
+
+	/*
+	 * Buddy candidates are cache hot:
+	 */
+	if (sched_feat(CACHE_HOT_BUDDY) && this_rq()->nr_running &&
+			(&p->se == cfs_rq_of(&p->se)->next ||
+			 &p->se == cfs_rq_of(&p->se)->last))
+		return 1;
+
+	if (sysctl_sched_migration_cost == -1)
+		return 1;
+	if (sysctl_sched_migration_cost == 0)
+		return 0;
+
+	delta = now - p->se.exec_start;
+
+	return delta < (s64)sysctl_sched_migration_cost;
+}
+
+static int
+cpu_move_task(struct task_struct *p, struct rq *src_rq, struct rq *dst_rq)
+{
+	/*
+	 * we do not migrate tasks that are:
+	 * 1) running (obviously), or
+	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
+	 * 3) are cache-hot on their current CPU.
+	 */
+	if (!cpumask_test_cpu(dst_rq->cpu, tsk_cpus_allowed(p)))
+		return 0;
+
+	if (task_running(src_rq, p))
+		return 0;
+
+	/*
+	 * aggressive migration if task is cache cold
+	 */
+	if (!cpu_task_hot(p, src_rq->clock_task)) {
+		/*
+		 * move a task
+		 */
+		deactivate_task(src_rq, p, 0);
+		set_task_cpu(p, dst_rq->cpu);
+		activate_task(dst_rq, p, 0);
+		check_preempt_curr(dst_rq, p, 0);
+		return 1;
+	}
+
+	return 0;
+}
+
+/*
+ * __unload_cpu_work is run by src cpu stopper, which pushes running
+ * tasks off src cpu onto dst cpu
+ */
+static int __unload_cpu_work(void *data)
+{
+	struct rq *src_rq = data;
+	int src_cpu = cpu_of(src_rq);
+	struct cpu_concurrency_t *cc = &src_rq->concurrency;
+	struct rq *dst_rq = cpu_rq(cc->dst_cpu);
+
+	struct list_head *tasks = &src_rq->cfs_tasks;
+	struct task_struct *p, *n;
+	int pushed = 0;
+	int nr_migrate_break = 1;
+
+	raw_spin_lock_irq(&src_rq->lock);
+
+	/* make sure the requested cpu hasn't gone down in the meantime */
+	if (unlikely(src_cpu != smp_processor_id() || !cc->unload))
+		goto out_unlock;
+
+	/* Is there any task to move? */
+	if (src_rq->nr_running <= 1)
+		goto out_unlock;
+
+	double_lock_balance(src_rq, dst_rq);
+
+	list_for_each_entry_safe(p, n, tasks, se.group_node) {
+
+		if (!cpu_move_task(p, src_rq, dst_rq))
+			continue;
+
+		pushed++;
+
+		if (pushed >= nr_migrate_break)
+			break;
+	}
+
+	double_unlock_balance(src_rq, dst_rq);
+out_unlock:
+	cc->unload = 0;
+	raw_spin_unlock_irq(&src_rq->lock);
+
+	return 0;
+}
+
+/*
+ * unload src_cpu to dst_cpu
+ */
+static void unload_cpu(int src_cpu, int dst_cpu)
+{
+	unsigned long flags;
+	struct rq *src_rq = cpu_rq(src_cpu);
+	struct cpu_concurrency_t *cc = &src_rq->concurrency;
+	int unload = 0;
+
+	raw_spin_lock_irqsave(&src_rq->lock, flags);
+
+	if (!cc->unload) {
+		cc->unload = 1;
+		cc->dst_cpu = dst_cpu;
+		unload = 1;
+	}
+
+	raw_spin_unlock_irqrestore(&src_rq->lock, flags);
+
+	if (unload)
+		stop_one_cpu_nowait(src_cpu, __unload_cpu_work, src_rq,
+			&cc->unload_work);
+}
+
+static inline int find_lowest_cc_cpu(struct cpumask *mask)
+{
+	u64 cpu_cc, min = ULLONG_MAX;
+	int i, lowest = nr_cpu_ids;
+	struct rq *rq;
+
+	for_each_cpu(i, mask) {
+		rq = cpu_rq(i);
+		cpu_cc = rq->concurrency.sum_now * rq->cpu_power;
+
+		if (cpu_cc < min) {
+			min = cpu_cc;
+			lowest = i;
+		}
+	}
+
+	return lowest;
+}
+
+/*
+ * find the lowest cc cpu in shielded and nonshielded cpus,
+ * aggressively unload the shielded to the nonshielded
+ */
+static void wc_unload(struct cpumask *nonshielded, struct sched_domain *sd)
+{
+	int src_cpu = nr_cpu_ids, dst_cpu, cpu = smp_processor_id();
+	struct cpumask *shielded_cpus = __get_cpu_var(local_cpu_mask);
+	u64 cpu_cc, min = ULLONG_MAX;
+	struct rq *rq;
+
+	cpumask_andnot(shielded_cpus, sched_domain_span(sd), nonshielded);
+
+	for_each_cpu(cpu, shielded_cpus) {
+		rq = cpu_rq(cpu);
+		if (rq->nr_running <= 0)
+			continue;
+
+		cpu_cc = rq->concurrency.sum_now * rq->cpu_power;
+		if (cpu_cc < min) {
+			min = cpu_cc;
+			src_cpu = cpu;
+		}
+	}
+
+	if (src_cpu >= nr_cpu_ids)
+		return;
+
+	dst_cpu = find_lowest_cc_cpu(nonshielded);
+	if (dst_cpu >= nr_cpu_ids)
+		return;
+
+	if (src_cpu != dst_cpu)
+		unload_cpu(src_cpu, dst_cpu);
+}
+
 #endif /* CONFIG_SMP */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 11/16 v3] Make wakeup bias threshold changeable via sysctl
  2014-05-30  6:35 [RFC PATCH 00/16 v3] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (9 preceding siblings ...)
  2014-05-30  6:36 ` [RFC PATCH 10/16 v3] Workload Consolidation APIs Yuyang Du
@ 2014-05-30  6:36 ` Yuyang Du
  2014-06-03 12:23   ` Peter Zijlstra
  2014-05-30  6:36 ` [RFC PATCH 12/16 v3] Bias select wakee than waker in WAKE_AFFINE Yuyang Du
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 32+ messages in thread
From: Yuyang Du @ 2014-05-30  6:36 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

In wakeup balance, we bias wakee and waker (in this order) if it is capable
of handling the wakee task.

sysctl_sched_cc_wakeup_threshold is the threshold to see whether the CPU
is capable, and can be changed by sysctl tool

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched/sysctl.h |    1 +
 kernel/sysctl.c              |    7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index f8a3e0a..f1e90c7 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -38,6 +38,7 @@ extern unsigned int sysctl_sched_child_runs_first;
 
 #ifdef CONFIG_SMP
 extern unsigned int sysctl_sched_cc_sum_period;
+extern unsigned int sysctl_sched_cc_wakeup_threshold;
 #endif
 
 enum sched_tunable_scaling {
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 13aea95..77a5aa5 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1098,6 +1098,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "sched_cc_wakeup_threshold",
+		.data		= &sysctl_sched_cc_wakeup_threshold,
+		.maxlen		= sizeof(sysctl_sched_cc_wakeup_threshold),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 #endif
 	{ }
 };
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 12/16 v3] Bias select wakee than waker in WAKE_AFFINE
  2014-05-30  6:35 [RFC PATCH 00/16 v3] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (10 preceding siblings ...)
  2014-05-30  6:36 ` [RFC PATCH 11/16 v3] Make wakeup bias threshold changeable via sysctl Yuyang Du
@ 2014-05-30  6:36 ` Yuyang Du
  2014-06-03 12:24   ` Peter Zijlstra
  2014-05-30  6:36 ` [RFC PATCH 13/16 v3] Intercept wakeup/fork/exec load balancing Yuyang Du
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 32+ messages in thread
From: Yuyang Du @ 2014-05-30  6:36 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

In WAKE_AFFINE, we do not simply select idle, but bias wakee than waker
if the cc of the wakee and waker (in this order) is capable of handling
the wakee task

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0c188df..d40ec9e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4370,7 +4370,18 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	struct sched_group *sg;
 	int i = task_cpu(p);
 
-	if (idle_cpu(target))
+	/*
+	 * we do not select idle, if the cc of the wakee and
+	 * waker (in this order) is capable of handling the wakee task
+	 */
+	if (sysctl_sched_cc_wakeup_threshold) {
+		if (idle_cpu(i) || cpu_cc_capable(i))
+			return i;
+
+		if (i != target && (idle_cpu(target) || cpu_cc_capable(target)))
+			return target;
+	}
+	else if (idle_cpu(target))
 		return target;
 
 	/*
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 13/16 v3] Intercept wakeup/fork/exec load balancing
  2014-05-30  6:35 [RFC PATCH 00/16 v3] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (11 preceding siblings ...)
  2014-05-30  6:36 ` [RFC PATCH 12/16 v3] Bias select wakee than waker in WAKE_AFFINE Yuyang Du
@ 2014-05-30  6:36 ` Yuyang Du
  2014-06-03 12:27   ` Peter Zijlstra
  2014-05-30  6:36 ` [RFC PATCH 14/16 v3] Intercept idle balancing Yuyang Du
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 32+ messages in thread
From: Yuyang Du @ 2014-05-30  6:36 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

We intercept load balancing to contain the load and load balancing in
the consolidated CPUs according to our consolidating mechanism.

In wakeup/fork/exec load balaning, when to find the idlest sched_group,
we first try to find the consolidated group

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |    9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d40ec9e..1c9ac08 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4475,7 +4475,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	}
 
 	while (sd) {
-		struct sched_group *group;
+		struct sched_group *group = NULL;
 		int weight;
 
 		if (!(sd->flags & sd_flag)) {
@@ -4483,7 +4483,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			continue;
 		}
 
-		group = find_idlest_group(sd, p, cpu, sd_flag);
+		if (sd->flags & SD_WORKLOAD_CONSOLIDATION)
+			group = wc_find_group(sd, p, cpu);
+
+		if (!group)
+			group = find_idlest_group(sd, p, cpu, sd_flag);
+
 		if (!group) {
 			sd = sd->child;
 			continue;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 14/16 v3] Intercept idle balancing
  2014-05-30  6:35 [RFC PATCH 00/16 v3] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (12 preceding siblings ...)
  2014-05-30  6:36 ` [RFC PATCH 13/16 v3] Intercept wakeup/fork/exec load balancing Yuyang Du
@ 2014-05-30  6:36 ` Yuyang Du
  2014-05-30  6:36 ` [RFC PATCH 15/16 v3] Intercept periodic nohz " Yuyang Du
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 32+ messages in thread
From: Yuyang Du @ 2014-05-30  6:36 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

We intercept load balancing to contain the load and load balancing in
the consolidated CPUs according to our consolidating mechanism.

In idle balancing, we do two things:

1) Skip pulling task to the idle non-consolidated CPUs.

2) In addition, for consolidated Idle CPU, we aggressively pull tasks from
   non-consolidated CPUs.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1c9ac08..220773f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6692,6 +6692,22 @@ static int idle_balance(struct rq *this_rq)
 
 	update_blocked_averages(this_cpu);
 	rcu_read_lock();
+
+	sd = top_flag_domain(this_cpu, SD_WORKLOAD_CONSOLIDATION);
+	if (sd) {
+		struct cpumask *nonshielded_cpus = __get_cpu_var(load_balance_mask);
+
+		cpumask_copy(nonshielded_cpus, cpu_active_mask);
+
+		/*
+		 * if we encounter shielded cpus here, don't do balance on them
+		 */
+		wc_nonshielded_mask(this_cpu, sd, nonshielded_cpus);
+		if (!cpumask_test_cpu(this_cpu, nonshielded_cpus))
+			goto unlock;
+		wc_unload(nonshielded_cpus, sd);
+	}
+
 	for_each_domain(this_cpu, sd) {
 		unsigned long interval;
 		int continue_balancing = 1;
@@ -6724,6 +6740,7 @@ static int idle_balance(struct rq *this_rq)
 		if (pulled_task)
 			break;
 	}
+unlock:
 	rcu_read_unlock();
 
 	raw_spin_lock(&this_rq->lock);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 15/16 v3] Intercept periodic nohz idle balancing
  2014-05-30  6:35 [RFC PATCH 00/16 v3] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (13 preceding siblings ...)
  2014-05-30  6:36 ` [RFC PATCH 14/16 v3] Intercept idle balancing Yuyang Du
@ 2014-05-30  6:36 ` Yuyang Du
  2014-05-30  6:36 ` [RFC PATCH 16/16 v3] Intercept periodic load balancing Yuyang Du
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 32+ messages in thread
From: Yuyang Du @ 2014-05-30  6:36 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

We intercept load balancing to contain the load and load balancing in
the consolidated CPUs according to our consolidating mechanism.

In periodic nohz idle balance, we skip the idle but non-consolidated
CPUs from load balancing.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |   57 ++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 47 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 220773f..1b8dd45 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6869,10 +6869,47 @@ static struct {
 
 static inline int find_new_ilb(void)
 {
-	int ilb = cpumask_first(nohz.idle_cpus_mask);
+	int ilb;
 
-	if (ilb < nr_cpu_ids && idle_cpu(ilb))
-		return ilb;
+	/*
+	 * Optimize for the case when we have no idle CPUs or only one
+	 * idle CPU. Don't walk the sched_domain hierarchy in such cases
+	 */
+	if (cpumask_weight(nohz.idle_cpus_mask) < 2)
+		return nr_cpu_ids;
+
+	ilb = cpumask_first(nohz.idle_cpus_mask);
+
+	if (ilb < nr_cpu_ids && idle_cpu(ilb)) {
+		struct sched_domain *sd;
+		int this_cpu = smp_processor_id();
+
+		rcu_read_lock();
+		sd = top_flag_domain(this_cpu, SD_WORKLOAD_CONSOLIDATION);
+		if (sd) {
+			struct cpumask *nonshielded_cpus = __get_cpu_var(load_balance_mask);
+
+			cpumask_copy(nonshielded_cpus, nohz.idle_cpus_mask);
+
+			wc_nonshielded_mask(this_cpu, sd, nonshielded_cpus);
+			rcu_read_unlock();
+
+			if (cpumask_weight(nonshielded_cpus) < 2)
+				return nr_cpu_ids;
+
+			/*
+			 * get idle load balancer again
+			 */
+			ilb = cpumask_first(nonshielded_cpus);
+
+		    if (ilb < nr_cpu_ids && idle_cpu(ilb))
+				return ilb;
+		}
+		else {
+			rcu_read_unlock();
+			return ilb;
+		}
+	}
 
 	return nr_cpu_ids;
 }
@@ -7109,7 +7146,7 @@ out:
  * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
  * rebalancing for all the cpus for whom scheduler ticks are stopped.
  */
-static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
+static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle, struct cpumask *mask)
 {
 	int this_cpu = this_rq->cpu;
 	struct rq *rq;
@@ -7119,7 +7156,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 	    !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
 		goto end;
 
-	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
+	for_each_cpu(balance_cpu, mask) {
 		if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
 			continue;
 
@@ -7167,10 +7204,10 @@ static inline int nohz_kick_needed(struct rq *rq)
 	if (unlikely(rq->idle_balance))
 		return 0;
 
-       /*
-	* We may be recently in ticked or tickless idle mode. At the first
-	* busy tick after returning from idle, we will update the busy stats.
-	*/
+	/*
+	 * We may be recently in ticked or tickless idle mode. At the first
+	 * busy tick after returning from idle, we will update the busy stats.
+	 */
 	set_cpu_sd_state_busy();
 	nohz_balance_exit_idle(cpu);
 
@@ -7213,7 +7250,7 @@ need_kick:
 	return 1;
 }
 #else
-static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
+static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle, struct cpumask *mask) { }
 #endif
 
 /*
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 16/16 v3] Intercept periodic load balancing
  2014-05-30  6:35 [RFC PATCH 00/16 v3] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (14 preceding siblings ...)
  2014-05-30  6:36 ` [RFC PATCH 15/16 v3] Intercept periodic nohz " Yuyang Du
@ 2014-05-30  6:36 ` Yuyang Du
  2014-06-09 17:30 ` [RFC PATCH 00/16 v3] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Morten Rasmussen
       [not found] ` <20140609164848.GB29593@e103034-lin>
  17 siblings, 0 replies; 32+ messages in thread
From: Yuyang Du @ 2014-05-30  6:36 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

We intercept load balancing to contain the load and load balancing in
the consolidated CPUs according to our consolidating mechanism.

In periodic load balancing, we do two things:

1) Skip pulling task to the non-consolidated CPUs.

2) In addition, for consolidated Idle CPU, we aggressively pull tasks from
   non-consolidated CPUs.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |   51 ++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 44 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1b8dd45..d22ac87 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7260,17 +7260,54 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle, struc
 static void run_rebalance_domains(struct softirq_action *h)
 {
 	struct rq *this_rq = this_rq();
+	int this_cpu = cpu_of(this_rq);
+	struct sched_domain *sd;
 	enum cpu_idle_type idle = this_rq->idle_balance ?
 						CPU_IDLE : CPU_NOT_IDLE;
 
-	rebalance_domains(this_rq, idle);
+	rcu_read_lock();
+	sd = top_flag_domain(this_cpu, SD_WORKLOAD_CONSOLIDATION);
+	if (sd) {
+		struct cpumask *nonshielded_cpus = __get_cpu_var(load_balance_mask);
 
-	/*
-	 * If this cpu has a pending nohz_balance_kick, then do the
-	 * balancing on behalf of the other idle cpus whose ticks are
-	 * stopped.
-	 */
-	nohz_idle_balance(this_rq, idle);
+		/*
+		 * if we encounter shielded cpus here, don't do balance on them
+		 */
+		cpumask_copy(nonshielded_cpus, cpu_active_mask);
+
+		wc_nonshielded_mask(this_cpu, sd, nonshielded_cpus);
+
+		/*
+		 * aggressively unload the shielded cpus to unshielded cpus
+		 */
+		wc_unload(nonshielded_cpus, sd);
+		rcu_read_unlock();
+
+		if (cpumask_test_cpu(this_cpu, nonshielded_cpus)) {
+			struct cpumask *idle_cpus = __get_cpu_var(local_cpu_mask);
+			cpumask_and(idle_cpus, nonshielded_cpus, nohz.idle_cpus_mask);
+
+			rebalance_domains(this_rq, idle);
+
+			/*
+			 * If this cpu has a pending nohz_balance_kick, then do the
+			 * balancing on behalf of the other idle cpus whose ticks are
+			 * stopped.
+			 */
+			nohz_idle_balance(this_rq, idle, idle_cpus);
+		}
+	}
+	else {
+		rcu_read_unlock();
+		rebalance_domains(this_rq, idle);
+
+		/*
+		 * If this cpu has a pending nohz_balance_kick, then do the
+		 * balancing on behalf of the other idle cpus whose ticks are
+		 * stopped.
+		 */
+		nohz_idle_balance(this_rq, idle, nohz.idle_cpus_mask);
+	}
 }
 
 /*
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 03/16 v3] How CC accrues with run queue change and time
  2014-05-30  6:35 ` [RFC PATCH 03/16 v3] How CC accrues with run queue change and time Yuyang Du
@ 2014-06-03 12:12   ` Peter Zijlstra
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2014-06-03 12:12 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, rafael.j.wysocki, linux-kernel, linux-pm,
	arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu

[-- Attachment #1: Type: text/plain, Size: 1222 bytes --]

On Fri, May 30, 2014 at 02:35:59PM +0800, Yuyang Du wrote:
> It is natural to use task concurrency (running tasks in the rq) as load
> indicator. We calculate CC for task concurrency by two steps:
> 
> 1) Divide continuous time into periods of time, and average task concurrency
> in period, for tolerating the transient bursts:
> 
> a = sum(concurrency * time) / period
> 
> 2) Exponentially decay past periods, and synthesize them all, for hysteresis
> to load drops or resilience to load rises (let f be decaying factor, and a_x
> the xth period average since period 0):
> 
> s = a_n + f^1 * a_n-1 + f^2 * a_n-2 +, ..., + f^(n-1) * a_1 + f^n * a_0
> 
> To sum up, CPU CC is 1) decayed average run queue length, or 2) run-queue-lengh-
> weighted CPU utilization.

why!? what benefit does it have over utilization.

Also, not a mention of dvfs or how that would impact things.

> Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> ---
>  kernel/sched/fair.c |  255 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 255 insertions(+)

And not a single word on why you cannot use the existing per-task
accounting infrastructure and not a single attempt to merge the two.



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 07/16 v3] Init Workload Consolidation flags in sched_domain
  2014-05-30  6:36 ` [RFC PATCH 07/16 v3] Init Workload Consolidation flags in sched_domain Yuyang Du
@ 2014-06-03 12:14   ` Peter Zijlstra
  2014-06-09 17:56     ` Dietmar Eggemann
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2014-06-03 12:14 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, rafael.j.wysocki, linux-kernel, linux-pm,
	arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu

[-- Attachment #1: Type: text/plain, Size: 2208 bytes --]

On Fri, May 30, 2014 at 02:36:03PM +0800, Yuyang Du wrote:
> Workload Consolidation can be enabled/disabled on the fly. This patchset
> enables MC and CPU domain WC by default.
> 
> To enable CPU WC (SD_WORKLOAD_CONSOLIDATION=0x8000):
> 
> sysctl -w kernel.sched_domain.cpuX.domainY.flags += 0x8000
> 
> To disable CPU WC:
> 
> sysctl -w kernel.sched_domain.cpuX.domainY.flags -= 0x8000
> 
> Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> ---
>  include/linux/topology.h |    6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 7062330..ebc339c3 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -102,12 +102,14 @@ int arch_update_cpu_topology(void);
>  				| 0*SD_SERIALIZE			\
>  				| 0*SD_PREFER_SIBLING			\
>  				| arch_sd_sibling_asym_packing()	\
> +				| 0*SD_WORKLOAD_CONSOLIDATION	\
>  				,					\
>  	.last_balance		= jiffies,				\
>  	.balance_interval	= 1,					\
>  	.smt_gain		= 1178,	/* 15% */			\
>  	.max_newidle_lb_cost	= 0,					\
>  	.next_decay_max_lb_cost	= jiffies,				\
> +	.consolidating_coeff = 0,					\
>  }
>  #endif
>  #endif /* CONFIG_SCHED_SMT */
> @@ -134,11 +136,13 @@ int arch_update_cpu_topology(void);
>  				| 0*SD_SHARE_CPUPOWER			\
>  				| 1*SD_SHARE_PKG_RESOURCES		\
>  				| 0*SD_SERIALIZE			\
> +				| 1*SD_WORKLOAD_CONSOLIDATION	\
>  				,					\
>  	.last_balance		= jiffies,				\
>  	.balance_interval	= 1,					\
>  	.max_newidle_lb_cost	= 0,					\
>  	.next_decay_max_lb_cost	= jiffies,				\
> +	.consolidating_coeff = 180,					\
>  }
>  #endif
>  #endif /* CONFIG_SCHED_MC */
> @@ -167,11 +171,13 @@ int arch_update_cpu_topology(void);
>  				| 0*SD_SHARE_PKG_RESOURCES		\
>  				| 0*SD_SERIALIZE			\
>  				| 1*SD_PREFER_SIBLING			\
> +				| 1*SD_WORKLOAD_CONSOLIDATION	\
>  				,					\
>  	.last_balance		= jiffies,				\
>  	.balance_interval	= 1,					\
>  	.max_newidle_lb_cost	= 0,					\
>  	.next_decay_max_lb_cost	= jiffies,				\
> +	.consolidating_coeff = 180,					\
>  }
>  #endif

What tree are you working against, non of that exists anymore. Also, you
cannot unconditionally set this.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 09/16 v3] Define and allocate a per CPU local cpumask for Workload Consolidation
  2014-05-30  6:36 ` [RFC PATCH 09/16 v3] Define and allocate a per CPU local cpumask for Workload Consolidation Yuyang Du
@ 2014-06-03 12:15   ` Peter Zijlstra
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2014-06-03 12:15 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, rafael.j.wysocki, linux-kernel, linux-pm,
	arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu

[-- Attachment #1: Type: text/plain, Size: 205 bytes --]

On Fri, May 30, 2014 at 02:36:05PM +0800, Yuyang Du wrote:
> We need these cpumasks to aid in cosolidated load balancing
> 

That's an entirely unsatisfactory changelog. Why would you need these?



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 10/16 v3] Workload Consolidation APIs
  2014-05-30  6:36 ` [RFC PATCH 10/16 v3] Workload Consolidation APIs Yuyang Du
@ 2014-06-03 12:22   ` Peter Zijlstra
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2014-06-03 12:22 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, rafael.j.wysocki, linux-kernel, linux-pm,
	arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu

[-- Attachment #1: Type: text/plain, Size: 2375 bytes --]

On Fri, May 30, 2014 at 02:36:06PM +0800, Yuyang Du wrote:
> Currently, CPU CC is per CPU. To consolidate, the formula is based on a heuristic.
> Suppose we have 2 CPUs, their task concurrency over time is ('-' means no
> task, 'x' having tasks):
> 
> 1)
> CPU0: ---xxxx---------- (CC[0])
> CPU1: ---------xxxx---- (CC[1])
> 
> 2)
> CPU0: ---xxxx---------- (CC[0])
> CPU1: ---xxxx---------- (CC[1])
> 
> If we consolidate CPU0 and CPU1, the consolidated CC will be: CC' = CC[0] +
> CC[1] for case 1 and CC'' = (CC[0] + CC[1]) * 2 for case 2. For the cases in
> between case 1 and 2 in terms of how xxx overlaps, the CC should be between
> CC' and CC''. So, we uniformly use this condition for consolidation (suppose
> we consolidate m CPUs to n CPUs, m > n):
> 
> (CC[0] + CC[1] + ... + CC[m-2] + CC[m-1]) * (n + log(m-n)) >=<? (1 * n) * n *
> consolidating_coefficient
> 
> The consolidating_coefficient could be like 100% or more or less.
> 

I'm still struggling to match that to the code presented.

> +/*
> + * as of now, we have the following assumption
> + * 1) every sched_group has the same weight
> + * 2) every CPU has the same computing power
> + */

Those are complete non starters.

> +/*
> + * wc_nonshielded_mask - return the nonshielded cpus in the @mask,
> + * which is unmasked by the shielded cpus
> + *
> + * traverse downward the sched_domain tree when the sched_domain contains
> + * flag SD_WORKLOAD_CONSOLIDATION, each sd may have more than two groups

WTF is a shielded/nonshielded cpu?

> +static int cpu_task_hot(struct task_struct *p, u64 now)
> +{
> +	s64 delta;
> +
> +	if (p->sched_class != &fair_sched_class)
> +		return 0;
> +
> +	if (unlikely(p->policy == SCHED_IDLE))
> +		return 0;
> +
> +	if (wc_push_hot_task)
> +		return 0;
> +
> +	/*
> +	 * Buddy candidates are cache hot:
> +	 */
> +	if (sched_feat(CACHE_HOT_BUDDY) && this_rq()->nr_running &&
> +			(&p->se == cfs_rq_of(&p->se)->next ||
> +			 &p->se == cfs_rq_of(&p->se)->last))
> +		return 1;
> +
> +	if (sysctl_sched_migration_cost == -1)
> +		return 1;
> +	if (sysctl_sched_migration_cost == 0)
> +		return 0;
> +
> +	delta = now - p->se.exec_start;
> +
> +	return delta < (s64)sysctl_sched_migration_cost;
> +}

In what universe to we need an exact copy of task_hot() ?

and on and on it goes.. :-(

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 11/16 v3] Make wakeup bias threshold changeable via sysctl
  2014-05-30  6:36 ` [RFC PATCH 11/16 v3] Make wakeup bias threshold changeable via sysctl Yuyang Du
@ 2014-06-03 12:23   ` Peter Zijlstra
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2014-06-03 12:23 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, rafael.j.wysocki, linux-kernel, linux-pm,
	arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu

[-- Attachment #1: Type: text/plain, Size: 358 bytes --]

On Fri, May 30, 2014 at 02:36:07PM +0800, Yuyang Du wrote:
> In wakeup balance, we bias wakee and waker (in this order) if it is capable
> of handling the wakee task.
> 
> sysctl_sched_cc_wakeup_threshold is the threshold to see whether the CPU
> is capable, and can be changed by sysctl tool

This must strive for the most useless changelog ever.. 

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 12/16 v3] Bias select wakee than waker in WAKE_AFFINE
  2014-05-30  6:36 ` [RFC PATCH 12/16 v3] Bias select wakee than waker in WAKE_AFFINE Yuyang Du
@ 2014-06-03 12:24   ` Peter Zijlstra
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2014-06-03 12:24 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, rafael.j.wysocki, linux-kernel, linux-pm,
	arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu

[-- Attachment #1: Type: text/plain, Size: 1278 bytes --]

On Fri, May 30, 2014 at 02:36:08PM +0800, Yuyang Du wrote:
> In WAKE_AFFINE, we do not simply select idle, but bias wakee than waker
> if the cc of the wakee and waker (in this order) is capable of handling
> the wakee task
> 
> Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> ---
>  kernel/sched/fair.c |   13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0c188df..d40ec9e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4370,7 +4370,18 @@ static int select_idle_sibling(struct task_struct *p, int target)
>  	struct sched_group *sg;
>  	int i = task_cpu(p);
>  
> -	if (idle_cpu(target))
> +	/*
> +	 * we do not select idle, if the cc of the wakee and
> +	 * waker (in this order) is capable of handling the wakee task
> +	 */
> +	if (sysctl_sched_cc_wakeup_threshold) {
> +		if (idle_cpu(i) || cpu_cc_capable(i))
> +			return i;
> +
> +		if (i != target && (idle_cpu(target) || cpu_cc_capable(target)))
> +			return target;
> +	}
> +	else if (idle_cpu(target))
>  		return target;
>  

So now you make a function called: select_idle_sibling() explicitly not
pick an idle cpu, and you don't think there's anything wrong with that?

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 13/16 v3] Intercept wakeup/fork/exec load balancing
  2014-05-30  6:36 ` [RFC PATCH 13/16 v3] Intercept wakeup/fork/exec load balancing Yuyang Du
@ 2014-06-03 12:27   ` Peter Zijlstra
  2014-06-03 23:46     ` Yuyang Du
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2014-06-03 12:27 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, rafael.j.wysocki, linux-kernel, linux-pm,
	arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu

[-- Attachment #1: Type: text/plain, Size: 798 bytes --]

On Fri, May 30, 2014 at 02:36:09PM +0800, Yuyang Du wrote:
> We intercept load balancing to contain the load and load balancing in
> the consolidated CPUs according to our consolidating mechanism.
> 
> In wakeup/fork/exec load balaning, when to find the idlest sched_group,
> we first try to find the consolidated group

Anything with intercept in is a complete non-starter. You still fully
duplicate the logic.

You really didn't get anything I said, did you?

Please as to go back to square 1 and read again.

So take a step back and try and explain what and why you're doing
things, also try and look at what other people are doing. If I see
another patch from you within two weeks I'll simply delete it, there's
no way you can read up and fix everything in such a short time.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 13/16 v3] Intercept wakeup/fork/exec load balancing
  2014-06-03 12:27   ` Peter Zijlstra
@ 2014-06-03 23:46     ` Yuyang Du
  0 siblings, 0 replies; 32+ messages in thread
From: Yuyang Du @ 2014-06-03 23:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rafael.j.wysocki, linux-kernel, linux-pm,
	arjan.van.de.ven, len.brown, alan.cox, mark.gross, pjt, bsegall,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu

On Tue, Jun 03, 2014 at 02:27:02PM +0200, Peter Zijlstra wrote:
> On Fri, May 30, 2014 at 02:36:09PM +0800, Yuyang Du wrote:
> > We intercept load balancing to contain the load and load balancing in
> > the consolidated CPUs according to our consolidating mechanism.
> > 
> > In wakeup/fork/exec load balaning, when to find the idlest sched_group,
> > we first try to find the consolidated group
> 
> Anything with intercept in is a complete non-starter. You still fully
> duplicate the logic.
> 
> You really didn't get anything I said, did you?
> 
> Please as to go back to square 1 and read again.
> 
> So take a step back and try and explain what and why you're doing
> things, also try and look at what other people are doing. If I see
> another patch from you within two weeks I'll simply delete it, there's
> no way you can read up and fix everything in such a short time.

Hi Peter,

Thanks for your reply, it hurts though, :(

I was concerned about what you said back, which should be this one:

PeterZ: Fourthly, I'm _never_ going to merge anything that hijacks the load balancer
and does some random other thing. There's going to be a single load-balancer
full stop.

But some explanation to this interception/hijack. It is driven by a sched
policy (SD_WORKLOAD_CONSOLIDATION) and the resulting effect of that policy if
enabled, or still part of the load balancer. Can't do/call it that way?

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 00/16 v3] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
  2014-05-30  6:35 [RFC PATCH 00/16 v3] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (15 preceding siblings ...)
  2014-05-30  6:36 ` [RFC PATCH 16/16 v3] Intercept periodic load balancing Yuyang Du
@ 2014-06-09 17:30 ` Morten Rasmussen
       [not found] ` <20140609164848.GB29593@e103034-lin>
  17 siblings, 0 replies; 32+ messages in thread
From: Morten Rasmussen @ 2014-06-09 17:30 UTC (permalink / raw)
  To: linux-kernel

Resend... The first attempt didn't reach LKML for some reason.

On Fri, May 30, 2014 at 07:35:56AM +0100, Yuyang Du wrote:
> Thanks to CFS’s “model an ideal, precise multi-tasking CPU”, tasks can be seen
> as concurrently running (the tasks in the runqueue). So it is natural to use
> task concurrency as load indicator. Having said that, we do two things:
> 
> 1) Divide continuous time into periods of time, and average task concurrency
> in period, for tolerating the transient bursts:
> a = sum(concurrency * time) / period
> 2) Exponentially decay past periods, and synthesize them all, for hysteresis
> to load drops or resilience to load rises (let f be decaying factor, and a_x
> the xth period average since period 0):
> s = a_n + f^1 * a_n-1 + f^2 * a_n-2 +, ..., + f^(n-1) * a_1 + f^n * a_0
> 
> We name this load indicator as CPU ConCurrency (CC): task concurrency
> determines how many CPUs are needed to be running concurrently.
> 
> Another two ways of how to interpret CC:
> 
> 1) the current work-conserving load balance also uses CC, but instantaneous
> CC.
> 
> 2) CC vs. CPU utilization. CC is runqueue-length-weighted CPU utilization. If
> we change: "a = sum(concurrency * time) / period" to "a' = sum(1 * time) /
> period". Then a' is just about the CPU utilization. And the way we weight
> runqueue-length is the simplest one (excluding the exponential decays, and you
> may have other ways).

Isn't a' exactly to the rq runnable_avg_{sum, period} that you remove in
patch 1? In that case it seems more obvious to repurpose them by
multiplying the the contributions to the rq runnable_avg_sum by
nr_running. AFAICT, that should give you the same metric.

On the other hand, I don't see what this new metric gives us that can't
be inferred from the existing load tracking metrics or through slight
modifications of those. It came up recently in a different thread that
the tracked task load is heavily influenced by other tasks running on
the same cpu if they happen to overlap in time. IIUC, that exactly the
information you are after. I think you could implement the same task
packing behaviour based on an unweighted version of
cfs.runnable_load_avg instead, which should be fairly straightforward to
introduce (I think it will become useful for other purposes as well). I
sort of hinted that in that thread already:

https://lkml.org/lkml/2014/6/4/54

> 
> To track CC, we intercept the scheduler in 1) enqueue, 2) dequeue, 3)
> scheduler tick, and 4) enter/exit idle.
> 
> After CC, in the consolidation part, we do 1) attach the CPU topology to be
> adaptive beyond our experimental platforms, and 2) intercept the current load
> balance for load and load balancing containment.
> 
> Currently, CC is per CPU. To consolidate, the formula is based on a heuristic.
> Suppose we have 2 CPUs, their task concurrency over time is ('-' means no
> task, 'x' having tasks):
> 
> 1)
> CPU0: ---xxxx---------- (CC[0])
> CPU1: ---------xxxx---- (CC[1])
> 
> 2)
> CPU0: ---xxxx---------- (CC[0])
> CPU1: ---xxxx---------- (CC[1])
> 
> If we consolidate CPU0 and CPU1, the consolidated CC will be: CC' = CC[0] +
> CC[1] for case 1 and CC'' = (CC[0] + CC[1]) * 2 for case 2. For the cases in
> between case 1 and 2 in terms of how xxx overlaps, the CC should be between
> CC' and CC''.

The potential worst case consolidated CC sum is:

	n * \sum{cpus}^{n} CC[n]

So, the range in which the true consolidated CC lies grows
proportionally to the number of cpus. We can't really say anything about
how things will pan out if we consolidate on fewer cpus. However, if it
turns out to be a bad mix of tasks the task runnable_avg_sum will go up
and if we use cfs.runnable_load_avg as the indication of compute
capacity requirements, we would eventually spread the load again.
Clearly, we don't want an unstable situation, so it might be better to
the consolidation partially and see where things are going.

> So, we uniformly use this condition for consolidation (suppose
> we consolidate m CPUs to n CPUs, m > n):
> 
> (CC[0] + CC[1] + ... + CC[m-2] + CC[m-1]) * (n + log(m-n)) >=<? (1 * n) * n *
> consolidating_coefficient

Do you have a rationale behind this heuristic? It seems to be more and
more pessimistic about how much load we can put on 'n' cpus as 'm'
increases. Basically trying to factor in some of the error that be in
the consolidated CC. Why the '(1 * n) * n...' and not just 'n * n * con...'?

Overall, IIUC, the aim of this patch set seems quite similar to the
previous proposals for task packing.

Morten

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 07/16 v3] Init Workload Consolidation flags in sched_domain
  2014-06-03 12:14   ` Peter Zijlstra
@ 2014-06-09 17:56     ` Dietmar Eggemann
  2014-06-09 21:18       ` Yuyang Du
  0 siblings, 1 reply; 32+ messages in thread
From: Dietmar Eggemann @ 2014-06-09 17:56 UTC (permalink / raw)
  To: Peter Zijlstra, Yuyang Du; +Cc: linux-kernel, linux-pm

... turned out that probably the cc list was too big for lkml. Dropping
all the individual email addresses on CC.

... it seems that this message hasn't made it to the list. Apologies to
everyone on To: and Cc: receiving it again.

On 03/06/14 13:14, Peter Zijlstra wrote:
> On Fri, May 30, 2014 at 02:36:03PM +0800, Yuyang Du wrote:
>> Workload Consolidation can be enabled/disabled on the fly. This patchset
>> enables MC and CPU domain WC by default.
>>
>> To enable CPU WC (SD_WORKLOAD_CONSOLIDATION=0x8000):
>>
>> sysctl -w kernel.sched_domain.cpuX.domainY.flags += 0x8000
>>
>> To disable CPU WC:
>>
>> sysctl -w kernel.sched_domain.cpuX.domainY.flags -= 0x8000
>>
>> Signed-off-by: Yuyang Du <yuyang.du@intel.com>
>> ---
>>  include/linux/topology.h |    6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/include/linux/topology.h b/include/linux/topology.h
>> index 7062330..ebc339c3 100644
>> --- a/include/linux/topology.h
>> +++ b/include/linux/topology.h
>> @@ -102,12 +102,14 @@ int arch_update_cpu_topology(void);
>>  				| 0*SD_SERIALIZE			\
>>  				| 0*SD_PREFER_SIBLING			\
>>  				| arch_sd_sibling_asym_packing()	\
>> +				| 0*SD_WORKLOAD_CONSOLIDATION	\
>>  				,					\
>>  	.last_balance		= jiffies,				\
>>  	.balance_interval	= 1,					\
>>  	.smt_gain		= 1178,	/* 15% */			\
>>  	.max_newidle_lb_cost	= 0,					\
>>  	.next_decay_max_lb_cost	= jiffies,				\
>> +	.consolidating_coeff = 0,					\
>>  }
>>  #endif
>>  #endif /* CONFIG_SCHED_SMT */
>> @@ -134,11 +136,13 @@ int arch_update_cpu_topology(void);
>>  				| 0*SD_SHARE_CPUPOWER			\
>>  				| 1*SD_SHARE_PKG_RESOURCES		\
>>  				| 0*SD_SERIALIZE			\
>> +				| 1*SD_WORKLOAD_CONSOLIDATION	\
>>  				,					\
>>  	.last_balance		= jiffies,				\
>>  	.balance_interval	= 1,					\
>>  	.max_newidle_lb_cost	= 0,					\
>>  	.next_decay_max_lb_cost	= jiffies,				\
>> +	.consolidating_coeff = 180,					\
>>  }
>>  #endif
>>  #endif /* CONFIG_SCHED_MC */
>> @@ -167,11 +171,13 @@ int arch_update_cpu_topology(void);
>>  				| 0*SD_SHARE_PKG_RESOURCES		\
>>  				| 0*SD_SERIALIZE			\
>>  				| 1*SD_PREFER_SIBLING			\
>> +				| 1*SD_WORKLOAD_CONSOLIDATION	\
>>  				,					\
>>  	.last_balance		= jiffies,				\
>>  	.balance_interval	= 1,					\
>>  	.max_newidle_lb_cost	= 0,					\
>>  	.next_decay_max_lb_cost	= jiffies,				\
>> +	.consolidating_coeff = 180,					\
>>  }
>>  #endif
> 
> What tree are you working against, non of that exists anymore. Also, you
> cannot unconditionally set this.
> 

Hi Yuyang,

I'm running these patches on my ARM TC2 on top of
kernel/git/torvalds/linux.git (v3.15-rc7-79-gfe45736f4134). There're
considerable changes in the area of sched domain setup since Vincent's
patchset 'rework sched_domain topology description' (destined for v3.16)
which you can find on kernel/git/tip/tip.git .

Why did you make SD_WORKLOAD_CONSOLIDATION controllable via sysctl? All
the other SD flags are set during setup. Your top_flag_domain() function
takes care of figuring out what is the highest sd level this is set on
during load-balance but I can't find any good reason to do it this way
other then for testing purposes?

Setting SD_WORKLOAD_CONSOLIDATION  (which is probably a behavioural flag
rather than a topology description related one) on a certain sd level
requires you to also think about its implications in sd_init() and in sd
degenerate functionality.

-- Dietmar




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 07/16 v3] Init Workload Consolidation flags in sched_domain
  2014-06-09 17:56     ` Dietmar Eggemann
@ 2014-06-09 21:18       ` Yuyang Du
  2014-06-10 11:52         ` Dietmar Eggemann
  0 siblings, 1 reply; 32+ messages in thread
From: Yuyang Du @ 2014-06-09 21:18 UTC (permalink / raw)
  To: Dietmar Eggemann; +Cc: Peter Zijlstra, linux-kernel, linux-pm

On Mon, Jun 09, 2014 at 06:56:17PM +0100, Dietmar Eggemann wrote:

Thanks, Dietmar.

> I'm running these patches on my ARM TC2 on top of
> kernel/git/torvalds/linux.git (v3.15-rc7-79-gfe45736f4134). There're
> considerable changes in the area of sched domain setup since Vincent's
> patchset 'rework sched_domain topology description' (destined for v3.16)
> which you can find on kernel/git/tip/tip.git .
> 

Yeah, PeterZ pointed it out. It was on top of mainline not tip.

> Why did you make SD_WORKLOAD_CONSOLIDATION controllable via sysctl? All
> the other SD flags are set during setup.
> 

I don't understand. Any flag or parameter in sched_domain can be modified
on-the-fly after booting via sysctl. The SD_XXX_INIT is a template to make
the sched_domain initialization easier, IIUC.

Yes, I should not unconditionally enable SD_WORKLOAD_CONSOLIDATION in MC
and CPU domain (pointed out by PeterZ), but I did so for the purpose of
testing this patchset at this moment. Eventually, this flag should not be
turned on for any domain by default for many reasons, not to mention CPU
topology is getting more diverse/complex.

I just checked Vincent's "rework sched_domain topology description". The
general picture for init sched_domain does not change. If you work on top
of tip tree, you can simply skip this patch (0007), and after booting
enable SD_WORKLOAD_CONSOLIDATION by:

sysctl -w kernel.sched_domain.cpuX.domainY.flags += 0x8000
sysctl -w kernel.sched_domain.cpu0.domain1.consolidating_coeff=180
sysctl -w kernel.sched_cc_wakeup_threshold=80

> Your top_flag_domain() function
> takes care of figuring out what is the highest sd level this is set on
> during load-balance but I can't find any good reason to do it this way
> other then for testing purposes?

Any flag is used for testing whether it is set on or not when encountering
it, including the flags in sched_domain for load balancing, this is why flag
is called flag. My flag is any excpetion?

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 00/16 v3] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
       [not found] ` <20140609164848.GB29593@e103034-lin>
@ 2014-06-09 21:23   ` Yuyang Du
  0 siblings, 0 replies; 32+ messages in thread
From: Yuyang Du @ 2014-06-09 21:23 UTC (permalink / raw)
  To: Morten Rasmussen; +Cc: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm

On Mon, Jun 09, 2014 at 05:48:48PM +0100, Morten Rasmussen wrote:

Thanks, Morten.

> > 2) CC vs. CPU utilization. CC is runqueue-length-weighted CPU utilization. If
> > we change: "a = sum(concurrency * time) / period" to "a' = sum(1 * time) /
> > period". Then a' is just about the CPU utilization. And the way we weight
> > runqueue-length is the simplest one (excluding the exponential decays, and you
> > may have other ways).
> 
> Isn't a' exactly to the rq runnable_avg_{sum, period} that you remove in
> patch 1? In that case it seems more obvious to repurpose them by
> multiplying the the contributions to the rq runnable_avg_sum by
> nr_running. AFAICT, that should give you the same metric.
> 
Yes, essentially it is. Removing it is simply because rq runnable_avg_X is not used.
And yes, by repurposing it, I can get CC, in that sense what I do is replacing it
not just removing it, :)

> On the other hand, I don't see what this new metric gives us that can't
> be inferred from the existing load tracking metrics or through slight
> modifications of those. It came up recently in a different thread that
> the tracked task load is heavily influenced by other tasks running on
> the same cpu if they happen to overlap in time. IIUC, that exactly the
> information you are after. I think you could implement the same task
> packing behaviour based on an unweighted version of
> cfs.runnable_load_avg instead, which should be fairly straightforward to
> introduce (I think it will become useful for other purposes as well). I
> sort of hinted that in that thread already:
> 
> https://lkml.org/lkml/2014/6/4/54
> 

Yes, it seems an unweighted cfs.runnable_load_avg should be similar to what CC is.
I have been thinking about and working on this.

My work in this regard is in the middle. One of my concerns is how sum and period
accrue with time, and how contrib is calculated (for both entity and rq runnable).
Resultingly, the period is "always" around 48000, and it takes sum a long time to
reflect the latest activity (IIUC, you also pointed this out). For balancing, this
might not be a problem, but for consolidating, we need much more sensitivity.

I don't know, but anyway, I will solve this/give a good reason (as is also
required by PeterZ).

> 
> The potential worst case consolidated CC sum is:
> 
> 	n * \sum{cpus}^{n} CC[n]
> 
> So, the range in which the true consolidated CC lies grows
> proportionally to the number of cpus. We can't really say anything about
> how things will pan out if we consolidate on fewer cpus. However, if it
> turns out to be a bad mix of tasks the task runnable_avg_sum will go up
> and if we use cfs.runnable_load_avg as the indication of compute
> capacity requirements, we would eventually spread the load again.
> Clearly, we don't want an unstable situation, so it might be better to
> the consolidation partially and see where things are going.
> 

No. The current load balancing is all done by pulling, and Workload Consolidation
will prevent the pulling when consolidated, that said, the current load
balancing (effectively) can/will not act in the opposite direction of
Workload Consolidation at the same time.

Is that what you are concerned?

> > So, we uniformly use this condition for consolidation (suppose
> > we consolidate m CPUs to n CPUs, m > n):
> > 
> > (CC[0] + CC[1] + ... + CC[m-2] + CC[m-1]) * (n + log(m-n)) >=<? (1 * n) * n *
> > consolidating_coefficient
> 
> Do you have a rationale behind this heuristic? It seems to be more and
> more pessimistic about how much load we can put on 'n' cpus as 'm'
> increases. Basically trying to factor in some of the error that be in
> the consolidated CC. Why the '(1 * n) * n...' and not just 'n * n * con...'?
> 

The rationale is: the more CPUs, the less likely they are concurrently running
(coscheduled), especially when load is not high and transient (this is when we
want to consolidate). So we go toward more optimistic for large m to small n,
exponentially by log.

> Overall, IIUC, the aim of this patch set seems quite similar to the
> previous proposals for task packing.
> 

Ok. So I did not do a weird thing, that is good, :) and help me do it together
since we all want it.

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 07/16 v3] Init Workload Consolidation flags in sched_domain
  2014-06-09 21:18       ` Yuyang Du
@ 2014-06-10 11:52         ` Dietmar Eggemann
  2014-06-10 18:09           ` Yuyang Du
  0 siblings, 1 reply; 32+ messages in thread
From: Dietmar Eggemann @ 2014-06-10 11:52 UTC (permalink / raw)
  To: Yuyang Du; +Cc: Peter Zijlstra, linux-kernel, linux-pm

On 09/06/14 22:18, Yuyang Du wrote:
> On Mon, Jun 09, 2014 at 06:56:17PM +0100, Dietmar Eggemann wrote:
> 
> Thanks, Dietmar.
> 
>> I'm running these patches on my ARM TC2 on top of
>> kernel/git/torvalds/linux.git (v3.15-rc7-79-gfe45736f4134). There're
>> considerable changes in the area of sched domain setup since Vincent's
>> patchset 'rework sched_domain topology description' (destined for v3.16)
>> which you can find on kernel/git/tip/tip.git .
>>
> 
> Yeah, PeterZ pointed it out. It was on top of mainline not tip.
> 
>> Why did you make SD_WORKLOAD_CONSOLIDATION controllable via sysctl? All
>> the other SD flags are set during setup.
>>
> 
> I don't understand. Any flag or parameter in sched_domain can be modified
> on-the-fly after booting via sysctl. The SD_XXX_INIT is a template to make
> the sched_domain initialization easier, IIUC.

Technically true but since the sysctrl stuff is per-cpu and you want to
change per-domain data, you have to be extremely careful that each cpu
still sees the same data.

Another counter example, if I delete the SD_SHARE_PKG_RESOURCES flag on
my ARM TC2 system for all CPU's on domain0 (MC level) via sysctl, the
scheduler still has sd_llc assigned to the struct sched_domain for the
MC level of the CPU.

> 
> Yes, I should not unconditionally enable SD_WORKLOAD_CONSOLIDATION in MC
> and CPU domain (pointed out by PeterZ), but I did so for the purpose of
> testing this patchset at this moment. Eventually, this flag should not be
> turned on for any domain by default for many reasons, not to mention CPU
> topology is getting more diverse/complex.

But isn't this the point to show how and under which conditions you
would set this flag in the existing code? Since I guess it's a scheduler
behavioural (not a topology related one) flag, it has to be integrated
nicely into sd_init() etc.

> 
> I just checked Vincent's "rework sched_domain topology description". The
> general picture for init sched_domain does not change. If you work on top
> of tip tree, you can simply skip this patch (0007), and after booting
> enable SD_WORKLOAD_CONSOLIDATION by:
> 
> sysctl -w kernel.sched_domain.cpuX.domainY.flags += 0x8000
> sysctl -w kernel.sched_domain.cpu0.domain1.consolidating_coeff=180
> sysctl -w kernel.sched_cc_wakeup_threshold=80
> 
>> Your top_flag_domain() function
>> takes care of figuring out what is the highest sd level this is set on
>> during load-balance but I can't find any good reason to do it this way
>> other then for testing purposes?
> 
> Any flag is used for testing whether it is set on or not when encountering
> it, including the flags in sched_domain for load balancing, this is why flag
> is called flag. My flag is any excpetion?

Not in this sense but there is no functionality in the scheduler right
now to check constantly if an sd flag has been set/unset via sysctl.
IMHO, there's only sd_init and (highest/lowest)_flag_domain to cache
pointers to special sd's and both are called during start-up or cpu
hotplug
((init/partition_sched_domains()->build_sched_domains()->{build_sched_domain()->sd_init(),
cpu_attach_domain()-> update_top_cache_domain())}

-- Dietmar

> 
> Thanks,
> Yuyang
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 07/16 v3] Init Workload Consolidation flags in sched_domain
  2014-06-10 11:52         ` Dietmar Eggemann
@ 2014-06-10 18:09           ` Yuyang Du
  2014-06-11  9:27             ` Dietmar Eggemann
  0 siblings, 1 reply; 32+ messages in thread
From: Yuyang Du @ 2014-06-10 18:09 UTC (permalink / raw)
  To: Dietmar Eggemann; +Cc: Peter Zijlstra, linux-kernel, linux-pm

On Tue, Jun 10, 2014 at 12:52:06PM +0100, Dietmar Eggemann wrote:

Hi Dietmar,

> Not in this sense but there is no functionality in the scheduler right
> now to check constantly if an sd flag has been set/unset via sysctl.

Sorry, I still don't understand. There are many "if (sd->flags & SD_XXX)"
in fair.c. What does it mean to you?

Probably you mean the SD_XX should be fixed in init and never changed via sysctl
thereafter. Ah... I don't know about this...

Overall, I think I should come up with a better way to implement the SD_WORKLOAD_CONSOLIDATION
policy (enabled or disabled) in load balancing (as is also pointed out by PeterZ).
But I just don't see the current implementation is any particular different than
any other SD_XX's.

Have you tried it on your platform?

Thanks a lot,
Yuyang

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 07/16 v3] Init Workload Consolidation flags in sched_domain
  2014-06-10 18:09           ` Yuyang Du
@ 2014-06-11  9:27             ` Dietmar Eggemann
  0 siblings, 0 replies; 32+ messages in thread
From: Dietmar Eggemann @ 2014-06-11  9:27 UTC (permalink / raw)
  To: Yuyang Du; +Cc: Peter Zijlstra, linux-kernel, linux-pm

On 10/06/14 19:09, Yuyang Du wrote:
> On Tue, Jun 10, 2014 at 12:52:06PM +0100, Dietmar Eggemann wrote:
> 
> Hi Dietmar,
> 
>> Not in this sense but there is no functionality in the scheduler right
>> now to check constantly if an sd flag has been set/unset via sysctl.
> 
> Sorry, I still don't understand. There are many "if (sd->flags & SD_XXX)"
> in fair.c. What does it mean to you?
> 
> Probably you mean the SD_XX should be fixed in init and never changed via sysctl
> thereafter. Ah... I don't know about this...

yes :-) I'm referring to your top_flag_domain() function which you need
to check what the highest sd level is where your flag is set. Existing
code only relies on flag setup during startup and after cpu hotplug or
on cached per-cpu sd pointers like sd_llc .

> 
> Overall, I think I should come up with a better way to implement the SD_WORKLOAD_CONSOLIDATION
> policy (enabled or disabled) in load balancing (as is also pointed out by PeterZ).
> But I just don't see the current implementation is any particular different than
> any other SD_XX's.
> 
> Have you tried it on your platform?

I'm running these patches on my ARM TC2 (2 clusters (2 CPUs, 3 CPUs)) on
top of kernel/git/torvalds/linux.git (v3.15-rc7-79-gfe45736f4134). By
default, on this platform CC is enabled on MC and CPU level. Overall
workloads show very different behaviour (CC enabled on MC and CPU level
as well as only enabled on MC level) compared to testruns wo/ CC but I
do not have the time to analyse it further. BTW, I hot-plugged out the
3rd CPU on the 2. cluster (there is this comment on top of
__nonshielded_groups() 'every sched_group has the same weight').

-- Dietmar

> 
> Thanks a lot,
> Yuyang
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2014-06-11  9:27 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-30  6:35 [RFC PATCH 00/16 v3] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
2014-05-30  6:35 ` [RFC PATCH 01/16 v3] Remove update_rq_runnable_avg Yuyang Du
2014-05-30  6:35 ` [RFC PATCH 02/16 v3] Define and initialize CPU ConCurrency in struct rq Yuyang Du
2014-05-30  6:35 ` [RFC PATCH 03/16 v3] How CC accrues with run queue change and time Yuyang Du
2014-06-03 12:12   ` Peter Zijlstra
2014-05-30  6:36 ` [RFC PATCH 04/16 v3] CPU CC update period is changeable via sysctl Yuyang Du
2014-05-30  6:36 ` [RFC PATCH 05/16 v3] Update CPU CC in fair Yuyang Du
2014-05-30  6:36 ` [RFC PATCH 06/16 v3] Add Workload Consolidation fields in struct sched_domain Yuyang Du
2014-05-30  6:36 ` [RFC PATCH 07/16 v3] Init Workload Consolidation flags in sched_domain Yuyang Du
2014-06-03 12:14   ` Peter Zijlstra
2014-06-09 17:56     ` Dietmar Eggemann
2014-06-09 21:18       ` Yuyang Du
2014-06-10 11:52         ` Dietmar Eggemann
2014-06-10 18:09           ` Yuyang Du
2014-06-11  9:27             ` Dietmar Eggemann
2014-05-30  6:36 ` [RFC PATCH 08/16 v3] Write CPU topology info for Workload Consolidation fields " Yuyang Du
2014-05-30  6:36 ` [RFC PATCH 09/16 v3] Define and allocate a per CPU local cpumask for Workload Consolidation Yuyang Du
2014-06-03 12:15   ` Peter Zijlstra
2014-05-30  6:36 ` [RFC PATCH 10/16 v3] Workload Consolidation APIs Yuyang Du
2014-06-03 12:22   ` Peter Zijlstra
2014-05-30  6:36 ` [RFC PATCH 11/16 v3] Make wakeup bias threshold changeable via sysctl Yuyang Du
2014-06-03 12:23   ` Peter Zijlstra
2014-05-30  6:36 ` [RFC PATCH 12/16 v3] Bias select wakee than waker in WAKE_AFFINE Yuyang Du
2014-06-03 12:24   ` Peter Zijlstra
2014-05-30  6:36 ` [RFC PATCH 13/16 v3] Intercept wakeup/fork/exec load balancing Yuyang Du
2014-06-03 12:27   ` Peter Zijlstra
2014-06-03 23:46     ` Yuyang Du
2014-05-30  6:36 ` [RFC PATCH 14/16 v3] Intercept idle balancing Yuyang Du
2014-05-30  6:36 ` [RFC PATCH 15/16 v3] Intercept periodic nohz " Yuyang Du
2014-05-30  6:36 ` [RFC PATCH 16/16 v3] Intercept periodic load balancing Yuyang Du
2014-06-09 17:30 ` [RFC PATCH 00/16 v3] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Morten Rasmussen
     [not found] ` <20140609164848.GB29593@e103034-lin>
2014-06-09 21:23   ` Yuyang Du

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).