linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/9 v4] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
@ 2014-06-25  0:35 Yuyang Du
  2014-06-25  0:36 ` [RFC PATCH 1/9 v4] sched: Remove update_rq_runnable_avg Yuyang Du
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: Yuyang Du @ 2014-06-25  0:35 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, dietmar.eggemann,
	rajeev.d.muralidhar, vishwesh.m.rudramuni, nicole.chalhoub,
	ajaya.durg, harinarayanan.seshadri, jacob.jun.pan, Yuyang Du

The current scheduler’s load balancing is completely work-conserving. In some
workload, generally low CPU utilization but immersed with CPU bursts of
transient tasks, migrating task to engage all available CPUs for
work-conserving can lead to significant overhead: cache locality loss,
idle/active HW state transitional latency and power, shallower idle state,
etc, which are both power and performance inefficient especially for today’s
low power processors in mobile. 

This RFC introduces a sense of idleness-conserving into work-conserving (by
all means, we really don’t want to be overwhelming in only one way). But to
what extent the idleness-conserving should be, bearing in mind that we don’t
want to sacrifice performance? We first need a load/idleness indicator to that
end.

Thanks to CFS’s “model an ideal, precise multi-tasking CPU”, tasks can be seen
as concurrently running (the tasks in the runqueue). So it is natural to use
task concurrency as load indicator. Having said that, we do two things:

1) Divide continuous time into periods of time, and average task concurrency
in period, for tolerating the transient bursts:
a = sum(concurrency * time) / period
2) Exponentially decay past periods, and synthesize them all, for hysteresis
to load drops or resilience to load rises (let f be decaying factor, and a_x
the xth period average since period 0):
s = a_n + f^1 * a_n-1 + f^2 * a_n-2 +, ..., + f^(n-1) * a_1 + f^n * a_0

We name this load indicator as CPU ConCurrency (CC): task concurrency
determines how many CPUs are needed to be running concurrently.

Another two ways of how to interpret CC:

1) the current work-conserving load balance also uses CC, but instantaneous
CC.

2) CC vs. CPU utilization. CC is runqueue-length-weighted CPU utilization. If
we change: "a = sum(concurrency * time) / period" to "a' = sum(1 * time) /
period". Then a' is just about the CPU utilization. And the way we weight
runqueue-length is the simplest one (excluding the exponential decays, and you
may have other ways).

To track CC, we intercept the scheduler in 1) enqueue, 2) dequeue, 3)
scheduler tick, and 4) enter/exit idle.

After CC, in the consolidation part, we do 1) attach the CPU topology to be
adaptive beyond our experimental platforms, and 2) intercept the current load
balance for load and load balancing containment.

Currently, CC is per CPU. To consolidate, the formula is based on a heuristic.
Suppose we have 2 CPUs, their task concurrency over time is ('-' means no
task, 'x' having tasks):

1)
CPU0: ---xxxx---------- (CC[0])
CPU1: ---------xxxx---- (CC[1])

2)
CPU0: ---xxxx---------- (CC[0])
CPU1: ---xxxx---------- (CC[1])

If we consolidate CPU0 and CPU1, the consolidated CC will be: CC' = CC[0] +
CC[1] for case 1 and CC'' = (CC[0] + CC[1]) * 2 for case 2. For the cases in
between case 1 and 2 in terms of how xxx overlaps, the CC should be between
CC' and CC''. So, we uniformly use this condition for consolidation (suppose
we consolidate m CPUs to n CPUs, m > n):

(CC[0] + CC[1] + ... + CC[m-2] + CC[m-1]) * (n + log(m-n)) >=<? (1 * n) * n *
consolidating_coefficient

The consolidating_coefficient could be like 100% or more or less.

By CC, we implemented a Workload Consolidation (WC) patch on two Intel mobile
platforms (a quad-core composed of two dual-core modules): contain load and
load balancing in the first dual-core when aggregated CC low, and if not in
the full quad-core. Results show that we got power savings and no substantial
performance regression (even gains for some). The workloads we used to
evaluate the Workload Consolidation include 1) 50+ perf/ux benchmarks (almost
all of the magazine ones), and 2) ~10 power workloads, of course, they are the
easiest ones, such as browsing, audio, video, recording, imaging, etc.

v4:
- Reuse per task load average to calculate CC
- Enable SD_WORKLOAD_CONSOLIDATION in sched_domain initialization
- Reuse active_load_balance_cpu_stop

v3:
- Removed rq->avg first, and base our patch on it
- Removed all CONFIG_CPU_CONCURRENCY and CONFIG_WORKLOAD_CONSOLIDATION
- CPU CC will be updated mandatory
- CPU WC can be enabled/disabled by flags per domain level on the fly 
- CPU CC and WC is completely fair scheduler thing, don't touch RT anymore
 
v2:
- Data type defined in formation

Patchset against linux-next v3.16-rc2.

Yuyang Du (9):
  sched: Remove update_rq_runnable_avg
  sched: Precise accumulated time and acount runnable number in
    update_entity_runnable_avg
  How CPU ConCurrency (CC) accrues with runqueue change and time
  Define SD_WORKLOAD_CONSOLIDATION and attach to sched_domain
  Workload Consolidation: Consolidating workload to a subset of CPUs if
    possible
  Implement Workload Consolidation in wakeup/fork/exec
  Implement Workload Consolidation in idle_balance
  Implement Workload Consolidation in nohz_idle_balance
  Implement Workload Consolidation in periodic load balance

 include/linux/sched.h        |    9 +-
 include/linux/sched/sysctl.h |    4 +
 kernel/sched/core.c          |   46 +++-
 kernel/sched/debug.c         |    8 -
 kernel/sched/fair.c          |  576 +++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h         |   17 +-
 kernel/sysctl.c              |    9 +
 7 files changed, 612 insertions(+), 57 deletions(-)

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 1/9 v4] sched: Remove update_rq_runnable_avg
  2014-06-25  0:35 [RFC PATCH 0/9 v4] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
@ 2014-06-25  0:36 ` Yuyang Du
  2014-06-25  0:36 ` [RFC PATCH 2/9 v4] sched: Precise accumulated time and acount runnable number in update_entity_runnable_avg Yuyang Du
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Yuyang Du @ 2014-06-25  0:36 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, dietmar.eggemann,
	rajeev.d.muralidhar, vishwesh.m.rudramuni, nicole.chalhoub,
	ajaya.durg, harinarayanan.seshadri, jacob.jun.pan, Yuyang Du

The current rq->avg is not made use of anywhere, and the code is in fair
scheduler's critical path, so remove it.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/debug.c |    8 --------
 kernel/sched/fair.c  |   24 ++++--------------------
 kernel/sched/sched.h |    2 --
 3 files changed, 4 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 695f977..4b864c7 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -68,14 +68,6 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 #define PN(F) \
 	SEQ_printf(m, "  .%-30s: %lld.%06ld\n", #F, SPLIT_NS((long long)F))
 
-	if (!se) {
-		struct sched_avg *avg = &cpu_rq(cpu)->avg;
-		P(avg->runnable_avg_sum);
-		P(avg->runnable_avg_period);
-		return;
-	}
-
-
 	PN(se->exec_start);
 	PN(se->vruntime);
 	PN(se->sum_exec_runtime);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fea7d33..1a2d04f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2430,18 +2430,12 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
 	}
 }
 
-static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
-{
-	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
-	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
-}
 #else /* CONFIG_FAIR_GROUP_SCHED */
 static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 						 int force_update) {}
 static inline void __update_tg_runnable_avg(struct sched_avg *sa,
 						  struct cfs_rq *cfs_rq) {}
 static inline void __update_group_entity_contrib(struct sched_entity *se) {}
-static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 static inline void __update_task_entity_contrib(struct sched_entity *se)
@@ -2614,7 +2608,6 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
  */
 void idle_enter_fair(struct rq *this_rq)
 {
-	update_rq_runnable_avg(this_rq, 1);
 }
 
 /*
@@ -2624,7 +2617,6 @@ void idle_enter_fair(struct rq *this_rq)
  */
 void idle_exit_fair(struct rq *this_rq)
 {
-	update_rq_runnable_avg(this_rq, 0);
 }
 
 static int idle_balance(struct rq *this_rq);
@@ -2633,7 +2625,6 @@ static int idle_balance(struct rq *this_rq);
 
 static inline void update_entity_load_avg(struct sched_entity *se,
 					  int update_cfs_rq) {}
-static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
 static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 					   struct sched_entity *se,
 					   int wakeup) {}
@@ -3936,10 +3927,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_entity_load_avg(se, 1);
 	}
 
-	if (!se) {
-		update_rq_runnable_avg(rq, rq->nr_running);
+	if (!se)
 		add_nr_running(rq, 1);
-	}
+
 	hrtick_update(rq);
 }
 
@@ -3997,10 +3987,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_entity_load_avg(se, 1);
 	}
 
-	if (!se) {
+	if (!se)
 		sub_nr_running(rq, 1);
-		update_rq_runnable_avg(rq, 1);
-	}
+
 	hrtick_update(rq);
 }
 
@@ -5437,9 +5426,6 @@ static void __update_blocked_averages_cpu(struct task_group *tg, int cpu)
 		 */
 		if (!se->avg.runnable_avg_sum && !cfs_rq->nr_running)
 			list_del_leaf_cfs_rq(cfs_rq);
-	} else {
-		struct rq *rq = rq_of(cfs_rq);
-		update_rq_runnable_avg(rq, rq->nr_running);
 	}
 }
 
@@ -7352,8 +7338,6 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	if (numabalancing_enabled)
 		task_tick_numa(rq, curr);
-
-	update_rq_runnable_avg(rq, 1);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 31cc02e..a147571 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -542,8 +542,6 @@ struct rq {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this cpu: */
 	struct list_head leaf_cfs_rq_list;
-
-	struct sched_avg avg;
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 	/*
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 2/9 v4] sched: Precise accumulated time and acount runnable number in update_entity_runnable_avg
  2014-06-25  0:35 [RFC PATCH 0/9 v4] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
  2014-06-25  0:36 ` [RFC PATCH 1/9 v4] sched: Remove update_rq_runnable_avg Yuyang Du
@ 2014-06-25  0:36 ` Yuyang Du
  2014-06-25  0:36 ` [RFC PATCH 3/9 v4] How CPU ConCurrency (CC) accrues with runqueue change and time Yuyang Du
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Yuyang Du @ 2014-06-25  0:36 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, dietmar.eggemann,
	rajeev.d.muralidhar, vishwesh.m.rudramuni, nicole.chalhoub,
	ajaya.durg, harinarayanan.seshadri, jacob.jun.pan, Yuyang Du

The current amount of time already accumulated against next period is
determined by:

	delta_w = sa->runnable_avg_period % 1024

Considering the runnable_avg_period is the sum of an infinite series, this
method looks gross, even the error would be no more than 1024 (~1ms).
But precisely accounting this is not hard, just use a variable period_contrib
to record it.

The current runnable argument is either 1 or 0, indicating whether this entity
is runnable or not. In order to account the number of runnables as well, change
how runnable_avg_sum is accumulated.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched.h |    1 +
 kernel/sched/fair.c   |   12 ++++++++----
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 306f4f0..1b1997d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1074,6 +1074,7 @@ struct sched_avg {
 	 * choices of y < 1-2^(-32)*1024.
 	 */
 	u32 runnable_avg_sum, runnable_avg_period;
+	u32 period_contrib;
 	u64 last_runnable_update;
 	s64 decay_count;
 	unsigned long load_avg_contrib;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1a2d04f..e914e32 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2294,10 +2294,12 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 	sa->last_runnable_update = now;
 
 	/* delta_w is the amount already accumulated against our next period */
-	delta_w = sa->runnable_avg_period % 1024;
+	delta_w = sa->period_contrib;
 	if (delta + delta_w >= 1024) {
 		/* period roll-over */
 		decayed = 1;
+		/* how much left for next period will start over, we don't know yet */
+		sa->period_contrib = 0;
 
 		/*
 		 * Now that we know we're crossing a period boundary, figure
@@ -2306,7 +2308,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		 */
 		delta_w = 1024 - delta_w;
 		if (runnable)
-			sa->runnable_avg_sum += delta_w;
+			sa->runnable_avg_sum += runnable * delta_w;
 		sa->runnable_avg_period += delta_w;
 
 		delta -= delta_w;
@@ -2323,15 +2325,17 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
 		runnable_contrib = __compute_runnable_contrib(periods);
 		if (runnable)
-			sa->runnable_avg_sum += runnable_contrib;
+			sa->runnable_avg_sum += runnable * runnable_contrib;
 		sa->runnable_avg_period += runnable_contrib;
 	}
 
 	/* Remainder of delta accrued against u_0` */
 	if (runnable)
-		sa->runnable_avg_sum += delta;
+		sa->runnable_avg_sum += runnable * delta;
 	sa->runnable_avg_period += delta;
 
+	sa->period_contrib += delta;
+
 	return decayed;
 }
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 3/9 v4] How CPU ConCurrency (CC) accrues with runqueue change and time
  2014-06-25  0:35 [RFC PATCH 0/9 v4] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
  2014-06-25  0:36 ` [RFC PATCH 1/9 v4] sched: Remove update_rq_runnable_avg Yuyang Du
  2014-06-25  0:36 ` [RFC PATCH 2/9 v4] sched: Precise accumulated time and acount runnable number in update_entity_runnable_avg Yuyang Du
@ 2014-06-25  0:36 ` Yuyang Du
  2014-06-25  0:36 ` [RFC PATCH 4/9 v4] Define SD_WORKLOAD_CONSOLIDATION and attach to sched_domain Yuyang Du
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Yuyang Du @ 2014-06-25  0:36 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, dietmar.eggemann,
	rajeev.d.muralidhar, vishwesh.m.rudramuni, nicole.chalhoub,
	ajaya.durg, harinarayanan.seshadri, jacob.jun.pan, Yuyang Du

CC can be seen as either decayed average run queue length, or run-queue-lengh-
weighted CPU utilization. CC is calculated by two steps:

1) Divide continuous time into periods of time, and average task concurrency
in period, for tolerating the transient bursts:

a = sum(concurrency * time) / period

2) Exponentially decay past periods, and synthesize them all, for hysteresis
to load drops or resilience to load rises (let f be decaying factor, and a_x
the xth period average since period 0):

s = a_n + f^1 * a_n-1 + f^2 * a_n-2 +, ..., + f^(n-1) * a_1 + f^n * a_0

We reuse __update_entity_runnable_avg to calc CC.

CC can only be modified when enqueue and dequeue the CPU rq. We also update it
in scheduler tick, load balancing, and idle enter/exit in case we may not have
enqueue and dequeue for a long time.

Therefore, we update/track CC in and only in these points:

we update cpu concurrency at:
1) enqueue task, which increases concurrency
2) dequeue task, which decreases concurrency
3) periodic scheduler tick, in case no en/dequeue for long
4) enter and exit idle
5) update_blocked_averages

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c  |   45 +++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h |    2 ++
 2 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e914e32..c4270cf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2605,6 +2605,8 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 	} /* migrations, e.g. sleep=0 leave decay_count == 0 */
 }
 
+static inline void update_cpu_concurrency(struct rq *rq);
+
 /*
  * Update the rq's load with the elapsed running time before entering
  * idle. if the last scheduled task is not a CFS task, idle_enter will
@@ -2612,6 +2614,7 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
  */
 void idle_enter_fair(struct rq *this_rq)
 {
+	update_cpu_concurrency(this_rq);
 }
 
 /*
@@ -2621,6 +2624,7 @@ void idle_enter_fair(struct rq *this_rq)
  */
 void idle_exit_fair(struct rq *this_rq)
 {
+	update_cpu_concurrency(this_rq);
 }
 
 static int idle_balance(struct rq *this_rq);
@@ -2638,6 +2642,8 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 static inline void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
 					      int force_update) {}
 
+static inline void update_cpu_concurrency(struct rq *rq) {}
+
 static inline int idle_balance(struct rq *rq)
 {
 	return 0;
@@ -3931,8 +3937,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_entity_load_avg(se, 1);
 	}
 
-	if (!se)
+	if (!se) {
+		update_cpu_concurrency(rq);
 		add_nr_running(rq, 1);
+	}
 
 	hrtick_update(rq);
 }
@@ -3991,8 +3999,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_entity_load_avg(se, 1);
 	}
 
-	if (!se)
+	if (!se) {
+		update_cpu_concurrency(rq);
 		sub_nr_running(rq, 1);
+	}
 
 	hrtick_update(rq);
 }
@@ -5454,6 +5464,8 @@ static void update_blocked_averages(int cpu)
 		__update_blocked_averages_cpu(cfs_rq->tg, rq->cpu);
 	}
 
+	update_cpu_concurrency(rq);
+
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
 }
 
@@ -7342,6 +7354,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	if (numabalancing_enabled)
 		task_tick_numa(rq, curr);
+
+	update_cpu_concurrency(rq);
 }
 
 /*
@@ -7801,3 +7815,30 @@ __init void init_sched_fair_class(void)
 #endif /* SMP */
 
 }
+
+#ifdef CONFIG_SMP
+
+/*
+ * CPU ConCurrency (CC) measures the CPU load by averaging
+ * the number of running tasks. Using CC, the scheduler can
+ * evaluate the load of CPUs to improve load balance for power
+ * efficiency without sacrificing performance.
+ */
+
+/*
+ * we update cpu concurrency at:
+ * 1) enqueue task, which increases concurrency
+ * 2) dequeue task, which decreases concurrency
+ * 3) periodic scheduler tick, in case no en/dequeue for long
+ * 4) enter and exit idle
+ * 5) update_blocked_averages
+ */
+static void update_cpu_concurrency(struct rq *rq)
+{
+	struct sched_avg *sa = &rq->avg;
+	if (__update_entity_runnable_avg(rq->clock, sa, rq->nr_running)) {
+		sa->load_avg_contrib = sa->runnable_avg_sum << NICE_0_SHIFT;
+		sa->load_avg_contrib /= (sa->runnable_avg_period + 1);
+	}
+}
+#endif /* CONFIG_SMP */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a147571..eb47ce2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -579,6 +579,8 @@ struct rq {
 
 	struct list_head cfs_tasks;
 
+	struct sched_avg avg;
+
 	u64 rt_avg;
 	u64 age_stamp;
 	u64 idle_stamp;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 4/9 v4] Define SD_WORKLOAD_CONSOLIDATION and attach to sched_domain
  2014-06-25  0:35 [RFC PATCH 0/9 v4] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (2 preceding siblings ...)
  2014-06-25  0:36 ` [RFC PATCH 3/9 v4] How CPU ConCurrency (CC) accrues with runqueue change and time Yuyang Du
@ 2014-06-25  0:36 ` Yuyang Du
  2014-06-25  0:36 ` [RFC PATCH 5/9 v4] Workload Consolidation: Consolidating workload to a subset of CPUs if possible Yuyang Du
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Yuyang Du @ 2014-06-25  0:36 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, dietmar.eggemann,
	rajeev.d.muralidhar, vishwesh.m.rudramuni, nicole.chalhoub,
	ajaya.durg, harinarayanan.seshadri, jacob.jun.pan, Yuyang Du

Workload Consolidation is completely CPU topology and policy driven. To do so,
we define SD_WORKLOAD_CONSOLIDATION, and add some fields in sched_domain struct:

1) total_groups is the group number in total in this domain
2) group_number is this CPU's group sequence number
3) consolidating_coeff is the coefficient for consolidating CPUs, and is changeable
   via sysctl tool to make consolidation more aggressive or less
4) first_group is the pointer to this domain's first group ordered by CPU number

This patchset enables SD_WORKLOAD_CONSOLIDATION in MC domain by default. But we need
to come up with a better way to determine on which architecture this flag should be
enabled or not. Thanks to PeterZ and Dietmar for pointing this out and help me
finally understand it.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched.h |    8 +++++++-
 kernel/sched/core.c   |   46 ++++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h  |   13 ++++++++++---
 3 files changed, 59 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1b1997d..a339467 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -870,6 +870,7 @@ enum cpu_idle_type {
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
 #define SD_NUMA			0x4000	/* cross-node balancing */
+#define SD_WORKLOAD_CONSOLIDATION  0x8000  /* consolidate CPU workload */
 
 #ifdef CONFIG_SCHED_SMT
 static inline const int cpu_smt_flags(void)
@@ -881,7 +882,7 @@ static inline const int cpu_smt_flags(void)
 #ifdef CONFIG_SCHED_MC
 static inline const int cpu_core_flags(void)
 {
-	return SD_SHARE_PKG_RESOURCES;
+	return SD_SHARE_PKG_RESOURCES | SD_WORKLOAD_CONSOLIDATION;
 }
 #endif
 
@@ -973,6 +974,11 @@ struct sched_domain {
 		struct rcu_head rcu;	/* used during destruction */
 	};
 
+	unsigned int total_groups;			/* total group number */
+	unsigned int group_number;			/* this CPU's group sequence */
+	unsigned int consolidating_coeff;	/* consolidating coefficient */
+	struct sched_group *first_group;	/* ordered by CPU number */
+
 	unsigned int span_weight;
 	/*
 	 * Span of all CPUs in this domain.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3bdf01b..da3cd74 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4941,7 +4941,7 @@ set_table_entry(struct ctl_table *entry,
 static struct ctl_table *
 sd_alloc_ctl_domain_table(struct sched_domain *sd)
 {
-	struct ctl_table *table = sd_alloc_ctl_entry(14);
+	struct ctl_table *table = sd_alloc_ctl_entry(15);
 
 	if (table == NULL)
 		return NULL;
@@ -4974,7 +4974,9 @@ sd_alloc_ctl_domain_table(struct sched_domain *sd)
 		sizeof(long), 0644, proc_doulongvec_minmax, false);
 	set_table_entry(&table[12], "name", sd->name,
 		CORENAME_MAX_SIZE, 0444, proc_dostring, false);
-	/* &table[13] is terminator */
+	set_table_entry(&table[13], "consolidating_coeff", &sd->consolidating_coeff,
+		sizeof(int), 0644, proc_dointvec, false);
+	/* &table[14] is terminator */
 
 	return table;
 }
@@ -5586,7 +5588,7 @@ static void update_top_cache_domain(int cpu)
 	int id = cpu;
 	int size = 1;
 
-	sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
+	sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES, 1);
 	if (sd) {
 		id = cpumask_first(sched_domain_span(sd));
 		size = cpumask_weight(sched_domain_span(sd));
@@ -5601,10 +5603,41 @@ static void update_top_cache_domain(int cpu)
 	sd = lowest_flag_domain(cpu, SD_NUMA);
 	rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
 
-	sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
+	sd = highest_flag_domain(cpu, SD_ASYM_PACKING, 1);
 	rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
 }
 
+
+DEFINE_PER_CPU(struct sched_domain *, sd_wc);
+
+static void update_wc_domain(struct sched_domain *sd, int cpu)
+{
+	while (sd) {
+		int i = 0, j = 0, first, min = INT_MAX;
+		struct sched_group *group;
+
+		group = sd->groups;
+		first = group_first_cpu(group);
+		do {
+			int k = group_first_cpu(group);
+			i += 1;
+			if (k < first)
+				j += 1;
+			if (k < min) {
+				sd->first_group = group;
+				min = k;
+			}
+		} while (group = group->next, group != sd->groups);
+
+		sd->total_groups = i;
+		sd->group_number = j;
+		sd = sd->parent;
+	}
+
+	sd = highest_flag_domain(cpu, SD_WORKLOAD_CONSOLIDATION, 0);
+	rcu_assign_pointer(per_cpu(sd_wc, cpu), sd);
+}
+
 /*
  * Attach the domain 'sd' to 'cpu' as its base domain. Callers must
  * hold the hotplug lock.
@@ -5653,6 +5686,8 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 	destroy_sched_domains(tmp, cpu);
 
 	update_top_cache_domain(cpu);
+
+	update_wc_domain(sd, cpu);
 }
 
 /* cpus with isolated domains */
@@ -6069,6 +6104,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 #ifdef CONFIG_SCHED_DEBUG
 		.name			= tl->name,
 #endif
+		.consolidating_coeff = 0,
 	};
 
 	/*
@@ -6098,6 +6134,8 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 		}
 
 #endif
+	} else if (sd->flags & SD_WORKLOAD_CONSOLIDATION) {
+		sd->consolidating_coeff = 160;
 	} else {
 		sd->flags |= SD_PREFER_SIBLING;
 		sd->cache_nice_tries = 1;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eb47ce2..a2a7230 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -695,16 +695,22 @@ extern void sched_ttwu_pending(void);
  *		be returned.
  * @flag:	The flag to check for the highest sched_domain
  *		for the given cpu.
+ * @all: The flag is contained by all sched_domains from the hightest down
  *
  * Returns the highest sched_domain of a cpu which contains the given flag.
  */
-static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
+static inline struct
+sched_domain *highest_flag_domain(int cpu, int flag, int all)
 {
 	struct sched_domain *sd, *hsd = NULL;
 
 	for_each_domain(cpu, sd) {
-		if (!(sd->flags & flag))
-			break;
+		if (!(sd->flags & flag)) {
+			if (all)
+				break;
+			else
+				continue;
+		}
 		hsd = sd;
 	}
 
@@ -729,6 +735,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(struct sched_domain *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain *, sd_busy);
 DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+DECLARE_PER_CPU(struct sched_domain *, sd_wc);
 
 struct sched_group_capacity {
 	atomic_t ref;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 5/9 v4] Workload Consolidation: Consolidating workload to a subset of CPUs if possible
  2014-06-25  0:35 [RFC PATCH 0/9 v4] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (3 preceding siblings ...)
  2014-06-25  0:36 ` [RFC PATCH 4/9 v4] Define SD_WORKLOAD_CONSOLIDATION and attach to sched_domain Yuyang Du
@ 2014-06-25  0:36 ` Yuyang Du
  2014-06-25  0:36 ` [RFC PATCH 6/9 v4] Implement Workload Consolidation in wakeup/fork/exec Yuyang Du
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Yuyang Du @ 2014-06-25  0:36 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, dietmar.eggemann,
	rajeev.d.muralidhar, vishwesh.m.rudramuni, nicole.chalhoub,
	ajaya.durg, harinarayanan.seshadri, jacob.jun.pan, Yuyang Du

CPU CC is a per CPU metric. To determine whether to consolidate or not,
the formula is based on a heuristic. Suppose we have 2 CPUs, their task
concurrency over time is ('-' means no task, 'x' having tasks):

1)
CPU0: ---xxxx---------- (CC[0])
CPU1: ---------xxxx---- (CC[1])

2)
CPU0: ---xxxx---------- (CC[0])
CPU1: ---xxxx---------- (CC[1])

If we consolidate CPU0 and CPU1, the consolidated CC will be: CC' = CC[0] +
CC[1] for case 1 and CC'' = (CC[0] + CC[1]) * 2 for case 2. For the cases in
between case 1 and 2 in terms of how xxx overlaps, the CC should be between
CC' and CC''. So, we uniformly use the following formula: (suppose we consolidate
m CPUs to n CPUs, m > n):

(CC[0] + CC[1] + ... + CC[m-2] + CC[m-1]) * (n + log(m-n)) >=<? (1 * n) * n *
consolidating_coefficient

The consolidating_coefficient could be like 100 (%) or more or less.

TODO:
1) need sched statistics
2) whether or not to consolidate is decided every time we need it. Not efficient.
3) really want to be used for multi-socket machines, but the decision to consolidation
   is time-consuming if CPU number increase significantly, need to remedy this

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |  332 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 332 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c4270cf..7f80058 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6749,6 +6749,8 @@ update_next_balance(struct sched_domain *sd, int cpu_busy, unsigned long *next_b
 		*next_balance = next;
 }
 
+static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask);
+
 /*
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
@@ -7805,6 +7807,12 @@ void print_cfs_stats(struct seq_file *m, int cpu)
 __init void init_sched_fair_class(void)
 {
 #ifdef CONFIG_SMP
+	unsigned int i;
+	for_each_possible_cpu(i) {
+		zalloc_cpumask_var_node(&per_cpu(local_cpu_mask, i),
+					GFP_KERNEL, cpu_to_node(i));
+	}
+
 	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
 
 #ifdef CONFIG_NO_HZ_COMMON
@@ -7841,4 +7849,328 @@ static void update_cpu_concurrency(struct rq *rq)
 		sa->load_avg_contrib /= (sa->runnable_avg_period + 1);
 	}
 }
+
+static inline u32 cc_weight(unsigned int nr_running)
+{
+   return nr_running << NICE_0_SHIFT;
+}
+
+static inline unsigned long get_cpu_concurrency(int cpu)
+{
+	return cpu_rq(cpu)->avg.load_avg_contrib;
+}
+
+static inline u64 sched_group_cc(struct sched_group *sg)
+{
+	u64 sg_cc = 0;
+	int i;
+
+	for_each_cpu(i, sched_group_cpus(sg))
+		sg_cc += get_cpu_concurrency(i) * capacity_of(i);
+
+	return sg_cc;
+}
+
+static inline u64 sched_domain_cc(struct sched_domain *sd)
+{
+	struct sched_group *sg = sd->groups;
+	u64 sd_cc = 0;
+
+	do {
+		sd_cc += sched_group_cc(sg);
+		sg = sg->next;
+	} while (sg != sd->groups);
+
+	return sd_cc;
+}
+
+static inline struct sched_group *
+find_lowest_cc_group(struct sched_group *sg, int span)
+{
+	u64 grp_cc, min = ULLONG_MAX;
+	struct sched_group *lowest = NULL;
+	int i;
+
+	for (i = 0; i < span; ++i) {
+		grp_cc = sched_group_cc(sg);
+
+		if (grp_cc < min) {
+			min = grp_cc;
+			lowest = sg;
+		}
+
+		sg = sg->next;
+	}
+
+	return lowest;
+}
+
+static inline u64 __calc_cc_thr(int cpus, unsigned int asym_cc)
+{
+	u64 thr = cpus;
+
+	thr *= cc_weight(1);
+	thr *= asym_cc;
+	thr <<= SCHED_CAPACITY_SHIFT;
+
+	return thr;
+}
+
+/*
+ * Can @src_cc of @src_nr CPUs be consolidated to @dst_cc of @dst_nr CPUs
+ *
+ * CC is per CPU average. The cosnolidated CC depends on the overlapped
+ * running of the CPUs before consolidation. Suppose we have 2 CPUs,
+ * their CC over time is ('-' means idle, 'x' means running):
+ *
+ * Case 1
+ * CPU0: ---xxxx---------- (CC[0])
+ * CPU1: ---------xxxx---- (CC[1])
+ *
+ * Case 2
+ * CPU0: ---xxxx---------- (CC[0])
+ * CPU1: ---xxxx---------- (CC[1])
+ *
+ * If we consolidate CPU0 and CPU1, the consolidated CC will be: CC' =
+ * CC[0] + CC[1] for case 1 and CC'' = (CC[0] + CC[1]) * 2 for case 2.
+ * For the cases in between case 1 and 2 in terms of how xxx overlaps,
+ * the CC should be between CC' and CC''. So, we use this heuristic to
+ * calc consolidated CC (suppose we consolidate m CPUs to n CPUs, m > n):
+ *
+ * (CC[0] + CC[1] + ... + CC[m-2] + CC[m-1]) * (n + log(m-n)) >=<? (1 *
+ * n) * n * consolidating_coefficient
+ *
+ * The consolidating_coefficient could be like 100% or more or less.
+ */
+static inline int
+__can_consolidate_cc(u64 src_cc, int src_nr, u64 dst_cc, int dst_nr)
+{
+	dst_cc *= dst_nr;
+	src_nr -= dst_nr;
+
+	if (unlikely(src_nr <= 0))
+		return 0;
+
+	src_nr = ilog2(src_nr);
+	src_nr += dst_nr;
+	src_cc *= src_nr;
+
+	if (src_cc > dst_cc)
+		return 0;
+
+	return 1;
+}
+
+/*
+ * find the consolidated group according to the CC of this domain's CPUs
+ */
+static struct sched_group * wc_find_group(struct sched_domain *sd,
+	struct task_struct *p, int this_cpu)
+{
+	int half, sg_weight, ns_half = 0;
+	struct sched_group *sg;
+	u64 sd_cc;
+
+	half = DIV_ROUND_CLOSEST(sd->total_groups, 2);
+	sg_weight = sd->groups->group_weight;
+
+	sd_cc = sched_domain_cc(sd);
+	sd_cc *= 100;
+
+	while (half) {
+		int allowed = 0, i;
+		int cpus = sg_weight * half;
+		u64 threshold = __calc_cc_thr(cpus,
+			sd->consolidating_coeff);
+
+		/*
+		 * we did not consider the added cc by this
+		 * wakeup (mostly from fork/exec)
+		 */
+		if (!__can_consolidate_cc(sd_cc, sd->span_weight,
+			threshold, cpus))
+			break;
+
+		sg = sd->first_group;
+		for (i = 0; i < half; ++i) {
+			/* if it has no cpus allowed */
+			if (!cpumask_intersects(sched_group_cpus(sg),
+					tsk_cpus_allowed(p)))
+				continue;
+
+			allowed = 1;
+			sg = sg->next;
+		}
+
+		if (!allowed)
+			break;
+
+		ns_half = half;
+		half /= 2;
+	}
+
+	if (!ns_half)
+		return NULL;
+
+	if (ns_half == 1)
+		return sd->first_group;
+
+	return find_lowest_cc_group(sd->first_group, ns_half);
+}
+
+/*
+ * For the majority of architectures, we have the following assumption:
+ * 1) every sched_group has the same weight
+ * 2) every CPU has the same computing power
+ */
+static inline int __nonshielded_groups(struct sched_domain *sd)
+{
+	int half, sg_weight, ret = 0;
+	u64 sd_cc;
+
+	half = DIV_ROUND_CLOSEST(sd->total_groups, 2);
+	sg_weight = sd->groups->group_weight;
+
+	sd_cc = sched_domain_cc(sd);
+	sd_cc *= 100;
+
+	while (half) {
+		int cpus = sg_weight * half;
+		u64 threshold = __calc_cc_thr(cpus,
+			sd->consolidating_coeff);
+
+		if (!__can_consolidate_cc(sd_cc, sd->span_weight,
+			threshold, cpus))
+			return ret;
+
+		ret = half;
+		half /= 2;
+	}
+
+	return ret;
+}
+
+/*
+ * if we decide to move workload from CPUx to CPUy (consolidating workload
+ * to CPUy), then we call CPUx nonshielded and CPUy shielded in the following.
+ *
+ * wc_nonshielded_mask - return the nonshielded CPUs in the @mask.
+ *
+ * traverse downward the sched_domain tree when the sched_domain contains
+ * flag SD_WORKLOAD_CONSOLIDATION, each sd may have more than two groups
+ */
+static void
+wc_nonshielded_mask(int cpu, struct sched_domain *sd, struct cpumask *mask)
+{
+	struct cpumask *nonshielded_cpus = __get_cpu_var(local_cpu_mask);
+	int i;
+
+	while (sd) {
+		struct sched_group *sg;
+		int this_sg_nr, ns_half;
+
+		if (!(sd->flags & SD_WORKLOAD_CONSOLIDATION)) {
+			sd = sd->child;
+			continue;
+		}
+
+		ns_half = __nonshielded_groups(sd);
+
+		if (!ns_half)
+			break;
+
+		cpumask_clear(nonshielded_cpus);
+		sg = sd->first_group;
+
+		for (i = 0; i < ns_half; ++i) {
+			cpumask_or(nonshielded_cpus, nonshielded_cpus,
+				sched_group_cpus(sg));
+			sg = sg->next;
+		}
+
+		cpumask_and(mask, mask, nonshielded_cpus);
+
+		this_sg_nr = sd->group_number;
+		if (this_sg_nr)
+			break;
+
+		sd = sd->child;
+	}
+}
+
+/*
+ * unload src_cpu to dst_cpu
+ */
+static void unload_cpu(int src_cpu, int dst_cpu)
+{
+	unsigned long flags;
+	struct rq *src_rq = cpu_rq(src_cpu);
+	int unload = 0;
+
+	raw_spin_lock_irqsave(&src_rq->lock, flags);
+
+	if (!src_rq->active_balance) {
+		src_rq->active_balance = 1;
+		src_rq->push_cpu = dst_cpu;
+		unload = 1;
+	}
+
+	raw_spin_unlock_irqrestore(&src_rq->lock, flags);
+
+	if (unload)
+		stop_one_cpu_nowait(src_cpu, active_load_balance_cpu_stop,
+			src_rq, &src_rq->active_balance_work);
+}
+
+static inline int find_lowest_cc_cpu(struct cpumask *mask)
+{
+	u64 cpu_cc, min = ULLONG_MAX;
+	int i, lowest = nr_cpu_ids;
+
+	for_each_cpu(i, mask) {
+		cpu_cc = get_cpu_concurrency(i) * capacity_of(i);
+
+		if (cpu_cc < min) {
+			min = cpu_cc;
+			lowest = i;
+		}
+	}
+
+	return lowest;
+}
+
+/*
+ * find the lowest cc cpu in shielded and nonshielded cpus,
+ * aggressively unload the shielded to the nonshielded
+ */
+static void wc_unload(struct cpumask *nonshielded, struct sched_domain *sd)
+{
+	int src_cpu = nr_cpu_ids, dst_cpu, cpu = smp_processor_id();
+	struct cpumask *shielded_cpus = __get_cpu_var(local_cpu_mask);
+	u64 cpu_cc, min = ULLONG_MAX;
+
+	cpumask_andnot(shielded_cpus, sched_domain_span(sd), nonshielded);
+
+	for_each_cpu(cpu, shielded_cpus) {
+		if (cpu_rq(cpu)->nr_running <= 0)
+			continue;
+
+		cpu_cc = get_cpu_concurrency(cpu) * capacity_of(cpu);
+		if (cpu_cc < min) {
+			min = cpu_cc;
+			src_cpu = cpu;
+		}
+	}
+
+	if (src_cpu >= nr_cpu_ids)
+		return;
+
+	dst_cpu = find_lowest_cc_cpu(nonshielded);
+	if (dst_cpu >= nr_cpu_ids)
+		return;
+
+	if (src_cpu != dst_cpu)
+		unload_cpu(src_cpu, dst_cpu);
+}
+
 #endif /* CONFIG_SMP */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 6/9 v4] Implement Workload Consolidation in wakeup/fork/exec
  2014-06-25  0:35 [RFC PATCH 0/9 v4] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (4 preceding siblings ...)
  2014-06-25  0:36 ` [RFC PATCH 5/9 v4] Workload Consolidation: Consolidating workload to a subset of CPUs if possible Yuyang Du
@ 2014-06-25  0:36 ` Yuyang Du
  2014-06-25  0:36 ` [RFC PATCH 7/9 v4] Implement Workload Consolidation in idle_balance Yuyang Du
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Yuyang Du @ 2014-06-25  0:36 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, dietmar.eggemann,
	rajeev.d.muralidhar, vishwesh.m.rudramuni, nicole.chalhoub,
	ajaya.durg, harinarayanan.seshadri, jacob.jun.pan, Yuyang Du

In WAKE_AFFINE, if the target (in wakee and waker order) is not idle, but the
the target is capable of handling the wakee task according to CC, we also
select it.

When to find the idlest sched_group, we first try to find the consolidated group.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched/sysctl.h |    4 ++++
 kernel/sched/fair.c          |   52 +++++++++++++++++++++++++++++++++++++++---
 kernel/sysctl.c              |    9 ++++++++
 3 files changed, 62 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 596a0e0..78acbd7 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -40,6 +40,10 @@ extern unsigned int sysctl_sched_min_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
 extern unsigned int sysctl_sched_child_runs_first;
 
+#ifdef CONFIG_SMP
+extern unsigned int sysctl_sched_cc_wakeup_threshold;
+#endif
+
 enum sched_tunable_scaling {
 	SCHED_TUNABLESCALING_NONE,
 	SCHED_TUNABLESCALING_LOG,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7f80058..008cbc9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2606,6 +2606,9 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 }
 
 static inline void update_cpu_concurrency(struct rq *rq);
+static struct sched_group *wc_find_group(struct sched_domain *sd,
+	struct task_struct *p, int this_cpu);
+static int cpu_cc_capable(int cpu);
 
 /*
  * Update the rq's load with the elapsed running time before entering
@@ -4421,7 +4424,19 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	struct sched_group *sg;
 	int i = task_cpu(p);
 
-	if (idle_cpu(target))
+	/*
+	 * We prefer wakee to waker CPU. For each of them, if it is idle, then
+	 * select it, but if not, we lower down the bar to use a threshold of CC
+	 * to determine whether it is capable of handling the wakee task
+	 */
+	if (sysctl_sched_cc_wakeup_threshold) {
+		if (idle_cpu(i) || cpu_cc_capable(i))
+			return i;
+
+		if (i != target && (idle_cpu(target) || cpu_cc_capable(target)))
+			return target;
+	}
+	else if (idle_cpu(target))
 		return target;
 
 	/*
@@ -4515,7 +4530,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	}
 
 	while (sd) {
-		struct sched_group *group;
+		struct sched_group *group = NULL;
 		int weight;
 
 		if (!(sd->flags & sd_flag)) {
@@ -4523,7 +4538,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			continue;
 		}
 
-		group = find_idlest_group(sd, p, cpu, sd_flag);
+		if (sd->flags & SD_WORKLOAD_CONSOLIDATION)
+			group = wc_find_group(sd, p, cpu);
+
+		if (!group)
+			group = find_idlest_group(sd, p, cpu, sd_flag);
+
 		if (!group) {
 			sd = sd->child;
 			continue;
@@ -7834,6 +7854,12 @@ __init void init_sched_fair_class(void)
  */
 
 /*
+ * concurrency lower than this threshold percentage of cc 1
+ * is capable of running wakee task, otherwise make it 0
+ */
+unsigned int sysctl_sched_cc_wakeup_threshold = 60UL;
+
+/*
  * we update cpu concurrency at:
  * 1) enqueue task, which increases concurrency
  * 2) dequeue task, which decreases concurrency
@@ -7860,6 +7886,26 @@ static inline unsigned long get_cpu_concurrency(int cpu)
 	return cpu_rq(cpu)->avg.load_avg_contrib;
 }
 
+/*
+ * whether cpu is capable of having more concurrency
+ */
+static int cpu_cc_capable(int cpu)
+{
+	u64 cpu_cc = get_cpu_concurrency(cpu);
+	u64 threshold = cc_weight(1);
+
+	cpu_cc *= 100;
+	cpu_cc *= capacity_of(cpu);
+
+	threshold *= sysctl_sched_cc_wakeup_threshold;
+	threshold <<= SCHED_CAPACITY_SHIFT;
+
+	if (cpu_cc <= threshold)
+		return 1;
+
+	return 0;
+}
+
 static inline u64 sched_group_cc(struct sched_group *sg)
 {
 	u64 sg_cc = 0;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 7de6555..987557b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1102,6 +1102,15 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 #endif
+#ifdef CONFIG_SMP
+	{
+		.procname	= "sched_cc_wakeup_threshold",
+		.data		= &sysctl_sched_cc_wakeup_threshold,
+		.maxlen		= sizeof(sysctl_sched_cc_wakeup_threshold),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+#endif
 	{ }
 };
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 7/9 v4] Implement Workload Consolidation in idle_balance
  2014-06-25  0:35 [RFC PATCH 0/9 v4] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (5 preceding siblings ...)
  2014-06-25  0:36 ` [RFC PATCH 6/9 v4] Implement Workload Consolidation in wakeup/fork/exec Yuyang Du
@ 2014-06-25  0:36 ` Yuyang Du
  2014-06-25  0:36 ` [RFC PATCH 8/9 v4] Implement Workload Consolidation in nohz_idle_balance Yuyang Du
  2014-06-25  0:36 ` [RFC PATCH 9/9 v4] Implement Workload Consolidation in periodic load balance Yuyang Du
  8 siblings, 0 replies; 10+ messages in thread
From: Yuyang Du @ 2014-06-25  0:36 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, dietmar.eggemann,
	rajeev.d.muralidhar, vishwesh.m.rudramuni, nicole.chalhoub,
	ajaya.durg, harinarayanan.seshadri, jacob.jun.pan, Yuyang Du

1) Skip pulling task to the idle non-consolidated CPUs.

2) In addition, for consolidated Idle CPU, we aggressively pull tasks from
   non-consolidated CPUs.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 008cbc9..bf65fde 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2608,6 +2608,9 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 static inline void update_cpu_concurrency(struct rq *rq);
 static struct sched_group *wc_find_group(struct sched_domain *sd,
 	struct task_struct *p, int this_cpu);
+static void wc_unload(struct cpumask *nonshielded, struct sched_domain *sd);
+static void wc_nonshielded_mask(int cpu, struct sched_domain *sd,
+	struct cpumask *mask);
 static int cpu_cc_capable(int cpu);
 
 /*
@@ -6808,6 +6811,22 @@ static int idle_balance(struct rq *this_rq)
 
 	update_blocked_averages(this_cpu);
 	rcu_read_lock();
+
+	sd = per_cpu(sd_wc, this_cpu);
+	if (sd) {
+		struct cpumask *nonshielded_cpus = __get_cpu_var(load_balance_mask);
+
+		cpumask_copy(nonshielded_cpus, cpu_active_mask);
+
+		/*
+		 * if we encounter shielded cpus here, don't do balance on them
+		 */
+		wc_nonshielded_mask(this_cpu, sd, nonshielded_cpus);
+		if (!cpumask_test_cpu(this_cpu, nonshielded_cpus))
+			goto unlock;
+		wc_unload(nonshielded_cpus, sd);
+	}
+
 	for_each_domain(this_cpu, sd) {
 		int continue_balancing = 1;
 		u64 t0, domain_cost;
@@ -6843,6 +6862,7 @@ static int idle_balance(struct rq *this_rq)
 		if (pulled_task || this_rq->nr_running > 0)
 			break;
 	}
+unlock:
 	rcu_read_unlock();
 
 	raw_spin_lock(&this_rq->lock);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 8/9 v4] Implement Workload Consolidation in nohz_idle_balance
  2014-06-25  0:35 [RFC PATCH 0/9 v4] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (6 preceding siblings ...)
  2014-06-25  0:36 ` [RFC PATCH 7/9 v4] Implement Workload Consolidation in idle_balance Yuyang Du
@ 2014-06-25  0:36 ` Yuyang Du
  2014-06-25  0:36 ` [RFC PATCH 9/9 v4] Implement Workload Consolidation in periodic load balance Yuyang Du
  8 siblings, 0 replies; 10+ messages in thread
From: Yuyang Du @ 2014-06-25  0:36 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, dietmar.eggemann,
	rajeev.d.muralidhar, vishwesh.m.rudramuni, nicole.chalhoub,
	ajaya.durg, harinarayanan.seshadri, jacob.jun.pan, Yuyang Du

In periodic nohz idle balance, we skip kicking idle but non-consolidated CPUs.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |   55 +++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 45 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf65fde..549f6e0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6983,10 +6983,45 @@ static struct {
 
 static inline int find_new_ilb(void)
 {
-	int ilb = cpumask_first(nohz.idle_cpus_mask);
+	int ilb;
 
-	if (ilb < nr_cpu_ids && idle_cpu(ilb))
-		return ilb;
+	/*
+	 * Optimize for the case when we have no idle CPUs or only one
+	 * idle CPU. Don't walk the sched_domain hierarchy in such cases
+	 */
+	if (cpumask_weight(nohz.idle_cpus_mask) < 2)
+		return nr_cpu_ids;
+
+	ilb = cpumask_first(nohz.idle_cpus_mask);
+
+	if (ilb < nr_cpu_ids && idle_cpu(ilb)) {
+		struct sched_domain *sd;
+		int this_cpu = smp_processor_id();
+
+		sd = per_cpu(sd_wc, this_cpu);
+		if (sd) {
+			struct cpumask *nonshielded_cpus = __get_cpu_var(load_balance_mask);
+
+			cpumask_copy(nonshielded_cpus, nohz.idle_cpus_mask);
+
+			rcu_read_lock();
+			wc_nonshielded_mask(this_cpu, sd, nonshielded_cpus);
+			rcu_read_unlock();
+
+			if (cpumask_weight(nonshielded_cpus) < 2)
+				return nr_cpu_ids;
+
+			/*
+			 * get idle load balancer again
+			 */
+			ilb = cpumask_first(nonshielded_cpus);
+
+		    if (ilb < nr_cpu_ids && idle_cpu(ilb))
+				return ilb;
+		}
+		else
+			return ilb;
+	}
 
 	return nr_cpu_ids;
 }
@@ -7217,7 +7252,7 @@ out:
  * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
  * rebalancing for all the cpus for whom scheduler ticks are stopped.
  */
-static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
+static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle, struct cpumask *mask)
 {
 	int this_cpu = this_rq->cpu;
 	struct rq *rq;
@@ -7227,7 +7262,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 	    !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
 		goto end;
 
-	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
+	for_each_cpu(balance_cpu, mask) {
 		if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
 			continue;
 
@@ -7280,10 +7315,10 @@ static inline int nohz_kick_needed(struct rq *rq)
 	if (unlikely(rq->idle_balance))
 		return 0;
 
-       /*
-	* We may be recently in ticked or tickless idle mode. At the first
-	* busy tick after returning from idle, we will update the busy stats.
-	*/
+	/*
+	 * We may be recently in ticked or tickless idle mode. At the first
+	 * busy tick after returning from idle, we will update the busy stats.
+	 */
 	set_cpu_sd_state_busy();
 	nohz_balance_exit_idle(cpu);
 
@@ -7326,7 +7361,7 @@ need_kick:
 	return 1;
 }
 #else
-static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
+static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle, struct cpumask *mask) { }
 #endif
 
 /*
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 9/9 v4] Implement Workload Consolidation in periodic load balance
  2014-06-25  0:35 [RFC PATCH 0/9 v4] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (7 preceding siblings ...)
  2014-06-25  0:36 ` [RFC PATCH 8/9 v4] Implement Workload Consolidation in nohz_idle_balance Yuyang Du
@ 2014-06-25  0:36 ` Yuyang Du
  8 siblings, 0 replies; 10+ messages in thread
From: Yuyang Du @ 2014-06-25  0:36 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, dietmar.eggemann,
	rajeev.d.muralidhar, vishwesh.m.rudramuni, nicole.chalhoub,
	ajaya.durg, harinarayanan.seshadri, jacob.jun.pan, Yuyang Du

1) Skip pulling task to the non-consolidated CPUs.

2) In addition, for consolidated Idle CPU, we aggressively pull tasks from
   non-consolidated CPUs.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |   50 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 549f6e0..7fcd2d0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7371,17 +7371,53 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle, struc
 static void run_rebalance_domains(struct softirq_action *h)
 {
 	struct rq *this_rq = this_rq();
+	int this_cpu = cpu_of(this_rq);
+	struct sched_domain *sd;
 	enum cpu_idle_type idle = this_rq->idle_balance ?
 						CPU_IDLE : CPU_NOT_IDLE;
 
-	rebalance_domains(this_rq, idle);
+	sd = per_cpu(sd_wc, this_cpu);
+	if (sd) {
+		struct cpumask *nonshielded_cpus = __get_cpu_var(load_balance_mask);
 
-	/*
-	 * If this cpu has a pending nohz_balance_kick, then do the
-	 * balancing on behalf of the other idle cpus whose ticks are
-	 * stopped.
-	 */
-	nohz_idle_balance(this_rq, idle);
+		/*
+		 * if we encounter shielded cpus here, don't do balance on them
+		 */
+		cpumask_copy(nonshielded_cpus, cpu_active_mask);
+
+		rcu_read_lock();
+		wc_nonshielded_mask(this_cpu, sd, nonshielded_cpus);
+		rcu_read_unlock();
+
+		/*
+		 * aggressively unload the shielded cpus to unshielded cpus
+		 */
+		wc_unload(nonshielded_cpus, sd);
+
+		if (cpumask_test_cpu(this_cpu, nonshielded_cpus)) {
+			struct cpumask *idle_cpus = __get_cpu_var(local_cpu_mask);
+			cpumask_and(idle_cpus, nonshielded_cpus, nohz.idle_cpus_mask);
+
+			rebalance_domains(this_rq, idle);
+
+			/*
+			 * If this cpu has a pending nohz_balance_kick, then do the
+			 * balancing on behalf of the other idle cpus whose ticks are
+			 * stopped.
+			 */
+			nohz_idle_balance(this_rq, idle, idle_cpus);
+		}
+	}
+	else {
+		rebalance_domains(this_rq, idle);
+
+		/*
+		 * If this cpu has a pending nohz_balance_kick, then do the
+		 * balancing on behalf of the other idle cpus whose ticks are
+		 * stopped.
+		 */
+		nohz_idle_balance(this_rq, idle, nohz.idle_cpus_mask);
+	}
 }
 
 /*
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-06-25  8:50 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-25  0:35 [RFC PATCH 0/9 v4] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
2014-06-25  0:36 ` [RFC PATCH 1/9 v4] sched: Remove update_rq_runnable_avg Yuyang Du
2014-06-25  0:36 ` [RFC PATCH 2/9 v4] sched: Precise accumulated time and acount runnable number in update_entity_runnable_avg Yuyang Du
2014-06-25  0:36 ` [RFC PATCH 3/9 v4] How CPU ConCurrency (CC) accrues with runqueue change and time Yuyang Du
2014-06-25  0:36 ` [RFC PATCH 4/9 v4] Define SD_WORKLOAD_CONSOLIDATION and attach to sched_domain Yuyang Du
2014-06-25  0:36 ` [RFC PATCH 5/9 v4] Workload Consolidation: Consolidating workload to a subset of CPUs if possible Yuyang Du
2014-06-25  0:36 ` [RFC PATCH 6/9 v4] Implement Workload Consolidation in wakeup/fork/exec Yuyang Du
2014-06-25  0:36 ` [RFC PATCH 7/9 v4] Implement Workload Consolidation in idle_balance Yuyang Du
2014-06-25  0:36 ` [RFC PATCH 8/9 v4] Implement Workload Consolidation in nohz_idle_balance Yuyang Du
2014-06-25  0:36 ` [RFC PATCH 9/9 v4] Implement Workload Consolidation in periodic load balance Yuyang Du

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).