All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/2] CPU frequency scaled from a task's load on an idle wakeup
@ 2014-11-10  5:45 Shilpasri G Bhat
  2014-11-10  5:45 ` [RFC 1/2] sched/fair: Add cumulative average of load_avg_contrib to a task Shilpasri G Bhat
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Shilpasri G Bhat @ 2014-11-10  5:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-pm, mturquette, amit.kucheria, vincent.guittot,
	daniel.lezcano, Morten.Rasmussen, efault, nicolas.pitre,
	dietmar.eggemann, pjt, bsegall, peterz, mingo, linaro-kernel,
	Shilpasri G Bhat

This patch set aims to solve a problem in cpufreq governor's CPU
load calculation logic when the CPU wakes up after an idle period.
In the current logic when a CPU wakes up from an idle state the
'previous load' of the CPU is used as its current load on the
alternate wakeups.

A latency-sensitive-bursty task will be benefited from this logic if
it wakes up on a CPU on which it was initially running, with a
non-compromised CPU 'previous load' i.e, the 'previous load' holds
the last calculated CPU load before the task went to sleep. In such
a case, the cpufreq governor will account to high previous CPU load
and decides to run at high frequency.

The problem in this logic is that the 'previous load' which is meant
to help certain latency-sensitive-bursty tasks can get used by some
periodic-small tasks(like kernel daemons) to its advantage if the
small task woke up first on the CPU. This will deprive the the
latency-sensitive-bursty tasks from running at high frequency until
the cpufreq governor notices the 100% CPU utilization. If this pattern
gets repeated in the due course of bursty task's execution we will
land on the same problem which 'prev_load' had originally set forth to 
solve.

Probably we could reduce these inefficiencies if the cpufreq
governor was aware of the task's nature, while calculating the load
during an idle-wakeup scenario. So instead of using the previous
load for the CPU , the load can be deduced on the basis of incoming
task's load.

In this patch we use a metric built on top of 'load_avg_contrib'.
'load_avg_contrib' of a task's sched entity can describe the nature
of the task in terms of its CPU utilization. The properties of this
metric to encapsulate the CPU utilization of a task makes it a
potential candidate for scaling CPU frequency. However, due to the
nature of its design 'load_avg_contrib' cannot pick up the task's
load rapidly after a wakeup. As we are trying to solve the problem
on idle-wakeup case we cannot use this metric value as is to scale
the frequency. So we measure the cumulative moving average of
'load_avg_contrib'.

The cumulative average of 'load_avg_contrib' at a given point is the
average of all the values of 'load_avg_contrib' up until that point.
The current average of a new 'load_avg_contrib' value is as below:

Cumulative_average(n+1) = x(n+1) + Cumulative_average(n) * n
                        ---------------------------------------
                                        n+1
where,
Cumulative_average(n+1) is the current cumulative average
x(n+1) is the latest 'load_avg_contrib' value
Cumulative_average(n) is the previous cumulative average
n+1 is the number of 'load_avg_contrib' values so far

The cumulative average of 'load_avg_contrib' will help us smooth out
the short-term fluctuations and highlight long-term trend of
'load-avg_contrib' metric. So cumulative average of the task can
depict the nature of the task more effectively. Thus we can scale CPU
frequency based on the cumulative average of the task and make
calculative decisions whether to decrease or increase the frequency
depending on the nature of the task.

Shilpasri G Bhat (2):
  sched/fair: Add cumulative average of load_avg_contrib to a task
  cpufreq: governor: CPU frequency scaled from task's cumulative-load on
    an idle wakeup

 drivers/cpufreq/cpufreq_governor.c | 39 +++++++++++++++-----------------------
 drivers/cpufreq/cpufreq_governor.h |  9 ++-------
 include/linux/sched.h              |  4 ++++
 kernel/sched/core.c                | 35 ++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c                |  6 +++++-
 kernel/sched/sched.h               |  2 +-
 6 files changed, 62 insertions(+), 33 deletions(-)

-- 
1.9.3


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC 1/2] sched/fair: Add cumulative average of load_avg_contrib to a task
  2014-11-10  5:45 [RFC 0/2] CPU frequency scaled from a task's load on an idle wakeup Shilpasri G Bhat
@ 2014-11-10  5:45 ` Shilpasri G Bhat
  2014-11-10 13:49   ` Peter Zijlstra
  2014-11-10  5:45 ` [RFC 2/2] cpufreq: governor: CPU frequency scaled from task's cumulative-load on an idle wakeup Shilpasri G Bhat
  2014-11-10  9:19 ` [RFC 0/2] CPU frequency scaled from a task's load " Shilpasri G Bhat
  2 siblings, 1 reply; 8+ messages in thread
From: Shilpasri G Bhat @ 2014-11-10  5:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-pm, mturquette, amit.kucheria, vincent.guittot,
	daniel.lezcano, Morten.Rasmussen, efault, nicolas.pitre,
	dietmar.eggemann, pjt, bsegall, peterz, mingo, linaro-kernel,
	Shilpasri G Bhat

This patch aims to identify a suitable metric which can define
the nature of the task as CPU intensive or non-intensive. This
metric's value will be used to optimize the logic to scale CPU
frequency during an idle wakeup scenario i.e., to make the cpufreq
governor to scale the frequency optimized for workloads.

Given the properties of 'load_avg_contrib' of a task's sched entity,
it forms a potential candidate for frequency scaling. By the virtue of
its design this metric picks up the load slowly during an idle wakeup.
So at that instant of wakeup we cannot account on the latest value of
this metric as it is unreliable. This can be seen in the test results
below.

I ran a modified version of ebizzy which sleeps in between its
execution such that its utilization can be user defined value. The
below trace was observed for a single thread ebizzy run for a 40%
utilization.

T1 ebizzy   309.613862: .__update_entity_load_avg_contrib: ebizzy load_avg_contrib=1022
T2 ebizzy   309.613864: sched_switch: prev_comm=ebizzy ==> next_comm=swapper/8
T3 <idle>   310.062932: .__update_entity_load_avg_contrib: ebizzy load_avg_contrib=0
T4 <idle>   310.062936: sched_switch: prev_comm=swapper/8 ==> next_comm=ebizzy
T5 ebizzy   310.063104: .__update_entity_load_avg_contrib: ebizzy load_avg_contrib=67
T6 ebizzy   310.063106: .__update_entity_load_avg_contrib: kworker/8:1 load_avg_contrib=0
T7 ebizzy   310.063108: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:1

At 309.613862 the 'load_avg_contrib' value of ebizzy is equal to 1022
which indicates its high utilization. It goes to sleep at 309.613864.
After a long idle peroid ebizzy wakes up at 310.062932. Once a cpu
wakes up from an idle state all the timers that were deferred on that
cpu will be fired. The cpufreq governor's timer is one such deferred
timer which gets fired consequently after ebizzy wakes up. On next
context_switch at 310.063104 we can see that load_avg_contrib's value
gets updated to 67. Further at 310.063108 the cpufreq governor gets
switched in to calculate the load, now if it were to consider ebizzy's
'load_avg_contrib' value then we will not benefit as the recent value
is far less compared to the value ebizzy had before going to sleep.

We can hide these fluctuations of 'load_avg_contrib' by calculating
the average of all the values of 'load_avg_contrib'. The cumulative
average of 'load_avg_contrib' will always preserve the long-term
behavior of the task. Thus using the cumulative average of
'load_avg_contrib' we can scale the cpu frequency as best suited for
the task.

'load_avg_contrib' of ebizzy is updated at T1, T3 and T5. So in
a period from T1-T7 ebizzy has the following values :
	load_avg_contrib 	cumulative_average
T1		1022		1022/1        = 1022
T3		0		(1022+0)/2    = 511
T5		67		(1022+0+67)/3 = 363

At T5 the cumulative_average is 363 which is better than the
load_avg_contrib value 67 when used to decide the nature of the task.
Thus we can use cumulative_average to scale the cpu frequency during
an idle wakeup.

Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Suggested-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---
 include/linux/sched.h |  4 ++++
 kernel/sched/core.c   | 35 +++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c   |  6 +++++-
 kernel/sched/sched.h  |  2 +-
 4 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e344bb..212a0a7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1081,6 +1081,8 @@ struct sched_avg {
 	u64 last_runnable_update;
 	s64 decay_count;
 	unsigned long load_avg_contrib;
+	unsigned long cumulative_avg;
+	unsigned long cumulative_avg_count;
 };
 
 #ifdef CONFIG_SCHEDSTATS
@@ -3032,3 +3034,5 @@ static inline unsigned long rlimit_max(unsigned int limit)
 }
 
 #endif
+
+extern unsigned int task_cumulative_load(int cpu);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4499950..b3d0d5a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2853,6 +2853,7 @@ need_resched:
 	if (likely(prev != next)) {
 		rq->nr_switches++;
 		rq->curr = next;
+		rq->prev = prev;
 		++*switch_count;
 
 		context_switch(rq, prev, next); /* unlocks the rq */
@@ -8219,3 +8220,37 @@ void dump_cpu_task(int cpu)
 	pr_info("Task dump for CPU %d:\n", cpu);
 	sched_show_task(cpu_curr(cpu));
 }
+
+/**
+ * task_cumulative_load - return the cumulative load of
+ * the previous task if cpu is the current cpu OR the
+ * cumulative load of current task on the cpu. If cpu
+ * is idle then return 0.
+ *
+ * Invoked by the cpufreq governor to calculate the
+ * load when the CPU is woken from an idle state.
+ *
+ */
+unsigned int task_cumulative_load(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	struct task_struct *p;
+
+	if (cpu == smp_processor_id()) {
+		if (rq->prev == rq->idle)
+			goto idle;
+		p = rq->prev;
+	} else {
+		if (rq->curr == rq->idle)
+			goto idle;
+		p = rq->curr;
+	}
+	/*
+	 * Removing the priority as we are interested in CPU
+	 * utilization of the task
+	 */
+	return (100 * p->se.avg.cumulative_avg / p->se.load.weight);
+idle:
+	return 0;
+}
+EXPORT_SYMBOL(task_cumulative_load);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0b069bf..58c27e3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -680,6 +680,8 @@ void init_task_runnable_average(struct task_struct *p)
 	slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
 	p->se.avg.runnable_avg_sum = slice;
 	p->se.avg.runnable_avg_period = slice;
+	p->se.avg.cumulative_avg_count = 1;
+	p->se.avg.cumulative_avg = p->se.load.weight;
 	__update_task_entity_contrib(&p->se);
 }
 #else
@@ -2476,11 +2478,13 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
 static inline void __update_task_entity_contrib(struct sched_entity *se)
 {
 	u32 contrib;
-
 	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
 	contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
 	contrib /= (se->avg.runnable_avg_period + 1);
 	se->avg.load_avg_contrib = scale_load(contrib);
+	se->avg.cumulative_avg *= se->avg.cumulative_avg_count;
+	se->avg.cumulative_avg += se->avg.load_avg_contrib;
+	se->avg.cumulative_avg /= ++se->avg.cumulative_avg_count;
 }
 
 /* Compute the current contribution to load_avg by se, return any delta */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 24156c84..064d6b1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -565,7 +565,7 @@ struct rq {
 	 */
 	unsigned long nr_uninterruptible;
 
-	struct task_struct *curr, *idle, *stop;
+	struct task_struct *curr, *idle, *stop, *prev;
 	unsigned long next_balance;
 	struct mm_struct *prev_mm;
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC 2/2] cpufreq: governor: CPU frequency scaled from task's cumulative-load on an idle wakeup
  2014-11-10  5:45 [RFC 0/2] CPU frequency scaled from a task's load on an idle wakeup Shilpasri G Bhat
  2014-11-10  5:45 ` [RFC 1/2] sched/fair: Add cumulative average of load_avg_contrib to a task Shilpasri G Bhat
@ 2014-11-10  5:45 ` Shilpasri G Bhat
  2014-11-10  9:19 ` [RFC 0/2] CPU frequency scaled from a task's load " Shilpasri G Bhat
  2 siblings, 0 replies; 8+ messages in thread
From: Shilpasri G Bhat @ 2014-11-10  5:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-pm, mturquette, amit.kucheria, vincent.guittot,
	daniel.lezcano, Morten.Rasmussen, efault, nicolas.pitre,
	dietmar.eggemann, pjt, bsegall, peterz, mingo, linaro-kernel,
	Shilpasri G Bhat

"18b46a cpufreq: governor: Be friendly towards latency-sensitive
bursty workloads" patch by Srivatsa tried to solve a problem in
cpufreq governor's CPU load calculation logic. The fix is to detect
wakeup from cpuidle scenario and to start the workload at reasonably
high frequency and thus avoiding the redundant step to bring down
the frequency when a task wakes up on an idle CPU.

This logic used the 'previous load' as the CPU load for the alternate
wakeup-from-cpuidle and not newly calculating the load. The issue is
that sometimes the 'previous load' which is meant to help certain
latency sensitive tasks can get used by some kworker or similar task
to its advantage if it woke up first on the cpu. This will deprive
important tasks of high cpu frequency thus replicating the issue that
we set out to solve. Also if the bursty workloads were in a constant
migration of cpus then they will be deprived from running at high
frequency on every wakeup-on-idle-cpu scenarios. This is explained
below:

   Time(ms)		Events on CPU

   100		Task A is running on CPU

   110		Governor's timer fires and calculates CPU load:
   		load=100 prev_load=100. It marks the high load
   		and increases the CPU frequency

   110.5	Task A is running

   118		Task A went to sleep and CPU became idle

   140		Task B starts to run

   141		Governor's deferred timer fires and calculates CPU load.
   		It notices the long idle period and uses the prev_load
		as the current CPU load. load=100 prev_load=0. It
		increases the CPU frequency as it sees the high load.

   141.5	Task B is running

   142		Task B went to sleep and CPU became idle

   174		Task A woke up and started running again

   175		Governor's deferred timer fires and calculates CPU load.
   		load=3 prev_load=3. Even though it is a long idle
		period the CPU load is freshly calculated as the
		prev_load is copied for alternate wake_ups. The
		governor decides to decrease the frequency as it
		notices low load.

At 175ms governor decreases the CPU frequency as it freshly calculates
the load. So Task A is deprived from running at high frequency for the
next 10ms of its execution until the governor's timer fires again to
compute the load. This happened because Task B woke up first on the
CPU consuming the prev_load.

Considering latency sensitive bursty workloads like Task A and short
bursty housekeeping kernel functions like Task B; Task A will surely
fall short of chances of running at high frequency on its wakeup if
Task B continues to pop up in this fashion.

If the governor was aware of the history of the incoming task it could
make decisions more appropriately to scale the CPU frequency.
So instead of cpu load use a task's cumulative-load to calculate the
load on every wakeup from idle state. The cumulative-load of task will
explain the nature of the task's cpu utilization. But we cannot directly
map cumulative-load to scale cpu frequency. Because the current
cpufreq governor logic is implemented keeping cpu load in mind. This
logic intends to increase the CPU frequency if the task's cumulative
load is above a threshold limit.

Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Suggested-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---
 drivers/cpufreq/cpufreq_governor.c | 39 +++++++++++++++-----------------------
 drivers/cpufreq/cpufreq_governor.h |  9 ++-------
 2 files changed, 17 insertions(+), 31 deletions(-)

diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
index 1b44496..de4896c 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -19,6 +19,7 @@
 #include <linux/export.h>
 #include <linux/kernel_stat.h>
 #include <linux/slab.h>
+#include <linux/sched.h>
 
 #include "cpufreq_governor.h"
 
@@ -119,37 +120,29 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu)
 		 * actually is. This is undesirable for latency-sensitive bursty
 		 * workloads.
 		 *
-		 * To avoid this, we reuse the 'load' from the previous
-		 * time-window and give this task a chance to start with a
-		 * reasonably high CPU frequency. (However, we shouldn't over-do
-		 * this copy, lest we get stuck at a high load (high frequency)
-		 * for too long, even when the current system load has actually
-		 * dropped down. So we perform the copy only once, upon the
-		 * first wake-up from idle.)
+		 * To avoid this, we use the task's cumulative load which woke
+		 * up the idle CPU to suitably scale the CPU frequency. For a
+		 * sibling CPU we use the cumulative load of the current task on
+		 * that CPU.
+		 *
+		 * A task is considered to be CPU intensive if its cumulative
+		 * load is greater than 10. Hence it is desirable to increase
+		 * the CPU frequency to frequency_max.
 		 *
 		 * Detecting this situation is easy: the governor's deferrable
 		 * timer would not have fired during CPU-idle periods. Hence
 		 * an unusually large 'wall_time' (as compared to the sampling
 		 * rate) indicates this scenario.
-		 *
-		 * prev_load can be zero in two cases and we must recalculate it
-		 * for both cases:
-		 * - during long idle intervals
-		 * - explicitly set to zero
 		 */
-		if (unlikely(wall_time > (2 * sampling_rate) &&
-			     j_cdbs->prev_load)) {
-			load = j_cdbs->prev_load;
 
-			/*
-			 * Perform a destructive copy, to ensure that we copy
-			 * the previous load only once, upon the first wake-up
-			 * from idle.
-			 */
-			j_cdbs->prev_load = 0;
+		if (unlikely(wall_time > (2 * sampling_rate))) {
+			load = task_cumulative_load(j);
+			if (load > 10) {
+				max_load = MAX_LOAD;
+				break;
+			}
 		} else {
 			load = 100 * (wall_time - idle_time) / wall_time;
-			j_cdbs->prev_load = load;
 		}
 
 		if (load > max_load)
@@ -381,8 +374,6 @@ int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 
 			prev_load = (unsigned int)
 				(j_cdbs->prev_cpu_wall - j_cdbs->prev_cpu_idle);
-			j_cdbs->prev_load = 100 * prev_load /
-					(unsigned int) j_cdbs->prev_cpu_wall;
 
 			if (ignore_nice)
 				j_cdbs->prev_cpu_nice =
diff --git a/drivers/cpufreq/cpufreq_governor.h b/drivers/cpufreq/cpufreq_governor.h
index cc401d1..bfae4fe 100644
--- a/drivers/cpufreq/cpufreq_governor.h
+++ b/drivers/cpufreq/cpufreq_governor.h
@@ -36,6 +36,8 @@
 #define MIN_LATENCY_MULTIPLIER			(20)
 #define TRANSITION_LATENCY_LIMIT		(10 * 1000 * 1000)
 
+#define MAX_LOAD				(99)
+
 /* Ondemand Sampling types */
 enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE};
 
@@ -134,13 +136,6 @@ struct cpu_dbs_common_info {
 	u64 prev_cpu_idle;
 	u64 prev_cpu_wall;
 	u64 prev_cpu_nice;
-	/*
-	 * Used to keep track of load in the previous interval. However, when
-	 * explicitly set to zero, it is used as a flag to ensure that we copy
-	 * the previous load to the current interval only once, upon the first
-	 * wake-up from idle.
-	 */
-	unsigned int prev_load;
 	struct cpufreq_policy *cur_policy;
 	struct delayed_work work;
 	/*
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC 0/2] CPU frequency scaled from a task's load on an idle wakeup
  2014-11-10  5:45 [RFC 0/2] CPU frequency scaled from a task's load on an idle wakeup Shilpasri G Bhat
  2014-11-10  5:45 ` [RFC 1/2] sched/fair: Add cumulative average of load_avg_contrib to a task Shilpasri G Bhat
  2014-11-10  5:45 ` [RFC 2/2] cpufreq: governor: CPU frequency scaled from task's cumulative-load on an idle wakeup Shilpasri G Bhat
@ 2014-11-10  9:19 ` Shilpasri G Bhat
  2 siblings, 0 replies; 8+ messages in thread
From: Shilpasri G Bhat @ 2014-11-10  9:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-pm, mturquette, amit.kucheria, vincent.guittot,
	daniel.lezcano, Morten.Rasmussen, efault, nicolas.pitre,
	dietmar.eggemann, pjt, bsegall, peterz, mingo, linaro-kernel,
	Preeti U Murthy

Experimental Results:

Tested on a powerpc machine with 16 cores and obtained following results with
patchset.

I ran a modified version of ebizzy called sleeping-ebizzy which runs ebizzy at
various levels of utilization. The following results were found by running
ebizzy with 1 thread for 30s.

  Utilization(%)	Difference(%) in records/s with patch
  --------------        -------------------------------------
	10			-0.5516335445
	20			+0.0196049675
	30			+0.2222333684
	40			+0.3205441843
	50			-0.0103332452
	60			-0.3525380134
	70			+0.428654342
	80			+0.1527132862
	90			+0.0758061406

Thanks and Regards,
Shilpa

On 11/10/2014 11:15 AM, Shilpasri G Bhat wrote:
> This patch set aims to solve a problem in cpufreq governor's CPU
> load calculation logic when the CPU wakes up after an idle period.
> In the current logic when a CPU wakes up from an idle state the
> 'previous load' of the CPU is used as its current load on the
> alternate wakeups.
> 
> A latency-sensitive-bursty task will be benefited from this logic if
> it wakes up on a CPU on which it was initially running, with a
> non-compromised CPU 'previous load' i.e, the 'previous load' holds
> the last calculated CPU load before the task went to sleep. In such
> a case, the cpufreq governor will account to high previous CPU load
> and decides to run at high frequency.
> 
> The problem in this logic is that the 'previous load' which is meant
> to help certain latency-sensitive-bursty tasks can get used by some
> periodic-small tasks(like kernel daemons) to its advantage if the
> small task woke up first on the CPU. This will deprive the the
> latency-sensitive-bursty tasks from running at high frequency until
> the cpufreq governor notices the 100% CPU utilization. If this pattern
> gets repeated in the due course of bursty task's execution we will
> land on the same problem which 'prev_load' had originally set forth to 
> solve.
> 
> Probably we could reduce these inefficiencies if the cpufreq
> governor was aware of the task's nature, while calculating the load
> during an idle-wakeup scenario. So instead of using the previous
> load for the CPU , the load can be deduced on the basis of incoming
> task's load.
> 
> In this patch we use a metric built on top of 'load_avg_contrib'.
> 'load_avg_contrib' of a task's sched entity can describe the nature
> of the task in terms of its CPU utilization. The properties of this
> metric to encapsulate the CPU utilization of a task makes it a
> potential candidate for scaling CPU frequency. However, due to the
> nature of its design 'load_avg_contrib' cannot pick up the task's
> load rapidly after a wakeup. As we are trying to solve the problem
> on idle-wakeup case we cannot use this metric value as is to scale
> the frequency. So we measure the cumulative moving average of
> 'load_avg_contrib'.
> 
> The cumulative average of 'load_avg_contrib' at a given point is the
> average of all the values of 'load_avg_contrib' up until that point.
> The current average of a new 'load_avg_contrib' value is as below:
> 
> Cumulative_average(n+1) = x(n+1) + Cumulative_average(n) * n
>                         ---------------------------------------
>                                         n+1
> where,
> Cumulative_average(n+1) is the current cumulative average
> x(n+1) is the latest 'load_avg_contrib' value
> Cumulative_average(n) is the previous cumulative average
> n+1 is the number of 'load_avg_contrib' values so far
> 
> The cumulative average of 'load_avg_contrib' will help us smooth out
> the short-term fluctuations and highlight long-term trend of
> 'load-avg_contrib' metric. So cumulative average of the task can
> depict the nature of the task more effectively. Thus we can scale CPU
> frequency based on the cumulative average of the task and make
> calculative decisions whether to decrease or increase the frequency
> depending on the nature of the task.
> 
> Shilpasri G Bhat (2):
>   sched/fair: Add cumulative average of load_avg_contrib to a task
>   cpufreq: governor: CPU frequency scaled from task's cumulative-load on
>     an idle wakeup
> 
>  drivers/cpufreq/cpufreq_governor.c | 39 +++++++++++++++-----------------------
>  drivers/cpufreq/cpufreq_governor.h |  9 ++-------
>  include/linux/sched.h              |  4 ++++
>  kernel/sched/core.c                | 35 ++++++++++++++++++++++++++++++++++
>  kernel/sched/fair.c                |  6 +++++-
>  kernel/sched/sched.h               |  2 +-
>  6 files changed, 62 insertions(+), 33 deletions(-)
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 1/2] sched/fair: Add cumulative average of load_avg_contrib to a task
  2014-11-10  5:45 ` [RFC 1/2] sched/fair: Add cumulative average of load_avg_contrib to a task Shilpasri G Bhat
@ 2014-11-10 13:49   ` Peter Zijlstra
  2014-11-11 14:52     ` Shilpasri G Bhat
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2014-11-10 13:49 UTC (permalink / raw)
  To: Shilpasri G Bhat
  Cc: linux-kernel, linux-pm, mturquette, amit.kucheria,
	vincent.guittot, daniel.lezcano, Morten.Rasmussen, efault,
	nicolas.pitre, dietmar.eggemann, pjt, bsegall, mingo,
	linaro-kernel

On Mon, Nov 10, 2014 at 11:15:57AM +0530, Shilpasri G Bhat wrote:
> +/**
> + * task_cumulative_load - return the cumulative load of
> + * the previous task if cpu is the current cpu OR the
> + * cumulative load of current task on the cpu. If cpu
> + * is idle then return 0.
> + *
> + * Invoked by the cpufreq governor to calculate the
> + * load when the CPU is woken from an idle state.
> + *
> + */
> +unsigned int task_cumulative_load(int cpu)
> +{
> +	struct rq *rq = cpu_rq(cpu);
> +	struct task_struct *p;
> +
> +	if (cpu == smp_processor_id()) {
> +		if (rq->prev == rq->idle)
> +			goto idle;
> +		p = rq->prev;
> +	} else {
> +		if (rq->curr == rq->idle)
> +			goto idle;
> +		p = rq->curr;
> +	}
> +	/*
> +	 * Removing the priority as we are interested in CPU
> +	 * utilization of the task
> +	 */
> +	return (100 * p->se.avg.cumulative_avg / p->se.load.weight);
> +idle:
> +	return 0;
> +}
> +EXPORT_SYMBOL(task_cumulative_load);

This doesn't make any sense, its wrong and also its broken.

This doesn't make any sense, because the load of a cpu is unrelated to
whatever task ran there last. You want to look at the task that is
waking now.

It is wrong because dividing out the load.weight doesn't quite work
right. Also there's patches that introduce unweighted stats like you
want, you could have used those.

It it broken because who says rq->prev still exists?

> @@ -2476,11 +2478,13 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
>  static inline void __update_task_entity_contrib(struct sched_entity *se)
>  {
>  	u32 contrib;
> -
>  	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
>  	contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
>  	contrib /= (se->avg.runnable_avg_period + 1);
>  	se->avg.load_avg_contrib = scale_load(contrib);
> +	se->avg.cumulative_avg *= se->avg.cumulative_avg_count;
> +	se->avg.cumulative_avg += se->avg.load_avg_contrib;
> +	se->avg.cumulative_avg /= ++se->avg.cumulative_avg_count;
>  }

This is not a numerically stable algorithm. Look what happens when
cumulative_avg_count gets large. Also, whatever made you choose an
absolute decay filter like that?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 1/2] sched/fair: Add cumulative average of load_avg_contrib to a task
  2014-11-10 13:49   ` Peter Zijlstra
@ 2014-11-11 14:52     ` Shilpasri G Bhat
  2014-11-11 17:29       ` Peter Zijlstra
  2014-11-11 17:30       ` Peter Zijlstra
  0 siblings, 2 replies; 8+ messages in thread
From: Shilpasri G Bhat @ 2014-11-11 14:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-pm, mturquette, amit.kucheria,
	vincent.guittot, daniel.lezcano, Morten.Rasmussen, efault,
	nicolas.pitre, dietmar.eggemann, pjt, bsegall, mingo,
	linaro-kernel, Preeti U Murthy, Shilpasri G Bhat

On 11/10/2014 07:19 PM, Peter Zijlstra wrote:
> On Mon, Nov 10, 2014 at 11:15:57AM +0530, Shilpasri G Bhat wrote:
>> +/**
>> + * task_cumulative_load - return the cumulative load of
>> + * the previous task if cpu is the current cpu OR the
>> + * cumulative load of current task on the cpu. If cpu
>> + * is idle then return 0.
>> + *
>> + * Invoked by the cpufreq governor to calculate the
>> + * load when the CPU is woken from an idle state.
>> + *
>> + */
>> +unsigned int task_cumulative_load(int cpu)
>> +{
>> +	struct rq *rq = cpu_rq(cpu);
>> +	struct task_struct *p;
>> +
>> +	if (cpu == smp_processor_id()) {
>> +		if (rq->prev == rq->idle)
>> +			goto idle;
>> +		p = rq->prev;
>> +	} else {
>> +		if (rq->curr == rq->idle)
>> +			goto idle;
>> +		p = rq->curr;
>> +	}
>> +	/*
>> +	 * Removing the priority as we are interested in CPU
>> +	 * utilization of the task
>> +	 */
>> +	return (100 * p->se.avg.cumulative_avg / p->se.load.weight);
>> +idle:
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL(task_cumulative_load);
> 
> This doesn't make any sense, its wrong and also its broken.
> 
> This doesn't make any sense, because the load of a cpu is unrelated to
> whatever task ran there last. You want to look at the task that is
> waking now.

Yes I agree that the task in consideration should be current task on
cpu and not the last task. But the above code is handled by the
cpufreq governor's kworker. So the task in context while executing
'task_cumulative_load()' is kworker and we are not interested in
kworker's load. The task's load which we want to account to is
predecessor of kworker i.e, the task that woke up on the idle cpu.

> 
> It is wrong because dividing out the load.weight doesn't quite work
> right. Also there's patches that introduce unweighted stats like you
> want, you could have used those.

Okay.
> 
> It it broken because who says rq->prev still exists?

'task_cumulative_load()' is handled by the cpufreq governor's kworker.
And the kworker is queued only if there is task running on cpu which
guarantees the existence of rq->prev in a running state.

> 
>> @@ -2476,11 +2478,13 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
>>  static inline void __update_task_entity_contrib(struct sched_entity *se)
>>  {
>>  	u32 contrib;
>> -
>>  	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
>>  	contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
>>  	contrib /= (se->avg.runnable_avg_period + 1);
>>  	se->avg.load_avg_contrib = scale_load(contrib);
>> +	se->avg.cumulative_avg *= se->avg.cumulative_avg_count;
>> +	se->avg.cumulative_avg += se->avg.load_avg_contrib;
>> +	se->avg.cumulative_avg /= ++se->avg.cumulative_avg_count;
>>  }
> 
> This is not a numerically stable algorithm. Look what happens when
> cumulative_avg_count gets large. Also, whatever made you choose an
> absolute decay filter like that?
> 

Yes I agree.

The problem I am trying to solve using this metric is: if cpufreq
governor's timer is handled very early when a task wakes up to run
on an idle cpu then the value of 'load_avg_contrib' of that task is
very less, when the governor's kworker is switched in. This happens
because the task did not run for enough time such that it can update
its 'load_avg_contrib' to show considerable amount of load.

Cumulative load was our initial attempt at overcoming this. It is true
that this is not the best solution. We wanted the community's suggestion
on how to get around this issue. What do you propose?

Thanks and Regards,
Shilpa


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 1/2] sched/fair: Add cumulative average of load_avg_contrib to a task
  2014-11-11 14:52     ` Shilpasri G Bhat
@ 2014-11-11 17:29       ` Peter Zijlstra
  2014-11-11 17:30       ` Peter Zijlstra
  1 sibling, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2014-11-11 17:29 UTC (permalink / raw)
  To: Shilpasri G Bhat
  Cc: linux-kernel, linux-pm, mturquette, amit.kucheria,
	vincent.guittot, daniel.lezcano, Morten.Rasmussen, efault,
	nicolas.pitre, dietmar.eggemann, pjt, bsegall, mingo,
	linaro-kernel, Preeti U Murthy

On Tue, Nov 11, 2014 at 08:22:55PM +0530, Shilpasri G Bhat wrote:
> Cumulative load was our initial attempt at overcoming this. It is true
> that this is not the best solution. We wanted the community's suggestion
> on how to get around this issue. What do you propose?

Shoot cpufreq in the head basically :-) We're looking at completely
reworking the way we do things.

We basically want to integrate cpufreq into the scheduler and effect
frequency changes from the scheduler context where possible, only
reverting to schedulable context (kworker) where there is no other
possibility.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 1/2] sched/fair: Add cumulative average of load_avg_contrib to a task
  2014-11-11 14:52     ` Shilpasri G Bhat
  2014-11-11 17:29       ` Peter Zijlstra
@ 2014-11-11 17:30       ` Peter Zijlstra
  1 sibling, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2014-11-11 17:30 UTC (permalink / raw)
  To: Shilpasri G Bhat
  Cc: linux-kernel, linux-pm, mturquette, amit.kucheria,
	vincent.guittot, daniel.lezcano, Morten.Rasmussen, efault,
	nicolas.pitre, dietmar.eggemann, pjt, bsegall, mingo,
	linaro-kernel, Preeti U Murthy

On Tue, Nov 11, 2014 at 08:22:55PM +0530, Shilpasri G Bhat wrote:
> On 11/10/2014 07:19 PM, Peter Zijlstra wrote:
> > It it broken because who says rq->prev still exists?
> 
> 'task_cumulative_load()' is handled by the cpufreq governor's kworker.
> And the kworker is queued only if there is task running on cpu which
> guarantees the existence of rq->prev in a running state.

Not so, there is no guarantee it will still be running by the time the
kworker actually comes around to running.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-11-11 17:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-10  5:45 [RFC 0/2] CPU frequency scaled from a task's load on an idle wakeup Shilpasri G Bhat
2014-11-10  5:45 ` [RFC 1/2] sched/fair: Add cumulative average of load_avg_contrib to a task Shilpasri G Bhat
2014-11-10 13:49   ` Peter Zijlstra
2014-11-11 14:52     ` Shilpasri G Bhat
2014-11-11 17:29       ` Peter Zijlstra
2014-11-11 17:30       ` Peter Zijlstra
2014-11-10  5:45 ` [RFC 2/2] cpufreq: governor: CPU frequency scaled from task's cumulative-load on an idle wakeup Shilpasri G Bhat
2014-11-10  9:19 ` [RFC 0/2] CPU frequency scaled from a task's load " Shilpasri G Bhat

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.