linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Improve estimated utilization of preempted FAIR tasks
@ 2018-06-04 16:05 Patrick Bellasi
  2018-06-04 16:05 ` [PATCH 1/2] sched/fair: pelt: use u32 for util_avg Patrick Bellasi
  2018-06-04 16:06 ` [PATCH 2/2] sched/fair: util_est: add running_sum tracking Patrick Bellasi
  0 siblings, 2 replies; 19+ messages in thread
From: Patrick Bellasi @ 2018-06-04 16:05 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki, Viresh Kumar,
	Vincent Guittot, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Joel Fernandes, Steve Muckle

Here is a small series to improve the estimated utilization of fair
tasks which
are preempted by either another FAIR task of by tasks of an higher
priority
scheduling class (i.e. RT and DL).

It can certainly be improved, but it has been already functionally
tested and given the discussion going on on IRC and in this other
thread:

   https://lkml.org/lkml/2018/5/25/410
   1527253951-22709-1-git-send-email-vincent.guittot@linaro.org

I wanted to anticipate a possible simple alternative/additional approach
to improve CPU utilization tracking for CFS tasks. Thus, every comment
and suggestion is more then welcome!

The main reasons and the overall idea is explained by the second patch
of this
series. While the first patch is just a first step in the direction of
having a
more memory and run-time efficient implementation.

The ultimate benefit of this series is to make the CFS class more robust
in
terms of its integration with the schedutil governor. Indeed, by better
estimating the CPU bandwidth requirements for CFS tasks we can assert
more
accurately their bandwidth demands and be less sensitive to side effect,
on
PELT, generated by higher priority classes preempting CFS tasks, like
the
CFS util_avg decay.

Cheers Patrick

Patrick Bellasi (2):
  sched/fair: pelt: use u32 for util_avg
  sched/fair: util_est: add running_sum tracking

 include/linux/sched.h |  3 ++-
 kernel/sched/debug.c  |  2 +-
 kernel/sched/fair.c   | 33 ++++++++++++++++++++++++---------
 3 files changed, 27 insertions(+), 11 deletions(-)

-- 
2.15.1

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/2] sched/fair: pelt: use u32 for util_avg
  2018-06-04 16:05 [PATCH 0/2] Improve estimated utilization of preempted FAIR tasks Patrick Bellasi
@ 2018-06-04 16:05 ` Patrick Bellasi
  2018-06-05  1:30   ` kbuild test robot
  2018-06-05  1:34   ` kbuild test robot
  2018-06-04 16:06 ` [PATCH 2/2] sched/fair: util_est: add running_sum tracking Patrick Bellasi
  1 sibling, 2 replies; 19+ messages in thread
From: Patrick Bellasi @ 2018-06-04 16:05 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki, Viresh Kumar,
	Vincent Guittot, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Joel Fernandes, Steve Muckle, Todd Kjos

The util_avg signal is used to track the utilization (i.e. RUNNING time)
of SEs and RQs. Its values are computed according to the PELT algorithm
and thus, for SE, they are bounded to an (internal) representation which
uses 20bits. For RQ instead they are technically un-bounded, since when
tasks are migrated across RQs we sum their utilization to the
destination RQ.

We currently use an unsigned long to track util_avg which maps into a
64bits storage on 64bits systems. However, 32bits should be good enough
for all practical usages. Indeed, even for RQs, the remaining 12bits
allows to track up to 4K 100% tasks concurrently RUNNABLE on a single
CPU.

Since the sched_avg data structure already completely fits a 64B cache
line, let's get back 4B by using u32 to track util_avg. The recovered
space could be conveniently used to fit other load tracking related
metrics into the same cache line.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
 include/linux/sched.h |  2 +-
 kernel/sched/debug.c  |  2 +-
 kernel/sched/fair.c   | 17 ++++++++++-------
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 28ff3ca9f752..9d8732dab264 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -402,7 +402,7 @@ struct sched_avg {
 	u32				period_contrib;
 	unsigned long			load_avg;
 	unsigned long			runnable_load_avg;
-	unsigned long			util_avg;
+	u32				util_avg;
 	struct util_est			util_est;
 } ____cacheline_aligned;
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 15b10e210a6b..a985789eeb9c 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -541,7 +541,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->avg.load_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "runnable_load_avg",
 			cfs_rq->avg.runnable_load_avg);
-	SEQ_printf(m, "  .%-30s: %lu\n", "util_avg",
+	SEQ_printf(m, "  .%-30s: %u\n", "util_avg",
 			cfs_rq->avg.util_avg);
 	SEQ_printf(m, "  .%-30s: %u\n", "util_est_enqueued",
 			cfs_rq->avg.util_est.enqueued);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e497c05aab7f..f74441be3f44 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -750,19 +750,22 @@ static void attach_entity_cfs_rq(struct sched_entity *se);
 void post_init_entity_util_avg(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	struct sched_avg *sa = &se->avg;
 	long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
 
 	if (cap > 0) {
-		if (cfs_rq->avg.util_avg != 0) {
-			sa->util_avg  = cfs_rq->avg.util_avg * se->load.weight;
-			sa->util_avg /= (cfs_rq->avg.load_avg + 1);
+		struct sched_avg *sa = &se->avg;
+		u64 util_avg = READ_ONCE(sa->util_avg);
 
-			if (sa->util_avg > cap)
-				sa->util_avg = cap;
+		if (cfs_rq->avg.util_avg != 0) {
+			util_avg  =  cfs_rq->avg.util_avg * se->load.weight;
+			util_avg /= (cfs_rq->avg.load_avg + 1);
+			if (util_avg > cap)
+				util_avg = cap;
 		} else {
-			sa->util_avg = cap;
+			util_avg = cap;
 		}
+
+		WRITE_ONCE(sa->util_avg, util_avg);
 	}
 
 	if (entity_is_task(se)) {
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/2] sched/fair: util_est: add running_sum tracking
  2018-06-04 16:05 [PATCH 0/2] Improve estimated utilization of preempted FAIR tasks Patrick Bellasi
  2018-06-04 16:05 ` [PATCH 1/2] sched/fair: pelt: use u32 for util_avg Patrick Bellasi
@ 2018-06-04 16:06 ` Patrick Bellasi
  2018-06-04 17:46   ` Joel Fernandes
                     ` (3 more replies)
  1 sibling, 4 replies; 19+ messages in thread
From: Patrick Bellasi @ 2018-06-04 16:06 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki, Viresh Kumar,
	Vincent Guittot, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Joel Fernandes, Steve Muckle, Todd Kjos

The estimated utilization of a task is affected by the task being
preempted, either by another FAIR task of by a task of an higher
priority class (i.e. RT or DL). Indeed, when a preemption happens, the
PELT utilization of the preempted task is going to be decayed a bit.
That's actually correct for utilization, which goal is to measure the
actual CPU bandwidth consumed by a task.

However, the above behavior does not allow to know exactly what is the
utilization a task "would have used" if it was running without
being preempted. Thus, this reduces the effectiveness of util_est for a
task because it does not always allow to predict how much CPU a task is
likely to require.

Let's improve the estimated utilization by adding a new "sort-of" PELT
signal, explicitly only for SE which has the following behavior:
 a) at each enqueue time of a task, its value is the (already decayed)
    util_avg of the task being enqueued
 b) it's updated at each update_load_avg
 c) it can just increase, whenever the task is actually RUNNING on a
    CPU, while it's kept stable while the task is RUNNANBLE but not
    actively consuming CPU bandwidth

Such a defined signal is exactly equivalent to the util_avg for a task
running alone on a CPU while, in case the task is preempted, it allows
to know at dequeue time how much would have been the task utilization if
it was running alone on that CPU.

This new signal is named "running_avg", since it tracks the actual
RUNNING time of a task by ignoring any form of preemption.

>From an implementation standpoint, since the sched_avg should fit into a
single cache line, we save space by tracking only a new runnable sum:
   p->se.avg.running_sum
while the conversion into a running_avg is done on demand whenever we
need it, which is at task dequeue time when a new util_est sample has to
be collected.

The conversion from "running_sum" to "running_avg" is done by performing
a single division by LOAD_AVG_MAX, which introduces a small error since
in the division we do not consider the (sa->period_contrib - 1024)
compensation factor used in ___update_load_avg(). However:
 a) this error is expected to be limited (~2-3%)
 b) it can be safely ignored since the estimated utilization is the only
    consumer which is already subject to small estimation errors

The additional corresponding benefit is that, at run-time, we pay the
cost for a additional sum and multiply, while the more expensive
division is required only at dequeue time.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 16 ++++++++++++++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9d8732dab264..2bd5f1c68da9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -399,6 +399,7 @@ struct sched_avg {
 	u64				load_sum;
 	u64				runnable_load_sum;
 	u32				util_sum;
+	u32				running_sum;
 	u32				period_contrib;
 	unsigned long			load_avg;
 	unsigned long			runnable_load_avg;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f74441be3f44..5d54d6a4c31f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3161,6 +3161,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
 		sa->runnable_load_sum =
 			decay_load(sa->runnable_load_sum, periods);
 		sa->util_sum = decay_load((u64)(sa->util_sum), periods);
+		if (running)
+			sa->running_sum = decay_load(sa->running_sum, periods);
 
 		/*
 		 * Step 2
@@ -3176,8 +3178,10 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
 		sa->load_sum += load * contrib;
 	if (runnable)
 		sa->runnable_load_sum += runnable * contrib;
-	if (running)
+	if (running) {
 		sa->util_sum += contrib * scale_cpu;
+		sa->running_sum += contrib * scale_cpu;
+	}
 
 	return periods;
 }
@@ -3963,6 +3967,12 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
 	WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
 }
 
+static inline void util_est_enqueue_running(struct task_struct *p)
+{
+	/* Initilize the (non-preempted) utilization */
+	p->se.avg.running_sum = p->se.avg.util_sum;
+}
+
 /*
  * Check if a (signed) value is within a specified (unsigned) margin,
  * based on the observation that:
@@ -4018,7 +4028,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
 	 * Skip update of task's estimated utilization when its EWMA is
 	 * already ~1% close to its last activation value.
 	 */
-	ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED);
+	ue.enqueued = p->se.avg.running_sum / LOAD_AVG_MAX;
 	last_ewma_diff = ue.enqueued - ue.ewma;
 	if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
 		return;
@@ -5437,6 +5447,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!se)
 		add_nr_running(rq, 1);
 
+	util_est_enqueue_running(p);
+
 	hrtick_update(rq);
 }
 
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking
  2018-06-04 16:06 ` [PATCH 2/2] sched/fair: util_est: add running_sum tracking Patrick Bellasi
@ 2018-06-04 17:46   ` Joel Fernandes
  2018-06-05 15:21     ` Patrick Bellasi
  2018-06-05  1:29   ` kbuild test robot
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Joel Fernandes @ 2018-06-04 17:46 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Viresh Kumar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Joel Fernandes,
	Steve Muckle, Todd Kjos

Hi Patrick,

On Mon, Jun 04, 2018 at 05:06:00PM +0100, Patrick Bellasi wrote:
> The estimated utilization of a task is affected by the task being
> preempted, either by another FAIR task of by a task of an higher
> priority class (i.e. RT or DL). Indeed, when a preemption happens, the
> PELT utilization of the preempted task is going to be decayed a bit.
> That's actually correct for utilization, which goal is to measure the
> actual CPU bandwidth consumed by a task.
> 
> However, the above behavior does not allow to know exactly what is the
> utilization a task "would have used" if it was running without
> being preempted. Thus, this reduces the effectiveness of util_est for a
> task because it does not always allow to predict how much CPU a task is
> likely to require.
> 
> Let's improve the estimated utilization by adding a new "sort-of" PELT
> signal, explicitly only for SE which has the following behavior:
>  a) at each enqueue time of a task, its value is the (already decayed)
>     util_avg of the task being enqueued
>  b) it's updated at each update_load_avg
>  c) it can just increase, whenever the task is actually RUNNING on a
>     CPU, while it's kept stable while the task is RUNNANBLE but not
>     actively consuming CPU bandwidth
> 
> Such a defined signal is exactly equivalent to the util_avg for a task
> running alone on a CPU while, in case the task is preempted, it allows
> to know at dequeue time how much would have been the task utilization if
> it was running alone on that CPU.
> 
> This new signal is named "running_avg", since it tracks the actual
> RUNNING time of a task by ignoring any form of preemption.
> 
> From an implementation standpoint, since the sched_avg should fit into a
> single cache line, we save space by tracking only a new runnable sum:
>    p->se.avg.running_sum
> while the conversion into a running_avg is done on demand whenever we
> need it, which is at task dequeue time when a new util_est sample has to
> be collected.
> 
> The conversion from "running_sum" to "running_avg" is done by performing
> a single division by LOAD_AVG_MAX, which introduces a small error since
> in the division we do not consider the (sa->period_contrib - 1024)
> compensation factor used in ___update_load_avg(). However:
>  a) this error is expected to be limited (~2-3%)
>  b) it can be safely ignored since the estimated utilization is the only
>     consumer which is already subject to small estimation errors
> 
> The additional corresponding benefit is that, at run-time, we pay the
> cost for a additional sum and multiply, while the more expensive
> division is required only at dequeue time.
> 
> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Todd Kjos <tkjos@google.com>
> Cc: Joel Fernandes <joelaf@google.com>
> Cc: Steve Muckle <smuckle@google.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Morten Rasmussen <morten.rasmussen@arm.com>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-pm@vger.kernel.org
> ---
>  include/linux/sched.h |  1 +
>  kernel/sched/fair.c   | 16 ++++++++++++++--
>  2 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 9d8732dab264..2bd5f1c68da9 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -399,6 +399,7 @@ struct sched_avg {
>  	u64				load_sum;
>  	u64				runnable_load_sum;
>  	u32				util_sum;
> +	u32				running_sum;
>  	u32				period_contrib;
>  	unsigned long			load_avg;
>  	unsigned long			runnable_load_avg;

Should update the documentation comments above the struct too?

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f74441be3f44..5d54d6a4c31f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3161,6 +3161,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
>  		sa->runnable_load_sum =
>  			decay_load(sa->runnable_load_sum, periods);
>  		sa->util_sum = decay_load((u64)(sa->util_sum), periods);
> +		if (running)
> +			sa->running_sum = decay_load(sa->running_sum, periods);
>  
>  		/*
>  		 * Step 2
> @@ -3176,8 +3178,10 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
>  		sa->load_sum += load * contrib;
>  	if (runnable)
>  		sa->runnable_load_sum += runnable * contrib;
> -	if (running)
> +	if (running) {
>  		sa->util_sum += contrib * scale_cpu;
> +		sa->running_sum += contrib * scale_cpu;
> +	}
>  
>  	return periods;
>  }
> @@ -3963,6 +3967,12 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
>  	WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
>  }

PELT changes look nice and makes sense :)

> +static inline void util_est_enqueue_running(struct task_struct *p)
> +{
> +	/* Initilize the (non-preempted) utilization */
> +	p->se.avg.running_sum = p->se.avg.util_sum;
> +}
> +
>  /*
>   * Check if a (signed) value is within a specified (unsigned) margin,
>   * based on the observation that:
> @@ -4018,7 +4028,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
>  	 * Skip update of task's estimated utilization when its EWMA is
>  	 * already ~1% close to its last activation value.
>  	 */
> -	ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED);
> +	ue.enqueued = p->se.avg.running_sum / LOAD_AVG_MAX;

I guess we are doing extra division here which adds some cost. Does
performance look Ok with the change?

thanks,

 - Joel
 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking
  2018-06-04 16:06 ` [PATCH 2/2] sched/fair: util_est: add running_sum tracking Patrick Bellasi
  2018-06-04 17:46   ` Joel Fernandes
@ 2018-06-05  1:29   ` kbuild test robot
  2018-06-05  6:57   ` Vincent Guittot
  2018-06-05 10:46   ` kbuild test robot
  3 siblings, 0 replies; 19+ messages in thread
From: kbuild test robot @ 2018-06-05  1:29 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: kbuild-all, linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Viresh Kumar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Joel Fernandes,
	Steve Muckle, Todd Kjos

[-- Attachment #1: Type: text/plain, Size: 3083 bytes --]

Hi Patrick,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on next-20180604]
[cannot apply to v4.17]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Patrick-Bellasi/sched-fair-pelt-use-u32-for-util_avg/20180605-082640
config: i386-randconfig-a1-06041847 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   kernel/sched/fair.c: In function 'enqueue_task_fair':
>> kernel/sched/fair.c:5450:2: error: implicit declaration of function 'util_est_enqueue_running' [-Werror=implicit-function-declaration]
     util_est_enqueue_running(p);
     ^
   cc1: some warnings being treated as errors

vim +/util_est_enqueue_running +5450 kernel/sched/fair.c

  5389	
  5390	/*
  5391	 * The enqueue_task method is called before nr_running is
  5392	 * increased. Here we update the fair scheduling stats and
  5393	 * then put the task into the rbtree:
  5394	 */
  5395	static void
  5396	enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
  5397	{
  5398		struct cfs_rq *cfs_rq;
  5399		struct sched_entity *se = &p->se;
  5400	
  5401		/*
  5402		 * The code below (indirectly) updates schedutil which looks at
  5403		 * the cfs_rq utilization to select a frequency.
  5404		 * Let's add the task's estimated utilization to the cfs_rq's
  5405		 * estimated utilization, before we update schedutil.
  5406		 */
  5407		util_est_enqueue(&rq->cfs, p);
  5408	
  5409		/*
  5410		 * If in_iowait is set, the code below may not trigger any cpufreq
  5411		 * utilization updates, so do it here explicitly with the IOWAIT flag
  5412		 * passed.
  5413		 */
  5414		if (p->in_iowait)
  5415			cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
  5416	
  5417		for_each_sched_entity(se) {
  5418			if (se->on_rq)
  5419				break;
  5420			cfs_rq = cfs_rq_of(se);
  5421			enqueue_entity(cfs_rq, se, flags);
  5422	
  5423			/*
  5424			 * end evaluation on encountering a throttled cfs_rq
  5425			 *
  5426			 * note: in the case of encountering a throttled cfs_rq we will
  5427			 * post the final h_nr_running increment below.
  5428			 */
  5429			if (cfs_rq_throttled(cfs_rq))
  5430				break;
  5431			cfs_rq->h_nr_running++;
  5432	
  5433			flags = ENQUEUE_WAKEUP;
  5434		}
  5435	
  5436		for_each_sched_entity(se) {
  5437			cfs_rq = cfs_rq_of(se);
  5438			cfs_rq->h_nr_running++;
  5439	
  5440			if (cfs_rq_throttled(cfs_rq))
  5441				break;
  5442	
  5443			update_load_avg(cfs_rq, se, UPDATE_TG);
  5444			update_cfs_group(se);
  5445		}
  5446	
  5447		if (!se)
  5448			add_nr_running(rq, 1);
  5449	
> 5450		util_est_enqueue_running(p);
  5451	
  5452		hrtick_update(rq);
  5453	}
  5454	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 26074 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] sched/fair: pelt: use u32 for util_avg
  2018-06-04 16:05 ` [PATCH 1/2] sched/fair: pelt: use u32 for util_avg Patrick Bellasi
@ 2018-06-05  1:30   ` kbuild test robot
  2018-06-05  1:34   ` kbuild test robot
  1 sibling, 0 replies; 19+ messages in thread
From: kbuild test robot @ 2018-06-05  1:30 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: kbuild-all, linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Viresh Kumar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Joel Fernandes,
	Steve Muckle, Todd Kjos

[-- Attachment #1: Type: text/plain, Size: 875 bytes --]

Hi Patrick,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on v4.17 next-20180604]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Patrick-Bellasi/sched-fair-pelt-use-u32-for-util_avg/20180605-082640
config: i386-defconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   kernel/sched/fair.o: In function `post_init_entity_util_avg':
>> fair.c:(.text+0xa057): undefined reference to `__udivdi3'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 26279 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] sched/fair: pelt: use u32 for util_avg
  2018-06-04 16:05 ` [PATCH 1/2] sched/fair: pelt: use u32 for util_avg Patrick Bellasi
  2018-06-05  1:30   ` kbuild test robot
@ 2018-06-05  1:34   ` kbuild test robot
  1 sibling, 0 replies; 19+ messages in thread
From: kbuild test robot @ 2018-06-05  1:34 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: kbuild-all, linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Viresh Kumar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Joel Fernandes,
	Steve Muckle, Todd Kjos

[-- Attachment #1: Type: text/plain, Size: 3283 bytes --]

Hi Patrick,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on v4.17 next-20180604]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Patrick-Bellasi/sched-fair-pelt-use-u32-for-util_avg/20180605-082640
config: i386-randconfig-s0-201822 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   kernel/sched/fair.o: In function `post_init_entity_util_avg':
>> kernel/sched/fair.c:761: undefined reference to `__udivdi3'

vim +761 kernel/sched/fair.c

   724	
   725	/*
   726	 * With new tasks being created, their initial util_avgs are extrapolated
   727	 * based on the cfs_rq's current util_avg:
   728	 *
   729	 *   util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight
   730	 *
   731	 * However, in many cases, the above util_avg does not give a desired
   732	 * value. Moreover, the sum of the util_avgs may be divergent, such
   733	 * as when the series is a harmonic series.
   734	 *
   735	 * To solve this problem, we also cap the util_avg of successive tasks to
   736	 * only 1/2 of the left utilization budget:
   737	 *
   738	 *   util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n
   739	 *
   740	 * where n denotes the nth task.
   741	 *
   742	 * For example, a simplest series from the beginning would be like:
   743	 *
   744	 *  task  util_avg: 512, 256, 128,  64,  32,   16,    8, ...
   745	 * cfs_rq util_avg: 512, 768, 896, 960, 992, 1008, 1016, ...
   746	 *
   747	 * Finally, that extrapolated util_avg is clamped to the cap (util_avg_cap)
   748	 * if util_avg > util_avg_cap.
   749	 */
   750	void post_init_entity_util_avg(struct sched_entity *se)
   751	{
   752		struct cfs_rq *cfs_rq = cfs_rq_of(se);
   753		long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
   754	
   755		if (cap > 0) {
   756			struct sched_avg *sa = &se->avg;
   757			u64 util_avg = READ_ONCE(sa->util_avg);
   758	
   759			if (cfs_rq->avg.util_avg != 0) {
   760				util_avg  =  cfs_rq->avg.util_avg * se->load.weight;
 > 761				util_avg /= (cfs_rq->avg.load_avg + 1);
   762				if (util_avg > cap)
   763					util_avg = cap;
   764			} else {
   765				util_avg = cap;
   766			}
   767	
   768			WRITE_ONCE(sa->util_avg, util_avg);
   769		}
   770	
   771		if (entity_is_task(se)) {
   772			struct task_struct *p = task_of(se);
   773			if (p->sched_class != &fair_sched_class) {
   774				/*
   775				 * For !fair tasks do:
   776				 *
   777				update_cfs_rq_load_avg(now, cfs_rq);
   778				attach_entity_load_avg(cfs_rq, se, 0);
   779				switched_from_fair(rq, p);
   780				 *
   781				 * such that the next switched_to_fair() has the
   782				 * expected state.
   783				 */
   784				se->avg.last_update_time = cfs_rq_clock_task(cfs_rq);
   785				return;
   786			}
   787		}
   788	
   789		attach_entity_cfs_rq(se);
   790	}
   791	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28035 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking
  2018-06-04 16:06 ` [PATCH 2/2] sched/fair: util_est: add running_sum tracking Patrick Bellasi
  2018-06-04 17:46   ` Joel Fernandes
  2018-06-05  1:29   ` kbuild test robot
@ 2018-06-05  6:57   ` Vincent Guittot
  2018-06-05 15:11     ` Patrick Bellasi
  2018-06-05 10:46   ` kbuild test robot
  3 siblings, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2018-06-05  6:57 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, open list:THERMAL, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Viresh Kumar, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Joel Fernandes, Steve Muckle,
	Todd Kjos

On 4 June 2018 at 18:06, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> The estimated utilization of a task is affected by the task being
> preempted, either by another FAIR task of by a task of an higher
> priority class (i.e. RT or DL). Indeed, when a preemption happens, the
> PELT utilization of the preempted task is going to be decayed a bit.
> That's actually correct for utilization, which goal is to measure the
> actual CPU bandwidth consumed by a task.
>
> However, the above behavior does not allow to know exactly what is the
> utilization a task "would have used" if it was running without
> being preempted. Thus, this reduces the effectiveness of util_est for a
> task because it does not always allow to predict how much CPU a task is
> likely to require.
>
> Let's improve the estimated utilization by adding a new "sort-of" PELT
> signal, explicitly only for SE which has the following behavior:
>  a) at each enqueue time of a task, its value is the (already decayed)
>     util_avg of the task being enqueued
>  b) it's updated at each update_load_avg
>  c) it can just increase, whenever the task is actually RUNNING on a
>     CPU, while it's kept stable while the task is RUNNANBLE but not
>     actively consuming CPU bandwidth
>
> Such a defined signal is exactly equivalent to the util_avg for a task
> running alone on a CPU while, in case the task is preempted, it allows
> to know at dequeue time how much would have been the task utilization if
> it was running alone on that CPU.

I don't agree with this statement above
Let say that you have 2 periodic tasks which wants to run 4ms in every
period of 10ms and wakes up at the same time.
One task will execute 1st and the other will wait but at the end they
have the same period and running time and as a result the same
util_avg which is exactly what they need.

>
> This new signal is named "running_avg", since it tracks the actual
> RUNNING time of a task by ignoring any form of preemption.

In fact, you still take into account this preemption as you remove
some time whereas it's only a shift of the excution

>
> From an implementation standpoint, since the sched_avg should fit into a
> single cache line, we save space by tracking only a new runnable sum:
>    p->se.avg.running_sum
> while the conversion into a running_avg is done on demand whenever we
> need it, which is at task dequeue time when a new util_est sample has to
> be collected.
>
> The conversion from "running_sum" to "running_avg" is done by performing
> a single division by LOAD_AVG_MAX, which introduces a small error since
> in the division we do not consider the (sa->period_contrib - 1024)
> compensation factor used in ___update_load_avg(). However:
>  a) this error is expected to be limited (~2-3%)
>  b) it can be safely ignored since the estimated utilization is the only
>     consumer which is already subject to small estimation errors
>
> The additional corresponding benefit is that, at run-time, we pay the
> cost for a additional sum and multiply, while the more expensive
> division is required only at dequeue time.
>
> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Todd Kjos <tkjos@google.com>
> Cc: Joel Fernandes <joelaf@google.com>
> Cc: Steve Muckle <smuckle@google.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Morten Rasmussen <morten.rasmussen@arm.com>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-pm@vger.kernel.org
> ---
>  include/linux/sched.h |  1 +
>  kernel/sched/fair.c   | 16 ++++++++++++++--
>  2 files changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 9d8732dab264..2bd5f1c68da9 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -399,6 +399,7 @@ struct sched_avg {
>         u64                             load_sum;
>         u64                             runnable_load_sum;
>         u32                             util_sum;
> +       u32                             running_sum;
>         u32                             period_contrib;
>         unsigned long                   load_avg;
>         unsigned long                   runnable_load_avg;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f74441be3f44..5d54d6a4c31f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3161,6 +3161,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
>                 sa->runnable_load_sum =
>                         decay_load(sa->runnable_load_sum, periods);
>                 sa->util_sum = decay_load((u64)(sa->util_sum), periods);
> +               if (running)
> +                       sa->running_sum = decay_load(sa->running_sum, periods);

so you make some time disappearing from the equation as the signal
will not be decayed and make the period looking shorter than reality.
With the example I mentioned above, the 2nd task will be seen as if
its period becomes 6ms instead of 10ms which is not true and the
utilization of the task is overestimated without any good reason
Furthermore, this new signal doesn't have clear meaning because it's
not utilization but it's also not the waiting time of the task.

>
>                 /*
>                  * Step 2
> @@ -3176,8 +3178,10 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
>                 sa->load_sum += load * contrib;
>         if (runnable)
>                 sa->runnable_load_sum += runnable * contrib;
> -       if (running)
> +       if (running) {
>                 sa->util_sum += contrib * scale_cpu;
> +               sa->running_sum += contrib * scale_cpu;
> +       }
>
>         return periods;
>  }
> @@ -3963,6 +3967,12 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
>         WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
>  }
>
> +static inline void util_est_enqueue_running(struct task_struct *p)
> +{
> +       /* Initilize the (non-preempted) utilization */
> +       p->se.avg.running_sum = p->se.avg.util_sum;
> +}
> +
>  /*
>   * Check if a (signed) value is within a specified (unsigned) margin,
>   * based on the observation that:
> @@ -4018,7 +4028,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
>          * Skip update of task's estimated utilization when its EWMA is
>          * already ~1% close to its last activation value.
>          */
> -       ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED);
> +       ue.enqueued = p->se.avg.running_sum / LOAD_AVG_MAX;
>         last_ewma_diff = ue.enqueued - ue.ewma;
>         if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
>                 return;
> @@ -5437,6 +5447,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>         if (!se)
>                 add_nr_running(rq, 1);
>
> +       util_est_enqueue_running(p);
> +
>         hrtick_update(rq);
>  }
>
> --
> 2.15.1
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking
  2018-06-04 16:06 ` [PATCH 2/2] sched/fair: util_est: add running_sum tracking Patrick Bellasi
                     ` (2 preceding siblings ...)
  2018-06-05  6:57   ` Vincent Guittot
@ 2018-06-05 10:46   ` kbuild test robot
  3 siblings, 0 replies; 19+ messages in thread
From: kbuild test robot @ 2018-06-05 10:46 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: kbuild-all, linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Viresh Kumar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Joel Fernandes,
	Steve Muckle, Todd Kjos

[-- Attachment #1: Type: text/plain, Size: 9599 bytes --]

Hi Patrick,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on next-20180604]
[cannot apply to v4.17]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Patrick-Bellasi/sched-fair-pelt-use-u32-for-util_avg/20180605-082640
config: x86_64-randconfig-x011-201822 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   kernel/sched/fair.c: In function 'enqueue_task_fair':
>> kernel/sched/fair.c:5450:2: error: implicit declaration of function 'util_est_enqueue_running'; did you mean 'util_est_enqueue'? [-Werror=implicit-function-declaration]
     util_est_enqueue_running(p);
     ^~~~~~~~~~~~~~~~~~~~~~~~
     util_est_enqueue
   Cyclomatic Complexity 5 include/linux/compiler.h:__read_once_size
   Cyclomatic Complexity 1 include/linux/kasan-checks.h:kasan_check_write
   Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:constant_test_bit
   Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:variable_test_bit
   Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:fls
   Cyclomatic Complexity 1 include/linux/log2.h:__ilog2_u32
   Cyclomatic Complexity 1 arch/x86/include/asm/current.h:get_current
   Cyclomatic Complexity 1 arch/x86/include/asm/atomic64_64.h:arch_atomic64_add
   Cyclomatic Complexity 1 include/asm-generic/atomic-instrumented.h:atomic64_add
   Cyclomatic Complexity 2 arch/x86/include/asm/jump_label.h:arch_static_branch
   Cyclomatic Complexity 1 include/linux/jump_label.h:static_key_false
   Cyclomatic Complexity 1 include/linux/math64.h:mul_u64_u32_shr
   Cyclomatic Complexity 2 include/linux/thread_info.h:test_ti_thread_flag
   Cyclomatic Complexity 5 arch/x86/include/asm/preempt.h:__preempt_count_add
   Cyclomatic Complexity 1 arch/x86/include/asm/preempt.h:__preempt_count_dec_and_test
   Cyclomatic Complexity 1 include/linux/lockdep.h:lock_is_held
   Cyclomatic Complexity 1 include/linux/rcupdate.h:rcu_lock_acquire
   Cyclomatic Complexity 1 include/linux/rcupdate.h:rcu_lock_release
   Cyclomatic Complexity 1 include/linux/rcupdate.h:rcu_read_lock
   Cyclomatic Complexity 1 include/linux/rcupdate.h:rcu_read_unlock
   Cyclomatic Complexity 1 include/linux/rcupdate.h:rcu_read_lock_sched_notrace
   Cyclomatic Complexity 2 include/linux/rcupdate.h:rcu_read_unlock_sched_notrace
   Cyclomatic Complexity 1 include/linux/rbtree.h:rb_link_node
   Cyclomatic Complexity 1 include/linux/sched.h:task_thread_info
   Cyclomatic Complexity 1 include/linux/sched.h:test_tsk_thread_flag
   Cyclomatic Complexity 1 include/linux/sched.h:test_tsk_need_resched
   Cyclomatic Complexity 1 include/linux/sched.h:task_cpu
   Cyclomatic Complexity 1 include/linux/sched/cputime.h:get_running_cputimer
   Cyclomatic Complexity 2 include/linux/sched/cputime.h:account_group_exec_runtime
   Cyclomatic Complexity 1 include/linux/cgroup.h:task_css_set
   Cyclomatic Complexity 1 include/linux/cgroup.h:task_dfl_cgroup
   Cyclomatic Complexity 3 include/linux/cgroup.h:cgroup_parent
   Cyclomatic Complexity 1 include/linux/cgroup.h:cpuacct_charge
   Cyclomatic Complexity 2 include/linux/cgroup.h:cgroup_account_cputime
   Cyclomatic Complexity 1 kernel/sched/sched.h:cpu_of
   Cyclomatic Complexity 1 kernel/sched/sched.h:assert_clock_updated
   Cyclomatic Complexity 4 kernel/sched/sched.h:rq_clock
   Cyclomatic Complexity 4 kernel/sched/sched.h:rq_clock_task
   Cyclomatic Complexity 4 kernel/sched/sched.h:rq_clock_skip_update
   Cyclomatic Complexity 1 kernel/sched/sched.h:rq_pin_lock
   Cyclomatic Complexity 1 kernel/sched/sched.h:rq_unpin_lock
   Cyclomatic Complexity 1 kernel/sched/sched.h:task_on_rq_queued
   Cyclomatic Complexity 1 kernel/sched/sched.h:put_prev_task
   Cyclomatic Complexity 1 kernel/sched/sched.h:sched_update_tick_dependency
   Cyclomatic Complexity 2 kernel/sched/sched.h:add_nr_running
   Cyclomatic Complexity 1 kernel/sched/sched.h:sub_nr_running
   Cyclomatic Complexity 1 kernel/sched/sched.h:hrtick_enabled
   Cyclomatic Complexity 1 kernel/sched/sched.h:rq_lock
   Cyclomatic Complexity 1 kernel/sched/sched.h:rq_unlock
   Cyclomatic Complexity 2 kernel/sched/sched.h:cpufreq_update_util
   Cyclomatic Complexity 4 include/trace/events/sched.h:trace_sched_stat_runtime
   Cyclomatic Complexity 1 kernel/sched/fair.c:update_load_add
   Cyclomatic Complexity 1 kernel/sched/fair.c:update_load_sub
   Cyclomatic Complexity 1 kernel/sched/fair.c:update_load_set
   Cyclomatic Complexity 2 kernel/sched/fair.c:task_of
   Cyclomatic Complexity 2 kernel/sched/fair.c:rq_of
   Cyclomatic Complexity 1 kernel/sched/fair.c:task_cfs_rq
   Cyclomatic Complexity 1 kernel/sched/fair.c:cfs_rq_of
   Cyclomatic Complexity 1 kernel/sched/fair.c:group_cfs_rq
   Cyclomatic Complexity 1 kernel/sched/fair.c:list_add_leaf_cfs_rq
   Cyclomatic Complexity 1 kernel/sched/fair.c:parent_entity
   Cyclomatic Complexity 1 kernel/sched/fair.c:find_matching_se
   Cyclomatic Complexity 2 kernel/sched/fair.c:max_vruntime
   Cyclomatic Complexity 2 kernel/sched/fair.c:min_vruntime
   Cyclomatic Complexity 1 kernel/sched/fair.c:entity_before
   Cyclomatic Complexity 2 kernel/sched/fair.c:calc_delta_fair
   Cyclomatic Complexity 1 kernel/sched/fair.c:update_tg_load_avg
   Cyclomatic Complexity 1 kernel/sched/fair.c:update_stats_wait_start
   Cyclomatic Complexity 1 kernel/sched/fair.c:update_stats_wait_end
   Cyclomatic Complexity 1 kernel/sched/fair.c:update_stats_enqueue
   Cyclomatic Complexity 1 kernel/sched/fair.c:update_stats_dequeue
   Cyclomatic Complexity 1 kernel/sched/fair.c:update_stats_curr_start
   Cyclomatic Complexity 1 kernel/sched/fair.c:task_tick_numa
   Cyclomatic Complexity 2 kernel/sched/fair.c:account_entity_enqueue
   Cyclomatic Complexity 2 kernel/sched/fair.c:account_entity_dequeue
   Cyclomatic Complexity 1 kernel/sched/fair.c:enqueue_runnable_load_avg
   Cyclomatic Complexity 1 kernel/sched/fair.c:dequeue_runnable_load_avg
   Cyclomatic Complexity 1 kernel/sched/fair.c:enqueue_load_avg
   Cyclomatic Complexity 1 kernel/sched/fair.c:dequeue_load_avg
   Cyclomatic Complexity 1 kernel/sched/fair.c:update_cfs_group
   Cyclomatic Complexity 3 kernel/sched/fair.c:cfs_rq_util_change
   Cyclomatic Complexity 1 kernel/sched/fair.c:update_load_avg
   Cyclomatic Complexity 1 kernel/sched/fair.c:attach_entity_load_avg
   Cyclomatic Complexity 1 kernel/sched/fair.c:detach_entity_load_avg
   Cyclomatic Complexity 1 kernel/sched/fair.c:idle_balance
   Cyclomatic Complexity 1 kernel/sched/fair.c:util_est_enqueue
   Cyclomatic Complexity 1 kernel/sched/fair.c:util_est_dequeue
   Cyclomatic Complexity 1 kernel/sched/fair.c:check_spread
   Cyclomatic Complexity 1 kernel/sched/fair.c:check_schedstat_required
   Cyclomatic Complexity 3 kernel/sched/fair.c:__clear_buddies_last
   Cyclomatic Complexity 3 kernel/sched/fair.c:__clear_buddies_next
   Cyclomatic Complexity 3 kernel/sched/fair.c:__clear_buddies_skip
   Cyclomatic Complexity 4 kernel/sched/fair.c:clear_buddies
   Cyclomatic Complexity 1 kernel/sched/fair.c:account_cfs_rq_runtime
   Cyclomatic Complexity 1 kernel/sched/fair.c:check_cfs_rq_runtime
   Cyclomatic Complexity 1 kernel/sched/fair.c:check_enqueue_throttle
   Cyclomatic Complexity 1 kernel/sched/fair.c:return_cfs_rq_runtime

vim +5450 kernel/sched/fair.c

  5389	
  5390	/*
  5391	 * The enqueue_task method is called before nr_running is
  5392	 * increased. Here we update the fair scheduling stats and
  5393	 * then put the task into the rbtree:
  5394	 */
  5395	static void
  5396	enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
  5397	{
  5398		struct cfs_rq *cfs_rq;
  5399		struct sched_entity *se = &p->se;
  5400	
  5401		/*
  5402		 * The code below (indirectly) updates schedutil which looks at
  5403		 * the cfs_rq utilization to select a frequency.
  5404		 * Let's add the task's estimated utilization to the cfs_rq's
  5405		 * estimated utilization, before we update schedutil.
  5406		 */
  5407		util_est_enqueue(&rq->cfs, p);
  5408	
  5409		/*
  5410		 * If in_iowait is set, the code below may not trigger any cpufreq
  5411		 * utilization updates, so do it here explicitly with the IOWAIT flag
  5412		 * passed.
  5413		 */
  5414		if (p->in_iowait)
  5415			cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
  5416	
  5417		for_each_sched_entity(se) {
  5418			if (se->on_rq)
  5419				break;
  5420			cfs_rq = cfs_rq_of(se);
  5421			enqueue_entity(cfs_rq, se, flags);
  5422	
  5423			/*
  5424			 * end evaluation on encountering a throttled cfs_rq
  5425			 *
  5426			 * note: in the case of encountering a throttled cfs_rq we will
  5427			 * post the final h_nr_running increment below.
  5428			 */
  5429			if (cfs_rq_throttled(cfs_rq))
  5430				break;
  5431			cfs_rq->h_nr_running++;
  5432	
  5433			flags = ENQUEUE_WAKEUP;
  5434		}
  5435	
  5436		for_each_sched_entity(se) {
  5437			cfs_rq = cfs_rq_of(se);
  5438			cfs_rq->h_nr_running++;
  5439	
  5440			if (cfs_rq_throttled(cfs_rq))
  5441				break;
  5442	
  5443			update_load_avg(cfs_rq, se, UPDATE_TG);
  5444			update_cfs_group(se);
  5445		}
  5446	
  5447		if (!se)
  5448			add_nr_running(rq, 1);
  5449	
> 5450		util_est_enqueue_running(p);
  5451	
  5452		hrtick_update(rq);
  5453	}
  5454	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 33993 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking
  2018-06-05  6:57   ` Vincent Guittot
@ 2018-06-05 15:11     ` Patrick Bellasi
  2018-06-05 15:31       ` Juri Lelli
  2018-06-06  8:26       ` Vincent Guittot
  0 siblings, 2 replies; 19+ messages in thread
From: Patrick Bellasi @ 2018-06-05 15:11 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, open list:THERMAL, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Viresh Kumar, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Joel Fernandes, Steve Muckle,
	Todd Kjos

Hi Vincent,

On 05-Jun 08:57, Vincent Guittot wrote:
> On 4 June 2018 at 18:06, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

[...]

> > Let's improve the estimated utilization by adding a new "sort-of" PELT
> > signal, explicitly only for SE which has the following behavior:
> >  a) at each enqueue time of a task, its value is the (already decayed)
> >     util_avg of the task being enqueued
> >  b) it's updated at each update_load_avg
> >  c) it can just increase, whenever the task is actually RUNNING on a
> >     CPU, while it's kept stable while the task is RUNNANBLE but not
> >     actively consuming CPU bandwidth
> >
> > Such a defined signal is exactly equivalent to the util_avg for a task
> > running alone on a CPU while, in case the task is preempted, it allows
> > to know at dequeue time how much would have been the task utilization if
> > it was running alone on that CPU.
> 
> I don't agree with this statement above
> Let say that you have 2 periodic tasks which wants to run 4ms in every
> period of 10ms and wakes up at the same time.
> One task will execute 1st and the other will wait but at the end they
> have the same period and running time and as a result the same
> util_avg which is exactly what they need.

In your example above you say that both tasks will end up with a 40%
util_avg, and that's exactly also the value reported by the new
"running_avg" metric.

Both tasks, if they where running alone, they would have generated a
40% utilization, and above I'm saying:

        it allows to know at dequeue time
   how much would have been the task utilization
      if it it was running alone on that CPU

Don't see where is not correct, maybe I should explain it better?

> > This new signal is named "running_avg", since it tracks the actual
> > RUNNING time of a task by ignoring any form of preemption.
> 
> In fact, you still take into account this preemption as you remove
> some time whereas it's only a shift of the excution

When the task is enqueued we cache the (already decayed) util_avg, and
from this time on the running_avg can only increase. It increases only
for the portion of CPU time the task is running and it's never decayed
while the task is preempted.

In your example above, if we look at the second task running, the one
"delayed", you will have:

 @t1 wakeup time:    running_avg@t1 = util_avg@t1
                        since we initialize it with the
                        "sleep decayed" util_avg

  NOTE: this initialization is the important part, more on that on your
  next comment below related to the period being changed.

 @t2 switch_in time: running_avg@t2 = running_avg@t1
                        since it's not decayed while RUNNUBLE but !RUNNING

                     while, meanwhile:

                     util_avg@t2 < util_avg@t1
                        since it's decayed while RUNNABLE but !RUNNING

 @t3 dequeue time:   running_avg@t3 > running_avg@t2


When you say "it's only a shift of the execution" I agree, and indeed
the running_avg is not affected at all.

So, we can say that I'm "accounting for preemption time" but really,
the way I do it, is just by _not decaying_ PELT when the task is:

                 RUNNABLE but !RUNNING

and that's why I say that running_avg gives you the same behavior of
util_avg _if_ the task was running alone, because in that case you never
have the condition above, and thus util_avg is never decayed.

[...]

> > @@ -3161,6 +3161,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> >                 sa->runnable_load_sum =
> >                         decay_load(sa->runnable_load_sum, periods);
> >                 sa->util_sum = decay_load((u64)(sa->util_sum), periods);
> > +               if (running)
> > +                       sa->running_sum = decay_load(sa->running_sum, periods);
> 
> so you make some time disappearing from the equation as the signal
> will not be decayed and make the period looking shorter than reality.

Since at enqueue time we always initialize running_avg to whatever is
util_avg, I don't think we are messing up with the period at all.

The util_avg signal properly accounts for the period.
Thus, the combined effect of:

  - initializing running_avg at enqueue time with the value of
    util_avg, already decayed to properly account for the task period
  - not decaying running_avg when the task is RUNNABLE but !RUNNING

should just result into "virtually" considering the two tasks of your
example "as if" they was running concurrently on two different CPUs.

Isn't it?

> With the example I mentioned above, the 2nd task will be seen as if
> its period becomes 6ms instead of 10ms which is not true and the
> utilization of the task is overestimated without any good reason

I don't see that overestimation... and I cannot even measure it.

If I run an experiment with your example above, while using the
performance governor to rule out any possible scale invariance
difference, here is what I measure:

   Task1 (40ms delayed by the following Task2):
                               mean          std     max
      running_avg        455.387449    22.940168   492.0
      util_avg           433.233288    17.395477   458.0

   Task2 (waking up at same time of Task1 and running before):
                               mean          std     max
      running_avg        430.281834    22.405175   455.0
      util_avg           421.745331    22.098873   456.0

and if I compare Task1 above with another experiment where Task1 is
running alone:

   Task1 (running alone):
                               mean          std     min
      running_avg        460.257895    22.103704   460.0
      util_avg           435.119737    17.647556   461.0


Thus, there are some variations ok, but I think they are more due to
the rounding errors we have.
Task1 is still seen as a 455 expected utilization, which is not the
4/6 ~= 660 you say above if it should be accounted as a task running
4ms every 6ms.

> Furthermore, this new signal doesn't have clear meaning because it's
> not utilization but it's also not the waiting time of the task.

The meaning is: the utilization of the task _if_ it should be running
alone on that CPU. Tehe numbers above shows that since
    455 (Task1 delayed) ~= 460 (Task1 running alone)

If it's running alone it would be exactly util_avg (minus rounding
errors), but if it's preempted it will be a better representation of
the task's needed CPU bandwidth.

Unless we really need to track the "waiting time of a task" I would
say that the signal above should make util_est much more accurate on
knowing, at wakeup time, how much CPU a task will likely need...
independently from preemption sources.

Here are some other experiments / workloads I'm testing:

   https://gist.github.com/derkling/fe01e5ddf639e18b5d85a3b1d2e84b7d

> >
> >                 /*
> >                  * Step 2
> > @@ -3176,8 +3178,10 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> >                 sa->load_sum += load * contrib;
> >         if (runnable)
> >                 sa->runnable_load_sum += runnable * contrib;
> > -       if (running)
> > +       if (running) {
> >                 sa->util_sum += contrib * scale_cpu;
> > +               sa->running_sum += contrib * scale_cpu;
> > +       }
> >
> >         return periods;
> >  }
> > @@ -3963,6 +3967,12 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> >         WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> >  }
> >
> > +static inline void util_est_enqueue_running(struct task_struct *p)
> > +{
> > +       /* Initilize the (non-preempted) utilization */
> > +       p->se.avg.running_sum = p->se.avg.util_sum;

This line above is what should ensure we do not mess up with the task
period.

> > +}
> > +
> >  /*
> >   * Check if a (signed) value is within a specified (unsigned) margin,
> >   * based on the observation that:
> > @@ -4018,7 +4028,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
> >          * Skip update of task's estimated utilization when its EWMA is
> >          * already ~1% close to its last activation value.
> >          */
> > -       ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED);
> > +       ue.enqueued = p->se.avg.running_sum / LOAD_AVG_MAX;
> >         last_ewma_diff = ue.enqueued - ue.ewma;
> >         if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
> >                 return;
> > @@ -5437,6 +5447,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >         if (!se)
> >                 add_nr_running(rq, 1);
> >
> > +       util_est_enqueue_running(p);
> > +
> >         hrtick_update(rq);
> >  }
> >
> > --
> > 2.15.1
> >

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking
  2018-06-04 17:46   ` Joel Fernandes
@ 2018-06-05 15:21     ` Patrick Bellasi
  2018-06-05 19:33       ` Joel Fernandes
  0 siblings, 1 reply; 19+ messages in thread
From: Patrick Bellasi @ 2018-06-05 15:21 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Viresh Kumar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Joel Fernandes,
	Steve Muckle, Todd Kjos

On 04-Jun 10:46, Joel Fernandes wrote:
> Hi Patrick,
> 
> On Mon, Jun 04, 2018 at 05:06:00PM +0100, Patrick Bellasi wrote:
> > The estimated utilization of a task is affected by the task being
> > preempted, either by another FAIR task of by a task of an higher
> > priority class (i.e. RT or DL). Indeed, when a preemption happens, the
> > PELT utilization of the preempted task is going to be decayed a bit.
> > That's actually correct for utilization, which goal is to measure the
> > actual CPU bandwidth consumed by a task.
> > 
> > However, the above behavior does not allow to know exactly what is the
> > utilization a task "would have used" if it was running without
> > being preempted. Thus, this reduces the effectiveness of util_est for a
> > task because it does not always allow to predict how much CPU a task is
> > likely to require.
> > 
> > Let's improve the estimated utilization by adding a new "sort-of" PELT
> > signal, explicitly only for SE which has the following behavior:
> >  a) at each enqueue time of a task, its value is the (already decayed)
> >     util_avg of the task being enqueued
> >  b) it's updated at each update_load_avg
> >  c) it can just increase, whenever the task is actually RUNNING on a
> >     CPU, while it's kept stable while the task is RUNNANBLE but not
> >     actively consuming CPU bandwidth
> > 
> > Such a defined signal is exactly equivalent to the util_avg for a task
> > running alone on a CPU while, in case the task is preempted, it allows
> > to know at dequeue time how much would have been the task utilization if
> > it was running alone on that CPU.
> > 
> > This new signal is named "running_avg", since it tracks the actual
> > RUNNING time of a task by ignoring any form of preemption.
> > 
> > From an implementation standpoint, since the sched_avg should fit into a
> > single cache line, we save space by tracking only a new runnable sum:
> >    p->se.avg.running_sum
> > while the conversion into a running_avg is done on demand whenever we
> > need it, which is at task dequeue time when a new util_est sample has to
> > be collected.
> > 
> > The conversion from "running_sum" to "running_avg" is done by performing
> > a single division by LOAD_AVG_MAX, which introduces a small error since
> > in the division we do not consider the (sa->period_contrib - 1024)
> > compensation factor used in ___update_load_avg(). However:
> >  a) this error is expected to be limited (~2-3%)
> >  b) it can be safely ignored since the estimated utilization is the only
> >     consumer which is already subject to small estimation errors
> > 
> > The additional corresponding benefit is that, at run-time, we pay the
> > cost for a additional sum and multiply, while the more expensive
> > division is required only at dequeue time.
> > 
> > Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Juri Lelli <juri.lelli@redhat.com>
> > Cc: Todd Kjos <tkjos@google.com>
> > Cc: Joel Fernandes <joelaf@google.com>
> > Cc: Steve Muckle <smuckle@google.com>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Morten Rasmussen <morten.rasmussen@arm.com>
> > Cc: linux-kernel@vger.kernel.org
> > Cc: linux-pm@vger.kernel.org
> > ---
> >  include/linux/sched.h |  1 +
> >  kernel/sched/fair.c   | 16 ++++++++++++++--
> >  2 files changed, 15 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 9d8732dab264..2bd5f1c68da9 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -399,6 +399,7 @@ struct sched_avg {
> >  	u64				load_sum;
> >  	u64				runnable_load_sum;
> >  	u32				util_sum;
> > +	u32				running_sum;
> >  	u32				period_contrib;
> >  	unsigned long			load_avg;
> >  	unsigned long			runnable_load_avg;
> 
> Should update the documentation comments above the struct too?
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index f74441be3f44..5d54d6a4c31f 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3161,6 +3161,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> >  		sa->runnable_load_sum =
> >  			decay_load(sa->runnable_load_sum, periods);
> >  		sa->util_sum = decay_load((u64)(sa->util_sum), periods);
> > +		if (running)
> > +			sa->running_sum = decay_load(sa->running_sum, periods);
> >  
> >  		/*
> >  		 * Step 2
> > @@ -3176,8 +3178,10 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> >  		sa->load_sum += load * contrib;
> >  	if (runnable)
> >  		sa->runnable_load_sum += runnable * contrib;
> > -	if (running)
> > +	if (running) {
> >  		sa->util_sum += contrib * scale_cpu;
> > +		sa->running_sum += contrib * scale_cpu;
> > +	}
> >  
> >  	return periods;
> >  }
> > @@ -3963,6 +3967,12 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> >  	WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> >  }
> 
> PELT changes look nice and makes sense :)

That's not strictly speaking a PELT change... it's still more in the
idea to work "on top of PELT" to make it more effective in measuring
the tasks expected required CPU bandwidth.

> > +static inline void util_est_enqueue_running(struct task_struct *p)
> > +{
> > +	/* Initilize the (non-preempted) utilization */
> > +	p->se.avg.running_sum = p->se.avg.util_sum;
> > +}
> > +
> >  /*
> >   * Check if a (signed) value is within a specified (unsigned) margin,
> >   * based on the observation that:
> > @@ -4018,7 +4028,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
> >  	 * Skip update of task's estimated utilization when its EWMA is
> >  	 * already ~1% close to its last activation value.
> >  	 */
> > -	ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED);
> > +	ue.enqueued = p->se.avg.running_sum / LOAD_AVG_MAX;
> 
> I guess we are doing extra division here which adds some cost. Does
> performance look Ok with the change?

This extra division is there and done only at dequeue time instead of
doing it at each update_load_avg.

To be more precise, at each ___update_load_avg we should really update
running_avg by:

   u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
   sa->running_avg = sa->running_sum / divider;

but, this would imply tracking an additional signal in sched_avg and
doing an additional division at ___update_load_avg() time.

Morten suggested that, if we accept the rounding errors due to
considering

      divider ~= LOAD_AVG_MAX

thus discarding the (sa->period_contrib - 1024) correction, then we
can completely skip the tracking of running_avg (thus saving space in
sched_avg) and approximate it at dequeue time as per the code line,
just to compute the new util_est sample to accumulate.

Does that make sense now?

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking
  2018-06-05 15:11     ` Patrick Bellasi
@ 2018-06-05 15:31       ` Juri Lelli
  2018-06-05 16:54         ` Patrick Bellasi
  2018-06-06  8:26       ` Vincent Guittot
  1 sibling, 1 reply; 19+ messages in thread
From: Juri Lelli @ 2018-06-05 15:31 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Vincent Guittot, linux-kernel, open list:THERMAL, Ingo Molnar,
	Peter Zijlstra, Rafael J . Wysocki, Viresh Kumar,
	Dietmar Eggemann, Morten Rasmussen, Joel Fernandes, Steve Muckle,
	Todd Kjos

On 05/06/18 16:11, Patrick Bellasi wrote:

[...]

> If I run an experiment with your example above, while using the
> performance governor to rule out any possible scale invariance
> difference, here is what I measure:
> 
>    Task1 (40ms delayed by the following Task2):
>                                mean          std     max
>       running_avg        455.387449    22.940168   492.0
>       util_avg           433.233288    17.395477   458.0
> 
>    Task2 (waking up at same time of Task1 and running before):
>                                mean          std     max
>       running_avg        430.281834    22.405175   455.0
>       util_avg           421.745331    22.098873   456.0
> 
> and if I compare Task1 above with another experiment where Task1 is
> running alone:
> 
>    Task1 (running alone):
>                                mean          std     min
>       running_avg        460.257895    22.103704   460.0
>       util_avg           435.119737    17.647556   461.0

Wait, why again in this last case running_avg != util_avg? :)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking
  2018-06-05 15:31       ` Juri Lelli
@ 2018-06-05 16:54         ` Patrick Bellasi
  2018-06-05 20:46           ` Joel Fernandes
  0 siblings, 1 reply; 19+ messages in thread
From: Patrick Bellasi @ 2018-06-05 16:54 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Vincent Guittot, linux-kernel, open list:THERMAL, Ingo Molnar,
	Peter Zijlstra, Rafael J . Wysocki, Viresh Kumar,
	Dietmar Eggemann, Morten Rasmussen, Joel Fernandes, Steve Muckle,
	Todd Kjos

On 05-Jun 17:31, Juri Lelli wrote:
> On 05/06/18 16:11, Patrick Bellasi wrote:
> 
> [...]
> 
> > If I run an experiment with your example above, while using the
> > performance governor to rule out any possible scale invariance
> > difference, here is what I measure:
> > 
> >    Task1 (40ms delayed by the following Task2):
> >                                mean          std     max
> >       running_avg        455.387449    22.940168   492.0
> >       util_avg           433.233288    17.395477   458.0
> > 
> >    Task2 (waking up at same time of Task1 and running before):
> >                                mean          std     max
> >       running_avg        430.281834    22.405175   455.0
> >       util_avg           421.745331    22.098873   456.0
> > 
> > and if I compare Task1 above with another experiment where Task1 is
> > running alone:
> > 
> >    Task1 (running alone):
> >                                mean          std     min
> >       running_avg        460.257895    22.103704   460.0
> >       util_avg           435.119737    17.647556   461.0
> 
> Wait, why again in this last case running_avg != util_avg? :)

I _think_ it's mostly due to the rouding errors we have because of the
reasons I've explained in the reply to Joel:

   https://lkml.org/lkml/2018/6/5/559
   20180605152156.GD32302@e110439-lin

at the end, while commenting about the division overhead.

I should try the above examples while tracking the full signal at
___update_load_avg() time.

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking
  2018-06-05 15:21     ` Patrick Bellasi
@ 2018-06-05 19:33       ` Joel Fernandes
  2018-06-05 19:43         ` Joel Fernandes
  0 siblings, 1 reply; 19+ messages in thread
From: Joel Fernandes @ 2018-06-05 19:33 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Viresh Kumar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Joel Fernandes,
	Steve Muckle, Todd Kjos

On Tue, Jun 05, 2018 at 04:21:56PM +0100, Patrick Bellasi wrote:
[..]
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index f74441be3f44..5d54d6a4c31f 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -3161,6 +3161,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> > >  		sa->runnable_load_sum =
> > >  			decay_load(sa->runnable_load_sum, periods);
> > >  		sa->util_sum = decay_load((u64)(sa->util_sum), periods);
> > > +		if (running)
> > > +			sa->running_sum = decay_load(sa->running_sum, periods);
> > >  
> > >  		/*
> > >  		 * Step 2
> > > @@ -3176,8 +3178,10 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> > >  		sa->load_sum += load * contrib;
> > >  	if (runnable)
> > >  		sa->runnable_load_sum += runnable * contrib;
> > > -	if (running)
> > > +	if (running) {
> > >  		sa->util_sum += contrib * scale_cpu;
> > > +		sa->running_sum += contrib * scale_cpu;
> > > +	}
> > >  
> > >  	return periods;
> > >  }
> > > @@ -3963,6 +3967,12 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> > >  	WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> > >  }
> > 
> > PELT changes look nice and makes sense :)
> 
> That's not strictly speaking a PELT change... it's still more in the
> idea to work "on top of PELT" to make it more effective in measuring
> the tasks expected required CPU bandwidth.

I meant "PELT change" as in change to the code that calculates PELT signals..

> > > +static inline void util_est_enqueue_running(struct task_struct *p)
> > > +{
> > > +	/* Initilize the (non-preempted) utilization */
> > > +	p->se.avg.running_sum = p->se.avg.util_sum;
> > > +}
> > > +
> > >  /*
> > >   * Check if a (signed) value is within a specified (unsigned) margin,
> > >   * based on the observation that:
> > > @@ -4018,7 +4028,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
> > >  	 * Skip update of task's estimated utilization when its EWMA is
> > >  	 * already ~1% close to its last activation value.
> > >  	 */
> > > -	ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED);
> > > +	ue.enqueued = p->se.avg.running_sum / LOAD_AVG_MAX;
> > 
> > I guess we are doing extra division here which adds some cost. Does
> > performance look Ok with the change?
> 
> This extra division is there and done only at dequeue time instead of
> doing it at each update_load_avg.

I know. :)

> To be more precise, at each ___update_load_avg we should really update
> running_avg by:
> 
>    u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
>    sa->running_avg = sa->running_sum / divider;
> 
> but, this would imply tracking an additional signal in sched_avg and
> doing an additional division at ___update_load_avg() time.
> 
> Morten suggested that, if we accept the rounding errors due to
> considering
> 
>       divider ~= LOAD_AVG_MAX
> 
> thus discarding the (sa->period_contrib - 1024) correction, then we
> can completely skip the tracking of running_avg (thus saving space in
> sched_avg) and approximate it at dequeue time as per the code line,
> just to compute the new util_est sample to accumulate.
> 
> Does that make sense now?

The patch always made sense to me.. I was just pointing out the extra
division this patch adds. I agree since its done on dequeue-only, then its
probably Ok to do..

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking
  2018-06-05 19:33       ` Joel Fernandes
@ 2018-06-05 19:43         ` Joel Fernandes
  0 siblings, 0 replies; 19+ messages in thread
From: Joel Fernandes @ 2018-06-05 19:43 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Viresh Kumar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Joel Fernandes,
	Steve Muckle, Todd Kjos

On Tue, Jun 05, 2018 at 12:33:17PM -0700, Joel Fernandes wrote:
> On Tue, Jun 05, 2018 at 04:21:56PM +0100, Patrick Bellasi wrote:
[..]
> > To be more precise, at each ___update_load_avg we should really update
> > running_avg by:
> > 
> >    u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
> >    sa->running_avg = sa->running_sum / divider;
> > 
> > but, this would imply tracking an additional signal in sched_avg and
> > doing an additional division at ___update_load_avg() time.
> > 
> > Morten suggested that, if we accept the rounding errors due to
> > considering
> > 
> >       divider ~= LOAD_AVG_MAX
> > 
> > thus discarding the (sa->period_contrib - 1024) correction, then we
> > can completely skip the tracking of running_avg (thus saving space in
> > sched_avg) and approximate it at dequeue time as per the code line,
> > just to compute the new util_est sample to accumulate.
> > 
> > Does that make sense now?
> 
> The patch always made sense to me.. I was just pointing out the extra
> division this patch adds. I agree since its done on dequeue-only, then its
> probably Ok to do..
> 

One thing to note about this error is I remember not compensating for it
would make the utilization reduce... which is kind of weird. But yeah if we
can find a way compensate for the error and also keep the overhead low,
that's super.. I know this is probably low in priority considering the other
concerns Vincent brought up which are being discussed in the other thread,..
but just saying. :)

 - Joel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking
  2018-06-05 16:54         ` Patrick Bellasi
@ 2018-06-05 20:46           ` Joel Fernandes
  2018-06-05 23:15             ` Saravana Kannan
  0 siblings, 1 reply; 19+ messages in thread
From: Joel Fernandes @ 2018-06-05 20:46 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Juri Lelli, Vincent Guittot, linux-kernel, open list:THERMAL,
	Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki, Viresh Kumar,
	Dietmar Eggemann, Morten Rasmussen, connoro, Joel Fernandes,
	Steve Muckle, Todd Kjos

On Tue, Jun 05, 2018 at 05:54:31PM +0100, Patrick Bellasi wrote:
> On 05-Jun 17:31, Juri Lelli wrote:
> > On 05/06/18 16:11, Patrick Bellasi wrote:
> > 
> > [...]
> > 
> > > If I run an experiment with your example above, while using the
> > > performance governor to rule out any possible scale invariance
> > > difference, here is what I measure:
> > > 
> > >    Task1 (40ms delayed by the following Task2):
> > >                                mean          std     max
> > >       running_avg        455.387449    22.940168   492.0
> > >       util_avg           433.233288    17.395477   458.0
> > > 
> > >    Task2 (waking up at same time of Task1 and running before):
> > >                                mean          std     max
> > >       running_avg        430.281834    22.405175   455.0
> > >       util_avg           421.745331    22.098873   456.0
> > > 
> > > and if I compare Task1 above with another experiment where Task1 is
> > > running alone:
> > > 
> > >    Task1 (running alone):
> > >                                mean          std     min
> > >       running_avg        460.257895    22.103704   460.0
> > >       util_avg           435.119737    17.647556   461.0
> > 
> > Wait, why again in this last case running_avg != util_avg? :)
> 
> I _think_ it's mostly due to the rouding errors we have because of the
> reasons I've explained in the reply to Joel:
> 
>    https://lkml.org/lkml/2018/6/5/559
>    20180605152156.GD32302@e110439-lin
> 
> at the end, while commenting about the division overhead.
> 
> I should try the above examples while tracking the full signal at
> ___update_load_avg() time.

Is that the only issue? I think if a CFS task is blocked by another CFS task
due to preemption, then with your patch we would account the CFS blocked time
as well into the blocked task's running utilization, which seems incorrect.
Or did I miss something?

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking
  2018-06-05 20:46           ` Joel Fernandes
@ 2018-06-05 23:15             ` Saravana Kannan
  0 siblings, 0 replies; 19+ messages in thread
From: Saravana Kannan @ 2018-06-05 23:15 UTC (permalink / raw)
  To: Joel Fernandes, Patrick Bellasi
  Cc: Juri Lelli, Vincent Guittot, linux-kernel, open list:THERMAL,
	Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki, Viresh Kumar,
	Dietmar Eggemann, Morten Rasmussen, connoro, Joel Fernandes,
	Steve Muckle, Todd Kjos

On 06/05/2018 01:46 PM, Joel Fernandes wrote:
> On Tue, Jun 05, 2018 at 05:54:31PM +0100, Patrick Bellasi wrote:
>> On 05-Jun 17:31, Juri Lelli wrote:
>>> On 05/06/18 16:11, Patrick Bellasi wrote:
>>>
>>> [...]
>>>
>>>> If I run an experiment with your example above, while using the
>>>> performance governor to rule out any possible scale invariance
>>>> difference, here is what I measure:
>>>>
>>>>     Task1 (40ms delayed by the following Task2):
>>>>                                 mean          std     max
>>>>        running_avg        455.387449    22.940168   492.0
>>>>        util_avg           433.233288    17.395477   458.0
>>>>
>>>>     Task2 (waking up at same time of Task1 and running before):
>>>>                                 mean          std     max
>>>>        running_avg        430.281834    22.405175   455.0
>>>>        util_avg           421.745331    22.098873   456.0
>>>>
>>>> and if I compare Task1 above with another experiment where Task1 is
>>>> running alone:
>>>>
>>>>     Task1 (running alone):
>>>>                                 mean          std     min
>>>>        running_avg        460.257895    22.103704   460.0
>>>>        util_avg           435.119737    17.647556   461.0
>>> Wait, why again in this last case running_avg != util_avg? :)
>> I _think_ it's mostly due to the rouding errors we have because of the
>> reasons I've explained in the reply to Joel:
>>
>>     https://lkml.org/lkml/2018/6/5/559
>>     20180605152156.GD32302@e110439-lin
>>
>> at the end, while commenting about the division overhead.
>>
>> I should try the above examples while tracking the full signal at
>> ___update_load_avg() time.
> Is that the only issue? I think if a CFS task is blocked by another CFS task
> due to preemption, then with your patch we would account the CFS blocked time
> as well into the blocked task's running utilization, which seems incorrect.
> Or did I miss something?
This is my concern too. This will negatively affect any task packing 
because more tasks are going to be runnable but not running and that 
going to increase the over all frequency (I'm assuming you want to use 
this for frequency guidance eventually?).

-Saravana

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking
  2018-06-05 15:11     ` Patrick Bellasi
  2018-06-05 15:31       ` Juri Lelli
@ 2018-06-06  8:26       ` Vincent Guittot
  2018-06-06 10:38         ` Patrick Bellasi
  1 sibling, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2018-06-06  8:26 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, open list:THERMAL, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Viresh Kumar, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Joel Fernandes, Steve Muckle,
	Todd Kjos

Le Tuesday 05 Jun 2018 à 16:11:29 (+0100), Patrick Bellasi a écrit :
> Hi Vincent,
> 
> On 05-Jun 08:57, Vincent Guittot wrote:
> > On 4 June 2018 at 18:06, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> 
> [...]
> 
> > > Let's improve the estimated utilization by adding a new "sort-of" PELT
> > > signal, explicitly only for SE which has the following behavior:
> > >  a) at each enqueue time of a task, its value is the (already decayed)
> > >     util_avg of the task being enqueued
> > >  b) it's updated at each update_load_avg
> > >  c) it can just increase, whenever the task is actually RUNNING on a
> > >     CPU, while it's kept stable while the task is RUNNANBLE but not
> > >     actively consuming CPU bandwidth
> > >
> > > Such a defined signal is exactly equivalent to the util_avg for a task
> > > running alone on a CPU while, in case the task is preempted, it allows
> > > to know at dequeue time how much would have been the task utilization if
> > > it was running alone on that CPU.
> > 
> > I don't agree with this statement above
> > Let say that you have 2 periodic tasks which wants to run 4ms in every
> > period of 10ms and wakes up at the same time.
> > One task will execute 1st and the other will wait but at the end they
> > have the same period and running time and as a result the same
> > util_avg which is exactly what they need.
> 
> In your example above you say that both tasks will end up with a 40%
> util_avg, and that's exactly also the value reported by the new
> "running_avg" metric.

It's not really possible because the util_avg of the 2 tasks already give the exact same
right value. So running_avg, which removes some decay for the 2nd task, can't
give same results.

I add more details below

>
> 
> Both tasks, if they where running alone, they would have generated a
> 40% utilization, and above I'm saying:
> 
>         it allows to know at dequeue time
>    how much would have been the task utilization
>       if it it was running alone on that CPU
> 
> Don't see where is not correct, maybe I should explain it better?
> 
> > > This new signal is named "running_avg", since it tracks the actual
> > > RUNNING time of a task by ignoring any form of preemption.
> > 
> > In fact, you still take into account this preemption as you remove
> > some time whereas it's only a shift of the excution
> 
> When the task is enqueued we cache the (already decayed) util_avg, and
> from this time on the running_avg can only increase. It increases only
> for the portion of CPU time the task is running and it's never decayed
> while the task is preempted.
> 
> In your example above, if we look at the second task running, the one
> "delayed", you will have:
> 
>  @t1 wakeup time:    running_avg@t1 = util_avg@t1
>                         since we initialize it with the
>                         "sleep decayed" util_avg
> 
>   NOTE: this initialization is the important part, more on that on your
>   next comment below related to the period being changed.
> 
>  @t2 switch_in time: running_avg@t2 = running_avg@t1
>                         since it's not decayed while RUNNUBLE but !RUNNING
> 
>                      while, meanwhile:
> 
>                      util_avg@t2 < util_avg@t1
>                         since it's decayed while RUNNABLE but !RUNNING
> 
>  @t3 dequeue time:   running_avg@t3 > running_avg@t2
> 
> 
> When you say "it's only a shift of the execution" I agree, and indeed
> the running_avg is not affected at all.
> 
> So, we can say that I'm "accounting for preemption time" but really,
> the way I do it, is just by _not decaying_ PELT when the task is:
> 
>                  RUNNABLE but !RUNNING
> 
> and that's why I say that running_avg gives you the same behavior of
> util_avg _if_ the task was running alone, because in that case you never
> have the condition above, and thus util_avg is never decayed.
>

For the above 2 tasks of the example example we have the pattern

Task 1
state       RRRRSSSSSSERRRRSSSSSSERRRRSSSSSS
util_avg    AAAADDDDDD AAAADDDDDD AAAADDDDDD

Task 2
state       WWWWRRRRSSEWWWWRRRRSSEWWWWRRRRSS
util_avg    DDDDAAAADD DDDDAAAADD DDDDAAAADD
running_avg     AAAADDC    AAAADDC    AAAADD

R : Running 1ms, S: Sleep 1ms , W: Wait 1ms, E: Enqueue event 
A: Accumulate 1ms, D: Decay 1ms, C: Copy util_avg value

the util_avg of T1 and T2 have the same pattern which is:
  AAAADDDDDDAAAADDDDDDAAAADDDDDD
and as a result, the same value which represents their utilization

For the running_avg of T2, there is only 2 decays aftert the last running
phase and before the next enqueue
so the pattern will be AAAADDAAAA
instead of the         AAAADDDDDDAAAA

the runninh_avg will report a higher value than reality

>
> 
> [...]
> 
> > > @@ -3161,6 +3161,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> > >                 sa->runnable_load_sum =
> > >                         decay_load(sa->runnable_load_sum, periods);
> > >                 sa->util_sum = decay_load((u64)(sa->util_sum), periods);
> > > +               if (running)
> > > +                       sa->running_sum = decay_load(sa->running_sum, periods);
> > 
> > so you make some time disappearing from the equation as the signal
> > will not be decayed and make the period looking shorter than reality.
> 
> Since at enqueue time we always initialize running_avg to whatever is
> util_avg, I don't think we are messing up with the period at all.
> 
> The util_avg signal properly accounts for the period.
> Thus, the combined effect of:
> 
>   - initializing running_avg at enqueue time with the value of
>     util_avg, already decayed to properly account for the task period
>   - not decaying running_avg when the task is RUNNABLE but !RUNNING
> 
> should just result into "virtually" considering the two tasks of your
> example "as if" they was running concurrently on two different CPUs.
> 
> Isn't it?

No because part of the "sleep" phase is done during the waiting time and
you don't take into account this sleep time

see my explanation above

> 
> > With the example I mentioned above, the 2nd task will be seen as if
> > its period becomes 6ms instead of 10ms which is not true and the
> > utilization of the task is overestimated without any good reason
> 
> I don't see that overestimation... and I cannot even measure it.
> 
> If I run an experiment with your example above, while using the
> performance governor to rule out any possible scale invariance
> difference, here is what I measure:
> 
>    Task1 (40ms delayed by the following Task2):

the 40ms above is a typo mistake ? you mean 4ms, isn't it ?

>                                mean          std     max
>       running_avg        455.387449    22.940168   492.i0

the 492 is the overestimation generated by running_avg and not a rounding
effect. With 4ms and 10ms the number of missing decay steps is small but if
you use 40ms/100ms instead you will see a greater difference in the max number

>       util_avg           433.233288    17.395477   458.0
> 
>    Task2 (waking up at same time of Task1 and running before):
>                                mean          std     max
>       running_avg        430.281834    22.405175   455.0
>       util_avg           421.745331    22.098873   456.0
> 
> and if I compare Task1 above with another experiment where Task1 is
> running alone:
> 
>    Task1 (running alone):
>                                mean          std     min
>       running_avg        460.257895    22.103704   460.0
>       util_avg           435.119737    17.647556   461.0
> 
> 
> Thus, there are some variations ok, but I think they are more due to
> the rounding errors we have.
> Task1 is still seen as a 455 expected utilization, which is not the
> 4/6 ~= 660 you say above if it should be accounted as a task running
> 4ms every 6ms.
> 
> > Furthermore, this new signal doesn't have clear meaning because it's
> > not utilization but it's also not the waiting time of the task.
> 
> The meaning is: the utilization of the task _if_ it should be running
> alone on that CPU. Tehe numbers above shows that since
>     455 (Task1 delayed) ~= 460 (Task1 running alone)

you should look at the max and not the mean that what util_est is interested
in IIUC

Vincent

> 
> If it's running alone it would be exactly util_avg (minus rounding
> errors), but if it's preempted it will be a better representation of
> the task's needed CPU bandwidth.
> 
> Unless we really need to track the "waiting time of a task" I would
> say that the signal above should make util_est much more accurate on
> knowing, at wakeup time, how much CPU a task will likely need...
> independently from preemption sources.
> 
> Here are some other experiments / workloads I'm testing:
> 
>    https://gist.github.com/derkling/fe01e5ddf639e18b5d85a3b1d2e84b7d
> 
> > >
> > >                 /*
> > >                  * Step 2
> > > @@ -3176,8 +3178,10 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> > >                 sa->load_sum += load * contrib;
> > >         if (runnable)
> > >                 sa->runnable_load_sum += runnable * contrib;
> > > -       if (running)
> > > +       if (running) {
> > >                 sa->util_sum += contrib * scale_cpu;
> > > +               sa->running_sum += contrib * scale_cpu;
> > > +       }
> > >
> > >         return periods;
> > >  }
> > > @@ -3963,6 +3967,12 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> > >         WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> > >  }
> > >
> > > +static inline void util_est_enqueue_running(struct task_struct *p)
> > > +{
> > > +       /* Initilize the (non-preempted) utilization */
> > > +       p->se.avg.running_sum = p->se.avg.util_sum;
> 
> This line above is what should ensure we do not mess up with the task
> period.
> 
> > > +}
> > > +
> > >  /*
> > >   * Check if a (signed) value is within a specified (unsigned) margin,
> > >   * based on the observation that:
> > > @@ -4018,7 +4028,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
> > >          * Skip update of task's estimated utilization when its EWMA is
> > >          * already ~1% close to its last activation value.
> > >          */
> > > -       ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED);
> > > +       ue.enqueued = p->se.avg.running_sum / LOAD_AVG_MAX;
> > >         last_ewma_diff = ue.enqueued - ue.ewma;
> > >         if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
> > >                 return;
> > > @@ -5437,6 +5447,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >         if (!se)
> > >                 add_nr_running(rq, 1);
> > >
> > > +       util_est_enqueue_running(p);
> > > +
> > >         hrtick_update(rq);
> > >  }
> > >
> > > --
> > > 2.15.1
> > >
> 
> -- 
> #include <best/regards.h>
> 
> Patrick Bellasi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking
  2018-06-06  8:26       ` Vincent Guittot
@ 2018-06-06 10:38         ` Patrick Bellasi
  0 siblings, 0 replies; 19+ messages in thread
From: Patrick Bellasi @ 2018-06-06 10:38 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, open list:THERMAL, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Viresh Kumar, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Joel Fernandes, Steve Muckle,
	Todd Kjos

Hi Vincent,

On 06-Jun 10:26, Vincent Guittot wrote:

[...]

> For the above 2 tasks of the example example we have the pattern
> 
> Task 1
> state       RRRRSSSSSSERRRRSSSSSSERRRRSSSSSS
> util_avg    AAAADDDDDD AAAADDDDDD AAAADDDDDD
> 
> Task 2
> state       WWWWRRRRSSEWWWWRRRRSSEWWWWRRRRSS
> util_avg    DDDDAAAADD DDDDAAAADD DDDDAAAADD
> running_avg     AAAADDC    AAAADDC    AAAADD
> 
> R : Running 1ms, S: Sleep 1ms , W: Wait 1ms, E: Enqueue event 
> A: Accumulate 1ms, D: Decay 1ms, C: Copy util_avg value
> 
> the util_avg of T1 and T2 have the same pattern which is:
>   AAAADDDDDDAAAADDDDDDAAAADDDDDD
> and as a result, the same value which represents their utilization
> 
> For the running_avg of T2, there is only 2 decays aftert the last running
> phase and before the next enqueue
> so the pattern will be AAAADDAAAA
> instead of the         AAAADDDDDDAAAA
> 
> the runninh_avg will report a higher value than reality

Right!... Your example above is really useful, thanks.

Reasoning on the same line, we can easily see that a 50% CFS task
co-scheduled with a 50% RT task, which delays the CFS one and has the
same period,  will make the CFS task appear as a 100% task.

Which is definitively not what we want to sample as estimated
utilization.

The good news, if we like, is instead that util_avg is already
correctly reporting 50% at the end of each activation and thus, when
we collect util_est samples we already have the best utilization value
we can collect for a task.

The only time we collect "wrong" estimation samples is when there
is not idle time, thus eventually util_est should be improved by
discarding samples in that cases... but I'm not entirely sure if and
how we can detect them.

Or just to ensure we have idle time... as you are proposing in
the other thread.

Thanks again for pointing out the issue above.

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-06-06 10:38 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-04 16:05 [PATCH 0/2] Improve estimated utilization of preempted FAIR tasks Patrick Bellasi
2018-06-04 16:05 ` [PATCH 1/2] sched/fair: pelt: use u32 for util_avg Patrick Bellasi
2018-06-05  1:30   ` kbuild test robot
2018-06-05  1:34   ` kbuild test robot
2018-06-04 16:06 ` [PATCH 2/2] sched/fair: util_est: add running_sum tracking Patrick Bellasi
2018-06-04 17:46   ` Joel Fernandes
2018-06-05 15:21     ` Patrick Bellasi
2018-06-05 19:33       ` Joel Fernandes
2018-06-05 19:43         ` Joel Fernandes
2018-06-05  1:29   ` kbuild test robot
2018-06-05  6:57   ` Vincent Guittot
2018-06-05 15:11     ` Patrick Bellasi
2018-06-05 15:31       ` Juri Lelli
2018-06-05 16:54         ` Patrick Bellasi
2018-06-05 20:46           ` Joel Fernandes
2018-06-05 23:15             ` Saravana Kannan
2018-06-06  8:26       ` Vincent Guittot
2018-06-06 10:38         ` Patrick Bellasi
2018-06-05 10:46   ` kbuild test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).