All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] track rt rq utilization
@ 2017-05-24  9:00 Vincent Guittot
  2017-05-24  9:00 ` [RFC PATCH 1/2] sched/rt: add utilization tracking Vincent Guittot
  2017-05-24  9:00 ` [PATCH 2/2] cpufreq/schedutil: add rt " Vincent Guittot
  0 siblings, 2 replies; 11+ messages in thread
From: Vincent Guittot @ 2017-05-24  9:00 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: rjw, juri.lelli, dietmar.eggemann, Morten.Rasmussen, Vincent Guittot

When both cfs and rt tasks compete to run on a CPU, we can see some frequency
drops with schedutil governor. In such case, the cfs_rq's utilization doesn't
reflect anymore the utilization of cfs tasks but only the remaining part that
is not used by rt tasks. We should monitor the stolen utilization and take
it into account when selecting OPP.

Patch 1 tracks utilization of rt_rq.
Patch 2 adds the rt_rq's utilization when selection OPP for cfs tasks

This patchset doesn't change the OPP selection policy for RT tasks

Vincent Guittot (2):
  sched/rt: add utilization tracking
  cpufreq/schedutil: add rt utilization tracking

 kernel/sched/cpufreq_schedutil.c |  2 +-
 kernel/sched/fair.c              | 21 +++++++++++++++++++++
 kernel/sched/rt.c                |  9 +++++++++
 kernel/sched/sched.h             |  3 +++
 4 files changed, 34 insertions(+), 1 deletion(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 1/2] sched/rt: add utilization tracking
  2017-05-24  9:00 [PATCH 0/2] track rt rq utilization Vincent Guittot
@ 2017-05-24  9:00 ` Vincent Guittot
  2017-05-30 15:50   ` Morten Rasmussen
  2017-05-31  9:40   ` Peter Zijlstra
  2017-05-24  9:00 ` [PATCH 2/2] cpufreq/schedutil: add rt " Vincent Guittot
  1 sibling, 2 replies; 11+ messages in thread
From: Vincent Guittot @ 2017-05-24  9:00 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: rjw, juri.lelli, dietmar.eggemann, Morten.Rasmussen, Vincent Guittot

schedutil governor relies on cfs_rq's util_avg to choose the OPP when cfs
tasks are running. When the CPU is overloaded by cfs and rt tasks, cfs tasks
are preempted by rt tasks and in this case util_avg reflects the remaining
capacity that is used by cfs tasks but not what cfs tasks want to use. In such
case, schedutil can select a lower OPP when cfs task runs whereas the CPU is
overloaded. In order to have a more accurate view of the utilization of the
CPU, we track the utilization that is used by RT tasks.
DL tasks are not taken into account as they have their own utilization
tracking mecanism.

We don't use rt_avg which doesn't have the same dynamic as PELT and which
can include IRQ time that are also accounted in cfs task utilization

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---

If the changes are reasonnable, it might worth moving the PELT function in a
dedicated pelt.c file and the ugly
extern int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running);
in a pelt.h header


 kernel/sched/fair.c  | 21 +++++++++++++++++++++
 kernel/sched/rt.c    |  9 +++++++++
 kernel/sched/sched.h |  3 +++
 3 files changed, 33 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 47a0c55..0d7698b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2950,6 +2950,17 @@ __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->curr != NULL, cfs_rq);
 }
 
+int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running)
+{
+	int ret;
+
+	ret = ___update_load_avg(now, cpu, &rt_rq->avg, 0, running, NULL);
+
+
+	return ret;
+}
+
+
 /*
  * Signed add and clamp on underflow.
  *
@@ -3478,6 +3489,11 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 	return 0;
 }
 
+int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running)
+{
+	return 0;
+}
+
 #define UPDATE_TG	0x0
 #define SKIP_AGE_LOAD	0x0
 
@@ -7008,6 +7024,10 @@ static void update_blocked_averages(int cpu)
 		if (cfs_rq_is_decayed(cfs_rq))
 			list_del_leaf_cfs_rq(cfs_rq);
 	}
+
+	update_rt_rq_load_avg(rq_clock_task(rq), cpu, &rq->rt, 0);
+
+
 	rq_unlock_irqrestore(rq, &rf);
 }
 
@@ -7067,6 +7087,7 @@ static inline void update_blocked_averages(int cpu)
 	rq_lock_irqsave(rq, &rf);
 	update_rq_clock(rq);
 	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true);
+	update_rt_rq_load_avg(rq_clock_task(rq), cpu, &rq->rt, 0);
 	rq_unlock_irqrestore(rq, &rf);
 }
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 581d5c7..09293fa 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1534,6 +1534,8 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
 	return p;
 }
 
+extern int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running);
+
 static struct task_struct *
 pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
@@ -1579,6 +1581,10 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	queue_push_tasks(rq);
 
+	if (p)
+		update_rt_rq_load_avg(rq_clock_task(rq), cpu_of(rq), rt_rq,
+				rq->curr->sched_class == &rt_sched_class);
+
 	return p;
 }
 
@@ -1586,6 +1592,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 {
 	update_curr_rt(rq);
 
+	update_rt_rq_load_avg(rq_clock_task(rq), cpu_of(rq), &rq->rt, 1);
+
 	/*
 	 * The previous task needs to be made eligible for pushing
 	 * if it is still active
@@ -2368,6 +2376,7 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
 	struct sched_rt_entity *rt_se = &p->rt;
 
 	update_curr_rt(rq);
+	update_rt_rq_load_avg(rq_clock_task(rq), cpu_of(rq), &rq->rt, 1);
 
 	watchdog(rq, p);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f8cf1d8..75b17ff 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -502,6 +502,9 @@ struct rt_rq {
 	unsigned long rt_nr_total;
 	int overloaded;
 	struct plist_head pushable_tasks;
+
+	struct sched_avg avg;
+
 #ifdef HAVE_RT_PUSH_IPI
 	int push_flags;
 	int push_cpu;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] cpufreq/schedutil: add rt utilization tracking
  2017-05-24  9:00 [PATCH 0/2] track rt rq utilization Vincent Guittot
  2017-05-24  9:00 ` [RFC PATCH 1/2] sched/rt: add utilization tracking Vincent Guittot
@ 2017-05-24  9:00 ` Vincent Guittot
  1 sibling, 0 replies; 11+ messages in thread
From: Vincent Guittot @ 2017-05-24  9:00 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: rjw, juri.lelli, dietmar.eggemann, Morten.Rasmussen, Vincent Guittot

Add both cfs_rq and rt_rq's utilization when selecting an OPP for cfs task
as rt task can preempt and steal cfs's running time.
This prevent frequency drops when rt tasks steal running time to cfs tasks
which appear lower than they are.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/cpufreq_schedutil.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 622eed1..bc292b92 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -164,7 +164,7 @@ static void sugov_get_util(unsigned long *util, unsigned long *max)
 
 	cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id());
 
-	*util = min(rq->cfs.avg.util_avg, cfs_max);
+	*util = min(rq->cfs.avg.util_avg + rq->rt.avg.util_avg, cfs_max);
 	*max = cfs_max;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] sched/rt: add utilization tracking
  2017-05-24  9:00 ` [RFC PATCH 1/2] sched/rt: add utilization tracking Vincent Guittot
@ 2017-05-30 15:50   ` Morten Rasmussen
  2017-05-30 16:24     ` Vincent Guittot
  2017-05-31  9:40   ` Peter Zijlstra
  1 sibling, 1 reply; 11+ messages in thread
From: Morten Rasmussen @ 2017-05-30 15:50 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, rjw, juri.lelli, dietmar.eggemann

On Wed, May 24, 2017 at 11:00:51AM +0200, Vincent Guittot wrote:
> schedutil governor relies on cfs_rq's util_avg to choose the OPP when cfs
> tasks are running. When the CPU is overloaded by cfs and rt tasks, cfs tasks
> are preempted by rt tasks and in this case util_avg reflects the remaining
> capacity that is used by cfs tasks but not what cfs tasks want to use. In such
> case, schedutil can select a lower OPP when cfs task runs whereas the CPU is
> overloaded. In order to have a more accurate view of the utilization of the
> CPU, we track the utilization that is used by RT tasks.
> DL tasks are not taken into account as they have their own utilization
> tracking mecanism.
> 
> We don't use rt_avg which doesn't have the same dynamic as PELT and which
> can include IRQ time that are also accounted in cfs task utilization

If I get it correctly, the fundamental issue is that we need utilization
metrics for all sched_classes that are sufficiently similar that adding
them up is a useful input to schedutil? rt_avg which already exists
today is too slow to be fit for purpose, so lets try PELT.

It isn't a full PELT implementation in this patch. In fact, it isn't
per-entity :-) However, it might be sufficient to improve things without
causing too much overhead.

> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> 
> If the changes are reasonnable, it might worth moving the PELT function in a
> dedicated pelt.c file and the ugly
> extern int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running);
> in a pelt.h header
> 
> 
>  kernel/sched/fair.c  | 21 +++++++++++++++++++++
>  kernel/sched/rt.c    |  9 +++++++++
>  kernel/sched/sched.h |  3 +++
>  3 files changed, 33 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 47a0c55..0d7698b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2950,6 +2950,17 @@ __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
>  			cfs_rq->curr != NULL, cfs_rq);
>  }
>  
> +int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running)
> +{
> +	int ret;
> +
> +	ret = ___update_load_avg(now, cpu, &rt_rq->avg, 0, running, NULL);
> +
> +
> +	return ret;
> +}
> +
> +
>  /*
>   * Signed add and clamp on underflow.
>   *
> @@ -3478,6 +3489,11 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
>  	return 0;
>  }
>  
> +int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running)
> +{
> +	return 0;
> +}
> +
>  #define UPDATE_TG	0x0
>  #define SKIP_AGE_LOAD	0x0
>  
> @@ -7008,6 +7024,10 @@ static void update_blocked_averages(int cpu)
>  		if (cfs_rq_is_decayed(cfs_rq))
>  			list_del_leaf_cfs_rq(cfs_rq);
>  	}
> +
> +	update_rt_rq_load_avg(rq_clock_task(rq), cpu, &rq->rt, 0);
> +
> +
>  	rq_unlock_irqrestore(rq, &rf);
>  }

This would be easier if we had a place to stick remote updates of
various metrics for NOHZ idle cpus?

>  
> @@ -7067,6 +7087,7 @@ static inline void update_blocked_averages(int cpu)
>  	rq_lock_irqsave(rq, &rf);
>  	update_rq_clock(rq);
>  	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true);
> +	update_rt_rq_load_avg(rq_clock_task(rq), cpu, &rq->rt, 0);
>  	rq_unlock_irqrestore(rq, &rf);
>  }
>  
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 581d5c7..09293fa 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1534,6 +1534,8 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
>  	return p;
>  }
>  
> +extern int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running);
> +
>  static struct task_struct *
>  pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
> @@ -1579,6 +1581,10 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  
>  	queue_push_tasks(rq);
>  
> +	if (p)
> +		update_rt_rq_load_avg(rq_clock_task(rq), cpu_of(rq), rt_rq,
> +				rq->curr->sched_class == &rt_sched_class);
> +
>  	return p;
>  }

I'm quite likely missing something, but why is this update need here?

IIUC, put_prev_task() was called just a couple of lines further up which
also does an update (below). The only difference is that this update
depends on the current task being an RT task, which isn't always the
case. Is it supposed to decay the RT PELT average when we pick an RT
task and the previous task !RT? In that case, the "running" argument
needs a ! in front.

Morten

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] sched/rt: add utilization tracking
  2017-05-30 15:50   ` Morten Rasmussen
@ 2017-05-30 16:24     ` Vincent Guittot
  2017-05-31  7:57       ` Morten Rasmussen
  0 siblings, 1 reply; 11+ messages in thread
From: Vincent Guittot @ 2017-05-30 16:24 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, rjw, Juri Lelli,
	Dietmar Eggemann

On 30 May 2017 at 17:50, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Wed, May 24, 2017 at 11:00:51AM +0200, Vincent Guittot wrote:
>> schedutil governor relies on cfs_rq's util_avg to choose the OPP when cfs
>> tasks are running. When the CPU is overloaded by cfs and rt tasks, cfs tasks
>> are preempted by rt tasks and in this case util_avg reflects the remaining
>> capacity that is used by cfs tasks but not what cfs tasks want to use. In such
>> case, schedutil can select a lower OPP when cfs task runs whereas the CPU is
>> overloaded. In order to have a more accurate view of the utilization of the
>> CPU, we track the utilization that is used by RT tasks.
>> DL tasks are not taken into account as they have their own utilization
>> tracking mecanism.
>>
>> We don't use rt_avg which doesn't have the same dynamic as PELT and which
>> can include IRQ time that are also accounted in cfs task utilization
>
> If I get it correctly, the fundamental issue is that we need utilization
> metrics for all sched_classes that are sufficiently similar that adding
> them up is a useful input to schedutil? rt_avg which already exists
> today is too slow to be fit for purpose, so lets try PELT.

Yes and in addition, rt_avg can include irq handler time which is
already accounted in cfs PELT

>
> It isn't a full PELT implementation in this patch. In fact, it isn't
> per-entity :-) However, it might be sufficient to improve things without
> causing too much overhead.

Yes, full PELT is not needed only per rt_rq utilization is useful

>
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>
>> If the changes are reasonnable, it might worth moving the PELT function in a
>> dedicated pelt.c file and the ugly
>> extern int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running);
>> in a pelt.h header
>>
>>
>>  kernel/sched/fair.c  | 21 +++++++++++++++++++++
>>  kernel/sched/rt.c    |  9 +++++++++
>>  kernel/sched/sched.h |  3 +++
>>  3 files changed, 33 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 47a0c55..0d7698b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2950,6 +2950,17 @@ __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
>>                       cfs_rq->curr != NULL, cfs_rq);
>>  }
>>
>> +int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running)
>> +{
>> +     int ret;
>> +
>> +     ret = ___update_load_avg(now, cpu, &rt_rq->avg, 0, running, NULL);
>> +
>> +
>> +     return ret;
>> +}
>> +
>> +
>>  /*
>>   * Signed add and clamp on underflow.
>>   *
>> @@ -3478,6 +3489,11 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
>>       return 0;
>>  }
>>
>> +int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running)
>> +{
>> +     return 0;
>> +}
>> +
>>  #define UPDATE_TG    0x0
>>  #define SKIP_AGE_LOAD        0x0
>>
>> @@ -7008,6 +7024,10 @@ static void update_blocked_averages(int cpu)
>>               if (cfs_rq_is_decayed(cfs_rq))
>>                       list_del_leaf_cfs_rq(cfs_rq);
>>       }
>> +
>> +     update_rt_rq_load_avg(rq_clock_task(rq), cpu, &rq->rt, 0);
>> +
>> +
>>       rq_unlock_irqrestore(rq, &rf);
>>  }
>
> This would be easier if we had a place to stick remote updates of
> various metrics for NOHZ idle cpus?

Yes probably.


>
>>
>> @@ -7067,6 +7087,7 @@ static inline void update_blocked_averages(int cpu)
>>       rq_lock_irqsave(rq, &rf);
>>       update_rq_clock(rq);
>>       update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true);
>> +     update_rt_rq_load_avg(rq_clock_task(rq), cpu, &rq->rt, 0);
>>       rq_unlock_irqrestore(rq, &rf);
>>  }
>>
>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> index 581d5c7..09293fa 100644
>> --- a/kernel/sched/rt.c
>> +++ b/kernel/sched/rt.c
>> @@ -1534,6 +1534,8 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
>>       return p;
>>  }
>>
>> +extern int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running);
>> +
>>  static struct task_struct *
>>  pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>>  {
>> @@ -1579,6 +1581,10 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>>
>>       queue_push_tasks(rq);
>>
>> +     if (p)
>> +             update_rt_rq_load_avg(rq_clock_task(rq), cpu_of(rq), rt_rq,
>> +                             rq->curr->sched_class == &rt_sched_class);
>> +
>>       return p;
>>  }
>
> I'm quite likely missing something, but why is this update need here?
>
> IIUC, put_prev_task() was called just a couple of lines further up which
> also does an update (below). The only difference is that this update
> depends on the current task being an RT task, which isn't always the
> case. Is it supposed to decay the RT PELT average when we pick an RT
> task and the previous task !RT? In that case, the "running" argument
> needs a ! in front.

if prev task is not a rt task but a cfs task, the put_prev_task will
apply on cfs class and cfs pelt but not on rt one
so when the next task is a rt task, we have to update rt utilization
either to decay the rt utilization if curr task (prev task) is not rt,
or to add utilization if curr task is rt task.
we have something similar with put_prev_task_rt

>
> Morten

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] sched/rt: add utilization tracking
  2017-05-30 16:24     ` Vincent Guittot
@ 2017-05-31  7:57       ` Morten Rasmussen
  0 siblings, 0 replies; 11+ messages in thread
From: Morten Rasmussen @ 2017-05-31  7:57 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, rjw, Juri Lelli,
	Dietmar Eggemann

On Tue, May 30, 2017 at 06:24:27PM +0200, Vincent Guittot wrote:
> On 30 May 2017 at 17:50, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Wed, May 24, 2017 at 11:00:51AM +0200, Vincent Guittot wrote:
> >> @@ -1534,6 +1534,8 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
> >>       return p;
> >>  }
> >>
> >> +extern int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running);
> >> +
> >>  static struct task_struct *
> >>  pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >>  {
> >> @@ -1579,6 +1581,10 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >>
> >>       queue_push_tasks(rq);
> >>
> >> +     if (p)
> >> +             update_rt_rq_load_avg(rq_clock_task(rq), cpu_of(rq), rt_rq,
> >> +                             rq->curr->sched_class == &rt_sched_class);
> >> +
> >>       return p;
> >>  }
> >
> > I'm quite likely missing something, but why is this update need here?
> >
> > IIUC, put_prev_task() was called just a couple of lines further up which
> > also does an update (below). The only difference is that this update
> > depends on the current task being an RT task, which isn't always the
> > case. Is it supposed to decay the RT PELT average when we pick an RT
> > task and the previous task !RT? In that case, the "running" argument
> > needs a ! in front.
> 
> if prev task is not a rt task but a cfs task, the put_prev_task will
> apply on cfs class and cfs pelt but not on rt one
> so when the next task is a rt task, we have to update rt utilization
> either to decay the rt utilization if curr task (prev task) is not rt,
> or to add utilization if curr task is rt task.
> we have something similar with put_prev_task_rt

Yes. It was just me getting the logic wrong. rq->curr->sched_class ==
&rt_sched_class would of course be false if the previous task is CFS
meaning "running" is false and hence we will decay the rt rq PELT signal
as we correctly should.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] sched/rt: add utilization tracking
  2017-05-24  9:00 ` [RFC PATCH 1/2] sched/rt: add utilization tracking Vincent Guittot
  2017-05-30 15:50   ` Morten Rasmussen
@ 2017-05-31  9:40   ` Peter Zijlstra
  2017-05-31 10:30     ` Peter Zijlstra
  2017-05-31 11:24     ` Vincent Guittot
  1 sibling, 2 replies; 11+ messages in thread
From: Peter Zijlstra @ 2017-05-31  9:40 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, rjw, juri.lelli, dietmar.eggemann, Morten.Rasmussen

On Wed, May 24, 2017 at 11:00:51AM +0200, Vincent Guittot wrote:
> schedutil governor relies on cfs_rq's util_avg to choose the OPP when cfs
> tasks are running. When the CPU is overloaded by cfs and rt tasks, cfs tasks
> are preempted by rt tasks and in this case util_avg reflects the remaining
> capacity that is used by cfs tasks but not what cfs tasks want to use. In such
> case, schedutil can select a lower OPP when cfs task runs whereas the CPU is
> overloaded. In order to have a more accurate view of the utilization of the
> CPU, we track the utilization that is used by RT tasks.
> DL tasks are not taken into account as they have their own utilization
> tracking mecanism.

Well, the DL tracking is fairly pessimistic; it assumes all DL tasks
will consume their total budget, which will rarely, if ever, happen.

So I suspect it might well be worth it to also track DL activity for the
purpose of compensating CFS.

In fact, I don't think you particularly care about RT here, as anything
!CFS that preempts it, including those interrupts you mentioned. Which
gets us back to what rt_avg is.

> We don't use rt_avg which doesn't have the same dynamic as PELT and which
> can include IRQ time that are also accounted in cfs task utilization

Well, if rt_avg includes IRQ time, then that IRQ time is not part of
the task clock.

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> 
> If the changes are reasonnable, it might worth moving the PELT function in a
> dedicated pelt.c file and the ugly
> extern int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running);
> in a pelt.h header
> 
> 
>  kernel/sched/fair.c  | 21 +++++++++++++++++++++
>  kernel/sched/rt.c    |  9 +++++++++
>  kernel/sched/sched.h |  3 +++
>  3 files changed, 33 insertions(+)

Also, and I didn't check this, it is important that the windows are
aligned if you want to sum the values.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] sched/rt: add utilization tracking
  2017-05-31  9:40   ` Peter Zijlstra
@ 2017-05-31 10:30     ` Peter Zijlstra
  2017-05-31 10:41       ` Juri Lelli
  2017-05-31 11:24     ` Vincent Guittot
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2017-05-31 10:30 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, rjw, juri.lelli, dietmar.eggemann, Morten.Rasmussen

On Wed, May 31, 2017 at 11:40:47AM +0200, Peter Zijlstra wrote:
> On Wed, May 24, 2017 at 11:00:51AM +0200, Vincent Guittot wrote:
> > schedutil governor relies on cfs_rq's util_avg to choose the OPP when cfs
> > tasks are running. When the CPU is overloaded by cfs and rt tasks, cfs tasks
> > are preempted by rt tasks and in this case util_avg reflects the remaining
> > capacity that is used by cfs tasks but not what cfs tasks want to use. In such
> > case, schedutil can select a lower OPP when cfs task runs whereas the CPU is
> > overloaded. In order to have a more accurate view of the utilization of the
> > CPU, we track the utilization that is used by RT tasks.
> > DL tasks are not taken into account as they have their own utilization
> > tracking mecanism.
> 
> Well, the DL tracking is fairly pessimistic; it assumes all DL tasks
> will consume their total budget, which will rarely, if ever, happen.
> 
> So I suspect it might well be worth it to also track DL activity for the
> purpose of compensating CFS.

Again, it seems I have this CPPC/HWP crud firmly stuck in my brain.
Because I was thinking:

	min_freq = dl_util
	avg_freq = dl_avg + rt_avg + cfs_util


But given we don't actually have that split... meh.

> In fact, I don't think you particularly care about RT here, as anything
> !CFS that preempts it, including those interrupts you mentioned. Which
> gets us back to what rt_avg is.
> 
> > We don't use rt_avg which doesn't have the same dynamic as PELT and which
> > can include IRQ time that are also accounted in cfs task utilization
> 
> Well, if rt_avg includes IRQ time, then that IRQ time is not part of
> the task clock.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] sched/rt: add utilization tracking
  2017-05-31 10:30     ` Peter Zijlstra
@ 2017-05-31 10:41       ` Juri Lelli
  2017-05-31 11:39         ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Juri Lelli @ 2017-05-31 10:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, mingo, linux-kernel, rjw, dietmar.eggemann,
	Morten.Rasmussen

On 31/05/17 12:30, Peter Zijlstra wrote:
> On Wed, May 31, 2017 at 11:40:47AM +0200, Peter Zijlstra wrote:
> > On Wed, May 24, 2017 at 11:00:51AM +0200, Vincent Guittot wrote:
> > > schedutil governor relies on cfs_rq's util_avg to choose the OPP when cfs
> > > tasks are running. When the CPU is overloaded by cfs and rt tasks, cfs tasks
> > > are preempted by rt tasks and in this case util_avg reflects the remaining
> > > capacity that is used by cfs tasks but not what cfs tasks want to use. In such
> > > case, schedutil can select a lower OPP when cfs task runs whereas the CPU is
> > > overloaded. In order to have a more accurate view of the utilization of the
> > > CPU, we track the utilization that is used by RT tasks.
> > > DL tasks are not taken into account as they have their own utilization
> > > tracking mecanism.
> > 
> > Well, the DL tracking is fairly pessimistic; it assumes all DL tasks
> > will consume their total budget, which will rarely, if ever, happen.
> > 
> > So I suspect it might well be worth it to also track DL activity for the
> > purpose of compensating CFS.
> 
> Again, it seems I have this CPPC/HWP crud firmly stuck in my brain.
> Because I was thinking:
> 
> 	min_freq = dl_util
> 	avg_freq = dl_avg + rt_avg + cfs_util
> 
> 
> But given we don't actually have that split... meh.
> 

Right, interesting. So, I guess the question is: should we have it? :)

IMHO, it makes sense and seems to benefit mobile use-cases I'm looking
at.

rt_avg though it also seems to build up very slowly (at least with
default configs). I'm experimenting with Vincent proposal and it looks
better (w.r.t. using rt_avg). Also summing up signals that behave
similarly doesn't seem the wrong thing to do.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] sched/rt: add utilization tracking
  2017-05-31  9:40   ` Peter Zijlstra
  2017-05-31 10:30     ` Peter Zijlstra
@ 2017-05-31 11:24     ` Vincent Guittot
  1 sibling, 0 replies; 11+ messages in thread
From: Vincent Guittot @ 2017-05-31 11:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, rjw, Juri Lelli, Dietmar Eggemann,
	Morten Rasmussen

On 31 May 2017 at 11:40, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, May 24, 2017 at 11:00:51AM +0200, Vincent Guittot wrote:
>> schedutil governor relies on cfs_rq's util_avg to choose the OPP when cfs
>> tasks are running. When the CPU is overloaded by cfs and rt tasks, cfs tasks
>> are preempted by rt tasks and in this case util_avg reflects the remaining
>> capacity that is used by cfs tasks but not what cfs tasks want to use. In such
>> case, schedutil can select a lower OPP when cfs task runs whereas the CPU is
>> overloaded. In order to have a more accurate view of the utilization of the
>> CPU, we track the utilization that is used by RT tasks.
>> DL tasks are not taken into account as they have their own utilization
>> tracking mecanism.
>
> Well, the DL tracking is fairly pessimistic; it assumes all DL tasks
> will consume their total budget, which will rarely, if ever, happen.
>
> So I suspect it might well be worth it to also track DL activity for the
> purpose of compensating CFS.
>
> In fact, I don't think you particularly care about RT here, as anything
> !CFS that preempts it, including those interrupts you mentioned. Which
> gets us back to what rt_avg is.
>
>> We don't use rt_avg which doesn't have the same dynamic as PELT and which
>> can include IRQ time that are also accounted in cfs task utilization
>
> Well, if rt_avg includes IRQ time, then that IRQ time is not part of
> the task clock.

ah yes you're right.
I haven't noticed irq time was removed from the clock used for accounting PELT

>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>
>> If the changes are reasonnable, it might worth moving the PELT function in a
>> dedicated pelt.c file and the ugly
>> extern int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running);
>> in a pelt.h header
>>
>>
>>  kernel/sched/fair.c  | 21 +++++++++++++++++++++
>>  kernel/sched/rt.c    |  9 +++++++++
>>  kernel/sched/sched.h |  3 +++
>>  3 files changed, 33 insertions(+)
>
> Also, and I didn't check this, it is important that the windows are
> aligned if you want to sum the values.

yes. good point

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] sched/rt: add utilization tracking
  2017-05-31 10:41       ` Juri Lelli
@ 2017-05-31 11:39         ` Peter Zijlstra
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2017-05-31 11:39 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Vincent Guittot, mingo, linux-kernel, rjw, dietmar.eggemann,
	Morten.Rasmussen

On Wed, May 31, 2017 at 11:41:11AM +0100, Juri Lelli wrote:
> On 31/05/17 12:30, Peter Zijlstra wrote:

> > Again, it seems I have this CPPC/HWP crud firmly stuck in my brain.
> > Because I was thinking:
> > 
> > 	min_freq = dl_util
> > 	avg_freq = dl_avg + rt_avg + cfs_util
> > 
> > 
> > But given we don't actually have that split... meh.
> > 
> 
> Right, interesting. So, I guess the question is: should we have it? :)

I think cpufreq should certainly look at moving into that direction, but
that's up to Rafael of course..

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-05-31 11:39 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-24  9:00 [PATCH 0/2] track rt rq utilization Vincent Guittot
2017-05-24  9:00 ` [RFC PATCH 1/2] sched/rt: add utilization tracking Vincent Guittot
2017-05-30 15:50   ` Morten Rasmussen
2017-05-30 16:24     ` Vincent Guittot
2017-05-31  7:57       ` Morten Rasmussen
2017-05-31  9:40   ` Peter Zijlstra
2017-05-31 10:30     ` Peter Zijlstra
2017-05-31 10:41       ` Juri Lelli
2017-05-31 11:39         ` Peter Zijlstra
2017-05-31 11:24     ` Vincent Guittot
2017-05-24  9:00 ` [PATCH 2/2] cpufreq/schedutil: add rt " Vincent Guittot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.