* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-09 10:46 [PATCH v2] sched: rt: Make RT capacity aware Qais Yousef
@ 2019-10-23 12:34 ` Qais Yousef
2019-10-28 14:37 ` Peter Zijlstra
` (5 subsequent siblings)
6 siblings, 0 replies; 42+ messages in thread
From: Qais Yousef @ 2019-10-23 12:34 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Steven Rostedt
Cc: Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
Mel Gorman, linux-kernel, qperret, balsini
Adding some Android folks who might be interested.
Steven/Peter, in case this has dropped off your queue; it'd be great to get
some feedback when you get a chance to look at it.
Thanks
--
Qais Yousef
On 10/09/19 11:46, Qais Yousef wrote:
> Capacity Awareness refers to the fact that on heterogeneous systems
> (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> when placing tasks we need to be aware of this difference of CPU
> capacities.
>
> In such scenarios we want to ensure that the selected CPU has enough
> capacity to meet the requirement of the running task. Enough capacity
> means here that capacity_orig_of(cpu) >= task.requirement.
>
> The definition of task.requirement is dependent on the scheduling class.
>
> For CFS, utilization is used to select a CPU that has >= capacity value
> than the cfs_task.util.
>
> capacity_orig_of(cpu) >= cfs_task.util
>
> DL isn't capacity aware at the moment but can make use of the bandwidth
> reservation to implement that in a similar manner CFS uses utilization.
> The following patchset implements that:
>
> https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/
>
> capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
>
> For RT we don't have a per task utilization signal and we lack any
> information in general about what performance requirement the RT task
> needs. But with the introduction of uclamp, RT tasks can now control
> that by setting uclamp_min to guarantee a minimum performance point.
>
> ATM the uclamp value are only used for frequency selection; but on
> heterogeneous systems this is not enough and we need to ensure that the
> capacity of the CPU is >= uclamp_min. Which is what implemented here.
>
> capacity_orig_of(cpu) >= rt_task.uclamp_min
>
> Note that by default uclamp.min is 1024, which means that RT tasks will
> always be biased towards the big CPUs, which make for a better more
> predictable behavior for the default case.
>
> Must stress that the bias acts as a hint rather than a definite
> placement strategy. For example, if all big cores are busy executing
> other RT tasks we can't guarantee that a new RT task will be placed
> there.
>
> On non-heterogeneous systems the original behavior of RT should be
> retained. Similarly if uclamp is not selected in the config.
>
> Signed-off-by: Qais Yousef <qais.yousef@arm.com>
> ---
>
> Changes in v2:
> - Use cpupri_find() to check the fitness of the task instead of
> sprinkling find_lowest_rq() with several checks of
> rt_task_fits_capacity().
>
> The selected implementation opted to pass the fitness function as an
> argument rather than call rt_task_fits_capacity() capacity which is
> a cleaner to keep the logical separation of the 2 modules; but it
> means the compiler has less room to optimize rt_task_fits_capacity()
> out when it's a constant value.
>
> The logic is not perfect. For example if a 'small' task is occupying a big CPU
> and another big task wakes up; we won't force migrate the small task to clear
> the big cpu for the big task that woke up.
>
> IOW, the logic is best effort and can't give hard guarantees. But improves the
> current situation where a task can randomly end up on any CPU regardless of
> what it needs. ie: without this patch an RT task can wake up on a big or small
> CPU, but with this it will always wake up on a big CPU (assuming the big CPUs
> aren't overloaded) - hence provide a consistent performance.
>
> I'm looking at ways to improve this best effort, but this patch should be
> a good start to discuss our Capacity Awareness requirement. There's a trade-off
> of complexity to be made here and I'd like to keep things as simple as
> possible and build on top as needed.
>
>
> kernel/sched/cpupri.c | 23 ++++++++++--
> kernel/sched/cpupri.h | 4 ++-
> kernel/sched/rt.c | 81 +++++++++++++++++++++++++++++++++++--------
> 3 files changed, 91 insertions(+), 17 deletions(-)
>
> diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
> index b7abca987d94..799791c01d60 100644
> --- a/kernel/sched/cpupri.c
> +++ b/kernel/sched/cpupri.c
> @@ -57,7 +57,8 @@ static int convert_prio(int prio)
> * Return: (int)bool - CPUs were found
> */
> int cpupri_find(struct cpupri *cp, struct task_struct *p,
> - struct cpumask *lowest_mask)
> + struct cpumask *lowest_mask,
> + bool (*fitness_fn)(struct task_struct *p, int cpu))
> {
> int idx = 0;
> int task_pri = convert_prio(p->prio);
> @@ -98,6 +99,8 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
> continue;
>
> if (lowest_mask) {
> + int cpu;
> +
> cpumask_and(lowest_mask, p->cpus_ptr, vec->mask);
>
> /*
> @@ -108,7 +111,23 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
> * condition, simply act as though we never hit this
> * priority level and continue on.
> */
> - if (cpumask_any(lowest_mask) >= nr_cpu_ids)
> + if (cpumask_empty(lowest_mask))
> + continue;
> +
> + if (!fitness_fn)
> + return 1;
> +
> + /* Ensure the capacity of the CPUs fit the task */
> + for_each_cpu(cpu, lowest_mask) {
> + if (!fitness_fn(p, cpu))
> + cpumask_clear_cpu(cpu, lowest_mask);
> + }
> +
> + /*
> + * If no CPU at the current priority can fit the task
> + * continue looking
> + */
> + if (cpumask_empty(lowest_mask))
> continue;
> }
>
> diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
> index 7dc20a3232e7..32dd520db11f 100644
> --- a/kernel/sched/cpupri.h
> +++ b/kernel/sched/cpupri.h
> @@ -18,7 +18,9 @@ struct cpupri {
> };
>
> #ifdef CONFIG_SMP
> -int cpupri_find(struct cpupri *cp, struct task_struct *p, struct cpumask *lowest_mask);
> +int cpupri_find(struct cpupri *cp, struct task_struct *p,
> + struct cpumask *lowest_mask,
> + bool (*fitness_fn)(struct task_struct *p, int cpu));
> void cpupri_set(struct cpupri *cp, int cpu, int pri);
> int cpupri_init(struct cpupri *cp);
> void cpupri_cleanup(struct cpupri *cp);
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index ebaa4e619684..3a68054e15b3 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -437,6 +437,45 @@ static inline int on_rt_rq(struct sched_rt_entity *rt_se)
> return rt_se->on_rq;
> }
>
> +#ifdef CONFIG_UCLAMP_TASK
> +/*
> + * Verify the fitness of task @p to run on @cpu taking into account the uclamp
> + * settings.
> + *
> + * This check is only important for heterogeneous systems where uclamp_min value
> + * is higher than the capacity of a @cpu. For non-heterogeneous system this
> + * function will always return true.
> + *
> + * The function will return true if the capacity of the @cpu is >= the
> + * uclamp_min and false otherwise.
> + *
> + * Note that uclamp_min will be clamped to uclamp_max if uclamp_min
> + * > uclamp_max.
> + */
> +inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> +{
> + unsigned int min_cap;
> + unsigned int max_cap;
> + unsigned int cpu_cap;
> +
> + /* Only heterogeneous systems can benefit from this check */
> + if (!static_branch_unlikely(&sched_asym_cpucapacity))
> + return true;
> +
> + min_cap = uclamp_eff_value(p, UCLAMP_MIN);
> + max_cap = uclamp_eff_value(p, UCLAMP_MAX);
> +
> + cpu_cap = capacity_orig_of(cpu);
> +
> + return cpu_cap >= min(min_cap, max_cap);
> +}
> +#else
> +static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> +{
> + return true;
> +}
> +#endif
> +
> #ifdef CONFIG_RT_GROUP_SCHED
>
> static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
> @@ -1391,6 +1430,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
> {
> struct task_struct *curr;
> struct rq *rq;
> + bool test;
>
> /* For anything but wake ups, just return the task_cpu */
> if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
> @@ -1422,10 +1462,16 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
> *
> * This test is optimistic, if we get it wrong the load-balancer
> * will have to sort it out.
> + *
> + * We take into account the capacity of the cpu to ensure it fits the
> + * requirement of the task - which is only important on heterogeneous
> + * systems like big.LITTLE.
> */
> - if (curr && unlikely(rt_task(curr)) &&
> - (curr->nr_cpus_allowed < 2 ||
> - curr->prio <= p->prio)) {
> + test = curr &&
> + unlikely(rt_task(curr)) &&
> + (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
> +
> + if (test || !rt_task_fits_capacity(p, cpu)) {
> int target = find_lowest_rq(p);
>
> /*
> @@ -1449,7 +1495,7 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
> * let's hope p can move out.
> */
> if (rq->curr->nr_cpus_allowed == 1 ||
> - !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
> + !cpupri_find(&rq->rd->cpupri, rq->curr, NULL, NULL))
> return;
>
> /*
> @@ -1457,7 +1503,7 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
> * see if it is pushed or pulled somewhere else.
> */
> if (p->nr_cpus_allowed != 1
> - && cpupri_find(&rq->rd->cpupri, p, NULL))
> + && cpupri_find(&rq->rd->cpupri, p, NULL, NULL))
> return;
>
> /*
> @@ -1600,7 +1646,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct rq_fla
> static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
> {
> if (!task_running(rq, p) &&
> - cpumask_test_cpu(cpu, p->cpus_ptr))
> + cpumask_test_cpu(cpu, p->cpus_ptr) &&
> + rt_task_fits_capacity(p, cpu))
> return 1;
>
> return 0;
> @@ -1642,7 +1689,8 @@ static int find_lowest_rq(struct task_struct *task)
> if (task->nr_cpus_allowed == 1)
> return -1; /* No other targets possible */
>
> - if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask))
> + if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask,
> + rt_task_fits_capacity))
> return -1; /* No targets found */
>
> /*
> @@ -2146,12 +2194,14 @@ static void pull_rt_task(struct rq *this_rq)
> */
> static void task_woken_rt(struct rq *rq, struct task_struct *p)
> {
> - if (!task_running(rq, p) &&
> - !test_tsk_need_resched(rq->curr) &&
> - p->nr_cpus_allowed > 1 &&
> - (dl_task(rq->curr) || rt_task(rq->curr)) &&
> - (rq->curr->nr_cpus_allowed < 2 ||
> - rq->curr->prio <= p->prio))
> + bool need_to_push = !task_running(rq, p) &&
> + !test_tsk_need_resched(rq->curr) &&
> + p->nr_cpus_allowed > 1 &&
> + (dl_task(rq->curr) || rt_task(rq->curr)) &&
> + (rq->curr->nr_cpus_allowed < 2 ||
> + rq->curr->prio <= p->prio);
> +
> + if (need_to_push || !rt_task_fits_capacity(p, cpu_of(rq)))
> push_rt_tasks(rq);
> }
>
> @@ -2223,7 +2273,10 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
> */
> if (task_on_rq_queued(p) && rq->curr != p) {
> #ifdef CONFIG_SMP
> - if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
> + bool need_to_push = rq->rt.overloaded ||
> + !rt_task_fits_capacity(p, cpu_of(rq));
> +
> + if (p->nr_cpus_allowed > 1 && need_to_push)
> rt_queue_push_tasks(rq);
> #endif /* CONFIG_SMP */
> if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
> --
> 2.17.1
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-09 10:46 [PATCH v2] sched: rt: Make RT capacity aware Qais Yousef
2019-10-23 12:34 ` Qais Yousef
@ 2019-10-28 14:37 ` Peter Zijlstra
2019-10-28 18:01 ` Steven Rostedt
2019-10-29 8:13 ` Vincent Guittot
` (4 subsequent siblings)
6 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2019-10-28 14:37 UTC (permalink / raw)
To: Qais Yousef
Cc: Ingo Molnar, Steven Rostedt, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel
On Wed, Oct 09, 2019 at 11:46:11AM +0100, Qais Yousef wrote:
> Capacity Awareness refers to the fact that on heterogeneous systems
> (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> when placing tasks we need to be aware of this difference of CPU
> capacities.
>
> In such scenarios we want to ensure that the selected CPU has enough
> capacity to meet the requirement of the running task. Enough capacity
> means here that capacity_orig_of(cpu) >= task.requirement.
>
> The definition of task.requirement is dependent on the scheduling class.
>
> For CFS, utilization is used to select a CPU that has >= capacity value
> than the cfs_task.util.
>
> capacity_orig_of(cpu) >= cfs_task.util
>
> DL isn't capacity aware at the moment but can make use of the bandwidth
> reservation to implement that in a similar manner CFS uses utilization.
> The following patchset implements that:
>
> https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/
>
> capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
>
> For RT we don't have a per task utilization signal and we lack any
> information in general about what performance requirement the RT task
> needs. But with the introduction of uclamp, RT tasks can now control
> that by setting uclamp_min to guarantee a minimum performance point.
>
> ATM the uclamp value are only used for frequency selection; but on
> heterogeneous systems this is not enough and we need to ensure that the
> capacity of the CPU is >= uclamp_min. Which is what implemented here.
>
> capacity_orig_of(cpu) >= rt_task.uclamp_min
>
> Note that by default uclamp.min is 1024, which means that RT tasks will
> always be biased towards the big CPUs, which make for a better more
> predictable behavior for the default case.
>
> Must stress that the bias acts as a hint rather than a definite
> placement strategy. For example, if all big cores are busy executing
> other RT tasks we can't guarantee that a new RT task will be placed
> there.
>
> On non-heterogeneous systems the original behavior of RT should be
> retained. Similarly if uclamp is not selected in the config.
>
> Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Works for me; Steve you OK with this?
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-28 14:37 ` Peter Zijlstra
@ 2019-10-28 18:01 ` Steven Rostedt
2019-10-28 20:50 ` Peter Zijlstra
` (2 more replies)
0 siblings, 3 replies; 42+ messages in thread
From: Steven Rostedt @ 2019-10-28 18:01 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Qais Yousef, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel
On Mon, 28 Oct 2019 15:37:49 +0100
Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Oct 09, 2019 at 11:46:11AM +0100, Qais Yousef wrote:
> > Capacity Awareness refers to the fact that on heterogeneous systems
> > (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> > when placing tasks we need to be aware of this difference of CPU
> > capacities.
> >
> > In such scenarios we want to ensure that the selected CPU has enough
> > capacity to meet the requirement of the running task. Enough capacity
> > means here that capacity_orig_of(cpu) >= task.requirement.
> >
> > The definition of task.requirement is dependent on the scheduling class.
> >
> > For CFS, utilization is used to select a CPU that has >= capacity value
> > than the cfs_task.util.
> >
> > capacity_orig_of(cpu) >= cfs_task.util
> >
> > DL isn't capacity aware at the moment but can make use of the bandwidth
> > reservation to implement that in a similar manner CFS uses utilization.
> > The following patchset implements that:
> >
> > https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/
> >
> > capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
> >
> > For RT we don't have a per task utilization signal and we lack any
> > information in general about what performance requirement the RT task
> > needs. But with the introduction of uclamp, RT tasks can now control
> > that by setting uclamp_min to guarantee a minimum performance point.
> >
> > ATM the uclamp value are only used for frequency selection; but on
> > heterogeneous systems this is not enough and we need to ensure that the
> > capacity of the CPU is >= uclamp_min. Which is what implemented here.
> >
> > capacity_orig_of(cpu) >= rt_task.uclamp_min
> >
> > Note that by default uclamp.min is 1024, which means that RT tasks will
> > always be biased towards the big CPUs, which make for a better more
> > predictable behavior for the default case.
> >
> > Must stress that the bias acts as a hint rather than a definite
> > placement strategy. For example, if all big cores are busy executing
> > other RT tasks we can't guarantee that a new RT task will be placed
> > there.
> >
> > On non-heterogeneous systems the original behavior of RT should be
> > retained. Similarly if uclamp is not selected in the config.
> >
> > Signed-off-by: Qais Yousef <qais.yousef@arm.com>
>
> Works for me; Steve you OK with this?
Nothing against it, but I want to take a deeper look before we accept
it. Are you OK in waiting a week? I'm currently at Open Source Summit
and still have two more talks to write (giving them Thursday). I wont
have time to look till next week.
-- Steve
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-28 18:01 ` Steven Rostedt
@ 2019-10-28 20:50 ` Peter Zijlstra
2019-12-20 16:01 ` Qais Yousef
2019-11-07 9:15 ` Qais Yousef
2019-11-18 15:43 ` Qais Yousef
2 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2019-10-28 20:50 UTC (permalink / raw)
To: Steven Rostedt
Cc: Qais Yousef, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel
On Mon, Oct 28, 2019 at 02:01:47PM -0400, Steven Rostedt wrote:
> On Mon, 28 Oct 2019 15:37:49 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> > Works for me; Steve you OK with this?
>
> Nothing against it, but I want to take a deeper look before we accept
> it. Are you OK in waiting a week? I'm currently at Open Source Summit
> and still have two more talks to write (giving them Thursday). I wont
> have time to look till next week.
Sure, I'll keep it in my queue, but will make sure it doesn't hit -tip
until you've had time.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-28 20:50 ` Peter Zijlstra
@ 2019-12-20 16:01 ` Qais Yousef
2019-12-20 17:18 ` Peter Zijlstra
0 siblings, 1 reply; 42+ messages in thread
From: Qais Yousef @ 2019-12-20 16:01 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Steven Rostedt, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel
Hi Peter
On 10/28/19 21:50, Peter Zijlstra wrote:
> On Mon, Oct 28, 2019 at 02:01:47PM -0400, Steven Rostedt wrote:
> > On Mon, 28 Oct 2019 15:37:49 +0100
> > Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > Works for me; Steve you OK with this?
> >
> > Nothing against it, but I want to take a deeper look before we accept
> > it. Are you OK in waiting a week? I'm currently at Open Source Summit
> > and still have two more talks to write (giving them Thursday). I wont
> > have time to look till next week.
>
> Sure, I'll keep it in my queue, but will make sure it doesn't hit -tip
> until you've had time.
Reviewers are happy with this now. It'd be nice if you can pick it up again for
the next round to -tip.
Thanks!
--
Qais Yousef
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-12-20 16:01 ` Qais Yousef
@ 2019-12-20 17:18 ` Peter Zijlstra
2019-12-20 17:36 ` Qais Yousef
0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2019-12-20 17:18 UTC (permalink / raw)
To: Qais Yousef
Cc: Steven Rostedt, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel
On Fri, Dec 20, 2019 at 04:01:49PM +0000, Qais Yousef wrote:
> On 10/28/19 21:50, Peter Zijlstra wrote:
> > On Mon, Oct 28, 2019 at 02:01:47PM -0400, Steven Rostedt wrote:
> > > On Mon, 28 Oct 2019 15:37:49 +0100
> > > Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > > > Works for me; Steve you OK with this?
> > >
> > > Nothing against it, but I want to take a deeper look before we accept
> > > it. Are you OK in waiting a week? I'm currently at Open Source Summit
> > > and still have two more talks to write (giving them Thursday). I wont
> > > have time to look till next week.
> >
> > Sure, I'll keep it in my queue, but will make sure it doesn't hit -tip
> > until you've had time.
>
> Reviewers are happy with this now. It'd be nice if you can pick it up again for
> the next round to -tip.
>
Sorry, I missed Steve's and Dietmar's replies. It should shorty appear
in queue.git and I'll try and push to -tip over the weekend (provided
the robots don't come up with something fishy).
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-12-20 17:18 ` Peter Zijlstra
@ 2019-12-20 17:36 ` Qais Yousef
0 siblings, 0 replies; 42+ messages in thread
From: Qais Yousef @ 2019-12-20 17:36 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Steven Rostedt, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel
On 12/20/19 18:18, Peter Zijlstra wrote:
> On Fri, Dec 20, 2019 at 04:01:49PM +0000, Qais Yousef wrote:
> > On 10/28/19 21:50, Peter Zijlstra wrote:
> > > On Mon, Oct 28, 2019 at 02:01:47PM -0400, Steven Rostedt wrote:
> > > > On Mon, 28 Oct 2019 15:37:49 +0100
> > > > Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > > > Works for me; Steve you OK with this?
> > > >
> > > > Nothing against it, but I want to take a deeper look before we accept
> > > > it. Are you OK in waiting a week? I'm currently at Open Source Summit
> > > > and still have two more talks to write (giving them Thursday). I wont
> > > > have time to look till next week.
> > >
> > > Sure, I'll keep it in my queue, but will make sure it doesn't hit -tip
> > > until you've had time.
> >
> > Reviewers are happy with this now. It'd be nice if you can pick it up again for
> > the next round to -tip.
> >
>
> Sorry, I missed Steve's and Dietmar's replies. It should shorty appear
> in queue.git and I'll try and push to -tip over the weekend (provided
> the robots don't come up with something fishy).
No worries, thanks! It missed the 5.5 merge window anyway.
We had 2 reports by the buildbot last time, luckily I kept the fixups at the
top of my local branch.
Happy to apply locally again or prefer I send v3?
Cheers
---
diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index 799791c01d60..bdfb802d4c12 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -46,6 +46,8 @@ static int convert_prio(int prio)
* @cp: The cpupri context
* @p: The task
* @lowest_mask: A mask to fill in with selected CPUs (or NULL)
+ * @fitness_fn: A pointer to a function to do custom checks whether the CPU
+ * fits a specific criteria so that we only return those CPUs.
*
* Note: This function returns the recommended CPUs as calculated during the
* current invocation. By the time the call returns, the CPUs may have in
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 3a68054e15b3..6afecb5557db 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -452,7 +452,7 @@ static inline int on_rt_rq(struct sched_rt_entity *rt_se)
* Note that uclamp_min will be clamped to uclamp_max if uclamp_min
* > uclamp_max.
*/
-inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
+static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
{
unsigned int min_cap;
unsigned int max_cap;
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-28 18:01 ` Steven Rostedt
2019-10-28 20:50 ` Peter Zijlstra
@ 2019-11-07 9:15 ` Qais Yousef
2019-11-18 15:43 ` Qais Yousef
2 siblings, 0 replies; 42+ messages in thread
From: Qais Yousef @ 2019-11-07 9:15 UTC (permalink / raw)
To: Steven Rostedt
Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel
Hi Steve
On 10/28/19 14:01, Steven Rostedt wrote:
> On Mon, 28 Oct 2019 15:37:49 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
>
> > On Wed, Oct 09, 2019 at 11:46:11AM +0100, Qais Yousef wrote:
> > > Capacity Awareness refers to the fact that on heterogeneous systems
> > > (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> > > when placing tasks we need to be aware of this difference of CPU
> > > capacities.
> > >
> > > In such scenarios we want to ensure that the selected CPU has enough
> > > capacity to meet the requirement of the running task. Enough capacity
> > > means here that capacity_orig_of(cpu) >= task.requirement.
> > >
> > > The definition of task.requirement is dependent on the scheduling class.
> > >
> > > For CFS, utilization is used to select a CPU that has >= capacity value
> > > than the cfs_task.util.
> > >
> > > capacity_orig_of(cpu) >= cfs_task.util
> > >
> > > DL isn't capacity aware at the moment but can make use of the bandwidth
> > > reservation to implement that in a similar manner CFS uses utilization.
> > > The following patchset implements that:
> > >
> > > https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/
> > >
> > > capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
> > >
> > > For RT we don't have a per task utilization signal and we lack any
> > > information in general about what performance requirement the RT task
> > > needs. But with the introduction of uclamp, RT tasks can now control
> > > that by setting uclamp_min to guarantee a minimum performance point.
> > >
> > > ATM the uclamp value are only used for frequency selection; but on
> > > heterogeneous systems this is not enough and we need to ensure that the
> > > capacity of the CPU is >= uclamp_min. Which is what implemented here.
> > >
> > > capacity_orig_of(cpu) >= rt_task.uclamp_min
> > >
> > > Note that by default uclamp.min is 1024, which means that RT tasks will
> > > always be biased towards the big CPUs, which make for a better more
> > > predictable behavior for the default case.
> > >
> > > Must stress that the bias acts as a hint rather than a definite
> > > placement strategy. For example, if all big cores are busy executing
> > > other RT tasks we can't guarantee that a new RT task will be placed
> > > there.
> > >
> > > On non-heterogeneous systems the original behavior of RT should be
> > > retained. Similarly if uclamp is not selected in the config.
> > >
> > > Signed-off-by: Qais Yousef <qais.yousef@arm.com>
> >
> > Works for me; Steve you OK with this?
>
> Nothing against it, but I want to take a deeper look before we accept
> it. Are you OK in waiting a week? I'm currently at Open Source Summit
> and still have two more talks to write (giving them Thursday). I wont
> have time to look till next week.
A gentle reminder if you can spare some time to look at this. It'd be nice if
it can make it to the 5.5 merge window if there are no major concerns about it.
Thanks!
--
Qais Yousef
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-28 18:01 ` Steven Rostedt
2019-10-28 20:50 ` Peter Zijlstra
2019-11-07 9:15 ` Qais Yousef
@ 2019-11-18 15:43 ` Qais Yousef
2019-11-18 15:53 ` Steven Rostedt
2 siblings, 1 reply; 42+ messages in thread
From: Qais Yousef @ 2019-11-18 15:43 UTC (permalink / raw)
To: Steven Rostedt
Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel
Hi Steve
On 10/28/19 14:01, Steven Rostedt wrote:
> On Mon, 28 Oct 2019 15:37:49 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
>
> > On Wed, Oct 09, 2019 at 11:46:11AM +0100, Qais Yousef wrote:
> > > Capacity Awareness refers to the fact that on heterogeneous systems
> > > (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> > > when placing tasks we need to be aware of this difference of CPU
> > > capacities.
> > >
> > > In such scenarios we want to ensure that the selected CPU has enough
> > > capacity to meet the requirement of the running task. Enough capacity
> > > means here that capacity_orig_of(cpu) >= task.requirement.
> > >
> > > The definition of task.requirement is dependent on the scheduling class.
> > >
> > > For CFS, utilization is used to select a CPU that has >= capacity value
> > > than the cfs_task.util.
> > >
> > > capacity_orig_of(cpu) >= cfs_task.util
> > >
> > > DL isn't capacity aware at the moment but can make use of the bandwidth
> > > reservation to implement that in a similar manner CFS uses utilization.
> > > The following patchset implements that:
> > >
> > > https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/
> > >
> > > capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
> > >
> > > For RT we don't have a per task utilization signal and we lack any
> > > information in general about what performance requirement the RT task
> > > needs. But with the introduction of uclamp, RT tasks can now control
> > > that by setting uclamp_min to guarantee a minimum performance point.
> > >
> > > ATM the uclamp value are only used for frequency selection; but on
> > > heterogeneous systems this is not enough and we need to ensure that the
> > > capacity of the CPU is >= uclamp_min. Which is what implemented here.
> > >
> > > capacity_orig_of(cpu) >= rt_task.uclamp_min
> > >
> > > Note that by default uclamp.min is 1024, which means that RT tasks will
> > > always be biased towards the big CPUs, which make for a better more
> > > predictable behavior for the default case.
> > >
> > > Must stress that the bias acts as a hint rather than a definite
> > > placement strategy. For example, if all big cores are busy executing
> > > other RT tasks we can't guarantee that a new RT task will be placed
> > > there.
> > >
> > > On non-heterogeneous systems the original behavior of RT should be
> > > retained. Similarly if uclamp is not selected in the config.
> > >
> > > Signed-off-by: Qais Yousef <qais.yousef@arm.com>
> >
> > Works for me; Steve you OK with this?
>
> Nothing against it, but I want to take a deeper look before we accept
> it. Are you OK in waiting a week? I'm currently at Open Source Summit
> and still have two more talks to write (giving them Thursday). I wont
> have time to look till next week.
Apologies if I am being too pushy. But not sure whether this is waiting for its
turn in the queue or slipped through the cracks, hence another gentle reminder
in case it's the latter :-)
Many thanks
--
Qais Yousef
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-11-18 15:43 ` Qais Yousef
@ 2019-11-18 15:53 ` Steven Rostedt
2019-11-18 16:12 ` Qais Yousef
0 siblings, 1 reply; 42+ messages in thread
From: Steven Rostedt @ 2019-11-18 15:53 UTC (permalink / raw)
To: Qais Yousef
Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel
On Mon, 18 Nov 2019 15:43:35 +0000
Qais Yousef <qais.yousef@arm.com> wrote:
>
> >
> > Nothing against it, but I want to take a deeper look before we accept
> > it. Are you OK in waiting a week? I'm currently at Open Source Summit
> > and still have two more talks to write (giving them Thursday). I wont
> > have time to look till next week.
>
> Apologies if I am being too pushy. But not sure whether this is waiting for its
> turn in the queue or slipped through the cracks, hence another gentle reminder
> in case it's the latter :-)
No you are not being too pushy, my apologies to you, it's been quite
hectic lately (both from a business and personal stand point). I'll
look at it now.
Thanks, and sorry for the delay :-(
-- Steve
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-11-18 15:53 ` Steven Rostedt
@ 2019-11-18 16:12 ` Qais Yousef
0 siblings, 0 replies; 42+ messages in thread
From: Qais Yousef @ 2019-11-18 16:12 UTC (permalink / raw)
To: Steven Rostedt
Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel
On 11/18/19 10:53, Steven Rostedt wrote:
> On Mon, 18 Nov 2019 15:43:35 +0000
> Qais Yousef <qais.yousef@arm.com> wrote:
>
> >
> > >
> > > Nothing against it, but I want to take a deeper look before we accept
> > > it. Are you OK in waiting a week? I'm currently at Open Source Summit
> > > and still have two more talks to write (giving them Thursday). I wont
> > > have time to look till next week.
> >
> > Apologies if I am being too pushy. But not sure whether this is waiting for its
> > turn in the queue or slipped through the cracks, hence another gentle reminder
> > in case it's the latter :-)
>
> No you are not being too pushy, my apologies to you, it's been quite
> hectic lately (both from a business and personal stand point). I'll
> look at it now.
>
> Thanks, and sorry for the delay :-(
No worries! I appreciate that you have to juggle a lot of things together. As
long as it's on the queue and isn't accidentally dropped, then it is what it
is.
Many thanks
--
Qais Yousef
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-09 10:46 [PATCH v2] sched: rt: Make RT capacity aware Qais Yousef
2019-10-23 12:34 ` Qais Yousef
2019-10-28 14:37 ` Peter Zijlstra
@ 2019-10-29 8:13 ` Vincent Guittot
2019-10-29 11:02 ` Qais Yousef
2019-10-30 11:57 ` Dietmar Eggemann
` (3 subsequent siblings)
6 siblings, 1 reply; 42+ messages in thread
From: Vincent Guittot @ 2019-10-29 8:13 UTC (permalink / raw)
To: Qais Yousef
Cc: Ingo Molnar, Peter Zijlstra, Steven Rostedt, Juri Lelli,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel
On Wed, 9 Oct 2019 at 12:46, Qais Yousef <qais.yousef@arm.com> wrote:
>
> Capacity Awareness refers to the fact that on heterogeneous systems
> (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> when placing tasks we need to be aware of this difference of CPU
> capacities.
>
> In such scenarios we want to ensure that the selected CPU has enough
> capacity to meet the requirement of the running task. Enough capacity
> means here that capacity_orig_of(cpu) >= task.requirement.
>
> The definition of task.requirement is dependent on the scheduling class.
>
> For CFS, utilization is used to select a CPU that has >= capacity value
> than the cfs_task.util.
>
> capacity_orig_of(cpu) >= cfs_task.util
>
> DL isn't capacity aware at the moment but can make use of the bandwidth
> reservation to implement that in a similar manner CFS uses utilization.
> The following patchset implements that:
>
> https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/
>
> capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
>
> For RT we don't have a per task utilization signal and we lack any
> information in general about what performance requirement the RT task
> needs. But with the introduction of uclamp, RT tasks can now control
> that by setting uclamp_min to guarantee a minimum performance point.
>
> ATM the uclamp value are only used for frequency selection; but on
> heterogeneous systems this is not enough and we need to ensure that the
> capacity of the CPU is >= uclamp_min. Which is what implemented here.
>
> capacity_orig_of(cpu) >= rt_task.uclamp_min
>
> Note that by default uclamp.min is 1024, which means that RT tasks will
> always be biased towards the big CPUs, which make for a better more
> predictable behavior for the default case.
hmm... big cores are not always the best choices for rt tasks, they
generally took more time to wake up or to switch context because of
the pipeline depth and others branch predictions
>
> Must stress that the bias acts as a hint rather than a definite
> placement strategy. For example, if all big cores are busy executing
> other RT tasks we can't guarantee that a new RT task will be placed
> there.
>
> On non-heterogeneous systems the original behavior of RT should be
> retained. Similarly if uclamp is not selected in the config.
>
> Signed-off-by: Qais Yousef <qais.yousef@arm.com>
> ---
>
> Changes in v2:
> - Use cpupri_find() to check the fitness of the task instead of
> sprinkling find_lowest_rq() with several checks of
> rt_task_fits_capacity().
>
> The selected implementation opted to pass the fitness function as an
> argument rather than call rt_task_fits_capacity() capacity which is
> a cleaner to keep the logical separation of the 2 modules; but it
> means the compiler has less room to optimize rt_task_fits_capacity()
> out when it's a constant value.
>
> The logic is not perfect. For example if a 'small' task is occupying a big CPU
> and another big task wakes up; we won't force migrate the small task to clear
> the big cpu for the big task that woke up.
>
> IOW, the logic is best effort and can't give hard guarantees. But improves the
> current situation where a task can randomly end up on any CPU regardless of
> what it needs. ie: without this patch an RT task can wake up on a big or small
> CPU, but with this it will always wake up on a big CPU (assuming the big CPUs
> aren't overloaded) - hence provide a consistent performance.
>
> I'm looking at ways to improve this best effort, but this patch should be
> a good start to discuss our Capacity Awareness requirement. There's a trade-off
> of complexity to be made here and I'd like to keep things as simple as
> possible and build on top as needed.
>
>
> kernel/sched/cpupri.c | 23 ++++++++++--
> kernel/sched/cpupri.h | 4 ++-
> kernel/sched/rt.c | 81 +++++++++++++++++++++++++++++++++++--------
> 3 files changed, 91 insertions(+), 17 deletions(-)
>
> diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
> index b7abca987d94..799791c01d60 100644
> --- a/kernel/sched/cpupri.c
> +++ b/kernel/sched/cpupri.c
> @@ -57,7 +57,8 @@ static int convert_prio(int prio)
> * Return: (int)bool - CPUs were found
> */
> int cpupri_find(struct cpupri *cp, struct task_struct *p,
> - struct cpumask *lowest_mask)
> + struct cpumask *lowest_mask,
> + bool (*fitness_fn)(struct task_struct *p, int cpu))
> {
> int idx = 0;
> int task_pri = convert_prio(p->prio);
> @@ -98,6 +99,8 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
> continue;
>
> if (lowest_mask) {
> + int cpu;
> +
> cpumask_and(lowest_mask, p->cpus_ptr, vec->mask);
>
> /*
> @@ -108,7 +111,23 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
> * condition, simply act as though we never hit this
> * priority level and continue on.
> */
> - if (cpumask_any(lowest_mask) >= nr_cpu_ids)
> + if (cpumask_empty(lowest_mask))
> + continue;
> +
> + if (!fitness_fn)
> + return 1;
> +
> + /* Ensure the capacity of the CPUs fit the task */
> + for_each_cpu(cpu, lowest_mask) {
> + if (!fitness_fn(p, cpu))
> + cpumask_clear_cpu(cpu, lowest_mask);
> + }
> +
> + /*
> + * If no CPU at the current priority can fit the task
> + * continue looking
> + */
> + if (cpumask_empty(lowest_mask))
> continue;
> }
>
> diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
> index 7dc20a3232e7..32dd520db11f 100644
> --- a/kernel/sched/cpupri.h
> +++ b/kernel/sched/cpupri.h
> @@ -18,7 +18,9 @@ struct cpupri {
> };
>
> #ifdef CONFIG_SMP
> -int cpupri_find(struct cpupri *cp, struct task_struct *p, struct cpumask *lowest_mask);
> +int cpupri_find(struct cpupri *cp, struct task_struct *p,
> + struct cpumask *lowest_mask,
> + bool (*fitness_fn)(struct task_struct *p, int cpu));
> void cpupri_set(struct cpupri *cp, int cpu, int pri);
> int cpupri_init(struct cpupri *cp);
> void cpupri_cleanup(struct cpupri *cp);
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index ebaa4e619684..3a68054e15b3 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -437,6 +437,45 @@ static inline int on_rt_rq(struct sched_rt_entity *rt_se)
> return rt_se->on_rq;
> }
>
> +#ifdef CONFIG_UCLAMP_TASK
> +/*
> + * Verify the fitness of task @p to run on @cpu taking into account the uclamp
> + * settings.
> + *
> + * This check is only important for heterogeneous systems where uclamp_min value
> + * is higher than the capacity of a @cpu. For non-heterogeneous system this
> + * function will always return true.
> + *
> + * The function will return true if the capacity of the @cpu is >= the
> + * uclamp_min and false otherwise.
> + *
> + * Note that uclamp_min will be clamped to uclamp_max if uclamp_min
> + * > uclamp_max.
> + */
> +inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> +{
> + unsigned int min_cap;
> + unsigned int max_cap;
> + unsigned int cpu_cap;
> +
> + /* Only heterogeneous systems can benefit from this check */
> + if (!static_branch_unlikely(&sched_asym_cpucapacity))
> + return true;
> +
> + min_cap = uclamp_eff_value(p, UCLAMP_MIN);
> + max_cap = uclamp_eff_value(p, UCLAMP_MAX);
> +
> + cpu_cap = capacity_orig_of(cpu);
> +
> + return cpu_cap >= min(min_cap, max_cap);
> +}
> +#else
> +static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> +{
> + return true;
> +}
> +#endif
> +
> #ifdef CONFIG_RT_GROUP_SCHED
>
> static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
> @@ -1391,6 +1430,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
> {
> struct task_struct *curr;
> struct rq *rq;
> + bool test;
>
> /* For anything but wake ups, just return the task_cpu */
> if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
> @@ -1422,10 +1462,16 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
> *
> * This test is optimistic, if we get it wrong the load-balancer
> * will have to sort it out.
> + *
> + * We take into account the capacity of the cpu to ensure it fits the
> + * requirement of the task - which is only important on heterogeneous
> + * systems like big.LITTLE.
> */
> - if (curr && unlikely(rt_task(curr)) &&
> - (curr->nr_cpus_allowed < 2 ||
> - curr->prio <= p->prio)) {
> + test = curr &&
> + unlikely(rt_task(curr)) &&
> + (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
> +
> + if (test || !rt_task_fits_capacity(p, cpu)) {
> int target = find_lowest_rq(p);
>
> /*
> @@ -1449,7 +1495,7 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
> * let's hope p can move out.
> */
> if (rq->curr->nr_cpus_allowed == 1 ||
> - !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
> + !cpupri_find(&rq->rd->cpupri, rq->curr, NULL, NULL))
> return;
>
> /*
> @@ -1457,7 +1503,7 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
> * see if it is pushed or pulled somewhere else.
> */
> if (p->nr_cpus_allowed != 1
> - && cpupri_find(&rq->rd->cpupri, p, NULL))
> + && cpupri_find(&rq->rd->cpupri, p, NULL, NULL))
> return;
>
> /*
> @@ -1600,7 +1646,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct rq_fla
> static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
> {
> if (!task_running(rq, p) &&
> - cpumask_test_cpu(cpu, p->cpus_ptr))
> + cpumask_test_cpu(cpu, p->cpus_ptr) &&
> + rt_task_fits_capacity(p, cpu))
> return 1;
>
> return 0;
> @@ -1642,7 +1689,8 @@ static int find_lowest_rq(struct task_struct *task)
> if (task->nr_cpus_allowed == 1)
> return -1; /* No other targets possible */
>
> - if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask))
> + if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask,
> + rt_task_fits_capacity))
> return -1; /* No targets found */
>
> /*
> @@ -2146,12 +2194,14 @@ static void pull_rt_task(struct rq *this_rq)
> */
> static void task_woken_rt(struct rq *rq, struct task_struct *p)
> {
> - if (!task_running(rq, p) &&
> - !test_tsk_need_resched(rq->curr) &&
> - p->nr_cpus_allowed > 1 &&
> - (dl_task(rq->curr) || rt_task(rq->curr)) &&
> - (rq->curr->nr_cpus_allowed < 2 ||
> - rq->curr->prio <= p->prio))
> + bool need_to_push = !task_running(rq, p) &&
> + !test_tsk_need_resched(rq->curr) &&
> + p->nr_cpus_allowed > 1 &&
> + (dl_task(rq->curr) || rt_task(rq->curr)) &&
> + (rq->curr->nr_cpus_allowed < 2 ||
> + rq->curr->prio <= p->prio);
> +
> + if (need_to_push || !rt_task_fits_capacity(p, cpu_of(rq)))
> push_rt_tasks(rq);
> }
>
> @@ -2223,7 +2273,10 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
> */
> if (task_on_rq_queued(p) && rq->curr != p) {
> #ifdef CONFIG_SMP
> - if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
> + bool need_to_push = rq->rt.overloaded ||
> + !rt_task_fits_capacity(p, cpu_of(rq));
> +
> + if (p->nr_cpus_allowed > 1 && need_to_push)
> rt_queue_push_tasks(rq);
> #endif /* CONFIG_SMP */
> if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
> --
> 2.17.1
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-29 8:13 ` Vincent Guittot
@ 2019-10-29 11:02 ` Qais Yousef
2019-10-29 11:17 ` Vincent Guittot
0 siblings, 1 reply; 42+ messages in thread
From: Qais Yousef @ 2019-10-29 11:02 UTC (permalink / raw)
To: Vincent Guittot
Cc: Ingo Molnar, Peter Zijlstra, Steven Rostedt, Juri Lelli,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel
On 10/29/19 09:13, Vincent Guittot wrote:
> On Wed, 9 Oct 2019 at 12:46, Qais Yousef <qais.yousef@arm.com> wrote:
> >
> > Capacity Awareness refers to the fact that on heterogeneous systems
> > (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> > when placing tasks we need to be aware of this difference of CPU
> > capacities.
> >
> > In such scenarios we want to ensure that the selected CPU has enough
> > capacity to meet the requirement of the running task. Enough capacity
> > means here that capacity_orig_of(cpu) >= task.requirement.
> >
> > The definition of task.requirement is dependent on the scheduling class.
> >
> > For CFS, utilization is used to select a CPU that has >= capacity value
> > than the cfs_task.util.
> >
> > capacity_orig_of(cpu) >= cfs_task.util
> >
> > DL isn't capacity aware at the moment but can make use of the bandwidth
> > reservation to implement that in a similar manner CFS uses utilization.
> > The following patchset implements that:
> >
> > https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/
> >
> > capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
> >
> > For RT we don't have a per task utilization signal and we lack any
> > information in general about what performance requirement the RT task
> > needs. But with the introduction of uclamp, RT tasks can now control
> > that by setting uclamp_min to guarantee a minimum performance point.
> >
> > ATM the uclamp value are only used for frequency selection; but on
> > heterogeneous systems this is not enough and we need to ensure that the
> > capacity of the CPU is >= uclamp_min. Which is what implemented here.
> >
> > capacity_orig_of(cpu) >= rt_task.uclamp_min
> >
> > Note that by default uclamp.min is 1024, which means that RT tasks will
> > always be biased towards the big CPUs, which make for a better more
> > predictable behavior for the default case.
>
> hmm... big cores are not always the best choices for rt tasks, they
> generally took more time to wake up or to switch context because of
> the pipeline depth and others branch predictions
Can you quantify this into a number? I suspect this latency should be in the
200-300us range. And the difference between little and big should be much
smaller than that, no? We can't give guarantees in Linux in that order in
general and for serious real time users they have to do extra tweaks like
disabling power management which can introduce latency and hinder determinism.
Beside enabling PREEMPT_RT.
For generic systems a few ms is the best we can give and we can easily fall out
of this without any tweaks.
The choice of going to the maximum performance point in the system for RT tasks
by default goes beyond this patch anyway. I'm just making it consistent here
since we have different performance levels and RT didn't understand this
before.
So what I'm doing here is just make things consistent rather than change the
default.
What do you suggest?
Thanks
--
Qais Yousef
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-29 11:02 ` Qais Yousef
@ 2019-10-29 11:17 ` Vincent Guittot
2019-10-29 11:48 ` Qais Yousef
0 siblings, 1 reply; 42+ messages in thread
From: Vincent Guittot @ 2019-10-29 11:17 UTC (permalink / raw)
To: Qais Yousef
Cc: Ingo Molnar, Peter Zijlstra, Steven Rostedt, Juri Lelli,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel
On Tue, 29 Oct 2019 at 12:02, Qais Yousef <qais.yousef@arm.com> wrote:
>
> On 10/29/19 09:13, Vincent Guittot wrote:
> > On Wed, 9 Oct 2019 at 12:46, Qais Yousef <qais.yousef@arm.com> wrote:
> > >
> > > Capacity Awareness refers to the fact that on heterogeneous systems
> > > (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> > > when placing tasks we need to be aware of this difference of CPU
> > > capacities.
> > >
> > > In such scenarios we want to ensure that the selected CPU has enough
> > > capacity to meet the requirement of the running task. Enough capacity
> > > means here that capacity_orig_of(cpu) >= task.requirement.
> > >
> > > The definition of task.requirement is dependent on the scheduling class.
> > >
> > > For CFS, utilization is used to select a CPU that has >= capacity value
> > > than the cfs_task.util.
> > >
> > > capacity_orig_of(cpu) >= cfs_task.util
> > >
> > > DL isn't capacity aware at the moment but can make use of the bandwidth
> > > reservation to implement that in a similar manner CFS uses utilization.
> > > The following patchset implements that:
> > >
> > > https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/
> > >
> > > capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
> > >
> > > For RT we don't have a per task utilization signal and we lack any
> > > information in general about what performance requirement the RT task
> > > needs. But with the introduction of uclamp, RT tasks can now control
> > > that by setting uclamp_min to guarantee a minimum performance point.
> > >
> > > ATM the uclamp value are only used for frequency selection; but on
> > > heterogeneous systems this is not enough and we need to ensure that the
> > > capacity of the CPU is >= uclamp_min. Which is what implemented here.
> > >
> > > capacity_orig_of(cpu) >= rt_task.uclamp_min
> > >
> > > Note that by default uclamp.min is 1024, which means that RT tasks will
> > > always be biased towards the big CPUs, which make for a better more
> > > predictable behavior for the default case.
> >
> > hmm... big cores are not always the best choices for rt tasks, they
> > generally took more time to wake up or to switch context because of
> > the pipeline depth and others branch predictions
>
> Can you quantify this into a number? I suspect this latency should be in the
As a general rule, we pinned IRQs on little core because of such
responsiveness difference. I don't have numbers in mind as the tests
were run at the beg of b.L system.. few years ago
Then, if you look at some idle states definitions in DT, you will see
that exit latency of cluster down state of big core of hikey960 is
2900us vs 1600us for little
> 200-300us range. And the difference between little and big should be much
> smaller than that, no? We can't give guarantees in Linux in that order in
> general and for serious real time users they have to do extra tweaks like
> disabling power management which can introduce latency and hinder determinism.
> Beside enabling PREEMPT_RT.
>
> For generic systems a few ms is the best we can give and we can easily fall out
> of this without any tweaks.
>
> The choice of going to the maximum performance point in the system for RT tasks
> by default goes beyond this patch anyway. I'm just making it consistent here
> since we have different performance levels and RT didn't understand this
> before.
>
> So what I'm doing here is just make things consistent rather than change the
> default.
>
> What do you suggest?
Making big cores the default CPUs for all RT tasks is not a minor
change and IMO locality should stay the default behavior when there is
no uclamp constraint
>
> Thanks
>
> --
> Qais Yousef
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-29 11:17 ` Vincent Guittot
@ 2019-10-29 11:48 ` Qais Yousef
2019-10-29 12:20 ` Vincent Guittot
0 siblings, 1 reply; 42+ messages in thread
From: Qais Yousef @ 2019-10-29 11:48 UTC (permalink / raw)
To: Vincent Guittot
Cc: Ingo Molnar, Peter Zijlstra, Steven Rostedt, Juri Lelli,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel
On 10/29/19 12:17, Vincent Guittot wrote:
> On Tue, 29 Oct 2019 at 12:02, Qais Yousef <qais.yousef@arm.com> wrote:
> >
> > On 10/29/19 09:13, Vincent Guittot wrote:
> > > On Wed, 9 Oct 2019 at 12:46, Qais Yousef <qais.yousef@arm.com> wrote:
> > > >
> > > > Capacity Awareness refers to the fact that on heterogeneous systems
> > > > (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> > > > when placing tasks we need to be aware of this difference of CPU
> > > > capacities.
> > > >
> > > > In such scenarios we want to ensure that the selected CPU has enough
> > > > capacity to meet the requirement of the running task. Enough capacity
> > > > means here that capacity_orig_of(cpu) >= task.requirement.
> > > >
> > > > The definition of task.requirement is dependent on the scheduling class.
> > > >
> > > > For CFS, utilization is used to select a CPU that has >= capacity value
> > > > than the cfs_task.util.
> > > >
> > > > capacity_orig_of(cpu) >= cfs_task.util
> > > >
> > > > DL isn't capacity aware at the moment but can make use of the bandwidth
> > > > reservation to implement that in a similar manner CFS uses utilization.
> > > > The following patchset implements that:
> > > >
> > > > https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/
> > > >
> > > > capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
> > > >
> > > > For RT we don't have a per task utilization signal and we lack any
> > > > information in general about what performance requirement the RT task
> > > > needs. But with the introduction of uclamp, RT tasks can now control
> > > > that by setting uclamp_min to guarantee a minimum performance point.
> > > >
> > > > ATM the uclamp value are only used for frequency selection; but on
> > > > heterogeneous systems this is not enough and we need to ensure that the
> > > > capacity of the CPU is >= uclamp_min. Which is what implemented here.
> > > >
> > > > capacity_orig_of(cpu) >= rt_task.uclamp_min
> > > >
> > > > Note that by default uclamp.min is 1024, which means that RT tasks will
> > > > always be biased towards the big CPUs, which make for a better more
> > > > predictable behavior for the default case.
> > >
> > > hmm... big cores are not always the best choices for rt tasks, they
> > > generally took more time to wake up or to switch context because of
> > > the pipeline depth and others branch predictions
> >
> > Can you quantify this into a number? I suspect this latency should be in the
>
> As a general rule, we pinned IRQs on little core because of such
> responsiveness difference. I don't have numbers in mind as the tests
> were run at the beg of b.L system.. few years ago
> Then, if you look at some idle states definitions in DT, you will see
> that exit latency of cluster down state of big core of hikey960 is
> 2900us vs 1600us for little
I don't think hikey960 is a good system to use as a reference. SD845 shows more
sensible numbers
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/boot/dts/qcom/sdm845.dtsi?h=v5.4-rc5#n407
>
> > 200-300us range. And the difference between little and big should be much
> > smaller than that, no? We can't give guarantees in Linux in that order in
> > general and for serious real time users they have to do extra tweaks like
> > disabling power management which can introduce latency and hinder determinism.
> > Beside enabling PREEMPT_RT.
> >
> > For generic systems a few ms is the best we can give and we can easily fall out
> > of this without any tweaks.
> >
> > The choice of going to the maximum performance point in the system for RT tasks
> > by default goes beyond this patch anyway. I'm just making it consistent here
> > since we have different performance levels and RT didn't understand this
> > before.
> >
> > So what I'm doing here is just make things consistent rather than change the
> > default.
> >
> > What do you suggest?
>
> Making big cores the default CPUs for all RT tasks is not a minor
> change and IMO locality should stay the default behavior when there is
> no uclamp constraint
How this is affecting locality? The task will always go to the big core, so it
should be local.
And before introducing uclamp the default was going to the maximum frequency
anyway - which is the highest performance point. So what this does is basically
make sure that if we asked for a 1024 capacity, we get that.
Beside the decision is taken by the setup of the system wide uclamp.min. We
can change this to be something smaller but I don't think we can come up with
a better value by default. Admin should tune this to something smaller if the
performance of their little cores is good for their needs.
What this patch says if I want my uclamp.min of my RT task to be 1024, then we
give better guarantees it'll get that 1024 performance it asked for. And the
default of 1024 is consistent with what Linux has always done for RT out of the
box.
--
Qais Yousef
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-29 11:48 ` Qais Yousef
@ 2019-10-29 12:20 ` Vincent Guittot
2019-10-29 12:46 ` Qais Yousef
0 siblings, 1 reply; 42+ messages in thread
From: Vincent Guittot @ 2019-10-29 12:20 UTC (permalink / raw)
To: Qais Yousef
Cc: Ingo Molnar, Peter Zijlstra, Steven Rostedt, Juri Lelli,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel
On Tue, 29 Oct 2019 at 12:48, Qais Yousef <qais.yousef@arm.com> wrote:
>
> On 10/29/19 12:17, Vincent Guittot wrote:
> > On Tue, 29 Oct 2019 at 12:02, Qais Yousef <qais.yousef@arm.com> wrote:
> > >
> > > On 10/29/19 09:13, Vincent Guittot wrote:
> > > > On Wed, 9 Oct 2019 at 12:46, Qais Yousef <qais.yousef@arm.com> wrote:
> > > > >
> > > > > Capacity Awareness refers to the fact that on heterogeneous systems
> > > > > (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> > > > > when placing tasks we need to be aware of this difference of CPU
> > > > > capacities.
> > > > >
> > > > > In such scenarios we want to ensure that the selected CPU has enough
> > > > > capacity to meet the requirement of the running task. Enough capacity
> > > > > means here that capacity_orig_of(cpu) >= task.requirement.
> > > > >
> > > > > The definition of task.requirement is dependent on the scheduling class.
> > > > >
> > > > > For CFS, utilization is used to select a CPU that has >= capacity value
> > > > > than the cfs_task.util.
> > > > >
> > > > > capacity_orig_of(cpu) >= cfs_task.util
> > > > >
> > > > > DL isn't capacity aware at the moment but can make use of the bandwidth
> > > > > reservation to implement that in a similar manner CFS uses utilization.
> > > > > The following patchset implements that:
> > > > >
> > > > > https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/
> > > > >
> > > > > capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
> > > > >
> > > > > For RT we don't have a per task utilization signal and we lack any
> > > > > information in general about what performance requirement the RT task
> > > > > needs. But with the introduction of uclamp, RT tasks can now control
> > > > > that by setting uclamp_min to guarantee a minimum performance point.
> > > > >
> > > > > ATM the uclamp value are only used for frequency selection; but on
> > > > > heterogeneous systems this is not enough and we need to ensure that the
> > > > > capacity of the CPU is >= uclamp_min. Which is what implemented here.
> > > > >
> > > > > capacity_orig_of(cpu) >= rt_task.uclamp_min
> > > > >
> > > > > Note that by default uclamp.min is 1024, which means that RT tasks will
> > > > > always be biased towards the big CPUs, which make for a better more
> > > > > predictable behavior for the default case.
> > > >
> > > > hmm... big cores are not always the best choices for rt tasks, they
> > > > generally took more time to wake up or to switch context because of
> > > > the pipeline depth and others branch predictions
> > >
> > > Can you quantify this into a number? I suspect this latency should be in the
> >
> > As a general rule, we pinned IRQs on little core because of such
> > responsiveness difference. I don't have numbers in mind as the tests
> > were run at the beg of b.L system.. few years ago
> > Then, if you look at some idle states definitions in DT, you will see
> > that exit latency of cluster down state of big core of hikey960 is
> > 2900us vs 1600us for little
>
> I don't think hikey960 is a good system to use as a reference. SD845 shows more
> sensible numbers
It is not worse than another
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/boot/dts/qcom/sdm845.dtsi?h=v5.4-rc5#n407
>
> >
> > > 200-300us range. And the difference between little and big should be much
> > > smaller than that, no? We can't give guarantees in Linux in that order in
> > > general and for serious real time users they have to do extra tweaks like
> > > disabling power management which can introduce latency and hinder determinism.
> > > Beside enabling PREEMPT_RT.
> > >
> > > For generic systems a few ms is the best we can give and we can easily fall out
> > > of this without any tweaks.
> > >
> > > The choice of going to the maximum performance point in the system for RT tasks
> > > by default goes beyond this patch anyway. I'm just making it consistent here
> > > since we have different performance levels and RT didn't understand this
> > > before.
> > >
> > > So what I'm doing here is just make things consistent rather than change the
> > > default.
> > >
> > > What do you suggest?
> >
> > Making big cores the default CPUs for all RT tasks is not a minor
> > change and IMO locality should stay the default behavior when there is
> > no uclamp constraint
>
> How this is affecting locality? The task will always go to the big core, so it
> should be local.
local with the waker
You will force rt task to run on big cluster although waker, data and
interrupts can be on little one.
So making big core as default is far from always being the best choice
>
> And before introducing uclamp the default was going to the maximum frequency
> anyway - which is the highest performance point. So what this does is basically
> make sure that if we asked for a 1024 capacity, we get that.
>
> Beside the decision is taken by the setup of the system wide uclamp.min. We
> can change this to be something smaller but I don't think we can come up with
> a better value by default. Admin should tune this to something smaller if the
> performance of their little cores is good for their needs.
>
> What this patch says if I want my uclamp.min of my RT task to be 1024, then we
> give better guarantees it'll get that 1024 performance it asked for. And the
> default of 1024 is consistent with what Linux has always done for RT out of the
> box.
>
> --
> Qais Yousef
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-29 12:20 ` Vincent Guittot
@ 2019-10-29 12:46 ` Qais Yousef
2019-10-29 12:54 ` Vincent Guittot
0 siblings, 1 reply; 42+ messages in thread
From: Qais Yousef @ 2019-10-29 12:46 UTC (permalink / raw)
To: Vincent Guittot
Cc: Ingo Molnar, Peter Zijlstra, Steven Rostedt, Juri Lelli,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel
On 10/29/19 13:20, Vincent Guittot wrote:
> > > Making big cores the default CPUs for all RT tasks is not a minor
> > > change and IMO locality should stay the default behavior when there is
> > > no uclamp constraint
> >
> > How this is affecting locality? The task will always go to the big core, so it
> > should be local.
>
> local with the waker
> You will force rt task to run on big cluster although waker, data and
> interrupts can be on little one.
> So making big core as default is far from always being the best choice
This is loaded with assumptions IMO. AFAICT we don't know what's the best
choice.
First, the value of uclamp.min is outside of the scope of this patch. Unless
what you're saying is that when uclamp.min is 1024 then we should NOT choose a
big cpu then there's no disagreement about what this patch do. If that's what
you're objecting to please be more specific about how do you see this working
instead.
If your objection is purely based on the choice of uclamp.min then while
I agree that on modern systems the little cores are good enough for the
majority of RT tasks in average Android systems. But I don't feel confident to
reach this conclusion on low end systems where the little core doesn't have
enough grunt in many cases. So I see the current default is adequate and the
responsibility of further tweaking lies within the hands of the system admin.
--
Qais Yousef
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-29 12:46 ` Qais Yousef
@ 2019-10-29 12:54 ` Vincent Guittot
2019-10-29 13:02 ` Peter Zijlstra
2019-10-29 20:36 ` Patrick Bellasi
0 siblings, 2 replies; 42+ messages in thread
From: Vincent Guittot @ 2019-10-29 12:54 UTC (permalink / raw)
To: Qais Yousef
Cc: Ingo Molnar, Peter Zijlstra, Steven Rostedt, Juri Lelli,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel
On Tue, 29 Oct 2019 at 13:46, Qais Yousef <qais.yousef@arm.com> wrote:
>
> On 10/29/19 13:20, Vincent Guittot wrote:
> > > > Making big cores the default CPUs for all RT tasks is not a minor
> > > > change and IMO locality should stay the default behavior when there is
> > > > no uclamp constraint
> > >
> > > How this is affecting locality? The task will always go to the big core, so it
> > > should be local.
> >
> > local with the waker
> > You will force rt task to run on big cluster although waker, data and
> > interrupts can be on little one.
> > So making big core as default is far from always being the best choice
>
> This is loaded with assumptions IMO. AFAICT we don't know what's the best
> choice.
>
> First, the value of uclamp.min is outside of the scope of this patch. Unless
> what you're saying is that when uclamp.min is 1024 then we should NOT choose a
> big cpu then there's no disagreement about what this patch do. If that's what
> you're objecting to please be more specific about how do you see this working
> instead.
My point is that this patch makes the big cores the default CPUs for
RT tasks which is far from being a minor change and far from being an
obvious default good choice
>
> If your objection is purely based on the choice of uclamp.min then while
> I agree that on modern systems the little cores are good enough for the
> majority of RT tasks in average Android systems. But I don't feel confident to
> reach this conclusion on low end systems where the little core doesn't have
> enough grunt in many cases. So I see the current default is adequate and the
> responsibility of further tweaking lies within the hands of the system admin.
>
> --
> Qais Yousef
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-29 12:54 ` Vincent Guittot
@ 2019-10-29 13:02 ` Peter Zijlstra
2019-10-29 20:36 ` Patrick Bellasi
1 sibling, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2019-10-29 13:02 UTC (permalink / raw)
To: Vincent Guittot
Cc: Qais Yousef, Ingo Molnar, Steven Rostedt, Juri Lelli,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel
On Tue, Oct 29, 2019 at 01:54:24PM +0100, Vincent Guittot wrote:
> On Tue, 29 Oct 2019 at 13:46, Qais Yousef <qais.yousef@arm.com> wrote:
> >
> > On 10/29/19 13:20, Vincent Guittot wrote:
> > > > > Making big cores the default CPUs for all RT tasks is not a minor
> > > > > change and IMO locality should stay the default behavior when there is
> > > > > no uclamp constraint
> > > >
> > > > How this is affecting locality? The task will always go to the big core, so it
> > > > should be local.
> > >
> > > local with the waker
> > > You will force rt task to run on big cluster although waker, data and
> > > interrupts can be on little one.
> > > So making big core as default is far from always being the best choice
> >
> > This is loaded with assumptions IMO. AFAICT we don't know what's the best
> > choice.
> >
> > First, the value of uclamp.min is outside of the scope of this patch. Unless
> > what you're saying is that when uclamp.min is 1024 then we should NOT choose a
> > big cpu then there's no disagreement about what this patch do. If that's what
> > you're objecting to please be more specific about how do you see this working
> > instead.
>
> My point is that this patch makes the big cores the default CPUs for
> RT tasks which is far from being a minor change and far from being an
> obvious default good choice
FIFO/RR tasks don't have a bandwidth specification (barring uclamp),
therefore we must assume the worst. This is the same principle that has
them select max_freq all the time.
I think it is a very natural extention of that very principle to place
(otherwise unconstrained RT tasks) on big cores.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-29 12:54 ` Vincent Guittot
2019-10-29 13:02 ` Peter Zijlstra
@ 2019-10-29 20:36 ` Patrick Bellasi
2019-10-30 8:04 ` Vincent Guittot
1 sibling, 1 reply; 42+ messages in thread
From: Patrick Bellasi @ 2019-10-29 20:36 UTC (permalink / raw)
To: Vincent Guittot
Cc: Qais Yousef, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
linux-kernel
Hi Vincent, Qais,
On 29-Oct 13:54, Vincent Guittot wrote:
> On Tue, 29 Oct 2019 at 13:46, Qais Yousef <qais.yousef@arm.com> wrote:
> > On 10/29/19 13:20, Vincent Guittot wrote:
> > > > > Making big cores the default CPUs for all RT tasks is not a minor
> > > > > change and IMO locality should stay the default behavior when there is
> > > > > no uclamp constraint
The default value for system-wide's uclamp.min is 1024 because by
default _we want_ RT tasks running at MAX_OPP. This means that by
default RT tasks are running _under constraints_.
The "no uclamp constraints" case you mention requires that you set
uclamp.min=0 for that task. In that case, this patch will do exactly
what you ask for: locality is preserved.
> > > > How this is affecting locality? The task will always go to the big core, so it
> > > > should be local.
> > >
> > > local with the waker
> > > You will force rt task to run on big cluster although waker, data and
> > > interrupts can be on little one.
> > > So making big core as default is far from always being the best choice
> >
> > This is loaded with assumptions IMO. AFAICT we don't know what's the best
> > choice.
Agree... more on that after...
> > First, the value of uclamp.min is outside of the scope of this patch. Unless
> > what you're saying is that when uclamp.min is 1024 then we should NOT choose a
> > big cpu then there's no disagreement about what this patch do. If that's what
> > you're objecting to please be more specific about how do you see this working
> > instead.
>
> My point is that this patch makes the big cores the default CPUs for
> RT tasks which is far from being a minor change and far from being
> an obvious default good choice
Some time ago we agreed that going to MAX_OPP for RT tasks was
"mandatory". That was defenitively a big change, likely much more
impacting than the one proposed by this patch.
On many mobile devices we ended up pinning RT tasks on LITTLE cores
(mainly) to save quite a lot of energy by avoiding the case of big
CPUs randomly spiking to MAX_OPP just because of a small RT task
waking up on them. We also added some heuristic in schedutil has a
"band aid" for the effects of the aforementioned choice.
By running RT on LITTLEs there could be also some wakeup latency
improvement? Yes, maybe... would be interesting to have some real
HW *and* SW use-case on hand to compare.
However, we know that RT is all about "latency", but what is a bit
more fuzzy is the definition of "latency":
A) wakeup-latency
From a scheduler standpoint it's quite often considered as the the
time it takes to "wakeup" a task and actually start executing its
instructions.
B) completion-time
From an app standpoint, it's quite often important the time to
complete the task activation and go back to sleep.
Running at MAX_OPP looks much more related to the need to complete
fast than waking up fast, especially considering that that decision
was taken looking mainly (perhaps only) to SMP systems.
On heterogeneous systems, "wakeup-latency" and "completion-time" are
two metrics which *maybe* can be better served by different cores.
However, it's very difficult to argument if one metric is more
important than the other. It's even more difficult to quantify it
because of the multitide of HW and SW combinations.
Thus, what's proposed here can be far from being an "obvious good
chooce", but it's also quite difficult to argue and proof that's
defenitively _not_ a good choice. It's just a default which:
1) keeps things simple for RT tasks by using the same default
policy for both frequency and CPUs selection
we always run (when possible) at the highest available capacity
2) it's based on a per-task/system-wide tunable mechanism
Is that not enought to justfy it as a default?
Best,
Patrick
--
#include <best/regards.h>
Patrick Bellasi
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-29 20:36 ` Patrick Bellasi
@ 2019-10-30 8:04 ` Vincent Guittot
2019-10-30 9:26 ` Qais Yousef
2019-10-30 12:11 ` Quentin Perret
0 siblings, 2 replies; 42+ messages in thread
From: Vincent Guittot @ 2019-10-30 8:04 UTC (permalink / raw)
To: Patrick Bellasi
Cc: Qais Yousef, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
linux-kernel
On Tue, 29 Oct 2019 at 21:36, Patrick Bellasi
<patrick.bellasi@matbug.net> wrote:
>
> Hi Vincent, Qais,
>
> On 29-Oct 13:54, Vincent Guittot wrote:
> > On Tue, 29 Oct 2019 at 13:46, Qais Yousef <qais.yousef@arm.com> wrote:
> > > On 10/29/19 13:20, Vincent Guittot wrote:
> > > > > > Making big cores the default CPUs for all RT tasks is not a minor
> > > > > > change and IMO locality should stay the default behavior when there is
> > > > > > no uclamp constraint
>
> The default value for system-wide's uclamp.min is 1024 because by
> default _we want_ RT tasks running at MAX_OPP. This means that by
> default RT tasks are running _under constraints_.
>
> The "no uclamp constraints" case you mention requires that you set
> uclamp.min=0 for that task. In that case, this patch will do exactly
> what you ask for: locality is preserved.
>
> > > > > How this is affecting locality? The task will always go to the big core, so it
> > > > > should be local.
> > > >
> > > > local with the waker
> > > > You will force rt task to run on big cluster although waker, data and
> > > > interrupts can be on little one.
> > > > So making big core as default is far from always being the best choice
> > >
> > > This is loaded with assumptions IMO. AFAICT we don't know what's the best
> > > choice.
>
> Agree... more on that after...
>
> > > First, the value of uclamp.min is outside of the scope of this patch. Unless
> > > what you're saying is that when uclamp.min is 1024 then we should NOT choose a
> > > big cpu then there's no disagreement about what this patch do. If that's what
> > > you're objecting to please be more specific about how do you see this working
> > > instead.
> >
> > My point is that this patch makes the big cores the default CPUs for
> > RT tasks which is far from being a minor change and far from being
> > an obvious default good choice
>
> Some time ago we agreed that going to MAX_OPP for RT tasks was
> "mandatory". That was defenitively a big change, likely much more
> impacting than the one proposed by this patch.
>
> On many mobile devices we ended up pinning RT tasks on LITTLE cores
> (mainly) to save quite a lot of energy by avoiding the case of big
> CPUs randomly spiking to MAX_OPP just because of a small RT task
> waking up on them. We also added some heuristic in schedutil has a
> "band aid" for the effects of the aforementioned choice.
>
> By running RT on LITTLEs there could be also some wakeup latency
> improvement? Yes, maybe... would be interesting to have some real
> HW *and* SW use-case on hand to compare.
>
> However, we know that RT is all about "latency", but what is a bit
> more fuzzy is the definition of "latency":
>
> A) wakeup-latency
> From a scheduler standpoint it's quite often considered as the the
> time it takes to "wakeup" a task and actually start executing its
> instructions.
>
> B) completion-time
> From an app standpoint, it's quite often important the time to
> complete the task activation and go back to sleep.
>
> Running at MAX_OPP looks much more related to the need to complete
> fast than waking up fast, especially considering that that decision
You will wake up faster as well when running at MAX_OPP because
instructions will run faster or at least as fast. That being said,
running twice faster doesn't mean at all waking up twice faster but
for sure it will be faster although the gain can be really short.
Whereas running on a big core with more capacity doesn't mean that you
will wake up faster because of uarch difference.
I agree that "long" running rt task will most probably benefit from
big cores to complete earlier but that no more obvious for short one.
> was taken looking mainly (perhaps only) to SMP systems.
>
> On heterogeneous systems, "wakeup-latency" and "completion-time" are
> two metrics which *maybe* can be better served by different cores.
> However, it's very difficult to argument if one metric is more
> important than the other. It's even more difficult to quantify it
> because of the multitide of HW and SW combinations.
That's the point of my comment, choosing big cores as default and
always best choice is far from being obvious.
And this patch changes the default behavior without study of the
impact apart from stating that this should be ok
>
> Thus, what's proposed here can be far from being an "obvious good
> chooce", but it's also quite difficult to argue and proof that's
> defenitively _not_ a good choice. It's just a default which:
>
> 1) keeps things simple for RT tasks by using the same default
> policy for both frequency and CPUs selection
> we always run (when possible) at the highest available capacity
>
> 2) it's based on a per-task/system-wide tunable mechanism
>
> Is that not enought to justfy it as a default?
>
> Best,
> Patrick
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-30 8:04 ` Vincent Guittot
@ 2019-10-30 9:26 ` Qais Yousef
2019-10-30 12:11 ` Quentin Perret
1 sibling, 0 replies; 42+ messages in thread
From: Qais Yousef @ 2019-10-30 9:26 UTC (permalink / raw)
To: Vincent Guittot
Cc: Patrick Bellasi, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
linux-kernel
On 10/30/19 09:04, Vincent Guittot wrote:
> On Tue, 29 Oct 2019 at 21:36, Patrick Bellasi
> <patrick.bellasi@matbug.net> wrote:
> > Some time ago we agreed that going to MAX_OPP for RT tasks was
> > "mandatory". That was defenitively a big change, likely much more
> > impacting than the one proposed by this patch.
> >
> > On many mobile devices we ended up pinning RT tasks on LITTLE cores
> > (mainly) to save quite a lot of energy by avoiding the case of big
> > CPUs randomly spiking to MAX_OPP just because of a small RT task
> > waking up on them. We also added some heuristic in schedutil has a
> > "band aid" for the effects of the aforementioned choice.
> >
> > By running RT on LITTLEs there could be also some wakeup latency
> > improvement? Yes, maybe... would be interesting to have some real
> > HW *and* SW use-case on hand to compare.
> >
> > However, we know that RT is all about "latency", but what is a bit
> > more fuzzy is the definition of "latency":
> >
> > A) wakeup-latency
> > From a scheduler standpoint it's quite often considered as the the
> > time it takes to "wakeup" a task and actually start executing its
> > instructions.
> >
> > B) completion-time
> > From an app standpoint, it's quite often important the time to
> > complete the task activation and go back to sleep.
> >
> > Running at MAX_OPP looks much more related to the need to complete
> > fast than waking up fast, especially considering that that decision
>
> You will wake up faster as well when running at MAX_OPP because
> instructions will run faster or at least as fast. That being said,
> running twice faster doesn't mean at all waking up twice faster but
> for sure it will be faster although the gain can be really short.
> Whereas running on a big core with more capacity doesn't mean that you
> will wake up faster because of uarch difference.
> I agree that "long" running rt task will most probably benefit from
> big cores to complete earlier but that no more obvious for short one.
Idle states and other power management features are a known source of latency.
This latency changes across hardware all the time and RT people are accustomed
to test against this.
Android has a wakelock which AFAIR disabled deep sleep because on some sections
the wakeup latency can hinder throughput for some apps. So it's a known problem
outside RT universe too.
>
> > was taken looking mainly (perhaps only) to SMP systems.
> >
> > On heterogeneous systems, "wakeup-latency" and "completion-time" are
> > two metrics which *maybe* can be better served by different cores.
> > However, it's very difficult to argument if one metric is more
> > important than the other. It's even more difficult to quantify it
> > because of the multitide of HW and SW combinations.
>
> That's the point of my comment, choosing big cores as default and
> always best choice is far from being obvious.
It's consistent and deterministic unlike the current situation of it depends on
your luck. What you get across boots/runs is completely random and this is
worse than what this patch offers.
The default for Linux has always been putting the system at the highest
performance point by default. And this translates to the biggest CPU at
the highest frequency. It's not ideal but consistent. This doesn't prevent
people from tweaking their systems to get what they want.
> And this patch changes the default behavior without study of the
> impact apart from stating that this should be ok
Without this patch there's no way for an RT task to guarantee a minimum
performance requirement.
I don't think there's a change of the default behavior because without this
patch we could still end up on a big CPU.
And you're stating that the difference between wakeup time in big cores and
little cores is a problem. And as I stated several times this is a known source
of latency that changes across systems and if somebody cares about idle state
latencies then they probably looking at something beyond generic systems and
need to tune it to guarantee a deterministic low latency behavior.
I still don't see any proposed alternative to what should be the default
behavior.
--
Qais Yousef
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-30 8:04 ` Vincent Guittot
2019-10-30 9:26 ` Qais Yousef
@ 2019-10-30 12:11 ` Quentin Perret
1 sibling, 0 replies; 42+ messages in thread
From: Quentin Perret @ 2019-10-30 12:11 UTC (permalink / raw)
To: Vincent Guittot
Cc: Patrick Bellasi, Qais Yousef, Ingo Molnar, Peter Zijlstra,
Steven Rostedt, Juri Lelli, Dietmar Eggemann, Ben Segall,
Mel Gorman, linux-kernel
On Wednesday 30 Oct 2019 at 09:04:44 (+0100), Vincent Guittot wrote:
> That's the point of my comment, choosing big cores as default and
> always best choice is far from being obvious.
> And this patch changes the default behavior without study of the
> impact apart from stating that this should be ok
The current behaviour is totally bogus on big.LITTLE TBH, and nobody
uses it as-is. Vendors hack RT a lot for that exact reason.
Right now a RT task gets a random CPU, which on big.LITTLE is
effectively equivalent to selecting a random OPP on SMP. And since RT is
all about predicatbility, I'd argue sticking to the big CPUs for
consistency with the frequency selection policy is the only sensible
default.
So +1 from me for Qais' patch :)
Thanks,
Quentin
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-09 10:46 [PATCH v2] sched: rt: Make RT capacity aware Qais Yousef
` (2 preceding siblings ...)
2019-10-29 8:13 ` Vincent Guittot
@ 2019-10-30 11:57 ` Dietmar Eggemann
2019-10-30 17:43 ` Qais Yousef
2019-11-25 21:36 ` Steven Rostedt
` (2 subsequent siblings)
6 siblings, 1 reply; 42+ messages in thread
From: Dietmar Eggemann @ 2019-10-30 11:57 UTC (permalink / raw)
To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Steven Rostedt
Cc: Juri Lelli, Vincent Guittot, Ben Segall, Mel Gorman, linux-kernel
On 09.10.19 12:46, Qais Yousef wrote:
[...]
> Changes in v2:
> - Use cpupri_find() to check the fitness of the task instead of
> sprinkling find_lowest_rq() with several checks of
> rt_task_fits_capacity().
>
> The selected implementation opted to pass the fitness function as an
> argument rather than call rt_task_fits_capacity() capacity which is
> a cleaner to keep the logical separation of the 2 modules; but it
> means the compiler has less room to optimize rt_task_fits_capacity()
> out when it's a constant value.
I would prefer exporting rt_task_fits_capacity() sched-internally via
kernel/sched/sched.h. Less code changes and the indication whether
rt_task_fits_capacity() has to be used in cpupri_find() is already given
by lowest_mask being !NULL or NULL.
[...]
> +inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> +{
> + unsigned int min_cap;
> + unsigned int max_cap;
> + unsigned int cpu_cap;
Nit picking. Since we discussed it already,
I found this "Also please try to aggregate variables of the same type
into a single line. There is no point in wasting screen space::" ;-)
https://lore.kernel.org/r/20181107171149.165693799@linutronix.de
[...]
> @@ -2223,7 +2273,10 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
> */
> if (task_on_rq_queued(p) && rq->curr != p) {
> #ifdef CONFIG_SMP
> - if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
> + bool need_to_push = rq->rt.overloaded ||
> + !rt_task_fits_capacity(p, cpu_of(rq));
> +
> + if (p->nr_cpus_allowed > 1 && need_to_push)
> rt_queue_push_tasks(rq);
> #endif /* CONFIG_SMP */
> if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
What happens to a always running CFS task which switches to RT? Luca
introduced a special migrate callback (next to push and pull)
specifically to deal with this scenario. A lot of new infrastructure for
this one use case, but still, do we care for it in RT as well?
https://lore.kernel.org/r/20190506044836.2914-4-luca.abeni@santannapisa.it
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-30 11:57 ` Dietmar Eggemann
@ 2019-10-30 17:43 ` Qais Yousef
2019-11-28 13:59 ` Dietmar Eggemann
0 siblings, 1 reply; 42+ messages in thread
From: Qais Yousef @ 2019-10-30 17:43 UTC (permalink / raw)
To: Dietmar Eggemann
Cc: Ingo Molnar, Peter Zijlstra, Steven Rostedt, Juri Lelli,
Vincent Guittot, Ben Segall, Mel Gorman, linux-kernel
On 10/30/19 12:57, Dietmar Eggemann wrote:
> On 09.10.19 12:46, Qais Yousef wrote:
>
> [...]
>
> > Changes in v2:
> > - Use cpupri_find() to check the fitness of the task instead of
> > sprinkling find_lowest_rq() with several checks of
> > rt_task_fits_capacity().
> >
> > The selected implementation opted to pass the fitness function as an
> > argument rather than call rt_task_fits_capacity() capacity which is
> > a cleaner to keep the logical separation of the 2 modules; but it
> > means the compiler has less room to optimize rt_task_fits_capacity()
> > out when it's a constant value.
>
> I would prefer exporting rt_task_fits_capacity() sched-internally via
> kernel/sched/sched.h. Less code changes and the indication whether
> rt_task_fits_capacity() has to be used in cpupri_find() is already given
> by lowest_mask being !NULL or NULL.
>
I don't mind if the maintainers agree too. The reason I did that way is because
it keeps the implementation of cpupri generic and self contained.
rt_task_fits_capacity() at the moment is a static function in rt.c
To use it in cpupri_find() I either need to make it public somewhere or have an
extern bool rt_task_fits_capacity(...);
in cpupri.c. Neither of which appealed to me personally.
> [...]
>
> > +inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> > +{
> > + unsigned int min_cap;
> > + unsigned int max_cap;
> > + unsigned int cpu_cap;
>
> Nit picking. Since we discussed it already,
>
> I found this "Also please try to aggregate variables of the same type
> into a single line. There is no point in wasting screen space::" ;-)
>
> https://lore.kernel.org/r/20181107171149.165693799@linutronix.de
That wasn't merged at the end AFAICT :)
It's not my preferred style in this case if I have the choice - but I promise
to change it if I ended up having to spin another version anyway :)
>
> [...]
>
> > @@ -2223,7 +2273,10 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
> > */
> > if (task_on_rq_queued(p) && rq->curr != p) {
> > #ifdef CONFIG_SMP
> > - if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
> > + bool need_to_push = rq->rt.overloaded ||
> > + !rt_task_fits_capacity(p, cpu_of(rq));
> > +
> > + if (p->nr_cpus_allowed > 1 && need_to_push)
> > rt_queue_push_tasks(rq);
> > #endif /* CONFIG_SMP */
> > if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
> What happens to a always running CFS task which switches to RT? Luca
> introduced a special migrate callback (next to push and pull)
> specifically to deal with this scenario. A lot of new infrastructure for
> this one use case, but still, do we care for it in RT as well?
>
> https://lore.kernel.org/r/20190506044836.2914-4-luca.abeni@santannapisa.it
>
Good question. This scenario and the one where uclamp values are changed while
an RT task is running are similar.
In both cases the migration will happen on the next activation/wakeup AFAICS.
I am not sure an always running rt/deadline task is something conceivable in
practice and we want to cater for. It is certainly something we can push a fix
for in the future on top of this. IMHO we're better off trying to keep the
complexity low until we have a real scenario/use case that justifies it.
As it stands when the system is overloaded or when there are more RT tasks than
big cores we're stuffed too. And there are probably more ways where we can
shoot ourselves in the foot. Using RT and deadline is hard and that's one of
their unavoidable plagues I suppose.
Thanks
--
Qais Yousef
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-30 17:43 ` Qais Yousef
@ 2019-11-28 13:59 ` Dietmar Eggemann
0 siblings, 0 replies; 42+ messages in thread
From: Dietmar Eggemann @ 2019-11-28 13:59 UTC (permalink / raw)
To: Qais Yousef
Cc: Ingo Molnar, Peter Zijlstra, Steven Rostedt, Juri Lelli,
Vincent Guittot, Ben Segall, Mel Gorman, linux-kernel
On 30/10/2019 18:43, Qais Yousef wrote:
> On 10/30/19 12:57, Dietmar Eggemann wrote:
>> On 09.10.19 12:46, Qais Yousef wrote:
>>
>> [...]
>>
>>> Changes in v2:
>>> - Use cpupri_find() to check the fitness of the task instead of
>>> sprinkling find_lowest_rq() with several checks of
>>> rt_task_fits_capacity().
>>>
>>> The selected implementation opted to pass the fitness function as an
>>> argument rather than call rt_task_fits_capacity() capacity which is
>>> a cleaner to keep the logical separation of the 2 modules; but it
>>> means the compiler has less room to optimize rt_task_fits_capacity()
>>> out when it's a constant value.
>>
>> I would prefer exporting rt_task_fits_capacity() sched-internally via
>> kernel/sched/sched.h. Less code changes and the indication whether
>> rt_task_fits_capacity() has to be used in cpupri_find() is already given
>> by lowest_mask being !NULL or NULL.
>>
>
> I don't mind if the maintainers agree too. The reason I did that way is because
> it keeps the implementation of cpupri generic and self contained.
>
> rt_task_fits_capacity() at the moment is a static function in rt.c
> To use it in cpupri_find() I either need to make it public somewhere or have an
>
> extern bool rt_task_fits_capacity(...);
>
> in cpupri.c. Neither of which appealed to me personally.
>
>> [...]
>>
>>> +inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
>>> +{
>>> + unsigned int min_cap;
>>> + unsigned int max_cap;
>>> + unsigned int cpu_cap;
>>
>> Nit picking. Since we discussed it already,
>>
>> I found this "Also please try to aggregate variables of the same type
>> into a single line. There is no point in wasting screen space::" ;-)
>>
>> https://lore.kernel.org/r/20181107171149.165693799@linutronix.de
>
> That wasn't merged at the end AFAICT :)
>
> It's not my preferred style in this case if I have the choice - but I promise
> to change it if I ended up having to spin another version anyway :)
>
>>
>> [...]
>>
>>> @@ -2223,7 +2273,10 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
>>> */
>>> if (task_on_rq_queued(p) && rq->curr != p) {
>>> #ifdef CONFIG_SMP
>>> - if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
>>> + bool need_to_push = rq->rt.overloaded ||
>>> + !rt_task_fits_capacity(p, cpu_of(rq));
>>> +
>>> + if (p->nr_cpus_allowed > 1 && need_to_push)
>>> rt_queue_push_tasks(rq);
>>> #endif /* CONFIG_SMP */
>>> if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
>> What happens to a always running CFS task which switches to RT? Luca
>> introduced a special migrate callback (next to push and pull)
>> specifically to deal with this scenario. A lot of new infrastructure for
>> this one use case, but still, do we care for it in RT as well?
>>
>> https://lore.kernel.org/r/20190506044836.2914-4-luca.abeni@santannapisa.it
>>
>
> Good question. This scenario and the one where uclamp values are changed while
> an RT task is running are similar.
>
> In both cases the migration will happen on the next activation/wakeup AFAICS.
>
> I am not sure an always running rt/deadline task is something conceivable in
> practice and we want to cater for. It is certainly something we can push a fix
> for in the future on top of this. IMHO we're better off trying to keep the
> complexity low until we have a real scenario/use case that justifies it.
>
> As it stands when the system is overloaded or when there are more RT tasks than
> big cores we're stuffed too. And there are probably more ways where we can
> shoot ourselves in the foot. Using RT and deadline is hard and that's one of
> their unavoidable plagues I suppose.
I'm OK with that. I just wanted to make sure we will apply the same
requirements to the upcoming DL capacity awareness implementation. Not
catering for this use case means that we can skip the migrate callback
from Luca's initial DL capacity awareness patch-set.
I still would prefer to export rt_task_fits_capacity() via
kernel/sched/sched.h but I won't insist on that detail:
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-09 10:46 [PATCH v2] sched: rt: Make RT capacity aware Qais Yousef
` (3 preceding siblings ...)
2019-10-30 11:57 ` Dietmar Eggemann
@ 2019-11-25 21:36 ` Steven Rostedt
2019-11-26 9:39 ` Qais Yousef
2019-12-25 10:38 ` [tip: sched/core] sched/rt: Make RT capacity-aware tip-bot2 for Qais Yousef
2020-01-31 10:06 ` [PATCH v2] sched: rt: Make RT capacity aware Pavan Kondeti
6 siblings, 1 reply; 42+ messages in thread
From: Steven Rostedt @ 2019-11-25 21:36 UTC (permalink / raw)
To: Qais Yousef
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel
Sorry for the very late response...
On Wed, 9 Oct 2019 11:46:11 +0100
Qais Yousef <qais.yousef@arm.com> wrote:
> ATM the uclamp value are only used for frequency selection; but on
> heterogeneous systems this is not enough and we need to ensure that the
> capacity of the CPU is >= uclamp_min. Which is what implemented here.
Is it possible that the capacity can be fixed, where the process can
just have a cpu mask of CPUs that has the capacity for it?
Not that this will affect this patch now, but just something for the
future.
>
> capacity_orig_of(cpu) >= rt_task.uclamp_min
>
> Note that by default uclamp.min is 1024, which means that RT tasks will
> always be biased towards the big CPUs, which make for a better more
> predictable behavior for the default case.
>
> Must stress that the bias acts as a hint rather than a definite
> placement strategy. For example, if all big cores are busy executing
> other RT tasks we can't guarantee that a new RT task will be placed
> there.
>
> On non-heterogeneous systems the original behavior of RT should be
> retained. Similarly if uclamp is not selected in the config.
>
> Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-- Steve
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-11-25 21:36 ` Steven Rostedt
@ 2019-11-26 9:39 ` Qais Yousef
0 siblings, 0 replies; 42+ messages in thread
From: Qais Yousef @ 2019-11-26 9:39 UTC (permalink / raw)
To: Steven Rostedt
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel
On 11/25/19 16:36, Steven Rostedt wrote:
> Sorry for the very late response...
No worries and thanks for taking the time to look at this!
>
> On Wed, 9 Oct 2019 11:46:11 +0100
> Qais Yousef <qais.yousef@arm.com> wrote:
>
> > ATM the uclamp value are only used for frequency selection; but on
> > heterogeneous systems this is not enough and we need to ensure that the
> > capacity of the CPU is >= uclamp_min. Which is what implemented here.
>
> Is it possible that the capacity can be fixed, where the process can
> just have a cpu mask of CPUs that has the capacity for it?
It is possible and I did consider it. But it didn't feel right because:
1. The same can be achieved with regular affinities.
2. On some systems, especially medium and lower end devices the number
of big cores might be small (2 bigs and 6 littles is common).
I couldn't justify that pinning to bigs is always better than
letting the system try to balance itself in case it gets overloaded.
The thing with RT is that generally we offer little guarantees if the system is
not designed properly and the user can easily shoot themselves in the foot.
This implementation offered the simplest best effort while still not being too
restrictive.
But the idea is valid and worth looking at in the future.
>
> Not that this will affect this patch now, but just something for the
> future.
>
> >
> > capacity_orig_of(cpu) >= rt_task.uclamp_min
> >
> > Note that by default uclamp.min is 1024, which means that RT tasks will
> > always be biased towards the big CPUs, which make for a better more
> > predictable behavior for the default case.
> >
> > Must stress that the bias acts as a hint rather than a definite
> > placement strategy. For example, if all big cores are busy executing
> > other RT tasks we can't guarantee that a new RT task will be placed
> > there.
> >
> > On non-heterogeneous systems the original behavior of RT should be
> > retained. Similarly if uclamp is not selected in the config.
> >
> > Signed-off-by: Qais Yousef <qais.yousef@arm.com>
>
> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Thanks for your time :-)
--
Qais Yousef
^ permalink raw reply [flat|nested] 42+ messages in thread
* [tip: sched/core] sched/rt: Make RT capacity-aware
2019-10-09 10:46 [PATCH v2] sched: rt: Make RT capacity aware Qais Yousef
` (4 preceding siblings ...)
2019-11-25 21:36 ` Steven Rostedt
@ 2019-12-25 10:38 ` tip-bot2 for Qais Yousef
2020-01-31 10:06 ` [PATCH v2] sched: rt: Make RT capacity aware Pavan Kondeti
6 siblings, 0 replies; 42+ messages in thread
From: tip-bot2 for Qais Yousef @ 2019-12-25 10:38 UTC (permalink / raw)
To: linux-tip-commits
Cc: Qais Yousef, Peter Zijlstra (Intel),
Dietmar Eggemann, Steven Rostedt (VMware),
Linus Torvalds, Thomas Gleixner, Ingo Molnar, x86, LKML
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 804d402fb6f6487b825aae8cf42fda6426c62867
Gitweb: https://git.kernel.org/tip/804d402fb6f6487b825aae8cf42fda6426c62867
Author: Qais Yousef <qais.yousef@arm.com>
AuthorDate: Wed, 09 Oct 2019 11:46:11 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 25 Dec 2019 10:42:10 +01:00
sched/rt: Make RT capacity-aware
Capacity Awareness refers to the fact that on heterogeneous systems
(like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
when placing tasks we need to be aware of this difference of CPU
capacities.
In such scenarios we want to ensure that the selected CPU has enough
capacity to meet the requirement of the running task. Enough capacity
means here that capacity_orig_of(cpu) >= task.requirement.
The definition of task.requirement is dependent on the scheduling class.
For CFS, utilization is used to select a CPU that has >= capacity value
than the cfs_task.util.
capacity_orig_of(cpu) >= cfs_task.util
DL isn't capacity aware at the moment but can make use of the bandwidth
reservation to implement that in a similar manner CFS uses utilization.
The following patchset implements that:
https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/
capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
For RT we don't have a per task utilization signal and we lack any
information in general about what performance requirement the RT task
needs. But with the introduction of uclamp, RT tasks can now control
that by setting uclamp_min to guarantee a minimum performance point.
ATM the uclamp value are only used for frequency selection; but on
heterogeneous systems this is not enough and we need to ensure that the
capacity of the CPU is >= uclamp_min. Which is what implemented here.
capacity_orig_of(cpu) >= rt_task.uclamp_min
Note that by default uclamp.min is 1024, which means that RT tasks will
always be biased towards the big CPUs, which make for a better more
predictable behavior for the default case.
Must stress that the bias acts as a hint rather than a definite
placement strategy. For example, if all big cores are busy executing
other RT tasks we can't guarantee that a new RT task will be placed
there.
On non-heterogeneous systems the original behavior of RT should be
retained. Similarly if uclamp is not selected in the config.
[ mingo: Minor edits to comments. ]
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191009104611.15363-1-qais.yousef@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/cpupri.c | 25 +++++++++++--
kernel/sched/cpupri.h | 4 +-
kernel/sched/rt.c | 83 ++++++++++++++++++++++++++++++++++--------
3 files changed, 94 insertions(+), 18 deletions(-)
diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index b7abca9..1a2719e 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -46,6 +46,8 @@ static int convert_prio(int prio)
* @cp: The cpupri context
* @p: The task
* @lowest_mask: A mask to fill in with selected CPUs (or NULL)
+ * @fitness_fn: A pointer to a function to do custom checks whether the CPU
+ * fits a specific criteria so that we only return those CPUs.
*
* Note: This function returns the recommended CPUs as calculated during the
* current invocation. By the time the call returns, the CPUs may have in
@@ -57,7 +59,8 @@ static int convert_prio(int prio)
* Return: (int)bool - CPUs were found
*/
int cpupri_find(struct cpupri *cp, struct task_struct *p,
- struct cpumask *lowest_mask)
+ struct cpumask *lowest_mask,
+ bool (*fitness_fn)(struct task_struct *p, int cpu))
{
int idx = 0;
int task_pri = convert_prio(p->prio);
@@ -98,6 +101,8 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
continue;
if (lowest_mask) {
+ int cpu;
+
cpumask_and(lowest_mask, p->cpus_ptr, vec->mask);
/*
@@ -108,7 +113,23 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
* condition, simply act as though we never hit this
* priority level and continue on.
*/
- if (cpumask_any(lowest_mask) >= nr_cpu_ids)
+ if (cpumask_empty(lowest_mask))
+ continue;
+
+ if (!fitness_fn)
+ return 1;
+
+ /* Ensure the capacity of the CPUs fit the task */
+ for_each_cpu(cpu, lowest_mask) {
+ if (!fitness_fn(p, cpu))
+ cpumask_clear_cpu(cpu, lowest_mask);
+ }
+
+ /*
+ * If no CPU at the current priority can fit the task
+ * continue looking
+ */
+ if (cpumask_empty(lowest_mask))
continue;
}
diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index 7dc20a3..32dd520 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -18,7 +18,9 @@ struct cpupri {
};
#ifdef CONFIG_SMP
-int cpupri_find(struct cpupri *cp, struct task_struct *p, struct cpumask *lowest_mask);
+int cpupri_find(struct cpupri *cp, struct task_struct *p,
+ struct cpumask *lowest_mask,
+ bool (*fitness_fn)(struct task_struct *p, int cpu));
void cpupri_set(struct cpupri *cp, int cpu, int pri);
int cpupri_init(struct cpupri *cp);
void cpupri_cleanup(struct cpupri *cp);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e591d40..4043abe 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -437,6 +437,45 @@ static inline int on_rt_rq(struct sched_rt_entity *rt_se)
return rt_se->on_rq;
}
+#ifdef CONFIG_UCLAMP_TASK
+/*
+ * Verify the fitness of task @p to run on @cpu taking into account the uclamp
+ * settings.
+ *
+ * This check is only important for heterogeneous systems where uclamp_min value
+ * is higher than the capacity of a @cpu. For non-heterogeneous system this
+ * function will always return true.
+ *
+ * The function will return true if the capacity of the @cpu is >= the
+ * uclamp_min and false otherwise.
+ *
+ * Note that uclamp_min will be clamped to uclamp_max if uclamp_min
+ * > uclamp_max.
+ */
+static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
+{
+ unsigned int min_cap;
+ unsigned int max_cap;
+ unsigned int cpu_cap;
+
+ /* Only heterogeneous systems can benefit from this check */
+ if (!static_branch_unlikely(&sched_asym_cpucapacity))
+ return true;
+
+ min_cap = uclamp_eff_value(p, UCLAMP_MIN);
+ max_cap = uclamp_eff_value(p, UCLAMP_MAX);
+
+ cpu_cap = capacity_orig_of(cpu);
+
+ return cpu_cap >= min(min_cap, max_cap);
+}
+#else
+static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
+{
+ return true;
+}
+#endif
+
#ifdef CONFIG_RT_GROUP_SCHED
static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
@@ -1391,6 +1430,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
{
struct task_struct *curr;
struct rq *rq;
+ bool test;
/* For anything but wake ups, just return the task_cpu */
if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
@@ -1422,10 +1462,16 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
*
* This test is optimistic, if we get it wrong the load-balancer
* will have to sort it out.
+ *
+ * We take into account the capacity of the CPU to ensure it fits the
+ * requirement of the task - which is only important on heterogeneous
+ * systems like big.LITTLE.
*/
- if (curr && unlikely(rt_task(curr)) &&
- (curr->nr_cpus_allowed < 2 ||
- curr->prio <= p->prio)) {
+ test = curr &&
+ unlikely(rt_task(curr)) &&
+ (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
+
+ if (test || !rt_task_fits_capacity(p, cpu)) {
int target = find_lowest_rq(p);
/*
@@ -1449,15 +1495,15 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
* let's hope p can move out.
*/
if (rq->curr->nr_cpus_allowed == 1 ||
- !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
+ !cpupri_find(&rq->rd->cpupri, rq->curr, NULL, NULL))
return;
/*
* p is migratable, so let's not schedule it and
* see if it is pushed or pulled somewhere else.
*/
- if (p->nr_cpus_allowed != 1
- && cpupri_find(&rq->rd->cpupri, p, NULL))
+ if (p->nr_cpus_allowed != 1 &&
+ cpupri_find(&rq->rd->cpupri, p, NULL, NULL))
return;
/*
@@ -1601,7 +1647,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
{
if (!task_running(rq, p) &&
- cpumask_test_cpu(cpu, p->cpus_ptr))
+ cpumask_test_cpu(cpu, p->cpus_ptr) &&
+ rt_task_fits_capacity(p, cpu))
return 1;
return 0;
@@ -1643,7 +1690,8 @@ static int find_lowest_rq(struct task_struct *task)
if (task->nr_cpus_allowed == 1)
return -1; /* No other targets possible */
- if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask))
+ if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask,
+ rt_task_fits_capacity))
return -1; /* No targets found */
/*
@@ -2147,12 +2195,14 @@ skip:
*/
static void task_woken_rt(struct rq *rq, struct task_struct *p)
{
- if (!task_running(rq, p) &&
- !test_tsk_need_resched(rq->curr) &&
- p->nr_cpus_allowed > 1 &&
- (dl_task(rq->curr) || rt_task(rq->curr)) &&
- (rq->curr->nr_cpus_allowed < 2 ||
- rq->curr->prio <= p->prio))
+ bool need_to_push = !task_running(rq, p) &&
+ !test_tsk_need_resched(rq->curr) &&
+ p->nr_cpus_allowed > 1 &&
+ (dl_task(rq->curr) || rt_task(rq->curr)) &&
+ (rq->curr->nr_cpus_allowed < 2 ||
+ rq->curr->prio <= p->prio);
+
+ if (need_to_push || !rt_task_fits_capacity(p, cpu_of(rq)))
push_rt_tasks(rq);
}
@@ -2224,7 +2274,10 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
*/
if (task_on_rq_queued(p) && rq->curr != p) {
#ifdef CONFIG_SMP
- if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
+ bool need_to_push = rq->rt.overloaded ||
+ !rt_task_fits_capacity(p, cpu_of(rq));
+
+ if (p->nr_cpus_allowed > 1 && need_to_push)
rt_queue_push_tasks(rq);
#endif /* CONFIG_SMP */
if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2019-10-09 10:46 [PATCH v2] sched: rt: Make RT capacity aware Qais Yousef
` (5 preceding siblings ...)
2019-12-25 10:38 ` [tip: sched/core] sched/rt: Make RT capacity-aware tip-bot2 for Qais Yousef
@ 2020-01-31 10:06 ` Pavan Kondeti
2020-01-31 15:34 ` Qais Yousef
6 siblings, 1 reply; 42+ messages in thread
From: Pavan Kondeti @ 2020-01-31 10:06 UTC (permalink / raw)
To: Qais Yousef
Cc: Ingo Molnar, Peter Zijlstra, Steven Rostedt, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
linux-kernel
Hi Qais,
On Wed, Oct 09, 2019 at 11:46:11AM +0100, Qais Yousef wrote:
> Capacity Awareness refers to the fact that on heterogeneous systems
> (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> when placing tasks we need to be aware of this difference of CPU
> capacities.
>
> In such scenarios we want to ensure that the selected CPU has enough
> capacity to meet the requirement of the running task. Enough capacity
> means here that capacity_orig_of(cpu) >= task.requirement.
>
> The definition of task.requirement is dependent on the scheduling class.
>
> For CFS, utilization is used to select a CPU that has >= capacity value
> than the cfs_task.util.
>
> capacity_orig_of(cpu) >= cfs_task.util
>
> DL isn't capacity aware at the moment but can make use of the bandwidth
> reservation to implement that in a similar manner CFS uses utilization.
> The following patchset implements that:
>
> https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/
>
> capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
>
> For RT we don't have a per task utilization signal and we lack any
> information in general about what performance requirement the RT task
> needs. But with the introduction of uclamp, RT tasks can now control
> that by setting uclamp_min to guarantee a minimum performance point.
>
> ATM the uclamp value are only used for frequency selection; but on
> heterogeneous systems this is not enough and we need to ensure that the
> capacity of the CPU is >= uclamp_min. Which is what implemented here.
>
> capacity_orig_of(cpu) >= rt_task.uclamp_min
>
> Note that by default uclamp.min is 1024, which means that RT tasks will
> always be biased towards the big CPUs, which make for a better more
> predictable behavior for the default case.
>
> Must stress that the bias acts as a hint rather than a definite
> placement strategy. For example, if all big cores are busy executing
> other RT tasks we can't guarantee that a new RT task will be placed
> there.
>
> On non-heterogeneous systems the original behavior of RT should be
> retained. Similarly if uclamp is not selected in the config.
>
> Signed-off-by: Qais Yousef <qais.yousef@arm.com>
> ---
>
> Changes in v2:
> - Use cpupri_find() to check the fitness of the task instead of
> sprinkling find_lowest_rq() with several checks of
> rt_task_fits_capacity().
>
> The selected implementation opted to pass the fitness function as an
> argument rather than call rt_task_fits_capacity() capacity which is
> a cleaner to keep the logical separation of the 2 modules; but it
> means the compiler has less room to optimize rt_task_fits_capacity()
> out when it's a constant value.
>
> The logic is not perfect. For example if a 'small' task is occupying a big CPU
> and another big task wakes up; we won't force migrate the small task to clear
> the big cpu for the big task that woke up.
>
> IOW, the logic is best effort and can't give hard guarantees. But improves the
> current situation where a task can randomly end up on any CPU regardless of
> what it needs. ie: without this patch an RT task can wake up on a big or small
> CPU, but with this it will always wake up on a big CPU (assuming the big CPUs
> aren't overloaded) - hence provide a consistent performance.
>
> I'm looking at ways to improve this best effort, but this patch should be
> a good start to discuss our Capacity Awareness requirement. There's a trade-off
> of complexity to be made here and I'd like to keep things as simple as
> possible and build on top as needed.
>
>
> kernel/sched/cpupri.c | 23 ++++++++++--
> kernel/sched/cpupri.h | 4 ++-
> kernel/sched/rt.c | 81 +++++++++++++++++++++++++++++++++++--------
> 3 files changed, 91 insertions(+), 17 deletions(-)
>
> diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
> index b7abca987d94..799791c01d60 100644
> --- a/kernel/sched/cpupri.c
> +++ b/kernel/sched/cpupri.c
> @@ -57,7 +57,8 @@ static int convert_prio(int prio)
> * Return: (int)bool - CPUs were found
> */
> int cpupri_find(struct cpupri *cp, struct task_struct *p,
> - struct cpumask *lowest_mask)
> + struct cpumask *lowest_mask,
> + bool (*fitness_fn)(struct task_struct *p, int cpu))
> {
> int idx = 0;
> int task_pri = convert_prio(p->prio);
> @@ -98,6 +99,8 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
> continue;
>
> if (lowest_mask) {
> + int cpu;
> +
> cpumask_and(lowest_mask, p->cpus_ptr, vec->mask);
>
> /*
> @@ -108,7 +111,23 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
> * condition, simply act as though we never hit this
> * priority level and continue on.
> */
> - if (cpumask_any(lowest_mask) >= nr_cpu_ids)
> + if (cpumask_empty(lowest_mask))
> + continue;
> +
> + if (!fitness_fn)
> + return 1;
> +
> + /* Ensure the capacity of the CPUs fit the task */
> + for_each_cpu(cpu, lowest_mask) {
> + if (!fitness_fn(p, cpu))
> + cpumask_clear_cpu(cpu, lowest_mask);
> + }
> +
> + /*
> + * If no CPU at the current priority can fit the task
> + * continue looking
> + */
> + if (cpumask_empty(lowest_mask))
> continue;
> }
>
I understand that RT tasks run on BIG cores by default when uclamp is enabled.
Can you tell what happens when we have more runnable RT tasks than the BIG
CPUs? Do they get packed on the BIG CPUs or eventually silver CPUs pull those
tasks? Since rt_task_fits_capacity() is considered during wakeup, push and
pull, the tasks may get packed on BIG forever. Is my understanding correct?
Also what happens for the case where RT tasks are pinned to silver but with
default uclamp value i.e p.uclamp.min=1024 ? They may all get queued on a
single silver and other silvers may not help since the task does not fit
there. In practice, we may not use this setup. Just wanted to know if this
behavior is intentional or not.
Thanks,
Pavan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] sched: rt: Make RT capacity aware
2020-01-31 10:06 ` [PATCH v2] sched: rt: Make RT capacity aware Pavan Kondeti
@ 2020-01-31 15:34 ` Qais Yousef
[not found] ` <CAEU1=PnYryM26F-tNAT0JVUoFcygRgE374JiBeJPQeTEoZpANg@mail.gmail.com>
0 siblings, 1 reply; 42+ messages in thread
From: Qais Yousef @ 2020-01-31 15:34 UTC (permalink / raw)
To: Pavan Kondeti
Cc: Ingo Molnar, Peter Zijlstra, Steven Rostedt, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
linux-kernel
Hi Pavan
On 01/31/20 15:36, Pavan Kondeti wrote:
> Hi Qais,
>
> On Wed, Oct 09, 2019 at 11:46:11AM +0100, Qais Yousef wrote:
[...]
> >
> > For RT we don't have a per task utilization signal and we lack any
> > information in general about what performance requirement the RT task
> > needs. But with the introduction of uclamp, RT tasks can now control
> > that by setting uclamp_min to guarantee a minimum performance point.
[...]
> > ---
> >
> > Changes in v2:
> > - Use cpupri_find() to check the fitness of the task instead of
> > sprinkling find_lowest_rq() with several checks of
> > rt_task_fits_capacity().
> >
> > The selected implementation opted to pass the fitness function as an
> > argument rather than call rt_task_fits_capacity() capacity which is
> > a cleaner to keep the logical separation of the 2 modules; but it
> > means the compiler has less room to optimize rt_task_fits_capacity()
> > out when it's a constant value.
> >
> > The logic is not perfect. For example if a 'small' task is occupying a big CPU
> > and another big task wakes up; we won't force migrate the small task to clear
> > the big cpu for the big task that woke up.
> >
> > IOW, the logic is best effort and can't give hard guarantees. But improves the
> > current situation where a task can randomly end up on any CPU regardless of
> > what it needs. ie: without this patch an RT task can wake up on a big or small
> > CPU, but with this it will always wake up on a big CPU (assuming the big CPUs
> > aren't overloaded) - hence provide a consistent performance.
[...]
> I understand that RT tasks run on BIG cores by default when uclamp is enabled.
> Can you tell what happens when we have more runnable RT tasks than the BIG
> CPUs? Do they get packed on the BIG CPUs or eventually silver CPUs pull those
> tasks? Since rt_task_fits_capacity() is considered during wakeup, push and
> pull, the tasks may get packed on BIG forever. Is my understanding correct?
I left up the relevant part from the commit message and my 'cover-letter' above
that should contain answers to your question.
In short, the logic is best effort and isn't a hard guarantee. When the system
is overloaded we'll still spread, and a task that needs a big core might end up
on a little one. But AFAIU with RT, if you really want guarantees you need to
do some planning otherwise there are no guarantees in general that your task
will get what it needs.
But I understand your question is for the general purpose case. I've hacked my
notebook to run a few tests for you
https://gist.github.com/qais-yousef/cfe7487e3b43c3c06a152da31ae09101
Look at the diagrams in "Test {1, 2, 3} Results". I spawned 6 tasks which match
the 6 cores on the Juno I ran on. Based on Linus' master from a couple of days.
Note on Juno cores 1 and 2 are the big cors. 'b_*' and 'l_*' are the task names
which are remnants from my previous testing where I spawned different numbers
of big and small tasks.
I repeat the same tests 3 times to demonstrate the repeatability. The logic
causes 2 tasks to run on a big CPU, but there's spreading. IMO on a general
purpose system this is a good behavior. On a real time system that needs better
guarantee then there's no alternative to doing proper RT planning.
In the last test I just spawn 2 tasks which end up on the right CPUs, 1 and 2.
On system like Android my observations has been that there are very little
concurrent RT tasks active at the same time. So if there are some tasks in the
system that do want to be on the big CPU, they most likely to get that
guarantee. Without this patch what you get is completely random.
>
> Also what happens for the case where RT tasks are pinned to silver but with
> default uclamp value i.e p.uclamp.min=1024 ? They may all get queued on a
> single silver and other silvers may not help since the task does not fit
> there. In practice, we may not use this setup. Just wanted to know if this
> behavior is intentional or not.
I'm not sure I understand your question.
If the RT tasks are affined to a set of CPUs, then we'll only search in these
CPUs. I expect the logic not to change with this patch. If you have a use case
that you think that breaks with this patch, can you please share the details
so I can reproduce?
I just ran several tests spawning 4 tasks affined to the little cores and
I indeed see them spreading on the littles.
Cheers
--
Qais Yousef
^ permalink raw reply [flat|nested] 42+ messages in thread