linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched: rt: Make RT capacity aware
@ 2019-09-03 10:33 Qais Yousef
  2019-09-04 11:25 ` Steven Rostedt
  0 siblings, 1 reply; 9+ messages in thread
From: Qais Yousef @ 2019-09-03 10:33 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Steven Rostedt
  Cc: Dietmar Eggemann, Vincent Guittot, Alessio Balsini, linux-kernel,
	Qais Yousef

Capacity Awareness refers to the fact that on heterogeneous systems
(like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
when placing tasks we need to be aware of this difference of CPU
capacities.

In such scenarios we want to ensure that the selected CPU has enough
capacity to meet the requirement of the running task. Enough capacity
means here that capacity_orig_of(cpu) >= task.requirement.

The definition of task.requirement is dependent on the scheduling class.

For CFS, utilization is used to select a CPU that has >= capacity value
than the cfs_task.util.

	capacity_orig_of(cpu) >= cfs_task.util

DL isn't capacity aware at the moment but can make use of the bandwidth
reservation to implement that in a similar manner CFS uses utilization.
The following patchset implements that:

https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/

	capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime

For RT we don't have a per task utilization signal and we lack any
information in general about what performance requirement the RT task
needs. But with the introduction of uclamp, RT tasks can now control
that by setting uclamp_min to guarantee a minimum performance point.

ATM the uclamp value are only used for frequency selection; but on
heterogeneous systems this is not enough and we need to ensure that the
capacity of the CPU is >= uclamp_min. Which is what implemented here.

	capacity_orig_of(cpu) >= rt_task.uclamp_min

Note that by default uclamp.min is 1024, which means that RT tasks will
always be biased towards the big CPUs, which make for a better more
predictable behavior for the default case.

Must stress that the bias acts as a hint rather than a definite
placement strategy. For example, if all big cores are busy executing
other RT tasks we can't guarantee that a new RT task will be placed
there.

On non-heterogeneous systems the original behavior of RT should be
retained. Similarly if uclamp is not selected in the config.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
---

The logic is not perfect. For example if a 'small' task is occupying a big CPU
and another big task wakes up; we won't force migrate the small task to clear
the big cpu for the big task that woke up.

IOW, the logic is best effort and can't give hard guarantees. But improves the
current situation where a task can randomly end up on any CPU regardless of
what it needs. ie: without this patch an RT task can wake up on a big or small
CPU, but with this it will always wake up on a big CPU (assuming the big CPUs
aren't overloaded) - hence provide a consistent performance.

I'm looking at ways to improve this best effort, but this patch should be
a good start to discuss our Capacity Awareness requirement. There's a trade-off
of complexity to be made here and I'd like to keep things as simple as
possible and build on top as needed.


 kernel/sched/rt.c | 112 +++++++++++++++++++++++++++++++++++++---------
 1 file changed, 92 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a532558a5176..7c3bcbef692b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -436,6 +436,45 @@ static inline int on_rt_rq(struct sched_rt_entity *rt_se)
 	return rt_se->on_rq;
 }
 
+#ifdef CONFIG_UCLAMP_TASK
+/*
+ * Verify the fitness of task @p to run on @cpu taking into account the uclamp
+ * settings.
+ *
+ * This check is only important for heterogeneous systems where uclamp_min value
+ * is higher than the capacity of a @cpu. For non-heterogeneous system this
+ * function will always return true.
+ *
+ * The function will return true if the capacity of the @cpu is >= the
+ * uclamp_min and false otherwise.
+ *
+ * Note that uclamp_min will be clamped to uclamp_max if uclamp_min
+ * > uclamp_max.
+ */
+static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
+{
+	unsigned int min_cap;
+	unsigned int max_cap;
+	unsigned int cpu_cap;
+
+	/* Only heterogeneous systems can benefit from this check */
+	if (!static_branch_unlikely(&sched_asym_cpucapacity))
+		return true;
+
+	min_cap = uclamp_eff_value(p, UCLAMP_MIN);
+	max_cap = uclamp_eff_value(p, UCLAMP_MAX);
+
+	cpu_cap = capacity_orig_of(cpu);
+
+	return cpu_cap >= min(min_cap, max_cap);
+}
+#else
+static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
+{
+	return true;
+}
+#endif
+
 #ifdef CONFIG_RT_GROUP_SCHED
 
 static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
@@ -1390,6 +1429,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
 {
 	struct task_struct *curr;
 	struct rq *rq;
+	bool test;
 
 	/* For anything but wake ups, just return the task_cpu */
 	if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
@@ -1421,10 +1461,16 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
 	 *
 	 * This test is optimistic, if we get it wrong the load-balancer
 	 * will have to sort it out.
+	 *
+	 * We take into account the capacity of the cpu to ensure it fits the
+	 * requirement of the task - which is only important on heterogeneous
+	 * systems like big.LITTLE.
 	 */
-	if (curr && unlikely(rt_task(curr)) &&
-	    (curr->nr_cpus_allowed < 2 ||
-	     curr->prio <= p->prio)) {
+	test = curr &&
+	       unlikely(rt_task(curr)) &&
+	       (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
+
+	if (test || !rt_task_fits_capacity(p, cpu)) {
 		int target = find_lowest_rq(p);
 
 		/*
@@ -1614,7 +1660,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
 {
 	if (!task_running(rq, p) &&
-	    cpumask_test_cpu(cpu, p->cpus_ptr))
+	    cpumask_test_cpu(cpu, p->cpus_ptr) &&
+	    rt_task_fits_capacity(p, cpu))
 		return 1;
 
 	return 0;
@@ -1648,6 +1695,7 @@ static int find_lowest_rq(struct task_struct *task)
 	struct cpumask *lowest_mask = this_cpu_cpumask_var_ptr(local_cpu_mask);
 	int this_cpu = smp_processor_id();
 	int cpu      = task_cpu(task);
+	bool test;
 
 	/* Make sure the mask is initialized first */
 	if (unlikely(!lowest_mask))
@@ -1666,16 +1714,23 @@ static int find_lowest_rq(struct task_struct *task)
 	 *
 	 * We prioritize the last CPU that the task executed on since
 	 * it is most likely cache-hot in that location.
+	 *
+	 * Assuming the task still fits the capacity of the last CPU.
 	 */
-	if (cpumask_test_cpu(cpu, lowest_mask))
+	if (cpumask_test_cpu(cpu, lowest_mask) &&
+	    rt_task_fits_capacity(task, cpu)) {
 		return cpu;
+	}
 
 	/*
 	 * Otherwise, we consult the sched_domains span maps to figure
 	 * out which CPU is logically closest to our hot cache data.
 	 */
-	if (!cpumask_test_cpu(this_cpu, lowest_mask))
-		this_cpu = -1; /* Skip this_cpu opt if not among lowest */
+	if (!cpumask_test_cpu(this_cpu, lowest_mask) ||
+	    !rt_task_fits_capacity(task, this_cpu)) {
+		/* Skip this_cpu opt if not among lowest or doesn't fit */
+		this_cpu = -1;
+	}
 
 	rcu_read_lock();
 	for_each_domain(cpu, sd) {
@@ -1692,11 +1747,15 @@ static int find_lowest_rq(struct task_struct *task)
 				return this_cpu;
 			}
 
-			best_cpu = cpumask_first_and(lowest_mask,
-						     sched_domain_span(sd));
-			if (best_cpu < nr_cpu_ids) {
-				rcu_read_unlock();
-				return best_cpu;
+			for_each_cpu_and(best_cpu, lowest_mask,
+					 sched_domain_span(sd)) {
+				if (best_cpu >= nr_cpu_ids)
+					break;
+
+				if (rt_task_fits_capacity(task, best_cpu)) {
+					rcu_read_unlock();
+					return best_cpu;
+				}
 			}
 		}
 	}
@@ -1711,7 +1770,15 @@ static int find_lowest_rq(struct task_struct *task)
 		return this_cpu;
 
 	cpu = cpumask_any(lowest_mask);
-	if (cpu < nr_cpu_ids)
+
+	/*
+	 * Make sure that the fitness on @cpu doesn't change compared to the
+	 * cpu we're currently running on.
+	 */
+	test = rt_task_fits_capacity(task, cpu) ==
+	       rt_task_fits_capacity(task, task_cpu(task));
+
+	if (cpu < nr_cpu_ids && test)
 		return cpu;
 
 	return -1;
@@ -2160,12 +2227,14 @@ static void pull_rt_task(struct rq *this_rq)
  */
 static void task_woken_rt(struct rq *rq, struct task_struct *p)
 {
-	if (!task_running(rq, p) &&
-	    !test_tsk_need_resched(rq->curr) &&
-	    p->nr_cpus_allowed > 1 &&
-	    (dl_task(rq->curr) || rt_task(rq->curr)) &&
-	    (rq->curr->nr_cpus_allowed < 2 ||
-	     rq->curr->prio <= p->prio))
+	bool need_to_push = !task_running(rq, p) &&
+			    !test_tsk_need_resched(rq->curr) &&
+			    p->nr_cpus_allowed > 1 &&
+			    (dl_task(rq->curr) || rt_task(rq->curr)) &&
+			    (rq->curr->nr_cpus_allowed < 2 ||
+			     rq->curr->prio <= p->prio);
+
+	if (need_to_push || !rt_task_fits_capacity(p, cpu_of(rq)))
 		push_rt_tasks(rq);
 }
 
@@ -2237,7 +2306,10 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
 	 */
 	if (task_on_rq_queued(p) && rq->curr != p) {
 #ifdef CONFIG_SMP
-		if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
+		bool need_to_push = rq->rt.overloaded ||
+				    !rt_task_fits_capacity(p, cpu_of(rq));
+
+		if (p->nr_cpus_allowed > 1 && need_to_push)
 			rt_queue_push_tasks(rq);
 #endif /* CONFIG_SMP */
 		if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched: rt: Make RT capacity aware
  2019-09-03 10:33 [PATCH] sched: rt: Make RT capacity aware Qais Yousef
@ 2019-09-04 11:25 ` Steven Rostedt
  2019-09-04 15:40   ` Qais Yousef
  0 siblings, 1 reply; 9+ messages in thread
From: Steven Rostedt @ 2019-09-04 11:25 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Peter Zijlstra, Dietmar Eggemann, Vincent Guittot,
	Alessio Balsini, linux-kernel

On Tue,  3 Sep 2019 11:33:29 +0100
Qais Yousef <qais.yousef@arm.com> wrote:

> Capacity Awareness refers to the fact that on heterogeneous systems
> (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> when placing tasks we need to be aware of this difference of CPU
> capacities.
> 
> In such scenarios we want to ensure that the selected CPU has enough
> capacity to meet the requirement of the running task. Enough capacity
> means here that capacity_orig_of(cpu) >= task.requirement.
> 
> The definition of task.requirement is dependent on the scheduling class.
> 
> For CFS, utilization is used to select a CPU that has >= capacity value
> than the cfs_task.util.
> 
> 	capacity_orig_of(cpu) >= cfs_task.util
> 
> DL isn't capacity aware at the moment but can make use of the bandwidth
> reservation to implement that in a similar manner CFS uses utilization.
> The following patchset implements that:
> 
> https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/
> 
> 	capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
> 
> For RT we don't have a per task utilization signal and we lack any
> information in general about what performance requirement the RT task
> needs. But with the introduction of uclamp, RT tasks can now control
> that by setting uclamp_min to guarantee a minimum performance point.
> 
> ATM the uclamp value are only used for frequency selection; but on
> heterogeneous systems this is not enough and we need to ensure that the
> capacity of the CPU is >= uclamp_min. Which is what implemented here.
> 
> 	capacity_orig_of(cpu) >= rt_task.uclamp_min
> 
> Note that by default uclamp.min is 1024, which means that RT tasks will
> always be biased towards the big CPUs, which make for a better more
> predictable behavior for the default case.
> 
> Must stress that the bias acts as a hint rather than a definite
> placement strategy. For example, if all big cores are busy executing
> other RT tasks we can't guarantee that a new RT task will be placed
> there.
> 
> On non-heterogeneous systems the original behavior of RT should be
> retained. Similarly if uclamp is not selected in the config.

Nice change log. It clearly describes what you are trying to do with
the patch, and gives an idea of what is still needed to be done.

> 
> Signed-off-by: Qais Yousef <qais.yousef@arm.com>
> ---
> 
> The logic is not perfect. For example if a 'small' task is occupying a big CPU
> and another big task wakes up; we won't force migrate the small task to clear
> the big cpu for the big task that woke up.
> 
> IOW, the logic is best effort and can't give hard guarantees. But improves the
> current situation where a task can randomly end up on any CPU regardless of
> what it needs. ie: without this patch an RT task can wake up on a big or small
> CPU, but with this it will always wake up on a big CPU (assuming the big CPUs
> aren't overloaded) - hence provide a consistent performance.
> 
> I'm looking at ways to improve this best effort, but this patch should be
> a good start to discuss our Capacity Awareness requirement. There's a trade-off
> of complexity to be made here and I'd like to keep things as simple as
> possible and build on top as needed.
> 
> 
>  kernel/sched/rt.c | 112 +++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 92 insertions(+), 20 deletions(-)
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index a532558a5176..7c3bcbef692b 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -436,6 +436,45 @@ static inline int on_rt_rq(struct sched_rt_entity *rt_se)
>  	return rt_se->on_rq;
>  }
>  
> +#ifdef CONFIG_UCLAMP_TASK
> +/*
> + * Verify the fitness of task @p to run on @cpu taking into account the uclamp
> + * settings.
> + *
> + * This check is only important for heterogeneous systems where uclamp_min value
> + * is higher than the capacity of a @cpu. For non-heterogeneous system this
> + * function will always return true.
> + *
> + * The function will return true if the capacity of the @cpu is >= the
> + * uclamp_min and false otherwise.
> + *
> + * Note that uclamp_min will be clamped to uclamp_max if uclamp_min
> + * > uclamp_max.
> + */
> +static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> +{
> +	unsigned int min_cap;
> +	unsigned int max_cap;
> +	unsigned int cpu_cap;
> +
> +	/* Only heterogeneous systems can benefit from this check */
> +	if (!static_branch_unlikely(&sched_asym_cpucapacity))

OK, so this is optimized for the case that UCLAMP_TASK is compiled in
but for an homogeneous system.

> +		return true;
> +
> +	min_cap = uclamp_eff_value(p, UCLAMP_MIN);
> +	max_cap = uclamp_eff_value(p, UCLAMP_MAX);
> +
> +	cpu_cap = capacity_orig_of(cpu);
> +
> +	return cpu_cap >= min(min_cap, max_cap);
> +}
> +#else
> +static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> +{
> +	return true;
> +}
> +#endif
> +
>  #ifdef CONFIG_RT_GROUP_SCHED
>  
>  static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
> @@ -1390,6 +1429,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
>  {
>  	struct task_struct *curr;
>  	struct rq *rq;
> +	bool test;
>  
>  	/* For anything but wake ups, just return the task_cpu */
>  	if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
> @@ -1421,10 +1461,16 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
>  	 *
>  	 * This test is optimistic, if we get it wrong the load-balancer
>  	 * will have to sort it out.
> +	 *
> +	 * We take into account the capacity of the cpu to ensure it fits the
> +	 * requirement of the task - which is only important on heterogeneous
> +	 * systems like big.LITTLE.
>  	 */
> -	if (curr && unlikely(rt_task(curr)) &&
> -	    (curr->nr_cpus_allowed < 2 ||
> -	     curr->prio <= p->prio)) {
> +	test = curr &&
> +	       unlikely(rt_task(curr)) &&
> +	       (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
> +
> +	if (test || !rt_task_fits_capacity(p, cpu)) {
>  		int target = find_lowest_rq(p);
>  
>  		/*
> @@ -1614,7 +1660,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
>  static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
>  {
>  	if (!task_running(rq, p) &&
> -	    cpumask_test_cpu(cpu, p->cpus_ptr))
> +	    cpumask_test_cpu(cpu, p->cpus_ptr) &&
> +	    rt_task_fits_capacity(p, cpu))

Hmm, so if a CPU goes idle, and looks for CPUS with more than one RT
task queued (overloaded), it will skip pulling RT tasks if they are
below capacity. Is that the desired effect? I think we could end up
with a small idle CPUs with RT tasks waiting to run.


>  		return 1;
>  
>  	return 0;
> @@ -1648,6 +1695,7 @@ static int find_lowest_rq(struct task_struct *task)
>  	struct cpumask *lowest_mask = this_cpu_cpumask_var_ptr(local_cpu_mask);
>  	int this_cpu = smp_processor_id();
>  	int cpu      = task_cpu(task);
> +	bool test;
>  
>  	/* Make sure the mask is initialized first */
>  	if (unlikely(!lowest_mask))
> @@ -1666,16 +1714,23 @@ static int find_lowest_rq(struct task_struct *task)
>  	 *
>  	 * We prioritize the last CPU that the task executed on since
>  	 * it is most likely cache-hot in that location.
> +	 *
> +	 * Assuming the task still fits the capacity of the last CPU.
>  	 */
> -	if (cpumask_test_cpu(cpu, lowest_mask))
> +	if (cpumask_test_cpu(cpu, lowest_mask) &&
> +	    rt_task_fits_capacity(task, cpu)) {
>  		return cpu;
> +	}
>  
>  	/*
>  	 * Otherwise, we consult the sched_domains span maps to figure
>  	 * out which CPU is logically closest to our hot cache data.
>  	 */
> -	if (!cpumask_test_cpu(this_cpu, lowest_mask))
> -		this_cpu = -1; /* Skip this_cpu opt if not among lowest */
> +	if (!cpumask_test_cpu(this_cpu, lowest_mask) ||
> +	    !rt_task_fits_capacity(task, this_cpu)) {
> +		/* Skip this_cpu opt if not among lowest or doesn't fit */
> +		this_cpu = -1;
> +	}
>  
>  	rcu_read_lock();
>  	for_each_domain(cpu, sd) {
> @@ -1692,11 +1747,15 @@ static int find_lowest_rq(struct task_struct *task)
>  				return this_cpu;
>  			}
>  
> -			best_cpu = cpumask_first_and(lowest_mask,
> -						     sched_domain_span(sd));
> -			if (best_cpu < nr_cpu_ids) {
> -				rcu_read_unlock();
> -				return best_cpu;
> +			for_each_cpu_and(best_cpu, lowest_mask,
> +					 sched_domain_span(sd)) {
> +				if (best_cpu >= nr_cpu_ids)

Can that happen in this loop?

> +					break;
> +
> +				if (rt_task_fits_capacity(task, best_cpu)) {
> +					rcu_read_unlock();
> +					return best_cpu;
> +				}
>  			}
>  		}
>  	}
> @@ -1711,7 +1770,15 @@ static int find_lowest_rq(struct task_struct *task)
>  		return this_cpu;
>  
>  	cpu = cpumask_any(lowest_mask);
> -	if (cpu < nr_cpu_ids)
> +
> +	/*
> +	 * Make sure that the fitness on @cpu doesn't change compared to the
> +	 * cpu we're currently running on.
> +	 */
> +	test = rt_task_fits_capacity(task, cpu) ==
> +	       rt_task_fits_capacity(task, task_cpu(task));

But what if the new CPU is a better fit, and we are running on a CPU
that's not fit?

-- Steve


> +
> +	if (cpu < nr_cpu_ids && test)
>  		return cpu;
>  
>  	return -1;
> @@ -2160,12 +2227,14 @@ static void pull_rt_task(struct rq *this_rq)
>   */
>  static void task_woken_rt(struct rq *rq, struct task_struct *p)
>  {
> -	if (!task_running(rq, p) &&
> -	    !test_tsk_need_resched(rq->curr) &&
> -	    p->nr_cpus_allowed > 1 &&
> -	    (dl_task(rq->curr) || rt_task(rq->curr)) &&
> -	    (rq->curr->nr_cpus_allowed < 2 ||
> -	     rq->curr->prio <= p->prio))
> +	bool need_to_push = !task_running(rq, p) &&
> +			    !test_tsk_need_resched(rq->curr) &&
> +			    p->nr_cpus_allowed > 1 &&
> +			    (dl_task(rq->curr) || rt_task(rq->curr))
> &&
> +			    (rq->curr->nr_cpus_allowed < 2 ||
> +			     rq->curr->prio <= p->prio);
> +
> +	if (need_to_push || !rt_task_fits_capacity(p, cpu_of(rq)))
>  		push_rt_tasks(rq);
>  }
>  
> @@ -2237,7 +2306,10 @@ static void switched_to_rt(struct rq *rq,
> struct task_struct *p) */
>  	if (task_on_rq_queued(p) && rq->curr != p) {
>  #ifdef CONFIG_SMP
> -		if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
> +		bool need_to_push = rq->rt.overloaded ||
> +				    !rt_task_fits_capacity(p,
> cpu_of(rq)); +
> +		if (p->nr_cpus_allowed > 1 && need_to_push)
>  			rt_queue_push_tasks(rq);
>  #endif /* CONFIG_SMP */
>  		if (p->prio < rq->curr->prio &&
> cpu_online(cpu_of(rq)))


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched: rt: Make RT capacity aware
  2019-09-04 11:25 ` Steven Rostedt
@ 2019-09-04 15:40   ` Qais Yousef
  2019-09-13 13:30     ` Dietmar Eggemann
  0 siblings, 1 reply; 9+ messages in thread
From: Qais Yousef @ 2019-09-04 15:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Peter Zijlstra, Dietmar Eggemann, Vincent Guittot,
	Alessio Balsini, linux-kernel

On 09/04/19 07:25, Steven Rostedt wrote:
> On Tue,  3 Sep 2019 11:33:29 +0100
> Qais Yousef <qais.yousef@arm.com> wrote:
> 
> > Capacity Awareness refers to the fact that on heterogeneous systems
> > (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> > when placing tasks we need to be aware of this difference of CPU
> > capacities.
> > 
> > In such scenarios we want to ensure that the selected CPU has enough
> > capacity to meet the requirement of the running task. Enough capacity
> > means here that capacity_orig_of(cpu) >= task.requirement.
> > 
> > The definition of task.requirement is dependent on the scheduling class.
> > 
> > For CFS, utilization is used to select a CPU that has >= capacity value
> > than the cfs_task.util.
> > 
> > 	capacity_orig_of(cpu) >= cfs_task.util
> > 
> > DL isn't capacity aware at the moment but can make use of the bandwidth
> > reservation to implement that in a similar manner CFS uses utilization.
> > The following patchset implements that:
> > 
> > https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/
> > 
> > 	capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
> > 
> > For RT we don't have a per task utilization signal and we lack any
> > information in general about what performance requirement the RT task
> > needs. But with the introduction of uclamp, RT tasks can now control
> > that by setting uclamp_min to guarantee a minimum performance point.
> > 
> > ATM the uclamp value are only used for frequency selection; but on
> > heterogeneous systems this is not enough and we need to ensure that the
> > capacity of the CPU is >= uclamp_min. Which is what implemented here.
> > 
> > 	capacity_orig_of(cpu) >= rt_task.uclamp_min
> > 
> > Note that by default uclamp.min is 1024, which means that RT tasks will
> > always be biased towards the big CPUs, which make for a better more
> > predictable behavior for the default case.
> > 
> > Must stress that the bias acts as a hint rather than a definite
> > placement strategy. For example, if all big cores are busy executing
> > other RT tasks we can't guarantee that a new RT task will be placed
> > there.
> > 
> > On non-heterogeneous systems the original behavior of RT should be
> > retained. Similarly if uclamp is not selected in the config.
> 
> Nice change log. It clearly describes what you are trying to do with
> the patch, and gives an idea of what is still needed to be done.

Thanks! Kudos to my colleagues who helped refining it to ensure clarity.

> 
> > 
> > Signed-off-by: Qais Yousef <qais.yousef@arm.com>
> > ---
> > 
> > The logic is not perfect. For example if a 'small' task is occupying a big CPU
> > and another big task wakes up; we won't force migrate the small task to clear
> > the big cpu for the big task that woke up.
> > 
> > IOW, the logic is best effort and can't give hard guarantees. But improves the
> > current situation where a task can randomly end up on any CPU regardless of
> > what it needs. ie: without this patch an RT task can wake up on a big or small
> > CPU, but with this it will always wake up on a big CPU (assuming the big CPUs
> > aren't overloaded) - hence provide a consistent performance.
> > 
> > I'm looking at ways to improve this best effort, but this patch should be
> > a good start to discuss our Capacity Awareness requirement. There's a trade-off
> > of complexity to be made here and I'd like to keep things as simple as
> > possible and build on top as needed.
> > 
> > 
> >  kernel/sched/rt.c | 112 +++++++++++++++++++++++++++++++++++++---------
> >  1 file changed, 92 insertions(+), 20 deletions(-)
> > 
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index a532558a5176..7c3bcbef692b 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -436,6 +436,45 @@ static inline int on_rt_rq(struct sched_rt_entity *rt_se)
> >  	return rt_se->on_rq;
> >  }
> >  
> > +#ifdef CONFIG_UCLAMP_TASK
> > +/*
> > + * Verify the fitness of task @p to run on @cpu taking into account the uclamp
> > + * settings.
> > + *
> > + * This check is only important for heterogeneous systems where uclamp_min value
> > + * is higher than the capacity of a @cpu. For non-heterogeneous system this
> > + * function will always return true.
> > + *
> > + * The function will return true if the capacity of the @cpu is >= the
> > + * uclamp_min and false otherwise.
> > + *
> > + * Note that uclamp_min will be clamped to uclamp_max if uclamp_min
> > + * > uclamp_max.
> > + */
> > +static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> > +{
> > +	unsigned int min_cap;
> > +	unsigned int max_cap;
> > +	unsigned int cpu_cap;
> > +
> > +	/* Only heterogeneous systems can benefit from this check */
> > +	if (!static_branch_unlikely(&sched_asym_cpucapacity))
> 
> OK, so this is optimized for the case that UCLAMP_TASK is compiled in
> but for an homogeneous system.

Correct. sched_asym_cpucapacity will be enabled when building the
topology/sched_domains at boot if we detect we are running on such a system.

> 
> > +		return true;
> > +
> > +	min_cap = uclamp_eff_value(p, UCLAMP_MIN);
> > +	max_cap = uclamp_eff_value(p, UCLAMP_MAX);
> > +
> > +	cpu_cap = capacity_orig_of(cpu);
> > +
> > +	return cpu_cap >= min(min_cap, max_cap);
> > +}
> > +#else
> > +static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> > +{
> > +	return true;
> > +}
> > +#endif
> > +
> >  #ifdef CONFIG_RT_GROUP_SCHED
> >  
> >  static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
> > @@ -1390,6 +1429,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
> >  {
> >  	struct task_struct *curr;
> >  	struct rq *rq;
> > +	bool test;
> >  
> >  	/* For anything but wake ups, just return the task_cpu */
> >  	if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
> > @@ -1421,10 +1461,16 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
> >  	 *
> >  	 * This test is optimistic, if we get it wrong the load-balancer
> >  	 * will have to sort it out.
> > +	 *
> > +	 * We take into account the capacity of the cpu to ensure it fits the
> > +	 * requirement of the task - which is only important on heterogeneous
> > +	 * systems like big.LITTLE.
> >  	 */
> > -	if (curr && unlikely(rt_task(curr)) &&
> > -	    (curr->nr_cpus_allowed < 2 ||
> > -	     curr->prio <= p->prio)) {
> > +	test = curr &&
> > +	       unlikely(rt_task(curr)) &&
> > +	       (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
> > +
> > +	if (test || !rt_task_fits_capacity(p, cpu)) {
> >  		int target = find_lowest_rq(p);
> >  
> >  		/*
> > @@ -1614,7 +1660,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
> >  static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
> >  {
> >  	if (!task_running(rq, p) &&
> > -	    cpumask_test_cpu(cpu, p->cpus_ptr))
> > +	    cpumask_test_cpu(cpu, p->cpus_ptr) &&
> > +	    rt_task_fits_capacity(p, cpu))
> 
> Hmm, so if a CPU goes idle, and looks for CPUS with more than one RT
> task queued (overloaded), it will skip pulling RT tasks if they are
> below capacity. Is that the desired effect? I think we could end up
> with a small idle CPUs with RT tasks waiting to run.

The intention was not to pull this task that doesn't fit. But not to abort the
whole pull operation. pick_highest_pushable_task() should still iterate through
the remaining tasks, or did I miss something?

> 
> 
> >  		return 1;
> >  
> >  	return 0;
> > @@ -1648,6 +1695,7 @@ static int find_lowest_rq(struct task_struct *task)
> >  	struct cpumask *lowest_mask = this_cpu_cpumask_var_ptr(local_cpu_mask);
> >  	int this_cpu = smp_processor_id();
> >  	int cpu      = task_cpu(task);
> > +	bool test;
> >  
> >  	/* Make sure the mask is initialized first */
> >  	if (unlikely(!lowest_mask))
> > @@ -1666,16 +1714,23 @@ static int find_lowest_rq(struct task_struct *task)
> >  	 *
> >  	 * We prioritize the last CPU that the task executed on since
> >  	 * it is most likely cache-hot in that location.
> > +	 *
> > +	 * Assuming the task still fits the capacity of the last CPU.
> >  	 */
> > -	if (cpumask_test_cpu(cpu, lowest_mask))
> > +	if (cpumask_test_cpu(cpu, lowest_mask) &&
> > +	    rt_task_fits_capacity(task, cpu)) {
> >  		return cpu;
> > +	}
> >  
> >  	/*
> >  	 * Otherwise, we consult the sched_domains span maps to figure
> >  	 * out which CPU is logically closest to our hot cache data.
> >  	 */
> > -	if (!cpumask_test_cpu(this_cpu, lowest_mask))
> > -		this_cpu = -1; /* Skip this_cpu opt if not among lowest */
> > +	if (!cpumask_test_cpu(this_cpu, lowest_mask) ||
> > +	    !rt_task_fits_capacity(task, this_cpu)) {
> > +		/* Skip this_cpu opt if not among lowest or doesn't fit */
> > +		this_cpu = -1;
> > +	}
> >  
> >  	rcu_read_lock();
> >  	for_each_domain(cpu, sd) {
> > @@ -1692,11 +1747,15 @@ static int find_lowest_rq(struct task_struct *task)
> >  				return this_cpu;
> >  			}
> >  
> > -			best_cpu = cpumask_first_and(lowest_mask,
> > -						     sched_domain_span(sd));
> > -			if (best_cpu < nr_cpu_ids) {
> > -				rcu_read_unlock();
> > -				return best_cpu;
> > +			for_each_cpu_and(best_cpu, lowest_mask,
> > +					 sched_domain_span(sd)) {
> > +				if (best_cpu >= nr_cpu_ids)
> 
> Can that happen in this loop?

I kept the condition that was originally here but inverted the logic so we
don't mindlessly iterate through the rest of the CPUs. IOW, tried to retain the
original behavior of `if (best_cpu < nr_cpu_ids)` logic.

Whether we can remove this check I don't know to be honest. Similar check exist
below and I did wonder under what conditions this could happen but didn't try
to follow the thread.

The only case I can think about is if we set nr_cpu_ids through command line
to a lower value. Then if cpu_possible_mask and family aren't updated
accordingly then yeah this check will protect against scheduling on cpus the
users said they don't want to use? Just guessing.

> 
> > +					break;
> > +
> > +				if (rt_task_fits_capacity(task, best_cpu)) {
> > +					rcu_read_unlock();
> > +					return best_cpu;
> > +				}
> >  			}
> >  		}
> >  	}
> > @@ -1711,7 +1770,15 @@ static int find_lowest_rq(struct task_struct *task)
> >  		return this_cpu;
> >  
> >  	cpu = cpumask_any(lowest_mask);
> > -	if (cpu < nr_cpu_ids)
> > +
> > +	/*
> > +	 * Make sure that the fitness on @cpu doesn't change compared to the
> > +	 * cpu we're currently running on.
> > +	 */
> > +	test = rt_task_fits_capacity(task, cpu) ==
> > +	       rt_task_fits_capacity(task, task_cpu(task));
> 
> But what if the new CPU is a better fit, and we are running on a CPU
> that's not fit?

Then we shouldn't reach this fallback code and we should have a valid best_cpu.
The fact that we are here means that a better fit CPU doesn't exist.

Maybe I can improve this condition. What we want to protect against really is
that if we are on the right CPU, we don't want to migrate to the wrong one.

In my head that seemed the simplest condition but maybe I overthought it

	if (!rt_task_fits_cpu(task, task_cpu(task)))
		// pick any cpu

Should give what I want too.

Thanks!

--
Qais Yousef

> 
> -- Steve
> 
> 
> > +
> > +	if (cpu < nr_cpu_ids && test)
> >  		return cpu;
> >  
> >  	return -1;
> > @@ -2160,12 +2227,14 @@ static void pull_rt_task(struct rq *this_rq)
> >   */
> >  static void task_woken_rt(struct rq *rq, struct task_struct *p)
> >  {
> > -	if (!task_running(rq, p) &&
> > -	    !test_tsk_need_resched(rq->curr) &&
> > -	    p->nr_cpus_allowed > 1 &&
> > -	    (dl_task(rq->curr) || rt_task(rq->curr)) &&
> > -	    (rq->curr->nr_cpus_allowed < 2 ||
> > -	     rq->curr->prio <= p->prio))
> > +	bool need_to_push = !task_running(rq, p) &&
> > +			    !test_tsk_need_resched(rq->curr) &&
> > +			    p->nr_cpus_allowed > 1 &&
> > +			    (dl_task(rq->curr) || rt_task(rq->curr))
> > &&
> > +			    (rq->curr->nr_cpus_allowed < 2 ||
> > +			     rq->curr->prio <= p->prio);
> > +
> > +	if (need_to_push || !rt_task_fits_capacity(p, cpu_of(rq)))
> >  		push_rt_tasks(rq);
> >  }
> >  
> > @@ -2237,7 +2306,10 @@ static void switched_to_rt(struct rq *rq,
> > struct task_struct *p) */
> >  	if (task_on_rq_queued(p) && rq->curr != p) {
> >  #ifdef CONFIG_SMP
> > -		if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
> > +		bool need_to_push = rq->rt.overloaded ||
> > +				    !rt_task_fits_capacity(p,
> > cpu_of(rq)); +
> > +		if (p->nr_cpus_allowed > 1 && need_to_push)
> >  			rt_queue_push_tasks(rq);
> >  #endif /* CONFIG_SMP */
> >  		if (p->prio < rq->curr->prio &&
> > cpu_online(cpu_of(rq)))
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched: rt: Make RT capacity aware
  2019-09-04 15:40   ` Qais Yousef
@ 2019-09-13 13:30     ` Dietmar Eggemann
  2019-09-18 14:52       ` Qais Yousef
  0 siblings, 1 reply; 9+ messages in thread
From: Dietmar Eggemann @ 2019-09-13 13:30 UTC (permalink / raw)
  To: Qais Yousef, Steven Rostedt
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Alessio Balsini,
	linux-kernel

On 9/4/19 4:40 PM, Qais Yousef wrote:
> On 09/04/19 07:25, Steven Rostedt wrote:
>> On Tue,  3 Sep 2019 11:33:29 +0100
>> Qais Yousef <qais.yousef@arm.com> wrote:

[...]

>>> @@ -1614,7 +1660,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
>>>  static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
>>>  {
>>>  	if (!task_running(rq, p) &&
>>> -	    cpumask_test_cpu(cpu, p->cpus_ptr))
>>> +	    cpumask_test_cpu(cpu, p->cpus_ptr) &&
>>> +	    rt_task_fits_capacity(p, cpu))
>>
>> Hmm, so if a CPU goes idle, and looks for CPUS with more than one RT
>> task queued (overloaded), it will skip pulling RT tasks if they are
>> below capacity. Is that the desired effect? I think we could end up
>> with a small idle CPUs with RT tasks waiting to run.
> 
> The intention was not to pull this task that doesn't fit. But not to abort the
> whole pull operation. pick_highest_pushable_task() should still iterate through
> the remaining tasks, or did I miss something?

On a big.LITTLE system (6 CPUs with [446 1024 1024 446 466 466] CPU
capacity vector) I try to trace the handling of the 3rd big task
(big2-2, util_min: 800, util_max: 1024) of an rt-app workload running 3
of them.

rt_task_fits_capacity() call in:

tag 1: select_task_rq_rt(), 3-7: 1st till 5th in find_lowest_rq()

[   37.505325] rt_task_fits_capacity: CPU3 tag=1 [big2-2 285] ret=0
[   37.505882] find_lowest_rq: CPU3 [big2-2 285] tag=1 find_lowest_rq
[   37.506509] CPU3 [big2-2 285] lowest_mask=0,3-5
[   37.507971] rt_task_fits_capacity: CPU3 tag=3 [big2-2 285] ret=0
[   37.508200] rt_task_fits_capacity: CPU3 tag=4 [big2-2 285] ret=0
[   37.509486] rt_task_fits_capacity: CPU0 tag=5 [big2-2 285] ret=0
[   37.510191] rt_task_fits_capacity: CPU3 tag=5 [big2-2 285] ret=0
[   37.511334] rt_task_fits_capacity: CPU4 tag=5 [big2-2 285] ret=0
[   37.512194] rt_task_fits_capacity: CPU5 tag=5 [big2-2 285] ret=0
[   37.513210] rt_task_fits_capacity: CPU0 tag=6 [big2-2 285] ret=0
[   37.514085] rt_task_fits_capacity: CPU3 tag=7 [big2-2 285] ret=0
[   37.514732] --> select_task_rq_rt: CPU3 [big2-2 285] cpu=0

Since CPU 0,3-5 can't run big2-2, it takes up to the test that the
fitness hasn't changed. In case a capacity-aware (with fallback CPU)
cpupri_find() would have returned a suitable lowest_mask, less work
would have been needed.

The snippet is repeating itself for the whole run of the workload since
all the rt-tasks run for the same time and I'm only faking big.LITTLE on
qemu.

[...]

>>>  	rcu_read_lock();
>>>  	for_each_domain(cpu, sd) {
>>> @@ -1692,11 +1747,15 @@ static int find_lowest_rq(struct task_struct *task)
>>>  				return this_cpu;
>>>  			}
>>>  
>>> -			best_cpu = cpumask_first_and(lowest_mask,
>>> -						     sched_domain_span(sd));
>>> -			if (best_cpu < nr_cpu_ids) {
>>> -				rcu_read_unlock();
>>> -				return best_cpu;
>>> +			for_each_cpu_and(best_cpu, lowest_mask,
>>> +					 sched_domain_span(sd)) {
>>> +				if (best_cpu >= nr_cpu_ids)
>>
>> Can that happen in this loop?
> 
> I kept the condition that was originally here but inverted the logic so we
> don't mindlessly iterate through the rest of the CPUs. IOW, tried to retain the
> original behavior of `if (best_cpu < nr_cpu_ids)` logic.
> 
> Whether we can remove this check I don't know to be honest. Similar check exist
> below and I did wonder under what conditions this could happen but didn't try
> to follow the thread.
> 
> The only case I can think about is if we set nr_cpu_ids through command line
> to a lower value. Then if cpu_possible_mask and family aren't updated
> accordingly then yeah this check will protect against scheduling on cpus the
> users said they don't want to use? Just guessing.

Why don't you build the capacity awareness into cpupri_find(...,
lowest_mask)?
You would have to add a fallback strategy in case p doesn't fit on any
CPUs cpupri_find() returns today as lowest_mask. (return the CPU with
the max capacity).

Luca proposed something like this for SCHED_DEADLINE in "[RFC PATCH 0/6]
Capacity awareness for SCHED_DEADLINE" (patch 2/6 and 5/6)

https://lkml.kernel.org/r/20190506044836.2914-1-luca.abeni@santannapisa.it

In this case you could get rid of all the newly introduced
rt_task_fits_capacity() logic in find_lowest_rq().

---

I can't see the

for_each_domain(cpu, sd) {
	for_each_cpu_and(best_cpu, lowest_mask, sched_domain_span(sd)) {
                ...
	}
}

other than we want to pick a CPU from the lowest_mask first which is
closer to task_cpu(task).

Qemu doesn't give me cluster:

# cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
MC

[...]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched: rt: Make RT capacity aware
  2019-09-13 13:30     ` Dietmar Eggemann
@ 2019-09-18 14:52       ` Qais Yousef
  2019-09-20 12:52         ` Dietmar Eggemann
  0 siblings, 1 reply; 9+ messages in thread
From: Qais Yousef @ 2019-09-18 14:52 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Alessio Balsini, linux-kernel

On 09/13/19 14:30, Dietmar Eggemann wrote:
> On 9/4/19 4:40 PM, Qais Yousef wrote:
> > On 09/04/19 07:25, Steven Rostedt wrote:
> >> On Tue,  3 Sep 2019 11:33:29 +0100
> >> Qais Yousef <qais.yousef@arm.com> wrote:
> 
> [...]
> 
> >>> @@ -1614,7 +1660,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
> >>>  static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
> >>>  {
> >>>  	if (!task_running(rq, p) &&
> >>> -	    cpumask_test_cpu(cpu, p->cpus_ptr))
> >>> +	    cpumask_test_cpu(cpu, p->cpus_ptr) &&
> >>> +	    rt_task_fits_capacity(p, cpu))
> >>
> >> Hmm, so if a CPU goes idle, and looks for CPUS with more than one RT
> >> task queued (overloaded), it will skip pulling RT tasks if they are
> >> below capacity. Is that the desired effect? I think we could end up
> >> with a small idle CPUs with RT tasks waiting to run.
> > 
> > The intention was not to pull this task that doesn't fit. But not to abort the
> > whole pull operation. pick_highest_pushable_task() should still iterate through
> > the remaining tasks, or did I miss something?
> 
> On a big.LITTLE system (6 CPUs with [446 1024 1024 446 466 466] CPU
> capacity vector) I try to trace the handling of the 3rd big task
> (big2-2, util_min: 800, util_max: 1024) of an rt-app workload running 3
> of them.
> 
> rt_task_fits_capacity() call in:
> 
> tag 1: select_task_rq_rt(), 3-7: 1st till 5th in find_lowest_rq()
> 
> [   37.505325] rt_task_fits_capacity: CPU3 tag=1 [big2-2 285] ret=0
> [   37.505882] find_lowest_rq: CPU3 [big2-2 285] tag=1 find_lowest_rq
> [   37.506509] CPU3 [big2-2 285] lowest_mask=0,3-5
> [   37.507971] rt_task_fits_capacity: CPU3 tag=3 [big2-2 285] ret=0
> [   37.508200] rt_task_fits_capacity: CPU3 tag=4 [big2-2 285] ret=0
> [   37.509486] rt_task_fits_capacity: CPU0 tag=5 [big2-2 285] ret=0
> [   37.510191] rt_task_fits_capacity: CPU3 tag=5 [big2-2 285] ret=0
> [   37.511334] rt_task_fits_capacity: CPU4 tag=5 [big2-2 285] ret=0
> [   37.512194] rt_task_fits_capacity: CPU5 tag=5 [big2-2 285] ret=0
> [   37.513210] rt_task_fits_capacity: CPU0 tag=6 [big2-2 285] ret=0
> [   37.514085] rt_task_fits_capacity: CPU3 tag=7 [big2-2 285] ret=0
> [   37.514732] --> select_task_rq_rt: CPU3 [big2-2 285] cpu=0
> 
> Since CPU 0,3-5 can't run big2-2, it takes up to the test that the
> fitness hasn't changed. In case a capacity-aware (with fallback CPU)
> cpupri_find() would have returned a suitable lowest_mask, less work
> would have been needed.

rt_task_fits_capacity() is inlined and I'd expect all the data it accesses to
be cache hot, so it should be fast.

I had 2 concerns with using cpupri_find()

	1. find_lowest_rq() is not the only user

	2. The fallback mechanism means we either have to call cpupri_find()
	   twice once to find filtered lowest_rq and the other to return the
	   none filtered version.

	   ie:

		cpupri_find(..., check_fitness = true);
		// returns lowest_mask that is filtered for cap fitness

		fallback:
			cpupri_find(..., check_fitness = false);
			// returns lowest_mask without check for cap fitness

	   Or we can pass 2 masks to cpupri_find() so that we fill the filtered
	   and non-filtered results in one go and use the one that is
	   relevant.

	   ie:

	   	cpupri_find(... , struct cpumask *lowest_mask,
	   			  struct cpumask *lowest_mask_filtered)
		// Use lowest_mask_filtered

		fallback:
			// Use lowest_mask

So the way I see it it's not actually making things neater and in my eyes looks
more involved/complex.

Happy to explore this path further if maintainers advocate for it too though.

> 
> The snippet is repeating itself for the whole run of the workload since
> all the rt-tasks run for the same time and I'm only faking big.LITTLE on
> qemu.
> 
> [...]
> 
> >>>  	rcu_read_lock();
> >>>  	for_each_domain(cpu, sd) {
> >>> @@ -1692,11 +1747,15 @@ static int find_lowest_rq(struct task_struct *task)
> >>>  				return this_cpu;
> >>>  			}
> >>>  
> >>> -			best_cpu = cpumask_first_and(lowest_mask,
> >>> -						     sched_domain_span(sd));
> >>> -			if (best_cpu < nr_cpu_ids) {
> >>> -				rcu_read_unlock();
> >>> -				return best_cpu;
> >>> +			for_each_cpu_and(best_cpu, lowest_mask,
> >>> +					 sched_domain_span(sd)) {
> >>> +				if (best_cpu >= nr_cpu_ids)
> >>
> >> Can that happen in this loop?
> > 
> > I kept the condition that was originally here but inverted the logic so we
> > don't mindlessly iterate through the rest of the CPUs. IOW, tried to retain the
> > original behavior of `if (best_cpu < nr_cpu_ids)` logic.
> > 
> > Whether we can remove this check I don't know to be honest. Similar check exist
> > below and I did wonder under what conditions this could happen but didn't try
> > to follow the thread.
> > 
> > The only case I can think about is if we set nr_cpu_ids through command line
> > to a lower value. Then if cpu_possible_mask and family aren't updated
> > accordingly then yeah this check will protect against scheduling on cpus the
> > users said they don't want to use? Just guessing.
> 
> Why don't you build the capacity awareness into cpupri_find(...,
> lowest_mask)?

Hopefully my answer above covered this.

> You would have to add a fallback strategy in case p doesn't fit on any
> CPUs cpupri_find() returns today as lowest_mask. (return the CPU with
> the max capacity).
> 
> Luca proposed something like this for SCHED_DEADLINE in "[RFC PATCH 0/6]
> Capacity awareness for SCHED_DEADLINE" (patch 2/6 and 5/6)
> 
> https://lkml.kernel.org/r/20190506044836.2914-1-luca.abeni@santannapisa.it

It would be nice to keep implementation the similar. I haven't looked closely
at deadline code. I'm not sure if it has similar complexities like RT or not.
I'll have a closer look to see if I can borrow ideas from there.

> 
> In this case you could get rid of all the newly introduced
> rt_task_fits_capacity() logic in find_lowest_rq().
> 
> ---
> 
> I can't see the
> 
> for_each_domain(cpu, sd) {
> 	for_each_cpu_and(best_cpu, lowest_mask, sched_domain_span(sd)) {
>                 ...
> 	}
> }
> 
> other than we want to pick a CPU from the lowest_mask first which is
> closer to task_cpu(task).
> 
> Qemu doesn't give me cluster:
> 
> # cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
> MC

I'm not sure I get what you're trying to get at here. Can you please elaborate
more?

Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched: rt: Make RT capacity aware
  2019-09-18 14:52       ` Qais Yousef
@ 2019-09-20 12:52         ` Dietmar Eggemann
  2019-09-23 11:52           ` Qais Yousef
  0 siblings, 1 reply; 9+ messages in thread
From: Dietmar Eggemann @ 2019-09-20 12:52 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Alessio Balsini, linux-kernel

On 9/18/19 4:52 PM, Qais Yousef wrote:
> On 09/13/19 14:30, Dietmar Eggemann wrote:
>> On 9/4/19 4:40 PM, Qais Yousef wrote:
>>> On 09/04/19 07:25, Steven Rostedt wrote:
>>>> On Tue,  3 Sep 2019 11:33:29 +0100
>>>> Qais Yousef <qais.yousef@arm.com> wrote:

[...]

>> On a big.LITTLE system (6 CPUs with [446 1024 1024 446 466 466] CPU
>> capacity vector) I try to trace the handling of the 3rd big task
>> (big2-2, util_min: 800, util_max: 1024) of an rt-app workload running 3
>> of them.
>>
>> rt_task_fits_capacity() call in:
>>
>> tag 1: select_task_rq_rt(), 3-7: 1st till 5th in find_lowest_rq()
>>
>> [   37.505325] rt_task_fits_capacity: CPU3 tag=1 [big2-2 285] ret=0
>> [   37.505882] find_lowest_rq: CPU3 [big2-2 285] tag=1 find_lowest_rq
>> [   37.506509] CPU3 [big2-2 285] lowest_mask=0,3-5
>> [   37.507971] rt_task_fits_capacity: CPU3 tag=3 [big2-2 285] ret=0
>> [   37.508200] rt_task_fits_capacity: CPU3 tag=4 [big2-2 285] ret=0
>> [   37.509486] rt_task_fits_capacity: CPU0 tag=5 [big2-2 285] ret=0
>> [   37.510191] rt_task_fits_capacity: CPU3 tag=5 [big2-2 285] ret=0
>> [   37.511334] rt_task_fits_capacity: CPU4 tag=5 [big2-2 285] ret=0
>> [   37.512194] rt_task_fits_capacity: CPU5 tag=5 [big2-2 285] ret=0
>> [   37.513210] rt_task_fits_capacity: CPU0 tag=6 [big2-2 285] ret=0
>> [   37.514085] rt_task_fits_capacity: CPU3 tag=7 [big2-2 285] ret=0
>> [   37.514732] --> select_task_rq_rt: CPU3 [big2-2 285] cpu=0
>>
>> Since CPU 0,3-5 can't run big2-2, it takes up to the test that the
>> fitness hasn't changed. In case a capacity-aware (with fallback CPU)
>> cpupri_find() would have returned a suitable lowest_mask, less work
>> would have been needed.
> 
> rt_task_fits_capacity() is inlined and I'd expect all the data it accesses to
> be cache hot, so it should be fast.

My concern is not so much runtime but maintenance of the new code. We
want to integrate a feature 'Asymmetric CPU capacity support' in RT and
DL sched class which have the same overall design (DL was copied from
RT). IMHO, the new feature 'Asymmetric CPU capacity support' has to be
integrated the same way (if possible) to let people understand the
related code easily.

> I had 2 concerns with using cpupri_find()
> 
> 	1. find_lowest_rq() is not the only user

But the only one calling it with lowest_mask != NULL (like for DL
find_later_rq() is the only one calling cpudl_find() with later_mask !=
NULL.

> 	2. The fallback mechanism means we either have to call cpupri_find()
> 	   twice once to find filtered lowest_rq and the other to return the
> 	   none filtered version.

This is what I have in mind. (Only compile tested! ... and the 'if
(cpumask_any(lowest_mask) >= nr_cpu_ids)' condition has to be considered
as well):

@@ -98,8 +103,26 @@ int cpupri_find(struct cpupri *cp, struct
task_struct *p,
                        continue;

                if (lowest_mask) {
+                       int cpu, max_cap_cpu = -1;
+                       unsigned long max_cap = 0;
+
                        cpumask_and(lowest_mask, p->cpus_ptr, vec->mask);

+                       for_each_cpu(cpu, lowest_mask) {
+                               unsigned long cap =
arch_scale_cpu_capacity(cpu);
+
+                               if (!rt_task_fits_capacity(p, cpu))
+                                       cpumask_clear_cpu(cpu, lowest_mask);
+
+                               if (cap > max_cap) {
+                                       max_cap = cap;
+                                       max_cap_cpu = cpu;
+                               }
+                       }
+
+                       if (cpumask_empty(lowest_mask) && max_cap)
+                               cpumask_set_cpu(max_cap_cpu, lowest_mask);
+

[...]

>> I can't see the
>>
>> for_each_domain(cpu, sd) {
>> 	for_each_cpu_and(best_cpu, lowest_mask, sched_domain_span(sd)) {
>>                 ...
>> 	}
>> }
>>
>> other than we want to pick a CPU from the lowest_mask first which is
>> closer to task_cpu(task).
>>
>> Qemu doesn't give me cluster:
>>
>> # cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
>> MC
> 
> I'm not sure I get what you're trying to get at here. Can you please elaborate
> more?

I was referring to this for each scheduler domain snippet:

for_each_domain(cpu, sd) {

}

this would first look for a best_cpu in MC and then in DIE level in the
original code. I remember that Steven asked about the for_each_cpu_and()
you introduced here.
But in case you can test the idea of introducing the changes in
cpupri_find() you won't have to worry about this anymore.

[...]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched: rt: Make RT capacity aware
  2019-09-20 12:52         ` Dietmar Eggemann
@ 2019-09-23 11:52           ` Qais Yousef
  2019-10-07  9:14             ` Dietmar Eggemann
  0 siblings, 1 reply; 9+ messages in thread
From: Qais Yousef @ 2019-09-23 11:52 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Alessio Balsini, linux-kernel

On 09/20/19 14:52, Dietmar Eggemann wrote:
> > 	2. The fallback mechanism means we either have to call cpupri_find()
> > 	   twice once to find filtered lowest_rq and the other to return the
> > 	   none filtered version.
> 
> This is what I have in mind. (Only compile tested! ... and the 'if
> (cpumask_any(lowest_mask) >= nr_cpu_ids)' condition has to be considered
> as well):
> 
> @@ -98,8 +103,26 @@ int cpupri_find(struct cpupri *cp, struct
> task_struct *p,
>                         continue;
> 
>                 if (lowest_mask) {
> +                       int cpu, max_cap_cpu = -1;
> +                       unsigned long max_cap = 0;
> +
>                         cpumask_and(lowest_mask, p->cpus_ptr, vec->mask);
> 
> +                       for_each_cpu(cpu, lowest_mask) {
> +                               unsigned long cap =
> arch_scale_cpu_capacity(cpu);
> +
> +                               if (!rt_task_fits_capacity(p, cpu))
> +                                       cpumask_clear_cpu(cpu, lowest_mask);
> +
> +                               if (cap > max_cap) {
> +                                       max_cap = cap;
> +                                       max_cap_cpu = cpu;
> +                               }
> +                       }
> +
> +                       if (cpumask_empty(lowest_mask) && max_cap)
> +                               cpumask_set_cpu(max_cap_cpu, lowest_mask);

I had a patch that I was testing but what I did is to continue rather than
return a max_cap_cpu.

e.g:

	if no cpu at current priority fits the task:
		continue;
	else:
		return the lowest_mask which contains fitting cpus only

	if no fitting cpu was find:
		return 0;


Or we can tweak your approach to be

	if no cpu at current priority fits the task:
		if the cpu the task is currently running on doesn't fit it:
			return lowest_mask with max_cap_cpu set;

So we either:

	1. Continue the search until we find a fitting CPU; bail out otherwise.

	2. Or we attempt to return a CPU only if the CPU the task is currently
	   running on doesn't fit it. We don't want to migrate the task from a
	   fitting to a non-fitting one.

We can also do something hybrid like:

	3. Remember the outcome of 2 but don't return immediately and attempt
	   to find a fitting CPU at a different priority level.


Personally I see 1 is the simplest and good enough solution. What do you think?

I think this is 'continue' to search makes doing it at cpupri_find() more
robust than having to deal with whatever mask we first found in
find_lowest_rq() - so I'm starting to like this approach better. Thanks for
bringing it up.


Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched: rt: Make RT capacity aware
  2019-09-23 11:52           ` Qais Yousef
@ 2019-10-07  9:14             ` Dietmar Eggemann
  2019-10-08  6:13               ` Qais Yousef
  0 siblings, 1 reply; 9+ messages in thread
From: Dietmar Eggemann @ 2019-10-07  9:14 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Alessio Balsini, linux-kernel

On 23/09/2019 13:52, Qais Yousef wrote:
> On 09/20/19 14:52, Dietmar Eggemann wrote:
>>> 	2. The fallback mechanism means we either have to call cpupri_find()
>>> 	   twice once to find filtered lowest_rq and the other to return the
>>> 	   none filtered version.
>>
>> This is what I have in mind. (Only compile tested! ... and the 'if
>> (cpumask_any(lowest_mask) >= nr_cpu_ids)' condition has to be considered
>> as well):
>>
>> @@ -98,8 +103,26 @@ int cpupri_find(struct cpupri *cp, struct
>> task_struct *p,
>>                         continue;
>>
>>                 if (lowest_mask) {
>> +                       int cpu, max_cap_cpu = -1;
>> +                       unsigned long max_cap = 0;
>> +
>>                         cpumask_and(lowest_mask, p->cpus_ptr, vec->mask);
>>
>> +                       for_each_cpu(cpu, lowest_mask) {
>> +                               unsigned long cap =
>> arch_scale_cpu_capacity(cpu);
>> +
>> +                               if (!rt_task_fits_capacity(p, cpu))
>> +                                       cpumask_clear_cpu(cpu, lowest_mask);
>> +
>> +                               if (cap > max_cap) {
>> +                                       max_cap = cap;
>> +                                       max_cap_cpu = cpu;
>> +                               }
>> +                       }
>> +
>> +                       if (cpumask_empty(lowest_mask) && max_cap)
>> +                               cpumask_set_cpu(max_cap_cpu, lowest_mask);
> 
> I had a patch that I was testing but what I did is to continue rather than
> return a max_cap_cpu.

Continuing is the correct thing to do here. I just tried to illustrate
the idea.

> e.g:
> 
> 	if no cpu at current priority fits the task:
> 		continue;
> 	else:
> 		return the lowest_mask which contains fitting cpus only
> 
> 	if no fitting cpu was find:
> 		return 0;

I guess this is what we want to achieve here. It's unavoidable that we
will run sooner (compared to an SMP system) into a situation in which we
have to go higher in the rd->cpupri->pri_to_cpu[] array or in which we
can't return a lower mask at all.

> Or we can tweak your approach to be
> 
> 	if no cpu at current priority fits the task:
> 		if the cpu the task is currently running on doesn't fit it:
> 			return lowest_mask with max_cap_cpu set;

I wasn't aware of the pri_to_cpu[] array and how cpupri_find(,
lowest_mask) tries to return the lowest_mask of the lowest priority
(pri_to_cpu[] index).

> So we either:
> 
> 	1. Continue the search until we find a fitting CPU; bail out otherwise.

If this describes the solution in which we concentrate the
capacity-awareness in cpupri_find(), then I'm OK with it.
find_lowest_rq() already favours task_cpu(task), this_cpu and finally
cpus in sched_groups (from the viewpoint of task_cpu(task)).

> 	2. Or we attempt to return a CPU only if the CPU the task is currently
> 	   running on doesn't fit it. We don't want to migrate the task from a
> 	   fitting to a non-fitting one.

I would prefer 1., keeping the necessary changes confined in
cpupri_find() if possible.

> We can also do something hybrid like:
> 
> 	3. Remember the outcome of 2 but don't return immediately and attempt
> 	   to find a fitting CPU at a different priority level.
> 
> 
> Personally I see 1 is the simplest and good enough solution. What do you think?

Agreed. We would potentially need a fast lookup for p -> uclamp_cpumask
though?

> I think this is 'continue' to search makes doing it at cpupri_find() more
> robust than having to deal with whatever mask we first found in
> find_lowest_rq() - so I'm starting to like this approach better. Thanks for
> bringing it up.

My main concern is that having rt_task_fits_capacity() added to almost
every condition in the code makes it hard to understand what capacity
awareness in RT wants to achieve.

[...]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched: rt: Make RT capacity aware
  2019-10-07  9:14             ` Dietmar Eggemann
@ 2019-10-08  6:13               ` Qais Yousef
  0 siblings, 0 replies; 9+ messages in thread
From: Qais Yousef @ 2019-10-08  6:13 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Alessio Balsini, linux-kernel

On 10/07/19 11:14, Dietmar Eggemann wrote:
> On 23/09/2019 13:52, Qais Yousef wrote:
> > On 09/20/19 14:52, Dietmar Eggemann wrote:
> >>> 	2. The fallback mechanism means we either have to call cpupri_find()
> >>> 	   twice once to find filtered lowest_rq and the other to return the
> >>> 	   none filtered version.
> >>
> >> This is what I have in mind. (Only compile tested! ... and the 'if
> >> (cpumask_any(lowest_mask) >= nr_cpu_ids)' condition has to be considered
> >> as well):
> >>
> >> @@ -98,8 +103,26 @@ int cpupri_find(struct cpupri *cp, struct
> >> task_struct *p,
> >>                         continue;
> >>
> >>                 if (lowest_mask) {
> >> +                       int cpu, max_cap_cpu = -1;
> >> +                       unsigned long max_cap = 0;
> >> +
> >>                         cpumask_and(lowest_mask, p->cpus_ptr, vec->mask);
> >>
> >> +                       for_each_cpu(cpu, lowest_mask) {
> >> +                               unsigned long cap =
> >> arch_scale_cpu_capacity(cpu);
> >> +
> >> +                               if (!rt_task_fits_capacity(p, cpu))
> >> +                                       cpumask_clear_cpu(cpu, lowest_mask);
> >> +
> >> +                               if (cap > max_cap) {
> >> +                                       max_cap = cap;
> >> +                                       max_cap_cpu = cpu;
> >> +                               }
> >> +                       }
> >> +
> >> +                       if (cpumask_empty(lowest_mask) && max_cap)
> >> +                               cpumask_set_cpu(max_cap_cpu, lowest_mask);
> > 
> > I had a patch that I was testing but what I did is to continue rather than
> > return a max_cap_cpu.
> 
> Continuing is the correct thing to do here. I just tried to illustrate
> the idea.
> 
> > e.g:
> > 
> > 	if no cpu at current priority fits the task:
> > 		continue;
> > 	else:
> > 		return the lowest_mask which contains fitting cpus only
> > 
> > 	if no fitting cpu was find:
> > 		return 0;
> 
> I guess this is what we want to achieve here. It's unavoidable that we
> will run sooner (compared to an SMP system) into a situation in which we
> have to go higher in the rd->cpupri->pri_to_cpu[] array or in which we
> can't return a lower mask at all.
> 
> > Or we can tweak your approach to be
> > 
> > 	if no cpu at current priority fits the task:
> > 		if the cpu the task is currently running on doesn't fit it:
> > 			return lowest_mask with max_cap_cpu set;
> 
> I wasn't aware of the pri_to_cpu[] array and how cpupri_find(,
> lowest_mask) tries to return the lowest_mask of the lowest priority
> (pri_to_cpu[] index).
> 
> > So we either:
> > 
> > 	1. Continue the search until we find a fitting CPU; bail out otherwise.
> 
> If this describes the solution in which we concentrate the
> capacity-awareness in cpupri_find(), then I'm OK with it.
> find_lowest_rq() already favours task_cpu(task), this_cpu and finally
> cpus in sched_groups (from the viewpoint of task_cpu(task)).
> 
> > 	2. Or we attempt to return a CPU only if the CPU the task is currently
> > 	   running on doesn't fit it. We don't want to migrate the task from a
> > 	   fitting to a non-fitting one.
> 
> I would prefer 1., keeping the necessary changes confined in
> cpupri_find() if possible.

We are in agreement then.

> 
> > We can also do something hybrid like:
> > 
> > 	3. Remember the outcome of 2 but don't return immediately and attempt
> > 	   to find a fitting CPU at a different priority level.
> > 
> > 
> > Personally I see 1 is the simplest and good enough solution. What do you think?
> 
> Agreed. We would potentially need a fast lookup for p -> uclamp_cpumask
> though?

We can extend task_struct to store a cpumask of the cpus that fit the uclamp
settings and keep it up-to-date whenever the uclamp values change. I did
consider that but it seemed better to keep the implementation confined. I could
have been too conservative - so I'd be happy to look at that.

Thanks

--
Qais Yousef

> 
> > I think this is 'continue' to search makes doing it at cpupri_find() more
> > robust than having to deal with whatever mask we first found in
> > find_lowest_rq() - so I'm starting to like this approach better. Thanks for
> > bringing it up.
> 
> My main concern is that having rt_task_fits_capacity() added to almost
> every condition in the code makes it hard to understand what capacity
> awareness in RT wants to achieve.
> 
> [...]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-10-08  6:13 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-03 10:33 [PATCH] sched: rt: Make RT capacity aware Qais Yousef
2019-09-04 11:25 ` Steven Rostedt
2019-09-04 15:40   ` Qais Yousef
2019-09-13 13:30     ` Dietmar Eggemann
2019-09-18 14:52       ` Qais Yousef
2019-09-20 12:52         ` Dietmar Eggemann
2019-09-23 11:52           ` Qais Yousef
2019-10-07  9:14             ` Dietmar Eggemann
2019-10-08  6:13               ` Qais Yousef

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).