All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/6] Capacity awareness for SCHED_DEADLINE
@ 2020-04-27  8:37 Dietmar Eggemann
  2020-04-27  8:37 ` [PATCH v2 1/6] sched/topology: Store root domain CPU capacity sum Dietmar Eggemann
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: Dietmar Eggemann @ 2020-04-27  8:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli
  Cc: Vincent Guittot, Steven Rostedt, Luca Abeni,
	Daniel Bristot de Oliveira, Wei Wang, Quentin Perret,
	Alessio Balsini, Pavan Kondeti, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

The SCHED_DEADLINE (DL) admission control does not work correctly on
heterogeneous (asymmetric CPU capacity) systems such as Arm big.LITTLE
or DynamIQ.

Let's fix this by explicitly considering CPU capacity in DL admission
control and task migration.

The DL sched class now attempts to avoid missing task deadlines due to
smaller CPU (CPU capacity < 1024) not being capable enough to finish a
task in time. It does so by trying to place a task so that its CPU
capacity scaled deadline is not smaller than its runtime.

This patch-set only supports capacity awareness in idle scenarios
(cpudl::free_cpus not empty). Capacity awareness for non-idle cases
should be added in a later series.

Changes v1 [1] -> v2:

Discussion about capacity awareness in idle and non-idle scenarios
indicated that the current patch-set only supports the former.

Per-patch changes:

(1) Use rq->cpu_capacity_orig or capacity_orig_of() instead of
    arch_scale_cpu_capacity() [patch 1,6/6]

(2) Optimize dl_bw_cpus(), i.e. return weight of rd->span if rd->span
    &sube cpu_active_mask [patch 2/6]

(3) Replace rd_capacity() with dl_bw_capacity(). [patch 3/6]

Changes RFC [2] -> v1:

Only use static values for CPU bandwidth (sched_dl_entity::dl_runtime,
::dl_deadline) and CPU capacity (arch_scale_cpu_capacity()) to fix DL
admission control.

Dynamic values for CPU bandwidth (sched_dl_entity::runtime, ::deadline)
and CPU capacity (capacity_of()) are considered to be more related to
energy trade-off calculations which could be later introduced using the
Energy Model.

Since the design of the DL and RT sched classes are very similar, the
implementation follows the overall design of RT capacity awareness
(commit 804d402fb6f6 ("sched/rt: Make RT capacity-aware")).

Per-patch changes:

(1) Store CPU capacity sum in the root domain during
    build_sched_domains() [patch 1/4]

(2) Adjust to RT capacity awareness design [patch 3/4]

(3) Remove CPU capacity aware placement in switched_to_dl()
    (dl_migrate callback) [RFC patch 3/6]

    Balance callbacks (push, pull) run only in schedule_tail()
    __schedule(), rt_mutex_setprio() or __sched_setscheduler().
    DL throttling leads to a call to __dequeue_task_dl() which is not a
    full task dequeue. The task is still enqueued and only removed from
    the rq.
    So a queue_balance_callback() call in update_curr_dl()->
    __dequeue_task_dl() will not be followed by a balance_callback()
    call in one of the 4 functions mentioned above.

(4) Remove 'dynamic CPU bandwidth' consideration and only support
    'static CPU bandwidth' (ratio between sched_dl_entity::dl_runtime
    and ::dl_deadline) [RFC patch 4/6]

(5) Remove modification to migration logic which tried to schedule
    small tasks on LITTLE CPUs [RFC patch 6/6]

[1] https://lore.kernel.org/r/20200408095012.3819-1-dietmar.eggemann@arm.com
[2] https://lore.kernel.org/r/20190506044836.2914-1-luca.abeni@santannapisa.it

The following rt-app testcase tailored to Arm64 Hikey960:

root@h960:~# cat /sys/devices/system/cpu/cpu*/cpu_capacity
462
462
462
462
1024
1024
1024
1024

shows the expected behavior.

According to the following condition in dl_task_fits_capacity()

    cap_scale(dl_deadline, arch_scale_cpu_capacity(cpu)) >= dl_runtime

thread0-[0-3] are placed on a big CPUs whereas thread1-[0-3] run on a
LITTLE CPU respectively.

...
"tasks" : {
 "thread0" : {
  "policy" : "SCHED_DEADLINE",
  "instance" : 4,
  "timer" : { "ref" : "unique0", "period" : 16000, "mode" : "absolute" },
  "run" : 10000,
  "dl-runtime" : 11000,
  "dl-period" : 16000,
  "dl-deadline" : 16000
},
 "thread1" : {
  "policy" : "SCHED_DEADLINE",
  "instance" : 4,
  "delay" : 1000,
  "timer" : { "ref" : "unique1", "period" : 16000, "mode" : "absolute" },
  "run" : 5500,
  "dl-runtime" : 6500			
  "dl-period" : 16000,
  "dl-deadline" : 16000
}
...

Tests were run with Performance CPUfreq governor so that the Schedutil
CPUfreq governor DL threads (sugov:[0,4]), necessary on a
slow-switching platform like Hikey960, do not interfere with the
rt-app test tasks. Using Schedutil would require to lower the number of
tasks to 3 instances each.

Dietmar Eggemann (3):
  sched/topology: Store root domain CPU capacity sum
  sched/deadline: Optimize dl_bw_cpus()
  sched/deadline: Add dl_bw_capacity()

Luca Abeni (3):
  sched/deadline: Improve admission control for asymmetric CPU
    capacities
  sched/deadline: Make DL capacity-aware
  sched/deadline: Implement fallback mechanism for !fit case

 kernel/sched/cpudeadline.c | 23 +++++++++++
 kernel/sched/deadline.c    | 80 +++++++++++++++++++++++++++++---------
 kernel/sched/sched.h       | 22 +++++++++--
 kernel/sched/topology.c    | 15 +++++--
 4 files changed, 115 insertions(+), 25 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 1/6] sched/topology: Store root domain CPU capacity sum
  2020-04-27  8:37 [PATCH v2 0/6] Capacity awareness for SCHED_DEADLINE Dietmar Eggemann
@ 2020-04-27  8:37 ` Dietmar Eggemann
  2020-04-27  8:37 ` [PATCH v2 2/6] sched/deadline: Optimize dl_bw_cpus() Dietmar Eggemann
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Dietmar Eggemann @ 2020-04-27  8:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli
  Cc: Vincent Guittot, Steven Rostedt, Luca Abeni,
	Daniel Bristot de Oliveira, Wei Wang, Quentin Perret,
	Alessio Balsini, Pavan Kondeti, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

Add sum of (original) CPU capacity of all member CPUs to root domain.

This is needed for capacity-aware SCHED_DEADLINE admission control.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/sched.h    |  1 +
 kernel/sched/topology.c | 15 +++++++++++----
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index db3a57675ccf..58e1d3903ab9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -803,6 +803,7 @@ struct root_domain {
 	cpumask_var_t		rto_mask;
 	struct cpupri		cpupri;
 
+	unsigned long           sum_cpu_capacity;
 	unsigned long		max_cpu_capacity;
 
 	/*
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8344757bba6e..fbb20b7a80c0 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2052,12 +2052,18 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 	/* Attach the domains */
 	rcu_read_lock();
 	for_each_cpu(i, cpu_map) {
+		unsigned long cap;
+
 		rq = cpu_rq(i);
 		sd = *per_cpu_ptr(d.sd, i);
+		cap = rq->cpu_capacity_orig;
 
 		/* Use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing: */
-		if (rq->cpu_capacity_orig > READ_ONCE(d.rd->max_cpu_capacity))
-			WRITE_ONCE(d.rd->max_cpu_capacity, rq->cpu_capacity_orig);
+		if (cap > READ_ONCE(d.rd->max_cpu_capacity))
+			WRITE_ONCE(d.rd->max_cpu_capacity, cap);
+
+		WRITE_ONCE(d.rd->sum_cpu_capacity,
+			   READ_ONCE(d.rd->sum_cpu_capacity) + cap);
 
 		cpu_attach_domain(sd, d.rd, i);
 	}
@@ -2067,8 +2073,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		static_branch_inc_cpuslocked(&sched_asym_cpucapacity);
 
 	if (rq && sched_debug_enabled) {
-		pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
-			cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
+		pr_info("root domain span: %*pbl (capacity = %lu max cpu_capacity = %lu)\n",
+			cpumask_pr_args(cpu_map), rq->rd->sum_cpu_capacity,
+			rq->rd->max_cpu_capacity);
 	}
 
 	ret = 0;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 2/6] sched/deadline: Optimize dl_bw_cpus()
  2020-04-27  8:37 [PATCH v2 0/6] Capacity awareness for SCHED_DEADLINE Dietmar Eggemann
  2020-04-27  8:37 ` [PATCH v2 1/6] sched/topology: Store root domain CPU capacity sum Dietmar Eggemann
@ 2020-04-27  8:37 ` Dietmar Eggemann
  2020-04-30 10:55   ` Pavan Kondeti
  2020-04-27  8:37 ` [PATCH v2 3/6] sched/deadline: Add dl_bw_capacity() Dietmar Eggemann
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Dietmar Eggemann @ 2020-04-27  8:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli
  Cc: Vincent Guittot, Steven Rostedt, Luca Abeni,
	Daniel Bristot de Oliveira, Wei Wang, Quentin Perret,
	Alessio Balsini, Pavan Kondeti, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

Return the weight of the rd (root domain) span in case it is a subset
of the cpu_active_mask.

Continue to compute the number of CPUs over rd span and cpu_active_mask
when in hotplug.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/deadline.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 504d2f51b0d6..4ae22bfc37ae 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -54,10 +54,16 @@ static inline struct dl_bw *dl_bw_of(int i)
 static inline int dl_bw_cpus(int i)
 {
 	struct root_domain *rd = cpu_rq(i)->rd;
-	int cpus = 0;
+	int cpus;
 
 	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
 			 "sched RCU must be held");
+
+	if (cpumask_subset(rd->span, cpu_active_mask))
+		return cpumask_weight(rd->span);
+
+	cpus = 0;
+
 	for_each_cpu_and(i, rd->span, cpu_active_mask)
 		cpus++;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 3/6] sched/deadline: Add dl_bw_capacity()
  2020-04-27  8:37 [PATCH v2 0/6] Capacity awareness for SCHED_DEADLINE Dietmar Eggemann
  2020-04-27  8:37 ` [PATCH v2 1/6] sched/topology: Store root domain CPU capacity sum Dietmar Eggemann
  2020-04-27  8:37 ` [PATCH v2 2/6] sched/deadline: Optimize dl_bw_cpus() Dietmar Eggemann
@ 2020-04-27  8:37 ` Dietmar Eggemann
  2020-05-06 10:54   ` Dietmar Eggemann
  2020-04-27  8:37 ` [PATCH v2 4/6] sched/deadline: Improve admission control for asymmetric CPU capacities Dietmar Eggemann
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Dietmar Eggemann @ 2020-04-27  8:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli
  Cc: Vincent Guittot, Steven Rostedt, Luca Abeni,
	Daniel Bristot de Oliveira, Wei Wang, Quentin Perret,
	Alessio Balsini, Pavan Kondeti, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

Capacity-aware SCHED_DEADLINE admission control needs rd (root domain)
CPU capacity sum.

The design of this function follows that of dl_bw_cpus().

That is, return the rd CPU capacity sum in case the rd span a subset of
the cpu_active_mask.

Compute the CPU capacity sum over rd span and cpu_active_mask when in
hotplug.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/deadline.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 4ae22bfc37ae..eb23e6921d94 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -69,6 +69,25 @@ static inline int dl_bw_cpus(int i)
 
 	return cpus;
 }
+
+static inline unsigned long dl_bw_capacity(int i)
+{
+	struct root_domain *rd = cpu_rq(i)->rd;
+	unsigned long cap;
+
+	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
+			 "sched RCU must be held");
+
+	if (cpumask_subset(rd->span, cpu_active_mask))
+		return rd->sum_cpu_capacity;
+
+	cap = 0;
+
+	for_each_cpu_and(i, rd->span, cpu_active_mask)
+		cap += capacity_orig_of(i);
+
+	return cap;
+}
 #else
 static inline struct dl_bw *dl_bw_of(int i)
 {
@@ -79,6 +98,11 @@ static inline int dl_bw_cpus(int i)
 {
 	return 1;
 }
+
+static inline unsigned long dl_bw_capacity(int i)
+{
+	return SCHED_CAPACITY_SCALE;
+}
 #endif
 
 static inline
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 4/6] sched/deadline: Improve admission control for asymmetric CPU capacities
  2020-04-27  8:37 [PATCH v2 0/6] Capacity awareness for SCHED_DEADLINE Dietmar Eggemann
                   ` (2 preceding siblings ...)
  2020-04-27  8:37 ` [PATCH v2 3/6] sched/deadline: Add dl_bw_capacity() Dietmar Eggemann
@ 2020-04-27  8:37 ` Dietmar Eggemann
  2020-04-27  8:37 ` [PATCH v2 5/6] sched/deadline: Make DL capacity-aware Dietmar Eggemann
  2020-04-27  8:37 ` [PATCH v2 6/6] sched/deadline: Implement fallback mechanism for !fit case Dietmar Eggemann
  5 siblings, 0 replies; 24+ messages in thread
From: Dietmar Eggemann @ 2020-04-27  8:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli
  Cc: Vincent Guittot, Steven Rostedt, Luca Abeni,
	Daniel Bristot de Oliveira, Wei Wang, Quentin Perret,
	Alessio Balsini, Pavan Kondeti, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

From: Luca Abeni <luca.abeni@santannapisa.it>

The current SCHED_DEADLINE (DL) admission control ensures that

    sum of reserved CPU bandwidth < x * M

where

    x = /proc/sys/kernel/sched_rt_{runtime,period}_us
    M = # CPUs in root domain.

DL admission control works well for homogeneous systems where the
capacity of all CPUs are equal (1024). I.e. bounded tardiness for DL
and non-starvation of non-DL tasks is guaranteed.

But on heterogeneous systems where capacity of CPUs are different it
could fail by over-allocating CPU time on smaller capacity CPUs.

On an Arm big.LITTLE/DynamIQ system DL tasks can easily starve other
tasks making it unusable.

Fix this by explicitly considering the CPU capacity in the DL admission
test by replacing M with the root domain CPU capacity sum.

Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/deadline.c | 30 +++++++++++++++++-------------
 kernel/sched/sched.h    |  6 +++---
 2 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index eb23e6921d94..08ab28e1cefc 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2581,11 +2581,12 @@ void sched_dl_do_global(void)
 int sched_dl_overflow(struct task_struct *p, int policy,
 		      const struct sched_attr *attr)
 {
-	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
 	u64 period = attr->sched_period ?: attr->sched_deadline;
 	u64 runtime = attr->sched_runtime;
 	u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
-	int cpus, err = -1;
+	int cpus, err = -1, cpu = task_cpu(p);
+	struct dl_bw *dl_b = dl_bw_of(cpu);
+	unsigned long cap;
 
 	if (attr->sched_flags & SCHED_FLAG_SUGOV)
 		return 0;
@@ -2600,15 +2601,17 @@ int sched_dl_overflow(struct task_struct *p, int policy,
 	 * allocated bandwidth of the container.
 	 */
 	raw_spin_lock(&dl_b->lock);
-	cpus = dl_bw_cpus(task_cpu(p));
+	cpus = dl_bw_cpus(cpu);
+	cap = dl_bw_capacity(cpu);
+
 	if (dl_policy(policy) && !task_has_dl_policy(p) &&
-	    !__dl_overflow(dl_b, cpus, 0, new_bw)) {
+	    !__dl_overflow(dl_b, cap, 0, new_bw)) {
 		if (hrtimer_active(&p->dl.inactive_timer))
 			__dl_sub(dl_b, p->dl.dl_bw, cpus);
 		__dl_add(dl_b, new_bw, cpus);
 		err = 0;
 	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
-		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
+		   !__dl_overflow(dl_b, cap, p->dl.dl_bw, new_bw)) {
 		/*
 		 * XXX this is slightly incorrect: when the task
 		 * utilization decreases, we should delay the total
@@ -2744,19 +2747,19 @@ bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr)
 #ifdef CONFIG_SMP
 int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed)
 {
+	unsigned long flags, cap;
 	unsigned int dest_cpu;
 	struct dl_bw *dl_b;
 	bool overflow;
-	int cpus, ret;
-	unsigned long flags;
+	int ret;
 
 	dest_cpu = cpumask_any_and(cpu_active_mask, cs_cpus_allowed);
 
 	rcu_read_lock_sched();
 	dl_b = dl_bw_of(dest_cpu);
 	raw_spin_lock_irqsave(&dl_b->lock, flags);
-	cpus = dl_bw_cpus(dest_cpu);
-	overflow = __dl_overflow(dl_b, cpus, 0, p->dl.dl_bw);
+	cap = dl_bw_capacity(dest_cpu);
+	overflow = __dl_overflow(dl_b, cap, 0, p->dl.dl_bw);
 	if (overflow) {
 		ret = -EBUSY;
 	} else {
@@ -2766,6 +2769,8 @@ int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allo
 		 * We will free resources in the source root_domain
 		 * later on (see set_cpus_allowed_dl()).
 		 */
+		int cpus = dl_bw_cpus(dest_cpu);
+
 		__dl_add(dl_b, p->dl.dl_bw, cpus);
 		ret = 0;
 	}
@@ -2798,16 +2803,15 @@ int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur,
 
 bool dl_cpu_busy(unsigned int cpu)
 {
-	unsigned long flags;
+	unsigned long flags, cap;
 	struct dl_bw *dl_b;
 	bool overflow;
-	int cpus;
 
 	rcu_read_lock_sched();
 	dl_b = dl_bw_of(cpu);
 	raw_spin_lock_irqsave(&dl_b->lock, flags);
-	cpus = dl_bw_cpus(cpu);
-	overflow = __dl_overflow(dl_b, cpus, 0, 0);
+	cap = dl_bw_capacity(cpu);
+	overflow = __dl_overflow(dl_b, cap, 0, 0);
 	raw_spin_unlock_irqrestore(&dl_b->lock, flags);
 	rcu_read_unlock_sched();
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 58e1d3903ab9..511edacc2282 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -310,11 +310,11 @@ void __dl_add(struct dl_bw *dl_b, u64 tsk_bw, int cpus)
 	__dl_update(dl_b, -((s32)tsk_bw / cpus));
 }
 
-static inline
-bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw)
+static inline bool __dl_overflow(struct dl_bw *dl_b, unsigned long cap,
+				 u64 old_bw, u64 new_bw)
 {
 	return dl_b->bw != -1 &&
-	       dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
+	       cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw;
 }
 
 extern void init_dl_bw(struct dl_bw *dl_b);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 5/6] sched/deadline: Make DL capacity-aware
  2020-04-27  8:37 [PATCH v2 0/6] Capacity awareness for SCHED_DEADLINE Dietmar Eggemann
                   ` (3 preceding siblings ...)
  2020-04-27  8:37 ` [PATCH v2 4/6] sched/deadline: Improve admission control for asymmetric CPU capacities Dietmar Eggemann
@ 2020-04-27  8:37 ` Dietmar Eggemann
  2020-04-30 13:10   ` Pavan Kondeti
  2020-04-27  8:37 ` [PATCH v2 6/6] sched/deadline: Implement fallback mechanism for !fit case Dietmar Eggemann
  5 siblings, 1 reply; 24+ messages in thread
From: Dietmar Eggemann @ 2020-04-27  8:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli
  Cc: Vincent Guittot, Steven Rostedt, Luca Abeni,
	Daniel Bristot de Oliveira, Wei Wang, Quentin Perret,
	Alessio Balsini, Pavan Kondeti, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

From: Luca Abeni <luca.abeni@santannapisa.it>

The current SCHED_DEADLINE (DL) scheduler uses a global EDF scheduling
algorithm w/o considering CPU capacity or task utilization.
This works well on homogeneous systems where DL tasks are guaranteed
to have a bounded tardiness but presents issues on heterogeneous
systems.

A DL task can migrate to a CPU which does not have enough CPU capacity
to correctly serve the task (e.g. a task w/ 70ms runtime and 100ms
period on a CPU w/ 512 capacity).

Add the DL fitness function dl_task_fits_capacity() for DL admission
control on heterogeneous systems. A task fits onto a CPU if:

    CPU original capacity / 1024 >= task runtime / task deadline

Use this function on heterogeneous systems to try to find a CPU which
meets this criterion during task wakeup, push and offline migration.

On homogeneous systems the original behavior of the DL admission
control should be retained.

Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/cpudeadline.c | 14 +++++++++++++-
 kernel/sched/deadline.c    | 18 ++++++++++++++----
 kernel/sched/sched.h       | 15 +++++++++++++++
 3 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 5cc4012572ec..8630f2a40a3f 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -121,7 +121,19 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
 
 	if (later_mask &&
 	    cpumask_and(later_mask, cp->free_cpus, p->cpus_ptr)) {
-		return 1;
+		int cpu;
+
+		if (!static_branch_unlikely(&sched_asym_cpucapacity))
+			return 1;
+
+		/* Ensure the capacity of the CPUs fits the task. */
+		for_each_cpu(cpu, later_mask) {
+			if (!dl_task_fits_capacity(p, cpu))
+				cpumask_clear_cpu(cpu, later_mask);
+		}
+
+		if (!cpumask_empty(later_mask))
+			return 1;
 	} else {
 		int best_cpu = cpudl_maximum(cp);
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 08ab28e1cefc..575b7d88d839 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1634,6 +1634,7 @@ static int
 select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
 {
 	struct task_struct *curr;
+	bool select_rq;
 	struct rq *rq;
 
 	if (sd_flag != SD_BALANCE_WAKE)
@@ -1653,10 +1654,19 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
 	 * other hand, if it has a shorter deadline, we
 	 * try to make it stay here, it might be important.
 	 */
-	if (unlikely(dl_task(curr)) &&
-	    (curr->nr_cpus_allowed < 2 ||
-	     !dl_entity_preempt(&p->dl, &curr->dl)) &&
-	    (p->nr_cpus_allowed > 1)) {
+	select_rq = unlikely(dl_task(curr)) &&
+		    (curr->nr_cpus_allowed < 2 ||
+		     !dl_entity_preempt(&p->dl, &curr->dl)) &&
+		    p->nr_cpus_allowed > 1;
+
+	/*
+	 * Take the capacity of the CPU into account to
+	 * ensure it fits the requirement of the task.
+	 */
+	if (static_branch_unlikely(&sched_asym_cpucapacity))
+		select_rq |= !dl_task_fits_capacity(p, cpu);
+
+	if (select_rq) {
 		int target = find_later_rq(p);
 
 		if (target != -1 &&
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 511edacc2282..ec0efd99497b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -317,6 +317,21 @@ static inline bool __dl_overflow(struct dl_bw *dl_b, unsigned long cap,
 	       cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw;
 }
 
+/*
+ * Verify the fitness of task @p to run on @cpu taking into account the
+ * CPU original capacity and the runtime/deadline ratio of the task.
+ *
+ * The function will return true if the CPU original capacity of the
+ * @cpu scaled by SCHED_CAPACITY_SCALE >= runtime/deadline ratio of the
+ * task and false otherwise.
+ */
+static inline bool dl_task_fits_capacity(struct task_struct *p, int cpu)
+{
+	unsigned long cap = arch_scale_cpu_capacity(cpu);
+
+	return cap_scale(p->dl.dl_deadline, cap) >= p->dl.dl_runtime;
+}
+
 extern void init_dl_bw(struct dl_bw *dl_b);
 extern int  sched_dl_global_validate(void);
 extern void sched_dl_do_global(void);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 6/6] sched/deadline: Implement fallback mechanism for !fit case
  2020-04-27  8:37 [PATCH v2 0/6] Capacity awareness for SCHED_DEADLINE Dietmar Eggemann
                   ` (4 preceding siblings ...)
  2020-04-27  8:37 ` [PATCH v2 5/6] sched/deadline: Make DL capacity-aware Dietmar Eggemann
@ 2020-04-27  8:37 ` Dietmar Eggemann
  2020-04-27 13:34   ` Juri Lelli
  5 siblings, 1 reply; 24+ messages in thread
From: Dietmar Eggemann @ 2020-04-27  8:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli
  Cc: Vincent Guittot, Steven Rostedt, Luca Abeni,
	Daniel Bristot de Oliveira, Wei Wang, Quentin Perret,
	Alessio Balsini, Pavan Kondeti, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

From: Luca Abeni <luca.abeni@santannapisa.it>

When a task has a runtime that cannot be served within the scheduling
deadline by any of the idle CPU (later_mask) the task is doomed to miss
its deadline.

This can happen since the SCHED_DEADLINE admission control guarantees
only bounded tardiness and not the hard respect of all deadlines.
In this case try to select the idle CPU with the largest CPU capacity
to minimize tardiness.

Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/cpudeadline.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 8630f2a40a3f..b6c7a0bc0880 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -121,19 +121,30 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
 
 	if (later_mask &&
 	    cpumask_and(later_mask, cp->free_cpus, p->cpus_ptr)) {
-		int cpu;
+		unsigned long cap, max_cap = 0;
+		int cpu, max_cpu = -1;
 
 		if (!static_branch_unlikely(&sched_asym_cpucapacity))
 			return 1;
 
 		/* Ensure the capacity of the CPUs fits the task. */
 		for_each_cpu(cpu, later_mask) {
-			if (!dl_task_fits_capacity(p, cpu))
+			if (!dl_task_fits_capacity(p, cpu)) {
 				cpumask_clear_cpu(cpu, later_mask);
+
+				cap = capacity_orig_of(cpu);
+
+				if (cap > max_cap) {
+					max_cap = cap;
+					max_cpu = cpu;
+				}
+			}
 		}
 
-		if (!cpumask_empty(later_mask))
-			return 1;
+		if (cpumask_empty(later_mask))
+			cpumask_set_cpu(max_cpu, later_mask);
+
+		return 1;
 	} else {
 		int best_cpu = cpudl_maximum(cp);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 6/6] sched/deadline: Implement fallback mechanism for !fit case
  2020-04-27  8:37 ` [PATCH v2 6/6] sched/deadline: Implement fallback mechanism for !fit case Dietmar Eggemann
@ 2020-04-27 13:34   ` Juri Lelli
  2020-04-27 14:17     ` luca abeni
  0 siblings, 1 reply; 24+ messages in thread
From: Juri Lelli @ 2020-04-27 13:34 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Steven Rostedt,
	Luca Abeni, Daniel Bristot de Oliveira, Wei Wang, Quentin Perret,
	Alessio Balsini, Pavan Kondeti, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

Hi,

On 27/04/20 10:37, Dietmar Eggemann wrote:
> From: Luca Abeni <luca.abeni@santannapisa.it>
> 
> When a task has a runtime that cannot be served within the scheduling
> deadline by any of the idle CPU (later_mask) the task is doomed to miss
> its deadline.
> 
> This can happen since the SCHED_DEADLINE admission control guarantees
> only bounded tardiness and not the hard respect of all deadlines.
> In this case try to select the idle CPU with the largest CPU capacity
> to minimize tardiness.
> 
> Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  kernel/sched/cpudeadline.c | 19 +++++++++++++++----
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> index 8630f2a40a3f..b6c7a0bc0880 100644
> --- a/kernel/sched/cpudeadline.c
> +++ b/kernel/sched/cpudeadline.c
> @@ -121,19 +121,30 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
>  
>  	if (later_mask &&
>  	    cpumask_and(later_mask, cp->free_cpus, p->cpus_ptr)) {
> -		int cpu;
> +		unsigned long cap, max_cap = 0;
> +		int cpu, max_cpu = -1;
>  
>  		if (!static_branch_unlikely(&sched_asym_cpucapacity))
>  			return 1;
>  
>  		/* Ensure the capacity of the CPUs fits the task. */
>  		for_each_cpu(cpu, later_mask) {
> -			if (!dl_task_fits_capacity(p, cpu))
> +			if (!dl_task_fits_capacity(p, cpu)) {
>  				cpumask_clear_cpu(cpu, later_mask);
> +
> +				cap = capacity_orig_of(cpu);
> +
> +				if (cap > max_cap) {
> +					max_cap = cap;
> +					max_cpu = cpu;
> +				}
> +			}
>  		}
>  
> -		if (!cpumask_empty(later_mask))
> -			return 1;
> +		if (cpumask_empty(later_mask))
> +			cpumask_set_cpu(max_cpu, later_mask);

Think we touched upon this during v1 review, but I'm (still?) wondering
if we can do a little better, still considering only free cpus.

Can't we get into a situation that some of the (once free) big cpus have
been occupied by small tasks and now a big task enters the system and it
only finds small cpus available, were it could have fit into bigs if
small tasks were put onto small cpus?

I.e., shouldn't we always try to best fit among free cpus?

Thanks,

Juri


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 6/6] sched/deadline: Implement fallback mechanism for !fit case
  2020-04-27 13:34   ` Juri Lelli
@ 2020-04-27 14:17     ` luca abeni
  2020-04-29 17:39       ` Dietmar Eggemann
  0 siblings, 1 reply; 24+ messages in thread
From: luca abeni @ 2020-04-27 14:17 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Dietmar Eggemann, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Steven Rostedt, Daniel Bristot de Oliveira, Wei Wang,
	Quentin Perret, Alessio Balsini, Pavan Kondeti, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

Hi Juri,

On Mon, 27 Apr 2020 15:34:38 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:

> Hi,
> 
> On 27/04/20 10:37, Dietmar Eggemann wrote:
> > From: Luca Abeni <luca.abeni@santannapisa.it>
> > 
> > When a task has a runtime that cannot be served within the
> > scheduling deadline by any of the idle CPU (later_mask) the task is
> > doomed to miss its deadline.
> > 
> > This can happen since the SCHED_DEADLINE admission control
> > guarantees only bounded tardiness and not the hard respect of all
> > deadlines. In this case try to select the idle CPU with the largest
> > CPU capacity to minimize tardiness.
> > 
> > Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
[...]
> > -		if (!cpumask_empty(later_mask))
> > -			return 1;
> > +		if (cpumask_empty(later_mask))
> > +			cpumask_set_cpu(max_cpu, later_mask);  
> 
> Think we touched upon this during v1 review, but I'm (still?)
> wondering if we can do a little better, still considering only free
> cpus.
> 
> Can't we get into a situation that some of the (once free) big cpus
> have been occupied by small tasks and now a big task enters the
> system and it only finds small cpus available, were it could have fit
> into bigs if small tasks were put onto small cpus?
> 
> I.e., shouldn't we always try to best fit among free cpus?

Yes; there was an additional patch that tried schedule each task on the
slowest core where it can fit, to address this issue.
But I think it will go in a second round of patches.



			Luca

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 6/6] sched/deadline: Implement fallback mechanism for !fit case
  2020-04-27 14:17     ` luca abeni
@ 2020-04-29 17:39       ` Dietmar Eggemann
  2020-04-30 11:00         ` Pavan Kondeti
  0 siblings, 1 reply; 24+ messages in thread
From: Dietmar Eggemann @ 2020-04-29 17:39 UTC (permalink / raw)
  To: luca abeni, Juri Lelli
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Steven Rostedt,
	Daniel Bristot de Oliveira, Wei Wang, Quentin Perret,
	Alessio Balsini, Pavan Kondeti, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

On 27/04/2020 16:17, luca abeni wrote:
> Hi Juri,
> 
> On Mon, 27 Apr 2020 15:34:38 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
> 
>> Hi,
>>
>> On 27/04/20 10:37, Dietmar Eggemann wrote:
>>> From: Luca Abeni <luca.abeni@santannapisa.it>
>>>
>>> When a task has a runtime that cannot be served within the
>>> scheduling deadline by any of the idle CPU (later_mask) the task is
>>> doomed to miss its deadline.
>>>
>>> This can happen since the SCHED_DEADLINE admission control
>>> guarantees only bounded tardiness and not the hard respect of all
>>> deadlines. In this case try to select the idle CPU with the largest
>>> CPU capacity to minimize tardiness.
>>>
>>> Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
>>> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> [...]
>>> -		if (!cpumask_empty(later_mask))
>>> -			return 1;
>>> +		if (cpumask_empty(later_mask))
>>> +			cpumask_set_cpu(max_cpu, later_mask);  
>>
>> Think we touched upon this during v1 review, but I'm (still?)
>> wondering if we can do a little better, still considering only free
>> cpus.
>>
>> Can't we get into a situation that some of the (once free) big cpus
>> have been occupied by small tasks and now a big task enters the
>> system and it only finds small cpus available, were it could have fit
>> into bigs if small tasks were put onto small cpus?
>>
>> I.e., shouldn't we always try to best fit among free cpus?
> 
> Yes; there was an additional patch that tried schedule each task on the
> slowest core where it can fit, to address this issue.
> But I think it will go in a second round of patches.

Yes, we can run into this situation in DL, but also in CFS or RT.

IMHO, this patch is aligned with the Capacity Awareness implementation
in CFS and RT.

Capacity Awareness so far is 'find a CPU which fits the requirement of
the task (Req)'. It's not (yet) find the best CPU.

CFS - select_idle_capacity() -> task_fits_capacity()

      Req: util(p) * 1.25 < capacity_of(cpu)

RT  - select_task_rq_rt(), cpupri_find_fitness() ->
      rt_task_fits_capacity()

      Req: uclamp_eff_value(p) <= capacity_orig_of(cpu)

DL  - select_task_rq_dl(), cpudl_find() -> dl_task_fits_capacity()

      Req: dl_runtime(p)/dl_deadline(p) * 1024  <= capacity_orig_of(cpu)


There has to be an "idle" (from the viewpoint of the task) CPU available
with a fitting capacity. Otherwise a fallback mechanism applies.

CFS - best capacity handling in select_idle_capacity().

RT  - Non-fitting lowest mask

DL  - This patch

You did spot the rt-app 'delay' for the small tasks in the test case ;-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 2/6] sched/deadline: Optimize dl_bw_cpus()
  2020-04-27  8:37 ` [PATCH v2 2/6] sched/deadline: Optimize dl_bw_cpus() Dietmar Eggemann
@ 2020-04-30 10:55   ` Pavan Kondeti
  2020-05-01 16:12     ` Dietmar Eggemann
  0 siblings, 1 reply; 24+ messages in thread
From: Pavan Kondeti @ 2020-04-30 10:55 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Luca Abeni, Daniel Bristot de Oliveira, Wei Wang,
	Quentin Perret, Alessio Balsini, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

On Mon, Apr 27, 2020 at 10:37:05AM +0200, Dietmar Eggemann wrote:
> Return the weight of the rd (root domain) span in case it is a subset
> of the cpu_active_mask.
> 
> Continue to compute the number of CPUs over rd span and cpu_active_mask
> when in hotplug.
> 
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  kernel/sched/deadline.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 504d2f51b0d6..4ae22bfc37ae 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -54,10 +54,16 @@ static inline struct dl_bw *dl_bw_of(int i)
>  static inline int dl_bw_cpus(int i)
>  {
>  	struct root_domain *rd = cpu_rq(i)->rd;
> -	int cpus = 0;
> +	int cpus;
>  
>  	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
>  			 "sched RCU must be held");
> +
> +	if (cpumask_subset(rd->span, cpu_active_mask))
> +		return cpumask_weight(rd->span);
> +

Looks good to me. This is a nice optimization.

> +	cpus = 0;
> +
>  	for_each_cpu_and(i, rd->span, cpu_active_mask)
>  		cpus++;
>  
Do you know why this check is in place? Is it only to cover
the case of cpuset_cpu_inactive()->dl_cpu_busy()?

Thanks,
Pavan
> -- 
> 2.17.1
> 

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 6/6] sched/deadline: Implement fallback mechanism for !fit case
  2020-04-29 17:39       ` Dietmar Eggemann
@ 2020-04-30 11:00         ` Pavan Kondeti
  2020-05-01 16:12           ` Dietmar Eggemann
  0 siblings, 1 reply; 24+ messages in thread
From: Pavan Kondeti @ 2020-04-30 11:00 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: luca abeni, Juri Lelli, Ingo Molnar, Peter Zijlstra,
	Vincent Guittot, Steven Rostedt, Daniel Bristot de Oliveira,
	Wei Wang, Quentin Perret, Alessio Balsini, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

On Wed, Apr 29, 2020 at 07:39:50PM +0200, Dietmar Eggemann wrote:
> On 27/04/2020 16:17, luca abeni wrote:
> > Hi Juri,
> > 
> > On Mon, 27 Apr 2020 15:34:38 +0200
> > Juri Lelli <juri.lelli@redhat.com> wrote:
> > 
> >> Hi,
> >>
> >> On 27/04/20 10:37, Dietmar Eggemann wrote:
> >>> From: Luca Abeni <luca.abeni@santannapisa.it>
> >>>
> >>> When a task has a runtime that cannot be served within the
> >>> scheduling deadline by any of the idle CPU (later_mask) the task is
> >>> doomed to miss its deadline.
> >>>
> >>> This can happen since the SCHED_DEADLINE admission control
> >>> guarantees only bounded tardiness and not the hard respect of all
> >>> deadlines. In this case try to select the idle CPU with the largest
> >>> CPU capacity to minimize tardiness.
> >>>
> >>> Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
> >>> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > [...]
> >>> -		if (!cpumask_empty(later_mask))
> >>> -			return 1;
> >>> +		if (cpumask_empty(later_mask))
> >>> +			cpumask_set_cpu(max_cpu, later_mask);  
> >>
> >> Think we touched upon this during v1 review, but I'm (still?)
> >> wondering if we can do a little better, still considering only free
> >> cpus.
> >>
> >> Can't we get into a situation that some of the (once free) big cpus
> >> have been occupied by small tasks and now a big task enters the
> >> system and it only finds small cpus available, were it could have fit
> >> into bigs if small tasks were put onto small cpus?
> >>
> >> I.e., shouldn't we always try to best fit among free cpus?
> > 
> > Yes; there was an additional patch that tried schedule each task on the
> > slowest core where it can fit, to address this issue.
> > But I think it will go in a second round of patches.
> 
> Yes, we can run into this situation in DL, but also in CFS or RT.
> 
In CFS case, the misfit task handling in load balancer should help pulling
the BIG task running on the little CPUs. I get your point that we can run
into the same scenario with other scheduling class tasks.

> IMHO, this patch is aligned with the Capacity Awareness implementation
> in CFS and RT.
> 
> Capacity Awareness so far is 'find a CPU which fits the requirement of
> the task (Req)'. It's not (yet) find the best CPU.
> 
> CFS - select_idle_capacity() -> task_fits_capacity()
> 
>       Req: util(p) * 1.25 < capacity_of(cpu)
> 
> RT  - select_task_rq_rt(), cpupri_find_fitness() ->
>       rt_task_fits_capacity()
> 
>       Req: uclamp_eff_value(p) <= capacity_orig_of(cpu)
> 
> DL  - select_task_rq_dl(), cpudl_find() -> dl_task_fits_capacity()
> 
>       Req: dl_runtime(p)/dl_deadline(p) * 1024  <= capacity_orig_of(cpu)
> 
> 
> There has to be an "idle" (from the viewpoint of the task) CPU available
> with a fitting capacity. Otherwise a fallback mechanism applies.
> 
> CFS - best capacity handling in select_idle_capacity().
> 
> RT  - Non-fitting lowest mask
> 
> DL  - This patch
> 
> You did spot the rt-app 'delay' for the small tasks in the test case ;-)

Thanks for the hint. It was not clear to me why 1 msec delay is given for
the small tasks in the rt-app json description in the cover letter.
I get it now :-)

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/6] sched/deadline: Make DL capacity-aware
  2020-04-27  8:37 ` [PATCH v2 5/6] sched/deadline: Make DL capacity-aware Dietmar Eggemann
@ 2020-04-30 13:10   ` Pavan Kondeti
  2020-05-01 16:12     ` Dietmar Eggemann
  0 siblings, 1 reply; 24+ messages in thread
From: Pavan Kondeti @ 2020-04-30 13:10 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Luca Abeni, Daniel Bristot de Oliveira, Wei Wang,
	Quentin Perret, Alessio Balsini, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

On Mon, Apr 27, 2020 at 10:37:08AM +0200, Dietmar Eggemann wrote:
> From: Luca Abeni <luca.abeni@santannapisa.it>
> 
> The current SCHED_DEADLINE (DL) scheduler uses a global EDF scheduling
> algorithm w/o considering CPU capacity or task utilization.
> This works well on homogeneous systems where DL tasks are guaranteed
> to have a bounded tardiness but presents issues on heterogeneous
> systems.
> 
> A DL task can migrate to a CPU which does not have enough CPU capacity
> to correctly serve the task (e.g. a task w/ 70ms runtime and 100ms
> period on a CPU w/ 512 capacity).
> 
> Add the DL fitness function dl_task_fits_capacity() for DL admission
> control on heterogeneous systems. A task fits onto a CPU if:
> 
>     CPU original capacity / 1024 >= task runtime / task deadline
> 
> Use this function on heterogeneous systems to try to find a CPU which
> meets this criterion during task wakeup, push and offline migration.
> 
> On homogeneous systems the original behavior of the DL admission
> control should be retained.
> 
> Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  kernel/sched/cpudeadline.c | 14 +++++++++++++-
>  kernel/sched/deadline.c    | 18 ++++++++++++++----
>  kernel/sched/sched.h       | 15 +++++++++++++++
>  3 files changed, 42 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> index 5cc4012572ec..8630f2a40a3f 100644
> --- a/kernel/sched/cpudeadline.c
> +++ b/kernel/sched/cpudeadline.c
> @@ -121,7 +121,19 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
>  
>  	if (later_mask &&
>  	    cpumask_and(later_mask, cp->free_cpus, p->cpus_ptr)) {
> -		return 1;
> +		int cpu;
> +
> +		if (!static_branch_unlikely(&sched_asym_cpucapacity))
> +			return 1;
> +
> +		/* Ensure the capacity of the CPUs fits the task. */
> +		for_each_cpu(cpu, later_mask) {
> +			if (!dl_task_fits_capacity(p, cpu))
> +				cpumask_clear_cpu(cpu, later_mask);
> +		}
> +
> +		if (!cpumask_empty(later_mask))
> +			return 1;
>  	} else {
>  		int best_cpu = cpudl_maximum(cp);
>  
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 08ab28e1cefc..575b7d88d839 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1634,6 +1634,7 @@ static int
>  select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
>  {
>  	struct task_struct *curr;
> +	bool select_rq;
>  	struct rq *rq;
>  
>  	if (sd_flag != SD_BALANCE_WAKE)
> @@ -1653,10 +1654,19 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
>  	 * other hand, if it has a shorter deadline, we
>  	 * try to make it stay here, it might be important.
>  	 */
> -	if (unlikely(dl_task(curr)) &&
> -	    (curr->nr_cpus_allowed < 2 ||
> -	     !dl_entity_preempt(&p->dl, &curr->dl)) &&
> -	    (p->nr_cpus_allowed > 1)) {
> +	select_rq = unlikely(dl_task(curr)) &&
> +		    (curr->nr_cpus_allowed < 2 ||
> +		     !dl_entity_preempt(&p->dl, &curr->dl)) &&
> +		    p->nr_cpus_allowed > 1;
> +
> +	/*
> +	 * Take the capacity of the CPU into account to
> +	 * ensure it fits the requirement of the task.
> +	 */
> +	if (static_branch_unlikely(&sched_asym_cpucapacity))
> +		select_rq |= !dl_task_fits_capacity(p, cpu);
> +
> +	if (select_rq) {
>  		int target = find_later_rq(p);

I see that find_later_rq() checks if the previous CPU is part of
later_mask and returns it immediately. So we don't migrate the
task in the case where there previous CPU can't fit the task and
there are no idle CPUs on which the task can fit. LGTM.

>  
>  		if (target != -1 &&
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 511edacc2282..ec0efd99497b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -317,6 +317,21 @@ static inline bool __dl_overflow(struct dl_bw *dl_b, unsigned long cap,
>  	       cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw;
>  }
>  
> +/*
> + * Verify the fitness of task @p to run on @cpu taking into account the
> + * CPU original capacity and the runtime/deadline ratio of the task.
> + *
> + * The function will return true if the CPU original capacity of the
> + * @cpu scaled by SCHED_CAPACITY_SCALE >= runtime/deadline ratio of the
> + * task and false otherwise.
> + */
> +static inline bool dl_task_fits_capacity(struct task_struct *p, int cpu)
> +{
> +	unsigned long cap = arch_scale_cpu_capacity(cpu);
> +
> +	return cap_scale(p->dl.dl_deadline, cap) >= p->dl.dl_runtime;
> +}
> +

This is same as

return p->dl.dl_bw >> (BW_SHIFT - SCHED_CAPACITY_SHIFT) <= cap

Correct?  If yes, would it be better to use this?

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 2/6] sched/deadline: Optimize dl_bw_cpus()
  2020-04-30 10:55   ` Pavan Kondeti
@ 2020-05-01 16:12     ` Dietmar Eggemann
  0 siblings, 0 replies; 24+ messages in thread
From: Dietmar Eggemann @ 2020-05-01 16:12 UTC (permalink / raw)
  To: Pavan Kondeti
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Luca Abeni, Daniel Bristot de Oliveira, Wei Wang,
	Quentin Perret, Alessio Balsini, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

On 30/04/2020 12:55, Pavan Kondeti wrote:
> On Mon, Apr 27, 2020 at 10:37:05AM +0200, Dietmar Eggemann wrote:

[..]

>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index 504d2f51b0d6..4ae22bfc37ae 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -54,10 +54,16 @@ static inline struct dl_bw *dl_bw_of(int i)
>>  static inline int dl_bw_cpus(int i)
>>  {
>>  	struct root_domain *rd = cpu_rq(i)->rd;
>> -	int cpus = 0;
>> +	int cpus;
>>  
>>  	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
>>  			 "sched RCU must be held");
>> +
>> +	if (cpumask_subset(rd->span, cpu_active_mask))
>> +		return cpumask_weight(rd->span);
>> +
> 
> Looks good to me. This is a nice optimization.
> 
>> +	cpus = 0;
>> +
>>  	for_each_cpu_and(i, rd->span, cpu_active_mask)
>>  		cpus++;
>>  
> Do you know why this check is in place? Is it only to cover
> the case of cpuset_cpu_inactive()->dl_cpu_busy()?

It should cover:

(1) Preventing CPU hp when DL detects a possible overflow w/o the CPU:

    sched_cpu_deactivate() -> cpuset_cpu_inactive() -> dl_cpu_busy() ->
    dl_bw_cpus() [now replaced by dl_bw_capacity()].

(2) DL Admission Control in CPU HP:

    __sched_setscheduler() -> sched_dl_overflow() -> dl_bw_cpus()
                                           [now + -> dl_bw_capacity()]

(3) In create/destroy exclusive cpusets scenarios (comment in
    set_cpus_allowed_dl(), although I wasn't able to provoke this so
    far:

    do_set_cpus_allowed() -> p->sched_class->set_cpus_allowed() was
    never called when I ran a DL testcase and create/destroy exclusive
    cpusets at the same time?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/6] sched/deadline: Make DL capacity-aware
  2020-04-30 13:10   ` Pavan Kondeti
@ 2020-05-01 16:12     ` Dietmar Eggemann
  2020-05-04  3:58       ` Pavan Kondeti
  0 siblings, 1 reply; 24+ messages in thread
From: Dietmar Eggemann @ 2020-05-01 16:12 UTC (permalink / raw)
  To: Pavan Kondeti
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Luca Abeni, Daniel Bristot de Oliveira, Wei Wang,
	Quentin Perret, Alessio Balsini, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

On 30/04/2020 15:10, Pavan Kondeti wrote:
> On Mon, Apr 27, 2020 at 10:37:08AM +0200, Dietmar Eggemann wrote:
>> From: Luca Abeni <luca.abeni@santannapisa.it>

[...]

>> @@ -1653,10 +1654,19 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
>>  	 * other hand, if it has a shorter deadline, we
>>  	 * try to make it stay here, it might be important.
>>  	 */
>> -	if (unlikely(dl_task(curr)) &&
>> -	    (curr->nr_cpus_allowed < 2 ||
>> -	     !dl_entity_preempt(&p->dl, &curr->dl)) &&
>> -	    (p->nr_cpus_allowed > 1)) {
>> +	select_rq = unlikely(dl_task(curr)) &&
>> +		    (curr->nr_cpus_allowed < 2 ||
>> +		     !dl_entity_preempt(&p->dl, &curr->dl)) &&
>> +		    p->nr_cpus_allowed > 1;
>> +
>> +	/*
>> +	 * Take the capacity of the CPU into account to
>> +	 * ensure it fits the requirement of the task.
>> +	 */
>> +	if (static_branch_unlikely(&sched_asym_cpucapacity))
>> +		select_rq |= !dl_task_fits_capacity(p, cpu);
>> +
>> +	if (select_rq) {
>>  		int target = find_later_rq(p);
> 
> I see that find_later_rq() checks if the previous CPU is part of
> later_mask and returns it immediately. So we don't migrate the
> task in the case where there previous CPU can't fit the task and
> there are no idle CPUs on which the task can fit. LGTM.

Hope I understand you here. I don't think that [patch 6/6] provides this
already.

In case 'later_mask' has no fitting CPUs, 'max_cpu' is set in the
otherwise empty 'later_mask'. But 'max_cpu' is not necessary task_cpu(p).

Example on Juno [L b b L L L] with thread0-0 (big task)

     cpudl_find [thread0-0 2117] orig later_mask=0,3-4 later_mask=0
  find_later_rq [thread0-0 2117] task_cpu=2 later_mask=0

A tweak could be added favor task_cpu(p) in case it is amongst the CPUs
with the maximum capacity in cpudl_find() for the !fit case.

[...]

>> +/*
>> + * Verify the fitness of task @p to run on @cpu taking into account the
>> + * CPU original capacity and the runtime/deadline ratio of the task.
>> + *
>> + * The function will return true if the CPU original capacity of the
>> + * @cpu scaled by SCHED_CAPACITY_SCALE >= runtime/deadline ratio of the
>> + * task and false otherwise.
>> + */
>> +static inline bool dl_task_fits_capacity(struct task_struct *p, int cpu)
>> +{
>> +	unsigned long cap = arch_scale_cpu_capacity(cpu);
>> +
>> +	return cap_scale(p->dl.dl_deadline, cap) >= p->dl.dl_runtime;
>> +}
>> +
> 
> This is same as
> 
> return p->dl.dl_bw >> (BW_SHIFT - SCHED_CAPACITY_SHIFT) <= cap
> 
> Correct?  If yes, would it be better to use this?

We could use sched_dl_entity::dl_density (dl_runtime / dl_deadline) but
then I would have to export BW_SHIFT.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 6/6] sched/deadline: Implement fallback mechanism for !fit case
  2020-04-30 11:00         ` Pavan Kondeti
@ 2020-05-01 16:12           ` Dietmar Eggemann
  0 siblings, 0 replies; 24+ messages in thread
From: Dietmar Eggemann @ 2020-05-01 16:12 UTC (permalink / raw)
  To: Pavan Kondeti
  Cc: luca abeni, Juri Lelli, Ingo Molnar, Peter Zijlstra,
	Vincent Guittot, Steven Rostedt, Daniel Bristot de Oliveira,
	Wei Wang, Quentin Perret, Alessio Balsini, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

On 30/04/2020 13:00, Pavan Kondeti wrote:
> On Wed, Apr 29, 2020 at 07:39:50PM +0200, Dietmar Eggemann wrote:
>> On 27/04/2020 16:17, luca abeni wrote:

[...]

>>> On Mon, 27 Apr 2020 15:34:38 +0200
>>> Juri Lelli <juri.lelli@redhat.com> wrote:

[...]

>>>> On 27/04/20 10:37, Dietmar Eggemann wrote:
>>>>> From: Luca Abeni <luca.abeni@santannapisa.it>

[...]

>>>>> -		if (!cpumask_empty(later_mask))
>>>>> -			return 1;
>>>>> +		if (cpumask_empty(later_mask))
>>>>> +			cpumask_set_cpu(max_cpu, later_mask);  
>>>>
>>>> Think we touched upon this during v1 review, but I'm (still?)
>>>> wondering if we can do a little better, still considering only free
>>>> cpus.
>>>>
>>>> Can't we get into a situation that some of the (once free) big cpus
>>>> have been occupied by small tasks and now a big task enters the
>>>> system and it only finds small cpus available, were it could have fit
>>>> into bigs if small tasks were put onto small cpus?
>>>>
>>>> I.e., shouldn't we always try to best fit among free cpus?
>>>
>>> Yes; there was an additional patch that tried schedule each task on the
>>> slowest core where it can fit, to address this issue.
>>> But I think it will go in a second round of patches.
>>
>> Yes, we can run into this situation in DL, but also in CFS or RT.
>>
> In CFS case, the misfit task handling in load balancer should help pulling
> the BIG task running on the little CPUs. I get your point that we can run
> into the same scenario with other scheduling class tasks.

Yes, the CPU stopper (i.e. CFS's active load balance) can help here.
IMHO, using the CPU stopper in RT/DL for moving the running task (next
to using best fit rather than just fit CPU) is considered future work.
AFAICS, push/pull is not designed for migration of running tasks.

[...]

>> You did spot the rt-app 'delay' for the small tasks in the test case ;-)
> 
> Thanks for the hint. It was not clear to me why 1 msec delay is given for
> the small tasks in the rt-app json description in the cover letter.
> I get it now :-)

So far Capacity awareness in RT/DL means that as long as there are CPUs
available which fit the task, use one of them. This is already
beneficial for a lot of use cases on CPU asymmetric systems since it
offers more predictable behavior.

I'll add a note to the cover letter in the next version about the reason
of the rt-app 'delay'.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/6] sched/deadline: Make DL capacity-aware
  2020-05-01 16:12     ` Dietmar Eggemann
@ 2020-05-04  3:58       ` Pavan Kondeti
  2020-05-05 18:02         ` Dietmar Eggemann
  0 siblings, 1 reply; 24+ messages in thread
From: Pavan Kondeti @ 2020-05-04  3:58 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Luca Abeni, Daniel Bristot de Oliveira, Wei Wang,
	Quentin Perret, Alessio Balsini, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

On Fri, May 01, 2020 at 06:12:07PM +0200, Dietmar Eggemann wrote:
> On 30/04/2020 15:10, Pavan Kondeti wrote:
> > On Mon, Apr 27, 2020 at 10:37:08AM +0200, Dietmar Eggemann wrote:
> >> From: Luca Abeni <luca.abeni@santannapisa.it>
> 
> [...]
> 
> >> @@ -1653,10 +1654,19 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> >>  	 * other hand, if it has a shorter deadline, we
> >>  	 * try to make it stay here, it might be important.
> >>  	 */
> >> -	if (unlikely(dl_task(curr)) &&
> >> -	    (curr->nr_cpus_allowed < 2 ||
> >> -	     !dl_entity_preempt(&p->dl, &curr->dl)) &&
> >> -	    (p->nr_cpus_allowed > 1)) {
> >> +	select_rq = unlikely(dl_task(curr)) &&
> >> +		    (curr->nr_cpus_allowed < 2 ||
> >> +		     !dl_entity_preempt(&p->dl, &curr->dl)) &&
> >> +		    p->nr_cpus_allowed > 1;
> >> +
> >> +	/*
> >> +	 * Take the capacity of the CPU into account to
> >> +	 * ensure it fits the requirement of the task.
> >> +	 */
> >> +	if (static_branch_unlikely(&sched_asym_cpucapacity))
> >> +		select_rq |= !dl_task_fits_capacity(p, cpu);
> >> +
> >> +	if (select_rq) {
> >>  		int target = find_later_rq(p);
> > 
> > I see that find_later_rq() checks if the previous CPU is part of
> > later_mask and returns it immediately. So we don't migrate the
> > task in the case where there previous CPU can't fit the task and
> > there are no idle CPUs on which the task can fit. LGTM.
> 
> Hope I understand you here. I don't think that [patch 6/6] provides this
> already.
> 
> In case 'later_mask' has no fitting CPUs, 'max_cpu' is set in the
> otherwise empty 'later_mask'. But 'max_cpu' is not necessary task_cpu(p).
> 
> Example on Juno [L b b L L L] with thread0-0 (big task)
> 
>      cpudl_find [thread0-0 2117] orig later_mask=0,3-4 later_mask=0
>   find_later_rq [thread0-0 2117] task_cpu=2 later_mask=0
> 
> A tweak could be added favor task_cpu(p) in case it is amongst the CPUs
> with the maximum capacity in cpudl_find() for the !fit case.
> 

You are right. max_cpu can be other than task_cpu(p) in which case we
migrate the task though it won't fit on the new CPU. While introducing
capacity awareness in RT, Quais made the below change to avoid the
migration. We can do something similar here also.

commit b28bc1e002c2 (sched/rt: Re-instate old behavior in select_task_rq_rt())

> [...]
> 
> >> +/*
> >> + * Verify the fitness of task @p to run on @cpu taking into account the
> >> + * CPU original capacity and the runtime/deadline ratio of the task.
> >> + *
> >> + * The function will return true if the CPU original capacity of the
> >> + * @cpu scaled by SCHED_CAPACITY_SCALE >= runtime/deadline ratio of the
> >> + * task and false otherwise.
> >> + */
> >> +static inline bool dl_task_fits_capacity(struct task_struct *p, int cpu)
> >> +{
> >> +	unsigned long cap = arch_scale_cpu_capacity(cpu);
> >> +
> >> +	return cap_scale(p->dl.dl_deadline, cap) >= p->dl.dl_runtime;
> >> +}
> >> +
> > 
> > This is same as
> > 
> > return p->dl.dl_bw >> (BW_SHIFT - SCHED_CAPACITY_SHIFT) <= cap
> > 
> > Correct?  If yes, would it be better to use this?
> 
> We could use sched_dl_entity::dl_density (dl_runtime / dl_deadline) but
> then I would have to export BW_SHIFT.

Yeah, I meant dl_denstity not dl_bw. Thanks for correcting me. I see that
BW_SHIFT is defined in kernel/sched/sched.h

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/6] sched/deadline: Make DL capacity-aware
  2020-05-04  3:58       ` Pavan Kondeti
@ 2020-05-05 18:02         ` Dietmar Eggemann
  0 siblings, 0 replies; 24+ messages in thread
From: Dietmar Eggemann @ 2020-05-05 18:02 UTC (permalink / raw)
  To: Pavan Kondeti
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Luca Abeni, Daniel Bristot de Oliveira, Wei Wang,
	Quentin Perret, Alessio Balsini, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

On 04/05/2020 05:58, Pavan Kondeti wrote:
> On Fri, May 01, 2020 at 06:12:07PM +0200, Dietmar Eggemann wrote:
>> On 30/04/2020 15:10, Pavan Kondeti wrote:
>>> On Mon, Apr 27, 2020 at 10:37:08AM +0200, Dietmar Eggemann wrote:
>>>> From: Luca Abeni <luca.abeni@santannapisa.it>
>>
>> [...]
>>
>>>> @@ -1653,10 +1654,19 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
>>>>  	 * other hand, if it has a shorter deadline, we
>>>>  	 * try to make it stay here, it might be important.
>>>>  	 */
>>>> -	if (unlikely(dl_task(curr)) &&
>>>> -	    (curr->nr_cpus_allowed < 2 ||
>>>> -	     !dl_entity_preempt(&p->dl, &curr->dl)) &&
>>>> -	    (p->nr_cpus_allowed > 1)) {
>>>> +	select_rq = unlikely(dl_task(curr)) &&
>>>> +		    (curr->nr_cpus_allowed < 2 ||
>>>> +		     !dl_entity_preempt(&p->dl, &curr->dl)) &&
>>>> +		    p->nr_cpus_allowed > 1;
>>>> +
>>>> +	/*
>>>> +	 * Take the capacity of the CPU into account to
>>>> +	 * ensure it fits the requirement of the task.
>>>> +	 */
>>>> +	if (static_branch_unlikely(&sched_asym_cpucapacity))
>>>> +		select_rq |= !dl_task_fits_capacity(p, cpu);
>>>> +
>>>> +	if (select_rq) {
>>>>  		int target = find_later_rq(p);
>>>
>>> I see that find_later_rq() checks if the previous CPU is part of
>>> later_mask and returns it immediately. So we don't migrate the
>>> task in the case where there previous CPU can't fit the task and
>>> there are no idle CPUs on which the task can fit. LGTM.
>>
>> Hope I understand you here. I don't think that [patch 6/6] provides this
>> already.
>>
>> In case 'later_mask' has no fitting CPUs, 'max_cpu' is set in the
>> otherwise empty 'later_mask'. But 'max_cpu' is not necessary task_cpu(p).
>>
>> Example on Juno [L b b L L L] with thread0-0 (big task)
>>
>>      cpudl_find [thread0-0 2117] orig later_mask=0,3-4 later_mask=0
>>   find_later_rq [thread0-0 2117] task_cpu=2 later_mask=0
>>
>> A tweak could be added favor task_cpu(p) in case it is amongst the CPUs
>> with the maximum capacity in cpudl_find() for the !fit case.
>>
> 
> You are right. max_cpu can be other than task_cpu(p) in which case we
> migrate the task though it won't fit on the new CPU. While introducing
> capacity awareness in RT, Quais made the below change to avoid the
> migration. We can do something similar here also.
> 
> commit b28bc1e002c2 (sched/rt: Re-instate old behavior in select_task_rq_rt())

I'm not sure something like this is necessary here.

With DL capacity awareness we reduce the later_mask returned by
cpudl_find() in find_later_rq() to those idle CPUs which can handle p
or in case there is none to (the/a) 'non-fitting CPU w/ max capacity'.

We just have to favor task_cpu(p) in [patch 6/6] in case it is part
of the initial later_mask and among these 'non-fitting CPUs w/ max
capacity'.

This will make sure that it gets chosen so find_later_rq() returns it
before the 'for_each_domain()' loop.

And I guess we still want to migrate if there is a non-fitting CPU w/ a
higher CPU capacity than task_cpu() (tri-gear).

@@ -137,7 +137,8 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
 
                                cap = capacity_orig_of(cpu);
 
-                               if (cap > max_cap) {
+                               if (cap > max_cap ||
+                                   (cpu == task_cpu(p) && cap == max_cap)) {
                                        max_cap = cap;
                                        max_cpu = cpu;

In case task_cpu() is not part of later_cpu, the 'max CPU capacity CPU' is
returned as 'best_cpu'. 

[...]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/6] sched/deadline: Add dl_bw_capacity()
  2020-04-27  8:37 ` [PATCH v2 3/6] sched/deadline: Add dl_bw_capacity() Dietmar Eggemann
@ 2020-05-06 10:54   ` Dietmar Eggemann
  2020-05-06 12:37     ` Juri Lelli
  0 siblings, 1 reply; 24+ messages in thread
From: Dietmar Eggemann @ 2020-05-06 10:54 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli
  Cc: Vincent Guittot, Steven Rostedt, Luca Abeni,
	Daniel Bristot de Oliveira, Wei Wang, Quentin Perret,
	Alessio Balsini, Pavan Kondeti, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

On 27/04/2020 10:37, Dietmar Eggemann wrote:

[...]

> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 4ae22bfc37ae..eb23e6921d94 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -69,6 +69,25 @@ static inline int dl_bw_cpus(int i)
>  
>  	return cpus;
>  }
> +
> +static inline unsigned long dl_bw_capacity(int i)
> +{
> +	struct root_domain *rd = cpu_rq(i)->rd;
> +	unsigned long cap;
> +
> +	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
> +			 "sched RCU must be held");
> +
> +	if (cpumask_subset(rd->span, cpu_active_mask))
> +		return rd->sum_cpu_capacity;
> +
> +	cap = 0;
> +
> +	for_each_cpu_and(i, rd->span, cpu_active_mask)
> +		cap += capacity_orig_of(i);
> +
> +	return cap;
> +}

There is an issue w/ excl. cpusets and cpuset.sched_load_balance=0. The
latter is needed to demonstrate the problem since DL task affinity can't
be altered.

A CPU in such a cpuset has its rq attached to def_root_domain which does
not have its 'sum_cpu_capacity' properly set.

root@juno:~# bash
root@juno:~# ps -eo comm,pid,class | grep bash
bash             1661 TS
bash             2040 TS
bash             2176 TS <--

root@juno:~# echo 2176 > /sys/fs/cgroup/cpuset/B/tasks 

root@juno:~# chrt -d --sched-runtime 8000 --sched-period 16000 -p 0 2176
chrt: failed to set pid 2176's policy: Device or resource busy

...
sched_dl_overflow: [bash 2176] task_cpu=4 cpus_ptr=2,4-5
dl_bw_capacity() CPU4 dflt_rd->sum_cpu_capacity=0 <-- !!! dflt_rd->span=2,4-5 cpu_active_mask=0-5
...

OTHA, rd->span is properly set due to 'cpumask_clear_cpu(rq->cpu,
old_rd->span) and cpumask_set_cpu(rq->cpu, rd->span)' in rq_attach_root().

It's not possible to treat 'rd->sum_cpu_capacity' like 'rd->span' since
the former changes between sched domain teardown/bringup w/ asymmetric
CPU capacity.

What could be done is to return 'dl_bw_cpus(i) << SCHED_CAPACITY_SHIFT'
w/ symmetric CPU capacity (of 1024) and to loop over rd->span otherwise.
Latter includes symmetric cpusets w/ only little CPUs. 

---8<---
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 575b7d88d839..6d17748cb7a1 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -70,24 +70,28 @@ static inline int dl_bw_cpus(int i)
 	return cpus;
 }
 
-static inline unsigned long dl_bw_capacity(int i)
-{
+static inline unsigned long __dl_bw_capacity(int i) {
 	struct root_domain *rd = cpu_rq(i)->rd;
-	unsigned long cap;
+	unsigned long cap = 0;
 
 	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
 			 "sched RCU must be held");
 
-	if (cpumask_subset(rd->span, cpu_active_mask))
-		return rd->sum_cpu_capacity;
-
-	cap = 0;
-
 	for_each_cpu_and(i, rd->span, cpu_active_mask)
 		cap += capacity_orig_of(i);
 
 	return cap;
 }
+
+static inline unsigned long dl_bw_capacity(int i)
+{
+	if (!static_branch_unlikely(&sched_asym_cpucapacity) &&
+	    capacity_orig_of(i) == SCHED_CAPACITY_SCALE) {
+		return dl_bw_cpus(i) << SCHED_CAPACITY_SHIFT;
+	} else {
+		return __dl_bw_capacity(i);
+	}
+}
 #else
 static inline struct dl_bw *dl_bw_of(int i)
 {
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/6] sched/deadline: Add dl_bw_capacity()
  2020-05-06 10:54   ` Dietmar Eggemann
@ 2020-05-06 12:37     ` Juri Lelli
  2020-05-06 15:09       ` Dietmar Eggemann
  0 siblings, 1 reply; 24+ messages in thread
From: Juri Lelli @ 2020-05-06 12:37 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Steven Rostedt,
	Luca Abeni, Daniel Bristot de Oliveira, Wei Wang, Quentin Perret,
	Alessio Balsini, Pavan Kondeti, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

On 06/05/20 12:54, Dietmar Eggemann wrote:
> On 27/04/2020 10:37, Dietmar Eggemann wrote:
> 
> [...]
> 
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index 4ae22bfc37ae..eb23e6921d94 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -69,6 +69,25 @@ static inline int dl_bw_cpus(int i)
> >  
> >  	return cpus;
> >  }
> > +
> > +static inline unsigned long dl_bw_capacity(int i)
> > +{
> > +	struct root_domain *rd = cpu_rq(i)->rd;
> > +	unsigned long cap;
> > +
> > +	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
> > +			 "sched RCU must be held");
> > +
> > +	if (cpumask_subset(rd->span, cpu_active_mask))
> > +		return rd->sum_cpu_capacity;
> > +
> > +	cap = 0;
> > +
> > +	for_each_cpu_and(i, rd->span, cpu_active_mask)
> > +		cap += capacity_orig_of(i);
> > +
> > +	return cap;
> > +}
> 
> There is an issue w/ excl. cpusets and cpuset.sched_load_balance=0. The
> latter is needed to demonstrate the problem since DL task affinity can't
> be altered.
> 
> A CPU in such a cpuset has its rq attached to def_root_domain which does
> not have its 'sum_cpu_capacity' properly set.

Hummm, but if sched_load_balance is disabled it means that we've now got
a subset of CPUs which (from a DL AC pow) are partitioned. So, I'd tend
to say that we actually want to check new tasks bw requirement against
the available bandwidth of the particular CPU they happen to be running
(and will continue to run) when setscheduler is called.

If then load balance is enabled again, AC check we did above should
still be valid for all tasks admitted in the meantime, no?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/6] sched/deadline: Add dl_bw_capacity()
  2020-05-06 12:37     ` Juri Lelli
@ 2020-05-06 15:09       ` Dietmar Eggemann
  2020-05-11  8:01         ` Juri Lelli
  0 siblings, 1 reply; 24+ messages in thread
From: Dietmar Eggemann @ 2020-05-06 15:09 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Steven Rostedt,
	Luca Abeni, Daniel Bristot de Oliveira, Wei Wang, Quentin Perret,
	Alessio Balsini, Pavan Kondeti, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

On 06/05/2020 14:37, Juri Lelli wrote:
> On 06/05/20 12:54, Dietmar Eggemann wrote:
>> On 27/04/2020 10:37, Dietmar Eggemann wrote:

[...]

>> There is an issue w/ excl. cpusets and cpuset.sched_load_balance=0. The
>> latter is needed to demonstrate the problem since DL task affinity can't
>> be altered.
>>
>> A CPU in such a cpuset has its rq attached to def_root_domain which does
>> not have its 'sum_cpu_capacity' properly set.
> 
> Hummm, but if sched_load_balance is disabled it means that we've now got
> a subset of CPUs which (from a DL AC pow) are partitioned. So, I'd tend

Yes, the CPUs of the cpuset w/ cpuset.sched_load_balance=0 (cpuset B in
the example).

> to say that we actually want to check new tasks bw requirement against
> the available bandwidth of the particular CPU they happen to be running
> (and will continue to run) when setscheduler is called.

By 'available bandwidth of the particular CPU' you refer to
'\Sum_{cpu_rq(i)->rd->span} CPU capacity', right?

This is what this fix tries to achieve. Regardless whether cpu_rq(i)->rd
is a 'real' rd or the def_root_domain, dl_bw_capacity() will now always
return '\Sum_{cpu_rq(i)->rd->span} CPU capacity'

> If then load balance is enabled again, AC check we did above should
> still be valid for all tasks admitted in the meantime, no?
 
Example (w/ this fix) on Juno [L b b L L L], capacity_orig_of(L)=446 :

mkdir /sys/fs/cgroup/cpuset/A
echo 0 > /sys/fs/cgroup/cpuset/A/cpuset.mems
echo 1 > /sys/fs/cgroup/cpuset/A/cpuset.cpu_exclusive
echo 0-2 > /sys/fs/cgroup/cpuset/A/cpuset.cpus

mkdir /sys/fs/cgroup/cpuset/B
echo 0 > /sys/fs/cgroup/cpuset/B/cpuset.mems
echo 1 > /sys/fs/cgroup/cpuset/B/cpuset.cpu_exclusive
echo 3-5 > /sys/fs/cgroup/cpuset/B/cpuset.cpus

echo 0 > /sys/fs/cgroup/cpuset/B/cpuset.sched_load_balance
echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance

echo $$ > /sys/fs/cgroup/cpuset/B/tasks
chrt -d --sched-runtime 8000 --sched-period 16000 -p 0 $$

...
[  144.920102] __dl_bw_capacity CPU3 rd->span=3-5 return 1338
[  144.925607] sched_dl_overflow: [bash 1999] task_cpu(p)=3 cap=1338 cpus_ptr=3-5
[  144.932841] __dl_bw_capacity CPU3 rd->span=3-5 return 1338
...

echo 1 > /sys/fs/cgroup/cpuset/B/cpuset.sched_load_balance

echo $$ > /sys/fs/cgroup/cpuset/B/tasks
chrt -d --sched-runtime 8000 --sched-period 16000 -p 0 $$

...
[  254.367982] __dl_bw_capacity CPU5 rd->span=3-5 return 1338
[  254.373487] sched_dl_overflow: [bash 2052] task_cpu(p)=5 cap=1338 cpus_ptr=3-5
[  254.380721] __dl_bw_capacity CPU5 rd->span=3-5 return 1338
...

Regardless of 'B/cpuset.sched_load_balance'
'\Sum_{cpu_rq(i)->rd->span} CPU capacity' stays 1338 (3*446)

So IMHO, DL AC check stays intact.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/6] sched/deadline: Add dl_bw_capacity()
  2020-05-06 15:09       ` Dietmar Eggemann
@ 2020-05-11  8:01         ` Juri Lelli
  2020-05-12 12:39           ` Dietmar Eggemann
  0 siblings, 1 reply; 24+ messages in thread
From: Juri Lelli @ 2020-05-11  8:01 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Steven Rostedt,
	Luca Abeni, Daniel Bristot de Oliveira, Wei Wang, Quentin Perret,
	Alessio Balsini, Pavan Kondeti, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

On 06/05/20 17:09, Dietmar Eggemann wrote:
> On 06/05/2020 14:37, Juri Lelli wrote:
> > On 06/05/20 12:54, Dietmar Eggemann wrote:
> >> On 27/04/2020 10:37, Dietmar Eggemann wrote:
> 
> [...]
> 
> >> There is an issue w/ excl. cpusets and cpuset.sched_load_balance=0. The
> >> latter is needed to demonstrate the problem since DL task affinity can't
> >> be altered.
> >>
> >> A CPU in such a cpuset has its rq attached to def_root_domain which does
> >> not have its 'sum_cpu_capacity' properly set.
> > 
> > Hummm, but if sched_load_balance is disabled it means that we've now got
> > a subset of CPUs which (from a DL AC pow) are partitioned. So, I'd tend
> 
> Yes, the CPUs of the cpuset w/ cpuset.sched_load_balance=0 (cpuset B in
> the example).
> 
> > to say that we actually want to check new tasks bw requirement against
> > the available bandwidth of the particular CPU they happen to be running
> > (and will continue to run) when setscheduler is called.
> 
> By 'available bandwidth of the particular CPU' you refer to
> '\Sum_{cpu_rq(i)->rd->span} CPU capacity', right?

No. I was referring to the single CPU capacity. The capacity of the CPU
where a task is running when setscheduler is called for it (and DL AC
performed). See below, maybe more clear why I wondered about this case..

> This is what this fix tries to achieve. Regardless whether cpu_rq(i)->rd
> is a 'real' rd or the def_root_domain, dl_bw_capacity() will now always
> return '\Sum_{cpu_rq(i)->rd->span} CPU capacity'
> 
> > If then load balance is enabled again, AC check we did above should
> > still be valid for all tasks admitted in the meantime, no?
>  
> Example (w/ this fix) on Juno [L b b L L L], capacity_orig_of(L)=446 :
> 
> mkdir /sys/fs/cgroup/cpuset/A
> echo 0 > /sys/fs/cgroup/cpuset/A/cpuset.mems
> echo 1 > /sys/fs/cgroup/cpuset/A/cpuset.cpu_exclusive
> echo 0-2 > /sys/fs/cgroup/cpuset/A/cpuset.cpus
> 
> mkdir /sys/fs/cgroup/cpuset/B
> echo 0 > /sys/fs/cgroup/cpuset/B/cpuset.mems
> echo 1 > /sys/fs/cgroup/cpuset/B/cpuset.cpu_exclusive
> echo 3-5 > /sys/fs/cgroup/cpuset/B/cpuset.cpus
> 
> echo 0 > /sys/fs/cgroup/cpuset/B/cpuset.sched_load_balance
> echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
> 
> echo $$ > /sys/fs/cgroup/cpuset/B/tasks
> chrt -d --sched-runtime 8000 --sched-period 16000 -p 0 $$
> 
> ...
> [  144.920102] __dl_bw_capacity CPU3 rd->span=3-5 return 1338
> [  144.925607] sched_dl_overflow: [bash 1999] task_cpu(p)=3 cap=1338 cpus_ptr=3-5

So, here you are checking new task bw against 1338 which is 3*L
capacity. However, since load balance is disabled at this point for 3-5,
once admitted the task will only be able to run on CPU 3. Now, if more
tasks on CPU 3 are admitted the same way (up to 1338) I believe they
will start to experience deadline misses because only 446 will be
actually available to them until load balance is enabled below and they
are then free to migrate to CPUs 4 and 5.

Does it makes sense?

> [  144.932841] __dl_bw_capacity CPU3 rd->span=3-5 return 1338
> ...
> 
> echo 1 > /sys/fs/cgroup/cpuset/B/cpuset.sched_load_balance
> 
> echo $$ > /sys/fs/cgroup/cpuset/B/tasks
> chrt -d --sched-runtime 8000 --sched-period 16000 -p 0 $$
> 
> ...
> [  254.367982] __dl_bw_capacity CPU5 rd->span=3-5 return 1338
> [  254.373487] sched_dl_overflow: [bash 2052] task_cpu(p)=5 cap=1338 cpus_ptr=3-5
> [  254.380721] __dl_bw_capacity CPU5 rd->span=3-5 return 1338
> ...
> 
> Regardless of 'B/cpuset.sched_load_balance'
> '\Sum_{cpu_rq(i)->rd->span} CPU capacity' stays 1338 (3*446)
> 
> So IMHO, DL AC check stays intact.
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/6] sched/deadline: Add dl_bw_capacity()
  2020-05-11  8:01         ` Juri Lelli
@ 2020-05-12 12:39           ` Dietmar Eggemann
  2020-05-15 12:26             ` Juri Lelli
  0 siblings, 1 reply; 24+ messages in thread
From: Dietmar Eggemann @ 2020-05-12 12:39 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Steven Rostedt,
	Luca Abeni, Daniel Bristot de Oliveira, Wei Wang, Quentin Perret,
	Alessio Balsini, Pavan Kondeti, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

On 11/05/2020 10:01, Juri Lelli wrote:
> On 06/05/20 17:09, Dietmar Eggemann wrote:
>> On 06/05/2020 14:37, Juri Lelli wrote:
>>> On 06/05/20 12:54, Dietmar Eggemann wrote:
>>>> On 27/04/2020 10:37, Dietmar Eggemann wrote:

[...]

>>> to say that we actually want to check new tasks bw requirement against
>>> the available bandwidth of the particular CPU they happen to be running
>>> (and will continue to run) when setscheduler is called.
>>
>> By 'available bandwidth of the particular CPU' you refer to
>> '\Sum_{cpu_rq(i)->rd->span} CPU capacity', right?
> 
> No. I was referring to the single CPU capacity. The capacity of the CPU
> where a task is running when setscheduler is called for it (and DL AC
> performed). See below, maybe more clear why I wondered about this case..

OK, got it! I was just confused since I don't think that this patch
introduced the issue.

Before the patch 'int cpus = dl_bw_cpus(task_cpu(p))' was used which
returns the number of cpus on the (default) rd (n). So for a single CPU
(1024) we use n*1024.

I wonder if a fix for that should be part of this patch-set?

[...]

>> ...
>> [  144.920102] __dl_bw_capacity CPU3 rd->span=3-5 return 1338
>> [  144.925607] sched_dl_overflow: [bash 1999] task_cpu(p)=3 cap=1338 cpus_ptr=3-5
> 
> So, here you are checking new task bw against 1338 which is 3*L
> capacity. However, since load balance is disabled at this point for 3-5,
> once admitted the task will only be able to run on CPU 3. Now, if more
> tasks on CPU 3 are admitted the same way (up to 1338) I believe they
> will start to experience deadline misses because only 446 will be
> actually available to them until load balance is enabled below and they
> are then free to migrate to CPUs 4 and 5.
> 
> Does it makes sense?

Yes, it does.

So my first idea was to only consider the CPU (i.e. its CPU capacity) in
case we detect 'cpu_rq(cpu)->rd == def_root_domain'?

In case I re-enable load-balancing on cpuset '/', we can't make a task
in cpuset 'B' DL since we hit this in __sched_setscheduler():

4931           /*
4932            * Don't allow tasks with an affinity mask smaller than
4933            * the entire root_domain to become SCHED_DEADLINE.
...
4935            */
4936            if (!cpumask_subset(span, p->cpus_ptr) || ...

root@juno:~# echo 1 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
root@juno:~# echo $$ > /sys/fs/cgroup/cpuset/B/tasks
root@juno:~# chrt -d --sched-runtime 8000 --sched-period 16000 -p 0 $$
chrt: failed to set pid 2316's policy: Operation not permitted

So this task has to leave 'B' first I assume.

[...]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/6] sched/deadline: Add dl_bw_capacity()
  2020-05-12 12:39           ` Dietmar Eggemann
@ 2020-05-15 12:26             ` Juri Lelli
  0 siblings, 0 replies; 24+ messages in thread
From: Juri Lelli @ 2020-05-15 12:26 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Steven Rostedt,
	Luca Abeni, Daniel Bristot de Oliveira, Wei Wang, Quentin Perret,
	Alessio Balsini, Pavan Kondeti, Patrick Bellasi,
	Morten Rasmussen, Valentin Schneider, Qais Yousef, linux-kernel

On 12/05/20 14:39, Dietmar Eggemann wrote:
> On 11/05/2020 10:01, Juri Lelli wrote:
> > On 06/05/20 17:09, Dietmar Eggemann wrote:
> >> On 06/05/2020 14:37, Juri Lelli wrote:
> >>> On 06/05/20 12:54, Dietmar Eggemann wrote:
> >>>> On 27/04/2020 10:37, Dietmar Eggemann wrote:
> 
> [...]
> 
> >>> to say that we actually want to check new tasks bw requirement against
> >>> the available bandwidth of the particular CPU they happen to be running
> >>> (and will continue to run) when setscheduler is called.
> >>
> >> By 'available bandwidth of the particular CPU' you refer to
> >> '\Sum_{cpu_rq(i)->rd->span} CPU capacity', right?
> > 
> > No. I was referring to the single CPU capacity. The capacity of the CPU
> > where a task is running when setscheduler is called for it (and DL AC
> > performed). See below, maybe more clear why I wondered about this case..
> 
> OK, got it! I was just confused since I don't think that this patch
> introduced the issue.
> 
> Before the patch 'int cpus = dl_bw_cpus(task_cpu(p))' was used which
> returns the number of cpus on the (default) rd (n). So for a single CPU
> (1024) we use n*1024.
> 
> I wonder if a fix for that should be part of this patch-set?

Not really, I guess. As you said, the issue was there already. We can
fix both situations with a subsequent patch. I just realized that we
have a problem by reviewing this set, but not this set job to fix it.

While you are at changing this part, it might be good to put a comment
(XXX fix this, or something) about the issue, so that we don't forget.

> [...]
> 
> >> ...
> >> [  144.920102] __dl_bw_capacity CPU3 rd->span=3-5 return 1338
> >> [  144.925607] sched_dl_overflow: [bash 1999] task_cpu(p)=3 cap=1338 cpus_ptr=3-5
> > 
> > So, here you are checking new task bw against 1338 which is 3*L
> > capacity. However, since load balance is disabled at this point for 3-5,
> > once admitted the task will only be able to run on CPU 3. Now, if more
> > tasks on CPU 3 are admitted the same way (up to 1338) I believe they
> > will start to experience deadline misses because only 446 will be
> > actually available to them until load balance is enabled below and they
> > are then free to migrate to CPUs 4 and 5.
> > 
> > Does it makes sense?
> 
> Yes, it does.
> 
> So my first idea was to only consider the CPU (i.e. its CPU capacity) in
> case we detect 'cpu_rq(cpu)->rd == def_root_domain'?
> 
> In case I re-enable load-balancing on cpuset '/', we can't make a task
> in cpuset 'B' DL since we hit this in __sched_setscheduler():
> 
> 4931           /*
> 4932            * Don't allow tasks with an affinity mask smaller than
> 4933            * the entire root_domain to become SCHED_DEADLINE.
> ...
> 4935            */
> 4936            if (!cpumask_subset(span, p->cpus_ptr) || ...
> 
> root@juno:~# echo 1 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
> root@juno:~# echo $$ > /sys/fs/cgroup/cpuset/B/tasks
> root@juno:~# chrt -d --sched-runtime 8000 --sched-period 16000 -p 0 $$
> chrt: failed to set pid 2316's policy: Operation not permitted
> 
> So this task has to leave 'B' first I assume.

Right, because the span is back to contain all cpus (load balancing
enabled at root level), but tasks in 'B' still have affinity set to a
subset of them.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2020-05-15 12:27 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-27  8:37 [PATCH v2 0/6] Capacity awareness for SCHED_DEADLINE Dietmar Eggemann
2020-04-27  8:37 ` [PATCH v2 1/6] sched/topology: Store root domain CPU capacity sum Dietmar Eggemann
2020-04-27  8:37 ` [PATCH v2 2/6] sched/deadline: Optimize dl_bw_cpus() Dietmar Eggemann
2020-04-30 10:55   ` Pavan Kondeti
2020-05-01 16:12     ` Dietmar Eggemann
2020-04-27  8:37 ` [PATCH v2 3/6] sched/deadline: Add dl_bw_capacity() Dietmar Eggemann
2020-05-06 10:54   ` Dietmar Eggemann
2020-05-06 12:37     ` Juri Lelli
2020-05-06 15:09       ` Dietmar Eggemann
2020-05-11  8:01         ` Juri Lelli
2020-05-12 12:39           ` Dietmar Eggemann
2020-05-15 12:26             ` Juri Lelli
2020-04-27  8:37 ` [PATCH v2 4/6] sched/deadline: Improve admission control for asymmetric CPU capacities Dietmar Eggemann
2020-04-27  8:37 ` [PATCH v2 5/6] sched/deadline: Make DL capacity-aware Dietmar Eggemann
2020-04-30 13:10   ` Pavan Kondeti
2020-05-01 16:12     ` Dietmar Eggemann
2020-05-04  3:58       ` Pavan Kondeti
2020-05-05 18:02         ` Dietmar Eggemann
2020-04-27  8:37 ` [PATCH v2 6/6] sched/deadline: Implement fallback mechanism for !fit case Dietmar Eggemann
2020-04-27 13:34   ` Juri Lelli
2020-04-27 14:17     ` luca abeni
2020-04-29 17:39       ` Dietmar Eggemann
2020-04-30 11:00         ` Pavan Kondeti
2020-05-01 16:12           ` Dietmar Eggemann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.