LKML Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/6] sched/deadline: cpuset task acceptance review
@ 2021-01-12 15:53 Daniel Bristot de Oliveira
  2021-01-12 15:53 ` [PATCH 1/6] sched/deadline: Consolidate the SCHED_DL task_can_attach() check on its own function Daniel Bristot de Oliveira
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Daniel Bristot de Oliveira @ 2021-01-12 15:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Marco Perronet, Daniel Bristot de Oliveira, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Li Zefan, Tejun Heo,
	Johannes Weiner, Valentin Schneider, cgroups

While surveying the properties of the SCHED_DEADLINE, Marco Perronet found some
inconsistencies in the acceptance of DL threads on cpuset. More precisely,
regarding the acceptance of treads with arbitrary affinity. He contacted me,
and while doing the investigation, we found yet other potential issues
addressed in this patch series.

Each patch has a more in-depth explanation, including ways to reproduce
the problem. 

Daniel Bristot de Oliveira (6):
  sched/deadline: Consolidate the SCHED_DL task_can_attach() check on
    its own function
  sched/deadline: Inform dl_task_can_attach() if the cpuset is exclusive
  sched/deadline: Allow DL tasks on empty (cgroup v2) cpusets
  sched/deadline: Block DL tasks on non-exclusive cpuset if bandwitdh
    control is enable
  sched/deadline: Add helpers to get the correct root domain/span/dl_bw
  sched/deadline: Fixes cpu/rd/dl_bw references for suspended tasks

 include/linux/sched.h   |  2 +-
 kernel/cgroup/cpuset.c  |  5 ++++-
 kernel/sched/core.c     | 13 ++++++------
 kernel/sched/deadline.c | 28 ++++++++++++++++++++++---
 kernel/sched/sched.h    | 45 ++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 80 insertions(+), 13 deletions(-)

-- 
2.29.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/6] sched/deadline: Consolidate the SCHED_DL task_can_attach() check on its own function
  2021-01-12 15:53 [PATCH 0/6] sched/deadline: cpuset task acceptance review Daniel Bristot de Oliveira
@ 2021-01-12 15:53 ` Daniel Bristot de Oliveira
  2021-01-12 15:53 ` [PATCH 2/6] sched/deadline: Inform dl_task_can_attach() if the cpuset is exclusive Daniel Bristot de Oliveira
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Daniel Bristot de Oliveira @ 2021-01-12 15:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Marco Perronet, Daniel Bristot de Oliveira, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Li Zefan, Tejun Heo,
	Johannes Weiner, Valentin Schneider, cgroups

It is easier to track the restrictions imposed on SCHED_DEADLINE
tasks if we keep them all together in a single function.

No function changes.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
---
 kernel/sched/core.c     | 3 +--
 kernel/sched/deadline.c | 7 +++++++
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7af80c3fce12..f4aede34449c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7142,8 +7142,7 @@ int task_can_attach(struct task_struct *p,
 		goto out;
 	}
 
-	if (dl_task(p) && !cpumask_intersects(task_rq(p)->rd->span,
-					      cs_cpus_allowed))
+	if (dl_task(p))
 		ret = dl_task_can_attach(p, cs_cpus_allowed);
 
 out:
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 75686c6d4436..c97d2c707b98 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2870,6 +2870,13 @@ int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allo
 	bool overflow;
 	int ret;
 
+	/*
+	 * The task is not moving to another root domain, so it is
+	 * already accounted.
+	 */
+	if (cpumask_intersects(task_rq(p)->rd->span, cs_cpus_allowed))
+		return 0;
+
 	dest_cpu = cpumask_any_and(cpu_active_mask, cs_cpus_allowed);
 
 	rcu_read_lock_sched();
-- 
2.29.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/6] sched/deadline: Inform dl_task_can_attach() if the cpuset is exclusive
  2021-01-12 15:53 [PATCH 0/6] sched/deadline: cpuset task acceptance review Daniel Bristot de Oliveira
  2021-01-12 15:53 ` [PATCH 1/6] sched/deadline: Consolidate the SCHED_DL task_can_attach() check on its own function Daniel Bristot de Oliveira
@ 2021-01-12 15:53 ` Daniel Bristot de Oliveira
  2021-01-12 15:53 ` [PATCH 3/6] sched/deadline: Allow DL tasks on empty (cgroup v2) cpusets Daniel Bristot de Oliveira
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Daniel Bristot de Oliveira @ 2021-01-12 15:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Marco Perronet, Daniel Bristot de Oliveira, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Li Zefan, Tejun Heo,
	Johannes Weiner, Valentin Schneider, cgroups

In preparation for blocking SCHED_DEADLINE tasks on non-exclusive cpuset
if bandwidth control is enabled.

No functional changes.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
---
 include/linux/sched.h   | 2 +-
 kernel/cgroup/cpuset.c  | 5 ++++-
 kernel/sched/core.c     | 4 ++--
 kernel/sched/deadline.c | 3 ++-
 kernel/sched/sched.h    | 2 +-
 5 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 942b87f80cc7..9525fbe032c9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1654,7 +1654,7 @@ current_restore_flags(unsigned long orig_flags, unsigned long flags)
 }
 
 extern int cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
-extern int task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed);
+extern int task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed, int exclusive);
 #ifdef CONFIG_SMP
 extern void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask);
 extern int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask);
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 53c70c470a38..c6513dfabd3d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2132,6 +2132,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 	struct cgroup_subsys_state *css;
 	struct cpuset *cs;
 	struct task_struct *task;
+	int exclusive;
 	int ret;
 
 	/* used later by cpuset_attach() */
@@ -2146,8 +2147,10 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 	    (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed)))
 		goto out_unlock;
 
+	exclusive = is_cpu_exclusive(cs);
+
 	cgroup_taskset_for_each(task, css, tset) {
-		ret = task_can_attach(task, cs->cpus_allowed);
+		ret = task_can_attach(task, cs->cpus_allowed, exclusive);
 		if (ret)
 			goto out_unlock;
 		ret = security_task_setscheduler(task);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f4aede34449c..5961a97541c2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7124,7 +7124,7 @@ int cpuset_cpumask_can_shrink(const struct cpumask *cur,
 }
 
 int task_can_attach(struct task_struct *p,
-		    const struct cpumask *cs_cpus_allowed)
+		    const struct cpumask *cs_cpus_allowed, int exclusive)
 {
 	int ret = 0;
 
@@ -7143,7 +7143,7 @@ int task_can_attach(struct task_struct *p,
 	}
 
 	if (dl_task(p))
-		ret = dl_task_can_attach(p, cs_cpus_allowed);
+		ret = dl_task_can_attach(p, cs_cpus_allowed, exclusive);
 
 out:
 	return ret;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c97d2c707b98..943aa32cc1bc 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2862,7 +2862,8 @@ bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr)
 }
 
 #ifdef CONFIG_SMP
-int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed)
+int dl_task_can_attach(struct task_struct *p,
+		const struct cpumask *cs_cpus_allowed, int exclusive)
 {
 	unsigned long flags, cap;
 	unsigned int dest_cpu;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f5acb6c5ce49..54881d99cebd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -337,7 +337,7 @@ extern void __setparam_dl(struct task_struct *p, const struct sched_attr *attr);
 extern void __getparam_dl(struct task_struct *p, struct sched_attr *attr);
 extern bool __checkparam_dl(const struct sched_attr *attr);
 extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr);
-extern int  dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed);
+extern int  dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed, int exclusive);
 extern int  dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
 extern bool dl_cpu_busy(unsigned int cpu);
 
-- 
2.29.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 3/6] sched/deadline: Allow DL tasks on empty (cgroup v2) cpusets
  2021-01-12 15:53 [PATCH 0/6] sched/deadline: cpuset task acceptance review Daniel Bristot de Oliveira
  2021-01-12 15:53 ` [PATCH 1/6] sched/deadline: Consolidate the SCHED_DL task_can_attach() check on its own function Daniel Bristot de Oliveira
  2021-01-12 15:53 ` [PATCH 2/6] sched/deadline: Inform dl_task_can_attach() if the cpuset is exclusive Daniel Bristot de Oliveira
@ 2021-01-12 15:53 ` Daniel Bristot de Oliveira
  2021-01-14 12:12   ` Dietmar Eggemann
  2021-01-22  8:08   ` Juri Lelli
  2021-01-12 15:53 ` [PATCH 4/6] sched/deadline: Block DL tasks on non-exclusive cpuset if bandwitdh control is enable Daniel Bristot de Oliveira
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 15+ messages in thread
From: Daniel Bristot de Oliveira @ 2021-01-12 15:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Marco Perronet, Daniel Bristot de Oliveira, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Li Zefan, Tejun Heo,
	Johannes Weiner, Valentin Schneider, cgroups

cgroups v2 allows the cpuset controller to be enabled/disabled on
demand. On Fedora 32, cpuset is disabled by default. To enable it,
a user needs to:

  # cd /sys/fs/cgroup/
  # echo +cpuset > cgroup.subtree_control

Existing cgroups will expose the cpuset interface (e.g., cpuset.cpus
file). By default, cpuset.cpus has no CPU assigned, which means that
existing tasks will move to a cpuset without cpus.

With that in mind, look what happens if a SCHED_DEADLINE task exists
on any cgroup (user.slice by default on Fedora):

----- %< -----
  # chrt -d --sched-period 1000000000 --sched-runtime 100000000 0 sleep 100 &
  # cd /sys/fs/cgroup/
  # echo '+cpuset' > cgroup.subtree_control
  [   65.384041] BUG: unable to handle page fault for address: ffffffffb720f7e0
  [   65.384551] #PF: supervisor read access in kernel mode
  [   65.384923] #PF: error_code(0x0000) - not-present page
  [   65.385298] PGD 61a15067 P4D 61a15067 PUD 61a16063 PMD 800fffff9ddff062
  [   65.385781] Oops: 0000 [#1] SMP PTI
  [   65.386042] CPU: 0 PID: 799 Comm: sh Not tainted 5.10.0-rc3 #1
  [   65.386461] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  [   65.387077] RIP: 0010:dl_task_can_attach+0x40/0x250
  [   65.387429] Code: 54 55 53 48 83 ec 18 48 89 3c 24 bf ff ff ff ff e8 05 a2 52 00
                       4c 63 f0 48 c7 c5 00 9e 02 00 4a 8b 04 f5 00 09 47 b6 48 89 ea
                       <4c> 8b a4 10 e0 09 00 00 49 8d 44 24 40 48 89 c7 48 89 44 24
                       08 e8
  [   65.388768] RSP: 0018:ffffaee8c056fcd8 EFLAGS: 00010283
  [   65.389148] RAX: ffffffffb71e5000 RBX: ffffaee8c056fdd0 RCX: 0000000000000040
  [   65.389661] RDX: 0000000000029e00 RSI: ffff9db202534e48 RDI: ffffffffb6d3a3e0
  [   65.390174] RBP: 0000000000029e00 R08: 0000000000000000 R09: 0000000000000004
  [   65.390686] R10: 0000000000000001 R11: 00000000ffa6fbff R12: ffffaee8c056fbf0
  [   65.391196] R13: ffff9db2024e1400 R14: 0000000000000004 R15: ffff9db20ebb31e0
  [   65.391710] FS:  00007f6df41b1740(0000) GS:ffff9db377c00000(0000) knlGS:0000000000000000
  [   65.392289] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [   65.392705] CR2: ffffffffb720f7e0 CR3: 000000010680a003 CR4: 0000000000370ef0
  [   65.393220] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [   65.393732] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [   65.394244] Call Trace:
  [   65.394437]  cpuset_can_attach+0x8b/0x110
  [   65.394732]  cgroup_migrate_execute+0x70/0x430
  [   65.395057]  cgroup_update_dfl_csses+0x222/0x230
  [   65.395392]  cgroup_subtree_control_write+0x2c6/0x3c0
  [   65.395759]  kernfs_fop_write+0xce/0x1b0
  [   65.396048]  vfs_write+0xc2/0x230
  [   65.396291]  ksys_write+0x4f/0xc0
  [   65.396533]  do_syscall_64+0x33/0x40
  [   65.396797]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [   65.397166] RIP: 0033:0x7f6df42a6537
  [   65.397428] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f
                       1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05
                       <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89
                       74 24
  [   65.398766] RSP: 002b:00007ffee4128018 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
  [   65.399838] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f6df42a6537
  [   65.400923] RDX: 0000000000000008 RSI: 000055b3f7e549e0 RDI: 0000000000000001
  [   65.402003] RBP: 000055b3f7e549e0 R08: 000000000000000a R09: 0000000000000007
  [   65.403082] R10: 0000000000000004 R11: 0000000000000246 R12: 0000000000000008
  [   65.404156] R13: 00007f6df4378500 R14: 0000000000000008 R15: 00007f6df4378700
  [   65.405218] Modules linked in: <lots of modules>
  [   65.414172] CR2: ffffffffb720f7e0
  [   65.415117] ---[ end trace 2dbff1a688549e65 ]---
----- >% -----

That happens because on dl_task_can_attach():
        dest_cpu = cpumask_any_and(cpu_active_mask, cs_cpus_allowed);

returns a non active cpu.

Initially, I thought about returning an error and blocking the
operation. However, that is indeed not needed. The cpuset without
CPUs assigned will be a non-root cpuset, hence its cpu mask will
be the same as the root one. So, the bandwidth was already accounted,
and the task can proceed.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
---
 kernel/sched/deadline.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 943aa32cc1bc..788a391657a5 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2871,6 +2871,13 @@ int dl_task_can_attach(struct task_struct *p,
 	bool overflow;
 	int ret;
 
+	/*
+	 * The cpuset has no cpus assigned, so the thread will not
+	 * change its affinity.
+	 */
+	if (cpumask_empty(cs_cpus_allowed))
+		return 0;
+
 	/*
 	 * The task is not moving to another root domain, so it is
 	 * already accounted.
-- 
2.29.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 4/6] sched/deadline: Block DL tasks on non-exclusive cpuset if bandwitdh control is enable
  2021-01-12 15:53 [PATCH 0/6] sched/deadline: cpuset task acceptance review Daniel Bristot de Oliveira
                   ` (2 preceding siblings ...)
  2021-01-12 15:53 ` [PATCH 3/6] sched/deadline: Allow DL tasks on empty (cgroup v2) cpusets Daniel Bristot de Oliveira
@ 2021-01-12 15:53 ` Daniel Bristot de Oliveira
  2021-01-14 15:51   ` Dietmar Eggemann
  2021-01-12 15:53 ` [PATCH 5/6] sched/deadline: Add helpers to get the correct root domain/span/dl_bw Daniel Bristot de Oliveira
  2021-01-12 15:53 ` [PATCH 6/6] sched/deadline: Fixes cpu/rd/dl_bw references for suspended tasks Daniel Bristot de Oliveira
  5 siblings, 1 reply; 15+ messages in thread
From: Daniel Bristot de Oliveira @ 2021-01-12 15:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Marco Perronet, Daniel Bristot de Oliveira, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Li Zefan, Tejun Heo,
	Johannes Weiner, Valentin Schneider, cgroups

The current SCHED_DEADLINE design supports only global scheduler,
or variants of it, i.e., clustered and partitioned, via cpuset config.
To enable the partitioning of a system with clusters of CPUs, the
documentation advises the usage of exclusive cpusets, creating an
exclusive root_domain for the cpuset.

Attempts to change the cpu affinity of a thread to a cpu mask different
from the root domain results in an error. For instance:

----- %< -----
[root@x1 linux]# chrt -d --sched-period 1000000000 --sched-runtime 100000000 0 sleep 10000 &
[1] 69020
[root@x1 linux]# taskset -p -c 0 69020
pid 69020's current affinity list: 0-7
taskset: failed to set pid 69020's affinity: Device or resource busy
----- >% -----

However, such restriction can be bypassed by disabling the
SCHED_DEADLINE admission test, under the assumption that
the user is aware of the implications of such a decision.

However, Marco Perronet noticed that it was possible to
by-pass this mechanism because no restriction is currently
imposed by the cpuset mechanism.

For instance, this script:
----- %< -----
#!/bin/bash

# Enter on the cgroup directory
cd /sys/fs/cgroup/

# Check it if is cgroup v2 and enable cpuset
if [ -e cgroup.subtree_control ]; then
	# Enable cpuset controller on cgroup v2
	echo +cpuset > cgroup.subtree_control
fi

echo LOG: create a cpuset and assigned the CPU 0 to it
# Create cpuset groups
rmdir dl-group &> /dev/null
mkdir dl-group

# Restrict the task to the CPU 0
echo 0 > dl-group/cpuset.mems
echo 0 > dl-group/cpuset.cpus

# Place a task in the root cgroup
echo LOG: dispatching the first DL task
chrt -d --sched-period 1000000000 --sched-runtime 100000000 0 sleep 100 &
ROOT_PID="$!"
ROOT_ALLOWED=`cat /proc/$ROOT_PID/status | grep Cpus_allowed_list | awk '{print $2}'`

# Disapatch another task in the root cgroup, to move it later.
echo LOG: dispatching the second DL task
chrt -d --sched-period 1000000000 --sched-runtime 100000000 0 sleep 100 &
CPUSET_PID="$!"

# let them settle down
sleep 1

# Assign the second task to the cgroup
echo LOG: moving the second DL task to the cpuset
echo "$CPUSET_PID" > dl-group/cgroup.procs 2> /dev/null
ACCEPTED=$?
CPUSET_ALLOWED=`cat /proc/$CPUSET_PID/status | grep Cpus_allowed_list | awk '{print $2}'`

if [ $ACCEPTED == 0 ]; then
	echo FAIL: a DL task was accepted on a non-exclusive cpuset
else
	echo PASS: DL task was rejected on a non-exclusive cpuset
fi

if [ $ROOT_ALLOWED == $CPUSET_ALLOWED ]; then
	echo PASS: the affinity did not change: $CPUSET_ALLOWED == $ROOT_ALLOWED
else
	echo FAIL: the cpu affinity is different: $CPUSET_ALLOWED == $ROOT_ALLOWED
fi

# Just ignore the clean up
exec > /dev/null 2>&1
kill -9 $CPUSET_PID
kill -9 $ROOT_PID
rmdir dl-group
----- >% -----

Shows these results:
----- %< -----
LOG: create a cpuset and assigned the CPU 0 to it
LOG: dispatching the first DL task
LOG: dispatching the second DL task
LOG: moving the second DL task to the cpuset
FAIL: a DL task was accepted on a non-exclusive cpuset
FAIL: the cpu affinity is different: 0 == 0-3
----- >% -----

This result is a problem because the two tasks have a different
cpu mask, but they end up sharing the cpu 0, which is something
not supported in the current SCHED_DEADLINE designed (APA - Arbitrary
Processor Affinities).

To avoid such scenario, the correct action to be taken is rejecting
the attach of SCHED_DEADLINE thread to a non-exclusive cpuset.

With the proposed patch in place, the script above returns:

----- %< -----
LOG: create a cpuset and assigned the CPU 0 to it
LOG: dispatching the first DL task
LOG: dispatching the second DL task
LOG: moving the second DL task to the cpuset
PASS: DL task was rejected on a non-exclusive cpuset
PASS: the affinity did not change: 0-3 == 0-3
----- >% -----

Still, likewise for taskset, this restriction can be bypassed by
disabling the admission test, i.e.:

# sysctl -w kernel.sched_rt_runtime_us=-1

and work at their own risk.

Reported-by: Marco Perronet <perronet@mpi-sws.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
---
 kernel/sched/deadline.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 788a391657a5..c221e14d5b86 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2878,6 +2878,13 @@ int dl_task_can_attach(struct task_struct *p,
 	if (cpumask_empty(cs_cpus_allowed))
 		return 0;
 
+	/*
+	 * Do not allow moving tasks to non-exclusive cpusets
+	 * if bandwidth control is enabled.
+	 */
+	if (dl_bandwidth_enabled() && !exclusive)
+		return -EBUSY;
+
 	/*
 	 * The task is not moving to another root domain, so it is
 	 * already accounted.
-- 
2.29.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 5/6] sched/deadline: Add helpers to get the correct root domain/span/dl_bw
  2021-01-12 15:53 [PATCH 0/6] sched/deadline: cpuset task acceptance review Daniel Bristot de Oliveira
                   ` (3 preceding siblings ...)
  2021-01-12 15:53 ` [PATCH 4/6] sched/deadline: Block DL tasks on non-exclusive cpuset if bandwitdh control is enable Daniel Bristot de Oliveira
@ 2021-01-12 15:53 ` Daniel Bristot de Oliveira
  2021-01-12 15:53 ` [PATCH 6/6] sched/deadline: Fixes cpu/rd/dl_bw references for suspended tasks Daniel Bristot de Oliveira
  5 siblings, 0 replies; 15+ messages in thread
From: Daniel Bristot de Oliveira @ 2021-01-12 15:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Marco Perronet, Daniel Bristot de Oliveira, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Li Zefan, Tejun Heo,
	Johannes Weiner, Valentin Schneider, cgroups

task_cpu() is often used on SCHED_DEADLINE to access the root_domain
and associated information. However, the task_cpu() does not always
reflect the correct CPU. The reason being is shown in the code:

From kernel/sched/core:
----- %< -----
 * p->on_rq <- { 0, 1 = TASK_ON_RQ_QUEUED, 2 = TASK_ON_RQ_MIGRATING }:
 *
 *   is set by activate_task() and cleared by deactivate_task(), under
 *   rq->lock. Non-zero indicates the task is runnable, the special
 *   ON_RQ_MIGRATING state is used for migration without holding both
 *   rq->locks. It indicates task_cpu() is not stable, see task_rq_lock().
[...]
 * task_cpu(p): is changed by set_task_cpu(), the rules are:
 *
 *  - Don't call set_task_cpu() on a blocked task:
 *
 *    We don't care what CPU we're not running on, this simplifies hotplug,
 *    the CPU assignment of blocked tasks isn't required to be valid.
----- >% -----

So, a sleeping task will not have its task_cpu() stable, and this is
causing problems on SCHED_DEADLINE.

In preparation for fixing this problems, we need to add helper functions
that return the dl_task_cpu, root_domain, and "root" dl_bw that
reflects the current placement/affinity of the task.

Note that these functions are only required on the code path that
can happen when the task is not queued.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
---
 kernel/sched/sched.h | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 54881d99cebd..5f3f3cb5a6ff 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2393,6 +2393,37 @@ void __dl_update(struct dl_bw *dl_b, s64 bw)
 		rq->dl.extra_bw += bw;
 	}
 }
+
+static inline
+int dl_task_cpu(struct task_struct *p)
+{
+	/*
+	 * We can only rely on task_cpu() of runnable tasks.
+	 */
+	if (p->on_rq)
+		return task_cpu(p);
+
+	/*
+	 * The task_cpu() of non-runnable task migth be pointing a CPU
+	 * it cannot run anymore (see set_task_cpu()). Hence, let's
+	 * get a possible cpu from the current cpu_mask.
+	 */
+	return cpumask_any(p->cpus_ptr);
+
+}
+
+static inline
+struct root_domain *dl_task_rd(struct task_struct *p)
+{
+	int cpu = dl_task_cpu(p);
+	return cpu_rq(cpu)->rd;
+}
+
+static inline
+struct dl_bw *dl_task_root_bw(struct task_struct *p)
+{
+	return &dl_task_rd(p)->dl_bw;
+}
 #else
 static inline
 void __dl_update(struct dl_bw *dl_b, s64 bw)
@@ -2401,6 +2432,18 @@ void __dl_update(struct dl_bw *dl_b, s64 bw)
 
 	dl->extra_bw += bw;
 }
+
+static inline
+int dl_task_cpu(struct task_struct *p)
+{
+	return 0;
+}
+
+static inline
+struct dl_bw *dl_task_root_bw(struct task_struct *p)
+{
+	return &task_rq(p)->dl.dl_bw;
+}
 #endif
 
 
-- 
2.29.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 6/6] sched/deadline: Fixes cpu/rd/dl_bw references for suspended tasks
  2021-01-12 15:53 [PATCH 0/6] sched/deadline: cpuset task acceptance review Daniel Bristot de Oliveira
                   ` (4 preceding siblings ...)
  2021-01-12 15:53 ` [PATCH 5/6] sched/deadline: Add helpers to get the correct root domain/span/dl_bw Daniel Bristot de Oliveira
@ 2021-01-12 15:53 ` Daniel Bristot de Oliveira
  2021-01-15 14:36   ` Dietmar Eggemann
  5 siblings, 1 reply; 15+ messages in thread
From: Daniel Bristot de Oliveira @ 2021-01-12 15:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Marco Perronet, Daniel Bristot de Oliveira, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Li Zefan, Tejun Heo,
	Johannes Weiner, Valentin Schneider, cgroups

__set_cpus_allowed_ptr() migrates running or runnable, setting the
task's cpu accordingly. The CPU is not set when the task is not
runnable because of complications on the hotplug code. The
task cpu will be updated in the next wakeup anyway.

However, this creates a problem for the usage of task_cpu(p), which
might point the task to a CPU in which it cannot run, or worse,
a runqueue/root_domain it does not belong to, causing some
odd errors. For example, the script below shows that a sleeping
task cannot become SCHED_DEADLINE if they moved to another (exclusive)
cpuset:

----- %< -----
  #!/bin/bash
  # Enter on the cgroup directory
  cd /sys/fs/cgroup/

  # Check it if is cgroup v2 and enable cpuset
  if [ -e cgroup.subtree_control ]; then
  	# Enable cpuset controller on cgroup v2
  	echo +cpuset > cgroup.subtree_control
  fi

  echo LOG: create an exclusive cpuset and assigned the CPU 0 to it
  # Create cpuset groups
  rmdir dl-group &> /dev/null
  mkdir dl-group

  # Restrict the task to the CPU 0
  echo 0 > dl-group/cpuset.mems
  echo 0 > dl-group/cpuset.cpus
  echo root >  dl-group/cpuset.cpus.partition

  echo LOG: dispatching a regular task
  sleep 100 &
  CPUSET_PID="$!"

  # let it settle down
  sleep 1

  # Assign the second task to the cgroup
  echo LOG: moving the second DL task to the cpuset
  echo "$CPUSET_PID" > dl-group/cgroup.procs 2> /dev/null

  CPUSET_ALLOWED=`cat /proc/$CPUSET_PID/status | grep Cpus_allowed_list | awk '{print $2}'`

  chrt -p -d --sched-period 1000000000 --sched-runtime 100000000 0 $CPUSET_PID
  ACCEPTED=$?

  if [ $ACCEPTED == 0 ]; then
  	echo PASS: the task became DL
  else
  	echo FAIL: the task was rejected as DL
  fi

  # Just ignore the clean up
  exec > /dev/null 2>&1
  kill -9 $CPUSET_PID
  kill -9 $ROOT_PID
  rmdir dl-group
----- >% -----

Long story short: the sleep task is (not runnable) on a cpu != 0. After
moving to a cpuset with only the CPU 0, task_cpu() returns a cpu that
does not belong to the cpuset the task is in, and the task is rejected
in this if:

----- %< -----
__sched_setscheduler():
[...]
        rq = task_rq_lock(p, &rf); <-- uses task_cpu(), that points to
				   <-- the old cpu.
[...]
                if (dl_bandwidth_enabled() && dl_policy(policy) &&
                                !(attr->sched_flags & SCHED_FLAG_SUGOV)) {
                        cpumask_t *span = rq->rd->span;        <--- wrong rd!

                        /*
                         * Don't allow tasks with an affinity mask smaller than
                         * the entire root_domain to become SCHED_DEADLINE. We
                         * will also fail if there's no bandwidth available.
                         */
                        if (!cpumask_subset(span, p->cpus_ptr) ||
                            rq->rd->dl_bw.bw == 0) {
                                retval = -EPERM;   <--- returns here.
                                goto unlock;
                        }
                }
----- >% -----

Because the rq, and so the root domain, corresponding to the ones of
the CPU in which the sleep command went to... sleep, not the ones
it will run in the next wakeup because of its affinity.

To avoid this problem, use the dl_task* helpers that return the task
cpu, root domain, and the "root" dl_bw, aware of the status of
task->cpu.

Reported-by: Marco Perronet <perronet@mpi-sws.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
---
 kernel/sched/core.c     | 6 +++---
 kernel/sched/deadline.c | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5961a97541c2..3c2775e6869f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5905,15 +5905,15 @@ static int __sched_setscheduler(struct task_struct *p,
 #ifdef CONFIG_SMP
 		if (dl_bandwidth_enabled() && dl_policy(policy) &&
 				!(attr->sched_flags & SCHED_FLAG_SUGOV)) {
-			cpumask_t *span = rq->rd->span;
+			struct root_domain *rd = dl_task_rd(p);
 
 			/*
 			 * Don't allow tasks with an affinity mask smaller than
 			 * the entire root_domain to become SCHED_DEADLINE. We
 			 * will also fail if there's no bandwidth available.
 			 */
-			if (!cpumask_subset(span, p->cpus_ptr) ||
-			    rq->rd->dl_bw.bw == 0) {
+			if (!cpumask_subset(rd->span, p->cpus_ptr) ||
+			    rd->dl_bw.bw == 0) {
 				retval = -EPERM;
 				goto unlock;
 			}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c221e14d5b86..1f6264cb8867 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2678,8 +2678,8 @@ int sched_dl_overflow(struct task_struct *p, int policy,
 	u64 period = attr->sched_period ?: attr->sched_deadline;
 	u64 runtime = attr->sched_runtime;
 	u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
-	int cpus, err = -1, cpu = task_cpu(p);
-	struct dl_bw *dl_b = dl_bw_of(cpu);
+	int cpus, err = -1, cpu = dl_task_cpu(p);
+	struct dl_bw *dl_b = dl_task_root_bw(p);
 	unsigned long cap;
 
 	if (attr->sched_flags & SCHED_FLAG_SUGOV)
-- 
2.29.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/6] sched/deadline: Allow DL tasks on empty (cgroup v2) cpusets
  2021-01-12 15:53 ` [PATCH 3/6] sched/deadline: Allow DL tasks on empty (cgroup v2) cpusets Daniel Bristot de Oliveira
@ 2021-01-14 12:12   ` Dietmar Eggemann
  2021-01-18 12:51     ` Daniel Bristot de Oliveira
  2021-01-22  8:08   ` Juri Lelli
  1 sibling, 1 reply; 15+ messages in thread
From: Dietmar Eggemann @ 2021-01-14 12:12 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira, linux-kernel
  Cc: Marco Perronet, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Li Zefan, Tejun Heo, Johannes Weiner, Valentin Schneider,
	cgroups

On 12/01/2021 16:53, Daniel Bristot de Oliveira wrote:
> cgroups v2 allows the cpuset controller to be enabled/disabled on
> demand. On Fedora 32, cpuset is disabled by default. To enable it,
> a user needs to:
> 
>   # cd /sys/fs/cgroup/
>   # echo +cpuset > cgroup.subtree_control
> 
> Existing cgroups will expose the cpuset interface (e.g., cpuset.cpus
> file). By default, cpuset.cpus has no CPU assigned, which means that
> existing tasks will move to a cpuset without cpus.
> 
> With that in mind, look what happens if a SCHED_DEADLINE task exists
> on any cgroup (user.slice by default on Fedora):
> 
> ----- %< -----
>   # chrt -d --sched-period 1000000000 --sched-runtime 100000000 0 sleep 100 &

Like you mentioned above, to see the issue the DL task has to be moved
into the cgroup (e.g. user.slice) here:

echo $PID > /sys/fs/cgroup/user.slice/cgroup.procs

>   # cd /sys/fs/cgroup/
>   # echo '+cpuset' > cgroup.subtree_control
>   [   65.384041] BUG: unable to handle page fault for address: ffffffffb720f7e0
>   [   65.384551] #PF: supervisor read access in kernel mode
>   [   65.384923] #PF: error_code(0x0000) - not-present page
>   [   65.385298] PGD 61a15067 P4D 61a15067 PUD 61a16063 PMD 800fffff9ddff062
>   [   65.385781] Oops: 0000 [#1] SMP PTI
>   [   65.386042] CPU: 0 PID: 799 Comm: sh Not tainted 5.10.0-rc3 #1
>   [   65.386461] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
>   [   65.387077] RIP: 0010:dl_task_can_attach+0x40/0x250
>   [   65.387429] Code: 54 55 53 48 83 ec 18 48 89 3c 24 bf ff ff ff ff e8 05 a2 52 00
>                        4c 63 f0 48 c7 c5 00 9e 02 00 4a 8b 04 f5 00 09 47 b6 48 89 ea
>                        <4c> 8b a4 10 e0 09 00 00 49 8d 44 24 40 48 89 c7 48 89 44 24
>                        08 e8
>   [   65.388768] RSP: 0018:ffffaee8c056fcd8 EFLAGS: 00010283
>   [   65.389148] RAX: ffffffffb71e5000 RBX: ffffaee8c056fdd0 RCX: 0000000000000040
>   [   65.389661] RDX: 0000000000029e00 RSI: ffff9db202534e48 RDI: ffffffffb6d3a3e0
>   [   65.390174] RBP: 0000000000029e00 R08: 0000000000000000 R09: 0000000000000004
>   [   65.390686] R10: 0000000000000001 R11: 00000000ffa6fbff R12: ffffaee8c056fbf0
>   [   65.391196] R13: ffff9db2024e1400 R14: 0000000000000004 R15: ffff9db20ebb31e0
>   [   65.391710] FS:  00007f6df41b1740(0000) GS:ffff9db377c00000(0000) knlGS:0000000000000000
>   [   65.392289] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   [   65.392705] CR2: ffffffffb720f7e0 CR3: 000000010680a003 CR4: 0000000000370ef0
>   [   65.393220] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>   [   65.393732] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>   [   65.394244] Call Trace:
>   [   65.394437]  cpuset_can_attach+0x8b/0x110
>   [   65.394732]  cgroup_migrate_execute+0x70/0x430
>   [   65.395057]  cgroup_update_dfl_csses+0x222/0x230
>   [   65.395392]  cgroup_subtree_control_write+0x2c6/0x3c0
>   [   65.395759]  kernfs_fop_write+0xce/0x1b0
>   [   65.396048]  vfs_write+0xc2/0x230
>   [   65.396291]  ksys_write+0x4f/0xc0
>   [   65.396533]  do_syscall_64+0x33/0x40
>   [   65.396797]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>   [   65.397166] RIP: 0033:0x7f6df42a6537
>   [   65.397428] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f
>                        1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05
>                        <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89
>                        74 24
>   [   65.398766] RSP: 002b:00007ffee4128018 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>   [   65.399838] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f6df42a6537
>   [   65.400923] RDX: 0000000000000008 RSI: 000055b3f7e549e0 RDI: 0000000000000001
>   [   65.402003] RBP: 000055b3f7e549e0 R08: 000000000000000a R09: 0000000000000007
>   [   65.403082] R10: 0000000000000004 R11: 0000000000000246 R12: 0000000000000008
>   [   65.404156] R13: 00007f6df4378500 R14: 0000000000000008 R15: 00007f6df4378700
>   [   65.405218] Modules linked in: <lots of modules>
>   [   65.414172] CR2: ffffffffb720f7e0
>   [   65.415117] ---[ end trace 2dbff1a688549e65 ]---
> ----- >% -----
> 
> That happens because on dl_task_can_attach():
>         dest_cpu = cpumask_any_and(cpu_active_mask, cs_cpus_allowed);
> 
> returns a non active cpu.

Since cs_cpus_allowed is empty dest_cpu should be an invalid CPU (>=
nr_cpu_ids) here.

> Initially, I thought about returning an error and blocking the
> operation. However, that is indeed not needed. The cpuset without
> CPUs assigned will be a non-root cpuset, hence its cpu mask will
> be the same as the root one. So, the bandwidth was already accounted,
> and the task can proceed.

LGTM.

After the '/sys/fs/cgroup# echo '+cpuset' > cgroup.subtree_control':

root's cpuset.cpus.effective == user.slice's cpuset.cpus.effective

> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Li Zefan <lizefan@huawei.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org
> ---
>  kernel/sched/deadline.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 943aa32cc1bc..788a391657a5 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2871,6 +2871,13 @@ int dl_task_can_attach(struct task_struct *p,
>  	bool overflow;
>  	int ret;
>  
> +	/*
> +	 * The cpuset has no cpus assigned, so the thread will not
> +	 * change its affinity.
> +	 */
> +	if (cpumask_empty(cs_cpus_allowed))
> +		return 0;
> +
>  	/*
>  	 * The task is not moving to another root domain, so it is
>  	 * already accounted.
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/6] sched/deadline: Block DL tasks on non-exclusive cpuset if bandwitdh control is enable
  2021-01-12 15:53 ` [PATCH 4/6] sched/deadline: Block DL tasks on non-exclusive cpuset if bandwitdh control is enable Daniel Bristot de Oliveira
@ 2021-01-14 15:51   ` Dietmar Eggemann
  2021-01-19  9:41     ` Daniel Bristot de Oliveira
  0 siblings, 1 reply; 15+ messages in thread
From: Dietmar Eggemann @ 2021-01-14 15:51 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira, linux-kernel
  Cc: Marco Perronet, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Li Zefan, Tejun Heo, Johannes Weiner, Valentin Schneider,
	cgroups

On 12/01/2021 16:53, Daniel Bristot de Oliveira wrote:
> The current SCHED_DEADLINE design supports only global scheduler,
> or variants of it, i.e., clustered and partitioned, via cpuset config.
> To enable the partitioning of a system with clusters of CPUs, the
> documentation advises the usage of exclusive cpusets, creating an
> exclusive root_domain for the cpuset.
> 
> Attempts to change the cpu affinity of a thread to a cpu mask different
> from the root domain results in an error. For instance:

[...]

> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 788a391657a5..c221e14d5b86 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2878,6 +2878,13 @@ int dl_task_can_attach(struct task_struct *p,
>  	if (cpumask_empty(cs_cpus_allowed))
>  		return 0;
>  
> +	/*
> +	 * Do not allow moving tasks to non-exclusive cpusets
> +	 * if bandwidth control is enabled.
> +	 */
> +	if (dl_bandwidth_enabled() && !exclusive)
> +		return -EBUSY;
> +
>  	/*
>  	 * The task is not moving to another root domain, so it is
>  	 * already accounted.
> 

But doesn't this mean you only have to set this cgroup exclusive/root to
run into the same issue:

with this patch:

cgroupv1:

root@juno:/sys/fs/cgroup/cpuset# chrt -d --sched-period 1000000000
--sched-runtime 100000000 0 sleep 500 &
[1] 1668
root@juno:/sys/fs/cgroup/cpuset# PID1=$!

root@juno:/sys/fs/cgroup/cpuset# chrt -d --sched-period 1000000000
--sched-runtime 100000000 0 sleep 500 &
[2] 1669
root@juno:/sys/fs/cgroup/cpuset# PID2=$!

root@juno:/sys/fs/cgroup/cpuset# mkdir A

root@juno:/sys/fs/cgroup/cpuset# echo 0 > ./A/cpuset.mems
root@juno:/sys/fs/cgroup/cpuset# echo 0 > ./A/cpuset.cpus

root@juno:/sys/fs/cgroup/cpuset# echo $PID2 > ./A/cgroup.procs
-bash: echo: write error: Device or resource busy

root@juno:/sys/fs/cgroup/cpuset# echo 1 > ./A/cpuset.cpu_exclusive

root@juno:/sys/fs/cgroup/cpuset# echo $PID2 > ./A/cgroup.procs

root@juno:/sys/fs/cgroup/cpuset# cat /proc/$PID1/status | grep
Cpus_allowed_list | awk '{print $2}'
0-5
root@juno:/sys/fs/cgroup/cpuset# cat /proc/$PID2/status | grep
Cpus_allowed_list | awk '{print $2}'
0

cgroupv2:

root@juno:/sys/fs/cgroup# echo +cpuset > cgroup.subtree_control

root@juno:/sys/fs/cgroup# chrt -d --sched-period 1000000000
--sched-runtime 100000000 0 sleep 500 &
[1] 1687
root@juno:/sys/fs/cgroup# PID1=$!

root@juno:/sys/fs/cgroup# chrt -d --sched-period 1000000000
--sched-runtime 100000000 0 sleep 500 &
[2] 1688
root@juno:/sys/fs/cgroup# PID2=$!

root@juno:/sys/fs/cgroup# mkdir A

root@juno:/sys/fs/cgroup# echo 0 > ./A/cpuset.mems
root@juno:/sys/fs/cgroup# echo 0 > ./A/cpuset.cpus

root@juno:/sys/fs/cgroup# echo $PID2 > ./A/cgroup.procs
-bash: echo: write error: Device or resource busy

root@juno:/sys/fs/cgroup# echo root > ./A/cpuset.cpus.partition

root@juno:/sys/fs/cgroup# echo $PID2 > ./A/cgroup.procs

root@juno:/sys/fs/cgroup# cat /proc/$PID1/status | grep
Cpus_allowed_list | awk '{print $2}'
0-5
root@juno:/sys/fs/cgroup# cat /proc/$PID2/status | grep
Cpus_allowed_list | awk '{print $2}'
0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 6/6] sched/deadline: Fixes cpu/rd/dl_bw references for suspended tasks
  2021-01-12 15:53 ` [PATCH 6/6] sched/deadline: Fixes cpu/rd/dl_bw references for suspended tasks Daniel Bristot de Oliveira
@ 2021-01-15 14:36   ` Dietmar Eggemann
  2021-01-18 13:17     ` Daniel Bristot de Oliveira
  0 siblings, 1 reply; 15+ messages in thread
From: Dietmar Eggemann @ 2021-01-15 14:36 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira, linux-kernel
  Cc: Marco Perronet, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Li Zefan, Tejun Heo, Johannes Weiner, Valentin Schneider,
	cgroups

On 12/01/2021 16:53, Daniel Bristot de Oliveira wrote:

[...]

> ----- %< -----
>   #!/bin/bash
>   # Enter on the cgroup directory
>   cd /sys/fs/cgroup/
> 
>   # Check it if is cgroup v2 and enable cpuset
>   if [ -e cgroup.subtree_control ]; then
>   	# Enable cpuset controller on cgroup v2
>   	echo +cpuset > cgroup.subtree_control
>   fi
> 
>   echo LOG: create an exclusive cpuset and assigned the CPU 0 to it
>   # Create cpuset groups
>   rmdir dl-group &> /dev/null
>   mkdir dl-group
> 
>   # Restrict the task to the CPU 0
>   echo 0 > dl-group/cpuset.mems
>   echo 0 > dl-group/cpuset.cpus
>   echo root >  dl-group/cpuset.cpus.partition
> 
>   echo LOG: dispatching a regular task
>   sleep 100 &
>   CPUSET_PID="$!"
> 
>   # let it settle down
>   sleep 1
> 
>   # Assign the second task to the cgroup

There is only one task 'CPUSET_PID' involved here?

>   echo LOG: moving the second DL task to the cpuset
>   echo "$CPUSET_PID" > dl-group/cgroup.procs 2> /dev/null
> 
>   CPUSET_ALLOWED=`cat /proc/$CPUSET_PID/status | grep Cpus_allowed_list | awk '{print $2}'`
> 
>   chrt -p -d --sched-period 1000000000 --sched-runtime 100000000 0 $CPUSET_PID
>   ACCEPTED=$?
> 
>   if [ $ACCEPTED == 0 ]; then
>   	echo PASS: the task became DL
>   else
>   	echo FAIL: the task was rejected as DL
>   fi

[...]

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5961a97541c2..3c2775e6869f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5905,15 +5905,15 @@ static int __sched_setscheduler(struct task_struct *p,
>  #ifdef CONFIG_SMP
>  		if (dl_bandwidth_enabled() && dl_policy(policy) &&
>  				!(attr->sched_flags & SCHED_FLAG_SUGOV)) {
> -			cpumask_t *span = rq->rd->span;
> +			struct root_domain *rd = dl_task_rd(p);
>  
>  			/*
>  			 * Don't allow tasks with an affinity mask smaller than
>  			 * the entire root_domain to become SCHED_DEADLINE. We
>  			 * will also fail if there's no bandwidth available.
>  			 */
> -			if (!cpumask_subset(span, p->cpus_ptr) ||
> -			    rq->rd->dl_bw.bw == 0) {
> +			if (!cpumask_subset(rd->span, p->cpus_ptr) ||
> +			    rd->dl_bw.bw == 0) {
>  				retval = -EPERM;
>  				goto unlock;
>  			}
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index c221e14d5b86..1f6264cb8867 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2678,8 +2678,8 @@ int sched_dl_overflow(struct task_struct *p, int policy,
>  	u64 period = attr->sched_period ?: attr->sched_deadline;
>  	u64 runtime = attr->sched_runtime;
>  	u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
> -	int cpus, err = -1, cpu = task_cpu(p);
> -	struct dl_bw *dl_b = dl_bw_of(cpu);
> +	int cpus, err = -1, cpu = dl_task_cpu(p);
> +	struct dl_bw *dl_b = dl_task_root_bw(p);
>  	unsigned long cap;
>  
>  	if (attr->sched_flags & SCHED_FLAG_SUGOV)
> 

Wouldn't it be sufficient to just introduce dl_task_cpu() and use the
correct cpu to get rd->span or struct dl_bw in __sched_setscheduler()
-> sched_dl_overflow()?

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5961a97541c2..0573f676696a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5905,7 +5905,8 @@ static int __sched_setscheduler(struct task_struct *p,
 #ifdef CONFIG_SMP
                if (dl_bandwidth_enabled() && dl_policy(policy) &&
                                !(attr->sched_flags & SCHED_FLAG_SUGOV)) {
-                       cpumask_t *span = rq->rd->span;
+                       int cpu = dl_task_cpu(p);
+                       cpumask_t *span = cpu_rq(cpu)->rd->span;
 
                        /*
                         * Don't allow tasks with an affinity mask smaller than
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c221e14d5b86..308ecaaf3d28 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2678,7 +2678,7 @@ int sched_dl_overflow(struct task_struct *p, int policy,
        u64 period = attr->sched_period ?: attr->sched_deadline;
        u64 runtime = attr->sched_runtime;
        u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
-       int cpus, err = -1, cpu = task_cpu(p);
+       int cpus, err = -1, cpu = dl_task_cpu(p);
        struct dl_bw *dl_b = dl_bw_of(cpu);
        unsigned long cap;

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/6] sched/deadline: Allow DL tasks on empty (cgroup v2) cpusets
  2021-01-14 12:12   ` Dietmar Eggemann
@ 2021-01-18 12:51     ` Daniel Bristot de Oliveira
  0 siblings, 0 replies; 15+ messages in thread
From: Daniel Bristot de Oliveira @ 2021-01-18 12:51 UTC (permalink / raw)
  To: Dietmar Eggemann, linux-kernel
  Cc: Marco Perronet, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Li Zefan, Tejun Heo, Johannes Weiner, Valentin Schneider,
	cgroups

On 1/14/21 1:12 PM, Dietmar Eggemann wrote:
> On 12/01/2021 16:53, Daniel Bristot de Oliveira wrote:
>> cgroups v2 allows the cpuset controller to be enabled/disabled on
>> demand. On Fedora 32, cpuset is disabled by default. To enable it,
>> a user needs to:
>>
>>   # cd /sys/fs/cgroup/
>>   # echo +cpuset > cgroup.subtree_control
>>
>> Existing cgroups will expose the cpuset interface (e.g., cpuset.cpus
>> file). By default, cpuset.cpus has no CPU assigned, which means that
>> existing tasks will move to a cpuset without cpus.
>>
>> With that in mind, look what happens if a SCHED_DEADLINE task exists
>> on any cgroup (user.slice by default on Fedora):
>>
>> ----- %< -----
>>   # chrt -d --sched-period 1000000000 --sched-runtime 100000000 0 sleep 100 &
> 
> Like you mentioned above, to see the issue the DL task has to be moved
> into the cgroup (e.g. user.slice) here:
> 
> echo $PID > /sys/fs/cgroup/user.slice/cgroup.procs

Ops, I "forgot" to add this step because Fedora/systemd does it by default.

I will add this to the log.

>>   # cd /sys/fs/cgroup/
>>   # echo '+cpuset' > cgroup.subtree_control
>>   [   65.384041] BUG: unable to handle page fault for address: ffffffffb720f7e0
>>   [   65.384551] #PF: supervisor read access in kernel mode
>>   [   65.384923] #PF: error_code(0x0000) - not-present page
>>   [   65.385298] PGD 61a15067 P4D 61a15067 PUD 61a16063 PMD 800fffff9ddff062
>>   [   65.385781] Oops: 0000 [#1] SMP PTI
>>   [   65.386042] CPU: 0 PID: 799 Comm: sh Not tainted 5.10.0-rc3 #1
>>   [   65.386461] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
>>   [   65.387077] RIP: 0010:dl_task_can_attach+0x40/0x250
>>   [   65.387429] Code: 54 55 53 48 83 ec 18 48 89 3c 24 bf ff ff ff ff e8 05 a2 52 00
>>                        4c 63 f0 48 c7 c5 00 9e 02 00 4a 8b 04 f5 00 09 47 b6 48 89 ea
>>                        <4c> 8b a4 10 e0 09 00 00 49 8d 44 24 40 48 89 c7 48 89 44 24
>>                        08 e8
>>   [   65.388768] RSP: 0018:ffffaee8c056fcd8 EFLAGS: 00010283
>>   [   65.389148] RAX: ffffffffb71e5000 RBX: ffffaee8c056fdd0 RCX: 0000000000000040
>>   [   65.389661] RDX: 0000000000029e00 RSI: ffff9db202534e48 RDI: ffffffffb6d3a3e0
>>   [   65.390174] RBP: 0000000000029e00 R08: 0000000000000000 R09: 0000000000000004
>>   [   65.390686] R10: 0000000000000001 R11: 00000000ffa6fbff R12: ffffaee8c056fbf0
>>   [   65.391196] R13: ffff9db2024e1400 R14: 0000000000000004 R15: ffff9db20ebb31e0
>>   [   65.391710] FS:  00007f6df41b1740(0000) GS:ffff9db377c00000(0000) knlGS:0000000000000000
>>   [   65.392289] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>   [   65.392705] CR2: ffffffffb720f7e0 CR3: 000000010680a003 CR4: 0000000000370ef0
>>   [   65.393220] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>   [   65.393732] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>   [   65.394244] Call Trace:
>>   [   65.394437]  cpuset_can_attach+0x8b/0x110
>>   [   65.394732]  cgroup_migrate_execute+0x70/0x430
>>   [   65.395057]  cgroup_update_dfl_csses+0x222/0x230
>>   [   65.395392]  cgroup_subtree_control_write+0x2c6/0x3c0
>>   [   65.395759]  kernfs_fop_write+0xce/0x1b0
>>   [   65.396048]  vfs_write+0xc2/0x230
>>   [   65.396291]  ksys_write+0x4f/0xc0
>>   [   65.396533]  do_syscall_64+0x33/0x40
>>   [   65.396797]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>   [   65.397166] RIP: 0033:0x7f6df42a6537
>>   [   65.397428] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f
>>                        1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05
>>                        <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89
>>                        74 24
>>   [   65.398766] RSP: 002b:00007ffee4128018 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>>   [   65.399838] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f6df42a6537
>>   [   65.400923] RDX: 0000000000000008 RSI: 000055b3f7e549e0 RDI: 0000000000000001
>>   [   65.402003] RBP: 000055b3f7e549e0 R08: 000000000000000a R09: 0000000000000007
>>   [   65.403082] R10: 0000000000000004 R11: 0000000000000246 R12: 0000000000000008
>>   [   65.404156] R13: 00007f6df4378500 R14: 0000000000000008 R15: 00007f6df4378700
>>   [   65.405218] Modules linked in: <lots of modules>
>>   [   65.414172] CR2: ffffffffb720f7e0
>>   [   65.415117] ---[ end trace 2dbff1a688549e65 ]---
>> ----- >% -----
>>
>> That happens because on dl_task_can_attach():
>>         dest_cpu = cpumask_any_and(cpu_active_mask, cs_cpus_allowed);
>>
>> returns a non active cpu.
> 
> Since cs_cpus_allowed is empty dest_cpu should be an invalid CPU (>=
> nr_cpu_ids) here.

Correct, I will change the "non active" to "an invalid CPU (>= > nr_cpu_ids)"

>> Initially, I thought about returning an error and blocking the
>> operation. However, that is indeed not needed. The cpuset without
>> CPUs assigned will be a non-root cpuset, hence its cpu mask will
>> be the same as the root one. So, the bandwidth was already accounted,
>> and the task can proceed.
> 
> LGTM.
> 
> After the '/sys/fs/cgroup# echo '+cpuset' > cgroup.subtree_control':
> 
> root's cpuset.cpus.effective == user.slice's cpuset.cpus.effective

Good, thanks!

-- Daniel

>> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Juri Lelli <juri.lelli@redhat.com>
>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Ben Segall <bsegall@google.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
>> Cc: Li Zefan <lizefan@huawei.com>
>> Cc: Tejun Heo <tj@kernel.org>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Valentin Schneider <valentin.schneider@arm.com>
>> Cc: linux-kernel@vger.kernel.org
>> Cc: cgroups@vger.kernel.org
>> ---
>>  kernel/sched/deadline.c | 7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index 943aa32cc1bc..788a391657a5 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -2871,6 +2871,13 @@ int dl_task_can_attach(struct task_struct *p,
>>  	bool overflow;
>>  	int ret;
>>  
>> +	/*
>> +	 * The cpuset has no cpus assigned, so the thread will not
>> +	 * change its affinity.
>> +	 */
>> +	if (cpumask_empty(cs_cpus_allowed))
>> +		return 0;
>> +
>>  	/*
>>  	 * The task is not moving to another root domain, so it is
>>  	 * already accounted.
>>
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 6/6] sched/deadline: Fixes cpu/rd/dl_bw references for suspended tasks
  2021-01-15 14:36   ` Dietmar Eggemann
@ 2021-01-18 13:17     ` Daniel Bristot de Oliveira
  0 siblings, 0 replies; 15+ messages in thread
From: Daniel Bristot de Oliveira @ 2021-01-18 13:17 UTC (permalink / raw)
  To: Dietmar Eggemann, linux-kernel
  Cc: Marco Perronet, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Li Zefan, Tejun Heo, Johannes Weiner, Valentin Schneider,
	cgroups

On 1/15/21 3:36 PM, Dietmar Eggemann wrote:
> On 12/01/2021 16:53, Daniel Bristot de Oliveira wrote:
> 
> [...]
> 
>> ----- %< -----
>>   #!/bin/bash
>>   # Enter on the cgroup directory
>>   cd /sys/fs/cgroup/
>>
>>   # Check it if is cgroup v2 and enable cpuset
>>   if [ -e cgroup.subtree_control ]; then
>>   	# Enable cpuset controller on cgroup v2
>>   	echo +cpuset > cgroup.subtree_control
>>   fi
>>
>>   echo LOG: create an exclusive cpuset and assigned the CPU 0 to it
>>   # Create cpuset groups
>>   rmdir dl-group &> /dev/null
>>   mkdir dl-group
>>
>>   # Restrict the task to the CPU 0
>>   echo 0 > dl-group/cpuset.mems
>>   echo 0 > dl-group/cpuset.cpus
>>   echo root >  dl-group/cpuset.cpus.partition
>>
>>   echo LOG: dispatching a regular task
>>   sleep 100 &
>>   CPUSET_PID="$!"
>>
>>   # let it settle down
>>   sleep 1
>>
>>   # Assign the second task to the cgroup
> 
> There is only one task 'CPUSET_PID' involved here?

Ooops, yep, I will remove the "second" part.

>>   echo LOG: moving the second DL task to the cpuset
>>   echo "$CPUSET_PID" > dl-group/cgroup.procs 2> /dev/null
>>
>>   CPUSET_ALLOWED=`cat /proc/$CPUSET_PID/status | grep Cpus_allowed_list | awk '{print $2}'`
>>
>>   chrt -p -d --sched-period 1000000000 --sched-runtime 100000000 0 $CPUSET_PID
>>   ACCEPTED=$?
>>
>>   if [ $ACCEPTED == 0 ]; then
>>   	echo PASS: the task became DL
>>   else
>>   	echo FAIL: the task was rejected as DL
>>   fi
> 
> [...]
> 
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 5961a97541c2..3c2775e6869f 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -5905,15 +5905,15 @@ static int __sched_setscheduler(struct task_struct *p,
>>  #ifdef CONFIG_SMP
>>  		if (dl_bandwidth_enabled() && dl_policy(policy) &&
>>  				!(attr->sched_flags & SCHED_FLAG_SUGOV)) {
>> -			cpumask_t *span = rq->rd->span;
>> +			struct root_domain *rd = dl_task_rd(p);
>>  
>>  			/*
>>  			 * Don't allow tasks with an affinity mask smaller than
>>  			 * the entire root_domain to become SCHED_DEADLINE. We
>>  			 * will also fail if there's no bandwidth available.
>>  			 */
>> -			if (!cpumask_subset(span, p->cpus_ptr) ||
>> -			    rq->rd->dl_bw.bw == 0) {
>> +			if (!cpumask_subset(rd->span, p->cpus_ptr) ||
>> +			    rd->dl_bw.bw == 0) {
>>  				retval = -EPERM;
>>  				goto unlock;
>>  			}
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index c221e14d5b86..1f6264cb8867 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -2678,8 +2678,8 @@ int sched_dl_overflow(struct task_struct *p, int policy,
>>  	u64 period = attr->sched_period ?: attr->sched_deadline;
>>  	u64 runtime = attr->sched_runtime;
>>  	u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
>> -	int cpus, err = -1, cpu = task_cpu(p);
>> -	struct dl_bw *dl_b = dl_bw_of(cpu);
>> +	int cpus, err = -1, cpu = dl_task_cpu(p);
>> +	struct dl_bw *dl_b = dl_task_root_bw(p);
>>  	unsigned long cap;
>>  
>>  	if (attr->sched_flags & SCHED_FLAG_SUGOV)
>>
> 
> Wouldn't it be sufficient to just introduce dl_task_cpu() and use the
> correct cpu to get rd->span or struct dl_bw in __sched_setscheduler()
> -> sched_dl_overflow()?

While I think that having the dl_task_rd() makes the code more intuitive, I have
no problem on not adding it...

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5961a97541c2..0573f676696a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5905,7 +5905,8 @@ static int __sched_setscheduler(struct task_struct *p,
>  #ifdef CONFIG_SMP
>                 if (dl_bandwidth_enabled() && dl_policy(policy) &&
>                                 !(attr->sched_flags & SCHED_FLAG_SUGOV)) {
> -                       cpumask_t *span = rq->rd->span;
> +                       int cpu = dl_task_cpu(p);
> +                       cpumask_t *span = cpu_rq(cpu)->rd->span;
>  
>                         /*
>                          * Don't allow tasks with an affinity mask smaller than
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index c221e14d5b86..308ecaaf3d28 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2678,7 +2678,7 @@ int sched_dl_overflow(struct task_struct *p, int policy,
>         u64 period = attr->sched_period ?: attr->sched_deadline;
>         u64 runtime = attr->sched_runtime;
>         u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
> -       int cpus, err = -1, cpu = task_cpu(p);
> +       int cpus, err = -1, cpu = dl_task_cpu(p);
>         struct dl_bw *dl_b = dl_bw_of(cpu);
>         unsigned long cap;
> 

This way works for me too.

Thanks!

-- Daniel


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/6] sched/deadline: Block DL tasks on non-exclusive cpuset if bandwitdh control is enable
  2021-01-14 15:51   ` Dietmar Eggemann
@ 2021-01-19  9:41     ` Daniel Bristot de Oliveira
  2021-01-19 15:37       ` Dietmar Eggemann
  0 siblings, 1 reply; 15+ messages in thread
From: Daniel Bristot de Oliveira @ 2021-01-19  9:41 UTC (permalink / raw)
  To: Dietmar Eggemann, linux-kernel
  Cc: Marco Perronet, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Li Zefan, Tejun Heo, Johannes Weiner, Valentin Schneider,
	cgroups

On 1/14/21 4:51 PM, Dietmar Eggemann wrote:
> On 12/01/2021 16:53, Daniel Bristot de Oliveira wrote:
>> The current SCHED_DEADLINE design supports only global scheduler,
>> or variants of it, i.e., clustered and partitioned, via cpuset config.
>> To enable the partitioning of a system with clusters of CPUs, the
>> documentation advises the usage of exclusive cpusets, creating an
>> exclusive root_domain for the cpuset.
>>
>> Attempts to change the cpu affinity of a thread to a cpu mask different
>> from the root domain results in an error. For instance:
> 
> [...]
> 
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index 788a391657a5..c221e14d5b86 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -2878,6 +2878,13 @@ int dl_task_can_attach(struct task_struct *p,
>>  	if (cpumask_empty(cs_cpus_allowed))
>>  		return 0;
>>  
>> +	/*
>> +	 * Do not allow moving tasks to non-exclusive cpusets
>> +	 * if bandwidth control is enabled.
>> +	 */
>> +	if (dl_bandwidth_enabled() && !exclusive)
>> +		return -EBUSY;
>> +
>>  	/*
>>  	 * The task is not moving to another root domain, so it is
>>  	 * already accounted.
>>
> 
> But doesn't this mean you only have to set this cgroup exclusive/root to
> run into the same issue:
> 
> with this patch:
> 
> cgroupv1:
> 
> root@juno:/sys/fs/cgroup/cpuset# chrt -d --sched-period 1000000000
> --sched-runtime 100000000 0 sleep 500 &
> [1] 1668
> root@juno:/sys/fs/cgroup/cpuset# PID1=$!
> 
> root@juno:/sys/fs/cgroup/cpuset# chrt -d --sched-period 1000000000
> --sched-runtime 100000000 0 sleep 500 &
> [2] 1669
> root@juno:/sys/fs/cgroup/cpuset# PID2=$!
> 
> root@juno:/sys/fs/cgroup/cpuset# mkdir A
> 
> root@juno:/sys/fs/cgroup/cpuset# echo 0 > ./A/cpuset.mems
> root@juno:/sys/fs/cgroup/cpuset# echo 0 > ./A/cpuset.cpus
> 
> root@juno:/sys/fs/cgroup/cpuset# echo $PID2 > ./A/cgroup.procs
> -bash: echo: write error: Device or resource busy
> 
> root@juno:/sys/fs/cgroup/cpuset# echo 1 > ./A/cpuset.cpu_exclusive
> 
> root@juno:/sys/fs/cgroup/cpuset# echo $PID2 > ./A/cgroup.procs
> 
> root@juno:/sys/fs/cgroup/cpuset# cat /proc/$PID1/status | grep
> Cpus_allowed_list | awk '{print $2}'
> 0-5
> root@juno:/sys/fs/cgroup/cpuset# cat /proc/$PID2/status | grep
> Cpus_allowed_list | awk '{print $2}'
> 0

On CPU v1 we also need to disable the load balance to create a root domain, right?

> cgroupv2:

Yeah, I see your point. I was seeing a different output because of Fedora
default's behavior of adding the tasks to the system.slice/user.slice...

doing:

> root@juno:/sys/fs/cgroup# echo +cpuset > cgroup.subtree_control

# echo $$ > cgroup.procs

> root@juno:/sys/fs/cgroup# chrt -d --sched-period 1000000000
> --sched-runtime 100000000 0 sleep 500 &
> [1] 1687
> root@juno:/sys/fs/cgroup# PID1=$!
> 
> root@juno:/sys/fs/cgroup# chrt -d --sched-period 1000000000
> --sched-runtime 100000000 0 sleep 500 &
> [2] 1688
> root@juno:/sys/fs/cgroup# PID2=$!
> 
> root@juno:/sys/fs/cgroup# mkdir A
> 
> root@juno:/sys/fs/cgroup# echo 0 > ./A/cpuset.mems
> root@juno:/sys/fs/cgroup# echo 0 > ./A/cpuset.cpus
> 
> root@juno:/sys/fs/cgroup# echo $PID2 > ./A/cgroup.procs
> -bash: echo: write error: Device or resource busy
> 
> root@juno:/sys/fs/cgroup# echo root > ./A/cpuset.cpus.partition
> 
> root@juno:/sys/fs/cgroup# echo $PID2 > ./A/cgroup.procs
> 
> root@juno:/sys/fs/cgroup# cat /proc/$PID1/status | grep
> Cpus_allowed_list | awk '{print $2}'
> 0-5
> root@juno:/sys/fs/cgroup# cat /proc/$PID2/status | grep
> Cpus_allowed_list | awk '{print $2}'
> 0

makes me see the same behavior. This will require further analysis.

So, my plan now is to split this patch set into two, one with patches 1,3,5, and
6, which fixes the most "easy" part of the problems, and another with 2 and 4
that will require further investigation (discussed this with Dietmar already).

Thoughts?

-- Daniel


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/6] sched/deadline: Block DL tasks on non-exclusive cpuset if bandwitdh control is enable
  2021-01-19  9:41     ` Daniel Bristot de Oliveira
@ 2021-01-19 15:37       ` Dietmar Eggemann
  0 siblings, 0 replies; 15+ messages in thread
From: Dietmar Eggemann @ 2021-01-19 15:37 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira, linux-kernel
  Cc: Marco Perronet, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Li Zefan, Tejun Heo, Johannes Weiner, Valentin Schneider,
	cgroups

On 19/01/2021 10:41, Daniel Bristot de Oliveira wrote:
> On 1/14/21 4:51 PM, Dietmar Eggemann wrote:
>> On 12/01/2021 16:53, Daniel Bristot de Oliveira wrote:

[...]

>> with this patch:
>>
>> cgroupv1:
>>
>> root@juno:/sys/fs/cgroup/cpuset# chrt -d --sched-period 1000000000
>> --sched-runtime 100000000 0 sleep 500 &
>> [1] 1668
>> root@juno:/sys/fs/cgroup/cpuset# PID1=$!
>>
>> root@juno:/sys/fs/cgroup/cpuset# chrt -d --sched-period 1000000000
>> --sched-runtime 100000000 0 sleep 500 &
>> [2] 1669
>> root@juno:/sys/fs/cgroup/cpuset# PID2=$!
>>
>> root@juno:/sys/fs/cgroup/cpuset# mkdir A
>>
>> root@juno:/sys/fs/cgroup/cpuset# echo 0 > ./A/cpuset.mems
>> root@juno:/sys/fs/cgroup/cpuset# echo 0 > ./A/cpuset.cpus
>>
>> root@juno:/sys/fs/cgroup/cpuset# echo $PID2 > ./A/cgroup.procs
>> -bash: echo: write error: Device or resource busy
>>
>> root@juno:/sys/fs/cgroup/cpuset# echo 1 > ./A/cpuset.cpu_exclusive
>>
>> root@juno:/sys/fs/cgroup/cpuset# echo $PID2 > ./A/cgroup.procs
>>
>> root@juno:/sys/fs/cgroup/cpuset# cat /proc/$PID1/status | grep
>> Cpus_allowed_list | awk '{print $2}'
>> 0-5
>> root@juno:/sys/fs/cgroup/cpuset# cat /proc/$PID2/status | grep
>> Cpus_allowed_list | awk '{print $2}'
>> 0
> 
> On CPU v1 we also need to disable the load balance to create a root domain, right?

IMHO, that's not necessary for this example. But yes, if we create 2
exclusive cpusets A and B we want to turn off load-balancing on root
level. It also doesn't hurt doing this in this example. But we end up
with no sched domain since load-balance is disabled at root and A only
contains CPU0.

root@juno:/sys/fs/cgroup/cpuset# echo 0 > cpuset.sched_load_balance

ls /proc/sys/kernel/sched_domain/cpu*/ doesn't show any (sched) domains.

>> cgroupv2:
> 
> Yeah, I see your point. I was seeing a different output because of Fedora
> default's behavior of adding the tasks to the system.slice/user.slice...
> 
> doing:
> 
>> root@juno:/sys/fs/cgroup# echo +cpuset > cgroup.subtree_control
> 
> # echo $$ > cgroup.procs

The current shell should be already in the root cgroup?

root@juno:/sys/fs/cgroup# echo $$
1644
root@juno:/sys/fs/cgroup# cat cgroup.procs | grep $$
1644

[...]


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/6] sched/deadline: Allow DL tasks on empty (cgroup v2) cpusets
  2021-01-12 15:53 ` [PATCH 3/6] sched/deadline: Allow DL tasks on empty (cgroup v2) cpusets Daniel Bristot de Oliveira
  2021-01-14 12:12   ` Dietmar Eggemann
@ 2021-01-22  8:08   ` Juri Lelli
  1 sibling, 0 replies; 15+ messages in thread
From: Juri Lelli @ 2021-01-22  8:08 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: linux-kernel, Marco Perronet, Ingo Molnar, Peter Zijlstra,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Li Zefan, Tejun Heo, Johannes Weiner,
	Valentin Schneider, cgroups

Hi,

On 12/01/21 16:53, Daniel Bristot de Oliveira wrote:
> cgroups v2 allows the cpuset controller to be enabled/disabled on
> demand. On Fedora 32, cpuset is disabled by default. To enable it,
> a user needs to:
> 
>   # cd /sys/fs/cgroup/
>   # echo +cpuset > cgroup.subtree_control
> 
> Existing cgroups will expose the cpuset interface (e.g., cpuset.cpus
> file). By default, cpuset.cpus has no CPU assigned, which means that
> existing tasks will move to a cpuset without cpus.

This is kind of confusing, though. Isn't it?

> Initially, I thought about returning an error and blocking the
> operation. However, that is indeed not needed. The cpuset without
> CPUs assigned will be a non-root cpuset, hence its cpu mask will
> be the same as the root one. So, the bandwidth was already accounted,
> and the task can proceed.
> 
> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Li Zefan <lizefan@huawei.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org
> ---
>  kernel/sched/deadline.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 943aa32cc1bc..788a391657a5 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2871,6 +2871,13 @@ int dl_task_can_attach(struct task_struct *p,
>  	bool overflow;
>  	int ret;
>  
> +	/*
> +	 * The cpuset has no cpus assigned, so the thread will not
> +	 * change its affinity.

Is this always the case also in the presence of deeper hierarchies?

Thanks,
Juri


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, back to index

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-12 15:53 [PATCH 0/6] sched/deadline: cpuset task acceptance review Daniel Bristot de Oliveira
2021-01-12 15:53 ` [PATCH 1/6] sched/deadline: Consolidate the SCHED_DL task_can_attach() check on its own function Daniel Bristot de Oliveira
2021-01-12 15:53 ` [PATCH 2/6] sched/deadline: Inform dl_task_can_attach() if the cpuset is exclusive Daniel Bristot de Oliveira
2021-01-12 15:53 ` [PATCH 3/6] sched/deadline: Allow DL tasks on empty (cgroup v2) cpusets Daniel Bristot de Oliveira
2021-01-14 12:12   ` Dietmar Eggemann
2021-01-18 12:51     ` Daniel Bristot de Oliveira
2021-01-22  8:08   ` Juri Lelli
2021-01-12 15:53 ` [PATCH 4/6] sched/deadline: Block DL tasks on non-exclusive cpuset if bandwitdh control is enable Daniel Bristot de Oliveira
2021-01-14 15:51   ` Dietmar Eggemann
2021-01-19  9:41     ` Daniel Bristot de Oliveira
2021-01-19 15:37       ` Dietmar Eggemann
2021-01-12 15:53 ` [PATCH 5/6] sched/deadline: Add helpers to get the correct root domain/span/dl_bw Daniel Bristot de Oliveira
2021-01-12 15:53 ` [PATCH 6/6] sched/deadline: Fixes cpu/rd/dl_bw references for suspended tasks Daniel Bristot de Oliveira
2021-01-15 14:36   ` Dietmar Eggemann
2021-01-18 13:17     ` Daniel Bristot de Oliveira

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git
	git clone --mirror https://lore.kernel.org/lkml/10 lkml/git/10.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git