While surveying the properties of the SCHED_DEADLINE, Marco Perronet found some inconsistencies in the acceptance of DL threads on cpuset. More precisely, regarding the acceptance of treads with arbitrary affinity. He contacted me, and while doing the investigation, we found yet other potential issues addressed in this patch series. Each patch has a more in-depth explanation, including ways to reproduce the problem. Daniel Bristot de Oliveira (6): sched/deadline: Consolidate the SCHED_DL task_can_attach() check on its own function sched/deadline: Inform dl_task_can_attach() if the cpuset is exclusive sched/deadline: Allow DL tasks on empty (cgroup v2) cpusets sched/deadline: Block DL tasks on non-exclusive cpuset if bandwitdh control is enable sched/deadline: Add helpers to get the correct root domain/span/dl_bw sched/deadline: Fixes cpu/rd/dl_bw references for suspended tasks include/linux/sched.h | 2 +- kernel/cgroup/cpuset.c | 5 ++++- kernel/sched/core.c | 13 ++++++------ kernel/sched/deadline.c | 28 ++++++++++++++++++++++--- kernel/sched/sched.h | 45 ++++++++++++++++++++++++++++++++++++++++- 5 files changed, 80 insertions(+), 13 deletions(-) -- 2.29.2
It is easier to track the restrictions imposed on SCHED_DEADLINE tasks if we keep them all together in a single function. No function changes. Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: linux-kernel@vger.kernel.org Cc: cgroups@vger.kernel.org --- kernel/sched/core.c | 3 +-- kernel/sched/deadline.c | 7 +++++++ 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7af80c3fce12..f4aede34449c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7142,8 +7142,7 @@ int task_can_attach(struct task_struct *p, goto out; } - if (dl_task(p) && !cpumask_intersects(task_rq(p)->rd->span, - cs_cpus_allowed)) + if (dl_task(p)) ret = dl_task_can_attach(p, cs_cpus_allowed); out: diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 75686c6d4436..c97d2c707b98 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2870,6 +2870,13 @@ int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allo bool overflow; int ret; + /* + * The task is not moving to another root domain, so it is + * already accounted. + */ + if (cpumask_intersects(task_rq(p)->rd->span, cs_cpus_allowed)) + return 0; + dest_cpu = cpumask_any_and(cpu_active_mask, cs_cpus_allowed); rcu_read_lock_sched(); -- 2.29.2
In preparation for blocking SCHED_DEADLINE tasks on non-exclusive cpuset if bandwidth control is enabled. No functional changes. Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Tejun Heo <tj@kernel.org> cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: linux-kernel@vger.kernel.org Cc: cgroups@vger.kernel.org --- include/linux/sched.h | 2 +- kernel/cgroup/cpuset.c | 5 ++++- kernel/sched/core.c | 4 ++-- kernel/sched/deadline.c | 3 ++- kernel/sched/sched.h | 2 +- 5 files changed, 10 insertions(+), 6 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 942b87f80cc7..9525fbe032c9 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1654,7 +1654,7 @@ current_restore_flags(unsigned long orig_flags, unsigned long flags) } extern int cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial); -extern int task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed); +extern int task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed, int exclusive); #ifdef CONFIG_SMP extern void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask); extern int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask); diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 53c70c470a38..c6513dfabd3d 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2132,6 +2132,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset) struct cgroup_subsys_state *css; struct cpuset *cs; struct task_struct *task; + int exclusive; int ret; /* used later by cpuset_attach() */ @@ -2146,8 +2147,10 @@ static int cpuset_can_attach(struct cgroup_taskset *tset) (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))) goto out_unlock; + exclusive = is_cpu_exclusive(cs); + cgroup_taskset_for_each(task, css, tset) { - ret = task_can_attach(task, cs->cpus_allowed); + ret = task_can_attach(task, cs->cpus_allowed, exclusive); if (ret) goto out_unlock; ret = security_task_setscheduler(task); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f4aede34449c..5961a97541c2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7124,7 +7124,7 @@ int cpuset_cpumask_can_shrink(const struct cpumask *cur, } int task_can_attach(struct task_struct *p, - const struct cpumask *cs_cpus_allowed) + const struct cpumask *cs_cpus_allowed, int exclusive) { int ret = 0; @@ -7143,7 +7143,7 @@ int task_can_attach(struct task_struct *p, } if (dl_task(p)) - ret = dl_task_can_attach(p, cs_cpus_allowed); + ret = dl_task_can_attach(p, cs_cpus_allowed, exclusive); out: return ret; diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index c97d2c707b98..943aa32cc1bc 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2862,7 +2862,8 @@ bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr) } #ifdef CONFIG_SMP -int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed) +int dl_task_can_attach(struct task_struct *p, + const struct cpumask *cs_cpus_allowed, int exclusive) { unsigned long flags, cap; unsigned int dest_cpu; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index f5acb6c5ce49..54881d99cebd 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -337,7 +337,7 @@ extern void __setparam_dl(struct task_struct *p, const struct sched_attr *attr); extern void __getparam_dl(struct task_struct *p, struct sched_attr *attr); extern bool __checkparam_dl(const struct sched_attr *attr); extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr); -extern int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed); +extern int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed, int exclusive); extern int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial); extern bool dl_cpu_busy(unsigned int cpu); -- 2.29.2
cgroups v2 allows the cpuset controller to be enabled/disabled on demand. On Fedora 32, cpuset is disabled by default. To enable it, a user needs to: # cd /sys/fs/cgroup/ # echo +cpuset > cgroup.subtree_control Existing cgroups will expose the cpuset interface (e.g., cpuset.cpus file). By default, cpuset.cpus has no CPU assigned, which means that existing tasks will move to a cpuset without cpus. With that in mind, look what happens if a SCHED_DEADLINE task exists on any cgroup (user.slice by default on Fedora): ----- %< ----- # chrt -d --sched-period 1000000000 --sched-runtime 100000000 0 sleep 100 & # cd /sys/fs/cgroup/ # echo '+cpuset' > cgroup.subtree_control [ 65.384041] BUG: unable to handle page fault for address: ffffffffb720f7e0 [ 65.384551] #PF: supervisor read access in kernel mode [ 65.384923] #PF: error_code(0x0000) - not-present page [ 65.385298] PGD 61a15067 P4D 61a15067 PUD 61a16063 PMD 800fffff9ddff062 [ 65.385781] Oops: 0000 [#1] SMP PTI [ 65.386042] CPU: 0 PID: 799 Comm: sh Not tainted 5.10.0-rc3 #1 [ 65.386461] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 [ 65.387077] RIP: 0010:dl_task_can_attach+0x40/0x250 [ 65.387429] Code: 54 55 53 48 83 ec 18 48 89 3c 24 bf ff ff ff ff e8 05 a2 52 00 4c 63 f0 48 c7 c5 00 9e 02 00 4a 8b 04 f5 00 09 47 b6 48 89 ea <4c> 8b a4 10 e0 09 00 00 49 8d 44 24 40 48 89 c7 48 89 44 24 08 e8 [ 65.388768] RSP: 0018:ffffaee8c056fcd8 EFLAGS: 00010283 [ 65.389148] RAX: ffffffffb71e5000 RBX: ffffaee8c056fdd0 RCX: 0000000000000040 [ 65.389661] RDX: 0000000000029e00 RSI: ffff9db202534e48 RDI: ffffffffb6d3a3e0 [ 65.390174] RBP: 0000000000029e00 R08: 0000000000000000 R09: 0000000000000004 [ 65.390686] R10: 0000000000000001 R11: 00000000ffa6fbff R12: ffffaee8c056fbf0 [ 65.391196] R13: ffff9db2024e1400 R14: 0000000000000004 R15: ffff9db20ebb31e0 [ 65.391710] FS: 00007f6df41b1740(0000) GS:ffff9db377c00000(0000) knlGS:0000000000000000 [ 65.392289] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 65.392705] CR2: ffffffffb720f7e0 CR3: 000000010680a003 CR4: 0000000000370ef0 [ 65.393220] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 65.393732] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 65.394244] Call Trace: [ 65.394437] cpuset_can_attach+0x8b/0x110 [ 65.394732] cgroup_migrate_execute+0x70/0x430 [ 65.395057] cgroup_update_dfl_csses+0x222/0x230 [ 65.395392] cgroup_subtree_control_write+0x2c6/0x3c0 [ 65.395759] kernfs_fop_write+0xce/0x1b0 [ 65.396048] vfs_write+0xc2/0x230 [ 65.396291] ksys_write+0x4f/0xc0 [ 65.396533] do_syscall_64+0x33/0x40 [ 65.396797] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 65.397166] RIP: 0033:0x7f6df42a6537 [ 65.397428] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 [ 65.398766] RSP: 002b:00007ffee4128018 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 65.399838] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f6df42a6537 [ 65.400923] RDX: 0000000000000008 RSI: 000055b3f7e549e0 RDI: 0000000000000001 [ 65.402003] RBP: 000055b3f7e549e0 R08: 000000000000000a R09: 0000000000000007 [ 65.403082] R10: 0000000000000004 R11: 0000000000000246 R12: 0000000000000008 [ 65.404156] R13: 00007f6df4378500 R14: 0000000000000008 R15: 00007f6df4378700 [ 65.405218] Modules linked in: <lots of modules> [ 65.414172] CR2: ffffffffb720f7e0 [ 65.415117] ---[ end trace 2dbff1a688549e65 ]--- ----- >% ----- That happens because on dl_task_can_attach(): dest_cpu = cpumask_any_and(cpu_active_mask, cs_cpus_allowed); returns a non active cpu. Initially, I thought about returning an error and blocking the operation. However, that is indeed not needed. The cpuset without CPUs assigned will be a non-root cpuset, hence its cpu mask will be the same as the root one. So, the bandwidth was already accounted, and the task can proceed. Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: linux-kernel@vger.kernel.org Cc: cgroups@vger.kernel.org --- kernel/sched/deadline.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 943aa32cc1bc..788a391657a5 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2871,6 +2871,13 @@ int dl_task_can_attach(struct task_struct *p, bool overflow; int ret; + /* + * The cpuset has no cpus assigned, so the thread will not + * change its affinity. + */ + if (cpumask_empty(cs_cpus_allowed)) + return 0; + /* * The task is not moving to another root domain, so it is * already accounted. -- 2.29.2
The current SCHED_DEADLINE design supports only global scheduler, or variants of it, i.e., clustered and partitioned, via cpuset config. To enable the partitioning of a system with clusters of CPUs, the documentation advises the usage of exclusive cpusets, creating an exclusive root_domain for the cpuset. Attempts to change the cpu affinity of a thread to a cpu mask different from the root domain results in an error. For instance: ----- %< ----- [root@x1 linux]# chrt -d --sched-period 1000000000 --sched-runtime 100000000 0 sleep 10000 & [1] 69020 [root@x1 linux]# taskset -p -c 0 69020 pid 69020's current affinity list: 0-7 taskset: failed to set pid 69020's affinity: Device or resource busy ----- >% ----- However, such restriction can be bypassed by disabling the SCHED_DEADLINE admission test, under the assumption that the user is aware of the implications of such a decision. However, Marco Perronet noticed that it was possible to by-pass this mechanism because no restriction is currently imposed by the cpuset mechanism. For instance, this script: ----- %< ----- #!/bin/bash # Enter on the cgroup directory cd /sys/fs/cgroup/ # Check it if is cgroup v2 and enable cpuset if [ -e cgroup.subtree_control ]; then # Enable cpuset controller on cgroup v2 echo +cpuset > cgroup.subtree_control fi echo LOG: create a cpuset and assigned the CPU 0 to it # Create cpuset groups rmdir dl-group &> /dev/null mkdir dl-group # Restrict the task to the CPU 0 echo 0 > dl-group/cpuset.mems echo 0 > dl-group/cpuset.cpus # Place a task in the root cgroup echo LOG: dispatching the first DL task chrt -d --sched-period 1000000000 --sched-runtime 100000000 0 sleep 100 & ROOT_PID="$!" ROOT_ALLOWED=`cat /proc/$ROOT_PID/status | grep Cpus_allowed_list | awk '{print $2}'` # Disapatch another task in the root cgroup, to move it later. echo LOG: dispatching the second DL task chrt -d --sched-period 1000000000 --sched-runtime 100000000 0 sleep 100 & CPUSET_PID="$!" # let them settle down sleep 1 # Assign the second task to the cgroup echo LOG: moving the second DL task to the cpuset echo "$CPUSET_PID" > dl-group/cgroup.procs 2> /dev/null ACCEPTED=$? CPUSET_ALLOWED=`cat /proc/$CPUSET_PID/status | grep Cpus_allowed_list | awk '{print $2}'` if [ $ACCEPTED == 0 ]; then echo FAIL: a DL task was accepted on a non-exclusive cpuset else echo PASS: DL task was rejected on a non-exclusive cpuset fi if [ $ROOT_ALLOWED == $CPUSET_ALLOWED ]; then echo PASS: the affinity did not change: $CPUSET_ALLOWED == $ROOT_ALLOWED else echo FAIL: the cpu affinity is different: $CPUSET_ALLOWED == $ROOT_ALLOWED fi # Just ignore the clean up exec > /dev/null 2>&1 kill -9 $CPUSET_PID kill -9 $ROOT_PID rmdir dl-group ----- >% ----- Shows these results: ----- %< ----- LOG: create a cpuset and assigned the CPU 0 to it LOG: dispatching the first DL task LOG: dispatching the second DL task LOG: moving the second DL task to the cpuset FAIL: a DL task was accepted on a non-exclusive cpuset FAIL: the cpu affinity is different: 0 == 0-3 ----- >% ----- This result is a problem because the two tasks have a different cpu mask, but they end up sharing the cpu 0, which is something not supported in the current SCHED_DEADLINE designed (APA - Arbitrary Processor Affinities). To avoid such scenario, the correct action to be taken is rejecting the attach of SCHED_DEADLINE thread to a non-exclusive cpuset. With the proposed patch in place, the script above returns: ----- %< ----- LOG: create a cpuset and assigned the CPU 0 to it LOG: dispatching the first DL task LOG: dispatching the second DL task LOG: moving the second DL task to the cpuset PASS: DL task was rejected on a non-exclusive cpuset PASS: the affinity did not change: 0-3 == 0-3 ----- >% ----- Still, likewise for taskset, this restriction can be bypassed by disabling the admission test, i.e.: # sysctl -w kernel.sched_rt_runtime_us=-1 and work at their own risk. Reported-by: Marco Perronet <perronet@mpi-sws.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: linux-kernel@vger.kernel.org Cc: cgroups@vger.kernel.org --- kernel/sched/deadline.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 788a391657a5..c221e14d5b86 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2878,6 +2878,13 @@ int dl_task_can_attach(struct task_struct *p, if (cpumask_empty(cs_cpus_allowed)) return 0; + /* + * Do not allow moving tasks to non-exclusive cpusets + * if bandwidth control is enabled. + */ + if (dl_bandwidth_enabled() && !exclusive) + return -EBUSY; + /* * The task is not moving to another root domain, so it is * already accounted. -- 2.29.2
task_cpu() is often used on SCHED_DEADLINE to access the root_domain and associated information. However, the task_cpu() does not always reflect the correct CPU. The reason being is shown in the code: From kernel/sched/core: ----- %< ----- * p->on_rq <- { 0, 1 = TASK_ON_RQ_QUEUED, 2 = TASK_ON_RQ_MIGRATING }: * * is set by activate_task() and cleared by deactivate_task(), under * rq->lock. Non-zero indicates the task is runnable, the special * ON_RQ_MIGRATING state is used for migration without holding both * rq->locks. It indicates task_cpu() is not stable, see task_rq_lock(). [...] * task_cpu(p): is changed by set_task_cpu(), the rules are: * * - Don't call set_task_cpu() on a blocked task: * * We don't care what CPU we're not running on, this simplifies hotplug, * the CPU assignment of blocked tasks isn't required to be valid. ----- >% ----- So, a sleeping task will not have its task_cpu() stable, and this is causing problems on SCHED_DEADLINE. In preparation for fixing this problems, we need to add helper functions that return the dl_task_cpu, root_domain, and "root" dl_bw that reflects the current placement/affinity of the task. Note that these functions are only required on the code path that can happen when the task is not queued. Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: linux-kernel@vger.kernel.org Cc: cgroups@vger.kernel.org --- kernel/sched/sched.h | 43 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 54881d99cebd..5f3f3cb5a6ff 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2393,6 +2393,37 @@ void __dl_update(struct dl_bw *dl_b, s64 bw) rq->dl.extra_bw += bw; } } + +static inline +int dl_task_cpu(struct task_struct *p) +{ + /* + * We can only rely on task_cpu() of runnable tasks. + */ + if (p->on_rq) + return task_cpu(p); + + /* + * The task_cpu() of non-runnable task migth be pointing a CPU + * it cannot run anymore (see set_task_cpu()). Hence, let's + * get a possible cpu from the current cpu_mask. + */ + return cpumask_any(p->cpus_ptr); + +} + +static inline +struct root_domain *dl_task_rd(struct task_struct *p) +{ + int cpu = dl_task_cpu(p); + return cpu_rq(cpu)->rd; +} + +static inline +struct dl_bw *dl_task_root_bw(struct task_struct *p) +{ + return &dl_task_rd(p)->dl_bw; +} #else static inline void __dl_update(struct dl_bw *dl_b, s64 bw) @@ -2401,6 +2432,18 @@ void __dl_update(struct dl_bw *dl_b, s64 bw) dl->extra_bw += bw; } + +static inline +int dl_task_cpu(struct task_struct *p) +{ + return 0; +} + +static inline +struct dl_bw *dl_task_root_bw(struct task_struct *p) +{ + return &task_rq(p)->dl.dl_bw; +} #endif -- 2.29.2
__set_cpus_allowed_ptr() migrates running or runnable, setting the task's cpu accordingly. The CPU is not set when the task is not runnable because of complications on the hotplug code. The task cpu will be updated in the next wakeup anyway. However, this creates a problem for the usage of task_cpu(p), which might point the task to a CPU in which it cannot run, or worse, a runqueue/root_domain it does not belong to, causing some odd errors. For example, the script below shows that a sleeping task cannot become SCHED_DEADLINE if they moved to another (exclusive) cpuset: ----- %< ----- #!/bin/bash # Enter on the cgroup directory cd /sys/fs/cgroup/ # Check it if is cgroup v2 and enable cpuset if [ -e cgroup.subtree_control ]; then # Enable cpuset controller on cgroup v2 echo +cpuset > cgroup.subtree_control fi echo LOG: create an exclusive cpuset and assigned the CPU 0 to it # Create cpuset groups rmdir dl-group &> /dev/null mkdir dl-group # Restrict the task to the CPU 0 echo 0 > dl-group/cpuset.mems echo 0 > dl-group/cpuset.cpus echo root > dl-group/cpuset.cpus.partition echo LOG: dispatching a regular task sleep 100 & CPUSET_PID="$!" # let it settle down sleep 1 # Assign the second task to the cgroup echo LOG: moving the second DL task to the cpuset echo "$CPUSET_PID" > dl-group/cgroup.procs 2> /dev/null CPUSET_ALLOWED=`cat /proc/$CPUSET_PID/status | grep Cpus_allowed_list | awk '{print $2}'` chrt -p -d --sched-period 1000000000 --sched-runtime 100000000 0 $CPUSET_PID ACCEPTED=$? if [ $ACCEPTED == 0 ]; then echo PASS: the task became DL else echo FAIL: the task was rejected as DL fi # Just ignore the clean up exec > /dev/null 2>&1 kill -9 $CPUSET_PID kill -9 $ROOT_PID rmdir dl-group ----- >% ----- Long story short: the sleep task is (not runnable) on a cpu != 0. After moving to a cpuset with only the CPU 0, task_cpu() returns a cpu that does not belong to the cpuset the task is in, and the task is rejected in this if: ----- %< ----- __sched_setscheduler(): [...] rq = task_rq_lock(p, &rf); <-- uses task_cpu(), that points to <-- the old cpu. [...] if (dl_bandwidth_enabled() && dl_policy(policy) && !(attr->sched_flags & SCHED_FLAG_SUGOV)) { cpumask_t *span = rq->rd->span; <--- wrong rd! /* * Don't allow tasks with an affinity mask smaller than * the entire root_domain to become SCHED_DEADLINE. We * will also fail if there's no bandwidth available. */ if (!cpumask_subset(span, p->cpus_ptr) || rq->rd->dl_bw.bw == 0) { retval = -EPERM; <--- returns here. goto unlock; } } ----- >% ----- Because the rq, and so the root domain, corresponding to the ones of the CPU in which the sleep command went to... sleep, not the ones it will run in the next wakeup because of its affinity. To avoid this problem, use the dl_task* helpers that return the task cpu, root domain, and the "root" dl_bw, aware of the status of task->cpu. Reported-by: Marco Perronet <perronet@mpi-sws.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: linux-kernel@vger.kernel.org Cc: cgroups@vger.kernel.org --- kernel/sched/core.c | 6 +++--- kernel/sched/deadline.c | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5961a97541c2..3c2775e6869f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5905,15 +5905,15 @@ static int __sched_setscheduler(struct task_struct *p, #ifdef CONFIG_SMP if (dl_bandwidth_enabled() && dl_policy(policy) && !(attr->sched_flags & SCHED_FLAG_SUGOV)) { - cpumask_t *span = rq->rd->span; + struct root_domain *rd = dl_task_rd(p); /* * Don't allow tasks with an affinity mask smaller than * the entire root_domain to become SCHED_DEADLINE. We * will also fail if there's no bandwidth available. */ - if (!cpumask_subset(span, p->cpus_ptr) || - rq->rd->dl_bw.bw == 0) { + if (!cpumask_subset(rd->span, p->cpus_ptr) || + rd->dl_bw.bw == 0) { retval = -EPERM; goto unlock; } diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index c221e14d5b86..1f6264cb8867 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2678,8 +2678,8 @@ int sched_dl_overflow(struct task_struct *p, int policy, u64 period = attr->sched_period ?: attr->sched_deadline; u64 runtime = attr->sched_runtime; u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0; - int cpus, err = -1, cpu = task_cpu(p); - struct dl_bw *dl_b = dl_bw_of(cpu); + int cpus, err = -1, cpu = dl_task_cpu(p); + struct dl_bw *dl_b = dl_task_root_bw(p); unsigned long cap; if (attr->sched_flags & SCHED_FLAG_SUGOV) -- 2.29.2
On 12/01/2021 16:53, Daniel Bristot de Oliveira wrote: > cgroups v2 allows the cpuset controller to be enabled/disabled on > demand. On Fedora 32, cpuset is disabled by default. To enable it, > a user needs to: > > # cd /sys/fs/cgroup/ > # echo +cpuset > cgroup.subtree_control > > Existing cgroups will expose the cpuset interface (e.g., cpuset.cpus > file). By default, cpuset.cpus has no CPU assigned, which means that > existing tasks will move to a cpuset without cpus. > > With that in mind, look what happens if a SCHED_DEADLINE task exists > on any cgroup (user.slice by default on Fedora): > > ----- %< ----- > # chrt -d --sched-period 1000000000 --sched-runtime 100000000 0 sleep 100 & Like you mentioned above, to see the issue the DL task has to be moved into the cgroup (e.g. user.slice) here: echo $PID > /sys/fs/cgroup/user.slice/cgroup.procs > # cd /sys/fs/cgroup/ > # echo '+cpuset' > cgroup.subtree_control > [ 65.384041] BUG: unable to handle page fault for address: ffffffffb720f7e0 > [ 65.384551] #PF: supervisor read access in kernel mode > [ 65.384923] #PF: error_code(0x0000) - not-present page > [ 65.385298] PGD 61a15067 P4D 61a15067 PUD 61a16063 PMD 800fffff9ddff062 > [ 65.385781] Oops: 0000 [#1] SMP PTI > [ 65.386042] CPU: 0 PID: 799 Comm: sh Not tainted 5.10.0-rc3 #1 > [ 65.386461] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 > [ 65.387077] RIP: 0010:dl_task_can_attach+0x40/0x250 > [ 65.387429] Code: 54 55 53 48 83 ec 18 48 89 3c 24 bf ff ff ff ff e8 05 a2 52 00 > 4c 63 f0 48 c7 c5 00 9e 02 00 4a 8b 04 f5 00 09 47 b6 48 89 ea > <4c> 8b a4 10 e0 09 00 00 49 8d 44 24 40 48 89 c7 48 89 44 24 > 08 e8 > [ 65.388768] RSP: 0018:ffffaee8c056fcd8 EFLAGS: 00010283 > [ 65.389148] RAX: ffffffffb71e5000 RBX: ffffaee8c056fdd0 RCX: 0000000000000040 > [ 65.389661] RDX: 0000000000029e00 RSI: ffff9db202534e48 RDI: ffffffffb6d3a3e0 > [ 65.390174] RBP: 0000000000029e00 R08: 0000000000000000 R09: 0000000000000004 > [ 65.390686] R10: 0000000000000001 R11: 00000000ffa6fbff R12: ffffaee8c056fbf0 > [ 65.391196] R13: ffff9db2024e1400 R14: 0000000000000004 R15: ffff9db20ebb31e0 > [ 65.391710] FS: 00007f6df41b1740(0000) GS:ffff9db377c00000(0000) knlGS:0000000000000000 > [ 65.392289] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 65.392705] CR2: ffffffffb720f7e0 CR3: 000000010680a003 CR4: 0000000000370ef0 > [ 65.393220] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 65.393732] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 65.394244] Call Trace: > [ 65.394437] cpuset_can_attach+0x8b/0x110 > [ 65.394732] cgroup_migrate_execute+0x70/0x430 > [ 65.395057] cgroup_update_dfl_csses+0x222/0x230 > [ 65.395392] cgroup_subtree_control_write+0x2c6/0x3c0 > [ 65.395759] kernfs_fop_write+0xce/0x1b0 > [ 65.396048] vfs_write+0xc2/0x230 > [ 65.396291] ksys_write+0x4f/0xc0 > [ 65.396533] do_syscall_64+0x33/0x40 > [ 65.396797] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 65.397166] RIP: 0033:0x7f6df42a6537 > [ 65.397428] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f > 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 > <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 > 74 24 > [ 65.398766] RSP: 002b:00007ffee4128018 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 > [ 65.399838] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f6df42a6537 > [ 65.400923] RDX: 0000000000000008 RSI: 000055b3f7e549e0 RDI: 0000000000000001 > [ 65.402003] RBP: 000055b3f7e549e0 R08: 000000000000000a R09: 0000000000000007 > [ 65.403082] R10: 0000000000000004 R11: 0000000000000246 R12: 0000000000000008 > [ 65.404156] R13: 00007f6df4378500 R14: 0000000000000008 R15: 00007f6df4378700 > [ 65.405218] Modules linked in: <lots of modules> > [ 65.414172] CR2: ffffffffb720f7e0 > [ 65.415117] ---[ end trace 2dbff1a688549e65 ]--- > ----- >% ----- > > That happens because on dl_task_can_attach(): > dest_cpu = cpumask_any_and(cpu_active_mask, cs_cpus_allowed); > > returns a non active cpu. Since cs_cpus_allowed is empty dest_cpu should be an invalid CPU (>= nr_cpu_ids) here. > Initially, I thought about returning an error and blocking the > operation. However, that is indeed not needed. The cpuset without > CPUs assigned will be a non-root cpuset, hence its cpu mask will > be the same as the root one. So, the bandwidth was already accounted, > and the task can proceed. LGTM. After the '/sys/fs/cgroup# echo '+cpuset' > cgroup.subtree_control': root's cpuset.cpus.effective == user.slice's cpuset.cpus.effective > Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Juri Lelli <juri.lelli@redhat.com> > Cc: Vincent Guittot <vincent.guittot@linaro.org> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Ben Segall <bsegall@google.com> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Daniel Bristot de Oliveira <bristot@redhat.com> > Cc: Li Zefan <lizefan@huawei.com> > Cc: Tejun Heo <tj@kernel.org> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Valentin Schneider <valentin.schneider@arm.com> > Cc: linux-kernel@vger.kernel.org > Cc: cgroups@vger.kernel.org > --- > kernel/sched/deadline.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > index 943aa32cc1bc..788a391657a5 100644 > --- a/kernel/sched/deadline.c > +++ b/kernel/sched/deadline.c > @@ -2871,6 +2871,13 @@ int dl_task_can_attach(struct task_struct *p, > bool overflow; > int ret; > > + /* > + * The cpuset has no cpus assigned, so the thread will not > + * change its affinity. > + */ > + if (cpumask_empty(cs_cpus_allowed)) > + return 0; > + > /* > * The task is not moving to another root domain, so it is > * already accounted. >
On 12/01/2021 16:53, Daniel Bristot de Oliveira wrote: > The current SCHED_DEADLINE design supports only global scheduler, > or variants of it, i.e., clustered and partitioned, via cpuset config. > To enable the partitioning of a system with clusters of CPUs, the > documentation advises the usage of exclusive cpusets, creating an > exclusive root_domain for the cpuset. > > Attempts to change the cpu affinity of a thread to a cpu mask different > from the root domain results in an error. For instance: [...] > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > index 788a391657a5..c221e14d5b86 100644 > --- a/kernel/sched/deadline.c > +++ b/kernel/sched/deadline.c > @@ -2878,6 +2878,13 @@ int dl_task_can_attach(struct task_struct *p, > if (cpumask_empty(cs_cpus_allowed)) > return 0; > > + /* > + * Do not allow moving tasks to non-exclusive cpusets > + * if bandwidth control is enabled. > + */ > + if (dl_bandwidth_enabled() && !exclusive) > + return -EBUSY; > + > /* > * The task is not moving to another root domain, so it is > * already accounted. > But doesn't this mean you only have to set this cgroup exclusive/root to run into the same issue: with this patch: cgroupv1: root@juno:/sys/fs/cgroup/cpuset# chrt -d --sched-period 1000000000 --sched-runtime 100000000 0 sleep 500 & [1] 1668 root@juno:/sys/fs/cgroup/cpuset# PID1=$! root@juno:/sys/fs/cgroup/cpuset# chrt -d --sched-period 1000000000 --sched-runtime 100000000 0 sleep 500 & [2] 1669 root@juno:/sys/fs/cgroup/cpuset# PID2=$! root@juno:/sys/fs/cgroup/cpuset# mkdir A root@juno:/sys/fs/cgroup/cpuset# echo 0 > ./A/cpuset.mems root@juno:/sys/fs/cgroup/cpuset# echo 0 > ./A/cpuset.cpus root@juno:/sys/fs/cgroup/cpuset# echo $PID2 > ./A/cgroup.procs -bash: echo: write error: Device or resource busy root@juno:/sys/fs/cgroup/cpuset# echo 1 > ./A/cpuset.cpu_exclusive root@juno:/sys/fs/cgroup/cpuset# echo $PID2 > ./A/cgroup.procs root@juno:/sys/fs/cgroup/cpuset# cat /proc/$PID1/status | grep Cpus_allowed_list | awk '{print $2}' 0-5 root@juno:/sys/fs/cgroup/cpuset# cat /proc/$PID2/status | grep Cpus_allowed_list | awk '{print $2}' 0 cgroupv2: root@juno:/sys/fs/cgroup# echo +cpuset > cgroup.subtree_control root@juno:/sys/fs/cgroup# chrt -d --sched-period 1000000000 --sched-runtime 100000000 0 sleep 500 & [1] 1687 root@juno:/sys/fs/cgroup# PID1=$! root@juno:/sys/fs/cgroup# chrt -d --sched-period 1000000000 --sched-runtime 100000000 0 sleep 500 & [2] 1688 root@juno:/sys/fs/cgroup# PID2=$! root@juno:/sys/fs/cgroup# mkdir A root@juno:/sys/fs/cgroup# echo 0 > ./A/cpuset.mems root@juno:/sys/fs/cgroup# echo 0 > ./A/cpuset.cpus root@juno:/sys/fs/cgroup# echo $PID2 > ./A/cgroup.procs -bash: echo: write error: Device or resource busy root@juno:/sys/fs/cgroup# echo root > ./A/cpuset.cpus.partition root@juno:/sys/fs/cgroup# echo $PID2 > ./A/cgroup.procs root@juno:/sys/fs/cgroup# cat /proc/$PID1/status | grep Cpus_allowed_list | awk '{print $2}' 0-5 root@juno:/sys/fs/cgroup# cat /proc/$PID2/status | grep Cpus_allowed_list | awk '{print $2}' 0
On 12/01/2021 16:53, Daniel Bristot de Oliveira wrote: [...] > ----- %< ----- > #!/bin/bash > # Enter on the cgroup directory > cd /sys/fs/cgroup/ > > # Check it if is cgroup v2 and enable cpuset > if [ -e cgroup.subtree_control ]; then > # Enable cpuset controller on cgroup v2 > echo +cpuset > cgroup.subtree_control > fi > > echo LOG: create an exclusive cpuset and assigned the CPU 0 to it > # Create cpuset groups > rmdir dl-group &> /dev/null > mkdir dl-group > > # Restrict the task to the CPU 0 > echo 0 > dl-group/cpuset.mems > echo 0 > dl-group/cpuset.cpus > echo root > dl-group/cpuset.cpus.partition > > echo LOG: dispatching a regular task > sleep 100 & > CPUSET_PID="$!" > > # let it settle down > sleep 1 > > # Assign the second task to the cgroup There is only one task 'CPUSET_PID' involved here? > echo LOG: moving the second DL task to the cpuset > echo "$CPUSET_PID" > dl-group/cgroup.procs 2> /dev/null > > CPUSET_ALLOWED=`cat /proc/$CPUSET_PID/status | grep Cpus_allowed_list | awk '{print $2}'` > > chrt -p -d --sched-period 1000000000 --sched-runtime 100000000 0 $CPUSET_PID > ACCEPTED=$? > > if [ $ACCEPTED == 0 ]; then > echo PASS: the task became DL > else > echo FAIL: the task was rejected as DL > fi [...] > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 5961a97541c2..3c2775e6869f 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -5905,15 +5905,15 @@ static int __sched_setscheduler(struct task_struct *p, > #ifdef CONFIG_SMP > if (dl_bandwidth_enabled() && dl_policy(policy) && > !(attr->sched_flags & SCHED_FLAG_SUGOV)) { > - cpumask_t *span = rq->rd->span; > + struct root_domain *rd = dl_task_rd(p); > > /* > * Don't allow tasks with an affinity mask smaller than > * the entire root_domain to become SCHED_DEADLINE. We > * will also fail if there's no bandwidth available. > */ > - if (!cpumask_subset(span, p->cpus_ptr) || > - rq->rd->dl_bw.bw == 0) { > + if (!cpumask_subset(rd->span, p->cpus_ptr) || > + rd->dl_bw.bw == 0) { > retval = -EPERM; > goto unlock; > } > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > index c221e14d5b86..1f6264cb8867 100644 > --- a/kernel/sched/deadline.c > +++ b/kernel/sched/deadline.c > @@ -2678,8 +2678,8 @@ int sched_dl_overflow(struct task_struct *p, int policy, > u64 period = attr->sched_period ?: attr->sched_deadline; > u64 runtime = attr->sched_runtime; > u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0; > - int cpus, err = -1, cpu = task_cpu(p); > - struct dl_bw *dl_b = dl_bw_of(cpu); > + int cpus, err = -1, cpu = dl_task_cpu(p); > + struct dl_bw *dl_b = dl_task_root_bw(p); > unsigned long cap; > > if (attr->sched_flags & SCHED_FLAG_SUGOV) > Wouldn't it be sufficient to just introduce dl_task_cpu() and use the correct cpu to get rd->span or struct dl_bw in __sched_setscheduler() -> sched_dl_overflow()? diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5961a97541c2..0573f676696a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5905,7 +5905,8 @@ static int __sched_setscheduler(struct task_struct *p, #ifdef CONFIG_SMP if (dl_bandwidth_enabled() && dl_policy(policy) && !(attr->sched_flags & SCHED_FLAG_SUGOV)) { - cpumask_t *span = rq->rd->span; + int cpu = dl_task_cpu(p); + cpumask_t *span = cpu_rq(cpu)->rd->span; /* * Don't allow tasks with an affinity mask smaller than diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index c221e14d5b86..308ecaaf3d28 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2678,7 +2678,7 @@ int sched_dl_overflow(struct task_struct *p, int policy, u64 period = attr->sched_period ?: attr->sched_deadline; u64 runtime = attr->sched_runtime; u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0; - int cpus, err = -1, cpu = task_cpu(p); + int cpus, err = -1, cpu = dl_task_cpu(p); struct dl_bw *dl_b = dl_bw_of(cpu); unsigned long cap;
On 1/14/21 1:12 PM, Dietmar Eggemann wrote: > On 12/01/2021 16:53, Daniel Bristot de Oliveira wrote: >> cgroups v2 allows the cpuset controller to be enabled/disabled on >> demand. On Fedora 32, cpuset is disabled by default. To enable it, >> a user needs to: >> >> # cd /sys/fs/cgroup/ >> # echo +cpuset > cgroup.subtree_control >> >> Existing cgroups will expose the cpuset interface (e.g., cpuset.cpus >> file). By default, cpuset.cpus has no CPU assigned, which means that >> existing tasks will move to a cpuset without cpus. >> >> With that in mind, look what happens if a SCHED_DEADLINE task exists >> on any cgroup (user.slice by default on Fedora): >> >> ----- %< ----- >> # chrt -d --sched-period 1000000000 --sched-runtime 100000000 0 sleep 100 & > > Like you mentioned above, to see the issue the DL task has to be moved > into the cgroup (e.g. user.slice) here: > > echo $PID > /sys/fs/cgroup/user.slice/cgroup.procs Ops, I "forgot" to add this step because Fedora/systemd does it by default. I will add this to the log. >> # cd /sys/fs/cgroup/ >> # echo '+cpuset' > cgroup.subtree_control >> [ 65.384041] BUG: unable to handle page fault for address: ffffffffb720f7e0 >> [ 65.384551] #PF: supervisor read access in kernel mode >> [ 65.384923] #PF: error_code(0x0000) - not-present page >> [ 65.385298] PGD 61a15067 P4D 61a15067 PUD 61a16063 PMD 800fffff9ddff062 >> [ 65.385781] Oops: 0000 [#1] SMP PTI >> [ 65.386042] CPU: 0 PID: 799 Comm: sh Not tainted 5.10.0-rc3 #1 >> [ 65.386461] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 >> [ 65.387077] RIP: 0010:dl_task_can_attach+0x40/0x250 >> [ 65.387429] Code: 54 55 53 48 83 ec 18 48 89 3c 24 bf ff ff ff ff e8 05 a2 52 00 >> 4c 63 f0 48 c7 c5 00 9e 02 00 4a 8b 04 f5 00 09 47 b6 48 89 ea >> <4c> 8b a4 10 e0 09 00 00 49 8d 44 24 40 48 89 c7 48 89 44 24 >> 08 e8 >> [ 65.388768] RSP: 0018:ffffaee8c056fcd8 EFLAGS: 00010283 >> [ 65.389148] RAX: ffffffffb71e5000 RBX: ffffaee8c056fdd0 RCX: 0000000000000040 >> [ 65.389661] RDX: 0000000000029e00 RSI: ffff9db202534e48 RDI: ffffffffb6d3a3e0 >> [ 65.390174] RBP: 0000000000029e00 R08: 0000000000000000 R09: 0000000000000004 >> [ 65.390686] R10: 0000000000000001 R11: 00000000ffa6fbff R12: ffffaee8c056fbf0 >> [ 65.391196] R13: ffff9db2024e1400 R14: 0000000000000004 R15: ffff9db20ebb31e0 >> [ 65.391710] FS: 00007f6df41b1740(0000) GS:ffff9db377c00000(0000) knlGS:0000000000000000 >> [ 65.392289] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 65.392705] CR2: ffffffffb720f7e0 CR3: 000000010680a003 CR4: 0000000000370ef0 >> [ 65.393220] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> [ 65.393732] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> [ 65.394244] Call Trace: >> [ 65.394437] cpuset_can_attach+0x8b/0x110 >> [ 65.394732] cgroup_migrate_execute+0x70/0x430 >> [ 65.395057] cgroup_update_dfl_csses+0x222/0x230 >> [ 65.395392] cgroup_subtree_control_write+0x2c6/0x3c0 >> [ 65.395759] kernfs_fop_write+0xce/0x1b0 >> [ 65.396048] vfs_write+0xc2/0x230 >> [ 65.396291] ksys_write+0x4f/0xc0 >> [ 65.396533] do_syscall_64+0x33/0x40 >> [ 65.396797] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> [ 65.397166] RIP: 0033:0x7f6df42a6537 >> [ 65.397428] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f >> 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 >> <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 >> 74 24 >> [ 65.398766] RSP: 002b:00007ffee4128018 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 >> [ 65.399838] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f6df42a6537 >> [ 65.400923] RDX: 0000000000000008 RSI: 000055b3f7e549e0 RDI: 0000000000000001 >> [ 65.402003] RBP: 000055b3f7e549e0 R08: 000000000000000a R09: 0000000000000007 >> [ 65.403082] R10: 0000000000000004 R11: 0000000000000246 R12: 0000000000000008 >> [ 65.404156] R13: 00007f6df4378500 R14: 0000000000000008 R15: 00007f6df4378700 >> [ 65.405218] Modules linked in: <lots of modules> >> [ 65.414172] CR2: ffffffffb720f7e0 >> [ 65.415117] ---[ end trace 2dbff1a688549e65 ]--- >> ----- >% ----- >> >> That happens because on dl_task_can_attach(): >> dest_cpu = cpumask_any_and(cpu_active_mask, cs_cpus_allowed); >> >> returns a non active cpu. > > Since cs_cpus_allowed is empty dest_cpu should be an invalid CPU (>= > nr_cpu_ids) here. Correct, I will change the "non active" to "an invalid CPU (>= > nr_cpu_ids)" >> Initially, I thought about returning an error and blocking the >> operation. However, that is indeed not needed. The cpuset without >> CPUs assigned will be a non-root cpuset, hence its cpu mask will >> be the same as the root one. So, the bandwidth was already accounted, >> and the task can proceed. > > LGTM. > > After the '/sys/fs/cgroup# echo '+cpuset' > cgroup.subtree_control': > > root's cpuset.cpus.effective == user.slice's cpuset.cpus.effective Good, thanks! -- Daniel >> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> >> Cc: Ingo Molnar <mingo@redhat.com> >> Cc: Peter Zijlstra <peterz@infradead.org> >> Cc: Juri Lelli <juri.lelli@redhat.com> >> Cc: Vincent Guittot <vincent.guittot@linaro.org> >> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> >> Cc: Steven Rostedt <rostedt@goodmis.org> >> Cc: Ben Segall <bsegall@google.com> >> Cc: Mel Gorman <mgorman@suse.de> >> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> >> Cc: Li Zefan <lizefan@huawei.com> >> Cc: Tejun Heo <tj@kernel.org> >> Cc: Johannes Weiner <hannes@cmpxchg.org> >> Cc: Valentin Schneider <valentin.schneider@arm.com> >> Cc: linux-kernel@vger.kernel.org >> Cc: cgroups@vger.kernel.org >> --- >> kernel/sched/deadline.c | 7 +++++++ >> 1 file changed, 7 insertions(+) >> >> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c >> index 943aa32cc1bc..788a391657a5 100644 >> --- a/kernel/sched/deadline.c >> +++ b/kernel/sched/deadline.c >> @@ -2871,6 +2871,13 @@ int dl_task_can_attach(struct task_struct *p, >> bool overflow; >> int ret; >> >> + /* >> + * The cpuset has no cpus assigned, so the thread will not >> + * change its affinity. >> + */ >> + if (cpumask_empty(cs_cpus_allowed)) >> + return 0; >> + >> /* >> * The task is not moving to another root domain, so it is >> * already accounted. >> >
On 1/15/21 3:36 PM, Dietmar Eggemann wrote: > On 12/01/2021 16:53, Daniel Bristot de Oliveira wrote: > > [...] > >> ----- %< ----- >> #!/bin/bash >> # Enter on the cgroup directory >> cd /sys/fs/cgroup/ >> >> # Check it if is cgroup v2 and enable cpuset >> if [ -e cgroup.subtree_control ]; then >> # Enable cpuset controller on cgroup v2 >> echo +cpuset > cgroup.subtree_control >> fi >> >> echo LOG: create an exclusive cpuset and assigned the CPU 0 to it >> # Create cpuset groups >> rmdir dl-group &> /dev/null >> mkdir dl-group >> >> # Restrict the task to the CPU 0 >> echo 0 > dl-group/cpuset.mems >> echo 0 > dl-group/cpuset.cpus >> echo root > dl-group/cpuset.cpus.partition >> >> echo LOG: dispatching a regular task >> sleep 100 & >> CPUSET_PID="$!" >> >> # let it settle down >> sleep 1 >> >> # Assign the second task to the cgroup > > There is only one task 'CPUSET_PID' involved here? Ooops, yep, I will remove the "second" part. >> echo LOG: moving the second DL task to the cpuset >> echo "$CPUSET_PID" > dl-group/cgroup.procs 2> /dev/null >> >> CPUSET_ALLOWED=`cat /proc/$CPUSET_PID/status | grep Cpus_allowed_list | awk '{print $2}'` >> >> chrt -p -d --sched-period 1000000000 --sched-runtime 100000000 0 $CPUSET_PID >> ACCEPTED=$? >> >> if [ $ACCEPTED == 0 ]; then >> echo PASS: the task became DL >> else >> echo FAIL: the task was rejected as DL >> fi > > [...] > >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index 5961a97541c2..3c2775e6869f 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -5905,15 +5905,15 @@ static int __sched_setscheduler(struct task_struct *p, >> #ifdef CONFIG_SMP >> if (dl_bandwidth_enabled() && dl_policy(policy) && >> !(attr->sched_flags & SCHED_FLAG_SUGOV)) { >> - cpumask_t *span = rq->rd->span; >> + struct root_domain *rd = dl_task_rd(p); >> >> /* >> * Don't allow tasks with an affinity mask smaller than >> * the entire root_domain to become SCHED_DEADLINE. We >> * will also fail if there's no bandwidth available. >> */ >> - if (!cpumask_subset(span, p->cpus_ptr) || >> - rq->rd->dl_bw.bw == 0) { >> + if (!cpumask_subset(rd->span, p->cpus_ptr) || >> + rd->dl_bw.bw == 0) { >> retval = -EPERM; >> goto unlock; >> } >> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c >> index c221e14d5b86..1f6264cb8867 100644 >> --- a/kernel/sched/deadline.c >> +++ b/kernel/sched/deadline.c >> @@ -2678,8 +2678,8 @@ int sched_dl_overflow(struct task_struct *p, int policy, >> u64 period = attr->sched_period ?: attr->sched_deadline; >> u64 runtime = attr->sched_runtime; >> u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0; >> - int cpus, err = -1, cpu = task_cpu(p); >> - struct dl_bw *dl_b = dl_bw_of(cpu); >> + int cpus, err = -1, cpu = dl_task_cpu(p); >> + struct dl_bw *dl_b = dl_task_root_bw(p); >> unsigned long cap; >> >> if (attr->sched_flags & SCHED_FLAG_SUGOV) >> > > Wouldn't it be sufficient to just introduce dl_task_cpu() and use the > correct cpu to get rd->span or struct dl_bw in __sched_setscheduler() > -> sched_dl_overflow()? While I think that having the dl_task_rd() makes the code more intuitive, I have no problem on not adding it... > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 5961a97541c2..0573f676696a 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -5905,7 +5905,8 @@ static int __sched_setscheduler(struct task_struct *p, > #ifdef CONFIG_SMP > if (dl_bandwidth_enabled() && dl_policy(policy) && > !(attr->sched_flags & SCHED_FLAG_SUGOV)) { > - cpumask_t *span = rq->rd->span; > + int cpu = dl_task_cpu(p); > + cpumask_t *span = cpu_rq(cpu)->rd->span; > > /* > * Don't allow tasks with an affinity mask smaller than > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > index c221e14d5b86..308ecaaf3d28 100644 > --- a/kernel/sched/deadline.c > +++ b/kernel/sched/deadline.c > @@ -2678,7 +2678,7 @@ int sched_dl_overflow(struct task_struct *p, int policy, > u64 period = attr->sched_period ?: attr->sched_deadline; > u64 runtime = attr->sched_runtime; > u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0; > - int cpus, err = -1, cpu = task_cpu(p); > + int cpus, err = -1, cpu = dl_task_cpu(p); > struct dl_bw *dl_b = dl_bw_of(cpu); > unsigned long cap; > This way works for me too. Thanks! -- Daniel
On 1/14/21 4:51 PM, Dietmar Eggemann wrote: > On 12/01/2021 16:53, Daniel Bristot de Oliveira wrote: >> The current SCHED_DEADLINE design supports only global scheduler, >> or variants of it, i.e., clustered and partitioned, via cpuset config. >> To enable the partitioning of a system with clusters of CPUs, the >> documentation advises the usage of exclusive cpusets, creating an >> exclusive root_domain for the cpuset. >> >> Attempts to change the cpu affinity of a thread to a cpu mask different >> from the root domain results in an error. For instance: > > [...] > >> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c >> index 788a391657a5..c221e14d5b86 100644 >> --- a/kernel/sched/deadline.c >> +++ b/kernel/sched/deadline.c >> @@ -2878,6 +2878,13 @@ int dl_task_can_attach(struct task_struct *p, >> if (cpumask_empty(cs_cpus_allowed)) >> return 0; >> >> + /* >> + * Do not allow moving tasks to non-exclusive cpusets >> + * if bandwidth control is enabled. >> + */ >> + if (dl_bandwidth_enabled() && !exclusive) >> + return -EBUSY; >> + >> /* >> * The task is not moving to another root domain, so it is >> * already accounted. >> > > But doesn't this mean you only have to set this cgroup exclusive/root to > run into the same issue: > > with this patch: > > cgroupv1: > > root@juno:/sys/fs/cgroup/cpuset# chrt -d --sched-period 1000000000 > --sched-runtime 100000000 0 sleep 500 & > [1] 1668 > root@juno:/sys/fs/cgroup/cpuset# PID1=$! > > root@juno:/sys/fs/cgroup/cpuset# chrt -d --sched-period 1000000000 > --sched-runtime 100000000 0 sleep 500 & > [2] 1669 > root@juno:/sys/fs/cgroup/cpuset# PID2=$! > > root@juno:/sys/fs/cgroup/cpuset# mkdir A > > root@juno:/sys/fs/cgroup/cpuset# echo 0 > ./A/cpuset.mems > root@juno:/sys/fs/cgroup/cpuset# echo 0 > ./A/cpuset.cpus > > root@juno:/sys/fs/cgroup/cpuset# echo $PID2 > ./A/cgroup.procs > -bash: echo: write error: Device or resource busy > > root@juno:/sys/fs/cgroup/cpuset# echo 1 > ./A/cpuset.cpu_exclusive > > root@juno:/sys/fs/cgroup/cpuset# echo $PID2 > ./A/cgroup.procs > > root@juno:/sys/fs/cgroup/cpuset# cat /proc/$PID1/status | grep > Cpus_allowed_list | awk '{print $2}' > 0-5 > root@juno:/sys/fs/cgroup/cpuset# cat /proc/$PID2/status | grep > Cpus_allowed_list | awk '{print $2}' > 0 On CPU v1 we also need to disable the load balance to create a root domain, right? > cgroupv2: Yeah, I see your point. I was seeing a different output because of Fedora default's behavior of adding the tasks to the system.slice/user.slice... doing: > root@juno:/sys/fs/cgroup# echo +cpuset > cgroup.subtree_control # echo $$ > cgroup.procs > root@juno:/sys/fs/cgroup# chrt -d --sched-period 1000000000 > --sched-runtime 100000000 0 sleep 500 & > [1] 1687 > root@juno:/sys/fs/cgroup# PID1=$! > > root@juno:/sys/fs/cgroup# chrt -d --sched-period 1000000000 > --sched-runtime 100000000 0 sleep 500 & > [2] 1688 > root@juno:/sys/fs/cgroup# PID2=$! > > root@juno:/sys/fs/cgroup# mkdir A > > root@juno:/sys/fs/cgroup# echo 0 > ./A/cpuset.mems > root@juno:/sys/fs/cgroup# echo 0 > ./A/cpuset.cpus > > root@juno:/sys/fs/cgroup# echo $PID2 > ./A/cgroup.procs > -bash: echo: write error: Device or resource busy > > root@juno:/sys/fs/cgroup# echo root > ./A/cpuset.cpus.partition > > root@juno:/sys/fs/cgroup# echo $PID2 > ./A/cgroup.procs > > root@juno:/sys/fs/cgroup# cat /proc/$PID1/status | grep > Cpus_allowed_list | awk '{print $2}' > 0-5 > root@juno:/sys/fs/cgroup# cat /proc/$PID2/status | grep > Cpus_allowed_list | awk '{print $2}' > 0 makes me see the same behavior. This will require further analysis. So, my plan now is to split this patch set into two, one with patches 1,3,5, and 6, which fixes the most "easy" part of the problems, and another with 2 and 4 that will require further investigation (discussed this with Dietmar already). Thoughts? -- Daniel
On 19/01/2021 10:41, Daniel Bristot de Oliveira wrote: > On 1/14/21 4:51 PM, Dietmar Eggemann wrote: >> On 12/01/2021 16:53, Daniel Bristot de Oliveira wrote: [...] >> with this patch: >> >> cgroupv1: >> >> root@juno:/sys/fs/cgroup/cpuset# chrt -d --sched-period 1000000000 >> --sched-runtime 100000000 0 sleep 500 & >> [1] 1668 >> root@juno:/sys/fs/cgroup/cpuset# PID1=$! >> >> root@juno:/sys/fs/cgroup/cpuset# chrt -d --sched-period 1000000000 >> --sched-runtime 100000000 0 sleep 500 & >> [2] 1669 >> root@juno:/sys/fs/cgroup/cpuset# PID2=$! >> >> root@juno:/sys/fs/cgroup/cpuset# mkdir A >> >> root@juno:/sys/fs/cgroup/cpuset# echo 0 > ./A/cpuset.mems >> root@juno:/sys/fs/cgroup/cpuset# echo 0 > ./A/cpuset.cpus >> >> root@juno:/sys/fs/cgroup/cpuset# echo $PID2 > ./A/cgroup.procs >> -bash: echo: write error: Device or resource busy >> >> root@juno:/sys/fs/cgroup/cpuset# echo 1 > ./A/cpuset.cpu_exclusive >> >> root@juno:/sys/fs/cgroup/cpuset# echo $PID2 > ./A/cgroup.procs >> >> root@juno:/sys/fs/cgroup/cpuset# cat /proc/$PID1/status | grep >> Cpus_allowed_list | awk '{print $2}' >> 0-5 >> root@juno:/sys/fs/cgroup/cpuset# cat /proc/$PID2/status | grep >> Cpus_allowed_list | awk '{print $2}' >> 0 > > On CPU v1 we also need to disable the load balance to create a root domain, right? IMHO, that's not necessary for this example. But yes, if we create 2 exclusive cpusets A and B we want to turn off load-balancing on root level. It also doesn't hurt doing this in this example. But we end up with no sched domain since load-balance is disabled at root and A only contains CPU0. root@juno:/sys/fs/cgroup/cpuset# echo 0 > cpuset.sched_load_balance ls /proc/sys/kernel/sched_domain/cpu*/ doesn't show any (sched) domains. >> cgroupv2: > > Yeah, I see your point. I was seeing a different output because of Fedora > default's behavior of adding the tasks to the system.slice/user.slice... > > doing: > >> root@juno:/sys/fs/cgroup# echo +cpuset > cgroup.subtree_control > > # echo $$ > cgroup.procs The current shell should be already in the root cgroup? root@juno:/sys/fs/cgroup# echo $$ 1644 root@juno:/sys/fs/cgroup# cat cgroup.procs | grep $$ 1644 [...]
Hi, On 12/01/21 16:53, Daniel Bristot de Oliveira wrote: > cgroups v2 allows the cpuset controller to be enabled/disabled on > demand. On Fedora 32, cpuset is disabled by default. To enable it, > a user needs to: > > # cd /sys/fs/cgroup/ > # echo +cpuset > cgroup.subtree_control > > Existing cgroups will expose the cpuset interface (e.g., cpuset.cpus > file). By default, cpuset.cpus has no CPU assigned, which means that > existing tasks will move to a cpuset without cpus. This is kind of confusing, though. Isn't it? > Initially, I thought about returning an error and blocking the > operation. However, that is indeed not needed. The cpuset without > CPUs assigned will be a non-root cpuset, hence its cpu mask will > be the same as the root one. So, the bandwidth was already accounted, > and the task can proceed. > > Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Juri Lelli <juri.lelli@redhat.com> > Cc: Vincent Guittot <vincent.guittot@linaro.org> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Ben Segall <bsegall@google.com> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Daniel Bristot de Oliveira <bristot@redhat.com> > Cc: Li Zefan <lizefan@huawei.com> > Cc: Tejun Heo <tj@kernel.org> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Valentin Schneider <valentin.schneider@arm.com> > Cc: linux-kernel@vger.kernel.org > Cc: cgroups@vger.kernel.org > --- > kernel/sched/deadline.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > index 943aa32cc1bc..788a391657a5 100644 > --- a/kernel/sched/deadline.c > +++ b/kernel/sched/deadline.c > @@ -2871,6 +2871,13 @@ int dl_task_can_attach(struct task_struct *p, > bool overflow; > int ret; > > + /* > + * The cpuset has no cpus assigned, so the thread will not > + * change its affinity. Is this always the case also in the presence of deeper hierarchies? Thanks, Juri