All of lore.kernel.org
 help / color / mirror / Atom feed
* [resend RFC 0/3] core scheduling: add PR_SCHED_CORE_SHARE
@ 2022-01-24 10:52 Christian Brauner
  2022-01-24 10:52 ` [resend RFC 1/3] pid: introduce task_by_pid() Christian Brauner
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Christian Brauner @ 2022-01-24 10:52 UTC (permalink / raw)
  To: Joel Fernandes, Chris Hyser, Daniel Bristot de Oliveira,
	Peter Zijlstra, linux-kernel
  Cc: Peter Collingbourne, Dietmar Eggemann, Thomas Gleixner,
	Mel Gorman, Vincent Guittot, Juri Lelli, Catalin Marinas,
	Ingo Molnar, Steven Rostedt, Ben Segall,
	Sebastian Andrzej Siewior, Balbir Singh, Christian Brauner

Hey everyone,

This adds the new PR_CORE_SCHED prctl() command PR_SCHED_CORE_SHARE to
allow a third process to pull a core scheduling domain from one task and
push it to another task.

The core scheduling uapi is exposed via the PR_SCHED_CORE option of the
prctl() system call. Two commands can be used to alter the core
scheduling domain of a task:

1. PR_SCHED_CORE_SHARE_TO
   This command takes the cookie for the caller's core scheduling domain
   and applies it to a target task identified by passing a pid.

2. PR_SCHED_CORE_SHARE_FROM
   This command takes the cookie for a task's core scheduling domain and
   applies it to the calling task.

While these options cover nearly all use-cases they are rather
inconvient for some common use-cases. A vm/container manager often
supervises a large number of vms/containers:

                               vm/container manager

vm-supervisor-1    container-supervisor-1    vm-supervisor-2    container-supervisor-2

Where none of the vms/container are its immediate children.

For container managers each container often has a separate supervising
process and the workload is the parent of the container. In the example
below the supervising process is "[lxc monitor]" and the workload is
"/sbin/init" and all descendant processes:

├─[lxc monitor] /var/lib/lxd/containers imp1
│   └─systemd
│       ├─agetty -o -p -- \\u --noclear --keep-baud console 115200,38400,9600 linux
│       ├─cron -f -P
│       ├─dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
│       ├─networkd-dispat /usr/bin/networkd-dispatcher --run-startup-triggers
│       ├─rsyslogd -n -iNONE
│       │   ├─{rsyslogd}
│       │   └─{rsyslogd}
│       ├─systemd-journal
│       ├─systemd-logind
│       ├─systemd-network
│       ├─systemd-resolve
│       └─systemd-udevd

Similiar in spirit but different in layout a vm often has a supervising
process and multiple threads for each vcpu:

├─qemu-system-x86 -S -name f2-vm [...]
│   ├─{qemu-system-x86}
│   ├─{qemu-system-x86}
│   ├─{qemu-system-x86}
│   ├─{qemu-system-x86}
│   ├─{qemu-system-x86}
│   ├─{qemu-system-x86}
│   ├─{qemu-system-x86}
│   ├─{qemu-system-x86}
│   ├─{qemu-system-x86}
│   ├─{qemu-system-x86}
│   └─{qemu-system-x86}

So ultimately an approximation of that layout would be:

                               vm/container manager

vm-supervisor-1    container-supervisor-1    vm-supervisor-2    container-supervisor-2
       |                      |                    |                      |
     vcpus                 workload              vcpus                 workload
                          (/sbin/init)                               (/sbin/init)

For containers a new core scheduling domain is allocated for the init
process. Any descendant processes and threads init spawns will
automatically inherit the correct core scheduling domain.

For vms a new core scheduling domain is allocated and each vcpu thread
will be made to join the new core scheduling domain.

Whenever the tool or library that we use to run containers or vms
exposes an option to automatically create a new core scheduling domain
we will make use of it. However that is not always the case. In such
cases the vm/container manager will need to allocate and set the core
scheduling domain for the relevant processes or threads.

Neither the vm/container mananger nor the indivial vm/container
supervisors are supposed to run in any or the same core scheduling
domain as the respective vcpus/workloads.

So in order to create to create a new core scheduling domain we need to
fork() off a new helper process which allocates a core scheduling domain
and then pushes the cookie for the core scheduling domain to the
relevant vcpus/workloads.

This works but things get rather tricky, especially for containers, when
a new process is supposed to be spawned into a running container.
An important step in creating a new process inside a running container
involves:

- getting a handle on the container's init process (pid or nowadays
  often a pidfd)
- getting a handle on the container's namespaces (namespace file
  descriptors reachable via /proc/<init-pid>/ns/<ns-typ> or nowadays
  often a pidfd)
- calling setns() either on each namespace file descriptor individually
  or on the pidfd of the init process

An important sub-step here is to attach to the container's pid namespace
via setns(). After attaching to the container's pid namespace any
process created via a fork()-like system calls will be a full member of
the container's pid namespace.

So attaching often involves two child processes. The first child simply
attaches to the namespaces of the container including the container's
pid namespace. The second child fork()s and ultimately exec()s thereby
guaranteeing that the newly created process is a full member of the
container's pid namespace:

first_child = fork();
if (first_child == 0) {
        setns(CLONE_NEWPID);

        second_child = fork();
        if (second_child == 0) {
                execlp();
        }
}

As part of this we also need to make sure that the second child - the
one ultimately exec()ing the relevant programm in an already running
container - joins the core scheduling domain of the container. When the
container runs in a new pid namespace this can usually be done by
calling:

first_child = fork();
if (first_child == 0) {
        setns(CLONE_NEWPID);

        second_child = fork();
        if (second_child == 0) {
                prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM,
                      1, PR_SCHED_CORE_SCOPE_THREAD, 0);

                execlp();
        }
}

from the second child since we know that pid 1 in a container running
inside of a separate pid namespace is the correct process to get the
core scheduling domain from.

However, this doesn't work when the container does not run in a separate
pid namespace or when it shares the pid namespace with another
container. In these scenarios we can't simply call
PR_SCHED_CORE_SHARE_FROM from the second child since we don't know the
correct pid number to call it on in the pid namespace.

(Note it is of course possible to learn the pid of the process in the
relevant pid namespace but it is rather complex involving three separate
processes and an AF_UNIX domain socket over which to send a message
including struct ucred from which to learn the relevant pid. But that
doesn't work in all cases since it requires privileges to translate
arbitrary pids. In any case, this is not an option for performance
reasons alone. However, I do also have a separate patchset in [1]
allowing translation of pids between pid namespaces which will help with
that in the future - something which I had discussed with Joel a while
back but haven't pushed for yet since implementing it early 2020. Both
patches are useful independent of one another.)

Additionally, we ideally always want to manage the core scheduling
domain from the first child since the first child knows the pids for the
relevant processes in its current pid namespace. The first child knows
the pid of the init process in the current pid namespace from which to
pull the core scheduling domain and it knows the pid of the second child
it created to which to apply the core scheduling domain.

The core scheduling domain of the first child needs to be unaffected as
it might run sensitive codepaths that should not be exposed in smt attacks.

The new PR_CORE_SCHED_SHARE command for the PR_SCHED_CORE prctl() option
allows to support this and other use-cases by making it possible to pull
the core scheduling domain from a task identified via its pid and push
it to another task identified via its pid from a third managing task:

prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE,
      <pid-to-which-to-apply-coresched-domain>,
      PR_SCHED_CORE_SCOPE_{THREAD,THREAD_GROUP,PROCESS_GROUP},
      <pid-from-which-to-take-coresched-domain>)

In order to use PR_SCHED_CORE_SHARE the caller must have
ptrace_may_access() rights to both the task from which to take the core
scheduling domain and to the task to which to apply the core scheduling
domain. If the caller passes zero as the 5th argument then its own core
scheduling domain is applied to the target making the option adhere to
regular prctl() semantics.

Thanks!
Christian

Christian Brauner (3):
  pid: introduce task_by_pid()
  sched/prctl: add PR_SCHED_CORE_SHARE command
  tests: add new PR_SCHED_CORE_SHARE test

 arch/mips/kernel/mips-mt-fpaff.c              | 14 +-----
 arch/x86/kernel/cpu/resctrl/rdtgroup.c        | 19 +++-----
 block/ioprio.c                                | 10 +----
 include/linux/sched.h                         |  9 +++-
 include/uapi/linux/prctl.h                    |  3 +-
 kernel/cgroup/cgroup.c                        | 12 ++---
 kernel/events/core.c                          |  5 +--
 kernel/futex/syscalls.c                       | 20 +++------
 kernel/pid.c                                  |  5 +++
 kernel/sched/core.c                           | 27 ++++--------
 kernel/sched/core_sched.c                     | 44 ++++++++++++++-----
 kernel/sys.c                                  | 12 ++---
 mm/mempolicy.c                                |  2 +-
 tools/testing/selftests/sched/cs_prctl_test.c | 23 ++++++++++
 14 files changed, 105 insertions(+), 100 deletions(-)


base-commit: e783362eb54cd99b2cac8b3a9aeac942e6f6ac07
-- 
2.32.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [resend RFC 1/3] pid: introduce task_by_pid()
  2022-01-24 10:52 [resend RFC 0/3] core scheduling: add PR_SCHED_CORE_SHARE Christian Brauner
@ 2022-01-24 10:52 ` Christian Brauner
  2022-01-26 16:56   ` Tejun Heo
  2022-01-24 10:52 ` [resend RFC 2/3] sched/prctl: add PR_SCHED_CORE_SHARE command Christian Brauner
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Christian Brauner @ 2022-01-24 10:52 UTC (permalink / raw)
  To: Joel Fernandes, Chris Hyser, Daniel Bristot de Oliveira,
	Peter Zijlstra, linux-kernel
  Cc: Peter Collingbourne, Dietmar Eggemann, Thomas Gleixner,
	Mel Gorman, Vincent Guittot, Juri Lelli, Catalin Marinas,
	Ingo Molnar, Steven Rostedt, Ben Segall,
	Sebastian Andrzej Siewior, Balbir Singh, Christian Brauner,
	Jens Axboe, Tejun Heo

We have a lot of places that open code

if (who)
        p = find_task_by_vpid(who);
else
        p = current;

Introduce a simpler helper which can be used instead.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 arch/mips/kernel/mips-mt-fpaff.c       | 14 ++-----------
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 19 +++++++-----------
 block/ioprio.c                         | 10 ++--------
 include/linux/sched.h                  |  7 +++++++
 kernel/cgroup/cgroup.c                 | 12 ++++--------
 kernel/events/core.c                   |  5 +----
 kernel/futex/syscalls.c                | 20 ++++++-------------
 kernel/pid.c                           |  5 +++++
 kernel/sched/core.c                    | 27 ++++++++------------------
 kernel/sched/core_sched.c              | 12 ++++--------
 kernel/sys.c                           | 12 +++---------
 mm/mempolicy.c                         |  2 +-
 12 files changed, 50 insertions(+), 95 deletions(-)

diff --git a/arch/mips/kernel/mips-mt-fpaff.c b/arch/mips/kernel/mips-mt-fpaff.c
index 67e130d3f038..53c8a56815ea 100644
--- a/arch/mips/kernel/mips-mt-fpaff.c
+++ b/arch/mips/kernel/mips-mt-fpaff.c
@@ -33,16 +33,6 @@ unsigned long mt_fpemul_threshold;
  * updated when kernel/sched/core.c changes.
  */
 
-/*
- * find_process_by_pid - find a process with a matching PID value.
- * used in sys_sched_set/getaffinity() in kernel/sched/core.c, so
- * cloned here.
- */
-static inline struct task_struct *find_process_by_pid(pid_t pid)
-{
-	return pid ? find_task_by_vpid(pid) : current;
-}
-
 /*
  * check the target process has a UID that matches the current process's
  */
@@ -79,7 +69,7 @@ asmlinkage long mipsmt_sys_sched_setaffinity(pid_t pid, unsigned int len,
 	cpus_read_lock();
 	rcu_read_lock();
 
-	p = find_process_by_pid(pid);
+	p = task_by_pid(pid);
 	if (!p) {
 		rcu_read_unlock();
 		cpus_read_unlock();
@@ -170,7 +160,7 @@ asmlinkage long mipsmt_sys_sched_getaffinity(pid_t pid, unsigned int len,
 	rcu_read_lock();
 
 	retval = -ESRCH;
-	p = find_process_by_pid(pid);
+	p = task_by_pid(pid);
 	if (!p)
 		goto out_unlock;
 	retval = security_task_getscheduler(p);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index b57b3db9a6a7..577d0ffebb9d 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -660,19 +660,14 @@ static int rdtgroup_move_task(pid_t pid, struct rdtgroup *rdtgrp,
 	int ret;
 
 	rcu_read_lock();
-	if (pid) {
-		tsk = find_task_by_vpid(pid);
-		if (!tsk) {
-			rcu_read_unlock();
-			rdt_last_cmd_printf("No task %d\n", pid);
-			return -ESRCH;
-		}
-	} else {
-		tsk = current;
-	}
-
-	get_task_struct(tsk);
+	tsk = task_by_pid(pid);
+	if (tsk)
+		get_task_struct(tsk);
 	rcu_read_unlock();
+	if (!tsk) {
+		rdt_last_cmd_printf("No task %d\n", pid);
+		return -ESRCH;
+	}
 
 	ret = rdtgroup_task_write_permission(tsk, of);
 	if (!ret)
diff --git a/block/ioprio.c b/block/ioprio.c
index 2fe068fcaad5..934e96cd495b 100644
--- a/block/ioprio.c
+++ b/block/ioprio.c
@@ -81,10 +81,7 @@ SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio)
 	rcu_read_lock();
 	switch (which) {
 		case IOPRIO_WHO_PROCESS:
-			if (!who)
-				p = current;
-			else
-				p = find_task_by_vpid(who);
+			p = task_by_pid(who);
 			if (p)
 				ret = set_task_ioprio(p, ioprio);
 			break;
@@ -176,10 +173,7 @@ SYSCALL_DEFINE2(ioprio_get, int, which, int, who)
 	rcu_read_lock();
 	switch (which) {
 		case IOPRIO_WHO_PROCESS:
-			if (!who)
-				p = current;
-			else
-				p = find_task_by_vpid(who);
+			p = task_by_pid(who);
 			if (p)
 				ret = get_task_ioprio(p);
 			break;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 508b91d57470..0408372594dd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1912,6 +1912,13 @@ extern unsigned long init_stack[THREAD_SIZE / sizeof(unsigned long)];
  */
 
 extern struct task_struct *find_task_by_vpid(pid_t nr);
+/**
+ * task_by_pid - find a process with a matching PID value.
+ * @pid: the pid in question.
+ *
+ * The task of @pid, if found. %NULL otherwise.
+ */
+extern struct task_struct *task_by_pid(pid_t nr);
 extern struct task_struct *find_task_by_pid_ns(pid_t nr, struct pid_namespace *ns);
 
 /*
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index b31e1465868a..3fddd5003a2b 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2839,14 +2839,10 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 	}
 
 	rcu_read_lock();
-	if (pid) {
-		tsk = find_task_by_vpid(pid);
-		if (!tsk) {
-			tsk = ERR_PTR(-ESRCH);
-			goto out_unlock_threadgroup;
-		}
-	} else {
-		tsk = current;
+	tsk = task_by_pid(pid);
+	if (!tsk) {
+		tsk = ERR_PTR(-ESRCH);
+		goto out_unlock_threadgroup;
 	}
 
 	if (threadgroup)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index fc18664f49b0..9f9ea469f1d1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4608,10 +4608,7 @@ find_lively_task_by_vpid(pid_t vpid)
 	struct task_struct *task;
 
 	rcu_read_lock();
-	if (!vpid)
-		task = current;
-	else
-		task = find_task_by_vpid(vpid);
+	task = task_by_pid(vpid);
 	if (task)
 		get_task_struct(task);
 	rcu_read_unlock();
diff --git a/kernel/futex/syscalls.c b/kernel/futex/syscalls.c
index 086a22d1adb7..76b5c5389214 100644
--- a/kernel/futex/syscalls.c
+++ b/kernel/futex/syscalls.c
@@ -57,13 +57,9 @@ SYSCALL_DEFINE3(get_robust_list, int, pid,
 	rcu_read_lock();
 
 	ret = -ESRCH;
-	if (!pid)
-		p = current;
-	else {
-		p = find_task_by_vpid(pid);
-		if (!p)
-			goto err_unlock;
-	}
+	p = task_by_pid(pid);
+	if (!p)
+		goto err_unlock;
 
 	ret = -EPERM;
 	if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS))
@@ -326,13 +322,9 @@ COMPAT_SYSCALL_DEFINE3(get_robust_list, int, pid,
 	rcu_read_lock();
 
 	ret = -ESRCH;
-	if (!pid)
-		p = current;
-	else {
-		p = find_task_by_vpid(pid);
-		if (!p)
-			goto err_unlock;
-	}
+	p = task_by_pid(pid);
+	if (!p)
+		goto err_unlock;
 
 	ret = -EPERM;
 	if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS))
diff --git a/kernel/pid.c b/kernel/pid.c
index 2fc0a16ec77b..1cd82fa58273 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -422,6 +422,11 @@ struct task_struct *find_task_by_vpid(pid_t vnr)
 	return find_task_by_pid_ns(vnr, task_active_pid_ns(current));
 }
 
+struct task_struct *task_by_pid(pid_t nr)
+{
+	return nr ? find_task_by_vpid(nr) : current;
+}
+
 struct task_struct *find_get_task_by_vpid(pid_t nr)
 {
 	struct task_struct *task;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2e4ae00e52d1..196543f0c39a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7174,17 +7174,6 @@ unsigned long sched_cpu_util(int cpu, unsigned long max)
 }
 #endif /* CONFIG_SMP */
 
-/**
- * find_process_by_pid - find a process with a matching PID value.
- * @pid: the pid in question.
- *
- * The task of @pid, if found. %NULL otherwise.
- */
-static struct task_struct *find_process_by_pid(pid_t pid)
-{
-	return pid ? find_task_by_vpid(pid) : current;
-}
-
 /*
  * sched_setparam() passes in -1 for its policy, to let the functions
  * it calls know not to change it.
@@ -7627,7 +7616,7 @@ do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
 
 	rcu_read_lock();
 	retval = -ESRCH;
-	p = find_process_by_pid(pid);
+	p = task_by_pid(pid);
 	if (likely(p))
 		get_task_struct(p);
 	rcu_read_unlock();
@@ -7750,7 +7739,7 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
 
 	rcu_read_lock();
 	retval = -ESRCH;
-	p = find_process_by_pid(pid);
+	p = task_by_pid(pid);
 	if (likely(p))
 		get_task_struct(p);
 	rcu_read_unlock();
@@ -7782,7 +7771,7 @@ SYSCALL_DEFINE1(sched_getscheduler, pid_t, pid)
 
 	retval = -ESRCH;
 	rcu_read_lock();
-	p = find_process_by_pid(pid);
+	p = task_by_pid(pid);
 	if (p) {
 		retval = security_task_getscheduler(p);
 		if (!retval)
@@ -7811,7 +7800,7 @@ SYSCALL_DEFINE2(sched_getparam, pid_t, pid, struct sched_param __user *, param)
 		return -EINVAL;
 
 	rcu_read_lock();
-	p = find_process_by_pid(pid);
+	p = task_by_pid(pid);
 	retval = -ESRCH;
 	if (!p)
 		goto out_unlock;
@@ -7894,7 +7883,7 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
 		return -EINVAL;
 
 	rcu_read_lock();
-	p = find_process_by_pid(pid);
+	p = task_by_pid(pid);
 	retval = -ESRCH;
 	if (!p)
 		goto out_unlock;
@@ -8003,7 +7992,7 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 
 	rcu_read_lock();
 
-	p = find_process_by_pid(pid);
+	p = task_by_pid(pid);
 	if (!p) {
 		rcu_read_unlock();
 		return -ESRCH;
@@ -8082,7 +8071,7 @@ long sched_getaffinity(pid_t pid, struct cpumask *mask)
 	rcu_read_lock();
 
 	retval = -ESRCH;
-	p = find_process_by_pid(pid);
+	p = task_by_pid(pid);
 	if (!p)
 		goto out_unlock;
 
@@ -8482,7 +8471,7 @@ static int sched_rr_get_interval(pid_t pid, struct timespec64 *t)
 
 	retval = -ESRCH;
 	rcu_read_lock();
-	p = find_process_by_pid(pid);
+	p = task_by_pid(pid);
 	if (!p)
 		goto out_unlock;
 
diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
index 1fb45672ec85..0c40445337c5 100644
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -148,14 +148,10 @@ int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
 		return -EINVAL;
 
 	rcu_read_lock();
-	if (pid == 0) {
-		task = current;
-	} else {
-		task = find_task_by_vpid(pid);
-		if (!task) {
-			rcu_read_unlock();
-			return -ESRCH;
-		}
+	task = task_by_pid(pid);
+	if (!task) {
+		rcu_read_unlock();
+		return -ESRCH;
 	}
 	get_task_struct(task);
 	rcu_read_unlock();
diff --git a/kernel/sys.c b/kernel/sys.c
index ecc4cf019242..9460e2eefaad 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -222,10 +222,7 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
 	rcu_read_lock();
 	switch (which) {
 	case PRIO_PROCESS:
-		if (who)
-			p = find_task_by_vpid(who);
-		else
-			p = current;
+		p = task_by_pid(who);
 		if (p)
 			error = set_one_prio(p, niceval, error);
 		break;
@@ -285,10 +282,7 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who)
 	rcu_read_lock();
 	switch (which) {
 	case PRIO_PROCESS:
-		if (who)
-			p = find_task_by_vpid(who);
-		else
-			p = current;
+		p = task_by_pid(who);
 		if (p) {
 			niceval = nice_to_rlimit(task_nice(p));
 			if (niceval > retval)
@@ -1659,7 +1653,7 @@ SYSCALL_DEFINE4(prlimit64, pid_t, pid, unsigned int, resource,
 	}
 
 	rcu_read_lock();
-	tsk = pid ? find_task_by_vpid(pid) : current;
+	tsk = task_by_pid(pid);
 	if (!tsk) {
 		rcu_read_unlock();
 		return -ESRCH;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 028e8dd82b44..c113e274204a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1613,7 +1613,7 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
 
 	/* Find the mm_struct */
 	rcu_read_lock();
-	task = pid ? find_task_by_vpid(pid) : current;
+	task = task_by_pid(pid);
 	if (!task) {
 		rcu_read_unlock();
 		err = -ESRCH;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [resend RFC 2/3] sched/prctl: add PR_SCHED_CORE_SHARE command
  2022-01-24 10:52 [resend RFC 0/3] core scheduling: add PR_SCHED_CORE_SHARE Christian Brauner
  2022-01-24 10:52 ` [resend RFC 1/3] pid: introduce task_by_pid() Christian Brauner
@ 2022-01-24 10:52 ` Christian Brauner
  2022-01-24 17:25   ` Joel Fernandes
  2022-01-25  0:31   ` Josh Don
  2022-01-24 10:52 ` [resend RFC 3/3] tests: add new PR_SCHED_CORE_SHARE test Christian Brauner
  2022-01-24 17:25 ` [resend RFC 0/3] core scheduling: add PR_SCHED_CORE_SHARE Joel Fernandes
  3 siblings, 2 replies; 9+ messages in thread
From: Christian Brauner @ 2022-01-24 10:52 UTC (permalink / raw)
  To: Joel Fernandes, Chris Hyser, Daniel Bristot de Oliveira,
	Peter Zijlstra, linux-kernel
  Cc: Peter Collingbourne, Dietmar Eggemann, Thomas Gleixner,
	Mel Gorman, Vincent Guittot, Juri Lelli, Catalin Marinas,
	Ingo Molnar, Steven Rostedt, Ben Segall,
	Sebastian Andrzej Siewior, Balbir Singh, Christian Brauner

This adds the new PR_CORE_SCHED prctl() command PR_SCHED_CORE_SHARE to
allow a third process to pull a core scheduling domain from one task and
push it to another task.

The core scheduling uapi is exposed via the PR_SCHED_CORE option of the
prctl() system call. Two commands can be used to alter the core
scheduling domain of a task:

1. PR_SCHED_CORE_SHARE_TO
   This command takes the cookie for the caller's core scheduling domain
   and applies it to a target task identified by passing a pid.

2. PR_SCHED_CORE_SHARE_FROM
   This command takes the cookie for a task's core scheduling domain and
   applies it to the calling task.

While these options cover nearly all use-cases they are rather
inconvient for some common use-cases. A vm/container manager often
supervises a large number of vms/containers:

                               vm/container manager

vm-supervisor-1    container-supervisor-1    vm-supervisor-2    container-supervisor-2

Where none of the vms/container are its immediate children.

For container managers each container often has a separate supervising
process and the workload is the parent of the container. In the example
below the supervising process is "[lxc monitor]" and the workload is
"/sbin/init" and all descendant processes:

├─[lxc monitor] /var/lib/lxd/containers imp1
│   └─systemd
│       ├─agetty -o -p -- \\u --noclear --keep-baud console 115200,38400,9600 linux
│       ├─cron -f -P
│       ├─dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
│       ├─networkd-dispat /usr/bin/networkd-dispatcher --run-startup-triggers
│       ├─rsyslogd -n -iNONE
│       │   ├─{rsyslogd}
│       │   └─{rsyslogd}
│       ├─systemd-journal
│       ├─systemd-logind
│       ├─systemd-network
│       ├─systemd-resolve
│       └─systemd-udevd

Similiar in spirit but different in layout a vm often has a supervising
process and multiple threads for each vcpu:

├─qemu-system-x86 -S -name f2-vm [...]
│   ├─{qemu-system-x86}
│   ├─{qemu-system-x86}
│   ├─{qemu-system-x86}
│   ├─{qemu-system-x86}
│   ├─{qemu-system-x86}
│   ├─{qemu-system-x86}
│   ├─{qemu-system-x86}
│   ├─{qemu-system-x86}
│   ├─{qemu-system-x86}
│   ├─{qemu-system-x86}
│   └─{qemu-system-x86}

So ultimately an approximation of that layout would be:

                               vm/container manager

vm-supervisor-1    container-supervisor-1    vm-supervisor-2    container-supervisor-2
       |                      |                    |                      |
     vcpus                 workload              vcpus                 workload
                          (/sbin/init)                               (/sbin/init)

For containers a new core scheduling domain is allocated for the init
process. Any descendant processes and threads init spawns will
automatically inherit the correct core scheduling domain.

For vms a new core scheduling domain is allocated and each vcpu thread
will be made to join the new core scheduling domain.

Whenever the tool or library that we use to run containers or vms
exposes an option to automatically create a new core scheduling domain
we will make use of it. However that is not always the case. In such
cases the vm/container manager will need to allocate and set the core
scheduling domain for the relevant processes or threads.

Neither the vm/container mananger nor the indivial vm/container
supervisors are supposed to run in any or the same core scheduling
domain as the respective vcpus/workloads.

So in order to create to create a new core scheduling domain we need to
fork() off a new helper process which allocates a core scheduling domain
and then pushes the cookie for the core scheduling domain to the
relevant vcpus/workloads.

This works but things get rather tricky, especially for containers, when
a new process is supposed to be spawned into a running container.
An important step in creating a new process inside a running container
involves:

- getting a handle on the container's init process (pid or nowadays
  often a pidfd)
- getting a handle on the container's namespaces (namespace file
  descriptors reachable via /proc/<init-pid>/ns/<ns-typ> or nowadays
  often a pidfd)
- calling setns() either on each namespace file descriptor individually
  or on the pidfd of the init process

An important sub-step here is to attach to the container's pid namespace
via setns(). After attaching to the container's pid namespace any
process created via a fork()-like system calls will be a full member of
the container's pid namespace.

So attaching often involves two child processes. The first child simply
attaches to the namespaces of the container including the container's
pid namespace. The second child fork()s and ultimately exec()s thereby
guaranteeing that the newly created process is a full member of the
container's pid namespace:

first_child = fork();
if (first_child == 0) {
        setns(CLONE_NEWPID);

        second_child = fork();
        if (second_child == 0) {
                execlp();
        }
}

As part of this we also need to make sure that the second child - the
one ultimately exec()ing the relevant programm in an already running
container - joins the core scheduling domain of the container. When the
container runs in a new pid namespace this can usually be done by
calling:

first_child = fork();
if (first_child == 0) {
        setns(CLONE_NEWPID);

        second_child = fork();
        if (second_child == 0) {
                prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM,
                      1, PR_SCHED_CORE_SCOPE_THREAD, 0);

                execlp();
        }
}

from the second child since we know that pid 1 in a container running
inside of a separate pid namespace is the correct process to get the
core scheduling domain from.

However, this doesn't work when the container does not run in a separate
pid namespace or when it shares the pid namespace with another
container. In these scenarios we can't simply call
PR_SCHED_CORE_SHARE_FROM from the second child since we don't know the
correct pid number to call it on in the pid namespace.

(Note it is of course possible to learn the pid of the process in the
relevant pid namespace but it is rather complex involving three separate
processes and an AF_UNIX domain socket over which to send a message
including struct ucred from which to learn the relevant pid. But that
doesn't work in all cases since it requires privileges to translate
arbitrary pids. In any case, this is not an option for performance
reasons alone. However, I do also have a separate patchset in [1]
allowing translation of pids between pid namespaces which will help with
that in the future - something which I had discussed with Joel a while
back but haven't pushed for yet since implementing it early 2020. Both
patches are useful independent of one another.)

Additionally, we ideally always want to manage the core scheduling
domain from the first child since the first child knows the pids for the
relevant processes in its current pid namespace. The first child knows
the pid of the init process in the current pid namespace from which to
pull the core scheduling domain and it knows the pid of the second child
it created to which to apply the core scheduling domain.

The core scheduling domain of the first child needs to be unaffected as
it might run sensitive codepaths that should not be exposed in smt attacks.

The new PR_CORE_SCHED_SHARE command for the PR_SCHED_CORE prctl() option
allows to support this and other use-cases by making it possible to pull
the core scheduling domain from a task identified via its pid and push
it to another task identified via its pid from a third managing task:

prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE,
      <pid-to-which-to-apply-coresched-domain>,
      PR_SCHED_CORE_SCOPE_{THREAD,THREAD_GROUP,PROCESS_GROUP},
      <pid-from-which-to-take-coresched-domain>)

In order to use PR_SCHED_CORE_SHARE the caller must have
ptrace_may_access() rights to both the task from which to take the core
scheduling domain and to the task to which to apply the core scheduling
domain. If the caller passes zero as the 5th argument then its own core
scheduling domain is applied to the target making the option adhere to
regular prctl() semantics.

[1]: https://git.kernel.org/brauner/h/ioctl_ns_get_init_pid
     https://git.kernel.org/brauner/c/1ad81fd698dd7e6511c3db422eba42dec3ce1b08
Cc: Peter Collingbourne <pcc@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chris Hyser <chris.hyser@oracle.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Balbir Singh <sblbir@amazon.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 include/linux/sched.h      |  2 +-
 include/uapi/linux/prctl.h |  3 ++-
 kernel/sched/core_sched.c  | 32 +++++++++++++++++++++++++++++---
 3 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0408372594dd..2eeac7a341ad 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2341,7 +2341,7 @@ const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 extern void sched_core_free(struct task_struct *tsk);
 extern void sched_core_fork(struct task_struct *p);
 extern int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
-				unsigned long uaddr);
+				unsigned long arg);
 #else
 static inline void sched_core_free(struct task_struct *tsk) { }
 static inline void sched_core_fork(struct task_struct *p) { }
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index e998764f0262..e53945dadede 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -267,7 +267,8 @@ struct prctl_mm_map {
 # define PR_SCHED_CORE_CREATE		1 /* create unique core_sched cookie */
 # define PR_SCHED_CORE_SHARE_TO		2 /* push core_sched cookie to pid */
 # define PR_SCHED_CORE_SHARE_FROM	3 /* pull core_sched cookie to pid */
-# define PR_SCHED_CORE_MAX		4
+# define PR_SCHED_CORE_SHARE		4
+# define PR_SCHED_CORE_MAX		5
 # define PR_SCHED_CORE_SCOPE_THREAD		0
 # define PR_SCHED_CORE_SCOPE_THREAD_GROUP	1
 # define PR_SCHED_CORE_SCOPE_PROCESS_GROUP	2
diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
index 0c40445337c5..241bb38f5e55 100644
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -129,9 +129,10 @@ static void __sched_core_set(struct task_struct *p, unsigned long cookie)
 
 /* Called from prctl interface: PR_SCHED_CORE */
 int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
-			 unsigned long uaddr)
+			 unsigned long arg)
 {
-	unsigned long cookie = 0, id = 0;
+	unsigned long cookie = 0, id = 0, uaddr = 0;
+	pid_t pid_share = -1;
 	struct task_struct *task, *p;
 	struct pid *grp;
 	int err = 0;
@@ -144,9 +145,20 @@ int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
 	BUILD_BUG_ON(PR_SCHED_CORE_SCOPE_PROCESS_GROUP != PIDTYPE_PGID);
 
 	if (type > PIDTYPE_PGID || cmd >= PR_SCHED_CORE_MAX || pid < 0 ||
-	    (cmd != PR_SCHED_CORE_GET && uaddr))
+	    (cmd != PR_SCHED_CORE_GET && cmd != PR_SCHED_CORE_SHARE && arg))
 		return -EINVAL;
 
+	switch (cmd) {
+	case PR_SCHED_CORE_GET:
+		uaddr = arg;
+		break;
+	case PR_SCHED_CORE_SHARE:
+		pid_share = arg;
+		if (pid_share < 0)
+			return -EINVAL;
+		break;
+	}
+
 	rcu_read_lock();
 	task = task_by_pid(pid);
 	if (!task) {
@@ -200,6 +212,20 @@ int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
 		__sched_core_set(current, cookie);
 		goto out;
 
+	case PR_SCHED_CORE_SHARE:
+		rcu_read_lock();
+		p = task_by_pid(pid_share);
+		if (!p)
+			err = -ESRCH;
+		else if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS))
+			err = -EPERM;
+		if (!err)
+			cookie = sched_core_clone_cookie(p);
+		rcu_read_unlock();
+		if (err)
+			goto out;
+		break;
+
 	default:
 		err = -EINVAL;
 		goto out;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [resend RFC 3/3] tests: add new PR_SCHED_CORE_SHARE test
  2022-01-24 10:52 [resend RFC 0/3] core scheduling: add PR_SCHED_CORE_SHARE Christian Brauner
  2022-01-24 10:52 ` [resend RFC 1/3] pid: introduce task_by_pid() Christian Brauner
  2022-01-24 10:52 ` [resend RFC 2/3] sched/prctl: add PR_SCHED_CORE_SHARE command Christian Brauner
@ 2022-01-24 10:52 ` Christian Brauner
  2022-01-24 17:25 ` [resend RFC 0/3] core scheduling: add PR_SCHED_CORE_SHARE Joel Fernandes
  3 siblings, 0 replies; 9+ messages in thread
From: Christian Brauner @ 2022-01-24 10:52 UTC (permalink / raw)
  To: Joel Fernandes, Chris Hyser, Daniel Bristot de Oliveira,
	Peter Zijlstra, linux-kernel
  Cc: Peter Collingbourne, Dietmar Eggemann, Thomas Gleixner,
	Mel Gorman, Vincent Guittot, Juri Lelli, Catalin Marinas,
	Ingo Molnar, Steven Rostedt, Ben Segall,
	Sebastian Andrzej Siewior, Balbir Singh, Christian Brauner,
	Shuah Khan, linux-kselftest

Add tests for the new PR_SCHED_CORE_SHARE command.

Cc: Peter Collingbourne <pcc@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chris Hyser <chris.hyser@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Balbir Singh <sblbir@amazon.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/sched/cs_prctl_test.c | 23 +++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/tools/testing/selftests/sched/cs_prctl_test.c b/tools/testing/selftests/sched/cs_prctl_test.c
index 8109b17dc764..985b83fe7221 100644
--- a/tools/testing/selftests/sched/cs_prctl_test.c
+++ b/tools/testing/selftests/sched/cs_prctl_test.c
@@ -229,6 +229,7 @@ int main(int argc, char *argv[])
 	int pidx;
 	int pid;
 	int opt;
+	int i;
 
 	while ((opt = getopt(argc, argv, ":hkT:P:d:")) != -1) {
 		switch (opt) {
@@ -325,6 +326,28 @@ int main(int argc, char *argv[])
 	validate(get_cs_cookie(pid) != 0);
 	validate(get_cs_cookie(pid) == get_cs_cookie(procs[pidx].thr_tids[0]));
 
+	printf("\n## Set a new cookie on a single thread/PR_SCHED_CORE_SCOPE_THREAD [%d]\n", pid);
+	if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, pid, PR_SCHED_CORE_SCOPE_THREAD, 0) < 0)
+		handle_error("core_sched create failed -- PR_SCHED_CORE_SCOPE_THREAD");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(pid) != get_cs_cookie(procs[pidx].thr_tids[0]));
+
+	printf("\n## Copy cookie from a thread [%d] to [%d] as PR_SCHED_CORE_SCOPE_THREAD\n", pid, procs[pidx].thr_tids[0]);
+	if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE, procs[pidx].thr_tids[0], PR_SCHED_CORE_SCOPE_THREAD, pid) < 0)
+		handle_error("core_sched share cookie from and to thread failed -- PR_SCHED_CORE_SCOPE_THREAD");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(pid) == get_cs_cookie(procs[pidx].thr_tids[0]));
+
+	printf("\n## Copy cookie from a thread [%d] to [%d] as PR_SCHED_CORE_SCOPE_THREAD_GROUP\n", pid, pid);
+	if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE, pid, PR_SCHED_CORE_SCOPE_THREAD_GROUP, pid) < 0)
+		handle_error("core_sched share cookie from and to thread-group failed -- PR_SCHED_CORE_SCOPE_THREAD_GROUP");
+	disp_processes(num_processes, procs);
+
+	for (i = 0; i < procs[pidx].num_threads; ++i)
+		validate(get_cs_cookie(pid) == get_cs_cookie(procs[pidx].thr_tids[i]));
+
 	if (errors) {
 		printf("TESTS FAILED. errors: %d\n", errors);
 		res = 10;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [resend RFC 2/3] sched/prctl: add PR_SCHED_CORE_SHARE command
  2022-01-24 10:52 ` [resend RFC 2/3] sched/prctl: add PR_SCHED_CORE_SHARE command Christian Brauner
@ 2022-01-24 17:25   ` Joel Fernandes
  2022-01-25  0:31   ` Josh Don
  1 sibling, 0 replies; 9+ messages in thread
From: Joel Fernandes @ 2022-01-24 17:25 UTC (permalink / raw)
  To: Christian Brauner, Josh Don
  Cc: Chris Hyser, Daniel Bristot de Oliveira, Peter Zijlstra, LKML,
	Peter Collingbourne, Dietmar Eggemann, Thomas Gleixner,
	Mel Gorman, Vincent Guittot, Juri Lelli, Catalin Marinas,
	Ingo Molnar, Steven Rostedt, Ben Segall,
	Sebastian Andrzej Siewior, Balbir Singh

+Josh Don Who was involved with upstream development of the interface.

On Mon, Jan 24, 2022 at 5:53 AM Christian Brauner <brauner@kernel.org> wrote:
>
> This adds the new PR_CORE_SCHED prctl() command PR_SCHED_CORE_SHARE to
> allow a third process to pull a core scheduling domain from one task and
> push it to another task.
>
> The core scheduling uapi is exposed via the PR_SCHED_CORE option of the
> prctl() system call. Two commands can be used to alter the core
> scheduling domain of a task:
>
> 1. PR_SCHED_CORE_SHARE_TO
>    This command takes the cookie for the caller's core scheduling domain
>    and applies it to a target task identified by passing a pid.
>
> 2. PR_SCHED_CORE_SHARE_FROM
>    This command takes the cookie for a task's core scheduling domain and
>    applies it to the calling task.
>
> While these options cover nearly all use-cases they are rather
> inconvient for some common use-cases. A vm/container manager often
> supervises a large number of vms/containers:
>
>                                vm/container manager
>
> vm-supervisor-1    container-supervisor-1    vm-supervisor-2    container-supervisor-2
>
> Where none of the vms/container are its immediate children.
>
> For container managers each container often has a separate supervising
> process and the workload is the parent of the container. In the example
> below the supervising process is "[lxc monitor]" and the workload is
> "/sbin/init" and all descendant processes:
>
> ├─[lxc monitor] /var/lib/lxd/containers imp1
> │   └─systemd
> │       ├─agetty -o -p -- \\u --noclear --keep-baud console 115200,38400,9600 linux
> │       ├─cron -f -P
> │       ├─dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
> │       ├─networkd-dispat /usr/bin/networkd-dispatcher --run-startup-triggers
> │       ├─rsyslogd -n -iNONE
> │       │   ├─{rsyslogd}
> │       │   └─{rsyslogd}
> │       ├─systemd-journal
> │       ├─systemd-logind
> │       ├─systemd-network
> │       ├─systemd-resolve
> │       └─systemd-udevd
>
> Similiar in spirit but different in layout a vm often has a supervising
> process and multiple threads for each vcpu:
>
> ├─qemu-system-x86 -S -name f2-vm [...]
> │   ├─{qemu-system-x86}
> │   ├─{qemu-system-x86}
> │   ├─{qemu-system-x86}
> │   ├─{qemu-system-x86}
> │   ├─{qemu-system-x86}
> │   ├─{qemu-system-x86}
> │   ├─{qemu-system-x86}
> │   ├─{qemu-system-x86}
> │   ├─{qemu-system-x86}
> │   ├─{qemu-system-x86}
> │   └─{qemu-system-x86}
>
> So ultimately an approximation of that layout would be:
>
>                                vm/container manager
>
> vm-supervisor-1    container-supervisor-1    vm-supervisor-2    container-supervisor-2
>        |                      |                    |                      |
>      vcpus                 workload              vcpus                 workload
>                           (/sbin/init)                               (/sbin/init)
>
> For containers a new core scheduling domain is allocated for the init
> process. Any descendant processes and threads init spawns will
> automatically inherit the correct core scheduling domain.
>
> For vms a new core scheduling domain is allocated and each vcpu thread
> will be made to join the new core scheduling domain.
>
> Whenever the tool or library that we use to run containers or vms
> exposes an option to automatically create a new core scheduling domain
> we will make use of it. However that is not always the case. In such
> cases the vm/container manager will need to allocate and set the core
> scheduling domain for the relevant processes or threads.
>
> Neither the vm/container mananger nor the indivial vm/container
> supervisors are supposed to run in any or the same core scheduling
> domain as the respective vcpus/workloads.
>
> So in order to create to create a new core scheduling domain we need to
> fork() off a new helper process which allocates a core scheduling domain
> and then pushes the cookie for the core scheduling domain to the
> relevant vcpus/workloads.
>
> This works but things get rather tricky, especially for containers, when
> a new process is supposed to be spawned into a running container.
> An important step in creating a new process inside a running container
> involves:
>
> - getting a handle on the container's init process (pid or nowadays
>   often a pidfd)
> - getting a handle on the container's namespaces (namespace file
>   descriptors reachable via /proc/<init-pid>/ns/<ns-typ> or nowadays
>   often a pidfd)
> - calling setns() either on each namespace file descriptor individually
>   or on the pidfd of the init process
>
> An important sub-step here is to attach to the container's pid namespace
> via setns(). After attaching to the container's pid namespace any
> process created via a fork()-like system calls will be a full member of
> the container's pid namespace.
>
> So attaching often involves two child processes. The first child simply
> attaches to the namespaces of the container including the container's
> pid namespace. The second child fork()s and ultimately exec()s thereby
> guaranteeing that the newly created process is a full member of the
> container's pid namespace:
>
> first_child = fork();
> if (first_child == 0) {
>         setns(CLONE_NEWPID);
>
>         second_child = fork();
>         if (second_child == 0) {
>                 execlp();
>         }
> }
>
> As part of this we also need to make sure that the second child - the
> one ultimately exec()ing the relevant programm in an already running
> container - joins the core scheduling domain of the container. When the
> container runs in a new pid namespace this can usually be done by
> calling:
>
> first_child = fork();
> if (first_child == 0) {
>         setns(CLONE_NEWPID);
>
>         second_child = fork();
>         if (second_child == 0) {
>                 prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM,
>                       1, PR_SCHED_CORE_SCOPE_THREAD, 0);
>
>                 execlp();
>         }
> }
>
> from the second child since we know that pid 1 in a container running
> inside of a separate pid namespace is the correct process to get the
> core scheduling domain from.
>
> However, this doesn't work when the container does not run in a separate
> pid namespace or when it shares the pid namespace with another
> container. In these scenarios we can't simply call
> PR_SCHED_CORE_SHARE_FROM from the second child since we don't know the
> correct pid number to call it on in the pid namespace.
>
> (Note it is of course possible to learn the pid of the process in the
> relevant pid namespace but it is rather complex involving three separate
> processes and an AF_UNIX domain socket over which to send a message
> including struct ucred from which to learn the relevant pid. But that
> doesn't work in all cases since it requires privileges to translate
> arbitrary pids. In any case, this is not an option for performance
> reasons alone. However, I do also have a separate patchset in [1]
> allowing translation of pids between pid namespaces which will help with
> that in the future - something which I had discussed with Joel a while
> back but haven't pushed for yet since implementing it early 2020. Both
> patches are useful independent of one another.)
>
> Additionally, we ideally always want to manage the core scheduling
> domain from the first child since the first child knows the pids for the
> relevant processes in its current pid namespace. The first child knows
> the pid of the init process in the current pid namespace from which to
> pull the core scheduling domain and it knows the pid of the second child
> it created to which to apply the core scheduling domain.
>
> The core scheduling domain of the first child needs to be unaffected as
> it might run sensitive codepaths that should not be exposed in smt attacks.
>
> The new PR_CORE_SCHED_SHARE command for the PR_SCHED_CORE prctl() option
> allows to support this and other use-cases by making it possible to pull
> the core scheduling domain from a task identified via its pid and push
> it to another task identified via its pid from a third managing task:
>
> prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE,
>       <pid-to-which-to-apply-coresched-domain>,
>       PR_SCHED_CORE_SCOPE_{THREAD,THREAD_GROUP,PROCESS_GROUP},
>       <pid-from-which-to-take-coresched-domain>)
>
> In order to use PR_SCHED_CORE_SHARE the caller must have
> ptrace_may_access() rights to both the task from which to take the core
> scheduling domain and to the task to which to apply the core scheduling
> domain. If the caller passes zero as the 5th argument then its own core
> scheduling domain is applied to the target making the option adhere to
> regular prctl() semantics.
>
> [1]: https://git.kernel.org/brauner/h/ioctl_ns_get_init_pid
>      https://git.kernel.org/brauner/c/1ad81fd698dd7e6511c3db422eba42dec3ce1b08
> Cc: Peter Collingbourne <pcc@google.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Joel Fernandes <joel@joelfernandes.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Chris Hyser <chris.hyser@oracle.com>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Balbir Singh <sblbir@amazon.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  include/linux/sched.h      |  2 +-
>  include/uapi/linux/prctl.h |  3 ++-
>  kernel/sched/core_sched.c  | 32 +++++++++++++++++++++++++++++---
>  3 files changed, 32 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 0408372594dd..2eeac7a341ad 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2341,7 +2341,7 @@ const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
>  extern void sched_core_free(struct task_struct *tsk);
>  extern void sched_core_fork(struct task_struct *p);
>  extern int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
> -                               unsigned long uaddr);
> +                               unsigned long arg);
>  #else
>  static inline void sched_core_free(struct task_struct *tsk) { }
>  static inline void sched_core_fork(struct task_struct *p) { }
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index e998764f0262..e53945dadede 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -267,7 +267,8 @@ struct prctl_mm_map {
>  # define PR_SCHED_CORE_CREATE          1 /* create unique core_sched cookie */
>  # define PR_SCHED_CORE_SHARE_TO                2 /* push core_sched cookie to pid */
>  # define PR_SCHED_CORE_SHARE_FROM      3 /* pull core_sched cookie to pid */
> -# define PR_SCHED_CORE_MAX             4
> +# define PR_SCHED_CORE_SHARE           4
> +# define PR_SCHED_CORE_MAX             5
>  # define PR_SCHED_CORE_SCOPE_THREAD            0
>  # define PR_SCHED_CORE_SCOPE_THREAD_GROUP      1
>  # define PR_SCHED_CORE_SCOPE_PROCESS_GROUP     2
> diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
> index 0c40445337c5..241bb38f5e55 100644
> --- a/kernel/sched/core_sched.c
> +++ b/kernel/sched/core_sched.c
> @@ -129,9 +129,10 @@ static void __sched_core_set(struct task_struct *p, unsigned long cookie)
>
>  /* Called from prctl interface: PR_SCHED_CORE */
>  int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
> -                        unsigned long uaddr)
> +                        unsigned long arg)
>  {
> -       unsigned long cookie = 0, id = 0;
> +       unsigned long cookie = 0, id = 0, uaddr = 0;
> +       pid_t pid_share = -1;
>         struct task_struct *task, *p;
>         struct pid *grp;
>         int err = 0;
> @@ -144,9 +145,20 @@ int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
>         BUILD_BUG_ON(PR_SCHED_CORE_SCOPE_PROCESS_GROUP != PIDTYPE_PGID);
>
>         if (type > PIDTYPE_PGID || cmd >= PR_SCHED_CORE_MAX || pid < 0 ||
> -           (cmd != PR_SCHED_CORE_GET && uaddr))
> +           (cmd != PR_SCHED_CORE_GET && cmd != PR_SCHED_CORE_SHARE && arg))
>                 return -EINVAL;
>
> +       switch (cmd) {
> +       case PR_SCHED_CORE_GET:
> +               uaddr = arg;
> +               break;
> +       case PR_SCHED_CORE_SHARE:
> +               pid_share = arg;
> +               if (pid_share < 0)
> +                       return -EINVAL;
> +               break;
> +       }
> +
>         rcu_read_lock();
>         task = task_by_pid(pid);
>         if (!task) {
> @@ -200,6 +212,20 @@ int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
>                 __sched_core_set(current, cookie);
>                 goto out;
>
> +       case PR_SCHED_CORE_SHARE:
> +               rcu_read_lock();
> +               p = task_by_pid(pid_share);
> +               if (!p)
> +                       err = -ESRCH;
> +               else if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS))
> +                       err = -EPERM;
> +               if (!err)
> +                       cookie = sched_core_clone_cookie(p);
> +               rcu_read_unlock();
> +               if (err)
> +                       goto out;
> +               break;
> +
>         default:
>                 err = -EINVAL;
>                 goto out;
> --
> 2.32.0
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [resend RFC 0/3] core scheduling: add PR_SCHED_CORE_SHARE
  2022-01-24 10:52 [resend RFC 0/3] core scheduling: add PR_SCHED_CORE_SHARE Christian Brauner
                   ` (2 preceding siblings ...)
  2022-01-24 10:52 ` [resend RFC 3/3] tests: add new PR_SCHED_CORE_SHARE test Christian Brauner
@ 2022-01-24 17:25 ` Joel Fernandes
  3 siblings, 0 replies; 9+ messages in thread
From: Joel Fernandes @ 2022-01-24 17:25 UTC (permalink / raw)
  To: Christian Brauner, Josh Don
  Cc: Chris Hyser, Daniel Bristot de Oliveira, Peter Zijlstra, LKML,
	Peter Collingbourne, Dietmar Eggemann, Thomas Gleixner,
	Mel Gorman, Vincent Guittot, Juri Lelli, Catalin Marinas,
	Ingo Molnar, Steven Rostedt, Ben Segall,
	Sebastian Andrzej Siewior, Balbir Singh

+Josh Don  Who was involved with upstream development of the interface.

On Mon, Jan 24, 2022 at 5:53 AM Christian Brauner <brauner@kernel.org> wrote:
>
> Hey everyone,
>
> This adds the new PR_CORE_SCHED prctl() command PR_SCHED_CORE_SHARE to
> allow a third process to pull a core scheduling domain from one task and
> push it to another task.
>
> The core scheduling uapi is exposed via the PR_SCHED_CORE option of the
> prctl() system call. Two commands can be used to alter the core
> scheduling domain of a task:
>
> 1. PR_SCHED_CORE_SHARE_TO
>    This command takes the cookie for the caller's core scheduling domain
>    and applies it to a target task identified by passing a pid.
>
> 2. PR_SCHED_CORE_SHARE_FROM
>    This command takes the cookie for a task's core scheduling domain and
>    applies it to the calling task.
>
> While these options cover nearly all use-cases they are rather
> inconvient for some common use-cases. A vm/container manager often
> supervises a large number of vms/containers:
>
>                                vm/container manager
>
> vm-supervisor-1    container-supervisor-1    vm-supervisor-2    container-supervisor-2
>
> Where none of the vms/container are its immediate children.
>
> For container managers each container often has a separate supervising
> process and the workload is the parent of the container. In the example
> below the supervising process is "[lxc monitor]" and the workload is
> "/sbin/init" and all descendant processes:
>
> ├─[lxc monitor] /var/lib/lxd/containers imp1
> │   └─systemd
> │       ├─agetty -o -p -- \\u --noclear --keep-baud console 115200,38400,9600 linux
> │       ├─cron -f -P
> │       ├─dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
> │       ├─networkd-dispat /usr/bin/networkd-dispatcher --run-startup-triggers
> │       ├─rsyslogd -n -iNONE
> │       │   ├─{rsyslogd}
> │       │   └─{rsyslogd}
> │       ├─systemd-journal
> │       ├─systemd-logind
> │       ├─systemd-network
> │       ├─systemd-resolve
> │       └─systemd-udevd
>
> Similiar in spirit but different in layout a vm often has a supervising
> process and multiple threads for each vcpu:
>
> ├─qemu-system-x86 -S -name f2-vm [...]
> │   ├─{qemu-system-x86}
> │   ├─{qemu-system-x86}
> │   ├─{qemu-system-x86}
> │   ├─{qemu-system-x86}
> │   ├─{qemu-system-x86}
> │   ├─{qemu-system-x86}
> │   ├─{qemu-system-x86}
> │   ├─{qemu-system-x86}
> │   ├─{qemu-system-x86}
> │   ├─{qemu-system-x86}
> │   └─{qemu-system-x86}
>
> So ultimately an approximation of that layout would be:
>
>                                vm/container manager
>
> vm-supervisor-1    container-supervisor-1    vm-supervisor-2    container-supervisor-2
>        |                      |                    |                      |
>      vcpus                 workload              vcpus                 workload
>                           (/sbin/init)                               (/sbin/init)
>
> For containers a new core scheduling domain is allocated for the init
> process. Any descendant processes and threads init spawns will
> automatically inherit the correct core scheduling domain.
>
> For vms a new core scheduling domain is allocated and each vcpu thread
> will be made to join the new core scheduling domain.
>
> Whenever the tool or library that we use to run containers or vms
> exposes an option to automatically create a new core scheduling domain
> we will make use of it. However that is not always the case. In such
> cases the vm/container manager will need to allocate and set the core
> scheduling domain for the relevant processes or threads.
>
> Neither the vm/container mananger nor the indivial vm/container
> supervisors are supposed to run in any or the same core scheduling
> domain as the respective vcpus/workloads.
>
> So in order to create to create a new core scheduling domain we need to
> fork() off a new helper process which allocates a core scheduling domain
> and then pushes the cookie for the core scheduling domain to the
> relevant vcpus/workloads.
>
> This works but things get rather tricky, especially for containers, when
> a new process is supposed to be spawned into a running container.
> An important step in creating a new process inside a running container
> involves:
>
> - getting a handle on the container's init process (pid or nowadays
>   often a pidfd)
> - getting a handle on the container's namespaces (namespace file
>   descriptors reachable via /proc/<init-pid>/ns/<ns-typ> or nowadays
>   often a pidfd)
> - calling setns() either on each namespace file descriptor individually
>   or on the pidfd of the init process
>
> An important sub-step here is to attach to the container's pid namespace
> via setns(). After attaching to the container's pid namespace any
> process created via a fork()-like system calls will be a full member of
> the container's pid namespace.
>
> So attaching often involves two child processes. The first child simply
> attaches to the namespaces of the container including the container's
> pid namespace. The second child fork()s and ultimately exec()s thereby
> guaranteeing that the newly created process is a full member of the
> container's pid namespace:
>
> first_child = fork();
> if (first_child == 0) {
>         setns(CLONE_NEWPID);
>
>         second_child = fork();
>         if (second_child == 0) {
>                 execlp();
>         }
> }
>
> As part of this we also need to make sure that the second child - the
> one ultimately exec()ing the relevant programm in an already running
> container - joins the core scheduling domain of the container. When the
> container runs in a new pid namespace this can usually be done by
> calling:
>
> first_child = fork();
> if (first_child == 0) {
>         setns(CLONE_NEWPID);
>
>         second_child = fork();
>         if (second_child == 0) {
>                 prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM,
>                       1, PR_SCHED_CORE_SCOPE_THREAD, 0);
>
>                 execlp();
>         }
> }
>
> from the second child since we know that pid 1 in a container running
> inside of a separate pid namespace is the correct process to get the
> core scheduling domain from.
>
> However, this doesn't work when the container does not run in a separate
> pid namespace or when it shares the pid namespace with another
> container. In these scenarios we can't simply call
> PR_SCHED_CORE_SHARE_FROM from the second child since we don't know the
> correct pid number to call it on in the pid namespace.
>
> (Note it is of course possible to learn the pid of the process in the
> relevant pid namespace but it is rather complex involving three separate
> processes and an AF_UNIX domain socket over which to send a message
> including struct ucred from which to learn the relevant pid. But that
> doesn't work in all cases since it requires privileges to translate
> arbitrary pids. In any case, this is not an option for performance
> reasons alone. However, I do also have a separate patchset in [1]
> allowing translation of pids between pid namespaces which will help with
> that in the future - something which I had discussed with Joel a while
> back but haven't pushed for yet since implementing it early 2020. Both
> patches are useful independent of one another.)
>
> Additionally, we ideally always want to manage the core scheduling
> domain from the first child since the first child knows the pids for the
> relevant processes in its current pid namespace. The first child knows
> the pid of the init process in the current pid namespace from which to
> pull the core scheduling domain and it knows the pid of the second child
> it created to which to apply the core scheduling domain.
>
> The core scheduling domain of the first child needs to be unaffected as
> it might run sensitive codepaths that should not be exposed in smt attacks.
>
> The new PR_CORE_SCHED_SHARE command for the PR_SCHED_CORE prctl() option
> allows to support this and other use-cases by making it possible to pull
> the core scheduling domain from a task identified via its pid and push
> it to another task identified via its pid from a third managing task:
>
> prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE,
>       <pid-to-which-to-apply-coresched-domain>,
>       PR_SCHED_CORE_SCOPE_{THREAD,THREAD_GROUP,PROCESS_GROUP},
>       <pid-from-which-to-take-coresched-domain>)
>
> In order to use PR_SCHED_CORE_SHARE the caller must have
> ptrace_may_access() rights to both the task from which to take the core
> scheduling domain and to the task to which to apply the core scheduling
> domain. If the caller passes zero as the 5th argument then its own core
> scheduling domain is applied to the target making the option adhere to
> regular prctl() semantics.
>
> Thanks!
> Christian
>
> Christian Brauner (3):
>   pid: introduce task_by_pid()
>   sched/prctl: add PR_SCHED_CORE_SHARE command
>   tests: add new PR_SCHED_CORE_SHARE test
>
>  arch/mips/kernel/mips-mt-fpaff.c              | 14 +-----
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c        | 19 +++-----
>  block/ioprio.c                                | 10 +----
>  include/linux/sched.h                         |  9 +++-
>  include/uapi/linux/prctl.h                    |  3 +-
>  kernel/cgroup/cgroup.c                        | 12 ++---
>  kernel/events/core.c                          |  5 +--
>  kernel/futex/syscalls.c                       | 20 +++------
>  kernel/pid.c                                  |  5 +++
>  kernel/sched/core.c                           | 27 ++++--------
>  kernel/sched/core_sched.c                     | 44 ++++++++++++++-----
>  kernel/sys.c                                  | 12 ++---
>  mm/mempolicy.c                                |  2 +-
>  tools/testing/selftests/sched/cs_prctl_test.c | 23 ++++++++++
>  14 files changed, 105 insertions(+), 100 deletions(-)
>
>
> base-commit: e783362eb54cd99b2cac8b3a9aeac942e6f6ac07
> --
> 2.32.0
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [resend RFC 2/3] sched/prctl: add PR_SCHED_CORE_SHARE command
  2022-01-24 10:52 ` [resend RFC 2/3] sched/prctl: add PR_SCHED_CORE_SHARE command Christian Brauner
  2022-01-24 17:25   ` Joel Fernandes
@ 2022-01-25  0:31   ` Josh Don
  2022-01-25 12:15     ` Christian Brauner
  1 sibling, 1 reply; 9+ messages in thread
From: Josh Don @ 2022-01-25  0:31 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Joel Fernandes, Chris Hyser, Daniel Bristot de Oliveira,
	Peter Zijlstra, linux-kernel, Peter Collingbourne,
	Dietmar Eggemann, Thomas Gleixner, Mel Gorman, Vincent Guittot,
	Juri Lelli, Catalin Marinas, Ingo Molnar, Steven Rostedt,
	Ben Segall, Sebastian Andrzej Siewior, Balbir Singh

Hey Christian,

This seems like a reasonable extension of the interface to me.

> @@ -200,6 +212,20 @@ int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
>                 __sched_core_set(current, cookie);
>                 goto out;
>
> +       case PR_SCHED_CORE_SHARE:
> +               rcu_read_lock();
> +               p = task_by_pid(pid_share);
> +               if (!p)
> +                       err = -ESRCH;
> +               else if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS))
> +                       err = -EPERM;
> +               if (!err)
> +                       cookie = sched_core_clone_cookie(p);
> +               rcu_read_unlock();
> +               if (err)
> +                       goto out;
> +               break;
> +

Did you consider folding this into SCHED_CORE_SHARE_TO? SHARE_TO isn't
using the last arg right now; it could use it as an override for the
task we copy the cookie from instead of always choosing 'current'.
Since the code currently rejects any SCHED_CORE prctl calls with a
non-zero last arg for commands other than SCHED_CORE_GET, this would
be a safe change for userspace.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [resend RFC 2/3] sched/prctl: add PR_SCHED_CORE_SHARE command
  2022-01-25  0:31   ` Josh Don
@ 2022-01-25 12:15     ` Christian Brauner
  0 siblings, 0 replies; 9+ messages in thread
From: Christian Brauner @ 2022-01-25 12:15 UTC (permalink / raw)
  To: Josh Don
  Cc: Joel Fernandes, Chris Hyser, Daniel Bristot de Oliveira,
	Peter Zijlstra, linux-kernel, Peter Collingbourne,
	Dietmar Eggemann, Thomas Gleixner, Mel Gorman, Vincent Guittot,
	Juri Lelli, Catalin Marinas, Ingo Molnar, Steven Rostedt,
	Ben Segall, Sebastian Andrzej Siewior, Balbir Singh

On Mon, Jan 24, 2022 at 04:31:05PM -0800, Josh Don wrote:
> Hey Christian,
> 
> This seems like a reasonable extension of the interface to me.
> 
> > @@ -200,6 +212,20 @@ int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
> >                 __sched_core_set(current, cookie);
> >                 goto out;
> >
> > +       case PR_SCHED_CORE_SHARE:
> > +               rcu_read_lock();
> > +               p = task_by_pid(pid_share);
> > +               if (!p)
> > +                       err = -ESRCH;
> > +               else if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS))
> > +                       err = -EPERM;
> > +               if (!err)
> > +                       cookie = sched_core_clone_cookie(p);
> > +               rcu_read_unlock();
> > +               if (err)
> > +                       goto out;
> > +               break;
> > +
> 
> Did you consider folding this into SCHED_CORE_SHARE_TO? SHARE_TO isn't
> using the last arg right now; it could use it as an override for the
> task we copy the cookie from instead of always choosing 'current'.
> Since the code currently rejects any SCHED_CORE prctl calls with a
> non-zero last arg for commands other than SCHED_CORE_GET, this would
> be a safe change for userspace.

Yeah, that sounds good to me too. I can certainly rework the patch to do
that!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [resend RFC 1/3] pid: introduce task_by_pid()
  2022-01-24 10:52 ` [resend RFC 1/3] pid: introduce task_by_pid() Christian Brauner
@ 2022-01-26 16:56   ` Tejun Heo
  0 siblings, 0 replies; 9+ messages in thread
From: Tejun Heo @ 2022-01-26 16:56 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Joel Fernandes, Chris Hyser, Daniel Bristot de Oliveira,
	Peter Zijlstra, linux-kernel, Peter Collingbourne,
	Dietmar Eggemann, Thomas Gleixner, Mel Gorman, Vincent Guittot,
	Juri Lelli, Catalin Marinas, Ingo Molnar, Steven Rostedt,
	Ben Segall, Sebastian Andrzej Siewior, Balbir Singh, Jens Axboe

On Mon, Jan 24, 2022 at 11:52:45AM +0100, Christian Brauner wrote:
> We have a lot of places that open code
> 
> if (who)
>         p = find_task_by_vpid(who);
> else
>         p = current;
> 
> Introduce a simpler helper which can be used instead.
> 
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Christian Brauner <brauner@kernel.org>

For cgroup part:

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-01-26 16:56 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-24 10:52 [resend RFC 0/3] core scheduling: add PR_SCHED_CORE_SHARE Christian Brauner
2022-01-24 10:52 ` [resend RFC 1/3] pid: introduce task_by_pid() Christian Brauner
2022-01-26 16:56   ` Tejun Heo
2022-01-24 10:52 ` [resend RFC 2/3] sched/prctl: add PR_SCHED_CORE_SHARE command Christian Brauner
2022-01-24 17:25   ` Joel Fernandes
2022-01-25  0:31   ` Josh Don
2022-01-25 12:15     ` Christian Brauner
2022-01-24 10:52 ` [resend RFC 3/3] tests: add new PR_SCHED_CORE_SHARE test Christian Brauner
2022-01-24 17:25 ` [resend RFC 0/3] core scheduling: add PR_SCHED_CORE_SHARE Joel Fernandes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.