linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] sched: Core scheduling interfaces
@ 2021-04-01 13:10 Peter Zijlstra
  2021-04-01 13:10 ` [PATCH 1/9] sched: Allow sched_core_put() from atomic context Peter Zijlstra
                   ` (9 more replies)
  0 siblings, 10 replies; 36+ messages in thread
From: Peter Zijlstra @ 2021-04-01 13:10 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tj, tglx

Hi,

This is a rewrite of the core sched interface bits, and mostly replaces patches
2-5 from this set here:

  https://lkml.kernel.org/r/20210324214020.34142-1-joel@joelfernandes.org

The task interface is extended to include PR_SCHED_CORE_GET, because the
selftest. Otherwise the task interface is much the same, except completely new
code.

The cgroup interface now uses a 'core_sched' file, which still takes 0,1. It is
however changed such that you can have nested tags. The for any given task, the
first parent with a cookie is the effective one. The rationale is that this way
you can delegate subtrees and still allow them some control over grouping.

The cgroup thing also '(ab)uses' cgroup_mutex for serialization because it
needs to ensure continuity between ss->can_attach() and ss->attach() for the
memory allocation. If the prctl() were allowed to interleave it might steal the
memory.

Using cgroup_mutex feels icky, but is not without precedent,
kernel/bpf/cgroup.c does the same thing afaict.

TJ, can you please have a look at this?

The last patch implements the prctl() / cgroup interaction, up until that point
each task carries the cookie set last between either interface, which is not
desirable. It really isn't the nicest thing ever, but it does keep the
scheduling core from having to consider multiple cookies.

Also, I still hate the kernel/sched/core_sched.c filename, but short of using
gibberish names to make tab-completion easier I simply cannot come up with
a remotely sane alternative :/

The code seems to not insta crash, and I can run the prctl() selftest while in
a cgroup and have it pass, not leak any references etc.. But it's otherwise
lightly tested code. Please read carefully etc..

Also of note; I didn't seem to need the css_offline and css_exit handlers the
other set added.

FWIW, I have a 4 day weekend ahead :-)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/9] sched: Allow sched_core_put() from atomic context
  2021-04-01 13:10 [PATCH 0/9] sched: Core scheduling interfaces Peter Zijlstra
@ 2021-04-01 13:10 ` Peter Zijlstra
  2021-04-01 13:10 ` [PATCH 2/9] sched: Implement core-sched assertions Peter Zijlstra
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2021-04-01 13:10 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tj, tglx

Stuff the meat of sched_core_put() into a work such that we can use
sched_core_put() from atomic context.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   33 +++++++++++++++++++++++++++------
 1 file changed, 27 insertions(+), 6 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -223,7 +223,7 @@ static struct task_struct *sched_core_ne
  */
 
 static DEFINE_MUTEX(sched_core_mutex);
-static int sched_core_count;
+static atomic_t sched_core_count;
 static struct cpumask sched_core_mask;
 
 static void __sched_core_flip(bool enabled)
@@ -286,18 +286,39 @@ static void __sched_core_disable(void)
 
 void sched_core_get(void)
 {
+	if (atomic_inc_not_zero(&sched_core_count))
+		return;
+
 	mutex_lock(&sched_core_mutex);
-	if (!sched_core_count++)
+	if (!atomic_read(&sched_core_count))
 		__sched_core_enable();
+
+	smp_mb__before_atomic();
+	atomic_inc(&sched_core_count);
 	mutex_unlock(&sched_core_mutex);
 }
 
-void sched_core_put(void)
+static void __sched_core_put(struct work_struct *work)
 {
-	mutex_lock(&sched_core_mutex);
-	if (!--sched_core_count)
+	if (atomic_dec_and_mutex_lock(&sched_core_count, &sched_core_mutex)) {
 		__sched_core_disable();
-	mutex_unlock(&sched_core_mutex);
+		mutex_unlock(&sched_core_mutex);
+	}
+}
+
+void sched_core_put(void)
+{
+	static DECLARE_WORK(_work, __sched_core_put);
+
+	/*
+	 * "There can only be one"
+	 *
+	 * Either this is the last one, or we don't actually need to do any
+	 * 'work'. If it is the last *again*, we rely on
+	 * WORK_STRUCT_PENDING_BIT.
+	 */
+	if (!atomic_add_unless(&sched_core_count, -1, 1))
+		schedule_work(&_work);
 }
 
 #else /* !CONFIG_SCHED_CORE */



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 2/9] sched: Implement core-sched assertions
  2021-04-01 13:10 [PATCH 0/9] sched: Core scheduling interfaces Peter Zijlstra
  2021-04-01 13:10 ` [PATCH 1/9] sched: Allow sched_core_put() from atomic context Peter Zijlstra
@ 2021-04-01 13:10 ` Peter Zijlstra
  2021-04-01 13:10 ` [PATCH 3/9] sched: Trivial core scheduling cookie management Peter Zijlstra
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2021-04-01 13:10 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tj, tglx


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -268,18 +268,24 @@ static void __sched_core_flip(bool enabl
 	cpus_read_unlock();
 }
 
-static void __sched_core_enable(void)
+static void sched_core_assert_empty(void)
 {
-	// XXX verify there are no cookie tasks (yet)
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		WARN_ON_ONCE(!RB_EMPTY_ROOT(&cpu_rq(cpu)->core_tree));
+}
 
+static void __sched_core_enable(void)
+{
 	static_branch_enable(&__sched_core_enabled);
 	__sched_core_flip(true);
+	sched_core_assert_empty();
 }
 
 static void __sched_core_disable(void)
 {
-	// XXX verify there are no cookie tasks (left)
-
+	sched_core_assert_empty();
 	__sched_core_flip(false);
 	static_branch_disable(&__sched_core_enabled);
 }



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 3/9] sched: Trivial core scheduling cookie management
  2021-04-01 13:10 [PATCH 0/9] sched: Core scheduling interfaces Peter Zijlstra
  2021-04-01 13:10 ` [PATCH 1/9] sched: Allow sched_core_put() from atomic context Peter Zijlstra
  2021-04-01 13:10 ` [PATCH 2/9] sched: Implement core-sched assertions Peter Zijlstra
@ 2021-04-01 13:10 ` Peter Zijlstra
  2021-04-01 20:04   ` Josh Don
  2021-04-01 13:10 ` [PATCH 4/9] sched: Default core-sched policy Peter Zijlstra
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2021-04-01 13:10 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tj, tglx

In order to not have to use pid_struct, create a new, smaller,
structure to manage task cookies for core scheduling.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h     |    6 ++
 kernel/fork.c             |    1 
 kernel/sched/Makefile     |    1 
 kernel/sched/core.c       |    7 +--
 kernel/sched/core_sched.c |  106 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h      |   16 ++++++
 6 files changed, 134 insertions(+), 3 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2169,4 +2169,10 @@ int sched_trace_rq_nr_running(struct rq
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
+#ifdef CONFIG_SCHED_CORE
+extern void sched_core_free(struct task_struct *tsk);
+#else
+static inline void sched_core_free(struct task_struct *tsk) { }
+#endif
+
 #endif
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -737,6 +737,7 @@ void __put_task_struct(struct task_struc
 	exit_creds(tsk);
 	delayacct_tsk_free(tsk);
 	put_signal_struct(tsk->signal);
+	sched_core_free(tsk);
 
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) +=
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
+obj-$(CONFIG_SCHED_CORE) += core_sched.o
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -157,7 +157,7 @@ static inline int rb_sched_core_cmp(cons
 	return 0;
 }
 
-static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 {
 	rq->core->core_task_seq++;
 
@@ -167,14 +167,15 @@ static void sched_core_enqueue(struct rq
 	rb_add(&p->core_node, &rq->core_tree, rb_sched_core_less);
 }
 
-static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+void sched_core_dequeue(struct rq *rq, struct task_struct *p)
 {
 	rq->core->core_task_seq++;
 
-	if (!p->core_cookie)
+	if (!sched_core_enqueued(p))
 		return;
 
 	rb_erase(&p->core_node, &rq->core_tree);
+	RB_CLEAR_NODE(&p->core_node);
 }
 
 /*
--- /dev/null
+++ b/kernel/sched/core_sched.c
@@ -0,0 +1,106 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "sched.h"
+
+/*
+ * A simple wrapper around refcount. An allocated sched_core_cookie's
+ * address is used to compute the cookie of the task.
+ */
+struct sched_core_cookie {
+	refcount_t refcnt;
+};
+
+unsigned long sched_core_alloc_cookie(void)
+{
+	struct sched_core_cookie *ck = kmalloc(sizeof(*ck), GFP_KERNEL);
+	if (!ck)
+		return 0;
+
+	refcount_set(&ck->refcnt, 1);
+	sched_core_get();
+
+	return (unsigned long)ck;
+}
+
+void sched_core_put_cookie(unsigned long cookie)
+{
+	struct sched_core_cookie *ptr = (void *)cookie;
+
+	if (ptr && refcount_dec_and_test(&ptr->refcnt)) {
+		kfree(ptr);
+		sched_core_put();
+	}
+}
+
+unsigned long sched_core_get_cookie(unsigned long cookie)
+{
+	struct sched_core_cookie *ptr = (void *)cookie;
+
+	if (ptr)
+		refcount_inc(&ptr->refcnt);
+
+	return cookie;
+}
+
+/*
+ * sched_core_update_cookie - Common helper to update a task's core cookie. This
+ * updates the selected cookie field.
+ * @p: The task whose cookie should be updated.
+ * @cookie: The new cookie.
+ * @cookie_type: The cookie field to which the cookie corresponds.
+ */
+unsigned long sched_core_update_cookie(struct task_struct *p, unsigned long cookie)
+{
+	unsigned long old_cookie;
+	struct rq_flags rf;
+	struct rq *rq;
+	bool enqueued;
+
+	rq = task_rq_lock(p, &rf);
+
+	/*
+	 * Since creating a cookie implies sched_core_get(), and we cannot set
+	 * a cookie until after we've created it, similarly, we cannot destroy
+	 * a cookie until after we've removed it, we must have core scheduling
+	 * enabled here.
+	 */
+	SCHED_WARN_ON((p->core_cookie || cookie) && !sched_core_enabled(rq));
+
+	enqueued = sched_core_enqueued(p);
+	if (enqueued)
+		sched_core_dequeue(rq, p);
+
+	old_cookie = p->core_cookie;
+	p->core_cookie = cookie;
+
+	if (enqueued)
+		sched_core_enqueue(rq, p);
+
+	/*
+	 * If task is currently running , it may not be compatible anymore after
+	 * the cookie change, so enter the scheduler on its CPU to schedule it
+	 * away.
+	 */
+	if (task_running(rq, p))
+		resched_curr(rq);
+
+	task_rq_unlock(rq, p, &rf);
+
+	return old_cookie;
+}
+
+static unsigned long sched_core_clone_cookie(struct task_struct *p)
+{
+	unsigned long cookie, flags;
+
+	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	cookie = sched_core_get_cookie(p->core_cookie);
+	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+
+	return cookie;
+}
+
+void sched_core_free(struct task_struct *p)
+{
+	sched_core_put_cookie(p->core_cookie);
+}
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1222,6 +1222,22 @@ static inline bool sched_group_cookie_ma
 
 extern void queue_core_balance(struct rq *rq);
 
+static inline bool sched_core_enqueued(struct task_struct *p)
+{
+	return !RB_EMPTY_NODE(&p->core_node);
+}
+
+extern void sched_core_enqueue(struct rq *rq, struct task_struct *p);
+extern void sched_core_dequeue(struct rq *rq, struct task_struct *p);
+
+extern void sched_core_get(void);
+extern void sched_core_put(void);
+
+extern unsigned long sched_core_alloc_cookie(void);
+extern void sched_core_put_cookie(unsigned long cookie);
+extern unsigned long sched_core_get_cookie(unsigned long cookie);
+extern unsigned long sched_core_update_cookie(struct task_struct *p, unsigned long cookie);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline bool sched_core_enabled(struct rq *rq)



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 4/9] sched: Default core-sched policy
  2021-04-01 13:10 [PATCH 0/9] sched: Core scheduling interfaces Peter Zijlstra
                   ` (2 preceding siblings ...)
  2021-04-01 13:10 ` [PATCH 3/9] sched: Trivial core scheduling cookie management Peter Zijlstra
@ 2021-04-01 13:10 ` Peter Zijlstra
  2021-04-21 13:33   ` Peter Zijlstra
  2021-04-01 13:10 ` [PATCH 5/9] sched: prctl() core-scheduling interface Peter Zijlstra
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2021-04-01 13:10 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tj, tglx

Implement default core scheduling policy.

 - fork() / clone(): inherit cookie from parent
 - exec(): if cookie then new cookie

Did that exec() thing want to change cookie on suid instead, just like
perf_event_exit_task() ?

Note that sched_core_fork() is called from under tasklist_lock, and
not from sched_fork() earlier. This avoids a few races later.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 fs/exec.c                  |    4 +++-
 include/linux/sched.h      |    4 ++++
 include/linux/sched/task.h |    4 ++--
 kernel/fork.c              |    3 +++
 kernel/sched/core.c        |   11 +++++++++--
 kernel/sched/core_sched.c  |   21 +++++++++++++++++++++
 6 files changed, 42 insertions(+), 5 deletions(-)

--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1807,7 +1807,9 @@ static int bprm_execve(struct linux_binp
 	if (IS_ERR(file))
 		goto out_unmark;
 
-	sched_exec();
+	retval = sched_exec();
+	if (retval)
+		goto out;
 
 	bprm->file = file;
 	/*
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2171,8 +2171,12 @@ const struct cpumask *sched_trace_rd_spa
 
 #ifdef CONFIG_SCHED_CORE
 extern void sched_core_free(struct task_struct *tsk);
+extern int sched_core_exec(void);
+extern void sched_core_fork(struct task_struct *p);
 #else
 static inline void sched_core_free(struct task_struct *tsk) { }
+static inline int sched_core_exec(void) { return 0; }
+static inline void sched_core_fork(struct task_struct *p) { }
 #endif
 
 #endif
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -94,9 +94,9 @@ extern void free_task(struct task_struct
 
 /* sched_exec is called by processes performing an exec */
 #ifdef CONFIG_SMP
-extern void sched_exec(void);
+extern int sched_exec(void);
 #else
-#define sched_exec()   {}
+static inline int sched_exec(void) { return 0; }
 #endif
 
 static inline struct task_struct *get_task_struct(struct task_struct *t)
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2242,6 +2242,8 @@ static __latent_entropy struct task_stru
 
 	klp_copy_process(p);
 
+	sched_core_fork(p);
+
 	spin_lock(&current->sighand->siglock);
 
 	/*
@@ -2329,6 +2331,7 @@ static __latent_entropy struct task_stru
 	return p;
 
 bad_fork_cancel_cgroup:
+	sched_core_free(p);
 	spin_unlock(&current->sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
 	cgroup_cancel_fork(p, args);
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4762,11 +4762,17 @@ unsigned long nr_iowait(void)
  * sched_exec - execve() is a valuable balancing opportunity, because at
  * this point the task has the smallest effective memory and cache footprint.
  */
-void sched_exec(void)
+int sched_exec(void)
 {
 	struct task_struct *p = current;
 	unsigned long flags;
 	int dest_cpu;
+	int ret;
+
+	/* this may change what tasks current can share a core with */
+	ret = sched_core_exec();
+	if (ret)
+		return ret;
 
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), WF_EXEC);
@@ -4778,10 +4784,11 @@ void sched_exec(void)
 
 		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 		stop_one_cpu(task_cpu(p), migration_cpu_stop, &arg);
-		return;
+		return 0;
 	}
 unlock:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+	return 0;
 }
 
 #endif
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -100,7 +100,28 @@ static unsigned long sched_core_clone_co
 	return cookie;
 }
 
+void sched_core_fork(struct task_struct *p)
+{
+	RB_CLEAR_NODE(&p->core_node);
+	p->core_cookie = sched_core_clone_cookie(current);
+}
+
 void sched_core_free(struct task_struct *p)
 {
 	sched_core_put_cookie(p->core_cookie);
 }
+
+int sched_core_exec(void)
+{
+	/* absent a policy mech, if task had a cookie, give it a new one */
+	if (current->core_cookie) {
+		unsigned long cookie = sched_core_alloc_cookie();
+		if (!cookie)
+			return -ENOMEM;
+		cookie = sched_core_update_cookie(current, cookie);
+		sched_core_put_cookie(cookie);
+	}
+
+	return 0;
+}
+



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 5/9] sched: prctl() core-scheduling interface
  2021-04-01 13:10 [PATCH 0/9] sched: Core scheduling interfaces Peter Zijlstra
                   ` (3 preceding siblings ...)
  2021-04-01 13:10 ` [PATCH 4/9] sched: Default core-sched policy Peter Zijlstra
@ 2021-04-01 13:10 ` Peter Zijlstra
  2021-04-07 17:00   ` Peter Zijlstra
  2021-04-01 13:10 ` [PATCH 6/9] kselftest: Add test for core sched prctl interface Peter Zijlstra
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2021-04-01 13:10 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tj, tglx

From: Chris Hyser <chris.hyser@oracle.com>

This patch provides support for setting, clearing and copying core
scheduling 'task cookies' between threads (PID), processes (TGID), and
process groups (PGID).

The value of core scheduling isn't that tasks don't share a core,
'nosmt' can do that. The value lies in exploiting all the sharing
opportunities that exist to recover possible lost performance and that
requires a degree of flexibility in the API.

>From a security perspective (and there are others), the thread,
process and process group distinction is an existent hierarchal
categorization of tasks that reflects many of the security concerns
about 'data sharing'. For example, protecting against cache-snooping
by a thread that can just read the memory directly isn't all that
useful.

With this in mind, subcommands to CLEAR/CREATE/SHARE (TO/FROM) provide
a mechanism to create, clear and share cookies. CLEAR/CREATE/SHARE_TO
specify a target pid with enum pidtype used to specify the scope of
the targeted tasks. For example, PIDTYPE_TGID will share the cookie
with the process and all of it's threads as typically desired in a
security scenario.

API:

  prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, tgtpid, pidtype, &cookie)
  prctl(PR_SCHED_CORE, PR_SCHED_CORE_CLEAR, tgtpid, pidtype, NULL)
  prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, tgtpid, pidtype, NULL)
  prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, tgtpid, pidtype, NULL)
  prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, srcpid, pidtype, NULL)

where 'tgtpid/srcpid == 0' implies the current process and pidtype is
kernel enum pid_type {PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID, ...}.
PIDTYPE_SID, sharing a cookie with an entire session, was considered
less useful given the choice to create a new cookie on task exec().

For return values, EINVAL, ENOMEM are what they say. ESRCH means the
tgtpid/srcpid was not found. EPERM indicates lack of PTRACE permission
access to tgtpid/srcpid. EACCES indicates that a task in the target
pidtype group was not updated due to permission.

Current hard-coded policies are:

 - a user can clear the cookie of any process they can set a cookie for.
   Lack of a cookie *might* be a security issue if cookies are being used
   for that.

[peterz: complete rewrite]
Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210324214020.34142-4-joel@joelfernandes.org
---
 include/linux/sched.h            |    2 
 include/uapi/linux/prctl.h       |    9 +++
 kernel/sched/core_sched.c        |  117 +++++++++++++++++++++++++++++++++++++++
 kernel/sys.c                     |    5 +
 tools/include/uapi/linux/prctl.h |    9 +++
 5 files changed, 142 insertions(+)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2173,6 +2173,8 @@ const struct cpumask *sched_trace_rd_spa
 extern void sched_core_free(struct task_struct *tsk);
 extern int sched_core_exec(void);
 extern void sched_core_fork(struct task_struct *p);
+extern int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
+				unsigned long uaddr);
 #else
 static inline void sched_core_free(struct task_struct *tsk) { }
 static inline int sched_core_exec(void) { return 0; }
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -255,4 +255,13 @@ struct prctl_mm_map {
 # define SYSCALL_DISPATCH_FILTER_ALLOW	0
 # define SYSCALL_DISPATCH_FILTER_BLOCK	1
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE			60
+# define PR_SCHED_CORE_GET		0
+# define PR_SCHED_CORE_CLEAR		1 /* clear core_sched cookie of pid */
+# define PR_SCHED_CORE_CREATE		2 /* create unique core_sched cookie */
+# define PR_SCHED_CORE_SHARE_TO		3 /* push core_sched cookie to pid */
+# define PR_SCHED_CORE_SHARE_FROM	4 /* pull core_sched cookie to pid */
+# define PR_SCHED_CORE_MAX		5
+
 #endif /* _LINUX_PRCTL_H */
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-only
 
+#include <linux/prctl.h>
 #include "sched.h"
 
 /*
@@ -125,3 +126,119 @@ int sched_core_exec(void)
 	return 0;
 }
 
+static void __sched_core_set(struct task_struct *p, unsigned long cookie)
+{
+	cookie = sched_core_get_cookie(cookie);
+	cookie = sched_core_update_cookie(p, cookie);
+	sched_core_put_cookie(cookie);
+}
+
+/* Called from prctl interface: PR_SCHED_CORE */
+int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
+			 unsigned long uaddr)
+{
+	unsigned long cookie = 0, id = 0;
+	struct task_struct *task, *p;
+	struct pid *grp;
+	int err = 0;
+
+	if (!static_branch_likely(&sched_smt_present))
+		return -ENODEV;
+
+	if (type > PIDTYPE_PGID || cmd >= PR_SCHED_CORE_MAX || pid < 0 ||
+	    (cmd != PR_SCHED_CORE_GET && uaddr))
+		return -EINVAL;
+
+	rcu_read_lock();
+	if (pid == 0) {
+		task = current;
+	} else {
+		task = find_task_by_vpid(pid);
+		if (!task) {
+			rcu_read_unlock();
+			return -ESRCH;
+		}
+	}
+	get_task_struct(task);
+	rcu_read_unlock();
+
+	/*
+	 * Check if this process has the right to modify the specified
+	 * process. Use the regular "ptrace_may_access()" checks.
+	 */
+	if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
+		err = -EPERM;
+		goto out;
+	}
+
+	switch (cmd) {
+	case PR_SCHED_CORE_GET:
+		if (type != PIDTYPE_PID || uaddr & 7) {
+			err = -EINVAL;
+			goto out;
+		}
+		cookie = sched_core_clone_cookie(task);
+		if (cookie) {
+			/* XXX improve ? */
+			ptr_to_hashval((void *)cookie, &id);
+		}
+		err = put_user(id, (u64 __user *)uaddr);
+		goto out;
+
+	case PR_SCHED_CORE_CLEAR:
+		cookie = 0;
+		break;
+
+	case PR_SCHED_CORE_CREATE:
+		cookie = sched_core_alloc_cookie();
+		if (!cookie) {
+			err = -ENOMEM;
+			goto out;
+		}
+		break;
+
+	case PR_SCHED_CORE_SHARE_TO:
+		cookie = sched_core_clone_cookie(current);
+		break;
+
+	case PR_SCHED_CORE_SHARE_FROM:
+		if (type != PIDTYPE_PID) {
+			err = -EINVAL;
+			goto out;
+		}
+		cookie = sched_core_clone_cookie(task);
+		__sched_core_set(current, cookie);
+		goto out;
+
+	default:
+		err = -EINVAL;
+		goto out;
+	};
+
+	if (type == PIDTYPE_PID) {
+		__sched_core_set(task, cookie);
+		goto out;
+	}
+
+	read_lock(&tasklist_lock);
+	grp = task_pid_type(task, type);
+
+	do_each_pid_thread(grp, type, p) {
+		if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS)) {
+			err = -EPERM;
+			goto out_tasklist;
+		}
+	} while_each_pid_thread(grp, type, p);
+
+	do_each_pid_thread(grp, type, p) {
+		__sched_core_set(p, cookie);
+	} while_each_pid_thread(grp, type, p);
+out_tasklist:
+	read_unlock(&tasklist_lock);
+
+out:
+	sched_core_put_cookie(cookie);
+	put_task_struct(task);
+	return err;
+}
+
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2534,6 +2534,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 		error = set_syscall_user_dispatch(arg2, arg3, arg4,
 						  (char __user *) arg5);
 		break;
+#ifdef CONFIG_SCHED_CORE
+	case PR_SCHED_CORE:
+		error = sched_core_share_pid(arg2, arg3, arg4, arg5);
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -255,4 +255,13 @@ struct prctl_mm_map {
 # define SYSCALL_DISPATCH_FILTER_ALLOW	0
 # define SYSCALL_DISPATCH_FILTER_BLOCK	1
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE			60
+# define PR_SCHED_CORE_GET		0
+# define PR_SCHED_CORE_CLEAR		1 /* clear core_sched cookie of pid */
+# define PR_SCHED_CORE_CREATE		2 /* create unique core_sched cookie */
+# define PR_SCHED_CORE_SHARE_TO		3 /* push core_sched cookie to pid */
+# define PR_SCHED_CORE_SHARE_FROM	4 /* pull core_sched cookie to pid */
+# define PR_SCHED_CORE_MAX		5
+
 #endif /* _LINUX_PRCTL_H */



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 6/9] kselftest: Add test for core sched prctl interface
  2021-04-01 13:10 [PATCH 0/9] sched: Core scheduling interfaces Peter Zijlstra
                   ` (4 preceding siblings ...)
  2021-04-01 13:10 ` [PATCH 5/9] sched: prctl() core-scheduling interface Peter Zijlstra
@ 2021-04-01 13:10 ` Peter Zijlstra
  2021-04-01 13:10 ` [PATCH 7/9] sched: Cgroup core-scheduling interface Peter Zijlstra
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2021-04-01 13:10 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tj, tglx

From: Chris Hyser <chris.hyser@oracle.com>

Provides a selftest and examples of using the interface.

[peterz: updated to not use sched_debug]
Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210324214020.34142-5-joel@joelfernandes.org
---
 tools/testing/selftests/sched/.gitignore      |    1 
 tools/testing/selftests/sched/Makefile        |   14 +
 tools/testing/selftests/sched/config          |    1 
 tools/testing/selftests/sched/cs_prctl_test.c |  356 ++++++++++++++++++++++++++
 4 files changed, 372 insertions(+)
 create mode 100644 tools/testing/selftests/sched/.gitignore
 create mode 100644 tools/testing/selftests/sched/Makefile
 create mode 100644 tools/testing/selftests/sched/config
 create mode 100644 tools/testing/selftests/sched/cs_prctl_test.c

--- /dev/null
+++ b/tools/testing/selftests/sched/.gitignore
@@ -0,0 +1 @@
+cs_prctl_test
--- /dev/null
+++ b/tools/testing/selftests/sched/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0+
+
+ifneq ($(shell $(CC) --version 2>&1 | head -n 1 | grep clang),)
+CLANG_FLAGS += -no-integrated-as
+endif
+
+CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/  -Wl,-rpath=./ \
+	  $(CLANG_FLAGS)
+LDLIBS += -lpthread
+
+TEST_GEN_FILES := cs_prctl_test
+TEST_PROGS := cs_prctl_test
+
+include ../lib.mk
--- /dev/null
+++ b/tools/testing/selftests/sched/config
@@ -0,0 +1 @@
+CONFIG_SCHED_DEBUG=y
--- /dev/null
+++ b/tools/testing/selftests/sched/cs_prctl_test.c
@@ -0,0 +1,356 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Use the core scheduling prctl() to test core scheduling cookies control.
+ *
+ * Copyright (c) 2021 Oracle and/or its affiliates.
+ * Author: Chris Hyser <chris.hyser@oracle.com>
+ *
+ *
+ * This library is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This library is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
+ * for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public License
+ * along with this library; if not, see <http://www.gnu.org/licenses>.
+ */
+
+#define _GNU_SOURCE
+#include <sys/eventfd.h>
+#include <sys/wait.h>
+#include <sys/types.h>
+#include <sched.h>
+#include <sys/prctl.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include <time.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#if __GLIBC_PREREQ(2, 30) == 0
+#include <sys/syscall.h>
+static pid_t gettid(void)
+{
+	return syscall(SYS_gettid);
+}
+#endif
+
+#ifndef PR_SCHED_CORE
+#define PR_SCHED_CORE			60
+# define PR_SCHED_CORE_GET		0
+# define PR_SCHED_CORE_CLEAR		1 /* clear core_sched cookie of pid */
+# define PR_SCHED_CORE_CREATE		2 /* create unique core_sched cookie */
+# define PR_SCHED_CORE_SHARE_TO		3 /* push core_sched cookie to pid */
+# define PR_SCHED_CORE_SHARE_FROM	4 /* pull core_sched cookie to pid */
+# define PR_SCHED_CORE_MAX		5
+#endif
+
+#define MAX_PROCESSES 128
+#define MAX_THREADS   128
+
+static const char USAGE[] = "cs_prctl_test [options]\n"
+"    options:\n"
+"	-P  : number of processes to create.\n"
+"	-T  : number of threads per process to create.\n"
+"	-d  : delay time to keep tasks alive.\n"
+"	-k  : keep tasks alive until keypress.\n";
+
+enum pid_type {PIDTYPE_PID = 0, PIDTYPE_TGID, PIDTYPE_PGID};
+
+const int THREAD_CLONE_FLAGS = CLONE_THREAD | CLONE_SIGHAND | CLONE_FS | CLONE_VM | CLONE_FILES;
+
+static int _prctl(int option, unsigned long arg2, unsigned long arg3, unsigned long arg4,
+		  unsigned long arg5)
+{
+	int res;
+
+	res = prctl(option, arg2, arg3, arg4, arg5);
+	printf("%d = prctl(%d, %ld, %ld, %ld, %lx)\n", res, option, (long)arg2, (long)arg3,
+	       (long)arg4, arg5);
+	return res;
+}
+
+#define STACK_SIZE (1024 * 1024)
+
+#define handle_error(msg) __handle_error(__FILE__, __LINE__, msg)
+static void __handle_error(char *fn, int ln, char *msg)
+{
+	printf("(%s:%d) - ", fn, ln);
+	perror(msg);
+	exit(EXIT_FAILURE);
+}
+
+static void handle_usage(int rc, char *msg)
+{
+	puts(USAGE);
+	puts(msg);
+	putchar('\n');
+	exit(rc);
+}
+
+static unsigned long get_cs_cookie(int pid)
+{
+	unsigned long long cookie;
+	int ret;
+
+	ret = prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, pid, PIDTYPE_PID,
+		    (unsigned long)&cookie);
+	if (ret) {
+		printf("Not a core sched system\n");
+		return -1UL;
+	}
+
+	return cookie;
+}
+
+struct child_args {
+	int num_threads;
+	int pfd[2];
+	int cpid;
+	int thr_tids[MAX_THREADS];
+};
+
+static int child_func_thread(void __attribute__((unused))*arg)
+{
+	while (1)
+		usleep(20000);
+	return 0;
+}
+
+static void create_threads(int num_threads, int thr_tids[])
+{
+	void *child_stack;
+	pid_t tid;
+	int i;
+
+	for (i = 0; i < num_threads; ++i) {
+		child_stack = malloc(STACK_SIZE);
+		if (!child_stack)
+			handle_error("child stack allocate");
+
+		tid = clone(child_func_thread, child_stack + STACK_SIZE, THREAD_CLONE_FLAGS, NULL);
+		if (tid == -1)
+			handle_error("clone thread");
+		thr_tids[i] = tid;
+	}
+}
+
+static int child_func_process(void *arg)
+{
+	struct child_args *ca = (struct child_args *)arg;
+
+	close(ca->pfd[0]);
+
+	create_threads(ca->num_threads, ca->thr_tids);
+
+	write(ca->pfd[1], &ca->thr_tids, sizeof(int) * ca->num_threads);
+	close(ca->pfd[1]);
+
+	while (1)
+		usleep(20000);
+	return 0;
+}
+
+static unsigned char child_func_process_stack[STACK_SIZE];
+
+void create_processes(int num_processes, int num_threads, struct child_args proc[])
+{
+	pid_t cpid;
+	int i;
+
+	for (i = 0; i < num_processes; ++i) {
+		proc[i].num_threads = num_threads;
+
+		if (pipe(proc[i].pfd) == -1)
+			handle_error("pipe() failed");
+
+		cpid = clone(child_func_process, child_func_process_stack + STACK_SIZE,
+			     SIGCHLD, &proc[i]);
+		proc[i].cpid = cpid;
+		close(proc[i].pfd[1]);
+	}
+
+	for (i = 0; i < num_processes; ++i) {
+		read(proc[i].pfd[0], &proc[i].thr_tids, sizeof(int) * proc[i].num_threads);
+		close(proc[i].pfd[0]);
+	}
+}
+
+void disp_processes(int num_processes, struct child_args proc[])
+{
+	int i, j;
+
+	printf("tid=%d, / tgid=%d / pgid=%d: %lx\n", gettid(), getpid(), getpgid(0),
+	       get_cs_cookie(getpid()));
+
+	for (i = 0; i < num_processes; ++i) {
+		printf("    tid=%d, / tgid=%d / pgid=%d: %lx\n", proc[i].cpid, proc[i].cpid,
+		       getpgid(proc[i].cpid), get_cs_cookie(proc[i].cpid));
+		for (j = 0; j < proc[i].num_threads; ++j) {
+			printf("        tid=%d, / tgid=%d / pgid=%d: %lx\n", proc[i].thr_tids[j],
+			       proc[i].cpid, getpgid(0), get_cs_cookie(proc[i].thr_tids[j]));
+		}
+	}
+	puts("\n");
+}
+
+static int errors;
+
+#define validate(v) _validate(__LINE__, v, #v)
+void _validate(int line, int val, char *msg)
+{
+	if (!val) {
+		++errors;
+		printf("(%d) FAILED: %s\n", line, msg);
+	} else {
+		printf("(%d) PASSED: %s\n", line, msg);
+	}
+}
+
+int main(int argc, char *argv[])
+{
+	struct child_args procs[MAX_PROCESSES];
+
+	int keypress = 0;
+	int num_processes = 2;
+	int num_threads = 3;
+	int delay = 0;
+	int res = 0;
+	int pidx;
+	int pid;
+	int opt;
+
+	while ((opt = getopt(argc, argv, ":hkT:P:d:")) != -1) {
+		switch (opt) {
+		case 'P':
+			num_processes = (int)strtol(optarg, NULL, 10);
+			break;
+		case 'T':
+			num_threads = (int)strtoul(optarg, NULL, 10);
+			break;
+		case 'd':
+			delay = (int)strtol(optarg, NULL, 10);
+			break;
+		case 'k':
+			keypress = 1;
+			break;
+		case 'h':
+			printf(USAGE);
+			exit(EXIT_SUCCESS);
+		default:
+			handle_usage(20, "unknown option");
+		}
+	}
+
+	if (num_processes < 1 || num_processes > MAX_PROCESSES)
+		handle_usage(1, "Bad processes value");
+
+	if (num_threads < 1 || num_threads > MAX_THREADS)
+		handle_usage(2, "Bad thread value");
+
+	if (keypress)
+		delay = -1;
+
+	srand(time(NULL));
+
+	/* put into separate process group */
+	if (setpgid(0, 0) != 0)
+		handle_error("process group");
+
+	printf("\n## Create a thread/process/process group hiearchy\n");
+	create_processes(num_processes, num_threads, procs);
+	disp_processes(num_processes, procs);
+	validate(get_cs_cookie(0) == 0);
+
+	printf("\n## Set a cookie on entire process group\n");
+	if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, 0, PIDTYPE_PGID, 0) < 0)
+		handle_error("core_sched create failed -- PGID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(0) != 0);
+
+	/* get a random process pid */
+	pidx = rand() % num_processes;
+	pid = procs[pidx].cpid;
+
+	validate(get_cs_cookie(0) == get_cs_cookie(pid));
+	validate(get_cs_cookie(0) == get_cs_cookie(procs[pidx].thr_tids[0]));
+
+	printf("\n## Set a new cookie on entire process/TGID [%d]\n", pid);
+	if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, pid, PIDTYPE_TGID, 0) < 0)
+		handle_error("core_sched create failed -- TGID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(0) != get_cs_cookie(pid));
+	validate(get_cs_cookie(pid) != 0);
+	validate(get_cs_cookie(pid) == get_cs_cookie(procs[pidx].thr_tids[0]));
+
+	printf("\n## Copy the cookie of current/PGID[%d], to pid [%d] as PIDTYPE_PID\n",
+	       getpid(), pid);
+	if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, pid, PIDTYPE_PID, 0) < 0)
+		handle_error("core_sched share to itself failed -- PID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(0) == get_cs_cookie(pid));
+	validate(get_cs_cookie(pid) != 0);
+	validate(get_cs_cookie(pid) != get_cs_cookie(procs[pidx].thr_tids[0]));
+
+	printf("\n## Copy cookie from a thread [%d] to current/PGID [%d] as PIDTYPE_PID\n",
+	       procs[pidx].thr_tids[0], getpid());
+	if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, procs[pidx].thr_tids[0],
+		   PIDTYPE_PID, 0) < 0)
+		handle_error("core_sched share from thread failed -- PID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(0) == get_cs_cookie(procs[pidx].thr_tids[0]));
+	validate(get_cs_cookie(pid) != get_cs_cookie(procs[pidx].thr_tids[0]));
+
+	printf("\n## Clear a cookie on a single task [%d]\n", pid);
+	if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_CLEAR, pid, PIDTYPE_PID, 0) < 0)
+		handle_error("core_sched clear failed -- PID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(pid) == 0);
+	validate(get_cs_cookie(procs[pidx].thr_tids[0]) != 0);
+
+	printf("\n## Copy cookie from current [%d] to current as pidtype PGID\n", getpid());
+	if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, 0, PIDTYPE_PGID, 0) < 0)
+		handle_error("core_sched share to self failed -- PGID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(0) == get_cs_cookie(pid));
+	validate(get_cs_cookie(pid) != 0);
+	validate(get_cs_cookie(pid) == get_cs_cookie(procs[pidx].thr_tids[0]));
+
+	printf("\n## Clear cookies on the entire process group\n");
+	if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_CLEAR, 0, PIDTYPE_PGID, 0) < 0)
+		handle_error("core_sched clear failed -- PGID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(0) == 0);
+	validate(get_cs_cookie(pid) == 0);
+	validate(get_cs_cookie(procs[pidx].thr_tids[0]) == 0);
+
+	if (errors) {
+		printf("TESTS FAILED. errors: %d\n", errors);
+		res = 10;
+	} else {
+		printf("SUCCESS !!!\n");
+	}
+
+	if (keypress)
+		getchar();
+	else
+		sleep(delay);
+
+	for (pidx = 0; pidx < num_processes; ++pidx)
+		kill(procs[pidx].cpid, 15);
+
+	return res;
+}



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 7/9] sched: Cgroup core-scheduling interface
  2021-04-01 13:10 [PATCH 0/9] sched: Core scheduling interfaces Peter Zijlstra
                   ` (5 preceding siblings ...)
  2021-04-01 13:10 ` [PATCH 6/9] kselftest: Add test for core sched prctl interface Peter Zijlstra
@ 2021-04-01 13:10 ` Peter Zijlstra
  2021-04-02  0:34   ` Josh Don
  2021-04-01 13:10 ` [PATCH 8/9] rbtree: Remove const from the rb_find_add() comparator Peter Zijlstra
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2021-04-01 13:10 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tj, tglx

Implement a basic cgroup core-scheduling interface.

A new cpu.core_sched file is added which takes the values 0,1. When
set, the cgroup and all it's descendants will be granted the same
cookie and thus allowed to share a core with each-other, but not with
system tasks or tasks of other subtrees that might have another
cookie.

The file is hierarchical, and a subtree can again set it to 1, in
which case that subtree will get a different cookie and will no longer
share with the parent tree.

For each task, the nearest core_sched parent 'wins'.

Interaction with the prctl() interface is non-existent and left for a
future patch.

Noteably; this patch somewhat abuses cgroup_mutex. By holding
cgroup_mutex over the write() operation, which sets the cookie, the
cookie is stable in any cgroup callback (that is called with
cgroup_mutex held). A future patch relies on ss->can_attach() and
ss->attach() being 'atomic', which is hard to do without cgroup_mutex.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |  150 +++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |    7 ++
 2 files changed, 157 insertions(+)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5688,10 +5688,53 @@ static inline void sched_core_cpu_starti
 		}
 	}
 }
+
+void sched_core_cgroup_online(struct task_group *parent, struct task_group *tg)
+{
+	lockdep_assert_held(&cgroup_mutex);
+
+	if (parent->core_parent) {
+		WARN_ON_ONCE(parent->core_cookie);
+		WARN_ON_ONCE(!parent->core_parent->core_cookie);
+		tg->core_parent = parent->core_parent;
+
+	} else if (parent->core_cookie) {
+		WARN_ON_ONCE(parent->core_parent);
+		tg->core_parent = parent;
+	}
+}
+
+void sched_core_cgroup_free(struct task_group *tg)
+{
+	sched_core_put_cookie(tg->core_cookie);
+}
+
+unsigned long sched_core_cgroup_cookie(struct task_group *tg)
+{
+	unsigned long cookie = 0;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	if (tg->core_cookie)
+		cookie = tg->core_cookie;
+	else if (tg->core_parent)
+		cookie = tg->core_parent->core_cookie;
+
+	return sched_core_get_cookie(cookie);
+}
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline void sched_core_cpu_starting(unsigned int cpu) {}
 
+static inline void sched_core_cgroup_free(struct task_group *tg) { }
+static inline void sched_core_cgroup_online(struct task_group *parent, struct task_group tg) { }
+
+static inline unsigned long sched_core_cgroup_cookie(struct task_group *tg)
+{
+	return 0;
+}
+
 static struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
@@ -9310,6 +9353,7 @@ static void sched_free_group(struct task
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
 	autogroup_free(tg);
+	sched_core_cgroup_free(tg);
 	kmem_cache_free(task_group_cache, tg);
 }
 
@@ -9353,6 +9397,8 @@ void sched_online_group(struct task_grou
 	spin_unlock_irqrestore(&task_group_lock, flags);
 
 	online_fair_sched_group(tg);
+
+	sched_core_cgroup_online(parent, tg);
 }
 
 /* rcu callback to free various structures associated with a task group */
@@ -9414,6 +9460,7 @@ void sched_move_task(struct task_struct
 {
 	int queued, running, queue_flags =
 		DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
+	unsigned long cookie;
 	struct rq_flags rf;
 	struct rq *rq;
 
@@ -9443,6 +9490,10 @@ void sched_move_task(struct task_struct
 	}
 
 	task_rq_unlock(rq, tsk, &rf);
+
+	cookie = sched_core_cgroup_cookie(tsk->sched_task_group);
+	cookie = sched_core_update_cookie(tsk, cookie);
+	sched_core_put_cookie(cookie);
 }
 
 static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
@@ -10050,6 +10101,89 @@ static u64 cpu_rt_period_read_uint(struc
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_SCHED_CORE
+u64 cpu_sched_core_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	return !!css_tg(css)->core_cookie;
+}
+
+int cpu_sched_core_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+{
+	unsigned long cookie = 0, old_cookie = 0;
+	struct task_group *tg = css_tg(css);
+	struct cgroup_subsys_state *cssi;
+	struct task_group *parent = NULL;
+	int ret = 0;
+
+	if (val > 1)
+		return -ERANGE;
+
+	if (!static_branch_likely(&sched_smt_present))
+		return -ENODEV;
+
+	mutex_lock(&cgroup_mutex);
+	if (!!val == !!tg->core_cookie)
+		goto unlock;
+
+	old_cookie = tg->core_cookie;
+	if (val) {
+		cookie = sched_core_alloc_cookie();
+		if (!cookie) {
+			ret = -ENOMEM;
+			goto unlock;
+		}
+		WARN_ON_ONCE(old_cookie);
+
+	} else if (tg->parent) {
+		if (tg->parent->core_parent)
+			parent = tg->parent->core_parent;
+		else if (tg->parent->core_cookie)
+			parent = tg->parent;
+	}
+
+	WARN_ON_ONCE(cookie && parent);
+
+	tg->core_cookie = sched_core_get_cookie(cookie);
+	tg->core_parent = parent;
+
+	if (cookie)
+		parent = tg;
+	else if (parent)
+		cookie = sched_core_get_cookie(parent->core_cookie);
+
+	css_for_each_descendant_pre(cssi, css) {
+		struct task_group *tgi = css_tg(cssi);
+		struct css_task_iter it;
+		struct task_struct *p;
+
+		if (tgi != tg) {
+			if (tgi->core_cookie || (tgi->core_parent && tgi->core_parent != tg))
+				continue;
+
+			tgi->core_parent = parent;
+			tgi->core_cookie = 0;
+		}
+
+		css_task_iter_start(cssi, 0, &it);
+		while ((p = css_task_iter_next(&it))) {
+			unsigned long p_cookie;
+
+			cookie = sched_core_get_cookie(cookie);
+			p_cookie = sched_core_update_cookie(p, cookie);
+			sched_core_put_cookie(p_cookie);
+		}
+		css_task_iter_end(&it);
+	}
+
+unlock:
+	mutex_unlock(&cgroup_mutex);
+
+	sched_core_put_cookie(cookie);
+	sched_core_put_cookie(old_cookie);
+	return ret;
+}
+#endif
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -10100,6 +10234,14 @@ static struct cftype cpu_legacy_files[]
 		.write = cpu_uclamp_max_write,
 	},
 #endif
+#ifdef CONFIG_SCHED_CORE
+	{
+		.name = "core_sched",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_sched_core_read_u64,
+		.write_u64 = cpu_sched_core_write_u64,
+	},
+#endif
 	{ }	/* Terminate */
 };
 
@@ -10281,6 +10423,14 @@ static struct cftype cpu_files[] = {
 		.write = cpu_uclamp_max_write,
 	},
 #endif
+#ifdef CONFIG_SCHED_CORE
+	{
+		.name = "core_sched",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_sched_core_read_u64,
+		.write_u64 = cpu_sched_core_write_u64,
+	},
+#endif
 	{ }	/* terminate */
 };
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -431,6 +431,10 @@ struct task_group {
 	struct uclamp_se	uclamp[UCLAMP_CNT];
 #endif
 
+#ifdef CONFIG_SCHED_CORE
+	struct task_group	*core_parent;
+	unsigned long		core_cookie;
+#endif
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -1130,6 +1134,9 @@ static inline bool is_migration_disabled
 
 struct sched_group;
 #ifdef CONFIG_SCHED_CORE
+
+extern struct mutex cgroup_mutex; // XXX
+
 DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
 static inline struct cpumask *sched_group_span(struct sched_group *sg);
 



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 8/9] rbtree: Remove const from the rb_find_add() comparator
  2021-04-01 13:10 [PATCH 0/9] sched: Core scheduling interfaces Peter Zijlstra
                   ` (6 preceding siblings ...)
  2021-04-01 13:10 ` [PATCH 7/9] sched: Cgroup core-scheduling interface Peter Zijlstra
@ 2021-04-01 13:10 ` Peter Zijlstra
  2021-04-01 13:10 ` [PATCH 9/9] sched: prctl() and cgroup interaction Peter Zijlstra
  2021-04-04 23:39 ` [PATCH 0/9] sched: Core scheduling interfaces Tejun Heo
  9 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2021-04-01 13:10 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tj, tglx

This allows the rb_find_add() comparator to modify existing entries.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/rbtree.h       |    2 +-
 kernel/events/uprobes.c      |    2 +-
 tools/include/linux/rbtree.h |    2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -248,7 +248,7 @@ rb_add(struct rb_node *node, struct rb_r
  */
 static __always_inline struct rb_node *
 rb_find_add(struct rb_node *node, struct rb_root *tree,
-	    int (*cmp)(struct rb_node *, const struct rb_node *))
+	    int (*cmp)(struct rb_node *, struct rb_node *))
 {
 	struct rb_node **link = &tree->rb_node;
 	struct rb_node *parent = NULL;
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -646,7 +646,7 @@ static inline int __uprobe_cmp_key(const
 	return uprobe_cmp(a->inode, a->offset, __node_2_uprobe(b));
 }
 
-static inline int __uprobe_cmp(struct rb_node *a, const struct rb_node *b)
+static inline int __uprobe_cmp(struct rb_node *a, struct rb_node *b)
 {
 	struct uprobe *u = __node_2_uprobe(a);
 	return uprobe_cmp(u->inode, u->offset, __node_2_uprobe(b));
--- a/tools/include/linux/rbtree.h
+++ b/tools/include/linux/rbtree.h
@@ -232,7 +232,7 @@ rb_add(struct rb_node *node, struct rb_r
  */
 static __always_inline struct rb_node *
 rb_find_add(struct rb_node *node, struct rb_root *tree,
-	    int (*cmp)(struct rb_node *, const struct rb_node *))
+	    int (*cmp)(struct rb_node *, struct rb_node *))
 {
 	struct rb_node **link = &tree->rb_node;
 	struct rb_node *parent = NULL;



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 9/9] sched: prctl() and cgroup interaction
  2021-04-01 13:10 [PATCH 0/9] sched: Core scheduling interfaces Peter Zijlstra
                   ` (7 preceding siblings ...)
  2021-04-01 13:10 ` [PATCH 8/9] rbtree: Remove const from the rb_find_add() comparator Peter Zijlstra
@ 2021-04-01 13:10 ` Peter Zijlstra
  2021-04-03  1:30   ` Josh Don
  2021-04-04 23:39 ` [PATCH 0/9] sched: Core scheduling interfaces Tejun Heo
  9 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2021-04-01 13:10 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tj, tglx

All the nasty bits that manage prctl() and cgroup core-sched
interaction.

In order to manage this; a 'fat' cookie is introduced, this is
basically a nested cookie that links to two other cookies. Any unique
combination of task and cgroup cookies will get _one_ fat cookie.

Uniqueness of fat cookies is ensured by use of a global tree.

Due to the locking rules for cookies, the need for fat cookies is not
apparent up-front, nor can they be allocated in-situ, therefore
pre-allocate them agressively and mostly free them instantly when not
used.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h     |    1 
 kernel/sched/core.c       |   25 +++
 kernel/sched/core_sched.c |  329 ++++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h      |   11 +
 4 files changed, 337 insertions(+), 29 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -704,6 +704,7 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
 	struct rb_node			core_node;
 	unsigned long			core_cookie;
+	void				*core_spare_fat;
 	unsigned int			core_occupation;
 #endif
 
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9487,7 +9487,7 @@ void sched_move_task(struct task_struct
 	task_rq_unlock(rq, tsk, &rf);
 
 	cookie = sched_core_cgroup_cookie(tsk->sched_task_group);
-	cookie = sched_core_update_cookie(tsk, cookie);
+	cookie = sched_core_update_cookie(tsk, cookie | GROUP_COOKIE);
 	sched_core_put_cookie(cookie);
 }
 
@@ -9592,6 +9592,10 @@ static int cpu_cgroup_can_attach(struct
 
 		if (ret)
 			break;
+
+		ret = sched_core_prealloc_fat(task);
+		if (ret)
+			break;
 	}
 	return ret;
 }
@@ -10122,13 +10126,28 @@ int cpu_sched_core_write_u64(struct cgro
 
 	old_cookie = tg->core_cookie;
 	if (val) {
-		cookie = sched_core_alloc_cookie();
+		cookie = sched_core_alloc_cookie(GROUP_COOKIE);
 		if (!cookie) {
 			ret = -ENOMEM;
 			goto unlock;
 		}
 		WARN_ON_ONCE(old_cookie);
 
+		css_for_each_descendant_pre(cssi, css) {
+			struct css_task_iter it;
+			struct task_struct *p;
+
+			css_task_iter_start(cssi, 0, &it);
+			while ((p = css_task_iter_next(&it))) {
+				ret = sched_core_prealloc_fat(p);
+				if (ret) {
+					css_task_iter_end(&it);
+					goto unlock;
+				}
+			}
+			css_task_iter_end(&it);
+		}
+
 	} else if (tg->parent) {
 		if (tg->parent->core_parent)
 			parent = tg->parent->core_parent;
@@ -10164,7 +10183,7 @@ int cpu_sched_core_write_u64(struct cgro
 			unsigned long p_cookie;
 
 			cookie = sched_core_get_cookie(cookie);
-			p_cookie = sched_core_update_cookie(p, cookie);
+			p_cookie = sched_core_update_cookie(p, cookie | GROUP_COOKIE);
 			sched_core_put_cookie(p_cookie);
 		}
 		css_task_iter_end(&it);
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -1,6 +1,8 @@
 // SPDX-License-Identifier: GPL-2.0-only
 
 #include <linux/prctl.h>
+#include <linux/rbtree.h>
+#include <linux/cgroup.h>
 #include "sched.h"
 
 /*
@@ -8,26 +10,243 @@
  * address is used to compute the cookie of the task.
  */
 struct sched_core_cookie {
-	refcount_t refcnt;
+	refcount_t	refcnt;
+	unsigned int	type;
 };
 
-unsigned long sched_core_alloc_cookie(void)
+static inline void *cookie_ptr(unsigned long cookie)
+{
+	return (void *)(cookie & ~3UL);
+}
+
+static inline int cookie_type(unsigned long cookie)
+{
+	return cookie & 3;
+}
+
+static inline void sched_core_init_cookie(struct sched_core_cookie *ck, unsigned int type)
+{
+	refcount_set(&ck->refcnt, 1);
+	ck->type = type;
+}
+
+#ifdef CONFIG_CGROUP_SCHED
+
+#define FAT_COOKIE	0x03
+
+struct sched_core_fat_cookie {
+	struct sched_core_cookie	cookie;
+	unsigned long			task_cookie;
+	unsigned long			group_cookie;
+	struct rb_node			node;
+};
+
+static DEFINE_RAW_SPINLOCK(fat_lock);
+static struct rb_root fat_root;
+
+static void fat_mutex_lock(void)
+{
+	/*
+	 * { ss->can_attach(), ss->attach() } vs prctl() for p->core_spare_fat
+	 */
+	mutex_lock(&cgroup_mutex);
+}
+
+static void fat_mutex_unlock(void)
+{
+	mutex_unlock(&cgroup_mutex);
+}
+
+static void sched_core_put_fat(struct sched_core_fat_cookie *fat)
+{
+	unsigned long flags;
+
+	if (fat->cookie.type != FAT_COOKIE)
+		return;
+
+	sched_core_put_cookie(fat->task_cookie);
+	sched_core_put_cookie(fat->group_cookie);
+
+	if (!RB_EMPTY_NODE(&fat->node)) {
+		raw_spin_lock_irqsave(&fat_lock, flags);
+		rb_erase(&fat->node, &fat_root);
+		raw_spin_unlock_irqrestore(&fat_lock, flags);
+	}
+}
+
+static void *node_2_fat(struct rb_node *n)
+{
+	return rb_entry(n, struct sched_core_fat_cookie, node);
+}
+
+static int fat_cmp(struct rb_node *a, struct rb_node *b)
+{
+	struct sched_core_fat_cookie *ca = node_2_fat(a);
+	struct sched_core_fat_cookie *cb = node_2_fat(b);
+
+	if (ca->group_cookie < cb->group_cookie)
+		return -1;
+	if (ca->group_cookie > cb->group_cookie)
+		return 1;
+
+	if (ca->task_cookie < cb->task_cookie)
+		return -1;
+	if (ca->task_cookie > cb->task_cookie)
+		return 1;
+
+	if (refcount_inc_not_zero(&cb->cookie.refcnt))
+		return 0;
+
+	return 1;
+}
+
+static unsigned long __sched_core_fat_cookie(struct task_struct *p,
+					     void **spare_fat,
+					     unsigned long cookie)
+{
+	unsigned long task_cookie, group_cookie;
+	unsigned int p_type = cookie_type(p->core_cookie);
+	unsigned int c_type = cookie_type(cookie);
+	struct sched_core_fat_cookie *fat;
+	unsigned long flags;
+	struct rb_node *n;
+
+	if (WARN_ON_ONCE(c_type == FAT_COOKIE))
+		return cookie;
+
+	if (!p_type || p_type == c_type)
+		return cookie;
+
+	if (p_type == FAT_COOKIE) {
+		fat = cookie_ptr(p->core_cookie);
+
+		/* loose fat */
+		if (!cookie_ptr(cookie)) {
+			if (c_type == TASK_COOKIE)
+				cookie = fat->group_cookie;
+			else
+				cookie = fat->task_cookie;
+
+			WARN_ON_ONCE(!cookie_ptr(cookie));
+			return sched_core_get_cookie(cookie);
+		}
+
+		/* other fat */
+		if (c_type == TASK_COOKIE)
+			group_cookie = fat->group_cookie;
+		else
+			task_cookie = fat->task_cookie;
+
+	} else {
+
+		/* new fat */
+		if (p_type == TASK_COOKIE)
+			task_cookie = p->core_cookie;
+		else
+			group_cookie = p->core_cookie;
+	}
+
+	if (c_type == TASK_COOKIE)
+		task_cookie = cookie;
+	else
+		group_cookie = cookie;
+
+	fat = *spare_fat;
+	if (WARN_ON_ONCE(!fat))
+		return cookie;
+
+	sched_core_init_cookie(&fat->cookie, FAT_COOKIE);
+	fat->task_cookie = sched_core_get_cookie(task_cookie);
+	fat->group_cookie = sched_core_get_cookie(group_cookie);
+	RB_CLEAR_NODE(&fat->node);
+
+	raw_spin_lock_irqsave(&fat_lock, flags);
+	n = rb_find_add(&fat->node, &fat_root, fat_cmp);
+	raw_spin_unlock_irqrestore(&fat_lock, flags);
+
+	if (n) {
+		sched_core_put_fat(fat);
+		fat = node_2_fat(n);
+	} else {
+		*spare_fat = NULL;
+	}
+
+	return (unsigned long)fat | FAT_COOKIE;
+}
+
+static int __sched_core_alloc_fat(void **spare_fat)
+{
+	if (*spare_fat)
+		return 0;
+
+	*spare_fat = kmalloc(sizeof(struct sched_core_fat_cookie), GFP_KERNEL);
+	if (!*spare_fat)
+		return -ENOMEM;
+
+	return 0;
+}
+
+int sched_core_prealloc_fat(struct task_struct *p)
+{
+	lockdep_assert_held(&cgroup_mutex);
+	return __sched_core_alloc_fat(&p->core_spare_fat);
+}
+
+static inline unsigned long __sched_core_task_cookie(struct task_struct *p)
+{
+	unsigned long cookie = p->core_cookie;
+	unsigned int c_type = cookie_type(cookie);
+
+	if (!(c_type & TASK_COOKIE))
+		return 0;
+
+	if (c_type == FAT_COOKIE)
+		cookie = ((struct sched_core_fat_cookie *)cookie_ptr(cookie))->task_cookie;
+
+	return cookie;
+}
+
+#else
+
+static inline void fat_mutex_lock(void) { }
+static inline void fat_mutex_unlock(void) { }
+
+static inline void sched_core_put_fat(void *ptr) { }
+static inline int __sched_core_alloc_fat(void **spare_fat) { return 0; }
+
+static inline unsigned long __sched_core_fat_cookie(struct task_struct *p,
+						    void **spare_fat,
+						    unsigned long cookie)
+{
+	return cookie;
+}
+
+static inline unsigned long __sched_core_task_cookie(struct task_struct *p)
+{
+	return p->core_cookie;
+}
+
+#endif /* CGROUP_SCHED */
+
+unsigned long sched_core_alloc_cookie(unsigned int type)
 {
 	struct sched_core_cookie *ck = kmalloc(sizeof(*ck), GFP_KERNEL);
 	if (!ck)
 		return 0;
 
-	refcount_set(&ck->refcnt, 1);
+	WARN_ON_ONCE(type > GROUP_COOKIE);
+	sched_core_init_cookie(ck, type);
 	sched_core_get();
 
-	return (unsigned long)ck;
+	return (unsigned long)ck | type;
 }
 
 void sched_core_put_cookie(unsigned long cookie)
 {
-	struct sched_core_cookie *ptr = (void *)cookie;
+	struct sched_core_cookie *ptr = cookie_ptr(cookie);
 
 	if (ptr && refcount_dec_and_test(&ptr->refcnt)) {
+		sched_core_put_fat((void *)ptr);
 		kfree(ptr);
 		sched_core_put();
 	}
@@ -35,7 +254,7 @@ void sched_core_put_cookie(unsigned long
 
 unsigned long sched_core_get_cookie(unsigned long cookie)
 {
-	struct sched_core_cookie *ptr = (void *)cookie;
+	struct sched_core_cookie *ptr = cookie_ptr(cookie);
 
 	if (ptr)
 		refcount_inc(&ptr->refcnt);
@@ -50,14 +269,22 @@ unsigned long sched_core_get_cookie(unsi
  * @cookie: The new cookie.
  * @cookie_type: The cookie field to which the cookie corresponds.
  */
-unsigned long sched_core_update_cookie(struct task_struct *p, unsigned long cookie)
+static unsigned long __sched_core_update_cookie(struct task_struct *p,
+						void **spare_fat,
+						unsigned long cookie)
 {
 	unsigned long old_cookie;
 	struct rq_flags rf;
 	struct rq *rq;
 	bool enqueued;
 
-	rq = task_rq_lock(p, &rf);
+	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
+
+	cookie = __sched_core_fat_cookie(p, spare_fat, cookie);
+	if (!cookie_ptr(cookie))
+		cookie = 0UL;
+
+	rq = __task_rq_lock(p, &rf);
 
 	/*
 	 * Since creating a cookie implies sched_core_get(), and we cannot set
@@ -90,9 +317,19 @@ unsigned long sched_core_update_cookie(s
 	return old_cookie;
 }
 
+unsigned long sched_core_update_cookie(struct task_struct *p, unsigned long cookie)
+{
+	cookie =  __sched_core_update_cookie(p, &p->core_spare_fat, cookie);
+	if (p->core_spare_fat) {
+		kfree(p->core_spare_fat);
+		p->core_spare_fat = NULL;
+	}
+	return cookie;
+}
+
 static unsigned long sched_core_clone_cookie(struct task_struct *p)
 {
-	unsigned long cookie, flags;
+	unsigned long flags, cookie;
 
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	cookie = sched_core_get_cookie(p->core_cookie);
@@ -101,26 +338,47 @@ static unsigned long sched_core_clone_co
 	return cookie;
 }
 
+static unsigned long sched_core_clone_task_cookie(struct task_struct *p)
+{
+	unsigned long flags, cookie;
+
+	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	cookie = sched_core_get_cookie(__sched_core_task_cookie(p));
+	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+
+	return cookie;
+}
+
 void sched_core_fork(struct task_struct *p)
 {
 	RB_CLEAR_NODE(&p->core_node);
 	p->core_cookie = sched_core_clone_cookie(current);
+	p->core_spare_fat = NULL;
 }
 
 void sched_core_free(struct task_struct *p)
 {
 	sched_core_put_cookie(p->core_cookie);
+	kfree(p->core_spare_fat);
 }
 
 int sched_core_exec(void)
 {
 	/* absent a policy mech, if task had a cookie, give it a new one */
-	if (current->core_cookie) {
-		unsigned long cookie = sched_core_alloc_cookie();
+	if (current->core_cookie & TASK_COOKIE) {
+		void *spare_fat = NULL;
+		unsigned long cookie;
+
+		if (__sched_core_alloc_fat(&spare_fat))
+			return -ENOMEM;
+
+		cookie = sched_core_alloc_cookie(TASK_COOKIE);
 		if (!cookie)
 			return -ENOMEM;
-		cookie = sched_core_update_cookie(current, cookie);
+
+		cookie = __sched_core_update_cookie(current, &spare_fat, cookie);
 		sched_core_put_cookie(cookie);
+		kfree(spare_fat);
 	}
 
 	return 0;
@@ -129,7 +387,7 @@ int sched_core_exec(void)
 static void __sched_core_set(struct task_struct *p, unsigned long cookie)
 {
 	cookie = sched_core_get_cookie(cookie);
-	cookie = sched_core_update_cookie(p, cookie);
+	cookie = sched_core_update_cookie(p, cookie | TASK_COOKIE);
 	sched_core_put_cookie(cookie);
 }
 
@@ -171,55 +429,62 @@ int sched_core_share_pid(unsigned int cm
 		goto out;
 	}
 
+	fat_mutex_lock();
+
+	err = sched_core_prealloc_fat(task);
+	if (err)
+		goto out_unlock;
+
 	switch (cmd) {
 	case PR_SCHED_CORE_GET:
 		if (type != PIDTYPE_PID || uaddr & 7) {
 			err = -EINVAL;
-			goto out;
+			goto out_unlock;
 		}
-		cookie = sched_core_clone_cookie(task);
-		if (cookie) {
+		cookie = sched_core_clone_task_cookie(task);
+		if (cookie_ptr(cookie)) {
 			/* XXX improve ? */
 			ptr_to_hashval((void *)cookie, &id);
 		}
 		err = put_user(id, (u64 __user *)uaddr);
-		goto out;
+		goto out_unlock;
 
 	case PR_SCHED_CORE_CLEAR:
 		cookie = 0;
 		break;
 
 	case PR_SCHED_CORE_CREATE:
-		cookie = sched_core_alloc_cookie();
+		cookie = sched_core_alloc_cookie(TASK_COOKIE);
 		if (!cookie) {
 			err = -ENOMEM;
-			goto out;
+			goto out_unlock;
 		}
 		break;
 
 	case PR_SCHED_CORE_SHARE_TO:
-		cookie = sched_core_clone_cookie(current);
+		cookie = sched_core_clone_task_cookie(current);
 		break;
 
 	case PR_SCHED_CORE_SHARE_FROM:
 		if (type != PIDTYPE_PID) {
 			err = -EINVAL;
-			goto out;
+			goto out_unlock;
 		}
-		cookie = sched_core_clone_cookie(task);
+		cookie = sched_core_clone_task_cookie(task);
 		__sched_core_set(current, cookie);
-		goto out;
+		goto out_unlock;
 
 	default:
 		err = -EINVAL;
-		goto out;
+		goto out_unlock;
 	};
 
 	if (type == PIDTYPE_PID) {
 		__sched_core_set(task, cookie);
-		goto out;
+		goto out_unlock;
 	}
 
+again:
 	read_lock(&tasklist_lock);
 	grp = task_pid_type(task, type);
 
@@ -228,6 +493,18 @@ int sched_core_share_pid(unsigned int cm
 			err = -EPERM;
 			goto out_tasklist;
 		}
+
+		if (IS_ENABLED(CONFIG_CGROUP_SCHED) && !p->core_spare_fat) {
+			get_task_struct(p);
+			read_unlock(&tasklist_lock);
+
+			err = sched_core_prealloc_fat(p);
+			put_task_struct(p);
+			if (err)
+				goto out_unlock;
+
+			goto again;
+		}
 	} while_each_pid_thread(grp, type, p);
 
 	do_each_pid_thread(grp, type, p) {
@@ -236,6 +513,8 @@ int sched_core_share_pid(unsigned int cm
 out_tasklist:
 	read_unlock(&tasklist_lock);
 
+out_unlock:
+	fat_mutex_unlock();
 out:
 	sched_core_put_cookie(cookie);
 	put_task_struct(task);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1240,10 +1240,14 @@ extern void sched_core_dequeue(struct rq
 extern void sched_core_get(void);
 extern void sched_core_put(void);
 
-extern unsigned long sched_core_alloc_cookie(void);
+#define TASK_COOKIE	0x01
+#define GROUP_COOKIE	0x02
+
+extern unsigned long sched_core_alloc_cookie(unsigned int type);
 extern void sched_core_put_cookie(unsigned long cookie);
 extern unsigned long sched_core_get_cookie(unsigned long cookie);
 extern unsigned long sched_core_update_cookie(struct task_struct *p, unsigned long cookie);
+extern int sched_core_prealloc_fat(struct task_struct *p);
 
 #else /* !CONFIG_SCHED_CORE */
 
@@ -1257,6 +1261,11 @@ static inline bool sched_core_disabled(v
 	return true;
 }
 
+static inline int sched_core_prealloc_fat(struct task_struct *p)
+{
+	return 0;
+}
+
 static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 {
 	return &rq->__lock;



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/9] sched: Trivial core scheduling cookie management
  2021-04-01 13:10 ` [PATCH 3/9] sched: Trivial core scheduling cookie management Peter Zijlstra
@ 2021-04-01 20:04   ` Josh Don
  2021-04-02  7:13     ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Josh Don @ 2021-04-01 20:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes, Hyser,Chris, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, linux-kernel, Tejun Heo,
	Thomas Gleixner

> +/*
> + * sched_core_update_cookie - Common helper to update a task's core cookie. This
> + * updates the selected cookie field.
> + * @p: The task whose cookie should be updated.
> + * @cookie: The new cookie.
> + * @cookie_type: The cookie field to which the cookie corresponds.

No cookie_type in this patch (or cookie fields). Also might want to
call out that the caller is responsible for put'ing the return value.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 7/9] sched: Cgroup core-scheduling interface
  2021-04-01 13:10 ` [PATCH 7/9] sched: Cgroup core-scheduling interface Peter Zijlstra
@ 2021-04-02  0:34   ` Josh Don
  0 siblings, 0 replies; 36+ messages in thread
From: Josh Don @ 2021-04-02  0:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes, Hyser,Chris, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, linux-kernel, Tejun Heo,
	Thomas Gleixner

Thanks, allowing for multiple group cookies in a hierarchy is a nice
improvement.

> +               if (tgi != tg) {
> +                       if (tgi->core_cookie || (tgi->core_parent && tgi->core_parent != tg))
> +                               continue;
> +
> +                       tgi->core_parent = parent;
> +                       tgi->core_cookie = 0;

core_cookie must already be 0, given the check above.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/9] sched: Trivial core scheduling cookie management
  2021-04-01 20:04   ` Josh Don
@ 2021-04-02  7:13     ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2021-04-02  7:13 UTC (permalink / raw)
  To: Josh Don
  Cc: Joel Fernandes, Hyser,Chris, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, linux-kernel, Tejun Heo,
	Thomas Gleixner

On Thu, Apr 01, 2021 at 01:04:58PM -0700, Josh Don wrote:
> > +/*
> > + * sched_core_update_cookie - Common helper to update a task's core cookie. This
> > + * updates the selected cookie field.
> > + * @p: The task whose cookie should be updated.
> > + * @cookie: The new cookie.
> > + * @cookie_type: The cookie field to which the cookie corresponds.
> 
> No cookie_type in this patch (or cookie fields). Also might want to
> call out that the caller is responsible for put'ing the return value.

Lol, I don't think I've even seen the comment. Yes that needs more than
a rewording.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 9/9] sched: prctl() and cgroup interaction
  2021-04-01 13:10 ` [PATCH 9/9] sched: prctl() and cgroup interaction Peter Zijlstra
@ 2021-04-03  1:30   ` Josh Don
  2021-04-06 15:12     ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Josh Don @ 2021-04-03  1:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes, Hyser,Chris, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, linux-kernel, Tejun Heo,
	Thomas Gleixner

Hi Peter,

I walked through the reference counting, and it seems good to me
(though it did take a few passes to fully digest the invariants for
the fat cookie stuff).

> +unsigned long sched_core_alloc_cookie(unsigned int type)
>  {
>         struct sched_core_cookie *ck = kmalloc(sizeof(*ck), GFP_KERNEL);
>         if (!ck)
>                 return 0;
> -       refcount_set(&ck->refcnt, 1);
> +       WARN_ON_ONCE(type > GROUP_COOKIE);
> +       sched_core_init_cookie(ck, type);
>         sched_core_get();
>
> -       return (unsigned long)ck;
> +       return (unsigned long)ck | type;
 > }

This feels like it needs to be stronger than a WARN_ON_ONCE; could
create a corrupted address that we later try to kfree().

Also, for my own edification, why will the bottom two bits here always be 0?

> -unsigned long sched_core_alloc_cookie(void)
> +static inline void *cookie_ptr(unsigned long cookie)
> +{
> +       return (void *)(cookie & ~3UL);
> +}
> +
> +static inline int cookie_type(unsigned long cookie)
> +{
> +       return cookie & 3;
> +}

s/3/FAT_COOKIE

> +#define FAT_COOKIE     0x03

Move to sched.h to group with TASK/GROUP_COOKIE?

> +static unsigned long __sched_core_fat_cookie(struct task_struct *p,
> +                                            void **spare_fat,
> +                                            unsigned long cookie)
> +{

This function looks good to me, but could use some more comments about
the pre/post-condition assumptions. Ie. cookie already has a get()
associated with it, caller is expected to kfree the spare_fat.

> +       raw_spin_lock_irqsave(&fat_lock, flags);
> +       n = rb_find_add(&fat->node, &fat_root, fat_cmp);
> +       raw_spin_unlock_irqrestore(&fat_lock, flags);
> +
> +       if (n) {
> +               sched_core_put_fat(fat);
> +               fat = node_2_fat(n);

This put() doesn't seem strictly necessary; caller will be
unconditionally freeing the spare_fat. Keep anyway for completeness,
but add a comment?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] sched: Core scheduling interfaces
  2021-04-01 13:10 [PATCH 0/9] sched: Core scheduling interfaces Peter Zijlstra
                   ` (8 preceding siblings ...)
  2021-04-01 13:10 ` [PATCH 9/9] sched: prctl() and cgroup interaction Peter Zijlstra
@ 2021-04-04 23:39 ` Tejun Heo
  2021-04-05 18:46   ` Joel Fernandes
                     ` (2 more replies)
  9 siblings, 3 replies; 36+ messages in thread
From: Tejun Heo @ 2021-04-04 23:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman, linux-kernel, tglx,
	Michal Koutný,
	Christian Brauner, Zefan Li

cc'ing Michal and Christian who've been spending some time on cgroup
interface issues recently and Li Zefan for cpuset.

On Thu, Apr 01, 2021 at 03:10:12PM +0200, Peter Zijlstra wrote:
> The cgroup interface now uses a 'core_sched' file, which still takes 0,1. It is
> however changed such that you can have nested tags. The for any given task, the
> first parent with a cookie is the effective one. The rationale is that this way
> you can delegate subtrees and still allow them some control over grouping.

I find it difficult to like the proposed interface from the name (the term
"core" is really confusing given how the word tends to be used internally)
to the semantics (it isn't like anything else) and even the functionality
(we're gonna have fixed processors at some point, right?).

Here are some preliminary thoughts:

* Are both prctl and cgroup based interfaces really necessary? I could be
  being naive but given that we're (hopefully) working around hardware
  deficiencies which will go away in time, I think there's a strong case for
  minimizing at least the interface to the bare minimum.

  Given how cgroups are set up (membership operations happening only for
  seeding, especially with the new clone interface), it isn't too difficult
  to synchronize process tree and cgroup hierarchy where it matters - ie.
  given the right per-process level interface, restricting configuration for
  a cgroup sub-hierarchy may not need any cgroup involvement at all. This
  also nicely gets rid of the interaction between prctl and cgroup bits.

* If we *have* to have cgroup interface, I wonder whether this would fit a
  lot better as a part of cpuset. If you squint just right, this can be
  viewed as some dynamic form of cpuset. Implementation-wise, it probably
  won't integrate with the rest but I think the feature will be less jarring
  as a part of cpuset, which already is a bit of kitchensink anyway.

> The cgroup thing also '(ab)uses' cgroup_mutex for serialization because it
> needs to ensure continuity between ss->can_attach() and ss->attach() for the
> memory allocation. If the prctl() were allowed to interleave it might steal the
> memory.
> 
> Using cgroup_mutex feels icky, but is not without precedent,
> kernel/bpf/cgroup.c does the same thing afaict.
> 
> TJ, can you please have a look at this?

Yeah, using cgroup_mutex for stabilizing cgroup hierarchy for consecutive
operations is fine. It might be worthwhile to break that out into a proper
interface but that's the least of concerns here.

Can someone point me to a realistic and concrete usage scenario for this
feature?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] sched: Core scheduling interfaces
  2021-04-04 23:39 ` [PATCH 0/9] sched: Core scheduling interfaces Tejun Heo
@ 2021-04-05 18:46   ` Joel Fernandes
  2021-04-06 14:16     ` Tejun Heo
  2021-04-06 15:32   ` Peter Zijlstra
  2021-04-07 16:50   ` Michal Koutný
  2 siblings, 1 reply; 36+ messages in thread
From: Joel Fernandes @ 2021-04-05 18:46 UTC (permalink / raw)
  To: Tejun Heo, Hao Luo
  Cc: Peter Zijlstra, Hyser,Chris, Josh Don, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman, LKML,
	Thomas Glexiner, Michal Koutný,
	Christian Brauner, Zefan Li

Hi TJ, Peter,

On Sun, Apr 4, 2021 at 7:39 PM Tejun Heo <tj@kernel.org> wrote:
>
> cc'ing Michal and Christian who've been spending some time on cgroup
> interface issues recently and Li Zefan for cpuset.
>
> On Thu, Apr 01, 2021 at 03:10:12PM +0200, Peter Zijlstra wrote:
> > The cgroup interface now uses a 'core_sched' file, which still takes 0,1. It is
> > however changed such that you can have nested tags. The for any given task, the
> > first parent with a cookie is the effective one. The rationale is that this way
> > you can delegate subtrees and still allow them some control over grouping.
>
> I find it difficult to like the proposed interface from the name (the term
> "core" is really confusing given how the word tends to be used internally)
> to the semantics (it isn't like anything else) and even the functionality
> (we're gonna have fixed processors at some point, right?).
>
> Here are some preliminary thoughts:
>
> * Are both prctl and cgroup based interfaces really necessary? I could be
>   being naive but given that we're (hopefully) working around hardware
>   deficiencies which will go away in time, I think there's a strong case for
>   minimizing at least the interface to the bare minimum.

I don't think these issues are going away as there are constantly new
exploits related to SMT that are coming out. Further, core scheduling
is not only for SMT - there are other usecases as well (such as VM
performance by preventing vCPU threads from core-sharing).

>
>   Given how cgroups are set up (membership operations happening only for
>   seeding, especially with the new clone interface), it isn't too difficult
>   to synchronize process tree and cgroup hierarchy where it matters - ie.
>   given the right per-process level interface, restricting configuration for
>   a cgroup sub-hierarchy may not need any cgroup involvement at all. This
>   also nicely gets rid of the interaction between prctl and cgroup bits.
> * If we *have* to have cgroup interface, I wonder whether this would fit a
>   lot better as a part of cpuset. If you squint just right, this can be
>   viewed as some dynamic form of cpuset. Implementation-wise, it probably
>   won't integrate with the rest but I think the feature will be less jarring
>   as a part of cpuset, which already is a bit of kitchensink anyway.

I think both interfaces are important for different reasons. Could you
take a look at the initial thread I started few months ago? I tried to
elaborate about usecases in detail:
http://lore.kernel.org/r/20200822030155.GA414063@google.com

Also, in ChromeOS we can't use CGroups for this purpose. The CGroup
hierarchy does not fit well with the threads we are tagging. Also, we
use CGroup v1, and since CGroups cannot overlap, this is impossible
let alone cumbersome. And, the CGroup interface having core scheduling
is still useful for people using containers and wanting to
core-schedule each container separately ( +Hao Luo can elaborate more
on that, but I did describe it in the link above).

> > The cgroup thing also '(ab)uses' cgroup_mutex for serialization because it
> > needs to ensure continuity between ss->can_attach() and ss->attach() for the
> > memory allocation. If the prctl() were allowed to interleave it might steal the
> > memory.
> >
> > Using cgroup_mutex feels icky, but is not without precedent,
> > kernel/bpf/cgroup.c does the same thing afaict.
> >
> > TJ, can you please have a look at this?
>
> Yeah, using cgroup_mutex for stabilizing cgroup hierarchy for consecutive
> operations is fine. It might be worthwhile to break that out into a proper
> interface but that's the least of concerns here.
>
> Can someone point me to a realistic and concrete usage scenario for this
> feature?

Yeah, its at http://lore.kernel.org/r/20200822030155.GA414063@google.com
as mentioned above, let me know if you need any more details about
usecase.

About the file name, how about kernel/sched/smt.c ? That definitely
provides more information than 'core_sched.c'.

Thanks,
- Joel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] sched: Core scheduling interfaces
  2021-04-05 18:46   ` Joel Fernandes
@ 2021-04-06 14:16     ` Tejun Heo
  2021-04-18  1:35       ` Joel Fernandes
  0 siblings, 1 reply; 36+ messages in thread
From: Tejun Heo @ 2021-04-06 14:16 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Hao Luo, Peter Zijlstra, Hyser,Chris, Josh Don, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman, LKML,
	Thomas Glexiner, Michal Koutný,
	Christian Brauner, Zefan Li

Hello,

On Mon, Apr 05, 2021 at 02:46:09PM -0400, Joel Fernandes wrote:
> Yeah, its at http://lore.kernel.org/r/20200822030155.GA414063@google.com
> as mentioned above, let me know if you need any more details about
> usecase.

Except for the unspecified reason in usecase 4, I don't see why cgroup is in
the picture at all. This doesn't really have much to do with hierarchical
resource distribution. Besides, yes, you can use cgroup for logical
structuring and identificaiton purposes but in those cases the interactions
and interface should be with the original subsystem while using cgroup IDs
or paths as parameters - see tracing and bpf for examples.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 9/9] sched: prctl() and cgroup interaction
  2021-04-03  1:30   ` Josh Don
@ 2021-04-06 15:12     ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2021-04-06 15:12 UTC (permalink / raw)
  To: Josh Don
  Cc: Joel Fernandes, Hyser,Chris, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, linux-kernel, Tejun Heo,
	Thomas Gleixner

On Fri, Apr 02, 2021 at 06:30:48PM -0700, Josh Don wrote:
> Hi Peter,
> 
> I walked through the reference counting, and it seems good to me
> (though it did take a few passes to fully digest the invariants for
> the fat cookie stuff).
> 
> > +unsigned long sched_core_alloc_cookie(unsigned int type)
> >  {
> >         struct sched_core_cookie *ck = kmalloc(sizeof(*ck), GFP_KERNEL);
> >         if (!ck)
> >                 return 0;
> > -       refcount_set(&ck->refcnt, 1);
> > +       WARN_ON_ONCE(type > GROUP_COOKIE);
> > +       sched_core_init_cookie(ck, type);
> >         sched_core_get();
> >
> > -       return (unsigned long)ck;
> > +       return (unsigned long)ck | type;
>  > }
> 
> This feels like it needs to be stronger than a WARN_ON_ONCE; could
> create a corrupted address that we later try to kfree().

The fattie is also released with kfree(), not a real problem that.

> Also, for my own edification, why will the bottom two bits here always be 0?

#define ARCH_KMALLOC_MINALIGN __alignof__(unsigned long long)

> > -unsigned long sched_core_alloc_cookie(void)
> > +static inline void *cookie_ptr(unsigned long cookie)
> > +{
> > +       return (void *)(cookie & ~3UL);
> > +}
> > +
> > +static inline int cookie_type(unsigned long cookie)
> > +{
> > +       return cookie & 3;
> > +}
> 
> s/3/FAT_COOKIE

COOKIE_MASK if anything..

> > +#define FAT_COOKIE     0x03
> 
> Move to sched.h to group with TASK/GROUP_COOKIE?

Didn't want to expose FAT outside the ifdef

> > +static unsigned long __sched_core_fat_cookie(struct task_struct *p,
> > +                                            void **spare_fat,
> > +                                            unsigned long cookie)
> > +{
> 
> This function looks good to me, but could use some more comments about
> the pre/post-condition assumptions. Ie. cookie already has a get()
> associated with it, caller is expected to kfree the spare_fat.

Fair enough. I'll go write up something. I had to fix a refcount issue
here, which should've been clue a comment it required.

There's actually a bug with the rb tree magic too, I'll go fix.

> > +       raw_spin_lock_irqsave(&fat_lock, flags);
> > +       n = rb_find_add(&fat->node, &fat_root, fat_cmp);
> > +       raw_spin_unlock_irqrestore(&fat_lock, flags);
> > +
> > +       if (n) {
> > +               sched_core_put_fat(fat);
> > +               fat = node_2_fat(n);
> 
> This put() doesn't seem strictly necessary; caller will be
> unconditionally freeing the spare_fat. Keep anyway for completeness,
> but add a comment?

It needs to put the references we got on constructing the new fat.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] sched: Core scheduling interfaces
  2021-04-04 23:39 ` [PATCH 0/9] sched: Core scheduling interfaces Tejun Heo
  2021-04-05 18:46   ` Joel Fernandes
@ 2021-04-06 15:32   ` Peter Zijlstra
  2021-04-06 16:08     ` Tejun Heo
  2021-04-07 16:50   ` Michal Koutný
  2 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2021-04-06 15:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman, linux-kernel, tglx,
	Michal Koutný,
	Christian Brauner, Zefan Li

On Sun, Apr 04, 2021 at 07:39:03PM -0400, Tejun Heo wrote:
> cc'ing Michal and Christian who've been spending some time on cgroup
> interface issues recently and Li Zefan for cpuset.
> 
> On Thu, Apr 01, 2021 at 03:10:12PM +0200, Peter Zijlstra wrote:
> > The cgroup interface now uses a 'core_sched' file, which still takes 0,1. It is
> > however changed such that you can have nested tags. The for any given task, the
> > first parent with a cookie is the effective one. The rationale is that this way
> > you can delegate subtrees and still allow them some control over grouping.
> 
> I find it difficult to like the proposed interface from the name (the term
> "core" is really confusing given how the word tends to be used internally)
> to the semantics (it isn't like anything else) and even the functionality
> (we're gonna have fixed processors at some point, right?).

Core is the topological name for the thing that hosts the SMT threads.
Can't really help that.

> Here are some preliminary thoughts:
> 
> * Are both prctl and cgroup based interfaces really necessary? I could be
>   being naive but given that we're (hopefully) working around hardware
>   deficiencies which will go away in time, I think there's a strong case for
>   minimizing at least the interface to the bare minimum.

I'm not one for cgroups much, so I'll let others argue that case, except
that per systemd and all the other new fangled shit, people seem to use
cgroups a lot to group tasks. So it makes sense to also expose this
through cgroups in some form.

That said; I've had requests from lots of non security folks about this
feature to help mitigate the SMT interference.

Consider for example Real-Time. If you have an active SMT sibling, the
CPU performance is much less than it would be when the SMT sibling is
idle. Therefore, for the benefit of determinism, it would be very nice
if RT tasks could force-idle their SMT siblings, and voila, this
interface allows exactly that.

The same is true for other workloads that care about interference.

>   Given how cgroups are set up (membership operations happening only for
>   seeding, especially with the new clone interface), it isn't too difficult
>   to synchronize process tree and cgroup hierarchy where it matters - ie.
>   given the right per-process level interface, restricting configuration for
>   a cgroup sub-hierarchy may not need any cgroup involvement at all. This
>   also nicely gets rid of the interaction between prctl and cgroup bits.

I've no idea what you mean :/ The way I use cgroups (when I have to, for
testing) is to echo the pid into /cgroup/foo/tasks. No clone or anything
involved.

None of my test machines come up with cgroupfs mounted, and any and all
cgroup setup is under my control.

> * If we *have* to have cgroup interface, I wonder whether this would fit a
>   lot better as a part of cpuset. If you squint just right, this can be
>   viewed as some dynamic form of cpuset. Implementation-wise, it probably
>   won't integrate with the rest but I think the feature will be less jarring
>   as a part of cpuset, which already is a bit of kitchensink anyway.

Not sure I agree, we do not change the affinity of things, we only
control who's allowed to run concurrently on SMT siblings. There could
be a cpuset partition split between the siblings and it would still work
fine.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] sched: Core scheduling interfaces
  2021-04-06 15:32   ` Peter Zijlstra
@ 2021-04-06 16:08     ` Tejun Heo
  2021-04-07 18:39       ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Tejun Heo @ 2021-04-06 16:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman, linux-kernel, tglx,
	Michal Koutný,
	Christian Brauner, Zefan Li

Hello,

On Tue, Apr 06, 2021 at 05:32:04PM +0200, Peter Zijlstra wrote:
> > I find it difficult to like the proposed interface from the name (the term
> > "core" is really confusing given how the word tends to be used internally)
> > to the semantics (it isn't like anything else) and even the functionality
> > (we're gonna have fixed processors at some point, right?).
> 
> Core is the topological name for the thing that hosts the SMT threads.
> Can't really help that.

I find the name pretty unfortunate given how overloaded the term is
generally and also in kernel but oh well...

> > Here are some preliminary thoughts:
> > 
> > * Are both prctl and cgroup based interfaces really necessary? I could be
> >   being naive but given that we're (hopefully) working around hardware
> >   deficiencies which will go away in time, I think there's a strong case for
> >   minimizing at least the interface to the bare minimum.
> 
> I'm not one for cgroups much, so I'll let others argue that case, except
> that per systemd and all the other new fangled shit, people seem to use
> cgroups a lot to group tasks. So it makes sense to also expose this
> through cgroups in some form.

All the new fangled things follow a certain usage pattern which makes
aligning parts of process tree with cgroup layout trivial, so when
restrictions can be applied along the process tree like this and there isn't
a particular need for dynamic hierarchical control, there isn't much need
for direct cgroup interface.

> That said; I've had requests from lots of non security folks about this
> feature to help mitigate the SMT interference.
> 
> Consider for example Real-Time. If you have an active SMT sibling, the
> CPU performance is much less than it would be when the SMT sibling is
> idle. Therefore, for the benefit of determinism, it would be very nice
> if RT tasks could force-idle their SMT siblings, and voila, this
> interface allows exactly that.
> 
> The same is true for other workloads that care about interference.

I see.

> >   Given how cgroups are set up (membership operations happening only for
> >   seeding, especially with the new clone interface), it isn't too difficult
> >   to synchronize process tree and cgroup hierarchy where it matters - ie.
> >   given the right per-process level interface, restricting configuration for
> >   a cgroup sub-hierarchy may not need any cgroup involvement at all. This
> >   also nicely gets rid of the interaction between prctl and cgroup bits.
> 
> I've no idea what you mean :/ The way I use cgroups (when I have to, for
> testing) is to echo the pid into /cgroup/foo/tasks. No clone or anything
> involved.

The usage pattern is creating a new cgroup, seeding it with a process
(either writing to tasks or using CLONE_INTO_CGROUP) and let it continue
only on that sub-hierarchy, so cgroup hierarchy usually partially overlays
process trees.

> None of my test machines come up with cgroupfs mounted, and any and all
> cgroup setup is under my control.
>
> > * If we *have* to have cgroup interface, I wonder whether this would fit a
> >   lot better as a part of cpuset. If you squint just right, this can be
> >   viewed as some dynamic form of cpuset. Implementation-wise, it probably
> >   won't integrate with the rest but I think the feature will be less jarring
> >   as a part of cpuset, which already is a bit of kitchensink anyway.
> 
> Not sure I agree, we do not change the affinity of things, we only
> control who's allowed to run concurrently on SMT siblings. There could
> be a cpuset partition split between the siblings and it would still work
> fine.

I see. Yeah, if we really need it, I'm not sure it fits in cgroup interface
proper. As I wrote elsewhere, these things are usually implemented on the
originating subsystem interface with cgroup ID as a parameter.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] sched: Core scheduling interfaces
  2021-04-04 23:39 ` [PATCH 0/9] sched: Core scheduling interfaces Tejun Heo
  2021-04-05 18:46   ` Joel Fernandes
  2021-04-06 15:32   ` Peter Zijlstra
@ 2021-04-07 16:50   ` Michal Koutný
  2021-04-07 18:34     ` Peter Zijlstra
  2 siblings, 1 reply; 36+ messages in thread
From: Michal Koutný @ 2021-04-07 16:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, joel, chris.hyser, joshdon, mingo,
	vincent.guittot, valentin.schneider, mgorman, linux-kernel, tglx,
	Christian Brauner, Zefan Li

[-- Attachment #1: Type: text/plain, Size: 1209 bytes --]

Hello.

IIUC, the premise is that the tasks that have different cookies imply
they would never share a core.

On Thu, Apr 01, 2021 at 03:10:12PM +0200, Peter Zijlstra wrote:
> The cgroup interface now uses a 'core_sched' file, which still takes 0,1. It is
> however changed such that you can have nested tags. The for any given task, the
> first parent with a cookie is the effective one. The rationale is that this way
> you can delegate subtrees and still allow them some control over grouping.

Given the existence of prctl and clone APIs, I don't see the reason to
have a separate cgroup-bound interface too (as argued by Tejun). The
potential speciality is the possibility to re-tag whole groups of
processes at runtime (but the listed use cases [1] don't require that
and it's not such a good idea given its asynchronicity).

Also, I would find useful some more explanation how the hierarchical
behavior is supposed to work. In my understanding the task is either
allowed to request its own isolation or not. The cgroups could be used
to restrict this privilege, however, that doesn't seem to be the case
here.

My two cents,
Michal

[1] https://lore.kernel.org/lkml/20200822030155.GA414063@google.com/

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/9] sched: prctl() core-scheduling interface
  2021-04-01 13:10 ` [PATCH 5/9] sched: prctl() core-scheduling interface Peter Zijlstra
@ 2021-04-07 17:00   ` Peter Zijlstra
  2021-04-18  3:52     ` Joel Fernandes
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2021-04-07 17:00 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, tj, tglx

On Thu, Apr 01, 2021 at 03:10:17PM +0200, Peter Zijlstra wrote:

> Current hard-coded policies are:
> 
>  - a user can clear the cookie of any process they can set a cookie for.
>    Lack of a cookie *might* be a security issue if cookies are being used
>    for that.

ChromeOS people, what are you doing about this? syscall/prctl filtering?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] sched: Core scheduling interfaces
  2021-04-07 16:50   ` Michal Koutný
@ 2021-04-07 18:34     ` Peter Zijlstra
  2021-04-08 13:25       ` Michal Koutný
  2021-04-19 11:30       ` Tejun Heo
  0 siblings, 2 replies; 36+ messages in thread
From: Peter Zijlstra @ 2021-04-07 18:34 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman, linux-kernel, tglx,
	Christian Brauner, Zefan Li

On Wed, Apr 07, 2021 at 06:50:32PM +0200, Michal Koutný wrote:
> Hello.
> 
> IIUC, the premise is that the tasks that have different cookies imply
> they would never share a core.

Correct.

> On Thu, Apr 01, 2021 at 03:10:12PM +0200, Peter Zijlstra wrote:
> > The cgroup interface now uses a 'core_sched' file, which still takes 0,1. It is
> > however changed such that you can have nested tags. The for any given task, the
> > first parent with a cookie is the effective one. The rationale is that this way
> > you can delegate subtrees and still allow them some control over grouping.
> 
> Given the existence of prctl and clone APIs, I don't see the reason to
> have a separate cgroup-bound interface too (as argued by Tejun). 

IMO as long as cgroups have that tasks file, you get to support people
using it. That means that tasks joining your cgroup need to 'inherit'
cgroup properties.

That's not something covered by either prctl or clone.

> The potential speciality is the possibility to re-tag whole groups of
> processes at runtime (but the listed use cases [1] don't require that
> and it's not such a good idea given its asynchronicity).

That seems to be the implication of having that tasks file. Tasks can
join a cgroup, so you get to deal with that.

You can't just say, don't do that then.

> Also, I would find useful some more explanation how the hierarchical
> behavior is supposed to work. In my understanding the task is either
> allowed to request its own isolation or not. The cgroups could be used
> to restrict this privilege, however, that doesn't seem to be the case
> here.

Given something like:

	R
       / \
      A   B
         / \
	C   D

B group can set core_sched=1 and then all its (and its decendants) tasks
get to have the same (group) cookie and cannot share with others.

If however B is a delegate and has a subgroup D that is security
sensitive and must not share core resources with the rest of B, then it
can also set D.core_sched=1, such that D (and its decendants) will have
another (group) cookie.

On top of this, say C has a Real-Time tasks, that wants to limit SMT
interference, then it can set a (task/prctl) cookie on itself, such that
it will not share the core with the rest of the tasks of B.


In that scenario the D subtree is a restriction (doesn't share) with the
B subtree.

And all of B is a restriction on all its tasks, including the Real-Time
task that set a task cookie, in that none of them can share with tasks
outside of B (including system tasks which are in R), irrespective of
what they do with their task cookie.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] sched: Core scheduling interfaces
  2021-04-06 16:08     ` Tejun Heo
@ 2021-04-07 18:39       ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2021-04-07 18:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman, linux-kernel, tglx,
	Michal Koutný,
	Christian Brauner, Zefan Li

On Tue, Apr 06, 2021 at 12:08:50PM -0400, Tejun Heo wrote:

> I see. Yeah, if we really need it, I'm not sure it fits in cgroup interface
> proper. As I wrote elsewhere, these things are usually implemented on the
> originating subsystem interface with cgroup ID as a parameter.

This would be something like:

  prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, cgroup-fd, PIDTYPE_CGROUP, NULL);

right? Where we assign to self the cookie from the cgroup.

The problem I see with this is that a task can trivially undo/circumvent
this by calling PR_SCHED_CORE_CLEAR on itself, at which point it can
share with system tasks again.

Also, it doesn't really transfer well to the group/tasks thing. When a
task joins a cgroup, it doesn't automagically gain the cgroup
properties. Whoever does the transition will then also have to prctl()
this, which nobody will do.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] sched: Core scheduling interfaces
  2021-04-07 18:34     ` Peter Zijlstra
@ 2021-04-08 13:25       ` Michal Koutný
  2021-04-08 15:02         ` Peter Zijlstra
  2021-04-19 11:30       ` Tejun Heo
  1 sibling, 1 reply; 36+ messages in thread
From: Michal Koutný @ 2021-04-08 13:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman, linux-kernel, tglx,
	Christian Brauner, Zefan Li

[-- Attachment #1: Type: text/plain, Size: 2485 bytes --]

On Wed, Apr 07, 2021 at 08:34:24PM +0200, Peter Zijlstra <peterz@infradead.org> wrote:
> IMO as long as cgroups have that tasks file, you get to support people
> using it. That means that tasks joining your cgroup need to 'inherit'
> cgroup properties.
The tasks file is consequence of binding this to cgroups, I'm one step
back. Why to make "core isolation" a cgroup property?

(I understand this could help "visualize" what the common domains are if
cgroups were the only API but with prctl the structure can be
arbitrarily modified anyway.)


> Given something like:
> 
>         R
>        / \
>       A   B
>          / \
>         C   D
Thanks for the example. 

> B group can set core_sched=1 and then all its (and its decendants) tasks
> get to have the same (group) cookie and cannot share with others.
The same could be achieved with the first task of group B allocating its
new cookie which would be inherited in its descednants.

> If however B is a delegate and has a subgroup D that is security
> sensitive and must not share core resources with the rest of B, then it
> can also set D.core_sched=1, such that D (and its decendants) will have
> another (group) cookie.
If there is such a sensitive descendant task, it could allocate a new
cookie (same way as the first one in B did).

> On top of this, say C has a Real-Time tasks, that wants to limit SMT
> interference, then it can set a (task/prctl) cookie on itself, such that
> it will not share the core with the rest of the tasks of B.
(IIUC, in this particular example it'd be redundant if B had no inner
tasks since D isolated itself already.)
Yes, so this is again the same pattern as the tasks above have done.

> In that scenario the D subtree is a restriction (doesn't share) with the
> B subtree.
This implies D's isolation from everything else too, not just B's
members, no?

> And all of B is a restriction on all its tasks, including the Real-Time
> task that set a task cookie, in that none of them can share with tasks
> outside of B (including system tasks which are in R), irrespective of
> what they do with their task cookie.
IIUC, the equivalent restriction could be achieved with the PTRACE-like
check in the prctl API too (with respectively divided uids).

I'm curious whether the cgroup API actually simplifies things that are
possible with the clone/prctl API or allows anything that wouldn't be
otherwise possible.

Regards,
Michal


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] sched: Core scheduling interfaces
  2021-04-08 13:25       ` Michal Koutný
@ 2021-04-08 15:02         ` Peter Zijlstra
  2021-04-09  0:16           ` Josh Don
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2021-04-08 15:02 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman, linux-kernel, tglx,
	Christian Brauner, Zefan Li

On Thu, Apr 08, 2021 at 03:25:52PM +0200, Michal Koutný wrote:
> On Wed, Apr 07, 2021 at 08:34:24PM +0200, Peter Zijlstra <peterz@infradead.org> wrote:
> > IMO as long as cgroups have that tasks file, you get to support people
> > using it. That means that tasks joining your cgroup need to 'inherit'
> > cgroup properties.
> The tasks file is consequence of binding this to cgroups, I'm one step
> back. Why to make "core isolation" a cgroup property?

Yeah, dunno, people asked for it. I'm just proposing an implementation
that, when given the need, seems to make sense and is internally
consistent.

> (I understand this could help "visualize" what the common domains are if
> cgroups were the only API but with prctl the structure can be
> arbitrarily modified anyway.)
> 
> 
> > Given something like:
> > 
> >         R
> >        / \
> >       A   B
> >          / \
> >         C   D
> Thanks for the example. 
> 
> > B group can set core_sched=1 and then all its (and its decendants) tasks
> > get to have the same (group) cookie and cannot share with others.
> The same could be achieved with the first task of group B allocating its
> new cookie which would be inherited in its descednants.

Except then the task can CLEAR its own cookie and escape the constraint.

> > In that scenario the D subtree is a restriction (doesn't share) with the
> > B subtree.
> This implies D's isolation from everything else too, not just B's
> members, no?

Correct. Look at it as a contraint on co-scheduling, you can never,
whatever you do, share an SMT sibling with someone outside your subtree.

> > And all of B is a restriction on all its tasks, including the Real-Time
> > task that set a task cookie, in that none of them can share with tasks
> > outside of B (including system tasks which are in R), irrespective of
> > what they do with their task cookie.
> IIUC, the equivalent restriction could be achieved with the PTRACE-like
> check in the prctl API too (with respectively divided uids).

I'm not sure I understand; if tasks in A and B are of the same user,
then ptrace will not help anything. And per the above, you always have
ptrace on yourself so you can escape your constraint per the above.

> I'm curious whether the cgroup API actually simplifies things that are
> possible with the clone/prctl API or allows anything that wouldn't be
> otherwise possible.

With the cgroup API it is impossible for a task to escape the cgroup
constraint. It can never share a core with anything not in the subtree.

This is not possible with just the task interface.

If this is actually needed I've no clue, IMO all of cgroups is not
needed :-) Clearly other people feel differently about that.


Much of this would go away if CLEAR were not possible I suppose. But
IIRC the idea was to let a task isolate itself temporarily, while doing
some sensitive thing (eg. encrypt an email) but otherwise not be
constrained. But I'm not sure I can remember all the various things
people wanted this crud for :/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] sched: Core scheduling interfaces
  2021-04-08 15:02         ` Peter Zijlstra
@ 2021-04-09  0:16           ` Josh Don
  0 siblings, 0 replies; 36+ messages in thread
From: Josh Don @ 2021-04-09  0:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michal Koutný,
	Tejun Heo, Joel Fernandes, Hyser,Chris, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman, linux-kernel,
	Thomas Gleixner, Christian Brauner, Zefan Li

On Thu, Apr 8, 2021 at 9:47 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Apr 08, 2021 at 03:25:52PM +0200, Michal Koutný wrote:
>
> > I'm curious whether the cgroup API actually simplifies things that are
> > possible with the clone/prctl API or allows anything that wouldn't be
> > otherwise possible.
>
> With the cgroup API it is impossible for a task to escape the cgroup
> constraint. It can never share a core with anything not in the subtree.
>
> This is not possible with just the task interface.
>
> If this is actually needed I've no clue, IMO all of cgroups is not
> needed :-) Clearly other people feel differently about that.

The cgroup interface seems very nice from a management perspective;
moving arbitrary tasks around in the cgroup hierarchy will handle the
necessary cookie adjustments to ensure that nothing shares with an
unexpected task. It also makes auditing the core scheduling groups
very straightforward; anything in a cookie'd cgroup's tasks file will
be guaranteed isolated from the rest of the system, period.

Further, if a system management thread wants two arbitrary tasks A and
B to share a cookie, this seems more painful with the PRCTL interface.
The management thread would need to something like
- PR_SCHED_CORE_CREATE
- PR_SCHED_CORE_SHARE_TO A
- PR_SCHED_CORE_SHARE_TO B
- PR_SCHED_CORE_CLEAR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] sched: Core scheduling interfaces
  2021-04-06 14:16     ` Tejun Heo
@ 2021-04-18  1:35       ` Joel Fernandes
  2021-04-19  9:00         ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Joel Fernandes @ 2021-04-18  1:35 UTC (permalink / raw)
  To: Tejun Heo, Peter Zijlstra
  Cc: Hao Luo, Hyser,Chris, Josh Don, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, LKML, Thomas Glexiner,
	Michal Koutný,
	Christian Brauner, Zefan Li

On Tue, Apr 06, 2021 at 10:16:12AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Mon, Apr 05, 2021 at 02:46:09PM -0400, Joel Fernandes wrote:
> > Yeah, its at http://lore.kernel.org/r/20200822030155.GA414063@google.com
> > as mentioned above, let me know if you need any more details about
> > usecase.
> 
> Except for the unspecified reason in usecase 4, I don't see why cgroup is in
> the picture at all. This doesn't really have much to do with hierarchical
> resource distribution. Besides, yes, you can use cgroup for logical
> structuring and identificaiton purposes but in those cases the interactions
> and interface should be with the original subsystem while using cgroup IDs
> or paths as parameters - see tracing and bpf for examples.

Personally for ChromeOS, we need only the per-task interface. Considering
that the second argument of this prctl is a command, I don't see why we
cannot add a new command PR_SCHED_CORE_CGROUP_SHARE to do what Tejun is
saying (in the future).

In order to not block ChromeOS and other "per-task interface" usecases, I
suggest we keep the CGroup interface for a later time (whether that's
through prctl or the CGroups FS way which Tejun dislikes) and move forward
with per-task interface only initially.

Peter, any thoughts on this?

thanks,

- Joel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/9] sched: prctl() core-scheduling interface
  2021-04-07 17:00   ` Peter Zijlstra
@ 2021-04-18  3:52     ` Joel Fernandes
  0 siblings, 0 replies; 36+ messages in thread
From: Joel Fernandes @ 2021-04-18  3:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: chris.hyser, joshdon, mingo, vincent.guittot, valentin.schneider,
	mgorman, linux-kernel, tj, tglx

On Wed, Apr 07, 2021 at 07:00:33PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 01, 2021 at 03:10:17PM +0200, Peter Zijlstra wrote:
> 
> > Current hard-coded policies are:
> > 
> >  - a user can clear the cookie of any process they can set a cookie for.
> >    Lack of a cookie *might* be a security issue if cookies are being used
> >    for that.
> 
> ChromeOS people, what are you doing about this? syscall/prctl filtering?

Yes, in ChromeOS, we allow the prctl(2) syscall only before entering the
seccomp sandbox. Once we enter the sandbox, we do not allow the prctl(2).

This has the nice design that the security is enforced on entering the
sandbox, and prior to entering the sandbox, no permissions need be given.

Let me know if that makes sense and if you had any other questions. thanks,

-Joel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] sched: Core scheduling interfaces
  2021-04-18  1:35       ` Joel Fernandes
@ 2021-04-19  9:00         ` Peter Zijlstra
  2021-04-21 13:35           ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2021-04-19  9:00 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Tejun Heo, Hao Luo, Hyser,Chris, Josh Don, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman, LKML,
	Thomas Glexiner, Michal Koutný,
	Christian Brauner, Zefan Li

On Sat, Apr 17, 2021 at 09:35:07PM -0400, Joel Fernandes wrote:
> On Tue, Apr 06, 2021 at 10:16:12AM -0400, Tejun Heo wrote:
> > Hello,
> > 
> > On Mon, Apr 05, 2021 at 02:46:09PM -0400, Joel Fernandes wrote:
> > > Yeah, its at http://lore.kernel.org/r/20200822030155.GA414063@google.com
> > > as mentioned above, let me know if you need any more details about
> > > usecase.
> > 
> > Except for the unspecified reason in usecase 4, I don't see why cgroup is in
> > the picture at all. This doesn't really have much to do with hierarchical
> > resource distribution. Besides, yes, you can use cgroup for logical
> > structuring and identificaiton purposes but in those cases the interactions
> > and interface should be with the original subsystem while using cgroup IDs
> > or paths as parameters - see tracing and bpf for examples.
> 
> Personally for ChromeOS, we need only the per-task interface. Considering
> that the second argument of this prctl is a command, I don't see why we
> cannot add a new command PR_SCHED_CORE_CGROUP_SHARE to do what Tejun is
> saying (in the future).
> 
> In order to not block ChromeOS and other "per-task interface" usecases, I
> suggest we keep the CGroup interface for a later time (whether that's
> through prctl or the CGroups FS way which Tejun dislikes) and move forward
> with per-task interface only initially.

Josh, you being on the other Google team, the one that actually uses the
cgroup interface AFAIU, can you fight the good fight with TJ on this?

> Peter, any thoughts on this?

Adding CGROUP_SHARE is not sufficient to close the hole against CLEAR.
So we either then have to 'tweak' the meaning of CLEAR or replace it
entirely, neither seem attractive.


I'd love to make some progress on all this.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] sched: Core scheduling interfaces
  2021-04-07 18:34     ` Peter Zijlstra
  2021-04-08 13:25       ` Michal Koutný
@ 2021-04-19 11:30       ` Tejun Heo
  2021-04-20  1:17         ` Josh Don
  1 sibling, 1 reply; 36+ messages in thread
From: Tejun Heo @ 2021-04-19 11:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michal Koutný,
	joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman, linux-kernel, tglx,
	Christian Brauner, Zefan Li

Hello,

Sorry about late reply.

On Wed, Apr 07, 2021 at 08:34:24PM +0200, Peter Zijlstra wrote:
> > Given the existence of prctl and clone APIs, I don't see the reason to
> > have a separate cgroup-bound interface too (as argued by Tejun). 
> 
> IMO as long as cgroups have that tasks file, you get to support people
> using it. That means that tasks joining your cgroup need to 'inherit'

This argument doesn't really make sense to me. We don't just add things to
make interfaces orthogonal. It can be a consideration but not the only or
even one of the most important ones. There are many cases we say to not
shoot oneself in the foot and also many interfaces which are fading away or
in the process of being deprecated.

I'm not planning to deprecate the dynamic migration interfaces given the
history and usefulness in testing but the usage model of cgroup2 is clearly
defined and documented in this regard - whether the initial population of
the cgroup happens through CLONE_INTO_CGROUP or migration, for resource
tracking and control purposes, cgroup2 does not support dynamic migration
with the exception of migrations within threaded domains.

Anything is a possibility but given how this requirement is intertwined with
achieveing comprehensive resource control, a core goal of cgroup2, and
widely adopted by all the new fangled things as you put it, changing this
wouldn't be easy. Not just because some people including me are against it
but because there are inherent technical challenges and overheads to
supporting dynamic charge migration for stateful controllers and the
cost-benefit balance doesn't come out in favor.

So, the current "policy" is something like this:

* cgroupfs interface is for cgroup core features of organizing cgroups and
  processes and configuring resource configurations which preferably follow
  one of the control models defined in the doc.

* The hierarchical organization is semi-static in the sense that once a
  cgroup is populated tasks shouldn't be moved in or out of the cgroup with
  the exception of threaded domains.

* Non-resource control usages can hook into cgroup for identification /
  tagging purposes but should use the originating interface for
  interactions.

This has been consistently applied over the years now. There of course can
be exceptions but I haven't seen anything outstanding in this round of
discussion so am pretty skeptical. The actual use cases don't seem to need
it and the only argument for it is it'd be nice to have it and involves
violating the usage model.

My suggestion is going ahead with the per-process interface with cgroup
extension on mind in case actual use cases arise. Also, when planning cgroup
integration, putting dynamic migration front and center likely isn't a good
idea.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] sched: Core scheduling interfaces
  2021-04-19 11:30       ` Tejun Heo
@ 2021-04-20  1:17         ` Josh Don
  0 siblings, 0 replies; 36+ messages in thread
From: Josh Don @ 2021-04-20  1:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Michal Koutný,
	Joel Fernandes, Hyser,Chris, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, linux-kernel, Thomas Gleixner,
	Christian Brauner, Zefan Li

On Mon, Apr 19, 2021 at 2:01 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Josh, you being on the other Google team, the one that actually uses the
> cgroup interface AFAIU, can you fight the good fight with TJ on this?

A bit of extra context is in
https://lore.kernel.org/lkml/CABk29NtTScu2HO7V9Di+Fh2gv8zu5xiC5iNRwPFCLhpD+DKP0A@mail.gmail.com.

On the management/auditing side, the cgroup interface gives a clear
indication of which tasks share a cookie. It is a bit less attractive
to add a prctl interface for enumerating this.

Also on the management side, I illustrated in the above message how a
thread would potentially group together other threads. One limitation
of the current prctl interface is that the share_{to, from} always
operates on the current thread. Granted we can work around this as
described, and also potentially extend the prctl interface to operate
on two tasks.

So I agree that the cgroup interface here isn't strictly necessary,
though it seems convenient. I will double-check with internal teams
that would be using the interface to see if there are any other
considerations I'm missing.

On Mon, Apr 19, 2021 at 4:30 AM Tejun Heo <tj@kernel.org> wrote:
>
> My suggestion is going ahead with the per-process interface with cgroup
> extension on mind in case actual use cases arise. Also, when planning cgroup
> integration, putting dynamic migration front and center likely isn't a good
> idea.

tasks would not be frequently moved around; I'd expect security
configuration to remain mostly static. Or maybe I'm misunderstanding
your emphasis here?


If you feel the above is not strong enough (ie. there should be a use
case not feasible with prctl), I'd support that we move forward with
the prctl stuff for now, since the cgroup interface is independant.

Thanks,
Josh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/9] sched: Default core-sched policy
  2021-04-01 13:10 ` [PATCH 4/9] sched: Default core-sched policy Peter Zijlstra
@ 2021-04-21 13:33   ` Peter Zijlstra
  2021-04-21 14:31     ` Chris Hyser
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2021-04-21 13:33 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, tj, tglx

On Thu, Apr 01, 2021 at 03:10:16PM +0200, Peter Zijlstra wrote:
> Implement default core scheduling policy.
> 
>  - fork() / clone(): inherit cookie from parent
>  - exec(): if cookie then new cookie
> 
> Did that exec() thing want to change cookie on suid instead, just like
> perf_event_exit_task() ?

Talk to me about that exec() thing; I still think it's weird. Did we
just want to drop cookie on suid instead?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] sched: Core scheduling interfaces
  2021-04-19  9:00         ` Peter Zijlstra
@ 2021-04-21 13:35           ` Peter Zijlstra
  2021-04-21 14:45             ` Chris Hyser
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2021-04-21 13:35 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Tejun Heo, Hao Luo, Hyser,Chris, Josh Don, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman, LKML,
	Thomas Glexiner, Michal Koutný,
	Christian Brauner, Zefan Li

On Mon, Apr 19, 2021 at 11:00:57AM +0200, Peter Zijlstra wrote:
> On Sat, Apr 17, 2021 at 09:35:07PM -0400, Joel Fernandes wrote:

> > Peter, any thoughts on this?
> 
> Adding CGROUP_SHARE is not sufficient to close the hole against CLEAR.
> So we either then have to 'tweak' the meaning of CLEAR or replace it
> entirely, neither seem attractive.
> 
> 
> I'd love to make some progress on all this.

Can I comment out CLEAR so we can sort that out later? I suppose people
can still do temp cookies simply by using a temp task.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/9] sched: Default core-sched policy
  2021-04-21 13:33   ` Peter Zijlstra
@ 2021-04-21 14:31     ` Chris Hyser
  0 siblings, 0 replies; 36+ messages in thread
From: Chris Hyser @ 2021-04-21 14:31 UTC (permalink / raw)
  To: Peter Zijlstra, joel, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, tj, tglx

On 4/21/21 9:33 AM, Peter Zijlstra wrote:
> On Thu, Apr 01, 2021 at 03:10:16PM +0200, Peter Zijlstra wrote:
>> Implement default core scheduling policy.
>>
>>   - fork() / clone(): inherit cookie from parent
>>   - exec(): if cookie then new cookie
>>
>> Did that exec() thing want to change cookie on suid instead, just like
>> perf_event_exit_task() ?
> 
> Talk to me about that exec() thing; I still think it's weird. Did we
> just want to drop cookie on suid instead?
> 

The choices for fork and exec were for security. A forked process or thread gets a copy because the code is "trusted", 
but execed code is not by default (that was the policy choice) and so it gets a new cookie. Dropping the cookie on suid 
was not considered (an oversight on my part), but seems like a good idea, but I think, also orthogonal.

The biggest issue with a new cookie on exec is that makes it difficult to construct a large process tree of all the same 
cookies like the cgroup interface does. If we remove the cgroup interface, we either need to remove the "new cookie on 
exec" (inherit cookie from parent from fork) or provide a global/per-task policy for controlling that. Probably easiest 
to just drop it.

-chrish

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] sched: Core scheduling interfaces
  2021-04-21 13:35           ` Peter Zijlstra
@ 2021-04-21 14:45             ` Chris Hyser
  0 siblings, 0 replies; 36+ messages in thread
From: Chris Hyser @ 2021-04-21 14:45 UTC (permalink / raw)
  To: Peter Zijlstra, Joel Fernandes
  Cc: Tejun Heo, Hao Luo, Josh Don, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, LKML, Thomas Glexiner,
	Michal Koutný,
	Christian Brauner, Zefan Li

On 4/21/21 9:35 AM, Peter Zijlstra wrote:
> On Mon, Apr 19, 2021 at 11:00:57AM +0200, Peter Zijlstra wrote:
>> On Sat, Apr 17, 2021 at 09:35:07PM -0400, Joel Fernandes wrote:
> 
>>> Peter, any thoughts on this?
>>
>> Adding CGROUP_SHARE is not sufficient to close the hole against CLEAR.
>> So we either then have to 'tweak' the meaning of CLEAR or replace it
>> entirely, neither seem attractive.
>>
>>
>> I'd love to make some progress on all this.
> 
> Can I comment out CLEAR so we can sort that out later? I suppose people

I merely added CLEAR for completeness. Ultimately, I think having to kill a process because a cookie got set by mistake 
is bad, but it can absolutely be sorted out later.

-chrish

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2021-04-21 14:45 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-01 13:10 [PATCH 0/9] sched: Core scheduling interfaces Peter Zijlstra
2021-04-01 13:10 ` [PATCH 1/9] sched: Allow sched_core_put() from atomic context Peter Zijlstra
2021-04-01 13:10 ` [PATCH 2/9] sched: Implement core-sched assertions Peter Zijlstra
2021-04-01 13:10 ` [PATCH 3/9] sched: Trivial core scheduling cookie management Peter Zijlstra
2021-04-01 20:04   ` Josh Don
2021-04-02  7:13     ` Peter Zijlstra
2021-04-01 13:10 ` [PATCH 4/9] sched: Default core-sched policy Peter Zijlstra
2021-04-21 13:33   ` Peter Zijlstra
2021-04-21 14:31     ` Chris Hyser
2021-04-01 13:10 ` [PATCH 5/9] sched: prctl() core-scheduling interface Peter Zijlstra
2021-04-07 17:00   ` Peter Zijlstra
2021-04-18  3:52     ` Joel Fernandes
2021-04-01 13:10 ` [PATCH 6/9] kselftest: Add test for core sched prctl interface Peter Zijlstra
2021-04-01 13:10 ` [PATCH 7/9] sched: Cgroup core-scheduling interface Peter Zijlstra
2021-04-02  0:34   ` Josh Don
2021-04-01 13:10 ` [PATCH 8/9] rbtree: Remove const from the rb_find_add() comparator Peter Zijlstra
2021-04-01 13:10 ` [PATCH 9/9] sched: prctl() and cgroup interaction Peter Zijlstra
2021-04-03  1:30   ` Josh Don
2021-04-06 15:12     ` Peter Zijlstra
2021-04-04 23:39 ` [PATCH 0/9] sched: Core scheduling interfaces Tejun Heo
2021-04-05 18:46   ` Joel Fernandes
2021-04-06 14:16     ` Tejun Heo
2021-04-18  1:35       ` Joel Fernandes
2021-04-19  9:00         ` Peter Zijlstra
2021-04-21 13:35           ` Peter Zijlstra
2021-04-21 14:45             ` Chris Hyser
2021-04-06 15:32   ` Peter Zijlstra
2021-04-06 16:08     ` Tejun Heo
2021-04-07 18:39       ` Peter Zijlstra
2021-04-07 16:50   ` Michal Koutný
2021-04-07 18:34     ` Peter Zijlstra
2021-04-08 13:25       ` Michal Koutný
2021-04-08 15:02         ` Peter Zijlstra
2021-04-09  0:16           ` Josh Don
2021-04-19 11:30       ` Tejun Heo
2021-04-20  1:17         ` Josh Don

This is a public inbox, see mirroring instructions
on how to clone and mirror all data and code used for this inbox