All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH resend 0/8] Core sched remaining patches rebased
@ 2021-03-24 21:40 Joel Fernandes (Google)
  2021-03-24 21:40 ` [PATCH resend 1/8] sched: migration changes for core scheduling Joel Fernandes (Google)
                   ` (7 more replies)
  0 siblings, 8 replies; 18+ messages in thread
From: Joel Fernandes (Google) @ 2021-03-24 21:40 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, rostedt,
	benbjiang, Alexandre Chartre, James.Bottomley, OWeisse,
	Dhaval Giani, chris.hyser, Josh Don, Hao Luo, Tom Lendacky,
	dhiatt

Core-Scheduling (resend based on peter's queue.git sched/core-sched branch).
===============
Enclosed is interface related core scheduling patches and one for migration.
The main core scheduling patches were already pulled in by Peter with these
bits left.

Main changes are the simplification of the core cookie scheme,
new prctl code, and other misc fixes based on Peter's feedback.

These remaining patches was worked on mostly by Josh Don and Chris Hyser.

Introduction of feature
=======================
Core scheduling is a feature that allows only trusted tasks to run
concurrently on cpus sharing compute resources (eg: hyperthreads on a
core). The goal is to mitigate the core-level side-channel attacks
without requiring to disable SMT (which has a significant impact on
performance in some situations). Core scheduling (as of v7) mitigates
user-space to user-space attacks and user to kernel attack when one of
the siblings enters the kernel via interrupts or system call.

By default, the feature doesn't change any of the current scheduler
behavior. The user decides which tasks can run simultaneously on the
same core (for now by having them in the same tagged cgroup). When a tag
is enabled in a cgroup and a task from that cgroup is running on a
hardware thread, the scheduler ensures that only idle or trusted tasks
run on the other sibling(s). Besides security concerns, this feature can
also be beneficial for RT and performance applications where we want to
control how tasks make use of SMT dynamically.

Both a CGroup and Per-task interface via prctl(2) are provided for configuring
core sharing. More details are provided in documentation patch.  Kselftests are
provided to verify the correctness/rules of the interface.

Testing
=======
ChromeOS testing shows 300% improvement in keypress latency on a Google
docs key press with Google hangout test (the maximum latency drops from 150ms
to 50ms for keypresses).

Julien: TPCC tests showed improvements with core-scheduling as below. With kernel
protection enabled, it does not show any regression. Possibly ASI will improve
the performance for those who choose kernel protection (can be controlled through
ht_protect kernel command line option).
				average		stdev		diff
baseline (SMT on)		1197.272	44.78312824	
core sched (   kernel protect)	412.9895	45.42734343	-65.51%
core sched (no kernel protect)	686.6515	71.77756931	-42.65%
nosmt				408.667		39.39042872	-65.87%
(Note these results are from v8).

Vineeth tested sysbench and does not see any regressions.
Hong and Aubrey tested v9 and see results similar to v8. There is a known issue
with uperf that does regress. This appears to be because of ksoftirq heavily
contending with other tasks on the core. The consensus is this can be improved
in the future.

Other changes:
- Fixed breaking of coresched= option patch on !SCHED_CORE builds.
- Trivial commit message changes.

Changes in v10
==============
- migration code changes from Aubrey.
- dropped patches merged.
- interface changes from Josh and Chris.

Changes in v9
=============
- Note that the vruntime snapshot change is written in 2 patches to show the
  progression of the idea and prevent merge conflicts:
    sched/fair: Snapshot the min_vruntime of CPUs on force idle
    sched: Improve snapshotting of min_vruntime for CGroups
  Same with the RT priority inversion change:
    sched: Fix priority inversion of cookied task with sibling
    sched: Improve snapshotting of min_vruntime for CGroups
- Disable coresched on certain AMD HW.

Changes in v8
=============
- New interface/API implementation
  - Joel
- Revised kernel protection patch
  - Joel
- Revised Hotplug fixes
  - Joel
- Minor bug fixes and address review comments
  - Vineeth

Changes in v7
=============
- Kernel protection from untrusted usermode tasks
  - Joel, Vineeth
- Fix for hotplug crashes and hangs
  - Joel, Vineeth

Changes in v6
=============
- Documentation
  - Joel
- Pause siblings on entering nmi/irq/softirq
  - Joel, Vineeth
- Fix for RCU crash
  - Joel
- Fix for a crash in pick_next_task
  - Yu Chen, Vineeth
- Minor re-write of core-wide vruntime comparison
  - Aaron Lu
- Cleanup: Address Review comments
- Cleanup: Remove hotplug support (for now)
- Build fixes: 32 bit, SMT=n, AUTOGROUP=n etc
  - Joel, Vineeth

Changes in v5
=============
- Fixes for cgroup/process tagging during corner cases like cgroup
  destroy, task moving across cgroups etc
  - Tim Chen
- Coresched aware task migrations
  - Aubrey Li
- Other minor stability fixes.

Changes in v4
=============
- Implement a core wide min_vruntime for vruntime comparison of tasks
  across cpus in a core.
  - Aaron Lu
- Fixes a typo bug in setting the forced_idle cpu.
  - Aaron Lu

Changes in v3
=============
- Fixes the issue of sibling picking up an incompatible task
  - Aaron Lu
  - Vineeth Pillai
  - Julien Desfossez
- Fixes the issue of starving threads due to forced idle
  - Peter Zijlstra
- Fixes the refcounting issue when deleting a cgroup with tag
  - Julien Desfossez
- Fixes a crash during cpu offline/online with coresched enabled
  - Vineeth Pillai
- Fixes a comparison logic issue in sched_core_find
  - Aaron Lu

Changes in v2
=============
- Fixes for couple of NULL pointer dereference crashes
  - Subhra Mazumdar
  - Tim Chen
- Improves priority comparison logic for process in different cpus
  - Peter Zijlstra
  - Aaron Lu
- Fixes a hard lockup in rq locking
  - Vineeth Pillai
  - Julien Desfossez
- Fixes a performance issue seen on IO heavy workloads
  - Vineeth Pillai
  - Julien Desfossez
- Fix for 32bit build
  - Aubrey Li

Future work
===========
- Load balancing/Migration fixes for core scheduling.
  With v6, Load balancing is partially coresched aware, but has some
  issues w.r.t process/taskgroup weights:
  https://lwn.net/ml/linux-kernel/20200225034438.GA617271@z...

Aubrey Li (1):
sched: migration changes for core scheduling

Joel Fernandes (Google) (3):
kselftest: Add tests for core-sched interface
Documentation: Add core scheduling documentation
sched: Debug bits...

Josh Don (2):
sched: core scheduling tagging infrastructure
sched: cgroup cookie API for core scheduling

chris hyser (2):
sched: prctl() cookie manipulation for core scheduling
kselftest: Add test for core sched prctl interface

.../admin-guide/hw-vuln/core-scheduling.rst   | 460 ++++++++++
Documentation/admin-guide/hw-vuln/index.rst   |   1 +
fs/exec.c                                     |   4 +-
include/linux/sched.h                         |  35 +-
include/linux/sched/task.h                    |   4 +-
include/uapi/linux/prctl.h                    |   7 +
kernel/fork.c                                 |   1 +
kernel/sched/Makefile                         |   1 +
kernel/sched/core.c                           | 212 ++++-
kernel/sched/coretag.c                        | 587 +++++++++++++
kernel/sched/debug.c                          |   4 +
kernel/sched/fair.c                           |  41 +-
kernel/sched/sched.h                          | 151 +++-
kernel/sys.c                                  |   7 +
tools/include/uapi/linux/prctl.h              |   7 +
tools/testing/selftests/sched/.gitignore      |   2 +
tools/testing/selftests/sched/Makefile        |  14 +
tools/testing/selftests/sched/config          |   1 +
tools/testing/selftests/sched/cs_prctl_test.c | 370 ++++++++
.../testing/selftests/sched/test_coresched.c  | 812 ++++++++++++++++++
20 files changed, 2659 insertions(+), 62 deletions(-)
create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst
create mode 100644 kernel/sched/coretag.c
create mode 100644 tools/testing/selftests/sched/.gitignore
create mode 100644 tools/testing/selftests/sched/Makefile
create mode 100644 tools/testing/selftests/sched/config
create mode 100644 tools/testing/selftests/sched/cs_prctl_test.c
create mode 100644 tools/testing/selftests/sched/test_coresched.c

--
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH resend 1/8] sched: migration changes for core scheduling
  2021-03-24 21:40 [PATCH resend 0/8] Core sched remaining patches rebased Joel Fernandes (Google)
@ 2021-03-24 21:40 ` Joel Fernandes (Google)
  2021-03-24 21:40 ` [PATCH resend 2/8] sched: core scheduling tagging infrastructure Joel Fernandes (Google)
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Joel Fernandes (Google) @ 2021-03-24 21:40 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, rostedt,
	benbjiang, Alexandre Chartre, James.Bottomley, OWeisse,
	Dhaval Giani, chris.hyser, Josh Don, Hao Luo, Tom Lendacky,
	dhiatt, Aubrey Li

From: Aubrey Li <aubrey.li@linux.intel.com>

 - Don't migrate if there is a cookie mismatch
     Load balance tries to move task from busiest CPU to the
     destination CPU. When core scheduling is enabled, if the
     task's cookie does not match with the destination CPU's
     core cookie, this task may be skipped by this CPU. This
     mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
     In the fast path of task wakeup, select the first cookie matched
     idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
     In the slow path of task wakeup, find the idlest CPU whose core
     cookie matches with task's cookie

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/fair.c  | 29 ++++++++++++++----
 kernel/sched/sched.h | 73 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 96 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a03564398605..12030b73a032 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5877,11 +5877,15 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+		struct rq *rq = cpu_rq(i);
+
+		if (!sched_core_cookie_match(rq, p))
+			continue;
+
 		if (sched_idle_cpu(i))
 			return i;
 
 		if (available_idle_cpu(i)) {
-			struct rq *rq = cpu_rq(i);
 			struct cpuidle_state *idle = idle_get_state(rq);
 			if (idle && idle->exit_latency < min_exit_latency) {
 				/*
@@ -5967,9 +5971,10 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
 	return new_cpu;
 }
 
-static inline int __select_idle_cpu(int cpu)
+static inline int __select_idle_cpu(int cpu, struct task_struct *p)
 {
-	if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
+	    sched_cpu_cookie_match(cpu_rq(cpu), p))
 		return cpu;
 
 	return -1;
@@ -6041,7 +6046,7 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
 	int cpu;
 
 	if (!static_branch_likely(&sched_smt_present))
-		return __select_idle_cpu(core);
+		return __select_idle_cpu(core, p);
 
 	for_each_cpu(cpu, cpu_smt_mask(core)) {
 		if (!available_idle_cpu(cpu)) {
@@ -6079,7 +6084,7 @@ static inline bool test_idle_cores(int cpu, bool def)
 
 static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
 {
-	return __select_idle_cpu(core);
+	return __select_idle_cpu(core, p);
 }
 
 #endif /* CONFIG_SCHED_SMT */
@@ -6132,7 +6137,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 		} else {
 			if (!--nr)
 				return -1;
-			idle_cpu = __select_idle_cpu(cpu);
+			idle_cpu = __select_idle_cpu(cpu, p);
 			if ((unsigned int)idle_cpu < nr_cpumask_bits)
 				break;
 		}
@@ -7473,6 +7478,14 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 
 	if (sysctl_sched_migration_cost == -1)
 		return 1;
+
+	/*
+	 * Don't migrate task if the task's cookie does not match
+	 * with the destination CPU's core cookie.
+	 */
+	if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+		return 1;
+
 	if (sysctl_sched_migration_cost == 0)
 		return 0;
 
@@ -8834,6 +8847,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 					p->cpus_ptr))
 			continue;
 
+		/* Skip over this group if no cookie matched */
+		if (!sched_group_cookie_match(cpu_rq(this_cpu), p, group))
+			continue;
+
 		local_group = cpumask_test_cpu(this_cpu,
 					       sched_group_span(group));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 80abbc0af680..12edfb8f6994 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1128,8 +1128,10 @@ static inline bool is_migration_disabled(struct task_struct *p)
 #endif
 }
 
+struct sched_group;
 #ifdef CONFIG_SCHED_CORE
 DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
+static inline struct cpumask *sched_group_span(struct sched_group *sg);
 
 static inline bool sched_core_enabled(struct rq *rq)
 {
@@ -1163,6 +1165,61 @@ static inline raw_spinlock_t *__rq_lockp(struct rq *rq)
 
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);
 
+/*
+ * Helpers to check if the CPU's core cookie matches with the task's cookie
+ * when core scheduling is enabled.
+ * A special case is that the task's cookie always matches with CPU's core
+ * cookie if the CPU is in an idle core.
+ */
+static inline bool sched_cpu_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	return rq->core->core_cookie == p->core_cookie;
+}
+
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	bool idle_core = true;
+	int cpu;
+
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
+		if (!available_idle_cpu(cpu)) {
+			idle_core = false;
+			break;
+		}
+	}
+
+	/*
+	 * A CPU in an idle core is always the best choice for tasks with
+	 * cookies.
+	 */
+	return idle_core || rq->core->core_cookie == p->core_cookie;
+}
+
+static inline bool sched_group_cookie_match(struct rq *rq,
+					    struct task_struct *p,
+					    struct sched_group *group)
+{
+	int cpu;
+
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	for_each_cpu_and(cpu, sched_group_span(group), p->cpus_ptr) {
+		if (sched_core_cookie_match(rq, p))
+			return true;
+	}
+	return false;
+}
+
 extern void queue_core_balance(struct rq *rq);
 
 #else /* !CONFIG_SCHED_CORE */
@@ -1191,6 +1248,22 @@ static inline void queue_core_balance(struct rq *rq)
 {
 }
 
+static inline bool sched_cpu_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	return true;
+}
+
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	return true;
+}
+
+static inline bool sched_group_cookie_match(struct rq *rq,
+					    struct task_struct *p,
+					    struct sched_group *group)
+{
+	return true;
+}
 #endif /* CONFIG_SCHED_CORE */
 
 static inline void lockdep_assert_rq_held(struct rq *rq)
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH resend 2/8] sched: core scheduling tagging infrastructure
  2021-03-24 21:40 [PATCH resend 0/8] Core sched remaining patches rebased Joel Fernandes (Google)
  2021-03-24 21:40 ` [PATCH resend 1/8] sched: migration changes for core scheduling Joel Fernandes (Google)
@ 2021-03-24 21:40 ` Joel Fernandes (Google)
  2021-03-27  0:09   ` Peter Zijlstra
  2021-03-24 21:40 ` [PATCH resend 3/8] sched: prctl() cookie manipulation for core scheduling Joel Fernandes (Google)
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Joel Fernandes (Google) @ 2021-03-24 21:40 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, rostedt,
	benbjiang, Alexandre Chartre, James.Bottomley, OWeisse,
	Dhaval Giani, chris.hyser, Josh Don, Hao Luo, Tom Lendacky,
	dhiatt

From: Josh Don <joshdon@google.com>

A single unsigned long is insufficient as a cookie value for core
scheduling. We will minimally have cookie values for a per-task and a
per-group interface, which must be combined into an overall cookie.

This patch adds the infrastructure necessary for setting task and group
cookie. Namely, it reworks the core_cookie into a struct, and provides
interfaces for setting task and group cookie, as well as other
operations (i.e. compare()). Subsequent patches will use these hooks to
provide an API for setting these cookies.

One important property of this interface is that neither the per-task
nor the per-cgroup setting overrides the other. For example, if two
tasks are in different cgroups, and one or both of the cgroups is tagged
using the per-cgroup interface, then these tasks cannot share, even if
they use the per-task interface to attempt to share with one another.

Core scheduler has extra overhead.  Enable it only for machines with
more than one SMT hardware thread.

Co-developed-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
Signed-off-by: Josh Don <joshdon@google.com>
---
 include/linux/sched.h  |  24 +++-
 kernel/fork.c          |   1 +
 kernel/sched/Makefile  |   1 +
 kernel/sched/core.c    | 100 ++++++++++-------
 kernel/sched/coretag.c | 245 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/debug.c   |   4 +
 kernel/sched/sched.h   |  57 ++++++++--
 7 files changed, 384 insertions(+), 48 deletions(-)
 create mode 100644 kernel/sched/coretag.c

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5d91ff1d3a30..833f8d682212 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -645,6 +645,22 @@ struct kmap_ctrl {
 #endif
 };
 
+#ifdef CONFIG_SCHED_CORE
+struct sched_core_cookie {
+	unsigned long task_cookie;
+#ifdef CONFIG_CGROUP_SCHED
+	unsigned long group_cookie;
+#endif
+
+	/*
+	 * A u64 representation of the cookie used only for display to
+	 * userspace. We avoid exposing the actual cookie contents, which
+	 * are kernel pointers.
+	 */
+	u64 userspace_id;
+};
+#endif
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/*
@@ -703,7 +719,7 @@ struct task_struct {
 
 #ifdef CONFIG_SCHED_CORE
 	struct rb_node			core_node;
-	unsigned long			core_cookie;
+	struct sched_core_cookie	core_cookie;
 	unsigned int			core_occupation;
 #endif
 
@@ -2166,4 +2182,10 @@ int sched_trace_rq_nr_running(struct rq *rq);
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
+#ifdef CONFIG_SCHED_CORE
+void sched_tsk_free(struct task_struct *tsk);
+#else
+#define sched_tsk_free(tsk) do { } while (0)
+#endif
+
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 54cc905e5fe0..cbe461105b10 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -737,6 +737,7 @@ void __put_task_struct(struct task_struct *tsk)
 	exit_creds(tsk);
 	delayacct_tsk_free(tsk);
 	put_signal_struct(tsk->signal);
+	sched_tsk_free(tsk);
 
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 5fc9c9b70862..c526c20adf9d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
+obj-$(CONFIG_SCHED_CORE) += coretag.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 53d742ed6432..1b07687c53d4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -123,11 +123,13 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b, bool
 
 static inline bool __sched_core_less(struct task_struct *a, struct task_struct *b)
 {
-	if (a->core_cookie < b->core_cookie)
-		return true;
+	int cmp = sched_core_cookie_cmp(&a->core_cookie, &b->core_cookie);
 
-	if (a->core_cookie > b->core_cookie)
-		return false;
+	if (cmp < 0)
+		return true; /* a < b */
+
+	if (cmp > 0)
+		return false; /* a > b */
 
 	/* flip prio, so high prio is leftmost */
 	if (prio_less(b, a, task_rq(a)->core->core_forceidle))
@@ -146,41 +148,49 @@ static inline bool rb_sched_core_less(struct rb_node *a, const struct rb_node *b
 static inline int rb_sched_core_cmp(const void *key, const struct rb_node *node)
 {
 	const struct task_struct *p = __node_2_sc(node);
-	unsigned long cookie = (unsigned long)key;
+	const struct sched_core_cookie *cookie = key;
+	int cmp = sched_core_cookie_cmp(cookie, &p->core_cookie);
 
-	if (cookie < p->core_cookie)
+	if (cmp < 0)
 		return -1;
 
-	if (cookie > p->core_cookie)
+	if (cmp > 0)
 		return 1;
 
 	return 0;
 }
 
-static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+static bool sched_core_empty(struct rq *rq)
+{
+	return RB_EMPTY_ROOT(&rq->core_tree);
+}
+
+void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 {
 	rq->core->core_task_seq++;
 
-	if (!p->core_cookie)
+	if (sched_core_is_zero_cookie(&p->core_cookie))
 		return;
 
 	rb_add(&p->core_node, &rq->core_tree, rb_sched_core_less);
 }
 
-static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+void sched_core_dequeue(struct rq *rq, struct task_struct *p)
 {
 	rq->core->core_task_seq++;
 
-	if (!p->core_cookie)
+	if (!sched_core_enqueued(p))
 		return;
 
 	rb_erase(&p->core_node, &rq->core_tree);
+	RB_CLEAR_NODE(&p->core_node);
 }
 
 /*
  * Find left-most (aka, highest priority) task matching @cookie.
  */
-static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
+static struct task_struct *sched_core_find(struct rq *rq,
+					   struct sched_core_cookie *cookie)
 {
 	struct rb_node *node;
 
@@ -194,7 +204,8 @@ static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
 	return __node_2_sc(node);
 }
 
-static struct task_struct *sched_core_next(struct task_struct *p, unsigned long cookie)
+static struct task_struct *sched_core_next(struct task_struct *p,
+					   struct sched_core_cookie *cookie)
 {
 	struct rb_node *node = &p->core_node;
 
@@ -203,7 +214,7 @@ static struct task_struct *sched_core_next(struct task_struct *p, unsigned long
 		return NULL;
 
 	p = container_of(node, struct task_struct, core_node);
-	if (p->core_cookie != cookie)
+	if (sched_core_cookie_not_equal(&p->core_cookie, cookie))
 		return NULL;
 
 	return p;
@@ -246,8 +257,10 @@ static void __sched_core_flip(bool enabled)
 			raw_spin_lock_nested(&cpu_rq(t)->__lock, i++);
 		}
 
-		for_each_cpu(t, smt_mask)
+		for_each_cpu(t, smt_mask) {
+			WARN_ON_ONCE(cpu_rq(t)->core_enabled == enabled);
 			cpu_rq(t)->core_enabled = enabled;
+		}
 
 		for_each_cpu(t, smt_mask)
 			raw_spin_unlock(&cpu_rq(t)->__lock);
@@ -270,7 +283,12 @@ static void __sched_core_flip(bool enabled)
 
 static void __sched_core_enable(void)
 {
-	// XXX verify there are no cookie tasks (yet)
+	int cpu;
+
+	/* verify there are no cookie tasks (yet) */
+	for_each_online_cpu(cpu) {
+		BUG_ON(!sched_core_empty(cpu_rq(cpu)));
+	}
 
 	static_branch_enable(&__sched_core_enabled);
 	__sched_core_flip(true);
@@ -278,8 +296,6 @@ static void __sched_core_enable(void)
 
 static void __sched_core_disable(void)
 {
-	// XXX verify there are no cookie tasks (left)
-
 	__sched_core_flip(false);
 	static_branch_disable(&__sched_core_enabled);
 }
@@ -299,12 +315,6 @@ void sched_core_put(void)
 		__sched_core_disable();
 	mutex_unlock(&sched_core_mutex);
 }
-
-#else /* !CONFIG_SCHED_CORE */
-
-static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
-static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
-
 #endif /* CONFIG_SCHED_CORE */
 
 /*
@@ -4016,6 +4026,7 @@ static inline void init_schedstats(void) {}
 int sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
 	unsigned long flags;
+	int __maybe_unused ret;
 
 	__sched_fork(clone_flags, p);
 	/*
@@ -4091,6 +4102,13 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 #ifdef CONFIG_SMP
 	plist_node_init(&p->pushable_tasks, MAX_PRIO);
 	RB_CLEAR_NODE(&p->pushable_dl_tasks);
+#endif
+#ifdef CONFIG_SCHED_CORE
+	RB_CLEAR_NODE(&p->core_node);
+
+	ret = sched_core_fork(p, clone_flags);
+	if (ret)
+		return ret;
 #endif
 	return 0;
 }
@@ -5222,9 +5240,11 @@ static inline bool is_task_rq_idle(struct task_struct *t)
 	return (task_rq(t)->idle == t);
 }
 
-static inline bool cookie_equals(struct task_struct *a, unsigned long cookie)
+static inline bool cookie_equals(struct task_struct *a,
+				 struct sched_core_cookie *cookie)
 {
-	return is_task_rq_idle(a) || (a->core_cookie == cookie);
+	return is_task_rq_idle(a) ||
+	       sched_core_cookie_equal(&a->core_cookie, cookie);
 }
 
 static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
@@ -5232,7 +5252,7 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
 	if (is_task_rq_idle(a) || is_task_rq_idle(b))
 		return true;
 
-	return a->core_cookie == b->core_cookie;
+	return sched_core_cookie_equal(&a->core_cookie, &b->core_cookie);
 }
 
 // XXX fairness/fwd progress conditions
@@ -5247,18 +5267,19 @@ static struct task_struct *
 pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max, bool in_fi)
 {
 	struct task_struct *class_pick, *cookie_pick;
-	unsigned long cookie = rq->core->core_cookie;
+	struct sched_core_cookie *cookie = &rq->core->core_cookie;
 
 	class_pick = class->pick_task(rq);
 	if (!class_pick)
 		return NULL;
 
-	if (!cookie) {
+	if (sched_core_is_zero_cookie(cookie)) {
 		/*
 		 * If class_pick is tagged, return it only if it has
 		 * higher priority than max.
 		 */
-		if (max && class_pick->core_cookie &&
+		if (max &&
+		    !sched_core_is_zero_cookie(&class_pick->core_cookie) &&
 		    prio_less(class_pick, max, in_fi))
 			return idle_sched_class.pick_task(rq);
 
@@ -5340,10 +5361,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	put_prev_task_balance(rq, prev, rf);
 
 	smt_mask = cpu_smt_mask(cpu);
-	need_sync = !!rq->core->core_cookie;
+	need_sync = !sched_core_is_zero_cookie(&rq->core->core_cookie);
 
 	/* reset state */
-	rq->core->core_cookie = 0UL;
+	sched_core_cookie_reset(&rq->core->core_cookie);
 	if (rq->core->core_forceidle) {
 		need_sync = true;
 		fi_before = true;
@@ -5373,7 +5394,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 				break;
 		}
 
-		if (!next->core_cookie) {
+		if (sched_core_is_zero_cookie(&next->core_cookie)) {
 			rq->core_pick = NULL;
 			/*
 			 * For robustness, update the min_vruntime_fi for
@@ -5524,14 +5545,14 @@ static bool try_steal_cookie(int this, int that)
 {
 	struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
 	struct task_struct *p;
-	unsigned long cookie;
+	struct sched_core_cookie *cookie;
 	bool success = false;
 
 	local_irq_disable();
 	double_rq_lock(dst, src);
 
-	cookie = dst->core->core_cookie;
-	if (!cookie)
+	cookie = &dst->core->core_cookie;
+	if (sched_core_is_zero_cookie(cookie))
 		goto unlock;
 
 	if (dst->curr != dst->idle)
@@ -5618,7 +5639,7 @@ void queue_core_balance(struct rq *rq)
 	if (!sched_core_enabled(rq))
 		return;
 
-	if (!rq->core->core_cookie)
+	if (sched_core_is_zero_cookie(&rq->core->core_cookie))
 		return;
 
 	if (!rq->nr_running) /* not forced idle */
@@ -8244,6 +8265,9 @@ void init_idle(struct task_struct *idle, int cpu)
 #ifdef CONFIG_SMP
 	sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);
 #endif
+#ifdef CONFIG_SCHED_CORE
+	RB_CLEAR_NODE(&idle->core_node);
+#endif
 }
 
 #ifdef CONFIG_SMP
@@ -8995,7 +9019,7 @@ void __init sched_init(void)
 		rq->core_tree = RB_ROOT;
 		rq->core_forceidle = false;
 
-		rq->core_cookie = 0UL;
+		sched_core_cookie_reset(&rq->core_cookie);
 #endif
 	}
 
diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
new file mode 100644
index 000000000000..ba73569237f0
--- /dev/null
+++ b/kernel/sched/coretag.c
@@ -0,0 +1,245 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * kernel/sched/coretag.c
+ *
+ * Core-scheduling tagging interface support.
+ */
+
+#include <linux/prctl.h>
+#include "sched.h"
+
+/*
+ * A simple wrapper around refcount. An allocated sched_core_task_cookie's
+ * address is used to compute the cookie of the task.
+ */
+struct sched_core_task_cookie {
+	refcount_t refcnt;
+	u32 id; /* purely for display to userspace */
+	struct work_struct work; /* to free in WQ context. */;
+};
+
+/* Protects creation and assignment of task cookies */
+static DEFINE_MUTEX(sched_core_tasks_mutex);
+
+/*
+ * Returns the following:
+ * a < b  => -1
+ * a == b => 0
+ * a > b  => 1
+ */
+int sched_core_cookie_cmp(const struct sched_core_cookie *a,
+			  const struct sched_core_cookie *b)
+{
+#define COOKIE_CMP_RETURN(field) do {		\
+	if (a->field < b->field)		\
+		return -1;			\
+	else if (a->field > b->field)		\
+		return 1;			\
+} while (0)					\
+
+	COOKIE_CMP_RETURN(task_cookie);
+#ifdef CONFIG_CGROUP_SCHED
+	COOKIE_CMP_RETURN(group_cookie);
+#endif
+
+	/* all cookie fields match */
+	return 0;
+
+#undef COOKIE_CMP_RETURN
+}
+
+inline bool sched_core_cookie_equal(const struct sched_core_cookie *a,
+				    const struct sched_core_cookie *b)
+{
+	return !sched_core_cookie_cmp(a, b);
+}
+
+inline bool sched_core_cookie_not_equal(const struct sched_core_cookie *a,
+					const struct sched_core_cookie *b)
+{
+	return !!sched_core_cookie_cmp(a, b);
+}
+
+bool sched_core_is_zero_cookie(const struct sched_core_cookie *cookie)
+{
+	static const struct sched_core_cookie zero_cookie;
+
+	return sched_core_cookie_equal(cookie, &zero_cookie);
+}
+
+inline void sched_core_cookie_reset(struct sched_core_cookie *cookie)
+{
+	memset(cookie, 0, sizeof(*cookie));
+}
+
+static void __sched_core_set_task_cookie(struct sched_core_cookie *cookie,
+					 unsigned long val)
+{
+	struct sched_core_task_cookie *task_cookie = (void *)val;
+	u64 task_cookie_id; /* only uses upper 32 bits */
+
+	cookie->task_cookie = val;
+
+	if (task_cookie) {
+		task_cookie_id = task_cookie->id;
+		task_cookie_id <<= 32;
+	} else {
+		task_cookie_id = 0;
+	}
+
+	/* task cookie userspace id is the upper 32 bits */
+	cookie->userspace_id &= 0xffffffff;
+	cookie->userspace_id |= task_cookie_id;
+}
+
+#ifdef CONFIG_CGROUP_SCHED
+static void __sched_core_set_group_cookie(struct sched_core_cookie *cookie,
+					  unsigned long val)
+{
+	cookie->group_cookie = val;
+
+	// XXX incorporate group_cookie into userspace id
+}
+#endif
+
+/*
+ * sched_core_update_cookie - Common helper to update a task's core cookie. This
+ * updates the selected cookie field.
+ * @p: The task whose cookie should be updated.
+ * @cookie: The new cookie.
+ * @cookie_type: The cookie field to which the cookie corresponds.
+ */
+static void sched_core_update_cookie(struct task_struct *p,
+				     unsigned long cookie,
+				     enum sched_core_cookie_type cookie_type)
+{
+	struct rq *rq;
+	struct rq_flags rf;
+
+	if (!p)
+		return;
+
+	rq = task_rq_lock(p, &rf);
+
+	/* Update cookie under task rq lock */
+	switch (cookie_type) {
+	case sched_core_task_cookie_type:
+		lockdep_assert_held(&sched_core_tasks_mutex);
+		__sched_core_set_task_cookie(&p->core_cookie, cookie);
+		break;
+#ifdef CONFIG_CGROUP_SCHED
+	case sched_core_group_cookie_type:
+		__sched_core_set_group_cookie(&p->core_cookie, cookie);
+		break;
+#endif
+	default:
+		WARN_ON_ONCE(1);
+	}
+
+	if (sched_core_enqueued(p))
+		sched_core_dequeue(rq, p);
+
+	if (sched_core_enabled(rq) &&
+	    !sched_core_is_zero_cookie(&p->core_cookie) &&
+	    task_on_rq_queued(p))
+		sched_core_enqueue(task_rq(p), p);
+
+	/*
+	 * If task is currently running , it may not be compatible anymore after
+	 * the cookie change, so enter the scheduler on its CPU to schedule it
+	 * away.
+	 */
+	if (task_running(rq, p))
+		resched_curr(rq);
+
+	task_rq_unlock(rq, p, &rf);
+}
+
+static void sched_core_free_task_cookie_work(struct work_struct *ws);
+
+static unsigned long sched_core_alloc_task_cookie(void)
+{
+	struct sched_core_task_cookie *ck = kmalloc(sizeof(*ck), GFP_KERNEL);
+	static u32 next_id = 1;
+
+	lockdep_assert_held(&sched_core_tasks_mutex);
+
+	if (!ck)
+		return 0;
+
+	ck->id = next_id++;
+	WARN_ON_ONCE(next_id == 0); /* warn on wrap */
+
+	refcount_set(&ck->refcnt, 1);
+	INIT_WORK(&ck->work, sched_core_free_task_cookie_work);
+
+	/* Each live task_cookie is associated with a single sched_core_get() */
+	sched_core_get();
+
+	return (unsigned long)ck;
+}
+
+static void sched_core_get_task_cookie(unsigned long cookie)
+{
+	struct sched_core_task_cookie *ptr = (void *)cookie;
+
+	refcount_inc(&ptr->refcnt);
+}
+
+/* Called when the cookie's refcnt drops to 0. */
+static void __sched_core_free_task_cookie(struct sched_core_task_cookie *cookie)
+{
+	kfree(cookie);
+	sched_core_put();
+}
+
+static void sched_core_free_task_cookie_work(struct work_struct *ws)
+{
+	struct sched_core_task_cookie *ck =
+		container_of(ws, struct sched_core_task_cookie, work);
+
+	__sched_core_free_task_cookie(ck);
+}
+
+static void sched_core_put_task_cookie(unsigned long cookie)
+{
+	struct sched_core_task_cookie *ptr = (void *)cookie;
+
+	if (refcount_dec_and_test(&ptr->refcnt))
+		__sched_core_free_task_cookie(ptr);
+}
+
+static void sched_core_put_task_cookie_async(unsigned long cookie)
+{
+	struct sched_core_task_cookie *ptr = (void *)cookie;
+
+	if (refcount_dec_and_test(&ptr->refcnt))
+		queue_work(system_wq, &ptr->work);
+}
+
+static inline void sched_core_update_task_cookie(struct task_struct *t,
+						 unsigned long c)
+{
+	sched_core_update_cookie(t, c, sched_core_task_cookie_type);
+}
+
+/*
+ * Called from sched_fork().
+ */
+int sched_core_fork(struct task_struct *p, unsigned long clone_flags)
+{
+	/*
+	 * Task cookie is ref counted; avoid an uncounted reference.
+	 */
+	__sched_core_set_task_cookie(&p->core_cookie, 0);
+
+	return 0;
+}
+
+void sched_tsk_free(struct task_struct *tsk)
+{
+	unsigned long task_cookie = tsk->core_cookie.task_cookie;
+
+	if (task_cookie)
+		sched_core_put_task_cookie_async(task_cookie);
+}
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 3ef9f2bca823..330d1dd8d5a6 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1024,6 +1024,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		__PS("clock-delta", t1-t0);
 	}
 
+#ifdef CONFIG_SCHED_CORE
+	__PS("core_cookie", p->core_cookie.userspace_id);
+#endif
+
 	sched_show_numa(p, m);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 12edfb8f6994..5b49cfaa4a53 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1084,11 +1084,11 @@ struct rq {
 	struct rb_root		core_tree;
 
 	/* shared state */
-	unsigned int		core_task_seq;
-	unsigned int		core_pick_seq;
-	unsigned long		core_cookie;
-	unsigned char		core_forceidle;
-	unsigned int		core_forceidle_seq;
+	unsigned int			core_task_seq;
+	unsigned int			core_pick_seq;
+	struct sched_core_cookie	core_cookie;
+	unsigned char			core_forceidle;
+	unsigned int			core_forceidle_seq;
 #endif
 };
 
@@ -1133,6 +1133,13 @@ struct sched_group;
 DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
 static inline struct cpumask *sched_group_span(struct sched_group *sg);
 
+enum sched_core_cookie_type {
+	sched_core_task_cookie_type,
+#ifdef CONFIG_CGROUP_SCHED
+	sched_core_group_cookie_type,
+#endif
+};
+
 static inline bool sched_core_enabled(struct rq *rq)
 {
 	return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
@@ -1163,8 +1170,32 @@ static inline raw_spinlock_t *__rq_lockp(struct rq *rq)
 	return &rq->__lock;
 }
 
+int sched_core_fork(struct task_struct *p, unsigned long clone_flags);
+
+static inline bool sched_core_enqueued(struct task_struct *task)
+{
+	return !RB_EMPTY_NODE(&task->core_node);
+}
+
+void queue_core_balance(struct rq *rq);
+
+void sched_core_enqueue(struct rq *rq, struct task_struct *p);
+void sched_core_dequeue(struct rq *rq, struct task_struct *p);
+void sched_core_get(void);
+void sched_core_put(void);
+
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);
 
+int sched_core_cookie_cmp(const struct sched_core_cookie *a,
+			  const struct sched_core_cookie *b);
+bool sched_core_is_zero_cookie(const struct sched_core_cookie *cookie);
+inline bool sched_core_cookie_equal(const struct sched_core_cookie *a,
+				    const struct sched_core_cookie *b);
+inline bool sched_core_cookie_not_equal(const struct sched_core_cookie *a,
+					const struct sched_core_cookie *b);
+inline void sched_core_cookie_reset(struct sched_core_cookie *cookie);
+
+
 /*
  * Helpers to check if the CPU's core cookie matches with the task's cookie
  * when core scheduling is enabled.
@@ -1177,7 +1208,7 @@ static inline bool sched_cpu_cookie_match(struct rq *rq, struct task_struct *p)
 	if (!sched_core_enabled(rq))
 		return true;
 
-	return rq->core->core_cookie == p->core_cookie;
+	return sched_core_cookie_equal(&rq->core->core_cookie, &p->core_cookie);
 }
 
 static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
@@ -1200,7 +1231,8 @@ static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
 	 * A CPU in an idle core is always the best choice for tasks with
 	 * cookies.
 	 */
-	return idle_core || rq->core->core_cookie == p->core_cookie;
+	return idle_core ||
+	       sched_core_cookie_equal(&rq->core->core_cookie, &p->core_cookie);
 }
 
 static inline bool sched_group_cookie_match(struct rq *rq,
@@ -1220,8 +1252,6 @@ static inline bool sched_group_cookie_match(struct rq *rq,
 	return false;
 }
 
-extern void queue_core_balance(struct rq *rq);
-
 #else /* !CONFIG_SCHED_CORE */
 
 static inline bool sched_core_enabled(struct rq *rq)
@@ -1264,6 +1294,15 @@ static inline bool sched_group_cookie_match(struct rq *rq,
 {
 	return true;
 }
+
+static inline bool sched_core_enqueued(struct task_struct *task)
+{
+	return false;
+}
+
+static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
+static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+
 #endif /* CONFIG_SCHED_CORE */
 
 static inline void lockdep_assert_rq_held(struct rq *rq)
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH resend 3/8] sched: prctl() cookie manipulation for core scheduling
  2021-03-24 21:40 [PATCH resend 0/8] Core sched remaining patches rebased Joel Fernandes (Google)
  2021-03-24 21:40 ` [PATCH resend 1/8] sched: migration changes for core scheduling Joel Fernandes (Google)
  2021-03-24 21:40 ` [PATCH resend 2/8] sched: core scheduling tagging infrastructure Joel Fernandes (Google)
@ 2021-03-24 21:40 ` Joel Fernandes (Google)
  2021-03-24 21:40 ` [PATCH resend 4/8] kselftest: Add test for core sched prctl interface Joel Fernandes (Google)
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Joel Fernandes (Google) @ 2021-03-24 21:40 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, rostedt,
	benbjiang, Alexandre Chartre, James.Bottomley, OWeisse,
	Dhaval Giani, chris.hyser, Josh Don, Hao Luo, Tom Lendacky,
	dhiatt

From: chris hyser <chris.hyser@oracle.com>

This patch provides support for setting, clearing and copying core
scheduling 'task cookies' between threads (PID), processes (TGID), and
process groups (PGID).

The value of core scheduling isn't that tasks don't share a core, 'nosmt'
can do that. The value lies in exploiting all the sharing opportunities
that exist to recover possible lost performance and that requires a degree
of flexibility in the API. From a security perspective (and there are
others), the thread, process and process group distinction is an existent
hierarchal categorization of tasks that reflects many of the security
concerns about 'data sharing'. For example, protecting against
cache-snooping by a thread that can just read the memory directly isn't all
that useful. With this in mind, subcommands to CLEAR/CREATE/SHARE (TO/FROM)
provide a mechanism to create, clear and share cookies.
CLEAR/CREATE/SHARE_TO specify a target pid with enum pidtype used to
specify the scope of the targeted tasks. For example, PIDTYPE_TGID will
share the cookie with the process and all of it's threads as typically
desired in a security scenario.

API:

prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_CREATE, tgtpid, pidtype, 0)
prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_CLEAR, tgtpid, pidtype, 0)
prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_FROM, srcpid, 0, 0)
prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_TO, tgtpid, pidtype, 0)

where 'tgtpid/srcpid == 0' implies the current process and pidtype is
kernel enum pid_type {PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID, ...}.
PIDTYPE_SID, sharing a cookie with an entire session, was considered less
useful given the choice to create a new cookie on task exec().

For return values, EINVAL, ENOMEM are what they say. ESRCH means the
tgtpid/srcpid was not found. EPERM indicates lack of PTRACE permission
access to tgtpid/srcpid. EACCES indicates that a task in the target pidtype
group was not updated due to permission.

In terms of interaction with the cgroup interface, task cookies are set
independently of cgroup core scheduling cookies and thus would allow use
for tasks within a container using cgroup cookies.

Current hard-coded policies are:
- a user can clear the cookie of any process they can set a cookie for.
Lack of a cookie *might* be a security issue if cookies are being used
for that.
- on fork of a parent with a cookie, both process and thread child tasks
get a copy.
- on exec a task with a cookie is given a new cookie

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: Josh Don <joshdon@google.com>
---
 fs/exec.c                        |   4 +-
 include/linux/sched.h            |  11 ++
 include/linux/sched/task.h       |   4 +-
 include/uapi/linux/prctl.h       |   7 ++
 kernel/sched/core.c              |  11 +-
 kernel/sched/coretag.c           | 196 ++++++++++++++++++++++++++++++-
 kernel/sched/sched.h             |   2 +
 kernel/sys.c                     |   7 ++
 tools/include/uapi/linux/prctl.h |   7 ++
 9 files changed, 241 insertions(+), 8 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 18594f11c31f..ab0945508b50 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1807,7 +1807,9 @@ static int bprm_execve(struct linux_binprm *bprm,
 	if (IS_ERR(file))
 		goto out_unmark;
 
-	sched_exec();
+	retval = sched_exec();
+	if (retval)
+		goto out;
 
 	bprm->file = file;
 	/*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 833f8d682212..075b15392a4a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2184,8 +2184,19 @@ const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
 #ifdef CONFIG_SCHED_CORE
 void sched_tsk_free(struct task_struct *tsk);
+int sched_core_share_pid(unsigned long flags, pid_t pid, enum pid_type type);
+int sched_core_exec(void);
 #else
 #define sched_tsk_free(tsk) do { } while (0)
+static inline int sched_core_share_pid(unsigned long flags, pid_t pid, enum pid_type type)
+{
+	return 0;
+}
+
+static inline int sched_core_exec(void)
+{
+	return 0;
+}
 #endif
 
 #endif
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ef02be869cf2..d0f5b233f092 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -94,9 +94,9 @@ extern void free_task(struct task_struct *tsk);
 
 /* sched_exec is called by processes performing an exec */
 #ifdef CONFIG_SMP
-extern void sched_exec(void);
+int sched_exec(void);
 #else
-#define sched_exec()   {}
+static inline int sched_exec(void) { return 0; }
 #endif
 
 static inline struct task_struct *get_task_struct(struct task_struct *t)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 667f1aed091c..e658dca88f4f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -255,4 +255,11 @@ struct prctl_mm_map {
 # define SYSCALL_DISPATCH_FILTER_ALLOW	0
 # define SYSCALL_DISPATCH_FILTER_BLOCK	1
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE		60
+# define PR_SCHED_CORE_CLEAR		0 /* clear core_sched cookie of pid */
+# define PR_SCHED_CORE_CREATE		1 /* create unique core_sched cookie */
+# define PR_SCHED_CORE_SHARE_FROM	2 /* get core_sched cookie from pid */
+# define PR_SCHED_CORE_SHARE_TO		3 /* push core_sched cookie to pid */
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1b07687c53d4..3093cb3414c3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4752,11 +4752,17 @@ unsigned long nr_iowait(void)
  * sched_exec - execve() is a valuable balancing opportunity, because at
  * this point the task has the smallest effective memory and cache footprint.
  */
-void sched_exec(void)
+int sched_exec(void)
 {
 	struct task_struct *p = current;
 	unsigned long flags;
 	int dest_cpu;
+	int ret;
+
+	/* this may change what tasks current can share a core with */
+	ret = sched_core_exec();
+	if (ret)
+		return ret;
 
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), WF_EXEC);
@@ -4768,10 +4774,11 @@ void sched_exec(void)
 
 		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 		stop_one_cpu(task_cpu(p), migration_cpu_stop, &arg);
-		return;
+		return 0;
 	}
 unlock:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+	return 0;
 }
 
 #endif
diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
index ba73569237f0..550f4975eea2 100644
--- a/kernel/sched/coretag.c
+++ b/kernel/sched/coretag.c
@@ -155,6 +155,7 @@ static void sched_core_update_cookie(struct task_struct *p,
 	task_rq_unlock(rq, p, &rf);
 }
 
+/* Per-task interface: task free. */
 static void sched_core_free_task_cookie_work(struct work_struct *ws);
 
 static unsigned long sched_core_alloc_task_cookie(void)
@@ -223,16 +224,205 @@ static inline void sched_core_update_task_cookie(struct task_struct *t,
 	sched_core_update_cookie(t, c, sched_core_task_cookie_type);
 }
 
-/*
- * Called from sched_fork().
- */
+static int sched_core_create_cookie(struct task_struct *p)
+{
+	unsigned long cookie;
+
+	lockdep_assert_held(&sched_core_tasks_mutex);
+
+	cookie = sched_core_alloc_task_cookie();
+	if (!cookie)
+		return -ENOMEM;
+
+	if (p->core_cookie.task_cookie)
+		sched_core_put_task_cookie(p->core_cookie.task_cookie);
+
+	sched_core_update_task_cookie(p, cookie);
+	return 0;
+}
+
+static void sched_core_clear_cookie(struct task_struct *p)
+{
+	lockdep_assert_held(&sched_core_tasks_mutex);
+	if (p->core_cookie.task_cookie) {
+		sched_core_put_task_cookie(p->core_cookie.task_cookie);
+		sched_core_update_task_cookie(p, 0);
+	}
+}
+
+static unsigned long sched_core_get_copy_cookie(struct task_struct *p)
+{
+	unsigned long cookie = p->core_cookie.task_cookie;
+
+	lockdep_assert_held(&sched_core_tasks_mutex);
+	sched_core_get_task_cookie(cookie);
+	return cookie;
+}
+
+static void sched_core_copy_cookie_frm_to(struct task_struct *ft, struct task_struct *tt)
+{
+	unsigned long cookie;
+
+	lockdep_assert_held(&sched_core_tasks_mutex);
+
+	/* sharing a 0 cookie is a clear */
+	if (!ft->core_cookie.task_cookie) {
+		sched_core_clear_cookie(tt);
+		return;
+	}
+
+	cookie = sched_core_get_copy_cookie(ft);
+	if (tt->core_cookie.task_cookie)
+		sched_core_put_task_cookie(tt->core_cookie.task_cookie);
+	sched_core_update_task_cookie(tt, cookie);
+}
+
+/* Called from prctl interface: PR_SCHED_CORE_SHARE */
+int sched_core_share_pid(unsigned long flags, pid_t pid, enum pid_type type)
+{
+	struct task_struct *task;
+	struct task_struct *p;
+	unsigned long cookie;
+	struct pid *grp;
+	int err = 0;
+
+	if (type > PIDTYPE_PGID || flags > PR_SCHED_CORE_SHARE_TO || pid < 0 ||
+	    (flags == PR_SCHED_CORE_SHARE_FROM && type != PIDTYPE_PID))
+		return -EINVAL;
+
+	rcu_read_lock();
+
+	if (pid == 0) {
+		task = current;
+	} else {
+		task = find_task_by_vpid(pid);
+		if (!task) {
+			rcu_read_unlock();
+			return -ESRCH;
+		}
+	}
+
+	get_task_struct(task);
+
+	/* Check if this process has the right to modify the specified
+	 * process. Use the regular "ptrace_may_access()" checks.
+	 */
+	if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
+		rcu_read_unlock();
+		err = -EPERM;
+		goto out;
+	}
+	rcu_read_unlock();
+
+	mutex_lock(&sched_core_tasks_mutex);
+	if (type == PIDTYPE_PID) {
+		if (flags == PR_SCHED_CORE_CREATE) {
+			err = sched_core_create_cookie(task);
+
+		} else if (flags == PR_SCHED_CORE_CLEAR) {
+			sched_core_clear_cookie(task);
+
+		} else if (flags == PR_SCHED_CORE_SHARE_FROM) {
+			sched_core_copy_cookie_frm_to(task, current);
+
+		} else if (flags == PR_SCHED_CORE_SHARE_TO) {
+			sched_core_copy_cookie_frm_to(current, task);
+
+		} else {
+			err = -EINVAL;
+			goto out_unlock;
+		}
+	} else {
+		if (flags == PR_SCHED_CORE_CREATE) {
+			cookie = sched_core_alloc_task_cookie();
+			if (!cookie) {
+				err = -ENOMEM;
+				goto out_unlock;
+			}
+
+		} else if (flags == PR_SCHED_CORE_CLEAR) {
+			cookie = 0;
+		} else if (flags == PR_SCHED_CORE_SHARE_TO) {
+			cookie = sched_core_get_copy_cookie(current);
+		}  else {
+			err = -EINVAL;
+			goto out_unlock;
+		}
+
+		rcu_read_lock();
+		if (type == PIDTYPE_TGID) {
+			grp = task_tgid(task);
+		} else if (type == PIDTYPE_PGID) {
+			grp = task_pgrp(task);
+		} else {
+			err = -EINVAL;
+			rcu_read_unlock();
+			goto out_unlock;
+		}
+
+		do_each_pid_thread(grp, type, p) {
+			/*
+			 * if not allowed, don't do it, but indicate to caller.
+			 * task and current are already good.
+			 */
+			if (p == task || p == current ||
+			    ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS)) {
+				if (cookie)
+					sched_core_get_task_cookie(cookie);
+				if (p->core_cookie.task_cookie)
+					sched_core_put_task_cookie_async(p->core_cookie.task_cookie);
+				sched_core_update_task_cookie(p, cookie);
+			} else {
+				err = -EACCES;
+			}
+		} while_each_pid_thread(grp, type, p);
+
+		rcu_read_unlock();
+
+		/*
+		 * Remove the extra reference we took to the cookie
+		 * (ie. via alloc/copy).
+		 */
+		if (cookie)
+			sched_core_put_task_cookie(cookie);
+	}
+out_unlock:
+	mutex_unlock(&sched_core_tasks_mutex);
+
+out:
+	put_task_struct(task);
+	return err;
+}
+
+int sched_core_exec(void)
+{
+	int ret = 0;
+
+	/* absent a policy mech, if task had a cookie, give it a new one */
+	if (READ_ONCE(current->core_cookie.task_cookie)) {
+		mutex_lock(&sched_core_tasks_mutex);
+		if (current->core_cookie.task_cookie)
+			ret = sched_core_create_cookie(current);
+		mutex_unlock(&sched_core_tasks_mutex);
+	}
+	return ret;
+}
+
+/* Called from sched_fork() */
 int sched_core_fork(struct task_struct *p, unsigned long clone_flags)
 {
 	/*
 	 * Task cookie is ref counted; avoid an uncounted reference.
+	 * If p should have a task cookie, it will be set below.
 	 */
 	__sched_core_set_task_cookie(&p->core_cookie, 0);
 
+	if (READ_ONCE(current->core_cookie.task_cookie)) {
+		mutex_lock(&sched_core_tasks_mutex);
+		if (current->core_cookie.task_cookie)
+			sched_core_copy_cookie_frm_to(current, p);
+		mutex_unlock(&sched_core_tasks_mutex);
+	}
 	return 0;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5b49cfaa4a53..1be86d9cc58f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1184,6 +1184,8 @@ void sched_core_dequeue(struct rq *rq, struct task_struct *p);
 void sched_core_get(void);
 void sched_core_put(void);
 
+int sched_core_share_pid(unsigned long flags, pid_t pid, enum pid_type type);
+
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);
 
 int sched_core_cookie_cmp(const struct sched_core_cookie *a,
diff --git a/kernel/sys.c b/kernel/sys.c
index 2e2e3f378d97..b40243522146 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2534,6 +2534,13 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		error = set_syscall_user_dispatch(arg2, arg3, arg4,
 						  (char __user *) arg5);
 		break;
+#ifdef CONFIG_SCHED_CORE
+	case PR_SCHED_CORE_SHARE:
+		if (arg5)
+			return -EINVAL;
+		error = sched_core_share_pid(arg2, arg3, arg4);
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index 667f1aed091c..14900c400e74 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -255,4 +255,11 @@ struct prctl_mm_map {
 # define SYSCALL_DISPATCH_FILTER_ALLOW	0
 # define SYSCALL_DISPATCH_FILTER_BLOCK	1
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE		60
+# define PR_SCHED_CORE_CLEAR		0  /* clear core_sched cookie of pid */
+# define PR_SCHED_CORE_CREATE		1  /* get core_sched cookie from pid */
+# define PR_SCHED_CORE_SHARE_FROM	2  /* get core_sched cookie from pid */
+# define PR_SCHED_CORE_SHARE_TO		3  /* push core_sched cookie to pid */
+
 #endif /* _LINUX_PRCTL_H */
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH resend 4/8] kselftest: Add test for core sched prctl interface
  2021-03-24 21:40 [PATCH resend 0/8] Core sched remaining patches rebased Joel Fernandes (Google)
                   ` (2 preceding siblings ...)
  2021-03-24 21:40 ` [PATCH resend 3/8] sched: prctl() cookie manipulation for core scheduling Joel Fernandes (Google)
@ 2021-03-24 21:40 ` Joel Fernandes (Google)
  2021-03-24 21:40 ` [PATCH resend 5/8] sched: cgroup cookie API for core scheduling Joel Fernandes (Google)
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Joel Fernandes (Google) @ 2021-03-24 21:40 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, rostedt,
	benbjiang, Alexandre Chartre, James.Bottomley, OWeisse,
	Dhaval Giani, chris.hyser, Josh Don, Hao Luo, Tom Lendacky,
	dhiatt

From: chris hyser <chris.hyser@oracle.com>

Provides a selftest and examples of using the interface.

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: Josh Don <joshdon@google.com>
---
 tools/testing/selftests/sched/.gitignore      |   1 +
 tools/testing/selftests/sched/Makefile        |  14 +
 tools/testing/selftests/sched/config          |   1 +
 tools/testing/selftests/sched/cs_prctl_test.c | 370 ++++++++++++++++++
 4 files changed, 386 insertions(+)
 create mode 100644 tools/testing/selftests/sched/.gitignore
 create mode 100644 tools/testing/selftests/sched/Makefile
 create mode 100644 tools/testing/selftests/sched/config
 create mode 100644 tools/testing/selftests/sched/cs_prctl_test.c

diff --git a/tools/testing/selftests/sched/.gitignore b/tools/testing/selftests/sched/.gitignore
new file mode 100644
index 000000000000..6996d4654d92
--- /dev/null
+++ b/tools/testing/selftests/sched/.gitignore
@@ -0,0 +1 @@
+cs_prctl_test
diff --git a/tools/testing/selftests/sched/Makefile b/tools/testing/selftests/sched/Makefile
new file mode 100644
index 000000000000..10c72f14fea9
--- /dev/null
+++ b/tools/testing/selftests/sched/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0+
+
+ifneq ($(shell $(CC) --version 2>&1 | head -n 1 | grep clang),)
+CLANG_FLAGS += -no-integrated-as
+endif
+
+CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/  -Wl,-rpath=./ \
+	  $(CLANG_FLAGS)
+LDLIBS += -lpthread
+
+TEST_GEN_FILES := cs_prctl_test
+TEST_PROGS := cs_prctl_test
+
+include ../lib.mk
diff --git a/tools/testing/selftests/sched/config b/tools/testing/selftests/sched/config
new file mode 100644
index 000000000000..e8b09aa7c0c4
--- /dev/null
+++ b/tools/testing/selftests/sched/config
@@ -0,0 +1 @@
+CONFIG_SCHED_DEBUG=y
diff --git a/tools/testing/selftests/sched/cs_prctl_test.c b/tools/testing/selftests/sched/cs_prctl_test.c
new file mode 100644
index 000000000000..03581e180e31
--- /dev/null
+++ b/tools/testing/selftests/sched/cs_prctl_test.c
@@ -0,0 +1,370 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Use the core scheduling prctl() to test core scheduling cookies control.
+ *
+ * Copyright (c) 2021 Oracle and/or its affiliates.
+ * Author: Chris Hyser <chris.hyser@oracle.com>
+ *
+ *
+ * This library is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This library is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
+ * for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public License
+ * along with this library; if not, see <http://www.gnu.org/licenses>.
+ */
+
+#define _GNU_SOURCE
+#include <sys/eventfd.h>
+#include <sys/wait.h>
+#include <sys/types.h>
+#include <sched.h>
+#include <sys/prctl.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include <time.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#if __GLIBC_PREREQ(2, 30) == 0
+#include <sys/syscall.h>
+static pid_t gettid(void)
+{
+	return syscall(SYS_gettid);
+}
+#endif
+
+#ifndef PR_SCHED_CORE_SHARE
+#define PR_SCHED_CORE_SHARE		60
+# define PR_SCHED_CORE_CLEAR		0
+# define PR_SCHED_CORE_CREATE		1
+# define PR_SCHED_CORE_SHARE_FROM	2
+# define PR_SCHED_CORE_SHARE_TO		3
+#endif
+
+#define MAX_PROCESSES 128
+#define MAX_THREADS   128
+
+static const char USAGE[] = "cs_prctl_test [options]\n"
+"    options:\n"
+"	-P  : number of processes to create.\n"
+"	-T  : number of threads per process to create.\n"
+"	-d  : delay time to keep tasks alive.\n"
+"	-k  : keep tasks alive until keypress.\n";
+
+enum pid_type {PIDTYPE_PID = 0, PIDTYPE_TGID, PIDTYPE_PGID};
+
+const int THREAD_CLONE_FLAGS = CLONE_THREAD | CLONE_SIGHAND | CLONE_FS | CLONE_VM | CLONE_FILES;
+
+static int _prctl(int option, unsigned long arg2, unsigned long arg3, unsigned long arg4,
+		  unsigned long arg5)
+{
+	int res;
+
+	res = prctl(option, arg2, arg3, arg4, arg5);
+	printf("%d = prctl(%d, %ld, %ld, %ld, %lx)\n", res, option, (long)arg2, (long)arg3,
+	       (long)arg4, arg5);
+	return res;
+}
+
+#define STACK_SIZE (1024 * 1024)
+
+#define handle_error(msg) __handle_error(__FILE__, __LINE__, msg)
+static void __handle_error(char *fn, int ln, char *msg)
+{
+	printf("(%s:%d) - ", fn, ln);
+	perror(msg);
+	exit(EXIT_FAILURE);
+}
+
+static void handle_usage(int rc, char *msg)
+{
+	puts(USAGE);
+	puts(msg);
+	putchar('\n');
+	exit(rc);
+}
+
+static unsigned long get_cs_cookie(int pid)
+{
+	char buf[4096];
+	char fn[512];
+	FILE *inf;
+	char *c;
+	int i;
+
+	if (pid == 0)
+		pid = getpid();
+	snprintf(fn, 512, "/proc/%d/sched", pid);
+
+	inf = fopen(fn, "r");
+	if (!inf)
+		return -2UL;
+
+	while (fgets(buf, 4096, inf)) {
+		if (!strncmp(buf, "core_cookie", 11)) {
+			for (i = 0, c = buf; *c != '\0' && *c != ':' && i < 4096; ++i, ++c)
+				;
+			if (*c == ':') {
+				fclose(inf);
+				return strtoul(c + 1, NULL, 10);
+			}
+		}
+	}
+	fclose(inf);
+	printf("Not a core sched system\n");
+	return -1UL;
+}
+
+struct child_args {
+	int num_threads;
+	int pfd[2];
+	int cpid;
+	int thr_tids[MAX_THREADS];
+};
+
+static int child_func_thread(void __attribute__((unused))*arg)
+{
+	while (1)
+		usleep(20000);
+	return 0;
+}
+
+static void create_threads(int num_threads, int thr_tids[])
+{
+	void *child_stack;
+	pid_t tid;
+	int i;
+
+	for (i = 0; i < num_threads; ++i) {
+		child_stack = malloc(STACK_SIZE);
+		if (!child_stack)
+			handle_error("child stack allocate");
+
+		tid = clone(child_func_thread, child_stack + STACK_SIZE, THREAD_CLONE_FLAGS, NULL);
+		if (tid == -1)
+			handle_error("clone thread");
+		thr_tids[i] = tid;
+	}
+}
+
+static int child_func_process(void *arg)
+{
+	struct child_args *ca = (struct child_args *)arg;
+
+	close(ca->pfd[0]);
+
+	create_threads(ca->num_threads, ca->thr_tids);
+
+	write(ca->pfd[1], &ca->thr_tids, sizeof(int) * ca->num_threads);
+	close(ca->pfd[1]);
+
+	while (1)
+		usleep(20000);
+	return 0;
+}
+
+static unsigned char child_func_process_stack[STACK_SIZE];
+
+void create_processes(int num_processes, int num_threads, struct child_args proc[])
+{
+	pid_t cpid;
+	int i;
+
+	for (i = 0; i < num_processes; ++i) {
+		proc[i].num_threads = num_threads;
+
+		if (pipe(proc[i].pfd) == -1)
+			handle_error("pipe() failed");
+
+		cpid = clone(child_func_process, child_func_process_stack + STACK_SIZE,
+			     SIGCHLD, &proc[i]);
+		proc[i].cpid = cpid;
+		close(proc[i].pfd[1]);
+	}
+
+	for (i = 0; i < num_processes; ++i) {
+		read(proc[i].pfd[0], &proc[i].thr_tids, sizeof(int) * proc[i].num_threads);
+		close(proc[i].pfd[0]);
+	}
+}
+
+void disp_processes(int num_processes, struct child_args proc[])
+{
+	int i, j;
+
+	printf("tid=%d, / tgid=%d / pgid=%d: %lx\n", gettid(), getpid(), getpgid(0),
+	       get_cs_cookie(getpid()));
+
+	for (i = 0; i < num_processes; ++i) {
+		printf("    tid=%d, / tgid=%d / pgid=%d: %lx\n", proc[i].cpid, proc[i].cpid,
+		       getpgid(proc[i].cpid), get_cs_cookie(proc[i].cpid));
+		for (j = 0; j < proc[i].num_threads; ++j) {
+			printf("        tid=%d, / tgid=%d / pgid=%d: %lx\n", proc[i].thr_tids[j],
+			       proc[i].cpid, getpgid(0), get_cs_cookie(proc[i].thr_tids[j]));
+		}
+	}
+	puts("\n");
+}
+
+static int errors;
+
+#define validate(v) _validate(__LINE__, v, #v)
+void _validate(int line, int val, char *msg)
+{
+	if (!val) {
+		++errors;
+		printf("(%d) FAILED: %s\n", line, msg);
+	} else {
+		printf("(%d) PASSED: %s\n", line, msg);
+	}
+}
+
+int main(int argc, char *argv[])
+{
+	struct child_args procs[MAX_PROCESSES];
+
+	int keypress = 0;
+	int num_processes = 2;
+	int num_threads = 3;
+	int delay = 0;
+	int res = 0;
+	int pidx;
+	int pid;
+	int opt;
+
+	while ((opt = getopt(argc, argv, ":hkT:P:d:")) != -1) {
+		switch (opt) {
+		case 'P':
+			num_processes = (int)strtol(optarg, NULL, 10);
+			break;
+		case 'T':
+			num_threads = (int)strtoul(optarg, NULL, 10);
+			break;
+		case 'd':
+			delay = (int)strtol(optarg, NULL, 10);
+			break;
+		case 'k':
+			keypress = 1;
+			break;
+		case 'h':
+			printf(USAGE);
+			exit(EXIT_SUCCESS);
+		default:
+			handle_usage(20, "unknown option");
+		}
+	}
+
+	if (num_processes < 1 || num_processes > MAX_PROCESSES)
+		handle_usage(1, "Bad processes value");
+
+	if (num_threads < 1 || num_threads > MAX_THREADS)
+		handle_usage(2, "Bad thread value");
+
+	if (keypress)
+		delay = -1;
+
+	srand(time(NULL));
+
+	/* put into separate process group */
+	if (setpgid(0, 0) != 0)
+		handle_error("process group");
+
+	printf("\n## Create a thread/process/process group hiearchy\n");
+	create_processes(num_processes, num_threads, procs);
+	disp_processes(num_processes, procs);
+	validate(get_cs_cookie(0) == 0);
+
+	printf("\n## Set a cookie on entire process group\n");
+	if (_prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_CREATE, 0, PIDTYPE_PGID, 0) < 0)
+		handle_error("core_sched create failed -- PGID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(0) != 0);
+
+	/* get a random process pid */
+	pidx = rand() % num_processes;
+	pid = procs[pidx].cpid;
+
+	validate(get_cs_cookie(0) == get_cs_cookie(pid));
+	validate(get_cs_cookie(0) == get_cs_cookie(procs[pidx].thr_tids[0]));
+
+	printf("\n## Set a new cookie on entire process/TGID [%d]\n", pid);
+	if (_prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_CREATE, pid, PIDTYPE_TGID, 0) < 0)
+		handle_error("core_sched create failed -- TGID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(0) != get_cs_cookie(pid));
+	validate(get_cs_cookie(pid) != 0);
+	validate(get_cs_cookie(pid) == get_cs_cookie(procs[pidx].thr_tids[0]));
+
+	printf("\n## Copy the cookie of current/PGID[%d], to pid [%d] as PIDTYPE_PID\n",
+	       getpid(), pid);
+	if (_prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_TO, pid, PIDTYPE_PID, 0) < 0)
+		handle_error("core_sched share to itself failed -- PID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(0) == get_cs_cookie(pid));
+	validate(get_cs_cookie(pid) != 0);
+	validate(get_cs_cookie(pid) != get_cs_cookie(procs[pidx].thr_tids[0]));
+
+	printf("\n## Copy cookie from a thread [%d] to current/PGID [%d] as PIDTYPE_PID\n",
+	       procs[pidx].thr_tids[0], getpid());
+	if (_prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_FROM, procs[pidx].thr_tids[0],
+		   PIDTYPE_PID, 0) < 0)
+		handle_error("core_sched share from thread failed -- PID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(0) == get_cs_cookie(procs[pidx].thr_tids[0]));
+	validate(get_cs_cookie(pid) != get_cs_cookie(procs[pidx].thr_tids[0]));
+
+	printf("\n## Clear a cookie on a single task [%d]\n", pid);
+	if (_prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_CLEAR, pid, PIDTYPE_PID, 0) < 0)
+		handle_error("core_sched clear failed -- PID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(pid) == 0);
+	validate(get_cs_cookie(procs[pidx].thr_tids[0]) != 0);
+
+	printf("\n## Copy cookie from current [%d] to current as pidtype PGID\n", getpid());
+	if (_prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_TO, 0, PIDTYPE_PGID, 0) < 0)
+		handle_error("core_sched share to self failed -- PGID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(0) == get_cs_cookie(pid));
+	validate(get_cs_cookie(pid) != 0);
+	validate(get_cs_cookie(pid) == get_cs_cookie(procs[pidx].thr_tids[0]));
+
+	printf("\n## Clear cookies on the entire process group\n");
+	if (_prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_CLEAR, 0, PIDTYPE_PGID, 0) < 0)
+		handle_error("core_sched clear failed -- PGID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(0) == 0);
+	validate(get_cs_cookie(pid) == 0);
+	validate(get_cs_cookie(procs[pidx].thr_tids[0]) == 0);
+
+	if (errors) {
+		printf("TESTS FAILED. errors: %d\n", errors);
+		res = 10;
+	} else {
+		printf("SUCCESS !!!\n");
+	}
+
+	if (keypress)
+		getchar();
+	else
+		sleep(delay);
+
+	for (pidx = 0; pidx < num_processes; ++pidx)
+		kill(procs[pidx].cpid, 15);
+
+	return res;
+}
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH resend 5/8] sched: cgroup cookie API for core scheduling
  2021-03-24 21:40 [PATCH resend 0/8] Core sched remaining patches rebased Joel Fernandes (Google)
                   ` (3 preceding siblings ...)
  2021-03-24 21:40 ` [PATCH resend 4/8] kselftest: Add test for core sched prctl interface Joel Fernandes (Google)
@ 2021-03-24 21:40 ` Joel Fernandes (Google)
  2021-03-30  9:23   ` Peter Zijlstra
  2021-03-24 21:40 ` [PATCH resend 6/8] kselftest: Add tests for core-sched interface Joel Fernandes (Google)
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Joel Fernandes (Google) @ 2021-03-24 21:40 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, rostedt,
	benbjiang, Alexandre Chartre, James.Bottomley, OWeisse,
	Dhaval Giani, chris.hyser, Josh Don, Hao Luo, Tom Lendacky,
	dhiatt

From: Josh Don <joshdon@google.com>

This adds the API to set/get the cookie for a given cgroup. This
interface lives at cgroup/cpu.core_tag.

The cgroup interface can be used to toggle a unique cookie value for all
descendent tasks, preventing these tasks from sharing with any others.
See Documentation/admin-guide/hw-vuln/core-scheduling.rst for a full
rundown of both this and the per-task API.

Signed-off-by: Josh Don <joshdon@google.com>
---
 kernel/sched/core.c    |  61 ++++++++++++++--
 kernel/sched/coretag.c | 156 ++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h   |  25 +++++++
 3 files changed, 235 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3093cb3414c3..a733891dfe7d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9328,6 +9328,8 @@ struct task_group *sched_create_group(struct task_group *parent)
 
 	alloc_uclamp_sched_group(tg, parent);
 
+	alloc_sched_core_sched_group(tg);
+
 	return tg;
 
 err:
@@ -9391,6 +9393,11 @@ static void sched_change_group(struct task_struct *tsk, int type)
 	tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
 			  struct task_group, css);
 	tg = autogroup_task_group(tsk, tg);
+
+#ifdef CONFIG_SCHED_CORE
+	sched_core_change_group(tsk, tg);
+#endif
+
 	tsk->sched_task_group = tg;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -9443,11 +9450,6 @@ void sched_move_task(struct task_struct *tsk)
 	task_rq_unlock(rq, tsk, &rf);
 }
 
-static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
-{
-	return css ? container_of(css, struct task_group, css) : NULL;
-}
-
 static struct cgroup_subsys_state *
 cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 {
@@ -9483,6 +9485,18 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
 	return 0;
 }
 
+static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
+{
+#ifdef CONFIG_SCHED_CORE
+	struct task_group *tg = css_tg(css);
+
+	if (tg->core_tagged) {
+		sched_core_put();
+		tg->core_tagged = 0;
+	}
+#endif
+}
+
 static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
@@ -9517,6 +9531,25 @@ static void cpu_cgroup_fork(struct task_struct *task)
 	task_rq_unlock(rq, task, &rf);
 }
 
+static void cpu_cgroup_exit(struct task_struct *task)
+{
+#ifdef CONFIG_SCHED_CORE
+	/*
+	 * This is possible if task exit races with core sched being
+	 * disabled due to the task's cgroup no longer being tagged, since
+	 * cpu_core_tag_write_u64() will miss dying tasks.
+	 */
+	if (unlikely(sched_core_enqueued(task))) {
+		struct rq *rq;
+		struct rq_flags rf;
+
+		rq = task_rq_lock(task, &rf);
+		sched_core_dequeue(rq, task);
+		task_rq_unlock(rq, task, &rf);
+	}
+#endif
+}
+
 static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
 {
 	struct task_struct *task;
@@ -10084,6 +10117,14 @@ static struct cftype cpu_legacy_files[] = {
 		.write_u64 = cpu_rt_period_write_uint,
 	},
 #endif
+#ifdef CONFIG_SCHED_CORE
+	{
+		.name = "core_tag",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_tag_read_u64,
+		.write_u64 = cpu_core_tag_write_u64,
+	},
+#endif
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 	{
 		.name = "uclamp.min",
@@ -10257,6 +10298,14 @@ static struct cftype cpu_files[] = {
 		.write_s64 = cpu_weight_nice_write_s64,
 	},
 #endif
+#ifdef CONFIG_SCHED_CORE
+	{
+		.name = "core_tag",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_tag_read_u64,
+		.write_u64 = cpu_core_tag_write_u64,
+	},
+#endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
 		.name = "max",
@@ -10285,10 +10334,12 @@ static struct cftype cpu_files[] = {
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_online	= cpu_cgroup_css_online,
+	.css_offline	= cpu_cgroup_css_offline,
 	.css_released	= cpu_cgroup_css_released,
 	.css_free	= cpu_cgroup_css_free,
 	.css_extra_stat_show = cpu_extra_stat_show,
 	.fork		= cpu_cgroup_fork,
+	.exit		= cpu_cgroup_exit,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
 	.legacy_cftypes	= cpu_legacy_files,
diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
index 550f4975eea2..1498790bc76c 100644
--- a/kernel/sched/coretag.c
+++ b/kernel/sched/coretag.c
@@ -96,9 +96,19 @@ static void __sched_core_set_task_cookie(struct sched_core_cookie *cookie,
 static void __sched_core_set_group_cookie(struct sched_core_cookie *cookie,
 					  unsigned long val)
 {
+	struct task_group *tg = (struct task_group *)val;
+	u64 group_cookie_id; /* only uses lower 32 bits */
+
 	cookie->group_cookie = val;
 
-	// XXX incorporate group_cookie into userspace id
+	if (tg)
+		group_cookie_id = tg->sched_core_id;
+	else
+		group_cookie_id = 0;
+
+	/* group cookie userspace id is the lower 32 bits */
+	cookie->userspace_id &= 0xffffffff00000000;
+	cookie->userspace_id |= group_cookie_id;
 }
 #endif
 
@@ -394,6 +404,142 @@ int sched_core_share_pid(unsigned long flags, pid_t pid, enum pid_type type)
 	return err;
 }
 
+/* CGroup core-scheduling interface support. */
+#ifdef CONFIG_CGROUP_SCHED
+/*
+ * Helper to get the group cookie in a hierarchy. Any ancestor can have a
+ * cookie.
+ *
+ * Can race with an update to tg->core_tagged if sched_core_group_mutex is
+ * not held.
+ */
+static unsigned long cpu_core_get_group_cookie(struct task_group *tg)
+{
+	for (; tg; tg = tg->parent) {
+		if (READ_ONCE(tg->core_tagged))
+			return (unsigned long)tg;
+	}
+
+	return 0;
+}
+
+/* Determine if any group in @tg's children are tagged. */
+static bool cpu_core_check_descendants(struct task_group *tg, bool check_tag)
+{
+	struct task_group *child;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(child, &tg->children, siblings) {
+		if ((child->core_tagged && check_tag)) {
+			rcu_read_unlock();
+			return true;
+		}
+
+		rcu_read_unlock();
+		return cpu_core_check_descendants(child, check_tag);
+	}
+
+	rcu_read_unlock();
+	return false;
+}
+
+u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css,
+			  struct cftype *cft)
+{
+	return !!css_tg(css)->core_tagged;
+}
+
+int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
+			   u64 val)
+{
+	static DEFINE_MUTEX(sched_core_group_mutex);
+	struct task_group *tg = css_tg(css);
+	struct cgroup_subsys_state *css_tmp;
+	struct task_struct *p;
+	unsigned long group_cookie;
+	int ret = 0;
+
+	if (val > 1)
+		return -ERANGE;
+
+	if (!static_branch_likely(&sched_smt_present))
+		return -EINVAL;
+
+	mutex_lock(&sched_core_group_mutex);
+
+	if (!tg->core_tagged && val) {
+		/* Tag is being set. Check ancestors and descendants. */
+		if (cpu_core_get_group_cookie(tg) ||
+		    cpu_core_check_descendants(tg, true /* tag */)) {
+			ret = -EBUSY;
+			goto out_unlock;
+		}
+	} else if (tg->core_tagged && !val) {
+		/* Tag is being reset. Check descendants. */
+		if (cpu_core_check_descendants(tg, true /* tag */)) {
+			ret = -EBUSY;
+			goto out_unlock;
+		}
+	} else {
+		goto out_unlock;
+	}
+
+	if (val)
+		sched_core_get();
+
+	tg->core_tagged = val;
+	group_cookie = cpu_core_get_group_cookie(tg);
+
+	rcu_read_lock();
+	css_for_each_descendant_pre(css_tmp, css) {
+		struct css_task_iter it;
+
+		css_task_iter_start(css_tmp, 0, &it);
+		/*
+		 * Note: css_task_iter_next will skip dying tasks.
+		 * There could still be dying tasks left in the core queue
+		 * when we set cgroup tag to 0 when the loop is done below.
+		 * We will handle this in cpu_cgroup_exit().
+		 */
+		while ((p = css_task_iter_next(&it))) {
+			sched_core_update_cookie(p, group_cookie,
+						 sched_core_group_cookie_type);
+		}
+
+		css_task_iter_end(&it);
+	}
+	rcu_read_unlock();
+
+	if (!val)
+		sched_core_put();
+
+out_unlock:
+	mutex_unlock(&sched_core_group_mutex);
+	return ret;
+}
+
+void sched_core_change_group(struct task_struct *p, struct task_group *new_tg)
+{
+	lockdep_assert_held(rq_lockp(task_rq(p)));
+
+	/*
+	 * Reading the group cookie can race, but since the task is already
+	 * visible in the group, a concurrent group_cookie update will also
+	 * update this task.
+	 */
+	__sched_core_set_group_cookie(&p->core_cookie,
+				      cpu_core_get_group_cookie(new_tg));
+}
+
+void alloc_sched_core_sched_group(struct task_group *tg)
+{
+	static u32 next_id = 1;
+
+	tg->sched_core_id = next_id++;
+	WARN_ON_ONCE(next_id == 0);
+}
+#endif
+
 int sched_core_exec(void)
 {
 	int ret = 0;
@@ -408,7 +554,13 @@ int sched_core_exec(void)
 	return ret;
 }
 
-/* Called from sched_fork() */
+/*
+ * Called from sched_fork().
+ *
+ * NOTE: This might race with a concurrent cgroup cookie update. That's
+ * ok; sched_core_change_group() will handle this post-fork, once the
+ * task is visible.
+ */
 int sched_core_fork(struct task_struct *p, unsigned long clone_flags)
 {
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1be86d9cc58f..c3435469ea24 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -385,6 +385,11 @@ struct cfs_bandwidth {
 struct task_group {
 	struct cgroup_subsys_state css;
 
+#ifdef CONFIG_SCHED_CORE
+	int			core_tagged;
+	u32			sched_core_id;
+#endif
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* schedulable entities of this group on each CPU */
 	struct sched_entity	**se;
@@ -433,6 +438,11 @@ struct task_group {
 
 };
 
+static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct task_group, css) : NULL;
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 #define ROOT_TASK_GROUP_LOAD	NICE_0_LOAD
 
@@ -1170,6 +1180,7 @@ static inline raw_spinlock_t *__rq_lockp(struct rq *rq)
 	return &rq->__lock;
 }
 
+void sched_core_change_group(struct task_struct *p, struct task_group *new_tg);
 int sched_core_fork(struct task_struct *p, unsigned long clone_flags);
 
 static inline bool sched_core_enqueued(struct task_struct *task)
@@ -1186,6 +1197,16 @@ void sched_core_put(void);
 
 int sched_core_share_pid(unsigned long flags, pid_t pid, enum pid_type type);
 
+#ifdef CONFIG_CGROUP_SCHED
+u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css,
+			  struct cftype *cft);
+
+int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
+			   u64 val);
+
+void alloc_sched_core_sched_group(struct task_group *tg);
+#endif
+
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);
 
 int sched_core_cookie_cmp(const struct sched_core_cookie *a,
@@ -1305,6 +1326,10 @@ static inline bool sched_core_enqueued(struct task_struct *task)
 static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
 static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
 
+#ifdef CONFIG_CGROUP_SCHED
+void alloc_sched_core_sched_group(struct task_group *tg) {}
+#endif
+
 #endif /* CONFIG_SCHED_CORE */
 
 static inline void lockdep_assert_rq_held(struct rq *rq)
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH resend 6/8] kselftest: Add tests for core-sched interface
  2021-03-24 21:40 [PATCH resend 0/8] Core sched remaining patches rebased Joel Fernandes (Google)
                   ` (4 preceding siblings ...)
  2021-03-24 21:40 ` [PATCH resend 5/8] sched: cgroup cookie API for core scheduling Joel Fernandes (Google)
@ 2021-03-24 21:40 ` Joel Fernandes (Google)
  2021-03-24 21:40 ` [PATCH resend 7/8] Documentation: Add core scheduling documentation Joel Fernandes (Google)
  2021-03-24 21:40 ` [PATCH resend 8/8] sched: Debug bits Joel Fernandes (Google)
  7 siblings, 0 replies; 18+ messages in thread
From: Joel Fernandes (Google) @ 2021-03-24 21:40 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, rostedt,
	benbjiang, Alexandre Chartre, James.Bottomley, OWeisse,
	Dhaval Giani, chris.hyser, Josh Don, Hao Luo, Tom Lendacky,
	dhiatt

Add a kselftest test to ensure that the core-sched interface is working
correctly.

Co-developed-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Co-developed-by: Josh Don <joshdon@google.com>
Signed-off-by: Josh Don <joshdon@google.com>
Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: chris hyser <chris.hyser@oracle.com>
---
 tools/testing/selftests/sched/.gitignore      |   1 +
 tools/testing/selftests/sched/Makefile        |   4 +-
 .../testing/selftests/sched/test_coresched.c  | 812 ++++++++++++++++++
 3 files changed, 815 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/sched/test_coresched.c

diff --git a/tools/testing/selftests/sched/.gitignore b/tools/testing/selftests/sched/.gitignore
index 6996d4654d92..a4b4a1cdcd93 100644
--- a/tools/testing/selftests/sched/.gitignore
+++ b/tools/testing/selftests/sched/.gitignore
@@ -1 +1,2 @@
 cs_prctl_test
+test_coresched
diff --git a/tools/testing/selftests/sched/Makefile b/tools/testing/selftests/sched/Makefile
index 10c72f14fea9..830766e12bed 100644
--- a/tools/testing/selftests/sched/Makefile
+++ b/tools/testing/selftests/sched/Makefile
@@ -8,7 +8,7 @@ CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/  -Wl,-rpath=./ \
 	  $(CLANG_FLAGS)
 LDLIBS += -lpthread
 
-TEST_GEN_FILES := cs_prctl_test
-TEST_PROGS := cs_prctl_test
+TEST_GEN_FILES := test_coresched cs_prctl_test
+TEST_PROGS := test_coresched cs_prctl_test
 
 include ../lib.mk
diff --git a/tools/testing/selftests/sched/test_coresched.c b/tools/testing/selftests/sched/test_coresched.c
new file mode 100644
index 000000000000..9d47845e6f8a
--- /dev/null
+++ b/tools/testing/selftests/sched/test_coresched.c
@@ -0,0 +1,812 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Core-scheduling selftests.
+ *
+ * Copyright (C) 2020, Joel Fernandes.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/mount.h>
+#include <sys/prctl.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <unistd.h>
+
+#ifndef PR_SCHED_CORE_SHARE
+#define PR_SCHED_CORE_SHARE 59
+# define PR_SCHED_CORE_CLEAR            0
+# define PR_SCHED_CORE_CREATE           1
+# define PR_SCHED_CORE_SHARE_FROM       2
+# define PR_SCHED_CORE_SHARE_TO         3
+#endif
+
+#ifndef DEBUG_PRINT
+#define dprint(...)
+#else
+#define dprint(str, args...) printf("DEBUG: %s: " str "\n", __func__, ##args)
+#endif
+
+void print_banner(char *s)
+{
+	printf("coresched: %s:  ", s);
+}
+
+void print_pass(void)
+{
+	printf("PASS\n");
+}
+
+void assert_cond(int cond, char *str)
+{
+	if (!cond) {
+		printf("Error: %s\n", str);
+		abort();
+	}
+}
+
+char *make_group_root(void)
+{
+	char *mntpath, *mnt;
+	int ret;
+
+	mntpath = malloc(50);
+	if (!mntpath) {
+		perror("Failed to allocate mntpath\n");
+		abort();
+	}
+
+	sprintf(mntpath, "/tmp/coresched-test-XXXXXX");
+	mnt = mkdtemp(mntpath);
+	if (!mnt) {
+		perror("Failed to create mount: ");
+		exit(-1);
+	}
+
+	ret = mount("nodev", mnt, "cgroup", 0, "cpu");
+	if (ret == -1) {
+		perror("Failed to mount cgroup: ");
+		exit(-1);
+	}
+
+	return mnt;
+}
+
+void assert_group_tag(char *cgroup_path, char *tag)
+{
+	char tag_path[50] = {}, rdbuf[8] = {};
+	int tfd;
+
+	sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
+	tfd = open(tag_path, O_RDONLY, 0666);
+	if (tfd == -1) {
+		perror("Open of cgroup tag path failed: ");
+		abort();
+	}
+
+	if (read(tfd, rdbuf, 1) != 1) {
+		perror("Failed to enable coresched on cgroup: ");
+		abort();
+	}
+
+	if (strcmp(rdbuf, tag)) {
+		printf("Group tag does not match (exp: %s, act: %s)\n", tag,
+		       rdbuf);
+		abort();
+	}
+
+	if (close(tfd) == -1) {
+		perror("Failed to close tag fd: ");
+		abort();
+	}
+}
+
+void tag_group(char *cgroup_path)
+{
+	char tag_path[50];
+	int tfd;
+
+	sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
+	tfd = open(tag_path, O_WRONLY, 0666);
+	if (tfd == -1) {
+		perror("Open of cgroup tag path failed: ");
+		abort();
+	}
+
+	if (write(tfd, "1", 1) != 1) {
+		perror("Failed to enable coresched on cgroup: ");
+		abort();
+	}
+
+	if (close(tfd) == -1) {
+		perror("Failed to close tag fd: ");
+		abort();
+	}
+
+	assert_group_tag(cgroup_path, "1");
+}
+
+void untag_group(char *cgroup_path)
+{
+	char tag_path[50];
+	int tfd;
+
+	sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
+	tfd = open(tag_path, O_WRONLY, 0666);
+	if (tfd == -1) {
+		perror("Open of cgroup tag path failed: ");
+		abort();
+	}
+
+	if (write(tfd, "0", 1) != 1) {
+		perror("Failed to disable coresched on cgroup: ");
+		abort();
+	}
+
+	if (close(tfd) == -1) {
+		perror("Failed to close tag fd: ");
+		abort();
+	}
+
+	assert_group_tag(cgroup_path, "0");
+}
+
+char *make_group(char *parent, char *name)
+{
+	char *cgroup_path;
+	int ret;
+
+	if (!parent && !name)
+		return make_group_root();
+
+	cgroup_path = malloc(50);
+	if (!cgroup_path) {
+		perror("Failed to allocate cgroup_path\n");
+		abort();
+	}
+
+	/* Make the cgroup node for this group */
+	sprintf(cgroup_path, "%s/%s", parent, name);
+	ret = mkdir(cgroup_path, 0644);
+	if (ret == -1) {
+		perror("Failed to create group in cgroup: ");
+		abort();
+	}
+
+	return cgroup_path;
+}
+
+static void del_group(char *path)
+{
+	if (rmdir(path) != 0) {
+		printf("Removal of group failed\n");
+		abort();
+	}
+
+	free(path);
+}
+
+static void del_root_group(char *path)
+{
+	if (umount(path) != 0) {
+		perror("umount of cgroup failed\n");
+		abort();
+	}
+
+	if (rmdir(path) != 0) {
+		printf("Removal of group failed\n");
+		abort();
+	}
+
+	free(path);
+}
+
+struct task_state {
+	int pid_share;
+	char pid_str[50];
+	pthread_mutex_t m;
+	pthread_cond_t cond;
+	pthread_cond_t cond_par;
+};
+
+struct task_state *add_task(char *p)
+{
+	struct task_state *mem;
+	pthread_mutexattr_t am;
+	pthread_condattr_t a;
+	char tasks_path[50];
+	int tfd, pid, ret;
+
+	sprintf(tasks_path, "%s/tasks", p);
+	tfd = open(tasks_path, O_WRONLY, 0666);
+	if (tfd == -1) {
+		perror("Open of cgroup tasks path failed: ");
+		abort();
+	}
+
+	mem = mmap(NULL, sizeof(*mem), PROT_READ | PROT_WRITE,
+		   MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+	memset(mem, 0, sizeof(*mem));
+
+	pthread_condattr_init(&a);
+	pthread_condattr_setpshared(&a, PTHREAD_PROCESS_SHARED);
+	pthread_mutexattr_init(&am);
+	pthread_mutexattr_setpshared(&am, PTHREAD_PROCESS_SHARED);
+
+	pthread_cond_init(&mem->cond, &a);
+	pthread_cond_init(&mem->cond_par, &a);
+	pthread_mutex_init(&mem->m, &am);
+
+	pid = fork();
+	if (pid == 0) {
+		while (1) {
+			pthread_mutex_lock(&mem->m);
+			while (!mem->pid_share)
+				pthread_cond_wait(&mem->cond, &mem->m);
+
+			pid = mem->pid_share;
+			mem->pid_share = 0;
+
+			if (pid == -1) {
+				if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_CLEAR, getpid(), 0, 0))
+					perror("prctl() PR_SCHED_CORE_CLEAR failed");
+			} else {
+				if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_CREATE, pid, 0, 0))
+					perror("prctl() PR_SCHED_CORE_CREATE failed");
+
+				if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_FROM, pid, 0, 0))
+					perror("prctl() PR_SCHED_CORE_SHARE_FROM failed");
+			}
+			pthread_mutex_unlock(&mem->m);
+			pthread_cond_signal(&mem->cond_par);
+		}
+	}
+
+	sprintf(mem->pid_str, "%d", pid);
+	dprint("add task %d to group %s", pid, p);
+
+	ret = write(tfd, mem->pid_str, strlen(mem->pid_str));
+	assert_cond(ret != -1, "Failed to write pid into tasks");
+
+	close(tfd);
+	return mem;
+}
+
+/* Move task to a new cgroup */
+void migrate_task(struct task_state *t, char *group)
+{
+	char tasks_path[50];
+	int tfd, ret;
+
+	sprintf(tasks_path, "%s/tasks", group);
+	tfd = open(tasks_path, O_WRONLY, 0666);
+	if (tfd == -1) {
+		perror("Open of cgroup tasks path failed: ");
+		abort();
+	}
+
+	ret = write(tfd, t->pid_str, strlen(t->pid_str));
+	assert_cond(ret != -1, "Failed to write pid into tasks");
+
+	close(tfd);
+}
+
+/* Make t1 share with t2 */
+void make_tasks_share(struct task_state *t1, struct task_state *t2)
+{
+	int p2 = atoi(t2->pid_str);
+
+	dprint("task %s %s", t1->pid_str, t2->pid_str);
+
+	pthread_mutex_lock(&t1->m);
+	t1->pid_share = p2;
+	pthread_mutex_unlock(&t1->m);
+
+	pthread_cond_signal(&t1->cond);
+
+	pthread_mutex_lock(&t1->m);
+	while (t1->pid_share)
+		pthread_cond_wait(&t1->cond_par, &t1->m);
+	pthread_mutex_unlock(&t1->m);
+}
+
+/* Make t1 share with t2 */
+void reset_task_cookie(struct task_state *t1)
+{
+	dprint("task %s", t1->pid_str);
+
+	pthread_mutex_lock(&t1->m);
+	t1->pid_share = -1;
+	pthread_mutex_unlock(&t1->m);
+
+	pthread_cond_signal(&t1->cond);
+
+	pthread_mutex_lock(&t1->m);
+	while (t1->pid_share)
+		pthread_cond_wait(&t1->cond_par, &t1->m);
+	pthread_mutex_unlock(&t1->m);
+}
+
+char *get_task_core_cookie(char *pid)
+{
+	char proc_path[50];
+	int found = 0;
+	char *line;
+	int i, j;
+	FILE *fp;
+
+	line = malloc(1024);
+	assert_cond(!!line, "Failed to alloc memory");
+
+	sprintf(proc_path, "/proc/%s/sched", pid);
+
+	fp = fopen(proc_path, "r");
+	while ((fgets(line, 1024, fp)) != NULL) {
+		if (!strstr(line, "core_cookie"))
+			continue;
+
+		for (j = 0, i = 0; i < 1024 && line[i] != '\0'; i++)
+			if (line[i] >= '0' && line[i] <= '9')
+				line[j++] = line[i];
+		line[j] = '\0';
+		found = 1;
+		break;
+	}
+
+	fclose(fp);
+
+	if (found)
+		return line;
+
+	free(line);
+	printf("core_cookie not found. Enable SCHED_DEBUG?\n");
+	abort();
+	return NULL;
+}
+
+#define assert_tasks_share(t1, t2) _assert_tasks_share(__LINE__, t1, t2)
+void _assert_tasks_share(int line, struct task_state *t1, struct task_state *t2)
+{
+	char *c1, *c2;
+
+	c1 = get_task_core_cookie(t1->pid_str);
+	c2 = get_task_core_cookie(t2->pid_str);
+	dprint("(%d) check task (%s) cookie (%s) == task (%s) cookie (%s)",
+	       line, t1->pid_str, c1, t2->pid_str, c2);
+	assert_cond(!strcmp(c1, c2), "Tasks don't share cookie");
+	free(c1);
+	free(c2);
+}
+
+#define assert_tasks_dont_share(t1, t2) _assert_tasks_dont_share(__LINE__, t1, t2)
+void _assert_tasks_dont_share(int line, struct task_state *t1, struct task_state *t2)
+{
+	char *c1, *c2;
+
+	c1 = get_task_core_cookie(t1->pid_str);
+	c2 = get_task_core_cookie(t2->pid_str);
+	dprint("(%d) check task (%s) cookie (%s) != task (%s) cookie (%s)",
+	       line, t1->pid_str, c1, t2->pid_str, c2);
+	assert_cond(strcmp(c1, c2), "Tasks don't share cookie");
+	free(c1);
+	free(c2);
+}
+
+void assert_task_has_cookie(char *pid)
+{
+	char *tk;
+
+	tk = get_task_core_cookie(pid);
+
+	assert_cond(strcmp(tk, "0"), "Task does not have cookie");
+
+	free(tk);
+}
+
+void assert_task_has_no_cookie(char *pid)
+{
+	char *tk;
+
+	tk = get_task_core_cookie(pid);
+
+	assert_cond(!strcmp(tk, "0"), "Task has cookie");
+
+	free(tk);
+}
+
+void kill_task(struct task_state *t)
+{
+	int pid = atoi(t->pid_str);
+
+	kill(pid, SIGKILL);
+	waitpid(pid, NULL, 0);
+}
+
+/*
+ * Test that a group's children have a cookie inherrited
+ * from their parent group _after_ the parent was tagged.
+ *
+ *   p ----- c1 - c11
+ *     \ c2 - c22
+ */
+static void test_cgroup_parent_child_tag_inherit(char *root)
+{
+	char *p, *c1, *c11, *c2, *c22;
+	struct task_state *tsk_p, *tsk_c1, *tsk_c11, *tsk_c2, *tsk_c22;
+
+	print_banner("TEST-CGROUP-PARENT-CHILD-TAG");
+
+	p = make_group(root, "p");
+	tsk_p = add_task(p);
+	assert_task_has_no_cookie(tsk_p->pid_str);
+
+	c1 = make_group(p, "c1");
+	tsk_c1 = add_task(c1);
+	assert_group_tag(c1, "0");
+	assert_task_has_no_cookie(tsk_c1->pid_str);
+	assert_tasks_share(tsk_c1, tsk_p);
+
+	c11 = make_group(c1, "c11");
+	tsk_c11 = add_task(c11);
+	assert_group_tag(c11, "0");
+	assert_task_has_no_cookie(tsk_c11->pid_str);
+	assert_tasks_share(tsk_c11, tsk_p);
+
+	c2 = make_group(p, "c2");
+	tsk_c2 = add_task(c2);
+	assert_group_tag(c2, "0");
+	assert_task_has_no_cookie(tsk_c2->pid_str);
+	assert_tasks_share(tsk_c2, tsk_p);
+
+	tag_group(p);
+
+	/* Verify c1 got the cookie */
+	assert_group_tag(c1, "0");
+	assert_task_has_cookie(tsk_c1->pid_str);
+	assert_tasks_share(tsk_c1, tsk_p);
+
+	/* Verify c2 got the cookie */
+	assert_group_tag(c2, "0");
+	assert_task_has_cookie(tsk_c2->pid_str);
+	assert_tasks_share(tsk_c2, tsk_p);
+
+	/* Verify c11 got the cookie */
+	assert_group_tag(c11, "0");
+	assert_task_has_cookie(tsk_c11->pid_str);
+	assert_tasks_share(tsk_c11, tsk_p);
+
+	/*
+	 * Verify c22 which is a nested group created
+	 * _after_ tagging got the cookie.
+	 */
+	c22 = make_group(c2, "c22");
+	tsk_c22 = add_task(c22);
+
+	assert_group_tag(c22, "0");
+	assert_task_has_cookie(tsk_c22->pid_str);
+	assert_tasks_share(tsk_c22, tsk_c1);
+	assert_tasks_share(tsk_c22, tsk_c11);
+	assert_tasks_share(tsk_c22, tsk_c2);
+	assert_tasks_share(tsk_c22, tsk_p);
+
+	kill_task(tsk_p);
+	kill_task(tsk_c1);
+	kill_task(tsk_c11);
+	kill_task(tsk_c2);
+	kill_task(tsk_c22);
+	del_group(c22);
+	del_group(c11);
+	del_group(c1);
+	del_group(c2);
+	del_group(p);
+	print_pass();
+}
+
+/*
+ * Test that a tagged group's children have a cookie inherrited
+ * from their parent group.
+ */
+static void test_cgroup_parent_tag_child_inherit(char *root)
+{
+	char *p, *c1, *c2, *c3;
+	struct task_state *tsk_p, *tsk_c1, *tsk_c2, *tsk_c3;
+
+	print_banner("TEST-CGROUP-PARENT-TAG-CHILD-INHERIT");
+
+	p = make_group(root, "p");
+	tsk_p = add_task(p);
+	assert_task_has_no_cookie(tsk_p->pid_str);
+	tag_group(p);
+	assert_task_has_cookie(tsk_p->pid_str);
+
+	c1 = make_group(p, "c1");
+	tsk_c1 = add_task(c1);
+	assert_task_has_cookie(tsk_c1->pid_str);
+	/* Child tag is "0" but it inherits cookie from parent. */
+	assert_group_tag(c1, "0");
+	assert_tasks_share(tsk_c1, tsk_p);
+
+	c2 = make_group(p, "c2");
+	tsk_c2 = add_task(c2);
+	assert_group_tag(c2, "0");
+	assert_tasks_share(tsk_c2, tsk_p);
+	assert_tasks_share(tsk_c1, tsk_c2);
+
+	c3 = make_group(c1, "c3");
+	tsk_c3 = add_task(c3);
+	assert_group_tag(c3, "0");
+	assert_tasks_share(tsk_c3, tsk_p);
+	assert_tasks_share(tsk_c1, tsk_c3);
+
+	kill_task(tsk_p);
+	kill_task(tsk_c1);
+	kill_task(tsk_c2);
+	kill_task(tsk_c3);
+	del_group(c3);
+	del_group(c1);
+	del_group(c2);
+	del_group(p);
+	print_pass();
+}
+
+/* Test that the cookie is updated on cgroup migration */
+static void test_cgroup_inherit_on_migrate(char *root)
+{
+	char *c1, *c2, *c3;
+	struct task_state *tsk_c1, *tsk_c2, *tsk_c3;
+
+	print_banner("TEST-CGROUP-INHERIT-ON-MIGRATE");
+
+	c1 = make_group(root, "c1");
+	tsk_c1 = add_task(c1);
+	assert_task_has_no_cookie(tsk_c1->pid_str);
+
+	c2 = make_group(root, "c2");
+	tsk_c2 = add_task(c2);
+	tsk_c3 = add_task(c2);
+	assert_task_has_no_cookie(tsk_c2->pid_str);
+	assert_task_has_no_cookie(tsk_c3->pid_str);
+	tag_group(c2);
+	assert_task_has_cookie(tsk_c2->pid_str);
+	assert_task_has_cookie(tsk_c3->pid_str);
+	assert_tasks_share(tsk_c2, tsk_c3);
+	assert_tasks_dont_share(tsk_c1, tsk_c2);
+
+	c3 = make_group(root, "c3");
+	tag_group(c3);
+	assert_group_tag(c3, "1");
+	migrate_task(tsk_c3, c3);
+	assert_task_has_cookie(tsk_c3->pid_str);
+	assert_tasks_dont_share(tsk_c2, tsk_c3);
+	assert_tasks_dont_share(tsk_c1, tsk_c3);
+
+	migrate_task(tsk_c3, c2);
+	assert_tasks_share(tsk_c3, tsk_c3);
+
+	migrate_task(tsk_c3, c1);
+	assert_tasks_share(tsk_c3, tsk_c1);
+
+	kill_task(tsk_c1);
+	kill_task(tsk_c2);
+	kill_task(tsk_c3);
+	del_group(c1);
+	del_group(c2);
+	del_group(c3);
+	print_pass();
+}
+
+/* Test that group cookie is cleared when group is untagged */
+static void test_untag_group(char *root)
+{
+	char *c;
+	struct task_state *t;
+
+	print_banner("TEST-UNTAG-CGROUP");
+
+	c = make_group(root, "c");
+	t = add_task(c);
+	assert_task_has_no_cookie(t->pid_str);
+	tag_group(c);
+	assert_task_has_cookie(t->pid_str);
+	untag_group(c);
+	assert_task_has_no_cookie(t->pid_str);
+
+	kill_task(t);
+	del_group(c);
+	print_pass();
+}
+
+/* Test case when both cgroup and task cookies are used at the same time */
+static void test_cgroup_and_task_cookie(char *root)
+{
+	char *c1, *c2;
+	struct task_state *tsk_1, *tsk_2, *tsk_3;
+
+	print_banner("TEST-CGROUP-AND-TASK-COOKIE");
+
+	c1 = make_group(root, "c1");
+	c2 = make_group(root, "c2");
+	tag_group(c1);
+	tag_group(c2);
+	tsk_1 = add_task(c1);
+	tsk_2 = add_task(c1);
+	tsk_3 = add_task(c2);
+	assert_task_has_cookie(tsk_1->pid_str);
+	assert_task_has_cookie(tsk_2->pid_str);
+	assert_task_has_cookie(tsk_3->pid_str);
+	assert_tasks_share(tsk_1, tsk_2);
+	assert_tasks_dont_share(tsk_1, tsk_3);
+
+	/*
+	 * Two tasks in different cgroup but with the same task cookie;
+	 * should not share.
+	 */
+	make_tasks_share(tsk_1, tsk_3);
+	assert_tasks_dont_share(tsk_1, tsk_3);
+	reset_task_cookie(tsk_1);
+	reset_task_cookie(tsk_3);
+
+	/*
+	 * Two tasks in the same cgroup and with the same task cookie;
+	 * should share.
+	 */
+	make_tasks_share(tsk_1, tsk_2);
+	assert_tasks_share(tsk_1, tsk_2);
+	reset_task_cookie(tsk_1);
+	reset_task_cookie(tsk_2);
+
+	/*
+	 * Two tasks in the same cgroup but with different task cookies;
+	 * should not share.
+	 */
+	make_tasks_share(tsk_1, tsk_3);
+	assert_tasks_dont_share(tsk_1, tsk_2);
+	reset_task_cookie(tsk_1);
+	reset_task_cookie(tsk_3);
+
+	kill_task(tsk_1);
+	kill_task(tsk_2);
+	kill_task(tsk_3);
+	del_group(c1);
+	del_group(c2);
+	print_pass();
+}
+
+static void test_prctl_in_group(char *root)
+{
+	char *p;
+	struct task_state *tsk1, *tsk2, *tsk3;
+
+	print_banner("TEST-PRCTL-IN-GROUP");
+
+	p = make_group(root, "p");
+	tsk1 = add_task(p);
+	assert_task_has_no_cookie(tsk1->pid_str);
+	tag_group(p);
+	assert_task_has_cookie(tsk1->pid_str);
+
+	tsk2 = add_task(p);
+	assert_task_has_cookie(tsk2->pid_str);
+
+	tsk3 = add_task(p);
+	assert_task_has_cookie(tsk3->pid_str);
+
+	/* tsk2 share with tsk3 -- both get disconnected from CGroup. */
+	make_tasks_share(tsk2, tsk3);
+	assert_task_has_cookie(tsk2->pid_str);
+	assert_task_has_cookie(tsk3->pid_str);
+	assert_tasks_share(tsk2, tsk3);
+	assert_tasks_dont_share(tsk1, tsk2);
+	assert_tasks_dont_share(tsk1, tsk3);
+
+	/* now reset tsk3 -- get connected back to CGroup. */
+	reset_task_cookie(tsk3);
+	assert_task_has_cookie(tsk3->pid_str);
+	assert_tasks_dont_share(tsk2, tsk3);
+	assert_tasks_share(tsk1, tsk3);	// tsk3 is back.
+	assert_tasks_dont_share(tsk1, tsk2);	// but tsk2 is still zombie
+
+	/* now reset tsk2 as well to get it connected back to CGroup. */
+	reset_task_cookie(tsk2);
+	assert_task_has_cookie(tsk2->pid_str);
+	assert_tasks_share(tsk2, tsk3);
+	assert_tasks_share(tsk1, tsk3);
+	assert_tasks_share(tsk1, tsk2);
+
+	/* Test the rest of the cases (2 to 4)
+	 *
+	 *          t1              joining         t2
+	 * CASE 1:
+	 * before   0                               0
+	 * after    new cookie                      new cookie
+	 *
+	 * CASE 2:
+	 * before   X (non-zero)                    0
+	 * after    0                               0
+	 *
+	 * CASE 3:
+	 * before   0                               X (non-zero)
+	 * after    X                               X
+	 *
+	 * CASE 4:
+	 * before   Y (non-zero)                    X (non-zero)
+	 * after    X                               X
+	 */
+
+	/* case 2: */
+	dprint("case 2");
+	make_tasks_share(tsk1, tsk1);
+	assert_tasks_dont_share(tsk1, tsk2);
+	assert_tasks_dont_share(tsk1, tsk3);
+	assert_task_has_cookie(tsk1->pid_str);
+	make_tasks_share(tsk1, tsk2);	/* Will reset the task cookie. */
+	assert_task_has_cookie(tsk1->pid_str);
+	assert_task_has_cookie(tsk2->pid_str);
+
+	/* case 3: */
+	dprint("case 3");
+	make_tasks_share(tsk2, tsk2);
+	assert_tasks_dont_share(tsk2, tsk1);
+	assert_tasks_dont_share(tsk2, tsk3);
+	assert_task_has_cookie(tsk2->pid_str);
+	make_tasks_share(tsk1, tsk2);
+	assert_task_has_cookie(tsk1->pid_str);
+	assert_task_has_cookie(tsk2->pid_str);
+	assert_tasks_share(tsk1, tsk2);
+	assert_tasks_dont_share(tsk1, tsk3);
+	reset_task_cookie(tsk1);
+	reset_task_cookie(tsk2);
+
+	/* case 4: */
+	dprint("case 4");
+	assert_tasks_share(tsk1, tsk2);
+	assert_task_has_cookie(tsk1->pid_str);
+	assert_task_has_cookie(tsk2->pid_str);
+	make_tasks_share(tsk1, tsk1);
+	assert_task_has_cookie(tsk1->pid_str);
+	make_tasks_share(tsk2, tsk2);
+	assert_task_has_cookie(tsk2->pid_str);
+	assert_tasks_dont_share(tsk1, tsk2);
+	make_tasks_share(tsk1, tsk2);
+	assert_task_has_cookie(tsk1->pid_str);
+	assert_task_has_cookie(tsk2->pid_str);
+	assert_tasks_share(tsk1, tsk2);
+	assert_tasks_dont_share(tsk1, tsk3);
+	reset_task_cookie(tsk1);
+	reset_task_cookie(tsk2);
+
+	kill_task(tsk1);
+	kill_task(tsk2);
+	kill_task(tsk3);
+	del_group(p);
+	print_pass();
+}
+
+int main(int argc, char *argv[])
+{
+	char *root;
+
+	if (argc > 1)
+		root = argv[1];
+	else
+		root = make_group(NULL, NULL);
+
+	test_cgroup_parent_tag_child_inherit(root);
+	test_cgroup_parent_child_tag_inherit(root);
+	test_cgroup_inherit_on_migrate(root);
+	test_untag_group(root);
+	test_cgroup_and_task_cookie(root);
+	test_prctl_in_group(root);
+
+	if (argc <= 1)
+		del_root_group(root);
+	return 0;
+}
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH resend 7/8] Documentation: Add core scheduling documentation
  2021-03-24 21:40 [PATCH resend 0/8] Core sched remaining patches rebased Joel Fernandes (Google)
                   ` (5 preceding siblings ...)
  2021-03-24 21:40 ` [PATCH resend 6/8] kselftest: Add tests for core-sched interface Joel Fernandes (Google)
@ 2021-03-24 21:40 ` Joel Fernandes (Google)
  2021-03-24 21:40 ` [PATCH resend 8/8] sched: Debug bits Joel Fernandes (Google)
  7 siblings, 0 replies; 18+ messages in thread
From: Joel Fernandes (Google) @ 2021-03-24 21:40 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, rostedt,
	benbjiang, Alexandre Chartre, James.Bottomley, OWeisse,
	Dhaval Giani, chris.hyser, Josh Don, Hao Luo, Tom Lendacky,
	dhiatt, Randy Dunlap

Document the usecases, design and interfaces for core scheduling.

Co-developed-by: Chris Hyser <chris.hyser@oracle.com>
Co-developed-by: Vineeth Pillai <viremana@linux.microsoft.com>
Co-developed-by: Josh Don <joshdon@google.com>
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 .../admin-guide/hw-vuln/core-scheduling.rst   | 460 ++++++++++++++++++
 Documentation/admin-guide/hw-vuln/index.rst   |   1 +
 2 files changed, 461 insertions(+)
 create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
new file mode 100644
index 000000000000..0ef00edd50e6
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -0,0 +1,460 @@
+Core Scheduling
+***************
+Core scheduling support allows userspace to define groups of tasks that can
+share a core. These groups can be specified either for security usecases (one
+group of tasks don't trust another), or for performance usecases (some
+workloads may benefit from running on the same core as they don't need the same
+hardware resources of the shared core).
+
+Security usecase
+----------------
+A cross-HT attack involves the attacker and victim running on different
+Hyper Threads of the same core. MDS and L1TF are examples of such attacks.
+Without core scheduling, the only full mitigation of cross-HT attacks is to
+disable Hyper Threading (HT). Core scheduling allows HT to be turned on safely
+by ensuring that trusted tasks can share a core. This increase in core sharing
+can improvement performance, however it is not guaranteed that performance will
+always improve, though that is seen to be the case with a number of real world
+workloads. In theory, core scheduling aims to perform at least as good as when
+Hyper Threading is disabled. In practice, this is mostly the case though not
+always: as synchronizing scheduling decisions across 2 or more CPUs in a core
+involves additional overhead - especially when the system is lightly loaded
+(``total_threads <= N/2``, where N is the total number of CPUs).
+
+Usage
+-----
+Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
+Using this feature, userspace defines groups of tasks that trust each other.
+The core scheduler uses this information to make sure that tasks that do not
+trust each other will never run simultaneously on a core, while doing its best
+to satisfy the system's scheduling requirements.
+
+There are 2 ways to use core-scheduling:
+
+CGroup
+######
+Core scheduling adds additional files to the CPU controller CGroup:
+
+* ``cpu.core_tag``
+
+Writing ``1`` into this file results in all tasks in the group getting tagged.
+This results in all the CGroup's tasks allowed to run concurrently on a core's
+hyperthreads (also called siblings).
+
+The file being a value of ``0`` means the tag state of the CGroup is inherited
+from its parent hierarchy. If any ancestor of the CGroup is tagged, then the
+group is tagged.
+
+.. note:: Once a CGroup is tagged via cpu.core_tag, it is not possible to set this
+          for any descendant of the tagged group.
+
+.. note:: When a CGroup is not tagged, all the tasks within the group can share
+          a core with kernel threads and untagged system threads. For this reason,
+          if a group has ``cpu.core_tag`` of 0, it is considered to be trusted.
+
+prctl(2) interface
+##################
+
+A ``prtcl(2)`` command ``PR_SCHED_CORE_SHARE`` provides an interface for the
+creation of and admission and removal of tasks from core scheduling groups.
+Permission to change the ``cookie`` and hence the core scheduling group it
+represents is based on ``ptrace access``.
+
+::
+
+    #include <sys/prctl.h>
+
+    int prctl(int option, unsigned long arg2, unsigned long arg3, unsigned long arg4, unsigned long arg5);
+
+    int prctl(PR_SCHED_CORE_SHARE, sub_command, pid, pid_type, 0);
+
+option:
+    ``PR_SCHED_CORE_SHARE``
+
+arg2:
+    sub-command:
+
+    - ``PR_SCHED_CORE_CLEAR            0  -- clear core_sched cookie of pid``
+    - ``PR_SCHED_CORE_CREATE           1  -- create a new cookie for pid``
+    - ``PR_SCHED_CORE_SHARE_FROM       2  -- copy core_sched cookie from pid``
+    - ``PR_SCHED_CORE_SHARE_TO         3  -- copy core_sched cookie to pid``
+
+arg3:
+    ``pid`` of the task for which the operation applies where ``pid == 0``
+    implies current process.
+
+arg4:
+    ``pid_type`` for PR_SCHED_CORE_CLEAR/CREATE/SHARE_TO is an enum
+    {PIDTYPE_PID=0, PIDTYPE_TGID, PIDTYPE_PGID} and determines how the target
+    ``pid`` should be interpreted. ``PIDTYPE_PID`` indicates that the target
+    ``pid`` should be treated as an individual task, ``PIDTYPE_TGID`` a process
+    or thread group, and ``PIDTYPE_PGID`` or a process group ``PIDTYPE_PGID``.
+
+arg5:
+    MUST be equal to 0.
+
+Return Value:
+::
+
+    EINVAL - bad parameter
+    ENOMEM - unable to allocate memory for a cookie.
+    ESRCH  - unable to find specified pid.
+    EPERM  - caller lacks permission to change the cookie of target.
+    EACCES - caller lacks permission to change the cookies of all
+             targets of pidtype.
+
+Creation
+~~~~~~~~
+Creation is accomplished by creating a ''cookie'' for the target pid which may
+be an individual task if tgt_pid_type is ``PIDTYPE_PID``, a process and it's
+threads if ``PIDTYPE_TGID`` or all tasks of the specified process group if
+``PIDTYPE_PGID``.
+
+::
+
+    if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_CREATE, tgt_pid, tgt_pid_type, 0) < 0)
+            handle_error("sched_core_create failed");
+
+Examples:
+::
+
+	Given a process group, process and thread hierarchy:
+
+	## Create a thread/process/process group hierarchy
+	tid=40003, / tgid=40003 / pgid=40003: 0
+	    tid=40004, / tgid=40004 / pgid=40003: 0
+		tid=40005, / tgid=40004 / pgid=40003: 0
+		tid=40007, / tgid=40004 / pgid=40003: 0
+		tid=40008, / tgid=40004 / pgid=40003: 0
+	    tid=40006, / tgid=40006 / pgid=40003: 0
+		tid=40009, / tgid=40006 / pgid=40003: 0
+		tid=40010, / tgid=40006 / pgid=40003: 0
+		tid=40011, / tgid=40006 / pgid=40003: 0
+
+	## Set a cookie on entire process group
+	0 = prctl(59, 1, 0, 2, 0)
+	tid=40003, / tgid=40003 / pgid=40003: 1f7121100000000
+	    tid=40004, / tgid=40004 / pgid=40003: 1f7121100000000
+		tid=40005, / tgid=40004 / pgid=40003: 1f7121100000000
+		tid=40007, / tgid=40004 / pgid=40003: 1f7121100000000
+		tid=40008, / tgid=40004 / pgid=40003: 1f7121100000000
+	    tid=40006, / tgid=40006 / pgid=40003: 1f7121100000000
+		tid=40009, / tgid=40006 / pgid=40003: 1f7121100000000
+		tid=40010, / tgid=40006 / pgid=40003: 1f7121100000000
+		tid=40011, / tgid=40006 / pgid=40003: 1f7121100000000
+
+	## Set a new cookie on entire process/TGID [40006]
+	0 = prctl(59, 1, 40006, 1, 0)
+	tid=40003, / tgid=40003 / pgid=40003: 1f7121100000000
+	    tid=40004, / tgid=40004 / pgid=40003: 1f7121100000000
+		tid=40005, / tgid=40004 / pgid=40003: 1f7121100000000
+		tid=40007, / tgid=40004 / pgid=40003: 1f7121100000000
+		tid=40008, / tgid=40004 / pgid=40003: 1f7121100000000
+	    tid=40006, / tgid=40006 / pgid=40003: 1f7121200000000
+		tid=40009, / tgid=40006 / pgid=40003: 1f7121200000000
+		tid=40010, / tgid=40006 / pgid=40003: 1f7121200000000
+		tid=40011, / tgid=40006 / pgid=40003: 1f7121200000000
+
+Removal
+~~~~~~~
+Removing a task(s) from a core scheduling group is done by specifying a target
+pid and it's type. Again, pidtype determines the interpretation of the target pid.
+
+::
+
+    if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_CLEAR, tgt_pid, tgt_pid_type, 0) < 0)
+             handle_error("clr_tid sched_core failed");
+
+Examples (continued from above):
+::
+
+	## Clear a cookie on a single task [40006]
+	0 = prctl(59, 0, 40006, 0, 0)
+	tid=40003, / tgid=40003 / pgid=40003: 1f7121200000000
+	    tid=40004, / tgid=40004 / pgid=40003: 1f7121100000000
+		tid=40005, / tgid=40004 / pgid=40003: 1f7121100000000
+		tid=40007, / tgid=40004 / pgid=40003: 1f7121100000000
+		tid=40008, / tgid=40004 / pgid=40003: 1f7121100000000
+	    tid=40006, / tgid=40006 / pgid=40003: 0
+		tid=40009, / tgid=40006 / pgid=40003: 1f7121200000000
+		tid=40010, / tgid=40006 / pgid=40003: 1f7121200000000
+		tid=40011, / tgid=40006 / pgid=40003: 1f7121200000000
+
+	## Clear cookies on the entire process group
+	0 = prctl(59, 0, 0, 2, 0)
+	tid=40003, / tgid=40003 / pgid=40003: 0
+	    tid=40004, / tgid=40004 / pgid=40003: 0
+		tid=40005, / tgid=40004 / pgid=40003: 0
+		tid=40007, / tgid=40004 / pgid=40003: 0
+		tid=40008, / tgid=40004 / pgid=40003: 0
+	    tid=40006, / tgid=40006 / pgid=40003: 0
+		tid=40009, / tgid=40006 / pgid=40003: 0
+		tid=40010, / tgid=40006 / pgid=40003: 0
+		tid=40011, / tgid=40006 / pgid=40003: 0
+
+Cookie Transferal
+~~~~~~~~~~~~~~~~~
+
+Transferring a cookie between the current tasks and other tasks is possible using
+``PR_SCHED_CORE_SHARE_FROM`` and ``PR_SCHED_CORE_SHARE_TO`` depending on
+direction. A helper utility can copy cookies between tasks by first copying a
+cookie from the source task then sharing it using ``PR_SCHED_CORE_SHARE_TO`` to
+the  target ``pid`` with, again, pidtype determine the interpretation of the target pid.
+
+::
+
+    if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_FROM, from_pid, 0, 0) < 0)
+            handle_error("from_tid sched_core failed");
+
+    if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_TO, to_pid, tgt_pid_type, 0) < 0)
+            handle_error("to_tid sched_core failed");
+
+Examples (continued from above):
+::
+
+	pidtype can even be used with SHARE_TO to "spread" the cookie among the
+	appropriate group by sharing with self.
+
+	## given setup
+	tid=40003, / tgid=40003 / pgid=40003: 1f7121200000000
+	    tid=40004, / tgid=40004 / pgid=40003: 1f7121100000000
+		tid=40005, / tgid=40004 / pgid=40003: 1f7121100000000
+		tid=40007, / tgid=40004 / pgid=40003: 1f7121100000000
+		tid=40008, / tgid=40004 / pgid=40003: 1f7121100000000
+	    tid=40006, / tgid=40006 / pgid=40003: 0
+		tid=40009, / tgid=40006 / pgid=40003: 1f7121200000000
+		tid=40010, / tgid=40006 / pgid=40003: 1f7121200000000
+		tid=40011, / tgid=40006 / pgid=40003: 1f7121200000000
+
+	## Copy cookie from current [40003] to current as pidtype PGID
+	0 = prctl(59, 3, 0, 2, 0)
+	tid=40003, / tgid=40003 / pgid=40003: 1f7121200000000
+	    tid=40004, / tgid=40004 / pgid=40003: 1f7121200000000
+		tid=40005, / tgid=40004 / pgid=40003: 1f7121200000000
+		tid=40007, / tgid=40004 / pgid=40003: 1f7121200000000
+		tid=40008, / tgid=40004 / pgid=40003: 1f7121200000000
+	    tid=40006, / tgid=40006 / pgid=40003: 1f7121200000000
+		tid=40009, / tgid=40006 / pgid=40003: 1f7121200000000
+		tid=40010, / tgid=40006 / pgid=40003: 1f7121200000000
+		tid=40011, / tgid=40006 / pgid=40003: 1f7121200000000
+
+Other prctl Notes
+~~~~~~~~~~~~~~~~~
+
+It is important to note that, on a ``clone(2)`` or ``fork(2)``, the child will
+be assigned a copy of the same ``cookie`` as its parent and thus in the same
+core scheduling group. Basically the default is that fork-only or threaded
+applications can share the cores of a processor. A new cookie can be created
+post clone/fork if such sharing is undesirable.  An ``execve(2)`` will however,
+automatically assign a new cookie to a task as no reasonable assumptions can be
+made on the security of the new code.
+
+.. note:: The core-sharing granted with ``prctl(2)`` will be subject to
+          core-sharing restrictions specified by the CGroup interface. For example,
+          if tasks T1 and T2 are a part of 2 different tagged CGroups, then they will
+          not share a core even if ``prctl(2)`` is used to set a shared cookie.
+
+Design/Implementation
+---------------------
+Each task that is tagged is assigned a cookie internally in the kernel. As
+mentioned in `Usage`_, tasks with the same cookie value are assumed to trust
+each other and share a core.
+
+The basic idea is that, every schedule event tries to select tasks for all the
+siblings of a core such that all the selected tasks running on a core are
+trusted (same cookie) at any point in time. Kernel threads are assumed trusted.
+The idle task is considered special, as it trusts everything and everything
+trusts it.
+
+During a schedule() event on any sibling of a core, the highest priority task on
+the sibling's core is picked and assigned to the sibling calling schedule(), if
+the sibling has the task enqueued. For rest of the siblings in the core,
+highest priority task with the same cookie is selected if there is one runnable
+in their individual run queues. If a task with same cookie is not available,
+the idle task is selected.  Idle task is globally trusted.
+
+Once a task has been selected for all the siblings in the core, an IPI is sent to
+siblings for whom a new task was selected. Siblings on receiving the IPI will
+switch to the new task immediately. If an idle task is selected for a sibling,
+then the sibling is considered to be in a `forced idle` state. I.e., it may
+have tasks on its on runqueue to run, however it will still have to run idle.
+More on this in the next section.
+
+Forced-idling of tasks
+----------------------
+The scheduler tries its best to find tasks that trust each other such that all
+tasks selected to be scheduled are of the highest priority in a core.  However,
+it is possible that some runqueues had tasks that were incompatible with the
+highest priority ones in the core. Favoring security over fairness, one or more
+siblings could be forced to select a lower priority task if the highest
+priority task is not trusted with respect to the core wide highest priority
+task.  If a sibling does not have a trusted task to run, it will be forced idle
+by the scheduler (idle thread is scheduled to run).
+
+When the highest priority task is selected to run, a reschedule-IPI is sent to
+the sibling to force it into idle. This results in 4 cases which need to be
+considered depending on whether a VM or a regular usermode process was running
+on either HT::
+
+          HT1 (attack)            HT2 (victim)
+   A      idle -> user space      user space -> idle
+   B      idle -> user space      guest -> idle
+   C      idle -> guest           user space -> idle
+   D      idle -> guest           guest -> idle
+
+Note that for better performance, we do not wait for the destination CPU
+(victim) to enter idle mode. This is because the sending of the IPI would bring
+the destination CPU immediately into kernel mode from user space, or VMEXIT
+in the case of guests. At best, this would only leak some scheduler metadata
+which may not be worth protecting. It is also possible that the IPI is received
+too late on some architectures, but this has not been observed in the case of
+x86.
+
+Kernel protection from untrusted tasks
+--------------------------------------
+Entry into the kernel (syscall, IRQ or VMEXIT) needs protection.  The scheduler
+on its own cannot protect the kernel executing concurrently with an untrusted
+task in a core. This is because the scheduler is unaware of interrupts/syscalls
+at scheduling time. To mitigate this, an IPI is sent to siblings on kernel
+entry. This IPI forces the sibling to enter kernel mode and wait before
+returning to user until all siblings of the core have left kernel mode. This
+process is also known as stunning.  For good performance, an IPI is sent only
+to a sibling only if it is running a tagged task. If a sibling is running a
+kernel thread or is idle, no IPI is sent.
+
+The kernel protection feature is off by default. To protect the kernel, pass a
+comma separated list of what to protect to the ht_protect= kernel command line
+option. Possible values are irq, syscall and kvm.
+
+Note that an arch has to define the ``TIF_UNSAFE_RET`` thread info flag to be
+able to use kernel protection. Also, if protecting the kernel from a VM is
+desired, an arch should call kvm_exit_to_guest_mode() during ``VMENTER`` and
+kvm_exit_to_guest_mode() during ``VMEXIT``. Currently, x86 support both these.
+
+Other alternative ideas discussed for kernel protection are listed below just
+for completeness. They all have limitations:
+
+1. Changing interrupt affinities to a trusted core which does not execute untrusted tasks
+#########################################################################################
+By changing the interrupt affinities to a designated safe-CPU which runs
+only trusted tasks, IRQ data can be protected. One issue is this involves
+giving up a full CPU core of the system to run safe tasks. Another is that,
+per-cpu interrupts such as the local timer interrupt cannot have their
+affinity changed. Also, sensitive timer callbacks such as the random entropy timer
+can run in softirq on return from these interrupts and expose sensitive
+data. In the future, that could be mitigated by forcing softirqs into threaded
+mode by utilizing a mechanism similar to ``CONFIG_PREEMPT_RT``.
+
+Yet another issue with this is, for multiqueue devices with managed
+interrupts, the IRQ affinities cannot be changed; however it could be
+possible to force a reduced number of queues which would in turn allow to
+shield one or two CPUs from such interrupts and queue handling for the price
+of indirection.
+
+2. Running IRQs as threaded-IRQs
+################################
+This would result in forcing IRQs into the scheduler which would then provide
+the process-context mitigation. However, not all interrupts can be threaded.
+Also this does nothing about syscall entries.
+
+3. Kernel Address Space Isolation
+#################################
+System calls could run in a much restricted address space which is
+guaranteed not to leak any sensitive data. There are practical limitation in
+implementing this - the main concern being how to decide on an address space
+that is guaranteed to not have any sensitive data.
+
+4. Limited cookie-based protection
+##################################
+On a system call, change the cookie to the system trusted cookie and initiate a
+schedule event. This would be better than pausing all the siblings during the
+entire duration for the system call, but still would be a huge hit to the
+performance.
+
+Trust model
+-----------
+Core scheduling maintains trust relationships amongst groups of tasks by
+assigning the tag of them with the same cookie value.
+When a system with core scheduling boots, all tasks are considered to trust
+each other. This is because the core scheduler does not have information about
+trust relationships until userspace uses the above mentioned interfaces, to
+communicate them. In other words, all tasks have a default cookie value of 0.
+and are considered system-wide trusted. The stunning of siblings running
+cookie-0 tasks is also avoided.
+
+Once userspace uses the above mentioned interfaces to group sets of tasks, tasks
+within such groups are considered to trust each other, but do not trust those
+outside. Tasks outside the group also don't trust tasks within.
+
+coresched command line option
+-----------------------------
+The coresched kernel command line option can be used to:
+  - Keep coresched on even if system is not vulnerable x (``coresched=on``).
+  - Keep coresched off even if system is vulnerable (``coresched=off``).
+  - Keep coresched on only if system is vulnerable x (``coresched=secure``).
+
+The default is ``coresched=secure``. However a user who has a usecase that
+needs core-scheduling, such as improving performance of VMs by tagging vCPU
+threads, could pass ``coresched=on`` to force it on.
+
+Limitations in core-scheduling
+------------------------------
+Core scheduling tries to guarantee that only trusted tasks run concurrently on a
+core. But there could be small window of time during which untrusted tasks run
+concurrently or kernel could be running concurrently with a task not trusted by
+kernel.
+
+1. IPI processing delays
+########################
+Core scheduling selects only trusted tasks to run together. IPI is used to notify
+the siblings to switch to the new task. But there could be hardware delays in
+receiving of the IPI on some arch (on x86, this has not been observed). This may
+cause an attacker task to start running on a CPU before its siblings receive the
+IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings
+may populate data in the cache and micro architectural buffers after the attacker
+starts to run and this is a possibility for data leak.
+
+Open cross-HT issues that core scheduling does not solve
+--------------------------------------------------------
+1. For MDS
+##########
+Core scheduling cannot protect against MDS attacks between an HT running in
+user mode and another running in kernel mode. Even though both HTs run tasks
+which trust each other, kernel memory is still considered untrusted. Such
+attacks are possible for any combination of sibling CPU modes (host or guest mode).
+
+2. For L1TF
+###########
+Core scheduling cannot protect against an L1TF guest attacker exploiting a
+guest or host victim. This is because the guest attacker can craft invalid
+PTEs which are not inverted due to a vulnerable guest kernel. The only
+solution is to disable EPT (Extended Page Tables).
+
+For both MDS and L1TF, if the guest vCPU is configured to not trust each
+other (by tagging separately), then the guest to guest attacks would go away.
+Or it could be a system admin policy which considers guest to guest attacks as
+a guest problem.
+
+Another approach to resolve these would be to make every untrusted task on the
+system to not trust every other untrusted task. While this could reduce
+parallelism of the untrusted tasks, it would still solve the above issues while
+allowing system processes (trusted tasks) to share a core.
+
+Use cases
+---------
+The main use case for Core scheduling is mitigating the cross-HT vulnerabilities
+with SMT enabled. There are other use cases where this feature could be used:
+
+- Isolating tasks that needs a whole core: Examples include realtime tasks, tasks
+  that uses SIMD instructions etc.
+- Gang scheduling: Requirements for a group of tasks that needs to be scheduled
+  together could also be realized using core scheduling. One example is vCPUs of
+  a VM.
+
+Future work
+-----------
+Skipping per-HT mitigations if task is trusted
+##############################################
+If core scheduling is enabled, by default all tasks trust each other as
+mentioned above. In such scenario, it may be desirable to skip the same-HT
+mitigations on return to the trusted user-mode to improve performance.
diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index ca4dbdd9016d..f12cda55538b 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -15,3 +15,4 @@ are configurable at compile, boot or run time.
    tsx_async_abort
    multihit.rst
    special-register-buffer-data-sampling.rst
+   core-scheduling.rst
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH resend 8/8] sched: Debug bits...
  2021-03-24 21:40 [PATCH resend 0/8] Core sched remaining patches rebased Joel Fernandes (Google)
                   ` (6 preceding siblings ...)
  2021-03-24 21:40 ` [PATCH resend 7/8] Documentation: Add core scheduling documentation Joel Fernandes (Google)
@ 2021-03-24 21:40 ` Joel Fernandes (Google)
  7 siblings, 0 replies; 18+ messages in thread
From: Joel Fernandes (Google) @ 2021-03-24 21:40 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, rostedt,
	benbjiang, Alexandre Chartre, James.Bottomley, OWeisse,
	Dhaval Giani, chris.hyser, Josh Don, Hao Luo, Tom Lendacky,
	dhiatt

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Not-Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c | 40 +++++++++++++++++++++++++++++++++++++++-
 kernel/sched/fair.c | 12 ++++++++++++
 2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a733891dfe7d..2649efeac19f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -106,6 +106,10 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b, bool
 
 	int pa = __task_prio(a), pb = __task_prio(b);
 
+	trace_printk("(%s/%d;%d,%llu,%llu) ?< (%s/%d;%d,%llu,%llu)\n",
+		     a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+		     b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
 	if (-pa < -pb)
 		return true;
 
@@ -292,12 +296,16 @@ static void __sched_core_enable(void)
 
 	static_branch_enable(&__sched_core_enabled);
 	__sched_core_flip(true);
+
+	printk("core sched enabled\n");
 }
 
 static void __sched_core_disable(void)
 {
 	__sched_core_flip(false);
 	static_branch_disable(&__sched_core_enabled);
+
+	printk("core sched disabled\n");
 }
 
 void sched_core_get(void)
@@ -5361,6 +5369,13 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			set_next_task(rq, next);
 		}
 
+		trace_printk("pick pre selected (%u %u %u): %s/%d %llu\n",
+			     rq->core->core_task_seq,
+			     rq->core->core_pick_seq,
+			     rq->core_sched_seq,
+			     next->comm, next->pid,
+			     next->core_cookie.userspace_id);
+
 		rq->core_pick = NULL;
 		return next;
 	}
@@ -5455,6 +5470,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 					rq->core->core_forceidle_seq++;
 			}
 
+			trace_printk("cpu(%d): selected: %s/%d %llu\n",
+				     i, p->comm, p->pid,
+				     p->core_cookie.userspace_id);
+
 			/*
 			 * If this new candidate is of higher priority than the
 			 * previous; and they're incompatible; we need to wipe
@@ -5471,6 +5490,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 				rq->core->core_cookie = p->core_cookie;
 				max = p;
 
+				trace_printk("max: %s/%d %llu\n",
+					     max->comm, max->pid,
+					     max->core_cookie.userspace_id);
+
 				if (old_max) {
 					rq->core->core_forceidle = false;
 					for_each_cpu(j, smt_mask) {
@@ -5492,6 +5515,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	/* Something should have been selected for current CPU */
 	WARN_ON_ONCE(!next);
+	trace_printk("picked: %s/%d %llu\n", next->comm, next->pid,
+		     next->core_cookie.userspace_id);
 
 	/*
 	 * Reschedule siblings
@@ -5533,13 +5558,21 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		}
 
 		/* Did we break L1TF mitigation requirements? */
-		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+		if (unlikely(!cookie_match(next, rq_i->core_pick))) {
+			trace_printk("[%d]: cookie mismatch. %s/%d/0x%llu/0x%llu\n",
+				     rq_i->cpu, rq_i->core_pick->comm,
+				     rq_i->core_pick->pid,
+				     rq_i->core_pick->core_cookie.userspace_id,
+				     rq_i->core->core_cookie.userspace_id);
+			WARN_ON_ONCE(1);
+		}
 
 		if (rq_i->curr == rq_i->core_pick) {
 			rq_i->core_pick = NULL;
 			continue;
 		}
 
+		trace_printk("IPI(%d)\n", i);
 		resched_curr(rq_i);
 	}
 
@@ -5579,6 +5612,11 @@ static bool try_steal_cookie(int this, int that)
 		if (p->core_occupation > dst->idle->core_occupation)
 			goto next;
 
+		trace_printk("core fill: %s/%d (%d->%d) %d %d %llu\n",
+			     p->comm, p->pid, that, this,
+			     p->core_occupation, dst->idle->core_occupation,
+			     cookie->userspace_id);
+
 		p->on_rq = TASK_ON_RQ_MIGRATING;
 		deactivate_task(src, p, 0);
 		set_task_cpu(p, this);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 12030b73a032..2432420b4bef 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10783,6 +10783,9 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
  */
 static void se_fi_update(struct sched_entity *se, unsigned int fi_seq, bool forceidle)
 {
+	bool root = true;
+	long old, new;
+
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
@@ -10792,6 +10795,15 @@ static void se_fi_update(struct sched_entity *se, unsigned int fi_seq, bool forc
 			cfs_rq->forceidle_seq = fi_seq;
 		}
 
+
+		if (root) {
+			old = cfs_rq->min_vruntime_fi;
+			new = cfs_rq->min_vruntime;
+			root = false;
+			trace_printk("cfs_rq(min_vruntime_fi) %lu->%lu\n",
+				     old, new);
+		}
+
 		cfs_rq->min_vruntime_fi = cfs_rq->min_vruntime;
 	}
 }
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH resend 2/8] sched: core scheduling tagging infrastructure
  2021-03-24 21:40 ` [PATCH resend 2/8] sched: core scheduling tagging infrastructure Joel Fernandes (Google)
@ 2021-03-27  0:09   ` Peter Zijlstra
  2021-03-27  3:19     ` Josh Don
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2021-03-27  0:09 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, graf, konrad.wilk,
	dfaggioli, rostedt, benbjiang, Alexandre Chartre,
	James.Bottomley, OWeisse, Dhaval Giani, chris.hyser, Josh Don,
	Hao Luo, Tom Lendacky, dhiatt

On Wed, Mar 24, 2021 at 05:40:14PM -0400, Joel Fernandes (Google) wrote:
> From: Josh Don <joshdon@google.com>
> 
> A single unsigned long is insufficient as a cookie value for core
> scheduling. We will minimally have cookie values for a per-task and a
> per-group interface, which must be combined into an overall cookie.
> 
> This patch adds the infrastructure necessary for setting task and group
> cookie. Namely, it reworks the core_cookie into a struct, and provides
> interfaces for setting task and group cookie, as well as other
> operations (i.e. compare()). Subsequent patches will use these hooks to
> provide an API for setting these cookies.
> 

*urgh*... so I specifically wanted the task interface first to avoid /
get-rid of all this madness. And then you keep it :-(

I've spend the past few hours rewriting patches #2 and #3, and adapting
#4. The thing was working before I added SHARE_FROM back and introduced
GET, but now I'm seeing a few FAILs from the selftest.

I'm too tired to make sense of anything much, or even focus my eyes
consistently, so I'll have to prod at it some more next week, but I've
pushed out the lot to my queue.git:

  https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core-sched

Also, we really need a better name than coretag.c.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH resend 2/8] sched: core scheduling tagging infrastructure
  2021-03-27  0:09   ` Peter Zijlstra
@ 2021-03-27  3:19     ` Josh Don
  2021-03-29  9:55       ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Josh Don @ 2021-03-27  3:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Gleixner, linux-kernel, Ingo Molnar,
	torvalds, fweisbec, Kees Cook, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, Steven Rostedt, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Hyser,Chris, Hao Luo, Tom Lendacky, dhiatt

Hi Peter,

On Fri, Mar 26, 2021 at 5:10 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Mar 24, 2021 at 05:40:14PM -0400, Joel Fernandes (Google) wrote:
> > From: Josh Don <joshdon@google.com>
> >
> > A single unsigned long is insufficient as a cookie value for core
> > scheduling. We will minimally have cookie values for a per-task and a
> > per-group interface, which must be combined into an overall cookie.
> >
> > This patch adds the infrastructure necessary for setting task and group
> > cookie. Namely, it reworks the core_cookie into a struct, and provides
> > interfaces for setting task and group cookie, as well as other
> > operations (i.e. compare()). Subsequent patches will use these hooks to
> > provide an API for setting these cookies.
> >
>
> *urgh*... so I specifically wanted the task interface first to avoid /
> get-rid of all this madness. And then you keep it :-(

Sorry, I misunderstood the ask here :/ I had separated out the cgroup
interface parts of the patch, leaving (mostly) the parts which
introduced a compound cookie structure. I see now that you just wanted
the plain task interface to start, with no notion of group cookie.

> I've spend the past few hours rewriting patches #2 and #3, and adapting
> #4. The thing was working before I added SHARE_FROM back and introduced
> GET, but now I'm seeing a few FAILs from the selftest.
>
> I'm too tired to make sense of anything much, or even focus my eyes
> consistently, so I'll have to prod at it some more next week, but I've
> pushed out the lot to my queue.git:
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core-sched

Thanks, I'll take a look next week.

> Also, we really need a better name than coretag.c.

Yea, we don't really otherwise use the phrase "tagging". core_sched.c
is probably too confusing given we have sched/core.c.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH resend 2/8] sched: core scheduling tagging infrastructure
  2021-03-27  3:19     ` Josh Don
@ 2021-03-29  9:55       ` Peter Zijlstra
  2021-03-30 21:29         ` Josh Don
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2021-03-29  9:55 UTC (permalink / raw)
  To: Josh Don
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Gleixner, linux-kernel, Ingo Molnar,
	torvalds, fweisbec, Kees Cook, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, Steven Rostedt, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Hyser,Chris, Hao Luo, Tom Lendacky, dhiatt

On Fri, Mar 26, 2021 at 08:19:57PM -0700, Josh Don wrote:
> On Fri, Mar 26, 2021 at 5:10 PM Peter Zijlstra <peterz@infradead.org> wrote:

> > I've spend the past few hours rewriting patches #2 and #3, and adapting
> > #4. The thing was working before I added SHARE_FROM back and introduced
> > GET, but now I'm seeing a few FAILs from the selftest.
> >
> > I'm too tired to make sense of anything much, or even focus my eyes
> > consistently, so I'll have to prod at it some more next week, but I've
> > pushed out the lot to my queue.git:
> >
> >   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core-sched
> 
> Thanks, I'll take a look next week.

OK, fixed the fails. My tired head made it unconditionally return the
cookie-id of 'current' instead of task. Pushed out an update.

> > Also, we really need a better name than coretag.c.
> 
> Yea, we don't really otherwise use the phrase "tagging". core_sched.c
> is probably too confusing given we have sched/core.c.

Right, so I tried core_sched and my fingers already hate it as much as
kernel/scftorture.c (which I'd assumed my fingers would get used to
eventually, but noooo).

Looking at kernel/sched/ C is very overrepresented, so we really don't
want another I think. B, E, G, H, J, K, N, seem to still be available in
the first half of the alphabet. Maybe, bonghits.c, gabbleduck.c ?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH resend 5/8] sched: cgroup cookie API for core scheduling
  2021-03-24 21:40 ` [PATCH resend 5/8] sched: cgroup cookie API for core scheduling Joel Fernandes (Google)
@ 2021-03-30  9:23   ` Peter Zijlstra
  2021-03-30  9:26     ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2021-03-30  9:23 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, graf, konrad.wilk,
	dfaggioli, rostedt, benbjiang, Alexandre Chartre,
	James.Bottomley, OWeisse, Dhaval Giani, chris.hyser, Josh Don,
	Hao Luo, Tom Lendacky, dhiatt

On Wed, Mar 24, 2021 at 05:40:17PM -0400, Joel Fernandes (Google) wrote:
> From: Josh Don <joshdon@google.com>
> 
> This adds the API to set/get the cookie for a given cgroup. This
> interface lives at cgroup/cpu.core_tag.
> 
> The cgroup interface can be used to toggle a unique cookie value for all
> descendent tasks, preventing these tasks from sharing with any others.
> See Documentation/admin-guide/hw-vuln/core-scheduling.rst for a full
> rundown of both this and the per-task API.

I refuse to read RST. Life's too short for that.

> +u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css,
> +			  struct cftype *cft)
> +{
> +	return !!css_tg(css)->core_tagged;
> +}
> +
> +int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
> +			   u64 val)
> +{
> +	static DEFINE_MUTEX(sched_core_group_mutex);
> +	struct task_group *tg = css_tg(css);
> +	struct cgroup_subsys_state *css_tmp;
> +	struct task_struct *p;
> +	unsigned long group_cookie;
> +	int ret = 0;
> +
> +	if (val > 1)
> +		return -ERANGE;
> +
> +	if (!static_branch_likely(&sched_smt_present))
> +		return -EINVAL;
> +
> +	mutex_lock(&sched_core_group_mutex);
> +
> +	if (!tg->core_tagged && val) {
> +		/* Tag is being set. Check ancestors and descendants. */
> +		if (cpu_core_get_group_cookie(tg) ||
> +		    cpu_core_check_descendants(tg, true /* tag */)) {
> +			ret = -EBUSY;
> +			goto out_unlock;
> +		}

So the desired semantics is to only allow a single tag on any upwards
path? Isn't that in conflict with the cgroup requirements?

TJ?

> +	} else if (tg->core_tagged && !val) {
> +		/* Tag is being reset. Check descendants. */
> +		if (cpu_core_check_descendants(tg, true /* tag */)) {

I'm struggling to understand this. If, per the above, you cannot set
when either a parent is already set or a child is set, then how can a
child be set to refuse clearing?

> +			ret = -EBUSY;
> +			goto out_unlock;
> +		}
> +	} else {
> +		goto out_unlock;
> +	}



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH resend 5/8] sched: cgroup cookie API for core scheduling
  2021-03-30  9:23   ` Peter Zijlstra
@ 2021-03-30  9:26     ` Peter Zijlstra
  2021-03-30 21:19       ` Josh Don
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2021-03-30  9:26 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, graf, konrad.wilk,
	dfaggioli, rostedt, benbjiang, Alexandre Chartre,
	James.Bottomley, OWeisse, Dhaval Giani, chris.hyser, Josh Don,
	Hao Luo, Tom Lendacky, dhiatt, Tejun Heo


*sigh*, +tj

On Tue, Mar 30, 2021 at 11:23:10AM +0200, Peter Zijlstra wrote:
> On Wed, Mar 24, 2021 at 05:40:17PM -0400, Joel Fernandes (Google) wrote:
> > From: Josh Don <joshdon@google.com>
> > 
> > This adds the API to set/get the cookie for a given cgroup. This
> > interface lives at cgroup/cpu.core_tag.
> > 
> > The cgroup interface can be used to toggle a unique cookie value for all
> > descendent tasks, preventing these tasks from sharing with any others.
> > See Documentation/admin-guide/hw-vuln/core-scheduling.rst for a full
> > rundown of both this and the per-task API.
> 
> I refuse to read RST. Life's too short for that.
> 
> > +u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css,
> > +			  struct cftype *cft)
> > +{
> > +	return !!css_tg(css)->core_tagged;
> > +}
> > +
> > +int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
> > +			   u64 val)
> > +{
> > +	static DEFINE_MUTEX(sched_core_group_mutex);
> > +	struct task_group *tg = css_tg(css);
> > +	struct cgroup_subsys_state *css_tmp;
> > +	struct task_struct *p;
> > +	unsigned long group_cookie;
> > +	int ret = 0;
> > +
> > +	if (val > 1)
> > +		return -ERANGE;
> > +
> > +	if (!static_branch_likely(&sched_smt_present))
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&sched_core_group_mutex);
> > +
> > +	if (!tg->core_tagged && val) {
> > +		/* Tag is being set. Check ancestors and descendants. */
> > +		if (cpu_core_get_group_cookie(tg) ||
> > +		    cpu_core_check_descendants(tg, true /* tag */)) {
> > +			ret = -EBUSY;
> > +			goto out_unlock;
> > +		}
> 
> So the desired semantics is to only allow a single tag on any upwards
> path? Isn't that in conflict with the cgroup requirements?
> 
> TJ?
> 
> > +	} else if (tg->core_tagged && !val) {
> > +		/* Tag is being reset. Check descendants. */
> > +		if (cpu_core_check_descendants(tg, true /* tag */)) {
> 
> I'm struggling to understand this. If, per the above, you cannot set
> when either a parent is already set or a child is set, then how can a
> child be set to refuse clearing?
> 
> > +			ret = -EBUSY;
> > +			goto out_unlock;
> > +		}
> > +	} else {
> > +		goto out_unlock;
> > +	}
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH resend 5/8] sched: cgroup cookie API for core scheduling
  2021-03-30  9:26     ` Peter Zijlstra
@ 2021-03-30 21:19       ` Josh Don
  0 siblings, 0 replies; 18+ messages in thread
From: Josh Don @ 2021-03-30 21:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Gleixner, linux-kernel, Ingo Molnar,
	torvalds, fweisbec, Kees Cook, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, Steven Rostedt, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Hyser,Chris, Hao Luo, Tom Lendacky, dhiatt, Tejun Heo

On Tue, Mar 30, 2021 at 2:29 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, Mar 24, 2021 at 05:40:17PM -0400, Joel Fernandes (Google) wrote:
> > > +
> > > +   if (!tg->core_tagged && val) {
> > > +           /* Tag is being set. Check ancestors and descendants. */
> > > +           if (cpu_core_get_group_cookie(tg) ||
> > > +               cpu_core_check_descendants(tg, true /* tag */)) {
> > > +                   ret = -EBUSY;
> > > +                   goto out_unlock;
> > > +           }
> >
> > So the desired semantics is to only allow a single tag on any upwards
> > path? Isn't that in conflict with the cgroup requirements?
> >
> > TJ?

I carried this requirement over from the previous iteration, but I
don't see a reason why we can't just dump this and have each task use
the group cookie of its closest tagged ancestor. Joel, is there any
context here I'm missing?

FWIW I also just realized that cpu_core_check_descendants() is busted
as it recurses only on one child.

> > > +   } else if (tg->core_tagged && !val) {
> > > +           /* Tag is being reset. Check descendants. */
> > > +           if (cpu_core_check_descendants(tg, true /* tag */)) {
> >
> > I'm struggling to understand this. If, per the above, you cannot set
> > when either a parent is already set or a child is set, then how can a
> > child be set to refuse clearing?

Yes this is superfluous with the above semantics.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH resend 2/8] sched: core scheduling tagging infrastructure
  2021-03-29  9:55       ` Peter Zijlstra
@ 2021-03-30 21:29         ` Josh Don
  2021-03-31  7:11           ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Josh Don @ 2021-03-30 21:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Gleixner, linux-kernel, Ingo Molnar,
	torvalds, fweisbec, Kees Cook, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, Steven Rostedt, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Hyser,Chris, Hao Luo, Tom Lendacky, dhiatt

On Mon, Mar 29, 2021 at 2:55 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> OK, fixed the fails. My tired head made it unconditionally return the
> cookie-id of 'current' instead of task. Pushed out an update.

I see you have the per-task and prctl stuff pulled into your tree. I
can rebase the compound cookie and cgroup api patches on top if you'd
like; not sure if you've already re-ordered it locally. Any other
comments on the former?

> > > Also, we really need a better name than coretag.c.
> >
> > Yea, we don't really otherwise use the phrase "tagging". core_sched.c
> > is probably too confusing given we have sched/core.c.
>
> Right, so I tried core_sched and my fingers already hate it as much as
> kernel/scftorture.c (which I'd assumed my fingers would get used to
> eventually, but noooo).
>
> Looking at kernel/sched/ C is very overrepresented, so we really don't
> want another I think. B, E, G, H, J, K, N, seem to still be available in
> the first half of the alphabet. Maybe, bonghits.c, gabbleduck.c ?

hardware_vuln.c? Tricky to avoid a C with cpu, core, and cookie :)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH resend 2/8] sched: core scheduling tagging infrastructure
  2021-03-30 21:29         ` Josh Don
@ 2021-03-31  7:11           ` Peter Zijlstra
  2021-04-01 13:46             ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2021-03-31  7:11 UTC (permalink / raw)
  To: Josh Don
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Gleixner, linux-kernel, Ingo Molnar,
	torvalds, fweisbec, Kees Cook, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, Steven Rostedt, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Hyser,Chris, Hao Luo, Tom Lendacky, dhiatt

On Tue, Mar 30, 2021 at 02:29:06PM -0700, Josh Don wrote:
> On Mon, Mar 29, 2021 at 2:55 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > OK, fixed the fails. My tired head made it unconditionally return the
> > cookie-id of 'current' instead of task. Pushed out an update.
> 
> I see you have the per-task and prctl stuff pulled into your tree. I
> can rebase the compound cookie and cgroup api patches on top if you'd
> like; not sure if you've already re-ordered it locally. Any other
> comments on the former?

Hold off on that for a little while; i've been grubbing through the
cgroup code as well, just haven't had anything that actually works yet.
I'll hopefully have something soon (I really want to quickly forget all
the cgroup details again).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH resend 2/8] sched: core scheduling tagging infrastructure
  2021-03-31  7:11           ` Peter Zijlstra
@ 2021-04-01 13:46             ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2021-04-01 13:46 UTC (permalink / raw)
  To: Josh Don
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Gleixner, linux-kernel, Ingo Molnar,
	torvalds, fweisbec, Kees Cook, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, Steven Rostedt, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Hyser,Chris, Hao Luo, Tom Lendacky, dhiatt

On Wed, Mar 31, 2021 at 09:11:27AM +0200, Peter Zijlstra wrote:
> On Tue, Mar 30, 2021 at 02:29:06PM -0700, Josh Don wrote:
> > On Mon, Mar 29, 2021 at 2:55 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > OK, fixed the fails. My tired head made it unconditionally return the
> > > cookie-id of 'current' instead of task. Pushed out an update.
> > 
> > I see you have the per-task and prctl stuff pulled into your tree. I
> > can rebase the compound cookie and cgroup api patches on top if you'd
> > like; not sure if you've already re-ordered it locally. Any other
> > comments on the former?
> 
> Hold off on that for a little while; i've been grubbing through the
> cgroup code as well, just haven't had anything that actually works yet.
> I'll hopefully have something soon (I really want to quickly forget all
> the cgroup details again).

With a significantly trimmed Cc list:

https://lkml.kernel.org/r/20210401131012.395311786@infradead.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-04-01 18:43 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-24 21:40 [PATCH resend 0/8] Core sched remaining patches rebased Joel Fernandes (Google)
2021-03-24 21:40 ` [PATCH resend 1/8] sched: migration changes for core scheduling Joel Fernandes (Google)
2021-03-24 21:40 ` [PATCH resend 2/8] sched: core scheduling tagging infrastructure Joel Fernandes (Google)
2021-03-27  0:09   ` Peter Zijlstra
2021-03-27  3:19     ` Josh Don
2021-03-29  9:55       ` Peter Zijlstra
2021-03-30 21:29         ` Josh Don
2021-03-31  7:11           ` Peter Zijlstra
2021-04-01 13:46             ` Peter Zijlstra
2021-03-24 21:40 ` [PATCH resend 3/8] sched: prctl() cookie manipulation for core scheduling Joel Fernandes (Google)
2021-03-24 21:40 ` [PATCH resend 4/8] kselftest: Add test for core sched prctl interface Joel Fernandes (Google)
2021-03-24 21:40 ` [PATCH resend 5/8] sched: cgroup cookie API for core scheduling Joel Fernandes (Google)
2021-03-30  9:23   ` Peter Zijlstra
2021-03-30  9:26     ` Peter Zijlstra
2021-03-30 21:19       ` Josh Don
2021-03-24 21:40 ` [PATCH resend 6/8] kselftest: Add tests for core-sched interface Joel Fernandes (Google)
2021-03-24 21:40 ` [PATCH resend 7/8] Documentation: Add core scheduling documentation Joel Fernandes (Google)
2021-03-24 21:40 ` [PATCH resend 8/8] sched: Debug bits Joel Fernandes (Google)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.