linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v10 0/5] Core scheduling remaining patches
@ 2021-01-23  1:16 Joel Fernandes (Google)
  2021-01-23  1:17 ` [PATCH v10 1/5] sched: migration changes for core scheduling Joel Fernandes (Google)
                   ` (4 more replies)
  0 siblings, 5 replies; 34+ messages in thread
From: Joel Fernandes (Google) @ 2021-01-23  1:16 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky

Core-Scheduling
===============
Enclosed is series v10 of core scheduling.
Many of the core patches in v9 were picked up in Peter's queue tree. This
series contains the remaining patches (interface, docs).

v9 series is here: https://lwn.net/Articles/837595/

Introduction of feature
=======================
Core scheduling is a feature that allows only trusted tasks to run
concurrently on cpus sharing compute resources (eg: hyperthreads on a
core). The goal is to mitigate the core-level side-channel attacks
without requiring to disable SMT (which has a significant impact on
performance in some situations). Core scheduling (as of v7) mitigates
user-space to user-space attacks and user to kernel attack when one of
the siblings enters the kernel via interrupts or system call.

By default, the feature doesn't change any of the current scheduler
behavior. The user decides which tasks can run simultaneously on the
same core (for now by having them in the same tagged cgroup). When a tag
is enabled in a cgroup and a task from that cgroup is running on a
hardware thread, the scheduler ensures that only idle or trusted tasks
run on the other sibling(s). Besides security concerns, this feature can
also be beneficial for RT and performance applications where we want to
control how tasks make use of SMT dynamically.

Both a CGroup and Per-task interface via prctl(2) are provided for configuring
core sharing. More details are provided in documentation patch.  Kselftests are
provided to verify the correctness/rules of the interface.

Testing
=======
ChromeOS testing shows 300% improvement in keypress latency on a Google
docs key press with Google hangout test (the maximum latency drops from 150ms
to 50ms for keypresses).

Vineeth tested sysbench on v9 series and does not see any regressions.
Hong and Aubrey tested v9 and see results similar to v8. There is a known issue
with uperf that does regress. This appears to be because of ksoftirq heavily
contending with other tasks on the core. The consensus is this can be improved
in the future.

Changes in v10
==============
- migration code changes from Aubrey.
- interface changes from Josh and Chris.
- dropped kernel protection changes for now (have to rework them for mainline).

Changes in v9
=============
- Note that the vruntime snapshot change is written in 2 patches to show the
  progression of the idea and prevent merge conflicts:
    sched/fair: Snapshot the min_vruntime of CPUs on force idle
    sched: Improve snapshotting of min_vruntime for CGroups
  Same with the RT priority inversion change:
    sched: Fix priority inversion of cookied task with sibling
    sched: Improve snapshotting of min_vruntime for CGroups
- Disable coresched on certain AMD HW.

Changes in v8
=============
- New interface/API implementation
  - Joel
- Revised kernel protection patch
  - Joel
- Revised Hotplug fixes
  - Joel
- Minor bug fixes and address review comments
  - Vineeth

Changes in v7
=============
- Kernel protection from untrusted usermode tasks
  - Joel, Vineeth
- Fix for hotplug crashes and hangs
  - Joel, Vineeth

Changes in v6
=============
- Documentation
  - Joel
- Pause siblings on entering nmi/irq/softirq
  - Joel, Vineeth
- Fix for RCU crash
  - Joel
- Fix for a crash in pick_next_task
  - Yu Chen, Vineeth
- Minor re-write of core-wide vruntime comparison
  - Aaron Lu
- Cleanup: Address Review comments
- Cleanup: Remove hotplug support (for now)
- Build fixes: 32 bit, SMT=n, AUTOGROUP=n etc
  - Joel, Vineeth

Changes in v5
=============
- Fixes for cgroup/process tagging during corner cases like cgroup
  destroy, task moving across cgroups etc
  - Tim Chen
- Coresched aware task migrations
  - Aubrey Li
- Other minor stability fixes.

Changes in v4
=============
- Implement a core wide min_vruntime for vruntime comparison of tasks
  across cpus in a core.
  - Aaron Lu
- Fixes a typo bug in setting the forced_idle cpu.
  - Aaron Lu

Changes in v3
=============
- Fixes the issue of sibling picking up an incompatible task
  - Aaron Lu
  - Vineeth Pillai
  - Julien Desfossez
- Fixes the issue of starving threads due to forced idle
  - Peter Zijlstra
- Fixes the refcounting issue when deleting a cgroup with tag
  - Julien Desfossez
- Fixes a crash during cpu offline/online with coresched enabled
  - Vineeth Pillai
- Fixes a comparison logic issue in sched_core_find
  - Aaron Lu

Changes in v2
=============
- Fixes for couple of NULL pointer dereference crashes
  - Subhra Mazumdar
  - Tim Chen
- Improves priority comparison logic for process in different cpus
  - Peter Zijlstra
  - Aaron Lu
- Fixes a hard lockup in rq locking
  - Vineeth Pillai
  - Julien Desfossez
- Fixes a performance issue seen on IO heavy workloads
  - Vineeth Pillai
  - Julien Desfossez
- Fix for 32bit build
  - Aubrey Li

Future work
===========
- Load balancing/Migration fixes for core scheduling.
  With v6, Load balancing is partially coresched aware, but has some
  issues w.r.t process/taskgroup weights:
  https://lwn.net/ml/linux-kernel/20200225034438.GA617271@z...

Aubrey Li (1):
sched: migration changes for core scheduling

Joel Fernandes (Google) (3):
kselftest: Add tests for core-sched interface
Documentation: Add core scheduling documentation
sched: Debug bits...

Peter Zijlstra (1):
sched: CGroup tagging interface for core scheduling

.../admin-guide/hw-vuln/core-scheduling.rst   | 263 +++++++
Documentation/admin-guide/hw-vuln/index.rst   |   1 +
include/linux/sched.h                         |  10 +
include/uapi/linux/prctl.h                    |   6 +
kernel/fork.c                                 |   1 +
kernel/sched/Makefile                         |   1 +
kernel/sched/core.c                           | 171 ++++-
kernel/sched/coretag.c                        | 669 ++++++++++++++++
kernel/sched/debug.c                          |   4 +
kernel/sched/fair.c                           |  42 +-
kernel/sched/sched.h                          | 130 +++-
kernel/sys.c                                  |   7 +
tools/include/uapi/linux/prctl.h              |   6 +
tools/testing/selftests/sched/.gitignore      |   1 +
tools/testing/selftests/sched/Makefile        |  14 +
tools/testing/selftests/sched/config          |   1 +
.../testing/selftests/sched/test_coresched.c  | 716 ++++++++++++++++++
17 files changed, 2018 insertions(+), 25 deletions(-)
create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst
create mode 100644 kernel/sched/coretag.c
create mode 100644 tools/testing/selftests/sched/.gitignore
create mode 100644 tools/testing/selftests/sched/Makefile
create mode 100644 tools/testing/selftests/sched/config
create mode 100644 tools/testing/selftests/sched/test_coresched.c

--
2.30.0.280.ga3ce27912f-goog


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v10 1/5] sched: migration changes for core scheduling
  2021-01-23  1:16 [PATCH v10 0/5] Core scheduling remaining patches Joel Fernandes (Google)
@ 2021-01-23  1:17 ` Joel Fernandes (Google)
  2021-01-23  1:17 ` [PATCH v10 2/5] sched: CGroup tagging interface " Joel Fernandes (Google)
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 34+ messages in thread
From: Joel Fernandes (Google) @ 2021-01-23  1:17 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Aubrey Li

From: Aubrey Li <aubrey.li@intel.com>

 - Don't migrate if there is a cookie mismatch
     Load balance tries to move task from busiest CPU to the
     destination CPU. When core scheduling is enabled, if the
     task's cookie does not match with the destination CPU's
     core cookie, this task will be skipped by this CPU. This
     mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
     In the fast path of task wakeup, select the first cookie matched
     idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
     In the slow path of task wakeup, find the idlest CPU whose core
     cookie matches with task's cookie

 - Don't migrate task if cookie not match
     For the NUMA load balance, don't migrate task to the CPU whose
     core cookie does not match with task's cookie

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/fair.c  | 33 +++++++++++++++++---
 kernel/sched/sched.h | 72 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 101 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7f90765f7fd..fddd7c44bbf3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1921,6 +1921,13 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
 			continue;
 
+		/*
+		 * Skip this cpu if source task's cookie does not match
+		 * with CPU's core cookie.
+		 */
+		if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+			continue;
+
 		env->dst_cpu = cpu;
 		if (task_numa_compare(env, taskimp, groupimp, maymove))
 			break;
@@ -5867,11 +5874,15 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+		struct rq *rq = cpu_rq(i);
+
+		if (!sched_core_cookie_match(rq, p))
+			continue;
+
 		if (sched_idle_cpu(i))
 			return i;
 
 		if (available_idle_cpu(i)) {
-			struct rq *rq = cpu_rq(i);
 			struct cpuidle_state *idle = idle_get_state(rq);
 			if (idle && idle->exit_latency < min_exit_latency) {
 				/*
@@ -6129,7 +6140,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	for_each_cpu_wrap(cpu, cpus, target) {
 		if (!--nr)
 			return -1;
-		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+
+		if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
+		    sched_cpu_cookie_match(cpu_rq(cpu), p))
 			break;
 	}
 
@@ -7530,8 +7543,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * We do not migrate tasks that are:
 	 * 1) throttled_lb_pair, or
 	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
-	 * 3) running (obviously), or
-	 * 4) are cache-hot on their current CPU.
+	 * 3) task's cookie does not match with this CPU's core cookie
+	 * 4) running (obviously), or
+	 * 5) are cache-hot on their current CPU.
 	 */
 	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
 		return 0;
@@ -7566,6 +7580,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 0;
 	}
 
+	/*
+	 * Don't migrate task if the task's cookie does not match
+	 * with the destination CPU's core cookie.
+	 */
+	if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+		return 0;
+
 	/* Record that we found atleast one task that could run on dst_cpu */
 	env->flags &= ~LBF_ALL_PINNED;
 
@@ -8792,6 +8813,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 					p->cpus_ptr))
 			continue;
 
+		/* Skip over this group if no cookie matched */
+		if (!sched_group_cookie_match(cpu_rq(this_cpu), p, group))
+			continue;
+
 		local_group = cpumask_test_cpu(this_cpu,
 					       sched_group_span(group));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3efcbc779a75..d6efb1ffc08c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1122,6 +1122,7 @@ static inline bool is_migration_disabled(struct task_struct *p)
 
 #ifdef CONFIG_SCHED_CORE
 DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
+static inline struct cpumask *sched_group_span(struct sched_group *sg);
 
 static inline bool sched_core_enabled(struct rq *rq)
 {
@@ -1138,6 +1139,61 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);
 
+/*
+ * Helpers to check if the CPU's core cookie matches with the task's cookie
+ * when core scheduling is enabled.
+ * A special case is that the task's cookie always matches with CPU's core
+ * cookie if the CPU is in an idle core.
+ */
+static inline bool sched_cpu_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	return rq->core->core_cookie == p->core_cookie;
+}
+
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	bool idle_core = true;
+	int cpu;
+
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
+		if (!available_idle_cpu(cpu)) {
+			idle_core = false;
+			break;
+		}
+	}
+
+	/*
+	 * A CPU in an idle core is always the best choice for tasks with
+	 * cookies.
+	 */
+	return idle_core || rq->core->core_cookie == p->core_cookie;
+}
+
+static inline bool sched_group_cookie_match(struct rq *rq,
+					    struct task_struct *p,
+					    struct sched_group *group)
+{
+	int cpu;
+
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	for_each_cpu_and(cpu, sched_group_span(group), p->cpus_ptr) {
+		if (sched_core_cookie_match(rq, p))
+			return true;
+	}
+	return false;
+}
+
 extern void queue_core_balance(struct rq *rq);
 
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);
@@ -1158,6 +1214,22 @@ static inline void queue_core_balance(struct rq *rq)
 {
 }
 
+static inline bool sched_cpu_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	return true;
+}
+
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	return true;
+}
+
+static inline bool sched_group_cookie_match(struct rq *rq,
+					    struct task_struct *p,
+					    struct sched_group *group)
+{
+	return true;
+}
 #endif /* CONFIG_SCHED_CORE */
 
 #ifdef CONFIG_SCHED_SMT
-- 
2.30.0.280.ga3ce27912f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-01-23  1:16 [PATCH v10 0/5] Core scheduling remaining patches Joel Fernandes (Google)
  2021-01-23  1:17 ` [PATCH v10 1/5] sched: migration changes for core scheduling Joel Fernandes (Google)
@ 2021-01-23  1:17 ` Joel Fernandes (Google)
  2021-02-03 16:51   ` Peter Zijlstra
                     ` (9 more replies)
  2021-01-23  1:17 ` [PATCH v10 3/5] kselftest: Add tests for core-sched interface Joel Fernandes (Google)
                   ` (2 subsequent siblings)
  4 siblings, 10 replies; 34+ messages in thread
From: Joel Fernandes (Google) @ 2021-01-23  1:17 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky

From: Peter Zijlstra <peterz@infradead.org>

Marks all tasks in a cgroup as matching for core-scheduling.

A task will need to be moved into the core scheduler queue when the cgroup
it belongs to is tagged to run with core scheduling.  Similarly the task
will need to be moved out of the core scheduler queue when the cgroup
is untagged.

Also after we forked a task, its core scheduler queue's presence will
need to be updated according to its new cgroup's status.

Use stop machine mechanism to update all tasks in a cgroup to prevent a
new task from sneaking into the cgroup, and missed out from the update
while we iterates through all the tasks in the cgroup.  A more complicated
scheme could probably avoid the stop machine.  Such scheme will also
need to resovle inconsistency between a task's cgroup core scheduling
tag and residency in core scheduler queue.

We are opting for the simple stop machine mechanism for now that avoids
such complications.

Core scheduler has extra overhead.  Enable it only for core with
more than one SMT hardware threads.

Co-developed-by: Josh Don <joshdon@google.com>
Co-developed-by: Chris Hyser <chris.hyser@oracle.com>
Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/sched.h            |  10 +
 include/uapi/linux/prctl.h       |   6 +
 kernel/fork.c                    |   1 +
 kernel/sched/Makefile            |   1 +
 kernel/sched/core.c              | 136 ++++++-
 kernel/sched/coretag.c           | 669 +++++++++++++++++++++++++++++++
 kernel/sched/debug.c             |   4 +
 kernel/sched/sched.h             |  58 ++-
 kernel/sys.c                     |   7 +
 tools/include/uapi/linux/prctl.h |   6 +
 10 files changed, 878 insertions(+), 20 deletions(-)
 create mode 100644 kernel/sched/coretag.c

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7efce9c9d9cf..7ca6f2f72cda 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -688,6 +688,8 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
 	struct rb_node			core_node;
 	unsigned long			core_cookie;
+	unsigned long			core_task_cookie;
+	unsigned long			core_group_cookie;
 	unsigned int			core_occupation;
 #endif
 
@@ -2076,4 +2078,12 @@ int sched_trace_rq_nr_running(struct rq *rq);
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
+#ifdef CONFIG_SCHED_CORE
+int sched_core_share_pid(unsigned long flags, pid_t pid);
+void sched_tsk_free(struct task_struct *tsk);
+#else
+#define sched_core_share_pid(flags, pid) do { } while (0)
+#define sched_tsk_free(tsk) do { } while (0)
+#endif
+
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c334e6a02e5f..f8e4e9626121 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -248,4 +248,10 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER		57
 #define PR_GET_IO_FLUSHER		58
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE		59
+# define PR_SCHED_CORE_CLEAR		0  /* clear core_sched cookie of pid */
+# define PR_SCHED_CORE_SHARE_FROM	1  /* get core_sched cookie from pid */
+# define PR_SCHED_CORE_SHARE_TO		2  /* push core_sched cookie to pid */
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 7199d359690c..5468c93829c5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -736,6 +736,7 @@ void __put_task_struct(struct task_struct *tsk)
 	exit_creds(tsk);
 	delayacct_tsk_free(tsk);
 	put_signal_struct(tsk->signal);
+	sched_tsk_free(tsk);
 
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 5fc9c9b70862..c526c20adf9d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
+obj-$(CONFIG_SCHED_CORE) += coretag.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 20125431af87..a3844e2e7379 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -136,7 +136,33 @@ static inline bool __sched_core_less(struct task_struct *a, struct task_struct *
 	return false;
 }
 
-static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+static bool sched_core_empty(struct rq *rq)
+{
+	return RB_EMPTY_ROOT(&rq->core_tree);
+}
+
+static struct task_struct *sched_core_first(struct rq *rq)
+{
+	struct task_struct *task;
+
+	task = container_of(rb_first(&rq->core_tree), struct task_struct, core_node);
+	return task;
+}
+
+static void sched_core_flush(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	struct task_struct *task;
+
+	while (!sched_core_empty(rq)) {
+		task = sched_core_first(rq);
+		rb_erase(&task->core_node, &rq->core_tree);
+		RB_CLEAR_NODE(&task->core_node);
+	}
+	rq->core->core_task_seq++;
+}
+
+void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 {
 	struct rb_node *parent, **node;
 	struct task_struct *node_task;
@@ -163,14 +189,15 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 	rb_insert_color(&p->core_node, &rq->core_tree);
 }
 
-static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+void sched_core_dequeue(struct rq *rq, struct task_struct *p)
 {
 	rq->core->core_task_seq++;
 
-	if (!p->core_cookie)
+	if (!sched_core_enqueued(p))
 		return;
 
 	rb_erase(&p->core_node, &rq->core_tree);
+	RB_CLEAR_NODE(&p->core_node);
 }
 
 /*
@@ -234,8 +261,24 @@ static int __sched_core_stopper(void *data)
 	bool enabled = !!(unsigned long)data;
 	int cpu;
 
-	for_each_possible_cpu(cpu)
-		cpu_rq(cpu)->core_enabled = enabled;
+	for_each_possible_cpu(cpu) {
+		struct rq *rq = cpu_rq(cpu);
+
+		WARN_ON_ONCE(enabled == rq->core_enabled);
+
+		if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) {
+			/*
+			 * All active and migrating tasks will have already
+			 * been removed from core queue when we clear the
+			 * cgroup tags. However, dying tasks could still be
+			 * left in core queue. Flush them here.
+			 */
+			if (!enabled)
+				sched_core_flush(cpu);
+
+			rq->core_enabled = enabled;
+		}
+	}
 
 	return 0;
 }
@@ -245,7 +288,11 @@ static int sched_core_count;
 
 static void __sched_core_enable(void)
 {
-	// XXX verify there are no cookie tasks (yet)
+	int cpu;
+
+	/* verify there are no cookie tasks (yet) */
+	for_each_online_cpu(cpu)
+		BUG_ON(!sched_core_empty(cpu_rq(cpu)));
 
 	static_branch_enable(&__sched_core_enabled);
 	stop_machine(__sched_core_stopper, (void *)true, NULL);
@@ -253,8 +300,6 @@ static void __sched_core_enable(void)
 
 static void __sched_core_disable(void)
 {
-	// XXX verify there are no cookie tasks (left)
-
 	stop_machine(__sched_core_stopper, (void *)false, NULL);
 	static_branch_disable(&__sched_core_enabled);
 }
@@ -274,12 +319,6 @@ void sched_core_put(void)
 		__sched_core_disable();
 	mutex_unlock(&sched_core_mutex);
 }
-
-#else /* !CONFIG_SCHED_CORE */
-
-static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
-static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
-
 #endif /* CONFIG_SCHED_CORE */
 
 /*
@@ -3882,6 +3921,7 @@ static inline void init_schedstats(void) {}
 int sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
 	unsigned long flags;
+	int __maybe_unused ret;
 
 	__sched_fork(clone_flags, p);
 	/*
@@ -3957,6 +3997,13 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 #ifdef CONFIG_SMP
 	plist_node_init(&p->pushable_tasks, MAX_PRIO);
 	RB_CLEAR_NODE(&p->pushable_dl_tasks);
+#endif
+#ifdef CONFIG_SCHED_CORE
+	RB_CLEAR_NODE(&p->core_node);
+
+	ret = sched_core_fork(p, clone_flags);
+	if (ret)
+		return ret;
 #endif
 	return 0;
 }
@@ -7738,6 +7785,9 @@ void init_idle(struct task_struct *idle, int cpu)
 #ifdef CONFIG_SMP
 	sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);
 #endif
+#ifdef CONFIG_SCHED_CORE
+	RB_CLEAR_NODE(&idle->core_node);
+#endif
 }
 
 #ifdef CONFIG_SMP
@@ -8742,6 +8792,9 @@ void sched_offline_group(struct task_group *tg)
 	spin_unlock_irqrestore(&task_group_lock, flags);
 }
 
+void cpu_core_get_group_cookie(struct task_group *tg,
+			       unsigned long *group_cookie_ptr);
+
 static void sched_change_group(struct task_struct *tsk, int type)
 {
 	struct task_group *tg;
@@ -8754,6 +8807,11 @@ static void sched_change_group(struct task_struct *tsk, int type)
 	tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
 			  struct task_group, css);
 	tg = autogroup_task_group(tsk, tg);
+
+#ifdef CONFIG_SCHED_CORE
+	sched_core_change_group(tsk, tg);
+#endif
+
 	tsk->sched_task_group = tg;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -8806,11 +8864,6 @@ void sched_move_task(struct task_struct *tsk)
 	task_rq_unlock(rq, tsk, &rf);
 }
 
-static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
-{
-	return css ? container_of(css, struct task_group, css) : NULL;
-}
-
 static struct cgroup_subsys_state *
 cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 {
@@ -8846,6 +8899,18 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
 	return 0;
 }
 
+static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
+{
+#ifdef CONFIG_SCHED_CORE
+	struct task_group *tg = css_tg(css);
+
+	if (tg->core_tagged) {
+		sched_core_put();
+		tg->core_tagged = 0;
+	}
+#endif
+}
+
 static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
@@ -9447,6 +9512,22 @@ static struct cftype cpu_legacy_files[] = {
 		.write_u64 = cpu_rt_period_write_uint,
 	},
 #endif
+#ifdef CONFIG_SCHED_CORE
+	{
+		.name = "core_tag",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_tag_read_u64,
+		.write_u64 = cpu_core_tag_write_u64,
+	},
+#ifdef CONFIG_SCHED_DEBUG
+	/* Read the group cookie. */
+	{
+		.name = "core_group_cookie",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_group_cookie_read_u64,
+	},
+#endif
+#endif
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 	{
 		.name = "uclamp.min",
@@ -9620,6 +9701,22 @@ static struct cftype cpu_files[] = {
 		.write_s64 = cpu_weight_nice_write_s64,
 	},
 #endif
+#ifdef CONFIG_SCHED_CORE
+	{
+		.name = "core_tag",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_tag_read_u64,
+		.write_u64 = cpu_core_tag_write_u64,
+	},
+#ifdef CONFIG_SCHED_DEBUG
+	/* Read the group cookie. */
+	{
+		.name = "core_group_cookie",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_group_cookie_read_u64,
+	},
+#endif
+#endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
 		.name = "max",
@@ -9648,6 +9745,7 @@ static struct cftype cpu_files[] = {
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_online	= cpu_cgroup_css_online,
+	.css_offline	= cpu_cgroup_css_offline,
 	.css_released	= cpu_cgroup_css_released,
 	.css_free	= cpu_cgroup_css_free,
 	.css_extra_stat_show = cpu_extra_stat_show,
diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
new file mode 100644
index 000000000000..207fbaac5897
--- /dev/null
+++ b/kernel/sched/coretag.c
@@ -0,0 +1,669 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * kernel/sched/core-tag.c
+ *
+ * Core-scheduling tagging interface support.
+ *
+ * Copyright(C) 2020, Joel Fernandes.
+ * Initial interfacing code  by Peter Ziljstra.
+ */
+
+#include <linux/prctl.h>
+#include "sched.h"
+
+/*
+ * Wrapper representing a complete cookie. The address of the cookie is used as
+ * a unique identifier. Each cookie has a unique permutation of the internal
+ * cookie fields.
+ */
+struct sched_core_cookie {
+	unsigned long task_cookie;
+	unsigned long group_cookie;
+
+	struct rb_node node;
+	refcount_t refcnt;
+};
+
+/*
+ * A simple wrapper around refcount. An allocated sched_core_task_cookie's
+ * address is used to compute the cookie of the task.
+ */
+struct sched_core_task_cookie {
+	refcount_t refcnt;
+	struct work_struct work; /* to free in WQ context. */;
+};
+
+/* All active sched_core_cookies */
+static struct rb_root sched_core_cookies = RB_ROOT;
+static DEFINE_RAW_SPINLOCK(sched_core_cookies_lock);
+
+static void sched_core_cookie_init_from_task(struct sched_core_cookie *cookie,
+					     struct task_struct *p)
+{
+	cookie->task_cookie = p->core_task_cookie;
+	cookie->group_cookie = p->core_group_cookie;
+}
+
+/*
+ * Returns the following:
+ * a < b  => -1
+ * a == b => 0
+ * a > b  => 1
+ */
+static int sched_core_cookie_cmp(const struct sched_core_cookie *a,
+				 const struct sched_core_cookie *b)
+{
+#define COOKIE_CMP_RETURN(field) do {		\
+	if (a->field < b->field)		\
+		return -1;			\
+	else if (a->field > b->field)		\
+		return 1;			\
+} while (0)					\
+
+	COOKIE_CMP_RETURN(task_cookie);
+	COOKIE_CMP_RETURN(group_cookie);
+
+	/* all cookie fields match */
+	return 0;
+
+#undef COOKIE_CMP_RETURN
+}
+
+static inline void __sched_core_erase_cookie(struct sched_core_cookie *cookie)
+{
+	lockdep_assert_held(&sched_core_cookies_lock);
+
+	/* Already removed */
+	if (RB_EMPTY_NODE(&cookie->node))
+		return;
+
+	rb_erase(&cookie->node, &sched_core_cookies);
+	RB_CLEAR_NODE(&cookie->node);
+}
+
+/* Called when a task no longer points to the cookie in question */
+static void sched_core_put_cookie(struct sched_core_cookie *cookie)
+{
+	unsigned long flags;
+
+	if (!cookie)
+		return;
+
+	if (refcount_dec_and_test(&cookie->refcnt)) {
+		raw_spin_lock_irqsave(&sched_core_cookies_lock, flags);
+		__sched_core_erase_cookie(cookie);
+		raw_spin_unlock_irqrestore(&sched_core_cookies_lock, flags);
+		kfree(cookie);
+	}
+}
+
+/*
+ * A task's core cookie is a compound structure composed of various cookie
+ * fields (task_cookie, group_cookie). The overall core_cookie is
+ * a pointer to a struct containing those values. This function either finds
+ * an existing core_cookie or creates a new one, and then updates the task's
+ * core_cookie to point to it. Additionally, it handles the necessary reference
+ * counting.
+ */
+static void __sched_core_update_cookie(struct task_struct *p)
+{
+	struct rb_node *parent, **node;
+	struct sched_core_cookie *node_core_cookie, *match;
+	static const struct sched_core_cookie zero_cookie;
+	struct sched_core_cookie requested_cookie;
+	bool is_zero_cookie;
+	struct sched_core_cookie *const curr_cookie =
+		(struct sched_core_cookie *)p->core_cookie;
+
+	/*
+	 * Ensures that we do not cause races/corruption by modifying/reading
+	 * task cookie fields. Also ensures task cannot get migrated.
+	 */
+	lockdep_assert_held(rq_lockp(task_rq(p)));
+
+	sched_core_cookie_init_from_task(&requested_cookie, p);
+
+	is_zero_cookie = !sched_core_cookie_cmp(&requested_cookie, &zero_cookie);
+
+	/*
+	 * Already have a cookie matching the requested settings? Nothing to
+	 * do.
+	 */
+	if ((curr_cookie && !sched_core_cookie_cmp(curr_cookie, &requested_cookie)) ||
+	    (!curr_cookie && is_zero_cookie))
+		return;
+
+	raw_spin_lock(&sched_core_cookies_lock);
+
+	if (is_zero_cookie) {
+		match = NULL;
+		goto finish;
+	}
+
+retry:
+	match = NULL;
+
+	node = &sched_core_cookies.rb_node;
+	parent = *node;
+	while (*node) {
+		int cmp;
+
+		node_core_cookie =
+			container_of(*node, struct sched_core_cookie, node);
+		parent = *node;
+
+		cmp = sched_core_cookie_cmp(&requested_cookie, node_core_cookie);
+		if (cmp < 0) {
+			node = &parent->rb_left;
+		} else if (cmp > 0) {
+			node = &parent->rb_right;
+		} else {
+			match = node_core_cookie;
+			break;
+		}
+	}
+
+	if (!match) {
+		/* No existing cookie; create and insert one */
+		match = kmalloc(sizeof(struct sched_core_cookie), GFP_ATOMIC);
+
+		/* Fall back to zero cookie */
+		if (WARN_ON_ONCE(!match))
+			goto finish;
+
+		*match = requested_cookie;
+		refcount_set(&match->refcnt, 1);
+
+		rb_link_node(&match->node, parent, node);
+		rb_insert_color(&match->node, &sched_core_cookies);
+	} else {
+		/*
+		 * Cookie exists, increment refcnt. If refcnt is currently 0,
+		 * we're racing with a put() (refcnt decremented but cookie not
+		 * yet removed from the tree). In this case, we can simply
+		 * perform the removal ourselves and retry.
+		 * sched_core_put_cookie() will still function correctly.
+		 */
+		if (unlikely(!refcount_inc_not_zero(&match->refcnt))) {
+			__sched_core_erase_cookie(match);
+			goto retry;
+		}
+	}
+
+finish:
+	p->core_cookie = (unsigned long)match;
+
+	raw_spin_unlock(&sched_core_cookies_lock);
+
+	sched_core_put_cookie(curr_cookie);
+}
+
+/*
+ * sched_core_update_cookie - Common helper to update a task's core cookie. This
+ * updates the selected cookie field and then updates the overall cookie.
+ * @p: The task whose cookie should be updated.
+ * @cookie: The new cookie.
+ * @cookie_type: The cookie field to which the cookie corresponds.
+ */
+static void sched_core_update_cookie(struct task_struct *p, unsigned long cookie,
+				     enum sched_core_cookie_type cookie_type)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+
+	if (!p)
+		return;
+
+	rq = task_rq_lock(p, &rf);
+
+	switch (cookie_type) {
+	case sched_core_task_cookie_type:
+		p->core_task_cookie = cookie;
+		break;
+	case sched_core_group_cookie_type:
+		p->core_group_cookie = cookie;
+		break;
+	default:
+		WARN_ON_ONCE(1);
+	}
+
+	/* Set p->core_cookie, which is the overall cookie */
+	__sched_core_update_cookie(p);
+
+	if (sched_core_enqueued(p)) {
+		sched_core_dequeue(rq, p);
+		if (!p->core_cookie) {
+			task_rq_unlock(rq, p, &rf);
+			return;
+		}
+	}
+
+	if (sched_core_enabled(rq) &&
+	    p->core_cookie && task_on_rq_queued(p))
+		sched_core_enqueue(task_rq(p), p);
+
+	/*
+	 * If task is currently running or waking, it may not be compatible
+	 * anymore after the cookie change, so enter the scheduler on its CPU
+	 * to schedule it away.
+	 */
+	if (task_running(rq, p) || p->state == TASK_WAKING)
+		resched_curr(rq);
+
+	task_rq_unlock(rq, p, &rf);
+}
+
+#ifdef CONFIG_CGROUP_SCHED
+static unsigned long cpu_core_get_group_cookie(struct task_group *tg);
+
+void sched_core_change_group(struct task_struct *p, struct task_group *new_tg)
+{
+	lockdep_assert_held(rq_lockp(task_rq(p)));
+
+	/*
+	 * It is ok if this races with an update to new_tg->core_tagged. Any
+	 * update that occurs after we read the group_cookie here will have to
+	 * perform a cookie update on this task _after_ the update below, due
+	 * to the dependence on task_rq lock.
+	 */
+	p->core_group_cookie = cpu_core_get_group_cookie(new_tg);
+
+	__sched_core_update_cookie(p);
+}
+#endif
+
+/* Per-task interface: Used by fork(2) and prctl(2). */
+static void sched_core_put_cookie_work(struct work_struct *ws);
+
+/* Caller has to call sched_core_get() if non-zero value is returned. */
+static unsigned long sched_core_alloc_task_cookie(void)
+{
+	struct sched_core_task_cookie *ck =
+		kmalloc(sizeof(struct sched_core_task_cookie), GFP_KERNEL);
+
+	if (!ck)
+		return 0;
+	refcount_set(&ck->refcnt, 1);
+	INIT_WORK(&ck->work, sched_core_put_cookie_work);
+
+	return (unsigned long)ck;
+}
+
+static void sched_core_get_task_cookie(unsigned long cookie)
+{
+	struct sched_core_task_cookie *ptr =
+		(struct sched_core_task_cookie *)cookie;
+
+	refcount_inc(&ptr->refcnt);
+}
+
+static void sched_core_put_task_cookie(unsigned long cookie)
+{
+	struct sched_core_task_cookie *ptr =
+		(struct sched_core_task_cookie *)cookie;
+
+	if (refcount_dec_and_test(&ptr->refcnt))
+		kfree(ptr);
+}
+
+static void sched_core_put_cookie_work(struct work_struct *ws)
+{
+	struct sched_core_task_cookie *ck =
+		container_of(ws, struct sched_core_task_cookie, work);
+
+	sched_core_put_task_cookie((unsigned long)ck);
+	sched_core_put();
+}
+
+static inline void sched_core_update_task_cookie(struct task_struct *t,
+						 unsigned long c)
+{
+	sched_core_update_cookie(t, c, sched_core_task_cookie_type);
+}
+
+int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
+{
+	static DEFINE_MUTEX(sched_core_tasks_mutex);
+	unsigned long cookie;
+	int ret = -ENOMEM;
+
+	mutex_lock(&sched_core_tasks_mutex);
+
+	if (!t2) {
+		if (t1->core_task_cookie) {
+			sched_core_put_task_cookie(t1->core_task_cookie);
+			sched_core_update_task_cookie(t1, 0);
+			sched_core_put();
+		}
+	} else if (t1 == t2) {
+		/* Assign a unique per-task cookie solely for t1. */
+		cookie = sched_core_alloc_task_cookie();
+		if (!cookie)
+			goto out_unlock;
+		sched_core_get();
+
+		if (t1->core_task_cookie) {
+			sched_core_put_task_cookie(t1->core_task_cookie);
+			sched_core_put();
+		}
+		sched_core_update_task_cookie(t1, cookie);
+	} else if (!t1->core_task_cookie && !t2->core_task_cookie) {
+		/*
+		 * 		t1		joining		t2
+		 * CASE 1:
+		 * before	0				0
+		 * after	new cookie			new cookie
+		 *
+		 * CASE 2:
+		 * before	X (non-zero)			0
+		 * after	0				0
+		 *
+		 * CASE 3:
+		 * before	0				X (non-zero)
+		 * after	X				X
+		 *
+		 * CASE 4:
+		 * before	Y (non-zero)			X (non-zero)
+		 * after	X				X
+		 */
+
+		/* CASE 1. */
+		cookie = sched_core_alloc_task_cookie();
+		if (!cookie)
+			goto out_unlock;
+		sched_core_get(); /* For the alloc. */
+
+		/* Add another reference for the other task. */
+		sched_core_get_task_cookie(cookie);
+		sched_core_get(); /* For the other task. */
+
+		sched_core_update_task_cookie(t1, cookie);
+		sched_core_update_task_cookie(t2, cookie);
+	} else if (t1->core_task_cookie && !t2->core_task_cookie) {
+		/* CASE 2. */
+		sched_core_put_task_cookie(t1->core_task_cookie);
+		sched_core_update_task_cookie(t1, 0);
+		sched_core_put();
+	} else if (!t1->core_task_cookie && t2->core_task_cookie) {
+		/* CASE 3. */
+		sched_core_get_task_cookie(t2->core_task_cookie);
+		sched_core_get();
+		sched_core_update_task_cookie(t1, t2->core_task_cookie);
+
+	} else {
+		/* CASE 4. */
+		sched_core_get_task_cookie(t2->core_task_cookie);
+		sched_core_get();
+
+		sched_core_put_task_cookie(t1->core_task_cookie);
+		sched_core_update_task_cookie(t1, t2->core_task_cookie);
+		sched_core_put();
+	}
+
+	ret = 0;
+out_unlock:
+	mutex_unlock(&sched_core_tasks_mutex);
+	return ret;
+}
+
+/* Called from prctl interface: PR_SCHED_CORE_SHARE */
+int sched_core_share_pid(unsigned long flags, pid_t pid)
+{
+	struct task_struct *to;
+	struct task_struct *from;
+	struct task_struct *task;
+	int err;
+
+	rcu_read_lock();
+	task = find_task_by_vpid(pid);
+	if (!task) {
+		rcu_read_unlock();
+		return -ESRCH;
+	}
+
+	get_task_struct(task);
+
+	/*
+	 * Check if this process has the right to modify the specified
+	 * process. Use the regular "ptrace_may_access()" checks.
+	 */
+	if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
+		rcu_read_unlock();
+		err = -EPERM;
+		goto out;
+	}
+	rcu_read_unlock();
+
+	if (flags == PR_SCHED_CORE_CLEAR) {
+		to = task;
+		from = NULL;
+	} else if (flags == PR_SCHED_CORE_SHARE_TO) {
+		to = task;
+		from = current;
+	} else if (flags == PR_SCHED_CORE_SHARE_FROM) {
+		to = current;
+		from = task;
+	} else {
+		err = -EINVAL;
+		goto out;
+	}
+
+	err = sched_core_share_tasks(to, from);
+out:
+	put_task_struct(task);
+	return err;
+}
+
+/* CGroup core-scheduling interface support. */
+#ifdef CONFIG_CGROUP_SCHED
+/*
+ * Helper to get the group cookie in a hierarchy. Any ancestor can have a
+ * cookie.
+ *
+ * Can race with an update to tg->core_tagged if sched_core_group_mutex is
+ * not held.
+ */
+static unsigned long cpu_core_get_group_cookie(struct task_group *tg)
+{
+	for (; tg; tg = tg->parent) {
+		if (READ_ONCE(tg->core_tagged))
+			return (unsigned long)tg;
+	}
+
+	return 0;
+}
+
+/* Determine if any group in @tg's children are tagged. */
+static bool cpu_core_check_descendants(struct task_group *tg, bool check_tag)
+{
+	struct task_group *child;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(child, &tg->children, siblings) {
+		if ((child->core_tagged && check_tag)) {
+			rcu_read_unlock();
+			return true;
+		}
+
+		rcu_read_unlock();
+		return cpu_core_check_descendants(child, check_tag);
+	}
+
+	rcu_read_unlock();
+	return false;
+}
+
+u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css,
+			  struct cftype *cft)
+{
+	return !!css_tg(css)->core_tagged;
+}
+
+#ifdef CONFIG_SCHED_DEBUG
+u64 cpu_core_group_cookie_read_u64(struct cgroup_subsys_state *css,
+				   struct cftype *cft)
+{
+	return cpu_core_get_group_cookie(css_tg(css));
+}
+#endif
+
+int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
+			   u64 val)
+{
+	static DEFINE_MUTEX(sched_core_group_mutex);
+	struct task_group *tg = css_tg(css);
+	struct cgroup_subsys_state *css_tmp;
+	struct task_struct *p;
+	unsigned long cookie;
+	int ret = 0;
+
+	if (val > 1)
+		return -ERANGE;
+
+	if (!static_branch_likely(&sched_smt_present))
+		return -EINVAL;
+
+	mutex_lock(&sched_core_group_mutex);
+
+	if (!tg->core_tagged && val) {
+		/* Tag is being set. Check ancestors and descendants. */
+		if (cpu_core_get_group_cookie(tg) ||
+		    cpu_core_check_descendants(tg, true /* tag */)) {
+			ret = -EBUSY;
+			goto out_unlock;
+		}
+	} else if (tg->core_tagged && !val) {
+		/* Tag is being reset. Check descendants. */
+		if (cpu_core_check_descendants(tg, true /* tag */)) {
+			ret = -EBUSY;
+			goto out_unlock;
+		}
+	} else {
+		goto out_unlock;
+	}
+
+	if (!!val)
+		sched_core_get();
+
+	tg->core_tagged = val;
+	cookie = cpu_core_get_group_cookie(tg);
+
+	rcu_read_lock();
+	css_for_each_descendant_pre(css_tmp, css) {
+		struct css_task_iter it;
+
+		css_task_iter_start(css_tmp, 0, &it);
+		/*
+		 * Note: css_task_iter_next will skip dying tasks.
+		 * There could still be dying tasks left in the core queue
+		 * when we set cgroup tag to 0 when the loop is done below.
+		 */
+		while ((p = css_task_iter_next(&it)))
+			sched_core_update_cookie(p, cookie, sched_core_group_cookie_type);
+
+		css_task_iter_end(&it);
+	}
+	rcu_read_unlock();
+
+	if (!val)
+		sched_core_put();
+
+out_unlock:
+	mutex_unlock(&sched_core_group_mutex);
+	return ret;
+}
+#endif
+
+/* Called from sched_fork() */
+int sched_core_fork(struct task_struct *p, unsigned long clone_flags)
+{
+	/*
+	 * These are ref counted; avoid an uncounted reference.
+	 * If p should have a cookie, it will be set below.
+	 */
+	p->core_task_cookie = 0;
+	p->core_cookie = 0;
+
+	/*
+	 * First, update the new task's per-task cookie.
+	 * If parent is tagged via per-task cookie, tag the child (either with
+	 * the parent's cookie, or a new one).
+	 */
+	if (READ_ONCE(current->core_task_cookie)) {
+		int ret;
+
+		if (clone_flags & CLONE_THREAD) {
+			/* For CLONE_THREAD, share parent's per-task tag. */
+			ret = sched_core_share_tasks(p, p);
+		} else {
+			/* Otherwise, assign a new per-task tag. */
+			ret = sched_core_share_tasks(p, current);
+		}
+
+		if (ret)
+			return ret;
+
+		/* sched_core_share_tasks() should always update p's core_cookie. */
+		WARN_ON_ONCE(!p->core_cookie);
+
+		return 0;
+	}
+
+	/*
+	 * NOTE: This might race with a concurrent cgroup cookie update. That's
+	 * ok; sched_core_change_group() will handle this post-fork, once the
+	 * task is visible.
+	 */
+	if (p->core_group_cookie) {
+		struct sched_core_cookie *parent_cookie;
+		struct sched_core_cookie child_requested_cookie;
+		bool needs_update = false;
+		struct rq_flags rf;
+		struct rq *rq;
+		unsigned long flags;
+
+		/* No locking needed; child is not yet visible */
+		sched_core_cookie_init_from_task(&child_requested_cookie, p);
+
+		/*
+		 * Optimization: try to grab the parent's cookie and increment
+		 * the refcount directly, rather than traverse the RB tree.
+		 *
+		 * Note: sched_core_cookies_lock is less contended than
+		 * rq_lock(current), and is sufficient to protect
+		 * current->core_cookie.
+		 */
+		raw_spin_lock_irqsave(&sched_core_cookies_lock, flags);
+		parent_cookie = (struct sched_core_cookie *)current->core_cookie;
+		if (likely(parent_cookie &&
+			   !sched_core_cookie_cmp(&child_requested_cookie, parent_cookie) &&
+			   refcount_inc_not_zero(&parent_cookie->refcnt))) {
+			p->core_cookie = (unsigned long)parent_cookie;
+		} else {
+			needs_update = true; /* raced */
+		}
+		raw_spin_unlock_irqrestore(&sched_core_cookies_lock, flags);
+
+		if (needs_update) {
+			rq = task_rq_lock(p, &rf);
+			__sched_core_update_cookie(p);
+			task_rq_unlock(rq, p, &rf);
+			WARN_ON_ONCE(!p->core_cookie);
+		}
+	}
+
+	return 0;
+}
+
+void sched_tsk_free(struct task_struct *tsk)
+{
+	struct sched_core_task_cookie *ck;
+
+	sched_core_put_cookie((struct sched_core_cookie *)tsk->core_cookie);
+
+	if (!tsk->core_task_cookie)
+		return;
+
+	ck = (struct sched_core_task_cookie *)tsk->core_task_cookie;
+	queue_work(system_wq, &ck->work);
+}
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 60a922d3f46f..8c452b8010ad 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1024,6 +1024,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		__PS("clock-delta", t1-t0);
 	}
 
+#ifdef CONFIG_SCHED_CORE
+	__PS("core_cookie", p->core_cookie);
+#endif
+
 	sched_show_numa(p, m);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d6efb1ffc08c..133560db1030 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -377,6 +377,10 @@ struct cfs_bandwidth {
 struct task_group {
 	struct cgroup_subsys_state css;
 
+#ifdef CONFIG_SCHED_CORE
+	int			core_tagged;
+#endif
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* schedulable entities of this group on each CPU */
 	struct sched_entity	**se;
@@ -425,6 +429,11 @@ struct task_group {
 
 };
 
+static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct task_group, css) : NULL;
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 #define ROOT_TASK_GROUP_LOAD	NICE_0_LOAD
 
@@ -1124,6 +1133,11 @@ static inline bool is_migration_disabled(struct task_struct *p)
 DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
 static inline struct cpumask *sched_group_span(struct sched_group *sg);
 
+enum sched_core_cookie_type {
+	sched_core_task_cookie_type,
+	sched_core_group_cookie_type,
+};
+
 static inline bool sched_core_enabled(struct rq *rq)
 {
 	return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
@@ -1194,12 +1208,54 @@ static inline bool sched_group_cookie_match(struct rq *rq,
 	return false;
 }
 
-extern void queue_core_balance(struct rq *rq);
+void sched_core_change_group(struct task_struct *p, struct task_group *new_tg);
+int sched_core_fork(struct task_struct *p, unsigned long clone_flags);
+
+static inline bool sched_core_enqueued(struct task_struct *task)
+{
+	return !RB_EMPTY_NODE(&task->core_node);
+}
+
+void queue_core_balance(struct rq *rq);
+
+void sched_core_enqueue(struct rq *rq, struct task_struct *p);
+void sched_core_dequeue(struct rq *rq, struct task_struct *p);
+void sched_core_get(void);
+void sched_core_put(void);
+
+int sched_core_share_pid(unsigned long flags, pid_t pid);
+int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2);
+
+#ifdef CONFIG_CGROUP_SCHED
+u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css,
+			  struct cftype *cft);
+
+#ifdef CONFIG_SCHED_DEBUG
+u64 cpu_core_group_cookie_read_u64(struct cgroup_subsys_state *css,
+				   struct cftype *cft);
+#endif
+
+int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
+			   u64 val);
+#endif
+
+#ifndef TIF_UNSAFE_RET
+#define TIF_UNSAFE_RET (0)
+#define _TIF_UNSAFE_RET (0)
+#endif
 
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);
 
 #else /* !CONFIG_SCHED_CORE */
 
+static inline bool sched_core_enqueued(struct task_struct *task) { return false; }
+static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
+static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+static inline int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
+{
+	return -ENOTSUPP;
+}
+
 static inline bool sched_core_enabled(struct rq *rq)
 {
 	return false;
diff --git a/kernel/sys.c b/kernel/sys.c
index a730c03ee607..da52a0da24ef 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2530,6 +2530,13 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
 		break;
+#ifdef CONFIG_SCHED_CORE
+	case PR_SCHED_CORE_SHARE:
+		if (arg4 || arg5)
+			return -EINVAL;
+		error = sched_core_share_pid(arg2, arg3);
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index 7f0827705c9a..cd37bbf3cd90 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -247,4 +247,10 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER		57
 #define PR_GET_IO_FLUSHER		58
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE		59
+# define PR_SCHED_CORE_CLEAR		0  /* clear core_sched cookie of pid */
+# define PR_SCHED_CORE_SHARE_FROM	1  /* get core_sched cookie from pid */
+# define PR_SCHED_CORE_SHARE_TO		2  /* push core_sched cookie to pid */
+
 #endif /* _LINUX_PRCTL_H */
-- 
2.30.0.280.ga3ce27912f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v10 3/5] kselftest: Add tests for core-sched interface
  2021-01-23  1:16 [PATCH v10 0/5] Core scheduling remaining patches Joel Fernandes (Google)
  2021-01-23  1:17 ` [PATCH v10 1/5] sched: migration changes for core scheduling Joel Fernandes (Google)
  2021-01-23  1:17 ` [PATCH v10 2/5] sched: CGroup tagging interface " Joel Fernandes (Google)
@ 2021-01-23  1:17 ` Joel Fernandes (Google)
  2021-01-23  1:17 ` [PATCH v10 4/5] Documentation: Add core scheduling documentation Joel Fernandes (Google)
  2021-01-23  1:17 ` [PATCH v10 5/5] sched: Debug bits Joel Fernandes (Google)
  4 siblings, 0 replies; 34+ messages in thread
From: Joel Fernandes (Google) @ 2021-01-23  1:17 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky

Add a kselftest test to ensure that the core-sched interface is working
correctly.

Co-developed-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Reviewed-by: Josh Don <joshdon@google.com>
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 tools/testing/selftests/sched/.gitignore      |   1 +
 tools/testing/selftests/sched/Makefile        |  14 +
 tools/testing/selftests/sched/config          |   1 +
 .../testing/selftests/sched/test_coresched.c  | 716 ++++++++++++++++++
 4 files changed, 732 insertions(+)
 create mode 100644 tools/testing/selftests/sched/.gitignore
 create mode 100644 tools/testing/selftests/sched/Makefile
 create mode 100644 tools/testing/selftests/sched/config
 create mode 100644 tools/testing/selftests/sched/test_coresched.c

diff --git a/tools/testing/selftests/sched/.gitignore b/tools/testing/selftests/sched/.gitignore
new file mode 100644
index 000000000000..4660929b0b9a
--- /dev/null
+++ b/tools/testing/selftests/sched/.gitignore
@@ -0,0 +1 @@
+test_coresched
diff --git a/tools/testing/selftests/sched/Makefile b/tools/testing/selftests/sched/Makefile
new file mode 100644
index 000000000000..e43b74fc5d7e
--- /dev/null
+++ b/tools/testing/selftests/sched/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0+
+
+ifneq ($(shell $(CC) --version 2>&1 | head -n 1 | grep clang),)
+CLANG_FLAGS += -no-integrated-as
+endif
+
+CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/  -Wl,-rpath=./ \
+	  $(CLANG_FLAGS)
+LDLIBS += -lpthread
+
+TEST_GEN_FILES := test_coresched
+TEST_PROGS := test_coresched
+
+include ../lib.mk
diff --git a/tools/testing/selftests/sched/config b/tools/testing/selftests/sched/config
new file mode 100644
index 000000000000..e8b09aa7c0c4
--- /dev/null
+++ b/tools/testing/selftests/sched/config
@@ -0,0 +1 @@
+CONFIG_SCHED_DEBUG=y
diff --git a/tools/testing/selftests/sched/test_coresched.c b/tools/testing/selftests/sched/test_coresched.c
new file mode 100644
index 000000000000..4d18a0a727c8
--- /dev/null
+++ b/tools/testing/selftests/sched/test_coresched.c
@@ -0,0 +1,716 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Core-scheduling selftests.
+ *
+ * Copyright (C) 2020, Joel Fernandes.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/mount.h>
+#include <sys/prctl.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <unistd.h>
+
+#ifndef PR_SCHED_CORE_SHARE
+#define PR_SCHED_CORE_SHARE 59
+# define PR_SCHED_CORE_CLEAR            0
+# define PR_SCHED_CORE_SHARE_FROM       1
+# define PR_SCHED_CORE_SHARE_TO         2
+#endif
+
+#ifndef DEBUG_PRINT
+#define dprint(...)
+#else
+#define dprint(str, args...) printf("DEBUG: %s: " str "\n", __func__, ##args)
+#endif
+
+void print_banner(char *s)
+{
+	printf("coresched: %s:  ", s);
+}
+
+void print_pass(void)
+{
+	printf("PASS\n");
+}
+
+void assert_cond(int cond, char *str)
+{
+	if (!cond) {
+		printf("Error: %s\n", str);
+		abort();
+	}
+}
+
+char *make_group_root(void)
+{
+	char *mntpath, *mnt;
+	int ret;
+
+	mntpath = malloc(50);
+	if (!mntpath) {
+		perror("Failed to allocate mntpath\n");
+		abort();
+	}
+
+	sprintf(mntpath, "/tmp/coresched-test-XXXXXX");
+	mnt = mkdtemp(mntpath);
+	if (!mnt) {
+		perror("Failed to create mount: ");
+		exit(-1);
+	}
+
+	ret = mount("nodev", mnt, "cgroup", 0, "cpu");
+	if (ret == -1) {
+		perror("Failed to mount cgroup: ");
+		exit(-1);
+	}
+
+	return mnt;
+}
+
+char *read_group_cookie(char *cgroup_path)
+{
+	char path[50] = {}, *val;
+	int fd;
+
+	sprintf(path, "%s/cpu.core_group_cookie", cgroup_path);
+	fd = open(path, O_RDONLY, 0666);
+	if (fd == -1) {
+		perror("Open of cgroup tag path failed: ");
+		abort();
+	}
+
+	val = calloc(1, 50);
+	if (read(fd, val, 50) == -1) {
+		perror("Failed to read group cookie: ");
+		abort();
+	}
+
+	val[strcspn(val, "\r\n")] = 0;
+
+	close(fd);
+	return val;
+}
+
+void assert_group_tag(char *cgroup_path, char *tag)
+{
+	char tag_path[50] = {}, rdbuf[8] = {};
+	int tfd;
+
+	sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
+	tfd = open(tag_path, O_RDONLY, 0666);
+	if (tfd == -1) {
+		perror("Open of cgroup tag path failed: ");
+		abort();
+	}
+
+	if (read(tfd, rdbuf, 1) != 1) {
+		perror("Failed to enable coresched on cgroup: ");
+		abort();
+	}
+
+	if (strcmp(rdbuf, tag)) {
+		printf("Group tag does not match (exp: %s, act: %s)\n", tag,
+		       rdbuf);
+		abort();
+	}
+
+	if (close(tfd) == -1) {
+		perror("Failed to close tag fd: ");
+		abort();
+	}
+}
+
+void tag_group(char *cgroup_path)
+{
+	char tag_path[50];
+	int tfd;
+
+	sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
+	tfd = open(tag_path, O_WRONLY, 0666);
+	if (tfd == -1) {
+		perror("Open of cgroup tag path failed: ");
+		abort();
+	}
+
+	if (write(tfd, "1", 1) != 1) {
+		perror("Failed to enable coresched on cgroup: ");
+		abort();
+	}
+
+	if (close(tfd) == -1) {
+		perror("Failed to close tag fd: ");
+		abort();
+	}
+
+	assert_group_tag(cgroup_path, "1");
+}
+
+void untag_group(char *cgroup_path)
+{
+	char tag_path[50];
+	int tfd;
+
+	sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
+	tfd = open(tag_path, O_WRONLY, 0666);
+	if (tfd == -1) {
+		perror("Open of cgroup tag path failed: ");
+		abort();
+	}
+
+	if (write(tfd, "0", 1) != 1) {
+		perror("Failed to disable coresched on cgroup: ");
+		abort();
+	}
+
+	if (close(tfd) == -1) {
+		perror("Failed to close tag fd: ");
+		abort();
+	}
+
+	assert_group_tag(cgroup_path, "0");
+}
+
+char *make_group(char *parent, char *name)
+{
+	char *cgroup_path;
+	int ret;
+
+	if (!parent && !name)
+		return make_group_root();
+
+	cgroup_path = malloc(50);
+	if (!cgroup_path) {
+		perror("Failed to allocate cgroup_path\n");
+		abort();
+	}
+
+	/* Make the cgroup node for this group */
+	sprintf(cgroup_path, "%s/%s", parent, name);
+	ret = mkdir(cgroup_path, 0644);
+	if (ret == -1) {
+		perror("Failed to create group in cgroup: ");
+		abort();
+	}
+
+	return cgroup_path;
+}
+
+static void del_group(char *path)
+{
+	if (rmdir(path) != 0) {
+		printf("Removal of group failed\n");
+		abort();
+	}
+
+	free(path);
+}
+
+static void del_root_group(char *path)
+{
+	if (umount(path) != 0) {
+		perror("umount of cgroup failed\n");
+		abort();
+	}
+
+	if (rmdir(path) != 0) {
+		printf("Removal of group failed\n");
+		abort();
+	}
+
+	free(path);
+}
+
+void assert_group_cookie_equal(char *c1, char *c2)
+{
+	char *v1, *v2;
+
+	v1 = read_group_cookie(c1);
+	v2 = read_group_cookie(c2);
+	if (strcmp(v1, v2)) {
+		printf("Group cookies not equal\n");
+		abort();
+	}
+
+	free(v1);
+	free(v2);
+}
+
+void assert_group_cookie_not_equal(char *c1, char *c2)
+{
+	char *v1, *v2;
+
+	v1 = read_group_cookie(c1);
+	v2 = read_group_cookie(c2);
+	if (!strcmp(v1, v2)) {
+		printf("Group cookies equal\n");
+		abort();
+	}
+
+	free(v1);
+	free(v2);
+}
+
+void assert_group_cookie_not_zero(char *c1)
+{
+	char *v1 = read_group_cookie(c1);
+
+	v1[1] = 0;
+	if (!strcmp(v1, "0")) {
+		printf("Group cookie zero\n");
+		abort();
+	}
+	free(v1);
+}
+
+void assert_group_cookie_zero(char *c1)
+{
+	char *v1 = read_group_cookie(c1);
+
+	v1[1] = 0;
+	if (strcmp(v1, "0")) {
+		printf("Group cookie not zero");
+		abort();
+	}
+	free(v1);
+}
+
+struct task_state {
+	int pid_share;
+	char pid_str[50];
+	pthread_mutex_t m;
+	pthread_cond_t cond;
+	pthread_cond_t cond_par;
+};
+
+struct task_state *add_task(char *p)
+{
+	struct task_state *mem;
+	pthread_mutexattr_t am;
+	pthread_condattr_t a;
+	char tasks_path[50];
+	int tfd, pid, ret;
+
+	sprintf(tasks_path, "%s/tasks", p);
+	tfd = open(tasks_path, O_WRONLY, 0666);
+	if (tfd == -1) {
+		perror("Open of cgroup tasks path failed: ");
+		abort();
+	}
+
+	mem = mmap(NULL, sizeof(*mem), PROT_READ | PROT_WRITE,
+		   MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+	memset(mem, 0, sizeof(*mem));
+
+	pthread_condattr_init(&a);
+	pthread_condattr_setpshared(&a, PTHREAD_PROCESS_SHARED);
+	pthread_mutexattr_init(&am);
+	pthread_mutexattr_setpshared(&am, PTHREAD_PROCESS_SHARED);
+
+	pthread_cond_init(&mem->cond, &a);
+	pthread_cond_init(&mem->cond_par, &a);
+	pthread_mutex_init(&mem->m, &am);
+
+	pid = fork();
+	if (pid == 0) {
+		while (1) {
+			pthread_mutex_lock(&mem->m);
+			while (!mem->pid_share)
+				pthread_cond_wait(&mem->cond, &mem->m);
+
+			pid = mem->pid_share;
+			mem->pid_share = 0;
+
+			if (pid == -1) {
+				if (prctl
+				    (PR_SCHED_CORE_SHARE, PR_SCHED_CORE_CLEAR,
+				     getpid(), 0, 0))
+					perror
+					    ("prctl() PR_SCHED_CORE_CLEAR failed");
+			} else {
+				if (prctl
+				    (PR_SCHED_CORE_SHARE,
+				     PR_SCHED_CORE_SHARE_FROM, pid, 0, 0))
+					perror
+					    ("prctl() PR_SCHED_CORE_SHARE_FROM failed");
+			}
+			pthread_mutex_unlock(&mem->m);
+			pthread_cond_signal(&mem->cond_par);
+		}
+	}
+
+	sprintf(mem->pid_str, "%d", pid);
+	dprint("add task %d to group %s", pid, p);
+
+	ret = write(tfd, mem->pid_str, strlen(mem->pid_str));
+	assert_cond(ret != -1, "Failed to write pid into tasks");
+
+	close(tfd);
+	return mem;
+}
+
+/* Make t1 share with t2 */
+void make_tasks_share(struct task_state *t1, struct task_state *t2)
+{
+	int p2 = atoi(t2->pid_str);
+
+	dprint("task %s %s", t1->pid_str, t2->pid_str);
+
+	pthread_mutex_lock(&t1->m);
+	t1->pid_share = p2;
+	pthread_mutex_unlock(&t1->m);
+
+	pthread_cond_signal(&t1->cond);
+
+	pthread_mutex_lock(&t1->m);
+	while (t1->pid_share)
+		pthread_cond_wait(&t1->cond_par, &t1->m);
+	pthread_mutex_unlock(&t1->m);
+}
+
+/* Make t1 share with t2 */
+void reset_task_cookie(struct task_state *t1)
+{
+	dprint("task %s", t1->pid_str);
+
+	pthread_mutex_lock(&t1->m);
+	t1->pid_share = -1;
+	pthread_mutex_unlock(&t1->m);
+
+	pthread_cond_signal(&t1->cond);
+
+	pthread_mutex_lock(&t1->m);
+	while (t1->pid_share)
+		pthread_cond_wait(&t1->cond_par, &t1->m);
+	pthread_mutex_unlock(&t1->m);
+}
+
+char *get_task_core_cookie(char *pid)
+{
+	char proc_path[50];
+	int found = 0;
+	char *line;
+	int i, j;
+	FILE *fp;
+
+	line = malloc(1024);
+	assert_cond(!!line, "Failed to alloc memory");
+
+	sprintf(proc_path, "/proc/%s/sched", pid);
+
+	fp = fopen(proc_path, "r");
+	while ((fgets(line, 1024, fp)) != NULL) {
+		if (!strstr(line, "core_cookie"))
+			continue;
+
+		for (j = 0, i = 0; i < 1024 && line[i] != '\0'; i++)
+			if (line[i] >= '0' && line[i] <= '9')
+				line[j++] = line[i];
+		line[j] = '\0';
+		found = 1;
+		break;
+	}
+
+	fclose(fp);
+
+	if (found)
+		return line;
+
+	free(line);
+	printf("core_cookie not found. Enable SCHED_DEBUG?\n");
+	abort();
+	return NULL;
+}
+
+void assert_tasks_share(struct task_state *t1, struct task_state *t2)
+{
+	char *c1, *c2;
+
+	c1 = get_task_core_cookie(t1->pid_str);
+	c2 = get_task_core_cookie(t2->pid_str);
+	dprint("check task (%s) cookie (%s) == task (%s) cookie (%s)",
+	       t1->pid_str, c1, t2->pid_str, c2);
+	assert_cond(!strcmp(c1, c2), "Tasks don't share cookie");
+	free(c1);
+	free(c2);
+}
+
+void assert_tasks_dont_share(struct task_state *t1, struct task_state *t2)
+{
+	char *c1, *c2;
+
+	c1 = get_task_core_cookie(t1->pid_str);
+	c2 = get_task_core_cookie(t2->pid_str);
+	dprint("check task (%s) cookie (%s) != task (%s) cookie (%s)",
+	       t1->pid_str, c1, t2->pid_str, c2);
+	assert_cond(strcmp(c1, c2), "Tasks don't share cookie");
+	free(c1);
+	free(c2);
+}
+
+void assert_task_has_cookie(char *pid)
+{
+	char *tk;
+
+	tk = get_task_core_cookie(pid);
+
+	assert_cond(strcmp(tk, "0"), "Task does not have cookie");
+
+	free(tk);
+}
+
+void kill_task(struct task_state *t)
+{
+	int pid = atoi(t->pid_str);
+
+	kill(pid, SIGKILL);
+	waitpid(pid, NULL, 0);
+}
+
+/*
+ * Test that a group's children have a cookie inherrited
+ * from their parent group _after_ the parent was tagged.
+ *
+ *   p ----- c1 - c11
+ *     \ c2 - c22
+ */
+static void test_cgroup_parent_child_tag_inherit(char *root)
+{
+	char *p, *c1, *c11, *c2, *c22;
+
+	print_banner("TEST-CGROUP-PARENT-CHILD-TAG");
+
+	p = make_group(root, "p");
+	assert_group_cookie_zero(p);
+
+	c1 = make_group(p, "c1");
+	assert_group_tag(c1, "0");	/* Child tag is "0" but inherits cookie from parent. */
+	assert_group_cookie_zero(c1);
+	assert_group_cookie_equal(c1, p);
+
+	c11 = make_group(c1, "c11");
+	assert_group_tag(c11, "0");
+	assert_group_cookie_zero(c11);
+	assert_group_cookie_equal(c11, p);
+
+	c2 = make_group(p, "c2");
+	assert_group_tag(c2, "0");
+	assert_group_cookie_zero(c2);
+	assert_group_cookie_equal(c2, p);
+
+	tag_group(p);
+
+	/* Verify c1 got the cookie */
+	assert_group_tag(c1, "0");
+	assert_group_cookie_not_zero(c1);
+	assert_group_cookie_equal(c1, p);
+
+	/* Verify c2 got the cookie */
+	assert_group_tag(c2, "0");
+	assert_group_cookie_not_zero(c2);
+	assert_group_cookie_equal(c2, p);
+
+	/* Verify c11 got the cookie */
+	assert_group_tag(c11, "0");
+	assert_group_cookie_not_zero(c11);
+	assert_group_cookie_equal(c11, p);
+
+	/*
+	 * Verify c22 which is a nested group created
+	 * _after_ tagging got the cookie.
+	 */
+	c22 = make_group(c2, "c22");
+
+	assert_group_tag(c22, "0");
+	assert_group_cookie_not_zero(c22);
+	assert_group_cookie_equal(c22, c1);
+	assert_group_cookie_equal(c22, c11);
+	assert_group_cookie_equal(c22, c2);
+	assert_group_cookie_equal(c22, p);
+
+	del_group(c22);
+	del_group(c11);
+	del_group(c1);
+	del_group(c2);
+	del_group(p);
+	print_pass();
+}
+
+/*
+ * Test that a tagged group's children have a cookie inherrited
+ * from their parent group.
+ */
+static void test_cgroup_parent_tag_child_inherit(char *root)
+{
+	char *p, *c1, *c2, *c3;
+
+	print_banner("TEST-CGROUP-PARENT-TAG-CHILD-INHERIT");
+
+	p = make_group(root, "p");
+	assert_group_cookie_zero(p);
+	tag_group(p);
+	assert_group_cookie_not_zero(p);
+
+	c1 = make_group(p, "c1");
+	assert_group_cookie_not_zero(c1);
+	/* Child tag is "0" but it inherits cookie from parent. */
+	assert_group_tag(c1, "0");
+	assert_group_cookie_equal(c1, p);
+
+	c2 = make_group(p, "c2");
+	assert_group_tag(c2, "0");
+	assert_group_cookie_equal(c2, p);
+	assert_group_cookie_equal(c1, c2);
+
+	c3 = make_group(c1, "c3");
+	assert_group_tag(c3, "0");
+	assert_group_cookie_equal(c3, p);
+	assert_group_cookie_equal(c1, c3);
+
+	del_group(c3);
+	del_group(c1);
+	del_group(c2);
+	del_group(p);
+	print_pass();
+}
+
+static void test_prctl_in_group(char *root)
+{
+	char *p;
+	struct task_state *tsk1, *tsk2, *tsk3;
+
+	print_banner("TEST-PRCTL-IN-GROUP");
+
+	p = make_group(root, "p");
+	assert_group_cookie_zero(p);
+	tag_group(p);
+	assert_group_cookie_not_zero(p);
+
+	tsk1 = add_task(p);
+	assert_task_has_cookie(tsk1->pid_str);
+
+	tsk2 = add_task(p);
+	assert_task_has_cookie(tsk2->pid_str);
+
+	tsk3 = add_task(p);
+	assert_task_has_cookie(tsk3->pid_str);
+
+	/* tsk2 share with tsk3 -- both get disconnected from CGroup. */
+	make_tasks_share(tsk2, tsk3);
+	assert_task_has_cookie(tsk2->pid_str);
+	assert_task_has_cookie(tsk3->pid_str);
+	assert_tasks_share(tsk2, tsk3);
+	assert_tasks_dont_share(tsk1, tsk2);
+	assert_tasks_dont_share(tsk1, tsk3);
+
+	/* now reset tsk3 -- get connected back to CGroup. */
+	reset_task_cookie(tsk3);
+	assert_task_has_cookie(tsk3->pid_str);
+	assert_tasks_dont_share(tsk2, tsk3);
+	assert_tasks_share(tsk1, tsk3);	// tsk3 is back.
+	assert_tasks_dont_share(tsk1, tsk2);	// but tsk2 is still zombie
+
+	/* now reset tsk2 as well to get it connected back to CGroup. */
+	reset_task_cookie(tsk2);
+	assert_task_has_cookie(tsk2->pid_str);
+	assert_tasks_share(tsk2, tsk3);
+	assert_tasks_share(tsk1, tsk3);
+	assert_tasks_share(tsk1, tsk2);
+
+	/* Test the rest of the cases (2 to 4)
+	 *
+	 *          t1              joining         t2
+	 * CASE 1:
+	 * before   0                               0
+	 * after    new cookie                      new cookie
+	 *
+	 * CASE 2:
+	 * before   X (non-zero)                    0
+	 * after    0                               0
+	 *
+	 * CASE 3:
+	 * before   0                               X (non-zero)
+	 * after    X                               X
+	 *
+	 * CASE 4:
+	 * before   Y (non-zero)                    X (non-zero)
+	 * after    X                               X
+	 */
+
+	/* case 2: */
+	dprint("case 2");
+	make_tasks_share(tsk1, tsk1);
+	assert_tasks_dont_share(tsk1, tsk2);
+	assert_tasks_dont_share(tsk1, tsk3);
+	assert_task_has_cookie(tsk1->pid_str);
+	make_tasks_share(tsk1, tsk2);	/* Will reset the task cookie. */
+	assert_task_has_cookie(tsk1->pid_str);
+	assert_task_has_cookie(tsk2->pid_str);
+
+	/* case 3: */
+	dprint("case 3");
+	make_tasks_share(tsk2, tsk2);
+	assert_tasks_dont_share(tsk2, tsk1);
+	assert_tasks_dont_share(tsk2, tsk3);
+	assert_task_has_cookie(tsk2->pid_str);
+	make_tasks_share(tsk1, tsk2);
+	assert_task_has_cookie(tsk1->pid_str);
+	assert_task_has_cookie(tsk2->pid_str);
+	assert_tasks_share(tsk1, tsk2);
+	assert_tasks_dont_share(tsk1, tsk3);
+	reset_task_cookie(tsk1);
+	reset_task_cookie(tsk2);
+
+	/* case 4: */
+	dprint("case 4");
+	assert_tasks_share(tsk1, tsk2);
+	assert_task_has_cookie(tsk1->pid_str);
+	assert_task_has_cookie(tsk2->pid_str);
+	make_tasks_share(tsk1, tsk1);
+	assert_task_has_cookie(tsk1->pid_str);
+	make_tasks_share(tsk2, tsk2);
+	assert_task_has_cookie(tsk2->pid_str);
+	assert_tasks_dont_share(tsk1, tsk2);
+	make_tasks_share(tsk1, tsk2);
+	assert_task_has_cookie(tsk1->pid_str);
+	assert_task_has_cookie(tsk2->pid_str);
+	assert_tasks_share(tsk1, tsk2);
+	assert_tasks_dont_share(tsk1, tsk3);
+	reset_task_cookie(tsk1);
+	reset_task_cookie(tsk2);
+
+	kill_task(tsk1);
+	kill_task(tsk2);
+	kill_task(tsk3);
+	del_group(p);
+	print_pass();
+}
+
+int main(int argc, char *argv[])
+{
+	char *root;
+
+	if (argc > 1)
+		root = argv[1];
+	else
+		root = make_group(NULL, NULL);
+
+	test_cgroup_parent_tag_child_inherit(root);
+	test_cgroup_parent_child_tag_inherit(root);
+	test_prctl_in_group(root);
+
+	if (argc <= 1)
+		del_root_group(root);
+	return 0;
+}
-- 
2.30.0.280.ga3ce27912f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v10 4/5] Documentation: Add core scheduling documentation
  2021-01-23  1:16 [PATCH v10 0/5] Core scheduling remaining patches Joel Fernandes (Google)
                   ` (2 preceding siblings ...)
  2021-01-23  1:17 ` [PATCH v10 3/5] kselftest: Add tests for core-sched interface Joel Fernandes (Google)
@ 2021-01-23  1:17 ` Joel Fernandes (Google)
  2021-01-23  1:17 ` [PATCH v10 5/5] sched: Debug bits Joel Fernandes (Google)
  4 siblings, 0 replies; 34+ messages in thread
From: Joel Fernandes (Google) @ 2021-01-23  1:17 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Randy Dunlap

Document the usecases, design and interfaces for core scheduling.

Co-developed-by: Chris Hyser <chris.hyser@oracle.com>
Co-developed-by: Vineeth Pillai <viremana@linux.microsoft.com>
Co-developed-by: Josh Don <joshdon@google.com>
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 .../admin-guide/hw-vuln/core-scheduling.rst   | 263 ++++++++++++++++++
 Documentation/admin-guide/hw-vuln/index.rst   |   1 +
 2 files changed, 264 insertions(+)
 create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
new file mode 100644
index 000000000000..a795747c706a
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -0,0 +1,263 @@
+Core Scheduling
+***************
+Core scheduling support allows userspace to define groups of tasks that can
+share a core. These groups can be specified either for security usecases (one
+group of tasks don't trust another), or for performance usecases (some
+workloads may benefit from running on the same core as they don't need the same
+hardware resources of the shared core).
+
+Security usecase
+----------------
+A cross-HT attack involves the attacker and victim running on different
+Hyper Threads of the same core. MDS and L1TF are examples of such attacks.
+Without core scheduling, the only full mitigation of cross-HT attacks is to
+disable Hyper Threading (HT). Core scheduling allows HT to be turned on safely
+by ensuring that trusted tasks can share a core. This increase in core sharing
+can improvement performance, however it is not guaranteed that performance will
+always improve, though that is seen to be the case with a number of real world
+workloads. In theory, core scheduling aims to perform at least as good as when
+Hyper Threading is disabled. In practice, this is mostly the case though not
+always: as synchronizing scheduling decisions across 2 or more CPUs in a core
+involves additional overhead - especially when the system is lightly loaded
+(``total_threads <= N/2``, where N is the total number of CPUs).
+
+Usage
+-----
+Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
+Using this feature, userspace defines groups of tasks that can be co-scheduled
+on the same core.
+The core scheduler uses this information to make sure that tasks that are not
+in the same group never run simultaneously on a core, while doing its best to
+satisfy the system's scheduling requirements.
+
+There are 2 ways to use core-scheduling:
+
+CGroup
+######
+Core scheduling adds additional files to the CPU controller CGroup:
+
+* ``cpu.core_tag``
+Writing ``1`` into this file results in all tasks in the group getting tagged.
+This results in all the CGroup's tasks allowed to run concurrently on a core's
+hyperthreads (also called siblings).
+
+The file being a value of ``0`` means the tag state of the CGroup is inherited
+from its parent hierarchy. If any ancestor of the CGroup is tagged, then the
+group is tagged.
+
+.. note:: Once a CGroup is tagged via cpu.core_tag, it is not possible to set this
+          for any descendant of the tagged group.
+
+.. note:: When a CGroup is not tagged, all the tasks within the group can share
+          a core with kernel threads and untagged system threads. For this reason,
+          if a group has ``cpu.core_tag`` of 0, it is considered to be trusted.
+
+prctl interface
+###############
+A ``prtcl(2)`` command ``PR_SCHED_CORE_SHARE`` provides an interface for the
+creation of and admission and removal of tasks from core scheduling groups.
+
+::
+
+    #include <sys/prctl.h>
+
+    int prctl(int option, unsigned long arg2, unsigned long arg3,
+            unsigned long arg4, unsigned long arg5);
+
+option:
+    ``PR_SCHED_CORE_SHARE``
+
+arg2:
+    - ``PR_SCHED_CORE_CLEAR            0  -- clear core_sched cookie of pid``
+    - ``PR_SCHED_CORE_SHARE_FROM       1  -- get core_sched cookie from pid``
+    - ``PR_SCHED_CORE_SHARE_TO         2  -- push core_sched cookie to pid``
+
+arg3:
+    ``tid`` of the task for which the operation applies
+
+arg4 and arg5:
+    MUST be equal to 0.
+
+Creation
+~~~~~~~~
+Creation is accomplished by sharing a ''cookie'' from a process not currently in
+a core scheduling group.
+
+::
+
+    if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_FROM, src_tid, 0, 0) < 0)
+            handle_error("src_tid sched_core failed");
+
+Removal
+~~~~~~~
+Removing a task from a core scheduling group is done by:
+
+::
+
+    if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_CLEAR, clr_tid, 0, 0) < 0)
+             handle_error("clr_tid sched_core failed");
+
+Cookie Transferal
+~~~~~~~~~~~~~~~~~
+Transferring a cookie between the current and other tasks is possible using
+PR_SCHED_CORE_SHARE_FROM and PR_SCHED_CORE_SHARE_TO to inherit a cookie from a
+specified task or a share a cookie with a task. In combination this allows a
+simple helper program to pull a cookie from a task in an existing core
+scheduling group and share it with already running tasks.
+
+::
+
+    if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_FROM, from_tid, 0, 0) < 0)
+            handle_error("from_tid sched_core failed");
+
+    if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_TO, to_tid, 0, 0) < 0)
+            handle_error("to_tid sched_core failed");
+
+
+.. note:: The core-sharing granted with ``prctl(2)`` will be subject to
+          core-sharing restrictions specified by the CGroup interface. For example,
+          if tasks T1 and T2 are a part of 2 different tagged CGroups, then they will
+          not share a core even if ``prctl(2)`` is called. This is analogous to how
+          affinities are set using the cpuset interface.
+
+It is important to note that, on a ``clone(2)`` syscall with ``CLONE_THREAD`` set,
+the child will be assigned the same ''cookie'' as its parent and thus in the
+same core scheduling group.  In the security usecase, a ``CLONE_THREAD`` child
+can access its parent's address space anyway (``CLONE_THREAD`` requires
+``CLONE_SIGHAND`` which requires ``CLONE_VM``), so there's no point in not
+allowing them to share a core. If a different behavior is desired, the child
+thread can call ``prctl(2)`` as needed.  This behavior is specific to the
+``prctl(2)`` interface. For the CGroup interface, the child of a fork always
+shares a core with its parent.  On the other hand, if a parent was previously
+tagged via ``prctl(2)`` and does a regular ``fork(2)`` syscall, the child will
+receive a unique tag.
+
+Design/Implementation
+---------------------
+Each task that is tagged is assigned a cookie internally in the kernel. As
+mentioned in `Usage`_, tasks with the same cookie value are assumed to trust
+each other and share a core.
+
+The basic idea is that, every schedule event tries to select tasks for all the
+siblings of a core such that all the selected tasks running on a core are
+trusted (same cookie) at any point in time. Kernel threads are assumed trusted.
+The idle task is considered special, as it trusts everything and everything
+trusts it.
+
+During a schedule() event on any sibling of a core, the highest priority task on
+the sibling's core is picked and assigned to the sibling calling schedule(), if
+the sibling has the task enqueued. For rest of the siblings in the core,
+highest priority task with the same cookie is selected if there is one runnable
+in their individual run queues. If a task with same cookie is not available,
+the idle task is selected.  Idle task is globally trusted.
+
+Once a task has been selected for all the siblings in the core, an IPI is sent to
+siblings for whom a new task was selected. Siblings on receiving the IPI will
+switch to the new task immediately. If an idle task is selected for a sibling,
+then the sibling is considered to be in a `forced idle` state. I.e., it may
+have tasks on its on runqueue to run, however it will still have to run idle.
+More on this in the next section.
+
+Forced-idling of tasks
+----------------------
+The scheduler tries its best to find tasks that trust each other such that all
+tasks selected to be scheduled are of the highest priority in a core.  However,
+it is possible that some runqueues had tasks that were incompatible with the
+highest priority ones in the core. Favoring security over fairness, one or more
+siblings could be forced to select a lower priority task if the highest
+priority task is not trusted with respect to the core wide highest priority
+task.  If a sibling does not have a trusted task to run, it will be forced idle
+by the scheduler (idle thread is scheduled to run).
+
+When the highest priority task is selected to run, a reschedule-IPI is sent to
+the sibling to force it into idle. This results in 4 cases which need to be
+considered depending on whether a VM or a regular usermode process was running
+on either HT::
+
+          HT1 (attack)            HT2 (victim)
+   A      idle -> user space      user space -> idle
+   B      idle -> user space      guest -> idle
+   C      idle -> guest           user space -> idle
+   D      idle -> guest           guest -> idle
+
+Note that for better performance, we do not wait for the destination CPU
+(victim) to enter idle mode. This is because the sending of the IPI would bring
+the destination CPU immediately into kernel mode from user space, or VMEXIT
+in the case of guests. At best, this would only leak some scheduler metadata
+which may not be worth protecting. It is also possible that the IPI is received
+too late on some architectures, but this has not been observed in the case of
+x86.
+
+Trust model
+-----------
+Core scheduling maintains trust relationships amongst groups of tasks by
+assigning the tag of them with the same cookie value.
+When a system with core scheduling boots, all tasks are considered to trust
+each other. This is because the core scheduler does not have information about
+trust relationships until userspace uses the above mentioned interfaces, to
+communicate them. In other words, all tasks have a default cookie value of 0.
+and are considered system-wide trusted. The stunning of siblings running
+cookie-0 tasks is also avoided.
+
+Once userspace uses the above mentioned interfaces to group sets of tasks, tasks
+within such groups are considered to trust each other, but do not trust those
+outside. Tasks outside the group also don't trust tasks within.
+
+Limitations in core-scheduling
+------------------------------
+Core scheduling tries to guarantee that only trusted tasks run concurrently on a
+core. But there could be small window of time during which untrusted tasks run
+concurrently or kernel could be running concurrently with a task not trusted by
+kernel.
+
+1. IPI processing delays
+########################
+Core scheduling selects only trusted tasks to run together. IPI is used to notify
+the siblings to switch to the new task. But there could be hardware delays in
+receiving of the IPI on some arch (on x86, this has not been observed). This may
+cause an attacker task to start running on a CPU before its siblings receive the
+IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings
+may populate data in the cache and micro architectural buffers after the attacker
+starts to run and this is a possibility for data leak.
+
+Open cross-HT issues that core scheduling does not solve
+--------------------------------------------------------
+1. For MDS
+##########
+Core scheduling cannot protect against MDS attacks between an HT running in
+user mode and another running in kernel mode. Even though both HTs run tasks
+which trust each other, kernel memory is still considered untrusted. Such
+attacks are possible for any combination of sibling CPU modes (host or guest mode).
+
+2. For L1TF
+###########
+Core scheduling cannot protect against an L1TF guest attacker exploiting a
+guest or host victim. This is because the guest attacker can craft invalid
+PTEs which are not inverted due to a vulnerable guest kernel. The only
+solution is to disable EPT (Extended Page Tables).
+
+For both MDS and L1TF, if the guest vCPU is configured to not trust each
+other (by tagging separately), then the guest to guest attacks would go away.
+Or it could be a system admin policy which considers guest to guest attacks as
+a guest problem.
+
+Another approach to resolve these would be to make every untrusted task on the
+system to not trust every other untrusted task. While this could reduce
+parallelism of the untrusted tasks, it would still solve the above issues while
+allowing system processes (trusted tasks) to share a core.
+
+3. Protecting the kernel (IRQ, syscall, VMEXIT)
+###############################################
+This section is a work in progress. The main point here is entry into the
+kernel is not protected from attackers on a sibling.
+
+Use cases
+---------
+The main use case for Core scheduling is mitigating the cross-HT vulnerabilities
+with SMT enabled. There are other use cases where this feature could be used:
+
+- Isolating tasks that needs a whole core: Examples include realtime tasks, tasks
+  that uses SIMD instructions etc.
+- Gang scheduling: Requirements for a group of tasks that needs to be scheduled
+  together could also be realized using core scheduling. One example is vCPUs of
+  a VM.
diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index 21710f8609fe..361ccbbd9e54 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -16,3 +16,4 @@ are configurable at compile, boot or run time.
    multihit.rst
    special-register-buffer-data-sampling.rst
    l1d_flush.rst
+   core-scheduling.rst
-- 
2.30.0.280.ga3ce27912f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v10 5/5] sched: Debug bits...
  2021-01-23  1:16 [PATCH v10 0/5] Core scheduling remaining patches Joel Fernandes (Google)
                   ` (3 preceding siblings ...)
  2021-01-23  1:17 ` [PATCH v10 4/5] Documentation: Add core scheduling documentation Joel Fernandes (Google)
@ 2021-01-23  1:17 ` Joel Fernandes (Google)
  4 siblings, 0 replies; 34+ messages in thread
From: Joel Fernandes (Google) @ 2021-01-23  1:17 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Not-Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c | 35 ++++++++++++++++++++++++++++++++++-
 kernel/sched/fair.c |  9 +++++++++
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a3844e2e7379..56ba2ca4f922 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -106,6 +106,10 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b, bool
 
 	int pa = __task_prio(a), pb = __task_prio(b);
 
+	trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
+		     a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+		     b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
 	if (-pa < -pb)
 		return true;
 
@@ -296,12 +300,16 @@ static void __sched_core_enable(void)
 
 	static_branch_enable(&__sched_core_enabled);
 	stop_machine(__sched_core_stopper, (void *)true, NULL);
+
+	printk("core sched enabled\n");
 }
 
 static void __sched_core_disable(void)
 {
 	stop_machine(__sched_core_stopper, (void *)false, NULL);
 	static_branch_disable(&__sched_core_enabled);
+
+	printk("core sched disabled\n");
 }
 
 void sched_core_get(void)
@@ -5237,6 +5245,13 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			set_next_task(rq, next);
 		}
 
+		trace_printk("pick pre selected (%u %u %u): %s/%d %lx\n",
+			     rq->core->core_task_seq,
+			     rq->core->core_pick_seq,
+			     rq->core_sched_seq,
+			     next->comm, next->pid,
+			     next->core_cookie);
+
 		rq->core_pick = NULL;
 		return next;
 	}
@@ -5331,6 +5346,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 					rq->core->core_forceidle_seq++;
 			}
 
+			trace_printk("cpu(%d): selected: %s/%d %lx\n",
+				     i, p->comm, p->pid, p->core_cookie);
+
 			/*
 			 * If this new candidate is of higher priority than the
 			 * previous; and they're incompatible; we need to wipe
@@ -5347,6 +5365,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 				rq->core->core_cookie = p->core_cookie;
 				max = p;
 
+				trace_printk("max: %s/%d %lx\n", max->comm, max->pid, max->core_cookie);
+
 				if (old_max) {
 					rq->core->core_forceidle = false;
 					for_each_cpu(j, smt_mask) {
@@ -5368,6 +5388,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	/* Something should have been selected for current CPU */
 	WARN_ON_ONCE(!next);
+	trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, next->core_cookie);
 
 	/*
 	 * Reschedule siblings
@@ -5409,13 +5430,21 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		}
 
 		/* Did we break L1TF mitigation requirements? */
-		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+		if (unlikely(!cookie_match(next, rq_i->core_pick))) {
+			trace_printk("[%d]: cookie mismatch. %s/%d/0x%lx/0x%lx\n",
+				     rq_i->cpu, rq_i->core_pick->comm,
+				     rq_i->core_pick->pid,
+				     rq_i->core_pick->core_cookie,
+				     rq_i->core->core_cookie);
+			WARN_ON_ONCE(1);
+		}
 
 		if (rq_i->curr == rq_i->core_pick) {
 			rq_i->core_pick = NULL;
 			continue;
 		}
 
+		trace_printk("IPI(%d)\n", i);
 		resched_curr(rq_i);
 	}
 
@@ -5455,6 +5484,10 @@ static bool try_steal_cookie(int this, int that)
 		if (p->core_occupation > dst->idle->core_occupation)
 			goto next;
 
+		trace_printk("core fill: %s/%d (%d->%d) %d %d %lx\n",
+			     p->comm, p->pid, that, this,
+			     p->core_occupation, dst->idle->core_occupation, cookie);
+
 		p->on_rq = TASK_ON_RQ_MIGRATING;
 		deactivate_task(src, p, 0);
 		set_task_cpu(p, this);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fddd7c44bbf3..ebeeebc4223a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10769,6 +10769,15 @@ static void se_fi_update(struct sched_entity *se, unsigned int fi_seq, bool forc
 			cfs_rq->forceidle_seq = fi_seq;
 		}
 
+
+		if (root) {
+			old = cfs_rq->min_vruntime_fi;
+			new = cfs_rq->min_vruntime;
+			root = false;
+			trace_printk("cfs_rq(min_vruntime_fi) %lu->%lu\n",
+				     old, new);
+		}
+
 		cfs_rq->min_vruntime_fi = cfs_rq->min_vruntime;
 	}
 }
-- 
2.30.0.280.ga3ce27912f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-01-23  1:17 ` [PATCH v10 2/5] sched: CGroup tagging interface " Joel Fernandes (Google)
@ 2021-02-03 16:51   ` Peter Zijlstra
  2021-02-04 13:59     ` Peter Zijlstra
  2021-02-04 13:54   ` Peter Zijlstra
                     ` (8 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2021-02-03 16:51 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky


I'm slowly starting to go through this...

On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> +static bool sched_core_empty(struct rq *rq)
> +{
> +	return RB_EMPTY_ROOT(&rq->core_tree);
> +}
> +
> +static struct task_struct *sched_core_first(struct rq *rq)
> +{
> +	struct task_struct *task;
> +
> +	task = container_of(rb_first(&rq->core_tree), struct task_struct, core_node);
> +	return task;
> +}

AFAICT you can do with:

static struct task_struct *sched_core_any(struct rq *rq)
{
	return rb_entry(rq->core_tree.rb_node, struct task_struct, code_node);
}

> +static void sched_core_flush(int cpu)
> +{
> +	struct rq *rq = cpu_rq(cpu);
> +	struct task_struct *task;
> +
> +	while (!sched_core_empty(rq)) {
> +		task = sched_core_first(rq);
> +		rb_erase(&task->core_node, &rq->core_tree);
> +		RB_CLEAR_NODE(&task->core_node);
> +	}
> +	rq->core->core_task_seq++;
> +}

However,

> +	for_each_possible_cpu(cpu) {
> +		struct rq *rq = cpu_rq(cpu);
> +
> +		WARN_ON_ONCE(enabled == rq->core_enabled);
> +
> +		if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) {
> +			/*
> +			 * All active and migrating tasks will have already
> +			 * been removed from core queue when we clear the
> +			 * cgroup tags. However, dying tasks could still be
> +			 * left in core queue. Flush them here.
> +			 */
> +			if (!enabled)
> +				sched_core_flush(cpu);
> +
> +			rq->core_enabled = enabled;
> +		}
> +	}

I'm not sure I understand. Is the problem that we're still schedulable
during do_exit() after cgroup_exit() ? It could be argued that when we
leave the cgroup there, we should definitely leave the tag group too.




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-01-23  1:17 ` [PATCH v10 2/5] sched: CGroup tagging interface " Joel Fernandes (Google)
  2021-02-03 16:51   ` Peter Zijlstra
@ 2021-02-04 13:54   ` Peter Zijlstra
  2021-02-05  3:45     ` Josh Don
  2021-02-04 13:57   ` Peter Zijlstra
                     ` (7 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2021-02-04 13:54 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky

On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> From: Peter Zijlstra <peterz@infradead.org>

I'm thinking this is too much credit, I didn't write much of this.

> Marks all tasks in a cgroup as matching for core-scheduling.
> 
> A task will need to be moved into the core scheduler queue when the cgroup
> it belongs to is tagged to run with core scheduling.  Similarly the task
> will need to be moved out of the core scheduler queue when the cgroup
> is untagged.
> 
> Also after we forked a task, its core scheduler queue's presence will
> need to be updated according to its new cgroup's status.
> 
> Use stop machine mechanism to update all tasks in a cgroup to prevent a
> new task from sneaking into the cgroup, and missed out from the update
> while we iterates through all the tasks in the cgroup.  A more complicated
> scheme could probably avoid the stop machine.  Such scheme will also
> need to resovle inconsistency between a task's cgroup core scheduling
> tag and residency in core scheduler queue.
> 
> We are opting for the simple stop machine mechanism for now that avoids
> such complications.
> 
> Core scheduler has extra overhead.  Enable it only for core with
> more than one SMT hardware threads.

Very little actual words on the desired and implemented semantics of the
interface, while the patch contains much non-obvious complication.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-01-23  1:17 ` [PATCH v10 2/5] sched: CGroup tagging interface " Joel Fernandes (Google)
  2021-02-03 16:51   ` Peter Zijlstra
  2021-02-04 13:54   ` Peter Zijlstra
@ 2021-02-04 13:57   ` Peter Zijlstra
  2021-02-04 20:52     ` Chris Hyser
  2021-02-04 14:01   ` Peter Zijlstra
                     ` (6 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2021-02-04 13:57 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky

On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> +/* Request the scheduler to share a core */
> +#define PR_SCHED_CORE_SHARE		59
> +# define PR_SCHED_CORE_CLEAR		0  /* clear core_sched cookie of pid */
> +# define PR_SCHED_CORE_SHARE_FROM	1  /* get core_sched cookie from pid */
> +# define PR_SCHED_CORE_SHARE_TO		2  /* push core_sched cookie to pid */

Why ?

Also, how do I set a unique cookie on myself with this interface?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-02-03 16:51   ` Peter Zijlstra
@ 2021-02-04 13:59     ` Peter Zijlstra
  2021-02-05 16:42       ` Joel Fernandes
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2021-02-04 13:59 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky

On Wed, Feb 03, 2021 at 05:51:15PM +0100, Peter Zijlstra wrote:
> 
> I'm slowly starting to go through this...
> 
> On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> > +static bool sched_core_empty(struct rq *rq)
> > +{
> > +	return RB_EMPTY_ROOT(&rq->core_tree);
> > +}
> > +
> > +static struct task_struct *sched_core_first(struct rq *rq)
> > +{
> > +	struct task_struct *task;
> > +
> > +	task = container_of(rb_first(&rq->core_tree), struct task_struct, core_node);
> > +	return task;
> > +}
> 
> AFAICT you can do with:
> 
> static struct task_struct *sched_core_any(struct rq *rq)
> {
> 	return rb_entry(rq->core_tree.rb_node, struct task_struct, code_node);
> }
> 
> > +static void sched_core_flush(int cpu)
> > +{
> > +	struct rq *rq = cpu_rq(cpu);
> > +	struct task_struct *task;
> > +
> > +	while (!sched_core_empty(rq)) {
> > +		task = sched_core_first(rq);
> > +		rb_erase(&task->core_node, &rq->core_tree);
> > +		RB_CLEAR_NODE(&task->core_node);
> > +	}
> > +	rq->core->core_task_seq++;
> > +}
> 
> However,
> 
> > +	for_each_possible_cpu(cpu) {
> > +		struct rq *rq = cpu_rq(cpu);
> > +
> > +		WARN_ON_ONCE(enabled == rq->core_enabled);
> > +
> > +		if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) {
> > +			/*
> > +			 * All active and migrating tasks will have already
> > +			 * been removed from core queue when we clear the
> > +			 * cgroup tags. However, dying tasks could still be
> > +			 * left in core queue. Flush them here.
> > +			 */
> > +			if (!enabled)
> > +				sched_core_flush(cpu);
> > +
> > +			rq->core_enabled = enabled;
> > +		}
> > +	}
> 
> I'm not sure I understand. Is the problem that we're still schedulable
> during do_exit() after cgroup_exit() ? It could be argued that when we
> leave the cgroup there, we should definitely leave the tag group too.

That is, did you forget to implement cpu_cgroup_exit()?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-01-23  1:17 ` [PATCH v10 2/5] sched: CGroup tagging interface " Joel Fernandes (Google)
                     ` (2 preceding siblings ...)
  2021-02-04 13:57   ` Peter Zijlstra
@ 2021-02-04 14:01   ` Peter Zijlstra
  2021-02-05  3:55     ` Josh Don
  2021-02-04 14:35   ` Peter Zijlstra
                     ` (5 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2021-02-04 14:01 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky

On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:

> +#ifdef CONFIG_SCHED_DEBUG
> +	/* Read the group cookie. */
> +	{
> +		.name = "core_group_cookie",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.read_u64 = cpu_core_group_cookie_read_u64,
> +	},
> +#endif

> +#ifdef CONFIG_SCHED_DEBUG
> +	/* Read the group cookie. */
> +	{
> +		.name = "core_group_cookie",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.read_u64 = cpu_core_group_cookie_read_u64,
> +	},
> +#endif

AFAICT this leaks kernel pointers. IIRC that was a bad thing.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-01-23  1:17 ` [PATCH v10 2/5] sched: CGroup tagging interface " Joel Fernandes (Google)
                     ` (3 preceding siblings ...)
  2021-02-04 14:01   ` Peter Zijlstra
@ 2021-02-04 14:35   ` Peter Zijlstra
  2021-02-05  4:07     ` Josh Don
  2021-02-04 14:52   ` Peter Zijlstra
                     ` (4 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2021-02-04 14:35 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky

On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> +static inline void __sched_core_erase_cookie(struct sched_core_cookie *cookie)
> +{
> +	lockdep_assert_held(&sched_core_cookies_lock);
> +
> +	/* Already removed */
> +	if (RB_EMPTY_NODE(&cookie->node))
> +		return;
> +
> +	rb_erase(&cookie->node, &sched_core_cookies);
> +	RB_CLEAR_NODE(&cookie->node);
> +}
> +
> +/* Called when a task no longer points to the cookie in question */
> +static void sched_core_put_cookie(struct sched_core_cookie *cookie)
> +{
> +	unsigned long flags;
> +
> +	if (!cookie)
> +		return;
> +
> +	if (refcount_dec_and_test(&cookie->refcnt)) {
> +		raw_spin_lock_irqsave(&sched_core_cookies_lock, flags);
> +		__sched_core_erase_cookie(cookie);
> +		raw_spin_unlock_irqrestore(&sched_core_cookies_lock, flags);
> +		kfree(cookie);
> +	}
> +}

> +static void __sched_core_update_cookie(struct task_struct *p)
> +{

> +	raw_spin_lock(&sched_core_cookies_lock);

> +		/*
> +		 * Cookie exists, increment refcnt. If refcnt is currently 0,
> +		 * we're racing with a put() (refcnt decremented but cookie not
> +		 * yet removed from the tree). In this case, we can simply
> +		 * perform the removal ourselves and retry.
> +		 * sched_core_put_cookie() will still function correctly.
> +		 */
> +		if (unlikely(!refcount_inc_not_zero(&match->refcnt))) {
> +			__sched_core_erase_cookie(match);
> +			goto retry;
> +		}

refcount_dec_and_lock() avoids that complication.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-01-23  1:17 ` [PATCH v10 2/5] sched: CGroup tagging interface " Joel Fernandes (Google)
                     ` (4 preceding siblings ...)
  2021-02-04 14:35   ` Peter Zijlstra
@ 2021-02-04 14:52   ` Peter Zijlstra
  2021-02-05 16:37     ` Joel Fernandes
  2021-02-05 11:41   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2021-02-04 14:52 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky

On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> +static void sched_core_update_cookie(struct task_struct *p, unsigned long cookie,
> +				     enum sched_core_cookie_type cookie_type)
> +{
> +	struct rq_flags rf;
> +	struct rq *rq;
> +
> +	if (!p)
> +		return;
> +
> +	rq = task_rq_lock(p, &rf);
> +
> +	switch (cookie_type) {
> +	case sched_core_task_cookie_type:
> +		p->core_task_cookie = cookie;
> +		break;
> +	case sched_core_group_cookie_type:
> +		p->core_group_cookie = cookie;
> +		break;
> +	default:
> +		WARN_ON_ONCE(1);
> +	}
> +
> +	/* Set p->core_cookie, which is the overall cookie */
> +	__sched_core_update_cookie(p);
> +
> +	if (sched_core_enqueued(p)) {
> +		sched_core_dequeue(rq, p);
> +		if (!p->core_cookie) {
> +			task_rq_unlock(rq, p, &rf);
> +			return;
> +		}
> +	}
> +
> +	if (sched_core_enabled(rq) &&
> +	    p->core_cookie && task_on_rq_queued(p))
> +		sched_core_enqueue(task_rq(p), p);
> +
> +	/*
> +	 * If task is currently running or waking, it may not be compatible
> +	 * anymore after the cookie change, so enter the scheduler on its CPU
> +	 * to schedule it away.
> +	 */
> +	if (task_running(rq, p) || p->state == TASK_WAKING)
> +		resched_curr(rq);

I'm not immediately seeing the need for that WAKING test. Since we're
holding it's rq->lock, the only place that task can be WAKING is on the
wake_list. And if it's there, it needs to acquire rq->lock to get
enqueued, and rq->lock again to get scheduled.

What am I missing?

> +
> +	task_rq_unlock(rq, p, &rf);
> +}

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-02-04 13:57   ` Peter Zijlstra
@ 2021-02-04 20:52     ` Chris Hyser
  2021-02-05 10:43       ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Chris Hyser @ 2021-02-04 20:52 UTC (permalink / raw)
  To: Peter Zijlstra, Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, Ben Segall, Josh Don, Hao Luo,
	Tom Lendacky

On 2/4/21 8:57 AM, Peter Zijlstra wrote:
> On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
>> +/* Request the scheduler to share a core */
>> +#define PR_SCHED_CORE_SHARE		59
>> +# define PR_SCHED_CORE_CLEAR		0  /* clear core_sched cookie of pid */
>> +# define PR_SCHED_CORE_SHARE_FROM	1  /* get core_sched cookie from pid */
>> +# define PR_SCHED_CORE_SHARE_TO		2  /* push core_sched cookie to pid */
> 
> Why ?

The simplest interface would be a single 'set' command that specifies and sets a cookie value. Using 0 as a special 
value could then clear it. However, an early requirement that people seemed to agree with, is that cookies should be 
opaque and system guaranteed unique except when explicitly shared. Thus, since two tasks cannot share a cookie by 
explicitly setting the same cookie value, the prctl() must provide for a means of cookie sharing between tasks. The v9 
proposal had incorporated all of this into a single "from-only" command whose actions depended on the state of the two 
tasks. If neither have a cookie and one shares from the other, they both get the same new cookie. If the calling task 
had one and the other didn't, the calling task's cookie was cleared. And of course if the src task has a cookie, the 
caller just gets it. Does a lot, tad bit overloaded, and still insufficient.

A second complication was a decision that new processes (not threads) do not inherit their parents cookie. Thus forking 
is also not a means to share a cookie. Basically with a "from-only" interface, the new task would need to be modified to 
call prctl() itself. From-only also does not allow for setting a cookie on an unmodified already running task. This can 
be fixed by providing both a "to" and "from" sharing interface that allows helper programs to construct arbitrary 
configurations from unmodified programs.
> Also, how do I set a unique cookie on myself with this interface?

The v10 patch still uses the overloaded v9 mechanism (which as mentioned above is if two tasks w/o cookies share a 
cookie they get a new shared unique cookie). Yes, that is clearly an inconsistency and kludgy. The mechanism is 
documented in the docs, but clearly not obvious from the interface above. I think we got a bit overzealous in patch 
squashing and much of this verbiage should have been in the combined commit message.

So based on the above, how about we add a "create" to pair with "clear" and call it "create" vs "set" since we are 
creating a unique opaque cookie versus setting a particular value. And as mentioned, because one can't specify a cookie 
directly but only thru sharing relationships, we need both "to" and "from" to make it completely usable.

So we end up with something like this:
     PR_SCHED_CORE_CREATE                       -- give yourself a unique cookie
     PR_SCHED_CORE_CLEAR                        -- clear your core sched cookie
     PR_SCHED_CORE_SHARE_FROM <src_task>        -- get their cookie for you
     PR_SCHED_CORE_SHARE_TO   <dest_task>       -- push your cookie to them

An additional question is should the inheritability of a process' cookie be configurable? The current code gives the 
child process their own unique cookie if the parent had a cookie. That is useful in some cases, but many other 
configurations could be made much easier with inheritance.

If configurable cookie inheritance would be useful, it might look something like this:

PR_SCHED_CORE_CHILD_INHERIT <0/1>  -- 1 - child inherits cookie from parent. 0 - If parent has a cookie, child process 
gets a unique cookie.

-chrish

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-02-04 13:54   ` Peter Zijlstra
@ 2021-02-05  3:45     ` Josh Don
  0 siblings, 0 replies; 34+ messages in thread
From: Josh Don @ 2021-02-05  3:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Gleixner, linux-kernel, mingo,
	torvalds, fweisbec, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Hyser,Chris, Ben Segall, Hao Luo,
	Tom Lendacky

On Thu, Feb 4, 2021 at 5:54 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> > From: Peter Zijlstra <peterz@infradead.org>
>
> I'm thinking this is too much credit, I didn't write much of this.
>
> > Marks all tasks in a cgroup as matching for core-scheduling.
> >
> > A task will need to be moved into the core scheduler queue when the cgroup
> > it belongs to is tagged to run with core scheduling.  Similarly the task
> > will need to be moved out of the core scheduler queue when the cgroup
> > is untagged.
> >
> > Also after we forked a task, its core scheduler queue's presence will
> > need to be updated according to its new cgroup's status.
> >
> > Use stop machine mechanism to update all tasks in a cgroup to prevent a
> > new task from sneaking into the cgroup, and missed out from the update
> > while we iterates through all the tasks in the cgroup.  A more complicated
> > scheme could probably avoid the stop machine.  Such scheme will also
> > need to resovle inconsistency between a task's cgroup core scheduling
> > tag and residency in core scheduler queue.
> >
> > We are opting for the simple stop machine mechanism for now that avoids
> > such complications.
> >
> > Core scheduler has extra overhead.  Enable it only for core with
> > more than one SMT hardware threads.
>
> Very little actual words on the desired and implemented semantics of the
> interface, while the patch contains much non-obvious complication.

Ack to both, will fix in the next posting.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-02-04 14:01   ` Peter Zijlstra
@ 2021-02-05  3:55     ` Josh Don
  0 siblings, 0 replies; 34+ messages in thread
From: Josh Don @ 2021-02-05  3:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Gleixner, linux-kernel, mingo,
	torvalds, fweisbec, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Hyser,Chris, Ben Segall, Hao Luo,
	Tom Lendacky

On Thu, Feb 4, 2021 at 6:02 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
>
> > +#ifdef CONFIG_SCHED_DEBUG
> > +     /* Read the group cookie. */
> > +     {
> > +             .name = "core_group_cookie",
> > +             .flags = CFTYPE_NOT_ON_ROOT,
> > +             .read_u64 = cpu_core_group_cookie_read_u64,
> > +     },
> > +#endif
>
> > +#ifdef CONFIG_SCHED_DEBUG
> > +     /* Read the group cookie. */
> > +     {
> > +             .name = "core_group_cookie",
> > +             .flags = CFTYPE_NOT_ON_ROOT,
> > +             .read_u64 = cpu_core_group_cookie_read_u64,
> > +     },
> > +#endif
>
> AFAICT this leaks kernel pointers. IIRC that was a bad thing.

For that matter, we're also exposing the cookie pointer in
/proc/$pid/sched. Currently these are used by the selftests to
validate that two tasks are/aren't sharing.  If this poses a risk, we
can rework to avoid exposing the actual pointers.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-02-04 14:35   ` Peter Zijlstra
@ 2021-02-05  4:07     ` Josh Don
  0 siblings, 0 replies; 34+ messages in thread
From: Josh Don @ 2021-02-05  4:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Gleixner, linux-kernel, mingo,
	torvalds, fweisbec, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Hyser,Chris, Ben Segall, Hao Luo,
	Tom Lendacky

On Thu, Feb 4, 2021 at 6:36 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> refcount_dec_and_lock() avoids that complication.

There isn't currently an interface for raw_spin_locks. Certainly could
add a new interface in a separate patch as part of this series, but
doesn't seem that bad to do the locking inline.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-02-04 20:52     ` Chris Hyser
@ 2021-02-05 10:43       ` Peter Zijlstra
  2021-02-05 22:19         ` Chris Hyser
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2021-02-05 10:43 UTC (permalink / raw)
  To: Chris Hyser
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, Ben Segall, Josh Don, Hao Luo,
	Tom Lendacky

On Thu, Feb 04, 2021 at 03:52:55PM -0500, Chris Hyser wrote:

> A second complication was a decision that new processes (not threads) do not
> inherit their parents cookie. Thus forking is also not a means to share a
> cookie. Basically with a "from-only" interface, the new task would need to
> be modified to call prctl() itself. From-only also does not allow for
> setting a cookie on an unmodified already running task. This can be fixed by
> providing both a "to" and "from" sharing interface that allows helper
> programs to construct arbitrary configurations from unmodified programs.

Do we really want to inhibit on fork() or would exec() be a better
place? What about those applications that use fork() based workers?

> > Also, how do I set a unique cookie on myself with this interface?
> 
> The v10 patch still uses the overloaded v9 mechanism (which as mentioned
> above is if two tasks w/o cookies share a cookie they get a new shared
> unique cookie). Yes, that is clearly an inconsistency and kludgy. The
> mechanism is documented in the docs, but clearly not obvious from the

I've not seen a document so far (also, I'm not one to actually read
documentation, much preferring comments and Changelogs).

> So based on the above, how about we add a "create" to pair with "clear" and
> call it "create" vs "set" since we are creating a unique opaque cookie
> versus setting a particular value. And as mentioned, because one can't
> specify a cookie directly but only thru sharing relationships, we need both
> "to" and "from" to make it completely usable.
> 
> So we end up with something like this:
>     PR_SCHED_CORE_CREATE                       -- give yourself a unique cookie
>     PR_SCHED_CORE_CLEAR                        -- clear your core sched cookie
>     PR_SCHED_CORE_SHARE_FROM <src_task>        -- get their cookie for you
>     PR_SCHED_CORE_SHARE_TO   <dest_task>       -- push your cookie to them

I'm still wondering why we need _FROM/_TO. What exactly will we miss
with just _SHARE <pid>?

	current		arg_task
	<none>		<none>		-EDAFT
	<none>		<cookie>	current gets cookie
	<cookie>	<none>		arg_task gets cookie
	<cookie>	<cookie>	-EDAFTER

(I have a suspicion, but I want to see it spelled out).

Also, do we wants this interface to be able to work on processes? Things
like fcntl(F_SETOWN_EX) allow you to specify a PID type.

> An additional question is should the inheritability of a process' cookie be
> configurable? The current code gives the child process their own unique
> cookie if the parent had a cookie. That is useful in some cases, but many
> other configurations could be made much easier with inheritance.

What was the argument for not following the traditional fork() semantics
and inheriting everything?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-01-23  1:17 ` [PATCH v10 2/5] sched: CGroup tagging interface " Joel Fernandes (Google)
                     ` (5 preceding siblings ...)
  2021-02-04 14:52   ` Peter Zijlstra
@ 2021-02-05 11:41   ` Peter Zijlstra
  2021-02-05 11:52   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2021-02-05 11:41 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky

On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -736,6 +736,7 @@ void __put_task_struct(struct task_struct *tsk)
>  	exit_creds(tsk);
>  	delayacct_tsk_free(tsk);
>  	put_signal_struct(tsk->signal);
> +	sched_tsk_free(tsk);
>  
>  	if (!profile_handoff_task(tsk))
>  		free_task(tsk);

> +struct sched_core_task_cookie {
> +	refcount_t refcnt;
> +	struct work_struct work; /* to free in WQ context. */;
> +};

> +static void sched_core_put_cookie_work(struct work_struct *ws);
> +
> +/* Caller has to call sched_core_get() if non-zero value is returned. */
> +static unsigned long sched_core_alloc_task_cookie(void)
> +{
> +	struct sched_core_task_cookie *ck =
> +		kmalloc(sizeof(struct sched_core_task_cookie), GFP_KERNEL);
> +
> +	if (!ck)
> +		return 0;
> +	refcount_set(&ck->refcnt, 1);
> +	INIT_WORK(&ck->work, sched_core_put_cookie_work);
> +
> +	return (unsigned long)ck;
> +}

> +static void sched_core_put_task_cookie(unsigned long cookie)
> +{
> +	struct sched_core_task_cookie *ptr =
> +		(struct sched_core_task_cookie *)cookie;
> +
> +	if (refcount_dec_and_test(&ptr->refcnt))
> +		kfree(ptr);
> +}

> +static void sched_core_put_cookie_work(struct work_struct *ws)
> +{
> +	struct sched_core_task_cookie *ck =
> +		container_of(ws, struct sched_core_task_cookie, work);
> +
> +	sched_core_put_task_cookie((unsigned long)ck);
> +	sched_core_put();
> +}

> +void sched_tsk_free(struct task_struct *tsk)
> +{
> +	struct sched_core_task_cookie *ck;
> +
> +	sched_core_put_cookie((struct sched_core_cookie *)tsk->core_cookie);
> +
> +	if (!tsk->core_task_cookie)
> +		return;
> +
> +	ck = (struct sched_core_task_cookie *)tsk->core_task_cookie;
> +	queue_work(system_wq, &ck->work);
> +}

*groan*.. so the purpose of that work is to be able to disable
core-scheduling after the last such task dies.

Maybe add sched_core_put_async() instead?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-01-23  1:17 ` [PATCH v10 2/5] sched: CGroup tagging interface " Joel Fernandes (Google)
                     ` (6 preceding siblings ...)
  2021-02-05 11:41   ` Peter Zijlstra
@ 2021-02-05 11:52   ` Peter Zijlstra
  2021-02-06  1:15     ` Josh Don
  2021-02-05 12:00   ` Peter Zijlstra
  2021-02-23  4:00   ` Chris Hyser
  9 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2021-02-05 11:52 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky

On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:

> +/* All active sched_core_cookies */
> +static struct rb_root sched_core_cookies = RB_ROOT;
> +static DEFINE_RAW_SPINLOCK(sched_core_cookies_lock);

> +/*
> + * Returns the following:
> + * a < b  => -1
> + * a == b => 0
> + * a > b  => 1
> + */
> +static int sched_core_cookie_cmp(const struct sched_core_cookie *a,
> +				 const struct sched_core_cookie *b)
> +{
> +#define COOKIE_CMP_RETURN(field) do {		\
> +	if (a->field < b->field)		\
> +		return -1;			\
> +	else if (a->field > b->field)		\
> +		return 1;			\
> +} while (0)					\
> +
> +	COOKIE_CMP_RETURN(task_cookie);
> +	COOKIE_CMP_RETURN(group_cookie);
> +
> +	/* all cookie fields match */
> +	return 0;
> +
> +#undef COOKIE_CMP_RETURN
> +}

AFAICT all this madness exists because cgroup + task interaction, yet
none of that code is actually dependent on cgroups being on.

So this seems to implement semantics that will make two tasks that share
a cookie, but are then placed in different cgroups not actually share.

Is that desired? Can we justify these semantics and the resulting code
complexity.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-01-23  1:17 ` [PATCH v10 2/5] sched: CGroup tagging interface " Joel Fernandes (Google)
                     ` (7 preceding siblings ...)
  2021-02-05 11:52   ` Peter Zijlstra
@ 2021-02-05 12:00   ` Peter Zijlstra
  2021-02-23  4:00   ` Chris Hyser
  9 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2021-02-05 12:00 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky

On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> +int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
> +{
> +	static DEFINE_MUTEX(sched_core_tasks_mutex);
> +	unsigned long cookie;
> +	int ret = -ENOMEM;
> +
> +	mutex_lock(&sched_core_tasks_mutex);
> +
> +	if (!t2) {
> +		if (t1->core_task_cookie) {
> +			sched_core_put_task_cookie(t1->core_task_cookie);
> +			sched_core_update_task_cookie(t1, 0);
> +			sched_core_put();
> +		}

So this seems to be the bit that implements _CLEAR. ISTR there were
security implications / considerations here.

When the machine is vulnerable to L1TF/MDS and the like, clearing the
cookie would gain privilege and should thus be subject to some checks,
but I can'd find anything.

At the very least that deserves a comment I'm thinking.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-02-04 14:52   ` Peter Zijlstra
@ 2021-02-05 16:37     ` Joel Fernandes
  0 siblings, 0 replies; 34+ messages in thread
From: Joel Fernandes @ 2021-02-05 16:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky

On Thu, Feb 04, 2021 at 03:52:53PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> > +static void sched_core_update_cookie(struct task_struct *p, unsigned long cookie,
> > +				     enum sched_core_cookie_type cookie_type)
> > +{
> > +	struct rq_flags rf;
> > +	struct rq *rq;
> > +
> > +	if (!p)
> > +		return;
> > +
> > +	rq = task_rq_lock(p, &rf);
> > +
> > +	switch (cookie_type) {
> > +	case sched_core_task_cookie_type:
> > +		p->core_task_cookie = cookie;
> > +		break;
> > +	case sched_core_group_cookie_type:
> > +		p->core_group_cookie = cookie;
> > +		break;
> > +	default:
> > +		WARN_ON_ONCE(1);
> > +	}
> > +
> > +	/* Set p->core_cookie, which is the overall cookie */
> > +	__sched_core_update_cookie(p);
> > +
> > +	if (sched_core_enqueued(p)) {
> > +		sched_core_dequeue(rq, p);
> > +		if (!p->core_cookie) {
> > +			task_rq_unlock(rq, p, &rf);
> > +			return;
> > +		}
> > +	}
> > +
> > +	if (sched_core_enabled(rq) &&
> > +	    p->core_cookie && task_on_rq_queued(p))
> > +		sched_core_enqueue(task_rq(p), p);
> > +
> > +	/*
> > +	 * If task is currently running or waking, it may not be compatible
> > +	 * anymore after the cookie change, so enter the scheduler on its CPU
> > +	 * to schedule it away.
> > +	 */
> > +	if (task_running(rq, p) || p->state == TASK_WAKING)
> > +		resched_curr(rq);
> 
> I'm not immediately seeing the need for that WAKING test. Since we're
> holding it's rq->lock, the only place that task can be WAKING is on the
> wake_list. And if it's there, it needs to acquire rq->lock to get
> enqueued, and rq->lock again to get scheduled.
> 
> What am I missing?

Hi Peter,

I did this way following a similar pattern in affine_move_task(). However, I
think you are right. Unlike in the case affine_move_task(), we have
schedule() to do the right thing for us in case of any races with wakeup. So
the TASK_WAKING test is indeed not needed and we can drop tha test. Apologies
for adding the extra test out of paranoia.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-02-04 13:59     ` Peter Zijlstra
@ 2021-02-05 16:42       ` Joel Fernandes
  0 siblings, 0 replies; 34+ messages in thread
From: Joel Fernandes @ 2021-02-05 16:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky

Hi Peter,

On Thu, Feb 04, 2021 at 02:59:58PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 03, 2021 at 05:51:15PM +0100, Peter Zijlstra wrote:
> > 
> > I'm slowly starting to go through this...
> > 
> > On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> > > +static bool sched_core_empty(struct rq *rq)
> > > +{
> > > +	return RB_EMPTY_ROOT(&rq->core_tree);
> > > +}
> > > +
> > > +static struct task_struct *sched_core_first(struct rq *rq)
> > > +{
> > > +	struct task_struct *task;
> > > +
> > > +	task = container_of(rb_first(&rq->core_tree), struct task_struct, core_node);
> > > +	return task;
> > > +}
> > 
> > AFAICT you can do with:
> > 
> > static struct task_struct *sched_core_any(struct rq *rq)
> > {
> > 	return rb_entry(rq->core_tree.rb_node, struct task_struct, code_node);
> > }
> > 
> > > +static void sched_core_flush(int cpu)
> > > +{
> > > +	struct rq *rq = cpu_rq(cpu);
> > > +	struct task_struct *task;
> > > +
> > > +	while (!sched_core_empty(rq)) {
> > > +		task = sched_core_first(rq);
> > > +		rb_erase(&task->core_node, &rq->core_tree);
> > > +		RB_CLEAR_NODE(&task->core_node);
> > > +	}
> > > +	rq->core->core_task_seq++;
> > > +}
> > 
> > However,
> > 
> > > +	for_each_possible_cpu(cpu) {
> > > +		struct rq *rq = cpu_rq(cpu);
> > > +
> > > +		WARN_ON_ONCE(enabled == rq->core_enabled);
> > > +
> > > +		if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) {
> > > +			/*
> > > +			 * All active and migrating tasks will have already
> > > +			 * been removed from core queue when we clear the
> > > +			 * cgroup tags. However, dying tasks could still be
> > > +			 * left in core queue. Flush them here.
> > > +			 */
> > > +			if (!enabled)
> > > +				sched_core_flush(cpu);
> > > +
> > > +			rq->core_enabled = enabled;
> > > +		}
> > > +	}
> > 
> > I'm not sure I understand. Is the problem that we're still schedulable
> > during do_exit() after cgroup_exit() ?

Yes, exactly. Tim had written this code in the original patches and it
carried (I was not involved at that time). IIRC, the issue is the exit will
race with core scheduling being disabled. Even after core sched is disabled,
it will still exist in the core rb tree and needs to be removed. Otherwise it
causes crashes.

> It could be argued that when we
> > leave the cgroup there, we should definitely leave the tag group too.
> 
> That is, did you forget to implement cpu_cgroup_exit()?

Yes, I think it is better to implement it in cpu_cgroup_exit().

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-02-05 10:43       ` Peter Zijlstra
@ 2021-02-05 22:19         ` Chris Hyser
  0 siblings, 0 replies; 34+ messages in thread
From: Chris Hyser @ 2021-02-05 22:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, Ben Segall, Josh Don, Hao Luo,
	Tom Lendacky

On 2/5/21 5:43 AM, Peter Zijlstra wrote:
> On Thu, Feb 04, 2021 at 03:52:55PM -0500, Chris Hyser wrote:
> 
>> A second complication was a decision that new processes (not threads) do not
>> inherit their parents cookie. Thus forking is also not a means to share a
>> cookie. Basically with a "from-only" interface, the new task would need to
>> be modified to call prctl() itself. From-only also does not allow for
>> setting a cookie on an unmodified already running task. This can be fixed by
>> providing both a "to" and "from" sharing interface that allows helper
>> programs to construct arbitrary configurations from unmodified programs.
> 
> Do we really want to inhibit on fork() or would exec() be a better
> place? What about those applications that use fork() based workers?

I like exec-time as a fan of fork-based workers. I suppose the counter argument would be security, but the new process 
is in a state to be trusted to lower it's privileges, change permissions, core sched cookie etc before it makes itself 
differently vulnerable and because it is the same code, it "knows" if it did.

>>> Also, how do I set a unique cookie on myself with this interface?
>>
>> The v10 patch still uses the overloaded v9 mechanism (which as mentioned
>> above is if two tasks w/o cookies share a cookie they get a new shared
>> unique cookie). Yes, that is clearly an inconsistency and kludgy. The
>> mechanism is documented in the docs, but clearly not obvious from the
> 
> I've not seen a document so far (also, I'm not one to actually read
> documentation, much preferring comments and Changelogs).

Understood. I think we should split this patch into separate prctl and cgroup patches. The rationale decided here would 
then go into the prctl patch commit message. We can also use this split to address any dependencies we've created on 
cgroups that you mentioned in a different email.

>> So based on the above, how about we add a "create" to pair with "clear" and
>> call it "create" vs "set" since we are creating a unique opaque cookie
>> versus setting a particular value. And as mentioned, because one can't
>> specify a cookie directly but only thru sharing relationships, we need both
>> "to" and "from" to make it completely usable.
>>
>> So we end up with something like this:
>>      PR_SCHED_CORE_CREATE                       -- give yourself a unique cookie
>>      PR_SCHED_CORE_CLEAR                        -- clear your core sched cookie
>>      PR_SCHED_CORE_SHARE_FROM <src_task>        -- get their cookie for you
>>      PR_SCHED_CORE_SHARE_TO   <dest_task>       -- push your cookie to them
> 
> I'm still wondering why we need _FROM/_TO. What exactly will we miss
> with just _SHARE <pid>?
> 
> 	current		arg_task
> 	<none>		<none>		-EDAFT
> 	<none>		<cookie>	current gets cookie
> 	<cookie>	<none>		arg_task gets cookie
> 	<cookie>	<cookie>	-EDAFTER
> 
> (I have a suspicion, but I want to see it spelled out).

The min requirements of the interface I see are:

1. create a my own cookie
2. clear my own cookie
3. create a cookie for a running unmodified task
4. clear a cookie for a running unmodified task
5. copy a cookie from one running unmodified task to another unmodified task

So from your example above:
 > 	<none>		<cookie>	current gets cookie

could also mean clear the cookie of arg_task and satisfy requirement 4 above.

"Share" is a fuzzy term. I should have used COPY as that is more the semantics I was thinking ... specified directional 
transfer. So we could have a single command with two arguments where argument order determines direction. In the v10 
interface proposal, as one argument, current, was always implied, direction had to be specified in the command.

So a single copy command could be something like:

PR_SCHED_CORE_COPY   <src_task> <dst_task>

to replace the two. The very first util you write to do any thing useful w/ all of this is a "copy_cookie <spid> 
<dpid>". :-)

> Also, do we wants this interface to be able to work on processes? Things
> like fcntl(F_SETOWN_EX) allow you to specify a PID type.

Yes and creating/clearing a cookie on a PGID and SID seem like useful shortcuts to look into.

>> An additional question is should the inheritability of a process' cookie be
>> configurable? The current code gives the child process their own unique
>> cookie if the parent had a cookie. That is useful in some cases, but many
>> other configurations could be made much easier with inheritance.
> 
> What was the argument for not following the traditional fork() semantics
> and inheriting everything?

The code just showed up with no explanation :-), but I think I know what was intended and it touches on the same 
security policy type problem you mentioned in a comment on the CLEAR code. In a secure context, you can't just allow a 
random user to clear their cookie, i.e. make themselves trusted. At the same time, in a non-secure context, and several 
use cases have been put forward, I can't think of anything more annoying then being able to set a cookie on my task and 
then not having permission to clear it.

The fork scenario has a similar problem. A child inheriting the cookie means just that, but not inheriting is likely 
different depending on whether its secure vs non-secure (and obviously we can't use those terms. Who wants to advocate 
for non-security :-). For non-secure, don't inherit means the child gets no cookie; secure means the child gets their 
own unique cookie and not the parent's. In the absence of any way to set a policy, the current code chose the secure 
default which makes sense.

So I guess that raises the ugly question, do we need some kind of globally scoped, "secure/not-secure" core sched policy 
flag?

-chrish

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-02-05 11:52   ` Peter Zijlstra
@ 2021-02-06  1:15     ` Josh Don
  0 siblings, 0 replies; 34+ messages in thread
From: Josh Don @ 2021-02-06  1:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Gleixner, linux-kernel, mingo,
	torvalds, fweisbec, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Hyser,Chris, Ben Segall, Hao Luo,
	Tom Lendacky

On Fri, Feb 5, 2021 at 3:53 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
>
> > +/* All active sched_core_cookies */
> > +static struct rb_root sched_core_cookies = RB_ROOT;
> > +static DEFINE_RAW_SPINLOCK(sched_core_cookies_lock);
>
> > +/*
> > + * Returns the following:
> > + * a < b  => -1
> > + * a == b => 0
> > + * a > b  => 1
> > + */
> > +static int sched_core_cookie_cmp(const struct sched_core_cookie *a,
> > +                              const struct sched_core_cookie *b)
> > +{
> > +#define COOKIE_CMP_RETURN(field) do {                \
> > +     if (a->field < b->field)                \
> > +             return -1;                      \
> > +     else if (a->field > b->field)           \
> > +             return 1;                       \
> > +} while (0)                                  \
> > +
> > +     COOKIE_CMP_RETURN(task_cookie);
> > +     COOKIE_CMP_RETURN(group_cookie);
> > +
> > +     /* all cookie fields match */
> > +     return 0;
> > +
> > +#undef COOKIE_CMP_RETURN
> > +}
>
> AFAICT all this madness exists because cgroup + task interaction, yet
> none of that code is actually dependent on cgroups being on.
>
> So this seems to implement semantics that will make two tasks that share
> a cookie, but are then placed in different cgroups not actually share.
>
> Is that desired? Can we justify these semantics and the resulting code
> complexity.

Yes that is the desired result. IMO it is less optimal from an
interface perspective if we were to instead have group or task cookie
override the other. Joel gave some additional justification here:
https://lkml.org/lkml/2020/12/6/389.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-01-23  1:17 ` [PATCH v10 2/5] sched: CGroup tagging interface " Joel Fernandes (Google)
                     ` (8 preceding siblings ...)
  2021-02-05 12:00   ` Peter Zijlstra
@ 2021-02-23  4:00   ` Chris Hyser
  2021-02-23  9:05     ` Peter Zijlstra
  9 siblings, 1 reply; 34+ messages in thread
From: Chris Hyser @ 2021-02-23  4:00 UTC (permalink / raw)
  To: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, Ben Segall,
	Josh Don, Hao Luo, Tom Lendacky

On 1/22/21 8:17 PM, Joel Fernandes (Google) wrote:
> +static void __sched_core_update_cookie(struct task_struct *p)
> +{
> +	struct rb_node *parent, **node;
> +	struct sched_core_cookie *node_core_cookie, *match;
> +	static const struct sched_core_cookie zero_cookie;
> +	struct sched_core_cookie requested_cookie;
> +	bool is_zero_cookie;
> +	struct sched_core_cookie *const curr_cookie =
> +		(struct sched_core_cookie *)p->core_cookie;
> +
> +	/*
> +	 * Ensures that we do not cause races/corruption by modifying/reading
> +	 * task cookie fields. Also ensures task cannot get migrated.
> +	 */
> +	lockdep_assert_held(rq_lockp(task_rq(p)));
> +
> +	sched_core_cookie_init_from_task(&requested_cookie, p);
> +
> +	is_zero_cookie = !sched_core_cookie_cmp(&requested_cookie, &zero_cookie);
> +
> +	/*
> +	 * Already have a cookie matching the requested settings? Nothing to
> +	 * do.
> +	 */
> +	if ((curr_cookie && !sched_core_cookie_cmp(curr_cookie, &requested_cookie)) ||
> +	    (!curr_cookie && is_zero_cookie))
> +		return;
> +
> +	raw_spin_lock(&sched_core_cookies_lock);
> +
> +	if (is_zero_cookie) {
> +		match = NULL;
> +		goto finish;
> +	}
> +
> +retry:
> +	match = NULL;
> +
> +	node = &sched_core_cookies.rb_node;
> +	parent = *node;
> +	while (*node) {
> +		int cmp;
> +
> +		node_core_cookie =
> +			container_of(*node, struct sched_core_cookie, node);
> +		parent = *node;
> +
> +		cmp = sched_core_cookie_cmp(&requested_cookie, node_core_cookie);
> +		if (cmp < 0) {
> +			node = &parent->rb_left;
> +		} else if (cmp > 0) {
> +			node = &parent->rb_right;
> +		} else {
> +			match = node_core_cookie;
> +			break;
> +		}
> +	}
> +
> +	if (!match) {
> +		/* No existing cookie; create and insert one */
> +		match = kmalloc(sizeof(struct sched_core_cookie), GFP_ATOMIC);
> +
> +		/* Fall back to zero cookie */
> +		if (WARN_ON_ONCE(!match))
> +			goto finish;
> +
> +		*match = requested_cookie;
> +		refcount_set(&match->refcnt, 1);
> +
> +		rb_link_node(&match->node, parent, node);
> +		rb_insert_color(&match->node, &sched_core_cookies);
> +	} else {
> +		/*
> +		 * Cookie exists, increment refcnt. If refcnt is currently 0,
> +		 * we're racing with a put() (refcnt decremented but cookie not
> +		 * yet removed from the tree). In this case, we can simply
> +		 * perform the removal ourselves and retry.
> +		 * sched_core_put_cookie() will still function correctly.
> +		 */
> +		if (unlikely(!refcount_inc_not_zero(&match->refcnt))) {
> +			__sched_core_erase_cookie(match);
> +			goto retry;
> +		}
> +	}
> +
> +finish:
> +	p->core_cookie = (unsigned long)match;
> +
> +	raw_spin_unlock(&sched_core_cookies_lock);
> +
> +	sched_core_put_cookie(curr_cookie);
> +}
> +
> +/*
> + * sched_core_update_cookie - Common helper to update a task's core cookie. This
> + * updates the selected cookie field and then updates the overall cookie.
> + * @p: The task whose cookie should be updated.
> + * @cookie: The new cookie.
> + * @cookie_type: The cookie field to which the cookie corresponds.
> + */
> +static void sched_core_update_cookie(struct task_struct *p, unsigned long cookie,
> +				     enum sched_core_cookie_type cookie_type)
> +{
> +	struct rq_flags rf;
> +	struct rq *rq;
> +
> +	if (!p)
> +		return;
> +
> +	rq = task_rq_lock(p, &rf);
> +
> +	switch (cookie_type) {
> +	case sched_core_task_cookie_type:
> +		p->core_task_cookie = cookie;
> +		break;
> +	case sched_core_group_cookie_type:
> +		p->core_group_cookie = cookie;
> +		break;
> +	default:
> +		WARN_ON_ONCE(1);
> +	}
> +
> +	/* Set p->core_cookie, which is the overall cookie */
> +	__sched_core_update_cookie(p);

While trying to test the new prctl() code I'm working on, I ran into a bug I chased back into this v10 code. Under a 
fair amount of stress, when the function __sched_core_update_cookie() is ultimately called from sched_core_fork(), the 
system deadlocks or otherwise non-visibly crashes. I've not had much success figuring out why/what. I'm running with 
LOCKDEP on and seeing no complaints. Duplicating it only requires setting a cookie on a task and forking a bunch of 
threads ... all of which then want to update their cookie.

-chrish

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-02-23  4:00   ` Chris Hyser
@ 2021-02-23  9:05     ` Peter Zijlstra
  2021-02-23 19:25       ` Chris Hyser
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2021-02-23  9:05 UTC (permalink / raw)
  To: Chris Hyser
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, Ben Segall, Josh Don, Hao Luo,
	Tom Lendacky

On Mon, Feb 22, 2021 at 11:00:37PM -0500, Chris Hyser wrote:
> On 1/22/21 8:17 PM, Joel Fernandes (Google) wrote:
> While trying to test the new prctl() code I'm working on, I ran into a bug I
> chased back into this v10 code. Under a fair amount of stress, when the
> function __sched_core_update_cookie() is ultimately called from
> sched_core_fork(), the system deadlocks or otherwise non-visibly crashes.
> I've not had much success figuring out why/what. I'm running with LOCKDEP on
> and seeing no complaints. Duplicating it only requires setting a cookie on a
> task and forking a bunch of threads ... all of which then want to update
> their cookie.

Can you share the code and reproducer?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-02-23  9:05     ` Peter Zijlstra
@ 2021-02-23 19:25       ` Chris Hyser
  2021-02-24  5:15         ` Josh Don
  0 siblings, 1 reply; 34+ messages in thread
From: Chris Hyser @ 2021-02-23 19:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, Ben Segall, Josh Don, Hao Luo,
	Tom Lendacky

[-- Attachment #1: Type: text/plain, Size: 1360 bytes --]

On 2/23/21 4:05 AM, Peter Zijlstra wrote:
> On Mon, Feb 22, 2021 at 11:00:37PM -0500, Chris Hyser wrote:
>> On 1/22/21 8:17 PM, Joel Fernandes (Google) wrote:
>> While trying to test the new prctl() code I'm working on, I ran into a bug I
>> chased back into this v10 code. Under a fair amount of stress, when the
>> function __sched_core_update_cookie() is ultimately called from
>> sched_core_fork(), the system deadlocks or otherwise non-visibly crashes.
>> I've not had much success figuring out why/what. I'm running with LOCKDEP on
>> and seeing no complaints. Duplicating it only requires setting a cookie on a
>> task and forking a bunch of threads ... all of which then want to update
>> their cookie.
> 
> Can you share the code and reproducer?

Attached is a tarball with c code (source) and scripts. Just run ./setup_bug which will compile the source and start a 
bash with a cs cookie. Then run ./show_bug which dumps the cookie and then fires off some processes and threads. Note 
the cs_clone command is not doing any core sched prctls for this test (not needed and currently coded for a diff prctl 
interface). It just creates processes and threads. I see this hang almost instantly.

Josh, I did verify that this occurs on Joel's coresched tree both with and w/o the kprot patch and that should exactly 
correspond to these patches.

-chrish


[-- Attachment #2: bug.tar.xz --]
[-- Type: application/x-xz, Size: 2784 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-02-23 19:25       ` Chris Hyser
@ 2021-02-24  5:15         ` Josh Don
  2021-02-24 13:02           ` Chris Hyser
  0 siblings, 1 reply; 34+ messages in thread
From: Josh Don @ 2021-02-24  5:15 UTC (permalink / raw)
  To: Chris Hyser
  Cc: Peter Zijlstra, Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Gleixner, linux-kernel, mingo,
	torvalds, fweisbec, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Ben Segall, Hao Luo, Tom Lendacky

On Tue, Feb 23, 2021 at 11:26 AM Chris Hyser <chris.hyser@oracle.com> wrote:
>
> On 2/23/21 4:05 AM, Peter Zijlstra wrote:
> > On Mon, Feb 22, 2021 at 11:00:37PM -0500, Chris Hyser wrote:
> >> On 1/22/21 8:17 PM, Joel Fernandes (Google) wrote:
> >> While trying to test the new prctl() code I'm working on, I ran into a bug I
> >> chased back into this v10 code. Under a fair amount of stress, when the
> >> function __sched_core_update_cookie() is ultimately called from
> >> sched_core_fork(), the system deadlocks or otherwise non-visibly crashes.
> >> I've not had much success figuring out why/what. I'm running with LOCKDEP on
> >> and seeing no complaints. Duplicating it only requires setting a cookie on a
> >> task and forking a bunch of threads ... all of which then want to update
> >> their cookie.
> >
> > Can you share the code and reproducer?
>
> Attached is a tarball with c code (source) and scripts. Just run ./setup_bug which will compile the source and start a
> bash with a cs cookie. Then run ./show_bug which dumps the cookie and then fires off some processes and threads. Note
> the cs_clone command is not doing any core sched prctls for this test (not needed and currently coded for a diff prctl
> interface). It just creates processes and threads. I see this hang almost instantly.
>
> Josh, I did verify that this occurs on Joel's coresched tree both with and w/o the kprot patch and that should exactly
> correspond to these patches.
>
> -chrish
>

I think I've gotten to the root of this. In the fork code, our cases
for inheriting task_cookie are inverted for CLONE_THREAD vs
!CLONE_THREAD. As a result, we are creating a new cookie per-thread,
rather than inheriting from the parent. Now this is actually ok; I'm
not observing a scalability problem with creating this many cookies.
However, it means that overall throughput of your binary is cut in
~half, since none of the threads can share a core. Note that I never
saw an indefinite deadlock, just ~2x runtime for your binary vs the
control. I've verified that both a) manually hardcoding all threads to
be able to share regardless of cookie, and b) using a machine with 6
cores instead of 2, both allow your binary to complete in the same
amount of time as without the new API.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-02-24  5:15         ` Josh Don
@ 2021-02-24 13:02           ` Chris Hyser
  2021-02-24 13:52             ` chris hyser
  0 siblings, 1 reply; 34+ messages in thread
From: Chris Hyser @ 2021-02-24 13:02 UTC (permalink / raw)
  To: Josh Don
  Cc: Peter Zijlstra, Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Gleixner, linux-kernel, mingo,
	torvalds, fweisbec, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Ben Segall, Hao Luo, Tom Lendacky



On 2/24/21 12:15 AM, Josh Don wrote:
> On Tue, Feb 23, 2021 at 11:26 AM Chris Hyser <chris.hyser@oracle.com> wrote:
>>
>> On 2/23/21 4:05 AM, Peter Zijlstra wrote:
>>> On Mon, Feb 22, 2021 at 11:00:37PM -0500, Chris Hyser wrote:
>>>> On 1/22/21 8:17 PM, Joel Fernandes (Google) wrote:
>>>> While trying to test the new prctl() code I'm working on, I ran into a bug I
>>>> chased back into this v10 code. Under a fair amount of stress, when the
>>>> function __sched_core_update_cookie() is ultimately called from
>>>> sched_core_fork(), the system deadlocks or otherwise non-visibly crashes.
>>>> I've not had much success figuring out why/what. I'm running with LOCKDEP on
>>>> and seeing no complaints. Duplicating it only requires setting a cookie on a
>>>> task and forking a bunch of threads ... all of which then want to update
>>>> their cookie.
>>>
>>> Can you share the code and reproducer?
>>
>> Attached is a tarball with c code (source) and scripts. Just run ./setup_bug which will compile the source and start a
>> bash with a cs cookie. Then run ./show_bug which dumps the cookie and then fires off some processes and threads. Note
>> the cs_clone command is not doing any core sched prctls for this test (not needed and currently coded for a diff prctl
>> interface). It just creates processes and threads. I see this hang almost instantly.
>>
>> Josh, I did verify that this occurs on Joel's coresched tree both with and w/o the kprot patch and that should exactly
>> correspond to these patches.
>>
>> -chrish
>>
> 
> I think I've gotten to the root of this. In the fork code, our cases
> for inheriting task_cookie are inverted for CLONE_THREAD vs
> !CLONE_THREAD. As a result, we are creating a new cookie per-thread,
> rather than inheriting from the parent. Now this is actually ok; I'm
> not observing a scalability problem with creating this many cookies.

This isn't the issue. The test code generates cases for both THREAD_CLONE and not and both paths call the cookie update 
code. The new code I was testing when I discovered this, fixed the problem you noted.


> However, it means that overall throughput of your binary is cut in
> ~half, since none of the threads can share a core. Note that I never
> saw an indefinite deadlock, just ~2x runtime for your binary vs th > control. I've verified that both a) manually hardcoding all threads to
> be able to share regardless of cookie, and b) using a machine with 6
> cores instead of 2, both allow your binary to complete in the same
> amount of time as without the new API.

This was on a 24 core box. When I run the test, I definitely hangs. I'll answer your other email as wwll.

-chrish

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-02-24 13:02           ` Chris Hyser
@ 2021-02-24 13:52             ` chris hyser
  2021-02-24 15:47               ` chris hyser
  0 siblings, 1 reply; 34+ messages in thread
From: chris hyser @ 2021-02-24 13:52 UTC (permalink / raw)
  To: Josh Don
  Cc: Peter Zijlstra, Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Gleixner, linux-kernel, mingo,
	torvalds, fweisbec, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Ben Segall, Hao Luo, Tom Lendacky

On 2/24/21 8:02 AM, Chris Hyser wrote:

>> However, it means that overall throughput of your binary is cut in
>> ~half, since none of the threads can share a core. Note that I never
>> saw an indefinite deadlock, just ~2x runtime for your binary vs th > control. I've verified that both a) manually 
>> hardcoding all threads to
>> be able to share regardless of cookie, and b) using a machine with 6
>> cores instead of 2, both allow your binary to complete in the same
>> amount of time as without the new API.
> 
> This was on a 24 core box. When I run the test, I definitely hangs. I'll answer your other email as wwll.


I just want to clarify. The test completes in secs normally. When I run this on the 24 core box from the console, other 
ssh connections immediately freeze. The console is frozen. You can't ping the box and it has stayed that way for up to 
1/2 hour before I reset it. I'm trying to get some kind of stack trace to see what it is doing. To the extent that I've 
been able to trace it or print it, the "next code" always seems to be __sched_core_update_cookie(p);

-chrish





^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-02-24 13:52             ` chris hyser
@ 2021-02-24 15:47               ` chris hyser
  2021-02-26 20:07                 ` Chris Hyser
  0 siblings, 1 reply; 34+ messages in thread
From: chris hyser @ 2021-02-24 15:47 UTC (permalink / raw)
  To: Josh Don
  Cc: Peter Zijlstra, Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Gleixner, linux-kernel, mingo,
	torvalds, fweisbec, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Ben Segall, Hao Luo, Tom Lendacky

On 2/24/21 8:52 AM, chris hyser wrote:
> On 2/24/21 8:02 AM, Chris Hyser wrote:
> 
>>> However, it means that overall throughput of your binary is cut in
>>> ~half, since none of the threads can share a core. Note that I never
>>> saw an indefinite deadlock, just ~2x runtime for your binary vs th > control. I've verified that both a) manually 
>>> hardcoding all threads to
>>> be able to share regardless of cookie, and b) using a machine with 6
>>> cores instead of 2, both allow your binary to complete in the same
>>> amount of time as without the new API.
>>
>> This was on a 24 core box. When I run the test, I definitely hangs. I'll answer your other email as wwll.
> 
> 
> I just want to clarify. The test completes in secs normally. When I run this on the 24 core box from the console, other 
> ssh connections immediately freeze. The console is frozen. You can't ping the box and it has stayed that way for up to 
> 1/2 hour before I reset it. I'm trying to get some kind of stack trace to see what it is doing. To the extent that I've 
> been able to trace it or print it, the "next code" always seems to be __sched_core_update_cookie(p);

I cannot duplicate this on a 4 core box even with 1000's of processes and threads. The 24 core box does not even create 
the full 400 processes/threads in that test before it hangs and that test reliably fails almost instantly. The working 
theory is that the 24 core box is doing way more of the clone syscalls in parallel.

-chrish

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-02-24 15:47               ` chris hyser
@ 2021-02-26 20:07                 ` Chris Hyser
  2021-03-01 21:01                   ` Josh Don
  0 siblings, 1 reply; 34+ messages in thread
From: Chris Hyser @ 2021-02-26 20:07 UTC (permalink / raw)
  To: Josh Don
  Cc: Peter Zijlstra, Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Gleixner, linux-kernel, mingo,
	torvalds, fweisbec, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Ben Segall, Hao Luo, Tom Lendacky

On 2/24/21 10:47 AM, chris hyser wrote:
> On 2/24/21 8:52 AM, chris hyser wrote:
>> On 2/24/21 8:02 AM, Chris Hyser wrote:
>>
>>>> However, it means that overall throughput of your binary is cut in
>>>> ~half, since none of the threads can share a core. Note that I never
>>>> saw an indefinite deadlock, just ~2x runtime for your binary vs th > control. I've verified that both a) manually 
>>>> hardcoding all threads to
>>>> be able to share regardless of cookie, and b) using a machine with 6
>>>> cores instead of 2, both allow your binary to complete in the same
>>>> amount of time as without the new API.
>>>
>>> This was on a 24 core box. When I run the test, I definitely hangs. I'll answer your other email as wwll.
>>
>>
>> I just want to clarify. The test completes in secs normally. When I run this on the 24 core box from the console, 
>> other ssh connections immediately freeze. The console is frozen. You can't ping the box and it has stayed that way for 
>> up to 1/2 hour before I reset it. I'm trying to get some kind of stack trace to see what it is doing. To the extent 
>> that I've been able to trace it or print it, the "next code" always seems to be __sched_core_update_cookie(p);
> 
> I cannot duplicate this on a 4 core box even with 1000's of processes and threads. The 24 core box does not even create 
> the full 400 processes/threads in that test before it hangs and that test reliably fails almost instantly. The working 
> theory is that the 24 core box is doing way more of the clone syscalls in parallel.

Update:

The clone syscall stress test I have is causing a deadlock with this patchset when
compiled with CONFIG_PROVE_RAW_LOCK_NESTING=y. I am able to get stacktraces with
nmi_watchdog and am looking through those. Josh was not able to duplicate the
deadlock, instead getting an actual warning about kmalloc w/GFP_ATOMIC while
under a raw spinlock in the function __sched_core_update_cookie(). When I retry
the test with a patch Josh sent, removing the kmalloc() and thus the trigger of
the warning, no more deadlock. So some combination of CONFIGs, timing and
function calls seems to have found something in this LOCK proving code.

-chrish

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling
  2021-02-26 20:07                 ` Chris Hyser
@ 2021-03-01 21:01                   ` Josh Don
  0 siblings, 0 replies; 34+ messages in thread
From: Josh Don @ 2021-03-01 21:01 UTC (permalink / raw)
  To: Chris Hyser
  Cc: Peter Zijlstra, Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Gleixner, linux-kernel, mingo,
	torvalds, fweisbec, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Ben Segall, Hao Luo, Tom Lendacky

On Fri, Feb 26, 2021 at 12:08 PM Chris Hyser <chris.hyser@oracle.com> wrote:
>
> Update:
>
> The clone syscall stress test I have is causing a deadlock with this patchset when
> compiled with CONFIG_PROVE_RAW_LOCK_NESTING=y. I am able to get stacktraces with
> nmi_watchdog and am looking through those. Josh was not able to duplicate the
> deadlock, instead getting an actual warning about kmalloc w/GFP_ATOMIC while
> under a raw spinlock in the function __sched_core_update_cookie(). When I retry
> the test with a patch Josh sent, removing the kmalloc() and thus the trigger of
> the warning, no more deadlock. So some combination of CONFIGs, timing and
> function calls seems to have found something in this LOCK proving code.
>
> -chrish

I have a patch to remove the dynamic allocation and overall simplify
the logic here. Will squash that along with the rest of the
modifications suggested by Peter in the next version.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2021-03-02  7:09 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-23  1:16 [PATCH v10 0/5] Core scheduling remaining patches Joel Fernandes (Google)
2021-01-23  1:17 ` [PATCH v10 1/5] sched: migration changes for core scheduling Joel Fernandes (Google)
2021-01-23  1:17 ` [PATCH v10 2/5] sched: CGroup tagging interface " Joel Fernandes (Google)
2021-02-03 16:51   ` Peter Zijlstra
2021-02-04 13:59     ` Peter Zijlstra
2021-02-05 16:42       ` Joel Fernandes
2021-02-04 13:54   ` Peter Zijlstra
2021-02-05  3:45     ` Josh Don
2021-02-04 13:57   ` Peter Zijlstra
2021-02-04 20:52     ` Chris Hyser
2021-02-05 10:43       ` Peter Zijlstra
2021-02-05 22:19         ` Chris Hyser
2021-02-04 14:01   ` Peter Zijlstra
2021-02-05  3:55     ` Josh Don
2021-02-04 14:35   ` Peter Zijlstra
2021-02-05  4:07     ` Josh Don
2021-02-04 14:52   ` Peter Zijlstra
2021-02-05 16:37     ` Joel Fernandes
2021-02-05 11:41   ` Peter Zijlstra
2021-02-05 11:52   ` Peter Zijlstra
2021-02-06  1:15     ` Josh Don
2021-02-05 12:00   ` Peter Zijlstra
2021-02-23  4:00   ` Chris Hyser
2021-02-23  9:05     ` Peter Zijlstra
2021-02-23 19:25       ` Chris Hyser
2021-02-24  5:15         ` Josh Don
2021-02-24 13:02           ` Chris Hyser
2021-02-24 13:52             ` chris hyser
2021-02-24 15:47               ` chris hyser
2021-02-26 20:07                 ` Chris Hyser
2021-03-01 21:01                   ` Josh Don
2021-01-23  1:17 ` [PATCH v10 3/5] kselftest: Add tests for core-sched interface Joel Fernandes (Google)
2021-01-23  1:17 ` [PATCH v10 4/5] Documentation: Add core scheduling documentation Joel Fernandes (Google)
2021-01-23  1:17 ` [PATCH v10 5/5] sched: Debug bits Joel Fernandes (Google)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).