linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 -tip 00/26] Core scheduling
@ 2020-10-20  1:43 Joel Fernandes (Google)
  2020-10-20  1:43 ` [PATCH v8 -tip 01/26] sched: Wrap rq::lock access Joel Fernandes (Google)
                   ` (27 more replies)
  0 siblings, 28 replies; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen

Eighth iteration of the Core-Scheduling feature.

Core scheduling is a feature that allows only trusted tasks to run
concurrently on cpus sharing compute resources (eg: hyperthreads on a
core). The goal is to mitigate the core-level side-channel attacks
without requiring to disable SMT (which has a significant impact on
performance in some situations). Core scheduling (as of v7) mitigates
user-space to user-space attacks and user to kernel attack when one of
the siblings enters the kernel via interrupts or system call.

By default, the feature doesn't change any of the current scheduler
behavior. The user decides which tasks can run simultaneously on the
same core (for now by having them in the same tagged cgroup). When a tag
is enabled in a cgroup and a task from that cgroup is running on a
hardware thread, the scheduler ensures that only idle or trusted tasks
run on the other sibling(s). Besides security concerns, this feature can
also be beneficial for RT and performance applications where we want to
control how tasks make use of SMT dynamically.

This iteration focuses on the the following stuff:
- Redesigned API.
- Rework of Kernel Protection feature based on Thomas's entry work.
- Rework of hotplug fixes.
- Address review comments in v7

Joel: Both a CGroup and Per-task interface via prctl(2) are provided for
configuring core sharing. More details are provided in documentation patch.
Kselftests are provided to verify the correctness/rules of the interface.

Julien: TPCC tests showed improvements with core-scheduling. With kernel
protection enabled, it does not show any regression. Possibly ASI will improve
the performance for those who choose kernel protection (can be toggled through
sched_core_protect_kernel sysctl). Results:
v8				average		stdev		diff
baseline (SMT on)		1197.272	44.78312824	
core sched (   kernel protect)	412.9895	45.42734343	-65.51%
core sched (no kernel protect)	686.6515	71.77756931	-42.65%
nosmt				408.667		39.39042872	-65.87%

v8 is rebased on tip/master.

Future work
===========
- Load balancing/Migration fixes for core scheduling.
  With v6, Load balancing is partially coresched aware, but has some
  issues w.r.t process/taskgroup weights:
  https://lwn.net/ml/linux-kernel/20200225034438.GA617271@z...
- Core scheduling test framework: kselftests, torture tests etc

Changes in v8
=============
- New interface/API implementation
  - Joel
- Revised kernel protection patch
  - Joel
- Revised Hotplug fixes
  - Joel
- Minor bug fixes and address review comments
  - Vineeth

Changes in v7
=============
- Kernel protection from untrusted usermode tasks
  - Joel, Vineeth
- Fix for hotplug crashes and hangs
  - Joel, Vineeth

Changes in v6
=============
- Documentation
  - Joel
- Pause siblings on entering nmi/irq/softirq
  - Joel, Vineeth
- Fix for RCU crash
  - Joel
- Fix for a crash in pick_next_task
  - Yu Chen, Vineeth
- Minor re-write of core-wide vruntime comparison
  - Aaron Lu
- Cleanup: Address Review comments
- Cleanup: Remove hotplug support (for now)
- Build fixes: 32 bit, SMT=n, AUTOGROUP=n etc
  - Joel, Vineeth

Changes in v5
=============
- Fixes for cgroup/process tagging during corner cases like cgroup
  destroy, task moving across cgroups etc
  - Tim Chen
- Coresched aware task migrations
  - Aubrey Li
- Other minor stability fixes.

Changes in v4
=============
- Implement a core wide min_vruntime for vruntime comparison of tasks
  across cpus in a core.
  - Aaron Lu
- Fixes a typo bug in setting the forced_idle cpu.
  - Aaron Lu

Changes in v3
=============
- Fixes the issue of sibling picking up an incompatible task
  - Aaron Lu
  - Vineeth Pillai
  - Julien Desfossez
- Fixes the issue of starving threads due to forced idle
  - Peter Zijlstra
- Fixes the refcounting issue when deleting a cgroup with tag
  - Julien Desfossez
- Fixes a crash during cpu offline/online with coresched enabled
  - Vineeth Pillai
- Fixes a comparison logic issue in sched_core_find
  - Aaron Lu

Changes in v2
=============
- Fixes for couple of NULL pointer dereference crashes
  - Subhra Mazumdar
  - Tim Chen
- Improves priority comparison logic for process in different cpus
  - Peter Zijlstra
  - Aaron Lu
- Fixes a hard lockup in rq locking
  - Vineeth Pillai
  - Julien Desfossez
- Fixes a performance issue seen on IO heavy workloads
  - Vineeth Pillai
  - Julien Desfossez
- Fix for 32bit build
  - Aubrey Li

Aubrey Li (1):
sched: migration changes for core scheduling

Joel Fernandes (Google) (13):
sched/fair: Snapshot the min_vruntime of CPUs on force idle
arch/x86: Add a new TIF flag for untrusted tasks
kernel/entry: Add support for core-wide protection of kernel-mode
entry/idle: Enter and exit kernel protection during idle entry and
exit
sched: Split the cookie and setup per-task cookie on fork
sched: Add a per-thread core scheduling interface
sched: Add a second-level tag for nested CGroup usecase
sched: Release references to the per-task cookie on exit
sched: Handle task addition to CGroup
sched/debug: Add CGroup node for printing group cookie if SCHED_DEBUG
kselftest: Add tests for core-sched interface
sched: Move core-scheduler interfacing code to a new file
Documentation: Add core scheduling documentation

Peter Zijlstra (10):
sched: Wrap rq::lock access
sched: Introduce sched_class::pick_task()
sched: Core-wide rq->lock
sched/fair: Add a few assertions
sched: Basic tracking of matching tasks
sched: Add core wide task selection and scheduling.
sched: Trivial forced-newidle balancer
irq_work: Cleanup
sched: cgroup tagging interface for core scheduling
sched: Debug bits...

Vineeth Pillai (2):
sched/fair: Fix forced idle sibling starvation corner case
entry/kvm: Protect the kernel when entering from guest

.../admin-guide/hw-vuln/core-scheduling.rst   |  312 +++++
Documentation/admin-guide/hw-vuln/index.rst   |    1 +
.../admin-guide/kernel-parameters.txt         |    7 +
arch/x86/include/asm/thread_info.h            |    2 +
arch/x86/kvm/x86.c                            |    3 +
drivers/gpu/drm/i915/i915_request.c           |    4 +-
include/linux/entry-common.h                  |   20 +-
include/linux/entry-kvm.h                     |   12 +
include/linux/irq_work.h                      |   33 +-
include/linux/irqflags.h                      |    4 +-
include/linux/sched.h                         |   27 +-
include/uapi/linux/prctl.h                    |    3 +
kernel/Kconfig.preempt                        |    6 +
kernel/bpf/stackmap.c                         |    2 +-
kernel/entry/common.c                         |   25 +-
kernel/entry/kvm.c                            |   13 +
kernel/fork.c                                 |    1 +
kernel/irq_work.c                             |   18 +-
kernel/printk/printk.c                        |    6 +-
kernel/rcu/tree.c                             |    3 +-
kernel/sched/Makefile                         |    1 +
kernel/sched/core.c                           | 1135 ++++++++++++++++-
kernel/sched/coretag.c                        |  468 +++++++
kernel/sched/cpuacct.c                        |   12 +-
kernel/sched/deadline.c                       |   34 +-
kernel/sched/debug.c                          |    8 +-
kernel/sched/fair.c                           |  272 ++--
kernel/sched/idle.c                           |   24 +-
kernel/sched/pelt.h                           |    2 +-
kernel/sched/rt.c                             |   22 +-
kernel/sched/sched.h                          |  302 ++++-
kernel/sched/stop_task.c                      |   13 +-
kernel/sched/topology.c                       |    4 +-
kernel/sys.c                                  |    3 +
kernel/time/tick-sched.c                      |    6 +-
kernel/trace/bpf_trace.c                      |    2 +-
tools/include/uapi/linux/prctl.h              |    3 +
tools/testing/selftests/sched/.gitignore      |    1 +
tools/testing/selftests/sched/Makefile        |   14 +
tools/testing/selftests/sched/config          |    1 +
.../testing/selftests/sched/test_coresched.c  |  840 ++++++++++++
41 files changed, 3437 insertions(+), 232 deletions(-)
create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst
create mode 100644 kernel/sched/coretag.c
create mode 100644 tools/testing/selftests/sched/.gitignore
create mode 100644 tools/testing/selftests/sched/Makefile
create mode 100644 tools/testing/selftests/sched/config
create mode 100644 tools/testing/selftests/sched/test_coresched.c

--
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 01/26] sched: Wrap rq::lock access
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-20  1:43 ` [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task() Joel Fernandes (Google)
                   ` (26 subsequent siblings)
  27 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Vineeth Remanan Pillai, Aubrey Li, Paul E. McKenney, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

In preparation of playing games with rq->lock, abstract the thing
using an accessor.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
---
 kernel/sched/core.c     |  46 +++++++++---------
 kernel/sched/cpuacct.c  |  12 ++---
 kernel/sched/deadline.c |  18 +++----
 kernel/sched/debug.c    |   4 +-
 kernel/sched/fair.c     |  38 +++++++--------
 kernel/sched/idle.c     |   4 +-
 kernel/sched/pelt.h     |   2 +-
 kernel/sched/rt.c       |   8 +--
 kernel/sched/sched.h    | 105 +++++++++++++++++++++-------------------
 kernel/sched/topology.c |   4 +-
 10 files changed, 122 insertions(+), 119 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d2003a7d5ab5..97181b3d12eb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -186,12 +186,12 @@ struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 
 	for (;;) {
 		rq = task_rq(p);
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
 			rq_pin_lock(rq, rf);
 			return rq;
 		}
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 
 		while (unlikely(task_on_rq_migrating(p)))
 			cpu_relax();
@@ -210,7 +210,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 	for (;;) {
 		raw_spin_lock_irqsave(&p->pi_lock, rf->flags);
 		rq = task_rq(p);
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		/*
 		 *	move_queued_task()		task_rq_lock()
 		 *
@@ -232,7 +232,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 			rq_pin_lock(rq, rf);
 			return rq;
 		}
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 		raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 
 		while (unlikely(task_on_rq_migrating(p)))
@@ -302,7 +302,7 @@ void update_rq_clock(struct rq *rq)
 {
 	s64 delta;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	if (rq->clock_update_flags & RQCF_ACT_SKIP)
 		return;
@@ -611,7 +611,7 @@ void resched_curr(struct rq *rq)
 	struct task_struct *curr = rq->curr;
 	int cpu;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	if (test_tsk_need_resched(curr))
 		return;
@@ -635,10 +635,10 @@ void resched_cpu(int cpu)
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), flags);
 	if (cpu_online(cpu) || cpu == smp_processor_id())
 		resched_curr(rq);
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 }
 
 #ifdef CONFIG_SMP
@@ -1137,7 +1137,7 @@ static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
 	struct uclamp_se *uc_se = &p->uclamp[clamp_id];
 	struct uclamp_bucket *bucket;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	/* Update task effective clamp */
 	p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
@@ -1177,7 +1177,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
 	unsigned int bkt_clamp;
 	unsigned int rq_clamp;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	/*
 	 * If sched_uclamp_used was enabled after task @p was enqueued,
@@ -1733,7 +1733,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
 				   struct task_struct *p, int new_cpu)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
 	set_task_cpu(p, new_cpu);
@@ -1845,7 +1845,7 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 		 * Because __kthread_bind() calls this on blocked tasks without
 		 * holding rq->lock.
 		 */
-		lockdep_assert_held(&rq->lock);
+		lockdep_assert_held(rq_lockp(rq));
 		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
 	}
 	if (running)
@@ -1982,7 +1982,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	 * task_rq_lock().
 	 */
 	WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||
-				      lockdep_is_held(&task_rq(p)->lock)));
+				      lockdep_is_held(rq_lockp(task_rq(p)))));
 #endif
 	/*
 	 * Clearly, migrating tasks to offline CPUs is a fairly daft thing.
@@ -2493,7 +2493,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 {
 	int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	if (p->sched_contributes_to_load)
 		rq->nr_uninterruptible--;
@@ -3495,10 +3495,10 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf
 	 * do an early lockdep release here:
 	 */
 	rq_unpin_lock(rq, rf);
-	spin_release(&rq->lock.dep_map, _THIS_IP_);
+	spin_release(&rq_lockp(rq)->dep_map, _THIS_IP_);
 #ifdef CONFIG_DEBUG_SPINLOCK
 	/* this is a valid case when another task releases the spinlock */
-	rq->lock.owner = next;
+	rq_lockp(rq)->owner = next;
 #endif
 }
 
@@ -3509,8 +3509,8 @@ static inline void finish_lock_switch(struct rq *rq)
 	 * fix up the runqueue lock - which gets 'carried over' from
 	 * prev into current:
 	 */
-	spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
-	raw_spin_unlock_irq(&rq->lock);
+	spin_acquire(&rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
+	raw_spin_unlock_irq(rq_lockp(rq));
 }
 
 /*
@@ -3660,7 +3660,7 @@ static void __balance_callback(struct rq *rq)
 	void (*func)(struct rq *rq);
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), flags);
 	head = rq->balance_callback;
 	rq->balance_callback = NULL;
 	while (head) {
@@ -3671,7 +3671,7 @@ static void __balance_callback(struct rq *rq)
 
 		func(rq);
 	}
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 }
 
 static inline void balance_callback(struct rq *rq)
@@ -6521,7 +6521,7 @@ void init_idle(struct task_struct *idle, int cpu)
 	__sched_fork(0, idle);
 
 	raw_spin_lock_irqsave(&idle->pi_lock, flags);
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 
 	idle->state = TASK_RUNNING;
 	idle->se.exec_start = sched_clock();
@@ -6559,7 +6559,7 @@ void init_idle(struct task_struct *idle, int cpu)
 #ifdef CONFIG_SMP
 	idle->on_cpu = 1;
 #endif
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 	raw_spin_unlock_irqrestore(&idle->pi_lock, flags);
 
 	/* Set the preempt count _outside_ the spinlocks! */
@@ -7131,7 +7131,7 @@ void __init sched_init(void)
 		struct rq *rq;
 
 		rq = cpu_rq(i);
-		raw_spin_lock_init(&rq->lock);
+		raw_spin_lock_init(&rq->__lock);
 		rq->nr_running = 0;
 		rq->calc_load_active = 0;
 		rq->calc_load_update = jiffies + LOAD_FREQ;
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 941c28cf9738..38c1a68e91f0 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -112,7 +112,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
 	/*
 	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
 	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 	if (index == CPUACCT_STAT_NSTATS) {
@@ -126,7 +126,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
 	}
 
 #ifndef CONFIG_64BIT
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 	return data;
@@ -141,14 +141,14 @@ static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
 	/*
 	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
 	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 	for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
 		cpuusage->usages[i] = val;
 
 #ifndef CONFIG_64BIT
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 }
 
@@ -253,13 +253,13 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V)
 			 * Take rq->lock to make 64-bit read safe on 32-bit
 			 * platforms.
 			 */
-			raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+			raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 			seq_printf(m, " %llu", cpuusage->usages[index]);
 
 #ifndef CONFIG_64BIT
-			raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+			raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 		}
 		seq_puts(m, "\n");
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 6d93f4518734..814ec49502b1 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -119,7 +119,7 @@ void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->running_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
 	dl_rq->running_bw += dl_bw;
 	SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
 	SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
@@ -132,7 +132,7 @@ void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->running_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
 	dl_rq->running_bw -= dl_bw;
 	SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
 	if (dl_rq->running_bw > old)
@@ -146,7 +146,7 @@ void __add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->this_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
 	dl_rq->this_bw += dl_bw;
 	SCHED_WARN_ON(dl_rq->this_bw < old); /* overflow */
 }
@@ -156,7 +156,7 @@ void __sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->this_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
 	dl_rq->this_bw -= dl_bw;
 	SCHED_WARN_ON(dl_rq->this_bw > old); /* underflow */
 	if (dl_rq->this_bw > old)
@@ -966,7 +966,7 @@ static int start_dl_timer(struct task_struct *p)
 	ktime_t now, act;
 	s64 delta;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	/*
 	 * We want the timer to fire at the deadline, but considering
@@ -1076,9 +1076,9 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 		 * If the runqueue is no longer available, migrate the
 		 * task elsewhere. This necessarily changes rq.
 		 */
-		lockdep_unpin_lock(&rq->lock, rf.cookie);
+		lockdep_unpin_lock(rq_lockp(rq), rf.cookie);
 		rq = dl_task_offline_migration(rq, p);
-		rf.cookie = lockdep_pin_lock(&rq->lock);
+		rf.cookie = lockdep_pin_lock(rq_lockp(rq));
 		update_rq_clock(rq);
 
 		/*
@@ -1727,7 +1727,7 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused
 	 * from try_to_wake_up(). Hence, p->pi_lock is locked, but
 	 * rq->lock is not... So, lock it
 	 */
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 	if (p->dl.dl_non_contending) {
 		sub_running_bw(&p->dl, &rq->dl);
 		p->dl.dl_non_contending = 0;
@@ -1742,7 +1742,7 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused
 			put_task_struct(p);
 	}
 	sub_rq_bw(&p->dl, &rq->dl);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 }
 
 static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 0655524700d2..c8fee8d9dfd4 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -551,7 +551,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "exec_clock",
 			SPLIT_NS(cfs_rq->exec_clock));
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), flags);
 	if (rb_first_cached(&cfs_rq->tasks_timeline))
 		MIN_vruntime = (__pick_first_entity(cfs_rq))->vruntime;
 	last = __pick_last_entity(cfs_rq);
@@ -559,7 +559,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 		max_vruntime = last->vruntime;
 	min_vruntime = cfs_rq->min_vruntime;
 	rq0_min_vruntime = cpu_rq(0)->cfs.min_vruntime;
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "MIN_vruntime",
 			SPLIT_NS(MIN_vruntime));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "min_vruntime",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aa4c6227cd6d..dbd9368a959d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1101,7 +1101,7 @@ struct numa_group {
 static struct numa_group *deref_task_numa_group(struct task_struct *p)
 {
 	return rcu_dereference_check(p->numa_group, p == current ||
-		(lockdep_is_held(&task_rq(p)->lock) && !READ_ONCE(p->on_cpu)));
+		(lockdep_is_held(rq_lockp(task_rq(p))) && !READ_ONCE(p->on_cpu)));
 }
 
 static struct numa_group *deref_curr_numa_group(struct task_struct *p)
@@ -5291,7 +5291,7 @@ static void __maybe_unused update_runtime_enabled(struct rq *rq)
 {
 	struct task_group *tg;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(tg, &task_groups, list) {
@@ -5310,7 +5310,7 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
 {
 	struct task_group *tg;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(tg, &task_groups, list) {
@@ -6772,7 +6772,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 		 * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
 		 * rq->lock and can modify state directly.
 		 */
-		lockdep_assert_held(&task_rq(p)->lock);
+		lockdep_assert_held(rq_lockp(task_rq(p)));
 		detach_entity_cfs_rq(&p->se);
 
 	} else {
@@ -7400,7 +7400,7 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 {
 	s64 delta;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	if (p->sched_class != &fair_sched_class)
 		return 0;
@@ -7498,7 +7498,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 {
 	int tsk_cache_hot;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	/*
 	 * We do not migrate tasks that are:
@@ -7576,7 +7576,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
  */
 static void detach_task(struct task_struct *p, struct lb_env *env)
 {
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
 	set_task_cpu(p, env->dst_cpu);
@@ -7592,7 +7592,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
 {
 	struct task_struct *p;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	list_for_each_entry_reverse(p,
 			&env->src_rq->cfs_tasks, se.group_node) {
@@ -7628,7 +7628,7 @@ static int detach_tasks(struct lb_env *env)
 	struct task_struct *p;
 	int detached = 0;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	if (env->imbalance <= 0)
 		return 0;
@@ -7750,7 +7750,7 @@ static int detach_tasks(struct lb_env *env)
  */
 static void attach_task(struct rq *rq, struct task_struct *p)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	BUG_ON(task_rq(p) != rq);
 	activate_task(rq, p, ENQUEUE_NOCLOCK);
@@ -9684,7 +9684,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		if (need_active_balance(&env)) {
 			unsigned long flags;
 
-			raw_spin_lock_irqsave(&busiest->lock, flags);
+			raw_spin_lock_irqsave(rq_lockp(busiest), flags);
 
 			/*
 			 * Don't kick the active_load_balance_cpu_stop,
@@ -9692,7 +9692,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 			 * moved to this_cpu:
 			 */
 			if (!cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr)) {
-				raw_spin_unlock_irqrestore(&busiest->lock,
+				raw_spin_unlock_irqrestore(rq_lockp(busiest),
 							    flags);
 				env.flags |= LBF_ALL_PINNED;
 				goto out_one_pinned;
@@ -9708,7 +9708,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 				busiest->push_cpu = this_cpu;
 				active_balance = 1;
 			}
-			raw_spin_unlock_irqrestore(&busiest->lock, flags);
+			raw_spin_unlock_irqrestore(rq_lockp(busiest), flags);
 
 			if (active_balance) {
 				stop_one_cpu_nowait(cpu_of(busiest),
@@ -10460,7 +10460,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
 	    time_before(jiffies, READ_ONCE(nohz.next_blocked)))
 		return;
 
-	raw_spin_unlock(&this_rq->lock);
+	raw_spin_unlock(rq_lockp(this_rq));
 	/*
 	 * This CPU is going to be idle and blocked load of idle CPUs
 	 * need to be updated. Run the ilb locally as it is a good
@@ -10469,7 +10469,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
 	 */
 	if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
 		kick_ilb(NOHZ_STATS_KICK);
-	raw_spin_lock(&this_rq->lock);
+	raw_spin_lock(rq_lockp(this_rq));
 }
 
 #else /* !CONFIG_NO_HZ_COMMON */
@@ -10535,7 +10535,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 		goto out;
 	}
 
-	raw_spin_unlock(&this_rq->lock);
+	raw_spin_unlock(rq_lockp(this_rq));
 
 	update_blocked_averages(this_cpu);
 	rcu_read_lock();
@@ -10573,7 +10573,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	}
 	rcu_read_unlock();
 
-	raw_spin_lock(&this_rq->lock);
+	raw_spin_lock(rq_lockp(this_rq));
 
 	if (curr_cost > this_rq->max_idle_balance_cost)
 		this_rq->max_idle_balance_cost = curr_cost;
@@ -11049,9 +11049,9 @@ void unregister_fair_sched_group(struct task_group *tg)
 
 		rq = cpu_rq(cpu);
 
-		raw_spin_lock_irqsave(&rq->lock, flags);
+		raw_spin_lock_irqsave(rq_lockp(rq), flags);
 		list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
-		raw_spin_unlock_irqrestore(&rq->lock, flags);
+		raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 	}
 }
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index f324dc36fc43..8ce6e80352cf 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -421,10 +421,10 @@ struct task_struct *pick_next_task_idle(struct rq *rq)
 static void
 dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
 {
-	raw_spin_unlock_irq(&rq->lock);
+	raw_spin_unlock_irq(rq_lockp(rq));
 	printk(KERN_ERR "bad: scheduling from the idle thread!\n");
 	dump_stack();
-	raw_spin_lock_irq(&rq->lock);
+	raw_spin_lock_irq(rq_lockp(rq));
 }
 
 /*
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 795e43e02afc..e850bd71a8ce 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -141,7 +141,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
 
 static inline u64 rq_clock_pelt(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	assert_clock_updated(rq);
 
 	return rq->clock_pelt - rq->lost_idle_time;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f215eea6a966..e57fca05b660 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -887,7 +887,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
 		if (skip)
 			continue;
 
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		update_rq_clock(rq);
 
 		if (rt_rq->rt_time) {
@@ -925,7 +925,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
 
 		if (enqueue)
 			sched_rt_rq_enqueue(rt_rq);
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 	}
 
 	if (!throttled && (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF))
@@ -2094,9 +2094,9 @@ void rto_push_irq_work_func(struct irq_work *work)
 	 * When it gets updated, a check is made if a push is possible.
 	 */
 	if (has_pushable_tasks(rq)) {
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		push_rt_tasks(rq);
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 	}
 
 	raw_spin_lock(&rd->rto_lock);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index df80bfcea92e..587ebabebaff 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -894,7 +894,7 @@ DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
  */
 struct rq {
 	/* runqueue lock: */
-	raw_spinlock_t		lock;
+	raw_spinlock_t		__lock;
 
 	/*
 	 * nr_running and cpu_load should be in the same cacheline because
@@ -1075,6 +1075,10 @@ static inline int cpu_of(struct rq *rq)
 #endif
 }
 
+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+	return &rq->__lock;
+}
 
 #ifdef CONFIG_SCHED_SMT
 extern void __update_idle_core(struct rq *rq);
@@ -1142,7 +1146,7 @@ static inline void assert_clock_updated(struct rq *rq)
 
 static inline u64 rq_clock(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	assert_clock_updated(rq);
 
 	return rq->clock;
@@ -1150,7 +1154,7 @@ static inline u64 rq_clock(struct rq *rq)
 
 static inline u64 rq_clock_task(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	assert_clock_updated(rq);
 
 	return rq->clock_task;
@@ -1176,7 +1180,7 @@ static inline u64 rq_clock_thermal(struct rq *rq)
 
 static inline void rq_clock_skip_update(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	rq->clock_update_flags |= RQCF_REQ_SKIP;
 }
 
@@ -1186,7 +1190,7 @@ static inline void rq_clock_skip_update(struct rq *rq)
  */
 static inline void rq_clock_cancel_skipupdate(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	rq->clock_update_flags &= ~RQCF_REQ_SKIP;
 }
 
@@ -1215,7 +1219,7 @@ struct rq_flags {
  */
 static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
 {
-	rf->cookie = lockdep_pin_lock(&rq->lock);
+	rf->cookie = lockdep_pin_lock(rq_lockp(rq));
 
 #ifdef CONFIG_SCHED_DEBUG
 	rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
@@ -1230,12 +1234,12 @@ static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
 		rf->clock_update_flags = RQCF_UPDATED;
 #endif
 
-	lockdep_unpin_lock(&rq->lock, rf->cookie);
+	lockdep_unpin_lock(rq_lockp(rq), rf->cookie);
 }
 
 static inline void rq_repin_lock(struct rq *rq, struct rq_flags *rf)
 {
-	lockdep_repin_lock(&rq->lock, rf->cookie);
+	lockdep_repin_lock(rq_lockp(rq), rf->cookie);
 
 #ifdef CONFIG_SCHED_DEBUG
 	/*
@@ -1256,7 +1260,7 @@ static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 }
 
 static inline void
@@ -1265,7 +1269,7 @@ task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 	__releases(p->pi_lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 }
 
@@ -1273,7 +1277,7 @@ static inline void
 rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock_irqsave(&rq->lock, rf->flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), rf->flags);
 	rq_pin_lock(rq, rf);
 }
 
@@ -1281,7 +1285,7 @@ static inline void
 rq_lock_irq(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock_irq(&rq->lock);
+	raw_spin_lock_irq(rq_lockp(rq));
 	rq_pin_lock(rq, rf);
 }
 
@@ -1289,7 +1293,7 @@ static inline void
 rq_lock(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 	rq_pin_lock(rq, rf);
 }
 
@@ -1297,7 +1301,7 @@ static inline void
 rq_relock(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 	rq_repin_lock(rq, rf);
 }
 
@@ -1306,7 +1310,7 @@ rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), rf->flags);
 }
 
 static inline void
@@ -1314,7 +1318,7 @@ rq_unlock_irq(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock_irq(&rq->lock);
+	raw_spin_unlock_irq(rq_lockp(rq));
 }
 
 static inline void
@@ -1322,7 +1326,7 @@ rq_unlock(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 }
 
 static inline struct rq *
@@ -1387,7 +1391,7 @@ queue_balance_callback(struct rq *rq,
 		       struct callback_head *head,
 		       void (*func)(struct rq *rq))
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	if (unlikely(head->next))
 		return;
@@ -2091,7 +2095,7 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
 	__acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
-	raw_spin_unlock(&this_rq->lock);
+	raw_spin_unlock(rq_lockp(this_rq));
 	double_rq_lock(this_rq, busiest);
 
 	return 1;
@@ -2110,20 +2114,22 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
 	__acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
-	int ret = 0;
-
-	if (unlikely(!raw_spin_trylock(&busiest->lock))) {
-		if (busiest < this_rq) {
-			raw_spin_unlock(&this_rq->lock);
-			raw_spin_lock(&busiest->lock);
-			raw_spin_lock_nested(&this_rq->lock,
-					      SINGLE_DEPTH_NESTING);
-			ret = 1;
-		} else
-			raw_spin_lock_nested(&busiest->lock,
-					      SINGLE_DEPTH_NESTING);
+	if (rq_lockp(this_rq) == rq_lockp(busiest))
+		return 0;
+
+	if (likely(raw_spin_trylock(rq_lockp(busiest))))
+		return 0;
+
+	if (rq_lockp(busiest) >= rq_lockp(this_rq)) {
+		raw_spin_lock_nested(rq_lockp(busiest), SINGLE_DEPTH_NESTING);
+		return 0;
 	}
-	return ret;
+
+	raw_spin_unlock(rq_lockp(this_rq));
+	raw_spin_lock(rq_lockp(busiest));
+	raw_spin_lock_nested(rq_lockp(this_rq), SINGLE_DEPTH_NESTING);
+
+	return 1;
 }
 
 #endif /* CONFIG_PREEMPTION */
@@ -2133,11 +2139,7 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
  */
 static inline int double_lock_balance(struct rq *this_rq, struct rq *busiest)
 {
-	if (unlikely(!irqs_disabled())) {
-		/* printk() doesn't work well under rq->lock */
-		raw_spin_unlock(&this_rq->lock);
-		BUG_ON(1);
-	}
+	lockdep_assert_irqs_disabled();
 
 	return _double_lock_balance(this_rq, busiest);
 }
@@ -2145,8 +2147,9 @@ static inline int double_lock_balance(struct rq *this_rq, struct rq *busiest)
 static inline void double_unlock_balance(struct rq *this_rq, struct rq *busiest)
 	__releases(busiest->lock)
 {
-	raw_spin_unlock(&busiest->lock);
-	lock_set_subclass(&this_rq->lock.dep_map, 0, _RET_IP_);
+	if (rq_lockp(this_rq) != rq_lockp(busiest))
+		raw_spin_unlock(rq_lockp(busiest));
+	lock_set_subclass(&rq_lockp(this_rq)->dep_map, 0, _RET_IP_);
 }
 
 static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
@@ -2187,16 +2190,16 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
 	__acquires(rq2->lock)
 {
 	BUG_ON(!irqs_disabled());
-	if (rq1 == rq2) {
-		raw_spin_lock(&rq1->lock);
+	if (rq_lockp(rq1) == rq_lockp(rq2)) {
+		raw_spin_lock(rq_lockp(rq1));
 		__acquire(rq2->lock);	/* Fake it out ;) */
 	} else {
-		if (rq1 < rq2) {
-			raw_spin_lock(&rq1->lock);
-			raw_spin_lock_nested(&rq2->lock, SINGLE_DEPTH_NESTING);
+		if (rq_lockp(rq1) < rq_lockp(rq2)) {
+			raw_spin_lock(rq_lockp(rq1));
+			raw_spin_lock_nested(rq_lockp(rq2), SINGLE_DEPTH_NESTING);
 		} else {
-			raw_spin_lock(&rq2->lock);
-			raw_spin_lock_nested(&rq1->lock, SINGLE_DEPTH_NESTING);
+			raw_spin_lock(rq_lockp(rq2));
+			raw_spin_lock_nested(rq_lockp(rq1), SINGLE_DEPTH_NESTING);
 		}
 	}
 }
@@ -2211,9 +2214,9 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
 	__releases(rq1->lock)
 	__releases(rq2->lock)
 {
-	raw_spin_unlock(&rq1->lock);
-	if (rq1 != rq2)
-		raw_spin_unlock(&rq2->lock);
+	raw_spin_unlock(rq_lockp(rq1));
+	if (rq_lockp(rq1) != rq_lockp(rq2))
+		raw_spin_unlock(rq_lockp(rq2));
 	else
 		__release(rq2->lock);
 }
@@ -2236,7 +2239,7 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
 {
 	BUG_ON(!irqs_disabled());
 	BUG_ON(rq1 != rq2);
-	raw_spin_lock(&rq1->lock);
+	raw_spin_lock(rq_lockp(rq1));
 	__acquire(rq2->lock);	/* Fake it out ;) */
 }
 
@@ -2251,7 +2254,7 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
 	__releases(rq2->lock)
 {
 	BUG_ON(rq1 != rq2);
-	raw_spin_unlock(&rq1->lock);
+	raw_spin_unlock(rq_lockp(rq1));
 	__release(rq2->lock);
 }
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index dd7770226086..eeb9aca1c853 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -454,7 +454,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	struct root_domain *old_rd = NULL;
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), flags);
 
 	if (rq->rd) {
 		old_rd = rq->rd;
@@ -480,7 +480,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
 		set_rq_online(rq);
 
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 
 	if (old_rd)
 		call_rcu(&old_rd->rcu, free_rootdomain);
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task()
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
  2020-10-20  1:43 ` [PATCH v8 -tip 01/26] sched: Wrap rq::lock access Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-22  7:59   ` Li, Aubrey
  2020-10-20  1:43 ` [PATCH v8 -tip 03/26] sched: Core-wide rq->lock Joel Fernandes (Google)
                   ` (25 subsequent siblings)
  27 siblings, 1 reply; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Vineeth Remanan Pillai, Aubrey Li, Paul E. McKenney, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

Because sched_class::pick_next_task() also implies
sched_class::set_next_task() (and possibly put_prev_task() and
newidle_balance) it is not state invariant. This makes it unsuitable
for remote task selection.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/deadline.c  | 16 ++++++++++++++--
 kernel/sched/fair.c      | 32 +++++++++++++++++++++++++++++++-
 kernel/sched/idle.c      |  8 ++++++++
 kernel/sched/rt.c        | 14 ++++++++++++--
 kernel/sched/sched.h     |  3 +++
 kernel/sched/stop_task.c | 13 +++++++++++--
 6 files changed, 79 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 814ec49502b1..0271a7848ab3 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1848,7 +1848,7 @@ static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
 	return rb_entry(left, struct sched_dl_entity, rb_node);
 }
 
-static struct task_struct *pick_next_task_dl(struct rq *rq)
+static struct task_struct *pick_task_dl(struct rq *rq)
 {
 	struct sched_dl_entity *dl_se;
 	struct dl_rq *dl_rq = &rq->dl;
@@ -1860,7 +1860,18 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
 	dl_se = pick_next_dl_entity(rq, dl_rq);
 	BUG_ON(!dl_se);
 	p = dl_task_of(dl_se);
-	set_next_task_dl(rq, p, true);
+
+	return p;
+}
+
+static struct task_struct *pick_next_task_dl(struct rq *rq)
+{
+	struct task_struct *p;
+
+	p = pick_task_dl(rq);
+	if (p)
+		set_next_task_dl(rq, p, true);
+
 	return p;
 }
 
@@ -2517,6 +2528,7 @@ const struct sched_class dl_sched_class
 
 #ifdef CONFIG_SMP
 	.balance		= balance_dl,
+	.pick_task		= pick_task_dl,
 	.select_task_rq		= select_task_rq_dl,
 	.migrate_task_rq	= migrate_task_rq_dl,
 	.set_cpus_allowed       = set_cpus_allowed_dl,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dbd9368a959d..bd6aed63f5e3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4450,7 +4450,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	 * Avoid running the skip buddy, if running something else can
 	 * be done without getting too unfair.
 	 */
-	if (cfs_rq->skip == se) {
+	if (cfs_rq->skip && cfs_rq->skip == se) {
 		struct sched_entity *second;
 
 		if (se == curr) {
@@ -6976,6 +6976,35 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 		set_last_buddy(se);
 }
 
+#ifdef CONFIG_SMP
+static struct task_struct *pick_task_fair(struct rq *rq)
+{
+	struct cfs_rq *cfs_rq = &rq->cfs;
+	struct sched_entity *se;
+
+	if (!cfs_rq->nr_running)
+		return NULL;
+
+	do {
+		struct sched_entity *curr = cfs_rq->curr;
+
+		se = pick_next_entity(cfs_rq, NULL);
+
+		if (curr) {
+			if (se && curr->on_rq)
+				update_curr(cfs_rq);
+
+			if (!se || entity_before(curr, se))
+				se = curr;
+		}
+
+		cfs_rq = group_cfs_rq(se);
+	} while (cfs_rq);
+
+	return task_of(se);
+}
+#endif
+
 struct task_struct *
 pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
@@ -11173,6 +11202,7 @@ const struct sched_class fair_sched_class
 
 #ifdef CONFIG_SMP
 	.balance		= balance_fair,
+	.pick_task		= pick_task_fair,
 	.select_task_rq		= select_task_rq_fair,
 	.migrate_task_rq	= migrate_task_rq_fair,
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 8ce6e80352cf..ce7552c6bc65 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -405,6 +405,13 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
 	schedstat_inc(rq->sched_goidle);
 }
 
+#ifdef CONFIG_SMP
+static struct task_struct *pick_task_idle(struct rq *rq)
+{
+	return rq->idle;
+}
+#endif
+
 struct task_struct *pick_next_task_idle(struct rq *rq)
 {
 	struct task_struct *next = rq->idle;
@@ -472,6 +479,7 @@ const struct sched_class idle_sched_class
 
 #ifdef CONFIG_SMP
 	.balance		= balance_idle,
+	.pick_task		= pick_task_idle,
 	.select_task_rq		= select_task_rq_idle,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e57fca05b660..a5851c775270 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1624,7 +1624,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
 	return rt_task_of(rt_se);
 }
 
-static struct task_struct *pick_next_task_rt(struct rq *rq)
+static struct task_struct *pick_task_rt(struct rq *rq)
 {
 	struct task_struct *p;
 
@@ -1632,7 +1632,16 @@ static struct task_struct *pick_next_task_rt(struct rq *rq)
 		return NULL;
 
 	p = _pick_next_task_rt(rq);
-	set_next_task_rt(rq, p, true);
+
+	return p;
+}
+
+static struct task_struct *pick_next_task_rt(struct rq *rq)
+{
+	struct task_struct *p = pick_task_rt(rq);
+	if (p)
+		set_next_task_rt(rq, p, true);
+
 	return p;
 }
 
@@ -2443,6 +2452,7 @@ const struct sched_class rt_sched_class
 
 #ifdef CONFIG_SMP
 	.balance		= balance_rt,
+	.pick_task		= pick_task_rt,
 	.select_task_rq		= select_task_rq_rt,
 	.set_cpus_allowed       = set_cpus_allowed_common,
 	.rq_online              = rq_online_rt,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 587ebabebaff..54bfac702805 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1800,6 +1800,9 @@ struct sched_class {
 
 #ifdef CONFIG_SMP
 	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
+
+	struct task_struct * (*pick_task)(struct rq *rq);
+
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
 	void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
 
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 394bc8126a1e..8f92915dd95e 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -34,15 +34,23 @@ static void set_next_task_stop(struct rq *rq, struct task_struct *stop, bool fir
 	stop->se.exec_start = rq_clock_task(rq);
 }
 
-static struct task_struct *pick_next_task_stop(struct rq *rq)
+static struct task_struct *pick_task_stop(struct rq *rq)
 {
 	if (!sched_stop_runnable(rq))
 		return NULL;
 
-	set_next_task_stop(rq, rq->stop, true);
 	return rq->stop;
 }
 
+static struct task_struct *pick_next_task_stop(struct rq *rq)
+{
+	struct task_struct *p = pick_task_stop(rq);
+	if (p)
+		set_next_task_stop(rq, p, true);
+
+	return p;
+}
+
 static void
 enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
@@ -124,6 +132,7 @@ const struct sched_class stop_sched_class
 
 #ifdef CONFIG_SMP
 	.balance		= balance_stop,
+	.pick_task		= pick_task_stop,
 	.select_task_rq		= select_task_rq_stop,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 03/26] sched: Core-wide rq->lock
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
  2020-10-20  1:43 ` [PATCH v8 -tip 01/26] sched: Wrap rq::lock access Joel Fernandes (Google)
  2020-10-20  1:43 ` [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task() Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-26 11:59   ` Peter Zijlstra
  2020-10-20  1:43 ` [PATCH v8 -tip 04/26] sched/fair: Add a few assertions Joel Fernandes (Google)
                   ` (24 subsequent siblings)
  27 siblings, 1 reply; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Vineeth Remanan Pillai, Aubrey Li, Paul E. McKenney, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

Introduce the basic infrastructure to have a core wide rq->lock.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
---
 kernel/Kconfig.preempt |   6 +++
 kernel/sched/core.c    | 109 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h   |  31 ++++++++++++
 3 files changed, 146 insertions(+)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index bf82259cff96..4488fbf4d3a8 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -80,3 +80,9 @@ config PREEMPT_COUNT
 config PREEMPTION
        bool
        select PREEMPT_COUNT
+
+config SCHED_CORE
+	bool "Core Scheduling for SMT"
+	default y
+	depends on SCHED_SMT
+
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 97181b3d12eb..cecbf91cb477 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,70 @@ unsigned int sysctl_sched_rt_period = 1000000;
 
 __read_mostly int scheduler_running;
 
+#ifdef CONFIG_SCHED_CORE
+
+DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+/*
+ * The static-key + stop-machine variable are needed such that:
+ *
+ *	spin_lock(rq_lockp(rq));
+ *	...
+ *	spin_unlock(rq_lockp(rq));
+ *
+ * ends up locking and unlocking the _same_ lock, and all CPUs
+ * always agree on what rq has what lock.
+ *
+ * XXX entirely possible to selectively enable cores, don't bother for now.
+ */
+static int __sched_core_stopper(void *data)
+{
+	bool enabled = !!(unsigned long)data;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		cpu_rq(cpu)->core_enabled = enabled;
+
+	return 0;
+}
+
+static DEFINE_MUTEX(sched_core_mutex);
+static int sched_core_count;
+
+static void __sched_core_enable(void)
+{
+	// XXX verify there are no cookie tasks (yet)
+
+	static_branch_enable(&__sched_core_enabled);
+	stop_machine(__sched_core_stopper, (void *)true, NULL);
+}
+
+static void __sched_core_disable(void)
+{
+	// XXX verify there are no cookie tasks (left)
+
+	stop_machine(__sched_core_stopper, (void *)false, NULL);
+	static_branch_disable(&__sched_core_enabled);
+}
+
+void sched_core_get(void)
+{
+	mutex_lock(&sched_core_mutex);
+	if (!sched_core_count++)
+		__sched_core_enable();
+	mutex_unlock(&sched_core_mutex);
+}
+
+void sched_core_put(void)
+{
+	mutex_lock(&sched_core_mutex);
+	if (!--sched_core_count)
+		__sched_core_disable();
+	mutex_unlock(&sched_core_mutex);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * part of the period that we allow rt tasks to run in us.
  * default: 0.95s
@@ -4363,6 +4427,43 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	BUG();
 }
 
+#ifdef CONFIG_SCHED_CORE
+
+static inline void sched_core_cpu_starting(unsigned int cpu)
+{
+	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+	struct rq *rq, *core_rq = NULL;
+	int i;
+
+	core_rq = cpu_rq(cpu)->core;
+
+	if (!core_rq) {
+		for_each_cpu(i, smt_mask) {
+			rq = cpu_rq(i);
+			if (rq->core && rq->core == rq)
+				core_rq = rq;
+			init_sched_core_irq_work(rq);
+		}
+
+		if (!core_rq)
+			core_rq = cpu_rq(cpu);
+
+		for_each_cpu(i, smt_mask) {
+			rq = cpu_rq(i);
+
+			WARN_ON_ONCE(rq->core && rq->core != core_rq);
+			rq->core = core_rq;
+		}
+	}
+
+	printk("core: %d -> %d\n", cpu, cpu_of(core_rq));
+}
+#else /* !CONFIG_SCHED_CORE */
+
+static inline void sched_core_cpu_starting(unsigned int cpu) {}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -6963,6 +7064,9 @@ static void sched_rq_cpu_starting(unsigned int cpu)
 
 int sched_cpu_starting(unsigned int cpu)
 {
+
+	sched_core_cpu_starting(cpu);
+
 	sched_rq_cpu_starting(cpu);
 	sched_tick_start(cpu);
 	return 0;
@@ -7193,6 +7297,11 @@ void __init sched_init(void)
 #endif /* CONFIG_SMP */
 		hrtick_rq_init(rq);
 		atomic_set(&rq->nr_iowait, 0);
+
+#ifdef CONFIG_SCHED_CORE
+		rq->core = NULL;
+		rq->core_enabled = 0;
+#endif
 	}
 
 	set_load_weight(&init_task, false);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 54bfac702805..85c8472b5d00 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1048,6 +1048,12 @@ struct rq {
 	/* Must be inspected within a rcu lock section */
 	struct cpuidle_state	*idle_state;
 #endif
+
+#ifdef CONFIG_SCHED_CORE
+	/* per rq */
+	struct rq		*core;
+	unsigned int		core_enabled;
+#endif
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -1075,11 +1081,36 @@ static inline int cpu_of(struct rq *rq)
 #endif
 }
 
+#ifdef CONFIG_SCHED_CORE
+DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+	return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
+}
+
+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+	if (sched_core_enabled(rq))
+		return &rq->core->__lock;
+
+	return &rq->__lock;
+}
+
+#else /* !CONFIG_SCHED_CORE */
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+	return false;
+}
+
 static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 {
 	return &rq->__lock;
 }
 
+#endif /* CONFIG_SCHED_CORE */
+
 #ifdef CONFIG_SCHED_SMT
 extern void __update_idle_core(struct rq *rq);
 
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 04/26] sched/fair: Add a few assertions
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (2 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 03/26] sched: Core-wide rq->lock Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-20  1:43 ` [PATCH v8 -tip 05/26] sched: Basic tracking of matching tasks Joel Fernandes (Google)
                   ` (23 subsequent siblings)
  27 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bd6aed63f5e3..b4bc82f46fe7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6228,6 +6228,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	}
 
 symmetric:
+	/*
+	 * per-cpu select_idle_mask usage
+	 */
+	lockdep_assert_irqs_disabled();
+
 	if (available_idle_cpu(target) || sched_idle_cpu(target))
 		return target;
 
@@ -6670,8 +6675,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
  * certain conditions an idle sibling CPU if the domain has SD_WAKE_AFFINE set.
  *
  * Returns the target CPU number.
- *
- * preempt must be disabled.
  */
 static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
@@ -6682,6 +6685,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	int want_affine = 0;
 	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 
+	/*
+	 * required for stable ->cpus_allowed
+	 */
+	lockdep_assert_held(&p->pi_lock);
+
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
 
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 05/26] sched: Basic tracking of matching tasks
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (3 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 04/26] sched/fair: Add a few assertions Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-20  1:43 ` [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling Joel Fernandes (Google)
                   ` (22 subsequent siblings)
  27 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Vineeth Remanan Pillai, Aubrey Li, Paul E. McKenney, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

Introduce task_struct::core_cookie as an opaque identifier for core
scheduling. When enabled; core scheduling will only allow matching
task to be on the core; where idle matches everything.

When task_struct::core_cookie is set (and core scheduling is enabled)
these tasks are indexed in a second RB-tree, first on cookie value
then on scheduling function, such that matching task selection always
finds the most elegible match.

NOTE: *shudder* at the overhead...

NOTE: *sigh*, a 3rd copy of the scheduling function; the alternative
is per class tracking of cookies and that just duplicates a lot of
stuff for no raisin (the 2nd copy lives in the rt-mutex PI code).

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
---
 include/linux/sched.h |   8 ++-
 kernel/sched/core.c   | 146 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c   |  46 -------------
 kernel/sched/sched.h  |  55 ++++++++++++++++
 4 files changed, 208 insertions(+), 47 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 393db0690101..c3563d7cab7f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -683,10 +683,16 @@ struct task_struct {
 	const struct sched_class	*sched_class;
 	struct sched_entity		se;
 	struct sched_rt_entity		rt;
+	struct sched_dl_entity		dl;
+
+#ifdef CONFIG_SCHED_CORE
+	struct rb_node			core_node;
+	unsigned long			core_cookie;
+#endif
+
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group		*sched_task_group;
 #endif
-	struct sched_dl_entity		dl;
 
 #ifdef CONFIG_UCLAMP_TASK
 	/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cecbf91cb477..a032f481c6e6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -78,6 +78,141 @@ __read_mostly int scheduler_running;
 
 DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
 
+/* kernel prio, less is more */
+static inline int __task_prio(struct task_struct *p)
+{
+	if (p->sched_class == &stop_sched_class) /* trumps deadline */
+		return -2;
+
+	if (rt_prio(p->prio)) /* includes deadline */
+		return p->prio; /* [-1, 99] */
+
+	if (p->sched_class == &idle_sched_class)
+		return MAX_RT_PRIO + NICE_WIDTH; /* 140 */
+
+	return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */
+}
+
+/*
+ * l(a,b)
+ * le(a,b) := !l(b,a)
+ * g(a,b)  := l(b,a)
+ * ge(a,b) := !l(a,b)
+ */
+
+/* real prio, less is less */
+static inline bool prio_less(struct task_struct *a, struct task_struct *b)
+{
+
+	int pa = __task_prio(a), pb = __task_prio(b);
+
+	if (-pa < -pb)
+		return true;
+
+	if (-pb < -pa)
+		return false;
+
+	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
+		return !dl_time_before(a->dl.deadline, b->dl.deadline);
+
+	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
+		u64 vruntime = b->se.vruntime;
+
+		/*
+		 * Normalize the vruntime if tasks are in different cpus.
+		 */
+		if (task_cpu(a) != task_cpu(b)) {
+			vruntime -= task_cfs_rq(b)->min_vruntime;
+			vruntime += task_cfs_rq(a)->min_vruntime;
+		}
+
+		return !((s64)(a->se.vruntime - vruntime) <= 0);
+	}
+
+	return false;
+}
+
+static inline bool __sched_core_less(struct task_struct *a, struct task_struct *b)
+{
+	if (a->core_cookie < b->core_cookie)
+		return true;
+
+	if (a->core_cookie > b->core_cookie)
+		return false;
+
+	/* flip prio, so high prio is leftmost */
+	if (prio_less(b, a))
+		return true;
+
+	return false;
+}
+
+static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+{
+	struct rb_node *parent, **node;
+	struct task_struct *node_task;
+
+	rq->core->core_task_seq++;
+
+	if (!p->core_cookie)
+		return;
+
+	node = &rq->core_tree.rb_node;
+	parent = *node;
+
+	while (*node) {
+		node_task = container_of(*node, struct task_struct, core_node);
+		parent = *node;
+
+		if (__sched_core_less(p, node_task))
+			node = &parent->rb_left;
+		else
+			node = &parent->rb_right;
+	}
+
+	rb_link_node(&p->core_node, parent, node);
+	rb_insert_color(&p->core_node, &rq->core_tree);
+}
+
+static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+{
+	rq->core->core_task_seq++;
+
+	if (!p->core_cookie)
+		return;
+
+	rb_erase(&p->core_node, &rq->core_tree);
+}
+
+/*
+ * Find left-most (aka, highest priority) task matching @cookie.
+ */
+static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
+{
+	struct rb_node *node = rq->core_tree.rb_node;
+	struct task_struct *node_task, *match;
+
+	/*
+	 * The idle task always matches any cookie!
+	 */
+	match = idle_sched_class.pick_task(rq);
+
+	while (node) {
+		node_task = container_of(node, struct task_struct, core_node);
+
+		if (cookie < node_task->core_cookie) {
+			node = node->rb_left;
+		} else if (cookie > node_task->core_cookie) {
+			node = node->rb_right;
+		} else {
+			match = node_task;
+			node = node->rb_left;
+		}
+	}
+
+	return match;
+}
+
 /*
  * The static-key + stop-machine variable are needed such that:
  *
@@ -136,6 +271,11 @@ void sched_core_put(void)
 	mutex_unlock(&sched_core_mutex);
 }
 
+#else /* !CONFIG_SCHED_CORE */
+
+static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
+static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+
 #endif /* CONFIG_SCHED_CORE */
 
 /*
@@ -1624,6 +1764,9 @@ static inline void init_uclamp(void) { }
 
 static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	if (sched_core_enabled(rq))
+		sched_core_enqueue(rq, p);
+
 	if (!(flags & ENQUEUE_NOCLOCK))
 		update_rq_clock(rq);
 
@@ -1638,6 +1781,9 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 
 static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	if (sched_core_enabled(rq))
+		sched_core_dequeue(rq, p);
+
 	if (!(flags & DEQUEUE_NOCLOCK))
 		update_rq_clock(rq);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b4bc82f46fe7..58f670e5704d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -258,33 +258,11 @@ const struct sched_class fair_sched_class;
  */
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static inline struct task_struct *task_of(struct sched_entity *se)
-{
-	SCHED_WARN_ON(!entity_is_task(se));
-	return container_of(se, struct task_struct, se);
-}
 
 /* Walk up scheduling entities hierarchy */
 #define for_each_sched_entity(se) \
 		for (; se; se = se->parent)
 
-static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
-{
-	return p->se.cfs_rq;
-}
-
-/* runqueue on which this entity is (to be) queued */
-static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
-{
-	return se->cfs_rq;
-}
-
-/* runqueue "owned" by this group */
-static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
-{
-	return grp->my_q;
-}
-
 static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len)
 {
 	if (!path)
@@ -445,33 +423,9 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #else	/* !CONFIG_FAIR_GROUP_SCHED */
 
-static inline struct task_struct *task_of(struct sched_entity *se)
-{
-	return container_of(se, struct task_struct, se);
-}
-
 #define for_each_sched_entity(se) \
 		for (; se; se = NULL)
 
-static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
-{
-	return &task_rq(p)->cfs;
-}
-
-static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
-{
-	struct task_struct *p = task_of(se);
-	struct rq *rq = task_rq(p);
-
-	return &rq->cfs;
-}
-
-/* runqueue "owned" by this group */
-static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
-{
-	return NULL;
-}
-
 static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len)
 {
 	if (path)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 85c8472b5d00..4964453591c3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1053,6 +1053,10 @@ struct rq {
 	/* per rq */
 	struct rq		*core;
 	unsigned int		core_enabled;
+	struct rb_root		core_tree;
+
+	/* shared state */
+	unsigned int		core_task_seq;
 #endif
 };
 
@@ -1132,6 +1136,57 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		raw_cpu_ptr(&runqueues)
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline struct task_struct *task_of(struct sched_entity *se)
+{
+	SCHED_WARN_ON(!entity_is_task(se));
+	return container_of(se, struct task_struct, se);
+}
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+	return p->se.cfs_rq;
+}
+
+/* runqueue on which this entity is (to be) queued */
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+	return se->cfs_rq;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+	return grp->my_q;
+}
+
+#else
+
+static inline struct task_struct *task_of(struct sched_entity *se)
+{
+	return container_of(se, struct task_struct, se);
+}
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+	return &task_rq(p)->cfs;
+}
+
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+	struct task_struct *p = task_of(se);
+	struct rq *rq = task_rq(p);
+
+	return &rq->cfs;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+	return NULL;
+}
+#endif
+
 extern void update_rq_clock(struct rq *rq);
 
 static inline u64 __rq_clock_broken(struct rq *rq)
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (4 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 05/26] sched: Basic tracking of matching tasks Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-23 13:51   ` Peter Zijlstra
  2020-10-23 15:05   ` Peter Zijlstra
  2020-10-20  1:43 ` [PATCH v8 -tip 07/26] sched/fair: Fix forced idle sibling starvation corner case Joel Fernandes (Google)
                   ` (21 subsequent siblings)
  27 siblings, 2 replies; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Vineeth Remanan Pillai, Aaron Lu, Aubrey Li, Paul E. McKenney,
	Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

Instead of only selecting a local task, select a task for all SMT
siblings for every reschedule on the core (irrespective which logical
CPU does the reschedule).

During a CPU hotplug event, schedule would be called with the hotplugged
CPU not in the cpumask. So use for_each_cpu(_wrap)_or to include the
current cpu in the task pick loop.

There are multiple loops in pick_next_task that iterate over CPUs in
smt_mask. During a hotplug event, sibling could be removed from the
smt_mask while pick_next_task is running. So we cannot trust the mask
across the different loops. This can confuse the logic. Add a retry logic
if smt_mask changes between the loops.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Aaron Lu <aaron.lu@linux.alibaba.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/core.c  | 301 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |   6 +-
 2 files changed, 305 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a032f481c6e6..12030b77bd6d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4533,7 +4533,7 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	const struct sched_class *class;
 	struct task_struct *p;
@@ -4574,6 +4574,294 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 }
 
 #ifdef CONFIG_SCHED_CORE
+static inline bool is_task_rq_idle(struct task_struct *t)
+{
+	return (task_rq(t)->idle == t);
+}
+
+static inline bool cookie_equals(struct task_struct *a, unsigned long cookie)
+{
+	return is_task_rq_idle(a) || (a->core_cookie == cookie);
+}
+
+static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
+{
+	if (is_task_rq_idle(a) || is_task_rq_idle(b))
+		return true;
+
+	return a->core_cookie == b->core_cookie;
+}
+
+// XXX fairness/fwd progress conditions
+/*
+ * Returns
+ * - NULL if there is no runnable task for this class.
+ * - the highest priority task for this runqueue if it matches
+ *   rq->core->core_cookie or its priority is greater than max.
+ * - Else returns idle_task.
+ */
+static struct task_struct *
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max)
+{
+	struct task_struct *class_pick, *cookie_pick;
+	unsigned long cookie = rq->core->core_cookie;
+
+	class_pick = class->pick_task(rq);
+	if (!class_pick)
+		return NULL;
+
+	if (!cookie) {
+		/*
+		 * If class_pick is tagged, return it only if it has
+		 * higher priority than max.
+		 */
+		if (max && class_pick->core_cookie &&
+		    prio_less(class_pick, max))
+			return idle_sched_class.pick_task(rq);
+
+		return class_pick;
+	}
+
+	/*
+	 * If class_pick is idle or matches cookie, return early.
+	 */
+	if (cookie_equals(class_pick, cookie))
+		return class_pick;
+
+	cookie_pick = sched_core_find(rq, cookie);
+
+	/*
+	 * If class > max && class > cookie, it is the highest priority task on
+	 * the core (so far) and it must be selected, otherwise we must go with
+	 * the cookie pick in order to satisfy the constraint.
+	 */
+	if (prio_less(cookie_pick, class_pick) &&
+	    (!max || prio_less(max, class_pick)))
+		return class_pick;
+
+	return cookie_pick;
+}
+
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	struct task_struct *next, *max = NULL;
+	const struct sched_class *class;
+	const struct cpumask *smt_mask;
+	bool need_sync;
+	int i, j, cpu;
+
+	if (!sched_core_enabled(rq))
+		return __pick_next_task(rq, prev, rf);
+
+	cpu = cpu_of(rq);
+
+	/* Stopper task is switching into idle, no need core-wide selection. */
+	if (cpu_is_offline(cpu)) {
+		/*
+		 * Reset core_pick so that we don't enter the fastpath when
+		 * coming online. core_pick would already be migrated to
+		 * another cpu during offline.
+		 */
+		rq->core_pick = NULL;
+		return __pick_next_task(rq, prev, rf);
+	}
+
+	/*
+	 * If there were no {en,de}queues since we picked (IOW, the task
+	 * pointers are all still valid), and we haven't scheduled the last
+	 * pick yet, do so now.
+	 *
+	 * rq->core_pick can be NULL if no selection was made for a CPU because
+	 * it was either offline or went offline during a sibling's core-wide
+	 * selection. In this case, do a core-wide selection.
+	 */
+	if (rq->core->core_pick_seq == rq->core->core_task_seq &&
+	    rq->core->core_pick_seq != rq->core_sched_seq &&
+	    rq->core_pick) {
+		WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
+
+		next = rq->core_pick;
+		if (next != prev) {
+			put_prev_task(rq, prev);
+			set_next_task(rq, next);
+		}
+
+		rq->core_pick = NULL;
+		return next;
+	}
+
+	put_prev_task_balance(rq, prev, rf);
+
+	smt_mask = cpu_smt_mask(cpu);
+
+	/*
+	 * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
+	 *
+	 * @task_seq guards the task state ({en,de}queues)
+	 * @pick_seq is the @task_seq we did a selection on
+	 * @sched_seq is the @pick_seq we scheduled
+	 *
+	 * However, preemptions can cause multiple picks on the same task set.
+	 * 'Fix' this by also increasing @task_seq for every pick.
+	 */
+	rq->core->core_task_seq++;
+	need_sync = !!rq->core->core_cookie;
+
+	/* reset state */
+	rq->core->core_cookie = 0UL;
+	for_each_cpu(i, smt_mask) {
+		struct rq *rq_i = cpu_rq(i);
+
+		rq_i->core_pick = NULL;
+
+		if (rq_i->core_forceidle) {
+			need_sync = true;
+			rq_i->core_forceidle = false;
+		}
+
+		if (i != cpu)
+			update_rq_clock(rq_i);
+	}
+
+	/*
+	 * Try and select tasks for each sibling in decending sched_class
+	 * order.
+	 */
+	for_each_class(class) {
+again:
+		for_each_cpu_wrap(i, smt_mask, cpu) {
+			struct rq *rq_i = cpu_rq(i);
+			struct task_struct *p;
+
+			if (rq_i->core_pick)
+				continue;
+
+			/*
+			 * If this sibling doesn't yet have a suitable task to
+			 * run; ask for the most elegible task, given the
+			 * highest priority task already selected for this
+			 * core.
+			 */
+			p = pick_task(rq_i, class, max);
+			if (!p) {
+				/*
+				 * If there weren't no cookies; we don't need to
+				 * bother with the other siblings.
+				 * If the rest of the core is not running a tagged
+				 * task, i.e.  need_sync == 0, and the current CPU
+				 * which called into the schedule() loop does not
+				 * have any tasks for this class, skip selecting for
+				 * other siblings since there's no point. We don't skip
+				 * for RT/DL because that could make CFS force-idle RT.
+				 */
+				if (i == cpu && !need_sync && class == &fair_sched_class)
+					goto next_class;
+
+				continue;
+			}
+
+			/*
+			 * Optimize the 'normal' case where there aren't any
+			 * cookies and we don't need to sync up.
+			 */
+			if (i == cpu && !need_sync && !p->core_cookie) {
+				next = p;
+				goto done;
+			}
+
+			rq_i->core_pick = p;
+
+			/*
+			 * If this new candidate is of higher priority than the
+			 * previous; and they're incompatible; we need to wipe
+			 * the slate and start over. pick_task makes sure that
+			 * p's priority is more than max if it doesn't match
+			 * max's cookie.
+			 *
+			 * NOTE: this is a linear max-filter and is thus bounded
+			 * in execution time.
+			 */
+			if (!max || !cookie_match(max, p)) {
+				struct task_struct *old_max = max;
+
+				rq->core->core_cookie = p->core_cookie;
+				max = p;
+
+				if (old_max) {
+					for_each_cpu(j, smt_mask) {
+						if (j == i)
+							continue;
+
+						cpu_rq(j)->core_pick = NULL;
+					}
+					goto again;
+				} else {
+					/*
+					 * Once we select a task for a cpu, we
+					 * should not be doing an unconstrained
+					 * pick because it might starve a task
+					 * on a forced idle cpu.
+					 */
+					need_sync = true;
+				}
+
+			}
+		}
+next_class:;
+	}
+
+	rq->core->core_pick_seq = rq->core->core_task_seq;
+	next = rq->core_pick;
+	rq->core_sched_seq = rq->core->core_pick_seq;
+
+	/* Something should have been selected for current CPU */
+	WARN_ON_ONCE(!next);
+
+	/*
+	 * Reschedule siblings
+	 *
+	 * NOTE: L1TF -- at this point we're no longer running the old task and
+	 * sending an IPI (below) ensures the sibling will no longer be running
+	 * their task. This ensures there is no inter-sibling overlap between
+	 * non-matching user state.
+	 */
+	for_each_cpu(i, smt_mask) {
+		struct rq *rq_i = cpu_rq(i);
+
+		/*
+		 * An online sibling might have gone offline before a task
+		 * could be picked for it, or it might be offline but later
+		 * happen to come online, but its too late and nothing was
+		 * picked for it.  That's Ok - it will pick tasks for itself,
+		 * so ignore it.
+		 */
+		if (!rq_i->core_pick)
+			continue;
+
+		if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running)
+			rq_i->core_forceidle = true;
+
+		if (i == cpu) {
+			rq_i->core_pick = NULL;
+			continue;
+		}
+
+		/* Did we break L1TF mitigation requirements? */
+		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+
+		if (rq_i->curr == rq_i->core_pick) {
+			rq_i->core_pick = NULL;
+			continue;
+		}
+
+		resched_curr(rq_i);
+	}
+
+done:
+	set_next_task(rq, next);
+	return next;
+}
 
 static inline void sched_core_cpu_starting(unsigned int cpu)
 {
@@ -4608,6 +4896,12 @@ static inline void sched_core_cpu_starting(unsigned int cpu)
 
 static inline void sched_core_cpu_starting(unsigned int cpu) {}
 
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	return __pick_next_task(rq, prev, rf);
+}
+
 #endif /* CONFIG_SCHED_CORE */
 
 /*
@@ -7446,7 +7740,12 @@ void __init sched_init(void)
 
 #ifdef CONFIG_SCHED_CORE
 		rq->core = NULL;
+		rq->core_pick = NULL;
 		rq->core_enabled = 0;
+		rq->core_tree = RB_ROOT;
+		rq->core_forceidle = false;
+
+		rq->core_cookie = 0UL;
 #endif
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4964453591c3..2b6e0bf61720 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1052,11 +1052,16 @@ struct rq {
 #ifdef CONFIG_SCHED_CORE
 	/* per rq */
 	struct rq		*core;
+	struct task_struct	*core_pick;
 	unsigned int		core_enabled;
+	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
+	unsigned char		core_forceidle;
 
 	/* shared state */
 	unsigned int		core_task_seq;
+	unsigned int		core_pick_seq;
+	unsigned long		core_cookie;
 #endif
 };
 
@@ -1936,7 +1941,6 @@ static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 
 static inline void set_next_task(struct rq *rq, struct task_struct *next)
 {
-	WARN_ON_ONCE(rq->curr != next);
 	next->sched_class->set_next_task(rq, next, false);
 }
 
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 07/26] sched/fair: Fix forced idle sibling starvation corner case
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (5 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-20  1:43 ` [PATCH v8 -tip 08/26] sched/fair: Snapshot the min_vruntime of CPUs on force idle Joel Fernandes (Google)
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Vineeth Remanan Pillai, Aubrey Li, Paul E. McKenney, Tim Chen

From: Vineeth Pillai <viremana@linux.microsoft.com>

If there is only one long running local task and the sibling is
forced idle, it  might not get a chance to run until a schedule
event happens on any cpu in the core.

So we check for this condition during a tick to see if a sibling
is starved and then give it a chance to schedule.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
---
 kernel/sched/core.c  | 15 ++++++++-------
 kernel/sched/fair.c  | 40 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  2 +-
 3 files changed, 49 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 12030b77bd6d..469428979182 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4710,16 +4710,15 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	/* reset state */
 	rq->core->core_cookie = 0UL;
+	if (rq->core->core_forceidle) {
+		need_sync = true;
+		rq->core->core_forceidle = false;
+	}
 	for_each_cpu(i, smt_mask) {
 		struct rq *rq_i = cpu_rq(i);
 
 		rq_i->core_pick = NULL;
 
-		if (rq_i->core_forceidle) {
-			need_sync = true;
-			rq_i->core_forceidle = false;
-		}
-
 		if (i != cpu)
 			update_rq_clock(rq_i);
 	}
@@ -4839,8 +4838,10 @@ next_class:;
 		if (!rq_i->core_pick)
 			continue;
 
-		if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running)
-			rq_i->core_forceidle = true;
+		if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running &&
+		    !rq_i->core->core_forceidle) {
+			rq_i->core->core_forceidle = true;
+		}
 
 		if (i == cpu) {
 			rq_i->core_pick = NULL;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 58f670e5704d..56bea0decda1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10652,6 +10652,44 @@ static void rq_offline_fair(struct rq *rq)
 
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_SCHED_CORE
+static inline bool
+__entity_slice_used(struct sched_entity *se, int min_nr_tasks)
+{
+	u64 slice = sched_slice(cfs_rq_of(se), se);
+	u64 rtime = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+
+	return (rtime * min_nr_tasks > slice);
+}
+
+#define MIN_NR_TASKS_DURING_FORCEIDLE	2
+static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
+{
+	if (!sched_core_enabled(rq))
+		return;
+
+	/*
+	 * If runqueue has only one task which used up its slice and
+	 * if the sibling is forced idle, then trigger schedule to
+	 * give forced idle task a chance.
+	 *
+	 * sched_slice() considers only this active rq and it gets the
+	 * whole slice. But during force idle, we have siblings acting
+	 * like a single runqueue and hence we need to consider runnable
+	 * tasks on this cpu and the forced idle cpu. Ideally, we should
+	 * go through the forced idle rq, but that would be a perf hit.
+	 * We can assume that the forced idle cpu has atleast
+	 * MIN_NR_TASKS_DURING_FORCEIDLE - 1 tasks and use that to check
+	 * if we need to give up the cpu.
+	 */
+	if (rq->core->core_forceidle && rq->cfs.nr_running == 1 &&
+	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
+		resched_curr(rq);
+}
+#else
+static inline void task_tick_core(struct rq *rq, struct task_struct *curr) {}
+#endif
+
 /*
  * scheduler tick hitting a task of our scheduling class.
  *
@@ -10675,6 +10713,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	update_misfit_status(curr, rq);
 	update_overutilized_status(task_rq(curr));
+
+	task_tick_core(rq, curr);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2b6e0bf61720..884d23d5e55d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1056,12 +1056,12 @@ struct rq {
 	unsigned int		core_enabled;
 	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
-	unsigned char		core_forceidle;
 
 	/* shared state */
 	unsigned int		core_task_seq;
 	unsigned int		core_pick_seq;
 	unsigned long		core_cookie;
+	unsigned char		core_forceidle;
 #endif
 };
 
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 08/26] sched/fair: Snapshot the min_vruntime of CPUs on force idle
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (6 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 07/26] sched/fair: Fix forced idle sibling starvation corner case Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-26 12:47   ` Peter Zijlstra
  2020-10-20  1:43 ` [PATCH v8 -tip 09/26] sched: Trivial forced-newidle balancer Joel Fernandes (Google)
                   ` (19 subsequent siblings)
  27 siblings, 1 reply; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen

During force-idle, we end up doing cross-cpu comparison of vruntimes
during pick_next_task. If we simply compare (vruntime-min_vruntime)
across CPUs, and if the CPUs only have 1 task each, we will always
end up comparing 0 with 0 and pick just one of the tasks all the time.
This starves the task that was not picked. To fix this, take a snapshot
of the min_vruntime when entering force idle and use it for comparison.
This min_vruntime snapshot will only be used for cross-CPU vruntime
comparison, and nothing else.

This resolves several performance issues that were seen in ChromeOS
audio usecase.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/core.c  | 33 ++++++++++++++++++++-------------
 kernel/sched/fair.c  | 40 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  5 +++++
 3 files changed, 65 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 469428979182..a5404ec9e89a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -115,19 +115,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-		u64 vruntime = b->se.vruntime;
-
-		/*
-		 * Normalize the vruntime if tasks are in different cpus.
-		 */
-		if (task_cpu(a) != task_cpu(b)) {
-			vruntime -= task_cfs_rq(b)->min_vruntime;
-			vruntime += task_cfs_rq(a)->min_vruntime;
-		}
-
-		return !((s64)(a->se.vruntime - vruntime) <= 0);
-	}
+	if (pa == MAX_RT_PRIO + MAX_NICE)	/* fair */
+		return cfs_prio_less(a, b);
 
 	return false;
 }
@@ -4648,6 +4637,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	struct task_struct *next, *max = NULL;
 	const struct sched_class *class;
 	const struct cpumask *smt_mask;
+	bool fi_before = false;
 	bool need_sync;
 	int i, j, cpu;
 
@@ -4712,6 +4702,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	rq->core->core_cookie = 0UL;
 	if (rq->core->core_forceidle) {
 		need_sync = true;
+		fi_before = true;
 		rq->core->core_forceidle = false;
 	}
 	for_each_cpu(i, smt_mask) {
@@ -4723,6 +4714,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			update_rq_clock(rq_i);
 	}
 
+	/* Reset the snapshot if core is no longer in force-idle. */
+	if (!fi_before) {
+		for_each_cpu(i, smt_mask) {
+			struct rq *rq_i = cpu_rq(i);
+			rq_i->cfs.min_vruntime_fi = rq_i->cfs.min_vruntime;
+		}
+	}
+
 	/*
 	 * Try and select tasks for each sibling in decending sched_class
 	 * order.
@@ -4859,6 +4858,14 @@ next_class:;
 		resched_curr(rq_i);
 	}
 
+	/* Snapshot if core is in force-idle. */
+	if (!fi_before && rq->core->core_forceidle) {
+		for_each_cpu(i, smt_mask) {
+			struct rq *rq_i = cpu_rq(i);
+			rq_i->cfs.min_vruntime_fi = rq_i->cfs.min_vruntime;
+		}
+	}
+
 done:
 	set_next_task(rq, next);
 	return next;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 56bea0decda1..9cae08c3fca1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10686,6 +10686,46 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
 	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
 		resched_curr(rq);
 }
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+	bool samecpu = task_cpu(a) == task_cpu(b);
+	struct sched_entity *sea = &a->se;
+	struct sched_entity *seb = &b->se;
+	struct cfs_rq *cfs_rqa;
+	struct cfs_rq *cfs_rqb;
+	s64 delta;
+
+	if (samecpu) {
+		/* vruntime is per cfs_rq */
+		while (!is_same_group(sea, seb)) {
+			int sea_depth = sea->depth;
+			int seb_depth = seb->depth;
+			if (sea_depth >= seb_depth)
+				sea = parent_entity(sea);
+			if (sea_depth <= seb_depth)
+				seb = parent_entity(seb);
+		}
+
+		delta = (s64)(sea->vruntime - seb->vruntime);
+		goto out;
+	}
+
+	/* crosscpu: compare root level se's vruntime to decide priority */
+	while (sea->parent)
+		sea = sea->parent;
+	while (seb->parent)
+		seb = seb->parent;
+
+	cfs_rqa = sea->cfs_rq;
+	cfs_rqb = seb->cfs_rq;
+
+	/* normalize vruntime WRT their rq's base */
+	delta = (s64)(sea->vruntime - seb->vruntime) +
+		(s64)(cfs_rqb->min_vruntime_fi - cfs_rqa->min_vruntime_fi);
+out:
+	return delta > 0;
+}
 #else
 static inline void task_tick_core(struct rq *rq, struct task_struct *curr) {}
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 884d23d5e55d..dfdb0ebb07a8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -524,6 +524,9 @@ struct cfs_rq {
 
 	u64			exec_clock;
 	u64			min_vruntime;
+#ifdef CONFIG_SCHED_CORE
+	u64			min_vruntime_fi;
+#endif
 #ifndef CONFIG_64BIT
 	u64			min_vruntime_copy;
 #endif
@@ -1106,6 +1109,8 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 	return &rq->__lock;
 }
 
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline bool sched_core_enabled(struct rq *rq)
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 09/26] sched: Trivial forced-newidle balancer
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (7 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 08/26] sched/fair: Snapshot the min_vruntime of CPUs on force idle Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-20  1:43 ` [PATCH v8 -tip 10/26] sched: migration changes for core scheduling Joel Fernandes (Google)
                   ` (18 subsequent siblings)
  27 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Paul E . McKenney, Aubrey Li, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

When a sibling is forced-idle to match the core-cookie; search for
matching tasks to fill the core.

rcu_read_unlock() can incur an infrequent deadlock in
sched_core_balance(). Fix this by using the RCU-sched flavor instead.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
---
 include/linux/sched.h |   1 +
 kernel/sched/core.c   | 130 +++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/idle.c   |   1 +
 kernel/sched/sched.h  |   6 ++
 4 files changed, 137 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c3563d7cab7f..d38e904dd603 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -688,6 +688,7 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
 	struct rb_node			core_node;
 	unsigned long			core_cookie;
+	unsigned int			core_occupation;
 #endif
 
 #ifdef CONFIG_CGROUP_SCHED
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a5404ec9e89a..02db5b024768 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -202,6 +202,21 @@ static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
 	return match;
 }
 
+static struct task_struct *sched_core_next(struct task_struct *p, unsigned long cookie)
+{
+	struct rb_node *node = &p->core_node;
+
+	node = rb_next(node);
+	if (!node)
+		return NULL;
+
+	p = container_of(node, struct task_struct, core_node);
+	if (p->core_cookie != cookie)
+		return NULL;
+
+	return p;
+}
+
 /*
  * The static-key + stop-machine variable are needed such that:
  *
@@ -4638,8 +4653,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	const struct sched_class *class;
 	const struct cpumask *smt_mask;
 	bool fi_before = false;
+	int i, j, cpu, occ = 0;
 	bool need_sync;
-	int i, j, cpu;
 
 	if (!sched_core_enabled(rq))
 		return __pick_next_task(rq, prev, rf);
@@ -4768,6 +4783,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 				goto done;
 			}
 
+			if (!is_task_rq_idle(p))
+				occ++;
+
 			rq_i->core_pick = p;
 
 			/*
@@ -4793,6 +4811,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 						cpu_rq(j)->core_pick = NULL;
 					}
+					occ = 1;
 					goto again;
 				} else {
 					/*
@@ -4842,6 +4861,8 @@ next_class:;
 			rq_i->core->core_forceidle = true;
 		}
 
+		rq_i->core_pick->core_occupation = occ;
+
 		if (i == cpu) {
 			rq_i->core_pick = NULL;
 			continue;
@@ -4871,6 +4892,113 @@ next_class:;
 	return next;
 }
 
+static bool try_steal_cookie(int this, int that)
+{
+	struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
+	struct task_struct *p;
+	unsigned long cookie;
+	bool success = false;
+
+	local_irq_disable();
+	double_rq_lock(dst, src);
+
+	cookie = dst->core->core_cookie;
+	if (!cookie)
+		goto unlock;
+
+	if (dst->curr != dst->idle)
+		goto unlock;
+
+	p = sched_core_find(src, cookie);
+	if (p == src->idle)
+		goto unlock;
+
+	do {
+		if (p == src->core_pick || p == src->curr)
+			goto next;
+
+		if (!cpumask_test_cpu(this, &p->cpus_mask))
+			goto next;
+
+		if (p->core_occupation > dst->idle->core_occupation)
+			goto next;
+
+		p->on_rq = TASK_ON_RQ_MIGRATING;
+		deactivate_task(src, p, 0);
+		set_task_cpu(p, this);
+		activate_task(dst, p, 0);
+		p->on_rq = TASK_ON_RQ_QUEUED;
+
+		resched_curr(dst);
+
+		success = true;
+		break;
+
+next:
+		p = sched_core_next(p, cookie);
+	} while (p);
+
+unlock:
+	double_rq_unlock(dst, src);
+	local_irq_enable();
+
+	return success;
+}
+
+static bool steal_cookie_task(int cpu, struct sched_domain *sd)
+{
+	int i;
+
+	for_each_cpu_wrap(i, sched_domain_span(sd), cpu) {
+		if (i == cpu)
+			continue;
+
+		if (need_resched())
+			break;
+
+		if (try_steal_cookie(cpu, i))
+			return true;
+	}
+
+	return false;
+}
+
+static void sched_core_balance(struct rq *rq)
+{
+	struct sched_domain *sd;
+	int cpu = cpu_of(rq);
+
+	preempt_disable();
+	rcu_read_lock();
+	raw_spin_unlock_irq(rq_lockp(rq));
+	for_each_domain(cpu, sd) {
+		if (need_resched())
+			break;
+
+		if (steal_cookie_task(cpu, sd))
+			break;
+	}
+	raw_spin_lock_irq(rq_lockp(rq));
+	rcu_read_unlock();
+	preempt_enable();
+}
+
+static DEFINE_PER_CPU(struct callback_head, core_balance_head);
+
+void queue_core_balance(struct rq *rq)
+{
+	if (!sched_core_enabled(rq))
+		return;
+
+	if (!rq->core->core_cookie)
+		return;
+
+	if (!rq->nr_running) /* not forced idle */
+		return;
+
+	queue_balance_callback(rq, &per_cpu(core_balance_head, rq->cpu), sched_core_balance);
+}
+
 static inline void sched_core_cpu_starting(unsigned int cpu)
 {
 	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index ce7552c6bc65..a74926be80ac 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -403,6 +403,7 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
 {
 	update_idle_core(rq);
 	schedstat_inc(rq->sched_goidle);
+	queue_core_balance(rq);
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dfdb0ebb07a8..58f741b52103 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1111,6 +1111,8 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
 
+extern void queue_core_balance(struct rq *rq);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline bool sched_core_enabled(struct rq *rq)
@@ -1123,6 +1125,10 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 	return &rq->__lock;
 }
 
+static inline void queue_core_balance(struct rq *rq)
+{
+}
+
 #endif /* CONFIG_SCHED_CORE */
 
 #ifdef CONFIG_SCHED_SMT
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 10/26] sched: migration changes for core scheduling
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (8 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 09/26] sched: Trivial forced-newidle balancer Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-20  1:43 ` [PATCH v8 -tip 11/26] irq_work: Cleanup Joel Fernandes (Google)
                   ` (17 subsequent siblings)
  27 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Aubrey Li, Vineeth Remanan Pillai, Paul E. McKenney,
	Tim Chen

From: Aubrey Li <aubrey.li@intel.com>

 - Don't migrate if there is a cookie mismatch
     Load balance tries to move task from busiest CPU to the
     destination CPU. When core scheduling is enabled, if the
     task's cookie does not match with the destination CPU's
     core cookie, this task will be skipped by this CPU. This
     mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
     In the fast path of task wakeup, select the first cookie matched
     idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
     In the slow path of task wakeup, find the idlest CPU whose core
     cookie matches with task's cookie

 - Don't migrate task if cookie not match
     For the NUMA load balance, don't migrate task to the CPU whose
     core cookie does not match with task's cookie

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
---
 kernel/sched/fair.c  | 64 ++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h | 29 ++++++++++++++++++++
 2 files changed, 88 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9cae08c3fca1..93a3b874077d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1912,6 +1912,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
 			continue;
 
+#ifdef CONFIG_SCHED_CORE
+		/*
+		 * Skip this cpu if source task's cookie does not match
+		 * with CPU's core cookie.
+		 */
+		if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+			continue;
+#endif
+
 		env->dst_cpu = cpu;
 		if (task_numa_compare(env, taskimp, groupimp, maymove))
 			break;
@@ -5846,11 +5855,17 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+		struct rq *rq = cpu_rq(i);
+
+#ifdef CONFIG_SCHED_CORE
+		if (!sched_core_cookie_match(rq, p))
+			continue;
+#endif
+
 		if (sched_idle_cpu(i))
 			return i;
 
 		if (available_idle_cpu(i)) {
-			struct rq *rq = cpu_rq(i);
 			struct cpuidle_state *idle = idle_get_state(rq);
 			if (idle && idle->exit_latency < min_exit_latency) {
 				/*
@@ -6108,8 +6123,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	for_each_cpu_wrap(cpu, cpus, target) {
 		if (!--nr)
 			return -1;
-		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
-			break;
+
+		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
+#ifdef CONFIG_SCHED_CORE
+			/*
+			 * If Core Scheduling is enabled, select this cpu
+			 * only if the process cookie matches core cookie.
+			 */
+			if (sched_core_enabled(cpu_rq(cpu)) &&
+			    p->core_cookie == cpu_rq(cpu)->core->core_cookie)
+#endif
+				break;
+		}
 	}
 
 	time = cpu_clock(this) - time;
@@ -7495,8 +7520,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * We do not migrate tasks that are:
 	 * 1) throttled_lb_pair, or
 	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
-	 * 3) running (obviously), or
-	 * 4) are cache-hot on their current CPU.
+	 * 3) task's cookie does not match with this CPU's core cookie
+	 * 4) running (obviously), or
+	 * 5) are cache-hot on their current CPU.
 	 */
 	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
 		return 0;
@@ -7531,6 +7557,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 0;
 	}
 
+#ifdef CONFIG_SCHED_CORE
+	/*
+	 * Don't migrate task if the task's cookie does not match
+	 * with the destination CPU's core cookie.
+	 */
+	if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+		return 0;
+#endif
+
 	/* Record that we found atleast one task that could run on dst_cpu */
 	env->flags &= ~LBF_ALL_PINNED;
 
@@ -8757,6 +8792,25 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 					p->cpus_ptr))
 			continue;
 
+#ifdef CONFIG_SCHED_CORE
+		if (sched_core_enabled(cpu_rq(this_cpu))) {
+			int i = 0;
+			bool cookie_match = false;
+
+			for_each_cpu(i, sched_group_span(group)) {
+				struct rq *rq = cpu_rq(i);
+
+				if (sched_core_cookie_match(rq, p)) {
+					cookie_match = true;
+					break;
+				}
+			}
+			/* Skip over this group if no cookie matched */
+			if (!cookie_match)
+				continue;
+		}
+#endif
+
 		local_group = cpumask_test_cpu(this_cpu,
 					       sched_group_span(group));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 58f741b52103..d0c7a7f87d73 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1111,6 +1111,35 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
 
+/*
+ * Helper to check if the CPU's core cookie matches with the task's cookie
+ * when core scheduling is enabled.
+ * A special case is that the task's cookie always matches with CPU's core
+ * cookie if the CPU is in an idle core.
+ */
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	bool idle_core = true;
+	int cpu;
+
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
+		if (!available_idle_cpu(cpu)) {
+			idle_core = false;
+			break;
+		}
+	}
+
+	/*
+	 * A CPU in an idle core is always the best choice for tasks with
+	 * cookies.
+	 */
+	return idle_core || rq->core->core_cookie == p->core_cookie;
+}
+
 extern void queue_core_balance(struct rq *rq);
 
 #else /* !CONFIG_SCHED_CORE */
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 11/26] irq_work: Cleanup
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (9 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 10/26] sched: migration changes for core scheduling Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-20  1:43 ` [PATCH v8 -tip 12/26] arch/x86: Add a new TIF flag for untrusted tasks Joel Fernandes (Google)
                   ` (16 subsequent siblings)
  27 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

Get rid of the __call_single_node union and clean up the API a little
to avoid external code relying on the structure layout as much.

(Needed for irq_work_is_busy() API in core-scheduling series).

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 drivers/gpu/drm/i915/i915_request.c |  4 ++--
 include/linux/irq_work.h            | 33 ++++++++++++++++++-----------
 include/linux/irqflags.h            |  4 ++--
 kernel/bpf/stackmap.c               |  2 +-
 kernel/irq_work.c                   | 18 ++++++++--------
 kernel/printk/printk.c              |  6 ++----
 kernel/rcu/tree.c                   |  3 +--
 kernel/time/tick-sched.c            |  6 ++----
 kernel/trace/bpf_trace.c            |  2 +-
 9 files changed, 41 insertions(+), 37 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 0e813819b041..5385b081a376 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -197,7 +197,7 @@ __notify_execute_cb(struct i915_request *rq, bool (*fn)(struct irq_work *wrk))
 
 	llist_for_each_entry_safe(cb, cn,
 				  llist_del_all(&rq->execute_cb),
-				  work.llnode)
+				  work.node.llist)
 		fn(&cb->work);
 }
 
@@ -460,7 +460,7 @@ __await_execution(struct i915_request *rq,
 	 * callback first, then checking the ACTIVE bit, we serialise with
 	 * the completed/retired request.
 	 */
-	if (llist_add(&cb->work.llnode, &signal->execute_cb)) {
+	if (llist_add(&cb->work.node.llist, &signal->execute_cb)) {
 		if (i915_request_is_active(signal) ||
 		    __request_in_flight(signal))
 			__notify_execute_cb_imm(signal);
diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index 30823780c192..ec2a47a81e42 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -14,28 +14,37 @@
  */
 
 struct irq_work {
-	union {
-		struct __call_single_node node;
-		struct {
-			struct llist_node llnode;
-			atomic_t flags;
-		};
-	};
+	struct __call_single_node node;
 	void (*func)(struct irq_work *);
 };
 
+#define __IRQ_WORK_INIT(_func, _flags) (struct irq_work){	\
+	.node = { .u_flags = (_flags), },			\
+	.func = (_func),					\
+}
+
+#define IRQ_WORK_INIT(_func) __IRQ_WORK_INIT(_func, 0)
+#define IRQ_WORK_INIT_LAZY(_func) __IRQ_WORK_INIT(_func, IRQ_WORK_LAZY)
+#define IRQ_WORK_INIT_HARD(_func) __IRQ_WORK_INIT(_func, IRQ_WORK_HARD_IRQ)
+
+#define DEFINE_IRQ_WORK(name, _f)				\
+	struct irq_work name = IRQ_WORK_INIT(_f)
+
 static inline
 void init_irq_work(struct irq_work *work, void (*func)(struct irq_work *))
 {
-	atomic_set(&work->flags, 0);
-	work->func = func;
+	*work = IRQ_WORK_INIT(func);
 }
 
-#define DEFINE_IRQ_WORK(name, _f) struct irq_work name = {	\
-		.flags = ATOMIC_INIT(0),			\
-		.func  = (_f)					\
+static inline bool irq_work_is_pending(struct irq_work *work)
+{
+	return atomic_read(&work->node.a_flags) & IRQ_WORK_PENDING;
 }
 
+static inline bool irq_work_is_busy(struct irq_work *work)
+{
+	return atomic_read(&work->node.a_flags) & IRQ_WORK_BUSY;
+}
 
 bool irq_work_queue(struct irq_work *work);
 bool irq_work_queue_on(struct irq_work *work, int cpu);
diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 3ed4e8771b64..fef2d43a7a1d 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -109,12 +109,12 @@ do {						\
 
 # define lockdep_irq_work_enter(__work)					\
 	  do {								\
-		  if (!(atomic_read(&__work->flags) & IRQ_WORK_HARD_IRQ))\
+		  if (!(atomic_read(&__work->node.a_flags) & IRQ_WORK_HARD_IRQ))\
 			current->irq_config = 1;			\
 	  } while (0)
 # define lockdep_irq_work_exit(__work)					\
 	  do {								\
-		  if (!(atomic_read(&__work->flags) & IRQ_WORK_HARD_IRQ))\
+		  if (!(atomic_read(&__work->node.a_flags) & IRQ_WORK_HARD_IRQ))\
 			current->irq_config = 0;			\
 	  } while (0)
 
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 06065fa27124..599041cd0c8a 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -298,7 +298,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
 	if (irqs_disabled()) {
 		if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
 			work = this_cpu_ptr(&up_read_work);
-			if (atomic_read(&work->irq_work.flags) & IRQ_WORK_BUSY) {
+			if (irq_work_is_busy(&work->irq_work)) {
 				/* cannot queue more up_read, fallback */
 				irq_work_busy = true;
 			}
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index eca83965b631..fbff25adb574 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -31,7 +31,7 @@ static bool irq_work_claim(struct irq_work *work)
 {
 	int oflags;
 
-	oflags = atomic_fetch_or(IRQ_WORK_CLAIMED | CSD_TYPE_IRQ_WORK, &work->flags);
+	oflags = atomic_fetch_or(IRQ_WORK_CLAIMED | CSD_TYPE_IRQ_WORK, &work->node.a_flags);
 	/*
 	 * If the work is already pending, no need to raise the IPI.
 	 * The pairing atomic_fetch_andnot() in irq_work_run() makes sure
@@ -53,12 +53,12 @@ void __weak arch_irq_work_raise(void)
 static void __irq_work_queue_local(struct irq_work *work)
 {
 	/* If the work is "lazy", handle it from next tick if any */
-	if (atomic_read(&work->flags) & IRQ_WORK_LAZY) {
-		if (llist_add(&work->llnode, this_cpu_ptr(&lazy_list)) &&
+	if (atomic_read(&work->node.a_flags) & IRQ_WORK_LAZY) {
+		if (llist_add(&work->node.llist, this_cpu_ptr(&lazy_list)) &&
 		    tick_nohz_tick_stopped())
 			arch_irq_work_raise();
 	} else {
-		if (llist_add(&work->llnode, this_cpu_ptr(&raised_list)))
+		if (llist_add(&work->node.llist, this_cpu_ptr(&raised_list)))
 			arch_irq_work_raise();
 	}
 }
@@ -102,7 +102,7 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (cpu != smp_processor_id()) {
 		/* Arch remote IPI send/receive backend aren't NMI safe */
 		WARN_ON_ONCE(in_nmi());
-		__smp_call_single_queue(cpu, &work->llnode);
+		__smp_call_single_queue(cpu, &work->node.llist);
 	} else {
 		__irq_work_queue_local(work);
 	}
@@ -142,7 +142,7 @@ void irq_work_single(void *arg)
 	 * to claim that work don't rely on us to handle their data
 	 * while we are in the middle of the func.
 	 */
-	flags = atomic_fetch_andnot(IRQ_WORK_PENDING, &work->flags);
+	flags = atomic_fetch_andnot(IRQ_WORK_PENDING, &work->node.a_flags);
 
 	lockdep_irq_work_enter(work);
 	work->func(work);
@@ -152,7 +152,7 @@ void irq_work_single(void *arg)
 	 * no-one else claimed it meanwhile.
 	 */
 	flags &= ~IRQ_WORK_PENDING;
-	(void)atomic_cmpxchg(&work->flags, flags, flags & ~IRQ_WORK_BUSY);
+	(void)atomic_cmpxchg(&work->node.a_flags, flags, flags & ~IRQ_WORK_BUSY);
 }
 
 static void irq_work_run_list(struct llist_head *list)
@@ -166,7 +166,7 @@ static void irq_work_run_list(struct llist_head *list)
 		return;
 
 	llnode = llist_del_all(list);
-	llist_for_each_entry_safe(work, tmp, llnode, llnode)
+	llist_for_each_entry_safe(work, tmp, llnode, node.llist)
 		irq_work_single(work);
 }
 
@@ -198,7 +198,7 @@ void irq_work_sync(struct irq_work *work)
 {
 	lockdep_assert_irqs_enabled();
 
-	while (atomic_read(&work->flags) & IRQ_WORK_BUSY)
+	while (irq_work_is_busy(work))
 		cpu_relax();
 }
 EXPORT_SYMBOL_GPL(irq_work_sync);
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index fe64a49344bf..9ef23d4b07c7 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -3025,10 +3025,8 @@ static void wake_up_klogd_work_func(struct irq_work *irq_work)
 		wake_up_interruptible(&log_wait);
 }
 
-static DEFINE_PER_CPU(struct irq_work, wake_up_klogd_work) = {
-	.func = wake_up_klogd_work_func,
-	.flags = ATOMIC_INIT(IRQ_WORK_LAZY),
-};
+static DEFINE_PER_CPU(struct irq_work, wake_up_klogd_work) =
+	IRQ_WORK_INIT_LAZY(wake_up_klogd_work_func);
 
 void wake_up_klogd(void)
 {
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 06895ef85d69..a41e84f1b55a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1311,8 +1311,6 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
 		if (IS_ENABLED(CONFIG_IRQ_WORK) &&
 		    !rdp->rcu_iw_pending && rdp->rcu_iw_gp_seq != rnp->gp_seq &&
 		    (rnp->ffmask & rdp->grpmask)) {
-			init_irq_work(&rdp->rcu_iw, rcu_iw_handler);
-			atomic_set(&rdp->rcu_iw.flags, IRQ_WORK_HARD_IRQ);
 			rdp->rcu_iw_pending = true;
 			rdp->rcu_iw_gp_seq = rnp->gp_seq;
 			irq_work_queue_on(&rdp->rcu_iw, rdp->cpu);
@@ -3964,6 +3962,7 @@ int rcutree_prepare_cpu(unsigned int cpu)
 	rdp->cpu_no_qs.b.norm = true;
 	rdp->core_needs_qs = false;
 	rdp->rcu_iw_pending = false;
+	rdp->rcu_iw = IRQ_WORK_INIT_HARD(rcu_iw_handler);
 	rdp->rcu_iw_gp_seq = rdp->gp_seq - 1;
 	trace_rcu_grace_period(rcu_state.name, rdp->gp_seq, TPS("cpuonl"));
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 81632cd5e3b7..1b734070f028 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -243,10 +243,8 @@ static void nohz_full_kick_func(struct irq_work *work)
 	/* Empty, the tick restart happens on tick_nohz_irq_exit() */
 }
 
-static DEFINE_PER_CPU(struct irq_work, nohz_full_kick_work) = {
-	.func = nohz_full_kick_func,
-	.flags = ATOMIC_INIT(IRQ_WORK_HARD_IRQ),
-};
+static DEFINE_PER_CPU(struct irq_work, nohz_full_kick_work) =
+	IRQ_WORK_INIT_HARD(nohz_full_kick_func);
 
 /*
  * Kick this CPU if it's full dynticks in order to force it to
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 4517c8b66518..a6903912f7a0 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1086,7 +1086,7 @@ static int bpf_send_signal_common(u32 sig, enum pid_type type)
 			return -EINVAL;
 
 		work = this_cpu_ptr(&send_signal_work);
-		if (atomic_read(&work->irq_work.flags) & IRQ_WORK_BUSY)
+		if (irq_work_is_busy(&work->irq_work))
 			return -EBUSY;
 
 		/* Add the current task, which is the target of sending signal,
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 12/26] arch/x86: Add a new TIF flag for untrusted tasks
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (10 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 11/26] irq_work: Cleanup Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-20  1:43 ` [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode Joel Fernandes (Google)
                   ` (15 subsequent siblings)
  27 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen

Add a new TIF flag to indicate whether the kernel needs to be careful
and take additional steps to mitigate micro-architectural issues during
entry into user or guest mode.

This new flag will be used by the series to determine if waiting is
needed or not, during exit to user or guest mode.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 arch/x86/include/asm/thread_info.h | 2 ++
 kernel/sched/sched.h               | 6 ++++++
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index c448fcfa1b82..45b6dbdf116e 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -99,6 +99,7 @@ struct thread_info {
 #define TIF_SPEC_FORCE_UPDATE	23	/* Force speculation MSR update in context switch */
 #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
 #define TIF_BLOCKSTEP		25	/* set when we want DEBUGCTLMSR_BTF */
+#define TIF_UNSAFE_RET   	26	/* On return to process/guest, perform safety checks. */
 #define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
 #define TIF_SYSCALL_TRACEPOINT	28	/* syscall tracepoint instrumentation */
 #define TIF_ADDR32		29	/* 32-bit address space on 64 bits */
@@ -129,6 +130,7 @@ struct thread_info {
 #define _TIF_SPEC_FORCE_UPDATE	(1 << TIF_SPEC_FORCE_UPDATE)
 #define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
 #define _TIF_BLOCKSTEP		(1 << TIF_BLOCKSTEP)
+#define _TIF_UNSAFE_RET 	(1 << TIF_UNSAFE_RET)
 #define _TIF_LAZY_MMU_UPDATES	(1 << TIF_LAZY_MMU_UPDATES)
 #define _TIF_SYSCALL_TRACEPOINT	(1 << TIF_SYSCALL_TRACEPOINT)
 #define _TIF_ADDR32		(1 << TIF_ADDR32)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d0c7a7f87d73..f7e2d8a3be8e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2769,3 +2769,9 @@ static inline bool is_per_cpu_kthread(struct task_struct *p)
 
 void swake_up_all_locked(struct swait_queue_head *q);
 void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
+
+#ifdef CONFIG_SCHED_CORE
+#ifndef TIF_UNSAFE_RET
+#define TIF_UNSAFE_RET (0)
+#endif
+#endif
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (11 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 12/26] arch/x86: Add a new TIF flag for untrusted tasks Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-20  3:41   ` Randy Dunlap
                     ` (2 more replies)
  2020-10-20  1:43 ` [PATCH v8 -tip 14/26] entry/idle: Enter and exit kernel protection during idle entry and exit Joel Fernandes (Google)
                   ` (14 subsequent siblings)
  27 siblings, 3 replies; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Tim Chen, Paul E . McKenney

Core-scheduling prevents hyperthreads in usermode from attacking each
other, but it does not do anything about one of the hyperthreads
entering the kernel for any reason. This leaves the door open for MDS
and L1TF attacks with concurrent execution sequences between
hyperthreads.

This patch therefore adds support for protecting all syscall and IRQ
kernel mode entries. Care is taken to track the outermost usermode exit
and entry using per-cpu counters. In cases where one of the hyperthreads
enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
when not needed - example: idle and non-cookie HTs do not need to be
forced into kernel mode.

More information about attacks:
For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
data to either host or guest attackers. For L1TF, it is possible to leak
to guest attackers. There is no possible mitigation involving flushing
of buffers to avoid this since the execution of attacker and victims
happen concurrently on 2 or more HTs.

Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Co-developed-by: Vineeth Pillai <viremana@linux.microsoft.com>
Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 .../admin-guide/kernel-parameters.txt         |   7 +
 include/linux/entry-common.h                  |   2 +-
 include/linux/sched.h                         |  12 +
 kernel/entry/common.c                         |  25 +-
 kernel/sched/core.c                           | 229 ++++++++++++++++++
 kernel/sched/sched.h                          |   3 +
 6 files changed, 275 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 3236427e2215..48567110f709 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4678,6 +4678,13 @@
 
 	sbni=		[NET] Granch SBNI12 leased line adapter
 
+	sched_core_protect_kernel=
+			[SCHED_CORE] Pause SMT siblings of a core running in
+			user mode, if at least one of the siblings of the core
+			is running in kernel mode. This is to guarantee that
+			kernel data is not leaked to tasks which are not trusted
+			by the kernel.
+
 	sched_debug	[KNL] Enables verbose scheduler debug messages.
 
 	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 474f29638d2c..260216de357b 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -69,7 +69,7 @@
 
 #define EXIT_TO_USER_MODE_WORK						\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
-	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING |			\
+	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_UNSAFE_RET |	\
 	 ARCH_EXIT_TO_USER_MODE_WORK)
 
 /**
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d38e904dd603..fe6f225bfbf9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2071,4 +2071,16 @@ int sched_trace_rq_nr_running(struct rq *rq);
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
+#ifdef CONFIG_SCHED_CORE
+void sched_core_unsafe_enter(void);
+void sched_core_unsafe_exit(void);
+bool sched_core_wait_till_safe(unsigned long ti_check);
+bool sched_core_kernel_protected(void);
+#else
+#define sched_core_unsafe_enter(ignore) do { } while (0)
+#define sched_core_unsafe_exit(ignore) do { } while (0)
+#define sched_core_wait_till_safe(ignore) do { } while (0)
+#define sched_core_kernel_protected(ignore) do { } while (0)
+#endif
+
 #endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 0a1e20f8d4e8..c8dc6b1b1f40 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -137,6 +137,26 @@ static __always_inline void exit_to_user_mode(void)
 /* Workaround to allow gradual conversion of architecture code */
 void __weak arch_do_signal(struct pt_regs *regs) { }
 
+unsigned long exit_to_user_get_work(void)
+{
+	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+
+	if (IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected())
+		return ti_work;
+
+#ifdef CONFIG_SCHED_CORE
+	ti_work &= EXIT_TO_USER_MODE_WORK;
+	if ((ti_work & _TIF_UNSAFE_RET) == ti_work) {
+		sched_core_unsafe_exit();
+		if (sched_core_wait_till_safe(EXIT_TO_USER_MODE_WORK)) {
+			sched_core_unsafe_enter(); /* not exiting to user yet. */
+		}
+	}
+
+	return READ_ONCE(current_thread_info()->flags);
+#endif
+}
+
 static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 					    unsigned long ti_work)
 {
@@ -175,7 +195,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		 * enabled above.
 		 */
 		local_irq_disable_exit_to_user();
-		ti_work = READ_ONCE(current_thread_info()->flags);
+		ti_work = exit_to_user_get_work();
 	}
 
 	/* Return the latest work state for arch_exit_to_user_mode() */
@@ -184,9 +204,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 
 static void exit_to_user_mode_prepare(struct pt_regs *regs)
 {
-	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+	unsigned long ti_work;
 
 	lockdep_assert_irqs_disabled();
+	ti_work = exit_to_user_get_work();
 
 	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
 		ti_work = exit_to_user_mode_loop(regs, ti_work);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 02db5b024768..5a7aeaa914e3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -76,6 +76,27 @@ __read_mostly int scheduler_running;
 
 #ifdef CONFIG_SCHED_CORE
 
+DEFINE_STATIC_KEY_TRUE(sched_core_protect_kernel);
+static int __init set_sched_core_protect_kernel(char *str)
+{
+	unsigned long val = 0;
+
+	if (!str)
+		return 0;
+
+	if (!kstrtoul(str, 0, &val) && !val)
+		static_branch_disable(&sched_core_protect_kernel);
+
+	return 1;
+}
+__setup("sched_core_protect_kernel=", set_sched_core_protect_kernel);
+
+/* Is the kernel protected by core scheduling? */
+bool sched_core_kernel_protected(void)
+{
+	return static_branch_likely(&sched_core_protect_kernel);
+}
+
 DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
 
 /* kernel prio, less is more */
@@ -4596,6 +4617,214 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
 	return a->core_cookie == b->core_cookie;
 }
 
+/*
+ * Handler to attempt to enter kernel. It does nothing because the exit to
+ * usermode or guest mode will do the actual work (of waiting if needed).
+ */
+static void sched_core_irq_work(struct irq_work *work)
+{
+	return;
+}
+
+static inline void init_sched_core_irq_work(struct rq *rq)
+{
+	init_irq_work(&rq->core_irq_work, sched_core_irq_work);
+}
+
+/*
+ * sched_core_wait_till_safe - Pause the caller's hyperthread until the core
+ * exits the core-wide unsafe state. Obviously the CPU calling this function
+ * should not be responsible for the core being in the core-wide unsafe state
+ * otherwise it will deadlock.
+ *
+ * @ti_check: We spin here with IRQ enabled and preempt disabled. Break out of
+ *            the loop if TIF flags are set and notify caller about it.
+ *
+ * IRQs should be disabled.
+ */
+bool sched_core_wait_till_safe(unsigned long ti_check)
+{
+	bool restart = false;
+	struct rq *rq;
+	int cpu;
+
+	/* We clear the thread flag only at the end, so need to check for it. */
+	ti_check &= ~_TIF_UNSAFE_RET;
+
+	cpu = smp_processor_id();
+	rq = cpu_rq(cpu);
+
+	if (!sched_core_enabled(rq))
+		goto ret;
+
+	/* Down grade to allow interrupts to prevent stop_machine lockups.. */
+	preempt_disable();
+	local_irq_enable();
+
+	/*
+	 * Wait till the core of this HT is not in an unsafe state.
+	 *
+	 * Pair with smp_store_release() in sched_core_unsafe_exit().
+	 */
+	while (smp_load_acquire(&rq->core->core_unsafe_nest) > 0) {
+		cpu_relax();
+		if (READ_ONCE(current_thread_info()->flags) & ti_check) {
+			restart = true;
+			break;
+		}
+	}
+
+	/* Upgrade it back to the expectations of entry code. */
+	local_irq_disable();
+	preempt_enable();
+
+ret:
+	if (!restart)
+		clear_tsk_thread_flag(current, TIF_UNSAFE_RET);
+
+	return restart;
+}
+
+/*
+ * Enter the core-wide IRQ state. Sibling will be paused if it is running
+ * 'untrusted' code, until sched_core_unsafe_exit() is called. Every attempt to
+ * avoid sending useless IPIs is made. Must be called only from hard IRQ
+ * context.
+ */
+void sched_core_unsafe_enter(void)
+{
+	const struct cpumask *smt_mask;
+	unsigned long flags;
+	struct rq *rq;
+	int i, cpu;
+
+	if (!static_branch_likely(&sched_core_protect_kernel))
+		return;
+
+	/* Ensure that on return to user/guest, we check whether to wait. */
+	if (current->core_cookie)
+		set_tsk_thread_flag(current, TIF_UNSAFE_RET);
+
+	local_irq_save(flags);
+	cpu = smp_processor_id();
+	rq = cpu_rq(cpu);
+	if (!sched_core_enabled(rq))
+		goto ret;
+
+	/* Count unsafe_enter() calls received without unsafe_exit() on this CPU. */
+	rq->core_this_unsafe_nest++;
+
+	/* Should not nest: enter() should only pair with exit(). */
+	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 1))
+		goto ret;
+
+	raw_spin_lock(rq_lockp(rq));
+	smt_mask = cpu_smt_mask(cpu);
+
+	/* Contribute this CPU's unsafe_enter() to core-wide unsafe_enter() count. */
+	WRITE_ONCE(rq->core->core_unsafe_nest, rq->core->core_unsafe_nest + 1);
+
+	if (WARN_ON_ONCE(rq->core->core_unsafe_nest == UINT_MAX))
+		goto unlock;
+
+	if (irq_work_is_busy(&rq->core_irq_work)) {
+		/*
+		 * Do nothing more since we are in an IPI sent from another
+		 * sibling to enforce safety. That sibling would have sent IPIs
+		 * to all of the HTs.
+		 */
+		goto unlock;
+	}
+
+	/*
+	 * If we are not the first ones on the core to enter core-wide unsafe
+	 * state, do nothing.
+	 */
+	if (rq->core->core_unsafe_nest > 1)
+		goto unlock;
+
+	/* Do nothing more if the core is not tagged. */
+	if (!rq->core->core_cookie)
+		goto unlock;
+
+	for_each_cpu(i, smt_mask) {
+		struct rq *srq = cpu_rq(i);
+
+		if (i == cpu || cpu_is_offline(i))
+			continue;
+
+		if (!srq->curr->mm || is_task_rq_idle(srq->curr))
+			continue;
+
+		/* Skip if HT is not running a tagged task. */
+		if (!srq->curr->core_cookie && !srq->core_pick)
+			continue;
+
+		/*
+		 * Force sibling into the kernel by IPI. If work was already
+		 * pending, no new IPIs are sent. This is Ok since the receiver
+		 * would already be in the kernel, or on its way to it.
+		 */
+		irq_work_queue_on(&srq->core_irq_work, i);
+	}
+unlock:
+	raw_spin_unlock(rq_lockp(rq));
+ret:
+	local_irq_restore(flags);
+}
+
+/*
+ * Process any work need for either exiting the core-wide unsafe state, or for
+ * waiting on this hyperthread if the core is still in this state.
+ *
+ * @idle: Are we called from the idle loop?
+ */
+void sched_core_unsafe_exit(void)
+{
+	unsigned long flags;
+	unsigned int nest;
+	struct rq *rq;
+	int cpu;
+
+	if (!static_branch_likely(&sched_core_protect_kernel))
+		return;
+
+	local_irq_save(flags);
+	cpu = smp_processor_id();
+	rq = cpu_rq(cpu);
+
+	/* Do nothing if core-sched disabled. */
+	if (!sched_core_enabled(rq))
+		goto ret;
+
+	/*
+	 * Can happen when a process is forked and the first return to user
+	 * mode is a syscall exit. Either way, there's nothing to do.
+	 */
+	if (rq->core_this_unsafe_nest == 0)
+		goto ret;
+
+	rq->core_this_unsafe_nest--;
+
+	/* enter() should be paired with exit() only. */
+	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 0))
+		goto ret;
+
+	raw_spin_lock(rq_lockp(rq));
+	/*
+	 * Core-wide nesting counter can never be 0 because we are
+	 * still in it on this CPU.
+	 */
+	nest = rq->core->core_unsafe_nest;
+	WARN_ON_ONCE(!nest);
+
+	/* Pair with smp_load_acquire() in sched_core_wait_till_safe(). */
+	smp_store_release(&rq->core->core_unsafe_nest, nest - 1);
+	raw_spin_unlock(rq_lockp(rq));
+ret:
+	local_irq_restore(flags);
+}
+
 // XXX fairness/fwd progress conditions
 /*
  * Returns
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f7e2d8a3be8e..4bcf3b1ddfb3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1059,12 +1059,15 @@ struct rq {
 	unsigned int		core_enabled;
 	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
+	struct irq_work		core_irq_work; /* To force HT into kernel */
+	unsigned int		core_this_unsafe_nest;
 
 	/* shared state */
 	unsigned int		core_task_seq;
 	unsigned int		core_pick_seq;
 	unsigned long		core_cookie;
 	unsigned char		core_forceidle;
+	unsigned int		core_unsafe_nest;
 #endif
 };
 
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 14/26] entry/idle: Enter and exit kernel protection during idle entry and exit
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (12 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-20  1:43 ` [PATCH v8 -tip 15/26] entry/kvm: Protect the kernel when entering from guest Joel Fernandes (Google)
                   ` (13 subsequent siblings)
  27 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen

Add a generic_idle_{enter,exit} helper function to enter and exit kernel
protection when entering and exiting idle, respectively.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/entry-common.h | 18 ++++++++++++++++++
 kernel/sched/idle.c          | 11 ++++++-----
 2 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 260216de357b..879562d920f2 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -402,4 +402,22 @@ void irqentry_exit_cond_resched(void);
  */
 void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state);
 
+/**
+ * generic_idle_enter - Called during entry into idle for housekeeping.
+ */
+static inline void generic_idle_enter(void)
+{
+	/* Entering idle ends the protected kernel region. */
+	sched_core_unsafe_exit();
+}
+
+/**
+ * generic_idle_enter - Called when exiting idle for housekeeping.
+ */
+static inline void generic_idle_exit(void)
+{
+	/* Exiting idle (re)starts the protected kernel region. */
+	sched_core_unsafe_enter();
+}
+
 #endif
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index a74926be80ac..029ba61576f2 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -8,6 +8,7 @@
  */
 #include "sched.h"
 
+#include <linux/entry-common.h>
 #include <trace/events/power.h>
 
 /* Linker adds these: start and end of __cpuidle functions */
@@ -54,6 +55,7 @@ __setup("hlt", cpu_idle_nopoll_setup);
 
 static noinline int __cpuidle cpu_idle_poll(void)
 {
+	generic_idle_enter();
 	trace_cpu_idle(0, smp_processor_id());
 	stop_critical_timings();
 	rcu_idle_enter();
@@ -66,6 +68,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
 	rcu_idle_exit();
 	start_critical_timings();
 	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
+	generic_idle_exit();
 
 	return 1;
 }
@@ -156,11 +159,7 @@ static void cpuidle_idle_call(void)
 		return;
 	}
 
-	/*
-	 * The RCU framework needs to be told that we are entering an idle
-	 * section, so no more rcu read side critical sections and one more
-	 * step to the grace period
-	 */
+	generic_idle_enter();
 
 	if (cpuidle_not_available(drv, dev)) {
 		tick_nohz_idle_stop_tick();
@@ -225,6 +224,8 @@ static void cpuidle_idle_call(void)
 	 */
 	if (WARN_ON_ONCE(irqs_disabled()))
 		local_irq_enable();
+
+	generic_idle_exit();
 }
 
 /*
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 15/26] entry/kvm: Protect the kernel when entering from guest
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (13 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 14/26] entry/idle: Enter and exit kernel protection during idle entry and exit Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-20  1:43 ` [PATCH v8 -tip 16/26] sched: cgroup tagging interface for core scheduling Joel Fernandes (Google)
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen

From: Vineeth Pillai <viremana@linux.microsoft.com>

Similar to how user to kernel mode transitions are protected in earlier
patches, protect the entry into kernel from guest mode as well.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
---
 arch/x86/kvm/x86.c        |  3 +++
 include/linux/entry-kvm.h | 12 ++++++++++++
 kernel/entry/kvm.c        | 13 +++++++++++++
 3 files changed, 28 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ce856e0ece84..05a281f3ef28 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8540,6 +8540,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	 */
 	smp_mb__after_srcu_read_unlock();
 
+	kvm_exit_to_guest_mode(vcpu);
+
 	/*
 	 * This handles the case where a posted interrupt was
 	 * notified with kvm_vcpu_kick.
@@ -8633,6 +8635,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		}
 	}
 
+	kvm_enter_from_guest_mode(vcpu);
 	local_irq_enable();
 	preempt_enable();
 
diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
index 0cef17afb41a..32aabb7f3e6d 100644
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -77,4 +77,16 @@ static inline bool xfer_to_guest_mode_work_pending(void)
 }
 #endif /* CONFIG_KVM_XFER_TO_GUEST_WORK */
 
+/**
+ * kvm_enter_from_guest_mode - Hook called just after entering kernel from guest.
+ * @vcpu:   Pointer to the current VCPU data
+ */
+void kvm_enter_from_guest_mode(struct kvm_vcpu *vcpu);
+
+/**
+ * kvm_exit_to_guest_mode - Hook called just before entering guest from kernel.
+ * @vcpu:   Pointer to the current VCPU data
+ */
+void kvm_exit_to_guest_mode(struct kvm_vcpu *vcpu);
+
 #endif
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index eb1a8a4c867c..b0b7facf4374 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -49,3 +49,16 @@ int xfer_to_guest_mode_handle_work(struct kvm_vcpu *vcpu)
 	return xfer_to_guest_mode_work(vcpu, ti_work);
 }
 EXPORT_SYMBOL_GPL(xfer_to_guest_mode_handle_work);
+
+void kvm_enter_from_guest_mode(struct kvm_vcpu *vcpu)
+{
+	sched_core_unsafe_enter();
+}
+EXPORT_SYMBOL_GPL(kvm_enter_from_guest_mode);
+
+void kvm_exit_to_guest_mode(struct kvm_vcpu *vcpu)
+{
+	sched_core_unsafe_exit();
+	sched_core_wait_till_safe(XFER_TO_GUEST_MODE_WORK);
+}
+EXPORT_SYMBOL_GPL(kvm_exit_to_guest_mode);
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 16/26] sched: cgroup tagging interface for core scheduling
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (14 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 15/26] entry/kvm: Protect the kernel when entering from guest Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-20  1:43 ` [PATCH v8 -tip 17/26] sched: Split the cookie and setup per-task cookie on fork Joel Fernandes (Google)
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Vineeth Remanan Pillai, Aubrey Li, Paul E. McKenney, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

Marks all tasks in a cgroup as matching for core-scheduling.

A task will need to be moved into the core scheduler queue when the cgroup
it belongs to is tagged to run with core scheduling.  Similarly the task
will need to be moved out of the core scheduler queue when the cgroup
is untagged.

Also after we forked a task, its core scheduler queue's presence will
need to be updated according to its new cgroup's status.

Use stop machine mechanism to update all tasks in a cgroup to prevent a
new task from sneaking into the cgroup, and missed out from the update
while we iterates through all the tasks in the cgroup.  A more complicated
scheme could probably avoid the stop machine.  Such scheme will also
need to resovle inconsistency between a task's cgroup core scheduling
tag and residency in core scheduler queue.

We are opting for the simple stop machine mechanism for now that avoids
such complications.

Core scheduler has extra overhead.  Enable it only for core with
more than one SMT hardware threads.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
---
 kernel/sched/core.c  | 183 +++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h |   4 +
 2 files changed, 181 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5a7aeaa914e3..bab4ea2f5cd8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -157,6 +157,37 @@ static inline bool __sched_core_less(struct task_struct *a, struct task_struct *
 	return false;
 }
 
+static bool sched_core_empty(struct rq *rq)
+{
+	return RB_EMPTY_ROOT(&rq->core_tree);
+}
+
+static bool sched_core_enqueued(struct task_struct *task)
+{
+	return !RB_EMPTY_NODE(&task->core_node);
+}
+
+static struct task_struct *sched_core_first(struct rq *rq)
+{
+	struct task_struct *task;
+
+	task = container_of(rb_first(&rq->core_tree), struct task_struct, core_node);
+	return task;
+}
+
+static void sched_core_flush(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	struct task_struct *task;
+
+	while (!sched_core_empty(rq)) {
+		task = sched_core_first(rq);
+		rb_erase(&task->core_node, &rq->core_tree);
+		RB_CLEAR_NODE(&task->core_node);
+	}
+	rq->core->core_task_seq++;
+}
+
 static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 {
 	struct rb_node *parent, **node;
@@ -188,10 +219,11 @@ static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
 {
 	rq->core->core_task_seq++;
 
-	if (!p->core_cookie)
+	if (!sched_core_enqueued(p))
 		return;
 
 	rb_erase(&p->core_node, &rq->core_tree);
+	RB_CLEAR_NODE(&p->core_node);
 }
 
 /*
@@ -255,8 +287,24 @@ static int __sched_core_stopper(void *data)
 	bool enabled = !!(unsigned long)data;
 	int cpu;
 
-	for_each_possible_cpu(cpu)
-		cpu_rq(cpu)->core_enabled = enabled;
+	for_each_possible_cpu(cpu) {
+		struct rq *rq = cpu_rq(cpu);
+
+		WARN_ON_ONCE(enabled == rq->core_enabled);
+
+		if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) {
+			/*
+			 * All active and migrating tasks will have already
+			 * been removed from core queue when we clear the
+			 * cgroup tags. However, dying tasks could still be
+			 * left in core queue. Flush them here.
+			 */
+			if (!enabled)
+				sched_core_flush(cpu);
+
+			rq->core_enabled = enabled;
+		}
+	}
 
 	return 0;
 }
@@ -266,7 +314,11 @@ static int sched_core_count;
 
 static void __sched_core_enable(void)
 {
-	// XXX verify there are no cookie tasks (yet)
+	int cpu;
+
+	/* verify there are no cookie tasks (yet) */
+	for_each_online_cpu(cpu)
+		BUG_ON(!sched_core_empty(cpu_rq(cpu)));
 
 	static_branch_enable(&__sched_core_enabled);
 	stop_machine(__sched_core_stopper, (void *)true, NULL);
@@ -274,8 +326,6 @@ static void __sched_core_enable(void)
 
 static void __sched_core_disable(void)
 {
-	// XXX verify there are no cookie tasks (left)
-
 	stop_machine(__sched_core_stopper, (void *)false, NULL);
 	static_branch_disable(&__sched_core_enabled);
 }
@@ -300,6 +350,7 @@ void sched_core_put(void)
 
 static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
 static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+static bool sched_core_enqueued(struct task_struct *task) { return false; }
 
 #endif /* CONFIG_SCHED_CORE */
 
@@ -3529,6 +3580,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 #ifdef CONFIG_SMP
 	plist_node_init(&p->pushable_tasks, MAX_PRIO);
 	RB_CLEAR_NODE(&p->pushable_dl_tasks);
+#endif
+#ifdef CONFIG_SCHED_CORE
+	RB_CLEAR_NODE(&p->core_node);
 #endif
 	return 0;
 }
@@ -7480,6 +7534,9 @@ void init_idle(struct task_struct *idle, int cpu)
 #ifdef CONFIG_SMP
 	sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);
 #endif
+#ifdef CONFIG_SCHED_CORE
+	RB_CLEAR_NODE(&idle->core_node);
+#endif
 }
 
 #ifdef CONFIG_SMP
@@ -8441,6 +8498,15 @@ static void sched_change_group(struct task_struct *tsk, int type)
 	tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
 			  struct task_group, css);
 	tg = autogroup_task_group(tsk, tg);
+
+#ifdef CONFIG_SCHED_CORE
+	if ((unsigned long)tsk->sched_task_group == tsk->core_cookie)
+		tsk->core_cookie = 0UL;
+
+	if (tg->tagged /* && !tsk->core_cookie ? */)
+		tsk->core_cookie = (unsigned long)tg;
+#endif
+
 	tsk->sched_task_group = tg;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -8533,6 +8599,18 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
 	return 0;
 }
 
+static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
+{
+#ifdef CONFIG_SCHED_CORE
+	struct task_group *tg = css_tg(css);
+
+	if (tg->tagged) {
+		sched_core_put();
+		tg->tagged = 0;
+	}
+#endif
+}
+
 static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
@@ -9098,6 +9176,82 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_SCHED_CORE
+static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+
+	return !!tg->tagged;
+}
+
+struct write_core_tag {
+	struct cgroup_subsys_state *css;
+	int val;
+};
+
+static int __sched_write_tag(void *data)
+{
+	struct write_core_tag *tag = (struct write_core_tag *) data;
+	struct cgroup_subsys_state *css = tag->css;
+	int val = tag->val;
+	struct task_group *tg = css_tg(tag->css);
+	struct css_task_iter it;
+	struct task_struct *p;
+
+	tg->tagged = !!val;
+
+	css_task_iter_start(css, 0, &it);
+	/*
+	 * Note: css_task_iter_next will skip dying tasks.
+	 * There could still be dying tasks left in the core queue
+	 * when we set cgroup tag to 0 when the loop is done below.
+	 */
+	while ((p = css_task_iter_next(&it))) {
+		p->core_cookie = !!val ? (unsigned long)tg : 0UL;
+
+		if (sched_core_enqueued(p)) {
+			sched_core_dequeue(task_rq(p), p);
+			if (!p->core_cookie)
+				continue;
+		}
+
+		if (sched_core_enabled(task_rq(p)) &&
+		    p->core_cookie && task_on_rq_queued(p))
+			sched_core_enqueue(task_rq(p), p);
+
+	}
+	css_task_iter_end(&it);
+
+	return 0;
+}
+
+static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+{
+	struct task_group *tg = css_tg(css);
+	struct write_core_tag wtag;
+
+	if (val > 1)
+		return -ERANGE;
+
+	if (!static_branch_likely(&sched_smt_present))
+		return -EINVAL;
+
+	if (tg->tagged == !!val)
+		return 0;
+
+	if (!!val)
+		sched_core_get();
+
+	wtag.css = css;
+	wtag.val = val;
+	stop_machine(__sched_write_tag, (void *) &wtag, NULL);
+	if (!val)
+		sched_core_put();
+
+	return 0;
+}
+#endif
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -9134,6 +9288,14 @@ static struct cftype cpu_legacy_files[] = {
 		.write_u64 = cpu_rt_period_write_uint,
 	},
 #endif
+#ifdef CONFIG_SCHED_CORE
+	{
+		.name = "tag",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_tag_read_u64,
+		.write_u64 = cpu_core_tag_write_u64,
+	},
+#endif
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 	{
 		.name = "uclamp.min",
@@ -9307,6 +9469,14 @@ static struct cftype cpu_files[] = {
 		.write_s64 = cpu_weight_nice_write_s64,
 	},
 #endif
+#ifdef CONFIG_SCHED_CORE
+	{
+		.name = "tag",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_tag_read_u64,
+		.write_u64 = cpu_core_tag_write_u64,
+	},
+#endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
 		.name = "max",
@@ -9335,6 +9505,7 @@ static struct cftype cpu_files[] = {
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_online	= cpu_cgroup_css_online,
+	.css_offline	= cpu_cgroup_css_offline,
 	.css_released	= cpu_cgroup_css_released,
 	.css_free	= cpu_cgroup_css_free,
 	.css_extra_stat_show = cpu_extra_stat_show,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4bcf3b1ddfb3..661569ee4650 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -384,6 +384,10 @@ struct cfs_bandwidth {
 struct task_group {
 	struct cgroup_subsys_state css;
 
+#ifdef CONFIG_SCHED_CORE
+	int			tagged;
+#endif
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* schedulable entities of this group on each CPU */
 	struct sched_entity	**se;
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 17/26] sched: Split the cookie and setup per-task cookie on fork
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (15 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 16/26] sched: cgroup tagging interface for core scheduling Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-11-04 22:30   ` chris hyser
  2020-10-20  1:43 ` [PATCH v8 -tip 18/26] sched: Add a per-thread core scheduling interface Joel Fernandes (Google)
                   ` (10 subsequent siblings)
  27 siblings, 1 reply; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen

In order to prevent interference and clearly support both per-task and CGroup
APIs, split the cookie into 2 and allow it to be set from either per-task, or
CGroup API. The final cookie is the combined value of both and is computed when
the stop-machine executes during a change of cookie.

Also, for the per-task cookie, it will get weird if we use pointers of any
emphemeral objects. For this reason, introduce a refcounted object who's sole
purpose is to assign unique cookie value by way of the object's pointer.

While at it, refactor the CGroup code a bit. Future patches will introduce more
APIs and support.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/sched.h |   2 +
 kernel/sched/core.c   | 241 ++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/debug.c  |   4 +
 3 files changed, 236 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fe6f225bfbf9..c6034c00846a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -688,6 +688,8 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
 	struct rb_node			core_node;
 	unsigned long			core_cookie;
+	unsigned long			core_task_cookie;
+	unsigned long			core_group_cookie;
 	unsigned int			core_occupation;
 #endif
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bab4ea2f5cd8..30a9e4cb5ce1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -346,11 +346,14 @@ void sched_core_put(void)
 	mutex_unlock(&sched_core_mutex);
 }
 
+static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
 static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
 static bool sched_core_enqueued(struct task_struct *task) { return false; }
+static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2) { }
 
 #endif /* CONFIG_SCHED_CORE */
 
@@ -3583,6 +3586,20 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 #endif
 #ifdef CONFIG_SCHED_CORE
 	RB_CLEAR_NODE(&p->core_node);
+
+	/*
+	 * Tag child via per-task cookie only if parent is tagged via per-task
+	 * cookie. This is independent of, but can be additive to the CGroup tagging.
+	 */
+	if (current->core_task_cookie) {
+
+		/* If it is not CLONE_THREAD fork, assign a unique per-task tag. */
+		if (!(clone_flags & CLONE_THREAD)) {
+			return sched_core_share_tasks(p, p);
+               }
+		/* Otherwise share the parent's per-task tag. */
+		return sched_core_share_tasks(p, current);
+	}
 #endif
 	return 0;
 }
@@ -9177,6 +9194,217 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 #endif /* CONFIG_RT_GROUP_SCHED */
 
 #ifdef CONFIG_SCHED_CORE
+/*
+ * A simple wrapper around refcount. An allocated sched_core_cookie's
+ * address is used to compute the cookie of the task.
+ */
+struct sched_core_cookie {
+	refcount_t refcnt;
+};
+
+/*
+ * sched_core_tag_requeue - Common helper for all interfaces to set a cookie.
+ * @p: The task to assign a cookie to.
+ * @cookie: The cookie to assign.
+ * @group: is it a group interface or a per-task interface.
+ *
+ * This function is typically called from a stop-machine handler.
+ */
+void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie, bool group)
+{
+	if (!p)
+		return;
+
+	if (group)
+		p->core_group_cookie = cookie;
+	else
+		p->core_task_cookie = cookie;
+
+	/* Use up half of the cookie's bits for task cookie and remaining for group cookie. */
+	p->core_cookie = (p->core_task_cookie <<
+				(sizeof(unsigned long) * 4)) + p->core_group_cookie;
+
+	if (sched_core_enqueued(p)) {
+		sched_core_dequeue(task_rq(p), p);
+		if (!p->core_task_cookie)
+			return;
+	}
+
+	if (sched_core_enabled(task_rq(p)) &&
+			p->core_cookie && task_on_rq_queued(p))
+		sched_core_enqueue(task_rq(p), p);
+}
+
+/* Per-task interface */
+static unsigned long sched_core_alloc_task_cookie(void)
+{
+	struct sched_core_cookie *ptr =
+		kmalloc(sizeof(struct sched_core_cookie), GFP_KERNEL);
+
+	if (!ptr)
+		return 0;
+	refcount_set(&ptr->refcnt, 1);
+
+	/*
+	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
+	 * is done after the stopper runs.
+	 */
+	sched_core_get();
+	return (unsigned long)ptr;
+}
+
+static bool sched_core_get_task_cookie(unsigned long cookie)
+{
+	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
+
+	/*
+	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
+	 * is done after the stopper runs.
+	 */
+	sched_core_get();
+	return refcount_inc_not_zero(&ptr->refcnt);
+}
+
+static void sched_core_put_task_cookie(unsigned long cookie)
+{
+	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
+
+	if (refcount_dec_and_test(&ptr->refcnt))
+		kfree(ptr);
+}
+
+struct sched_core_task_write_tag {
+	struct task_struct *tasks[2];
+	unsigned long cookies[2];
+};
+
+/*
+ * Ensure that the task has been requeued. The stopper ensures that the task cannot
+ * be migrated to a different CPU while its core scheduler queue state is being updated.
+ * It also makes sure to requeue a task if it was running actively on another CPU.
+ */
+static int sched_core_task_join_stopper(void *data)
+{
+	struct sched_core_task_write_tag *tag = (struct sched_core_task_write_tag *)data;
+	int i;
+
+	for (i = 0; i < 2; i++)
+		sched_core_tag_requeue(tag->tasks[i], tag->cookies[i], false /* !group */);
+
+	return 0;
+}
+
+static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
+{
+	struct sched_core_task_write_tag wr = {}; /* for stop machine. */
+	bool sched_core_put_after_stopper = false;
+	unsigned long cookie;
+	int ret = -ENOMEM;
+
+	mutex_lock(&sched_core_mutex);
+
+	/*
+	 * NOTE: sched_core_get() is done by sched_core_alloc_task_cookie() or
+	 *       sched_core_put_task_cookie(). However, sched_core_put() is done
+	 *       by this function *after* the stopper removes the tasks from the
+	 *       core queue, and not before. This is just to play it safe.
+	 */
+	if (t2 == NULL) {
+		if (t1->core_task_cookie) {
+			sched_core_put_task_cookie(t1->core_task_cookie);
+			sched_core_put_after_stopper = true;
+			wr.tasks[0] = t1; /* Keep wr.cookies[0] reset for t1. */
+		}
+	} else if (t1 == t2) {
+		/* Assign a unique per-task cookie solely for t1. */
+
+		cookie = sched_core_alloc_task_cookie();
+		if (!cookie)
+			goto out_unlock;
+
+		if (t1->core_task_cookie) {
+			sched_core_put_task_cookie(t1->core_task_cookie);
+			sched_core_put_after_stopper = true;
+		}
+		wr.tasks[0] = t1;
+		wr.cookies[0] = cookie;
+	} else
+	/*
+	 * 		t1		joining		t2
+	 * CASE 1:
+	 * before	0				0
+	 * after	new cookie			new cookie
+	 *
+	 * CASE 2:
+	 * before	X (non-zero)			0
+	 * after	0				0
+	 *
+	 * CASE 3:
+	 * before	0				X (non-zero)
+	 * after	X				X
+	 *
+	 * CASE 4:
+	 * before	Y (non-zero)			X (non-zero)
+	 * after	X				X
+	 */
+	if (!t1->core_task_cookie && !t2->core_task_cookie) {
+		/* CASE 1. */
+		cookie = sched_core_alloc_task_cookie();
+		if (!cookie)
+			goto out_unlock;
+
+		/* Add another reference for the other task. */
+		if (!sched_core_get_task_cookie(cookie)) {
+			return -EINVAL;
+			goto out_unlock;
+		}
+
+		wr.tasks[0] = t1;
+		wr.tasks[1] = t2;
+		wr.cookies[0] = wr.cookies[1] = cookie;
+
+	} else if (t1->core_task_cookie && !t2->core_task_cookie) {
+		/* CASE 2. */
+		sched_core_put_task_cookie(t1->core_task_cookie);
+		sched_core_put_after_stopper = true;
+
+		wr.tasks[0] = t1; /* Reset cookie for t1. */
+
+	} else if (!t1->core_task_cookie && t2->core_task_cookie) {
+		/* CASE 3. */
+		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+
+		wr.tasks[0] = t1;
+		wr.cookies[0] = t2->core_task_cookie;
+
+	} else {
+		/* CASE 4. */
+		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+		sched_core_put_task_cookie(t1->core_task_cookie);
+		sched_core_put_after_stopper = true;
+
+		wr.tasks[0] = t1;
+		wr.cookies[0] = t2->core_task_cookie;
+	}
+
+	stop_machine(sched_core_task_join_stopper, (void *)&wr, NULL);
+
+	if (sched_core_put_after_stopper)
+		sched_core_put();
+
+	ret = 0;
+out_unlock:
+	mutex_unlock(&sched_core_mutex);
+	return ret;
+}
+
+/* CGroup interface */
 static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
 {
 	struct task_group *tg = css_tg(css);
@@ -9207,18 +9435,9 @@ static int __sched_write_tag(void *data)
 	 * when we set cgroup tag to 0 when the loop is done below.
 	 */
 	while ((p = css_task_iter_next(&it))) {
-		p->core_cookie = !!val ? (unsigned long)tg : 0UL;
-
-		if (sched_core_enqueued(p)) {
-			sched_core_dequeue(task_rq(p), p);
-			if (!p->core_cookie)
-				continue;
-		}
-
-		if (sched_core_enabled(task_rq(p)) &&
-		    p->core_cookie && task_on_rq_queued(p))
-			sched_core_enqueue(task_rq(p), p);
+		unsigned long cookie = !!val ? (unsigned long)tg : 0UL;
 
+		sched_core_tag_requeue(p, cookie, true /* group */);
 	}
 	css_task_iter_end(&it);
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index c8fee8d9dfd4..88bf45267672 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1024,6 +1024,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		__PS("clock-delta", t1-t0);
 	}
 
+#ifdef CONFIG_SCHED_CORE
+	__PS("core_cookie", p->core_cookie);
+#endif
+
 	sched_show_numa(p, m);
 }
 
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 18/26] sched: Add a per-thread core scheduling interface
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (16 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 17/26] sched: Split the cookie and setup per-task cookie on fork Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-20  1:43 ` [PATCH v8 -tip 19/26] sched: Add a second-level tag for nested CGroup usecase Joel Fernandes (Google)
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen

Add a per-thread core scheduling interface which allows a thread to share a
core with another thread, or have a core exclusively for itself.

ChromeOS uses core-scheduling to securely enable hyperthreading.  This cuts
down the keypress latency in Google docs from 150ms to 50ms while improving
the camera streaming frame rate by ~3%.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/sched.h            |  2 ++
 include/uapi/linux/prctl.h       |  3 ++
 kernel/sched/core.c              | 51 +++++++++++++++++++++++++++++---
 kernel/sys.c                     |  3 ++
 tools/include/uapi/linux/prctl.h |  3 ++
 5 files changed, 58 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c6034c00846a..4cb76575afa8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2078,11 +2078,13 @@ void sched_core_unsafe_enter(void);
 void sched_core_unsafe_exit(void);
 bool sched_core_wait_till_safe(unsigned long ti_check);
 bool sched_core_kernel_protected(void);
+int sched_core_share_pid(pid_t pid);
 #else
 #define sched_core_unsafe_enter(ignore) do { } while (0)
 #define sched_core_unsafe_exit(ignore) do { } while (0)
 #define sched_core_wait_till_safe(ignore) do { } while (0)
 #define sched_core_kernel_protected(ignore) do { } while (0)
+#define sched_core_share_pid(pid_t pid) do { } while (0)
 #endif
 
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c334e6a02e5f..217b0482aea1 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -248,4 +248,7 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER		57
 #define PR_GET_IO_FLUSHER		58
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE		59
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 30a9e4cb5ce1..a0678614a056 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -310,6 +310,7 @@ static int __sched_core_stopper(void *data)
 }
 
 static DEFINE_MUTEX(sched_core_mutex);
+static DEFINE_MUTEX(sched_core_tasks_mutex);
 static int sched_core_count;
 
 static void __sched_core_enable(void)
@@ -3588,8 +3589,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 	RB_CLEAR_NODE(&p->core_node);
 
 	/*
-	 * Tag child via per-task cookie only if parent is tagged via per-task
-	 * cookie. This is independent of, but can be additive to the CGroup tagging.
+	 * If parent is tagged via per-task cookie, tag the child (either with
+	 * the parent's cookie, or a new one). The final cookie is calculated
+	 * by concatenating the per-task cookie with that of the CGroup's.
 	 */
 	if (current->core_task_cookie) {
 
@@ -9301,7 +9303,7 @@ static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2
 	unsigned long cookie;
 	int ret = -ENOMEM;
 
-	mutex_lock(&sched_core_mutex);
+	mutex_lock(&sched_core_tasks_mutex);
 
 	/*
 	 * NOTE: sched_core_get() is done by sched_core_alloc_task_cookie() or
@@ -9400,10 +9402,51 @@ static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2
 
 	ret = 0;
 out_unlock:
-	mutex_unlock(&sched_core_mutex);
+	mutex_unlock(&sched_core_tasks_mutex);
 	return ret;
 }
 
+/* Called from prctl interface: PR_SCHED_CORE_SHARE */
+int sched_core_share_pid(pid_t pid)
+{
+	struct task_struct *task;
+	int err;
+
+	if (pid == 0) { /* Recent current task's cookie. */
+		/* Resetting a cookie requires privileges. */
+		if (current->core_task_cookie)
+			if (!capable(CAP_SYS_ADMIN))
+				return -EPERM;
+		task = NULL;
+	} else {
+		rcu_read_lock();
+		task = pid ? find_task_by_vpid(pid) : current;
+		if (!task) {
+			rcu_read_unlock();
+			return -ESRCH;
+		}
+
+		get_task_struct(task);
+
+		/*
+		 * Check if this process has the right to modify the specified
+		 * process. Use the regular "ptrace_may_access()" checks.
+		 */
+		if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
+			rcu_read_unlock();
+			err = -EPERM;
+			goto out_put;
+		}
+		rcu_read_unlock();
+	}
+
+	err = sched_core_share_tasks(current, task);
+out_put:
+	if (task)
+		put_task_struct(task);
+	return err;
+}
+
 /* CGroup interface */
 static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
 {
diff --git a/kernel/sys.c b/kernel/sys.c
index 6401880dff74..17911b8680b1 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2530,6 +2530,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
 		break;
+	case PR_SCHED_CORE_SHARE:
+		error = sched_core_share_pid(arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index 07b4f8131e36..9318f643c4b3 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -238,4 +238,7 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER		57
 #define PR_GET_IO_FLUSHER		58
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE		59
+
 #endif /* _LINUX_PRCTL_H */
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 19/26] sched: Add a second-level tag for nested CGroup usecase
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (17 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 18/26] sched: Add a per-thread core scheduling interface Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-31  0:42   ` Josh Don
       [not found]   ` <6c07e70d-52f2-69ff-e1fa-690cd2c97f3d@linux.intel.com>
  2020-10-20  1:43 ` [PATCH v8 -tip 20/26] sched: Release references to the per-task cookie on exit Joel Fernandes (Google)
                   ` (8 subsequent siblings)
  27 siblings, 2 replies; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen

Google has a usecase where the first level tag to tag a CGroup is not
sufficient. So, a patch is carried for years where a second tag is added which
is writeable by unprivileged users.

Google uses DAC controls to make the 'tag' possible to set only by root while
the second-level 'color' can be changed by anyone. The actual names that
Google uses is different, but the concept is the same.

The hierarchy looks like:

Root group
   / \
  A   B    (These are created by the root daemon - borglet).
 / \   \
C   D   E  (These are created by AppEngine within the container).

The reason why Google has two parts is that AppEngine wants to allow a subset of
subcgroups within a parent tagged cgroup sharing execution. Think of these
subcgroups belong to the same customer or project. Because these subcgroups are
created by AppEngine, they are not tracked by borglet (the root daemon),
therefore borglet won't have a chance to set a color for them. That's where
'color' file comes from. Color could be set by AppEngine, and once set, the
normal tasks within the subcgroup would not be able to overwrite it. This is
enforced by promoting the permission of the color file in cgroupfs.

The 'color' is a 8-bit value allowing for upto 256 unique colors. IMHO, having
more than these many CGroups sounds like a scalability issue so this suffices.
We steal the lower 8-bits of the cookie to set the color.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/core.c  | 181 +++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |   3 +-
 2 files changed, 158 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a0678614a056..42aa811eab14 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8522,7 +8522,7 @@ static void sched_change_group(struct task_struct *tsk, int type)
 	if ((unsigned long)tsk->sched_task_group == tsk->core_cookie)
 		tsk->core_cookie = 0UL;
 
-	if (tg->tagged /* && !tsk->core_cookie ? */)
+	if (tg->core_tagged /* && !tsk->core_cookie ? */)
 		tsk->core_cookie = (unsigned long)tg;
 #endif
 
@@ -8623,9 +8623,9 @@ static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
 #ifdef CONFIG_SCHED_CORE
 	struct task_group *tg = css_tg(css);
 
-	if (tg->tagged) {
+	if (tg->core_tagged) {
 		sched_core_put();
-		tg->tagged = 0;
+		tg->core_tagged = 0;
 	}
 #endif
 }
@@ -9228,7 +9228,7 @@ void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie, bool gr
 
 	if (sched_core_enqueued(p)) {
 		sched_core_dequeue(task_rq(p), p);
-		if (!p->core_task_cookie)
+		if (!p->core_cookie)
 			return;
 	}
 
@@ -9448,41 +9448,100 @@ int sched_core_share_pid(pid_t pid)
 }
 
 /* CGroup interface */
+
+/*
+ * Helper to get the cookie in a hierarchy.
+ * The cookie is a combination of a tag and color. Any ancestor
+ * can have a tag/color. tag is the first-level cookie setting
+ * with color being the second. Atmost one color and one tag is
+ * allowed.
+ */
+static unsigned long cpu_core_get_group_cookie(struct task_group *tg)
+{
+	unsigned long color = 0;
+
+	if (!tg)
+		return 0;
+
+	for (; tg; tg = tg->parent) {
+		if (tg->core_tag_color) {
+			WARN_ON_ONCE(color);
+			color = tg->core_tag_color;
+		}
+
+		if (tg->core_tagged) {
+			unsigned long cookie = ((unsigned long)tg << 8) | color;
+			cookie &= (1UL << (sizeof(unsigned long) * 4)) - 1;
+			return cookie;
+		}
+	}
+
+	return 0;
+}
+
+/* Determine if any group in @tg's children are tagged or colored. */
+static bool cpu_core_check_descendants(struct task_group *tg, bool check_tag,
+					bool check_color)
+{
+	struct task_group *child;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(child, &tg->children, siblings) {
+		if ((child->core_tagged && check_tag) ||
+		    (child->core_tag_color && check_color)) {
+			rcu_read_unlock();
+			return true;
+		}
+
+		rcu_read_unlock();
+		return cpu_core_check_descendants(child, check_tag, check_color);
+	}
+
+	rcu_read_unlock();
+	return false;
+}
+
 static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
 {
 	struct task_group *tg = css_tg(css);
 
-	return !!tg->tagged;
+	return !!tg->core_tagged;
+}
+
+static u64 cpu_core_tag_color_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+
+	return tg->core_tag_color;
 }
 
 struct write_core_tag {
 	struct cgroup_subsys_state *css;
-	int val;
+	unsigned long cookie;
 };
 
 static int __sched_write_tag(void *data)
 {
 	struct write_core_tag *tag = (struct write_core_tag *) data;
-	struct cgroup_subsys_state *css = tag->css;
-	int val = tag->val;
-	struct task_group *tg = css_tg(tag->css);
-	struct css_task_iter it;
 	struct task_struct *p;
+	struct cgroup_subsys_state *css;
 
-	tg->tagged = !!val;
+	rcu_read_lock();
+	css_for_each_descendant_pre(css, tag->css) {
+		struct css_task_iter it;
 
-	css_task_iter_start(css, 0, &it);
-	/*
-	 * Note: css_task_iter_next will skip dying tasks.
-	 * There could still be dying tasks left in the core queue
-	 * when we set cgroup tag to 0 when the loop is done below.
-	 */
-	while ((p = css_task_iter_next(&it))) {
-		unsigned long cookie = !!val ? (unsigned long)tg : 0UL;
+		css_task_iter_start(css, 0, &it);
+		/*
+		 * Note: css_task_iter_next will skip dying tasks.
+		 * There could still be dying tasks left in the core queue
+		 * when we set cgroup tag to 0 when the loop is done below.
+		 */
+		while ((p = css_task_iter_next(&it)))
+			sched_core_tag_requeue(p, tag->cookie, true /* group */);
 
-		sched_core_tag_requeue(p, cookie, true /* group */);
+		css_task_iter_end(&it);
 	}
-	css_task_iter_end(&it);
+	rcu_read_unlock();
 
 	return 0;
 }
@@ -9498,20 +9557,80 @@ static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype
 	if (!static_branch_likely(&sched_smt_present))
 		return -EINVAL;
 
-	if (tg->tagged == !!val)
+	if (!tg->core_tagged && val) {
+		/* Tag is being set. Check ancestors and descendants. */
+		if (cpu_core_get_group_cookie(tg) ||
+		    cpu_core_check_descendants(tg, true /* tag */, true /* color */))
+			return -EBUSY;
+	} else if (tg->core_tagged && !val) {
+		/* Tag is being reset. Check descendants. */
+		if (cpu_core_check_descendants(tg, true /* tag */, true /* color */))
+			return -EBUSY;
+	} else {
 		return 0;
+	}
 
 	if (!!val)
 		sched_core_get();
 
 	wtag.css = css;
-	wtag.val = val;
+	wtag.cookie = (unsigned long)tg << 8; /* Reserve lower 8 bits for color. */
+
+	/* Truncate the upper 32-bits - those are used by the per-task cookie. */
+	wtag.cookie &= (1UL << (sizeof(unsigned long) * 4)) - 1;
+
+	tg->core_tagged = val;
+
 	stop_machine(__sched_write_tag, (void *) &wtag, NULL);
 	if (!val)
 		sched_core_put();
 
 	return 0;
 }
+
+static int cpu_core_tag_color_write_u64(struct cgroup_subsys_state *css,
+					struct cftype *cft, u64 val)
+{
+	struct task_group *tg = css_tg(css);
+	struct write_core_tag wtag;
+	u64 cookie;
+
+	if (val > 255)
+		return -ERANGE;
+
+	if (!static_branch_likely(&sched_smt_present))
+		return -EINVAL;
+
+	cookie = cpu_core_get_group_cookie(tg);
+	/* Can't set color if nothing in the ancestors were tagged. */
+	if (!cookie)
+		return -EINVAL;
+
+	/*
+	 * Something in the ancestors already colors us. Can't change the color
+	 * at this level.
+	 */
+	if (!tg->core_tag_color && (cookie & 255))
+		return -EINVAL;
+
+	/*
+	 * Check if any descendants are colored. If so, we can't recolor them.
+	 * Don't need to check if descendants are tagged, since we don't allow
+	 * tagging when already tagged.
+	 */
+	if (cpu_core_check_descendants(tg, false /* tag */, true /* color */))
+		return -EINVAL;
+
+	cookie &= ~255;
+	cookie |= val;
+	wtag.css = css;
+	wtag.cookie = cookie;
+	tg->core_tag_color = val;
+
+	stop_machine(__sched_write_tag, (void *) &wtag, NULL);
+
+	return 0;
+}
 #endif
 
 static struct cftype cpu_legacy_files[] = {
@@ -9552,11 +9671,17 @@ static struct cftype cpu_legacy_files[] = {
 #endif
 #ifdef CONFIG_SCHED_CORE
 	{
-		.name = "tag",
+		.name = "core_tag",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_u64 = cpu_core_tag_read_u64,
 		.write_u64 = cpu_core_tag_write_u64,
 	},
+	{
+		.name = "core_tag_color",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_tag_color_read_u64,
+		.write_u64 = cpu_core_tag_color_write_u64,
+	},
 #endif
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 	{
@@ -9733,11 +9858,17 @@ static struct cftype cpu_files[] = {
 #endif
 #ifdef CONFIG_SCHED_CORE
 	{
-		.name = "tag",
+		.name = "core_tag",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_u64 = cpu_core_tag_read_u64,
 		.write_u64 = cpu_core_tag_write_u64,
 	},
+	{
+		.name = "core_tag_color",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_tag_color_read_u64,
+		.write_u64 = cpu_core_tag_color_write_u64,
+	},
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 661569ee4650..aebeb91c4a0f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -385,7 +385,8 @@ struct task_group {
 	struct cgroup_subsys_state css;
 
 #ifdef CONFIG_SCHED_CORE
-	int			tagged;
+	int			core_tagged;
+	u8			core_tag_color;
 #endif
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 20/26] sched: Release references to the per-task cookie on exit
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (18 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 19/26] sched: Add a second-level tag for nested CGroup usecase Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-11-04 21:50   ` chris hyser
  2020-10-20  1:43 ` [PATCH v8 -tip 21/26] sched: Handle task addition to CGroup Joel Fernandes (Google)
                   ` (7 subsequent siblings)
  27 siblings, 1 reply; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen

During exit, we have to free the references to a cookie that might be shared by
many tasks. This commit therefore ensures when the task_struct is released, any
references to cookies that it holds are also released.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/sched.h | 2 ++
 kernel/fork.c         | 1 +
 kernel/sched/core.c   | 8 ++++++++
 3 files changed, 11 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4cb76575afa8..eabd96beff92 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2079,12 +2079,14 @@ void sched_core_unsafe_exit(void);
 bool sched_core_wait_till_safe(unsigned long ti_check);
 bool sched_core_kernel_protected(void);
 int sched_core_share_pid(pid_t pid);
+void sched_tsk_free(struct task_struct *tsk);
 #else
 #define sched_core_unsafe_enter(ignore) do { } while (0)
 #define sched_core_unsafe_exit(ignore) do { } while (0)
 #define sched_core_wait_till_safe(ignore) do { } while (0)
 #define sched_core_kernel_protected(ignore) do { } while (0)
 #define sched_core_share_pid(pid_t pid) do { } while (0)
+#define sched_tsk_free(tsk) do { } while (0)
 #endif
 
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index b9c289d0f4ef..a39248a02fdd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -736,6 +736,7 @@ void __put_task_struct(struct task_struct *tsk)
 	exit_creds(tsk);
 	delayacct_tsk_free(tsk);
 	put_signal_struct(tsk->signal);
+	sched_tsk_free(tsk);
 
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 42aa811eab14..61e1dcf11000 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9631,6 +9631,14 @@ static int cpu_core_tag_color_write_u64(struct cgroup_subsys_state *css,
 
 	return 0;
 }
+
+void sched_tsk_free(struct task_struct *tsk)
+{
+	if (!tsk->core_task_cookie)
+		return;
+	sched_core_put_task_cookie(tsk->core_task_cookie);
+	sched_core_put();
+}
 #endif
 
 static struct cftype cpu_legacy_files[] = {
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 21/26] sched: Handle task addition to CGroup
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (19 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 20/26] sched: Release references to the per-task cookie on exit Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-20  1:43 ` [PATCH v8 -tip 22/26] sched/debug: Add CGroup node for printing group cookie if SCHED_DEBUG Joel Fernandes (Google)
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen

Due to earlier patches, the old way of computing a task's cookie when it
is added to a CGroup,is outdated. Update it by fetching the group's
cookie using the new helpers.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/core.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 61e1dcf11000..1321c26a8385 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8505,6 +8505,9 @@ void sched_offline_group(struct task_group *tg)
 	spin_unlock_irqrestore(&task_group_lock, flags);
 }
 
+#define SCHED_CORE_GROUP_COOKIE_MASK ((1UL << (sizeof(unsigned long) * 4)) - 1)
+static unsigned long cpu_core_get_group_cookie(struct task_group *tg);
+
 static void sched_change_group(struct task_struct *tsk, int type)
 {
 	struct task_group *tg;
@@ -8519,11 +8522,13 @@ static void sched_change_group(struct task_struct *tsk, int type)
 	tg = autogroup_task_group(tsk, tg);
 
 #ifdef CONFIG_SCHED_CORE
-	if ((unsigned long)tsk->sched_task_group == tsk->core_cookie)
-		tsk->core_cookie = 0UL;
+	if (tsk->core_group_cookie) {
+		tsk->core_group_cookie = 0UL;
+		tsk->core_cookie &= ~SCHED_CORE_GROUP_COOKIE_MASK;
+	}
 
-	if (tg->core_tagged /* && !tsk->core_cookie ? */)
-		tsk->core_cookie = (unsigned long)tg;
+	tsk->core_group_cookie = cpu_core_get_group_cookie(tg);
+	tsk->core_cookie |= tsk->core_group_cookie;
 #endif
 
 	tsk->sched_task_group = tg;
@@ -9471,7 +9476,7 @@ static unsigned long cpu_core_get_group_cookie(struct task_group *tg)
 
 		if (tg->core_tagged) {
 			unsigned long cookie = ((unsigned long)tg << 8) | color;
-			cookie &= (1UL << (sizeof(unsigned long) * 4)) - 1;
+			cookie &= SCHED_CORE_GROUP_COOKIE_MASK;
 			return cookie;
 		}
 	}
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 22/26] sched/debug: Add CGroup node for printing group cookie if SCHED_DEBUG
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (20 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 21/26] sched: Handle task addition to CGroup Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-20  1:43 ` [PATCH v8 -tip 23/26] kselftest: Add tests for core-sched interface Joel Fernandes (Google)
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen

This will be used by kselftest to verify the CGroup cookie value that is
set by the CGroup interface.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/core.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1321c26a8385..b3afbba5abe1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9520,6 +9520,13 @@ static u64 cpu_core_tag_color_read_u64(struct cgroup_subsys_state *css, struct c
 	return tg->core_tag_color;
 }
 
+#ifdef CONFIG_SCHED_DEBUG
+static u64 cpu_core_group_cookie_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	return cpu_core_get_group_cookie(css_tg(css));
+}
+#endif
+
 struct write_core_tag {
 	struct cgroup_subsys_state *css;
 	unsigned long cookie;
@@ -9695,6 +9702,14 @@ static struct cftype cpu_legacy_files[] = {
 		.read_u64 = cpu_core_tag_color_read_u64,
 		.write_u64 = cpu_core_tag_color_write_u64,
 	},
+#ifdef CONFIG_SCHED_DEBUG
+	/* Read the effective cookie (color+tag) of the group. */
+	{
+		.name = "core_group_cookie",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_group_cookie_read_u64,
+	},
+#endif
 #endif
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 	{
@@ -9882,6 +9897,14 @@ static struct cftype cpu_files[] = {
 		.read_u64 = cpu_core_tag_color_read_u64,
 		.write_u64 = cpu_core_tag_color_write_u64,
 	},
+#ifdef CONFIG_SCHED_DEBUG
+	/* Read the effective cookie (color+tag) of the group. */
+	{
+		.name = "core_group_cookie",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_group_cookie_read_u64,
+	},
+#endif
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 23/26] kselftest: Add tests for core-sched interface
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (21 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 22/26] sched/debug: Add CGroup node for printing group cookie if SCHED_DEBUG Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-30 22:20   ` [PATCH] sched: Change all 4 space tabs to actual tabs John B. Wyatt IV
  2020-10-20  1:43 ` [PATCH v8 -tip 24/26] sched: Move core-scheduler interfacing code to a new file Joel Fernandes (Google)
                   ` (4 subsequent siblings)
  27 siblings, 1 reply; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 tools/testing/selftests/sched/.gitignore      |   1 +
 tools/testing/selftests/sched/Makefile        |  14 +
 tools/testing/selftests/sched/config          |   1 +
 .../testing/selftests/sched/test_coresched.c  | 840 ++++++++++++++++++
 4 files changed, 856 insertions(+)
 create mode 100644 tools/testing/selftests/sched/.gitignore
 create mode 100644 tools/testing/selftests/sched/Makefile
 create mode 100644 tools/testing/selftests/sched/config
 create mode 100644 tools/testing/selftests/sched/test_coresched.c

diff --git a/tools/testing/selftests/sched/.gitignore b/tools/testing/selftests/sched/.gitignore
new file mode 100644
index 000000000000..4660929b0b9a
--- /dev/null
+++ b/tools/testing/selftests/sched/.gitignore
@@ -0,0 +1 @@
+test_coresched
diff --git a/tools/testing/selftests/sched/Makefile b/tools/testing/selftests/sched/Makefile
new file mode 100644
index 000000000000..e43b74fc5d7e
--- /dev/null
+++ b/tools/testing/selftests/sched/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0+
+
+ifneq ($(shell $(CC) --version 2>&1 | head -n 1 | grep clang),)
+CLANG_FLAGS += -no-integrated-as
+endif
+
+CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/  -Wl,-rpath=./ \
+	  $(CLANG_FLAGS)
+LDLIBS += -lpthread
+
+TEST_GEN_FILES := test_coresched
+TEST_PROGS := test_coresched
+
+include ../lib.mk
diff --git a/tools/testing/selftests/sched/config b/tools/testing/selftests/sched/config
new file mode 100644
index 000000000000..e8b09aa7c0c4
--- /dev/null
+++ b/tools/testing/selftests/sched/config
@@ -0,0 +1 @@
+CONFIG_SCHED_DEBUG=y
diff --git a/tools/testing/selftests/sched/test_coresched.c b/tools/testing/selftests/sched/test_coresched.c
new file mode 100644
index 000000000000..2fdefb843115
--- /dev/null
+++ b/tools/testing/selftests/sched/test_coresched.c
@@ -0,0 +1,840 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Core-scheduling selftests.
+ *
+ * Copyright (C) 2020, Joel Fernandes.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/mount.h>
+#include <sys/prctl.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <unistd.h>
+
+#ifndef PR_SCHED_CORE_SHARE
+#define PR_SCHED_CORE_SHARE 59
+#endif
+
+#ifndef DEBUG_PRINT
+#define dprint(...)
+#else
+#define dprint(str, args...) printf("DEBUG: %s: " str "\n", __func__, ##args)
+#endif
+
+void print_banner(char *s)
+{
+    printf("coresched: %s:  ", s);
+}
+
+void print_pass(void)
+{
+    printf("PASS\n");
+}
+
+void assert_cond(int cond, char *str)
+{
+    if (!cond) {
+	printf("Error: %s\n", str);
+	abort();
+    }
+}
+
+char *make_group_root(void)
+{
+	char *mntpath, *mnt;
+	int ret;
+
+	mntpath = malloc(50);
+	if (!mntpath) {
+	    perror("Failed to allocate mntpath\n");
+	    abort();
+	}
+
+	sprintf(mntpath, "/tmp/coresched-test-XXXXXX");
+	mnt = mkdtemp(mntpath);
+	if (!mnt) {
+		perror("Failed to create mount: ");
+		exit(-1);
+	}
+
+	ret = mount("nodev", mnt, "cgroup", 0, "cpu");
+	if (ret == -1) {
+		perror("Failed to mount cgroup: ");
+		exit(-1);
+	}
+
+	return mnt;
+}
+
+char *read_group_cookie(char *cgroup_path)
+{
+    char path[50] = {}, *val;
+    int fd;
+
+    sprintf(path, "%s/cpu.core_group_cookie", cgroup_path);
+    fd = open(path, O_RDONLY, 0666);
+    if (fd == -1) {
+	perror("Open of cgroup tag path failed: ");
+	abort();
+    }
+
+    val = calloc(1, 50);
+    if (read(fd, val, 50) == -1) {
+	perror("Failed to read group cookie: ");
+	abort();
+    }
+
+    val[strcspn(val, "\r\n")] = 0;
+
+    close(fd);
+    return val;
+}
+
+void assert_group_tag(char *cgroup_path, char *tag)
+{
+    char tag_path[50] = {}, rdbuf[8] = {};
+    int tfd;
+
+    sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
+    tfd = open(tag_path, O_RDONLY, 0666);
+    if (tfd == -1) {
+	perror("Open of cgroup tag path failed: ");
+	abort();
+    }
+
+    if (read(tfd, rdbuf, 1) != 1) {
+	perror("Failed to enable coresched on cgroup: ");
+	abort();
+    }
+
+    if (strcmp(rdbuf, tag)) {
+	printf("Group tag does not match (exp: %s, act: %s)\n", tag, rdbuf);
+	abort();
+    }
+
+    if (close(tfd) == -1) {
+	perror("Failed to close tag fd: ");
+	abort();
+    }
+}
+
+void assert_group_color(char *cgroup_path, const char *color)
+{
+    char tag_path[50] = {}, rdbuf[8] = {};
+    int tfd;
+
+    sprintf(tag_path, "%s/cpu.core_tag_color", cgroup_path);
+    tfd = open(tag_path, O_RDONLY, 0666);
+    if (tfd == -1) {
+	perror("Open of cgroup tag path failed: ");
+	abort();
+    }
+
+    if (read(tfd, rdbuf, 8) == -1) {
+	perror("Failed to read group color\n");
+	abort();
+    }
+
+    if (strncmp(color, rdbuf, strlen(color))) {
+	printf("Group color does not match (exp: %s, act: %s)\n", color, rdbuf);
+	abort();
+    }
+
+    if (close(tfd) == -1) {
+	perror("Failed to close color fd: ");
+	abort();
+    }
+}
+
+void color_group(char *cgroup_path, const char *color_str)
+{
+	char tag_path[50];
+	int tfd, color, ret;
+
+	color = atoi(color_str);
+
+	sprintf(tag_path, "%s/cpu.core_tag_color", cgroup_path);
+	tfd = open(tag_path, O_WRONLY, 0666);
+	if (tfd == -1) {
+		perror("Open of cgroup tag path failed: ");
+		abort();
+	}
+
+	ret = write(tfd, color_str, strlen(color_str));
+	assert_cond(color < 256 || ret == -1,
+		    "Writing invalid range color should have failed!");
+
+	if (color < 1 || color > 255) {
+	    close(tfd);
+	    return;
+	}
+
+	if (ret == -1) {
+		perror("Failed to set color on cgroup: ");
+		abort();
+	}
+
+	if (close(tfd) == -1) {
+		perror("Failed to close tag fd: ");
+		abort();
+	}
+
+	assert_group_color(cgroup_path, color_str);
+}
+
+void tag_group(char *cgroup_path)
+{
+	char tag_path[50];
+	int tfd;
+
+	sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
+	tfd = open(tag_path, O_WRONLY, 0666);
+	if (tfd == -1) {
+		perror("Open of cgroup tag path failed: ");
+		abort();
+	}
+
+	if (write(tfd, "1", 1) != 1) {
+		perror("Failed to enable coresched on cgroup: ");
+		abort();
+	}
+
+	if (close(tfd) == -1) {
+		perror("Failed to close tag fd: ");
+		abort();
+	}
+
+	assert_group_tag(cgroup_path, "1");
+}
+
+void untag_group(char *cgroup_path)
+{
+	char tag_path[50];
+	int tfd;
+
+	sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
+	tfd = open(tag_path, O_WRONLY, 0666);
+	if (tfd == -1) {
+		perror("Open of cgroup tag path failed: ");
+		abort();
+	}
+
+	if (write(tfd, "0", 1) != 1) {
+		perror("Failed to disable coresched on cgroup: ");
+		abort();
+	}
+
+	if (close(tfd) == -1) {
+		perror("Failed to close tag fd: ");
+		abort();
+	}
+
+	assert_group_tag(cgroup_path, "0");
+}
+
+char *make_group(char *parent, char *name)
+{
+	char *cgroup_path;
+	int ret;
+
+	if (!parent && !name)
+	    return make_group_root();
+
+	cgroup_path = malloc(50);
+	if (!cgroup_path) {
+	    perror("Failed to allocate cgroup_path\n");
+	    abort();
+	}
+
+	/* Make the cgroup node for this group */
+	sprintf(cgroup_path, "%s/%s", parent, name);
+	ret = mkdir(cgroup_path, 0644);
+	if (ret == -1) {
+		perror("Failed to create group in cgroup: ");
+		abort();
+	}
+
+	return cgroup_path;
+}
+
+static void del_group(char *path)
+{
+    if (rmdir(path) != 0) {
+	printf("Removal of group failed\n");
+	abort();
+    }
+
+    free(path);
+}
+
+static void del_root_group(char *path)
+{
+    if (umount(path) != 0) {
+	perror("umount of cgroup failed\n");
+	abort();
+    }
+
+    if (rmdir(path) != 0) {
+	printf("Removal of group failed\n");
+	abort();
+    }
+
+    free(path);
+}
+
+void assert_group_cookie_equal(char *c1, char *c2)
+{
+    char *v1, *v2;
+
+    v1 = read_group_cookie(c1);
+    v2 = read_group_cookie(c2);
+    if (strcmp(v1, v2)) {
+	printf("Group cookies not equal\n");
+	abort();
+    }
+
+    free(v1);
+    free(v2);
+}
+
+void assert_group_cookie_not_equal(char *c1, char *c2)
+{
+    char *v1, *v2;
+
+    v1 = read_group_cookie(c1);
+    v2 = read_group_cookie(c2);
+    if (!strcmp(v1, v2)) {
+	printf("Group cookies not equal\n");
+	abort();
+    }
+
+    free(v1);
+    free(v2);
+}
+
+void assert_group_cookie_not_zero(char *c1)
+{
+    char *v1 = read_group_cookie(c1);
+
+    v1[1] = 0;
+    if (!strcmp(v1, "0")) {
+	printf("Group cookie zero\n");
+	abort();
+    }
+    free(v1);
+}
+
+void assert_group_cookie_zero(char *c1)
+{
+    char *v1 = read_group_cookie(c1);
+
+    v1[1] = 0;
+    if (strcmp(v1, "0")) {
+	printf("Group cookie not zero");
+	abort();
+    }
+    free(v1);
+}
+
+struct task_state {
+    int pid_share;
+    char pid_str[50];
+    pthread_mutex_t m;
+    pthread_cond_t cond;
+    pthread_cond_t cond_par;
+};
+
+struct task_state *add_task(char *p)
+{
+    struct task_state *mem;
+    pthread_mutexattr_t am;
+    pthread_condattr_t a;
+    char tasks_path[50];
+    int tfd, pid, ret;
+
+    sprintf(tasks_path, "%s/tasks", p);
+    tfd = open(tasks_path, O_WRONLY, 0666);
+    if (tfd == -1) {
+	perror("Open of cgroup tasks path failed: ");
+	abort();
+    }
+
+    mem = mmap(NULL, sizeof *mem, PROT_READ | PROT_WRITE,
+	    MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+    memset(mem, 0, sizeof(*mem));
+
+    pthread_condattr_init(&a);
+    pthread_condattr_setpshared(&a, PTHREAD_PROCESS_SHARED);
+    pthread_mutexattr_init(&am);
+    pthread_mutexattr_setpshared(&am, PTHREAD_PROCESS_SHARED);
+
+    pthread_cond_init(&mem->cond, &a);
+    pthread_cond_init(&mem->cond_par, &a);
+    pthread_mutex_init(&mem->m, &am);
+
+    pid = fork();
+    if (pid == 0) {
+	while(1) {
+	    pthread_mutex_lock(&mem->m);
+	    while(!mem->pid_share)
+		pthread_cond_wait(&mem->cond, &mem->m);
+
+	    pid = mem->pid_share;
+	    mem->pid_share = 0;
+	    if (pid == -1)
+		pid = 0;
+	    prctl(PR_SCHED_CORE_SHARE, pid);
+	    pthread_mutex_unlock(&mem->m);
+	    pthread_cond_signal(&mem->cond_par);
+	}
+    }
+
+    sprintf(mem->pid_str, "%d", pid);
+    dprint("add task %d to group %s", pid, p);
+
+    ret = write(tfd, mem->pid_str, strlen(mem->pid_str));
+    assert_cond(ret != -1,
+	    "Failed to write pid into tasks");
+
+    close(tfd);
+    return mem;
+}
+
+/* Make t1 share with t2 */
+void make_tasks_share(struct task_state *t1, struct task_state *t2)
+{
+    int p2 = atoi(t2->pid_str);
+    dprint("task %s %s", t1->pid_str, t2->pid_str);
+
+    pthread_mutex_lock(&t1->m);
+    t1->pid_share = p2;
+    pthread_mutex_unlock(&t1->m);
+
+    pthread_cond_signal(&t1->cond);
+
+    pthread_mutex_lock(&t1->m);
+    while (t1->pid_share)
+	pthread_cond_wait(&t1->cond_par, &t1->m);
+    pthread_mutex_unlock(&t1->m);
+}
+
+/* Make t1 share with t2 */
+void reset_task_cookie(struct task_state *t1)
+{
+    dprint("task %s", t1->pid_str);
+
+    pthread_mutex_lock(&t1->m);
+    t1->pid_share = -1;
+    pthread_mutex_unlock(&t1->m);
+
+    pthread_cond_signal(&t1->cond);
+
+    pthread_mutex_lock(&t1->m);
+    while (t1->pid_share)
+	pthread_cond_wait(&t1->cond_par, &t1->m);
+    pthread_mutex_unlock(&t1->m);
+}
+
+char *get_task_core_cookie(char *pid)
+{
+    char proc_path[50];
+    int found = 0;
+    char *line;
+    int i, j;
+    FILE *fp;
+
+    line = malloc(1024);
+    assert_cond(!!line, "Failed to alloc memory");
+
+    sprintf(proc_path, "/proc/%s/sched", pid);
+
+    fp = fopen(proc_path, "r");
+    while ((fgets(line, 1024, fp)) != NULL)
+    {
+        if(!strstr(line, "core_cookie"))
+            continue;
+
+        for (j = 0, i = 0; i < 1024 && line[i] != '\0'; i++)
+            if (line[i] >= '0' && line[i] <= '9')
+                line[j++] = line[i];
+        line[j] = '\0';
+        found = 1;
+        break;
+    }
+
+    fclose(fp);
+
+    if (found) {
+        return line;
+    } else {
+        free(line);
+	printf("core_cookie not found. Enable SCHED_DEBUG?\n");
+	abort();
+        return NULL;
+    }
+}
+
+void assert_tasks_share(struct task_state *t1, struct task_state *t2)
+{
+    char *c1, *c2;
+
+    c1 = get_task_core_cookie(t1->pid_str);
+    c2 = get_task_core_cookie(t2->pid_str);
+    dprint("check task (%s) cookie (%s) == task (%s) cookie (%s)",
+	    t1->pid_str, c1, t2->pid_str, c2);
+    assert_cond(!strcmp(c1, c2), "Tasks don't share cookie");
+    free(c1); free(c2);
+}
+
+void assert_tasks_dont_share(struct task_state *t1,  struct task_state *t2)
+{
+    char *c1, *c2;
+    c1 = get_task_core_cookie(t1->pid_str);
+    c2 = get_task_core_cookie(t2->pid_str);
+    dprint("check task (%s) cookie (%s) != task (%s) cookie (%s)",
+	    t1->pid_str, c1, t2->pid_str, c2);
+    assert_cond(strcmp(c1, c2), "Tasks don't share cookie");
+    free(c1); free(c2);
+}
+
+void assert_group_cookie_equals_task_cookie(char *g, char *pid)
+{
+    char *gk;
+    char *tk;
+
+    gk = read_group_cookie(g);
+    tk = get_task_core_cookie(pid);
+
+    assert_cond(!strcmp(gk, tk), "Group cookie not equal to tasks'");
+
+    free(gk);
+    free(tk);
+}
+
+void assert_group_cookie_not_equals_task_cookie(char *g, char *pid)
+{
+    char *gk;
+    char *tk;
+
+    gk = read_group_cookie(g);
+    tk = get_task_core_cookie(pid);
+
+    assert_cond(strcmp(gk, tk), "Group cookie not equal to tasks'");
+
+    free(gk);
+    free(tk);
+}
+
+void kill_task(struct task_state *t)
+{
+    int pid = atoi(t->pid_str);
+
+    kill(pid, SIGKILL);
+    waitpid(pid, NULL, 0);
+}
+
+/*
+ * Test coloring. r1, y2, b3 and r4 are children of a tagged group y1. But r1,
+ * b3 and r4 are colored different from y1.  Only y2 (and thus y22) shares the
+ * same color as y1. Due to this only those have same cookie as y1. Further, r4
+ * and r1 have the same cookie as they are both colored the same.
+ *
+ *   y1 ----- r1 -- r11  (color, say red)
+ *   | \_y2 -- y22 (color, say yello (default))
+ *   \_b3 (color, say blue)
+ *   \_r4 (color, say red)
+ */
+static void test_cgroup_coloring(char *root)
+{
+    char *y1, *y2, *y22, *r1, *r11, *b3, *r4;
+
+    print_banner("TEST-CGROUP-COLORING");
+
+    y1 = make_group(root, "y1");
+    tag_group(y1);
+
+    y2 = make_group(y1, "y2");
+    y22 = make_group(y2, "y22");
+
+    r1 = make_group(y1, "y1");
+    r11 = make_group(r1, "r11");
+
+    color_group(r1, "256"); /* Wouldn't succeed. */
+    color_group(r1, "0");   /* Wouldn't succeed. */
+    color_group(r1, "254");
+
+    b3 = make_group(y1, "b3");
+    color_group(b3, "8");
+
+    r4 = make_group(y1, "r4");
+    color_group(r4, "254");
+
+    /* Check that all yellows share the same cookie. */
+    assert_group_cookie_not_zero(y1);
+    assert_group_cookie_equal(y1, y2);
+    assert_group_cookie_equal(y1, y22);
+
+    /* Check that all reds share the same cookie. */
+    assert_group_cookie_not_zero(r1);
+    assert_group_cookie_equal(r1, r11);
+    assert_group_cookie_equal(r11, r4);
+
+    /* Check that blue, red and yellow are different cookie. */
+    assert_group_cookie_not_equal(r1, b3);
+    assert_group_cookie_not_equal(b3, y1);
+
+    del_group(r11);
+    del_group(r1);
+    del_group(y22);
+    del_group(y2);
+    del_group(b3);
+    del_group(r4);
+    del_group(y1);
+    print_pass();
+}
+
+/*
+ * Test that a group's children have a cookie inherrited
+ * from their parent group _after_ the parent was tagged.
+ *
+ *   p ----- c1 - c11
+ *     \ c2 - c22
+ */
+static void test_cgroup_parent_child_tag_inherit(char *root)
+{
+    char *p, *c1, *c11, *c2, *c22;
+
+    print_banner("TEST-CGROUP-PARENT-CHILD-TAG");
+
+    p = make_group(root, "p");
+    assert_group_cookie_zero(p);
+
+    c1 = make_group(p, "c1");
+    assert_group_tag(c1, "0"); /* Child tag is "0" but inherits cookie from parent. */
+    assert_group_cookie_zero(c1);
+    assert_group_cookie_equal(c1, p);
+
+    c11 = make_group(c1, "c11");
+    assert_group_tag(c11, "0");
+    assert_group_cookie_zero(c11);
+    assert_group_cookie_equal(c11, p);
+
+    c2 = make_group(p, "c2");
+    assert_group_tag(c2, "0");
+    assert_group_cookie_zero(c2);
+    assert_group_cookie_equal(c2, p);
+
+    tag_group(p);
+
+    /* Verify c1 got the cookie */
+    assert_group_tag(c1, "0");
+    assert_group_cookie_not_zero(c1);
+    assert_group_cookie_equal(c1, p);
+
+    /* Verify c2 got the cookie */
+    assert_group_tag(c2, "0");
+    assert_group_cookie_not_zero(c2);
+    assert_group_cookie_equal(c2, p);
+
+    /* Verify c11 got the cookie */
+    assert_group_tag(c11, "0");
+    assert_group_cookie_not_zero(c11);
+    assert_group_cookie_equal(c11, p);
+
+    /*
+     * Verify c22 which is a nested group created
+     * _after_ tagging got the cookie.
+     */
+    c22 = make_group(c2, "c22");
+
+    assert_group_tag(c22, "0");
+    assert_group_cookie_not_zero(c22);
+    assert_group_cookie_equal(c22, c1);
+    assert_group_cookie_equal(c22, c11);
+    assert_group_cookie_equal(c22, c2);
+    assert_group_cookie_equal(c22, p);
+
+    del_group(c22);
+    del_group(c11);
+    del_group(c1);
+    del_group(c2);
+    del_group(p);
+    print_pass();
+}
+
+/*
+ * Test that a tagged group's children have a cookie inherrited
+ * from their parent group.
+ */
+static void test_cgroup_parent_tag_child_inherit(char *root)
+{
+    char *p, *c1, *c2, *c3;
+
+    print_banner("TEST-CGROUP-PARENT-TAG-CHILD-INHERIT");
+
+    p = make_group(root, "p");
+    assert_group_cookie_zero(p);
+    tag_group(p);
+    assert_group_cookie_not_zero(p);
+
+    c1 = make_group(p, "c1");
+    assert_group_cookie_not_zero(c1);
+    /* Child tag is "0" but it inherits cookie from parent. */
+    assert_group_tag(c1, "0");
+    assert_group_cookie_equal(c1, p);
+
+    c2 = make_group(p, "c2");
+    assert_group_tag(c2, "0");
+    assert_group_cookie_equal(c2, p);
+    assert_group_cookie_equal(c1, c2);
+
+    c3 = make_group(c1, "c3");
+    assert_group_tag(c3, "0");
+    assert_group_cookie_equal(c3, p);
+    assert_group_cookie_equal(c1, c3);
+
+    del_group(c3);
+    del_group(c1);
+    del_group(c2);
+    del_group(p);
+    print_pass();
+}
+
+static void test_prctl_in_group(char *root)
+{
+    char *p;
+    struct task_state *tsk1, *tsk2, *tsk3;
+
+    print_banner("TEST-PRCTL-IN-GROUP");
+
+    p = make_group(root, "p");
+    assert_group_cookie_zero(p);
+    tag_group(p);
+    assert_group_cookie_not_zero(p);
+
+    tsk1 = add_task(p);
+    assert_group_cookie_equals_task_cookie(p, tsk1->pid_str);
+
+    tsk2 = add_task(p);
+    assert_group_cookie_equals_task_cookie(p, tsk2->pid_str);
+
+    tsk3 = add_task(p);
+    assert_group_cookie_equals_task_cookie(p, tsk3->pid_str);
+
+    /* tsk2 share with tsk3 -- both get disconnected from CGroup. */
+    make_tasks_share(tsk2, tsk3);
+    assert_tasks_share(tsk2, tsk3);
+    assert_tasks_dont_share(tsk1, tsk2);
+    assert_tasks_dont_share(tsk1, tsk3);
+    assert_group_cookie_equals_task_cookie(p, tsk1->pid_str);
+    assert_group_cookie_not_equals_task_cookie(p, tsk2->pid_str);
+    assert_group_cookie_not_equals_task_cookie(p, tsk3->pid_str);
+
+    /* now reset tsk3 -- get connected back to CGroup. */
+    reset_task_cookie(tsk3);
+    assert_tasks_dont_share(tsk2, tsk3);
+    assert_tasks_share(tsk1, tsk3);      // tsk3 is back.
+    assert_tasks_dont_share(tsk1, tsk2); // but tsk2 is still zombie
+    assert_group_cookie_equals_task_cookie(p, tsk1->pid_str);
+    assert_group_cookie_not_equals_task_cookie(p, tsk2->pid_str);
+    assert_group_cookie_equals_task_cookie(p, tsk3->pid_str); // tsk3 is back.
+
+    /* now reset tsk2 as well to get it connected back to CGroup. */
+    reset_task_cookie(tsk2);
+    assert_tasks_share(tsk2, tsk3);
+    assert_tasks_share(tsk1, tsk3);
+    assert_tasks_share(tsk1, tsk2);
+    assert_group_cookie_equals_task_cookie(p, tsk1->pid_str);
+    assert_group_cookie_equals_task_cookie(p, tsk2->pid_str);
+    assert_group_cookie_equals_task_cookie(p, tsk3->pid_str);
+
+    /* Test the rest of the cases (2 to 4)
+     *
+     *		t1		joining		t2
+     * CASE 1:
+     * before	0				0
+     * after	new cookie			new cookie
+     *
+     * CASE 2:
+     * before	X (non-zero)			0
+     * after	0				0
+     *
+     * CASE 3:
+     * before	0				X (non-zero)
+     * after	X				X
+     *
+     * CASE 4:
+     * before	Y (non-zero)			X (non-zero)
+     * after	X				X
+     */
+
+    /* case 2: */
+    dprint("case 2");
+    make_tasks_share(tsk1, tsk1);
+    assert_tasks_dont_share(tsk1, tsk2);
+    assert_tasks_dont_share(tsk1, tsk3);
+    assert_group_cookie_not_equals_task_cookie(p, tsk1->pid_str);
+    make_tasks_share(tsk1, tsk2); /* Will reset the task cookie. */
+    assert_group_cookie_equals_task_cookie(p, tsk1->pid_str);
+
+    /* case 3: */
+    dprint("case 3");
+    assert_group_cookie_equals_task_cookie(p, tsk2->pid_str);
+    make_tasks_share(tsk2, tsk2);
+    assert_group_cookie_not_equals_task_cookie(p, tsk2->pid_str);
+    assert_tasks_dont_share(tsk2, tsk1);
+    assert_tasks_dont_share(tsk2, tsk3);
+    assert_group_cookie_equals_task_cookie(p, tsk1->pid_str);
+    make_tasks_share(tsk1, tsk2);
+    assert_group_cookie_not_equals_task_cookie(p, tsk1->pid_str);
+    assert_tasks_share(tsk1, tsk2);
+    assert_tasks_dont_share(tsk1, tsk3);
+    reset_task_cookie(tsk1);
+    reset_task_cookie(tsk2);
+
+    /* case 4: */
+    dprint("case 4");
+    assert_tasks_share(tsk1, tsk2);
+    assert_group_cookie_equals_task_cookie(p, tsk1->pid_str);
+    make_tasks_share(tsk1, tsk1);
+    assert_group_cookie_not_equals_task_cookie(p, tsk1->pid_str);
+    assert_group_cookie_equals_task_cookie(p, tsk2->pid_str);
+    make_tasks_share(tsk2, tsk2);
+    assert_group_cookie_not_equals_task_cookie(p, tsk2->pid_str);
+    assert_tasks_dont_share(tsk1, tsk2);
+    make_tasks_share(tsk1, tsk2);
+    assert_group_cookie_not_equals_task_cookie(p, tsk1->pid_str);
+    assert_tasks_share(tsk1, tsk2);
+    assert_tasks_dont_share(tsk1, tsk3);
+    reset_task_cookie(tsk1);
+    reset_task_cookie(tsk2);
+
+    kill_task(tsk1);
+    kill_task(tsk2);
+    kill_task(tsk3);
+    del_group(p);
+    print_pass();
+}
+
+int main() {
+    char *root = make_group(NULL, NULL);
+
+    test_cgroup_parent_tag_child_inherit(root);
+    test_cgroup_parent_child_tag_inherit(root);
+    test_cgroup_coloring(root);
+    test_prctl_in_group(root);
+
+    del_root_group(root);
+    return 0;
+}
+
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 24/26] sched: Move core-scheduler interfacing code to a new file
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (22 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 23/26] kselftest: Add tests for core-sched interface Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-26  1:05   ` Li, Aubrey
  2020-10-20  1:43 ` [PATCH v8 -tip 25/26] Documentation: Add core scheduling documentation Joel Fernandes (Google)
                   ` (3 subsequent siblings)
  27 siblings, 1 reply; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen

core.c is already huge. The core-tagging interface code is largely
independent of it. Move it to its own file to make both files easier to
maintain.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/Makefile  |   1 +
 kernel/sched/core.c    | 481 +----------------------------------------
 kernel/sched/coretag.c | 468 +++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h   |  56 ++++-
 4 files changed, 523 insertions(+), 483 deletions(-)
 create mode 100644 kernel/sched/coretag.c

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 5fc9c9b70862..c526c20adf9d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
+obj-$(CONFIG_SCHED_CORE) += coretag.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b3afbba5abe1..211e0784675f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -162,11 +162,6 @@ static bool sched_core_empty(struct rq *rq)
 	return RB_EMPTY_ROOT(&rq->core_tree);
 }
 
-static bool sched_core_enqueued(struct task_struct *task)
-{
-	return !RB_EMPTY_NODE(&task->core_node);
-}
-
 static struct task_struct *sched_core_first(struct rq *rq)
 {
 	struct task_struct *task;
@@ -188,7 +183,7 @@ static void sched_core_flush(int cpu)
 	rq->core->core_task_seq++;
 }
 
-static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 {
 	struct rb_node *parent, **node;
 	struct task_struct *node_task;
@@ -215,7 +210,7 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 	rb_insert_color(&p->core_node, &rq->core_tree);
 }
 
-static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+void sched_core_dequeue(struct rq *rq, struct task_struct *p)
 {
 	rq->core->core_task_seq++;
 
@@ -310,7 +305,6 @@ static int __sched_core_stopper(void *data)
 }
 
 static DEFINE_MUTEX(sched_core_mutex);
-static DEFINE_MUTEX(sched_core_tasks_mutex);
 static int sched_core_count;
 
 static void __sched_core_enable(void)
@@ -346,16 +340,6 @@ void sched_core_put(void)
 		__sched_core_disable();
 	mutex_unlock(&sched_core_mutex);
 }
-
-static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2);
-
-#else /* !CONFIG_SCHED_CORE */
-
-static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
-static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
-static bool sched_core_enqueued(struct task_struct *task) { return false; }
-static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2) { }
-
 #endif /* CONFIG_SCHED_CORE */
 
 /*
@@ -8505,9 +8489,6 @@ void sched_offline_group(struct task_group *tg)
 	spin_unlock_irqrestore(&task_group_lock, flags);
 }
 
-#define SCHED_CORE_GROUP_COOKIE_MASK ((1UL << (sizeof(unsigned long) * 4)) - 1)
-static unsigned long cpu_core_get_group_cookie(struct task_group *tg);
-
 static void sched_change_group(struct task_struct *tsk, int type)
 {
 	struct task_group *tg;
@@ -8583,11 +8564,6 @@ void sched_move_task(struct task_struct *tsk)
 	task_rq_unlock(rq, tsk, &rf);
 }
 
-static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
-{
-	return css ? container_of(css, struct task_group, css) : NULL;
-}
-
 static struct cgroup_subsys_state *
 cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 {
@@ -9200,459 +9176,6 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
-#ifdef CONFIG_SCHED_CORE
-/*
- * A simple wrapper around refcount. An allocated sched_core_cookie's
- * address is used to compute the cookie of the task.
- */
-struct sched_core_cookie {
-	refcount_t refcnt;
-};
-
-/*
- * sched_core_tag_requeue - Common helper for all interfaces to set a cookie.
- * @p: The task to assign a cookie to.
- * @cookie: The cookie to assign.
- * @group: is it a group interface or a per-task interface.
- *
- * This function is typically called from a stop-machine handler.
- */
-void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie, bool group)
-{
-	if (!p)
-		return;
-
-	if (group)
-		p->core_group_cookie = cookie;
-	else
-		p->core_task_cookie = cookie;
-
-	/* Use up half of the cookie's bits for task cookie and remaining for group cookie. */
-	p->core_cookie = (p->core_task_cookie <<
-				(sizeof(unsigned long) * 4)) + p->core_group_cookie;
-
-	if (sched_core_enqueued(p)) {
-		sched_core_dequeue(task_rq(p), p);
-		if (!p->core_cookie)
-			return;
-	}
-
-	if (sched_core_enabled(task_rq(p)) &&
-			p->core_cookie && task_on_rq_queued(p))
-		sched_core_enqueue(task_rq(p), p);
-}
-
-/* Per-task interface */
-static unsigned long sched_core_alloc_task_cookie(void)
-{
-	struct sched_core_cookie *ptr =
-		kmalloc(sizeof(struct sched_core_cookie), GFP_KERNEL);
-
-	if (!ptr)
-		return 0;
-	refcount_set(&ptr->refcnt, 1);
-
-	/*
-	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
-	 * is done after the stopper runs.
-	 */
-	sched_core_get();
-	return (unsigned long)ptr;
-}
-
-static bool sched_core_get_task_cookie(unsigned long cookie)
-{
-	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
-
-	/*
-	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
-	 * is done after the stopper runs.
-	 */
-	sched_core_get();
-	return refcount_inc_not_zero(&ptr->refcnt);
-}
-
-static void sched_core_put_task_cookie(unsigned long cookie)
-{
-	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
-
-	if (refcount_dec_and_test(&ptr->refcnt))
-		kfree(ptr);
-}
-
-struct sched_core_task_write_tag {
-	struct task_struct *tasks[2];
-	unsigned long cookies[2];
-};
-
-/*
- * Ensure that the task has been requeued. The stopper ensures that the task cannot
- * be migrated to a different CPU while its core scheduler queue state is being updated.
- * It also makes sure to requeue a task if it was running actively on another CPU.
- */
-static int sched_core_task_join_stopper(void *data)
-{
-	struct sched_core_task_write_tag *tag = (struct sched_core_task_write_tag *)data;
-	int i;
-
-	for (i = 0; i < 2; i++)
-		sched_core_tag_requeue(tag->tasks[i], tag->cookies[i], false /* !group */);
-
-	return 0;
-}
-
-static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
-{
-	struct sched_core_task_write_tag wr = {}; /* for stop machine. */
-	bool sched_core_put_after_stopper = false;
-	unsigned long cookie;
-	int ret = -ENOMEM;
-
-	mutex_lock(&sched_core_tasks_mutex);
-
-	/*
-	 * NOTE: sched_core_get() is done by sched_core_alloc_task_cookie() or
-	 *       sched_core_put_task_cookie(). However, sched_core_put() is done
-	 *       by this function *after* the stopper removes the tasks from the
-	 *       core queue, and not before. This is just to play it safe.
-	 */
-	if (t2 == NULL) {
-		if (t1->core_task_cookie) {
-			sched_core_put_task_cookie(t1->core_task_cookie);
-			sched_core_put_after_stopper = true;
-			wr.tasks[0] = t1; /* Keep wr.cookies[0] reset for t1. */
-		}
-	} else if (t1 == t2) {
-		/* Assign a unique per-task cookie solely for t1. */
-
-		cookie = sched_core_alloc_task_cookie();
-		if (!cookie)
-			goto out_unlock;
-
-		if (t1->core_task_cookie) {
-			sched_core_put_task_cookie(t1->core_task_cookie);
-			sched_core_put_after_stopper = true;
-		}
-		wr.tasks[0] = t1;
-		wr.cookies[0] = cookie;
-	} else
-	/*
-	 * 		t1		joining		t2
-	 * CASE 1:
-	 * before	0				0
-	 * after	new cookie			new cookie
-	 *
-	 * CASE 2:
-	 * before	X (non-zero)			0
-	 * after	0				0
-	 *
-	 * CASE 3:
-	 * before	0				X (non-zero)
-	 * after	X				X
-	 *
-	 * CASE 4:
-	 * before	Y (non-zero)			X (non-zero)
-	 * after	X				X
-	 */
-	if (!t1->core_task_cookie && !t2->core_task_cookie) {
-		/* CASE 1. */
-		cookie = sched_core_alloc_task_cookie();
-		if (!cookie)
-			goto out_unlock;
-
-		/* Add another reference for the other task. */
-		if (!sched_core_get_task_cookie(cookie)) {
-			return -EINVAL;
-			goto out_unlock;
-		}
-
-		wr.tasks[0] = t1;
-		wr.tasks[1] = t2;
-		wr.cookies[0] = wr.cookies[1] = cookie;
-
-	} else if (t1->core_task_cookie && !t2->core_task_cookie) {
-		/* CASE 2. */
-		sched_core_put_task_cookie(t1->core_task_cookie);
-		sched_core_put_after_stopper = true;
-
-		wr.tasks[0] = t1; /* Reset cookie for t1. */
-
-	} else if (!t1->core_task_cookie && t2->core_task_cookie) {
-		/* CASE 3. */
-		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
-			ret = -EINVAL;
-			goto out_unlock;
-		}
-
-		wr.tasks[0] = t1;
-		wr.cookies[0] = t2->core_task_cookie;
-
-	} else {
-		/* CASE 4. */
-		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
-			ret = -EINVAL;
-			goto out_unlock;
-		}
-		sched_core_put_task_cookie(t1->core_task_cookie);
-		sched_core_put_after_stopper = true;
-
-		wr.tasks[0] = t1;
-		wr.cookies[0] = t2->core_task_cookie;
-	}
-
-	stop_machine(sched_core_task_join_stopper, (void *)&wr, NULL);
-
-	if (sched_core_put_after_stopper)
-		sched_core_put();
-
-	ret = 0;
-out_unlock:
-	mutex_unlock(&sched_core_tasks_mutex);
-	return ret;
-}
-
-/* Called from prctl interface: PR_SCHED_CORE_SHARE */
-int sched_core_share_pid(pid_t pid)
-{
-	struct task_struct *task;
-	int err;
-
-	if (pid == 0) { /* Recent current task's cookie. */
-		/* Resetting a cookie requires privileges. */
-		if (current->core_task_cookie)
-			if (!capable(CAP_SYS_ADMIN))
-				return -EPERM;
-		task = NULL;
-	} else {
-		rcu_read_lock();
-		task = pid ? find_task_by_vpid(pid) : current;
-		if (!task) {
-			rcu_read_unlock();
-			return -ESRCH;
-		}
-
-		get_task_struct(task);
-
-		/*
-		 * Check if this process has the right to modify the specified
-		 * process. Use the regular "ptrace_may_access()" checks.
-		 */
-		if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
-			rcu_read_unlock();
-			err = -EPERM;
-			goto out_put;
-		}
-		rcu_read_unlock();
-	}
-
-	err = sched_core_share_tasks(current, task);
-out_put:
-	if (task)
-		put_task_struct(task);
-	return err;
-}
-
-/* CGroup interface */
-
-/*
- * Helper to get the cookie in a hierarchy.
- * The cookie is a combination of a tag and color. Any ancestor
- * can have a tag/color. tag is the first-level cookie setting
- * with color being the second. Atmost one color and one tag is
- * allowed.
- */
-static unsigned long cpu_core_get_group_cookie(struct task_group *tg)
-{
-	unsigned long color = 0;
-
-	if (!tg)
-		return 0;
-
-	for (; tg; tg = tg->parent) {
-		if (tg->core_tag_color) {
-			WARN_ON_ONCE(color);
-			color = tg->core_tag_color;
-		}
-
-		if (tg->core_tagged) {
-			unsigned long cookie = ((unsigned long)tg << 8) | color;
-			cookie &= SCHED_CORE_GROUP_COOKIE_MASK;
-			return cookie;
-		}
-	}
-
-	return 0;
-}
-
-/* Determine if any group in @tg's children are tagged or colored. */
-static bool cpu_core_check_descendants(struct task_group *tg, bool check_tag,
-					bool check_color)
-{
-	struct task_group *child;
-
-	rcu_read_lock();
-	list_for_each_entry_rcu(child, &tg->children, siblings) {
-		if ((child->core_tagged && check_tag) ||
-		    (child->core_tag_color && check_color)) {
-			rcu_read_unlock();
-			return true;
-		}
-
-		rcu_read_unlock();
-		return cpu_core_check_descendants(child, check_tag, check_color);
-	}
-
-	rcu_read_unlock();
-	return false;
-}
-
-static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
-{
-	struct task_group *tg = css_tg(css);
-
-	return !!tg->core_tagged;
-}
-
-static u64 cpu_core_tag_color_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
-{
-	struct task_group *tg = css_tg(css);
-
-	return tg->core_tag_color;
-}
-
-#ifdef CONFIG_SCHED_DEBUG
-static u64 cpu_core_group_cookie_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
-{
-	return cpu_core_get_group_cookie(css_tg(css));
-}
-#endif
-
-struct write_core_tag {
-	struct cgroup_subsys_state *css;
-	unsigned long cookie;
-};
-
-static int __sched_write_tag(void *data)
-{
-	struct write_core_tag *tag = (struct write_core_tag *) data;
-	struct task_struct *p;
-	struct cgroup_subsys_state *css;
-
-	rcu_read_lock();
-	css_for_each_descendant_pre(css, tag->css) {
-		struct css_task_iter it;
-
-		css_task_iter_start(css, 0, &it);
-		/*
-		 * Note: css_task_iter_next will skip dying tasks.
-		 * There could still be dying tasks left in the core queue
-		 * when we set cgroup tag to 0 when the loop is done below.
-		 */
-		while ((p = css_task_iter_next(&it)))
-			sched_core_tag_requeue(p, tag->cookie, true /* group */);
-
-		css_task_iter_end(&it);
-	}
-	rcu_read_unlock();
-
-	return 0;
-}
-
-static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
-{
-	struct task_group *tg = css_tg(css);
-	struct write_core_tag wtag;
-
-	if (val > 1)
-		return -ERANGE;
-
-	if (!static_branch_likely(&sched_smt_present))
-		return -EINVAL;
-
-	if (!tg->core_tagged && val) {
-		/* Tag is being set. Check ancestors and descendants. */
-		if (cpu_core_get_group_cookie(tg) ||
-		    cpu_core_check_descendants(tg, true /* tag */, true /* color */))
-			return -EBUSY;
-	} else if (tg->core_tagged && !val) {
-		/* Tag is being reset. Check descendants. */
-		if (cpu_core_check_descendants(tg, true /* tag */, true /* color */))
-			return -EBUSY;
-	} else {
-		return 0;
-	}
-
-	if (!!val)
-		sched_core_get();
-
-	wtag.css = css;
-	wtag.cookie = (unsigned long)tg << 8; /* Reserve lower 8 bits for color. */
-
-	/* Truncate the upper 32-bits - those are used by the per-task cookie. */
-	wtag.cookie &= (1UL << (sizeof(unsigned long) * 4)) - 1;
-
-	tg->core_tagged = val;
-
-	stop_machine(__sched_write_tag, (void *) &wtag, NULL);
-	if (!val)
-		sched_core_put();
-
-	return 0;
-}
-
-static int cpu_core_tag_color_write_u64(struct cgroup_subsys_state *css,
-					struct cftype *cft, u64 val)
-{
-	struct task_group *tg = css_tg(css);
-	struct write_core_tag wtag;
-	u64 cookie;
-
-	if (val > 255)
-		return -ERANGE;
-
-	if (!static_branch_likely(&sched_smt_present))
-		return -EINVAL;
-
-	cookie = cpu_core_get_group_cookie(tg);
-	/* Can't set color if nothing in the ancestors were tagged. */
-	if (!cookie)
-		return -EINVAL;
-
-	/*
-	 * Something in the ancestors already colors us. Can't change the color
-	 * at this level.
-	 */
-	if (!tg->core_tag_color && (cookie & 255))
-		return -EINVAL;
-
-	/*
-	 * Check if any descendants are colored. If so, we can't recolor them.
-	 * Don't need to check if descendants are tagged, since we don't allow
-	 * tagging when already tagged.
-	 */
-	if (cpu_core_check_descendants(tg, false /* tag */, true /* color */))
-		return -EINVAL;
-
-	cookie &= ~255;
-	cookie |= val;
-	wtag.css = css;
-	wtag.cookie = cookie;
-	tg->core_tag_color = val;
-
-	stop_machine(__sched_write_tag, (void *) &wtag, NULL);
-
-	return 0;
-}
-
-void sched_tsk_free(struct task_struct *tsk)
-{
-	if (!tsk->core_task_cookie)
-		return;
-	sched_core_put_task_cookie(tsk->core_task_cookie);
-	sched_core_put();
-}
-#endif
-
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
new file mode 100644
index 000000000000..3333c9b0afc5
--- /dev/null
+++ b/kernel/sched/coretag.c
@@ -0,0 +1,468 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * kernel/sched/core-tag.c
+ *
+ * Core-scheduling tagging interface support.
+ *
+ * Copyright(C) 2020, Joel Fernandes.
+ * Initial interfacing code  by Peter Ziljstra.
+ */
+
+#include "sched.h"
+
+/*
+ * A simple wrapper around refcount. An allocated sched_core_cookie's
+ * address is used to compute the cookie of the task.
+ */
+struct sched_core_cookie {
+	refcount_t refcnt;
+};
+
+static DEFINE_MUTEX(sched_core_tasks_mutex);
+
+/*
+ * sched_core_tag_requeue - Common helper for all interfaces to set a cookie.
+ * @p: The task to assign a cookie to.
+ * @cookie: The cookie to assign.
+ * @group: is it a group interface or a per-task interface.
+ *
+ * This function is typically called from a stop-machine handler.
+ */
+void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie, bool group)
+{
+	if (!p)
+		return;
+
+	if (group)
+		p->core_group_cookie = cookie;
+	else
+		p->core_task_cookie = cookie;
+
+	/* Use up half of the cookie's bits for task cookie and remaining for group cookie. */
+	p->core_cookie = (p->core_task_cookie <<
+				(sizeof(unsigned long) * 4)) + p->core_group_cookie;
+
+	if (sched_core_enqueued(p)) {
+		sched_core_dequeue(task_rq(p), p);
+		if (!p->core_cookie)
+			return;
+	}
+
+	if (sched_core_enabled(task_rq(p)) &&
+			p->core_cookie && task_on_rq_queued(p))
+		sched_core_enqueue(task_rq(p), p);
+}
+
+/* Per-task interface: Used by fork(2) and prctl(2). */
+static unsigned long sched_core_alloc_task_cookie(void)
+{
+	struct sched_core_cookie *ptr =
+		kmalloc(sizeof(struct sched_core_cookie), GFP_KERNEL);
+
+	if (!ptr)
+		return 0;
+	refcount_set(&ptr->refcnt, 1);
+
+	/*
+	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
+	 * is done after the stopper runs.
+	 */
+	sched_core_get();
+	return (unsigned long)ptr;
+}
+
+static bool sched_core_get_task_cookie(unsigned long cookie)
+{
+	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
+
+	/*
+	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
+	 * is done after the stopper runs.
+	 */
+	sched_core_get();
+	return refcount_inc_not_zero(&ptr->refcnt);
+}
+
+static void sched_core_put_task_cookie(unsigned long cookie)
+{
+	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
+
+	if (refcount_dec_and_test(&ptr->refcnt))
+		kfree(ptr);
+}
+
+struct sched_core_task_write_tag {
+	struct task_struct *tasks[2];
+	unsigned long cookies[2];
+};
+
+/*
+ * Ensure that the task has been requeued. The stopper ensures that the task cannot
+ * be migrated to a different CPU while its core scheduler queue state is being updated.
+ * It also makes sure to requeue a task if it was running actively on another CPU.
+ */
+static int sched_core_task_join_stopper(void *data)
+{
+	struct sched_core_task_write_tag *tag = (struct sched_core_task_write_tag *)data;
+	int i;
+
+	for (i = 0; i < 2; i++)
+		sched_core_tag_requeue(tag->tasks[i], tag->cookies[i], false /* !group */);
+
+	return 0;
+}
+
+int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
+{
+	struct sched_core_task_write_tag wr = {}; /* for stop machine. */
+	bool sched_core_put_after_stopper = false;
+	unsigned long cookie;
+	int ret = -ENOMEM;
+
+	mutex_lock(&sched_core_tasks_mutex);
+
+	/*
+	 * NOTE: sched_core_get() is done by sched_core_alloc_task_cookie() or
+	 *       sched_core_put_task_cookie(). However, sched_core_put() is done
+	 *       by this function *after* the stopper removes the tasks from the
+	 *       core queue, and not before. This is just to play it safe.
+	 */
+	if (t2 == NULL) {
+		if (t1->core_task_cookie) {
+			sched_core_put_task_cookie(t1->core_task_cookie);
+			sched_core_put_after_stopper = true;
+			wr.tasks[0] = t1; /* Keep wr.cookies[0] reset for t1. */
+		}
+	} else if (t1 == t2) {
+		/* Assign a unique per-task cookie solely for t1. */
+
+		cookie = sched_core_alloc_task_cookie();
+		if (!cookie)
+			goto out_unlock;
+
+		if (t1->core_task_cookie) {
+			sched_core_put_task_cookie(t1->core_task_cookie);
+			sched_core_put_after_stopper = true;
+		}
+		wr.tasks[0] = t1;
+		wr.cookies[0] = cookie;
+	} else
+	/*
+	 * 		t1		joining		t2
+	 * CASE 1:
+	 * before	0				0
+	 * after	new cookie			new cookie
+	 *
+	 * CASE 2:
+	 * before	X (non-zero)			0
+	 * after	0				0
+	 *
+	 * CASE 3:
+	 * before	0				X (non-zero)
+	 * after	X				X
+	 *
+	 * CASE 4:
+	 * before	Y (non-zero)			X (non-zero)
+	 * after	X				X
+	 */
+	if (!t1->core_task_cookie && !t2->core_task_cookie) {
+		/* CASE 1. */
+		cookie = sched_core_alloc_task_cookie();
+		if (!cookie)
+			goto out_unlock;
+
+		/* Add another reference for the other task. */
+		if (!sched_core_get_task_cookie(cookie)) {
+			return -EINVAL;
+			goto out_unlock;
+		}
+
+		wr.tasks[0] = t1;
+		wr.tasks[1] = t2;
+		wr.cookies[0] = wr.cookies[1] = cookie;
+
+	} else if (t1->core_task_cookie && !t2->core_task_cookie) {
+		/* CASE 2. */
+		sched_core_put_task_cookie(t1->core_task_cookie);
+		sched_core_put_after_stopper = true;
+
+		wr.tasks[0] = t1; /* Reset cookie for t1. */
+
+	} else if (!t1->core_task_cookie && t2->core_task_cookie) {
+		/* CASE 3. */
+		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+
+		wr.tasks[0] = t1;
+		wr.cookies[0] = t2->core_task_cookie;
+
+	} else {
+		/* CASE 4. */
+		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+		sched_core_put_task_cookie(t1->core_task_cookie);
+		sched_core_put_after_stopper = true;
+
+		wr.tasks[0] = t1;
+		wr.cookies[0] = t2->core_task_cookie;
+	}
+
+	stop_machine(sched_core_task_join_stopper, (void *)&wr, NULL);
+
+	if (sched_core_put_after_stopper)
+		sched_core_put();
+
+	ret = 0;
+out_unlock:
+	mutex_unlock(&sched_core_tasks_mutex);
+	return ret;
+}
+
+/* Called from prctl interface: PR_SCHED_CORE_SHARE */
+int sched_core_share_pid(pid_t pid)
+{
+	struct task_struct *task;
+	int err;
+
+	if (pid == 0) { /* Recent current task's cookie. */
+		/* Resetting a cookie requires privileges. */
+		if (current->core_task_cookie)
+			if (!capable(CAP_SYS_ADMIN))
+				return -EPERM;
+		task = NULL;
+	} else {
+		rcu_read_lock();
+		task = pid ? find_task_by_vpid(pid) : current;
+		if (!task) {
+			rcu_read_unlock();
+			return -ESRCH;
+		}
+
+		get_task_struct(task);
+
+		/*
+		 * Check if this process has the right to modify the specified
+		 * process. Use the regular "ptrace_may_access()" checks.
+		 */
+		if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
+			rcu_read_unlock();
+			err = -EPERM;
+			goto out_put;
+		}
+		rcu_read_unlock();
+	}
+
+	err = sched_core_share_tasks(current, task);
+out_put:
+	if (task)
+		put_task_struct(task);
+	return err;
+}
+
+/* CGroup core-scheduling interface support. */
+
+/*
+ * Helper to get the cookie in a hierarchy.
+ * The cookie is a combination of a tag and color. Any ancestor
+ * can have a tag/color. tag is the first-level cookie setting
+ * with color being the second. Atmost one color and one tag is
+ * allowed.
+ */
+unsigned long cpu_core_get_group_cookie(struct task_group *tg)
+{
+	unsigned long color = 0;
+
+	if (!tg)
+		return 0;
+
+	for (; tg; tg = tg->parent) {
+		if (tg->core_tag_color) {
+			WARN_ON_ONCE(color);
+			color = tg->core_tag_color;
+		}
+
+		if (tg->core_tagged) {
+			unsigned long cookie = ((unsigned long)tg << 8) | color;
+			cookie &= SCHED_CORE_GROUP_COOKIE_MASK;
+			return cookie;
+		}
+	}
+
+	return 0;
+}
+
+/* Determine if any group in @tg's children are tagged or colored. */
+static bool cpu_core_check_descendants(struct task_group *tg, bool check_tag,
+				       bool check_color)
+{
+	struct task_group *child;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(child, &tg->children, siblings) {
+		if ((child->core_tagged && check_tag) ||
+		    (child->core_tag_color && check_color)) {
+			rcu_read_unlock();
+			return true;
+		}
+
+		rcu_read_unlock();
+		return cpu_core_check_descendants(child, check_tag, check_color);
+	}
+
+	rcu_read_unlock();
+	return false;
+}
+
+u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css,
+			  struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+
+	return !!tg->core_tagged;
+}
+
+u64 cpu_core_tag_color_read_u64(struct cgroup_subsys_state *css,
+				struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+
+	return tg->core_tag_color;
+}
+
+#ifdef CONFIG_SCHED_DEBUG
+u64 cpu_core_group_cookie_read_u64(struct cgroup_subsys_state *css,
+				   struct cftype *cft)
+{
+	return cpu_core_get_group_cookie(css_tg(css));
+}
+#endif
+
+struct write_core_tag {
+	struct cgroup_subsys_state *css;
+	unsigned long cookie;
+};
+
+static int __sched_write_tag(void *data)
+{
+	struct write_core_tag *tag = (struct write_core_tag *) data;
+	struct task_struct *p;
+	struct cgroup_subsys_state *css;
+
+	rcu_read_lock();
+	css_for_each_descendant_pre(css, tag->css) {
+		struct css_task_iter it;
+
+		css_task_iter_start(css, 0, &it);
+		/*
+		 * Note: css_task_iter_next will skip dying tasks.
+		 * There could still be dying tasks left in the core queue
+		 * when we set cgroup tag to 0 when the loop is done below.
+		 */
+		while ((p = css_task_iter_next(&it)))
+			sched_core_tag_requeue(p, tag->cookie, true /* group */);
+
+		css_task_iter_end(&it);
+	}
+	rcu_read_unlock();
+
+	return 0;
+}
+
+int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
+			   u64 val)
+{
+	struct task_group *tg = css_tg(css);
+	struct write_core_tag wtag;
+
+	if (val > 1)
+		return -ERANGE;
+
+	if (!static_branch_likely(&sched_smt_present))
+		return -EINVAL;
+
+	if (!tg->core_tagged && val) {
+		/* Tag is being set. Check ancestors and descendants. */
+		if (cpu_core_get_group_cookie(tg) ||
+		    cpu_core_check_descendants(tg, true /* tag */, true /* color */))
+			return -EBUSY;
+	} else if (tg->core_tagged && !val) {
+		/* Tag is being reset. Check descendants. */
+		if (cpu_core_check_descendants(tg, true /* tag */, true /* color */))
+			return -EBUSY;
+	} else {
+		return 0;
+	}
+
+	if (!!val)
+		sched_core_get();
+
+	wtag.css = css;
+	wtag.cookie = (unsigned long)tg << 8; /* Reserve lower 8 bits for color. */
+
+	/* Truncate the upper 32-bits - those are used by the per-task cookie. */
+	wtag.cookie &= (1UL << (sizeof(unsigned long) * 4)) - 1;
+
+	tg->core_tagged = val;
+
+	stop_machine(__sched_write_tag, (void *) &wtag, NULL);
+	if (!val)
+		sched_core_put();
+
+	return 0;
+}
+
+int cpu_core_tag_color_write_u64(struct cgroup_subsys_state *css,
+				 struct cftype *cft, u64 val)
+{
+	struct task_group *tg = css_tg(css);
+	struct write_core_tag wtag;
+	u64 cookie;
+
+	if (val > 255)
+		return -ERANGE;
+
+	if (!static_branch_likely(&sched_smt_present))
+		return -EINVAL;
+
+	cookie = cpu_core_get_group_cookie(tg);
+	/* Can't set color if nothing in the ancestors were tagged. */
+	if (!cookie)
+		return -EINVAL;
+
+	/*
+	 * Something in the ancestors already colors us. Can't change the color
+	 * at this level.
+	 */
+	if (!tg->core_tag_color && (cookie & 255))
+		return -EINVAL;
+
+	/*
+	 * Check if any descendants are colored. If so, we can't recolor them.
+	 * Don't need to check if descendants are tagged, since we don't allow
+	 * tagging when already tagged.
+	 */
+	if (cpu_core_check_descendants(tg, false /* tag */, true /* color */))
+		return -EINVAL;
+
+	cookie &= ~255;
+	cookie |= val;
+	wtag.css = css;
+	wtag.cookie = cookie;
+	tg->core_tag_color = val;
+
+	stop_machine(__sched_write_tag, (void *) &wtag, NULL);
+
+	return 0;
+}
+
+void sched_tsk_free(struct task_struct *tsk)
+{
+	if (!tsk->core_task_cookie)
+		return;
+	sched_core_put_task_cookie(tsk->core_task_cookie);
+	sched_core_put();
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index aebeb91c4a0f..290a3b8be3d3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -437,6 +437,11 @@ struct task_group {
 
 };
 
+static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct task_group, css) : NULL;
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 #define ROOT_TASK_GROUP_LOAD	NICE_0_LOAD
 
@@ -1104,6 +1109,8 @@ static inline int cpu_of(struct rq *rq)
 #ifdef CONFIG_SCHED_CORE
 DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
 
+#define SCHED_CORE_GROUP_COOKIE_MASK ((1UL << (sizeof(unsigned long) * 4)) - 1)
+
 static inline bool sched_core_enabled(struct rq *rq)
 {
 	return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
@@ -1148,10 +1155,54 @@ static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
 	return idle_core || rq->core->core_cookie == p->core_cookie;
 }
 
-extern void queue_core_balance(struct rq *rq);
+static inline bool sched_core_enqueued(struct task_struct *task)
+{
+	return !RB_EMPTY_NODE(&task->core_node);
+}
+
+void queue_core_balance(struct rq *rq);
+
+void sched_core_enqueue(struct rq *rq, struct task_struct *p);
+void sched_core_dequeue(struct rq *rq, struct task_struct *p);
+void sched_core_get(void);
+void sched_core_put(void);
+
+void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie,
+			    bool group);
+
+int sched_core_share_pid(pid_t pid);
+int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2);
+
+unsigned long cpu_core_get_group_cookie(struct task_group *tg);
+
+u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css,
+			  struct cftype *cft);
+
+u64 cpu_core_tag_color_read_u64(struct cgroup_subsys_state *css,
+				struct cftype *cft);
+
+#ifdef CONFIG_SCHED_DEBUG
+u64 cpu_core_group_cookie_read_u64(struct cgroup_subsys_state *css,
+				   struct cftype *cft);
+#endif
+
+int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
+			   u64 val);
+
+int cpu_core_tag_color_write_u64(struct cgroup_subsys_state *css,
+				 struct cftype *cft, u64 val);
+
+#ifndef TIF_UNSAFE_RET
+#define TIF_UNSAFE_RET (0)
+#endif
 
 #else /* !CONFIG_SCHED_CORE */
 
+static inline bool sched_core_enqueued(struct task_struct *task) { return false; }
+static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
+static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+static inline int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2) { }
+
 static inline bool sched_core_enabled(struct rq *rq)
 {
 	return false;
@@ -2779,7 +2830,4 @@ void swake_up_all_locked(struct swait_queue_head *q);
 void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
 
 #ifdef CONFIG_SCHED_CORE
-#ifndef TIF_UNSAFE_RET
-#define TIF_UNSAFE_RET (0)
-#endif
 #endif
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 25/26] Documentation: Add core scheduling documentation
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (23 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 24/26] sched: Move core-scheduler interfacing code to a new file Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-20  3:36   ` Randy Dunlap
  2020-10-20  1:43 ` [PATCH v8 -tip 26/26] sched: Debug bits Joel Fernandes (Google)
                   ` (2 subsequent siblings)
  27 siblings, 1 reply; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen

Document the usecases, design and interfaces for core scheduling.

Co-developed-by: Vineeth Pillai <viremana@linux.microsoft.com>
Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 .../admin-guide/hw-vuln/core-scheduling.rst   | 312 ++++++++++++++++++
 Documentation/admin-guide/hw-vuln/index.rst   |   1 +
 2 files changed, 313 insertions(+)
 create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
new file mode 100644
index 000000000000..eacafbb8fa3f
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -0,0 +1,312 @@
+Core Scheduling
+***************
+Core scheduling support allows userspace to define groups of tasks that can
+share a core. These groups can be specified either for security usecases (one
+group of tasks don't trust another), or for performance usecases (some
+workloads may benefit from running on the same core as they don't need the same
+hardware resources of the shared core).
+
+Security usecase
+----------------
+A cross-HT attack involves the attacker and victim running on different
+Hyper Threads of the same core. MDS and L1TF are examples of such attacks.
+Without core scheduling, the only full mitigation of cross-HT attacks is to
+disable Hyper Threading (HT). Core scheduling allows HT to be turned on safely
+by ensuring that trusted tasks can share a core. This increase in core sharing
+can improvement performance, however it is not guaranteed that performance will
+always improve, though that is seen to be the case with a number of real world
+workloads. In theory, core scheduling aims to perform at least as good as when
+Hyper Threading is disabled. In practise, this is mostly the case though not
+always: as synchronizing scheduling decisions across 2 or more CPUs in a core
+involves additional overhead - especially when the system is lightly loaded
+(``total_threads <= N/2``).
+
+Usage
+-----
+Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
+Using this feature, userspace defines groups of tasks that trust each other.
+The core scheduler uses this information to make sure that tasks that do not
+trust each other will never run simultaneously on a core, while doing its best
+to satisfy the system's scheduling requirements.
+
+There are 2 ways to use core-scheduling:
+
+CGroup
+######
+Core scheduling adds additional files to the CPU controller CGroup:
+
+* ``cpu.tag``
+Writing ``1`` into this file results in all tasks in the group get tagged. This
+results in all the CGroup's tasks allowed to run concurrently on a core's
+hyperthreads (also called siblings).
+
+The file being a value of ``0`` means the tag state of the CGroup is inheritted
+from its parent hierarchy. If any ancestor of the CGroup is tagged, then the
+group is tagged.
+
+.. note:: Once a CGroup is tagged via cpu.tag, it is not possible to set this
+          for any descendant of the tagged group. For finer grained control, the
+          ``cpu.tag_color`` file described next may be used.
+
+.. note:: When a CGroup is not tagged, all the tasks within the group can share
+          a core with kernel threads and untagged system threads. For this reason,
+          if a group has ``cpu.tag`` of 0, it is considered to be trusted.
+
+* ``cpu.tag_color``
+For finer grained control over core sharing, a color can also be set in
+addition to the tag. This allows to further control core sharing between child
+CGroups within an already tagged CGroup. The color and the tag are both used to
+generate a `cookie` which is used by the scheduler to identify the group.
+
+Upto 256 different colors can be set (0-255) by writing into this file.
+
+A sample real-world usage of this file follows:
+
+Google uses DAC controls to make ``cpu.tag`` writeable only by root and the
+``cpu.tag_color`` can be changed by anyone.
+
+The hierarchy looks like this:
+::
+  Root group
+     / \
+    A   B    (These are created by the root daemon - borglet).
+   / \   \
+  C   D   E  (These are created by AppEngine within the container).
+
+A and B are containers for 2 different jobs or apps that are created by a root
+daemon called borglet. borglet then tags each of these group with the ``cpu.tag``
+file. The job itself can create additional child CGroups which are colored by
+the container's AppEngine with the ``cpu.tag_color`` file.
+
+The reason why Google uses this 2-level tagging system is that AppEngine wants to
+allow a subset of child CGroups within a tagged parent CGroup to be co-scheduled on a
+core while not being co-scheduled with other child CGroups. Think of these
+child CGroups as belonging to the same customer or project.  Because these
+child CGroups are created by AppEngine, they are not tracked by borglet (the
+root daemon), therefore borglet won't have a chance to set a color for them.
+That's where cpu.tag_color file comes in. A color could be set by AppEngine,
+and once set, the normal tasks within the subcgroup would not be able to
+overwrite it. This is enforced by promoting the permission of the
+``cpu.tag_color`` file in cgroupfs.
+
+The color is an 8-bit value allowing for upto 256 unique colors.
+
+.. note:: Once a CGroup is colored, none of its descendants can be re-colored. Also
+          coloring of a CGroup is possible only if either the group or one of its
+          ancestors were tagged via the ``cpu.tag`` file.
+
+prctl interface
+###############
+A ``prtcl(2)`` command ``PR_SCHED_CORE_SHARE`` is available to a process to request
+sharing a core with another process.  For example, consider 2 processes ``P1``
+and ``P2`` with PIDs 100 and 200. If process ``P1`` calls
+``prctl(PR_SCHED_CORE_SHARE, 200)``, the kernel makes ``P1`` share a core with ``P2``.
+The kernel performs ptrace access mode checks before granting the request.
+
+.. note:: This operation is not commutative. P1 calling
+          ``prctl(PR_SCHED_CORE_SHARE, pidof(P2)`` is not the same as P2 calling the
+          same for P1. The former case is P1 joining P2's group of processes
+          (which P2 would have joined with ``prctl(2)`` prior to P1's ``prctl(2)``).
+
+.. note:: The core-sharing granted with prctl(2) will be subject to
+          core-sharing restrictions specified by the CGroup interface. For example
+          if P1 and P2 are a part of 2 different tagged CGroups, then they will
+          not share a core even if a prctl(2) call is made. This is analogous
+          to how affinities are set using the cpuset interface.
+
+It is important to note that, on a ``CLONE_THREAD`` ``clone(2)`` syscall, the child
+will be assigned the same tag as its parent and thus be allowed to share a core
+with them. is design choice is because, for the security usecase, a
+``CLONE_THREAD`` child can access its parent's address space anyway, so there's
+no point in not allowing them to share a core. If a different behavior is
+desired, the child thread can call ``prctl(2)`` as needed.  This behavior is
+specific to the ``prctl(2)`` interface. For the CGroup interface, the child of a
+fork always share's a core with its parent's.  On the other hand, if a parent
+was previously tagged via ``prctl(2)`` and does a regular ``fork(2)`` syscall, the
+child will receive a unique tag.
+
+Design/Implementation
+---------------------
+Each task that is tagged is assigned a cookie internally in the kernel. As
+mentioned in `Usage`_, tasks with the same cookie value are assumed to trust
+each other and share a core.
+
+The basic idea is that, every schedule event tries to select tasks for all the
+siblings of a core such that all the selected tasks running on a core are
+trusted (same cookie) at any point in time. Kernel threads are assumed trusted.
+The idle task is considered special, in that it trusts every thing.
+
+During a ``schedule()`` event on any sibling of a core, the highest priority task for
+that core is picked and assigned to the sibling calling ``schedule()`` if it has it
+enqueued. For rest of the siblings in the core, highest priority task with the
+same cookie is selected if there is one runnable in their individual run
+queues. If a task with same cookie is not available, the idle task is selected.
+Idle task is globally trusted.
+
+Once a task has been selected for all the siblings in the core, an IPI is sent to
+siblings for whom a new task was selected. Siblings on receiving the IPI, will
+switch to the new task immediately. If an idle task is selected for a sibling,
+then the sibling is considered to be in a `forced idle` state. i.e., it may
+have tasks on its on runqueue to run, however it will still have to run idle.
+More on this in the next section.
+
+Forced-idling of tasks
+----------------------
+The scheduler tries its best to find tasks that trust each other such that all
+tasks selected to be scheduled are of the highest priority in a core.  However,
+it is possible that some runqueues had tasks that were incompatibile with the
+highest priority ones in the core. Favoring security over fairness, one or more
+siblings could be forced to select a lower priority task if the highest
+priority task is not trusted with respect to the core wide highest priority
+task.  If a sibling does not have a trusted task to run, it will be forced idle
+by the scheduler(idle thread is scheduled to run).
+
+When the highest priorty task is selected to run, a reschedule-IPI is sent to
+the sibling to force it into idle. This results in 4 cases which need to be
+considered depending on whether a VM or a regular usermode process was running
+on either HT::
+
+          HT1 (attack)            HT2 (victim)
+   A      idle -> user space      user space -> idle
+   B      idle -> user space      guest -> idle
+   C      idle -> guest           user space -> idle
+   D      idle -> guest           guest -> idle
+
+Note that for better performance, we do not wait for the destination CPU
+(victim) to enter idle mode. This is because the sending of the IPI would bring
+the destination CPU immediately into kernel mode from user space, or VMEXIT
+in the case of  guests. At best, this would only leak some scheduler metadata
+which may not be worth protecting. It is also possible that the IPI is received
+too late on some architectures, but this has not been observed in the case of
+x86.
+
+Kernel protection from untrusted tasks
+--------------------------------------
+The scheduler on its own cannot protect the kernel executing concurrently with
+an untrusted task in a core. This is because the scheduler is unaware of
+interrupts/syscalls at scheduling time. To mitigate this, an IPI is sent to
+siblings on kernel entry (syscall and IRQ). This IPI forces the sibling to enter
+kernel mode and wait before returning to user until all siblings of the
+core have left kernel mode. This process is also known as stunning.  For good
+performance, an IPI is sent only to a sibling only if it is running a tagged
+task. If a sibling is running a kernel thread or is idle, no IPI is sent.
+
+The kernel protection feature can be turned off on the kernel command line by
+passing ``sched_core_protect_kernel=0``.
+
+Other alternative ideas discussed for kernel protection are listed below just
+for completeness. They all have limitations:
+
+1. Changing interrupt affinities to a trusted core which does not execute untrusted tasks
+#########################################################################################
+By changing the interrupt affinities to a designated safe-CPU which runs
+only trusted tasks, IRQ data can be protected. One issue is this involves
+giving up a full CPU core of the system to run safe tasks. Another is that,
+per-cpu interrupts such as the local timer interrupt cannot have their
+affinity changed. also, sensitive timer callbacks such as the random entropy timer
+can run in softirq on return from these interrupts and expose sensitive
+data. In the future, that could be mitigated by forcing softirqs into threaded
+mode by utilizing a mechanism similar to ``CONFIG_PREEMPT_RT``.
+
+Yet another issue with this is, for multiqueue devices with managed
+interrupts, the IRQ affinities cannot be changed however it could be
+possible to force a reduced number of queues which would in turn allow to
+shield one or two CPUs from such interrupts and queue handling for the price
+of indirection.
+
+2. Running IRQs as threaded-IRQs
+################################
+This would result in forcing IRQs into the scheduler which would then provide
+the process-context mitigation. However, not all interrupts can be threaded.
+Also this does nothing about syscall entries.
+
+3. Kernel Address Space Isolation
+#################################
+System calls could run in a much restricted address space which is
+guarenteed not to leak any sensitive data. There are practical limitation in
+implementing this - the main concern being how to decide on an address space
+that is guarenteed to not have any sensitive data.
+
+4. Limited cookie-based protection
+##################################
+On a system call, change the cookie to the system trusted cookie and initiate a
+schedule event. This would be better than pausing all the siblings during the
+entire duration for the system call, but still would be a huge hit to the
+performance.
+
+Trust model
+-----------
+Core scheduling maintains trust relationships amongst groups of tasks by
+assigning the tag of them with the same cookie value.
+When a system with core scheduling boots, all tasks are considered to trust
+each other. This is because the core scheduler does not have information about
+trust relationships until userspace uses the above mentioned interfaces, to
+communicate them. In other words, all tasks have a default cookie value of 0.
+and are considered system-wide trusted. The stunning of siblings running
+cookie-0 tasks is also avoided.
+
+Once userspace uses the above mentioned interfaces to group sets of tasks, tasks
+within such groups are considered to trust each other, but do not trust those
+outside. Tasks outside the group also don't trust tasks within.
+
+Limitations
+-----------
+Core scheduling tries to guarentee that only trusted tasks run concurrently on a
+core. But there could be small window of time during which untrusted tasks run
+concurrently or kernel could be running concurrently with a task not trusted by
+kernel.
+
+1. IPI processing delays
+########################
+Core scheduling selects only trusted tasks to run together. IPI is used to notify
+the siblings to switch to the new task. But there could be hardware delays in
+receiving of the IPI on some arch (on x86, this has not been observed). This may
+cause an attacker task to start running on a cpu before its siblings receive the
+IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings
+may populate data in the cache and micro acrhitectural buffers after the attacker
+starts to run and this is a possibility for data leak.
+
+Open cross-HT issues that core scheduling does not solve
+--------------------------------------------------------
+1. For MDS
+##########
+Core scheduling cannot protect against MDS attacks between an HT running in
+user mode and another running in kernel mode. Even though both HTs run tasks
+which trust each other, kernel memory is still considered untrusted. Such
+attacks are possible for any combination of sibling CPU modes (host or guest mode).
+
+2. For L1TF
+###########
+Core scheduling cannot protect against a L1TF guest attackers exploiting a
+guest or host victim. This is because the guest attacker can craft invalid
+PTEs which are not inverted due to a vulnerable guest kernel. The only
+solution is to disable EPT.
+
+For both MDS and L1TF, if the guest vCPU is configured to not trust each
+other (by tagging separately), then the guest to guest attacks would go away.
+Or it could be a system admin policy which considers guest to guest attacks as
+a guest problem.
+
+Another approach to resolve these would be to make every untrusted task on the
+system to not trust every other untrusted task. While this could reduce
+parallelism of the untrusted tasks, it would still solve the above issues while
+allowing system processes (trusted tasks) to share a core.
+
+Use cases
+---------
+The main use case for Core scheduling is mitigating the cross-HT vulnerabilities
+with SMT enabled. There are other use cases where this feature could be used:
+
+- Isolating tasks that needs a whole core: Examples include realtime tasks, tasks
+  that uses SIMD instructions etc.
+- Gang scheduling: Requirements for a group of tasks that needs to be scheduled
+  together could also be realized using core scheduling. One example is vcpus of
+  a VM.
+
+Future work
+-----------
+Skipping per-HT mitigations if task is trusted
+##############################################
+If core scheduling is enabled, by default all tasks trust each other as
+mentioned above. In such scenario, it may be desirable to skip the same-HT
+mitigations on return to the trusted user-mode to improve performance.
diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index 21710f8609fe..361ccbbd9e54 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -16,3 +16,4 @@ are configurable at compile, boot or run time.
    multihit.rst
    special-register-buffer-data-sampling.rst
    l1d_flush.rst
+   core-scheduling.rst
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 -tip 26/26] sched: Debug bits...
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (24 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 25/26] Documentation: Add core scheduling documentation Joel Fernandes (Google)
@ 2020-10-20  1:43 ` Joel Fernandes (Google)
  2020-10-30 13:26 ` [PATCH v8 -tip 00/26] Core scheduling Ning, Hongyu
  2020-11-06 20:55 ` [RFT for v9] (Was Re: [PATCH v8 -tip 00/26] Core scheduling) Joel Fernandes
  27 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes (Google) @ 2020-10-20  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Not-Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c | 37 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 211e0784675f..61758b5478d8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -127,6 +127,10 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 
 	int pa = __task_prio(a), pb = __task_prio(b);
 
+	trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
+		     a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+		     b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
 	if (-pa < -pb)
 		return true;
 
@@ -317,12 +321,16 @@ static void __sched_core_enable(void)
 
 	static_branch_enable(&__sched_core_enabled);
 	stop_machine(__sched_core_stopper, (void *)true, NULL);
+
+	printk("core sched enabled\n");
 }
 
 static void __sched_core_disable(void)
 {
 	stop_machine(__sched_core_stopper, (void *)false, NULL);
 	static_branch_disable(&__sched_core_enabled);
+
+	printk("core sched disabled\n");
 }
 
 void sched_core_get(void)
@@ -4978,6 +4986,13 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			set_next_task(rq, next);
 		}
 
+		trace_printk("pick pre selected (%u %u %u): %s/%d %lx\n",
+			     rq->core->core_task_seq,
+			     rq->core->core_pick_seq,
+			     rq->core_sched_seq,
+			     next->comm, next->pid,
+			     next->core_cookie);
+
 		rq->core_pick = NULL;
 		return next;
 	}
@@ -5066,6 +5081,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			 */
 			if (i == cpu && !need_sync && !p->core_cookie) {
 				next = p;
+				trace_printk("unconstrained pick: %s/%d %lx\n",
+					     next->comm, next->pid, next->core_cookie);
 				goto done;
 			}
 
@@ -5074,6 +5091,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 			rq_i->core_pick = p;
 
+			trace_printk("cpu(%d): selected: %s/%d %lx\n",
+				     i, p->comm, p->pid, p->core_cookie);
+
 			/*
 			 * If this new candidate is of higher priority than the
 			 * previous; and they're incompatible; we need to wipe
@@ -5090,6 +5110,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 				rq->core->core_cookie = p->core_cookie;
 				max = p;
 
+				trace_printk("max: %s/%d %lx\n", max->comm, max->pid, max->core_cookie);
+
 				if (old_max) {
 					for_each_cpu(j, smt_mask) {
 						if (j == i)
@@ -5120,6 +5142,7 @@ next_class:;
 
 	/* Something should have been selected for current CPU */
 	WARN_ON_ONCE(!next);
+	trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, next->core_cookie);
 
 	/*
 	 * Reschedule siblings
@@ -5155,13 +5178,21 @@ next_class:;
 		}
 
 		/* Did we break L1TF mitigation requirements? */
-		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+		if (unlikely(!cookie_match(next, rq_i->core_pick))) {
+			trace_printk("[%d]: cookie mismatch. %s/%d/0x%lx/0x%lx\n",
+				     rq_i->cpu, rq_i->core_pick->comm,
+				     rq_i->core_pick->pid,
+				     rq_i->core_pick->core_cookie,
+				     rq_i->core->core_cookie);
+			WARN_ON_ONCE(1);
+		}
 
 		if (rq_i->curr == rq_i->core_pick) {
 			rq_i->core_pick = NULL;
 			continue;
 		}
 
+		trace_printk("IPI(%d)\n", i);
 		resched_curr(rq_i);
 	}
 
@@ -5209,6 +5240,10 @@ static bool try_steal_cookie(int this, int that)
 		if (p->core_occupation > dst->idle->core_occupation)
 			goto next;
 
+		trace_printk("core fill: %s/%d (%d->%d) %d %d %lx\n",
+			     p->comm, p->pid, that, this,
+			     p->core_occupation, dst->idle->core_occupation, cookie);
+
 		p->on_rq = TASK_ON_RQ_MIGRATING;
 		deactivate_task(src, p, 0);
 		set_task_cpu(p, this);
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 25/26] Documentation: Add core scheduling documentation
  2020-10-20  1:43 ` [PATCH v8 -tip 25/26] Documentation: Add core scheduling documentation Joel Fernandes (Google)
@ 2020-10-20  3:36   ` Randy Dunlap
  2020-11-12 16:11     ` Joel Fernandes
  0 siblings, 1 reply; 98+ messages in thread
From: Randy Dunlap @ 2020-10-20  3:36 UTC (permalink / raw)
  To: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen

Hi Joel,

On 10/19/20 6:43 PM, Joel Fernandes (Google) wrote:
> Document the usecases, design and interfaces for core scheduling.
> 
> Co-developed-by: Vineeth Pillai <viremana@linux.microsoft.com>
> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>   .../admin-guide/hw-vuln/core-scheduling.rst   | 312 ++++++++++++++++++
>   Documentation/admin-guide/hw-vuln/index.rst   |   1 +
>   2 files changed, 313 insertions(+)
>   create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst
> 
> diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
> new file mode 100644
> index 000000000000..eacafbb8fa3f
> --- /dev/null
> +++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
> @@ -0,0 +1,312 @@
> +Core Scheduling
> +***************
> +Core scheduling support allows userspace to define groups of tasks that can
> +share a core. These groups can be specified either for security usecases (one
> +group of tasks don't trust another), or for performance usecases (some
> +workloads may benefit from running on the same core as they don't need the same
> +hardware resources of the shared core).
> +
> +Security usecase
> +----------------
> +A cross-HT attack involves the attacker and victim running on different
> +Hyper Threads of the same core. MDS and L1TF are examples of such attacks.
> +Without core scheduling, the only full mitigation of cross-HT attacks is to
> +disable Hyper Threading (HT). Core scheduling allows HT to be turned on safely
> +by ensuring that trusted tasks can share a core. This increase in core sharing
> +can improvement performance, however it is not guaranteed that performance will
> +always improve, though that is seen to be the case with a number of real world
> +workloads. In theory, core scheduling aims to perform at least as good as when
> +Hyper Threading is disabled. In practise, this is mostly the case though not
> +always: as synchronizing scheduling decisions across 2 or more CPUs in a core
> +involves additional overhead - especially when the system is lightly loaded
> +(``total_threads <= N/2``).

N is number of CPUs?

> +
> +Usage
> +-----
> +Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
> +Using this feature, userspace defines groups of tasks that trust each other.
> +The core scheduler uses this information to make sure that tasks that do not
> +trust each other will never run simultaneously on a core, while doing its best
> +to satisfy the system's scheduling requirements.
> +
> +There are 2 ways to use core-scheduling:
> +
> +CGroup
> +######
> +Core scheduling adds additional files to the CPU controller CGroup:
> +
> +* ``cpu.tag``
> +Writing ``1`` into this file results in all tasks in the group get tagged. This

                                                                   getting
or                                                                being

> +results in all the CGroup's tasks allowed to run concurrently on a core's
> +hyperthreads (also called siblings).
> +
> +The file being a value of ``0`` means the tag state of the CGroup is inheritted

                                                                         inherited

> +from its parent hierarchy. If any ancestor of the CGroup is tagged, then the
> +group is tagged.
> +
> +.. note:: Once a CGroup is tagged via cpu.tag, it is not possible to set this
> +          for any descendant of the tagged group. For finer grained control, the
> +          ``cpu.tag_color`` file described next may be used.
> +
> +.. note:: When a CGroup is not tagged, all the tasks within the group can share
> +          a core with kernel threads and untagged system threads. For this reason,
> +          if a group has ``cpu.tag`` of 0, it is considered to be trusted.
> +
> +* ``cpu.tag_color``
> +For finer grained control over core sharing, a color can also be set in
> +addition to the tag. This allows to further control core sharing between child
> +CGroups within an already tagged CGroup. The color and the tag are both used to
> +generate a `cookie` which is used by the scheduler to identify the group.
> +
> +Upto 256 different colors can be set (0-255) by writing into this file.

   Up to

> +
> +A sample real-world usage of this file follows:
> +
> +Google uses DAC controls to make ``cpu.tag`` writeable only by root and the

$search tells me "writable".

> +``cpu.tag_color`` can be changed by anyone.
> +
> +The hierarchy looks like this:
> +::
> +  Root group
> +     / \
> +    A   B    (These are created by the root daemon - borglet).
> +   / \   \
> +  C   D   E  (These are created by AppEngine within the container).
> +
> +A and B are containers for 2 different jobs or apps that are created by a root
> +daemon called borglet. borglet then tags each of these group with the ``cpu.tag``
> +file. The job itself can create additional child CGroups which are colored by
> +the container's AppEngine with the ``cpu.tag_color`` file.
> +
> +The reason why Google uses this 2-level tagging system is that AppEngine wants to
> +allow a subset of child CGroups within a tagged parent CGroup to be co-scheduled on a
> +core while not being co-scheduled with other child CGroups. Think of these
> +child CGroups as belonging to the same customer or project.  Because these
> +child CGroups are created by AppEngine, they are not tracked by borglet (the
> +root daemon), therefore borglet won't have a chance to set a color for them.
> +That's where cpu.tag_color file comes in. A color could be set by AppEngine,
> +and once set, the normal tasks within the subcgroup would not be able to
> +overwrite it. This is enforced by promoting the permission of the
> +``cpu.tag_color`` file in cgroupfs.
> +
> +The color is an 8-bit value allowing for upto 256 unique colors.

                                             up to

> +
> +.. note:: Once a CGroup is colored, none of its descendants can be re-colored. Also
> +          coloring of a CGroup is possible only if either the group or one of its
> +          ancestors were tagged via the ``cpu.tag`` file.

                        was

> +
> +prctl interface
> +###############
> +A ``prtcl(2)`` command ``PR_SCHED_CORE_SHARE`` is available to a process to request
> +sharing a core with another process.  For example, consider 2 processes ``P1``
> +and ``P2`` with PIDs 100 and 200. If process ``P1`` calls
> +``prctl(PR_SCHED_CORE_SHARE, 200)``, the kernel makes ``P1`` share a core with ``P2``.
> +The kernel performs ptrace access mode checks before granting the request.
> +
> +.. note:: This operation is not commutative. P1 calling
> +          ``prctl(PR_SCHED_CORE_SHARE, pidof(P2)`` is not the same as P2 calling the
> +          same for P1. The former case is P1 joining P2's group of processes
> +          (which P2 would have joined with ``prctl(2)`` prior to P1's ``prctl(2)``).
> +
> +.. note:: The core-sharing granted with prctl(2) will be subject to
> +          core-sharing restrictions specified by the CGroup interface. For example
> +          if P1 and P2 are a part of 2 different tagged CGroups, then they will
> +          not share a core even if a prctl(2) call is made. This is analogous
> +          to how affinities are set using the cpuset interface.
> +
> +It is important to note that, on a ``CLONE_THREAD`` ``clone(2)`` syscall, the child
> +will be assigned the same tag as its parent and thus be allowed to share a core
> +with them. is design choice is because, for the security usecase, a

               ^^^ missing subject ...

> +``CLONE_THREAD`` child can access its parent's address space anyway, so there's
> +no point in not allowing them to share a core. If a different behavior is
> +desired, the child thread can call ``prctl(2)`` as needed.  This behavior is
> +specific to the ``prctl(2)`` interface. For the CGroup interface, the child of a
> +fork always share's a core with its parent's.  On the other hand, if a parent

                shares         with its parent's what?  "parent's" is possessive.


> +was previously tagged via ``prctl(2)`` and does a regular ``fork(2)`` syscall, the
> +child will receive a unique tag.
> +
> +Design/Implementation
> +---------------------
> +Each task that is tagged is assigned a cookie internally in the kernel. As
> +mentioned in `Usage`_, tasks with the same cookie value are assumed to trust
> +each other and share a core.
> +
> +The basic idea is that, every schedule event tries to select tasks for all the
> +siblings of a core such that all the selected tasks running on a core are
> +trusted (same cookie) at any point in time. Kernel threads are assumed trusted.
> +The idle task is considered special, in that it trusts every thing.

                                         or                everything.

> +
> +During a ``schedule()`` event on any sibling of a core, the highest priority task for
> +that core is picked and assigned to the sibling calling ``schedule()`` if it has it

                                                            too many          it     it
They are not the same "it,", are they?

> +enqueued. For rest of the siblings in the core, highest priority task with the
> +same cookie is selected if there is one runnable in their individual run
> +queues. If a task with same cookie is not available, the idle task is selected.
> +Idle task is globally trusted.
> +
> +Once a task has been selected for all the siblings in the core, an IPI is sent to
> +siblings for whom a new task was selected. Siblings on receiving the IPI, will

                                                                     no comma^

> +switch to the new task immediately. If an idle task is selected for a sibling,
> +then the sibling is considered to be in a `forced idle` state. i.e., it may

                                                                   I.e.,

> +have tasks on its on runqueue to run, however it will still have to run idle.
> +More on this in the next section.
> +
> +Forced-idling of tasks
> +----------------------
> +The scheduler tries its best to find tasks that trust each other such that all
> +tasks selected to be scheduled are of the highest priority in a core.  However,
> +it is possible that some runqueues had tasks that were incompatibile with the

                                                           incompatible

> +highest priority ones in the core. Favoring security over fairness, one or more
> +siblings could be forced to select a lower priority task if the highest
> +priority task is not trusted with respect to the core wide highest priority
> +task.  If a sibling does not have a trusted task to run, it will be forced idle
> +by the scheduler(idle thread is scheduled to run).

           scheduler (idle

> +
> +When the highest priorty task is selected to run, a reschedule-IPI is sent to

                     priority

> +the sibling to force it into idle. This results in 4 cases which need to be
> +considered depending on whether a VM or a regular usermode process was running
> +on either HT::
> +
> +          HT1 (attack)            HT2 (victim)
> +   A      idle -> user space      user space -> idle
> +   B      idle -> user space      guest -> idle
> +   C      idle -> guest           user space -> idle
> +   D      idle -> guest           guest -> idle
> +
> +Note that for better performance, we do not wait for the destination CPU
> +(victim) to enter idle mode. This is because the sending of the IPI would bring
> +the destination CPU immediately into kernel mode from user space, or VMEXIT
> +in the case of  guests. At best, this would only leak some scheduler metadata

   drop one space ^^

> +which may not be worth protecting. It is also possible that the IPI is received
> +too late on some architectures, but this has not been observed in the case of
> +x86.
> +
> +Kernel protection from untrusted tasks
> +--------------------------------------
> +The scheduler on its own cannot protect the kernel executing concurrently with
> +an untrusted task in a core. This is because the scheduler is unaware of
> +interrupts/syscalls at scheduling time. To mitigate this, an IPI is sent to
> +siblings on kernel entry (syscall and IRQ). This IPI forces the sibling to enter
> +kernel mode and wait before returning to user until all siblings of the
> +core have left kernel mode. This process is also known as stunning.  For good
> +performance, an IPI is sent only to a sibling only if it is running a tagged
> +task. If a sibling is running a kernel thread or is idle, no IPI is sent.
> +
> +The kernel protection feature can be turned off on the kernel command line by
> +passing ``sched_core_protect_kernel=0``.
> +
> +Other alternative ideas discussed for kernel protection are listed below just
> +for completeness. They all have limitations:
> +
> +1. Changing interrupt affinities to a trusted core which does not execute untrusted tasks
> +#########################################################################################
> +By changing the interrupt affinities to a designated safe-CPU which runs
> +only trusted tasks, IRQ data can be protected. One issue is this involves
> +giving up a full CPU core of the system to run safe tasks. Another is that,
> +per-cpu interrupts such as the local timer interrupt cannot have their
> +affinity changed. also, sensitive timer callbacks such as the random entropy timer

                      Also,

> +can run in softirq on return from these interrupts and expose sensitive
> +data. In the future, that could be mitigated by forcing softirqs into threaded
> +mode by utilizing a mechanism similar to ``CONFIG_PREEMPT_RT``.
> +
> +Yet another issue with this is, for multiqueue devices with managed
> +interrupts, the IRQ affinities cannot be changed however it could be

                                             changed; however,

> +possible to force a reduced number of queues which would in turn allow to
> +shield one or two CPUs from such interrupts and queue handling for the price
> +of indirection.
> +
> +2. Running IRQs as threaded-IRQs
> +################################
> +This would result in forcing IRQs into the scheduler which would then provide
> +the process-context mitigation. However, not all interrupts can be threaded.
> +Also this does nothing about syscall entries.
> +
> +3. Kernel Address Space Isolation
> +#################################
> +System calls could run in a much restricted address space which is
> +guarenteed not to leak any sensitive data. There are practical limitation in

    guaranteed                                                     limitations in


> +implementing this - the main concern being how to decide on an address space
> +that is guarenteed to not have any sensitive data.

            guaranteed

> +
> +4. Limited cookie-based protection
> +##################################
> +On a system call, change the cookie to the system trusted cookie and initiate a
> +schedule event. This would be better than pausing all the siblings during the
> +entire duration for the system call, but still would be a huge hit to the
> +performance.
> +
> +Trust model
> +-----------
> +Core scheduling maintains trust relationships amongst groups of tasks by
> +assigning the tag of them with the same cookie value.
> +When a system with core scheduling boots, all tasks are considered to trust
> +each other. This is because the core scheduler does not have information about
> +trust relationships until userspace uses the above mentioned interfaces, to
> +communicate them. In other words, all tasks have a default cookie value of 0.
> +and are considered system-wide trusted. The stunning of siblings running
> +cookie-0 tasks is also avoided.
> +
> +Once userspace uses the above mentioned interfaces to group sets of tasks, tasks
> +within such groups are considered to trust each other, but do not trust those
> +outside. Tasks outside the group also don't trust tasks within.
> +
> +Limitations
> +-----------
> +Core scheduling tries to guarentee that only trusted tasks run concurrently on a

                             guarantee

> +core. But there could be small window of time during which untrusted tasks run
> +concurrently or kernel could be running concurrently with a task not trusted by
> +kernel.
> +
> +1. IPI processing delays
> +########################
> +Core scheduling selects only trusted tasks to run together. IPI is used to notify
> +the siblings to switch to the new task. But there could be hardware delays in
> +receiving of the IPI on some arch (on x86, this has not been observed). This may
> +cause an attacker task to start running on a cpu before its siblings receive the

                                                 CPU

> +IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings
> +may populate data in the cache and micro acrhitectural buffers after the attacker

                                             architectural

> +starts to run and this is a possibility for data leak.
> +
> +Open cross-HT issues that core scheduling does not solve
> +--------------------------------------------------------
> +1. For MDS
> +##########
> +Core scheduling cannot protect against MDS attacks between an HT running in
> +user mode and another running in kernel mode. Even though both HTs run tasks
> +which trust each other, kernel memory is still considered untrusted. Such
> +attacks are possible for any combination of sibling CPU modes (host or guest mode).
> +
> +2. For L1TF
> +###########
> +Core scheduling cannot protect against a L1TF guest attackers exploiting a

                                           an           attacker


> +guest or host victim. This is because the guest attacker can craft invalid
> +PTEs which are not inverted due to a vulnerable guest kernel. The only
> +solution is to disable EPT.

huh?  what is EPT?  where is it documented/discussed?

> +
> +For both MDS and L1TF, if the guest vCPU is configured to not trust each
> +other (by tagging separately), then the guest to guest attacks would go away.
> +Or it could be a system admin policy which considers guest to guest attacks as
> +a guest problem.
> +
> +Another approach to resolve these would be to make every untrusted task on the
> +system to not trust every other untrusted task. While this could reduce
> +parallelism of the untrusted tasks, it would still solve the above issues while
> +allowing system processes (trusted tasks) to share a core.
> +
> +Use cases
> +---------
> +The main use case for Core scheduling is mitigating the cross-HT vulnerabilities
> +with SMT enabled. There are other use cases where this feature could be used:
> +
> +- Isolating tasks that needs a whole core: Examples include realtime tasks, tasks
> +  that uses SIMD instructions etc.
> +- Gang scheduling: Requirements for a group of tasks that needs to be scheduled
> +  together could also be realized using core scheduling. One example is vcpus of

                                                                            vCPUs

> +  a VM.
> +
> +Future work
> +-----------
> +Skipping per-HT mitigations if task is trusted
> +##############################################
> +If core scheduling is enabled, by default all tasks trust each other as
> +mentioned above. In such scenario, it may be desirable to skip the same-HT
> +mitigations on return to the trusted user-mode to improve performance.

> diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
> index 21710f8609fe..361ccbbd9e54 100644
> --- a/Documentation/admin-guide/hw-vuln/index.rst
> +++ b/Documentation/admin-guide/hw-vuln/index.rst
> @@ -16,3 +16,4 @@ are configurable at compile, boot or run time.
>      multihit.rst
>      special-register-buffer-data-sampling.rst
>      l1d_flush.rst
> +   core-scheduling.rst

Might be an indentation problem there?  Can't tell for sure.


thanks.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode
  2020-10-20  1:43 ` [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode Joel Fernandes (Google)
@ 2020-10-20  3:41   ` Randy Dunlap
  2020-11-03  0:20     ` Joel Fernandes
  2020-10-22  5:48   ` Li, Aubrey
  2020-10-30 10:29   ` Alexandre Chartre
  2 siblings, 1 reply; 98+ messages in thread
From: Randy Dunlap @ 2020-10-20  3:41 UTC (permalink / raw)
  To: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Tim Chen, Paul E . McKenney

On 10/19/20 6:43 PM, Joel Fernandes (Google) wrote:
> 
> ---
>   .../admin-guide/kernel-parameters.txt         |   7 +
>   include/linux/entry-common.h                  |   2 +-
>   include/linux/sched.h                         |  12 +
>   kernel/entry/common.c                         |  25 +-
>   kernel/sched/core.c                           | 229 ++++++++++++++++++
>   kernel/sched/sched.h                          |   3 +
>   6 files changed, 275 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 3236427e2215..48567110f709 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4678,6 +4678,13 @@
>   
>   	sbni=		[NET] Granch SBNI12 leased line adapter
>   
> +	sched_core_protect_kernel=

Needs a list of possible values after '=', along with telling us
what the default value/setting is.


> +			[SCHED_CORE] Pause SMT siblings of a core running in
> +			user mode, if at least one of the siblings of the core
> +			is running in kernel mode. This is to guarantee that
> +			kernel data is not leaked to tasks which are not trusted
> +			by the kernel.
> +


thanks.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode
  2020-10-20  1:43 ` [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode Joel Fernandes (Google)
  2020-10-20  3:41   ` Randy Dunlap
@ 2020-10-22  5:48   ` Li, Aubrey
  2020-11-03  0:50     ` Joel Fernandes
  2020-10-30 10:29   ` Alexandre Chartre
  2 siblings, 1 reply; 98+ messages in thread
From: Li, Aubrey @ 2020-10-22  5:48 UTC (permalink / raw)
  To: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Tim Chen, Paul E . McKenney

On 2020/10/20 9:43, Joel Fernandes (Google) wrote:
> Core-scheduling prevents hyperthreads in usermode from attacking each
> other, but it does not do anything about one of the hyperthreads
> entering the kernel for any reason. This leaves the door open for MDS
> and L1TF attacks with concurrent execution sequences between
> hyperthreads.
> 
> This patch therefore adds support for protecting all syscall and IRQ
> kernel mode entries. Care is taken to track the outermost usermode exit
> and entry using per-cpu counters. In cases where one of the hyperthreads
> enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
> when not needed - example: idle and non-cookie HTs do not need to be
> forced into kernel mode.
> 
> More information about attacks:
> For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
> data to either host or guest attackers. For L1TF, it is possible to leak
> to guest attackers. There is no possible mitigation involving flushing
> of buffers to avoid this since the execution of attacker and victims
> happen concurrently on 2 or more HTs.
> 
> Cc: Julien Desfossez <jdesfossez@digitalocean.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Aaron Lu <aaron.lwe@gmail.com>
> Cc: Aubrey Li <aubrey.li@linux.intel.com>
> Cc: Tim Chen <tim.c.chen@intel.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Co-developed-by: Vineeth Pillai <viremana@linux.microsoft.com>
> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  .../admin-guide/kernel-parameters.txt         |   7 +
>  include/linux/entry-common.h                  |   2 +-
>  include/linux/sched.h                         |  12 +
>  kernel/entry/common.c                         |  25 +-
>  kernel/sched/core.c                           | 229 ++++++++++++++++++
>  kernel/sched/sched.h                          |   3 +
>  6 files changed, 275 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 3236427e2215..48567110f709 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4678,6 +4678,13 @@
>  
>  	sbni=		[NET] Granch SBNI12 leased line adapter
>  
> +	sched_core_protect_kernel=
> +			[SCHED_CORE] Pause SMT siblings of a core running in
> +			user mode, if at least one of the siblings of the core
> +			is running in kernel mode. This is to guarantee that
> +			kernel data is not leaked to tasks which are not trusted
> +			by the kernel.
> +
>  	sched_debug	[KNL] Enables verbose scheduler debug messages.
>  
>  	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index 474f29638d2c..260216de357b 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -69,7 +69,7 @@
>  
>  #define EXIT_TO_USER_MODE_WORK						\
>  	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
> -	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING |			\
> +	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_UNSAFE_RET |	\
>  	 ARCH_EXIT_TO_USER_MODE_WORK)
>  
>  /**
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d38e904dd603..fe6f225bfbf9 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2071,4 +2071,16 @@ int sched_trace_rq_nr_running(struct rq *rq);
>  
>  const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
>  
> +#ifdef CONFIG_SCHED_CORE
> +void sched_core_unsafe_enter(void);
> +void sched_core_unsafe_exit(void);
> +bool sched_core_wait_till_safe(unsigned long ti_check);
> +bool sched_core_kernel_protected(void);
> +#else
> +#define sched_core_unsafe_enter(ignore) do { } while (0)
> +#define sched_core_unsafe_exit(ignore) do { } while (0)
> +#define sched_core_wait_till_safe(ignore) do { } while (0)
> +#define sched_core_kernel_protected(ignore) do { } while (0)
> +#endif
> +
>  #endif
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 0a1e20f8d4e8..c8dc6b1b1f40 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -137,6 +137,26 @@ static __always_inline void exit_to_user_mode(void)
>  /* Workaround to allow gradual conversion of architecture code */
>  void __weak arch_do_signal(struct pt_regs *regs) { }
>  
> +unsigned long exit_to_user_get_work(void)
> +{
> +	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> +
> +	if (IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected())
> +		return ti_work;
> +
> +#ifdef CONFIG_SCHED_CORE
> +	ti_work &= EXIT_TO_USER_MODE_WORK;
> +	if ((ti_work & _TIF_UNSAFE_RET) == ti_work) {

Though _TIF_UNSAFE_RET is not x86 specific, but I only saw the definition in x86.
I'm not sure if other SMT capable architectures are happy with this?


> +		sched_core_unsafe_exit();
> +		if (sched_core_wait_till_safe(EXIT_TO_USER_MODE_WORK)) {
> +			sched_core_unsafe_enter(); /* not exiting to user yet. */
> +		}
> +	}
> +
> +	return READ_ONCE(current_thread_info()->flags);
> +#endif
> +}
> +
>  static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>  					    unsigned long ti_work)
>  {
> @@ -175,7 +195,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>  		 * enabled above.
>  		 */
>  		local_irq_disable_exit_to_user();
> -		ti_work = READ_ONCE(current_thread_info()->flags);
> +		ti_work = exit_to_user_get_work();
>  	}
>  
>  	/* Return the latest work state for arch_exit_to_user_mode() */
> @@ -184,9 +204,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>  
>  static void exit_to_user_mode_prepare(struct pt_regs *regs)
>  {
> -	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> +	unsigned long ti_work;
>  
>  	lockdep_assert_irqs_disabled();
> +	ti_work = exit_to_user_get_work();
>  
>  	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
>  		ti_work = exit_to_user_mode_loop(regs, ti_work);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 02db5b024768..5a7aeaa914e3 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -76,6 +76,27 @@ __read_mostly int scheduler_running;
>  
>  #ifdef CONFIG_SCHED_CORE
>  
> +DEFINE_STATIC_KEY_TRUE(sched_core_protect_kernel);
> +static int __init set_sched_core_protect_kernel(char *str)
> +{
> +	unsigned long val = 0;
> +
> +	if (!str)
> +		return 0;
> +
> +	if (!kstrtoul(str, 0, &val) && !val)
> +		static_branch_disable(&sched_core_protect_kernel);
> +
> +	return 1;
> +}
> +__setup("sched_core_protect_kernel=", set_sched_core_protect_kernel);
> +
> +/* Is the kernel protected by core scheduling? */
> +bool sched_core_kernel_protected(void)
> +{
> +	return static_branch_likely(&sched_core_protect_kernel);
> +}
> +
>  DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
>  
>  /* kernel prio, less is more */
> @@ -4596,6 +4617,214 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
>  	return a->core_cookie == b->core_cookie;
>  }
>  
> +/*
> + * Handler to attempt to enter kernel. It does nothing because the exit to
> + * usermode or guest mode will do the actual work (of waiting if needed).
> + */
> +static void sched_core_irq_work(struct irq_work *work)
> +{
> +	return;
> +}
> +
> +static inline void init_sched_core_irq_work(struct rq *rq)
> +{
> +	init_irq_work(&rq->core_irq_work, sched_core_irq_work);
> +}
> +
> +/*
> + * sched_core_wait_till_safe - Pause the caller's hyperthread until the core
> + * exits the core-wide unsafe state. Obviously the CPU calling this function
> + * should not be responsible for the core being in the core-wide unsafe state
> + * otherwise it will deadlock.
> + *
> + * @ti_check: We spin here with IRQ enabled and preempt disabled. Break out of
> + *            the loop if TIF flags are set and notify caller about it.
> + *
> + * IRQs should be disabled.
> + */
> +bool sched_core_wait_till_safe(unsigned long ti_check)
> +{
> +	bool restart = false;
> +	struct rq *rq;
> +	int cpu;
> +
> +	/* We clear the thread flag only at the end, so need to check for it. */
> +	ti_check &= ~_TIF_UNSAFE_RET;
> +
> +	cpu = smp_processor_id();
> +	rq = cpu_rq(cpu);
> +
> +	if (!sched_core_enabled(rq))
> +		goto ret;
> +
> +	/* Down grade to allow interrupts to prevent stop_machine lockups.. */
> +	preempt_disable();
> +	local_irq_enable();
> +
> +	/*
> +	 * Wait till the core of this HT is not in an unsafe state.
> +	 *
> +	 * Pair with smp_store_release() in sched_core_unsafe_exit().
> +	 */
> +	while (smp_load_acquire(&rq->core->core_unsafe_nest) > 0) {
> +		cpu_relax();
> +		if (READ_ONCE(current_thread_info()->flags) & ti_check) {
> +			restart = true;
> +			break;
> +		}
> +	}
> +
> +	/* Upgrade it back to the expectations of entry code. */
> +	local_irq_disable();
> +	preempt_enable();
> +
> +ret:
> +	if (!restart)
> +		clear_tsk_thread_flag(current, TIF_UNSAFE_RET);
> +
> +	return restart;
> +}
> +
> +/*
> + * Enter the core-wide IRQ state. Sibling will be paused if it is running
> + * 'untrusted' code, until sched_core_unsafe_exit() is called. Every attempt to
> + * avoid sending useless IPIs is made. Must be called only from hard IRQ
> + * context.
> + */
> +void sched_core_unsafe_enter(void)
> +{
> +	const struct cpumask *smt_mask;
> +	unsigned long flags;
> +	struct rq *rq;
> +	int i, cpu;
> +
> +	if (!static_branch_likely(&sched_core_protect_kernel))
> +		return;
> +
> +	/* Ensure that on return to user/guest, we check whether to wait. */
> +	if (current->core_cookie)
> +		set_tsk_thread_flag(current, TIF_UNSAFE_RET);
> +
> +	local_irq_save(flags);
> +	cpu = smp_processor_id();
> +	rq = cpu_rq(cpu);
> +	if (!sched_core_enabled(rq))
> +		goto ret;
> +
> +	/* Count unsafe_enter() calls received without unsafe_exit() on this CPU. */
> +	rq->core_this_unsafe_nest++;
> +
> +	/* Should not nest: enter() should only pair with exit(). */
> +	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 1))
> +		goto ret;
> +
> +	raw_spin_lock(rq_lockp(rq));
> +	smt_mask = cpu_smt_mask(cpu);
> +
> +	/* Contribute this CPU's unsafe_enter() to core-wide unsafe_enter() count. */
> +	WRITE_ONCE(rq->core->core_unsafe_nest, rq->core->core_unsafe_nest + 1);
> +
> +	if (WARN_ON_ONCE(rq->core->core_unsafe_nest == UINT_MAX))
> +		goto unlock;
> +
> +	if (irq_work_is_busy(&rq->core_irq_work)) {
> +		/*
> +		 * Do nothing more since we are in an IPI sent from another
> +		 * sibling to enforce safety. That sibling would have sent IPIs
> +		 * to all of the HTs.
> +		 */
> +		goto unlock;
> +	}
> +
> +	/*
> +	 * If we are not the first ones on the core to enter core-wide unsafe
> +	 * state, do nothing.
> +	 */
> +	if (rq->core->core_unsafe_nest > 1)
> +		goto unlock;
> +
> +	/* Do nothing more if the core is not tagged. */
> +	if (!rq->core->core_cookie)
> +		goto unlock;
> +
> +	for_each_cpu(i, smt_mask) {
> +		struct rq *srq = cpu_rq(i);
> +
> +		if (i == cpu || cpu_is_offline(i))
> +			continue;
> +
> +		if (!srq->curr->mm || is_task_rq_idle(srq->curr))
> +			continue;
> +
> +		/* Skip if HT is not running a tagged task. */
> +		if (!srq->curr->core_cookie && !srq->core_pick)
> +			continue;
> +
> +		/*
> +		 * Force sibling into the kernel by IPI. If work was already
> +		 * pending, no new IPIs are sent. This is Ok since the receiver
> +		 * would already be in the kernel, or on its way to it.
> +		 */
> +		irq_work_queue_on(&srq->core_irq_work, i);
> +	}
> +unlock:
> +	raw_spin_unlock(rq_lockp(rq));
> +ret:
> +	local_irq_restore(flags);
> +}
> +
> +/*
> + * Process any work need for either exiting the core-wide unsafe state, or for
> + * waiting on this hyperthread if the core is still in this state.
> + *
> + * @idle: Are we called from the idle loop?
> + */
> +void sched_core_unsafe_exit(void)
> +{
> +	unsigned long flags;
> +	unsigned int nest;
> +	struct rq *rq;
> +	int cpu;
> +
> +	if (!static_branch_likely(&sched_core_protect_kernel))
> +		return;
> +
> +	local_irq_save(flags);
> +	cpu = smp_processor_id();
> +	rq = cpu_rq(cpu);
> +
> +	/* Do nothing if core-sched disabled. */
> +	if (!sched_core_enabled(rq))
> +		goto ret;
> +
> +	/*
> +	 * Can happen when a process is forked and the first return to user
> +	 * mode is a syscall exit. Either way, there's nothing to do.
> +	 */
> +	if (rq->core_this_unsafe_nest == 0)
> +		goto ret;
> +
> +	rq->core_this_unsafe_nest--;
> +
> +	/* enter() should be paired with exit() only. */
> +	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 0))
> +		goto ret;
> +
> +	raw_spin_lock(rq_lockp(rq));
> +	/*
> +	 * Core-wide nesting counter can never be 0 because we are
> +	 * still in it on this CPU.
> +	 */
> +	nest = rq->core->core_unsafe_nest;
> +	WARN_ON_ONCE(!nest);
> +
> +	/* Pair with smp_load_acquire() in sched_core_wait_till_safe(). */
> +	smp_store_release(&rq->core->core_unsafe_nest, nest - 1);
> +	raw_spin_unlock(rq_lockp(rq));
> +ret:
> +	local_irq_restore(flags);
> +}
> +
>  // XXX fairness/fwd progress conditions
>  /*
>   * Returns
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f7e2d8a3be8e..4bcf3b1ddfb3 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1059,12 +1059,15 @@ struct rq {
>  	unsigned int		core_enabled;
>  	unsigned int		core_sched_seq;
>  	struct rb_root		core_tree;
> +	struct irq_work		core_irq_work; /* To force HT into kernel */
> +	unsigned int		core_this_unsafe_nest;
>  
>  	/* shared state */
>  	unsigned int		core_task_seq;
>  	unsigned int		core_pick_seq;
>  	unsigned long		core_cookie;
>  	unsigned char		core_forceidle;
> +	unsigned int		core_unsafe_nest;
>  #endif
>  };
>  
> 


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task()
  2020-10-20  1:43 ` [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task() Joel Fernandes (Google)
@ 2020-10-22  7:59   ` Li, Aubrey
  2020-10-22 15:25     ` Joel Fernandes
  0 siblings, 1 reply; 98+ messages in thread
From: Li, Aubrey @ 2020-10-22  7:59 UTC (permalink / raw)
  To: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Vineeth Remanan Pillai, Paul E. McKenney, Tim Chen

On 2020/10/20 9:43, Joel Fernandes (Google) wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Because sched_class::pick_next_task() also implies
> sched_class::set_next_task() (and possibly put_prev_task() and
> newidle_balance) it is not state invariant. This makes it unsuitable
> for remote task selection.
> 
> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
> Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/sched/deadline.c  | 16 ++++++++++++++--
>  kernel/sched/fair.c      | 32 +++++++++++++++++++++++++++++++-
>  kernel/sched/idle.c      |  8 ++++++++
>  kernel/sched/rt.c        | 14 ++++++++++++--
>  kernel/sched/sched.h     |  3 +++
>  kernel/sched/stop_task.c | 13 +++++++++++--
>  6 files changed, 79 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 814ec49502b1..0271a7848ab3 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1848,7 +1848,7 @@ static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
>  	return rb_entry(left, struct sched_dl_entity, rb_node);
>  }
>  
> -static struct task_struct *pick_next_task_dl(struct rq *rq)
> +static struct task_struct *pick_task_dl(struct rq *rq)
>  {
>  	struct sched_dl_entity *dl_se;
>  	struct dl_rq *dl_rq = &rq->dl;
> @@ -1860,7 +1860,18 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
>  	dl_se = pick_next_dl_entity(rq, dl_rq);
>  	BUG_ON(!dl_se);
>  	p = dl_task_of(dl_se);
> -	set_next_task_dl(rq, p, true);
> +
> +	return p;
> +}
> +
> +static struct task_struct *pick_next_task_dl(struct rq *rq)
> +{
> +	struct task_struct *p;
> +
> +	p = pick_task_dl(rq);
> +	if (p)
> +		set_next_task_dl(rq, p, true);
> +
>  	return p;
>  }
>  
> @@ -2517,6 +2528,7 @@ const struct sched_class dl_sched_class
>  
>  #ifdef CONFIG_SMP
>  	.balance		= balance_dl,
> +	.pick_task		= pick_task_dl,
>  	.select_task_rq		= select_task_rq_dl,
>  	.migrate_task_rq	= migrate_task_rq_dl,
>  	.set_cpus_allowed       = set_cpus_allowed_dl,
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dbd9368a959d..bd6aed63f5e3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4450,7 +4450,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>  	 * Avoid running the skip buddy, if running something else can
>  	 * be done without getting too unfair.
>  	 */
> -	if (cfs_rq->skip == se) {
> +	if (cfs_rq->skip && cfs_rq->skip == se) {
>  		struct sched_entity *second;
>  
>  		if (se == curr) {
> @@ -6976,6 +6976,35 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>  		set_last_buddy(se);
>  }
>  
> +#ifdef CONFIG_SMP
> +static struct task_struct *pick_task_fair(struct rq *rq)
> +{
> +	struct cfs_rq *cfs_rq = &rq->cfs;
> +	struct sched_entity *se;
> +
> +	if (!cfs_rq->nr_running)
> +		return NULL;
> +
> +	do {
> +		struct sched_entity *curr = cfs_rq->curr;
> +
> +		se = pick_next_entity(cfs_rq, NULL);
> +
> +		if (curr) {
> +			if (se && curr->on_rq)
> +				update_curr(cfs_rq);
> +
> +			if (!se || entity_before(curr, se))
> +				se = curr;
> +		}
> +
> +		cfs_rq = group_cfs_rq(se);
> +	} while (cfs_rq);
> +
> +	return task_of(se);
> +}
> +#endif

One of my machines hangs when I run uperf with only one message:
[  719.034962] BUG: kernel NULL pointer dereference, address: 0000000000000050

Then I replicated the problem on my another machine(no serial console),
here is the stack by manual copy.

Call Trace:
 pick_next_entity+0xb0/0x160
 pick_task_fair+0x4b/0x90
 __schedule+0x59b/0x12f0
 schedule_idle+0x1e/0x40
 do_idle+0x193/0x2d0
 cpu_startup_entry+0x19/0x20
 start_secondary+0x110/0x150
 secondary_startup_64_no_verify+0xa6/0xab

> +
>  struct task_struct *
>  pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
> @@ -11173,6 +11202,7 @@ const struct sched_class fair_sched_class
>  
>  #ifdef CONFIG_SMP
>  	.balance		= balance_fair,
> +	.pick_task		= pick_task_fair,
>  	.select_task_rq		= select_task_rq_fair,
>  	.migrate_task_rq	= migrate_task_rq_fair,
>  
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 8ce6e80352cf..ce7552c6bc65 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -405,6 +405,13 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
>  	schedstat_inc(rq->sched_goidle);
>  }
>  
> +#ifdef CONFIG_SMP
> +static struct task_struct *pick_task_idle(struct rq *rq)
> +{
> +	return rq->idle;
> +}
> +#endif
> +
>  struct task_struct *pick_next_task_idle(struct rq *rq)
>  {
>  	struct task_struct *next = rq->idle;
> @@ -472,6 +479,7 @@ const struct sched_class idle_sched_class
>  
>  #ifdef CONFIG_SMP
>  	.balance		= balance_idle,
> +	.pick_task		= pick_task_idle,
>  	.select_task_rq		= select_task_rq_idle,
>  	.set_cpus_allowed	= set_cpus_allowed_common,
>  #endif
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index e57fca05b660..a5851c775270 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1624,7 +1624,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
>  	return rt_task_of(rt_se);
>  }
>  
> -static struct task_struct *pick_next_task_rt(struct rq *rq)
> +static struct task_struct *pick_task_rt(struct rq *rq)
>  {
>  	struct task_struct *p;
>  
> @@ -1632,7 +1632,16 @@ static struct task_struct *pick_next_task_rt(struct rq *rq)
>  		return NULL;
>  
>  	p = _pick_next_task_rt(rq);
> -	set_next_task_rt(rq, p, true);
> +
> +	return p;
> +}
> +
> +static struct task_struct *pick_next_task_rt(struct rq *rq)
> +{
> +	struct task_struct *p = pick_task_rt(rq);
> +	if (p)
> +		set_next_task_rt(rq, p, true);
> +
>  	return p;
>  }
>  
> @@ -2443,6 +2452,7 @@ const struct sched_class rt_sched_class
>  
>  #ifdef CONFIG_SMP
>  	.balance		= balance_rt,
> +	.pick_task		= pick_task_rt,
>  	.select_task_rq		= select_task_rq_rt,
>  	.set_cpus_allowed       = set_cpus_allowed_common,
>  	.rq_online              = rq_online_rt,
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 587ebabebaff..54bfac702805 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1800,6 +1800,9 @@ struct sched_class {
>  
>  #ifdef CONFIG_SMP
>  	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
> +
> +	struct task_struct * (*pick_task)(struct rq *rq);
> +
>  	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
>  	void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
>  
> diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
> index 394bc8126a1e..8f92915dd95e 100644
> --- a/kernel/sched/stop_task.c
> +++ b/kernel/sched/stop_task.c
> @@ -34,15 +34,23 @@ static void set_next_task_stop(struct rq *rq, struct task_struct *stop, bool fir
>  	stop->se.exec_start = rq_clock_task(rq);
>  }
>  
> -static struct task_struct *pick_next_task_stop(struct rq *rq)
> +static struct task_struct *pick_task_stop(struct rq *rq)
>  {
>  	if (!sched_stop_runnable(rq))
>  		return NULL;
>  
> -	set_next_task_stop(rq, rq->stop, true);
>  	return rq->stop;
>  }
>  
> +static struct task_struct *pick_next_task_stop(struct rq *rq)
> +{
> +	struct task_struct *p = pick_task_stop(rq);
> +	if (p)
> +		set_next_task_stop(rq, p, true);
> +
> +	return p;
> +}
> +
>  static void
>  enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
>  {
> @@ -124,6 +132,7 @@ const struct sched_class stop_sched_class
>  
>  #ifdef CONFIG_SMP
>  	.balance		= balance_stop,
> +	.pick_task		= pick_task_stop,
>  	.select_task_rq		= select_task_rq_stop,
>  	.set_cpus_allowed	= set_cpus_allowed_common,
>  #endif
> 


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task()
  2020-10-22  7:59   ` Li, Aubrey
@ 2020-10-22 15:25     ` Joel Fernandes
  2020-10-23  5:25       ` Li, Aubrey
  0 siblings, 1 reply; 98+ messages in thread
From: Joel Fernandes @ 2020-10-22 15:25 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, Thomas Glexiner, LKML,
	Ingo Molnar, Linus Torvalds, Frederic Weisbecker, Kees Cook,
	Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, graf, konrad.wilk,
	Dario Faggioli, Paul Turner, Steven Rostedt, Patrick Bellasi,
	benbjiang(蒋彪),
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Hyser,Chris, Vineeth Remanan Pillai,
	Paul E. McKenney, Tim Chen

On Thu, Oct 22, 2020 at 12:59 AM Li, Aubrey <aubrey.li@linux.intel.com> wrote:
>
> On 2020/10/20 9:43, Joel Fernandes (Google) wrote:
> > From: Peter Zijlstra <peterz@infradead.org>
> >
> > Because sched_class::pick_next_task() also implies
> > sched_class::set_next_task() (and possibly put_prev_task() and
> > newidle_balance) it is not state invariant. This makes it unsuitable
> > for remote task selection.
> >
> > Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
> > Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >  kernel/sched/deadline.c  | 16 ++++++++++++++--
> >  kernel/sched/fair.c      | 32 +++++++++++++++++++++++++++++++-
> >  kernel/sched/idle.c      |  8 ++++++++
> >  kernel/sched/rt.c        | 14 ++++++++++++--
> >  kernel/sched/sched.h     |  3 +++
> >  kernel/sched/stop_task.c | 13 +++++++++++--
> >  6 files changed, 79 insertions(+), 7 deletions(-)
> >
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index 814ec49502b1..0271a7848ab3 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -1848,7 +1848,7 @@ static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
> >       return rb_entry(left, struct sched_dl_entity, rb_node);
> >  }
> >
> > -static struct task_struct *pick_next_task_dl(struct rq *rq)
> > +static struct task_struct *pick_task_dl(struct rq *rq)
> >  {
> >       struct sched_dl_entity *dl_se;
> >       struct dl_rq *dl_rq = &rq->dl;
> > @@ -1860,7 +1860,18 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
> >       dl_se = pick_next_dl_entity(rq, dl_rq);
> >       BUG_ON(!dl_se);
> >       p = dl_task_of(dl_se);
> > -     set_next_task_dl(rq, p, true);
> > +
> > +     return p;
> > +}
> > +
> > +static struct task_struct *pick_next_task_dl(struct rq *rq)
> > +{
> > +     struct task_struct *p;
> > +
> > +     p = pick_task_dl(rq);
> > +     if (p)
> > +             set_next_task_dl(rq, p, true);
> > +
> >       return p;
> >  }
> >
> > @@ -2517,6 +2528,7 @@ const struct sched_class dl_sched_class
> >
> >  #ifdef CONFIG_SMP
> >       .balance                = balance_dl,
> > +     .pick_task              = pick_task_dl,
> >       .select_task_rq         = select_task_rq_dl,
> >       .migrate_task_rq        = migrate_task_rq_dl,
> >       .set_cpus_allowed       = set_cpus_allowed_dl,
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index dbd9368a959d..bd6aed63f5e3 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4450,7 +4450,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> >        * Avoid running the skip buddy, if running something else can
> >        * be done without getting too unfair.
> >        */
> > -     if (cfs_rq->skip == se) {
> > +     if (cfs_rq->skip && cfs_rq->skip == se) {
> >               struct sched_entity *second;
> >
> >               if (se == curr) {
> > @@ -6976,6 +6976,35 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> >               set_last_buddy(se);
> >  }
> >
> > +#ifdef CONFIG_SMP
> > +static struct task_struct *pick_task_fair(struct rq *rq)
> > +{
> > +     struct cfs_rq *cfs_rq = &rq->cfs;
> > +     struct sched_entity *se;
> > +
> > +     if (!cfs_rq->nr_running)
> > +             return NULL;
> > +
> > +     do {
> > +             struct sched_entity *curr = cfs_rq->curr;
> > +
> > +             se = pick_next_entity(cfs_rq, NULL);
> > +
> > +             if (curr) {
> > +                     if (se && curr->on_rq)
> > +                             update_curr(cfs_rq);
> > +
> > +                     if (!se || entity_before(curr, se))
> > +                             se = curr;
> > +             }
> > +
> > +             cfs_rq = group_cfs_rq(se);
> > +     } while (cfs_rq);
> > ++
> > +     return task_of(se);
> > +}
> > +#endif
>
> One of my machines hangs when I run uperf with only one message:
> [  719.034962] BUG: kernel NULL pointer dereference, address: 0000000000000050
>
> Then I replicated the problem on my another machine(no serial console),
> here is the stack by manual copy.
>
> Call Trace:
>  pick_next_entity+0xb0/0x160
>  pick_task_fair+0x4b/0x90
>  __schedule+0x59b/0x12f0
>  schedule_idle+0x1e/0x40
>  do_idle+0x193/0x2d0
>  cpu_startup_entry+0x19/0x20
>  start_secondary+0x110/0x150
>  secondary_startup_64_no_verify+0xa6/0xab

Interesting. Wondering if we screwed something up in the rebase.

Questions:
1. Does the issue happen if you just apply only up until this patch,
or the entire series?
2. Do you see the issue in v7? Not much if at all has changed in this
part of the code from v7 -> v8 but could be something in the newer
kernel.

We tested this series after rebase heavily so it is indeed strange to
see this so late.

 - Joel


>
> > +
> >  struct task_struct *
> >  pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >  {
> > @@ -11173,6 +11202,7 @@ const struct sched_class fair_sched_class
> >
> >  #ifdef CONFIG_SMP
> >       .balance                = balance_fair,
> > +     .pick_task              = pick_task_fair,
> >       .select_task_rq         = select_task_rq_fair,
> >       .migrate_task_rq        = migrate_task_rq_fair,
> >
> > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > index 8ce6e80352cf..ce7552c6bc65 100644
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -405,6 +405,13 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
> >       schedstat_inc(rq->sched_goidle);
> >  }
> >
> > +#ifdef CONFIG_SMP
> > +static struct task_struct *pick_task_idle(struct rq *rq)
> > +{
> > +     return rq->idle;
> > +}
> > +#endif
> > +
> >  struct task_struct *pick_next_task_idle(struct rq *rq)
> >  {
> >       struct task_struct *next = rq->idle;
> > @@ -472,6 +479,7 @@ const struct sched_class idle_sched_class
> >
> >  #ifdef CONFIG_SMP
> >       .balance                = balance_idle,
> > +     .pick_task              = pick_task_idle,
> >       .select_task_rq         = select_task_rq_idle,
> >       .set_cpus_allowed       = set_cpus_allowed_common,
> >  #endif
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index e57fca05b660..a5851c775270 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -1624,7 +1624,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
> >       return rt_task_of(rt_se);
> >  }
> >
> > -static struct task_struct *pick_next_task_rt(struct rq *rq)
> > +static struct task_struct *pick_task_rt(struct rq *rq)
> >  {
> >       struct task_struct *p;
> >
> > @@ -1632,7 +1632,16 @@ static struct task_struct *pick_next_task_rt(struct rq *rq)
> >               return NULL;
> >
> >       p = _pick_next_task_rt(rq);
> > -     set_next_task_rt(rq, p, true);
> > +
> > +     return p;
> > +}
> > +
> > +static struct task_struct *pick_next_task_rt(struct rq *rq)
> > +{
> > +     struct task_struct *p = pick_task_rt(rq);
> > +     if (p)
> > +             set_next_task_rt(rq, p, true);
> > +
> >       return p;
> >  }
> >
> > @@ -2443,6 +2452,7 @@ const struct sched_class rt_sched_class
> >
> >  #ifdef CONFIG_SMP
> >       .balance                = balance_rt,
> > +     .pick_task              = pick_task_rt,
> >       .select_task_rq         = select_task_rq_rt,
> >       .set_cpus_allowed       = set_cpus_allowed_common,
> >       .rq_online              = rq_online_rt,
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 587ebabebaff..54bfac702805 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1800,6 +1800,9 @@ struct sched_class {
> >
> >  #ifdef CONFIG_SMP
> >       int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
> > +
> > +     struct task_struct * (*pick_task)(struct rq *rq);
> > +
> >       int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
> >       void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
> >
> > diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
> > index 394bc8126a1e..8f92915dd95e 100644
> > --- a/kernel/sched/stop_task.c
> > +++ b/kernel/sched/stop_task.c
> > @@ -34,15 +34,23 @@ static void set_next_task_stop(struct rq *rq, struct task_struct *stop, bool fir
> >       stop->se.exec_start = rq_clock_task(rq);
> >  }
> >
> > -static struct task_struct *pick_next_task_stop(struct rq *rq)
> > +static struct task_struct *pick_task_stop(struct rq *rq)
> >  {
> >       if (!sched_stop_runnable(rq))
> >               return NULL;
> >
> > -     set_next_task_stop(rq, rq->stop, true);
> >       return rq->stop;
> >  }
> >
> > +static struct task_struct *pick_next_task_stop(struct rq *rq)
> > +{
> > +     struct task_struct *p = pick_task_stop(rq);
> > +     if (p)
> > +             set_next_task_stop(rq, p, true);
> > +
> > +     return p;
> > +}
> > +
> >  static void
> >  enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
> >  {
> > @@ -124,6 +132,7 @@ const struct sched_class stop_sched_class
> >
> >  #ifdef CONFIG_SMP
> >       .balance                = balance_stop,
> > +     .pick_task              = pick_task_stop,
> >       .select_task_rq         = select_task_rq_stop,
> >       .set_cpus_allowed       = set_cpus_allowed_common,
> >  #endif
> >
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task()
  2020-10-22 15:25     ` Joel Fernandes
@ 2020-10-23  5:25       ` Li, Aubrey
  2020-10-23 21:47         ` Joel Fernandes
  0 siblings, 1 reply; 98+ messages in thread
From: Li, Aubrey @ 2020-10-23  5:25 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, Thomas Glexiner, LKML,
	Ingo Molnar, Linus Torvalds, Frederic Weisbecker, Kees Cook,
	Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, graf, konrad.wilk,
	Dario Faggioli, Paul Turner, Steven Rostedt, Patrick Bellasi,
	benbjiang(蒋彪),
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Hyser,Chris, Vineeth Remanan Pillai,
	Paul E. McKenney, Tim Chen, Ning, Hongyu

On 2020/10/22 23:25, Joel Fernandes wrote:
> On Thu, Oct 22, 2020 at 12:59 AM Li, Aubrey <aubrey.li@linux.intel.com> wrote:
>>
>> On 2020/10/20 9:43, Joel Fernandes (Google) wrote:
>>> From: Peter Zijlstra <peterz@infradead.org>
>>>
>>> Because sched_class::pick_next_task() also implies
>>> sched_class::set_next_task() (and possibly put_prev_task() and
>>> newidle_balance) it is not state invariant. This makes it unsuitable
>>> for remote task selection.
>>>
>>> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>> Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
>>> Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
>>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>>> ---
>>>  kernel/sched/deadline.c  | 16 ++++++++++++++--
>>>  kernel/sched/fair.c      | 32 +++++++++++++++++++++++++++++++-
>>>  kernel/sched/idle.c      |  8 ++++++++
>>>  kernel/sched/rt.c        | 14 ++++++++++++--
>>>  kernel/sched/sched.h     |  3 +++
>>>  kernel/sched/stop_task.c | 13 +++++++++++--
>>>  6 files changed, 79 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>>> index 814ec49502b1..0271a7848ab3 100644
>>> --- a/kernel/sched/deadline.c
>>> +++ b/kernel/sched/deadline.c
>>> @@ -1848,7 +1848,7 @@ static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
>>>       return rb_entry(left, struct sched_dl_entity, rb_node);
>>>  }
>>>
>>> -static struct task_struct *pick_next_task_dl(struct rq *rq)
>>> +static struct task_struct *pick_task_dl(struct rq *rq)
>>>  {
>>>       struct sched_dl_entity *dl_se;
>>>       struct dl_rq *dl_rq = &rq->dl;
>>> @@ -1860,7 +1860,18 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
>>>       dl_se = pick_next_dl_entity(rq, dl_rq);
>>>       BUG_ON(!dl_se);
>>>       p = dl_task_of(dl_se);
>>> -     set_next_task_dl(rq, p, true);
>>> +
>>> +     return p;
>>> +}
>>> +
>>> +static struct task_struct *pick_next_task_dl(struct rq *rq)
>>> +{
>>> +     struct task_struct *p;
>>> +
>>> +     p = pick_task_dl(rq);
>>> +     if (p)
>>> +             set_next_task_dl(rq, p, true);
>>> +
>>>       return p;
>>>  }
>>>
>>> @@ -2517,6 +2528,7 @@ const struct sched_class dl_sched_class
>>>
>>>  #ifdef CONFIG_SMP
>>>       .balance                = balance_dl,
>>> +     .pick_task              = pick_task_dl,
>>>       .select_task_rq         = select_task_rq_dl,
>>>       .migrate_task_rq        = migrate_task_rq_dl,
>>>       .set_cpus_allowed       = set_cpus_allowed_dl,
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index dbd9368a959d..bd6aed63f5e3 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -4450,7 +4450,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>>>        * Avoid running the skip buddy, if running something else can
>>>        * be done without getting too unfair.
>>>        */
>>> -     if (cfs_rq->skip == se) {
>>> +     if (cfs_rq->skip && cfs_rq->skip == se) {
>>>               struct sched_entity *second;
>>>
>>>               if (se == curr) {
>>> @@ -6976,6 +6976,35 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>>>               set_last_buddy(se);
>>>  }
>>>
>>> +#ifdef CONFIG_SMP
>>> +static struct task_struct *pick_task_fair(struct rq *rq)
>>> +{
>>> +     struct cfs_rq *cfs_rq = &rq->cfs;
>>> +     struct sched_entity *se;
>>> +
>>> +     if (!cfs_rq->nr_running)
>>> +             return NULL;
>>> +
>>> +     do {
>>> +             struct sched_entity *curr = cfs_rq->curr;
>>> +
>>> +             se = pick_next_entity(cfs_rq, NULL);
>>> +
>>> +             if (curr) {
>>> +                     if (se && curr->on_rq)
>>> +                             update_curr(cfs_rq);
>>> +
>>> +                     if (!se || entity_before(curr, se))
>>> +                             se = curr;
>>> +             }
>>> +
>>> +             cfs_rq = group_cfs_rq(se);
>>> +     } while (cfs_rq);
>>> ++
>>> +     return task_of(se);
>>> +}
>>> +#endif
>>
>> One of my machines hangs when I run uperf with only one message:
>> [  719.034962] BUG: kernel NULL pointer dereference, address: 0000000000000050
>>
>> Then I replicated the problem on my another machine(no serial console),
>> here is the stack by manual copy.
>>
>> Call Trace:
>>  pick_next_entity+0xb0/0x160
>>  pick_task_fair+0x4b/0x90
>>  __schedule+0x59b/0x12f0
>>  schedule_idle+0x1e/0x40
>>  do_idle+0x193/0x2d0
>>  cpu_startup_entry+0x19/0x20
>>  start_secondary+0x110/0x150
>>  secondary_startup_64_no_verify+0xa6/0xab
> 
> Interesting. Wondering if we screwed something up in the rebase.
> 
> Questions:
> 1. Does the issue happen if you just apply only up until this patch,
> or the entire series?

I applied the entire series and just find a related patch to report the
issue.

> 2. Do you see the issue in v7? Not much if at all has changed in this
> part of the code from v7 -> v8 but could be something in the newer
> kernel.
> 

IIRC, I can run uperf successfully on v7.
I'm on tip/master 2d3e8c9424c9 (origin/master) "Merge branch 'linus'."
Please let me know if this is a problem, or you have a repo I can pull
for testing.

> We tested this series after rebase heavily so it is indeed strange to
> see this so late.
Cc Hongyu - Maybe we can run the test cases in our hand before next release.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.
  2020-10-20  1:43 ` [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling Joel Fernandes (Google)
@ 2020-10-23 13:51   ` Peter Zijlstra
  2020-10-23 13:54     ` Peter Zijlstra
  2020-10-23 15:05   ` Peter Zijlstra
  1 sibling, 1 reply; 98+ messages in thread
From: Peter Zijlstra @ 2020-10-23 13:51 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Vineeth Remanan Pillai,
	Aaron Lu, Aubrey Li, Paul E. McKenney, Tim Chen

On Mon, Oct 19, 2020 at 09:43:16PM -0400, Joel Fernandes (Google) wrote:
> +			/*
> +			 * If this sibling doesn't yet have a suitable task to
> +			 * run; ask for the most elegible task, given the
> +			 * highest priority task already selected for this
> +			 * core.
> +			 */
> +			p = pick_task(rq_i, class, max);
> +			if (!p) {
> +				/*
> +				 * If there weren't no cookies; we don't need to
> +				 * bother with the other siblings.
> +				 * If the rest of the core is not running a tagged
> +				 * task, i.e.  need_sync == 0, and the current CPU
> +				 * which called into the schedule() loop does not
> +				 * have any tasks for this class, skip selecting for
> +				 * other siblings since there's no point. We don't skip
> +				 * for RT/DL because that could make CFS force-idle RT.
> +				 */
> +				if (i == cpu && !need_sync && class == &fair_sched_class)
> +					goto next_class;
> +
> +				continue;
> +			}

I'm failing to understand the class == &fair_sched_class bit.

IIRC the condition is such that the core doesn't have a cookie (we don't
need to sync the threads) so we'll only do a pick for our local CPU.

That should be invariant of class.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.
  2020-10-23 13:51   ` Peter Zijlstra
@ 2020-10-23 13:54     ` Peter Zijlstra
  2020-10-23 17:57       ` Joel Fernandes
  0 siblings, 1 reply; 98+ messages in thread
From: Peter Zijlstra @ 2020-10-23 13:54 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Vineeth Remanan Pillai,
	Aaron Lu, Aubrey Li, Paul E. McKenney, Tim Chen

On Fri, Oct 23, 2020 at 03:51:29PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 19, 2020 at 09:43:16PM -0400, Joel Fernandes (Google) wrote:
> > +			/*
> > +			 * If this sibling doesn't yet have a suitable task to
> > +			 * run; ask for the most elegible task, given the
> > +			 * highest priority task already selected for this
> > +			 * core.
> > +			 */
> > +			p = pick_task(rq_i, class, max);
> > +			if (!p) {
> > +				/*
> > +				 * If there weren't no cookies; we don't need to
> > +				 * bother with the other siblings.
> > +				 * If the rest of the core is not running a tagged
> > +				 * task, i.e.  need_sync == 0, and the current CPU
> > +				 * which called into the schedule() loop does not
> > +				 * have any tasks for this class, skip selecting for
> > +				 * other siblings since there's no point. We don't skip
> > +				 * for RT/DL because that could make CFS force-idle RT.
> > +				 */
> > +				if (i == cpu && !need_sync && class == &fair_sched_class)
> > +					goto next_class;
> > +
> > +				continue;
> > +			}
> 
> I'm failing to understand the class == &fair_sched_class bit.
> 
> IIRC the condition is such that the core doesn't have a cookie (we don't
> need to sync the threads) so we'll only do a pick for our local CPU.
> 
> That should be invariant of class.

That is; it should be the exact counterpart of this bit:

> +			/*
> +			 * Optimize the 'normal' case where there aren't any
> +			 * cookies and we don't need to sync up.
> +			 */
> +			if (i == cpu && !need_sync && !p->core_cookie) {
> +				next = p;
> +				goto done;
> +			}

If there is no task found in this class, try the next class, if there
is, we done.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.
  2020-10-20  1:43 ` [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling Joel Fernandes (Google)
  2020-10-23 13:51   ` Peter Zijlstra
@ 2020-10-23 15:05   ` Peter Zijlstra
  2020-10-23 17:59     ` Joel Fernandes
  1 sibling, 1 reply; 98+ messages in thread
From: Peter Zijlstra @ 2020-10-23 15:05 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Vineeth Remanan Pillai,
	Aaron Lu, Aubrey Li, Paul E. McKenney, Tim Chen

On Mon, Oct 19, 2020 at 09:43:16PM -0400, Joel Fernandes (Google) wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Instead of only selecting a local task, select a task for all SMT
> siblings for every reschedule on the core (irrespective which logical
> CPU does the reschedule).

This:

> 
> During a CPU hotplug event, schedule would be called with the hotplugged
> CPU not in the cpumask. So use for_each_cpu(_wrap)_or to include the
> current cpu in the task pick loop.
> 
> There are multiple loops in pick_next_task that iterate over CPUs in
> smt_mask. During a hotplug event, sibling could be removed from the
> smt_mask while pick_next_task is running. So we cannot trust the mask
> across the different loops. This can confuse the logic. Add a retry logic
> if smt_mask changes between the loops.

isn't entirely accurate anymore, is it?

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.
  2020-10-23 13:54     ` Peter Zijlstra
@ 2020-10-23 17:57       ` Joel Fernandes
  2020-10-23 19:26         ` Peter Zijlstra
  0 siblings, 1 reply; 98+ messages in thread
From: Joel Fernandes @ 2020-10-23 17:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Vineeth Remanan Pillai,
	Aaron Lu, Aubrey Li, Paul E. McKenney, Tim Chen

On Fri, Oct 23, 2020 at 03:54:00PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 23, 2020 at 03:51:29PM +0200, Peter Zijlstra wrote:
> > On Mon, Oct 19, 2020 at 09:43:16PM -0400, Joel Fernandes (Google) wrote:
> > > +			/*
> > > +			 * If this sibling doesn't yet have a suitable task to
> > > +			 * run; ask for the most elegible task, given the
> > > +			 * highest priority task already selected for this
> > > +			 * core.
> > > +			 */
> > > +			p = pick_task(rq_i, class, max);
> > > +			if (!p) {
> > > +				/*
> > > +				 * If there weren't no cookies; we don't need to
> > > +				 * bother with the other siblings.
> > > +				 * If the rest of the core is not running a tagged
> > > +				 * task, i.e.  need_sync == 0, and the current CPU
> > > +				 * which called into the schedule() loop does not
> > > +				 * have any tasks for this class, skip selecting for
> > > +				 * other siblings since there's no point. We don't skip
> > > +				 * for RT/DL because that could make CFS force-idle RT.
> > > +				 */
> > > +				if (i == cpu && !need_sync && class == &fair_sched_class)
> > > +					goto next_class;
> > > +
> > > +				continue;
> > > +			}
> > 
> > I'm failing to understand the class == &fair_sched_class bit.

The last line in the comment explains it "We don't skip for RT/DL because
that could make CFS force-idle RT.".

Even if need_sync == false, we need to go look at other CPUs (non-local
CPUs) to see if they could be running RT.

Say the RQs in a particular core look like this:
Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task.

rq0	       rq1
CFS1 (tagged)  RT1 (not tag)
CFS2 (tagged)

Say schedule() runs on rq0. Now, it will enter the above loop and
pick_task(RT) will return NULL for 'p'. It will enter the above if() block
and see that need_sync == false and will skip RT entirely.

The end result of the selection will be (say prio(CFS1) > prio(CFS2)):
rq0		rq1
CFS1		IDLE

When it should have selected:
rq0		r1
IDLE		RT

I saw this issue on real-world usecases in ChromeOS where an RT task gets
constantly force-idled and breaks RT. The "class == &fair_sched_class" bit
cures it.

> > > +                          * for RT/DL because that could make CFS force-idle RT.
> > IIRC the condition is such that the core doesn't have a cookie (we don't
> > need to sync the threads) so we'll only do a pick for our local CPU.
> > 
> > That should be invariant of class.
> 
> That is; it should be the exact counterpart of this bit:
> 
> > +			/*
> > +			 * Optimize the 'normal' case where there aren't any
> > +			 * cookies and we don't need to sync up.
> > +			 */
> > +			if (i == cpu && !need_sync && !p->core_cookie) {
> > +				next = p;
> > +				goto done;
> > +			}
> 
> If there is no task found in this class, try the next class, if there
> is, we done.

That's Ok. But we cannot skip RT class on other CPUs.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.
  2020-10-23 15:05   ` Peter Zijlstra
@ 2020-10-23 17:59     ` Joel Fernandes
  0 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes @ 2020-10-23 17:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Vineeth Remanan Pillai,
	Aaron Lu, Aubrey Li, Paul E. McKenney, Tim Chen

On Fri, Oct 23, 2020 at 05:05:44PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 19, 2020 at 09:43:16PM -0400, Joel Fernandes (Google) wrote:
> > From: Peter Zijlstra <peterz@infradead.org>
> > 
> > Instead of only selecting a local task, select a task for all SMT
> > siblings for every reschedule on the core (irrespective which logical
> > CPU does the reschedule).
> 
> This:
> 
> > 
> > During a CPU hotplug event, schedule would be called with the hotplugged
> > CPU not in the cpumask. So use for_each_cpu(_wrap)_or to include the
> > current cpu in the task pick loop.
> > 
> > There are multiple loops in pick_next_task that iterate over CPUs in
> > smt_mask. During a hotplug event, sibling could be removed from the
> > smt_mask while pick_next_task is running. So we cannot trust the mask
> > across the different loops. This can confuse the logic. Add a retry logic
> > if smt_mask changes between the loops.
> 
> isn't entirely accurate anymore, is it?

Yes you are right, we need to delete this bit from the changelog. :-(. I'll
go do that.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.
  2020-10-23 17:57       ` Joel Fernandes
@ 2020-10-23 19:26         ` Peter Zijlstra
  2020-10-23 21:31           ` Joel Fernandes
  0 siblings, 1 reply; 98+ messages in thread
From: Peter Zijlstra @ 2020-10-23 19:26 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Vineeth Remanan Pillai,
	Aaron Lu, Aubrey Li, Paul E. McKenney, Tim Chen

On Fri, Oct 23, 2020 at 01:57:24PM -0400, Joel Fernandes wrote:
> On Fri, Oct 23, 2020 at 03:54:00PM +0200, Peter Zijlstra wrote:
> > On Fri, Oct 23, 2020 at 03:51:29PM +0200, Peter Zijlstra wrote:
> > > On Mon, Oct 19, 2020 at 09:43:16PM -0400, Joel Fernandes (Google) wrote:
> > > > +			/*
> > > > +			 * If this sibling doesn't yet have a suitable task to
> > > > +			 * run; ask for the most elegible task, given the
> > > > +			 * highest priority task already selected for this
> > > > +			 * core.
> > > > +			 */
> > > > +			p = pick_task(rq_i, class, max);
> > > > +			if (!p) {
> > > > +				/*
> > > > +				 * If there weren't no cookies; we don't need to
> > > > +				 * bother with the other siblings.
> > > > +				 * If the rest of the core is not running a tagged
> > > > +				 * task, i.e.  need_sync == 0, and the current CPU
> > > > +				 * which called into the schedule() loop does not
> > > > +				 * have any tasks for this class, skip selecting for
> > > > +				 * other siblings since there's no point. We don't skip
> > > > +				 * for RT/DL because that could make CFS force-idle RT.
> > > > +				 */
> > > > +				if (i == cpu && !need_sync && class == &fair_sched_class)
> > > > +					goto next_class;
> > > > +
> > > > +				continue;
> > > > +			}
> > > 
> > > I'm failing to understand the class == &fair_sched_class bit.
> 
> The last line in the comment explains it "We don't skip for RT/DL because
> that could make CFS force-idle RT.".

Well, yes, but it does not explain how this can come about, now does it.

> Even if need_sync == false, we need to go look at other CPUs (non-local
> CPUs) to see if they could be running RT.
> 
> Say the RQs in a particular core look like this:
> Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task.
> 
> rq0	       rq1
> CFS1 (tagged)  RT1 (not tag)
> CFS2 (tagged)
> 
> Say schedule() runs on rq0. Now, it will enter the above loop and
> pick_task(RT) will return NULL for 'p'. It will enter the above if() block
> and see that need_sync == false and will skip RT entirely.
> 
> The end result of the selection will be (say prio(CFS1) > prio(CFS2)):
> rq0		rq1
> CFS1		IDLE
> 
> When it should have selected:
> rq0		r1
> IDLE		RT
> 
> I saw this issue on real-world usecases in ChromeOS where an RT task gets
> constantly force-idled and breaks RT. The "class == &fair_sched_class" bit
> cures it.

Ah, I see. The thing is, this looses the optimization for a bunch of
valid (and arguably common) scenarios. The problem is that the moment we
end up selecting a task with a cookie we've invalidated the premise
under which we ended up with the selected task.

How about this then?

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4709,6 +4709,7 @@ pick_next_task(struct rq *rq, struct tas
 	need_sync = !!rq->core->core_cookie;

 	/* reset state */
+reset:
 	rq->core->core_cookie = 0UL;
 	for_each_cpu(i, smt_mask) {
 		struct rq *rq_i = cpu_rq(i);
@@ -4748,14 +4749,8 @@ pick_next_task(struct rq *rq, struct tas
 				/*
 				 * If there weren't no cookies; we don't need to
 				 * bother with the other siblings.
-				 * If the rest of the core is not running a tagged
-				 * task, i.e.  need_sync == 0, and the current CPU
-				 * which called into the schedule() loop does not
-				 * have any tasks for this class, skip selecting for
-				 * other siblings since there's no point. We don't skip
-				 * for RT/DL because that could make CFS force-idle RT.
 				 */
-				if (i == cpu && !need_sync && !p->core_cookie)
+				if (i == cpu && !need_sync)
 					goto next_class;

 				continue;
@@ -4765,7 +4760,17 @@ pick_next_task(struct rq *rq, struct tas
 			 * Optimize the 'normal' case where there aren't any
 			 * cookies and we don't need to sync up.
 			 */
-			if (i == cpu && !need_sync && !p->core_cookie) {
+			if (i == cpu && !need_sync) {
+				if (p->core_cookie) {
+					/*
+					 * This optimization is only valid as
+					 * long as there are no cookies
+					 * involved.
+					 */
+					need_sync = true;
+					goto reset;
+				}
+
 				next = p;
 				goto done;
 			}
@@ -4805,7 +4810,6 @@ pick_next_task(struct rq *rq, struct tas
 					 */
 					need_sync = true;
 				}
-
 			}
 		}
 next_class:;


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.
  2020-10-23 19:26         ` Peter Zijlstra
@ 2020-10-23 21:31           ` Joel Fernandes
  2020-10-26  8:28             ` Peter Zijlstra
  2020-10-26  9:31             ` Peter Zijlstra
  0 siblings, 2 replies; 98+ messages in thread
From: Joel Fernandes @ 2020-10-23 21:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Vineeth Remanan Pillai,
	Aaron Lu, Aubrey Li, Paul E. McKenney, Tim Chen

On Fri, Oct 23, 2020 at 09:26:54PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 23, 2020 at 01:57:24PM -0400, Joel Fernandes wrote:
> > On Fri, Oct 23, 2020 at 03:54:00PM +0200, Peter Zijlstra wrote:
> > > On Fri, Oct 23, 2020 at 03:51:29PM +0200, Peter Zijlstra wrote:
> > > > On Mon, Oct 19, 2020 at 09:43:16PM -0400, Joel Fernandes (Google) wrote:
> > > > > +			/*
> > > > > +			 * If this sibling doesn't yet have a suitable task to
> > > > > +			 * run; ask for the most elegible task, given the
> > > > > +			 * highest priority task already selected for this
> > > > > +			 * core.
> > > > > +			 */
> > > > > +			p = pick_task(rq_i, class, max);
> > > > > +			if (!p) {
> > > > > +				/*
> > > > > +				 * If there weren't no cookies; we don't need to
> > > > > +				 * bother with the other siblings.
> > > > > +				 * If the rest of the core is not running a tagged
> > > > > +				 * task, i.e.  need_sync == 0, and the current CPU
> > > > > +				 * which called into the schedule() loop does not
> > > > > +				 * have any tasks for this class, skip selecting for
> > > > > +				 * other siblings since there's no point. We don't skip
> > > > > +				 * for RT/DL because that could make CFS force-idle RT.
> > > > > +				 */
> > > > > +				if (i == cpu && !need_sync && class == &fair_sched_class)
> > > > > +					goto next_class;
> > > > > +
> > > > > +				continue;
> > > > > +			}
> > > > 
> > > > I'm failing to understand the class == &fair_sched_class bit.
> > 
> > The last line in the comment explains it "We don't skip for RT/DL because
> > that could make CFS force-idle RT.".
> 
> Well, yes, but it does not explain how this can come about, now does it.

Sorry, I should have made it a separate commit with the below explanation. Oh
well, live and learn!

> > Even if need_sync == false, we need to go look at other CPUs (non-local
> > CPUs) to see if they could be running RT.
> > 
> > Say the RQs in a particular core look like this:
> > Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task.
> > 
> > rq0	       rq1
> > CFS1 (tagged)  RT1 (not tag)
> > CFS2 (tagged)
> > 
> > Say schedule() runs on rq0. Now, it will enter the above loop and
> > pick_task(RT) will return NULL for 'p'. It will enter the above if() block
> > and see that need_sync == false and will skip RT entirely.
> > 
> > The end result of the selection will be (say prio(CFS1) > prio(CFS2)):
> > rq0		rq1
> > CFS1		IDLE
> > 
> > When it should have selected:
> > rq0		r1
> > IDLE		RT
> > 
> > I saw this issue on real-world usecases in ChromeOS where an RT task gets
> > constantly force-idled and breaks RT. The "class == &fair_sched_class" bit
> > cures it.
> 
> Ah, I see. The thing is, this looses the optimization for a bunch of
> valid (and arguably common) scenarios. The problem is that the moment we
> end up selecting a task with a cookie we've invalidated the premise
> under which we ended up with the selected task.
> 
> How about this then?

This does look better. It makes sense and I think it will work. I will look
more into it and also test it.

BTW, as further optimization in the future, isn't it better for the
schedule() loop on 1 HT to select for all HT *even if* need_sync == false to
begin with?  i.e. no cookied tasks are runnable.

That way the pick loop in schedule() running on other HTs can directly pick
what was pre-selected for it via:
        if (rq->core->core_pick_seq == rq->core->core_task_seq &&
            rq->core->core_pick_seq != rq->core_sched_seq &&
            rq->core_pick)
.. which I think is more efficient. Its just a thought and may not be worth doing.

thanks,

 - Joel


> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4709,6 +4709,7 @@ pick_next_task(struct rq *rq, struct tas
>  	need_sync = !!rq->core->core_cookie;
> 
>  	/* reset state */
> +reset:
>  	rq->core->core_cookie = 0UL;
>  	for_each_cpu(i, smt_mask) {
>  		struct rq *rq_i = cpu_rq(i);
> @@ -4748,14 +4749,8 @@ pick_next_task(struct rq *rq, struct tas
>  				/*
>  				 * If there weren't no cookies; we don't need to
>  				 * bother with the other siblings.
> -				 * If the rest of the core is not running a tagged
> -				 * task, i.e.  need_sync == 0, and the current CPU
> -				 * which called into the schedule() loop does not
> -				 * have any tasks for this class, skip selecting for
> -				 * other siblings since there's no point. We don't skip
> -				 * for RT/DL because that could make CFS force-idle RT.
>  				 */
> -				if (i == cpu && !need_sync && !p->core_cookie)
> +				if (i == cpu && !need_sync)
>  					goto next_class;
> 
>  				continue;
> @@ -4765,7 +4760,17 @@ pick_next_task(struct rq *rq, struct tas
>  			 * Optimize the 'normal' case where there aren't any
>  			 * cookies and we don't need to sync up.
>  			 */
> -			if (i == cpu && !need_sync && !p->core_cookie) {
> +			if (i == cpu && !need_sync) {
> +				if (p->core_cookie) {
> +					/*
> +					 * This optimization is only valid as
> +					 * long as there are no cookies
> +					 * involved.
> +					 */
> +					need_sync = true;
> +					goto reset;
> +				}
> +
>  				next = p;
>  				goto done;
>  			}
> @@ -4805,7 +4810,6 @@ pick_next_task(struct rq *rq, struct tas
>  					 */
>  					need_sync = true;
>  				}
> -
>  			}
>  		}
>  next_class:;
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task()
  2020-10-23  5:25       ` Li, Aubrey
@ 2020-10-23 21:47         ` Joel Fernandes
  2020-10-24  2:48           ` Li, Aubrey
  0 siblings, 1 reply; 98+ messages in thread
From: Joel Fernandes @ 2020-10-23 21:47 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, Thomas Glexiner, LKML,
	Ingo Molnar, Linus Torvalds, Frederic Weisbecker, Kees Cook,
	Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, graf, konrad.wilk,
	Dario Faggioli, Paul Turner, Steven Rostedt, Patrick Bellasi,
	benbjiang(蒋彪),
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Hyser,Chris, Vineeth Remanan Pillai,
	Paul E. McKenney, Tim Chen, Ning, Hongyu

On Fri, Oct 23, 2020 at 01:25:38PM +0800, Li, Aubrey wrote:
> >>> @@ -2517,6 +2528,7 @@ const struct sched_class dl_sched_class
> >>>
> >>>  #ifdef CONFIG_SMP
> >>>       .balance                = balance_dl,
> >>> +     .pick_task              = pick_task_dl,
> >>>       .select_task_rq         = select_task_rq_dl,
> >>>       .migrate_task_rq        = migrate_task_rq_dl,
> >>>       .set_cpus_allowed       = set_cpus_allowed_dl,
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index dbd9368a959d..bd6aed63f5e3 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -4450,7 +4450,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> >>>        * Avoid running the skip buddy, if running something else can
> >>>        * be done without getting too unfair.
> >>>        */
> >>> -     if (cfs_rq->skip == se) {
> >>> +     if (cfs_rq->skip && cfs_rq->skip == se) {
> >>>               struct sched_entity *second;
> >>>
> >>>               if (se == curr) {
> >>> @@ -6976,6 +6976,35 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> >>>               set_last_buddy(se);
> >>>  }
> >>>
> >>> +#ifdef CONFIG_SMP
> >>> +static struct task_struct *pick_task_fair(struct rq *rq)
> >>> +{
> >>> +     struct cfs_rq *cfs_rq = &rq->cfs;
> >>> +     struct sched_entity *se;
> >>> +
> >>> +     if (!cfs_rq->nr_running)
> >>> +             return NULL;
> >>> +
> >>> +     do {
> >>> +             struct sched_entity *curr = cfs_rq->curr;
> >>> +
> >>> +             se = pick_next_entity(cfs_rq, NULL);
> >>> +
> >>> +             if (curr) {
> >>> +                     if (se && curr->on_rq)
> >>> +                             update_curr(cfs_rq);
> >>> +
> >>> +                     if (!se || entity_before(curr, se))
> >>> +                             se = curr;
> >>> +             }
> >>> +
> >>> +             cfs_rq = group_cfs_rq(se);
> >>> +     } while (cfs_rq);
> >>> ++
> >>> +     return task_of(se);
> >>> +}
> >>> +#endif
> >>
> >> One of my machines hangs when I run uperf with only one message:
> >> [  719.034962] BUG: kernel NULL pointer dereference, address: 0000000000000050
> >>
> >> Then I replicated the problem on my another machine(no serial console),
> >> here is the stack by manual copy.
> >>
> >> Call Trace:
> >>  pick_next_entity+0xb0/0x160
> >>  pick_task_fair+0x4b/0x90
> >>  __schedule+0x59b/0x12f0
> >>  schedule_idle+0x1e/0x40
> >>  do_idle+0x193/0x2d0
> >>  cpu_startup_entry+0x19/0x20
> >>  start_secondary+0x110/0x150
> >>  secondary_startup_64_no_verify+0xa6/0xab
> > 
> > Interesting. Wondering if we screwed something up in the rebase.
> > 
> > Questions:
> > 1. Does the issue happen if you just apply only up until this patch,
> > or the entire series?
> 
> I applied the entire series and just find a related patch to report the
> issue.

Ok.

> > 2. Do you see the issue in v7? Not much if at all has changed in this
> > part of the code from v7 -> v8 but could be something in the newer
> > kernel.
> > 
> 
> IIRC, I can run uperf successfully on v7.
> I'm on tip/master 2d3e8c9424c9 (origin/master) "Merge branch 'linus'."
> Please let me know if this is a problem, or you have a repo I can pull
> for testing.

Here is a repo with v8 series on top of v5.9 release:
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=coresched-v5.9

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task()
  2020-10-23 21:47         ` Joel Fernandes
@ 2020-10-24  2:48           ` Li, Aubrey
  2020-10-24 11:10             ` Vineeth Pillai
  0 siblings, 1 reply; 98+ messages in thread
From: Li, Aubrey @ 2020-10-24  2:48 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, Thomas Glexiner, LKML,
	Ingo Molnar, Linus Torvalds, Frederic Weisbecker, Kees Cook,
	Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, graf, konrad.wilk,
	Dario Faggioli, Paul Turner, Steven Rostedt, Patrick Bellasi,
	benbjiang(蒋彪),
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Hyser,Chris, Vineeth Remanan Pillai,
	Paul E. McKenney, Tim Chen, Ning, Hongyu

On 2020/10/24 5:47, Joel Fernandes wrote:
> On Fri, Oct 23, 2020 at 01:25:38PM +0800, Li, Aubrey wrote:
>>>>> @@ -2517,6 +2528,7 @@ const struct sched_class dl_sched_class
>>>>>
>>>>>  #ifdef CONFIG_SMP
>>>>>       .balance                = balance_dl,
>>>>> +     .pick_task              = pick_task_dl,
>>>>>       .select_task_rq         = select_task_rq_dl,
>>>>>       .migrate_task_rq        = migrate_task_rq_dl,
>>>>>       .set_cpus_allowed       = set_cpus_allowed_dl,
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index dbd9368a959d..bd6aed63f5e3 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -4450,7 +4450,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>>>>>        * Avoid running the skip buddy, if running something else can
>>>>>        * be done without getting too unfair.
>>>>>        */
>>>>> -     if (cfs_rq->skip == se) {
>>>>> +     if (cfs_rq->skip && cfs_rq->skip == se) {
>>>>>               struct sched_entity *second;
>>>>>
>>>>>               if (se == curr) {
>>>>> @@ -6976,6 +6976,35 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>>>>>               set_last_buddy(se);
>>>>>  }
>>>>>
>>>>> +#ifdef CONFIG_SMP
>>>>> +static struct task_struct *pick_task_fair(struct rq *rq)
>>>>> +{
>>>>> +     struct cfs_rq *cfs_rq = &rq->cfs;
>>>>> +     struct sched_entity *se;
>>>>> +
>>>>> +     if (!cfs_rq->nr_running)
>>>>> +             return NULL;
>>>>> +
>>>>> +     do {
>>>>> +             struct sched_entity *curr = cfs_rq->curr;
>>>>> +
>>>>> +             se = pick_next_entity(cfs_rq, NULL);
>>>>> +
>>>>> +             if (curr) {
>>>>> +                     if (se && curr->on_rq)
>>>>> +                             update_curr(cfs_rq);
>>>>> +
>>>>> +                     if (!se || entity_before(curr, se))
>>>>> +                             se = curr;
>>>>> +             }
>>>>> +
>>>>> +             cfs_rq = group_cfs_rq(se);
>>>>> +     } while (cfs_rq);
>>>>> ++
>>>>> +     return task_of(se);
>>>>> +}
>>>>> +#endif
>>>>
>>>> One of my machines hangs when I run uperf with only one message:
>>>> [  719.034962] BUG: kernel NULL pointer dereference, address: 0000000000000050
>>>>
>>>> Then I replicated the problem on my another machine(no serial console),
>>>> here is the stack by manual copy.
>>>>
>>>> Call Trace:
>>>>  pick_next_entity+0xb0/0x160
>>>>  pick_task_fair+0x4b/0x90
>>>>  __schedule+0x59b/0x12f0
>>>>  schedule_idle+0x1e/0x40
>>>>  do_idle+0x193/0x2d0
>>>>  cpu_startup_entry+0x19/0x20
>>>>  start_secondary+0x110/0x150
>>>>  secondary_startup_64_no_verify+0xa6/0xab
>>>
>>> Interesting. Wondering if we screwed something up in the rebase.
>>>
>>> Questions:
>>> 1. Does the issue happen if you just apply only up until this patch,
>>> or the entire series?
>>
>> I applied the entire series and just find a related patch to report the
>> issue.
> 
> Ok.
> 
>>> 2. Do you see the issue in v7? Not much if at all has changed in this
>>> part of the code from v7 -> v8 but could be something in the newer
>>> kernel.
>>>
>>
>> IIRC, I can run uperf successfully on v7.
>> I'm on tip/master 2d3e8c9424c9 (origin/master) "Merge branch 'linus'."
>> Please let me know if this is a problem, or you have a repo I can pull
>> for testing.
> 
> Here is a repo with v8 series on top of v5.9 release:
> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=coresched-v5.9

I didn't see NULL pointer dereference BUG of this repo, will post performance
data later.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task()
  2020-10-24  2:48           ` Li, Aubrey
@ 2020-10-24 11:10             ` Vineeth Pillai
  2020-10-24 12:27               ` Vineeth Pillai
  0 siblings, 1 reply; 98+ messages in thread
From: Vineeth Pillai @ 2020-10-24 11:10 UTC (permalink / raw)
  To: Li, Aubrey, Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Aaron Lu, Aubrey Li, Thomas Glexiner, LKML, Ingo Molnar,
	Linus Torvalds, Frederic Weisbecker, Kees Cook, Greg Kerr,
	Phil Auld, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, Dario Faggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi,
	benbjiang(蒋彪),
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Hyser,Chris, Paul E. McKenney,
	Tim Chen, Ning, Hongyu

Hi Aubrey,


On 10/23/20 10:48 PM, Li, Aubrey wrote:
>
>>>> 2. Do you see the issue in v7? Not much if at all has changed in this
>>>> part of the code from v7 -> v8 but could be something in the newer
>>>> kernel.
>>>>
>>> IIRC, I can run uperf successfully on v7.
>>> I'm on tip/master 2d3e8c9424c9 (origin/master) "Merge branch 'linus'."
>>> Please let me know if this is a problem, or you have a repo I can pull
>>> for testing.
>> Here is a repo with v8 series on top of v5.9 release:
>> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=coresched-v5.9
> I didn't see NULL pointer dereference BUG of this repo, will post performance
> data later.
There has been a change in tip in pick_next_entity which caused
removal of a coresched related change to fix the crash. Could
you please try this patch on top of the posted v8 and see if this
works?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 93a3b874077d..4cae5ac48b60 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4428,12 +4428,14 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
                         se = second;
         }

-       if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
+       if (left && cfs_rq->next &&
+                       wakeup_preempt_entity(cfs_rq->next, left) < 1) {
                 /*
                  * Someone really wants this to run. If it's not unfair, run it.
                  */
                 se = cfs_rq->next;
-       } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
+       } else if (left && cfs_rq->last &&
+                       wakeup_preempt_entity(cfs_rq->last, left) < 1) {
                 /*
                  * Prefer last buddy, try to return the CPU to a preempted task.


There reason for left being NULL needs to be investigated. This was
there from v1 and we did not yet get to it. I shall try to debug later
this week.

Kindly test this patch and let us know if it fixes the crash.

Thanks,
Vineeth

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task()
  2020-10-24 11:10             ` Vineeth Pillai
@ 2020-10-24 12:27               ` Vineeth Pillai
  2020-10-24 23:48                 ` Li, Aubrey
                                   ` (2 more replies)
  0 siblings, 3 replies; 98+ messages in thread
From: Vineeth Pillai @ 2020-10-24 12:27 UTC (permalink / raw)
  To: Li, Aubrey, Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Aaron Lu, Aubrey Li, Thomas Glexiner, LKML, Ingo Molnar,
	Linus Torvalds, Frederic Weisbecker, Kees Cook, Greg Kerr,
	Phil Auld, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, Dario Faggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi,
	benbjiang(蒋彪),
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Hyser,Chris, Paul E. McKenney,
	Tim Chen, Ning, Hongyu



On 10/24/20 7:10 AM, Vineeth Pillai wrote:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 93a3b874077d..4cae5ac48b60 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4428,12 +4428,14 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct 
> sched_entity *curr)
>                         se = second;
>         }
>
> -       if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) 
> < 1) {
> +       if (left && cfs_rq->next &&
> +                       wakeup_preempt_entity(cfs_rq->next, left) < 1) {
>                 /*
>                  * Someone really wants this to run. If it's not 
> unfair, run it.
>                  */
>                 se = cfs_rq->next;
> -       } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, 
> left) < 1) {
> +       } else if (left && cfs_rq->last &&
> +                       wakeup_preempt_entity(cfs_rq->last, left) < 1) {
>                 /*
>                  * Prefer last buddy, try to return the CPU to a 
> preempted task.
>
>
> There reason for left being NULL needs to be investigated. This was
> there from v1 and we did not yet get to it. I shall try to debug later
> this week.

Thinking more about it and looking at the crash, I think that
'left == NULL' can happen in pick_next_entity for core scheduling.
If a cfs_rq has only one task that is running, then it will be
dequeued and 'left = __pick_first_entity()' will be NULL as the
cfs_rq will be empty. This would not happen outside of coresched
because we never call pick_tack() before put_prev_task() which
will enqueue the task back.

With core scheduling, a cpu can call pick_task() for its sibling while
the sibling is still running the active task and put_prev_task has yet
not been called. This can result in 'left == NULL'. So I think the
above fix is appropriate when core scheduling is active. It could be
cleaned up a bit though.

Thanks,
Vineeth


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task()
  2020-10-24 12:27               ` Vineeth Pillai
@ 2020-10-24 23:48                 ` Li, Aubrey
  2020-10-26  9:01                 ` Peter Zijlstra
  2020-10-27 14:14                 ` Joel Fernandes
  2 siblings, 0 replies; 98+ messages in thread
From: Li, Aubrey @ 2020-10-24 23:48 UTC (permalink / raw)
  To: Vineeth Pillai, Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Aaron Lu, Aubrey Li, Thomas Glexiner, LKML, Ingo Molnar,
	Linus Torvalds, Frederic Weisbecker, Kees Cook, Greg Kerr,
	Phil Auld, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, Dario Faggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi,
	benbjiang(蒋彪),
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Hyser,Chris, Paul E. McKenney,
	Tim Chen, Ning, Hongyu

On 2020/10/24 20:27, Vineeth Pillai wrote:
> 
> 
> On 10/24/20 7:10 AM, Vineeth Pillai wrote:
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 93a3b874077d..4cae5ac48b60 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4428,12 +4428,14 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>>                         se = second;
>>         }
>>
>> -       if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
>> +       if (left && cfs_rq->next &&
>> +                       wakeup_preempt_entity(cfs_rq->next, left) < 1) {
>>                 /*
>>                  * Someone really wants this to run. If it's not unfair, run it.
>>                  */
>>                 se = cfs_rq->next;
>> -       } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
>> +       } else if (left && cfs_rq->last &&
>> +                       wakeup_preempt_entity(cfs_rq->last, left) < 1) {
>>                 /*
>>                  * Prefer last buddy, try to return the CPU to a preempted task.
>>
>>
>> There reason for left being NULL needs to be investigated. This was
>> there from v1 and we did not yet get to it. I shall try to debug later
>> this week.
> 
> Thinking more about it and looking at the crash, I think that
> 'left == NULL' can happen in pick_next_entity for core scheduling.
> If a cfs_rq has only one task that is running, then it will be
> dequeued and 'left = __pick_first_entity()' will be NULL as the
> cfs_rq will be empty. This would not happen outside of coresched
> because we never call pick_tack() before put_prev_task() which
> will enqueue the task back.
> 
> With core scheduling, a cpu can call pick_task() for its sibling while
> the sibling is still running the active task and put_prev_task has yet
> not been called. This can result in 'left == NULL'. So I think the
> above fix is appropriate when core scheduling is active. It could be
> cleaned up a bit though.

This patch works, thanks Vineeth for the quick fix!

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 24/26] sched: Move core-scheduler interfacing code to a new file
  2020-10-20  1:43 ` [PATCH v8 -tip 24/26] sched: Move core-scheduler interfacing code to a new file Joel Fernandes (Google)
@ 2020-10-26  1:05   ` Li, Aubrey
  2020-11-03  2:58     ` Joel Fernandes
  0 siblings, 1 reply; 98+ messages in thread
From: Li, Aubrey @ 2020-10-26  1:05 UTC (permalink / raw)
  To: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Paul E. McKenney, Tim Chen

On 2020/10/20 9:43, Joel Fernandes (Google) wrote:
> core.c is already huge. The core-tagging interface code is largely
> independent of it. Move it to its own file to make both files easier to
> maintain.
> 
> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/sched/Makefile  |   1 +
>  kernel/sched/core.c    | 481 +----------------------------------------
>  kernel/sched/coretag.c | 468 +++++++++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h   |  56 ++++-
>  4 files changed, 523 insertions(+), 483 deletions(-)
>  create mode 100644 kernel/sched/coretag.c
> 
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index 5fc9c9b70862..c526c20adf9d 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -36,3 +36,4 @@ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
>  obj-$(CONFIG_MEMBARRIER) += membarrier.o
>  obj-$(CONFIG_CPU_ISOLATION) += isolation.o
>  obj-$(CONFIG_PSI) += psi.o
> +obj-$(CONFIG_SCHED_CORE) += coretag.o
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b3afbba5abe1..211e0784675f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -162,11 +162,6 @@ static bool sched_core_empty(struct rq *rq)
>  	return RB_EMPTY_ROOT(&rq->core_tree);
>  }
>  
> -static bool sched_core_enqueued(struct task_struct *task)
> -{
> -	return !RB_EMPTY_NODE(&task->core_node);
> -}
> -
>  static struct task_struct *sched_core_first(struct rq *rq)
>  {
>  	struct task_struct *task;
> @@ -188,7 +183,7 @@ static void sched_core_flush(int cpu)
>  	rq->core->core_task_seq++;
>  }
>  
> -static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
> +void sched_core_enqueue(struct rq *rq, struct task_struct *p)
>  {
>  	struct rb_node *parent, **node;
>  	struct task_struct *node_task;
> @@ -215,7 +210,7 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
>  	rb_insert_color(&p->core_node, &rq->core_tree);
>  }
>  
> -static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
> +void sched_core_dequeue(struct rq *rq, struct task_struct *p)
>  {
>  	rq->core->core_task_seq++;
>  
> @@ -310,7 +305,6 @@ static int __sched_core_stopper(void *data)
>  }
>  
>  static DEFINE_MUTEX(sched_core_mutex);
> -static DEFINE_MUTEX(sched_core_tasks_mutex);
>  static int sched_core_count;
>  
>  static void __sched_core_enable(void)
> @@ -346,16 +340,6 @@ void sched_core_put(void)
>  		__sched_core_disable();
>  	mutex_unlock(&sched_core_mutex);
>  }
> -
> -static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2);
> -
> -#else /* !CONFIG_SCHED_CORE */
> -
> -static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
> -static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
> -static bool sched_core_enqueued(struct task_struct *task) { return false; }
> -static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2) { }
> -
>  #endif /* CONFIG_SCHED_CORE */
>  
>  /*
> @@ -8505,9 +8489,6 @@ void sched_offline_group(struct task_group *tg)
>  	spin_unlock_irqrestore(&task_group_lock, flags);
>  }
>  
> -#define SCHED_CORE_GROUP_COOKIE_MASK ((1UL << (sizeof(unsigned long) * 4)) - 1)
> -static unsigned long cpu_core_get_group_cookie(struct task_group *tg);
> -
>  static void sched_change_group(struct task_struct *tsk, int type)
>  {
>  	struct task_group *tg;
> @@ -8583,11 +8564,6 @@ void sched_move_task(struct task_struct *tsk)
>  	task_rq_unlock(rq, tsk, &rf);
>  }
>  
> -static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
> -{
> -	return css ? container_of(css, struct task_group, css) : NULL;
> -}
> -
>  static struct cgroup_subsys_state *
>  cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  {
> @@ -9200,459 +9176,6 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
>  }
>  #endif /* CONFIG_RT_GROUP_SCHED */
>  
> -#ifdef CONFIG_SCHED_CORE
> -/*
> - * A simple wrapper around refcount. An allocated sched_core_cookie's
> - * address is used to compute the cookie of the task.
> - */
> -struct sched_core_cookie {
> -	refcount_t refcnt;
> -};
> -
> -/*
> - * sched_core_tag_requeue - Common helper for all interfaces to set a cookie.
> - * @p: The task to assign a cookie to.
> - * @cookie: The cookie to assign.
> - * @group: is it a group interface or a per-task interface.
> - *
> - * This function is typically called from a stop-machine handler.
> - */
> -void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie, bool group)
> -{
> -	if (!p)
> -		return;
> -
> -	if (group)
> -		p->core_group_cookie = cookie;
> -	else
> -		p->core_task_cookie = cookie;
> -
> -	/* Use up half of the cookie's bits for task cookie and remaining for group cookie. */
> -	p->core_cookie = (p->core_task_cookie <<
> -				(sizeof(unsigned long) * 4)) + p->core_group_cookie;
> -
> -	if (sched_core_enqueued(p)) {
> -		sched_core_dequeue(task_rq(p), p);
> -		if (!p->core_cookie)
> -			return;
> -	}
> -
> -	if (sched_core_enabled(task_rq(p)) &&
> -			p->core_cookie && task_on_rq_queued(p))
> -		sched_core_enqueue(task_rq(p), p);
> -}
> -
> -/* Per-task interface */
> -static unsigned long sched_core_alloc_task_cookie(void)
> -{
> -	struct sched_core_cookie *ptr =
> -		kmalloc(sizeof(struct sched_core_cookie), GFP_KERNEL);
> -
> -	if (!ptr)
> -		return 0;
> -	refcount_set(&ptr->refcnt, 1);
> -
> -	/*
> -	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
> -	 * is done after the stopper runs.
> -	 */
> -	sched_core_get();
> -	return (unsigned long)ptr;
> -}
> -
> -static bool sched_core_get_task_cookie(unsigned long cookie)
> -{
> -	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
> -
> -	/*
> -	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
> -	 * is done after the stopper runs.
> -	 */
> -	sched_core_get();
> -	return refcount_inc_not_zero(&ptr->refcnt);
> -}
> -
> -static void sched_core_put_task_cookie(unsigned long cookie)
> -{
> -	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
> -
> -	if (refcount_dec_and_test(&ptr->refcnt))
> -		kfree(ptr);
> -}
> -
> -struct sched_core_task_write_tag {
> -	struct task_struct *tasks[2];
> -	unsigned long cookies[2];
> -};
> -
> -/*
> - * Ensure that the task has been requeued. The stopper ensures that the task cannot
> - * be migrated to a different CPU while its core scheduler queue state is being updated.
> - * It also makes sure to requeue a task if it was running actively on another CPU.
> - */
> -static int sched_core_task_join_stopper(void *data)
> -{
> -	struct sched_core_task_write_tag *tag = (struct sched_core_task_write_tag *)data;
> -	int i;
> -
> -	for (i = 0; i < 2; i++)
> -		sched_core_tag_requeue(tag->tasks[i], tag->cookies[i], false /* !group */);
> -
> -	return 0;
> -}
> -
> -static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
> -{
> -	struct sched_core_task_write_tag wr = {}; /* for stop machine. */
> -	bool sched_core_put_after_stopper = false;
> -	unsigned long cookie;
> -	int ret = -ENOMEM;
> -
> -	mutex_lock(&sched_core_tasks_mutex);
> -
> -	/*
> -	 * NOTE: sched_core_get() is done by sched_core_alloc_task_cookie() or
> -	 *       sched_core_put_task_cookie(). However, sched_core_put() is done
> -	 *       by this function *after* the stopper removes the tasks from the
> -	 *       core queue, and not before. This is just to play it safe.
> -	 */
> -	if (t2 == NULL) {
> -		if (t1->core_task_cookie) {
> -			sched_core_put_task_cookie(t1->core_task_cookie);
> -			sched_core_put_after_stopper = true;
> -			wr.tasks[0] = t1; /* Keep wr.cookies[0] reset for t1. */
> -		}
> -	} else if (t1 == t2) {
> -		/* Assign a unique per-task cookie solely for t1. */
> -
> -		cookie = sched_core_alloc_task_cookie();
> -		if (!cookie)
> -			goto out_unlock;
> -
> -		if (t1->core_task_cookie) {
> -			sched_core_put_task_cookie(t1->core_task_cookie);
> -			sched_core_put_after_stopper = true;
> -		}
> -		wr.tasks[0] = t1;
> -		wr.cookies[0] = cookie;
> -	} else
> -	/*
> -	 * 		t1		joining		t2
> -	 * CASE 1:
> -	 * before	0				0
> -	 * after	new cookie			new cookie
> -	 *
> -	 * CASE 2:
> -	 * before	X (non-zero)			0
> -	 * after	0				0
> -	 *
> -	 * CASE 3:
> -	 * before	0				X (non-zero)
> -	 * after	X				X
> -	 *
> -	 * CASE 4:
> -	 * before	Y (non-zero)			X (non-zero)
> -	 * after	X				X
> -	 */
> -	if (!t1->core_task_cookie && !t2->core_task_cookie) {
> -		/* CASE 1. */
> -		cookie = sched_core_alloc_task_cookie();
> -		if (!cookie)
> -			goto out_unlock;
> -
> -		/* Add another reference for the other task. */
> -		if (!sched_core_get_task_cookie(cookie)) {
> -			return -EINVAL;
> -			goto out_unlock;
> -		}
> -
> -		wr.tasks[0] = t1;
> -		wr.tasks[1] = t2;
> -		wr.cookies[0] = wr.cookies[1] = cookie;
> -
> -	} else if (t1->core_task_cookie && !t2->core_task_cookie) {
> -		/* CASE 2. */
> -		sched_core_put_task_cookie(t1->core_task_cookie);
> -		sched_core_put_after_stopper = true;
> -
> -		wr.tasks[0] = t1; /* Reset cookie for t1. */
> -
> -	} else if (!t1->core_task_cookie && t2->core_task_cookie) {
> -		/* CASE 3. */
> -		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
> -			ret = -EINVAL;
> -			goto out_unlock;
> -		}
> -
> -		wr.tasks[0] = t1;
> -		wr.cookies[0] = t2->core_task_cookie;
> -
> -	} else {
> -		/* CASE 4. */
> -		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
> -			ret = -EINVAL;
> -			goto out_unlock;
> -		}
> -		sched_core_put_task_cookie(t1->core_task_cookie);
> -		sched_core_put_after_stopper = true;
> -
> -		wr.tasks[0] = t1;
> -		wr.cookies[0] = t2->core_task_cookie;
> -	}
> -
> -	stop_machine(sched_core_task_join_stopper, (void *)&wr, NULL);
> -
> -	if (sched_core_put_after_stopper)
> -		sched_core_put();
> -
> -	ret = 0;
> -out_unlock:
> -	mutex_unlock(&sched_core_tasks_mutex);
> -	return ret;
> -}
> -
> -/* Called from prctl interface: PR_SCHED_CORE_SHARE */
> -int sched_core_share_pid(pid_t pid)
> -{
> -	struct task_struct *task;
> -	int err;
> -
> -	if (pid == 0) { /* Recent current task's cookie. */
> -		/* Resetting a cookie requires privileges. */
> -		if (current->core_task_cookie)
> -			if (!capable(CAP_SYS_ADMIN))
> -				return -EPERM;
> -		task = NULL;
> -	} else {
> -		rcu_read_lock();
> -		task = pid ? find_task_by_vpid(pid) : current;
> -		if (!task) {
> -			rcu_read_unlock();
> -			return -ESRCH;
> -		}
> -
> -		get_task_struct(task);
> -
> -		/*
> -		 * Check if this process has the right to modify the specified
> -		 * process. Use the regular "ptrace_may_access()" checks.
> -		 */
> -		if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
> -			rcu_read_unlock();
> -			err = -EPERM;
> -			goto out_put;
> -		}
> -		rcu_read_unlock();
> -	}
> -
> -	err = sched_core_share_tasks(current, task);
> -out_put:
> -	if (task)
> -		put_task_struct(task);
> -	return err;
> -}
> -
> -/* CGroup interface */
> -
> -/*
> - * Helper to get the cookie in a hierarchy.
> - * The cookie is a combination of a tag and color. Any ancestor
> - * can have a tag/color. tag is the first-level cookie setting
> - * with color being the second. Atmost one color and one tag is
> - * allowed.
> - */
> -static unsigned long cpu_core_get_group_cookie(struct task_group *tg)
> -{
> -	unsigned long color = 0;
> -
> -	if (!tg)
> -		return 0;
> -
> -	for (; tg; tg = tg->parent) {
> -		if (tg->core_tag_color) {
> -			WARN_ON_ONCE(color);
> -			color = tg->core_tag_color;
> -		}
> -
> -		if (tg->core_tagged) {
> -			unsigned long cookie = ((unsigned long)tg << 8) | color;
> -			cookie &= SCHED_CORE_GROUP_COOKIE_MASK;
> -			return cookie;
> -		}
> -	}
> -
> -	return 0;
> -}
> -
> -/* Determine if any group in @tg's children are tagged or colored. */
> -static bool cpu_core_check_descendants(struct task_group *tg, bool check_tag,
> -					bool check_color)
> -{
> -	struct task_group *child;
> -
> -	rcu_read_lock();
> -	list_for_each_entry_rcu(child, &tg->children, siblings) {
> -		if ((child->core_tagged && check_tag) ||
> -		    (child->core_tag_color && check_color)) {
> -			rcu_read_unlock();
> -			return true;
> -		}
> -
> -		rcu_read_unlock();
> -		return cpu_core_check_descendants(child, check_tag, check_color);
> -	}
> -
> -	rcu_read_unlock();
> -	return false;
> -}
> -
> -static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
> -{
> -	struct task_group *tg = css_tg(css);
> -
> -	return !!tg->core_tagged;
> -}
> -
> -static u64 cpu_core_tag_color_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
> -{
> -	struct task_group *tg = css_tg(css);
> -
> -	return tg->core_tag_color;
> -}
> -
> -#ifdef CONFIG_SCHED_DEBUG
> -static u64 cpu_core_group_cookie_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
> -{
> -	return cpu_core_get_group_cookie(css_tg(css));
> -}
> -#endif
> -
> -struct write_core_tag {
> -	struct cgroup_subsys_state *css;
> -	unsigned long cookie;
> -};
> -
> -static int __sched_write_tag(void *data)
> -{
> -	struct write_core_tag *tag = (struct write_core_tag *) data;
> -	struct task_struct *p;
> -	struct cgroup_subsys_state *css;
> -
> -	rcu_read_lock();
> -	css_for_each_descendant_pre(css, tag->css) {
> -		struct css_task_iter it;
> -
> -		css_task_iter_start(css, 0, &it);
> -		/*
> -		 * Note: css_task_iter_next will skip dying tasks.
> -		 * There could still be dying tasks left in the core queue
> -		 * when we set cgroup tag to 0 when the loop is done below.
> -		 */
> -		while ((p = css_task_iter_next(&it)))
> -			sched_core_tag_requeue(p, tag->cookie, true /* group */);
> -
> -		css_task_iter_end(&it);
> -	}
> -	rcu_read_unlock();
> -
> -	return 0;
> -}
> -
> -static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
> -{
> -	struct task_group *tg = css_tg(css);
> -	struct write_core_tag wtag;
> -
> -	if (val > 1)
> -		return -ERANGE;
> -
> -	if (!static_branch_likely(&sched_smt_present))
> -		return -EINVAL;
> -
> -	if (!tg->core_tagged && val) {
> -		/* Tag is being set. Check ancestors and descendants. */
> -		if (cpu_core_get_group_cookie(tg) ||
> -		    cpu_core_check_descendants(tg, true /* tag */, true /* color */))
> -			return -EBUSY;
> -	} else if (tg->core_tagged && !val) {
> -		/* Tag is being reset. Check descendants. */
> -		if (cpu_core_check_descendants(tg, true /* tag */, true /* color */))
> -			return -EBUSY;
> -	} else {
> -		return 0;
> -	}
> -
> -	if (!!val)
> -		sched_core_get();
> -
> -	wtag.css = css;
> -	wtag.cookie = (unsigned long)tg << 8; /* Reserve lower 8 bits for color. */
> -
> -	/* Truncate the upper 32-bits - those are used by the per-task cookie. */
> -	wtag.cookie &= (1UL << (sizeof(unsigned long) * 4)) - 1;
> -
> -	tg->core_tagged = val;
> -
> -	stop_machine(__sched_write_tag, (void *) &wtag, NULL);
> -	if (!val)
> -		sched_core_put();
> -
> -	return 0;
> -}
> -
> -static int cpu_core_tag_color_write_u64(struct cgroup_subsys_state *css,
> -					struct cftype *cft, u64 val)
> -{
> -	struct task_group *tg = css_tg(css);
> -	struct write_core_tag wtag;
> -	u64 cookie;
> -
> -	if (val > 255)
> -		return -ERANGE;
> -
> -	if (!static_branch_likely(&sched_smt_present))
> -		return -EINVAL;
> -
> -	cookie = cpu_core_get_group_cookie(tg);
> -	/* Can't set color if nothing in the ancestors were tagged. */
> -	if (!cookie)
> -		return -EINVAL;
> -
> -	/*
> -	 * Something in the ancestors already colors us. Can't change the color
> -	 * at this level.
> -	 */
> -	if (!tg->core_tag_color && (cookie & 255))
> -		return -EINVAL;
> -
> -	/*
> -	 * Check if any descendants are colored. If so, we can't recolor them.
> -	 * Don't need to check if descendants are tagged, since we don't allow
> -	 * tagging when already tagged.
> -	 */
> -	if (cpu_core_check_descendants(tg, false /* tag */, true /* color */))
> -		return -EINVAL;
> -
> -	cookie &= ~255;
> -	cookie |= val;
> -	wtag.css = css;
> -	wtag.cookie = cookie;
> -	tg->core_tag_color = val;
> -
> -	stop_machine(__sched_write_tag, (void *) &wtag, NULL);
> -
> -	return 0;
> -}
> -
> -void sched_tsk_free(struct task_struct *tsk)
> -{
> -	if (!tsk->core_task_cookie)
> -		return;
> -	sched_core_put_task_cookie(tsk->core_task_cookie);
> -	sched_core_put();
> -}
> -#endif
> -
>  static struct cftype cpu_legacy_files[] = {
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  	{
> diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
> new file mode 100644
> index 000000000000..3333c9b0afc5
> --- /dev/null
> +++ b/kernel/sched/coretag.c
> @@ -0,0 +1,468 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * kernel/sched/core-tag.c
> + *
> + * Core-scheduling tagging interface support.
> + *
> + * Copyright(C) 2020, Joel Fernandes.
> + * Initial interfacing code  by Peter Ziljstra.
> + */
> +
> +#include "sched.h"
> +
> +/*
> + * A simple wrapper around refcount. An allocated sched_core_cookie's
> + * address is used to compute the cookie of the task.
> + */
> +struct sched_core_cookie {
> +	refcount_t refcnt;
> +};
> +
> +static DEFINE_MUTEX(sched_core_tasks_mutex);
> +
> +/*
> + * sched_core_tag_requeue - Common helper for all interfaces to set a cookie.
> + * @p: The task to assign a cookie to.
> + * @cookie: The cookie to assign.
> + * @group: is it a group interface or a per-task interface.
> + *
> + * This function is typically called from a stop-machine handler.
> + */
> +void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie, bool group)
> +{
> +	if (!p)
> +		return;
> +
> +	if (group)
> +		p->core_group_cookie = cookie;
> +	else
> +		p->core_task_cookie = cookie;
> +
> +	/* Use up half of the cookie's bits for task cookie and remaining for group cookie. */
> +	p->core_cookie = (p->core_task_cookie <<
> +				(sizeof(unsigned long) * 4)) + p->core_group_cookie;
> +
> +	if (sched_core_enqueued(p)) {
> +		sched_core_dequeue(task_rq(p), p);
> +		if (!p->core_cookie)
> +			return;
> +	}
> +
> +	if (sched_core_enabled(task_rq(p)) &&
> +			p->core_cookie && task_on_rq_queued(p))
> +		sched_core_enqueue(task_rq(p), p);
> +}
> +
> +/* Per-task interface: Used by fork(2) and prctl(2). */
> +static unsigned long sched_core_alloc_task_cookie(void)
> +{
> +	struct sched_core_cookie *ptr =
> +		kmalloc(sizeof(struct sched_core_cookie), GFP_KERNEL);
> +
> +	if (!ptr)
> +		return 0;
> +	refcount_set(&ptr->refcnt, 1);
> +
> +	/*
> +	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
> +	 * is done after the stopper runs.
> +	 */
> +	sched_core_get();
> +	return (unsigned long)ptr;
> +}
> +
> +static bool sched_core_get_task_cookie(unsigned long cookie)
> +{
> +	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
> +
> +	/*
> +	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
> +	 * is done after the stopper runs.
> +	 */
> +	sched_core_get();
> +	return refcount_inc_not_zero(&ptr->refcnt);
> +}
> +
> +static void sched_core_put_task_cookie(unsigned long cookie)
> +{
> +	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
> +
> +	if (refcount_dec_and_test(&ptr->refcnt))
> +		kfree(ptr);
> +}
> +
> +struct sched_core_task_write_tag {
> +	struct task_struct *tasks[2];
> +	unsigned long cookies[2];
> +};
> +
> +/*
> + * Ensure that the task has been requeued. The stopper ensures that the task cannot
> + * be migrated to a different CPU while its core scheduler queue state is being updated.
> + * It also makes sure to requeue a task if it was running actively on another CPU.
> + */
> +static int sched_core_task_join_stopper(void *data)
> +{
> +	struct sched_core_task_write_tag *tag = (struct sched_core_task_write_tag *)data;
> +	int i;
> +
> +	for (i = 0; i < 2; i++)
> +		sched_core_tag_requeue(tag->tasks[i], tag->cookies[i], false /* !group */);
> +
> +	return 0;
> +}
> +
> +int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
> +{
> +	struct sched_core_task_write_tag wr = {}; /* for stop machine. */
> +	bool sched_core_put_after_stopper = false;
> +	unsigned long cookie;
> +	int ret = -ENOMEM;
> +
> +	mutex_lock(&sched_core_tasks_mutex);
> +
> +	/*
> +	 * NOTE: sched_core_get() is done by sched_core_alloc_task_cookie() or
> +	 *       sched_core_put_task_cookie(). However, sched_core_put() is done
> +	 *       by this function *after* the stopper removes the tasks from the
> +	 *       core queue, and not before. This is just to play it safe.
> +	 */
> +	if (t2 == NULL) {
> +		if (t1->core_task_cookie) {
> +			sched_core_put_task_cookie(t1->core_task_cookie);
> +			sched_core_put_after_stopper = true;
> +			wr.tasks[0] = t1; /* Keep wr.cookies[0] reset for t1. */
> +		}
> +	} else if (t1 == t2) {
> +		/* Assign a unique per-task cookie solely for t1. */
> +
> +		cookie = sched_core_alloc_task_cookie();
> +		if (!cookie)
> +			goto out_unlock;
> +
> +		if (t1->core_task_cookie) {
> +			sched_core_put_task_cookie(t1->core_task_cookie);
> +			sched_core_put_after_stopper = true;
> +		}
> +		wr.tasks[0] = t1;
> +		wr.cookies[0] = cookie;
> +	} else
> +	/*
> +	 * 		t1		joining		t2
> +	 * CASE 1:
> +	 * before	0				0
> +	 * after	new cookie			new cookie
> +	 *
> +	 * CASE 2:
> +	 * before	X (non-zero)			0
> +	 * after	0				0
> +	 *
> +	 * CASE 3:
> +	 * before	0				X (non-zero)
> +	 * after	X				X
> +	 *
> +	 * CASE 4:
> +	 * before	Y (non-zero)			X (non-zero)
> +	 * after	X				X
> +	 */
> +	if (!t1->core_task_cookie && !t2->core_task_cookie) {
> +		/* CASE 1. */
> +		cookie = sched_core_alloc_task_cookie();
> +		if (!cookie)
> +			goto out_unlock;
> +
> +		/* Add another reference for the other task. */
> +		if (!sched_core_get_task_cookie(cookie)) {
> +			return -EINVAL;

ret = -EINVAL; mutex is not released otherwise... 

> +			goto out_unlock;
> +		}
> +
> +		wr.tasks[0] = t1;
> +		wr.tasks[1] = t2;
> +		wr.cookies[0] = wr.cookies[1] = cookie;
> +
> +	} else if (t1->core_task_cookie && !t2->core_task_cookie) {
> +		/* CASE 2. */
> +		sched_core_put_task_cookie(t1->core_task_cookie);
> +		sched_core_put_after_stopper = true;
> +
> +		wr.tasks[0] = t1; /* Reset cookie for t1. */
> +
> +	} else if (!t1->core_task_cookie && t2->core_task_cookie) {
> +		/* CASE 3. */
> +		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
> +			ret = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		wr.tasks[0] = t1;
> +		wr.cookies[0] = t2->core_task_cookie;
> +
> +	} else {
> +		/* CASE 4. */
> +		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
> +			ret = -EINVAL;
> +			goto out_unlock;
> +		}
> +		sched_core_put_task_cookie(t1->core_task_cookie);
> +		sched_core_put_after_stopper = true;
> +
> +		wr.tasks[0] = t1;
> +		wr.cookies[0] = t2->core_task_cookie;
> +	}
> +
> +	stop_machine(sched_core_task_join_stopper, (void *)&wr, NULL);
> +
> +	if (sched_core_put_after_stopper)
> +		sched_core_put();
> +
> +	ret = 0;
> +out_unlock:
> +	mutex_unlock(&sched_core_tasks_mutex);
> +	return ret;
> +}
> +
> +/* Called from prctl interface: PR_SCHED_CORE_SHARE */
> +int sched_core_share_pid(pid_t pid)
> +{
> +	struct task_struct *task;
> +	int err;
> +
> +	if (pid == 0) { /* Recent current task's cookie. */
> +		/* Resetting a cookie requires privileges. */
> +		if (current->core_task_cookie)
> +			if (!capable(CAP_SYS_ADMIN))
> +				return -EPERM;
> +		task = NULL;
> +	} else {
> +		rcu_read_lock();
> +		task = pid ? find_task_by_vpid(pid) : current;
> +		if (!task) {
> +			rcu_read_unlock();
> +			return -ESRCH;
> +		}
> +
> +		get_task_struct(task);
> +
> +		/*
> +		 * Check if this process has the right to modify the specified
> +		 * process. Use the regular "ptrace_may_access()" checks.
> +		 */
> +		if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
> +			rcu_read_unlock();
> +			err = -EPERM;
> +			goto out_put;
> +		}
> +		rcu_read_unlock();
> +	}
> +
> +	err = sched_core_share_tasks(current, task);
> +out_put:
> +	if (task)
> +		put_task_struct(task);
> +	return err;
> +}
> +
> +/* CGroup core-scheduling interface support. */
> +
> +/*
> + * Helper to get the cookie in a hierarchy.
> + * The cookie is a combination of a tag and color. Any ancestor
> + * can have a tag/color. tag is the first-level cookie setting
> + * with color being the second. Atmost one color and one tag is
> + * allowed.
> + */
> +unsigned long cpu_core_get_group_cookie(struct task_group *tg)
> +{
> +	unsigned long color = 0;
> +
> +	if (!tg)
> +		return 0;
> +
> +	for (; tg; tg = tg->parent) {
> +		if (tg->core_tag_color) {
> +			WARN_ON_ONCE(color);
> +			color = tg->core_tag_color;
> +		}
> +
> +		if (tg->core_tagged) {
> +			unsigned long cookie = ((unsigned long)tg << 8) | color;
> +			cookie &= SCHED_CORE_GROUP_COOKIE_MASK;
> +			return cookie;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/* Determine if any group in @tg's children are tagged or colored. */
> +static bool cpu_core_check_descendants(struct task_group *tg, bool check_tag,
> +				       bool check_color)
> +{
> +	struct task_group *child;
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(child, &tg->children, siblings) {
> +		if ((child->core_tagged && check_tag) ||
> +		    (child->core_tag_color && check_color)) {
> +			rcu_read_unlock();
> +			return true;
> +		}
> +
> +		rcu_read_unlock();
> +		return cpu_core_check_descendants(child, check_tag, check_color);
> +	}
> +
> +	rcu_read_unlock();
> +	return false;
> +}
> +
> +u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css,
> +			  struct cftype *cft)
> +{
> +	struct task_group *tg = css_tg(css);
> +
> +	return !!tg->core_tagged;
> +}
> +
> +u64 cpu_core_tag_color_read_u64(struct cgroup_subsys_state *css,
> +				struct cftype *cft)
> +{
> +	struct task_group *tg = css_tg(css);
> +
> +	return tg->core_tag_color;
> +}
> +
> +#ifdef CONFIG_SCHED_DEBUG
> +u64 cpu_core_group_cookie_read_u64(struct cgroup_subsys_state *css,
> +				   struct cftype *cft)
> +{
> +	return cpu_core_get_group_cookie(css_tg(css));
> +}
> +#endif
> +
> +struct write_core_tag {
> +	struct cgroup_subsys_state *css;
> +	unsigned long cookie;
> +};
> +
> +static int __sched_write_tag(void *data)
> +{
> +	struct write_core_tag *tag = (struct write_core_tag *) data;
> +	struct task_struct *p;
> +	struct cgroup_subsys_state *css;
> +
> +	rcu_read_lock();
> +	css_for_each_descendant_pre(css, tag->css) {
> +		struct css_task_iter it;
> +
> +		css_task_iter_start(css, 0, &it);
> +		/*
> +		 * Note: css_task_iter_next will skip dying tasks.
> +		 * There could still be dying tasks left in the core queue
> +		 * when we set cgroup tag to 0 when the loop is done below.
> +		 */
> +		while ((p = css_task_iter_next(&it)))
> +			sched_core_tag_requeue(p, tag->cookie, true /* group */);
> +
> +		css_task_iter_end(&it);
> +	}
> +	rcu_read_unlock();
> +
> +	return 0;
> +}
> +
> +int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
> +			   u64 val)
> +{
> +	struct task_group *tg = css_tg(css);
> +	struct write_core_tag wtag;
> +
> +	if (val > 1)
> +		return -ERANGE;
> +
> +	if (!static_branch_likely(&sched_smt_present))
> +		return -EINVAL;
> +
> +	if (!tg->core_tagged && val) {
> +		/* Tag is being set. Check ancestors and descendants. */
> +		if (cpu_core_get_group_cookie(tg) ||
> +		    cpu_core_check_descendants(tg, true /* tag */, true /* color */))
> +			return -EBUSY;
> +	} else if (tg->core_tagged && !val) {
> +		/* Tag is being reset. Check descendants. */
> +		if (cpu_core_check_descendants(tg, true /* tag */, true /* color */))
> +			return -EBUSY;
> +	} else {
> +		return 0;
> +	}
> +
> +	if (!!val)
> +		sched_core_get();
> +
> +	wtag.css = css;
> +	wtag.cookie = (unsigned long)tg << 8; /* Reserve lower 8 bits for color. */
> +
> +	/* Truncate the upper 32-bits - those are used by the per-task cookie. */
> +	wtag.cookie &= (1UL << (sizeof(unsigned long) * 4)) - 1;
> +
> +	tg->core_tagged = val;
> +
> +	stop_machine(__sched_write_tag, (void *) &wtag, NULL);
> +	if (!val)
> +		sched_core_put();
> +
> +	return 0;
> +}
> +
> +int cpu_core_tag_color_write_u64(struct cgroup_subsys_state *css,
> +				 struct cftype *cft, u64 val)
> +{
> +	struct task_group *tg = css_tg(css);
> +	struct write_core_tag wtag;
> +	u64 cookie;
> +
> +	if (val > 255)
> +		return -ERANGE;
> +
> +	if (!static_branch_likely(&sched_smt_present))
> +		return -EINVAL;
> +
> +	cookie = cpu_core_get_group_cookie(tg);
> +	/* Can't set color if nothing in the ancestors were tagged. */
> +	if (!cookie)
> +		return -EINVAL;
> +
> +	/*
> +	 * Something in the ancestors already colors us. Can't change the color
> +	 * at this level.
> +	 */
> +	if (!tg->core_tag_color && (cookie & 255))
> +		return -EINVAL;
> +
> +	/*
> +	 * Check if any descendants are colored. If so, we can't recolor them.
> +	 * Don't need to check if descendants are tagged, since we don't allow
> +	 * tagging when already tagged.
> +	 */
> +	if (cpu_core_check_descendants(tg, false /* tag */, true /* color */))
> +		return -EINVAL;
> +
> +	cookie &= ~255;
> +	cookie |= val;
> +	wtag.css = css;
> +	wtag.cookie = cookie;
> +	tg->core_tag_color = val;
> +
> +	stop_machine(__sched_write_tag, (void *) &wtag, NULL);
> +
> +	return 0;
> +}
> +
> +void sched_tsk_free(struct task_struct *tsk)
> +{
> +	if (!tsk->core_task_cookie)
> +		return;
> +	sched_core_put_task_cookie(tsk->core_task_cookie);
> +	sched_core_put();
> +}
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index aebeb91c4a0f..290a3b8be3d3 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -437,6 +437,11 @@ struct task_group {
>  
>  };
>  
> +static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
> +{
> +	return css ? container_of(css, struct task_group, css) : NULL;
> +}
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  #define ROOT_TASK_GROUP_LOAD	NICE_0_LOAD
>  
> @@ -1104,6 +1109,8 @@ static inline int cpu_of(struct rq *rq)
>  #ifdef CONFIG_SCHED_CORE
>  DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
>  
> +#define SCHED_CORE_GROUP_COOKIE_MASK ((1UL << (sizeof(unsigned long) * 4)) - 1)
> +
>  static inline bool sched_core_enabled(struct rq *rq)
>  {
>  	return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
> @@ -1148,10 +1155,54 @@ static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
>  	return idle_core || rq->core->core_cookie == p->core_cookie;
>  }
>  
> -extern void queue_core_balance(struct rq *rq);
> +static inline bool sched_core_enqueued(struct task_struct *task)
> +{
> +	return !RB_EMPTY_NODE(&task->core_node);
> +}
> +
> +void queue_core_balance(struct rq *rq);
> +
> +void sched_core_enqueue(struct rq *rq, struct task_struct *p);
> +void sched_core_dequeue(struct rq *rq, struct task_struct *p);
> +void sched_core_get(void);
> +void sched_core_put(void);
> +
> +void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie,
> +			    bool group);
> +
> +int sched_core_share_pid(pid_t pid);
> +int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2);
> +
> +unsigned long cpu_core_get_group_cookie(struct task_group *tg);
> +
> +u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css,
> +			  struct cftype *cft);
> +
> +u64 cpu_core_tag_color_read_u64(struct cgroup_subsys_state *css,
> +				struct cftype *cft);
> +
> +#ifdef CONFIG_SCHED_DEBUG
> +u64 cpu_core_group_cookie_read_u64(struct cgroup_subsys_state *css,
> +				   struct cftype *cft);
> +#endif
> +
> +int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
> +			   u64 val);
> +
> +int cpu_core_tag_color_write_u64(struct cgroup_subsys_state *css,
> +				 struct cftype *cft, u64 val);
> +
> +#ifndef TIF_UNSAFE_RET
> +#define TIF_UNSAFE_RET (0)
> +#endif
>  
>  #else /* !CONFIG_SCHED_CORE */
>  
> +static inline bool sched_core_enqueued(struct task_struct *task) { return false; }
> +static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
> +static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
> +static inline int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2) { }
> +
>  static inline bool sched_core_enabled(struct rq *rq)
>  {
>  	return false;
> @@ -2779,7 +2830,4 @@ void swake_up_all_locked(struct swait_queue_head *q);
>  void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
>  
>  #ifdef CONFIG_SCHED_CORE
> -#ifndef TIF_UNSAFE_RET
> -#define TIF_UNSAFE_RET (0)
> -#endif
>  #endif
> 


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.
  2020-10-23 21:31           ` Joel Fernandes
@ 2020-10-26  8:28             ` Peter Zijlstra
  2020-10-27 16:58               ` Joel Fernandes
  2020-10-26  9:31             ` Peter Zijlstra
  1 sibling, 1 reply; 98+ messages in thread
From: Peter Zijlstra @ 2020-10-26  8:28 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Vineeth Remanan Pillai,
	Aaron Lu, Aubrey Li, Paul E. McKenney, Tim Chen

On Fri, Oct 23, 2020 at 05:31:18PM -0400, Joel Fernandes wrote:
> BTW, as further optimization in the future, isn't it better for the
> schedule() loop on 1 HT to select for all HT *even if* need_sync == false to
> begin with?  i.e. no cookied tasks are runnable.
> 
> That way the pick loop in schedule() running on other HTs can directly pick
> what was pre-selected for it via:
>         if (rq->core->core_pick_seq == rq->core->core_task_seq &&
>             rq->core->core_pick_seq != rq->core_sched_seq &&
>             rq->core_pick)
> .. which I think is more efficient. Its just a thought and may not be worth doing.

I'm not sure that works. Imagine a sibling doing a wakeup (or sleep)
just after you done your core wide pick. Then it will have to repick and
you end up with having to do 2*nr_smt picks instead of 2 picks.



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task()
  2020-10-24 12:27               ` Vineeth Pillai
  2020-10-24 23:48                 ` Li, Aubrey
@ 2020-10-26  9:01                 ` Peter Zijlstra
  2020-10-27  3:17                   ` Li, Aubrey
  2020-10-27 14:19                   ` Joel Fernandes
  2020-10-27 14:14                 ` Joel Fernandes
  2 siblings, 2 replies; 98+ messages in thread
From: Peter Zijlstra @ 2020-10-26  9:01 UTC (permalink / raw)
  To: Vineeth Pillai
  Cc: Li, Aubrey, Joel Fernandes, Nishanth Aravamudan,
	Julien Desfossez, Tim Chen, Aaron Lu, Aubrey Li, Thomas Glexiner,
	LKML, Ingo Molnar, Linus Torvalds, Frederic Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, graf, konrad.wilk,
	Dario Faggioli, Paul Turner, Steven Rostedt, Patrick Bellasi,
	benbjiang(蒋彪),
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Hyser,Chris, Paul E. McKenney,
	Tim Chen, Ning, Hongyu

On Sat, Oct 24, 2020 at 08:27:16AM -0400, Vineeth Pillai wrote:
> 
> 
> On 10/24/20 7:10 AM, Vineeth Pillai wrote:
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 93a3b874077d..4cae5ac48b60 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4428,12 +4428,14 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct
> > sched_entity *curr)
> >                         se = second;
> >         }
> > 
> > -       if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) <
> > 1) {
> > +       if (left && cfs_rq->next &&
> > +                       wakeup_preempt_entity(cfs_rq->next, left) < 1) {
> >                 /*
> >                  * Someone really wants this to run. If it's not unfair,
> > run it.
> >                  */
> >                 se = cfs_rq->next;
> > -       } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last,
> > left) < 1) {
> > +       } else if (left && cfs_rq->last &&
> > +                       wakeup_preempt_entity(cfs_rq->last, left) < 1) {
> >                 /*
> >                  * Prefer last buddy, try to return the CPU to a
> > preempted task.
> > 
> > 
> > There reason for left being NULL needs to be investigated. This was
> > there from v1 and we did not yet get to it. I shall try to debug later
> > this week.
> 
> Thinking more about it and looking at the crash, I think that
> 'left == NULL' can happen in pick_next_entity for core scheduling.
> If a cfs_rq has only one task that is running, then it will be
> dequeued and 'left = __pick_first_entity()' will be NULL as the
> cfs_rq will be empty. This would not happen outside of coresched
> because we never call pick_tack() before put_prev_task() which
> will enqueue the task back.
> 
> With core scheduling, a cpu can call pick_task() for its sibling while
> the sibling is still running the active task and put_prev_task has yet
> not been called. This can result in 'left == NULL'.

Quite correct. Hurmph.. the reason we do this is because... we do the
update_curr() the wrong way around. And I can't seem to remember why we
do that (it was in my original patches).

Something like so seems the obvious thing to do, but I can't seem to
remember why we're not doing it :-(

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6950,15 +6950,10 @@ static struct task_struct *pick_task_fai
 	do {
 		struct sched_entity *curr = cfs_rq->curr;
 
-		se = pick_next_entity(cfs_rq, NULL);
+		if (curr && curr->on_rq)
+			update_curr(cfs_rq);
 
-		if (curr) {
-			if (se && curr->on_rq)
-				update_curr(cfs_rq);
-
-			if (!se || entity_before(curr, se))
-				se = curr;
-		}
+		se = pick_next_entity(cfs_rq, curr);
 
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.
  2020-10-23 21:31           ` Joel Fernandes
  2020-10-26  8:28             ` Peter Zijlstra
@ 2020-10-26  9:31             ` Peter Zijlstra
  2020-11-05 18:50               ` Joel Fernandes
  1 sibling, 1 reply; 98+ messages in thread
From: Peter Zijlstra @ 2020-10-26  9:31 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Vineeth Remanan Pillai,
	Aaron Lu, Aubrey Li, Paul E. McKenney, Tim Chen

On Fri, Oct 23, 2020 at 05:31:18PM -0400, Joel Fernandes wrote:
> On Fri, Oct 23, 2020 at 09:26:54PM +0200, Peter Zijlstra wrote:

> > How about this then?
> 
> This does look better. It makes sense and I think it will work. I will look
> more into it and also test it.

Hummm... Looking at it again I wonder if I can make something like the
below work.

(depends on the next patch that pulls core_forceidle into core-wide
state)

That would retain the CFS-cgroup optimization as well, for as long as
there's no cookies around.

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4691,8 +4691,6 @@ pick_next_task(struct rq *rq, struct tas
 		return next;
 	}
 
-	put_prev_task_balance(rq, prev, rf);
-
 	smt_mask = cpu_smt_mask(cpu);
 
 	/*
@@ -4707,14 +4705,25 @@ pick_next_task(struct rq *rq, struct tas
 	 */
 	rq->core->core_task_seq++;
 	need_sync = !!rq->core->core_cookie;
-
-	/* reset state */
-reset:
-	rq->core->core_cookie = 0UL;
 	if (rq->core->core_forceidle) {
 		need_sync = true;
 		rq->core->core_forceidle = false;
 	}
+
+	if (!need_sync) {
+		next = __pick_next_task(rq, prev, rf);
+		if (!next->core_cookie) {
+			rq->core_pick = NULL;
+			return next;
+		}
+		put_prev_task(next);
+		need_sync = true;
+	} else {
+		put_prev_task_balance(rq, prev, rf);
+	}
+
+	/* reset state */
+	rq->core->core_cookie = 0UL;
 	for_each_cpu(i, smt_mask) {
 		struct rq *rq_i = cpu_rq(i);
 
@@ -4744,35 +4752,8 @@ pick_next_task(struct rq *rq, struct tas
 			 * core.
 			 */
 			p = pick_task(rq_i, class, max);
-			if (!p) {
-				/*
-				 * If there weren't no cookies; we don't need to
-				 * bother with the other siblings.
-				 */
-				if (i == cpu && !need_sync)
-					goto next_class;
-
+			if (!p)
 				continue;
-			}
-
-			/*
-			 * Optimize the 'normal' case where there aren't any
-			 * cookies and we don't need to sync up.
-			 */
-			if (i == cpu && !need_sync) {
-				if (p->core_cookie) {
-					/*
-					 * This optimization is only valid as
-					 * long as there are no cookies
-					 * involved.
-					 */
-					need_sync = true;
-					goto reset;
-				}
-
-				next = p;
-				goto done;
-			}
 
 			rq_i->core_pick = p;
 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 03/26] sched: Core-wide rq->lock
  2020-10-20  1:43 ` [PATCH v8 -tip 03/26] sched: Core-wide rq->lock Joel Fernandes (Google)
@ 2020-10-26 11:59   ` Peter Zijlstra
  2020-10-27 16:27     ` Joel Fernandes
  0 siblings, 1 reply; 98+ messages in thread
From: Peter Zijlstra @ 2020-10-26 11:59 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Vineeth Remanan Pillai,
	Aubrey Li, Paul E. McKenney, Tim Chen

On Mon, Oct 19, 2020 at 09:43:13PM -0400, Joel Fernandes (Google) wrote:
> +	if (!core_rq) {
> +		for_each_cpu(i, smt_mask) {
> +			rq = cpu_rq(i);
> +			if (rq->core && rq->core == rq)
> +				core_rq = rq;
> +			init_sched_core_irq_work(rq);

That function doesn't exist quite yet.

> +		}

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 08/26] sched/fair: Snapshot the min_vruntime of CPUs on force idle
  2020-10-20  1:43 ` [PATCH v8 -tip 08/26] sched/fair: Snapshot the min_vruntime of CPUs on force idle Joel Fernandes (Google)
@ 2020-10-26 12:47   ` Peter Zijlstra
  2020-10-28 15:29     ` Joel Fernandes
                       ` (3 more replies)
  0 siblings, 4 replies; 98+ messages in thread
From: Peter Zijlstra @ 2020-10-26 12:47 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Aubrey Li,
	Paul E. McKenney, Tim Chen

On Mon, Oct 19, 2020 at 09:43:18PM -0400, Joel Fernandes (Google) wrote:

> @@ -4723,6 +4714,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  			update_rq_clock(rq_i);
>  	}
>  
> +	/* Reset the snapshot if core is no longer in force-idle. */
> +	if (!fi_before) {
> +		for_each_cpu(i, smt_mask) {
> +			struct rq *rq_i = cpu_rq(i);
> +			rq_i->cfs.min_vruntime_fi = rq_i->cfs.min_vruntime;
> +		}
> +	}

So this is the thing that drags vruntime_fi along when (both?) siblings
are active, right? But should we not do that after pick? Consider 2
tasks a weight 1 and a weight 10 task, one for each sibling. By syncing
the vruntime before picking, the cfs_prio_less() loop will not be able
to distinguish between these two, since they'll both have effectively
the same lag.

If however, you syn after pick, then the weight 1 task will have accreud
far more runtime than the weight 10 task, and consequently the weight 10
task will have preference when a decision will have to be made.

(also, if this were the right place, the whole thing should've been part
of the for_each_cpu() loop right before this)

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 56bea0decda1..9cae08c3fca1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10686,6 +10686,46 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
>  	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
>  		resched_curr(rq);
>  }
> +
> +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> +{
> +	bool samecpu = task_cpu(a) == task_cpu(b);
> +	struct sched_entity *sea = &a->se;
> +	struct sched_entity *seb = &b->se;
> +	struct cfs_rq *cfs_rqa;
> +	struct cfs_rq *cfs_rqb;
> +	s64 delta;
> +
> +	if (samecpu) {
> +		/* vruntime is per cfs_rq */
> +		while (!is_same_group(sea, seb)) {
> +			int sea_depth = sea->depth;
> +			int seb_depth = seb->depth;
> +			if (sea_depth >= seb_depth)
> +				sea = parent_entity(sea);
> +			if (sea_depth <= seb_depth)
> +				seb = parent_entity(seb);
> +		}
> +
> +		delta = (s64)(sea->vruntime - seb->vruntime);
> +		goto out;
> +	}
> +
> +	/* crosscpu: compare root level se's vruntime to decide priority */
> +	while (sea->parent)
> +		sea = sea->parent;
> +	while (seb->parent)
> +		seb = seb->parent;

This seems unfortunate, I think we can do better.

> +
> +	cfs_rqa = sea->cfs_rq;
> +	cfs_rqb = seb->cfs_rq;
> +
> +	/* normalize vruntime WRT their rq's base */
> +	delta = (s64)(sea->vruntime - seb->vruntime) +
> +		(s64)(cfs_rqb->min_vruntime_fi - cfs_rqa->min_vruntime_fi);
> +out:
> +	return delta > 0;
> +}


How's something like this?

 - after each pick, such that the pick itself sees the divergence (see
   above); either:

    - pull the vruntime_fi forward, when !fi
    - freeze the vruntime_fi, when newly fi    (A)

 - either way, update vruntime_fi for each cfs_rq in the active
   hierachy.

 - when comparing, and fi, update the vruntime_fi hierachy until we
   encounter a mark from (A), per doing it during the pick, but before
   runtime, this guaranteees it hasn't moved since (A).

XXX, still buggered on SMT>2, imagine having {ta, tb, fi, i} on an SMT4,
then when comparing any two tasks that do not involve the fi, we should
(probably) have pulled them fwd -- but we can't actually pull them,
because then the fi thing would break, mooo.


--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -115,19 +115,8 @@ static inline bool prio_less(struct task
 	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-		u64 vruntime = b->se.vruntime;
-
-		/*
-		 * Normalize the vruntime if tasks are in different cpus.
-		 */
-		if (task_cpu(a) != task_cpu(b)) {
-			vruntime -= task_cfs_rq(b)->min_vruntime;
-			vruntime += task_cfs_rq(a)->min_vruntime;
-		}
-
-		return !((s64)(a->se.vruntime - vruntime) <= 0);
-	}
+	if (pa == MAX_RT_PRIO + MAX_NICE)	/* fair */
+		return cfs_prio_less(a, b);
 
 	return false;
 }
@@ -4642,12 +4631,15 @@ pick_task(struct rq *rq, const struct sc
 	return cookie_pick;
 }
 
+extern void task_vruntime_update(struct rq *rq, struct task_struct *p);
+
 static struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct task_struct *next, *max = NULL;
 	const struct sched_class *class;
 	const struct cpumask *smt_mask;
+	bool fi_before = false;
 	bool need_sync;
 	int i, j, cpu;
 
@@ -4707,6 +4699,7 @@ pick_next_task(struct rq *rq, struct tas
 	need_sync = !!rq->core->core_cookie;
 	if (rq->core->core_forceidle) {
 		need_sync = true;
+		fi_before = true;
 		rq->core->core_forceidle = false;
 	}
 
@@ -4757,6 +4750,11 @@ pick_next_task(struct rq *rq, struct tas
 				continue;
 
 			rq_i->core_pick = p;
+			if (rq_i->idle == p && rq_i->nr_running) {
+				rq->core->core_forceidle = true;
+				if (!fi_before)
+					rq->core->core_forceidle_seq++;
+			}
 
 			/*
 			 * If this new candidate is of higher priority than the
@@ -4775,6 +4773,7 @@ pick_next_task(struct rq *rq, struct tas
 				max = p;
 
 				if (old_max) {
+					rq->core->core_forceidle = false;
 					for_each_cpu(j, smt_mask) {
 						if (j == i)
 							continue;
@@ -4823,10 +4822,8 @@ pick_next_task(struct rq *rq, struct tas
 		if (!rq_i->core_pick)
 			continue;
 
-		if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running &&
-		    !rq_i->core->core_forceidle) {
-			rq_i->core->core_forceidle = true;
-		}
+		if (!(fi_before && rq->core->core_forceidle))
+			task_vruntime_update(rq_i, rq_i->core_pick);
 
 		if (i == cpu) {
 			rq_i->core_pick = NULL;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10686,6 +10686,67 @@ static inline void task_tick_core(struct
 	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
 		resched_curr(rq);
 }
+
+static void se_fi_update(struct sched_entity *se, unsigned int fi_seq, bool forceidle)
+{
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		if (forceidle) {
+			if (cfs_rq->forceidle_seq == fi_seq)
+				break;
+			cfs_rq->forceidle_seq = fi_seq;
+		}
+
+		cfs_rq->min_vruntime_fi = cfs_rq->min_vruntime;
+	}
+}
+
+void task_vruntime_update(struct rq *rq, struct task_struct *p)
+{
+	struct sched_entity *se = &p->se;
+
+	if (p->sched_class != &fair_sched_class)
+		return;
+
+	se_fi_update(se, rq->core->core_forceidle_seq, rq->core->core_forceidle);
+}
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+	struct rq *rq = task_rq(a);
+	struct sched_entity *sea = &a->se;
+	struct sched_entity *seb = &b->se;
+	struct cfs_rq *cfs_rqa;
+	struct cfs_rq *cfs_rqb;
+	s64 delta;
+
+	SCHED_WARN_ON(task_rq(b)->core != rq->core);
+
+	while (sea->cfs_rq->tg != seb->cfs_rq->tg) {
+		int sea_depth = sea->depth;
+		int seb_depth = seb->depth;
+
+		if (sea_depth >= seb_depth)
+			sea = parent_entity(sea);
+		if (sea_depth <= seb_depth)
+			seb = parent_entity(seb);
+	}
+
+	if (rq->core->core_forceidle) {
+		se_fi_update(sea, rq->core->core_forceidle_seq, true);
+		se_fi_update(seb, rq->core->core_forceidle_seq, true);
+	}
+
+	cfs_rqa = sea->cfs_rq;
+	cfs_rqb = seb->cfs_rq;
+
+	/* normalize vruntime WRT their rq's base */
+	delta = (s64)(sea->vruntime - seb->vruntime) +
+		(s64)(cfs_rqb->min_vruntime_fi - cfs_rqa->min_vruntime_fi);
+
+	return delta > 0;
+}
 #else
 static inline void task_tick_core(struct rq *rq, struct task_struct *curr) {}
 #endif
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -522,6 +522,11 @@ struct cfs_rq {
 	unsigned int		h_nr_running;      /* SCHED_{NORMAL,BATCH,IDLE} */
 	unsigned int		idle_h_nr_running; /* SCHED_IDLE */
 
+#ifdef CONFIG_SCHED_CORE
+	unsigned int		forceidle_seq;
+	u64			min_vruntime_fi;
+#endif
+
 	u64			exec_clock;
 	u64			min_vruntime;
 #ifndef CONFIG_64BIT
@@ -1061,7 +1066,8 @@ struct rq {
 	unsigned int		core_task_seq;
 	unsigned int		core_pick_seq;
 	unsigned long		core_cookie;
-	unsigned char		core_forceidle;
+	unsigned int		core_forceidle;
+	unsigned int		core_forceidle_seq;
 #endif
 };
 
@@ -1106,6 +1112,8 @@ static inline raw_spinlock_t *rq_lockp(s
 	return &rq->__lock;
 }
 
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline bool sched_core_enabled(struct rq *rq)

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task()
  2020-10-26  9:01                 ` Peter Zijlstra
@ 2020-10-27  3:17                   ` Li, Aubrey
  2020-10-27 14:19                   ` Joel Fernandes
  1 sibling, 0 replies; 98+ messages in thread
From: Li, Aubrey @ 2020-10-27  3:17 UTC (permalink / raw)
  To: Peter Zijlstra, Vineeth Pillai
  Cc: Joel Fernandes, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	Aaron Lu, Aubrey Li, Thomas Glexiner, LKML, Ingo Molnar,
	Linus Torvalds, Frederic Weisbecker, Kees Cook, Greg Kerr,
	Phil Auld, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, Dario Faggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi,
	benbjiang(蒋彪),
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Hyser,Chris, Paul E. McKenney,
	Tim Chen, Ning, Hongyu

On 2020/10/26 17:01, Peter Zijlstra wrote:
> On Sat, Oct 24, 2020 at 08:27:16AM -0400, Vineeth Pillai wrote:
>>
>>
>> On 10/24/20 7:10 AM, Vineeth Pillai wrote:
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 93a3b874077d..4cae5ac48b60 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -4428,12 +4428,14 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct
>>> sched_entity *curr)
>>>                         se = second;
>>>         }
>>>
>>> -       if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) <
>>> 1) {
>>> +       if (left && cfs_rq->next &&
>>> +                       wakeup_preempt_entity(cfs_rq->next, left) < 1) {
>>>                 /*
>>>                  * Someone really wants this to run. If it's not unfair,
>>> run it.
>>>                  */
>>>                 se = cfs_rq->next;
>>> -       } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last,
>>> left) < 1) {
>>> +       } else if (left && cfs_rq->last &&
>>> +                       wakeup_preempt_entity(cfs_rq->last, left) < 1) {
>>>                 /*
>>>                  * Prefer last buddy, try to return the CPU to a
>>> preempted task.
>>>
>>>
>>> There reason for left being NULL needs to be investigated. This was
>>> there from v1 and we did not yet get to it. I shall try to debug later
>>> this week.
>>
>> Thinking more about it and looking at the crash, I think that
>> 'left == NULL' can happen in pick_next_entity for core scheduling.
>> If a cfs_rq has only one task that is running, then it will be
>> dequeued and 'left = __pick_first_entity()' will be NULL as the
>> cfs_rq will be empty. This would not happen outside of coresched
>> because we never call pick_tack() before put_prev_task() which
>> will enqueue the task back.
>>
>> With core scheduling, a cpu can call pick_task() for its sibling while
>> the sibling is still running the active task and put_prev_task has yet
>> not been called. This can result in 'left == NULL'.
> 
> Quite correct. Hurmph.. the reason we do this is because... we do the
> update_curr() the wrong way around. And I can't seem to remember why we
> do that (it was in my original patches).
> 
> Something like so seems the obvious thing to do, but I can't seem to
> remember why we're not doing it :-(
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6950,15 +6950,10 @@ static struct task_struct *pick_task_fai
>  	do {
>  		struct sched_entity *curr = cfs_rq->curr;
>  
> -		se = pick_next_entity(cfs_rq, NULL);
> +		if (curr && curr->on_rq)
> +			update_curr(cfs_rq);
>  
> -		if (curr) {
> -			if (se && curr->on_rq)
> -				update_curr(cfs_rq);
> -
> -			if (!se || entity_before(curr, se))
> -				se = curr;
> -		}
> +		se = pick_next_entity(cfs_rq, curr);
>  
>  		cfs_rq = group_cfs_rq(se);
>  	} while (cfs_rq);
> 

This patch works too for my benchmark, thanks Peter!

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task()
  2020-10-24 12:27               ` Vineeth Pillai
  2020-10-24 23:48                 ` Li, Aubrey
  2020-10-26  9:01                 ` Peter Zijlstra
@ 2020-10-27 14:14                 ` Joel Fernandes
  2 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes @ 2020-10-27 14:14 UTC (permalink / raw)
  To: Vineeth Pillai
  Cc: Li, Aubrey, Nishanth Aravamudan, Julien Desfossez,
	Peter Zijlstra, Tim Chen, Aaron Lu, Aubrey Li, Thomas Glexiner,
	LKML, Ingo Molnar, Linus Torvalds, Frederic Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, graf, konrad.wilk,
	Dario Faggioli, Paul Turner, Steven Rostedt, Patrick Bellasi,
	benbjiang(蒋彪),
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Hyser,Chris, Paul E. McKenney,
	Tim Chen, Ning, Hongyu

On Sat, Oct 24, 2020 at 08:27:16AM -0400, Vineeth Pillai wrote:
> 
> 
> On 10/24/20 7:10 AM, Vineeth Pillai wrote:
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 93a3b874077d..4cae5ac48b60 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4428,12 +4428,14 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct
> > sched_entity *curr)
> >                         se = second;
> >         }
> > 
> > -       if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) <
> > 1) {
> > +       if (left && cfs_rq->next &&
> > +                       wakeup_preempt_entity(cfs_rq->next, left) < 1) {
> >                 /*
> >                  * Someone really wants this to run. If it's not unfair,
> > run it.
> >                  */
> >                 se = cfs_rq->next;
> > -       } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last,
> > left) < 1) {
> > +       } else if (left && cfs_rq->last &&
> > +                       wakeup_preempt_entity(cfs_rq->last, left) < 1) {
> >                 /*
> >                  * Prefer last buddy, try to return the CPU to a
> > preempted task.
> > 
> > 
> > There reason for left being NULL needs to be investigated. This was
> > there from v1 and we did not yet get to it. I shall try to debug later
> > this week.
> 
> Thinking more about it and looking at the crash, I think that
> 'left == NULL' can happen in pick_next_entity for core scheduling.
> If a cfs_rq has only one task that is running, then it will be
> dequeued and 'left = __pick_first_entity()' will be NULL as the
> cfs_rq will be empty. This would not happen outside of coresched
> because we never call pick_tack() before put_prev_task() which
> will enqueue the task back.
> 
> With core scheduling, a cpu can call pick_task() for its sibling while
> the sibling is still running the active task and put_prev_task has yet
> not been called. This can result in 'left == NULL'. So I think the
> above fix is appropriate when core scheduling is active. It could be
> cleaned up a bit though.

Thanks a lot Vineeth!
Just add, the pick_next_entity() returning NULL still causes pick_task_fair()
to behave correctly. Because pick_task_fair() will return 'curr' to the
core-wide picking code, not NULL. The problem is CFS pick_next_entity() is
not able to handle it though and crash,  as Vineeth found the below hunk got
lost in the rebase:

@@ -4464,13 +4464,13 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	/*
 	 * Prefer last buddy, try to return the CPU to a preempted task.
 	 */
-	if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
+	if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
 		se = cfs_rq->last;

 	/*
 	 * Someone really wants this to run. If it's not unfair, run it.
 	 */
-	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
+	if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
 		se = cfs_rq->next;

 	clear_buddies(cfs_rq, se);

Peter's fix in the other email looks good and I will include that for testing
before the next posting.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task()
  2020-10-26  9:01                 ` Peter Zijlstra
  2020-10-27  3:17                   ` Li, Aubrey
@ 2020-10-27 14:19                   ` Joel Fernandes
  2020-10-27 15:23                     ` Joel Fernandes
  1 sibling, 1 reply; 98+ messages in thread
From: Joel Fernandes @ 2020-10-27 14:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vineeth Pillai, Li, Aubrey, Nishanth Aravamudan,
	Julien Desfossez, Tim Chen, Aaron Lu, Aubrey Li, Thomas Glexiner,
	LKML, Ingo Molnar, Linus Torvalds, Frederic Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, graf, konrad.wilk,
	Dario Faggioli, Paul Turner, Steven Rostedt, Patrick Bellasi,
	benbjiang(蒋彪),
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Hyser,Chris, Paul E. McKenney,
	Tim Chen, Ning, Hongyu

On Mon, Oct 26, 2020 at 10:01:31AM +0100, Peter Zijlstra wrote:
> On Sat, Oct 24, 2020 at 08:27:16AM -0400, Vineeth Pillai wrote:
> > 
> > 
> > On 10/24/20 7:10 AM, Vineeth Pillai wrote:
> > > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 93a3b874077d..4cae5ac48b60 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -4428,12 +4428,14 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct
> > > sched_entity *curr)
> > >                         se = second;
> > >         }
> > > 
> > > -       if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) <
> > > 1) {
> > > +       if (left && cfs_rq->next &&
> > > +                       wakeup_preempt_entity(cfs_rq->next, left) < 1) {
> > >                 /*
> > >                  * Someone really wants this to run. If it's not unfair,
> > > run it.
> > >                  */
> > >                 se = cfs_rq->next;
> > > -       } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last,
> > > left) < 1) {
> > > +       } else if (left && cfs_rq->last &&
> > > +                       wakeup_preempt_entity(cfs_rq->last, left) < 1) {
> > >                 /*
> > >                  * Prefer last buddy, try to return the CPU to a
> > > preempted task.
> > > 
> > > 
> > > There reason for left being NULL needs to be investigated. This was
> > > there from v1 and we did not yet get to it. I shall try to debug later
> > > this week.
> > 
> > Thinking more about it and looking at the crash, I think that
> > 'left == NULL' can happen in pick_next_entity for core scheduling.
> > If a cfs_rq has only one task that is running, then it will be
> > dequeued and 'left = __pick_first_entity()' will be NULL as the
> > cfs_rq will be empty. This would not happen outside of coresched
> > because we never call pick_tack() before put_prev_task() which
> > will enqueue the task back.
> > 
> > With core scheduling, a cpu can call pick_task() for its sibling while
> > the sibling is still running the active task and put_prev_task has yet
> > not been called. This can result in 'left == NULL'.
> 
> Quite correct. Hurmph.. the reason we do this is because... we do the
> update_curr() the wrong way around. And I can't seem to remember why we
> do that (it was in my original patches).
> 
> Something like so seems the obvious thing to do, but I can't seem to
> remember why we're not doing it :-(

The code below is just a refactor and not a functional change though, right?

i.e. pick_next_entity() is already returning se = curr, if se == NULL.

But the advantage of your refactor is it doesn't crash the kernel.

So your change appears safe to me unless I missed something.

thanks,

 - Joel


> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6950,15 +6950,10 @@ static struct task_struct *pick_task_fai
>  	do {
>  		struct sched_entity *curr = cfs_rq->curr;
>  
> -		se = pick_next_entity(cfs_rq, NULL);
> +		if (curr && curr->on_rq)
> +			update_curr(cfs_rq);
>  
> -		if (curr) {
> -			if (se && curr->on_rq)
> -				update_curr(cfs_rq);
> -
> -			if (!se || entity_before(curr, se))
> -				se = curr;
> -		}
> +		se = pick_next_entity(cfs_rq, curr);
>  
>  		cfs_rq = group_cfs_rq(se);
>  	} while (cfs_rq);

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task()
  2020-10-27 14:19                   ` Joel Fernandes
@ 2020-10-27 15:23                     ` Joel Fernandes
  0 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes @ 2020-10-27 15:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vineeth Pillai, Li, Aubrey, Nishanth Aravamudan,
	Julien Desfossez, Tim Chen, Aaron Lu, Aubrey Li, Thomas Glexiner,
	LKML, Ingo Molnar, Linus Torvalds, Frederic Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, graf, konrad.wilk,
	Dario Faggioli, Paul Turner, Steven Rostedt, Patrick Bellasi,
	benbjiang(蒋彪),
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Hyser,Chris, Paul E. McKenney,
	Tim Chen, Ning, Hongyu

On Tue, Oct 27, 2020 at 10:19:11AM -0400, Joel Fernandes wrote:
> On Mon, Oct 26, 2020 at 10:01:31AM +0100, Peter Zijlstra wrote:
> > On Sat, Oct 24, 2020 at 08:27:16AM -0400, Vineeth Pillai wrote:
> > > 
> > > 
> > > On 10/24/20 7:10 AM, Vineeth Pillai wrote:
> > > > 
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 93a3b874077d..4cae5ac48b60 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -4428,12 +4428,14 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct
> > > > sched_entity *curr)
> > > >                         se = second;
> > > >         }
> > > > 
> > > > -       if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) <
> > > > 1) {
> > > > +       if (left && cfs_rq->next &&
> > > > +                       wakeup_preempt_entity(cfs_rq->next, left) < 1) {
> > > >                 /*
> > > >                  * Someone really wants this to run. If it's not unfair,
> > > > run it.
> > > >                  */
> > > >                 se = cfs_rq->next;
> > > > -       } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last,
> > > > left) < 1) {
> > > > +       } else if (left && cfs_rq->last &&
> > > > +                       wakeup_preempt_entity(cfs_rq->last, left) < 1) {
> > > >                 /*
> > > >                  * Prefer last buddy, try to return the CPU to a
> > > > preempted task.
> > > > 
> > > > 
> > > > There reason for left being NULL needs to be investigated. This was
> > > > there from v1 and we did not yet get to it. I shall try to debug later
> > > > this week.
> > > 
> > > Thinking more about it and looking at the crash, I think that
> > > 'left == NULL' can happen in pick_next_entity for core scheduling.
> > > If a cfs_rq has only one task that is running, then it will be
> > > dequeued and 'left = __pick_first_entity()' will be NULL as the
> > > cfs_rq will be empty. This would not happen outside of coresched
> > > because we never call pick_tack() before put_prev_task() which
> > > will enqueue the task back.
> > > 
> > > With core scheduling, a cpu can call pick_task() for its sibling while
> > > the sibling is still running the active task and put_prev_task has yet
> > > not been called. This can result in 'left == NULL'.
> > 
> > Quite correct. Hurmph.. the reason we do this is because... we do the
> > update_curr() the wrong way around. And I can't seem to remember why we
> > do that (it was in my original patches).
> > 
> > Something like so seems the obvious thing to do, but I can't seem to
> > remember why we're not doing it :-(
> 
> The code below is just a refactor and not a functional change though, right?
> 
> i.e. pick_next_entity() is already returning se = curr, if se == NULL.
> 
> But the advantage of your refactor is it doesn't crash the kernel.
> 
> So your change appears safe to me unless I missed something.

I included it as patch appeneded below for testing, hopefully the commit
message is appropriate.

On a related note, this pattern is very similar to pick_next_task_fair()'s
!simple case. Over there it does check_cfs_rq_runtime() for throttling the
cfs_rq.  Should we also be doing that in pick_task_fair() ?
This bit:
                        /*
                         * This call to check_cfs_rq_runtime() will do the
                         * throttle and dequeue its entity in the parent(s).
                         * Therefore the nr_running test will indeed
                         * be correct.
                         */
                        if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
                                cfs_rq = &rq->cfs;

                                if (!cfs_rq->nr_running)
                                        goto idle;

                                goto simple;
                        }

---8<-----------------------

From: Peter Zijlstra <peterz@infradead.org>
Subject: [PATCH] sched/fair: Fix pick_task_fair crashes due to empty rbtree

pick_next_entity() is passed curr == NULL during core-scheduling. Due to
this, if the rbtree is empty, the 'left' variable is set to NULL within
the function. This can cause crashes within the function.

This is not an issue if put_prev_task() is invoked on the currently
running task before calling pick_next_entity(). However, in core
scheduling, it is possible that a sibling CPU picks for another RQ in
the core, via pick_task_fair(). This remote sibling would not get any
opportunities to do a put_prev_task().

Fix it by refactoring pick_task_fair() such that pick_next_entity() is
called with the cfs_rq->curr. This will prevent pick_next_entity() from
crashing if its rbtree is empty.

Suggested-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/fair.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 93a3b874077d..591859016263 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6975,15 +6975,10 @@ static struct task_struct *pick_task_fair(struct rq *rq)
 	do {
 		struct sched_entity *curr = cfs_rq->curr;
 
-		se = pick_next_entity(cfs_rq, NULL);
-
-		if (curr) {
-			if (se && curr->on_rq)
-				update_curr(cfs_rq);
+		if (curr && curr->on_rq)
+			update_curr(cfs_rq);
 
-			if (!se || entity_before(curr, se))
-				se = curr;
-		}
+		se = pick_next_entity(cfs_rq, curr);
 
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
-- 
2.29.0.rc2.309.g374f81d7ae-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 03/26] sched: Core-wide rq->lock
  2020-10-26 11:59   ` Peter Zijlstra
@ 2020-10-27 16:27     ` Joel Fernandes
  0 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes @ 2020-10-27 16:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Vineeth Remanan Pillai,
	Aubrey Li, Paul E. McKenney, Tim Chen

On Mon, Oct 26, 2020 at 12:59:27PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 19, 2020 at 09:43:13PM -0400, Joel Fernandes (Google) wrote:
> > +	if (!core_rq) {
> > +		for_each_cpu(i, smt_mask) {
> > +			rq = cpu_rq(i);
> > +			if (rq->core && rq->core == rq)
> > +				core_rq = rq;
> > +			init_sched_core_irq_work(rq);
> 
> That function doesn't exist quite yet.

Moved to the appropriate patch, sorry and thank you!

 - Joel



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.
  2020-10-26  8:28             ` Peter Zijlstra
@ 2020-10-27 16:58               ` Joel Fernandes
  0 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes @ 2020-10-27 16:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Vineeth Remanan Pillai,
	Aaron Lu, Aubrey Li, Paul E. McKenney, Tim Chen

On Mon, Oct 26, 2020 at 09:28:14AM +0100, Peter Zijlstra wrote:
> On Fri, Oct 23, 2020 at 05:31:18PM -0400, Joel Fernandes wrote:
> > BTW, as further optimization in the future, isn't it better for the
> > schedule() loop on 1 HT to select for all HT *even if* need_sync == false to
> > begin with?  i.e. no cookied tasks are runnable.
> > 
> > That way the pick loop in schedule() running on other HTs can directly pick
> > what was pre-selected for it via:
> >         if (rq->core->core_pick_seq == rq->core->core_task_seq &&
> >             rq->core->core_pick_seq != rq->core_sched_seq &&
> >             rq->core_pick)
> > .. which I think is more efficient. Its just a thought and may not be worth doing.
> 
> I'm not sure that works. Imagine a sibling doing a wakeup (or sleep)
> just after you done your core wide pick. Then it will have to repick and
> you end up with having to do 2*nr_smt picks instead of 2 picks.

For a workload that is having mostly runnable tasks (not doing lot of wakeup
/ sleep), maybe it makes sense.

Also if you have only cookied tasks and they are doing wake up / sleep, then
you have 2*nr_smt_picks anyway as the core picks constantly get invalidated,
AFAICS.

I guess in the current code, the assumptions are:
1. Most tasks are not cookied task
2. They can wake up and sleep a lot

I guess those are Ok assumptions though, but maybe we could document it.

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 08/26] sched/fair: Snapshot the min_vruntime of CPUs on force idle
  2020-10-26 12:47   ` Peter Zijlstra
@ 2020-10-28 15:29     ` Joel Fernandes
  2020-10-28 18:39     ` Joel Fernandes
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes @ 2020-10-28 15:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Aubrey Li,
	Paul E. McKenney, Tim Chen

Hi Peter,

I am still working on understanding your approach and will reply soon, but I
just wanted to clarify the question on my approach:

On Mon, Oct 26, 2020 at 01:47:24PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 19, 2020 at 09:43:18PM -0400, Joel Fernandes (Google) wrote:
> 
> > @@ -4723,6 +4714,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >  			update_rq_clock(rq_i);
> >  	}
> >  
> > +	/* Reset the snapshot if core is no longer in force-idle. */
> > +	if (!fi_before) {
> > +		for_each_cpu(i, smt_mask) {
> > +			struct rq *rq_i = cpu_rq(i);
> > +			rq_i->cfs.min_vruntime_fi = rq_i->cfs.min_vruntime;
> > +		}
> > +	}
> 
> So this is the thing that drags vruntime_fi along when (both?) siblings
> are active, right? But should we not do that after pick? Consider 2
> tasks a weight 1 and a weight 10 task, one for each sibling. By syncing
> the vruntime before picking, the cfs_prio_less() loop will not be able
> to distinguish between these two, since they'll both have effectively
> the same lag.

Actually the snapshot I take is local to the rq's business, it is not
core-wide. So there's no core-wide sync in the patch.

So for each rq, I'm assigning the rq's ->min_vruntime_fi to its own
->min_vruntime.

And then later in cfs_prio_less(), I'm normalizing task's ->vruntime with
respect to the ->min_vruntime_fi and using that delta for comparison. So each
->rq is really dealing with its own lag, its not sync'ed.

About why it is to be done before, if we are not in force-idle any more, then
every time we pick, then rq's min_vruntime_fi will be assigned its
->min_vruntime which will be used later, as baseline for that particular rq,
during the cfs_prio_less(). So it has to be done before at least in my
approach. This is no different from not having this patch.

However, if we are in force-idle, then the min_vruntime_fi stays put on all
siblings so that the delta continues to grow on all cpus *since* the force
idle started.

The reason I also do it after is, if the core just entered force idle but
wasn't previously in it, then we take a snapshot at that point and continue
to use that snapshot in future selections as long as the core is in force
idle.

This approach may have some drawbacks but it reduces the task latencies quite
a bit in my testing.

Thoughts?

thanks,

 - Joel


> If however, you syn after pick, then the weight 1 task will have accreud
> far more runtime than the weight 10 task, and consequently the weight 10
> task will have preference when a decision will have to be made.
> 
> (also, if this were the right place, the whole thing should've been part
> of the for_each_cpu() loop right before this)

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 56bea0decda1..9cae08c3fca1 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10686,6 +10686,46 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
> >  	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
> >  		resched_curr(rq);
> >  }
> > +
> > +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> > +{
> > +	bool samecpu = task_cpu(a) == task_cpu(b);
> > +	struct sched_entity *sea = &a->se;
> > +	struct sched_entity *seb = &b->se;
> > +	struct cfs_rq *cfs_rqa;
> > +	struct cfs_rq *cfs_rqb;
> > +	s64 delta;
> > +
> > +	if (samecpu) {
> > +		/* vruntime is per cfs_rq */
> > +		while (!is_same_group(sea, seb)) {
> > +			int sea_depth = sea->depth;
> > +			int seb_depth = seb->depth;
> > +			if (sea_depth >= seb_depth)
> > +				sea = parent_entity(sea);
> > +			if (sea_depth <= seb_depth)
> > +				seb = parent_entity(seb);
> > +		}
> > +
> > +		delta = (s64)(sea->vruntime - seb->vruntime);
> > +		goto out;
> > +	}
> > +
> > +	/* crosscpu: compare root level se's vruntime to decide priority */
> > +	while (sea->parent)
> > +		sea = sea->parent;
> > +	while (seb->parent)
> > +		seb = seb->parent;
> 
> This seems unfortunate, I think we can do better.
> 
> > +
> > +	cfs_rqa = sea->cfs_rq;
> > +	cfs_rqb = seb->cfs_rq;
> > +
> > +	/* normalize vruntime WRT their rq's base */
> > +	delta = (s64)(sea->vruntime - seb->vruntime) +
> > +		(s64)(cfs_rqb->min_vruntime_fi - cfs_rqa->min_vruntime_fi);
> > +out:
> > +	return delta > 0;
> > +}
> 
> 
> How's something like this?
> 
>  - after each pick, such that the pick itself sees the divergence (see
>    above); either:
> 
>     - pull the vruntime_fi forward, when !fi
>     - freeze the vruntime_fi, when newly fi    (A)
> 
>  - either way, update vruntime_fi for each cfs_rq in the active
>    hierachy.
> 
>  - when comparing, and fi, update the vruntime_fi hierachy until we
>    encounter a mark from (A), per doing it during the pick, but before
>    runtime, this guaranteees it hasn't moved since (A).
> 
> XXX, still buggered on SMT>2, imagine having {ta, tb, fi, i} on an SMT4,
> then when comparing any two tasks that do not involve the fi, we should
> (probably) have pulled them fwd -- but we can't actually pull them,
> because then the fi thing would break, mooo.
> 
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -115,19 +115,8 @@ static inline bool prio_less(struct task
>  	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
>  		return !dl_time_before(a->dl.deadline, b->dl.deadline);
>  
> -	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
> -		u64 vruntime = b->se.vruntime;
> -
> -		/*
> -		 * Normalize the vruntime if tasks are in different cpus.
> -		 */
> -		if (task_cpu(a) != task_cpu(b)) {
> -			vruntime -= task_cfs_rq(b)->min_vruntime;
> -			vruntime += task_cfs_rq(a)->min_vruntime;
> -		}
> -
> -		return !((s64)(a->se.vruntime - vruntime) <= 0);
> -	}
> +	if (pa == MAX_RT_PRIO + MAX_NICE)	/* fair */
> +		return cfs_prio_less(a, b);
>  
>  	return false;
>  }
> @@ -4642,12 +4631,15 @@ pick_task(struct rq *rq, const struct sc
>  	return cookie_pick;
>  }
>  
> +extern void task_vruntime_update(struct rq *rq, struct task_struct *p);
> +
>  static struct task_struct *
>  pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
>  	struct task_struct *next, *max = NULL;
>  	const struct sched_class *class;
>  	const struct cpumask *smt_mask;
> +	bool fi_before = false;
>  	bool need_sync;
>  	int i, j, cpu;
>  
> @@ -4707,6 +4699,7 @@ pick_next_task(struct rq *rq, struct tas
>  	need_sync = !!rq->core->core_cookie;
>  	if (rq->core->core_forceidle) {
>  		need_sync = true;
> +		fi_before = true;
>  		rq->core->core_forceidle = false;
>  	}
>  
> @@ -4757,6 +4750,11 @@ pick_next_task(struct rq *rq, struct tas
>  				continue;
>  
>  			rq_i->core_pick = p;
> +			if (rq_i->idle == p && rq_i->nr_running) {
> +				rq->core->core_forceidle = true;
> +				if (!fi_before)
> +					rq->core->core_forceidle_seq++;
> +			}
>  
>  			/*
>  			 * If this new candidate is of higher priority than the
> @@ -4775,6 +4773,7 @@ pick_next_task(struct rq *rq, struct tas
>  				max = p;
>  
>  				if (old_max) {
> +					rq->core->core_forceidle = false;
>  					for_each_cpu(j, smt_mask) {
>  						if (j == i)
>  							continue;
> @@ -4823,10 +4822,8 @@ pick_next_task(struct rq *rq, struct tas
>  		if (!rq_i->core_pick)
>  			continue;
>  
> -		if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running &&
> -		    !rq_i->core->core_forceidle) {
> -			rq_i->core->core_forceidle = true;
> -		}
> +		if (!(fi_before && rq->core->core_forceidle))
> +			task_vruntime_update(rq_i, rq_i->core_pick);
>  
>  		if (i == cpu) {
>  			rq_i->core_pick = NULL;
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10686,6 +10686,67 @@ static inline void task_tick_core(struct
>  	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
>  		resched_curr(rq);
>  }
> +
> +static void se_fi_update(struct sched_entity *se, unsigned int fi_seq, bool forceidle)
> +{
> +	for_each_sched_entity(se) {
> +		struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +		if (forceidle) {
> +			if (cfs_rq->forceidle_seq == fi_seq)
> +				break;
> +			cfs_rq->forceidle_seq = fi_seq;
> +		}
> +
> +		cfs_rq->min_vruntime_fi = cfs_rq->min_vruntime;
> +	}
> +}
> +
> +void task_vruntime_update(struct rq *rq, struct task_struct *p)
> +{
> +	struct sched_entity *se = &p->se;
> +
> +	if (p->sched_class != &fair_sched_class)
> +		return;
> +
> +	se_fi_update(se, rq->core->core_forceidle_seq, rq->core->core_forceidle);
> +}
> +
> +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> +{
> +	struct rq *rq = task_rq(a);
> +	struct sched_entity *sea = &a->se;
> +	struct sched_entity *seb = &b->se;
> +	struct cfs_rq *cfs_rqa;
> +	struct cfs_rq *cfs_rqb;
> +	s64 delta;
> +
> +	SCHED_WARN_ON(task_rq(b)->core != rq->core);
> +
> +	while (sea->cfs_rq->tg != seb->cfs_rq->tg) {
> +		int sea_depth = sea->depth;
> +		int seb_depth = seb->depth;
> +
> +		if (sea_depth >= seb_depth)
> +			sea = parent_entity(sea);
> +		if (sea_depth <= seb_depth)
> +			seb = parent_entity(seb);
> +	}
> +
> +	if (rq->core->core_forceidle) {
> +		se_fi_update(sea, rq->core->core_forceidle_seq, true);
> +		se_fi_update(seb, rq->core->core_forceidle_seq, true);
> +	}
> +
> +	cfs_rqa = sea->cfs_rq;
> +	cfs_rqb = seb->cfs_rq;
> +
> +	/* normalize vruntime WRT their rq's base */
> +	delta = (s64)(sea->vruntime - seb->vruntime) +
> +		(s64)(cfs_rqb->min_vruntime_fi - cfs_rqa->min_vruntime_fi);
> +
> +	return delta > 0;
> +}
>  #else
>  static inline void task_tick_core(struct rq *rq, struct task_struct *curr) {}
>  #endif
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -522,6 +522,11 @@ struct cfs_rq {
>  	unsigned int		h_nr_running;      /* SCHED_{NORMAL,BATCH,IDLE} */
>  	unsigned int		idle_h_nr_running; /* SCHED_IDLE */
>  
> +#ifdef CONFIG_SCHED_CORE
> +	unsigned int		forceidle_seq;
> +	u64			min_vruntime_fi;
> +#endif
> +
>  	u64			exec_clock;
>  	u64			min_vruntime;
>  #ifndef CONFIG_64BIT
> @@ -1061,7 +1066,8 @@ struct rq {
>  	unsigned int		core_task_seq;
>  	unsigned int		core_pick_seq;
>  	unsigned long		core_cookie;
> -	unsigned char		core_forceidle;
> +	unsigned int		core_forceidle;
> +	unsigned int		core_forceidle_seq;
>  #endif
>  };
>  
> @@ -1106,6 +1112,8 @@ static inline raw_spinlock_t *rq_lockp(s
>  	return &rq->__lock;
>  }
>  
> +bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
> +
>  #else /* !CONFIG_SCHED_CORE */
>  
>  static inline bool sched_core_enabled(struct rq *rq)

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 08/26] sched/fair: Snapshot the min_vruntime of CPUs on force idle
  2020-10-26 12:47   ` Peter Zijlstra
  2020-10-28 15:29     ` Joel Fernandes
@ 2020-10-28 18:39     ` Joel Fernandes
  2020-10-29 16:59     ` Joel Fernandes
  2020-10-29 18:24     ` Joel Fernandes
  3 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes @ 2020-10-28 18:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Aubrey Li,
	Paul E. McKenney, Tim Chen

On Mon, Oct 26, 2020 at 01:47:24PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 19, 2020 at 09:43:18PM -0400, Joel Fernandes (Google) wrote:
> 
> > @@ -4723,6 +4714,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >  			update_rq_clock(rq_i);
> >  	}
> >  
> > +	/* Reset the snapshot if core is no longer in force-idle. */
> > +	if (!fi_before) {
> > +		for_each_cpu(i, smt_mask) {
> > +			struct rq *rq_i = cpu_rq(i);
> > +			rq_i->cfs.min_vruntime_fi = rq_i->cfs.min_vruntime;
> > +		}
> > +	}
> 
> So this is the thing that drags vruntime_fi along when (both?) siblings
> are active, right? But should we not do that after pick? Consider 2
> tasks a weight 1 and a weight 10 task, one for each sibling. By syncing
> the vruntime before picking, the cfs_prio_less() loop will not be able
> to distinguish between these two, since they'll both have effectively
> the same lag.

Just to set some terms, by lag you mean "the delta between task->vruntime and
cfs_rq->min_vruntime_fi" right ?

Assuming this is what you mean,

Say there's 2 task T1 and T2 on 2 different CPUs.
T1 has weight=10 vruntime=200
T2 has weight=1  vruntime=1000

Say T1's cfs_rq R1 has min_vruntime of 100
    T2's cfs_rq R2 has min_vruntime of 200

First time we force idle, we do the "sync" in my patch (which is what I think
you can call snapshotting of min_vruntime). Assuming we do the sync _before_
picking:
This causes R1's ->min_vruntime_fi to be 100.
            R2's ->min_vruntime_fi to be 200.

So during picking, cfs_prio_less() will see R1's "lag" as (200-100) = 100.
And R2's "lag" as (1000-200) = 800.

So the lags are different and I didn't get what you mean by "have effectively
the same lag".

Could you let me know what I am missing?

Also your patch is great and I feel it does the same thing as my patch except
for doing the min_vruntime snapshot up the hierarchy (which is great!). BTW,
do you really need force_idle_seq?

It seems to me, since you have fi_before, you can just set another variable
on the stack if the fi status changed during picking, and pass that along to
se_fi_update() or is there another case where the force_idle_seq is needed?

Thanks!

 - Joel


> If however, you syn after pick, then the weight 1 task will have accreud
> far more runtime than the weight 10 task, and consequently the weight 10
> task will have preference when a decision will have to be made.
> 
> (also, if this were the right place, the whole thing should've been part
> of the for_each_cpu() loop right before this)
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 56bea0decda1..9cae08c3fca1 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10686,6 +10686,46 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
> >  	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
> >  		resched_curr(rq);
> >  }
> > +
> > +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> > +{
> > +	bool samecpu = task_cpu(a) == task_cpu(b);
> > +	struct sched_entity *sea = &a->se;
> > +	struct sched_entity *seb = &b->se;
> > +	struct cfs_rq *cfs_rqa;
> > +	struct cfs_rq *cfs_rqb;
> > +	s64 delta;
> > +
> > +	if (samecpu) {
> > +		/* vruntime is per cfs_rq */
> > +		while (!is_same_group(sea, seb)) {
> > +			int sea_depth = sea->depth;
> > +			int seb_depth = seb->depth;
> > +			if (sea_depth >= seb_depth)
> > +				sea = parent_entity(sea);
> > +			if (sea_depth <= seb_depth)
> > +				seb = parent_entity(seb);
> > +		}
> > +
> > +		delta = (s64)(sea->vruntime - seb->vruntime);
> > +		goto out;
> > +	}
> > +
> > +	/* crosscpu: compare root level se's vruntime to decide priority */
> > +	while (sea->parent)
> > +		sea = sea->parent;
> > +	while (seb->parent)
> > +		seb = seb->parent;
> 
> This seems unfortunate, I think we can do better.
> 
> > +
> > +	cfs_rqa = sea->cfs_rq;
> > +	cfs_rqb = seb->cfs_rq;
> > +
> > +	/* normalize vruntime WRT their rq's base */
> > +	delta = (s64)(sea->vruntime - seb->vruntime) +
> > +		(s64)(cfs_rqb->min_vruntime_fi - cfs_rqa->min_vruntime_fi);
> > +out:
> > +	return delta > 0;
> > +}
> 
> 
> How's something like this?
> 
>  - after each pick, such that the pick itself sees the divergence (see
>    above); either:
> 
>     - pull the vruntime_fi forward, when !fi
>     - freeze the vruntime_fi, when newly fi    (A)
> 
>  - either way, update vruntime_fi for each cfs_rq in the active
>    hierachy.
> 
>  - when comparing, and fi, update the vruntime_fi hierachy until we
>    encounter a mark from (A), per doing it during the pick, but before
>    runtime, this guaranteees it hasn't moved since (A).
> 
> XXX, still buggered on SMT>2, imagine having {ta, tb, fi, i} on an SMT4,
> then when comparing any two tasks that do not involve the fi, we should
> (probably) have pulled them fwd -- but we can't actually pull them,
> because then the fi thing would break, mooo.
> 
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -115,19 +115,8 @@ static inline bool prio_less(struct task
>  	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
>  		return !dl_time_before(a->dl.deadline, b->dl.deadline);
>  
> -	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
> -		u64 vruntime = b->se.vruntime;
> -
> -		/*
> -		 * Normalize the vruntime if tasks are in different cpus.
> -		 */
> -		if (task_cpu(a) != task_cpu(b)) {
> -			vruntime -= task_cfs_rq(b)->min_vruntime;
> -			vruntime += task_cfs_rq(a)->min_vruntime;
> -		}
> -
> -		return !((s64)(a->se.vruntime - vruntime) <= 0);
> -	}
> +	if (pa == MAX_RT_PRIO + MAX_NICE)	/* fair */
> +		return cfs_prio_less(a, b);
>  
>  	return false;
>  }
> @@ -4642,12 +4631,15 @@ pick_task(struct rq *rq, const struct sc
>  	return cookie_pick;
>  }
>  
> +extern void task_vruntime_update(struct rq *rq, struct task_struct *p);
> +
>  static struct task_struct *
>  pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
>  	struct task_struct *next, *max = NULL;
>  	const struct sched_class *class;
>  	const struct cpumask *smt_mask;
> +	bool fi_before = false;
>  	bool need_sync;
>  	int i, j, cpu;
>  
> @@ -4707,6 +4699,7 @@ pick_next_task(struct rq *rq, struct tas
>  	need_sync = !!rq->core->core_cookie;
>  	if (rq->core->core_forceidle) {
>  		need_sync = true;
> +		fi_before = true;
>  		rq->core->core_forceidle = false;
>  	}
>  
> @@ -4757,6 +4750,11 @@ pick_next_task(struct rq *rq, struct tas
>  				continue;
>  
>  			rq_i->core_pick = p;
> +			if (rq_i->idle == p && rq_i->nr_running) {
> +				rq->core->core_forceidle = true;
> +				if (!fi_before)
> +					rq->core->core_forceidle_seq++;
> +			}
>  
>  			/*
>  			 * If this new candidate is of higher priority than the
> @@ -4775,6 +4773,7 @@ pick_next_task(struct rq *rq, struct tas
>  				max = p;
>  
>  				if (old_max) {
> +					rq->core->core_forceidle = false;
>  					for_each_cpu(j, smt_mask) {
>  						if (j == i)
>  							continue;
> @@ -4823,10 +4822,8 @@ pick_next_task(struct rq *rq, struct tas
>  		if (!rq_i->core_pick)
>  			continue;
>  
> -		if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running &&
> -		    !rq_i->core->core_forceidle) {
> -			rq_i->core->core_forceidle = true;
> -		}
> +		if (!(fi_before && rq->core->core_forceidle))
> +			task_vruntime_update(rq_i, rq_i->core_pick);
>  
>  		if (i == cpu) {
>  			rq_i->core_pick = NULL;
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10686,6 +10686,67 @@ static inline void task_tick_core(struct
>  	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
>  		resched_curr(rq);
>  }
> +
> +static void se_fi_update(struct sched_entity *se, unsigned int fi_seq, bool forceidle)
> +{
> +	for_each_sched_entity(se) {
> +		struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +		if (forceidle) {
> +			if (cfs_rq->forceidle_seq == fi_seq)
> +				break;
> +			cfs_rq->forceidle_seq = fi_seq;
> +		}
> +
> +		cfs_rq->min_vruntime_fi = cfs_rq->min_vruntime;
> +	}
> +}
> +
> +void task_vruntime_update(struct rq *rq, struct task_struct *p)
> +{
> +	struct sched_entity *se = &p->se;
> +
> +	if (p->sched_class != &fair_sched_class)
> +		return;
> +
> +	se_fi_update(se, rq->core->core_forceidle_seq, rq->core->core_forceidle);
> +}
> +
> +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> +{
> +	struct rq *rq = task_rq(a);
> +	struct sched_entity *sea = &a->se;
> +	struct sched_entity *seb = &b->se;
> +	struct cfs_rq *cfs_rqa;
> +	struct cfs_rq *cfs_rqb;
> +	s64 delta;
> +
> +	SCHED_WARN_ON(task_rq(b)->core != rq->core);
> +
> +	while (sea->cfs_rq->tg != seb->cfs_rq->tg) {
> +		int sea_depth = sea->depth;
> +		int seb_depth = seb->depth;
> +
> +		if (sea_depth >= seb_depth)
> +			sea = parent_entity(sea);
> +		if (sea_depth <= seb_depth)
> +			seb = parent_entity(seb);
> +	}
> +
> +	if (rq->core->core_forceidle) {
> +		se_fi_update(sea, rq->core->core_forceidle_seq, true);
> +		se_fi_update(seb, rq->core->core_forceidle_seq, true);
> +	}
> +
> +	cfs_rqa = sea->cfs_rq;
> +	cfs_rqb = seb->cfs_rq;
> +
> +	/* normalize vruntime WRT their rq's base */
> +	delta = (s64)(sea->vruntime - seb->vruntime) +
> +		(s64)(cfs_rqb->min_vruntime_fi - cfs_rqa->min_vruntime_fi);
> +
> +	return delta > 0;
> +}
>  #else
>  static inline void task_tick_core(struct rq *rq, struct task_struct *curr) {}
>  #endif
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -522,6 +522,11 @@ struct cfs_rq {
>  	unsigned int		h_nr_running;      /* SCHED_{NORMAL,BATCH,IDLE} */
>  	unsigned int		idle_h_nr_running; /* SCHED_IDLE */
>  
> +#ifdef CONFIG_SCHED_CORE
> +	unsigned int		forceidle_seq;
> +	u64			min_vruntime_fi;
> +#endif
> +
>  	u64			exec_clock;
>  	u64			min_vruntime;
>  #ifndef CONFIG_64BIT
> @@ -1061,7 +1066,8 @@ struct rq {
>  	unsigned int		core_task_seq;
>  	unsigned int		core_pick_seq;
>  	unsigned long		core_cookie;
> -	unsigned char		core_forceidle;
> +	unsigned int		core_forceidle;
> +	unsigned int		core_forceidle_seq;
>  #endif
>  };
>  
> @@ -1106,6 +1112,8 @@ static inline raw_spinlock_t *rq_lockp(s
>  	return &rq->__lock;
>  }
>  
> +bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
> +
>  #else /* !CONFIG_SCHED_CORE */
>  
>  static inline bool sched_core_enabled(struct rq *rq)

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 08/26] sched/fair: Snapshot the min_vruntime of CPUs on force idle
  2020-10-26 12:47   ` Peter Zijlstra
  2020-10-28 15:29     ` Joel Fernandes
  2020-10-28 18:39     ` Joel Fernandes
@ 2020-10-29 16:59     ` Joel Fernandes
  2020-10-29 18:24     ` Joel Fernandes
  3 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes @ 2020-10-29 16:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Aubrey Li,
	Paul E. McKenney, Tim Chen

On Mon, Oct 26, 2020 at 01:47:24PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 19, 2020 at 09:43:18PM -0400, Joel Fernandes (Google) wrote:
> 
> > @@ -4723,6 +4714,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >  			update_rq_clock(rq_i);
> >  	}
> >  
> > +	/* Reset the snapshot if core is no longer in force-idle. */
> > +	if (!fi_before) {
> > +		for_each_cpu(i, smt_mask) {
> > +			struct rq *rq_i = cpu_rq(i);
> > +			rq_i->cfs.min_vruntime_fi = rq_i->cfs.min_vruntime;
> > +		}
> > +	}
> 
> So this is the thing that drags vruntime_fi along when (both?) siblings
> are active, right? But should we not do that after pick? Consider 2
> tasks a weight 1 and a weight 10 task, one for each sibling. By syncing
> the vruntime before picking, the cfs_prio_less() loop will not be able
> to distinguish between these two, since they'll both have effectively
> the same lag.
> 
> If however, you syn after pick, then the weight 1 task will have accreud
> far more runtime than the weight 10 task, and consequently the weight 10
> task will have preference when a decision will have to be made.
> 
> (also, if this were the right place, the whole thing should've been part
> of the for_each_cpu() loop right before this)
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 56bea0decda1..9cae08c3fca1 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10686,6 +10686,46 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
> >  	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
> >  		resched_curr(rq);
> >  }
> > +
> > +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> > +{
> > +	bool samecpu = task_cpu(a) == task_cpu(b);
> > +	struct sched_entity *sea = &a->se;
> > +	struct sched_entity *seb = &b->se;
> > +	struct cfs_rq *cfs_rqa;
> > +	struct cfs_rq *cfs_rqb;
> > +	s64 delta;
> > +
> > +	if (samecpu) {
> > +		/* vruntime is per cfs_rq */
> > +		while (!is_same_group(sea, seb)) {
> > +			int sea_depth = sea->depth;
> > +			int seb_depth = seb->depth;
> > +			if (sea_depth >= seb_depth)
> > +				sea = parent_entity(sea);
> > +			if (sea_depth <= seb_depth)
> > +				seb = parent_entity(seb);
> > +		}
> > +
> > +		delta = (s64)(sea->vruntime - seb->vruntime);
> > +		goto out;
> > +	}
> > +
> > +	/* crosscpu: compare root level se's vruntime to decide priority */
> > +	while (sea->parent)
> > +		sea = sea->parent;
> > +	while (seb->parent)
> > +		seb = seb->parent;
> 
> This seems unfortunate, I think we can do better.
> 
> > +
> > +	cfs_rqa = sea->cfs_rq;
> > +	cfs_rqb = seb->cfs_rq;
> > +
> > +	/* normalize vruntime WRT their rq's base */
> > +	delta = (s64)(sea->vruntime - seb->vruntime) +
> > +		(s64)(cfs_rqb->min_vruntime_fi - cfs_rqa->min_vruntime_fi);
> > +out:
> > +	return delta > 0;
> > +}
> 
> 
> How's something like this?
> 
>  - after each pick, such that the pick itself sees the divergence (see
>    above); either:
> 
>     - pull the vruntime_fi forward, when !fi
>     - freeze the vruntime_fi, when newly fi    (A)
> 
>  - either way, update vruntime_fi for each cfs_rq in the active
>    hierachy.
> 
>  - when comparing, and fi, update the vruntime_fi hierachy until we
>    encounter a mark from (A), per doing it during the pick, but before
>    runtime, this guaranteees it hasn't moved since (A).
> 
> XXX, still buggered on SMT>2, imagine having {ta, tb, fi, i} on an SMT4,
> then when comparing any two tasks that do not involve the fi, we should
> (probably) have pulled them fwd -- but we can't actually pull them,
> because then the fi thing would break, mooo.

Hi Peter, Vineeth,

I tried Peter's diff (had to backport it to 4.19 as that's what our devices are
on). Let me know if I screwed the backport:
https://chromium.googlesource.com/chromiumos/third_party/kernel/+/6469fd5b6bcd0eb1e054307aa4b54bc9e937346d%5E%21/

With Peter's changes, I see really bad wakeup latencies when running a simple
synthetic test (1 task busy on a CPU with the sibling doing a 70% run-sleep
cycle with 1 second period. The 'perf sched latency' worst case is of the
order of 100s of ms, which I don't see with my patch.

I do feel Peter's patch is better though and I'll try to debug it.

thanks,

 - Joel


> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -115,19 +115,8 @@ static inline bool prio_less(struct task
>  	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
>  		return !dl_time_before(a->dl.deadline, b->dl.deadline);
>  
> -	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
> -		u64 vruntime = b->se.vruntime;
> -
> -		/*
> -		 * Normalize the vruntime if tasks are in different cpus.
> -		 */
> -		if (task_cpu(a) != task_cpu(b)) {
> -			vruntime -= task_cfs_rq(b)->min_vruntime;
> -			vruntime += task_cfs_rq(a)->min_vruntime;
> -		}
> -
> -		return !((s64)(a->se.vruntime - vruntime) <= 0);
> -	}
> +	if (pa == MAX_RT_PRIO + MAX_NICE)	/* fair */
> +		return cfs_prio_less(a, b);
>  
>  	return false;
>  }
> @@ -4642,12 +4631,15 @@ pick_task(struct rq *rq, const struct sc
>  	return cookie_pick;
>  }
>  
> +extern void task_vruntime_update(struct rq *rq, struct task_struct *p);
> +
>  static struct task_struct *
>  pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
>  	struct task_struct *next, *max = NULL;
>  	const struct sched_class *class;
>  	const struct cpumask *smt_mask;
> +	bool fi_before = false;
>  	bool need_sync;
>  	int i, j, cpu;
>  
> @@ -4707,6 +4699,7 @@ pick_next_task(struct rq *rq, struct tas
>  	need_sync = !!rq->core->core_cookie;
>  	if (rq->core->core_forceidle) {
>  		need_sync = true;
> +		fi_before = true;
>  		rq->core->core_forceidle = false;
>  	}
>  
> @@ -4757,6 +4750,11 @@ pick_next_task(struct rq *rq, struct tas
>  				continue;
>  
>  			rq_i->core_pick = p;
> +			if (rq_i->idle == p && rq_i->nr_running) {
> +				rq->core->core_forceidle = true;
> +				if (!fi_before)
> +					rq->core->core_forceidle_seq++;
> +			}
>  
>  			/*
>  			 * If this new candidate is of higher priority than the
> @@ -4775,6 +4773,7 @@ pick_next_task(struct rq *rq, struct tas
>  				max = p;
>  
>  				if (old_max) {
> +					rq->core->core_forceidle = false;
>  					for_each_cpu(j, smt_mask) {
>  						if (j == i)
>  							continue;
> @@ -4823,10 +4822,8 @@ pick_next_task(struct rq *rq, struct tas
>  		if (!rq_i->core_pick)
>  			continue;
>  
> -		if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running &&
> -		    !rq_i->core->core_forceidle) {
> -			rq_i->core->core_forceidle = true;
> -		}
> +		if (!(fi_before && rq->core->core_forceidle))
> +			task_vruntime_update(rq_i, rq_i->core_pick);
>  
>  		if (i == cpu) {
>  			rq_i->core_pick = NULL;
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10686,6 +10686,67 @@ static inline void task_tick_core(struct
>  	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
>  		resched_curr(rq);
>  }
> +
> +static void se_fi_update(struct sched_entity *se, unsigned int fi_seq, bool forceidle)
> +{
> +	for_each_sched_entity(se) {
> +		struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +		if (forceidle) {
> +			if (cfs_rq->forceidle_seq == fi_seq)
> +				break;
> +			cfs_rq->forceidle_seq = fi_seq;
> +		}
> +
> +		cfs_rq->min_vruntime_fi = cfs_rq->min_vruntime;
> +	}
> +}
> +
> +void task_vruntime_update(struct rq *rq, struct task_struct *p)
> +{
> +	struct sched_entity *se = &p->se;
> +
> +	if (p->sched_class != &fair_sched_class)
> +		return;
> +
> +	se_fi_update(se, rq->core->core_forceidle_seq, rq->core->core_forceidle);
> +}
> +
> +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> +{
> +	struct rq *rq = task_rq(a);
> +	struct sched_entity *sea = &a->se;
> +	struct sched_entity *seb = &b->se;
> +	struct cfs_rq *cfs_rqa;
> +	struct cfs_rq *cfs_rqb;
> +	s64 delta;
> +
> +	SCHED_WARN_ON(task_rq(b)->core != rq->core);
> +
> +	while (sea->cfs_rq->tg != seb->cfs_rq->tg) {
> +		int sea_depth = sea->depth;
> +		int seb_depth = seb->depth;
> +
> +		if (sea_depth >= seb_depth)
> +			sea = parent_entity(sea);
> +		if (sea_depth <= seb_depth)
> +			seb = parent_entity(seb);
> +	}
> +
> +	if (rq->core->core_forceidle) {
> +		se_fi_update(sea, rq->core->core_forceidle_seq, true);
> +		se_fi_update(seb, rq->core->core_forceidle_seq, true);
> +	}
> +
> +	cfs_rqa = sea->cfs_rq;
> +	cfs_rqb = seb->cfs_rq;
> +
> +	/* normalize vruntime WRT their rq's base */
> +	delta = (s64)(sea->vruntime - seb->vruntime) +
> +		(s64)(cfs_rqb->min_vruntime_fi - cfs_rqa->min_vruntime_fi);
> +
> +	return delta > 0;
> +}
>  #else
>  static inline void task_tick_core(struct rq *rq, struct task_struct *curr) {}
>  #endif
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -522,6 +522,11 @@ struct cfs_rq {
>  	unsigned int		h_nr_running;      /* SCHED_{NORMAL,BATCH,IDLE} */
>  	unsigned int		idle_h_nr_running; /* SCHED_IDLE */
>  
> +#ifdef CONFIG_SCHED_CORE
> +	unsigned int		forceidle_seq;
> +	u64			min_vruntime_fi;
> +#endif
> +
>  	u64			exec_clock;
>  	u64			min_vruntime;
>  #ifndef CONFIG_64BIT
> @@ -1061,7 +1066,8 @@ struct rq {
>  	unsigned int		core_task_seq;
>  	unsigned int		core_pick_seq;
>  	unsigned long		core_cookie;
> -	unsigned char		core_forceidle;
> +	unsigned int		core_forceidle;
> +	unsigned int		core_forceidle_seq;
>  #endif
>  };
>  
> @@ -1106,6 +1112,8 @@ static inline raw_spinlock_t *rq_lockp(s
>  	return &rq->__lock;
>  }
>  
> +bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
> +
>  #else /* !CONFIG_SCHED_CORE */
>  
>  static inline bool sched_core_enabled(struct rq *rq)

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 08/26] sched/fair: Snapshot the min_vruntime of CPUs on force idle
  2020-10-26 12:47   ` Peter Zijlstra
                       ` (2 preceding siblings ...)
  2020-10-29 16:59     ` Joel Fernandes
@ 2020-10-29 18:24     ` Joel Fernandes
  2020-10-29 18:59       ` Peter Zijlstra
  3 siblings, 1 reply; 98+ messages in thread
From: Joel Fernandes @ 2020-10-29 18:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, keescook,
	kerrnel, Phil Auld, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Tim Chen

On Mon, Oct 26, 2020 at 01:47:24PM +0100, Peter Zijlstra wrote:
[..] 
> How's something like this?
> 
>  - after each pick, such that the pick itself sees the divergence (see
>    above); either:
> 
>     - pull the vruntime_fi forward, when !fi
>     - freeze the vruntime_fi, when newly fi    (A)
> 
>  - either way, update vruntime_fi for each cfs_rq in the active
>    hierachy.
> 
>  - when comparing, and fi, update the vruntime_fi hierachy until we
>    encounter a mark from (A), per doing it during the pick, but before
>    runtime, this guaranteees it hasn't moved since (A).
> 
> XXX, still buggered on SMT>2, imagine having {ta, tb, fi, i} on an SMT4,
> then when comparing any two tasks that do not involve the fi, we should
> (probably) have pulled them fwd -- but we can't actually pull them,
> because then the fi thing would break, mooo.
> 
v> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -115,19 +115,8 @@ static inline bool prio_less(struct task
>  	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
>  		return !dl_time_before(a->dl.deadline, b->dl.deadline);
>  
> -	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
> -		u64 vruntime = b->se.vruntime;
> -
> -		/*
> -		 * Normalize the vruntime if tasks are in different cpus.
> -		 */
> -		if (task_cpu(a) != task_cpu(b)) {
> -			vruntime -= task_cfs_rq(b)->min_vruntime;
> -			vruntime += task_cfs_rq(a)->min_vruntime;
> -		}
> -
> -		return !((s64)(a->se.vruntime - vruntime) <= 0);
> -	}
> +	if (pa == MAX_RT_PRIO + MAX_NICE)	/* fair */
> +		return cfs_prio_less(a, b);
>  
>  	return false;
>  }
> @@ -4642,12 +4631,15 @@ pick_task(struct rq *rq, const struct sc
>  	return cookie_pick;
>  }
>  
> +extern void task_vruntime_update(struct rq *rq, struct task_struct *p);
> +
>  static struct task_struct *
>  pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
>  	struct task_struct *next, *max = NULL;
>  	const struct sched_class *class;
>  	const struct cpumask *smt_mask;
> +	bool fi_before = false;
>  	bool need_sync;
>  	int i, j, cpu;
>  
> @@ -4707,6 +4699,7 @@ pick_next_task(struct rq *rq, struct tas
>  	need_sync = !!rq->core->core_cookie;
>  	if (rq->core->core_forceidle) {
>  		need_sync = true;
> +		fi_before = true;
>  		rq->core->core_forceidle = false;
>  	}
>  
> @@ -4757,6 +4750,11 @@ pick_next_task(struct rq *rq, struct tas
>  				continue;
>  
>  			rq_i->core_pick = p;
> +			if (rq_i->idle == p && rq_i->nr_running) {
> +				rq->core->core_forceidle = true;
> +				if (!fi_before)
> +					rq->core->core_forceidle_seq++;
> +			}
>  
>  			/*
>  			 * If this new candidate is of higher priority than the
> @@ -4775,6 +4773,7 @@ pick_next_task(struct rq *rq, struct tas
>  				max = p;
>  
>  				if (old_max) {
> +					rq->core->core_forceidle = false;
>  					for_each_cpu(j, smt_mask) {
>  						if (j == i)
>  							continue;
> @@ -4823,10 +4822,8 @@ pick_next_task(struct rq *rq, struct tas
>  		if (!rq_i->core_pick)
>  			continue;
>  
> -		if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running &&
> -		    !rq_i->core->core_forceidle) {
> -			rq_i->core->core_forceidle = true;
> -		}
> +		if (!(fi_before && rq->core->core_forceidle))
> +			task_vruntime_update(rq_i, rq_i->core_pick);

Shouldn't this be:

	if (!fi_before && rq->core->core_forceidle)
			task_vruntime_update(rq_i, rq_i->core_pick);

?

>  
>  		if (i == cpu) {
>  			rq_i->core_pick = NULL;
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10686,6 +10686,67 @@ static inline void task_tick_core(struct
>  	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
>  		resched_curr(rq);
>  }
> +
> +static void se_fi_update(struct sched_entity *se, unsigned int fi_seq, bool forceidle)
> +{
> +	for_each_sched_entity(se) {
> +		struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +		if (forceidle) {
> +			if (cfs_rq->forceidle_seq == fi_seq)
> +				break;
> +			cfs_rq->forceidle_seq = fi_seq;
> +		}
> +
> +		cfs_rq->min_vruntime_fi = cfs_rq->min_vruntime;
> +	}
> +}
> +
> +void task_vruntime_update(struct rq *rq, struct task_struct *p)
> +{
> +	struct sched_entity *se = &p->se;
> +
> +	if (p->sched_class != &fair_sched_class)
> +		return;
> +
> +	se_fi_update(se, rq->core->core_forceidle_seq, rq->core->core_forceidle);
> +}
> +
> +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> +{
> +	struct rq *rq = task_rq(a);
> +	struct sched_entity *sea = &a->se;
> +	struct sched_entity *seb = &b->se;
> +	struct cfs_rq *cfs_rqa;
> +	struct cfs_rq *cfs_rqb;
> +	s64 delta;
> +
> +	SCHED_WARN_ON(task_rq(b)->core != rq->core);
> +
> +	while (sea->cfs_rq->tg != seb->cfs_rq->tg) {
> +		int sea_depth = sea->depth;
> +		int seb_depth = seb->depth;
> +
> +		if (sea_depth >= seb_depth)
> +			sea = parent_entity(sea);
> +		if (sea_depth <= seb_depth)
> +			seb = parent_entity(seb);
> +	}
> +
> +	if (rq->core->core_forceidle) {
> +		se_fi_update(sea, rq->core->core_forceidle_seq, true);
> +		se_fi_update(seb, rq->core->core_forceidle_seq, true);
> +	}

As we chatted on IRC you mentioned the reason for the sync here is:

 say we have 2 cgroups (a,b) under root, and we go force-idle in a, then we
 update a and root. Then we pick and end up in b, but b hasn't been updated
 yet.

One thing I was wondering about that was, if the pick of 'b' happens much
later than 'a', then the snapshot might be happening too late right?

Maybe the snapshot should happen on all cfs_rqs on all siblings in
pick_next_task() itself? That way everything gets updated at the instant the
force-idle started. Thought that may be a bit more slow.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 08/26] sched/fair: Snapshot the min_vruntime of CPUs on force idle
  2020-10-29 18:24     ` Joel Fernandes
@ 2020-10-29 18:59       ` Peter Zijlstra
  2020-10-30  2:36         ` Joel Fernandes
  0 siblings, 1 reply; 98+ messages in thread
From: Peter Zijlstra @ 2020-10-29 18:59 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, keescook,
	kerrnel, Phil Auld, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Tim Chen

On Thu, Oct 29, 2020 at 02:24:29PM -0400, Joel Fernandes wrote:

> > @@ -4823,10 +4822,8 @@ pick_next_task(struct rq *rq, struct tas
> >  		if (!rq_i->core_pick)
> >  			continue;
> >  
> > -		if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running &&
> > -		    !rq_i->core->core_forceidle) {
> > -			rq_i->core->core_forceidle = true;
> > -		}
> > +		if (!(fi_before && rq->core->core_forceidle))
> > +			task_vruntime_update(rq_i, rq_i->core_pick);
> 
> Shouldn't this be:
> 
> 	if (!fi_before && rq->core->core_forceidle)
> 			task_vruntime_update(rq_i, rq_i->core_pick);
> 
> ?

*groan*, I should've written a comment there :/

When we're not fi, we need to update.
when we're fi and we were not fi, we must update
When we're fi and we were already fi, we must not update

Which gives:

	fib	fi	X

	0	0	1
	0	1	0
	1	0	1
	1	1	1

which is: !(!fib && fi) or something.

> > +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> > +{
> > +	struct rq *rq = task_rq(a);
> > +	struct sched_entity *sea = &a->se;
> > +	struct sched_entity *seb = &b->se;
> > +	struct cfs_rq *cfs_rqa;
> > +	struct cfs_rq *cfs_rqb;
> > +	s64 delta;
> > +
> > +	SCHED_WARN_ON(task_rq(b)->core != rq->core);
> > +
> > +	while (sea->cfs_rq->tg != seb->cfs_rq->tg) {
> > +		int sea_depth = sea->depth;
> > +		int seb_depth = seb->depth;
> > +
> > +		if (sea_depth >= seb_depth)
> > +			sea = parent_entity(sea);
> > +		if (sea_depth <= seb_depth)
> > +			seb = parent_entity(seb);
> > +	}
> > +
> > +	if (rq->core->core_forceidle) {
> > +		se_fi_update(sea, rq->core->core_forceidle_seq, true);
> > +		se_fi_update(seb, rq->core->core_forceidle_seq, true);
> > +	}
> 
> As we chatted on IRC you mentioned the reason for the sync here is:
> 
>  say we have 2 cgroups (a,b) under root, and we go force-idle in a, then we
>  update a and root. Then we pick and end up in b, but b hasn't been updated
>  yet.
> 
> One thing I was wondering about that was, if the pick of 'b' happens much
> later than 'a', then the snapshot might be happening too late right?

No, since this is the first pick in b since fi, it cannot have advanced.
So by updating to fi_seq before picking, we guarantee it is unchanged
since we went fi.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 08/26] sched/fair: Snapshot the min_vruntime of CPUs on force idle
  2020-10-29 18:59       ` Peter Zijlstra
@ 2020-10-30  2:36         ` Joel Fernandes
  2020-10-30  2:42           ` Joel Fernandes
  0 siblings, 1 reply; 98+ messages in thread
From: Joel Fernandes @ 2020-10-30  2:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Glexiner, LKML, Ingo Molnar,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, graf, konrad.wilk,
	Dario Faggioli, Paul Turner, Steven Rostedt, Patrick Bellasi,
	benbjiang(蒋彪),
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Hyser,Chris, Aubrey Li, Tim Chen

On Thu, Oct 29, 2020 at 2:59 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Oct 29, 2020 at 02:24:29PM -0400, Joel Fernandes wrote:
>
> > > @@ -4823,10 +4822,8 @@ pick_next_task(struct rq *rq, struct tas
> > >             if (!rq_i->core_pick)
> > >                     continue;
> > >
> > > -           if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running &&
> > > -               !rq_i->core->core_forceidle) {
> > > -                   rq_i->core->core_forceidle = true;
> > > -           }
> > > +           if (!(fi_before && rq->core->core_forceidle))
> > > +                   task_vruntime_update(rq_i, rq_i->core_pick);
> >
> > Shouldn't this be:
> >
> >       if (!fi_before && rq->core->core_forceidle)
> >                       task_vruntime_update(rq_i, rq_i->core_pick);
> >
> > ?
>
> *groan*, I should've written a comment there :/
>
> When we're not fi, we need to update.
> when we're fi and we were not fi, we must update
> When we're fi and we were already fi, we must not update
>
> Which gives:
>
>         fib     fi      X
>
>         0       0       1
>         0       1       0
>         1       0       1
>         1       1       1
>
> which is: !(!fib && fi) or something.
>

Got it! This is what my initial patch intended to do as well, but
yours is better.

> > > +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> > > +{
> > > +   struct rq *rq = task_rq(a);
> > > +   struct sched_entity *sea = &a->se;
> > > +   struct sched_entity *seb = &b->se;
> > > +   struct cfs_rq *cfs_rqa;
> > > +   struct cfs_rq *cfs_rqb;
> > > +   s64 delta;
> > > +
> > > +   SCHED_WARN_ON(task_rq(b)->core != rq->core);
> > > +
> > > +   while (sea->cfs_rq->tg != seb->cfs_rq->tg) {
> > > +           int sea_depth = sea->depth;
> > > +           int seb_depth = seb->depth;
> > > +
> > > +           if (sea_depth >= seb_depth)
> > > +                   sea = parent_entity(sea);
> > > +           if (sea_depth <= seb_depth)
> > > +                   seb = parent_entity(seb);
> > > +   }
> > > +
> > > +   if (rq->core->core_forceidle) {
> > > +           se_fi_update(sea, rq->core->core_forceidle_seq, true);
> > > +           se_fi_update(seb, rq->core->core_forceidle_seq, true);
> > > +   }
> >
> > As we chatted on IRC you mentioned the reason for the sync here is:
> >
> >  say we have 2 cgroups (a,b) under root, and we go force-idle in a, then we
> >  update a and root. Then we pick and end up in b, but b hasn't been updated
> >  yet.
> >
> > One thing I was wondering about that was, if the pick of 'b' happens much
> > later than 'a', then the snapshot might be happening too late right?
>
> No, since this is the first pick in b since fi, it cannot have advanced.
> So by updating to fi_seq before picking, we guarantee it is unchanged
> since we went fi.

Makes complete sense.

I got it to a point where the latencies are much lower, but still not
at a point where it's as good as the initial patch I posted.

There could be more bugs. At the moment, the only one I corrected in
your patch is making the truth table do !(!fib && fi). But there is
still something else going on.

Thanks!

- Joel

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 08/26] sched/fair: Snapshot the min_vruntime of CPUs on force idle
  2020-10-30  2:36         ` Joel Fernandes
@ 2020-10-30  2:42           ` Joel Fernandes
  2020-10-30  8:41             ` Peter Zijlstra
  0 siblings, 1 reply; 98+ messages in thread
From: Joel Fernandes @ 2020-10-30  2:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Glexiner, LKML, Ingo Molnar,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, graf, konrad.wilk,
	Dario Faggioli, Paul Turner, Steven Rostedt, Patrick Bellasi,
	benbjiang(蒋彪),
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Hyser,Chris, Aubrey Li, Tim Chen

On Thu, Oct 29, 2020 at 10:36 PM Joel Fernandes <joel@joelfernandes.org> wrote:

> > > > +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> > > > +{
> > > > +   struct rq *rq = task_rq(a);
> > > > +   struct sched_entity *sea = &a->se;
> > > > +   struct sched_entity *seb = &b->se;
> > > > +   struct cfs_rq *cfs_rqa;
> > > > +   struct cfs_rq *cfs_rqb;
> > > > +   s64 delta;
> > > > +
> > > > +   SCHED_WARN_ON(task_rq(b)->core != rq->core);
> > > > +
> > > > +   while (sea->cfs_rq->tg != seb->cfs_rq->tg) {
> > > > +           int sea_depth = sea->depth;
> > > > +           int seb_depth = seb->depth;
> > > > +
> > > > +           if (sea_depth >= seb_depth)
> > > > +                   sea = parent_entity(sea);
> > > > +           if (sea_depth <= seb_depth)
> > > > +                   seb = parent_entity(seb);
> > > > +   }
> > > > +
> > > > +   if (rq->core->core_forceidle) {
> > > > +           se_fi_update(sea, rq->core->core_forceidle_seq, true);
> > > > +           se_fi_update(seb, rq->core->core_forceidle_seq, true);
> > > > +   }
> > >
> > > As we chatted on IRC you mentioned the reason for the sync here is:
> > >
> > >  say we have 2 cgroups (a,b) under root, and we go force-idle in a, then we
> > >  update a and root. Then we pick and end up in b, but b hasn't been updated
> > >  yet.
> > >
> > > One thing I was wondering about that was, if the pick of 'b' happens much
> > > later than 'a', then the snapshot might be happening too late right?
> >
> > No, since this is the first pick in b since fi, it cannot have advanced.
> > So by updating to fi_seq before picking, we guarantee it is unchanged
> > since we went fi.
>
> Makes complete sense.
>
> I got it to a point where the latencies are much lower, but still not
> at a point where it's as good as the initial patch I posted.
>
> There could be more bugs. At the moment, the only one I corrected in
> your patch is making the truth table do !(!fib && fi). But there is
> still something else going on.

Forgot to ask, do you also need to do the task_vruntime_update() for
the unconstrained pick?

That's in line with what you mentioned: That you still need to do the
update if fi_before == false and fi_now == false.

So something like this?
@@ -4209,6 +4209,10 @@ pick_next_task(struct rq *rq, struct
task_struct *prev, struct rq_flags *rf)
                                next = p;
                                trace_printk("unconstrained pick: %s/%d %lx\n",
                                             next->comm, next->pid,
next->core_cookie);
+
+                               WARN_ON_ONCE(fi_before);
+                               task_vruntime_update(rq_i, p);
+
                                goto done;
                        }

Quoting the truth table:

> >         fib     fi      X
> >
> >         0       0       1
> >         0       1       0
> >         1       0       1
> >         1       1       1
> >

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 08/26] sched/fair: Snapshot the min_vruntime of CPUs on force idle
  2020-10-30  2:42           ` Joel Fernandes
@ 2020-10-30  8:41             ` Peter Zijlstra
  2020-10-31 21:41               ` Joel Fernandes
  0 siblings, 1 reply; 98+ messages in thread
From: Peter Zijlstra @ 2020-10-30  8:41 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Glexiner, LKML, Ingo Molnar,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, graf, konrad.wilk,
	Dario Faggioli, Paul Turner, Steven Rostedt, Patrick Bellasi,
	benbjiang(蒋彪),
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Hyser,Chris, Aubrey Li, Tim Chen

On Thu, Oct 29, 2020 at 10:42:29PM -0400, Joel Fernandes wrote:

> Forgot to ask, do you also need to do the task_vruntime_update() for
> the unconstrained pick?

Humm.. interesting case.

Yes, however... since in that case the picks aren't synchronized it's a
wee bit dodgy. I'll throw it on the pile together with SMT4.


Also, I'm still hoping you can make this form work:

  https://lkml.kernel.org/r/20201026093131.GF2628@hirez.programming.kicks-ass.net

(note that the put_prev_task() needs an additional rq argument)

That's both simpler code and faster.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode
  2020-10-20  1:43 ` [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode Joel Fernandes (Google)
  2020-10-20  3:41   ` Randy Dunlap
  2020-10-22  5:48   ` Li, Aubrey
@ 2020-10-30 10:29   ` Alexandre Chartre
  2020-11-03  1:20     ` Joel Fernandes
  2 siblings, 1 reply; 98+ messages in thread
From: Alexandre Chartre @ 2020-10-30 10:29 UTC (permalink / raw)
  To: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, James.Bottomley, OWeisse,
	Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser, Aubrey Li,
	Tim Chen, Paul E . McKenney


On 10/20/20 3:43 AM, Joel Fernandes (Google) wrote:
> Core-scheduling prevents hyperthreads in usermode from attacking each
> other, but it does not do anything about one of the hyperthreads
> entering the kernel for any reason. This leaves the door open for MDS
> and L1TF attacks with concurrent execution sequences between
> hyperthreads.
> 
> This patch therefore adds support for protecting all syscall and IRQ
> kernel mode entries. Care is taken to track the outermost usermode exit
> and entry using per-cpu counters. In cases where one of the hyperthreads
> enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
> when not needed - example: idle and non-cookie HTs do not need to be
> forced into kernel mode.

Hi Joel,

In order to protect syscall/IRQ kernel mode entries, shouldn't we have a
call to sched_core_unsafe_enter() in the syscall/IRQ entry code? I don't
see such a call. Am I missing something?


> More information about attacks:
> For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
> data to either host or guest attackers. For L1TF, it is possible to leak
> to guest attackers. There is no possible mitigation involving flushing
> of buffers to avoid this since the execution of attacker and victims
> happen concurrently on 2 or more HTs.
> 
> Cc: Julien Desfossez <jdesfossez@digitalocean.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Aaron Lu <aaron.lwe@gmail.com>
> Cc: Aubrey Li <aubrey.li@linux.intel.com>
> Cc: Tim Chen <tim.c.chen@intel.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Co-developed-by: Vineeth Pillai <viremana@linux.microsoft.com>
> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>   .../admin-guide/kernel-parameters.txt         |   7 +
>   include/linux/entry-common.h                  |   2 +-
>   include/linux/sched.h                         |  12 +
>   kernel/entry/common.c                         |  25 +-
>   kernel/sched/core.c                           | 229 ++++++++++++++++++
>   kernel/sched/sched.h                          |   3 +
>   6 files changed, 275 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 3236427e2215..48567110f709 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4678,6 +4678,13 @@
>   
>   	sbni=		[NET] Granch SBNI12 leased line adapter
>   
> +	sched_core_protect_kernel=
> +			[SCHED_CORE] Pause SMT siblings of a core running in
> +			user mode, if at least one of the siblings of the core
> +			is running in kernel mode. This is to guarantee that
> +			kernel data is not leaked to tasks which are not trusted
> +			by the kernel.
> +
>   	sched_debug	[KNL] Enables verbose scheduler debug messages.
>   
>   	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index 474f29638d2c..260216de357b 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -69,7 +69,7 @@
>   
>   #define EXIT_TO_USER_MODE_WORK						\
>   	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
> -	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING |			\
> +	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_UNSAFE_RET |	\
>   	 ARCH_EXIT_TO_USER_MODE_WORK)
>   
>   /**
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d38e904dd603..fe6f225bfbf9 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2071,4 +2071,16 @@ int sched_trace_rq_nr_running(struct rq *rq);
>   
>   const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
>   
> +#ifdef CONFIG_SCHED_CORE
> +void sched_core_unsafe_enter(void);
> +void sched_core_unsafe_exit(void);
> +bool sched_core_wait_till_safe(unsigned long ti_check);
> +bool sched_core_kernel_protected(void);
> +#else
> +#define sched_core_unsafe_enter(ignore) do { } while (0)
> +#define sched_core_unsafe_exit(ignore) do { } while (0)
> +#define sched_core_wait_till_safe(ignore) do { } while (0)
> +#define sched_core_kernel_protected(ignore) do { } while (0)
> +#endif
> +
>   #endif
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 0a1e20f8d4e8..c8dc6b1b1f40 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -137,6 +137,26 @@ static __always_inline void exit_to_user_mode(void)
>   /* Workaround to allow gradual conversion of architecture code */
>   void __weak arch_do_signal(struct pt_regs *regs) { }
>   
> +unsigned long exit_to_user_get_work(void)
> +{
> +	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> +
> +	if (IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected())
> +		return ti_work;
> +
> +#ifdef CONFIG_SCHED_CORE
> +	ti_work &= EXIT_TO_USER_MODE_WORK;
> +	if ((ti_work & _TIF_UNSAFE_RET) == ti_work) {
> +		sched_core_unsafe_exit();
> +		if (sched_core_wait_till_safe(EXIT_TO_USER_MODE_WORK)) {

If we call sched_core_unsafe_exit() before sched_core_wait_till_safe() then we
expose ourself during the entire wait period in sched_core_wait_till_safe(). It
would be better to call sched_core_unsafe_exit() once we know for sure we are
going to exit.

alex.


> +			sched_core_unsafe_enter(); /* not exiting to user yet. */
> +		}
> +	}
> +
> +	return READ_ONCE(current_thread_info()->flags);
> +#endif
> +}
> +
>   static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>   					    unsigned long ti_work)
>   {
> @@ -175,7 +195,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>   		 * enabled above.
>   		 */
>   		local_irq_disable_exit_to_user();
> -		ti_work = READ_ONCE(current_thread_info()->flags);
> +		ti_work = exit_to_user_get_work();
>   	}
>   
>   	/* Return the latest work state for arch_exit_to_user_mode() */
> @@ -184,9 +204,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>   
>   static void exit_to_user_mode_prepare(struct pt_regs *regs)
>   {
> -	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> +	unsigned long ti_work;
>   
>   	lockdep_assert_irqs_disabled();
> +	ti_work = exit_to_user_get_work();
>   
>   	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
>   		ti_work = exit_to_user_mode_loop(regs, ti_work);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 02db5b024768..5a7aeaa914e3 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -76,6 +76,27 @@ __read_mostly int scheduler_running;
>   
>   #ifdef CONFIG_SCHED_CORE
>   
> +DEFINE_STATIC_KEY_TRUE(sched_core_protect_kernel);
> +static int __init set_sched_core_protect_kernel(char *str)
> +{
> +	unsigned long val = 0;
> +
> +	if (!str)
> +		return 0;
> +
> +	if (!kstrtoul(str, 0, &val) && !val)
> +		static_branch_disable(&sched_core_protect_kernel);
> +
> +	return 1;
> +}
> +__setup("sched_core_protect_kernel=", set_sched_core_protect_kernel);
> +
> +/* Is the kernel protected by core scheduling? */
> +bool sched_core_kernel_protected(void)
> +{
> +	return static_branch_likely(&sched_core_protect_kernel);
> +}
> +
>   DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
>   
>   /* kernel prio, less is more */
> @@ -4596,6 +4617,214 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
>   	return a->core_cookie == b->core_cookie;
>   }
>   
> +/*
> + * Handler to attempt to enter kernel. It does nothing because the exit to
> + * usermode or guest mode will do the actual work (of waiting if needed).
> + */
> +static void sched_core_irq_work(struct irq_work *work)
> +{
> +	return;
> +}
> +
> +static inline void init_sched_core_irq_work(struct rq *rq)
> +{
> +	init_irq_work(&rq->core_irq_work, sched_core_irq_work);
> +}
> +
> +/*
> + * sched_core_wait_till_safe - Pause the caller's hyperthread until the core
> + * exits the core-wide unsafe state. Obviously the CPU calling this function
> + * should not be responsible for the core being in the core-wide unsafe state
> + * otherwise it will deadlock.
> + *
> + * @ti_check: We spin here with IRQ enabled and preempt disabled. Break out of
> + *            the loop if TIF flags are set and notify caller about it.
> + *
> + * IRQs should be disabled.
> + */
> +bool sched_core_wait_till_safe(unsigned long ti_check)
> +{
> +	bool restart = false;
> +	struct rq *rq;
> +	int cpu;
> +
> +	/* We clear the thread flag only at the end, so need to check for it. */
> +	ti_check &= ~_TIF_UNSAFE_RET;
> +
> +	cpu = smp_processor_id();
> +	rq = cpu_rq(cpu);
> +
> +	if (!sched_core_enabled(rq))
> +		goto ret;
> +
> +	/* Down grade to allow interrupts to prevent stop_machine lockups.. */
> +	preempt_disable();
> +	local_irq_enable();
> +
> +	/*
> +	 * Wait till the core of this HT is not in an unsafe state.
> +	 *
> +	 * Pair with smp_store_release() in sched_core_unsafe_exit().
> +	 */
> +	while (smp_load_acquire(&rq->core->core_unsafe_nest) > 0) {
> +		cpu_relax();
> +		if (READ_ONCE(current_thread_info()->flags) & ti_check) {
> +			restart = true;
> +			break;
> +		}
> +	}
> +
> +	/* Upgrade it back to the expectations of entry code. */
> +	local_irq_disable();
> +	preempt_enable();
> +
> +ret:
> +	if (!restart)
> +		clear_tsk_thread_flag(current, TIF_UNSAFE_RET);
> +
> +	return restart;
> +}
> +
> +/*
> + * Enter the core-wide IRQ state. Sibling will be paused if it is running
> + * 'untrusted' code, until sched_core_unsafe_exit() is called. Every attempt to
> + * avoid sending useless IPIs is made. Must be called only from hard IRQ
> + * context.
> + */
> +void sched_core_unsafe_enter(void)
> +{
> +	const struct cpumask *smt_mask;
> +	unsigned long flags;
> +	struct rq *rq;
> +	int i, cpu;
> +
> +	if (!static_branch_likely(&sched_core_protect_kernel))
> +		return;
> +
> +	/* Ensure that on return to user/guest, we check whether to wait. */
> +	if (current->core_cookie)
> +		set_tsk_thread_flag(current, TIF_UNSAFE_RET);
> +
> +	local_irq_save(flags);
> +	cpu = smp_processor_id();
> +	rq = cpu_rq(cpu);
> +	if (!sched_core_enabled(rq))
> +		goto ret;
> +
> +	/* Count unsafe_enter() calls received without unsafe_exit() on this CPU. */
> +	rq->core_this_unsafe_nest++;
> +
> +	/* Should not nest: enter() should only pair with exit(). */
> +	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 1))
> +		goto ret;
> +
> +	raw_spin_lock(rq_lockp(rq));
> +	smt_mask = cpu_smt_mask(cpu);
> +
> +	/* Contribute this CPU's unsafe_enter() to core-wide unsafe_enter() count. */
> +	WRITE_ONCE(rq->core->core_unsafe_nest, rq->core->core_unsafe_nest + 1);
> +
> +	if (WARN_ON_ONCE(rq->core->core_unsafe_nest == UINT_MAX))
> +		goto unlock;
> +
> +	if (irq_work_is_busy(&rq->core_irq_work)) {
> +		/*
> +		 * Do nothing more since we are in an IPI sent from another
> +		 * sibling to enforce safety. That sibling would have sent IPIs
> +		 * to all of the HTs.
> +		 */
> +		goto unlock;
> +	}
> +
> +	/*
> +	 * If we are not the first ones on the core to enter core-wide unsafe
> +	 * state, do nothing.
> +	 */
> +	if (rq->core->core_unsafe_nest > 1)
> +		goto unlock;
> +
> +	/* Do nothing more if the core is not tagged. */
> +	if (!rq->core->core_cookie)
> +		goto unlock;
> +
> +	for_each_cpu(i, smt_mask) {
> +		struct rq *srq = cpu_rq(i);
> +
> +		if (i == cpu || cpu_is_offline(i))
> +			continue;
> +
> +		if (!srq->curr->mm || is_task_rq_idle(srq->curr))
> +			continue;
> +
> +		/* Skip if HT is not running a tagged task. */
> +		if (!srq->curr->core_cookie && !srq->core_pick)
> +			continue;
> +
> +		/*
> +		 * Force sibling into the kernel by IPI. If work was already
> +		 * pending, no new IPIs are sent. This is Ok since the receiver
> +		 * would already be in the kernel, or on its way to it.
> +		 */
> +		irq_work_queue_on(&srq->core_irq_work, i);
> +	}
> +unlock:
> +	raw_spin_unlock(rq_lockp(rq));
> +ret:
> +	local_irq_restore(flags);
> +}
> +
> +/*
> + * Process any work need for either exiting the core-wide unsafe state, or for
> + * waiting on this hyperthread if the core is still in this state.
> + *
> + * @idle: Are we called from the idle loop?
> + */
> +void sched_core_unsafe_exit(void)
> +{
> +	unsigned long flags;
> +	unsigned int nest;
> +	struct rq *rq;
> +	int cpu;
> +
> +	if (!static_branch_likely(&sched_core_protect_kernel))
> +		return;
> +
> +	local_irq_save(flags);
> +	cpu = smp_processor_id();
> +	rq = cpu_rq(cpu);
> +
> +	/* Do nothing if core-sched disabled. */
> +	if (!sched_core_enabled(rq))
> +		goto ret;
> +
> +	/*
> +	 * Can happen when a process is forked and the first return to user
> +	 * mode is a syscall exit. Either way, there's nothing to do.
> +	 */
> +	if (rq->core_this_unsafe_nest == 0)
> +		goto ret;
> +
> +	rq->core_this_unsafe_nest--;
> +
> +	/* enter() should be paired with exit() only. */
> +	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 0))
> +		goto ret;
> +
> +	raw_spin_lock(rq_lockp(rq));
> +	/*
> +	 * Core-wide nesting counter can never be 0 because we are
> +	 * still in it on this CPU.
> +	 */
> +	nest = rq->core->core_unsafe_nest;
> +	WARN_ON_ONCE(!nest);
> +
> +	/* Pair with smp_load_acquire() in sched_core_wait_till_safe(). */
> +	smp_store_release(&rq->core->core_unsafe_nest, nest - 1);
> +	raw_spin_unlock(rq_lockp(rq));
> +ret:
> +	local_irq_restore(flags);
> +}
> +
>   // XXX fairness/fwd progress conditions
>   /*
>    * Returns
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f7e2d8a3be8e..4bcf3b1ddfb3 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1059,12 +1059,15 @@ struct rq {
>   	unsigned int		core_enabled;
>   	unsigned int		core_sched_seq;
>   	struct rb_root		core_tree;
> +	struct irq_work		core_irq_work; /* To force HT into kernel */
> +	unsigned int		core_this_unsafe_nest;
>   
>   	/* shared state */
>   	unsigned int		core_task_seq;
>   	unsigned int		core_pick_seq;
>   	unsigned long		core_cookie;
>   	unsigned char		core_forceidle;
> +	unsigned int		core_unsafe_nest;
>   #endif
>   };
>   
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 00/26] Core scheduling
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (25 preceding siblings ...)
  2020-10-20  1:43 ` [PATCH v8 -tip 26/26] sched: Debug bits Joel Fernandes (Google)
@ 2020-10-30 13:26 ` Ning, Hongyu
  2020-11-06  2:58   ` Li, Aubrey
  2020-11-06 20:55 ` [RFT for v9] (Was Re: [PATCH v8 -tip 00/26] Core scheduling) Joel Fernandes
  27 siblings, 1 reply; 98+ messages in thread
From: Ning, Hongyu @ 2020-10-30 13:26 UTC (permalink / raw)
  To: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen

On 2020/10/20 9:43, Joel Fernandes (Google) wrote:
> Eighth iteration of the Core-Scheduling feature.
> 
> Core scheduling is a feature that allows only trusted tasks to run
> concurrently on cpus sharing compute resources (eg: hyperthreads on a
> core). The goal is to mitigate the core-level side-channel attacks
> without requiring to disable SMT (which has a significant impact on
> performance in some situations). Core scheduling (as of v7) mitigates
> user-space to user-space attacks and user to kernel attack when one of
> the siblings enters the kernel via interrupts or system call.
> 
> By default, the feature doesn't change any of the current scheduler
> behavior. The user decides which tasks can run simultaneously on the
> same core (for now by having them in the same tagged cgroup). When a tag
> is enabled in a cgroup and a task from that cgroup is running on a
> hardware thread, the scheduler ensures that only idle or trusted tasks
> run on the other sibling(s). Besides security concerns, this feature can
> also be beneficial for RT and performance applications where we want to
> control how tasks make use of SMT dynamically.
> 
> This iteration focuses on the the following stuff:
> - Redesigned API.
> - Rework of Kernel Protection feature based on Thomas's entry work.
> - Rework of hotplug fixes.
> - Address review comments in v7
> 
> Joel: Both a CGroup and Per-task interface via prctl(2) are provided for
> configuring core sharing. More details are provided in documentation patch.
> Kselftests are provided to verify the correctness/rules of the interface.
> 
> Julien: TPCC tests showed improvements with core-scheduling. With kernel
> protection enabled, it does not show any regression. Possibly ASI will improve
> the performance for those who choose kernel protection (can be toggled through
> sched_core_protect_kernel sysctl). Results:
> v8				average		stdev		diff
> baseline (SMT on)		1197.272	44.78312824	
> core sched (   kernel protect)	412.9895	45.42734343	-65.51%
> core sched (no kernel protect)	686.6515	71.77756931	-42.65%
> nosmt				408.667		39.39042872	-65.87%
> 
> v8 is rebased on tip/master.
> 
> Future work
> ===========
> - Load balancing/Migration fixes for core scheduling.
>   With v6, Load balancing is partially coresched aware, but has some
>   issues w.r.t process/taskgroup weights:
>   https://lwn.net/ml/linux-kernel/20200225034438.GA617271@z...
> - Core scheduling test framework: kselftests, torture tests etc
> 
> Changes in v8
> =============
> - New interface/API implementation
>   - Joel
> - Revised kernel protection patch
>   - Joel
> - Revised Hotplug fixes
>   - Joel
> - Minor bug fixes and address review comments
>   - Vineeth
> 

> create mode 100644 tools/testing/selftests/sched/config
> create mode 100644 tools/testing/selftests/sched/test_coresched.c
> 

Adding 4 workloads test results for Core Scheduling v8: 

- kernel under test: coresched community v8 from https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=coresched-v5.9
- workloads: 
	-- A. sysbench cpu (192 threads) + sysbench cpu (192 threads)
	-- B. sysbench cpu (192 threads) + sysbench mysql (192 threads, mysqld forced into the same cgroup)
	-- C. uperf netperf.xml (192 threads over TCP or UDP protocol separately)
	-- D. will-it-scale context_switch via pipe (192 threads)
- test machine setup: 
	CPU(s):              192
	On-line CPU(s) list: 0-191
	Thread(s) per core:  2
	Core(s) per socket:  48
	Socket(s):           2
	NUMA node(s):        4
- test results:
	-- workload A, no obvious performance drop in cs_on:
	+----------------------+------+----------------------+------------------------+
	|                      | **   | sysbench cpu * 192   | sysbench mysql * 192   |
	+======================+======+======================+========================+
	| cgroup               | **   | cg_sysbench_cpu_0    | cg_sysbench_mysql_0    |
	+----------------------+------+----------------------+------------------------+
	| record_item          | **   | Tput_avg (events/s)  | Tput_avg (events/s)    |
	+----------------------+------+----------------------+------------------------+
	| coresched_normalized | **   | 1.01                 | 0.87                   |
	+----------------------+------+----------------------+------------------------+
	| default_normalized   | **   | 1                    | 1                      |
	+----------------------+------+----------------------+------------------------+
	| smtoff_normalized    | **   | 0.59                 | 0.82                   |
	+----------------------+------+----------------------+------------------------+

	-- workload B, no obvious performance drop in cs_on:
	+----------------------+------+----------------------+------------------------+
	|                      | **   | sysbench cpu * 192   | sysbench cpu * 192     |
	+======================+======+======================+========================+
	| cgroup               | **   | cg_sysbench_cpu_0    | cg_sysbench_cpu_1      |
	+----------------------+------+----------------------+------------------------+
	| record_item          | **   | Tput_avg (events/s)  | Tput_avg (events/s)    |
	+----------------------+------+----------------------+------------------------+
	| coresched_normalized | **   | 1.01                 | 0.98                   |
	+----------------------+------+----------------------+------------------------+
	| default_normalized   | **   | 1                    | 1                      |
	+----------------------+------+----------------------+------------------------+
	| smtoff_normalized    | **   | 0.6                  | 0.6                    |
	+----------------------+------+----------------------+------------------------+

	-- workload C, known performance drop in cs_on since Core Scheduling v6:
	+----------------------+------+---------------------------+---------------------------+
	|                      | **   | uperf netperf TCP * 192   | uperf netperf UDP * 192   |
	+======================+======+===========================+===========================+
	| cgroup               | **   | cg_uperf                  | cg_uperf                  |
	+----------------------+------+---------------------------+---------------------------+
	| record_item          | **   | Tput_avg (Gb/s)           | Tput_avg (Gb/s)           |
	+----------------------+------+---------------------------+---------------------------+
	| coresched_normalized | **   | 0.46                      | 0.48                      |
	+----------------------+------+---------------------------+---------------------------+
	| default_normalized   | **   | 1                         | 1                         |
	+----------------------+------+---------------------------+---------------------------+
	| smtoff_normalized    | **   | 0.82                      | 0.79                      |
	+----------------------+------+---------------------------+---------------------------+

	-- workload D, new added syscall workload, performance drop in cs_on:
	+----------------------+------+-------------------------------+
	|                      | **   | will-it-scale  * 192          |
	|                      |      | (pipe based context_switch)   |
	+======================+======+===============================+
	| cgroup               | **   | cg_will-it-scale              |
	+----------------------+------+-------------------------------+
	| record_item          | **   | threads_avg                   |
	+----------------------+------+-------------------------------+
	| coresched_normalized | **   | 0.2                           |
	+----------------------+------+-------------------------------+
	| default_normalized   | **   | 1                             |
	+----------------------+------+-------------------------------+
	| smtoff_normalized    | **   | 0.89                          |
	+----------------------+------+-------------------------------+

	comments: per internal analyzing, suspected huge amount of spin_lock contention in cs_on, may lead to significant performance drop

- notes on test results record_item:
	* coresched_normalized: smton, cs enabled, test result normalized by default value
	* default_normalized: smton, cs disabled, test result normalized by default value
	* smtoff_normalized: smtoff, test result normalized by default value

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH] sched: Change all 4 space tabs to actual tabs
  2020-10-20  1:43 ` [PATCH v8 -tip 23/26] kselftest: Add tests for core-sched interface Joel Fernandes (Google)
@ 2020-10-30 22:20   ` John B. Wyatt IV
  0 siblings, 0 replies; 98+ messages in thread
From: John B. Wyatt IV @ 2020-10-30 22:20 UTC (permalink / raw)
  To: ' Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen '
  Cc: John B. Wyatt IV

This patch solves over a hundred style warnings caused by 4 space tabs from
patch 23 in the coresched-v8-posted-to-list patch set.

Please review. A script was written to convert all 4 spaces blocks to tabs and
some manual editing was used to fix the any remaining alignment issues caused
by the conversion to tabs.

Issues reported by checkpatch.

Signed-off-by: John B. Wyatt IV <jbwyatt4@gmail.com>
---
 .../testing/selftests/sched/test_coresched.c  | 998 +++++++++---------
 1 file changed, 499 insertions(+), 499 deletions(-)

diff --git a/tools/testing/selftests/sched/test_coresched.c b/tools/testing/selftests/sched/test_coresched.c
index 2fdefb843115..44a33ef88435 100644
--- a/tools/testing/selftests/sched/test_coresched.c
+++ b/tools/testing/selftests/sched/test_coresched.c
@@ -33,20 +33,20 @@
 
 void print_banner(char *s)
 {
-    printf("coresched: %s:  ", s);
+	printf("coresched: %s:  ", s);
 }
 
 void print_pass(void)
 {
-    printf("PASS\n");
+	printf("PASS\n");
 }
 
 void assert_cond(int cond, char *str)
 {
-    if (!cond) {
-	printf("Error: %s\n", str);
-	abort();
-    }
+	if (!cond) {
+		printf("Error: %s\n", str);
+		abort();
+	}
 }
 
 char *make_group_root(void)
@@ -56,8 +56,8 @@ char *make_group_root(void)
 
 	mntpath = malloc(50);
 	if (!mntpath) {
-	    perror("Failed to allocate mntpath\n");
-	    abort();
+		perror("Failed to allocate mntpath\n");
+		abort();
 	}
 
 	sprintf(mntpath, "/tmp/coresched-test-XXXXXX");
@@ -78,82 +78,82 @@ char *make_group_root(void)
 
 char *read_group_cookie(char *cgroup_path)
 {
-    char path[50] = {}, *val;
-    int fd;
+	char path[50] = {}, *val;
+	int fd;
 
-    sprintf(path, "%s/cpu.core_group_cookie", cgroup_path);
-    fd = open(path, O_RDONLY, 0666);
-    if (fd == -1) {
-	perror("Open of cgroup tag path failed: ");
-	abort();
-    }
+	sprintf(path, "%s/cpu.core_group_cookie", cgroup_path);
+	fd = open(path, O_RDONLY, 0666);
+	if (fd == -1) {
+		perror("Open of cgroup tag path failed: ");
+		abort();
+	}
 
-    val = calloc(1, 50);
-    if (read(fd, val, 50) == -1) {
-	perror("Failed to read group cookie: ");
-	abort();
-    }
+	val = calloc(1, 50);
+	if (read(fd, val, 50) == -1) {
+		perror("Failed to read group cookie: ");
+		abort();
+	}
 
-    val[strcspn(val, "\r\n")] = 0;
+	val[strcspn(val, "\r\n")] = 0;
 
-    close(fd);
-    return val;
+	close(fd);
+	return val;
 }
 
 void assert_group_tag(char *cgroup_path, char *tag)
 {
-    char tag_path[50] = {}, rdbuf[8] = {};
-    int tfd;
+	char tag_path[50] = {}, rdbuf[8] = {};
+	int tfd;
 
-    sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
-    tfd = open(tag_path, O_RDONLY, 0666);
-    if (tfd == -1) {
-	perror("Open of cgroup tag path failed: ");
-	abort();
-    }
+	sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
+	tfd = open(tag_path, O_RDONLY, 0666);
+	if (tfd == -1) {
+		perror("Open of cgroup tag path failed: ");
+		abort();
+	}
 
-    if (read(tfd, rdbuf, 1) != 1) {
-	perror("Failed to enable coresched on cgroup: ");
-	abort();
-    }
+	if (read(tfd, rdbuf, 1) != 1) {
+		perror("Failed to enable coresched on cgroup: ");
+		abort();
+	}
 
-    if (strcmp(rdbuf, tag)) {
-	printf("Group tag does not match (exp: %s, act: %s)\n", tag, rdbuf);
-	abort();
-    }
+	if (strcmp(rdbuf, tag)) {
+		printf("Group tag does not match (exp: %s, act: %s)\n", tag, rdbuf);
+		abort();
+	}
 
-    if (close(tfd) == -1) {
-	perror("Failed to close tag fd: ");
-	abort();
-    }
+	if (close(tfd) == -1) {
+		perror("Failed to close tag fd: ");
+		abort();
+	}
 }
 
 void assert_group_color(char *cgroup_path, const char *color)
 {
-    char tag_path[50] = {}, rdbuf[8] = {};
-    int tfd;
+	char tag_path[50] = {}, rdbuf[8] = {};
+	int tfd;
 
-    sprintf(tag_path, "%s/cpu.core_tag_color", cgroup_path);
-    tfd = open(tag_path, O_RDONLY, 0666);
-    if (tfd == -1) {
-	perror("Open of cgroup tag path failed: ");
-	abort();
-    }
+	sprintf(tag_path, "%s/cpu.core_tag_color", cgroup_path);
+	tfd = open(tag_path, O_RDONLY, 0666);
+	if (tfd == -1) {
+		perror("Open of cgroup tag path failed: ");
+		abort();
+	}
 
-    if (read(tfd, rdbuf, 8) == -1) {
-	perror("Failed to read group color\n");
-	abort();
-    }
+	if (read(tfd, rdbuf, 8) == -1) {
+		perror("Failed to read group color\n");
+		abort();
+	}
 
-    if (strncmp(color, rdbuf, strlen(color))) {
-	printf("Group color does not match (exp: %s, act: %s)\n", color, rdbuf);
-	abort();
-    }
+	if (strncmp(color, rdbuf, strlen(color))) {
+		printf("Group color does not match (exp: %s, act: %s)\n", color, rdbuf);
+		abort();
+	}
 
-    if (close(tfd) == -1) {
-	perror("Failed to close color fd: ");
-	abort();
-    }
+	if (close(tfd) == -1) {
+		perror("Failed to close color fd: ");
+		abort();
+	}
 }
 
 void color_group(char *cgroup_path, const char *color_str)
@@ -172,11 +172,11 @@ void color_group(char *cgroup_path, const char *color_str)
 
 	ret = write(tfd, color_str, strlen(color_str));
 	assert_cond(color < 256 || ret == -1,
-		    "Writing invalid range color should have failed!");
+			"Writing invalid range color should have failed!");
 
 	if (color < 1 || color > 255) {
-	    close(tfd);
-	    return;
+		close(tfd);
+		return;
 	}
 
 	if (ret == -1) {
@@ -248,12 +248,12 @@ char *make_group(char *parent, char *name)
 	int ret;
 
 	if (!parent && !name)
-	    return make_group_root();
+		return make_group_root();
 
 	cgroup_path = malloc(50);
 	if (!cgroup_path) {
-	    perror("Failed to allocate cgroup_path\n");
-	    abort();
+		perror("Failed to allocate cgroup_path\n");
+		abort();
 	}
 
 	/* Make the cgroup node for this group */
@@ -269,278 +269,278 @@ char *make_group(char *parent, char *name)
 
 static void del_group(char *path)
 {
-    if (rmdir(path) != 0) {
-	printf("Removal of group failed\n");
-	abort();
-    }
+	if (rmdir(path) != 0) {
+		printf("Removal of group failed\n");
+		abort();
+	}
 
-    free(path);
+	free(path);
 }
 
 static void del_root_group(char *path)
 {
-    if (umount(path) != 0) {
-	perror("umount of cgroup failed\n");
-	abort();
-    }
+	if (umount(path) != 0) {
+		perror("umount of cgroup failed\n");
+		abort();
+	}
 
-    if (rmdir(path) != 0) {
-	printf("Removal of group failed\n");
-	abort();
-    }
+	if (rmdir(path) != 0) {
+		printf("Removal of group failed\n");
+		abort();
+	}
 
-    free(path);
+	free(path);
 }
 
 void assert_group_cookie_equal(char *c1, char *c2)
 {
-    char *v1, *v2;
+	char *v1, *v2;
 
-    v1 = read_group_cookie(c1);
-    v2 = read_group_cookie(c2);
-    if (strcmp(v1, v2)) {
-	printf("Group cookies not equal\n");
-	abort();
-    }
+	v1 = read_group_cookie(c1);
+	v2 = read_group_cookie(c2);
+	if (strcmp(v1, v2)) {
+		printf("Group cookies not equal\n");
+		abort();
+	}
 
-    free(v1);
-    free(v2);
+	free(v1);
+	free(v2);
 }
 
 void assert_group_cookie_not_equal(char *c1, char *c2)
 {
-    char *v1, *v2;
+	char *v1, *v2;
 
-    v1 = read_group_cookie(c1);
-    v2 = read_group_cookie(c2);
-    if (!strcmp(v1, v2)) {
-	printf("Group cookies not equal\n");
-	abort();
-    }
+	v1 = read_group_cookie(c1);
+	v2 = read_group_cookie(c2);
+	if (!strcmp(v1, v2)) {
+		printf("Group cookies not equal\n");
+		abort();
+	}
 
-    free(v1);
-    free(v2);
+	free(v1);
+	free(v2);
 }
 
 void assert_group_cookie_not_zero(char *c1)
 {
-    char *v1 = read_group_cookie(c1);
+	char *v1 = read_group_cookie(c1);
 
-    v1[1] = 0;
-    if (!strcmp(v1, "0")) {
-	printf("Group cookie zero\n");
-	abort();
-    }
-    free(v1);
+	v1[1] = 0;
+	if (!strcmp(v1, "0")) {
+		printf("Group cookie zero\n");
+		abort();
+	}
+	free(v1);
 }
 
 void assert_group_cookie_zero(char *c1)
 {
-    char *v1 = read_group_cookie(c1);
+	char *v1 = read_group_cookie(c1);
 
-    v1[1] = 0;
-    if (strcmp(v1, "0")) {
-	printf("Group cookie not zero");
-	abort();
-    }
-    free(v1);
+	v1[1] = 0;
+	if (strcmp(v1, "0")) {
+		printf("Group cookie not zero");
+		abort();
+	}
+	free(v1);
 }
 
 struct task_state {
-    int pid_share;
-    char pid_str[50];
-    pthread_mutex_t m;
-    pthread_cond_t cond;
-    pthread_cond_t cond_par;
+	int pid_share;
+	char pid_str[50];
+	pthread_mutex_t m;
+	pthread_cond_t cond;
+	pthread_cond_t cond_par;
 };
 
 struct task_state *add_task(char *p)
 {
-    struct task_state *mem;
-    pthread_mutexattr_t am;
-    pthread_condattr_t a;
-    char tasks_path[50];
-    int tfd, pid, ret;
-
-    sprintf(tasks_path, "%s/tasks", p);
-    tfd = open(tasks_path, O_WRONLY, 0666);
-    if (tfd == -1) {
-	perror("Open of cgroup tasks path failed: ");
-	abort();
-    }
-
-    mem = mmap(NULL, sizeof *mem, PROT_READ | PROT_WRITE,
-	    MAP_SHARED | MAP_ANONYMOUS, -1, 0);
-    memset(mem, 0, sizeof(*mem));
-
-    pthread_condattr_init(&a);
-    pthread_condattr_setpshared(&a, PTHREAD_PROCESS_SHARED);
-    pthread_mutexattr_init(&am);
-    pthread_mutexattr_setpshared(&am, PTHREAD_PROCESS_SHARED);
-
-    pthread_cond_init(&mem->cond, &a);
-    pthread_cond_init(&mem->cond_par, &a);
-    pthread_mutex_init(&mem->m, &am);
-
-    pid = fork();
-    if (pid == 0) {
-	while(1) {
-	    pthread_mutex_lock(&mem->m);
-	    while(!mem->pid_share)
-		pthread_cond_wait(&mem->cond, &mem->m);
-
-	    pid = mem->pid_share;
-	    mem->pid_share = 0;
-	    if (pid == -1)
-		pid = 0;
-	    prctl(PR_SCHED_CORE_SHARE, pid);
-	    pthread_mutex_unlock(&mem->m);
-	    pthread_cond_signal(&mem->cond_par);
-	}
-    }
-
-    sprintf(mem->pid_str, "%d", pid);
-    dprint("add task %d to group %s", pid, p);
-
-    ret = write(tfd, mem->pid_str, strlen(mem->pid_str));
-    assert_cond(ret != -1,
-	    "Failed to write pid into tasks");
-
-    close(tfd);
-    return mem;
+	struct task_state *mem;
+	pthread_mutexattr_t am;
+	pthread_condattr_t a;
+	char tasks_path[50];
+	int tfd, pid, ret;
+
+	sprintf(tasks_path, "%s/tasks", p);
+	tfd = open(tasks_path, O_WRONLY, 0666);
+	if (tfd == -1) {
+		perror("Open of cgroup tasks path failed: ");
+		abort();
+	}
+
+	mem = mmap(NULL, sizeof *mem, PROT_READ | PROT_WRITE,
+		MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+	memset(mem, 0, sizeof(*mem));
+
+	pthread_condattr_init(&a);
+	pthread_condattr_setpshared(&a, PTHREAD_PROCESS_SHARED);
+	pthread_mutexattr_init(&am);
+	pthread_mutexattr_setpshared(&am, PTHREAD_PROCESS_SHARED);
+
+	pthread_cond_init(&mem->cond, &a);
+	pthread_cond_init(&mem->cond_par, &a);
+	pthread_mutex_init(&mem->m, &am);
+
+	pid = fork();
+	if (pid == 0) {
+		while(1) {
+			pthread_mutex_lock(&mem->m);
+			while(!mem->pid_share)
+				pthread_cond_wait(&mem->cond, &mem->m);
+
+			pid = mem->pid_share;
+			mem->pid_share = 0;
+			if (pid == -1)
+				pid = 0;
+			prctl(PR_SCHED_CORE_SHARE, pid);
+			pthread_mutex_unlock(&mem->m);
+			pthread_cond_signal(&mem->cond_par);
+		}
+	}
+
+	sprintf(mem->pid_str, "%d", pid);
+	dprint("add task %d to group %s", pid, p);
+
+	ret = write(tfd, mem->pid_str, strlen(mem->pid_str));
+	assert_cond(ret != -1,
+		"Failed to write pid into tasks");
+
+	close(tfd);
+	return mem;
 }
 
 /* Make t1 share with t2 */
 void make_tasks_share(struct task_state *t1, struct task_state *t2)
 {
-    int p2 = atoi(t2->pid_str);
-    dprint("task %s %s", t1->pid_str, t2->pid_str);
+	int p2 = atoi(t2->pid_str);
+	dprint("task %s %s", t1->pid_str, t2->pid_str);
 
-    pthread_mutex_lock(&t1->m);
-    t1->pid_share = p2;
-    pthread_mutex_unlock(&t1->m);
+	pthread_mutex_lock(&t1->m);
+	t1->pid_share = p2;
+	pthread_mutex_unlock(&t1->m);
 
-    pthread_cond_signal(&t1->cond);
+	pthread_cond_signal(&t1->cond);
 
-    pthread_mutex_lock(&t1->m);
-    while (t1->pid_share)
-	pthread_cond_wait(&t1->cond_par, &t1->m);
-    pthread_mutex_unlock(&t1->m);
+	pthread_mutex_lock(&t1->m);
+	while (t1->pid_share)
+		pthread_cond_wait(&t1->cond_par, &t1->m);
+	pthread_mutex_unlock(&t1->m);
 }
 
 /* Make t1 share with t2 */
 void reset_task_cookie(struct task_state *t1)
 {
-    dprint("task %s", t1->pid_str);
+	dprint("task %s", t1->pid_str);
 
-    pthread_mutex_lock(&t1->m);
-    t1->pid_share = -1;
-    pthread_mutex_unlock(&t1->m);
+	pthread_mutex_lock(&t1->m);
+	t1->pid_share = -1;
+	pthread_mutex_unlock(&t1->m);
 
-    pthread_cond_signal(&t1->cond);
+	pthread_cond_signal(&t1->cond);
 
-    pthread_mutex_lock(&t1->m);
-    while (t1->pid_share)
-	pthread_cond_wait(&t1->cond_par, &t1->m);
-    pthread_mutex_unlock(&t1->m);
+	pthread_mutex_lock(&t1->m);
+	while (t1->pid_share)
+		pthread_cond_wait(&t1->cond_par, &t1->m);
+	pthread_mutex_unlock(&t1->m);
 }
 
 char *get_task_core_cookie(char *pid)
 {
-    char proc_path[50];
-    int found = 0;
-    char *line;
-    int i, j;
-    FILE *fp;
-
-    line = malloc(1024);
-    assert_cond(!!line, "Failed to alloc memory");
-
-    sprintf(proc_path, "/proc/%s/sched", pid);
-
-    fp = fopen(proc_path, "r");
-    while ((fgets(line, 1024, fp)) != NULL)
-    {
-        if(!strstr(line, "core_cookie"))
-            continue;
-
-        for (j = 0, i = 0; i < 1024 && line[i] != '\0'; i++)
-            if (line[i] >= '0' && line[i] <= '9')
-                line[j++] = line[i];
-        line[j] = '\0';
-        found = 1;
-        break;
-    }
-
-    fclose(fp);
-
-    if (found) {
-        return line;
-    } else {
-        free(line);
+	char proc_path[50];
+	int found = 0;
+	char *line;
+	int i, j;
+	FILE *fp;
+
+	line = malloc(1024);
+	assert_cond(!!line, "Failed to alloc memory");
+
+	sprintf(proc_path, "/proc/%s/sched", pid);
+
+	fp = fopen(proc_path, "r");
+	while ((fgets(line, 1024, fp)) != NULL)
+	{
+		if(!strstr(line, "core_cookie"))
+			continue;
+
+		for (j = 0, i = 0; i < 1024 && line[i] != '\0'; i++)
+			if (line[i] >= '0' && line[i] <= '9')
+				line[j++] = line[i];
+		line[j] = '\0';
+		found = 1;
+		break;
+	}
+
+	fclose(fp);
+
+	if (found) {
+		return line;
+	} else {
+		free(line);
 	printf("core_cookie not found. Enable SCHED_DEBUG?\n");
 	abort();
-        return NULL;
-    }
+		return NULL;
+	}
 }
 
 void assert_tasks_share(struct task_state *t1, struct task_state *t2)
 {
-    char *c1, *c2;
-
-    c1 = get_task_core_cookie(t1->pid_str);
-    c2 = get_task_core_cookie(t2->pid_str);
-    dprint("check task (%s) cookie (%s) == task (%s) cookie (%s)",
-	    t1->pid_str, c1, t2->pid_str, c2);
-    assert_cond(!strcmp(c1, c2), "Tasks don't share cookie");
-    free(c1); free(c2);
+	char *c1, *c2;
+
+	c1 = get_task_core_cookie(t1->pid_str);
+	c2 = get_task_core_cookie(t2->pid_str);
+	dprint("check task (%s) cookie (%s) == task (%s) cookie (%s)",
+		t1->pid_str, c1, t2->pid_str, c2);
+	assert_cond(!strcmp(c1, c2), "Tasks don't share cookie");
+	free(c1); free(c2);
 }
 
 void assert_tasks_dont_share(struct task_state *t1,  struct task_state *t2)
 {
-    char *c1, *c2;
-    c1 = get_task_core_cookie(t1->pid_str);
-    c2 = get_task_core_cookie(t2->pid_str);
-    dprint("check task (%s) cookie (%s) != task (%s) cookie (%s)",
-	    t1->pid_str, c1, t2->pid_str, c2);
-    assert_cond(strcmp(c1, c2), "Tasks don't share cookie");
-    free(c1); free(c2);
+	char *c1, *c2;
+	c1 = get_task_core_cookie(t1->pid_str);
+	c2 = get_task_core_cookie(t2->pid_str);
+	dprint("check task (%s) cookie (%s) != task (%s) cookie (%s)",
+		t1->pid_str, c1, t2->pid_str, c2);
+	assert_cond(strcmp(c1, c2), "Tasks don't share cookie");
+	free(c1); free(c2);
 }
 
 void assert_group_cookie_equals_task_cookie(char *g, char *pid)
 {
-    char *gk;
-    char *tk;
+	char *gk;
+	char *tk;
 
-    gk = read_group_cookie(g);
-    tk = get_task_core_cookie(pid);
+	gk = read_group_cookie(g);
+	tk = get_task_core_cookie(pid);
 
-    assert_cond(!strcmp(gk, tk), "Group cookie not equal to tasks'");
+	assert_cond(!strcmp(gk, tk), "Group cookie not equal to tasks'");
 
-    free(gk);
-    free(tk);
+	free(gk);
+	free(tk);
 }
 
 void assert_group_cookie_not_equals_task_cookie(char *g, char *pid)
 {
-    char *gk;
-    char *tk;
+	char *gk;
+	char *tk;
 
-    gk = read_group_cookie(g);
-    tk = get_task_core_cookie(pid);
+	gk = read_group_cookie(g);
+	tk = get_task_core_cookie(pid);
 
-    assert_cond(strcmp(gk, tk), "Group cookie not equal to tasks'");
+	assert_cond(strcmp(gk, tk), "Group cookie not equal to tasks'");
 
-    free(gk);
-    free(tk);
+	free(gk);
+	free(tk);
 }
 
 void kill_task(struct task_state *t)
 {
-    int pid = atoi(t->pid_str);
+	int pid = atoi(t->pid_str);
 
-    kill(pid, SIGKILL);
-    waitpid(pid, NULL, 0);
+	kill(pid, SIGKILL);
+	waitpid(pid, NULL, 0);
 }
 
 /*
@@ -556,51 +556,51 @@ void kill_task(struct task_state *t)
  */
 static void test_cgroup_coloring(char *root)
 {
-    char *y1, *y2, *y22, *r1, *r11, *b3, *r4;
+	char *y1, *y2, *y22, *r1, *r11, *b3, *r4;
 
-    print_banner("TEST-CGROUP-COLORING");
+	print_banner("TEST-CGROUP-COLORING");
 
-    y1 = make_group(root, "y1");
-    tag_group(y1);
+	y1 = make_group(root, "y1");
+	tag_group(y1);
 
-    y2 = make_group(y1, "y2");
-    y22 = make_group(y2, "y22");
+	y2 = make_group(y1, "y2");
+	y22 = make_group(y2, "y22");
 
-    r1 = make_group(y1, "y1");
-    r11 = make_group(r1, "r11");
+	r1 = make_group(y1, "y1");
+	r11 = make_group(r1, "r11");
 
-    color_group(r1, "256"); /* Wouldn't succeed. */
-    color_group(r1, "0");   /* Wouldn't succeed. */
-    color_group(r1, "254");
+	color_group(r1, "256"); /* Wouldn't succeed. */
+	color_group(r1, "0");   /* Wouldn't succeed. */
+	color_group(r1, "254");
 
-    b3 = make_group(y1, "b3");
-    color_group(b3, "8");
+	b3 = make_group(y1, "b3");
+	color_group(b3, "8");
 
-    r4 = make_group(y1, "r4");
-    color_group(r4, "254");
+	r4 = make_group(y1, "r4");
+	color_group(r4, "254");
 
-    /* Check that all yellows share the same cookie. */
-    assert_group_cookie_not_zero(y1);
-    assert_group_cookie_equal(y1, y2);
-    assert_group_cookie_equal(y1, y22);
+	/* Check that all yellows share the same cookie. */
+	assert_group_cookie_not_zero(y1);
+	assert_group_cookie_equal(y1, y2);
+	assert_group_cookie_equal(y1, y22);
 
-    /* Check that all reds share the same cookie. */
-    assert_group_cookie_not_zero(r1);
-    assert_group_cookie_equal(r1, r11);
-    assert_group_cookie_equal(r11, r4);
+	/* Check that all reds share the same cookie. */
+	assert_group_cookie_not_zero(r1);
+	assert_group_cookie_equal(r1, r11);
+	assert_group_cookie_equal(r11, r4);
 
-    /* Check that blue, red and yellow are different cookie. */
-    assert_group_cookie_not_equal(r1, b3);
-    assert_group_cookie_not_equal(b3, y1);
+	/* Check that blue, red and yellow are different cookie. */
+	assert_group_cookie_not_equal(r1, b3);
+	assert_group_cookie_not_equal(b3, y1);
 
-    del_group(r11);
-    del_group(r1);
-    del_group(y22);
-    del_group(y2);
-    del_group(b3);
-    del_group(r4);
-    del_group(y1);
-    print_pass();
+	del_group(r11);
+	del_group(r1);
+	del_group(y22);
+	del_group(y2);
+	del_group(b3);
+	del_group(r4);
+	del_group(y1);
+	print_pass();
 }
 
 /*
@@ -608,68 +608,68 @@ static void test_cgroup_coloring(char *root)
  * from their parent group _after_ the parent was tagged.
  *
  *   p ----- c1 - c11
- *     \ c2 - c22
+ *	 \ c2 - c22
  */
 static void test_cgroup_parent_child_tag_inherit(char *root)
 {
-    char *p, *c1, *c11, *c2, *c22;
-
-    print_banner("TEST-CGROUP-PARENT-CHILD-TAG");
-
-    p = make_group(root, "p");
-    assert_group_cookie_zero(p);
-
-    c1 = make_group(p, "c1");
-    assert_group_tag(c1, "0"); /* Child tag is "0" but inherits cookie from parent. */
-    assert_group_cookie_zero(c1);
-    assert_group_cookie_equal(c1, p);
-
-    c11 = make_group(c1, "c11");
-    assert_group_tag(c11, "0");
-    assert_group_cookie_zero(c11);
-    assert_group_cookie_equal(c11, p);
-
-    c2 = make_group(p, "c2");
-    assert_group_tag(c2, "0");
-    assert_group_cookie_zero(c2);
-    assert_group_cookie_equal(c2, p);
-
-    tag_group(p);
-
-    /* Verify c1 got the cookie */
-    assert_group_tag(c1, "0");
-    assert_group_cookie_not_zero(c1);
-    assert_group_cookie_equal(c1, p);
-
-    /* Verify c2 got the cookie */
-    assert_group_tag(c2, "0");
-    assert_group_cookie_not_zero(c2);
-    assert_group_cookie_equal(c2, p);
-
-    /* Verify c11 got the cookie */
-    assert_group_tag(c11, "0");
-    assert_group_cookie_not_zero(c11);
-    assert_group_cookie_equal(c11, p);
-
-    /*
-     * Verify c22 which is a nested group created
-     * _after_ tagging got the cookie.
-     */
-    c22 = make_group(c2, "c22");
-
-    assert_group_tag(c22, "0");
-    assert_group_cookie_not_zero(c22);
-    assert_group_cookie_equal(c22, c1);
-    assert_group_cookie_equal(c22, c11);
-    assert_group_cookie_equal(c22, c2);
-    assert_group_cookie_equal(c22, p);
-
-    del_group(c22);
-    del_group(c11);
-    del_group(c1);
-    del_group(c2);
-    del_group(p);
-    print_pass();
+	char *p, *c1, *c11, *c2, *c22;
+
+	print_banner("TEST-CGROUP-PARENT-CHILD-TAG");
+
+	p = make_group(root, "p");
+	assert_group_cookie_zero(p);
+
+	c1 = make_group(p, "c1");
+	assert_group_tag(c1, "0"); /* Child tag is "0" but inherits cookie from parent. */
+	assert_group_cookie_zero(c1);
+	assert_group_cookie_equal(c1, p);
+
+	c11 = make_group(c1, "c11");
+	assert_group_tag(c11, "0");
+	assert_group_cookie_zero(c11);
+	assert_group_cookie_equal(c11, p);
+
+	c2 = make_group(p, "c2");
+	assert_group_tag(c2, "0");
+	assert_group_cookie_zero(c2);
+	assert_group_cookie_equal(c2, p);
+
+	tag_group(p);
+
+	/* Verify c1 got the cookie */
+	assert_group_tag(c1, "0");
+	assert_group_cookie_not_zero(c1);
+	assert_group_cookie_equal(c1, p);
+
+	/* Verify c2 got the cookie */
+	assert_group_tag(c2, "0");
+	assert_group_cookie_not_zero(c2);
+	assert_group_cookie_equal(c2, p);
+
+	/* Verify c11 got the cookie */
+	assert_group_tag(c11, "0");
+	assert_group_cookie_not_zero(c11);
+	assert_group_cookie_equal(c11, p);
+
+	/*
+	 * Verify c22 which is a nested group created
+	 * _after_ tagging got the cookie.
+	 */
+	c22 = make_group(c2, "c22");
+
+	assert_group_tag(c22, "0");
+	assert_group_cookie_not_zero(c22);
+	assert_group_cookie_equal(c22, c1);
+	assert_group_cookie_equal(c22, c11);
+	assert_group_cookie_equal(c22, c2);
+	assert_group_cookie_equal(c22, p);
+
+	del_group(c22);
+	del_group(c11);
+	del_group(c1);
+	del_group(c2);
+	del_group(p);
+	print_pass();
 }
 
 /*
@@ -678,163 +678,163 @@ static void test_cgroup_parent_child_tag_inherit(char *root)
  */
 static void test_cgroup_parent_tag_child_inherit(char *root)
 {
-    char *p, *c1, *c2, *c3;
-
-    print_banner("TEST-CGROUP-PARENT-TAG-CHILD-INHERIT");
-
-    p = make_group(root, "p");
-    assert_group_cookie_zero(p);
-    tag_group(p);
-    assert_group_cookie_not_zero(p);
-
-    c1 = make_group(p, "c1");
-    assert_group_cookie_not_zero(c1);
-    /* Child tag is "0" but it inherits cookie from parent. */
-    assert_group_tag(c1, "0");
-    assert_group_cookie_equal(c1, p);
-
-    c2 = make_group(p, "c2");
-    assert_group_tag(c2, "0");
-    assert_group_cookie_equal(c2, p);
-    assert_group_cookie_equal(c1, c2);
-
-    c3 = make_group(c1, "c3");
-    assert_group_tag(c3, "0");
-    assert_group_cookie_equal(c3, p);
-    assert_group_cookie_equal(c1, c3);
-
-    del_group(c3);
-    del_group(c1);
-    del_group(c2);
-    del_group(p);
-    print_pass();
+	char *p, *c1, *c2, *c3;
+
+	print_banner("TEST-CGROUP-PARENT-TAG-CHILD-INHERIT");
+
+	p = make_group(root, "p");
+	assert_group_cookie_zero(p);
+	tag_group(p);
+	assert_group_cookie_not_zero(p);
+
+	c1 = make_group(p, "c1");
+	assert_group_cookie_not_zero(c1);
+	/* Child tag is "0" but it inherits cookie from parent. */
+	assert_group_tag(c1, "0");
+	assert_group_cookie_equal(c1, p);
+
+	c2 = make_group(p, "c2");
+	assert_group_tag(c2, "0");
+	assert_group_cookie_equal(c2, p);
+	assert_group_cookie_equal(c1, c2);
+
+	c3 = make_group(c1, "c3");
+	assert_group_tag(c3, "0");
+	assert_group_cookie_equal(c3, p);
+	assert_group_cookie_equal(c1, c3);
+
+	del_group(c3);
+	del_group(c1);
+	del_group(c2);
+	del_group(p);
+	print_pass();
 }
 
 static void test_prctl_in_group(char *root)
 {
-    char *p;
-    struct task_state *tsk1, *tsk2, *tsk3;
-
-    print_banner("TEST-PRCTL-IN-GROUP");
-
-    p = make_group(root, "p");
-    assert_group_cookie_zero(p);
-    tag_group(p);
-    assert_group_cookie_not_zero(p);
-
-    tsk1 = add_task(p);
-    assert_group_cookie_equals_task_cookie(p, tsk1->pid_str);
-
-    tsk2 = add_task(p);
-    assert_group_cookie_equals_task_cookie(p, tsk2->pid_str);
-
-    tsk3 = add_task(p);
-    assert_group_cookie_equals_task_cookie(p, tsk3->pid_str);
-
-    /* tsk2 share with tsk3 -- both get disconnected from CGroup. */
-    make_tasks_share(tsk2, tsk3);
-    assert_tasks_share(tsk2, tsk3);
-    assert_tasks_dont_share(tsk1, tsk2);
-    assert_tasks_dont_share(tsk1, tsk3);
-    assert_group_cookie_equals_task_cookie(p, tsk1->pid_str);
-    assert_group_cookie_not_equals_task_cookie(p, tsk2->pid_str);
-    assert_group_cookie_not_equals_task_cookie(p, tsk3->pid_str);
-
-    /* now reset tsk3 -- get connected back to CGroup. */
-    reset_task_cookie(tsk3);
-    assert_tasks_dont_share(tsk2, tsk3);
-    assert_tasks_share(tsk1, tsk3);      // tsk3 is back.
-    assert_tasks_dont_share(tsk1, tsk2); // but tsk2 is still zombie
-    assert_group_cookie_equals_task_cookie(p, tsk1->pid_str);
-    assert_group_cookie_not_equals_task_cookie(p, tsk2->pid_str);
-    assert_group_cookie_equals_task_cookie(p, tsk3->pid_str); // tsk3 is back.
-
-    /* now reset tsk2 as well to get it connected back to CGroup. */
-    reset_task_cookie(tsk2);
-    assert_tasks_share(tsk2, tsk3);
-    assert_tasks_share(tsk1, tsk3);
-    assert_tasks_share(tsk1, tsk2);
-    assert_group_cookie_equals_task_cookie(p, tsk1->pid_str);
-    assert_group_cookie_equals_task_cookie(p, tsk2->pid_str);
-    assert_group_cookie_equals_task_cookie(p, tsk3->pid_str);
-
-    /* Test the rest of the cases (2 to 4)
-     *
-     *		t1		joining		t2
-     * CASE 1:
-     * before	0				0
-     * after	new cookie			new cookie
-     *
-     * CASE 2:
-     * before	X (non-zero)			0
-     * after	0				0
-     *
-     * CASE 3:
-     * before	0				X (non-zero)
-     * after	X				X
-     *
-     * CASE 4:
-     * before	Y (non-zero)			X (non-zero)
-     * after	X				X
-     */
-
-    /* case 2: */
-    dprint("case 2");
-    make_tasks_share(tsk1, tsk1);
-    assert_tasks_dont_share(tsk1, tsk2);
-    assert_tasks_dont_share(tsk1, tsk3);
-    assert_group_cookie_not_equals_task_cookie(p, tsk1->pid_str);
-    make_tasks_share(tsk1, tsk2); /* Will reset the task cookie. */
-    assert_group_cookie_equals_task_cookie(p, tsk1->pid_str);
-
-    /* case 3: */
-    dprint("case 3");
-    assert_group_cookie_equals_task_cookie(p, tsk2->pid_str);
-    make_tasks_share(tsk2, tsk2);
-    assert_group_cookie_not_equals_task_cookie(p, tsk2->pid_str);
-    assert_tasks_dont_share(tsk2, tsk1);
-    assert_tasks_dont_share(tsk2, tsk3);
-    assert_group_cookie_equals_task_cookie(p, tsk1->pid_str);
-    make_tasks_share(tsk1, tsk2);
-    assert_group_cookie_not_equals_task_cookie(p, tsk1->pid_str);
-    assert_tasks_share(tsk1, tsk2);
-    assert_tasks_dont_share(tsk1, tsk3);
-    reset_task_cookie(tsk1);
-    reset_task_cookie(tsk2);
-
-    /* case 4: */
-    dprint("case 4");
-    assert_tasks_share(tsk1, tsk2);
-    assert_group_cookie_equals_task_cookie(p, tsk1->pid_str);
-    make_tasks_share(tsk1, tsk1);
-    assert_group_cookie_not_equals_task_cookie(p, tsk1->pid_str);
-    assert_group_cookie_equals_task_cookie(p, tsk2->pid_str);
-    make_tasks_share(tsk2, tsk2);
-    assert_group_cookie_not_equals_task_cookie(p, tsk2->pid_str);
-    assert_tasks_dont_share(tsk1, tsk2);
-    make_tasks_share(tsk1, tsk2);
-    assert_group_cookie_not_equals_task_cookie(p, tsk1->pid_str);
-    assert_tasks_share(tsk1, tsk2);
-    assert_tasks_dont_share(tsk1, tsk3);
-    reset_task_cookie(tsk1);
-    reset_task_cookie(tsk2);
-
-    kill_task(tsk1);
-    kill_task(tsk2);
-    kill_task(tsk3);
-    del_group(p);
-    print_pass();
+	char *p;
+	struct task_state *tsk1, *tsk2, *tsk3;
+
+	print_banner("TEST-PRCTL-IN-GROUP");
+
+	p = make_group(root, "p");
+	assert_group_cookie_zero(p);
+	tag_group(p);
+	assert_group_cookie_not_zero(p);
+
+	tsk1 = add_task(p);
+	assert_group_cookie_equals_task_cookie(p, tsk1->pid_str);
+
+	tsk2 = add_task(p);
+	assert_group_cookie_equals_task_cookie(p, tsk2->pid_str);
+
+	tsk3 = add_task(p);
+	assert_group_cookie_equals_task_cookie(p, tsk3->pid_str);
+
+	/* tsk2 share with tsk3 -- both get disconnected from CGroup. */
+	make_tasks_share(tsk2, tsk3);
+	assert_tasks_share(tsk2, tsk3);
+	assert_tasks_dont_share(tsk1, tsk2);
+	assert_tasks_dont_share(tsk1, tsk3);
+	assert_group_cookie_equals_task_cookie(p, tsk1->pid_str);
+	assert_group_cookie_not_equals_task_cookie(p, tsk2->pid_str);
+	assert_group_cookie_not_equals_task_cookie(p, tsk3->pid_str);
+
+	/* now reset tsk3 -- get connected back to CGroup. */
+	reset_task_cookie(tsk3);
+	assert_tasks_dont_share(tsk2, tsk3);
+	assert_tasks_share(tsk1, tsk3);	  // tsk3 is back.
+	assert_tasks_dont_share(tsk1, tsk2); // but tsk2 is still zombie
+	assert_group_cookie_equals_task_cookie(p, tsk1->pid_str);
+	assert_group_cookie_not_equals_task_cookie(p, tsk2->pid_str);
+	assert_group_cookie_equals_task_cookie(p, tsk3->pid_str); // tsk3 is back.
+
+	/* now reset tsk2 as well to get it connected back to CGroup. */
+	reset_task_cookie(tsk2);
+	assert_tasks_share(tsk2, tsk3);
+	assert_tasks_share(tsk1, tsk3);
+	assert_tasks_share(tsk1, tsk2);
+	assert_group_cookie_equals_task_cookie(p, tsk1->pid_str);
+	assert_group_cookie_equals_task_cookie(p, tsk2->pid_str);
+	assert_group_cookie_equals_task_cookie(p, tsk3->pid_str);
+
+	/* Test the rest of the cases (2 to 4)
+	 *
+	 *		t1		joining		t2
+	 * CASE 1:
+	 * before	0				0
+	 * after	new cookie			new cookie
+	 *
+	 * CASE 2:
+	 * before	X (non-zero)			0
+	 * after	0				0
+	 *
+	 * CASE 3:
+	 * before	0				X (non-zero)
+	 * after	X				X
+	 *
+	 * CASE 4:
+	 * before	Y (non-zero)			X (non-zero)
+	 * after	X				X
+	 */
+
+	/* case 2: */
+	dprint("case 2");
+	make_tasks_share(tsk1, tsk1);
+	assert_tasks_dont_share(tsk1, tsk2);
+	assert_tasks_dont_share(tsk1, tsk3);
+	assert_group_cookie_not_equals_task_cookie(p, tsk1->pid_str);
+	make_tasks_share(tsk1, tsk2); /* Will reset the task cookie. */
+	assert_group_cookie_equals_task_cookie(p, tsk1->pid_str);
+
+	/* case 3: */
+	dprint("case 3");
+	assert_group_cookie_equals_task_cookie(p, tsk2->pid_str);
+	make_tasks_share(tsk2, tsk2);
+	assert_group_cookie_not_equals_task_cookie(p, tsk2->pid_str);
+	assert_tasks_dont_share(tsk2, tsk1);
+	assert_tasks_dont_share(tsk2, tsk3);
+	assert_group_cookie_equals_task_cookie(p, tsk1->pid_str);
+	make_tasks_share(tsk1, tsk2);
+	assert_group_cookie_not_equals_task_cookie(p, tsk1->pid_str);
+	assert_tasks_share(tsk1, tsk2);
+	assert_tasks_dont_share(tsk1, tsk3);
+	reset_task_cookie(tsk1);
+	reset_task_cookie(tsk2);
+
+	/* case 4: */
+	dprint("case 4");
+	assert_tasks_share(tsk1, tsk2);
+	assert_group_cookie_equals_task_cookie(p, tsk1->pid_str);
+	make_tasks_share(tsk1, tsk1);
+	assert_group_cookie_not_equals_task_cookie(p, tsk1->pid_str);
+	assert_group_cookie_equals_task_cookie(p, tsk2->pid_str);
+	make_tasks_share(tsk2, tsk2);
+	assert_group_cookie_not_equals_task_cookie(p, tsk2->pid_str);
+	assert_tasks_dont_share(tsk1, tsk2);
+	make_tasks_share(tsk1, tsk2);
+	assert_group_cookie_not_equals_task_cookie(p, tsk1->pid_str);
+	assert_tasks_share(tsk1, tsk2);
+	assert_tasks_dont_share(tsk1, tsk3);
+	reset_task_cookie(tsk1);
+	reset_task_cookie(tsk2);
+
+	kill_task(tsk1);
+	kill_task(tsk2);
+	kill_task(tsk3);
+	del_group(p);
+	print_pass();
 }
 
 int main() {
-    char *root = make_group(NULL, NULL);
+	char *root = make_group(NULL, NULL);
 
-    test_cgroup_parent_tag_child_inherit(root);
-    test_cgroup_parent_child_tag_inherit(root);
-    test_cgroup_coloring(root);
-    test_prctl_in_group(root);
+	test_cgroup_parent_tag_child_inherit(root);
+	test_cgroup_parent_child_tag_inherit(root);
+	test_cgroup_coloring(root);
+	test_prctl_in_group(root);
 
-    del_root_group(root);
-    return 0;
+	del_root_group(root);
+	return 0;
 }
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 19/26] sched: Add a second-level tag for nested CGroup usecase
  2020-10-20  1:43 ` [PATCH v8 -tip 19/26] sched: Add a second-level tag for nested CGroup usecase Joel Fernandes (Google)
@ 2020-10-31  0:42   ` Josh Don
  2020-11-03  2:54     ` Joel Fernandes
       [not found]   ` <6c07e70d-52f2-69ff-e1fa-690cd2c97f3d@linux.intel.com>
  1 sibling, 1 reply; 98+ messages in thread
From: Josh Don @ 2020-10-31  0:42 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, Thomas Gleixner,
	linux-kernel, mingo, torvalds, fweisbec, keescook, kerrnel,
	Phil Auld, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli,
	Paul Turner, Steven Rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, chris.hyser, Aubrey Li,
	Paul E. McKenney, Tim Chen, Benjamin Segall, Hao Luo

On Mon, Oct 19, 2020 at 6:45 PM Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
>
> +static unsigned long cpu_core_get_group_cookie(struct task_group *tg)
> +{
> +       unsigned long color = 0;
> +
> +       if (!tg)
> +               return 0;
> +
> +       for (; tg; tg = tg->parent) {
> +               if (tg->core_tag_color) {
> +                       WARN_ON_ONCE(color);
> +                       color = tg->core_tag_color;
> +               }
> +
> +               if (tg->core_tagged) {
> +                       unsigned long cookie = ((unsigned long)tg << 8) | color;
> +                       cookie &= (1UL << (sizeof(unsigned long) * 4)) - 1;
> +                       return cookie;
> +               }
> +       }
> +
> +       return 0;
> +}

I'm a bit wary of how core_task_cookie and core_group_cookie are
truncated to the lower half of their bits and combined into the
overall core_cookie.  Now that core_group_cookie is further losing 8
bits to color, that leaves (in the case of 32 bit unsigned long) only
8 bits to uniquely identify the group contribution to the cookie.

Also, I agree that 256 colors is likely adequate, but it would be nice
to avoid this restriction.

I'd like to propose the following alternative, which involves creating
a new struct to represent the core cookie:

struct core_cookie {
  unsigned long task_cookie;
  unsigned long group_cookie;
  unsigned long color;
  /* can be further extended with arbitrary fields */

  struct rb_node node;
  refcount_t;
};

struct rb_root core_cookies; /* (sorted), all active core_cookies */
seqlock_t core_cookies_lock; /* protects against removal/addition to
core_cookies */

struct task_struct {
  ...
  unsigned long core_cookie; /* (struct core_cookie *) */
}

A given task stores the address of a core_cookie struct in its
core_cookie field.  When we reconfigure a task's
color/task_cookie/group_cookie, we can first look for an existing
core_cookie that matches those settings, or create a new one.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 08/26] sched/fair: Snapshot the min_vruntime of CPUs on force idle
  2020-10-30  8:41             ` Peter Zijlstra
@ 2020-10-31 21:41               ` Joel Fernandes
  0 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes @ 2020-10-31 21:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Glexiner, LKML, Ingo Molnar,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, graf, konrad.wilk,
	Dario Faggioli, Paul Turner, Steven Rostedt, Patrick Bellasi,
	benbjiang(蒋彪),
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, Hyser,Chris, Aubrey Li, Tim Chen

On Fri, Oct 30, 2020 at 09:41:12AM +0100, Peter Zijlstra wrote:
> On Thu, Oct 29, 2020 at 10:42:29PM -0400, Joel Fernandes wrote:
> 
> > Forgot to ask, do you also need to do the task_vruntime_update() for
> > the unconstrained pick?
> 
> Humm.. interesting case.
> 
> Yes, however... since in that case the picks aren't synchronized it's a
> wee bit dodgy. I'll throw it on the pile together with SMT4.

Ok. I threw it into the patch below anyway to not take any chances. I got the
version of your patch to a point where the perf is looking good (there's
still some room for improvement, but I am much much happier with the tests).

The changes are:
- Fix but in !(fib && !fi) thingie.
- Pass prio_less with with force-idle information ... prio_less() is called
  from various paths. We need to be careful what we pass it. Its possible we
  select idle on something, but we are not done with selection yet - we
  continue doing prio_less(). So just pass it fi_before.
- During enqueue, prio_less is used then, we may need to sync even then.
- Make cfs_prio_less sync even when !fi.
- Sync the min_vruntime for unconstrained picks and during prio_less() even
  when the core is not in FI. Clarify why you feel for its dodgy for sync'ing
  when unconstrained?

> Also, I'm still hoping you can make this form work:
> 
>   https://lkml.kernel.org/r/20201026093131.GF2628@hirez.programming.kicks-ass.net
> 
> (note that the put_prev_task() needs an additional rq argument)
> 
> That's both simpler code and faster.

Sure, I'll give that a go soon.

(This is based on our older 4.19 kernel, so the task_tick_fair() bit needed
changes to make it work - which aren't needed in v8).

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 711f8a78a947..c15fd5bbc707 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -97,7 +97,7 @@ static inline int __task_prio(struct task_struct *p)
  */
 
 /* real prio, less is less */
-static inline bool prio_less(struct task_struct *a, struct task_struct *b)
+static inline bool prio_less(struct task_struct *a, struct task_struct *b, bool in_fi)
 {
 
 	int pa = __task_prio(a), pb = __task_prio(b);
@@ -112,7 +112,7 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
 	if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
-		return cfs_prio_less(a, b);
+		return cfs_prio_less(a, b, in_fi);
 
 	return false;
 }
@@ -126,7 +126,7 @@ static inline bool __sched_core_less(struct task_struct *a, struct task_struct *
 		return false;
 
 	/* flip prio, so high prio is leftmost */
-	if (prio_less(b, a))
+	if (prio_less(b, a, task_rq(a)->core->core_forceidle))
 		return true;
 
 	return false;
@@ -4025,7 +4025,7 @@ void sched_core_irq_exit(void)
  * - Else returns idle_task.
  */
 static struct task_struct *
-pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max)
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max, bool in_fi)
 {
 	struct task_struct *class_pick, *cookie_pick;
 	unsigned long cookie = rq->core->core_cookie;
@@ -4040,7 +4040,7 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
 		 * higher priority than max.
 		 */
 		if (max && class_pick->core_cookie &&
-		    prio_less(class_pick, max))
+		    prio_less(class_pick, max, in_fi))
 			return idle_sched_class.pick_task(rq);
 
 		return class_pick;
@@ -4059,13 +4059,15 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
 	 * the core (so far) and it must be selected, otherwise we must go with
 	 * the cookie pick in order to satisfy the constraint.
 	 */
-	if (prio_less(cookie_pick, class_pick) &&
-	    (!max || prio_less(max, class_pick)))
+	if (prio_less(cookie_pick, class_pick, in_fi) &&
+	    (!max || prio_less(max, class_pick, in_fi)))
 		return class_pick;
 
 	return cookie_pick;
 }
 
+extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi);
+
 static struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
@@ -4141,15 +4143,6 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			update_rq_clock(rq_i);
 	}
 
-	if (!fi_before) {
-		for_each_cpu(i, smt_mask) {
-			struct rq *rq_i = cpu_rq(i);
-
-			/* Reset the snapshot if core is no longer in force-idle. */
-			rq_i->cfs.min_vruntime_fi = rq_i->cfs.min_vruntime;
-		}
-	}
-
 	/*
 	 * Try and select tasks for each sibling in decending sched_class
 	 * order.
@@ -4174,7 +4167,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			 * highest priority task already selected for this
 			 * core.
 			 */
-			p = pick_task(rq_i, class, max);
+			p = pick_task(rq_i, class, max, fi_before);
 			if (!p) {
 				/*
 				 * If there weren't no cookies; we don't need
@@ -4199,6 +4192,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			 */
 			if (i == cpu && !need_sync && !p->core_cookie) {
 				next = p;
+
+				WARN_ON_ONCE(fi_before);
+				task_vruntime_update(rq_i, p, false);
+
 				goto done;
 			}
 
@@ -4206,6 +4203,11 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 				occ++;
 
 			rq_i->core_pick = p;
+			if (rq_i->idle == p && rq_i->nr_running) {
+				rq->core->core_forceidle = true;
+				if (!fi_before)
+					rq->core->core_forceidle_seq++;
+			}
 
 			/*
 			 * If this new candidate is of higher priority than the
@@ -4224,6 +4226,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 				max = p;
 
 				if (old_max) {
+					rq->core->core_forceidle = false;
 					for_each_cpu(j, smt_mask) {
 						if (j == i)
 							continue;
@@ -4268,29 +4271,22 @@ next_class:;
 
 		WARN_ON_ONCE(!rq_i->core_pick);
 
-		if (is_idle_task(rq_i->core_pick) && rq_i->nr_running) {
-			rq_i->core_forceidle = true;
-			need_sync = true;
-		}
+		if (!(!fi_before && rq->core->core_forceidle)) {
+			task_vruntime_update(rq_i, rq_i->core_pick, true);
 
-		rq_i->core_pick->core_occupation = occ;
+			rq_i->core_pick->core_occupation = occ;
 
-		if (i == cpu)
-			continue;
+			if (i == cpu)
+				continue;
 
-		if (rq_i->curr != rq_i->core_pick)
-			resched_curr(rq_i);
+			if (rq_i->curr != rq_i->core_pick) {
+				resched_curr(rq_i);
+			}
+		}
 
 		/* Did we break L1TF mitigation requirements? */
-		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
-	}
-
-	if (!fi_before && need_sync) {
-		for_each_cpu(i, smt_mask) {
-			struct rq *rq_i = cpu_rq(i);
-
-			/* Snapshot if core is in force-idle. */
-			rq_i->cfs.min_vruntime_fi = rq_i->cfs.min_vruntime;
+		if (unlikely(!cookie_match(next, rq_i->core_pick))) {
+			WARN_ON_ONCE(1);
 		}
 	}
 done:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7e36608477aa..6d8e16bc3d79 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -483,36 +483,30 @@ static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+static void se_fi_update(struct sched_entity *se, unsigned int fi_seq, bool forceidle);
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool in_fi)
 {
-	bool samecpu = task_cpu(a) == task_cpu(b);
+	struct rq *rq = task_rq(a);
 	struct sched_entity *sea = &a->se;
 	struct sched_entity *seb = &b->se;
 	struct cfs_rq *cfs_rqa;
 	struct cfs_rq *cfs_rqb;
 	s64 delta;
 
-	if (samecpu) {
-		/* vruntime is per cfs_rq */
-		while (!is_same_group(sea, seb)) {
-			int sea_depth = sea->depth;
-			int seb_depth = seb->depth;
+	SCHED_WARN_ON(task_rq(b)->core != rq->core);
 
-			if (sea_depth >= seb_depth)
-				sea = parent_entity(sea);
-			if (sea_depth <= seb_depth)
-				seb = parent_entity(seb);
-		}
+	while (sea->cfs_rq->tg != seb->cfs_rq->tg) {
+		int sea_depth = sea->depth;
+		int seb_depth = seb->depth;
 
-		delta = (s64)(sea->vruntime - seb->vruntime);
-		goto out;
+		if (sea_depth >= seb_depth)
+			sea = parent_entity(sea);
+		if (sea_depth <= seb_depth)
+			seb = parent_entity(seb);
 	}
 
-	/* crosscpu: compare root level se's vruntime to decide priority */
-	while (sea->parent)
-		sea = sea->parent;
-	while (seb->parent)
-		seb = seb->parent;
+	se_fi_update(sea, rq->core->core_forceidle_seq, in_fi);
+	se_fi_update(seb, rq->core->core_forceidle_seq, in_fi);
 
 	cfs_rqa = sea->cfs_rq;
 	cfs_rqb = seb->cfs_rq;
@@ -520,7 +514,7 @@ bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
 	/* normalize vruntime WRT their rq's base */
 	delta = (s64)(sea->vruntime - seb->vruntime) +
 		(s64)(cfs_rqb->min_vruntime_fi - cfs_rqa->min_vruntime_fi);
-out:
+
 	return delta > 0;
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
@@ -10929,8 +10923,6 @@ static void core_sched_deactivate_fair(struct rq *rq)
  */
 static void resched_forceidle_sibling(struct rq *rq, struct sched_entity *se)
 {
-	int cpu = cpu_of(rq), sibling_cpu;
-
 	/*
 	 * If runqueue has only one task which used up its slice and if the
 	 * sibling is forced idle, then trigger schedule to give forced idle
@@ -10944,23 +10936,9 @@ static void resched_forceidle_sibling(struct rq *rq, struct sched_entity *se)
 	 * forced idle cpu has atleast MIN_NR_TASKS_DURING_FORCEIDLE - 1 tasks
 	 * and use that to check if we need to give up the cpu.
 	 */
-	if (rq->cfs.nr_running > 1 ||
-	    !__entity_slice_used(se, MIN_NR_TASKS_DURING_FORCEIDLE))
-		return;
-
-	for_each_cpu(sibling_cpu, cpu_smt_mask(cpu)) {
-		struct rq *sibling_rq;
-		if (sibling_cpu == cpu)
-			continue;
-		if (cpu_is_offline(sibling_cpu))
-			continue;
-
-		sibling_rq = cpu_rq(sibling_cpu);
-		if (sibling_rq->core_forceidle) {
-			resched_curr(rq);
-			break;
-		}
-	}
+	if (rq->core->core_forceidle && rq->cfs.nr_running == 1 &&
+	    __entity_slice_used(se, MIN_NR_TASKS_DURING_FORCEIDLE))
+		resched_curr(rq);
 }
 #endif
 
@@ -11102,6 +11080,41 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 	}
 }
+static void se_fi_update(struct sched_entity *se, unsigned int fi_seq, bool forceidle)
+{
+	bool root = true;
+	long old, new;
+
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+		if (forceidle) {
+			if (cfs_rq->forceidle_seq == fi_seq)
+				break;
+			cfs_rq->forceidle_seq = fi_seq;
+		}
+
+		if (root) {
+			old = cfs_rq->min_vruntime_fi;
+			new = cfs_rq->min_vruntime;
+			root = false;
+			trace_printk("cfs_rq(min_vruntime_fi) %Lu->%Lu\n",
+				     old, new);
+		}
+
+		cfs_rq->min_vruntime_fi = cfs_rq->min_vruntime;
+	}
+}
+
+void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi)
+{
+	struct sched_entity *se = &p->se;
+
+	if (p->sched_class != &fair_sched_class)
+		return;
+
+	se_fi_update(se, rq->core->core_forceidle_seq, in_fi);
+}
+
 #else
 static void propagate_entity_cfs_rq(struct sched_entity *se) { }
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 45c8ce5c2333..74ad557d551e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -497,9 +497,13 @@ struct cfs_rq {
 	unsigned int		nr_running;
 	unsigned int		h_nr_running;
 
+#ifdef CONFIG_SCHED_CORE
+	unsigned int		forceidle_seq;
+	u64			min_vruntime_fi;
+#endif
+
 	u64			exec_clock;
 	u64			min_vruntime;
-	u64			min_vruntime_fi;
 #ifndef CONFIG_64BIT
 	u64			min_vruntime_copy;
 #endif
@@ -974,7 +978,6 @@ struct rq {
 	unsigned int		core_enabled;
 	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
-	bool			core_forceidle;
 	unsigned char		core_pause_pending;
 	unsigned int		core_this_irq_nest;
 
@@ -982,6 +985,8 @@ struct rq {
 	unsigned int		core_task_seq;
 	unsigned int		core_pick_seq;
 	unsigned long		core_cookie;
+	unsigned int		core_forceidle;
+	unsigned int		core_forceidle_seq;
 	unsigned int		core_irq_nest;
 #endif
 };
@@ -1061,6 +1066,8 @@ extern void queue_core_balance(struct rq *rq);
 void sched_core_add(struct rq *rq, struct task_struct *p);
 void sched_core_remove(struct rq *rq, struct task_struct *p);
 
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline bool sched_core_enabled(struct rq *rq)
@@ -2549,7 +2556,7 @@ unsigned long scale_irq_capacity(unsigned long util, unsigned long irq, unsigned
 #define perf_domain_span(pd) NULL
 #endif
 
-bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);
 
 #ifdef CONFIG_SMP
 extern struct static_key_false sched_energy_present;

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode
  2020-10-20  3:41   ` Randy Dunlap
@ 2020-11-03  0:20     ` Joel Fernandes
  0 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes @ 2020-11-03  0:20 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Tim Chen, Paul E . McKenney

On Mon, Oct 19, 2020 at 08:41:04PM -0700, Randy Dunlap wrote:
> On 10/19/20 6:43 PM, Joel Fernandes (Google) wrote:
> > 
> > ---
> >   .../admin-guide/kernel-parameters.txt         |   7 +
> >   include/linux/entry-common.h                  |   2 +-
> >   include/linux/sched.h                         |  12 +
> >   kernel/entry/common.c                         |  25 +-
> >   kernel/sched/core.c                           | 229 ++++++++++++++++++
> >   kernel/sched/sched.h                          |   3 +
> >   6 files changed, 275 insertions(+), 3 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 3236427e2215..48567110f709 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -4678,6 +4678,13 @@
> >   	sbni=		[NET] Granch SBNI12 leased line adapter
> > +	sched_core_protect_kernel=
> 
> Needs a list of possible values after '=', along with telling us
> what the default value/setting is.

Ok, I made it the following:

        sched_core_protect_kernel=
                        [SCHED_CORE] Pause SMT siblings of a core running in
                        user mode, if at least one of the siblings of the core
                        is running in kernel mode. This is to guarantee that
                        kernel data is not leaked to tasks which are not trusted
                        by the kernel. A value of 0 disables protection, 1
                        enables protection. The default is 1.

thanks,

 - Joel


> > +			[SCHED_CORE] Pause SMT siblings of a core running in
> > +			user mode, if at least one of the siblings of the core
> > +			is running in kernel mode. This is to guarantee that
> > +			kernel data is not leaked to tasks which are not trusted
> > +			by the kernel.
> > +
> 
> 
> thanks.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode
  2020-10-22  5:48   ` Li, Aubrey
@ 2020-11-03  0:50     ` Joel Fernandes
  0 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes @ 2020-11-03  0:50 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Tim Chen, Paul E . McKenney

On Thu, Oct 22, 2020 at 01:48:12PM +0800, Li, Aubrey wrote:
> On 2020/10/20 9:43, Joel Fernandes (Google) wrote:
> > Core-scheduling prevents hyperthreads in usermode from attacking each
> > other, but it does not do anything about one of the hyperthreads
> > entering the kernel for any reason. This leaves the door open for MDS
> > and L1TF attacks with concurrent execution sequences between
> > hyperthreads.
> > 
> > This patch therefore adds support for protecting all syscall and IRQ
> > kernel mode entries. Care is taken to track the outermost usermode exit
> > and entry using per-cpu counters. In cases where one of the hyperthreads
> > enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
> > when not needed - example: idle and non-cookie HTs do not need to be
> > forced into kernel mode.
> > 
> > More information about attacks:
> > For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
> > data to either host or guest attackers. For L1TF, it is possible to leak
> > to guest attackers. There is no possible mitigation involving flushing
> > of buffers to avoid this since the execution of attacker and victims
> > happen concurrently on 2 or more HTs.
> > 
> > Cc: Julien Desfossez <jdesfossez@digitalocean.com>
> > Cc: Tim Chen <tim.c.chen@linux.intel.com>
> > Cc: Aaron Lu <aaron.lwe@gmail.com>
> > Cc: Aubrey Li <aubrey.li@linux.intel.com>
> > Cc: Tim Chen <tim.c.chen@intel.com>
> > Cc: Paul E. McKenney <paulmck@kernel.org>
> > Co-developed-by: Vineeth Pillai <viremana@linux.microsoft.com>
> > Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> > Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >  .../admin-guide/kernel-parameters.txt         |   7 +
> >  include/linux/entry-common.h                  |   2 +-
> >  include/linux/sched.h                         |  12 +
> >  kernel/entry/common.c                         |  25 +-
> >  kernel/sched/core.c                           | 229 ++++++++++++++++++
> >  kernel/sched/sched.h                          |   3 +
> >  6 files changed, 275 insertions(+), 3 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 3236427e2215..48567110f709 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -4678,6 +4678,13 @@
> >  
> >  	sbni=		[NET] Granch SBNI12 leased line adapter
> >  
> > +	sched_core_protect_kernel=
> > +			[SCHED_CORE] Pause SMT siblings of a core running in
> > +			user mode, if at least one of the siblings of the core
> > +			is running in kernel mode. This is to guarantee that
> > +			kernel data is not leaked to tasks which are not trusted
> > +			by the kernel.
> > +
> >  	sched_debug	[KNL] Enables verbose scheduler debug messages.
> >  
> >  	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
> > diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> > index 474f29638d2c..260216de357b 100644
> > --- a/include/linux/entry-common.h
> > +++ b/include/linux/entry-common.h
> > @@ -69,7 +69,7 @@
> >  
> >  #define EXIT_TO_USER_MODE_WORK						\
> >  	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
> > -	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING |			\
> > +	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_UNSAFE_RET |	\
> >  	 ARCH_EXIT_TO_USER_MODE_WORK)
> >  
> >  /**
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index d38e904dd603..fe6f225bfbf9 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -2071,4 +2071,16 @@ int sched_trace_rq_nr_running(struct rq *rq);
> >  
> >  const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
> >  
> > +#ifdef CONFIG_SCHED_CORE
> > +void sched_core_unsafe_enter(void);
> > +void sched_core_unsafe_exit(void);
> > +bool sched_core_wait_till_safe(unsigned long ti_check);
> > +bool sched_core_kernel_protected(void);
> > +#else
> > +#define sched_core_unsafe_enter(ignore) do { } while (0)
> > +#define sched_core_unsafe_exit(ignore) do { } while (0)
> > +#define sched_core_wait_till_safe(ignore) do { } while (0)
> > +#define sched_core_kernel_protected(ignore) do { } while (0)
> > +#endif
> > +
> >  #endif
> > diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> > index 0a1e20f8d4e8..c8dc6b1b1f40 100644
> > --- a/kernel/entry/common.c
> > +++ b/kernel/entry/common.c
> > @@ -137,6 +137,26 @@ static __always_inline void exit_to_user_mode(void)
> >  /* Workaround to allow gradual conversion of architecture code */
> >  void __weak arch_do_signal(struct pt_regs *regs) { }
> >  
> > +unsigned long exit_to_user_get_work(void)
> > +{
> > +	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> > +
> > +	if (IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected())
> > +		return ti_work;
> > +
> > +#ifdef CONFIG_SCHED_CORE
> > +	ti_work &= EXIT_TO_USER_MODE_WORK;
> > +	if ((ti_work & _TIF_UNSAFE_RET) == ti_work) {
> 
> Though _TIF_UNSAFE_RET is not x86 specific, but I only saw the definition in x86.
> I'm not sure if other SMT capable architectures are happy with this?

Sorry for late reply.

Yes, arch has to define it if they want kernel protection. There's no way
around it.

But I need to stub _TIF_UNSAFE_RET to 0 if it is not defined, like the other
TIF flags do. I'll make changes as below.

thanks!

 - Joel

---8<-----------------------

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 34dbbb764d6c..a338d5d64c3d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4684,7 +4684,8 @@
 			is running in kernel mode. This is to guarantee that
 			kernel data is not leaked to tasks which are not trusted
 			by the kernel. A value of 0 disables protection, 1
-			enables protection. The default is 1.
+			enables protection. The default is 1. Note that protection
+			depends on the arch defining the _TIF_UNSAFE_RET flag.
 
 	sched_debug	[KNL] Enables verbose scheduler debug messages.
 
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 879562d920f2..4475d545f9a0 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -33,6 +33,10 @@
 # define _TIF_PATCH_PENDING		(0)
 #endif
 
+#ifndef _TIF_UNSAFE_RET
+# define _TIF_UNSAFE_RET		(0)
+#endif
+
 #ifndef _TIF_UPROBE
 # define _TIF_UPROBE			(0)
 #endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index a8e1974e5008..a18ed60cedea 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -28,7 +28,8 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs)
 
 	instrumentation_begin();
 	trace_hardirqs_off_finish();
-	sched_core_unsafe_enter();
+	if (_TIF_UNSAFE_RET) /* Kernel protection depends on arch defining the flag. */
+		sched_core_unsafe_enter();
 	instrumentation_end();
 }
 
@@ -142,7 +143,8 @@ unsigned long exit_to_user_get_work(void)
 {
 	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
 
-	if (IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected())
+	if ((IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected())
+	    || !_TIF_UNSAFE_RET)
 		return ti_work;
 
 #ifdef CONFIG_SCHED_CORE
-- 
2.29.1.341.ge80a0c044ae-goog



^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode
  2020-10-30 10:29   ` Alexandre Chartre
@ 2020-11-03  1:20     ` Joel Fernandes
  2020-11-06 16:57       ` Alexandre Chartre
  2020-11-10  9:35       ` Alexandre Chartre
  0 siblings, 2 replies; 98+ messages in thread
From: Joel Fernandes @ 2020-11-03  1:20 UTC (permalink / raw)
  To: Alexandre Chartre
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, James.Bottomley, OWeisse,
	Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser, Aubrey Li,
	Tim Chen, Paul E . McKenney

Hi Alexandre,

Sorry for late reply as I was working on the snapshotting patch...

On Fri, Oct 30, 2020 at 11:29:26AM +0100, Alexandre Chartre wrote:
> 
> On 10/20/20 3:43 AM, Joel Fernandes (Google) wrote:
> > Core-scheduling prevents hyperthreads in usermode from attacking each
> > other, but it does not do anything about one of the hyperthreads
> > entering the kernel for any reason. This leaves the door open for MDS
> > and L1TF attacks with concurrent execution sequences between
> > hyperthreads.
> > 
> > This patch therefore adds support for protecting all syscall and IRQ
> > kernel mode entries. Care is taken to track the outermost usermode exit
> > and entry using per-cpu counters. In cases where one of the hyperthreads
> > enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
> > when not needed - example: idle and non-cookie HTs do not need to be
> > forced into kernel mode.
> 
> Hi Joel,
> 
> In order to protect syscall/IRQ kernel mode entries, shouldn't we have a
> call to sched_core_unsafe_enter() in the syscall/IRQ entry code? I don't
> see such a call. Am I missing something?

Yes, this is known bug and fixed in v9 which I'll post soon. Meanwhile
updated patch is appended below:
 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 3236427e2215..48567110f709 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -4678,6 +4678,13 @@
> >   	sbni=		[NET] Granch SBNI12 leased line adapter
> > +	sched_core_protect_kernel=
> > +			[SCHED_CORE] Pause SMT siblings of a core running in
> > +			user mode, if at least one of the siblings of the core
> > +			is running in kernel mode. This is to guarantee that
> > +			kernel data is not leaked to tasks which are not trusted
> > +			by the kernel.
> > +
> >   	sched_debug	[KNL] Enables verbose scheduler debug messages.
> >   	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
> > diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> > index 474f29638d2c..260216de357b 100644
> > --- a/include/linux/entry-common.h
> > +++ b/include/linux/entry-common.h
> > @@ -69,7 +69,7 @@
> >   #define EXIT_TO_USER_MODE_WORK						\
> >   	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
> > -	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING |			\
> > +	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_UNSAFE_RET |	\
> >   	 ARCH_EXIT_TO_USER_MODE_WORK)
> >   /**
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index d38e904dd603..fe6f225bfbf9 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -2071,4 +2071,16 @@ int sched_trace_rq_nr_running(struct rq *rq);
> >   const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
> > +#ifdef CONFIG_SCHED_CORE
> > +void sched_core_unsafe_enter(void);
> > +void sched_core_unsafe_exit(void);
> > +bool sched_core_wait_till_safe(unsigned long ti_check);
> > +bool sched_core_kernel_protected(void);
> > +#else
> > +#define sched_core_unsafe_enter(ignore) do { } while (0)
> > +#define sched_core_unsafe_exit(ignore) do { } while (0)
> > +#define sched_core_wait_till_safe(ignore) do { } while (0)
> > +#define sched_core_kernel_protected(ignore) do { } while (0)
> > +#endif
> > +
> >   #endif
> > diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> > index 0a1e20f8d4e8..c8dc6b1b1f40 100644
> > --- a/kernel/entry/common.c
> > +++ b/kernel/entry/common.c
> > @@ -137,6 +137,26 @@ static __always_inline void exit_to_user_mode(void)
> >   /* Workaround to allow gradual conversion of architecture code */
> >   void __weak arch_do_signal(struct pt_regs *regs) { }
> > +unsigned long exit_to_user_get_work(void)
> > +{
> > +	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> > +
> > +	if (IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected())
> > +		return ti_work;
> > +
> > +#ifdef CONFIG_SCHED_CORE
> > +	ti_work &= EXIT_TO_USER_MODE_WORK;
> > +	if ((ti_work & _TIF_UNSAFE_RET) == ti_work) {
> > +		sched_core_unsafe_exit();
> > +		if (sched_core_wait_till_safe(EXIT_TO_USER_MODE_WORK)) {
> 
> If we call sched_core_unsafe_exit() before sched_core_wait_till_safe() then we
> expose ourself during the entire wait period in sched_core_wait_till_safe(). It
> would be better to call sched_core_unsafe_exit() once we know for sure we are
> going to exit.

The way the algorithm works right now, it requires the current task to get
out of the unsafe state while waiting otherwise it will lockup. Note that we
wait with interrupts enabled so new interrupts could come while waiting.

TBH this code is very tricky to get right and it took long time to get it
working properly. For now I am content with the way it works. We can improve
further incrementally on it in the future.

Let me know if I may add your Reviewed-by tag for this patch, if there are no
other comments, and I appreciate it. Appended the updated patch below.

thanks,

 - Joel

---8<-----------------------

From b2835a587a28405ffdf8fc801e798129a014a8c8 Mon Sep 17 00:00:00 2001
From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Date: Mon, 27 Jul 2020 17:56:14 -0400
Subject: [PATCH] kernel/entry: Add support for core-wide protection of
 kernel-mode

Core-scheduling prevents hyperthreads in usermode from attacking each
other, but it does not do anything about one of the hyperthreads
entering the kernel for any reason. This leaves the door open for MDS
and L1TF attacks with concurrent execution sequences between
hyperthreads.

This patch therefore adds support for protecting all syscall and IRQ
kernel mode entries. Care is taken to track the outermost usermode exit
and entry using per-cpu counters. In cases where one of the hyperthreads
enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
when not needed - example: idle and non-cookie HTs do not need to be
forced into kernel mode.

More information about attacks:
For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
data to either host or guest attackers. For L1TF, it is possible to leak
to guest attackers. There is no possible mitigation involving flushing
of buffers to avoid this since the execution of attacker and victims
happen concurrently on 2 or more HTs.

Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Co-developed-by: Vineeth Pillai <viremana@linux.microsoft.com>
Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 .../admin-guide/kernel-parameters.txt         |   9 +
 include/linux/entry-common.h                  |   6 +-
 include/linux/sched.h                         |  12 +
 kernel/entry/common.c                         |  28 ++-
 kernel/sched/core.c                           | 230 ++++++++++++++++++
 kernel/sched/sched.h                          |   3 +
 6 files changed, 285 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 3236427e2215..a338d5d64c3d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4678,6 +4678,15 @@
 
 	sbni=		[NET] Granch SBNI12 leased line adapter
 
+	sched_core_protect_kernel=
+			[SCHED_CORE] Pause SMT siblings of a core running in
+			user mode, if at least one of the siblings of the core
+			is running in kernel mode. This is to guarantee that
+			kernel data is not leaked to tasks which are not trusted
+			by the kernel. A value of 0 disables protection, 1
+			enables protection. The default is 1. Note that protection
+			depends on the arch defining the _TIF_UNSAFE_RET flag.
+
 	sched_debug	[KNL] Enables verbose scheduler debug messages.
 
 	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 474f29638d2c..62278c5b3b5f 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -33,6 +33,10 @@
 # define _TIF_PATCH_PENDING		(0)
 #endif
 
+#ifndef _TIF_UNSAFE_RET
+# define _TIF_UNSAFE_RET		(0)
+#endif
+
 #ifndef _TIF_UPROBE
 # define _TIF_UPROBE			(0)
 #endif
@@ -69,7 +73,7 @@
 
 #define EXIT_TO_USER_MODE_WORK						\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
-	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING |			\
+	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_UNSAFE_RET |	\
 	 ARCH_EXIT_TO_USER_MODE_WORK)
 
 /**
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d38e904dd603..fe6f225bfbf9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2071,4 +2071,16 @@ int sched_trace_rq_nr_running(struct rq *rq);
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
+#ifdef CONFIG_SCHED_CORE
+void sched_core_unsafe_enter(void);
+void sched_core_unsafe_exit(void);
+bool sched_core_wait_till_safe(unsigned long ti_check);
+bool sched_core_kernel_protected(void);
+#else
+#define sched_core_unsafe_enter(ignore) do { } while (0)
+#define sched_core_unsafe_exit(ignore) do { } while (0)
+#define sched_core_wait_till_safe(ignore) do { } while (0)
+#define sched_core_kernel_protected(ignore) do { } while (0)
+#endif
+
 #endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 0a1e20f8d4e8..a18ed60cedea 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -28,6 +28,8 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs)
 
 	instrumentation_begin();
 	trace_hardirqs_off_finish();
+	if (_TIF_UNSAFE_RET) /* Kernel protection depends on arch defining the flag. */
+		sched_core_unsafe_enter();
 	instrumentation_end();
 }
 
@@ -137,6 +139,27 @@ static __always_inline void exit_to_user_mode(void)
 /* Workaround to allow gradual conversion of architecture code */
 void __weak arch_do_signal(struct pt_regs *regs) { }
 
+unsigned long exit_to_user_get_work(void)
+{
+	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+
+	if ((IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected())
+	    || !_TIF_UNSAFE_RET)
+		return ti_work;
+
+#ifdef CONFIG_SCHED_CORE
+	ti_work &= EXIT_TO_USER_MODE_WORK;
+	if ((ti_work & _TIF_UNSAFE_RET) == ti_work) {
+		sched_core_unsafe_exit();
+		if (sched_core_wait_till_safe(EXIT_TO_USER_MODE_WORK)) {
+			sched_core_unsafe_enter(); /* not exiting to user yet. */
+		}
+	}
+
+	return READ_ONCE(current_thread_info()->flags);
+#endif
+}
+
 static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 					    unsigned long ti_work)
 {
@@ -175,7 +198,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		 * enabled above.
 		 */
 		local_irq_disable_exit_to_user();
-		ti_work = READ_ONCE(current_thread_info()->flags);
+		ti_work = exit_to_user_get_work();
 	}
 
 	/* Return the latest work state for arch_exit_to_user_mode() */
@@ -184,9 +207,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 
 static void exit_to_user_mode_prepare(struct pt_regs *regs)
 {
-	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+	unsigned long ti_work;
 
 	lockdep_assert_irqs_disabled();
+	ti_work = exit_to_user_get_work();
 
 	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
 		ti_work = exit_to_user_mode_loop(regs, ti_work);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e05728bdb18c..bd206708fac2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -76,6 +76,27 @@ __read_mostly int scheduler_running;
 
 #ifdef CONFIG_SCHED_CORE
 
+DEFINE_STATIC_KEY_TRUE(sched_core_protect_kernel);
+static int __init set_sched_core_protect_kernel(char *str)
+{
+	unsigned long val = 0;
+
+	if (!str)
+		return 0;
+
+	if (!kstrtoul(str, 0, &val) && !val)
+		static_branch_disable(&sched_core_protect_kernel);
+
+	return 1;
+}
+__setup("sched_core_protect_kernel=", set_sched_core_protect_kernel);
+
+/* Is the kernel protected by core scheduling? */
+bool sched_core_kernel_protected(void)
+{
+	return static_branch_likely(&sched_core_protect_kernel);
+}
+
 DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
 
 /* kernel prio, less is more */
@@ -4596,6 +4617,214 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
 	return a->core_cookie == b->core_cookie;
 }
 
+/*
+ * Handler to attempt to enter kernel. It does nothing because the exit to
+ * usermode or guest mode will do the actual work (of waiting if needed).
+ */
+static void sched_core_irq_work(struct irq_work *work)
+{
+	return;
+}
+
+static inline void init_sched_core_irq_work(struct rq *rq)
+{
+	init_irq_work(&rq->core_irq_work, sched_core_irq_work);
+}
+
+/*
+ * sched_core_wait_till_safe - Pause the caller's hyperthread until the core
+ * exits the core-wide unsafe state. Obviously the CPU calling this function
+ * should not be responsible for the core being in the core-wide unsafe state
+ * otherwise it will deadlock.
+ *
+ * @ti_check: We spin here with IRQ enabled and preempt disabled. Break out of
+ *            the loop if TIF flags are set and notify caller about it.
+ *
+ * IRQs should be disabled.
+ */
+bool sched_core_wait_till_safe(unsigned long ti_check)
+{
+	bool restart = false;
+	struct rq *rq;
+	int cpu;
+
+	/* We clear the thread flag only at the end, so need to check for it. */
+	ti_check &= ~_TIF_UNSAFE_RET;
+
+	cpu = smp_processor_id();
+	rq = cpu_rq(cpu);
+
+	if (!sched_core_enabled(rq))
+		goto ret;
+
+	/* Down grade to allow interrupts to prevent stop_machine lockups.. */
+	preempt_disable();
+	local_irq_enable();
+
+	/*
+	 * Wait till the core of this HT is not in an unsafe state.
+	 *
+	 * Pair with smp_store_release() in sched_core_unsafe_exit().
+	 */
+	while (smp_load_acquire(&rq->core->core_unsafe_nest) > 0) {
+		cpu_relax();
+		if (READ_ONCE(current_thread_info()->flags) & ti_check) {
+			restart = true;
+			break;
+		}
+	}
+
+	/* Upgrade it back to the expectations of entry code. */
+	local_irq_disable();
+	preempt_enable();
+
+ret:
+	if (!restart)
+		clear_tsk_thread_flag(current, TIF_UNSAFE_RET);
+
+	return restart;
+}
+
+/*
+ * Enter the core-wide IRQ state. Sibling will be paused if it is running
+ * 'untrusted' code, until sched_core_unsafe_exit() is called. Every attempt to
+ * avoid sending useless IPIs is made. Must be called only from hard IRQ
+ * context.
+ */
+void sched_core_unsafe_enter(void)
+{
+	const struct cpumask *smt_mask;
+	unsigned long flags;
+	struct rq *rq;
+	int i, cpu;
+
+	if (!static_branch_likely(&sched_core_protect_kernel))
+		return;
+
+	/* Ensure that on return to user/guest, we check whether to wait. */
+	if (current->core_cookie)
+		set_tsk_thread_flag(current, TIF_UNSAFE_RET);
+
+	local_irq_save(flags);
+	cpu = smp_processor_id();
+	rq = cpu_rq(cpu);
+	if (!sched_core_enabled(rq))
+		goto ret;
+
+	/* Count unsafe_enter() calls received without unsafe_exit() on this CPU. */
+	rq->core_this_unsafe_nest++;
+
+	/* Should not nest: enter() should only pair with exit(). */
+	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 1))
+		goto ret;
+
+	raw_spin_lock(rq_lockp(rq));
+	smt_mask = cpu_smt_mask(cpu);
+
+	/* Contribute this CPU's unsafe_enter() to core-wide unsafe_enter() count. */
+	WRITE_ONCE(rq->core->core_unsafe_nest, rq->core->core_unsafe_nest + 1);
+
+	if (WARN_ON_ONCE(rq->core->core_unsafe_nest == UINT_MAX))
+		goto unlock;
+
+	if (irq_work_is_busy(&rq->core_irq_work)) {
+		/*
+		 * Do nothing more since we are in an IPI sent from another
+		 * sibling to enforce safety. That sibling would have sent IPIs
+		 * to all of the HTs.
+		 */
+		goto unlock;
+	}
+
+	/*
+	 * If we are not the first ones on the core to enter core-wide unsafe
+	 * state, do nothing.
+	 */
+	if (rq->core->core_unsafe_nest > 1)
+		goto unlock;
+
+	/* Do nothing more if the core is not tagged. */
+	if (!rq->core->core_cookie)
+		goto unlock;
+
+	for_each_cpu(i, smt_mask) {
+		struct rq *srq = cpu_rq(i);
+
+		if (i == cpu || cpu_is_offline(i))
+			continue;
+
+		if (!srq->curr->mm || is_task_rq_idle(srq->curr))
+			continue;
+
+		/* Skip if HT is not running a tagged task. */
+		if (!srq->curr->core_cookie && !srq->core_pick)
+			continue;
+
+		/*
+		 * Force sibling into the kernel by IPI. If work was already
+		 * pending, no new IPIs are sent. This is Ok since the receiver
+		 * would already be in the kernel, or on its way to it.
+		 */
+		irq_work_queue_on(&srq->core_irq_work, i);
+	}
+unlock:
+	raw_spin_unlock(rq_lockp(rq));
+ret:
+	local_irq_restore(flags);
+}
+
+/*
+ * Process any work need for either exiting the core-wide unsafe state, or for
+ * waiting on this hyperthread if the core is still in this state.
+ *
+ * @idle: Are we called from the idle loop?
+ */
+void sched_core_unsafe_exit(void)
+{
+	unsigned long flags;
+	unsigned int nest;
+	struct rq *rq;
+	int cpu;
+
+	if (!static_branch_likely(&sched_core_protect_kernel))
+		return;
+
+	local_irq_save(flags);
+	cpu = smp_processor_id();
+	rq = cpu_rq(cpu);
+
+	/* Do nothing if core-sched disabled. */
+	if (!sched_core_enabled(rq))
+		goto ret;
+
+	/*
+	 * Can happen when a process is forked and the first return to user
+	 * mode is a syscall exit. Either way, there's nothing to do.
+	 */
+	if (rq->core_this_unsafe_nest == 0)
+		goto ret;
+
+	rq->core_this_unsafe_nest--;
+
+	/* enter() should be paired with exit() only. */
+	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 0))
+		goto ret;
+
+	raw_spin_lock(rq_lockp(rq));
+	/*
+	 * Core-wide nesting counter can never be 0 because we are
+	 * still in it on this CPU.
+	 */
+	nest = rq->core->core_unsafe_nest;
+	WARN_ON_ONCE(!nest);
+
+	/* Pair with smp_load_acquire() in sched_core_wait_till_safe(). */
+	smp_store_release(&rq->core->core_unsafe_nest, nest - 1);
+	raw_spin_unlock(rq_lockp(rq));
+ret:
+	local_irq_restore(flags);
+}
+
 // XXX fairness/fwd progress conditions
 /*
  * Returns
@@ -5019,6 +5248,7 @@ static inline void sched_core_cpu_starting(unsigned int cpu)
 			rq = cpu_rq(i);
 			if (rq->core && rq->core == rq)
 				core_rq = rq;
+			init_sched_core_irq_work(rq);
 		}
 
 		if (!core_rq)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f7e2d8a3be8e..4bcf3b1ddfb3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1059,12 +1059,15 @@ struct rq {
 	unsigned int		core_enabled;
 	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
+	struct irq_work		core_irq_work; /* To force HT into kernel */
+	unsigned int		core_this_unsafe_nest;
 
 	/* shared state */
 	unsigned int		core_task_seq;
 	unsigned int		core_pick_seq;
 	unsigned long		core_cookie;
 	unsigned char		core_forceidle;
+	unsigned int		core_unsafe_nest;
 #endif
 };
 
-- 
2.29.1.341.ge80a0c044ae-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 19/26] sched: Add a second-level tag for nested CGroup usecase
  2020-10-31  0:42   ` Josh Don
@ 2020-11-03  2:54     ` Joel Fernandes
  0 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes @ 2020-11-03  2:54 UTC (permalink / raw)
  To: Josh Don
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, Thomas Gleixner,
	linux-kernel, mingo, torvalds, fweisbec, keescook, kerrnel,
	Phil Auld, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli,
	Paul Turner, Steven Rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, chris.hyser, Aubrey Li,
	Paul E. McKenney, Tim Chen, Benjamin Segall, Hao Luo

On Fri, Oct 30, 2020 at 05:42:12PM -0700, Josh Don wrote:
> On Mon, Oct 19, 2020 at 6:45 PM Joel Fernandes (Google)
> <joel@joelfernandes.org> wrote:
> >
> > +static unsigned long cpu_core_get_group_cookie(struct task_group *tg)
> > +{
> > +       unsigned long color = 0;
> > +
> > +       if (!tg)
> > +               return 0;
> > +
> > +       for (; tg; tg = tg->parent) {
> > +               if (tg->core_tag_color) {
> > +                       WARN_ON_ONCE(color);
> > +                       color = tg->core_tag_color;
> > +               }
> > +
> > +               if (tg->core_tagged) {
> > +                       unsigned long cookie = ((unsigned long)tg << 8) | color;
> > +                       cookie &= (1UL << (sizeof(unsigned long) * 4)) - 1;
> > +                       return cookie;
> > +               }
> > +       }
> > +
> > +       return 0;
> > +}
> 
> I'm a bit wary of how core_task_cookie and core_group_cookie are
> truncated to the lower half of their bits and combined into the
> overall core_cookie.  Now that core_group_cookie is further losing 8
> bits to color, that leaves (in the case of 32 bit unsigned long) only
> 8 bits to uniquely identify the group contribution to the cookie.
> 
> Also, I agree that 256 colors is likely adequate, but it would be nice
> to avoid this restriction.
> 
> I'd like to propose the following alternative, which involves creating
> a new struct to represent the core cookie:
> 
> struct core_cookie {
>   unsigned long task_cookie;
>   unsigned long group_cookie;
>   unsigned long color;
>   /* can be further extended with arbitrary fields */
> 
>   struct rb_node node;
>   refcount_t;
> };
> 
> struct rb_root core_cookies; /* (sorted), all active core_cookies */
> seqlock_t core_cookies_lock; /* protects against removal/addition to
> core_cookies */
> 
> struct task_struct {
>   ...
>   unsigned long core_cookie; /* (struct core_cookie *) */
> }
> 
> A given task stores the address of a core_cookie struct in its
> core_cookie field.  When we reconfigure a task's
> color/task_cookie/group_cookie, we can first look for an existing
> core_cookie that matches those settings, or create a new one.

Josh,

This sounds good to me.

Just to mention one thing, for stuff like the following, you'll have to write
functions that can do greater-than, less-than operations, etc.

static inline bool __sched_core_less(struct task_struct *a, struct task_struct *b)
{
	if (a->core_cookie < b->core_cookie)
		return true;

	if (a->core_cookie > b->core_cookie)
		return false;

And pretty much everywhere you do null-checks on core_cookie, or access
core_cookie for any other reasons.

Also there's kselftests that need trivial modifications to pass with the new
changes you propose.

Looking forward to the patch to do the improvement and thanks.

thanks,

 - Joel


> > More information about attacks:
> > For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
> > data to either host or guest attackers. For L1TF, it is possible to leak
> > to guest attackers. There is no possible mitigation involving flushing
> > of buffers to avoid this since the execution of attacker and victims
> > happen concurrently on 2 or more HTs.
> > 
> > Cc: Julien Desfossez <jdesfossez@digitalocean.com>
> > Cc: Tim Chen <tim.c.chen@linux.intel.com>
> > Cc: Aaron Lu <aaron.lwe@gmail.com>
> > Cc: Aubrey Li <aubrey.li@linux.intel.com>
> > Cc: Tim Chen <tim.c.chen@intel.com>
> > Cc: Paul E. McKenney <paulmck@kernel.org>
> > Co-developed-by: Vineeth Pillai <viremana@linux.microsoft.com>
> > Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> > Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >   .../admin-guide/kernel-parameters.txt         |   7 +
> >   include/linux/entry-common.h                  |   2 +-
> >   include/linux/sched.h                         |  12 +
> >   kernel/entry/common.c                         |  25 +-
> >   kernel/sched/core.c                           | 229 ++++++++++++++++++
> >   kernel/sched/sched.h                          |   3 +
> >   6 files changed, 275 insertions(+), 3 deletions(-)
> >
The issue is with code like this:


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 24/26] sched: Move core-scheduler interfacing code to a new file
  2020-10-26  1:05   ` Li, Aubrey
@ 2020-11-03  2:58     ` Joel Fernandes
  0 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes @ 2020-11-03  2:58 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Paul E. McKenney, Tim Chen

On Mon, Oct 26, 2020 at 09:05:52AM +0800, Li, Aubrey wrote:
[..]
> > +int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
> > +{
> > +	struct sched_core_task_write_tag wr = {}; /* for stop machine. */
> > +	bool sched_core_put_after_stopper = false;
> > +	unsigned long cookie;
> > +	int ret = -ENOMEM;
> > +
> > +	mutex_lock(&sched_core_tasks_mutex);
> > +
> > +	/*
> > +	 * NOTE: sched_core_get() is done by sched_core_alloc_task_cookie() or
> > +	 *       sched_core_put_task_cookie(). However, sched_core_put() is done
> > +	 *       by this function *after* the stopper removes the tasks from the
> > +	 *       core queue, and not before. This is just to play it safe.
> > +	 */
> > +	if (t2 == NULL) {
> > +		if (t1->core_task_cookie) {
> > +			sched_core_put_task_cookie(t1->core_task_cookie);
> > +			sched_core_put_after_stopper = true;
> > +			wr.tasks[0] = t1; /* Keep wr.cookies[0] reset for t1. */
> > +		}
> > +	} else if (t1 == t2) {
> > +		/* Assign a unique per-task cookie solely for t1. */
> > +
> > +		cookie = sched_core_alloc_task_cookie();
> > +		if (!cookie)
> > +			goto out_unlock;
> > +
> > +		if (t1->core_task_cookie) {
> > +			sched_core_put_task_cookie(t1->core_task_cookie);
> > +			sched_core_put_after_stopper = true;
> > +		}
> > +		wr.tasks[0] = t1;
> > +		wr.cookies[0] = cookie;
> > +	} else
> > +	/*
> > +	 * 		t1		joining		t2
> > +	 * CASE 1:
> > +	 * before	0				0
> > +	 * after	new cookie			new cookie
> > +	 *
> > +	 * CASE 2:
> > +	 * before	X (non-zero)			0
> > +	 * after	0				0
> > +	 *
> > +	 * CASE 3:
> > +	 * before	0				X (non-zero)
> > +	 * after	X				X
> > +	 *
> > +	 * CASE 4:
> > +	 * before	Y (non-zero)			X (non-zero)
> > +	 * after	X				X
> > +	 */
> > +	if (!t1->core_task_cookie && !t2->core_task_cookie) {
> > +		/* CASE 1. */
> > +		cookie = sched_core_alloc_task_cookie();
> > +		if (!cookie)
> > +			goto out_unlock;
> > +
> > +		/* Add another reference for the other task. */
> > +		if (!sched_core_get_task_cookie(cookie)) {
> > +			return -EINVAL;
> 
> ret = -EINVAL; mutex is not released otherwise... 

Good find and will fix.

Minor suggestion: Could you truncate your emails when replying?

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 20/26] sched: Release references to the per-task cookie on exit
  2020-10-20  1:43 ` [PATCH v8 -tip 20/26] sched: Release references to the per-task cookie on exit Joel Fernandes (Google)
@ 2020-11-04 21:50   ` chris hyser
  2020-11-05 15:46     ` Joel Fernandes
  0 siblings, 1 reply; 98+ messages in thread
From: chris hyser @ 2020-11-04 21:50 UTC (permalink / raw)
  To: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, Aubrey Li,
	Paul E. McKenney, Tim Chen

On 10/19/20 9:43 PM, Joel Fernandes (Google) wrote:
> During exit, we have to free the references to a cookie that might be shared by
> many tasks. This commit therefore ensures when the task_struct is released, any
> references to cookies that it holds are also released.
> 
> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>   include/linux/sched.h | 2 ++
>   kernel/fork.c         | 1 +
>   kernel/sched/core.c   | 8 ++++++++
>   3 files changed, 11 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 4cb76575afa8..eabd96beff92 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2079,12 +2079,14 @@ void sched_core_unsafe_exit(void);
>   bool sched_core_wait_till_safe(unsigned long ti_check);
>   bool sched_core_kernel_protected(void);
>   int sched_core_share_pid(pid_t pid);
> +void sched_tsk_free(struct task_struct *tsk);
>   #else
>   #define sched_core_unsafe_enter(ignore) do { } while (0)
>   #define sched_core_unsafe_exit(ignore) do { } while (0)
>   #define sched_core_wait_till_safe(ignore) do { } while (0)
>   #define sched_core_kernel_protected(ignore) do { } while (0)
>   #define sched_core_share_pid(pid_t pid) do { } while (0)
> +#define sched_tsk_free(tsk) do { } while (0)
>   #endif
>   
>   #endif
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b9c289d0f4ef..a39248a02fdd 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -736,6 +736,7 @@ void __put_task_struct(struct task_struct *tsk)
>   	exit_creds(tsk);
>   	delayacct_tsk_free(tsk);
>   	put_signal_struct(tsk->signal);
> +	sched_tsk_free(tsk);
>   
>   	if (!profile_handoff_task(tsk))
>   		free_task(tsk);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 42aa811eab14..61e1dcf11000 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9631,6 +9631,14 @@ static int cpu_core_tag_color_write_u64(struct cgroup_subsys_state *css,
>   
>   	return 0;
>   }
> +
> +void sched_tsk_free(struct task_struct *tsk)
> +{
> +	if (!tsk->core_task_cookie)
> +		return;
> +	sched_core_put_task_cookie(tsk->core_task_cookie);
> +	sched_core_put();


sched_tsk_free() can be called under softirq. sched_core_put() is riddled with things that may want to sleep.

[root@chyser-vm6 bin]# [  135.349453] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:281
[  135.356381] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 0, name: swapper/1
[  135.364331] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G            E     5.9.0_ch_pick_task_v8_fix_4+ #15
[  135.370178] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
[  135.372052] Call Trace:
[  135.372733]  <IRQ>
[  135.373162]  dump_stack+0x74/0x92
[  135.375134]  ___might_sleep.cold+0x94/0xa5
[  135.379102]  __might_sleep+0x4b/0x80
[  135.382509]  mutex_lock+0x21/0x50
[  135.385616]  sched_core_put+0x15/0x70
[  135.388981]  sched_tsk_free+0x20/0x30
[  135.389753]  __put_task_struct+0xbe/0x190
[  135.390896]  delayed_put_task_struct+0x88/0xa0
[  135.391812]  rcu_core+0x341/0x520
[  135.392580]  rcu_core_si+0xe/0x10
[  135.393281]  __do_softirq+0xe8/0x2cd
[  135.394030]  asm_call_irq_on_stack+0x12/0x20
[  135.395060]  </IRQ>
[  135.395519]  do_softirq_own_stack+0x3d/0x50
[  135.396379]  irq_exit_rcu+0xb2/0xc0
[  135.397203]  sysvec_apic_timer_interrupt+0x3d/0x90
[  135.398188]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[  135.399395] RIP: 0010:native_safe_halt+0xe/0x10
[  135.400314] Code: 7b ff ff ff eb bd cc cc cc cc cc cc e9 07 00 00 00 0f 00 2d 06 01 57 00 f4 c3 66 90 e9 07 00 00 00 
0f 00 2d f6 00 57 00 fb f4 <c3> cc 0f 1f 44 00 00 55 48 89 e5 53 e8 b1 f2 63 ff 65 8b 15 5a 5d
[  135.404484] RSP: 0018:ffffb41f4007be90 EFLAGS: 00000206
[  135.405627] RAX: ffffffff9ea9b670 RBX: 0000000000000001 RCX: ffff9d3130d2ca80
[  135.407201] RDX: 000000000001d576 RSI: ffffb41f4007be38 RDI: 0000001f839d2693
[  135.409411] RBP: ffffb41f4007be98 R08: 0000000000000001 R09: 0000000000000000
[  135.413731] R10: 0000000000000002 R11: ffff9d3130d2bd00 R12: ffff9d2e40853e80
[  135.415161] R13: ffff9d2e40853e80 R14: 0000000000000000 R15: 0000000000000000
[  135.416861]  ? __sched_text_end+0x7/0x7
[  135.417758]  ? default_idle+0xe/0x20
[  135.418505]  arch_cpu_idle+0x15/0x20
[  135.419225]  default_idle_call+0x38/0xd0
[  135.420145]  do_idle+0x1ee/0x260
[  135.421908]  cpu_startup_entry+0x20/0x30
[  135.425635]  start_secondary+0x118/0x150
[  135.429498]  secondary_startup_64_no_verify+0xb8/0xbb

I tried a number of things like replacing the mutexes in put/get with spin_lock_bh() and just kept finding things like:


[root@chyser-vm5 bin]# [ 1123.516209] BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:49
[ 1123.517976] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 0, name: swapper/15
[ 1123.519557] CPU: 15 PID: 0 Comm: swapper/15 Tainted: G            E     5.9.0_ch_pick_task_v8_fix_5+ #16
[ 1123.521392] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
[ 1123.523039] Call Trace:
[ 1123.523546]  <IRQ>
[ 1123.523961]  dump_stack+0x74/0x92
[ 1123.524632]  ___might_sleep.cold+0x94/0xa5
[ 1123.525442]  __might_sleep+0x4b/0x80
[ 1123.526131]  ? cpumask_weight+0x20/0x20
[ 1123.526896]  cpus_read_lock+0x1c/0x60
[ 1123.527623]  stop_machine+0x1d/0x40
[ 1123.528321]  __sched_core_disable+0x19/0x40
[ 1123.529133]  sched_core_put+0x18/0x20
[ 1123.529852]  sched_tsk_free+0x20/0x30
[ 1123.530571]  __put_task_struct+0xbe/0x190
[ 1123.531361]  delayed_put_task_struct+0x88/0xa0
[ 1123.532234]  rcu_core+0x341/0x520
[ 1123.532882]  rcu_core_si+0xe/0x10
[ 1123.533554]  __do_softirq+0xe8/0x2cd
[ 1123.534281]  asm_call_irq_on_stack+0x12/0x20
[ 1123.535153]  </IRQ>
[ 1123.535584]  do_softirq_own_stack+0x3d/0x50
[ 1123.536403]  irq_exit_rcu+0xb2/0xc0
[ 1123.537080]  sysvec_apic_timer_interrupt+0x3d/0x90
[ 1123.538002]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 1123.539000] RIP: 0010:native_safe_halt+0xe/0x10
[ 1123.539886] Code: 7b ff ff ff eb bd cc cc cc cc cc cc e9 07 00 00 00 0f 00 2d d6 00 57 00 f4 c3 66 90 e9 07 00 00 00 
0f 00 2d c6 00 57 00 fb f4 <c3> cc 0f 1f 44 00 00 55 48 89 e5 53 e8 91 f2 63 ff 65 8b 15 2a 5d
[ 1123.543776] RSP: 0018:ffffa830800ebe90 EFLAGS: 00000206
[ 1123.544915] RAX: ffffffffa7c9b6a0 RBX: 000000000000000f RCX: ffff99af07deca80
[ 1123.546556] RDX: 000000000048f6e6 RSI: ffffa830800ebe38 RDI: 0000010599561e7a
[ 1123.548091] RBP: ffffa830800ebe98 R08: 0000000000000001 R09: 000000000000000e
[ 1123.549453] R10: ffff9992422012c8 R11: ffff9992422012cc R12: ffff999240a59f40
[ 1123.550815] R13: ffff999240a59f40 R14: 0000000000000000 R15: 0000000000000000
[ 1123.552188]  ? __sched_text_end+0x7/0x7
[ 1123.552939]  ? default_idle+0xe/0x20
[ 1123.553635]  arch_cpu_idle+0x15/0x20
[ 1123.554341]  default_idle_call+0x38/0xd0
[ 1123.555111]  do_idle+0x1ee/0x26



The prctl() parts of the selftests pass, but all the above is caused by simply calling prctl() to "share a cookie w/ a 
task" that then dies after second.

-chrish

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 17/26] sched: Split the cookie and setup per-task cookie on fork
  2020-10-20  1:43 ` [PATCH v8 -tip 17/26] sched: Split the cookie and setup per-task cookie on fork Joel Fernandes (Google)
@ 2020-11-04 22:30   ` chris hyser
  2020-11-05 14:49     ` Joel Fernandes
  2020-11-09 23:30     ` chris hyser
  0 siblings, 2 replies; 98+ messages in thread
From: chris hyser @ 2020-11-04 22:30 UTC (permalink / raw)
  To: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, Aubrey Li,
	Paul E. McKenney, Tim Chen

On 10/19/20 9:43 PM, Joel Fernandes (Google) wrote:
> In order to prevent interference and clearly support both per-task and CGroup
> APIs, split the cookie into 2 and allow it to be set from either per-task, or
> CGroup API. The final cookie is the combined value of both and is computed when
> the stop-machine executes during a change of cookie.
> 
> Also, for the per-task cookie, it will get weird if we use pointers of any
> emphemeral objects. For this reason, introduce a refcounted object who's sole
> purpose is to assign unique cookie value by way of the object's pointer.
> 
> While at it, refactor the CGroup code a bit. Future patches will introduce more
> APIs and support.
> 
> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>   include/linux/sched.h |   2 +
>   kernel/sched/core.c   | 241 ++++++++++++++++++++++++++++++++++++++++--
>   kernel/sched/debug.c  |   4 +
>   3 files changed, 236 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index fe6f225bfbf9..c6034c00846a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -688,6 +688,8 @@ struct task_struct {
>   #ifdef CONFIG_SCHED_CORE
>   	struct rb_node			core_node;
>   	unsigned long			core_cookie;
> +	unsigned long			core_task_cookie;
> +	unsigned long			core_group_cookie;
>   	unsigned int			core_occupation;
>   #endif
>   
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bab4ea2f5cd8..30a9e4cb5ce1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -346,11 +346,14 @@ void sched_core_put(void)
>   	mutex_unlock(&sched_core_mutex);
>   }
>   
> +static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2);
> +
>   #else /* !CONFIG_SCHED_CORE */
>   
>   static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
>   static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
>   static bool sched_core_enqueued(struct task_struct *task) { return false; }
> +static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2) { }
>   
>   #endif /* CONFIG_SCHED_CORE */
>   
> @@ -3583,6 +3586,20 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
>   #endif
>   #ifdef CONFIG_SCHED_CORE
>   	RB_CLEAR_NODE(&p->core_node);
> +
> +	/*
> +	 * Tag child via per-task cookie only if parent is tagged via per-task
> +	 * cookie. This is independent of, but can be additive to the CGroup tagging.
> +	 */
> +	if (current->core_task_cookie) {
> +
> +		/* If it is not CLONE_THREAD fork, assign a unique per-task tag. */
> +		if (!(clone_flags & CLONE_THREAD)) {
> +			return sched_core_share_tasks(p, p);
> +               }
> +		/* Otherwise share the parent's per-task tag. */
> +		return sched_core_share_tasks(p, current);
> +	}
>   #endif
>   	return 0;
>   }
> @@ -9177,6 +9194,217 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
>   #endif /* CONFIG_RT_GROUP_SCHED */
>   
>   #ifdef CONFIG_SCHED_CORE
> +/*
> + * A simple wrapper around refcount. An allocated sched_core_cookie's
> + * address is used to compute the cookie of the task.
> + */
> +struct sched_core_cookie {
> +	refcount_t refcnt;
> +};
> +
> +/*
> + * sched_core_tag_requeue - Common helper for all interfaces to set a cookie.
> + * @p: The task to assign a cookie to.
> + * @cookie: The cookie to assign.
> + * @group: is it a group interface or a per-task interface.
> + *
> + * This function is typically called from a stop-machine handler.
> + */
> +void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie, bool group)
> +{
> +	if (!p)
> +		return;
> +
> +	if (group)
> +		p->core_group_cookie = cookie;
> +	else
> +		p->core_task_cookie = cookie;
> +
> +	/* Use up half of the cookie's bits for task cookie and remaining for group cookie. */
> +	p->core_cookie = (p->core_task_cookie <<
> +				(sizeof(unsigned long) * 4)) + p->core_group_cookie;
> +
> +	if (sched_core_enqueued(p)) {
> +		sched_core_dequeue(task_rq(p), p);
> +		if (!p->core_task_cookie)
> +			return;
> +	}
> +
> +	if (sched_core_enabled(task_rq(p)) &&
> +			p->core_cookie && task_on_rq_queued(p))
> +		sched_core_enqueue(task_rq(p), p);
> +}
> +
> +/* Per-task interface */
> +static unsigned long sched_core_alloc_task_cookie(void)
> +{
> +	struct sched_core_cookie *ptr =
> +		kmalloc(sizeof(struct sched_core_cookie), GFP_KERNEL);
> +
> +	if (!ptr)
> +		return 0;
> +	refcount_set(&ptr->refcnt, 1);
> +
> +	/*
> +	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
> +	 * is done after the stopper runs.
> +	 */
> +	sched_core_get();
> +	return (unsigned long)ptr;
> +}
> +
> +static bool sched_core_get_task_cookie(unsigned long cookie)
> +{
> +	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
> +
> +	/*
> +	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
> +	 * is done after the stopper runs.
> +	 */
> +	sched_core_get();
> +	return refcount_inc_not_zero(&ptr->refcnt);
> +}
> +
> +static void sched_core_put_task_cookie(unsigned long cookie)
> +{
> +	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
> +
> +	if (refcount_dec_and_test(&ptr->refcnt))
> +		kfree(ptr);
> +}
> +
> +struct sched_core_task_write_tag {
> +	struct task_struct *tasks[2];
> +	unsigned long cookies[2];
> +};
> +
> +/*
> + * Ensure that the task has been requeued. The stopper ensures that the task cannot
> + * be migrated to a different CPU while its core scheduler queue state is being updated.
> + * It also makes sure to requeue a task if it was running actively on another CPU.
> + */
> +static int sched_core_task_join_stopper(void *data)
> +{
> +	struct sched_core_task_write_tag *tag = (struct sched_core_task_write_tag *)data;
> +	int i;
> +
> +	for (i = 0; i < 2; i++)
> +		sched_core_tag_requeue(tag->tasks[i], tag->cookies[i], false /* !group */);
> +
> +	return 0;
> +}
> +
> +static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
> +{
> +	struct sched_core_task_write_tag wr = {}; /* for stop machine. */
> +	bool sched_core_put_after_stopper = false;
> +	unsigned long cookie;
> +	int ret = -ENOMEM;
> +
> +	mutex_lock(&sched_core_mutex);
> +
> +	/*
> +	 * NOTE: sched_core_get() is done by sched_core_alloc_task_cookie() or
> +	 *       sched_core_put_task_cookie(). However, sched_core_put() is done
> +	 *       by this function *after* the stopper removes the tasks from the
> +	 *       core queue, and not before. This is just to play it safe.
> +	 */
> +	if (t2 == NULL) {
> +		if (t1->core_task_cookie) {
> +			sched_core_put_task_cookie(t1->core_task_cookie);
> +			sched_core_put_after_stopper = true;
> +			wr.tasks[0] = t1; /* Keep wr.cookies[0] reset for t1. */
> +		}
> +	} else if (t1 == t2) {
> +		/* Assign a unique per-task cookie solely for t1. */
> +
> +		cookie = sched_core_alloc_task_cookie();
> +		if (!cookie)
> +			goto out_unlock;
> +
> +		if (t1->core_task_cookie) {
> +			sched_core_put_task_cookie(t1->core_task_cookie);
> +			sched_core_put_after_stopper = true;
> +		}
> +		wr.tasks[0] = t1;
> +		wr.cookies[0] = cookie;
> +	} else
> +	/*
> +	 * 		t1		joining		t2
> +	 * CASE 1:
> +	 * before	0				0
> +	 * after	new cookie			new cookie
> +	 *
> +	 * CASE 2:
> +	 * before	X (non-zero)			0
> +	 * after	0				0
> +	 *
> +	 * CASE 3:
> +	 * before	0				X (non-zero)
> +	 * after	X				X
> +	 *
> +	 * CASE 4:
> +	 * before	Y (non-zero)			X (non-zero)
> +	 * after	X				X
> +	 */
> +	if (!t1->core_task_cookie && !t2->core_task_cookie) {
> +		/* CASE 1. */
> +		cookie = sched_core_alloc_task_cookie();
> +		if (!cookie)
> +			goto out_unlock;
> +
> +		/* Add another reference for the other task. */
> +		if (!sched_core_get_task_cookie(cookie)) {
> +			return -EINVAL;

should be:              ret = -EINVAL;

> +			goto out_unlock;
> +		}
> +
> +		wr.tasks[0] = t1;
> +		wr.tasks[1] = t2;
> +		wr.cookies[0] = wr.cookies[1] = cookie;
> +
> +	} else if (t1->core_task_cookie && !t2->core_task_cookie) {
> +		/* CASE 2. */
> +		sched_core_put_task_cookie(t1->core_task_cookie);
> +		sched_core_put_after_stopper = true;
> +
> +		wr.tasks[0] = t1; /* Reset cookie for t1. */
> +
> +	} else if (!t1->core_task_cookie && t2->core_task_cookie) {
> +		/* CASE 3. */
> +		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
> +			ret = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		wr.tasks[0] = t1;
> +		wr.cookies[0] = t2->core_task_cookie;
> +
> +	} else {
> +		/* CASE 4. */
> +		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
> +			ret = -EINVAL;
> +			goto out_unlock;
> +		}
> +		sched_core_put_task_cookie(t1->core_task_cookie);
> +		sched_core_put_after_stopper = true;
> +
> +		wr.tasks[0] = t1;
> +		wr.cookies[0] = t2->core_task_cookie;
> +	}
> +
> +	stop_machine(sched_core_task_join_stopper, (void *)&wr, NULL);
> +
> +	if (sched_core_put_after_stopper)
> +		sched_core_put();
> +
> +	ret = 0;
> +out_unlock:
> +	mutex_unlock(&sched_core_mutex);
> +	return ret;
> +}
> +
> +/* CGroup interface */
>   static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
>   {
>   	struct task_group *tg = css_tg(css);
> @@ -9207,18 +9435,9 @@ static int __sched_write_tag(void *data)
>   	 * when we set cgroup tag to 0 when the loop is done below.
>   	 */
>   	while ((p = css_task_iter_next(&it))) {
> -		p->core_cookie = !!val ? (unsigned long)tg : 0UL;
> -
> -		if (sched_core_enqueued(p)) {
> -			sched_core_dequeue(task_rq(p), p);
> -			if (!p->core_cookie)
> -				continue;
> -		}
> -
> -		if (sched_core_enabled(task_rq(p)) &&
> -		    p->core_cookie && task_on_rq_queued(p))
> -			sched_core_enqueue(task_rq(p), p);
> +		unsigned long cookie = !!val ? (unsigned long)tg : 0UL;
>   
> +		sched_core_tag_requeue(p, cookie, true /* group */);
>   	}
>   	css_task_iter_end(&it);
>   
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index c8fee8d9dfd4..88bf45267672 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -1024,6 +1024,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
>   		__PS("clock-delta", t1-t0);
>   	}
>   
> +#ifdef CONFIG_SCHED_CORE
> +	__PS("core_cookie", p->core_cookie);
> +#endif
> +
>   	sched_show_numa(p, m);
>   }
>   
> 

-chrish

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 17/26] sched: Split the cookie and setup per-task cookie on fork
  2020-11-04 22:30   ` chris hyser
@ 2020-11-05 14:49     ` Joel Fernandes
  2020-11-09 23:30     ` chris hyser
  1 sibling, 0 replies; 98+ messages in thread
From: Joel Fernandes @ 2020-11-05 14:49 UTC (permalink / raw)
  To: chris hyser
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, Aubrey Li,
	Paul E. McKenney, Tim Chen

On Wed, Nov 04, 2020 at 05:30:24PM -0500, chris hyser wrote:
[..]
> +		wr.cookies[0] = cookie;
> > +	} else
> > +	/*
> > +	 * 		t1		joining		t2
> > +	 * CASE 1:
> > +	 * before	0				0
> > +	 * after	new cookie			new cookie
> > +	 *
> > +	 * CASE 2:
> > +	 * before	X (non-zero)			0
> > +	 * after	0				0
> > +	 *
> > +	 * CASE 3:
> > +	 * before	0				X (non-zero)
> > +	 * after	X				X
> > +	 *
> > +	 * CASE 4:
> > +	 * before	Y (non-zero)			X (non-zero)
> > +	 * after	X				X
> > +	 */
> > +	if (!t1->core_task_cookie && !t2->core_task_cookie) {
> > +		/* CASE 1. */
> > +		cookie = sched_core_alloc_task_cookie();
> > +		if (!cookie)
> > +			goto out_unlock;
> > +
> > +		/* Add another reference for the other task. */
> > +		if (!sched_core_get_task_cookie(cookie)) {
> > +			return -EINVAL;
> 
> should be:              ret = -EINVAL;

Fixed now, thanks.

thanks,

 - Joel


> 
> > +			goto out_unlock;
> > +		}
> > +
> > +		wr.tasks[0] = t1;
> > +		wr.tasks[1] = t2;
> > +		wr.cookies[0] = wr.cookies[1] = cookie;
> > +
> > +	} else if (t1->core_task_cookie && !t2->core_task_cookie) {
> > +		/* CASE 2. */
> > +		sched_core_put_task_cookie(t1->core_task_cookie);
> > +		sched_core_put_after_stopper = true;
> > +
> > +		wr.tasks[0] = t1; /* Reset cookie for t1. */
> > +
> > +	} else if (!t1->core_task_cookie && t2->core_task_cookie) {
> > +		/* CASE 3. */
> > +		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
> > +			ret = -EINVAL;
> > +			goto out_unlock;
> > +		}
> > +
> > +		wr.tasks[0] = t1;
> > +		wr.cookies[0] = t2->core_task_cookie;
> > +
> > +	} else {
> > +		/* CASE 4. */
> > +		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
> > +			ret = -EINVAL;
> > +			goto out_unlock;
> > +		}
> > +		sched_core_put_task_cookie(t1->core_task_cookie);
> > +		sched_core_put_after_stopper = true;
> > +
> > +		wr.tasks[0] = t1;
> > +		wr.cookies[0] = t2->core_task_cookie;
> > +	}
> > +
> > +	stop_machine(sched_core_task_join_stopper, (void *)&wr, NULL);
> > +
> > +	if (sched_core_put_after_stopper)
> > +		sched_core_put();
> > +
> > +	ret = 0;
> > +out_unlock:
> > +	mutex_unlock(&sched_core_mutex);
> > +	return ret;
> > +}
> > +
> > +/* CGroup interface */
> >   static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
> >   {
> >   	struct task_group *tg = css_tg(css);
> > @@ -9207,18 +9435,9 @@ static int __sched_write_tag(void *data)
> >   	 * when we set cgroup tag to 0 when the loop is done below.
> >   	 */
> >   	while ((p = css_task_iter_next(&it))) {
> > -		p->core_cookie = !!val ? (unsigned long)tg : 0UL;
> > -
> > -		if (sched_core_enqueued(p)) {
> > -			sched_core_dequeue(task_rq(p), p);
> > -			if (!p->core_cookie)
> > -				continue;
> > -		}
> > -
> > -		if (sched_core_enabled(task_rq(p)) &&
> > -		    p->core_cookie && task_on_rq_queued(p))
> > -			sched_core_enqueue(task_rq(p), p);
> > +		unsigned long cookie = !!val ? (unsigned long)tg : 0UL;
> > +		sched_core_tag_requeue(p, cookie, true /* group */);
> >   	}
> >   	css_task_iter_end(&it);
> > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> > index c8fee8d9dfd4..88bf45267672 100644
> > --- a/kernel/sched/debug.c
> > +++ b/kernel/sched/debug.c
> > @@ -1024,6 +1024,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
> >   		__PS("clock-delta", t1-t0);
> >   	}
> > +#ifdef CONFIG_SCHED_CORE
> > +	__PS("core_cookie", p->core_cookie);
> > +#endif
> > +
> >   	sched_show_numa(p, m);
> >   }
> > 
> 
> -chrish

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 20/26] sched: Release references to the per-task cookie on exit
  2020-11-04 21:50   ` chris hyser
@ 2020-11-05 15:46     ` Joel Fernandes
  0 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes @ 2020-11-05 15:46 UTC (permalink / raw)
  To: chris hyser
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, Aubrey Li,
	Paul E. McKenney, Tim Chen

On Wed, Nov 04, 2020 at 04:50:42PM -0500, chris hyser wrote:
> On 10/19/20 9:43 PM, Joel Fernandes (Google) wrote:
> > During exit, we have to free the references to a cookie that might be shared by
> > many tasks. This commit therefore ensures when the task_struct is released, any
> > references to cookies that it holds are also released.
> > 
> > Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >   include/linux/sched.h | 2 ++
> >   kernel/fork.c         | 1 +
> >   kernel/sched/core.c   | 8 ++++++++
> >   3 files changed, 11 insertions(+)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 4cb76575afa8..eabd96beff92 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -2079,12 +2079,14 @@ void sched_core_unsafe_exit(void);
> >   bool sched_core_wait_till_safe(unsigned long ti_check);
> >   bool sched_core_kernel_protected(void);
> >   int sched_core_share_pid(pid_t pid);
> > +void sched_tsk_free(struct task_struct *tsk);
> >   #else
> >   #define sched_core_unsafe_enter(ignore) do { } while (0)
> >   #define sched_core_unsafe_exit(ignore) do { } while (0)
> >   #define sched_core_wait_till_safe(ignore) do { } while (0)
> >   #define sched_core_kernel_protected(ignore) do { } while (0)
> >   #define sched_core_share_pid(pid_t pid) do { } while (0)
> > +#define sched_tsk_free(tsk) do { } while (0)
> >   #endif
> >   #endif
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index b9c289d0f4ef..a39248a02fdd 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -736,6 +736,7 @@ void __put_task_struct(struct task_struct *tsk)
> >   	exit_creds(tsk);
> >   	delayacct_tsk_free(tsk);
> >   	put_signal_struct(tsk->signal);
> > +	sched_tsk_free(tsk);
> >   	if (!profile_handoff_task(tsk))
> >   		free_task(tsk);
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 42aa811eab14..61e1dcf11000 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -9631,6 +9631,14 @@ static int cpu_core_tag_color_write_u64(struct cgroup_subsys_state *css,
> >   	return 0;
> >   }
> > +
> > +void sched_tsk_free(struct task_struct *tsk)
> > +{
> > +	if (!tsk->core_task_cookie)
> > +		return;
> > +	sched_core_put_task_cookie(tsk->core_task_cookie);
> > +	sched_core_put();
> 
> 
> sched_tsk_free() can be called under softirq. sched_core_put() is riddled with things that may want to sleep.

Right, that breaks. Can you try the diff I attached below and see if it does
not crash now?

> I tried a number of things like replacing the mutexes in put/get with spin_lock_bh() and just kept finding things like:
> 
> 
> [root@chyser-vm5 bin]# [ 1123.516209] BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:49

Indeed, stop machine cannot be called from an atomic context, which is what
spin_lock_bh gives you.

---8<-----------------------

diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
index 6c008e5471d7..9967f37c5df0 100644
--- a/kernel/sched/coretag.c
+++ b/kernel/sched/coretag.c
@@ -16,6 +16,7 @@
  */
 struct sched_core_cookie {
 	refcount_t refcnt;
+	struct work_struct work; /* to free in WQ context. */;
 };
 
 static DEFINE_MUTEX(sched_core_tasks_mutex);
@@ -54,21 +55,24 @@ void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie, bool gr
 }
 
 /* Per-task interface: Used by fork(2) and prctl(2). */
+static void sched_core_put_cookie_work(struct work_struct *ws);
+
 static unsigned long sched_core_alloc_task_cookie(void)
 {
-	struct sched_core_cookie *ptr =
+	struct sched_core_cookie *ck =
 		kmalloc(sizeof(struct sched_core_cookie), GFP_KERNEL);
 
-	if (!ptr)
+	if (!ck)
 		return 0;
-	refcount_set(&ptr->refcnt, 1);
+	refcount_set(&ck->refcnt, 1);
+	INIT_WORK(&ck->work, sched_core_put_cookie_work);
 
 	/*
 	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
 	 * is done after the stopper runs.
 	 */
 	sched_core_get();
-	return (unsigned long)ptr;
+	return (unsigned long)ck;
 }
 
 static bool sched_core_get_task_cookie(unsigned long cookie)
@@ -91,6 +95,15 @@ static void sched_core_put_task_cookie(unsigned long cookie)
 		kfree(ptr);
 }
 
+static void sched_core_put_cookie_work(struct work_struct *ws)
+{
+	struct sched_core_cookie *ck =
+		container_of(ws, struct sched_core_cookie, work);
+
+	sched_core_put_task_cookie((unsigned long)ck);
+	sched_core_put();
+}
+
 struct sched_core_task_write_tag {
 	struct task_struct *tasks[2];
 	unsigned long cookies[2];
@@ -461,8 +474,11 @@ int cpu_core_tag_color_write_u64(struct cgroup_subsys_state *css,
 
 void sched_tsk_free(struct task_struct *tsk)
 {
+	struct sched_core_cookie *ck;
+
 	if (!tsk->core_task_cookie)
 		return;
-	sched_core_put_task_cookie(tsk->core_task_cookie);
-	sched_core_put();
+
+	ck = (struct sched_core_cookie *)tsk->core_task_cookie;
+	queue_work(system_wq, &ck->work);
 }

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 19/26] sched: Add a second-level tag for nested CGroup usecase
       [not found]   ` <6c07e70d-52f2-69ff-e1fa-690cd2c97f3d@linux.intel.com>
@ 2020-11-05 15:52     ` Joel Fernandes
  0 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes @ 2020-11-05 15:52 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Paul E. McKenney, Tim Chen

On Wed, Oct 28, 2020 at 02:23:02PM +0800, Li, Aubrey wrote:
> On 2020/10/20 9:43, Joel Fernandes (Google) wrote:
> > Google has a usecase where the first level tag to tag a CGroup is not
> > sufficient. So, a patch is carried for years where a second tag is added which
> > is writeable by unprivileged users.
> > 
> > Google uses DAC controls to make the 'tag' possible to set only by root while
> > the second-level 'color' can be changed by anyone. The actual names that
> > Google uses is different, but the concept is the same.
> > 
> > The hierarchy looks like:
> > 
> > Root group
> >    / \
> >   A   B    (These are created by the root daemon - borglet).
> >  / \   \
> > C   D   E  (These are created by AppEngine within the container).
> > 
> > The reason why Google has two parts is that AppEngine wants to allow a subset of
> > subcgroups within a parent tagged cgroup sharing execution. Think of these
> > subcgroups belong to the same customer or project. Because these subcgroups are
> > created by AppEngine, they are not tracked by borglet (the root daemon),
> > therefore borglet won't have a chance to set a color for them. That's where
> > 'color' file comes from. Color could be set by AppEngine, and once set, the
> > normal tasks within the subcgroup would not be able to overwrite it. This is
> > enforced by promoting the permission of the color file in cgroupfs.
> > 
> > The 'color' is a 8-bit value allowing for upto 256 unique colors. IMHO, having
> > more than these many CGroups sounds like a scalability issue so this suffices.
> > We steal the lower 8-bits of the cookie to set the color.
> > 
> 
> So when color = 0, tasks in group A C D can run together on the HTs in same core,
> And if I set the color of taskC in group C = 1, then taskC has a different cookie
> from taskA and taskD, so in terms of taskA, what's the difference between taskC
> and [taskB or taskE]? The color breaks the relationship that C belongs to A.

C does belong to A in the sense, A cannot share with B, this implies C can
never share with B. Setting C's color does not change that fact. So coloring
is irrelevant in your question.

Sure, A cannot share with C either after coloring, but that's irrelevant and
not the point of doing the coloring.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.
  2020-10-26  9:31             ` Peter Zijlstra
@ 2020-11-05 18:50               ` Joel Fernandes
  2020-11-05 22:07                 ` Joel Fernandes
  0 siblings, 1 reply; 98+ messages in thread
From: Joel Fernandes @ 2020-11-05 18:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Vineeth Remanan Pillai,
	Aaron Lu, Aubrey Li, Paul E. McKenney, Tim Chen

On Mon, Oct 26, 2020 at 10:31:31AM +0100, Peter Zijlstra wrote:
> On Fri, Oct 23, 2020 at 05:31:18PM -0400, Joel Fernandes wrote:
> > On Fri, Oct 23, 2020 at 09:26:54PM +0200, Peter Zijlstra wrote:
> 
> > > How about this then?
> > 
> > This does look better. It makes sense and I think it will work. I will look
> > more into it and also test it.
> 
> Hummm... Looking at it again I wonder if I can make something like the
> below work.
> 
> (depends on the next patch that pulls core_forceidle into core-wide
> state)
> 
> That would retain the CFS-cgroup optimization as well, for as long as
> there's no cookies around.
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4691,8 +4691,6 @@ pick_next_task(struct rq *rq, struct tas
>  		return next;
>  	}
>  
> -	put_prev_task_balance(rq, prev, rf);
> -
>  	smt_mask = cpu_smt_mask(cpu);
>  
>  	/*
> @@ -4707,14 +4705,25 @@ pick_next_task(struct rq *rq, struct tas
>  	 */
>  	rq->core->core_task_seq++;
>  	need_sync = !!rq->core->core_cookie;
> -
> -	/* reset state */
> -reset:
> -	rq->core->core_cookie = 0UL;
>  	if (rq->core->core_forceidle) {
>  		need_sync = true;
>  		rq->core->core_forceidle = false;
>  	}
> +
> +	if (!need_sync) {
> +		next = __pick_next_task(rq, prev, rf);

This could end up triggering pick_next_task_fair's newidle balancing;

> +		if (!next->core_cookie) {
> +			rq->core_pick = NULL;
> +			return next;
> +		}

.. only to realize here that pick_next_task_fair() that we have to put_prev
the task back as it has a cookie, but the effect of newidle balancing cannot
be reverted.

Would that be a problem as the newly pulled task might be incompatible and
would have been better to leave it alone?

TBH, this is a drastic change and we've done a lot of testing with the
current code and its looking good. I'm a little scared of changing it right
now and introducing regression. Can we maybe do this after the existing
patches are upstream?

thanks,

 - Joel


> +		put_prev_task(next);
> +		need_sync = true;
> +	} else {
> +		put_prev_task_balance(rq, prev, rf);
> +	}
> +
> +	/* reset state */
> +	rq->core->core_cookie = 0UL;
>  	for_each_cpu(i, smt_mask) {
>  		struct rq *rq_i = cpu_rq(i);
>  
> @@ -4744,35 +4752,8 @@ pick_next_task(struct rq *rq, struct tas
>  			 * core.
>  			 */
>  			p = pick_task(rq_i, class, max);
> -			if (!p) {
> -				/*
> -				 * If there weren't no cookies; we don't need to
> -				 * bother with the other siblings.
> -				 */
> -				if (i == cpu && !need_sync)
> -					goto next_class;
> -
> +			if (!p)
>  				continue;
> -			}
> -
> -			/*
> -			 * Optimize the 'normal' case where there aren't any
> -			 * cookies and we don't need to sync up.
> -			 */
> -			if (i == cpu && !need_sync) {
> -				if (p->core_cookie) {
> -					/*
> -					 * This optimization is only valid as
> -					 * long as there are no cookies
> -					 * involved.
> -					 */
> -					need_sync = true;
> -					goto reset;
> -				}
> -
> -				next = p;
> -				goto done;
> -			}
>  
>  			rq_i->core_pick = p;
>  

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.
  2020-11-05 18:50               ` Joel Fernandes
@ 2020-11-05 22:07                 ` Joel Fernandes
  0 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes @ 2020-11-05 22:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Vineeth Remanan Pillai,
	Aaron Lu, Aubrey Li, Tim Chen

On Thu, Nov 05, 2020 at 01:50:19PM -0500, Joel Fernandes wrote:
> On Mon, Oct 26, 2020 at 10:31:31AM +0100, Peter Zijlstra wrote:
> > On Fri, Oct 23, 2020 at 05:31:18PM -0400, Joel Fernandes wrote:
> > > On Fri, Oct 23, 2020 at 09:26:54PM +0200, Peter Zijlstra wrote:
> > 
> > > > How about this then?
> > > 
> > > This does look better. It makes sense and I think it will work. I will look
> > > more into it and also test it.
> > 
> > Hummm... Looking at it again I wonder if I can make something like the
> > below work.
> > 
> > (depends on the next patch that pulls core_forceidle into core-wide
> > state)
> > 
> > That would retain the CFS-cgroup optimization as well, for as long as
> > there's no cookies around.
> > 
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4691,8 +4691,6 @@ pick_next_task(struct rq *rq, struct tas
> >  		return next;
> >  	}
> >  
> > -	put_prev_task_balance(rq, prev, rf);
> > -
> >  	smt_mask = cpu_smt_mask(cpu);
> >  
> >  	/*
> > @@ -4707,14 +4705,25 @@ pick_next_task(struct rq *rq, struct tas
> >  	 */
> >  	rq->core->core_task_seq++;
> >  	need_sync = !!rq->core->core_cookie;
> > -
> > -	/* reset state */
> > -reset:
> > -	rq->core->core_cookie = 0UL;
> >  	if (rq->core->core_forceidle) {
> >  		need_sync = true;
> >  		rq->core->core_forceidle = false;
> >  	}
> > +
> > +	if (!need_sync) {
> > +		next = __pick_next_task(rq, prev, rf);
> 
> This could end up triggering pick_next_task_fair's newidle balancing;
> 
> > +		if (!next->core_cookie) {
> > +			rq->core_pick = NULL;
> > +			return next;
> > +		}
> 
> .. only to realize here that pick_next_task_fair() that we have to put_prev
> the task back as it has a cookie, but the effect of newidle balancing cannot
> be reverted.
> 
> Would that be a problem as the newly pulled task might be incompatible and
> would have been better to leave it alone?
> 
> TBH, this is a drastic change and we've done a lot of testing with the
> current code and its looking good. I'm a little scared of changing it right
> now and introducing regression. Can we maybe do this after the existing
> patches are upstream?

After sleeping over it, I am trying something like the following. Thoughts?

Basically, I call pick_task() in advance. That's mostly all that's different
with your patch:

---8<-----------------------

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0ce17aa72694..366e5ed84a63 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5000,28 +5000,34 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	put_prev_task_balance(rq, prev, rf);
 
 	smt_mask = cpu_smt_mask(cpu);
-
-	/*
-	 * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
-	 *
-	 * @task_seq guards the task state ({en,de}queues)
-	 * @pick_seq is the @task_seq we did a selection on
-	 * @sched_seq is the @pick_seq we scheduled
-	 *
-	 * However, preemptions can cause multiple picks on the same task set.
-	 * 'Fix' this by also increasing @task_seq for every pick.
-	 */
-	rq->core->core_task_seq++;
 	need_sync = !!rq->core->core_cookie;
 
 	/* reset state */
-reset:
 	rq->core->core_cookie = 0UL;
 	if (rq->core->core_forceidle) {
 		need_sync = true;
 		fi_before = true;
 		rq->core->core_forceidle = false;
 	}
+
+	/*
+	 * Optimize for common case where this CPU has no cookies
+	 * and there are no cookied tasks running on siblings.
+	 */
+	if (!need_sync) {
+		for_each_class(class) {
+			next = class->pick_task(rq);
+			if (next)
+				break;
+		}
+
+		if (!next->core_cookie) {
+			rq->core_pick = NULL;
+			goto done;
+		}
+		need_sync = true;
+	}
+
 	for_each_cpu(i, smt_mask) {
 		struct rq *rq_i = cpu_rq(i);
 
@@ -5039,6 +5045,18 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		}
 	}
 
+	/*
+	 * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
+	 *
+	 * @task_seq guards the task state ({en,de}queues)
+	 * @pick_seq is the @task_seq we did a selection on
+	 * @sched_seq is the @pick_seq we scheduled
+	 *
+	 * However, preemptions can cause multiple picks on the same task set.
+	 * 'Fix' this by also increasing @task_seq for every pick.
+	 */
+	rq->core->core_task_seq++;
+
 	/*
 	 * Try and select tasks for each sibling in decending sched_class
 	 * order.
@@ -5059,40 +5077,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			 * core.
 			 */
 			p = pick_task(rq_i, class, max);
-			if (!p) {
-				/*
-				 * If there weren't no cookies; we don't need to
-				 * bother with the other siblings.
-				 */
-				if (i == cpu && !need_sync)
-					goto next_class;
-
+			if (!p)
 				continue;
-			}
-
-			/*
-			 * Optimize the 'normal' case where there aren't any
-			 * cookies and we don't need to sync up.
-			 */
-			if (i == cpu && !need_sync) {
-				if (p->core_cookie) {
-					/*
-					 * This optimization is only valid as
-					 * long as there are no cookies
-					 * involved. We may have skipped
-					 * non-empty higher priority classes on
-					 * siblings, which are empty on this
-					 * CPU, so start over.
-					 */
-					need_sync = true;
-					goto reset;
-				}
-
-				next = p;
-				trace_printk("unconstrained pick: %s/%d %lx\n",
-					     next->comm, next->pid, next->core_cookie);
-				goto done;
-			}
 
 			if (!is_task_rq_idle(p))
 				occ++;

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 00/26] Core scheduling
  2020-10-30 13:26 ` [PATCH v8 -tip 00/26] Core scheduling Ning, Hongyu
@ 2020-11-06  2:58   ` Li, Aubrey
  2020-11-06 17:54     ` Joel Fernandes
  0 siblings, 1 reply; 98+ messages in thread
From: Li, Aubrey @ 2020-11-06  2:58 UTC (permalink / raw)
  To: Ning, Hongyu, Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Paul E. McKenney, Tim Chen

On 2020/10/30 21:26, Ning, Hongyu wrote:
> On 2020/10/20 9:43, Joel Fernandes (Google) wrote:
>> Eighth iteration of the Core-Scheduling feature.
>>
>> Core scheduling is a feature that allows only trusted tasks to run
>> concurrently on cpus sharing compute resources (eg: hyperthreads on a
>> core). The goal is to mitigate the core-level side-channel attacks
>> without requiring to disable SMT (which has a significant impact on
>> performance in some situations). Core scheduling (as of v7) mitigates
>> user-space to user-space attacks and user to kernel attack when one of
>> the siblings enters the kernel via interrupts or system call.
>>
>> By default, the feature doesn't change any of the current scheduler
>> behavior. The user decides which tasks can run simultaneously on the
>> same core (for now by having them in the same tagged cgroup). When a tag
>> is enabled in a cgroup and a task from that cgroup is running on a
>> hardware thread, the scheduler ensures that only idle or trusted tasks
>> run on the other sibling(s). Besides security concerns, this feature can
>> also be beneficial for RT and performance applications where we want to
>> control how tasks make use of SMT dynamically.
>>
>> This iteration focuses on the the following stuff:
>> - Redesigned API.
>> - Rework of Kernel Protection feature based on Thomas's entry work.
>> - Rework of hotplug fixes.
>> - Address review comments in v7
>>
>> Joel: Both a CGroup and Per-task interface via prctl(2) are provided for
>> configuring core sharing. More details are provided in documentation patch.
>> Kselftests are provided to verify the correctness/rules of the interface.
>>
>> Julien: TPCC tests showed improvements with core-scheduling. With kernel
>> protection enabled, it does not show any regression. Possibly ASI will improve
>> the performance for those who choose kernel protection (can be toggled through
>> sched_core_protect_kernel sysctl). Results:
>> v8				average		stdev		diff
>> baseline (SMT on)		1197.272	44.78312824	
>> core sched (   kernel protect)	412.9895	45.42734343	-65.51%
>> core sched (no kernel protect)	686.6515	71.77756931	-42.65%
>> nosmt				408.667		39.39042872	-65.87%
>>
>> v8 is rebased on tip/master.
>>
>> Future work
>> ===========
>> - Load balancing/Migration fixes for core scheduling.
>>   With v6, Load balancing is partially coresched aware, but has some
>>   issues w.r.t process/taskgroup weights:
>>   https://lwn.net/ml/linux-kernel/20200225034438.GA617271@z...
>> - Core scheduling test framework: kselftests, torture tests etc
>>
>> Changes in v8
>> =============
>> - New interface/API implementation
>>   - Joel
>> - Revised kernel protection patch
>>   - Joel
>> - Revised Hotplug fixes
>>   - Joel
>> - Minor bug fixes and address review comments
>>   - Vineeth
>>
> 
>> create mode 100644 tools/testing/selftests/sched/config
>> create mode 100644 tools/testing/selftests/sched/test_coresched.c
>>
> 
> Adding 4 workloads test results for Core Scheduling v8: 
> 
> - kernel under test: coresched community v8 from https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=coresched-v5.9
> - workloads: 
> 	-- A. sysbench cpu (192 threads) + sysbench cpu (192 threads)
> 	-- B. sysbench cpu (192 threads) + sysbench mysql (192 threads, mysqld forced into the same cgroup)
> 	-- C. uperf netperf.xml (192 threads over TCP or UDP protocol separately)
> 	-- D. will-it-scale context_switch via pipe (192 threads)
> - test machine setup: 
> 	CPU(s):              192
> 	On-line CPU(s) list: 0-191
> 	Thread(s) per core:  2
> 	Core(s) per socket:  48
> 	Socket(s):           2
> 	NUMA node(s):        4
> - test results:
> 	-- workload A, no obvious performance drop in cs_on:
> 	+----------------------+------+----------------------+------------------------+
> 	|                      | **   | sysbench cpu * 192   | sysbench mysql * 192   |
> 	+======================+======+======================+========================+
> 	| cgroup               | **   | cg_sysbench_cpu_0    | cg_sysbench_mysql_0    |
> 	+----------------------+------+----------------------+------------------------+
> 	| record_item          | **   | Tput_avg (events/s)  | Tput_avg (events/s)    |
> 	+----------------------+------+----------------------+------------------------+
> 	| coresched_normalized | **   | 1.01                 | 0.87                   |
> 	+----------------------+------+----------------------+------------------------+
> 	| default_normalized   | **   | 1                    | 1                      |
> 	+----------------------+------+----------------------+------------------------+
> 	| smtoff_normalized    | **   | 0.59                 | 0.82                   |
> 	+----------------------+------+----------------------+------------------------+
> 
> 	-- workload B, no obvious performance drop in cs_on:
> 	+----------------------+------+----------------------+------------------------+
> 	|                      | **   | sysbench cpu * 192   | sysbench cpu * 192     |
> 	+======================+======+======================+========================+
> 	| cgroup               | **   | cg_sysbench_cpu_0    | cg_sysbench_cpu_1      |
> 	+----------------------+------+----------------------+------------------------+
> 	| record_item          | **   | Tput_avg (events/s)  | Tput_avg (events/s)    |
> 	+----------------------+------+----------------------+------------------------+
> 	| coresched_normalized | **   | 1.01                 | 0.98                   |
> 	+----------------------+------+----------------------+------------------------+
> 	| default_normalized   | **   | 1                    | 1                      |
> 	+----------------------+------+----------------------+------------------------+
> 	| smtoff_normalized    | **   | 0.6                  | 0.6                    |
> 	+----------------------+------+----------------------+------------------------+
> 
> 	-- workload C, known performance drop in cs_on since Core Scheduling v6:
> 	+----------------------+------+---------------------------+---------------------------+
> 	|                      | **   | uperf netperf TCP * 192   | uperf netperf UDP * 192   |
> 	+======================+======+===========================+===========================+
> 	| cgroup               | **   | cg_uperf                  | cg_uperf                  |
> 	+----------------------+------+---------------------------+---------------------------+
> 	| record_item          | **   | Tput_avg (Gb/s)           | Tput_avg (Gb/s)           |
> 	+----------------------+------+---------------------------+---------------------------+
> 	| coresched_normalized | **   | 0.46                      | 0.48                      |
> 	+----------------------+------+---------------------------+---------------------------+
> 	| default_normalized   | **   | 1                         | 1                         |
> 	+----------------------+------+---------------------------+---------------------------+
> 	| smtoff_normalized    | **   | 0.82                      | 0.79                      |
> 	+----------------------+------+---------------------------+---------------------------+

This is known that when coresched is on, uperf offloads softirq service to
ksoftirqd, and the cookie of ksoftirqd is different from the cookie of uperf.
As a result, ksoftirqd can run concurrently with uperf previous but not now.

> 
> 	-- workload D, new added syscall workload, performance drop in cs_on:
> 	+----------------------+------+-------------------------------+
> 	|                      | **   | will-it-scale  * 192          |
> 	|                      |      | (pipe based context_switch)   |
> 	+======================+======+===============================+
> 	| cgroup               | **   | cg_will-it-scale              |
> 	+----------------------+------+-------------------------------+
> 	| record_item          | **   | threads_avg                   |
> 	+----------------------+------+-------------------------------+
> 	| coresched_normalized | **   | 0.2                           |
> 	+----------------------+------+-------------------------------+
> 	| default_normalized   | **   | 1                             |
> 	+----------------------+------+-------------------------------+
> 	| smtoff_normalized    | **   | 0.89                          |
> 	+----------------------+------+-------------------------------+

will-it-scale may be a very extreme case. The story here is,
- On one sibling reader/writer gets blocked and tries to schedule another reader/writer in.
- The other sibling tries to wake up reader/writer.

Both CPUs are acquiring rq->__lock,

So when coresched off, they are two different locks, lock stat(1 second delta) below:

class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
&rq->__lock:          210            210           0.10           3.04         180.87           0.86            797       79165021           0.03          20.69    60650198.34           0.77

But when coresched on, they are actually one same lock, lock stat(1 second delta) below:

class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
&rq->__lock:      6479459        6484857           0.05         216.46    60829776.85           9.38        8346319       15399739           0.03          95.56    81119515.38           5.27

This nature of core scheduling may degrade the performance of similar workloads with frequent context switching.

Any thoughts?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode
  2020-11-03  1:20     ` Joel Fernandes
@ 2020-11-06 16:57       ` Alexandre Chartre
  2020-11-06 17:43         ` Joel Fernandes
  2020-11-10  9:35       ` Alexandre Chartre
  1 sibling, 1 reply; 98+ messages in thread
From: Alexandre Chartre @ 2020-11-06 16:57 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, James.Bottomley, OWeisse,
	Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser, Aubrey Li,
	Tim Chen, Paul E . McKenney



On 11/3/20 2:20 AM, Joel Fernandes wrote:
> Hi Alexandre,
> 
> Sorry for late reply as I was working on the snapshotting patch...
> 
> On Fri, Oct 30, 2020 at 11:29:26AM +0100, Alexandre Chartre wrote:
>>
>> On 10/20/20 3:43 AM, Joel Fernandes (Google) wrote:
>>> Core-scheduling prevents hyperthreads in usermode from attacking each
>>> other, but it does not do anything about one of the hyperthreads
>>> entering the kernel for any reason. This leaves the door open for MDS
>>> and L1TF attacks with concurrent execution sequences between
>>> hyperthreads.
>>>
>>> This patch therefore adds support for protecting all syscall and IRQ
>>> kernel mode entries. Care is taken to track the outermost usermode exit
>>> and entry using per-cpu counters. In cases where one of the hyperthreads
>>> enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
>>> when not needed - example: idle and non-cookie HTs do not need to be
>>> forced into kernel mode.
>>
>> Hi Joel,
>>
>> In order to protect syscall/IRQ kernel mode entries, shouldn't we have a
>> call to sched_core_unsafe_enter() in the syscall/IRQ entry code? I don't
>> see such a call. Am I missing something?
> 
> Yes, this is known bug and fixed in v9 which I'll post soon. Meanwhile
> updated patch is appended below:
>   

[..]

>>> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
>>> index 0a1e20f8d4e8..c8dc6b1b1f40 100644
>>> --- a/kernel/entry/common.c
>>> +++ b/kernel/entry/common.c
>>> @@ -137,6 +137,26 @@ static __always_inline void exit_to_user_mode(void)
>>>    /* Workaround to allow gradual conversion of architecture code */
>>>    void __weak arch_do_signal(struct pt_regs *regs) { }
>>> +unsigned long exit_to_user_get_work(void)
>>> +{
>>> +	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
>>> +
>>> +	if (IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected())
>>> +		return ti_work;
>>> +
>>> +#ifdef CONFIG_SCHED_CORE
>>> +	ti_work &= EXIT_TO_USER_MODE_WORK;
>>> +	if ((ti_work & _TIF_UNSAFE_RET) == ti_work) {
>>> +		sched_core_unsafe_exit();
>>> +		if (sched_core_wait_till_safe(EXIT_TO_USER_MODE_WORK)) {
>>
>> If we call sched_core_unsafe_exit() before sched_core_wait_till_safe() then we
>> expose ourself during the entire wait period in sched_core_wait_till_safe(). It
>> would be better to call sched_core_unsafe_exit() once we know for sure we are
>> going to exit.
> 
> The way the algorithm works right now, it requires the current task to get
> out of the unsafe state while waiting otherwise it will lockup. Note that we
> wait with interrupts enabled so new interrupts could come while waiting.
> 
> TBH this code is very tricky to get right and it took long time to get it
> working properly. For now I am content with the way it works. We can improve
> further incrementally on it in the future.
> 

I am concerned this leaves a lot of windows opened (even with the updated patch)
where the system remains exposed. There are 3 obvious windows:

- after switching to the kernel page-table and until enter_from_user_mode() is called
- while waiting for other cpus
- after leaving exit_to_user_mode_loop() and until switching back to the user page-table

Also on syscall/interrupt entry, sched_core_unsafe_enter() is called (in the
updated patch) and this sends an IPI to other CPUs but it doesn't wait for
other CPUs to effectively switch to the kernel page-table. It also seems like
the case where the CPU is interrupted by a NMI is not handled.

I know the code is tricky, and I am working on something similar for ASI (ASI
lockdown) where I am addressing all these cases (this seems to work).

> Let me know if I may add your Reviewed-by tag for this patch, if there are no
> other comments, and I appreciate it. Appended the updated patch below.
> 

I haven't effectively reviewed the code yet. Also I wonder if the work I am doing
with ASI for synchronizing sibling cpus (i.e. the ASI lockdown) and the integration
with PTI could provide what you need. Basically each process has an ASI and the
ASI lockdown ensure that sibling cpus are also running with a trusted ASI. If
the process/ASI is interrupted (e.g. on interrupt/exception/NMI) then it forces
sibling cpus to also interrupt ASI. The sibling cpus synchronization occurs when
switching the page-tables (between user and kernel) so there's no exposure window.

Let me have a closer look.

alex.


> ---8<-----------------------
> 
>  From b2835a587a28405ffdf8fc801e798129a014a8c8 Mon Sep 17 00:00:00 2001
> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> Date: Mon, 27 Jul 2020 17:56:14 -0400
> Subject: [PATCH] kernel/entry: Add support for core-wide protection of
>   kernel-mode
> 
> Core-scheduling prevents hyperthreads in usermode from attacking each
> other, but it does not do anything about one of the hyperthreads
> entering the kernel for any reason. This leaves the door open for MDS
> and L1TF attacks with concurrent execution sequences between
> hyperthreads.
> 
> This patch therefore adds support for protecting all syscall and IRQ
> kernel mode entries. Care is taken to track the outermost usermode exit
> and entry using per-cpu counters. In cases where one of the hyperthreads
> enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
> when not needed - example: idle and non-cookie HTs do not need to be
> forced into kernel mode.
> 
> More information about attacks:
> For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
> data to either host or guest attackers. For L1TF, it is possible to leak
> to guest attackers. There is no possible mitigation involving flushing
> of buffers to avoid this since the execution of attacker and victims
> happen concurrently on 2 or more HTs.
> 
> Cc: Julien Desfossez <jdesfossez@digitalocean.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Aaron Lu <aaron.lwe@gmail.com>
> Cc: Aubrey Li <aubrey.li@linux.intel.com>
> Cc: Tim Chen <tim.c.chen@intel.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Co-developed-by: Vineeth Pillai <viremana@linux.microsoft.com>
> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>   .../admin-guide/kernel-parameters.txt         |   9 +
>   include/linux/entry-common.h                  |   6 +-
>   include/linux/sched.h                         |  12 +
>   kernel/entry/common.c                         |  28 ++-
>   kernel/sched/core.c                           | 230 ++++++++++++++++++
>   kernel/sched/sched.h                          |   3 +
>   6 files changed, 285 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 3236427e2215..a338d5d64c3d 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4678,6 +4678,15 @@
>   
>   	sbni=		[NET] Granch SBNI12 leased line adapter
>   
> +	sched_core_protect_kernel=
> +			[SCHED_CORE] Pause SMT siblings of a core running in
> +			user mode, if at least one of the siblings of the core
> +			is running in kernel mode. This is to guarantee that
> +			kernel data is not leaked to tasks which are not trusted
> +			by the kernel. A value of 0 disables protection, 1
> +			enables protection. The default is 1. Note that protection
> +			depends on the arch defining the _TIF_UNSAFE_RET flag.
> +
>   	sched_debug	[KNL] Enables verbose scheduler debug messages.
>   
>   	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index 474f29638d2c..62278c5b3b5f 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -33,6 +33,10 @@
>   # define _TIF_PATCH_PENDING		(0)
>   #endif
>   
> +#ifndef _TIF_UNSAFE_RET
> +# define _TIF_UNSAFE_RET		(0)
> +#endif
> +
>   #ifndef _TIF_UPROBE
>   # define _TIF_UPROBE			(0)
>   #endif
> @@ -69,7 +73,7 @@
>   
>   #define EXIT_TO_USER_MODE_WORK						\
>   	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
> -	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING |			\
> +	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_UNSAFE_RET |	\
>   	 ARCH_EXIT_TO_USER_MODE_WORK)
>   
>   /**
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d38e904dd603..fe6f225bfbf9 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2071,4 +2071,16 @@ int sched_trace_rq_nr_running(struct rq *rq);
>   
>   const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
>   
> +#ifdef CONFIG_SCHED_CORE
> +void sched_core_unsafe_enter(void);
> +void sched_core_unsafe_exit(void);
> +bool sched_core_wait_till_safe(unsigned long ti_check);
> +bool sched_core_kernel_protected(void);
> +#else
> +#define sched_core_unsafe_enter(ignore) do { } while (0)
> +#define sched_core_unsafe_exit(ignore) do { } while (0)
> +#define sched_core_wait_till_safe(ignore) do { } while (0)
> +#define sched_core_kernel_protected(ignore) do { } while (0)
> +#endif
> +
>   #endif
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 0a1e20f8d4e8..a18ed60cedea 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -28,6 +28,8 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs)
>   
>   	instrumentation_begin();
>   	trace_hardirqs_off_finish();
> +	if (_TIF_UNSAFE_RET) /* Kernel protection depends on arch defining the flag. */
> +		sched_core_unsafe_enter();
>   	instrumentation_end();
>   }
>   
> @@ -137,6 +139,27 @@ static __always_inline void exit_to_user_mode(void)
>   /* Workaround to allow gradual conversion of architecture code */
>   void __weak arch_do_signal(struct pt_regs *regs) { }
>   
> +unsigned long exit_to_user_get_work(void)
> +{
> +	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> +
> +	if ((IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected())
> +	    || !_TIF_UNSAFE_RET)
> +		return ti_work;
> +
> +#ifdef CONFIG_SCHED_CORE
> +	ti_work &= EXIT_TO_USER_MODE_WORK;
> +	if ((ti_work & _TIF_UNSAFE_RET) == ti_work) {
> +		sched_core_unsafe_exit();
> +		if (sched_core_wait_till_safe(EXIT_TO_USER_MODE_WORK)) {
> +			sched_core_unsafe_enter(); /* not exiting to user yet. */
> +		}
> +	}
> +
> +	return READ_ONCE(current_thread_info()->flags);
> +#endif
> +}
> +
>   static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>   					    unsigned long ti_work)
>   {
> @@ -175,7 +198,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>   		 * enabled above.
>   		 */
>   		local_irq_disable_exit_to_user();
> -		ti_work = READ_ONCE(current_thread_info()->flags);
> +		ti_work = exit_to_user_get_work();
>   	}
>   
>   	/* Return the latest work state for arch_exit_to_user_mode() */
> @@ -184,9 +207,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>   
>   static void exit_to_user_mode_prepare(struct pt_regs *regs)
>   {
> -	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> +	unsigned long ti_work;
>   
>   	lockdep_assert_irqs_disabled();
> +	ti_work = exit_to_user_get_work();
>   
>   	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
>   		ti_work = exit_to_user_mode_loop(regs, ti_work);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e05728bdb18c..bd206708fac2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -76,6 +76,27 @@ __read_mostly int scheduler_running;
>   
>   #ifdef CONFIG_SCHED_CORE
>   
> +DEFINE_STATIC_KEY_TRUE(sched_core_protect_kernel);
> +static int __init set_sched_core_protect_kernel(char *str)
> +{
> +	unsigned long val = 0;
> +
> +	if (!str)
> +		return 0;
> +
> +	if (!kstrtoul(str, 0, &val) && !val)
> +		static_branch_disable(&sched_core_protect_kernel);
> +
> +	return 1;
> +}
> +__setup("sched_core_protect_kernel=", set_sched_core_protect_kernel);
> +
> +/* Is the kernel protected by core scheduling? */
> +bool sched_core_kernel_protected(void)
> +{
> +	return static_branch_likely(&sched_core_protect_kernel);
> +}
> +
>   DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
>   
>   /* kernel prio, less is more */
> @@ -4596,6 +4617,214 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
>   	return a->core_cookie == b->core_cookie;
>   }
>   
> +/*
> + * Handler to attempt to enter kernel. It does nothing because the exit to
> + * usermode or guest mode will do the actual work (of waiting if needed).
> + */
> +static void sched_core_irq_work(struct irq_work *work)
> +{
> +	return;
> +}
> +
> +static inline void init_sched_core_irq_work(struct rq *rq)
> +{
> +	init_irq_work(&rq->core_irq_work, sched_core_irq_work);
> +}
> +
> +/*
> + * sched_core_wait_till_safe - Pause the caller's hyperthread until the core
> + * exits the core-wide unsafe state. Obviously the CPU calling this function
> + * should not be responsible for the core being in the core-wide unsafe state
> + * otherwise it will deadlock.
> + *
> + * @ti_check: We spin here with IRQ enabled and preempt disabled. Break out of
> + *            the loop if TIF flags are set and notify caller about it.
> + *
> + * IRQs should be disabled.
> + */
> +bool sched_core_wait_till_safe(unsigned long ti_check)
> +{
> +	bool restart = false;
> +	struct rq *rq;
> +	int cpu;
> +
> +	/* We clear the thread flag only at the end, so need to check for it. */
> +	ti_check &= ~_TIF_UNSAFE_RET;
> +
> +	cpu = smp_processor_id();
> +	rq = cpu_rq(cpu);
> +
> +	if (!sched_core_enabled(rq))
> +		goto ret;
> +
> +	/* Down grade to allow interrupts to prevent stop_machine lockups.. */
> +	preempt_disable();
> +	local_irq_enable();
> +
> +	/*
> +	 * Wait till the core of this HT is not in an unsafe state.
> +	 *
> +	 * Pair with smp_store_release() in sched_core_unsafe_exit().
> +	 */
> +	while (smp_load_acquire(&rq->core->core_unsafe_nest) > 0) {
> +		cpu_relax();
> +		if (READ_ONCE(current_thread_info()->flags) & ti_check) {
> +			restart = true;
> +			break;
> +		}
> +	}
> +
> +	/* Upgrade it back to the expectations of entry code. */
> +	local_irq_disable();
> +	preempt_enable();
> +
> +ret:
> +	if (!restart)
> +		clear_tsk_thread_flag(current, TIF_UNSAFE_RET);
> +
> +	return restart;
> +}
> +
> +/*
> + * Enter the core-wide IRQ state. Sibling will be paused if it is running
> + * 'untrusted' code, until sched_core_unsafe_exit() is called. Every attempt to
> + * avoid sending useless IPIs is made. Must be called only from hard IRQ
> + * context.
> + */
> +void sched_core_unsafe_enter(void)
> +{
> +	const struct cpumask *smt_mask;
> +	unsigned long flags;
> +	struct rq *rq;
> +	int i, cpu;
> +
> +	if (!static_branch_likely(&sched_core_protect_kernel))
> +		return;
> +
> +	/* Ensure that on return to user/guest, we check whether to wait. */
> +	if (current->core_cookie)
> +		set_tsk_thread_flag(current, TIF_UNSAFE_RET);
> +
> +	local_irq_save(flags);
> +	cpu = smp_processor_id();
> +	rq = cpu_rq(cpu);
> +	if (!sched_core_enabled(rq))
> +		goto ret;
> +
> +	/* Count unsafe_enter() calls received without unsafe_exit() on this CPU. */
> +	rq->core_this_unsafe_nest++;
> +
> +	/* Should not nest: enter() should only pair with exit(). */
> +	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 1))
> +		goto ret;
> +
> +	raw_spin_lock(rq_lockp(rq));
> +	smt_mask = cpu_smt_mask(cpu);
> +
> +	/* Contribute this CPU's unsafe_enter() to core-wide unsafe_enter() count. */
> +	WRITE_ONCE(rq->core->core_unsafe_nest, rq->core->core_unsafe_nest + 1);
> +
> +	if (WARN_ON_ONCE(rq->core->core_unsafe_nest == UINT_MAX))
> +		goto unlock;
> +
> +	if (irq_work_is_busy(&rq->core_irq_work)) {
> +		/*
> +		 * Do nothing more since we are in an IPI sent from another
> +		 * sibling to enforce safety. That sibling would have sent IPIs
> +		 * to all of the HTs.
> +		 */
> +		goto unlock;
> +	}
> +
> +	/*
> +	 * If we are not the first ones on the core to enter core-wide unsafe
> +	 * state, do nothing.
> +	 */
> +	if (rq->core->core_unsafe_nest > 1)
> +		goto unlock;
> +
> +	/* Do nothing more if the core is not tagged. */
> +	if (!rq->core->core_cookie)
> +		goto unlock;
> +
> +	for_each_cpu(i, smt_mask) {
> +		struct rq *srq = cpu_rq(i);
> +
> +		if (i == cpu || cpu_is_offline(i))
> +			continue;
> +
> +		if (!srq->curr->mm || is_task_rq_idle(srq->curr))
> +			continue;
> +
> +		/* Skip if HT is not running a tagged task. */
> +		if (!srq->curr->core_cookie && !srq->core_pick)
> +			continue;
> +
> +		/*
> +		 * Force sibling into the kernel by IPI. If work was already
> +		 * pending, no new IPIs are sent. This is Ok since the receiver
> +		 * would already be in the kernel, or on its way to it.
> +		 */
> +		irq_work_queue_on(&srq->core_irq_work, i);
> +	}
> +unlock:
> +	raw_spin_unlock(rq_lockp(rq));
> +ret:
> +	local_irq_restore(flags);
> +}
> +
> +/*
> + * Process any work need for either exiting the core-wide unsafe state, or for
> + * waiting on this hyperthread if the core is still in this state.
> + *
> + * @idle: Are we called from the idle loop?
> + */
> +void sched_core_unsafe_exit(void)
> +{
> +	unsigned long flags;
> +	unsigned int nest;
> +	struct rq *rq;
> +	int cpu;
> +
> +	if (!static_branch_likely(&sched_core_protect_kernel))
> +		return;
> +
> +	local_irq_save(flags);
> +	cpu = smp_processor_id();
> +	rq = cpu_rq(cpu);
> +
> +	/* Do nothing if core-sched disabled. */
> +	if (!sched_core_enabled(rq))
> +		goto ret;
> +
> +	/*
> +	 * Can happen when a process is forked and the first return to user
> +	 * mode is a syscall exit. Either way, there's nothing to do.
> +	 */
> +	if (rq->core_this_unsafe_nest == 0)
> +		goto ret;
> +
> +	rq->core_this_unsafe_nest--;
> +
> +	/* enter() should be paired with exit() only. */
> +	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 0))
> +		goto ret;
> +
> +	raw_spin_lock(rq_lockp(rq));
> +	/*
> +	 * Core-wide nesting counter can never be 0 because we are
> +	 * still in it on this CPU.
> +	 */
> +	nest = rq->core->core_unsafe_nest;
> +	WARN_ON_ONCE(!nest);
> +
> +	/* Pair with smp_load_acquire() in sched_core_wait_till_safe(). */
> +	smp_store_release(&rq->core->core_unsafe_nest, nest - 1);
> +	raw_spin_unlock(rq_lockp(rq));
> +ret:
> +	local_irq_restore(flags);
> +}
> +
>   // XXX fairness/fwd progress conditions
>   /*
>    * Returns
> @@ -5019,6 +5248,7 @@ static inline void sched_core_cpu_starting(unsigned int cpu)
>   			rq = cpu_rq(i);
>   			if (rq->core && rq->core == rq)
>   				core_rq = rq;
> +			init_sched_core_irq_work(rq);
>   		}
>   
>   		if (!core_rq)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f7e2d8a3be8e..4bcf3b1ddfb3 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1059,12 +1059,15 @@ struct rq {
>   	unsigned int		core_enabled;
>   	unsigned int		core_sched_seq;
>   	struct rb_root		core_tree;
> +	struct irq_work		core_irq_work; /* To force HT into kernel */
> +	unsigned int		core_this_unsafe_nest;
>   
>   	/* shared state */
>   	unsigned int		core_task_seq;
>   	unsigned int		core_pick_seq;
>   	unsigned long		core_cookie;
>   	unsigned char		core_forceidle;
> +	unsigned int		core_unsafe_nest;
>   #endif
>   };
>   
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode
  2020-11-06 16:57       ` Alexandre Chartre
@ 2020-11-06 17:43         ` Joel Fernandes
  2020-11-06 18:07           ` Alexandre Chartre
  0 siblings, 1 reply; 98+ messages in thread
From: Joel Fernandes @ 2020-11-06 17:43 UTC (permalink / raw)
  To: Alexandre Chartre
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, James.Bottomley, OWeisse,
	Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser, Aubrey Li,
	Tim Chen, Paul E . McKenney

On Fri, Nov 06, 2020 at 05:57:21PM +0100, Alexandre Chartre wrote:
> 
> 
> On 11/3/20 2:20 AM, Joel Fernandes wrote:
> > Hi Alexandre,
> > 
> > Sorry for late reply as I was working on the snapshotting patch...
> > 
> > On Fri, Oct 30, 2020 at 11:29:26AM +0100, Alexandre Chartre wrote:
> > > 
> > > On 10/20/20 3:43 AM, Joel Fernandes (Google) wrote:
> > > > Core-scheduling prevents hyperthreads in usermode from attacking each
> > > > other, but it does not do anything about one of the hyperthreads
> > > > entering the kernel for any reason. This leaves the door open for MDS
> > > > and L1TF attacks with concurrent execution sequences between
> > > > hyperthreads.
> > > > 
> > > > This patch therefore adds support for protecting all syscall and IRQ
> > > > kernel mode entries. Care is taken to track the outermost usermode exit
> > > > and entry using per-cpu counters. In cases where one of the hyperthreads
> > > > enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
> > > > when not needed - example: idle and non-cookie HTs do not need to be
> > > > forced into kernel mode.
> > > 
> > > Hi Joel,
> > > 
> > > In order to protect syscall/IRQ kernel mode entries, shouldn't we have a
> > > call to sched_core_unsafe_enter() in the syscall/IRQ entry code? I don't
> > > see such a call. Am I missing something?
> > 
> > Yes, this is known bug and fixed in v9 which I'll post soon. Meanwhile
> > updated patch is appended below:
> 
> [..]
> 
> > > > diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> > > > index 0a1e20f8d4e8..c8dc6b1b1f40 100644
> > > > --- a/kernel/entry/common.c
> > > > +++ b/kernel/entry/common.c
> > > > @@ -137,6 +137,26 @@ static __always_inline void exit_to_user_mode(void)
> > > >    /* Workaround to allow gradual conversion of architecture code */
> > > >    void __weak arch_do_signal(struct pt_regs *regs) { }
> > > > +unsigned long exit_to_user_get_work(void)
> > > > +{
> > > > +	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> > > > +
> > > > +	if (IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected())
> > > > +		return ti_work;
> > > > +
> > > > +#ifdef CONFIG_SCHED_CORE
> > > > +	ti_work &= EXIT_TO_USER_MODE_WORK;
> > > > +	if ((ti_work & _TIF_UNSAFE_RET) == ti_work) {
> > > > +		sched_core_unsafe_exit();
> > > > +		if (sched_core_wait_till_safe(EXIT_TO_USER_MODE_WORK)) {
> > > 
> > > If we call sched_core_unsafe_exit() before sched_core_wait_till_safe() then we
> > > expose ourself during the entire wait period in sched_core_wait_till_safe(). It
> > > would be better to call sched_core_unsafe_exit() once we know for sure we are
> > > going to exit.
> > 
> > The way the algorithm works right now, it requires the current task to get
> > out of the unsafe state while waiting otherwise it will lockup. Note that we
> > wait with interrupts enabled so new interrupts could come while waiting.
> > 
> > TBH this code is very tricky to get right and it took long time to get it
> > working properly. For now I am content with the way it works. We can improve
> > further incrementally on it in the future.
> > 
> 
> I am concerned this leaves a lot of windows opened (even with the updated patch)
> where the system remains exposed. There are 3 obvious windows:
> 
> - after switching to the kernel page-table and until enter_from_user_mode() is called
> - while waiting for other cpus
> - after leaving exit_to_user_mode_loop() and until switching back to the user page-table
> 
> Also on syscall/interrupt entry, sched_core_unsafe_enter() is called (in the
> updated patch) and this sends an IPI to other CPUs but it doesn't wait for
> other CPUs to effectively switch to the kernel page-table. It also seems like
> the case where the CPU is interrupted by a NMI is not handled.

TBH, we discussed on list before that there may not be much value in closing
the above mentioned windows. We already knew there are a few windows open
like that. Thomas Glexiner told us that the solution does not need to be 100%
as long as it closes most windows and is performant -- the important thing
being to disrupt the attacker than making it 100% attack proof. And keeping
the code simple.

> I know the code is tricky, and I am working on something similar for ASI (ASI
> lockdown) where I am addressing all these cases (this seems to work).

Ok.

> > Let me know if I may add your Reviewed-by tag for this patch, if there are no
> > other comments, and I appreciate it. Appended the updated patch below.
> > 
> 
> I haven't effectively reviewed the code yet. Also I wonder if the work I am doing
> with ASI for synchronizing sibling cpus (i.e. the ASI lockdown) and the integration
> with PTI could provide what you need. Basically each process has an ASI and the
> ASI lockdown ensure that sibling cpus are also running with a trusted ASI. If
> the process/ASI is interrupted (e.g. on interrupt/exception/NMI) then it forces
> sibling cpus to also interrupt ASI. The sibling cpus synchronization occurs when
> switching the page-tables (between user and kernel) so there's no exposure window.

Maybe. But are you doing everything this patch does, when you enter ASI lock
down?
Basically, we want to only send IPIs if needed and we track the "core wide"
entry into the kernel to make sure we do the sligthly higher overhead things
once per "core wide" entry and exit. For some pictures, check these slides
from slide 15:
https://docs.google.com/presentation/d/1VzeQo3AyGTN35DJ3LKoPWBfiZHZJiF8q0NrX9eVYG70/edit#slide=id.g91cff3980b_0_7

As for switching page tables, I am not sure if that is all that's needed to
close the windows you mentioned, because MDS attacks are independent of page
table entries, they leak the uarch buffers. So you really have to isolate
things in the time domain as this patch does.

We discussed with Thomas and Dario in previous list emails that maybe this
patch can be used as a subset of ASI work as a "utility", the implement the
stunning.

> Let me have a closer look.

Sure, thanks.

 - Joel

> alex.
> 
> 
> > ---8<-----------------------
> > 
> >  From b2835a587a28405ffdf8fc801e798129a014a8c8 Mon Sep 17 00:00:00 2001
> > From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> > Date: Mon, 27 Jul 2020 17:56:14 -0400
> > Subject: [PATCH] kernel/entry: Add support for core-wide protection of
> >   kernel-mode
> > 
> > Core-scheduling prevents hyperthreads in usermode from attacking each
> > other, but it does not do anything about one of the hyperthreads
> > entering the kernel for any reason. This leaves the door open for MDS
> > and L1TF attacks with concurrent execution sequences between
> > hyperthreads.
> > 
> > This patch therefore adds support for protecting all syscall and IRQ
> > kernel mode entries. Care is taken to track the outermost usermode exit
> > and entry using per-cpu counters. In cases where one of the hyperthreads
> > enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
> > when not needed - example: idle and non-cookie HTs do not need to be
> > forced into kernel mode.
> > 
> > More information about attacks:
> > For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
> > data to either host or guest attackers. For L1TF, it is possible to leak
> > to guest attackers. There is no possible mitigation involving flushing
> > of buffers to avoid this since the execution of attacker and victims
> > happen concurrently on 2 or more HTs.
> > 
> > Cc: Julien Desfossez <jdesfossez@digitalocean.com>
> > Cc: Tim Chen <tim.c.chen@linux.intel.com>
> > Cc: Aaron Lu <aaron.lwe@gmail.com>
> > Cc: Aubrey Li <aubrey.li@linux.intel.com>
> > Cc: Tim Chen <tim.c.chen@intel.com>
> > Cc: Paul E. McKenney <paulmck@kernel.org>
> > Co-developed-by: Vineeth Pillai <viremana@linux.microsoft.com>
> > Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> > Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >   .../admin-guide/kernel-parameters.txt         |   9 +
> >   include/linux/entry-common.h                  |   6 +-
> >   include/linux/sched.h                         |  12 +
> >   kernel/entry/common.c                         |  28 ++-
> >   kernel/sched/core.c                           | 230 ++++++++++++++++++
> >   kernel/sched/sched.h                          |   3 +
> >   6 files changed, 285 insertions(+), 3 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 3236427e2215..a338d5d64c3d 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -4678,6 +4678,15 @@
> >   	sbni=		[NET] Granch SBNI12 leased line adapter
> > +	sched_core_protect_kernel=
> > +			[SCHED_CORE] Pause SMT siblings of a core running in
> > +			user mode, if at least one of the siblings of the core
> > +			is running in kernel mode. This is to guarantee that
> > +			kernel data is not leaked to tasks which are not trusted
> > +			by the kernel. A value of 0 disables protection, 1
> > +			enables protection. The default is 1. Note that protection
> > +			depends on the arch defining the _TIF_UNSAFE_RET flag.
> > +
> >   	sched_debug	[KNL] Enables verbose scheduler debug messages.
> >   	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
> > diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> > index 474f29638d2c..62278c5b3b5f 100644
> > --- a/include/linux/entry-common.h
> > +++ b/include/linux/entry-common.h
> > @@ -33,6 +33,10 @@
> >   # define _TIF_PATCH_PENDING		(0)
> >   #endif
> > +#ifndef _TIF_UNSAFE_RET
> > +# define _TIF_UNSAFE_RET		(0)
> > +#endif
> > +
> >   #ifndef _TIF_UPROBE
> >   # define _TIF_UPROBE			(0)
> >   #endif
> > @@ -69,7 +73,7 @@
> >   #define EXIT_TO_USER_MODE_WORK						\
> >   	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
> > -	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING |			\
> > +	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_UNSAFE_RET |	\
> >   	 ARCH_EXIT_TO_USER_MODE_WORK)
> >   /**
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index d38e904dd603..fe6f225bfbf9 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -2071,4 +2071,16 @@ int sched_trace_rq_nr_running(struct rq *rq);
> >   const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
> > +#ifdef CONFIG_SCHED_CORE
> > +void sched_core_unsafe_enter(void);
> > +void sched_core_unsafe_exit(void);
> > +bool sched_core_wait_till_safe(unsigned long ti_check);
> > +bool sched_core_kernel_protected(void);
> > +#else
> > +#define sched_core_unsafe_enter(ignore) do { } while (0)
> > +#define sched_core_unsafe_exit(ignore) do { } while (0)
> > +#define sched_core_wait_till_safe(ignore) do { } while (0)
> > +#define sched_core_kernel_protected(ignore) do { } while (0)
> > +#endif
> > +
> >   #endif
> > diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> > index 0a1e20f8d4e8..a18ed60cedea 100644
> > --- a/kernel/entry/common.c
> > +++ b/kernel/entry/common.c
> > @@ -28,6 +28,8 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs)
> >   	instrumentation_begin();
> >   	trace_hardirqs_off_finish();
> > +	if (_TIF_UNSAFE_RET) /* Kernel protection depends on arch defining the flag. */
> > +		sched_core_unsafe_enter();
> >   	instrumentation_end();
> >   }
> > @@ -137,6 +139,27 @@ static __always_inline void exit_to_user_mode(void)
> >   /* Workaround to allow gradual conversion of architecture code */
> >   void __weak arch_do_signal(struct pt_regs *regs) { }
> > +unsigned long exit_to_user_get_work(void)
> > +{
> > +	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> > +
> > +	if ((IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected())
> > +	    || !_TIF_UNSAFE_RET)
> > +		return ti_work;
> > +
> > +#ifdef CONFIG_SCHED_CORE
> > +	ti_work &= EXIT_TO_USER_MODE_WORK;
> > +	if ((ti_work & _TIF_UNSAFE_RET) == ti_work) {
> > +		sched_core_unsafe_exit();
> > +		if (sched_core_wait_till_safe(EXIT_TO_USER_MODE_WORK)) {
> > +			sched_core_unsafe_enter(); /* not exiting to user yet. */
> > +		}
> > +	}
> > +
> > +	return READ_ONCE(current_thread_info()->flags);
> > +#endif
> > +}
> > +
> >   static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> >   					    unsigned long ti_work)
> >   {
> > @@ -175,7 +198,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> >   		 * enabled above.
> >   		 */
> >   		local_irq_disable_exit_to_user();
> > -		ti_work = READ_ONCE(current_thread_info()->flags);
> > +		ti_work = exit_to_user_get_work();
> >   	}
> >   	/* Return the latest work state for arch_exit_to_user_mode() */
> > @@ -184,9 +207,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> >   static void exit_to_user_mode_prepare(struct pt_regs *regs)
> >   {
> > -	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> > +	unsigned long ti_work;
> >   	lockdep_assert_irqs_disabled();
> > +	ti_work = exit_to_user_get_work();
> >   	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
> >   		ti_work = exit_to_user_mode_loop(regs, ti_work);
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index e05728bdb18c..bd206708fac2 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -76,6 +76,27 @@ __read_mostly int scheduler_running;
> >   #ifdef CONFIG_SCHED_CORE
> > +DEFINE_STATIC_KEY_TRUE(sched_core_protect_kernel);
> > +static int __init set_sched_core_protect_kernel(char *str)
> > +{
> > +	unsigned long val = 0;
> > +
> > +	if (!str)
> > +		return 0;
> > +
> > +	if (!kstrtoul(str, 0, &val) && !val)
> > +		static_branch_disable(&sched_core_protect_kernel);
> > +
> > +	return 1;
> > +}
> > +__setup("sched_core_protect_kernel=", set_sched_core_protect_kernel);
> > +
> > +/* Is the kernel protected by core scheduling? */
> > +bool sched_core_kernel_protected(void)
> > +{
> > +	return static_branch_likely(&sched_core_protect_kernel);
> > +}
> > +
> >   DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
> >   /* kernel prio, less is more */
> > @@ -4596,6 +4617,214 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
> >   	return a->core_cookie == b->core_cookie;
> >   }
> > +/*
> > + * Handler to attempt to enter kernel. It does nothing because the exit to
> > + * usermode or guest mode will do the actual work (of waiting if needed).
> > + */
> > +static void sched_core_irq_work(struct irq_work *work)
> > +{
> > +	return;
> > +}
> > +
> > +static inline void init_sched_core_irq_work(struct rq *rq)
> > +{
> > +	init_irq_work(&rq->core_irq_work, sched_core_irq_work);
> > +}
> > +
> > +/*
> > + * sched_core_wait_till_safe - Pause the caller's hyperthread until the core
> > + * exits the core-wide unsafe state. Obviously the CPU calling this function
> > + * should not be responsible for the core being in the core-wide unsafe state
> > + * otherwise it will deadlock.
> > + *
> > + * @ti_check: We spin here with IRQ enabled and preempt disabled. Break out of
> > + *            the loop if TIF flags are set and notify caller about it.
> > + *
> > + * IRQs should be disabled.
> > + */
> > +bool sched_core_wait_till_safe(unsigned long ti_check)
> > +{
> > +	bool restart = false;
> > +	struct rq *rq;
> > +	int cpu;
> > +
> > +	/* We clear the thread flag only at the end, so need to check for it. */
> > +	ti_check &= ~_TIF_UNSAFE_RET;
> > +
> > +	cpu = smp_processor_id();
> > +	rq = cpu_rq(cpu);
> > +
> > +	if (!sched_core_enabled(rq))
> > +		goto ret;
> > +
> > +	/* Down grade to allow interrupts to prevent stop_machine lockups.. */
> > +	preempt_disable();
> > +	local_irq_enable();
> > +
> > +	/*
> > +	 * Wait till the core of this HT is not in an unsafe state.
> > +	 *
> > +	 * Pair with smp_store_release() in sched_core_unsafe_exit().
> > +	 */
> > +	while (smp_load_acquire(&rq->core->core_unsafe_nest) > 0) {
> > +		cpu_relax();
> > +		if (READ_ONCE(current_thread_info()->flags) & ti_check) {
> > +			restart = true;
> > +			break;
> > +		}
> > +	}
> > +
> > +	/* Upgrade it back to the expectations of entry code. */
> > +	local_irq_disable();
> > +	preempt_enable();
> > +
> > +ret:
> > +	if (!restart)
> > +		clear_tsk_thread_flag(current, TIF_UNSAFE_RET);
> > +
> > +	return restart;
> > +}
> > +
> > +/*
> > + * Enter the core-wide IRQ state. Sibling will be paused if it is running
> > + * 'untrusted' code, until sched_core_unsafe_exit() is called. Every attempt to
> > + * avoid sending useless IPIs is made. Must be called only from hard IRQ
> > + * context.
> > + */
> > +void sched_core_unsafe_enter(void)
> > +{
> > +	const struct cpumask *smt_mask;
> > +	unsigned long flags;
> > +	struct rq *rq;
> > +	int i, cpu;
> > +
> > +	if (!static_branch_likely(&sched_core_protect_kernel))
> > +		return;
> > +
> > +	/* Ensure that on return to user/guest, we check whether to wait. */
> > +	if (current->core_cookie)
> > +		set_tsk_thread_flag(current, TIF_UNSAFE_RET);
> > +
> > +	local_irq_save(flags);
> > +	cpu = smp_processor_id();
> > +	rq = cpu_rq(cpu);
> > +	if (!sched_core_enabled(rq))
> > +		goto ret;
> > +
> > +	/* Count unsafe_enter() calls received without unsafe_exit() on this CPU. */
> > +	rq->core_this_unsafe_nest++;
> > +
> > +	/* Should not nest: enter() should only pair with exit(). */
> > +	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 1))
> > +		goto ret;
> > +
> > +	raw_spin_lock(rq_lockp(rq));
> > +	smt_mask = cpu_smt_mask(cpu);
> > +
> > +	/* Contribute this CPU's unsafe_enter() to core-wide unsafe_enter() count. */
> > +	WRITE_ONCE(rq->core->core_unsafe_nest, rq->core->core_unsafe_nest + 1);
> > +
> > +	if (WARN_ON_ONCE(rq->core->core_unsafe_nest == UINT_MAX))
> > +		goto unlock;
> > +
> > +	if (irq_work_is_busy(&rq->core_irq_work)) {
> > +		/*
> > +		 * Do nothing more since we are in an IPI sent from another
> > +		 * sibling to enforce safety. That sibling would have sent IPIs
> > +		 * to all of the HTs.
> > +		 */
> > +		goto unlock;
> > +	}
> > +
> > +	/*
> > +	 * If we are not the first ones on the core to enter core-wide unsafe
> > +	 * state, do nothing.
> > +	 */
> > +	if (rq->core->core_unsafe_nest > 1)
> > +		goto unlock;
> > +
> > +	/* Do nothing more if the core is not tagged. */
> > +	if (!rq->core->core_cookie)
> > +		goto unlock;
> > +
> > +	for_each_cpu(i, smt_mask) {
> > +		struct rq *srq = cpu_rq(i);
> > +
> > +		if (i == cpu || cpu_is_offline(i))
> > +			continue;
> > +
> > +		if (!srq->curr->mm || is_task_rq_idle(srq->curr))
> > +			continue;
> > +
> > +		/* Skip if HT is not running a tagged task. */
> > +		if (!srq->curr->core_cookie && !srq->core_pick)
> > +			continue;
> > +
> > +		/*
> > +		 * Force sibling into the kernel by IPI. If work was already
> > +		 * pending, no new IPIs are sent. This is Ok since the receiver
> > +		 * would already be in the kernel, or on its way to it.
> > +		 */
> > +		irq_work_queue_on(&srq->core_irq_work, i);
> > +	}
> > +unlock:
> > +	raw_spin_unlock(rq_lockp(rq));
> > +ret:
> > +	local_irq_restore(flags);
> > +}
> > +
> > +/*
> > + * Process any work need for either exiting the core-wide unsafe state, or for
> > + * waiting on this hyperthread if the core is still in this state.
> > + *
> > + * @idle: Are we called from the idle loop?
> > + */
> > +void sched_core_unsafe_exit(void)
> > +{
> > +	unsigned long flags;
> > +	unsigned int nest;
> > +	struct rq *rq;
> > +	int cpu;
> > +
> > +	if (!static_branch_likely(&sched_core_protect_kernel))
> > +		return;
> > +
> > +	local_irq_save(flags);
> > +	cpu = smp_processor_id();
> > +	rq = cpu_rq(cpu);
> > +
> > +	/* Do nothing if core-sched disabled. */
> > +	if (!sched_core_enabled(rq))
> > +		goto ret;
> > +
> > +	/*
> > +	 * Can happen when a process is forked and the first return to user
> > +	 * mode is a syscall exit. Either way, there's nothing to do.
> > +	 */
> > +	if (rq->core_this_unsafe_nest == 0)
> > +		goto ret;
> > +
> > +	rq->core_this_unsafe_nest--;
> > +
> > +	/* enter() should be paired with exit() only. */
> > +	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 0))
> > +		goto ret;
> > +
> > +	raw_spin_lock(rq_lockp(rq));
> > +	/*
> > +	 * Core-wide nesting counter can never be 0 because we are
> > +	 * still in it on this CPU.
> > +	 */
> > +	nest = rq->core->core_unsafe_nest;
> > +	WARN_ON_ONCE(!nest);
> > +
> > +	/* Pair with smp_load_acquire() in sched_core_wait_till_safe(). */
> > +	smp_store_release(&rq->core->core_unsafe_nest, nest - 1);
> > +	raw_spin_unlock(rq_lockp(rq));
> > +ret:
> > +	local_irq_restore(flags);
> > +}
> > +
> >   // XXX fairness/fwd progress conditions
> >   /*
> >    * Returns
> > @@ -5019,6 +5248,7 @@ static inline void sched_core_cpu_starting(unsigned int cpu)
> >   			rq = cpu_rq(i);
> >   			if (rq->core && rq->core == rq)
> >   				core_rq = rq;
> > +			init_sched_core_irq_work(rq);
> >   		}
> >   		if (!core_rq)
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index f7e2d8a3be8e..4bcf3b1ddfb3 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1059,12 +1059,15 @@ struct rq {
> >   	unsigned int		core_enabled;
> >   	unsigned int		core_sched_seq;
> >   	struct rb_root		core_tree;
> > +	struct irq_work		core_irq_work; /* To force HT into kernel */
> > +	unsigned int		core_this_unsafe_nest;
> >   	/* shared state */
> >   	unsigned int		core_task_seq;
> >   	unsigned int		core_pick_seq;
> >   	unsigned long		core_cookie;
> >   	unsigned char		core_forceidle;
> > +	unsigned int		core_unsafe_nest;
> >   #endif
> >   };
> > 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 00/26] Core scheduling
  2020-11-06  2:58   ` Li, Aubrey
@ 2020-11-06 17:54     ` Joel Fernandes
  2020-11-09  6:04       ` Li, Aubrey
  0 siblings, 1 reply; 98+ messages in thread
From: Joel Fernandes @ 2020-11-06 17:54 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Ning, Hongyu, Nishanth Aravamudan, Julien Desfossez,
	Peter Zijlstra, Tim Chen, Vineeth Pillai, Aaron Lu, Aubrey Li,
	tglx, linux-kernel, mingo, torvalds, fweisbec, keescook, kerrnel,
	Phil Auld, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Tim Chen

On Fri, Nov 06, 2020 at 10:58:58AM +0800, Li, Aubrey wrote:

> > 
> > 	-- workload D, new added syscall workload, performance drop in cs_on:
> > 	+----------------------+------+-------------------------------+
> > 	|                      | **   | will-it-scale  * 192          |
> > 	|                      |      | (pipe based context_switch)   |
> > 	+======================+======+===============================+
> > 	| cgroup               | **   | cg_will-it-scale              |
> > 	+----------------------+------+-------------------------------+
> > 	| record_item          | **   | threads_avg                   |
> > 	+----------------------+------+-------------------------------+
> > 	| coresched_normalized | **   | 0.2                           |
> > 	+----------------------+------+-------------------------------+
> > 	| default_normalized   | **   | 1                             |
> > 	+----------------------+------+-------------------------------+
> > 	| smtoff_normalized    | **   | 0.89                          |
> > 	+----------------------+------+-------------------------------+
> 
> will-it-scale may be a very extreme case. The story here is,
> - On one sibling reader/writer gets blocked and tries to schedule another reader/writer in.
> - The other sibling tries to wake up reader/writer.
> 
> Both CPUs are acquiring rq->__lock,
> 
> So when coresched off, they are two different locks, lock stat(1 second delta) below:
> 
> class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
> &rq->__lock:          210            210           0.10           3.04         180.87           0.86            797       79165021           0.03          20.69    60650198.34           0.77
> 
> But when coresched on, they are actually one same lock, lock stat(1 second delta) below:
> 
> class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
> &rq->__lock:      6479459        6484857           0.05         216.46    60829776.85           9.38        8346319       15399739           0.03          95.56    81119515.38           5.27
> 
> This nature of core scheduling may degrade the performance of similar workloads with frequent context switching.

When core sched is off, is SMT off as well? From the above table, it seems to
be. So even for core sched off, there will be a single lock per physical CPU
core (assuming SMT is also off) right? Or did I miss something?

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode
  2020-11-06 17:43         ` Joel Fernandes
@ 2020-11-06 18:07           ` Alexandre Chartre
  0 siblings, 0 replies; 98+ messages in thread
From: Alexandre Chartre @ 2020-11-06 18:07 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, James.Bottomley, OWeisse,
	Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser, Aubrey Li,
	Tim Chen, Paul E . McKenney


On 11/6/20 6:43 PM, Joel Fernandes wrote:
> On Fri, Nov 06, 2020 at 05:57:21PM +0100, Alexandre Chartre wrote:
>>
>>
>> On 11/3/20 2:20 AM, Joel Fernandes wrote:
>>> Hi Alexandre,
>>>
>>> Sorry for late reply as I was working on the snapshotting patch...
>>>
>>> On Fri, Oct 30, 2020 at 11:29:26AM +0100, Alexandre Chartre wrote:
>>>>
>>>> On 10/20/20 3:43 AM, Joel Fernandes (Google) wrote:
>>>>> Core-scheduling prevents hyperthreads in usermode from attacking each
>>>>> other, but it does not do anything about one of the hyperthreads
>>>>> entering the kernel for any reason. This leaves the door open for MDS
>>>>> and L1TF attacks with concurrent execution sequences between
>>>>> hyperthreads.
>>>>>
>>>>> This patch therefore adds support for protecting all syscall and IRQ
>>>>> kernel mode entries. Care is taken to track the outermost usermode exit
>>>>> and entry using per-cpu counters. In cases where one of the hyperthreads
>>>>> enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
>>>>> when not needed - example: idle and non-cookie HTs do not need to be
>>>>> forced into kernel mode.
>>>>
>>>> Hi Joel,
>>>>
>>>> In order to protect syscall/IRQ kernel mode entries, shouldn't we have a
>>>> call to sched_core_unsafe_enter() in the syscall/IRQ entry code? I don't
>>>> see such a call. Am I missing something?
>>>
>>> Yes, this is known bug and fixed in v9 which I'll post soon. Meanwhile
>>> updated patch is appended below:
>>
>> [..]
>>
>>>>> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
>>>>> index 0a1e20f8d4e8..c8dc6b1b1f40 100644
>>>>> --- a/kernel/entry/common.c
>>>>> +++ b/kernel/entry/common.c
>>>>> @@ -137,6 +137,26 @@ static __always_inline void exit_to_user_mode(void)
>>>>>     /* Workaround to allow gradual conversion of architecture code */
>>>>>     void __weak arch_do_signal(struct pt_regs *regs) { }
>>>>> +unsigned long exit_to_user_get_work(void)
>>>>> +{
>>>>> +	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
>>>>> +
>>>>> +	if (IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected())
>>>>> +		return ti_work;
>>>>> +
>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>> +	ti_work &= EXIT_TO_USER_MODE_WORK;
>>>>> +	if ((ti_work & _TIF_UNSAFE_RET) == ti_work) {
>>>>> +		sched_core_unsafe_exit();
>>>>> +		if (sched_core_wait_till_safe(EXIT_TO_USER_MODE_WORK)) {
>>>>
>>>> If we call sched_core_unsafe_exit() before sched_core_wait_till_safe() then we
>>>> expose ourself during the entire wait period in sched_core_wait_till_safe(). It
>>>> would be better to call sched_core_unsafe_exit() once we know for sure we are
>>>> going to exit.
>>>
>>> The way the algorithm works right now, it requires the current task to get
>>> out of the unsafe state while waiting otherwise it will lockup. Note that we
>>> wait with interrupts enabled so new interrupts could come while waiting.
>>>
>>> TBH this code is very tricky to get right and it took long time to get it
>>> working properly. For now I am content with the way it works. We can improve
>>> further incrementally on it in the future.
>>>
>>
>> I am concerned this leaves a lot of windows opened (even with the updated patch)
>> where the system remains exposed. There are 3 obvious windows:
>>
>> - after switching to the kernel page-table and until enter_from_user_mode() is called
>> - while waiting for other cpus
>> - after leaving exit_to_user_mode_loop() and until switching back to the user page-table
>>
>> Also on syscall/interrupt entry, sched_core_unsafe_enter() is called (in the
>> updated patch) and this sends an IPI to other CPUs but it doesn't wait for
>> other CPUs to effectively switch to the kernel page-table. It also seems like
>> the case where the CPU is interrupted by a NMI is not handled.
> 
> TBH, we discussed on list before that there may not be much value in closing
> the above mentioned windows. We already knew there are a few windows open
> like that. Thomas Glexiner told us that the solution does not need to be 100%
> as long as it closes most windows and is performant -- the important thing
> being to disrupt the attacker than making it 100% attack proof. And keeping
> the code simple.

Ok. I will need to check if the additional complexity I have is effectively
worth it, and that's certainly something we can improve other time if needed.


>> I know the code is tricky, and I am working on something similar for ASI (ASI
>> lockdown) where I am addressing all these cases (this seems to work).
> 
> Ok.
> 
>>> Let me know if I may add your Reviewed-by tag for this patch, if there are no
>>> other comments, and I appreciate it. Appended the updated patch below.
>>>
>>
>> I haven't effectively reviewed the code yet. Also I wonder if the work I am doing
>> with ASI for synchronizing sibling cpus (i.e. the ASI lockdown) and the integration
>> with PTI could provide what you need. Basically each process has an ASI and the
>> ASI lockdown ensure that sibling cpus are also running with a trusted ASI. If
>> the process/ASI is interrupted (e.g. on interrupt/exception/NMI) then it forces
>> sibling cpus to also interrupt ASI. The sibling cpus synchronization occurs when
>> switching the page-tables (between user and kernel) so there's no exposure window.
> 
> Maybe. But are you doing everything this patch does, when you enter ASI lockdown?
> Basically, we want to only send IPIs if needed and we track the "core wide"
> entry into the kernel to make sure we do the sligthly higher overhead things
> once per "core wide" entry and exit. For some pictures, check these slides
> from slide 15:
> https://docs.google.com/presentation/d/1VzeQo3AyGTN35DJ3LKoPWBfiZHZJiF8q0NrX9eVYG70/edit#slide=id.g91cff3980b_0_7

Yes, I think this looks similar to what ASI lockdown is doing.


> As for switching page tables, I am not sure if that is all that's needed to
> close the windows you mentioned, because MDS attacks are independent of page
> table entries, they leak the uarch buffers. So you really have to isolate
> things in the time domain as this patch does.
> 
> We discussed with Thomas and Dario in previous list emails that maybe this
> patch can be used as a subset of ASI work as a "utility", the implement the
> stunning.
> 

Ok, I will look how this can fit together.

alex.

>> Let me have a closer look.
> 
> Sure, thanks.
> 
>   - Joel
> 
>> alex.
>>
>>
>>> ---8<-----------------------
>>>
>>>   From b2835a587a28405ffdf8fc801e798129a014a8c8 Mon Sep 17 00:00:00 2001
>>> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
>>> Date: Mon, 27 Jul 2020 17:56:14 -0400
>>> Subject: [PATCH] kernel/entry: Add support for core-wide protection of
>>>    kernel-mode
>>>
>>> Core-scheduling prevents hyperthreads in usermode from attacking each
>>> other, but it does not do anything about one of the hyperthreads
>>> entering the kernel for any reason. This leaves the door open for MDS
>>> and L1TF attacks with concurrent execution sequences between
>>> hyperthreads.
>>>
>>> This patch therefore adds support for protecting all syscall and IRQ
>>> kernel mode entries. Care is taken to track the outermost usermode exit
>>> and entry using per-cpu counters. In cases where one of the hyperthreads
>>> enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
>>> when not needed - example: idle and non-cookie HTs do not need to be
>>> forced into kernel mode.
>>>
>>> More information about attacks:
>>> For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
>>> data to either host or guest attackers. For L1TF, it is possible to leak
>>> to guest attackers. There is no possible mitigation involving flushing
>>> of buffers to avoid this since the execution of attacker and victims
>>> happen concurrently on 2 or more HTs.
>>>
>>> Cc: Julien Desfossez <jdesfossez@digitalocean.com>
>>> Cc: Tim Chen <tim.c.chen@linux.intel.com>
>>> Cc: Aaron Lu <aaron.lwe@gmail.com>
>>> Cc: Aubrey Li <aubrey.li@linux.intel.com>
>>> Cc: Tim Chen <tim.c.chen@intel.com>
>>> Cc: Paul E. McKenney <paulmck@kernel.org>
>>> Co-developed-by: Vineeth Pillai <viremana@linux.microsoft.com>
>>> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
>>> Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
>>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>>> ---
>>>    .../admin-guide/kernel-parameters.txt         |   9 +
>>>    include/linux/entry-common.h                  |   6 +-
>>>    include/linux/sched.h                         |  12 +
>>>    kernel/entry/common.c                         |  28 ++-
>>>    kernel/sched/core.c                           | 230 ++++++++++++++++++
>>>    kernel/sched/sched.h                          |   3 +
>>>    6 files changed, 285 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>>> index 3236427e2215..a338d5d64c3d 100644
>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>> @@ -4678,6 +4678,15 @@
>>>    	sbni=		[NET] Granch SBNI12 leased line adapter
>>> +	sched_core_protect_kernel=
>>> +			[SCHED_CORE] Pause SMT siblings of a core running in
>>> +			user mode, if at least one of the siblings of the core
>>> +			is running in kernel mode. This is to guarantee that
>>> +			kernel data is not leaked to tasks which are not trusted
>>> +			by the kernel. A value of 0 disables protection, 1
>>> +			enables protection. The default is 1. Note that protection
>>> +			depends on the arch defining the _TIF_UNSAFE_RET flag.
>>> +
>>>    	sched_debug	[KNL] Enables verbose scheduler debug messages.
>>>    	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
>>> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
>>> index 474f29638d2c..62278c5b3b5f 100644
>>> --- a/include/linux/entry-common.h
>>> +++ b/include/linux/entry-common.h
>>> @@ -33,6 +33,10 @@
>>>    # define _TIF_PATCH_PENDING		(0)
>>>    #endif
>>> +#ifndef _TIF_UNSAFE_RET
>>> +# define _TIF_UNSAFE_RET		(0)
>>> +#endif
>>> +
>>>    #ifndef _TIF_UPROBE
>>>    # define _TIF_UPROBE			(0)
>>>    #endif
>>> @@ -69,7 +73,7 @@
>>>    #define EXIT_TO_USER_MODE_WORK						\
>>>    	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
>>> -	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING |			\
>>> +	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_UNSAFE_RET |	\
>>>    	 ARCH_EXIT_TO_USER_MODE_WORK)
>>>    /**
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index d38e904dd603..fe6f225bfbf9 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -2071,4 +2071,16 @@ int sched_trace_rq_nr_running(struct rq *rq);
>>>    const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
>>> +#ifdef CONFIG_SCHED_CORE
>>> +void sched_core_unsafe_enter(void);
>>> +void sched_core_unsafe_exit(void);
>>> +bool sched_core_wait_till_safe(unsigned long ti_check);
>>> +bool sched_core_kernel_protected(void);
>>> +#else
>>> +#define sched_core_unsafe_enter(ignore) do { } while (0)
>>> +#define sched_core_unsafe_exit(ignore) do { } while (0)
>>> +#define sched_core_wait_till_safe(ignore) do { } while (0)
>>> +#define sched_core_kernel_protected(ignore) do { } while (0)
>>> +#endif
>>> +
>>>    #endif
>>> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
>>> index 0a1e20f8d4e8..a18ed60cedea 100644
>>> --- a/kernel/entry/common.c
>>> +++ b/kernel/entry/common.c
>>> @@ -28,6 +28,8 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs)
>>>    	instrumentation_begin();
>>>    	trace_hardirqs_off_finish();
>>> +	if (_TIF_UNSAFE_RET) /* Kernel protection depends on arch defining the flag. */
>>> +		sched_core_unsafe_enter();
>>>    	instrumentation_end();
>>>    }
>>> @@ -137,6 +139,27 @@ static __always_inline void exit_to_user_mode(void)
>>>    /* Workaround to allow gradual conversion of architecture code */
>>>    void __weak arch_do_signal(struct pt_regs *regs) { }
>>> +unsigned long exit_to_user_get_work(void)
>>> +{
>>> +	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
>>> +
>>> +	if ((IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected())
>>> +	    || !_TIF_UNSAFE_RET)
>>> +		return ti_work;
>>> +
>>> +#ifdef CONFIG_SCHED_CORE
>>> +	ti_work &= EXIT_TO_USER_MODE_WORK;
>>> +	if ((ti_work & _TIF_UNSAFE_RET) == ti_work) {
>>> +		sched_core_unsafe_exit();
>>> +		if (sched_core_wait_till_safe(EXIT_TO_USER_MODE_WORK)) {
>>> +			sched_core_unsafe_enter(); /* not exiting to user yet. */
>>> +		}
>>> +	}
>>> +
>>> +	return READ_ONCE(current_thread_info()->flags);
>>> +#endif
>>> +}
>>> +
>>>    static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>>>    					    unsigned long ti_work)
>>>    {
>>> @@ -175,7 +198,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>>>    		 * enabled above.
>>>    		 */
>>>    		local_irq_disable_exit_to_user();
>>> -		ti_work = READ_ONCE(current_thread_info()->flags);
>>> +		ti_work = exit_to_user_get_work();
>>>    	}
>>>    	/* Return the latest work state for arch_exit_to_user_mode() */
>>> @@ -184,9 +207,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>>>    static void exit_to_user_mode_prepare(struct pt_regs *regs)
>>>    {
>>> -	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
>>> +	unsigned long ti_work;
>>>    	lockdep_assert_irqs_disabled();
>>> +	ti_work = exit_to_user_get_work();
>>>    	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
>>>    		ti_work = exit_to_user_mode_loop(regs, ti_work);
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index e05728bdb18c..bd206708fac2 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -76,6 +76,27 @@ __read_mostly int scheduler_running;
>>>    #ifdef CONFIG_SCHED_CORE
>>> +DEFINE_STATIC_KEY_TRUE(sched_core_protect_kernel);
>>> +static int __init set_sched_core_protect_kernel(char *str)
>>> +{
>>> +	unsigned long val = 0;
>>> +
>>> +	if (!str)
>>> +		return 0;
>>> +
>>> +	if (!kstrtoul(str, 0, &val) && !val)
>>> +		static_branch_disable(&sched_core_protect_kernel);
>>> +
>>> +	return 1;
>>> +}
>>> +__setup("sched_core_protect_kernel=", set_sched_core_protect_kernel);
>>> +
>>> +/* Is the kernel protected by core scheduling? */
>>> +bool sched_core_kernel_protected(void)
>>> +{
>>> +	return static_branch_likely(&sched_core_protect_kernel);
>>> +}
>>> +
>>>    DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
>>>    /* kernel prio, less is more */
>>> @@ -4596,6 +4617,214 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
>>>    	return a->core_cookie == b->core_cookie;
>>>    }
>>> +/*
>>> + * Handler to attempt to enter kernel. It does nothing because the exit to
>>> + * usermode or guest mode will do the actual work (of waiting if needed).
>>> + */
>>> +static void sched_core_irq_work(struct irq_work *work)
>>> +{
>>> +	return;
>>> +}
>>> +
>>> +static inline void init_sched_core_irq_work(struct rq *rq)
>>> +{
>>> +	init_irq_work(&rq->core_irq_work, sched_core_irq_work);
>>> +}
>>> +
>>> +/*
>>> + * sched_core_wait_till_safe - Pause the caller's hyperthread until the core
>>> + * exits the core-wide unsafe state. Obviously the CPU calling this function
>>> + * should not be responsible for the core being in the core-wide unsafe state
>>> + * otherwise it will deadlock.
>>> + *
>>> + * @ti_check: We spin here with IRQ enabled and preempt disabled. Break out of
>>> + *            the loop if TIF flags are set and notify caller about it.
>>> + *
>>> + * IRQs should be disabled.
>>> + */
>>> +bool sched_core_wait_till_safe(unsigned long ti_check)
>>> +{
>>> +	bool restart = false;
>>> +	struct rq *rq;
>>> +	int cpu;
>>> +
>>> +	/* We clear the thread flag only at the end, so need to check for it. */
>>> +	ti_check &= ~_TIF_UNSAFE_RET;
>>> +
>>> +	cpu = smp_processor_id();
>>> +	rq = cpu_rq(cpu);
>>> +
>>> +	if (!sched_core_enabled(rq))
>>> +		goto ret;
>>> +
>>> +	/* Down grade to allow interrupts to prevent stop_machine lockups.. */
>>> +	preempt_disable();
>>> +	local_irq_enable();
>>> +
>>> +	/*
>>> +	 * Wait till the core of this HT is not in an unsafe state.
>>> +	 *
>>> +	 * Pair with smp_store_release() in sched_core_unsafe_exit().
>>> +	 */
>>> +	while (smp_load_acquire(&rq->core->core_unsafe_nest) > 0) {
>>> +		cpu_relax();
>>> +		if (READ_ONCE(current_thread_info()->flags) & ti_check) {
>>> +			restart = true;
>>> +			break;
>>> +		}
>>> +	}
>>> +
>>> +	/* Upgrade it back to the expectations of entry code. */
>>> +	local_irq_disable();
>>> +	preempt_enable();
>>> +
>>> +ret:
>>> +	if (!restart)
>>> +		clear_tsk_thread_flag(current, TIF_UNSAFE_RET);
>>> +
>>> +	return restart;
>>> +}
>>> +
>>> +/*
>>> + * Enter the core-wide IRQ state. Sibling will be paused if it is running
>>> + * 'untrusted' code, until sched_core_unsafe_exit() is called. Every attempt to
>>> + * avoid sending useless IPIs is made. Must be called only from hard IRQ
>>> + * context.
>>> + */
>>> +void sched_core_unsafe_enter(void)
>>> +{
>>> +	const struct cpumask *smt_mask;
>>> +	unsigned long flags;
>>> +	struct rq *rq;
>>> +	int i, cpu;
>>> +
>>> +	if (!static_branch_likely(&sched_core_protect_kernel))
>>> +		return;
>>> +
>>> +	/* Ensure that on return to user/guest, we check whether to wait. */
>>> +	if (current->core_cookie)
>>> +		set_tsk_thread_flag(current, TIF_UNSAFE_RET);
>>> +
>>> +	local_irq_save(flags);
>>> +	cpu = smp_processor_id();
>>> +	rq = cpu_rq(cpu);
>>> +	if (!sched_core_enabled(rq))
>>> +		goto ret;
>>> +
>>> +	/* Count unsafe_enter() calls received without unsafe_exit() on this CPU. */
>>> +	rq->core_this_unsafe_nest++;
>>> +
>>> +	/* Should not nest: enter() should only pair with exit(). */
>>> +	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 1))
>>> +		goto ret;
>>> +
>>> +	raw_spin_lock(rq_lockp(rq));
>>> +	smt_mask = cpu_smt_mask(cpu);
>>> +
>>> +	/* Contribute this CPU's unsafe_enter() to core-wide unsafe_enter() count. */
>>> +	WRITE_ONCE(rq->core->core_unsafe_nest, rq->core->core_unsafe_nest + 1);
>>> +
>>> +	if (WARN_ON_ONCE(rq->core->core_unsafe_nest == UINT_MAX))
>>> +		goto unlock;
>>> +
>>> +	if (irq_work_is_busy(&rq->core_irq_work)) {
>>> +		/*
>>> +		 * Do nothing more since we are in an IPI sent from another
>>> +		 * sibling to enforce safety. That sibling would have sent IPIs
>>> +		 * to all of the HTs.
>>> +		 */
>>> +		goto unlock;
>>> +	}
>>> +
>>> +	/*
>>> +	 * If we are not the first ones on the core to enter core-wide unsafe
>>> +	 * state, do nothing.
>>> +	 */
>>> +	if (rq->core->core_unsafe_nest > 1)
>>> +		goto unlock;
>>> +
>>> +	/* Do nothing more if the core is not tagged. */
>>> +	if (!rq->core->core_cookie)
>>> +		goto unlock;
>>> +
>>> +	for_each_cpu(i, smt_mask) {
>>> +		struct rq *srq = cpu_rq(i);
>>> +
>>> +		if (i == cpu || cpu_is_offline(i))
>>> +			continue;
>>> +
>>> +		if (!srq->curr->mm || is_task_rq_idle(srq->curr))
>>> +			continue;
>>> +
>>> +		/* Skip if HT is not running a tagged task. */
>>> +		if (!srq->curr->core_cookie && !srq->core_pick)
>>> +			continue;
>>> +
>>> +		/*
>>> +		 * Force sibling into the kernel by IPI. If work was already
>>> +		 * pending, no new IPIs are sent. This is Ok since the receiver
>>> +		 * would already be in the kernel, or on its way to it.
>>> +		 */
>>> +		irq_work_queue_on(&srq->core_irq_work, i);
>>> +	}
>>> +unlock:
>>> +	raw_spin_unlock(rq_lockp(rq));
>>> +ret:
>>> +	local_irq_restore(flags);
>>> +}
>>> +
>>> +/*
>>> + * Process any work need for either exiting the core-wide unsafe state, or for
>>> + * waiting on this hyperthread if the core is still in this state.
>>> + *
>>> + * @idle: Are we called from the idle loop?
>>> + */
>>> +void sched_core_unsafe_exit(void)
>>> +{
>>> +	unsigned long flags;
>>> +	unsigned int nest;
>>> +	struct rq *rq;
>>> +	int cpu;
>>> +
>>> +	if (!static_branch_likely(&sched_core_protect_kernel))
>>> +		return;
>>> +
>>> +	local_irq_save(flags);
>>> +	cpu = smp_processor_id();
>>> +	rq = cpu_rq(cpu);
>>> +
>>> +	/* Do nothing if core-sched disabled. */
>>> +	if (!sched_core_enabled(rq))
>>> +		goto ret;
>>> +
>>> +	/*
>>> +	 * Can happen when a process is forked and the first return to user
>>> +	 * mode is a syscall exit. Either way, there's nothing to do.
>>> +	 */
>>> +	if (rq->core_this_unsafe_nest == 0)
>>> +		goto ret;
>>> +
>>> +	rq->core_this_unsafe_nest--;
>>> +
>>> +	/* enter() should be paired with exit() only. */
>>> +	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 0))
>>> +		goto ret;
>>> +
>>> +	raw_spin_lock(rq_lockp(rq));
>>> +	/*
>>> +	 * Core-wide nesting counter can never be 0 because we are
>>> +	 * still in it on this CPU.
>>> +	 */
>>> +	nest = rq->core->core_unsafe_nest;
>>> +	WARN_ON_ONCE(!nest);
>>> +
>>> +	/* Pair with smp_load_acquire() in sched_core_wait_till_safe(). */
>>> +	smp_store_release(&rq->core->core_unsafe_nest, nest - 1);
>>> +	raw_spin_unlock(rq_lockp(rq));
>>> +ret:
>>> +	local_irq_restore(flags);
>>> +}
>>> +
>>>    // XXX fairness/fwd progress conditions
>>>    /*
>>>     * Returns
>>> @@ -5019,6 +5248,7 @@ static inline void sched_core_cpu_starting(unsigned int cpu)
>>>    			rq = cpu_rq(i);
>>>    			if (rq->core && rq->core == rq)
>>>    				core_rq = rq;
>>> +			init_sched_core_irq_work(rq);
>>>    		}
>>>    		if (!core_rq)
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index f7e2d8a3be8e..4bcf3b1ddfb3 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -1059,12 +1059,15 @@ struct rq {
>>>    	unsigned int		core_enabled;
>>>    	unsigned int		core_sched_seq;
>>>    	struct rb_root		core_tree;
>>> +	struct irq_work		core_irq_work; /* To force HT into kernel */
>>> +	unsigned int		core_this_unsafe_nest;
>>>    	/* shared state */
>>>    	unsigned int		core_task_seq;
>>>    	unsigned int		core_pick_seq;
>>>    	unsigned long		core_cookie;
>>>    	unsigned char		core_forceidle;
>>> +	unsigned int		core_unsafe_nest;
>>>    #endif
>>>    };
>>>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [RFT for v9] (Was Re: [PATCH v8 -tip 00/26] Core scheduling)
  2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
                   ` (26 preceding siblings ...)
  2020-10-30 13:26 ` [PATCH v8 -tip 00/26] Core scheduling Ning, Hongyu
@ 2020-11-06 20:55 ` Joel Fernandes
  2020-11-13  9:22   ` Ning, Hongyu
  27 siblings, 1 reply; 98+ messages in thread
From: Joel Fernandes @ 2020-11-06 20:55 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel,
	hongyu.ning
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen

All,

I am getting ready to send the next v9 series based on tip/master
branch. Could you please give the below tree a try and report any results in
your testing?
git tree:
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git (branch coresched)
git log:
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=coresched

The major changes in this series are the improvements:
(1)
"sched: Make snapshotting of min_vruntime more CGroup-friendly"
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=coresched-v9-for-test&id=9a20a6652b3c50fd51faa829f7947004239a04eb

(2)
"sched: Simplify the core pick loop for optimized case"
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=coresched-v9-for-test&id=0370117b4fd418cdaaa6b1489bfc14f305691152

And a bug fix:
(1)
"sched: Enqueue task into core queue only after vruntime is updated"
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=coresched-v9-for-test&id=401dad5536e7e05d1299d0864e6fc5072029f492

There are also 2 more bug fixes that I squashed-in related to kernel
protection and a crash seen on the tip/master branch.

Hoping to send the series next week out to the list.

Have a great weekend, and Thanks!

 - Joel


On Mon, Oct 19, 2020 at 09:43:10PM -0400, Joel Fernandes (Google) wrote:
> Eighth iteration of the Core-Scheduling feature.
> 
> Core scheduling is a feature that allows only trusted tasks to run
> concurrently on cpus sharing compute resources (eg: hyperthreads on a
> core). The goal is to mitigate the core-level side-channel attacks
> without requiring to disable SMT (which has a significant impact on
> performance in some situations). Core scheduling (as of v7) mitigates
> user-space to user-space attacks and user to kernel attack when one of
> the siblings enters the kernel via interrupts or system call.
> 
> By default, the feature doesn't change any of the current scheduler
> behavior. The user decides which tasks can run simultaneously on the
> same core (for now by having them in the same tagged cgroup). When a tag
> is enabled in a cgroup and a task from that cgroup is running on a
> hardware thread, the scheduler ensures that only idle or trusted tasks
> run on the other sibling(s). Besides security concerns, this feature can
> also be beneficial for RT and performance applications where we want to
> control how tasks make use of SMT dynamically.
> 
> This iteration focuses on the the following stuff:
> - Redesigned API.
> - Rework of Kernel Protection feature based on Thomas's entry work.
> - Rework of hotplug fixes.
> - Address review comments in v7
> 
> Joel: Both a CGroup and Per-task interface via prctl(2) are provided for
> configuring core sharing. More details are provided in documentation patch.
> Kselftests are provided to verify the correctness/rules of the interface.
> 
> Julien: TPCC tests showed improvements with core-scheduling. With kernel
> protection enabled, it does not show any regression. Possibly ASI will improve
> the performance for those who choose kernel protection (can be toggled through
> sched_core_protect_kernel sysctl). Results:
> v8				average		stdev		diff
> baseline (SMT on)		1197.272	44.78312824	
> core sched (   kernel protect)	412.9895	45.42734343	-65.51%
> core sched (no kernel protect)	686.6515	71.77756931	-42.65%
> nosmt				408.667		39.39042872	-65.87%
> 
> v8 is rebased on tip/master.
> 
> Future work
> ===========
> - Load balancing/Migration fixes for core scheduling.
>   With v6, Load balancing is partially coresched aware, but has some
>   issues w.r.t process/taskgroup weights:
>   https://lwn.net/ml/linux-kernel/20200225034438.GA617271@z...
> - Core scheduling test framework: kselftests, torture tests etc
> 
> Changes in v8
> =============
> - New interface/API implementation
>   - Joel
> - Revised kernel protection patch
>   - Joel
> - Revised Hotplug fixes
>   - Joel
> - Minor bug fixes and address review comments
>   - Vineeth
> 
> Changes in v7
> =============
> - Kernel protection from untrusted usermode tasks
>   - Joel, Vineeth
> - Fix for hotplug crashes and hangs
>   - Joel, Vineeth
> 
> Changes in v6
> =============
> - Documentation
>   - Joel
> - Pause siblings on entering nmi/irq/softirq
>   - Joel, Vineeth
> - Fix for RCU crash
>   - Joel
> - Fix for a crash in pick_next_task
>   - Yu Chen, Vineeth
> - Minor re-write of core-wide vruntime comparison
>   - Aaron Lu
> - Cleanup: Address Review comments
> - Cleanup: Remove hotplug support (for now)
> - Build fixes: 32 bit, SMT=n, AUTOGROUP=n etc
>   - Joel, Vineeth
> 
> Changes in v5
> =============
> - Fixes for cgroup/process tagging during corner cases like cgroup
>   destroy, task moving across cgroups etc
>   - Tim Chen
> - Coresched aware task migrations
>   - Aubrey Li
> - Other minor stability fixes.
> 
> Changes in v4
> =============
> - Implement a core wide min_vruntime for vruntime comparison of tasks
>   across cpus in a core.
>   - Aaron Lu
> - Fixes a typo bug in setting the forced_idle cpu.
>   - Aaron Lu
> 
> Changes in v3
> =============
> - Fixes the issue of sibling picking up an incompatible task
>   - Aaron Lu
>   - Vineeth Pillai
>   - Julien Desfossez
> - Fixes the issue of starving threads due to forced idle
>   - Peter Zijlstra
> - Fixes the refcounting issue when deleting a cgroup with tag
>   - Julien Desfossez
> - Fixes a crash during cpu offline/online with coresched enabled
>   - Vineeth Pillai
> - Fixes a comparison logic issue in sched_core_find
>   - Aaron Lu
> 
> Changes in v2
> =============
> - Fixes for couple of NULL pointer dereference crashes
>   - Subhra Mazumdar
>   - Tim Chen
> - Improves priority comparison logic for process in different cpus
>   - Peter Zijlstra
>   - Aaron Lu
> - Fixes a hard lockup in rq locking
>   - Vineeth Pillai
>   - Julien Desfossez
> - Fixes a performance issue seen on IO heavy workloads
>   - Vineeth Pillai
>   - Julien Desfossez
> - Fix for 32bit build
>   - Aubrey Li
> 
> Aubrey Li (1):
> sched: migration changes for core scheduling
> 
> Joel Fernandes (Google) (13):
> sched/fair: Snapshot the min_vruntime of CPUs on force idle
> arch/x86: Add a new TIF flag for untrusted tasks
> kernel/entry: Add support for core-wide protection of kernel-mode
> entry/idle: Enter and exit kernel protection during idle entry and
> exit
> sched: Split the cookie and setup per-task cookie on fork
> sched: Add a per-thread core scheduling interface
> sched: Add a second-level tag for nested CGroup usecase
> sched: Release references to the per-task cookie on exit
> sched: Handle task addition to CGroup
> sched/debug: Add CGroup node for printing group cookie if SCHED_DEBUG
> kselftest: Add tests for core-sched interface
> sched: Move core-scheduler interfacing code to a new file
> Documentation: Add core scheduling documentation
> 
> Peter Zijlstra (10):
> sched: Wrap rq::lock access
> sched: Introduce sched_class::pick_task()
> sched: Core-wide rq->lock
> sched/fair: Add a few assertions
> sched: Basic tracking of matching tasks
> sched: Add core wide task selection and scheduling.
> sched: Trivial forced-newidle balancer
> irq_work: Cleanup
> sched: cgroup tagging interface for core scheduling
> sched: Debug bits...
> 
> Vineeth Pillai (2):
> sched/fair: Fix forced idle sibling starvation corner case
> entry/kvm: Protect the kernel when entering from guest
> 
> .../admin-guide/hw-vuln/core-scheduling.rst   |  312 +++++
> Documentation/admin-guide/hw-vuln/index.rst   |    1 +
> .../admin-guide/kernel-parameters.txt         |    7 +
> arch/x86/include/asm/thread_info.h            |    2 +
> arch/x86/kvm/x86.c                            |    3 +
> drivers/gpu/drm/i915/i915_request.c           |    4 +-
> include/linux/entry-common.h                  |   20 +-
> include/linux/entry-kvm.h                     |   12 +
> include/linux/irq_work.h                      |   33 +-
> include/linux/irqflags.h                      |    4 +-
> include/linux/sched.h                         |   27 +-
> include/uapi/linux/prctl.h                    |    3 +
> kernel/Kconfig.preempt                        |    6 +
> kernel/bpf/stackmap.c                         |    2 +-
> kernel/entry/common.c                         |   25 +-
> kernel/entry/kvm.c                            |   13 +
> kernel/fork.c                                 |    1 +
> kernel/irq_work.c                             |   18 +-
> kernel/printk/printk.c                        |    6 +-
> kernel/rcu/tree.c                             |    3 +-
> kernel/sched/Makefile                         |    1 +
> kernel/sched/core.c                           | 1135 ++++++++++++++++-
> kernel/sched/coretag.c                        |  468 +++++++
> kernel/sched/cpuacct.c                        |   12 +-
> kernel/sched/deadline.c                       |   34 +-
> kernel/sched/debug.c                          |    8 +-
> kernel/sched/fair.c                           |  272 ++--
> kernel/sched/idle.c                           |   24 +-
> kernel/sched/pelt.h                           |    2 +-
> kernel/sched/rt.c                             |   22 +-
> kernel/sched/sched.h                          |  302 ++++-
> kernel/sched/stop_task.c                      |   13 +-
> kernel/sched/topology.c                       |    4 +-
> kernel/sys.c                                  |    3 +
> kernel/time/tick-sched.c                      |    6 +-
> kernel/trace/bpf_trace.c                      |    2 +-
> tools/include/uapi/linux/prctl.h              |    3 +
> tools/testing/selftests/sched/.gitignore      |    1 +
> tools/testing/selftests/sched/Makefile        |   14 +
> tools/testing/selftests/sched/config          |    1 +
> .../testing/selftests/sched/test_coresched.c  |  840 ++++++++++++
> 41 files changed, 3437 insertions(+), 232 deletions(-)
> create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst
> create mode 100644 kernel/sched/coretag.c
> create mode 100644 tools/testing/selftests/sched/.gitignore
> create mode 100644 tools/testing/selftests/sched/Makefile
> create mode 100644 tools/testing/selftests/sched/config
> create mode 100644 tools/testing/selftests/sched/test_coresched.c
> 
> --
> 2.29.0.rc1.297.gfa9743e501-goog
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 00/26] Core scheduling
  2020-11-06 17:54     ` Joel Fernandes
@ 2020-11-09  6:04       ` Li, Aubrey
  0 siblings, 0 replies; 98+ messages in thread
From: Li, Aubrey @ 2020-11-09  6:04 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Ning, Hongyu, Nishanth Aravamudan, Julien Desfossez,
	Peter Zijlstra, Tim Chen, Vineeth Pillai, Aaron Lu, Aubrey Li,
	tglx, linux-kernel, mingo, torvalds, fweisbec, keescook, kerrnel,
	Phil Auld, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Tim Chen

On 2020/11/7 1:54, Joel Fernandes wrote:
> On Fri, Nov 06, 2020 at 10:58:58AM +0800, Li, Aubrey wrote:
> 
>>>
>>> 	-- workload D, new added syscall workload, performance drop in cs_on:
>>> 	+----------------------+------+-------------------------------+
>>> 	|                      | **   | will-it-scale  * 192          |
>>> 	|                      |      | (pipe based context_switch)   |
>>> 	+======================+======+===============================+
>>> 	| cgroup               | **   | cg_will-it-scale              |
>>> 	+----------------------+------+-------------------------------+
>>> 	| record_item          | **   | threads_avg                   |
>>> 	+----------------------+------+-------------------------------+
>>> 	| coresched_normalized | **   | 0.2                           |
>>> 	+----------------------+------+-------------------------------+
>>> 	| default_normalized   | **   | 1                             |
>>> 	+----------------------+------+-------------------------------+
>>> 	| smtoff_normalized    | **   | 0.89                          |
>>> 	+----------------------+------+-------------------------------+
>>
>> will-it-scale may be a very extreme case. The story here is,
>> - On one sibling reader/writer gets blocked and tries to schedule another reader/writer in.
>> - The other sibling tries to wake up reader/writer.
>>
>> Both CPUs are acquiring rq->__lock,
>>
>> So when coresched off, they are two different locks, lock stat(1 second delta) below:
>>
>> class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
>> &rq->__lock:          210            210           0.10           3.04         180.87           0.86            797       79165021           0.03          20.69    60650198.34           0.77
>>
>> But when coresched on, they are actually one same lock, lock stat(1 second delta) below:
>>
>> class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
>> &rq->__lock:      6479459        6484857           0.05         216.46    60829776.85           9.38        8346319       15399739           0.03          95.56    81119515.38           5.27
>>
>> This nature of core scheduling may degrade the performance of similar workloads with frequent context switching.
> 
> When core sched is off, is SMT off as well? From the above table, it seems to
> be. So even for core sched off, there will be a single lock per physical CPU
> core (assuming SMT is also off) right? Or did I miss something?
> 

The table includes 3 cases:
- default:	SMT on,  coresched off
- coresched:	SMT on,  coresched on
- smtoff:	SMT off, coresched off

I was comparing the default(coresched off & SMT on) case with (coresched
on & SMT on) case.

If SMT off, then reader and writer on the different cores have different rq->lock,
so the lock contention is not that serious.

class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
&rq->__lock:           60             60           0.11           1.92          41.33           0.69            127       67184172           0.03          22.95    33160428.37           0.49

Does this address your concern?

Thanks,
-Aubrey


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 17/26] sched: Split the cookie and setup per-task cookie on fork
  2020-11-04 22:30   ` chris hyser
  2020-11-05 14:49     ` Joel Fernandes
@ 2020-11-09 23:30     ` chris hyser
  1 sibling, 0 replies; 98+ messages in thread
From: chris hyser @ 2020-11-09 23:30 UTC (permalink / raw)
  To: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, Aubrey Li,
	Paul E. McKenney, Tim Chen

On 11/4/20 5:30 PM, chris hyser wrote:
> On 10/19/20 9:43 PM, Joel Fernandes (Google) wrote:
>> In order to prevent interference and clearly support both per-task and CGroup
>> APIs, split the cookie into 2 and allow it to be set from either per-task, or
>> CGroup API. The final cookie is the combined value of both and is computed when
>> the stop-machine executes during a change of cookie.
>>
>> Also, for the per-task cookie, it will get weird if we use pointers of any
>> emphemeral objects. For this reason, introduce a refcounted object who's sole
>> purpose is to assign unique cookie value by way of the object's pointer.
>>
>> While at it, refactor the CGroup code a bit. Future patches will introduce more
>> APIs and support.
>>
>> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>> ---
>>   include/linux/sched.h |   2 +
>>   kernel/sched/core.c   | 241 ++++++++++++++++++++++++++++++++++++++++--
>>   kernel/sched/debug.c  |   4 +
>>   3 files changed, 236 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index fe6f225bfbf9..c6034c00846a 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -688,6 +688,8 @@ struct task_struct {
>>   #ifdef CONFIG_SCHED_CORE
>>       struct rb_node            core_node;
>>       unsigned long            core_cookie;
>> +    unsigned long            core_task_cookie;
>> +    unsigned long            core_group_cookie;
>>       unsigned int            core_occupation;
>>   #endif
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index bab4ea2f5cd8..30a9e4cb5ce1 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -346,11 +346,14 @@ void sched_core_put(void)
>>       mutex_unlock(&sched_core_mutex);
>>   }
>> +static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2);
>> +
>>   #else /* !CONFIG_SCHED_CORE */
>>   static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
>>   static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
>>   static bool sched_core_enqueued(struct task_struct *task) { return false; }
>> +static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2) { }
>>   #endif /* CONFIG_SCHED_CORE */
>> @@ -3583,6 +3586,20 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
>>   #endif
>>   #ifdef CONFIG_SCHED_CORE
>>       RB_CLEAR_NODE(&p->core_node);
>> +
>> +    /*
>> +     * Tag child via per-task cookie only if parent is tagged via per-task
>> +     * cookie. This is independent of, but can be additive to the CGroup tagging.
>> +     */
>> +    if (current->core_task_cookie) {
>> +
>> +        /* If it is not CLONE_THREAD fork, assign a unique per-task tag. */
>> +        if (!(clone_flags & CLONE_THREAD)) {
>> +            return sched_core_share_tasks(p, p);
>> +               }
>> +        /* Otherwise share the parent's per-task tag. */
>> +        return sched_core_share_tasks(p, current);
>> +    }
>>   #endif
>>       return 0;
>>   }

sched_core_share_tasks() looks at the value of the new tasks core_task_cookie which is non-zero on a
process or thread clone and causes underflow for both the enable flag itself and for cookie ref counts.

So just zero it in __sched_fork().

-chrish

------

sched: zero out the core scheduling cookie on clone

As the cookie is reference counted, even if inherited, zero this and allow
explicit sharing.

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
---
  kernel/sched/core.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fd3cc03..2af0ea6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3378,6 +3378,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
      p->capture_control = NULL;
  #endif
      init_numa_balancing(clone_flags, p);
+#ifdef CONFIG_SCHED_CORE
+    p->core_task_cookie = 0;
+#endif
  #ifdef CONFIG_SMP
      p->wake_entry.u_flags = CSD_TYPE_TTWU;
  #endif

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode
  2020-11-03  1:20     ` Joel Fernandes
  2020-11-06 16:57       ` Alexandre Chartre
@ 2020-11-10  9:35       ` Alexandre Chartre
  2020-11-10 22:42         ` Joel Fernandes
  1 sibling, 1 reply; 98+ messages in thread
From: Alexandre Chartre @ 2020-11-10  9:35 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, James.Bottomley, OWeisse,
	Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser, Aubrey Li,
	Tim Chen, Paul E . McKenney


On 11/3/20 2:20 AM, Joel Fernandes wrote:
> Hi Alexandre,
> 
> Sorry for late reply as I was working on the snapshotting patch...
> 
> On Fri, Oct 30, 2020 at 11:29:26AM +0100, Alexandre Chartre wrote:
>>
>> On 10/20/20 3:43 AM, Joel Fernandes (Google) wrote:
>>> Core-scheduling prevents hyperthreads in usermode from attacking each
>>> other, but it does not do anything about one of the hyperthreads
>>> entering the kernel for any reason. This leaves the door open for MDS
>>> and L1TF attacks with concurrent execution sequences between
>>> hyperthreads.
>>>
>>> This patch therefore adds support for protecting all syscall and IRQ
>>> kernel mode entries. Care is taken to track the outermost usermode exit
>>> and entry using per-cpu counters. In cases where one of the hyperthreads
>>> enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
>>> when not needed - example: idle and non-cookie HTs do not need to be
>>> forced into kernel mode.
>>
>> Hi Joel,
>>
>> In order to protect syscall/IRQ kernel mode entries, shouldn't we have a
>> call to sched_core_unsafe_enter() in the syscall/IRQ entry code? I don't
>> see such a call. Am I missing something?
> 
> Yes, this is known bug and fixed in v9 which I'll post soon. Meanwhile
> updated patch is appended below:
>   

See comments below about the updated patch.

> ---8<-----------------------
> 
>  From b2835a587a28405ffdf8fc801e798129a014a8c8 Mon Sep 17 00:00:00 2001
> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> Date: Mon, 27 Jul 2020 17:56:14 -0400
> Subject: [PATCH] kernel/entry: Add support for core-wide protection of
>   kernel-mode
> 
> Core-scheduling prevents hyperthreads in usermode from attacking each
> other, but it does not do anything about one of the hyperthreads
> entering the kernel for any reason. This leaves the door open for MDS
> and L1TF attacks with concurrent execution sequences between
> hyperthreads.
> 
> This patch therefore adds support for protecting all syscall and IRQ
> kernel mode entries. Care is taken to track the outermost usermode exit
> and entry using per-cpu counters. In cases where one of the hyperthreads
> enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
> when not needed - example: idle and non-cookie HTs do not need to be
> forced into kernel mode.
> 
> More information about attacks:
> For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
> data to either host or guest attackers. For L1TF, it is possible to leak
> to guest attackers. There is no possible mitigation involving flushing
> of buffers to avoid this since the execution of attacker and victims
> happen concurrently on 2 or more HTs.
> 
> Cc: Julien Desfossez <jdesfossez@digitalocean.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Aaron Lu <aaron.lwe@gmail.com>
> Cc: Aubrey Li <aubrey.li@linux.intel.com>
> Cc: Tim Chen <tim.c.chen@intel.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Co-developed-by: Vineeth Pillai <viremana@linux.microsoft.com>
> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>   .../admin-guide/kernel-parameters.txt         |   9 +
>   include/linux/entry-common.h                  |   6 +-
>   include/linux/sched.h                         |  12 +
>   kernel/entry/common.c                         |  28 ++-
>   kernel/sched/core.c                           | 230 ++++++++++++++++++
>   kernel/sched/sched.h                          |   3 +
>   6 files changed, 285 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 3236427e2215..a338d5d64c3d 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4678,6 +4678,15 @@
>   
>   	sbni=		[NET] Granch SBNI12 leased line adapter
>   
> +	sched_core_protect_kernel=
> +			[SCHED_CORE] Pause SMT siblings of a core running in
> +			user mode, if at least one of the siblings of the core
> +			is running in kernel mode. This is to guarantee that
> +			kernel data is not leaked to tasks which are not trusted
> +			by the kernel. A value of 0 disables protection, 1
> +			enables protection. The default is 1. Note that protection
> +			depends on the arch defining the _TIF_UNSAFE_RET flag.
> +
>   	sched_debug	[KNL] Enables verbose scheduler debug messages.
>   
>   	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index 474f29638d2c..62278c5b3b5f 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -33,6 +33,10 @@
>   # define _TIF_PATCH_PENDING		(0)
>   #endif
>   
> +#ifndef _TIF_UNSAFE_RET
> +# define _TIF_UNSAFE_RET		(0)
> +#endif
> +
>   #ifndef _TIF_UPROBE
>   # define _TIF_UPROBE			(0)
>   #endif
> @@ -69,7 +73,7 @@
>   
>   #define EXIT_TO_USER_MODE_WORK						\
>   	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
> -	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING |			\
> +	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_UNSAFE_RET |	\
>   	 ARCH_EXIT_TO_USER_MODE_WORK)
>   
>   /**
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d38e904dd603..fe6f225bfbf9 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2071,4 +2071,16 @@ int sched_trace_rq_nr_running(struct rq *rq);
>   
>   const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
>   
> +#ifdef CONFIG_SCHED_CORE
> +void sched_core_unsafe_enter(void);
> +void sched_core_unsafe_exit(void);
> +bool sched_core_wait_till_safe(unsigned long ti_check);
> +bool sched_core_kernel_protected(void);
> +#else
> +#define sched_core_unsafe_enter(ignore) do { } while (0)
> +#define sched_core_unsafe_exit(ignore) do { } while (0)
> +#define sched_core_wait_till_safe(ignore) do { } while (0)
> +#define sched_core_kernel_protected(ignore) do { } while (0)
> +#endif
> +
>   #endif
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 0a1e20f8d4e8..a18ed60cedea 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -28,6 +28,8 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs)
>   
>   	instrumentation_begin();
>   	trace_hardirqs_off_finish();
> +	if (_TIF_UNSAFE_RET) /* Kernel protection depends on arch defining the flag. */
> +		sched_core_unsafe_enter();
>   	instrumentation_end();
>   }
>   
> @@ -137,6 +139,27 @@ static __always_inline void exit_to_user_mode(void)
>   /* Workaround to allow gradual conversion of architecture code */
>   void __weak arch_do_signal(struct pt_regs *regs) { }
>   
> +unsigned long exit_to_user_get_work(void)

Function should be static.


> +{
> +	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> +
> +	if ((IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected())
> +	    || !_TIF_UNSAFE_RET)
> +		return ti_work;
> +
> +#ifdef CONFIG_SCHED_CORE
> +	ti_work &= EXIT_TO_USER_MODE_WORK;
> +	if ((ti_work & _TIF_UNSAFE_RET) == ti_work) {
> +		sched_core_unsafe_exit();
> +		if (sched_core_wait_till_safe(EXIT_TO_USER_MODE_WORK)) {
> +			sched_core_unsafe_enter(); /* not exiting to user yet. */
> +		}
> +	}
> +
> +	return READ_ONCE(current_thread_info()->flags);
> +#endif
> +}
> +
>   static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>   					    unsigned long ti_work)
>   {
> @@ -175,7 +198,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>   		 * enabled above.
>   		 */
>   		local_irq_disable_exit_to_user();
> -		ti_work = READ_ONCE(current_thread_info()->flags);
> +		ti_work = exit_to_user_get_work();
>   	}

What happen if the task is scheduled out in exit_to_user_mode_loop? (e.g. if it has
_TIF_NEED_RESCHED set). It will have call sched_core_unsafe_enter() and force siblings
to wait for it. So shouldn't sched_core_unsafe_exit() be called when the task is
scheduled out? (because it won't run anymore) And sched_core_unsafe_enter() when
the task is scheduled back in?


>   	/* Return the latest work state for arch_exit_to_user_mode() */
> @@ -184,9 +207,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>   
>   static void exit_to_user_mode_prepare(struct pt_regs *regs)
>   {
> -	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> +	unsigned long ti_work;
>   
>   	lockdep_assert_irqs_disabled();
> +	ti_work = exit_to_user_get_work();
>   
>   	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
>   		ti_work = exit_to_user_mode_loop(regs, ti_work);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e05728bdb18c..bd206708fac2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -76,6 +76,27 @@ __read_mostly int scheduler_running;
>   
>   #ifdef CONFIG_SCHED_CORE
>   
> +DEFINE_STATIC_KEY_TRUE(sched_core_protect_kernel);
> +static int __init set_sched_core_protect_kernel(char *str)
> +{
> +	unsigned long val = 0;
> +
> +	if (!str)
> +		return 0;
> +
> +	if (!kstrtoul(str, 0, &val) && !val)
> +		static_branch_disable(&sched_core_protect_kernel);
> +
> +	return 1;
> +}
> +__setup("sched_core_protect_kernel=", set_sched_core_protect_kernel);
> +
> +/* Is the kernel protected by core scheduling? */
> +bool sched_core_kernel_protected(void)
> +{
> +	return static_branch_likely(&sched_core_protect_kernel);
> +}
> +
>   DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
>   
>   /* kernel prio, less is more */
> @@ -4596,6 +4617,214 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
>   	return a->core_cookie == b->core_cookie;
>   }
>   
> +/*
> + * Handler to attempt to enter kernel. It does nothing because the exit to
> + * usermode or guest mode will do the actual work (of waiting if needed).
> + */
> +static void sched_core_irq_work(struct irq_work *work)
> +{
> +	return;
> +}
> +
> +static inline void init_sched_core_irq_work(struct rq *rq)
> +{
> +	init_irq_work(&rq->core_irq_work, sched_core_irq_work);
> +}
> +
> +/*
> + * sched_core_wait_till_safe - Pause the caller's hyperthread until the core
> + * exits the core-wide unsafe state. Obviously the CPU calling this function
> + * should not be responsible for the core being in the core-wide unsafe state
> + * otherwise it will deadlock.
> + *
> + * @ti_check: We spin here with IRQ enabled and preempt disabled. Break out of
> + *            the loop if TIF flags are set and notify caller about it.
> + *
> + * IRQs should be disabled.
> + */
> +bool sched_core_wait_till_safe(unsigned long ti_check)
> +{
> +	bool restart = false;
> +	struct rq *rq;
> +	int cpu;
> +
> +	/* We clear the thread flag only at the end, so need to check for it. */

Do you mean "no need to check for it" ?


> +	ti_check &= ~_TIF_UNSAFE_RET;
> +
> +	cpu = smp_processor_id();
> +	rq = cpu_rq(cpu);
> +
> +	if (!sched_core_enabled(rq))
> +		goto ret;
> +
> +	/* Down grade to allow interrupts to prevent stop_machine lockups.. */
> +	preempt_disable();
> +	local_irq_enable();
> +
> +	/*
> +	 * Wait till the core of this HT is not in an unsafe state.
> +	 *
> +	 * Pair with smp_store_release() in sched_core_unsafe_exit().
> +	 */
> +	while (smp_load_acquire(&rq->core->core_unsafe_nest) > 0) {
> +		cpu_relax();
> +		if (READ_ONCE(current_thread_info()->flags) & ti_check) {
> +			restart = true;
> +			break;
> +		}
> +	}
> +
> +	/* Upgrade it back to the expectations of entry code. */
> +	local_irq_disable();
> +	preempt_enable();
> +
> +ret:
> +	if (!restart)
> +		clear_tsk_thread_flag(current, TIF_UNSAFE_RET);
> +
> +	return restart;
> +}
> +
> +/*
> + * Enter the core-wide IRQ state. Sibling will be paused if it is running
> + * 'untrusted' code, until sched_core_unsafe_exit() is called. Every attempt to
> + * avoid sending useless IPIs is made. Must be called only from hard IRQ
> + * context.
> + */
> +void sched_core_unsafe_enter(void)
> +{
> +	const struct cpumask *smt_mask;
> +	unsigned long flags;
> +	struct rq *rq;
> +	int i, cpu;
> +
> +	if (!static_branch_likely(&sched_core_protect_kernel))
> +		return;
> +
> +	/* Ensure that on return to user/guest, we check whether to wait. */
> +	if (current->core_cookie)
> +		set_tsk_thread_flag(current, TIF_UNSAFE_RET);
> +
> +	local_irq_save(flags);
> +	cpu = smp_processor_id();
> +	rq = cpu_rq(cpu);
> +	if (!sched_core_enabled(rq))
> +		goto ret;

Should we clear TIF_UNSAFE_RET if (!sched_core_enabled(rq))? This would avoid calling
sched_core_wait_till_safe().


> +
> +	/* Count unsafe_enter() calls received without unsafe_exit() on this CPU. */
> +	rq->core_this_unsafe_nest++;
> +
> +	/* Should not nest: enter() should only pair with exit(). */
> +	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 1))
> +		goto ret;

I would be more precise about the nesting comment: we don't nest not only because each
enter() is paired with an exit() but because each enter()/exit() is for a user context.
We can have nested interrupts but they will be for a kernel context so they won't enter/exit.

So I would say something like:

         /*
          * Should not nest: each enter() is paired with an exit(), and enter()/exit()
          * are done when coming from userspace. We can have nested interrupts between
          * enter()/exit() but they will originate from the kernel so they won't enter()
          * nor exit().
          */


> +
> +	raw_spin_lock(rq_lockp(rq));
> +	smt_mask = cpu_smt_mask(cpu);
> +
> +	/* Contribute this CPU's unsafe_enter() to core-wide unsafe_enter() count. */
> +	WRITE_ONCE(rq->core->core_unsafe_nest, rq->core->core_unsafe_nest + 1);

We are protected by the rq_lockp(rq) spinlock, but we still need to use WRITE_ONCE()
because sched_core_wait_till_safe() checks core_unsafe_next without taking rq_lockp(rq),
right? Shouldn't we be using smp_store_release() like sched_core_unsafe_exit() does?

In any case, it is worth having a comment why WRITE_ONCE() or smp_store_release() is
used.


> +
> +	if (WARN_ON_ONCE(rq->core->core_unsafe_nest == UINT_MAX))
> +		goto unlock;

It might be better checking if (rq->core->core_unsafe_nest >= cpumask_weight(smt_mask))
because we shouldn't exceed the number of siblings.

alex.


> +
> +	if (irq_work_is_busy(&rq->core_irq_work)) {
> +		/*
> +		 * Do nothing more since we are in an IPI sent from another
> +		 * sibling to enforce safety. That sibling would have sent IPIs
> +		 * to all of the HTs.
> +		 */
> +		goto unlock;
> +	}
> +
> +	/*
> +	 * If we are not the first ones on the core to enter core-wide unsafe
> +	 * state, do nothing.
> +	 */
> +	if (rq->core->core_unsafe_nest > 1)
> +		goto unlock;
> +
> +	/* Do nothing more if the core is not tagged. */
> +	if (!rq->core->core_cookie)
> +		goto unlock;
> +
> +	for_each_cpu(i, smt_mask) {
> +		struct rq *srq = cpu_rq(i);
> +
> +		if (i == cpu || cpu_is_offline(i))
> +			continue;
> +
> +		if (!srq->curr->mm || is_task_rq_idle(srq->curr))
> +			continue;
> +
> +		/* Skip if HT is not running a tagged task. */
> +		if (!srq->curr->core_cookie && !srq->core_pick)
> +			continue;
> +
> +		/*
> +		 * Force sibling into the kernel by IPI. If work was already
> +		 * pending, no new IPIs are sent. This is Ok since the receiver
> +		 * would already be in the kernel, or on its way to it.
> +		 */
> +		irq_work_queue_on(&srq->core_irq_work, i);
> +	}
> +unlock:
> +	raw_spin_unlock(rq_lockp(rq));
> +ret:
> +	local_irq_restore(flags);
> +}
> +
> +/*
> + * Process any work need for either exiting the core-wide unsafe state, or for
> + * waiting on this hyperthread if the core is still in this state.
> + *
> + * @idle: Are we called from the idle loop?
> + */
> +void sched_core_unsafe_exit(void)
> +{
> +	unsigned long flags;
> +	unsigned int nest;
> +	struct rq *rq;
> +	int cpu;
> +
> +	if (!static_branch_likely(&sched_core_protect_kernel))
> +		return;
> +
> +	local_irq_save(flags);
> +	cpu = smp_processor_id();
> +	rq = cpu_rq(cpu);
> +
> +	/* Do nothing if core-sched disabled. */
> +	if (!sched_core_enabled(rq))
> +		goto ret;
> +
> +	/*
> +	 * Can happen when a process is forked and the first return to user
> +	 * mode is a syscall exit. Either way, there's nothing to do.
> +	 */
> +	if (rq->core_this_unsafe_nest == 0)
> +		goto ret;
> +
> +	rq->core_this_unsafe_nest--;
> +
> +	/* enter() should be paired with exit() only. */
> +	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 0))
> +		goto ret;
> +
> +	raw_spin_lock(rq_lockp(rq));
> +	/*
> +	 * Core-wide nesting counter can never be 0 because we are
> +	 * still in it on this CPU.
> +	 */
> +	nest = rq->core->core_unsafe_nest;
> +	WARN_ON_ONCE(!nest);
> +
> +	/* Pair with smp_load_acquire() in sched_core_wait_till_safe(). */
> +	smp_store_release(&rq->core->core_unsafe_nest, nest - 1);
> +	raw_spin_unlock(rq_lockp(rq));
> +ret:
> +	local_irq_restore(flags);
> +}
> +
>   // XXX fairness/fwd progress conditions
>   /*
>    * Returns
> @@ -5019,6 +5248,7 @@ static inline void sched_core_cpu_starting(unsigned int cpu)
>   			rq = cpu_rq(i);
>   			if (rq->core && rq->core == rq)
>   				core_rq = rq;
> +			init_sched_core_irq_work(rq);
>   		}
>   
>   		if (!core_rq)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f7e2d8a3be8e..4bcf3b1ddfb3 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1059,12 +1059,15 @@ struct rq {
>   	unsigned int		core_enabled;
>   	unsigned int		core_sched_seq;
>   	struct rb_root		core_tree;
> +	struct irq_work		core_irq_work; /* To force HT into kernel */
> +	unsigned int		core_this_unsafe_nest;
>   
>   	/* shared state */
>   	unsigned int		core_task_seq;
>   	unsigned int		core_pick_seq;
>   	unsigned long		core_cookie;
>   	unsigned char		core_forceidle;
> +	unsigned int		core_unsafe_nest;
>   #endif
>   };
>   
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode
  2020-11-10  9:35       ` Alexandre Chartre
@ 2020-11-10 22:42         ` Joel Fernandes
  2020-11-16 10:08           ` Alexandre Chartre
  0 siblings, 1 reply; 98+ messages in thread
From: Joel Fernandes @ 2020-11-10 22:42 UTC (permalink / raw)
  To: Alexandre Chartre
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, James.Bottomley, OWeisse,
	Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser, Aubrey Li,
	Tim Chen, Paul E . McKenney

On Tue, Nov 10, 2020 at 10:35:17AM +0100, Alexandre Chartre wrote:
[..] 
> > ---8<-----------------------
> > 
> >  From b2835a587a28405ffdf8fc801e798129a014a8c8 Mon Sep 17 00:00:00 2001
> > From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> > Date: Mon, 27 Jul 2020 17:56:14 -0400
> > Subject: [PATCH] kernel/entry: Add support for core-wide protection of
> >   kernel-mode
[..]
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index d38e904dd603..fe6f225bfbf9 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -2071,4 +2071,16 @@ int sched_trace_rq_nr_running(struct rq *rq);
> >   const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
> > +#ifdef CONFIG_SCHED_CORE
> > +void sched_core_unsafe_enter(void);
> > +void sched_core_unsafe_exit(void);
> > +bool sched_core_wait_till_safe(unsigned long ti_check);
> > +bool sched_core_kernel_protected(void);
> > +#else
> > +#define sched_core_unsafe_enter(ignore) do { } while (0)
> > +#define sched_core_unsafe_exit(ignore) do { } while (0)
> > +#define sched_core_wait_till_safe(ignore) do { } while (0)
> > +#define sched_core_kernel_protected(ignore) do { } while (0)
> > +#endif
> > +
> >   #endif
> > diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> > index 0a1e20f8d4e8..a18ed60cedea 100644
> > --- a/kernel/entry/common.c
> > +++ b/kernel/entry/common.c
> > @@ -28,6 +28,8 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs)
> >   	instrumentation_begin();
> >   	trace_hardirqs_off_finish();
> > +	if (_TIF_UNSAFE_RET) /* Kernel protection depends on arch defining the flag. */
> > +		sched_core_unsafe_enter();
> >   	instrumentation_end();
> >   }
> > @@ -137,6 +139,27 @@ static __always_inline void exit_to_user_mode(void)
> >   /* Workaround to allow gradual conversion of architecture code */
> >   void __weak arch_do_signal(struct pt_regs *regs) { }
> > +unsigned long exit_to_user_get_work(void)
> 
> Function should be static.

Fixed.

> > +{
> > +	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> > +
> > +	if ((IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected())
> > +	    || !_TIF_UNSAFE_RET)
> > +		return ti_work;
> > +
> > +#ifdef CONFIG_SCHED_CORE
> > +	ti_work &= EXIT_TO_USER_MODE_WORK;
> > +	if ((ti_work & _TIF_UNSAFE_RET) == ti_work) {
> > +		sched_core_unsafe_exit();
> > +		if (sched_core_wait_till_safe(EXIT_TO_USER_MODE_WORK)) {
> > +			sched_core_unsafe_enter(); /* not exiting to user yet. */
> > +		}
> > +	}
> > +
> > +	return READ_ONCE(current_thread_info()->flags);
> > +#endif
> > +}
> > +
> >   static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> >   					    unsigned long ti_work)
> >   {
> > @@ -175,7 +198,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> >   		 * enabled above.
> >   		 */
> >   		local_irq_disable_exit_to_user();
> > -		ti_work = READ_ONCE(current_thread_info()->flags);
> > +		ti_work = exit_to_user_get_work();
> >   	}
> 
> What happen if the task is scheduled out in exit_to_user_mode_loop? (e.g. if it has
> _TIF_NEED_RESCHED set). It will have call sched_core_unsafe_enter() and force siblings
> to wait for it. So shouldn't sched_core_unsafe_exit() be called when the task is
> scheduled out? (because it won't run anymore) And sched_core_unsafe_enter() when
> the task is scheduled back in?

No, when the task is scheduled out, it will in kernel mode on the task being
scheduled in. That task (being scheduled-in) would have already done a
sched_core_unsafe_enter(). When that task returns to user made, it will do a
sched_core_unsafe_exit(). When all tasks goto sleep, the last task that
enters the idle loop will do a sched_core_unsafe_exit(). Just to note: the
"unsafe kernel context" is per-CPU and not per-task. Does that answer your
question?

> > +static inline void init_sched_core_irq_work(struct rq *rq)
> > +{
> > +	init_irq_work(&rq->core_irq_work, sched_core_irq_work);
> > +}
> > +
> > +/*
> > + * sched_core_wait_till_safe - Pause the caller's hyperthread until the core
> > + * exits the core-wide unsafe state. Obviously the CPU calling this function
> > + * should not be responsible for the core being in the core-wide unsafe state
> > + * otherwise it will deadlock.
> > + *
> > + * @ti_check: We spin here with IRQ enabled and preempt disabled. Break out of
> > + *            the loop if TIF flags are set and notify caller about it.
> > + *
> > + * IRQs should be disabled.
> > + */
> > +bool sched_core_wait_till_safe(unsigned long ti_check)
> > +{
> > +	bool restart = false;
> > +	struct rq *rq;
> > +	int cpu;
> > +
> > +	/* We clear the thread flag only at the end, so need to check for it. */
> 
> Do you mean "no need to check for it" ?

Fixed.

> > +/*
> > + * Enter the core-wide IRQ state. Sibling will be paused if it is running
> > + * 'untrusted' code, until sched_core_unsafe_exit() is called. Every attempt to
> > + * avoid sending useless IPIs is made. Must be called only from hard IRQ
> > + * context.
> > + */
> > +void sched_core_unsafe_enter(void)
> > +{
> > +	const struct cpumask *smt_mask;
> > +	unsigned long flags;
> > +	struct rq *rq;
> > +	int i, cpu;
> > +
> > +	if (!static_branch_likely(&sched_core_protect_kernel))
> > +		return;
> > +
> > +	/* Ensure that on return to user/guest, we check whether to wait. */
> > +	if (current->core_cookie)
> > +		set_tsk_thread_flag(current, TIF_UNSAFE_RET);
> > +
> > +	local_irq_save(flags);
> > +	cpu = smp_processor_id();
> > +	rq = cpu_rq(cpu);
> > +	if (!sched_core_enabled(rq))
> > +		goto ret;
> 
> Should we clear TIF_UNSAFE_RET if (!sched_core_enabled(rq))? This would avoid calling
> sched_core_wait_till_safe().

Ok, or what I'll do is move the set_tsk_thread_flag to after the check for
sched_core_enabled().

> > +
> > +	/* Count unsafe_enter() calls received without unsafe_exit() on this CPU. */
> > +	rq->core_this_unsafe_nest++;
> > +
> > +	/* Should not nest: enter() should only pair with exit(). */
> > +	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 1))
> > +		goto ret;
> 
> I would be more precise about the nesting comment: we don't nest not only because each
> enter() is paired with an exit() but because each enter()/exit() is for a user context.
> We can have nested interrupts but they will be for a kernel context so they won't enter/exit.
> 
> So I would say something like:
> 
>         /*
>          * Should not nest: each enter() is paired with an exit(), and enter()/exit()
>          * are done when coming from userspace. We can have nested interrupts between
>          * enter()/exit() but they will originate from the kernel so they won't enter()
>          * nor exit().
>          */

Changed it to following, hope its ok with you:
        /*
         * Should not nest: enter() should only pair with exit(). Both are done
         * during the first entry into kernel and the last exit from kernel.
         * Nested kernel entries (such as nested interrupts) will only trigger
         * enter() and exit() on the outer most kernel entry and exit.
         */

> > +
> > +	raw_spin_lock(rq_lockp(rq));
> > +	smt_mask = cpu_smt_mask(cpu);
> > +
> > +	/* Contribute this CPU's unsafe_enter() to core-wide unsafe_enter() count. */
> > +	WRITE_ONCE(rq->core->core_unsafe_nest, rq->core->core_unsafe_nest + 1);
> 
> We are protected by the rq_lockp(rq) spinlock, but we still need to use WRITE_ONCE()
> because sched_core_wait_till_safe() checks core_unsafe_next without taking rq_lockp(rq),
> right?

Yes.

> Shouldn't we be using smp_store_release() like sched_core_unsafe_exit() does?
> 
> In any case, it is worth having a comment why WRITE_ONCE() or smp_store_release() is
> used.

The smp_store_release() in exit() ensures that the write to the nesting
counter happens *after* all prior reads and write accesses done by this CPU
are seen by the spinning CPU doing the smp_load_acquire() before that
spinning CPU returns. I did put a comment there.

But, I think I don't need smp_store_release() at all here. The spin_unlock
that follows already has the required release semantics. I will demote it to
a WRITE_ONCE() in enter() as well, and add appropriate comments.

> > +
> > +	if (WARN_ON_ONCE(rq->core->core_unsafe_nest == UINT_MAX))
> > +		goto unlock;
> 
> It might be better checking if (rq->core->core_unsafe_nest >= cpumask_weight(smt_mask))
> because we shouldn't exceed the number of siblings.

I am a bit concerned with the time complexity of cpumask_weight(). It may be
better not to add overhead. I am not fully sure how it works but there is a
loop in bitmask weight that goes through the bits of the bitmap, what is your
opinion on that?

Can I add your Reviewed-by tag to below updated patch? Thanks for review!

 - Joel

---8<---

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index bd1a5b87a5e2..a36f08d74e09 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4678,6 +4678,15 @@
 
 	sbni=		[NET] Granch SBNI12 leased line adapter
 
+	sched_core_protect_kernel=
+			[SCHED_CORE] Pause SMT siblings of a core running in
+			user mode, if at least one of the siblings of the core
+			is running in kernel mode. This is to guarantee that
+			kernel data is not leaked to tasks which are not trusted
+			by the kernel. A value of 0 disables protection, 1
+			enables protection. The default is 1. Note that protection
+			depends on the arch defining the _TIF_UNSAFE_RET flag.
+
 	sched_debug	[KNL] Enables verbose scheduler debug messages.
 
 	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 474f29638d2c..62278c5b3b5f 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -33,6 +33,10 @@
 # define _TIF_PATCH_PENDING		(0)
 #endif
 
+#ifndef _TIF_UNSAFE_RET
+# define _TIF_UNSAFE_RET		(0)
+#endif
+
 #ifndef _TIF_UPROBE
 # define _TIF_UPROBE			(0)
 #endif
@@ -69,7 +73,7 @@
 
 #define EXIT_TO_USER_MODE_WORK						\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
-	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING |			\
+	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_UNSAFE_RET |	\
 	 ARCH_EXIT_TO_USER_MODE_WORK)
 
 /**
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d38e904dd603..fe6f225bfbf9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2071,4 +2071,16 @@ int sched_trace_rq_nr_running(struct rq *rq);
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
+#ifdef CONFIG_SCHED_CORE
+void sched_core_unsafe_enter(void);
+void sched_core_unsafe_exit(void);
+bool sched_core_wait_till_safe(unsigned long ti_check);
+bool sched_core_kernel_protected(void);
+#else
+#define sched_core_unsafe_enter(ignore) do { } while (0)
+#define sched_core_unsafe_exit(ignore) do { } while (0)
+#define sched_core_wait_till_safe(ignore) do { } while (0)
+#define sched_core_kernel_protected(ignore) do { } while (0)
+#endif
+
 #endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 2b8366693d5c..d5d88e735d55 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -28,6 +28,8 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs)
 
 	instrumentation_begin();
 	trace_hardirqs_off_finish();
+	if (_TIF_UNSAFE_RET) /* Kernel protection depends on arch defining the flag. */
+		sched_core_unsafe_enter();
 	instrumentation_end();
 }
 
@@ -137,6 +139,27 @@ static __always_inline void exit_to_user_mode(void)
 /* Workaround to allow gradual conversion of architecture code */
 void __weak arch_do_signal(struct pt_regs *regs) { }
 
+static unsigned long exit_to_user_get_work(void)
+{
+	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+
+	if ((IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected())
+	    || !_TIF_UNSAFE_RET)
+		return ti_work;
+
+#ifdef CONFIG_SCHED_CORE
+	ti_work &= EXIT_TO_USER_MODE_WORK;
+	if ((ti_work & _TIF_UNSAFE_RET) == ti_work) {
+		sched_core_unsafe_exit();
+		if (sched_core_wait_till_safe(EXIT_TO_USER_MODE_WORK)) {
+			sched_core_unsafe_enter(); /* not exiting to user yet. */
+		}
+	}
+
+	return READ_ONCE(current_thread_info()->flags);
+#endif
+}
+
 static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 					    unsigned long ti_work)
 {
@@ -174,7 +197,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		 * enabled above.
 		 */
 		local_irq_disable_exit_to_user();
-		ti_work = READ_ONCE(current_thread_info()->flags);
+		ti_work = exit_to_user_get_work();
 	}
 
 	/* Return the latest work state for arch_exit_to_user_mode() */
@@ -183,9 +206,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 
 static void exit_to_user_mode_prepare(struct pt_regs *regs)
 {
-	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+	unsigned long ti_work;
 
 	lockdep_assert_irqs_disabled();
+	ti_work = exit_to_user_get_work();
 
 	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
 		ti_work = exit_to_user_mode_loop(regs, ti_work);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fa68941998e3..429f9b8ca38e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -76,6 +76,27 @@ __read_mostly int scheduler_running;
 
 #ifdef CONFIG_SCHED_CORE
 
+DEFINE_STATIC_KEY_TRUE(sched_core_protect_kernel);
+static int __init set_sched_core_protect_kernel(char *str)
+{
+	unsigned long val = 0;
+
+	if (!str)
+		return 0;
+
+	if (!kstrtoul(str, 0, &val) && !val)
+		static_branch_disable(&sched_core_protect_kernel);
+
+	return 1;
+}
+__setup("sched_core_protect_kernel=", set_sched_core_protect_kernel);
+
+/* Is the kernel protected by core scheduling? */
+bool sched_core_kernel_protected(void)
+{
+	return static_branch_likely(&sched_core_protect_kernel);
+}
+
 DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
 
 /* kernel prio, less is more */
@@ -4596,6 +4617,226 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
 	return a->core_cookie == b->core_cookie;
 }
 
+/*
+ * Handler to attempt to enter kernel. It does nothing because the exit to
+ * usermode or guest mode will do the actual work (of waiting if needed).
+ */
+static void sched_core_irq_work(struct irq_work *work)
+{
+	return;
+}
+
+static inline void init_sched_core_irq_work(struct rq *rq)
+{
+	init_irq_work(&rq->core_irq_work, sched_core_irq_work);
+}
+
+/*
+ * sched_core_wait_till_safe - Pause the caller's hyperthread until the core
+ * exits the core-wide unsafe state. Obviously the CPU calling this function
+ * should not be responsible for the core being in the core-wide unsafe state
+ * otherwise it will deadlock.
+ *
+ * @ti_check: We spin here with IRQ enabled and preempt disabled. Break out of
+ *            the loop if TIF flags are set and notify caller about it.
+ *
+ * IRQs should be disabled.
+ */
+bool sched_core_wait_till_safe(unsigned long ti_check)
+{
+	bool restart = false;
+	struct rq *rq;
+	int cpu;
+
+	/* We clear the thread flag only at the end, so no need to check for it. */
+	ti_check &= ~_TIF_UNSAFE_RET;
+
+	cpu = smp_processor_id();
+	rq = cpu_rq(cpu);
+
+	if (!sched_core_enabled(rq))
+		goto ret;
+
+	/* Down grade to allow interrupts to prevent stop_machine lockups.. */
+	preempt_disable();
+	local_irq_enable();
+
+	/*
+	 * Wait till the core of this HT is not in an unsafe state.
+	 *
+	 * Pair with smp_store_release() in sched_core_unsafe_exit().
+	 */
+	while (smp_load_acquire(&rq->core->core_unsafe_nest) > 0) {
+		cpu_relax();
+		if (READ_ONCE(current_thread_info()->flags) & ti_check) {
+			restart = true;
+			break;
+		}
+	}
+
+	/* Upgrade it back to the expectations of entry code. */
+	local_irq_disable();
+	preempt_enable();
+
+ret:
+	if (!restart)
+		clear_tsk_thread_flag(current, TIF_UNSAFE_RET);
+
+	return restart;
+}
+
+/*
+ * Enter the core-wide IRQ state. Sibling will be paused if it is running
+ * 'untrusted' code, until sched_core_unsafe_exit() is called. Every attempt to
+ * avoid sending useless IPIs is made. Must be called only from hard IRQ
+ * context.
+ */
+void sched_core_unsafe_enter(void)
+{
+	const struct cpumask *smt_mask;
+	unsigned long flags;
+	struct rq *rq;
+	int i, cpu;
+
+	if (!static_branch_likely(&sched_core_protect_kernel))
+		return;
+
+	local_irq_save(flags);
+	cpu = smp_processor_id();
+	rq = cpu_rq(cpu);
+	if (!sched_core_enabled(rq))
+		goto ret;
+
+	/* Ensure that on return to user/guest, we check whether to wait. */
+	if (current->core_cookie)
+		set_tsk_thread_flag(current, TIF_UNSAFE_RET);
+
+	/* Count unsafe_enter() calls received without unsafe_exit() on this CPU. */
+	rq->core_this_unsafe_nest++;
+
+	/*
+	 * Should not nest: enter() should only pair with exit(). Both are done
+	 * during the first entry into kernel and the last exit from kernel.
+	 * Nested kernel entries (such as nested interrupts) will only trigger
+	 * enter() and exit() on the outer most kernel entry and exit.
+	 */
+	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 1))
+		goto ret;
+
+	raw_spin_lock(rq_lockp(rq));
+	smt_mask = cpu_smt_mask(cpu);
+
+	/*
+	 * Contribute this CPU's unsafe_enter() to the core-wide unsafe_enter()
+	 * count.  The raw_spin_unlock() release semantics pairs with the nest
+	 * counter's smp_load_acquire() in sched_core_wait_till_safe().
+	 */
+	WRITE_ONCE(rq->core->core_unsafe_nest, rq->core->core_unsafe_nest + 1);
+
+	if (WARN_ON_ONCE(rq->core->core_unsafe_nest == UINT_MAX))
+		goto unlock;
+
+	if (irq_work_is_busy(&rq->core_irq_work)) {
+		/*
+		 * Do nothing more since we are in an IPI sent from another
+		 * sibling to enforce safety. That sibling would have sent IPIs
+		 * to all of the HTs.
+		 */
+		goto unlock;
+	}
+
+	/*
+	 * If we are not the first ones on the core to enter core-wide unsafe
+	 * state, do nothing.
+	 */
+	if (rq->core->core_unsafe_nest > 1)
+		goto unlock;
+
+	/* Do nothing more if the core is not tagged. */
+	if (!rq->core->core_cookie)
+		goto unlock;
+
+	for_each_cpu(i, smt_mask) {
+		struct rq *srq = cpu_rq(i);
+
+		if (i == cpu || cpu_is_offline(i))
+			continue;
+
+		if (!srq->curr->mm || is_task_rq_idle(srq->curr))
+			continue;
+
+		/* Skip if HT is not running a tagged task. */
+		if (!srq->curr->core_cookie && !srq->core_pick)
+			continue;
+
+		/*
+		 * Force sibling into the kernel by IPI. If work was already
+		 * pending, no new IPIs are sent. This is Ok since the receiver
+		 * would already be in the kernel, or on its way to it.
+		 */
+		irq_work_queue_on(&srq->core_irq_work, i);
+	}
+unlock:
+	raw_spin_unlock(rq_lockp(rq));
+ret:
+	local_irq_restore(flags);
+}
+
+/*
+ * Process any work need for either exiting the core-wide unsafe state, or for
+ * waiting on this hyperthread if the core is still in this state.
+ *
+ * @idle: Are we called from the idle loop?
+ */
+void sched_core_unsafe_exit(void)
+{
+	unsigned long flags;
+	unsigned int nest;
+	struct rq *rq;
+	int cpu;
+
+	if (!static_branch_likely(&sched_core_protect_kernel))
+		return;
+
+	local_irq_save(flags);
+	cpu = smp_processor_id();
+	rq = cpu_rq(cpu);
+
+	/* Do nothing if core-sched disabled. */
+	if (!sched_core_enabled(rq))
+		goto ret;
+
+	/*
+	 * Can happen when a process is forked and the first return to user
+	 * mode is a syscall exit. Either way, there's nothing to do.
+	 */
+	if (rq->core_this_unsafe_nest == 0)
+		goto ret;
+
+	rq->core_this_unsafe_nest--;
+
+	/* enter() should be paired with exit() only. */
+	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 0))
+		goto ret;
+
+	raw_spin_lock(rq_lockp(rq));
+	/*
+	 * Core-wide nesting counter can never be 0 because we are
+	 * still in it on this CPU.
+	 */
+	nest = rq->core->core_unsafe_nest;
+	WARN_ON_ONCE(!nest);
+
+	WRITE_ONCE(rq->core->core_unsafe_nest, nest - 1);
+	/*
+	 * The raw_spin_unlock release semantics pairs with the nest counter's
+	 * smp_load_acquire() in sched_core_wait_till_safe().
+	 */
+	raw_spin_unlock(rq_lockp(rq));
+ret:
+	local_irq_restore(flags);
+}
+
 // XXX fairness/fwd progress conditions
 /*
  * Returns
@@ -4991,6 +5232,7 @@ static inline void sched_core_cpu_starting(unsigned int cpu)
 			rq = cpu_rq(i);
 			if (rq->core && rq->core == rq)
 				core_rq = rq;
+			init_sched_core_irq_work(rq);
 		}
 
 		if (!core_rq)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 001382bc67f9..20937a5b6272 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1061,6 +1061,8 @@ struct rq {
 	unsigned int		core_enabled;
 	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
+	struct irq_work		core_irq_work; /* To force HT into kernel */
+	unsigned int		core_this_unsafe_nest;
 
 	/* shared state */
 	unsigned int		core_task_seq;
@@ -1068,6 +1070,7 @@ struct rq {
 	unsigned long		core_cookie;
 	unsigned char		core_forceidle;
 	unsigned int		core_forceidle_seq;
+	unsigned int		core_unsafe_nest;
 #endif
 };
 

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 25/26] Documentation: Add core scheduling documentation
  2020-10-20  3:36   ` Randy Dunlap
@ 2020-11-12 16:11     ` Joel Fernandes
  0 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes @ 2020-11-12 16:11 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen

On Mon, Oct 19, 2020 at 08:36:25PM -0700, Randy Dunlap wrote:
> Hi Joel,
> 
> On 10/19/20 6:43 PM, Joel Fernandes (Google) wrote:
> > Document the usecases, design and interfaces for core scheduling.
> > 
> > Co-developed-by: Vineeth Pillai <viremana@linux.microsoft.com>
> > Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

All fixed as below, updated and thanks!

---8<-----------------------

From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Subject: [PATCH] Documentation: Add core scheduling documentation

Document the usecases, design and interfaces for core scheduling.

Co-developed-by: Vineeth Pillai <viremana@linux.microsoft.com>
Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 .../admin-guide/hw-vuln/core-scheduling.rst   | 313 ++++++++++++++++++
 Documentation/admin-guide/hw-vuln/index.rst   |   1 +
 2 files changed, 314 insertions(+)
 create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
new file mode 100644
index 000000000000..c7399809c74d
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -0,0 +1,313 @@
+Core Scheduling
+***************
+Core scheduling support allows userspace to define groups of tasks that can
+share a core. These groups can be specified either for security usecases (one
+group of tasks don't trust another), or for performance usecases (some
+workloads may benefit from running on the same core as they don't need the same
+hardware resources of the shared core).
+
+Security usecase
+----------------
+A cross-HT attack involves the attacker and victim running on different
+Hyper Threads of the same core. MDS and L1TF are examples of such attacks.
+Without core scheduling, the only full mitigation of cross-HT attacks is to
+disable Hyper Threading (HT). Core scheduling allows HT to be turned on safely
+by ensuring that trusted tasks can share a core. This increase in core sharing
+can improvement performance, however it is not guaranteed that performance will
+always improve, though that is seen to be the case with a number of real world
+workloads. In theory, core scheduling aims to perform at least as good as when
+Hyper Threading is disabled. In practice, this is mostly the case though not
+always: as synchronizing scheduling decisions across 2 or more CPUs in a core
+involves additional overhead - especially when the system is lightly loaded
+(``total_threads <= N/2``, where N is the total number of CPUs).
+
+Usage
+-----
+Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
+Using this feature, userspace defines groups of tasks that trust each other.
+The core scheduler uses this information to make sure that tasks that do not
+trust each other will never run simultaneously on a core, while doing its best
+to satisfy the system's scheduling requirements.
+
+There are 2 ways to use core-scheduling:
+
+CGroup
+######
+Core scheduling adds additional files to the CPU controller CGroup:
+
+* ``cpu.tag``
+Writing ``1`` into this file results in all tasks in the group getting tagged.
+This results in all the CGroup's tasks allowed to run concurrently on a core's
+hyperthreads (also called siblings).
+
+The file being a value of ``0`` means the tag state of the CGroup is inherited
+from its parent hierarchy. If any ancestor of the CGroup is tagged, then the
+group is tagged.
+
+.. note:: Once a CGroup is tagged via cpu.tag, it is not possible to set this
+          for any descendant of the tagged group. For finer grained control, the
+          ``cpu.tag_color`` file described next may be used.
+
+.. note:: When a CGroup is not tagged, all the tasks within the group can share
+          a core with kernel threads and untagged system threads. For this reason,
+          if a group has ``cpu.tag`` of 0, it is considered to be trusted.
+
+* ``cpu.tag_color``
+For finer grained control over core sharing, a color can also be set in
+addition to the tag. This allows to further control core sharing between child
+CGroups within an already tagged CGroup. The color and the tag are both used to
+generate a `cookie` which is used by the scheduler to identify the group.
+
+Up to 256 different colors can be set (0-255) by writing into this file.
+
+A sample real-world usage of this file follows:
+
+Google uses DAC controls to make ``cpu.tag`` writable only by root and the
+``cpu.tag_color`` can be changed by anyone.
+
+The hierarchy looks like this:
+::
+  Root group
+     / \
+    A   B    (These are created by the root daemon - borglet).
+   / \   \
+  C   D   E  (These are created by AppEngine within the container).
+
+A and B are containers for 2 different jobs or apps that are created by a root
+daemon called borglet. borglet then tags each of these group with the ``cpu.tag``
+file. The job itself can create additional child CGroups which are colored by
+the container's AppEngine with the ``cpu.tag_color`` file.
+
+The reason why Google uses this 2-level tagging system is that AppEngine wants to
+allow a subset of child CGroups within a tagged parent CGroup to be co-scheduled on a
+core while not being co-scheduled with other child CGroups. Think of these
+child CGroups as belonging to the same customer or project.  Because these
+child CGroups are created by AppEngine, they are not tracked by borglet (the
+root daemon), therefore borglet won't have a chance to set a color for them.
+That's where cpu.tag_color file comes in. A color could be set by AppEngine,
+and once set, the normal tasks within the subcgroup would not be able to
+overwrite it. This is enforced by promoting the permission of the
+``cpu.tag_color`` file in cgroupfs.
+
+The color is an 8-bit value allowing for up to 256 unique colors.
+
+.. note:: Once a CGroup is colored, none of its descendants can be re-colored. Also
+          coloring of a CGroup is possible only if either the group or one of its
+          ancestors was tagged via the ``cpu.tag`` file.
+
+prctl interface
+###############
+A ``prtcl(2)`` command ``PR_SCHED_CORE_SHARE`` is available to a process to request
+sharing a core with another process.  For example, consider 2 processes ``P1``
+and ``P2`` with PIDs 100 and 200. If process ``P1`` calls
+``prctl(PR_SCHED_CORE_SHARE, 200)``, the kernel makes ``P1`` share a core with ``P2``.
+The kernel performs ptrace access mode checks before granting the request.
+
+.. note:: This operation is not commutative. P1 calling
+          ``prctl(PR_SCHED_CORE_SHARE, pidof(P2)`` is not the same as P2 calling the
+          same for P1. The former case is P1 joining P2's group of processes
+          (which P2 would have joined with ``prctl(2)`` prior to P1's ``prctl(2)``).
+
+.. note:: The core-sharing granted with prctl(2) will be subject to
+          core-sharing restrictions specified by the CGroup interface. For example
+          if P1 and P2 are a part of 2 different tagged CGroups, then they will
+          not share a core even if a prctl(2) call is made. This is analogous
+          to how affinities are set using the cpuset interface.
+
+It is important to note that, on a ``CLONE_THREAD`` ``clone(2)`` syscall, the child
+will be assigned the same tag as its parent and thus be allowed to share a core
+with them. This design choice is because, for the security usecase, a
+``CLONE_THREAD`` child can access its parent's address space anyway, so there's
+no point in not allowing them to share a core. If a different behavior is
+desired, the child thread can call ``prctl(2)`` as needed.  This behavior is
+specific to the ``prctl(2)`` interface. For the CGroup interface, the child of a
+fork always shares a core with its parent.  On the other hand, if a parent
+was previously tagged via ``prctl(2)`` and does a regular ``fork(2)`` syscall, the
+child will receive a unique tag.
+
+Design/Implementation
+---------------------
+Each task that is tagged is assigned a cookie internally in the kernel. As
+mentioned in `Usage`_, tasks with the same cookie value are assumed to trust
+each other and share a core.
+
+The basic idea is that, every schedule event tries to select tasks for all the
+siblings of a core such that all the selected tasks running on a core are
+trusted (same cookie) at any point in time. Kernel threads are assumed trusted.
+The idle task is considered special, as it trusts everything and everything
+trusts it.
+
+During a schedule() event on any sibling of a core, the highest priority task on
+the sibling's core is picked and assigned to the sibling calling schedule(), if
+the sibling has the task enqueued. For rest of the siblings in the core,
+highest priority task with the same cookie is selected if there is one runnable
+in their individual run queues. If a task with same cookie is not available,
+the idle task is selected.  Idle task is globally trusted.
+
+Once a task has been selected for all the siblings in the core, an IPI is sent to
+siblings for whom a new task was selected. Siblings on receiving the IPI will
+switch to the new task immediately. If an idle task is selected for a sibling,
+then the sibling is considered to be in a `forced idle` state. I.e., it may
+have tasks on its on runqueue to run, however it will still have to run idle.
+More on this in the next section.
+
+Forced-idling of tasks
+----------------------
+The scheduler tries its best to find tasks that trust each other such that all
+tasks selected to be scheduled are of the highest priority in a core.  However,
+it is possible that some runqueues had tasks that were incompatible with the
+highest priority ones in the core. Favoring security over fairness, one or more
+siblings could be forced to select a lower priority task if the highest
+priority task is not trusted with respect to the core wide highest priority
+task.  If a sibling does not have a trusted task to run, it will be forced idle
+by the scheduler (idle thread is scheduled to run).
+
+When the highest priority task is selected to run, a reschedule-IPI is sent to
+the sibling to force it into idle. This results in 4 cases which need to be
+considered depending on whether a VM or a regular usermode process was running
+on either HT::
+
+          HT1 (attack)            HT2 (victim)
+   A      idle -> user space      user space -> idle
+   B      idle -> user space      guest -> idle
+   C      idle -> guest           user space -> idle
+   D      idle -> guest           guest -> idle
+
+Note that for better performance, we do not wait for the destination CPU
+(victim) to enter idle mode. This is because the sending of the IPI would bring
+the destination CPU immediately into kernel mode from user space, or VMEXIT
+in the case of guests. At best, this would only leak some scheduler metadata
+which may not be worth protecting. It is also possible that the IPI is received
+too late on some architectures, but this has not been observed in the case of
+x86.
+
+Kernel protection from untrusted tasks
+--------------------------------------
+The scheduler on its own cannot protect the kernel executing concurrently with
+an untrusted task in a core. This is because the scheduler is unaware of
+interrupts/syscalls at scheduling time. To mitigate this, an IPI is sent to
+siblings on kernel entry (syscall and IRQ). This IPI forces the sibling to enter
+kernel mode and wait before returning to user until all siblings of the
+core have left kernel mode. This process is also known as stunning.  For good
+performance, an IPI is sent only to a sibling only if it is running a tagged
+task. If a sibling is running a kernel thread or is idle, no IPI is sent.
+
+The kernel protection feature can be turned off on the kernel command line by
+passing ``sched_core_protect_kernel=0``.
+
+Other alternative ideas discussed for kernel protection are listed below just
+for completeness. They all have limitations:
+
+1. Changing interrupt affinities to a trusted core which does not execute untrusted tasks
+#########################################################################################
+By changing the interrupt affinities to a designated safe-CPU which runs
+only trusted tasks, IRQ data can be protected. One issue is this involves
+giving up a full CPU core of the system to run safe tasks. Another is that,
+per-cpu interrupts such as the local timer interrupt cannot have their
+affinity changed. Also, sensitive timer callbacks such as the random entropy timer
+can run in softirq on return from these interrupts and expose sensitive
+data. In the future, that could be mitigated by forcing softirqs into threaded
+mode by utilizing a mechanism similar to ``CONFIG_PREEMPT_RT``.
+
+Yet another issue with this is, for multiqueue devices with managed
+interrupts, the IRQ affinities cannot be changed; however it could be
+possible to force a reduced number of queues which would in turn allow to
+shield one or two CPUs from such interrupts and queue handling for the price
+of indirection.
+
+2. Running IRQs as threaded-IRQs
+################################
+This would result in forcing IRQs into the scheduler which would then provide
+the process-context mitigation. However, not all interrupts can be threaded.
+Also this does nothing about syscall entries.
+
+3. Kernel Address Space Isolation
+#################################
+System calls could run in a much restricted address space which is
+guaranteed not to leak any sensitive data. There are practical limitation in
+implementing this - the main concern being how to decide on an address space
+that is guaranteed to not have any sensitive data.
+
+4. Limited cookie-based protection
+##################################
+On a system call, change the cookie to the system trusted cookie and initiate a
+schedule event. This would be better than pausing all the siblings during the
+entire duration for the system call, but still would be a huge hit to the
+performance.
+
+Trust model
+-----------
+Core scheduling maintains trust relationships amongst groups of tasks by
+assigning the tag of them with the same cookie value.
+When a system with core scheduling boots, all tasks are considered to trust
+each other. This is because the core scheduler does not have information about
+trust relationships until userspace uses the above mentioned interfaces, to
+communicate them. In other words, all tasks have a default cookie value of 0.
+and are considered system-wide trusted. The stunning of siblings running
+cookie-0 tasks is also avoided.
+
+Once userspace uses the above mentioned interfaces to group sets of tasks, tasks
+within such groups are considered to trust each other, but do not trust those
+outside. Tasks outside the group also don't trust tasks within.
+
+Limitations
+-----------
+Core scheduling tries to guarantee that only trusted tasks run concurrently on a
+core. But there could be small window of time during which untrusted tasks run
+concurrently or kernel could be running concurrently with a task not trusted by
+kernel.
+
+1. IPI processing delays
+########################
+Core scheduling selects only trusted tasks to run together. IPI is used to notify
+the siblings to switch to the new task. But there could be hardware delays in
+receiving of the IPI on some arch (on x86, this has not been observed). This may
+cause an attacker task to start running on a CPU before its siblings receive the
+IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings
+may populate data in the cache and micro architectural buffers after the attacker
+starts to run and this is a possibility for data leak.
+
+Open cross-HT issues that core scheduling does not solve
+--------------------------------------------------------
+1. For MDS
+##########
+Core scheduling cannot protect against MDS attacks between an HT running in
+user mode and another running in kernel mode. Even though both HTs run tasks
+which trust each other, kernel memory is still considered untrusted. Such
+attacks are possible for any combination of sibling CPU modes (host or guest mode).
+
+2. For L1TF
+###########
+Core scheduling cannot protect against an L1TF guest attacker exploiting a
+guest or host victim. This is because the guest attacker can craft invalid
+PTEs which are not inverted due to a vulnerable guest kernel. The only
+solution is to disable EPT (Extended Page Tables).
+
+For both MDS and L1TF, if the guest vCPU is configured to not trust each
+other (by tagging separately), then the guest to guest attacks would go away.
+Or it could be a system admin policy which considers guest to guest attacks as
+a guest problem.
+
+Another approach to resolve these would be to make every untrusted task on the
+system to not trust every other untrusted task. While this could reduce
+parallelism of the untrusted tasks, it would still solve the above issues while
+allowing system processes (trusted tasks) to share a core.
+
+Use cases
+---------
+The main use case for Core scheduling is mitigating the cross-HT vulnerabilities
+with SMT enabled. There are other use cases where this feature could be used:
+
+- Isolating tasks that needs a whole core: Examples include realtime tasks, tasks
+  that uses SIMD instructions etc.
+- Gang scheduling: Requirements for a group of tasks that needs to be scheduled
+  together could also be realized using core scheduling. One example is vCPUs of
+  a VM.
+
+Future work
+-----------
+Skipping per-HT mitigations if task is trusted
+##############################################
+If core scheduling is enabled, by default all tasks trust each other as
+mentioned above. In such scenario, it may be desirable to skip the same-HT
+mitigations on return to the trusted user-mode to improve performance.
diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index 21710f8609fe..361ccbbd9e54 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -16,3 +16,4 @@ are configurable at compile, boot or run time.
    multihit.rst
    special-register-buffer-data-sampling.rst
    l1d_flush.rst
+   core-scheduling.rst
-- 
2.29.2.222.g5d2a92d10f8-goog


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [RFT for v9] (Was Re: [PATCH v8 -tip 00/26] Core scheduling)
  2020-11-06 20:55 ` [RFT for v9] (Was Re: [PATCH v8 -tip 00/26] Core scheduling) Joel Fernandes
@ 2020-11-13  9:22   ` Ning, Hongyu
  2020-11-13 10:01     ` Ning, Hongyu
  0 siblings, 1 reply; 98+ messages in thread
From: Ning, Hongyu @ 2020-11-13  9:22 UTC (permalink / raw)
  To: Joel Fernandes, Nishanth Aravamudan, Julien Desfossez,
	Peter Zijlstra, Tim Chen, Vineeth Pillai, Aaron Lu, Aubrey Li,
	tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen

On 2020/11/7 4:55, Joel Fernandes wrote:
> All,
> 
> I am getting ready to send the next v9 series based on tip/master
> branch. Could you please give the below tree a try and report any results in
> your testing?
> git tree:
> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git (branch coresched)
> git log:
> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=coresched
> 
> The major changes in this series are the improvements:
> (1)
> "sched: Make snapshotting of min_vruntime more CGroup-friendly"
> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=coresched-v9-for-test&id=9a20a6652b3c50fd51faa829f7947004239a04eb
> 
> (2)
> "sched: Simplify the core pick loop for optimized case"
> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=coresched-v9-for-test&id=0370117b4fd418cdaaa6b1489bfc14f305691152
> 
> And a bug fix:
> (1)
> "sched: Enqueue task into core queue only after vruntime is updated"
> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=coresched-v9-for-test&id=401dad5536e7e05d1299d0864e6fc5072029f492
> 
> There are also 2 more bug fixes that I squashed-in related to kernel
> protection and a crash seen on the tip/master branch.
> 
> Hoping to send the series next week out to the list.
> 
> Have a great weekend, and Thanks!
> 
>  - Joel
> 
> 
> On Mon, Oct 19, 2020 at 09:43:10PM -0400, Joel Fernandes (Google) wrote:

Adding 4 workloads test results for core scheduling v9 candidate: 

- kernel under test: 
	-- coresched community v9 candidate from https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git (branch coresched)
	-- latest commit: 2e8591a330ff (HEAD -> coresched, origin/coresched) NEW: sched: Add a coresched command line option
	-- coresched=on kernel parameter applied
- workloads: 
	-- A. sysbench cpu (192 threads) + sysbench cpu (192 threads)
	-- B. sysbench cpu (192 threads) + sysbench mysql (192 threads, mysqld forced into the same cgroup)
	-- C. uperf netperf.xml (192 threads over TCP or UDP protocol separately)
	-- D. will-it-scale context_switch via pipe (192 threads)
- test machine setup: 
	CPU(s):              192
	On-line CPU(s) list: 0-191
	Thread(s) per core:  2
	Core(s) per socket:  48
	Socket(s):           2
	NUMA node(s):        4
- test results, no obvious performance drop compared to community v8 build:
	-- workload A:
	+----------------------+------+----------------------+------------------------+
	|                      | **   | sysbench cpu * 192   | sysbench cpu * 192     |
	+======================+======+======================+========================+
	| cgroup               | **   | cg_sysbench_cpu_0    | cg_sysbench_cpu_1      |
	+----------------------+------+----------------------+------------------------+
	| record_item          | **   | Tput_avg (events/s)  | Tput_avg (events/s)    |
	+----------------------+------+----------------------+------------------------+
	| coresched_normalized | **   | 0.98                 | 1.01                   |
	+----------------------+------+----------------------+------------------------+
	| default_normalized   | **   | 1                    | 1                      |
	+----------------------+------+----------------------+------------------------+
	| smtoff_normalized    | **   | 0.59                 | 0.6                    |
	+----------------------+------+----------------------+------------------------+

	-- workload B:
	+----------------------+------+----------------------+------------------------+
	|                      | **   | sysbench cpu * 192   | sysbench mysql * 192   |
	+======================+======+======================+========================+
	| cgroup               | **   | cg_sysbench_cpu_0    | cg_sysbench_mysql_0    |
	+----------------------+------+----------------------+------------------------+
	| record_item          | **   | Tput_avg (events/s)  | Tput_avg (events/s)    |
	+----------------------+------+----------------------+------------------------+
	| coresched_normalized | **   | 1.02                 | 0.78                   |
	+----------------------+------+----------------------+------------------------+
	| default_normalized   | **   | 1                    | 1                      |
	+----------------------+------+----------------------+------------------------+
	| smtoff_normalized    | **   | 0.59                 | 0.75                   |
	+----------------------+------+----------------------+------------------------+

	-- workload C:
	+----------------------+------+---------------------------+---------------------------+
	|                      | **   | uperf netperf TCP * 192   | uperf netperf UDP * 192   |
	+======================+======+===========================+===========================+
	| cgroup               | **   | cg_uperf                  | cg_uperf                  |
	+----------------------+------+---------------------------+---------------------------+
	| record_item          | **   | Tput_avg (Gb/s)           | Tput_avg (Gb/s)           |
	+----------------------+------+---------------------------+---------------------------+
	| coresched_normalized | **   | 0.65                      | 0.67                      |
	+----------------------+------+---------------------------+---------------------------+
	| default_normalized   | **   | 1                         | 1                         |
	+----------------------+------+---------------------------+---------------------------+
	| smtoff_normalized    | **   | 0.83                      | 0.91                      |
	+----------------------+------+---------------------------+---------------------------+

	-- workload D:
	+----------------------+------+-------------------------------+
	|                      | **   | will-it-scale  * 192          |
	|                      |      | (pipe based context_switch)   |
	+======================+======+===============================+
	| cgroup               | **   | cg_will-it-scale              |
	+----------------------+------+-------------------------------+
	| record_item          | **   | threads_avg                   |
	+----------------------+------+-------------------------------+
	| coresched_normalized | **   | 0.29                          |
	+----------------------+------+-------------------------------+
	| default_normalized   | **   | 1.00                          |
	+----------------------+------+-------------------------------+
	| smtoff_normalized    | **   | 0.87                          |
	+----------------------+------+-------------------------------+

- notes on test results record_item:
	* coresched_normalized: smton, cs enabled, test result normalized by default value
	* default_normalized: smton, cs disabled, test result normalized by default value
	* smtoff_normalized: smtoff, test result normalized by default value


Hongyu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT for v9] (Was Re: [PATCH v8 -tip 00/26] Core scheduling)
  2020-11-13  9:22   ` Ning, Hongyu
@ 2020-11-13 10:01     ` Ning, Hongyu
  0 siblings, 0 replies; 98+ messages in thread
From: Ning, Hongyu @ 2020-11-13 10:01 UTC (permalink / raw)
  To: Joel Fernandes, Nishanth Aravamudan, Julien Desfossez,
	Peter Zijlstra, Tim Chen, Vineeth Pillai, Aaron Lu, Aubrey Li,
	tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Aubrey Li, Paul E. McKenney, Tim Chen


On 2020/11/13 17:22, Ning, Hongyu wrote:
> On 2020/11/7 4:55, Joel Fernandes wrote:
>> All,
>>
>> I am getting ready to send the next v9 series based on tip/master
>> branch. Could you please give the below tree a try and report any results in
>> your testing?
>> git tree:
>> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git (branch coresched)
>> git log:
>> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=coresched
>>
>> The major changes in this series are the improvements:
>> (1)
>> "sched: Make snapshotting of min_vruntime more CGroup-friendly"
>> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=coresched-v9-for-test&id=9a20a6652b3c50fd51faa829f7947004239a04eb
>>
>> (2)
>> "sched: Simplify the core pick loop for optimized case"
>> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=coresched-v9-for-test&id=0370117b4fd418cdaaa6b1489bfc14f305691152
>>
>> And a bug fix:
>> (1)
>> "sched: Enqueue task into core queue only after vruntime is updated"
>> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=coresched-v9-for-test&id=401dad5536e7e05d1299d0864e6fc5072029f492
>>
>> There are also 2 more bug fixes that I squashed-in related to kernel
>> protection and a crash seen on the tip/master branch.
>>
>> Hoping to send the series next week out to the list.
>>
>> Have a great weekend, and Thanks!
>>
>>  - Joel
>>
>>
>> On Mon, Oct 19, 2020 at 09:43:10PM -0400, Joel Fernandes (Google) wrote:
> 
> Adding 4 workloads test results for core scheduling v9 candidate: 
> 
> - kernel under test: 
> 	-- coresched community v9 candidate from https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git (branch coresched)
> 	-- latest commit: 2e8591a330ff (HEAD -> coresched, origin/coresched) NEW: sched: Add a coresched command line option
> 	-- coresched=on kernel parameter applied
> - workloads: 
> 	-- A. sysbench cpu (192 threads) + sysbench cpu (192 threads)
> 	-- B. sysbench cpu (192 threads) + sysbench mysql (192 threads, mysqld forced into the same cgroup)
> 	-- C. uperf netperf.xml (192 threads over TCP or UDP protocol separately)
> 	-- D. will-it-scale context_switch via pipe (192 threads)
> - test machine setup: 
> 	CPU(s):              192
> 	On-line CPU(s) list: 0-191
> 	Thread(s) per core:  2
> 	Core(s) per socket:  48
> 	Socket(s):           2
> 	NUMA node(s):        4
> - test results, no obvious performance drop compared to community v8 build:
> 	-- workload A:
> 	+----------------------+------+----------------------+------------------------+
> 	|                      | **   | sysbench cpu * 192   | sysbench cpu * 192     |
> 	+======================+======+======================+========================+
> 	| cgroup               | **   | cg_sysbench_cpu_0    | cg_sysbench_cpu_1      |
> 	+----------------------+------+----------------------+------------------------+
> 	| record_item          | **   | Tput_avg (events/s)  | Tput_avg (events/s)    |
> 	+----------------------+------+----------------------+------------------------+
> 	| coresched_normalized | **   | 0.98                 | 1.01                   |
> 	+----------------------+------+----------------------+------------------------+
> 	| default_normalized   | **   | 1                    | 1                      |
> 	+----------------------+------+----------------------+------------------------+
> 	| smtoff_normalized    | **   | 0.59                 | 0.6                    |
> 	+----------------------+------+----------------------+------------------------+
> 
> 	-- workload B:
> 	+----------------------+------+----------------------+------------------------+
> 	|                      | **   | sysbench cpu * 192   | sysbench mysql * 192   |
> 	+======================+======+======================+========================+
> 	| cgroup               | **   | cg_sysbench_cpu_0    | cg_sysbench_mysql_0    |
> 	+----------------------+------+----------------------+------------------------+
> 	| record_item          | **   | Tput_avg (events/s)  | Tput_avg (events/s)    |
> 	+----------------------+------+----------------------+------------------------+
> 	| coresched_normalized | **   | 1.02                 | 0.78                   |
> 	+----------------------+------+----------------------+------------------------+
> 	| default_normalized   | **   | 1                    | 1                      |
> 	+----------------------+------+----------------------+------------------------+
> 	| smtoff_normalized    | **   | 0.59                 | 0.75                   |
> 	+----------------------+------+----------------------+------------------------+
> 
> 	-- workload C:
> 	+----------------------+------+---------------------------+---------------------------+
> 	|                      | **   | uperf netperf TCP * 192   | uperf netperf UDP * 192   |
> 	+======================+======+===========================+===========================+
> 	| cgroup               | **   | cg_uperf                  | cg_uperf                  |
> 	+----------------------+------+---------------------------+---------------------------+
> 	| record_item          | **   | Tput_avg (Gb/s)           | Tput_avg (Gb/s)           |
> 	+----------------------+------+---------------------------+---------------------------+
> 	| coresched_normalized | **   | 0.65                      | 0.67                      |
> 	+----------------------+------+---------------------------+---------------------------+
> 	| default_normalized   | **   | 1                         | 1                         |
> 	+----------------------+------+---------------------------+---------------------------+
> 	| smtoff_normalized    | **   | 0.83                      | 0.91                      |
> 	+----------------------+------+---------------------------+---------------------------+
> 
> 	-- workload D:
> 	+----------------------+------+-------------------------------+
> 	|                      | **   | will-it-scale  * 192          |
> 	|                      |      | (pipe based context_switch)   |
> 	+======================+======+===============================+
> 	| cgroup               | **   | cg_will-it-scale              |
> 	+----------------------+------+-------------------------------+
> 	| record_item          | **   | threads_avg                   |
> 	+----------------------+------+-------------------------------+
> 	| coresched_normalized | **   | 0.29                          |
> 	+----------------------+------+-------------------------------+
> 	| default_normalized   | **   | 1.00                          |
> 	+----------------------+------+-------------------------------+
> 	| smtoff_normalized    | **   | 0.87                          |
> 	+----------------------+------+-------------------------------+
> 
> - notes on test results record_item:
> 	* coresched_normalized: smton, cs enabled, test result normalized by default value
> 	* default_normalized: smton, cs disabled, test result normalized by default value
> 	* smtoff_normalized: smtoff, test result normalized by default value
> 
> 
> Hongyu
> 

Add 2 more negative test case:

- continuously toggle cpu.core_tag, during workload running with cs_on
- continuously toggle smt setting via /sys/devices/system/cpu/smt/control, during workload running with cs_on

no kernel panic or platform hang observed.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode
  2020-11-10 22:42         ` Joel Fernandes
@ 2020-11-16 10:08           ` Alexandre Chartre
  2020-11-16 14:50             ` Joel Fernandes
  0 siblings, 1 reply; 98+ messages in thread
From: Alexandre Chartre @ 2020-11-16 10:08 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, James.Bottomley, OWeisse,
	Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser, Aubrey Li,
	Tim Chen, Paul E . McKenney


On 11/10/20 11:42 PM, Joel Fernandes wrote:
> On Tue, Nov 10, 2020 at 10:35:17AM +0100, Alexandre Chartre wrote:
> [..]
>>> ---8<-----------------------
>>>
>>>   From b2835a587a28405ffdf8fc801e798129a014a8c8 Mon Sep 17 00:00:00 2001
>>> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
>>> Date: Mon, 27 Jul 2020 17:56:14 -0400
>>> Subject: [PATCH] kernel/entry: Add support for core-wide protection of
>>>    kernel-mode
> [..]
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index d38e904dd603..fe6f225bfbf9 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -2071,4 +2071,16 @@ int sched_trace_rq_nr_running(struct rq *rq);
>>>    const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
>>> +#ifdef CONFIG_SCHED_CORE
>>> +void sched_core_unsafe_enter(void);
>>> +void sched_core_unsafe_exit(void);
>>> +bool sched_core_wait_till_safe(unsigned long ti_check);
>>> +bool sched_core_kernel_protected(void);
>>> +#else
>>> +#define sched_core_unsafe_enter(ignore) do { } while (0)
>>> +#define sched_core_unsafe_exit(ignore) do { } while (0)
>>> +#define sched_core_wait_till_safe(ignore) do { } while (0)
>>> +#define sched_core_kernel_protected(ignore) do { } while (0)
>>> +#endif
>>> +
>>>    #endif
>>> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
>>> index 0a1e20f8d4e8..a18ed60cedea 100644
>>> --- a/kernel/entry/common.c
>>> +++ b/kernel/entry/common.c
>>> @@ -28,6 +28,8 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs)
>>>    	instrumentation_begin();
>>>    	trace_hardirqs_off_finish();
>>> +	if (_TIF_UNSAFE_RET) /* Kernel protection depends on arch defining the flag. */
>>> +		sched_core_unsafe_enter();
>>>    	instrumentation_end();
>>>    }
>>> @@ -137,6 +139,27 @@ static __always_inline void exit_to_user_mode(void)
>>>    /* Workaround to allow gradual conversion of architecture code */
>>>    void __weak arch_do_signal(struct pt_regs *regs) { }
>>> +unsigned long exit_to_user_get_work(void)
>>
>> Function should be static.
> 
> Fixed.
> 
>>> +{
>>> +	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
>>> +
>>> +	if ((IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected())
>>> +	    || !_TIF_UNSAFE_RET)
>>> +		return ti_work;
>>> +
>>> +#ifdef CONFIG_SCHED_CORE
>>> +	ti_work &= EXIT_TO_USER_MODE_WORK;
>>> +	if ((ti_work & _TIF_UNSAFE_RET) == ti_work) {
>>> +		sched_core_unsafe_exit();
>>> +		if (sched_core_wait_till_safe(EXIT_TO_USER_MODE_WORK)) {
>>> +			sched_core_unsafe_enter(); /* not exiting to user yet. */
>>> +		}
>>> +	}
>>> +
>>> +	return READ_ONCE(current_thread_info()->flags);
>>> +#endif
>>> +}
>>> +
>>>    static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>>>    					    unsigned long ti_work)
>>>    {
>>> @@ -175,7 +198,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>>>    		 * enabled above.
>>>    		 */
>>>    		local_irq_disable_exit_to_user();
>>> -		ti_work = READ_ONCE(current_thread_info()->flags);
>>> +		ti_work = exit_to_user_get_work();
>>>    	}
>>
>> What happen if the task is scheduled out in exit_to_user_mode_loop? (e.g. if it has
>> _TIF_NEED_RESCHED set). It will have call sched_core_unsafe_enter() and force siblings
>> to wait for it. So shouldn't sched_core_unsafe_exit() be called when the task is
>> scheduled out? (because it won't run anymore) And sched_core_unsafe_enter() when
>> the task is scheduled back in?
> 
> No, when the task is scheduled out, it will in kernel mode on the task being
> scheduled in. That task (being scheduled-in) would have already done a
> sched_core_unsafe_enter(). When that task returns to user made, it will do a
> sched_core_unsafe_exit(). When all tasks goto sleep, the last task that
> enters the idle loop will do a sched_core_unsafe_exit(). Just to note: the
> "unsafe kernel context" is per-CPU and not per-task. Does that answer your
> question?

Ok, I think I get it: it works because when a task is scheduled out then the
scheduler will schedule in a new tagged task (because we have core scheduling).
So that new task should be accounted for core-wide protection the same way as
the previous one.


>>> +static inline void init_sched_core_irq_work(struct rq *rq)
>>> +{
>>> +	init_irq_work(&rq->core_irq_work, sched_core_irq_work);
>>> +}
>>> +
>>> +/*
>>> + * sched_core_wait_till_safe - Pause the caller's hyperthread until the core
>>> + * exits the core-wide unsafe state. Obviously the CPU calling this function
>>> + * should not be responsible for the core being in the core-wide unsafe state
>>> + * otherwise it will deadlock.
>>> + *
>>> + * @ti_check: We spin here with IRQ enabled and preempt disabled. Break out of
>>> + *            the loop if TIF flags are set and notify caller about it.
>>> + *
>>> + * IRQs should be disabled.
>>> + */
>>> +bool sched_core_wait_till_safe(unsigned long ti_check)
>>> +{
>>> +	bool restart = false;
>>> +	struct rq *rq;
>>> +	int cpu;
>>> +
>>> +	/* We clear the thread flag only at the end, so need to check for it. */
>>
>> Do you mean "no need to check for it" ?
> 
> Fixed.
> 
>>> +/*
>>> + * Enter the core-wide IRQ state. Sibling will be paused if it is running
>>> + * 'untrusted' code, until sched_core_unsafe_exit() is called. Every attempt to
>>> + * avoid sending useless IPIs is made. Must be called only from hard IRQ
>>> + * context.
>>> + */
>>> +void sched_core_unsafe_enter(void)
>>> +{
>>> +	const struct cpumask *smt_mask;
>>> +	unsigned long flags;
>>> +	struct rq *rq;
>>> +	int i, cpu;
>>> +
>>> +	if (!static_branch_likely(&sched_core_protect_kernel))
>>> +		return;
>>> +
>>> +	/* Ensure that on return to user/guest, we check whether to wait. */
>>> +	if (current->core_cookie)
>>> +		set_tsk_thread_flag(current, TIF_UNSAFE_RET);
>>> +
>>> +	local_irq_save(flags);
>>> +	cpu = smp_processor_id();
>>> +	rq = cpu_rq(cpu);
>>> +	if (!sched_core_enabled(rq))
>>> +		goto ret;
>>
>> Should we clear TIF_UNSAFE_RET if (!sched_core_enabled(rq))? This would avoid calling
>> sched_core_wait_till_safe().
> 
> Ok, or what I'll do is move the set_tsk_thread_flag to after the check for
> sched_core_enabled().
> 
>>> +
>>> +	/* Count unsafe_enter() calls received without unsafe_exit() on this CPU. */
>>> +	rq->core_this_unsafe_nest++;
>>> +
>>> +	/* Should not nest: enter() should only pair with exit(). */
>>> +	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 1))
>>> +		goto ret;
>>
>> I would be more precise about the nesting comment: we don't nest not only because each
>> enter() is paired with an exit() but because each enter()/exit() is for a user context.
>> We can have nested interrupts but they will be for a kernel context so they won't enter/exit.
>>
>> So I would say something like:
>>
>>          /*
>>           * Should not nest: each enter() is paired with an exit(), and enter()/exit()
>>           * are done when coming from userspace. We can have nested interrupts between
>>           * enter()/exit() but they will originate from the kernel so they won't enter()
>>           * nor exit().
>>           */
> 
> Changed it to following, hope its ok with you:
>          /*
>           * Should not nest: enter() should only pair with exit(). Both are done
>           * during the first entry into kernel and the last exit from kernel.
>           * Nested kernel entries (such as nested interrupts) will only trigger
>           * enter() and exit() on the outer most kernel entry and exit.
>           */
> 
>>> +
>>> +	raw_spin_lock(rq_lockp(rq));
>>> +	smt_mask = cpu_smt_mask(cpu);
>>> +
>>> +	/* Contribute this CPU's unsafe_enter() to core-wide unsafe_enter() count. */
>>> +	WRITE_ONCE(rq->core->core_unsafe_nest, rq->core->core_unsafe_nest + 1);
>>
>> We are protected by the rq_lockp(rq) spinlock, but we still need to use WRITE_ONCE()
>> because sched_core_wait_till_safe() checks core_unsafe_next without taking rq_lockp(rq),
>> right?
> 
> Yes.
> 
>> Shouldn't we be using smp_store_release() like sched_core_unsafe_exit() does?
>>
>> In any case, it is worth having a comment why WRITE_ONCE() or smp_store_release() is
>> used.
> 
> The smp_store_release() in exit() ensures that the write to the nesting
> counter happens *after* all prior reads and write accesses done by this CPU
> are seen by the spinning CPU doing the smp_load_acquire() before that
> spinning CPU returns. I did put a comment there.
> 
> But, I think I don't need smp_store_release() at all here. The spin_unlock
> that follows already has the required release semantics. I will demote it to
> a WRITE_ONCE() in enter() as well, and add appropriate comments.
> 

I think a WRITE_ONCE() is not even be useful here. The WRITE_ONCE() will only prevent
some possible compiler optimization in the function wrt rq->core->core_unsafe_nest, but
rq->core->core_unsafe_nest is just updated here, and concurrent changes are protected
by the rq_lockp(rq) spinlock, and the memory barrier is ensured by raw_spin_unlock().

So I thing you can just do:  rq->core->core_unsafe_nest++;

And in sched_core_unsafe_exit(), you can just do:  rq->core->core_unsafe_nest = nest - 1

Also comment in sched_core_wait_till_safe() wrt smp_load_acquire() should be updated,
it should say:

	/*
	 * Wait till the core of this HT is not in an unsafe state.
	 *
	 * Pair with raw_spin_unlock(rq_lockp(rq) in sched_core_unsafe_enter/exit()
	 */
	

>>> +
>>> +	if (WARN_ON_ONCE(rq->core->core_unsafe_nest == UINT_MAX))
>>> +		goto unlock;
>>
>> It might be better checking if (rq->core->core_unsafe_nest >= cpumask_weight(smt_mask))
>> because we shouldn't exceed the number of siblings.
> 
> I am a bit concerned with the time complexity of cpumask_weight(). It may be
> better not to add overhead. I am not fully sure how it works but there is a
> loop in bitmask weight that goes through the bits of the bitmap, what is your
> opinion on that?

Yes, it's looping through the bitmap so probably not worth adding this overhead it here.

> 
> Can I add your Reviewed-by tag to below updated patch? Thanks for review!

Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>


alex.

> 
> ---8<---
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index bd1a5b87a5e2..a36f08d74e09 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4678,6 +4678,15 @@
>   
>   	sbni=		[NET] Granch SBNI12 leased line adapter
>   
> +	sched_core_protect_kernel=
> +			[SCHED_CORE] Pause SMT siblings of a core running in
> +			user mode, if at least one of the siblings of the core
> +			is running in kernel mode. This is to guarantee that
> +			kernel data is not leaked to tasks which are not trusted
> +			by the kernel. A value of 0 disables protection, 1
> +			enables protection. The default is 1. Note that protection
> +			depends on the arch defining the _TIF_UNSAFE_RET flag.
> +
>   	sched_debug	[KNL] Enables verbose scheduler debug messages.
>   
>   	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index 474f29638d2c..62278c5b3b5f 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -33,6 +33,10 @@
>   # define _TIF_PATCH_PENDING		(0)
>   #endif
>   
> +#ifndef _TIF_UNSAFE_RET
> +# define _TIF_UNSAFE_RET		(0)
> +#endif
> +
>   #ifndef _TIF_UPROBE
>   # define _TIF_UPROBE			(0)
>   #endif
> @@ -69,7 +73,7 @@
>   
>   #define EXIT_TO_USER_MODE_WORK						\
>   	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
> -	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING |			\
> +	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_UNSAFE_RET |	\
>   	 ARCH_EXIT_TO_USER_MODE_WORK)
>   
>   /**
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d38e904dd603..fe6f225bfbf9 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2071,4 +2071,16 @@ int sched_trace_rq_nr_running(struct rq *rq);
>   
>   const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
>   
> +#ifdef CONFIG_SCHED_CORE
> +void sched_core_unsafe_enter(void);
> +void sched_core_unsafe_exit(void);
> +bool sched_core_wait_till_safe(unsigned long ti_check);
> +bool sched_core_kernel_protected(void);
> +#else
> +#define sched_core_unsafe_enter(ignore) do { } while (0)
> +#define sched_core_unsafe_exit(ignore) do { } while (0)
> +#define sched_core_wait_till_safe(ignore) do { } while (0)
> +#define sched_core_kernel_protected(ignore) do { } while (0)
> +#endif
> +
>   #endif
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 2b8366693d5c..d5d88e735d55 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -28,6 +28,8 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs)
>   
>   	instrumentation_begin();
>   	trace_hardirqs_off_finish();
> +	if (_TIF_UNSAFE_RET) /* Kernel protection depends on arch defining the flag. */
> +		sched_core_unsafe_enter();
>   	instrumentation_end();
>   }
>   
> @@ -137,6 +139,27 @@ static __always_inline void exit_to_user_mode(void)
>   /* Workaround to allow gradual conversion of architecture code */
>   void __weak arch_do_signal(struct pt_regs *regs) { }
>   
> +static unsigned long exit_to_user_get_work(void)
> +{
> +	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> +
> +	if ((IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected())
> +	    || !_TIF_UNSAFE_RET)
> +		return ti_work;
> +
> +#ifdef CONFIG_SCHED_CORE
> +	ti_work &= EXIT_TO_USER_MODE_WORK;
> +	if ((ti_work & _TIF_UNSAFE_RET) == ti_work) {
> +		sched_core_unsafe_exit();
> +		if (sched_core_wait_till_safe(EXIT_TO_USER_MODE_WORK)) {
> +			sched_core_unsafe_enter(); /* not exiting to user yet. */
> +		}
> +	}
> +
> +	return READ_ONCE(current_thread_info()->flags);
> +#endif
> +}
> +
>   static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>   					    unsigned long ti_work)
>   {
> @@ -174,7 +197,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>   		 * enabled above.
>   		 */
>   		local_irq_disable_exit_to_user();
> -		ti_work = READ_ONCE(current_thread_info()->flags);
> +		ti_work = exit_to_user_get_work();
>   	}
>   
>   	/* Return the latest work state for arch_exit_to_user_mode() */
> @@ -183,9 +206,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>   
>   static void exit_to_user_mode_prepare(struct pt_regs *regs)
>   {
> -	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> +	unsigned long ti_work;
>   
>   	lockdep_assert_irqs_disabled();
> +	ti_work = exit_to_user_get_work();
>   
>   	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
>   		ti_work = exit_to_user_mode_loop(regs, ti_work);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fa68941998e3..429f9b8ca38e 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -76,6 +76,27 @@ __read_mostly int scheduler_running;
>   
>   #ifdef CONFIG_SCHED_CORE
>   
> +DEFINE_STATIC_KEY_TRUE(sched_core_protect_kernel);
> +static int __init set_sched_core_protect_kernel(char *str)
> +{
> +	unsigned long val = 0;
> +
> +	if (!str)
> +		return 0;
> +
> +	if (!kstrtoul(str, 0, &val) && !val)
> +		static_branch_disable(&sched_core_protect_kernel);
> +
> +	return 1;
> +}
> +__setup("sched_core_protect_kernel=", set_sched_core_protect_kernel);
> +
> +/* Is the kernel protected by core scheduling? */
> +bool sched_core_kernel_protected(void)
> +{
> +	return static_branch_likely(&sched_core_protect_kernel);
> +}
> +
>   DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
>   
>   /* kernel prio, less is more */
> @@ -4596,6 +4617,226 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
>   	return a->core_cookie == b->core_cookie;
>   }
>   
> +/*
> + * Handler to attempt to enter kernel. It does nothing because the exit to
> + * usermode or guest mode will do the actual work (of waiting if needed).
> + */
> +static void sched_core_irq_work(struct irq_work *work)
> +{
> +	return;
> +}
> +
> +static inline void init_sched_core_irq_work(struct rq *rq)
> +{
> +	init_irq_work(&rq->core_irq_work, sched_core_irq_work);
> +}
> +
> +/*
> + * sched_core_wait_till_safe - Pause the caller's hyperthread until the core
> + * exits the core-wide unsafe state. Obviously the CPU calling this function
> + * should not be responsible for the core being in the core-wide unsafe state
> + * otherwise it will deadlock.
> + *
> + * @ti_check: We spin here with IRQ enabled and preempt disabled. Break out of
> + *            the loop if TIF flags are set and notify caller about it.
> + *
> + * IRQs should be disabled.
> + */
> +bool sched_core_wait_till_safe(unsigned long ti_check)
> +{
> +	bool restart = false;
> +	struct rq *rq;
> +	int cpu;
> +
> +	/* We clear the thread flag only at the end, so no need to check for it. */
> +	ti_check &= ~_TIF_UNSAFE_RET;
> +
> +	cpu = smp_processor_id();
> +	rq = cpu_rq(cpu);
> +
> +	if (!sched_core_enabled(rq))
> +		goto ret;
> +
> +	/* Down grade to allow interrupts to prevent stop_machine lockups.. */
> +	preempt_disable();
> +	local_irq_enable();
> +
> +	/*
> +	 * Wait till the core of this HT is not in an unsafe state.
> +	 *
> +	 * Pair with smp_store_release() in sched_core_unsafe_exit().
> +	 */
> +	while (smp_load_acquire(&rq->core->core_unsafe_nest) > 0) {
> +		cpu_relax();
> +		if (READ_ONCE(current_thread_info()->flags) & ti_check) {
> +			restart = true;
> +			break;
> +		}
> +	}
> +
> +	/* Upgrade it back to the expectations of entry code. */
> +	local_irq_disable();
> +	preempt_enable();
> +
> +ret:
> +	if (!restart)
> +		clear_tsk_thread_flag(current, TIF_UNSAFE_RET);
> +
> +	return restart;
> +}
> +
> +/*
> + * Enter the core-wide IRQ state. Sibling will be paused if it is running
> + * 'untrusted' code, until sched_core_unsafe_exit() is called. Every attempt to
> + * avoid sending useless IPIs is made. Must be called only from hard IRQ
> + * context.
> + */
> +void sched_core_unsafe_enter(void)
> +{
> +	const struct cpumask *smt_mask;
> +	unsigned long flags;
> +	struct rq *rq;
> +	int i, cpu;
> +
> +	if (!static_branch_likely(&sched_core_protect_kernel))
> +		return;
> +
> +	local_irq_save(flags);
> +	cpu = smp_processor_id();
> +	rq = cpu_rq(cpu);
> +	if (!sched_core_enabled(rq))
> +		goto ret;
> +
> +	/* Ensure that on return to user/guest, we check whether to wait. */
> +	if (current->core_cookie)
> +		set_tsk_thread_flag(current, TIF_UNSAFE_RET);
> +
> +	/* Count unsafe_enter() calls received without unsafe_exit() on this CPU. */
> +	rq->core_this_unsafe_nest++;
> +
> +	/*
> +	 * Should not nest: enter() should only pair with exit(). Both are done
> +	 * during the first entry into kernel and the last exit from kernel.
> +	 * Nested kernel entries (such as nested interrupts) will only trigger
> +	 * enter() and exit() on the outer most kernel entry and exit.
> +	 */
> +	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 1))
> +		goto ret;
> +
> +	raw_spin_lock(rq_lockp(rq));
> +	smt_mask = cpu_smt_mask(cpu);
> +
> +	/*
> +	 * Contribute this CPU's unsafe_enter() to the core-wide unsafe_enter()
> +	 * count.  The raw_spin_unlock() release semantics pairs with the nest
> +	 * counter's smp_load_acquire() in sched_core_wait_till_safe().
> +	 */
> +	WRITE_ONCE(rq->core->core_unsafe_nest, rq->core->core_unsafe_nest + 1);
> +
> +	if (WARN_ON_ONCE(rq->core->core_unsafe_nest == UINT_MAX))
> +		goto unlock;
> +
> +	if (irq_work_is_busy(&rq->core_irq_work)) {
> +		/*
> +		 * Do nothing more since we are in an IPI sent from another
> +		 * sibling to enforce safety. That sibling would have sent IPIs
> +		 * to all of the HTs.
> +		 */
> +		goto unlock;
> +	}
> +
> +	/*
> +	 * If we are not the first ones on the core to enter core-wide unsafe
> +	 * state, do nothing.
> +	 */
> +	if (rq->core->core_unsafe_nest > 1)
> +		goto unlock;
> +
> +	/* Do nothing more if the core is not tagged. */
> +	if (!rq->core->core_cookie)
> +		goto unlock;
> +
> +	for_each_cpu(i, smt_mask) {
> +		struct rq *srq = cpu_rq(i);
> +
> +		if (i == cpu || cpu_is_offline(i))
> +			continue;
> +
> +		if (!srq->curr->mm || is_task_rq_idle(srq->curr))
> +			continue;
> +
> +		/* Skip if HT is not running a tagged task. */
> +		if (!srq->curr->core_cookie && !srq->core_pick)
> +			continue;
> +
> +		/*
> +		 * Force sibling into the kernel by IPI. If work was already
> +		 * pending, no new IPIs are sent. This is Ok since the receiver
> +		 * would already be in the kernel, or on its way to it.
> +		 */
> +		irq_work_queue_on(&srq->core_irq_work, i);
> +	}
> +unlock:
> +	raw_spin_unlock(rq_lockp(rq));
> +ret:
> +	local_irq_restore(flags);
> +}
> +
> +/*
> + * Process any work need for either exiting the core-wide unsafe state, or for
> + * waiting on this hyperthread if the core is still in this state.
> + *
> + * @idle: Are we called from the idle loop?
> + */
> +void sched_core_unsafe_exit(void)
> +{
> +	unsigned long flags;
> +	unsigned int nest;
> +	struct rq *rq;
> +	int cpu;
> +
> +	if (!static_branch_likely(&sched_core_protect_kernel))
> +		return;
> +
> +	local_irq_save(flags);
> +	cpu = smp_processor_id();
> +	rq = cpu_rq(cpu);
> +
> +	/* Do nothing if core-sched disabled. */
> +	if (!sched_core_enabled(rq))
> +		goto ret;
> +
> +	/*
> +	 * Can happen when a process is forked and the first return to user
> +	 * mode is a syscall exit. Either way, there's nothing to do.
> +	 */
> +	if (rq->core_this_unsafe_nest == 0)
> +		goto ret;
> +
> +	rq->core_this_unsafe_nest--;
> +
> +	/* enter() should be paired with exit() only. */
> +	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 0))
> +		goto ret;
> +
> +	raw_spin_lock(rq_lockp(rq));
> +	/*
> +	 * Core-wide nesting counter can never be 0 because we are
> +	 * still in it on this CPU.
> +	 */
> +	nest = rq->core->core_unsafe_nest;
> +	WARN_ON_ONCE(!nest);
> +
> +	WRITE_ONCE(rq->core->core_unsafe_nest, nest - 1);
> +	/*
> +	 * The raw_spin_unlock release semantics pairs with the nest counter's
> +	 * smp_load_acquire() in sched_core_wait_till_safe().
> +	 */
> +	raw_spin_unlock(rq_lockp(rq));
> +ret:
> +	local_irq_restore(flags);
> +}
> +
>   // XXX fairness/fwd progress conditions
>   /*
>    * Returns
> @@ -4991,6 +5232,7 @@ static inline void sched_core_cpu_starting(unsigned int cpu)
>   			rq = cpu_rq(i);
>   			if (rq->core && rq->core == rq)
>   				core_rq = rq;
> +			init_sched_core_irq_work(rq);
>   		}
>   
>   		if (!core_rq)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 001382bc67f9..20937a5b6272 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1061,6 +1061,8 @@ struct rq {
>   	unsigned int		core_enabled;
>   	unsigned int		core_sched_seq;
>   	struct rb_root		core_tree;
> +	struct irq_work		core_irq_work; /* To force HT into kernel */
> +	unsigned int		core_this_unsafe_nest;
>   
>   	/* shared state */
>   	unsigned int		core_task_seq;
> @@ -1068,6 +1070,7 @@ struct rq {
>   	unsigned long		core_cookie;
>   	unsigned char		core_forceidle;
>   	unsigned int		core_forceidle_seq;
> +	unsigned int		core_unsafe_nest;
>   #endif
>   };
>   
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode
  2020-11-16 10:08           ` Alexandre Chartre
@ 2020-11-16 14:50             ` Joel Fernandes
  2020-11-16 15:43               ` Joel Fernandes
  0 siblings, 1 reply; 98+ messages in thread
From: Joel Fernandes @ 2020-11-16 14:50 UTC (permalink / raw)
  To: Alexandre Chartre
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, James.Bottomley, OWeisse,
	Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser, Aubrey Li,
	Tim Chen, Paul E . McKenney

On Mon, Nov 16, 2020 at 11:08:25AM +0100, Alexandre Chartre wrote:
[..]
> > > >    static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> > > >    					    unsigned long ti_work)
> > > >    {
> > > > @@ -175,7 +198,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> > > >    		 * enabled above.
> > > >    		 */
> > > >    		local_irq_disable_exit_to_user();
> > > > -		ti_work = READ_ONCE(current_thread_info()->flags);
> > > > +		ti_work = exit_to_user_get_work();
> > > >    	}
> > > 
> > > What happen if the task is scheduled out in exit_to_user_mode_loop? (e.g. if it has
> > > _TIF_NEED_RESCHED set). It will have call sched_core_unsafe_enter() and force siblings
> > > to wait for it. So shouldn't sched_core_unsafe_exit() be called when the task is
> > > scheduled out? (because it won't run anymore) And sched_core_unsafe_enter() when
> > > the task is scheduled back in?
> > 
> > No, when the task is scheduled out, it will in kernel mode on the task being
> > scheduled in. That task (being scheduled-in) would have already done a
> > sched_core_unsafe_enter(). When that task returns to user made, it will do a
> > sched_core_unsafe_exit(). When all tasks goto sleep, the last task that
> > enters the idle loop will do a sched_core_unsafe_exit(). Just to note: the
> > "unsafe kernel context" is per-CPU and not per-task. Does that answer your
> > question?
> 
> Ok, I think I get it: it works because when a task is scheduled out then the
> scheduler will schedule in a new tagged task (because we have core scheduling).
> So that new task should be accounted for core-wide protection the same way as
> the previous one.

Exactly!

> > > Shouldn't we be using smp_store_release() like sched_core_unsafe_exit() does?
> > > 
> > > In any case, it is worth having a comment why WRITE_ONCE() or smp_store_release() is
> > > used.
> > 
> > The smp_store_release() in exit() ensures that the write to the nesting
> > counter happens *after* all prior reads and write accesses done by this CPU
> > are seen by the spinning CPU doing the smp_load_acquire() before that
> > spinning CPU returns. I did put a comment there.
> > 
> > But, I think I don't need smp_store_release() at all here. The spin_unlock
> > that follows already has the required release semantics. I will demote it to
> > a WRITE_ONCE() in enter() as well, and add appropriate comments.
> > 
> 
> I think a WRITE_ONCE() is not even be useful here. The WRITE_ONCE() will only prevent
> some possible compiler optimization in the function wrt rq->core->core_unsafe_nest, but
> rq->core->core_unsafe_nest is just updated here, and concurrent changes are protected
> by the rq_lockp(rq) spinlock, and the memory barrier is ensured by raw_spin_unlock().
> 
> So I thing you can just do:  rq->core->core_unsafe_nest++;
> 
> And in sched_core_unsafe_exit(), you can just do:  rq->core->core_unsafe_nest = nest - 1

Hmm, I believe KCSAN will flag this as data-race though. Even though the
variable is modified under lock, it is still locklessly read in
wait_till_safe(). Though I agree that in practice it may not be that useful
because we are only checking if the variable is > 0. If its Ok with you, I
will leave it as WRITE_ONCE for now.

> Also comment in sched_core_wait_till_safe() wrt smp_load_acquire() should be updated,
> it should say:
> 
> 	/*
> 	 * Wait till the core of this HT is not in an unsafe state.
> 	 *
> 	 * Pair with raw_spin_unlock(rq_lockp(rq) in sched_core_unsafe_enter/exit()
> 	 */

Ah right, fixed. Thanks.

> > Can I add your Reviewed-by tag to below updated patch? Thanks for review!
> 
> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>

Will add, thanks!

 - Joel

> 
> > 
> > ---8<---
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index bd1a5b87a5e2..a36f08d74e09 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -4678,6 +4678,15 @@
> >   	sbni=		[NET] Granch SBNI12 leased line adapter
> > +	sched_core_protect_kernel=
> > +			[SCHED_CORE] Pause SMT siblings of a core running in
> > +			user mode, if at least one of the siblings of the core
> > +			is running in kernel mode. This is to guarantee that
> > +			kernel data is not leaked to tasks which are not trusted
> > +			by the kernel. A value of 0 disables protection, 1
> > +			enables protection. The default is 1. Note that protection
> > +			depends on the arch defining the _TIF_UNSAFE_RET flag.
> > +
> >   	sched_debug	[KNL] Enables verbose scheduler debug messages.
> >   	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
> > diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> > index 474f29638d2c..62278c5b3b5f 100644
> > --- a/include/linux/entry-common.h
> > +++ b/include/linux/entry-common.h
> > @@ -33,6 +33,10 @@
> >   # define _TIF_PATCH_PENDING		(0)
> >   #endif
> > +#ifndef _TIF_UNSAFE_RET
> > +# define _TIF_UNSAFE_RET		(0)
> > +#endif
> > +
> >   #ifndef _TIF_UPROBE
> >   # define _TIF_UPROBE			(0)
> >   #endif
> > @@ -69,7 +73,7 @@
> >   #define EXIT_TO_USER_MODE_WORK						\
> >   	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
> > -	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING |			\
> > +	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_UNSAFE_RET |	\
> >   	 ARCH_EXIT_TO_USER_MODE_WORK)
> >   /**
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index d38e904dd603..fe6f225bfbf9 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -2071,4 +2071,16 @@ int sched_trace_rq_nr_running(struct rq *rq);
> >   const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
> > +#ifdef CONFIG_SCHED_CORE
> > +void sched_core_unsafe_enter(void);
> > +void sched_core_unsafe_exit(void);
> > +bool sched_core_wait_till_safe(unsigned long ti_check);
> > +bool sched_core_kernel_protected(void);
> > +#else
> > +#define sched_core_unsafe_enter(ignore) do { } while (0)
> > +#define sched_core_unsafe_exit(ignore) do { } while (0)
> > +#define sched_core_wait_till_safe(ignore) do { } while (0)
> > +#define sched_core_kernel_protected(ignore) do { } while (0)
> > +#endif
> > +
> >   #endif
> > diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> > index 2b8366693d5c..d5d88e735d55 100644
> > --- a/kernel/entry/common.c
> > +++ b/kernel/entry/common.c
> > @@ -28,6 +28,8 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs)
> >   	instrumentation_begin();
> >   	trace_hardirqs_off_finish();
> > +	if (_TIF_UNSAFE_RET) /* Kernel protection depends on arch defining the flag. */
> > +		sched_core_unsafe_enter();
> >   	instrumentation_end();
> >   }
> > @@ -137,6 +139,27 @@ static __always_inline void exit_to_user_mode(void)
> >   /* Workaround to allow gradual conversion of architecture code */
> >   void __weak arch_do_signal(struct pt_regs *regs) { }
> > +static unsigned long exit_to_user_get_work(void)
> > +{
> > +	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> > +
> > +	if ((IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected())
> > +	    || !_TIF_UNSAFE_RET)
> > +		return ti_work;
> > +
> > +#ifdef CONFIG_SCHED_CORE
> > +	ti_work &= EXIT_TO_USER_MODE_WORK;
> > +	if ((ti_work & _TIF_UNSAFE_RET) == ti_work) {
> > +		sched_core_unsafe_exit();
> > +		if (sched_core_wait_till_safe(EXIT_TO_USER_MODE_WORK)) {
> > +			sched_core_unsafe_enter(); /* not exiting to user yet. */
> > +		}
> > +	}
> > +
> > +	return READ_ONCE(current_thread_info()->flags);
> > +#endif
> > +}
> > +
> >   static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> >   					    unsigned long ti_work)
> >   {
> > @@ -174,7 +197,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> >   		 * enabled above.
> >   		 */
> >   		local_irq_disable_exit_to_user();
> > -		ti_work = READ_ONCE(current_thread_info()->flags);
> > +		ti_work = exit_to_user_get_work();
> >   	}
> >   	/* Return the latest work state for arch_exit_to_user_mode() */
> > @@ -183,9 +206,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> >   static void exit_to_user_mode_prepare(struct pt_regs *regs)
> >   {
> > -	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> > +	unsigned long ti_work;
> >   	lockdep_assert_irqs_disabled();
> > +	ti_work = exit_to_user_get_work();
> >   	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
> >   		ti_work = exit_to_user_mode_loop(regs, ti_work);
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index fa68941998e3..429f9b8ca38e 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -76,6 +76,27 @@ __read_mostly int scheduler_running;
> >   #ifdef CONFIG_SCHED_CORE
> > +DEFINE_STATIC_KEY_TRUE(sched_core_protect_kernel);
> > +static int __init set_sched_core_protect_kernel(char *str)
> > +{
> > +	unsigned long val = 0;
> > +
> > +	if (!str)
> > +		return 0;
> > +
> > +	if (!kstrtoul(str, 0, &val) && !val)
> > +		static_branch_disable(&sched_core_protect_kernel);
> > +
> > +	return 1;
> > +}
> > +__setup("sched_core_protect_kernel=", set_sched_core_protect_kernel);
> > +
> > +/* Is the kernel protected by core scheduling? */
> > +bool sched_core_kernel_protected(void)
> > +{
> > +	return static_branch_likely(&sched_core_protect_kernel);
> > +}
> > +
> >   DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
> >   /* kernel prio, less is more */
> > @@ -4596,6 +4617,226 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
> >   	return a->core_cookie == b->core_cookie;
> >   }
> > +/*
> > + * Handler to attempt to enter kernel. It does nothing because the exit to
> > + * usermode or guest mode will do the actual work (of waiting if needed).
> > + */
> > +static void sched_core_irq_work(struct irq_work *work)
> > +{
> > +	return;
> > +}
> > +
> > +static inline void init_sched_core_irq_work(struct rq *rq)
> > +{
> > +	init_irq_work(&rq->core_irq_work, sched_core_irq_work);
> > +}
> > +
> > +/*
> > + * sched_core_wait_till_safe - Pause the caller's hyperthread until the core
> > + * exits the core-wide unsafe state. Obviously the CPU calling this function
> > + * should not be responsible for the core being in the core-wide unsafe state
> > + * otherwise it will deadlock.
> > + *
> > + * @ti_check: We spin here with IRQ enabled and preempt disabled. Break out of
> > + *            the loop if TIF flags are set and notify caller about it.
> > + *
> > + * IRQs should be disabled.
> > + */
> > +bool sched_core_wait_till_safe(unsigned long ti_check)
> > +{
> > +	bool restart = false;
> > +	struct rq *rq;
> > +	int cpu;
> > +
> > +	/* We clear the thread flag only at the end, so no need to check for it. */
> > +	ti_check &= ~_TIF_UNSAFE_RET;
> > +
> > +	cpu = smp_processor_id();
> > +	rq = cpu_rq(cpu);
> > +
> > +	if (!sched_core_enabled(rq))
> > +		goto ret;
> > +
> > +	/* Down grade to allow interrupts to prevent stop_machine lockups.. */
> > +	preempt_disable();
> > +	local_irq_enable();
> > +
> > +	/*
> > +	 * Wait till the core of this HT is not in an unsafe state.
> > +	 *
> > +	 * Pair with smp_store_release() in sched_core_unsafe_exit().
> > +	 */
> > +	while (smp_load_acquire(&rq->core->core_unsafe_nest) > 0) {
> > +		cpu_relax();
> > +		if (READ_ONCE(current_thread_info()->flags) & ti_check) {
> > +			restart = true;
> > +			break;
> > +		}
> > +	}
> > +
> > +	/* Upgrade it back to the expectations of entry code. */
> > +	local_irq_disable();
> > +	preempt_enable();
> > +
> > +ret:
> > +	if (!restart)
> > +		clear_tsk_thread_flag(current, TIF_UNSAFE_RET);
> > +
> > +	return restart;
> > +}
> > +
> > +/*
> > + * Enter the core-wide IRQ state. Sibling will be paused if it is running
> > + * 'untrusted' code, until sched_core_unsafe_exit() is called. Every attempt to
> > + * avoid sending useless IPIs is made. Must be called only from hard IRQ
> > + * context.
> > + */
> > +void sched_core_unsafe_enter(void)
> > +{
> > +	const struct cpumask *smt_mask;
> > +	unsigned long flags;
> > +	struct rq *rq;
> > +	int i, cpu;
> > +
> > +	if (!static_branch_likely(&sched_core_protect_kernel))
> > +		return;
> > +
> > +	local_irq_save(flags);
> > +	cpu = smp_processor_id();
> > +	rq = cpu_rq(cpu);
> > +	if (!sched_core_enabled(rq))
> > +		goto ret;
> > +
> > +	/* Ensure that on return to user/guest, we check whether to wait. */
> > +	if (current->core_cookie)
> > +		set_tsk_thread_flag(current, TIF_UNSAFE_RET);
> > +
> > +	/* Count unsafe_enter() calls received without unsafe_exit() on this CPU. */
> > +	rq->core_this_unsafe_nest++;
> > +
> > +	/*
> > +	 * Should not nest: enter() should only pair with exit(). Both are done
> > +	 * during the first entry into kernel and the last exit from kernel.
> > +	 * Nested kernel entries (such as nested interrupts) will only trigger
> > +	 * enter() and exit() on the outer most kernel entry and exit.
> > +	 */
> > +	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 1))
> > +		goto ret;
> > +
> > +	raw_spin_lock(rq_lockp(rq));
> > +	smt_mask = cpu_smt_mask(cpu);
> > +
> > +	/*
> > +	 * Contribute this CPU's unsafe_enter() to the core-wide unsafe_enter()
> > +	 * count.  The raw_spin_unlock() release semantics pairs with the nest
> > +	 * counter's smp_load_acquire() in sched_core_wait_till_safe().
> > +	 */
> > +	WRITE_ONCE(rq->core->core_unsafe_nest, rq->core->core_unsafe_nest + 1);
> > +
> > +	if (WARN_ON_ONCE(rq->core->core_unsafe_nest == UINT_MAX))
> > +		goto unlock;
> > +
> > +	if (irq_work_is_busy(&rq->core_irq_work)) {
> > +		/*
> > +		 * Do nothing more since we are in an IPI sent from another
> > +		 * sibling to enforce safety. That sibling would have sent IPIs
> > +		 * to all of the HTs.
> > +		 */
> > +		goto unlock;
> > +	}
> > +
> > +	/*
> > +	 * If we are not the first ones on the core to enter core-wide unsafe
> > +	 * state, do nothing.
> > +	 */
> > +	if (rq->core->core_unsafe_nest > 1)
> > +		goto unlock;
> > +
> > +	/* Do nothing more if the core is not tagged. */
> > +	if (!rq->core->core_cookie)
> > +		goto unlock;
> > +
> > +	for_each_cpu(i, smt_mask) {
> > +		struct rq *srq = cpu_rq(i);
> > +
> > +		if (i == cpu || cpu_is_offline(i))
> > +			continue;
> > +
> > +		if (!srq->curr->mm || is_task_rq_idle(srq->curr))
> > +			continue;
> > +
> > +		/* Skip if HT is not running a tagged task. */
> > +		if (!srq->curr->core_cookie && !srq->core_pick)
> > +			continue;
> > +
> > +		/*
> > +		 * Force sibling into the kernel by IPI. If work was already
> > +		 * pending, no new IPIs are sent. This is Ok since the receiver
> > +		 * would already be in the kernel, or on its way to it.
> > +		 */
> > +		irq_work_queue_on(&srq->core_irq_work, i);
> > +	}
> > +unlock:
> > +	raw_spin_unlock(rq_lockp(rq));
> > +ret:
> > +	local_irq_restore(flags);
> > +}
> > +
> > +/*
> > + * Process any work need for either exiting the core-wide unsafe state, or for
> > + * waiting on this hyperthread if the core is still in this state.
> > + *
> > + * @idle: Are we called from the idle loop?
> > + */
> > +void sched_core_unsafe_exit(void)
> > +{
> > +	unsigned long flags;
> > +	unsigned int nest;
> > +	struct rq *rq;
> > +	int cpu;
> > +
> > +	if (!static_branch_likely(&sched_core_protect_kernel))
> > +		return;
> > +
> > +	local_irq_save(flags);
> > +	cpu = smp_processor_id();
> > +	rq = cpu_rq(cpu);
> > +
> > +	/* Do nothing if core-sched disabled. */
> > +	if (!sched_core_enabled(rq))
> > +		goto ret;
> > +
> > +	/*
> > +	 * Can happen when a process is forked and the first return to user
> > +	 * mode is a syscall exit. Either way, there's nothing to do.
> > +	 */
> > +	if (rq->core_this_unsafe_nest == 0)
> > +		goto ret;
> > +
> > +	rq->core_this_unsafe_nest--;
> > +
> > +	/* enter() should be paired with exit() only. */
> > +	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 0))
> > +		goto ret;
> > +
> > +	raw_spin_lock(rq_lockp(rq));
> > +	/*
> > +	 * Core-wide nesting counter can never be 0 because we are
> > +	 * still in it on this CPU.
> > +	 */
> > +	nest = rq->core->core_unsafe_nest;
> > +	WARN_ON_ONCE(!nest);
> > +
> > +	WRITE_ONCE(rq->core->core_unsafe_nest, nest - 1);
> > +	/*
> > +	 * The raw_spin_unlock release semantics pairs with the nest counter's
> > +	 * smp_load_acquire() in sched_core_wait_till_safe().
> > +	 */
> > +	raw_spin_unlock(rq_lockp(rq));
> > +ret:
> > +	local_irq_restore(flags);
> > +}
> > +
> >   // XXX fairness/fwd progress conditions
> >   /*
> >    * Returns
> > @@ -4991,6 +5232,7 @@ static inline void sched_core_cpu_starting(unsigned int cpu)
> >   			rq = cpu_rq(i);
> >   			if (rq->core && rq->core == rq)
> >   				core_rq = rq;
> > +			init_sched_core_irq_work(rq);
> >   		}
> >   		if (!core_rq)
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 001382bc67f9..20937a5b6272 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1061,6 +1061,8 @@ struct rq {
> >   	unsigned int		core_enabled;
> >   	unsigned int		core_sched_seq;
> >   	struct rb_root		core_tree;
> > +	struct irq_work		core_irq_work; /* To force HT into kernel */
> > +	unsigned int		core_this_unsafe_nest;
> >   	/* shared state */
> >   	unsigned int		core_task_seq;
> > @@ -1068,6 +1070,7 @@ struct rq {
> >   	unsigned long		core_cookie;
> >   	unsigned char		core_forceidle;
> >   	unsigned int		core_forceidle_seq;
> > +	unsigned int		core_unsafe_nest;
> >   #endif
> >   };
> > 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode
  2020-11-16 14:50             ` Joel Fernandes
@ 2020-11-16 15:43               ` Joel Fernandes
  0 siblings, 0 replies; 98+ messages in thread
From: Joel Fernandes @ 2020-11-16 15:43 UTC (permalink / raw)
  To: Alexandre Chartre
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, James.Bottomley, OWeisse,
	Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser, Aubrey Li,
	Tim Chen, Paul E . McKenney

On Mon, Nov 16, 2020 at 09:50:37AM -0500, Joel Fernandes wrote:

> > > Can I add your Reviewed-by tag to below updated patch? Thanks for review!
> > 
> > Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
> 
> Will add, thanks!
> 
>  - Joel

Alexandre, there was one more trivial fixup I had to make. Just fyi:
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=coresched&id=06a302df95f3a235e2680102ec4e5da10c9b87a0

Basically, I need to conditionally call sched_core_unsafe_exit() depending on
value of sched_core_protect_kernel= option and CONFIG_SCHED_CORE. Below is
updated patch:

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index bd1a5b87a5e2..a36f08d74e09 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4678,6 +4678,15 @@
 
 	sbni=		[NET] Granch SBNI12 leased line adapter
 
+	sched_core_protect_kernel=
+			[SCHED_CORE] Pause SMT siblings of a core running in
+			user mode, if at least one of the siblings of the core
+			is running in kernel mode. This is to guarantee that
+			kernel data is not leaked to tasks which are not trusted
+			by the kernel. A value of 0 disables protection, 1
+			enables protection. The default is 1. Note that protection
+			depends on the arch defining the _TIF_UNSAFE_RET flag.
+
 	sched_debug	[KNL] Enables verbose scheduler debug messages.
 
 	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 1a128baf3628..022e1f114157 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -33,6 +33,10 @@
 # define _TIF_PATCH_PENDING		(0)
 #endif
 
+#ifndef _TIF_UNSAFE_RET
+# define _TIF_UNSAFE_RET		(0)
+#endif
+
 #ifndef _TIF_UPROBE
 # define _TIF_UPROBE			(0)
 #endif
@@ -74,7 +78,7 @@
 #define EXIT_TO_USER_MODE_WORK						\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
 	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |	\
-	 ARCH_EXIT_TO_USER_MODE_WORK)
+	 _TIF_UNSAFE_RET | ARCH_EXIT_TO_USER_MODE_WORK)
 
 /**
  * arch_check_user_regs - Architecture specific sanity check for user mode regs
@@ -444,4 +448,10 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs);
  */
 void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state);
 
+/* entry_kernel_protected - Is kernel protection on entry/exit into kernel supported? */
+static inline bool entry_kernel_protected(void)
+{
+	return IS_ENABLED(CONFIG_SCHED_CORE) && sched_core_kernel_protected()
+		&& _TIF_UNSAFE_RET != 0;
+}
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d38e904dd603..fe6f225bfbf9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2071,4 +2071,16 @@ int sched_trace_rq_nr_running(struct rq *rq);
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
+#ifdef CONFIG_SCHED_CORE
+void sched_core_unsafe_enter(void);
+void sched_core_unsafe_exit(void);
+bool sched_core_wait_till_safe(unsigned long ti_check);
+bool sched_core_kernel_protected(void);
+#else
+#define sched_core_unsafe_enter(ignore) do { } while (0)
+#define sched_core_unsafe_exit(ignore) do { } while (0)
+#define sched_core_wait_till_safe(ignore) do { } while (0)
+#define sched_core_kernel_protected(ignore) do { } while (0)
+#endif
+
 #endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index bc75c114c1b3..9d9d926f2a1c 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -28,6 +28,9 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs)
 
 	instrumentation_begin();
 	trace_hardirqs_off_finish();
+
+	if (entry_kernel_protected())
+		sched_core_unsafe_enter();
 	instrumentation_end();
 }
 
@@ -145,6 +148,26 @@ static void handle_signal_work(struct pt_regs *regs, unsigned long ti_work)
 	arch_do_signal_or_restart(regs, ti_work & _TIF_SIGPENDING);
 }
 
+static unsigned long exit_to_user_get_work(void)
+{
+	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+
+	if (!entry_kernel_protected())
+		return ti_work;
+
+#ifdef CONFIG_SCHED_CORE
+	ti_work &= EXIT_TO_USER_MODE_WORK;
+	if ((ti_work & _TIF_UNSAFE_RET) == ti_work) {
+		sched_core_unsafe_exit();
+		if (sched_core_wait_till_safe(EXIT_TO_USER_MODE_WORK)) {
+			sched_core_unsafe_enter(); /* not exiting to user yet. */
+		}
+	}
+
+	return READ_ONCE(current_thread_info()->flags);
+#endif
+}
+
 static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 					    unsigned long ti_work)
 {
@@ -182,7 +205,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		 * enabled above.
 		 */
 		local_irq_disable_exit_to_user();
-		ti_work = READ_ONCE(current_thread_info()->flags);
+		ti_work = exit_to_user_get_work();
 	}
 
 	/* Return the latest work state for arch_exit_to_user_mode() */
@@ -191,9 +214,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 
 static void exit_to_user_mode_prepare(struct pt_regs *regs)
 {
-	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+	unsigned long ti_work;
 
 	lockdep_assert_irqs_disabled();
+	ti_work = exit_to_user_get_work();
 
 	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
 		ti_work = exit_to_user_mode_loop(regs, ti_work);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a5e04078ba5d..56d6a382e3ff 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -76,6 +76,27 @@ __read_mostly int scheduler_running;
 
 #ifdef CONFIG_SCHED_CORE
 
+DEFINE_STATIC_KEY_TRUE(sched_core_protect_kernel);
+static int __init set_sched_core_protect_kernel(char *str)
+{
+	unsigned long val = 0;
+
+	if (!str)
+		return 0;
+
+	if (!kstrtoul(str, 0, &val) && !val)
+		static_branch_disable(&sched_core_protect_kernel);
+
+	return 1;
+}
+__setup("sched_core_protect_kernel=", set_sched_core_protect_kernel);
+
+/* Is the kernel protected by core scheduling? */
+bool sched_core_kernel_protected(void)
+{
+	return static_branch_likely(&sched_core_protect_kernel);
+}
+
 DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
 
 /* kernel prio, less is more */
@@ -4596,6 +4617,226 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
 	return a->core_cookie == b->core_cookie;
 }
 
+/*
+ * Handler to attempt to enter kernel. It does nothing because the exit to
+ * usermode or guest mode will do the actual work (of waiting if needed).
+ */
+static void sched_core_irq_work(struct irq_work *work)
+{
+	return;
+}
+
+static inline void init_sched_core_irq_work(struct rq *rq)
+{
+	init_irq_work(&rq->core_irq_work, sched_core_irq_work);
+}
+
+/*
+ * sched_core_wait_till_safe - Pause the caller's hyperthread until the core
+ * exits the core-wide unsafe state. Obviously the CPU calling this function
+ * should not be responsible for the core being in the core-wide unsafe state
+ * otherwise it will deadlock.
+ *
+ * @ti_check: We spin here with IRQ enabled and preempt disabled. Break out of
+ *            the loop if TIF flags are set and notify caller about it.
+ *
+ * IRQs should be disabled.
+ */
+bool sched_core_wait_till_safe(unsigned long ti_check)
+{
+	bool restart = false;
+	struct rq *rq;
+	int cpu;
+
+	/* We clear the thread flag only at the end, so no need to check for it. */
+	ti_check &= ~_TIF_UNSAFE_RET;
+
+	cpu = smp_processor_id();
+	rq = cpu_rq(cpu);
+
+	if (!sched_core_enabled(rq))
+		goto ret;
+
+	/* Down grade to allow interrupts to prevent stop_machine lockups.. */
+	preempt_disable();
+	local_irq_enable();
+
+	/*
+	 * Wait till the core of this HT is not in an unsafe state.
+	 *
+	 * Pair with raw_spin_lock/unlock() in sched_core_unsafe_enter/exit().
+	 */
+	while (smp_load_acquire(&rq->core->core_unsafe_nest) > 0) {
+		cpu_relax();
+		if (READ_ONCE(current_thread_info()->flags) & ti_check) {
+			restart = true;
+			break;
+		}
+	}
+
+	/* Upgrade it back to the expectations of entry code. */
+	local_irq_disable();
+	preempt_enable();
+
+ret:
+	if (!restart)
+		clear_tsk_thread_flag(current, TIF_UNSAFE_RET);
+
+	return restart;
+}
+
+/*
+ * Enter the core-wide IRQ state. Sibling will be paused if it is running
+ * 'untrusted' code, until sched_core_unsafe_exit() is called. Every attempt to
+ * avoid sending useless IPIs is made. Must be called only from hard IRQ
+ * context.
+ */
+void sched_core_unsafe_enter(void)
+{
+	const struct cpumask *smt_mask;
+	unsigned long flags;
+	struct rq *rq;
+	int i, cpu;
+
+	if (!static_branch_likely(&sched_core_protect_kernel))
+		return;
+
+	local_irq_save(flags);
+	cpu = smp_processor_id();
+	rq = cpu_rq(cpu);
+	if (!sched_core_enabled(rq))
+		goto ret;
+
+	/* Ensure that on return to user/guest, we check whether to wait. */
+	if (current->core_cookie)
+		set_tsk_thread_flag(current, TIF_UNSAFE_RET);
+
+	/* Count unsafe_enter() calls received without unsafe_exit() on this CPU. */
+	rq->core_this_unsafe_nest++;
+
+	/*
+	 * Should not nest: enter() should only pair with exit(). Both are done
+	 * during the first entry into kernel and the last exit from kernel.
+	 * Nested kernel entries (such as nested interrupts) will only trigger
+	 * enter() and exit() on the outer most kernel entry and exit.
+	 */
+	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 1))
+		goto ret;
+
+	raw_spin_lock(rq_lockp(rq));
+	smt_mask = cpu_smt_mask(cpu);
+
+	/*
+	 * Contribute this CPU's unsafe_enter() to the core-wide unsafe_enter()
+	 * count.  The raw_spin_unlock() release semantics pairs with the nest
+	 * counter's smp_load_acquire() in sched_core_wait_till_safe().
+	 */
+	WRITE_ONCE(rq->core->core_unsafe_nest, rq->core->core_unsafe_nest + 1);
+
+	if (WARN_ON_ONCE(rq->core->core_unsafe_nest == UINT_MAX))
+		goto unlock;
+
+	if (irq_work_is_busy(&rq->core_irq_work)) {
+		/*
+		 * Do nothing more since we are in an IPI sent from another
+		 * sibling to enforce safety. That sibling would have sent IPIs
+		 * to all of the HTs.
+		 */
+		goto unlock;
+	}
+
+	/*
+	 * If we are not the first ones on the core to enter core-wide unsafe
+	 * state, do nothing.
+	 */
+	if (rq->core->core_unsafe_nest > 1)
+		goto unlock;
+
+	/* Do nothing more if the core is not tagged. */
+	if (!rq->core->core_cookie)
+		goto unlock;
+
+	for_each_cpu(i, smt_mask) {
+		struct rq *srq = cpu_rq(i);
+
+		if (i == cpu || cpu_is_offline(i))
+			continue;
+
+		if (!srq->curr->mm || is_task_rq_idle(srq->curr))
+			continue;
+
+		/* Skip if HT is not running a tagged task. */
+		if (!srq->curr->core_cookie && !srq->core_pick)
+			continue;
+
+		/*
+		 * Force sibling into the kernel by IPI. If work was already
+		 * pending, no new IPIs are sent. This is Ok since the receiver
+		 * would already be in the kernel, or on its way to it.
+		 */
+		irq_work_queue_on(&srq->core_irq_work, i);
+	}
+unlock:
+	raw_spin_unlock(rq_lockp(rq));
+ret:
+	local_irq_restore(flags);
+}
+
+/*
+ * Process any work need for either exiting the core-wide unsafe state, or for
+ * waiting on this hyperthread if the core is still in this state.
+ *
+ * @idle: Are we called from the idle loop?
+ */
+void sched_core_unsafe_exit(void)
+{
+	unsigned long flags;
+	unsigned int nest;
+	struct rq *rq;
+	int cpu;
+
+	if (!static_branch_likely(&sched_core_protect_kernel))
+		return;
+
+	local_irq_save(flags);
+	cpu = smp_processor_id();
+	rq = cpu_rq(cpu);
+
+	/* Do nothing if core-sched disabled. */
+	if (!sched_core_enabled(rq))
+		goto ret;
+
+	/*
+	 * Can happen when a process is forked and the first return to user
+	 * mode is a syscall exit. Either way, there's nothing to do.
+	 */
+	if (rq->core_this_unsafe_nest == 0)
+		goto ret;
+
+	rq->core_this_unsafe_nest--;
+
+	/* enter() should be paired with exit() only. */
+	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 0))
+		goto ret;
+
+	raw_spin_lock(rq_lockp(rq));
+	/*
+	 * Core-wide nesting counter can never be 0 because we are
+	 * still in it on this CPU.
+	 */
+	nest = rq->core->core_unsafe_nest;
+	WARN_ON_ONCE(!nest);
+
+	WRITE_ONCE(rq->core->core_unsafe_nest, nest - 1);
+	/*
+	 * The raw_spin_unlock release semantics pairs with the nest counter's
+	 * smp_load_acquire() in sched_core_wait_till_safe().
+	 */
+	raw_spin_unlock(rq_lockp(rq));
+ret:
+	local_irq_restore(flags);
+}
+
 // XXX fairness/fwd progress conditions
 /*
  * Returns
@@ -4991,6 +5232,7 @@ static inline void sched_core_cpu_starting(unsigned int cpu)
 			rq = cpu_rq(i);
 			if (rq->core && rq->core == rq)
 				core_rq = rq;
+			init_sched_core_irq_work(rq);
 		}
 
 		if (!core_rq)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cd74cc41c8da..acf187c36fc4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1063,6 +1063,8 @@ struct rq {
 	unsigned int		core_enabled;
 	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
+	struct irq_work		core_irq_work; /* To force HT into kernel */
+	unsigned int		core_this_unsafe_nest;
 
 	/* shared state */
 	unsigned int		core_task_seq;
@@ -1070,6 +1072,7 @@ struct rq {
 	unsigned long		core_cookie;
 	unsigned char		core_forceidle;
 	unsigned int		core_forceidle_seq;
+	unsigned int		core_unsafe_nest;
 #endif
 };
 

^ permalink raw reply related	[flat|nested] 98+ messages in thread

end of thread, other threads:[~2020-11-16 15:43 UTC | newest]

Thread overview: 98+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-20  1:43 [PATCH v8 -tip 00/26] Core scheduling Joel Fernandes (Google)
2020-10-20  1:43 ` [PATCH v8 -tip 01/26] sched: Wrap rq::lock access Joel Fernandes (Google)
2020-10-20  1:43 ` [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task() Joel Fernandes (Google)
2020-10-22  7:59   ` Li, Aubrey
2020-10-22 15:25     ` Joel Fernandes
2020-10-23  5:25       ` Li, Aubrey
2020-10-23 21:47         ` Joel Fernandes
2020-10-24  2:48           ` Li, Aubrey
2020-10-24 11:10             ` Vineeth Pillai
2020-10-24 12:27               ` Vineeth Pillai
2020-10-24 23:48                 ` Li, Aubrey
2020-10-26  9:01                 ` Peter Zijlstra
2020-10-27  3:17                   ` Li, Aubrey
2020-10-27 14:19                   ` Joel Fernandes
2020-10-27 15:23                     ` Joel Fernandes
2020-10-27 14:14                 ` Joel Fernandes
2020-10-20  1:43 ` [PATCH v8 -tip 03/26] sched: Core-wide rq->lock Joel Fernandes (Google)
2020-10-26 11:59   ` Peter Zijlstra
2020-10-27 16:27     ` Joel Fernandes
2020-10-20  1:43 ` [PATCH v8 -tip 04/26] sched/fair: Add a few assertions Joel Fernandes (Google)
2020-10-20  1:43 ` [PATCH v8 -tip 05/26] sched: Basic tracking of matching tasks Joel Fernandes (Google)
2020-10-20  1:43 ` [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling Joel Fernandes (Google)
2020-10-23 13:51   ` Peter Zijlstra
2020-10-23 13:54     ` Peter Zijlstra
2020-10-23 17:57       ` Joel Fernandes
2020-10-23 19:26         ` Peter Zijlstra
2020-10-23 21:31           ` Joel Fernandes
2020-10-26  8:28             ` Peter Zijlstra
2020-10-27 16:58               ` Joel Fernandes
2020-10-26  9:31             ` Peter Zijlstra
2020-11-05 18:50               ` Joel Fernandes
2020-11-05 22:07                 ` Joel Fernandes
2020-10-23 15:05   ` Peter Zijlstra
2020-10-23 17:59     ` Joel Fernandes
2020-10-20  1:43 ` [PATCH v8 -tip 07/26] sched/fair: Fix forced idle sibling starvation corner case Joel Fernandes (Google)
2020-10-20  1:43 ` [PATCH v8 -tip 08/26] sched/fair: Snapshot the min_vruntime of CPUs on force idle Joel Fernandes (Google)
2020-10-26 12:47   ` Peter Zijlstra
2020-10-28 15:29     ` Joel Fernandes
2020-10-28 18:39     ` Joel Fernandes
2020-10-29 16:59     ` Joel Fernandes
2020-10-29 18:24     ` Joel Fernandes
2020-10-29 18:59       ` Peter Zijlstra
2020-10-30  2:36         ` Joel Fernandes
2020-10-30  2:42           ` Joel Fernandes
2020-10-30  8:41             ` Peter Zijlstra
2020-10-31 21:41               ` Joel Fernandes
2020-10-20  1:43 ` [PATCH v8 -tip 09/26] sched: Trivial forced-newidle balancer Joel Fernandes (Google)
2020-10-20  1:43 ` [PATCH v8 -tip 10/26] sched: migration changes for core scheduling Joel Fernandes (Google)
2020-10-20  1:43 ` [PATCH v8 -tip 11/26] irq_work: Cleanup Joel Fernandes (Google)
2020-10-20  1:43 ` [PATCH v8 -tip 12/26] arch/x86: Add a new TIF flag for untrusted tasks Joel Fernandes (Google)
2020-10-20  1:43 ` [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode Joel Fernandes (Google)
2020-10-20  3:41   ` Randy Dunlap
2020-11-03  0:20     ` Joel Fernandes
2020-10-22  5:48   ` Li, Aubrey
2020-11-03  0:50     ` Joel Fernandes
2020-10-30 10:29   ` Alexandre Chartre
2020-11-03  1:20     ` Joel Fernandes
2020-11-06 16:57       ` Alexandre Chartre
2020-11-06 17:43         ` Joel Fernandes
2020-11-06 18:07           ` Alexandre Chartre
2020-11-10  9:35       ` Alexandre Chartre
2020-11-10 22:42         ` Joel Fernandes
2020-11-16 10:08           ` Alexandre Chartre
2020-11-16 14:50             ` Joel Fernandes
2020-11-16 15:43               ` Joel Fernandes
2020-10-20  1:43 ` [PATCH v8 -tip 14/26] entry/idle: Enter and exit kernel protection during idle entry and exit Joel Fernandes (Google)
2020-10-20  1:43 ` [PATCH v8 -tip 15/26] entry/kvm: Protect the kernel when entering from guest Joel Fernandes (Google)
2020-10-20  1:43 ` [PATCH v8 -tip 16/26] sched: cgroup tagging interface for core scheduling Joel Fernandes (Google)
2020-10-20  1:43 ` [PATCH v8 -tip 17/26] sched: Split the cookie and setup per-task cookie on fork Joel Fernandes (Google)
2020-11-04 22:30   ` chris hyser
2020-11-05 14:49     ` Joel Fernandes
2020-11-09 23:30     ` chris hyser
2020-10-20  1:43 ` [PATCH v8 -tip 18/26] sched: Add a per-thread core scheduling interface Joel Fernandes (Google)
2020-10-20  1:43 ` [PATCH v8 -tip 19/26] sched: Add a second-level tag for nested CGroup usecase Joel Fernandes (Google)
2020-10-31  0:42   ` Josh Don
2020-11-03  2:54     ` Joel Fernandes
     [not found]   ` <6c07e70d-52f2-69ff-e1fa-690cd2c97f3d@linux.intel.com>
2020-11-05 15:52     ` Joel Fernandes
2020-10-20  1:43 ` [PATCH v8 -tip 20/26] sched: Release references to the per-task cookie on exit Joel Fernandes (Google)
2020-11-04 21:50   ` chris hyser
2020-11-05 15:46     ` Joel Fernandes
2020-10-20  1:43 ` [PATCH v8 -tip 21/26] sched: Handle task addition to CGroup Joel Fernandes (Google)
2020-10-20  1:43 ` [PATCH v8 -tip 22/26] sched/debug: Add CGroup node for printing group cookie if SCHED_DEBUG Joel Fernandes (Google)
2020-10-20  1:43 ` [PATCH v8 -tip 23/26] kselftest: Add tests for core-sched interface Joel Fernandes (Google)
2020-10-30 22:20   ` [PATCH] sched: Change all 4 space tabs to actual tabs John B. Wyatt IV
2020-10-20  1:43 ` [PATCH v8 -tip 24/26] sched: Move core-scheduler interfacing code to a new file Joel Fernandes (Google)
2020-10-26  1:05   ` Li, Aubrey
2020-11-03  2:58     ` Joel Fernandes
2020-10-20  1:43 ` [PATCH v8 -tip 25/26] Documentation: Add core scheduling documentation Joel Fernandes (Google)
2020-10-20  3:36   ` Randy Dunlap
2020-11-12 16:11     ` Joel Fernandes
2020-10-20  1:43 ` [PATCH v8 -tip 26/26] sched: Debug bits Joel Fernandes (Google)
2020-10-30 13:26 ` [PATCH v8 -tip 00/26] Core scheduling Ning, Hongyu
2020-11-06  2:58   ` Li, Aubrey
2020-11-06 17:54     ` Joel Fernandes
2020-11-09  6:04       ` Li, Aubrey
2020-11-06 20:55 ` [RFT for v9] (Was Re: [PATCH v8 -tip 00/26] Core scheduling) Joel Fernandes
2020-11-13  9:22   ` Ning, Hongyu
2020-11-13 10:01     ` Ning, Hongyu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).