linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -tip 00/32] Core scheduling (v9)
@ 2020-11-17 23:19 Joel Fernandes (Google)
  2020-11-17 23:19 ` [PATCH -tip 01/32] sched: Wrap rq::lock access Joel Fernandes (Google)
                   ` (32 more replies)
  0 siblings, 33 replies; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

Core-Scheduling
===============
Enclosed is series v9 of core scheduling.
v9 is rebased on tip/master (fe4adf6f92c4 ("Merge branch 'irq/core'"))..
I hope that this version is acceptable to be merged (pending any new review
comments that arise) as the main issues in the past are all resolved:
 1. Vruntime comparison.
 2. Documentation updates.
 3. CGroup and per-task interface developed by Google and Oracle.
 4. Hotplug fixes.
Almost all patches also have Reviewed-by or Acked-by tag. See below for full
list of changes in v9.

Introduction of feature
=======================
Core scheduling is a feature that allows only trusted tasks to run
concurrently on cpus sharing compute resources (eg: hyperthreads on a
core). The goal is to mitigate the core-level side-channel attacks
without requiring to disable SMT (which has a significant impact on
performance in some situations). Core scheduling (as of v7) mitigates
user-space to user-space attacks and user to kernel attack when one of
the siblings enters the kernel via interrupts or system call.

By default, the feature doesn't change any of the current scheduler
behavior. The user decides which tasks can run simultaneously on the
same core (for now by having them in the same tagged cgroup). When a tag
is enabled in a cgroup and a task from that cgroup is running on a
hardware thread, the scheduler ensures that only idle or trusted tasks
run on the other sibling(s). Besides security concerns, this feature can
also be beneficial for RT and performance applications where we want to
control how tasks make use of SMT dynamically.

Both a CGroup and Per-task interface via prctl(2) are provided for configuring
core sharing. More details are provided in documentation patch.  Kselftests are
provided to verify the correctness/rules of the interface.

Testing
=======
ChromeOS testing shows 300% improvement in keypress latency on a Google
docs key press with Google hangout test (the maximum latency drops from 150ms
to 50ms for keypresses).

Julien: TPCC tests showed improvements with core-scheduling as below. With kernel
protection enabled, it does not show any regression. Possibly ASI will improve
the performance for those who choose kernel protection (can be toggled through
sched_core_protect_kernel sysctl).
				average		stdev		diff
baseline (SMT on)		1197.272	44.78312824	
core sched (   kernel protect)	412.9895	45.42734343	-65.51%
core sched (no kernel protect)	686.6515	71.77756931	-42.65%
nosmt				408.667		39.39042872	-65.87%
(Note these results are from v8).

Vineeth tested sysbench and does not see any regressions.
Hong and Aubrey tested v9 and see results similar to v8. There is a known issue
with uperf that does regress. This appears to be because of ksoftirq heavily
contending with other tasks on the core. The consensus is this can be improved
in the future.

Changes in v9
=============
- Note that the vruntime snapshot change is written in 2 patches to show the
  progression of the idea and prevent merge conflicts:
    sched/fair: Snapshot the min_vruntime of CPUs on force idle
    sched: Improve snapshotting of min_vruntime for CGroups
  Same with the RT priority inversion change:
    sched: Fix priority inversion of cookied task with sibling
    sched: Improve snapshotting of min_vruntime for CGroups
- Disable coresched on certain AMD HW.

Changes in v8
=============
- New interface/API implementation
  - Joel
- Revised kernel protection patch
  - Joel
- Revised Hotplug fixes
  - Joel
- Minor bug fixes and address review comments
  - Vineeth

Changes in v7
=============
- Kernel protection from untrusted usermode tasks
  - Joel, Vineeth
- Fix for hotplug crashes and hangs
  - Joel, Vineeth

Changes in v6
=============
- Documentation
  - Joel
- Pause siblings on entering nmi/irq/softirq
  - Joel, Vineeth
- Fix for RCU crash
  - Joel
- Fix for a crash in pick_next_task
  - Yu Chen, Vineeth
- Minor re-write of core-wide vruntime comparison
  - Aaron Lu
- Cleanup: Address Review comments
- Cleanup: Remove hotplug support (for now)
- Build fixes: 32 bit, SMT=n, AUTOGROUP=n etc
  - Joel, Vineeth

Changes in v5
=============
- Fixes for cgroup/process tagging during corner cases like cgroup
  destroy, task moving across cgroups etc
  - Tim Chen
- Coresched aware task migrations
  - Aubrey Li
- Other minor stability fixes.

Changes in v4
=============
- Implement a core wide min_vruntime for vruntime comparison of tasks
  across cpus in a core.
  - Aaron Lu
- Fixes a typo bug in setting the forced_idle cpu.
  - Aaron Lu

Changes in v3
=============
- Fixes the issue of sibling picking up an incompatible task
  - Aaron Lu
  - Vineeth Pillai
  - Julien Desfossez
- Fixes the issue of starving threads due to forced idle
  - Peter Zijlstra
- Fixes the refcounting issue when deleting a cgroup with tag
  - Julien Desfossez
- Fixes a crash during cpu offline/online with coresched enabled
  - Vineeth Pillai
- Fixes a comparison logic issue in sched_core_find
  - Aaron Lu

Changes in v2
=============
- Fixes for couple of NULL pointer dereference crashes
  - Subhra Mazumdar
  - Tim Chen
- Improves priority comparison logic for process in different cpus
  - Peter Zijlstra
  - Aaron Lu
- Fixes a hard lockup in rq locking
  - Vineeth Pillai
  - Julien Desfossez
- Fixes a performance issue seen on IO heavy workloads
  - Vineeth Pillai
  - Julien Desfossez
- Fix for 32bit build
  - Aubrey Li

Future work
===========
- Load balancing/Migration fixes for core scheduling.
  With v6, Load balancing is partially coresched aware, but has some
  issues w.r.t process/taskgroup weights:
  https://lwn.net/ml/linux-kernel/20200225034438.GA617271@z...

Aubrey Li (1):
sched: migration changes for core scheduling

Joel Fernandes (Google) (16):
sched/fair: Snapshot the min_vruntime of CPUs on force idle
sched: Enqueue task into core queue only after vruntime is updated
sched: Simplify the core pick loop for optimized case
sched: Improve snapshotting of min_vruntime for CGroups
arch/x86: Add a new TIF flag for untrusted tasks
kernel/entry: Add support for core-wide protection of kernel-mode
entry/idle: Enter and exit kernel protection during idle entry and
exit
sched: Split the cookie and setup per-task cookie on fork
sched: Add a per-thread core scheduling interface
sched: Release references to the per-task cookie on exit
sched/debug: Add CGroup node for printing group cookie if SCHED_DEBUG
kselftest: Add tests for core-sched interface
sched: Move core-scheduler interfacing code to a new file
Documentation: Add core scheduling documentation
sched: Add a coresched command line option
sched: Debug bits...

Josh Don (2):
sched: Refactor core cookie into struct
sched: Add a second-level tag for nested CGroup usecase

Peter Zijlstra (11):
sched: Wrap rq::lock access
sched: Introduce sched_class::pick_task()
sched/fair: Fix pick_task_fair crashes due to empty rbtree
sched: Core-wide rq->lock
sched/fair: Add a few assertions
sched: Basic tracking of matching tasks
sched: Add core wide task selection and scheduling.
sched: Fix priority inversion of cookied task with sibling
sched: Trivial forced-newidle balancer
irq_work: Cleanup
sched: CGroup tagging interface for core scheduling

Vineeth Pillai (2):
sched/fair: Fix forced idle sibling starvation corner case
entry/kvm: Protect the kernel when entering from guest

.../admin-guide/hw-vuln/core-scheduling.rst   |  330 +++++
Documentation/admin-guide/hw-vuln/index.rst   |    1 +
.../admin-guide/kernel-parameters.txt         |   25 +
arch/x86/include/asm/thread_info.h            |    2 +
arch/x86/kernel/cpu/bugs.c                    |   19 +
arch/x86/kvm/x86.c                            |    2 +
drivers/gpu/drm/i915/i915_request.c           |    4 +-
include/linux/cpu.h                           |    1 +
include/linux/entry-common.h                  |   30 +-
include/linux/entry-kvm.h                     |   12 +
include/linux/irq_work.h                      |   33 +-
include/linux/irqflags.h                      |    4 +-
include/linux/sched.h                         |   28 +-
include/linux/sched/smt.h                     |    4 +
include/uapi/linux/prctl.h                    |    3 +
kernel/Kconfig.preempt                        |    5 +
kernel/bpf/stackmap.c                         |    2 +-
kernel/cpu.c                                  |   43 +
kernel/entry/common.c                         |   28 +-
kernel/entry/kvm.c                            |   33 +
kernel/fork.c                                 |    1 +
kernel/irq_work.c                             |   18 +-
kernel/printk/printk.c                        |    6 +-
kernel/rcu/tree.c                             |    3 +-
kernel/sched/Makefile                         |    1 +
kernel/sched/core.c                           | 1278 +++++++++++++++--
kernel/sched/coretag.c                        |  819 +++++++++++
kernel/sched/cpuacct.c                        |   12 +-
kernel/sched/deadline.c                       |   38 +-
kernel/sched/debug.c                          |   12 +-
kernel/sched/fair.c                           |  313 +++-
kernel/sched/idle.c                           |   24 +-
kernel/sched/pelt.h                           |    2 +-
kernel/sched/rt.c                             |   31 +-
kernel/sched/sched.h                          |  315 +++-
kernel/sched/stop_task.c                      |   14 +-
kernel/sched/topology.c                       |    4 +-
kernel/sys.c                                  |    3 +
kernel/time/tick-sched.c                      |    6 +-
kernel/trace/bpf_trace.c                      |    2 +-
tools/include/uapi/linux/prctl.h              |    3 +
tools/testing/selftests/sched/.gitignore      |    1 +
tools/testing/selftests/sched/Makefile        |   14 +
tools/testing/selftests/sched/config          |    1 +
.../testing/selftests/sched/test_coresched.c  |  818 +++++++++++
45 files changed, 4033 insertions(+), 315 deletions(-)
create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst
create mode 100644 kernel/sched/coretag.c
create mode 100644 tools/testing/selftests/sched/.gitignore
create mode 100644 tools/testing/selftests/sched/Makefile
create mode 100644 tools/testing/selftests/sched/config
create mode 100644 tools/testing/selftests/sched/test_coresched.c

--
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 01/32] sched: Wrap rq::lock access
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-19 23:31   ` Singh, Balbir
  2020-11-17 23:19 ` [PATCH -tip 02/32] sched: Introduce sched_class::pick_task() Joel Fernandes (Google)
                   ` (31 subsequent siblings)
  32 siblings, 1 reply; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

In preparation of playing games with rq->lock, abstract the thing
using an accessor.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/core.c     |  68 ++++++++++++-------------
 kernel/sched/cpuacct.c  |  12 ++---
 kernel/sched/deadline.c |  22 ++++----
 kernel/sched/debug.c    |   4 +-
 kernel/sched/fair.c     |  38 +++++++-------
 kernel/sched/idle.c     |   4 +-
 kernel/sched/pelt.h     |   2 +-
 kernel/sched/rt.c       |  16 +++---
 kernel/sched/sched.h    | 108 +++++++++++++++++++++-------------------
 kernel/sched/topology.c |   4 +-
 10 files changed, 141 insertions(+), 137 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a6aaf9fb3400..db5cc05a68bc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -186,12 +186,12 @@ struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 
 	for (;;) {
 		rq = task_rq(p);
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
 			rq_pin_lock(rq, rf);
 			return rq;
 		}
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 
 		while (unlikely(task_on_rq_migrating(p)))
 			cpu_relax();
@@ -210,7 +210,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 	for (;;) {
 		raw_spin_lock_irqsave(&p->pi_lock, rf->flags);
 		rq = task_rq(p);
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		/*
 		 *	move_queued_task()		task_rq_lock()
 		 *
@@ -232,7 +232,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 			rq_pin_lock(rq, rf);
 			return rq;
 		}
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 		raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 
 		while (unlikely(task_on_rq_migrating(p)))
@@ -302,7 +302,7 @@ void update_rq_clock(struct rq *rq)
 {
 	s64 delta;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	if (rq->clock_update_flags & RQCF_ACT_SKIP)
 		return;
@@ -611,7 +611,7 @@ void resched_curr(struct rq *rq)
 	struct task_struct *curr = rq->curr;
 	int cpu;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	if (test_tsk_need_resched(curr))
 		return;
@@ -635,10 +635,10 @@ void resched_cpu(int cpu)
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), flags);
 	if (cpu_online(cpu) || cpu == smp_processor_id())
 		resched_curr(rq);
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 }
 
 #ifdef CONFIG_SMP
@@ -1137,7 +1137,7 @@ static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
 	struct uclamp_se *uc_se = &p->uclamp[clamp_id];
 	struct uclamp_bucket *bucket;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	/* Update task effective clamp */
 	p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
@@ -1177,7 +1177,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
 	unsigned int bkt_clamp;
 	unsigned int rq_clamp;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	/*
 	 * If sched_uclamp_used was enabled after task @p was enqueued,
@@ -1807,7 +1807,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
 				   struct task_struct *p, int new_cpu)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
 	set_task_cpu(p, new_cpu);
@@ -1973,7 +1973,7 @@ int push_cpu_stop(void *arg)
 	struct task_struct *p = arg;
 
 	raw_spin_lock_irq(&p->pi_lock);
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 
 	if (task_rq(p) != rq)
 		goto out_unlock;
@@ -2003,7 +2003,7 @@ int push_cpu_stop(void *arg)
 
 out_unlock:
 	rq->push_busy = false;
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 	raw_spin_unlock_irq(&p->pi_lock);
 
 	put_task_struct(p);
@@ -2056,7 +2056,7 @@ __do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask, u32
 		 * Because __kthread_bind() calls this on blocked tasks without
 		 * holding rq->lock.
 		 */
-		lockdep_assert_held(&rq->lock);
+		lockdep_assert_held(rq_lockp(rq));
 		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
 	}
 	if (running)
@@ -2395,7 +2395,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	 * task_rq_lock().
 	 */
 	WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||
-				      lockdep_is_held(&task_rq(p)->lock)));
+				      lockdep_is_held(rq_lockp(task_rq(p)))));
 #endif
 	/*
 	 * Clearly, migrating tasks to offline CPUs is a fairly daft thing.
@@ -2941,7 +2941,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 {
 	int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	if (p->sched_contributes_to_load)
 		rq->nr_uninterruptible--;
@@ -3941,7 +3941,7 @@ static void do_balance_callbacks(struct rq *rq, struct callback_head *head)
 	void (*func)(struct rq *rq);
 	struct callback_head *next;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	while (head) {
 		func = (void (*)(struct rq *))head->func;
@@ -3957,7 +3957,7 @@ static inline struct callback_head *splice_balance_callbacks(struct rq *rq)
 {
 	struct callback_head *head = rq->balance_callback;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	if (head) {
 		rq->balance_callback = NULL;
 		rq->balance_flags &= ~BALANCE_WORK;
@@ -3976,9 +3976,9 @@ static inline void balance_callbacks(struct rq *rq, struct callback_head *head)
 	unsigned long flags;
 
 	if (unlikely(head)) {
-		raw_spin_lock_irqsave(&rq->lock, flags);
+		raw_spin_lock_irqsave(rq_lockp(rq), flags);
 		do_balance_callbacks(rq, head);
-		raw_spin_unlock_irqrestore(&rq->lock, flags);
+		raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 	}
 }
 
@@ -4028,10 +4028,10 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf
 	 * do an early lockdep release here:
 	 */
 	rq_unpin_lock(rq, rf);
-	spin_release(&rq->lock.dep_map, _THIS_IP_);
+	spin_release(&rq_lockp(rq)->dep_map, _THIS_IP_);
 #ifdef CONFIG_DEBUG_SPINLOCK
 	/* this is a valid case when another task releases the spinlock */
-	rq->lock.owner = next;
+	rq_lockp(rq)->owner = next;
 #endif
 }
 
@@ -4042,9 +4042,9 @@ static inline void finish_lock_switch(struct rq *rq)
 	 * fix up the runqueue lock - which gets 'carried over' from
 	 * prev into current:
 	 */
-	spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
+	spin_acquire(&rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
 	balance_switch(rq);
-	raw_spin_unlock_irq(&rq->lock);
+	raw_spin_unlock_irq(rq_lockp(rq));
 }
 
 /*
@@ -5023,7 +5023,7 @@ static void __sched notrace __schedule(bool preempt)
 
 		rq_unpin_lock(rq, &rf);
 		__balance_callbacks(rq);
-		raw_spin_unlock_irq(&rq->lock);
+		raw_spin_unlock_irq(rq_lockp(rq));
 	}
 }
 
@@ -5438,7 +5438,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
 
 	rq_unpin_lock(rq, &rf);
 	__balance_callbacks(rq);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 
 	preempt_enable();
 }
@@ -7020,7 +7020,7 @@ void init_idle(struct task_struct *idle, int cpu)
 	__sched_fork(0, idle);
 
 	raw_spin_lock_irqsave(&idle->pi_lock, flags);
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 
 	idle->state = TASK_RUNNING;
 	idle->se.exec_start = sched_clock();
@@ -7058,7 +7058,7 @@ void init_idle(struct task_struct *idle, int cpu)
 #ifdef CONFIG_SMP
 	idle->on_cpu = 1;
 #endif
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 	raw_spin_unlock_irqrestore(&idle->pi_lock, flags);
 
 	/* Set the preempt count _outside_ the spinlocks! */
@@ -7221,7 +7221,7 @@ static void balance_push(struct rq *rq)
 {
 	struct task_struct *push_task = rq->curr;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	SCHED_WARN_ON(rq->cpu != smp_processor_id());
 
 	/*
@@ -7242,9 +7242,9 @@ static void balance_push(struct rq *rq)
 		 */
 		if (!rq->nr_running && !rq_has_pinned_tasks(rq) &&
 		    rcuwait_active(&rq->hotplug_wait)) {
-			raw_spin_unlock(&rq->lock);
+			raw_spin_unlock(rq_lockp(rq));
 			rcuwait_wake_up(&rq->hotplug_wait);
-			raw_spin_lock(&rq->lock);
+			raw_spin_lock(rq_lockp(rq));
 		}
 		return;
 	}
@@ -7254,7 +7254,7 @@ static void balance_push(struct rq *rq)
 	 * Temporarily drop rq->lock such that we can wake-up the stop task.
 	 * Both preemption and IRQs are still disabled.
 	 */
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 	stop_one_cpu_nowait(rq->cpu, __balance_push_cpu_stop, push_task,
 			    this_cpu_ptr(&push_work));
 	/*
@@ -7262,7 +7262,7 @@ static void balance_push(struct rq *rq)
 	 * schedule(). The next pick is obviously going to be the stop task
 	 * which is_per_cpu_kthread() and will push this task away.
 	 */
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 }
 
 static void balance_push_set(int cpu, bool on)
@@ -7682,7 +7682,7 @@ void __init sched_init(void)
 		struct rq *rq;
 
 		rq = cpu_rq(i);
-		raw_spin_lock_init(&rq->lock);
+		raw_spin_lock_init(&rq->__lock);
 		rq->nr_running = 0;
 		rq->calc_load_active = 0;
 		rq->calc_load_update = jiffies + LOAD_FREQ;
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 941c28cf9738..38c1a68e91f0 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -112,7 +112,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
 	/*
 	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
 	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 	if (index == CPUACCT_STAT_NSTATS) {
@@ -126,7 +126,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
 	}
 
 #ifndef CONFIG_64BIT
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 	return data;
@@ -141,14 +141,14 @@ static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
 	/*
 	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
 	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 	for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
 		cpuusage->usages[i] = val;
 
 #ifndef CONFIG_64BIT
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 }
 
@@ -253,13 +253,13 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V)
 			 * Take rq->lock to make 64-bit read safe on 32-bit
 			 * platforms.
 			 */
-			raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+			raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 			seq_printf(m, " %llu", cpuusage->usages[index]);
 
 #ifndef CONFIG_64BIT
-			raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+			raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 		}
 		seq_puts(m, "\n");
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 2a5836f440e0..0f2ea0a3664c 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -135,7 +135,7 @@ void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->running_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
 	dl_rq->running_bw += dl_bw;
 	SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
 	SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
@@ -148,7 +148,7 @@ void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->running_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
 	dl_rq->running_bw -= dl_bw;
 	SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
 	if (dl_rq->running_bw > old)
@@ -162,7 +162,7 @@ void __add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->this_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
 	dl_rq->this_bw += dl_bw;
 	SCHED_WARN_ON(dl_rq->this_bw < old); /* overflow */
 }
@@ -172,7 +172,7 @@ void __sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->this_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
 	dl_rq->this_bw -= dl_bw;
 	SCHED_WARN_ON(dl_rq->this_bw > old); /* underflow */
 	if (dl_rq->this_bw > old)
@@ -982,7 +982,7 @@ static int start_dl_timer(struct task_struct *p)
 	ktime_t now, act;
 	s64 delta;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	/*
 	 * We want the timer to fire at the deadline, but considering
@@ -1092,9 +1092,9 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 		 * If the runqueue is no longer available, migrate the
 		 * task elsewhere. This necessarily changes rq.
 		 */
-		lockdep_unpin_lock(&rq->lock, rf.cookie);
+		lockdep_unpin_lock(rq_lockp(rq), rf.cookie);
 		rq = dl_task_offline_migration(rq, p);
-		rf.cookie = lockdep_pin_lock(&rq->lock);
+		rf.cookie = lockdep_pin_lock(rq_lockp(rq));
 		update_rq_clock(rq);
 
 		/*
@@ -1746,7 +1746,7 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused
 	 * from try_to_wake_up(). Hence, p->pi_lock is locked, but
 	 * rq->lock is not... So, lock it
 	 */
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 	if (p->dl.dl_non_contending) {
 		sub_running_bw(&p->dl, &rq->dl);
 		p->dl.dl_non_contending = 0;
@@ -1761,7 +1761,7 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused
 			put_task_struct(p);
 	}
 	sub_rq_bw(&p->dl, &rq->dl);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 }
 
 static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
@@ -2306,10 +2306,10 @@ static void pull_dl_task(struct rq *this_rq)
 		double_unlock_balance(this_rq, src_rq);
 
 		if (push_task) {
-			raw_spin_unlock(&this_rq->lock);
+			raw_spin_unlock(rq_lockp(this_rq));
 			stop_one_cpu_nowait(src_rq->cpu, push_cpu_stop,
 					    push_task, &src_rq->push_work);
-			raw_spin_lock(&this_rq->lock);
+			raw_spin_lock(rq_lockp(this_rq));
 		}
 	}
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2357921580f9..60a922d3f46f 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -551,7 +551,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "exec_clock",
 			SPLIT_NS(cfs_rq->exec_clock));
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), flags);
 	if (rb_first_cached(&cfs_rq->tasks_timeline))
 		MIN_vruntime = (__pick_first_entity(cfs_rq))->vruntime;
 	last = __pick_last_entity(cfs_rq);
@@ -559,7 +559,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 		max_vruntime = last->vruntime;
 	min_vruntime = cfs_rq->min_vruntime;
 	rq0_min_vruntime = cpu_rq(0)->cfs.min_vruntime;
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "MIN_vruntime",
 			SPLIT_NS(MIN_vruntime));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "min_vruntime",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 27536f37ba1a..52ddfec7cea6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1110,7 +1110,7 @@ struct numa_group {
 static struct numa_group *deref_task_numa_group(struct task_struct *p)
 {
 	return rcu_dereference_check(p->numa_group, p == current ||
-		(lockdep_is_held(&task_rq(p)->lock) && !READ_ONCE(p->on_cpu)));
+		(lockdep_is_held(rq_lockp(task_rq(p))) && !READ_ONCE(p->on_cpu)));
 }
 
 static struct numa_group *deref_curr_numa_group(struct task_struct *p)
@@ -5309,7 +5309,7 @@ static void __maybe_unused update_runtime_enabled(struct rq *rq)
 {
 	struct task_group *tg;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(tg, &task_groups, list) {
@@ -5328,7 +5328,7 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
 {
 	struct task_group *tg;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(tg, &task_groups, list) {
@@ -6813,7 +6813,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 		 * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
 		 * rq->lock and can modify state directly.
 		 */
-		lockdep_assert_held(&task_rq(p)->lock);
+		lockdep_assert_held(rq_lockp(task_rq(p)));
 		detach_entity_cfs_rq(&p->se);
 
 	} else {
@@ -7441,7 +7441,7 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 {
 	s64 delta;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	if (p->sched_class != &fair_sched_class)
 		return 0;
@@ -7539,7 +7539,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 {
 	int tsk_cache_hot;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	/*
 	 * We do not migrate tasks that are:
@@ -7617,7 +7617,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
  */
 static void detach_task(struct task_struct *p, struct lb_env *env)
 {
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
 	set_task_cpu(p, env->dst_cpu);
@@ -7633,7 +7633,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
 {
 	struct task_struct *p;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	list_for_each_entry_reverse(p,
 			&env->src_rq->cfs_tasks, se.group_node) {
@@ -7669,7 +7669,7 @@ static int detach_tasks(struct lb_env *env)
 	struct task_struct *p;
 	int detached = 0;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	if (env->imbalance <= 0)
 		return 0;
@@ -7791,7 +7791,7 @@ static int detach_tasks(struct lb_env *env)
  */
 static void attach_task(struct rq *rq, struct task_struct *p)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	BUG_ON(task_rq(p) != rq);
 	activate_task(rq, p, ENQUEUE_NOCLOCK);
@@ -9726,7 +9726,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		if (need_active_balance(&env)) {
 			unsigned long flags;
 
-			raw_spin_lock_irqsave(&busiest->lock, flags);
+			raw_spin_lock_irqsave(rq_lockp(busiest), flags);
 
 			/*
 			 * Don't kick the active_load_balance_cpu_stop,
@@ -9734,7 +9734,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 			 * moved to this_cpu:
 			 */
 			if (!cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr)) {
-				raw_spin_unlock_irqrestore(&busiest->lock,
+				raw_spin_unlock_irqrestore(rq_lockp(busiest),
 							    flags);
 				env.flags |= LBF_ALL_PINNED;
 				goto out_one_pinned;
@@ -9750,7 +9750,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 				busiest->push_cpu = this_cpu;
 				active_balance = 1;
 			}
-			raw_spin_unlock_irqrestore(&busiest->lock, flags);
+			raw_spin_unlock_irqrestore(rq_lockp(busiest), flags);
 
 			if (active_balance) {
 				stop_one_cpu_nowait(cpu_of(busiest),
@@ -10506,7 +10506,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
 	    time_before(jiffies, READ_ONCE(nohz.next_blocked)))
 		return;
 
-	raw_spin_unlock(&this_rq->lock);
+	raw_spin_unlock(rq_lockp(this_rq));
 	/*
 	 * This CPU is going to be idle and blocked load of idle CPUs
 	 * need to be updated. Run the ilb locally as it is a good
@@ -10515,7 +10515,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
 	 */
 	if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
 		kick_ilb(NOHZ_STATS_KICK);
-	raw_spin_lock(&this_rq->lock);
+	raw_spin_lock(rq_lockp(this_rq));
 }
 
 #else /* !CONFIG_NO_HZ_COMMON */
@@ -10581,7 +10581,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 		goto out;
 	}
 
-	raw_spin_unlock(&this_rq->lock);
+	raw_spin_unlock(rq_lockp(this_rq));
 
 	update_blocked_averages(this_cpu);
 	rcu_read_lock();
@@ -10619,7 +10619,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	}
 	rcu_read_unlock();
 
-	raw_spin_lock(&this_rq->lock);
+	raw_spin_lock(rq_lockp(this_rq));
 
 	if (curr_cost > this_rq->max_idle_balance_cost)
 		this_rq->max_idle_balance_cost = curr_cost;
@@ -11095,9 +11095,9 @@ void unregister_fair_sched_group(struct task_group *tg)
 
 		rq = cpu_rq(cpu);
 
-		raw_spin_lock_irqsave(&rq->lock, flags);
+		raw_spin_lock_irqsave(rq_lockp(rq), flags);
 		list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
-		raw_spin_unlock_irqrestore(&rq->lock, flags);
+		raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 	}
 }
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index df91b198a74c..50e128b899c4 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -422,10 +422,10 @@ struct task_struct *pick_next_task_idle(struct rq *rq)
 static void
 dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
 {
-	raw_spin_unlock_irq(&rq->lock);
+	raw_spin_unlock_irq(rq_lockp(rq));
 	printk(KERN_ERR "bad: scheduling from the idle thread!\n");
 	dump_stack();
-	raw_spin_lock_irq(&rq->lock);
+	raw_spin_lock_irq(rq_lockp(rq));
 }
 
 /*
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 795e43e02afc..e850bd71a8ce 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -141,7 +141,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
 
 static inline u64 rq_clock_pelt(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	assert_clock_updated(rq);
 
 	return rq->clock_pelt - rq->lost_idle_time;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index dbe4629cf7ba..a6f9d132c24f 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -888,7 +888,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
 		if (skip)
 			continue;
 
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		update_rq_clock(rq);
 
 		if (rt_rq->rt_time) {
@@ -926,7 +926,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
 
 		if (enqueue)
 			sched_rt_rq_enqueue(rt_rq);
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 	}
 
 	if (!throttled && (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF))
@@ -1894,10 +1894,10 @@ static int push_rt_task(struct rq *rq, bool pull)
 		 */
 		push_task = get_push_task(rq);
 		if (push_task) {
-			raw_spin_unlock(&rq->lock);
+			raw_spin_unlock(rq_lockp(rq));
 			stop_one_cpu_nowait(rq->cpu, push_cpu_stop,
 					    push_task, &rq->push_work);
-			raw_spin_lock(&rq->lock);
+			raw_spin_lock(rq_lockp(rq));
 		}
 
 		return 0;
@@ -2122,10 +2122,10 @@ void rto_push_irq_work_func(struct irq_work *work)
 	 * When it gets updated, a check is made if a push is possible.
 	 */
 	if (has_pushable_tasks(rq)) {
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		while (push_rt_task(rq, true))
 			;
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 	}
 
 	raw_spin_lock(&rd->rto_lock);
@@ -2243,10 +2243,10 @@ static void pull_rt_task(struct rq *this_rq)
 		double_unlock_balance(this_rq, src_rq);
 
 		if (push_task) {
-			raw_spin_unlock(&this_rq->lock);
+			raw_spin_unlock(rq_lockp(this_rq));
 			stop_one_cpu_nowait(src_rq->cpu, push_cpu_stop,
 					    push_task, &src_rq->push_work);
-			raw_spin_lock(&this_rq->lock);
+			raw_spin_lock(rq_lockp(this_rq));
 		}
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 590e6f27068c..f794c9337047 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -896,7 +896,7 @@ DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
  */
 struct rq {
 	/* runqueue lock: */
-	raw_spinlock_t		lock;
+	raw_spinlock_t		__lock;
 
 	/*
 	 * nr_running and cpu_load should be in the same cacheline because
@@ -1099,6 +1099,11 @@ static inline bool is_migration_disabled(struct task_struct *p)
 #endif
 }
 
+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+	return &rq->__lock;
+}
+
 #ifdef CONFIG_SCHED_SMT
 extern void __update_idle_core(struct rq *rq);
 
@@ -1165,7 +1170,7 @@ static inline void assert_clock_updated(struct rq *rq)
 
 static inline u64 rq_clock(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	assert_clock_updated(rq);
 
 	return rq->clock;
@@ -1173,7 +1178,7 @@ static inline u64 rq_clock(struct rq *rq)
 
 static inline u64 rq_clock_task(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	assert_clock_updated(rq);
 
 	return rq->clock_task;
@@ -1199,7 +1204,7 @@ static inline u64 rq_clock_thermal(struct rq *rq)
 
 static inline void rq_clock_skip_update(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	rq->clock_update_flags |= RQCF_REQ_SKIP;
 }
 
@@ -1209,7 +1214,7 @@ static inline void rq_clock_skip_update(struct rq *rq)
  */
 static inline void rq_clock_cancel_skipupdate(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	rq->clock_update_flags &= ~RQCF_REQ_SKIP;
 }
 
@@ -1238,7 +1243,7 @@ struct rq_flags {
  */
 static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
 {
-	rf->cookie = lockdep_pin_lock(&rq->lock);
+	rf->cookie = lockdep_pin_lock(rq_lockp(rq));
 
 #ifdef CONFIG_SCHED_DEBUG
 	rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
@@ -1256,12 +1261,12 @@ static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
 		rf->clock_update_flags = RQCF_UPDATED;
 #endif
 
-	lockdep_unpin_lock(&rq->lock, rf->cookie);
+	lockdep_unpin_lock(rq_lockp(rq), rf->cookie);
 }
 
 static inline void rq_repin_lock(struct rq *rq, struct rq_flags *rf)
 {
-	lockdep_repin_lock(&rq->lock, rf->cookie);
+	lockdep_repin_lock(rq_lockp(rq), rf->cookie);
 
 #ifdef CONFIG_SCHED_DEBUG
 	/*
@@ -1282,7 +1287,7 @@ static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 }
 
 static inline void
@@ -1291,7 +1296,7 @@ task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 	__releases(p->pi_lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 }
 
@@ -1299,7 +1304,7 @@ static inline void
 rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock_irqsave(&rq->lock, rf->flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), rf->flags);
 	rq_pin_lock(rq, rf);
 }
 
@@ -1307,7 +1312,7 @@ static inline void
 rq_lock_irq(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock_irq(&rq->lock);
+	raw_spin_lock_irq(rq_lockp(rq));
 	rq_pin_lock(rq, rf);
 }
 
@@ -1315,7 +1320,7 @@ static inline void
 rq_lock(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 	rq_pin_lock(rq, rf);
 }
 
@@ -1323,7 +1328,7 @@ static inline void
 rq_relock(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 	rq_repin_lock(rq, rf);
 }
 
@@ -1332,7 +1337,7 @@ rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), rf->flags);
 }
 
 static inline void
@@ -1340,7 +1345,7 @@ rq_unlock_irq(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock_irq(&rq->lock);
+	raw_spin_unlock_irq(rq_lockp(rq));
 }
 
 static inline void
@@ -1348,7 +1353,7 @@ rq_unlock(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 }
 
 static inline struct rq *
@@ -1416,7 +1421,7 @@ queue_balance_callback(struct rq *rq,
 		       struct callback_head *head,
 		       void (*func)(struct rq *rq))
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	if (unlikely(head->next || (rq->balance_flags & BALANCE_PUSH)))
 		return;
@@ -1959,7 +1964,7 @@ static inline struct task_struct *get_push_task(struct rq *rq)
 {
 	struct task_struct *p = rq->curr;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	if (rq->push_busy)
 		return NULL;
@@ -2167,7 +2172,7 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
 	__acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
-	raw_spin_unlock(&this_rq->lock);
+	raw_spin_unlock(rq_lockp(this_rq));
 	double_rq_lock(this_rq, busiest);
 
 	return 1;
@@ -2186,20 +2191,22 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
 	__acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
-	int ret = 0;
-
-	if (unlikely(!raw_spin_trylock(&busiest->lock))) {
-		if (busiest < this_rq) {
-			raw_spin_unlock(&this_rq->lock);
-			raw_spin_lock(&busiest->lock);
-			raw_spin_lock_nested(&this_rq->lock,
-					      SINGLE_DEPTH_NESTING);
-			ret = 1;
-		} else
-			raw_spin_lock_nested(&busiest->lock,
-					      SINGLE_DEPTH_NESTING);
+	if (rq_lockp(this_rq) == rq_lockp(busiest))
+		return 0;
+
+	if (likely(raw_spin_trylock(rq_lockp(busiest))))
+		return 0;
+
+	if (rq_lockp(busiest) >= rq_lockp(this_rq)) {
+		raw_spin_lock_nested(rq_lockp(busiest), SINGLE_DEPTH_NESTING);
+		return 0;
 	}
-	return ret;
+
+	raw_spin_unlock(rq_lockp(this_rq));
+	raw_spin_lock(rq_lockp(busiest));
+	raw_spin_lock_nested(rq_lockp(this_rq), SINGLE_DEPTH_NESTING);
+
+	return 1;
 }
 
 #endif /* CONFIG_PREEMPTION */
@@ -2209,11 +2216,7 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
  */
 static inline int double_lock_balance(struct rq *this_rq, struct rq *busiest)
 {
-	if (unlikely(!irqs_disabled())) {
-		/* printk() doesn't work well under rq->lock */
-		raw_spin_unlock(&this_rq->lock);
-		BUG_ON(1);
-	}
+	lockdep_assert_irqs_disabled();
 
 	return _double_lock_balance(this_rq, busiest);
 }
@@ -2221,8 +2224,9 @@ static inline int double_lock_balance(struct rq *this_rq, struct rq *busiest)
 static inline void double_unlock_balance(struct rq *this_rq, struct rq *busiest)
 	__releases(busiest->lock)
 {
-	raw_spin_unlock(&busiest->lock);
-	lock_set_subclass(&this_rq->lock.dep_map, 0, _RET_IP_);
+	if (rq_lockp(this_rq) != rq_lockp(busiest))
+		raw_spin_unlock(rq_lockp(busiest));
+	lock_set_subclass(&rq_lockp(this_rq)->dep_map, 0, _RET_IP_);
 }
 
 static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
@@ -2263,16 +2267,16 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
 	__acquires(rq2->lock)
 {
 	BUG_ON(!irqs_disabled());
-	if (rq1 == rq2) {
-		raw_spin_lock(&rq1->lock);
+	if (rq_lockp(rq1) == rq_lockp(rq2)) {
+		raw_spin_lock(rq_lockp(rq1));
 		__acquire(rq2->lock);	/* Fake it out ;) */
 	} else {
-		if (rq1 < rq2) {
-			raw_spin_lock(&rq1->lock);
-			raw_spin_lock_nested(&rq2->lock, SINGLE_DEPTH_NESTING);
+		if (rq_lockp(rq1) < rq_lockp(rq2)) {
+			raw_spin_lock(rq_lockp(rq1));
+			raw_spin_lock_nested(rq_lockp(rq2), SINGLE_DEPTH_NESTING);
 		} else {
-			raw_spin_lock(&rq2->lock);
-			raw_spin_lock_nested(&rq1->lock, SINGLE_DEPTH_NESTING);
+			raw_spin_lock(rq_lockp(rq2));
+			raw_spin_lock_nested(rq_lockp(rq1), SINGLE_DEPTH_NESTING);
 		}
 	}
 }
@@ -2287,9 +2291,9 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
 	__releases(rq1->lock)
 	__releases(rq2->lock)
 {
-	raw_spin_unlock(&rq1->lock);
-	if (rq1 != rq2)
-		raw_spin_unlock(&rq2->lock);
+	raw_spin_unlock(rq_lockp(rq1));
+	if (rq_lockp(rq1) != rq_lockp(rq2))
+		raw_spin_unlock(rq_lockp(rq2));
 	else
 		__release(rq2->lock);
 }
@@ -2312,7 +2316,7 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
 {
 	BUG_ON(!irqs_disabled());
 	BUG_ON(rq1 != rq2);
-	raw_spin_lock(&rq1->lock);
+	raw_spin_lock(rq_lockp(rq1));
 	__acquire(rq2->lock);	/* Fake it out ;) */
 }
 
@@ -2327,7 +2331,7 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
 	__releases(rq2->lock)
 {
 	BUG_ON(rq1 != rq2);
-	raw_spin_unlock(&rq1->lock);
+	raw_spin_unlock(rq_lockp(rq1));
 	__release(rq2->lock);
 }
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 90f3e5558fa2..82924db74ccb 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -454,7 +454,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	struct root_domain *old_rd = NULL;
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), flags);
 
 	if (rq->rd) {
 		old_rd = rq->rd;
@@ -480,7 +480,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
 		set_rq_online(rq);
 
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 
 	if (old_rd)
 		call_rcu(&old_rd->rcu, free_rootdomain);
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 02/32] sched: Introduce sched_class::pick_task()
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
  2020-11-17 23:19 ` [PATCH -tip 01/32] sched: Wrap rq::lock access Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-19 23:56   ` Singh, Balbir
  2020-11-25 16:28   ` Vincent Guittot
  2020-11-17 23:19 ` [PATCH -tip 03/32] sched/fair: Fix pick_task_fair crashes due to empty rbtree Joel Fernandes (Google)
                   ` (30 subsequent siblings)
  32 siblings, 2 replies; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

Because sched_class::pick_next_task() also implies
sched_class::set_next_task() (and possibly put_prev_task() and
newidle_balance) it is not state invariant. This makes it unsuitable
for remote task selection.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/deadline.c  | 16 ++++++++++++++--
 kernel/sched/fair.c      | 32 +++++++++++++++++++++++++++++++-
 kernel/sched/idle.c      |  8 ++++++++
 kernel/sched/rt.c        | 15 +++++++++++++--
 kernel/sched/sched.h     |  3 +++
 kernel/sched/stop_task.c | 14 ++++++++++++--
 6 files changed, 81 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0f2ea0a3664c..abfc8b505d0d 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1867,7 +1867,7 @@ static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
 	return rb_entry(left, struct sched_dl_entity, rb_node);
 }
 
-static struct task_struct *pick_next_task_dl(struct rq *rq)
+static struct task_struct *pick_task_dl(struct rq *rq)
 {
 	struct sched_dl_entity *dl_se;
 	struct dl_rq *dl_rq = &rq->dl;
@@ -1879,7 +1879,18 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
 	dl_se = pick_next_dl_entity(rq, dl_rq);
 	BUG_ON(!dl_se);
 	p = dl_task_of(dl_se);
-	set_next_task_dl(rq, p, true);
+
+	return p;
+}
+
+static struct task_struct *pick_next_task_dl(struct rq *rq)
+{
+	struct task_struct *p;
+
+	p = pick_task_dl(rq);
+	if (p)
+		set_next_task_dl(rq, p, true);
+
 	return p;
 }
 
@@ -2551,6 +2562,7 @@ DEFINE_SCHED_CLASS(dl) = {
 
 #ifdef CONFIG_SMP
 	.balance		= balance_dl,
+	.pick_task		= pick_task_dl,
 	.select_task_rq		= select_task_rq_dl,
 	.migrate_task_rq	= migrate_task_rq_dl,
 	.set_cpus_allowed       = set_cpus_allowed_dl,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 52ddfec7cea6..12cf068eeec8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4459,7 +4459,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	 * Avoid running the skip buddy, if running something else can
 	 * be done without getting too unfair.
 	 */
-	if (cfs_rq->skip == se) {
+	if (cfs_rq->skip && cfs_rq->skip == se) {
 		struct sched_entity *second;
 
 		if (se == curr) {
@@ -7017,6 +7017,35 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 		set_last_buddy(se);
 }
 
+#ifdef CONFIG_SMP
+static struct task_struct *pick_task_fair(struct rq *rq)
+{
+	struct cfs_rq *cfs_rq = &rq->cfs;
+	struct sched_entity *se;
+
+	if (!cfs_rq->nr_running)
+		return NULL;
+
+	do {
+		struct sched_entity *curr = cfs_rq->curr;
+
+		se = pick_next_entity(cfs_rq, NULL);
+
+		if (curr) {
+			if (se && curr->on_rq)
+				update_curr(cfs_rq);
+
+			if (!se || entity_before(curr, se))
+				se = curr;
+		}
+
+		cfs_rq = group_cfs_rq(se);
+	} while (cfs_rq);
+
+	return task_of(se);
+}
+#endif
+
 struct task_struct *
 pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
@@ -11219,6 +11248,7 @@ DEFINE_SCHED_CLASS(fair) = {
 
 #ifdef CONFIG_SMP
 	.balance		= balance_fair,
+	.pick_task		= pick_task_fair,
 	.select_task_rq		= select_task_rq_fair,
 	.migrate_task_rq	= migrate_task_rq_fair,
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 50e128b899c4..33864193a2f9 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -406,6 +406,13 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
 	schedstat_inc(rq->sched_goidle);
 }
 
+#ifdef CONFIG_SMP
+static struct task_struct *pick_task_idle(struct rq *rq)
+{
+	return rq->idle;
+}
+#endif
+
 struct task_struct *pick_next_task_idle(struct rq *rq)
 {
 	struct task_struct *next = rq->idle;
@@ -473,6 +480,7 @@ DEFINE_SCHED_CLASS(idle) = {
 
 #ifdef CONFIG_SMP
 	.balance		= balance_idle,
+	.pick_task		= pick_task_idle,
 	.select_task_rq		= select_task_rq_idle,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a6f9d132c24f..a0e245b0c4bd 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1626,7 +1626,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
 	return rt_task_of(rt_se);
 }
 
-static struct task_struct *pick_next_task_rt(struct rq *rq)
+static struct task_struct *pick_task_rt(struct rq *rq)
 {
 	struct task_struct *p;
 
@@ -1634,7 +1634,17 @@ static struct task_struct *pick_next_task_rt(struct rq *rq)
 		return NULL;
 
 	p = _pick_next_task_rt(rq);
-	set_next_task_rt(rq, p, true);
+
+	return p;
+}
+
+static struct task_struct *pick_next_task_rt(struct rq *rq)
+{
+	struct task_struct *p = pick_task_rt(rq);
+
+	if (p)
+		set_next_task_rt(rq, p, true);
+
 	return p;
 }
 
@@ -2483,6 +2493,7 @@ DEFINE_SCHED_CLASS(rt) = {
 
 #ifdef CONFIG_SMP
 	.balance		= balance_rt,
+	.pick_task		= pick_task_rt,
 	.select_task_rq		= select_task_rq_rt,
 	.set_cpus_allowed       = set_cpus_allowed_common,
 	.rq_online              = rq_online_rt,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f794c9337047..5a0dd2b312aa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1839,6 +1839,9 @@ struct sched_class {
 #ifdef CONFIG_SMP
 	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int flags);
+
+	struct task_struct * (*pick_task)(struct rq *rq);
+
 	void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
 
 	void (*task_woken)(struct rq *this_rq, struct task_struct *task);
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 55f39125c0e1..f988ebe3febb 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -34,15 +34,24 @@ static void set_next_task_stop(struct rq *rq, struct task_struct *stop, bool fir
 	stop->se.exec_start = rq_clock_task(rq);
 }
 
-static struct task_struct *pick_next_task_stop(struct rq *rq)
+static struct task_struct *pick_task_stop(struct rq *rq)
 {
 	if (!sched_stop_runnable(rq))
 		return NULL;
 
-	set_next_task_stop(rq, rq->stop, true);
 	return rq->stop;
 }
 
+static struct task_struct *pick_next_task_stop(struct rq *rq)
+{
+	struct task_struct *p = pick_task_stop(rq);
+
+	if (p)
+		set_next_task_stop(rq, p, true);
+
+	return p;
+}
+
 static void
 enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
@@ -123,6 +132,7 @@ DEFINE_SCHED_CLASS(stop) = {
 
 #ifdef CONFIG_SMP
 	.balance		= balance_stop,
+	.pick_task		= pick_task_stop,
 	.select_task_rq		= select_task_rq_stop,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 03/32] sched/fair: Fix pick_task_fair crashes due to empty rbtree
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
  2020-11-17 23:19 ` [PATCH -tip 01/32] sched: Wrap rq::lock access Joel Fernandes (Google)
  2020-11-17 23:19 ` [PATCH -tip 02/32] sched: Introduce sched_class::pick_task() Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-20 10:15   ` Singh, Balbir
  2020-11-17 23:19 ` [PATCH -tip 04/32] sched: Core-wide rq->lock Joel Fernandes (Google)
                   ` (29 subsequent siblings)
  32 siblings, 1 reply; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

pick_next_entity() is passed curr == NULL during core-scheduling. Due to
this, if the rbtree is empty, the 'left' variable is set to NULL within
the function. This can cause crashes within the function.

This is not an issue if put_prev_task() is invoked on the currently
running task before calling pick_next_entity(). However, in core
scheduling, it is possible that a sibling CPU picks for another RQ in
the core, via pick_task_fair(). This remote sibling would not get any
opportunities to do a put_prev_task().

Fix it by refactoring pick_task_fair() such that pick_next_entity() is
called with the cfs_rq->curr. This will prevent pick_next_entity() from
crashing if its rbtree is empty.

Also this fixes another possible bug where update_curr() would not be
called on the cfs_rq hierarchy if the rbtree is empty. This could effect
cross-cpu comparison of vruntime.

Suggested-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/fair.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 12cf068eeec8..51483a00a755 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7029,15 +7029,10 @@ static struct task_struct *pick_task_fair(struct rq *rq)
 	do {
 		struct sched_entity *curr = cfs_rq->curr;
 
-		se = pick_next_entity(cfs_rq, NULL);
-
-		if (curr) {
-			if (se && curr->on_rq)
-				update_curr(cfs_rq);
+		if (curr && curr->on_rq)
+			update_curr(cfs_rq);
 
-			if (!se || entity_before(curr, se))
-				se = curr;
-		}
+		se = pick_next_entity(cfs_rq, curr);
 
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 04/32] sched: Core-wide rq->lock
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (2 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 03/32] sched/fair: Fix pick_task_fair crashes due to empty rbtree Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-22  9:11   ` Balbir Singh
  2020-11-17 23:19 ` [PATCH -tip 05/32] sched/fair: Add a few assertions Joel Fernandes (Google)
                   ` (28 subsequent siblings)
  32 siblings, 1 reply; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

Introduce the basic infrastructure to have a core wide rq->lock.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/Kconfig.preempt |   5 ++
 kernel/sched/core.c    | 108 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h   |  31 ++++++++++++
 3 files changed, 144 insertions(+)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index bf82259cff96..6d8be4630bd6 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -80,3 +80,8 @@ config PREEMPT_COUNT
 config PREEMPTION
        bool
        select PREEMPT_COUNT
+
+config SCHED_CORE
+	bool "Core Scheduling for SMT"
+	default y
+	depends on SCHED_SMT
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index db5cc05a68bc..6d88bc9a6818 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,70 @@ unsigned int sysctl_sched_rt_period = 1000000;
 
 __read_mostly int scheduler_running;
 
+#ifdef CONFIG_SCHED_CORE
+
+DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+/*
+ * The static-key + stop-machine variable are needed such that:
+ *
+ *	spin_lock(rq_lockp(rq));
+ *	...
+ *	spin_unlock(rq_lockp(rq));
+ *
+ * ends up locking and unlocking the _same_ lock, and all CPUs
+ * always agree on what rq has what lock.
+ *
+ * XXX entirely possible to selectively enable cores, don't bother for now.
+ */
+static int __sched_core_stopper(void *data)
+{
+	bool enabled = !!(unsigned long)data;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		cpu_rq(cpu)->core_enabled = enabled;
+
+	return 0;
+}
+
+static DEFINE_MUTEX(sched_core_mutex);
+static int sched_core_count;
+
+static void __sched_core_enable(void)
+{
+	// XXX verify there are no cookie tasks (yet)
+
+	static_branch_enable(&__sched_core_enabled);
+	stop_machine(__sched_core_stopper, (void *)true, NULL);
+}
+
+static void __sched_core_disable(void)
+{
+	// XXX verify there are no cookie tasks (left)
+
+	stop_machine(__sched_core_stopper, (void *)false, NULL);
+	static_branch_disable(&__sched_core_enabled);
+}
+
+void sched_core_get(void)
+{
+	mutex_lock(&sched_core_mutex);
+	if (!sched_core_count++)
+		__sched_core_enable();
+	mutex_unlock(&sched_core_mutex);
+}
+
+void sched_core_put(void)
+{
+	mutex_lock(&sched_core_mutex);
+	if (!--sched_core_count)
+		__sched_core_disable();
+	mutex_unlock(&sched_core_mutex);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * part of the period that we allow rt tasks to run in us.
  * default: 0.95s
@@ -4859,6 +4923,42 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	BUG();
 }
 
+#ifdef CONFIG_SCHED_CORE
+
+static inline void sched_core_cpu_starting(unsigned int cpu)
+{
+	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+	struct rq *rq, *core_rq = NULL;
+	int i;
+
+	core_rq = cpu_rq(cpu)->core;
+
+	if (!core_rq) {
+		for_each_cpu(i, smt_mask) {
+			rq = cpu_rq(i);
+			if (rq->core && rq->core == rq)
+				core_rq = rq;
+		}
+
+		if (!core_rq)
+			core_rq = cpu_rq(cpu);
+
+		for_each_cpu(i, smt_mask) {
+			rq = cpu_rq(i);
+
+			WARN_ON_ONCE(rq->core && rq->core != core_rq);
+			rq->core = core_rq;
+		}
+	}
+
+	printk("core: %d -> %d\n", cpu, cpu_of(core_rq));
+}
+#else /* !CONFIG_SCHED_CORE */
+
+static inline void sched_core_cpu_starting(unsigned int cpu) {}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -7484,6 +7584,9 @@ static void sched_rq_cpu_starting(unsigned int cpu)
 
 int sched_cpu_starting(unsigned int cpu)
 {
+
+	sched_core_cpu_starting(cpu);
+
 	sched_rq_cpu_starting(cpu);
 	sched_tick_start(cpu);
 	return 0;
@@ -7747,6 +7850,11 @@ void __init sched_init(void)
 #endif /* CONFIG_SMP */
 		hrtick_rq_init(rq);
 		atomic_set(&rq->nr_iowait, 0);
+
+#ifdef CONFIG_SCHED_CORE
+		rq->core = NULL;
+		rq->core_enabled = 0;
+#endif
 	}
 
 	set_load_weight(&init_task, false);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5a0dd2b312aa..0dfccf988998 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1061,6 +1061,12 @@ struct rq {
 #endif
 	unsigned int		push_busy;
 	struct cpu_stop_work	push_work;
+
+#ifdef CONFIG_SCHED_CORE
+	/* per rq */
+	struct rq		*core;
+	unsigned int		core_enabled;
+#endif
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -1099,11 +1105,36 @@ static inline bool is_migration_disabled(struct task_struct *p)
 #endif
 }
 
+#ifdef CONFIG_SCHED_CORE
+DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+	return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
+}
+
+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+	if (sched_core_enabled(rq))
+		return &rq->core->__lock;
+
+	return &rq->__lock;
+}
+
+#else /* !CONFIG_SCHED_CORE */
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+	return false;
+}
+
 static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 {
 	return &rq->__lock;
 }
 
+#endif /* CONFIG_SCHED_CORE */
+
 #ifdef CONFIG_SCHED_SMT
 extern void __update_idle_core(struct rq *rq);
 
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 05/32] sched/fair: Add a few assertions
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (3 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 04/32] sched: Core-wide rq->lock Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-17 23:19 ` [PATCH -tip 06/32] sched: Basic tracking of matching tasks Joel Fernandes (Google)
                   ` (27 subsequent siblings)
  32 siblings, 0 replies; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/fair.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 51483a00a755..ca35bfc0a368 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6245,6 +6245,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 		task_util = uclamp_task_util(p);
 	}
 
+	/*
+	 * per-cpu select_idle_mask usage
+	 */
+	lockdep_assert_irqs_disabled();
+
 	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
 	    asym_fits_capacity(task_util, target))
 		return target;
@@ -6710,8 +6715,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
  * certain conditions an idle sibling CPU if the domain has SD_WAKE_AFFINE set.
  *
  * Returns the target CPU number.
- *
- * preempt must be disabled.
  */
 static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
@@ -6724,6 +6727,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	/* SD_flags and WF_flags share the first nibble */
 	int sd_flag = wake_flags & 0xF;
 
+	/*
+	 * required for stable ->cpus_allowed
+	 */
+	lockdep_assert_held(&p->pi_lock);
 	if (wake_flags & WF_TTWU) {
 		record_wakee(p);
 
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 06/32] sched: Basic tracking of matching tasks
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (4 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 05/32] sched/fair: Add a few assertions Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-17 23:19 ` [PATCH -tip 07/32] sched: Add core wide task selection and scheduling Joel Fernandes (Google)
                   ` (26 subsequent siblings)
  32 siblings, 0 replies; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

Introduce task_struct::core_cookie as an opaque identifier for core
scheduling. When enabled; core scheduling will only allow matching
task to be on the core; where idle matches everything.

When task_struct::core_cookie is set (and core scheduling is enabled)
these tasks are indexed in a second RB-tree, first on cookie value
then on scheduling function, such that matching task selection always
finds the most elegible match.

NOTE: *shudder* at the overhead...

NOTE: *sigh*, a 3rd copy of the scheduling function; the alternative
is per class tracking of cookies and that just duplicates a lot of
stuff for no raisin (the 2nd copy lives in the rt-mutex PI code).

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/sched.h |   8 ++-
 kernel/sched/core.c   | 146 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c   |  46 -------------
 kernel/sched/sched.h  |  55 ++++++++++++++++
 4 files changed, 208 insertions(+), 47 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7abbdd7f3884..344499ab29f2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -683,10 +683,16 @@ struct task_struct {
 	const struct sched_class	*sched_class;
 	struct sched_entity		se;
 	struct sched_rt_entity		rt;
+	struct sched_dl_entity		dl;
+
+#ifdef CONFIG_SCHED_CORE
+	struct rb_node			core_node;
+	unsigned long			core_cookie;
+#endif
+
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group		*sched_task_group;
 #endif
-	struct sched_dl_entity		dl;
 
 #ifdef CONFIG_UCLAMP_TASK
 	/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6d88bc9a6818..9d521033777f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -78,6 +78,141 @@ __read_mostly int scheduler_running;
 
 DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
 
+/* kernel prio, less is more */
+static inline int __task_prio(struct task_struct *p)
+{
+	if (p->sched_class == &stop_sched_class) /* trumps deadline */
+		return -2;
+
+	if (rt_prio(p->prio)) /* includes deadline */
+		return p->prio; /* [-1, 99] */
+
+	if (p->sched_class == &idle_sched_class)
+		return MAX_RT_PRIO + NICE_WIDTH; /* 140 */
+
+	return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */
+}
+
+/*
+ * l(a,b)
+ * le(a,b) := !l(b,a)
+ * g(a,b)  := l(b,a)
+ * ge(a,b) := !l(a,b)
+ */
+
+/* real prio, less is less */
+static inline bool prio_less(struct task_struct *a, struct task_struct *b)
+{
+
+	int pa = __task_prio(a), pb = __task_prio(b);
+
+	if (-pa < -pb)
+		return true;
+
+	if (-pb < -pa)
+		return false;
+
+	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
+		return !dl_time_before(a->dl.deadline, b->dl.deadline);
+
+	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
+		u64 vruntime = b->se.vruntime;
+
+		/*
+		 * Normalize the vruntime if tasks are in different cpus.
+		 */
+		if (task_cpu(a) != task_cpu(b)) {
+			vruntime -= task_cfs_rq(b)->min_vruntime;
+			vruntime += task_cfs_rq(a)->min_vruntime;
+		}
+
+		return !((s64)(a->se.vruntime - vruntime) <= 0);
+	}
+
+	return false;
+}
+
+static inline bool __sched_core_less(struct task_struct *a, struct task_struct *b)
+{
+	if (a->core_cookie < b->core_cookie)
+		return true;
+
+	if (a->core_cookie > b->core_cookie)
+		return false;
+
+	/* flip prio, so high prio is leftmost */
+	if (prio_less(b, a))
+		return true;
+
+	return false;
+}
+
+static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+{
+	struct rb_node *parent, **node;
+	struct task_struct *node_task;
+
+	rq->core->core_task_seq++;
+
+	if (!p->core_cookie)
+		return;
+
+	node = &rq->core_tree.rb_node;
+	parent = *node;
+
+	while (*node) {
+		node_task = container_of(*node, struct task_struct, core_node);
+		parent = *node;
+
+		if (__sched_core_less(p, node_task))
+			node = &parent->rb_left;
+		else
+			node = &parent->rb_right;
+	}
+
+	rb_link_node(&p->core_node, parent, node);
+	rb_insert_color(&p->core_node, &rq->core_tree);
+}
+
+static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+{
+	rq->core->core_task_seq++;
+
+	if (!p->core_cookie)
+		return;
+
+	rb_erase(&p->core_node, &rq->core_tree);
+}
+
+/*
+ * Find left-most (aka, highest priority) task matching @cookie.
+ */
+static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
+{
+	struct rb_node *node = rq->core_tree.rb_node;
+	struct task_struct *node_task, *match;
+
+	/*
+	 * The idle task always matches any cookie!
+	 */
+	match = idle_sched_class.pick_task(rq);
+
+	while (node) {
+		node_task = container_of(node, struct task_struct, core_node);
+
+		if (cookie < node_task->core_cookie) {
+			node = node->rb_left;
+		} else if (cookie > node_task->core_cookie) {
+			node = node->rb_right;
+		} else {
+			match = node_task;
+			node = node->rb_left;
+		}
+	}
+
+	return match;
+}
+
 /*
  * The static-key + stop-machine variable are needed such that:
  *
@@ -136,6 +271,11 @@ void sched_core_put(void)
 	mutex_unlock(&sched_core_mutex);
 }
 
+#else /* !CONFIG_SCHED_CORE */
+
+static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
+static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+
 #endif /* CONFIG_SCHED_CORE */
 
 /*
@@ -1624,6 +1764,9 @@ static inline void init_uclamp(void) { }
 
 static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	if (sched_core_enabled(rq))
+		sched_core_enqueue(rq, p);
+
 	if (!(flags & ENQUEUE_NOCLOCK))
 		update_rq_clock(rq);
 
@@ -1638,6 +1781,9 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 
 static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	if (sched_core_enabled(rq))
+		sched_core_dequeue(rq, p);
+
 	if (!(flags & DEQUEUE_NOCLOCK))
 		update_rq_clock(rq);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ca35bfc0a368..f53681cd263e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -258,33 +258,11 @@ const struct sched_class fair_sched_class;
  */
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static inline struct task_struct *task_of(struct sched_entity *se)
-{
-	SCHED_WARN_ON(!entity_is_task(se));
-	return container_of(se, struct task_struct, se);
-}
 
 /* Walk up scheduling entities hierarchy */
 #define for_each_sched_entity(se) \
 		for (; se; se = se->parent)
 
-static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
-{
-	return p->se.cfs_rq;
-}
-
-/* runqueue on which this entity is (to be) queued */
-static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
-{
-	return se->cfs_rq;
-}
-
-/* runqueue "owned" by this group */
-static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
-{
-	return grp->my_q;
-}
-
 static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len)
 {
 	if (!path)
@@ -445,33 +423,9 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #else	/* !CONFIG_FAIR_GROUP_SCHED */
 
-static inline struct task_struct *task_of(struct sched_entity *se)
-{
-	return container_of(se, struct task_struct, se);
-}
-
 #define for_each_sched_entity(se) \
 		for (; se; se = NULL)
 
-static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
-{
-	return &task_rq(p)->cfs;
-}
-
-static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
-{
-	struct task_struct *p = task_of(se);
-	struct rq *rq = task_rq(p);
-
-	return &rq->cfs;
-}
-
-/* runqueue "owned" by this group */
-static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
-{
-	return NULL;
-}
-
 static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len)
 {
 	if (path)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0dfccf988998..8ee0ca8ee5c3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1066,6 +1066,10 @@ struct rq {
 	/* per rq */
 	struct rq		*core;
 	unsigned int		core_enabled;
+	struct rb_root		core_tree;
+
+	/* shared state */
+	unsigned int		core_task_seq;
 #endif
 };
 
@@ -1156,6 +1160,57 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		raw_cpu_ptr(&runqueues)
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline struct task_struct *task_of(struct sched_entity *se)
+{
+	SCHED_WARN_ON(!entity_is_task(se));
+	return container_of(se, struct task_struct, se);
+}
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+	return p->se.cfs_rq;
+}
+
+/* runqueue on which this entity is (to be) queued */
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+	return se->cfs_rq;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+	return grp->my_q;
+}
+
+#else
+
+static inline struct task_struct *task_of(struct sched_entity *se)
+{
+	return container_of(se, struct task_struct, se);
+}
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+	return &task_rq(p)->cfs;
+}
+
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+	struct task_struct *p = task_of(se);
+	struct rq *rq = task_rq(p);
+
+	return &rq->cfs;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+	return NULL;
+}
+#endif
+
 extern void update_rq_clock(struct rq *rq);
 
 static inline u64 __rq_clock_broken(struct rq *rq)
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 07/32] sched: Add core wide task selection and scheduling.
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (5 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 06/32] sched: Basic tracking of matching tasks Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-17 23:19 ` [PATCH -tip 08/32] sched/fair: Fix forced idle sibling starvation corner case Joel Fernandes (Google)
                   ` (25 subsequent siblings)
  32 siblings, 0 replies; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aaron Lu, Aubrey Li,
	Paul E. McKenney, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

Instead of only selecting a local task, select a task for all SMT
siblings for every reschedule on the core (irrespective which logical
CPU does the reschedule).

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
Signed-off-by: Aaron Lu <aaron.lu@linux.alibaba.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/core.c  | 301 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |   6 +-
 2 files changed, 305 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9d521033777f..1bd0b0bbb040 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5029,7 +5029,7 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	const struct sched_class *class;
 	struct task_struct *p;
@@ -5070,6 +5070,294 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 }
 
 #ifdef CONFIG_SCHED_CORE
+static inline bool is_task_rq_idle(struct task_struct *t)
+{
+	return (task_rq(t)->idle == t);
+}
+
+static inline bool cookie_equals(struct task_struct *a, unsigned long cookie)
+{
+	return is_task_rq_idle(a) || (a->core_cookie == cookie);
+}
+
+static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
+{
+	if (is_task_rq_idle(a) || is_task_rq_idle(b))
+		return true;
+
+	return a->core_cookie == b->core_cookie;
+}
+
+// XXX fairness/fwd progress conditions
+/*
+ * Returns
+ * - NULL if there is no runnable task for this class.
+ * - the highest priority task for this runqueue if it matches
+ *   rq->core->core_cookie or its priority is greater than max.
+ * - Else returns idle_task.
+ */
+static struct task_struct *
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max)
+{
+	struct task_struct *class_pick, *cookie_pick;
+	unsigned long cookie = rq->core->core_cookie;
+
+	class_pick = class->pick_task(rq);
+	if (!class_pick)
+		return NULL;
+
+	if (!cookie) {
+		/*
+		 * If class_pick is tagged, return it only if it has
+		 * higher priority than max.
+		 */
+		if (max && class_pick->core_cookie &&
+		    prio_less(class_pick, max))
+			return idle_sched_class.pick_task(rq);
+
+		return class_pick;
+	}
+
+	/*
+	 * If class_pick is idle or matches cookie, return early.
+	 */
+	if (cookie_equals(class_pick, cookie))
+		return class_pick;
+
+	cookie_pick = sched_core_find(rq, cookie);
+
+	/*
+	 * If class > max && class > cookie, it is the highest priority task on
+	 * the core (so far) and it must be selected, otherwise we must go with
+	 * the cookie pick in order to satisfy the constraint.
+	 */
+	if (prio_less(cookie_pick, class_pick) &&
+	    (!max || prio_less(max, class_pick)))
+		return class_pick;
+
+	return cookie_pick;
+}
+
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	struct task_struct *next, *max = NULL;
+	const struct sched_class *class;
+	const struct cpumask *smt_mask;
+	bool need_sync;
+	int i, j, cpu;
+
+	if (!sched_core_enabled(rq))
+		return __pick_next_task(rq, prev, rf);
+
+	cpu = cpu_of(rq);
+
+	/* Stopper task is switching into idle, no need core-wide selection. */
+	if (cpu_is_offline(cpu)) {
+		/*
+		 * Reset core_pick so that we don't enter the fastpath when
+		 * coming online. core_pick would already be migrated to
+		 * another cpu during offline.
+		 */
+		rq->core_pick = NULL;
+		return __pick_next_task(rq, prev, rf);
+	}
+
+	/*
+	 * If there were no {en,de}queues since we picked (IOW, the task
+	 * pointers are all still valid), and we haven't scheduled the last
+	 * pick yet, do so now.
+	 *
+	 * rq->core_pick can be NULL if no selection was made for a CPU because
+	 * it was either offline or went offline during a sibling's core-wide
+	 * selection. In this case, do a core-wide selection.
+	 */
+	if (rq->core->core_pick_seq == rq->core->core_task_seq &&
+	    rq->core->core_pick_seq != rq->core_sched_seq &&
+	    rq->core_pick) {
+		WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
+
+		next = rq->core_pick;
+		if (next != prev) {
+			put_prev_task(rq, prev);
+			set_next_task(rq, next);
+		}
+
+		rq->core_pick = NULL;
+		return next;
+	}
+
+	put_prev_task_balance(rq, prev, rf);
+
+	smt_mask = cpu_smt_mask(cpu);
+
+	/*
+	 * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
+	 *
+	 * @task_seq guards the task state ({en,de}queues)
+	 * @pick_seq is the @task_seq we did a selection on
+	 * @sched_seq is the @pick_seq we scheduled
+	 *
+	 * However, preemptions can cause multiple picks on the same task set.
+	 * 'Fix' this by also increasing @task_seq for every pick.
+	 */
+	rq->core->core_task_seq++;
+	need_sync = !!rq->core->core_cookie;
+
+	/* reset state */
+	rq->core->core_cookie = 0UL;
+	for_each_cpu(i, smt_mask) {
+		struct rq *rq_i = cpu_rq(i);
+
+		rq_i->core_pick = NULL;
+
+		if (rq_i->core_forceidle) {
+			need_sync = true;
+			rq_i->core_forceidle = false;
+		}
+
+		if (i != cpu)
+			update_rq_clock(rq_i);
+	}
+
+	/*
+	 * Try and select tasks for each sibling in decending sched_class
+	 * order.
+	 */
+	for_each_class(class) {
+again:
+		for_each_cpu_wrap(i, smt_mask, cpu) {
+			struct rq *rq_i = cpu_rq(i);
+			struct task_struct *p;
+
+			if (rq_i->core_pick)
+				continue;
+
+			/*
+			 * If this sibling doesn't yet have a suitable task to
+			 * run; ask for the most elegible task, given the
+			 * highest priority task already selected for this
+			 * core.
+			 */
+			p = pick_task(rq_i, class, max);
+			if (!p) {
+				/*
+				 * If there weren't no cookies; we don't need to
+				 * bother with the other siblings.
+				 * If the rest of the core is not running a tagged
+				 * task, i.e.  need_sync == 0, and the current CPU
+				 * which called into the schedule() loop does not
+				 * have any tasks for this class, skip selecting for
+				 * other siblings since there's no point. We don't skip
+				 * for RT/DL because that could make CFS force-idle RT.
+				 */
+				if (i == cpu && !need_sync && class == &fair_sched_class)
+					goto next_class;
+
+				continue;
+			}
+
+			/*
+			 * Optimize the 'normal' case where there aren't any
+			 * cookies and we don't need to sync up.
+			 */
+			if (i == cpu && !need_sync && !p->core_cookie) {
+				next = p;
+				goto done;
+			}
+
+			rq_i->core_pick = p;
+
+			/*
+			 * If this new candidate is of higher priority than the
+			 * previous; and they're incompatible; we need to wipe
+			 * the slate and start over. pick_task makes sure that
+			 * p's priority is more than max if it doesn't match
+			 * max's cookie.
+			 *
+			 * NOTE: this is a linear max-filter and is thus bounded
+			 * in execution time.
+			 */
+			if (!max || !cookie_match(max, p)) {
+				struct task_struct *old_max = max;
+
+				rq->core->core_cookie = p->core_cookie;
+				max = p;
+
+				if (old_max) {
+					for_each_cpu(j, smt_mask) {
+						if (j == i)
+							continue;
+
+						cpu_rq(j)->core_pick = NULL;
+					}
+					goto again;
+				} else {
+					/*
+					 * Once we select a task for a cpu, we
+					 * should not be doing an unconstrained
+					 * pick because it might starve a task
+					 * on a forced idle cpu.
+					 */
+					need_sync = true;
+				}
+
+			}
+		}
+next_class:;
+	}
+
+	rq->core->core_pick_seq = rq->core->core_task_seq;
+	next = rq->core_pick;
+	rq->core_sched_seq = rq->core->core_pick_seq;
+
+	/* Something should have been selected for current CPU */
+	WARN_ON_ONCE(!next);
+
+	/*
+	 * Reschedule siblings
+	 *
+	 * NOTE: L1TF -- at this point we're no longer running the old task and
+	 * sending an IPI (below) ensures the sibling will no longer be running
+	 * their task. This ensures there is no inter-sibling overlap between
+	 * non-matching user state.
+	 */
+	for_each_cpu(i, smt_mask) {
+		struct rq *rq_i = cpu_rq(i);
+
+		/*
+		 * An online sibling might have gone offline before a task
+		 * could be picked for it, or it might be offline but later
+		 * happen to come online, but its too late and nothing was
+		 * picked for it.  That's Ok - it will pick tasks for itself,
+		 * so ignore it.
+		 */
+		if (!rq_i->core_pick)
+			continue;
+
+		if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running)
+			rq_i->core_forceidle = true;
+
+		if (i == cpu) {
+			rq_i->core_pick = NULL;
+			continue;
+		}
+
+		/* Did we break L1TF mitigation requirements? */
+		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+
+		if (rq_i->curr == rq_i->core_pick) {
+			rq_i->core_pick = NULL;
+			continue;
+		}
+
+		resched_curr(rq_i);
+	}
+
+done:
+	set_next_task(rq, next);
+	return next;
+}
 
 static inline void sched_core_cpu_starting(unsigned int cpu)
 {
@@ -5103,6 +5391,12 @@ static inline void sched_core_cpu_starting(unsigned int cpu)
 
 static inline void sched_core_cpu_starting(unsigned int cpu) {}
 
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	return __pick_next_task(rq, prev, rf);
+}
+
 #endif /* CONFIG_SCHED_CORE */
 
 /*
@@ -7999,7 +8293,12 @@ void __init sched_init(void)
 
 #ifdef CONFIG_SCHED_CORE
 		rq->core = NULL;
+		rq->core_pick = NULL;
 		rq->core_enabled = 0;
+		rq->core_tree = RB_ROOT;
+		rq->core_forceidle = false;
+
+		rq->core_cookie = 0UL;
 #endif
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8ee0ca8ee5c3..63b28e1843ee 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1065,11 +1065,16 @@ struct rq {
 #ifdef CONFIG_SCHED_CORE
 	/* per rq */
 	struct rq		*core;
+	struct task_struct	*core_pick;
 	unsigned int		core_enabled;
+	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
+	unsigned char		core_forceidle;
 
 	/* shared state */
 	unsigned int		core_task_seq;
+	unsigned int		core_pick_seq;
+	unsigned long		core_cookie;
 #endif
 };
 
@@ -1977,7 +1982,6 @@ static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 
 static inline void set_next_task(struct rq *rq, struct task_struct *next)
 {
-	WARN_ON_ONCE(rq->curr != next);
 	next->sched_class->set_next_task(rq, next, false);
 }
 
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 08/32] sched/fair: Fix forced idle sibling starvation corner case
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (6 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 07/32] sched: Add core wide task selection and scheduling Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-22 10:35   ` Balbir Singh
  2020-11-17 23:19 ` [PATCH -tip 09/32] sched/fair: Snapshot the min_vruntime of CPUs on force idle Joel Fernandes (Google)
                   ` (24 subsequent siblings)
  32 siblings, 1 reply; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

From: Vineeth Pillai <viremana@linux.microsoft.com>

If there is only one long running local task and the sibling is
forced idle, it  might not get a chance to run until a schedule
event happens on any cpu in the core.

So we check for this condition during a tick to see if a sibling
is starved and then give it a chance to schedule.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/core.c  | 15 ++++++++-------
 kernel/sched/fair.c  | 40 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  2 +-
 3 files changed, 49 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1bd0b0bbb040..52d0e83072a4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5206,16 +5206,15 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	/* reset state */
 	rq->core->core_cookie = 0UL;
+	if (rq->core->core_forceidle) {
+		need_sync = true;
+		rq->core->core_forceidle = false;
+	}
 	for_each_cpu(i, smt_mask) {
 		struct rq *rq_i = cpu_rq(i);
 
 		rq_i->core_pick = NULL;
 
-		if (rq_i->core_forceidle) {
-			need_sync = true;
-			rq_i->core_forceidle = false;
-		}
-
 		if (i != cpu)
 			update_rq_clock(rq_i);
 	}
@@ -5335,8 +5334,10 @@ next_class:;
 		if (!rq_i->core_pick)
 			continue;
 
-		if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running)
-			rq_i->core_forceidle = true;
+		if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running &&
+		    !rq_i->core->core_forceidle) {
+			rq_i->core->core_forceidle = true;
+		}
 
 		if (i == cpu) {
 			rq_i->core_pick = NULL;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f53681cd263e..42965c4fd71f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10692,6 +10692,44 @@ static void rq_offline_fair(struct rq *rq)
 
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_SCHED_CORE
+static inline bool
+__entity_slice_used(struct sched_entity *se, int min_nr_tasks)
+{
+	u64 slice = sched_slice(cfs_rq_of(se), se);
+	u64 rtime = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+
+	return (rtime * min_nr_tasks > slice);
+}
+
+#define MIN_NR_TASKS_DURING_FORCEIDLE	2
+static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
+{
+	if (!sched_core_enabled(rq))
+		return;
+
+	/*
+	 * If runqueue has only one task which used up its slice and
+	 * if the sibling is forced idle, then trigger schedule to
+	 * give forced idle task a chance.
+	 *
+	 * sched_slice() considers only this active rq and it gets the
+	 * whole slice. But during force idle, we have siblings acting
+	 * like a single runqueue and hence we need to consider runnable
+	 * tasks on this cpu and the forced idle cpu. Ideally, we should
+	 * go through the forced idle rq, but that would be a perf hit.
+	 * We can assume that the forced idle cpu has atleast
+	 * MIN_NR_TASKS_DURING_FORCEIDLE - 1 tasks and use that to check
+	 * if we need to give up the cpu.
+	 */
+	if (rq->core->core_forceidle && rq->cfs.nr_running == 1 &&
+	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
+		resched_curr(rq);
+}
+#else
+static inline void task_tick_core(struct rq *rq, struct task_struct *curr) {}
+#endif
+
 /*
  * scheduler tick hitting a task of our scheduling class.
  *
@@ -10715,6 +10753,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	update_misfit_status(curr, rq);
 	update_overutilized_status(task_rq(curr));
+
+	task_tick_core(rq, curr);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 63b28e1843ee..be656ca8693d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1069,12 +1069,12 @@ struct rq {
 	unsigned int		core_enabled;
 	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
-	unsigned char		core_forceidle;
 
 	/* shared state */
 	unsigned int		core_task_seq;
 	unsigned int		core_pick_seq;
 	unsigned long		core_cookie;
+	unsigned char		core_forceidle;
 #endif
 };
 
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 09/32] sched/fair: Snapshot the min_vruntime of CPUs on force idle
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (7 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 08/32] sched/fair: Fix forced idle sibling starvation corner case Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-22 11:44   ` Balbir Singh
  2020-11-17 23:19 ` [PATCH -tip 10/32] sched: Fix priority inversion of cookied task with sibling Joel Fernandes (Google)
                   ` (23 subsequent siblings)
  32 siblings, 1 reply; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

During force-idle, we end up doing cross-cpu comparison of vruntimes
during pick_next_task. If we simply compare (vruntime-min_vruntime)
across CPUs, and if the CPUs only have 1 task each, we will always
end up comparing 0 with 0 and pick just one of the tasks all the time.
This starves the task that was not picked. To fix this, take a snapshot
of the min_vruntime when entering force idle and use it for comparison.
This min_vruntime snapshot will only be used for cross-CPU vruntime
comparison, and nothing else.

This resolves several performance issues that were seen in ChromeOS
audio usecase.

NOTE: Note, this patch will be improved in a later patch. It is just
      kept here as the basis for the later patch and to make rebasing
      easier. Further, it may make reverting the improvement easier in
      case the improvement causes any regression.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/core.c  | 33 ++++++++++++++++++++-------------
 kernel/sched/fair.c  | 40 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  5 +++++
 3 files changed, 65 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 52d0e83072a4..4ee4902c2cf5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -115,19 +115,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-		u64 vruntime = b->se.vruntime;
-
-		/*
-		 * Normalize the vruntime if tasks are in different cpus.
-		 */
-		if (task_cpu(a) != task_cpu(b)) {
-			vruntime -= task_cfs_rq(b)->min_vruntime;
-			vruntime += task_cfs_rq(a)->min_vruntime;
-		}
-
-		return !((s64)(a->se.vruntime - vruntime) <= 0);
-	}
+	if (pa == MAX_RT_PRIO + MAX_NICE)	/* fair */
+		return cfs_prio_less(a, b);
 
 	return false;
 }
@@ -5144,6 +5133,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	struct task_struct *next, *max = NULL;
 	const struct sched_class *class;
 	const struct cpumask *smt_mask;
+	bool fi_before = false;
 	bool need_sync;
 	int i, j, cpu;
 
@@ -5208,6 +5198,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	rq->core->core_cookie = 0UL;
 	if (rq->core->core_forceidle) {
 		need_sync = true;
+		fi_before = true;
 		rq->core->core_forceidle = false;
 	}
 	for_each_cpu(i, smt_mask) {
@@ -5219,6 +5210,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			update_rq_clock(rq_i);
 	}
 
+	/* Reset the snapshot if core is no longer in force-idle. */
+	if (!fi_before) {
+		for_each_cpu(i, smt_mask) {
+			struct rq *rq_i = cpu_rq(i);
+			rq_i->cfs.min_vruntime_fi = rq_i->cfs.min_vruntime;
+		}
+	}
+
 	/*
 	 * Try and select tasks for each sibling in decending sched_class
 	 * order.
@@ -5355,6 +5354,14 @@ next_class:;
 		resched_curr(rq_i);
 	}
 
+	/* Snapshot if core is in force-idle. */
+	if (!fi_before && rq->core->core_forceidle) {
+		for_each_cpu(i, smt_mask) {
+			struct rq *rq_i = cpu_rq(i);
+			rq_i->cfs.min_vruntime_fi = rq_i->cfs.min_vruntime;
+		}
+	}
+
 done:
 	set_next_task(rq, next);
 	return next;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 42965c4fd71f..de82f88ba98c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10726,6 +10726,46 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
 	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
 		resched_curr(rq);
 }
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+	bool samecpu = task_cpu(a) == task_cpu(b);
+	struct sched_entity *sea = &a->se;
+	struct sched_entity *seb = &b->se;
+	struct cfs_rq *cfs_rqa;
+	struct cfs_rq *cfs_rqb;
+	s64 delta;
+
+	if (samecpu) {
+		/* vruntime is per cfs_rq */
+		while (!is_same_group(sea, seb)) {
+			int sea_depth = sea->depth;
+			int seb_depth = seb->depth;
+			if (sea_depth >= seb_depth)
+				sea = parent_entity(sea);
+			if (sea_depth <= seb_depth)
+				seb = parent_entity(seb);
+		}
+
+		delta = (s64)(sea->vruntime - seb->vruntime);
+		goto out;
+	}
+
+	/* crosscpu: compare root level se's vruntime to decide priority */
+	while (sea->parent)
+		sea = sea->parent;
+	while (seb->parent)
+		seb = seb->parent;
+
+	cfs_rqa = sea->cfs_rq;
+	cfs_rqb = seb->cfs_rq;
+
+	/* normalize vruntime WRT their rq's base */
+	delta = (s64)(sea->vruntime - seb->vruntime) +
+		(s64)(cfs_rqb->min_vruntime_fi - cfs_rqa->min_vruntime_fi);
+out:
+	return delta > 0;
+}
 #else
 static inline void task_tick_core(struct rq *rq, struct task_struct *curr) {}
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index be656ca8693d..d934cc51acf1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -517,6 +517,9 @@ struct cfs_rq {
 
 	u64			exec_clock;
 	u64			min_vruntime;
+#ifdef CONFIG_SCHED_CORE
+	u64			min_vruntime_fi;
+#endif
 #ifndef CONFIG_64BIT
 	u64			min_vruntime_copy;
 #endif
@@ -1130,6 +1133,8 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 	return &rq->__lock;
 }
 
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline bool sched_core_enabled(struct rq *rq)
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 10/32] sched: Fix priority inversion of cookied task with sibling
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (8 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 09/32] sched/fair: Snapshot the min_vruntime of CPUs on force idle Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-22 22:41   ` Balbir Singh
  2020-11-17 23:19 ` [PATCH -tip 11/32] sched: Enqueue task into core queue only after vruntime is updated Joel Fernandes (Google)
                   ` (22 subsequent siblings)
  32 siblings, 1 reply; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

The rationale is as follows. In the core-wide pick logic, even if
need_sync == false, we need to go look at other CPUs (non-local CPUs) to
see if they could be running RT.

Say the RQs in a particular core look like this:
Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task.

rq0            rq1
CFS1 (tagged)  RT1 (not tag)
CFS2 (tagged)

Say schedule() runs on rq0. Now, it will enter the above loop and
pick_task(RT) will return NULL for 'p'. It will enter the above if() block
and see that need_sync == false and will skip RT entirely.

The end result of the selection will be (say prio(CFS1) > prio(CFS2)):
rq0             rq1
CFS1            IDLE

When it should have selected:
rq0             r1
IDLE            RT

Joel saw this issue on real-world usecases in ChromeOS where an RT task
gets constantly force-idled and breaks RT. Lets cure it.

NOTE: This problem will be fixed differently in a later patch. It just
      kept here for reference purposes about this issue, and to make
      applying later patches easier.

Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/core.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4ee4902c2cf5..53af817740c0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5195,6 +5195,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	need_sync = !!rq->core->core_cookie;
 
 	/* reset state */
+reset:
 	rq->core->core_cookie = 0UL;
 	if (rq->core->core_forceidle) {
 		need_sync = true;
@@ -5242,14 +5243,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 				/*
 				 * If there weren't no cookies; we don't need to
 				 * bother with the other siblings.
-				 * If the rest of the core is not running a tagged
-				 * task, i.e.  need_sync == 0, and the current CPU
-				 * which called into the schedule() loop does not
-				 * have any tasks for this class, skip selecting for
-				 * other siblings since there's no point. We don't skip
-				 * for RT/DL because that could make CFS force-idle RT.
 				 */
-				if (i == cpu && !need_sync && class == &fair_sched_class)
+				if (i == cpu && !need_sync)
 					goto next_class;
 
 				continue;
@@ -5259,7 +5254,20 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			 * Optimize the 'normal' case where there aren't any
 			 * cookies and we don't need to sync up.
 			 */
-			if (i == cpu && !need_sync && !p->core_cookie) {
+			if (i == cpu && !need_sync) {
+				if (p->core_cookie) {
+					/*
+					 * This optimization is only valid as
+					 * long as there are no cookies
+					 * involved. We may have skipped
+					 * non-empty higher priority classes on
+					 * siblings, which are empty on this
+					 * CPU, so start over.
+					 */
+					need_sync = true;
+					goto reset;
+				}
+
 				next = p;
 				goto done;
 			}
@@ -5299,7 +5307,6 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 					 */
 					need_sync = true;
 				}
-
 			}
 		}
 next_class:;
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 11/32] sched: Enqueue task into core queue only after vruntime is updated
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (9 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 10/32] sched: Fix priority inversion of cookied task with sibling Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-17 23:19 ` [PATCH -tip 12/32] sched: Simplify the core pick loop for optimized case Joel Fernandes (Google)
                   ` (21 subsequent siblings)
  32 siblings, 0 replies; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

A waking task may have its vruntime adjusted. However, the code right
now puts it into the core queue without the adjustment. This means the
core queue may have a task with incorrect vruntime, potentially a very
long one. This may cause a task to get artificially boosted during
picking.

Fix it by enqueuing into the core queue only after the class-specific
enqueue callback has been called. This ensures that for CFS tasks, the
updated vruntime value is used when enqueuing the task into the core
rbtree.

Reviewed-by: Vineeth Pillai <viremana@linux.microsoft.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/core.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 53af817740c0..6aa76de55ef2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1753,9 +1753,6 @@ static inline void init_uclamp(void) { }
 
 static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 {
-	if (sched_core_enabled(rq))
-		sched_core_enqueue(rq, p);
-
 	if (!(flags & ENQUEUE_NOCLOCK))
 		update_rq_clock(rq);
 
@@ -1766,6 +1763,9 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 
 	uclamp_rq_inc(rq, p);
 	p->sched_class->enqueue_task(rq, p, flags);
+
+	if (sched_core_enabled(rq))
+		sched_core_enqueue(rq, p);
 }
 
 static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 12/32] sched: Simplify the core pick loop for optimized case
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (10 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 11/32] sched: Enqueue task into core queue only after vruntime is updated Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-24 12:04   ` Peter Zijlstra
  2020-11-17 23:19 ` [PATCH -tip 13/32] sched: Trivial forced-newidle balancer Joel Fernandes (Google)
                   ` (20 subsequent siblings)
  32 siblings, 1 reply; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

The core pick loop grew a lot of warts over time to support
optimizations. Turns out that that directly doing a class pick before
entering the core-wide pick is better for readability. Make the changes.

Since this is a relatively new patch, make it a separate patch so that
it is easier to revert in case anyone reports an issue with it. Testing
shows it to be working for me.

Reviewed-by: Vineeth Pillai <viremana@linux.microsoft.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/core.c | 73 ++++++++++++++++-----------------------------
 1 file changed, 26 insertions(+), 47 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6aa76de55ef2..12e8e6627ab3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5180,6 +5180,15 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	put_prev_task_balance(rq, prev, rf);
 
 	smt_mask = cpu_smt_mask(cpu);
+	need_sync = !!rq->core->core_cookie;
+
+	/* reset state */
+	rq->core->core_cookie = 0UL;
+	if (rq->core->core_forceidle) {
+		need_sync = true;
+		fi_before = true;
+		rq->core->core_forceidle = false;
+	}
 
 	/*
 	 * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
@@ -5192,16 +5201,25 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	 * 'Fix' this by also increasing @task_seq for every pick.
 	 */
 	rq->core->core_task_seq++;
-	need_sync = !!rq->core->core_cookie;
 
-	/* reset state */
-reset:
-	rq->core->core_cookie = 0UL;
-	if (rq->core->core_forceidle) {
+	/*
+	 * Optimize for common case where this CPU has no cookies
+	 * and there are no cookied tasks running on siblings.
+	 */
+	if (!need_sync) {
+		for_each_class(class) {
+			next = class->pick_task(rq);
+			if (next)
+				break;
+		}
+
+		if (!next->core_cookie) {
+			rq->core_pick = NULL;
+			goto done;
+		}
 		need_sync = true;
-		fi_before = true;
-		rq->core->core_forceidle = false;
 	}
+
 	for_each_cpu(i, smt_mask) {
 		struct rq *rq_i = cpu_rq(i);
 
@@ -5239,38 +5257,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			 * core.
 			 */
 			p = pick_task(rq_i, class, max);
-			if (!p) {
-				/*
-				 * If there weren't no cookies; we don't need to
-				 * bother with the other siblings.
-				 */
-				if (i == cpu && !need_sync)
-					goto next_class;
-
+			if (!p)
 				continue;
-			}
-
-			/*
-			 * Optimize the 'normal' case where there aren't any
-			 * cookies and we don't need to sync up.
-			 */
-			if (i == cpu && !need_sync) {
-				if (p->core_cookie) {
-					/*
-					 * This optimization is only valid as
-					 * long as there are no cookies
-					 * involved. We may have skipped
-					 * non-empty higher priority classes on
-					 * siblings, which are empty on this
-					 * CPU, so start over.
-					 */
-					need_sync = true;
-					goto reset;
-				}
-
-				next = p;
-				goto done;
-			}
 
 			rq_i->core_pick = p;
 
@@ -5298,18 +5286,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 						cpu_rq(j)->core_pick = NULL;
 					}
 					goto again;
-				} else {
-					/*
-					 * Once we select a task for a cpu, we
-					 * should not be doing an unconstrained
-					 * pick because it might starve a task
-					 * on a forced idle cpu.
-					 */
-					need_sync = true;
 				}
 			}
 		}
-next_class:;
 	}
 
 	rq->core->core_pick_seq = rq->core->core_task_seq;
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 13/32] sched: Trivial forced-newidle balancer
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (11 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 12/32] sched: Simplify the core pick loop for optimized case Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-23  4:38   ` Balbir Singh
  2020-11-17 23:19 ` [PATCH -tip 14/32] sched: migration changes for core scheduling Joel Fernandes (Google)
                   ` (19 subsequent siblings)
  32 siblings, 1 reply; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Paul E . McKenney,
	Aubrey Li, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

When a sibling is forced-idle to match the core-cookie; search for
matching tasks to fill the core.

rcu_read_unlock() can incur an infrequent deadlock in
sched_core_balance(). Fix this by using the RCU-sched flavor instead.

Acked-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/sched.h |   1 +
 kernel/sched/core.c   | 130 +++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/idle.c   |   1 +
 kernel/sched/sched.h  |   6 ++
 4 files changed, 137 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 344499ab29f2..7efce9c9d9cf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -688,6 +688,7 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
 	struct rb_node			core_node;
 	unsigned long			core_cookie;
+	unsigned int			core_occupation;
 #endif
 
 #ifdef CONFIG_CGROUP_SCHED
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 12e8e6627ab3..3b373b592680 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -202,6 +202,21 @@ static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
 	return match;
 }
 
+static struct task_struct *sched_core_next(struct task_struct *p, unsigned long cookie)
+{
+	struct rb_node *node = &p->core_node;
+
+	node = rb_next(node);
+	if (!node)
+		return NULL;
+
+	p = container_of(node, struct task_struct, core_node);
+	if (p->core_cookie != cookie)
+		return NULL;
+
+	return p;
+}
+
 /*
  * The static-key + stop-machine variable are needed such that:
  *
@@ -5134,8 +5149,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	const struct sched_class *class;
 	const struct cpumask *smt_mask;
 	bool fi_before = false;
+	int i, j, cpu, occ = 0;
 	bool need_sync;
-	int i, j, cpu;
 
 	if (!sched_core_enabled(rq))
 		return __pick_next_task(rq, prev, rf);
@@ -5260,6 +5275,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			if (!p)
 				continue;
 
+			if (!is_task_rq_idle(p))
+				occ++;
+
 			rq_i->core_pick = p;
 
 			/*
@@ -5285,6 +5303,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 						cpu_rq(j)->core_pick = NULL;
 					}
+					occ = 1;
 					goto again;
 				}
 			}
@@ -5324,6 +5343,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			rq_i->core->core_forceidle = true;
 		}
 
+		rq_i->core_pick->core_occupation = occ;
+
 		if (i == cpu) {
 			rq_i->core_pick = NULL;
 			continue;
@@ -5353,6 +5374,113 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	return next;
 }
 
+static bool try_steal_cookie(int this, int that)
+{
+	struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
+	struct task_struct *p;
+	unsigned long cookie;
+	bool success = false;
+
+	local_irq_disable();
+	double_rq_lock(dst, src);
+
+	cookie = dst->core->core_cookie;
+	if (!cookie)
+		goto unlock;
+
+	if (dst->curr != dst->idle)
+		goto unlock;
+
+	p = sched_core_find(src, cookie);
+	if (p == src->idle)
+		goto unlock;
+
+	do {
+		if (p == src->core_pick || p == src->curr)
+			goto next;
+
+		if (!cpumask_test_cpu(this, &p->cpus_mask))
+			goto next;
+
+		if (p->core_occupation > dst->idle->core_occupation)
+			goto next;
+
+		p->on_rq = TASK_ON_RQ_MIGRATING;
+		deactivate_task(src, p, 0);
+		set_task_cpu(p, this);
+		activate_task(dst, p, 0);
+		p->on_rq = TASK_ON_RQ_QUEUED;
+
+		resched_curr(dst);
+
+		success = true;
+		break;
+
+next:
+		p = sched_core_next(p, cookie);
+	} while (p);
+
+unlock:
+	double_rq_unlock(dst, src);
+	local_irq_enable();
+
+	return success;
+}
+
+static bool steal_cookie_task(int cpu, struct sched_domain *sd)
+{
+	int i;
+
+	for_each_cpu_wrap(i, sched_domain_span(sd), cpu) {
+		if (i == cpu)
+			continue;
+
+		if (need_resched())
+			break;
+
+		if (try_steal_cookie(cpu, i))
+			return true;
+	}
+
+	return false;
+}
+
+static void sched_core_balance(struct rq *rq)
+{
+	struct sched_domain *sd;
+	int cpu = cpu_of(rq);
+
+	preempt_disable();
+	rcu_read_lock();
+	raw_spin_unlock_irq(rq_lockp(rq));
+	for_each_domain(cpu, sd) {
+		if (need_resched())
+			break;
+
+		if (steal_cookie_task(cpu, sd))
+			break;
+	}
+	raw_spin_lock_irq(rq_lockp(rq));
+	rcu_read_unlock();
+	preempt_enable();
+}
+
+static DEFINE_PER_CPU(struct callback_head, core_balance_head);
+
+void queue_core_balance(struct rq *rq)
+{
+	if (!sched_core_enabled(rq))
+		return;
+
+	if (!rq->core->core_cookie)
+		return;
+
+	if (!rq->nr_running) /* not forced idle */
+		return;
+
+	queue_balance_callback(rq, &per_cpu(core_balance_head, rq->cpu), sched_core_balance);
+}
+
 static inline void sched_core_cpu_starting(unsigned int cpu)
 {
 	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 33864193a2f9..8bdb214eb78f 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -404,6 +404,7 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
 {
 	update_idle_core(rq);
 	schedstat_inc(rq->sched_goidle);
+	queue_core_balance(rq);
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d934cc51acf1..e72942a9ee11 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1135,6 +1135,8 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
 
+extern void queue_core_balance(struct rq *rq);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline bool sched_core_enabled(struct rq *rq)
@@ -1147,6 +1149,10 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 	return &rq->__lock;
 }
 
+static inline void queue_core_balance(struct rq *rq)
+{
+}
+
 #endif /* CONFIG_SCHED_CORE */
 
 #ifdef CONFIG_SCHED_SMT
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 14/32] sched: migration changes for core scheduling
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (12 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 13/32] sched: Trivial forced-newidle balancer Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-22 23:54   ` Balbir Singh
  2020-11-30 10:35   ` Vincent Guittot
  2020-11-17 23:19 ` [PATCH -tip 15/32] sched: Improve snapshotting of min_vruntime for CGroups Joel Fernandes (Google)
                   ` (18 subsequent siblings)
  32 siblings, 2 replies; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Aubrey Li, Paul E. McKenney, Tim Chen

From: Aubrey Li <aubrey.li@intel.com>

 - Don't migrate if there is a cookie mismatch
     Load balance tries to move task from busiest CPU to the
     destination CPU. When core scheduling is enabled, if the
     task's cookie does not match with the destination CPU's
     core cookie, this task will be skipped by this CPU. This
     mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
     In the fast path of task wakeup, select the first cookie matched
     idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
     In the slow path of task wakeup, find the idlest CPU whose core
     cookie matches with task's cookie

 - Don't migrate task if cookie not match
     For the NUMA load balance, don't migrate task to the CPU whose
     core cookie does not match with task's cookie

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/fair.c  | 64 ++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h | 29 ++++++++++++++++++++
 2 files changed, 88 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index de82f88ba98c..ceb3906c9a8a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1921,6 +1921,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
 			continue;
 
+#ifdef CONFIG_SCHED_CORE
+		/*
+		 * Skip this cpu if source task's cookie does not match
+		 * with CPU's core cookie.
+		 */
+		if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+			continue;
+#endif
+
 		env->dst_cpu = cpu;
 		if (task_numa_compare(env, taskimp, groupimp, maymove))
 			break;
@@ -5867,11 +5876,17 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+		struct rq *rq = cpu_rq(i);
+
+#ifdef CONFIG_SCHED_CORE
+		if (!sched_core_cookie_match(rq, p))
+			continue;
+#endif
+
 		if (sched_idle_cpu(i))
 			return i;
 
 		if (available_idle_cpu(i)) {
-			struct rq *rq = cpu_rq(i);
 			struct cpuidle_state *idle = idle_get_state(rq);
 			if (idle && idle->exit_latency < min_exit_latency) {
 				/*
@@ -6129,8 +6144,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	for_each_cpu_wrap(cpu, cpus, target) {
 		if (!--nr)
 			return -1;
-		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
-			break;
+
+		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
+#ifdef CONFIG_SCHED_CORE
+			/*
+			 * If Core Scheduling is enabled, select this cpu
+			 * only if the process cookie matches core cookie.
+			 */
+			if (sched_core_enabled(cpu_rq(cpu)) &&
+			    p->core_cookie == cpu_rq(cpu)->core->core_cookie)
+#endif
+				break;
+		}
 	}
 
 	time = cpu_clock(this) - time;
@@ -7530,8 +7555,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * We do not migrate tasks that are:
 	 * 1) throttled_lb_pair, or
 	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
-	 * 3) running (obviously), or
-	 * 4) are cache-hot on their current CPU.
+	 * 3) task's cookie does not match with this CPU's core cookie
+	 * 4) running (obviously), or
+	 * 5) are cache-hot on their current CPU.
 	 */
 	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
 		return 0;
@@ -7566,6 +7592,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 0;
 	}
 
+#ifdef CONFIG_SCHED_CORE
+	/*
+	 * Don't migrate task if the task's cookie does not match
+	 * with the destination CPU's core cookie.
+	 */
+	if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+		return 0;
+#endif
+
 	/* Record that we found atleast one task that could run on dst_cpu */
 	env->flags &= ~LBF_ALL_PINNED;
 
@@ -8792,6 +8827,25 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 					p->cpus_ptr))
 			continue;
 
+#ifdef CONFIG_SCHED_CORE
+		if (sched_core_enabled(cpu_rq(this_cpu))) {
+			int i = 0;
+			bool cookie_match = false;
+
+			for_each_cpu(i, sched_group_span(group)) {
+				struct rq *rq = cpu_rq(i);
+
+				if (sched_core_cookie_match(rq, p)) {
+					cookie_match = true;
+					break;
+				}
+			}
+			/* Skip over this group if no cookie matched */
+			if (!cookie_match)
+				continue;
+		}
+#endif
+
 		local_group = cpumask_test_cpu(this_cpu,
 					       sched_group_span(group));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e72942a9ee11..de553d39aa40 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1135,6 +1135,35 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
 
+/*
+ * Helper to check if the CPU's core cookie matches with the task's cookie
+ * when core scheduling is enabled.
+ * A special case is that the task's cookie always matches with CPU's core
+ * cookie if the CPU is in an idle core.
+ */
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	bool idle_core = true;
+	int cpu;
+
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
+		if (!available_idle_cpu(cpu)) {
+			idle_core = false;
+			break;
+		}
+	}
+
+	/*
+	 * A CPU in an idle core is always the best choice for tasks with
+	 * cookies.
+	 */
+	return idle_core || rq->core->core_cookie == p->core_cookie;
+}
+
 extern void queue_core_balance(struct rq *rq);
 
 #else /* !CONFIG_SCHED_CORE */
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 15/32] sched: Improve snapshotting of min_vruntime for CGroups
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (13 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 14/32] sched: migration changes for core scheduling Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-24 10:27   ` Peter Zijlstra
  2020-11-24 10:41   ` Peter Zijlstra
  2020-11-17 23:19 ` [PATCH -tip 16/32] irq_work: Cleanup Joel Fernandes (Google)
                   ` (17 subsequent siblings)
  32 siblings, 2 replies; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

A previous patch improved cross-cpu vruntime comparison opertations in
pick_next_task(). Improve it further for tasks in CGroups.

In particular, for cross-CPU comparisons, we were previously going to
the root-level se(s) for both the task being compared. That was strange.
This patch instead finds the se(s) for both tasks that have the same
parent (which may be different from root).

A note about the min_vruntime snapshot and force idling:
Abbreviations: fi: force-idled now? ; fib: force-idled before?
During selection:
When we're not fi, we need to update snapshot.
when we're fi and we were not fi, we must update snapshot.
When we're fi and we were already fi, we must not update snapshot.

Which gives:
        fib     fi      update?
        0       0       1
        0       1       1
        1       0       1
        1       1       0
So the min_vruntime snapshot needs to be updated when: !(fib && fi).

Also, the cfs_prio_less() function needs to be aware of whether the core
is in force idle or not, since it will be use this information to know
whether to advance a cfs_rq's min_vruntime_fi in the hierarchy. So pass
this information along via pick_task() -> prio_less().

Reviewed-by: Vineeth Pillai <viremana@linux.microsoft.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/core.c  | 61 +++++++++++++++++----------------
 kernel/sched/fair.c  | 80 ++++++++++++++++++++++++++++++++------------
 kernel/sched/sched.h |  7 +++-
 3 files changed, 97 insertions(+), 51 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3b373b592680..20125431af87 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -101,7 +101,7 @@ static inline int __task_prio(struct task_struct *p)
  */
 
 /* real prio, less is less */
-static inline bool prio_less(struct task_struct *a, struct task_struct *b)
+static inline bool prio_less(struct task_struct *a, struct task_struct *b, bool in_fi)
 {
 
 	int pa = __task_prio(a), pb = __task_prio(b);
@@ -116,7 +116,7 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
 	if (pa == MAX_RT_PRIO + MAX_NICE)	/* fair */
-		return cfs_prio_less(a, b);
+		return cfs_prio_less(a, b, in_fi);
 
 	return false;
 }
@@ -130,7 +130,7 @@ static inline bool __sched_core_less(struct task_struct *a, struct task_struct *
 		return false;
 
 	/* flip prio, so high prio is leftmost */
-	if (prio_less(b, a))
+	if (prio_less(b, a, task_rq(a)->core->core_forceidle))
 		return true;
 
 	return false;
@@ -5101,7 +5101,7 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
  * - Else returns idle_task.
  */
 static struct task_struct *
-pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max)
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max, bool in_fi)
 {
 	struct task_struct *class_pick, *cookie_pick;
 	unsigned long cookie = rq->core->core_cookie;
@@ -5116,7 +5116,7 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
 		 * higher priority than max.
 		 */
 		if (max && class_pick->core_cookie &&
-		    prio_less(class_pick, max))
+		    prio_less(class_pick, max, in_fi))
 			return idle_sched_class.pick_task(rq);
 
 		return class_pick;
@@ -5135,13 +5135,15 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
 	 * the core (so far) and it must be selected, otherwise we must go with
 	 * the cookie pick in order to satisfy the constraint.
 	 */
-	if (prio_less(cookie_pick, class_pick) &&
-	    (!max || prio_less(max, class_pick)))
+	if (prio_less(cookie_pick, class_pick, in_fi) &&
+	    (!max || prio_less(max, class_pick, in_fi)))
 		return class_pick;
 
 	return cookie_pick;
 }
 
+extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi);
+
 static struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
@@ -5230,9 +5232,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 		if (!next->core_cookie) {
 			rq->core_pick = NULL;
+			/*
+			 * For robustness, update the min_vruntime_fi for
+			 * unconstrained picks as well.
+			 */
+			WARN_ON_ONCE(fi_before);
+			task_vruntime_update(rq, next, false);
 			goto done;
 		}
-		need_sync = true;
 	}
 
 	for_each_cpu(i, smt_mask) {
@@ -5244,14 +5251,6 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			update_rq_clock(rq_i);
 	}
 
-	/* Reset the snapshot if core is no longer in force-idle. */
-	if (!fi_before) {
-		for_each_cpu(i, smt_mask) {
-			struct rq *rq_i = cpu_rq(i);
-			rq_i->cfs.min_vruntime_fi = rq_i->cfs.min_vruntime;
-		}
-	}
-
 	/*
 	 * Try and select tasks for each sibling in decending sched_class
 	 * order.
@@ -5271,7 +5270,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			 * highest priority task already selected for this
 			 * core.
 			 */
-			p = pick_task(rq_i, class, max);
+			p = pick_task(rq_i, class, max, fi_before);
 			if (!p)
 				continue;
 
@@ -5279,6 +5278,11 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 				occ++;
 
 			rq_i->core_pick = p;
+			if (rq_i->idle == p && rq_i->nr_running) {
+				rq->core->core_forceidle = true;
+				if (!fi_before)
+					rq->core->core_forceidle_seq++;
+			}
 
 			/*
 			 * If this new candidate is of higher priority than the
@@ -5297,6 +5301,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 				max = p;
 
 				if (old_max) {
+					rq->core->core_forceidle = false;
 					for_each_cpu(j, smt_mask) {
 						if (j == i)
 							continue;
@@ -5338,10 +5343,16 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		if (!rq_i->core_pick)
 			continue;
 
-		if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running &&
-		    !rq_i->core->core_forceidle) {
-			rq_i->core->core_forceidle = true;
-		}
+		/*
+		 * Update for new !FI->FI transitions, or if continuing to be in !FI:
+		 * fi_before     fi      update?
+		 *  0            0       1
+		 *  0            1       1
+		 *  1            0       1
+		 *  1            1       0
+		 */
+		if (!(fi_before && rq->core->core_forceidle))
+			task_vruntime_update(rq_i, rq_i->core_pick, rq->core->core_forceidle);
 
 		rq_i->core_pick->core_occupation = occ;
 
@@ -5361,14 +5372,6 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		resched_curr(rq_i);
 	}
 
-	/* Snapshot if core is in force-idle. */
-	if (!fi_before && rq->core->core_forceidle) {
-		for_each_cpu(i, smt_mask) {
-			struct rq *rq_i = cpu_rq(i);
-			rq_i->cfs.min_vruntime_fi = rq_i->cfs.min_vruntime;
-		}
-	}
-
 done:
 	set_next_task(rq, next);
 	return next;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ceb3906c9a8a..a89c7c917cc6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10781,43 +10781,81 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
 		resched_curr(rq);
 }
 
-bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+/*
+ * se_fi_update - Update the cfs_rq->min_vruntime_fi in a CFS hierarchy if needed.
+ */
+static void se_fi_update(struct sched_entity *se, unsigned int fi_seq, bool forceidle)
 {
-	bool samecpu = task_cpu(a) == task_cpu(b);
+	bool root = true;
+	long old, new;
+
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		if (forceidle) {
+			if (cfs_rq->forceidle_seq == fi_seq)
+				break;
+			cfs_rq->forceidle_seq = fi_seq;
+		}
+
+		cfs_rq->min_vruntime_fi = cfs_rq->min_vruntime;
+	}
+}
+
+void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi)
+{
+	struct sched_entity *se = &p->se;
+
+	if (p->sched_class != &fair_sched_class)
+		return;
+
+	se_fi_update(se, rq->core->core_forceidle_seq, in_fi);
+}
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool in_fi)
+{
+	struct rq *rq = task_rq(a);
 	struct sched_entity *sea = &a->se;
 	struct sched_entity *seb = &b->se;
 	struct cfs_rq *cfs_rqa;
 	struct cfs_rq *cfs_rqb;
 	s64 delta;
 
-	if (samecpu) {
-		/* vruntime is per cfs_rq */
-		while (!is_same_group(sea, seb)) {
-			int sea_depth = sea->depth;
-			int seb_depth = seb->depth;
-			if (sea_depth >= seb_depth)
-				sea = parent_entity(sea);
-			if (sea_depth <= seb_depth)
-				seb = parent_entity(seb);
-		}
+	SCHED_WARN_ON(task_rq(b)->core != rq->core);
 
-		delta = (s64)(sea->vruntime - seb->vruntime);
-		goto out;
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	/*
+	 * Find an se in the hierarchy for tasks a and b, such that the se's
+	 * are immediate siblings.
+	 */
+	while (sea->cfs_rq->tg != seb->cfs_rq->tg) {
+		int sea_depth = sea->depth;
+		int seb_depth = seb->depth;
+
+		if (sea_depth >= seb_depth)
+			sea = parent_entity(sea);
+		if (sea_depth <= seb_depth)
+			seb = parent_entity(seb);
 	}
 
-	/* crosscpu: compare root level se's vruntime to decide priority */
-	while (sea->parent)
-		sea = sea->parent;
-	while (seb->parent)
-		seb = seb->parent;
+	se_fi_update(sea, rq->core->core_forceidle_seq, in_fi);
+	se_fi_update(seb, rq->core->core_forceidle_seq, in_fi);
 
 	cfs_rqa = sea->cfs_rq;
 	cfs_rqb = seb->cfs_rq;
+#else
+	cfs_rqa = &task_rq(a)->cfs;
+	cfs_rqb = &task_rq(b)->cfs;
+#endif
 
-	/* normalize vruntime WRT their rq's base */
+	/*
+	 * Find delta after normalizing se's vruntime with its cfs_rq's
+	 * min_vruntime_fi, which would have been updated in prior calls
+	 * to se_fi_update().
+	 */
 	delta = (s64)(sea->vruntime - seb->vruntime) +
 		(s64)(cfs_rqb->min_vruntime_fi - cfs_rqa->min_vruntime_fi);
-out:
+
 	return delta > 0;
 }
 #else
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index de553d39aa40..5c258ab64052 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -518,8 +518,10 @@ struct cfs_rq {
 	u64			exec_clock;
 	u64			min_vruntime;
 #ifdef CONFIG_SCHED_CORE
+	unsigned int		forceidle_seq;
 	u64			min_vruntime_fi;
 #endif
+
 #ifndef CONFIG_64BIT
 	u64			min_vruntime_copy;
 #endif
@@ -1078,6 +1080,7 @@ struct rq {
 	unsigned int		core_pick_seq;
 	unsigned long		core_cookie;
 	unsigned char		core_forceidle;
+	unsigned int		core_forceidle_seq;
 #endif
 };
 
@@ -1133,7 +1136,7 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 	return &rq->__lock;
 }
 
-bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);
 
 /*
  * Helper to check if the CPU's core cookie matches with the task's cookie
@@ -1166,6 +1169,8 @@ static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
 
 extern void queue_core_balance(struct rq *rq);
 
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline bool sched_core_enabled(struct rq *rq)
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 16/32] irq_work: Cleanup
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (14 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 15/32] sched: Improve snapshotting of min_vruntime for CGroups Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-17 23:19 ` [PATCH -tip 17/32] arch/x86: Add a new TIF flag for untrusted tasks Joel Fernandes (Google)
                   ` (16 subsequent siblings)
  32 siblings, 0 replies; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

Get rid of the __call_single_node union and clean up the API a little
to avoid external code relying on the structure layout as much.

(Needed for irq_work_is_busy() API in core-scheduling series).

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 drivers/gpu/drm/i915/i915_request.c |  4 ++--
 include/linux/irq_work.h            | 33 ++++++++++++++++++-----------
 include/linux/irqflags.h            |  4 ++--
 kernel/bpf/stackmap.c               |  2 +-
 kernel/irq_work.c                   | 18 ++++++++--------
 kernel/printk/printk.c              |  6 ++----
 kernel/rcu/tree.c                   |  3 +--
 kernel/time/tick-sched.c            |  6 ++----
 kernel/trace/bpf_trace.c            |  2 +-
 9 files changed, 41 insertions(+), 37 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 0e813819b041..5385b081a376 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -197,7 +197,7 @@ __notify_execute_cb(struct i915_request *rq, bool (*fn)(struct irq_work *wrk))
 
 	llist_for_each_entry_safe(cb, cn,
 				  llist_del_all(&rq->execute_cb),
-				  work.llnode)
+				  work.node.llist)
 		fn(&cb->work);
 }
 
@@ -460,7 +460,7 @@ __await_execution(struct i915_request *rq,
 	 * callback first, then checking the ACTIVE bit, we serialise with
 	 * the completed/retired request.
 	 */
-	if (llist_add(&cb->work.llnode, &signal->execute_cb)) {
+	if (llist_add(&cb->work.node.llist, &signal->execute_cb)) {
 		if (i915_request_is_active(signal) ||
 		    __request_in_flight(signal))
 			__notify_execute_cb_imm(signal);
diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index 30823780c192..ec2a47a81e42 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -14,28 +14,37 @@
  */
 
 struct irq_work {
-	union {
-		struct __call_single_node node;
-		struct {
-			struct llist_node llnode;
-			atomic_t flags;
-		};
-	};
+	struct __call_single_node node;
 	void (*func)(struct irq_work *);
 };
 
+#define __IRQ_WORK_INIT(_func, _flags) (struct irq_work){	\
+	.node = { .u_flags = (_flags), },			\
+	.func = (_func),					\
+}
+
+#define IRQ_WORK_INIT(_func) __IRQ_WORK_INIT(_func, 0)
+#define IRQ_WORK_INIT_LAZY(_func) __IRQ_WORK_INIT(_func, IRQ_WORK_LAZY)
+#define IRQ_WORK_INIT_HARD(_func) __IRQ_WORK_INIT(_func, IRQ_WORK_HARD_IRQ)
+
+#define DEFINE_IRQ_WORK(name, _f)				\
+	struct irq_work name = IRQ_WORK_INIT(_f)
+
 static inline
 void init_irq_work(struct irq_work *work, void (*func)(struct irq_work *))
 {
-	atomic_set(&work->flags, 0);
-	work->func = func;
+	*work = IRQ_WORK_INIT(func);
 }
 
-#define DEFINE_IRQ_WORK(name, _f) struct irq_work name = {	\
-		.flags = ATOMIC_INIT(0),			\
-		.func  = (_f)					\
+static inline bool irq_work_is_pending(struct irq_work *work)
+{
+	return atomic_read(&work->node.a_flags) & IRQ_WORK_PENDING;
 }
 
+static inline bool irq_work_is_busy(struct irq_work *work)
+{
+	return atomic_read(&work->node.a_flags) & IRQ_WORK_BUSY;
+}
 
 bool irq_work_queue(struct irq_work *work);
 bool irq_work_queue_on(struct irq_work *work, int cpu);
diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 3ed4e8771b64..fef2d43a7a1d 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -109,12 +109,12 @@ do {						\
 
 # define lockdep_irq_work_enter(__work)					\
 	  do {								\
-		  if (!(atomic_read(&__work->flags) & IRQ_WORK_HARD_IRQ))\
+		  if (!(atomic_read(&__work->node.a_flags) & IRQ_WORK_HARD_IRQ))\
 			current->irq_config = 1;			\
 	  } while (0)
 # define lockdep_irq_work_exit(__work)					\
 	  do {								\
-		  if (!(atomic_read(&__work->flags) & IRQ_WORK_HARD_IRQ))\
+		  if (!(atomic_read(&__work->node.a_flags) & IRQ_WORK_HARD_IRQ))\
 			current->irq_config = 0;			\
 	  } while (0)
 
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 06065fa27124..599041cd0c8a 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -298,7 +298,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
 	if (irqs_disabled()) {
 		if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
 			work = this_cpu_ptr(&up_read_work);
-			if (atomic_read(&work->irq_work.flags) & IRQ_WORK_BUSY) {
+			if (irq_work_is_busy(&work->irq_work)) {
 				/* cannot queue more up_read, fallback */
 				irq_work_busy = true;
 			}
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index eca83965b631..fbff25adb574 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -31,7 +31,7 @@ static bool irq_work_claim(struct irq_work *work)
 {
 	int oflags;
 
-	oflags = atomic_fetch_or(IRQ_WORK_CLAIMED | CSD_TYPE_IRQ_WORK, &work->flags);
+	oflags = atomic_fetch_or(IRQ_WORK_CLAIMED | CSD_TYPE_IRQ_WORK, &work->node.a_flags);
 	/*
 	 * If the work is already pending, no need to raise the IPI.
 	 * The pairing atomic_fetch_andnot() in irq_work_run() makes sure
@@ -53,12 +53,12 @@ void __weak arch_irq_work_raise(void)
 static void __irq_work_queue_local(struct irq_work *work)
 {
 	/* If the work is "lazy", handle it from next tick if any */
-	if (atomic_read(&work->flags) & IRQ_WORK_LAZY) {
-		if (llist_add(&work->llnode, this_cpu_ptr(&lazy_list)) &&
+	if (atomic_read(&work->node.a_flags) & IRQ_WORK_LAZY) {
+		if (llist_add(&work->node.llist, this_cpu_ptr(&lazy_list)) &&
 		    tick_nohz_tick_stopped())
 			arch_irq_work_raise();
 	} else {
-		if (llist_add(&work->llnode, this_cpu_ptr(&raised_list)))
+		if (llist_add(&work->node.llist, this_cpu_ptr(&raised_list)))
 			arch_irq_work_raise();
 	}
 }
@@ -102,7 +102,7 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (cpu != smp_processor_id()) {
 		/* Arch remote IPI send/receive backend aren't NMI safe */
 		WARN_ON_ONCE(in_nmi());
-		__smp_call_single_queue(cpu, &work->llnode);
+		__smp_call_single_queue(cpu, &work->node.llist);
 	} else {
 		__irq_work_queue_local(work);
 	}
@@ -142,7 +142,7 @@ void irq_work_single(void *arg)
 	 * to claim that work don't rely on us to handle their data
 	 * while we are in the middle of the func.
 	 */
-	flags = atomic_fetch_andnot(IRQ_WORK_PENDING, &work->flags);
+	flags = atomic_fetch_andnot(IRQ_WORK_PENDING, &work->node.a_flags);
 
 	lockdep_irq_work_enter(work);
 	work->func(work);
@@ -152,7 +152,7 @@ void irq_work_single(void *arg)
 	 * no-one else claimed it meanwhile.
 	 */
 	flags &= ~IRQ_WORK_PENDING;
-	(void)atomic_cmpxchg(&work->flags, flags, flags & ~IRQ_WORK_BUSY);
+	(void)atomic_cmpxchg(&work->node.a_flags, flags, flags & ~IRQ_WORK_BUSY);
 }
 
 static void irq_work_run_list(struct llist_head *list)
@@ -166,7 +166,7 @@ static void irq_work_run_list(struct llist_head *list)
 		return;
 
 	llnode = llist_del_all(list);
-	llist_for_each_entry_safe(work, tmp, llnode, llnode)
+	llist_for_each_entry_safe(work, tmp, llnode, node.llist)
 		irq_work_single(work);
 }
 
@@ -198,7 +198,7 @@ void irq_work_sync(struct irq_work *work)
 {
 	lockdep_assert_irqs_enabled();
 
-	while (atomic_read(&work->flags) & IRQ_WORK_BUSY)
+	while (irq_work_is_busy(work))
 		cpu_relax();
 }
 EXPORT_SYMBOL_GPL(irq_work_sync);
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index fe64a49344bf..9ef23d4b07c7 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -3025,10 +3025,8 @@ static void wake_up_klogd_work_func(struct irq_work *irq_work)
 		wake_up_interruptible(&log_wait);
 }
 
-static DEFINE_PER_CPU(struct irq_work, wake_up_klogd_work) = {
-	.func = wake_up_klogd_work_func,
-	.flags = ATOMIC_INIT(IRQ_WORK_LAZY),
-};
+static DEFINE_PER_CPU(struct irq_work, wake_up_klogd_work) =
+	IRQ_WORK_INIT_LAZY(wake_up_klogd_work_func);
 
 void wake_up_klogd(void)
 {
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index bd04b09b84b3..ed4941f0bd59 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1311,8 +1311,6 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
 		if (IS_ENABLED(CONFIG_IRQ_WORK) &&
 		    !rdp->rcu_iw_pending && rdp->rcu_iw_gp_seq != rnp->gp_seq &&
 		    (rnp->ffmask & rdp->grpmask)) {
-			init_irq_work(&rdp->rcu_iw, rcu_iw_handler);
-			atomic_set(&rdp->rcu_iw.flags, IRQ_WORK_HARD_IRQ);
 			rdp->rcu_iw_pending = true;
 			rdp->rcu_iw_gp_seq = rnp->gp_seq;
 			irq_work_queue_on(&rdp->rcu_iw, rdp->cpu);
@@ -3964,6 +3962,7 @@ int rcutree_prepare_cpu(unsigned int cpu)
 	rdp->cpu_no_qs.b.norm = true;
 	rdp->core_needs_qs = false;
 	rdp->rcu_iw_pending = false;
+	rdp->rcu_iw = IRQ_WORK_INIT_HARD(rcu_iw_handler);
 	rdp->rcu_iw_gp_seq = rdp->gp_seq - 1;
 	trace_rcu_grace_period(rcu_state.name, rdp->gp_seq, TPS("cpuonl"));
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 81632cd5e3b7..1b734070f028 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -243,10 +243,8 @@ static void nohz_full_kick_func(struct irq_work *work)
 	/* Empty, the tick restart happens on tick_nohz_irq_exit() */
 }
 
-static DEFINE_PER_CPU(struct irq_work, nohz_full_kick_work) = {
-	.func = nohz_full_kick_func,
-	.flags = ATOMIC_INIT(IRQ_WORK_HARD_IRQ),
-};
+static DEFINE_PER_CPU(struct irq_work, nohz_full_kick_work) =
+	IRQ_WORK_INIT_HARD(nohz_full_kick_func);
 
 /*
  * Kick this CPU if it's full dynticks in order to force it to
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 4517c8b66518..a6903912f7a0 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1086,7 +1086,7 @@ static int bpf_send_signal_common(u32 sig, enum pid_type type)
 			return -EINVAL;
 
 		work = this_cpu_ptr(&send_signal_work);
-		if (atomic_read(&work->irq_work.flags) & IRQ_WORK_BUSY)
+		if (irq_work_is_busy(&work->irq_work))
 			return -EBUSY;
 
 		/* Add the current task, which is the target of sending signal,
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 17/32] arch/x86: Add a new TIF flag for untrusted tasks
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (15 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 16/32] irq_work: Cleanup Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-23  5:18   ` Balbir Singh
  2020-11-17 23:19 ` [PATCH -tip 18/32] kernel/entry: Add support for core-wide protection of kernel-mode Joel Fernandes (Google)
                   ` (15 subsequent siblings)
  32 siblings, 1 reply; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

Add a new TIF flag to indicate whether the kernel needs to be careful
and take additional steps to mitigate micro-architectural issues during
entry into user or guest mode.

This new flag will be used by the series to determine if waiting is
needed or not, during exit to user or guest mode.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Reviewed-by: Aubrey Li <aubrey.intel@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 arch/x86/include/asm/thread_info.h | 2 ++
 kernel/sched/sched.h               | 6 ++++++
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 93277a8d2ef0..ae4f6196e38c 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -99,6 +99,7 @@ struct thread_info {
 #define TIF_SPEC_FORCE_UPDATE	23	/* Force speculation MSR update in context switch */
 #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
 #define TIF_BLOCKSTEP		25	/* set when we want DEBUGCTLMSR_BTF */
+#define TIF_UNSAFE_RET		26	/* On return to process/guest, perform safety checks. */
 #define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
 #define TIF_SYSCALL_TRACEPOINT	28	/* syscall tracepoint instrumentation */
 #define TIF_ADDR32		29	/* 32-bit address space on 64 bits */
@@ -127,6 +128,7 @@ struct thread_info {
 #define _TIF_SPEC_FORCE_UPDATE	(1 << TIF_SPEC_FORCE_UPDATE)
 #define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
 #define _TIF_BLOCKSTEP		(1 << TIF_BLOCKSTEP)
+#define _TIF_UNSAFE_RET 	(1 << TIF_UNSAFE_RET)
 #define _TIF_LAZY_MMU_UPDATES	(1 << TIF_LAZY_MMU_UPDATES)
 #define _TIF_SYSCALL_TRACEPOINT	(1 << TIF_SYSCALL_TRACEPOINT)
 #define _TIF_ADDR32		(1 << TIF_ADDR32)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5c258ab64052..615092cb693c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2851,3 +2851,9 @@ static inline bool is_per_cpu_kthread(struct task_struct *p)
 
 void swake_up_all_locked(struct swait_queue_head *q);
 void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
+
+#ifdef CONFIG_SCHED_CORE
+#ifndef TIF_UNSAFE_RET
+#define TIF_UNSAFE_RET (0)
+#endif
+#endif
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 18/32] kernel/entry: Add support for core-wide protection of kernel-mode
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (16 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 17/32] arch/x86: Add a new TIF flag for untrusted tasks Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-24 16:09   ` Peter Zijlstra
                     ` (2 more replies)
  2020-11-17 23:19 ` [PATCH -tip 19/32] entry/idle: Enter and exit kernel protection during idle entry and exit Joel Fernandes (Google)
                   ` (14 subsequent siblings)
  32 siblings, 3 replies; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li, Tim Chen,
	Paul E . McKenney

Core-scheduling prevents hyperthreads in usermode from attacking each
other, but it does not do anything about one of the hyperthreads
entering the kernel for any reason. This leaves the door open for MDS
and L1TF attacks with concurrent execution sequences between
hyperthreads.

This patch therefore adds support for protecting all syscall and IRQ
kernel mode entries. Care is taken to track the outermost usermode exit
and entry using per-cpu counters. In cases where one of the hyperthreads
enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
when not needed - example: idle and non-cookie HTs do not need to be
forced into kernel mode.

More information about attacks:
For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
data to either host or guest attackers. For L1TF, it is possible to leak
to guest attackers. There is no possible mitigation involving flushing
of buffers to avoid this since the execution of attacker and victims
happen concurrently on 2 or more HTs.

Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Co-developed-by: Vineeth Pillai <viremana@linux.microsoft.com>
Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 .../admin-guide/kernel-parameters.txt         |  11 +
 include/linux/entry-common.h                  |  12 +-
 include/linux/sched.h                         |  12 +
 kernel/entry/common.c                         |  28 +-
 kernel/sched/core.c                           | 241 ++++++++++++++++++
 kernel/sched/sched.h                          |   3 +
 6 files changed, 304 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index bd1a5b87a5e2..b185c6ed4aba 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4678,6 +4678,17 @@
 
 	sbni=		[NET] Granch SBNI12 leased line adapter
 
+	sched_core_protect_kernel=
+			[SCHED_CORE] Pause SMT siblings of a core running in
+			user mode, if at least one of the siblings of the core
+			is running in kernel mode. This is to guarantee that
+			kernel data is not leaked to tasks which are not trusted
+			by the kernel. A value of 0 disables protection, 1
+			enables protection. The default is 1. Note that protection
+			depends on the arch defining the _TIF_UNSAFE_RET flag.
+			Further, for protecting VMEXIT, arch needs to call
+			KVM entry/exit hooks.
+
 	sched_debug	[KNL] Enables verbose scheduler debug messages.
 
 	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 1a128baf3628..022e1f114157 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -33,6 +33,10 @@
 # define _TIF_PATCH_PENDING		(0)
 #endif
 
+#ifndef _TIF_UNSAFE_RET
+# define _TIF_UNSAFE_RET		(0)
+#endif
+
 #ifndef _TIF_UPROBE
 # define _TIF_UPROBE			(0)
 #endif
@@ -74,7 +78,7 @@
 #define EXIT_TO_USER_MODE_WORK						\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
 	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |	\
-	 ARCH_EXIT_TO_USER_MODE_WORK)
+	 _TIF_UNSAFE_RET | ARCH_EXIT_TO_USER_MODE_WORK)
 
 /**
  * arch_check_user_regs - Architecture specific sanity check for user mode regs
@@ -444,4 +448,10 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs);
  */
 void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state);
 
+/* entry_kernel_protected - Is kernel protection on entry/exit into kernel supported? */
+static inline bool entry_kernel_protected(void)
+{
+	return IS_ENABLED(CONFIG_SCHED_CORE) && sched_core_kernel_protected()
+		&& _TIF_UNSAFE_RET != 0;
+}
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7efce9c9d9cf..a60868165590 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2076,4 +2076,16 @@ int sched_trace_rq_nr_running(struct rq *rq);
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
+#ifdef CONFIG_SCHED_CORE
+void sched_core_unsafe_enter(void);
+void sched_core_unsafe_exit(void);
+bool sched_core_wait_till_safe(unsigned long ti_check);
+bool sched_core_kernel_protected(void);
+#else
+#define sched_core_unsafe_enter(ignore) do { } while (0)
+#define sched_core_unsafe_exit(ignore) do { } while (0)
+#define sched_core_wait_till_safe(ignore) do { } while (0)
+#define sched_core_kernel_protected(ignore) do { } while (0)
+#endif
+
 #endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index bc75c114c1b3..9d9d926f2a1c 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -28,6 +28,9 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs)
 
 	instrumentation_begin();
 	trace_hardirqs_off_finish();
+
+	if (entry_kernel_protected())
+		sched_core_unsafe_enter();
 	instrumentation_end();
 }
 
@@ -145,6 +148,26 @@ static void handle_signal_work(struct pt_regs *regs, unsigned long ti_work)
 	arch_do_signal_or_restart(regs, ti_work & _TIF_SIGPENDING);
 }
 
+static unsigned long exit_to_user_get_work(void)
+{
+	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+
+	if (!entry_kernel_protected())
+		return ti_work;
+
+#ifdef CONFIG_SCHED_CORE
+	ti_work &= EXIT_TO_USER_MODE_WORK;
+	if ((ti_work & _TIF_UNSAFE_RET) == ti_work) {
+		sched_core_unsafe_exit();
+		if (sched_core_wait_till_safe(EXIT_TO_USER_MODE_WORK)) {
+			sched_core_unsafe_enter(); /* not exiting to user yet. */
+		}
+	}
+
+	return READ_ONCE(current_thread_info()->flags);
+#endif
+}
+
 static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 					    unsigned long ti_work)
 {
@@ -182,7 +205,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		 * enabled above.
 		 */
 		local_irq_disable_exit_to_user();
-		ti_work = READ_ONCE(current_thread_info()->flags);
+		ti_work = exit_to_user_get_work();
 	}
 
 	/* Return the latest work state for arch_exit_to_user_mode() */
@@ -191,9 +214,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 
 static void exit_to_user_mode_prepare(struct pt_regs *regs)
 {
-	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+	unsigned long ti_work;
 
 	lockdep_assert_irqs_disabled();
+	ti_work = exit_to_user_get_work();
 
 	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
 		ti_work = exit_to_user_mode_loop(regs, ti_work);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 20125431af87..7f807a84cc30 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -76,6 +76,27 @@ __read_mostly int scheduler_running;
 
 #ifdef CONFIG_SCHED_CORE
 
+DEFINE_STATIC_KEY_TRUE(sched_core_protect_kernel);
+static int __init set_sched_core_protect_kernel(char *str)
+{
+	unsigned long val = 0;
+
+	if (!str)
+		return 0;
+
+	if (!kstrtoul(str, 0, &val) && !val)
+		static_branch_disable(&sched_core_protect_kernel);
+
+	return 1;
+}
+__setup("sched_core_protect_kernel=", set_sched_core_protect_kernel);
+
+/* Is the kernel protected by core scheduling? */
+bool sched_core_kernel_protected(void)
+{
+	return static_branch_likely(&sched_core_protect_kernel);
+}
+
 DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
 
 /* kernel prio, less is more */
@@ -5092,6 +5113,225 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
 	return a->core_cookie == b->core_cookie;
 }
 
+/*
+ * Handler to attempt to enter kernel. It does nothing because the exit to
+ * usermode or guest mode will do the actual work (of waiting if needed).
+ */
+static void sched_core_irq_work(struct irq_work *work)
+{
+}
+
+static inline void init_sched_core_irq_work(struct rq *rq)
+{
+	init_irq_work(&rq->core_irq_work, sched_core_irq_work);
+}
+
+/*
+ * sched_core_wait_till_safe - Pause the caller's hyperthread until the core
+ * exits the core-wide unsafe state. Obviously the CPU calling this function
+ * should not be responsible for the core being in the core-wide unsafe state
+ * otherwise it will deadlock.
+ *
+ * @ti_check: We spin here with IRQ enabled and preempt disabled. Break out of
+ *            the loop if TIF flags are set and notify caller about it.
+ *
+ * IRQs should be disabled.
+ */
+bool sched_core_wait_till_safe(unsigned long ti_check)
+{
+	bool restart = false;
+	struct rq *rq;
+	int cpu;
+
+	/* We clear the thread flag only at the end, so no need to check for it. */
+	ti_check &= ~_TIF_UNSAFE_RET;
+
+	cpu = smp_processor_id();
+	rq = cpu_rq(cpu);
+
+	if (!sched_core_enabled(rq))
+		goto ret;
+
+	/* Down grade to allow interrupts to prevent stop_machine lockups.. */
+	preempt_disable();
+	local_irq_enable();
+
+	/*
+	 * Wait till the core of this HT is not in an unsafe state.
+	 *
+	 * Pair with raw_spin_lock/unlock() in sched_core_unsafe_enter/exit().
+	 */
+	while (smp_load_acquire(&rq->core->core_unsafe_nest) > 0) {
+		cpu_relax();
+		if (READ_ONCE(current_thread_info()->flags) & ti_check) {
+			restart = true;
+			break;
+		}
+	}
+
+	/* Upgrade it back to the expectations of entry code. */
+	local_irq_disable();
+	preempt_enable();
+
+ret:
+	if (!restart)
+		clear_tsk_thread_flag(current, TIF_UNSAFE_RET);
+
+	return restart;
+}
+
+/*
+ * Enter the core-wide IRQ state. Sibling will be paused if it is running
+ * 'untrusted' code, until sched_core_unsafe_exit() is called. Every attempt to
+ * avoid sending useless IPIs is made. Must be called only from hard IRQ
+ * context.
+ */
+void sched_core_unsafe_enter(void)
+{
+	const struct cpumask *smt_mask;
+	unsigned long flags;
+	struct rq *rq;
+	int i, cpu;
+
+	if (!static_branch_likely(&sched_core_protect_kernel))
+		return;
+
+	local_irq_save(flags);
+	cpu = smp_processor_id();
+	rq = cpu_rq(cpu);
+	if (!sched_core_enabled(rq))
+		goto ret;
+
+	/* Ensure that on return to user/guest, we check whether to wait. */
+	if (current->core_cookie)
+		set_tsk_thread_flag(current, TIF_UNSAFE_RET);
+
+	/* Count unsafe_enter() calls received without unsafe_exit() on this CPU. */
+	rq->core_this_unsafe_nest++;
+
+	/*
+	 * Should not nest: enter() should only pair with exit(). Both are done
+	 * during the first entry into kernel and the last exit from kernel.
+	 * Nested kernel entries (such as nested interrupts) will only trigger
+	 * enter() and exit() on the outer most kernel entry and exit.
+	 */
+	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 1))
+		goto ret;
+
+	raw_spin_lock(rq_lockp(rq));
+	smt_mask = cpu_smt_mask(cpu);
+
+	/*
+	 * Contribute this CPU's unsafe_enter() to the core-wide unsafe_enter()
+	 * count.  The raw_spin_unlock() release semantics pairs with the nest
+	 * counter's smp_load_acquire() in sched_core_wait_till_safe().
+	 */
+	WRITE_ONCE(rq->core->core_unsafe_nest, rq->core->core_unsafe_nest + 1);
+
+	if (WARN_ON_ONCE(rq->core->core_unsafe_nest == UINT_MAX))
+		goto unlock;
+
+	if (irq_work_is_busy(&rq->core_irq_work)) {
+		/*
+		 * Do nothing more since we are in an IPI sent from another
+		 * sibling to enforce safety. That sibling would have sent IPIs
+		 * to all of the HTs.
+		 */
+		goto unlock;
+	}
+
+	/*
+	 * If we are not the first ones on the core to enter core-wide unsafe
+	 * state, do nothing.
+	 */
+	if (rq->core->core_unsafe_nest > 1)
+		goto unlock;
+
+	/* Do nothing more if the core is not tagged. */
+	if (!rq->core->core_cookie)
+		goto unlock;
+
+	for_each_cpu(i, smt_mask) {
+		struct rq *srq = cpu_rq(i);
+
+		if (i == cpu || cpu_is_offline(i))
+			continue;
+
+		if (!srq->curr->mm || is_task_rq_idle(srq->curr))
+			continue;
+
+		/* Skip if HT is not running a tagged task. */
+		if (!srq->curr->core_cookie && !srq->core_pick)
+			continue;
+
+		/*
+		 * Force sibling into the kernel by IPI. If work was already
+		 * pending, no new IPIs are sent. This is Ok since the receiver
+		 * would already be in the kernel, or on its way to it.
+		 */
+		irq_work_queue_on(&srq->core_irq_work, i);
+	}
+unlock:
+	raw_spin_unlock(rq_lockp(rq));
+ret:
+	local_irq_restore(flags);
+}
+
+/*
+ * Process any work need for either exiting the core-wide unsafe state, or for
+ * waiting on this hyperthread if the core is still in this state.
+ *
+ * @idle: Are we called from the idle loop?
+ */
+void sched_core_unsafe_exit(void)
+{
+	unsigned long flags;
+	unsigned int nest;
+	struct rq *rq;
+	int cpu;
+
+	if (!static_branch_likely(&sched_core_protect_kernel))
+		return;
+
+	local_irq_save(flags);
+	cpu = smp_processor_id();
+	rq = cpu_rq(cpu);
+
+	/* Do nothing if core-sched disabled. */
+	if (!sched_core_enabled(rq))
+		goto ret;
+
+	/*
+	 * Can happen when a process is forked and the first return to user
+	 * mode is a syscall exit. Either way, there's nothing to do.
+	 */
+	if (rq->core_this_unsafe_nest == 0)
+		goto ret;
+
+	rq->core_this_unsafe_nest--;
+
+	/* enter() should be paired with exit() only. */
+	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 0))
+		goto ret;
+
+	raw_spin_lock(rq_lockp(rq));
+	/*
+	 * Core-wide nesting counter can never be 0 because we are
+	 * still in it on this CPU.
+	 */
+	nest = rq->core->core_unsafe_nest;
+	WARN_ON_ONCE(!nest);
+
+	WRITE_ONCE(rq->core->core_unsafe_nest, nest - 1);
+	/*
+	 * The raw_spin_unlock release semantics pairs with the nest counter's
+	 * smp_load_acquire() in sched_core_wait_till_safe().
+	 */
+	raw_spin_unlock(rq_lockp(rq));
+ret:
+	local_irq_restore(flags);
+}
+
 // XXX fairness/fwd progress conditions
 /*
  * Returns
@@ -5497,6 +5737,7 @@ static inline void sched_core_cpu_starting(unsigned int cpu)
 			rq = cpu_rq(i);
 			if (rq->core && rq->core == rq)
 				core_rq = rq;
+			init_sched_core_irq_work(rq);
 		}
 
 		if (!core_rq)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 615092cb693c..be6691337bbb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1074,6 +1074,8 @@ struct rq {
 	unsigned int		core_enabled;
 	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
+	struct irq_work		core_irq_work; /* To force HT into kernel */
+	unsigned int		core_this_unsafe_nest;
 
 	/* shared state */
 	unsigned int		core_task_seq;
@@ -1081,6 +1083,7 @@ struct rq {
 	unsigned long		core_cookie;
 	unsigned char		core_forceidle;
 	unsigned int		core_forceidle_seq;
+	unsigned int		core_unsafe_nest;
 #endif
 };
 
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 19/32] entry/idle: Enter and exit kernel protection during idle entry and exit
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (17 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 18/32] kernel/entry: Add support for core-wide protection of kernel-mode Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-24 16:13   ` Peter Zijlstra
  2020-11-17 23:19 ` [PATCH -tip 20/32] entry/kvm: Protect the kernel when entering from guest Joel Fernandes (Google)
                   ` (13 subsequent siblings)
  32 siblings, 1 reply; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

Add a generic_idle_{enter,exit} helper function to enter and exit kernel
protection when entering and exiting idle, respectively.

While at it, remove a stale RCU comment.

Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/entry-common.h | 18 ++++++++++++++++++
 kernel/sched/idle.c          | 11 ++++++-----
 2 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 022e1f114157..8f34ae625f83 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -454,4 +454,22 @@ static inline bool entry_kernel_protected(void)
 	return IS_ENABLED(CONFIG_SCHED_CORE) && sched_core_kernel_protected()
 		&& _TIF_UNSAFE_RET != 0;
 }
+
+/**
+ * generic_idle_enter - General tasks to perform during idle entry.
+ */
+static inline void generic_idle_enter(void)
+{
+	/* Entering idle ends the protected kernel region. */
+	sched_core_unsafe_exit();
+}
+
+/**
+ * generic_idle_exit  - General tasks to perform during idle exit.
+ */
+static inline void generic_idle_exit(void)
+{
+	/* Exiting idle (re)starts the protected kernel region. */
+	sched_core_unsafe_enter();
+}
 #endif
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 8bdb214eb78f..ee4f91396c31 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -8,6 +8,7 @@
  */
 #include "sched.h"
 
+#include <linux/entry-common.h>
 #include <trace/events/power.h>
 
 /* Linker adds these: start and end of __cpuidle functions */
@@ -54,6 +55,7 @@ __setup("hlt", cpu_idle_nopoll_setup);
 
 static noinline int __cpuidle cpu_idle_poll(void)
 {
+	generic_idle_enter();
 	trace_cpu_idle(0, smp_processor_id());
 	stop_critical_timings();
 	rcu_idle_enter();
@@ -66,6 +68,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
 	rcu_idle_exit();
 	start_critical_timings();
 	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
+	generic_idle_exit();
 
 	return 1;
 }
@@ -156,11 +159,7 @@ static void cpuidle_idle_call(void)
 		return;
 	}
 
-	/*
-	 * The RCU framework needs to be told that we are entering an idle
-	 * section, so no more rcu read side critical sections and one more
-	 * step to the grace period
-	 */
+	generic_idle_enter();
 
 	if (cpuidle_not_available(drv, dev)) {
 		tick_nohz_idle_stop_tick();
@@ -225,6 +224,8 @@ static void cpuidle_idle_call(void)
 	 */
 	if (WARN_ON_ONCE(irqs_disabled()))
 		local_irq_enable();
+
+	generic_idle_exit();
 }
 
 /*
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 20/32] entry/kvm: Protect the kernel when entering from guest
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (18 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 19/32] entry/idle: Enter and exit kernel protection during idle entry and exit Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-17 23:19 ` [PATCH -tip 21/32] sched: CGroup tagging interface for core scheduling Joel Fernandes (Google)
                   ` (12 subsequent siblings)
  32 siblings, 0 replies; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

From: Vineeth Pillai <viremana@linux.microsoft.com>

Similar to how user to kernel mode transitions are protected in earlier
patches, protect the entry into kernel from guest mode as well.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 arch/x86/kvm/x86.c        |  2 ++
 include/linux/entry-kvm.h | 12 ++++++++++++
 kernel/entry/kvm.c        | 33 +++++++++++++++++++++++++++++++++
 3 files changed, 47 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 447edc0d1d5a..a50be74f70f1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8910,6 +8910,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	 */
 	smp_mb__after_srcu_read_unlock();
 
+	kvm_exit_to_guest_mode();
 	/*
 	 * This handles the case where a posted interrupt was
 	 * notified with kvm_vcpu_kick.
@@ -9003,6 +9004,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		}
 	}
 
+	kvm_enter_from_guest_mode();
 	local_irq_enable();
 	preempt_enable();
 
diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
index 9b93f8584ff7..67da6dcf442b 100644
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -77,4 +77,16 @@ static inline bool xfer_to_guest_mode_work_pending(void)
 }
 #endif /* CONFIG_KVM_XFER_TO_GUEST_WORK */
 
+/**
+ * kvm_enter_from_guest_mode - Hook called just after entering kernel from guest.
+ * Caller should ensure interrupts are off.
+ */
+void kvm_enter_from_guest_mode(void);
+
+/**
+ * kvm_exit_to_guest_mode - Hook called just before entering guest from kernel.
+ * Caller should ensure interrupts are off.
+ */
+void kvm_exit_to_guest_mode(void);
+
 #endif
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index 49972ee99aff..3b603e8bd5da 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -50,3 +50,36 @@ int xfer_to_guest_mode_handle_work(struct kvm_vcpu *vcpu)
 	return xfer_to_guest_mode_work(vcpu, ti_work);
 }
 EXPORT_SYMBOL_GPL(xfer_to_guest_mode_handle_work);
+
+/**
+ * kvm_enter_from_guest_mode - Hook called just after entering kernel from guest.
+ * Caller should ensure interrupts are off.
+ */
+void kvm_enter_from_guest_mode(void)
+{
+	if (!entry_kernel_protected())
+		return;
+	sched_core_unsafe_enter();
+}
+EXPORT_SYMBOL_GPL(kvm_enter_from_guest_mode);
+
+/**
+ * kvm_exit_to_guest_mode - Hook called just before entering guest from kernel.
+ * Caller should ensure interrupts are off.
+ */
+void kvm_exit_to_guest_mode(void)
+{
+	if (!entry_kernel_protected())
+		return;
+	sched_core_unsafe_exit();
+
+	/*
+	 * Wait here instead of in xfer_to_guest_mode_handle_work(). The reason
+	 * is because in vcpu_run(), xfer_to_guest_mode_handle_work() is called
+	 * when a vCPU was either runnable or blocked. However, we only care
+	 * about the runnable case (VM entry/exit) which is handled by
+	 * vcpu_enter_guest().
+	 */
+	sched_core_wait_till_safe(XFER_TO_GUEST_MODE_WORK);
+}
+EXPORT_SYMBOL_GPL(kvm_exit_to_guest_mode);
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 21/32] sched: CGroup tagging interface for core scheduling
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (19 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 20/32] entry/kvm: Protect the kernel when entering from guest Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-17 23:19 ` [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork Joel Fernandes (Google)
                   ` (11 subsequent siblings)
  32 siblings, 0 replies; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

Marks all tasks in a cgroup as matching for core-scheduling.

A task will need to be moved into the core scheduler queue when the cgroup
it belongs to is tagged to run with core scheduling.  Similarly the task
will need to be moved out of the core scheduler queue when the cgroup
is untagged.

Also after we forked a task, its core scheduler queue's presence will
need to be updated according to its new cgroup's status.

Use stop machine mechanism to update all tasks in a cgroup to prevent a
new task from sneaking into the cgroup, and missed out from the update
while we iterates through all the tasks in the cgroup.  A more complicated
scheme could probably avoid the stop machine.  Such scheme will also
need to resovle inconsistency between a task's cgroup core scheduling
tag and residency in core scheduler queue.

We are opting for the simple stop machine mechanism for now that avoids
such complications.

Core scheduler has extra overhead.  Enable it only for core with
more than one SMT hardware threads.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/core.c  | 183 +++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h |   4 +
 2 files changed, 181 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f807a84cc30..b99a7493d590 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -157,6 +157,37 @@ static inline bool __sched_core_less(struct task_struct *a, struct task_struct *
 	return false;
 }
 
+static bool sched_core_empty(struct rq *rq)
+{
+	return RB_EMPTY_ROOT(&rq->core_tree);
+}
+
+static bool sched_core_enqueued(struct task_struct *task)
+{
+	return !RB_EMPTY_NODE(&task->core_node);
+}
+
+static struct task_struct *sched_core_first(struct rq *rq)
+{
+	struct task_struct *task;
+
+	task = container_of(rb_first(&rq->core_tree), struct task_struct, core_node);
+	return task;
+}
+
+static void sched_core_flush(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	struct task_struct *task;
+
+	while (!sched_core_empty(rq)) {
+		task = sched_core_first(rq);
+		rb_erase(&task->core_node, &rq->core_tree);
+		RB_CLEAR_NODE(&task->core_node);
+	}
+	rq->core->core_task_seq++;
+}
+
 static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 {
 	struct rb_node *parent, **node;
@@ -188,10 +219,11 @@ static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
 {
 	rq->core->core_task_seq++;
 
-	if (!p->core_cookie)
+	if (!sched_core_enqueued(p))
 		return;
 
 	rb_erase(&p->core_node, &rq->core_tree);
+	RB_CLEAR_NODE(&p->core_node);
 }
 
 /*
@@ -255,8 +287,24 @@ static int __sched_core_stopper(void *data)
 	bool enabled = !!(unsigned long)data;
 	int cpu;
 
-	for_each_possible_cpu(cpu)
-		cpu_rq(cpu)->core_enabled = enabled;
+	for_each_possible_cpu(cpu) {
+		struct rq *rq = cpu_rq(cpu);
+
+		WARN_ON_ONCE(enabled == rq->core_enabled);
+
+		if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) {
+			/*
+			 * All active and migrating tasks will have already
+			 * been removed from core queue when we clear the
+			 * cgroup tags. However, dying tasks could still be
+			 * left in core queue. Flush them here.
+			 */
+			if (!enabled)
+				sched_core_flush(cpu);
+
+			rq->core_enabled = enabled;
+		}
+	}
 
 	return 0;
 }
@@ -266,7 +314,11 @@ static int sched_core_count;
 
 static void __sched_core_enable(void)
 {
-	// XXX verify there are no cookie tasks (yet)
+	int cpu;
+
+	/* verify there are no cookie tasks (yet) */
+	for_each_online_cpu(cpu)
+		BUG_ON(!sched_core_empty(cpu_rq(cpu)));
 
 	static_branch_enable(&__sched_core_enabled);
 	stop_machine(__sched_core_stopper, (void *)true, NULL);
@@ -274,8 +326,6 @@ static void __sched_core_enable(void)
 
 static void __sched_core_disable(void)
 {
-	// XXX verify there are no cookie tasks (left)
-
 	stop_machine(__sched_core_stopper, (void *)false, NULL);
 	static_branch_disable(&__sched_core_enabled);
 }
@@ -300,6 +350,7 @@ void sched_core_put(void)
 
 static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
 static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+static bool sched_core_enqueued(struct task_struct *task) { return false; }
 
 #endif /* CONFIG_SCHED_CORE */
 
@@ -3978,6 +4029,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 #ifdef CONFIG_SMP
 	plist_node_init(&p->pushable_tasks, MAX_PRIO);
 	RB_CLEAR_NODE(&p->pushable_dl_tasks);
+#endif
+#ifdef CONFIG_SCHED_CORE
+	RB_CLEAR_NODE(&p->core_node);
 #endif
 	return 0;
 }
@@ -7979,6 +8033,9 @@ void init_idle(struct task_struct *idle, int cpu)
 #ifdef CONFIG_SMP
 	sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);
 #endif
+#ifdef CONFIG_SCHED_CORE
+	RB_CLEAR_NODE(&idle->core_node);
+#endif
 }
 
 #ifdef CONFIG_SMP
@@ -8995,6 +9052,15 @@ static void sched_change_group(struct task_struct *tsk, int type)
 	tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
 			  struct task_group, css);
 	tg = autogroup_task_group(tsk, tg);
+
+#ifdef CONFIG_SCHED_CORE
+	if ((unsigned long)tsk->sched_task_group == tsk->core_cookie)
+		tsk->core_cookie = 0UL;
+
+	if (tg->tagged /* && !tsk->core_cookie ? */)
+		tsk->core_cookie = (unsigned long)tg;
+#endif
+
 	tsk->sched_task_group = tg;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -9087,6 +9153,18 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
 	return 0;
 }
 
+static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
+{
+#ifdef CONFIG_SCHED_CORE
+	struct task_group *tg = css_tg(css);
+
+	if (tg->tagged) {
+		sched_core_put();
+		tg->tagged = 0;
+	}
+#endif
+}
+
 static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
@@ -9652,6 +9730,82 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_SCHED_CORE
+static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+
+	return !!tg->tagged;
+}
+
+struct write_core_tag {
+	struct cgroup_subsys_state *css;
+	int val;
+};
+
+static int __sched_write_tag(void *data)
+{
+	struct write_core_tag *tag = (struct write_core_tag *) data;
+	struct cgroup_subsys_state *css = tag->css;
+	int val = tag->val;
+	struct task_group *tg = css_tg(tag->css);
+	struct css_task_iter it;
+	struct task_struct *p;
+
+	tg->tagged = !!val;
+
+	css_task_iter_start(css, 0, &it);
+	/*
+	 * Note: css_task_iter_next will skip dying tasks.
+	 * There could still be dying tasks left in the core queue
+	 * when we set cgroup tag to 0 when the loop is done below.
+	 */
+	while ((p = css_task_iter_next(&it))) {
+		p->core_cookie = !!val ? (unsigned long)tg : 0UL;
+
+		if (sched_core_enqueued(p)) {
+			sched_core_dequeue(task_rq(p), p);
+			if (!p->core_cookie)
+				continue;
+		}
+
+		if (sched_core_enabled(task_rq(p)) &&
+		    p->core_cookie && task_on_rq_queued(p))
+			sched_core_enqueue(task_rq(p), p);
+
+	}
+	css_task_iter_end(&it);
+
+	return 0;
+}
+
+static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+{
+	struct task_group *tg = css_tg(css);
+	struct write_core_tag wtag;
+
+	if (val > 1)
+		return -ERANGE;
+
+	if (!static_branch_likely(&sched_smt_present))
+		return -EINVAL;
+
+	if (tg->tagged == !!val)
+		return 0;
+
+	if (!!val)
+		sched_core_get();
+
+	wtag.css = css;
+	wtag.val = val;
+	stop_machine(__sched_write_tag, (void *) &wtag, NULL);
+	if (!val)
+		sched_core_put();
+
+	return 0;
+}
+#endif
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -9688,6 +9842,14 @@ static struct cftype cpu_legacy_files[] = {
 		.write_u64 = cpu_rt_period_write_uint,
 	},
 #endif
+#ifdef CONFIG_SCHED_CORE
+	{
+		.name = "tag",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_tag_read_u64,
+		.write_u64 = cpu_core_tag_write_u64,
+	},
+#endif
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 	{
 		.name = "uclamp.min",
@@ -9861,6 +10023,14 @@ static struct cftype cpu_files[] = {
 		.write_s64 = cpu_weight_nice_write_s64,
 	},
 #endif
+#ifdef CONFIG_SCHED_CORE
+	{
+		.name = "tag",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_tag_read_u64,
+		.write_u64 = cpu_core_tag_write_u64,
+	},
+#endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
 		.name = "max",
@@ -9889,6 +10059,7 @@ static struct cftype cpu_files[] = {
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_online	= cpu_cgroup_css_online,
+	.css_offline	= cpu_cgroup_css_offline,
 	.css_released	= cpu_cgroup_css_released,
 	.css_free	= cpu_cgroup_css_free,
 	.css_extra_stat_show = cpu_extra_stat_show,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index be6691337bbb..3ba08973ed58 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -377,6 +377,10 @@ struct cfs_bandwidth {
 struct task_group {
 	struct cgroup_subsys_state css;
 
+#ifdef CONFIG_SCHED_CORE
+	int			tagged;
+#endif
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* schedulable entities of this group on each CPU */
 	struct sched_entity	**se;
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (20 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 21/32] sched: CGroup tagging interface for core scheduling Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-25 11:07   ` Peter Zijlstra
                     ` (6 more replies)
  2020-11-17 23:19 ` [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface Joel Fernandes (Google)
                   ` (10 subsequent siblings)
  32 siblings, 7 replies; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

In order to prevent interference and clearly support both per-task and CGroup
APIs, split the cookie into 2 and allow it to be set from either per-task, or
CGroup API. The final cookie is the combined value of both and is computed when
the stop-machine executes during a change of cookie.

Also, for the per-task cookie, it will get weird if we use pointers of any
emphemeral objects. For this reason, introduce a refcounted object who's sole
purpose is to assign unique cookie value by way of the object's pointer.

While at it, refactor the CGroup code a bit. Future patches will introduce more
APIs and support.

Reviewed-by: Josh Don <joshdon@google.com>
Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/sched.h |   2 +
 kernel/sched/core.c   | 241 ++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/debug.c  |   4 +
 3 files changed, 236 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a60868165590..c6a3b0fa952b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -688,6 +688,8 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
 	struct rb_node			core_node;
 	unsigned long			core_cookie;
+	unsigned long			core_task_cookie;
+	unsigned long			core_group_cookie;
 	unsigned int			core_occupation;
 #endif
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b99a7493d590..7ccca355623a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -346,11 +346,14 @@ void sched_core_put(void)
 	mutex_unlock(&sched_core_mutex);
 }
 
+static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
 static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
 static bool sched_core_enqueued(struct task_struct *task) { return false; }
+static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2) { }
 
 #endif /* CONFIG_SCHED_CORE */
 
@@ -4032,6 +4035,20 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 #endif
 #ifdef CONFIG_SCHED_CORE
 	RB_CLEAR_NODE(&p->core_node);
+
+	/*
+	 * Tag child via per-task cookie only if parent is tagged via per-task
+	 * cookie. This is independent of, but can be additive to the CGroup tagging.
+	 */
+	if (current->core_task_cookie) {
+
+		/* If it is not CLONE_THREAD fork, assign a unique per-task tag. */
+		if (!(clone_flags & CLONE_THREAD)) {
+			return sched_core_share_tasks(p, p);
+               }
+		/* Otherwise share the parent's per-task tag. */
+		return sched_core_share_tasks(p, current);
+	}
 #endif
 	return 0;
 }
@@ -9731,6 +9748,217 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 #endif /* CONFIG_RT_GROUP_SCHED */
 
 #ifdef CONFIG_SCHED_CORE
+/*
+ * A simple wrapper around refcount. An allocated sched_core_cookie's
+ * address is used to compute the cookie of the task.
+ */
+struct sched_core_cookie {
+	refcount_t refcnt;
+};
+
+/*
+ * sched_core_tag_requeue - Common helper for all interfaces to set a cookie.
+ * @p: The task to assign a cookie to.
+ * @cookie: The cookie to assign.
+ * @group: is it a group interface or a per-task interface.
+ *
+ * This function is typically called from a stop-machine handler.
+ */
+void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie, bool group)
+{
+	if (!p)
+		return;
+
+	if (group)
+		p->core_group_cookie = cookie;
+	else
+		p->core_task_cookie = cookie;
+
+	/* Use up half of the cookie's bits for task cookie and remaining for group cookie. */
+	p->core_cookie = (p->core_task_cookie <<
+				(sizeof(unsigned long) * 4)) + p->core_group_cookie;
+
+	if (sched_core_enqueued(p)) {
+		sched_core_dequeue(task_rq(p), p);
+		if (!p->core_task_cookie)
+			return;
+	}
+
+	if (sched_core_enabled(task_rq(p)) &&
+			p->core_cookie && task_on_rq_queued(p))
+		sched_core_enqueue(task_rq(p), p);
+}
+
+/* Per-task interface */
+static unsigned long sched_core_alloc_task_cookie(void)
+{
+	struct sched_core_cookie *ptr =
+		kmalloc(sizeof(struct sched_core_cookie), GFP_KERNEL);
+
+	if (!ptr)
+		return 0;
+	refcount_set(&ptr->refcnt, 1);
+
+	/*
+	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
+	 * is done after the stopper runs.
+	 */
+	sched_core_get();
+	return (unsigned long)ptr;
+}
+
+static bool sched_core_get_task_cookie(unsigned long cookie)
+{
+	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
+
+	/*
+	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
+	 * is done after the stopper runs.
+	 */
+	sched_core_get();
+	return refcount_inc_not_zero(&ptr->refcnt);
+}
+
+static void sched_core_put_task_cookie(unsigned long cookie)
+{
+	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
+
+	if (refcount_dec_and_test(&ptr->refcnt))
+		kfree(ptr);
+}
+
+struct sched_core_task_write_tag {
+	struct task_struct *tasks[2];
+	unsigned long cookies[2];
+};
+
+/*
+ * Ensure that the task has been requeued. The stopper ensures that the task cannot
+ * be migrated to a different CPU while its core scheduler queue state is being updated.
+ * It also makes sure to requeue a task if it was running actively on another CPU.
+ */
+static int sched_core_task_join_stopper(void *data)
+{
+	struct sched_core_task_write_tag *tag = (struct sched_core_task_write_tag *)data;
+	int i;
+
+	for (i = 0; i < 2; i++)
+		sched_core_tag_requeue(tag->tasks[i], tag->cookies[i], false /* !group */);
+
+	return 0;
+}
+
+static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
+{
+	struct sched_core_task_write_tag wr = {}; /* for stop machine. */
+	bool sched_core_put_after_stopper = false;
+	unsigned long cookie;
+	int ret = -ENOMEM;
+
+	mutex_lock(&sched_core_mutex);
+
+	/*
+	 * NOTE: sched_core_get() is done by sched_core_alloc_task_cookie() or
+	 *       sched_core_put_task_cookie(). However, sched_core_put() is done
+	 *       by this function *after* the stopper removes the tasks from the
+	 *       core queue, and not before. This is just to play it safe.
+	 */
+	if (t2 == NULL) {
+		if (t1->core_task_cookie) {
+			sched_core_put_task_cookie(t1->core_task_cookie);
+			sched_core_put_after_stopper = true;
+			wr.tasks[0] = t1; /* Keep wr.cookies[0] reset for t1. */
+		}
+	} else if (t1 == t2) {
+		/* Assign a unique per-task cookie solely for t1. */
+
+		cookie = sched_core_alloc_task_cookie();
+		if (!cookie)
+			goto out_unlock;
+
+		if (t1->core_task_cookie) {
+			sched_core_put_task_cookie(t1->core_task_cookie);
+			sched_core_put_after_stopper = true;
+		}
+		wr.tasks[0] = t1;
+		wr.cookies[0] = cookie;
+	} else
+	/*
+	 * 		t1		joining		t2
+	 * CASE 1:
+	 * before	0				0
+	 * after	new cookie			new cookie
+	 *
+	 * CASE 2:
+	 * before	X (non-zero)			0
+	 * after	0				0
+	 *
+	 * CASE 3:
+	 * before	0				X (non-zero)
+	 * after	X				X
+	 *
+	 * CASE 4:
+	 * before	Y (non-zero)			X (non-zero)
+	 * after	X				X
+	 */
+	if (!t1->core_task_cookie && !t2->core_task_cookie) {
+		/* CASE 1. */
+		cookie = sched_core_alloc_task_cookie();
+		if (!cookie)
+			goto out_unlock;
+
+		/* Add another reference for the other task. */
+		if (!sched_core_get_task_cookie(cookie)) {
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+
+		wr.tasks[0] = t1;
+		wr.tasks[1] = t2;
+		wr.cookies[0] = wr.cookies[1] = cookie;
+
+	} else if (t1->core_task_cookie && !t2->core_task_cookie) {
+		/* CASE 2. */
+		sched_core_put_task_cookie(t1->core_task_cookie);
+		sched_core_put_after_stopper = true;
+
+		wr.tasks[0] = t1; /* Reset cookie for t1. */
+
+	} else if (!t1->core_task_cookie && t2->core_task_cookie) {
+		/* CASE 3. */
+		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+
+		wr.tasks[0] = t1;
+		wr.cookies[0] = t2->core_task_cookie;
+
+	} else {
+		/* CASE 4. */
+		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+		sched_core_put_task_cookie(t1->core_task_cookie);
+		sched_core_put_after_stopper = true;
+
+		wr.tasks[0] = t1;
+		wr.cookies[0] = t2->core_task_cookie;
+	}
+
+	stop_machine(sched_core_task_join_stopper, (void *)&wr, NULL);
+
+	if (sched_core_put_after_stopper)
+		sched_core_put();
+
+	ret = 0;
+out_unlock:
+	mutex_unlock(&sched_core_mutex);
+	return ret;
+}
+
+/* CGroup interface */
 static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
 {
 	struct task_group *tg = css_tg(css);
@@ -9761,18 +9989,9 @@ static int __sched_write_tag(void *data)
 	 * when we set cgroup tag to 0 when the loop is done below.
 	 */
 	while ((p = css_task_iter_next(&it))) {
-		p->core_cookie = !!val ? (unsigned long)tg : 0UL;
-
-		if (sched_core_enqueued(p)) {
-			sched_core_dequeue(task_rq(p), p);
-			if (!p->core_cookie)
-				continue;
-		}
-
-		if (sched_core_enabled(task_rq(p)) &&
-		    p->core_cookie && task_on_rq_queued(p))
-			sched_core_enqueue(task_rq(p), p);
+		unsigned long cookie = !!val ? (unsigned long)tg : 0UL;
 
+		sched_core_tag_requeue(p, cookie, true /* group */);
 	}
 	css_task_iter_end(&it);
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 60a922d3f46f..8c452b8010ad 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1024,6 +1024,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		__PS("clock-delta", t1-t0);
 	}
 
+#ifdef CONFIG_SCHED_CORE
+	__PS("core_cookie", p->core_cookie);
+#endif
+
 	sched_show_numa(p, m);
 }
 
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (21 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-25 13:08   ` Peter Zijlstra
  2020-12-02 21:47   ` Chris Hyser
  2020-11-17 23:19 ` [PATCH -tip 24/32] sched: Release references to the per-task cookie on exit Joel Fernandes (Google)
                   ` (9 subsequent siblings)
  32 siblings, 2 replies; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

Add a per-thread core scheduling interface which allows a thread to share a
core with another thread, or have a core exclusively for itself.

ChromeOS uses core-scheduling to securely enable hyperthreading.  This cuts
down the keypress latency in Google docs from 150ms to 50ms while improving
the camera streaming frame rate by ~3%.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Reviewed-by: Aubrey Li <aubrey.intel@gmail.com>
Co-developed-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/sched.h            |  1 +
 include/uapi/linux/prctl.h       |  3 ++
 kernel/sched/core.c              | 51 +++++++++++++++++++++++++++++---
 kernel/sys.c                     |  3 ++
 tools/include/uapi/linux/prctl.h |  3 ++
 5 files changed, 57 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c6a3b0fa952b..79d76c78cc8e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2083,6 +2083,7 @@ void sched_core_unsafe_enter(void);
 void sched_core_unsafe_exit(void);
 bool sched_core_wait_till_safe(unsigned long ti_check);
 bool sched_core_kernel_protected(void);
+int sched_core_share_pid(pid_t pid);
 #else
 #define sched_core_unsafe_enter(ignore) do { } while (0)
 #define sched_core_unsafe_exit(ignore) do { } while (0)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c334e6a02e5f..217b0482aea1 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -248,4 +248,7 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER		57
 #define PR_GET_IO_FLUSHER		58
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE		59
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7ccca355623a..a95898c75bdf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -310,6 +310,7 @@ static int __sched_core_stopper(void *data)
 }
 
 static DEFINE_MUTEX(sched_core_mutex);
+static DEFINE_MUTEX(sched_core_tasks_mutex);
 static int sched_core_count;
 
 static void __sched_core_enable(void)
@@ -4037,8 +4038,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 	RB_CLEAR_NODE(&p->core_node);
 
 	/*
-	 * Tag child via per-task cookie only if parent is tagged via per-task
-	 * cookie. This is independent of, but can be additive to the CGroup tagging.
+	 * If parent is tagged via per-task cookie, tag the child (either with
+	 * the parent's cookie, or a new one). The final cookie is calculated
+	 * by concatenating the per-task cookie with that of the CGroup's.
 	 */
 	if (current->core_task_cookie) {
 
@@ -9855,7 +9857,7 @@ static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2
 	unsigned long cookie;
 	int ret = -ENOMEM;
 
-	mutex_lock(&sched_core_mutex);
+	mutex_lock(&sched_core_tasks_mutex);
 
 	/*
 	 * NOTE: sched_core_get() is done by sched_core_alloc_task_cookie() or
@@ -9954,10 +9956,51 @@ static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2
 
 	ret = 0;
 out_unlock:
-	mutex_unlock(&sched_core_mutex);
+	mutex_unlock(&sched_core_tasks_mutex);
 	return ret;
 }
 
+/* Called from prctl interface: PR_SCHED_CORE_SHARE */
+int sched_core_share_pid(pid_t pid)
+{
+	struct task_struct *task;
+	int err;
+
+	if (pid == 0) { /* Recent current task's cookie. */
+		/* Resetting a cookie requires privileges. */
+		if (current->core_task_cookie)
+			if (!capable(CAP_SYS_ADMIN))
+				return -EPERM;
+		task = NULL;
+	} else {
+		rcu_read_lock();
+		task = pid ? find_task_by_vpid(pid) : current;
+		if (!task) {
+			rcu_read_unlock();
+			return -ESRCH;
+		}
+
+		get_task_struct(task);
+
+		/*
+		 * Check if this process has the right to modify the specified
+		 * process. Use the regular "ptrace_may_access()" checks.
+		 */
+		if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
+			rcu_read_unlock();
+			err = -EPERM;
+			goto out_put;
+		}
+		rcu_read_unlock();
+	}
+
+	err = sched_core_share_tasks(current, task);
+out_put:
+	if (task)
+		put_task_struct(task);
+	return err;
+}
+
 /* CGroup interface */
 static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
 {
diff --git a/kernel/sys.c b/kernel/sys.c
index a730c03ee607..61a3c98e36de 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2530,6 +2530,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
 		break;
+	case PR_SCHED_CORE_SHARE:
+		error = sched_core_share_pid(arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index 7f0827705c9a..4c45b7dcd92d 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -247,4 +247,7 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER		57
 #define PR_GET_IO_FLUSHER		58
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE		59
+
 #endif /* _LINUX_PRCTL_H */
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 24/32] sched: Release references to the per-task cookie on exit
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (22 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-25 13:03   ` Peter Zijlstra
  2020-11-17 23:19 ` [PATCH -tip 25/32] sched: Refactor core cookie into struct Joel Fernandes (Google)
                   ` (8 subsequent siblings)
  32 siblings, 1 reply; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

During exit, we have to free the references to a cookie that might be shared by
many tasks. This commit therefore ensures when the task_struct is released, any
references to cookies that it holds are also released.

Reviewed-by: Chris Hyser <chris.hyser@oracle.com>
Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/sched.h | 3 +++
 kernel/fork.c         | 1 +
 kernel/sched/core.c   | 8 ++++++++
 3 files changed, 12 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 79d76c78cc8e..6fbdb1a204bf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2084,11 +2084,14 @@ void sched_core_unsafe_exit(void);
 bool sched_core_wait_till_safe(unsigned long ti_check);
 bool sched_core_kernel_protected(void);
 int sched_core_share_pid(pid_t pid);
+void sched_tsk_free(struct task_struct *tsk);
 #else
 #define sched_core_unsafe_enter(ignore) do { } while (0)
 #define sched_core_unsafe_exit(ignore) do { } while (0)
 #define sched_core_wait_till_safe(ignore) do { } while (0)
 #define sched_core_kernel_protected(ignore) do { } while (0)
+#define sched_core_share_pid(pid) do { } while (0)
+#define sched_tsk_free(tsk) do { } while (0)
 #endif
 
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 7199d359690c..5468c93829c5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -736,6 +736,7 @@ void __put_task_struct(struct task_struct *tsk)
 	exit_creds(tsk);
 	delayacct_tsk_free(tsk);
 	put_signal_struct(tsk->signal);
+	sched_tsk_free(tsk);
 
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a95898c75bdf..cc36c384364e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10066,6 +10066,14 @@ static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype
 
 	return 0;
 }
+
+void sched_tsk_free(struct task_struct *tsk)
+{
+	if (!tsk->core_task_cookie)
+		return;
+	sched_core_put_task_cookie(tsk->core_task_cookie);
+	sched_core_put();
+}
 #endif
 
 static struct cftype cpu_legacy_files[] = {
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 25/32] sched: Refactor core cookie into struct
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (23 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 24/32] sched: Release references to the per-task cookie on exit Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-17 23:19 ` [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase Joel Fernandes (Google)
                   ` (7 subsequent siblings)
  32 siblings, 0 replies; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

From: Josh Don <joshdon@google.com>

The overall core cookie is currently a single unsigned long value. This
poses issues as we seek to add additional sub-fields to the cookie. This
patch refactors the core_cookie to be a pointer to a struct containing
an arbitrary set of cookie fields.

We maintain a sorted RB tree of existing core cookies so that multiple
tasks may share the same core_cookie.

This will be especially useful in the next patch, where the concept of
cookie color is introduced.

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/core.c  | 481 +++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |  11 +-
 2 files changed, 429 insertions(+), 63 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cc36c384364e..bd75b3d62a97 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3958,6 +3958,7 @@ static inline void init_schedstats(void) {}
 int sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
 	unsigned long flags;
+	int __maybe_unused ret;
 
 	__sched_fork(clone_flags, p);
 	/*
@@ -4037,20 +4038,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 #ifdef CONFIG_SCHED_CORE
 	RB_CLEAR_NODE(&p->core_node);
 
-	/*
-	 * If parent is tagged via per-task cookie, tag the child (either with
-	 * the parent's cookie, or a new one). The final cookie is calculated
-	 * by concatenating the per-task cookie with that of the CGroup's.
-	 */
-	if (current->core_task_cookie) {
-
-		/* If it is not CLONE_THREAD fork, assign a unique per-task tag. */
-		if (!(clone_flags & CLONE_THREAD)) {
-			return sched_core_share_tasks(p, p);
-               }
-		/* Otherwise share the parent's per-task tag. */
-		return sched_core_share_tasks(p, current);
-	}
+	ret = sched_core_fork(p, clone_flags);
+	if (ret)
+		return ret;
 #endif
 	return 0;
 }
@@ -9059,6 +9049,9 @@ void sched_offline_group(struct task_group *tg)
 	spin_unlock_irqrestore(&task_group_lock, flags);
 }
 
+void cpu_core_get_group_cookie(struct task_group *tg,
+			       unsigned long *group_cookie_ptr);
+
 static void sched_change_group(struct task_struct *tsk, int type)
 {
 	struct task_group *tg;
@@ -9073,11 +9066,7 @@ static void sched_change_group(struct task_struct *tsk, int type)
 	tg = autogroup_task_group(tsk, tg);
 
 #ifdef CONFIG_SCHED_CORE
-	if ((unsigned long)tsk->sched_task_group == tsk->core_cookie)
-		tsk->core_cookie = 0UL;
-
-	if (tg->tagged /* && !tsk->core_cookie ? */)
-		tsk->core_cookie = (unsigned long)tg;
+	sched_core_change_group(tsk, tg);
 #endif
 
 	tsk->sched_task_group = tg;
@@ -9177,9 +9166,9 @@ static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
 #ifdef CONFIG_SCHED_CORE
 	struct task_group *tg = css_tg(css);
 
-	if (tg->tagged) {
+	if (tg->core_tagged) {
 		sched_core_put();
-		tg->tagged = 0;
+		tg->core_tagged = 0;
 	}
 #endif
 }
@@ -9751,38 +9740,225 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 
 #ifdef CONFIG_SCHED_CORE
 /*
- * A simple wrapper around refcount. An allocated sched_core_cookie's
- * address is used to compute the cookie of the task.
+ * Wrapper representing a complete cookie. The address of the cookie is used as
+ * a unique identifier. Each cookie has a unique permutation of the internal
+ * cookie fields.
  */
 struct sched_core_cookie {
+	unsigned long task_cookie;
+	unsigned long group_cookie;
+
+	struct rb_node node;
 	refcount_t refcnt;
 };
 
 /*
- * sched_core_tag_requeue - Common helper for all interfaces to set a cookie.
- * @p: The task to assign a cookie to.
- * @cookie: The cookie to assign.
- * @group: is it a group interface or a per-task interface.
+ * A simple wrapper around refcount. An allocated sched_core_task_cookie's
+ * address is used to compute the cookie of the task.
+ */
+struct sched_core_task_cookie {
+	refcount_t refcnt;
+};
+
+/* All active sched_core_cookies */
+static struct rb_root sched_core_cookies = RB_ROOT;
+static DEFINE_RAW_SPINLOCK(sched_core_cookies_lock);
+
+/*
+ * Returns the following:
+ * a < b  => -1
+ * a == b => 0
+ * a > b  => 1
+ */
+static int sched_core_cookie_cmp(const struct sched_core_cookie *a,
+				 const struct sched_core_cookie *b)
+{
+#define COOKIE_CMP_RETURN(field) do {		\
+	if (a->field < b->field)		\
+		return -1;			\
+	else if (a->field > b->field)		\
+		return 1;			\
+} while (0)					\
+
+	COOKIE_CMP_RETURN(task_cookie);
+	COOKIE_CMP_RETURN(group_cookie);
+
+	/* all cookie fields match */
+	return 0;
+
+#undef COOKIE_CMP_RETURN
+}
+
+static inline void __sched_core_erase_cookie(struct sched_core_cookie *cookie)
+{
+	lockdep_assert_held(&sched_core_cookies_lock);
+
+	/* Already removed */
+	if (RB_EMPTY_NODE(&cookie->node))
+		return;
+
+	rb_erase(&cookie->node, &sched_core_cookies);
+	RB_CLEAR_NODE(&cookie->node);
+}
+
+/* Called when a task no longer points to the cookie in question */
+static void sched_core_put_cookie(struct sched_core_cookie *cookie)
+{
+	unsigned long flags;
+
+	if (!cookie)
+		return;
+
+	if (refcount_dec_and_test(&cookie->refcnt)) {
+		raw_spin_lock_irqsave(&sched_core_cookies_lock, flags);
+		__sched_core_erase_cookie(cookie);
+		raw_spin_unlock_irqrestore(&sched_core_cookies_lock, flags);
+		kfree(cookie);
+	}
+}
+
+/*
+ * A task's core cookie is a compound structure composed of various cookie
+ * fields (task_cookie, group_cookie). The overall core_cookie is
+ * a pointer to a struct containing those values. This function either finds
+ * an existing core_cookie or creates a new one, and then updates the task's
+ * core_cookie to point to it. Additionally, it handles the necessary reference
+ * counting.
  *
- * This function is typically called from a stop-machine handler.
+ * REQUIRES: task_rq(p) lock or called from cpu_stopper.
+ * Doing so ensures that we do not cause races/corruption by modifying/reading
+ * task cookie fields.
  */
-void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie, bool group)
+static void __sched_core_update_cookie(struct task_struct *p)
+{
+	struct rb_node *parent, **node;
+	struct sched_core_cookie *node_core_cookie, *match;
+	static const struct sched_core_cookie zero_cookie;
+	struct sched_core_cookie temp = {
+		.task_cookie	= p->core_task_cookie,
+		.group_cookie	= p->core_group_cookie,
+	};
+	const bool is_zero_cookie =
+		(sched_core_cookie_cmp(&temp, &zero_cookie) == 0);
+	struct sched_core_cookie *const curr_cookie =
+		(struct sched_core_cookie *)p->core_cookie;
+	unsigned long flags;
+
+	/*
+	 * Already have a cookie matching the requested settings? Nothing to
+	 * do.
+	 */
+	if ((curr_cookie && sched_core_cookie_cmp(curr_cookie, &temp) == 0) ||
+	    (!curr_cookie && is_zero_cookie))
+		return;
+
+	raw_spin_lock_irqsave(&sched_core_cookies_lock, flags);
+
+	if (is_zero_cookie) {
+		match = NULL;
+		goto finish;
+	}
+
+retry:
+	match = NULL;
+
+	node = &sched_core_cookies.rb_node;
+	parent = *node;
+	while (*node) {
+		int cmp;
+
+		node_core_cookie =
+			container_of(*node, struct sched_core_cookie, node);
+		parent = *node;
+
+		cmp = sched_core_cookie_cmp(&temp, node_core_cookie);
+		if (cmp < 0) {
+			node = &parent->rb_left;
+		} else if (cmp > 0) {
+			node = &parent->rb_right;
+		} else {
+			match = node_core_cookie;
+			break;
+		}
+	}
+
+	if (!match) {
+		/* No existing cookie; create and insert one */
+		match = kmalloc(sizeof(struct sched_core_cookie), GFP_ATOMIC);
+
+		/* Fall back to zero cookie */
+		if (WARN_ON_ONCE(!match))
+			goto finish;
+
+		match->task_cookie = temp.task_cookie;
+		match->group_cookie = temp.group_cookie;
+		refcount_set(&match->refcnt, 1);
+
+		rb_link_node(&match->node, parent, node);
+		rb_insert_color(&match->node, &sched_core_cookies);
+	} else {
+		/*
+		 * Cookie exists, increment refcnt. If refcnt is currently 0,
+		 * we're racing with a put() (refcnt decremented but cookie not
+		 * yet removed from the tree). In this case, we can simply
+		 * perform the removal ourselves and retry.
+		 * sched_core_put_cookie() will still function correctly.
+		 */
+		if (unlikely(!refcount_inc_not_zero(&match->refcnt))) {
+			__sched_core_erase_cookie(match);
+			goto retry;
+		}
+	}
+
+finish:
+	/*
+	 * Set the core_cookie under the cookies lock. This guarantees that
+	 * p->core_cookie cannot be freed while the cookies lock is held in
+	 * sched_core_fork().
+	 */
+	p->core_cookie = (unsigned long)match;
+
+	raw_spin_unlock_irqrestore(&sched_core_cookies_lock, flags);
+
+	sched_core_put_cookie(curr_cookie);
+}
+
+/*
+ * sched_core_update_cookie - Common helper to update a task's core cookie. This
+ * updates the selected cookie field and then updates the overall cookie.
+ * @p: The task whose cookie should be updated.
+ * @cookie: The new cookie.
+ * @cookie_type: The cookie field to which the cookie corresponds.
+ *
+ * REQUIRES: either task_rq(p)->lock held or called from a stop-machine handler.
+ * Doing so ensures that we do not cause races/corruption by modifying/reading
+ * task cookie fields.
+ */
+static void sched_core_update_cookie(struct task_struct *p, unsigned long cookie,
+				     enum sched_core_cookie_type cookie_type)
 {
 	if (!p)
 		return;
 
-	if (group)
-		p->core_group_cookie = cookie;
-	else
+	switch (cookie_type) {
+	case sched_core_no_update:
+		break;
+	case sched_core_task_cookie_type:
 		p->core_task_cookie = cookie;
+		break;
+	case sched_core_group_cookie_type:
+		p->core_group_cookie = cookie;
+		break;
+	default:
+		WARN_ON_ONCE(1);
+	}
 
-	/* Use up half of the cookie's bits for task cookie and remaining for group cookie. */
-	p->core_cookie = (p->core_task_cookie <<
-				(sizeof(unsigned long) * 4)) + p->core_group_cookie;
+	/* Set p->core_cookie, which is the overall cookie */
+	__sched_core_update_cookie(p);
 
 	if (sched_core_enqueued(p)) {
 		sched_core_dequeue(task_rq(p), p);
-		if (!p->core_task_cookie)
+		if (!p->core_cookie)
 			return;
 	}
 
@@ -9791,11 +9967,28 @@ void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie, bool gr
 		sched_core_enqueue(task_rq(p), p);
 }
 
+void cpu_core_get_group_cookie(struct task_group *tg,
+			       unsigned long *group_cookie_ptr);
+
+void sched_core_change_group(struct task_struct *p, struct task_group *new_tg)
+{
+	unsigned long new_group_cookie;
+
+	cpu_core_get_group_cookie(new_tg, &new_group_cookie);
+
+	if (p->core_group_cookie == new_group_cookie)
+		return;
+
+	p->core_group_cookie = new_group_cookie;
+
+	__sched_core_update_cookie(p);
+}
+
 /* Per-task interface */
 static unsigned long sched_core_alloc_task_cookie(void)
 {
-	struct sched_core_cookie *ptr =
-		kmalloc(sizeof(struct sched_core_cookie), GFP_KERNEL);
+	struct sched_core_task_cookie *ptr =
+		kmalloc(sizeof(struct sched_core_task_cookie), GFP_KERNEL);
 
 	if (!ptr)
 		return 0;
@@ -9811,7 +10004,8 @@ static unsigned long sched_core_alloc_task_cookie(void)
 
 static bool sched_core_get_task_cookie(unsigned long cookie)
 {
-	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
+	struct sched_core_task_cookie *ptr =
+		(struct sched_core_task_cookie *)cookie;
 
 	/*
 	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
@@ -9823,7 +10017,8 @@ static bool sched_core_get_task_cookie(unsigned long cookie)
 
 static void sched_core_put_task_cookie(unsigned long cookie)
 {
-	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
+	struct sched_core_task_cookie *ptr =
+		(struct sched_core_task_cookie *)cookie;
 
 	if (refcount_dec_and_test(&ptr->refcnt))
 		kfree(ptr);
@@ -9845,7 +10040,8 @@ static int sched_core_task_join_stopper(void *data)
 	int i;
 
 	for (i = 0; i < 2; i++)
-		sched_core_tag_requeue(tag->tasks[i], tag->cookies[i], false /* !group */);
+		sched_core_update_cookie(tag->tasks[i], tag->cookies[i],
+					 sched_core_task_cookie_type);
 
 	return 0;
 }
@@ -10002,41 +10198,89 @@ int sched_core_share_pid(pid_t pid)
 }
 
 /* CGroup interface */
+
+/*
+ * Helper to get the group cookie and color in a hierarchy.
+ * Any ancestor can have a tag/color. Atmost one color and one
+ * tag is allowed.
+ * Sets *group_cookie_ptr to the hierarchical group cookie.
+ */
+void cpu_core_get_group_cookie(struct task_group *tg,
+			       unsigned long *group_cookie_ptr)
+{
+	unsigned long group_cookie = 0UL;
+
+	if (!tg)
+		goto out;
+
+	for (; tg; tg = tg->parent) {
+
+		if (tg->core_tagged) {
+			group_cookie = (unsigned long)tg;
+			break;
+		}
+	}
+
+out:
+	*group_cookie_ptr = group_cookie;
+}
+
+/* Determine if any group in @tg's children are tagged. */
+static bool cpu_core_check_descendants(struct task_group *tg, bool check_tag)
+{
+	struct task_group *child;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(child, &tg->children, siblings) {
+		if ((child->core_tagged && check_tag)) {
+			rcu_read_unlock();
+			return true;
+		}
+
+		rcu_read_unlock();
+		return cpu_core_check_descendants(child, check_tag);
+	}
+
+	rcu_read_unlock();
+	return false;
+}
+
 static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
 {
 	struct task_group *tg = css_tg(css);
 
-	return !!tg->tagged;
+	return !!tg->core_tagged;
 }
 
 struct write_core_tag {
 	struct cgroup_subsys_state *css;
-	int val;
+	unsigned long cookie;
+	enum sched_core_cookie_type cookie_type;
 };
 
 static int __sched_write_tag(void *data)
 {
 	struct write_core_tag *tag = (struct write_core_tag *) data;
-	struct cgroup_subsys_state *css = tag->css;
-	int val = tag->val;
-	struct task_group *tg = css_tg(tag->css);
-	struct css_task_iter it;
 	struct task_struct *p;
+	struct cgroup_subsys_state *css;
 
-	tg->tagged = !!val;
+	rcu_read_lock();
+	css_for_each_descendant_pre(css, tag->css) {
+		struct css_task_iter it;
 
-	css_task_iter_start(css, 0, &it);
-	/*
-	 * Note: css_task_iter_next will skip dying tasks.
-	 * There could still be dying tasks left in the core queue
-	 * when we set cgroup tag to 0 when the loop is done below.
-	 */
-	while ((p = css_task_iter_next(&it))) {
-		unsigned long cookie = !!val ? (unsigned long)tg : 0UL;
+		css_task_iter_start(css, 0, &it);
+		/*
+		 * Note: css_task_iter_next will skip dying tasks.
+		 * There could still be dying tasks left in the core queue
+		 * when we set cgroup tag to 0 when the loop is done below.
+		 */
+		while ((p = css_task_iter_next(&it)))
+			sched_core_update_cookie(p, tag->cookie,
+						 tag->cookie_type);
 
-		sched_core_tag_requeue(p, cookie, true /* group */);
+		css_task_iter_end(&it);
 	}
-	css_task_iter_end(&it);
+	rcu_read_unlock();
 
 	return 0;
 }
@@ -10045,6 +10289,7 @@ static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype
 {
 	struct task_group *tg = css_tg(css);
 	struct write_core_tag wtag;
+	unsigned long group_cookie;
 
 	if (val > 1)
 		return -ERANGE;
@@ -10052,14 +10297,29 @@ static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype
 	if (!static_branch_likely(&sched_smt_present))
 		return -EINVAL;
 
-	if (tg->tagged == !!val)
+	if (!tg->core_tagged && val) {
+		/* Tag is being set. Check ancestors and descendants. */
+		cpu_core_get_group_cookie(tg, &group_cookie);
+		if (group_cookie ||
+		    cpu_core_check_descendants(tg, true /* tag */))
+			return -EBUSY;
+	} else if (tg->core_tagged && !val) {
+		/* Tag is being reset. Check descendants. */
+		if (cpu_core_check_descendants(tg, true /* tag */))
+			return -EBUSY;
+	} else {
 		return 0;
+	}
 
 	if (!!val)
 		sched_core_get();
 
 	wtag.css = css;
-	wtag.val = val;
+	wtag.cookie = (unsigned long)tg;
+	wtag.cookie_type = sched_core_group_cookie_type;
+
+	tg->core_tagged = val;
+
 	stop_machine(__sched_write_tag, (void *) &wtag, NULL);
 	if (!val)
 		sched_core_put();
@@ -10067,8 +10327,105 @@ static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype
 	return 0;
 }
 
+static int sched_update_core_tag_stopper(void *data)
+{
+	struct task_struct *p = (struct task_struct *)data;
+
+	/* Recalculate core cookie */
+	sched_core_update_cookie(p, 0, sched_core_no_update);
+
+	return 0;
+}
+
+/* Called from sched_fork() */
+int sched_core_fork(struct task_struct *p, unsigned long clone_flags)
+{
+	struct sched_core_cookie *parent_cookie =
+		(struct sched_core_cookie *)current->core_cookie;
+
+	/*
+	 * core_cookie is ref counted; avoid an uncounted reference.
+	 * If p should have a cookie, it will be set below.
+	 */
+	p->core_cookie = 0UL;
+
+	/*
+	 * If parent is tagged via per-task cookie, tag the child (either with
+	 * the parent's cookie, or a new one).
+	 *
+	 * We can return directly in this case, because sched_core_share_tasks()
+	 * will set the core_cookie (so there is no need to try to inherit from
+	 * the parent). The cookie will have the proper sub-fields (ie. group
+	 * cookie, etc.), because these come from p's task_struct, which is
+	 * dup'd from the parent.
+	 */
+	if (current->core_task_cookie) {
+		int ret;
+
+		/* If it is not CLONE_THREAD fork, assign a unique per-task tag. */
+		if (!(clone_flags & CLONE_THREAD)) {
+			ret = sched_core_share_tasks(p, p);
+		} else {
+			/* Otherwise share the parent's per-task tag. */
+			ret = sched_core_share_tasks(p, current);
+		}
+
+		if (ret)
+			return ret;
+
+		/*
+		 * We expect sched_core_share_tasks() to always update p's
+		 * core_cookie.
+		 */
+		WARN_ON_ONCE(!p->core_cookie);
+
+		return 0;
+	}
+
+	/*
+	 * If parent is tagged, inherit the cookie and ensure that the reference
+	 * count is updated.
+	 *
+	 * Technically, we could instead zero-out the task's group_cookie and
+	 * allow sched_core_change_group() to handle this post-fork, but
+	 * inheriting here has a performance advantage, since we don't
+	 * need to traverse the core_cookies RB tree and can instead grab the
+	 * parent's cookie directly.
+	 */
+	if (parent_cookie) {
+		bool need_stopper = false;
+		unsigned long flags;
+
+		/*
+		 * cookies lock prevents task->core_cookie from changing or
+		 * being freed
+		 */
+		raw_spin_lock_irqsave(&sched_core_cookies_lock, flags);
+
+		if (likely(refcount_inc_not_zero(&parent_cookie->refcnt))) {
+			p->core_cookie = (unsigned long)parent_cookie;
+		} else {
+			/*
+			 * Raced with put(). We'll use stop_machine to get
+			 * a core_cookie
+			 */
+			need_stopper = true;
+		}
+
+		raw_spin_unlock_irqrestore(&sched_core_cookies_lock, flags);
+
+		if (need_stopper)
+			stop_machine(sched_update_core_tag_stopper,
+				     (void *)p, NULL);
+	}
+
+	return 0;
+}
+
 void sched_tsk_free(struct task_struct *tsk)
 {
+	sched_core_put_cookie((struct sched_core_cookie *)tsk->core_cookie);
+
 	if (!tsk->core_task_cookie)
 		return;
 	sched_core_put_task_cookie(tsk->core_task_cookie);
@@ -10114,7 +10471,7 @@ static struct cftype cpu_legacy_files[] = {
 #endif
 #ifdef CONFIG_SCHED_CORE
 	{
-		.name = "tag",
+		.name = "core_tag",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_u64 = cpu_core_tag_read_u64,
 		.write_u64 = cpu_core_tag_write_u64,
@@ -10295,7 +10652,7 @@ static struct cftype cpu_files[] = {
 #endif
 #ifdef CONFIG_SCHED_CORE
 	{
-		.name = "tag",
+		.name = "core_tag",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_u64 = cpu_core_tag_read_u64,
 		.write_u64 = cpu_core_tag_write_u64,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3ba08973ed58..042a9d6a3be9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -378,7 +378,7 @@ struct task_group {
 	struct cgroup_subsys_state css;
 
 #ifdef CONFIG_SCHED_CORE
-	int			tagged;
+	int			core_tagged;
 #endif
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -1130,6 +1130,12 @@ static inline bool is_migration_disabled(struct task_struct *p)
 #ifdef CONFIG_SCHED_CORE
 DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
 
+enum sched_core_cookie_type {
+	sched_core_no_update = 0,
+	sched_core_task_cookie_type,
+	sched_core_group_cookie_type,
+};
+
 static inline bool sched_core_enabled(struct rq *rq)
 {
 	return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
@@ -1174,6 +1180,9 @@ static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
 	return idle_core || rq->core->core_cookie == p->core_cookie;
 }
 
+void sched_core_change_group(struct task_struct *p, struct task_group *new_tg);
+int sched_core_fork(struct task_struct *p, unsigned long clone_flags);
+
 extern void queue_core_balance(struct rq *rq);
 
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (24 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 25/32] sched: Refactor core cookie into struct Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-25 13:42   ` Peter Zijlstra
  2020-11-17 23:19 ` [PATCH -tip 27/32] sched/debug: Add CGroup node for printing group cookie if SCHED_DEBUG Joel Fernandes (Google)
                   ` (6 subsequent siblings)
  32 siblings, 1 reply; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

From: Josh Don <joshdon@google.com>

Google has a usecase where the first level tag to tag a CGroup is not
sufficient. So, a patch is carried for years where a second tag is added which
is writeable by unprivileged users.

Google uses DAC controls to make the 'tag' possible to set only by root while
the second-level 'color' can be changed by anyone. The actual names that
Google uses is different, but the concept is the same.

The hierarchy looks like:

Root group
   / \
  A   B    (These are created by the root daemon - borglet).
 / \   \
C   D   E  (These are created by AppEngine within the container).

The reason why Google has two parts is that AppEngine wants to allow a subset of
subcgroups within a parent tagged cgroup sharing execution. Think of these
subcgroups belong to the same customer or project. Because these subcgroups are
created by AppEngine, they are not tracked by borglet (the root daemon),
therefore borglet won't have a chance to set a color for them. That's where
'color' file comes from. Color could be set by AppEngine, and once set, the
normal tasks within the subcgroup would not be able to overwrite it. This is
enforced by promoting the permission of the color file in cgroupfs.

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/sched.h |   1 +
 kernel/sched/core.c   | 120 +++++++++++++++++++++++++++++++++++-------
 kernel/sched/sched.h  |   2 +
 3 files changed, 103 insertions(+), 20 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6fbdb1a204bf..c9efdf8ccdf3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -690,6 +690,7 @@ struct task_struct {
 	unsigned long			core_cookie;
 	unsigned long			core_task_cookie;
 	unsigned long			core_group_cookie;
+	unsigned long			core_color;
 	unsigned int			core_occupation;
 #endif
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bd75b3d62a97..8f17ec8e993e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9049,9 +9049,6 @@ void sched_offline_group(struct task_group *tg)
 	spin_unlock_irqrestore(&task_group_lock, flags);
 }
 
-void cpu_core_get_group_cookie(struct task_group *tg,
-			       unsigned long *group_cookie_ptr);
-
 static void sched_change_group(struct task_struct *tsk, int type)
 {
 	struct task_group *tg;
@@ -9747,6 +9744,7 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 struct sched_core_cookie {
 	unsigned long task_cookie;
 	unsigned long group_cookie;
+	unsigned long color;
 
 	struct rb_node node;
 	refcount_t refcnt;
@@ -9782,6 +9780,7 @@ static int sched_core_cookie_cmp(const struct sched_core_cookie *a,
 
 	COOKIE_CMP_RETURN(task_cookie);
 	COOKIE_CMP_RETURN(group_cookie);
+	COOKIE_CMP_RETURN(color);
 
 	/* all cookie fields match */
 	return 0;
@@ -9819,7 +9818,7 @@ static void sched_core_put_cookie(struct sched_core_cookie *cookie)
 
 /*
  * A task's core cookie is a compound structure composed of various cookie
- * fields (task_cookie, group_cookie). The overall core_cookie is
+ * fields (task_cookie, group_cookie, color). The overall core_cookie is
  * a pointer to a struct containing those values. This function either finds
  * an existing core_cookie or creates a new one, and then updates the task's
  * core_cookie to point to it. Additionally, it handles the necessary reference
@@ -9837,6 +9836,7 @@ static void __sched_core_update_cookie(struct task_struct *p)
 	struct sched_core_cookie temp = {
 		.task_cookie	= p->core_task_cookie,
 		.group_cookie	= p->core_group_cookie,
+		.color		= p->core_color
 	};
 	const bool is_zero_cookie =
 		(sched_core_cookie_cmp(&temp, &zero_cookie) == 0);
@@ -9892,6 +9892,7 @@ static void __sched_core_update_cookie(struct task_struct *p)
 
 		match->task_cookie = temp.task_cookie;
 		match->group_cookie = temp.group_cookie;
+		match->color = temp.color;
 		refcount_set(&match->refcnt, 1);
 
 		rb_link_node(&match->node, parent, node);
@@ -9949,6 +9950,9 @@ static void sched_core_update_cookie(struct task_struct *p, unsigned long cookie
 	case sched_core_group_cookie_type:
 		p->core_group_cookie = cookie;
 		break;
+	case sched_core_color_type:
+		p->core_color = cookie;
+		break;
 	default:
 		WARN_ON_ONCE(1);
 	}
@@ -9967,19 +9971,23 @@ static void sched_core_update_cookie(struct task_struct *p, unsigned long cookie
 		sched_core_enqueue(task_rq(p), p);
 }
 
-void cpu_core_get_group_cookie(struct task_group *tg,
-			       unsigned long *group_cookie_ptr);
+void cpu_core_get_group_cookie_and_color(struct task_group *tg,
+					 unsigned long *group_cookie_ptr,
+					 unsigned long *color_ptr);
 
 void sched_core_change_group(struct task_struct *p, struct task_group *new_tg)
 {
-	unsigned long new_group_cookie;
+	unsigned long new_group_cookie, new_color;
 
-	cpu_core_get_group_cookie(new_tg, &new_group_cookie);
+	cpu_core_get_group_cookie_and_color(new_tg, &new_group_cookie,
+					    &new_color);
 
-	if (p->core_group_cookie == new_group_cookie)
+	if (p->core_group_cookie == new_group_cookie &&
+	    p->core_color == new_color)
 		return;
 
 	p->core_group_cookie = new_group_cookie;
+	p->core_color = new_color;
 
 	__sched_core_update_cookie(p);
 }
@@ -10203,17 +10211,24 @@ int sched_core_share_pid(pid_t pid)
  * Helper to get the group cookie and color in a hierarchy.
  * Any ancestor can have a tag/color. Atmost one color and one
  * tag is allowed.
- * Sets *group_cookie_ptr to the hierarchical group cookie.
+ * Sets *group_cookie_ptr and *color_ptr to the hierarchical group cookie
+ * and color.
  */
-void cpu_core_get_group_cookie(struct task_group *tg,
-			       unsigned long *group_cookie_ptr)
+void cpu_core_get_group_cookie_and_color(struct task_group *tg,
+					 unsigned long *group_cookie_ptr,
+					 unsigned long *color_ptr)
 {
 	unsigned long group_cookie = 0UL;
+	unsigned long color = 0UL;
 
 	if (!tg)
 		goto out;
 
 	for (; tg; tg = tg->parent) {
+		if (tg->core_tag_color) {
+			WARN_ON_ONCE(color);
+			color = tg->core_tag_color;
+		}
 
 		if (tg->core_tagged) {
 			group_cookie = (unsigned long)tg;
@@ -10223,22 +10238,25 @@ void cpu_core_get_group_cookie(struct task_group *tg,
 
 out:
 	*group_cookie_ptr = group_cookie;
+	*color_ptr = color;
 }
 
-/* Determine if any group in @tg's children are tagged. */
-static bool cpu_core_check_descendants(struct task_group *tg, bool check_tag)
+/* Determine if any group in @tg's children are tagged or colored. */
+static bool cpu_core_check_descendants(struct task_group *tg, bool check_tag,
+					bool check_color)
 {
 	struct task_group *child;
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(child, &tg->children, siblings) {
-		if ((child->core_tagged && check_tag)) {
+		if ((child->core_tagged && check_tag) ||
+		    (child->core_tag_color && check_color)) {
 			rcu_read_unlock();
 			return true;
 		}
 
 		rcu_read_unlock();
-		return cpu_core_check_descendants(child, check_tag);
+		return cpu_core_check_descendants(child, check_tag, check_color);
 	}
 
 	rcu_read_unlock();
@@ -10252,6 +10270,13 @@ static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype
 	return !!tg->core_tagged;
 }
 
+static u64 cpu_core_tag_color_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+
+	return tg->core_tag_color;
+}
+
 struct write_core_tag {
 	struct cgroup_subsys_state *css;
 	unsigned long cookie;
@@ -10289,7 +10314,7 @@ static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype
 {
 	struct task_group *tg = css_tg(css);
 	struct write_core_tag wtag;
-	unsigned long group_cookie;
+	unsigned long group_cookie, color;
 
 	if (val > 1)
 		return -ERANGE;
@@ -10299,13 +10324,13 @@ static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype
 
 	if (!tg->core_tagged && val) {
 		/* Tag is being set. Check ancestors and descendants. */
-		cpu_core_get_group_cookie(tg, &group_cookie);
+		cpu_core_get_group_cookie_and_color(tg, &group_cookie, &color);
 		if (group_cookie ||
-		    cpu_core_check_descendants(tg, true /* tag */))
+		    cpu_core_check_descendants(tg, true /* tag */, true /* color */))
 			return -EBUSY;
 	} else if (tg->core_tagged && !val) {
 		/* Tag is being reset. Check descendants. */
-		if (cpu_core_check_descendants(tg, true /* tag */))
+		if (cpu_core_check_descendants(tg, true /* tag */, true /* color */))
 			return -EBUSY;
 	} else {
 		return 0;
@@ -10327,6 +10352,49 @@ static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype
 	return 0;
 }
 
+static int cpu_core_tag_color_write_u64(struct cgroup_subsys_state *css,
+					struct cftype *cft, u64 val)
+{
+	struct task_group *tg = css_tg(css);
+	struct write_core_tag wtag;
+	unsigned long group_cookie, color;
+
+	if (val > ULONG_MAX)
+		return -ERANGE;
+
+	if (!static_branch_likely(&sched_smt_present))
+		return -EINVAL;
+
+	cpu_core_get_group_cookie_and_color(tg, &group_cookie, &color);
+	/* Can't set color if nothing in the ancestors were tagged. */
+	if (!group_cookie)
+		return -EINVAL;
+
+	/*
+	 * Something in the ancestors already colors us. Can't change the color
+	 * at this level.
+	 */
+	if (!tg->core_tag_color && color)
+		return -EINVAL;
+
+	/*
+	 * Check if any descendants are colored. If so, we can't recolor them.
+	 * Don't need to check if descendants are tagged, since we don't allow
+	 * tagging when already tagged.
+	 */
+	if (cpu_core_check_descendants(tg, false /* tag */, true /* color */))
+		return -EINVAL;
+
+	wtag.css = css;
+	wtag.cookie = val;
+	wtag.cookie_type = sched_core_color_type;
+	tg->core_tag_color = val;
+
+	stop_machine(__sched_write_tag, (void *) &wtag, NULL);
+
+	return 0;
+}
+
 static int sched_update_core_tag_stopper(void *data)
 {
 	struct task_struct *p = (struct task_struct *)data;
@@ -10476,6 +10544,12 @@ static struct cftype cpu_legacy_files[] = {
 		.read_u64 = cpu_core_tag_read_u64,
 		.write_u64 = cpu_core_tag_write_u64,
 	},
+	{
+		.name = "core_tag_color",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_tag_color_read_u64,
+		.write_u64 = cpu_core_tag_color_write_u64,
+	},
 #endif
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 	{
@@ -10657,6 +10731,12 @@ static struct cftype cpu_files[] = {
 		.read_u64 = cpu_core_tag_read_u64,
 		.write_u64 = cpu_core_tag_write_u64,
 	},
+	{
+		.name = "core_tag_color",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_tag_color_read_u64,
+		.write_u64 = cpu_core_tag_color_write_u64,
+	},
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 042a9d6a3be9..0ca22918b69a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -379,6 +379,7 @@ struct task_group {
 
 #ifdef CONFIG_SCHED_CORE
 	int			core_tagged;
+	unsigned long		core_tag_color;
 #endif
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -1134,6 +1135,7 @@ enum sched_core_cookie_type {
 	sched_core_no_update = 0,
 	sched_core_task_cookie_type,
 	sched_core_group_cookie_type,
+	sched_core_color_type
 };
 
 static inline bool sched_core_enabled(struct rq *rq)
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 27/32] sched/debug: Add CGroup node for printing group cookie if SCHED_DEBUG
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (25 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-17 23:19 ` [PATCH -tip 28/32] kselftest: Add tests for core-sched interface Joel Fernandes (Google)
                   ` (5 subsequent siblings)
  32 siblings, 0 replies; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

This will be used by kselftest to verify the CGroup cookie value that is
set by the CGroup interface.

Reviewed-by: Josh Don <joshdon@google.com>
Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/core.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8f17ec8e993e..f1d9762b571a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10277,6 +10277,21 @@ static u64 cpu_core_tag_color_read_u64(struct cgroup_subsys_state *css, struct c
 	return tg->core_tag_color;
 }
 
+#ifdef CONFIG_SCHED_DEBUG
+static u64 cpu_core_group_cookie_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	unsigned long group_cookie, color;
+
+	cpu_core_get_group_cookie_and_color(css_tg(css), &group_cookie, &color);
+
+	/*
+	 * Combine group_cookie and color into a single 64 bit value, for
+	 * display purposes only.
+	 */
+	return (group_cookie << 32) | (color & 0xffffffff);
+}
+#endif
+
 struct write_core_tag {
 	struct cgroup_subsys_state *css;
 	unsigned long cookie;
@@ -10550,6 +10565,14 @@ static struct cftype cpu_legacy_files[] = {
 		.read_u64 = cpu_core_tag_color_read_u64,
 		.write_u64 = cpu_core_tag_color_write_u64,
 	},
+#ifdef CONFIG_SCHED_DEBUG
+	/* Read the effective cookie (color+tag) of the group. */
+	{
+		.name = "core_group_cookie",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_group_cookie_read_u64,
+	},
+#endif
 #endif
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 	{
@@ -10737,6 +10760,14 @@ static struct cftype cpu_files[] = {
 		.read_u64 = cpu_core_tag_color_read_u64,
 		.write_u64 = cpu_core_tag_color_write_u64,
 	},
+#ifdef CONFIG_SCHED_DEBUG
+	/* Read the effective cookie (color+tag) of the group. */
+	{
+		.name = "core_group_cookie",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_group_cookie_read_u64,
+	},
+#endif
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 28/32] kselftest: Add tests for core-sched interface
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (26 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 27/32] sched/debug: Add CGroup node for printing group cookie if SCHED_DEBUG Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-17 23:19 ` [PATCH -tip 29/32] sched: Move core-scheduler interfacing code to a new file Joel Fernandes (Google)
                   ` (4 subsequent siblings)
  32 siblings, 0 replies; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

Add a kselftest test to ensure that the core-sched interface is working
correctly.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Reviewed-by: Josh Don <joshdon@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 tools/testing/selftests/sched/.gitignore      |   1 +
 tools/testing/selftests/sched/Makefile        |  14 +
 tools/testing/selftests/sched/config          |   1 +
 .../testing/selftests/sched/test_coresched.c  | 818 ++++++++++++++++++
 4 files changed, 834 insertions(+)
 create mode 100644 tools/testing/selftests/sched/.gitignore
 create mode 100644 tools/testing/selftests/sched/Makefile
 create mode 100644 tools/testing/selftests/sched/config
 create mode 100644 tools/testing/selftests/sched/test_coresched.c

diff --git a/tools/testing/selftests/sched/.gitignore b/tools/testing/selftests/sched/.gitignore
new file mode 100644
index 000000000000..4660929b0b9a
--- /dev/null
+++ b/tools/testing/selftests/sched/.gitignore
@@ -0,0 +1 @@
+test_coresched
diff --git a/tools/testing/selftests/sched/Makefile b/tools/testing/selftests/sched/Makefile
new file mode 100644
index 000000000000..e43b74fc5d7e
--- /dev/null
+++ b/tools/testing/selftests/sched/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0+
+
+ifneq ($(shell $(CC) --version 2>&1 | head -n 1 | grep clang),)
+CLANG_FLAGS += -no-integrated-as
+endif
+
+CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/  -Wl,-rpath=./ \
+	  $(CLANG_FLAGS)
+LDLIBS += -lpthread
+
+TEST_GEN_FILES := test_coresched
+TEST_PROGS := test_coresched
+
+include ../lib.mk
diff --git a/tools/testing/selftests/sched/config b/tools/testing/selftests/sched/config
new file mode 100644
index 000000000000..e8b09aa7c0c4
--- /dev/null
+++ b/tools/testing/selftests/sched/config
@@ -0,0 +1 @@
+CONFIG_SCHED_DEBUG=y
diff --git a/tools/testing/selftests/sched/test_coresched.c b/tools/testing/selftests/sched/test_coresched.c
new file mode 100644
index 000000000000..70ed2758fe23
--- /dev/null
+++ b/tools/testing/selftests/sched/test_coresched.c
@@ -0,0 +1,818 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Core-scheduling selftests.
+ *
+ * Copyright (C) 2020, Joel Fernandes.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/mount.h>
+#include <sys/prctl.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <unistd.h>
+
+#ifndef PR_SCHED_CORE_SHARE
+#define PR_SCHED_CORE_SHARE 59
+#endif
+
+#ifndef DEBUG_PRINT
+#define dprint(...)
+#else
+#define dprint(str, args...) printf("DEBUG: %s: " str "\n", __func__, ##args)
+#endif
+
+void print_banner(char *s)
+{
+    printf("coresched: %s:  ", s);
+}
+
+void print_pass(void)
+{
+    printf("PASS\n");
+}
+
+void assert_cond(int cond, char *str)
+{
+    if (!cond) {
+	printf("Error: %s\n", str);
+	abort();
+    }
+}
+
+char *make_group_root(void)
+{
+	char *mntpath, *mnt;
+	int ret;
+
+	mntpath = malloc(50);
+	if (!mntpath) {
+	    perror("Failed to allocate mntpath\n");
+	    abort();
+	}
+
+	sprintf(mntpath, "/tmp/coresched-test-XXXXXX");
+	mnt = mkdtemp(mntpath);
+	if (!mnt) {
+		perror("Failed to create mount: ");
+		exit(-1);
+	}
+
+	ret = mount("nodev", mnt, "cgroup", 0, "cpu");
+	if (ret == -1) {
+		perror("Failed to mount cgroup: ");
+		exit(-1);
+	}
+
+	return mnt;
+}
+
+char *read_group_cookie(char *cgroup_path)
+{
+    char path[50] = {}, *val;
+    int fd;
+
+    sprintf(path, "%s/cpu.core_group_cookie", cgroup_path);
+    fd = open(path, O_RDONLY, 0666);
+    if (fd == -1) {
+	perror("Open of cgroup tag path failed: ");
+	abort();
+    }
+
+    val = calloc(1, 50);
+    if (read(fd, val, 50) == -1) {
+	perror("Failed to read group cookie: ");
+	abort();
+    }
+
+    val[strcspn(val, "\r\n")] = 0;
+
+    close(fd);
+    return val;
+}
+
+void assert_group_tag(char *cgroup_path, char *tag)
+{
+    char tag_path[50] = {}, rdbuf[8] = {};
+    int tfd;
+
+    sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
+    tfd = open(tag_path, O_RDONLY, 0666);
+    if (tfd == -1) {
+	perror("Open of cgroup tag path failed: ");
+	abort();
+    }
+
+    if (read(tfd, rdbuf, 1) != 1) {
+	perror("Failed to enable coresched on cgroup: ");
+	abort();
+    }
+
+    if (strcmp(rdbuf, tag)) {
+	printf("Group tag does not match (exp: %s, act: %s)\n", tag, rdbuf);
+	abort();
+    }
+
+    if (close(tfd) == -1) {
+	perror("Failed to close tag fd: ");
+	abort();
+    }
+}
+
+void assert_group_color(char *cgroup_path, const char *color)
+{
+    char tag_path[50] = {}, rdbuf[8] = {};
+    int tfd;
+
+    sprintf(tag_path, "%s/cpu.core_tag_color", cgroup_path);
+    tfd = open(tag_path, O_RDONLY, 0666);
+    if (tfd == -1) {
+	perror("Open of cgroup tag path failed: ");
+	abort();
+    }
+
+    if (read(tfd, rdbuf, 8) == -1) {
+	perror("Failed to read group color\n");
+	abort();
+    }
+
+    if (strncmp(color, rdbuf, strlen(color))) {
+	printf("Group color does not match (exp: %s, act: %s)\n", color, rdbuf);
+	abort();
+    }
+
+    if (close(tfd) == -1) {
+	perror("Failed to close color fd: ");
+	abort();
+    }
+}
+
+void color_group(char *cgroup_path, const char *color_str)
+{
+	char tag_path[50];
+	int tfd, color, ret;
+
+	color = atoi(color_str);
+
+	sprintf(tag_path, "%s/cpu.core_tag_color", cgroup_path);
+	tfd = open(tag_path, O_WRONLY, 0666);
+	if (tfd == -1) {
+		perror("Open of cgroup tag path failed: ");
+		abort();
+	}
+
+	ret = write(tfd, color_str, strlen(color_str));
+	assert_cond(ret != -1,
+		    "Writing invalid range color should have failed!");
+
+	if (color < 1) {
+	    close(tfd);
+	    return;
+	}
+
+	if (ret == -1) {
+		perror("Failed to set color on cgroup: ");
+		abort();
+	}
+
+	if (close(tfd) == -1) {
+		perror("Failed to close tag fd: ");
+		abort();
+	}
+
+	assert_group_color(cgroup_path, color_str);
+}
+
+void tag_group(char *cgroup_path)
+{
+	char tag_path[50];
+	int tfd;
+
+	sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
+	tfd = open(tag_path, O_WRONLY, 0666);
+	if (tfd == -1) {
+		perror("Open of cgroup tag path failed: ");
+		abort();
+	}
+
+	if (write(tfd, "1", 1) != 1) {
+		perror("Failed to enable coresched on cgroup: ");
+		abort();
+	}
+
+	if (close(tfd) == -1) {
+		perror("Failed to close tag fd: ");
+		abort();
+	}
+
+	assert_group_tag(cgroup_path, "1");
+}
+
+void untag_group(char *cgroup_path)
+{
+	char tag_path[50];
+	int tfd;
+
+	sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
+	tfd = open(tag_path, O_WRONLY, 0666);
+	if (tfd == -1) {
+		perror("Open of cgroup tag path failed: ");
+		abort();
+	}
+
+	if (write(tfd, "0", 1) != 1) {
+		perror("Failed to disable coresched on cgroup: ");
+		abort();
+	}
+
+	if (close(tfd) == -1) {
+		perror("Failed to close tag fd: ");
+		abort();
+	}
+
+	assert_group_tag(cgroup_path, "0");
+}
+
+char *make_group(char *parent, char *name)
+{
+	char *cgroup_path;
+	int ret;
+
+	if (!parent && !name)
+	    return make_group_root();
+
+	cgroup_path = malloc(50);
+	if (!cgroup_path) {
+	    perror("Failed to allocate cgroup_path\n");
+	    abort();
+	}
+
+	/* Make the cgroup node for this group */
+	sprintf(cgroup_path, "%s/%s", parent, name);
+	ret = mkdir(cgroup_path, 0644);
+	if (ret == -1) {
+		perror("Failed to create group in cgroup: ");
+		abort();
+	}
+
+	return cgroup_path;
+}
+
+static void del_group(char *path)
+{
+    if (rmdir(path) != 0) {
+	printf("Removal of group failed\n");
+	abort();
+    }
+
+    free(path);
+}
+
+static void del_root_group(char *path)
+{
+    if (umount(path) != 0) {
+	perror("umount of cgroup failed\n");
+	abort();
+    }
+
+    if (rmdir(path) != 0) {
+	printf("Removal of group failed\n");
+	abort();
+    }
+
+    free(path);
+}
+
+void assert_group_cookie_equal(char *c1, char *c2)
+{
+    char *v1, *v2;
+
+    v1 = read_group_cookie(c1);
+    v2 = read_group_cookie(c2);
+    if (strcmp(v1, v2)) {
+	printf("Group cookies not equal\n");
+	abort();
+    }
+
+    free(v1);
+    free(v2);
+}
+
+void assert_group_cookie_not_equal(char *c1, char *c2)
+{
+    char *v1, *v2;
+
+    v1 = read_group_cookie(c1);
+    v2 = read_group_cookie(c2);
+    if (!strcmp(v1, v2)) {
+	printf("Group cookies equal\n");
+	abort();
+    }
+
+    free(v1);
+    free(v2);
+}
+
+void assert_group_cookie_not_zero(char *c1)
+{
+    char *v1 = read_group_cookie(c1);
+
+    v1[1] = 0;
+    if (!strcmp(v1, "0")) {
+	printf("Group cookie zero\n");
+	abort();
+    }
+    free(v1);
+}
+
+void assert_group_cookie_zero(char *c1)
+{
+    char *v1 = read_group_cookie(c1);
+
+    v1[1] = 0;
+    if (strcmp(v1, "0")) {
+	printf("Group cookie not zero");
+	abort();
+    }
+    free(v1);
+}
+
+struct task_state {
+    int pid_share;
+    char pid_str[50];
+    pthread_mutex_t m;
+    pthread_cond_t cond;
+    pthread_cond_t cond_par;
+};
+
+struct task_state *add_task(char *p)
+{
+    struct task_state *mem;
+    pthread_mutexattr_t am;
+    pthread_condattr_t a;
+    char tasks_path[50];
+    int tfd, pid, ret;
+
+    sprintf(tasks_path, "%s/tasks", p);
+    tfd = open(tasks_path, O_WRONLY, 0666);
+    if (tfd == -1) {
+	perror("Open of cgroup tasks path failed: ");
+	abort();
+    }
+
+    mem = mmap(NULL, sizeof *mem, PROT_READ | PROT_WRITE,
+	    MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+    memset(mem, 0, sizeof(*mem));
+
+    pthread_condattr_init(&a);
+    pthread_condattr_setpshared(&a, PTHREAD_PROCESS_SHARED);
+    pthread_mutexattr_init(&am);
+    pthread_mutexattr_setpshared(&am, PTHREAD_PROCESS_SHARED);
+
+    pthread_cond_init(&mem->cond, &a);
+    pthread_cond_init(&mem->cond_par, &a);
+    pthread_mutex_init(&mem->m, &am);
+
+    pid = fork();
+    if (pid == 0) {
+	while(1) {
+	    pthread_mutex_lock(&mem->m);
+	    while(!mem->pid_share)
+		pthread_cond_wait(&mem->cond, &mem->m);
+
+	    pid = mem->pid_share;
+	    mem->pid_share = 0;
+	    if (pid == -1)
+		pid = 0;
+	    prctl(PR_SCHED_CORE_SHARE, pid);
+	    pthread_mutex_unlock(&mem->m);
+	    pthread_cond_signal(&mem->cond_par);
+	}
+    }
+
+    sprintf(mem->pid_str, "%d", pid);
+    dprint("add task %d to group %s", pid, p);
+
+    ret = write(tfd, mem->pid_str, strlen(mem->pid_str));
+    assert_cond(ret != -1,
+	    "Failed to write pid into tasks");
+
+    close(tfd);
+    return mem;
+}
+
+/* Make t1 share with t2 */
+void make_tasks_share(struct task_state *t1, struct task_state *t2)
+{
+    int p2 = atoi(t2->pid_str);
+    dprint("task %s %s", t1->pid_str, t2->pid_str);
+
+    pthread_mutex_lock(&t1->m);
+    t1->pid_share = p2;
+    pthread_mutex_unlock(&t1->m);
+
+    pthread_cond_signal(&t1->cond);
+
+    pthread_mutex_lock(&t1->m);
+    while (t1->pid_share)
+	pthread_cond_wait(&t1->cond_par, &t1->m);
+    pthread_mutex_unlock(&t1->m);
+}
+
+/* Make t1 share with t2 */
+void reset_task_cookie(struct task_state *t1)
+{
+    dprint("task %s", t1->pid_str);
+
+    pthread_mutex_lock(&t1->m);
+    t1->pid_share = -1;
+    pthread_mutex_unlock(&t1->m);
+
+    pthread_cond_signal(&t1->cond);
+
+    pthread_mutex_lock(&t1->m);
+    while (t1->pid_share)
+	pthread_cond_wait(&t1->cond_par, &t1->m);
+    pthread_mutex_unlock(&t1->m);
+}
+
+char *get_task_core_cookie(char *pid)
+{
+    char proc_path[50];
+    int found = 0;
+    char *line;
+    int i, j;
+    FILE *fp;
+
+    line = malloc(1024);
+    assert_cond(!!line, "Failed to alloc memory");
+
+    sprintf(proc_path, "/proc/%s/sched", pid);
+
+    fp = fopen(proc_path, "r");
+    while ((fgets(line, 1024, fp)) != NULL)
+    {
+        if(!strstr(line, "core_cookie"))
+            continue;
+
+        for (j = 0, i = 0; i < 1024 && line[i] != '\0'; i++)
+            if (line[i] >= '0' && line[i] <= '9')
+                line[j++] = line[i];
+        line[j] = '\0';
+        found = 1;
+        break;
+    }
+
+    fclose(fp);
+
+    if (found) {
+        return line;
+    } else {
+        free(line);
+	printf("core_cookie not found. Enable SCHED_DEBUG?\n");
+	abort();
+        return NULL;
+    }
+}
+
+void assert_tasks_share(struct task_state *t1, struct task_state *t2)
+{
+    char *c1, *c2;
+
+    c1 = get_task_core_cookie(t1->pid_str);
+    c2 = get_task_core_cookie(t2->pid_str);
+    dprint("check task (%s) cookie (%s) == task (%s) cookie (%s)",
+	    t1->pid_str, c1, t2->pid_str, c2);
+    assert_cond(!strcmp(c1, c2), "Tasks don't share cookie");
+    free(c1); free(c2);
+}
+
+void assert_tasks_dont_share(struct task_state *t1,  struct task_state *t2)
+{
+    char *c1, *c2;
+    c1 = get_task_core_cookie(t1->pid_str);
+    c2 = get_task_core_cookie(t2->pid_str);
+    dprint("check task (%s) cookie (%s) != task (%s) cookie (%s)",
+	    t1->pid_str, c1, t2->pid_str, c2);
+    assert_cond(strcmp(c1, c2), "Tasks don't share cookie");
+    free(c1); free(c2);
+}
+
+void assert_task_has_cookie(char *pid)
+{
+    char *tk;
+
+    tk = get_task_core_cookie(pid);
+
+    assert_cond(strcmp(tk, "0"), "Task does not have cookie");
+
+    free(tk);
+}
+
+void kill_task(struct task_state *t)
+{
+    int pid = atoi(t->pid_str);
+
+    kill(pid, SIGKILL);
+    waitpid(pid, NULL, 0);
+}
+
+/*
+ * Test coloring. r1, y2, b3 and r4 are children of a tagged group y1. But r1,
+ * b3 and r4 are colored different from y1.  Only y2 (and thus y22) shares the
+ * same color as y1. Due to this only those have same cookie as y1. Further, r4
+ * and r1 have the same cookie as they are both colored the same.
+ *
+ *   y1 ----- r1 -- r11  (color, say red)
+ *   | \_y2 -- y22 (color, say yello (default))
+ *   \_b3 (color, say blue)
+ *   \_r4 (color, say red)
+ */
+static void test_cgroup_coloring(char *root)
+{
+    char *y1, *y2, *y22, *r1, *r11, *b3, *r4;
+
+    print_banner("TEST-CGROUP-COLORING");
+
+    y1 = make_group(root, "y1");
+    tag_group(y1);
+
+    y2 = make_group(y1, "y2");
+    y22 = make_group(y2, "y22");
+
+    r1 = make_group(y1, "y1");
+    r11 = make_group(r1, "r11");
+
+    color_group(r1, "10000");
+    color_group(r1, "0");   /* Wouldn't succeed. */
+    color_group(r1, "254");
+
+    b3 = make_group(y1, "b3");
+    color_group(b3, "8");
+
+    r4 = make_group(y1, "r4");
+    color_group(r4, "254");
+
+    /* Check that all yellows share the same cookie. */
+    assert_group_cookie_not_zero(y1);
+    assert_group_cookie_equal(y1, y2);
+    assert_group_cookie_equal(y1, y22);
+
+    /* Check that all reds share the same cookie. */
+    assert_group_cookie_not_zero(r1);
+    assert_group_cookie_equal(r1, r11);
+    assert_group_cookie_equal(r11, r4);
+
+    /* Check that blue, red and yellow are different cookie. */
+    assert_group_cookie_not_equal(r1, b3);
+    assert_group_cookie_not_equal(b3, y1);
+
+    del_group(r11);
+    del_group(r1);
+    del_group(y22);
+    del_group(y2);
+    del_group(b3);
+    del_group(r4);
+    del_group(y1);
+    print_pass();
+}
+
+/*
+ * Test that a group's children have a cookie inherrited
+ * from their parent group _after_ the parent was tagged.
+ *
+ *   p ----- c1 - c11
+ *     \ c2 - c22
+ */
+static void test_cgroup_parent_child_tag_inherit(char *root)
+{
+    char *p, *c1, *c11, *c2, *c22;
+
+    print_banner("TEST-CGROUP-PARENT-CHILD-TAG");
+
+    p = make_group(root, "p");
+    assert_group_cookie_zero(p);
+
+    c1 = make_group(p, "c1");
+    assert_group_tag(c1, "0"); /* Child tag is "0" but inherits cookie from parent. */
+    assert_group_cookie_zero(c1);
+    assert_group_cookie_equal(c1, p);
+
+    c11 = make_group(c1, "c11");
+    assert_group_tag(c11, "0");
+    assert_group_cookie_zero(c11);
+    assert_group_cookie_equal(c11, p);
+
+    c2 = make_group(p, "c2");
+    assert_group_tag(c2, "0");
+    assert_group_cookie_zero(c2);
+    assert_group_cookie_equal(c2, p);
+
+    tag_group(p);
+
+    /* Verify c1 got the cookie */
+    assert_group_tag(c1, "0");
+    assert_group_cookie_not_zero(c1);
+    assert_group_cookie_equal(c1, p);
+
+    /* Verify c2 got the cookie */
+    assert_group_tag(c2, "0");
+    assert_group_cookie_not_zero(c2);
+    assert_group_cookie_equal(c2, p);
+
+    /* Verify c11 got the cookie */
+    assert_group_tag(c11, "0");
+    assert_group_cookie_not_zero(c11);
+    assert_group_cookie_equal(c11, p);
+
+    /*
+     * Verify c22 which is a nested group created
+     * _after_ tagging got the cookie.
+     */
+    c22 = make_group(c2, "c22");
+
+    assert_group_tag(c22, "0");
+    assert_group_cookie_not_zero(c22);
+    assert_group_cookie_equal(c22, c1);
+    assert_group_cookie_equal(c22, c11);
+    assert_group_cookie_equal(c22, c2);
+    assert_group_cookie_equal(c22, p);
+
+    del_group(c22);
+    del_group(c11);
+    del_group(c1);
+    del_group(c2);
+    del_group(p);
+    print_pass();
+}
+
+/*
+ * Test that a tagged group's children have a cookie inherrited
+ * from their parent group.
+ */
+static void test_cgroup_parent_tag_child_inherit(char *root)
+{
+    char *p, *c1, *c2, *c3;
+
+    print_banner("TEST-CGROUP-PARENT-TAG-CHILD-INHERIT");
+
+    p = make_group(root, "p");
+    assert_group_cookie_zero(p);
+    tag_group(p);
+    assert_group_cookie_not_zero(p);
+
+    c1 = make_group(p, "c1");
+    assert_group_cookie_not_zero(c1);
+    /* Child tag is "0" but it inherits cookie from parent. */
+    assert_group_tag(c1, "0");
+    assert_group_cookie_equal(c1, p);
+
+    c2 = make_group(p, "c2");
+    assert_group_tag(c2, "0");
+    assert_group_cookie_equal(c2, p);
+    assert_group_cookie_equal(c1, c2);
+
+    c3 = make_group(c1, "c3");
+    assert_group_tag(c3, "0");
+    assert_group_cookie_equal(c3, p);
+    assert_group_cookie_equal(c1, c3);
+
+    del_group(c3);
+    del_group(c1);
+    del_group(c2);
+    del_group(p);
+    print_pass();
+}
+
+static void test_prctl_in_group(char *root)
+{
+    char *p;
+    struct task_state *tsk1, *tsk2, *tsk3;
+
+    print_banner("TEST-PRCTL-IN-GROUP");
+
+    p = make_group(root, "p");
+    assert_group_cookie_zero(p);
+    tag_group(p);
+    assert_group_cookie_not_zero(p);
+
+    tsk1 = add_task(p);
+    assert_task_has_cookie(tsk1->pid_str);
+
+    tsk2 = add_task(p);
+    assert_task_has_cookie(tsk2->pid_str);
+
+    tsk3 = add_task(p);
+    assert_task_has_cookie(tsk3->pid_str);
+
+    /* tsk2 share with tsk3 -- both get disconnected from CGroup. */
+    make_tasks_share(tsk2, tsk3);
+    assert_task_has_cookie(tsk2->pid_str);
+    assert_task_has_cookie(tsk3->pid_str);
+    assert_tasks_share(tsk2, tsk3);
+    assert_tasks_dont_share(tsk1, tsk2);
+    assert_tasks_dont_share(tsk1, tsk3);
+
+    /* now reset tsk3 -- get connected back to CGroup. */
+    reset_task_cookie(tsk3);
+    assert_task_has_cookie(tsk3->pid_str);
+    assert_tasks_dont_share(tsk2, tsk3);
+    assert_tasks_share(tsk1, tsk3);      // tsk3 is back.
+    assert_tasks_dont_share(tsk1, tsk2); // but tsk2 is still zombie
+
+    /* now reset tsk2 as well to get it connected back to CGroup. */
+    reset_task_cookie(tsk2);
+    assert_task_has_cookie(tsk2->pid_str);
+    assert_tasks_share(tsk2, tsk3);
+    assert_tasks_share(tsk1, tsk3);
+    assert_tasks_share(tsk1, tsk2);
+
+    /* Test the rest of the cases (2 to 4)
+     *
+     *		t1		joining		t2
+     * CASE 1:
+     * before	0				0
+     * after	new cookie			new cookie
+     *
+     * CASE 2:
+     * before	X (non-zero)			0
+     * after	0				0
+     *
+     * CASE 3:
+     * before	0				X (non-zero)
+     * after	X				X
+     *
+     * CASE 4:
+     * before	Y (non-zero)			X (non-zero)
+     * after	X				X
+     */
+
+    /* case 2: */
+    dprint("case 2");
+    make_tasks_share(tsk1, tsk1);
+    assert_tasks_dont_share(tsk1, tsk2);
+    assert_tasks_dont_share(tsk1, tsk3);
+    assert_task_has_cookie(tsk1->pid_str);
+    make_tasks_share(tsk1, tsk2); /* Will reset the task cookie. */
+    assert_task_has_cookie(tsk1->pid_str);
+    assert_task_has_cookie(tsk2->pid_str);
+
+    /* case 3: */
+    dprint("case 3");
+    make_tasks_share(tsk2, tsk2);
+    assert_tasks_dont_share(tsk2, tsk1);
+    assert_tasks_dont_share(tsk2, tsk3);
+    assert_task_has_cookie(tsk2->pid_str);
+    make_tasks_share(tsk1, tsk2);
+    assert_task_has_cookie(tsk1->pid_str);
+    assert_task_has_cookie(tsk2->pid_str);
+    assert_tasks_share(tsk1, tsk2);
+    assert_tasks_dont_share(tsk1, tsk3);
+    reset_task_cookie(tsk1);
+    reset_task_cookie(tsk2);
+
+    /* case 4: */
+    dprint("case 4");
+    assert_tasks_share(tsk1, tsk2);
+    assert_task_has_cookie(tsk1->pid_str);
+    assert_task_has_cookie(tsk2->pid_str);
+    make_tasks_share(tsk1, tsk1);
+    assert_task_has_cookie(tsk1->pid_str);
+    make_tasks_share(tsk2, tsk2);
+    assert_task_has_cookie(tsk2->pid_str);
+    assert_tasks_dont_share(tsk1, tsk2);
+    make_tasks_share(tsk1, tsk2);
+    assert_task_has_cookie(tsk1->pid_str);
+    assert_task_has_cookie(tsk2->pid_str);
+    assert_tasks_share(tsk1, tsk2);
+    assert_tasks_dont_share(tsk1, tsk3);
+    reset_task_cookie(tsk1);
+    reset_task_cookie(tsk2);
+
+    kill_task(tsk1);
+    kill_task(tsk2);
+    kill_task(tsk3);
+    del_group(p);
+    print_pass();
+}
+
+int main() {
+    char *root = make_group(NULL, NULL);
+
+    test_cgroup_parent_tag_child_inherit(root);
+    test_cgroup_parent_child_tag_inherit(root);
+    test_cgroup_coloring(root);
+    test_prctl_in_group(root);
+
+    del_root_group(root);
+    return 0;
+}
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 29/32] sched: Move core-scheduler interfacing code to a new file
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (27 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 28/32] kselftest: Add tests for core-sched interface Joel Fernandes (Google)
@ 2020-11-17 23:19 ` Joel Fernandes (Google)
  2020-11-17 23:20 ` [PATCH -tip 30/32] Documentation: Add core scheduling documentation Joel Fernandes (Google)
                   ` (3 subsequent siblings)
  32 siblings, 0 replies; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:19 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

core.c is already huge. The core-tagging interface code is largely
independent of it. Move it to its own file to make both files easier to
maintain.

Also make the following changes:
- Fix SWA bugs found by Chris Hyser.
- Fix refcount underrun caused by not zero'ing new task's cookie.

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Reviewed-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/Makefile  |   1 +
 kernel/sched/core.c    | 809 +---------------------------------------
 kernel/sched/coretag.c | 819 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h   |  51 ++-
 4 files changed, 872 insertions(+), 808 deletions(-)
 create mode 100644 kernel/sched/coretag.c

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 5fc9c9b70862..c526c20adf9d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
+obj-$(CONFIG_SCHED_CORE) += coretag.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f1d9762b571a..5ef04bdc849f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -162,11 +162,6 @@ static bool sched_core_empty(struct rq *rq)
 	return RB_EMPTY_ROOT(&rq->core_tree);
 }
 
-static bool sched_core_enqueued(struct task_struct *task)
-{
-	return !RB_EMPTY_NODE(&task->core_node);
-}
-
 static struct task_struct *sched_core_first(struct rq *rq)
 {
 	struct task_struct *task;
@@ -188,7 +183,7 @@ static void sched_core_flush(int cpu)
 	rq->core->core_task_seq++;
 }
 
-static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 {
 	struct rb_node *parent, **node;
 	struct task_struct *node_task;
@@ -215,7 +210,7 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 	rb_insert_color(&p->core_node, &rq->core_tree);
 }
 
-static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+void sched_core_dequeue(struct rq *rq, struct task_struct *p)
 {
 	rq->core->core_task_seq++;
 
@@ -310,7 +305,6 @@ static int __sched_core_stopper(void *data)
 }
 
 static DEFINE_MUTEX(sched_core_mutex);
-static DEFINE_MUTEX(sched_core_tasks_mutex);
 static int sched_core_count;
 
 static void __sched_core_enable(void)
@@ -346,16 +340,6 @@ void sched_core_put(void)
 		__sched_core_disable();
 	mutex_unlock(&sched_core_mutex);
 }
-
-static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2);
-
-#else /* !CONFIG_SCHED_CORE */
-
-static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
-static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
-static bool sched_core_enqueued(struct task_struct *task) { return false; }
-static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2) { }
-
 #endif /* CONFIG_SCHED_CORE */
 
 /*
@@ -3834,6 +3818,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->capture_control = NULL;
 #endif
 	init_numa_balancing(clone_flags, p);
+#ifdef CONFIG_SCHED_CORE
+	p->core_task_cookie = 0;
+#endif
 #ifdef CONFIG_SMP
 	p->wake_entry.u_flags = CSD_TYPE_TTWU;
 	p->migration_pending = NULL;
@@ -9118,11 +9105,6 @@ void sched_move_task(struct task_struct *tsk)
 	task_rq_unlock(rq, tsk, &rf);
 }
 
-static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
-{
-	return css ? container_of(css, struct task_group, css) : NULL;
-}
-
 static struct cgroup_subsys_state *
 cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 {
@@ -9735,787 +9717,6 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
-#ifdef CONFIG_SCHED_CORE
-/*
- * Wrapper representing a complete cookie. The address of the cookie is used as
- * a unique identifier. Each cookie has a unique permutation of the internal
- * cookie fields.
- */
-struct sched_core_cookie {
-	unsigned long task_cookie;
-	unsigned long group_cookie;
-	unsigned long color;
-
-	struct rb_node node;
-	refcount_t refcnt;
-};
-
-/*
- * A simple wrapper around refcount. An allocated sched_core_task_cookie's
- * address is used to compute the cookie of the task.
- */
-struct sched_core_task_cookie {
-	refcount_t refcnt;
-};
-
-/* All active sched_core_cookies */
-static struct rb_root sched_core_cookies = RB_ROOT;
-static DEFINE_RAW_SPINLOCK(sched_core_cookies_lock);
-
-/*
- * Returns the following:
- * a < b  => -1
- * a == b => 0
- * a > b  => 1
- */
-static int sched_core_cookie_cmp(const struct sched_core_cookie *a,
-				 const struct sched_core_cookie *b)
-{
-#define COOKIE_CMP_RETURN(field) do {		\
-	if (a->field < b->field)		\
-		return -1;			\
-	else if (a->field > b->field)		\
-		return 1;			\
-} while (0)					\
-
-	COOKIE_CMP_RETURN(task_cookie);
-	COOKIE_CMP_RETURN(group_cookie);
-	COOKIE_CMP_RETURN(color);
-
-	/* all cookie fields match */
-	return 0;
-
-#undef COOKIE_CMP_RETURN
-}
-
-static inline void __sched_core_erase_cookie(struct sched_core_cookie *cookie)
-{
-	lockdep_assert_held(&sched_core_cookies_lock);
-
-	/* Already removed */
-	if (RB_EMPTY_NODE(&cookie->node))
-		return;
-
-	rb_erase(&cookie->node, &sched_core_cookies);
-	RB_CLEAR_NODE(&cookie->node);
-}
-
-/* Called when a task no longer points to the cookie in question */
-static void sched_core_put_cookie(struct sched_core_cookie *cookie)
-{
-	unsigned long flags;
-
-	if (!cookie)
-		return;
-
-	if (refcount_dec_and_test(&cookie->refcnt)) {
-		raw_spin_lock_irqsave(&sched_core_cookies_lock, flags);
-		__sched_core_erase_cookie(cookie);
-		raw_spin_unlock_irqrestore(&sched_core_cookies_lock, flags);
-		kfree(cookie);
-	}
-}
-
-/*
- * A task's core cookie is a compound structure composed of various cookie
- * fields (task_cookie, group_cookie, color). The overall core_cookie is
- * a pointer to a struct containing those values. This function either finds
- * an existing core_cookie or creates a new one, and then updates the task's
- * core_cookie to point to it. Additionally, it handles the necessary reference
- * counting.
- *
- * REQUIRES: task_rq(p) lock or called from cpu_stopper.
- * Doing so ensures that we do not cause races/corruption by modifying/reading
- * task cookie fields.
- */
-static void __sched_core_update_cookie(struct task_struct *p)
-{
-	struct rb_node *parent, **node;
-	struct sched_core_cookie *node_core_cookie, *match;
-	static const struct sched_core_cookie zero_cookie;
-	struct sched_core_cookie temp = {
-		.task_cookie	= p->core_task_cookie,
-		.group_cookie	= p->core_group_cookie,
-		.color		= p->core_color
-	};
-	const bool is_zero_cookie =
-		(sched_core_cookie_cmp(&temp, &zero_cookie) == 0);
-	struct sched_core_cookie *const curr_cookie =
-		(struct sched_core_cookie *)p->core_cookie;
-	unsigned long flags;
-
-	/*
-	 * Already have a cookie matching the requested settings? Nothing to
-	 * do.
-	 */
-	if ((curr_cookie && sched_core_cookie_cmp(curr_cookie, &temp) == 0) ||
-	    (!curr_cookie && is_zero_cookie))
-		return;
-
-	raw_spin_lock_irqsave(&sched_core_cookies_lock, flags);
-
-	if (is_zero_cookie) {
-		match = NULL;
-		goto finish;
-	}
-
-retry:
-	match = NULL;
-
-	node = &sched_core_cookies.rb_node;
-	parent = *node;
-	while (*node) {
-		int cmp;
-
-		node_core_cookie =
-			container_of(*node, struct sched_core_cookie, node);
-		parent = *node;
-
-		cmp = sched_core_cookie_cmp(&temp, node_core_cookie);
-		if (cmp < 0) {
-			node = &parent->rb_left;
-		} else if (cmp > 0) {
-			node = &parent->rb_right;
-		} else {
-			match = node_core_cookie;
-			break;
-		}
-	}
-
-	if (!match) {
-		/* No existing cookie; create and insert one */
-		match = kmalloc(sizeof(struct sched_core_cookie), GFP_ATOMIC);
-
-		/* Fall back to zero cookie */
-		if (WARN_ON_ONCE(!match))
-			goto finish;
-
-		match->task_cookie = temp.task_cookie;
-		match->group_cookie = temp.group_cookie;
-		match->color = temp.color;
-		refcount_set(&match->refcnt, 1);
-
-		rb_link_node(&match->node, parent, node);
-		rb_insert_color(&match->node, &sched_core_cookies);
-	} else {
-		/*
-		 * Cookie exists, increment refcnt. If refcnt is currently 0,
-		 * we're racing with a put() (refcnt decremented but cookie not
-		 * yet removed from the tree). In this case, we can simply
-		 * perform the removal ourselves and retry.
-		 * sched_core_put_cookie() will still function correctly.
-		 */
-		if (unlikely(!refcount_inc_not_zero(&match->refcnt))) {
-			__sched_core_erase_cookie(match);
-			goto retry;
-		}
-	}
-
-finish:
-	/*
-	 * Set the core_cookie under the cookies lock. This guarantees that
-	 * p->core_cookie cannot be freed while the cookies lock is held in
-	 * sched_core_fork().
-	 */
-	p->core_cookie = (unsigned long)match;
-
-	raw_spin_unlock_irqrestore(&sched_core_cookies_lock, flags);
-
-	sched_core_put_cookie(curr_cookie);
-}
-
-/*
- * sched_core_update_cookie - Common helper to update a task's core cookie. This
- * updates the selected cookie field and then updates the overall cookie.
- * @p: The task whose cookie should be updated.
- * @cookie: The new cookie.
- * @cookie_type: The cookie field to which the cookie corresponds.
- *
- * REQUIRES: either task_rq(p)->lock held or called from a stop-machine handler.
- * Doing so ensures that we do not cause races/corruption by modifying/reading
- * task cookie fields.
- */
-static void sched_core_update_cookie(struct task_struct *p, unsigned long cookie,
-				     enum sched_core_cookie_type cookie_type)
-{
-	if (!p)
-		return;
-
-	switch (cookie_type) {
-	case sched_core_no_update:
-		break;
-	case sched_core_task_cookie_type:
-		p->core_task_cookie = cookie;
-		break;
-	case sched_core_group_cookie_type:
-		p->core_group_cookie = cookie;
-		break;
-	case sched_core_color_type:
-		p->core_color = cookie;
-		break;
-	default:
-		WARN_ON_ONCE(1);
-	}
-
-	/* Set p->core_cookie, which is the overall cookie */
-	__sched_core_update_cookie(p);
-
-	if (sched_core_enqueued(p)) {
-		sched_core_dequeue(task_rq(p), p);
-		if (!p->core_cookie)
-			return;
-	}
-
-	if (sched_core_enabled(task_rq(p)) &&
-			p->core_cookie && task_on_rq_queued(p))
-		sched_core_enqueue(task_rq(p), p);
-}
-
-void cpu_core_get_group_cookie_and_color(struct task_group *tg,
-					 unsigned long *group_cookie_ptr,
-					 unsigned long *color_ptr);
-
-void sched_core_change_group(struct task_struct *p, struct task_group *new_tg)
-{
-	unsigned long new_group_cookie, new_color;
-
-	cpu_core_get_group_cookie_and_color(new_tg, &new_group_cookie,
-					    &new_color);
-
-	if (p->core_group_cookie == new_group_cookie &&
-	    p->core_color == new_color)
-		return;
-
-	p->core_group_cookie = new_group_cookie;
-	p->core_color = new_color;
-
-	__sched_core_update_cookie(p);
-}
-
-/* Per-task interface */
-static unsigned long sched_core_alloc_task_cookie(void)
-{
-	struct sched_core_task_cookie *ptr =
-		kmalloc(sizeof(struct sched_core_task_cookie), GFP_KERNEL);
-
-	if (!ptr)
-		return 0;
-	refcount_set(&ptr->refcnt, 1);
-
-	/*
-	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
-	 * is done after the stopper runs.
-	 */
-	sched_core_get();
-	return (unsigned long)ptr;
-}
-
-static bool sched_core_get_task_cookie(unsigned long cookie)
-{
-	struct sched_core_task_cookie *ptr =
-		(struct sched_core_task_cookie *)cookie;
-
-	/*
-	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
-	 * is done after the stopper runs.
-	 */
-	sched_core_get();
-	return refcount_inc_not_zero(&ptr->refcnt);
-}
-
-static void sched_core_put_task_cookie(unsigned long cookie)
-{
-	struct sched_core_task_cookie *ptr =
-		(struct sched_core_task_cookie *)cookie;
-
-	if (refcount_dec_and_test(&ptr->refcnt))
-		kfree(ptr);
-}
-
-struct sched_core_task_write_tag {
-	struct task_struct *tasks[2];
-	unsigned long cookies[2];
-};
-
-/*
- * Ensure that the task has been requeued. The stopper ensures that the task cannot
- * be migrated to a different CPU while its core scheduler queue state is being updated.
- * It also makes sure to requeue a task if it was running actively on another CPU.
- */
-static int sched_core_task_join_stopper(void *data)
-{
-	struct sched_core_task_write_tag *tag = (struct sched_core_task_write_tag *)data;
-	int i;
-
-	for (i = 0; i < 2; i++)
-		sched_core_update_cookie(tag->tasks[i], tag->cookies[i],
-					 sched_core_task_cookie_type);
-
-	return 0;
-}
-
-static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
-{
-	struct sched_core_task_write_tag wr = {}; /* for stop machine. */
-	bool sched_core_put_after_stopper = false;
-	unsigned long cookie;
-	int ret = -ENOMEM;
-
-	mutex_lock(&sched_core_tasks_mutex);
-
-	/*
-	 * NOTE: sched_core_get() is done by sched_core_alloc_task_cookie() or
-	 *       sched_core_put_task_cookie(). However, sched_core_put() is done
-	 *       by this function *after* the stopper removes the tasks from the
-	 *       core queue, and not before. This is just to play it safe.
-	 */
-	if (t2 == NULL) {
-		if (t1->core_task_cookie) {
-			sched_core_put_task_cookie(t1->core_task_cookie);
-			sched_core_put_after_stopper = true;
-			wr.tasks[0] = t1; /* Keep wr.cookies[0] reset for t1. */
-		}
-	} else if (t1 == t2) {
-		/* Assign a unique per-task cookie solely for t1. */
-
-		cookie = sched_core_alloc_task_cookie();
-		if (!cookie)
-			goto out_unlock;
-
-		if (t1->core_task_cookie) {
-			sched_core_put_task_cookie(t1->core_task_cookie);
-			sched_core_put_after_stopper = true;
-		}
-		wr.tasks[0] = t1;
-		wr.cookies[0] = cookie;
-	} else
-	/*
-	 * 		t1		joining		t2
-	 * CASE 1:
-	 * before	0				0
-	 * after	new cookie			new cookie
-	 *
-	 * CASE 2:
-	 * before	X (non-zero)			0
-	 * after	0				0
-	 *
-	 * CASE 3:
-	 * before	0				X (non-zero)
-	 * after	X				X
-	 *
-	 * CASE 4:
-	 * before	Y (non-zero)			X (non-zero)
-	 * after	X				X
-	 */
-	if (!t1->core_task_cookie && !t2->core_task_cookie) {
-		/* CASE 1. */
-		cookie = sched_core_alloc_task_cookie();
-		if (!cookie)
-			goto out_unlock;
-
-		/* Add another reference for the other task. */
-		if (!sched_core_get_task_cookie(cookie)) {
-			ret = -EINVAL;
-			goto out_unlock;
-		}
-
-		wr.tasks[0] = t1;
-		wr.tasks[1] = t2;
-		wr.cookies[0] = wr.cookies[1] = cookie;
-
-	} else if (t1->core_task_cookie && !t2->core_task_cookie) {
-		/* CASE 2. */
-		sched_core_put_task_cookie(t1->core_task_cookie);
-		sched_core_put_after_stopper = true;
-
-		wr.tasks[0] = t1; /* Reset cookie for t1. */
-
-	} else if (!t1->core_task_cookie && t2->core_task_cookie) {
-		/* CASE 3. */
-		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
-			ret = -EINVAL;
-			goto out_unlock;
-		}
-
-		wr.tasks[0] = t1;
-		wr.cookies[0] = t2->core_task_cookie;
-
-	} else {
-		/* CASE 4. */
-		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
-			ret = -EINVAL;
-			goto out_unlock;
-		}
-		sched_core_put_task_cookie(t1->core_task_cookie);
-		sched_core_put_after_stopper = true;
-
-		wr.tasks[0] = t1;
-		wr.cookies[0] = t2->core_task_cookie;
-	}
-
-	stop_machine(sched_core_task_join_stopper, (void *)&wr, NULL);
-
-	if (sched_core_put_after_stopper)
-		sched_core_put();
-
-	ret = 0;
-out_unlock:
-	mutex_unlock(&sched_core_tasks_mutex);
-	return ret;
-}
-
-/* Called from prctl interface: PR_SCHED_CORE_SHARE */
-int sched_core_share_pid(pid_t pid)
-{
-	struct task_struct *task;
-	int err;
-
-	if (pid == 0) { /* Recent current task's cookie. */
-		/* Resetting a cookie requires privileges. */
-		if (current->core_task_cookie)
-			if (!capable(CAP_SYS_ADMIN))
-				return -EPERM;
-		task = NULL;
-	} else {
-		rcu_read_lock();
-		task = pid ? find_task_by_vpid(pid) : current;
-		if (!task) {
-			rcu_read_unlock();
-			return -ESRCH;
-		}
-
-		get_task_struct(task);
-
-		/*
-		 * Check if this process has the right to modify the specified
-		 * process. Use the regular "ptrace_may_access()" checks.
-		 */
-		if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
-			rcu_read_unlock();
-			err = -EPERM;
-			goto out_put;
-		}
-		rcu_read_unlock();
-	}
-
-	err = sched_core_share_tasks(current, task);
-out_put:
-	if (task)
-		put_task_struct(task);
-	return err;
-}
-
-/* CGroup interface */
-
-/*
- * Helper to get the group cookie and color in a hierarchy.
- * Any ancestor can have a tag/color. Atmost one color and one
- * tag is allowed.
- * Sets *group_cookie_ptr and *color_ptr to the hierarchical group cookie
- * and color.
- */
-void cpu_core_get_group_cookie_and_color(struct task_group *tg,
-					 unsigned long *group_cookie_ptr,
-					 unsigned long *color_ptr)
-{
-	unsigned long group_cookie = 0UL;
-	unsigned long color = 0UL;
-
-	if (!tg)
-		goto out;
-
-	for (; tg; tg = tg->parent) {
-		if (tg->core_tag_color) {
-			WARN_ON_ONCE(color);
-			color = tg->core_tag_color;
-		}
-
-		if (tg->core_tagged) {
-			group_cookie = (unsigned long)tg;
-			break;
-		}
-	}
-
-out:
-	*group_cookie_ptr = group_cookie;
-	*color_ptr = color;
-}
-
-/* Determine if any group in @tg's children are tagged or colored. */
-static bool cpu_core_check_descendants(struct task_group *tg, bool check_tag,
-					bool check_color)
-{
-	struct task_group *child;
-
-	rcu_read_lock();
-	list_for_each_entry_rcu(child, &tg->children, siblings) {
-		if ((child->core_tagged && check_tag) ||
-		    (child->core_tag_color && check_color)) {
-			rcu_read_unlock();
-			return true;
-		}
-
-		rcu_read_unlock();
-		return cpu_core_check_descendants(child, check_tag, check_color);
-	}
-
-	rcu_read_unlock();
-	return false;
-}
-
-static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
-{
-	struct task_group *tg = css_tg(css);
-
-	return !!tg->core_tagged;
-}
-
-static u64 cpu_core_tag_color_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
-{
-	struct task_group *tg = css_tg(css);
-
-	return tg->core_tag_color;
-}
-
-#ifdef CONFIG_SCHED_DEBUG
-static u64 cpu_core_group_cookie_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
-{
-	unsigned long group_cookie, color;
-
-	cpu_core_get_group_cookie_and_color(css_tg(css), &group_cookie, &color);
-
-	/*
-	 * Combine group_cookie and color into a single 64 bit value, for
-	 * display purposes only.
-	 */
-	return (group_cookie << 32) | (color & 0xffffffff);
-}
-#endif
-
-struct write_core_tag {
-	struct cgroup_subsys_state *css;
-	unsigned long cookie;
-	enum sched_core_cookie_type cookie_type;
-};
-
-static int __sched_write_tag(void *data)
-{
-	struct write_core_tag *tag = (struct write_core_tag *) data;
-	struct task_struct *p;
-	struct cgroup_subsys_state *css;
-
-	rcu_read_lock();
-	css_for_each_descendant_pre(css, tag->css) {
-		struct css_task_iter it;
-
-		css_task_iter_start(css, 0, &it);
-		/*
-		 * Note: css_task_iter_next will skip dying tasks.
-		 * There could still be dying tasks left in the core queue
-		 * when we set cgroup tag to 0 when the loop is done below.
-		 */
-		while ((p = css_task_iter_next(&it)))
-			sched_core_update_cookie(p, tag->cookie,
-						 tag->cookie_type);
-
-		css_task_iter_end(&it);
-	}
-	rcu_read_unlock();
-
-	return 0;
-}
-
-static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
-{
-	struct task_group *tg = css_tg(css);
-	struct write_core_tag wtag;
-	unsigned long group_cookie, color;
-
-	if (val > 1)
-		return -ERANGE;
-
-	if (!static_branch_likely(&sched_smt_present))
-		return -EINVAL;
-
-	if (!tg->core_tagged && val) {
-		/* Tag is being set. Check ancestors and descendants. */
-		cpu_core_get_group_cookie_and_color(tg, &group_cookie, &color);
-		if (group_cookie ||
-		    cpu_core_check_descendants(tg, true /* tag */, true /* color */))
-			return -EBUSY;
-	} else if (tg->core_tagged && !val) {
-		/* Tag is being reset. Check descendants. */
-		if (cpu_core_check_descendants(tg, true /* tag */, true /* color */))
-			return -EBUSY;
-	} else {
-		return 0;
-	}
-
-	if (!!val)
-		sched_core_get();
-
-	wtag.css = css;
-	wtag.cookie = (unsigned long)tg;
-	wtag.cookie_type = sched_core_group_cookie_type;
-
-	tg->core_tagged = val;
-
-	stop_machine(__sched_write_tag, (void *) &wtag, NULL);
-	if (!val)
-		sched_core_put();
-
-	return 0;
-}
-
-static int cpu_core_tag_color_write_u64(struct cgroup_subsys_state *css,
-					struct cftype *cft, u64 val)
-{
-	struct task_group *tg = css_tg(css);
-	struct write_core_tag wtag;
-	unsigned long group_cookie, color;
-
-	if (val > ULONG_MAX)
-		return -ERANGE;
-
-	if (!static_branch_likely(&sched_smt_present))
-		return -EINVAL;
-
-	cpu_core_get_group_cookie_and_color(tg, &group_cookie, &color);
-	/* Can't set color if nothing in the ancestors were tagged. */
-	if (!group_cookie)
-		return -EINVAL;
-
-	/*
-	 * Something in the ancestors already colors us. Can't change the color
-	 * at this level.
-	 */
-	if (!tg->core_tag_color && color)
-		return -EINVAL;
-
-	/*
-	 * Check if any descendants are colored. If so, we can't recolor them.
-	 * Don't need to check if descendants are tagged, since we don't allow
-	 * tagging when already tagged.
-	 */
-	if (cpu_core_check_descendants(tg, false /* tag */, true /* color */))
-		return -EINVAL;
-
-	wtag.css = css;
-	wtag.cookie = val;
-	wtag.cookie_type = sched_core_color_type;
-	tg->core_tag_color = val;
-
-	stop_machine(__sched_write_tag, (void *) &wtag, NULL);
-
-	return 0;
-}
-
-static int sched_update_core_tag_stopper(void *data)
-{
-	struct task_struct *p = (struct task_struct *)data;
-
-	/* Recalculate core cookie */
-	sched_core_update_cookie(p, 0, sched_core_no_update);
-
-	return 0;
-}
-
-/* Called from sched_fork() */
-int sched_core_fork(struct task_struct *p, unsigned long clone_flags)
-{
-	struct sched_core_cookie *parent_cookie =
-		(struct sched_core_cookie *)current->core_cookie;
-
-	/*
-	 * core_cookie is ref counted; avoid an uncounted reference.
-	 * If p should have a cookie, it will be set below.
-	 */
-	p->core_cookie = 0UL;
-
-	/*
-	 * If parent is tagged via per-task cookie, tag the child (either with
-	 * the parent's cookie, or a new one).
-	 *
-	 * We can return directly in this case, because sched_core_share_tasks()
-	 * will set the core_cookie (so there is no need to try to inherit from
-	 * the parent). The cookie will have the proper sub-fields (ie. group
-	 * cookie, etc.), because these come from p's task_struct, which is
-	 * dup'd from the parent.
-	 */
-	if (current->core_task_cookie) {
-		int ret;
-
-		/* If it is not CLONE_THREAD fork, assign a unique per-task tag. */
-		if (!(clone_flags & CLONE_THREAD)) {
-			ret = sched_core_share_tasks(p, p);
-		} else {
-			/* Otherwise share the parent's per-task tag. */
-			ret = sched_core_share_tasks(p, current);
-		}
-
-		if (ret)
-			return ret;
-
-		/*
-		 * We expect sched_core_share_tasks() to always update p's
-		 * core_cookie.
-		 */
-		WARN_ON_ONCE(!p->core_cookie);
-
-		return 0;
-	}
-
-	/*
-	 * If parent is tagged, inherit the cookie and ensure that the reference
-	 * count is updated.
-	 *
-	 * Technically, we could instead zero-out the task's group_cookie and
-	 * allow sched_core_change_group() to handle this post-fork, but
-	 * inheriting here has a performance advantage, since we don't
-	 * need to traverse the core_cookies RB tree and can instead grab the
-	 * parent's cookie directly.
-	 */
-	if (parent_cookie) {
-		bool need_stopper = false;
-		unsigned long flags;
-
-		/*
-		 * cookies lock prevents task->core_cookie from changing or
-		 * being freed
-		 */
-		raw_spin_lock_irqsave(&sched_core_cookies_lock, flags);
-
-		if (likely(refcount_inc_not_zero(&parent_cookie->refcnt))) {
-			p->core_cookie = (unsigned long)parent_cookie;
-		} else {
-			/*
-			 * Raced with put(). We'll use stop_machine to get
-			 * a core_cookie
-			 */
-			need_stopper = true;
-		}
-
-		raw_spin_unlock_irqrestore(&sched_core_cookies_lock, flags);
-
-		if (need_stopper)
-			stop_machine(sched_update_core_tag_stopper,
-				     (void *)p, NULL);
-	}
-
-	return 0;
-}
-
-void sched_tsk_free(struct task_struct *tsk)
-{
-	sched_core_put_cookie((struct sched_core_cookie *)tsk->core_cookie);
-
-	if (!tsk->core_task_cookie)
-		return;
-	sched_core_put_task_cookie(tsk->core_task_cookie);
-	sched_core_put();
-}
-#endif
-
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
new file mode 100644
index 000000000000..800c0f8bacfc
--- /dev/null
+++ b/kernel/sched/coretag.c
@@ -0,0 +1,819 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * kernel/sched/core-tag.c
+ *
+ * Core-scheduling tagging interface support.
+ *
+ * Copyright(C) 2020, Joel Fernandes.
+ * Initial interfacing code  by Peter Ziljstra.
+ */
+
+#include "sched.h"
+
+/*
+ * Wrapper representing a complete cookie. The address of the cookie is used as
+ * a unique identifier. Each cookie has a unique permutation of the internal
+ * cookie fields.
+ */
+struct sched_core_cookie {
+	unsigned long task_cookie;
+	unsigned long group_cookie;
+	unsigned long color;
+
+	struct rb_node node;
+	refcount_t refcnt;
+};
+
+/*
+ * A simple wrapper around refcount. An allocated sched_core_task_cookie's
+ * address is used to compute the cookie of the task.
+ */
+struct sched_core_task_cookie {
+	refcount_t refcnt;
+	struct work_struct work; /* to free in WQ context. */;
+};
+
+static DEFINE_MUTEX(sched_core_tasks_mutex);
+
+/* All active sched_core_cookies */
+static struct rb_root sched_core_cookies = RB_ROOT;
+static DEFINE_RAW_SPINLOCK(sched_core_cookies_lock);
+
+/*
+ * Returns the following:
+ * a < b  => -1
+ * a == b => 0
+ * a > b  => 1
+ */
+static int sched_core_cookie_cmp(const struct sched_core_cookie *a,
+				 const struct sched_core_cookie *b)
+{
+#define COOKIE_CMP_RETURN(field) do {		\
+	if (a->field < b->field)		\
+		return -1;			\
+	else if (a->field > b->field)		\
+		return 1;			\
+} while (0)					\
+
+	COOKIE_CMP_RETURN(task_cookie);
+	COOKIE_CMP_RETURN(group_cookie);
+	COOKIE_CMP_RETURN(color);
+
+	/* all cookie fields match */
+	return 0;
+
+#undef COOKIE_CMP_RETURN
+}
+
+static inline void __sched_core_erase_cookie(struct sched_core_cookie *cookie)
+{
+	lockdep_assert_held(&sched_core_cookies_lock);
+
+	/* Already removed */
+	if (RB_EMPTY_NODE(&cookie->node))
+		return;
+
+	rb_erase(&cookie->node, &sched_core_cookies);
+	RB_CLEAR_NODE(&cookie->node);
+}
+
+/* Called when a task no longer points to the cookie in question */
+static void sched_core_put_cookie(struct sched_core_cookie *cookie)
+{
+	unsigned long flags;
+
+	if (!cookie)
+		return;
+
+	if (refcount_dec_and_test(&cookie->refcnt)) {
+		raw_spin_lock_irqsave(&sched_core_cookies_lock, flags);
+		__sched_core_erase_cookie(cookie);
+		raw_spin_unlock_irqrestore(&sched_core_cookies_lock, flags);
+		kfree(cookie);
+	}
+}
+
+/*
+ * A task's core cookie is a compound structure composed of various cookie
+ * fields (task_cookie, group_cookie, color). The overall core_cookie is
+ * a pointer to a struct containing those values. This function either finds
+ * an existing core_cookie or creates a new one, and then updates the task's
+ * core_cookie to point to it. Additionally, it handles the necessary reference
+ * counting.
+ *
+ * REQUIRES: task_rq(p) lock or called from cpu_stopper.
+ * Doing so ensures that we do not cause races/corruption by modifying/reading
+ * task cookie fields.
+ */
+static void __sched_core_update_cookie(struct task_struct *p)
+{
+	struct rb_node *parent, **node;
+	struct sched_core_cookie *node_core_cookie, *match;
+	static const struct sched_core_cookie zero_cookie;
+	struct sched_core_cookie temp = {
+		.task_cookie	= p->core_task_cookie,
+		.group_cookie	= p->core_group_cookie,
+		.color		= p->core_color
+	};
+	const bool is_zero_cookie =
+		(sched_core_cookie_cmp(&temp, &zero_cookie) == 0);
+	struct sched_core_cookie *const curr_cookie =
+		(struct sched_core_cookie *)p->core_cookie;
+	unsigned long flags;
+
+	/*
+	 * Already have a cookie matching the requested settings? Nothing to
+	 * do.
+	 */
+	if ((curr_cookie && sched_core_cookie_cmp(curr_cookie, &temp) == 0) ||
+	    (!curr_cookie && is_zero_cookie))
+		return;
+
+	raw_spin_lock_irqsave(&sched_core_cookies_lock, flags);
+
+	if (is_zero_cookie) {
+		match = NULL;
+		goto finish;
+	}
+
+retry:
+	match = NULL;
+
+	node = &sched_core_cookies.rb_node;
+	parent = *node;
+	while (*node) {
+		int cmp;
+
+		node_core_cookie =
+			container_of(*node, struct sched_core_cookie, node);
+		parent = *node;
+
+		cmp = sched_core_cookie_cmp(&temp, node_core_cookie);
+		if (cmp < 0) {
+			node = &parent->rb_left;
+		} else if (cmp > 0) {
+			node = &parent->rb_right;
+		} else {
+			match = node_core_cookie;
+			break;
+		}
+	}
+
+	if (!match) {
+		/* No existing cookie; create and insert one */
+		match = kmalloc(sizeof(struct sched_core_cookie), GFP_ATOMIC);
+
+		/* Fall back to zero cookie */
+		if (WARN_ON_ONCE(!match))
+			goto finish;
+
+		match->task_cookie = temp.task_cookie;
+		match->group_cookie = temp.group_cookie;
+		match->color = temp.color;
+		refcount_set(&match->refcnt, 1);
+
+		rb_link_node(&match->node, parent, node);
+		rb_insert_color(&match->node, &sched_core_cookies);
+	} else {
+		/*
+		 * Cookie exists, increment refcnt. If refcnt is currently 0,
+		 * we're racing with a put() (refcnt decremented but cookie not
+		 * yet removed from the tree). In this case, we can simply
+		 * perform the removal ourselves and retry.
+		 * sched_core_put_cookie() will still function correctly.
+		 */
+		if (unlikely(!refcount_inc_not_zero(&match->refcnt))) {
+			__sched_core_erase_cookie(match);
+			goto retry;
+		}
+	}
+
+finish:
+	/*
+	 * Set the core_cookie under the cookies lock. This guarantees that
+	 * p->core_cookie cannot be freed while the cookies lock is held in
+	 * sched_core_fork().
+	 */
+	p->core_cookie = (unsigned long)match;
+
+	raw_spin_unlock_irqrestore(&sched_core_cookies_lock, flags);
+
+	sched_core_put_cookie(curr_cookie);
+}
+
+/*
+ * sched_core_update_cookie - Common helper to update a task's core cookie. This
+ * updates the selected cookie field and then updates the overall cookie.
+ * @p: The task whose cookie should be updated.
+ * @cookie: The new cookie.
+ * @cookie_type: The cookie field to which the cookie corresponds.
+ *
+ * REQUIRES: either task_rq(p)->lock held or called from a stop-machine handler.
+ * Doing so ensures that we do not cause races/corruption by modifying/reading
+ * task cookie fields.
+ */
+static void sched_core_update_cookie(struct task_struct *p, unsigned long cookie,
+				     enum sched_core_cookie_type cookie_type)
+{
+	if (!p)
+		return;
+
+	switch (cookie_type) {
+	case sched_core_no_update:
+		break;
+	case sched_core_task_cookie_type:
+		p->core_task_cookie = cookie;
+		break;
+	case sched_core_group_cookie_type:
+		p->core_group_cookie = cookie;
+		break;
+	case sched_core_color_type:
+		p->core_color = cookie;
+		break;
+	default:
+		WARN_ON_ONCE(1);
+	}
+
+	/* Set p->core_cookie, which is the overall cookie */
+	__sched_core_update_cookie(p);
+
+	if (sched_core_enqueued(p)) {
+		sched_core_dequeue(task_rq(p), p);
+		if (!p->core_cookie)
+			return;
+	}
+
+	if (sched_core_enabled(task_rq(p)) &&
+	    p->core_cookie && task_on_rq_queued(p))
+		sched_core_enqueue(task_rq(p), p);
+}
+
+#ifdef CONFIG_CGROUP_SCHED
+void cpu_core_get_group_cookie_and_color(struct task_group *tg,
+					 unsigned long *group_cookie_ptr,
+					 unsigned long *color_ptr);
+
+void sched_core_change_group(struct task_struct *p, struct task_group *new_tg)
+{
+	unsigned long new_group_cookie, new_color;
+
+	cpu_core_get_group_cookie_and_color(new_tg, &new_group_cookie,
+					    &new_color);
+
+	if (p->core_group_cookie == new_group_cookie &&
+	    p->core_color == new_color)
+		return;
+
+	p->core_group_cookie = new_group_cookie;
+	p->core_color = new_color;
+
+	__sched_core_update_cookie(p);
+}
+#endif
+
+/* Per-task interface: Used by fork(2) and prctl(2). */
+static void sched_core_put_cookie_work(struct work_struct *ws);
+
+static unsigned long sched_core_alloc_task_cookie(void)
+{
+	struct sched_core_task_cookie *ck =
+		kmalloc(sizeof(struct sched_core_task_cookie), GFP_KERNEL);
+
+	if (!ck)
+		return 0;
+	refcount_set(&ck->refcnt, 1);
+	INIT_WORK(&ck->work, sched_core_put_cookie_work);
+
+	/*
+	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
+	 * is done after the stopper runs.
+	 */
+	sched_core_get();
+	return (unsigned long)ck;
+}
+
+static bool sched_core_get_task_cookie(unsigned long cookie)
+{
+	struct sched_core_task_cookie *ptr =
+		(struct sched_core_task_cookie *)cookie;
+
+	/*
+	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
+	 * is done after the stopper runs.
+	 */
+	sched_core_get();
+	return refcount_inc_not_zero(&ptr->refcnt);
+}
+
+static void sched_core_put_task_cookie(unsigned long cookie)
+{
+	struct sched_core_task_cookie *ptr =
+		(struct sched_core_task_cookie *)cookie;
+
+	if (refcount_dec_and_test(&ptr->refcnt))
+		kfree(ptr);
+}
+
+static void sched_core_put_cookie_work(struct work_struct *ws)
+{
+	struct sched_core_task_cookie *ck =
+		container_of(ws, struct sched_core_task_cookie, work);
+
+	sched_core_put_task_cookie((unsigned long)ck);
+	sched_core_put();
+}
+
+struct sched_core_task_write_tag {
+	struct task_struct *tasks[2];
+	unsigned long cookies[2];
+};
+
+/*
+ * Ensure that the task has been requeued. The stopper ensures that the task cannot
+ * be migrated to a different CPU while its core scheduler queue state is being updated.
+ * It also makes sure to requeue a task if it was running actively on another CPU.
+ */
+static int sched_core_task_join_stopper(void *data)
+{
+	struct sched_core_task_write_tag *tag = (struct sched_core_task_write_tag *)data;
+	int i;
+
+	for (i = 0; i < 2; i++)
+		sched_core_update_cookie(tag->tasks[i], tag->cookies[i],
+					 sched_core_task_cookie_type);
+
+	return 0;
+}
+
+int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
+{
+	struct sched_core_task_write_tag wr = {}; /* for stop machine. */
+	bool sched_core_put_after_stopper = false;
+	unsigned long cookie;
+	int ret = -ENOMEM;
+
+	mutex_lock(&sched_core_tasks_mutex);
+
+	/*
+	 * NOTE: sched_core_get() is done by sched_core_alloc_task_cookie() or
+	 *       sched_core_put_task_cookie(). However, sched_core_put() is done
+	 *       by this function *after* the stopper removes the tasks from the
+	 *       core queue, and not before. This is just to play it safe.
+	 */
+	if (!t2) {
+		if (t1->core_task_cookie) {
+			sched_core_put_task_cookie(t1->core_task_cookie);
+			sched_core_put_after_stopper = true;
+			wr.tasks[0] = t1; /* Keep wr.cookies[0] reset for t1. */
+		}
+	} else if (t1 == t2) {
+		/* Assign a unique per-task cookie solely for t1. */
+
+		cookie = sched_core_alloc_task_cookie();
+		if (!cookie)
+			goto out_unlock;
+
+		if (t1->core_task_cookie) {
+			sched_core_put_task_cookie(t1->core_task_cookie);
+			sched_core_put_after_stopper = true;
+		}
+		wr.tasks[0] = t1;
+		wr.cookies[0] = cookie;
+	} else if (!t1->core_task_cookie && !t2->core_task_cookie) {
+		/*
+		 * 		t1		joining		t2
+		 * CASE 1:
+		 * before	0				0
+		 * after	new cookie			new cookie
+		 *
+		 * CASE 2:
+		 * before	X (non-zero)			0
+		 * after	0				0
+		 *
+		 * CASE 3:
+		 * before	0				X (non-zero)
+		 * after	X				X
+		 *
+		 * CASE 4:
+		 * before	Y (non-zero)			X (non-zero)
+		 * after	X				X
+		 */
+
+		/* CASE 1. */
+		cookie = sched_core_alloc_task_cookie();
+		if (!cookie)
+			goto out_unlock;
+
+		/* Add another reference for the other task. */
+		if (!sched_core_get_task_cookie(cookie)) {
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+
+		wr.tasks[0] = t1;
+		wr.tasks[1] = t2;
+		wr.cookies[0] = wr.cookies[1] = cookie;
+
+	} else if (t1->core_task_cookie && !t2->core_task_cookie) {
+		/* CASE 2. */
+		sched_core_put_task_cookie(t1->core_task_cookie);
+		sched_core_put_after_stopper = true;
+
+		wr.tasks[0] = t1; /* Reset cookie for t1. */
+
+	} else if (!t1->core_task_cookie && t2->core_task_cookie) {
+		/* CASE 3. */
+		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+
+		wr.tasks[0] = t1;
+		wr.cookies[0] = t2->core_task_cookie;
+
+	} else {
+		/* CASE 4. */
+		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+		sched_core_put_task_cookie(t1->core_task_cookie);
+		sched_core_put_after_stopper = true;
+
+		wr.tasks[0] = t1;
+		wr.cookies[0] = t2->core_task_cookie;
+	}
+
+	stop_machine(sched_core_task_join_stopper, (void *)&wr, NULL);
+
+	if (sched_core_put_after_stopper)
+		sched_core_put();
+
+	ret = 0;
+out_unlock:
+	mutex_unlock(&sched_core_tasks_mutex);
+	return ret;
+}
+
+/* Called from prctl interface: PR_SCHED_CORE_SHARE */
+int sched_core_share_pid(pid_t pid)
+{
+	struct task_struct *task;
+	int err;
+
+	if (pid == 0) { /* Recent current task's cookie. */
+		/* Resetting a cookie requires privileges. */
+		if (current->core_task_cookie)
+			if (!capable(CAP_SYS_ADMIN))
+				return -EPERM;
+		task = NULL;
+	} else {
+		rcu_read_lock();
+		task = pid ? find_task_by_vpid(pid) : current;
+		if (!task) {
+			rcu_read_unlock();
+			return -ESRCH;
+		}
+
+		get_task_struct(task);
+
+		/*
+		 * Check if this process has the right to modify the specified
+		 * process. Use the regular "ptrace_may_access()" checks.
+		 */
+		if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
+			rcu_read_unlock();
+			err = -EPERM;
+			goto out;
+		}
+		rcu_read_unlock();
+	}
+
+	err = sched_core_share_tasks(current, task);
+out:
+	if (task)
+		put_task_struct(task);
+	return err;
+}
+
+/* CGroup core-scheduling interface support. */
+#ifdef CONFIG_CGROUP_SCHED
+/*
+ * Helper to get the group cookie and color in a hierarchy.
+ * Any ancestor can have a tag/color. Atmost one color and one
+ * tag is allowed.
+ * Sets *group_cookie_ptr and *color_ptr to the hierarchical group cookie
+ * and color.
+ */
+void cpu_core_get_group_cookie_and_color(struct task_group *tg,
+					 unsigned long *group_cookie_ptr,
+					 unsigned long *color_ptr)
+{
+	unsigned long group_cookie = 0UL;
+	unsigned long color = 0UL;
+
+	if (!tg)
+		goto out;
+
+	for (; tg; tg = tg->parent) {
+		if (tg->core_tag_color) {
+			WARN_ON_ONCE(color);
+			color = tg->core_tag_color;
+		}
+
+		if (tg->core_tagged) {
+			group_cookie = (unsigned long)tg;
+			break;
+		}
+	}
+
+out:
+	*group_cookie_ptr = group_cookie;
+	*color_ptr = color;
+}
+
+/* Determine if any group in @tg's children are tagged or colored. */
+static bool cpu_core_check_descendants(struct task_group *tg, bool check_tag,
+				       bool check_color)
+{
+	struct task_group *child;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(child, &tg->children, siblings) {
+		if ((child->core_tagged && check_tag) ||
+		    (child->core_tag_color && check_color)) {
+			rcu_read_unlock();
+			return true;
+		}
+
+		rcu_read_unlock();
+		return cpu_core_check_descendants(child, check_tag, check_color);
+	}
+
+	rcu_read_unlock();
+	return false;
+}
+
+u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css,
+			  struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+
+	return !!tg->core_tagged;
+}
+
+u64 cpu_core_tag_color_read_u64(struct cgroup_subsys_state *css,
+				struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+
+	return tg->core_tag_color;
+}
+
+#ifdef CONFIG_SCHED_DEBUG
+u64 cpu_core_group_cookie_read_u64(struct cgroup_subsys_state *css,
+				   struct cftype *cft)
+{
+	unsigned long group_cookie, color;
+
+	cpu_core_get_group_cookie_and_color(css_tg(css), &group_cookie, &color);
+
+	/*
+	 * Combine group_cookie and color into a single 64 bit value, for
+	 * display purposes only.
+	 */
+	return (group_cookie << 32) | (color & 0xffffffff);
+}
+#endif
+
+struct write_core_tag {
+	struct cgroup_subsys_state *css;
+	unsigned long cookie;
+	enum sched_core_cookie_type cookie_type;
+};
+
+static int __sched_write_tag(void *data)
+{
+	struct write_core_tag *tag = (struct write_core_tag *)data;
+	struct task_struct *p;
+	struct cgroup_subsys_state *css;
+
+	rcu_read_lock();
+	css_for_each_descendant_pre(css, tag->css) {
+		struct css_task_iter it;
+
+		css_task_iter_start(css, 0, &it);
+		/*
+		 * Note: css_task_iter_next will skip dying tasks.
+		 * There could still be dying tasks left in the core queue
+		 * when we set cgroup tag to 0 when the loop is done below.
+		 */
+		while ((p = css_task_iter_next(&it)))
+			sched_core_update_cookie(p, tag->cookie,
+						 tag->cookie_type);
+
+		css_task_iter_end(&it);
+	}
+	rcu_read_unlock();
+
+	return 0;
+}
+
+int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
+			   u64 val)
+{
+	struct task_group *tg = css_tg(css);
+	struct write_core_tag wtag;
+	unsigned long group_cookie, color;
+
+	if (val > 1)
+		return -ERANGE;
+
+	if (!static_branch_likely(&sched_smt_present))
+		return -EINVAL;
+
+	if (!tg->core_tagged && val) {
+		/* Tag is being set. Check ancestors and descendants. */
+		cpu_core_get_group_cookie_and_color(tg, &group_cookie, &color);
+		if (group_cookie ||
+		    cpu_core_check_descendants(tg, true /* tag */, true /* color */))
+			return -EBUSY;
+	} else if (tg->core_tagged && !val) {
+		/* Tag is being reset. Check descendants. */
+		if (cpu_core_check_descendants(tg, true /* tag */, true /* color */))
+			return -EBUSY;
+	} else {
+		return 0;
+	}
+
+	if (!!val)
+		sched_core_get();
+
+	wtag.css = css;
+	wtag.cookie = (unsigned long)tg;
+	wtag.cookie_type = sched_core_group_cookie_type;
+
+	tg->core_tagged = val;
+
+	stop_machine(__sched_write_tag, (void *)&wtag, NULL);
+	if (!val)
+		sched_core_put();
+
+	return 0;
+}
+
+int cpu_core_tag_color_write_u64(struct cgroup_subsys_state *css,
+				 struct cftype *cft, u64 val)
+{
+	struct task_group *tg = css_tg(css);
+	struct write_core_tag wtag;
+	unsigned long group_cookie, color;
+
+	if (val > ULONG_MAX)
+		return -ERANGE;
+
+	if (!static_branch_likely(&sched_smt_present))
+		return -EINVAL;
+
+	cpu_core_get_group_cookie_and_color(tg, &group_cookie, &color);
+	/* Can't set color if nothing in the ancestors were tagged. */
+	if (!group_cookie)
+		return -EINVAL;
+
+	/*
+	 * Something in the ancestors already colors us. Can't change the color
+	 * at this level.
+	 */
+	if (!tg->core_tag_color && color)
+		return -EINVAL;
+
+	/*
+	 * Check if any descendants are colored. If so, we can't recolor them.
+	 * Don't need to check if descendants are tagged, since we don't allow
+	 * tagging when already tagged.
+	 */
+	if (cpu_core_check_descendants(tg, false /* tag */, true /* color */))
+		return -EINVAL;
+
+	wtag.css = css;
+	wtag.cookie = val;
+	wtag.cookie_type = sched_core_color_type;
+	tg->core_tag_color = val;
+
+	stop_machine(__sched_write_tag, (void *)&wtag, NULL);
+
+	return 0;
+}
+#endif
+
+/*
+ * Tagging support when fork(2) is called:
+ * If it is a CLONE_THREAD fork, share parent's tag. Otherwise assign a unique per-task tag.
+ */
+static int sched_update_core_tag_stopper(void *data)
+{
+	struct task_struct *p = (struct task_struct *)data;
+
+	/* Recalculate core cookie */
+	sched_core_update_cookie(p, 0, sched_core_no_update);
+
+	return 0;
+}
+
+/* Called from sched_fork() */
+int sched_core_fork(struct task_struct *p, unsigned long clone_flags)
+{
+	struct sched_core_cookie *parent_cookie =
+		(struct sched_core_cookie *)current->core_cookie;
+
+	/*
+	 * core_cookie is ref counted; avoid an uncounted reference.
+	 * If p should have a cookie, it will be set below.
+	 */
+	p->core_cookie = 0UL;
+
+	/*
+	 * If parent is tagged via per-task cookie, tag the child (either with
+	 * the parent's cookie, or a new one).
+	 *
+	 * We can return directly in this case, because sched_core_share_tasks()
+	 * will set the core_cookie (so there is no need to try to inherit from
+	 * the parent). The cookie will have the proper sub-fields (ie. group
+	 * cookie, etc.), because these come from p's task_struct, which is
+	 * dup'd from the parent.
+	 */
+	if (current->core_task_cookie) {
+		int ret;
+
+		/* If it is not CLONE_THREAD fork, assign a unique per-task tag. */
+		if (!(clone_flags & CLONE_THREAD)) {
+			ret = sched_core_share_tasks(p, p);
+		} else {
+			/* Otherwise share the parent's per-task tag. */
+			ret = sched_core_share_tasks(p, current);
+		}
+
+		if (ret)
+			return ret;
+
+		/*
+		 * We expect sched_core_share_tasks() to always update p's
+		 * core_cookie.
+		 */
+		WARN_ON_ONCE(!p->core_cookie);
+
+		return 0;
+	}
+
+	/*
+	 * If parent is tagged, inherit the cookie and ensure that the reference
+	 * count is updated.
+	 *
+	 * Technically, we could instead zero-out the task's group_cookie and
+	 * allow sched_core_change_group() to handle this post-fork, but
+	 * inheriting here has a performance advantage, since we don't
+	 * need to traverse the core_cookies RB tree and can instead grab the
+	 * parent's cookie directly.
+	 */
+	if (parent_cookie) {
+		bool need_stopper = false;
+		unsigned long flags;
+
+		/*
+		 * cookies lock prevents task->core_cookie from changing or
+		 * being freed
+		 */
+		raw_spin_lock_irqsave(&sched_core_cookies_lock, flags);
+
+		if (likely(refcount_inc_not_zero(&parent_cookie->refcnt))) {
+			p->core_cookie = (unsigned long)parent_cookie;
+		} else {
+			/*
+			 * Raced with put(). We'll use stop_machine to get
+			 * a core_cookie
+			 */
+			need_stopper = true;
+		}
+
+		raw_spin_unlock_irqrestore(&sched_core_cookies_lock, flags);
+
+		if (need_stopper)
+			stop_machine(sched_update_core_tag_stopper,
+				     (void *)p, NULL);
+	}
+
+	return 0;
+}
+
+void sched_tsk_free(struct task_struct *tsk)
+{
+	struct sched_core_task_cookie *ck;
+
+	sched_core_put_cookie((struct sched_core_cookie *)tsk->core_cookie);
+
+	if (!tsk->core_task_cookie)
+		return;
+
+	ck = (struct sched_core_task_cookie *)tsk->core_task_cookie;
+	queue_work(system_wq, &ck->work);
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0ca22918b69a..a99c24740e4b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -430,6 +430,11 @@ struct task_group {
 
 };
 
+static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct task_group, css) : NULL;
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 #define ROOT_TASK_GROUP_LOAD	NICE_0_LOAD
 
@@ -1185,12 +1190,53 @@ static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
 void sched_core_change_group(struct task_struct *p, struct task_group *new_tg);
 int sched_core_fork(struct task_struct *p, unsigned long clone_flags);
 
-extern void queue_core_balance(struct rq *rq);
+static inline bool sched_core_enqueued(struct task_struct *task)
+{
+	return !RB_EMPTY_NODE(&task->core_node);
+}
+
+void queue_core_balance(struct rq *rq);
+
+void sched_core_enqueue(struct rq *rq, struct task_struct *p);
+void sched_core_dequeue(struct rq *rq, struct task_struct *p);
+void sched_core_get(void);
+void sched_core_put(void);
+
+int sched_core_share_pid(pid_t pid);
+int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2);
+
+#ifdef CONFIG_CGROUP_SCHED
+u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css,
+			  struct cftype *cft);
+
+u64 cpu_core_tag_color_read_u64(struct cgroup_subsys_state *css,
+				struct cftype *cft);
+
+#ifdef CONFIG_SCHED_DEBUG
+u64 cpu_core_group_cookie_read_u64(struct cgroup_subsys_state *css,
+				   struct cftype *cft);
+#endif
+
+int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
+			   u64 val);
+
+int cpu_core_tag_color_write_u64(struct cgroup_subsys_state *css,
+				 struct cftype *cft, u64 val);
+#endif
+
+#ifndef TIF_UNSAFE_RET
+#define TIF_UNSAFE_RET (0)
+#endif
 
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);
 
 #else /* !CONFIG_SCHED_CORE */
 
+static inline bool sched_core_enqueued(struct task_struct *task) { return false; }
+static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
+static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+static inline int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2) { }
+
 static inline bool sched_core_enabled(struct rq *rq)
 {
 	return false;
@@ -2871,7 +2917,4 @@ void swake_up_all_locked(struct swait_queue_head *q);
 void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
 
 #ifdef CONFIG_SCHED_CORE
-#ifndef TIF_UNSAFE_RET
-#define TIF_UNSAFE_RET (0)
-#endif
 #endif
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 30/32] Documentation: Add core scheduling documentation
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (28 preceding siblings ...)
  2020-11-17 23:19 ` [PATCH -tip 29/32] sched: Move core-scheduler interfacing code to a new file Joel Fernandes (Google)
@ 2020-11-17 23:20 ` Joel Fernandes (Google)
  2020-11-17 23:20 ` [PATCH -tip 31/32] sched: Add a coresched command line option Joel Fernandes (Google)
                   ` (2 subsequent siblings)
  32 siblings, 0 replies; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:20 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Randy Dunlap,
	Aubrey Li, Paul E. McKenney, Tim Chen

Document the usecases, design and interfaces for core scheduling.

Co-developed-by: Vineeth Pillai <viremana@linux.microsoft.com>
Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 .../admin-guide/hw-vuln/core-scheduling.rst   | 330 ++++++++++++++++++
 Documentation/admin-guide/hw-vuln/index.rst   |   1 +
 2 files changed, 331 insertions(+)
 create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
new file mode 100644
index 000000000000..01be28d0687a
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -0,0 +1,330 @@
+Core Scheduling
+***************
+Core scheduling support allows userspace to define groups of tasks that can
+share a core. These groups can be specified either for security usecases (one
+group of tasks don't trust another), or for performance usecases (some
+workloads may benefit from running on the same core as they don't need the same
+hardware resources of the shared core).
+
+Security usecase
+----------------
+A cross-HT attack involves the attacker and victim running on different
+Hyper Threads of the same core. MDS and L1TF are examples of such attacks.
+Without core scheduling, the only full mitigation of cross-HT attacks is to
+disable Hyper Threading (HT). Core scheduling allows HT to be turned on safely
+by ensuring that trusted tasks can share a core. This increase in core sharing
+can improvement performance, however it is not guaranteed that performance will
+always improve, though that is seen to be the case with a number of real world
+workloads. In theory, core scheduling aims to perform at least as good as when
+Hyper Threading is disabled. In practice, this is mostly the case though not
+always: as synchronizing scheduling decisions across 2 or more CPUs in a core
+involves additional overhead - especially when the system is lightly loaded
+(``total_threads <= N/2``, where N is the total number of CPUs).
+
+Usage
+-----
+Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
+Using this feature, userspace defines groups of tasks that trust each other.
+The core scheduler uses this information to make sure that tasks that do not
+trust each other will never run simultaneously on a core, while doing its best
+to satisfy the system's scheduling requirements.
+
+There are 2 ways to use core-scheduling:
+
+CGroup
+######
+Core scheduling adds additional files to the CPU controller CGroup:
+
+* ``cpu.core_tag``
+Writing ``1`` into this file results in all tasks in the group getting tagged.
+This results in all the CGroup's tasks allowed to run concurrently on a core's
+hyperthreads (also called siblings).
+
+The file being a value of ``0`` means the tag state of the CGroup is inherited
+from its parent hierarchy. If any ancestor of the CGroup is tagged, then the
+group is tagged.
+
+.. note:: Once a CGroup is tagged via cpu.core_tag, it is not possible to set this
+          for any descendant of the tagged group. For finer grained control, the
+          ``cpu.core_tag_color`` file described next may be used.
+
+.. note:: When a CGroup is not tagged, all the tasks within the group can share
+          a core with kernel threads and untagged system threads. For this reason,
+          if a group has ``cpu.core_tag`` of 0, it is considered to be trusted.
+
+* ``cpu.core_tag_color``
+For finer grained control over core sharing, a color can also be set in
+addition to the tag. This allows to further control core sharing between child
+CGroups within an already tagged CGroup. The color and the tag are both used to
+generate a `cookie` which is used by the scheduler to identify the group.
+
+Up to 256 different colors can be set (0-255) by writing into this file.
+
+A sample real-world usage of this file follows:
+
+Google uses DAC controls to make ``cpu.core_tag`` writable only by root and the
+``cpu.core_tag_color`` can be changed by anyone.
+
+The hierarchy looks like this:
+::
+  Root group
+     / \
+    A   B    (These are created by the root daemon - borglet).
+   / \   \
+  C   D   E  (These are created by AppEngine within the container).
+
+A and B are containers for 2 different jobs or apps that are created by a root
+daemon called borglet. borglet then tags each of these group with the ``cpu.core_tag``
+file. The job itself can create additional child CGroups which are colored by
+the container's AppEngine with the ``cpu.core_tag_color`` file.
+
+The reason why Google uses this 2-level tagging system is that AppEngine wants to
+allow a subset of child CGroups within a tagged parent CGroup to be co-scheduled on a
+core while not being co-scheduled with other child CGroups. Think of these
+child CGroups as belonging to the same customer or project.  Because these
+child CGroups are created by AppEngine, they are not tracked by borglet (the
+root daemon), therefore borglet won't have a chance to set a color for them.
+That's where cpu.core_tag_color file comes in. A color could be set by AppEngine,
+and once set, the normal tasks within the subcgroup would not be able to
+overwrite it. This is enforced by promoting the permission of the
+``cpu.core_tag_color`` file in cgroupfs.
+
+The color is an 8-bit value allowing for up to 256 unique colors.
+
+.. note:: Once a CGroup is colored, none of its descendants can be re-colored. Also
+          coloring of a CGroup is possible only if either the group or one of its
+          ancestors was tagged via the ``cpu.core_tag`` file.
+
+prctl interface
+###############
+A ``prtcl(2)`` command ``PR_SCHED_CORE_SHARE`` is available to a process to request
+sharing a core with another process.  For example, consider 2 processes ``P1``
+and ``P2`` with PIDs 100 and 200. If process ``P1`` calls
+``prctl(PR_SCHED_CORE_SHARE, 200)``, the kernel makes ``P1`` share a core with ``P2``.
+The kernel performs ptrace access mode checks before granting the request.
+
+.. note:: This operation is not commutative. P1 calling
+          ``prctl(PR_SCHED_CORE_SHARE, pidof(P2)`` is not the same as P2 calling the
+          same for P1. The former case is P1 joining P2's group of processes
+          (which P2 would have joined with ``prctl(2)`` prior to P1's ``prctl(2)``).
+
+.. note:: The core-sharing granted with prctl(2) will be subject to
+          core-sharing restrictions specified by the CGroup interface. For example
+          if P1 and P2 are a part of 2 different tagged CGroups, then they will
+          not share a core even if a prctl(2) call is made. This is analogous
+          to how affinities are set using the cpuset interface.
+
+It is important to note that, on a ``CLONE_THREAD`` ``clone(2)`` syscall, the child
+will be assigned the same tag as its parent and thus be allowed to share a core
+with them. This design choice is because, for the security usecase, a
+``CLONE_THREAD`` child can access its parent's address space anyway, so there's
+no point in not allowing them to share a core. If a different behavior is
+desired, the child thread can call ``prctl(2)`` as needed.  This behavior is
+specific to the ``prctl(2)`` interface. For the CGroup interface, the child of a
+fork always shares a core with its parent.  On the other hand, if a parent
+was previously tagged via ``prctl(2)`` and does a regular ``fork(2)`` syscall, the
+child will receive a unique tag.
+
+Design/Implementation
+---------------------
+Each task that is tagged is assigned a cookie internally in the kernel. As
+mentioned in `Usage`_, tasks with the same cookie value are assumed to trust
+each other and share a core.
+
+The basic idea is that, every schedule event tries to select tasks for all the
+siblings of a core such that all the selected tasks running on a core are
+trusted (same cookie) at any point in time. Kernel threads are assumed trusted.
+The idle task is considered special, as it trusts everything and everything
+trusts it.
+
+During a schedule() event on any sibling of a core, the highest priority task on
+the sibling's core is picked and assigned to the sibling calling schedule(), if
+the sibling has the task enqueued. For rest of the siblings in the core,
+highest priority task with the same cookie is selected if there is one runnable
+in their individual run queues. If a task with same cookie is not available,
+the idle task is selected.  Idle task is globally trusted.
+
+Once a task has been selected for all the siblings in the core, an IPI is sent to
+siblings for whom a new task was selected. Siblings on receiving the IPI will
+switch to the new task immediately. If an idle task is selected for a sibling,
+then the sibling is considered to be in a `forced idle` state. I.e., it may
+have tasks on its on runqueue to run, however it will still have to run idle.
+More on this in the next section.
+
+Forced-idling of tasks
+----------------------
+The scheduler tries its best to find tasks that trust each other such that all
+tasks selected to be scheduled are of the highest priority in a core.  However,
+it is possible that some runqueues had tasks that were incompatible with the
+highest priority ones in the core. Favoring security over fairness, one or more
+siblings could be forced to select a lower priority task if the highest
+priority task is not trusted with respect to the core wide highest priority
+task.  If a sibling does not have a trusted task to run, it will be forced idle
+by the scheduler (idle thread is scheduled to run).
+
+When the highest priority task is selected to run, a reschedule-IPI is sent to
+the sibling to force it into idle. This results in 4 cases which need to be
+considered depending on whether a VM or a regular usermode process was running
+on either HT::
+
+          HT1 (attack)            HT2 (victim)
+   A      idle -> user space      user space -> idle
+   B      idle -> user space      guest -> idle
+   C      idle -> guest           user space -> idle
+   D      idle -> guest           guest -> idle
+
+Note that for better performance, we do not wait for the destination CPU
+(victim) to enter idle mode. This is because the sending of the IPI would bring
+the destination CPU immediately into kernel mode from user space, or VMEXIT
+in the case of guests. At best, this would only leak some scheduler metadata
+which may not be worth protecting. It is also possible that the IPI is received
+too late on some architectures, but this has not been observed in the case of
+x86.
+
+Kernel protection from untrusted tasks
+--------------------------------------
+Entry into the kernel (syscall, IRQ or VMEXIT) needs protection.  The scheduler
+on its own cannot protect the kernel executing concurrently with an untrusted
+task in a core. This is because the scheduler is unaware of interrupts/syscalls
+at scheduling time. To mitigate this, an IPI is sent to siblings on kernel
+entry. This IPI forces the sibling to enter kernel mode and wait before
+returning to user until all siblings of the core have left kernel mode. This
+process is also known as stunning.  For good performance, an IPI is sent only
+to a sibling only if it is running a tagged task. If a sibling is running a
+kernel thread or is idle, no IPI is sent.
+
+The kernel protection feature is on my default and can be turned off on the
+kernel command line by passing ``sched_core_protect_kernel=0``.
+
+Note that an arch has to define the ``TIF_UNSAFE_RET`` thread info flag to be
+able to use kernel protection. Also, if protecting the kernel from a VM is
+desired, an arch should call kvm_exit_to_guest_mode() during ``VMENTER`` and
+kvm_exit_to_guest_mode() during ``VMEXIT``. Currently, x86 support both these.
+
+Other alternative ideas discussed for kernel protection are listed below just
+for completeness. They all have limitations:
+
+1. Changing interrupt affinities to a trusted core which does not execute untrusted tasks
+#########################################################################################
+By changing the interrupt affinities to a designated safe-CPU which runs
+only trusted tasks, IRQ data can be protected. One issue is this involves
+giving up a full CPU core of the system to run safe tasks. Another is that,
+per-cpu interrupts such as the local timer interrupt cannot have their
+affinity changed. Also, sensitive timer callbacks such as the random entropy timer
+can run in softirq on return from these interrupts and expose sensitive
+data. In the future, that could be mitigated by forcing softirqs into threaded
+mode by utilizing a mechanism similar to ``CONFIG_PREEMPT_RT``.
+
+Yet another issue with this is, for multiqueue devices with managed
+interrupts, the IRQ affinities cannot be changed; however it could be
+possible to force a reduced number of queues which would in turn allow to
+shield one or two CPUs from such interrupts and queue handling for the price
+of indirection.
+
+2. Running IRQs as threaded-IRQs
+################################
+This would result in forcing IRQs into the scheduler which would then provide
+the process-context mitigation. However, not all interrupts can be threaded.
+Also this does nothing about syscall entries.
+
+3. Kernel Address Space Isolation
+#################################
+System calls could run in a much restricted address space which is
+guaranteed not to leak any sensitive data. There are practical limitation in
+implementing this - the main concern being how to decide on an address space
+that is guaranteed to not have any sensitive data.
+
+4. Limited cookie-based protection
+##################################
+On a system call, change the cookie to the system trusted cookie and initiate a
+schedule event. This would be better than pausing all the siblings during the
+entire duration for the system call, but still would be a huge hit to the
+performance.
+
+Trust model
+-----------
+Core scheduling maintains trust relationships amongst groups of tasks by
+assigning the tag of them with the same cookie value.
+When a system with core scheduling boots, all tasks are considered to trust
+each other. This is because the core scheduler does not have information about
+trust relationships until userspace uses the above mentioned interfaces, to
+communicate them. In other words, all tasks have a default cookie value of 0.
+and are considered system-wide trusted. The stunning of siblings running
+cookie-0 tasks is also avoided.
+
+Once userspace uses the above mentioned interfaces to group sets of tasks, tasks
+within such groups are considered to trust each other, but do not trust those
+outside. Tasks outside the group also don't trust tasks within.
+
+coresched command line option
+-----------------------------
+The coresched kernel command line option can be used to:
+  - Keep coresched on even if system is not vulnerable x (``coresched=on``).
+  - Keep coresched off even if system is vulnerable (``coresched=off``).
+  - Keep coresched on only if system is vulnerable x (``coresched=secure``).
+
+The default is ``coresched=secure``. However a user who has a usecase that
+needs core-scheduling, such as improving performance of VMs by tagging vCPU
+threads, could pass ``coresched=on`` to force it on.
+
+Limitations in core-scheduling
+------------------------------
+Core scheduling tries to guarantee that only trusted tasks run concurrently on a
+core. But there could be small window of time during which untrusted tasks run
+concurrently or kernel could be running concurrently with a task not trusted by
+kernel.
+
+1. IPI processing delays
+########################
+Core scheduling selects only trusted tasks to run together. IPI is used to notify
+the siblings to switch to the new task. But there could be hardware delays in
+receiving of the IPI on some arch (on x86, this has not been observed). This may
+cause an attacker task to start running on a CPU before its siblings receive the
+IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings
+may populate data in the cache and micro architectural buffers after the attacker
+starts to run and this is a possibility for data leak.
+
+Open cross-HT issues that core scheduling does not solve
+--------------------------------------------------------
+1. For MDS
+##########
+Core scheduling cannot protect against MDS attacks between an HT running in
+user mode and another running in kernel mode. Even though both HTs run tasks
+which trust each other, kernel memory is still considered untrusted. Such
+attacks are possible for any combination of sibling CPU modes (host or guest mode).
+
+2. For L1TF
+###########
+Core scheduling cannot protect against an L1TF guest attacker exploiting a
+guest or host victim. This is because the guest attacker can craft invalid
+PTEs which are not inverted due to a vulnerable guest kernel. The only
+solution is to disable EPT (Extended Page Tables).
+
+For both MDS and L1TF, if the guest vCPU is configured to not trust each
+other (by tagging separately), then the guest to guest attacks would go away.
+Or it could be a system admin policy which considers guest to guest attacks as
+a guest problem.
+
+Another approach to resolve these would be to make every untrusted task on the
+system to not trust every other untrusted task. While this could reduce
+parallelism of the untrusted tasks, it would still solve the above issues while
+allowing system processes (trusted tasks) to share a core.
+
+Use cases
+---------
+The main use case for Core scheduling is mitigating the cross-HT vulnerabilities
+with SMT enabled. There are other use cases where this feature could be used:
+
+- Isolating tasks that needs a whole core: Examples include realtime tasks, tasks
+  that uses SIMD instructions etc.
+- Gang scheduling: Requirements for a group of tasks that needs to be scheduled
+  together could also be realized using core scheduling. One example is vCPUs of
+  a VM.
+
+Future work
+-----------
+Skipping per-HT mitigations if task is trusted
+##############################################
+If core scheduling is enabled, by default all tasks trust each other as
+mentioned above. In such scenario, it may be desirable to skip the same-HT
+mitigations on return to the trusted user-mode to improve performance.
diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index 21710f8609fe..361ccbbd9e54 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -16,3 +16,4 @@ are configurable at compile, boot or run time.
    multihit.rst
    special-register-buffer-data-sampling.rst
    l1d_flush.rst
+   core-scheduling.rst
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 31/32] sched: Add a coresched command line option
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (29 preceding siblings ...)
  2020-11-17 23:20 ` [PATCH -tip 30/32] Documentation: Add core scheduling documentation Joel Fernandes (Google)
@ 2020-11-17 23:20 ` Joel Fernandes (Google)
  2020-11-19 23:39   ` Randy Dunlap
  2020-11-25 13:45   ` Peter Zijlstra
  2020-11-17 23:20 ` [PATCH -tip 32/32] sched: Debug bits Joel Fernandes (Google)
  2020-11-24 11:48 ` [PATCH -tip 00/32] Core scheduling (v9) Vincent Guittot
  32 siblings, 2 replies; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:20 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

Some hardware such as certain AMD variants don't have cross-HT MDS/L1TF
issues. Detect this and don't enable core scheduling as it can
needlessly slow those device down.

However, some users may want core scheduling even if the hardware is
secure. To support them, add a coresched= option which defaults to
'secure' and can be overridden to 'on' if the user wants to enable
coresched even if the HW is not vulnerable. 'off' would disable
core scheduling in any case.

Also add a sched_debug entry to indicate if core scheduling is turned on
or not.

Reviewed-by: Alexander Graf <graf@amazon.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 .../admin-guide/kernel-parameters.txt         | 14 ++++++
 arch/x86/kernel/cpu/bugs.c                    | 19 ++++++++
 include/linux/cpu.h                           |  1 +
 include/linux/sched/smt.h                     |  4 ++
 kernel/cpu.c                                  | 43 +++++++++++++++++++
 kernel/sched/core.c                           |  6 +++
 kernel/sched/debug.c                          |  4 ++
 7 files changed, 91 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index b185c6ed4aba..9cd2cf7c18d4 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -698,6 +698,20 @@
 			/proc/<pid>/coredump_filter.
 			See also Documentation/filesystems/proc.rst.
 
+	coresched=	[SCHED_CORE] This feature allows the Linux scheduler
+			to force hyperthread siblings of a CPU to only execute tasks
+			concurrently on all hyperthreads that are running within the
+			same core scheduling group.
+			Possible values are:
+			'on' - Enable scheduler capability to core schedule.
+			By default, no tasks will be core scheduled, but the coresched
+			interface can be used to form groups of tasks that are forced
+			to share a core.
+			'off' - Disable scheduler capability to core schedule.
+			'secure' - Like 'on' but only enable on systems affected by
+			MDS or L1TF vulnerabilities. 'off' otherwise.
+			Default: 'secure'.
+
 	coresight_cpu_debug.enable
 			[ARM,ARM64]
 			Format: <bool>
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index dece79e4d1e9..f3163f4a805c 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -43,6 +43,7 @@ static void __init mds_select_mitigation(void);
 static void __init mds_print_mitigation(void);
 static void __init taa_select_mitigation(void);
 static void __init srbds_select_mitigation(void);
+static void __init coresched_select(void);
 
 /* The base value of the SPEC_CTRL MSR that always has to be preserved. */
 u64 x86_spec_ctrl_base;
@@ -103,6 +104,9 @@ void __init check_bugs(void)
 	if (boot_cpu_has(X86_FEATURE_STIBP))
 		x86_spec_ctrl_mask |= SPEC_CTRL_STIBP;
 
+	/* Update whether core-scheduling is needed. */
+	coresched_select();
+
 	/* Select the proper CPU mitigations before patching alternatives: */
 	spectre_v1_select_mitigation();
 	spectre_v2_select_mitigation();
@@ -1808,4 +1812,19 @@ ssize_t cpu_show_srbds(struct device *dev, struct device_attribute *attr, char *
 {
 	return cpu_show_common(dev, attr, buf, X86_BUG_SRBDS);
 }
+
+/*
+ * When coresched=secure command line option is passed (default), disable core
+ * scheduling if CPU does not have MDS/L1TF vulnerability.
+ */
+static void __init coresched_select(void)
+{
+#ifdef CONFIG_SCHED_CORE
+	if (coresched_cmd_secure() &&
+	    !boot_cpu_has_bug(X86_BUG_MDS) &&
+	    !boot_cpu_has_bug(X86_BUG_L1TF))
+		static_branch_disable(&sched_coresched_supported);
+#endif
+}
+
 #endif
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index d6428aaf67e7..d1f1e64316d6 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -228,4 +228,5 @@ static inline int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) { return 0;
 extern bool cpu_mitigations_off(void);
 extern bool cpu_mitigations_auto_nosmt(void);
 
+extern bool coresched_cmd_secure(void);
 #endif /* _LINUX_CPU_H_ */
diff --git a/include/linux/sched/smt.h b/include/linux/sched/smt.h
index 59d3736c454c..561064eb3268 100644
--- a/include/linux/sched/smt.h
+++ b/include/linux/sched/smt.h
@@ -17,4 +17,8 @@ static inline bool sched_smt_active(void) { return false; }
 
 void arch_smt_update(void);
 
+#ifdef CONFIG_SCHED_CORE
+extern struct static_key_true sched_coresched_supported;
+#endif
+
 #endif
diff --git a/kernel/cpu.c b/kernel/cpu.c
index fa535eaa4826..f22330c3ab4c 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2559,3 +2559,46 @@ bool cpu_mitigations_auto_nosmt(void)
 	return cpu_mitigations == CPU_MITIGATIONS_AUTO_NOSMT;
 }
 EXPORT_SYMBOL_GPL(cpu_mitigations_auto_nosmt);
+
+/*
+ * These are used for a global "coresched=" cmdline option for controlling
+ * core scheduling. Note that core sched may be needed for usecases other
+ * than security as well.
+ */
+enum coresched_cmds {
+	CORE_SCHED_OFF,
+	CORE_SCHED_SECURE,
+	CORE_SCHED_ON,
+};
+
+static enum coresched_cmds coresched_cmd __ro_after_init = CORE_SCHED_SECURE;
+
+static int __init coresched_parse_cmdline(char *arg)
+{
+	if (!strcmp(arg, "off"))
+		coresched_cmd = CORE_SCHED_OFF;
+	else if (!strcmp(arg, "on"))
+		coresched_cmd = CORE_SCHED_ON;
+	else if (!strcmp(arg, "secure"))
+		/*
+		 * On x86, coresched=secure means coresched is enabled only if
+		 * system has MDS/L1TF vulnerability (see x86/bugs.c).
+		 */
+		coresched_cmd = CORE_SCHED_SECURE;
+	else
+		pr_crit("Unsupported coresched=%s, defaulting to secure.\n",
+			arg);
+
+	if (coresched_cmd == CORE_SCHED_OFF)
+		static_branch_disable(&sched_coresched_supported);
+
+	return 0;
+}
+early_param("coresched", coresched_parse_cmdline);
+
+/* coresched=secure */
+bool coresched_cmd_secure(void)
+{
+	return coresched_cmd == CORE_SCHED_SECURE;
+}
+EXPORT_SYMBOL_GPL(coresched_cmd_secure);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5ef04bdc849f..01938a2154fd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -325,8 +325,12 @@ static void __sched_core_disable(void)
 	static_branch_disable(&__sched_core_enabled);
 }
 
+DEFINE_STATIC_KEY_TRUE(sched_coresched_supported);
+
 void sched_core_get(void)
 {
+	if (!static_branch_likely(&sched_coresched_supported))
+		return;
 	mutex_lock(&sched_core_mutex);
 	if (!sched_core_count++)
 		__sched_core_enable();
@@ -335,6 +339,8 @@ void sched_core_get(void)
 
 void sched_core_put(void)
 {
+	if (!static_branch_likely(&sched_coresched_supported))
+		return;
 	mutex_lock(&sched_core_mutex);
 	if (!--sched_core_count)
 		__sched_core_disable();
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 8c452b8010ad..cffdfab7478e 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -782,6 +782,10 @@ static void sched_debug_header(struct seq_file *m)
 		"sysctl_sched_tunable_scaling",
 		sysctl_sched_tunable_scaling,
 		sched_tunable_scaling_names[sysctl_sched_tunable_scaling]);
+#ifdef CONFIG_SCHED_CORE
+	SEQ_printf(m, "  .%-40s: %d\n", "core_sched_enabled",
+		   !!static_branch_likely(&__sched_core_enabled));
+#endif
 	SEQ_printf(m, "\n");
 }
 
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH -tip 32/32] sched: Debug bits...
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (30 preceding siblings ...)
  2020-11-17 23:20 ` [PATCH -tip 31/32] sched: Add a coresched command line option Joel Fernandes (Google)
@ 2020-11-17 23:20 ` Joel Fernandes (Google)
  2020-12-01  0:21   ` Balbir Singh
  2020-11-24 11:48 ` [PATCH -tip 00/32] Core scheduling (v9) Vincent Guittot
  32 siblings, 1 reply; 150+ messages in thread
From: Joel Fernandes (Google) @ 2020-11-17 23:20 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini, joel,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Not-Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c | 35 ++++++++++++++++++++++++++++++++++-
 kernel/sched/fair.c |  9 +++++++++
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 01938a2154fd..bbeeb18d460e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -127,6 +127,10 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b, bool
 
 	int pa = __task_prio(a), pb = __task_prio(b);
 
+	trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
+		     a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+		     b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
 	if (-pa < -pb)
 		return true;
 
@@ -317,12 +321,16 @@ static void __sched_core_enable(void)
 
 	static_branch_enable(&__sched_core_enabled);
 	stop_machine(__sched_core_stopper, (void *)true, NULL);
+
+	printk("core sched enabled\n");
 }
 
 static void __sched_core_disable(void)
 {
 	stop_machine(__sched_core_stopper, (void *)false, NULL);
 	static_branch_disable(&__sched_core_enabled);
+
+	printk("core sched disabled\n");
 }
 
 DEFINE_STATIC_KEY_TRUE(sched_coresched_supported);
@@ -5486,6 +5494,13 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			set_next_task(rq, next);
 		}
 
+		trace_printk("pick pre selected (%u %u %u): %s/%d %lx\n",
+			     rq->core->core_task_seq,
+			     rq->core->core_pick_seq,
+			     rq->core_sched_seq,
+			     next->comm, next->pid,
+			     next->core_cookie);
+
 		rq->core_pick = NULL;
 		return next;
 	}
@@ -5580,6 +5595,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 					rq->core->core_forceidle_seq++;
 			}
 
+			trace_printk("cpu(%d): selected: %s/%d %lx\n",
+				     i, p->comm, p->pid, p->core_cookie);
+
 			/*
 			 * If this new candidate is of higher priority than the
 			 * previous; and they're incompatible; we need to wipe
@@ -5596,6 +5614,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 				rq->core->core_cookie = p->core_cookie;
 				max = p;
 
+				trace_printk("max: %s/%d %lx\n", max->comm, max->pid, max->core_cookie);
+
 				if (old_max) {
 					rq->core->core_forceidle = false;
 					for_each_cpu(j, smt_mask) {
@@ -5617,6 +5637,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	/* Something should have been selected for current CPU */
 	WARN_ON_ONCE(!next);
+	trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, next->core_cookie);
 
 	/*
 	 * Reschedule siblings
@@ -5658,13 +5679,21 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		}
 
 		/* Did we break L1TF mitigation requirements? */
-		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+		if (unlikely(!cookie_match(next, rq_i->core_pick))) {
+			trace_printk("[%d]: cookie mismatch. %s/%d/0x%lx/0x%lx\n",
+				     rq_i->cpu, rq_i->core_pick->comm,
+				     rq_i->core_pick->pid,
+				     rq_i->core_pick->core_cookie,
+				     rq_i->core->core_cookie);
+			WARN_ON_ONCE(1);
+		}
 
 		if (rq_i->curr == rq_i->core_pick) {
 			rq_i->core_pick = NULL;
 			continue;
 		}
 
+		trace_printk("IPI(%d)\n", i);
 		resched_curr(rq_i);
 	}
 
@@ -5704,6 +5733,10 @@ static bool try_steal_cookie(int this, int that)
 		if (p->core_occupation > dst->idle->core_occupation)
 			goto next;
 
+		trace_printk("core fill: %s/%d (%d->%d) %d %d %lx\n",
+			     p->comm, p->pid, that, this,
+			     p->core_occupation, dst->idle->core_occupation, cookie);
+
 		p->on_rq = TASK_ON_RQ_MIGRATING;
 		deactivate_task(src, p, 0);
 		set_task_cpu(p, this);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a89c7c917cc6..81c8a50ab4c4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10798,6 +10798,15 @@ static void se_fi_update(struct sched_entity *se, unsigned int fi_seq, bool forc
 			cfs_rq->forceidle_seq = fi_seq;
 		}
 
+
+		if (root) {
+			old = cfs_rq->min_vruntime_fi;
+			new = cfs_rq->min_vruntime;
+			root = false;
+			trace_printk("cfs_rq(min_vruntime_fi) %lu->%lu\n",
+				     old, new);
+		}
+
 		cfs_rq->min_vruntime_fi = cfs_rq->min_vruntime;
 	}
 }
-- 
2.29.2.299.gdc1121823c-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 01/32] sched: Wrap rq::lock access
  2020-11-17 23:19 ` [PATCH -tip 01/32] sched: Wrap rq::lock access Joel Fernandes (Google)
@ 2020-11-19 23:31   ` Singh, Balbir
  2020-11-20 16:55     ` Joel Fernandes
  0 siblings, 1 reply; 150+ messages in thread
From: Singh, Balbir @ 2020-11-19 23:31 UTC (permalink / raw)
  To: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

On 18/11/20 10:19 am, Joel Fernandes (Google) wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> In preparation of playing games with rq->lock, abstract the thing
> using an accessor.
> 

Could you clarify games? I presume the intention is to redefine the scope of the lock based on whether core sched is enabled or not? I presume patch 4/32 has the details.

Balbir Singh

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 31/32] sched: Add a coresched command line option
  2020-11-17 23:20 ` [PATCH -tip 31/32] sched: Add a coresched command line option Joel Fernandes (Google)
@ 2020-11-19 23:39   ` Randy Dunlap
  2020-11-25 13:45   ` Peter Zijlstra
  1 sibling, 0 replies; 150+ messages in thread
From: Randy Dunlap @ 2020-11-19 23:39 UTC (permalink / raw)
  To: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

Hi Joel,

On 11/17/20 3:20 PM, Joel Fernandes (Google) wrote:
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index b185c6ed4aba..9cd2cf7c18d4 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -698,6 +698,20 @@
>  			/proc/<pid>/coredump_filter.
>  			See also Documentation/filesystems/proc.rst.
>  
> +	coresched=	[SCHED_CORE] This feature allows the Linux scheduler

Unless I missed it somewhere else, this "SCHED_CORE" string should be
added to Documentation/admin-guide/kernel-parameters.rst, where there is
a list of "qualifiers" for kernel parameters.

(It looks like you are using it as the name of a Kconfig option, which
makes some sense, but that's not how it's [currently] done. :)


> +			to force hyperthread siblings of a CPU to only execute tasks
> +			concurrently on all hyperthreads that are running within the
> +			same core scheduling group.
> +			Possible values are:
> +			'on' - Enable scheduler capability to core schedule.
> +			By default, no tasks will be core scheduled, but the coresched
> +			interface can be used to form groups of tasks that are forced
> +			to share a core.
> +			'off' - Disable scheduler capability to core schedule.
> +			'secure' - Like 'on' but only enable on systems affected by
> +			MDS or L1TF vulnerabilities. 'off' otherwise.
> +			Default: 'secure'.
> +
>  	coresight_cpu_debug.enable
>  			[ARM,ARM64]
>  			Format: <bool>

thanks.
-- 
~Randy


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 02/32] sched: Introduce sched_class::pick_task()
  2020-11-17 23:19 ` [PATCH -tip 02/32] sched: Introduce sched_class::pick_task() Joel Fernandes (Google)
@ 2020-11-19 23:56   ` Singh, Balbir
  2020-11-20 16:58     ` Joel Fernandes
  2020-11-25 16:28   ` Vincent Guittot
  1 sibling, 1 reply; 150+ messages in thread
From: Singh, Balbir @ 2020-11-19 23:56 UTC (permalink / raw)
  To: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

On 18/11/20 10:19 am, Joel Fernandes (Google) wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Because sched_class::pick_next_task() also implies
> sched_class::set_next_task() (and possibly put_prev_task() and
> newidle_balance) it is not state invariant. This makes it unsuitable
> for remote task selection.
> 

The change makes sense, a small suggestion below

> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
> Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/sched/deadline.c  | 16 ++++++++++++++--
>  kernel/sched/fair.c      | 32 +++++++++++++++++++++++++++++++-
>  kernel/sched/idle.c      |  8 ++++++++
>  kernel/sched/rt.c        | 15 +++++++++++++--
>  kernel/sched/sched.h     |  3 +++
>  kernel/sched/stop_task.c | 14 ++++++++++++--
>  6 files changed, 81 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 0f2ea0a3664c..abfc8b505d0d 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1867,7 +1867,7 @@ static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
>  	return rb_entry(left, struct sched_dl_entity, rb_node);
>  }
>  
> -static struct task_struct *pick_next_task_dl(struct rq *rq)
> +static struct task_struct *pick_task_dl(struct rq *rq)
>  {
>  	struct sched_dl_entity *dl_se;
>  	struct dl_rq *dl_rq = &rq->dl;
> @@ -1879,7 +1879,18 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
>  	dl_se = pick_next_dl_entity(rq, dl_rq);
>  	BUG_ON(!dl_se);
>  	p = dl_task_of(dl_se);
> -	set_next_task_dl(rq, p, true);
> +
> +	return p;
> +}
> +
> +static struct task_struct *pick_next_task_dl(struct rq *rq)
> +{
> +	struct task_struct *p;
> +
> +	p = pick_task_dl(rq);
> +	if (p)
> +		set_next_task_dl(rq, p, true);
> +
>  	return p;
>  }
>  
> @@ -2551,6 +2562,7 @@ DEFINE_SCHED_CLASS(dl) = {
>  
>  #ifdef CONFIG_SMP
>  	.balance		= balance_dl,
> +	.pick_task		= pick_task_dl,
>  	.select_task_rq		= select_task_rq_dl,
>  	.migrate_task_rq	= migrate_task_rq_dl,
>  	.set_cpus_allowed       = set_cpus_allowed_dl,
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 52ddfec7cea6..12cf068eeec8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4459,7 +4459,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>  	 * Avoid running the skip buddy, if running something else can
>  	 * be done without getting too unfair.
>  	 */
> -	if (cfs_rq->skip == se) {
> +	if (cfs_rq->skip && cfs_rq->skip == se) {
>  		struct sched_entity *second;
>  
>  		if (se == curr) {
> @@ -7017,6 +7017,35 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>  		set_last_buddy(se);
>  }
>  
> +#ifdef CONFIG_SMP
> +static struct task_struct *pick_task_fair(struct rq *rq)
> +{
> +	struct cfs_rq *cfs_rq = &rq->cfs;
> +	struct sched_entity *se;
> +
> +	if (!cfs_rq->nr_running)
> +		return NULL;
> +
> +	do {
> +		struct sched_entity *curr = cfs_rq->curr;
> +
> +		se = pick_next_entity(cfs_rq, NULL);
> +
> +		if (curr) {
> +			if (se && curr->on_rq)
> +				update_curr(cfs_rq);
> +
> +			if (!se || entity_before(curr, se))
> +				se = curr;
> +		}

Do we want to optimize this a bit 

if (curr) {
	if (!se || entity_before(curr, se))
		se = curr;

	if ((se != curr) && curr->on_rq)
		update_curr(cfs_rq);

}

Balbir

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 03/32] sched/fair: Fix pick_task_fair crashes due to empty rbtree
  2020-11-17 23:19 ` [PATCH -tip 03/32] sched/fair: Fix pick_task_fair crashes due to empty rbtree Joel Fernandes (Google)
@ 2020-11-20 10:15   ` Singh, Balbir
  2020-11-20 18:11     ` Vineeth Pillai
  2020-11-24  8:31     ` Peter Zijlstra
  0 siblings, 2 replies; 150+ messages in thread
From: Singh, Balbir @ 2020-11-20 10:15 UTC (permalink / raw)
  To: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

On 18/11/20 10:19 am, Joel Fernandes (Google) wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> pick_next_entity() is passed curr == NULL during core-scheduling. Due to
> this, if the rbtree is empty, the 'left' variable is set to NULL within
> the function. This can cause crashes within the function.
> 
> This is not an issue if put_prev_task() is invoked on the currently
> running task before calling pick_next_entity(). However, in core
> scheduling, it is possible that a sibling CPU picks for another RQ in
> the core, via pick_task_fair(). This remote sibling would not get any
> opportunities to do a put_prev_task().
> 
> Fix it by refactoring pick_task_fair() such that pick_next_entity() is
> called with the cfs_rq->curr. This will prevent pick_next_entity() from
> crashing if its rbtree is empty.
> 
> Also this fixes another possible bug where update_curr() would not be
> called on the cfs_rq hierarchy if the rbtree is empty. This could effect
> cross-cpu comparison of vruntime.
> 

It is not clear from the changelog as to what does put_prev_task() do to prevent
the crash from occuring? Why did we pass NULL as curr in the first place to
pick_next_entity?

The patch looks functionally correct as in, it passes curr as the reference
to pick_next_entity() for left and entity_before comparisons.

Balbir Singh

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 01/32] sched: Wrap rq::lock access
  2020-11-19 23:31   ` Singh, Balbir
@ 2020-11-20 16:55     ` Joel Fernandes
  2020-11-22  8:52       ` Balbir Singh
  0 siblings, 1 reply; 150+ messages in thread
From: Joel Fernandes @ 2020-11-20 16:55 UTC (permalink / raw)
  To: Singh, Balbir
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

On Fri, Nov 20, 2020 at 10:31:39AM +1100, Singh, Balbir wrote:
> On 18/11/20 10:19 am, Joel Fernandes (Google) wrote:
> > From: Peter Zijlstra <peterz@infradead.org>
> > 
> > In preparation of playing games with rq->lock, abstract the thing
> > using an accessor.
> > 
> 
> Could you clarify games? I presume the intention is to redefine the scope
> of the lock based on whether core sched is enabled or not? I presume patch
> 4/32 has the details.

Your line wrapping broke, I fixed it.

That is in fact the game. By wrapping it, the nature of the locking is
dynamic based on whether core sched is enabled or not (both statically and
dynamically).

thanks,

 - Joel

 
> Balbir Singh

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 02/32] sched: Introduce sched_class::pick_task()
  2020-11-19 23:56   ` Singh, Balbir
@ 2020-11-20 16:58     ` Joel Fernandes
  2020-11-25 23:19       ` Balbir Singh
  0 siblings, 1 reply; 150+ messages in thread
From: Joel Fernandes @ 2020-11-20 16:58 UTC (permalink / raw)
  To: Singh, Balbir
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

On Fri, Nov 20, 2020 at 10:56:09AM +1100, Singh, Balbir wrote:
[..] 
> > +#ifdef CONFIG_SMP
> > +static struct task_struct *pick_task_fair(struct rq *rq)
> > +{
> > +	struct cfs_rq *cfs_rq = &rq->cfs;
> > +	struct sched_entity *se;
> > +
> > +	if (!cfs_rq->nr_running)
> > +		return NULL;
> > +
> > +	do {
> > +		struct sched_entity *curr = cfs_rq->curr;
> > +
> > +		se = pick_next_entity(cfs_rq, NULL);
> > +
> > +		if (curr) {
> > +			if (se && curr->on_rq)
> > +				update_curr(cfs_rq);
> > +
> > +			if (!se || entity_before(curr, se))
> > +				se = curr;
> > +		}
> 
> Do we want to optimize this a bit 
> 
> if (curr) {
> 	if (!se || entity_before(curr, se))
> 		se = curr;
> 
> 	if ((se != curr) && curr->on_rq)
> 		update_curr(cfs_rq);
> 
> }

Do you see a difference in codegen? What's the optimization?

Also in later patches this code changes, so it should not matter:
See: 3e0838fa3c51 ("sched/fair: Fix pick_task_fair crashes due to empty rbtree")

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 03/32] sched/fair: Fix pick_task_fair crashes due to empty rbtree
  2020-11-20 10:15   ` Singh, Balbir
@ 2020-11-20 18:11     ` Vineeth Pillai
  2020-11-23 22:31       ` Balbir Singh
  2020-11-24  8:31     ` Peter Zijlstra
  1 sibling, 1 reply; 150+ messages in thread
From: Vineeth Pillai @ 2020-11-20 18:11 UTC (permalink / raw)
  To: Singh, Balbir, Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Aaron Lu, Aubrey Li, tglx, linux-kernel
  Cc: mingo, torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

Hi Balbir,

On 11/20/20 5:15 AM, Singh, Balbir wrote:
> On 18/11/20 10:19 am, Joel Fernandes (Google) wrote:
>> From: Peter Zijlstra <peterz@infradead.org>
>>
>> pick_next_entity() is passed curr == NULL during core-scheduling. Due to
>> this, if the rbtree is empty, the 'left' variable is set to NULL within
>> the function. This can cause crashes within the function.
>>
>> This is not an issue if put_prev_task() is invoked on the currently
>> running task before calling pick_next_entity(). However, in core
>> scheduling, it is possible that a sibling CPU picks for another RQ in
>> the core, via pick_task_fair(). This remote sibling would not get any
>> opportunities to do a put_prev_task().
>>
>> Fix it by refactoring pick_task_fair() such that pick_next_entity() is
>> called with the cfs_rq->curr. This will prevent pick_next_entity() from
>> crashing if its rbtree is empty.
>>
>> Also this fixes another possible bug where update_curr() would not be
>> called on the cfs_rq hierarchy if the rbtree is empty. This could effect
>> cross-cpu comparison of vruntime.
>>
> It is not clear from the changelog as to what does put_prev_task() do to prevent
> the crash from occuring? Why did we pass NULL as curr in the first place to
> pick_next_entity?
A little more context on this crash in v8 is here:
https://lwn.net/ml/linux-kernel/8230ada7-839f-2335-9a55-b09f6a813e91@linux.microsoft.com/

The issue here arises from the fact that, we try to pick task for a
sibling while sibling is running a task. Running tasks are not in the
cfs_rq and pick_next_entity can return NULL if there is only one cfs
task in the cfs_rq. This would not happen normally because
put_prev_task is called before pick_task and put_prev_task adds the
task back to cfs_rq. But for coresched, pick_task is called on a
remote sibling's cfs_rq without calling put_prev_task and this can
lead to pick_next_entity returning NULL.

The initial logic of passing NULL would work fine as long as we call
put_prev_task before calling pick_task_fair. But for coresched, we
call pick_task_fair on siblings while the task is running and would
not be able to call put_prev_task. So this refactor of the code fixes
the crash by explicitly passing curr.

Hope this clarifies..

Thanks,
Vineeth


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 01/32] sched: Wrap rq::lock access
  2020-11-20 16:55     ` Joel Fernandes
@ 2020-11-22  8:52       ` Balbir Singh
  0 siblings, 0 replies; 150+ messages in thread
From: Balbir Singh @ 2020-11-22  8:52 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

On Fri, Nov 20, 2020 at 11:55:22AM -0500, Joel Fernandes wrote:
> On Fri, Nov 20, 2020 at 10:31:39AM +1100, Singh, Balbir wrote:
> > On 18/11/20 10:19 am, Joel Fernandes (Google) wrote:
> > > From: Peter Zijlstra <peterz@infradead.org>
> > > 
> > > In preparation of playing games with rq->lock, abstract the thing
> > > using an accessor.
> > > 
> > 
> > Could you clarify games? I presume the intention is to redefine the scope
> > of the lock based on whether core sched is enabled or not? I presume patch
> > 4/32 has the details.
> 
> Your line wrapping broke, I fixed it.
>

Sorry, I've been using thunderbird from time to time and even though
I set the options specified in the Documentation (email-clients), it's
not working as expected.

> That is in fact the game. By wrapping it, the nature of the locking is
> dynamic based on whether core sched is enabled or not (both statically and
> dynamically).
>

My point was that the word game does not do justice to the change, some
details around how this abstractions helps based on the (re)definition of rq
with coresched might help.

Balbir Singh.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 04/32] sched: Core-wide rq->lock
  2020-11-17 23:19 ` [PATCH -tip 04/32] sched: Core-wide rq->lock Joel Fernandes (Google)
@ 2020-11-22  9:11   ` Balbir Singh
  2020-11-24  8:16     ` Peter Zijlstra
  0 siblings, 1 reply; 150+ messages in thread
From: Balbir Singh @ 2020-11-22  9:11 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

On Tue, Nov 17, 2020 at 06:19:34PM -0500, Joel Fernandes (Google) wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Introduce the basic infrastructure to have a core wide rq->lock.
>

Reading through the patch, it seems like all the CPUs have to be
running with sched core enabled/disabled? Is it possible to have some
cores with core sched disabled? I don't see a strong use case for it,
but I am wondering if the design will fall apart if that assumption is
broken?

Balbir Singh


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 08/32] sched/fair: Fix forced idle sibling starvation corner case
  2020-11-17 23:19 ` [PATCH -tip 08/32] sched/fair: Fix forced idle sibling starvation corner case Joel Fernandes (Google)
@ 2020-11-22 10:35   ` Balbir Singh
  0 siblings, 0 replies; 150+ messages in thread
From: Balbir Singh @ 2020-11-22 10:35 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

On Tue, Nov 17, 2020 at 06:19:38PM -0500, Joel Fernandes (Google) wrote:
> From: Vineeth Pillai <viremana@linux.microsoft.com>
> 
> If there is only one long running local task and the sibling is
> forced idle, it  might not get a chance to run until a schedule
> event happens on any cpu in the core.
> 
> So we check for this condition during a tick to see if a sibling
> is starved and then give it a chance to schedule.
> 
> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
> Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/sched/core.c  | 15 ++++++++-------
>  kernel/sched/fair.c  | 40 ++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h |  2 +-
>  3 files changed, 49 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1bd0b0bbb040..52d0e83072a4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5206,16 +5206,15 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  
>  	/* reset state */
>  	rq->core->core_cookie = 0UL;
> +	if (rq->core->core_forceidle) {
> +		need_sync = true;
> +		rq->core->core_forceidle = false;
> +	}
>  	for_each_cpu(i, smt_mask) {
>  		struct rq *rq_i = cpu_rq(i);
>  
>  		rq_i->core_pick = NULL;
>  
> -		if (rq_i->core_forceidle) {
> -			need_sync = true;
> -			rq_i->core_forceidle = false;
> -		}
> -
>  		if (i != cpu)
>  			update_rq_clock(rq_i);
>  	}
> @@ -5335,8 +5334,10 @@ next_class:;
>  		if (!rq_i->core_pick)
>  			continue;
>  
> -		if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running)
> -			rq_i->core_forceidle = true;
> +		if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running &&
> +		    !rq_i->core->core_forceidle) {
> +			rq_i->core->core_forceidle = true;
> +		}
>  
>  		if (i == cpu) {
>  			rq_i->core_pick = NULL;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f53681cd263e..42965c4fd71f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10692,6 +10692,44 @@ static void rq_offline_fair(struct rq *rq)
>  
>  #endif /* CONFIG_SMP */
>  
> +#ifdef CONFIG_SCHED_CORE
> +static inline bool
> +__entity_slice_used(struct sched_entity *se, int min_nr_tasks)
> +{
> +	u64 slice = sched_slice(cfs_rq_of(se), se);

I wonder if the definition of sched_slice() should be revisited for core
scheduling?

Should we use sched_slice = sched_slice / cpumask_weight(smt_mask)?
Would that resolve the issue your seeing? Effectively we need to answer
if two sched core siblings should be treated as executing one large
slice?

Balbir Singh.



^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 09/32] sched/fair: Snapshot the min_vruntime of CPUs on force idle
  2020-11-17 23:19 ` [PATCH -tip 09/32] sched/fair: Snapshot the min_vruntime of CPUs on force idle Joel Fernandes (Google)
@ 2020-11-22 11:44   ` Balbir Singh
  2020-11-23 12:31     ` Vineeth Pillai
  0 siblings, 1 reply; 150+ messages in thread
From: Balbir Singh @ 2020-11-22 11:44 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

On Tue, Nov 17, 2020 at 06:19:39PM -0500, Joel Fernandes (Google) wrote:
> During force-idle, we end up doing cross-cpu comparison of vruntimes
> during pick_next_task. If we simply compare (vruntime-min_vruntime)
> across CPUs, and if the CPUs only have 1 task each, we will always
> end up comparing 0 with 0 and pick just one of the tasks all the time.
> This starves the task that was not picked. To fix this, take a snapshot
> of the min_vruntime when entering force idle and use it for comparison.
> This min_vruntime snapshot will only be used for cross-CPU vruntime
> comparison, and nothing else.
> 
> This resolves several performance issues that were seen in ChromeOS
> audio usecase.
> 
> NOTE: Note, this patch will be improved in a later patch. It is just
>       kept here as the basis for the later patch and to make rebasing
>       easier. Further, it may make reverting the improvement easier in
>       case the improvement causes any regression.
>

This seems cumbersome, is there no way to track the min_vruntime via
rq->core->min_vruntime?

Balbir Singh.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 10/32] sched: Fix priority inversion of cookied task with sibling
  2020-11-17 23:19 ` [PATCH -tip 10/32] sched: Fix priority inversion of cookied task with sibling Joel Fernandes (Google)
@ 2020-11-22 22:41   ` Balbir Singh
  2020-11-24 18:30     ` Joel Fernandes
  0 siblings, 1 reply; 150+ messages in thread
From: Balbir Singh @ 2020-11-22 22:41 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

On Tue, Nov 17, 2020 at 06:19:40PM -0500, Joel Fernandes (Google) wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> The rationale is as follows. In the core-wide pick logic, even if
> need_sync == false, we need to go look at other CPUs (non-local CPUs) to
> see if they could be running RT.
> 
> Say the RQs in a particular core look like this:
> Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task.
> 
> rq0            rq1
> CFS1 (tagged)  RT1 (not tag)
> CFS2 (tagged)
> 
> Say schedule() runs on rq0. Now, it will enter the above loop and
> pick_task(RT) will return NULL for 'p'. It will enter the above if() block
> and see that need_sync == false and will skip RT entirely.
> 
> The end result of the selection will be (say prio(CFS1) > prio(CFS2)):
> rq0             rq1
> CFS1            IDLE
> 
> When it should have selected:
> rq0             r1
> IDLE            RT
> 
> Joel saw this issue on real-world usecases in ChromeOS where an RT task
> gets constantly force-idled and breaks RT. Lets cure it.
> 
> NOTE: This problem will be fixed differently in a later patch. It just
>       kept here for reference purposes about this issue, and to make
>       applying later patches easier.
>

The changelog is hard to read, it refers to above if(), whereas there
is no code snippet in the changelog. Also, from what I can see following
the series, p->core_cookie is not yet set anywhere (unless I missed it),
so fixing it in here did not make sense just reading the series.

Balbir

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 14/32] sched: migration changes for core scheduling
  2020-11-17 23:19 ` [PATCH -tip 14/32] sched: migration changes for core scheduling Joel Fernandes (Google)
@ 2020-11-22 23:54   ` Balbir Singh
  2020-11-23  4:36     ` Li, Aubrey
  2020-11-30 10:35   ` Vincent Guittot
  1 sibling, 1 reply; 150+ messages in thread
From: Balbir Singh @ 2020-11-22 23:54 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Nov 17, 2020 at 06:19:44PM -0500, Joel Fernandes (Google) wrote:
> From: Aubrey Li <aubrey.li@intel.com>
> 
>  - Don't migrate if there is a cookie mismatch
>      Load balance tries to move task from busiest CPU to the
>      destination CPU. When core scheduling is enabled, if the
>      task's cookie does not match with the destination CPU's
>      core cookie, this task will be skipped by this CPU. This
>      mitigates the forced idle time on the destination CPU.
> 
>  - Select cookie matched idle CPU
>      In the fast path of task wakeup, select the first cookie matched
>      idle CPU instead of the first idle CPU.
> 
>  - Find cookie matched idlest CPU
>      In the slow path of task wakeup, find the idlest CPU whose core
>      cookie matches with task's cookie
> 
>  - Don't migrate task if cookie not match
>      For the NUMA load balance, don't migrate task to the CPU whose
>      core cookie does not match with task's cookie
> 
> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/sched/fair.c  | 64 ++++++++++++++++++++++++++++++++++++++++----
>  kernel/sched/sched.h | 29 ++++++++++++++++++++
>  2 files changed, 88 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index de82f88ba98c..ceb3906c9a8a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1921,6 +1921,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>  		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>  			continue;
>  
> +#ifdef CONFIG_SCHED_CORE
> +		/*
> +		 * Skip this cpu if source task's cookie does not match
> +		 * with CPU's core cookie.
> +		 */
> +		if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
> +			continue;
> +#endif
> +

Any reason this is under an #ifdef? In sched_core_cookie_match() won't
the check for sched_core_enabled() do the right thing even when
CONFIG_SCHED_CORE is not enabed?

>  		env->dst_cpu = cpu;
>  		if (task_numa_compare(env, taskimp, groupimp, maymove))
>  			break;
> @@ -5867,11 +5876,17 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>  
>  	/* Traverse only the allowed CPUs */
>  	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
> +		struct rq *rq = cpu_rq(i);
> +
> +#ifdef CONFIG_SCHED_CORE
> +		if (!sched_core_cookie_match(rq, p))
> +			continue;
> +#endif
> +
>  		if (sched_idle_cpu(i))
>  			return i;
>  
>  		if (available_idle_cpu(i)) {
> -			struct rq *rq = cpu_rq(i);
>  			struct cpuidle_state *idle = idle_get_state(rq);
>  			if (idle && idle->exit_latency < min_exit_latency) {
>  				/*
> @@ -6129,8 +6144,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>  	for_each_cpu_wrap(cpu, cpus, target) {
>  		if (!--nr)
>  			return -1;
> -		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> -			break;
> +
> +		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
> +#ifdef CONFIG_SCHED_CORE
> +			/*
> +			 * If Core Scheduling is enabled, select this cpu
> +			 * only if the process cookie matches core cookie.
> +			 */
> +			if (sched_core_enabled(cpu_rq(cpu)) &&
> +			    p->core_cookie == cpu_rq(cpu)->core->core_cookie)
> +#endif
> +				break;
> +		}
>  	}
>  
>  	time = cpu_clock(this) - time;
> @@ -7530,8 +7555,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>  	 * We do not migrate tasks that are:
>  	 * 1) throttled_lb_pair, or
>  	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
> -	 * 3) running (obviously), or
> -	 * 4) are cache-hot on their current CPU.
> +	 * 3) task's cookie does not match with this CPU's core cookie
> +	 * 4) running (obviously), or
> +	 * 5) are cache-hot on their current CPU.
>  	 */
>  	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>  		return 0;
> @@ -7566,6 +7592,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>  		return 0;
>  	}
>  
> +#ifdef CONFIG_SCHED_CORE
> +	/*
> +	 * Don't migrate task if the task's cookie does not match
> +	 * with the destination CPU's core cookie.
> +	 */
> +	if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
> +		return 0;
> +#endif
> +
>  	/* Record that we found atleast one task that could run on dst_cpu */
>  	env->flags &= ~LBF_ALL_PINNED;
>  
> @@ -8792,6 +8827,25 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  					p->cpus_ptr))
>  			continue;
>  
> +#ifdef CONFIG_SCHED_CORE
> +		if (sched_core_enabled(cpu_rq(this_cpu))) {
> +			int i = 0;
> +			bool cookie_match = false;
> +
> +			for_each_cpu(i, sched_group_span(group)) {
> +				struct rq *rq = cpu_rq(i);
> +
> +				if (sched_core_cookie_match(rq, p)) {
> +					cookie_match = true;
> +					break;
> +				}
> +			}
> +			/* Skip over this group if no cookie matched */
> +			if (!cookie_match)
> +				continue;
> +		}
> +#endif
> +
>  		local_group = cpumask_test_cpu(this_cpu,
>  					       sched_group_span(group));
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e72942a9ee11..de553d39aa40 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1135,6 +1135,35 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
>  
>  bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
>  
> +/*
> + * Helper to check if the CPU's core cookie matches with the task's cookie
> + * when core scheduling is enabled.
> + * A special case is that the task's cookie always matches with CPU's core
> + * cookie if the CPU is in an idle core.
> + */
> +static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
> +{
> +	bool idle_core = true;
> +	int cpu;
> +
> +	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
> +	if (!sched_core_enabled(rq))
> +		return true;
> +
> +	for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
> +		if (!available_idle_cpu(cpu)) {

I was looking at this snippet and comparing this to is_core_idle(), the
major difference is the check for vcpu_is_preempted(). Do we want to
call the core as non idle if any vcpu was preempted on this CPU?

> +			idle_core = false;
> +			break;
> +		}
> +	}
> +
> +	/*
> +	 * A CPU in an idle core is always the best choice for tasks with
> +	 * cookies.
> +	 */
> +	return idle_core || rq->core->core_cookie == p->core_cookie;
> +}
> +

Balbir Singh.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 14/32] sched: migration changes for core scheduling
  2020-11-22 23:54   ` Balbir Singh
@ 2020-11-23  4:36     ` Li, Aubrey
  2020-11-24 15:42       ` Peter Zijlstra
  0 siblings, 1 reply; 150+ messages in thread
From: Li, Aubrey @ 2020-11-23  4:36 UTC (permalink / raw)
  To: Balbir Singh, Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

On 2020/11/23 7:54, Balbir Singh wrote:
> On Tue, Nov 17, 2020 at 06:19:44PM -0500, Joel Fernandes (Google) wrote:
>> From: Aubrey Li <aubrey.li@intel.com>
>>
>>  - Don't migrate if there is a cookie mismatch
>>      Load balance tries to move task from busiest CPU to the
>>      destination CPU. When core scheduling is enabled, if the
>>      task's cookie does not match with the destination CPU's
>>      core cookie, this task will be skipped by this CPU. This
>>      mitigates the forced idle time on the destination CPU.
>>
>>  - Select cookie matched idle CPU
>>      In the fast path of task wakeup, select the first cookie matched
>>      idle CPU instead of the first idle CPU.
>>
>>  - Find cookie matched idlest CPU
>>      In the slow path of task wakeup, find the idlest CPU whose core
>>      cookie matches with task's cookie
>>
>>  - Don't migrate task if cookie not match
>>      For the NUMA load balance, don't migrate task to the CPU whose
>>      core cookie does not match with task's cookie
>>
>> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>> ---
>>  kernel/sched/fair.c  | 64 ++++++++++++++++++++++++++++++++++++++++----
>>  kernel/sched/sched.h | 29 ++++++++++++++++++++
>>  2 files changed, 88 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index de82f88ba98c..ceb3906c9a8a 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1921,6 +1921,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>>  		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>>  			continue;
>>  
>> +#ifdef CONFIG_SCHED_CORE
>> +		/*
>> +		 * Skip this cpu if source task's cookie does not match
>> +		 * with CPU's core cookie.
>> +		 */
>> +		if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>> +			continue;
>> +#endif
>> +
> 
> Any reason this is under an #ifdef? In sched_core_cookie_match() won't
> the check for sched_core_enabled() do the right thing even when
> CONFIG_SCHED_CORE is not enabed?> 
Yes, sched_core_enabled works properly when CONFIG_SCHED_CORE is not
enabled. But when CONFIG_SCHED_CORE is not enabled, it does not make
sense to leave a core scheduler specific function here even at compile
time. Also, for the cases in hot path, this saves CPU cycles to avoid
a judgment.


>>  		env->dst_cpu = cpu;
>>  		if (task_numa_compare(env, taskimp, groupimp, maymove))
>>  			break;
>> @@ -5867,11 +5876,17 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>  
>>  	/* Traverse only the allowed CPUs */
>>  	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
>> +		struct rq *rq = cpu_rq(i);
>> +
>> +#ifdef CONFIG_SCHED_CORE
>> +		if (!sched_core_cookie_match(rq, p))
>> +			continue;
>> +#endif
>> +
>>  		if (sched_idle_cpu(i))
>>  			return i;
>>  
>>  		if (available_idle_cpu(i)) {
>> -			struct rq *rq = cpu_rq(i);
>>  			struct cpuidle_state *idle = idle_get_state(rq);
>>  			if (idle && idle->exit_latency < min_exit_latency) {
>>  				/*
>> @@ -6129,8 +6144,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>  	for_each_cpu_wrap(cpu, cpus, target) {
>>  		if (!--nr)
>>  			return -1;
>> -		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>> -			break;
>> +
>> +		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
>> +#ifdef CONFIG_SCHED_CORE
>> +			/*
>> +			 * If Core Scheduling is enabled, select this cpu
>> +			 * only if the process cookie matches core cookie.
>> +			 */
>> +			if (sched_core_enabled(cpu_rq(cpu)) &&
>> +			    p->core_cookie == cpu_rq(cpu)->core->core_cookie)
>> +#endif
>> +				break;
>> +		}
>>  	}
>>  
>>  	time = cpu_clock(this) - time;
>> @@ -7530,8 +7555,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>  	 * We do not migrate tasks that are:
>>  	 * 1) throttled_lb_pair, or
>>  	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
>> -	 * 3) running (obviously), or
>> -	 * 4) are cache-hot on their current CPU.
>> +	 * 3) task's cookie does not match with this CPU's core cookie
>> +	 * 4) running (obviously), or
>> +	 * 5) are cache-hot on their current CPU.
>>  	 */
>>  	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>>  		return 0;
>> @@ -7566,6 +7592,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>  		return 0;
>>  	}
>>  
>> +#ifdef CONFIG_SCHED_CORE
>> +	/*
>> +	 * Don't migrate task if the task's cookie does not match
>> +	 * with the destination CPU's core cookie.
>> +	 */
>> +	if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
>> +		return 0;
>> +#endif
>> +
>>  	/* Record that we found atleast one task that could run on dst_cpu */
>>  	env->flags &= ~LBF_ALL_PINNED;
>>  
>> @@ -8792,6 +8827,25 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>  					p->cpus_ptr))
>>  			continue;
>>  
>> +#ifdef CONFIG_SCHED_CORE
>> +		if (sched_core_enabled(cpu_rq(this_cpu))) {
>> +			int i = 0;
>> +			bool cookie_match = false;
>> +
>> +			for_each_cpu(i, sched_group_span(group)) {
>> +				struct rq *rq = cpu_rq(i);
>> +
>> +				if (sched_core_cookie_match(rq, p)) {
>> +					cookie_match = true;
>> +					break;
>> +				}
>> +			}
>> +			/* Skip over this group if no cookie matched */
>> +			if (!cookie_match)
>> +				continue;
>> +		}
>> +#endif
>> +
>>  		local_group = cpumask_test_cpu(this_cpu,
>>  					       sched_group_span(group));
>>  
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index e72942a9ee11..de553d39aa40 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1135,6 +1135,35 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
>>  
>>  bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
>>  
>> +/*
>> + * Helper to check if the CPU's core cookie matches with the task's cookie
>> + * when core scheduling is enabled.
>> + * A special case is that the task's cookie always matches with CPU's core
>> + * cookie if the CPU is in an idle core.
>> + */
>> +static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
>> +{
>> +	bool idle_core = true;
>> +	int cpu;
>> +
>> +	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
>> +	if (!sched_core_enabled(rq))
>> +		return true;
>> +
>> +	for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
>> +		if (!available_idle_cpu(cpu)) {
> 
> I was looking at this snippet and comparing this to is_core_idle(), the
> major difference is the check for vcpu_is_preempted(). Do we want to
> call the core as non idle if any vcpu was preempted on this CPU?

Yes, if there is a VCPU was preempted on this CPU, better not place task
on this core as the VCPU may be holding a spinlock and wants to be executed
again ASAP.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 13/32] sched: Trivial forced-newidle balancer
  2020-11-17 23:19 ` [PATCH -tip 13/32] sched: Trivial forced-newidle balancer Joel Fernandes (Google)
@ 2020-11-23  4:38   ` Balbir Singh
  2020-11-23 15:07     ` Li, Aubrey
  0 siblings, 1 reply; 150+ messages in thread
From: Balbir Singh @ 2020-11-23  4:38 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Paul E . McKenney,
	Aubrey Li, Tim Chen

On Tue, Nov 17, 2020 at 06:19:43PM -0500, Joel Fernandes (Google) wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> When a sibling is forced-idle to match the core-cookie; search for
> matching tasks to fill the core.
> 
> rcu_read_unlock() can incur an infrequent deadlock in
> sched_core_balance(). Fix this by using the RCU-sched flavor instead.
> 
...
> +
> +		if (p->core_occupation > dst->idle->core_occupation)
> +			goto next;
> +

I am unable to understand this check, a comment or clarification in the
changelog will help. I presume we are looking at either one or two cpus
to define the core_occupation and we expect to match it against the
destination CPU.

Balbir Singh.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 17/32] arch/x86: Add a new TIF flag for untrusted tasks
  2020-11-17 23:19 ` [PATCH -tip 17/32] arch/x86: Add a new TIF flag for untrusted tasks Joel Fernandes (Google)
@ 2020-11-23  5:18   ` Balbir Singh
  0 siblings, 0 replies; 150+ messages in thread
From: Balbir Singh @ 2020-11-23  5:18 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

On Tue, Nov 17, 2020 at 06:19:47PM -0500, Joel Fernandes (Google) wrote:
> Add a new TIF flag to indicate whether the kernel needs to be careful
> and take additional steps to mitigate micro-architectural issues during
> entry into user or guest mode.
> 
> This new flag will be used by the series to determine if waiting is
> needed or not, during exit to user or guest mode.
> 
> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Reviewed-by: Aubrey Li <aubrey.intel@gmail.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---

Acked-by: Balbir Singh <bsingharora@gmail.com>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 09/32] sched/fair: Snapshot the min_vruntime of CPUs on force idle
  2020-11-22 11:44   ` Balbir Singh
@ 2020-11-23 12:31     ` Vineeth Pillai
  2020-11-23 23:31       ` Balbir Singh
  0 siblings, 1 reply; 150+ messages in thread
From: Vineeth Pillai @ 2020-11-23 12:31 UTC (permalink / raw)
  To: Balbir Singh, Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

Hi Balbir,

On 11/22/20 6:44 AM, Balbir Singh wrote:
>
> This seems cumbersome, is there no way to track the min_vruntime via
> rq->core->min_vruntime?
Do you mean to have a core wide min_vruntime? We had a
similar approach from v3 to v7 and it had major issues which
broke the assumptions of cfs. There were some lengthy
discussions and Peter explained in-depth about the issues:

https://lwn.net/ml/linux-kernel/20200506143506.GH5298@hirez.programming.kicks-ass.net/
https://lwn.net/ml/linux-kernel/20200515103844.GG2978@hirez.programming.kicks-ass.net/

Thanks,
Vineeth


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 13/32] sched: Trivial forced-newidle balancer
  2020-11-23  4:38   ` Balbir Singh
@ 2020-11-23 15:07     ` Li, Aubrey
  2020-11-23 23:35       ` Balbir Singh
  0 siblings, 1 reply; 150+ messages in thread
From: Li, Aubrey @ 2020-11-23 15:07 UTC (permalink / raw)
  To: Balbir Singh, Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Paul E . McKenney,
	Tim Chen

On 2020/11/23 12:38, Balbir Singh wrote:
> On Tue, Nov 17, 2020 at 06:19:43PM -0500, Joel Fernandes (Google) wrote:
>> From: Peter Zijlstra <peterz@infradead.org>
>>
>> When a sibling is forced-idle to match the core-cookie; search for
>> matching tasks to fill the core.
>>
>> rcu_read_unlock() can incur an infrequent deadlock in
>> sched_core_balance(). Fix this by using the RCU-sched flavor instead.
>>
> ...
>> +
>> +		if (p->core_occupation > dst->idle->core_occupation)
>> +			goto next;
>> +
> 
> I am unable to understand this check, a comment or clarification in the
> changelog will help. I presume we are looking at either one or two cpus
> to define the core_occupation and we expect to match it against the
> destination CPU.

IIUC, this check prevents a task from keeping jumping among the cores forever.

For example, on a SMT2 platform:
- core0 runs taskA and taskB, core_occupation is 2
- core1 runs taskC, core_occupation is 1

Without this check, taskB could ping-pong between core0 and core1 by core load
balance.

Thanks,
-Aubrey



^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 03/32] sched/fair: Fix pick_task_fair crashes due to empty rbtree
  2020-11-20 18:11     ` Vineeth Pillai
@ 2020-11-23 22:31       ` Balbir Singh
  0 siblings, 0 replies; 150+ messages in thread
From: Balbir Singh @ 2020-11-23 22:31 UTC (permalink / raw)
  To: Vineeth Pillai
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Fri, Nov 20, 2020 at 01:11:06PM -0500, Vineeth Pillai wrote:
> Hi Balbir,
> 
> On 11/20/20 5:15 AM, Singh, Balbir wrote:
> > On 18/11/20 10:19 am, Joel Fernandes (Google) wrote:
> > > From: Peter Zijlstra <peterz@infradead.org>
> > > 
> > > pick_next_entity() is passed curr == NULL during core-scheduling. Due to
> > > this, if the rbtree is empty, the 'left' variable is set to NULL within
> > > the function. This can cause crashes within the function.
> > > 
> > > This is not an issue if put_prev_task() is invoked on the currently
> > > running task before calling pick_next_entity(). However, in core
> > > scheduling, it is possible that a sibling CPU picks for another RQ in
> > > the core, via pick_task_fair(). This remote sibling would not get any
> > > opportunities to do a put_prev_task().
> > > 
> > > Fix it by refactoring pick_task_fair() such that pick_next_entity() is
> > > called with the cfs_rq->curr. This will prevent pick_next_entity() from
> > > crashing if its rbtree is empty.
> > > 
> > > Also this fixes another possible bug where update_curr() would not be
> > > called on the cfs_rq hierarchy if the rbtree is empty. This could effect
> > > cross-cpu comparison of vruntime.
> > > 
> > It is not clear from the changelog as to what does put_prev_task() do to prevent
> > the crash from occuring? Why did we pass NULL as curr in the first place to
> > pick_next_entity?
> A little more context on this crash in v8 is here:
> https://lwn.net/ml/linux-kernel/8230ada7-839f-2335-9a55-b09f6a813e91@linux.microsoft.com/
> 
> The issue here arises from the fact that, we try to pick task for a
> sibling while sibling is running a task. Running tasks are not in the
> cfs_rq and pick_next_entity can return NULL if there is only one cfs
> task in the cfs_rq. This would not happen normally because
> put_prev_task is called before pick_task and put_prev_task adds the
> task back to cfs_rq. But for coresched, pick_task is called on a
> remote sibling's cfs_rq without calling put_prev_task and this can
> lead to pick_next_entity returning NULL.
> 
> The initial logic of passing NULL would work fine as long as we call
> put_prev_task before calling pick_task_fair. But for coresched, we
> call pick_task_fair on siblings while the task is running and would
> not be able to call put_prev_task. So this refactor of the code fixes
> the crash by explicitly passing curr.
> 
> Hope this clarifies..
>

Yes, it does!

Thanks,
Balbir Singh.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 09/32] sched/fair: Snapshot the min_vruntime of CPUs on force idle
  2020-11-23 12:31     ` Vineeth Pillai
@ 2020-11-23 23:31       ` Balbir Singh
  2020-11-24  9:09         ` Peter Zijlstra
  0 siblings, 1 reply; 150+ messages in thread
From: Balbir Singh @ 2020-11-23 23:31 UTC (permalink / raw)
  To: Vineeth Pillai
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Mon, Nov 23, 2020 at 07:31:31AM -0500, Vineeth Pillai wrote:
> Hi Balbir,
> 
> On 11/22/20 6:44 AM, Balbir Singh wrote:
> > 
> > This seems cumbersome, is there no way to track the min_vruntime via
> > rq->core->min_vruntime?
> Do you mean to have a core wide min_vruntime? We had a
> similar approach from v3 to v7 and it had major issues which
> broke the assumptions of cfs. There were some lengthy
> discussions and Peter explained in-depth about the issues:
> 
> https://lwn.net/ml/linux-kernel/20200506143506.GH5298@hirez.programming.kicks-ass.net/
> https://lwn.net/ml/linux-kernel/20200515103844.GG2978@hirez.programming.kicks-ass.net/
>

One of the equations in the link is

">From which immediately follows that:

          T_k + T_l
  S_k+l = ---------                                       (13)
          W_k + W_l

On which we can define a combined lag:

  lag_k+l(i) := S_k+l - s_i                               (14)

And that gives us the tools to compare tasks across a combined runqueue.
"

S_k+l reads like rq->core->vruntime, but it sounds like the equivalent
of rq->core->vruntime is updated when we enter forced idle as opposed to
all the time.

Balbir

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 13/32] sched: Trivial forced-newidle balancer
  2020-11-23 15:07     ` Li, Aubrey
@ 2020-11-23 23:35       ` Balbir Singh
  2020-11-24  0:32         ` Li, Aubrey
  0 siblings, 1 reply; 150+ messages in thread
From: Balbir Singh @ 2020-11-23 23:35 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Paul E . McKenney,
	Tim Chen

On Mon, Nov 23, 2020 at 11:07:27PM +0800, Li, Aubrey wrote:
> On 2020/11/23 12:38, Balbir Singh wrote:
> > On Tue, Nov 17, 2020 at 06:19:43PM -0500, Joel Fernandes (Google) wrote:
> >> From: Peter Zijlstra <peterz@infradead.org>
> >>
> >> When a sibling is forced-idle to match the core-cookie; search for
> >> matching tasks to fill the core.
> >>
> >> rcu_read_unlock() can incur an infrequent deadlock in
> >> sched_core_balance(). Fix this by using the RCU-sched flavor instead.
> >>
> > ...
> >> +
> >> +		if (p->core_occupation > dst->idle->core_occupation)
> >> +			goto next;
> >> +
> > 
> > I am unable to understand this check, a comment or clarification in the
> > changelog will help. I presume we are looking at either one or two cpus
> > to define the core_occupation and we expect to match it against the
> > destination CPU.
> 
> IIUC, this check prevents a task from keeping jumping among the cores forever.
> 
> For example, on a SMT2 platform:
> - core0 runs taskA and taskB, core_occupation is 2
> - core1 runs taskC, core_occupation is 1
> 
> Without this check, taskB could ping-pong between core0 and core1 by core load
> balance.

But the comparison is p->core_occuption (as in tasks core occuptation,
not sure what that means, can a task have a core_occupation of > 1?)

Balbir Singh.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 13/32] sched: Trivial forced-newidle balancer
  2020-11-23 23:35       ` Balbir Singh
@ 2020-11-24  0:32         ` Li, Aubrey
  2020-11-25 21:28           ` Balbir Singh
  0 siblings, 1 reply; 150+ messages in thread
From: Li, Aubrey @ 2020-11-24  0:32 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Paul E . McKenney,
	Tim Chen

On 2020/11/24 7:35, Balbir Singh wrote:
> On Mon, Nov 23, 2020 at 11:07:27PM +0800, Li, Aubrey wrote:
>> On 2020/11/23 12:38, Balbir Singh wrote:
>>> On Tue, Nov 17, 2020 at 06:19:43PM -0500, Joel Fernandes (Google) wrote:
>>>> From: Peter Zijlstra <peterz@infradead.org>
>>>>
>>>> When a sibling is forced-idle to match the core-cookie; search for
>>>> matching tasks to fill the core.
>>>>
>>>> rcu_read_unlock() can incur an infrequent deadlock in
>>>> sched_core_balance(). Fix this by using the RCU-sched flavor instead.
>>>>
>>> ...
>>>> +
>>>> +		if (p->core_occupation > dst->idle->core_occupation)
>>>> +			goto next;
>>>> +
>>>
>>> I am unable to understand this check, a comment or clarification in the
>>> changelog will help. I presume we are looking at either one or two cpus
>>> to define the core_occupation and we expect to match it against the
>>> destination CPU.
>>
>> IIUC, this check prevents a task from keeping jumping among the cores forever.
>>
>> For example, on a SMT2 platform:
>> - core0 runs taskA and taskB, core_occupation is 2
>> - core1 runs taskC, core_occupation is 1
>>
>> Without this check, taskB could ping-pong between core0 and core1 by core load
>> balance.
> 
> But the comparison is p->core_occuption (as in tasks core occuptation,
> not sure what that means, can a task have a core_occupation of > 1?)
>

p->core_occupation is assigned to the core occupation in the last pick_next_task.
(so yes, it can have a > 1 core_occupation).

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 04/32] sched: Core-wide rq->lock
  2020-11-22  9:11   ` Balbir Singh
@ 2020-11-24  8:16     ` Peter Zijlstra
  2020-11-26  0:35       ` Balbir Singh
  0 siblings, 1 reply; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-24  8:16 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Sun, Nov 22, 2020 at 08:11:52PM +1100, Balbir Singh wrote:
> On Tue, Nov 17, 2020 at 06:19:34PM -0500, Joel Fernandes (Google) wrote:
> > From: Peter Zijlstra <peterz@infradead.org>
> > 
> > Introduce the basic infrastructure to have a core wide rq->lock.
> >
> 
> Reading through the patch, it seems like all the CPUs have to be
> running with sched core enabled/disabled? Is it possible to have some
> cores with core sched disabled?

Yep, patch even says so:

 + * XXX entirely possible to selectively enable cores, don't bother for now.

> I don't see a strong use case for it,
> but I am wondering if the design will fall apart if that assumption is
> broken?

The use-case I have is not using stop-machine. That is, stopping a whole
core at a time, instead of the whole sodding machine. It's on the todo
list *somewhere*....



^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 03/32] sched/fair: Fix pick_task_fair crashes due to empty rbtree
  2020-11-20 10:15   ` Singh, Balbir
  2020-11-20 18:11     ` Vineeth Pillai
@ 2020-11-24  8:31     ` Peter Zijlstra
  1 sibling, 0 replies; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-24  8:31 UTC (permalink / raw)
  To: Singh, Balbir
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Fri, Nov 20, 2020 at 09:15:28PM +1100, Singh, Balbir wrote:
> On 18/11/20 10:19 am, Joel Fernandes (Google) wrote:
> > From: Peter Zijlstra <peterz@infradead.org>
> > 
> > pick_next_entity() is passed curr == NULL during core-scheduling. Due to
> > this, if the rbtree is empty, the 'left' variable is set to NULL within
> > the function. This can cause crashes within the function.
> > 
> > This is not an issue if put_prev_task() is invoked on the currently
> > running task before calling pick_next_entity(). However, in core
> > scheduling, it is possible that a sibling CPU picks for another RQ in
> > the core, via pick_task_fair(). This remote sibling would not get any
> > opportunities to do a put_prev_task().
> > 
> > Fix it by refactoring pick_task_fair() such that pick_next_entity() is
> > called with the cfs_rq->curr. This will prevent pick_next_entity() from
> > crashing if its rbtree is empty.
> > 
> > Also this fixes another possible bug where update_curr() would not be
> > called on the cfs_rq hierarchy if the rbtree is empty. This could effect
> > cross-cpu comparison of vruntime.
> > 
> 
> It is not clear from the changelog as to what does put_prev_task() do to prevent
> the crash from occuring? Why did we pass NULL as curr in the first place to
> pick_next_entity?
> 
> The patch looks functionally correct as in, it passes curr as the reference
> to pick_next_entity() for left and entity_before comparisons.

This patch should not exist; it should be smashed into the previous
patch. There is no point in preserving a crash.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 09/32] sched/fair: Snapshot the min_vruntime of CPUs on force idle
  2020-11-23 23:31       ` Balbir Singh
@ 2020-11-24  9:09         ` Peter Zijlstra
  2020-11-25 23:17           ` Balbir Singh
  0 siblings, 1 reply; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-24  9:09 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Vineeth Pillai, Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Aaron Lu,
	Aubrey Li, tglx, linux-kernel, mingo, torvalds, fweisbec,
	keescook, kerrnel, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, graf, konrad.wilk,
	dfaggioli, pjt, rostedt, derkling, benbjiang, Alexandre Chartre,
	James.Bottomley, OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes,
	chris.hyser, Ben Segall, Josh Don, Hao Luo, Tom Lendacky,
	Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Nov 24, 2020 at 10:31:49AM +1100, Balbir Singh wrote:
> On Mon, Nov 23, 2020 at 07:31:31AM -0500, Vineeth Pillai wrote:
> > Hi Balbir,
> > 
> > On 11/22/20 6:44 AM, Balbir Singh wrote:
> > > 
> > > This seems cumbersome, is there no way to track the min_vruntime via
> > > rq->core->min_vruntime?
> > Do you mean to have a core wide min_vruntime? We had a
> > similar approach from v3 to v7 and it had major issues which
> > broke the assumptions of cfs. There were some lengthy
> > discussions and Peter explained in-depth about the issues:
> > 
> > https://lwn.net/ml/linux-kernel/20200506143506.GH5298@hirez.programming.kicks-ass.net/
> > https://lwn.net/ml/linux-kernel/20200515103844.GG2978@hirez.programming.kicks-ass.net/
> >
> 
> One of the equations in the link is
> 
> ">From which immediately follows that:
> 
>           T_k + T_l
>   S_k+l = ---------                                       (13)
>           W_k + W_l
> 
> On which we can define a combined lag:
> 
>   lag_k+l(i) := S_k+l - s_i                               (14)
> 
> And that gives us the tools to compare tasks across a combined runqueue.
> "
> 
> S_k+l reads like rq->core->vruntime, but it sounds like the equivalent
> of rq->core->vruntime is updated when we enter forced idle as opposed to
> all the time.

Yes, but actually computing and maintaining it is hella hard. Try it
with the very first example in that email (the infeasible weight
scenario) and tell me how it works for you ;-)

Also note that the text below (6) mentions dynamic, then look up the
EEVDF paper which describes some of the dynamics -- the paper is
incomplete and contains a bug, I forget if it ever got updated or if
there's another paper that completes it (the BQF I/O scheduler started
from that and fixed it).

I'm not saying it cannot be done, I'm just saying it is really rather
involved and probably not worth it.

The basic observation the current approach relies on is that al that
faffery basically boils down to the fact that vruntime only means
something when there is contention. And that only the progression is
important not the actual value. That is, this is all fundamentally a
differential equation and our integration constant is meaningless (also
embodied in (7)).

Also, I think the code as proposed here relies on SMT2 and is buggered
for SMT3+. Now, that second link above describes means of making SMT3+
work, but we're not there yet.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 15/32] sched: Improve snapshotting of min_vruntime for CGroups
  2020-11-17 23:19 ` [PATCH -tip 15/32] sched: Improve snapshotting of min_vruntime for CGroups Joel Fernandes (Google)
@ 2020-11-24 10:27   ` Peter Zijlstra
  2020-11-24 17:07     ` Joel Fernandes
  2020-11-24 10:41   ` Peter Zijlstra
  1 sibling, 1 reply; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-24 10:27 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Nov 17, 2020 at 06:19:45PM -0500, Joel Fernandes (Google) wrote:
> A previous patch improved cross-cpu vruntime comparison opertations in
> pick_next_task(). Improve it further for tasks in CGroups.
> 
> In particular, for cross-CPU comparisons, we were previously going to
> the root-level se(s) for both the task being compared. That was strange.
> This patch instead finds the se(s) for both tasks that have the same
> parent (which may be different from root).
> 
> A note about the min_vruntime snapshot and force idling:
> Abbreviations: fi: force-idled now? ; fib: force-idled before?
> During selection:
> When we're not fi, we need to update snapshot.
> when we're fi and we were not fi, we must update snapshot.
> When we're fi and we were already fi, we must not update snapshot.
> 
> Which gives:
>         fib     fi      update?
>         0       0       1
>         0       1       1
>         1       0       1
>         1       1       0
> So the min_vruntime snapshot needs to be updated when: !(fib && fi).
> 
> Also, the cfs_prio_less() function needs to be aware of whether the core
> is in force idle or not, since it will be use this information to know
> whether to advance a cfs_rq's min_vruntime_fi in the hierarchy. So pass
> this information along via pick_task() -> prio_less().

Hurmph.. so I'm tempted to smash a bunch of patches together.

 2 <- 3 (already done - bisection crashes are daft)
 6 <- 11
 7 <- {10, 12}
 9 <- 15

I'm thinking that would result in an easier to read series, or do we
want to preserve this history?

(fwiw, I pulled 15 before 13,14, as I think that makes more sense
anyway).

Hmm?

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 15/32] sched: Improve snapshotting of min_vruntime for CGroups
  2020-11-17 23:19 ` [PATCH -tip 15/32] sched: Improve snapshotting of min_vruntime for CGroups Joel Fernandes (Google)
  2020-11-24 10:27   ` Peter Zijlstra
@ 2020-11-24 10:41   ` Peter Zijlstra
  1 sibling, 0 replies; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-24 10:41 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Nov 17, 2020 at 06:19:45PM -0500, Joel Fernandes (Google) wrote:
> +static void se_fi_update(struct sched_entity *se, unsigned int fi_seq, bool forceidle)
>  {
> -	bool samecpu = task_cpu(a) == task_cpu(b);
> +	bool root = true;
> +	long old, new;

My compiler was not impressed by all those variable definitions.

> +
> +	for_each_sched_entity(se) {
> +		struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +		if (forceidle) {
> +			if (cfs_rq->forceidle_seq == fi_seq)
> +				break;
> +			cfs_rq->forceidle_seq = fi_seq;
> +		}
> +
> +		cfs_rq->min_vruntime_fi = cfs_rq->min_vruntime;
> +	}
> +}

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 00/32] Core scheduling (v9)
  2020-11-17 23:19 [PATCH -tip 00/32] Core scheduling (v9) Joel Fernandes (Google)
                   ` (31 preceding siblings ...)
  2020-11-17 23:20 ` [PATCH -tip 32/32] sched: Debug bits Joel Fernandes (Google)
@ 2020-11-24 11:48 ` Vincent Guittot
  2020-11-24 15:08   ` Joel Fernandes
  32 siblings, 1 reply; 150+ messages in thread
From: Vincent Guittot @ 2020-11-24 11:48 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, Thomas Gleixner,
	linux-kernel, Ingo Molnar, Linus Torvalds, Frederic Weisbecker,
	Kees Cook, kerrnel, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, graf, konrad.wilk,
	dfaggioli, Paul Turner, Steven Rostedt, derkling, Jiang Biao,
	Alexandre Chartre, James Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, Hyser,Chris, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

Hi Joel,

On Wed, 18 Nov 2020 at 00:20, Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
>
> Core-Scheduling
> ===============
> Enclosed is series v9 of core scheduling.
> v9 is rebased on tip/master (fe4adf6f92c4 ("Merge branch 'irq/core'"))..
> I hope that this version is acceptable to be merged (pending any new review

./scripts/get_maintainer.pl is quite useful to make sure that all
maintainers are cced and helps a lot to get some reviews

> comments that arise) as the main issues in the past are all resolved:
>  1. Vruntime comparison.
>  2. Documentation updates.
>  3. CGroup and per-task interface developed by Google and Oracle.
>  4. Hotplug fixes.
> Almost all patches also have Reviewed-by or Acked-by tag. See below for full
> list of changes in v9.
>
> Introduction of feature
> =======================
> Core scheduling is a feature that allows only trusted tasks to run
> concurrently on cpus sharing compute resources (eg: hyperthreads on a
> core). The goal is to mitigate the core-level side-channel attacks
> without requiring to disable SMT (which has a significant impact on
> performance in some situations). Core scheduling (as of v7) mitigates
> user-space to user-space attacks and user to kernel attack when one of
> the siblings enters the kernel via interrupts or system call.
>
> By default, the feature doesn't change any of the current scheduler
> behavior. The user decides which tasks can run simultaneously on the
> same core (for now by having them in the same tagged cgroup). When a tag
> is enabled in a cgroup and a task from that cgroup is running on a
> hardware thread, the scheduler ensures that only idle or trusted tasks
> run on the other sibling(s). Besides security concerns, this feature can
> also be beneficial for RT and performance applications where we want to
> control how tasks make use of SMT dynamically.
>
> Both a CGroup and Per-task interface via prctl(2) are provided for configuring
> core sharing. More details are provided in documentation patch.  Kselftests are
> provided to verify the correctness/rules of the interface.
>
> Testing
> =======
> ChromeOS testing shows 300% improvement in keypress latency on a Google
> docs key press with Google hangout test (the maximum latency drops from 150ms
> to 50ms for keypresses).
>
> Julien: TPCC tests showed improvements with core-scheduling as below. With kernel
> protection enabled, it does not show any regression. Possibly ASI will improve
> the performance for those who choose kernel protection (can be toggled through
> sched_core_protect_kernel sysctl).
>                                 average         stdev           diff
> baseline (SMT on)               1197.272        44.78312824
> core sched (   kernel protect)  412.9895        45.42734343     -65.51%
> core sched (no kernel protect)  686.6515        71.77756931     -42.65%
> nosmt                           408.667         39.39042872     -65.87%
> (Note these results are from v8).
>
> Vineeth tested sysbench and does not see any regressions.
> Hong and Aubrey tested v9 and see results similar to v8. There is a known issue
> with uperf that does regress. This appears to be because of ksoftirq heavily
> contending with other tasks on the core. The consensus is this can be improved
> in the future.
>
> Changes in v9
> =============
> - Note that the vruntime snapshot change is written in 2 patches to show the
>   progression of the idea and prevent merge conflicts:
>     sched/fair: Snapshot the min_vruntime of CPUs on force idle
>     sched: Improve snapshotting of min_vruntime for CGroups
>   Same with the RT priority inversion change:
>     sched: Fix priority inversion of cookied task with sibling
>     sched: Improve snapshotting of min_vruntime for CGroups
> - Disable coresched on certain AMD HW.
>
> Changes in v8
> =============
> - New interface/API implementation
>   - Joel
> - Revised kernel protection patch
>   - Joel
> - Revised Hotplug fixes
>   - Joel
> - Minor bug fixes and address review comments
>   - Vineeth
>
> Changes in v7
> =============
> - Kernel protection from untrusted usermode tasks
>   - Joel, Vineeth
> - Fix for hotplug crashes and hangs
>   - Joel, Vineeth
>
> Changes in v6
> =============
> - Documentation
>   - Joel
> - Pause siblings on entering nmi/irq/softirq
>   - Joel, Vineeth
> - Fix for RCU crash
>   - Joel
> - Fix for a crash in pick_next_task
>   - Yu Chen, Vineeth
> - Minor re-write of core-wide vruntime comparison
>   - Aaron Lu
> - Cleanup: Address Review comments
> - Cleanup: Remove hotplug support (for now)
> - Build fixes: 32 bit, SMT=n, AUTOGROUP=n etc
>   - Joel, Vineeth
>
> Changes in v5
> =============
> - Fixes for cgroup/process tagging during corner cases like cgroup
>   destroy, task moving across cgroups etc
>   - Tim Chen
> - Coresched aware task migrations
>   - Aubrey Li
> - Other minor stability fixes.
>
> Changes in v4
> =============
> - Implement a core wide min_vruntime for vruntime comparison of tasks
>   across cpus in a core.
>   - Aaron Lu
> - Fixes a typo bug in setting the forced_idle cpu.
>   - Aaron Lu
>
> Changes in v3
> =============
> - Fixes the issue of sibling picking up an incompatible task
>   - Aaron Lu
>   - Vineeth Pillai
>   - Julien Desfossez
> - Fixes the issue of starving threads due to forced idle
>   - Peter Zijlstra
> - Fixes the refcounting issue when deleting a cgroup with tag
>   - Julien Desfossez
> - Fixes a crash during cpu offline/online with coresched enabled
>   - Vineeth Pillai
> - Fixes a comparison logic issue in sched_core_find
>   - Aaron Lu
>
> Changes in v2
> =============
> - Fixes for couple of NULL pointer dereference crashes
>   - Subhra Mazumdar
>   - Tim Chen
> - Improves priority comparison logic for process in different cpus
>   - Peter Zijlstra
>   - Aaron Lu
> - Fixes a hard lockup in rq locking
>   - Vineeth Pillai
>   - Julien Desfossez
> - Fixes a performance issue seen on IO heavy workloads
>   - Vineeth Pillai
>   - Julien Desfossez
> - Fix for 32bit build
>   - Aubrey Li
>
> Future work
> ===========
> - Load balancing/Migration fixes for core scheduling.
>   With v6, Load balancing is partially coresched aware, but has some
>   issues w.r.t process/taskgroup weights:
>   https://lwn.net/ml/linux-kernel/20200225034438.GA617271@z...
>
> Aubrey Li (1):
> sched: migration changes for core scheduling
>
> Joel Fernandes (Google) (16):
> sched/fair: Snapshot the min_vruntime of CPUs on force idle
> sched: Enqueue task into core queue only after vruntime is updated
> sched: Simplify the core pick loop for optimized case
> sched: Improve snapshotting of min_vruntime for CGroups
> arch/x86: Add a new TIF flag for untrusted tasks
> kernel/entry: Add support for core-wide protection of kernel-mode
> entry/idle: Enter and exit kernel protection during idle entry and
> exit
> sched: Split the cookie and setup per-task cookie on fork
> sched: Add a per-thread core scheduling interface
> sched: Release references to the per-task cookie on exit
> sched/debug: Add CGroup node for printing group cookie if SCHED_DEBUG
> kselftest: Add tests for core-sched interface
> sched: Move core-scheduler interfacing code to a new file
> Documentation: Add core scheduling documentation
> sched: Add a coresched command line option
> sched: Debug bits...
>
> Josh Don (2):
> sched: Refactor core cookie into struct
> sched: Add a second-level tag for nested CGroup usecase
>
> Peter Zijlstra (11):
> sched: Wrap rq::lock access
> sched: Introduce sched_class::pick_task()
> sched/fair: Fix pick_task_fair crashes due to empty rbtree
> sched: Core-wide rq->lock
> sched/fair: Add a few assertions
> sched: Basic tracking of matching tasks
> sched: Add core wide task selection and scheduling.
> sched: Fix priority inversion of cookied task with sibling
> sched: Trivial forced-newidle balancer
> irq_work: Cleanup
> sched: CGroup tagging interface for core scheduling
>
> Vineeth Pillai (2):
> sched/fair: Fix forced idle sibling starvation corner case
> entry/kvm: Protect the kernel when entering from guest
>
> .../admin-guide/hw-vuln/core-scheduling.rst   |  330 +++++
> Documentation/admin-guide/hw-vuln/index.rst   |    1 +
> .../admin-guide/kernel-parameters.txt         |   25 +
> arch/x86/include/asm/thread_info.h            |    2 +
> arch/x86/kernel/cpu/bugs.c                    |   19 +
> arch/x86/kvm/x86.c                            |    2 +
> drivers/gpu/drm/i915/i915_request.c           |    4 +-
> include/linux/cpu.h                           |    1 +
> include/linux/entry-common.h                  |   30 +-
> include/linux/entry-kvm.h                     |   12 +
> include/linux/irq_work.h                      |   33 +-
> include/linux/irqflags.h                      |    4 +-
> include/linux/sched.h                         |   28 +-
> include/linux/sched/smt.h                     |    4 +
> include/uapi/linux/prctl.h                    |    3 +
> kernel/Kconfig.preempt                        |    5 +
> kernel/bpf/stackmap.c                         |    2 +-
> kernel/cpu.c                                  |   43 +
> kernel/entry/common.c                         |   28 +-
> kernel/entry/kvm.c                            |   33 +
> kernel/fork.c                                 |    1 +
> kernel/irq_work.c                             |   18 +-
> kernel/printk/printk.c                        |    6 +-
> kernel/rcu/tree.c                             |    3 +-
> kernel/sched/Makefile                         |    1 +
> kernel/sched/core.c                           | 1278 +++++++++++++++--
> kernel/sched/coretag.c                        |  819 +++++++++++
> kernel/sched/cpuacct.c                        |   12 +-
> kernel/sched/deadline.c                       |   38 +-
> kernel/sched/debug.c                          |   12 +-
> kernel/sched/fair.c                           |  313 +++-
> kernel/sched/idle.c                           |   24 +-
> kernel/sched/pelt.h                           |    2 +-
> kernel/sched/rt.c                             |   31 +-
> kernel/sched/sched.h                          |  315 +++-
> kernel/sched/stop_task.c                      |   14 +-
> kernel/sched/topology.c                       |    4 +-
> kernel/sys.c                                  |    3 +
> kernel/time/tick-sched.c                      |    6 +-
> kernel/trace/bpf_trace.c                      |    2 +-
> tools/include/uapi/linux/prctl.h              |    3 +
> tools/testing/selftests/sched/.gitignore      |    1 +
> tools/testing/selftests/sched/Makefile        |   14 +
> tools/testing/selftests/sched/config          |    1 +
> .../testing/selftests/sched/test_coresched.c  |  818 +++++++++++
> 45 files changed, 4033 insertions(+), 315 deletions(-)
> create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst
> create mode 100644 kernel/sched/coretag.c
> create mode 100644 tools/testing/selftests/sched/.gitignore
> create mode 100644 tools/testing/selftests/sched/Makefile
> create mode 100644 tools/testing/selftests/sched/config
> create mode 100644 tools/testing/selftests/sched/test_coresched.c
>
> --
> 2.29.2.299.gdc1121823c-goog
>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 12/32] sched: Simplify the core pick loop for optimized case
  2020-11-17 23:19 ` [PATCH -tip 12/32] sched: Simplify the core pick loop for optimized case Joel Fernandes (Google)
@ 2020-11-24 12:04   ` Peter Zijlstra
  2020-11-24 17:04     ` Joel Fernandes
  0 siblings, 1 reply; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-24 12:04 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Nov 17, 2020 at 06:19:42PM -0500, Joel Fernandes (Google) wrote:
> +	/*
> +	 * Optimize for common case where this CPU has no cookies
> +	 * and there are no cookied tasks running on siblings.
> +	 */
> +	if (!need_sync) {
> +		for_each_class(class) {
> +			next = class->pick_task(rq);
> +			if (next)
> +				break;
> +		}
> +
> +		if (!next->core_cookie) {
> +			rq->core_pick = NULL;
> +			goto done;
> +		}
>  		need_sync = true;
>  	}

This isn't what I send you here:

  https://lkml.kernel.org/r/20201026093131.GF2628@hirez.programming.kicks-ass.net

Specifically, you've lost the whole cfs-cgroup optimization.

What was wrong/not working with the below?

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5225,8 +5227,6 @@ pick_next_task(struct rq *rq, struct tas
 		return next;
 	}
 
-	put_prev_task_balance(rq, prev, rf);
-
 	smt_mask = cpu_smt_mask(cpu);
 	need_sync = !!rq->core->core_cookie;
 
@@ -5255,17 +5255,14 @@ pick_next_task(struct rq *rq, struct tas
 	 * and there are no cookied tasks running on siblings.
 	 */
 	if (!need_sync) {
-		for_each_class(class) {
-			next = class->pick_task(rq);
-			if (next)
-				break;
-		}
-
+		next = __pick_next_task(rq, prev, rf);
 		if (!next->core_cookie) {
 			rq->core_pick = NULL;
-			goto done;
+			return next;
 		}
-		need_sync = true;
+		put_prev_task(next);
+	} else {
+		put_prev_task_balance(rq, prev, rf);
 	}
 
 	for_each_cpu(i, smt_mask) {

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 00/32] Core scheduling (v9)
  2020-11-24 11:48 ` [PATCH -tip 00/32] Core scheduling (v9) Vincent Guittot
@ 2020-11-24 15:08   ` Joel Fernandes
  2020-12-03  6:16     ` Ning, Hongyu
  0 siblings, 1 reply; 150+ messages in thread
From: Joel Fernandes @ 2020-11-24 15:08 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, Thomas Gleixner,
	linux-kernel, Ingo Molnar, Linus Torvalds, Frederic Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, Alexander Graf, konrad.wilk,
	Dario Faggioli, Paul Turner, Steven Rostedt, Patrick Bellasi,
	Jiang Biao, Alexandre Chartre, James Bottomley, OWeisse,
	Dhaval Giani, Junaid Shahid, Jesse Barnes, Hyser,Chris,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

On Tue, Nov 24, 2020 at 6:48 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> Hi Joel,
>
> On Wed, 18 Nov 2020 at 00:20, Joel Fernandes (Google)
> <joel@joelfernandes.org> wrote:
> >
> > Core-Scheduling
> > ===============
> > Enclosed is series v9 of core scheduling.
> > v9 is rebased on tip/master (fe4adf6f92c4 ("Merge branch 'irq/core'"))..
> > I hope that this version is acceptable to be merged (pending any new review
>
> ./scripts/get_maintainer.pl is quite useful to make sure that all
> maintainers are cced and helps a lot to get some reviews

Apologies. I was just going by folks who were CC'd on previous series.
I am new to doing this series's postings.  Sorry if I missed you and
will run get_maintainers henceforth!

 - Joel


> > comments that arise) as the main issues in the past are all resolved:
> >  1. Vruntime comparison.
> >  2. Documentation updates.
> >  3. CGroup and per-task interface developed by Google and Oracle.
> >  4. Hotplug fixes.
> > Almost all patches also have Reviewed-by or Acked-by tag. See below for full
> > list of changes in v9.
> >
> > Introduction of feature
> > =======================
> > Core scheduling is a feature that allows only trusted tasks to run
> > concurrently on cpus sharing compute resources (eg: hyperthreads on a
> > core). The goal is to mitigate the core-level side-channel attacks
> > without requiring to disable SMT (which has a significant impact on
> > performance in some situations). Core scheduling (as of v7) mitigates
> > user-space to user-space attacks and user to kernel attack when one of
> > the siblings enters the kernel via interrupts or system call.
> >
> > By default, the feature doesn't change any of the current scheduler
> > behavior. The user decides which tasks can run simultaneously on the
> > same core (for now by having them in the same tagged cgroup). When a tag
> > is enabled in a cgroup and a task from that cgroup is running on a
> > hardware thread, the scheduler ensures that only idle or trusted tasks
> > run on the other sibling(s). Besides security concerns, this feature can
> > also be beneficial for RT and performance applications where we want to
> > control how tasks make use of SMT dynamically.
> >
> > Both a CGroup and Per-task interface via prctl(2) are provided for configuring
> > core sharing. More details are provided in documentation patch.  Kselftests are
> > provided to verify the correctness/rules of the interface.
> >
> > Testing
> > =======
> > ChromeOS testing shows 300% improvement in keypress latency on a Google
> > docs key press with Google hangout test (the maximum latency drops from 150ms
> > to 50ms for keypresses).
> >
> > Julien: TPCC tests showed improvements with core-scheduling as below. With kernel
> > protection enabled, it does not show any regression. Possibly ASI will improve
> > the performance for those who choose kernel protection (can be toggled through
> > sched_core_protect_kernel sysctl).
> >                                 average         stdev           diff
> > baseline (SMT on)               1197.272        44.78312824
> > core sched (   kernel protect)  412.9895        45.42734343     -65.51%
> > core sched (no kernel protect)  686.6515        71.77756931     -42.65%
> > nosmt                           408.667         39.39042872     -65.87%
> > (Note these results are from v8).
> >
> > Vineeth tested sysbench and does not see any regressions.
> > Hong and Aubrey tested v9 and see results similar to v8. There is a known issue
> > with uperf that does regress. This appears to be because of ksoftirq heavily
> > contending with other tasks on the core. The consensus is this can be improved
> > in the future.
> >
> > Changes in v9
> > =============
> > - Note that the vruntime snapshot change is written in 2 patches to show the
> >   progression of the idea and prevent merge conflicts:
> >     sched/fair: Snapshot the min_vruntime of CPUs on force idle
> >     sched: Improve snapshotting of min_vruntime for CGroups
> >   Same with the RT priority inversion change:
> >     sched: Fix priority inversion of cookied task with sibling
> >     sched: Improve snapshotting of min_vruntime for CGroups
> > - Disable coresched on certain AMD HW.
> >
> > Changes in v8
> > =============
> > - New interface/API implementation
> >   - Joel
> > - Revised kernel protection patch
> >   - Joel
> > - Revised Hotplug fixes
> >   - Joel
> > - Minor bug fixes and address review comments
> >   - Vineeth
> >
> > Changes in v7
> > =============
> > - Kernel protection from untrusted usermode tasks
> >   - Joel, Vineeth
> > - Fix for hotplug crashes and hangs
> >   - Joel, Vineeth
> >
> > Changes in v6
> > =============
> > - Documentation
> >   - Joel
> > - Pause siblings on entering nmi/irq/softirq
> >   - Joel, Vineeth
> > - Fix for RCU crash
> >   - Joel
> > - Fix for a crash in pick_next_task
> >   - Yu Chen, Vineeth
> > - Minor re-write of core-wide vruntime comparison
> >   - Aaron Lu
> > - Cleanup: Address Review comments
> > - Cleanup: Remove hotplug support (for now)
> > - Build fixes: 32 bit, SMT=n, AUTOGROUP=n etc
> >   - Joel, Vineeth
> >
> > Changes in v5
> > =============
> > - Fixes for cgroup/process tagging during corner cases like cgroup
> >   destroy, task moving across cgroups etc
> >   - Tim Chen
> > - Coresched aware task migrations
> >   - Aubrey Li
> > - Other minor stability fixes.
> >
> > Changes in v4
> > =============
> > - Implement a core wide min_vruntime for vruntime comparison of tasks
> >   across cpus in a core.
> >   - Aaron Lu
> > - Fixes a typo bug in setting the forced_idle cpu.
> >   - Aaron Lu
> >
> > Changes in v3
> > =============
> > - Fixes the issue of sibling picking up an incompatible task
> >   - Aaron Lu
> >   - Vineeth Pillai
> >   - Julien Desfossez
> > - Fixes the issue of starving threads due to forced idle
> >   - Peter Zijlstra
> > - Fixes the refcounting issue when deleting a cgroup with tag
> >   - Julien Desfossez
> > - Fixes a crash during cpu offline/online with coresched enabled
> >   - Vineeth Pillai
> > - Fixes a comparison logic issue in sched_core_find
> >   - Aaron Lu
> >
> > Changes in v2
> > =============
> > - Fixes for couple of NULL pointer dereference crashes
> >   - Subhra Mazumdar
> >   - Tim Chen
> > - Improves priority comparison logic for process in different cpus
> >   - Peter Zijlstra
> >   - Aaron Lu
> > - Fixes a hard lockup in rq locking
> >   - Vineeth Pillai
> >   - Julien Desfossez
> > - Fixes a performance issue seen on IO heavy workloads
> >   - Vineeth Pillai
> >   - Julien Desfossez
> > - Fix for 32bit build
> >   - Aubrey Li
> >
> > Future work
> > ===========
> > - Load balancing/Migration fixes for core scheduling.
> >   With v6, Load balancing is partially coresched aware, but has some
> >   issues w.r.t process/taskgroup weights:
> >   https://lwn.net/ml/linux-kernel/20200225034438.GA617271@z...
> >
> > Aubrey Li (1):
> > sched: migration changes for core scheduling
> >
> > Joel Fernandes (Google) (16):
> > sched/fair: Snapshot the min_vruntime of CPUs on force idle
> > sched: Enqueue task into core queue only after vruntime is updated
> > sched: Simplify the core pick loop for optimized case
> > sched: Improve snapshotting of min_vruntime for CGroups
> > arch/x86: Add a new TIF flag for untrusted tasks
> > kernel/entry: Add support for core-wide protection of kernel-mode
> > entry/idle: Enter and exit kernel protection during idle entry and
> > exit
> > sched: Split the cookie and setup per-task cookie on fork
> > sched: Add a per-thread core scheduling interface
> > sched: Release references to the per-task cookie on exit
> > sched/debug: Add CGroup node for printing group cookie if SCHED_DEBUG
> > kselftest: Add tests for core-sched interface
> > sched: Move core-scheduler interfacing code to a new file
> > Documentation: Add core scheduling documentation
> > sched: Add a coresched command line option
> > sched: Debug bits...
> >
> > Josh Don (2):
> > sched: Refactor core cookie into struct
> > sched: Add a second-level tag for nested CGroup usecase
> >
> > Peter Zijlstra (11):
> > sched: Wrap rq::lock access
> > sched: Introduce sched_class::pick_task()
> > sched/fair: Fix pick_task_fair crashes due to empty rbtree
> > sched: Core-wide rq->lock
> > sched/fair: Add a few assertions
> > sched: Basic tracking of matching tasks
> > sched: Add core wide task selection and scheduling.
> > sched: Fix priority inversion of cookied task with sibling
> > sched: Trivial forced-newidle balancer
> > irq_work: Cleanup
> > sched: CGroup tagging interface for core scheduling
> >
> > Vineeth Pillai (2):
> > sched/fair: Fix forced idle sibling starvation corner case
> > entry/kvm: Protect the kernel when entering from guest
> >
> > .../admin-guide/hw-vuln/core-scheduling.rst   |  330 +++++
> > Documentation/admin-guide/hw-vuln/index.rst   |    1 +
> > .../admin-guide/kernel-parameters.txt         |   25 +
> > arch/x86/include/asm/thread_info.h            |    2 +
> > arch/x86/kernel/cpu/bugs.c                    |   19 +
> > arch/x86/kvm/x86.c                            |    2 +
> > drivers/gpu/drm/i915/i915_request.c           |    4 +-
> > include/linux/cpu.h                           |    1 +
> > include/linux/entry-common.h                  |   30 +-
> > include/linux/entry-kvm.h                     |   12 +
> > include/linux/irq_work.h                      |   33 +-
> > include/linux/irqflags.h                      |    4 +-
> > include/linux/sched.h                         |   28 +-
> > include/linux/sched/smt.h                     |    4 +
> > include/uapi/linux/prctl.h                    |    3 +
> > kernel/Kconfig.preempt                        |    5 +
> > kernel/bpf/stackmap.c                         |    2 +-
> > kernel/cpu.c                                  |   43 +
> > kernel/entry/common.c                         |   28 +-
> > kernel/entry/kvm.c                            |   33 +
> > kernel/fork.c                                 |    1 +
> > kernel/irq_work.c                             |   18 +-
> > kernel/printk/printk.c                        |    6 +-
> > kernel/rcu/tree.c                             |    3 +-
> > kernel/sched/Makefile                         |    1 +
> > kernel/sched/core.c                           | 1278 +++++++++++++++--
> > kernel/sched/coretag.c                        |  819 +++++++++++
> > kernel/sched/cpuacct.c                        |   12 +-
> > kernel/sched/deadline.c                       |   38 +-
> > kernel/sched/debug.c                          |   12 +-
> > kernel/sched/fair.c                           |  313 +++-
> > kernel/sched/idle.c                           |   24 +-
> > kernel/sched/pelt.h                           |    2 +-
> > kernel/sched/rt.c                             |   31 +-
> > kernel/sched/sched.h                          |  315 +++-
> > kernel/sched/stop_task.c                      |   14 +-
> > kernel/sched/topology.c                       |    4 +-
> > kernel/sys.c                                  |    3 +
> > kernel/time/tick-sched.c                      |    6 +-
> > kernel/trace/bpf_trace.c                      |    2 +-
> > tools/include/uapi/linux/prctl.h              |    3 +
> > tools/testing/selftests/sched/.gitignore      |    1 +
> > tools/testing/selftests/sched/Makefile        |   14 +
> > tools/testing/selftests/sched/config          |    1 +
> > .../testing/selftests/sched/test_coresched.c  |  818 +++++++++++
> > 45 files changed, 4033 insertions(+), 315 deletions(-)
> > create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst
> > create mode 100644 kernel/sched/coretag.c
> > create mode 100644 tools/testing/selftests/sched/.gitignore
> > create mode 100644 tools/testing/selftests/sched/Makefile
> > create mode 100644 tools/testing/selftests/sched/config
> > create mode 100644 tools/testing/selftests/sched/test_coresched.c
> >
> > --
> > 2.29.2.299.gdc1121823c-goog
> >

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 14/32] sched: migration changes for core scheduling
  2020-11-23  4:36     ` Li, Aubrey
@ 2020-11-24 15:42       ` Peter Zijlstra
  2020-11-25  3:12         ` Li, Aubrey
  0 siblings, 1 reply; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-24 15:42 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Balbir Singh, Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Mon, Nov 23, 2020 at 12:36:10PM +0800, Li, Aubrey wrote:
> >> +#ifdef CONFIG_SCHED_CORE
> >> +		/*
> >> +		 * Skip this cpu if source task's cookie does not match
> >> +		 * with CPU's core cookie.
> >> +		 */
> >> +		if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
> >> +			continue;
> >> +#endif
> >> +
> > 
> > Any reason this is under an #ifdef? In sched_core_cookie_match() won't
> > the check for sched_core_enabled() do the right thing even when
> > CONFIG_SCHED_CORE is not enabed?> 
> Yes, sched_core_enabled works properly when CONFIG_SCHED_CORE is not
> enabled. But when CONFIG_SCHED_CORE is not enabled, it does not make
> sense to leave a core scheduler specific function here even at compile
> time. Also, for the cases in hot path, this saves CPU cycles to avoid
> a judgment.

No, that's nonsense. If it works, remove the #ifdef. Less (#ifdef) is
more.

> >> +static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
> >> +{
> >> +	bool idle_core = true;
> >> +	int cpu;
> >> +
> >> +	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
> >> +	if (!sched_core_enabled(rq))
> >> +		return true;
> >> +
> >> +	for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
> >> +		if (!available_idle_cpu(cpu)) {
> > 
> > I was looking at this snippet and comparing this to is_core_idle(), the
> > major difference is the check for vcpu_is_preempted(). Do we want to
> > call the core as non idle if any vcpu was preempted on this CPU?
> 
> Yes, if there is a VCPU was preempted on this CPU, better not place task
> on this core as the VCPU may be holding a spinlock and wants to be executed
> again ASAP.

If you're doing core scheduling on vcpus, you deserve all the pain
possible.



^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 18/32] kernel/entry: Add support for core-wide protection of kernel-mode
  2020-11-17 23:19 ` [PATCH -tip 18/32] kernel/entry: Add support for core-wide protection of kernel-mode Joel Fernandes (Google)
@ 2020-11-24 16:09   ` Peter Zijlstra
  2020-11-24 17:52     ` Joel Fernandes
  2020-11-25  9:37   ` Peter Zijlstra
  2020-11-26  5:37   ` Balbir Singh
  2 siblings, 1 reply; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-24 16:09 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Tim Chen, Paul E . McKenney

On Tue, Nov 17, 2020 at 06:19:48PM -0500, Joel Fernandes (Google) wrote:
> Core-scheduling prevents hyperthreads in usermode from attacking each
> other, but it does not do anything about one of the hyperthreads
> entering the kernel for any reason. This leaves the door open for MDS
> and L1TF attacks with concurrent execution sequences between
> hyperthreads.
> 
> This patch therefore adds support for protecting all syscall and IRQ
> kernel mode entries. Care is taken to track the outermost usermode exit
> and entry using per-cpu counters. In cases where one of the hyperthreads
> enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
> when not needed - example: idle and non-cookie HTs do not need to be
> forced into kernel mode.
> 
> More information about attacks:
> For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
> data to either host or guest attackers. For L1TF, it is possible to leak
> to guest attackers. There is no possible mitigation involving flushing
> of buffers to avoid this since the execution of attacker and victims
> happen concurrently on 2 or more HTs.

Oh gawd; this is horrible...


> +bool sched_core_wait_till_safe(unsigned long ti_check)
> +{
> +	bool restart = false;
> +	struct rq *rq;
> +	int cpu;
> +
> +	/* We clear the thread flag only at the end, so no need to check for it. */
> +	ti_check &= ~_TIF_UNSAFE_RET;
> +
> +	cpu = smp_processor_id();
> +	rq = cpu_rq(cpu);
> +
> +	if (!sched_core_enabled(rq))
> +		goto ret;
> +
> +	/* Down grade to allow interrupts to prevent stop_machine lockups.. */
> +	preempt_disable();
> +	local_irq_enable();
> +
> +	/*
> +	 * Wait till the core of this HT is not in an unsafe state.
> +	 *
> +	 * Pair with raw_spin_lock/unlock() in sched_core_unsafe_enter/exit().
> +	 */
> +	while (smp_load_acquire(&rq->core->core_unsafe_nest) > 0) {
> +		cpu_relax();
> +		if (READ_ONCE(current_thread_info()->flags) & ti_check) {
> +			restart = true;
> +			break;
> +		}
> +	}

What's that ACQUIRE for?

> +
> +	/* Upgrade it back to the expectations of entry code. */
> +	local_irq_disable();
> +	preempt_enable();
> +
> +ret:
> +	if (!restart)
> +		clear_tsk_thread_flag(current, TIF_UNSAFE_RET);
> +
> +	return restart;
> +}

So if TIF_NEED_RESCHED gets set, we'll break out and reschedule, cute.

> +void sched_core_unsafe_enter(void)
> +{
> +	const struct cpumask *smt_mask;
> +	unsigned long flags;
> +	struct rq *rq;
> +	int i, cpu;
> +
> +	if (!static_branch_likely(&sched_core_protect_kernel))
> +		return;
> +
> +	local_irq_save(flags);
> +	cpu = smp_processor_id();
> +	rq = cpu_rq(cpu);
> +	if (!sched_core_enabled(rq))
> +		goto ret;
> +
> +	/* Ensure that on return to user/guest, we check whether to wait. */
> +	if (current->core_cookie)
> +		set_tsk_thread_flag(current, TIF_UNSAFE_RET);
> +
> +	/* Count unsafe_enter() calls received without unsafe_exit() on this CPU. */
> +	rq->core_this_unsafe_nest++;
> +
> +	/*
> +	 * Should not nest: enter() should only pair with exit(). Both are done
> +	 * during the first entry into kernel and the last exit from kernel.
> +	 * Nested kernel entries (such as nested interrupts) will only trigger
> +	 * enter() and exit() on the outer most kernel entry and exit.
> +	 */
> +	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 1))
> +		goto ret;
> +
> +	raw_spin_lock(rq_lockp(rq));
> +	smt_mask = cpu_smt_mask(cpu);
> +
> +	/*
> +	 * Contribute this CPU's unsafe_enter() to the core-wide unsafe_enter()
> +	 * count.  The raw_spin_unlock() release semantics pairs with the nest
> +	 * counter's smp_load_acquire() in sched_core_wait_till_safe().
> +	 */
> +	WRITE_ONCE(rq->core->core_unsafe_nest, rq->core->core_unsafe_nest + 1);
> +
> +	if (WARN_ON_ONCE(rq->core->core_unsafe_nest == UINT_MAX))
> +		goto unlock;
> +
> +	if (irq_work_is_busy(&rq->core_irq_work)) {
> +		/*
> +		 * Do nothing more since we are in an IPI sent from another
> +		 * sibling to enforce safety. That sibling would have sent IPIs
> +		 * to all of the HTs.
> +		 */
> +		goto unlock;
> +	}
> +
> +	/*
> +	 * If we are not the first ones on the core to enter core-wide unsafe
> +	 * state, do nothing.
> +	 */
> +	if (rq->core->core_unsafe_nest > 1)
> +		goto unlock;
> +
> +	/* Do nothing more if the core is not tagged. */
> +	if (!rq->core->core_cookie)
> +		goto unlock;
> +
> +	for_each_cpu(i, smt_mask) {
> +		struct rq *srq = cpu_rq(i);
> +
> +		if (i == cpu || cpu_is_offline(i))
> +			continue;
> +
> +		if (!srq->curr->mm || is_task_rq_idle(srq->curr))
> +			continue;
> +
> +		/* Skip if HT is not running a tagged task. */
> +		if (!srq->curr->core_cookie && !srq->core_pick)
> +			continue;
> +
> +		/*
> +		 * Force sibling into the kernel by IPI. If work was already
> +		 * pending, no new IPIs are sent. This is Ok since the receiver
> +		 * would already be in the kernel, or on its way to it.
> +		 */
> +		irq_work_queue_on(&srq->core_irq_work, i);

Why irq_work though? Why not smp_send_reschedule(i)?

> +	}
> +unlock:
> +	raw_spin_unlock(rq_lockp(rq));
> +ret:
> +	local_irq_restore(flags);
> +}

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 19/32] entry/idle: Enter and exit kernel protection during idle entry and exit
  2020-11-17 23:19 ` [PATCH -tip 19/32] entry/idle: Enter and exit kernel protection during idle entry and exit Joel Fernandes (Google)
@ 2020-11-24 16:13   ` Peter Zijlstra
  2020-11-24 18:03     ` Joel Fernandes
  0 siblings, 1 reply; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-24 16:13 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Nov 17, 2020 at 06:19:49PM -0500, Joel Fernandes (Google) wrote:
> Add a generic_idle_{enter,exit} helper function to enter and exit kernel
> protection when entering and exiting idle, respectively.
> 
> While at it, remove a stale RCU comment.
> 
> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  include/linux/entry-common.h | 18 ++++++++++++++++++
>  kernel/sched/idle.c          | 11 ++++++-----
>  2 files changed, 24 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index 022e1f114157..8f34ae625f83 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -454,4 +454,22 @@ static inline bool entry_kernel_protected(void)
>  	return IS_ENABLED(CONFIG_SCHED_CORE) && sched_core_kernel_protected()
>  		&& _TIF_UNSAFE_RET != 0;
>  }
> +
> +/**
> + * generic_idle_enter - General tasks to perform during idle entry.
> + */
> +static inline void generic_idle_enter(void)
> +{
> +	/* Entering idle ends the protected kernel region. */
> +	sched_core_unsafe_exit();
> +}
> +
> +/**
> + * generic_idle_exit  - General tasks to perform during idle exit.
> + */
> +static inline void generic_idle_exit(void)
> +{
> +	/* Exiting idle (re)starts the protected kernel region. */
> +	sched_core_unsafe_enter();
> +}
>  #endif

That naming is terrible..

> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 8bdb214eb78f..ee4f91396c31 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -8,6 +8,7 @@
>   */
>  #include "sched.h"
>  
> +#include <linux/entry-common.h>
>  #include <trace/events/power.h>
>  
>  /* Linker adds these: start and end of __cpuidle functions */
> @@ -54,6 +55,7 @@ __setup("hlt", cpu_idle_nopoll_setup);
>  
>  static noinline int __cpuidle cpu_idle_poll(void)
>  {
> +	generic_idle_enter();
>  	trace_cpu_idle(0, smp_processor_id());
>  	stop_critical_timings();
>  	rcu_idle_enter();
> @@ -66,6 +68,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
>  	rcu_idle_exit();
>  	start_critical_timings();
>  	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
> +	generic_idle_exit();
>  
>  	return 1;
>  }
> @@ -156,11 +159,7 @@ static void cpuidle_idle_call(void)
>  		return;
>  	}
>  
> -	/*
> -	 * The RCU framework needs to be told that we are entering an idle
> -	 * section, so no more rcu read side critical sections and one more
> -	 * step to the grace period
> -	 */
> +	generic_idle_enter();
>  
>  	if (cpuidle_not_available(drv, dev)) {
>  		tick_nohz_idle_stop_tick();
> @@ -225,6 +224,8 @@ static void cpuidle_idle_call(void)
>  	 */
>  	if (WARN_ON_ONCE(irqs_disabled()))
>  		local_irq_enable();
> +
> +	generic_idle_exit();
>  }

I'm confused.. arch_cpu_idle_{enter,exit}() weren't conveniently placed
for you?

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 12/32] sched: Simplify the core pick loop for optimized case
  2020-11-24 12:04   ` Peter Zijlstra
@ 2020-11-24 17:04     ` Joel Fernandes
  2020-11-25  8:37       ` Peter Zijlstra
  0 siblings, 1 reply; 150+ messages in thread
From: Joel Fernandes @ 2020-11-24 17:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

Hi Peter,

On Tue, Nov 24, 2020 at 01:04:38PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:42PM -0500, Joel Fernandes (Google) wrote:
> > +	/*
> > +	 * Optimize for common case where this CPU has no cookies
> > +	 * and there are no cookied tasks running on siblings.
> > +	 */
> > +	if (!need_sync) {
> > +		for_each_class(class) {
> > +			next = class->pick_task(rq);
> > +			if (next)
> > +				break;
> > +		}
> > +
> > +		if (!next->core_cookie) {
> > +			rq->core_pick = NULL;
> > +			goto done;
> > +		}
> >  		need_sync = true;
> >  	}
> 
> This isn't what I send you here:
> 
>   https://lkml.kernel.org/r/20201026093131.GF2628@hirez.programming.kicks-ass.net

I had replied to it here with concerns about the effects of newly idle
balancing not being reverseable, it was only a theoretical concern:
http://lore.kernel.org/r/20201105185019.GA2771003@google.com

Also I was trying to keep the logic the same as v8 for unconstrained pick
(calling pick_task), considering that has been tested quite a bit.

> Specifically, you've lost the whole cfs-cgroup optimization.

Are you referring to this optimization in pick_next_task_fair() ?

/*
 * Since we haven't yet done put_prev_entity and if the
 * selected task
 * is a different task than we started out with, try
 * and touch the
 * least amount of cfs_rqs.
 */

You are right, we wouldn't get that with just calling pick_task_fair(). We
did not have this in v8 series either though.

Also, if the task is a cookied task, then I think you are doing more work
with your patch due to the extra put_prev_task().

> What was wrong/not working with the below?

Other than the new idle balancing, IIRC it was also causing instability.
Maybe we can considering this optimization in the future if that's Ok with
you?

thanks,

 - Joel

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5225,8 +5227,6 @@ pick_next_task(struct rq *rq, struct tas
>  		return next;
>  	}
>  
> -	put_prev_task_balance(rq, prev, rf);
> -
>  	smt_mask = cpu_smt_mask(cpu);
>  	need_sync = !!rq->core->core_cookie;
>  
> @@ -5255,17 +5255,14 @@ pick_next_task(struct rq *rq, struct tas
>  	 * and there are no cookied tasks running on siblings.
>  	 */
>  	if (!need_sync) {
> -		for_each_class(class) {
> -			next = class->pick_task(rq);
> -			if (next)
> -				break;
> -		}
> -
> +		next = __pick_next_task(rq, prev, rf);
>  		if (!next->core_cookie) {
>  			rq->core_pick = NULL;
> -			goto done;
> +			return next;
>  		}
> -		need_sync = true;
> +		put_prev_task(next);
> +	} else {
> +		put_prev_task_balance(rq, prev, rf);
>  	}
>  
>  	for_each_cpu(i, smt_mask) {

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 15/32] sched: Improve snapshotting of min_vruntime for CGroups
  2020-11-24 10:27   ` Peter Zijlstra
@ 2020-11-24 17:07     ` Joel Fernandes
  2020-11-25  8:41       ` Peter Zijlstra
  0 siblings, 1 reply; 150+ messages in thread
From: Joel Fernandes @ 2020-11-24 17:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

Hi Peter,

On Tue, Nov 24, 2020 at 11:27:41AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:45PM -0500, Joel Fernandes (Google) wrote:
> > A previous patch improved cross-cpu vruntime comparison opertations in
> > pick_next_task(). Improve it further for tasks in CGroups.
> > 
> > In particular, for cross-CPU comparisons, we were previously going to
> > the root-level se(s) for both the task being compared. That was strange.
> > This patch instead finds the se(s) for both tasks that have the same
> > parent (which may be different from root).
> > 
> > A note about the min_vruntime snapshot and force idling:
> > Abbreviations: fi: force-idled now? ; fib: force-idled before?
> > During selection:
> > When we're not fi, we need to update snapshot.
> > when we're fi and we were not fi, we must update snapshot.
> > When we're fi and we were already fi, we must not update snapshot.
> > 
> > Which gives:
> >         fib     fi      update?
> >         0       0       1
> >         0       1       1
> >         1       0       1
> >         1       1       0
> > So the min_vruntime snapshot needs to be updated when: !(fib && fi).
> > 
> > Also, the cfs_prio_less() function needs to be aware of whether the core
> > is in force idle or not, since it will be use this information to know
> > whether to advance a cfs_rq's min_vruntime_fi in the hierarchy. So pass
> > this information along via pick_task() -> prio_less().
> 
> Hurmph.. so I'm tempted to smash a bunch of patches together.
> 
>  2 <- 3 (already done - bisection crashes are daft)
>  6 <- 11
>  7 <- {10, 12}
>  9 <- 15
> 
> I'm thinking that would result in an easier to read series, or do we
> want to preserve this history?
> 
> (fwiw, I pulled 15 before 13,14, as I think that makes more sense
> anyway).

Either way would be Ok with me, I would suggest retaining the history though
so that the details in the changelog are preserved of the issues we faced,
and in the future we can refer back to them.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 18/32] kernel/entry: Add support for core-wide protection of kernel-mode
  2020-11-24 16:09   ` Peter Zijlstra
@ 2020-11-24 17:52     ` Joel Fernandes
  0 siblings, 0 replies; 150+ messages in thread
From: Joel Fernandes @ 2020-11-24 17:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Tim Chen, Paul E . McKenney

Hi Peter,

On Tue, Nov 24, 2020 at 05:09:06PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:48PM -0500, Joel Fernandes (Google) wrote:
> > Core-scheduling prevents hyperthreads in usermode from attacking each
> > other, but it does not do anything about one of the hyperthreads
> > entering the kernel for any reason. This leaves the door open for MDS
> > and L1TF attacks with concurrent execution sequences between
> > hyperthreads.
> > 
> > This patch therefore adds support for protecting all syscall and IRQ
> > kernel mode entries. Care is taken to track the outermost usermode exit
> > and entry using per-cpu counters. In cases where one of the hyperthreads
> > enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
> > when not needed - example: idle and non-cookie HTs do not need to be
> > forced into kernel mode.
> > 
> > More information about attacks:
> > For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
> > data to either host or guest attackers. For L1TF, it is possible to leak
> > to guest attackers. There is no possible mitigation involving flushing
> > of buffers to avoid this since the execution of attacker and victims
> > happen concurrently on 2 or more HTs.
> 
> Oh gawd; this is horrible...

Yeah, its the same issue we discussed earlier this year :)

> > +bool sched_core_wait_till_safe(unsigned long ti_check)
> > +{
> > +	bool restart = false;
> > +	struct rq *rq;
> > +	int cpu;
> > +
> > +	/* We clear the thread flag only at the end, so no need to check for it. */
> > +	ti_check &= ~_TIF_UNSAFE_RET;
> > +
> > +	cpu = smp_processor_id();
> > +	rq = cpu_rq(cpu);
> > +
> > +	if (!sched_core_enabled(rq))
> > +		goto ret;
> > +
> > +	/* Down grade to allow interrupts to prevent stop_machine lockups.. */
> > +	preempt_disable();
> > +	local_irq_enable();
> > +
> > +	/*
> > +	 * Wait till the core of this HT is not in an unsafe state.
> > +	 *
> > +	 * Pair with raw_spin_lock/unlock() in sched_core_unsafe_enter/exit().
> > +	 */
> > +	while (smp_load_acquire(&rq->core->core_unsafe_nest) > 0) {
> > +		cpu_relax();
> > +		if (READ_ONCE(current_thread_info()->flags) & ti_check) {
> > +			restart = true;
> > +			break;
> > +		}
> > +	}
> 
> What's that ACQUIRE for?

I was concerned about something like below for weakly-ordered arch:

Kernel				Attacker
------                          --------
write unsafe=1

kernel code does stores		while (unsafe == 1); (ACQUIRE)
   ^                             ^
   | needs to be ordered	 | needs to be ordered
   v                             v
write unsafe=0 (RELEASE)        Attacker code.


Here, I want the access made by kernel code to be ordered WRT the write to
the unsafe nesting counter variable, so that the attacker code does not see
those accesses later.

It could be argued its a theoretical concern, but I wanted to play it safe. In
particular, the existing uarch buffer flushing before entering the Attacker
code might make it sufficiently impossible for Attacker to do anything bad
even without the additional memory barriers.

> > +
> > +	/* Upgrade it back to the expectations of entry code. */
> > +	local_irq_disable();
> > +	preempt_enable();
> > +
> > +ret:
> > +	if (!restart)
> > +		clear_tsk_thread_flag(current, TIF_UNSAFE_RET);
> > +
> > +	return restart;
> > +}
> 
> So if TIF_NEED_RESCHED gets set, we'll break out and reschedule, cute.

Thanks.

> > +	/* Do nothing more if the core is not tagged. */
> > +	if (!rq->core->core_cookie)
> > +		goto unlock;
> > +
> > +	for_each_cpu(i, smt_mask) {
> > +		struct rq *srq = cpu_rq(i);
> > +
> > +		if (i == cpu || cpu_is_offline(i))
> > +			continue;
> > +
> > +		if (!srq->curr->mm || is_task_rq_idle(srq->curr))
> > +			continue;
> > +
> > +		/* Skip if HT is not running a tagged task. */
> > +		if (!srq->curr->core_cookie && !srq->core_pick)
> > +			continue;
> > +
> > +		/*
> > +		 * Force sibling into the kernel by IPI. If work was already
> > +		 * pending, no new IPIs are sent. This is Ok since the receiver
> > +		 * would already be in the kernel, or on its way to it.
> > +		 */
> > +		irq_work_queue_on(&srq->core_irq_work, i);
> 
> Why irq_work though? Why not smp_send_reschedule(i)?

I tried this approach. The main issue is sending IPIs if a previous IPI was
already sent. I needed some mechanism (shared variables to detect this)
so that an IPI would not be sent if one was just sent.

With irq_work, I get the irq_work_is_busy() which helps with exactly that.
With other approaches I tried, I ran into issues with CSD locking and it
seems irq_work implements the CSD locking the way I wanted to. So to avoid
reinventing the wheel, I stuck to irq_work and it worked well.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 19/32] entry/idle: Enter and exit kernel protection during idle entry and exit
  2020-11-24 16:13   ` Peter Zijlstra
@ 2020-11-24 18:03     ` Joel Fernandes
  2020-11-25  8:49       ` Peter Zijlstra
  0 siblings, 1 reply; 150+ messages in thread
From: Joel Fernandes @ 2020-11-24 18:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Nov 24, 2020 at 05:13:35PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:49PM -0500, Joel Fernandes (Google) wrote:
> > Add a generic_idle_{enter,exit} helper function to enter and exit kernel
> > protection when entering and exiting idle, respectively.
> > 
> > While at it, remove a stale RCU comment.
> > 
> > Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
> > Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >  include/linux/entry-common.h | 18 ++++++++++++++++++
> >  kernel/sched/idle.c          | 11 ++++++-----
> >  2 files changed, 24 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> > index 022e1f114157..8f34ae625f83 100644
> > --- a/include/linux/entry-common.h
> > +++ b/include/linux/entry-common.h
> > @@ -454,4 +454,22 @@ static inline bool entry_kernel_protected(void)
> >  	return IS_ENABLED(CONFIG_SCHED_CORE) && sched_core_kernel_protected()
> >  		&& _TIF_UNSAFE_RET != 0;
> >  }
> > +
> > +/**
> > + * generic_idle_enter - General tasks to perform during idle entry.
> > + */
> > +static inline void generic_idle_enter(void)
> > +{
> > +	/* Entering idle ends the protected kernel region. */
> > +	sched_core_unsafe_exit();
> > +}
> > +
> > +/**
> > + * generic_idle_exit  - General tasks to perform during idle exit.
> > + */
> > +static inline void generic_idle_exit(void)
> > +{
> > +	/* Exiting idle (re)starts the protected kernel region. */
> > +	sched_core_unsafe_enter();
> > +}
> >  #endif
> 
> That naming is terrible..

Yeah sorry :-\. The naming I chose was to be aligned with the
CONFIG_GENERIC_ENTRY naming. I am open to ideas on that.

> > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > index 8bdb214eb78f..ee4f91396c31 100644
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -8,6 +8,7 @@
> >   */
> >  #include "sched.h"
> >  
> > +#include <linux/entry-common.h>
> >  #include <trace/events/power.h>
> >  
> >  /* Linker adds these: start and end of __cpuidle functions */
> > @@ -54,6 +55,7 @@ __setup("hlt", cpu_idle_nopoll_setup);
> >  
> >  static noinline int __cpuidle cpu_idle_poll(void)
> >  {
> > +	generic_idle_enter();
> >  	trace_cpu_idle(0, smp_processor_id());
> >  	stop_critical_timings();
> >  	rcu_idle_enter();
> > @@ -66,6 +68,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
> >  	rcu_idle_exit();
> >  	start_critical_timings();
> >  	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
> > +	generic_idle_exit();
> >  
> >  	return 1;
> >  }
> > @@ -156,11 +159,7 @@ static void cpuidle_idle_call(void)
> >  		return;
> >  	}
> >  
> > -	/*
> > -	 * The RCU framework needs to be told that we are entering an idle
> > -	 * section, so no more rcu read side critical sections and one more
> > -	 * step to the grace period
> > -	 */
> > +	generic_idle_enter();
> >  
> >  	if (cpuidle_not_available(drv, dev)) {
> >  		tick_nohz_idle_stop_tick();
> > @@ -225,6 +224,8 @@ static void cpuidle_idle_call(void)
> >  	 */
> >  	if (WARN_ON_ONCE(irqs_disabled()))
> >  		local_irq_enable();
> > +
> > +	generic_idle_exit();
> >  }
> 
> I'm confused.. arch_cpu_idle_{enter,exit}() weren't conveniently placed
> for you?

The way this patch series works, it does not depend on arch code as much as
possible. Since there are other arch that may need this patchset such as ARM,
it may be better to keep it in the generic entry code.  Thoughts?

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 10/32] sched: Fix priority inversion of cookied task with sibling
  2020-11-22 22:41   ` Balbir Singh
@ 2020-11-24 18:30     ` Joel Fernandes
  2020-11-25 23:05       ` Balbir Singh
  0 siblings, 1 reply; 150+ messages in thread
From: Joel Fernandes @ 2020-11-24 18:30 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

On Mon, Nov 23, 2020 at 09:41:23AM +1100, Balbir Singh wrote:
> On Tue, Nov 17, 2020 at 06:19:40PM -0500, Joel Fernandes (Google) wrote:
> > From: Peter Zijlstra <peterz@infradead.org>
> > 
> > The rationale is as follows. In the core-wide pick logic, even if
> > need_sync == false, we need to go look at other CPUs (non-local CPUs) to
> > see if they could be running RT.
> > 
> > Say the RQs in a particular core look like this:
> > Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task.
> > 
> > rq0            rq1
> > CFS1 (tagged)  RT1 (not tag)
> > CFS2 (tagged)
> > 
> > Say schedule() runs on rq0. Now, it will enter the above loop and
> > pick_task(RT) will return NULL for 'p'. It will enter the above if() block
> > and see that need_sync == false and will skip RT entirely.
> > 
> > The end result of the selection will be (say prio(CFS1) > prio(CFS2)):
> > rq0             rq1
> > CFS1            IDLE
> > 
> > When it should have selected:
> > rq0             r1
> > IDLE            RT
> > 
> > Joel saw this issue on real-world usecases in ChromeOS where an RT task
> > gets constantly force-idled and breaks RT. Lets cure it.
> > 
> > NOTE: This problem will be fixed differently in a later patch. It just
> >       kept here for reference purposes about this issue, and to make
> >       applying later patches easier.
> >
> 
> The changelog is hard to read, it refers to above if(), whereas there
> is no code snippet in the changelog.

Yeah sorry, it comes from this email where I described the issue:
http://lore.kernel.org/r/20201023175724.GA3563800@google.com

I corrected the changelog and appended the patch below. Also pushed it to:
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=coresched

> Also, from what I can see following
> the series, p->core_cookie is not yet set anywhere (unless I missed it),
> so fixing it in here did not make sense just reading the series.

The interface patches for core_cookie are added later, that's how it is. The
infrastructure comes first here. It would also not make sense to add
interface first as well so I think the current ordering is fine.

---8<-----------------------

From: Peter Zijlstra <peterz@infradead.org>
Subject: [PATCH] sched: Fix priority inversion of cookied task with sibling

The rationale is as follows. In the core-wide pick logic, even if
need_sync == false, we need to go look at other CPUs (non-local CPUs) to
see if they could be running RT.

Say the RQs in a particular core look like this:
Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task.

rq0            rq1
CFS1 (tagged)  RT1 (not tag)
CFS2 (tagged)

The end result of the selection will be (say prio(CFS1) > prio(CFS2)):
rq0             rq1
CFS1            IDLE

When it should have selected:
rq0             r1
IDLE            RT

Fix this issue by forcing need_sync and restarting the search if a
cookied task was discovered. This will avoid this optimization from
making incorrect picks.

Joel saw this issue on real-world usecases in ChromeOS where an RT task
gets constantly force-idled and breaks RT. Lets cure it.

NOTE: This problem will be fixed differently in a later patch. It just
      kept here for reference purposes about this issue, and to make
      applying later patches easier.

Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/core.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4ee4902c2cf5..53af817740c0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5195,6 +5195,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	need_sync = !!rq->core->core_cookie;
 
 	/* reset state */
+reset:
 	rq->core->core_cookie = 0UL;
 	if (rq->core->core_forceidle) {
 		need_sync = true;
@@ -5242,14 +5243,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 				/*
 				 * If there weren't no cookies; we don't need to
 				 * bother with the other siblings.
-				 * If the rest of the core is not running a tagged
-				 * task, i.e.  need_sync == 0, and the current CPU
-				 * which called into the schedule() loop does not
-				 * have any tasks for this class, skip selecting for
-				 * other siblings since there's no point. We don't skip
-				 * for RT/DL because that could make CFS force-idle RT.
 				 */
-				if (i == cpu && !need_sync && class == &fair_sched_class)
+				if (i == cpu && !need_sync)
 					goto next_class;
 
 				continue;
@@ -5259,7 +5254,20 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			 * Optimize the 'normal' case where there aren't any
 			 * cookies and we don't need to sync up.
 			 */
-			if (i == cpu && !need_sync && !p->core_cookie) {
+			if (i == cpu && !need_sync) {
+				if (p->core_cookie) {
+					/*
+					 * This optimization is only valid as
+					 * long as there are no cookies
+					 * involved. We may have skipped
+					 * non-empty higher priority classes on
+					 * siblings, which are empty on this
+					 * CPU, so start over.
+					 */
+					need_sync = true;
+					goto reset;
+				}
+
 				next = p;
 				goto done;
 			}
@@ -5299,7 +5307,6 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 					 */
 					need_sync = true;
 				}
-
 			}
 		}
 next_class:;
-- 
2.29.2.454.gaff20da3a2-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 14/32] sched: migration changes for core scheduling
  2020-11-24 15:42       ` Peter Zijlstra
@ 2020-11-25  3:12         ` Li, Aubrey
  2020-11-25 22:57           ` Balbir Singh
  0 siblings, 1 reply; 150+ messages in thread
From: Li, Aubrey @ 2020-11-25  3:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Balbir Singh, Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On 2020/11/24 23:42, Peter Zijlstra wrote:
> On Mon, Nov 23, 2020 at 12:36:10PM +0800, Li, Aubrey wrote:
>>>> +#ifdef CONFIG_SCHED_CORE
>>>> +		/*
>>>> +		 * Skip this cpu if source task's cookie does not match
>>>> +		 * with CPU's core cookie.
>>>> +		 */
>>>> +		if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>>>> +			continue;
>>>> +#endif
>>>> +
>>>
>>> Any reason this is under an #ifdef? In sched_core_cookie_match() won't
>>> the check for sched_core_enabled() do the right thing even when
>>> CONFIG_SCHED_CORE is not enabed?> 
>> Yes, sched_core_enabled works properly when CONFIG_SCHED_CORE is not
>> enabled. But when CONFIG_SCHED_CORE is not enabled, it does not make
>> sense to leave a core scheduler specific function here even at compile
>> time. Also, for the cases in hot path, this saves CPU cycles to avoid
>> a judgment.
> 
> No, that's nonsense. If it works, remove the #ifdef. Less (#ifdef) is
> more.
> 

Okay, I pasted the refined patch here.
@Joel, please let me know if you want me to send it in a separated thread.

Thanks,
-Aubrey
======================================================================
From 18e4f4592c2a159fcbae637f3a422e37ad24cb5a Mon Sep 17 00:00:00 2001
From: Aubrey Li <aubrey.li@linux.intel.com>
Date: Wed, 25 Nov 2020 02:43:46 +0000
Subject: [PATCH 14/33] sched: migration changes for core scheduling

 - Don't migrate if there is a cookie mismatch
     Load balance tries to move task from busiest CPU to the
     destination CPU. When core scheduling is enabled, if the
     task's cookie does not match with the destination CPU's
     core cookie, this task will be skipped by this CPU. This
     mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
     In the fast path of task wakeup, select the first cookie matched
     idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
     In the slow path of task wakeup, find the idlest CPU whose core
     cookie matches with task's cookie

 - Don't migrate task if cookie not match
     For the NUMA load balance, don't migrate task to the CPU whose
     core cookie does not match with task's cookie

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/fair.c  | 58 ++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h | 33 +++++++++++++++++++++++++
 2 files changed, 86 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index de82f88ba98c..7eea5da6685a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1921,6 +1921,13 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
 			continue;
 
+		/*
+		 * Skip this cpu if source task's cookie does not match
+		 * with CPU's core cookie.
+		 */
+		if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+			continue;
+
 		env->dst_cpu = cpu;
 		if (task_numa_compare(env, taskimp, groupimp, maymove))
 			break;
@@ -5867,11 +5874,15 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+		struct rq *rq = cpu_rq(i);
+
+		if (!sched_core_cookie_match(rq, p))
+			continue;
+
 		if (sched_idle_cpu(i))
 			return i;
 
 		if (available_idle_cpu(i)) {
-			struct rq *rq = cpu_rq(i);
 			struct cpuidle_state *idle = idle_get_state(rq);
 			if (idle && idle->exit_latency < min_exit_latency) {
 				/*
@@ -6129,8 +6140,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	for_each_cpu_wrap(cpu, cpus, target) {
 		if (!--nr)
 			return -1;
-		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
-			break;
+
+		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
+#ifdef CONFIG_SCHED_CORE
+			/*
+			 * If Core Scheduling is enabled, select this cpu
+			 * only if the process cookie matches core cookie.
+			 */
+			if (sched_core_enabled(cpu_rq(cpu)) &&
+			    p->core_cookie == cpu_rq(cpu)->core->core_cookie)
+#endif
+				break;
+		}
 	}
 
 	time = cpu_clock(this) - time;
@@ -7530,8 +7551,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * We do not migrate tasks that are:
 	 * 1) throttled_lb_pair, or
 	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
-	 * 3) running (obviously), or
-	 * 4) are cache-hot on their current CPU.
+	 * 3) task's cookie does not match with this CPU's core cookie
+	 * 4) running (obviously), or
+	 * 5) are cache-hot on their current CPU.
 	 */
 	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
 		return 0;
@@ -7566,6 +7588,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 0;
 	}
 
+	/*
+	 * Don't migrate task if the task's cookie does not match
+	 * with the destination CPU's core cookie.
+	 */
+	if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+		return 0;
+
 	/* Record that we found atleast one task that could run on dst_cpu */
 	env->flags &= ~LBF_ALL_PINNED;
 
@@ -8792,6 +8821,25 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 					p->cpus_ptr))
 			continue;
 
+#ifdef CONFIG_SCHED_CORE
+		if (sched_core_enabled(cpu_rq(this_cpu))) {
+			int i = 0;
+			bool cookie_match = false;
+
+			for_each_cpu(i, sched_group_span(group)) {
+				struct rq *rq = cpu_rq(i);
+
+				if (sched_core_cookie_match(rq, p)) {
+					cookie_match = true;
+					break;
+				}
+			}
+			/* Skip over this group if no cookie matched */
+			if (!cookie_match)
+				continue;
+		}
+#endif
+
 		local_group = cpumask_test_cpu(this_cpu,
 					       sched_group_span(group));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e72942a9ee11..05b93787fe62 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1135,6 +1135,35 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
 
+/*
+ * Helper to check if the CPU's core cookie matches with the task's cookie
+ * when core scheduling is enabled.
+ * A special case is that the task's cookie always matches with CPU's core
+ * cookie if the CPU is in an idle core.
+ */
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	bool idle_core = true;
+	int cpu;
+
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
+		if (!available_idle_cpu(cpu)) {
+			idle_core = false;
+			break;
+		}
+	}
+
+	/*
+	 * A CPU in an idle core is always the best choice for tasks with
+	 * cookies.
+	 */
+	return idle_core || rq->core->core_cookie == p->core_cookie;
+}
+
 extern void queue_core_balance(struct rq *rq);
 
 #else /* !CONFIG_SCHED_CORE */
@@ -1153,6 +1182,10 @@ static inline void queue_core_balance(struct rq *rq)
 {
 }
 
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	return true;
+}
 #endif /* CONFIG_SCHED_CORE */
 
 #ifdef CONFIG_SCHED_SMT
-- 
2.17.1


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 12/32] sched: Simplify the core pick loop for optimized case
  2020-11-24 17:04     ` Joel Fernandes
@ 2020-11-25  8:37       ` Peter Zijlstra
  0 siblings, 0 replies; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-25  8:37 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Nov 24, 2020 at 12:04:30PM -0500, Joel Fernandes wrote:
> Hi Peter,
> 
> On Tue, Nov 24, 2020 at 01:04:38PM +0100, Peter Zijlstra wrote:
> > On Tue, Nov 17, 2020 at 06:19:42PM -0500, Joel Fernandes (Google) wrote:
> > > +	/*
> > > +	 * Optimize for common case where this CPU has no cookies
> > > +	 * and there are no cookied tasks running on siblings.
> > > +	 */
> > > +	if (!need_sync) {
> > > +		for_each_class(class) {
> > > +			next = class->pick_task(rq);
> > > +			if (next)
> > > +				break;
> > > +		}
> > > +
> > > +		if (!next->core_cookie) {
> > > +			rq->core_pick = NULL;
> > > +			goto done;
> > > +		}
> > >  		need_sync = true;
> > >  	}
> > 
> > This isn't what I send you here:
> > 
> >   https://lkml.kernel.org/r/20201026093131.GF2628@hirez.programming.kicks-ass.net
> 
> I had replied to it here with concerns about the effects of newly idle
> balancing not being reverseable, it was only a theoretical concern:
> http://lore.kernel.org/r/20201105185019.GA2771003@google.com

Gah, missed that. I don't think that matters much see:
put_prev_task_balance() calling balance_fair().

> > Specifically, you've lost the whole cfs-cgroup optimization.
> 
> Are you referring to this optimization in pick_next_task_fair() ?
> 
> /*
>  * Since we haven't yet done put_prev_entity and if the
>  * selected task
>  * is a different task than we started out with, try
>  * and touch the
>  * least amount of cfs_rqs.
>  */

Yep, that. The giant FAIR_GROUP_SCHED hunk. The thing that makes
all of pick_next_task() more complicated than it really wants to be.

> You are right, we wouldn't get that with just calling pick_task_fair(). We
> did not have this in v8 series either though.
> 
> Also, if the task is a cookied task, then I think you are doing more work
> with your patch due to the extra put_prev_task().

Yes, but only if you mix cookie tasks with non-cookie tasks and schedule
two non-cookie tasks back-to-back. I don't think we care overly much
about that.

I think it makes more sense to ensure that if you have core-sched
enabled on your machine and have a (core-aligned) parition with
non-cookie tasks, scheduling has works as 'normal' as possible.

> > What was wrong/not working with the below?
> 
> Other than the new idle balancing, IIRC it was also causing instability.
> Maybe we can considering this optimization in the future if that's Ok with
> you?

Hurmph.. you don't happen to remember what went splat?

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 15/32] sched: Improve snapshotting of min_vruntime for CGroups
  2020-11-24 17:07     ` Joel Fernandes
@ 2020-11-25  8:41       ` Peter Zijlstra
  0 siblings, 0 replies; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-25  8:41 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Nov 24, 2020 at 12:07:44PM -0500, Joel Fernandes wrote:

> Either way would be Ok with me, I would suggest retaining the history though
> so that the details in the changelog are preserved of the issues we faced,
> and in the future we can refer back to them.

Well, in part that's what we have mail archives for. My main concern is
that there's a lot of back and forth in the patches as presented and
that makes review really awkward.

I've done some of the patch munging proposed, I might do more if I find
the motivation.

But first I should probably go read all that API crud you did on top,
I've not read that before...

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 19/32] entry/idle: Enter and exit kernel protection during idle entry and exit
  2020-11-24 18:03     ` Joel Fernandes
@ 2020-11-25  8:49       ` Peter Zijlstra
  2020-12-01 18:24         ` Joel Fernandes
  0 siblings, 1 reply; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-25  8:49 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Nov 24, 2020 at 01:03:43PM -0500, Joel Fernandes wrote:
> On Tue, Nov 24, 2020 at 05:13:35PM +0100, Peter Zijlstra wrote:
> > On Tue, Nov 17, 2020 at 06:19:49PM -0500, Joel Fernandes (Google) wrote:

> > > +static inline void generic_idle_enter(void)
> > > +static inline void generic_idle_exit(void)

> > That naming is terrible..
> 
> Yeah sorry :-\. The naming I chose was to be aligned with the
> CONFIG_GENERIC_ENTRY naming. I am open to ideas on that.

entry_idle_{enter,exit}() ?

> > I'm confused.. arch_cpu_idle_{enter,exit}() weren't conveniently placed
> > for you?
> 
> The way this patch series works, it does not depend on arch code as much as
> possible. Since there are other arch that may need this patchset such as ARM,
> it may be better to keep it in the generic entry code.  Thoughts?

I didn't necessarily mean using those hooks, even placing your new hooks
right next to them would've covered the exact same code with less lines
modified.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 18/32] kernel/entry: Add support for core-wide protection of kernel-mode
  2020-11-17 23:19 ` [PATCH -tip 18/32] kernel/entry: Add support for core-wide protection of kernel-mode Joel Fernandes (Google)
  2020-11-24 16:09   ` Peter Zijlstra
@ 2020-11-25  9:37   ` Peter Zijlstra
  2020-12-01 17:55     ` Joel Fernandes
  2020-11-26  5:37   ` Balbir Singh
  2 siblings, 1 reply; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-25  9:37 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Tim Chen, Paul E . McKenney

On Tue, Nov 17, 2020 at 06:19:48PM -0500, Joel Fernandes (Google) wrote:
> Core-scheduling prevents hyperthreads in usermode from attacking each
> other, but it does not do anything about one of the hyperthreads
> entering the kernel for any reason. This leaves the door open for MDS
> and L1TF attacks with concurrent execution sequences between
> hyperthreads.
> 
> This patch therefore adds support for protecting all syscall and IRQ
> kernel mode entries. Care is taken to track the outermost usermode exit
> and entry using per-cpu counters. In cases where one of the hyperthreads
> enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
> when not needed - example: idle and non-cookie HTs do not need to be
> forced into kernel mode.
> 
> More information about attacks:
> For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
> data to either host or guest attackers. For L1TF, it is possible to leak
> to guest attackers. There is no possible mitigation involving flushing
> of buffers to avoid this since the execution of attacker and victims
> happen concurrently on 2 or more HTs.

>  .../admin-guide/kernel-parameters.txt         |  11 +
>  include/linux/entry-common.h                  |  12 +-
>  include/linux/sched.h                         |  12 +
>  kernel/entry/common.c                         |  28 +-
>  kernel/sched/core.c                           | 241 ++++++++++++++++++
>  kernel/sched/sched.h                          |   3 +
>  6 files changed, 304 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index bd1a5b87a5e2..b185c6ed4aba 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4678,6 +4678,17 @@
>  
>  	sbni=		[NET] Granch SBNI12 leased line adapter
>  
> +	sched_core_protect_kernel=
> +			[SCHED_CORE] Pause SMT siblings of a core running in
> +			user mode, if at least one of the siblings of the core
> +			is running in kernel mode. This is to guarantee that
> +			kernel data is not leaked to tasks which are not trusted
> +			by the kernel. A value of 0 disables protection, 1
> +			enables protection. The default is 1. Note that protection
> +			depends on the arch defining the _TIF_UNSAFE_RET flag.
> +			Further, for protecting VMEXIT, arch needs to call
> +			KVM entry/exit hooks.
> +
>  	sched_debug	[KNL] Enables verbose scheduler debug messages.
>  
>  	schedstats=	[KNL,X86] Enable or disable scheduled statistics.

So I don't like the parameter name, it's too long. Also I don't like it
because its a boolean.

You're adding syscall,irq,kvm under a single knob where they're all due
to different flavours of broken. Different hardware might want/need
different combinations.

Hardware without MDS but with L1TF wouldn't need the syscall hook, but
you're not givng a choice here. And this is generic code, you can't
assume stuff like this.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork
  2020-11-17 23:19 ` [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork Joel Fernandes (Google)
@ 2020-11-25 11:07   ` Peter Zijlstra
  2020-12-01 18:56     ` Joel Fernandes
  2020-11-25 11:10   ` Peter Zijlstra
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-25 11:07 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:
> Also, for the per-task cookie, it will get weird if we use pointers of any
> emphemeral objects. For this reason, introduce a refcounted object who's sole
> purpose is to assign unique cookie value by way of the object's pointer.

Might be useful to explain why exactly none of the many pid_t's are
good enough.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork
  2020-11-17 23:19 ` [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork Joel Fernandes (Google)
  2020-11-25 11:07   ` Peter Zijlstra
@ 2020-11-25 11:10   ` Peter Zijlstra
  2020-12-01 19:20     ` Joel Fernandes
  2020-11-25 11:11   ` Peter Zijlstra
                     ` (4 subsequent siblings)
  6 siblings, 1 reply; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-25 11:10 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:
> +void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie, bool group)
> +{
> +	if (!p)
> +		return;
> +
> +	if (group)
> +		p->core_group_cookie = cookie;
> +	else
> +		p->core_task_cookie = cookie;
> +
> +	/* Use up half of the cookie's bits for task cookie and remaining for group cookie. */
> +	p->core_cookie = (p->core_task_cookie <<
> +				(sizeof(unsigned long) * 4)) + p->core_group_cookie;

This seems dangerous; afaict there is nothing that prevents cookie
collision.


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork
  2020-11-17 23:19 ` [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork Joel Fernandes (Google)
  2020-11-25 11:07   ` Peter Zijlstra
  2020-11-25 11:10   ` Peter Zijlstra
@ 2020-11-25 11:11   ` Peter Zijlstra
  2020-12-01 19:16     ` Joel Fernandes
  2020-11-25 11:15   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-25 11:11 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:

> + * sched_core_tag_requeue - Common helper for all interfaces to set a cookie.

sched_core_set_cookie() would be a saner name, given that description,
don't you think?

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork
  2020-11-17 23:19 ` [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork Joel Fernandes (Google)
                     ` (2 preceding siblings ...)
  2020-11-25 11:11   ` Peter Zijlstra
@ 2020-11-25 11:15   ` Peter Zijlstra
  2020-12-01 19:11     ` Joel Fernandes
  2020-11-25 12:54   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-25 11:15 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:

> +/*
> + * Ensure that the task has been requeued. The stopper ensures that the task cannot
> + * be migrated to a different CPU while its core scheduler queue state is being updated.
> + * It also makes sure to requeue a task if it was running actively on another CPU.
> + */
> +static int sched_core_task_join_stopper(void *data)
> +{
> +	struct sched_core_task_write_tag *tag = (struct sched_core_task_write_tag *)data;
> +	int i;
> +
> +	for (i = 0; i < 2; i++)
> +		sched_core_tag_requeue(tag->tasks[i], tag->cookies[i], false /* !group */);
> +
> +	return 0;
> +}
> +
> +static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
> +{

> +	stop_machine(sched_core_task_join_stopper, (void *)&wr, NULL);

> +}

This is *REALLY* terrible...

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork
  2020-11-17 23:19 ` [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork Joel Fernandes (Google)
                     ` (3 preceding siblings ...)
  2020-11-25 11:15   ` Peter Zijlstra
@ 2020-11-25 12:54   ` Peter Zijlstra
  2020-12-01 18:38     ` Joel Fernandes
  2020-11-25 13:03   ` Peter Zijlstra
  2020-11-30 23:05   ` Balbir Singh
  6 siblings, 1 reply; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-25 12:54 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:
> +/* Per-task interface */
> +static unsigned long sched_core_alloc_task_cookie(void)
> +{
> +	struct sched_core_cookie *ptr =
> +		kmalloc(sizeof(struct sched_core_cookie), GFP_KERNEL);
> +
> +	if (!ptr)
> +		return 0;
> +	refcount_set(&ptr->refcnt, 1);
> +
> +	/*
> +	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
> +	 * is done after the stopper runs.
> +	 */
> +	sched_core_get();
> +	return (unsigned long)ptr;
> +}
> +
> +static bool sched_core_get_task_cookie(unsigned long cookie)
> +{
> +	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
> +
> +	/*
> +	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
> +	 * is done after the stopper runs.
> +	 */
> +	sched_core_get();
> +	return refcount_inc_not_zero(&ptr->refcnt);
> +}
> +
> +static void sched_core_put_task_cookie(unsigned long cookie)
> +{
> +	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
> +
> +	if (refcount_dec_and_test(&ptr->refcnt))
> +		kfree(ptr);
> +}

> +	/*
> +	 * NOTE: sched_core_get() is done by sched_core_alloc_task_cookie() or
> +	 *       sched_core_put_task_cookie(). However, sched_core_put() is done
> +	 *       by this function *after* the stopper removes the tasks from the
> +	 *       core queue, and not before. This is just to play it safe.
> +	 */

So for no reason what so ever you've made the code more difficult?

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork
  2020-11-17 23:19 ` [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork Joel Fernandes (Google)
                     ` (4 preceding siblings ...)
  2020-11-25 12:54   ` Peter Zijlstra
@ 2020-11-25 13:03   ` Peter Zijlstra
  2020-12-01 18:52     ` Joel Fernandes
  2020-11-30 23:05   ` Balbir Singh
  6 siblings, 1 reply; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-25 13:03 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:
> +static bool sched_core_get_task_cookie(unsigned long cookie)
> +{
> +	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
> +
> +	/*
> +	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
> +	 * is done after the stopper runs.
> +	 */
> +	sched_core_get();
> +	return refcount_inc_not_zero(&ptr->refcnt);

See below, but afaict this should be refcount_inc().

> +}

> +	/*
> +	 * 		t1		joining		t2
> +	 * CASE 1:
> +	 * before	0				0
> +	 * after	new cookie			new cookie
> +	 *
> +	 * CASE 2:
> +	 * before	X (non-zero)			0
> +	 * after	0				0
> +	 *
> +	 * CASE 3:
> +	 * before	0				X (non-zero)
> +	 * after	X				X
> +	 *
> +	 * CASE 4:
> +	 * before	Y (non-zero)			X (non-zero)
> +	 * after	X				X
> +	 */
> +	if (!t1->core_task_cookie && !t2->core_task_cookie) {
> +		/* CASE 1. */
> +		cookie = sched_core_alloc_task_cookie();
> +		if (!cookie)
> +			goto out_unlock;
> +
> +		/* Add another reference for the other task. */
> +		if (!sched_core_get_task_cookie(cookie)) {

afaict this should be refcount_inc(), as this can never fail and if it
does, it's an error.

> +			ret = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		wr.tasks[0] = t1;
> +		wr.tasks[1] = t2;
> +		wr.cookies[0] = wr.cookies[1] = cookie;
> +
> +	} else if (t1->core_task_cookie && !t2->core_task_cookie) {
> +		/* CASE 2. */
> +		sched_core_put_task_cookie(t1->core_task_cookie);
> +		sched_core_put_after_stopper = true;
> +
> +		wr.tasks[0] = t1; /* Reset cookie for t1. */
> +
> +	} else if (!t1->core_task_cookie && t2->core_task_cookie) {
> +		/* CASE 3. */
> +		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {

afaict this can never fail either, because you're calling in here with a
reference on t2

> +			ret = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		wr.tasks[0] = t1;
> +		wr.cookies[0] = t2->core_task_cookie;
> +
> +	} else {
> +		/* CASE 4. */
> +		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {

Same.

> +			ret = -EINVAL;
> +			goto out_unlock;
> +		}
> +		sched_core_put_task_cookie(t1->core_task_cookie);
> +		sched_core_put_after_stopper = true;
> +
> +		wr.tasks[0] = t1;
> +		wr.cookies[0] = t2->core_task_cookie;
> +	}

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 24/32] sched: Release references to the per-task cookie on exit
  2020-11-17 23:19 ` [PATCH -tip 24/32] sched: Release references to the per-task cookie on exit Joel Fernandes (Google)
@ 2020-11-25 13:03   ` Peter Zijlstra
  0 siblings, 0 replies; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-25 13:03 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Nov 17, 2020 at 06:19:54PM -0500, Joel Fernandes (Google) wrote:
> During exit, we have to free the references to a cookie that might be shared by
> many tasks. This commit therefore ensures when the task_struct is released, any
> references to cookies that it holds are also released.

This is one of those patches that just shouldn't exist. Squash it into
whatever patch that introduces this nonsense.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface
  2020-11-17 23:19 ` [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface Joel Fernandes (Google)
@ 2020-11-25 13:08   ` Peter Zijlstra
  2020-12-01 19:36     ` Joel Fernandes
  2020-12-02 21:47   ` Chris Hyser
  1 sibling, 1 reply; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-25 13:08 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Nov 17, 2020 at 06:19:53PM -0500, Joel Fernandes (Google) wrote:
> +/* Called from prctl interface: PR_SCHED_CORE_SHARE */
> +int sched_core_share_pid(pid_t pid)
> +{
> +	struct task_struct *task;
> +	int err;
> +
> +	if (pid == 0) { /* Recent current task's cookie. */
> +		/* Resetting a cookie requires privileges. */
> +		if (current->core_task_cookie)
> +			if (!capable(CAP_SYS_ADMIN))
> +				return -EPERM;

Coding-Style fail.

Also, why?!? I realize it is true for your case, because hardware fail.
But in general this just isn't true. This wants to be some configurable
policy.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase
  2020-11-17 23:19 ` [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase Joel Fernandes (Google)
@ 2020-11-25 13:42   ` Peter Zijlstra
  2020-11-30 23:10     ` Balbir Singh
                       ` (2 more replies)
  0 siblings, 3 replies; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-25 13:42 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Nov 17, 2020 at 06:19:56PM -0500, Joel Fernandes (Google) wrote:
> From: Josh Don <joshdon@google.com>
> 
> Google has a usecase where the first level tag to tag a CGroup is not
> sufficient. So, a patch is carried for years where a second tag is added which
> is writeable by unprivileged users.
> 
> Google uses DAC controls to make the 'tag' possible to set only by root while
> the second-level 'color' can be changed by anyone. The actual names that
> Google uses is different, but the concept is the same.
> 
> The hierarchy looks like:
> 
> Root group
>    / \
>   A   B    (These are created by the root daemon - borglet).
>  / \   \
> C   D   E  (These are created by AppEngine within the container).
> 
> The reason why Google has two parts is that AppEngine wants to allow a subset of
> subcgroups within a parent tagged cgroup sharing execution. Think of these
> subcgroups belong to the same customer or project. Because these subcgroups are
> created by AppEngine, they are not tracked by borglet (the root daemon),
> therefore borglet won't have a chance to set a color for them. That's where
> 'color' file comes from. Color could be set by AppEngine, and once set, the
> normal tasks within the subcgroup would not be able to overwrite it. This is
> enforced by promoting the permission of the color file in cgroupfs.

Why can't the above work by setting 'tag' (that's a terrible name, why
does that still live) in CDE? Have the most specific tag live. Same with
that thread stuff.

All this API stuff here is a complete and utter trainwreck. Please just
delete the patches and start over. Hint: if you use stop_machine(),
you're doing it wrong.

At best you now have the requirements sorted.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 31/32] sched: Add a coresched command line option
  2020-11-17 23:20 ` [PATCH -tip 31/32] sched: Add a coresched command line option Joel Fernandes (Google)
  2020-11-19 23:39   ` Randy Dunlap
@ 2020-11-25 13:45   ` Peter Zijlstra
  2020-11-26  0:11     ` Balbir Singh
  1 sibling, 1 reply; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-25 13:45 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Nov 17, 2020 at 06:20:01PM -0500, Joel Fernandes (Google) wrote:
> Some hardware such as certain AMD variants don't have cross-HT MDS/L1TF
> issues. Detect this and don't enable core scheduling as it can
> needlessly slow those device down.
> 
> However, some users may want core scheduling even if the hardware is
> secure. To support them, add a coresched= option which defaults to
> 'secure' and can be overridden to 'on' if the user wants to enable
> coresched even if the HW is not vulnerable. 'off' would disable
> core scheduling in any case.

This is all sorts of wrong, and the reason is because you hard-coded
that stupid policy.

Core scheduling should always be available on SMT (provided you did that
CONFIG_ thing). Even on AMD systems RT tasks might want to claim the
core exclusively.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 02/32] sched: Introduce sched_class::pick_task()
  2020-11-17 23:19 ` [PATCH -tip 02/32] sched: Introduce sched_class::pick_task() Joel Fernandes (Google)
  2020-11-19 23:56   ` Singh, Balbir
@ 2020-11-25 16:28   ` Vincent Guittot
  2020-11-26  9:07     ` Peter Zijlstra
  1 sibling, 1 reply; 150+ messages in thread
From: Vincent Guittot @ 2020-11-25 16:28 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, Thomas Gleixner,
	linux-kernel, Ingo Molnar, Linus Torvalds, Frederic Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, graf, konrad.wilk,
	dfaggioli, Paul Turner, Steven Rostedt, Patrick Bellasi,
	Jiang Biao, Alexandre Chartre, James Bottomley, OWeisse,
	Dhaval Giani, Junaid Shahid, jsbarnes, Hyser,Chris, Ben Segall,
	Josh Don, Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney,
	Tim Chen

On Wed, 18 Nov 2020 at 00:20, Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
>
> From: Peter Zijlstra <peterz@infradead.org>
>
> Because sched_class::pick_next_task() also implies
> sched_class::set_next_task() (and possibly put_prev_task() and
> newidle_balance) it is not state invariant. This makes it unsuitable
> for remote task selection.
>
> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
> Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/sched/deadline.c  | 16 ++++++++++++++--
>  kernel/sched/fair.c      | 32 +++++++++++++++++++++++++++++++-
>  kernel/sched/idle.c      |  8 ++++++++
>  kernel/sched/rt.c        | 15 +++++++++++++--
>  kernel/sched/sched.h     |  3 +++
>  kernel/sched/stop_task.c | 14 ++++++++++++--
>  6 files changed, 81 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 0f2ea0a3664c..abfc8b505d0d 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1867,7 +1867,7 @@ static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
>         return rb_entry(left, struct sched_dl_entity, rb_node);
>  }
>
> -static struct task_struct *pick_next_task_dl(struct rq *rq)
> +static struct task_struct *pick_task_dl(struct rq *rq)
>  {
>         struct sched_dl_entity *dl_se;
>         struct dl_rq *dl_rq = &rq->dl;
> @@ -1879,7 +1879,18 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
>         dl_se = pick_next_dl_entity(rq, dl_rq);
>         BUG_ON(!dl_se);
>         p = dl_task_of(dl_se);
> -       set_next_task_dl(rq, p, true);
> +
> +       return p;
> +}
> +
> +static struct task_struct *pick_next_task_dl(struct rq *rq)
> +{
> +       struct task_struct *p;
> +
> +       p = pick_task_dl(rq);
> +       if (p)
> +               set_next_task_dl(rq, p, true);
> +
>         return p;
>  }
>
> @@ -2551,6 +2562,7 @@ DEFINE_SCHED_CLASS(dl) = {
>
>  #ifdef CONFIG_SMP
>         .balance                = balance_dl,
> +       .pick_task              = pick_task_dl,
>         .select_task_rq         = select_task_rq_dl,
>         .migrate_task_rq        = migrate_task_rq_dl,
>         .set_cpus_allowed       = set_cpus_allowed_dl,
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 52ddfec7cea6..12cf068eeec8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4459,7 +4459,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>          * Avoid running the skip buddy, if running something else can
>          * be done without getting too unfair.
>          */
> -       if (cfs_rq->skip == se) {
> +       if (cfs_rq->skip && cfs_rq->skip == se) {
>                 struct sched_entity *second;
>
>                 if (se == curr) {
> @@ -7017,6 +7017,35 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>                 set_last_buddy(se);
>  }
>
> +#ifdef CONFIG_SMP
> +static struct task_struct *pick_task_fair(struct rq *rq)
> +{
> +       struct cfs_rq *cfs_rq = &rq->cfs;
> +       struct sched_entity *se;
> +
> +       if (!cfs_rq->nr_running)
> +               return NULL;
> +
> +       do {
> +               struct sched_entity *curr = cfs_rq->curr;
> +
> +               se = pick_next_entity(cfs_rq, NULL);

Calling pick_next_entity clears buddies. This is fine without
coresched because the se will be the next one. But calling
pick_task_fair doesn't mean that the se will be used

> +
> +               if (curr) {
> +                       if (se && curr->on_rq)
> +                               update_curr(cfs_rq);
> +

Shouldn't you check if cfs_rq is throttled ?

> +                       if (!se || entity_before(curr, se))
> +                               se = curr;
> +               }
> +
> +               cfs_rq = group_cfs_rq(se);
> +       } while (cfs_rq);
> +
> +       return task_of(se);
> +}
> +#endif
> +
>  struct task_struct *
>  pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
> @@ -11219,6 +11248,7 @@ DEFINE_SCHED_CLASS(fair) = {
>
>  #ifdef CONFIG_SMP
>         .balance                = balance_fair,
> +       .pick_task              = pick_task_fair,
>         .select_task_rq         = select_task_rq_fair,
>         .migrate_task_rq        = migrate_task_rq_fair,
>
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 50e128b899c4..33864193a2f9 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -406,6 +406,13 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
>         schedstat_inc(rq->sched_goidle);
>  }
>
> +#ifdef CONFIG_SMP
> +static struct task_struct *pick_task_idle(struct rq *rq)
> +{
> +       return rq->idle;
> +}
> +#endif
> +
>  struct task_struct *pick_next_task_idle(struct rq *rq)
>  {
>         struct task_struct *next = rq->idle;
> @@ -473,6 +480,7 @@ DEFINE_SCHED_CLASS(idle) = {
>
>  #ifdef CONFIG_SMP
>         .balance                = balance_idle,
> +       .pick_task              = pick_task_idle,
>         .select_task_rq         = select_task_rq_idle,
>         .set_cpus_allowed       = set_cpus_allowed_common,
>  #endif
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index a6f9d132c24f..a0e245b0c4bd 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1626,7 +1626,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
>         return rt_task_of(rt_se);
>  }
>
> -static struct task_struct *pick_next_task_rt(struct rq *rq)
> +static struct task_struct *pick_task_rt(struct rq *rq)
>  {
>         struct task_struct *p;
>
> @@ -1634,7 +1634,17 @@ static struct task_struct *pick_next_task_rt(struct rq *rq)
>                 return NULL;
>
>         p = _pick_next_task_rt(rq);
> -       set_next_task_rt(rq, p, true);
> +
> +       return p;
> +}
> +
> +static struct task_struct *pick_next_task_rt(struct rq *rq)
> +{
> +       struct task_struct *p = pick_task_rt(rq);
> +
> +       if (p)
> +               set_next_task_rt(rq, p, true);
> +
>         return p;
>  }
>
> @@ -2483,6 +2493,7 @@ DEFINE_SCHED_CLASS(rt) = {
>
>  #ifdef CONFIG_SMP
>         .balance                = balance_rt,
> +       .pick_task              = pick_task_rt,
>         .select_task_rq         = select_task_rq_rt,
>         .set_cpus_allowed       = set_cpus_allowed_common,
>         .rq_online              = rq_online_rt,
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f794c9337047..5a0dd2b312aa 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1839,6 +1839,9 @@ struct sched_class {
>  #ifdef CONFIG_SMP
>         int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
>         int  (*select_task_rq)(struct task_struct *p, int task_cpu, int flags);
> +
> +       struct task_struct * (*pick_task)(struct rq *rq);
> +
>         void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
>
>         void (*task_woken)(struct rq *this_rq, struct task_struct *task);
> diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
> index 55f39125c0e1..f988ebe3febb 100644
> --- a/kernel/sched/stop_task.c
> +++ b/kernel/sched/stop_task.c
> @@ -34,15 +34,24 @@ static void set_next_task_stop(struct rq *rq, struct task_struct *stop, bool fir
>         stop->se.exec_start = rq_clock_task(rq);
>  }
>
> -static struct task_struct *pick_next_task_stop(struct rq *rq)
> +static struct task_struct *pick_task_stop(struct rq *rq)
>  {
>         if (!sched_stop_runnable(rq))
>                 return NULL;
>
> -       set_next_task_stop(rq, rq->stop, true);
>         return rq->stop;
>  }
>
> +static struct task_struct *pick_next_task_stop(struct rq *rq)
> +{
> +       struct task_struct *p = pick_task_stop(rq);
> +
> +       if (p)
> +               set_next_task_stop(rq, p, true);
> +
> +       return p;
> +}
> +
>  static void
>  enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
>  {
> @@ -123,6 +132,7 @@ DEFINE_SCHED_CLASS(stop) = {
>
>  #ifdef CONFIG_SMP
>         .balance                = balance_stop,
> +       .pick_task              = pick_task_stop,
>         .select_task_rq         = select_task_rq_stop,
>         .set_cpus_allowed       = set_cpus_allowed_common,
>  #endif
> --
> 2.29.2.299.gdc1121823c-goog
>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 13/32] sched: Trivial forced-newidle balancer
  2020-11-24  0:32         ` Li, Aubrey
@ 2020-11-25 21:28           ` Balbir Singh
  0 siblings, 0 replies; 150+ messages in thread
From: Balbir Singh @ 2020-11-25 21:28 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Paul E . McKenney,
	Tim Chen

On Tue, Nov 24, 2020 at 08:32:01AM +0800, Li, Aubrey wrote:
> On 2020/11/24 7:35, Balbir Singh wrote:
> > On Mon, Nov 23, 2020 at 11:07:27PM +0800, Li, Aubrey wrote:
> >> On 2020/11/23 12:38, Balbir Singh wrote:
> >>> On Tue, Nov 17, 2020 at 06:19:43PM -0500, Joel Fernandes (Google) wrote:
> >>>> From: Peter Zijlstra <peterz@infradead.org>
> >>>>
> >>>> When a sibling is forced-idle to match the core-cookie; search for
> >>>> matching tasks to fill the core.
> >>>>
> >>>> rcu_read_unlock() can incur an infrequent deadlock in
> >>>> sched_core_balance(). Fix this by using the RCU-sched flavor instead.
> >>>>
> >>> ...
> >>>> +
> >>>> +		if (p->core_occupation > dst->idle->core_occupation)
> >>>> +			goto next;
> >>>> +
> >>>
> >>> I am unable to understand this check, a comment or clarification in the
> >>> changelog will help. I presume we are looking at either one or two cpus
> >>> to define the core_occupation and we expect to match it against the
> >>> destination CPU.
> >>
> >> IIUC, this check prevents a task from keeping jumping among the cores forever.
> >>
> >> For example, on a SMT2 platform:
> >> - core0 runs taskA and taskB, core_occupation is 2
> >> - core1 runs taskC, core_occupation is 1
> >>
> >> Without this check, taskB could ping-pong between core0 and core1 by core load
> >> balance.
> > 
> > But the comparison is p->core_occuption (as in tasks core occuptation,
> > not sure what that means, can a task have a core_occupation of > 1?)
> >
> 
> p->core_occupation is assigned to the core occupation in the last pick_next_task.
> (so yes, it can have a > 1 core_occupation).
>

Hmm.. I find that hard to interpret that. But I am happy to re-read the
code again.

Balbir Singh.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 14/32] sched: migration changes for core scheduling
  2020-11-25  3:12         ` Li, Aubrey
@ 2020-11-25 22:57           ` Balbir Singh
  2020-11-26  3:20             ` Li, Aubrey
  0 siblings, 1 reply; 150+ messages in thread
From: Balbir Singh @ 2020-11-25 22:57 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Peter Zijlstra, Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Wed, Nov 25, 2020 at 11:12:53AM +0800, Li, Aubrey wrote:
> On 2020/11/24 23:42, Peter Zijlstra wrote:
> > On Mon, Nov 23, 2020 at 12:36:10PM +0800, Li, Aubrey wrote:
> >>>> +#ifdef CONFIG_SCHED_CORE
> >>>> +		/*
> >>>> +		 * Skip this cpu if source task's cookie does not match
> >>>> +		 * with CPU's core cookie.
> >>>> +		 */
> >>>> +		if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
> >>>> +			continue;
> >>>> +#endif
> >>>> +
> >>>
> >>> Any reason this is under an #ifdef? In sched_core_cookie_match() won't
> >>> the check for sched_core_enabled() do the right thing even when
> >>> CONFIG_SCHED_CORE is not enabed?> 
> >> Yes, sched_core_enabled works properly when CONFIG_SCHED_CORE is not
> >> enabled. But when CONFIG_SCHED_CORE is not enabled, it does not make
> >> sense to leave a core scheduler specific function here even at compile
> >> time. Also, for the cases in hot path, this saves CPU cycles to avoid
> >> a judgment.
> > 
> > No, that's nonsense. If it works, remove the #ifdef. Less (#ifdef) is
> > more.
> > 
> 
> Okay, I pasted the refined patch here.
> @Joel, please let me know if you want me to send it in a separated thread.
>

You still have a bunch of #ifdefs, can't we just do

#ifndef CONFIG_SCHED_CORE
static inline bool sched_core_enabled(struct rq *rq)
{
        return false;
}
#endif

and frankly I think even that is not needed because there is a jump
label __sched_core_enabled that tells us if sched_core is enabled or
not.

Balbir


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 10/32] sched: Fix priority inversion of cookied task with sibling
  2020-11-24 18:30     ` Joel Fernandes
@ 2020-11-25 23:05       ` Balbir Singh
  2020-11-26  8:29         ` Peter Zijlstra
  2020-12-01 17:49         ` Joel Fernandes
  0 siblings, 2 replies; 150+ messages in thread
From: Balbir Singh @ 2020-11-25 23:05 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

On Tue, Nov 24, 2020 at 01:30:38PM -0500, Joel Fernandes wrote:
> On Mon, Nov 23, 2020 at 09:41:23AM +1100, Balbir Singh wrote:
> > On Tue, Nov 17, 2020 at 06:19:40PM -0500, Joel Fernandes (Google) wrote:
> > > From: Peter Zijlstra <peterz@infradead.org>
> > > 
> > > The rationale is as follows. In the core-wide pick logic, even if
> > > need_sync == false, we need to go look at other CPUs (non-local CPUs) to
> > > see if they could be running RT.
> > > 
> > > Say the RQs in a particular core look like this:
> > > Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task.
> > > 
> > > rq0            rq1
> > > CFS1 (tagged)  RT1 (not tag)
> > > CFS2 (tagged)
> > > 
> > > Say schedule() runs on rq0. Now, it will enter the above loop and
> > > pick_task(RT) will return NULL for 'p'. It will enter the above if() block
> > > and see that need_sync == false and will skip RT entirely.
> > > 
> > > The end result of the selection will be (say prio(CFS1) > prio(CFS2)):
> > > rq0             rq1
> > > CFS1            IDLE
> > > 
> > > When it should have selected:
> > > rq0             r1
> > > IDLE            RT
> > > 
> > > Joel saw this issue on real-world usecases in ChromeOS where an RT task
> > > gets constantly force-idled and breaks RT. Lets cure it.
> > > 
> > > NOTE: This problem will be fixed differently in a later patch. It just
> > >       kept here for reference purposes about this issue, and to make
> > >       applying later patches easier.
> > >
> > 
> > The changelog is hard to read, it refers to above if(), whereas there
> > is no code snippet in the changelog.
> 
> Yeah sorry, it comes from this email where I described the issue:
> http://lore.kernel.org/r/20201023175724.GA3563800@google.com
> 
> I corrected the changelog and appended the patch below. Also pushed it to:
> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=coresched
> 
> > Also, from what I can see following
> > the series, p->core_cookie is not yet set anywhere (unless I missed it),
> > so fixing it in here did not make sense just reading the series.
> 
> The interface patches for core_cookie are added later, that's how it is. The
> infrastructure comes first here. It would also not make sense to add
> interface first as well so I think the current ordering is fine.
>

Some comments below to help make the code easier to understand

> ---8<-----------------------
> 
> From: Peter Zijlstra <peterz@infradead.org>
> Subject: [PATCH] sched: Fix priority inversion of cookied task with sibling
> 
> The rationale is as follows. In the core-wide pick logic, even if
> need_sync == false, we need to go look at other CPUs (non-local CPUs) to
> see if they could be running RT.
> 
> Say the RQs in a particular core look like this:
> Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task.
> 
> rq0            rq1
> CFS1 (tagged)  RT1 (not tag)
> CFS2 (tagged)
> 
> The end result of the selection will be (say prio(CFS1) > prio(CFS2)):
> rq0             rq1
> CFS1            IDLE
> 
> When it should have selected:
> rq0             r1
> IDLE            RT
> 
> Fix this issue by forcing need_sync and restarting the search if a
> cookied task was discovered. This will avoid this optimization from
> making incorrect picks.
> 
> Joel saw this issue on real-world usecases in ChromeOS where an RT task
> gets constantly force-idled and breaks RT. Lets cure it.
> 
> NOTE: This problem will be fixed differently in a later patch. It just
>       kept here for reference purposes about this issue, and to make
>       applying later patches easier.
> 
> Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/sched/core.c | 25 ++++++++++++++++---------
>  1 file changed, 16 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4ee4902c2cf5..53af817740c0 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5195,6 +5195,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  	need_sync = !!rq->core->core_cookie;
>  
>  	/* reset state */
> +reset:
>  	rq->core->core_cookie = 0UL;
>  	if (rq->core->core_forceidle) {
>  		need_sync = true;
> @@ -5242,14 +5243,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  				/*
>  				 * If there weren't no cookies; we don't need to
>  				 * bother with the other siblings.
> -				 * If the rest of the core is not running a tagged
> -				 * task, i.e.  need_sync == 0, and the current CPU
> -				 * which called into the schedule() loop does not
> -				 * have any tasks for this class, skip selecting for
> -				 * other siblings since there's no point. We don't skip
> -				 * for RT/DL because that could make CFS force-idle RT.
>  				 */
> -				if (i == cpu && !need_sync && class == &fair_sched_class)
> +				if (i == cpu && !need_sync)
>  					goto next_class;
>  
>  				continue;
> @@ -5259,7 +5254,20 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  			 * Optimize the 'normal' case where there aren't any
>  			 * cookies and we don't need to sync up.
>  			 */
> -			if (i == cpu && !need_sync && !p->core_cookie) {
> +			if (i == cpu && !need_sync) {
> +				if (p->core_cookie) {
> +					/*
> +					 * This optimization is only valid as
> +					 * long as there are no cookies

This is not entirely true, need_sync is a function of core cookies, so I
think this needs more clarification, it sounds like we enter this when
the core has no cookies, but the task has a core_cookie? The term cookie
is quite overloaded when used in the context of core vs task.

Effectively from what I understand this means that p wants to be
coscheduled, but the core itself is not coscheduling anything at the
moment, so we need to see if we should do a sync and that sync might
cause p to get kicked out and a higher priority class to come in?

Balbir Singh.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 09/32] sched/fair: Snapshot the min_vruntime of CPUs on force idle
  2020-11-24  9:09         ` Peter Zijlstra
@ 2020-11-25 23:17           ` Balbir Singh
  2020-11-26  8:23             ` Peter Zijlstra
  0 siblings, 1 reply; 150+ messages in thread
From: Balbir Singh @ 2020-11-25 23:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vineeth Pillai, Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Aaron Lu,
	Aubrey Li, tglx, linux-kernel, mingo, torvalds, fweisbec,
	keescook, kerrnel, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, graf, konrad.wilk,
	dfaggioli, pjt, rostedt, derkling, benbjiang, Alexandre Chartre,
	James.Bottomley, OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes,
	chris.hyser, Ben Segall, Josh Don, Hao Luo, Tom Lendacky,
	Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Nov 24, 2020 at 10:09:55AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 24, 2020 at 10:31:49AM +1100, Balbir Singh wrote:
> > On Mon, Nov 23, 2020 at 07:31:31AM -0500, Vineeth Pillai wrote:
> > > Hi Balbir,
> > > 
> > > On 11/22/20 6:44 AM, Balbir Singh wrote:
> > > > 
> > > > This seems cumbersome, is there no way to track the min_vruntime via
> > > > rq->core->min_vruntime?
> > > Do you mean to have a core wide min_vruntime? We had a
> > > similar approach from v3 to v7 and it had major issues which
> > > broke the assumptions of cfs. There were some lengthy
> > > discussions and Peter explained in-depth about the issues:
> > > 
> > > https://lwn.net/ml/linux-kernel/20200506143506.GH5298@hirez.programming.kicks-ass.net/
> > > https://lwn.net/ml/linux-kernel/20200515103844.GG2978@hirez.programming.kicks-ass.net/
> > >
> > 
> > One of the equations in the link is
> > 
> > ">From which immediately follows that:
> > 
> >           T_k + T_l
> >   S_k+l = ---------                                       (13)
> >           W_k + W_l
> > 
> > On which we can define a combined lag:
> > 
> >   lag_k+l(i) := S_k+l - s_i                               (14)
> > 
> > And that gives us the tools to compare tasks across a combined runqueue.
> > "
> > 
> > S_k+l reads like rq->core->vruntime, but it sounds like the equivalent
> > of rq->core->vruntime is updated when we enter forced idle as opposed to
> > all the time.
> 
> Yes, but actually computing and maintaining it is hella hard. Try it
> with the very first example in that email (the infeasible weight
> scenario) and tell me how it works for you ;-)
>

OK, I was hoping it could be done in the new RBTree's enqueue/dequeue,
but yes I've not implemented it and I should go back and take a look at
the first example again.

> Also note that the text below (6) mentions dynamic, then look up the
> EEVDF paper which describes some of the dynamics -- the paper is
> incomplete and contains a bug, I forget if it ever got updated or if
> there's another paper that completes it (the BQF I/O scheduler started
> from that and fixed it).

I see, I am yet to read the EEVDF paper, but now I am out on a tangent
:)

> 
> I'm not saying it cannot be done, I'm just saying it is really rather
> involved and probably not worth it.
>

Fair enough

> The basic observation the current approach relies on is that al that
> faffery basically boils down to the fact that vruntime only means
> something when there is contention. And that only the progression is
> important not the actual value. That is, this is all fundamentally a
> differential equation and our integration constant is meaningless (also
> embodied in (7)).
>

I'll reread (6) and (7), I am trying to understand forced idle and
contention together, from what I understand of the patches, there is 

1. two tasks that are core scheduled, in that case vruntime works as
expected on each CPU, but we need to compare their combined vrtuntime
against other tasks on each CPU respectively for them to be
selected/chosen?
2. When one of the tasks selected is a part of the core scheduling group
and the other CPU does not select a core scheduled tasks, we need to ask
ourselves if that CPU should force idle and that's where this logic
comes into play?

> Also, I think the code as proposed here relies on SMT2 and is buggered
> for SMT3+. Now, that second link above describes means of making SMT3+
> work, but we're not there yet.

Thanks,
Balbir Singh.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 02/32] sched: Introduce sched_class::pick_task()
  2020-11-20 16:58     ` Joel Fernandes
@ 2020-11-25 23:19       ` Balbir Singh
  0 siblings, 0 replies; 150+ messages in thread
From: Balbir Singh @ 2020-11-25 23:19 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

On Fri, Nov 20, 2020 at 11:58:54AM -0500, Joel Fernandes wrote:
> On Fri, Nov 20, 2020 at 10:56:09AM +1100, Singh, Balbir wrote:
> [..] 
> > > +#ifdef CONFIG_SMP
> > > +static struct task_struct *pick_task_fair(struct rq *rq)
> > > +{
> > > +	struct cfs_rq *cfs_rq = &rq->cfs;
> > > +	struct sched_entity *se;
> > > +
> > > +	if (!cfs_rq->nr_running)
> > > +		return NULL;
> > > +
> > > +	do {
> > > +		struct sched_entity *curr = cfs_rq->curr;
> > > +
> > > +		se = pick_next_entity(cfs_rq, NULL);
> > > +
> > > +		if (curr) {
> > > +			if (se && curr->on_rq)
> > > +				update_curr(cfs_rq);
> > > +
> > > +			if (!se || entity_before(curr, se))
> > > +				se = curr;
> > > +		}
> > 
> > Do we want to optimize this a bit 
> > 
> > if (curr) {
> > 	if (!se || entity_before(curr, se))
> > 		se = curr;
> > 
> > 	if ((se != curr) && curr->on_rq)
> > 		update_curr(cfs_rq);
> > 
> > }
> 
> Do you see a difference in codegen? What's the optimization?
> 
> Also in later patches this code changes, so it should not matter:
> See: 3e0838fa3c51 ("sched/fair: Fix pick_task_fair crashes due to empty rbtree")
>

I did see the next patch, but the idea is that we don't need to
update_curr() if the picked sched entity is the same as current (se ==
curr). What are the side-effects of not updating curr when se == curr?

Balbir


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 31/32] sched: Add a coresched command line option
  2020-11-25 13:45   ` Peter Zijlstra
@ 2020-11-26  0:11     ` Balbir Singh
  0 siblings, 0 replies; 150+ messages in thread
From: Balbir Singh @ 2020-11-26  0:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Wed, Nov 25, 2020 at 02:45:37PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:20:01PM -0500, Joel Fernandes (Google) wrote:
> > Some hardware such as certain AMD variants don't have cross-HT MDS/L1TF
> > issues. Detect this and don't enable core scheduling as it can
> > needlessly slow those device down.
> > 
> > However, some users may want core scheduling even if the hardware is
> > secure. To support them, add a coresched= option which defaults to
> > 'secure' and can be overridden to 'on' if the user wants to enable
> > coresched even if the HW is not vulnerable. 'off' would disable
> > core scheduling in any case.
> 
> This is all sorts of wrong, and the reason is because you hard-coded
> that stupid policy.
> 
> Core scheduling should always be available on SMT (provided you did that
> CONFIG_ thing). Even on AMD systems RT tasks might want to claim the
> core exclusively.

Agreed, specifically if we need to have special cgroup tag/association to
enable it.

Balbir Singh.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 04/32] sched: Core-wide rq->lock
  2020-11-24  8:16     ` Peter Zijlstra
@ 2020-11-26  0:35       ` Balbir Singh
  0 siblings, 0 replies; 150+ messages in thread
From: Balbir Singh @ 2020-11-26  0:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Nov 24, 2020 at 09:16:17AM +0100, Peter Zijlstra wrote:
> On Sun, Nov 22, 2020 at 08:11:52PM +1100, Balbir Singh wrote:
> > On Tue, Nov 17, 2020 at 06:19:34PM -0500, Joel Fernandes (Google) wrote:
> > > From: Peter Zijlstra <peterz@infradead.org>
> > > 
> > > Introduce the basic infrastructure to have a core wide rq->lock.
> > >
> > 
> > Reading through the patch, it seems like all the CPUs have to be
> > running with sched core enabled/disabled? Is it possible to have some
> > cores with core sched disabled?
> 
> Yep, patch even says so:
> 
>  + * XXX entirely possible to selectively enable cores, don't bother for now.

Yes, it does in the comments, I looked at just the changelog :)

> 
> > I don't see a strong use case for it,
> > but I am wondering if the design will fall apart if that assumption is
> > broken?
> 
> The use-case I have is not using stop-machine. That is, stopping a whole
> core at a time, instead of the whole sodding machine. It's on the todo
> list *somewhere*....
> 
>

Good to know, I guess that would need a transition of the entire core to
idle and maintenance of a mask that prevents tasks that need core
scheduling from getting off of CPUs that enable core scheduling.

Balbir Singh.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 14/32] sched: migration changes for core scheduling
  2020-11-25 22:57           ` Balbir Singh
@ 2020-11-26  3:20             ` Li, Aubrey
  2020-11-26  8:32               ` Balbir Singh
  0 siblings, 1 reply; 150+ messages in thread
From: Li, Aubrey @ 2020-11-26  3:20 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Peter Zijlstra, Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On 2020/11/26 6:57, Balbir Singh wrote:
> On Wed, Nov 25, 2020 at 11:12:53AM +0800, Li, Aubrey wrote:
>> On 2020/11/24 23:42, Peter Zijlstra wrote:
>>> On Mon, Nov 23, 2020 at 12:36:10PM +0800, Li, Aubrey wrote:
>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>> +		/*
>>>>>> +		 * Skip this cpu if source task's cookie does not match
>>>>>> +		 * with CPU's core cookie.
>>>>>> +		 */
>>>>>> +		if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>>>>>> +			continue;
>>>>>> +#endif
>>>>>> +
>>>>>
>>>>> Any reason this is under an #ifdef? In sched_core_cookie_match() won't
>>>>> the check for sched_core_enabled() do the right thing even when
>>>>> CONFIG_SCHED_CORE is not enabed?> 
>>>> Yes, sched_core_enabled works properly when CONFIG_SCHED_CORE is not
>>>> enabled. But when CONFIG_SCHED_CORE is not enabled, it does not make
>>>> sense to leave a core scheduler specific function here even at compile
>>>> time. Also, for the cases in hot path, this saves CPU cycles to avoid
>>>> a judgment.
>>>
>>> No, that's nonsense. If it works, remove the #ifdef. Less (#ifdef) is
>>> more.
>>>
>>
>> Okay, I pasted the refined patch here.
>> @Joel, please let me know if you want me to send it in a separated thread.
>>
> 
> You still have a bunch of #ifdefs, can't we just do
> 
> #ifndef CONFIG_SCHED_CORE
> static inline bool sched_core_enabled(struct rq *rq)
> {
>         return false;
> }
> #endif
> 
> and frankly I think even that is not needed because there is a jump
> label __sched_core_enabled that tells us if sched_core is enabled or
> not.

Hmm..., I need another wrapper for CONFIG_SCHED_CORE specific variables.
How about this one?

Thanks,
-Aubrey

From 61dac9067e66b5b9ea26c684c8c8235714bab38a Mon Sep 17 00:00:00 2001
From: Aubrey Li <aubrey.li@linux.intel.com>
Date: Thu, 26 Nov 2020 03:08:04 +0000
Subject: [PATCH] sched: migration changes for core scheduling

 - Don't migrate if there is a cookie mismatch
     Load balance tries to move task from busiest CPU to the
     destination CPU. When core scheduling is enabled, if the
     task's cookie does not match with the destination CPU's
     core cookie, this task will be skipped by this CPU. This
     mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
     In the fast path of task wakeup, select the first cookie matched
     idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
     In the slow path of task wakeup, find the idlest CPU whose core
     cookie matches with task's cookie

 - Don't migrate task if cookie not match
     For the NUMA load balance, don't migrate task to the CPU whose
     core cookie does not match with task's cookie

Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/fair.c  | 57 ++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h | 43 +++++++++++++++++++++++++++++++++
 2 files changed, 95 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index de82f88ba98c..70dd013dff1d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1921,6 +1921,13 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
 			continue;
 
+		/*
+		 * Skip this cpu if source task's cookie does not match
+		 * with CPU's core cookie.
+		 */
+		if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+			continue;
+
 		env->dst_cpu = cpu;
 		if (task_numa_compare(env, taskimp, groupimp, maymove))
 			break;
@@ -5867,11 +5874,15 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+		struct rq *rq = cpu_rq(i);
+
+		if (!sched_core_cookie_match(rq, p))
+			continue;
+
 		if (sched_idle_cpu(i))
 			return i;
 
 		if (available_idle_cpu(i)) {
-			struct rq *rq = cpu_rq(i);
 			struct cpuidle_state *idle = idle_get_state(rq);
 			if (idle && idle->exit_latency < min_exit_latency) {
 				/*
@@ -6129,8 +6140,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	for_each_cpu_wrap(cpu, cpus, target) {
 		if (!--nr)
 			return -1;
-		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
-			break;
+
+		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
+			/*
+			 * If Core Scheduling is enabled, select this cpu
+			 * only if the process cookie matches core cookie.
+			 */
+			if (sched_core_enabled(cpu_rq(cpu))) {
+				if (__cookie_match(cpu_rq(cpu), p))
+					break;
+			} else {
+				break;
+			}
+		}
 	}
 
 	time = cpu_clock(this) - time;
@@ -7530,8 +7552,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * We do not migrate tasks that are:
 	 * 1) throttled_lb_pair, or
 	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
-	 * 3) running (obviously), or
-	 * 4) are cache-hot on their current CPU.
+	 * 3) task's cookie does not match with this CPU's core cookie
+	 * 4) running (obviously), or
+	 * 5) are cache-hot on their current CPU.
 	 */
 	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
 		return 0;
@@ -7566,6 +7589,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 0;
 	}
 
+	/*
+	 * Don't migrate task if the task's cookie does not match
+	 * with the destination CPU's core cookie.
+	 */
+	if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+		return 0;
+
 	/* Record that we found atleast one task that could run on dst_cpu */
 	env->flags &= ~LBF_ALL_PINNED;
 
@@ -8792,6 +8822,23 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 					p->cpus_ptr))
 			continue;
 
+		if (sched_core_enabled(cpu_rq(this_cpu))) {
+			int i = 0;
+			bool cookie_match = false;
+
+			for_each_cpu(i, sched_group_span(group)) {
+				struct rq *rq = cpu_rq(i);
+
+				if (sched_core_cookie_match(rq, p)) {
+					cookie_match = true;
+					break;
+				}
+			}
+			/* Skip over this group if no cookie matched */
+			if (!cookie_match)
+				continue;
+		}
+
 		local_group = cpumask_test_cpu(this_cpu,
 					       sched_group_span(group));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e72942a9ee11..8bb3b72d593c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1135,6 +1135,40 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
 
+/*
+ * Helper to check if the CPU's core cookie matches with the task's cookie
+ * when core scheduling is enabled.
+ * A special case is that the task's cookie always matches with CPU's core
+ * cookie if the CPU is in an idle core.
+ */
+static inline bool __cookie_match(struct rq *rq, struct task_struct *p)
+{
+	return rq->core->core_cookie == p->core_cookie;
+}
+
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	bool idle_core = true;
+	int cpu;
+
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
+		if (!available_idle_cpu(cpu)) {
+			idle_core = false;
+			break;
+		}
+	}
+
+	/*
+	 * A CPU in an idle core is always the best choice for tasks with
+	 * cookies.
+	 */
+	return idle_core || __cookie_match(rq, p);
+}
+
 extern void queue_core_balance(struct rq *rq);
 
 #else /* !CONFIG_SCHED_CORE */
@@ -1153,6 +1187,15 @@ static inline void queue_core_balance(struct rq *rq)
 {
 }
 
+static inline bool __cookie_match(struct rq *rq, struct task_struct *p)
+{
+	return true;
+}
+
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	return true;
+}
 #endif /* CONFIG_SCHED_CORE */
 
 #ifdef CONFIG_SCHED_SMT
-- 
2.17.1


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 18/32] kernel/entry: Add support for core-wide protection of kernel-mode
  2020-11-17 23:19 ` [PATCH -tip 18/32] kernel/entry: Add support for core-wide protection of kernel-mode Joel Fernandes (Google)
  2020-11-24 16:09   ` Peter Zijlstra
  2020-11-25  9:37   ` Peter Zijlstra
@ 2020-11-26  5:37   ` Balbir Singh
  2 siblings, 0 replies; 150+ messages in thread
From: Balbir Singh @ 2020-11-26  5:37 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li, Tim Chen,
	Paul E . McKenney

On Tue, Nov 17, 2020 at 06:19:48PM -0500, Joel Fernandes (Google) wrote:
> Core-scheduling prevents hyperthreads in usermode from attacking each
> other, but it does not do anything about one of the hyperthreads
> entering the kernel for any reason. This leaves the door open for MDS
> and L1TF attacks with concurrent execution sequences between
> hyperthreads.
> 
> This patch therefore adds support for protecting all syscall and IRQ
> kernel mode entries. Care is taken to track the outermost usermode exit
> and entry using per-cpu counters. In cases where one of the hyperthreads
> enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
> when not needed - example: idle and non-cookie HTs do not need to be
> forced into kernel mode.
> 
> More information about attacks:
> For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
> data to either host or guest attackers. For L1TF, it is possible to leak
> to guest attackers. There is no possible mitigation involving flushing
> of buffers to avoid this since the execution of attacker and victims
> happen concurrently on 2 or more HTs.
> 
> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Cc: Julien Desfossez <jdesfossez@digitalocean.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Aaron Lu <aaron.lwe@gmail.com>
> Cc: Aubrey Li <aubrey.li@linux.intel.com>
> Cc: Tim Chen <tim.c.chen@intel.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Co-developed-by: Vineeth Pillai <viremana@linux.microsoft.com>
> Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  .../admin-guide/kernel-parameters.txt         |  11 +
>  include/linux/entry-common.h                  |  12 +-
>  include/linux/sched.h                         |  12 +
>  kernel/entry/common.c                         |  28 +-
>  kernel/sched/core.c                           | 241 ++++++++++++++++++
>  kernel/sched/sched.h                          |   3 +
>  6 files changed, 304 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index bd1a5b87a5e2..b185c6ed4aba 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4678,6 +4678,17 @@
>  
>  	sbni=		[NET] Granch SBNI12 leased line adapter
>  
> +	sched_core_protect_kernel=
> +			[SCHED_CORE] Pause SMT siblings of a core running in
> +			user mode, if at least one of the siblings of the core
> +			is running in kernel mode. This is to guarantee that
> +			kernel data is not leaked to tasks which are not trusted
> +			by the kernel. A value of 0 disables protection, 1
> +			enables protection. The default is 1. Note that protection
> +			depends on the arch defining the _TIF_UNSAFE_RET flag.
> +			Further, for protecting VMEXIT, arch needs to call
> +			KVM entry/exit hooks.
> +
>  	sched_debug	[KNL] Enables verbose scheduler debug messages.
>  
>  	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index 1a128baf3628..022e1f114157 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -33,6 +33,10 @@
>  # define _TIF_PATCH_PENDING		(0)
>  #endif
>  
> +#ifndef _TIF_UNSAFE_RET
> +# define _TIF_UNSAFE_RET		(0)
> +#endif
> +
>  #ifndef _TIF_UPROBE
>  # define _TIF_UPROBE			(0)
>  #endif
> @@ -74,7 +78,7 @@
>  #define EXIT_TO_USER_MODE_WORK						\
>  	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
>  	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |	\
> -	 ARCH_EXIT_TO_USER_MODE_WORK)
> +	 _TIF_UNSAFE_RET | ARCH_EXIT_TO_USER_MODE_WORK)
>  
>  /**
>   * arch_check_user_regs - Architecture specific sanity check for user mode regs
> @@ -444,4 +448,10 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs);
>   */
>  void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state);
>  
> +/* entry_kernel_protected - Is kernel protection on entry/exit into kernel supported? */
> +static inline bool entry_kernel_protected(void)
> +{
> +	return IS_ENABLED(CONFIG_SCHED_CORE) && sched_core_kernel_protected()
> +		&& _TIF_UNSAFE_RET != 0;
> +}
>  #endif
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 7efce9c9d9cf..a60868165590 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2076,4 +2076,16 @@ int sched_trace_rq_nr_running(struct rq *rq);
>  
>  const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
>  
> +#ifdef CONFIG_SCHED_CORE
> +void sched_core_unsafe_enter(void);
> +void sched_core_unsafe_exit(void);
> +bool sched_core_wait_till_safe(unsigned long ti_check);
> +bool sched_core_kernel_protected(void);
> +#else
> +#define sched_core_unsafe_enter(ignore) do { } while (0)
> +#define sched_core_unsafe_exit(ignore) do { } while (0)
> +#define sched_core_wait_till_safe(ignore) do { } while (0)
> +#define sched_core_kernel_protected(ignore) do { } while (0)
> +#endif
> +
>  #endif
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index bc75c114c1b3..9d9d926f2a1c 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -28,6 +28,9 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs)
>  
>  	instrumentation_begin();
>  	trace_hardirqs_off_finish();
> +
> +	if (entry_kernel_protected())
> +		sched_core_unsafe_enter();
>  	instrumentation_end();
>  }
>  
> @@ -145,6 +148,26 @@ static void handle_signal_work(struct pt_regs *regs, unsigned long ti_work)
>  	arch_do_signal_or_restart(regs, ti_work & _TIF_SIGPENDING);
>  }
>  
> +static unsigned long exit_to_user_get_work(void)
> +{
> +	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> +
> +	if (!entry_kernel_protected())
> +		return ti_work;
> +
> +#ifdef CONFIG_SCHED_CORE

This #ifdef is not necessary, entry_kernel_protected() does this check
no? The code should also compile anyway from what I can see so far.

> +	ti_work &= EXIT_TO_USER_MODE_WORK;
> +	if ((ti_work & _TIF_UNSAFE_RET) == ti_work) {
> +		sched_core_unsafe_exit();
> +		if (sched_core_wait_till_safe(EXIT_TO_USER_MODE_WORK)) {
> +			sched_core_unsafe_enter(); /* not exiting to user yet. */
> +		}
> +	}
> +
> +	return READ_ONCE(current_thread_info()->flags);
> +#endif
> +}
> +
>  static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>  					    unsigned long ti_work)
>  {
> @@ -182,7 +205,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>  		 * enabled above.
>  		 */
>  		local_irq_disable_exit_to_user();
> -		ti_work = READ_ONCE(current_thread_info()->flags);
> +		ti_work = exit_to_user_get_work();
>  	}
>  
>  	/* Return the latest work state for arch_exit_to_user_mode() */
> @@ -191,9 +214,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>  
>  static void exit_to_user_mode_prepare(struct pt_regs *regs)
>  {
> -	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> +	unsigned long ti_work;
>  
>  	lockdep_assert_irqs_disabled();
> +	ti_work = exit_to_user_get_work();
>  
>  	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
>  		ti_work = exit_to_user_mode_loop(regs, ti_work);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 20125431af87..7f807a84cc30 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -76,6 +76,27 @@ __read_mostly int scheduler_running;
>  
>  #ifdef CONFIG_SCHED_CORE
>  
> +DEFINE_STATIC_KEY_TRUE(sched_core_protect_kernel);
> +static int __init set_sched_core_protect_kernel(char *str)
> +{
> +	unsigned long val = 0;
> +
> +	if (!str)
> +		return 0;
> +
> +	if (!kstrtoul(str, 0, &val) && !val)
> +		static_branch_disable(&sched_core_protect_kernel);
> +
> +	return 1;
> +}
> +__setup("sched_core_protect_kernel=", set_sched_core_protect_kernel);
> +
> +/* Is the kernel protected by core scheduling? */
> +bool sched_core_kernel_protected(void)
> +{
> +	return static_branch_likely(&sched_core_protect_kernel);
> +}
> +
>  DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
>  
>  /* kernel prio, less is more */
> @@ -5092,6 +5113,225 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
>  	return a->core_cookie == b->core_cookie;
>  }
>  
> +/*
> + * Handler to attempt to enter kernel. It does nothing because the exit to
> + * usermode or guest mode will do the actual work (of waiting if needed).
> + */
> +static void sched_core_irq_work(struct irq_work *work)
> +{
> +}
> +
> +static inline void init_sched_core_irq_work(struct rq *rq)
> +{
> +	init_irq_work(&rq->core_irq_work, sched_core_irq_work);
> +}
> +
> +/*
> + * sched_core_wait_till_safe - Pause the caller's hyperthread until the core
> + * exits the core-wide unsafe state. Obviously the CPU calling this function
> + * should not be responsible for the core being in the core-wide unsafe state
> + * otherwise it will deadlock.
> + *
> + * @ti_check: We spin here with IRQ enabled and preempt disabled. Break out of
> + *            the loop if TIF flags are set and notify caller about it.
> + *
> + * IRQs should be disabled.
> + */
> +bool sched_core_wait_till_safe(unsigned long ti_check)
> +{
> +	bool restart = false;
> +	struct rq *rq;
> +	int cpu;
> +
> +	/* We clear the thread flag only at the end, so no need to check for it. */
> +	ti_check &= ~_TIF_UNSAFE_RET;
> +
> +	cpu = smp_processor_id();
> +	rq = cpu_rq(cpu);
> +
> +	if (!sched_core_enabled(rq))
> +		goto ret;

Why do we need to deal with ti_check if sched_core_enabled() is false (two statements
above)

> +
> +	/* Down grade to allow interrupts to prevent stop_machine lockups.. */
> +	preempt_disable();
> +	local_irq_enable();
> +
> +	/*
> +	 * Wait till the core of this HT is not in an unsafe state.
> +	 *
> +	 * Pair with raw_spin_lock/unlock() in sched_core_unsafe_enter/exit().
> +	 */
> +	while (smp_load_acquire(&rq->core->core_unsafe_nest) > 0) {
> +		cpu_relax();
> +		if (READ_ONCE(current_thread_info()->flags) & ti_check) {
> +			restart = true;
> +			break;
> +		}
> +	}
> +
> +	/* Upgrade it back to the expectations of entry code. */
> +	local_irq_disable();
> +	preempt_enable();
> +
> +ret:
> +	if (!restart)
> +		clear_tsk_thread_flag(current, TIF_UNSAFE_RET);
> +
> +	return restart;
> +}
> +
> +/*
> + * Enter the core-wide IRQ state. Sibling will be paused if it is running
> + * 'untrusted' code, until sched_core_unsafe_exit() is called. Every attempt to
> + * avoid sending useless IPIs is made. Must be called only from hard IRQ
> + * context.
> + */
> +void sched_core_unsafe_enter(void)
> +{
> +	const struct cpumask *smt_mask;
> +	unsigned long flags;
> +	struct rq *rq;
> +	int i, cpu;
> +
> +	if (!static_branch_likely(&sched_core_protect_kernel))
> +		return;
> +
> +	local_irq_save(flags);
> +	cpu = smp_processor_id();
> +	rq = cpu_rq(cpu);
> +	if (!sched_core_enabled(rq))
> +		goto ret;

I am not sure about the rules of this check, do we have to do this with
irq's disabled? Given that sched_core_enabled() can only change under
stop_machine, can't we optimize this check?

> +
> +	/* Ensure that on return to user/guest, we check whether to wait. */
> +	if (current->core_cookie)
> +		set_tsk_thread_flag(current, TIF_UNSAFE_RET);
> +
> +	/* Count unsafe_enter() calls received without unsafe_exit() on this CPU. */
> +	rq->core_this_unsafe_nest++;
> +
> +	/*
> +	 * Should not nest: enter() should only pair with exit(). Both are done
> +	 * during the first entry into kernel and the last exit from kernel.
> +	 * Nested kernel entries (such as nested interrupts) will only trigger
> +	 * enter() and exit() on the outer most kernel entry and exit.
> +	 */
> +	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 1))
> +		goto ret;
> +
> +	raw_spin_lock(rq_lockp(rq));
> +	smt_mask = cpu_smt_mask(cpu);
> +
> +	/*
> +	 * Contribute this CPU's unsafe_enter() to the core-wide unsafe_enter()
> +	 * count.  The raw_spin_unlock() release semantics pairs with the nest
> +	 * counter's smp_load_acquire() in sched_core_wait_till_safe().
> +	 */
> +	WRITE_ONCE(rq->core->core_unsafe_nest, rq->core->core_unsafe_nest + 1);
> +
> +	if (WARN_ON_ONCE(rq->core->core_unsafe_nest == UINT_MAX))
> +		goto unlock;

I am sure this is quite unlikely unless your concerned about overflows,
will this all eventually move to under SCHED_DEBUG?

> +
> +	if (irq_work_is_busy(&rq->core_irq_work)) {
> +		/*
> +		 * Do nothing more since we are in an IPI sent from another
> +		 * sibling to enforce safety. That sibling would have sent IPIs
> +		 * to all of the HTs.
> +		 */
> +		goto unlock;
> +	}
> +
> +	/*
> +	 * If we are not the first ones on the core to enter core-wide unsafe
> +	 * state, do nothing.
> +	 */
> +	if (rq->core->core_unsafe_nest > 1)
> +		goto unlock;
> +
> +	/* Do nothing more if the core is not tagged. */
> +	if (!rq->core->core_cookie)
> +		goto unlock;
> +
> +	for_each_cpu(i, smt_mask) {
> +		struct rq *srq = cpu_rq(i);
> +
> +		if (i == cpu || cpu_is_offline(i))
> +			continue;
> +
> +		if (!srq->curr->mm || is_task_rq_idle(srq->curr))
> +			continue;
> +
> +		/* Skip if HT is not running a tagged task. */
> +		if (!srq->curr->core_cookie && !srq->core_pick)
> +			continue;
> +
> +		/*
> +		 * Force sibling into the kernel by IPI. If work was already
> +		 * pending, no new IPIs are sent. This is Ok since the receiver
> +		 * would already be in the kernel, or on its way to it.
> +		 */
> +		irq_work_queue_on(&srq->core_irq_work, i);
> +	}
> +unlock:
> +	raw_spin_unlock(rq_lockp(rq));
> +ret:
> +	local_irq_restore(flags);
> +}
> +
> +/*
> + * Process any work need for either exiting the core-wide unsafe state, or for
> + * waiting on this hyperthread if the core is still in this state.
> + *
> + * @idle: Are we called from the idle loop?
> + */
> +void sched_core_unsafe_exit(void)
> +{
> +	unsigned long flags;
> +	unsigned int nest;
> +	struct rq *rq;
> +	int cpu;
> +
> +	if (!static_branch_likely(&sched_core_protect_kernel))
> +		return;
> +
> +	local_irq_save(flags);
> +	cpu = smp_processor_id();
> +	rq = cpu_rq(cpu);
> +
> +	/* Do nothing if core-sched disabled. */
> +	if (!sched_core_enabled(rq))
> +		goto ret;
> +

Same as above

> +	/*
> +	 * Can happen when a process is forked and the first return to user
> +	 * mode is a syscall exit. Either way, there's nothing to do.
> +	 */
> +	if (rq->core_this_unsafe_nest == 0)
> +		goto ret;
> +
> +	rq->core_this_unsafe_nest--;
> +
> +	/* enter() should be paired with exit() only. */
> +	if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 0))
> +		goto ret;
> +
> +	raw_spin_lock(rq_lockp(rq));
> +	/*
> +	 * Core-wide nesting counter can never be 0 because we are
> +	 * still in it on this CPU.
> +	 */
> +	nest = rq->core->core_unsafe_nest;
> +	WARN_ON_ONCE(!nest);
> +
> +	WRITE_ONCE(rq->core->core_unsafe_nest, nest - 1);
> +	/*
> +	 * The raw_spin_unlock release semantics pairs with the nest counter's
> +	 * smp_load_acquire() in sched_core_wait_till_safe().
> +	 */
> +	raw_spin_unlock(rq_lockp(rq));
> +ret:
> +	local_irq_restore(flags);
> +}
> +
>  // XXX fairness/fwd progress conditions
>  /*
>   * Returns
> @@ -5497,6 +5737,7 @@ static inline void sched_core_cpu_starting(unsigned int cpu)
>  			rq = cpu_rq(i);
>  			if (rq->core && rq->core == rq)
>  				core_rq = rq;
> +			init_sched_core_irq_work(rq);
>  		}
>  
>  		if (!core_rq)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 615092cb693c..be6691337bbb 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1074,6 +1074,8 @@ struct rq {
>  	unsigned int		core_enabled;
>  	unsigned int		core_sched_seq;
>  	struct rb_root		core_tree;
> +	struct irq_work		core_irq_work; /* To force HT into kernel */
> +	unsigned int		core_this_unsafe_nest;
>  
>  	/* shared state */
>  	unsigned int		core_task_seq;
> @@ -1081,6 +1083,7 @@ struct rq {
>  	unsigned long		core_cookie;
>  	unsigned char		core_forceidle;
>  	unsigned int		core_forceidle_seq;
> +	unsigned int		core_unsafe_nest;
>  #endif
>  };
>

Balbir Singh.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 09/32] sched/fair: Snapshot the min_vruntime of CPUs on force idle
  2020-11-25 23:17           ` Balbir Singh
@ 2020-11-26  8:23             ` Peter Zijlstra
  0 siblings, 0 replies; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-26  8:23 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Vineeth Pillai, Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Aaron Lu,
	Aubrey Li, tglx, linux-kernel, mingo, torvalds, fweisbec,
	keescook, kerrnel, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, graf, konrad.wilk,
	dfaggioli, pjt, rostedt, derkling, benbjiang, Alexandre Chartre,
	James.Bottomley, OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes,
	chris.hyser, Ben Segall, Josh Don, Hao Luo, Tom Lendacky,
	Aubrey Li, Paul E. McKenney, Tim Chen

On Thu, Nov 26, 2020 at 10:17:15AM +1100, Balbir Singh wrote:
> On Tue, Nov 24, 2020 at 10:09:55AM +0100, Peter Zijlstra wrote:

> > The basic observation the current approach relies on is that al that
> > faffery basically boils down to the fact that vruntime only means
> > something when there is contention. And that only the progression is
> > important not the actual value. That is, this is all fundamentally a
> > differential equation and our integration constant is meaningless (also
> > embodied in (7)).
> >
> 
> I'll reread (6) and (7), I am trying to understand forced idle and
> contention together, from what I understand of the patches, there is 

When we force-idle there is contention by definition; there's a task
that wanted to run, but couldn't.

> 1. two tasks that are core scheduled, in that case vruntime works as
> expected on each CPU, but we need to compare their combined vrtuntime
> against other tasks on each CPU respectively for them to be
> selected/chosen?

We need to compare across CPUs when the cookies don't match. This is
required to avoid starving one or the other.

> 2. When one of the tasks selected is a part of the core scheduling group
> and the other CPU does not select a core scheduled tasks, we need to ask
> ourselves if that CPU should force idle and that's where this logic
> comes into play?

When one CPU selects a cookie task, and the other CPU cannot find a
matching task, it must go idle (as idle matches everyone). This is the
basic core-scheduling constraint.

So suppose you have two tasks, A and B, both with a cookie, but not
matching.

Normal scheduling would run A and B concurrent on the two siblings. Core
scheduling obviously cannot do this. When we pick A, the other CPU is
not allowed to run B and thus will have to be forced idle and
vice-versa.

The next problem is avoiding starvation. Assuming equal weight between
the tasks, we'd want to end up running A and B in alternating cycles.

This means having to compare runtimes between A and B, but when they're
on different runqueues the actual vruntime values can be wildly
divergent and cannot be reasily compared (the integration constant is
meaningless but really annoying ;-).

We also cannot use min_vruntime (which is the same as the task vruntime
when there is only a single task), because then you cannot observe
progress. The difference between min_vruntime and the task runtime is
always 0, so you can't tell who just ran and who got starved.

This is where our snapshots come in play, we snapshot vruntime after
task selection (before running), such that at the next pick we can tell
who made progress and who got starved.

By marking the vruntime of both runqueues at the same point in time we
basically normalize away that integration constant. You effectively
reset the vruntime to 0 (through (7), but without iterating all the
tasks and adjusting it).

Does that make sense?

Once you get this, read that second email linked.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 10/32] sched: Fix priority inversion of cookied task with sibling
  2020-11-25 23:05       ` Balbir Singh
@ 2020-11-26  8:29         ` Peter Zijlstra
  2020-11-26 22:27           ` Balbir Singh
  2020-12-01 17:49         ` Joel Fernandes
  1 sibling, 1 reply; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-26  8:29 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Joel Fernandes, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

On Thu, Nov 26, 2020 at 10:05:19AM +1100, Balbir Singh wrote:
> > @@ -5259,7 +5254,20 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >  			 * Optimize the 'normal' case where there aren't any
> >  			 * cookies and we don't need to sync up.
> >  			 */
> > -			if (i == cpu && !need_sync && !p->core_cookie) {
> > +			if (i == cpu && !need_sync) {
> > +				if (p->core_cookie) {
> > +					/*
> > +					 * This optimization is only valid as
> > +					 * long as there are no cookies
> 
> This is not entirely true, need_sync is a function of core cookies, so I
> think this needs more clarification, it sounds like we enter this when
> the core has no cookies, but the task has a core_cookie? The term cookie
> is quite overloaded when used in the context of core vs task.

Nah, its the same. So each task gets a cookie to identify the 'group' of
tasks (possibly just itself) it is allowed to share a core with.

When we to core task selection, the core gets assigned the cookie of the
group it will run, same thing.

> Effectively from what I understand this means that p wants to be
> coscheduled, but the core itself is not coscheduling anything at the
> moment, so we need to see if we should do a sync and that sync might
> cause p to get kicked out and a higher priority class to come in?

This whole patch is about eliding code-wide task selection when it is
not required. IOW an optimization.

When there wasn't a core cookie (IOW, the previous task selection wasn't
core wide and limited) and the task we just selected for our own CPU
also didn't have a cookie (IOW it doesn't have to be core-wide) we can
skip the core wide task selection and schedule just this CPU and call it
a day.

The logic was subtly wrong, this patch fixes it. A next patch completely
rewrites it again to make it far simpler again. Don't spend time trying
to understand this patch (unless you're _that_ kind of person ;-) but
instead apply the whole thing and look at the resulting pick_next_task()
function.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 14/32] sched: migration changes for core scheduling
  2020-11-26  3:20             ` Li, Aubrey
@ 2020-11-26  8:32               ` Balbir Singh
  2020-11-26  9:26                 ` Li, Aubrey
  0 siblings, 1 reply; 150+ messages in thread
From: Balbir Singh @ 2020-11-26  8:32 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Peter Zijlstra, Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Thu, Nov 26, 2020 at 11:20:41AM +0800, Li, Aubrey wrote:
> On 2020/11/26 6:57, Balbir Singh wrote:
> > On Wed, Nov 25, 2020 at 11:12:53AM +0800, Li, Aubrey wrote:
> >> On 2020/11/24 23:42, Peter Zijlstra wrote:
> >>> On Mon, Nov 23, 2020 at 12:36:10PM +0800, Li, Aubrey wrote:
> >>>>>> +#ifdef CONFIG_SCHED_CORE
> >>>>>> +		/*
> >>>>>> +		 * Skip this cpu if source task's cookie does not match
> >>>>>> +		 * with CPU's core cookie.
> >>>>>> +		 */
> >>>>>> +		if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
> >>>>>> +			continue;
> >>>>>> +#endif
> >>>>>> +
> >>>>>
> >>>>> Any reason this is under an #ifdef? In sched_core_cookie_match() won't
> >>>>> the check for sched_core_enabled() do the right thing even when
> >>>>> CONFIG_SCHED_CORE is not enabed?> 
> >>>> Yes, sched_core_enabled works properly when CONFIG_SCHED_CORE is not
> >>>> enabled. But when CONFIG_SCHED_CORE is not enabled, it does not make
> >>>> sense to leave a core scheduler specific function here even at compile
> >>>> time. Also, for the cases in hot path, this saves CPU cycles to avoid
> >>>> a judgment.
> >>>
> >>> No, that's nonsense. If it works, remove the #ifdef. Less (#ifdef) is
> >>> more.
> >>>
> >>
> >> Okay, I pasted the refined patch here.
> >> @Joel, please let me know if you want me to send it in a separated thread.
> >>
> > 
> > You still have a bunch of #ifdefs, can't we just do
> > 
> > #ifndef CONFIG_SCHED_CORE
> > static inline bool sched_core_enabled(struct rq *rq)
> > {
> >         return false;
> > }
> > #endif
> > 
> > and frankly I think even that is not needed because there is a jump
> > label __sched_core_enabled that tells us if sched_core is enabled or
> > not.
> 
> Hmm..., I need another wrapper for CONFIG_SCHED_CORE specific variables.
> How about this one?
>

Much better :)
 
> Thanks,
> -Aubrey
> 
> From 61dac9067e66b5b9ea26c684c8c8235714bab38a Mon Sep 17 00:00:00 2001
> From: Aubrey Li <aubrey.li@linux.intel.com>
> Date: Thu, 26 Nov 2020 03:08:04 +0000
> Subject: [PATCH] sched: migration changes for core scheduling
> 
>  - Don't migrate if there is a cookie mismatch
>      Load balance tries to move task from busiest CPU to the
>      destination CPU. When core scheduling is enabled, if the
>      task's cookie does not match with the destination CPU's
>      core cookie, this task will be skipped by this CPU. This
>      mitigates the forced idle time on the destination CPU.
> 
>  - Select cookie matched idle CPU
>      In the fast path of task wakeup, select the first cookie matched
>      idle CPU instead of the first idle CPU.
> 
>  - Find cookie matched idlest CPU
>      In the slow path of task wakeup, find the idlest CPU whose core
>      cookie matches with task's cookie
> 
>  - Don't migrate task if cookie not match
>      For the NUMA load balance, don't migrate task to the CPU whose
>      core cookie does not match with task's cookie
> 
> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/sched/fair.c  | 57 ++++++++++++++++++++++++++++++++++++++++----
>  kernel/sched/sched.h | 43 +++++++++++++++++++++++++++++++++
>  2 files changed, 95 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index de82f88ba98c..70dd013dff1d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1921,6 +1921,13 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>  		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>  			continue;
>  
> +		/*
> +		 * Skip this cpu if source task's cookie does not match
> +		 * with CPU's core cookie.
> +		 */
> +		if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
> +			continue;
> +
>  		env->dst_cpu = cpu;
>  		if (task_numa_compare(env, taskimp, groupimp, maymove))
>  			break;
> @@ -5867,11 +5874,15 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>  
>  	/* Traverse only the allowed CPUs */
>  	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
> +		struct rq *rq = cpu_rq(i);
> +
> +		if (!sched_core_cookie_match(rq, p))
> +			continue;
> +
>  		if (sched_idle_cpu(i))
>  			return i;
>  
>  		if (available_idle_cpu(i)) {
> -			struct rq *rq = cpu_rq(i);
>  			struct cpuidle_state *idle = idle_get_state(rq);
>  			if (idle && idle->exit_latency < min_exit_latency) {
>  				/*
> @@ -6129,8 +6140,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>  	for_each_cpu_wrap(cpu, cpus, target) {
>  		if (!--nr)
>  			return -1;
> -		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> -			break;
> +
> +		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
> +			/*
> +			 * If Core Scheduling is enabled, select this cpu
> +			 * only if the process cookie matches core cookie.
> +			 */
> +			if (sched_core_enabled(cpu_rq(cpu))) {
> +				if (__cookie_match(cpu_rq(cpu), p))
> +					break;
> +			} else {
> +				break;
> +			}
> +		}

Isn't this better and equivalent?

	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
		sched_core_cookie_match(cpu_rq(cpu), p))
		break;

>  	}
>  
>  	time = cpu_clock(this) - time;
> @@ -7530,8 +7552,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>  	 * We do not migrate tasks that are:
>  	 * 1) throttled_lb_pair, or
>  	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
> -	 * 3) running (obviously), or
> -	 * 4) are cache-hot on their current CPU.
> +	 * 3) task's cookie does not match with this CPU's core cookie
> +	 * 4) running (obviously), or
> +	 * 5) are cache-hot on their current CPU.
>  	 */
>  	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>  		return 0;
> @@ -7566,6 +7589,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>  		return 0;
>  	}
>  
> +	/*
> +	 * Don't migrate task if the task's cookie does not match
> +	 * with the destination CPU's core cookie.
> +	 */
> +	if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
> +		return 0;
> +
>  	/* Record that we found atleast one task that could run on dst_cpu */
>  	env->flags &= ~LBF_ALL_PINNED;
>  
> @@ -8792,6 +8822,23 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  					p->cpus_ptr))
>  			continue;
>  
> +		if (sched_core_enabled(cpu_rq(this_cpu))) {
> +			int i = 0;
> +			bool cookie_match = false;
> +
> +			for_each_cpu(i, sched_group_span(group)) {
> +				struct rq *rq = cpu_rq(i);
> +
> +				if (sched_core_cookie_match(rq, p)) {
> +					cookie_match = true;
> +					break;
> +				}
> +			}
> +			/* Skip over this group if no cookie matched */
> +			if (!cookie_match)
> +				continue;
> +		}
> +

Again, I think this can be refactored because sched_core_cookie_match checks
for sched_core_enabled()

	int i = 0;
	bool cookie_match = false;
	for_each_cpu(i, sched_group_span(group)) {
		if (sched_core_cookie_match(cpu_rq(i), p))
			break;
	}
	if (i >= nr_cpu_ids)
		continue;

> +			}
> +			/* Skip over this group if no cookie matched */
> +			if (!cookie_match)
> +				continue;

>  		local_group = cpumask_test_cpu(this_cpu,
>  					       sched_group_span(group));
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e72942a9ee11..8bb3b72d593c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1135,6 +1135,40 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
>  
>  bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
>  
> +/*
> + * Helper to check if the CPU's core cookie matches with the task's cookie
> + * when core scheduling is enabled.
> + * A special case is that the task's cookie always matches with CPU's core
> + * cookie if the CPU is in an idle core.
> + */
> +static inline bool __cookie_match(struct rq *rq, struct task_struct *p)
> +{
> +	return rq->core->core_cookie == p->core_cookie;
> +}
> +
> +static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
> +{
> +	bool idle_core = true;
> +	int cpu;
> +
> +	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
> +	if (!sched_core_enabled(rq))
> +		return true;
> +
> +	for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
> +		if (!available_idle_cpu(cpu)) {
> +			idle_core = false;
> +			break;
> +		}
> +	}
> +
> +	/*
> +	 * A CPU in an idle core is always the best choice for tasks with
> +	 * cookies.
> +	 */
> +	return idle_core || __cookie_match(rq, p);
> +}
> +
>  extern void queue_core_balance(struct rq *rq);
>  
>  #else /* !CONFIG_SCHED_CORE */
> @@ -1153,6 +1187,15 @@ static inline void queue_core_balance(struct rq *rq)
>  {
>  }
>  
> +static inline bool __cookie_match(struct rq *rq, struct task_struct *p)
> +{
> +	return true;
> +}
> +
> +static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
> +{
> +	return true;
> +}
>  #endif /* CONFIG_SCHED_CORE */
>  
>  #ifdef CONFIG_SCHED_SMT
> -- 
> 2.17.1
>

Balbir Singh. 

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 02/32] sched: Introduce sched_class::pick_task()
  2020-11-25 16:28   ` Vincent Guittot
@ 2020-11-26  9:07     ` Peter Zijlstra
  2020-11-26 10:17       ` Vincent Guittot
  0 siblings, 1 reply; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-26  9:07 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Gleixner, linux-kernel, Ingo Molnar,
	Linus Torvalds, Frederic Weisbecker, Kees Cook, Greg Kerr,
	Phil Auld, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi, Jiang Biao,
	Alexandre Chartre, James Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, Hyser,Chris, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Wed, Nov 25, 2020 at 05:28:36PM +0100, Vincent Guittot wrote:
> On Wed, 18 Nov 2020 at 00:20, Joel Fernandes (Google)

> > +#ifdef CONFIG_SMP
> > +static struct task_struct *pick_task_fair(struct rq *rq)
> > +{
> > +       struct cfs_rq *cfs_rq = &rq->cfs;
> > +       struct sched_entity *se;
> > +
> > +       if (!cfs_rq->nr_running)
> > +               return NULL;
> > +
> > +       do {
> > +               struct sched_entity *curr = cfs_rq->curr;
> > +
> > +               se = pick_next_entity(cfs_rq, NULL);
> 
> Calling pick_next_entity clears buddies. This is fine without
> coresched because the se will be the next one. But calling
> pick_task_fair doesn't mean that the se will be used

Urgh, nice one :/

> > +
> > +               if (curr) {
> > +                       if (se && curr->on_rq)
> > +                               update_curr(cfs_rq);
> > +
> 
> Shouldn't you check if cfs_rq is throttled ?

Hmm,... I suppose we do.

> > +                       if (!se || entity_before(curr, se))
> > +                               se = curr;
> > +               }
> > +
> > +               cfs_rq = group_cfs_rq(se);
> > +       } while (cfs_rq);
> > +
> > +       return task_of(se);
> > +}
> > +#endif

Something like so then?

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4354,6 +4354,8 @@ check_preempt_tick(struct cfs_rq *cfs_rq
 static void
 set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	clear_buddies(cfs_rq, se);
+
 	/* 'current' is not kept within the tree. */
 	if (se->on_rq) {
 		/*
@@ -4440,8 +4442,6 @@ pick_next_entity(struct cfs_rq *cfs_rq,
 		se = cfs_rq->last;
 	}
 
-	clear_buddies(cfs_rq, se);
-
 	return se;
 }
 
@@ -6982,20 +6982,29 @@ static void check_preempt_wakeup(struct
 #ifdef CONFIG_SMP
 static struct task_struct *pick_task_fair(struct rq *rq)
 {
-	struct cfs_rq *cfs_rq = &rq->cfs;
 	struct sched_entity *se;
-
+	struct cfs_rq *cfs_rq;
+       
+again:
+	cfs_rq = &rq->cfs;
 	if (!cfs_rq->nr_running)
 		return NULL;
 
 	do {
 		struct sched_entity *curr = cfs_rq->curr;
 
-		if (curr && curr->on_rq)
-			update_curr(cfs_rq);
+		/* When we pick for a remote RQ, we'll not have done put_prev_entity() */
+		if (curr) {
+			if (curr->on_rq)
+				update_curr(cfs_rq);
+			else
+				curr = NULL;
 
-		se = pick_next_entity(cfs_rq, curr);
+			if (unlikely(check_cfs_rq_runtime(cfs_rq)))
+				goto again;
+		}
 
+		se = pick_next_entity(cfs_rq, curr);
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
 

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 14/32] sched: migration changes for core scheduling
  2020-11-26  8:32               ` Balbir Singh
@ 2020-11-26  9:26                 ` Li, Aubrey
  2020-11-30  9:33                   ` Balbir Singh
  0 siblings, 1 reply; 150+ messages in thread
From: Li, Aubrey @ 2020-11-26  9:26 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Peter Zijlstra, Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On 2020/11/26 16:32, Balbir Singh wrote:
> On Thu, Nov 26, 2020 at 11:20:41AM +0800, Li, Aubrey wrote:
>> On 2020/11/26 6:57, Balbir Singh wrote:
>>> On Wed, Nov 25, 2020 at 11:12:53AM +0800, Li, Aubrey wrote:
>>>> On 2020/11/24 23:42, Peter Zijlstra wrote:
>>>>> On Mon, Nov 23, 2020 at 12:36:10PM +0800, Li, Aubrey wrote:
>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>> +		/*
>>>>>>>> +		 * Skip this cpu if source task's cookie does not match
>>>>>>>> +		 * with CPU's core cookie.
>>>>>>>> +		 */
>>>>>>>> +		if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>>>>>>>> +			continue;
>>>>>>>> +#endif
>>>>>>>> +
>>>>>>>
>>>>>>> Any reason this is under an #ifdef? In sched_core_cookie_match() won't
>>>>>>> the check for sched_core_enabled() do the right thing even when
>>>>>>> CONFIG_SCHED_CORE is not enabed?> 
>>>>>> Yes, sched_core_enabled works properly when CONFIG_SCHED_CORE is not
>>>>>> enabled. But when CONFIG_SCHED_CORE is not enabled, it does not make
>>>>>> sense to leave a core scheduler specific function here even at compile
>>>>>> time. Also, for the cases in hot path, this saves CPU cycles to avoid
>>>>>> a judgment.
>>>>>
>>>>> No, that's nonsense. If it works, remove the #ifdef. Less (#ifdef) is
>>>>> more.
>>>>>
>>>>
>>>> Okay, I pasted the refined patch here.
>>>> @Joel, please let me know if you want me to send it in a separated thread.
>>>>
>>>
>>> You still have a bunch of #ifdefs, can't we just do
>>>
>>> #ifndef CONFIG_SCHED_CORE
>>> static inline bool sched_core_enabled(struct rq *rq)
>>> {
>>>         return false;
>>> }
>>> #endif
>>>
>>> and frankly I think even that is not needed because there is a jump
>>> label __sched_core_enabled that tells us if sched_core is enabled or
>>> not.
>>
>> Hmm..., I need another wrapper for CONFIG_SCHED_CORE specific variables.
>> How about this one?
>>
> 
> Much better :)
>  
>> Thanks,
>> -Aubrey
>>
>> From 61dac9067e66b5b9ea26c684c8c8235714bab38a Mon Sep 17 00:00:00 2001
>> From: Aubrey Li <aubrey.li@linux.intel.com>
>> Date: Thu, 26 Nov 2020 03:08:04 +0000
>> Subject: [PATCH] sched: migration changes for core scheduling
>>
>>  - Don't migrate if there is a cookie mismatch
>>      Load balance tries to move task from busiest CPU to the
>>      destination CPU. When core scheduling is enabled, if the
>>      task's cookie does not match with the destination CPU's
>>      core cookie, this task will be skipped by this CPU. This
>>      mitigates the forced idle time on the destination CPU.
>>
>>  - Select cookie matched idle CPU
>>      In the fast path of task wakeup, select the first cookie matched
>>      idle CPU instead of the first idle CPU.
>>
>>  - Find cookie matched idlest CPU
>>      In the slow path of task wakeup, find the idlest CPU whose core
>>      cookie matches with task's cookie
>>
>>  - Don't migrate task if cookie not match
>>      For the NUMA load balance, don't migrate task to the CPU whose
>>      core cookie does not match with task's cookie
>>
>> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>> ---
>>  kernel/sched/fair.c  | 57 ++++++++++++++++++++++++++++++++++++++++----
>>  kernel/sched/sched.h | 43 +++++++++++++++++++++++++++++++++
>>  2 files changed, 95 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index de82f88ba98c..70dd013dff1d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1921,6 +1921,13 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>>  		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>>  			continue;
>>  
>> +		/*
>> +		 * Skip this cpu if source task's cookie does not match
>> +		 * with CPU's core cookie.
>> +		 */
>> +		if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>> +			continue;
>> +
>>  		env->dst_cpu = cpu;
>>  		if (task_numa_compare(env, taskimp, groupimp, maymove))
>>  			break;
>> @@ -5867,11 +5874,15 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>  
>>  	/* Traverse only the allowed CPUs */
>>  	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
>> +		struct rq *rq = cpu_rq(i);
>> +
>> +		if (!sched_core_cookie_match(rq, p))
>> +			continue;
>> +
>>  		if (sched_idle_cpu(i))
>>  			return i;
>>  
>>  		if (available_idle_cpu(i)) {
>> -			struct rq *rq = cpu_rq(i);
>>  			struct cpuidle_state *idle = idle_get_state(rq);
>>  			if (idle && idle->exit_latency < min_exit_latency) {
>>  				/*
>> @@ -6129,8 +6140,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>  	for_each_cpu_wrap(cpu, cpus, target) {
>>  		if (!--nr)
>>  			return -1;
>> -		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>> -			break;
>> +
>> +		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
>> +			/*
>> +			 * If Core Scheduling is enabled, select this cpu
>> +			 * only if the process cookie matches core cookie.
>> +			 */
>> +			if (sched_core_enabled(cpu_rq(cpu))) {
>> +				if (__cookie_match(cpu_rq(cpu), p))
>> +					break;
>> +			} else {
>> +				break;
>> +			}
>> +		}
> 
> Isn't this better and equivalent?
> 
> 	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> 		sched_core_cookie_match(cpu_rq(cpu), p))
> 		break;
>

 
That's my previous implementation in the earlier version.
But since here is the hot code path, we want to remove the idle
core check in sched_core_cookie_match.

>>  	}
>>  
>>  	time = cpu_clock(this) - time;
>> @@ -7530,8 +7552,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>  	 * We do not migrate tasks that are:
>>  	 * 1) throttled_lb_pair, or
>>  	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
>> -	 * 3) running (obviously), or
>> -	 * 4) are cache-hot on their current CPU.
>> +	 * 3) task's cookie does not match with this CPU's core cookie
>> +	 * 4) running (obviously), or
>> +	 * 5) are cache-hot on their current CPU.
>>  	 */
>>  	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>>  		return 0;
>> @@ -7566,6 +7589,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>  		return 0;
>>  	}
>>  
>> +	/*
>> +	 * Don't migrate task if the task's cookie does not match
>> +	 * with the destination CPU's core cookie.
>> +	 */
>> +	if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
>> +		return 0;
>> +
>>  	/* Record that we found atleast one task that could run on dst_cpu */
>>  	env->flags &= ~LBF_ALL_PINNED;
>>  
>> @@ -8792,6 +8822,23 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>  					p->cpus_ptr))
>>  			continue;
>>  
>> +		if (sched_core_enabled(cpu_rq(this_cpu))) {
>> +			int i = 0;
>> +			bool cookie_match = false;
>> +
>> +			for_each_cpu(i, sched_group_span(group)) {
>> +				struct rq *rq = cpu_rq(i);
>> +
>> +				if (sched_core_cookie_match(rq, p)) {
>> +					cookie_match = true;
>> +					break;
>> +				}
>> +			}
>> +			/* Skip over this group if no cookie matched */
>> +			if (!cookie_match)
>> +				continue;
>> +		}
>> +
> 
> Again, I think this can be refactored because sched_core_cookie_match checks
> for sched_core_enabled()
> 
> 	int i = 0;
> 	bool cookie_match = false;
> 	for_each_cpu(i, sched_group_span(group)) {
> 		if (sched_core_cookie_match(cpu_rq(i), p))
> 			break;
> 	}
> 	if (i >= nr_cpu_ids)
> 		continue;

There is a loop here when CONFIG_SCHED_CORE=n, which is unwanted I guess.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 02/32] sched: Introduce sched_class::pick_task()
  2020-11-26  9:07     ` Peter Zijlstra
@ 2020-11-26 10:17       ` Vincent Guittot
  2020-11-26 12:40         ` Peter Zijlstra
  0 siblings, 1 reply; 150+ messages in thread
From: Vincent Guittot @ 2020-11-26 10:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Gleixner, linux-kernel, Ingo Molnar,
	Linus Torvalds, Frederic Weisbecker, Kees Cook, Greg Kerr,
	Phil Auld, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi, Jiang Biao,
	Alexandre Chartre, James Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, Hyser,Chris, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Thu, 26 Nov 2020 at 10:07, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Nov 25, 2020 at 05:28:36PM +0100, Vincent Guittot wrote:
> > On Wed, 18 Nov 2020 at 00:20, Joel Fernandes (Google)
>
> > > +#ifdef CONFIG_SMP
> > > +static struct task_struct *pick_task_fair(struct rq *rq)
> > > +{
> > > +       struct cfs_rq *cfs_rq = &rq->cfs;
> > > +       struct sched_entity *se;
> > > +
> > > +       if (!cfs_rq->nr_running)
> > > +               return NULL;
> > > +
> > > +       do {
> > > +               struct sched_entity *curr = cfs_rq->curr;
> > > +
> > > +               se = pick_next_entity(cfs_rq, NULL);
> >
> > Calling pick_next_entity clears buddies. This is fine without
> > coresched because the se will be the next one. But calling
> > pick_task_fair doesn't mean that the se will be used
>
> Urgh, nice one :/
>
> > > +
> > > +               if (curr) {
> > > +                       if (se && curr->on_rq)
> > > +                               update_curr(cfs_rq);
> > > +
> >
> > Shouldn't you check if cfs_rq is throttled ?
>
> Hmm,... I suppose we do.
>
> > > +                       if (!se || entity_before(curr, se))
> > > +                               se = curr;
> > > +               }
> > > +
> > > +               cfs_rq = group_cfs_rq(se);
> > > +       } while (cfs_rq);
> > > +
> > > +       return task_of(se);
> > > +}
> > > +#endif
>
> Something like so then?

yes. it seems ok

>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4354,6 +4354,8 @@ check_preempt_tick(struct cfs_rq *cfs_rq
>  static void
>  set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> +       clear_buddies(cfs_rq, se);
> +
>         /* 'current' is not kept within the tree. */
>         if (se->on_rq) {
>                 /*
> @@ -4440,8 +4442,6 @@ pick_next_entity(struct cfs_rq *cfs_rq,
>                 se = cfs_rq->last;
>         }
>
> -       clear_buddies(cfs_rq, se);
> -
>         return se;
>  }
>
> @@ -6982,20 +6982,29 @@ static void check_preempt_wakeup(struct
>  #ifdef CONFIG_SMP
>  static struct task_struct *pick_task_fair(struct rq *rq)
>  {
> -       struct cfs_rq *cfs_rq = &rq->cfs;
>         struct sched_entity *se;
> -
> +       struct cfs_rq *cfs_rq;
> +
> +again:
> +       cfs_rq = &rq->cfs;
>         if (!cfs_rq->nr_running)
>                 return NULL;
>
>         do {
>                 struct sched_entity *curr = cfs_rq->curr;
>
> -               if (curr && curr->on_rq)
> -                       update_curr(cfs_rq);
> +               /* When we pick for a remote RQ, we'll not have done put_prev_entity() */
> +               if (curr) {
> +                       if (curr->on_rq)
> +                               update_curr(cfs_rq);
> +                       else
> +                               curr = NULL;
>
> -               se = pick_next_entity(cfs_rq, curr);
> +                       if (unlikely(check_cfs_rq_runtime(cfs_rq)))
> +                               goto again;
> +               }
>
> +               se = pick_next_entity(cfs_rq, curr);
>                 cfs_rq = group_cfs_rq(se);
>         } while (cfs_rq);
>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 02/32] sched: Introduce sched_class::pick_task()
  2020-11-26 10:17       ` Vincent Guittot
@ 2020-11-26 12:40         ` Peter Zijlstra
  0 siblings, 0 replies; 150+ messages in thread
From: Peter Zijlstra @ 2020-11-26 12:40 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Gleixner, linux-kernel, Ingo Molnar,
	Linus Torvalds, Frederic Weisbecker, Kees Cook, Greg Kerr,
	Phil Auld, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi, Jiang Biao,
	Alexandre Chartre, James Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, Hyser,Chris, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Thu, Nov 26, 2020 at 11:17:48AM +0100, Vincent Guittot wrote:

> > Something like so then?
> 
> yes. it seems ok
> 
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c

> > @@ -6982,20 +6982,29 @@ static void check_preempt_wakeup(struct
> >  #ifdef CONFIG_SMP
> >  static struct task_struct *pick_task_fair(struct rq *rq)
> >  {
> >         struct sched_entity *se;
> > +       struct cfs_rq *cfs_rq;
> > +
> > +again:
> > +       cfs_rq = &rq->cfs;
> >         if (!cfs_rq->nr_running)
> >                 return NULL;
> >
> >         do {
> >                 struct sched_entity *curr = cfs_rq->curr;
> >
> > +               /* When we pick for a remote RQ, we'll not have done put_prev_entity() */
> > +               if (curr) {
> > +                       if (curr->on_rq)
> > +                               update_curr(cfs_rq);
> > +                       else
> > +                               curr = NULL;
> >
> > +                       if (unlikely(check_cfs_rq_runtime(cfs_rq)))
> > +                               goto again;

Head-ache though; pick_task() was supposed to be stateless, but now
we're modifying a remote runqueue... I suppose it still works, because
irrespective of which task we end up picking (even idle), we'll schedule
the remote CPU, which would've resulted in the same (and possibly
triggered a reschedule if we'd not done it here).

There's a wrinkle through, other than in schedule(), where we dequeue()
and keep running with the current task while we release rq->lock, this
has preemption enabled as well.

This means that if we do this, the remote CPU could preempt, but the
task is then no longer on the runqueue.

I _think_ it all still works, but yuck!

> > +               }
> >
> > +               se = pick_next_entity(cfs_rq, curr);
> >                 cfs_rq = group_cfs_rq(se);
> >         } while (cfs_rq);
> >

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 10/32] sched: Fix priority inversion of cookied task with sibling
  2020-11-26  8:29         ` Peter Zijlstra
@ 2020-11-26 22:27           ` Balbir Singh
  0 siblings, 0 replies; 150+ messages in thread
From: Balbir Singh @ 2020-11-26 22:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

On Thu, Nov 26, 2020 at 09:29:14AM +0100, Peter Zijlstra wrote:
> On Thu, Nov 26, 2020 at 10:05:19AM +1100, Balbir Singh wrote:
> > > @@ -5259,7 +5254,20 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > >  			 * Optimize the 'normal' case where there aren't any
> > >  			 * cookies and we don't need to sync up.
> > >  			 */
> > > -			if (i == cpu && !need_sync && !p->core_cookie) {
> > > +			if (i == cpu && !need_sync) {
> > > +				if (p->core_cookie) {
> > > +					/*
> > > +					 * This optimization is only valid as
> > > +					 * long as there are no cookies
> > 
> > This is not entirely true, need_sync is a function of core cookies, so I
> > think this needs more clarification, it sounds like we enter this when
> > the core has no cookies, but the task has a core_cookie? The term cookie
> > is quite overloaded when used in the context of core vs task.
> 
> Nah, its the same. So each task gets a cookie to identify the 'group' of
> tasks (possibly just itself) it is allowed to share a core with.
> 
> When we to core task selection, the core gets assigned the cookie of the
> group it will run, same thing.
> 
> > Effectively from what I understand this means that p wants to be
> > coscheduled, but the core itself is not coscheduling anything at the
> > moment, so we need to see if we should do a sync and that sync might
> > cause p to get kicked out and a higher priority class to come in?
> 
> This whole patch is about eliding code-wide task selection when it is
> not required. IOW an optimization.
> 
> When there wasn't a core cookie (IOW, the previous task selection wasn't
> core wide and limited) and the task we just selected for our own CPU
> also didn't have a cookie (IOW it doesn't have to be core-wide) we can
> skip the core wide task selection and schedule just this CPU and call it
> a day.
> 
> The logic was subtly wrong, this patch fixes it. A next patch completely
> rewrites it again to make it far simpler again. Don't spend time trying
> to understand this patch (unless you're _that_ kind of person ;-) but
> instead apply the whole thing and look at the resulting pick_next_task()
> function.

Thanks, I'll look at the git tree and see what the final outcome looks like.

Balbir Singh.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 14/32] sched: migration changes for core scheduling
  2020-11-26  9:26                 ` Li, Aubrey
@ 2020-11-30  9:33                   ` Balbir Singh
  2020-11-30 12:29                     ` Li, Aubrey
  0 siblings, 1 reply; 150+ messages in thread
From: Balbir Singh @ 2020-11-30  9:33 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Peter Zijlstra, Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Thu, Nov 26, 2020 at 05:26:31PM +0800, Li, Aubrey wrote:
> On 2020/11/26 16:32, Balbir Singh wrote:
> > On Thu, Nov 26, 2020 at 11:20:41AM +0800, Li, Aubrey wrote:
> >> On 2020/11/26 6:57, Balbir Singh wrote:
> >>> On Wed, Nov 25, 2020 at 11:12:53AM +0800, Li, Aubrey wrote:
> >>>> On 2020/11/24 23:42, Peter Zijlstra wrote:
> >>>>> On Mon, Nov 23, 2020 at 12:36:10PM +0800, Li, Aubrey wrote:
> >>>>>>>> +#ifdef CONFIG_SCHED_CORE
> >>>>>>>> +		/*
> >>>>>>>> +		 * Skip this cpu if source task's cookie does not match
> >>>>>>>> +		 * with CPU's core cookie.
> >>>>>>>> +		 */
> >>>>>>>> +		if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
> >>>>>>>> +			continue;
> >>>>>>>> +#endif
> >>>>>>>> +
> >>>>>>>
> >>>>>>> Any reason this is under an #ifdef? In sched_core_cookie_match() won't
> >>>>>>> the check for sched_core_enabled() do the right thing even when
> >>>>>>> CONFIG_SCHED_CORE is not enabed?> 
> >>>>>> Yes, sched_core_enabled works properly when CONFIG_SCHED_CORE is not
> >>>>>> enabled. But when CONFIG_SCHED_CORE is not enabled, it does not make
> >>>>>> sense to leave a core scheduler specific function here even at compile
> >>>>>> time. Also, for the cases in hot path, this saves CPU cycles to avoid
> >>>>>> a judgment.
> >>>>>
> >>>>> No, that's nonsense. If it works, remove the #ifdef. Less (#ifdef) is
> >>>>> more.
> >>>>>
> >>>>
> >>>> Okay, I pasted the refined patch here.
> >>>> @Joel, please let me know if you want me to send it in a separated thread.
> >>>>
> >>>
> >>> You still have a bunch of #ifdefs, can't we just do
> >>>
> >>> #ifndef CONFIG_SCHED_CORE
> >>> static inline bool sched_core_enabled(struct rq *rq)
> >>> {
> >>>         return false;
> >>> }
> >>> #endif
> >>>
> >>> and frankly I think even that is not needed because there is a jump
> >>> label __sched_core_enabled that tells us if sched_core is enabled or
> >>> not.
> >>
> >> Hmm..., I need another wrapper for CONFIG_SCHED_CORE specific variables.
> >> How about this one?
> >>
> > 
> > Much better :)
> >  
> >> Thanks,
> >> -Aubrey
> >>
> >> From 61dac9067e66b5b9ea26c684c8c8235714bab38a Mon Sep 17 00:00:00 2001
> >> From: Aubrey Li <aubrey.li@linux.intel.com>
> >> Date: Thu, 26 Nov 2020 03:08:04 +0000
> >> Subject: [PATCH] sched: migration changes for core scheduling
> >>
> >>  - Don't migrate if there is a cookie mismatch
> >>      Load balance tries to move task from busiest CPU to the
> >>      destination CPU. When core scheduling is enabled, if the
> >>      task's cookie does not match with the destination CPU's
> >>      core cookie, this task will be skipped by this CPU. This
> >>      mitigates the forced idle time on the destination CPU.
> >>
> >>  - Select cookie matched idle CPU
> >>      In the fast path of task wakeup, select the first cookie matched
> >>      idle CPU instead of the first idle CPU.
> >>
> >>  - Find cookie matched idlest CPU
> >>      In the slow path of task wakeup, find the idlest CPU whose core
> >>      cookie matches with task's cookie
> >>
> >>  - Don't migrate task if cookie not match
> >>      For the NUMA load balance, don't migrate task to the CPU whose
> >>      core cookie does not match with task's cookie
> >>
> >> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> >> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> >> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> >> Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
> >> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >> ---
> >>  kernel/sched/fair.c  | 57 ++++++++++++++++++++++++++++++++++++++++----
> >>  kernel/sched/sched.h | 43 +++++++++++++++++++++++++++++++++
> >>  2 files changed, 95 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index de82f88ba98c..70dd013dff1d 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -1921,6 +1921,13 @@ static void task_numa_find_cpu(struct task_numa_env *env,
> >>  		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
> >>  			continue;
> >>  
> >> +		/*
> >> +		 * Skip this cpu if source task's cookie does not match
> >> +		 * with CPU's core cookie.
> >> +		 */
> >> +		if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
> >> +			continue;
> >> +
> >>  		env->dst_cpu = cpu;
> >>  		if (task_numa_compare(env, taskimp, groupimp, maymove))
> >>  			break;
> >> @@ -5867,11 +5874,15 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
> >>  
> >>  	/* Traverse only the allowed CPUs */
> >>  	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
> >> +		struct rq *rq = cpu_rq(i);
> >> +
> >> +		if (!sched_core_cookie_match(rq, p))
> >> +			continue;
> >> +
> >>  		if (sched_idle_cpu(i))
> >>  			return i;
> >>  
> >>  		if (available_idle_cpu(i)) {
> >> -			struct rq *rq = cpu_rq(i);
> >>  			struct cpuidle_state *idle = idle_get_state(rq);
> >>  			if (idle && idle->exit_latency < min_exit_latency) {
> >>  				/*
> >> @@ -6129,8 +6140,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> >>  	for_each_cpu_wrap(cpu, cpus, target) {
> >>  		if (!--nr)
> >>  			return -1;
> >> -		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> >> -			break;
> >> +
> >> +		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
> >> +			/*
> >> +			 * If Core Scheduling is enabled, select this cpu
> >> +			 * only if the process cookie matches core cookie.
> >> +			 */
> >> +			if (sched_core_enabled(cpu_rq(cpu))) {
> >> +				if (__cookie_match(cpu_rq(cpu), p))
> >> +					break;
> >> +			} else {
> >> +				break;
> >> +			}
> >> +		}
> > 
> > Isn't this better and equivalent?
> > 
> > 	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> > 		sched_core_cookie_match(cpu_rq(cpu), p))
> > 		break;
> >
> 
>  
> That's my previous implementation in the earlier version.
> But since here is the hot code path, we want to remove the idle
> core check in sched_core_cookie_match.

I see, so we basically need a jump label, if sched_core_cookie_match
can be inlined with a check for sched_core_enabled() upfront, it might
solve a lot of the concern, readability of this section of code is not
the best.

> 
> >>  	}
> >>  
> >>  	time = cpu_clock(this) - time;
> >> @@ -7530,8 +7552,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> >>  	 * We do not migrate tasks that are:
> >>  	 * 1) throttled_lb_pair, or
> >>  	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
> >> -	 * 3) running (obviously), or
> >> -	 * 4) are cache-hot on their current CPU.
> >> +	 * 3) task's cookie does not match with this CPU's core cookie
> >> +	 * 4) running (obviously), or
> >> +	 * 5) are cache-hot on their current CPU.
> >>  	 */
> >>  	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
> >>  		return 0;
> >> @@ -7566,6 +7589,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> >>  		return 0;
> >>  	}
> >>  
> >> +	/*
> >> +	 * Don't migrate task if the task's cookie does not match
> >> +	 * with the destination CPU's core cookie.
> >> +	 */
> >> +	if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
> >> +		return 0;
> >> +
> >>  	/* Record that we found atleast one task that could run on dst_cpu */
> >>  	env->flags &= ~LBF_ALL_PINNED;
> >>  
> >> @@ -8792,6 +8822,23 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >>  					p->cpus_ptr))
> >>  			continue;
> >>  
> >> +		if (sched_core_enabled(cpu_rq(this_cpu))) {
> >> +			int i = 0;
> >> +			bool cookie_match = false;
> >> +
> >> +			for_each_cpu(i, sched_group_span(group)) {
> >> +				struct rq *rq = cpu_rq(i);
> >> +
> >> +				if (sched_core_cookie_match(rq, p)) {
> >> +					cookie_match = true;
> >> +					break;
> >> +				}
> >> +			}
> >> +			/* Skip over this group if no cookie matched */
> >> +			if (!cookie_match)
> >> +				continue;
> >> +		}
> >> +
> > 
> > Again, I think this can be refactored because sched_core_cookie_match checks
> > for sched_core_enabled()
> > 
> > 	int i = 0;
> > 	bool cookie_match = false;
> > 	for_each_cpu(i, sched_group_span(group)) {
> > 		if (sched_core_cookie_match(cpu_rq(i), p))
> > 			break;
> > 	}
> > 	if (i >= nr_cpu_ids)
> > 		continue;
> 
> There is a loop here when CONFIG_SCHED_CORE=n, which is unwanted I guess.
> 

Yes, potentially, may be abstract the for_each_cpu into a function and then
optimize out the case for SCHED_CORE=n, I feel all the extra checks in the
various places make the code hard to read.

Balbir

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 14/32] sched: migration changes for core scheduling
  2020-11-17 23:19 ` [PATCH -tip 14/32] sched: migration changes for core scheduling Joel Fernandes (Google)
  2020-11-22 23:54   ` Balbir Singh
@ 2020-11-30 10:35   ` Vincent Guittot
  2020-11-30 12:32     ` Li, Aubrey
  1 sibling, 1 reply; 150+ messages in thread
From: Vincent Guittot @ 2020-11-30 10:35 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, Thomas Gleixner,
	linux-kernel, Ingo Molnar, Linus Torvalds, Frederic Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, graf, konrad.wilk,
	dfaggioli, Paul Turner, Steven Rostedt, Patrick Bellasi,
	Jiang Biao, Alexandre Chartre, James Bottomley, OWeisse,
	Dhaval Giani, Junaid Shahid, jsbarnes, Hyser,Chris, Ben Segall,
	Josh Don, Hao Luo, Tom Lendacky, Aubrey Li, Aubrey Li,
	Paul E. McKenney, Tim Chen

On Wed, 18 Nov 2020 at 00:20, Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
>
> From: Aubrey Li <aubrey.li@intel.com>
>
>  - Don't migrate if there is a cookie mismatch
>      Load balance tries to move task from busiest CPU to the
>      destination CPU. When core scheduling is enabled, if the
>      task's cookie does not match with the destination CPU's
>      core cookie, this task will be skipped by this CPU. This
>      mitigates the forced idle time on the destination CPU.
>
>  - Select cookie matched idle CPU
>      In the fast path of task wakeup, select the first cookie matched
>      idle CPU instead of the first idle CPU.
>
>  - Find cookie matched idlest CPU
>      In the slow path of task wakeup, find the idlest CPU whose core
>      cookie matches with task's cookie
>
>  - Don't migrate task if cookie not match
>      For the NUMA load balance, don't migrate task to the CPU whose
>      core cookie does not match with task's cookie
>
> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/sched/fair.c  | 64 ++++++++++++++++++++++++++++++++++++++++----
>  kernel/sched/sched.h | 29 ++++++++++++++++++++
>  2 files changed, 88 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index de82f88ba98c..ceb3906c9a8a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1921,6 +1921,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>                 if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>                         continue;
>
> +#ifdef CONFIG_SCHED_CORE
> +               /*
> +                * Skip this cpu if source task's cookie does not match
> +                * with CPU's core cookie.
> +                */
> +               if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
> +                       continue;
> +#endif
> +
>                 env->dst_cpu = cpu;
>                 if (task_numa_compare(env, taskimp, groupimp, maymove))
>                         break;
> @@ -5867,11 +5876,17 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>
>         /* Traverse only the allowed CPUs */
>         for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
> +               struct rq *rq = cpu_rq(i);
> +
> +#ifdef CONFIG_SCHED_CORE
> +               if (!sched_core_cookie_match(rq, p))
> +                       continue;
> +#endif
> +
>                 if (sched_idle_cpu(i))
>                         return i;
>
>                 if (available_idle_cpu(i)) {
> -                       struct rq *rq = cpu_rq(i);
>                         struct cpuidle_state *idle = idle_get_state(rq);
>                         if (idle && idle->exit_latency < min_exit_latency) {
>                                 /*
> @@ -6129,8 +6144,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>         for_each_cpu_wrap(cpu, cpus, target) {
>                 if (!--nr)
>                         return -1;
> -               if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> -                       break;
> +
> +               if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
> +#ifdef CONFIG_SCHED_CORE
> +                       /*
> +                        * If Core Scheduling is enabled, select this cpu
> +                        * only if the process cookie matches core cookie.
> +                        */
> +                       if (sched_core_enabled(cpu_rq(cpu)) &&
> +                           p->core_cookie == cpu_rq(cpu)->core->core_cookie)
> +#endif
> +                               break;
> +               }

This makes code unreadable.
Put this coresched specific stuff in an inline function; You can have
a look at what is done with asym_fits_capacity()

>         }
>
>         time = cpu_clock(this) - time;
> @@ -7530,8 +7555,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>          * We do not migrate tasks that are:
>          * 1) throttled_lb_pair, or
>          * 2) cannot be migrated to this CPU due to cpus_ptr, or
> -        * 3) running (obviously), or
> -        * 4) are cache-hot on their current CPU.
> +        * 3) task's cookie does not match with this CPU's core cookie
> +        * 4) running (obviously), or
> +        * 5) are cache-hot on their current CPU.
>          */
>         if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>                 return 0;
> @@ -7566,6 +7592,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>                 return 0;
>         }
>
> +#ifdef CONFIG_SCHED_CORE
> +       /*
> +        * Don't migrate task if the task's cookie does not match
> +        * with the destination CPU's core cookie.
> +        */
> +       if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
> +               return 0;
> +#endif
> +
>         /* Record that we found atleast one task that could run on dst_cpu */
>         env->flags &= ~LBF_ALL_PINNED;
>
> @@ -8792,6 +8827,25 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>                                         p->cpus_ptr))
>                         continue;
>
> +#ifdef CONFIG_SCHED_CORE
> +               if (sched_core_enabled(cpu_rq(this_cpu))) {
> +                       int i = 0;
> +                       bool cookie_match = false;
> +
> +                       for_each_cpu(i, sched_group_span(group)) {
> +                               struct rq *rq = cpu_rq(i);
> +
> +                               if (sched_core_cookie_match(rq, p)) {
> +                                       cookie_match = true;
> +                                       break;
> +                               }
> +                       }
> +                       /* Skip over this group if no cookie matched */
> +                       if (!cookie_match)
> +                               continue;
> +               }
> +#endif

same here, encapsulate this to keep find_idlest_group readable

> +
>                 local_group = cpumask_test_cpu(this_cpu,
>                                                sched_group_span(group));
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e72942a9ee11..de553d39aa40 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1135,6 +1135,35 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
>
>  bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
>
> +/*
> + * Helper to check if the CPU's core cookie matches with the task's cookie
> + * when core scheduling is enabled.
> + * A special case is that the task's cookie always matches with CPU's core
> + * cookie if the CPU is in an idle core.
> + */
> +static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
> +{
> +       bool idle_core = true;
> +       int cpu;
> +
> +       /* Ignore cookie match if core scheduler is not enabled on the CPU. */
> +       if (!sched_core_enabled(rq))
> +               return true;
> +
> +       for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
> +               if (!available_idle_cpu(cpu)) {
> +                       idle_core = false;
> +                       break;
> +               }
> +       }
> +
> +       /*
> +        * A CPU in an idle core is always the best choice for tasks with
> +        * cookies.
> +        */
> +       return idle_core || rq->core->core_cookie == p->core_cookie;
> +}
> +
>  extern void queue_core_balance(struct rq *rq);
>
>  #else /* !CONFIG_SCHED_CORE */
> --
> 2.29.2.299.gdc1121823c-goog
>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 14/32] sched: migration changes for core scheduling
  2020-11-30  9:33                   ` Balbir Singh
@ 2020-11-30 12:29                     ` Li, Aubrey
  2020-12-02 14:09                       ` Li, Aubrey
  0 siblings, 1 reply; 150+ messages in thread
From: Li, Aubrey @ 2020-11-30 12:29 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Peter Zijlstra, Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On 2020/11/30 17:33, Balbir Singh wrote:
> On Thu, Nov 26, 2020 at 05:26:31PM +0800, Li, Aubrey wrote:
>> On 2020/11/26 16:32, Balbir Singh wrote:
>>> On Thu, Nov 26, 2020 at 11:20:41AM +0800, Li, Aubrey wrote:
>>>> On 2020/11/26 6:57, Balbir Singh wrote:
>>>>> On Wed, Nov 25, 2020 at 11:12:53AM +0800, Li, Aubrey wrote:
>>>>>> On 2020/11/24 23:42, Peter Zijlstra wrote:
>>>>>>> On Mon, Nov 23, 2020 at 12:36:10PM +0800, Li, Aubrey wrote:
>>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>>>> +		/*
>>>>>>>>>> +		 * Skip this cpu if source task's cookie does not match
>>>>>>>>>> +		 * with CPU's core cookie.
>>>>>>>>>> +		 */
>>>>>>>>>> +		if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>>>>>>>>>> +			continue;
>>>>>>>>>> +#endif
>>>>>>>>>> +
>>>>>>>>>
>>>>>>>>> Any reason this is under an #ifdef? In sched_core_cookie_match() won't
>>>>>>>>> the check for sched_core_enabled() do the right thing even when
>>>>>>>>> CONFIG_SCHED_CORE is not enabed?> 
>>>>>>>> Yes, sched_core_enabled works properly when CONFIG_SCHED_CORE is not
>>>>>>>> enabled. But when CONFIG_SCHED_CORE is not enabled, it does not make
>>>>>>>> sense to leave a core scheduler specific function here even at compile
>>>>>>>> time. Also, for the cases in hot path, this saves CPU cycles to avoid
>>>>>>>> a judgment.
>>>>>>>
>>>>>>> No, that's nonsense. If it works, remove the #ifdef. Less (#ifdef) is
>>>>>>> more.
>>>>>>>
>>>>>>
>>>>>> Okay, I pasted the refined patch here.
>>>>>> @Joel, please let me know if you want me to send it in a separated thread.
>>>>>>
>>>>>
>>>>> You still have a bunch of #ifdefs, can't we just do
>>>>>
>>>>> #ifndef CONFIG_SCHED_CORE
>>>>> static inline bool sched_core_enabled(struct rq *rq)
>>>>> {
>>>>>         return false;
>>>>> }
>>>>> #endif
>>>>>
>>>>> and frankly I think even that is not needed because there is a jump
>>>>> label __sched_core_enabled that tells us if sched_core is enabled or
>>>>> not.
>>>>
>>>> Hmm..., I need another wrapper for CONFIG_SCHED_CORE specific variables.
>>>> How about this one?
>>>>
>>>
>>> Much better :)
>>>  
>>>> Thanks,
>>>> -Aubrey
>>>>
>>>> From 61dac9067e66b5b9ea26c684c8c8235714bab38a Mon Sep 17 00:00:00 2001
>>>> From: Aubrey Li <aubrey.li@linux.intel.com>
>>>> Date: Thu, 26 Nov 2020 03:08:04 +0000
>>>> Subject: [PATCH] sched: migration changes for core scheduling
>>>>
>>>>  - Don't migrate if there is a cookie mismatch
>>>>      Load balance tries to move task from busiest CPU to the
>>>>      destination CPU. When core scheduling is enabled, if the
>>>>      task's cookie does not match with the destination CPU's
>>>>      core cookie, this task will be skipped by this CPU. This
>>>>      mitigates the forced idle time on the destination CPU.
>>>>
>>>>  - Select cookie matched idle CPU
>>>>      In the fast path of task wakeup, select the first cookie matched
>>>>      idle CPU instead of the first idle CPU.
>>>>
>>>>  - Find cookie matched idlest CPU
>>>>      In the slow path of task wakeup, find the idlest CPU whose core
>>>>      cookie matches with task's cookie
>>>>
>>>>  - Don't migrate task if cookie not match
>>>>      For the NUMA load balance, don't migrate task to the CPU whose
>>>>      core cookie does not match with task's cookie
>>>>
>>>> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
>>>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
>>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>>>> Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
>>>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>>>> ---
>>>>  kernel/sched/fair.c  | 57 ++++++++++++++++++++++++++++++++++++++++----
>>>>  kernel/sched/sched.h | 43 +++++++++++++++++++++++++++++++++
>>>>  2 files changed, 95 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index de82f88ba98c..70dd013dff1d 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -1921,6 +1921,13 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>>>>  		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>>>>  			continue;
>>>>  
>>>> +		/*
>>>> +		 * Skip this cpu if source task's cookie does not match
>>>> +		 * with CPU's core cookie.
>>>> +		 */
>>>> +		if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>>>> +			continue;
>>>> +
>>>>  		env->dst_cpu = cpu;
>>>>  		if (task_numa_compare(env, taskimp, groupimp, maymove))
>>>>  			break;
>>>> @@ -5867,11 +5874,15 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>>>  
>>>>  	/* Traverse only the allowed CPUs */
>>>>  	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
>>>> +		struct rq *rq = cpu_rq(i);
>>>> +
>>>> +		if (!sched_core_cookie_match(rq, p))
>>>> +			continue;
>>>> +
>>>>  		if (sched_idle_cpu(i))
>>>>  			return i;
>>>>  
>>>>  		if (available_idle_cpu(i)) {
>>>> -			struct rq *rq = cpu_rq(i);
>>>>  			struct cpuidle_state *idle = idle_get_state(rq);
>>>>  			if (idle && idle->exit_latency < min_exit_latency) {
>>>>  				/*
>>>> @@ -6129,8 +6140,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>>>  	for_each_cpu_wrap(cpu, cpus, target) {
>>>>  		if (!--nr)
>>>>  			return -1;
>>>> -		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>>>> -			break;
>>>> +
>>>> +		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
>>>> +			/*
>>>> +			 * If Core Scheduling is enabled, select this cpu
>>>> +			 * only if the process cookie matches core cookie.
>>>> +			 */
>>>> +			if (sched_core_enabled(cpu_rq(cpu))) {
>>>> +				if (__cookie_match(cpu_rq(cpu), p))
>>>> +					break;
>>>> +			} else {
>>>> +				break;
>>>> +			}
>>>> +		}
>>>
>>> Isn't this better and equivalent?
>>>
>>> 	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
>>> 		sched_core_cookie_match(cpu_rq(cpu), p))
>>> 		break;
>>>
>>
>>  
>> That's my previous implementation in the earlier version.
>> But since here is the hot code path, we want to remove the idle
>> core check in sched_core_cookie_match.
> 
> I see, so we basically need a jump label, if sched_core_cookie_match
> can be inlined with a check for sched_core_enabled() upfront, it might
> solve a lot of the concern, readability of this section of code is not
> the best.
> 
>>
>>>>  	}
>>>>  
>>>>  	time = cpu_clock(this) - time;
>>>> @@ -7530,8 +7552,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>>>  	 * We do not migrate tasks that are:
>>>>  	 * 1) throttled_lb_pair, or
>>>>  	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
>>>> -	 * 3) running (obviously), or
>>>> -	 * 4) are cache-hot on their current CPU.
>>>> +	 * 3) task's cookie does not match with this CPU's core cookie
>>>> +	 * 4) running (obviously), or
>>>> +	 * 5) are cache-hot on their current CPU.
>>>>  	 */
>>>>  	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>>>>  		return 0;
>>>> @@ -7566,6 +7589,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>>>  		return 0;
>>>>  	}
>>>>  
>>>> +	/*
>>>> +	 * Don't migrate task if the task's cookie does not match
>>>> +	 * with the destination CPU's core cookie.
>>>> +	 */
>>>> +	if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
>>>> +		return 0;
>>>> +
>>>>  	/* Record that we found atleast one task that could run on dst_cpu */
>>>>  	env->flags &= ~LBF_ALL_PINNED;
>>>>  
>>>> @@ -8792,6 +8822,23 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>>>  					p->cpus_ptr))
>>>>  			continue;
>>>>  
>>>> +		if (sched_core_enabled(cpu_rq(this_cpu))) {
>>>> +			int i = 0;
>>>> +			bool cookie_match = false;
>>>> +
>>>> +			for_each_cpu(i, sched_group_span(group)) {
>>>> +				struct rq *rq = cpu_rq(i);
>>>> +
>>>> +				if (sched_core_cookie_match(rq, p)) {
>>>> +					cookie_match = true;
>>>> +					break;
>>>> +				}
>>>> +			}
>>>> +			/* Skip over this group if no cookie matched */
>>>> +			if (!cookie_match)
>>>> +				continue;
>>>> +		}
>>>> +
>>>
>>> Again, I think this can be refactored because sched_core_cookie_match checks
>>> for sched_core_enabled()
>>>
>>> 	int i = 0;
>>> 	bool cookie_match = false;
>>> 	for_each_cpu(i, sched_group_span(group)) {
>>> 		if (sched_core_cookie_match(cpu_rq(i), p))
>>> 			break;
>>> 	}
>>> 	if (i >= nr_cpu_ids)
>>> 		continue;
>>
>> There is a loop here when CONFIG_SCHED_CORE=n, which is unwanted I guess.
>>
> 
> Yes, potentially, may be abstract the for_each_cpu into a function and then
> optimize out the case for SCHED_CORE=n, I feel all the extra checks in the
> various places make the code hard to read.

Okay, I see your point, let me try if I can make it better.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 14/32] sched: migration changes for core scheduling
  2020-11-30 10:35   ` Vincent Guittot
@ 2020-11-30 12:32     ` Li, Aubrey
  0 siblings, 0 replies; 150+ messages in thread
From: Li, Aubrey @ 2020-11-30 12:32 UTC (permalink / raw)
  To: Vincent Guittot, Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, Thomas Gleixner,
	linux-kernel, Ingo Molnar, Linus Torvalds, Frederic Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, graf, konrad.wilk,
	dfaggioli, Paul Turner, Steven Rostedt, Patrick Bellasi,
	Jiang Biao, Alexandre Chartre, James Bottomley, OWeisse,
	Dhaval Giani, Junaid Shahid, jsbarnes, Hyser,Chris, Ben Segall,
	Josh Don, Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney,
	Tim Chen

On 2020/11/30 18:35, Vincent Guittot wrote:
> On Wed, 18 Nov 2020 at 00:20, Joel Fernandes (Google)
> <joel@joelfernandes.org> wrote:
>>
>> From: Aubrey Li <aubrey.li@intel.com>
>>
>>  - Don't migrate if there is a cookie mismatch
>>      Load balance tries to move task from busiest CPU to the
>>      destination CPU. When core scheduling is enabled, if the
>>      task's cookie does not match with the destination CPU's
>>      core cookie, this task will be skipped by this CPU. This
>>      mitigates the forced idle time on the destination CPU.
>>
>>  - Select cookie matched idle CPU
>>      In the fast path of task wakeup, select the first cookie matched
>>      idle CPU instead of the first idle CPU.
>>
>>  - Find cookie matched idlest CPU
>>      In the slow path of task wakeup, find the idlest CPU whose core
>>      cookie matches with task's cookie
>>
>>  - Don't migrate task if cookie not match
>>      For the NUMA load balance, don't migrate task to the CPU whose
>>      core cookie does not match with task's cookie
>>
>> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>> ---
>>  kernel/sched/fair.c  | 64 ++++++++++++++++++++++++++++++++++++++++----
>>  kernel/sched/sched.h | 29 ++++++++++++++++++++
>>  2 files changed, 88 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index de82f88ba98c..ceb3906c9a8a 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1921,6 +1921,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>>                 if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>>                         continue;
>>
>> +#ifdef CONFIG_SCHED_CORE
>> +               /*
>> +                * Skip this cpu if source task's cookie does not match
>> +                * with CPU's core cookie.
>> +                */
>> +               if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>> +                       continue;
>> +#endif
>> +
>>                 env->dst_cpu = cpu;
>>                 if (task_numa_compare(env, taskimp, groupimp, maymove))
>>                         break;
>> @@ -5867,11 +5876,17 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>
>>         /* Traverse only the allowed CPUs */
>>         for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
>> +               struct rq *rq = cpu_rq(i);
>> +
>> +#ifdef CONFIG_SCHED_CORE
>> +               if (!sched_core_cookie_match(rq, p))
>> +                       continue;
>> +#endif
>> +
>>                 if (sched_idle_cpu(i))
>>                         return i;
>>
>>                 if (available_idle_cpu(i)) {
>> -                       struct rq *rq = cpu_rq(i);
>>                         struct cpuidle_state *idle = idle_get_state(rq);
>>                         if (idle && idle->exit_latency < min_exit_latency) {
>>                                 /*
>> @@ -6129,8 +6144,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>         for_each_cpu_wrap(cpu, cpus, target) {
>>                 if (!--nr)
>>                         return -1;
>> -               if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>> -                       break;
>> +
>> +               if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
>> +#ifdef CONFIG_SCHED_CORE
>> +                       /*
>> +                        * If Core Scheduling is enabled, select this cpu
>> +                        * only if the process cookie matches core cookie.
>> +                        */
>> +                       if (sched_core_enabled(cpu_rq(cpu)) &&
>> +                           p->core_cookie == cpu_rq(cpu)->core->core_cookie)
>> +#endif
>> +                               break;
>> +               }
> 
> This makes code unreadable.
> Put this coresched specific stuff in an inline function; You can have
> a look at what is done with asym_fits_capacity()
> 
This is done in a refined version. Sorry I pasted the version embedded in this thread,
this is not the latest version.

>>         }
>>
>>         time = cpu_clock(this) - time;
>> @@ -7530,8 +7555,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>          * We do not migrate tasks that are:
>>          * 1) throttled_lb_pair, or
>>          * 2) cannot be migrated to this CPU due to cpus_ptr, or
>> -        * 3) running (obviously), or
>> -        * 4) are cache-hot on their current CPU.
>> +        * 3) task's cookie does not match with this CPU's core cookie
>> +        * 4) running (obviously), or
>> +        * 5) are cache-hot on their current CPU.
>>          */
>>         if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>>                 return 0;
>> @@ -7566,6 +7592,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>                 return 0;
>>         }
>>
>> +#ifdef CONFIG_SCHED_CORE
>> +       /*
>> +        * Don't migrate task if the task's cookie does not match
>> +        * with the destination CPU's core cookie.
>> +        */
>> +       if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
>> +               return 0;
>> +#endif
>> +
>>         /* Record that we found atleast one task that could run on dst_cpu */
>>         env->flags &= ~LBF_ALL_PINNED;
>>
>> @@ -8792,6 +8827,25 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>                                         p->cpus_ptr))
>>                         continue;
>>
>> +#ifdef CONFIG_SCHED_CORE
>> +               if (sched_core_enabled(cpu_rq(this_cpu))) {
>> +                       int i = 0;
>> +                       bool cookie_match = false;
>> +
>> +                       for_each_cpu(i, sched_group_span(group)) {
>> +                               struct rq *rq = cpu_rq(i);
>> +
>> +                               if (sched_core_cookie_match(rq, p)) {
>> +                                       cookie_match = true;
>> +                                       break;
>> +                               }
>> +                       }
>> +                       /* Skip over this group if no cookie matched */
>> +                       if (!cookie_match)
>> +                               continue;
>> +               }
>> +#endif
> 
> same here, encapsulate this to keep find_idlest_group readable
> 
Okay, I'll try to refine this.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork
  2020-11-17 23:19 ` [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork Joel Fernandes (Google)
                     ` (5 preceding siblings ...)
  2020-11-25 13:03   ` Peter Zijlstra
@ 2020-11-30 23:05   ` Balbir Singh
  6 siblings, 0 replies; 150+ messages in thread
From: Balbir Singh @ 2020-11-30 23:05 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:
> In order to prevent interference and clearly support both per-task and CGroup
> APIs, split the cookie into 2 and allow it to be set from either per-task, or
> CGroup API. The final cookie is the combined value of both and is computed when
> the stop-machine executes during a change of cookie.
> 
> Also, for the per-task cookie, it will get weird if we use pointers of any
> emphemeral objects. For this reason, introduce a refcounted object who's sole
> purpose is to assign unique cookie value by way of the object's pointer.
> 
> While at it, refactor the CGroup code a bit. Future patches will introduce more
> APIs and support.
> 
> Reviewed-by: Josh Don <joshdon@google.com>
> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  include/linux/sched.h |   2 +
>  kernel/sched/core.c   | 241 ++++++++++++++++++++++++++++++++++++++++--
>  kernel/sched/debug.c  |   4 +
>  3 files changed, 236 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index a60868165590..c6a3b0fa952b 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -688,6 +688,8 @@ struct task_struct {
>  #ifdef CONFIG_SCHED_CORE
>  	struct rb_node			core_node;
>  	unsigned long			core_cookie;
> +	unsigned long			core_task_cookie;
> +	unsigned long			core_group_cookie;
>  	unsigned int			core_occupation;
>  #endif
>  
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b99a7493d590..7ccca355623a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -346,11 +346,14 @@ void sched_core_put(void)
>  	mutex_unlock(&sched_core_mutex);
>  }
>  
> +static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2);
> +
>  #else /* !CONFIG_SCHED_CORE */
>  
>  static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
>  static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
>  static bool sched_core_enqueued(struct task_struct *task) { return false; }
> +static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2) { }
>  
>  #endif /* CONFIG_SCHED_CORE */
>  
> @@ -4032,6 +4035,20 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
>  #endif
>  #ifdef CONFIG_SCHED_CORE
>  	RB_CLEAR_NODE(&p->core_node);
> +
> +	/*
> +	 * Tag child via per-task cookie only if parent is tagged via per-task
> +	 * cookie. This is independent of, but can be additive to the CGroup tagging.
> +	 */
> +	if (current->core_task_cookie) {
> +
> +		/* If it is not CLONE_THREAD fork, assign a unique per-task tag. */
> +		if (!(clone_flags & CLONE_THREAD)) {
> +			return sched_core_share_tasks(p, p);
> +               }
> +		/* Otherwise share the parent's per-task tag. */
> +		return sched_core_share_tasks(p, current);
> +	}
>  #endif
>  	return 0;
>  }
> @@ -9731,6 +9748,217 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
>  #endif /* CONFIG_RT_GROUP_SCHED */
>  
>  #ifdef CONFIG_SCHED_CORE
> +/*
> + * A simple wrapper around refcount. An allocated sched_core_cookie's
> + * address is used to compute the cookie of the task.
> + */
> +struct sched_core_cookie {
> +	refcount_t refcnt;
> +};
> +
> +/*
> + * sched_core_tag_requeue - Common helper for all interfaces to set a cookie.
> + * @p: The task to assign a cookie to.
> + * @cookie: The cookie to assign.
> + * @group: is it a group interface or a per-task interface.
> + *
> + * This function is typically called from a stop-machine handler.

Can you clarify if it is typically or always, what are the implications for
locking?

> + */
> +void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie, bool group)
> +{
> +	if (!p)
> +		return;
> +
> +	if (group)
> +		p->core_group_cookie = cookie;
> +	else
> +		p->core_task_cookie = cookie;
> +
> +	/* Use up half of the cookie's bits for task cookie and remaining for group cookie. */
> +	p->core_cookie = (p->core_task_cookie <<
> +				(sizeof(unsigned long) * 4)) + p->core_group_cookie;
> +

Always use masks to ensure this fits in the space we have, should we be concerned about
overflows and the potential for collision of cookie values?

> +	if (sched_core_enqueued(p)) {
> +		sched_core_dequeue(task_rq(p), p);
> +		if (!p->core_task_cookie)
> +			return;
> +	}
> +
> +	if (sched_core_enabled(task_rq(p)) &&
> +			p->core_cookie && task_on_rq_queued(p))
> +		sched_core_enqueue(task_rq(p), p);
> +}
> +
> +/* Per-task interface */
> +static unsigned long sched_core_alloc_task_cookie(void)
> +{
> +	struct sched_core_cookie *ptr =
> +		kmalloc(sizeof(struct sched_core_cookie), GFP_KERNEL);
> +
> +	if (!ptr)
> +		return 0;
> +	refcount_set(&ptr->refcnt, 1);
> +
> +	/*
> +	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
> +	 * is done after the stopper runs.
> +	 */
> +	sched_core_get();
> +	return (unsigned long)ptr;
> +}
> +
> +static bool sched_core_get_task_cookie(unsigned long cookie)
> +{
> +	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
> +
> +	/*
> +	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
> +	 * is done after the stopper runs.
> +	 */
> +	sched_core_get();
> +	return refcount_inc_not_zero(&ptr->refcnt);
> +}
> +
> +static void sched_core_put_task_cookie(unsigned long cookie)
> +{
> +	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
> +
> +	if (refcount_dec_and_test(&ptr->refcnt))
> +		kfree(ptr);
> +}
> +
> +struct sched_core_task_write_tag {
> +	struct task_struct *tasks[2];
> +	unsigned long cookies[2];
> +};

Use a better name instead of 2?

> +
> +/*
> + * Ensure that the task has been requeued. The stopper ensures that the task cannot
> + * be migrated to a different CPU while its core scheduler queue state is being updated.
> + * It also makes sure to requeue a task if it was running actively on another CPU.
> + */
> +static int sched_core_task_join_stopper(void *data)
> +{
> +	struct sched_core_task_write_tag *tag = (struct sched_core_task_write_tag *)data;
> +	int i;
> +
> +	for (i = 0; i < 2; i++)

Use ARRAY_SIZE(cookies) if you have to or ARRAY_SIZE(tasks)

> +		sched_core_tag_requeue(tag->tasks[i], tag->cookies[i], false /* !group */);
> +
> +	return 0;
> +}
> +
> +static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
> +{

Can you please explain how t1 and t2 are related, there is a table below, but
I don't understand case#2, where the cookies get reset, is t2 the core leader
and t2 leads what t1 and t2 collectively get?

May be just called t2 as parent?

> +	struct sched_core_task_write_tag wr = {}; /* for stop machine. */
> +	bool sched_core_put_after_stopper = false;
> +	unsigned long cookie;
> +	int ret = -ENOMEM;
> +
> +	mutex_lock(&sched_core_mutex);
> +
> +	/*
> +	 * NOTE: sched_core_get() is done by sched_core_alloc_task_cookie() or
> +	 *       sched_core_put_task_cookie(). However, sched_core_put() is done
> +	 *       by this function *after* the stopper removes the tasks from the
> +	 *       core queue, and not before. This is just to play it safe.
> +	 */
> +	if (t2 == NULL) {
> +		if (t1->core_task_cookie) {
> +			sched_core_put_task_cookie(t1->core_task_cookie);
> +			sched_core_put_after_stopper = true;
> +			wr.tasks[0] = t1; /* Keep wr.cookies[0] reset for t1. */
> +		}
> +	} else if (t1 == t2) {
> +		/* Assign a unique per-task cookie solely for t1. */
> +
> +		cookie = sched_core_alloc_task_cookie();
> +		if (!cookie)
> +			goto out_unlock;
> +
> +		if (t1->core_task_cookie) {
> +			sched_core_put_task_cookie(t1->core_task_cookie);
> +			sched_core_put_after_stopper = true;
> +		}
> +		wr.tasks[0] = t1;
> +		wr.cookies[0] = cookie;
> +	} else
> +	/*
> +	 * 		t1		joining		t2
> +	 * CASE 1:
> +	 * before	0				0
> +	 * after	new cookie			new cookie
> +	 *
> +	 * CASE 2:
> +	 * before	X (non-zero)			0
> +	 * after	0				0
> +	 *
> +	 * CASE 3:
> +	 * before	0				X (non-zero)
> +	 * after	X				X
> +	 *
> +	 * CASE 4:
> +	 * before	Y (non-zero)			X (non-zero)
> +	 * after	X				X
> +	 */
> +	if (!t1->core_task_cookie && !t2->core_task_cookie) {
> +		/* CASE 1. */
> +		cookie = sched_core_alloc_task_cookie();
> +		if (!cookie)
> +			goto out_unlock;
> +
> +		/* Add another reference for the other task. */
> +		if (!sched_core_get_task_cookie(cookie)) {
> +			ret = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		wr.tasks[0] = t1;
> +		wr.tasks[1] = t2;
> +		wr.cookies[0] = wr.cookies[1] = cookie;
> +
> +	} else if (t1->core_task_cookie && !t2->core_task_cookie) {
> +		/* CASE 2. */
> +		sched_core_put_task_cookie(t1->core_task_cookie);
> +		sched_core_put_after_stopper = true;
> +
> +		wr.tasks[0] = t1; /* Reset cookie for t1. */
> +
> +	} else if (!t1->core_task_cookie && t2->core_task_cookie) {
> +		/* CASE 3. */
> +		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
> +			ret = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		wr.tasks[0] = t1;
> +		wr.cookies[0] = t2->core_task_cookie;
> +
> +	} else {
> +		/* CASE 4. */
> +		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
> +			ret = -EINVAL;
> +			goto out_unlock;
> +		}
> +		sched_core_put_task_cookie(t1->core_task_cookie);
> +		sched_core_put_after_stopper = true;
> +
> +		wr.tasks[0] = t1;
> +		wr.cookies[0] = t2->core_task_cookie;
> +	}
> +
> +	stop_machine(sched_core_task_join_stopper, (void *)&wr, NULL);
> +
> +	if (sched_core_put_after_stopper)
> +		sched_core_put();
> +
> +	ret = 0;
> +out_unlock:
> +	mutex_unlock(&sched_core_mutex);
> +	return ret;
> +}
> +
> +/* CGroup interface */
>  static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
>  {
>  	struct task_group *tg = css_tg(css);
> @@ -9761,18 +9989,9 @@ static int __sched_write_tag(void *data)
>  	 * when we set cgroup tag to 0 when the loop is done below.
>  	 */
>  	while ((p = css_task_iter_next(&it))) {
> -		p->core_cookie = !!val ? (unsigned long)tg : 0UL;
> -
> -		if (sched_core_enqueued(p)) {
> -			sched_core_dequeue(task_rq(p), p);
> -			if (!p->core_cookie)
> -				continue;
> -		}
> -
> -		if (sched_core_enabled(task_rq(p)) &&
> -		    p->core_cookie && task_on_rq_queued(p))
> -			sched_core_enqueue(task_rq(p), p);
> +		unsigned long cookie = !!val ? (unsigned long)tg : 0UL;
>  
> +		sched_core_tag_requeue(p, cookie, true /* group */);
>  	}
>  	css_task_iter_end(&it);
>  
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 60a922d3f46f..8c452b8010ad 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -1024,6 +1024,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
>  		__PS("clock-delta", t1-t0);
>  	}
>  
> +#ifdef CONFIG_SCHED_CORE
> +	__PS("core_cookie", p->core_cookie);
> +#endif
> +
>  	sched_show_numa(p, m);
>  }
>

Balbir Singh.  

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase
  2020-11-25 13:42   ` Peter Zijlstra
@ 2020-11-30 23:10     ` Balbir Singh
  2020-12-01 20:08     ` Joel Fernandes
  2020-12-02  6:18     ` Josh Don
  2 siblings, 0 replies; 150+ messages in thread
From: Balbir Singh @ 2020-11-30 23:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Wed, Nov 25, 2020 at 02:42:37PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:56PM -0500, Joel Fernandes (Google) wrote:
> > From: Josh Don <joshdon@google.com>
> > 
> > Google has a usecase where the first level tag to tag a CGroup is not
> > sufficient. So, a patch is carried for years where a second tag is added which
> > is writeable by unprivileged users.
> > 
> > Google uses DAC controls to make the 'tag' possible to set only by root while
> > the second-level 'color' can be changed by anyone. The actual names that
> > Google uses is different, but the concept is the same.
> > 
> > The hierarchy looks like:
> > 
> > Root group
> >    / \
> >   A   B    (These are created by the root daemon - borglet).
> >  / \   \
> > C   D   E  (These are created by AppEngine within the container).
> > 
> > The reason why Google has two parts is that AppEngine wants to allow a subset of
> > subcgroups within a parent tagged cgroup sharing execution. Think of these
> > subcgroups belong to the same customer or project. Because these subcgroups are
> > created by AppEngine, they are not tracked by borglet (the root daemon),
> > therefore borglet won't have a chance to set a color for them. That's where
> > 'color' file comes from. Color could be set by AppEngine, and once set, the
> > normal tasks within the subcgroup would not be able to overwrite it. This is
> > enforced by promoting the permission of the color file in cgroupfs.
> 
> Why can't the above work by setting 'tag' (that's a terrible name, why
> does that still live) in CDE? Have the most specific tag live. Same with
> that thread stuff.
> 
> All this API stuff here is a complete and utter trainwreck. Please just
> delete the patches and start over. Hint: if you use stop_machine(),
> you're doing it wrong.
> 
> At best you now have the requirements sorted.

+1, just remove this patch from the series so as to unblock the series.

Balbir Singh.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 32/32] sched: Debug bits...
  2020-11-17 23:20 ` [PATCH -tip 32/32] sched: Debug bits Joel Fernandes (Google)
@ 2020-12-01  0:21   ` Balbir Singh
  2021-01-15 15:10     ` Joel Fernandes
  0 siblings, 1 reply; 150+ messages in thread
From: Balbir Singh @ 2020-12-01  0:21 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

On Tue, Nov 17, 2020 at 06:20:02PM -0500, Joel Fernandes (Google) wrote:
> Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Not-Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---

May be put it under a #ifdef CONFIG_SCHED_CORE_DEBUG, even then please
make it more driven by selection via tracing rather than just trace_printk()

Balbir Singh.


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 10/32] sched: Fix priority inversion of cookied task with sibling
  2020-11-25 23:05       ` Balbir Singh
  2020-11-26  8:29         ` Peter Zijlstra
@ 2020-12-01 17:49         ` Joel Fernandes
  1 sibling, 0 replies; 150+ messages in thread
From: Joel Fernandes @ 2020-12-01 17:49 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, chris.hyser,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen

On Thu, Nov 26, 2020 at 10:05:19AM +1100, Balbir Singh wrote:
> On Tue, Nov 24, 2020 at 01:30:38PM -0500, Joel Fernandes wrote:
> > On Mon, Nov 23, 2020 at 09:41:23AM +1100, Balbir Singh wrote:
> > > On Tue, Nov 17, 2020 at 06:19:40PM -0500, Joel Fernandes (Google) wrote:
> > > > From: Peter Zijlstra <peterz@infradead.org>
> > > > 
> > > > The rationale is as follows. In the core-wide pick logic, even if
> > > > need_sync == false, we need to go look at other CPUs (non-local CPUs) to
> > > > see if they could be running RT.
> > > > 
> > > > Say the RQs in a particular core look like this:
> > > > Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task.
> > > > 
> > > > rq0            rq1
> > > > CFS1 (tagged)  RT1 (not tag)
> > > > CFS2 (tagged)
> > > > 
> > > > Say schedule() runs on rq0. Now, it will enter the above loop and
> > > > pick_task(RT) will return NULL for 'p'. It will enter the above if() block
> > > > and see that need_sync == false and will skip RT entirely.
> > > > 
> > > > The end result of the selection will be (say prio(CFS1) > prio(CFS2)):
> > > > rq0             rq1
> > > > CFS1            IDLE
> > > > 
> > > > When it should have selected:
> > > > rq0             r1
> > > > IDLE            RT
> > > > 
> > > > Joel saw this issue on real-world usecases in ChromeOS where an RT task
> > > > gets constantly force-idled and breaks RT. Lets cure it.
> > > > 
> > > > NOTE: This problem will be fixed differently in a later patch. It just
> > > >       kept here for reference purposes about this issue, and to make
> > > >       applying later patches easier.
> > > >
> > > 
> > > The changelog is hard to read, it refers to above if(), whereas there
> > > is no code snippet in the changelog.
> > 
> > Yeah sorry, it comes from this email where I described the issue:
> > http://lore.kernel.org/r/20201023175724.GA3563800@google.com
> > 
> > I corrected the changelog and appended the patch below. Also pushed it to:
> > https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=coresched
> > 
> > > Also, from what I can see following
> > > the series, p->core_cookie is not yet set anywhere (unless I missed it),
> > > so fixing it in here did not make sense just reading the series.
> > 
> > The interface patches for core_cookie are added later, that's how it is. The
> > infrastructure comes first here. It would also not make sense to add
> > interface first as well so I think the current ordering is fine.
> >
> 
> Some comments below to help make the code easier to understand
> 
> > ---8<-----------------------
> > 
> > From: Peter Zijlstra <peterz@infradead.org>
> > Subject: [PATCH] sched: Fix priority inversion of cookied task with sibling
> > 
> > The rationale is as follows. In the core-wide pick logic, even if
> > need_sync == false, we need to go look at other CPUs (non-local CPUs) to
> > see if they could be running RT.
> > 
> > Say the RQs in a particular core look like this:
> > Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task.
> > 
> > rq0            rq1
> > CFS1 (tagged)  RT1 (not tag)
> > CFS2 (tagged)
> > 
> > The end result of the selection will be (say prio(CFS1) > prio(CFS2)):
> > rq0             rq1
> > CFS1            IDLE
> > 
> > When it should have selected:
> > rq0             r1
> > IDLE            RT
> > 
> > Fix this issue by forcing need_sync and restarting the search if a
> > cookied task was discovered. This will avoid this optimization from
> > making incorrect picks.
> > 
> > Joel saw this issue on real-world usecases in ChromeOS where an RT task
> > gets constantly force-idled and breaks RT. Lets cure it.
> > 
> > NOTE: This problem will be fixed differently in a later patch. It just
> >       kept here for reference purposes about this issue, and to make
> >       applying later patches easier.
> > 
> > Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >  kernel/sched/core.c | 25 ++++++++++++++++---------
> >  1 file changed, 16 insertions(+), 9 deletions(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 4ee4902c2cf5..53af817740c0 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5195,6 +5195,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >  	need_sync = !!rq->core->core_cookie;
> >  
> >  	/* reset state */
> > +reset:
> >  	rq->core->core_cookie = 0UL;
> >  	if (rq->core->core_forceidle) {
> >  		need_sync = true;
> > @@ -5242,14 +5243,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >  				/*
> >  				 * If there weren't no cookies; we don't need to
> >  				 * bother with the other siblings.
> > -				 * If the rest of the core is not running a tagged
> > -				 * task, i.e.  need_sync == 0, and the current CPU
> > -				 * which called into the schedule() loop does not
> > -				 * have any tasks for this class, skip selecting for
> > -				 * other siblings since there's no point. We don't skip
> > -				 * for RT/DL because that could make CFS force-idle RT.
> >  				 */
> > -				if (i == cpu && !need_sync && class == &fair_sched_class)
> > +				if (i == cpu && !need_sync)
> >  					goto next_class;
> >  
> >  				continue;
> > @@ -5259,7 +5254,20 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >  			 * Optimize the 'normal' case where there aren't any
> >  			 * cookies and we don't need to sync up.
> >  			 */
> > -			if (i == cpu && !need_sync && !p->core_cookie) {
> > +			if (i == cpu && !need_sync) {
> > +				if (p->core_cookie) {
> > +					/*
> > +					 * This optimization is only valid as
> > +					 * long as there are no cookies
> 
> This is not entirely true, need_sync is a function of core cookies, so I
> think this needs more clarification, it sounds like we enter this when
> the core has no cookies, but the task has a core_cookie? The term cookie
> is quite overloaded when used in the context of core vs task.
> 
> Effectively from what I understand this means that p wants to be
> coscheduled, but the core itself is not coscheduling anything at the
> moment, so we need to see if we should do a sync and that sync might
> cause p to get kicked out and a higher priority class to come in?

Yeah so about need_sync, it is basically a flag that says if the HT running
the schedule() loop needs to bother with siblings.

need_sync is true only in following conditions:
- A cookied task is running on any HT on the core.
- Any HT in the core is force idled.

The above code comment you referred to is now reworked. That was for the case
where we discovered during local selection that we found a task with a
cookie so now we have to do a core-wide scan (need_sync = false before but
now it becomes true and we start over). This optimization is done sligtly
differently now, we run ->pick_task() on every class of the local CPU until
we find something, if we find something with a cookie then we do core-wide
selection.

The latest version of this code is now in Peter's branch:
https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/tree/kernel/sched/core.c?id=6288c0a49631ce6b53eeab7021a43e49c4c4d436

- Joel


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 18/32] kernel/entry: Add support for core-wide protection of kernel-mode
  2020-11-25  9:37   ` Peter Zijlstra
@ 2020-12-01 17:55     ` Joel Fernandes
  0 siblings, 0 replies; 150+ messages in thread
From: Joel Fernandes @ 2020-12-01 17:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Tim Chen, Paul E . McKenney

On Wed, Nov 25, 2020 at 10:37:00AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:48PM -0500, Joel Fernandes (Google) wrote:
> > Core-scheduling prevents hyperthreads in usermode from attacking each
> > other, but it does not do anything about one of the hyperthreads
> > entering the kernel for any reason. This leaves the door open for MDS
> > and L1TF attacks with concurrent execution sequences between
> > hyperthreads.
> > 
> > This patch therefore adds support for protecting all syscall and IRQ
> > kernel mode entries. Care is taken to track the outermost usermode exit
> > and entry using per-cpu counters. In cases where one of the hyperthreads
> > enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
> > when not needed - example: idle and non-cookie HTs do not need to be
> > forced into kernel mode.
> > 
> > More information about attacks:
> > For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
> > data to either host or guest attackers. For L1TF, it is possible to leak
> > to guest attackers. There is no possible mitigation involving flushing
> > of buffers to avoid this since the execution of attacker and victims
> > happen concurrently on 2 or more HTs.
> 
> >  .../admin-guide/kernel-parameters.txt         |  11 +
> >  include/linux/entry-common.h                  |  12 +-
> >  include/linux/sched.h                         |  12 +
> >  kernel/entry/common.c                         |  28 +-
> >  kernel/sched/core.c                           | 241 ++++++++++++++++++
> >  kernel/sched/sched.h                          |   3 +
> >  6 files changed, 304 insertions(+), 3 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index bd1a5b87a5e2..b185c6ed4aba 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -4678,6 +4678,17 @@
> >  
> >  	sbni=		[NET] Granch SBNI12 leased line adapter
> >  
> > +	sched_core_protect_kernel=
> > +			[SCHED_CORE] Pause SMT siblings of a core running in
> > +			user mode, if at least one of the siblings of the core
> > +			is running in kernel mode. This is to guarantee that
> > +			kernel data is not leaked to tasks which are not trusted
> > +			by the kernel. A value of 0 disables protection, 1
> > +			enables protection. The default is 1. Note that protection
> > +			depends on the arch defining the _TIF_UNSAFE_RET flag.
> > +			Further, for protecting VMEXIT, arch needs to call
> > +			KVM entry/exit hooks.
> > +
> >  	sched_debug	[KNL] Enables verbose scheduler debug messages.
> >  
> >  	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
> 
> So I don't like the parameter name, it's too long. Also I don't like it
> because its a boolean.

Maybe ht_protect= then?

> You're adding syscall,irq,kvm under a single knob where they're all due
> to different flavours of broken. Different hardware might want/need
> different combinations.

Ok, I can try to make it ht_protect=irq,syscall,kvm etc. And conditionally
enable the protection. Does that work for you?
> 
> Hardware without MDS but with L1TF wouldn't need the syscall hook, but
> you're not givng a choice here. And this is generic code, you can't
> assume stuff like this.

Got it.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 19/32] entry/idle: Enter and exit kernel protection during idle entry and exit
  2020-11-25  8:49       ` Peter Zijlstra
@ 2020-12-01 18:24         ` Joel Fernandes
  0 siblings, 0 replies; 150+ messages in thread
From: Joel Fernandes @ 2020-12-01 18:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Wed, Nov 25, 2020 at 09:49:08AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 24, 2020 at 01:03:43PM -0500, Joel Fernandes wrote:
> > On Tue, Nov 24, 2020 at 05:13:35PM +0100, Peter Zijlstra wrote:
> > > On Tue, Nov 17, 2020 at 06:19:49PM -0500, Joel Fernandes (Google) wrote:
> 
> > > > +static inline void generic_idle_enter(void)
> > > > +static inline void generic_idle_exit(void)
> 
> > > That naming is terrible..
> > 
> > Yeah sorry :-\. The naming I chose was to be aligned with the
> > CONFIG_GENERIC_ENTRY naming. I am open to ideas on that.
> 
> entry_idle_{enter,exit}() ?

Sounds good to me.

> > > I'm confused.. arch_cpu_idle_{enter,exit}() weren't conveniently placed
> > > for you?
> > 
> > The way this patch series works, it does not depend on arch code as much as
> > possible. Since there are other arch that may need this patchset such as ARM,
> > it may be better to keep it in the generic entry code.  Thoughts?
> 
> I didn't necessarily mean using those hooks, even placing your new hooks
> right next to them would've covered the exact same code with less lines
> modified.

Ok sure. I will improve it this way for next posting.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork
  2020-11-25 12:54   ` Peter Zijlstra
@ 2020-12-01 18:38     ` Joel Fernandes
  0 siblings, 0 replies; 150+ messages in thread
From: Joel Fernandes @ 2020-12-01 18:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Wed, Nov 25, 2020 at 01:54:47PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:
> > +/* Per-task interface */
> > +static unsigned long sched_core_alloc_task_cookie(void)
> > +{
> > +	struct sched_core_cookie *ptr =
> > +		kmalloc(sizeof(struct sched_core_cookie), GFP_KERNEL);
> > +
> > +	if (!ptr)
> > +		return 0;
> > +	refcount_set(&ptr->refcnt, 1);
> > +
> > +	/*
> > +	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
> > +	 * is done after the stopper runs.
> > +	 */
> > +	sched_core_get();
> > +	return (unsigned long)ptr;
> > +}
> > +
> > +static bool sched_core_get_task_cookie(unsigned long cookie)
> > +{
> > +	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
> > +
> > +	/*
> > +	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
> > +	 * is done after the stopper runs.
> > +	 */
> > +	sched_core_get();
> > +	return refcount_inc_not_zero(&ptr->refcnt);
> > +}
> > +
> > +static void sched_core_put_task_cookie(unsigned long cookie)
> > +{
> > +	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
> > +
> > +	if (refcount_dec_and_test(&ptr->refcnt))
> > +		kfree(ptr);
> > +}
> 
> > +	/*
> > +	 * NOTE: sched_core_get() is done by sched_core_alloc_task_cookie() or
> > +	 *       sched_core_put_task_cookie(). However, sched_core_put() is done
> > +	 *       by this function *after* the stopper removes the tasks from the
> > +	 *       core queue, and not before. This is just to play it safe.
> > +	 */
> 
> So for no reason what so ever you've made the code more difficult?

You're right, I could just do sched_core_get() in the caller. I changed it as in
the diff below:

---8<-----------------------

diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
index 800c0f8bacfc..75e2edb53a48 100644
--- a/kernel/sched/coretag.c
+++ b/kernel/sched/coretag.c
@@ -274,6 +274,7 @@ void sched_core_change_group(struct task_struct *p, struct task_group *new_tg)
 /* Per-task interface: Used by fork(2) and prctl(2). */
 static void sched_core_put_cookie_work(struct work_struct *ws);
 
+/* Caller has to call sched_core_get() if non-zero value is returned. */
 static unsigned long sched_core_alloc_task_cookie(void)
 {
 	struct sched_core_task_cookie *ck =
@@ -284,11 +285,6 @@ static unsigned long sched_core_alloc_task_cookie(void)
 	refcount_set(&ck->refcnt, 1);
 	INIT_WORK(&ck->work, sched_core_put_cookie_work);
 
-	/*
-	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
-	 * is done after the stopper runs.
-	 */
-	sched_core_get();
 	return (unsigned long)ck;
 }
 
@@ -354,12 +350,6 @@ int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
 
 	mutex_lock(&sched_core_tasks_mutex);
 
-	/*
-	 * NOTE: sched_core_get() is done by sched_core_alloc_task_cookie() or
-	 *       sched_core_put_task_cookie(). However, sched_core_put() is done
-	 *       by this function *after* the stopper removes the tasks from the
-	 *       core queue, and not before. This is just to play it safe.
-	 */
 	if (!t2) {
 		if (t1->core_task_cookie) {
 			sched_core_put_task_cookie(t1->core_task_cookie);
@@ -370,7 +360,9 @@ int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
 		/* Assign a unique per-task cookie solely for t1. */
 
 		cookie = sched_core_alloc_task_cookie();
-		if (!cookie)
+		if (cookie)
+			sched_core_get();
+		else
 			goto out_unlock;
 
 		if (t1->core_task_cookie) {
@@ -401,7 +393,9 @@ int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
 
 		/* CASE 1. */
 		cookie = sched_core_alloc_task_cookie();
-		if (!cookie)
+		if (cookie)
+			sched_core_get();
+		else
 			goto out_unlock;
 
 		/* Add another reference for the other task. */

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork
  2020-11-25 13:03   ` Peter Zijlstra
@ 2020-12-01 18:52     ` Joel Fernandes
  0 siblings, 0 replies; 150+ messages in thread
From: Joel Fernandes @ 2020-12-01 18:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Wed, Nov 25, 2020 at 02:03:22PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:
> > +static bool sched_core_get_task_cookie(unsigned long cookie)
> > +{
> > +	struct sched_core_cookie *ptr = (struct sched_core_cookie *)cookie;
> > +
> > +	/*
> > +	 * NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
> > +	 * is done after the stopper runs.
> > +	 */
> > +	sched_core_get();
> > +	return refcount_inc_not_zero(&ptr->refcnt);
> 
> See below, but afaict this should be refcount_inc().

Fully agreed with all these. Updated with diff as below. Will test further
and post next version soon. Thanks!

---8<-----------------------

diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
index 2fb5544a4a18..8fce3f4b7cae 100644
--- a/kernel/sched/coretag.c
+++ b/kernel/sched/coretag.c
@@ -288,12 +288,12 @@ static unsigned long sched_core_alloc_task_cookie(void)
 	return (unsigned long)ck;
 }
 
-static bool sched_core_get_task_cookie(unsigned long cookie)
+static void sched_core_get_task_cookie(unsigned long cookie)
 {
 	struct sched_core_task_cookie *ptr =
 		(struct sched_core_task_cookie *)cookie;
 
-	return refcount_inc_not_zero(&ptr->refcnt);
+	refcount_inc(&ptr->refcnt);
 }
 
 static void sched_core_put_task_cookie(unsigned long cookie)
@@ -392,10 +392,7 @@ int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
 		sched_core_get(); /* For the alloc. */
 
 		/* Add another reference for the other task. */
-		if (!sched_core_get_task_cookie(cookie)) {
-			ret = -EINVAL;
-			goto out_unlock;
-		}
+		sched_core_get_task_cookie(cookie);
 		sched_core_get(); /* For the other task. */
 
 		wr.tasks[0] = t1;
@@ -411,10 +408,7 @@ int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
 
 	} else if (!t1->core_task_cookie && t2->core_task_cookie) {
 		/* CASE 3. */
-		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
-			ret = -EINVAL;
-			goto out_unlock;
-		}
+		sched_core_get_task_cookie(t2->core_task_cookie);
 		sched_core_get();
 
 		wr.tasks[0] = t1;
@@ -422,10 +416,7 @@ int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
 
 	} else {
 		/* CASE 4. */
-		if (!sched_core_get_task_cookie(t2->core_task_cookie)) {
-			ret = -EINVAL;
-			goto out_unlock;
-		}
+		sched_core_get_task_cookie(t2->core_task_cookie);
 		sched_core_get();
 
 		sched_core_put_task_cookie(t1->core_task_cookie);
-- 
2.29.2.454.gaff20da3a2-goog


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork
  2020-11-25 11:07   ` Peter Zijlstra
@ 2020-12-01 18:56     ` Joel Fernandes
  0 siblings, 0 replies; 150+ messages in thread
From: Joel Fernandes @ 2020-12-01 18:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Wed, Nov 25, 2020 at 12:07:09PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:
> > Also, for the per-task cookie, it will get weird if we use pointers of any
> > emphemeral objects. For this reason, introduce a refcounted object who's sole
> > purpose is to assign unique cookie value by way of the object's pointer.
> 
> Might be useful to explain why exactly none of the many pid_t's are
> good enough.

I thought about this already and it does not seem a good fit. When 2
processes share, it is possible that more processes are added to that logical
group. Then the original processes that share die, but if we hold on to the
pid_t or task_struct, that would be awkward. It seemed introducing a new
refcounted struct was the right way to go. I can add these details to the
change log.

thanks!

 - Joel


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork
  2020-11-25 11:15   ` Peter Zijlstra
@ 2020-12-01 19:11     ` Joel Fernandes
  2020-12-01 19:20       ` Peter Zijlstra
  0 siblings, 1 reply; 150+ messages in thread
From: Joel Fernandes @ 2020-12-01 19:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Wed, Nov 25, 2020 at 12:15:41PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:
> 
> > +/*
> > + * Ensure that the task has been requeued. The stopper ensures that the task cannot
> > + * be migrated to a different CPU while its core scheduler queue state is being updated.
> > + * It also makes sure to requeue a task if it was running actively on another CPU.
> > + */
> > +static int sched_core_task_join_stopper(void *data)
> > +{
> > +	struct sched_core_task_write_tag *tag = (struct sched_core_task_write_tag *)data;
> > +	int i;
> > +
> > +	for (i = 0; i < 2; i++)
> > +		sched_core_tag_requeue(tag->tasks[i], tag->cookies[i], false /* !group */);
> > +
> > +	return 0;
> > +}
> > +
> > +static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
> > +{
> 
> > +	stop_machine(sched_core_task_join_stopper, (void *)&wr, NULL);
> 
> > +}
> 
> This is *REALLY* terrible...

I pulled this bit from your original patch. Are you concerned about the
stop_machine? Sharing a core is a slow path for our usecases (and as far as I
know, for everyone else's). We can probably do something different if that
requirement changes.


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork
  2020-11-25 11:11   ` Peter Zijlstra
@ 2020-12-01 19:16     ` Joel Fernandes
  0 siblings, 0 replies; 150+ messages in thread
From: Joel Fernandes @ 2020-12-01 19:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Wed, Nov 25, 2020 at 12:11:28PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:
> 
> > + * sched_core_tag_requeue - Common helper for all interfaces to set a cookie.
> 
> sched_core_set_cookie() would be a saner name, given that description,
> don't you think?

Yeah, Josh is better than me at naming so he changed it to
sched_core_update_cookie() already :-). Hopefully that's Ok too with you.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork
  2020-11-25 11:10   ` Peter Zijlstra
@ 2020-12-01 19:20     ` Joel Fernandes
  2020-12-01 19:34       ` Peter Zijlstra
  0 siblings, 1 reply; 150+ messages in thread
From: Joel Fernandes @ 2020-12-01 19:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Wed, Nov 25, 2020 at 12:10:14PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:
> > +void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie, bool group)
> > +{
> > +	if (!p)
> > +		return;
> > +
> > +	if (group)
> > +		p->core_group_cookie = cookie;
> > +	else
> > +		p->core_task_cookie = cookie;
> > +
> > +	/* Use up half of the cookie's bits for task cookie and remaining for group cookie. */
> > +	p->core_cookie = (p->core_task_cookie <<
> > +				(sizeof(unsigned long) * 4)) + p->core_group_cookie;
> 
> This seems dangerous; afaict there is nothing that prevents cookie
> collision.

This is fixed in a later patch by Josh "sched: Refactor core cookie into
struct" where we are having independent fields for each type of cookie.

I'll squash it next time I post to prevent confusion. Thanks,

 - Joel


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork
  2020-12-01 19:11     ` Joel Fernandes
@ 2020-12-01 19:20       ` Peter Zijlstra
  2020-12-06 18:15         ` Joel Fernandes
  0 siblings, 1 reply; 150+ messages in thread
From: Peter Zijlstra @ 2020-12-01 19:20 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Dec 01, 2020 at 02:11:33PM -0500, Joel Fernandes wrote:
> On Wed, Nov 25, 2020 at 12:15:41PM +0100, Peter Zijlstra wrote:
> > On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:
> > 
> > > +/*
> > > + * Ensure that the task has been requeued. The stopper ensures that the task cannot
> > > + * be migrated to a different CPU while its core scheduler queue state is being updated.
> > > + * It also makes sure to requeue a task if it was running actively on another CPU.
> > > + */
> > > +static int sched_core_task_join_stopper(void *data)
> > > +{
> > > +	struct sched_core_task_write_tag *tag = (struct sched_core_task_write_tag *)data;
> > > +	int i;
> > > +
> > > +	for (i = 0; i < 2; i++)
> > > +		sched_core_tag_requeue(tag->tasks[i], tag->cookies[i], false /* !group */);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
> > > +{
> > 
> > > +	stop_machine(sched_core_task_join_stopper, (void *)&wr, NULL);
> > 
> > > +}
> > 
> > This is *REALLY* terrible...
> 
> I pulled this bit from your original patch. Are you concerned about the
> stop_machine? Sharing a core is a slow path for our usecases (and as far as I
> know, for everyone else's). We can probably do something different if that
> requirement changes.
> 

Yeah.. so I can (and was planning on) remove stop_machine() from
sched_core_{dis,en}able() before merging it.

(there's two options, one uses stop_cpus() with the SMT mask, the other
RCU)

This though is exposing stop_machine() to joe user. Everybody is allowed
to prctl() it's own task and set a cookie on himself. This means you
just made giant unpriv DoS vector.

stop_machine is bad, really bad.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork
  2020-12-01 19:20     ` Joel Fernandes
@ 2020-12-01 19:34       ` Peter Zijlstra
  2020-12-02  6:36         ` Josh Don
  2020-12-06 17:49         ` Joel Fernandes
  0 siblings, 2 replies; 150+ messages in thread
From: Peter Zijlstra @ 2020-12-01 19:34 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Tue, Dec 01, 2020 at 02:20:28PM -0500, Joel Fernandes wrote:
> On Wed, Nov 25, 2020 at 12:10:14PM +0100, Peter Zijlstra wrote:
> > On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:
> > > +void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie, bool group)
> > > +{
> > > +	if (!p)
> > > +		return;
> > > +
> > > +	if (group)
> > > +		p->core_group_cookie = cookie;
> > > +	else
> > > +		p->core_task_cookie = cookie;
> > > +
> > > +	/* Use up half of the cookie's bits for task cookie and remaining for group cookie. */
> > > +	p->core_cookie = (p->core_task_cookie <<
> > > +				(sizeof(unsigned long) * 4)) + p->core_group_cookie;
> > 
> > This seems dangerous; afaict there is nothing that prevents cookie
> > collision.
> 
> This is fixed in a later patch by Josh "sched: Refactor core cookie into
> struct" where we are having independent fields for each type of cookie.

So I don't think that later patch is right... That is, it works, but
afaict it's massive overkill.

	COOKIE_CMP_RETURN(task_cookie);
	COOKIE_CMP_RETURN(group_cookie);
	COOKIE_CMP_RETURN(color);

So if task_cookie matches, we consider group_cookie, if that matches we
consider color.

Now, afaict that's semantically exactly the same as just using the
narrowest cookie. That is, use the task cookie if there is, and then,
walking the cgroup hierarchy (up) pick the first cgroup cookie.

(I don't understand the color thing, but lets have that discussion in
that subthread)

Which means you only need a single active cookie field.

IOW, you're just making things complicated and expensive.



^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface
  2020-11-25 13:08   ` Peter Zijlstra
@ 2020-12-01 19:36     ` Joel Fernandes
  0 siblings, 0 replies; 150+ messages in thread
From: Joel Fernandes @ 2020-12-01 19:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On Wed, Nov 25, 2020 at 02:08:08PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:53PM -0500, Joel Fernandes (Google) wrote:
> > +/* Called from prctl interface: PR_SCHED_CORE_SHARE */
> > +int sched_core_share_pid(pid_t pid)
> > +{
> > +	struct task_struct *task;
> > +	int err;
> > +
> > +	if (pid == 0) { /* Recent current task's cookie. */
> > +		/* Resetting a cookie requires privileges. */
> > +		if (current->core_task_cookie)
> > +			if (!capable(CAP_SYS_ADMIN))
> > +				return -EPERM;
> 
> Coding-Style fail.
> 
> Also, why?!? I realize it is true for your case, because hardware fail.
> But in general this just isn't true. This wants to be some configurable
> policy.

True. I think me and you discussed eons ago though that it needs to
privileged.  For our case, actually we use seccomp so we don't let
untrusted task set a cookie anyway, let alone reset it. We do it before we
enter the seccomp sandbox. So we don't really need this security check here.

Since you dislike this part of the patch, I am Ok with just dropping it as
below:

---8<-----------------------

diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
index 8fce3f4b7cae..9b587a1245f5 100644
--- a/kernel/sched/coretag.c
+++ b/kernel/sched/coretag.c
@@ -443,11 +443,7 @@ int sched_core_share_pid(pid_t pid)
 	struct task_struct *task;
 	int err;
 
-	if (pid == 0) { /* Recent current task's cookie. */
-		/* Resetting a cookie requires privileges. */
-		if (current->core_task_cookie)
-			if (!capable(CAP_SYS_ADMIN))
-				return -EPERM;
+	if (!pid) { /* Reset current task's cookie. */
 		task = NULL;
 	} else {
 		rcu_read_lock();

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase
  2020-11-25 13:42   ` Peter Zijlstra
  2020-11-30 23:10     ` Balbir Singh
@ 2020-12-01 20:08     ` Joel Fernandes
  2020-12-02  6:18     ` Josh Don
  2 siblings, 0 replies; 150+ messages in thread
From: Joel Fernandes @ 2020-12-01 20:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

Hi Peter,

On Wed, Nov 25, 2020 at 02:42:37PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:56PM -0500, Joel Fernandes (Google) wrote:
> > From: Josh Don <joshdon@google.com>
> > 
> > Google has a usecase where the first level tag to tag a CGroup is not
> > sufficient. So, a patch is carried for years where a second tag is added which
> > is writeable by unprivileged users.
> > 
> > Google uses DAC controls to make the 'tag' possible to set only by root while
> > the second-level 'color' can be changed by anyone. The actual names that
> > Google uses is different, but the concept is the same.
> > 
> > The hierarchy looks like:
> > 
> > Root group
> >    / \
> >   A   B    (These are created by the root daemon - borglet).
> >  / \   \
> > C   D   E  (These are created by AppEngine within the container).
> > 
> > The reason why Google has two parts is that AppEngine wants to allow a subset of
> > subcgroups within a parent tagged cgroup sharing execution. Think of these
> > subcgroups belong to the same customer or project. Because these subcgroups are
> > created by AppEngine, they are not tracked by borglet (the root daemon),
> > therefore borglet won't have a chance to set a color for them. That's where
> > 'color' file comes from. Color could be set by AppEngine, and once set, the
> > normal tasks within the subcgroup would not be able to overwrite it. This is
> > enforced by promoting the permission of the color file in cgroupfs.
> 
> Why can't the above work by setting 'tag' (that's a terrible name, why
> does that still live) in CDE? Have the most specific tag live. Same with
> that thread stuff.

There's 2 parts that Google's usecase has. The first part is set by a
privileged process, and the second part (color) is set within the container.
Maybe we can just put the "color" feature behind a CONFIG option for Google
to enable?

> All this API stuff here is a complete and utter trainwreck. Please just
> delete the patches and start over. Hint: if you use stop_machine(),
> you're doing it wrong.

Ok, the idea was to use stop_machine() as in your initial patch. It works
quite well in testing. However I agree with its horrible we ought to do
better (or at least try).

Maybe we can do a synchronize_rcu() after changing cookie, to ensure we are
no longer using the old cookie value in the scheduler.

> At best you now have the requirements sorted.

Yes.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase
  2020-11-25 13:42   ` Peter Zijlstra
  2020-11-30 23:10     ` Balbir Singh
  2020-12-01 20:08     ` Joel Fernandes
@ 2020-12-02  6:18     ` Josh Don
  2020-12-02  8:02       ` Peter Zijlstra
  2 siblings, 1 reply; 150+ messages in thread
From: Josh Don @ 2020-12-02  6:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Gleixner, linux-kernel, mingo,
	torvalds, fweisbec, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, chris.hyser, Ben Segall, Hao Luo,
	Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen,
	Oleg Rombakh

Hey Peter,

On Wed, Nov 25, 2020 at 5:43 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Why can't the above work by setting 'tag' (that's a terrible name, why
> does that still live) in CDE? Have the most specific tag live. Same with
> that thread stuff.

The motivation is to allow an unprivileged user the ability to
configure the trust hierarchy in a way that otherwise wouldn't be
possible for a given cgroup hierarchy.  For example given a cookie'd
hierarchy such as:

      A
   /  |  |   \
B  C  D  E

the user might only want subsets of {B, C, D, E} to share.  For
instance, the user might only want {B,C} and {D, E} to share.  One way
to solve this would be to allow the user to write the group cookie
directly.  However, this interface would need to be restricted to
privileged users, since otherwise the cookie could be configured to
share with any arbitrary cgroup.  The purpose of the 'color' field is
to expose a portion of the cookie that can be modified by a
non-privileged user in order to achieve this sharing goal.

If this doesn't seem like a useful case, I'm happy to drop this patch
from the series to unblock it.

> All this API stuff here is a complete and utter trainwreck. Please just
> delete the patches and start over. Hint: if you use stop_machine(),
> you're doing it wrong.

Yes, agree on stop_machine(); we'll pull that out of the underlying
interface patch.

Thanks,
Josh

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork
  2020-12-01 19:34       ` Peter Zijlstra
@ 2020-12-02  6:36         ` Josh Don
  2020-12-02  7:54           ` Peter Zijlstra
  2020-12-06 17:49         ` Joel Fernandes
  1 sibling, 1 reply; 150+ messages in thread
From: Josh Don @ 2020-12-02  6:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, Thomas Gleixner,
	linux-kernel, mingo, torvalds, fweisbec, Kees Cook, Greg Kerr,
	Phil Auld, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, chris.hyser, Ben Segall, Hao Luo,
	Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen,
	Oleg Rombakh

On Tue, Dec 1, 2020 at 11:35 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> So I don't think that later patch is right... That is, it works, but
> afaict it's massive overkill.
>
>         COOKIE_CMP_RETURN(task_cookie);
>         COOKIE_CMP_RETURN(group_cookie);
>         COOKIE_CMP_RETURN(color);
>
> So if task_cookie matches, we consider group_cookie, if that matches we
> consider color.
>
> Now, afaict that's semantically exactly the same as just using the
> narrowest cookie. That is, use the task cookie if there is, and then,
> walking the cgroup hierarchy (up) pick the first cgroup cookie.
>
> (I don't understand the color thing, but lets have that discussion in
> that subthread)
>
> Which means you only need a single active cookie field.
>
> IOW, you're just making things complicated and expensive.
>

For the per-task interface, I believe we still want to prevent two
tasks that share a task cookie from sharing an overall cookie if they
are in two separately tagged groups (Joel please correct me if I'm
mistaken there). That's why in Joel's older patch, the overall cookie
was a combination of the task and group cookies.  My concern about
that was the potential cookie collision.

I followed up on the 'color' portion in the other thread.

Thanks,
Josh

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork
  2020-12-02  6:36         ` Josh Don
@ 2020-12-02  7:54           ` Peter Zijlstra
  2020-12-04  0:20             ` Josh Don
  0 siblings, 1 reply; 150+ messages in thread
From: Peter Zijlstra @ 2020-12-02  7:54 UTC (permalink / raw)
  To: Josh Don
  Cc: Joel Fernandes, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, Thomas Gleixner,
	linux-kernel, mingo, torvalds, fweisbec, Kees Cook, Greg Kerr,
	Phil Auld, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, chris.hyser, Ben Segall, Hao Luo,
	Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen,
	Oleg Rombakh

On Tue, Dec 01, 2020 at 10:36:18PM -0800, Josh Don wrote:
> On Tue, Dec 1, 2020 at 11:35 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > So I don't think that later patch is right... That is, it works, but
> > afaict it's massive overkill.
> >
> >         COOKIE_CMP_RETURN(task_cookie);
> >         COOKIE_CMP_RETURN(group_cookie);
> >         COOKIE_CMP_RETURN(color);
> >
> > So if task_cookie matches, we consider group_cookie, if that matches we
> > consider color.
> >
> > Now, afaict that's semantically exactly the same as just using the
> > narrowest cookie. That is, use the task cookie if there is, and then,
> > walking the cgroup hierarchy (up) pick the first cgroup cookie.
> >
> > (I don't understand the color thing, but lets have that discussion in
> > that subthread)
> >
> > Which means you only need a single active cookie field.
> >
> > IOW, you're just making things complicated and expensive.
> >
> 
> For the per-task interface, I believe we still want to prevent two
> tasks that share a task cookie from sharing an overall cookie if they
> are in two separately tagged groups (Joel please correct me if I'm
> mistaken there). That's why in Joel's older patch, the overall cookie
> was a combination of the task and group cookies.  My concern about
> that was the potential cookie collision.

Then disallow sharing a task cookie when the tasks are in different
cgroups or disallow cgroup movement when they share a cookie.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase
  2020-12-02  6:18     ` Josh Don
@ 2020-12-02  8:02       ` Peter Zijlstra
  2020-12-02 18:53         ` Tejun Heo
  2020-12-04  0:51         ` Josh Don
  0 siblings, 2 replies; 150+ messages in thread
From: Peter Zijlstra @ 2020-12-02  8:02 UTC (permalink / raw)
  To: Josh Don
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Gleixner, linux-kernel, mingo,
	torvalds, fweisbec, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, chris.hyser, Ben Segall, Hao Luo,
	Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen,
	Oleg Rombakh, Tejun Heo

On Tue, Dec 01, 2020 at 10:18:00PM -0800, Josh Don wrote:
> Hey Peter,
> 
> On Wed, Nov 25, 2020 at 5:43 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Why can't the above work by setting 'tag' (that's a terrible name, why
> > does that still live) in CDE? Have the most specific tag live. Same with
> > that thread stuff.
> 
> The motivation is to allow an unprivileged user the ability to
> configure the trust hierarchy in a way that otherwise wouldn't be
> possible for a given cgroup hierarchy.  For example given a cookie'd
> hierarchy such as:
> 
>       A
>    /  |  |   \
> B  C  D  E
> 
> the user might only want subsets of {B, C, D, E} to share.  For
> instance, the user might only want {B,C} and {D, E} to share.  One way
> to solve this would be to allow the user to write the group cookie
> directly.  However, this interface would need to be restricted to
> privileged users, since otherwise the cookie could be configured to
> share with any arbitrary cgroup.  The purpose of the 'color' field is
> to expose a portion of the cookie that can be modified by a
> non-privileged user in order to achieve this sharing goal.
> 
> If this doesn't seem like a useful case, I'm happy to drop this patch
> from the series to unblock it.

Well, the traditional cgroup way of doing that would be to:

         A
	/ \
       T1 T2
      / \
     B   C

And tag T1 if you want B,C to share.

So me the color thing reads like an end-run around the cgroup hierarchy.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 14/32] sched: migration changes for core scheduling
  2020-11-30 12:29                     ` Li, Aubrey
@ 2020-12-02 14:09                       ` Li, Aubrey
  2020-12-03  1:06                         ` Li, Aubrey
  0 siblings, 1 reply; 150+ messages in thread
From: Li, Aubrey @ 2020-12-02 14:09 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Peter Zijlstra, Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

Hi Balbir,

I still placed the patch embedded in this thread, welcome any comments.

Thanks,
-Aubrey
======================================================================

From d64455dcaf47329673903a68a9df1151400cdd7a Mon Sep 17 00:00:00 2001
From: Aubrey Li <aubrey.li@linux.intel.com>
Date: Wed, 2 Dec 2020 13:53:30 +0000
Subject: [PATCH] sched: migration changes for core scheduling

 - Don't migrate if there is a cookie mismatch
     Load balance tries to move task from busiest CPU to the
     destination CPU. When core scheduling is enabled, if the
     task's cookie does not match with the destination CPU's
     core cookie, this task will be skipped by this CPU. This
     mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
     In the fast path of task wakeup, select the first cookie matched
     idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
     In the slow path of task wakeup, find the idlest CPU whose core
     cookie matches with task's cookie

 - Don't migrate task if cookie not match
     For the NUMA load balance, don't migrate task to the CPU whose
     core cookie does not match with task's cookie

Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/fair.c  | 33 +++++++++++++++++---
 kernel/sched/sched.h | 71 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 100 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index de82f88ba98c..b8657766b660 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1921,6 +1921,13 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
 			continue;
 
+		/*
+		 * Skip this cpu if source task's cookie does not match
+		 * with CPU's core cookie.
+		 */
+		if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+			continue;
+
 		env->dst_cpu = cpu;
 		if (task_numa_compare(env, taskimp, groupimp, maymove))
 			break;
@@ -5867,11 +5874,15 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+		struct rq *rq = cpu_rq(i);
+
+		if (!sched_core_cookie_match(rq, p))
+			continue;
+
 		if (sched_idle_cpu(i))
 			return i;
 
 		if (available_idle_cpu(i)) {
-			struct rq *rq = cpu_rq(i);
 			struct cpuidle_state *idle = idle_get_state(rq);
 			if (idle && idle->exit_latency < min_exit_latency) {
 				/*
@@ -6129,7 +6140,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	for_each_cpu_wrap(cpu, cpus, target) {
 		if (!--nr)
 			return -1;
-		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+
+		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu) &&
+		    sched_cpu_cookie_match(cpu_rq(cpu), p))
 			break;
 	}
 
@@ -7530,8 +7543,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * We do not migrate tasks that are:
 	 * 1) throttled_lb_pair, or
 	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
-	 * 3) running (obviously), or
-	 * 4) are cache-hot on their current CPU.
+	 * 3) task's cookie does not match with this CPU's core cookie
+	 * 4) running (obviously), or
+	 * 5) are cache-hot on their current CPU.
 	 */
 	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
 		return 0;
@@ -7566,6 +7580,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 0;
 	}
 
+	/*
+	 * Don't migrate task if the task's cookie does not match
+	 * with the destination CPU's core cookie.
+	 */
+	if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+		return 0;
+
 	/* Record that we found atleast one task that could run on dst_cpu */
 	env->flags &= ~LBF_ALL_PINNED;
 
@@ -8792,6 +8813,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 					p->cpus_ptr))
 			continue;
 
+		/* Skip over this group if no cookie matched */
+		if (!sched_group_cookie_match(cpu_rq(this_cpu), p, group))
+			continue;
+
 		local_group = cpumask_test_cpu(this_cpu,
 					       sched_group_span(group));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e72942a9ee11..e1adfffe6e39 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1135,6 +1135,61 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
 
+/*
+ * Helpers to check if the CPU's core cookie matches with the task's cookie
+ * when core scheduling is enabled.
+ * A special case is that the task's cookie always matches with CPU's core
+ * cookie if the CPU is in an idle core.
+ */
+static inline bool sched_cpu_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	return rq->core->core_cookie == p->core_cookie;
+}
+
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	bool idle_core = true;
+	int cpu;
+
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
+		if (!available_idle_cpu(cpu)) {
+			idle_core = false;
+			break;
+		}
+	}
+
+	/*
+	 * A CPU in an idle core is always the best choice for tasks with
+	 * cookies.
+	 */
+	return idle_core || __cookie_match(rq, p);
+}
+
+static inline bool sched_group_cookie_match(struct rq *rq,
+					    struct task_struct *p,
+					    struct sched_group *group)
+{
+	int cpu;
+
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	for_each_cpu_and(cpu, sched_group_span(group), p->cpus_ptr) {
+		if (sched_core_cookie_match(cpu_rq(cpu), p))
+			return true;
+	}
+	return false;
+}
+
 extern void queue_core_balance(struct rq *rq);
 
 #else /* !CONFIG_SCHED_CORE */
@@ -1153,6 +1208,22 @@ static inline void queue_core_balance(struct rq *rq)
 {
 }
 
+static inline bool sched_cpu_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	return true;
+}
+
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	return true;
+}
+
+static inline bool sched_group_cookie_match(struct rq *rq,
+					    struct task_struct *p,
+					    struct sched_group *group)
+{
+	return true;
+}
 #endif /* CONFIG_SCHED_CORE */
 
 #ifdef CONFIG_SCHED_SMT
-- 
2.17.1


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase
  2020-12-02  8:02       ` Peter Zijlstra
@ 2020-12-02 18:53         ` Tejun Heo
  2020-12-04  0:51         ` Josh Don
  1 sibling, 0 replies; 150+ messages in thread
From: Tejun Heo @ 2020-12-02 18:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Don, Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Gleixner, linux-kernel, mingo,
	torvalds, fweisbec, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, chris.hyser, Ben Segall, Hao Luo,
	Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen,
	Oleg Rombakh

Hello,

On Wed, Dec 02, 2020 at 09:02:11AM +0100, Peter Zijlstra wrote:
> > the user might only want subsets of {B, C, D, E} to share.  For
> > instance, the user might only want {B,C} and {D, E} to share.  One way
> > to solve this would be to allow the user to write the group cookie
> > directly.  However, this interface would need to be restricted to
> > privileged users, since otherwise the cookie could be configured to
> > share with any arbitrary cgroup.  The purpose of the 'color' field is
> > to expose a portion of the cookie that can be modified by a
> > non-privileged user in order to achieve this sharing goal.
> > 
> > If this doesn't seem like a useful case, I'm happy to drop this patch
> > from the series to unblock it.
> 
> Well, the traditional cgroup way of doing that would be to:
> 
>          A
> 	/ \
>        T1 T2
>       / \
>      B   C
> 
> And tag T1 if you want B,C to share.
> 
> So me the color thing reads like an end-run around the cgroup hierarchy.

+1

and please cc me on cgroup related changes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface
  2020-11-17 23:19 ` [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface Joel Fernandes (Google)
  2020-11-25 13:08   ` Peter Zijlstra
@ 2020-12-02 21:47   ` Chris Hyser
  2020-12-02 23:13     ` chris hyser
  2020-12-06 17:34     ` Joel Fernandes
  1 sibling, 2 replies; 150+ messages in thread
From: Chris Hyser @ 2020-12-02 21:47 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, Ben Segall,
	Josh Don, Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney,
	Tim Chen

On Tue, Nov 17, 2020 at 06:19:53PM -0500, Joel Fernandes (Google) wrote:
> Add a per-thread core scheduling interface which allows a thread to share a
> core with another thread, or have a core exclusively for itself.
> 
> ChromeOS uses core-scheduling to securely enable hyperthreading.  This cuts
> down the keypress latency in Google docs from 150ms to 50ms while improving
> the camera streaming frame rate by ~3%.
> 

Inline is a patch for comment to extend this interface to make it more useful.
This patch would still need to provide doc and selftests updates as well.

-chrish

---8<-----------------------

From ec3d6506fee89022d93789e1ba44d49c1b1b04dd Mon Sep 17 00:00:00 2001
From: chris hyser <chris.hyser@oracle.com>
Date: Tue, 10 Nov 2020 15:35:59 -0500
Subject: [PATCH] sched: Provide a more extensive prctl interface for core
 scheduling.

The current prctl interface is a "from" only interface allowing a task to
join an existing core scheduling group by getting the "cookie" from a
specified task/pid.

Additionally, sharing from a task without an existing core scheduling group
(cookie == 0) creates a new cookie shared between the two.

"From" functionality works well for programs modified to use the prctl(),
but many applications will not be modified simply for core scheduling.
Using a wrapper program to share the core scheduling cookie from another
task followed by an "exec" can work, but there is no means to assign a
cookie for an unmodified running task.

Simply inverting the interface to a "to"-only interface, i.e. the calling
task shares it's cookie with the specified task/pid also has limitations,
there being no means to enable tasks to join an existing core scheduling
group for instance.

The solution is to allow both, or more specifically provide a flags
argument to allow various core scheduling commands, currently FROM, TO, and
CLEAR.

The combination of FROM and TO allow a helper program to share the core
scheduling cookie of one task/core scheduling group, with additional tasks.

if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_FROM, src_pid, 0, 0) < 0)
	handle_error("src_pid sched_core failed");
if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_TO, dest_pid, 0, 0) < 0)
	handle_error("dest_pid sched_core failed");

Signed-off-by: chris hyser <chris.hyser@oracle.com>

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c9efdf8..eed002e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2084,14 +2084,14 @@ void sched_core_unsafe_enter(void);
 void sched_core_unsafe_exit(void);
 bool sched_core_wait_till_safe(unsigned long ti_check);
 bool sched_core_kernel_protected(void);
-int sched_core_share_pid(pid_t pid);
+int sched_core_share_pid(unsigned long flags, pid_t pid);
 void sched_tsk_free(struct task_struct *tsk);
 #else
 #define sched_core_unsafe_enter(ignore) do { } while (0)
 #define sched_core_unsafe_exit(ignore) do { } while (0)
 #define sched_core_wait_till_safe(ignore) do { } while (0)
 #define sched_core_kernel_protected(ignore) do { } while (0)
-#define sched_core_share_pid(pid) do { } while (0)
+#define sched_core_share_pid(flags, pid) do { } while (0)
 #define sched_tsk_free(tsk) do { } while (0)
 #endif
 
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 217b048..f8e4e96 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -250,5 +250,8 @@ struct prctl_mm_map {
 
 /* Request the scheduler to share a core */
 #define PR_SCHED_CORE_SHARE		59
+# define PR_SCHED_CORE_CLEAR		0  /* clear core_sched cookie of pid */
+# define PR_SCHED_CORE_SHARE_FROM	1  /* get core_sched cookie from pid */
+# define PR_SCHED_CORE_SHARE_TO		2  /* push core_sched cookie to pid */
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
index 800c0f8..14feac1 100644
--- a/kernel/sched/coretag.c
+++ b/kernel/sched/coretag.c
@@ -9,6 +9,7 @@
  */
 
 #include "sched.h"
+#include "linux/prctl.h"
 
 /*
  * Wrapper representing a complete cookie. The address of the cookie is used as
@@ -456,40 +457,45 @@ int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
 }
 
 /* Called from prctl interface: PR_SCHED_CORE_SHARE */
-int sched_core_share_pid(pid_t pid)
+int sched_core_share_pid(unsigned long flags, pid_t pid)
 {
+	struct task_struct *dest;
+	struct task_struct *src;
 	struct task_struct *task;
 	int err;
 
-	if (pid == 0) { /* Recent current task's cookie. */
-		/* Resetting a cookie requires privileges. */
-		if (current->core_task_cookie)
-			if (!capable(CAP_SYS_ADMIN))
-				return -EPERM;
-		task = NULL;
-	} else {
-		rcu_read_lock();
-		task = pid ? find_task_by_vpid(pid) : current;
-		if (!task) {
-			rcu_read_unlock();
-			return -ESRCH;
-		}
-
-		get_task_struct(task);
-
-		/*
-		 * Check if this process has the right to modify the specified
-		 * process. Use the regular "ptrace_may_access()" checks.
-		 */
-		if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
-			rcu_read_unlock();
-			err = -EPERM;
-			goto out;
-		}
+	rcu_read_lock();
+	task = find_task_by_vpid(pid);
+	if (!task) {
 		rcu_read_unlock();
+		return -ESRCH;
 	}
 
-	err = sched_core_share_tasks(current, task);
+	get_task_struct(task);
+
+	/*
+	 * Check if this process has the right to modify the specified
+	 * process. Use the regular "ptrace_may_access()" checks.
+	 */
+	if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
+		rcu_read_unlock();
+		err = -EPERM;
+		goto out;
+	}
+	rcu_read_unlock();
+
+	if (flags == PR_SCHED_CORE_CLEAR) {
+		dest = task;
+		src = NULL;
+	} else if (flags == PR_SCHED_CORE_SHARE_TO) {
+		dest = task;
+		src = current;
+	} else if (flags == PR_SCHED_CORE_SHARE_FROM) {
+		dest = current;
+		src = task;
+	}
+
+	err = sched_core_share_tasks(dest, src);
 out:
 	if (task)
 		put_task_struct(task);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index cffdfab..50c31f3 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1030,6 +1030,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 
 #ifdef CONFIG_SCHED_CORE
 	__PS("core_cookie", p->core_cookie);
+	__PS("core_task_cookie", p->core_task_cookie);
 #endif
 
 	sched_show_numa(p, m);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b3b89bd..eafb399 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1202,7 +1202,7 @@ void sched_core_dequeue(struct rq *rq, struct task_struct *p);
 void sched_core_get(void);
 void sched_core_put(void);
 
-int sched_core_share_pid(pid_t pid);
+int sched_core_share_pid(unsigned long flags, pid_t pid);
 int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2);
 
 #ifdef CONFIG_CGROUP_SCHED
diff --git a/kernel/sys.c b/kernel/sys.c
index 61a3c98..da52a0d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2530,9 +2530,13 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
 		break;
+#ifdef CONFIG_SCHED_CORE
 	case PR_SCHED_CORE_SHARE:
-		error = sched_core_share_pid(arg2);
+		if (arg4 || arg5)
+			return -EINVAL;
+		error = sched_core_share_pid(arg2, arg3);
 		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface
  2020-12-02 21:47   ` Chris Hyser
@ 2020-12-02 23:13     ` chris hyser
  2020-12-06 17:34     ` Joel Fernandes
  1 sibling, 0 replies; 150+ messages in thread
From: chris hyser @ 2020-12-02 23:13 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo,
	torvalds, fweisbec, keescook, kerrnel, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli, pjt,
	rostedt, derkling, benbjiang, Alexandre Chartre, James.Bottomley,
	OWeisse, Dhaval Giani, Junaid Shahid, jsbarnes, Ben Segall,
	Josh Don, Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney,
	Tim Chen

On 12/2/20 4:47 PM, Chris Hyser wrote:

> +	get_task_struct(task);
> +
> +	/*
> +	 * Check if this process has the right to modify the specified
> +	 * process. Use the regular "ptrace_may_access()" checks.
> +	 */
> +	if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
> +		rcu_read_unlock();
> +		err = -EPERM;
> +		goto out;
> +	}
> +	rcu_read_unlock();
> +
> +	if (flags == PR_SCHED_CORE_CLEAR) {
> +		dest = task;
> +		src = NULL;
> +	} else if (flags == PR_SCHED_CORE_SHARE_TO) {
> +		dest = task;
> +		src = current;
> +	} else if (flags == PR_SCHED_CORE_SHARE_FROM) {
> +		dest = current;
> +		src = task;
> +	}

I should have put in an else clause to catch bad input.

> +
> +	err = sched_core_share_tasks(dest, src);
>   out:
>   	if (task)
>   		put_task_struct(task);

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 14/32] sched: migration changes for core scheduling
  2020-12-02 14:09                       ` Li, Aubrey
@ 2020-12-03  1:06                         ` Li, Aubrey
  0 siblings, 0 replies; 150+ messages in thread
From: Li, Aubrey @ 2020-12-03  1:06 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Peter Zijlstra, Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, tglx, linux-kernel, mingo, torvalds,
	fweisbec, keescook, kerrnel, Phil Auld, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu,
	Christian Brauner, Agata Gruza, Antonio Gomez Iglesias, graf,
	konrad.wilk, dfaggioli, pjt, rostedt, derkling, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, jsbarnes, chris.hyser, Ben Segall, Josh Don,
	Hao Luo, Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen

On 2020/12/2 22:09, Li, Aubrey wrote:
> Hi Balbir,
> 
> I still placed the patch embedded in this thread, welcome any comments.

Sorry, this version needs more work, refined as below, and I realized
I should place a version number to the patch, start from v2 now.

Thanks,
-Aubrey
======================================================================
From aff2919889635aa9311d15bac3e949af0300ddc1 Mon Sep 17 00:00:00 2001
From: Aubrey Li <aubrey.li@linux.intel.com>
Date: Thu, 3 Dec 2020 00:51:18 +0000
Subject: [PATCH v2] sched: migration changes for core scheduling

 - Don't migrate if there is a cookie mismatch
     Load balance tries to move task from busiest CPU to the
     destination CPU. When core scheduling is enabled, if the
     task's cookie does not match with the destination CPU's
     core cookie, this task will be skipped by this CPU. This
     mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
     In the fast path of task wakeup, select the first cookie matched
     idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
     In the slow path of task wakeup, find the idlest CPU whose core
     cookie matches with task's cookie

 - Don't migrate task if cookie not match
     For the NUMA load balance, don't migrate task to the CPU whose
     core cookie does not match with task's cookie

Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/fair.c  | 33 +++++++++++++++++---
 kernel/sched/sched.h | 72 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 101 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index de82f88ba98c..afdfea70c58c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1921,6 +1921,13 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
 			continue;
 
+		/*
+		 * Skip this cpu if source task's cookie does not match
+		 * with CPU's core cookie.
+		 */
+		if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+			continue;
+
 		env->dst_cpu = cpu;
 		if (task_numa_compare(env, taskimp, groupimp, maymove))
 			break;
@@ -5867,11 +5874,15 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+		struct rq *rq = cpu_rq(i);
+
+		if (!sched_core_cookie_match(rq, p))
+			continue;
+
 		if (sched_idle_cpu(i))
 			return i;
 
 		if (available_idle_cpu(i)) {
-			struct rq *rq = cpu_rq(i);
 			struct cpuidle_state *idle = idle_get_state(rq);
 			if (idle && idle->exit_latency < min_exit_latency) {
 				/*
@@ -6129,7 +6140,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	for_each_cpu_wrap(cpu, cpus, target) {
 		if (!--nr)
 			return -1;
-		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+
+		if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
+		    sched_cpu_cookie_match(cpu_rq(cpu), p))
 			break;
 	}
 
@@ -7530,8 +7543,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * We do not migrate tasks that are:
 	 * 1) throttled_lb_pair, or
 	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
-	 * 3) running (obviously), or
-	 * 4) are cache-hot on their current CPU.
+	 * 3) task's cookie does not match with this CPU's core cookie
+	 * 4) running (obviously), or
+	 * 5) are cache-hot on their current CPU.
 	 */
 	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
 		return 0;
@@ -7566,6 +7580,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 0;
 	}
 
+	/*
+	 * Don't migrate task if the task's cookie does not match
+	 * with the destination CPU's core cookie.
+	 */
+	if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+		return 0;
+
 	/* Record that we found atleast one task that could run on dst_cpu */
 	env->flags &= ~LBF_ALL_PINNED;
 
@@ -8792,6 +8813,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 					p->cpus_ptr))
 			continue;
 
+		/* Skip over this group if no cookie matched */
+		if (!sched_group_cookie_match(cpu_rq(this_cpu), p, group))
+			continue;
+
 		local_group = cpumask_test_cpu(this_cpu,
 					       sched_group_span(group));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e72942a9ee11..82917ce183b4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1119,6 +1119,7 @@ static inline bool is_migration_disabled(struct task_struct *p)
 
 #ifdef CONFIG_SCHED_CORE
 DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
+static inline struct cpumask *sched_group_span(struct sched_group *sg);
 
 static inline bool sched_core_enabled(struct rq *rq)
 {
@@ -1135,6 +1136,61 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
 
+/*
+ * Helpers to check if the CPU's core cookie matches with the task's cookie
+ * when core scheduling is enabled.
+ * A special case is that the task's cookie always matches with CPU's core
+ * cookie if the CPU is in an idle core.
+ */
+static inline bool sched_cpu_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	return rq->core->core_cookie == p->core_cookie;
+}
+
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	bool idle_core = true;
+	int cpu;
+
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
+		if (!available_idle_cpu(cpu)) {
+			idle_core = false;
+			break;
+		}
+	}
+
+	/*
+	 * A CPU in an idle core is always the best choice for tasks with
+	 * cookies.
+	 */
+	return idle_core || rq->core->core_cookie == p->core_cookie;
+}
+
+static inline bool sched_group_cookie_match(struct rq *rq,
+					    struct task_struct *p,
+					    struct sched_group *group)
+{
+	int cpu;
+
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	for_each_cpu_and(cpu, sched_group_span(group), p->cpus_ptr) {
+		if (sched_core_cookie_match(rq, p))
+			return true;
+	}
+	return false;
+}
+
 extern void queue_core_balance(struct rq *rq);
 
 #else /* !CONFIG_SCHED_CORE */
@@ -1153,6 +1209,22 @@ static inline void queue_core_balance(struct rq *rq)
 {
 }
 
+static inline bool sched_cpu_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	return true;
+}
+
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	return true;
+}
+
+static inline bool sched_group_cookie_match(struct rq *rq,
+					    struct task_struct *p,
+					    struct sched_group *group)
+{
+	return true;
+}
 #endif /* CONFIG_SCHED_CORE */
 
 #ifdef CONFIG_SCHED_SMT
-- 
2.17.1


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 00/32] Core scheduling (v9)
  2020-11-24 15:08   ` Joel Fernandes
@ 2020-12-03  6:16     ` Ning, Hongyu
  0 siblings, 0 replies; 150+ messages in thread
From: Ning, Hongyu @ 2020-12-03  6:16 UTC (permalink / raw)
  To: Joel Fernandes, Vincent Guittot
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, Thomas Gleixner,
	linux-kernel, Ingo Molnar, Linus Torvalds, Frederic Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, vineeth, Chen Yu, Christian Brauner,
	Agata Gruza, Antonio Gomez Iglesias, Alexander Graf, konrad.wilk,
	Dario Faggioli, Paul Turner, Steven Rostedt, Patrick Bellasi,
	Jiang Biao, Alexandre Chartre, James Bottomley, OWeisse,
	Dhaval Giani, Junaid Shahid, Jesse Barnes, Hyser,Chris,
	Ben Segall, Josh Don, Hao Luo, Tom Lendacky, Aubrey Li,
	Paul E. McKenney, Tim Chen


On 2020/11/24 23:08, Joel Fernandes wrote:
>>>
>>> Core-Scheduling
>>> ===============
>>> Enclosed is series v9 of core scheduling.
>>> v9 is rebased on tip/master (fe4adf6f92c4 ("Merge branch 'irq/core'"))..
>>> I hope that this version is acceptable to be merged (pending any new review
>>> comments that arise) as the main issues in the past are all resolved:
>>>  1. Vruntime comparison.
>>>  2. Documentation updates.
>>>  3. CGroup and per-task interface developed by Google and Oracle.
>>>  4. Hotplug fixes.
>>> Almost all patches also have Reviewed-by or Acked-by tag. See below for full
>>> list of changes in v9.
>>>
>>> Introduction of feature
>>> =======================
>>> Core scheduling is a feature that allows only trusted tasks to run
>>> concurrently on cpus sharing compute resources (eg: hyperthreads on a
>>> core). The goal is to mitigate the core-level side-channel attacks
>>> without requiring to disable SMT (which has a significant impact on
>>> performance in some situations). Core scheduling (as of v7) mitigates
>>> user-space to user-space attacks and user to kernel attack when one of
>>> the siblings enters the kernel via interrupts or system call.
>>>
>>> By default, the feature doesn't change any of the current scheduler
>>> behavior. The user decides which tasks can run simultaneously on the
>>> same core (for now by having them in the same tagged cgroup). When a tag
>>> is enabled in a cgroup and a task from that cgroup is running on a
>>> hardware thread, the scheduler ensures that only idle or trusted tasks
>>> run on the other sibling(s). Besides security concerns, this feature can
>>> also be beneficial for RT and performance applications where we want to
>>> control how tasks make use of SMT dynamically.
>>>
>>> Both a CGroup and Per-task interface via prctl(2) are provided for configuring
>>> core sharing. More details are provided in documentation patch.  Kselftests are
>>> provided to verify the correctness/rules of the interface.
>>>
>>> Testing
>>> =======
>>> ChromeOS testing shows 300% improvement in keypress latency on a Google
>>> docs key press with Google hangout test (the maximum latency drops from 150ms
>>> to 50ms for keypresses).
>>>
>>> Julien: TPCC tests showed improvements with core-scheduling as below. With kernel
>>> protection enabled, it does not show any regression. Possibly ASI will improve
>>> the performance for those who choose kernel protection (can be toggled through
>>> sched_core_protect_kernel sysctl).
>>>                                 average         stdev           diff
>>> baseline (SMT on)               1197.272        44.78312824
>>> core sched (   kernel protect)  412.9895        45.42734343     -65.51%
>>> core sched (no kernel protect)  686.6515        71.77756931     -42.65%
>>> nosmt                           408.667         39.39042872     -65.87%
>>> (Note these results are from v8).
>>>
>>> Vineeth tested sysbench and does not see any regressions.
>>> Hong and Aubrey tested v9 and see results similar to v8. There is a known issue
>>> with uperf that does regress. This appears to be because of ksoftirq heavily
>>> contending with other tasks on the core. The consensus is this can be improved
>>> in the future.
>>>
>>> Changes in v9
>>> =============
>>> - Note that the vruntime snapshot change is written in 2 patches to show the
>>>   progression of the idea and prevent merge conflicts:
>>>     sched/fair: Snapshot the min_vruntime of CPUs on force idle
>>>     sched: Improve snapshotting of min_vruntime for CGroups
>>>   Same with the RT priority inversion change:
>>>     sched: Fix priority inversion of cookied task with sibling
>>>     sched: Improve snapshotting of min_vruntime for CGroups
>>> - Disable coresched on certain AMD HW.
>>>

Adding workloads and negative case test results for core scheduling v9 posted: 

- kernel under test: 
	-- coresched community v9 posted from https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=sched/coresched-v9-posted (tag: sched/coresched-v9-posted)
	-- latest commit: d48636e429de (HEAD -> coresched-v9-posted, tag: sched/coresched-v9-posted) sched: Debug bits...
	-- coresched=on kernel parameter applied
- workloads: 
	-- A. sysbench cpu (192 threads) + sysbench cpu (192 threads)
	-- B. sysbench cpu (192 threads) + sysbench mysql (192 threads, mysqld forced into the same cgroup)
	-- C. uperf netperf.xml (192 threads over TCP or UDP protocol separately)
	-- D. will-it-scale context_switch via pipe (192 threads)
- negative case:
	-- A. continuously toggle cpu.core_tag, during full loading uperf workload running with cs_on
	-- B. continuously toggle smt setting via /sys/devices/system/cpu/smt/control, during full loading uperf workload running with cs_on
	-- C. continuously cgroup switch between cs_on cgroup and cs_off cgroup via cgclassify, during full loading uperf workload running
- test machine setup: 
	CPU(s):              192
	On-line CPU(s) list: 0-191
	Thread(s) per core:  2
	Core(s) per socket:  48
	Socket(s):           2
	NUMA node(s):        4
- test results of workloads, no obvious performance drop compared to community v8 build:
	-- workload A:
	+----------------------+------+----------------------+------------------------+
	| workloads            | **   | sysbench cpu * 192   | sysbench cpu * 192     |
	+======================+======+======================+========================+
	| cgroup               | **   | cg_sysbench_cpu_0    | cg_sysbench_cpu_1      |
	+----------------------+------+----------------------+------------------------+
	| record_item          | **   | Tput_avg (events/s)  | Tput_avg (events/s)    |
	+----------------------+------+----------------------+------------------------+
	| coresched_normalized | **   | 0.97                 | 1.02                   |
	+----------------------+------+----------------------+------------------------+
	| default_normalized   | **   | 1.00                 | 1.00                   |
	+----------------------+------+----------------------+------------------------+
	| smtoff_normalized    | **   | 0.60                 | 0.60                   |
	+----------------------+------+----------------------+------------------------+

	-- workload B:
	+----------------------+------+----------------------+------------------------+
	| workloads            | **   | sysbench cpu * 192   | sysbench mysql * 192   |
	+======================+======+======================+========================+
	| cgroup               | **   | cg_sysbench_cpu_0    | cg_sysbench_mysql_0    |
	+----------------------+------+----------------------+------------------------+
	| record_item          | **   | Tput_avg (events/s)  | Tput_avg (events/s)    |
	+----------------------+------+----------------------+------------------------+
	| coresched_normalized | **   | 0.94                 | 0.88                   |
	+----------------------+------+----------------------+------------------------+
	| default_normalized   | **   | 1.00                 | 1.00                   |
	+----------------------+------+----------------------+------------------------+
	| smtoff_normalized    | **   | 0.56                 | 0.84                   |
	+----------------------+------+----------------------+------------------------+

	-- workload C:
	+----------------------+------+---------------------------+---------------------------+
	| workloads            | **   | uperf netperf TCP * 192   | uperf netperf UDP * 192   |
	+======================+======+===========================+===========================+
	| cgroup               | **   | cg_uperf                  | cg_uperf                  |
	+----------------------+------+---------------------------+---------------------------+
	| record_item          | **   | Tput_avg (Gb/s)           | Tput_avg (Gb/s)           |
	+----------------------+------+---------------------------+---------------------------+
	| coresched_normalized | **   | 0.64                      | 0.68                      |
	+----------------------+------+---------------------------+---------------------------+
	| default_normalized   | **   | 1.00                      | 1.00                      |
	+----------------------+------+---------------------------+---------------------------+
	| smtoff_normalized    | **   | 0.92                      | 0.89                      |
	+----------------------+------+---------------------------+---------------------------+

	-- workload D:
	+----------------------+------+-------------------------------+
	| workloads            | **   | will-it-scale  * 192          |
	|                      |      | (pipe based context_switch)   |
	+======================+======+===============================+
	| cgroup               | **   | cg_will-it-scale              |
	+----------------------+------+-------------------------------+
	| record_item          | **   | threads_avg                   |
	+----------------------+------+-------------------------------+
	| coresched_normalized | **   | 0.30                          |
	+----------------------+------+-------------------------------+
	| default_normalized   | **   | 1.00                          |
	+----------------------+------+-------------------------------+
	| smtoff_normalized    | **   | 0.87                          |
	+----------------------+------+-------------------------------+

	-- notes on record_item:
	* coresched_normalized: smton, cs enabled, test result normalized by default value
	* default_normalized: smton, cs disabled, test result normalized by default value
	* smtoff_normalized: smtoff, test result normalized by default value

- test results of negative case, all as expected, no kernel panic or system hang observed

Hongyu

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork
  2020-12-02  7:54           ` Peter Zijlstra
@ 2020-12-04  0:20             ` Josh Don
  0 siblings, 0 replies; 150+ messages in thread
From: Josh Don @ 2020-12-04  0:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	Vineeth Pillai, Aaron Lu, Aubrey Li, Thomas Gleixner,
	linux-kernel, mingo, torvalds, fweisbec, Kees Cook, Greg Kerr,
	Phil Auld, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, chris.hyser, Ben Segall, Hao Luo,
	Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen,
	Oleg Rombakh

On Tue, Dec 1, 2020 at 11:55 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Then disallow sharing a task cookie when the tasks are in different
> cgroups or disallow cgroup movement when they share a cookie.

Yes, we could restrict task cookie sharing to tasks that are in the
same cgroup. Then the cookie easily just becomes a single value;
either the task cookie or group cookie.

The advantage of the approach with the cookie struct is that it is
easily extensible, and allows for trust models that don't conform
exactly to the cgroup hierarchy (ie. our discussion on cookie color).
The overhead of the approach seems tolerable, given that updates to a
task's cookie are not in fast paths (ie. prctl, setting cgroup cookie,
sched_move_task).  Are you more concerned with the added complexity of
maintaining the RB tree, refcounts, etc?

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase
  2020-12-02  8:02       ` Peter Zijlstra
  2020-12-02 18:53         ` Tejun Heo
@ 2020-12-04  0:51         ` Josh Don
  2020-12-04 15:45           ` Tejun Heo
  1 sibling, 1 reply; 150+ messages in thread
From: Josh Don @ 2020-12-04  0:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Tim Chen, Vineeth Pillai,
	Aaron Lu, Aubrey Li, Thomas Gleixner, linux-kernel, mingo,
	torvalds, fweisbec, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	vineeth, Chen Yu, Christian Brauner, Agata Gruza,
	Antonio Gomez Iglesias, graf, konrad.wilk, dfaggioli,
	Paul Turner, Steven Rostedt, Patrick Bellasi, benbjiang,
	Alexandre Chartre, James.Bottomley, OWeisse, Dhaval Giani,
	Junaid Shahid, Jesse Barnes, chris.hyser, Ben Segall, Hao Luo,
	Tom Lendacky, Aubrey Li, Paul E. McKenney, Tim Chen,
	Oleg Rombakh, Tejun Heo

On Wed, Dec 2, 2020 at 12:02 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Dec 01, 2020 at 10:18:00PM -0800, Josh Don wrote:
> > Hey Peter,
> >
> > On Wed, Nov 25, 2020 at 5:43 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > Why can't the above work by setting 'tag' (that's a terrible name, why
> > > does that still live) in CDE? Have the most specific tag live. Same with
> > > that thread stuff.
> >
> > The motivation is to allow an unprivileged user the ability to
> > configure the trust hierarchy in a way that otherwise wouldn't be
> > possible for a given cgroup hierarchy.  For example given a cookie'd
> > hierarchy such as:
> >
> >       A
> >    /  |  |   \
> > B  C  D  E
> >
> > the user might only want subsets of {B, C, D, E} to share.  For
> > instance, the user might only want {B,C} and {D, E} to share.  One way
> > to solve this would be to allow the user to write the group cookie
> > directly.  However, this interface would need to be restricted to
> > privileged users, since otherwise the cookie could be configured to
> > share with any arbitrary cgroup.  The purpose of the 'color' field is
> > to expose a portion of the cookie that can be modified by a
> > non-privileged user in order to achieve this sharing goal.
> >
> > If this doesn't seem like a useful case, I'm happy to drop this patch
> > from the series to unbloc