linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/13] Core scheduling v5
@ 2020-03-04 16:59 vpillai
  2020-03-04 16:59 ` [RFC PATCH 01/13] sched: Wrap rq::lock access vpillai
                   ` (20 more replies)
  0 siblings, 21 replies; 110+ messages in thread
From: vpillai @ 2020-03-04 16:59 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: vpillai, linux-kernel, fweisbec, keescook, kerrnel, Phil Auld,
	Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel


Fifth iteration of the Core-Scheduling feature.

Core scheduling is a feature that only allows trusted tasks to run
concurrently on cpus sharing compute resources(eg: hyperthreads on a
core). The goal is to mitigate the core-level side-channel attacks
without requiring to disable SMT (which has a significant impact on
performance in some situations). So far, the feature mitigates user-space
to user-space attacks but not user-space to kernel attack, when one of
the hardware thread enters the kernel (syscall, interrupt etc).

By default, the feature doesn't change any of the current scheduler
behavior. The user decides which tasks can run simultaneously on the
same core (for now by having them in the same tagged cgroup). When
a tag is enabled in a cgroup and a task from that cgroup is running
on a hardware thread, the scheduler ensures that only idle or trusted
tasks run on the other sibling(s). Besides security concerns, this
feature can also be beneficial for RT and performance applications
where we want to control how tasks make use of SMT dynamically.

This version was focusing on performance and stability. Couple of
crashes related to task tagging and cpu hotplug path were fixed.
This version also improves the performance considerably by making
task migration and load balancing coresched aware.

In terms of performance, the major difference since the last iteration
is that now even IO-heavy and mixed-resources workloads are less
impacted by core-scheduling than by disabling SMT. Both host-level and
VM-level benchmarks were performed. Details in:
https://lkml.org/lkml/2020/2/12/1194
https://lkml.org/lkml/2019/11/1/269

v5 is rebased on top of 5.5.5(449718782a46)
https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5.y

Changes in v5
-------------
- Fixes for cgroup/process tagging during corner cases like cgroup
  destroy, task moving across cgroups etc
  - Tim Chen
- Coresched aware task migrations
  - Aubrey Li
- Other minor stability fixes.

Changes in v4
-------------
- Implement a core wide min_vruntime for vruntime comparison of tasks
  across cpus in a core.
  - Aaron Lu
- Fixes a typo bug in setting the forced_idle cpu.
  - Aaron Lu

Changes in v3
-------------
- Fixes the issue of sibling picking up an incompatible task
  - Aaron Lu
  - Vineeth Pillai
  - Julien Desfossez
- Fixes the issue of starving threads due to forced idle
  - Peter Zijlstra
- Fixes the refcounting issue when deleting a cgroup with tag
  - Julien Desfossez
- Fixes a crash during cpu offline/online with coresched enabled
  - Vineeth Pillai
- Fixes a comparison logic issue in sched_core_find
  - Aaron Lu

Changes in v2
-------------
- Fixes for couple of NULL pointer dereference crashes
  - Subhra Mazumdar
  - Tim Chen
- Improves priority comparison logic for process in different cpus
  - Peter Zijlstra
  - Aaron Lu
- Fixes a hard lockup in rq locking
  - Vineeth Pillai
  - Julien Desfossez
- Fixes a performance issue seen on IO heavy workloads
  - Vineeth Pillai
  - Julien Desfossez
- Fix for 32bit build
  - Aubrey Li

ISSUES
------
- Aaron(Intel) found an issue with load balancing when the tasks have
  different weights(nice or cgroup shares). Task weight is not considered
  in coresched aware load balancing and causes those higher weights task
  to starve.
- Joel(ChromeOS) found an issue where RT task may be preempted by a
  lower class task.
- Joel(ChromeOS) found a deadlock and crash on PREEMPT kernel in the
  coreshed idle balance logic

TODO
----
- Work on merging patches that are ready to be merged
- Decide on the API for exposing the feature to userland
- Experiment with adding synchronization points in VMEXIT to mitigate
  the VM-to-host-kernel leaking
- Investigate the source of the overhead even when no tasks are tagged:
  https://lkml.org/lkml/2019/10/29/242

---

Aaron Lu (2):
  sched/fair: wrapper for cfs_rq->min_vruntime
  sched/fair: core wide vruntime comparison

Aubrey Li (1):
  sched: migration changes for core scheduling

Peter Zijlstra (9):
  sched: Wrap rq::lock access
  sched: Introduce sched_class::pick_task()
  sched: Core-wide rq->lock
  sched/fair: Add a few assertions
  sched: Basic tracking of matching tasks
  sched: Add core wide task selection and scheduling.
  sched: Trivial forced-newidle balancer
  sched: cgroup tagging interface for core scheduling
  sched: Debug bits...

Tim Chen (1):
  sched: Update core scheduler queue when taking cpu online/offline

 include/linux/sched.h    |    9 +-
 kernel/Kconfig.preempt   |    6 +
 kernel/sched/core.c      | 1037 +++++++++++++++++++++++++++++++++++++-
 kernel/sched/cpuacct.c   |   12 +-
 kernel/sched/deadline.c  |   69 ++-
 kernel/sched/debug.c     |    4 +-
 kernel/sched/fair.c      |  387 +++++++++++---
 kernel/sched/idle.c      |   11 +-
 kernel/sched/pelt.h      |    2 +-
 kernel/sched/rt.c        |   65 ++-
 kernel/sched/sched.h     |  248 +++++++--
 kernel/sched/stop_task.c |   13 +-
 kernel/sched/topology.c  |    4 +-
 13 files changed, 1672 insertions(+), 195 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH 01/13] sched: Wrap rq::lock access
  2020-03-04 16:59 [RFC PATCH 00/13] Core scheduling v5 vpillai
@ 2020-03-04 16:59 ` vpillai
  2020-03-04 16:59 ` [RFC PATCH 02/13] sched: Introduce sched_class::pick_task() vpillai
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: vpillai @ 2020-03-04 16:59 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, fweisbec, keescook, kerrnel, Phil Auld, Aaron Lu,
	Aubrey Li, aubrey.li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel,
	Vineeth Remanan Pillai

From: Peter Zijlstra <peterz@infradead.org>

In preparation of playing games with rq->lock, abstract the thing
using an accessor.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
---
 kernel/sched/core.c     |  46 +++++++++---------
 kernel/sched/cpuacct.c  |  12 ++---
 kernel/sched/deadline.c |  18 +++----
 kernel/sched/debug.c    |   4 +-
 kernel/sched/fair.c     |  38 +++++++--------
 kernel/sched/idle.c     |   4 +-
 kernel/sched/pelt.h     |   2 +-
 kernel/sched/rt.c       |   8 +--
 kernel/sched/sched.h    | 105 +++++++++++++++++++++-------------------
 kernel/sched/topology.c |   4 +-
 10 files changed, 122 insertions(+), 119 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b2564d62a0f7..28ba9b56dd8a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -85,12 +85,12 @@ struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 
 	for (;;) {
 		rq = task_rq(p);
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
 			rq_pin_lock(rq, rf);
 			return rq;
 		}
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 
 		while (unlikely(task_on_rq_migrating(p)))
 			cpu_relax();
@@ -109,7 +109,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 	for (;;) {
 		raw_spin_lock_irqsave(&p->pi_lock, rf->flags);
 		rq = task_rq(p);
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		/*
 		 *	move_queued_task()		task_rq_lock()
 		 *
@@ -131,7 +131,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 			rq_pin_lock(rq, rf);
 			return rq;
 		}
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 		raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 
 		while (unlikely(task_on_rq_migrating(p)))
@@ -201,7 +201,7 @@ void update_rq_clock(struct rq *rq)
 {
 	s64 delta;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	if (rq->clock_update_flags & RQCF_ACT_SKIP)
 		return;
@@ -510,7 +510,7 @@ void resched_curr(struct rq *rq)
 	struct task_struct *curr = rq->curr;
 	int cpu;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	if (test_tsk_need_resched(curr))
 		return;
@@ -534,10 +534,10 @@ void resched_cpu(int cpu)
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), flags);
 	if (cpu_online(cpu) || cpu == smp_processor_id())
 		resched_curr(rq);
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 }
 
 #ifdef CONFIG_SMP
@@ -949,7 +949,7 @@ static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
 	struct uclamp_se *uc_se = &p->uclamp[clamp_id];
 	struct uclamp_bucket *bucket;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	/* Update task effective clamp */
 	p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
@@ -989,7 +989,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
 	unsigned int bkt_clamp;
 	unsigned int rq_clamp;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	bucket = &uc_rq->bucket[uc_se->bucket_id];
 	SCHED_WARN_ON(!bucket->tasks);
@@ -1490,7 +1490,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
 				   struct task_struct *p, int new_cpu)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
 	dequeue_task(rq, p, DEQUEUE_NOCLOCK);
@@ -1604,7 +1604,7 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 		 * Because __kthread_bind() calls this on blocked tasks without
 		 * holding rq->lock.
 		 */
-		lockdep_assert_held(&rq->lock);
+		lockdep_assert_held(rq_lockp(rq));
 		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
 	}
 	if (running)
@@ -1736,7 +1736,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	 * task_rq_lock().
 	 */
 	WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||
-				      lockdep_is_held(&task_rq(p)->lock)));
+				      lockdep_is_held(rq_lockp(task_rq(p)))));
 #endif
 	/*
 	 * Clearly, migrating tasks to offline CPUs is a fairly daft thing.
@@ -2253,7 +2253,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 {
 	int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 #ifdef CONFIG_SMP
 	if (p->sched_contributes_to_load)
@@ -3107,10 +3107,10 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf
 	 * do an early lockdep release here:
 	 */
 	rq_unpin_lock(rq, rf);
-	spin_release(&rq->lock.dep_map, _THIS_IP_);
+	spin_release(&rq_lockp(rq)->dep_map, _THIS_IP_);
 #ifdef CONFIG_DEBUG_SPINLOCK
 	/* this is a valid case when another task releases the spinlock */
-	rq->lock.owner = next;
+	rq_lockp(rq)->owner = next;
 #endif
 }
 
@@ -3121,8 +3121,8 @@ static inline void finish_lock_switch(struct rq *rq)
 	 * fix up the runqueue lock - which gets 'carried over' from
 	 * prev into current:
 	 */
-	spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
-	raw_spin_unlock_irq(&rq->lock);
+	spin_acquire(&rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
+	raw_spin_unlock_irq(rq_lockp(rq));
 }
 
 /*
@@ -3272,7 +3272,7 @@ static void __balance_callback(struct rq *rq)
 	void (*func)(struct rq *rq);
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), flags);
 	head = rq->balance_callback;
 	rq->balance_callback = NULL;
 	while (head) {
@@ -3283,7 +3283,7 @@ static void __balance_callback(struct rq *rq)
 
 		func(rq);
 	}
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 }
 
 static inline void balance_callback(struct rq *rq)
@@ -6033,7 +6033,7 @@ void init_idle(struct task_struct *idle, int cpu)
 	__sched_fork(0, idle);
 
 	raw_spin_lock_irqsave(&idle->pi_lock, flags);
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 
 	idle->state = TASK_RUNNING;
 	idle->se.exec_start = sched_clock();
@@ -6070,7 +6070,7 @@ void init_idle(struct task_struct *idle, int cpu)
 #ifdef CONFIG_SMP
 	idle->on_cpu = 1;
 #endif
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 	raw_spin_unlock_irqrestore(&idle->pi_lock, flags);
 
 	/* Set the preempt count _outside_ the spinlocks! */
@@ -6632,7 +6632,7 @@ void __init sched_init(void)
 		struct rq *rq;
 
 		rq = cpu_rq(i);
-		raw_spin_lock_init(&rq->lock);
+		raw_spin_lock_init(&rq->__lock);
 		rq->nr_running = 0;
 		rq->calc_load_active = 0;
 		rq->calc_load_update = jiffies + LOAD_FREQ;
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 9fbb10383434..78de28ebc45d 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -111,7 +111,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
 	/*
 	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
 	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 	if (index == CPUACCT_STAT_NSTATS) {
@@ -125,7 +125,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
 	}
 
 #ifndef CONFIG_64BIT
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 	return data;
@@ -140,14 +140,14 @@ static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
 	/*
 	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
 	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 	for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
 		cpuusage->usages[i] = val;
 
 #ifndef CONFIG_64BIT
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 }
 
@@ -252,13 +252,13 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V)
 			 * Take rq->lock to make 64-bit read safe on 32-bit
 			 * platforms.
 			 */
-			raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+			raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 			seq_printf(m, " %llu", cpuusage->usages[index]);
 
 #ifndef CONFIG_64BIT
-			raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+			raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 		}
 		seq_puts(m, "\n");
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 43323f875cb9..ded147f84382 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -80,7 +80,7 @@ void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->running_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
 	dl_rq->running_bw += dl_bw;
 	SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
 	SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
@@ -93,7 +93,7 @@ void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->running_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
 	dl_rq->running_bw -= dl_bw;
 	SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
 	if (dl_rq->running_bw > old)
@@ -107,7 +107,7 @@ void __add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->this_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
 	dl_rq->this_bw += dl_bw;
 	SCHED_WARN_ON(dl_rq->this_bw < old); /* overflow */
 }
@@ -117,7 +117,7 @@ void __sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->this_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
 	dl_rq->this_bw -= dl_bw;
 	SCHED_WARN_ON(dl_rq->this_bw > old); /* underflow */
 	if (dl_rq->this_bw > old)
@@ -925,7 +925,7 @@ static int start_dl_timer(struct task_struct *p)
 	ktime_t now, act;
 	s64 delta;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	/*
 	 * We want the timer to fire at the deadline, but considering
@@ -1035,9 +1035,9 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 		 * If the runqueue is no longer available, migrate the
 		 * task elsewhere. This necessarily changes rq.
 		 */
-		lockdep_unpin_lock(&rq->lock, rf.cookie);
+		lockdep_unpin_lock(rq_lockp(rq), rf.cookie);
 		rq = dl_task_offline_migration(rq, p);
-		rf.cookie = lockdep_pin_lock(&rq->lock);
+		rf.cookie = lockdep_pin_lock(rq_lockp(rq));
 		update_rq_clock(rq);
 
 		/*
@@ -1652,7 +1652,7 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused
 	 * from try_to_wake_up(). Hence, p->pi_lock is locked, but
 	 * rq->lock is not... So, lock it
 	 */
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 	if (p->dl.dl_non_contending) {
 		sub_running_bw(&p->dl, &rq->dl);
 		p->dl.dl_non_contending = 0;
@@ -1667,7 +1667,7 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused
 			put_task_struct(p);
 	}
 	sub_rq_bw(&p->dl, &rq->dl);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 }
 
 static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index f7e4579e746c..7e5f2237c7e4 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -498,7 +498,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "exec_clock",
 			SPLIT_NS(cfs_rq->exec_clock));
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), flags);
 	if (rb_first_cached(&cfs_rq->tasks_timeline))
 		MIN_vruntime = (__pick_first_entity(cfs_rq))->vruntime;
 	last = __pick_last_entity(cfs_rq);
@@ -506,7 +506,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 		max_vruntime = last->vruntime;
 	min_vruntime = cfs_rq->min_vruntime;
 	rq0_min_vruntime = cpu_rq(0)->cfs.min_vruntime;
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "MIN_vruntime",
 			SPLIT_NS(MIN_vruntime));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "min_vruntime",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ba749f579714..3b218753bf7a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1091,7 +1091,7 @@ struct numa_group {
 static struct numa_group *deref_task_numa_group(struct task_struct *p)
 {
 	return rcu_dereference_check(p->numa_group, p == current ||
-		(lockdep_is_held(&task_rq(p)->lock) && !READ_ONCE(p->on_cpu)));
+		(lockdep_is_held(rq_lockp(task_rq(p))) && !READ_ONCE(p->on_cpu)));
 }
 
 static struct numa_group *deref_curr_numa_group(struct task_struct *p)
@@ -5035,7 +5035,7 @@ static void __maybe_unused update_runtime_enabled(struct rq *rq)
 {
 	struct task_group *tg;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(tg, &task_groups, list) {
@@ -5054,7 +5054,7 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
 {
 	struct task_group *tg;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(tg, &task_groups, list) {
@@ -6438,7 +6438,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 		 * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
 		 * rq->lock and can modify state directly.
 		 */
-		lockdep_assert_held(&task_rq(p)->lock);
+		lockdep_assert_held(rq_lockp(task_rq(p)));
 		detach_entity_cfs_rq(&p->se);
 
 	} else {
@@ -7066,7 +7066,7 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 {
 	s64 delta;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	if (p->sched_class != &fair_sched_class)
 		return 0;
@@ -7160,7 +7160,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 {
 	int tsk_cache_hot;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	/*
 	 * We do not migrate tasks that are:
@@ -7238,7 +7238,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
  */
 static void detach_task(struct task_struct *p, struct lb_env *env)
 {
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
 	set_task_cpu(p, env->dst_cpu);
@@ -7254,7 +7254,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
 {
 	struct task_struct *p;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	list_for_each_entry_reverse(p,
 			&env->src_rq->cfs_tasks, se.group_node) {
@@ -7290,7 +7290,7 @@ static int detach_tasks(struct lb_env *env)
 	struct task_struct *p;
 	int detached = 0;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	if (env->imbalance <= 0)
 		return 0;
@@ -7405,7 +7405,7 @@ static int detach_tasks(struct lb_env *env)
  */
 static void attach_task(struct rq *rq, struct task_struct *p)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	BUG_ON(task_rq(p) != rq);
 	activate_task(rq, p, ENQUEUE_NOCLOCK);
@@ -9291,7 +9291,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		if (need_active_balance(&env)) {
 			unsigned long flags;
 
-			raw_spin_lock_irqsave(&busiest->lock, flags);
+			raw_spin_lock_irqsave(rq_lockp(busiest), flags);
 
 			/*
 			 * Don't kick the active_load_balance_cpu_stop,
@@ -9299,7 +9299,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 			 * moved to this_cpu:
 			 */
 			if (!cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr)) {
-				raw_spin_unlock_irqrestore(&busiest->lock,
+				raw_spin_unlock_irqrestore(rq_lockp(busiest),
 							    flags);
 				env.flags |= LBF_ALL_PINNED;
 				goto out_one_pinned;
@@ -9315,7 +9315,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 				busiest->push_cpu = this_cpu;
 				active_balance = 1;
 			}
-			raw_spin_unlock_irqrestore(&busiest->lock, flags);
+			raw_spin_unlock_irqrestore(rq_lockp(busiest), flags);
 
 			if (active_balance) {
 				stop_one_cpu_nowait(cpu_of(busiest),
@@ -10058,7 +10058,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
 	    time_before(jiffies, READ_ONCE(nohz.next_blocked)))
 		return;
 
-	raw_spin_unlock(&this_rq->lock);
+	raw_spin_unlock(rq_lockp(this_rq));
 	/*
 	 * This CPU is going to be idle and blocked load of idle CPUs
 	 * need to be updated. Run the ilb locally as it is a good
@@ -10067,7 +10067,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
 	 */
 	if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
 		kick_ilb(NOHZ_STATS_KICK);
-	raw_spin_lock(&this_rq->lock);
+	raw_spin_lock(rq_lockp(this_rq));
 }
 
 #else /* !CONFIG_NO_HZ_COMMON */
@@ -10133,7 +10133,7 @@ int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 		goto out;
 	}
 
-	raw_spin_unlock(&this_rq->lock);
+	raw_spin_unlock(rq_lockp(this_rq));
 
 	update_blocked_averages(this_cpu);
 	rcu_read_lock();
@@ -10174,7 +10174,7 @@ int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	}
 	rcu_read_unlock();
 
-	raw_spin_lock(&this_rq->lock);
+	raw_spin_lock(rq_lockp(this_rq));
 
 	if (curr_cost > this_rq->max_idle_balance_cost)
 		this_rq->max_idle_balance_cost = curr_cost;
@@ -10647,9 +10647,9 @@ void unregister_fair_sched_group(struct task_group *tg)
 
 		rq = cpu_rq(cpu);
 
-		raw_spin_lock_irqsave(&rq->lock, flags);
+		raw_spin_lock_irqsave(rq_lockp(rq), flags);
 		list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
-		raw_spin_unlock_irqrestore(&rq->lock, flags);
+		raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 	}
 }
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index ffa959e91227..f8653290de95 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -413,10 +413,10 @@ struct task_struct *pick_next_task_idle(struct rq *rq)
 static void
 dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
 {
-	raw_spin_unlock_irq(&rq->lock);
+	raw_spin_unlock_irq(rq_lockp(rq));
 	printk(KERN_ERR "bad: scheduling from the idle thread!\n");
 	dump_stack();
-	raw_spin_lock_irq(&rq->lock);
+	raw_spin_lock_irq(rq_lockp(rq));
 }
 
 /*
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index afff644da065..6649cb63e32a 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -116,7 +116,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
 
 static inline u64 rq_clock_pelt(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	assert_clock_updated(rq);
 
 	return rq->clock_pelt - rq->lost_idle_time;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e591d40fd645..fc7d6706b209 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -846,7 +846,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
 		if (skip)
 			continue;
 
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		update_rq_clock(rq);
 
 		if (rt_rq->rt_time) {
@@ -884,7 +884,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
 
 		if (enqueue)
 			sched_rt_rq_enqueue(rt_rq);
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 	}
 
 	if (!throttled && (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF))
@@ -2021,9 +2021,9 @@ void rto_push_irq_work_func(struct irq_work *work)
 	 * When it gets updated, a check is made if a push is possible.
 	 */
 	if (has_pushable_tasks(rq)) {
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		push_rt_tasks(rq);
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 	}
 
 	raw_spin_lock(&rd->rto_lock);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 280a3c735935..a306008a12f7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -846,7 +846,7 @@ struct uclamp_rq {
  */
 struct rq {
 	/* runqueue lock: */
-	raw_spinlock_t		lock;
+	raw_spinlock_t		__lock;
 
 	/*
 	 * nr_running and cpu_load should be in the same cacheline because
@@ -1026,6 +1026,10 @@ static inline int cpu_of(struct rq *rq)
 #endif
 }
 
+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+	return &rq->__lock;
+}
 
 #ifdef CONFIG_SCHED_SMT
 extern void __update_idle_core(struct rq *rq);
@@ -1093,7 +1097,7 @@ static inline void assert_clock_updated(struct rq *rq)
 
 static inline u64 rq_clock(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	assert_clock_updated(rq);
 
 	return rq->clock;
@@ -1101,7 +1105,7 @@ static inline u64 rq_clock(struct rq *rq)
 
 static inline u64 rq_clock_task(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	assert_clock_updated(rq);
 
 	return rq->clock_task;
@@ -1109,7 +1113,7 @@ static inline u64 rq_clock_task(struct rq *rq)
 
 static inline void rq_clock_skip_update(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	rq->clock_update_flags |= RQCF_REQ_SKIP;
 }
 
@@ -1119,7 +1123,7 @@ static inline void rq_clock_skip_update(struct rq *rq)
  */
 static inline void rq_clock_cancel_skipupdate(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	rq->clock_update_flags &= ~RQCF_REQ_SKIP;
 }
 
@@ -1138,7 +1142,7 @@ struct rq_flags {
 
 static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
 {
-	rf->cookie = lockdep_pin_lock(&rq->lock);
+	rf->cookie = lockdep_pin_lock(rq_lockp(rq));
 
 #ifdef CONFIG_SCHED_DEBUG
 	rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
@@ -1153,12 +1157,12 @@ static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
 		rf->clock_update_flags = RQCF_UPDATED;
 #endif
 
-	lockdep_unpin_lock(&rq->lock, rf->cookie);
+	lockdep_unpin_lock(rq_lockp(rq), rf->cookie);
 }
 
 static inline void rq_repin_lock(struct rq *rq, struct rq_flags *rf)
 {
-	lockdep_repin_lock(&rq->lock, rf->cookie);
+	lockdep_repin_lock(rq_lockp(rq), rf->cookie);
 
 #ifdef CONFIG_SCHED_DEBUG
 	/*
@@ -1179,7 +1183,7 @@ static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 }
 
 static inline void
@@ -1188,7 +1192,7 @@ task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 	__releases(p->pi_lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 }
 
@@ -1196,7 +1200,7 @@ static inline void
 rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock_irqsave(&rq->lock, rf->flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), rf->flags);
 	rq_pin_lock(rq, rf);
 }
 
@@ -1204,7 +1208,7 @@ static inline void
 rq_lock_irq(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock_irq(&rq->lock);
+	raw_spin_lock_irq(rq_lockp(rq));
 	rq_pin_lock(rq, rf);
 }
 
@@ -1212,7 +1216,7 @@ static inline void
 rq_lock(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 	rq_pin_lock(rq, rf);
 }
 
@@ -1220,7 +1224,7 @@ static inline void
 rq_relock(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 	rq_repin_lock(rq, rf);
 }
 
@@ -1229,7 +1233,7 @@ rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), rf->flags);
 }
 
 static inline void
@@ -1237,7 +1241,7 @@ rq_unlock_irq(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock_irq(&rq->lock);
+	raw_spin_unlock_irq(rq_lockp(rq));
 }
 
 static inline void
@@ -1245,7 +1249,7 @@ rq_unlock(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 }
 
 static inline struct rq *
@@ -1310,7 +1314,7 @@ queue_balance_callback(struct rq *rq,
 		       struct callback_head *head,
 		       void (*func)(struct rq *rq))
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	if (unlikely(head->next))
 		return;
@@ -1994,7 +1998,7 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
 	__acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
-	raw_spin_unlock(&this_rq->lock);
+	raw_spin_unlock(rq_lockp(this_rq));
 	double_rq_lock(this_rq, busiest);
 
 	return 1;
@@ -2013,20 +2017,22 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
 	__acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
-	int ret = 0;
-
-	if (unlikely(!raw_spin_trylock(&busiest->lock))) {
-		if (busiest < this_rq) {
-			raw_spin_unlock(&this_rq->lock);
-			raw_spin_lock(&busiest->lock);
-			raw_spin_lock_nested(&this_rq->lock,
-					      SINGLE_DEPTH_NESTING);
-			ret = 1;
-		} else
-			raw_spin_lock_nested(&busiest->lock,
-					      SINGLE_DEPTH_NESTING);
+	if (rq_lockp(this_rq) == rq_lockp(busiest))
+		return 0;
+
+	if (likely(raw_spin_trylock(rq_lockp(busiest))))
+		return 0;
+
+	if (rq_lockp(busiest) >= rq_lockp(this_rq)) {
+		raw_spin_lock_nested(rq_lockp(busiest), SINGLE_DEPTH_NESTING);
+		return 0;
 	}
-	return ret;
+
+	raw_spin_unlock(rq_lockp(this_rq));
+	raw_spin_lock(rq_lockp(busiest));
+	raw_spin_lock_nested(rq_lockp(this_rq), SINGLE_DEPTH_NESTING);
+
+	return 1;
 }
 
 #endif /* CONFIG_PREEMPTION */
@@ -2036,11 +2042,7 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
  */
 static inline int double_lock_balance(struct rq *this_rq, struct rq *busiest)
 {
-	if (unlikely(!irqs_disabled())) {
-		/* printk() doesn't work well under rq->lock */
-		raw_spin_unlock(&this_rq->lock);
-		BUG_ON(1);
-	}
+	lockdep_assert_irqs_disabled();
 
 	return _double_lock_balance(this_rq, busiest);
 }
@@ -2048,8 +2050,9 @@ static inline int double_lock_balance(struct rq *this_rq, struct rq *busiest)
 static inline void double_unlock_balance(struct rq *this_rq, struct rq *busiest)
 	__releases(busiest->lock)
 {
-	raw_spin_unlock(&busiest->lock);
-	lock_set_subclass(&this_rq->lock.dep_map, 0, _RET_IP_);
+	if (rq_lockp(this_rq) != rq_lockp(busiest))
+		raw_spin_unlock(rq_lockp(busiest));
+	lock_set_subclass(&rq_lockp(this_rq)->dep_map, 0, _RET_IP_);
 }
 
 static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
@@ -2090,16 +2093,16 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
 	__acquires(rq2->lock)
 {
 	BUG_ON(!irqs_disabled());
-	if (rq1 == rq2) {
-		raw_spin_lock(&rq1->lock);
+	if (rq_lockp(rq1) == rq_lockp(rq2)) {
+		raw_spin_lock(rq_lockp(rq1));
 		__acquire(rq2->lock);	/* Fake it out ;) */
 	} else {
-		if (rq1 < rq2) {
-			raw_spin_lock(&rq1->lock);
-			raw_spin_lock_nested(&rq2->lock, SINGLE_DEPTH_NESTING);
+		if (rq_lockp(rq1) < rq_lockp(rq2)) {
+			raw_spin_lock(rq_lockp(rq1));
+			raw_spin_lock_nested(rq_lockp(rq2), SINGLE_DEPTH_NESTING);
 		} else {
-			raw_spin_lock(&rq2->lock);
-			raw_spin_lock_nested(&rq1->lock, SINGLE_DEPTH_NESTING);
+			raw_spin_lock(rq_lockp(rq2));
+			raw_spin_lock_nested(rq_lockp(rq1), SINGLE_DEPTH_NESTING);
 		}
 	}
 }
@@ -2114,9 +2117,9 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
 	__releases(rq1->lock)
 	__releases(rq2->lock)
 {
-	raw_spin_unlock(&rq1->lock);
-	if (rq1 != rq2)
-		raw_spin_unlock(&rq2->lock);
+	raw_spin_unlock(rq_lockp(rq1));
+	if (rq_lockp(rq1) != rq_lockp(rq2))
+		raw_spin_unlock(rq_lockp(rq2));
 	else
 		__release(rq2->lock);
 }
@@ -2139,7 +2142,7 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
 {
 	BUG_ON(!irqs_disabled());
 	BUG_ON(rq1 != rq2);
-	raw_spin_lock(&rq1->lock);
+	raw_spin_lock(rq_lockp(rq1));
 	__acquire(rq2->lock);	/* Fake it out ;) */
 }
 
@@ -2154,7 +2157,7 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
 	__releases(rq2->lock)
 {
 	BUG_ON(rq1 != rq2);
-	raw_spin_unlock(&rq1->lock);
+	raw_spin_unlock(rq_lockp(rq1));
 	__release(rq2->lock);
 }
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index dfb64c08a407..991accc492d8 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -442,7 +442,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	struct root_domain *old_rd = NULL;
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), flags);
 
 	if (rq->rd) {
 		old_rd = rq->rd;
@@ -468,7 +468,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
 		set_rq_online(rq);
 
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 
 	if (old_rd)
 		call_rcu(&old_rd->rcu, free_rootdomain);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH 02/13] sched: Introduce sched_class::pick_task()
  2020-03-04 16:59 [RFC PATCH 00/13] Core scheduling v5 vpillai
  2020-03-04 16:59 ` [RFC PATCH 01/13] sched: Wrap rq::lock access vpillai
@ 2020-03-04 16:59 ` vpillai
  2020-03-04 16:59 ` [RFC PATCH 03/13] sched: Core-wide rq->lock vpillai
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: vpillai @ 2020-03-04 16:59 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, fweisbec, keescook, kerrnel, Phil Auld, Aaron Lu,
	Aubrey Li, aubrey.li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel,
	Vineeth Remanan Pillai

From: Peter Zijlstra <peterz@infradead.org>

Because sched_class::pick_next_task() also implies
sched_class::set_next_task() (and possibly put_prev_task() and
newidle_balance) it is not state invariant. This makes it unsuitable
for remote task selection.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
---
 kernel/sched/deadline.c  | 16 ++++++++++++++--
 kernel/sched/fair.c      | 34 +++++++++++++++++++++++++++++++---
 kernel/sched/idle.c      |  6 ++++++
 kernel/sched/rt.c        | 14 ++++++++++++--
 kernel/sched/sched.h     |  3 +++
 kernel/sched/stop_task.c | 13 +++++++++++--
 6 files changed, 77 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index ded147f84382..ee7fd8611ee4 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1773,7 +1773,7 @@ static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
 	return rb_entry(left, struct sched_dl_entity, rb_node);
 }
 
-static struct task_struct *pick_next_task_dl(struct rq *rq)
+static struct task_struct *pick_task_dl(struct rq *rq)
 {
 	struct sched_dl_entity *dl_se;
 	struct dl_rq *dl_rq = &rq->dl;
@@ -1785,7 +1785,18 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
 	dl_se = pick_next_dl_entity(rq, dl_rq);
 	BUG_ON(!dl_se);
 	p = dl_task_of(dl_se);
-	set_next_task_dl(rq, p, true);
+
+	return p;
+}
+
+static struct task_struct *pick_next_task_dl(struct rq *rq)
+{
+	struct task_struct *p;
+
+	p = pick_task_dl(rq);
+	if (p)
+		set_next_task_dl(rq, p, true);
+
 	return p;
 }
 
@@ -2442,6 +2453,7 @@ const struct sched_class dl_sched_class = {
 
 #ifdef CONFIG_SMP
 	.balance		= balance_dl,
+	.pick_task		= pick_task_dl,
 	.select_task_rq		= select_task_rq_dl,
 	.migrate_task_rq	= migrate_task_rq_dl,
 	.set_cpus_allowed       = set_cpus_allowed_dl,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3b218753bf7a..5eaaf0c4d9ad 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4228,7 +4228,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	 * Avoid running the skip buddy, if running something else can
 	 * be done without getting too unfair.
 	 */
-	if (cfs_rq->skip == se) {
+	if (cfs_rq->skip && cfs_rq->skip == se) {
 		struct sched_entity *second;
 
 		if (se == curr) {
@@ -4246,13 +4246,13 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	/*
 	 * Prefer last buddy, try to return the CPU to a preempted task.
 	 */
-	if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
+	if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
 		se = cfs_rq->last;
 
 	/*
 	 * Someone really wants this to run. If it's not unfair, run it.
 	 */
-	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
+	if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
 		se = cfs_rq->next;
 
 	clear_buddies(cfs_rq, se);
@@ -6642,6 +6642,33 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 		set_last_buddy(se);
 }
 
+static struct task_struct *pick_task_fair(struct rq *rq)
+{
+	struct cfs_rq *cfs_rq = &rq->cfs;
+	struct sched_entity *se;
+
+	if (!cfs_rq->nr_running)
+		return NULL;
+
+	do {
+		struct sched_entity *curr = cfs_rq->curr;
+
+		se = pick_next_entity(cfs_rq, NULL);
+
+		if (curr) {
+			if (se && curr->on_rq)
+				update_curr(cfs_rq);
+
+			if (!se || entity_before(curr, se))
+				se = curr;
+		}
+
+		cfs_rq = group_cfs_rq(se);
+	} while (cfs_rq);
+
+	return task_of(se);
+}
+
 struct task_struct *
 pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
@@ -10771,6 +10798,7 @@ const struct sched_class fair_sched_class = {
 
 #ifdef CONFIG_SMP
 	.balance		= balance_fair,
+	.pick_task		= pick_task_fair,
 	.select_task_rq		= select_task_rq_fair,
 	.migrate_task_rq	= migrate_task_rq_fair,
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index f8653290de95..46c18e3dab13 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -397,6 +397,11 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
 	schedstat_inc(rq->sched_goidle);
 }
 
+static struct task_struct *pick_task_idle(struct rq *rq)
+{
+	return rq->idle;
+}
+
 struct task_struct *pick_next_task_idle(struct rq *rq)
 {
 	struct task_struct *next = rq->idle;
@@ -469,6 +474,7 @@ const struct sched_class idle_sched_class = {
 
 #ifdef CONFIG_SMP
 	.balance		= balance_idle,
+	.pick_task		= pick_task_idle,
 	.select_task_rq		= select_task_rq_idle,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index fc7d6706b209..d044baedc617 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1567,7 +1567,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
 	return rt_task_of(rt_se);
 }
 
-static struct task_struct *pick_next_task_rt(struct rq *rq)
+static struct task_struct *pick_task_rt(struct rq *rq)
 {
 	struct task_struct *p;
 
@@ -1575,7 +1575,16 @@ static struct task_struct *pick_next_task_rt(struct rq *rq)
 		return NULL;
 
 	p = _pick_next_task_rt(rq);
-	set_next_task_rt(rq, p, true);
+
+	return p;
+}
+
+static struct task_struct *pick_next_task_rt(struct rq *rq)
+{
+	struct task_struct *p = pick_task_rt(rq);
+	if (p)
+		set_next_task_rt(rq, p, true);
+
 	return p;
 }
 
@@ -2368,6 +2377,7 @@ const struct sched_class rt_sched_class = {
 
 #ifdef CONFIG_SMP
 	.balance		= balance_rt,
+	.pick_task		= pick_task_rt,
 	.select_task_rq		= select_task_rq_rt,
 	.set_cpus_allowed       = set_cpus_allowed_common,
 	.rq_online              = rq_online_rt,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a306008a12f7..a8335e3078ab 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1724,6 +1724,9 @@ struct sched_class {
 
 #ifdef CONFIG_SMP
 	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
+
+	struct task_struct * (*pick_task)(struct rq *rq);
+
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
 	void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
 
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 4c9e9975684f..0611348edb28 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -34,15 +34,23 @@ static void set_next_task_stop(struct rq *rq, struct task_struct *stop, bool fir
 	stop->se.exec_start = rq_clock_task(rq);
 }
 
-static struct task_struct *pick_next_task_stop(struct rq *rq)
+static struct task_struct *pick_task_stop(struct rq *rq)
 {
 	if (!sched_stop_runnable(rq))
 		return NULL;
 
-	set_next_task_stop(rq, rq->stop, true);
 	return rq->stop;
 }
 
+static struct task_struct *pick_next_task_stop(struct rq *rq)
+{
+	struct task_struct *p = pick_task_stop(rq);
+	if (p)
+		set_next_task_stop(rq, p, true);
+
+	return p;
+}
+
 static void
 enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
@@ -130,6 +138,7 @@ const struct sched_class stop_sched_class = {
 
 #ifdef CONFIG_SMP
 	.balance		= balance_stop,
+	.pick_task		= pick_task_stop,
 	.select_task_rq		= select_task_rq_stop,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
-- 
2.17.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH 03/13] sched: Core-wide rq->lock
  2020-03-04 16:59 [RFC PATCH 00/13] Core scheduling v5 vpillai
  2020-03-04 16:59 ` [RFC PATCH 01/13] sched: Wrap rq::lock access vpillai
  2020-03-04 16:59 ` [RFC PATCH 02/13] sched: Introduce sched_class::pick_task() vpillai
@ 2020-03-04 16:59 ` vpillai
  2020-04-01 11:42   ` [PATCH] sched/arm64: store cpu topology before notify_cpu_starting Cheng Jian
                     ` (2 more replies)
  2020-03-04 16:59 ` [RFC PATCH 04/13] sched/fair: Add a few assertions vpillai
                   ` (17 subsequent siblings)
  20 siblings, 3 replies; 110+ messages in thread
From: vpillai @ 2020-03-04 16:59 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, fweisbec, keescook, kerrnel, Phil Auld, Aaron Lu,
	Aubrey Li, aubrey.li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel,
	Vineeth Remanan Pillai

From: Peter Zijlstra <peterz@infradead.org>

Introduce the basic infrastructure to have a core wide rq->lock.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
---
 kernel/Kconfig.preempt |   6 +++
 kernel/sched/core.c    | 113 ++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h   |  31 +++++++++++
 3 files changed, 148 insertions(+), 2 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index bf82259cff96..577c288e81e5 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -80,3 +80,9 @@ config PREEMPT_COUNT
 config PREEMPTION
        bool
        select PREEMPT_COUNT
+
+config SCHED_CORE
+	bool
+	default y
+	depends on SCHED_SMT
+
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 28ba9b56dd8a..ba17ff8a8663 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -73,6 +73,70 @@ __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 950000;
 
+#ifdef CONFIG_SCHED_CORE
+
+DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+/*
+ * The static-key + stop-machine variable are needed such that:
+ *
+ *	spin_lock(rq_lockp(rq));
+ *	...
+ *	spin_unlock(rq_lockp(rq));
+ *
+ * ends up locking and unlocking the _same_ lock, and all CPUs
+ * always agree on what rq has what lock.
+ *
+ * XXX entirely possible to selectively enable cores, don't bother for now.
+ */
+static int __sched_core_stopper(void *data)
+{
+	bool enabled = !!(unsigned long)data;
+	int cpu;
+
+	for_each_online_cpu(cpu)
+		cpu_rq(cpu)->core_enabled = enabled;
+
+	return 0;
+}
+
+static DEFINE_MUTEX(sched_core_mutex);
+static int sched_core_count;
+
+static void __sched_core_enable(void)
+{
+	// XXX verify there are no cookie tasks (yet)
+
+	static_branch_enable(&__sched_core_enabled);
+	stop_machine(__sched_core_stopper, (void *)true, NULL);
+}
+
+static void __sched_core_disable(void)
+{
+	// XXX verify there are no cookie tasks (left)
+
+	stop_machine(__sched_core_stopper, (void *)false, NULL);
+	static_branch_disable(&__sched_core_enabled);
+}
+
+void sched_core_get(void)
+{
+	mutex_lock(&sched_core_mutex);
+	if (!sched_core_count++)
+		__sched_core_enable();
+	mutex_unlock(&sched_core_mutex);
+}
+
+void sched_core_put(void)
+{
+	mutex_lock(&sched_core_mutex);
+	if (!--sched_core_count)
+		__sched_core_disable();
+	mutex_unlock(&sched_core_mutex);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * __task_rq_lock - lock the rq @p resides on.
  */
@@ -6400,8 +6464,15 @@ int sched_cpu_activate(unsigned int cpu)
 	/*
 	 * When going up, increment the number of cores with SMT present.
 	 */
-	if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
+	if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
 		static_branch_inc_cpuslocked(&sched_smt_present);
+#ifdef CONFIG_SCHED_CORE
+		if (static_branch_unlikely(&__sched_core_enabled)) {
+			rq->core_enabled = true;
+		}
+#endif
+	}
+
 #endif
 	set_cpu_active(cpu, true);
 
@@ -6447,8 +6518,16 @@ int sched_cpu_deactivate(unsigned int cpu)
 	/*
 	 * When going down, decrement the number of cores with SMT present.
 	 */
-	if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
+	if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
+#ifdef CONFIG_SCHED_CORE
+		struct rq *rq = cpu_rq(cpu);
+		if (static_branch_unlikely(&__sched_core_enabled)) {
+			rq->core_enabled = false;
+		}
+#endif
 		static_branch_dec_cpuslocked(&sched_smt_present);
+
+	}
 #endif
 
 	if (!sched_smp_initialized)
@@ -6473,6 +6552,28 @@ static void sched_rq_cpu_starting(unsigned int cpu)
 
 int sched_cpu_starting(unsigned int cpu)
 {
+#ifdef CONFIG_SCHED_CORE
+	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+	struct rq *rq, *core_rq = NULL;
+	int i;
+
+	for_each_cpu(i, smt_mask) {
+		rq = cpu_rq(i);
+		if (rq->core && rq->core == rq)
+			core_rq = rq;
+	}
+
+	if (!core_rq)
+		core_rq = cpu_rq(cpu);
+
+	for_each_cpu(i, smt_mask) {
+		rq = cpu_rq(i);
+
+		WARN_ON_ONCE(rq->core && rq->core != core_rq);
+		rq->core = core_rq;
+	}
+#endif /* CONFIG_SCHED_CORE */
+
 	sched_rq_cpu_starting(cpu);
 	sched_tick_start(cpu);
 	return 0;
@@ -6501,6 +6602,9 @@ int sched_cpu_dying(unsigned int cpu)
 	update_max_interval();
 	nohz_balance_exit_idle(rq);
 	hrtick_clear(rq);
+#ifdef CONFIG_SCHED_CORE
+	rq->core = NULL;
+#endif
 	return 0;
 }
 #endif
@@ -6695,6 +6799,11 @@ void __init sched_init(void)
 #endif /* CONFIG_SMP */
 		hrtick_rq_init(rq);
 		atomic_set(&rq->nr_iowait, 0);
+
+#ifdef CONFIG_SCHED_CORE
+		rq->core = NULL;
+		rq->core_enabled = 0;
+#endif
 	}
 
 	set_load_weight(&init_task, false);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a8335e3078ab..a3941b2ee29e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -999,6 +999,12 @@ struct rq {
 	/* Must be inspected within a rcu lock section */
 	struct cpuidle_state	*idle_state;
 #endif
+
+#ifdef CONFIG_SCHED_CORE
+	/* per rq */
+	struct rq		*core;
+	unsigned int		core_enabled;
+#endif
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -1026,11 +1032,36 @@ static inline int cpu_of(struct rq *rq)
 #endif
 }
 
+#ifdef CONFIG_SCHED_CORE
+DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+	return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
+}
+
+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+	if (sched_core_enabled(rq))
+		return &rq->core->__lock;
+
+	return &rq->__lock;
+}
+
+#else /* !CONFIG_SCHED_CORE */
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+	return false;
+}
+
 static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 {
 	return &rq->__lock;
 }
 
+#endif /* CONFIG_SCHED_CORE */
+
 #ifdef CONFIG_SCHED_SMT
 extern void __update_idle_core(struct rq *rq);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH 04/13] sched/fair: Add a few assertions
  2020-03-04 16:59 [RFC PATCH 00/13] Core scheduling v5 vpillai
                   ` (2 preceding siblings ...)
  2020-03-04 16:59 ` [RFC PATCH 03/13] sched: Core-wide rq->lock vpillai
@ 2020-03-04 16:59 ` vpillai
  2020-03-04 16:59 ` [RFC PATCH 05/13] sched: Basic tracking of matching tasks vpillai
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: vpillai @ 2020-03-04 16:59 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, fweisbec, keescook, kerrnel, Phil Auld, Aaron Lu,
	Aubrey Li, aubrey.li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel

From: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5eaaf0c4d9ad..cffc59a8b481 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5886,6 +5886,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	struct sched_domain *sd;
 	int i, recent_used_cpu;
 
+	/*
+	 * per-cpu select_idle_mask usage
+	 */
+	lockdep_assert_irqs_disabled();
+
 	if (available_idle_cpu(target) || sched_idle_cpu(target))
 		return target;
 
@@ -6332,8 +6337,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
  * certain conditions an idle sibling CPU if the domain has SD_WAKE_AFFINE set.
  *
  * Returns the target CPU number.
- *
- * preempt must be disabled.
  */
 static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
@@ -6344,6 +6347,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	int want_affine = 0;
 	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 
+	/*
+	 * required for stable ->cpus_allowed
+	 */
+	lockdep_assert_held(&p->pi_lock);
+
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH 05/13] sched: Basic tracking of matching tasks
  2020-03-04 16:59 [RFC PATCH 00/13] Core scheduling v5 vpillai
                   ` (3 preceding siblings ...)
  2020-03-04 16:59 ` [RFC PATCH 04/13] sched/fair: Add a few assertions vpillai
@ 2020-03-04 16:59 ` vpillai
  2020-03-04 16:59 ` [RFC PATCH 06/13] sched: Update core scheduler queue when taking cpu online/offline vpillai
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: vpillai @ 2020-03-04 16:59 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, fweisbec, keescook, kerrnel, Phil Auld, Aaron Lu,
	Aubrey Li, aubrey.li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel,
	Vineeth Remanan Pillai

From: Peter Zijlstra <peterz@infradead.org>

Introduce task_struct::core_cookie as an opaque identifier for core
scheduling. When enabled; core scheduling will only allow matching
task to be on the core; where idle matches everything.

When task_struct::core_cookie is set (and core scheduling is enabled)
these tasks are indexed in a second RB-tree, first on cookie value
then on scheduling function, such that matching task selection always
finds the most elegible match.

NOTE: *shudder* at the overhead...

NOTE: *sigh*, a 3rd copy of the scheduling function; the alternative
is per class tracking of cookies and that just duplicates a lot of
stuff for no raisin (the 2nd copy lives in the rt-mutex PI code).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
---
 include/linux/sched.h |   8 ++-
 kernel/sched/core.c   | 146 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c   |  46 -------------
 kernel/sched/sched.h  |  55 ++++++++++++++++
 4 files changed, 208 insertions(+), 47 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 716ad1d8d95e..80ec54706282 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -680,10 +680,16 @@ struct task_struct {
 	const struct sched_class	*sched_class;
 	struct sched_entity		se;
 	struct sched_rt_entity		rt;
+	struct sched_dl_entity		dl;
+
+#ifdef CONFIG_SCHED_CORE
+	struct rb_node			core_node;
+	unsigned long			core_cookie;
+#endif
+
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group		*sched_task_group;
 #endif
-	struct sched_dl_entity		dl;
 
 #ifdef CONFIG_UCLAMP_TASK
 	/* Clamp values requested for a scheduling entity */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ba17ff8a8663..452ce5bb9321 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -77,6 +77,141 @@ int sysctl_sched_rt_runtime = 950000;
 
 DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
 
+/* kernel prio, less is more */
+static inline int __task_prio(struct task_struct *p)
+{
+	if (p->sched_class == &stop_sched_class) /* trumps deadline */
+		return -2;
+
+	if (rt_prio(p->prio)) /* includes deadline */
+		return p->prio; /* [-1, 99] */
+
+	if (p->sched_class == &idle_sched_class)
+		return MAX_RT_PRIO + NICE_WIDTH; /* 140 */
+
+	return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */
+}
+
+/*
+ * l(a,b)
+ * le(a,b) := !l(b,a)
+ * g(a,b)  := l(b,a)
+ * ge(a,b) := !l(a,b)
+ */
+
+/* real prio, less is less */
+static inline bool prio_less(struct task_struct *a, struct task_struct *b)
+{
+
+	int pa = __task_prio(a), pb = __task_prio(b);
+
+	if (-pa < -pb)
+		return true;
+
+	if (-pb < -pa)
+		return false;
+
+	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
+		return !dl_time_before(a->dl.deadline, b->dl.deadline);
+
+	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
+		u64 vruntime = b->se.vruntime;
+
+		/*
+		 * Normalize the vruntime if tasks are in different cpus.
+		 */
+		if (task_cpu(a) != task_cpu(b)) {
+			vruntime -= task_cfs_rq(b)->min_vruntime;
+			vruntime += task_cfs_rq(a)->min_vruntime;
+		}
+
+		return !((s64)(a->se.vruntime - vruntime) <= 0);
+	}
+
+	return false;
+}
+
+static inline bool __sched_core_less(struct task_struct *a, struct task_struct *b)
+{
+	if (a->core_cookie < b->core_cookie)
+		return true;
+
+	if (a->core_cookie > b->core_cookie)
+		return false;
+
+	/* flip prio, so high prio is leftmost */
+	if (prio_less(b, a))
+		return true;
+
+	return false;
+}
+
+static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+{
+	struct rb_node *parent, **node;
+	struct task_struct *node_task;
+
+	rq->core->core_task_seq++;
+
+	if (!p->core_cookie)
+		return;
+
+	node = &rq->core_tree.rb_node;
+	parent = *node;
+
+	while (*node) {
+		node_task = container_of(*node, struct task_struct, core_node);
+		parent = *node;
+
+		if (__sched_core_less(p, node_task))
+			node = &parent->rb_left;
+		else
+			node = &parent->rb_right;
+	}
+
+	rb_link_node(&p->core_node, parent, node);
+	rb_insert_color(&p->core_node, &rq->core_tree);
+}
+
+static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+{
+	rq->core->core_task_seq++;
+
+	if (!p->core_cookie)
+		return;
+
+	rb_erase(&p->core_node, &rq->core_tree);
+}
+
+/*
+ * Find left-most (aka, highest priority) task matching @cookie.
+ */
+static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
+{
+	struct rb_node *node = rq->core_tree.rb_node;
+	struct task_struct *node_task, *match;
+
+	/*
+	 * The idle task always matches any cookie!
+	 */
+	match = idle_sched_class.pick_task(rq);
+
+	while (node) {
+		node_task = container_of(node, struct task_struct, core_node);
+
+		if (cookie < node_task->core_cookie) {
+			node = node->rb_left;
+		} else if (cookie > node_task->core_cookie) {
+			node = node->rb_right;
+		} else {
+			match = node_task;
+			node = node->rb_left;
+		}
+	}
+
+	return match;
+}
+
 /*
  * The static-key + stop-machine variable are needed such that:
  *
@@ -135,6 +270,11 @@ void sched_core_put(void)
 	mutex_unlock(&sched_core_mutex);
 }
 
+#else /* !CONFIG_SCHED_CORE */
+
+static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
+static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+
 #endif /* CONFIG_SCHED_CORE */
 
 /*
@@ -1354,6 +1494,9 @@ static inline void init_uclamp(void) { }
 
 static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	if (sched_core_enabled(rq))
+		sched_core_enqueue(rq, p);
+
 	if (!(flags & ENQUEUE_NOCLOCK))
 		update_rq_clock(rq);
 
@@ -1368,6 +1511,9 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 
 static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	if (sched_core_enabled(rq))
+		sched_core_dequeue(rq, p);
+
 	if (!(flags & DEQUEUE_NOCLOCK))
 		update_rq_clock(rq);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cffc59a8b481..d6c932e8d554 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -247,33 +247,11 @@ const struct sched_class fair_sched_class;
  */
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static inline struct task_struct *task_of(struct sched_entity *se)
-{
-	SCHED_WARN_ON(!entity_is_task(se));
-	return container_of(se, struct task_struct, se);
-}
 
 /* Walk up scheduling entities hierarchy */
 #define for_each_sched_entity(se) \
 		for (; se; se = se->parent)
 
-static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
-{
-	return p->se.cfs_rq;
-}
-
-/* runqueue on which this entity is (to be) queued */
-static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
-{
-	return se->cfs_rq;
-}
-
-/* runqueue "owned" by this group */
-static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
-{
-	return grp->my_q;
-}
-
 static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len)
 {
 	if (!path)
@@ -434,33 +412,9 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #else	/* !CONFIG_FAIR_GROUP_SCHED */
 
-static inline struct task_struct *task_of(struct sched_entity *se)
-{
-	return container_of(se, struct task_struct, se);
-}
-
 #define for_each_sched_entity(se) \
 		for (; se; se = NULL)
 
-static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
-{
-	return &task_rq(p)->cfs;
-}
-
-static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
-{
-	struct task_struct *p = task_of(se);
-	struct rq *rq = task_rq(p);
-
-	return &rq->cfs;
-}
-
-/* runqueue "owned" by this group */
-static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
-{
-	return NULL;
-}
-
 static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len)
 {
 	if (path)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a3941b2ee29e..a38ae770dfd6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1004,6 +1004,10 @@ struct rq {
 	/* per rq */
 	struct rq		*core;
 	unsigned int		core_enabled;
+	struct rb_root		core_tree;
+
+	/* shared state */
+	unsigned int		core_task_seq;
 #endif
 };
 
@@ -1083,6 +1087,57 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		raw_cpu_ptr(&runqueues)
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline struct task_struct *task_of(struct sched_entity *se)
+{
+	SCHED_WARN_ON(!entity_is_task(se));
+	return container_of(se, struct task_struct, se);
+}
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+	return p->se.cfs_rq;
+}
+
+/* runqueue on which this entity is (to be) queued */
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+	return se->cfs_rq;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+	return grp->my_q;
+}
+
+#else
+
+static inline struct task_struct *task_of(struct sched_entity *se)
+{
+	return container_of(se, struct task_struct, se);
+}
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+	return &task_rq(p)->cfs;
+}
+
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+	struct task_struct *p = task_of(se);
+	struct rq *rq = task_rq(p);
+
+	return &rq->cfs;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+	return NULL;
+}
+#endif
+
 extern void update_rq_clock(struct rq *rq);
 
 static inline u64 __rq_clock_broken(struct rq *rq)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH 06/13] sched: Update core scheduler queue when taking cpu online/offline
  2020-03-04 16:59 [RFC PATCH 00/13] Core scheduling v5 vpillai
                   ` (4 preceding siblings ...)
  2020-03-04 16:59 ` [RFC PATCH 05/13] sched: Basic tracking of matching tasks vpillai
@ 2020-03-04 16:59 ` vpillai
  2020-03-04 16:59 ` [RFC PATCH 07/13] sched: Add core wide task selection and scheduling vpillai
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: vpillai @ 2020-03-04 16:59 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, fweisbec, keescook, kerrnel, Phil Auld, Aaron Lu,
	Aubrey Li, aubrey.li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel

From: Tim Chen <tim.c.chen@linux.intel.com>

When we bring a CPU online and enable core scheduler, tasks that need
core scheduling need to be placed in the core's core scheduling queue.
Likewise when we taks a CPU offline or disable core scheudling on a
core, tasks in the core's core scheduling queue need to be removed.
Without such mechanisms, the core scheduler causes OOPs due to
inconsistent core scheduling state of a task.

Implement such enqueue and dequeue mechanisms according to a CPU's change
in core scheduling status.  The switch of core scheduling mode of a core,
and enqueue/dequeue of tasks on a core's queue due to the core scheduling
mode change has to be run in a separate context as it cannot be done in
the context taking cpu online/offline.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/core.c     | 156 ++++++++++++++++++++++++++++++++++++----
 kernel/sched/deadline.c |  35 +++++++++
 kernel/sched/fair.c     |  38 ++++++++++
 kernel/sched/rt.c       |  43 +++++++++++
 kernel/sched/sched.h    |   7 ++
 5 files changed, 264 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 452ce5bb9321..445f0d519336 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -75,6 +75,11 @@ int sysctl_sched_rt_runtime = 950000;
 
 #ifdef CONFIG_SCHED_CORE
 
+struct core_sched_cpu_work {
+	struct work_struct work;
+	cpumask_t smt_mask;
+};
+
 DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
 
 /* kernel prio, less is more */
@@ -183,6 +188,18 @@ static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
 	rb_erase(&p->core_node, &rq->core_tree);
 }
 
+void sched_core_add(struct rq *rq, struct task_struct *p)
+{
+	if (p->core_cookie && task_on_rq_queued(p))
+		sched_core_enqueue(rq, p);
+}
+
+void sched_core_remove(struct rq *rq, struct task_struct *p)
+{
+	if (sched_core_enqueued(p))
+		sched_core_dequeue(rq, p);
+}
+
 /*
  * Find left-most (aka, highest priority) task matching @cookie.
  */
@@ -270,10 +287,132 @@ void sched_core_put(void)
 	mutex_unlock(&sched_core_mutex);
 }
 
+enum cpu_action {
+	CPU_ACTIVATE = 1,
+	CPU_DEACTIVATE = 2
+};
+
+static int __activate_cpu_core_sched(void *data);
+static int __deactivate_cpu_core_sched(void *data);
+static void core_sched_cpu_update(unsigned int cpu, enum cpu_action action);
+
+static int activate_cpu_core_sched(struct core_sched_cpu_work *work)
+{
+	if (static_branch_unlikely(&__sched_core_enabled))
+		stop_machine(__activate_cpu_core_sched, (void *) work, NULL);
+
+	return 0;
+}
+
+static int deactivate_cpu_core_sched(struct core_sched_cpu_work *work)
+{
+	if (static_branch_unlikely(&__sched_core_enabled))
+		stop_machine(__deactivate_cpu_core_sched, (void *) work, NULL);
+
+	return 0;
+}
+
+static void core_sched_cpu_activate_fn(struct work_struct *work)
+{
+	struct core_sched_cpu_work *cpu_work;
+
+	cpu_work = container_of(work, struct core_sched_cpu_work, work);
+	activate_cpu_core_sched(cpu_work);
+	kfree(cpu_work);
+}
+
+static void core_sched_cpu_deactivate_fn(struct work_struct *work)
+{
+	struct core_sched_cpu_work *cpu_work;
+
+	cpu_work = container_of(work, struct core_sched_cpu_work, work);
+	deactivate_cpu_core_sched(cpu_work);
+	kfree(cpu_work);
+}
+
+static void core_sched_cpu_update(unsigned int cpu, enum cpu_action action)
+{
+	struct core_sched_cpu_work *work;
+
+	work = kmalloc(sizeof(struct core_sched_cpu_work), GFP_ATOMIC);
+	if (!work)
+		return;
+
+	if (action == CPU_ACTIVATE)
+		INIT_WORK(&work->work, core_sched_cpu_activate_fn);
+	else
+		INIT_WORK(&work->work, core_sched_cpu_deactivate_fn);
+
+	cpumask_copy(&work->smt_mask, cpu_smt_mask(cpu));
+
+	queue_work(system_highpri_wq, &work->work);
+}
+
+static int __activate_cpu_core_sched(void *data)
+{
+	struct core_sched_cpu_work *work = (struct core_sched_cpu_work *) data;
+	struct rq *rq;
+	int i;
+
+	if (cpumask_weight(&work->smt_mask) < 2)
+		return 0;
+
+	for_each_cpu(i, &work->smt_mask) {
+		const struct sched_class *class;
+
+		rq = cpu_rq(i);
+
+		if (rq->core_enabled)
+			continue;
+
+		for_each_class(class) {
+			if (!class->core_sched_activate)
+				continue;
+
+			if (cpu_online(i))
+				class->core_sched_activate(rq);
+		}
+
+		rq->core_enabled = true;
+	}
+	return 0;
+}
+
+static int __deactivate_cpu_core_sched(void *data)
+{
+	struct core_sched_cpu_work *work = (struct core_sched_cpu_work *) data;
+	struct rq *rq;
+	int i;
+
+	if (cpumask_weight(&work->smt_mask) > 2)
+		return 0;
+
+	for_each_cpu(i, &work->smt_mask) {
+		const struct sched_class *class;
+
+		rq = cpu_rq(i);
+
+		if (!rq->core_enabled)
+			continue;
+
+		for_each_class(class) {
+			if (!class->core_sched_deactivate)
+				continue;
+
+			if (cpu_online(i))
+				class->core_sched_deactivate(cpu_rq(i));
+		}
+
+		rq->core_enabled = false;
+	}
+	return 0;
+}
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
 static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+static inline void core_sched_cpu_update(unsigned int cpu, int action) { }
 
 #endif /* CONFIG_SCHED_CORE */
 
@@ -6612,13 +6751,8 @@ int sched_cpu_activate(unsigned int cpu)
 	 */
 	if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
 		static_branch_inc_cpuslocked(&sched_smt_present);
-#ifdef CONFIG_SCHED_CORE
-		if (static_branch_unlikely(&__sched_core_enabled)) {
-			rq->core_enabled = true;
-		}
-#endif
 	}
-
+	core_sched_cpu_update(cpu, CPU_ACTIVATE);
 #endif
 	set_cpu_active(cpu, true);
 
@@ -6665,15 +6799,10 @@ int sched_cpu_deactivate(unsigned int cpu)
 	 * When going down, decrement the number of cores with SMT present.
 	 */
 	if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
-#ifdef CONFIG_SCHED_CORE
-		struct rq *rq = cpu_rq(cpu);
-		if (static_branch_unlikely(&__sched_core_enabled)) {
-			rq->core_enabled = false;
-		}
-#endif
 		static_branch_dec_cpuslocked(&sched_smt_present);
 
 	}
+	core_sched_cpu_update(cpu, CPU_DEACTIVATE);
 #endif
 
 	if (!sched_smp_initialized)
@@ -6748,9 +6877,6 @@ int sched_cpu_dying(unsigned int cpu)
 	update_max_interval();
 	nohz_balance_exit_idle(rq);
 	hrtick_clear(rq);
-#ifdef CONFIG_SCHED_CORE
-	rq->core = NULL;
-#endif
 	return 0;
 }
 #endif
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index ee7fd8611ee4..e916bba0159c 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1773,6 +1773,37 @@ static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
 	return rb_entry(left, struct sched_dl_entity, rb_node);
 }
 
+static void for_each_dl_task(struct rq *rq,
+                             void (*fn)(struct rq *rq, struct task_struct *p))
+{
+	struct dl_rq *dl_rq = &rq->dl;
+	struct sched_dl_entity *dl_ent;
+	struct task_struct *task;
+	struct rb_node *rb_node;
+
+	rb_node = rb_first_cached(&dl_rq->root);
+	while (rb_node) {
+		dl_ent = rb_entry(rb_node, struct sched_dl_entity, rb_node);
+		task = dl_task_of(dl_ent);
+		fn(rq, task);
+		rb_node = rb_next(rb_node);
+	}
+}
+
+#ifdef CONFIG_SCHED_CORE
+
+static void core_sched_activate_dl(struct rq *rq)
+{
+	for_each_dl_task(rq, sched_core_add);
+}
+
+static void core_sched_deactivate_dl(struct rq *rq)
+{
+	for_each_dl_task(rq, sched_core_remove);
+}
+
+#endif
+
 static struct task_struct *pick_task_dl(struct rq *rq)
 {
 	struct sched_dl_entity *dl_se;
@@ -2460,6 +2491,10 @@ const struct sched_class dl_sched_class = {
 	.rq_online              = rq_online_dl,
 	.rq_offline             = rq_offline_dl,
 	.task_woken		= task_woken_dl,
+#ifdef CONFIG_SCHED_CORE
+	.core_sched_activate    = core_sched_activate_dl,
+	.core_sched_deactivate  = core_sched_deactivate_dl,
+#endif
 #endif
 
 	.task_tick		= task_tick_dl,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d6c932e8d554..a9eeef896c78 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10249,6 +10249,40 @@ static void rq_offline_fair(struct rq *rq)
 	unthrottle_offline_cfs_rqs(rq);
 }
 
+static void for_each_fair_task(struct rq *rq,
+			       void (*fn)(struct rq *rq, struct task_struct *p))
+{
+	struct cfs_rq *cfs_rq, *pos;
+	struct sched_entity *se;
+	struct task_struct *task;
+
+	for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
+		for (se = __pick_first_entity(cfs_rq);
+		     se != NULL;
+		     se = __pick_next_entity(se)) {
+
+			if (!entity_is_task(se))
+				continue;
+
+			task = task_of(se);
+			fn(rq, task);
+		}
+	}
+}
+
+#ifdef CONFIG_SCHED_CORE
+
+static void core_sched_activate_fair(struct rq *rq)
+{
+	for_each_fair_task(rq, sched_core_add);
+}
+
+static void core_sched_deactivate_fair(struct rq *rq)
+{
+	for_each_fair_task(rq, sched_core_remove);
+}
+
+#endif
 #endif /* CONFIG_SMP */
 
 /*
@@ -10769,6 +10803,10 @@ const struct sched_class fair_sched_class = {
 
 	.task_dead		= task_dead_fair,
 	.set_cpus_allowed	= set_cpus_allowed_common,
+#ifdef CONFIG_SCHED_CORE
+	.core_sched_activate	= core_sched_activate_fair,
+	.core_sched_deactivate	= core_sched_deactivate_fair,
+#endif
 #endif
 
 	.task_tick		= task_tick_fair,
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index d044baedc617..ccb585223fad 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1567,6 +1567,45 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
 	return rt_task_of(rt_se);
 }
 
+static void for_each_rt_task(struct rq *rq,
+			     void (*fn)(struct rq *rq, struct task_struct *p))
+{
+	rt_rq_iter_t iter;
+	struct rt_prio_array *array;
+	struct list_head *queue;
+	int i;
+	struct rt_rq *rt_rq = &rq->rt;
+	struct sched_rt_entity *rt_se = NULL;
+	struct task_struct *task;
+
+	for_each_rt_rq(rt_rq, iter, rq) {
+		array = &rt_rq->active;
+		for (i = 0; i < MAX_RT_PRIO; i++) {
+			queue = array->queue + i;
+			list_for_each_entry(rt_se, queue, run_list) {
+				if (rt_entity_is_task(rt_se)) {
+					task = rt_task_of(rt_se);
+					fn(rq, task);
+				}
+			}
+		}
+	}
+}
+
+#ifdef CONFIG_SCHED_CORE
+
+static void core_sched_activate_rt(struct rq *rq)
+{
+	for_each_rt_task(rq, sched_core_add);
+}
+
+static void core_sched_deactivate_rt(struct rq *rq)
+{
+	for_each_rt_task(rq, sched_core_remove);
+}
+
+#endif
+
 static struct task_struct *pick_task_rt(struct rq *rq)
 {
 	struct task_struct *p;
@@ -2384,6 +2423,10 @@ const struct sched_class rt_sched_class = {
 	.rq_offline             = rq_offline_rt,
 	.task_woken		= task_woken_rt,
 	.switched_from		= switched_from_rt,
+#ifdef CONFIG_SCHED_CORE
+	.core_sched_activate    = core_sched_activate_rt,
+	.core_sched_deactivate  = core_sched_deactivate_rt,
+#endif
 #endif
 
 	.task_tick		= task_tick_rt,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a38ae770dfd6..03d502357599 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1052,6 +1052,9 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 	return &rq->__lock;
 }
 
+void sched_core_add(struct rq *rq, struct task_struct *p);
+void sched_core_remove(struct rq *rq, struct task_struct *p);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline bool sched_core_enabled(struct rq *rq)
@@ -1823,6 +1826,10 @@ struct sched_class {
 
 	void (*rq_online)(struct rq *rq);
 	void (*rq_offline)(struct rq *rq);
+#ifdef CONFIG_SCHED_CORE
+	void (*core_sched_activate)(struct rq *rq);
+	void (*core_sched_deactivate)(struct rq *rq);
+#endif
 #endif
 
 	void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH 07/13] sched: Add core wide task selection and scheduling.
  2020-03-04 16:59 [RFC PATCH 00/13] Core scheduling v5 vpillai
                   ` (5 preceding siblings ...)
  2020-03-04 16:59 ` [RFC PATCH 06/13] sched: Update core scheduler queue when taking cpu online/offline vpillai
@ 2020-03-04 16:59 ` vpillai
  2020-04-14 13:35   ` Peter Zijlstra
                     ` (2 more replies)
  2020-03-04 16:59 ` [RFC PATCH 08/13] sched/fair: wrapper for cfs_rq->min_vruntime vpillai
                   ` (13 subsequent siblings)
  20 siblings, 3 replies; 110+ messages in thread
From: vpillai @ 2020-03-04 16:59 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, fweisbec, keescook, kerrnel, Phil Auld, Aaron Lu,
	Aubrey Li, aubrey.li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel,
	Vineeth Remanan Pillai, Aaron Lu

From: Peter Zijlstra <peterz@infradead.org>

Instead of only selecting a local task, select a task for all SMT
siblings for every reschedule on the core (irrespective which logical
CPU does the reschedule).

There could be races in core scheduler where a CPU is trying to pick
a task for its sibling in core scheduler, when that CPU has just been
offlined.  We should not schedule any tasks on the CPU in this case.
Return an idle task in pick_next_task for this situation.

NOTE: there is still potential for siblings rivalry.
NOTE: this is far too complicated; but thus far I've failed to
      simplify it further.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Signed-off-by: Aaron Lu <aaron.lu@linux.alibaba.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/core.c  | 274 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/fair.c  |  40 +++++++
 kernel/sched/sched.h |   6 +-
 3 files changed, 318 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 445f0d519336..9a1bd236044e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4253,7 +4253,7 @@ static inline void schedule_debug(struct task_struct *prev, bool preempt)
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	const struct sched_class *class;
 	struct task_struct *p;
@@ -4309,6 +4309,273 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	BUG();
 }
 
+#ifdef CONFIG_SCHED_CORE
+
+static inline bool cookie_equals(struct task_struct *a, unsigned long cookie)
+{
+	return is_idle_task(a) || (a->core_cookie == cookie);
+}
+
+static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
+{
+	if (is_idle_task(a) || is_idle_task(b))
+		return true;
+
+	return a->core_cookie == b->core_cookie;
+}
+
+// XXX fairness/fwd progress conditions
+/*
+ * Returns
+ * - NULL if there is no runnable task for this class.
+ * - the highest priority task for this runqueue if it matches
+ *   rq->core->core_cookie or its priority is greater than max.
+ * - Else returns idle_task.
+ */
+static struct task_struct *
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max)
+{
+	struct task_struct *class_pick, *cookie_pick;
+	unsigned long cookie = rq->core->core_cookie;
+
+	class_pick = class->pick_task(rq);
+	if (!class_pick)
+		return NULL;
+
+	if (!cookie) {
+		/*
+		 * If class_pick is tagged, return it only if it has
+		 * higher priority than max.
+		 */
+		if (max && class_pick->core_cookie &&
+		    prio_less(class_pick, max))
+			return idle_sched_class.pick_task(rq);
+
+		return class_pick;
+	}
+
+	/*
+	 * If class_pick is idle or matches cookie, return early.
+	 */
+	if (cookie_equals(class_pick, cookie))
+		return class_pick;
+
+	cookie_pick = sched_core_find(rq, cookie);
+
+	/*
+	 * If class > max && class > cookie, it is the highest priority task on
+	 * the core (so far) and it must be selected, otherwise we must go with
+	 * the cookie pick in order to satisfy the constraint.
+	 */
+	if (prio_less(cookie_pick, class_pick) &&
+	    (!max || prio_less(max, class_pick)))
+		return class_pick;
+
+	return cookie_pick;
+}
+
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	struct task_struct *next, *max = NULL;
+	const struct sched_class *class;
+	const struct cpumask *smt_mask;
+	int i, j, cpu;
+	bool need_sync = false;
+
+	cpu = cpu_of(rq);
+	if (cpu_is_offline(cpu))
+		return idle_sched_class.pick_next_task(rq);
+
+	if (!sched_core_enabled(rq))
+		return __pick_next_task(rq, prev, rf);
+
+	/*
+	 * If there were no {en,de}queues since we picked (IOW, the task
+	 * pointers are all still valid), and we haven't scheduled the last
+	 * pick yet, do so now.
+	 */
+	if (rq->core->core_pick_seq == rq->core->core_task_seq &&
+	    rq->core->core_pick_seq != rq->core_sched_seq) {
+		WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
+
+		next = rq->core_pick;
+		if (next != prev) {
+			put_prev_task(rq, prev);
+			set_next_task(rq, next);
+		}
+		return next;
+	}
+
+	prev->sched_class->put_prev_task(rq, prev);
+	if (!rq->nr_running)
+		newidle_balance(rq, rf);
+
+	smt_mask = cpu_smt_mask(cpu);
+
+	/*
+	 * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
+	 *
+	 * @task_seq guards the task state ({en,de}queues)
+	 * @pick_seq is the @task_seq we did a selection on
+	 * @sched_seq is the @pick_seq we scheduled
+	 *
+	 * However, preemptions can cause multiple picks on the same task set.
+	 * 'Fix' this by also increasing @task_seq for every pick.
+	 */
+	rq->core->core_task_seq++;
+	need_sync = !!rq->core->core_cookie;
+
+	/* reset state */
+	rq->core->core_cookie = 0UL;
+	for_each_cpu(i, smt_mask) {
+		struct rq *rq_i = cpu_rq(i);
+
+		rq_i->core_pick = NULL;
+
+		if (rq_i->core_forceidle) {
+			need_sync = true;
+			rq_i->core_forceidle = false;
+		}
+
+		if (i != cpu)
+			update_rq_clock(rq_i);
+	}
+
+	/*
+	 * Try and select tasks for each sibling in decending sched_class
+	 * order.
+	 */
+	for_each_class(class) {
+again:
+		for_each_cpu_wrap(i, smt_mask, cpu) {
+			struct rq *rq_i = cpu_rq(i);
+			struct task_struct *p;
+
+			if (cpu_is_offline(i)) {
+				rq_i->core_pick = rq_i->idle;
+				continue;
+			}
+
+			if (rq_i->core_pick)
+				continue;
+
+			/*
+			 * If this sibling doesn't yet have a suitable task to
+			 * run; ask for the most elegible task, given the
+			 * highest priority task already selected for this
+			 * core.
+			 */
+			p = pick_task(rq_i, class, max);
+			if (!p) {
+				/*
+				 * If there weren't no cookies; we don't need
+				 * to bother with the other siblings.
+				 */
+				if (i == cpu && !need_sync)
+					goto next_class;
+
+				continue;
+			}
+
+			/*
+			 * Optimize the 'normal' case where there aren't any
+			 * cookies and we don't need to sync up.
+			 */
+			if (i == cpu && !need_sync && !p->core_cookie) {
+				next = p;
+				goto done;
+			}
+
+			rq_i->core_pick = p;
+
+			/*
+			 * If this new candidate is of higher priority than the
+			 * previous; and they're incompatible; we need to wipe
+			 * the slate and start over. pick_task makes sure that
+			 * p's priority is more than max if it doesn't match
+			 * max's cookie.
+			 *
+			 * NOTE: this is a linear max-filter and is thus bounded
+			 * in execution time.
+			 */
+			if (!max || !cookie_match(max, p)) {
+				struct task_struct *old_max = max;
+
+				rq->core->core_cookie = p->core_cookie;
+				max = p;
+
+				if (old_max) {
+					for_each_cpu(j, smt_mask) {
+						if (j == i)
+							continue;
+
+						cpu_rq(j)->core_pick = NULL;
+					}
+					goto again;
+				} else {
+					/*
+					 * Once we select a task for a cpu, we
+					 * should not be doing an unconstrained
+					 * pick because it might starve a task
+					 * on a forced idle cpu.
+					 */
+					need_sync = true;
+				}
+
+			}
+		}
+next_class:;
+	}
+
+	rq->core->core_pick_seq = rq->core->core_task_seq;
+	next = rq->core_pick;
+	rq->core_sched_seq = rq->core->core_pick_seq;
+
+	/*
+	 * Reschedule siblings
+	 *
+	 * NOTE: L1TF -- at this point we're no longer running the old task and
+	 * sending an IPI (below) ensures the sibling will no longer be running
+	 * their task. This ensures there is no inter-sibling overlap between
+	 * non-matching user state.
+	 */
+	for_each_cpu(i, smt_mask) {
+		struct rq *rq_i = cpu_rq(i);
+
+		if (cpu_is_offline(i))
+			continue;
+
+		WARN_ON_ONCE(!rq_i->core_pick);
+
+		if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
+			rq_i->core_forceidle = true;
+
+		if (i == cpu)
+			continue;
+
+		if (rq_i->curr != rq_i->core_pick)
+			resched_curr(rq_i);
+
+		/* Did we break L1TF mitigation requirements? */
+		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+	}
+
+done:
+	set_next_task(rq, next);
+	return next;
+}
+
+#else /* !CONFIG_SCHED_CORE */
+
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	return __pick_next_task(rq, prev, rf);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -7074,7 +7341,12 @@ void __init sched_init(void)
 
 #ifdef CONFIG_SCHED_CORE
 		rq->core = NULL;
+		rq->core_pick = NULL;
 		rq->core_enabled = 0;
+		rq->core_tree = RB_ROOT;
+		rq->core_forceidle = false;
+
+		rq->core_cookie = 0UL;
 #endif
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a9eeef896c78..8432de767730 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4080,6 +4080,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		update_min_vruntime(cfs_rq);
 }
 
+static inline bool
+__entity_slice_used(struct sched_entity *se)
+{
+	return (se->sum_exec_runtime - se->prev_sum_exec_runtime) >
+		sched_slice(cfs_rq_of(se), se);
+}
+
 /*
  * Preempt the current task with a newly woken task if needed:
  */
@@ -10285,6 +10292,34 @@ static void core_sched_deactivate_fair(struct rq *rq)
 #endif
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_SCHED_CORE
+/*
+ * If runqueue has only one task which used up its slice and
+ * if the sibling is forced idle, then trigger schedule
+ * to give forced idle task a chance.
+ */
+static void resched_forceidle_sibling(struct rq *rq, struct sched_entity *se)
+{
+	int cpu = cpu_of(rq), sibling_cpu;
+	if (rq->cfs.nr_running > 1 || !__entity_slice_used(se))
+		return;
+
+	for_each_cpu(sibling_cpu, cpu_smt_mask(cpu)) {
+		struct rq *sibling_rq;
+		if (sibling_cpu == cpu)
+			continue;
+		if (cpu_is_offline(sibling_cpu))
+			continue;
+
+		sibling_rq = cpu_rq(sibling_cpu);
+		if (sibling_rq->core_forceidle) {
+			resched_curr(sibling_rq);
+		}
+	}
+}
+#endif
+
+
 /*
  * scheduler tick hitting a task of our scheduling class.
  *
@@ -10308,6 +10343,11 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	update_misfit_status(curr, rq);
 	update_overutilized_status(task_rq(curr));
+
+#ifdef CONFIG_SCHED_CORE
+	if (sched_core_enabled(rq))
+		resched_forceidle_sibling(rq, &curr->se);
+#endif
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 03d502357599..a829e26fa43a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1003,11 +1003,16 @@ struct rq {
 #ifdef CONFIG_SCHED_CORE
 	/* per rq */
 	struct rq		*core;
+	struct task_struct	*core_pick;
 	unsigned int		core_enabled;
+	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
+	bool			core_forceidle;
 
 	/* shared state */
 	unsigned int		core_task_seq;
+	unsigned int		core_pick_seq;
+	unsigned long		core_cookie;
 #endif
 };
 
@@ -1867,7 +1872,6 @@ static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 
 static inline void set_next_task(struct rq *rq, struct task_struct *next)
 {
-	WARN_ON_ONCE(rq->curr != next);
 	next->sched_class->set_next_task(rq, next, false);
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH 08/13] sched/fair: wrapper for cfs_rq->min_vruntime
  2020-03-04 16:59 [RFC PATCH 00/13] Core scheduling v5 vpillai
                   ` (6 preceding siblings ...)
  2020-03-04 16:59 ` [RFC PATCH 07/13] sched: Add core wide task selection and scheduling vpillai
@ 2020-03-04 16:59 ` vpillai
  2020-03-04 16:59 ` [RFC PATCH 09/13] sched/fair: core wide vruntime comparison vpillai
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: vpillai @ 2020-03-04 16:59 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: Aaron Lu, linux-kernel, fweisbec, keescook, kerrnel, Phil Auld,
	Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel, Aaron Lu

From: Aaron Lu <aaron.lu@linux.alibaba.com>

Add a wrapper function cfs_rq_min_vruntime(cfs_rq) to
return cfs_rq->min_vruntime.

It will be used in the following patch, no functionality
change.

Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com>
---
 kernel/sched/fair.c | 27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8432de767730..d99ea6ee7af2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -449,6 +449,11 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->min_vruntime;
+}
+
 static __always_inline
 void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
 
@@ -485,7 +490,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
 	struct sched_entity *curr = cfs_rq->curr;
 	struct rb_node *leftmost = rb_first_cached(&cfs_rq->tasks_timeline);
 
-	u64 vruntime = cfs_rq->min_vruntime;
+	u64 vruntime = cfs_rq_min_vruntime(cfs_rq);
 
 	if (curr) {
 		if (curr->on_rq)
@@ -505,7 +510,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
 	}
 
 	/* ensure we never gain time by being placed backwards. */
-	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
+	cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime);
 #ifndef CONFIG_64BIT
 	smp_wmb();
 	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
@@ -3833,7 +3838,7 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {}
 static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 #ifdef CONFIG_SCHED_DEBUG
-	s64 d = se->vruntime - cfs_rq->min_vruntime;
+	s64 d = se->vruntime - cfs_rq_min_vruntime(cfs_rq);
 
 	if (d < 0)
 		d = -d;
@@ -3846,7 +3851,7 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
-	u64 vruntime = cfs_rq->min_vruntime;
+	u64 vruntime = cfs_rq_min_vruntime(cfs_rq);
 
 	/*
 	 * The 'current' period is already promised to the current tasks,
@@ -3939,7 +3944,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * update_curr().
 	 */
 	if (renorm && curr)
-		se->vruntime += cfs_rq->min_vruntime;
+		se->vruntime += cfs_rq_min_vruntime(cfs_rq);
 
 	update_curr(cfs_rq);
 
@@ -3950,7 +3955,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * fairness detriment of existing tasks.
 	 */
 	if (renorm && !curr)
-		se->vruntime += cfs_rq->min_vruntime;
+		se->vruntime += cfs_rq_min_vruntime(cfs_rq);
 
 	/*
 	 * When enqueuing a sched_entity, we must:
@@ -4063,7 +4068,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * can move min_vruntime forward still more.
 	 */
 	if (!(flags & DEQUEUE_SLEEP))
-		se->vruntime -= cfs_rq->min_vruntime;
+		se->vruntime -= cfs_rq_min_vruntime(cfs_rq);
 
 	/* return excess runtime on last dequeue */
 	return_cfs_rq_runtime(cfs_rq);
@@ -6396,7 +6401,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 			min_vruntime = cfs_rq->min_vruntime;
 		} while (min_vruntime != min_vruntime_copy);
 #else
-		min_vruntime = cfs_rq->min_vruntime;
+		min_vruntime = cfs_rq_min_vruntime(cfs_rq);
 #endif
 
 		se->vruntime -= min_vruntime;
@@ -10382,7 +10387,7 @@ static void task_fork_fair(struct task_struct *p)
 		resched_curr(rq);
 	}
 
-	se->vruntime -= cfs_rq->min_vruntime;
+	se->vruntime -= cfs_rq_min_vruntime(cfs_rq);
 	rq_unlock(rq, &rf);
 }
 
@@ -10502,7 +10507,7 @@ static void detach_task_cfs_rq(struct task_struct *p)
 		 * cause 'unlimited' sleep bonus.
 		 */
 		place_entity(cfs_rq, se, 0);
-		se->vruntime -= cfs_rq->min_vruntime;
+		se->vruntime -= cfs_rq_min_vruntime(cfs_rq);
 	}
 
 	detach_entity_cfs_rq(se);
@@ -10516,7 +10521,7 @@ static void attach_task_cfs_rq(struct task_struct *p)
 	attach_entity_cfs_rq(se);
 
 	if (!vruntime_normalized(p))
-		se->vruntime += cfs_rq->min_vruntime;
+		se->vruntime += cfs_rq_min_vruntime(cfs_rq);
 }
 
 static void switched_from_fair(struct rq *rq, struct task_struct *p)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH 09/13] sched/fair: core wide vruntime comparison
  2020-03-04 16:59 [RFC PATCH 00/13] Core scheduling v5 vpillai
                   ` (7 preceding siblings ...)
  2020-03-04 16:59 ` [RFC PATCH 08/13] sched/fair: wrapper for cfs_rq->min_vruntime vpillai
@ 2020-03-04 16:59 ` vpillai
  2020-04-14 13:56   ` Peter Zijlstra
  2020-03-04 17:00 ` [RFC PATCH 10/13] sched: Trivial forced-newidle balancer vpillai
                   ` (11 subsequent siblings)
  20 siblings, 1 reply; 110+ messages in thread
From: vpillai @ 2020-03-04 16:59 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: Aaron Lu, linux-kernel, fweisbec, keescook, kerrnel, Phil Auld,
	Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel, Aaron Lu

From: Aaron Lu <aaron.lu@linux.alibaba.com>

This patch provides a vruntime based way to compare two cfs task's
priority, be it on the same cpu or different threads of the same core.

When the two tasks are on the same CPU, we just need to find a common
cfs_rq both sched_entities are on and then do the comparison.

When the two tasks are on differen threads of the same core, the root
level sched_entities to which the two tasks belong will be used to do
the comparison.

An ugly illustration for the cross CPU case:

   cpu0         cpu1
 /   |  \     /   |  \
se1 se2 se3  se4 se5 se6
    /  \            /   \
  se21 se22       se61  se62

Assume CPU0 and CPU1 are smt siblings and task A's se is se21 while
task B's se is se61. To compare priority of task A and B, we compare
priority of se2 and se6. Whose vruntime is smaller, who wins.

To make this work, the root level se should have a common cfs_rq min
vuntime, which I call it the core cfs_rq min vruntime.

When we adjust the min_vruntime of rq->core, we need to propgate
that down the tree so as to not cause starvation of existing tasks
based on previous vruntime.

Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com>
---
 kernel/sched/core.c  | 15 +------
 kernel/sched/fair.c  | 99 +++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  2 +
 3 files changed, 102 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9a1bd236044e..556bf054b896 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -119,19 +119,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-		u64 vruntime = b->se.vruntime;
-
-		/*
-		 * Normalize the vruntime if tasks are in different cpus.
-		 */
-		if (task_cpu(a) != task_cpu(b)) {
-			vruntime -= task_cfs_rq(b)->min_vruntime;
-			vruntime += task_cfs_rq(a)->min_vruntime;
-		}
-
-		return !((s64)(a->se.vruntime - vruntime) <= 0);
-	}
+	if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
+		return cfs_prio_less(a, b);
 
 	return false;
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d99ea6ee7af2..1c9a80d8dbb8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -449,9 +449,105 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+static inline struct cfs_rq *root_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return &rq_of(cfs_rq)->cfs;
+}
+
+static inline bool is_root_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq == root_cfs_rq(cfs_rq);
+}
+
+static inline struct cfs_rq *core_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return &rq_of(cfs_rq)->core->cfs;
+}
+
 static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
 {
-	return cfs_rq->min_vruntime;
+	if (!sched_core_enabled(rq_of(cfs_rq)))
+		return cfs_rq->min_vruntime;
+
+	if (is_root_cfs_rq(cfs_rq))
+		return core_cfs_rq(cfs_rq)->min_vruntime;
+	else
+		return cfs_rq->min_vruntime;
+}
+
+static void coresched_adjust_vruntime(struct cfs_rq *cfs_rq, u64 delta)
+{
+	struct sched_entity *se, *next;
+
+	if (!cfs_rq)
+		return;
+
+	cfs_rq->min_vruntime -= delta;
+	rbtree_postorder_for_each_entry_safe(se, next,
+			&cfs_rq->tasks_timeline.rb_root, run_node) {
+		if (se->vruntime > delta)
+			se->vruntime -= delta;
+		if (se->my_q)
+			coresched_adjust_vruntime(se->my_q, delta);
+	}
+}
+
+static void update_core_cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
+{
+	struct cfs_rq *cfs_rq_core;
+
+	if (!sched_core_enabled(rq_of(cfs_rq)))
+		return;
+
+	if (!is_root_cfs_rq(cfs_rq))
+		return;
+
+	cfs_rq_core = core_cfs_rq(cfs_rq);
+	if (cfs_rq_core != cfs_rq &&
+	    cfs_rq->min_vruntime < cfs_rq_core->min_vruntime) {
+		u64 delta = cfs_rq_core->min_vruntime - cfs_rq->min_vruntime;
+		coresched_adjust_vruntime(cfs_rq_core, delta);
+	}
+}
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+	struct sched_entity *sea = &a->se;
+	struct sched_entity *seb = &b->se;
+	bool samecpu = task_cpu(a) == task_cpu(b);
+	struct task_struct *p;
+	s64 delta;
+
+	if (samecpu) {
+		/* vruntime is per cfs_rq */
+		while (!is_same_group(sea, seb)) {
+			int sea_depth = sea->depth;
+			int seb_depth = seb->depth;
+
+			if (sea_depth >= seb_depth)
+				sea = parent_entity(sea);
+			if (sea_depth <= seb_depth)
+				seb = parent_entity(seb);
+		}
+
+		delta = (s64)(sea->vruntime - seb->vruntime);
+		goto out;
+	}
+
+	/* crosscpu: compare root level se's vruntime to decide priority */
+	while (sea->parent)
+		sea = sea->parent;
+	while (seb->parent)
+		seb = seb->parent;
+	delta = (s64)(sea->vruntime - seb->vruntime);
+
+out:
+	p = delta > 0 ? b : a;
+	trace_printk("picked %s/%d %s: %Ld %Ld %Ld\n", p->comm, p->pid,
+			samecpu ? "samecpu" : "crosscpu",
+			sea->vruntime, seb->vruntime, delta);
+
+	return delta > 0;
 }
 
 static __always_inline
@@ -511,6 +607,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
 
 	/* ensure we never gain time by being placed backwards. */
 	cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime);
+	update_core_cfs_rq_min_vruntime(cfs_rq);
 #ifndef CONFIG_64BIT
 	smp_wmb();
 	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a829e26fa43a..ef9e08e5da6a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2561,6 +2561,8 @@ static inline bool sched_energy_enabled(void) { return false; }
 
 #endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
 
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
+
 #ifdef CONFIG_MEMBARRIER
 /*
  * The scheduler provides memory barriers required by membarrier between:
-- 
2.17.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH 10/13] sched: Trivial forced-newidle balancer
  2020-03-04 16:59 [RFC PATCH 00/13] Core scheduling v5 vpillai
                   ` (8 preceding siblings ...)
  2020-03-04 16:59 ` [RFC PATCH 09/13] sched/fair: core wide vruntime comparison vpillai
@ 2020-03-04 17:00 ` vpillai
  2020-03-04 17:00 ` [RFC PATCH 11/13] sched: migration changes for core scheduling vpillai
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: vpillai @ 2020-03-04 17:00 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, fweisbec, keescook, kerrnel, Phil Auld, Aaron Lu,
	Aubrey Li, aubrey.li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel

From: Peter Zijlstra <peterz@infradead.org>

When a sibling is forced-idle to match the core-cookie; search for
matching tasks to fill the core.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h |   1 +
 kernel/sched/core.c   | 131 +++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/idle.c   |   1 +
 kernel/sched/sched.h  |   6 ++
 4 files changed, 138 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 80ec54706282..c9406a5b678f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -685,6 +685,7 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
 	struct rb_node			core_node;
 	unsigned long			core_cookie;
+	unsigned int			core_occupation;
 #endif
 
 #ifdef CONFIG_CGROUP_SCHED
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 556bf054b896..18ee8e10a171 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -218,6 +218,21 @@ static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
 	return match;
 }
 
+static struct task_struct *sched_core_next(struct task_struct *p, unsigned long cookie)
+{
+	struct rb_node *node = &p->core_node;
+
+	node = rb_next(node);
+	if (!node)
+		return NULL;
+
+	p = container_of(node, struct task_struct, core_node);
+	if (p->core_cookie != cookie)
+		return NULL;
+
+	return p;
+}
+
 /*
  * The static-key + stop-machine variable are needed such that:
  *
@@ -4369,7 +4384,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	struct task_struct *next, *max = NULL;
 	const struct sched_class *class;
 	const struct cpumask *smt_mask;
-	int i, j, cpu;
+	int i, j, cpu, occ = 0;
 	bool need_sync = false;
 
 	cpu = cpu_of(rq);
@@ -4476,6 +4491,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 				goto done;
 			}
 
+			if (!is_idle_task(p))
+				occ++;
+
 			rq_i->core_pick = p;
 
 			/*
@@ -4501,6 +4519,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 						cpu_rq(j)->core_pick = NULL;
 					}
+					occ = 1;
 					goto again;
 				} else {
 					/*
@@ -4540,6 +4559,8 @@ next_class:;
 		if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
 			rq_i->core_forceidle = true;
 
+		rq_i->core_pick->core_occupation = occ;
+
 		if (i == cpu)
 			continue;
 
@@ -4555,6 +4576,114 @@ next_class:;
 	return next;
 }
 
+static bool try_steal_cookie(int this, int that)
+{
+	struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
+	struct task_struct *p;
+	unsigned long cookie;
+	bool success = false;
+
+	local_irq_disable();
+	double_rq_lock(dst, src);
+
+	cookie = dst->core->core_cookie;
+	if (!cookie)
+		goto unlock;
+
+	if (dst->curr != dst->idle)
+		goto unlock;
+
+	p = sched_core_find(src, cookie);
+	if (p == src->idle)
+		goto unlock;
+
+	do {
+		if (p == src->core_pick || p == src->curr)
+			goto next;
+
+		if (!cpumask_test_cpu(this, &p->cpus_mask))
+			goto next;
+
+		if (p->core_occupation > dst->idle->core_occupation)
+			goto next;
+
+		p->on_rq = TASK_ON_RQ_MIGRATING;
+		deactivate_task(src, p, 0);
+		set_task_cpu(p, this);
+		activate_task(dst, p, 0);
+		p->on_rq = TASK_ON_RQ_QUEUED;
+
+		resched_curr(dst);
+
+		success = true;
+		break;
+
+next:
+		p = sched_core_next(p, cookie);
+	} while (p);
+
+unlock:
+	double_rq_unlock(dst, src);
+	local_irq_enable();
+
+	return success;
+}
+
+static bool steal_cookie_task(int cpu, struct sched_domain *sd)
+{
+	int i;
+
+	for_each_cpu_wrap(i, sched_domain_span(sd), cpu) {
+		if (i == cpu)
+			continue;
+
+		if (need_resched())
+			break;
+
+		if (try_steal_cookie(cpu, i))
+			return true;
+	}
+
+	return false;
+}
+
+static void sched_core_balance(struct rq *rq)
+{
+	struct sched_domain *sd;
+	int cpu = cpu_of(rq);
+
+	rcu_read_lock();
+	raw_spin_unlock_irq(rq_lockp(rq));
+	for_each_domain(cpu, sd) {
+		if (!(sd->flags & SD_LOAD_BALANCE))
+			break;
+
+		if (need_resched())
+			break;
+
+		if (steal_cookie_task(cpu, sd))
+			break;
+	}
+	raw_spin_lock_irq(rq_lockp(rq));
+	rcu_read_unlock();
+}
+
+static DEFINE_PER_CPU(struct callback_head, core_balance_head);
+
+void queue_core_balance(struct rq *rq)
+{
+	if (!sched_core_enabled(rq))
+		return;
+
+	if (!rq->core->core_cookie)
+		return;
+
+	if (!rq->nr_running) /* not forced idle */
+		return;
+
+	queue_balance_callback(rq, &per_cpu(core_balance_head, rq->cpu), sched_core_balance);
+}
+
 #else /* !CONFIG_SCHED_CORE */
 
 static struct task_struct *
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 46c18e3dab13..b2f08431f0f1 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -395,6 +395,7 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
 {
 	update_idle_core(rq);
 	schedstat_inc(rq->sched_goidle);
+	queue_core_balance(rq);
 }
 
 static struct task_struct *pick_task_idle(struct rq *rq)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ef9e08e5da6a..552c80b70757 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1057,6 +1057,8 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 	return &rq->__lock;
 }
 
+extern void queue_core_balance(struct rq *rq);
+
 void sched_core_add(struct rq *rq, struct task_struct *p);
 void sched_core_remove(struct rq *rq, struct task_struct *p);
 
@@ -1072,6 +1074,10 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 	return &rq->__lock;
 }
 
+static inline void queue_core_balance(struct rq *rq)
+{
+}
+
 #endif /* CONFIG_SCHED_CORE */
 
 #ifdef CONFIG_SCHED_SMT
-- 
2.17.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH 11/13] sched: migration changes for core scheduling
  2020-03-04 16:59 [RFC PATCH 00/13] Core scheduling v5 vpillai
                   ` (9 preceding siblings ...)
  2020-03-04 17:00 ` [RFC PATCH 10/13] sched: Trivial forced-newidle balancer vpillai
@ 2020-03-04 17:00 ` vpillai
  2020-06-12 13:21   ` Joel Fernandes
  2020-03-04 17:00 ` [RFC PATCH 12/13] sched: cgroup tagging interface " vpillai
                   ` (9 subsequent siblings)
  20 siblings, 1 reply; 110+ messages in thread
From: vpillai @ 2020-03-04 17:00 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: Aubrey Li, linux-kernel, fweisbec, keescook, kerrnel, Phil Auld,
	Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel,
	Vineeth Remanan Pillai

From: Aubrey Li <aubrey.li@intel.com>

 - Don't migrate if there is a cookie mismatch
     Load balance tries to move task from busiest CPU to the
     destination CPU. When core scheduling is enabled, if the
     task's cookie does not match with the destination CPU's
     core cookie, this task will be skipped by this CPU. This
     mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
     In the fast path of task wakeup, select the first cookie matched
     idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
     In the slow path of task wakeup, find the idlest CPU whose core
     cookie matches with task's cookie

 - Don't migrate task if cookie not match
     For the NUMA load balance, don't migrate task to the CPU whose
     core cookie does not match with task's cookie

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
---
 kernel/sched/fair.c  | 55 +++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h | 29 +++++++++++++++++++++++
 2 files changed, 81 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1c9a80d8dbb8..f42ceecb749f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1789,6 +1789,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
 			continue;
 
+#ifdef CONFIG_SCHED_CORE
+		/*
+		 * Skip this cpu if source task's cookie does not match
+		 * with CPU's core cookie.
+		 */
+		if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+			continue;
+#endif
+
 		env->dst_cpu = cpu;
 		task_numa_compare(env, taskimp, groupimp, maymove);
 	}
@@ -5660,8 +5669,13 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+		struct rq *rq = cpu_rq(i);
+
+#ifdef CONFIG_SCHED_CORE
+		if (!sched_core_cookie_match(rq, p))
+			continue;
+#endif
 		if (available_idle_cpu(i)) {
-			struct rq *rq = cpu_rq(i);
 			struct cpuidle_state *idle = idle_get_state(rq);
 			if (idle && idle->exit_latency < min_exit_latency) {
 				/*
@@ -5927,8 +5941,14 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 			return si_cpu;
 		if (!cpumask_test_cpu(cpu, p->cpus_ptr))
 			continue;
+#ifdef CONFIG_SCHED_CORE
+		if (available_idle_cpu(cpu) &&
+		    sched_core_cookie_match(cpu_rq(cpu), p))
+			break;
+#else
 		if (available_idle_cpu(cpu))
 			break;
+#endif
 		if (si_cpu == -1 && sched_idle_cpu(cpu))
 			si_cpu = cpu;
 	}
@@ -7264,8 +7284,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * We do not migrate tasks that are:
 	 * 1) throttled_lb_pair, or
 	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
-	 * 3) running (obviously), or
-	 * 4) are cache-hot on their current CPU.
+	 * 3) task's cookie does not match with this CPU's core cookie
+	 * 4) running (obviously), or
+	 * 5) are cache-hot on their current CPU.
 	 */
 	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
 		return 0;
@@ -7300,6 +7321,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 0;
 	}
 
+#ifdef CONFIG_SCHED_CORE
+	/*
+	 * Don't migrate task if the task's cookie does not match
+	 * with the destination CPU's core cookie.
+	 */
+	if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+		return 0;
+#endif
+
 	/* Record that we found atleast one task that could run on dst_cpu */
 	env->flags &= ~LBF_ALL_PINNED;
 
@@ -8498,6 +8528,25 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 					p->cpus_ptr))
 			continue;
 
+#ifdef CONFIG_SCHED_CORE
+		if (sched_core_enabled(cpu_rq(this_cpu))) {
+			int i = 0;
+			bool cookie_match = false;
+
+			for_each_cpu(i, sched_group_span(group)) {
+				struct rq *rq = cpu_rq(i);
+
+				if (sched_core_cookie_match(rq, p)) {
+					cookie_match = true;
+					break;
+				}
+			}
+			/* Skip over this group if no cookie matched */
+			if (!cookie_match)
+				continue;
+		}
+#endif
+
 		local_group = cpumask_test_cpu(this_cpu,
 					       sched_group_span(group));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 552c80b70757..e4019a482f0e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1057,6 +1057,35 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 	return &rq->__lock;
 }
 
+/*
+ * Helper to check if the CPU's core cookie matches with the task's cookie
+ * when core scheduling is enabled.
+ * A special case is that the task's cookie always matches with CPU's core
+ * cookie if the CPU is in an idle core.
+ */
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	bool idle_core = true;
+	int cpu;
+
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
+		if (!available_idle_cpu(cpu)) {
+			idle_core = false;
+			break;
+		}
+	}
+
+	/*
+	 * A CPU in an idle core is always the best choice for tasks with
+	 * cookies.
+	 */
+	return idle_core || rq->core->core_cookie == p->core_cookie;
+}
+
 extern void queue_core_balance(struct rq *rq);
 
 void sched_core_add(struct rq *rq, struct task_struct *p);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH 12/13] sched: cgroup tagging interface for core scheduling
  2020-03-04 16:59 [RFC PATCH 00/13] Core scheduling v5 vpillai
                   ` (10 preceding siblings ...)
  2020-03-04 17:00 ` [RFC PATCH 11/13] sched: migration changes for core scheduling vpillai
@ 2020-03-04 17:00 ` vpillai
  2020-06-26 15:06   ` Vineeth Remanan Pillai
  2020-03-04 17:00 ` [RFC PATCH 13/13] sched: Debug bits vpillai
                   ` (8 subsequent siblings)
  20 siblings, 1 reply; 110+ messages in thread
From: vpillai @ 2020-03-04 17:00 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, fweisbec, keescook, kerrnel, Phil Auld, Aaron Lu,
	Aubrey Li, aubrey.li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel,
	Vineeth Remanan Pillai

From: Peter Zijlstra <peterz@infradead.org>

Marks all tasks in a cgroup as matching for core-scheduling.

A task will need to be moved into the core scheduler queue when the cgroup
it belongs to is tagged to run with core scheduling.  Similarly the task
will need to be moved out of the core scheduler queue when the cgroup
is untagged.

Also after we forked a task, its core scheduler queue's presence will
need to be updated according to its new cgroup's status.

Use stop machine mechanism to update all tasks in a cgroup to prevent a
new task from sneaking into the cgroup, and missed out from the update
while we iterates through all the tasks in the cgroup.  A more complicated
scheme could probably avoid the stop machine.  Such scheme will also
need to resovle inconsistency between a task's cgroup core scheduling
tag and residency in core scheduler queue.

We are opting for the simple stop machine mechanism for now that avoids
such complications.

Core scheduler has extra overhead.  Enable it only for core with
more than one SMT hardware threads.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
---
 kernel/sched/core.c  | 186 +++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h |   4 +
 2 files changed, 184 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 18ee8e10a171..11e5a2a494ac 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -140,6 +140,37 @@ static inline bool __sched_core_less(struct task_struct *a, struct task_struct *
 	return false;
 }
 
+static bool sched_core_empty(struct rq *rq)
+{
+	return RB_EMPTY_ROOT(&rq->core_tree);
+}
+
+static bool sched_core_enqueued(struct task_struct *task)
+{
+	return !RB_EMPTY_NODE(&task->core_node);
+}
+
+static struct task_struct *sched_core_first(struct rq *rq)
+{
+	struct task_struct *task;
+
+	task = container_of(rb_first(&rq->core_tree), struct task_struct, core_node);
+	return task;
+}
+
+static void sched_core_flush(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	struct task_struct *task;
+
+	while (!sched_core_empty(rq)) {
+		task = sched_core_first(rq);
+		rb_erase(&task->core_node, &rq->core_tree);
+		RB_CLEAR_NODE(&task->core_node);
+	}
+	rq->core->core_task_seq++;
+}
+
 static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 {
 	struct rb_node *parent, **node;
@@ -171,10 +202,11 @@ static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
 {
 	rq->core->core_task_seq++;
 
-	if (!p->core_cookie)
+	if (!sched_core_enqueued(p))
 		return;
 
 	rb_erase(&p->core_node, &rq->core_tree);
+	RB_CLEAR_NODE(&p->core_node);
 }
 
 void sched_core_add(struct rq *rq, struct task_struct *p)
@@ -250,8 +282,22 @@ static int __sched_core_stopper(void *data)
 	bool enabled = !!(unsigned long)data;
 	int cpu;
 
-	for_each_online_cpu(cpu)
-		cpu_rq(cpu)->core_enabled = enabled;
+	if (!enabled) {
+		for_each_online_cpu(cpu) {
+		/*
+		 * All active and migrating tasks will have already been removed
+		 * from core queue when we clear the cgroup tags.
+		 * However, dying tasks could still be left in core queue.
+		 * Flush them here.
+		 */
+			sched_core_flush(cpu);
+		}
+	}
+
+	for_each_online_cpu(cpu) {
+		if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2))
+			cpu_rq(cpu)->core_enabled = enabled;
+	}
 
 	return 0;
 }
@@ -261,7 +307,11 @@ static int sched_core_count;
 
 static void __sched_core_enable(void)
 {
-	// XXX verify there are no cookie tasks (yet)
+	int cpu;
+
+	/* verify there are no cookie tasks (yet) */
+	for_each_online_cpu(cpu)
+		BUG_ON(!sched_core_empty(cpu_rq(cpu)));
 
 	static_branch_enable(&__sched_core_enabled);
 	stop_machine(__sched_core_stopper, (void *)true, NULL);
@@ -269,8 +319,6 @@ static void __sched_core_enable(void)
 
 static void __sched_core_disable(void)
 {
-	// XXX verify there are no cookie tasks (left)
-
 	stop_machine(__sched_core_stopper, (void *)false, NULL);
 	static_branch_disable(&__sched_core_enabled);
 }
@@ -416,6 +464,7 @@ static int __deactivate_cpu_core_sched(void *data)
 
 static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
 static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+static bool sched_core_enqueued(struct task_struct *task) { return false; }
 static inline void core_sched_cpu_update(unsigned int cpu, int action) { }
 
 #endif /* CONFIG_SCHED_CORE */
@@ -3268,6 +3317,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 #ifdef CONFIG_SMP
 	plist_node_init(&p->pushable_tasks, MAX_PRIO);
 	RB_CLEAR_NODE(&p->pushable_dl_tasks);
+#endif
+#ifdef CONFIG_SCHED_CORE
+	RB_CLEAR_NODE(&p->core_node);
 #endif
 	return 0;
 }
@@ -6819,6 +6871,9 @@ void init_idle(struct task_struct *idle, int cpu)
 #ifdef CONFIG_SMP
 	sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);
 #endif
+#ifdef CONFIG_SCHED_CORE
+	RB_CLEAR_NODE(&idle->core_node);
+#endif
 }
 
 #ifdef CONFIG_SMP
@@ -7796,6 +7851,15 @@ static void sched_change_group(struct task_struct *tsk, int type)
 	tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
 			  struct task_group, css);
 	tg = autogroup_task_group(tsk, tg);
+
+#ifdef CONFIG_SCHED_CORE
+	if ((unsigned long)tsk->sched_task_group == tsk->core_cookie)
+		tsk->core_cookie = 0UL;
+
+	if (tg->tagged /* && !tsk->core_cookie ? */)
+		tsk->core_cookie = (unsigned long)tg;
+#endif
+
 	tsk->sched_task_group = tg;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -7881,6 +7945,18 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
 	return 0;
 }
 
+static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
+{
+#ifdef CONFIG_SCHED_CORE
+	struct task_group *tg = css_tg(css);
+
+	if (tg->tagged) {
+		sched_core_put();
+		tg->tagged = 0;
+	}
+#endif
+}
+
 static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
@@ -7910,7 +7986,12 @@ static void cpu_cgroup_fork(struct task_struct *task)
 	rq = task_rq_lock(task, &rf);
 
 	update_rq_clock(rq);
+	if (sched_core_enqueued(task))
+		sched_core_dequeue(rq, task);
 	sched_change_group(task, TASK_SET_GROUP);
+	if (sched_core_enabled(rq) && task_on_rq_queued(task) &&
+	    task->core_cookie)
+		sched_core_enqueue(rq, task);
 
 	task_rq_unlock(rq, task, &rf);
 }
@@ -8436,6 +8517,82 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_SCHED_CORE
+static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+
+	return !!tg->tagged;
+}
+
+struct write_core_tag {
+	struct cgroup_subsys_state *css;
+	int val;
+};
+
+static int __sched_write_tag(void *data)
+{
+	struct write_core_tag *tag = (struct write_core_tag *) data;
+	struct cgroup_subsys_state *css = tag->css;
+	int val = tag->val;
+	struct task_group *tg = css_tg(tag->css);
+	struct css_task_iter it;
+	struct task_struct *p;
+
+	tg->tagged = !!val;
+
+	css_task_iter_start(css, 0, &it);
+	/*
+	 * Note: css_task_iter_next will skip dying tasks.
+	 * There could still be dying tasks left in the core queue
+	 * when we set cgroup tag to 0 when the loop is done below.
+	 */
+	while ((p = css_task_iter_next(&it))) {
+		p->core_cookie = !!val ? (unsigned long)tg : 0UL;
+
+		if (sched_core_enqueued(p)) {
+			sched_core_dequeue(task_rq(p), p);
+			if (!p->core_cookie)
+				continue;
+		}
+
+		if (sched_core_enabled(task_rq(p)) &&
+		    p->core_cookie && task_on_rq_queued(p))
+			sched_core_enqueue(task_rq(p), p);
+
+	}
+	css_task_iter_end(&it);
+
+	return 0;
+}
+
+static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+{
+	struct task_group *tg = css_tg(css);
+	struct write_core_tag wtag;
+
+	if (val > 1)
+		return -ERANGE;
+
+	if (!static_branch_likely(&sched_smt_present))
+		return -EINVAL;
+
+	if (tg->tagged == !!val)
+		return 0;
+
+	if (!!val)
+		sched_core_get();
+
+	wtag.css = css;
+	wtag.val = val;
+	stop_machine(__sched_write_tag, (void *) &wtag, NULL);
+	if (!val)
+		sched_core_put();
+
+	return 0;
+}
+#endif
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -8472,6 +8629,14 @@ static struct cftype cpu_legacy_files[] = {
 		.write_u64 = cpu_rt_period_write_uint,
 	},
 #endif
+#ifdef CONFIG_SCHED_CORE
+	{
+		.name = "tag",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_tag_read_u64,
+		.write_u64 = cpu_core_tag_write_u64,
+	},
+#endif
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 	{
 		.name = "uclamp.min",
@@ -8645,6 +8810,14 @@ static struct cftype cpu_files[] = {
 		.write_s64 = cpu_weight_nice_write_s64,
 	},
 #endif
+#ifdef CONFIG_SCHED_CORE
+	{
+		.name = "tag",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_tag_read_u64,
+		.write_u64 = cpu_core_tag_write_u64,
+	},
+#endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
 		.name = "max",
@@ -8673,6 +8846,7 @@ static struct cftype cpu_files[] = {
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_online	= cpu_cgroup_css_online,
+	.css_offline	= cpu_cgroup_css_offline,
 	.css_released	= cpu_cgroup_css_released,
 	.css_free	= cpu_cgroup_css_free,
 	.css_extra_stat_show = cpu_extra_stat_show,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e4019a482f0e..2079654b5c87 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -355,6 +355,10 @@ struct cfs_bandwidth {
 struct task_group {
 	struct cgroup_subsys_state css;
 
+#ifdef CONFIG_SCHED_CORE
+	int			tagged;
+#endif
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* schedulable entities of this group on each CPU */
 	struct sched_entity	**se;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH 13/13] sched: Debug bits...
  2020-03-04 16:59 [RFC PATCH 00/13] Core scheduling v5 vpillai
                   ` (11 preceding siblings ...)
  2020-03-04 17:00 ` [RFC PATCH 12/13] sched: cgroup tagging interface " vpillai
@ 2020-03-04 17:00 ` vpillai
  2020-03-04 17:36 ` [RFC PATCH 00/13] Core scheduling v5 Tim Chen
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: vpillai @ 2020-03-04 17:00 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, fweisbec, keescook, kerrnel, Phil Auld, Aaron Lu,
	Aubrey Li, aubrey.li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel

From: Peter Zijlstra <peterz@infradead.org>

Not-Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 11e5a2a494ac..a01df3e0b11e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -110,6 +110,10 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 
 	int pa = __task_prio(a), pb = __task_prio(b);
 
+	trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
+		     a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+		     b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
 	if (-pa < -pb)
 		return true;
 
@@ -315,12 +319,16 @@ static void __sched_core_enable(void)
 
 	static_branch_enable(&__sched_core_enabled);
 	stop_machine(__sched_core_stopper, (void *)true, NULL);
+
+	printk("core sched enabled\n");
 }
 
 static void __sched_core_disable(void)
 {
 	stop_machine(__sched_core_stopper, (void *)false, NULL);
 	static_branch_disable(&__sched_core_enabled);
+
+	printk("core sched disabled\n");
 }
 
 void sched_core_get(void)
@@ -4460,6 +4468,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			put_prev_task(rq, prev);
 			set_next_task(rq, next);
 		}
+
+		trace_printk("pick pre selected (%u %u %u): %s/%d %lx\n",
+			     rq->core->core_task_seq,
+			     rq->core->core_pick_seq,
+			     rq->core_sched_seq,
+			     next->comm, next->pid,
+			     next->core_cookie);
+
 		return next;
 	}
 
@@ -4540,6 +4556,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			 */
 			if (i == cpu && !need_sync && !p->core_cookie) {
 				next = p;
+				trace_printk("unconstrained pick: %s/%d %lx\n",
+					     next->comm, next->pid, next->core_cookie);
+
 				goto done;
 			}
 
@@ -4548,6 +4567,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 			rq_i->core_pick = p;
 
+			trace_printk("cpu(%d): selected: %s/%d %lx\n",
+				     i, p->comm, p->pid, p->core_cookie);
+
 			/*
 			 * If this new candidate is of higher priority than the
 			 * previous; and they're incompatible; we need to wipe
@@ -4564,6 +4586,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 				rq->core->core_cookie = p->core_cookie;
 				max = p;
 
+				trace_printk("max: %s/%d %lx\n", max->comm, max->pid, max->core_cookie);
+
 				if (old_max) {
 					for_each_cpu(j, smt_mask) {
 						if (j == i)
@@ -4591,6 +4615,7 @@ next_class:;
 	rq->core->core_pick_seq = rq->core->core_task_seq;
 	next = rq->core_pick;
 	rq->core_sched_seq = rq->core->core_pick_seq;
+	trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, next->core_cookie);
 
 	/*
 	 * Reschedule siblings
@@ -4616,11 +4641,20 @@ next_class:;
 		if (i == cpu)
 			continue;
 
-		if (rq_i->curr != rq_i->core_pick)
+		if (rq_i->curr != rq_i->core_pick) {
+			trace_printk("IPI(%d)\n", i);
 			resched_curr(rq_i);
+		}
 
 		/* Did we break L1TF mitigation requirements? */
-		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+		if (unlikely(!cookie_match(next, rq_i->core_pick))) {
+			trace_printk("[%d]: cookie mismatch. %s/%d/0x%lx/0x%lx\n",
+				     rq_i->cpu, rq_i->core_pick->comm,
+				     rq_i->core_pick->pid,
+				     rq_i->core_pick->core_cookie,
+				     rq_i->core->core_cookie);
+			WARN_ON_ONCE(1);
+		}
 	}
 
 done:
@@ -4659,6 +4693,10 @@ static bool try_steal_cookie(int this, int that)
 		if (p->core_occupation > dst->idle->core_occupation)
 			goto next;
 
+		trace_printk("core fill: %s/%d (%d->%d) %d %d %lx\n",
+			     p->comm, p->pid, that, this,
+			     p->core_occupation, dst->idle->core_occupation, cookie);
+
 		p->on_rq = TASK_ON_RQ_MIGRATING;
 		deactivate_task(src, p, 0);
 		set_task_cpu(p, this);
@@ -7287,6 +7325,8 @@ int sched_cpu_starting(unsigned int cpu)
 		WARN_ON_ONCE(rq->core && rq->core != core_rq);
 		rq->core = core_rq;
 	}
+
+	printk("core: %d -> %d\n", cpu, cpu_of(core_rq));
 #endif /* CONFIG_SCHED_CORE */
 
 	sched_rq_cpu_starting(cpu);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 00/13] Core scheduling v5
  2020-03-04 16:59 [RFC PATCH 00/13] Core scheduling v5 vpillai
                   ` (12 preceding siblings ...)
  2020-03-04 17:00 ` [RFC PATCH 13/13] sched: Debug bits vpillai
@ 2020-03-04 17:36 ` Tim Chen
  2020-03-04 17:42   ` Vineeth Remanan Pillai
  2020-04-14 14:21 ` Peter Zijlstra
                   ` (6 subsequent siblings)
  20 siblings, 1 reply; 110+ messages in thread
From: Tim Chen @ 2020-03-04 17:36 UTC (permalink / raw)
  To: vpillai, Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, fweisbec, keescook, kerrnel, Phil Auld, Aaron Lu,
	Aubrey Li, aubrey.li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel

On 3/4/20 8:59 AM, vpillai wrote:

> 
> ISSUES
> ------
> - Aaron(Intel) found an issue with load balancing when the tasks have

Just to set the record straight, Aaron works at Alibaba.

>   different weights(nice or cgroup shares). Task weight is not considered
>   in coresched aware load balancing and causes those higher weights task
>   to starve.

Thanks.

Tim

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 00/13] Core scheduling v5
  2020-03-04 17:36 ` [RFC PATCH 00/13] Core scheduling v5 Tim Chen
@ 2020-03-04 17:42   ` Vineeth Remanan Pillai
  0 siblings, 0 replies; 110+ messages in thread
From: Vineeth Remanan Pillai @ 2020-03-04 17:42 UTC (permalink / raw)
  To: Tim Chen
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Aaron Lu, Aubrey Li, Li, Aubrey,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	Joel Fernandes, Joel Fernandes

>
> Just to set the record straight, Aaron works at Alibaba.
>
Sorry about this. Thanks for the correction.

~Vineeth

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH] sched/arm64: store cpu topology before notify_cpu_starting
  2020-03-04 16:59 ` [RFC PATCH 03/13] sched: Core-wide rq->lock vpillai
@ 2020-04-01 11:42   ` Cheng Jian
  2020-04-01 13:23     ` Valentin Schneider
  2020-04-09 17:54     ` Joel Fernandes
  2020-04-14 11:36   ` [RFC PATCH 03/13] sched: Core-wide rq->lock Peter Zijlstra
  2020-04-14 14:32   ` Peter Zijlstra
  2 siblings, 2 replies; 110+ messages in thread
From: Cheng Jian @ 2020-04-01 11:42 UTC (permalink / raw)
  To: vpillai
  Cc: aaron.lwe, aubrey.intel, aubrey.li, fweisbec, jdesfossez, joel,
	joelaf, keescook, kerrnel, linux-kernel, mgorman, mingo,
	naravamudan, pauld, pawan.kumar.gupta, pbonzini, peterz, pjt,
	tglx, tim.c.chen, torvalds, valentin.schneider, cj.chengjian,
	xiexiuqi, huawei.libin, w.f

when SCHED_CORE enabled, sched_cpu_starting() uses thread_sibling as
SMT_MASK to initialize rq->core, but only after store_cpu_topology(),
the thread_sibling is ready for use.

	notify_cpu_starting()
	    -> sched_cpu_starting()	# use thread_sibling

	store_cpu_topology(cpu)
	    -> update_siblings_masks	# set thread_sibling

Fix this by doing notify_cpu_starting later, just like x86 do.

Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
---
 arch/arm64/kernel/smp.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 5407bf5d98ac..a427c14e82af 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -236,13 +236,18 @@ asmlinkage notrace void secondary_start_kernel(void)
 	cpuinfo_store_cpu();
 
 	/*
-	 * Enable GIC and timers.
+	 * Store cpu topology before notify_cpu_starting,
+	 * CPUHP_AP_SCHED_STARTING requires SMT topology
+	 * been initialized for SCHED_CORE.
 	 */
-	notify_cpu_starting(cpu);
-
 	store_cpu_topology(cpu);
 	numa_add_cpu(cpu);
 
+	/*
+	 * Enable GIC and timers.
+	 */
+	notify_cpu_starting(cpu);
+
 	/*
 	 * OK, now it's safe to let the boot CPU continue.  Wait for
 	 * the CPU migration code to notice that the CPU is online
-- 
2.17.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] sched/arm64: store cpu topology before notify_cpu_starting
  2020-04-01 11:42   ` [PATCH] sched/arm64: store cpu topology before notify_cpu_starting Cheng Jian
@ 2020-04-01 13:23     ` Valentin Schneider
  2020-04-06  8:00       ` chengjian (D)
  2020-04-09  9:59       ` Sudeep Holla
  2020-04-09 17:54     ` Joel Fernandes
  1 sibling, 2 replies; 110+ messages in thread
From: Valentin Schneider @ 2020-04-01 13:23 UTC (permalink / raw)
  To: Cheng Jian
  Cc: vpillai, aaron.lwe, aubrey.intel, aubrey.li, fweisbec,
	jdesfossez, joel, joelaf, keescook, kerrnel, linux-kernel,
	mgorman, mingo, naravamudan, pauld, pawan.kumar.gupta, pbonzini,
	peterz, pjt, tglx, tim.c.chen, torvalds, xiexiuqi, huawei.libin,
	w.f, linux-arm-kernel, Sudeep Holla


(+LAKML, +Sudeep)

On Wed, Apr 01 2020, Cheng Jian wrote:
> when SCHED_CORE enabled, sched_cpu_starting() uses thread_sibling as
> SMT_MASK to initialize rq->core, but only after store_cpu_topology(),
> the thread_sibling is ready for use.
>
>       notify_cpu_starting()
>           -> sched_cpu_starting()	# use thread_sibling
>
>       store_cpu_topology(cpu)
>           -> update_siblings_masks	# set thread_sibling
>
> Fix this by doing notify_cpu_starting later, just like x86 do.
>

I haven't been following the sched core stuff closely; can't this
rq->core assignment be done in sched_cpu_activate() instead? We already
look at the cpu_smt_mask() in there, and it is valid (we go through the
entirety of secondary_start_kernel() before getting anywhere near
CPUHP_AP_ACTIVE).

I don't think this breaks anything, but without this dependency in
sched_cpu_starting() then there isn't really a reason for this move.

> Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
> ---
>  arch/arm64/kernel/smp.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 5407bf5d98ac..a427c14e82af 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -236,13 +236,18 @@ asmlinkage notrace void secondary_start_kernel(void)
>       cpuinfo_store_cpu();
>
>       /*
> -	 * Enable GIC and timers.
> +	 * Store cpu topology before notify_cpu_starting,
> +	 * CPUHP_AP_SCHED_STARTING requires SMT topology
> +	 * been initialized for SCHED_CORE.
>        */
> -	notify_cpu_starting(cpu);
> -
>       store_cpu_topology(cpu);
>       numa_add_cpu(cpu);
>
> +	/*
> +	 * Enable GIC and timers.
> +	 */
> +	notify_cpu_starting(cpu);
> +
>       /*
>        * OK, now it's safe to let the boot CPU continue.  Wait for
>        * the CPU migration code to notice that the CPU is online

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] sched/arm64: store cpu topology before notify_cpu_starting
  2020-04-01 13:23     ` Valentin Schneider
@ 2020-04-06  8:00       ` chengjian (D)
  2020-04-09  9:59       ` Sudeep Holla
  1 sibling, 0 replies; 110+ messages in thread
From: chengjian (D) @ 2020-04-06  8:00 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: vpillai, aaron.lwe, aubrey.intel, aubrey.li, fweisbec,
	jdesfossez, joel, joelaf, keescook, kerrnel, linux-kernel,
	mgorman, mingo, naravamudan, pauld, pawan.kumar.gupta, pbonzini,
	peterz, pjt, tglx, tim.c.chen, torvalds, xiexiuqi, huawei.libin,
	w.f, linux-arm-kernel, Sudeep Holla, chengjian (D)


On 2020/4/1 21:23, Valentin Schneider wrote:
> (+LAKML, +Sudeep)
>
> On Wed, Apr 01 2020, Cheng Jian wrote:
>> when SCHED_CORE enabled, sched_cpu_starting() uses thread_sibling as
>> SMT_MASK to initialize rq->core, but only after store_cpu_topology(),
>> the thread_sibling is ready for use.
>>
>>        notify_cpu_starting()
>>            -> sched_cpu_starting()	# use thread_sibling
>>
>>        store_cpu_topology(cpu)
>>            -> update_siblings_masks	# set thread_sibling
>>
>> Fix this by doing notify_cpu_starting later, just like x86 do.
>>
> I haven't been following the sched core stuff closely; can't this
> rq->core assignment be done in sched_cpu_activate() instead? We already
> look at the cpu_smt_mask() in there, and it is valid (we go through the
> entirety of secondary_start_kernel() before getting anywhere near
> CPUHP_AP_ACTIVE).
>
> I don't think this breaks anything, but without this dependency in
> sched_cpu_starting() then there isn't really a reason for this move.

Yes, it is correct to put the rq-> core assignment in sched_cpu_active().

The cpu_smt_mask is already valid here.


I have made such an attempt on my own branch and passed the test.


Thank you.


     -- Cheng Jian



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] sched/arm64: store cpu topology before notify_cpu_starting
  2020-04-01 13:23     ` Valentin Schneider
  2020-04-06  8:00       ` chengjian (D)
@ 2020-04-09  9:59       ` Sudeep Holla
  2020-04-09 10:32         ` Valentin Schneider
  1 sibling, 1 reply; 110+ messages in thread
From: Sudeep Holla @ 2020-04-09  9:59 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Cheng Jian, vpillai, aaron.lwe, aubrey.intel, aubrey.li,
	fweisbec, jdesfossez, joel, joelaf, keescook, kerrnel,
	linux-kernel, mgorman, mingo, naravamudan, pauld,
	pawan.kumar.gupta, pbonzini, peterz, pjt, tglx, tim.c.chen,
	torvalds, xiexiuqi, huawei.libin, w.f, linux-arm-kernel,
	Sudeep Holla

On Wed, Apr 01, 2020 at 02:23:33PM +0100, Valentin Schneider wrote:
>
> (+LAKML, +Sudeep)
>

Thanks Valentin.

> On Wed, Apr 01 2020, Cheng Jian wrote:
> > when SCHED_CORE enabled, sched_cpu_starting() uses thread_sibling as
> > SMT_MASK to initialize rq->core, but only after store_cpu_topology(),
> > the thread_sibling is ready for use.
> >
> >       notify_cpu_starting()
> >           -> sched_cpu_starting()	# use thread_sibling
> >
> >       store_cpu_topology(cpu)
> >           -> update_siblings_masks	# set thread_sibling
> >
> > Fix this by doing notify_cpu_starting later, just like x86 do.
> >
>
> I haven't been following the sched core stuff closely; can't this
> rq->core assignment be done in sched_cpu_activate() instead? We already
> look at the cpu_smt_mask() in there, and it is valid (we go through the
> entirety of secondary_start_kernel() before getting anywhere near
> CPUHP_AP_ACTIVE).
>

I too came to same conclusion. Did you see any issues ? Or is it
just code inspection in parity with x86 ?

> I don't think this breaks anything, but without this dependency in
> sched_cpu_starting() then there isn't really a reason for this move.
>

Based on the commit message, had a quick look at x86 code and I agree
this shouldn't break anything. However the commit message does make
complete sense to me, especially reference to sched_cpu_starting
while smt_masks are accessed in sched_cpu_activate. Or am I missing
to understand something here ?

--
Regards,
Sudeep

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] sched/arm64: store cpu topology before notify_cpu_starting
  2020-04-09  9:59       ` Sudeep Holla
@ 2020-04-09 10:32         ` Valentin Schneider
  2020-04-09 11:08           ` Sudeep Holla
  0 siblings, 1 reply; 110+ messages in thread
From: Valentin Schneider @ 2020-04-09 10:32 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: Cheng Jian, vpillai, aaron.lwe, aubrey.intel, aubrey.li,
	fweisbec, jdesfossez, joel, joelaf, keescook, kerrnel,
	linux-kernel, mgorman, mingo, naravamudan, pauld,
	pawan.kumar.gupta, pbonzini, peterz, pjt, tglx, tim.c.chen,
	torvalds, xiexiuqi, huawei.libin, w.f, linux-arm-kernel


On 09/04/20 10:59, Sudeep Holla wrote:
> On Wed, Apr 01, 2020 at 02:23:33PM +0100, Valentin Schneider wrote:
>>
>> (+LAKML, +Sudeep)
>>
>
> Thanks Valentin.
>
>> On Wed, Apr 01 2020, Cheng Jian wrote:
>> > when SCHED_CORE enabled, sched_cpu_starting() uses thread_sibling as
>> > SMT_MASK to initialize rq->core, but only after store_cpu_topology(),
>> > the thread_sibling is ready for use.
>> >
>> >       notify_cpu_starting()
>> >           -> sched_cpu_starting()	# use thread_sibling
>> >
>> >       store_cpu_topology(cpu)
>> >           -> update_siblings_masks	# set thread_sibling
>> >
>> > Fix this by doing notify_cpu_starting later, just like x86 do.
>> >
>>
>> I haven't been following the sched core stuff closely; can't this
>> rq->core assignment be done in sched_cpu_activate() instead? We already
>> look at the cpu_smt_mask() in there, and it is valid (we go through the
>> entirety of secondary_start_kernel() before getting anywhere near
>> CPUHP_AP_ACTIVE).
>>
>
> I too came to same conclusion. Did you see any issues ? Or is it
> just code inspection in parity with x86 ?
>

With mainline this isn't a problem; with the core scheduling stuff there is
an expectation that we can use the SMT masks in sched_cpu_starting().

>> I don't think this breaks anything, but without this dependency in
>> sched_cpu_starting() then there isn't really a reason for this move.
>>
>
> Based on the commit message, had a quick look at x86 code and I agree
> this shouldn't break anything. However the commit message does make
> complete sense to me, especially reference to sched_cpu_starting
> while smt_masks are accessed in sched_cpu_activate. Or am I missing
> to understand something here ?

As stated above, it's not a problem for mainline, and AIUI we can change
the core scheduling bits to only use the SMT mask in sched_cpu_activate()
instead, therefore not requiring any change in the arch code.

I'm not aware of any written rule that the topology masks should be usable
from a given hotplug state upwards, only that right now we need them in
sched_cpu_(de)activate() for SMT scheduling - and that is already working
fine.

So really this should be considering as a simple neutral cleanup; I don't
really have any opinion on picking it up or not.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] sched/arm64: store cpu topology before notify_cpu_starting
  2020-04-09 10:32         ` Valentin Schneider
@ 2020-04-09 11:08           ` Sudeep Holla
  0 siblings, 0 replies; 110+ messages in thread
From: Sudeep Holla @ 2020-04-09 11:08 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Cheng Jian, vpillai, aaron.lwe, aubrey.intel, aubrey.li,
	fweisbec, jdesfossez, joel, joelaf, keescook, kerrnel,
	linux-kernel, mgorman, mingo, naravamudan, pauld,
	pawan.kumar.gupta, pbonzini, peterz, pjt, tglx, tim.c.chen,
	torvalds, xiexiuqi, huawei.libin, w.f, Sudeep Holla,
	linux-arm-kernel

On Thu, Apr 09, 2020 at 11:32:12AM +0100, Valentin Schneider wrote:
>
> On 09/04/20 10:59, Sudeep Holla wrote:
> > On Wed, Apr 01, 2020 at 02:23:33PM +0100, Valentin Schneider wrote:
> >>
> >> (+LAKML, +Sudeep)
> >>
> >
> > Thanks Valentin.
> >
> >> On Wed, Apr 01 2020, Cheng Jian wrote:
> >> > when SCHED_CORE enabled, sched_cpu_starting() uses thread_sibling as
> >> > SMT_MASK to initialize rq->core, but only after store_cpu_topology(),
> >> > the thread_sibling is ready for use.
> >> >
> >> >       notify_cpu_starting()
> >> >           -> sched_cpu_starting()	# use thread_sibling
> >> >
> >> >       store_cpu_topology(cpu)
> >> >           -> update_siblings_masks	# set thread_sibling
> >> >
> >> > Fix this by doing notify_cpu_starting later, just like x86 do.
> >> >
> >>
> >> I haven't been following the sched core stuff closely; can't this
> >> rq->core assignment be done in sched_cpu_activate() instead? We already
> >> look at the cpu_smt_mask() in there, and it is valid (we go through the
> >> entirety of secondary_start_kernel() before getting anywhere near
> >> CPUHP_AP_ACTIVE).
> >>
> >
> > I too came to same conclusion. Did you see any issues ? Or is it
> > just code inspection in parity with x86 ?
> >
>
> With mainline this isn't a problem; with the core scheduling stuff there is
> an expectation that we can use the SMT masks in sched_cpu_starting().
>

Ah, OK. I prefer this to be specified in the commit message as it is not
obvious.

> >> I don't think this breaks anything, but without this dependency in
> >> sched_cpu_starting() then there isn't really a reason for this move.
> >>
> >
> > Based on the commit message, had a quick look at x86 code and I agree
> > this shouldn't break anything. However the commit message does make
> > complete sense to me, especially reference to sched_cpu_starting
> > while smt_masks are accessed in sched_cpu_activate. Or am I missing
> > to understand something here ?
>
> As stated above, it's not a problem for mainline, and AIUI we can change
> the core scheduling bits to only use the SMT mask in sched_cpu_activate()
> instead, therefore not requiring any change in the arch code.
>

Either way is fine. If it is already set expectation that SMT masks needs
to be set before sched_cpu_starting, then let us just stick with that.

> I'm not aware of any written rule that the topology masks should be usable
> from a given hotplug state upwards, only that right now we need them in
> sched_cpu_(de)activate() for SMT scheduling - and that is already working
> fine.
>

Sure, we can at-least document as part of this change even if it is just
in ARM64 so that someone need not wonder the same in future.

> So really this should be considering as a simple neutral cleanup; I don't
> really have any opinion on picking it up or not.

I am fine with the change too, just need some tweaking in the commit
message.

--
Regards,
Sudeep

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] sched/arm64: store cpu topology before notify_cpu_starting
  2020-04-01 11:42   ` [PATCH] sched/arm64: store cpu topology before notify_cpu_starting Cheng Jian
  2020-04-01 13:23     ` Valentin Schneider
@ 2020-04-09 17:54     ` Joel Fernandes
  2020-04-10 13:49       ` chengjian (D)
  1 sibling, 1 reply; 110+ messages in thread
From: Joel Fernandes @ 2020-04-09 17:54 UTC (permalink / raw)
  To: Cheng Jian
  Cc: Vineeth Remanan Pillai, aaron.lwe, aubrey.intel, aubrey.li,
	Cc: Frederic Weisbecker, Julien Desfossez,
	Joel Fernandes (Google),
	Kees Cook, Greg Kerr, LKML, mgorman, Ingo Molnar, naravamudan,
	pauld, pawan.kumar.gupta, pbonzini, Peter Zijlstra, Paul Turner,
	Thomas Gleixner, tim.c.chen, Linus Torvalds, Valentin Schneider,
	xiexiuqi, huawei.libin, w.f

On Wed, Apr 1, 2020 at 7:27 AM Cheng Jian <cj.chengjian@huawei.com> wrote:
>
> when SCHED_CORE enabled, sched_cpu_starting() uses thread_sibling as
> SMT_MASK to initialize rq->core, but only after store_cpu_topology(),
> the thread_sibling is ready for use.
>
>         notify_cpu_starting()
>             -> sched_cpu_starting()     # use thread_sibling
>
>         store_cpu_topology(cpu)
>             -> update_siblings_masks    # set thread_sibling
>
> Fix this by doing notify_cpu_starting later, just like x86 do.
>
> Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

Just a high-level question, why does core-scheduling matter on ARM64?
Is it for HPC workloads?

Thanks,

 - Joel


> ---
>  arch/arm64/kernel/smp.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 5407bf5d98ac..a427c14e82af 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -236,13 +236,18 @@ asmlinkage notrace void secondary_start_kernel(void)
>         cpuinfo_store_cpu();
>
>         /*
> -        * Enable GIC and timers.
> +        * Store cpu topology before notify_cpu_starting,
> +        * CPUHP_AP_SCHED_STARTING requires SMT topology
> +        * been initialized for SCHED_CORE.
>          */
> -       notify_cpu_starting(cpu);
> -
>         store_cpu_topology(cpu);
>         numa_add_cpu(cpu);
>
> +       /*
> +        * Enable GIC and timers.
> +        */
> +       notify_cpu_starting(cpu);
> +
>         /*
>          * OK, now it's safe to let the boot CPU continue.  Wait for
>          * the CPU migration code to notice that the CPU is online
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] sched/arm64: store cpu topology before notify_cpu_starting
  2020-04-09 17:54     ` Joel Fernandes
@ 2020-04-10 13:49       ` chengjian (D)
  0 siblings, 0 replies; 110+ messages in thread
From: chengjian (D) @ 2020-04-10 13:49 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Vineeth Remanan Pillai, aaron.lwe, aubrey.intel, aubrey.li,
	Cc: Frederic Weisbecker, Julien Desfossez,
	Joel Fernandes (Google),
	Kees Cook, Greg Kerr, LKML, mgorman, Ingo Molnar, naravamudan,
	pauld, pawan.kumar.gupta, pbonzini, Peter Zijlstra, Paul Turner,
	Thomas Gleixner, tim.c.chen, Linus Torvalds, Valentin Schneider,
	xiexiuqi, huawei.libin, w.f, chengjian (D),
	wxf.wang@hisilicon.com >> Xuefeng Wang


On 2020/4/10 1:54, Joel Fernandes wrote:
> On Wed, Apr 1, 2020 at 7:27 AM Cheng Jian <cj.chengjian@huawei.com> wrote:
>> when SCHED_CORE enabled, sched_cpu_starting() uses thread_sibling as
>> SMT_MASK to initialize rq->core, but only after store_cpu_topology(),
>> the thread_sibling is ready for use.
>>
>>          notify_cpu_starting()
>>              -> sched_cpu_starting()     # use thread_sibling
>>
>>          store_cpu_topology(cpu)
>>              -> update_siblings_masks    # set thread_sibling
>>
>> Fix this by doing notify_cpu_starting later, just like x86 do.
>>
>> Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
> Just a high-level question, why does core-scheduling matter on ARM64?
> Is it for HPC workloads?
>
> Thanks,
>
>   - Joel

Hi, Joel

I am analyzing the mainline scheduling patches and find this problem.


ARM has some platforms that support SMT, and provides some emulate

can be used.



Thanks.

--Cheng Jian



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 03/13] sched: Core-wide rq->lock
  2020-03-04 16:59 ` [RFC PATCH 03/13] sched: Core-wide rq->lock vpillai
  2020-04-01 11:42   ` [PATCH] sched/arm64: store cpu topology before notify_cpu_starting Cheng Jian
@ 2020-04-14 11:36   ` Peter Zijlstra
  2020-04-14 21:35     ` Vineeth Remanan Pillai
  2020-04-14 14:32   ` Peter Zijlstra
  2 siblings, 1 reply; 110+ messages in thread
From: Peter Zijlstra @ 2020-04-14 11:36 UTC (permalink / raw)
  To: vpillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, mingo, tglx,
	pjt, torvalds, linux-kernel, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel

On Wed, Mar 04, 2020 at 04:59:53PM +0000, vpillai wrote:
> @@ -6400,8 +6464,15 @@ int sched_cpu_activate(unsigned int cpu)
>  	/*
>  	 * When going up, increment the number of cores with SMT present.
>  	 */
> -	if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
> +	if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
>  		static_branch_inc_cpuslocked(&sched_smt_present);
> +#ifdef CONFIG_SCHED_CORE
> +		if (static_branch_unlikely(&__sched_core_enabled)) {
> +			rq->core_enabled = true;
> +		}
> +#endif
> +	}
> +
>  #endif
>  	set_cpu_active(cpu, true);
>  
> @@ -6447,8 +6518,16 @@ int sched_cpu_deactivate(unsigned int cpu)
>  	/*
>  	 * When going down, decrement the number of cores with SMT present.
>  	 */
> -	if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
> +	if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
> +#ifdef CONFIG_SCHED_CORE
> +		struct rq *rq = cpu_rq(cpu);
> +		if (static_branch_unlikely(&__sched_core_enabled)) {
> +			rq->core_enabled = false;
> +		}
> +#endif
>  		static_branch_dec_cpuslocked(&sched_smt_present);
> +
> +	}
>  #endif
>  
>  	if (!sched_smp_initialized)

Aside from the fact that it's probably much saner to write this as:

	rq->core_enabled = static_key_enabled(&__sched_core_enabled);

I'm fairly sure I didn't write this part. And while I do somewhat see
the point of disabling core scheduling for a core that has only a single
thread on, I wonder why we care.

The thing is, this directly leads to the utter horror-show that is patch
6.

It should be perfectly possible to core schedule a core with only a
single thread on. It might be a tad silly to do, but it beats the heck
out of the trainwreck created here.

So how did this happen?

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 07/13] sched: Add core wide task selection and scheduling.
  2020-03-04 16:59 ` [RFC PATCH 07/13] sched: Add core wide task selection and scheduling vpillai
@ 2020-04-14 13:35   ` Peter Zijlstra
  2020-04-16 23:32     ` Tim Chen
  2020-04-16  3:39   ` Chen Yu
  2020-05-21 23:14   ` Joel Fernandes
  2 siblings, 1 reply; 110+ messages in thread
From: Peter Zijlstra @ 2020-04-14 13:35 UTC (permalink / raw)
  To: vpillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, mingo, tglx,
	pjt, torvalds, linux-kernel, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel,
	Aaron Lu

On Wed, Mar 04, 2020 at 04:59:57PM +0000, vpillai wrote:
> +static struct task_struct *
> +pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> +{
> +	struct task_struct *next, *max = NULL;
> +	const struct sched_class *class;
> +	const struct cpumask *smt_mask;
> +	int i, j, cpu;
> +	bool need_sync = false;

AFAICT that assignment is superfluous. Also, you violated the inverse
x-mas tree.

> +
> +	cpu = cpu_of(rq);
> +	if (cpu_is_offline(cpu))
> +		return idle_sched_class.pick_next_task(rq);

Are we actually hitting this one?

> +	if (!sched_core_enabled(rq))
> +		return __pick_next_task(rq, prev, rf);
> +
> +	/*
> +	 * If there were no {en,de}queues since we picked (IOW, the task
> +	 * pointers are all still valid), and we haven't scheduled the last
> +	 * pick yet, do so now.
> +	 */
> +	if (rq->core->core_pick_seq == rq->core->core_task_seq &&
> +	    rq->core->core_pick_seq != rq->core_sched_seq) {
> +		WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
> +
> +		next = rq->core_pick;
> +		if (next != prev) {
> +			put_prev_task(rq, prev);
> +			set_next_task(rq, next);
> +		}
> +		return next;
> +	}
> +
> +	prev->sched_class->put_prev_task(rq, prev);
> +	if (!rq->nr_running)
> +		newidle_balance(rq, rf);

This is wrong per commit:

  6e2df0581f56 ("sched: Fix pick_next_task() vs 'change' pattern race")

> +	smt_mask = cpu_smt_mask(cpu);
> +
> +	/*
> +	 * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
> +	 *
> +	 * @task_seq guards the task state ({en,de}queues)
> +	 * @pick_seq is the @task_seq we did a selection on
> +	 * @sched_seq is the @pick_seq we scheduled
> +	 *
> +	 * However, preemptions can cause multiple picks on the same task set.
> +	 * 'Fix' this by also increasing @task_seq for every pick.
> +	 */
> +	rq->core->core_task_seq++;
> +	need_sync = !!rq->core->core_cookie;
> +
> +	/* reset state */
> +	rq->core->core_cookie = 0UL;
> +	for_each_cpu(i, smt_mask) {
> +		struct rq *rq_i = cpu_rq(i);
> +
> +		rq_i->core_pick = NULL;
> +
> +		if (rq_i->core_forceidle) {
> +			need_sync = true;
> +			rq_i->core_forceidle = false;
> +		}
> +
> +		if (i != cpu)
> +			update_rq_clock(rq_i);
> +	}
> +
> +	/*
> +	 * Try and select tasks for each sibling in decending sched_class
> +	 * order.
> +	 */
> +	for_each_class(class) {
> +again:
> +		for_each_cpu_wrap(i, smt_mask, cpu) {
> +			struct rq *rq_i = cpu_rq(i);
> +			struct task_struct *p;
> +
> +			if (cpu_is_offline(i)) {
> +				rq_i->core_pick = rq_i->idle;
> +				continue;
> +			}

Why are we polluting the 'fast' path with offline crud? Why isn't this
the natural result of running pick_task() on an empty runqueue?

> +
> +			if (rq_i->core_pick)
> +				continue;
> +
> +			/*
> +			 * If this sibling doesn't yet have a suitable task to
> +			 * run; ask for the most elegible task, given the
> +			 * highest priority task already selected for this
> +			 * core.
> +			 */
> +			p = pick_task(rq_i, class, max);
> +			if (!p) {
> +				/*
> +				 * If there weren't no cookies; we don't need
> +				 * to bother with the other siblings.
> +				 */
> +				if (i == cpu && !need_sync)
> +					goto next_class;
> +
> +				continue;
> +			}
> +
> +			/*
> +			 * Optimize the 'normal' case where there aren't any
> +			 * cookies and we don't need to sync up.
> +			 */
> +			if (i == cpu && !need_sync && !p->core_cookie) {
> +				next = p;
> +				goto done;
> +			}
> +
> +			rq_i->core_pick = p;
> +
> +			/*
> +			 * If this new candidate is of higher priority than the
> +			 * previous; and they're incompatible; we need to wipe
> +			 * the slate and start over. pick_task makes sure that
> +			 * p's priority is more than max if it doesn't match
> +			 * max's cookie.
> +			 *
> +			 * NOTE: this is a linear max-filter and is thus bounded
> +			 * in execution time.
> +			 */
> +			if (!max || !cookie_match(max, p)) {
> +				struct task_struct *old_max = max;
> +
> +				rq->core->core_cookie = p->core_cookie;
> +				max = p;
> +
> +				if (old_max) {
> +					for_each_cpu(j, smt_mask) {
> +						if (j == i)
> +							continue;
> +
> +						cpu_rq(j)->core_pick = NULL;
> +					}
> +					goto again;
> +				} else {
> +					/*
> +					 * Once we select a task for a cpu, we
> +					 * should not be doing an unconstrained
> +					 * pick because it might starve a task
> +					 * on a forced idle cpu.
> +					 */
> +					need_sync = true;
> +				}
> +
> +			}
> +		}
> +next_class:;
> +	}
> +
> +	rq->core->core_pick_seq = rq->core->core_task_seq;
> +	next = rq->core_pick;
> +	rq->core_sched_seq = rq->core->core_pick_seq;
> +
> +	/*
> +	 * Reschedule siblings
> +	 *
> +	 * NOTE: L1TF -- at this point we're no longer running the old task and
> +	 * sending an IPI (below) ensures the sibling will no longer be running
> +	 * their task. This ensures there is no inter-sibling overlap between
> +	 * non-matching user state.
> +	 */
> +	for_each_cpu(i, smt_mask) {
> +		struct rq *rq_i = cpu_rq(i);
> +
> +		if (cpu_is_offline(i))
> +			continue;

Another one; please explain how an offline cpu can be part of the
smt_mask. Last time I checked it got cleared in stop-machine.

> +
> +		WARN_ON_ONCE(!rq_i->core_pick);
> +
> +		if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> +			rq_i->core_forceidle = true;
> +
> +		if (i == cpu)
> +			continue;
> +
> +		if (rq_i->curr != rq_i->core_pick)
> +			resched_curr(rq_i);
> +
> +		/* Did we break L1TF mitigation requirements? */
> +		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));

That comment is misleading...

> +	}
> +
> +done:
> +	set_next_task(rq, next);
> +	return next;
> +}

----8<----

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a9eeef896c78..8432de767730 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4080,6 +4080,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  		update_min_vruntime(cfs_rq);
>  }
>  
> +static inline bool
> +__entity_slice_used(struct sched_entity *se)
> +{
> +	return (se->sum_exec_runtime - se->prev_sum_exec_runtime) >
> +		sched_slice(cfs_rq_of(se), se);
> +}
> +
>  /*
>   * Preempt the current task with a newly woken task if needed:
>   */
> @@ -10285,6 +10292,34 @@ static void core_sched_deactivate_fair(struct rq *rq)
>  #endif
>  #endif /* CONFIG_SMP */
>  
> +#ifdef CONFIG_SCHED_CORE
> +/*
> + * If runqueue has only one task which used up its slice and
> + * if the sibling is forced idle, then trigger schedule
> + * to give forced idle task a chance.
> + */
> +static void resched_forceidle_sibling(struct rq *rq, struct sched_entity *se)
> +{
> +	int cpu = cpu_of(rq), sibling_cpu;
> +	if (rq->cfs.nr_running > 1 || !__entity_slice_used(se))
> +		return;
> +
> +	for_each_cpu(sibling_cpu, cpu_smt_mask(cpu)) {
> +		struct rq *sibling_rq;
> +		if (sibling_cpu == cpu)
> +			continue;
> +		if (cpu_is_offline(sibling_cpu))
> +			continue;
> +
> +		sibling_rq = cpu_rq(sibling_cpu);
> +		if (sibling_rq->core_forceidle) {
> +			resched_curr(sibling_rq);
> +		}
> +	}
> +}
> +#endif
> +
> +
>  /*
>   * scheduler tick hitting a task of our scheduling class.
>   *
> @@ -10308,6 +10343,11 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>  
>  	update_misfit_status(curr, rq);
>  	update_overutilized_status(task_rq(curr));
> +
> +#ifdef CONFIG_SCHED_CORE
> +	if (sched_core_enabled(rq))
> +		resched_forceidle_sibling(rq, &curr->se);
> +#endif
>  }
>  
>  /*

This ^ seems like it should be in it's own patch.

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 03d502357599..a829e26fa43a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1003,11 +1003,16 @@ struct rq {
>  #ifdef CONFIG_SCHED_CORE
>  	/* per rq */
>  	struct rq		*core;
> +	struct task_struct	*core_pick;
>  	unsigned int		core_enabled;
> +	unsigned int		core_sched_seq;
>  	struct rb_root		core_tree;
> +	bool			core_forceidle;

Someone forgot that _Bool shouldn't be part of composite types?

>  	/* shared state */
>  	unsigned int		core_task_seq;
> +	unsigned int		core_pick_seq;
> +	unsigned long		core_cookie;
>  #endif
>  };

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 09/13] sched/fair: core wide vruntime comparison
  2020-03-04 16:59 ` [RFC PATCH 09/13] sched/fair: core wide vruntime comparison vpillai
@ 2020-04-14 13:56   ` Peter Zijlstra
  2020-04-15  3:34     ` Aaron Lu
  0 siblings, 1 reply; 110+ messages in thread
From: Peter Zijlstra @ 2020-04-14 13:56 UTC (permalink / raw)
  To: vpillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, mingo, tglx,
	pjt, torvalds, Aaron Lu, linux-kernel, fweisbec, keescook,
	kerrnel, Phil Auld, Aaron Lu, Aubrey Li, aubrey.li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	Joel Fernandes, joel, Aaron Lu

On Wed, Mar 04, 2020 at 04:59:59PM +0000, vpillai wrote:
> From: Aaron Lu <aaron.lu@linux.alibaba.com>
> 
> This patch provides a vruntime based way to compare two cfs task's
> priority, be it on the same cpu or different threads of the same core.
> 
> When the two tasks are on the same CPU, we just need to find a common
> cfs_rq both sched_entities are on and then do the comparison.
> 
> When the two tasks are on differen threads of the same core, the root
> level sched_entities to which the two tasks belong will be used to do
> the comparison.
> 
> An ugly illustration for the cross CPU case:
> 
>    cpu0         cpu1
>  /   |  \     /   |  \
> se1 se2 se3  se4 se5 se6
>     /  \            /   \
>   se21 se22       se61  se62
> 
> Assume CPU0 and CPU1 are smt siblings and task A's se is se21 while
> task B's se is se61. To compare priority of task A and B, we compare
> priority of se2 and se6. Whose vruntime is smaller, who wins.
> 
> To make this work, the root level se should have a common cfs_rq min
> vuntime, which I call it the core cfs_rq min vruntime.
> 
> When we adjust the min_vruntime of rq->core, we need to propgate
> that down the tree so as to not cause starvation of existing tasks
> based on previous vruntime.

You forgot the time complexity analysis.


> +static void coresched_adjust_vruntime(struct cfs_rq *cfs_rq, u64 delta)
> +{
> +	struct sched_entity *se, *next;
> +
> +	if (!cfs_rq)
> +		return;
> +
> +	cfs_rq->min_vruntime -= delta;
> +	rbtree_postorder_for_each_entry_safe(se, next,
> +			&cfs_rq->tasks_timeline.rb_root, run_node) {

Which per this ^

> +		if (se->vruntime > delta)
> +			se->vruntime -= delta;
> +		if (se->my_q)
> +			coresched_adjust_vruntime(se->my_q, delta);
> +	}
> +}

> @@ -511,6 +607,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
>  
>  	/* ensure we never gain time by being placed backwards. */
>  	cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime);
> +	update_core_cfs_rq_min_vruntime(cfs_rq);
>  #ifndef CONFIG_64BIT
>  	smp_wmb();
>  	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;

as called from here, is exceedingly important.

Worse, I don't think our post-order iteration is even O(n).


All of this is exceedingly yuck.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 00/13] Core scheduling v5
  2020-03-04 16:59 [RFC PATCH 00/13] Core scheduling v5 vpillai
                   ` (13 preceding siblings ...)
  2020-03-04 17:36 ` [RFC PATCH 00/13] Core scheduling v5 Tim Chen
@ 2020-04-14 14:21 ` Peter Zijlstra
  2020-04-15 16:32   ` Joel Fernandes
  2020-05-09 14:35   ` Dario Faggioli
       [not found] ` <38805656-2e2f-222a-c083-692f4b113313@linux.intel.com>
                   ` (5 subsequent siblings)
  20 siblings, 2 replies; 110+ messages in thread
From: Peter Zijlstra @ 2020-04-14 14:21 UTC (permalink / raw)
  To: vpillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, mingo, tglx,
	pjt, torvalds, linux-kernel, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel

On Wed, Mar 04, 2020 at 04:59:50PM +0000, vpillai wrote:
> TODO
> ----
> - Work on merging patches that are ready to be merged
> - Decide on the API for exposing the feature to userland
> - Experiment with adding synchronization points in VMEXIT to mitigate
>   the VM-to-host-kernel leaking

VMEXIT is too late, you need to hook irq_enter(), which is what makes
the whole thing so horrible.

> - Investigate the source of the overhead even when no tasks are tagged:
>   https://lkml.org/lkml/2019/10/29/242

 - explain why we're all still doing this ....

Seriously, what actual problems does it solve? The patch-set still isn't
L1TF complete and afaict it does exactly nothing for MDS.

Like I've written many times now, back when the world was simpler and
all we had to worry about was L1TF, core-scheduling made some sense, but
how does it make sense today?

It's cute that this series sucks less than it did before, but what are
we trading that performance for?

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 03/13] sched: Core-wide rq->lock
  2020-03-04 16:59 ` [RFC PATCH 03/13] sched: Core-wide rq->lock vpillai
  2020-04-01 11:42   ` [PATCH] sched/arm64: store cpu topology before notify_cpu_starting Cheng Jian
  2020-04-14 11:36   ` [RFC PATCH 03/13] sched: Core-wide rq->lock Peter Zijlstra
@ 2020-04-14 14:32   ` Peter Zijlstra
  2 siblings, 0 replies; 110+ messages in thread
From: Peter Zijlstra @ 2020-04-14 14:32 UTC (permalink / raw)
  To: vpillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, mingo, tglx,
	pjt, torvalds, linux-kernel, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel

On Wed, Mar 04, 2020 at 04:59:53PM +0000, vpillai wrote:
> +DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
> +
> +/*
> + * The static-key + stop-machine variable are needed such that:
> + *
> + *	spin_lock(rq_lockp(rq));
> + *	...
> + *	spin_unlock(rq_lockp(rq));
> + *
> + * ends up locking and unlocking the _same_ lock, and all CPUs
> + * always agree on what rq has what lock.
> + *
> + * XXX entirely possible to selectively enable cores, don't bother for now.
> + */
> +static int __sched_core_stopper(void *data)
> +{
> +	bool enabled = !!(unsigned long)data;
> +	int cpu;
> +
> +	for_each_online_cpu(cpu)
> +		cpu_rq(cpu)->core_enabled = enabled;
> +
> +	return 0;
> +}
> +
> +static DEFINE_MUTEX(sched_core_mutex);
> +static int sched_core_count;
> +
> +static void __sched_core_enable(void)
> +{
> +	// XXX verify there are no cookie tasks (yet)
> +
> +	static_branch_enable(&__sched_core_enabled);
> +	stop_machine(__sched_core_stopper, (void *)true, NULL);
> +}
> +
> +static void __sched_core_disable(void)
> +{
> +	// XXX verify there are no cookie tasks (left)
> +
> +	stop_machine(__sched_core_stopper, (void *)false, NULL);
> +	static_branch_disable(&__sched_core_enabled);
> +}

> +static inline raw_spinlock_t *rq_lockp(struct rq *rq)
> +{
> +	if (sched_core_enabled(rq))
> +		return &rq->core->__lock;
> +
> +	return &rq->__lock;
> +}

While reading all this again, I realized it's not too hard to get rid of
stop-machine here.

void __raw_rq_lock(struct rq *rq)
{
	raw_spinlock_t *lock;

	for (;;) {
		lock = rq_lockp(rq);

		raw_spin_lock(lock);
		if (lock == rq_lock(rq))
			return;
		raw_spin_unlock(lock);
	}
}

void __sched_core_enable(int core, bool enable)
{
	const cpumask *smt_mask;
	int cpu, i;

	smt_mask = cpu_smt_mask(core);

	for_each_cpu(cpu, smt_mask)
		raw_spin_lock_nested(&cpu_rq(cpu)->__lock, i++);

	for_each_cpu(cpu, smt_mask)
		cpu_rq(cpu)->core_enabled = enable;

	for_each_cpu(cpu, smt_mask)
		raw_spin_unlock(&cpu_rq(cpu)->__lock);
}



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 03/13] sched: Core-wide rq->lock
  2020-04-14 11:36   ` [RFC PATCH 03/13] sched: Core-wide rq->lock Peter Zijlstra
@ 2020-04-14 21:35     ` Vineeth Remanan Pillai
  2020-04-15 10:55       ` Peter Zijlstra
  0 siblings, 1 reply; 110+ messages in thread
From: Vineeth Remanan Pillai @ 2020-04-14 21:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Aaron Lu, Aubrey Li, Li, Aubrey,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	Joel Fernandes, Joel Fernandes

> Aside from the fact that it's probably much saner to write this as:
>
>         rq->core_enabled = static_key_enabled(&__sched_core_enabled);
>
> I'm fairly sure I didn't write this part. And while I do somewhat see
> the point of disabling core scheduling for a core that has only a single
> thread on, I wonder why we care.
>
I think this change was to fix some crashes which happened due to
uninitialized rq->core if a sibling was offline during boot and is
onlined after coresched was enabled.

https://lwn.net/ml/linux-kernel/20190424111913.1386-1-vpillai@digitalocean.com/

I tried to fix it by initializing coresched members during a cpu online
and tearing it down on a cpu offline. This was back in v3 and do not
remember the exact details. I shall revisit this and see if there is a
better way to fix the race condition above.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 09/13] sched/fair: core wide vruntime comparison
  2020-04-14 13:56   ` Peter Zijlstra
@ 2020-04-15  3:34     ` Aaron Lu
  2020-04-15  4:07       ` Aaron Lu
  0 siblings, 1 reply; 110+ messages in thread
From: Aaron Lu @ 2020-04-15  3:34 UTC (permalink / raw)
  To: Peter Zijlstra, Vineeth Remanan Pillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, mingo, tglx,
	pjt, torvalds, Aaron Lu, linux-kernel, fweisbec, keescook,
	kerrnel, Phil Auld, Aubrey Li, aubrey.li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel

On Tue, Apr 14, 2020 at 03:56:24PM +0200, Peter Zijlstra wrote:
> On Wed, Mar 04, 2020 at 04:59:59PM +0000, vpillai wrote:
> > From: Aaron Lu <aaron.lu@linux.alibaba.com>
> > 
> > This patch provides a vruntime based way to compare two cfs task's
> > priority, be it on the same cpu or different threads of the same core.
> > 
> > When the two tasks are on the same CPU, we just need to find a common
> > cfs_rq both sched_entities are on and then do the comparison.
> > 
> > When the two tasks are on differen threads of the same core, the root
> > level sched_entities to which the two tasks belong will be used to do
> > the comparison.
> > 
> > An ugly illustration for the cross CPU case:
> > 
> >    cpu0         cpu1
> >  /   |  \     /   |  \
> > se1 se2 se3  se4 se5 se6
> >     /  \            /   \
> >   se21 se22       se61  se62
> > 
> > Assume CPU0 and CPU1 are smt siblings and task A's se is se21 while
> > task B's se is se61. To compare priority of task A and B, we compare
> > priority of se2 and se6. Whose vruntime is smaller, who wins.
> > 
> > To make this work, the root level se should have a common cfs_rq min
> > vuntime, which I call it the core cfs_rq min vruntime.
> > 
> > When we adjust the min_vruntime of rq->core, we need to propgate
> > that down the tree so as to not cause starvation of existing tasks
> > based on previous vruntime.
> 
> You forgot the time complexity analysis.

This is a mistake and the adjust should be needed only once when core
scheduling is initially enabled. It is an initialization thing and there
is no reason to do it in every invocation of coresched_adjust_vruntime().

Vineeth,

I think we have talked about this before and you agreed that it is
needed only once:
https://lore.kernel.org/lkml/20191012035503.GA113034@aaronlu/
https://lore.kernel.org/lkml/CANaguZBevMsQ_Zy1ozKn2Z5Uj6WBviC6UU+zpTQOVdDDLK6r2w@mail.gmail.com/

I'll see how to fix it, but feel free to beat me to it.
 
> > +static void coresched_adjust_vruntime(struct cfs_rq *cfs_rq, u64 delta)
> > +{
> > +	struct sched_entity *se, *next;
> > +
> > +	if (!cfs_rq)
> > +		return;
> > +
> > +	cfs_rq->min_vruntime -= delta;
> > +	rbtree_postorder_for_each_entry_safe(se, next,
> > +			&cfs_rq->tasks_timeline.rb_root, run_node) {
> 
> Which per this ^
> 
> > +		if (se->vruntime > delta)
> > +			se->vruntime -= delta;
> > +		if (se->my_q)
> > +			coresched_adjust_vruntime(se->my_q, delta);
> > +	}
> > +}
> 
> > @@ -511,6 +607,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
> >  
> >  	/* ensure we never gain time by being placed backwards. */
> >  	cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime);
> > +	update_core_cfs_rq_min_vruntime(cfs_rq);
> >  #ifndef CONFIG_64BIT
> >  	smp_wmb();
> >  	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
> 
> as called from here, is exceedingly important.
> 
> Worse, I don't think our post-order iteration is even O(n).
> 
> 
> All of this is exceedingly yuck.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 09/13] sched/fair: core wide vruntime comparison
  2020-04-15  3:34     ` Aaron Lu
@ 2020-04-15  4:07       ` Aaron Lu
  2020-04-15 21:24         ` Vineeth Remanan Pillai
  0 siblings, 1 reply; 110+ messages in thread
From: Aaron Lu @ 2020-04-15  4:07 UTC (permalink / raw)
  To: Peter Zijlstra, Vineeth Remanan Pillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, mingo, tglx,
	pjt, torvalds, Aaron Lu, linux-kernel, fweisbec, keescook,
	kerrnel, Phil Auld, Aubrey Li, aubrey.li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel

On Wed, Apr 15, 2020 at 11:34:08AM +0800, Aaron Lu wrote:
> On Tue, Apr 14, 2020 at 03:56:24PM +0200, Peter Zijlstra wrote:
> > On Wed, Mar 04, 2020 at 04:59:59PM +0000, vpillai wrote:
> > > From: Aaron Lu <aaron.lu@linux.alibaba.com>
> > > 
> > > This patch provides a vruntime based way to compare two cfs task's
> > > priority, be it on the same cpu or different threads of the same core.
> > > 
> > > When the two tasks are on the same CPU, we just need to find a common
> > > cfs_rq both sched_entities are on and then do the comparison.
> > > 
> > > When the two tasks are on differen threads of the same core, the root
> > > level sched_entities to which the two tasks belong will be used to do
> > > the comparison.
> > > 
> > > An ugly illustration for the cross CPU case:
> > > 
> > >    cpu0         cpu1
> > >  /   |  \     /   |  \
> > > se1 se2 se3  se4 se5 se6
> > >     /  \            /   \
> > >   se21 se22       se61  se62
> > > 
> > > Assume CPU0 and CPU1 are smt siblings and task A's se is se21 while
> > > task B's se is se61. To compare priority of task A and B, we compare
> > > priority of se2 and se6. Whose vruntime is smaller, who wins.
> > > 
> > > To make this work, the root level se should have a common cfs_rq min
> > > vuntime, which I call it the core cfs_rq min vruntime.
> > > 
> > > When we adjust the min_vruntime of rq->core, we need to propgate
> > > that down the tree so as to not cause starvation of existing tasks
> > > based on previous vruntime.
> > 
> > You forgot the time complexity analysis.
> 
> This is a mistake and the adjust should be needed only once when core
> scheduling is initially enabled. It is an initialization thing and there
> is no reason to do it in every invocation of coresched_adjust_vruntime().

Correction...
I meant there is no need to call coresched_adjust_vruntime() in every
invocation of update_core_cfs_rq_min_vruntime().

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 03/13] sched: Core-wide rq->lock
  2020-04-14 21:35     ` Vineeth Remanan Pillai
@ 2020-04-15 10:55       ` Peter Zijlstra
  0 siblings, 0 replies; 110+ messages in thread
From: Peter Zijlstra @ 2020-04-15 10:55 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Aaron Lu, Aubrey Li, Li, Aubrey,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	Joel Fernandes, Joel Fernandes

On Tue, Apr 14, 2020 at 05:35:07PM -0400, Vineeth Remanan Pillai wrote:
> > Aside from the fact that it's probably much saner to write this as:
> >
> >         rq->core_enabled = static_key_enabled(&__sched_core_enabled);
> >
> > I'm fairly sure I didn't write this part. And while I do somewhat see
> > the point of disabling core scheduling for a core that has only a single
> > thread on, I wonder why we care.
> >
> I think this change was to fix some crashes which happened due to
> uninitialized rq->core if a sibling was offline during boot and is
> onlined after coresched was enabled.
> 
> https://lwn.net/ml/linux-kernel/20190424111913.1386-1-vpillai@digitalocean.com/
> 
> I tried to fix it by initializing coresched members during a cpu online
> and tearing it down on a cpu offline. This was back in v3 and do not
> remember the exact details. I shall revisit this and see if there is a
> better way to fix the race condition above.

Argh, that problem again. So AFAIK booting with maxcpus= is broken in a
whole number of 'interesting' ways. I'm not sure what to do about that,
perhaps we should add a config around that option and make it depend on
CONFIG_BROKEN.

That said; I'm thinking it shouldn't be too hard to fix up the core
state before we add the CPU to the masks, but it will be arch specific.
See speculative_store_bypass_ht_init() for inspiration, but you'll need
to be even earlier, before set_cpu_sibling_map() in smp_callin() on x86
(no clue about other archs).

Even without maxcpus= this can happen when you do physical hotplug and
add a part (or replace one where the new part has more cores than the
old).

The moment core-scheduling is enabled and you're adding unknown
topology, we need to set up state before we publish the mask,... or I
suppose endlessly do: 'smt_mask & active_mask' all over the place :/ In
which case you can indeed do it purely in sched/core.

Hurmph...

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 00/13] Core scheduling v5
  2020-04-14 14:21 ` Peter Zijlstra
@ 2020-04-15 16:32   ` Joel Fernandes
  2020-04-17 11:12     ` Peter Zijlstra
  2020-05-09 14:35   ` Dario Faggioli
  1 sibling, 1 reply; 110+ messages in thread
From: Joel Fernandes @ 2020-04-15 16:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: vpillai, Nishanth Aravamudan, Julien Desfossez, Tim Chen, mingo,
	tglx, pjt, torvalds, linux-kernel, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini

On Tue, Apr 14, 2020 at 04:21:52PM +0200, Peter Zijlstra wrote:
> On Wed, Mar 04, 2020 at 04:59:50PM +0000, vpillai wrote:
> > TODO
> > ----
> > - Work on merging patches that are ready to be merged
> > - Decide on the API for exposing the feature to userland
> > - Experiment with adding synchronization points in VMEXIT to mitigate
> >   the VM-to-host-kernel leaking
> 
> VMEXIT is too late, you need to hook irq_enter(), which is what makes
> the whole thing so horrible.

We came up with a patch to do this as well. Currently testing it more and it
looks clean, will share it soon.

> > - Investigate the source of the overhead even when no tasks are tagged:
> >   https://lkml.org/lkml/2019/10/29/242
> 
>  - explain why we're all still doing this ....
> 
> Seriously, what actual problems does it solve? The patch-set still isn't
> L1TF complete and afaict it does exactly nothing for MDS.

The L1TF incompleteness is because of cross-HT attack from Guest vCPU
attacker to an interrupt/softirq executing on the other sibling correct? The
IRQ enter pausing the other sibling should fix that (which we will share in
a future series revision after adequate testing).

> Like I've written many times now, back when the world was simpler and
> all we had to worry about was L1TF, core-scheduling made some sense, but
> how does it make sense today?

For ChromeOS we're planning to tag each and every task seperately except for
trusted processes, so we are isolating untrusted tasks even from each other.

Sorry if this sounds like pushing my usecase, but we do get parallelism
advantage for the trusted tasks while still solving all security issues (for
ChromeOS). I agree that cross-HT user <-> kernel MDS is still an issue if
untrusted (tagged) tasks execute together on same core, but we are not
planning to do that on our setup at least.

> It's cute that this series sucks less than it did before, but what are
> we trading that performance for?

AIUI, the performance improves vs noht in the recent series. I am told that
is the case in recent postings of the series.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 09/13] sched/fair: core wide vruntime comparison
  2020-04-15  4:07       ` Aaron Lu
@ 2020-04-15 21:24         ` Vineeth Remanan Pillai
  2020-04-17  9:40           ` Aaron Lu
  0 siblings, 1 reply; 110+ messages in thread
From: Vineeth Remanan Pillai @ 2020-04-15 21:24 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Peter Zijlstra, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Aaron Lu, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aubrey Li, Li, Aubrey, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, Joel Fernandes

> > > You forgot the time complexity analysis.
> >
> > This is a mistake and the adjust should be needed only once when core
> > scheduling is initially enabled. It is an initialization thing and there
> > is no reason to do it in every invocation of coresched_adjust_vruntime().
>
> Correction...
> I meant there is no need to call coresched_adjust_vruntime() in every
> invocation of update_core_cfs_rq_min_vruntime().

Due to the checks in place, update_core_cfs_rq_min_vruntime should
not be calling coresched_adjust_vruntime more than once between a
coresched enable/disable. Once the min_vruntime is adjusted, we depend
only on rq->core and the other sibling's min_vruntime will not grow
until coresched disable.

I did some micro benchmark tests today to verify this and observed
that coresched_adjust_vruntime called at most once between a coresched
enable/disable.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 07/13] sched: Add core wide task selection and scheduling.
  2020-03-04 16:59 ` [RFC PATCH 07/13] sched: Add core wide task selection and scheduling vpillai
  2020-04-14 13:35   ` Peter Zijlstra
@ 2020-04-16  3:39   ` Chen Yu
  2020-04-16 19:59     ` Vineeth Remanan Pillai
  2020-04-17 11:18     ` Peter Zijlstra
  2020-05-21 23:14   ` Joel Fernandes
  2 siblings, 2 replies; 110+ messages in thread
From: Chen Yu @ 2020-04-16  3:39 UTC (permalink / raw)
  To: vpillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds, linux-kernel, fweisbec, keescook,
	kerrnel, Phil Auld, Aaron Lu, Aubrey Li, aubrey.li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	Joel Fernandes, joel, Aaron Lu, Long Cui

On Wed, Mar 04, 2020 at 04:59:57PM +0000, vpillai wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Instead of only selecting a local task, select a task for all SMT
> siblings for every reschedule on the core (irrespective which logical
> CPU does the reschedule).
> 
> There could be races in core scheduler where a CPU is trying to pick
> a task for its sibling in core scheduler, when that CPU has just been
> offlined.  We should not schedule any tasks on the CPU in this case.
> Return an idle task in pick_next_task for this situation.
> 
> NOTE: there is still potential for siblings rivalry.
> NOTE: this is far too complicated; but thus far I've failed to
>       simplify it further.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
> Signed-off-by: Aaron Lu <aaron.lu@linux.alibaba.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
[cut]

Hi Vineeth,
An NULL pointer exception was found when testing V5 on top of stable
v5.6.2. And we tried the patch Peter suggested, the NULL pointer
was not found so far. We don't know if this change would help mitigate
the symptom, but it should do no harm to test with this fix applied.

Thanks,
Chenyu

From 6828eaf4611eeb3e1bad3b9a0d4ec53c6fa01fe3 Mon Sep 17 00:00:00 2001
From: Chen Yu <yu.c.chen@intel.com>
Date: Thu, 16 Apr 2020 10:51:07 +0800
Subject: [PATCH] sched: Fix pick_next_task() race condition in core scheduling

As Perter mentioned that Commit 6e2df0581f56 ("sched: Fix pick_next_task()
vs 'change' pattern race") has fixed a race condition due to rq->lock
improperly released after put_prev_task(), backport this fix to core
scheduling's pick_next_task() as well.

Without this fix, Aubrey, Long and I found an NULL exception point
triggered within one hour when running RDT MBA(Intel Resource Directory
Technolodge Memory Bandwidth Allocation) benchmarks on a 36 Core(72 HTs)
platform, which tries to dereference a NULL sched_entity:

[ 3618.429053] BUG: kernel NULL pointer dereference, address: 0000000000000160
[ 3618.429039] RIP: 0010:pick_task_fair+0x2e/0xa0
[ 3618.429042] RSP: 0018:ffffc90000317da8 EFLAGS: 00010046
[ 3618.429044] RAX: 0000000000000000 RBX: ffff88afdf4ad100 RCX: 0000000000000001
[ 3618.429045] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88afdf4ad100
[ 3618.429045] RBP: ffffc90000317dc0 R08: 0000000000000048 R09: 0100000000100000
[ 3618.429046] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[ 3618.429047] R13: 000000000002d080 R14: ffff88afdf4ad080 R15: 0000000000000014
[ 3618.429048]  ? pick_task_fair+0x48/0xa0
[ 3618.429048]  pick_next_task+0x34c/0x7e0
[ 3618.429049]  ? tick_program_event+0x44/0x70
[ 3618.429049]  __schedule+0xee/0x5d0
[ 3618.429050]  schedule_idle+0x2c/0x40
[ 3618.429051]  do_idle+0x175/0x280
[ 3618.429051]  cpu_startup_entry+0x1d/0x30
[ 3618.429052]  start_secondary+0x169/0x1c0
[ 3618.429052]  secondary_startup_64+0xa4/0xb0

While with this patch applied, no NULL pointer exception was found within
14 hours for now. Although there's no direct evidence this fix would solve
the issue, it does fix a potential race condition.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/core.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 02495d44870f..ef101a3ef583 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4477,9 +4477,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		return next;
 	}
 
-	prev->sched_class->put_prev_task(rq, prev);
-	if (!rq->nr_running)
-		newidle_balance(rq, rf);
+
+#ifdef CONFIG_SMP
+	for_class_range(class, prev->sched_class, &idle_sched_class) {
+		if (class->balance(rq, prev, rf))
+			break;
+	}
+#endif
+	put_prev_task(rq, prev);
 
 	smt_mask = cpu_smt_mask(cpu);
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 07/13] sched: Add core wide task selection and scheduling.
  2020-04-16  3:39   ` Chen Yu
@ 2020-04-16 19:59     ` Vineeth Remanan Pillai
  2020-04-17 11:18     ` Peter Zijlstra
  1 sibling, 0 replies; 110+ messages in thread
From: Vineeth Remanan Pillai @ 2020-04-16 19:59 UTC (permalink / raw)
  To: Chen Yu
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Aaron Lu, Aubrey Li, Li, Aubrey,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	Joel Fernandes, Joel Fernandes, Aaron Lu, Long Cui

> Hi Vineeth,
> An NULL pointer exception was found when testing V5 on top of stable
> v5.6.2. And we tried the patch Peter suggested, the NULL pointer
> was not found so far. We don't know if this change would help mitigate
> the symptom, but it should do no harm to test with this fix applied.
>

Thanks for the patch Chenyu!

I was also looking into this as part of addressing Peter's comments. I
shall include this patch in v6 along with addressing all the issues
that Peter pointed out in this thread.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 07/13] sched: Add core wide task selection and scheduling.
  2020-04-14 13:35   ` Peter Zijlstra
@ 2020-04-16 23:32     ` Tim Chen
  2020-04-17 10:57       ` Peter Zijlstra
  0 siblings, 1 reply; 110+ messages in thread
From: Tim Chen @ 2020-04-16 23:32 UTC (permalink / raw)
  To: Peter Zijlstra, vpillai
  Cc: Nishanth Aravamudan, Julien Desfossez, mingo, tglx, pjt,
	torvalds, linux-kernel, fweisbec, keescook, kerrnel, Phil Auld,
	Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel, Aaron Lu



On 4/14/20 6:35 AM, Peter Zijlstra wrote:
> On Wed, Mar 04, 2020 at 04:59:57PM +0000, vpillai wrote:
>> +static struct task_struct *
>> +pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>> +{
>> +	struct task_struct *next, *max = NULL;
>> +	const struct sched_class *class;
>> +	const struct cpumask *smt_mask;
>> +	int i, j, cpu;
>> +	bool need_sync = false;
> 
> AFAICT that assignment is superfluous. Also, you violated the inverse
> x-mas tree.
> 
>> +
>> +	cpu = cpu_of(rq);
>> +	if (cpu_is_offline(cpu))
>> +		return idle_sched_class.pick_next_task(rq);
> 
> Are we actually hitting this one?
> 

I did hit this race when I was testing taking cpu offline and online,
which prompted the check of cpu being offline.

Tim


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 09/13] sched/fair: core wide vruntime comparison
  2020-04-15 21:24         ` Vineeth Remanan Pillai
@ 2020-04-17  9:40           ` Aaron Lu
  2020-04-20  8:07             ` [PATCH updated] sched/fair: core wide cfs task priority comparison Aaron Lu
  0 siblings, 1 reply; 110+ messages in thread
From: Aaron Lu @ 2020-04-17  9:40 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Peter Zijlstra, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Aaron Lu, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aubrey Li, Li, Aubrey, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, Joel Fernandes

On Wed, Apr 15, 2020 at 05:24:30PM -0400, Vineeth Remanan Pillai wrote:
> > > > You forgot the time complexity analysis.
> > >
> > > This is a mistake and the adjust should be needed only once when core
> > > scheduling is initially enabled. It is an initialization thing and there
> > > is no reason to do it in every invocation of coresched_adjust_vruntime().
> >
> > Correction...
> > I meant there is no need to call coresched_adjust_vruntime() in every
> > invocation of update_core_cfs_rq_min_vruntime().
> 
> Due to the checks in place, update_core_cfs_rq_min_vruntime should
> not be calling coresched_adjust_vruntime more than once between a
> coresched enable/disable. Once the min_vruntime is adjusted, we depend
> only on rq->core and the other sibling's min_vruntime will not grow
> until coresched disable.

OK, but I prefer to make it clear that this is an initialization only
stuff. Below is what I cooked, I also enhanced the changelog while at
it.

From e80121e61953da717da074ea2a097194f6d29ef4 Mon Sep 17 00:00:00 2001
From: Aaron Lu <ziqian.lzq@antfin.com>
Date: Thu, 25 Jul 2019 22:32:48 +0800
Subject: [PATCH] sched/fair: core vruntime comparison

This patch provides a vruntime based way to compare two cfs task's
priority, be it on the same cpu or different threads of the same core.

When the two tasks are on the same CPU, we just need to find a common
cfs_rq both sched_entities are on and then do the comparison.

When the two tasks are on differen threads of the same core, each thread
will choose the next task to run the usual way and then the root level
sched entities which the two tasks belong to will be used to decide
which task runs next core wide.

An illustration for the cross CPU case:

   cpu0         cpu1
 /   |  \     /   |  \
se1 se2 se3  se4 se5 se6
    /  \            /   \
  se21 se22       se61  se62
  (A)                    /
                       se621
                        (B)

Assume CPU0 and CPU1 are smt siblings and cpu0 has decided task A to
run next and cpu1 has decided task B to run next. To compare priority
of task A and B, we compare priority of se2 and se6. Whose vruntime is
smaller, who wins.

To make this work, the root level sched entities' vruntime of the two
threads must be directly comparable. So a new core wide cfs_rq
min_vruntime is introduced to serve the purpose of normalizing these
root level sched entities' vruntime.

All sub cfs_rqs and sched entities are not interesting in cross cpu
priority comparison as they will only participate in the usual cpu local
schedule decision so no need to normalize their vruntimes.

Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com>
---
 kernel/sched/core.c  |  24 ++++------
 kernel/sched/fair.c  | 101 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |   3 ++
 3 files changed, 111 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5f322922f5ae..d6c8c76cb07a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -119,19 +119,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-		u64 vruntime = b->se.vruntime;
-
-		/*
-		 * Normalize the vruntime if tasks are in different cpus.
-		 */
-		if (task_cpu(a) != task_cpu(b)) {
-			vruntime -= task_cfs_rq(b)->min_vruntime;
-			vruntime += task_cfs_rq(a)->min_vruntime;
-		}
-
-		return !((s64)(a->se.vruntime - vruntime) <= 0);
-	}
+	if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
+		return cfs_prio_less(a, b);
 
 	return false;
 }
@@ -291,8 +280,13 @@ static int __sched_core_stopper(void *data)
 	}
 
 	for_each_online_cpu(cpu) {
-		if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2))
-			cpu_rq(cpu)->core_enabled = enabled;
+		if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) {
+			struct rq *rq = cpu_rq(cpu);
+
+			rq->core_enabled = enabled;
+			if (rq->core == rq)
+				sched_core_adjust_se_vruntime(cpu);
+		}
 	}
 
 	return 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d99ea6ee7af2..7eecf590d6c0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -449,9 +449,103 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+static inline struct cfs_rq *root_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return &rq_of(cfs_rq)->cfs;
+}
+
+static inline bool is_root_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq == root_cfs_rq(cfs_rq);
+}
+
+static inline struct cfs_rq *core_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return &rq_of(cfs_rq)->core->cfs;
+}
+
 static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
 {
-	return cfs_rq->min_vruntime;
+	if (!sched_core_enabled(rq_of(cfs_rq)) || !is_root_cfs_rq(cfs_rq))
+		return cfs_rq->min_vruntime;
+
+	return core_cfs_rq(cfs_rq)->min_vruntime;
+}
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+	struct sched_entity *sea = &a->se;
+	struct sched_entity *seb = &b->se;
+	bool samecpu = task_cpu(a) == task_cpu(b);
+	s64 delta;
+
+	if (samecpu) {
+		/* vruntime is per cfs_rq */
+		while (!is_same_group(sea, seb)) {
+			int sea_depth = sea->depth;
+			int seb_depth = seb->depth;
+
+			if (sea_depth >= seb_depth)
+				sea = parent_entity(sea);
+			if (sea_depth <= seb_depth)
+				seb = parent_entity(seb);
+		}
+
+		delta = (s64)(sea->vruntime - seb->vruntime);
+		goto out;
+	}
+
+	/* crosscpu: compare root level se's vruntime to decide priority */
+	while (sea->parent)
+		sea = sea->parent;
+	while (seb->parent)
+		seb = seb->parent;
+	delta = (s64)(sea->vruntime - seb->vruntime);
+
+out:
+	return delta > 0;
+}
+
+/*
+ * This is called in stop machine context so no need to take the rq lock.
+ *
+ * Core scheduling is going to be enabled and the root level sched entities
+ * of both siblings will use cfs_rq->min_vruntime as the common cfs_rq
+ * min_vruntime, so it's necessary to normalize vruntime of existing root
+ * level sched entities in sibling_cfs_rq.
+ *
+ * Update of sibling_cfs_rq's min_vruntime isn't necessary as we will be
+ * only using cfs_rq->min_vruntime during the entire run of core scheduling.
+ */
+void sched_core_adjust_se_vruntime(int cpu)
+{
+	int i;
+
+	for_each_cpu(i, cpu_smt_mask(cpu)) {
+		struct cfs_rq *cfs_rq, *sibling_cfs_rq;
+		struct sched_entity *se, *next;
+		s64 delta;
+
+		if (i == cpu)
+			continue;
+
+		sibling_cfs_rq = &cpu_rq(i)->cfs;
+		if (!sibling_cfs_rq->nr_running)
+			continue;
+
+		cfs_rq = &cpu_rq(cpu)->cfs;
+		delta = cfs_rq->min_vruntime - sibling_cfs_rq->min_vruntime;
+		/*
+		 * XXX Malicious user can create a ton of runnable tasks in root
+		 * sibling_cfs_rq and cause the below vruntime normalization
+		 * potentially taking a long time.
+		 */
+		rbtree_postorder_for_each_entry_safe(se, next,
+				&sibling_cfs_rq->tasks_timeline.rb_root,
+				run_node) {
+			se->vruntime += delta;
+		}
+	}
 }
 
 static __always_inline
@@ -509,8 +603,11 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
 			vruntime = min_vruntime(vruntime, se->vruntime);
 	}
 
+	if (sched_core_enabled(rq_of(cfs_rq)) && is_root_cfs_rq(cfs_rq))
+		cfs_rq = core_cfs_rq(cfs_rq);
+
 	/* ensure we never gain time by being placed backwards. */
-	cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime);
+	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
 #ifndef CONFIG_64BIT
 	smp_wmb();
 	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 50a5675e941a..24bae760f764 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2594,3 +2594,6 @@ static inline void membarrier_switch_mm(struct rq *rq,
 {
 }
 #endif
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
+void sched_core_adjust_se_vruntime(int cpu);
-- 
2.19.1.3.ge56e4f7


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 07/13] sched: Add core wide task selection and scheduling.
  2020-04-16 23:32     ` Tim Chen
@ 2020-04-17 10:57       ` Peter Zijlstra
  0 siblings, 0 replies; 110+ messages in thread
From: Peter Zijlstra @ 2020-04-17 10:57 UTC (permalink / raw)
  To: Tim Chen
  Cc: vpillai, Nishanth Aravamudan, Julien Desfossez, mingo, tglx, pjt,
	torvalds, linux-kernel, fweisbec, keescook, kerrnel, Phil Auld,
	Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel, Aaron Lu

On Thu, Apr 16, 2020 at 04:32:28PM -0700, Tim Chen wrote:
> 
> 
> On 4/14/20 6:35 AM, Peter Zijlstra wrote:
> > On Wed, Mar 04, 2020 at 04:59:57PM +0000, vpillai wrote:
> >> +static struct task_struct *
> >> +pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >> +{
> >> +	struct task_struct *next, *max = NULL;
> >> +	const struct sched_class *class;
> >> +	const struct cpumask *smt_mask;
> >> +	int i, j, cpu;
> >> +	bool need_sync = false;
> > 
> > AFAICT that assignment is superfluous. Also, you violated the inverse
> > x-mas tree.
> > 
> >> +
> >> +	cpu = cpu_of(rq);
> >> +	if (cpu_is_offline(cpu))
> >> +		return idle_sched_class.pick_next_task(rq);
> > 
> > Are we actually hitting this one?
> > 
> 
> I did hit this race when I was testing taking cpu offline and online,
> which prompted the check of cpu being offline.

This is the schedule from the stop task to the idle task I presume,
there should really not be any other. And at that point the rq had
better be empty, so why didn't the normal task selection work?

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 00/13] Core scheduling v5
  2020-04-15 16:32   ` Joel Fernandes
@ 2020-04-17 11:12     ` Peter Zijlstra
  2020-04-17 12:35       ` Alexander Graf
  2020-04-18  2:25       ` Joel Fernandes
  0 siblings, 2 replies; 110+ messages in thread
From: Peter Zijlstra @ 2020-04-17 11:12 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: vpillai, Nishanth Aravamudan, Julien Desfossez, Tim Chen, mingo,
	tglx, pjt, torvalds, linux-kernel, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini

On Wed, Apr 15, 2020 at 12:32:20PM -0400, Joel Fernandes wrote:
> On Tue, Apr 14, 2020 at 04:21:52PM +0200, Peter Zijlstra wrote:
> > On Wed, Mar 04, 2020 at 04:59:50PM +0000, vpillai wrote:
> > > TODO
> > > ----
> > > - Work on merging patches that are ready to be merged
> > > - Decide on the API for exposing the feature to userland
> > > - Experiment with adding synchronization points in VMEXIT to mitigate
> > >   the VM-to-host-kernel leaking
> > 
> > VMEXIT is too late, you need to hook irq_enter(), which is what makes
> > the whole thing so horrible.
> 
> We came up with a patch to do this as well. Currently testing it more and it
> looks clean, will share it soon.

Thomas said we actually first do VMEXIT, and then enable interrupts. So
the VMEXIT thing should actually work, and that is indeed much saner
than sticking it in irq_enter().

It does however put yet more nails in the out-of-tree hypervisors.

> > > - Investigate the source of the overhead even when no tasks are tagged:
> > >   https://lkml.org/lkml/2019/10/29/242
> > 
> >  - explain why we're all still doing this ....
> > 
> > Seriously, what actual problems does it solve? The patch-set still isn't
> > L1TF complete and afaict it does exactly nothing for MDS.
> 
> The L1TF incompleteness is because of cross-HT attack from Guest vCPU
> attacker to an interrupt/softirq executing on the other sibling correct? The
> IRQ enter pausing the other sibling should fix that (which we will share in
> a future series revision after adequate testing).

Correct, the vCPU still running can glean host (kernel) state from the
sibling handling the interrupt in the host kernel.

> > Like I've written many times now, back when the world was simpler and
> > all we had to worry about was L1TF, core-scheduling made some sense, but
> > how does it make sense today?
> 
> For ChromeOS we're planning to tag each and every task seperately except for
> trusted processes, so we are isolating untrusted tasks even from each other.
> 
> Sorry if this sounds like pushing my usecase, but we do get parallelism
> advantage for the trusted tasks while still solving all security issues (for
> ChromeOS). I agree that cross-HT user <-> kernel MDS is still an issue if
> untrusted (tagged) tasks execute together on same core, but we are not
> planning to do that on our setup at least.

That doesn't completely solve things I think. Even if you run all
untrusted tasks as core exclusive, you still have a problem of them vs
interrupts on the other sibling.

You need to somehow arrange all interrupts to the core happen on the
same sibling that runs your untrusted task, such that the VERW on
return-to-userspace works as intended.

I suppose you can try and play funny games with interrupt routing tied
to the force-idle state, but I'm dreading what that'll look like. Or
were you going to handle this from your irq_enter() thing too?


Can someone go write up a document that very clearly states all the
problems and clearly explains how to use this feature?



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 07/13] sched: Add core wide task selection and scheduling.
  2020-04-16  3:39   ` Chen Yu
  2020-04-16 19:59     ` Vineeth Remanan Pillai
@ 2020-04-17 11:18     ` Peter Zijlstra
  2020-04-19 15:31       ` Chen Yu
  1 sibling, 1 reply; 110+ messages in thread
From: Peter Zijlstra @ 2020-04-17 11:18 UTC (permalink / raw)
  To: Chen Yu
  Cc: vpillai, Nishanth Aravamudan, Julien Desfossez, Tim Chen, mingo,
	tglx, pjt, torvalds, linux-kernel, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel,
	Aaron Lu, Long Cui

On Thu, Apr 16, 2020 at 11:39:05AM +0800, Chen Yu wrote:

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 02495d44870f..ef101a3ef583 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4477,9 +4477,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  		return next;
>  	}
>  
> -	prev->sched_class->put_prev_task(rq, prev);
> -	if (!rq->nr_running)
> -		newidle_balance(rq, rf);
> +
> +#ifdef CONFIG_SMP
> +	for_class_range(class, prev->sched_class, &idle_sched_class) {
> +		if (class->balance(rq, prev, rf))
> +			break;
> +	}
> +#endif
> +	put_prev_task(rq, prev);
>  
>  	smt_mask = cpu_smt_mask(cpu);

Instead of duplicating that, how about you put the existing copy in a
function to share? finish_prev_task() perhaps?

Also, can you please make newidle_balance() static again; I forgot doing
that in 6e2df0581f56, which would've made you notice this sooner I
suppose.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 00/13] Core scheduling v5
  2020-04-17 11:12     ` Peter Zijlstra
@ 2020-04-17 12:35       ` Alexander Graf
  2020-04-17 13:08         ` Peter Zijlstra
  2020-04-18  2:25       ` Joel Fernandes
  1 sibling, 1 reply; 110+ messages in thread
From: Alexander Graf @ 2020-04-17 12:35 UTC (permalink / raw)
  To: Peter Zijlstra, Joel Fernandes
  Cc: vpillai, Nishanth Aravamudan, Julien Desfossez, Tim Chen, mingo,
	tglx, pjt, torvalds, linux-kernel, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini

On 17.04.20 13:12, Peter Zijlstra wrote:
> On Wed, Apr 15, 2020 at 12:32:20PM -0400, Joel Fernandes wrote:
>> On Tue, Apr 14, 2020 at 04:21:52PM +0200, Peter Zijlstra wrote:
>>> On Wed, Mar 04, 2020 at 04:59:50PM +0000, vpillai wrote:
>>>> TODO
>>>> ----
>>>> - Work on merging patches that are ready to be merged
>>>> - Decide on the API for exposing the feature to userland
>>>> - Experiment with adding synchronization points in VMEXIT to mitigate
>>>>    the VM-to-host-kernel leaking
>>>
>>> VMEXIT is too late, you need to hook irq_enter(), which is what makes
>>> the whole thing so horrible.
>>
>> We came up with a patch to do this as well. Currently testing it more and it
>> looks clean, will share it soon.
> 
> Thomas said we actually first do VMEXIT, and then enable interrupts. So
> the VMEXIT thing should actually work, and that is indeed much saner
> than sticking it in irq_enter().

If we first kick out the sibling HT for every #VMEXIT, performance will 
be abysmal, no?

I know of a few options to make this work without the big hammer:


   1) Leave interrupts disabled on "fast-path" exits. This can become 
very hard to grasp very quickly.

   2) Patch the IRQ handlers (or build something more generic that 
installs a trampoline on all IRQ handler installations)

   3) Ignore IRQ data exposure (what could possibly go wrong, it's not 
like your IRQ handler reads secret data from the network, right)

   4) Create a "safe" page table which runs with HT enabled. Any access 
outside of the "safe" zone disables the sibling and switches to the 
"full" kernel page table. This should prevent any secret data to be 
fetched into caches/core buffers.

   5) Create a KVM specific "safe zone": Keep improving the ASI patches 
and make only the ASI environment safe for HT, everything else not.

Has there been any progress on 4? It sounded like the most generic 
option ...

> 
> It does however put yet more nails in the out-of-tree hypervisors.
> 
>>>> - Investigate the source of the overhead even when no tasks are tagged:
>>>>    https://lkml.org/lkml/2019/10/29/242
>>>
>>>   - explain why we're all still doing this ....
>>>
>>> Seriously, what actual problems does it solve? The patch-set still isn't
>>> L1TF complete and afaict it does exactly nothing for MDS.
>>
>> The L1TF incompleteness is because of cross-HT attack from Guest vCPU
>> attacker to an interrupt/softirq executing on the other sibling correct? The
>> IRQ enter pausing the other sibling should fix that (which we will share in
>> a future series revision after adequate testing).
> 
> Correct, the vCPU still running can glean host (kernel) state from the
> sibling handling the interrupt in the host kernel.
> 
>>> Like I've written many times now, back when the world was simpler and
>>> all we had to worry about was L1TF, core-scheduling made some sense, but
>>> how does it make sense today?
>>
>> For ChromeOS we're planning to tag each and every task seperately except for
>> trusted processes, so we are isolating untrusted tasks even from each other.
>>
>> Sorry if this sounds like pushing my usecase, but we do get parallelism
>> advantage for the trusted tasks while still solving all security issues (for
>> ChromeOS). I agree that cross-HT user <-> kernel MDS is still an issue if
>> untrusted (tagged) tasks execute together on same core, but we are not
>> planning to do that on our setup at least.
> 
> That doesn't completely solve things I think. Even if you run all
> untrusted tasks as core exclusive, you still have a problem of them vs
> interrupts on the other sibling.
 >
> You need to somehow arrange all interrupts to the core happen on the
> same sibling that runs your untrusted task, such that the VERW on
> return-to-userspace works as intended.
> 
> I suppose you can try and play funny games with interrupt routing tied
> to the force-idle state, but I'm dreading what that'll look like. Or
> were you going to handle this from your irq_enter() thing too?

I'm not sure I follow. We have thread local interrupts (timers, IPIs) 
and device interrupts (network, block, etc).

Thread local ones shouldn't transfer too much knowledge, so I'd be 
inclined to say we can just ignore that attack vector.

Device interrupts we can easily route to HT0. If we now make "core 
exclusive" a synonym for "always run on HT0", we can guarantee that they 
always land on the same CPU, no?

Then you don't need to hook into any idle state tracking, because you 
always know which CPU the "safe" one to both schedule tasks and route 
interrupts to is.


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 00/13] Core scheduling v5
  2020-04-17 12:35       ` Alexander Graf
@ 2020-04-17 13:08         ` Peter Zijlstra
  0 siblings, 0 replies; 110+ messages in thread
From: Peter Zijlstra @ 2020-04-17 13:08 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Joel Fernandes, vpillai, Nishanth Aravamudan, Julien Desfossez,
	Tim Chen, mingo, tglx, pjt, torvalds, linux-kernel, fweisbec,
	keescook, kerrnel, Phil Auld, Aaron Lu, Aubrey Li, aubrey.li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Fri, Apr 17, 2020 at 02:35:38PM +0200, Alexander Graf wrote:
> On 17.04.20 13:12, Peter Zijlstra wrote:

> If we first kick out the sibling HT for every #VMEXIT, performance will be
> abysmal, no?

I've been given to understand that people serious about virt try really
hard to avoid VMEXIT.


> > That doesn't completely solve things I think. Even if you run all
> > untrusted tasks as core exclusive, you still have a problem of them vs
> > interrupts on the other sibling.
> >
> > You need to somehow arrange all interrupts to the core happen on the
> > same sibling that runs your untrusted task, such that the VERW on
> > return-to-userspace works as intended.
> > 
> > I suppose you can try and play funny games with interrupt routing tied
> > to the force-idle state, but I'm dreading what that'll look like. Or
> > were you going to handle this from your irq_enter() thing too?
> 
> I'm not sure I follow. We have thread local interrupts (timers, IPIs) and
> device interrupts (network, block, etc).
> 
> Thread local ones shouldn't transfer too much knowledge, so I'd be inclined
> to say we can just ignore that attack vector.
> 
> Device interrupts we can easily route to HT0. If we now make "core
> exclusive" a synonym for "always run on HT0", we can guarantee that they
> always land on the same CPU, no?
> 
> Then you don't need to hook into any idle state tracking, because you always
> know which CPU the "safe" one to both schedule tasks and route interrupts to
> is.

That would come apart most mighty when someone does an explicit
sched_setaffinity() for !HT0.

While that might work for some relatively contained systems like
chromeos, it will not work in general I think.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 00/13] Core scheduling v5
  2020-04-17 11:12     ` Peter Zijlstra
  2020-04-17 12:35       ` Alexander Graf
@ 2020-04-18  2:25       ` Joel Fernandes
  1 sibling, 0 replies; 110+ messages in thread
From: Joel Fernandes @ 2020-04-18  2:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: vpillai, Nishanth Aravamudan, Julien Desfossez, Tim Chen, mingo,
	tglx, pjt, torvalds, linux-kernel, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini

Hi Peter,

On Fri, Apr 17, 2020 at 01:12:55PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 15, 2020 at 12:32:20PM -0400, Joel Fernandes wrote:
> > On Tue, Apr 14, 2020 at 04:21:52PM +0200, Peter Zijlstra wrote:
> > > On Wed, Mar 04, 2020 at 04:59:50PM +0000, vpillai wrote:
> > > > TODO
> > > > ----
> > > > - Work on merging patches that are ready to be merged
> > > > - Decide on the API for exposing the feature to userland
> > > > - Experiment with adding synchronization points in VMEXIT to mitigate
> > > >   the VM-to-host-kernel leaking
> > > 
> > > VMEXIT is too late, you need to hook irq_enter(), which is what makes
> > > the whole thing so horrible.
> > 
> > We came up with a patch to do this as well. Currently testing it more and it
> > looks clean, will share it soon.
> 
> Thomas said we actually first do VMEXIT, and then enable interrupts. So
> the VMEXIT thing should actually work, and that is indeed much saner
> than sticking it in irq_enter().
> 
> It does however put yet more nails in the out-of-tree hypervisors.

Just to clarify what we're talking about here. The condition we are trying to
protect against:

1. VM is malicious.
2. Sibling of VM is entering an interrupt on the host.
3. When we enter the interrupt, we send an IPI to force the VM into waiting.
4. The VM on the sibling enters VMEXIT.

In step 4, we have to synchronize. Is this the scenario we are discussing?

> > > > - Investigate the source of the overhead even when no tasks are tagged:
> > > >   https://lkml.org/lkml/2019/10/29/242
> > > 
> > >  - explain why we're all still doing this ....
> > > 
> > > Seriously, what actual problems does it solve? The patch-set still isn't
> > > L1TF complete and afaict it does exactly nothing for MDS.
> > 
> > The L1TF incompleteness is because of cross-HT attack from Guest vCPU
> > attacker to an interrupt/softirq executing on the other sibling correct? The
> > IRQ enter pausing the other sibling should fix that (which we will share in
> > a future series revision after adequate testing).
> 
> Correct, the vCPU still running can glean host (kernel) state from the
> sibling handling the interrupt in the host kernel.

Right. This is what we're handling.

> > > Like I've written many times now, back when the world was simpler and
> > > all we had to worry about was L1TF, core-scheduling made some sense, but
> > > how does it make sense today?
> > 
> > For ChromeOS we're planning to tag each and every task seperately except for
> > trusted processes, so we are isolating untrusted tasks even from each other.
> > 
> > Sorry if this sounds like pushing my usecase, but we do get parallelism
> > advantage for the trusted tasks while still solving all security issues (for
> > ChromeOS). I agree that cross-HT user <-> kernel MDS is still an issue if
> > untrusted (tagged) tasks execute together on same core, but we are not
> > planning to do that on our setup at least.
> 
> That doesn't completely solve things I think. Even if you run all
> untrusted tasks as core exclusive, you still have a problem of them vs
> interrupts on the other sibling.
> 
> You need to somehow arrange all interrupts to the core happen on the
> same sibling that runs your untrusted task, such that the VERW on
> return-to-userspace works as intended.
> 
> I suppose you can try and play funny games with interrupt routing tied
> to the force-idle state, but I'm dreading what that'll look like. Or
> were you going to handle this from your irq_enter() thing too?

Yes, even when host interrupt is on one sibling and the untrusted host
process is on the other sibling, we would be handling it the same way we
handle it for host interrupts vs untrusted guests. Perhaps we could optimize
pausing of the guest. But Vineeth tested and found that the same code that
pauses hosts also works for guests.

> Can someone go write up a document that very clearly states all the
> problems and clearly explains how to use this feature?

A document would make sense on how to use the feature. Maybe we can add it as
a documentation patch to the series?

Basically, from my notes the following are the problems:

Core-scheduling will help with cross-HT MDS and L1TF attacks. The following
are the scenarios (borrowed from an email from Thomas -- thanks!):

        HT1 (attack)            HT2 (victim)

 A      idle -> user space      user space -> idle

 B      idle -> user space      guest -> idle

 C      idle -> guest           user space -> idle

 D      idle -> guest           guest -> idle
 
All of them suffer from MDS. #C and #D suffer from L1TF.

All of these scenarios result in the victim getting idled to prevent any
leakage. However, this does not address the case where victim is an
interrupt handler or softirq. So we need to either route interrupts to run on
the attacker CPU, or have the irq_enter() send IPIs to pause the sibling
which is what we prototyped. Another approach to solve this is to force
interrupts into threaded mode as Thomas suggested.

The problematic usecase then left (if we ignore IRQ troubles), is MDS issues
between user <-> kernel simultaneous execution on both siblings. This does
not become an issue on ChromeOS where everything untrusted has its own tag.
For Vineeth's VM workloads also this isn't a problem from my discussions with
him, as he mentioned hyperthreads are both running guests and 2 vCPU threads
that belong to different VMs will not execute on the same core (though I'm
not sure whether hypercalls from a vCPU when the sibling is running another
vCPU of the same VM, is a concern here).

Any other case that needs to be considered?

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 07/13] sched: Add core wide task selection and scheduling.
  2020-04-17 11:18     ` Peter Zijlstra
@ 2020-04-19 15:31       ` Chen Yu
  0 siblings, 0 replies; 110+ messages in thread
From: Chen Yu @ 2020-04-19 15:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: vpillai, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux Kernel Mailing List, fweisbec, Kees Cook, Greg Kerr,
	Phil Auld, Aaron Lu, Aubrey Li, Aubrey Li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel,
	Aaron Lu, Long Cui

On Fri, Apr 17, 2020 at 7:18 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Apr 16, 2020 at 11:39:05AM +0800, Chen Yu wrote:
>
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 02495d44870f..ef101a3ef583 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4477,9 +4477,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >               return next;
> >       }
> >
> > -     prev->sched_class->put_prev_task(rq, prev);
> > -     if (!rq->nr_running)
> > -             newidle_balance(rq, rf);
> > +
> > +#ifdef CONFIG_SMP
> > +     for_class_range(class, prev->sched_class, &idle_sched_class) {
> > +             if (class->balance(rq, prev, rf))
> > +                     break;
> > +     }
> > +#endif
> > +     put_prev_task(rq, prev);
> >
> >       smt_mask = cpu_smt_mask(cpu);
>
> Instead of duplicating that, how about you put the existing copy in a
> function to share? finish_prev_task() perhaps?
>
> Also, can you please make newidle_balance() static again; I forgot doing
> that in 6e2df0581f56, which would've made you notice this sooner I
> suppose.
Okay, I'll do that,

Thanks,
Chenyu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH updated] sched/fair: core wide cfs task priority comparison
  2020-04-17  9:40           ` Aaron Lu
@ 2020-04-20  8:07             ` Aaron Lu
  2020-04-20 22:26               ` Vineeth Remanan Pillai
  0 siblings, 1 reply; 110+ messages in thread
From: Aaron Lu @ 2020-04-20  8:07 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Peter Zijlstra, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Aaron Lu, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aubrey Li, Li, Aubrey, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, Joel Fernandes

On Fri, Apr 17, 2020 at 05:40:45PM +0800, Aaron Lu wrote:
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -291,8 +280,13 @@ static int __sched_core_stopper(void *data)
>  	}
>  
>  	for_each_online_cpu(cpu) {
> -		if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2))
> -			cpu_rq(cpu)->core_enabled = enabled;
> +		if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) {
> +			struct rq *rq = cpu_rq(cpu);
> +
> +			rq->core_enabled = enabled;
> +			if (rq->core == rq)
> +				sched_core_adjust_se_vruntime(cpu);

The adjust is only needed when core scheduling is enabled while I
mistakenly called it on both enable and disable. And I come to think
normalize is a better name than adjust.

> +		}
>  	}
>  
>  	return 0;

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d99ea6ee7af2..7eecf590d6c0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> +void sched_core_adjust_se_vruntime(int cpu)
> +{
> +	int i;
> +
> +	for_each_cpu(i, cpu_smt_mask(cpu)) {
> +		struct cfs_rq *cfs_rq, *sibling_cfs_rq;
> +		struct sched_entity *se, *next;
> +		s64 delta;
> +
> +		if (i == cpu)
> +			continue;
> +
> +		sibling_cfs_rq = &cpu_rq(i)->cfs;
> +		if (!sibling_cfs_rq->nr_running)
> +			continue;
> +
> +		cfs_rq = &cpu_rq(cpu)->cfs;
> +		delta = cfs_rq->min_vruntime - sibling_cfs_rq->min_vruntime;
> +		/*
> +		 * XXX Malicious user can create a ton of runnable tasks in root
> +		 * sibling_cfs_rq and cause the below vruntime normalization
> +		 * potentially taking a long time.
> +		 */

Testing on a qemu/kvm VM shows that normalizing 32268 sched entities
takes about 6ms time so I think the risk is low, therefore, I'm going to
remove the XXX comment.

(I disabled CONFIG_SCHED_AUTOGROUP and started 32268 cpuhog tasks on one
cpu using taskset, adding trace_printk() before and after the below loop
gives me:
migration/0-11    [000] d..1   674.546882: sched_core_normalize_se_vruntime: cpu5: normalize nr_running=32268
migration/0-11    [000] d..1   674.552364: sched_core_normalize_se_vruntime: cpu5: normalize done
)

> +		rbtree_postorder_for_each_entry_safe(se, next,
> +				&sibling_cfs_rq->tasks_timeline.rb_root,
> +				run_node) {
> +			se->vruntime += delta;
> +		}
> +	}
>  }
>  
>  static __always_inline

I also think the patch is not to make every sched entity's vruntime core
wide but to make it possible to do core wide priority comparison for cfs
tasks so I changed the subject. Here is the updated patch:

From d045030074247faf3b515fab21ac06236ce4bd74 Mon Sep 17 00:00:00 2001
From: Aaron Lu <ziqian.lzq@antfin.com>
Date: Mon, 20 Apr 2020 10:27:17 +0800
Subject: [PATCH] sched/fair: core wide cfs task priority comparison

This patch provides a vruntime based way to compare two cfs task's
priority, be it on the same cpu or different threads of the same core.

When the two tasks are on the same CPU, we just need to find a common
cfs_rq both sched_entities are on and then do the comparison.

When the two tasks are on differen threads of the same core, each thread
will choose the next task to run the usual way and then the root level
sched entities which the two tasks belong to will be used to decide
which task runs next core wide.

An illustration for the cross CPU case:

   cpu0         cpu1
 /   |  \     /   |  \
se1 se2 se3  se4 se5 se6
    /  \            /   \
  se21 se22       se61  se62
  (A)                    /
                       se621
                        (B)

Assume CPU0 and CPU1 are smt siblings and cpu0 has decided task A to
run next and cpu1 has decided task B to run next. To compare priority
of task A and B, we compare priority of se2 and se6. Whose vruntime is
smaller, who wins.

To make this work, the root level sched entities' vruntime of the two
threads must be directly comparable. So one of the hyperthread's root
cfs_rq's min_vruntime is chosen as the core wide one and all root level
sched entities' vruntime is normalized against it.

All sub cfs_rqs and sched entities are not interesting in cross cpu
priority comparison as they will only participate in the usual cpu local
schedule decision so no need to normalize their vruntimes.

Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com>
---
 kernel/sched/core.c  | 24 +++++------
 kernel/sched/fair.c  | 96 +++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  3 ++
 3 files changed, 106 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5f322922f5ae..059add9a89ed 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -119,19 +119,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-		u64 vruntime = b->se.vruntime;
-
-		/*
-		 * Normalize the vruntime if tasks are in different cpus.
-		 */
-		if (task_cpu(a) != task_cpu(b)) {
-			vruntime -= task_cfs_rq(b)->min_vruntime;
-			vruntime += task_cfs_rq(a)->min_vruntime;
-		}
-
-		return !((s64)(a->se.vruntime - vruntime) <= 0);
-	}
+	if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
+		return cfs_prio_less(a, b);
 
 	return false;
 }
@@ -291,8 +280,13 @@ static int __sched_core_stopper(void *data)
 	}
 
 	for_each_online_cpu(cpu) {
-		if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2))
-			cpu_rq(cpu)->core_enabled = enabled;
+		if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) {
+			struct rq *rq = cpu_rq(cpu);
+
+			rq->core_enabled = enabled;
+			if (enabled && rq->core == rq)
+				sched_core_normalize_se_vruntime(cpu);
+		}
 	}
 
 	return 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d99ea6ee7af2..1b87d0c8b9ca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -449,9 +449,98 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+static inline struct cfs_rq *root_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return &rq_of(cfs_rq)->cfs;
+}
+
+static inline bool is_root_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq == root_cfs_rq(cfs_rq);
+}
+
+static inline struct cfs_rq *core_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return &rq_of(cfs_rq)->core->cfs;
+}
+
 static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
 {
-	return cfs_rq->min_vruntime;
+	if (!sched_core_enabled(rq_of(cfs_rq)) || !is_root_cfs_rq(cfs_rq))
+		return cfs_rq->min_vruntime;
+
+	return core_cfs_rq(cfs_rq)->min_vruntime;
+}
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+	struct sched_entity *sea = &a->se;
+	struct sched_entity *seb = &b->se;
+	bool samecpu = task_cpu(a) == task_cpu(b);
+	s64 delta;
+
+	if (samecpu) {
+		/* vruntime is per cfs_rq */
+		while (!is_same_group(sea, seb)) {
+			int sea_depth = sea->depth;
+			int seb_depth = seb->depth;
+
+			if (sea_depth >= seb_depth)
+				sea = parent_entity(sea);
+			if (sea_depth <= seb_depth)
+				seb = parent_entity(seb);
+		}
+
+		delta = (s64)(sea->vruntime - seb->vruntime);
+		goto out;
+	}
+
+	/* crosscpu: compare root level se's vruntime to decide priority */
+	while (sea->parent)
+		sea = sea->parent;
+	while (seb->parent)
+		seb = seb->parent;
+	delta = (s64)(sea->vruntime - seb->vruntime);
+
+out:
+	return delta > 0;
+}
+
+/*
+ * This is called in stop machine context so no need to take the rq lock.
+ *
+ * Core scheduling is going to be enabled and the root level sched entities
+ * of both siblings will use cfs_rq->min_vruntime as the common cfs_rq
+ * min_vruntime, so it's necessary to normalize vruntime of existing root
+ * level sched entities in sibling_cfs_rq.
+ *
+ * Update of sibling_cfs_rq's min_vruntime isn't necessary as we will be
+ * only using cfs_rq->min_vruntime during the entire run of core scheduling.
+ */
+void sched_core_normalize_se_vruntime(int cpu)
+{
+	struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
+	int i;
+
+	for_each_cpu(i, cpu_smt_mask(cpu)) {
+		struct sched_entity *se, *next;
+		struct cfs_rq *sibling_cfs_rq;
+		s64 delta;
+
+		if (i == cpu)
+			continue;
+
+		sibling_cfs_rq = &cpu_rq(i)->cfs;
+		if (!sibling_cfs_rq->nr_running)
+			continue;
+
+		delta = cfs_rq->min_vruntime - sibling_cfs_rq->min_vruntime;
+		rbtree_postorder_for_each_entry_safe(se, next,
+				&sibling_cfs_rq->tasks_timeline.rb_root,
+				run_node) {
+			se->vruntime += delta;
+		}
+	}
 }
 
 static __always_inline
@@ -509,8 +598,11 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
 			vruntime = min_vruntime(vruntime, se->vruntime);
 	}
 
+	if (sched_core_enabled(rq_of(cfs_rq)) && is_root_cfs_rq(cfs_rq))
+		cfs_rq = core_cfs_rq(cfs_rq);
+
 	/* ensure we never gain time by being placed backwards. */
-	cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime);
+	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
 #ifndef CONFIG_64BIT
 	smp_wmb();
 	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 50a5675e941a..d8f0eb7f6e42 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2594,3 +2594,6 @@ static inline void membarrier_switch_mm(struct rq *rq,
 {
 }
 #endif
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
+void sched_core_normalize_se_vruntime(int cpu);
-- 
2.19.1.3.ge56e4f7


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH updated] sched/fair: core wide cfs task priority comparison
  2020-04-20  8:07             ` [PATCH updated] sched/fair: core wide cfs task priority comparison Aaron Lu
@ 2020-04-20 22:26               ` Vineeth Remanan Pillai
  2020-04-21  2:51                 ` Aaron Lu
  0 siblings, 1 reply; 110+ messages in thread
From: Vineeth Remanan Pillai @ 2020-04-20 22:26 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Peter Zijlstra, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Aaron Lu, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aubrey Li, Li, Aubrey, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, Joel Fernandes

On Mon, Apr 20, 2020 at 4:08 AM Aaron Lu <aaron.lwe@gmail.com> wrote:
>
> On Fri, Apr 17, 2020 at 05:40:45PM +0800, Aaron Lu wrote:

> The adjust is only needed when core scheduling is enabled while I
> mistakenly called it on both enable and disable. And I come to think
> normalize is a better name than adjust.
>
I guess we would also need to update the min_vruntime of the sibling
to match the rq->core->min_vruntime on coresched disable. Otherwise
a new enqueue on root cfs of the sibling would inherit the very old
min_vruntime before coresched enable and thus would starve all the
already queued tasks until the newly enqueued se's vruntime catches up.

Other than that, I think the patch looks good. We haven't tested it
yet. Will do a round of testing and let you know soon.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH updated] sched/fair: core wide cfs task priority comparison
  2020-04-20 22:26               ` Vineeth Remanan Pillai
@ 2020-04-21  2:51                 ` Aaron Lu
  2020-04-24 14:24                   ` [PATCH updated v2] " Aaron Lu
  0 siblings, 1 reply; 110+ messages in thread
From: Aaron Lu @ 2020-04-21  2:51 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Peter Zijlstra, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Aaron Lu, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aubrey Li, Li, Aubrey, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, Joel Fernandes

On Mon, Apr 20, 2020 at 06:26:34PM -0400, Vineeth Remanan Pillai wrote:
> On Mon, Apr 20, 2020 at 4:08 AM Aaron Lu <aaron.lwe@gmail.com> wrote:
> >
> > On Fri, Apr 17, 2020 at 05:40:45PM +0800, Aaron Lu wrote:
> 
> > The adjust is only needed when core scheduling is enabled while I
> > mistakenly called it on both enable and disable. And I come to think
> > normalize is a better name than adjust.
> >
> I guess we would also need to update the min_vruntime of the sibling
> to match the rq->core->min_vruntime on coresched disable. Otherwise
> a new enqueue on root cfs of the sibling would inherit the very old
> min_vruntime before coresched enable and thus would starve all the
> already queued tasks until the newly enqueued se's vruntime catches up.

Yes this is a concern but AFAICS, there is no problem. Consider:
- when there is no queued task across the disable boundary, the stale
  min_vruntime doesn't matter as you said;
- when there are queued tasks across the disable boundary, the newly
  queued task will normalize its vruntime against the sibling_cfs_rq's
  min_vruntime, if the min_vruntime is stale and problem would occur.
  But my reading of the code made me think this min_vruntime should
  have already been updated by update_curr() in enqueue_entity() before
  being used by this newly enqueued task and update_curr() would bring
  the stale min_vruntime to the smallest vruntime of the queued ones so
  again, no problem should occur.

I have done a simple test locally before sending the patch out and didn't
find any problem but maybe I failed to hit the race window. Let me know
if I misunderstood something.
 
> Other than that, I think the patch looks good. We haven't tested it
> yet. Will do a round of testing and let you know soon.

Thanks.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH updated v2] sched/fair: core wide cfs task priority comparison
  2020-04-21  2:51                 ` Aaron Lu
@ 2020-04-24 14:24                   ` Aaron Lu
  2020-05-06 14:35                     ` Peter Zijlstra
  0 siblings, 1 reply; 110+ messages in thread
From: Aaron Lu @ 2020-04-24 14:24 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Peter Zijlstra, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Aaron Lu, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aubrey Li, Li, Aubrey, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, Joel Fernandes

On Tue, Apr 21, 2020 at 10:51:31AM +0800, Aaron Lu wrote:
> On Mon, Apr 20, 2020 at 06:26:34PM -0400, Vineeth Remanan Pillai wrote:
> > On Mon, Apr 20, 2020 at 4:08 AM Aaron Lu <aaron.lwe@gmail.com> wrote:
> > >
> > > On Fri, Apr 17, 2020 at 05:40:45PM +0800, Aaron Lu wrote:
> > 
> > > The adjust is only needed when core scheduling is enabled while I
> > > mistakenly called it on both enable and disable. And I come to think
> > > normalize is a better name than adjust.
> > >
> > I guess we would also need to update the min_vruntime of the sibling
> > to match the rq->core->min_vruntime on coresched disable. Otherwise
> > a new enqueue on root cfs of the sibling would inherit the very old
> > min_vruntime before coresched enable and thus would starve all the
> > already queued tasks until the newly enqueued se's vruntime catches up.
> 
> Yes this is a concern but AFAICS, there is no problem. Consider:
> - when there is no queued task across the disable boundary, the stale
>   min_vruntime doesn't matter as you said;
> - when there are queued tasks across the disable boundary, the newly
>   queued task will normalize its vruntime against the sibling_cfs_rq's
>   min_vruntime, if the min_vruntime is stale and problem would occur.
>   But my reading of the code made me think this min_vruntime should
>   have already been updated by update_curr() in enqueue_entity() before
>   being used by this newly enqueued task and update_curr() would bring
>   the stale min_vruntime to the smallest vruntime of the queued ones so
>   again, no problem should occur.

After discussion with Vineeth, I now tend to add the syncing of
sibling_cfs_rq min_vruntime on core disable because analysing all the
code is time consuming and though I didn't find any problems now, I
might miss something and future code change may also break the
expectations so adding it seems a safe thing to do, also, it didn't
bring any performance downgrade as it is a one time disable stuff.

Vineeth also pointed out a problem of misusing cfs_rq->min_vruntime for
!CONFIG_64BIT kernel in migrate_task_rq_fair(), this is also fixed.

(only compile tested for !CONFIG_64BIT kernel)

From cda051ed33e6b88f28b44147cc7c894994c9d991 Mon Sep 17 00:00:00 2001
From: Aaron Lu <ziqian.lzq@antfin.com>
Date: Mon, 20 Apr 2020 10:27:17 +0800
Subject: [PATCH] sched/fair: core wide cfs task priority comparison

This patch provides a vruntime based way to compare two cfs task's
priority, be it on the same cpu or different threads of the same core.

When the two tasks are on the same CPU, we just need to find a common
cfs_rq both sched_entities are on and then do the comparison.

When the two tasks are on differen threads of the same core, each thread
will choose the next task to run the usual way and then the root level
sched entities which the two tasks belong to will be used to decide
which task runs next core wide.

An illustration for the cross CPU case:

   cpu0         cpu1
 /   |  \     /   |  \
se1 se2 se3  se4 se5 se6
    /  \            /   \
  se21 se22       se61  se62
  (A)                    /
                       se621
                        (B)

Assume CPU0 and CPU1 are smt siblings and cpu0 has decided task A to
run next and cpu1 has decided task B to run next. To compare priority
of task A and B, we compare priority of se2 and se6. Whose vruntime is
smaller, who wins.

To make this work, the root level sched entities' vruntime of the two
threads must be directly comparable. So one of the hyperthread's root
cfs_rq's min_vruntime is chosen as the core wide one and all root level
sched entities' vruntime is normalized against it.

All sub cfs_rqs and sched entities are not interesting in cross cpu
priority comparison as they will only participate in the usual cpu local
schedule decision so no need to normalize their vruntimes.

Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com>
---
 kernel/sched/core.c  |  28 +++++----
 kernel/sched/fair.c  | 135 +++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h |   4 ++
 3 files changed, 148 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5f322922f5ae..d8bedddef6fb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -119,19 +119,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-		u64 vruntime = b->se.vruntime;
-
-		/*
-		 * Normalize the vruntime if tasks are in different cpus.
-		 */
-		if (task_cpu(a) != task_cpu(b)) {
-			vruntime -= task_cfs_rq(b)->min_vruntime;
-			vruntime += task_cfs_rq(a)->min_vruntime;
-		}
-
-		return !((s64)(a->se.vruntime - vruntime) <= 0);
-	}
+	if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
+		return cfs_prio_less(a, b);
 
 	return false;
 }
@@ -291,8 +280,17 @@ static int __sched_core_stopper(void *data)
 	}
 
 	for_each_online_cpu(cpu) {
-		if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2))
-			cpu_rq(cpu)->core_enabled = enabled;
+		if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) {
+			struct rq *rq = cpu_rq(cpu);
+
+			rq->core_enabled = enabled;
+			if (rq->core == rq) {
+				if (enabled)
+					sched_core_normalize_se_vruntime(cpu);
+				else
+					sched_core_sync_cfs_vruntime(cpu);
+			}
+		}
 	}
 
 	return 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d99ea6ee7af2..a5774f495d97 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -449,9 +449,133 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+static inline struct cfs_rq *root_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return &rq_of(cfs_rq)->cfs;
+}
+
+static inline bool is_root_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq == root_cfs_rq(cfs_rq);
+}
+
+static inline struct cfs_rq *core_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return &rq_of(cfs_rq)->core->cfs;
+}
+
 static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
 {
-	return cfs_rq->min_vruntime;
+	if (!sched_core_enabled(rq_of(cfs_rq)) || !is_root_cfs_rq(cfs_rq))
+		return cfs_rq->min_vruntime;
+
+	return core_cfs_rq(cfs_rq)->min_vruntime;
+}
+
+#ifndef CONFIG_64BIT
+static inline u64 cfs_rq_min_vruntime_copy(struct cfs_rq *cfs_rq)
+{
+	if (!sched_core_enabled(rq_of(cfs_rq)) || !is_root_cfs_rq(cfs_rq))
+		return cfs_rq->min_vruntime_copy;
+
+	return core_cfs_rq(cfs_rq)->min_vruntime_copy;
+}
+#endif
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+	struct sched_entity *sea = &a->se;
+	struct sched_entity *seb = &b->se;
+	bool samecpu = task_cpu(a) == task_cpu(b);
+	s64 delta;
+
+	if (samecpu) {
+		/* vruntime is per cfs_rq */
+		while (!is_same_group(sea, seb)) {
+			int sea_depth = sea->depth;
+			int seb_depth = seb->depth;
+
+			if (sea_depth >= seb_depth)
+				sea = parent_entity(sea);
+			if (sea_depth <= seb_depth)
+				seb = parent_entity(seb);
+		}
+
+		delta = (s64)(sea->vruntime - seb->vruntime);
+		goto out;
+	}
+
+	/* crosscpu: compare root level se's vruntime to decide priority */
+	while (sea->parent)
+		sea = sea->parent;
+	while (seb->parent)
+		seb = seb->parent;
+	delta = (s64)(sea->vruntime - seb->vruntime);
+
+out:
+	return delta > 0;
+}
+
+/*
+ * This is called in stop machine context so no need to take the rq lock.
+ *
+ * Core scheduling is going to be enabled and the root level sched entities
+ * of both siblings will use cfs_rq->min_vruntime as the common cfs_rq
+ * min_vruntime, so it's necessary to normalize vruntime of existing root
+ * level sched entities in sibling_cfs_rq.
+ *
+ * Update of sibling_cfs_rq's min_vruntime isn't necessary as we will be
+ * only using cfs_rq->min_vruntime during the entire run of core scheduling.
+ */
+void sched_core_normalize_se_vruntime(int cpu)
+{
+	struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
+	int i;
+
+	for_each_cpu(i, cpu_smt_mask(cpu)) {
+		struct sched_entity *se, *next;
+		struct cfs_rq *sibling_cfs_rq;
+		s64 delta;
+
+		if (i == cpu)
+			continue;
+
+		sibling_cfs_rq = &cpu_rq(i)->cfs;
+		if (!sibling_cfs_rq->nr_running)
+			continue;
+
+		delta = cfs_rq->min_vruntime - sibling_cfs_rq->min_vruntime;
+		rbtree_postorder_for_each_entry_safe(se, next,
+				&sibling_cfs_rq->tasks_timeline.rb_root,
+				run_node) {
+			se->vruntime += delta;
+		}
+	}
+}
+
+/*
+ * During the entire run of core scheduling, sibling_cfs_rq's min_vruntime
+ * is left unused and could lag far behind its still queued sched entities.
+ * Sync it to the up2date core wide one to avoid problems.
+ */
+void sched_core_sync_cfs_vruntime(int cpu)
+{
+	struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
+	int i;
+
+	for_each_cpu(i, cpu_smt_mask(cpu)) {
+		struct cfs_rq *sibling_cfs_rq;
+
+		if (i == cpu)
+			continue;
+
+		sibling_cfs_rq = &cpu_rq(i)->cfs;
+		sibling_cfs_rq->min_vruntime = cfs_rq->min_vruntime;
+#ifndef CONFIG_64BIT
+		smp_wmb();
+		sibling_cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
+#endif
+	}
 }
 
 static __always_inline
@@ -509,8 +633,11 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
 			vruntime = min_vruntime(vruntime, se->vruntime);
 	}
 
+	if (sched_core_enabled(rq_of(cfs_rq)) && is_root_cfs_rq(cfs_rq))
+		cfs_rq = core_cfs_rq(cfs_rq);
+
 	/* ensure we never gain time by being placed backwards. */
-	cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime);
+	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
 #ifndef CONFIG_64BIT
 	smp_wmb();
 	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
@@ -6396,9 +6523,9 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 		u64 min_vruntime_copy;
 
 		do {
-			min_vruntime_copy = cfs_rq->min_vruntime_copy;
+			min_vruntime_copy = cfs_rq_min_vruntime_copy(cfs_rq);
 			smp_rmb();
-			min_vruntime = cfs_rq->min_vruntime;
+			min_vruntime = cfs_rq_min_vruntime(cfs_rq);
 		} while (min_vruntime != min_vruntime_copy);
 #else
 		min_vruntime = cfs_rq_min_vruntime(cfs_rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 50a5675e941a..5517ca92b5bd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2594,3 +2594,7 @@ static inline void membarrier_switch_mm(struct rq *rq,
 {
 }
 #endif
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
+void sched_core_normalize_se_vruntime(int cpu);
+void sched_core_sync_cfs_vruntime(int cpu);
-- 
2.19.1.3.ge56e4f7


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison
  2020-04-24 14:24                   ` [PATCH updated v2] " Aaron Lu
@ 2020-05-06 14:35                     ` Peter Zijlstra
  2020-05-08  8:44                       ` Aaron Lu
  0 siblings, 1 reply; 110+ messages in thread
From: Peter Zijlstra @ 2020-05-06 14:35 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	Tim Chen, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Aaron Lu, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aubrey Li, Li, Aubrey, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, Joel Fernandes


Sorry for being verbose; I've been procrastinating replying, and in
doing so the things I wanted to say kept growing.

On Fri, Apr 24, 2020 at 10:24:43PM +0800, Aaron Lu wrote:

> To make this work, the root level sched entities' vruntime of the two
> threads must be directly comparable. So one of the hyperthread's root
> cfs_rq's min_vruntime is chosen as the core wide one and all root level
> sched entities' vruntime is normalized against it.

> +/*
> + * This is called in stop machine context so no need to take the rq lock.
> + *
> + * Core scheduling is going to be enabled and the root level sched entities
> + * of both siblings will use cfs_rq->min_vruntime as the common cfs_rq
> + * min_vruntime, so it's necessary to normalize vruntime of existing root
> + * level sched entities in sibling_cfs_rq.
> + *
> + * Update of sibling_cfs_rq's min_vruntime isn't necessary as we will be
> + * only using cfs_rq->min_vruntime during the entire run of core scheduling.
> + */
> +void sched_core_normalize_se_vruntime(int cpu)
> +{
> +	struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
> +	int i;
> +
> +	for_each_cpu(i, cpu_smt_mask(cpu)) {
> +		struct sched_entity *se, *next;
> +		struct cfs_rq *sibling_cfs_rq;
> +		s64 delta;
> +
> +		if (i == cpu)
> +			continue;
> +
> +		sibling_cfs_rq = &cpu_rq(i)->cfs;
> +		if (!sibling_cfs_rq->nr_running)
> +			continue;
> +
> +		delta = cfs_rq->min_vruntime - sibling_cfs_rq->min_vruntime;
> +		rbtree_postorder_for_each_entry_safe(se, next,
> +				&sibling_cfs_rq->tasks_timeline.rb_root,
> +				run_node) {
> +			se->vruntime += delta;
> +		}
> +	}
> +}

Aside from this being way to complicated for what it does -- you
could've saved the min_vruntime for each rq and compared them with
subtraction -- it is also terminally broken afaict.

Consider any infeasible weight scenario. Take for instance two tasks,
each bound to their respective sibling, one with weight 1 and one with
weight 2. Then the lower weight task will run ahead of the higher weight
task without bound.

This utterly destroys the concept of a shared time base.

Remember; all this is about a proportionally fair scheduling, where each
tasks receives:

             w_i
  dt_i = ---------- dt                                     (1)
	 \Sum_j w_j

which we do by tracking a virtual time, s_i:

         1
  s_i = --- d[t]_i                                         (2)
        w_i

Where d[t] is a delta of discrete time, while dt is an infinitesimal.
The immediate corrolary is that the ideal schedule S, where (2) to use
an infnitesimal delta, is:

          1
  S = ---------- dt                                        (3)
      \Sum_i w_i

From which we can define the lag, or deviation from the ideal, as:

  lag(i) = S - s_i                                         (4)

And since the one and only purpose is to approximate S, we get that:

  \Sum_i w_i lag(i) := 0                                   (5)

If this were not so, we no longer converge to S, and we can no longer
claim our scheduler has any of the properties we derive from S. This is
exactly what you did above, you broke it!


Let's continue for a while though; to see if there is anything useful to
be learned. We can combine (1)-(3) or (4)-(5) and express S in s_i:

      \Sum_i w_i s_i
  S = --------------                                       (6)
        \Sum_i w_i

Which gives us a way to compute S, given our s_i. Now, if you've read
our code, you know that we do not in fact do this, the reason for this
is two-fold. Firstly, computing S in that way requires a 64bit division
for every time we'd use it (see 12), and secondly, this only describes
the steady-state, it doesn't handle dynamics.

Anyway, in (6):  s_i -> x + (s_i - x), to get:

          \Sum_i w_i (s_i - x)
  S - x = --------------------                             (7)
               \Sum_i w_i

Which shows that S and s_i transform alike (which makes perfect sense
given that S is basically the (weighted) average of s_i).

Then:

  x -> s_min := min{s_i}                                   (8)

to obtain:

              \Sum_i w_i (s_i - s_min)
  S = s_min + ------------------------                     (9)
                    \Sum_i w_i

Which already looks familiar, and is the basis for our current
approximation:

  S ~= s_min                                              (10)

Now, obviously, (10) is absolute crap :-), but it sorta works.

So the thing to remember is that the above is strictly UP. It is
possible to generalize to multiple runqueues -- however it gets really
yuck when you have to add affinity support, as illustrated by our very
first counter-example.

XXX segue into the load-balance issues related to this:

  - how a negative lag task on a 'heavy' runqueue should not
    remain a negative lag task when migrated to a 'light' runqueue.

  - how we can compute and use the combined S in load-balancing to
    better handle infeasible weight scenarios.

Luckily I think we can avoid needing a full multi-queue variant for
core-scheduling (or load-balancing). The crucial observation is that we
only actually need this comparison in the presence of forced-idle; only
then do we need to tell if the stalled rq has higher priority over the
other.

[XXX assumes SMT2; better consider the more general case, I suspect
it'll work out because our comparison is always between 2 rqs and the
answer is only interesting if one of them is forced-idle]

And (under assumption of SMT2) when there is forced-idle, there is only
a single queue, so everything works like normal.

Let, for our runqueue 'k':

  T_k = \Sum_i w_i s_i
  W_k = \Sum_i w_i      ; for all i of k                  (11)

Then we can write (6) like:

        T_k
  S_k = ---                                               (12)
        W_k

From which immediately follows that:

          T_k + T_l
  S_k+l = ---------                                       (13)
          W_k + W_l

On which we can define a combined lag:

  lag_k+l(i) := S_k+l - s_i                               (14)

And that gives us the tools to compare tasks across a combined runqueue.


Combined this gives the following:

 a) when a runqueue enters force-idle, sync it against it's sibling rq(s)
    using (7); this only requires storing single 'time'-stamps.

 b) when comparing tasks between 2 runqueues of which one is forced-idle,
    compare the combined lag, per (14).

Now, of course cgroups (I so hate them) make this more interesting in
that a) seems to suggest we need to iterate all cgroup on a CPU at such
boundaries, but I think we can avoid that. The force-idle is for the
whole CPU, all it's rqs. So we can mark it in the root and lazily
propagate downward on demand.




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison
  2020-05-06 14:35                     ` Peter Zijlstra
@ 2020-05-08  8:44                       ` Aaron Lu
  2020-05-08  9:09                         ` Peter Zijlstra
  0 siblings, 1 reply; 110+ messages in thread
From: Aaron Lu @ 2020-05-08  8:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	Tim Chen, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Aaron Lu, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aubrey Li, Li, Aubrey, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, Joel Fernandes

On Wed, May 06, 2020 at 04:35:06PM +0200, Peter Zijlstra wrote:
> 
> Sorry for being verbose; I've been procrastinating replying, and in
> doing so the things I wanted to say kept growing.
> 
> On Fri, Apr 24, 2020 at 10:24:43PM +0800, Aaron Lu wrote:
> 
> > To make this work, the root level sched entities' vruntime of the two
> > threads must be directly comparable. So one of the hyperthread's root
> > cfs_rq's min_vruntime is chosen as the core wide one and all root level
> > sched entities' vruntime is normalized against it.
> 
> > +/*
> > + * This is called in stop machine context so no need to take the rq lock.
> > + *
> > + * Core scheduling is going to be enabled and the root level sched entities
> > + * of both siblings will use cfs_rq->min_vruntime as the common cfs_rq
> > + * min_vruntime, so it's necessary to normalize vruntime of existing root
> > + * level sched entities in sibling_cfs_rq.
> > + *
> > + * Update of sibling_cfs_rq's min_vruntime isn't necessary as we will be
> > + * only using cfs_rq->min_vruntime during the entire run of core scheduling.
> > + */
> > +void sched_core_normalize_se_vruntime(int cpu)
> > +{
> > +	struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
> > +	int i;
> > +
> > +	for_each_cpu(i, cpu_smt_mask(cpu)) {
> > +		struct sched_entity *se, *next;
> > +		struct cfs_rq *sibling_cfs_rq;
> > +		s64 delta;
> > +
> > +		if (i == cpu)
> > +			continue;
> > +
> > +		sibling_cfs_rq = &cpu_rq(i)->cfs;
> > +		if (!sibling_cfs_rq->nr_running)
> > +			continue;
> > +
> > +		delta = cfs_rq->min_vruntime - sibling_cfs_rq->min_vruntime;
> > +		rbtree_postorder_for_each_entry_safe(se, next,
> > +				&sibling_cfs_rq->tasks_timeline.rb_root,
> > +				run_node) {
> > +			se->vruntime += delta;
> > +		}
> > +	}
> > +}
> 
> Aside from this being way to complicated for what it does -- you
> could've saved the min_vruntime for each rq and compared them with
> subtraction -- it is also terminally broken afaict.
>
> Consider any infeasible weight scenario. Take for instance two tasks,
> each bound to their respective sibling, one with weight 1 and one with
> weight 2. Then the lower weight task will run ahead of the higher weight
> task without bound.

I don't follow how this could happen. Even the lower weight task runs
first, after some time, the higher weight task will get its turn and
from then on, the higher weight task will get more chance to run(due to
its higher weight and thus, slower accumulation of vruntime).

We used to have the following patch as a standalone one in v4:
sched/fair : Wake up forced idle siblings if needed
https://lore.kernel.org/lkml/cover.1572437285.git.vpillai@digitalocean.com/T/#md22d25d0e2932d059013e9b56600d8a847b02a13
Which originates from:
https://lore.kernel.org/lkml/20190725143344.GD992@aaronlu/

And in this series, it seems to be merged in:
[RFC PATCH 07/13] sched: Add core wide task selection and scheduling
https://lore.kernel.org/lkml/e942da7fd881977923463f19648085c1bfaa37f8.1583332765.git.vpillai@digitalocean.com/

My local test shows that when two cgroup's share are both set to 1024
and each bound to one sibling of a core, start a cpu intensive task in
each cgroup, then the cpu intensive task will each consume 50% cpu. When
one cgroup's share set to 512, it will consume about 33% while the other
consumes 67%, as expected.

I think the current patch works fine when 2 differently tagged tasks are
competing CPU, but when there are 3 tasks or more, things can get less
fair.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison
  2020-05-08  8:44                       ` Aaron Lu
@ 2020-05-08  9:09                         ` Peter Zijlstra
  2020-05-08 12:34                           ` Aaron Lu
  0 siblings, 1 reply; 110+ messages in thread
From: Peter Zijlstra @ 2020-05-08  9:09 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	Tim Chen, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Aaron Lu, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aubrey Li, Li, Aubrey, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, Joel Fernandes

On Fri, May 08, 2020 at 04:44:19PM +0800, Aaron Lu wrote:
> On Wed, May 06, 2020 at 04:35:06PM +0200, Peter Zijlstra wrote:

> > Aside from this being way to complicated for what it does -- you
> > could've saved the min_vruntime for each rq and compared them with
> > subtraction -- it is also terminally broken afaict.
> >
> > Consider any infeasible weight scenario. Take for instance two tasks,
> > each bound to their respective sibling, one with weight 1 and one with
> > weight 2. Then the lower weight task will run ahead of the higher weight
> > task without bound.
> 
> I don't follow how this could happen. Even the lower weight task runs
> first, after some time, the higher weight task will get its turn and
> from then on, the higher weight task will get more chance to run(due to
> its higher weight and thus, slower accumulation of vruntime).

That seems to assume they're mutually exclusive. In that case, as I
argued, we only have a single runqueue and then yes it works. But if
they're not exclusive, and can run concurrently, it comes apart.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison
  2020-05-08  9:09                         ` Peter Zijlstra
@ 2020-05-08 12:34                           ` Aaron Lu
  2020-05-14 13:02                             ` Peter Zijlstra
  0 siblings, 1 reply; 110+ messages in thread
From: Aaron Lu @ 2020-05-08 12:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	Tim Chen, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Aaron Lu, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aubrey Li, Li, Aubrey, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, Joel Fernandes

On Fri, May 08, 2020 at 11:09:25AM +0200, Peter Zijlstra wrote:
> On Fri, May 08, 2020 at 04:44:19PM +0800, Aaron Lu wrote:
> > On Wed, May 06, 2020 at 04:35:06PM +0200, Peter Zijlstra wrote:
> 
> > > Aside from this being way to complicated for what it does -- you
> > > could've saved the min_vruntime for each rq and compared them with
> > > subtraction -- it is also terminally broken afaict.
> > >
> > > Consider any infeasible weight scenario. Take for instance two tasks,
> > > each bound to their respective sibling, one with weight 1 and one with
> > > weight 2. Then the lower weight task will run ahead of the higher weight
> > > task without bound.
> > 
> > I don't follow how this could happen. Even the lower weight task runs
> > first, after some time, the higher weight task will get its turn and
> > from then on, the higher weight task will get more chance to run(due to
> > its higher weight and thus, slower accumulation of vruntime).
> 
> That seems to assume they're mutually exclusive. In that case, as I
> argued, we only have a single runqueue and then yes it works. But if
> they're not exclusive, and can run concurrently, it comes apart.

Ah right, now I see what you mean. Sorry for misunderstanding.

And yes, that 'utterly destroys the concept of a shared time base' and
then bad things can happen:
1) two same tagged tasks(t1 and t2) running on two siblings, with t1's
   weight lower than t2's;
2) both tasks are cpu intensive;
3) over time, the lower weight task(t1)'s vruntime becomes bigger and
   bigger than t2's vruntime and the core wide min_vruntime is the
   same as t1's vruntime per this patch;
4) a new task enqueued on the same sibling as t1, if the new task has
   an incompatible tag, it will be starved by t2 because t2's vruntime
   is way smaller than the core wide min_vruntime.

With this said, I realized a workaround for the issue described above:
when the core went from 'compatible mode'(step 1-3) to 'incompatible
mode'(step 4), reset all root level sched entities' vruntime to be the
same as the core wide min_vruntime. After all, the core is transforming
from two runqueue mode to single runqueue mode... I think this can solve
the issue to some extent but I may miss other scenarios.

I'll also re-read your last email about the 'lag' idea.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 00/13] Core scheduling v5
       [not found] ` <38805656-2e2f-222a-c083-692f4b113313@linux.intel.com>
@ 2020-05-09  3:39   ` Ning, Hongyu
  2020-05-14 20:51     ` FW: " Gruza, Agata
  0 siblings, 1 reply; 110+ messages in thread
From: Ning, Hongyu @ 2020-05-09  3:39 UTC (permalink / raw)
  To: vpillai, naravamudan, jdesfossez, peterz, Tim Chen, mingo, tglx,
	pjt, torvalds
  Cc: vpillai, fweisbec, keescook, kerrnel, pauld, aaron.lwe,
	aubrey.intel, Li, Aubrey, valentin.schneider, mgorman,
	pawan.kumar.gupta, pbonzini, joelaf, joel, linux-kernel


- Test environment:
Intel Xeon Server platform
CPU(s):              192
On-line CPU(s) list: 0-191
Thread(s) per core:  2
Core(s) per socket:  48
Socket(s):           2
NUMA node(s):        4

- Kernel under test: 
Core scheduling v5 base
https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5.y

- Test set based on sysbench 1.1.0-bd4b418:
A: sysbench cpu in cgroup cpu 1 + sysbench mysql in cgroup mysql 1 (192 workload tasks for each cgroup)
B: sysbench cpu in cgroup cpu 1 + sysbench cpu in cgroup cpu 2 + sysbench mysql in cgroup mysql 1 + sysbench mysql in cgroup mysql 2 (192 workload tasks for each cgroup)

- Test results briefing:
1 Good results:
1.1 For test set A, coresched could achieve same or better performance compared to smt_off, for both cpu workload and sysbench workload
1.2 For test set B, cpu workload, coresched could achieve better performance compared to smt_off

2 Bad results:
2.1 For test set B, mysql workload, coresched performance is lower than smt_off, potential fairness issue between cpu workloads and mysql workloads
2.2 For test set B, cpu workload, potential fairness issue between 2 cgroups cpu workloads

- Test results:
Note: test results in following tables are Tput normalized to default baseline

-- Test set A Tput normalized results:
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+
|                    | ****   | default   | coresched   | smt_off   | ***   | default     | coresched     | smt_off     |
+====================+========+===========+=============+===========+=======+=============+===============+=============+
| cgroups            | ****   | cg cpu 1  | cg cpu 1    | cg cpu 1  | ***   | cg mysql 1  | cg mysql 1    | cg mysql 1  |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+
| sysbench workload  | ****   | cpu       | cpu         | cpu       | ***   | mysql       | mysql         | mysql       |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+
| 192 tasks / cgroup | ****   | 1         | 0.95        | 0.54      | ***   | 1           | 0.92          | 0.97        |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+

-- Test set B Tput normalized results:
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+
|                    | ****   | default   | coresched   | smt_off   | ***   | default     | coresched     | smt_off     | **   | default     | coresched     | smt_off     | *   | default     | coresched     | smt_off     |
+====================+========+===========+=============+===========+=======+=============+===============+=============+======+=============+===============+=============+=====+=============+===============+=============+
| cgroups            | ****   | cg cpu 1  | cg cpu 1    | cg cpu 1  | ***   | cg cpu 2    | cg cpu 2      | cg cpu 2    | **   | cg mysql 1  | cg mysql 1    | cg mysql 1  | *   | cg mysql 2  | cg mysql 2    | cg mysql 2  |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+
| sysbench workload  | ****   | cpu       | cpu         | cpu       | ***   | cpu         | cpu           | cpu         | **   | mysql       | mysql         | mysql       | *   | mysql       | mysql         | mysql       |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+
| 192 tasks / cgroup | ****   | 1         | 0.9         | 0.47      | ***   | 1           | 1.32          | 0.66        | **   | 1           | 0.42          | 0.89        | *   | 1           | 0.42          | 0.89        |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+


> On Date: Wed,  4 Mar 2020 16:59:50 +0000, vpillai <vpillai@digitalocean.com> wrote:
> To: Nishanth Aravamudan <naravamudan@digitalocean.com>, Julien Desfossez <jdesfossez@digitalocean.com>, Peter Zijlstra <peterz@infradead.org>, Tim Chen <tim.c.chen@linux.intel.com>, mingo@kernel.org, tglx@linutronix.de, pjt@google.com, torvalds@linux-foundation.org
> CC: vpillai <vpillai@digitalocean.com>, linux-kernel@vger.kernel.org, fweisbec@gmail.com, keescook@chromium.org, kerrnel@google.com, Phil Auld <pauld@redhat.com>, Aaron Lu <aaron.lwe@gmail.com>, Aubrey Li <aubrey.intel@gmail.com>, aubrey.li@linux.intel.com, Valentin Schneider <valentin.schneider@arm.com>, Mel Gorman <mgorman@techsingularity.net>, Pawan Gupta <pawan.kumar.gupta@linux.intel.com>, Paolo Bonzini <pbonzini@redhat.com>, Joel Fernandes <joelaf@google.com>, joel@joelfernandes.org
> 
> 
> Fifth iteration of the Core-Scheduling feature.
> 
> Core scheduling is a feature that only allows trusted tasks to run
> concurrently on cpus sharing compute resources(eg: hyperthreads on a
> core). The goal is to mitigate the core-level side-channel attacks
> without requiring to disable SMT (which has a significant impact on
> performance in some situations). So far, the feature mitigates user-space
> to user-space attacks but not user-space to kernel attack, when one of
> the hardware thread enters the kernel (syscall, interrupt etc).
> 
> By default, the feature doesn't change any of the current scheduler
> behavior. The user decides which tasks can run simultaneously on the
> same core (for now by having them in the same tagged cgroup). When
> a tag is enabled in a cgroup and a task from that cgroup is running
> on a hardware thread, the scheduler ensures that only idle or trusted
> tasks run on the other sibling(s). Besides security concerns, this
> feature can also be beneficial for RT and performance applications
> where we want to control how tasks make use of SMT dynamically.
> 
> This version was focusing on performance and stability. Couple of
> crashes related to task tagging and cpu hotplug path were fixed.
> This version also improves the performance considerably by making
> task migration and load balancing coresched aware.
> 
> In terms of performance, the major difference since the last iteration
> is that now even IO-heavy and mixed-resources workloads are less
> impacted by core-scheduling than by disabling SMT. Both host-level and
> VM-level benchmarks were performed. Details in:
> https://lkml.org/lkml/2020/2/12/1194
> https://lkml.org/lkml/2019/11/1/269
> 
> v5 is rebased on top of 5.5.5(449718782a46)
> https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5.y
> 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 00/13] Core scheduling v5
  2020-04-14 14:21 ` Peter Zijlstra
  2020-04-15 16:32   ` Joel Fernandes
@ 2020-05-09 14:35   ` Dario Faggioli
  1 sibling, 0 replies; 110+ messages in thread
From: Dario Faggioli @ 2020-05-09 14:35 UTC (permalink / raw)
  To: Peter Zijlstra, vpillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, mingo, tglx,
	pjt, torvalds, linux-kernel, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel,
	Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 5824 bytes --]

On Tue, 2020-04-14 at 16:21 +0200, Peter Zijlstra wrote:
> On Wed, Mar 04, 2020 at 04:59:50PM +0000, vpillai wrote:
> > 
> > - Investigate the source of the overhead even when no tasks are
> > tagged:
> >   https://lkml.org/lkml/2019/10/29/242
> 
>  - explain why we're all still doing this ....
> 
> Seriously, what actual problems does it solve? The patch-set still
> isn't
> L1TF complete and afaict it does exactly nothing for MDS.
> 
Hey Peter! Late to the party, I know...

But I'm replying anyway. At least, you'll have the chance to yell at me
for this during OSPM. ;-P

> Like I've written many times now, back when the world was simpler and
> all we had to worry about was L1TF, core-scheduling made some sense,
> but
> how does it make sense today?
> 
Indeed core-scheduling alone doesn't even completely solve L1TF. There
are the interrupts and the VMEXITs issues. Both are being discussed in
this thread and, FWIW, my personal opinion is that the way to go is
what Alex says here:

<79529592-5d60-2a41-fbb6-4a5f8279f998@amazon.com>

(E.g., when he mentions solution 4 "Create a "safe" page table which
runs with HT enabled", etc).

But let's stick to your point: if it were only for L1TF, then fine, but
it's all pointless because of MDS. My answer to this is very much
focused on my usecase, which is virtualization. I know you hate us, and
you surely have your good reasons, but you know... :-)

Correct me if I'm wrong, but I think that the "nice" thing of L1TF is
that it allows a VM to spy on another VM or on the host, but it does
not allow a regular task to spy on another task or on the kernel (well,
it would, but it's easily mitigated).

The bad thing about MDS is that it instead allow *all* of that.

Now, one thing that we absolutely want to avoid in virt is that a VM is
able to spy on other VMs or on the host. Sure, we also care about tasks
running in our VMs to be safe, but, really, inter-VM and VM-to-host
isolation is the primary concern of an hypervisor.

And how a VM (or stuff running inside a VM) can spy on another VM or on
the host, via L1TF or MDS? Well, if the attacker VM and the victim VM
--or if the attacker VM and the host-- are running on the same core. If
they're not, it can't... which is basically an L1TF-only looking
scenario.

So, in virt, core-scheduling:
1) is the *only* way (aside from no-EPT) to prevent attacker VM to spy 
   on victim VM, if they're running concurrently, both in guest mode,
   on the same core (and that's, of course, because with
   core-scheduling they just won't be doing that :-) )
2) interrupts and VMEXITs needs being taken care of --which was the 
   case already when, as you said "we had only L1TF". Once that is done
   we will effectively prevent all VM to VM and VM to host attack
   scenarios.

Sure, it will still be possible, for instance, for task_A in VM1 to spy
on task_B, also in VM1. This seems to be, AFAIUI, Joel's usecase, so
I'm happy to leave it to him to defend that, as he's doing already (but
indeed I'm very happy to see that it is also getting attention).

Now, of course saying anything like "works for my own usecase so let's
go for it" does not fly. But since you were asking whether and how this
feature could make sense today, suppose that:
1) we get core-scheduling,
2) we find a solution for irqs and VMEXITs, as we would have to if 
   there was only L1TF,
3) we manage to make the overhead of core-scheduling close to zero 
   when it's there (I mean, enabled at compile time) but not used (I
   mean, no tagging of tasks, or whatever).

That would mean that virt people can enable core-scheduling, and
achieve good inter-VM and VM-to-host isolation, without imposing
overhead to other use cases, that would leave core-scheduling disabled.

And this is something that I would think it makes sense.

Of course, we're not there... because even when this series will give
us point 1, we will also need 2 and we need to make sure we also
satisfy 3 (and we weren't, last time I checked ;-P).

But I think it's worth keeping trying.

I'd also add a couple of more ideas, still about core-scheduling in
virt, but from a different standpoint than security:
- if I tag vcpu0 and vcpu1 together[*], then vcpu2 and vcpu3 together,
  then vcpu4 and vcpu5 together, then I'm sure that each pair will
  always be scheduled on the same core. At which point I can define
  an SMT virtual topology, for the VM, that will make sense, even
  without pinning the vcpus;
- if I run VMs from different customers, when vcpu2 of VM1 and vcpu1
  of VM2 run on the same core, they influence each others' performance.
  If, e.g., I bill basing on time spent on CPUs, it means customer
  A's workload, running in VM1, may influence the billing of customer
  B, who owns VM2. With core scheduling, if I tag all the vcpus of each
  VM together, I won't have this any longer.

[*] with "tag together" I mean let them have the same tag which, ATM
would be "put them in the same cgroup and enable cpu.tag".

Whether or not these make sense, e.g., performance wise, it's a bid
hard to tell, with the feature not-yet finalized... But I've started
doing some preliminary measurements already. Hopefully, they'll be
ready by Monday.  

So that's it. I hope this gives you enough material to complain about
during OSPM. At least, given the event is virtual, I won't get any
microphone box (or, worse, frozen sharks!) thrown at me in anger! :-D

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH RFC] Add support for core-wide protection of IRQ and softirq
  2020-03-04 16:59 [RFC PATCH 00/13] Core scheduling v5 vpillai
                   ` (15 preceding siblings ...)
       [not found] ` <38805656-2e2f-222a-c083-692f4b113313@linux.intel.com>
@ 2020-05-10 23:46 ` Joel Fernandes (Google)
  2020-05-11 13:49   ` Peter Zijlstra
  2020-05-20 22:26 ` [PATCH RFC] sched: Add a per-thread core scheduling interface Joel Fernandes (Google)
                   ` (3 subsequent siblings)
  20 siblings, 1 reply; 110+ messages in thread
From: Joel Fernandes (Google) @ 2020-05-10 23:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	Paul E . McKenney, Vineeth Pillai, Allison Randal, Armijn Hemel,
	Ben Segall, Dietmar Eggemann, Ingo Molnar, Juri Lelli,
	Mel Gorman, Muchun Song, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner, Vincent Guittot

With current core scheduling patchset, non-threaded IRQ and softirq
victims can leak data from its hyperthread to a sibling hyperthread
running an attacker.

For MDS, it is possible for the IRQ and softirq handlers to leak data to
either host or guest attackers. For L1TF, it is possible to leak to
guest attackers. There is no possible mitigation involving flushing of
buffers to avoid this since the execution of attacker and victims happen
concurrently on 2 or more HTs.

The solution in this patch is to monitor the outer-most core-wide
irq_enter() and irq_exit() executed by any sibling. In between these
two, we mark the core to be in a special core-wide IRQ state.

In the IRQ entry, if we detect that the sibling is running untrusted
code, we send a reschedule IPI so that the sibling transitions through
the sibling's irq_exit() to do any waiting there, till the IRQ being
protected finishes.

We also monitor the per-CPU outer-most irq_exit(). If during the per-cpu
outer-most irq_exit(), the core is still in the special core-wide IRQ
state, we perform a busy-wait till the core exits this state. This
combination of per-cpu and core-wide IRQ states helps to handle any
combination of irq_entry()s and irq_exit()s happening on all of the
siblings of the core in any order.

Lastly, we also check in the schedule loop if we are about to schedule
an untrusted process while the core is in such a state. This is possible
if a trusted thread enters the scheduler by way of yielding CPU. This
would involve no transitions through the irq_exit() point to do any
waiting, so we have to explicitly do the waiting there.

Every attempt is made to prevent a busy-wait unnecessarily, and in
testing on real-world ChromeOS usecases, it has not shown a performance
drop. In ChromeOS, with this and the rest of the core scheduling
patchset, we see around a 300% improvement in key press latencies into
Google docs when Camera streaming is running simulatenously (90th
percentile latency of ~150ms drops to ~50ms).

Cc: Paul E. McKenney <paulmck@kernel.org>
Co-developed-by: Vineeth Pillai <vpillai@digitalocean.com>
Signed-off-by: Vineeth Pillai <vpillai@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>


---
If you like some pictures of the cases handled by this patch, please
see the OSPM slide deck (the below link jumps straight to relevant
slides - about 6-7 of them in total): https://bit.ly/2zvzxWk

TODO:
1. Any optimziations for VM usecases (can we do something better than
   scheduler IPI?)
2. Waiting in schedule() can likely be optimized, example no need to
   wait if previous task was idle, as there would have been an IRQ
   involved with the wake up of the next task.

 include/linux/sched.h |   8 +++
 kernel/sched/core.c   | 159 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h  |   3 +
 kernel/softirq.c      |   2 +
 4 files changed, 172 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 710e9a8956007..fe6ae59fcadbe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2018,4 +2018,12 @@ int sched_trace_rq_cpu(struct rq *rq);
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
+#ifdef CONFIG_SCHED_CORE
+void sched_core_irq_enter(void);
+void sched_core_irq_exit(void);
+#else
+static void sched_core_irq_enter(void) { }
+static void sched_core_irq_exit(void) { }
+#endif
+
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 21c640170323b..e06195dcca7a0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4391,6 +4391,153 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
 	return a->core_cookie == b->core_cookie;
 }
 
+/*
+ * Helper function to pause the caller's hyperthread until the core exits the
+ * core-wide IRQ state. Obviously the CPU calling this function should not be
+ * responsible for the core being in the core-wide IRQ state otherwise it will
+ * deadlock. This function should be called from irq_exit() and from schedule().
+ * It is upto the callers to decide if calling here is necessary.
+ */
+static inline void sched_core_sibling_irq_pause(struct rq *rq)
+{
+	/*
+	 * Wait till the core of this HT is not in a core-wide IRQ state.
+	 *
+	 * Pair with smp_store_release() in sched_core_irq_exit().
+	 */
+	while (smp_load_acquire(&rq->core->core_irq_nest) > 0)
+		cpu_relax();
+}
+
+/*
+ * Enter the core-wide IRQ state. Sibling will be paused if it is running
+ * 'untrusted' code, until sched_core_irq_exit() is called. Every attempt to
+ * avoid sending useless IPIs is made. Must be called only from hard IRQ
+ * context.
+ */
+void sched_core_irq_enter(void)
+{
+	int i, cpu = smp_processor_id();
+	struct rq *rq = cpu_rq(cpu);
+	const struct cpumask *smt_mask;
+
+	if (!sched_core_enabled(rq))
+		return;
+
+	/* Count irq_enter() calls received without irq_exit() on this CPU. */
+	rq->core_this_irq_nest++;
+
+	/* If not outermost irq_enter(), do nothing. */
+	if (rq->core_this_irq_nest != 1 ||
+	    WARN_ON_ONCE(rq->core->core_this_irq_nest == UINT_MAX))
+		return;
+
+	raw_spin_lock(rq_lockp(rq));
+	smt_mask = cpu_smt_mask(cpu);
+
+	/* Contribute this CPU's irq_enter() to core-wide irq_enter() count. */
+	WRITE_ONCE(rq->core->core_irq_nest, rq->core->core_irq_nest + 1);
+	if (WARN_ON_ONCE(rq->core->core_irq_nest == UINT_MAX))
+		goto unlock;
+
+	if (rq->core_pause_pending) {
+		/*
+		 * Do nothing more since we are in a 'reschedule IPI' sent from
+		 * another sibling. That sibling would have sent IPIs to all of
+		 * the HTs.
+		 */
+		goto unlock;
+	}
+
+	/*
+	 * If we are not the first ones on the core to enter core-wide IRQ
+	 * state, do nothing.
+	 */
+	if (rq->core->core_irq_nest > 1)
+		goto unlock;
+
+	/* Do nothing more if the core is not tagged. */
+	if (!rq->core->core_cookie)
+		goto unlock;
+
+	for_each_cpu(i, smt_mask) {
+		struct rq *srq = cpu_rq(i);
+
+		if (i == cpu || cpu_is_offline(i))
+			continue;
+
+		if (!srq->curr->mm || is_idle_task(srq->curr))
+			continue;
+
+		/* Skip if HT is not running a tagged task. */
+		if (!srq->curr->core_cookie && !srq->core_pick)
+			continue;
+
+		/* IPI only if previous IPI was not pending. */
+		if (!srq->core_pause_pending) {
+			srq->core_pause_pending = 1;
+			smp_send_reschedule(i);
+		}
+	}
+unlock:
+	raw_spin_unlock(rq_lockp(rq));
+}
+
+/*
+ * Process any work need for either exiting the core-wide IRQ state, or for
+ * waiting on this hyperthread if the core is still in this state.
+ */
+void sched_core_irq_exit(void)
+{
+	int cpu = smp_processor_id();
+	struct rq *rq = cpu_rq(cpu);
+	bool wait_here = false;
+	unsigned int nest;
+
+	/* Do nothing if core-sched disabled. */
+	if (!sched_core_enabled(rq))
+		return;
+
+	rq->core_this_irq_nest--;
+
+	/* If not outermost on this CPU, do nothing. */
+	if (rq->core_this_irq_nest > 0 ||
+	    WARN_ON_ONCE(rq->core_this_irq_nest == UINT_MAX))
+		return;
+
+	raw_spin_lock(rq_lockp(rq));
+	/*
+	 * Core-wide nesting counter can never be 0 because we are
+	 * still in it on this CPU.
+	 */
+	nest = rq->core->core_irq_nest;
+	WARN_ON_ONCE(!nest);
+
+	/*
+	 * If we still have other CPUs in IRQs, we have to wait for them.
+	 * Either here, or in the scheduler.
+	 */
+	if (rq->core->core_cookie && nest > 1) {
+		/*
+		 * If we are entering the scheduler anyway, we can just wait
+		 * there for ->core_irq_nest to reach 0. If not, just wait here.
+		 */
+		if (!tif_need_resched()) {
+			wait_here = true;
+		}
+	}
+
+	if (rq->core_pause_pending)
+		rq->core_pause_pending = 0;
+
+	/* Pair with smp_load_acquire() in sched_core_sibling_irq_pause(). */
+	smp_store_release(&rq->core->core_irq_nest, nest - 1);
+	raw_spin_unlock(rq_lockp(rq));
+
+	if (wait_here)
+		sched_core_sibling_irq_pause(rq);
+}
+
 // XXX fairness/fwd progress conditions
 /*
  * Returns
@@ -4910,6 +5057,18 @@ static void __sched notrace __schedule(bool preempt)
 		rq_unlock_irq(rq, &rf);
 	}
 
+#ifdef CONFIG_SCHED_CORE
+	/*
+	 * If a CPU  that was running a trusted task entered the scheduler, and
+	 * the next task is untrusted, then check if waiting for core-wide IRQ
+	 * state to cease is needed since we would not have been able to get
+	 * the services of irq_exit() to do that waiting.
+	 */
+	if (sched_core_enabled(rq) &&
+	    !is_idle_task(next) && next->mm && next->core_cookie)
+		sched_core_sibling_irq_pause(rq);
+#endif
+
 	balance_callback(rq);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a7d9f156242e2..3a065d133ef51 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1018,11 +1018,14 @@ struct rq {
 	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
 	unsigned char		core_forceidle;
+	unsigned char		core_pause_pending;
+	unsigned int		core_this_irq_nest;
 
 	/* shared state */
 	unsigned int		core_task_seq;
 	unsigned int		core_pick_seq;
 	unsigned long		core_cookie;
+	unsigned int		core_irq_nest;
 #endif
 };
 
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 0427a86743a46..b953386c8f62f 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -345,6 +345,7 @@ asmlinkage __visible void do_softirq(void)
 void irq_enter(void)
 {
 	rcu_irq_enter();
+	sched_core_irq_enter();
 	if (is_idle_task(current) && !in_interrupt()) {
 		/*
 		 * Prevent raise_softirq from needlessly waking up ksoftirqd
@@ -413,6 +414,7 @@ void irq_exit(void)
 		invoke_softirq();
 
 	tick_irq_exit();
+	sched_core_irq_exit();
 	rcu_irq_exit();
 	trace_hardirq_exit(); /* must be last! */
 }
-- 
2.26.2.645.ge9eca65c58-goog

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH RFC] Add support for core-wide protection of IRQ and softirq
  2020-05-10 23:46 ` [PATCH RFC] Add support for core-wide protection of IRQ and softirq Joel Fernandes (Google)
@ 2020-05-11 13:49   ` Peter Zijlstra
  2020-05-11 14:54     ` Joel Fernandes
  0 siblings, 1 reply; 110+ messages in thread
From: Peter Zijlstra @ 2020-05-11 13:49 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Paul E . McKenney, Vineeth Pillai, Allison Randal,
	Armijn Hemel, Ben Segall, Dietmar Eggemann, Ingo Molnar,
	Juri Lelli, Mel Gorman, Muchun Song, Steven Rostedt,
	Thomas Gleixner, Vincent Guittot

On Sun, May 10, 2020 at 07:46:52PM -0400, Joel Fernandes (Google) wrote:
> With current core scheduling patchset, non-threaded IRQ and softirq
> victims can leak data from its hyperthread to a sibling hyperthread
> running an attacker.
> 
> For MDS, it is possible for the IRQ and softirq handlers to leak data to
> either host or guest attackers. For L1TF, it is possible to leak to
> guest attackers. There is no possible mitigation involving flushing of
> buffers to avoid this since the execution of attacker and victims happen
> concurrently on 2 or more HTs.
> 
> The solution in this patch is to monitor the outer-most core-wide
> irq_enter() and irq_exit() executed by any sibling. In between these
> two, we mark the core to be in a special core-wide IRQ state.

Another possible option is force_irqthreads :-) That would cure it
nicely.

Anyway, I'll go read this.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH RFC] Add support for core-wide protection of IRQ and softirq
  2020-05-11 13:49   ` Peter Zijlstra
@ 2020-05-11 14:54     ` Joel Fernandes
  0 siblings, 0 replies; 110+ messages in thread
From: Joel Fernandes @ 2020-05-11 14:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Paul E . McKenney, Vineeth Pillai, Allison Randal,
	Armijn Hemel, Ben Segall, Dietmar Eggemann, Ingo Molnar,
	Juri Lelli, Mel Gorman, Muchun Song, Steven Rostedt,
	Thomas Gleixner, Vincent Guittot

On Mon, May 11, 2020 at 9:49 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, May 10, 2020 at 07:46:52PM -0400, Joel Fernandes (Google) wrote:
> > With current core scheduling patchset, non-threaded IRQ and softirq
> > victims can leak data from its hyperthread to a sibling hyperthread
> > running an attacker.
> >
> > For MDS, it is possible for the IRQ and softirq handlers to leak data to
> > either host or guest attackers. For L1TF, it is possible to leak to
> > guest attackers. There is no possible mitigation involving flushing of
> > buffers to avoid this since the execution of attacker and victims happen
> > concurrently on 2 or more HTs.
> >
> > The solution in this patch is to monitor the outer-most core-wide
> > irq_enter() and irq_exit() executed by any sibling. In between these
> > two, we mark the core to be in a special core-wide IRQ state.
>
> Another possible option is force_irqthreads :-) That would cure it
> nicely.

Yes true, it was definitely my "plan B" at one point if this patch
showed any regression. Lastly, people not doing force_irqthreads would
still leave a hole open and it'd be nice to solve it by "default" than
depending on user/sysadmin configuration (same argument against
interrupt affinities, it is another knob for the sysadmin/designer to
configure correctly, Another argument being not all interrupts can be
threaded / affinitized).

Thanks in advance for reviewing the patch,

 - Joel

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison
  2020-05-08 12:34                           ` Aaron Lu
@ 2020-05-14 13:02                             ` Peter Zijlstra
  2020-05-14 22:51                               ` Vineeth Remanan Pillai
                                                 ` (2 more replies)
  0 siblings, 3 replies; 110+ messages in thread
From: Peter Zijlstra @ 2020-05-14 13:02 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	Tim Chen, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Aaron Lu, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aubrey Li, Li, Aubrey, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, Joel Fernandes

On Fri, May 08, 2020 at 08:34:57PM +0800, Aaron Lu wrote:
> With this said, I realized a workaround for the issue described above:
> when the core went from 'compatible mode'(step 1-3) to 'incompatible
> mode'(step 4), reset all root level sched entities' vruntime to be the
> same as the core wide min_vruntime. After all, the core is transforming
> from two runqueue mode to single runqueue mode... I think this can solve
> the issue to some extent but I may miss other scenarios.

A little something like so, this syncs min_vruntime when we switch to
single queue mode. This is very much SMT2 only, I got my head in twist
when thikning about more siblings, I'll have to try again later.

This very much retains the horrible approximation of S we always do.

Also, it is _completely_ untested...

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -102,7 +102,6 @@ static inline int __task_prio(struct tas
 /* real prio, less is less */
 static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 {
-
 	int pa = __task_prio(a), pb = __task_prio(b);
 
 	if (-pa < -pb)
@@ -114,19 +113,8 @@ static inline bool prio_less(struct task
 	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-		u64 vruntime = b->se.vruntime;
-
-		/*
-		 * Normalize the vruntime if tasks are in different cpus.
-		 */
-		if (task_cpu(a) != task_cpu(b)) {
-			vruntime -= task_cfs_rq(b)->min_vruntime;
-			vruntime += task_cfs_rq(a)->min_vruntime;
-		}
-
-		return !((s64)(a->se.vruntime - vruntime) <= 0);
-	}
+	if (pa == MAX_RT_PRIO + MAX_NICE)
+		return cfs_prio_less(a, b);
 
 	return false;
 }
@@ -4293,10 +4281,11 @@ static struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct task_struct *next, *max = NULL;
+	int old_active = 0, new_active = 0;
 	const struct sched_class *class;
 	const struct cpumask *smt_mask;
-	int i, j, cpu;
 	bool need_sync = false;
+	int i, j, cpu;
 
 	cpu = cpu_of(rq);
 	if (cpu_is_offline(cpu))
@@ -4349,10 +4338,14 @@ pick_next_task(struct rq *rq, struct tas
 		rq_i->core_pick = NULL;
 
 		if (rq_i->core_forceidle) {
+			// XXX is_idle_task(rq_i->curr) && rq_i->nr_running ??
 			need_sync = true;
 			rq_i->core_forceidle = false;
 		}
 
+		if (!is_idle_task(rq_i->curr))
+			old_active++;
+
 		if (i != cpu)
 			update_rq_clock(rq_i);
 	}
@@ -4463,8 +4456,12 @@ next_class:;
 
 		WARN_ON_ONCE(!rq_i->core_pick);
 
-		if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
-			rq_i->core_forceidle = true;
+		if (is_idle_task(rq_i->core_pick)) {
+			if (rq_i->nr_running)
+				rq_i->core_forceidle = true;
+		} else {
+			new_active++;
+		}
 
 		if (i == cpu)
 			continue;
@@ -4476,6 +4473,16 @@ next_class:;
 		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
 	}
 
+	/* XXX SMT2 only */
+	if (new_active == 1 && old_active > 1) {
+		/*
+		 * We just dropped into single-rq mode, increment the sequence
+		 * count to trigger the vruntime sync.
+		 */
+		rq->core->core_sync_seq++;
+	}
+	rq->core->core_active = new_active;
+
 done:
 	set_next_task(rq, next);
 	return next;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -386,6 +386,12 @@ is_same_group(struct sched_entity *se, s
 	return NULL;
 }
 
+static inline bool
+is_same_tg(struct sched_entity *se, struct sched_entity *pse)
+{
+	return se->cfs_rq->tg == pse->cfs_rq->tg;
+}
+
 static inline struct sched_entity *parent_entity(struct sched_entity *se)
 {
 	return se->parent;
@@ -394,8 +400,6 @@ static inline struct sched_entity *paren
 static void
 find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 {
-	int se_depth, pse_depth;
-
 	/*
 	 * preemption test can be made between sibling entities who are in the
 	 * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
@@ -403,23 +407,16 @@ find_matching_se(struct sched_entity **s
 	 * parent.
 	 */
 
-	/* First walk up until both entities are at same depth */
-	se_depth = (*se)->depth;
-	pse_depth = (*pse)->depth;
-
-	while (se_depth > pse_depth) {
-		se_depth--;
-		*se = parent_entity(*se);
-	}
-
-	while (pse_depth > se_depth) {
-		pse_depth--;
-		*pse = parent_entity(*pse);
-	}
+	/* XXX we now have 3 of these loops, C stinks */
 
 	while (!is_same_group(*se, *pse)) {
-		*se = parent_entity(*se);
-		*pse = parent_entity(*pse);
+		int se_depth = (*se)->depth;
+		int pse_depth = (*pse)->depth;
+
+		if (se_depth <= pse_depth)
+			*pse = parent_entity(*pse);
+		if (se_depth >= pse_depth)
+			*se = parent_entity(*se);
 	}
 }
 
@@ -455,6 +452,12 @@ static inline struct sched_entity *paren
 	return NULL;
 }
 
+static inline bool
+is_same_tg(struct sched_entity *se, struct sched_entity *pse)
+{
+	return true;
+}
+
 static inline void
 find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 {
@@ -462,6 +465,31 @@ find_matching_se(struct sched_entity **s
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+	struct sched_entity *se_a = &a->se, *se_b = &b->se;
+	struct cfs_rq *cfs_rq_a, *cfa_rq_b;
+	u64 vruntime_a, vruntime_b;
+
+	while (!is_same_tg(se_a, se_b)) {
+		int se_a_depth = se_a->depth;
+		int se_b_depth = se_b->depth;
+
+		if (se_a_depth <= se_b_depth)
+			se_b = parent_entity(se_b);
+		if (se_a_depth >= se_b_depth)
+			se_a = parent_entity(se_a);
+	}
+
+	cfs_rq_a = cfs_rq_of(se_a);
+	cfs_rq_b = cfs_rq_of(se_b);
+
+	vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime;
+	vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime;
+
+	return !((s64)(vruntime_a - vruntime_b) <= 0);
+}
+
 static __always_inline
 void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
 
@@ -6891,6 +6919,18 @@ static void check_preempt_wakeup(struct
 		set_last_buddy(se);
 }
 
+static void core_sync_entity(struct rq *rq, struct cfs_rq *cfs_rq)
+{
+	if (!sched_core_enabled())
+		return;
+
+	if (rq->core->core_sync_seq == cfs_rq->core_sync_seq)
+		return;
+
+	cfs_rq->core_sync_seq = rq->core->core_sync_seq;
+	cfs_rq->core_vruntime = cfs_rq->min_vruntime;
+}
+
 static struct task_struct *pick_task_fair(struct rq *rq)
 {
 	struct cfs_rq *cfs_rq = &rq->cfs;
@@ -6902,6 +6942,14 @@ static struct task_struct *pick_task_fai
 	do {
 		struct sched_entity *curr = cfs_rq->curr;
 
+		/*
+		 * Propagate the sync state down to whatever cfs_rq we need,
+		 * the active cfs_rq's will have been done by
+		 * set_next_task_fair(), the rest is inactive and will not have
+		 * changed due to the current running task.
+		 */
+		core_sync_entity(rq, cfs_rq);
+
 		se = pick_next_entity(cfs_rq, NULL);
 
 		if (curr) {
@@ -10825,7 +10873,8 @@ static void switched_to_fair(struct rq *
 	}
 }
 
-/* Account for a task changing its policy or group.
+/*
+ * Account for a task changing its policy or group.
  *
  * This routine is mostly called to set cfs_rq->curr field when a task
  * migrates between groups/classes.
@@ -10847,6 +10896,9 @@ static void set_next_task_fair(struct rq
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
+		/* snapshot vruntime before using it */
+		core_sync_entity(rq, cfs_rq);
+
 		set_next_entity(cfs_rq, se);
 		/* ensure bandwidth has been allocated on our new cfs_rq */
 		account_cfs_rq_runtime(cfs_rq, 0);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -503,6 +503,10 @@ struct cfs_rq {
 	unsigned int		h_nr_running;      /* SCHED_{NORMAL,BATCH,IDLE} */
 	unsigned int		idle_h_nr_running; /* SCHED_IDLE */
 
+#ifdef CONFIG_SCHED_CORE
+	unsigned int		core_sync_seq;
+	u64			core_vruntime;
+#endif
 	u64			exec_clock;
 	u64			min_vruntime;
 #ifndef CONFIG_64BIT
@@ -1035,12 +1039,15 @@ struct rq {
 	unsigned int		core_enabled;
 	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
-	bool			core_forceidle;
+	unsigned int		core_forceidle;
 
 	/* shared state */
 	unsigned int		core_task_seq;
 	unsigned int		core_pick_seq;
 	unsigned long		core_cookie;
+	unsigned int		core_sync_seq;
+	unsigned int		core_active;
+
 #endif
 };
 
@@ -2592,6 +2599,8 @@ static inline bool sched_energy_enabled(
 
 #endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
 
+extern bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
+
 #ifdef CONFIG_MEMBARRIER
 /*
  * The scheduler provides memory barriers required by membarrier between:

^ permalink raw reply	[flat|nested] 110+ messages in thread

* FW: [RFC PATCH 00/13] Core scheduling v5
  2020-05-09  3:39   ` Ning, Hongyu
@ 2020-05-14 20:51     ` Gruza, Agata
  0 siblings, 0 replies; 110+ messages in thread
From: Gruza, Agata @ 2020-05-14 20:51 UTC (permalink / raw)
  To: vpillai, naravamudan, jdesfossez, peterz, Tim Chen, mingo, tglx,
	pjt, torvalds
  Cc: vpillai, fweisbec, keescook, kerrnel, pauld, aaron.lwe,
	aubrey.intel, Li, Aubrey, valentin.schneider, mgorman,
	pawan.kumar.gupta, pbonzini, joelaf, joel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 11835 bytes --]



-----Original Message-----
From: linux-kernel-owner@vger.kernel.org <linux-kernel-owner@vger.kernel.org> On Behalf Of Ning, Hongyu
Sent: Friday, May 8, 2020 8:40 PM
To: vpillai@digitalocean.com; naravamudan@digitalocean.com; jdesfossez@digitalocean.com; peterz@infradead.org; Tim Chen <tim.c.chen@linux.intel.com>; mingo@kernel.org; tglx@linutronix.de; pjt@google.com; torvalds@linux-foundation.org
Cc: vpillai@digitalocean.com; fweisbec@gmail.com; keescook@chromium.org; kerrnel@google.com; pauld@redhat.com; aaron.lwe@gmail.com; aubrey.intel@gmail.com; Li, Aubrey <aubrey.li@linux.intel.com>; valentin.schneider@arm.com; mgorman@techsingularity.net; pawan.kumar.gupta@linux.intel.com; pbonzini@redhat.com; joelaf@google.com; joel@joelfernandes.org; linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 00/13] Core scheduling v5


- Test environment:
Intel Xeon Server platform
CPU(s):              192
On-line CPU(s) list: 0-191
Thread(s) per core:  2
Core(s) per socket:  48
Socket(s):           2
NUMA node(s):        4

- Kernel under test: 
Core scheduling v5 base
https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5.y

- Test set based on sysbench 1.1.0-bd4b418:
A: sysbench cpu in cgroup cpu 1 + sysbench mysql in cgroup mysql 1 (192 workload tasks for each cgroup)
B: sysbench cpu in cgroup cpu 1 + sysbench cpu in cgroup cpu 2 + sysbench mysql in cgroup mysql 1 + sysbench mysql in cgroup mysql 2 (192 workload tasks for each cgroup)

- Test results briefing:
1 Good results:
1.1 For test set A, coresched could achieve same or better performance compared to smt_off, for both cpu workload and sysbench workload
1.2 For test set B, cpu workload, coresched could achieve better performance compared to smt_off

2 Bad results:
2.1 For test set B, mysql workload, coresched performance is lower than smt_off, potential fairness issue between cpu workloads and mysql workloads
2.2 For test set B, cpu workload, potential fairness issue between 2 cgroups cpu workloads

- Test results:
Note: test results in following tables are Tput normalized to default baseline

-- Test set A Tput normalized results:
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+
|                    | ****   | default   | coresched   | smt_off   | ***   | default     | coresched     | smt_off     |
+====================+========+===========+=============+===========+===
+====+=============+===============+=============+
| cgroups            | ****   | cg cpu 1  | cg cpu 1    | cg cpu 1  | ***   | cg mysql 1  | cg mysql 1    | cg mysql 1  |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+
| sysbench workload  | ****   | cpu       | cpu         | cpu       | ***   | mysql       | mysql         | mysql       |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+
| 192 tasks / cgroup | ****   | 1         | 0.95        | 0.54      | ***   | 1           | 0.92          | 0.97        |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+

-- Test set B Tput normalized results:
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+
|                    | ****   | default   | coresched   | smt_off   | ***   | default     | coresched     | smt_off     | **   | default     | coresched     | smt_off     | *   | default     | coresched     | smt_off     |
+====================+========+===========+=============+===========+===
+====+=============+===============+=============+======+=============+=
+==============+=============+=====+=============+===============+======
+=======+
| cgroups            | ****   | cg cpu 1  | cg cpu 1    | cg cpu 1  | ***   | cg cpu 2    | cg cpu 2      | cg cpu 2    | **   | cg mysql 1  | cg mysql 1    | cg mysql 1  | *   | cg mysql 2  | cg mysql 2    | cg mysql 2  |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+
| sysbench workload  | ****   | cpu       | cpu         | cpu       | ***   | cpu         | cpu           | cpu         | **   | mysql       | mysql         | mysql       | *   | mysql       | mysql         | mysql       |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+
| 192 tasks / cgroup | ****   | 1         | 0.9         | 0.47      | ***   | 1           | 1.32          | 0.66        | **   | 1           | 0.42          | 0.89        | *   | 1           | 0.42          | 0.89        |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+


> On Date: Wed,  4 Mar 2020 16:59:50 +0000, vpillai <vpillai@digitalocean.com> wrote:
> To: Nishanth Aravamudan <naravamudan@digitalocean.com>, Julien 
> Desfossez <jdesfossez@digitalocean.com>, Peter Zijlstra 
> <peterz@infradead.org>, Tim Chen <tim.c.chen@linux.intel.com>, 
> mingo@kernel.org, tglx@linutronix.de, pjt@google.com, 
> torvalds@linux-foundation.org
> CC: vpillai <vpillai@digitalocean.com>, linux-kernel@vger.kernel.org, 
> fweisbec@gmail.com, keescook@chromium.org, kerrnel@google.com, Phil 
> Auld <pauld@redhat.com>, Aaron Lu <aaron.lwe@gmail.com>, Aubrey Li 
> <aubrey.intel@gmail.com>, aubrey.li@linux.intel.com, Valentin 
> Schneider <valentin.schneider@arm.com>, Mel Gorman 
> <mgorman@techsingularity.net>, Pawan Gupta 
> <pawan.kumar.gupta@linux.intel.com>, Paolo Bonzini 
> <pbonzini@redhat.com>, Joel Fernandes <joelaf@google.com>, 
> joel@joelfernandes.org
> 
> 
> Fifth iteration of the Core-Scheduling feature.
> 
> Core scheduling is a feature that only allows trusted tasks to run 
> concurrently on cpus sharing compute resources(eg: hyperthreads on a 
> core). The goal is to mitigate the core-level side-channel attacks 
> without requiring to disable SMT (which has a significant impact on 
> performance in some situations). So far, the feature mitigates 
> user-space to user-space attacks but not user-space to kernel attack, 
> when one of the hardware thread enters the kernel (syscall, interrupt etc).
> 
> By default, the feature doesn't change any of the current scheduler 
> behavior. The user decides which tasks can run simultaneously on the 
> same core (for now by having them in the same tagged cgroup). When a 
> tag is enabled in a cgroup and a task from that cgroup is running on a 
> hardware thread, the scheduler ensures that only idle or trusted tasks 
> run on the other sibling(s). Besides security concerns, this feature 
> can also be beneficial for RT and performance applications where we 
> want to control how tasks make use of SMT dynamically.
> 
> This version was focusing on performance and stability. Couple of 
> crashes related to task tagging and cpu hotplug path were fixed.
> This version also improves the performance considerably by making task 
> migration and load balancing coresched aware.
> 
> In terms of performance, the major difference since the last iteration 
> is that now even IO-heavy and mixed-resources workloads are less 
> impacted by core-scheduling than by disabling SMT. Both host-level and 
> VM-level benchmarks were performed. Details in:
> https://lkml.org/lkml/2020/2/12/1194
> https://lkml.org/lkml/2019/11/1/269
> 
> v5 is rebased on top of 5.5.5(449718782a46) 
> https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5
> .y
> 


----------------------------------------------------------------------
ABOUT:
----------------------------------------------------------------------
Hello,

Core scheduling is required to protect against leakage of sensitive 
data allocated on a sibling thread. Our goal is to measure performance 
impact of core scheduling across different workloads and show how it 
evolved over time. Below you will find data based on core-sched (v5). 
In attached PDF system configuration setup as well as further 
explanation of the findings.  

----------------------------------------------------------------------
BENCHMARKS:
----------------------------------------------------------------------
- hammerdb      : database benchmarking application
- sysbench-cpu	: multi-threaded cpu benchmark
- sysbench-mysql: multi-threaded benchmark that tests open source DBMS
- build-kernel	: benchmark that is used to build Linux kernel
 

----------------------------------------------------------------------      
PERFORMANCE IMPACT:
----------------------------------------------------------------------

+--------------------+--------+--------------+-------------+-------------------+--------------------+----------------------+
| benchmark          | ****   | # of cgroups | overcommit  | baseline + smt_on | coresched + smt_on | baseline + smt_off   |
+====================+========+==============+=============+===================+====================+======================+
| hammerdb           | ****   | 2cgroups     | 2x          | 1		       | 0.96		    | 0.87	           |	  
+--------------------+--------+--------------+-------------+-------------------+--------------------+----------------------+
| sysbench-cpu	     | ****   | 2cgroups     | 2x          | 1       	       | 0.95		    | 0.54		   |			
| sysbench-mysql     | ****   |     	     |             | 1     	       | 0.90		    | 0.47		   |
+--------------------+--------+--------------+-------------+-------------------+--------------------+----------------------+
| sysbench-cpu	     | ****   | 4cgroups     | 4x          | 1       	       | 0.90		    | 0.47		   |			
| sysbench-cpu       | ****   |     	     |             | 1     	       | 1.32		    | 0.66		   |
| sysbench-mycql     | ****   | 	     |             | 1       	       | 0.42		    | 0.89		   |			
| sysbench-mysql     | ****   |     	     |             | 1     	       | 0.42		    | 0.89	           |
+--------------------+--------+--------------+-------------+-------------------+--------------------+----------------------+
| kernel-build       | ****   | 2cgroups     | 0.5x        | 1		       | 1	            | 0.93	           |
|		     | ****   | 	     | 1x          | 1		       | 0.99		    | 0.92	           |
|		     | ****   |		     | 2x          | 1		       | 0.98		    | 0.91		   |
+--------------------+--------+--------------+-------------+-------------------+--------------------+----------------------+


----------------------------------------------------------------------
TAKE AWAYS:
----------------------------------------------------------------------
1. Core scheduling performs better than turning off HT.
2. Impact of core scheduling depends on the workload and thread 
scheduling intensity. 
3. Core scheduling requires cgroups. Tasks from the same cgroup are 
scheduled on the same core. 
4. Having core scheduling, in certain situations will introduce 
an uneven load distribution between multiple workload types. 
In such a case bias towards the cpu intensive workload is expected.  
5. Load balancing is not perfect. It needs more work.

Many thanks,

--Agata



[-- Attachment #2: LKML_core_sched_v5.5.y.pdf --]
[-- Type: application/pdf, Size: 360252 bytes --]

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison
  2020-05-14 13:02                             ` Peter Zijlstra
@ 2020-05-14 22:51                               ` Vineeth Remanan Pillai
  2020-05-15 10:38                                 ` Peter Zijlstra
  2020-05-16  3:42                               ` Aaron Lu
  2020-06-08  1:41                               ` Ning, Hongyu
  2 siblings, 1 reply; 110+ messages in thread
From: Vineeth Remanan Pillai @ 2020-05-14 22:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Aaron Lu, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Aaron Lu, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aubrey Li, Li, Aubrey, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, Joel Fernandes

Hi Peter,

On Thu, May 14, 2020 at 9:02 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> A little something like so, this syncs min_vruntime when we switch to
> single queue mode. This is very much SMT2 only, I got my head in twist
> when thikning about more siblings, I'll have to try again later.
>
Thanks for the quick patch! :-)

For SMT-n, would it work if sync vruntime if atleast one sibling is
forced idle? Since force_idle is for all the rqs, I think it would
be correct to sync the vruntime if atleast one cpu is forced idle.

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> -               if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> -                       rq_i->core_forceidle = true;
> +               if (is_idle_task(rq_i->core_pick)) {
> +                       if (rq_i->nr_running)
> +                               rq_i->core_forceidle = true;
> +               } else {
> +                       new_active++;
I think we need to reset new_active on restarting the selection.

> +               }
>
>                 if (i == cpu)
>                         continue;
> @@ -4476,6 +4473,16 @@ next_class:;
>                 WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
>         }
>
> +       /* XXX SMT2 only */
> +       if (new_active == 1 && old_active > 1) {
As I mentioned above, would it be correct to check if atleast one sibling is
forced_idle? Something like:
if (cpumask_weight(cpu_smt_mask(cpu)) == old_active && new_active < old_active)

> +               /*
> +                * We just dropped into single-rq mode, increment the sequence
> +                * count to trigger the vruntime sync.
> +                */
> +               rq->core->core_sync_seq++;
> +       }
> +       rq->core->core_active = new_active;
core_active seems to be unused.

> +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> +{
> +       struct sched_entity *se_a = &a->se, *se_b = &b->se;
> +       struct cfs_rq *cfs_rq_a, *cfa_rq_b;
> +       u64 vruntime_a, vruntime_b;
> +
> +       while (!is_same_tg(se_a, se_b)) {
> +               int se_a_depth = se_a->depth;
> +               int se_b_depth = se_b->depth;
> +
> +               if (se_a_depth <= se_b_depth)
> +                       se_b = parent_entity(se_b);
> +               if (se_a_depth >= se_b_depth)
> +                       se_a = parent_entity(se_a);
> +       }
> +
> +       cfs_rq_a = cfs_rq_of(se_a);
> +       cfs_rq_b = cfs_rq_of(se_b);
> +
> +       vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime;
> +       vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime;
Should we be using core_vruntime conditionally? should it be min_vruntime for
default comparisons and core_vruntime during force_idle?

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison
  2020-05-14 22:51                               ` Vineeth Remanan Pillai
@ 2020-05-15 10:38                                 ` Peter Zijlstra
  2020-05-15 10:43                                   ` Peter Zijlstra
  2020-05-15 14:24                                   ` Vineeth Remanan Pillai
  0 siblings, 2 replies; 110+ messages in thread
From: Peter Zijlstra @ 2020-05-15 10:38 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Aaron Lu, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Aaron Lu, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aubrey Li, Li, Aubrey, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, Joel Fernandes

On Thu, May 14, 2020 at 06:51:27PM -0400, Vineeth Remanan Pillai wrote:
> On Thu, May 14, 2020 at 9:02 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > A little something like so, this syncs min_vruntime when we switch to
> > single queue mode. This is very much SMT2 only, I got my head in twist
> > when thikning about more siblings, I'll have to try again later.
> >
> Thanks for the quick patch! :-)
> 
> For SMT-n, would it work if sync vruntime if atleast one sibling is
> forced idle? Since force_idle is for all the rqs, I think it would
> be correct to sync the vruntime if atleast one cpu is forced idle.

It's complicated ;-)

So this sync is basically a relative reset of S to 0.

So with 2 queues, when one goes idle, we drop them both to 0 and one
then increases due to not being idle, and the idle one builds up lag to
get re-elected. So far so simple, right?

When there's 3, we can have the situation where 2 run and one is idle,
we sync to 0 and let the idle one build up lag to get re-election. Now
suppose another one also drops idle. At this point dropping all to 0
again would destroy the built-up lag from the queue that was already
idle, not good.

So instead of syncing everything, we can:

  less := !((s64)(s_a - s_b) <= 0)

  (v_a - S_a) - (v_b - S_b) == v_a - v_b - S_a + S_b
                            == v_a - (v_b - S_a + S_b)

IOW, we can recast the (lag) comparison to a one-sided difference.
So if then, instead of syncing the whole queue, sync the idle queue
against the active queue with S_a + S_b at the point where we sync.

(XXX consider the implication of living in a cyclic group: N / 2^n N)

This gives us means of syncing single queues against the active queue,
and for already idle queues to preseve their build-up lag.

Of course, then we get the situation where there's 2 active and one
going idle, who do we pick to sync against? Theory would have us sync
against the combined S, but as we've already demonstated, there is no
such thing in infeasible weight scenarios.

One thing I've considered; and this is where that core_active rudiment
came from, is having active queues sync up between themselves after
every tick. This limits the observed divergence due to the work
conservance.

On top of that, we can improve upon things by moving away from our
horrible (10) hack and moving to (9) and employing (13) here.

Anyway, I got partway through that in the past days, but then my head
hurt. I'll consider it some more :-)

> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > -               if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> > -                       rq_i->core_forceidle = true;
> > +               if (is_idle_task(rq_i->core_pick)) {
> > +                       if (rq_i->nr_running)
> > +                               rq_i->core_forceidle = true;
> > +               } else {
> > +                       new_active++;
> I think we need to reset new_active on restarting the selection.

But this loop is after selection has been done; we don't modify
new_active during selection.

> > +               /*
> > +                * We just dropped into single-rq mode, increment the sequence
> > +                * count to trigger the vruntime sync.
> > +                */
> > +               rq->core->core_sync_seq++;
> > +       }
> > +       rq->core->core_active = new_active;
> core_active seems to be unused.

Correct; that's rudiments from an SMT-n attempt.

> > +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> > +{
> > +       struct sched_entity *se_a = &a->se, *se_b = &b->se;
> > +       struct cfs_rq *cfs_rq_a, *cfa_rq_b;
> > +       u64 vruntime_a, vruntime_b;
> > +
> > +       while (!is_same_tg(se_a, se_b)) {
> > +               int se_a_depth = se_a->depth;
> > +               int se_b_depth = se_b->depth;
> > +
> > +               if (se_a_depth <= se_b_depth)
> > +                       se_b = parent_entity(se_b);
> > +               if (se_a_depth >= se_b_depth)
> > +                       se_a = parent_entity(se_a);
> > +       }
> > +
> > +       cfs_rq_a = cfs_rq_of(se_a);
> > +       cfs_rq_b = cfs_rq_of(se_b);
> > +
> > +       vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime;
> > +       vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime;
> Should we be using core_vruntime conditionally? should it be min_vruntime for
> default comparisons and core_vruntime during force_idle?

At the very least it should be min_vruntime when cfs_rq_a == cfs_rq_b,
ie. when we're on the same CPU.

For the other case I was considering that tick based active sync, but
never got that finished and admittedly it all looks a bit weird. But I
figured I'd send it out so we can at least advance the discussion.

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -469,7 +469,7 @@ bool cfs_prio_less(struct task_struct *a
 {
 	struct sched_entity *se_a = &a->se, *se_b = &b->se;
 	struct cfs_rq *cfs_rq_a, *cfa_rq_b;
-	u64 vruntime_a, vruntime_b;
+	u64 s_a, s_b, S_a, S_b;
 
 	while (!is_same_tg(se_a, se_b)) {
 		int se_a_depth = se_a->depth;
@@ -484,10 +484,16 @@ bool cfs_prio_less(struct task_struct *a
 	cfs_rq_a = cfs_rq_of(se_a);
 	cfs_rq_b = cfs_rq_of(se_b);
 
-	vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime;
-	vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime;
+	S_a = cfs_rq_a->core_vruntime;
+	S_b = cfs_rq_b->core_vruntime;
 
-	return !((s64)(vruntime_a - vruntime_b) <= 0);
+	if (cfs_rq_a == cfs_rq_b)
+		S_a = S_b = cfs_rq_a->min_vruntime;
+
+	s_a = se_a->vruntime - S_a;
+	s_b = se_b->vruntime - S_b;
+
+	return !((s64)(s_a - s_b) <= 0);
 }
 
 static __always_inline

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison
  2020-05-15 10:38                                 ` Peter Zijlstra
@ 2020-05-15 10:43                                   ` Peter Zijlstra
  2020-05-15 14:24                                   ` Vineeth Remanan Pillai
  1 sibling, 0 replies; 110+ messages in thread
From: Peter Zijlstra @ 2020-05-15 10:43 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Aaron Lu, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Aaron Lu, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aubrey Li, Li, Aubrey, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, Joel Fernandes

On Fri, May 15, 2020 at 12:38:44PM +0200, Peter Zijlstra wrote:
>   less := !((s64)(s_a - s_b) <= 0)
> 
>   (v_a - S_a) - (v_b - S_b) == v_a - v_b - S_a + S_b
>                             == v_a - (v_b - S_a + S_b)
> 

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -469,7 +469,7 @@ bool cfs_prio_less(struct task_struct *a
>  {
>  	struct sched_entity *se_a = &a->se, *se_b = &b->se;
>  	struct cfs_rq *cfs_rq_a, *cfa_rq_b;
> -	u64 vruntime_a, vruntime_b;
> +	u64 s_a, s_b, S_a, S_b;
>  
>  	while (!is_same_tg(se_a, se_b)) {
>  		int se_a_depth = se_a->depth;
> @@ -484,10 +484,16 @@ bool cfs_prio_less(struct task_struct *a
>  	cfs_rq_a = cfs_rq_of(se_a);
>  	cfs_rq_b = cfs_rq_of(se_b);
>  
> -	vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime;
> -	vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime;
> +	S_a = cfs_rq_a->core_vruntime;
> +	S_b = cfs_rq_b->core_vruntime;
>  
> -	return !((s64)(vruntime_a - vruntime_b) <= 0);
> +	if (cfs_rq_a == cfs_rq_b)
> +		S_a = S_b = cfs_rq_a->min_vruntime;
> +
> +	s_a = se_a->vruntime - S_a;
> +	s_b = se_b->vruntime - S_b;
> +
> +	return !((s64)(s_a - s_b) <= 0);
>  }

Clearly I'm not awake yet; 's/s_/l_/g', 's/v_/s_/g', IOW:

  l_a = s_a - S_a



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison
  2020-05-15 10:38                                 ` Peter Zijlstra
  2020-05-15 10:43                                   ` Peter Zijlstra
@ 2020-05-15 14:24                                   ` Vineeth Remanan Pillai
  1 sibling, 0 replies; 110+ messages in thread
From: Vineeth Remanan Pillai @ 2020-05-15 14:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Aaron Lu, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Aaron Lu, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aubrey Li, Li, Aubrey, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, Joel Fernandes

On Fri, May 15, 2020 at 6:39 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> It's complicated ;-)
>
> So this sync is basically a relative reset of S to 0.
>
> So with 2 queues, when one goes idle, we drop them both to 0 and one
> then increases due to not being idle, and the idle one builds up lag to
> get re-elected. So far so simple, right?
>
> When there's 3, we can have the situation where 2 run and one is idle,
> we sync to 0 and let the idle one build up lag to get re-election. Now
> suppose another one also drops idle. At this point dropping all to 0
> again would destroy the built-up lag from the queue that was already
> idle, not good.
>
Thanks for the clarification :-).

I was suggesting an idea of corewide force_idle. We sync the core_vruntime
on first force_idle of a sibling in the core and start using core_vruntime
for priority comparison from then on. That way, we don't reset the lag on
every force_idle and the lag builds up from the first sibling that was
forced_idle. I think this would work with infeasible weights as well,
but needs to think more to see if it would break. A sample check to enter
this core wide force_idle state is:
(cpumask_weight(cpu_smt_mask(cpu)) == old_active && new_active < old_active)

And we exit the core wide force_idle state when the last sibling goes out
of force_idle and can start using min_vruntime for priority comparison
from then on.

When there is a cookie match on all siblings, we don't do priority comparison
now. But I think we need to do priority comparison for cookie matches
also, so that we update 'max' in the loop. And for this comparison during
a no forced_idle scenario, I hope it should be fine to use the min_vruntime.
Updating 'max' in the loop when cookie matches is not really needed for SMT2,
but would be needed for SMTn.

This is just a wild idea on top of your patches. Might not be accurate
in all cases and need to think more about the corner cases. I thought I
would think it loud here :-)

> So instead of syncing everything, we can:
>
>   less := !((s64)(s_a - s_b) <= 0)
>
>   (v_a - S_a) - (v_b - S_b) == v_a - v_b - S_a + S_b
>                             == v_a - (v_b - S_a + S_b)
>
> IOW, we can recast the (lag) comparison to a one-sided difference.
> So if then, instead of syncing the whole queue, sync the idle queue
> against the active queue with S_a + S_b at the point where we sync.
>
> (XXX consider the implication of living in a cyclic group: N / 2^n N)
>
> This gives us means of syncing single queues against the active queue,
> and for already idle queues to preseve their build-up lag.
>
> Of course, then we get the situation where there's 2 active and one
> going idle, who do we pick to sync against? Theory would have us sync
> against the combined S, but as we've already demonstated, there is no
> such thing in infeasible weight scenarios.
>
> One thing I've considered; and this is where that core_active rudiment
> came from, is having active queues sync up between themselves after
> every tick. This limits the observed divergence due to the work
> conservance.
>
> On top of that, we can improve upon things by moving away from our
> horrible (10) hack and moving to (9) and employing (13) here.
>
> Anyway, I got partway through that in the past days, but then my head
> hurt. I'll consider it some more :-)
This sounds much better and a more accurate approach then the one I
mentioned above. Please share the code when you have it in some form :-)

>
> > > +                       new_active++;
> > I think we need to reset new_active on restarting the selection.
>
> But this loop is after selection has been done; we don't modify
> new_active during selection.
My bad, sorry about this false alarm!

> > > +
> > > +       vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime;
> > > +       vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime;
> > Should we be using core_vruntime conditionally? should it be min_vruntime for
> > default comparisons and core_vruntime during force_idle?
>
> At the very least it should be min_vruntime when cfs_rq_a == cfs_rq_b,
> ie. when we're on the same CPU.
>
yes, this makes sense.

The issue that I was thinking about is, when there is no force_idle and
all siblings run compatible tasks for a while, min_vruntime progresses,
but core_vruntime lags behind. And when a new task gets enqueued, it gets
the min_vruntime. But now during comparison it might be treated unfairly.

Consider a small example of two rqs rq1 and rq2.
rq1->cfs->min_vruntime = 1000
rq2->cfs->min_vruntime = 2000

During a force_idle, core_vruntime gets synced and

rq1->cfs->core_vruntime = 1000
rq2->cfs->core_vruntime = 2000

Now, suppose the core is out of force_idle and runs two compatible tasks
for a while, where the task on rq1 has more weight. min_vruntime progresses
on both, but slowly on rq1. Say the progress looks like:
rq1->cfs->min_vruntime = 1200, se1->vruntime = 1200
rq2->cfs->min_vruntime = 2500, se2->vruntime = 2500

If a new incompatible task(se3) gets enqueued to rq2, it's vruntime would
be based on rq2's min_vruntime, say:
se3->vruntime = 2500

During our priority comparison, lag would be:
l_se1 = 200
l_se3 = 500

So se1, will get selected and run with se2 until its lag catches up with
se3's lag(even if se3 has more weight than se1).

This is a hypothetical situation, but can happen I think. And if we use
min_vruntime for comparison during no force_idle scenario, we could
avoid this. What do you think?

I didn't clearly understand the tick based active sync and probably would
better fix this problem I guess.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison
  2020-05-14 13:02                             ` Peter Zijlstra
  2020-05-14 22:51                               ` Vineeth Remanan Pillai
@ 2020-05-16  3:42                               ` Aaron Lu
  2020-05-22  9:40                                 ` Aaron Lu
  2020-06-08  1:41                               ` Ning, Hongyu
  2 siblings, 1 reply; 110+ messages in thread
From: Aaron Lu @ 2020-05-16  3:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	Tim Chen, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Aaron Lu, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aubrey Li, Li, Aubrey, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, Joel Fernandes

On Thu, May 14, 2020 at 03:02:48PM +0200, Peter Zijlstra wrote:
> On Fri, May 08, 2020 at 08:34:57PM +0800, Aaron Lu wrote:
> > With this said, I realized a workaround for the issue described above:
> > when the core went from 'compatible mode'(step 1-3) to 'incompatible
> > mode'(step 4), reset all root level sched entities' vruntime to be the
> > same as the core wide min_vruntime. After all, the core is transforming
> > from two runqueue mode to single runqueue mode... I think this can solve
> > the issue to some extent but I may miss other scenarios.
> 
> A little something like so, this syncs min_vruntime when we switch to
> single queue mode. This is very much SMT2 only, I got my head in twist
> when thikning about more siblings, I'll have to try again later.

Thanks a lot for the patch, I now see that "there is no need to adjust
every se's vruntime". :-)

> This very much retains the horrible approximation of S we always do.
> 
> Also, it is _completely_ untested...

I've been testing it.

One problem below.

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4293,10 +4281,11 @@ static struct task_struct *
>  pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
>  	struct task_struct *next, *max = NULL;
> +	int old_active = 0, new_active = 0;
>  	const struct sched_class *class;
>  	const struct cpumask *smt_mask;
> -	int i, j, cpu;
>  	bool need_sync = false;
> +	int i, j, cpu;
>  
>  	cpu = cpu_of(rq);
>  	if (cpu_is_offline(cpu))
> @@ -4349,10 +4338,14 @@ pick_next_task(struct rq *rq, struct tas
>  		rq_i->core_pick = NULL;
>  
>  		if (rq_i->core_forceidle) {
> +			// XXX is_idle_task(rq_i->curr) && rq_i->nr_running ??
>  			need_sync = true;
>  			rq_i->core_forceidle = false;
>  		}
>  
> +		if (!is_idle_task(rq_i->curr))
> +			old_active++;
> +
>  		if (i != cpu)
>  			update_rq_clock(rq_i);
>  	}
> @@ -4463,8 +4456,12 @@ next_class:;
>  
>  		WARN_ON_ONCE(!rq_i->core_pick);
>  
> -		if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> -			rq_i->core_forceidle = true;
> +		if (is_idle_task(rq_i->core_pick)) {
> +			if (rq_i->nr_running)
> +				rq_i->core_forceidle = true;
> +		} else {
> +			new_active++;
> +		}
>  
>  		if (i == cpu)
>  			continue;
> @@ -4476,6 +4473,16 @@ next_class:;
>  		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
>  	}
>  
> +	/* XXX SMT2 only */
> +	if (new_active == 1 && old_active > 1) {

There is a case when incompatible task appears but we failed to 'drop
into single-rq mode' per the above condition check. The TLDR is: when
there is a task that sits on the sibling rq with the same cookie as
'max', new_active will be 2 instead of 1 and that would cause us missing
the chance to do a sync of core min_vruntime.

This is how it happens:
1) 2 tasks of the same cgroup with different weight running on 2 siblings,
   say cg0_A with weight 1024 bound at cpu0 and cg0_B with weight 2 bound
   at cpu1(assume cpu0 and cpu1 are siblings);
2) Since new_active == 2, we didn't trigger min_vruntime sync. For
   simplicity, let's assume both siblings' root cfs_rq's min_vruntime and
   core_vruntime are all at 0 now;
3) let the two tasks run a while;
4) a new task cg1_C of another cgroup gets queued on cpu1. Since cpu1's
   existing task has a very small weight, its cfs_rq's min_vruntime can
   be much larger than cpu0's cfs_rq min_vruntime. So cg1_C's vruntime is
   much larger than cg0_A's and the 'max' of the core wide task
   selection goes to cg0_A;
5) Now I suppose we should drop into single-rq mode and by doing a sync
   of core min_vruntime, cg1_C's turn shall come. But the problem is, our
   current selection logic prefer not to waste CPU time so after decides
   cg0_A as the 'max', the sibling will also do a cookie_pick() and
   get cg0_B to run. This is where problem asises: new_active is 2
   instead of the expected 1.
6) Due to we didn't do the sync of core min_vruntime, the newly queued
   cg1_C shall wait a long time before cg0_A's vruntime catches up.

One naive way of precisely determine when to drop into single-rq mode is
to track how many tasks of a particular tag exists and use that to
decide if the core is in compatible mode(all tasks belong to the same
cgroup, IOW, have the same core_cookie) or not and act accordingly,
except that: does this sound too complex and inefficient?...

> +		/*
> +		 * We just dropped into single-rq mode, increment the sequence
> +		 * count to trigger the vruntime sync.
> +		 */
> +		rq->core->core_sync_seq++;
> +	}
> +	rq->core->core_active = new_active;
> +
>  done:
>  	set_next_task(rq, next);
>  	return next;

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH RFC] sched: Add a per-thread core scheduling interface
  2020-03-04 16:59 [RFC PATCH 00/13] Core scheduling v5 vpillai
                   ` (16 preceding siblings ...)
  2020-05-10 23:46 ` [PATCH RFC] Add support for core-wide protection of IRQ and softirq Joel Fernandes (Google)
@ 2020-05-20 22:26 ` Joel Fernandes (Google)
  2020-05-21  4:09   ` [PATCH RFC] sched: Add a per-thread core scheduling interface(Internet mail) benbjiang(蒋彪)
                     ` (2 more replies)
  2020-05-20 22:37 ` [PATCH RFC v2] Add support for core-wide protection of IRQ and softirq Joel Fernandes (Google)
                   ` (2 subsequent siblings)
  20 siblings, 3 replies; 110+ messages in thread
From: Joel Fernandes (Google) @ 2020-05-20 22:26 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: Joel Fernandes (Google),
	vpillai, linux-kernel, fweisbec, keescook, kerrnel, Phil Auld,
	Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes

Add a per-thread core scheduling interface which allows a thread to tag
itself and enable core scheduling. Based on discussion at OSPM with
maintainers, we propose a prctl(2) interface accepting values of 0 or 1.
 1 - enable core scheduling for the task.
 0 - disable core scheduling for the task.

Special cases:
(1)
The core-scheduling patchset contains a CGroup interface as well. In
order for us to respect users of that interface, we avoid overriding the
tag if a task was CGroup-tagged because the task becomes inconsistent
with the CGroup tag. Instead return -EBUSY.

(2)
If a task is prctl-tagged, allow the CGroup interface to override
the task's tag.

ChromeOS will use core-scheduling to securely enable hyperthreading.
This cuts down the keypress latency in Google docs from 150ms to 50ms
while improving the camera streaming frame rate by ~3%.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/sched.h      |  6 ++++
 include/uapi/linux/prctl.h |  3 ++
 kernel/sched/core.c        | 57 ++++++++++++++++++++++++++++++++++++++
 kernel/sys.c               |  3 ++
 4 files changed, 69 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fe6ae59fcadbe..8a40a093aa2ca 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1986,6 +1986,12 @@ static inline void rseq_execve(struct task_struct *t)
 
 #endif
 
+#ifdef CONFIG_SCHED_CORE
+int task_set_core_sched(int set, struct task_struct *tsk);
+#else
+int task_set_core_sched(int set, struct task_struct *tsk) { return -ENOTSUPP; }
+#endif
+
 void __exit_umh(struct task_struct *tsk);
 
 static inline void exit_umh(struct task_struct *tsk)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 07b4f8131e362..dba0c70f9cce6 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -238,4 +238,7 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER		57
 #define PR_GET_IO_FLUSHER		58
 
+/* Core scheduling per-task interface */
+#define PR_SET_CORE_SCHED		59
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 684359ff357e7..780514d03da47 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3320,6 +3320,13 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 #endif
 #ifdef CONFIG_SCHED_CORE
 	RB_CLEAR_NODE(&p->core_node);
+
+	/*
+	 * If task is using prctl(2) for tagging, do the prctl(2)-style tagging
+	 * for the child as well.
+	 */
+	if (current->core_cookie && ((unsigned long)current == current->core_cookie))
+		task_set_core_sched(1, p);
 #endif
 	return 0;
 }
@@ -7857,6 +7864,56 @@ void __cant_sleep(const char *file, int line, int preempt_offset)
 EXPORT_SYMBOL_GPL(__cant_sleep);
 #endif
 
+#ifdef CONFIG_SCHED_CORE
+
+/* Ensure that all siblings have rescheduled once */
+static int task_set_core_sched_stopper(void *data)
+{
+	return 0;
+}
+
+int task_set_core_sched(int set, struct task_struct *tsk)
+{
+	if (!tsk)
+		tsk = current;
+
+	if (set > 1)
+		return -ERANGE;
+
+	if (!static_branch_likely(&sched_smt_present))
+		return -EINVAL;
+
+	/*
+	 * If cookie was set previously, return -EBUSY if either of the
+	 * following are true:
+	 * 1. Task was previously tagged by CGroup method.
+	 * 2. Task or its parent were tagged by prctl().
+	 *
+	 * Note that, if CGroup tagging is done after prctl(), then that would
+	 * override the cookie. However, if prctl() is done after task was
+	 * added to tagged CGroup, then the prctl() returns -EBUSY.
+	 */
+	if (!!tsk->core_cookie == set) {
+		if ((tsk->core_cookie == (unsigned long)tsk) ||
+		    (tsk->core_cookie == (unsigned long)tsk->sched_task_group)) {
+			return -EBUSY;
+		}
+	}
+
+	if (set)
+		sched_core_get();
+
+	tsk->core_cookie = set ? (unsigned long)tsk : 0;
+
+	stop_machine(task_set_core_sched_stopper, NULL, NULL);
+
+	if (!set)
+		sched_core_put();
+
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_MAGIC_SYSRQ
 void normalize_rt_tasks(void)
 {
diff --git a/kernel/sys.c b/kernel/sys.c
index d325f3ab624a9..5c3bcf40dcb34 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2514,6 +2514,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
 		break;
+	case PR_SET_CORE_SCHED:
+		error = task_set_core_sched(arg2, NULL);
+		break;
 	default:
 		error = -EINVAL;
 		break;
-- 
2.26.2.761.g0e0b3e54be-goog


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH RFC v2] Add support for core-wide protection of IRQ and softirq
  2020-03-04 16:59 [RFC PATCH 00/13] Core scheduling v5 vpillai
                   ` (17 preceding siblings ...)
  2020-05-20 22:26 ` [PATCH RFC] sched: Add a per-thread core scheduling interface Joel Fernandes (Google)
@ 2020-05-20 22:37 ` Joel Fernandes (Google)
  2020-05-20 22:48 ` [PATCH RFC] sched: Use sched-RCU in core-scheduling balancing logic Joel Fernandes (Google)
  2020-06-25 20:12 ` [RFC PATCH 00/13] Core scheduling v5 Vineeth Remanan Pillai
  20 siblings, 0 replies; 110+ messages in thread
From: Joel Fernandes (Google) @ 2020-05-20 22:37 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: Joel Fernandes (Google),
	vpillai, linux-kernel, fweisbec, keescook, kerrnel, Phil Auld,
	Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, Tim Chen,
	Paul E . McKenney

With current core scheduling patchset, non-threaded IRQ and softirq
victims can leak data from its hyperthread to a sibling hyperthread
running an attacker.

For MDS, it is possible for the IRQ and softirq handlers to leak data to
either host or guest attackers. For L1TF, it is possible to leak to
guest attackers. There is no possible mitigation involving flushing of
buffers to avoid this since the execution of attacker and victims happen
concurrently on 2 or more HTs.

The solution in this patch is to monitor the outer-most core-wide
irq_enter() and irq_exit() executed by any sibling. In between these
two, we mark the core to be in a special core-wide IRQ state.

In the IRQ entry, if we detect that the sibling is running untrusted
code, we send a reschedule IPI so that the sibling transitions through
the sibling's irq_exit() to do any waiting there, till the IRQ being
protected finishes.

We also monitor the per-CPU outer-most irq_exit(). If during the per-cpu
outer-most irq_exit(), the core is still in the special core-wide IRQ
state, we perform a busy-wait till the core exits this state. This
combination of per-cpu and core-wide IRQ states helps to handle any
combination of irq_entry()s and irq_exit()s happening on all of the
siblings of the core in any order.

Lastly, we also check in the schedule loop if we are about to schedule
an untrusted process while the core is in such a state. This is possible
if a trusted thread enters the scheduler by way of yielding CPU. This
would involve no transitions through the irq_exit() point to do any
waiting, so we have to explicitly do the waiting there.

Every attempt is made to prevent a busy-wait unnecessarily, and in
testing on real-world ChromeOS usecases, it has not shown a performance
drop. In ChromeOS, with this and the rest of the core scheduling
patchset, we see around a 300% improvement in key press latencies into
Google docs when Camera streaming is running simulatenously (90th
percentile latency of ~150ms drops to ~50ms).

Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Co-developed-by: Vineeth Pillai <vpillai@digitalocean.com>
Signed-off-by: Vineeth Pillai <vpillai@digitalocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

---
If you like some pictures of the cases handled by this patch, please
see the OSPM slide deck (the below link jumps straight to relevant
slides - about 6-7 of them in total): https://bit.ly/2zvzxWk

v1->v2:
  Fixed a bug where softirq was causing deadlock (thanks Vineeth/Julien)

The issue was because of the following flow:

On CPU0:
local_bh_enable()
  ->  Enter softirq
   -> Softirq takes a lock.
   -> <new Interrupt received during softirq>
   -> New interrupt's irq_exit() : Wait since it is not outermost
                                   core-wide irq_exit().

On CPU1:
 <interrupt received>
irq_enter()  -> Enter the core wide IRQ state.
    <ISR raises a softirq which will run from irq_exit().
irq_exit()   ->
    -> enters softirq
         -> softirq tries to take a lock and blocks.

So it is an A->B and B->A deadlock.
A  = Enter the core-wide IRQ state or wait for it to end.
B =  Acquire a lock during softirq or wait for it to be released.

The fix is to enter the core-wide IRQ state even when entering through
the local_bh_enable -> softirq path (When there is no hardirq
context). which basically becomes:

On CPU0:
local_bh_enable()
    (Fix: Call sched_core_irq_enter() --> similar to irq_enter()).
  ->  Enter softirq
   -> Softirq takes a lock.
     -> <new Interrupt received during softirq> -> irq_enter()
     -> New interrupt's irq_exit()   (Will not wait since we are inner
        irq_exit()).

 include/linux/sched.h |   8 +++
 kernel/sched/core.c   | 159 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h  |   3 +
 kernel/softirq.c      |  12 ++++
 4 files changed, 182 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 710e9a8956007..fe6ae59fcadbe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2018,4 +2018,12 @@ int sched_trace_rq_cpu(struct rq *rq);
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
+#ifdef CONFIG_SCHED_CORE
+void sched_core_irq_enter(void);
+void sched_core_irq_exit(void);
+#else
+static void sched_core_irq_enter(void) { }
+static void sched_core_irq_exit(void) { }
+#endif
+
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 21c640170323b..684359ff357e7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4391,6 +4391,153 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
 	return a->core_cookie == b->core_cookie;
 }
 
+/*
+ * Helper function to pause the caller's hyperthread until the core exits the
+ * core-wide IRQ state. Obviously the CPU calling this function should not be
+ * responsible for the core being in the core-wide IRQ state otherwise it will
+ * deadlock. This function should be called from irq_exit() and from schedule().
+ * It is upto the callers to decide if calling here is necessary.
+ */
+static inline void sched_core_sibling_irq_pause(struct rq *rq)
+{
+	/*
+	 * Wait till the core of this HT is not in a core-wide IRQ state.
+	 *
+	 * Pair with smp_store_release() in sched_core_irq_exit().
+	 */
+	while (smp_load_acquire(&rq->core->core_irq_nest) > 0)
+		cpu_relax();
+}
+
+/*
+ * Enter the core-wide IRQ state. Sibling will be paused if it is running
+ * 'untrusted' code, until sched_core_irq_exit() is called. Every attempt to
+ * avoid sending useless IPIs is made. Must be called only from hard IRQ
+ * context.
+ */
+void sched_core_irq_enter(void)
+{
+	int i, cpu = smp_processor_id();
+	struct rq *rq = cpu_rq(cpu);
+	const struct cpumask *smt_mask;
+
+	if (!sched_core_enabled(rq))
+		return;
+
+	/* Count irq_enter() calls received without irq_exit() on this CPU. */
+	rq->core_this_irq_nest++;
+
+	/* If not outermost irq_enter(), do nothing. */
+	if (WARN_ON_ONCE(rq->core->core_this_irq_nest == UINT_MAX) ||
+	    rq->core_this_irq_nest != 1)
+		return;
+
+	raw_spin_lock(rq_lockp(rq));
+	smt_mask = cpu_smt_mask(cpu);
+
+	/* Contribute this CPU's irq_enter() to core-wide irq_enter() count. */
+	WRITE_ONCE(rq->core->core_irq_nest, rq->core->core_irq_nest + 1);
+	if (WARN_ON_ONCE(rq->core->core_irq_nest == UINT_MAX))
+		goto unlock;
+
+	if (rq->core_pause_pending) {
+		/*
+		 * Do nothing more since we are in a 'reschedule IPI' sent from
+		 * another sibling. That sibling would have sent IPIs to all of
+		 * the HTs.
+		 */
+		goto unlock;
+	}
+
+	/*
+	 * If we are not the first ones on the core to enter core-wide IRQ
+	 * state, do nothing.
+	 */
+	if (rq->core->core_irq_nest > 1)
+		goto unlock;
+
+	/* Do nothing more if the core is not tagged. */
+	if (!rq->core->core_cookie)
+		goto unlock;
+
+	for_each_cpu(i, smt_mask) {
+		struct rq *srq = cpu_rq(i);
+
+		if (i == cpu || cpu_is_offline(i))
+			continue;
+
+		if (!srq->curr->mm || is_idle_task(srq->curr))
+			continue;
+
+		/* Skip if HT is not running a tagged task. */
+		if (!srq->curr->core_cookie && !srq->core_pick)
+			continue;
+
+		/* IPI only if previous IPI was not pending. */
+		if (!srq->core_pause_pending) {
+			srq->core_pause_pending = 1;
+			smp_send_reschedule(i);
+		}
+	}
+unlock:
+	raw_spin_unlock(rq_lockp(rq));
+}
+
+/*
+ * Process any work need for either exiting the core-wide IRQ state, or for
+ * waiting on this hyperthread if the core is still in this state.
+ */
+void sched_core_irq_exit(void)
+{
+	int cpu = smp_processor_id();
+	struct rq *rq = cpu_rq(cpu);
+	bool wait_here = false;
+	unsigned int nest;
+
+	/* Do nothing if core-sched disabled. */
+	if (!sched_core_enabled(rq))
+		return;
+
+	rq->core_this_irq_nest--;
+
+	/* If not outermost on this CPU, do nothing. */
+	if (WARN_ON_ONCE(rq->core_this_irq_nest == UINT_MAX) ||
+	    rq->core_this_irq_nest > 0)
+		return;
+
+	raw_spin_lock(rq_lockp(rq));
+	/*
+	 * Core-wide nesting counter can never be 0 because we are
+	 * still in it on this CPU.
+	 */
+	nest = rq->core->core_irq_nest;
+	WARN_ON_ONCE(!nest);
+
+	/*
+	 * If we still have other CPUs in IRQs, we have to wait for them.
+	 * Either here, or in the scheduler.
+	 */
+	if (rq->core->core_cookie && nest > 1) {
+		/*
+		 * If we are entering the scheduler anyway, we can just wait
+		 * there for ->core_irq_nest to reach 0. If not, just wait here.
+		 */
+		if (!tif_need_resched()) {
+			wait_here = true;
+		}
+	}
+
+	if (rq->core_pause_pending)
+		rq->core_pause_pending = 0;
+
+	/* Pair with smp_load_acquire() in sched_core_sibling_irq_pause(). */
+	smp_store_release(&rq->core->core_irq_nest, nest - 1);
+	raw_spin_unlock(rq_lockp(rq));
+
+	if (wait_here)
+		sched_core_sibling_irq_pause(rq);
+}
+
 // XXX fairness/fwd progress conditions
 /*
  * Returns
@@ -4910,6 +5057,18 @@ static void __sched notrace __schedule(bool preempt)
 		rq_unlock_irq(rq, &rf);
 	}
 
+#ifdef CONFIG_SCHED_CORE
+	/*
+	 * If a CPU  that was running a trusted task entered the scheduler, and
+	 * the next task is untrusted, then check if waiting for core-wide IRQ
+	 * state to cease is needed since we would not have been able to get
+	 * the services of irq_exit() to do that waiting.
+	 */
+	if (sched_core_enabled(rq) &&
+	    !is_idle_task(next) && next->mm && next->core_cookie)
+		sched_core_sibling_irq_pause(rq);
+#endif
+
 	balance_callback(rq);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a7d9f156242e2..3a065d133ef51 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1018,11 +1018,14 @@ struct rq {
 	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
 	unsigned char		core_forceidle;
+	unsigned char		core_pause_pending;
+	unsigned int		core_this_irq_nest;
 
 	/* shared state */
 	unsigned int		core_task_seq;
 	unsigned int		core_pick_seq;
 	unsigned long		core_cookie;
+	unsigned int		core_irq_nest;
 #endif
 };
 
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 0427a86743a46..147abd6d82599 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -273,6 +273,13 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 	/* Reset the pending bitmask before enabling irqs */
 	set_softirq_pending(0);
 
+	/*
+	 * Core scheduling mitigations require entry into softirq to send stall
+	 * IPIs to sibling hyperthreads if needed (ex, sibling is running
+	 * untrusted task). If we are here from irq_exit(), no IPIs are sent.
+	 */
+	sched_core_irq_enter();
+
 	local_irq_enable();
 
 	h = softirq_vec;
@@ -305,6 +312,9 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 		rcu_softirq_qs();
 	local_irq_disable();
 
+	/* Inform the scheduler about exit from softirq. */
+	sched_core_irq_exit();
+
 	pending = local_softirq_pending();
 	if (pending) {
 		if (time_before(jiffies, end) && !need_resched() &&
@@ -345,6 +355,7 @@ asmlinkage __visible void do_softirq(void)
 void irq_enter(void)
 {
 	rcu_irq_enter();
+	sched_core_irq_enter();
 	if (is_idle_task(current) && !in_interrupt()) {
 		/*
 		 * Prevent raise_softirq from needlessly waking up ksoftirqd
@@ -413,6 +424,7 @@ void irq_exit(void)
 		invoke_softirq();
 
 	tick_irq_exit();
+	sched_core_irq_exit();
 	rcu_irq_exit();
 	trace_hardirq_exit(); /* must be last! */
 }
-- 
2.26.2.761.g0e0b3e54be-goog


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH RFC] sched: Use sched-RCU in core-scheduling balancing logic
  2020-03-04 16:59 [RFC PATCH 00/13] Core scheduling v5 vpillai
                   ` (18 preceding siblings ...)
  2020-05-20 22:37 ` [PATCH RFC v2] Add support for core-wide protection of IRQ and softirq Joel Fernandes (Google)
@ 2020-05-20 22:48 ` Joel Fernandes (Google)
  2020-05-21 22:52   ` Paul E. McKenney
  2020-06-25 20:12 ` [RFC PATCH 00/13] Core scheduling v5 Vineeth Remanan Pillai
  20 siblings, 1 reply; 110+ messages in thread
From: Joel Fernandes (Google) @ 2020-05-20 22:48 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: vpillai, linux-kernel, fweisbec, keescook, kerrnel, Phil Auld,
	Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, joel, paulmck

rcu_read_unlock() can incur an infrequent deadlock in
sched_core_balance(). Fix this by using sched-RCU instead.

This fixes the following spinlock recursion observed when testing the
core scheduling patches on PREEMPT=y kernel on ChromeOS:

[    3.240891] BUG: spinlock recursion on CPU#2, swapper/2/0
[    3.240900]  lock: 0xffff9cd1eeb28e40, .magic: dead4ead, .owner: swapper/2/0, .owner_cpu: 2
[    3.240905] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.4.22htcore #4
[    3.240908] Hardware name: Google Eve/Eve, BIOS Google_Eve.9584.174.0 05/29/2018
[    3.240910] Call Trace:
[    3.240919]  dump_stack+0x97/0xdb
[    3.240924]  ? spin_bug+0xa4/0xb1
[    3.240927]  do_raw_spin_lock+0x79/0x98
[    3.240931]  try_to_wake_up+0x367/0x61b
[    3.240935]  rcu_read_unlock_special+0xde/0x169
[    3.240938]  ? sched_core_balance+0xd9/0x11e
[    3.240941]  __rcu_read_unlock+0x48/0x4a
[    3.240945]  __balance_callback+0x50/0xa1
[    3.240949]  __schedule+0x55a/0x61e
[    3.240952]  schedule_idle+0x21/0x2d
[    3.240956]  do_idle+0x1d5/0x1f8
[    3.240960]  cpu_startup_entry+0x1d/0x1f
[    3.240964]  start_secondary+0x159/0x174
[    3.240967]  secondary_startup_64+0xa4/0xb0
[   14.998590] watchdog: BUG: soft lockup - CPU#0 stuck for 11s! [kworker/0:10:965]

Cc: vpillai <vpillai@digitalocean.com>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Aubrey Li <aubrey.intel@gmail.com>
Cc: peterz@infradead.org
Cc: paulmck@kernel.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Change-Id: I1a4bf0cd1426b3c21ad5de44719813ad4ee5805e
---
 kernel/sched/core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 780514d03da47..b8ca6fcaaaf06 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4897,7 +4897,7 @@ static void sched_core_balance(struct rq *rq)
 	struct sched_domain *sd;
 	int cpu = cpu_of(rq);
 
-	rcu_read_lock();
+	rcu_read_lock_sched();
 	raw_spin_unlock_irq(rq_lockp(rq));
 	for_each_domain(cpu, sd) {
 		if (!(sd->flags & SD_LOAD_BALANCE))
@@ -4910,7 +4910,7 @@ static void sched_core_balance(struct rq *rq)
 			break;
 	}
 	raw_spin_lock_irq(rq_lockp(rq));
-	rcu_read_unlock();
+	rcu_read_unlock_sched();
 }
 
 static DEFINE_PER_CPU(struct callback_head, core_balance_head);
-- 
2.26.2.761.g0e0b3e54be-goog


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH RFC] sched: Add a per-thread core scheduling interface(Internet mail)
  2020-05-20 22:26 ` [PATCH RFC] sched: Add a per-thread core scheduling interface Joel Fernandes (Google)
@ 2020-05-21  4:09   ` benbjiang(蒋彪)
  2020-05-21 13:49     ` Joel Fernandes
  2020-05-21  8:51   ` [PATCH RFC] sched: Add a per-thread core scheduling interface Peter Zijlstra
  2020-05-21 18:31   ` Linus Torvalds
  2 siblings, 1 reply; 110+ messages in thread
From: benbjiang(蒋彪) @ 2020-05-21  4:09 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds, vpillai, linux-kernel, fweisbec,
	keescook, kerrnel, Phil Auld, Aaron Lu, Aubrey Li, aubrey.li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	Joel Fernandes



> On May 21, 2020, at 6:26 AM, Joel Fernandes (Google) <joel@joelfernandes.org> wrote:
> 
> Add a per-thread core scheduling interface which allows a thread to tag
> itself and enable core scheduling. Based on discussion at OSPM with
> maintainers, we propose a prctl(2) interface accepting values of 0 or 1.
> 1 - enable core scheduling for the task.
> 0 - disable core scheduling for the task.
> 
> Special cases:
> (1)
> The core-scheduling patchset contains a CGroup interface as well. In
> order for us to respect users of that interface, we avoid overriding the
> tag if a task was CGroup-tagged because the task becomes inconsistent
> with the CGroup tag. Instead return -EBUSY.
> 
> (2)
> If a task is prctl-tagged, allow the CGroup interface to override
> the task's tag.
> 
> ChromeOS will use core-scheduling to securely enable hyperthreading.
> This cuts down the keypress latency in Google docs from 150ms to 50ms
> while improving the camera streaming frame rate by ~3%.
Hi,
Are the performance improvements compared to the hyperthreading disabled scenario or not?
Could you help to explain how the keypress latency improvement comes with core-scheduling?

Thanks a lot.

Regards,
Jiang


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH RFC] sched: Add a per-thread core scheduling interface
  2020-05-20 22:26 ` [PATCH RFC] sched: Add a per-thread core scheduling interface Joel Fernandes (Google)
  2020-05-21  4:09   ` [PATCH RFC] sched: Add a per-thread core scheduling interface(Internet mail) benbjiang(蒋彪)
@ 2020-05-21  8:51   ` Peter Zijlstra
  2020-05-21 13:47     ` Joel Fernandes
  2020-05-21 18:31   ` Linus Torvalds
  2 siblings, 1 reply; 110+ messages in thread
From: Peter Zijlstra @ 2020-05-21  8:51 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, mingo, tglx,
	pjt, torvalds, vpillai, linux-kernel, fweisbec, keescook,
	kerrnel, Phil Auld, Aaron Lu, Aubrey Li, aubrey.li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	Joel Fernandes

On Wed, May 20, 2020 at 06:26:42PM -0400, Joel Fernandes (Google) wrote:
> Add a per-thread core scheduling interface which allows a thread to tag
> itself and enable core scheduling. Based on discussion at OSPM with
> maintainers, we propose a prctl(2) interface accepting values of 0 or 1.
>  1 - enable core scheduling for the task.
>  0 - disable core scheduling for the task.

Yeah, so this is a terrible interface :-)

It doens't allow tasks for form their own groups (by for example setting
the key to that of another task).

It is also horribly ill defined what it means to 'enable', with whoem
is it allows to share a core.

> Special cases:

> (1)
> The core-scheduling patchset contains a CGroup interface as well. In
> order for us to respect users of that interface, we avoid overriding the
> tag if a task was CGroup-tagged because the task becomes inconsistent
> with the CGroup tag. Instead return -EBUSY.
> 
> (2)
> If a task is prctl-tagged, allow the CGroup interface to override
> the task's tag.

OK, so cgroup always wins; is why is that a good thing?

> ChromeOS will use core-scheduling to securely enable hyperthreading.
> This cuts down the keypress latency in Google docs from 150ms to 50ms
> while improving the camera streaming frame rate by ~3%.

It doesn't consider permissions.

Basically, with the way you guys use it, it should be a CAP_SYS_ADMIN
only to enable core-sched.

That also means we should very much default to disable.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH RFC] sched: Add a per-thread core scheduling interface
  2020-05-21  8:51   ` [PATCH RFC] sched: Add a per-thread core scheduling interface Peter Zijlstra
@ 2020-05-21 13:47     ` Joel Fernandes
  2020-05-21 20:20       ` Vineeth Remanan Pillai
  2020-05-22 12:59       ` Peter Zijlstra
  0 siblings, 2 replies; 110+ messages in thread
From: Joel Fernandes @ 2020-05-21 13:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, mingo, tglx,
	pjt, torvalds, vpillai, linux-kernel, fweisbec, keescook,
	kerrnel, Phil Auld, Aaron Lu, Aubrey Li, aubrey.li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

Hi Peter,
Thanks for the comments.

On Thu, May 21, 2020 at 10:51:22AM +0200, Peter Zijlstra wrote:
> On Wed, May 20, 2020 at 06:26:42PM -0400, Joel Fernandes (Google) wrote:
> > Add a per-thread core scheduling interface which allows a thread to tag
> > itself and enable core scheduling. Based on discussion at OSPM with
> > maintainers, we propose a prctl(2) interface accepting values of 0 or 1.
> >  1 - enable core scheduling for the task.
> >  0 - disable core scheduling for the task.
> 
> Yeah, so this is a terrible interface :-)

I tried to keep it simple. You are right, lets make it better.

> It doens't allow tasks for form their own groups (by for example setting
> the key to that of another task).

So for this, I was thinking of making the prctl pass in an integer. And 0
would mean untagged. Does that sound good to you?

> It is also horribly ill defined what it means to 'enable', with whoem
> is it allows to share a core.

I couldn't parse this. Do you mean "enabling coresched does not make sense if
we don't specify whom to share the core with?"

> > Special cases:
> 
> > (1)
> > The core-scheduling patchset contains a CGroup interface as well. In
> > order for us to respect users of that interface, we avoid overriding the
> > tag if a task was CGroup-tagged because the task becomes inconsistent
> > with the CGroup tag. Instead return -EBUSY.
> > 
> > (2)
> > If a task is prctl-tagged, allow the CGroup interface to override
> > the task's tag.
> 
> OK, so cgroup always wins; is why is that a good thing?

I was just trying to respect the functionality of the CGroup patch in the
coresched series, after all a gentleman named Peter Zijlstra wrote that
patch ;-) ;-).

More seriously, the reason I did it this way is the prctl-tagging is a bit
incompatible with CGroup tagging:

1. What happens if 2 tasks are in a tagged CGroup and one of them changes
their cookie through prctl? Do they still remain in the tagged CGroup but are
now going to not trust each other? Do they get removed from the CGroup? This
is why I made the prctl fail with -EBUSY in such cases.

2. What happens if 2 tagged tasks with different cookies are added to a
tagged CGroup? Do we fail the addition of the tasks to the group, or do we
override their cookie (like I'm doing)?

> > ChromeOS will use core-scheduling to securely enable hyperthreading.
> > This cuts down the keypress latency in Google docs from 150ms to 50ms
> > while improving the camera streaming frame rate by ~3%.
> 
> It doesn't consider permissions.
> 
> Basically, with the way you guys use it, it should be a CAP_SYS_ADMIN
> only to enable core-sched.

True, we were relying on the seccomp sandboxing in ChromeOS to protect the
prctl but you're right and I fixed it for next revision.

> That also means we should very much default to disable.

This is how it is already.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH RFC] sched: Add a per-thread core scheduling interface(Internet mail)
  2020-05-21  4:09   ` [PATCH RFC] sched: Add a per-thread core scheduling interface(Internet mail) benbjiang(蒋彪)
@ 2020-05-21 13:49     ` Joel Fernandes
  0 siblings, 0 replies; 110+ messages in thread
From: Joel Fernandes @ 2020-05-21 13:49 UTC (permalink / raw)
  To: benbjiang(蒋彪)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds, vpillai, linux-kernel, fweisbec,
	keescook, kerrnel, Phil Auld, Aaron Lu, Aubrey Li, aubrey.li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Thu, May 21, 2020 at 04:09:50AM +0000, benbjiang(蒋彪) wrote:
> 
> 
> > On May 21, 2020, at 6:26 AM, Joel Fernandes (Google) <joel@joelfernandes.org> wrote:
> > 
> > Add a per-thread core scheduling interface which allows a thread to tag
> > itself and enable core scheduling. Based on discussion at OSPM with
> > maintainers, we propose a prctl(2) interface accepting values of 0 or 1.
> > 1 - enable core scheduling for the task.
> > 0 - disable core scheduling for the task.
> > 
> > Special cases:
> > (1)
> > The core-scheduling patchset contains a CGroup interface as well. In
> > order for us to respect users of that interface, we avoid overriding the
> > tag if a task was CGroup-tagged because the task becomes inconsistent
> > with the CGroup tag. Instead return -EBUSY.
> > 
> > (2)
> > If a task is prctl-tagged, allow the CGroup interface to override
> > the task's tag.
> > 
> > ChromeOS will use core-scheduling to securely enable hyperthreading.
> > This cuts down the keypress latency in Google docs from 150ms to 50ms
> > while improving the camera streaming frame rate by ~3%.
> Hi,
> Are the performance improvements compared to the hyperthreading disabled scenario or not?
> Could you help to explain how the keypress latency improvement comes with core-scheduling?

Hi Jiang,

The keypress end-to-end latency metric we have is calculated from when the
keypress is registered in hardware to when a character is drawn on the
scereen. This involves several parties including the GPU and browser
processes which are all running in the same trust domain and benefit from
parallelism through hyperthreading.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH RFC] sched: Add a per-thread core scheduling interface
  2020-05-20 22:26 ` [PATCH RFC] sched: Add a per-thread core scheduling interface Joel Fernandes (Google)
  2020-05-21  4:09   ` [PATCH RFC] sched: Add a per-thread core scheduling interface(Internet mail) benbjiang(蒋彪)
  2020-05-21  8:51   ` [PATCH RFC] sched: Add a per-thread core scheduling interface Peter Zijlstra
@ 2020-05-21 18:31   ` Linus Torvalds
  2020-05-21 20:40     ` Joel Fernandes
  2 siblings, 1 reply; 110+ messages in thread
From: Linus Torvalds @ 2020-05-21 18:31 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, vpillai,
	Linux Kernel Mailing List, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Aaron Lu, Aubrey Li, aubrey.li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	Joel Fernandes

On Wed, May 20, 2020 at 3:26 PM Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
>
> ChromeOS will use core-scheduling to securely enable hyperthreading.
> This cuts down the keypress latency in Google docs from 150ms to 50ms
> while improving the camera streaming frame rate by ~3%.

I'm assuming this is "compared to SMT disabled"?

What is the cost compared to "SMT enabled but no core scheduling"?

But the real reason I'm piping up is that your  latency benchmark
sounds very cool.

Generally throughput benchmarks are much easier to do, how do you do
this latency benchmark, and is it perhaps something that could be run
more widely (ie I'm thinking that if it's generic enough and stable
enough to be run by some of the performance regression checking
robots, it would be a much more interesting test-case than some of the
ones they run right now...)

I'm looking at that "threaded phoronix gzip performance regression"
thread due to a totally unrelated scheduling change ("sched/fair:
Rework load_balance()"), and then I see this thread and my reaction is
"the keypress latency thing sounds like a much more interesting
performance test than threaded gzip from clear linux".

But the threaded gzip test is presumably trivial to script, while your
latency test is perhaps very specific to one particular platform and
setuip?

                   Linus

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH RFC] sched: Add a per-thread core scheduling interface
  2020-05-21 13:47     ` Joel Fernandes
@ 2020-05-21 20:20       ` Vineeth Remanan Pillai
  2020-05-22 12:59       ` Peter Zijlstra
  1 sibling, 0 replies; 110+ messages in thread
From: Vineeth Remanan Pillai @ 2020-05-21 20:20 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Peter Zijlstra, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Aaron Lu, Aubrey Li, Li, Aubrey,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Thu, May 21, 2020 at 9:47 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> > It doens't allow tasks for form their own groups (by for example setting
> > the key to that of another task).
>
> So for this, I was thinking of making the prctl pass in an integer. And 0
> would mean untagged. Does that sound good to you?
>
On a similar note, me and Joel were discussing about prctl and it came up
that, there is no mechanism to set cookie from outside a process using
prctl(2). So, another option we could consider is to use sched_setattr(2)
and expand sched_attr to accomodate a u64 cookie. User could pass in a
cookie to explicitly set it and also use the same cookie for grouping.

Haven't prototyped it yet. Will need to dig deeper and see how it would
really look like.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH RFC] sched: Add a per-thread core scheduling interface
  2020-05-21 18:31   ` Linus Torvalds
@ 2020-05-21 20:40     ` Joel Fernandes
  2020-05-21 21:58       ` Jesse Barnes
  0 siblings, 1 reply; 110+ messages in thread
From: Joel Fernandes @ 2020-05-21 20:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, vpillai,
	Linux Kernel Mailing List, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Aaron Lu, Aubrey Li, aubrey.li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

Hi Linus,

On Thu, May 21, 2020 at 11:31:38AM -0700, Linus Torvalds wrote:
> On Wed, May 20, 2020 at 3:26 PM Joel Fernandes (Google)
> <joel@joelfernandes.org> wrote:
> >
> > ChromeOS will use core-scheduling to securely enable hyperthreading.
> > This cuts down the keypress latency in Google docs from 150ms to 50ms
> > while improving the camera streaming frame rate by ~3%.
> 
> I'm assuming this is "compared to SMT disabled"?

Yes this is compared to SMT disabled, I'll improve the commit message.

> What is the cost compared to "SMT enabled but no core scheduling"?

With SMT enabled and no core scheduling, it is around 40ms in the higher
percentiles. Also one more thing I wanted to mention, this is the 90th
percentile.

> But the real reason I'm piping up is that your  latency benchmark
> sounds very cool.
> 
> Generally throughput benchmarks are much easier to do, how do you do
> this latency benchmark, and is it perhaps something that could be run
> more widely (ie I'm thinking that if it's generic enough and stable
> enough to be run by some of the performance regression checking
> robots, it would be a much more interesting test-case than some of the
> ones they run right now...)

Glad you like it! The metric is calculated with a timestamp of when the
driver says the key was pressed, up until when the GPU says we've drawn
pixels in response.

The test requires a mostly only requires Chrome browser. It opens some
pre-existing test URLs (a google doc, a window that opens a camera stream and
another window that decodes video). This metric is already calculated in
Chrome, we just scrape it from
chrome://histograms/Event.Latency.EndToEnd.KeyPress.  If you install Chrome,
you can goto this link and see the histogram.  We open a Google docs window
and synthetically input keys into it with a camera stream and video decoding
running in other windows which gives the CPUs a good beating. Then we collect
roughly the 90th percentile keypress latency from the above histogram and the
camera and decoded video's FPS, among other things. There is a test in the
works that my colleagues are writing to run the full Google hangout video
chatting stack to stress the system more (versus just the camera stream).  I
guess if the robots can somehow input keys into the Google docs and open the
right windows, then it is just a matter of scraping the histogram.

> I'm looking at that "threaded phoronix gzip performance regression"
> thread due to a totally unrelated scheduling change ("sched/fair:
> Rework load_balance()"), and then I see this thread and my reaction is
> "the keypress latency thing sounds like a much more interesting
> performance test than threaded gzip from clear linux".
> 
> But the threaded gzip test is presumably trivial to script, while your
> latency test is perhaps very specific to one particular platform and
> setuip?

Yes it is specifically a ChromeOS running on a pixel book running a 7th Gen
Intel Core i7 with 4 hardware threads.
https://store.google.com/us/product/google_pixelbook

I could try to make it a synthetic test but it might be difficult for a robot
to run it if it does not have graphics support and a camera connected to it.
It would then need a fake/emulated camera connected to it. These robots run
Linux in a non-GUI environment in qemu instances right?

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH RFC] sched: Add a per-thread core scheduling interface
  2020-05-21 20:40     ` Joel Fernandes
@ 2020-05-21 21:58       ` Jesse Barnes
  2020-05-22 16:33         ` Linus Torvalds
  0 siblings, 1 reply; 110+ messages in thread
From: Jesse Barnes @ 2020-05-21 21:58 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Linus Torvalds, Nishanth Aravamudan, Julien Desfossez,
	Peter Zijlstra, Tim Chen, Ingo Molnar, Thomas Gleixner,
	Paul Turner, vpillai, Linux Kernel Mailing List,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Thu, May 21, 2020 at 1:45 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Hi Linus,
>
> On Thu, May 21, 2020 at 11:31:38AM -0700, Linus Torvalds wrote:
> > On Wed, May 20, 2020 at 3:26 PM Joel Fernandes (Google)
> > <joel@joelfernandes.org> wrote:
> > Generally throughput benchmarks are much easier to do, how do you do
> > this latency benchmark, and is it perhaps something that could be run
> > more widely (ie I'm thinking that if it's generic enough and stable
> > enough to be run by some of the performance regression checking
> > robots, it would be a much more interesting test-case than some of the
> > ones they run right now...)
>
> Glad you like it! The metric is calculated with a timestamp of when the
> driver says the key was pressed, up until when the GPU says we've drawn
> pixels in response.
>
> The test requires a mostly only requires Chrome browser. It opens some
> pre-existing test URLs (a google doc, a window that opens a camera stream and
> another window that decodes video). This metric is already calculated in
> Chrome, we just scrape it from
> chrome://histograms/Event.Latency.EndToEnd.KeyPress.  If you install Chrome,
> you can goto this link and see the histogram.  We open a Google docs window
> and synthetically input keys into it with a camera stream and video decoding
> running in other windows which gives the CPUs a good beating. Then we collect
> roughly the 90th percentile keypress latency from the above histogram and the
> camera and decoded video's FPS, among other things. There is a test in the
> works that my colleagues are writing to run the full Google hangout video
> chatting stack to stress the system more (versus just the camera stream).  I
> guess if the robots can somehow input keys into the Google docs and open the
> right windows, then it is just a matter of scraping the histogram.

Expanding on this a little, we're working on a couple of projects that
should provide results like these for upstream.  One is continuously
rebasing our upstream backlog onto new kernels for testing purposes
(the idea here is to make it easier for us to update kernels on
Chromebooks), and the second is to drive more stuff into the
kernelci.org infrastructure.  Given the test environments we have in
place now, we can probably get results from our continuous rebase
project first and provide those against -rc releases if that's
something you'd be interested in.  Going forward, I hope we can
extract several of our tests and put them into kernelci as well, so we
get more general coverage without the potential impact of our (still
somewhat large) upstream backlog of patches.

To Joel's point, there are a few changes we'll have to make to get
similar results outside of our environment, but I think that's doable
without a ton of work.  And if anyone is curious, I think most of this
stuff is already public in the tast and autotest repos of the
chromiumos tree.  Just let us know if you want to make changes or port
to another environment so we can try to stay in sync wrt new features,
etc.

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH RFC] sched: Use sched-RCU in core-scheduling balancing logic
  2020-05-20 22:48 ` [PATCH RFC] sched: Use sched-RCU in core-scheduling balancing logic Joel Fernandes (Google)
@ 2020-05-21 22:52   ` Paul E. McKenney
  2020-05-22  1:26     ` Joel Fernandes
  0 siblings, 1 reply; 110+ messages in thread
From: Paul E. McKenney @ 2020-05-21 22:52 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds, vpillai, linux-kernel, fweisbec,
	keescook, kerrnel, Phil Auld, Aaron Lu, Aubrey Li, aubrey.li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	Joel Fernandes

On Wed, May 20, 2020 at 06:48:18PM -0400, Joel Fernandes (Google) wrote:
> rcu_read_unlock() can incur an infrequent deadlock in
> sched_core_balance(). Fix this by using sched-RCU instead.
> 
> This fixes the following spinlock recursion observed when testing the
> core scheduling patches on PREEMPT=y kernel on ChromeOS:
> 
> [    3.240891] BUG: spinlock recursion on CPU#2, swapper/2/0
> [    3.240900]  lock: 0xffff9cd1eeb28e40, .magic: dead4ead, .owner: swapper/2/0, .owner_cpu: 2
> [    3.240905] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.4.22htcore #4
> [    3.240908] Hardware name: Google Eve/Eve, BIOS Google_Eve.9584.174.0 05/29/2018
> [    3.240910] Call Trace:
> [    3.240919]  dump_stack+0x97/0xdb
> [    3.240924]  ? spin_bug+0xa4/0xb1
> [    3.240927]  do_raw_spin_lock+0x79/0x98
> [    3.240931]  try_to_wake_up+0x367/0x61b
> [    3.240935]  rcu_read_unlock_special+0xde/0x169
> [    3.240938]  ? sched_core_balance+0xd9/0x11e
> [    3.240941]  __rcu_read_unlock+0x48/0x4a
> [    3.240945]  __balance_callback+0x50/0xa1
> [    3.240949]  __schedule+0x55a/0x61e
> [    3.240952]  schedule_idle+0x21/0x2d
> [    3.240956]  do_idle+0x1d5/0x1f8
> [    3.240960]  cpu_startup_entry+0x1d/0x1f
> [    3.240964]  start_secondary+0x159/0x174
> [    3.240967]  secondary_startup_64+0xa4/0xb0
> [   14.998590] watchdog: BUG: soft lockup - CPU#0 stuck for 11s! [kworker/0:10:965]
> 
> Cc: vpillai <vpillai@digitalocean.com>
> Cc: Aaron Lu <aaron.lwe@gmail.com>
> Cc: Aubrey Li <aubrey.intel@gmail.com>
> Cc: peterz@infradead.org
> Cc: paulmck@kernel.org
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> Change-Id: I1a4bf0cd1426b3c21ad5de44719813ad4ee5805e

With some luck, the commit removing the need for this will hit
mainline during the next merge window.  Fingers firmly crossed...

						Thanx, Paul

> ---
>  kernel/sched/core.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 780514d03da47..b8ca6fcaaaf06 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4897,7 +4897,7 @@ static void sched_core_balance(struct rq *rq)
>  	struct sched_domain *sd;
>  	int cpu = cpu_of(rq);
>  
> -	rcu_read_lock();
> +	rcu_read_lock_sched();
>  	raw_spin_unlock_irq(rq_lockp(rq));
>  	for_each_domain(cpu, sd) {
>  		if (!(sd->flags & SD_LOAD_BALANCE))
> @@ -4910,7 +4910,7 @@ static void sched_core_balance(struct rq *rq)
>  			break;
>  	}
>  	raw_spin_lock_irq(rq_lockp(rq));
> -	rcu_read_unlock();
> +	rcu_read_unlock_sched();
>  }
>  
>  static DEFINE_PER_CPU(struct callback_head, core_balance_head);
> -- 
> 2.26.2.761.g0e0b3e54be-goog
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 07/13] sched: Add core wide task selection and scheduling.
  2020-03-04 16:59 ` [RFC PATCH 07/13] sched: Add core wide task selection and scheduling vpillai
  2020-04-14 13:35   ` Peter Zijlstra
  2020-04-16  3:39   ` Chen Yu
@ 2020-05-21 23:14   ` Joel Fernandes
  2020-05-21 23:16     ` Joel Fernandes
  2020-05-22  2:35     ` Joel Fernandes
  2 siblings, 2 replies; 110+ messages in thread
From: Joel Fernandes @ 2020-05-21 23:14 UTC (permalink / raw)
  To: vpillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds, linux-kernel, fweisbec, keescook,
	kerrnel, Phil Auld, Aaron Lu, Aubrey Li, aubrey.li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	Aaron Lu

On Wed, Mar 04, 2020 at 04:59:57PM +0000, vpillai wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Instead of only selecting a local task, select a task for all SMT
> siblings for every reschedule on the core (irrespective which logical
> CPU does the reschedule).
> 
> There could be races in core scheduler where a CPU is trying to pick
> a task for its sibling in core scheduler, when that CPU has just been
> offlined.  We should not schedule any tasks on the CPU in this case.
> Return an idle task in pick_next_task for this situation.
> 
> NOTE: there is still potential for siblings rivalry.
> NOTE: this is far too complicated; but thus far I've failed to
>       simplify it further.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
> Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
> Signed-off-by: Aaron Lu <aaron.lu@linux.alibaba.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  kernel/sched/core.c  | 274 ++++++++++++++++++++++++++++++++++++++++++-
>  kernel/sched/fair.c  |  40 +++++++
>  kernel/sched/sched.h |   6 +-
>  3 files changed, 318 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 445f0d519336..9a1bd236044e 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4253,7 +4253,7 @@ static inline void schedule_debug(struct task_struct *prev, bool preempt)
>   * Pick up the highest-prio task:
>   */
>  static inline struct task_struct *
> -pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> +__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
>  	const struct sched_class *class;
>  	struct task_struct *p;
> @@ -4309,6 +4309,273 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  	BUG();
>  }
>  
> +#ifdef CONFIG_SCHED_CORE
> +
> +static inline bool cookie_equals(struct task_struct *a, unsigned long cookie)
> +{
> +	return is_idle_task(a) || (a->core_cookie == cookie);
> +}
> +
> +static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
> +{
> +	if (is_idle_task(a) || is_idle_task(b))
> +		return true;
> +
> +	return a->core_cookie == b->core_cookie;
> +}
> +
> +// XXX fairness/fwd progress conditions
> +/*
> + * Returns
> + * - NULL if there is no runnable task for this class.
> + * - the highest priority task for this runqueue if it matches
> + *   rq->core->core_cookie or its priority is greater than max.
> + * - Else returns idle_task.
> + */
> +static struct task_struct *
> +pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max)
> +{
> +	struct task_struct *class_pick, *cookie_pick;
> +	unsigned long cookie = rq->core->core_cookie;
> +
> +	class_pick = class->pick_task(rq);
> +	if (!class_pick)
> +		return NULL;
> +
> +	if (!cookie) {
> +		/*
> +		 * If class_pick is tagged, return it only if it has
> +		 * higher priority than max.
> +		 */
> +		if (max && class_pick->core_cookie &&
> +		    prio_less(class_pick, max))
> +			return idle_sched_class.pick_task(rq);
> +
> +		return class_pick;
> +	}
> +
> +	/*
> +	 * If class_pick is idle or matches cookie, return early.
> +	 */
> +	if (cookie_equals(class_pick, cookie))
> +		return class_pick;
> +
> +	cookie_pick = sched_core_find(rq, cookie);
> +
> +	/*
> +	 * If class > max && class > cookie, it is the highest priority task on
> +	 * the core (so far) and it must be selected, otherwise we must go with
> +	 * the cookie pick in order to satisfy the constraint.
> +	 */
> +	if (prio_less(cookie_pick, class_pick) &&
> +	    (!max || prio_less(max, class_pick)))
> +		return class_pick;
> +
> +	return cookie_pick;
> +}

I've been hating on this pick_task() routine for a while now :-). If we add
the task to the tag tree as Peter suggested at OSPM for that other issue
Vineeth found, it seems it could be simpler.

This has just been near a compiler so far but how about:

---8<-----------------------

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 005d7f7323e2d..81e23252b6c99 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -182,9 +182,6 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 
 	rq->core->core_task_seq++;
 
-	if (!p->core_cookie)
-		return;
-
 	node = &rq->core_tree.rb_node;
 	parent = *node;
 
@@ -215,7 +212,7 @@ static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
 
 void sched_core_add(struct rq *rq, struct task_struct *p)
 {
-	if (p->core_cookie && task_on_rq_queued(p))
+	if (task_on_rq_queued(p))
 		sched_core_enqueue(rq, p);
 }
 
@@ -4563,36 +4560,32 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
 	if (!class_pick)
 		return NULL;
 
-	if (!cookie) {
-		/*
-		 * If class_pick is tagged, return it only if it has
-		 * higher priority than max.
-		 */
-		if (max && class_pick->core_cookie &&
-		    prio_less(class_pick, max))
-			return idle_sched_class.pick_task(rq);
-
+	if (!max)
 		return class_pick;
-	}
 
-	/*
-	 * If class_pick is idle or matches cookie, return early.
-	 */
+	/* Make sure the current max's cookie is core->core_cookie */
+	WARN_ON_ONCE(max->core_cookie != cookie);
+
+	/* If class_pick is idle or matches cookie, play nice. */
 	if (cookie_equals(class_pick, cookie))
 		return class_pick;
 
-	cookie_pick = sched_core_find(rq, cookie);
+	/* If class_pick is highest prio, trump max. */
+	if (prio_less(max, class_pick)) {
+
+		/* .. but not before checking if cookie trumps class. */
+		cookie_pick = sched_core_find(rq, cookie);
+		if (prio_less(class_pick, cookie_pick))
+			return cookie_pick;
 
-	/*
-	 * If class > max && class > cookie, it is the highest priority task on
-	 * the core (so far) and it must be selected, otherwise we must go with
-	 * the cookie pick in order to satisfy the constraint.
-	 */
-	if (prio_less(cookie_pick, class_pick) &&
-	    (!max || prio_less(max, class_pick)))
 		return class_pick;
+	}
 
-	return cookie_pick;
+	/*
+	 * We get here if class_pick was incompatible with max
+	 * and lower prio than max. So we have nothing.
+	 */
+	return idle_sched_class.pick_task(rq);
 }
 
 static struct task_struct *

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 07/13] sched: Add core wide task selection and scheduling.
  2020-05-21 23:14   ` Joel Fernandes
@ 2020-05-21 23:16     ` Joel Fernandes
  2020-05-22  2:35     ` Joel Fernandes
  1 sibling, 0 replies; 110+ messages in thread
From: Joel Fernandes @ 2020-05-21 23:16 UTC (permalink / raw)
  To: vpillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds, linux-kernel, fweisbec, keescook,
	kerrnel, Phil Auld, Aaron Lu, Aubrey Li, aubrey.li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	Aaron Lu

On Thu, May 21, 2020 at 07:14:26PM -0400, Joel Fernandes wrote:
> On Wed, Mar 04, 2020 at 04:59:57PM +0000, vpillai wrote:
[snip]
> > +	/*
> > +	 * If class_pick is idle or matches cookie, return early.
> > +	 */
> > +	if (cookie_equals(class_pick, cookie))
> > +		return class_pick;
> > +
> > +	cookie_pick = sched_core_find(rq, cookie);
> > +
> > +	/*
> > +	 * If class > max && class > cookie, it is the highest priority task on
> > +	 * the core (so far) and it must be selected, otherwise we must go with
> > +	 * the cookie pick in order to satisfy the constraint.
> > +	 */
> > +	if (prio_less(cookie_pick, class_pick) &&
> > +	    (!max || prio_less(max, class_pick)))
> > +		return class_pick;
> > +
> > +	return cookie_pick;
> > +}
> 
> I've been hating on this pick_task() routine for a while now :-). If we add
> the task to the tag tree as Peter suggested at OSPM for that other issue
> Vineeth found, it seems it could be simpler.

Sorry, I meant adding of a 0-tagged (no cookie) task to the tag tree.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH RFC] sched: Use sched-RCU in core-scheduling balancing logic
  2020-05-21 22:52   ` Paul E. McKenney
@ 2020-05-22  1:26     ` Joel Fernandes
  0 siblings, 0 replies; 110+ messages in thread
From: Joel Fernandes @ 2020-05-22  1:26 UTC (permalink / raw)
  To: paulmck
  Cc: Joel Fernandes (Google),
	Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	vpillai, LKML, Cc: Frederic Weisbecker, Kees Cook, Greg Kerr,
	Phil Auld, Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini

On Thu, May 21, 2020 at 6:52 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, May 20, 2020 at 06:48:18PM -0400, Joel Fernandes (Google) wrote:
> > rcu_read_unlock() can incur an infrequent deadlock in
> > sched_core_balance(). Fix this by using sched-RCU instead.
> >
> > This fixes the following spinlock recursion observed when testing the
> > core scheduling patches on PREEMPT=y kernel on ChromeOS:
> >
> > [    3.240891] BUG: spinlock recursion on CPU#2, swapper/2/0
> > [    3.240900]  lock: 0xffff9cd1eeb28e40, .magic: dead4ead, .owner: swapper/2/0, .owner_cpu: 2
> > [    3.240905] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.4.22htcore #4
> > [    3.240908] Hardware name: Google Eve/Eve, BIOS Google_Eve.9584.174.0 05/29/2018
> > [    3.240910] Call Trace:
> > [    3.240919]  dump_stack+0x97/0xdb
> > [    3.240924]  ? spin_bug+0xa4/0xb1
> > [    3.240927]  do_raw_spin_lock+0x79/0x98
> > [    3.240931]  try_to_wake_up+0x367/0x61b
> > [    3.240935]  rcu_read_unlock_special+0xde/0x169
> > [    3.240938]  ? sched_core_balance+0xd9/0x11e
> > [    3.240941]  __rcu_read_unlock+0x48/0x4a
> > [    3.240945]  __balance_callback+0x50/0xa1
> > [    3.240949]  __schedule+0x55a/0x61e
> > [    3.240952]  schedule_idle+0x21/0x2d
> > [    3.240956]  do_idle+0x1d5/0x1f8
> > [    3.240960]  cpu_startup_entry+0x1d/0x1f
> > [    3.240964]  start_secondary+0x159/0x174
> > [    3.240967]  secondary_startup_64+0xa4/0xb0
> > [   14.998590] watchdog: BUG: soft lockup - CPU#0 stuck for 11s! [kworker/0:10:965]
> >
> > Cc: vpillai <vpillai@digitalocean.com>
> > Cc: Aaron Lu <aaron.lwe@gmail.com>
> > Cc: Aubrey Li <aubrey.intel@gmail.com>
> > Cc: peterz@infradead.org
> > Cc: paulmck@kernel.org
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > Change-Id: I1a4bf0cd1426b3c21ad5de44719813ad4ee5805e
>
> With some luck, the commit removing the need for this will hit
> mainline during the next merge window.  Fingers firmly crossed...

Sounds good, thank you Paul :-)

 - Joel


>
>                                                 Thanx, Paul
>
> > ---
> >  kernel/sched/core.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 780514d03da47..b8ca6fcaaaf06 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4897,7 +4897,7 @@ static void sched_core_balance(struct rq *rq)
> >       struct sched_domain *sd;
> >       int cpu = cpu_of(rq);
> >
> > -     rcu_read_lock();
> > +     rcu_read_lock_sched();
> >       raw_spin_unlock_irq(rq_lockp(rq));
> >       for_each_domain(cpu, sd) {
> >               if (!(sd->flags & SD_LOAD_BALANCE))
> > @@ -4910,7 +4910,7 @@ static void sched_core_balance(struct rq *rq)
> >                       break;
> >       }
> >       raw_spin_lock_irq(rq_lockp(rq));
> > -     rcu_read_unlock();
> > +     rcu_read_unlock_sched();
> >  }
> >
> >  static DEFINE_PER_CPU(struct callback_head, core_balance_head);
> > --
> > 2.26.2.761.g0e0b3e54be-goog
> >

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 07/13] sched: Add core wide task selection and scheduling.
  2020-05-21 23:14   ` Joel Fernandes
  2020-05-21 23:16     ` Joel Fernandes
@ 2020-05-22  2:35     ` Joel Fernandes
  2020-05-22  3:44       ` Aaron Lu
  1 sibling, 1 reply; 110+ messages in thread
From: Joel Fernandes @ 2020-05-22  2:35 UTC (permalink / raw)
  To: vpillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds, linux-kernel, fweisbec, keescook,
	kerrnel, Phil Auld, Aaron Lu, Aubrey Li, aubrey.li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	Aaron Lu

On Thu, May 21, 2020 at 07:14:26PM -0400, Joel Fernandes wrote:
> On Wed, Mar 04, 2020 at 04:59:57PM +0000, vpillai wrote:
> > From: Peter Zijlstra <peterz@infradead.org>
> > 
> > Instead of only selecting a local task, select a task for all SMT
> > siblings for every reschedule on the core (irrespective which logical
> > CPU does the reschedule).
> > 
> > There could be races in core scheduler where a CPU is trying to pick
> > a task for its sibling in core scheduler, when that CPU has just been
> > offlined.  We should not schedule any tasks on the CPU in this case.
> > Return an idle task in pick_next_task for this situation.
> > 
> > NOTE: there is still potential for siblings rivalry.
> > NOTE: this is far too complicated; but thus far I've failed to
> >       simplify it further.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
> > Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
> > Signed-off-by: Aaron Lu <aaron.lu@linux.alibaba.com>
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > ---
> >  kernel/sched/core.c  | 274 ++++++++++++++++++++++++++++++++++++++++++-
> >  kernel/sched/fair.c  |  40 +++++++
> >  kernel/sched/sched.h |   6 +-
> >  3 files changed, 318 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 445f0d519336..9a1bd236044e 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4253,7 +4253,7 @@ static inline void schedule_debug(struct task_struct *prev, bool preempt)
> >   * Pick up the highest-prio task:
> >   */
> >  static inline struct task_struct *
> > -pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > +__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >  {
> >  	const struct sched_class *class;
> >  	struct task_struct *p;
> > @@ -4309,6 +4309,273 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >  	BUG();
> >  }
> >  
> > +#ifdef CONFIG_SCHED_CORE
> > +
> > +static inline bool cookie_equals(struct task_struct *a, unsigned long cookie)
> > +{
> > +	return is_idle_task(a) || (a->core_cookie == cookie);
> > +}
> > +
> > +static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
> > +{
> > +	if (is_idle_task(a) || is_idle_task(b))
> > +		return true;
> > +
> > +	return a->core_cookie == b->core_cookie;
> > +}
> > +
> > +// XXX fairness/fwd progress conditions
> > +/*
> > + * Returns
> > + * - NULL if there is no runnable task for this class.
> > + * - the highest priority task for this runqueue if it matches
> > + *   rq->core->core_cookie or its priority is greater than max.
> > + * - Else returns idle_task.
> > + */
> > +static struct task_struct *
> > +pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max)
> > +{
> > +	struct task_struct *class_pick, *cookie_pick;
> > +	unsigned long cookie = rq->core->core_cookie;
> > +
> > +	class_pick = class->pick_task(rq);
> > +	if (!class_pick)
> > +		return NULL;
> > +
> > +	if (!cookie) {
> > +		/*
> > +		 * If class_pick is tagged, return it only if it has
> > +		 * higher priority than max.
> > +		 */
> > +		if (max && class_pick->core_cookie &&
> > +		    prio_less(class_pick, max))
> > +			return idle_sched_class.pick_task(rq);
> > +
> > +		return class_pick;
> > +	}
> > +
> > +	/*
> > +	 * If class_pick is idle or matches cookie, return early.
> > +	 */
> > +	if (cookie_equals(class_pick, cookie))
> > +		return class_pick;
> > +
> > +	cookie_pick = sched_core_find(rq, cookie);
> > +
> > +	/*
> > +	 * If class > max && class > cookie, it is the highest priority task on
> > +	 * the core (so far) and it must be selected, otherwise we must go with
> > +	 * the cookie pick in order to satisfy the constraint.
> > +	 */
> > +	if (prio_less(cookie_pick, class_pick) &&
> > +	    (!max || prio_less(max, class_pick)))
> > +		return class_pick;
> > +
> > +	return cookie_pick;
> > +}
> 
> I've been hating on this pick_task() routine for a while now :-). If we add
> the task to the tag tree as Peter suggested at OSPM for that other issue
> Vineeth found, it seems it could be simpler.
> 
> This has just been near a compiler so far but how about:

Discussed a lot with Vineeth. Below is an improved version of the pick_task()
similification.

It also handles the following "bug" in the existing code as well that Vineeth
brought up in OSPM: Suppose 2 siblings of a core: rq 1 and rq 2.

In priority order (high to low), say we have the tasks:
A - untagged  (rq 1)
B - tagged    (rq 2)
C - untagged  (rq 2)

Say, B and C are in the same scheduling class.

When the pick_next_task() loop runs, it looks at rq 1 and max is A, A is
tenantively selected for rq 1. Then it looks at rq 2 and the class_pick is B.
But that's not compatible with A. So rq 2 gets forced idle.

In reality, rq 2 could have run C instead of idle. The fix is to add C to the
tag tree as Peter suggested in OSPM.

Updated diff below:

---8<-----------------------

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 005d7f7323e2d..625377f393ed3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -182,9 +182,6 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 
 	rq->core->core_task_seq++;
 
-	if (!p->core_cookie)
-		return;
-
 	node = &rq->core_tree.rb_node;
 	parent = *node;
 
@@ -215,7 +212,7 @@ static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
 
 void sched_core_add(struct rq *rq, struct task_struct *p)
 {
-	if (p->core_cookie && task_on_rq_queued(p))
+	if (task_on_rq_queued(p))
 		sched_core_enqueue(rq, p);
 }
 
@@ -4556,43 +4553,57 @@ void sched_core_irq_exit(void)
 static struct task_struct *
 pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max)
 {
-	struct task_struct *class_pick, *cookie_pick;
+	struct task_struct *class_pick, *cookie_pick, *rq_pick;
 	unsigned long cookie = rq->core->core_cookie;
 
 	class_pick = class->pick_task(rq);
 	if (!class_pick)
 		return NULL;
 
-	if (!cookie) {
-		/*
-		 * If class_pick is tagged, return it only if it has
-		 * higher priority than max.
-		 */
-		if (max && class_pick->core_cookie &&
-		    prio_less(class_pick, max))
-			return idle_sched_class.pick_task(rq);
+	if (!max)
+		return class_pick;
+
+	/* Make sure the current max's cookie is core->core_cookie */
+	WARN_ON_ONCE(max->core_cookie != cookie);
 
+	/* Try to play really nice: see if the class's cookie works. */
+	if (cookie_equals(class_pick, cookie))
 		return class_pick;
-	}
 
 	/*
-	 * If class_pick is idle or matches cookie, return early.
+	 * From here on, we must return class_pick, cookie_pick or idle.
+	 * Following are the cases:
+	 * 1 - lowest prio.
+	 * 3 - highest prio.
+	 *
+	 * max	class	cookie	outcome
+	 * 1	2	3	cookie
+	 * 1	3	2	class
+	 * 2	1	3	cookie
+	 * 2	3	1	class
+	 * 3	1	2	cookie
+	 * 3	2	1	cookie
+	 * 3	2	-	return idle (when no cookie task).
 	 */
-	if (cookie_equals(class_pick, cookie))
-		return class_pick;
 
+	/* First try to find the highest prio of (cookie, class and max). */
 	cookie_pick = sched_core_find(rq, cookie);
+	if (cookie_pick && prio_less(class_pick, cookie_pick))
+		rq_pick = cookie_pick;
+	else
+		rq_pick = class_pick;
+	if (prio_less(max, rq_pick))
+		return rq_pick;
+
+	/* If we max was greatest, then see if there was a cookie. */
+	if (cookie_pick)
+		return cookie_pick;
 
 	/*
-	 * If class > max && class > cookie, it is the highest priority task on
-	 * the core (so far) and it must be selected, otherwise we must go with
-	 * the cookie pick in order to satisfy the constraint.
+	 * We get here with if class_pick was incompatible with max
+	 * and lower prio than max. So we have nothing.
 	 */
-	if (prio_less(cookie_pick, class_pick) &&
-	    (!max || prio_less(max, class_pick)))
-		return class_pick;
-
-	return cookie_pick;
+	return idle_sched_class.pick_task(rq);
 }
 
 static struct task_struct *

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 07/13] sched: Add core wide task selection and scheduling.
  2020-05-22  2:35     ` Joel Fernandes
@ 2020-05-22  3:44       ` Aaron Lu
  2020-05-22 20:13         ` Joel Fernandes
  0 siblings, 1 reply; 110+ messages in thread
From: Aaron Lu @ 2020-05-22  3:44 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: vpillai, Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra,
	Tim Chen, mingo, tglx, pjt, torvalds, linux-kernel, fweisbec,
	keescook, kerrnel, Phil Auld, Aubrey Li, aubrey.li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	Aaron Lu

On Thu, May 21, 2020 at 10:35:56PM -0400, Joel Fernandes wrote:
> Discussed a lot with Vineeth. Below is an improved version of the pick_task()
> similification.
> 
> It also handles the following "bug" in the existing code as well that Vineeth
> brought up in OSPM: Suppose 2 siblings of a core: rq 1 and rq 2.
> 
> In priority order (high to low), say we have the tasks:
> A - untagged  (rq 1)
> B - tagged    (rq 2)
> C - untagged  (rq 2)
> 
> Say, B and C are in the same scheduling class.
> 
> When the pick_next_task() loop runs, it looks at rq 1 and max is A, A is
> tenantively selected for rq 1. Then it looks at rq 2 and the class_pick is B.
> But that's not compatible with A. So rq 2 gets forced idle.
> 
> In reality, rq 2 could have run C instead of idle. The fix is to add C to the
> tag tree as Peter suggested in OSPM.

I like the idea of adding untagged task to the core tree.

> Updated diff below:
> 
> ---8<-----------------------
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 005d7f7323e2d..625377f393ed3 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -182,9 +182,6 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
>  
>  	rq->core->core_task_seq++;
>  
> -	if (!p->core_cookie)
> -		return;
> -
>  	node = &rq->core_tree.rb_node;
>  	parent = *node;
>  
> @@ -215,7 +212,7 @@ static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
>  
>  void sched_core_add(struct rq *rq, struct task_struct *p)
>  {
> -	if (p->core_cookie && task_on_rq_queued(p))
> +	if (task_on_rq_queued(p))
>  		sched_core_enqueue(rq, p);
>  }

It appears there are other call sites of sched_core_enqueue() where
core_cookie is checked: cpu_cgroup_fork() and __sched_write_tag().

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison
  2020-05-16  3:42                               ` Aaron Lu
@ 2020-05-22  9:40                                 ` Aaron Lu
  0 siblings, 0 replies; 110+ messages in thread
From: Aaron Lu @ 2020-05-22  9:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	Tim Chen, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Aaron Lu, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aubrey Li, Li, Aubrey, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, Joel Fernandes

On Sat, May 16, 2020 at 11:42:30AM +0800, Aaron Lu wrote:
> On Thu, May 14, 2020 at 03:02:48PM +0200, Peter Zijlstra wrote:
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4476,6 +4473,16 @@ next_class:;
> >  		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
> >  	}
> >  
> > +	/* XXX SMT2 only */
> > +	if (new_active == 1 && old_active > 1) {
> 
> There is a case when incompatible task appears but we failed to 'drop
> into single-rq mode' per the above condition check. The TLDR is: when
> there is a task that sits on the sibling rq with the same cookie as
> 'max', new_active will be 2 instead of 1 and that would cause us missing
> the chance to do a sync of core min_vruntime.

FWIW: when I disable the feature of running cookie_pick task on sibling
and thus enforce a strict single-rq mode, Peter's patch works well for
the scenario described below.

> This is how it happens:
> 1) 2 tasks of the same cgroup with different weight running on 2 siblings,
>    say cg0_A with weight 1024 bound at cpu0 and cg0_B with weight 2 bound
>    at cpu1(assume cpu0 and cpu1 are siblings);
> 2) Since new_active == 2, we didn't trigger min_vruntime sync. For
>    simplicity, let's assume both siblings' root cfs_rq's min_vruntime and
>    core_vruntime are all at 0 now;
> 3) let the two tasks run a while;
> 4) a new task cg1_C of another cgroup gets queued on cpu1. Since cpu1's
>    existing task has a very small weight, its cfs_rq's min_vruntime can
>    be much larger than cpu0's cfs_rq min_vruntime. So cg1_C's vruntime is
>    much larger than cg0_A's and the 'max' of the core wide task
>    selection goes to cg0_A;
> 5) Now I suppose we should drop into single-rq mode and by doing a sync
>    of core min_vruntime, cg1_C's turn shall come. But the problem is, our
>    current selection logic prefer not to waste CPU time so after decides
>    cg0_A as the 'max', the sibling will also do a cookie_pick() and
>    get cg0_B to run. This is where problem asises: new_active is 2
>    instead of the expected 1.
> 6) Due to we didn't do the sync of core min_vruntime, the newly queued
>    cg1_C shall wait a long time before cg0_A's vruntime catches up.

P.S. this is what I did to enforce a strict single-rq mode:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1fa5b48b742a..0f5580bc7e96 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4411,7 +4411,7 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
 	    (!max || prio_less(max, class_pick)))
 		return class_pick;
 
-	return cookie_pick;
+	return NULL;
 }
 
 static struct task_struct *

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH RFC] sched: Add a per-thread core scheduling interface
  2020-05-21 13:47     ` Joel Fernandes
  2020-05-21 20:20       ` Vineeth Remanan Pillai
@ 2020-05-22 12:59       ` Peter Zijlstra
  2020-05-22 21:35         ` Joel Fernandes
  1 sibling, 1 reply; 110+ messages in thread
From: Peter Zijlstra @ 2020-05-22 12:59 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, mingo, tglx,
	pjt, torvalds, vpillai, linux-kernel, fweisbec, keescook,
	kerrnel, Phil Auld, Aaron Lu, Aubrey Li, aubrey.li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Thu, May 21, 2020 at 09:47:05AM -0400, Joel Fernandes wrote:
> Hi Peter,
> Thanks for the comments.
> 
> On Thu, May 21, 2020 at 10:51:22AM +0200, Peter Zijlstra wrote:
> > On Wed, May 20, 2020 at 06:26:42PM -0400, Joel Fernandes (Google) wrote:
> > > Add a per-thread core scheduling interface which allows a thread to tag
> > > itself and enable core scheduling. Based on discussion at OSPM with
> > > maintainers, we propose a prctl(2) interface accepting values of 0 or 1.
> > >  1 - enable core scheduling for the task.
> > >  0 - disable core scheduling for the task.
> > 
> > Yeah, so this is a terrible interface :-)
> 
> I tried to keep it simple. You are right, lets make it better.
> 
> > It doens't allow tasks for form their own groups (by for example setting
> > the key to that of another task).
> 
> So for this, I was thinking of making the prctl pass in an integer. And 0
> would mean untagged. Does that sound good to you?

A TID, I think. If you pass your own TID, you tag yourself as
not-sharing. If you tag yourself with another tasks's TID, you can do
ptrace tests to see if you're allowed to observe their junk.

> > It is also horribly ill defined what it means to 'enable', with whoem
> > is it allows to share a core.
> 
> I couldn't parse this. Do you mean "enabling coresched does not make sense if
> we don't specify whom to share the core with?"

As a corrolary yes. I mostly meant that a blanket 'enable' doesn't
specify a 'who' you're sharing your bits with.

> > OK, so cgroup always wins; is why is that a good thing?
> 
> I was just trying to respect the functionality of the CGroup patch in the
> coresched series, after all a gentleman named Peter Zijlstra wrote that
> patch ;-) ;-).

Yeah, but I think that same guy said that that was a shit interface and
only hacked up because it was easy :-)

> More seriously, the reason I did it this way is the prctl-tagging is a bit
> incompatible with CGroup tagging:
> 
> 1. What happens if 2 tasks are in a tagged CGroup and one of them changes
> their cookie through prctl? Do they still remain in the tagged CGroup but are
> now going to not trust each other? Do they get removed from the CGroup? This
> is why I made the prctl fail with -EBUSY in such cases.
> 
> 2. What happens if 2 tagged tasks with different cookies are added to a
> tagged CGroup? Do we fail the addition of the tasks to the group, or do we
> override their cookie (like I'm doing)?

For #2 I think I prefer failure.

But having the rationale spelled out in documentation (man-pages for
example) is important.

> > > ChromeOS will use core-scheduling to securely enable hyperthreading.
> > > This cuts down the keypress latency in Google docs from 150ms to 50ms
> > > while improving the camera streaming frame rate by ~3%.
> > 
> > It doesn't consider permissions.
> > 
> > Basically, with the way you guys use it, it should be a CAP_SYS_ADMIN
> > only to enable core-sched.
> 
> True, we were relying on the seccomp sandboxing in ChromeOS to protect the
> prctl but you're right and I fixed it for next revision.

With the TID idea above you get the ptrace tests.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH RFC] sched: Add a per-thread core scheduling interface
  2020-05-21 21:58       ` Jesse Barnes
@ 2020-05-22 16:33         ` Linus Torvalds
  0 siblings, 0 replies; 110+ messages in thread
From: Linus Torvalds @ 2020-05-22 16:33 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Joel Fernandes, Nishanth Aravamudan, Julien Desfossez,
	Peter Zijlstra, Tim Chen, Ingo Molnar, Thomas Gleixner,
	Paul Turner, vpillai, Linux Kernel Mailing List,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Thu, May 21, 2020 at 2:58 PM Jesse Barnes <jsbarnes@google.com> wrote:
>
> Expanding on this a little, we're working on a couple of projects that
> should provide results like these for upstream.  One is continuously
> rebasing our upstream backlog onto new kernels for testing purposes
> (the idea here is to make it easier for us to update kernels on
> Chromebooks),

Lovely. Not just for any performance work that comes out of this, but
hopefully this means that we'll also have quick problem reports if
something happens that affects chrome.

There's certainly been issues on the server side of google where we
made changes (*cough*cgroup*cough*) which didn't make anybody really
blink until years after the fact.. Which ends up being very
inconvenient when other parts of the community have been using the new
features for years.

> and the second is to drive more stuff into the
> kernelci.org infrastructure.  Given the test environments we have in
> place now, we can probably get results from our continuous rebase
> project first and provide those against -rc releases if that's
> something you'd be interested in.

I think the more automated (or regular, or close-to-upstream)
real-world testing that we get, the better off we are.  We have a
number of regular distributions that track the upstream kernel fairly
closely, so we get a fair amount of coverage for the normal desktop
loads.

And the bots are doing great, but they tend to test very specific
things (in the case of "syzbot" the "specific" thing is obviously
pretty far-ranging, but it's still very small details). And latency
has always been harder to really test (outside of the truly trivial
microbenchmarks), so the fact that it sounds like you're going to test
not only a different environment than the usual distros but have a few
macro-level latency tests just sounds lovely in general.

Let's see how lovely I think it is once you start sending regression reports..

                Linus

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 07/13] sched: Add core wide task selection and scheduling.
  2020-05-22  3:44       ` Aaron Lu
@ 2020-05-22 20:13         ` Joel Fernandes
  0 siblings, 0 replies; 110+ messages in thread
From: Joel Fernandes @ 2020-05-22 20:13 UTC (permalink / raw)
  To: Aaron Lu
  Cc: vpillai, Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra,
	Tim Chen, mingo, tglx, pjt, torvalds, linux-kernel, fweisbec,
	keescook, kerrnel, Phil Auld, Aubrey Li, aubrey.li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	Aaron Lu

On Fri, May 22, 2020 at 11:44:06AM +0800, Aaron Lu wrote:
[...]
> > Updated diff below:
> > 
> > ---8<-----------------------
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 005d7f7323e2d..625377f393ed3 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -182,9 +182,6 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
> >  
> >  	rq->core->core_task_seq++;
> >  
> > -	if (!p->core_cookie)
> > -		return;
> > -
> >  	node = &rq->core_tree.rb_node;
> >  	parent = *node;
> >  
> > @@ -215,7 +212,7 @@ static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
> >  
> >  void sched_core_add(struct rq *rq, struct task_struct *p)
> >  {
> > -	if (p->core_cookie && task_on_rq_queued(p))
> > +	if (task_on_rq_queued(p))
> >  		sched_core_enqueue(rq, p);
> >  }
> 
> It appears there are other call sites of sched_core_enqueue() where
> core_cookie is checked: cpu_cgroup_fork() and __sched_write_tag().

Thanks, but looks like pick_task()'s caller also makes various assumptions
about cookie == 0 so all that needs to be vetted again I think.

 - Joel


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH RFC] sched: Add a per-thread core scheduling interface
  2020-05-22 12:59       ` Peter Zijlstra
@ 2020-05-22 21:35         ` Joel Fernandes
  2020-05-24 14:00           ` Phil Auld
  0 siblings, 1 reply; 110+ messages in thread
From: Joel Fernandes @ 2020-05-22 21:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, mingo, tglx,
	pjt, torvalds, vpillai, linux-kernel, fweisbec, keescook,
	Phil Auld, Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini

On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
[..]
> > > It doens't allow tasks for form their own groups (by for example setting
> > > the key to that of another task).
> > 
> > So for this, I was thinking of making the prctl pass in an integer. And 0
> > would mean untagged. Does that sound good to you?
> 
> A TID, I think. If you pass your own TID, you tag yourself as
> not-sharing. If you tag yourself with another tasks's TID, you can do
> ptrace tests to see if you're allowed to observe their junk.

But that would require a bunch of tasks agreeing on which TID to tag with.
For example, if 2 tasks tag with each other's TID, then they would have
different tags and not share.

What's wrong with passing in an integer instead? In any case, we would do the
CAP_SYS_ADMIN check to limit who can do it.

Also, one thing CGroup interface allows is an external process to set the
cookie, so I am wondering if we should use sched_setattr(2) instead of, or in
addition to, the prctl(2). That way, we can drop the CGroup interface
completely. How do you feel about that?

> > > It is also horribly ill defined what it means to 'enable', with whoem
> > > is it allows to share a core.
> > 
> > I couldn't parse this. Do you mean "enabling coresched does not make sense if
> > we don't specify whom to share the core with?"
> 
> As a corrolary yes. I mostly meant that a blanket 'enable' doesn't
> specify a 'who' you're sharing your bits with.

Yes, ok. I can reword the commit log a bit to make it more clear that we are
specifying who we can share a core with.

> > I was just trying to respect the functionality of the CGroup patch in the
> > coresched series, after all a gentleman named Peter Zijlstra wrote that
> > patch ;-) ;-).
> 
> Yeah, but I think that same guy said that that was a shit interface and
> only hacked up because it was easy :-)

Fair enough :-)

> > More seriously, the reason I did it this way is the prctl-tagging is a bit
> > incompatible with CGroup tagging:
> > 
> > 1. What happens if 2 tasks are in a tagged CGroup and one of them changes
> > their cookie through prctl? Do they still remain in the tagged CGroup but are
> > now going to not trust each other? Do they get removed from the CGroup? This
> > is why I made the prctl fail with -EBUSY in such cases.
> > 
> > 2. What happens if 2 tagged tasks with different cookies are added to a
> > tagged CGroup? Do we fail the addition of the tasks to the group, or do we
> > override their cookie (like I'm doing)?
> 
> For #2 I think I prefer failure.
> 
> But having the rationale spelled out in documentation (man-pages for
> example) is important.

If we drop the CGroup interface, this would avoid both #1 and #2.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH RFC] sched: Add a per-thread core scheduling interface
  2020-05-22 21:35         ` Joel Fernandes
@ 2020-05-24 14:00           ` Phil Auld
  2020-05-28 14:51             ` Joel Fernandes
  2020-05-28 17:01             ` Peter Zijlstra
  0 siblings, 2 replies; 110+ messages in thread
From: Phil Auld @ 2020-05-24 14:00 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Peter Zijlstra, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	mingo, tglx, pjt, torvalds, vpillai, linux-kernel, fweisbec,
	keescook, Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini

On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote:
> On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
> [..]
> > > > It doens't allow tasks for form their own groups (by for example setting
> > > > the key to that of another task).
> > > 
> > > So for this, I was thinking of making the prctl pass in an integer. And 0
> > > would mean untagged. Does that sound good to you?
> > 
> > A TID, I think. If you pass your own TID, you tag yourself as
> > not-sharing. If you tag yourself with another tasks's TID, you can do
> > ptrace tests to see if you're allowed to observe their junk.
> 
> But that would require a bunch of tasks agreeing on which TID to tag with.
> For example, if 2 tasks tag with each other's TID, then they would have
> different tags and not share.
> 
> What's wrong with passing in an integer instead? In any case, we would do the
> CAP_SYS_ADMIN check to limit who can do it.
> 
> Also, one thing CGroup interface allows is an external process to set the
> cookie, so I am wondering if we should use sched_setattr(2) instead of, or in
> addition to, the prctl(2). That way, we can drop the CGroup interface
> completely. How do you feel about that?
>

I think it should be an arbitrary 64bit value, in both interfaces to avoid
any potential reuse security issues. 

I think the cgroup interface could be extended not to be a boolean but take
the value. With 0 being untagged as now.

And sched_setattr could be used to set it on a per task basis.


> > > > It is also horribly ill defined what it means to 'enable', with whoem
> > > > is it allows to share a core.
> > > 
> > > I couldn't parse this. Do you mean "enabling coresched does not make sense if
> > > we don't specify whom to share the core with?"
> > 
> > As a corrolary yes. I mostly meant that a blanket 'enable' doesn't
> > specify a 'who' you're sharing your bits with.
> 
> Yes, ok. I can reword the commit log a bit to make it more clear that we are
> specifying who we can share a core with.
> 
> > > I was just trying to respect the functionality of the CGroup patch in the
> > > coresched series, after all a gentleman named Peter Zijlstra wrote that
> > > patch ;-) ;-).
> > 
> > Yeah, but I think that same guy said that that was a shit interface and
> > only hacked up because it was easy :-)
> 
> Fair enough :-)
> 
> > > More seriously, the reason I did it this way is the prctl-tagging is a bit
> > > incompatible with CGroup tagging:
> > > 
> > > 1. What happens if 2 tasks are in a tagged CGroup and one of them changes
> > > their cookie through prctl? Do they still remain in the tagged CGroup but are
> > > now going to not trust each other? Do they get removed from the CGroup? This
> > > is why I made the prctl fail with -EBUSY in such cases.
> > > 
> > > 2. What happens if 2 tagged tasks with different cookies are added to a
> > > tagged CGroup? Do we fail the addition of the tasks to the group, or do we
> > > override their cookie (like I'm doing)?
> > 
> > For #2 I think I prefer failure.
> > 
> > But having the rationale spelled out in documentation (man-pages for
> > example) is important.
> 
> If we drop the CGroup interface, this would avoid both #1 and #2.
>

I believe both are useful.  Personally, I think the per-task setting should
win over the cgroup tagging. In that case #1 just falls out. And #2 pretty
much as well. Nothing would happen to the tagged task as they were added
to the cgroup. They'd keep their explicitly assigned tags and everything
should "just work". There are other reasons to be in a cpu cgroup together
than just the core scheduling tag.

There are a few other edge cases, like if you are in a cgroup, but have
been tagged explicitly with sched_setattr and then get untagged (presumably
by setting 0) do you get the cgroup tag or just stay untagged? I think based
on per-task winning you'd stay untagged. I supposed you could move out and
back in the cgroup to get the tag reapplied (Or maybe the cgroup interface
could just be reused with the same value to re-tag everyone who's untagged).



Cheers,
Phil


> thanks,
> 
>  - Joel
> 

-- 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH RFC] sched: Add a per-thread core scheduling interface
  2020-05-24 14:00           ` Phil Auld
@ 2020-05-28 14:51             ` Joel Fernandes
  2020-05-28 17:01             ` Peter Zijlstra
  1 sibling, 0 replies; 110+ messages in thread
From: Joel Fernandes @ 2020-05-28 14:51 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	mingo, tglx, pjt, torvalds, vpillai, linux-kernel, fweisbec,
	keescook, Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, derkling

On Sun, May 24, 2020 at 10:00:46AM -0400, Phil Auld wrote:
> On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote:
> > On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
> > [..]
> > > > > It doens't allow tasks for form their own groups (by for example setting
> > > > > the key to that of another task).
> > > > 
> > > > So for this, I was thinking of making the prctl pass in an integer. And 0
> > > > would mean untagged. Does that sound good to you?
> > > 
> > > A TID, I think. If you pass your own TID, you tag yourself as
> > > not-sharing. If you tag yourself with another tasks's TID, you can do
> > > ptrace tests to see if you're allowed to observe their junk.
> > 
> > But that would require a bunch of tasks agreeing on which TID to tag with.
> > For example, if 2 tasks tag with each other's TID, then they would have
> > different tags and not share.
> > 
> > What's wrong with passing in an integer instead? In any case, we would do the
> > CAP_SYS_ADMIN check to limit who can do it.
> > 
> > Also, one thing CGroup interface allows is an external process to set the
> > cookie, so I am wondering if we should use sched_setattr(2) instead of, or in
> > addition to, the prctl(2). That way, we can drop the CGroup interface
> > completely. How do you feel about that?
> >
> 
> I think it should be an arbitrary 64bit value, in both interfaces to avoid
> any potential reuse security issues. 
> 
> I think the cgroup interface could be extended not to be a boolean but take
> the value. With 0 being untagged as now.
> 
> And sched_setattr could be used to set it on a per task basis.

Yeah, something like this will be needed.

> > > > More seriously, the reason I did it this way is the prctl-tagging is a bit
> > > > incompatible with CGroup tagging:
> > > > 
> > > > 1. What happens if 2 tasks are in a tagged CGroup and one of them changes
> > > > their cookie through prctl? Do they still remain in the tagged CGroup but are
> > > > now going to not trust each other? Do they get removed from the CGroup? This
> > > > is why I made the prctl fail with -EBUSY in such cases.

In util-clamp's design (which has task-specific attribute and task-group
attribute), it seems for that the priority is task-specific value first, then
the group one, then the system-wide one.

Perhaps a similar design can be adopted for this interface. So probably we
should let the per-task interface not fail if the task was already in CGroup
and rather prioritize its value first before looking at the group one?

Uclamp's comments:

 * The effective clamp bucket index of a task depends on, by increasing
 * priority:
 * - the task specific clamp value, when explicitly requested from userspace
 * - the task group effective clamp value, for tasks not either in the root
 *   group or in an autogroup
 * - the system default clamp value, defined by the sysadmin

> > > > 
> > > > 2. What happens if 2 tagged tasks with different cookies are added to a
> > > > tagged CGroup? Do we fail the addition of the tasks to the group, or do we
> > > > override their cookie (like I'm doing)?
> > > 
> > > For #2 I think I prefer failure.
> > > 
> > > But having the rationale spelled out in documentation (man-pages for
> > > example) is important.
> > 
> > If we drop the CGroup interface, this would avoid both #1 and #2.
> >
> 
> I believe both are useful.  Personally, I think the per-task setting should
> win over the cgroup tagging. In that case #1 just falls out.

Cool, this is similar to what I mentioned above.

> And #2 pretty
> much as well. Nothing would happen to the tagged task as they were added
> to the cgroup. They'd keep their explicitly assigned tags and everything
> should "just work". There are other reasons to be in a cpu cgroup together
> than just the core scheduling tag.

Well ok, so there's no reason to fail them the addition to CGroup of a
prctl-tagged task then, we can let it succeed but prioritize the
task-specific attribute over the group-specific one.

> There are a few other edge cases, like if you are in a cgroup, but have
> been tagged explicitly with sched_setattr and then get untagged (presumably
> by setting 0) do you get the cgroup tag or just stay untagged? I think based
> on per-task winning you'd stay untagged. I supposed you could move out and
> back in the cgroup to get the tag reapplied (Or maybe the cgroup interface
> could just be reused with the same value to re-tag everyone who's untagged).

If we maintain a task-specific tag and a group-specific tag, then I think
both tags can coexist and the final tag is decided on priority basis
mentioned above.

So before getting into CGroup, I think first we develop the task-specific
tagging mechanism like Peter was suggesting. So let us talk about that. I
will reply to the other thread Vineeth started while CC'ing you. In
particular, I like Peter's idea about user land passing a TID to share a core
with.

thanks,

 - Joel


> 
> 
> 
> Cheers,
> Phil
> 
> 
> > thanks,
> > 
> >  - Joel
> > 
> 
> -- 
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH RFC] sched: Add a per-thread core scheduling interface
  2020-05-24 14:00           ` Phil Auld
  2020-05-28 14:51             ` Joel Fernandes
@ 2020-05-28 17:01             ` Peter Zijlstra
  2020-05-28 18:17               ` Phil Auld
  2020-05-28 18:23               ` Joel Fernandes
  1 sibling, 2 replies; 110+ messages in thread
From: Peter Zijlstra @ 2020-05-28 17:01 UTC (permalink / raw)
  To: Phil Auld
  Cc: Joel Fernandes, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	mingo, tglx, pjt, torvalds, vpillai, linux-kernel, fweisbec,
	keescook, Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini

On Sun, May 24, 2020 at 10:00:46AM -0400, Phil Auld wrote:
> On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote:
> > On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
> > [..]
> > > > > It doens't allow tasks for form their own groups (by for example setting
> > > > > the key to that of another task).
> > > > 
> > > > So for this, I was thinking of making the prctl pass in an integer. And 0
> > > > would mean untagged. Does that sound good to you?
> > > 
> > > A TID, I think. If you pass your own TID, you tag yourself as
> > > not-sharing. If you tag yourself with another tasks's TID, you can do
> > > ptrace tests to see if you're allowed to observe their junk.
> > 
> > But that would require a bunch of tasks agreeing on which TID to tag with.
> > For example, if 2 tasks tag with each other's TID, then they would have
> > different tags and not share.

Well, don't do that then ;-)

> > What's wrong with passing in an integer instead? In any case, we would do the
> > CAP_SYS_ADMIN check to limit who can do it.

So the actual permission model can be different depending on how broken
the hardware is.

> > Also, one thing CGroup interface allows is an external process to set the
> > cookie, so I am wondering if we should use sched_setattr(2) instead of, or in
> > addition to, the prctl(2). That way, we can drop the CGroup interface
> > completely. How do you feel about that?
> >
> 
> I think it should be an arbitrary 64bit value, in both interfaces to avoid
> any potential reuse security issues.
> 
> I think the cgroup interface could be extended not to be a boolean but take
> the value. With 0 being untagged as now.

How do you avoid reuse in such a huge space? That just creates yet
another problem for the kernel to keep track of who is who.

With random u64 numbers, it even becomes hard to determine if you're
sharing at all or not.

Now, with the current SMT+MDS trainwreck, any sharing is bad because it
allows leaking kernel privates. But under a less severe thread scenario,
say where only user data would be at risk, the ptrace() tests make
sense, but those become really hard with random u64 numbers too.

What would the purpose of random u64 values be for cgroups? That only
replicates the problem of determining uniqueness there. Then you can get
two cgroups unintentionally sharing because you got lucky.

Also, fundamentally, we cannot have more threads than TID space, it's a
natural identifier.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH RFC] sched: Add a per-thread core scheduling interface
  2020-05-28 17:01             ` Peter Zijlstra
@ 2020-05-28 18:17               ` Phil Auld
  2020-05-28 18:34                 ` Phil Auld
  2020-05-28 18:23               ` Joel Fernandes
  1 sibling, 1 reply; 110+ messages in thread
From: Phil Auld @ 2020-05-28 18:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	mingo, tglx, pjt, torvalds, vpillai, linux-kernel, fweisbec,
	keescook, Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini

On Thu, May 28, 2020 at 07:01:28PM +0200 Peter Zijlstra wrote:
> On Sun, May 24, 2020 at 10:00:46AM -0400, Phil Auld wrote:
> > On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote:
> > > On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
> > > [..]
> > > > > > It doens't allow tasks for form their own groups (by for example setting
> > > > > > the key to that of another task).
> > > > > 
> > > > > So for this, I was thinking of making the prctl pass in an integer. And 0
> > > > > would mean untagged. Does that sound good to you?
> > > > 
> > > > A TID, I think. If you pass your own TID, you tag yourself as
> > > > not-sharing. If you tag yourself with another tasks's TID, you can do
> > > > ptrace tests to see if you're allowed to observe their junk.
> > > 
> > > But that would require a bunch of tasks agreeing on which TID to tag with.
> > > For example, if 2 tasks tag with each other's TID, then they would have
> > > different tags and not share.
> 
> Well, don't do that then ;-)
>

That was a poorly worded example :)

The point I was trying to make was more that one TID of a group (not cgroup!)
of tasks is just an arbitrary value.

At a single process (or pair rather) level, sure, you can see it as an
identifier of whom you want to share with, but even then you have to tag
both processes with this. And it has less meaning when the whom you want to
share with is mutltiple tasks.

> > > What's wrong with passing in an integer instead? In any case, we would do the
> > > CAP_SYS_ADMIN check to limit who can do it.
> 
> So the actual permission model can be different depending on how broken
> the hardware is.
> 
> > > Also, one thing CGroup interface allows is an external process to set the
> > > cookie, so I am wondering if we should use sched_setattr(2) instead of, or in
> > > addition to, the prctl(2). That way, we can drop the CGroup interface
> > > completely. How do you feel about that?
> > >
> > 
> > I think it should be an arbitrary 64bit value, in both interfaces to avoid
> > any potential reuse security issues.
> > 
> > I think the cgroup interface could be extended not to be a boolean but take
> > the value. With 0 being untagged as now.
> 
> How do you avoid reuse in such a huge space? That just creates yet
> another problem for the kernel to keep track of who is who.
>

The kernel doesn't care or have to track anything.  The admin does.
At the kernel level it's just matching cookies. 

Tasks A,B,C all can share core so you give them each A's TID as a cookie.
Task A then exits. Now B and C are using essentially a random value.
Task D comes along and want to share with B and C. You have to tag it
with A's old TID, which has no meaning at this point.

And if A's TID ever gets reused. The new A` gets to share too. At some
level aren't those still 32bits? 

> With random u64 numbers, it even becomes hard to determine if you're
> sharing at all or not.
> 
> Now, with the current SMT+MDS trainwreck, any sharing is bad because it
> allows leaking kernel privates. But under a less severe thread scenario,
> say where only user data would be at risk, the ptrace() tests make
> sense, but those become really hard with random u64 numbers too.
> 
> What would the purpose of random u64 values be for cgroups? That only
> replicates the problem of determining uniqueness there. Then you can get
> two cgroups unintentionally sharing because you got lucky.
>

Seems that would be more flexible for the admin. 

What if you had two cgroups you wanted to allow to run together?  Or a
cgroup and a few processes from a different one (say with different
quotas or something).

I don't have such use cases so I don't feel that strongly but it seemed
more flexible and followed the mechanism-in-kernel/policy-in-userspace
dictum rather than basing the functionality on the implementation details.


Cheers,
Phil


> Also, fundamentally, we cannot have more threads than TID space, it's a
> natural identifier.
> 

-- 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH RFC] sched: Add a per-thread core scheduling interface
  2020-05-28 17:01             ` Peter Zijlstra
  2020-05-28 18:17               ` Phil Auld
@ 2020-05-28 18:23               ` Joel Fernandes
  1 sibling, 0 replies; 110+ messages in thread
From: Joel Fernandes @ 2020-05-28 18:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Phil Auld, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	mingo, tglx, pjt, torvalds, vpillai, linux-kernel, fweisbec,
	keescook, Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini

Hi Peter,

On Thu, May 28, 2020 at 07:01:28PM +0200, Peter Zijlstra wrote:
> On Sun, May 24, 2020 at 10:00:46AM -0400, Phil Auld wrote:
> > On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote:
> > > On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
> > > [..]
> > > > > > It doens't allow tasks for form their own groups (by for example setting
> > > > > > the key to that of another task).
> > > > > 
> > > > > So for this, I was thinking of making the prctl pass in an integer. And 0
> > > > > would mean untagged. Does that sound good to you?
> > > > 
> > > > A TID, I think. If you pass your own TID, you tag yourself as
> > > > not-sharing. If you tag yourself with another tasks's TID, you can do
> > > > ptrace tests to see if you're allowed to observe their junk.
> > > 
> > > But that would require a bunch of tasks agreeing on which TID to tag with.
> > > For example, if 2 tasks tag with each other's TID, then they would have
> > > different tags and not share.
> 
> Well, don't do that then ;-)

We could also guard it with a mutex. First task to set the TID wins, the
other thread just reuses the cookie of the TID that won.

But I think we cannot just use the TID value as the cookie, due to TID
wrap-around and reuse. Otherwise we could accidentally group 2 tasks. Instead, I
suggest let us keep TID as the interface per your suggestion and do the
needed ptrace checks, but convert the TID to the task_struct pointer value
and use that as the cookie for the group of tasks sharing a core.

Thoughts?

thanks,

 - Joel

> > > What's wrong with passing in an integer instead? In any case, we would do the
> > > CAP_SYS_ADMIN check to limit who can do it.
> 
> So the actual permission model can be different depending on how broken
> the hardware is.
> 
> > > Also, one thing CGroup interface allows is an external process to set the
> > > cookie, so I am wondering if we should use sched_setattr(2) instead of, or in
> > > addition to, the prctl(2). That way, we can drop the CGroup interface
> > > completely. How do you feel about that?
> > >
> > 
> > I think it should be an arbitrary 64bit value, in both interfaces to avoid
> > any potential reuse security issues.
> > 
> > I think the cgroup interface could be extended not to be a boolean but take
> > the value. With 0 being untagged as now.
> 
> How do you avoid reuse in such a huge space? That just creates yet
> another problem for the kernel to keep track of who is who.
> 
> With random u64 numbers, it even becomes hard to determine if you're
> sharing at all or not.
> 
> Now, with the current SMT+MDS trainwreck, any sharing is bad because it
> allows leaking kernel privates. But under a less severe thread scenario,
> say where only user data would be at risk, the ptrace() tests make
> sense, but those become really hard with random u64 numbers too.
> 
> What would the purpose of random u64 values be for cgroups? That only
> replicates the problem of determining uniqueness there. Then you can get
> two cgroups unintentionally sharing because you got lucky.
> 
> Also, fundamentally, we cannot have more threads than TID space, it's a
> natural identifier.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH RFC] sched: Add a per-thread core scheduling interface
  2020-05-28 18:17               ` Phil Auld
@ 2020-05-28 18:34                 ` Phil Auld
  0 siblings, 0 replies; 110+ messages in thread
From: Phil Auld @ 2020-05-28 18:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes, Nishanth Aravamudan, Julien Desfossez, Tim Chen,
	mingo, tglx, pjt, torvalds, vpillai, linux-kernel, fweisbec,
	keescook, Aaron Lu, Aubrey Li, aubrey.li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini

On Thu, May 28, 2020 at 02:17:19PM -0400 Phil Auld wrote:
> On Thu, May 28, 2020 at 07:01:28PM +0200 Peter Zijlstra wrote:
> > On Sun, May 24, 2020 at 10:00:46AM -0400, Phil Auld wrote:
> > > On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote:
> > > > On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
> > > > [..]
> > > > > > > It doens't allow tasks for form their own groups (by for example setting
> > > > > > > the key to that of another task).
> > > > > > 
> > > > > > So for this, I was thinking of making the prctl pass in an integer. And 0
> > > > > > would mean untagged. Does that sound good to you?
> > > > > 
> > > > > A TID, I think. If you pass your own TID, you tag yourself as
> > > > > not-sharing. If you tag yourself with another tasks's TID, you can do
> > > > > ptrace tests to see if you're allowed to observe their junk.
> > > > 
> > > > But that would require a bunch of tasks agreeing on which TID to tag with.
> > > > For example, if 2 tasks tag with each other's TID, then they would have
> > > > different tags and not share.
> > 
> > Well, don't do that then ;-)
> >
> 
> That was a poorly worded example :)
>

Heh, sorry, I thought that was my statement. I do not mean to belittle Joel's
example...  That's a fine example of a totally different problem than I
was thinking of :)


Cheers,
Phil

> The point I was trying to make was more that one TID of a group (not cgroup!)
> of tasks is just an arbitrary value.
> 
> At a single process (or pair rather) level, sure, you can see it as an
> identifier of whom you want to share with, but even then you have to tag
> both processes with this. And it has less meaning when the whom you want to
> share with is mutltiple tasks.
> 
> > > > What's wrong with passing in an integer instead? In any case, we would do the
> > > > CAP_SYS_ADMIN check to limit who can do it.
> > 
> > So the actual permission model can be different depending on how broken
> > the hardware is.
> > 
> > > > Also, one thing CGroup interface allows is an external process to set the
> > > > cookie, so I am wondering if we should use sched_setattr(2) instead of, or in
> > > > addition to, the prctl(2). That way, we can drop the CGroup interface
> > > > completely. How do you feel about that?
> > > >
> > > 
> > > I think it should be an arbitrary 64bit value, in both interfaces to avoid
> > > any potential reuse security issues.
> > > 
> > > I think the cgroup interface could be extended not to be a boolean but take
> > > the value. With 0 being untagged as now.
> > 
> > How do you avoid reuse in such a huge space? That just creates yet
> > another problem for the kernel to keep track of who is who.
> >
> 
> The kernel doesn't care or have to track anything.  The admin does.
> At the kernel level it's just matching cookies. 
> 
> Tasks A,B,C all can share core so you give them each A's TID as a cookie.
> Task A then exits. Now B and C are using essentially a random value.
> Task D comes along and want to share with B and C. You have to tag it
> with A's old TID, which has no meaning at this point.
> 
> And if A's TID ever gets reused. The new A` gets to share too. At some
> level aren't those still 32bits? 
> 
> > With random u64 numbers, it even becomes hard to determine if you're
> > sharing at all or not.
> > 
> > Now, with the current SMT+MDS trainwreck, any sharing is bad because it
> > allows leaking kernel privates. But under a less severe thread scenario,
> > say where only user data would be at risk, the ptrace() tests make
> > sense, but those become really hard with random u64 numbers too.
> > 
> > What would the purpose of random u64 values be for cgroups? That only
> > replicates the problem of determining uniqueness there. Then you can get
> > two cgroups unintentionally sharing because you got lucky.
> >
> 
> Seems that would be more flexible for the admin. 
> 
> What if you had two cgroups you wanted to allow to run together?  Or a
> cgroup and a few processes from a different one (say with different
> quotas or something).
> 
> I don't have such use cases so I don't feel that strongly but it seemed
> more flexible and followed the mechanism-in-kernel/policy-in-userspace
> dictum rather than basing the functionality on the implementation details.
> 
> 
> Cheers,
> Phil
> 
> 
> > Also, fundamentally, we cannot have more threads than TID space, it's a
> > natural identifier.
> > 
> 
> -- 

-- 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison
  2020-05-14 13:02                             ` Peter Zijlstra
  2020-05-14 22:51                               ` Vineeth Remanan Pillai
  2020-05-16  3:42                               ` Aaron Lu
@ 2020-06-08  1:41                               ` Ning, Hongyu
  2 siblings, 0 replies; 110+ messages in thread
From: Ning, Hongyu @ 2020-06-08  1:41 UTC (permalink / raw)
  To: Peter Zijlstra, Aaron Lu
  Cc: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	Tim Chen, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Aaron Lu, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aubrey Li, Li, Aubrey, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, Joel Fernandes

On 2020/5/14 21:02, Peter Zijlstra wrote:
> On Fri, May 08, 2020 at 08:34:57PM +0800, Aaron Lu wrote:
>> With this said, I realized a workaround for the issue described above:
>> when the core went from 'compatible mode'(step 1-3) to 'incompatible
>> mode'(step 4), reset all root level sched entities' vruntime to be the
>> same as the core wide min_vruntime. After all, the core is transforming
>> from two runqueue mode to single runqueue mode... I think this can solve
>> the issue to some extent but I may miss other scenarios.
> 
> A little something like so, this syncs min_vruntime when we switch to
> single queue mode. This is very much SMT2 only, I got my head in twist
> when thikning about more siblings, I'll have to try again later.
> 
> This very much retains the horrible approximation of S we always do.
> 
> Also, it is _completely_ untested...
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -102,7 +102,6 @@ static inline int __task_prio(struct tas
>  /* real prio, less is less */
>  static inline bool prio_less(struct task_struct *a, struct task_struct *b)
>  {
> -
>  	int pa = __task_prio(a), pb = __task_prio(b);
>  
>  	if (-pa < -pb)
> @@ -114,19 +113,8 @@ static inline bool prio_less(struct task
>  	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
>  		return !dl_time_before(a->dl.deadline, b->dl.deadline);
>  
> -	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
> -		u64 vruntime = b->se.vruntime;
> -
> -		/*
> -		 * Normalize the vruntime if tasks are in different cpus.
> -		 */
> -		if (task_cpu(a) != task_cpu(b)) {
> -			vruntime -= task_cfs_rq(b)->min_vruntime;
> -			vruntime += task_cfs_rq(a)->min_vruntime;
> -		}
> -
> -		return !((s64)(a->se.vruntime - vruntime) <= 0);
> -	}
> +	if (pa == MAX_RT_PRIO + MAX_NICE)
> +		return cfs_prio_less(a, b);
>  
>  	return false;
>  }
> @@ -4293,10 +4281,11 @@ static struct task_struct *
>  pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
>  	struct task_struct *next, *max = NULL;
> +	int old_active = 0, new_active = 0;
>  	const struct sched_class *class;
>  	const struct cpumask *smt_mask;
> -	int i, j, cpu;
>  	bool need_sync = false;
> +	int i, j, cpu;
>  
>  	cpu = cpu_of(rq);
>  	if (cpu_is_offline(cpu))
> @@ -4349,10 +4338,14 @@ pick_next_task(struct rq *rq, struct tas
>  		rq_i->core_pick = NULL;
>  
>  		if (rq_i->core_forceidle) {
> +			// XXX is_idle_task(rq_i->curr) && rq_i->nr_running ??
>  			need_sync = true;
>  			rq_i->core_forceidle = false;
>  		}
>  
> +		if (!is_idle_task(rq_i->curr))
> +			old_active++;
> +
>  		if (i != cpu)
>  			update_rq_clock(rq_i);
>  	}
> @@ -4463,8 +4456,12 @@ next_class:;
>  
>  		WARN_ON_ONCE(!rq_i->core_pick);
>  
> -		if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> -			rq_i->core_forceidle = true;
> +		if (is_idle_task(rq_i->core_pick)) {
> +			if (rq_i->nr_running)
> +				rq_i->core_forceidle = true;
> +		} else {
> +			new_active++;
> +		}
>  
>  		if (i == cpu)
>  			continue;
> @@ -4476,6 +4473,16 @@ next_class:;
>  		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
>  	}
>  
> +	/* XXX SMT2 only */
> +	if (new_active == 1 && old_active > 1) {
> +		/*
> +		 * We just dropped into single-rq mode, increment the sequence
> +		 * count to trigger the vruntime sync.
> +		 */
> +		rq->core->core_sync_seq++;
> +	}
> +	rq->core->core_active = new_active;
> +
>  done:
>  	set_next_task(rq, next);
>  	return next;
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -386,6 +386,12 @@ is_same_group(struct sched_entity *se, s
>  	return NULL;
>  }
>  
> +static inline bool
> +is_same_tg(struct sched_entity *se, struct sched_entity *pse)
> +{
> +	return se->cfs_rq->tg == pse->cfs_rq->tg;
> +}
> +
>  static inline struct sched_entity *parent_entity(struct sched_entity *se)
>  {
>  	return se->parent;
> @@ -394,8 +400,6 @@ static inline struct sched_entity *paren
>  static void
>  find_matching_se(struct sched_entity **se, struct sched_entity **pse)
>  {
> -	int se_depth, pse_depth;
> -
>  	/*
>  	 * preemption test can be made between sibling entities who are in the
>  	 * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
> @@ -403,23 +407,16 @@ find_matching_se(struct sched_entity **s
>  	 * parent.
>  	 */
>  
> -	/* First walk up until both entities are at same depth */
> -	se_depth = (*se)->depth;
> -	pse_depth = (*pse)->depth;
> -
> -	while (se_depth > pse_depth) {
> -		se_depth--;
> -		*se = parent_entity(*se);
> -	}
> -
> -	while (pse_depth > se_depth) {
> -		pse_depth--;
> -		*pse = parent_entity(*pse);
> -	}
> +	/* XXX we now have 3 of these loops, C stinks */
>  
>  	while (!is_same_group(*se, *pse)) {
> -		*se = parent_entity(*se);
> -		*pse = parent_entity(*pse);
> +		int se_depth = (*se)->depth;
> +		int pse_depth = (*pse)->depth;
> +
> +		if (se_depth <= pse_depth)
> +			*pse = parent_entity(*pse);
> +		if (se_depth >= pse_depth)
> +			*se = parent_entity(*se);
>  	}
>  }
>  
> @@ -455,6 +452,12 @@ static inline struct sched_entity *paren
>  	return NULL;
>  }
>  
> +static inline bool
> +is_same_tg(struct sched_entity *se, struct sched_entity *pse)
> +{
> +	return true;
> +}
> +
>  static inline void
>  find_matching_se(struct sched_entity **se, struct sched_entity **pse)
>  {
> @@ -462,6 +465,31 @@ find_matching_se(struct sched_entity **s
>  
>  #endif	/* CONFIG_FAIR_GROUP_SCHED */
>  
> +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> +{
> +	struct sched_entity *se_a = &a->se, *se_b = &b->se;
> +	struct cfs_rq *cfs_rq_a, *cfa_rq_b;
> +	u64 vruntime_a, vruntime_b;
> +
> +	while (!is_same_tg(se_a, se_b)) {
> +		int se_a_depth = se_a->depth;
> +		int se_b_depth = se_b->depth;
> +
> +		if (se_a_depth <= se_b_depth)
> +			se_b = parent_entity(se_b);
> +		if (se_a_depth >= se_b_depth)
> +			se_a = parent_entity(se_a);
> +	}
> +
> +	cfs_rq_a = cfs_rq_of(se_a);
> +	cfs_rq_b = cfs_rq_of(se_b);
> +
> +	vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime;
> +	vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime;
> +
> +	return !((s64)(vruntime_a - vruntime_b) <= 0);
> +}
> +
>  static __always_inline
>  void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
>  
> @@ -6891,6 +6919,18 @@ static void check_preempt_wakeup(struct
>  		set_last_buddy(se);
>  }
>  
> +static void core_sync_entity(struct rq *rq, struct cfs_rq *cfs_rq)
> +{
> +	if (!sched_core_enabled())
> +		return;
> +
> +	if (rq->core->core_sync_seq == cfs_rq->core_sync_seq)
> +		return;
> +
> +	cfs_rq->core_sync_seq = rq->core->core_sync_seq;
> +	cfs_rq->core_vruntime = cfs_rq->min_vruntime;
> +}
> +
>  static struct task_struct *pick_task_fair(struct rq *rq)
>  {
>  	struct cfs_rq *cfs_rq = &rq->cfs;
> @@ -6902,6 +6942,14 @@ static struct task_struct *pick_task_fai
>  	do {
>  		struct sched_entity *curr = cfs_rq->curr;
>  
> +		/*
> +		 * Propagate the sync state down to whatever cfs_rq we need,
> +		 * the active cfs_rq's will have been done by
> +		 * set_next_task_fair(), the rest is inactive and will not have
> +		 * changed due to the current running task.
> +		 */
> +		core_sync_entity(rq, cfs_rq);
> +
>  		se = pick_next_entity(cfs_rq, NULL);
>  
>  		if (curr) {
> @@ -10825,7 +10873,8 @@ static void switched_to_fair(struct rq *
>  	}
>  }
>  
> -/* Account for a task changing its policy or group.
> +/*
> + * Account for a task changing its policy or group.
>   *
>   * This routine is mostly called to set cfs_rq->curr field when a task
>   * migrates between groups/classes.
> @@ -10847,6 +10896,9 @@ static void set_next_task_fair(struct rq
>  	for_each_sched_entity(se) {
>  		struct cfs_rq *cfs_rq = cfs_rq_of(se);
>  
> +		/* snapshot vruntime before using it */
> +		core_sync_entity(rq, cfs_rq);
> +
>  		set_next_entity(cfs_rq, se);
>  		/* ensure bandwidth has been allocated on our new cfs_rq */
>  		account_cfs_rq_runtime(cfs_rq, 0);
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -503,6 +503,10 @@ struct cfs_rq {
>  	unsigned int		h_nr_running;      /* SCHED_{NORMAL,BATCH,IDLE} */
>  	unsigned int		idle_h_nr_running; /* SCHED_IDLE */
>  
> +#ifdef CONFIG_SCHED_CORE
> +	unsigned int		core_sync_seq;
> +	u64			core_vruntime;
> +#endif
>  	u64			exec_clock;
>  	u64			min_vruntime;
>  #ifndef CONFIG_64BIT
> @@ -1035,12 +1039,15 @@ struct rq {
>  	unsigned int		core_enabled;
>  	unsigned int		core_sched_seq;
>  	struct rb_root		core_tree;
> -	bool			core_forceidle;
> +	unsigned int		core_forceidle;
>  
>  	/* shared state */
>  	unsigned int		core_task_seq;
>  	unsigned int		core_pick_seq;
>  	unsigned long		core_cookie;
> +	unsigned int		core_sync_seq;
> +	unsigned int		core_active;
> +
>  #endif
>  };
>  
> @@ -2592,6 +2599,8 @@ static inline bool sched_energy_enabled(
>  
>  #endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
>  
> +extern bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
> +
>  #ifdef CONFIG_MEMBARRIER
>  /*
>   * The scheduler provides memory barriers required by membarrier between:
> 

here is a quick test update based on Peter's fairness patch above:

- Kernel under test: 
A: Core scheduling v5 community base + Peter's fairness patch (by reverting Aaron's fairness patch)
https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5.y + Peter's patch above
B: Core scheduling v5 community base (with Aaron's fairness patchset)
https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5.y (with Aaron's fairness patch)

- Test results briefing:
OVERALL PERFORMANCE ARE THE SAME FOR FOLLOWING 3 TEST SETS, BETWEEN 2 KERNEL TEST BUILDS

- Test set based on sysbench 1.1.0-bd4b418:
1: sysbench cpu in cgroup cpu 0 + sysbench cpu in cgroup cpu 1 (192 workload tasks for each cgroup)
2: sysbench mysql in cgroup mysql 0 + sysbench mysql in cgroup mysql 1 (192 workload tasks for each cgroup)
3: sysbench cpu in cgroup cpu 0 + sysbench mysql in cgroup mysql 0 (192 workload tasks for each cgroup)

- Test environment:
Intel Xeon Server platform
CPU(s):              192
On-line CPU(s) list: 0-191
Thread(s) per core:  2
Core(s) per socket:  48
Socket(s):           2
NUMA node(s):        4

- Test results:

Note: 
1: test results in following tables are Tput normalized to default baseline
2: test setting in following tables:
2.1: default -> core scheduling disabled
2.2: coresched -> core scheduling enabled
3. default test results are reused between 2 kernel test builds


Test set 1:
+----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| setting                          | ***   | default   | default     | coresched   | coresched     | **   | default     | default     | coresched     | coresched     |
+==================================+=======+===========+=============+=============+===============+======+=============+=============+===============+===============+
| cgroups                          | ***   | cg cpu 0  | cg cpu 0    | cg cpu 0    | cg cpu 0      | **   | cg cpu 1    | cg cpu 1    | cg cpu 1      | cg cpu 1      |
+----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| sysbench workload                | ***   | cpu       | cpu         | cpu         | cpu           | **   | cpu         | cpu         | cpu           | cpu           |
+----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| record item                      | ***   | Tput_avg  | Tput_stdev% | Tput_avg    | Tput_stdev%   | **   | Tput_avg    | Tput_stdev% | Tput_avg      | Tput_stdev%   |
+----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| Kernel_A(Peter's fairness patch) | ***   |           |             | 0.96        | 3.45%         | **   |             |             | 1.03          | 3.60%         |
+----------------------------------+-------+ 1         + 1.14%       +-------------+---------------+------+ 1           + 1.20%       +---------------+---------------+
| Kernel_B(Aaron's fairness patch) | ***   |           |             | 0.98        | 1.75%         | **   |             |             | 1.01          | 1.83%         |
+----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+

Test set 2:
+----------------------------------+-------+------------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| setting                          | ***   | default    | default     | coresched   | coresched     | **   | default     | default     | coresched     | coresched     |
+==================================+=======+============+=============+=============+===============+======+=============+=============+===============+===============+
| cgroups                          | ***   | cg mysql 0 | cg mysql 0  | cg mysql 0  | cg mysql 0    | **   | cg mysql 1  | cg mysql 1  | cg mysql 1    | cg mysql 1    |
+----------------------------------+-------+------------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| sysbench workload                | ***   | mysql      | mysql       | mysql       | mysql         | **   | mysql       | mysql       | mysql         | mysql         |
+----------------------------------+-------+------------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| record item                      | ***   | Tput_avg   | Tput_stdev% | Tput_avg    | Tput_stdev%   | **   | Tput_avg    | Tput_stdev% | Tput_avg      | Tput_stdev%   |
+----------------------------------+-------+------------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| Kernel_A(Peter's fairness patch) | ***   |            |             | 0.98        | 2.00%         | **   |             |             | 0.98          | 1.98%         |
+----------------------------------+-------+ 1          + 1.85%       +-------------+---------------+------+ 1           + 1.84%       +---------------+---------------+
| Kernel_B(Aaron's fairness patch) | ***   |            |             | 1.01        | 1.61%         | **   |             |             | 1.01          | 1.59%         |
+----------------------------------+-------+------------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+

Test set 3:
+----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| setting                          | ***   | default   | default     | coresched   | coresched     | **   | default     | default     | coresched     | coresched     |
+==================================+=======+===========+=============+=============+===============+======+=============+=============+===============+===============+
| cgroups                          | ***   | cg cpu    | cg cpu      | cg cpu      | cg cpu        | **   | cg mysql    | cg mysql    | cg mysql      | cg mysql      |
+----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| sysbench workload                | ***   | cpu       | cpu         | cpu         | cpu           | **   | mysql       | mysql       | mysql         | mysql         |
+----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| record item                      | ***   | Tput_avg  | Tput_stdev% | Tput_avg    | Tput_stdev%   | **   | Tput_avg    | Tput_stdev% | Tput_avg      | Tput_stdev%   |
+----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| Kernel_A(Peter's fairness patch) | ***   |           |             | 1.01        | 4.67%         | **   |             |             | 0.84          | 25.89%        |
+----------------------------------+-------+ 1         + 1.56%       +-------------+---------------+------+ 1           + 3.17%       +---------------+---------------+
| Kernel_B(Aaron's fairness patch) | ***   |           |             | 0.99        | 4.17%         | **   |             |             | 0.89          | 16.44%        |
+----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 11/13] sched: migration changes for core scheduling
  2020-03-04 17:00 ` [RFC PATCH 11/13] sched: migration changes for core scheduling vpillai
@ 2020-06-12 13:21   ` Joel Fernandes
  2020-06-12 21:32     ` Vineeth Remanan Pillai
  0 siblings, 1 reply; 110+ messages in thread
From: Joel Fernandes @ 2020-06-12 13:21 UTC (permalink / raw)
  To: vpillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds, Aubrey Li, linux-kernel, fweisbec,
	keescook, kerrnel, Phil Auld, Aaron Lu, Aubrey Li, aubrey.li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Wed, Mar 04, 2020 at 05:00:01PM +0000, vpillai wrote:
> From: Aubrey Li <aubrey.li@intel.com>
> 
>  - Don't migrate if there is a cookie mismatch
>      Load balance tries to move task from busiest CPU to the
>      destination CPU. When core scheduling is enabled, if the
>      task's cookie does not match with the destination CPU's
>      core cookie, this task will be skipped by this CPU. This
>      mitigates the forced idle time on the destination CPU.
> 
>  - Select cookie matched idle CPU
>      In the fast path of task wakeup, select the first cookie matched
>      idle CPU instead of the first idle CPU.
> 
>  - Find cookie matched idlest CPU
>      In the slow path of task wakeup, find the idlest CPU whose core
>      cookie matches with task's cookie
> 
>  - Don't migrate task if cookie not match
>      For the NUMA load balance, don't migrate task to the CPU whose
>      core cookie does not match with task's cookie
> 
> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
> ---
>  kernel/sched/fair.c  | 55 +++++++++++++++++++++++++++++++++++++++++---
>  kernel/sched/sched.h | 29 +++++++++++++++++++++++
>  2 files changed, 81 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1c9a80d8dbb8..f42ceecb749f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1789,6 +1789,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>  		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>  			continue;
>  
> +#ifdef CONFIG_SCHED_CORE
> +		/*
> +		 * Skip this cpu if source task's cookie does not match
> +		 * with CPU's core cookie.
> +		 */
> +		if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
> +			continue;
> +#endif
> +
>  		env->dst_cpu = cpu;
>  		task_numa_compare(env, taskimp, groupimp, maymove);
>  	}
> @@ -5660,8 +5669,13 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>  
>  	/* Traverse only the allowed CPUs */
>  	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
> +		struct rq *rq = cpu_rq(i);
> +
> +#ifdef CONFIG_SCHED_CORE
> +		if (!sched_core_cookie_match(rq, p))
> +			continue;
> +#endif
>  		if (available_idle_cpu(i)) {
> -			struct rq *rq = cpu_rq(i);
>  			struct cpuidle_state *idle = idle_get_state(rq);
>  			if (idle && idle->exit_latency < min_exit_latency) {
>  				/*
> @@ -5927,8 +5941,14 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>  			return si_cpu;
>  		if (!cpumask_test_cpu(cpu, p->cpus_ptr))
>  			continue;
> +#ifdef CONFIG_SCHED_CORE
> +		if (available_idle_cpu(cpu) &&
> +		    sched_core_cookie_match(cpu_rq(cpu), p))
> +			break;
> +#else

select_idle_cpu() is called only if no idle core could be found in the LLC by
select_idle_core().

So, would it be better here to just do the cookie equality check directly
instead of calling the sched_core_cookie_match() helper?  More so, because
select_idle_sibling() is a fastpath.

AFAIR, that's what v4 did:

                if (available_idle_cpu(cpu))
#ifdef CONFIG_SCHED_CORE
                        if (sched_core_enabled(cpu_rq(cpu)) &&
                            (p->core_cookie == cpu_rq(cpu)->core->core_cookie))
                                break;
#else
                        break;
#endif


Thoughts? thanks,

 - Joel


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 11/13] sched: migration changes for core scheduling
  2020-06-12 13:21   ` Joel Fernandes
@ 2020-06-12 21:32     ` Vineeth Remanan Pillai
  2020-06-13  2:25       ` Joel Fernandes
  0 siblings, 1 reply; 110+ messages in thread
From: Vineeth Remanan Pillai @ 2020-06-12 21:32 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Aubrey Li, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Aubrey Li, Li, Aubrey, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Fri, Jun 12, 2020 at 9:21 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> > +#ifdef CONFIG_SCHED_CORE
> > +             if (available_idle_cpu(cpu) &&
> > +                 sched_core_cookie_match(cpu_rq(cpu), p))
> > +                     break;
> > +#else
>
> select_idle_cpu() is called only if no idle core could be found in the LLC by
> select_idle_core().
>
> So, would it be better here to just do the cookie equality check directly
> instead of calling the sched_core_cookie_match() helper?  More so, because
> select_idle_sibling() is a fastpath.
>
Agree, this makes sense to me.

> AFAIR, that's what v4 did:
>
>                 if (available_idle_cpu(cpu))
> #ifdef CONFIG_SCHED_CORE
>                         if (sched_core_enabled(cpu_rq(cpu)) &&
>                             (p->core_cookie == cpu_rq(cpu)->core->core_cookie))
>                                 break;
> #else
>                         break;
> #endif
>
This patch was initially not in v4 and this is a merging of 4 patches
suggested post-v4. During the initial round, code was like above. But since
there looked like a code duplication in the different migration paths,
it was consolidated into sched_core_cookie_match() and it caused this
extra logic to this specific code path. As you mentioned, I also feel
we do not need to check for core idleness in this path.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 11/13] sched: migration changes for core scheduling
  2020-06-12 21:32     ` Vineeth Remanan Pillai
@ 2020-06-13  2:25       ` Joel Fernandes
  2020-06-13 18:59         ` Vineeth Remanan Pillai
  0 siblings, 1 reply; 110+ messages in thread
From: Joel Fernandes @ 2020-06-13  2:25 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Aubrey Li, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Aubrey Li, Li, Aubrey, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Fri, Jun 12, 2020 at 05:32:01PM -0400, Vineeth Remanan Pillai wrote:
> > AFAIR, that's what v4 did:
> >
> >                 if (available_idle_cpu(cpu))
> > #ifdef CONFIG_SCHED_CORE
> >                         if (sched_core_enabled(cpu_rq(cpu)) &&
> >                             (p->core_cookie == cpu_rq(cpu)->core->core_cookie))
> >                                 break;
> > #else
> >                         break;
> > #endif
> >
> This patch was initially not in v4 and this is a merging of 4 patches
> suggested post-v4. During the initial round, code was like above. But since
> there looked like a code duplication in the different migration paths,
> it was consolidated into sched_core_cookie_match() and it caused this
> extra logic to this specific code path. As you mentioned, I also feel
> we do not need to check for core idleness in this path.

Ok, so I take it that you will make it so in v6 then, unless of course
someone else objects.

thanks!

- Joel


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 11/13] sched: migration changes for core scheduling
  2020-06-13  2:25       ` Joel Fernandes
@ 2020-06-13 18:59         ` Vineeth Remanan Pillai
  2020-06-15  2:05           ` Li, Aubrey
  0 siblings, 1 reply; 110+ messages in thread
From: Vineeth Remanan Pillai @ 2020-06-13 18:59 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Aubrey Li, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Aubrey Li, Li, Aubrey, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Fri, Jun 12, 2020 at 10:25 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Ok, so I take it that you will make it so in v6 then, unless of course
> someone else objects.
>
Yes, just wanted to hear from Aubrey, Tim and others as well to see
if we have not missed anything obvious. Will have this in v6 if
there are no objections.

Thanks for bringing this up!

~Vineeth

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 11/13] sched: migration changes for core scheduling
  2020-06-13 18:59         ` Vineeth Remanan Pillai
@ 2020-06-15  2:05           ` Li, Aubrey
  0 siblings, 0 replies; 110+ messages in thread
From: Li, Aubrey @ 2020-06-15  2:05 UTC (permalink / raw)
  To: Vineeth Remanan Pillai, Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Aubrey Li, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On 2020/6/14 2:59, Vineeth Remanan Pillai wrote:
> On Fri, Jun 12, 2020 at 10:25 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>>
>> Ok, so I take it that you will make it so in v6 then, unless of course
>> someone else objects.
>>
> Yes, just wanted to hear from Aubrey, Tim and others as well to see
> if we have not missed anything obvious. Will have this in v6 if
> there are no objections.
> 
> Thanks for bringing this up!
> 
> ~Vineeth
> 
Yes, this makes sense to me, no need to find idle core in select_idle_cpu().
Thanks to catch this!

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 00/13] Core scheduling v5
  2020-03-04 16:59 [RFC PATCH 00/13] Core scheduling v5 vpillai
                   ` (19 preceding siblings ...)
  2020-05-20 22:48 ` [PATCH RFC] sched: Use sched-RCU in core-scheduling balancing logic Joel Fernandes (Google)
@ 2020-06-25 20:12 ` Vineeth Remanan Pillai
  2020-06-26  1:47   ` Joel Fernandes
  2020-06-29 12:33   ` Li, Aubrey
  20 siblings, 2 replies; 110+ messages in thread
From: Vineeth Remanan Pillai @ 2020-06-25 20:12 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Linus Torvalds
  Cc: Linux List Kernel Mailing, Frédéric Weisbecker,
	Ingo Molnar, Kees Cook, Thomas Gleixner, Greg Kerr, Phil Auld,
	Aaron Lu, Aubrey Li, Li, Aubrey, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Joel Fernandes, Joel Fernandes,
	Paul Turner

On Wed, Mar 4, 2020 at 12:00 PM vpillai <vpillai@digitalocean.com> wrote:
>
>
> Fifth iteration of the Core-Scheduling feature.
>
Its probably time for an iteration and We are planning to post v6 based
on this branch:
 https://github.com/digitalocean/linux-coresched/tree/coresched/pre-v6-v5.7.y

Just wanted to share the details about v6 here before posting the patch
series. If there is no objection to the following, we shall be posting
the v6 early next week.

The main changes from v6 are the following:
1. Address Peter's comments in v5
   - Code cleanup
   - Remove fixes related to hotplugging.
   - Split the patch out for force idling starvation
3. Fix for RCU deadlock
4. core wide priority comparison minor re-work.
5. IRQ Pause patch
6. Documentation
   - https://github.com/digitalocean/linux-coresched/blob/coresched/pre-v6-v5.7.y/Documentation/admin-guide/hw-vuln/core-scheduling.rst

This version is much leaner compared to v5 due to the removal of hotplug
support. As a result, dynamic coresched enable/disable on cpus due to
smt on/off on the core do not function anymore. I tried to reproduce the
crashes during hotplug, but could not reproduce reliably. The plan is to
try to reproduce the crashes with v6, and document each corner case for crashes
as we fix those. Previously, we randomly fixed the issues without a clear
documentation and the fixes became complex over time.

TODO lists:

 - Interface discussions could not come to a conclusion in v5 and hence would
   like to restart the discussion and reach a consensus on it.
   - https://lwn.net/ml/linux-kernel/20200520222642.70679-1-joel@joelfernandes.org

 - Core wide vruntime calculation needs rework:
   - https://lwn.net/ml/linux-kernel/20200506143506.GH5298@hirez.programming.kicks-ass.net

 - Load balancing/migration changes ignores group weights:
   - https://lwn.net/ml/linux-kernel/20200225034438.GA617271@ziqianlu-desktop.localdomain


Please have a look and let me know comments/suggestions or anything missed.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 00/13] Core scheduling v5
  2020-06-25 20:12 ` [RFC PATCH 00/13] Core scheduling v5 Vineeth Remanan Pillai
@ 2020-06-26  1:47   ` Joel Fernandes
  2020-06-26 14:36     ` Vineeth Remanan Pillai
  2020-06-29 12:33   ` Li, Aubrey
  1 sibling, 1 reply; 110+ messages in thread
From: Joel Fernandes @ 2020-06-26  1:47 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Ingo Molnar, Kees Cook,
	Thomas Gleixner, Greg Kerr, Phil Auld, Aaron Lu, Aubrey Li, Li,
	Aubrey, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, Joel Fernandes, Paul Turner

On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai
<vpillai@digitalocean.com> wrote:
[...]
> TODO lists:
>
>  - Interface discussions could not come to a conclusion in v5 and hence would
>    like to restart the discussion and reach a consensus on it.
>    - https://lwn.net/ml/linux-kernel/20200520222642.70679-1-joel@joelfernandes.org

Thanks Vineeth, just want to add: I have a revised implementation of
prctl(2) where you only pass a TID of a task you'd like to share a
core with (credit to Peter for the idea [1]) so we can make use of
ptrace_may_access() checks. I am currently finishing writing of
kselftests for this and post it all once it is ready.

However a question: If using the prctl(2) on a CGroup tagged task, we
discussed in previous threads [2] to override the CGroup cookie such
that the task may not share a core with any of the tasks in its CGroup
anymore and I think Peter and Phil are Ok with.  My question though is
- would that not be confusing for anyone looking at the CGroup
filesystem's "tag" and "tasks" files?

To resolve this, I am proposing to add a new CGroup file
'tasks.coresched' to the CGroup, and this will only contain tasks that
were assigned cookies due to their CGroup residency. As soon as one
prctl(2)'s the task, it will stop showing up in the CGroup's
"tasks.coresched" file (unless of course it was requesting to
prctl-share a core with someone in its CGroup itself). Are folks Ok
with this solution?

[1]  https://lore.kernel.org/lkml/20200528170128.GN2483@worktop.programming.kicks-ass.net/
[2] https://lore.kernel.org/lkml/20200524140046.GA5598@lorien.usersys.redhat.com/

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 00/13] Core scheduling v5
  2020-06-26  1:47   ` Joel Fernandes
@ 2020-06-26 14:36     ` Vineeth Remanan Pillai
  2020-06-26 15:10       ` Joel Fernandes
  0 siblings, 1 reply; 110+ messages in thread
From: Vineeth Remanan Pillai @ 2020-06-26 14:36 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Ingo Molnar, Kees Cook,
	Thomas Gleixner, Greg Kerr, Phil Auld, Aaron Lu, Aubrey Li, Li,
	Aubrey, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, Joel Fernandes, Paul Turner

On Thu, Jun 25, 2020 at 9:47 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai
> <vpillai@digitalocean.com> wrote:
> [...]
> > TODO lists:
> >
> >  - Interface discussions could not come to a conclusion in v5 and hence would
> >    like to restart the discussion and reach a consensus on it.
> >    - https://lwn.net/ml/linux-kernel/20200520222642.70679-1-joel@joelfernandes.org
>
> Thanks Vineeth, just want to add: I have a revised implementation of
> prctl(2) where you only pass a TID of a task you'd like to share a
> core with (credit to Peter for the idea [1]) so we can make use of
> ptrace_may_access() checks. I am currently finishing writing of
> kselftests for this and post it all once it is ready.
>
Thinking more about it, using TID/PID for prctl(2) and internally
using a task identifier to identify coresched group may have
limitations. A coresched group can exist longer than the lifetime
of a task and then there is a chance for that identifier to be
reused by a newer task which may or maynot be a part of the same
coresched group.

A way to overcome this is to have a coresched group with a seperate
identifier implemented internally and have mapping from task to the
group. And cgroup framework provides exactly that.

I feel we could use prctl for isolating individual tasks/processes
and use grouping frameworks like cgroup for core scheduling groups.
Cpu cgroup might not be a good idea as it has its own purpose. Users
might not always want a group of trusted tasks in the same cpu cgroup.
Or all the processes in an existing cpu cgroup might not be mutually
trusted as well.

What do you think about having a separate cgroup for coresched?
Both coresched cgroup and prctl() could co-exist where prctl could
be used to isolate individual process or task and coresched cgroup
to group trusted processes.

> However a question: If using the prctl(2) on a CGroup tagged task, we
> discussed in previous threads [2] to override the CGroup cookie such
> that the task may not share a core with any of the tasks in its CGroup
> anymore and I think Peter and Phil are Ok with.  My question though is
> - would that not be confusing for anyone looking at the CGroup
> filesystem's "tag" and "tasks" files?
>
Having a dedicated cgroup for coresched could solve this problem
as well. "coresched.tasks" inside the cgroup hierarchy would list all
the taskx in the group and prctl can override this and take it out
of the group.

> To resolve this, I am proposing to add a new CGroup file
> 'tasks.coresched' to the CGroup, and this will only contain tasks that
> were assigned cookies due to their CGroup residency. As soon as one
> prctl(2)'s the task, it will stop showing up in the CGroup's
> "tasks.coresched" file (unless of course it was requesting to
> prctl-share a core with someone in its CGroup itself). Are folks Ok
> with this solution?
>
As I mentioned above, IMHO cpu cgroups should not be used to account
for core scheduling as well. Cpu cgroups serve a different purpose
and overloading it with core scheduling would not be flexible and
scalable. But if there is a consensus to move forward with cpu cgroups,
adding this new file seems to be okay with me.

Thoughts/suggestions/concerns?

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 12/13] sched: cgroup tagging interface for core scheduling
  2020-03-04 17:00 ` [RFC PATCH 12/13] sched: cgroup tagging interface " vpillai
@ 2020-06-26 15:06   ` Vineeth Remanan Pillai
  0 siblings, 0 replies; 110+ messages in thread
From: Vineeth Remanan Pillai @ 2020-06-26 15:06 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds
  Cc: Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Aaron Lu, Aubrey Li, Li, Aubrey,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	Joel Fernandes, Joel Fernandes

On Wed, Mar 4, 2020 at 12:00 PM vpillai <vpillai@digitalocean.com> wrote:
>
>
> Marks all tasks in a cgroup as matching for core-scheduling.
>
> A task will need to be moved into the core scheduler queue when the cgroup
> it belongs to is tagged to run with core scheduling.  Similarly the task
> will need to be moved out of the core scheduler queue when the cgroup
> is untagged.
>
> Also after we forked a task, its core scheduler queue's presence will
> need to be updated according to its new cgroup's status.
>
This came up during a private discussion with Joel and thanks to
him for bringing this up! Details below..

> @@ -7910,7 +7986,12 @@ static void cpu_cgroup_fork(struct task_struct *task)
>         rq = task_rq_lock(task, &rf);
>
>         update_rq_clock(rq);
> +       if (sched_core_enqueued(task))
> +               sched_core_dequeue(rq, task);
A newly created task will not be enqueued and hence do we need this
here?

>         sched_change_group(task, TASK_SET_GROUP);
> +       if (sched_core_enabled(rq) && task_on_rq_queued(task) &&
> +           task->core_cookie)
> +               sched_core_enqueue(rq, task);
>
Do we need this here? Soon after this, wake_up_new_task() is called
which will ultimately call enqueue_task() and adds the task to the
coresched rbtree. So we will be trying to enqueue twice. Also, this
code will not really enqueue,  because task_on_rq_queued() would
return false at this point(activate_task is not yet called for this
new task).

I am not sure if I missed any other code path reaching here that
does not proceed with wake_up_new_task().Please let me know, if I
missed anything here.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 00/13] Core scheduling v5
  2020-06-26 14:36     ` Vineeth Remanan Pillai
@ 2020-06-26 15:10       ` Joel Fernandes
  2020-06-26 15:12         ` Joel Fernandes
                           ` (2 more replies)
  0 siblings, 3 replies; 110+ messages in thread
From: Joel Fernandes @ 2020-06-26 15:10 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Ingo Molnar, Kees Cook,
	Thomas Gleixner, Greg Kerr, Phil Auld, Aaron Lu, Aubrey Li, Li,
	Aubrey, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, Paul Turner

On Fri, Jun 26, 2020 at 10:36:01AM -0400, Vineeth Remanan Pillai wrote:
> On Thu, Jun 25, 2020 at 9:47 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai
> > <vpillai@digitalocean.com> wrote:
> > [...]
> > > TODO lists:
> > >
> > >  - Interface discussions could not come to a conclusion in v5 and hence would
> > >    like to restart the discussion and reach a consensus on it.
> > >    - https://lwn.net/ml/linux-kernel/20200520222642.70679-1-joel@joelfernandes.org
> >
> > Thanks Vineeth, just want to add: I have a revised implementation of
> > prctl(2) where you only pass a TID of a task you'd like to share a
> > core with (credit to Peter for the idea [1]) so we can make use of
> > ptrace_may_access() checks. I am currently finishing writing of
> > kselftests for this and post it all once it is ready.
> >
> Thinking more about it, using TID/PID for prctl(2) and internally
> using a task identifier to identify coresched group may have
> limitations. A coresched group can exist longer than the lifetime
> of a task and then there is a chance for that identifier to be
> reused by a newer task which may or maynot be a part of the same
> coresched group.

True, for the prctl(2) tagging (a task wanting to share core with
another) we will need some way of internally identifying groups which does
not depend on any value that can be reused for another purpose.

[..]
> What do you think about having a separate cgroup for coresched?
> Both coresched cgroup and prctl() could co-exist where prctl could
> be used to isolate individual process or task and coresched cgroup
> to group trusted processes.

This sounds like a fine idea to me. I wonder how Tejun and Peter feel about
having a new attribute-less CGroup controller for core-scheduling and just
use that for tagging. (No need to even have a tag file, just adding/removing
to/from CGroup will tag).

> > However a question: If using the prctl(2) on a CGroup tagged task, we
> > discussed in previous threads [2] to override the CGroup cookie such
> > that the task may not share a core with any of the tasks in its CGroup
> > anymore and I think Peter and Phil are Ok with.  My question though is
> > - would that not be confusing for anyone looking at the CGroup
> > filesystem's "tag" and "tasks" files?
> >
> Having a dedicated cgroup for coresched could solve this problem
> as well. "coresched.tasks" inside the cgroup hierarchy would list all
> the taskx in the group and prctl can override this and take it out
> of the group.

We don't even need coresched.tasks, just the existing 'tasks' of CGroups can
be used.

> > To resolve this, I am proposing to add a new CGroup file
> > 'tasks.coresched' to the CGroup, and this will only contain tasks that
> > were assigned cookies due to their CGroup residency. As soon as one
> > prctl(2)'s the task, it will stop showing up in the CGroup's
> > "tasks.coresched" file (unless of course it was requesting to
> > prctl-share a core with someone in its CGroup itself). Are folks Ok
> > with this solution?
> >
> As I mentioned above, IMHO cpu cgroups should not be used to account
> for core scheduling as well. Cpu cgroups serve a different purpose
> and overloading it with core scheduling would not be flexible and
> scalable. But if there is a consensus to move forward with cpu cgroups,
> adding this new file seems to be okay with me.

Yes, this is the problem. Many people use CPU controller CGroups already for
other purposes. In that case, tagging a CGroup would make all the entities in
the group be able to share a core, which may not always make sense. May be a
new CGroup controller is the answer (?).

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 00/13] Core scheduling v5
  2020-06-26 15:10       ` Joel Fernandes
@ 2020-06-26 15:12         ` Joel Fernandes
  2020-06-27 16:21         ` Joel Fernandes
  2020-06-30 14:11         ` Phil Auld
  2 siblings, 0 replies; 110+ messages in thread
From: Joel Fernandes @ 2020-06-26 15:12 UTC (permalink / raw)
  To: Vineeth Remanan Pillai, tj
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Ingo Molnar, Kees Cook,
	Thomas Gleixner, Greg Kerr, Phil Auld, Aaron Lu, Aubrey Li, Li,
	Aubrey, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, Paul Turner

On Fri, Jun 26, 2020 at 11:10:28AM -0400, Joel Fernandes wrote:
> On Fri, Jun 26, 2020 at 10:36:01AM -0400, Vineeth Remanan Pillai wrote:
> > On Thu, Jun 25, 2020 at 9:47 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai
> > > <vpillai@digitalocean.com> wrote:
> > > [...]
> > > > TODO lists:
> > > >
> > > >  - Interface discussions could not come to a conclusion in v5 and hence would
> > > >    like to restart the discussion and reach a consensus on it.
> > > >    - https://lwn.net/ml/linux-kernel/20200520222642.70679-1-joel@joelfernandes.org
> > >
> > > Thanks Vineeth, just want to add: I have a revised implementation of
> > > prctl(2) where you only pass a TID of a task you'd like to share a
> > > core with (credit to Peter for the idea [1]) so we can make use of
> > > ptrace_may_access() checks. I am currently finishing writing of
> > > kselftests for this and post it all once it is ready.
> > >
> > Thinking more about it, using TID/PID for prctl(2) and internally
> > using a task identifier to identify coresched group may have
> > limitations. A coresched group can exist longer than the lifetime
> > of a task and then there is a chance for that identifier to be
> > reused by a newer task which may or maynot be a part of the same
> > coresched group.
> 
> True, for the prctl(2) tagging (a task wanting to share core with
> another) we will need some way of internally identifying groups which does
> not depend on any value that can be reused for another purpose.
> 
> [..]
> > What do you think about having a separate cgroup for coresched?
> > Both coresched cgroup and prctl() could co-exist where prctl could
> > be used to isolate individual process or task and coresched cgroup
> > to group trusted processes.
> 
> This sounds like a fine idea to me. I wonder how Tejun and Peter feel about
> having a new attribute-less CGroup controller for core-scheduling and just
> use that for tagging. (No need to even have a tag file, just adding/removing
> to/from CGroup will tag).

+Tejun

thanks,

 - Joel


> > > However a question: If using the prctl(2) on a CGroup tagged task, we
> > > discussed in previous threads [2] to override the CGroup cookie such
> > > that the task may not share a core with any of the tasks in its CGroup
> > > anymore and I think Peter and Phil are Ok with.  My question though is
> > > - would that not be confusing for anyone looking at the CGroup
> > > filesystem's "tag" and "tasks" files?
> > >
> > Having a dedicated cgroup for coresched could solve this problem
> > as well. "coresched.tasks" inside the cgroup hierarchy would list all
> > the taskx in the group and prctl can override this and take it out
> > of the group.
> 
> We don't even need coresched.tasks, just the existing 'tasks' of CGroups can
> be used.
> 
> > > To resolve this, I am proposing to add a new CGroup file
> > > 'tasks.coresched' to the CGroup, and this will only contain tasks that
> > > were assigned cookies due to their CGroup residency. As soon as one
> > > prctl(2)'s the task, it will stop showing up in the CGroup's
> > > "tasks.coresched" file (unless of course it was requesting to
> > > prctl-share a core with someone in its CGroup itself). Are folks Ok
> > > with this solution?
> > >
> > As I mentioned above, IMHO cpu cgroups should not be used to account
> > for core scheduling as well. Cpu cgroups serve a different purpose
> > and overloading it with core scheduling would not be flexible and
> > scalable. But if there is a consensus to move forward with cpu cgroups,
> > adding this new file seems to be okay with me.
> 
> Yes, this is the problem. Many people use CPU controller CGroups already for
> other purposes. In that case, tagging a CGroup would make all the entities in
> the group be able to share a core, which may not always make sense. May be a
> new CGroup controller is the answer (?).
> 
> thanks,
> 
>  - Joel
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 00/13] Core scheduling v5
  2020-06-26 15:10       ` Joel Fernandes
  2020-06-26 15:12         ` Joel Fernandes
@ 2020-06-27 16:21         ` Joel Fernandes
  2020-06-30 14:11         ` Phil Auld
  2 siblings, 0 replies; 110+ messages in thread
From: Joel Fernandes @ 2020-06-27 16:21 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Ingo Molnar, Kees Cook,
	Thomas Gleixner, Greg Kerr, Phil Auld, Aaron Lu, Aubrey Li, Li,
	Aubrey, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, Paul Turner

On Fri, Jun 26, 2020 at 11:10 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> [..]
> > What do you think about having a separate cgroup for coresched?
> > Both coresched cgroup and prctl() could co-exist where prctl could
> > be used to isolate individual process or task and coresched cgroup
> > to group trusted processes.
>
> This sounds like a fine idea to me. I wonder how Tejun and Peter feel about
> having a new attribute-less CGroup controller for core-scheduling and just
> use that for tagging. (No need to even have a tag file, just adding/removing
> to/from CGroup will tag).

Unless there are any major objections to this idea, or better ideas
for CGroup users, we will consider proposing a new CGroup controller
for this. The issue with CPU controller CGroups being they may be
configured in a way that is incompatible with tagging.

And I was also thinking of a new clone flag CLONE_CORE (which allows a
child to share a parent's core). This is because the fork-semantics
are not clear and it may be better to leave the behavior of fork to
userspace IMHO than hard-coding policy in the kernel.

Perhaps we can also discuss this at the scheduler MC at Plumbers.

Any other thoughts?

 - Joel

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 00/13] Core scheduling v5
  2020-06-25 20:12 ` [RFC PATCH 00/13] Core scheduling v5 Vineeth Remanan Pillai
  2020-06-26  1:47   ` Joel Fernandes
@ 2020-06-29 12:33   ` Li, Aubrey
  2020-06-29 19:41     ` Vineeth Remanan Pillai
  1 sibling, 1 reply; 110+ messages in thread
From: Li, Aubrey @ 2020-06-29 12:33 UTC (permalink / raw)
  To: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	Peter Zijlstra, Tim Chen, Linus Torvalds
  Cc: Linux List Kernel Mailing, Frédéric Weisbecker,
	Ingo Molnar, Kees Cook, Thomas Gleixner, Greg Kerr, Phil Auld,
	Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, Joel Fernandes, Joel Fernandes, Paul Turner

Hi Vineeth,

On 2020/6/26 4:12, Vineeth Remanan Pillai wrote:
> On Wed, Mar 4, 2020 at 12:00 PM vpillai <vpillai@digitalocean.com> wrote:
>>
>>
>> Fifth iteration of the Core-Scheduling feature.
>>
> Its probably time for an iteration and We are planning to post v6 based
> on this branch:
>  https://github.com/digitalocean/linux-coresched/tree/coresched/pre-v6-v5.7.y
> 
> Just wanted to share the details about v6 here before posting the patch
> series. If there is no objection to the following, we shall be posting
> the v6 early next week.
> 
> The main changes from v6 are the following:
> 1. Address Peter's comments in v5
>    - Code cleanup
>    - Remove fixes related to hotplugging.
>    - Split the patch out for force idling starvation
> 3. Fix for RCU deadlock
> 4. core wide priority comparison minor re-work.
> 5. IRQ Pause patch
> 6. Documentation
>    - https://github.com/digitalocean/linux-coresched/blob/coresched/pre-v6-v5.7.y/Documentation/admin-guide/hw-vuln/core-scheduling.rst
> 
> This version is much leaner compared to v5 due to the removal of hotplug
> support. As a result, dynamic coresched enable/disable on cpus due to
> smt on/off on the core do not function anymore. I tried to reproduce the
> crashes during hotplug, but could not reproduce reliably. The plan is to
> try to reproduce the crashes with v6, and document each corner case for crashes
> as we fix those. Previously, we randomly fixed the issues without a clear
> documentation and the fixes became complex over time.
> 
> TODO lists:
> 
>  - Interface discussions could not come to a conclusion in v5 and hence would
>    like to restart the discussion and reach a consensus on it.
>    - https://lwn.net/ml/linux-kernel/20200520222642.70679-1-joel@joelfernandes.org
> 
>  - Core wide vruntime calculation needs rework:
>    - https://lwn.net/ml/linux-kernel/20200506143506.GH5298@hirez.programming.kicks-ass.net
> 
>  - Load balancing/migration changes ignores group weights:
>    - https://lwn.net/ml/linux-kernel/20200225034438.GA617271@ziqianlu-desktop.localdomain

According to Aaron's response below:
https://lwn.net/ml/linux-kernel/20200305085231.GA12108@ziqianlu-desktop.localdomain/

The following logic seems to be helpful for Aaron's case.

+	/*
+	 * Ignore cookie match if there is a big imbalance between the src rq
+	 * and dst rq.
+	 */
+	if ((src_rq->cfs.h_nr_running - rq->cfs.h_nr_running) > 1)
+		return true;

I didn't see any other comments on the patch at here:
https://lwn.net/ml/linux-kernel/67e46f79-51c2-5b69-71c6-133ec10b68c4@linux.intel.com/

Do we have another way to address this issue?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 00/13] Core scheduling v5
  2020-06-29 12:33   ` Li, Aubrey
@ 2020-06-29 19:41     ` Vineeth Remanan Pillai
  0 siblings, 0 replies; 110+ messages in thread
From: Vineeth Remanan Pillai @ 2020-06-29 19:41 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Ingo Molnar, Kees Cook,
	Thomas Gleixner, Greg Kerr, Phil Auld, Aaron Lu, Aubrey Li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini,
	Joel Fernandes, Joel Fernandes, Paul Turner

Hi Aubrey,

On Mon, Jun 29, 2020 at 8:34 AM Li, Aubrey <aubrey.li@linux.intel.com> wrote:
>
> >  - Load balancing/migration changes ignores group weights:
> >    - https://lwn.net/ml/linux-kernel/20200225034438.GA617271@ziqianlu-desktop.localdomain
>
> According to Aaron's response below:
> https://lwn.net/ml/linux-kernel/20200305085231.GA12108@ziqianlu-desktop.localdomain/
>
> The following logic seems to be helpful for Aaron's case.
>
> +       /*
> +        * Ignore cookie match if there is a big imbalance between the src rq
> +        * and dst rq.
> +        */
> +       if ((src_rq->cfs.h_nr_running - rq->cfs.h_nr_running) > 1)
> +               return true;
>
> I didn't see any other comments on the patch at here:
> https://lwn.net/ml/linux-kernel/67e46f79-51c2-5b69-71c6-133ec10b68c4@linux.intel.com/
>
> Do we have another way to address this issue?
>
We do not have a clear fix for this yet, and did not get much time to
work on this.

I feel that the above change would not be fixing the real issue.
The issue is about not considering the weight of the group when we
try to load balance, but the above change is checking only the
nr_running which might not work always. I feel that we should fix
the real issue in v6 and probably hold on to adding the workaround
fix in the interim.  I have added a TODO specifically for this bug
in v6.

What do you think?

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 00/13] Core scheduling v5
  2020-06-26 15:10       ` Joel Fernandes
  2020-06-26 15:12         ` Joel Fernandes
  2020-06-27 16:21         ` Joel Fernandes
@ 2020-06-30 14:11         ` Phil Auld
  2 siblings, 0 replies; 110+ messages in thread
From: Phil Auld @ 2020-06-30 14:11 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	Peter Zijlstra, Tim Chen, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Ingo Molnar, Kees Cook, Thomas Gleixner, Greg Kerr, Aaron Lu,
	Aubrey Li, Li, Aubrey, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Paul Turner

On Fri, Jun 26, 2020 at 11:10:28AM -0400 Joel Fernandes wrote:
> On Fri, Jun 26, 2020 at 10:36:01AM -0400, Vineeth Remanan Pillai wrote:
> > On Thu, Jun 25, 2020 at 9:47 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai
> > > <vpillai@digitalocean.com> wrote:
> > > [...]
> > > > TODO lists:
> > > >
> > > >  - Interface discussions could not come to a conclusion in v5 and hence would
> > > >    like to restart the discussion and reach a consensus on it.
> > > >    - https://lwn.net/ml/linux-kernel/20200520222642.70679-1-joel@joelfernandes.org
> > >
> > > Thanks Vineeth, just want to add: I have a revised implementation of
> > > prctl(2) where you only pass a TID of a task you'd like to share a
> > > core with (credit to Peter for the idea [1]) so we can make use of
> > > ptrace_may_access() checks. I am currently finishing writing of
> > > kselftests for this and post it all once it is ready.
> > >
> > Thinking more about it, using TID/PID for prctl(2) and internally
> > using a task identifier to identify coresched group may have
> > limitations. A coresched group can exist longer than the lifetime
> > of a task and then there is a chance for that identifier to be
> > reused by a newer task which may or maynot be a part of the same
> > coresched group.
> 
> True, for the prctl(2) tagging (a task wanting to share core with
> another) we will need some way of internally identifying groups which does
> not depend on any value that can be reused for another purpose.
>

That was my concern as well. That's why I was thinking it should be
an arbitrary, user/admin/orchestrator defined value and not be the
responsibility of the kernel at all.  However...


> [..]
> > What do you think about having a separate cgroup for coresched?
> > Both coresched cgroup and prctl() could co-exist where prctl could
> > be used to isolate individual process or task and coresched cgroup
> > to group trusted processes.
> 
> This sounds like a fine idea to me. I wonder how Tejun and Peter feel about
> having a new attribute-less CGroup controller for core-scheduling and just
> use that for tagging. (No need to even have a tag file, just adding/removing
> to/from CGroup will tag).
>

... this could be an interesting approach. Then the cookie could still
be the cgroup address as is and there would be no need for the prctl. At
least so it seems. 



Cheers,
Phil

> > > However a question: If using the prctl(2) on a CGroup tagged task, we
> > > discussed in previous threads [2] to override the CGroup cookie such
> > > that the task may not share a core with any of the tasks in its CGroup
> > > anymore and I think Peter and Phil are Ok with.  My question though is
> > > - would that not be confusing for anyone looking at the CGroup
> > > filesystem's "tag" and "tasks" files?
> > >
> > Having a dedicated cgroup for coresched could solve this problem
> > as well. "coresched.tasks" inside the cgroup hierarchy would list all
> > the taskx in the group and prctl can override this and take it out
> > of the group.
> 
> We don't even need coresched.tasks, just the existing 'tasks' of CGroups can
> be used.
> 
> > > To resolve this, I am proposing to add a new CGroup file
> > > 'tasks.coresched' to the CGroup, and this will only contain tasks that
> > > were assigned cookies due to their CGroup residency. As soon as one
> > > prctl(2)'s the task, it will stop showing up in the CGroup's
> > > "tasks.coresched" file (unless of course it was requesting to
> > > prctl-share a core with someone in its CGroup itself). Are folks Ok
> > > with this solution?
> > >
> > As I mentioned above, IMHO cpu cgroups should not be used to account
> > for core scheduling as well. Cpu cgroups serve a different purpose
> > and overloading it with core scheduling would not be flexible and
> > scalable. But if there is a consensus to move forward with cpu cgroups,
> > adding this new file seems to be okay with me.
> 
> Yes, this is the problem. Many people use CPU controller CGroups already for
> other purposes. In that case, tagging a CGroup would make all the entities in
> the group be able to share a core, which may not always make sense. May be a
> new CGroup controller is the answer (?).
> 
> thanks,
> 
>  - Joel
> 

-- 


^ permalink raw reply	[flat|nested] 110+ messages in thread

end of thread, other threads:[~2020-06-30 14:12 UTC | newest]

Thread overview: 110+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-04 16:59 [RFC PATCH 00/13] Core scheduling v5 vpillai
2020-03-04 16:59 ` [RFC PATCH 01/13] sched: Wrap rq::lock access vpillai
2020-03-04 16:59 ` [RFC PATCH 02/13] sched: Introduce sched_class::pick_task() vpillai
2020-03-04 16:59 ` [RFC PATCH 03/13] sched: Core-wide rq->lock vpillai
2020-04-01 11:42   ` [PATCH] sched/arm64: store cpu topology before notify_cpu_starting Cheng Jian
2020-04-01 13:23     ` Valentin Schneider
2020-04-06  8:00       ` chengjian (D)
2020-04-09  9:59       ` Sudeep Holla
2020-04-09 10:32         ` Valentin Schneider
2020-04-09 11:08           ` Sudeep Holla
2020-04-09 17:54     ` Joel Fernandes
2020-04-10 13:49       ` chengjian (D)
2020-04-14 11:36   ` [RFC PATCH 03/13] sched: Core-wide rq->lock Peter Zijlstra
2020-04-14 21:35     ` Vineeth Remanan Pillai
2020-04-15 10:55       ` Peter Zijlstra
2020-04-14 14:32   ` Peter Zijlstra
2020-03-04 16:59 ` [RFC PATCH 04/13] sched/fair: Add a few assertions vpillai
2020-03-04 16:59 ` [RFC PATCH 05/13] sched: Basic tracking of matching tasks vpillai
2020-03-04 16:59 ` [RFC PATCH 06/13] sched: Update core scheduler queue when taking cpu online/offline vpillai
2020-03-04 16:59 ` [RFC PATCH 07/13] sched: Add core wide task selection and scheduling vpillai
2020-04-14 13:35   ` Peter Zijlstra
2020-04-16 23:32     ` Tim Chen
2020-04-17 10:57       ` Peter Zijlstra
2020-04-16  3:39   ` Chen Yu
2020-04-16 19:59     ` Vineeth Remanan Pillai
2020-04-17 11:18     ` Peter Zijlstra
2020-04-19 15:31       ` Chen Yu
2020-05-21 23:14   ` Joel Fernandes
2020-05-21 23:16     ` Joel Fernandes
2020-05-22  2:35     ` Joel Fernandes
2020-05-22  3:44       ` Aaron Lu
2020-05-22 20:13         ` Joel Fernandes
2020-03-04 16:59 ` [RFC PATCH 08/13] sched/fair: wrapper for cfs_rq->min_vruntime vpillai
2020-03-04 16:59 ` [RFC PATCH 09/13] sched/fair: core wide vruntime comparison vpillai
2020-04-14 13:56   ` Peter Zijlstra
2020-04-15  3:34     ` Aaron Lu
2020-04-15  4:07       ` Aaron Lu
2020-04-15 21:24         ` Vineeth Remanan Pillai
2020-04-17  9:40           ` Aaron Lu
2020-04-20  8:07             ` [PATCH updated] sched/fair: core wide cfs task priority comparison Aaron Lu
2020-04-20 22:26               ` Vineeth Remanan Pillai
2020-04-21  2:51                 ` Aaron Lu
2020-04-24 14:24                   ` [PATCH updated v2] " Aaron Lu
2020-05-06 14:35                     ` Peter Zijlstra
2020-05-08  8:44                       ` Aaron Lu
2020-05-08  9:09                         ` Peter Zijlstra
2020-05-08 12:34                           ` Aaron Lu
2020-05-14 13:02                             ` Peter Zijlstra
2020-05-14 22:51                               ` Vineeth Remanan Pillai
2020-05-15 10:38                                 ` Peter Zijlstra
2020-05-15 10:43                                   ` Peter Zijlstra
2020-05-15 14:24                                   ` Vineeth Remanan Pillai
2020-05-16  3:42                               ` Aaron Lu
2020-05-22  9:40                                 ` Aaron Lu
2020-06-08  1:41                               ` Ning, Hongyu
2020-03-04 17:00 ` [RFC PATCH 10/13] sched: Trivial forced-newidle balancer vpillai
2020-03-04 17:00 ` [RFC PATCH 11/13] sched: migration changes for core scheduling vpillai
2020-06-12 13:21   ` Joel Fernandes
2020-06-12 21:32     ` Vineeth Remanan Pillai
2020-06-13  2:25       ` Joel Fernandes
2020-06-13 18:59         ` Vineeth Remanan Pillai
2020-06-15  2:05           ` Li, Aubrey
2020-03-04 17:00 ` [RFC PATCH 12/13] sched: cgroup tagging interface " vpillai
2020-06-26 15:06   ` Vineeth Remanan Pillai
2020-03-04 17:00 ` [RFC PATCH 13/13] sched: Debug bits vpillai
2020-03-04 17:36 ` [RFC PATCH 00/13] Core scheduling v5 Tim Chen
2020-03-04 17:42   ` Vineeth Remanan Pillai
2020-04-14 14:21 ` Peter Zijlstra
2020-04-15 16:32   ` Joel Fernandes
2020-04-17 11:12     ` Peter Zijlstra
2020-04-17 12:35       ` Alexander Graf
2020-04-17 13:08         ` Peter Zijlstra
2020-04-18  2:25       ` Joel Fernandes
2020-05-09 14:35   ` Dario Faggioli
     [not found] ` <38805656-2e2f-222a-c083-692f4b113313@linux.intel.com>
2020-05-09  3:39   ` Ning, Hongyu
2020-05-14 20:51     ` FW: " Gruza, Agata
2020-05-10 23:46 ` [PATCH RFC] Add support for core-wide protection of IRQ and softirq Joel Fernandes (Google)
2020-05-11 13:49   ` Peter Zijlstra
2020-05-11 14:54     ` Joel Fernandes
2020-05-20 22:26 ` [PATCH RFC] sched: Add a per-thread core scheduling interface Joel Fernandes (Google)
2020-05-21  4:09   ` [PATCH RFC] sched: Add a per-thread core scheduling interface(Internet mail) benbjiang(蒋彪)
2020-05-21 13:49     ` Joel Fernandes
2020-05-21  8:51   ` [PATCH RFC] sched: Add a per-thread core scheduling interface Peter Zijlstra
2020-05-21 13:47     ` Joel Fernandes
2020-05-21 20:20       ` Vineeth Remanan Pillai
2020-05-22 12:59       ` Peter Zijlstra
2020-05-22 21:35         ` Joel Fernandes
2020-05-24 14:00           ` Phil Auld
2020-05-28 14:51             ` Joel Fernandes
2020-05-28 17:01             ` Peter Zijlstra
2020-05-28 18:17               ` Phil Auld
2020-05-28 18:34                 ` Phil Auld
2020-05-28 18:23               ` Joel Fernandes
2020-05-21 18:31   ` Linus Torvalds
2020-05-21 20:40     ` Joel Fernandes
2020-05-21 21:58       ` Jesse Barnes
2020-05-22 16:33         ` Linus Torvalds
2020-05-20 22:37 ` [PATCH RFC v2] Add support for core-wide protection of IRQ and softirq Joel Fernandes (Google)
2020-05-20 22:48 ` [PATCH RFC] sched: Use sched-RCU in core-scheduling balancing logic Joel Fernandes (Google)
2020-05-21 22:52   ` Paul E. McKenney
2020-05-22  1:26     ` Joel Fernandes
2020-06-25 20:12 ` [RFC PATCH 00/13] Core scheduling v5 Vineeth Remanan Pillai
2020-06-26  1:47   ` Joel Fernandes
2020-06-26 14:36     ` Vineeth Remanan Pillai
2020-06-26 15:10       ` Joel Fernandes
2020-06-26 15:12         ` Joel Fernandes
2020-06-27 16:21         ` Joel Fernandes
2020-06-30 14:11         ` Phil Auld
2020-06-29 12:33   ` Li, Aubrey
2020-06-29 19:41     ` Vineeth Remanan Pillai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).